27.11.2012 Bezpečnostní seminář BIG DATA, Policejní akademie ČR v Praze

RNDr. Jakub Lokoč, Ph.D. Siret Research Group (www.siret.cz) Department of SW Engineering Faculty of Mathematics and Physics Charles University in Prague 1

2.1 billion Internet users worldwide http://royal.pingdom.com statistics for 2011 3.146 billion number of email accounts worldwide 800+ million number of users on Facebook 555 million number of websites (+300 million in 2011) 1 trillion number of video playbacks on YouTube 48 hours amount of video uploaded to YouTube every minute MM data 100 billion Estimated number of photos on Facebook 4.5 million Number of photos uploaded to Flickr each day 2

Storage Scalability Searching Security Accessibility 3

Text-based techniques Advantage scalable retrieval by inverted files Problem missing or misguiding annotations Content-based techniques Advantage no annotation needed, visual similarity Problem slow retrieval for complex similarity models Hybrid techniques Text-based query + content-based reranking/exploration Content-based query + text-based filtering Adapting content-based data for inverted files 4

Document vector model User issues keywords query (google, bing, ) Efficient query evaluation using inverted files Problems Manual annotation only for small data Subjectivity of the annotation Homonyms, etc. Automatic annotation Surrounding text + linguistic methods + ontologies Content-based keyword assignment Still lot of problems to solve 5

Text-based retrieval 6

All objects transformed into a similarity model Objects represented by descriptors (histograms, signatures) Descriptors measured by a distance measure d (Lp, SQFD, EMD) User issues an example object as a query q Feature Similarity extraction evaluation extraction Objects x sorted according to the visual similarity d(q, x) How to solve efficiency problem? Feature query object Hybrid techniques not whole DB is searched in the CB way Distance-based indexes or filter-and-refine methods Distributed architectures needed (storage, throughput, ) 7

Hybrid techniques reranking page 1 8

Hybrid techniques reranking page 2 9

Hybrid techniques exploration J. Lokoč, T. Grošup, T. Skopal Image Exploration using Online Feature Extraction and Reranking ICMR, 2012, Hongkong, China, ACM J. Lokoč, T. Grošup, T. Skopal SIR: The Smart Image Retrieval Engine SISAP, 2012, Toronto, Canada, Springer 10

When a distance measure is a metric, we can employ metric indexes for fast query processing Ball partitioning M-Tree, PM-Tree, LoC Hyperplane partitioning GNAT, M-Index Mapping methods LAESA, Omni family Zezula, P., Amato, G., Dohnal, V., Batko, M. Similarity Search: The Metric Space Approach (Springer, 2006) J. Lokoč, P. Čech, J. Novák, T. Skopal, SISAP, 2012, Toronto, Canada, Springer Cut-region: A Compact Building Block For Hierarchical Metric Indexing D. Novak, M. Batko, P. Zezula, Information systems, 2011, Elsevier Metric Index: An efficient and scalable solution for precise and approximate similarity search 11

Efficiency depends mainly on the distance distribution in the distance space Indicator of data indexability Intrinsic dimensionality idim = mean 2 / (2*variance) High idim = bad indexability ( curse of dimensionality) o 1 p 1 p 2 q o 2 E. Chavez, G. Navarro, R. Baeza-Yates, and J. L. Marroquin Searching in Metric Spaces, ACM Computing Surveys, 2001 12

Relaxing precission Approximate search Distance space transformation Synergistic modeling Distributed computing (brutal force) Peer-to-peer architecture Parallel processing on local nodes 13

Based on various ideas Early termination for good results Reducing query radius When time elapses Accessing % of DB Also distance modifications Zezula, P., Amato, G., Dohnal, V., Batko, M. Similarity Search: The Metric Space Approach (Springer, 2006) However, for fast retrieval, the quality deteriorates rapidly 14

Nonlinear transformations of the distance space Monotonous transformation = same similarity ordering Problems with metric properties If t = x 2 then 2 + 2 4 but 2 2 + 2 2 < 4 2 Approximate search with MAMs T. Skopal, Unified framework for fast exact and approximate search in dissimilarity spaces, ACM Transactions on Database Systems, 2007 T. Skopal, J. Lokoč, NM-tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces LNCS 5181, Springer, 2008, DEXA, Turin,Italy 15

Design indexable space (not only precission) Join the world of the domain experts and focus also on idim Many factors influence idim Extracted features Sampled points Kvantization Clustering Similarity measure Linear combinations Inner parameters Indexable space Let as remember also the MAP graphs Ch. Beecks, J. Lokoč, T. Seidl, T. Skopal, Indexing the Signature Quadratic Form Distance for Efficient Content-Based Multimedia Retrieval, ACM ICMR 2011, Trento, Italy, ACM J. Lokoč, Ch. Beecks, T. Seidl, T. Skopal, Parameterized Earth Mover s Distance for Efficient Metric Space Indexing, SISAP 2011, Lipari, Italy, ACM 16

Peer-to-peer architecture Chord protocol (efficient routing) M-Chord, M-Index Map objects from U to real domain R Use chord protocol for object distribution Query causes interval queries, results merged D. Novak, P. Zezula, M-Chord: a scalable distributed similarity search structure InfoScale, 2006, ACM D. Novak, M. Batko, P. Zezula, Large-scale similarity data management with distributed Metric Index, Information Processing & Management 17

Synergistic modeling Distance modifications Distributed index Approximate search limit routing Local node index Approximate search in local nodes Parallel processing 18

any questions? 19