Сергей Шельпук-«Эффективный поиск похожих объектов на...

11

Upload: tanya-denisyuk

Post on 13-Jan-2017

7.037 views

Category:

Education


2 download

TRANSCRIPT

Efficient Similarity Search on Big Datawith office laptopSergii ShelpukHead of Data Science, V.I.Tech

The ProblemYou have a database of 30M patients with all medical records. Each patient described by 250K of binary features.

You need a system for finding N most similar patients to a given one.Jesus Christ, its Big Data, get Hadoop!

Jesus Christ, its Big Data, get Hadoop!

Can we do better?Two main ideas:we dont need the meaning of each feature, we only care about similarity of the patients;we dont want to compare very different patients, we want to compare only the most similar ones.

Step 1: Reduce dimensionalityDecrease dimensionality of the data while preserving similaritiesLocality-sensitive hashing and minhashing

K-Means clusteringK-Means clustering groups similar patients in one group

Step 2: Group similarGroup similar patients and store groups as separate filesStore centroids of each cluster in a separate file, too

ApproachTo find N similar patients:Load a patientReduce dimensionality with minhashingLoad centroid fileCompare patient to every centroidLoad cluster file of the closest centroidCompare patient with patients in the clusterShow top N similar

Results50000 clusters up to ~1000 patients per cluster~500Kb-1Mb of every cluster file~18Mb centroid file

To do similarity search you need:~20Gb HDD~20Mb RAMSearch works in ~100 milliseconds on a regular office laptop

Thank you?