dmitry bugaychenko - smart.data@ОК.ru. how to make the world a bit better using a petabyte of data...
TRANSCRIPT
Smart.Data@ОК.ru How to make the world a bit be0er using a petabyte of data
and a couple of good books
Dmitry Bugaychenko
What is it about? • About us and the size • Smart data vs. big data • Smart data tasks • Smart data deck • Open data
2
OK.ru is about • Family, friends, classmates • Sea of posiEve and humor • Enormous collecEon of mulEmedia content • Largest plaHorm for online games
3
OK.ru in numbers • 200 000 000 registered users • 10 000 000 communiEes • 5 000 000 000 connecEons • Daily: – 40 000 000 unique users – 250 000 000 messages – 25 000 000 posts – 44 000 000 photos – 7 000 000 friendships – 9 000 000 000 entries added to news feeds – …
5
OK.ru in technical numbers • 7000+ servers • 10Tb of posts • 20Tb of likes • 80Tb messages in discussions • 400Gb social connecEons • … • And 6Tb of new data in our Hadoop each day
6
What is it about? • About us and the size • Smart data vs. big data • Smart data tasks • Smart data deck • Open data
7
Big Data vs. Smart Data • 100 gigabytes, how much is it? – 50000 electronic books – 10000 mp3 music tracks – 5000 photos from a modern smart phone – 10 HD movies – 1 season of your favorite TV series in good quality
8
Big Data vs. Smart Data • 100 gigabytes is enough to – Auto-‐generate music catalog – Triple acEvity in the music layer – Double acEvity in the video «related» are
• 400 gigabytes: – Plus 50% for clicks at the communiEes page – Extra 1М communiEes visits from the “recommended communiEes” portlet
• 10Тб – plus 20% for clicks at “Like” at news feed
9
Smart data!
The size of data is important, but your ability to employ the data to improve your product is way more important!
10
Be smart! 1. Set the Goal 2. Find the Model 3. Select the Toolset 4. Collect the Data needed 5. Mine the Data, train the Model, apply results 6. Profit!
11
Be smart! 1. Set the Goal 2. Find the Model 3. Select the Toolset 4. Collect the Data needed 5. Mine the Data, train the Model, apply results 6. Profit! 7. Repeat from step 1
12
What is it about? • About us and the size • Smart data vs. big data • Smart data tasks • Smart data deck • Open data
13
Smart data for music • The data – Metadata of UGC uploads and copyrighted content
– Users playbacks and playlists • Stage 0: Construct music catalog using a bunch of staEsEcal algorithms – Improved search and navigaEon – Images in music layer! – Plus 20-‐30% to playbacks
14
Smarter data for music • Stage 1: mining collaboraEve correlaEons – Similar arEsts and “ArEst radio”
• Stage 2: mining temporal correlaEons and combining with metadata and collaboraEve part – PersonalizaEon of the main page – The most affecEng feature (+100% to playbacks)
• Stage 3: segmenEng users’ tastes – “My Radio” feature
15
Smartest data for music • Stage 4: music content analysis – “ContentId” system – DeduplicaEon for search index – ArEsts genre tagging – Outliers removal – Large investments, but very limited user effect
16
Smart data for communiEes • The data: – Log of visits and acEons in communiEes – CommuniEes metadata – Posts content
• Stage 1: mine collaboraEve correlaEons, apply toolbox used for music – “Recommended communiEes” portlet – Highest CTR among all “add” content on the main page
17
Smarter data for communiEes • Stage 2: extend recommender with metadata (tags) and regional/demography data – Improved CTR for recommend communiEes – PersonalizaEon of the communiEes page – +50% to clicks at the communiEes page
• Stage 3: mine communiEes content for their semanEcs – Implemented fancy distributed Robust LDA model for communiEes post
– Not in producEon yet – waiEng for one of you, guys, to complete ;)
18
Smart data for video • The data – Likes for videos – “View events”
• Stage 1: mine collaboraEve correlaEons – Double clicks at the “related video” area – Likes perform be0er then “view events”
• Stage 2: advances collaboraEve models for top personalizaEon – Different variaEons of SVD – Limited effect. To be conEnued (may be with one of you ;) )
19
Smart data for news feed • The data:
– News feed impressions (the largest our data set) – Likes, comments and clicks
• Stage 1: improve ranking using CTR – We managed to construct the infrastructure capable to calculate CTR
for all our content at real Eme (at the speed of up to 4 000 000 events per second)
– +10% for CTR when considering CTR J
• Stage 2: improved models for news feed ranking – SVD-‐based collaboraEve approach is running in experimental seungs
and looks promising – More models are waiEng: content based (LDA or whatever), social
based, ensembles – Join the movement!
20
Smart data for anEspam • Stolen accounts detecEon • Pornography detecEon • Textual spam classificaEon • AutomaEc registraEon detecEons • …
21
Smart data for BI • We would like to understand why do we see certain effects and how can we influence them
• Using data analysis and visualizaEon to find the answers and insights
22
More areas willing to become smart!
• Help users to find friends (people you may know and more)
• Presents • Games • Photos • OperaEonal data • …
23
What is it about? • About us and the size • Smart data vs. big data • Smart data tasks • Smart data deck • Open data
24
Our technologies • Hadoop 2.x • Apache Pig • Apache Spark • Apache Kaxa • Apache Samza • Apache Cassandra • Python, R, Tableau
25
What is it about? • About us and the size • Smart data vs. big data • Smart data tasks • Smart data deck • Open data
27
Likes data set • The first dataset we opened for public • Contains posts made in communiEes • Contains users’ likes for the post • Get it at h0p://likesdataset.sh2014.org/index/ • Try yourself at the mini-‐contest for predicEng likes
28
SNA hakaton dataset • Large dataset (100+Gb) • Contains – Fragment of social graph – Users’ posts, likes and logins – CommuniEes posts – Demography data – Complaints
• To get it mailto:[email protected]
29
New users dataset • The freshest one (data up to 2015 April 8-‐th) • Contains – New users registraEons (40K) – All acEviEes related to those users (both own and others) – Social graph for all involved users (4M) – Demography data for the users
• PotenEal tasks: friends recommendaEons, bot detecEon, profile data correcEon…
• To get it mailto:[email protected]
30
What is it about? • About us and the size • Smart data vs. big data • Smart data tasks • Smart data deck • Open data • A bit of humor before lunch (author personal opinion)
31