big data in the web
DESCRIPTION
Keynote talk presented at 25th International Conference on Advanced Information Systems EngineeringTRANSCRIPT
6/28/13
1
Big Data in
The Web
Ricardo Baeza-Yates Yahoo! Labs
Barcelona & Santiago de Chile
- 3 -
Agenda
• Big Data • Asking the Right Questions • Wisdom of Crowds in the Web • The Long Tail • Issues and Examples • Concluding Remarks
6/28/13
2
- 4 -
4
Big Data
§ Capture, transfer, store, search, share, analyze, and visualize large data in reasonable time
§ Large volume and growth § Petabytes to exabytes § Growth is estimated in 3 exabytes per day § Structured vs. non-structured data
§ Diversity § Types, formats, complexity, topics, etc.
§ Best Public Data Example: The Web § Content: text, multimedia § Structure: graphs § Usage: real time streams
- 5 -
5
Big Data
§ Focus on analytics § Many storage technologies:
§ DBs, DWs, distributed file systems, … § Many processing technologies:
§ Cloud computing, map-reduce (Hadoop), … § Data mining, clustering, classification, … § Machine learning, A/B testing, NLP, … § Simulation
§ Several technology providers § Initial best practices (see TDWI report, 2011) § Main challenges: scalability, online
6/28/13
3
- 6 -
6
Big Data: The Five V’s
Characteristic Data Issue Computing Issue
Volume Scale, Redundancy
Scalability
Variety Heterogeneity, Complexity
Adaptability, Extensibility
Veracity Completeness, Bias, Sparsity, Noise, Spam
Reliability, Trust
Velocity Real time Online
Value Usefulness, Privacy
Business dependent
- 7 -
7
Asking the Right Questions
§ Problem Driven § What data we need? How much? § How we collect it? How we store and transfer it?
§ Understanding the Data § How sparse is the data? How much noise? § There is redundancy? There are biases? § There is spam? Any outliers?
§ Analyzing the Data § Any privacy issues? Do we need to anonymize? § How well our algorithms scale? § Can we visualize the results?
6/28/13
4
- 8 -
8
Too Much Data Available
§ The Web is a database! § Data does not imply information § Many analyses for the sake of it (data driven) § Analyzing data is not CS per se
§ Publish in the right forum!
§ Big Data or Right Data?
- 9 - 9
The Different Facets of the Web
6/28/13
5
- 11 - 11
The Structure of the Web
- 12 -
Big Data in the Web
Metadata RDF
Wikipedia ODP
Flickr
Text Anchors + links
Y! Answers Logs (Clicks+Queries)
Explicit Implicit Wordnet
UGC
Private
Scale Blogs, Groups
Quality?
6/28/13
6
- 13 -
Quantity
Quality
User- generated
Traditional publishing
What is in the Web? How Good it is?
- 14 - 14
What else is in the Web?
6/28/13
7
- 15 -
15
Noise and Spam
§ Noise may come from many places: § Instruments that measure § How we interpret the data (example later)
§ Spam is everywhere
- 16 - 16
Web Spam
Deceiving text, links, clicks… due to an economic incentive
Depending on the goal and the data, spam is easier to generate
Depending on the type & target data, spam is easier to fight
Disincentives for spammers?
• Social • Economical
Web Spam is NOT Mail Spam
6/28/13
8
- 17 - 17
- 18 -
Content and Metadata Trends
[Ramakrishnan and Tomkins 2007]
6/28/13
9
- 19 -
Web Data Trends
• User Generated Content – Massive (quality vs. quantity) – Social Networks – Real time (people + physical sensors)
• Impact – Fragmentation of ownership – Fragmentation of access (longer heavy tail) – Fragmentation of right to access
• Viability – Business model based in advertising
- 20 -
The Wisdom of Crowds
• James Surowiecki, a New Yorker columnist, published this book in 2004
– “Under the right circumstances, groups are remarkably intelligent”
• Importance of diversity, independence and decentralization
“large groups of people are smarter than an elite few, no matter how brilliant—they are better at solving problems, fostering innovation, coming to wise decisions, even predicting the future”.
Aggregating data
6/28/13
10
- 21 - 21
Web Data Mining
• Content: text & multimedia mining
• Structure: link analysis, graph mining
• Usage: log analysis, query mining
• Relate all of the above
– Web characterization
– Particular applications
- 22 -
Flickr: Clustering Pictures
22
6/28/13
11
- 23 -
Popularity
- 24 -
Flickr: Geo-tagged pictures
24
24
6/28/13
12
- 27 -
“Crowd Sourcing”
Web-based “peer production” has produced a number of successful products and communities • Wikipedia, Y! Answers, YouTube, Flickr, Digg, ...
Can this form of production be harnessed for other ends? • Existing successes are hard to replicate at will
Amazon Mechanical Turk (AMT) • Like outsourcing, but in a micro-distributed fashion
• Thousands of “turkers” working on hundreds of “HITS” (tasks)
• Rates are typically few cents per task
• Quality of their work is positively evaluated (e.g. in IR)
- 28 -
The Wisdom of (Large) Crowds
– Crucial for Search Ranking – Text: Web Writers & Editors
• not only for the Web!
– Links: Web Publishers – Tags: Web Taggers – Queries: All Web Users!
• Queries and actions (or no action!)
The crowd implicitly knows the experts!
6/28/13
13
- 30 -
30
Scalability
§ How to scale? § Doubling the data in the best case will double the time
§ Time complexity vs. result quality trade-off § Example: entity detection in linear time at almost state
of the art quality § That implies that there exists a text size n* for which
the linear algorithm will produce more correct entities § Distributed parallel processing
§ Map-reduce not always works § Parallelism is problem dependent § Online processing needs a different approach
- 31 -
31
Redundancy and Bias
§ There is any dependency in the data? § There is any duplication?
§ Lexical duplication in the Web is around 25% § Semantic duplication is larger
§ Are there any biases? § Example 1: clicks in search engines
§ Bias to the ranking and the interface § There is a ranking bias in the Web content
§ Example 2: tag recommendation
6/28/13
14
- 32 -
We can suggest tags: nice but ....
- 33 -
Privacy Example: AOL Query Logs Release Incident
No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men”.
Other queries: “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.”
Data trail led to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs.
A Face Is Exposed for AOL Searcher No. 4417749, By MICHAEL BARBARO and TOM ZELLER Jr, The New York Times, Aug 9 2006
33
6/28/13
15
- 34 -
Risks of Privacy
(ZIP code, date of birth, gender) is enough to identify 87% of US citizens using public DB (Sweeney, 2001)
K-anonymity Suppress or generalize attributes until each entry is identical to at least k-1 other entries
Federal Trade Commission in US: Privacy policies should “address the collection of data itself and not just how the data is used”, Dec 2010.
Data Protection Directive in EU
34
- 35 -
Risks of Privacy: Query Logs
Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007] • Gender: 84% • Age (±10): 79% • Location (ZIP3): 35%
Vanity Queries: [Jones et al, CIKM 2008] • Partial name: 8.9% • Complete: 1.2%
More information: • A Survey of query log privacy-enhancing techniques
from a policy perspective [Cooper, ACM TWEB 2008] A good anonymization is still an open problem
6/28/13
16
- 36 -
36
Sparsity
§ The Long Tail is always Sparse § Why there is a long tail? § When the crowd dominates § Empowering the tail
§ Example: Relations from Query Logs
- 38 -
The Wisdom of Crowds
– Popularity
– Diversity – Quality – Coverage
Long tail Heavy tail
6/28/13
17
- 39 -
The Long Tail
Most measures in the Web follow a power law
- 42 -
People
Interests
42
Heavy tail of user interests
Many queries, each asked very few times, make up a large fraction of all queries Movies watched, blogs read, words used, …
Normal people
Weirdos
One explanation
6/28/13
18
- 43 -
Many queries, each asked very few times, make up a large fraction of all queries
Applies to word usage, web page access, …
We are all partially eclectic
People
Interests
Broder, Gabrilovich, Goel, Pang; WSDM 2009
The reality
Heavy tail of user interests
- 44 -
Example: Click Distribution
User interaction is a
power law!
(Zipf’s principle of minimal effort)
6/28/13
19
- 45 -
When the crowd dominates
Kills the long tail See (obsolete now)
“shwarzneger” example
45
- 46 -
Empowering the Tail
The Filter “Bubble”, Eli Pariser • Avoid the Poor get Poorer Syndrome Solutions: • Diversity • Novelty • Serendipity
46
Explore & Exploit
6/28/13
20
- 47 -
How to Circumvent Sparsity?
Wisdom of “ad-hoc” crowds? Aggregate data in the “right way” When data is sparse
Aggregate users around same intent, task, facet, …. Change granularity “ad hoc”
• Middle age men • Fans of Messi
47
- 48 - 48
Example: Mining Geo/time Data
• Optimal Touristic Paths from Flickr
• Good for tourists and locals
De Choudhury et al, HT 2010
6/28/13
21
- 49 -
• The long tail is important not only for e-commerce, but because we are all there
• Personalization vs. Contextualization User interaction is another long tail
People
Interests
Aggregating in the Long Tail
- 69 - 69
Epilogue
l The Web is scientifically young
l The Web is intellectually diverse
l The technology mirrors the economic, legal and sociological reality
l Data must be interesting! (Gerhard Weikum)
l Problem driven
l Plenty of challenges
6/28/13
22
- 70 - 70
Mirror of Society
- 71 - 71
Exports/Imports vs. Domain Links
Baeza-Yates & Castillo, WWW2006
6/28/13
23
Contact: [email protected]
Thanks to many people at Yahoo! Labs
ASIST 2012 Book of the Year Award
Questions?