big data in the web

6/28/13

1

Big Data in

The Web

Ricardo Baeza-Yates Yahoo! Labs

Barcelona & Santiago de Chile

- 3 -

Agenda

• Big Data • Asking the Right Questions • Wisdom of Crowds in the Web • The Long Tail •  Issues and Examples • Concluding Remarks

6/28/13

2

- 4 -

4

Big Data

§  Capture, transfer, store, search, share, analyze, and visualize large data in reasonable time

§  Large volume and growth §  Petabytes to exabytes §  Growth is estimated in 3 exabytes per day §  Structured vs. non-structured data

§  Diversity §  Types, formats, complexity, topics, etc.

§  Best Public Data Example: The Web §  Content: text, multimedia §  Structure: graphs §  Usage: real time streams

- 5 -

5

Big Data

§  Focus on analytics §  Many storage technologies:

§  DBs, DWs, distributed file systems, … §  Many processing technologies:

§  Cloud computing, map-reduce (Hadoop), … §  Data mining, clustering, classification, … §  Machine learning, A/B testing, NLP, … §  Simulation

§  Several technology providers §  Initial best practices (see TDWI report, 2011) §  Main challenges: scalability, online

6/28/13

3

- 6 -

6

Big Data: The Five V’s

Characteristic Data Issue Computing Issue

Volume Scale, Redundancy

Scalability

Variety Heterogeneity, Complexity

Adaptability, Extensibility

Veracity Completeness, Bias, Sparsity, Noise, Spam

Reliability, Trust

Velocity Real time Online

Value Usefulness, Privacy

Business dependent

- 7 -

7

Asking the Right Questions

§  Problem Driven §  What data we need? How much? §  How we collect it? How we store and transfer it?

§  Understanding the Data §  How sparse is the data? How much noise? §  There is redundancy? There are biases? §  There is spam? Any outliers?

§  Analyzing the Data §  Any privacy issues? Do we need to anonymize? §  How well our algorithms scale? §  Can we visualize the results?

6/28/13

4

- 8 -

8

Too Much Data Available

§  The Web is a database! §  Data does not imply information §  Many analyses for the sake of it (data driven) §  Analyzing data is not CS per se

§  Publish in the right forum!

§  Big Data or Right Data?

- 9 - 9

The Different Facets of the Web

6/28/13

5

- 11 - 11

The Structure of the Web

- 12 -

Big Data in the Web

Metadata RDF

Wikipedia ODP

Flickr

Text Anchors + links

Y! Answers Logs (Clicks+Queries)

Explicit Implicit Wordnet

UGC

Private

Scale Blogs, Groups

Quality?

6/28/13

6

- 13 -

Quantity

Quality

User- generated

Traditional publishing

What is in the Web? How Good it is?

- 14 - 14

What else is in the Web?

6/28/13

7

- 15 -

15

Noise and Spam

§  Noise may come from many places: §  Instruments that measure §  How we interpret the data (example later)

§  Spam is everywhere

- 16 - 16

Web Spam

Deceiving text, links, clicks… due to an economic incentive

Depending on the goal and the data, spam is easier to generate

Depending on the type & target data, spam is easier to fight

Disincentives for spammers?

•  Social •  Economical

Web Spam is NOT Mail Spam

6/28/13

8

- 17 - 17

- 18 -

Content and Metadata Trends

[Ramakrishnan and Tomkins 2007]

6/28/13

9

- 19 -

Web Data Trends

•  User Generated Content – Massive (quality vs. quantity) – Social Networks – Real time (people + physical sensors)

•  Impact – Fragmentation of ownership – Fragmentation of access (longer heavy tail) – Fragmentation of right to access

•  Viability – Business model based in advertising

- 20 -

The Wisdom of Crowds

•  James Surowiecki, a New Yorker columnist, published this book in 2004

– “Under the right circumstances, groups are remarkably intelligent”

•  Importance of diversity, independence and decentralization

“large groups of people are smarter than an elite few, no matter how brilliant—they are better at solving problems, fostering innovation, coming to wise decisions, even predicting the future”.

Aggregating data

6/28/13

10

- 21 - 21

Web Data Mining

•  Content: text & multimedia mining

•  Structure: link analysis, graph mining

•  Usage: log analysis, query mining

•  Relate all of the above

– Web characterization

– Particular applications

- 22 -

Flickr: Clustering Pictures

22

6/28/13

11

- 23 -

Popularity

- 24 -

Flickr: Geo-tagged pictures

24

24

6/28/13

12

- 27 -

“Crowd Sourcing”

Web-based “peer production” has produced a number of successful products and communities •  Wikipedia, Y! Answers, YouTube, Flickr, Digg, ...

Can this form of production be harnessed for other ends? •  Existing successes are hard to replicate at will

Amazon Mechanical Turk (AMT) •  Like outsourcing, but in a micro-distributed fashion

•  Thousands of “turkers” working on hundreds of “HITS” (tasks)

•  Rates are typically few cents per task

•  Quality of their work is positively evaluated (e.g. in IR)

- 28 -

The Wisdom of (Large) Crowds

–  Crucial for Search Ranking –  Text: Web Writers & Editors

• not only for the Web!

–  Links: Web Publishers –  Tags: Web Taggers –  Queries: All Web Users!

• Queries and actions (or no action!)‏

The crowd implicitly knows the experts!

6/28/13

13

- 30 -

30

Scalability

§  How to scale? §  Doubling the data in the best case will double the time

§  Time complexity vs. result quality trade-off §  Example: entity detection in linear time at almost state

of the art quality §  That implies that there exists a text size n* for which

the linear algorithm will produce more correct entities §  Distributed parallel processing

§  Map-reduce not always works §  Parallelism is problem dependent §  Online processing needs a different approach

- 31 -

31

Redundancy and Bias

§  There is any dependency in the data? §  There is any duplication?

§  Lexical duplication in the Web is around 25% §  Semantic duplication is larger

§  Are there any biases? §  Example 1: clicks in search engines

§  Bias to the ranking and the interface §  There is a ranking bias in the Web content

§  Example 2: tag recommendation

6/28/13

14

- 32 -

We can suggest tags: nice but ....

- 33 -

Privacy Example: AOL Query Logs Release Incident

No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men”.

Other queries: “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.”

Data trail led to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs.

A Face Is Exposed for AOL Searcher No. 4417749, By MICHAEL BARBARO and TOM ZELLER Jr, The New York Times, Aug 9 2006

33

6/28/13

15

- 34 -

Risks of Privacy

(ZIP code, date of birth, gender) is enough to identify 87% of US citizens using public DB (Sweeney, 2001)

K-anonymity Suppress or generalize attributes until each entry is identical to at least k-1 other entries

Federal Trade Commission in US: Privacy policies should “address the collection of data itself and not just how the data is used”, Dec 2010.

Data Protection Directive in EU

34

- 35 -

Risks of Privacy: Query Logs

Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007] •  Gender: 84% •  Age (±10): 79% •  Location (ZIP3): 35%

Vanity Queries: [Jones et al, CIKM 2008] •  Partial name: 8.9% •  Complete: 1.2%

More information: •  A Survey of query log privacy-enhancing techniques

from a policy perspective [Cooper, ACM TWEB 2008] A good anonymization is still an open problem

6/28/13

16

- 36 -

36

Sparsity

§  The Long Tail is always Sparse §  Why there is a long tail? §  When the crowd dominates §  Empowering the tail

§  Example: Relations from Query Logs

- 38 -

The Wisdom of Crowds

–  Popularity

–  Diversity –  Quality –  Coverage

Long tail Heavy tail

6/28/13

17

- 39 -

The Long Tail

Most measures in the Web follow a power law

- 42 -

People

Interests

42

Heavy tail of user interests

Many queries, each asked very few times, make up a large fraction of all queries Movies watched, blogs read, words used, …

Normal people

Weirdos

One explanation

6/28/13

18

- 43 -

Many queries, each asked very few times, make up a large fraction of all queries

Applies to word usage, web page access, …

We are all partially eclectic

People

Interests

Broder, Gabrilovich, Goel, Pang; WSDM 2009

The reality

Heavy tail of user interests

- 44 -

Example: Click Distribution

User interaction is a

power law!

(Zipf’s principle of minimal effort)

6/28/13

19

- 45 -

When the crowd dominates

Kills the long tail See (obsolete now)

“shwarzneger” example

45

- 46 -

Empowering the Tail

The Filter “Bubble”, Eli Pariser •  Avoid the Poor get Poorer Syndrome Solutions: •  Diversity •  Novelty •  Serendipity

46

Explore & Exploit

6/28/13

20

- 47 -

How to Circumvent Sparsity?

Wisdom of “ad-hoc” crowds? Aggregate data in the “right way” When data is sparse

Aggregate users around same intent, task, facet, …. Change granularity “ad hoc”

•  Middle age men •  Fans of Messi

47

- 48 - 48

Example: Mining Geo/time Data

• Optimal Touristic Paths from Flickr

• Good for tourists and locals

De Choudhury et al, HT 2010

6/28/13

21

- 49 -

•  The long tail is important not only for e-commerce, but because we are all there

•  Personalization vs. Contextualization User interaction is another long tail

People

Interests

Aggregating in the Long Tail

- 69 - 69

Epilogue

l  The Web is scientifically young

l  The Web is intellectually diverse

l  The technology mirrors the economic, legal and sociological reality

l  Data must be interesting! (Gerhard Weikum)

l  Problem driven

l  Plenty of challenges

6/28/13

22

- 70 - 70

Mirror of Society

- 71 - 71

Exports/Imports vs. Domain Links

Baeza-Yates & Castillo, WWW2006

6/28/13

23

Contact: [email protected]

Thanks to many people at Yahoo! Labs

ASIST 2012 Book of the Year Award

Questions?

big data in the web

Technology