web mining for unknown term translation wen-hsiang lu ( 盧文祥 ) department of computer science...

Web Mining for Unknown Term Translation

Wen-Hsiang Lu (盧文祥 )

Department of Computer Science and Information engineering

[email protected]://myweb.ncku.edu.tw/~whlu

mailto:[email protected]

Web Mining

Research Problems• Difficulties in automatic construction of multilingual translation le

xicons– Techniques: Parallel/comparable corpora– Bottlenecks: Lacking diverse/multilingual resources

• Difficulties in query translation for cross-language information retrieval (CLIR)

– Techniques: Bilingual dictionary/machine translation/parallel corpora

– Bottlenecks: Multiple-senses/short/diverse/unknown query

• Challenges– Web queries are often

• Short: 2-3 words (Silverstein et al. 1998)• Diverse: wide-scoped topic • Unknown (out of vocabulary): 74% is unavailable in CEDICT

Chinese-English electronic dictionary containing 23,948 entries.– E.g.

• Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein)• New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染

(Nosocomial infections)

Cross-Language Information Retrieval

Query Translation

Query Translation

Information Retrieval

Information Retrieval

SourceQuery

TargetTranslation

Target Document

s

Target Document

s

• Query in source language and retrieve relevant documents in target languages

??SARS

愛因斯坦老年癡呆症

National Palace Museum

Difficulties in Web Query Translation Using Machine Translation

English source query : National Palace Museum

Chinese translation: 全國宮殿博物館

Term-TranslationExtraction

Term-TranslationExtraction

Live Translation Lexicon

Search-ResultMining

Search-ResultMining

Anchor-TextMining

Anchor-TextMining

Web Mining

Cross-LanguageInformation Retrieval

Cross-LanguageInformation Retrieval

Cross-LanguageWeb Search

Cross-LanguageWeb Search

New approach

ApplicationsInternet

Research Paradigm

Multilingual Anchor-Texts

Language-Mixed Texts in Search Result Pages

Anchor-Text Mining with Probabilistic Inference Model

)(

)()(

sP

tsPtsP

)()]|()|()|()|([

)()|()|(

)()]|()|()|([

)()|(

)()|(

)()|(

)(

)()(

1

1

1

1

1

1

n

iiiiii

n

iiii

n

iiiii

n

iii

n

iii

n

iii

uPutPusPutPusP

uPutPusP

uPutsPutPusP

uPutsP

uPutsP

uPutsP

tsP

tsPtsP

• Asymmetric translation models:

• Symmetric model with link information:

s' ofnumber the)( ,)(

)()( where

1

in-linkuuLuL

uLuP jj

jn

j

ii

Page authority

Co-occurrence

Conventional translation model

Transitive Translation Model for Multilingual Translation

)(log),( tsPtsPdirect

…

s

s : source termt : target translationm: intermediate translation

t

Sony (English)

• Direct Translation Model

• Indirect Translation Model

• Transitive Translation Model

新力 (Traditional Chine

se)

ソニー (Japanese)

Indirect

Translation

m

model inference ticprobabilis :)(

)() ,(

tsP

tsPtsPdirect

Direct Translation

value.d thresholpredefined :

otherwise. ), ,(

) ,( if ), ,() ,(

tsP

tsPtsPtsP

indirect

directdirecttrans

corpus in they probabilit occurrence :)(

)()()(

)(),(),(

mP

mPtmPmsP

mPtmmsPtsP

m

mindirect

Promising Results for Automatic Construction

of Multilingual Translation Lexicons

Promising Results for Automatic Construction

of Multilingual Translation Lexicons

Source terms (Traditional Chinese)

EnglishSimplified Chinese

Japanese

新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊

SonyNikeStanfordSydneyinternetnetworkhomepagecomputerdatabaseinformation

索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息

ソニーナイキスタンフォードシドニーインターネットネットワークホームページコンピューターデータベースインフォメーション

Search-Result Mining

• Goal: Improve translation coverage for diverse queries

• Idea – Chi-square test: co-occurrence relation – Context-vector analysis: context information

• Context-vector similarity measure

• Weighting scheme: TF*IDF

. including pages ofnumber the:

pages, Webofnumber total the:

, pageresult search in of frequency the:)(

)log() ,(max

) ,(

i

ii

jj

it

tn

N

dt,dtf

n

N

dtf

dtfw

i

• Chi-square similarity measure

• 2-way contingency table

)()()()(

)() ,(

2

2 dcdbcaba

cbdaNtsS

t ~t

s a b

~s c d

)()(

) ,(

12

12

1

mi t

mi s

tsmi

CV

ii

ii

ww

wwtsS

Workshop on Web Mining Technology and Applications (Dec. 13, 2006)

Panel

Web Mining: Recent Development and Trends

曾新穆教授 (Vincent S. Tseng)成功大學資訊工程系

Main Categories of Web Mining

• Web content mining

• Web usage mining

• Web structure mining

Web Content Mining• Trends

– Deep web mining– Semantic web mining– Vertical search– Web multimedia content mining

• Web image/video search• Web image/video annotation/classification/clustering• Web multimedia content filtering

– Example: YouTube

• Integration with web log mining

Web Usage Mining• Developed techniques

– Mining of frequent usage patterns• Association rules, sequential patterns, traversal patterns, etc.

• Trends– Personalization– Recommendation

• Web Ads

– Incorporation of content semantics/ontology– Considerations of Temporality– Extension to mobile web applications– Multidiscipline integration

Problems: Under-utilization of Clickstream Data

• Shop.org: U.S.-based visits to retail Web sites exceeded 10% of total Internet traffic for the first time ever on Thanksgiving, 2004

• Top five sites: eBay, Amazon.com, Dell.com, Walmart.com, BestBuy.com, and Target.com

• Aberdeen Group:

– 70% of site companies use Clickstream data only for basic website management!

Challenges for Clickstream Data Mining- Arun Sen et al., Communications of ACM, Nov. 2006

• Problems with data– Data incompleteness– Very large data size– Messiness in the data– Integration problems with Enterprise Data

• Too Many Analytical Methodologies– Web Metric-based Methodologies– Basic Marketing Metric-based Methodologies– Navigation-based Methodologies– Traffic-based Methodologies

• Data Analysis Problems– Across-dimension analysis problems– Timeliness of data mining under very large data size– Determination of useful/actionable analysis under thousands of

metrics

Web Information Extraction: The Issues for Unsupervised Approaches

Dr. Chia-Hui Chang ( 張嘉惠 )Department of Computer Science and Inform

ation Engineering,

National Central University, Taiwan

(Talk given at 2006 網路探勘技術與趨勢研討會 )

Outline

• Web Information Extraction– The key to web information integration

• Three Dimensions– Task definition– Automation degree– Technology

• Focused on Template Pages IE task– Issues for record-level IE– Techniques for solving these issues

Introduction

• The coverage of Web information is very wide and diverse– The Web has changed the way we obtain information. – Information search on the Web is not enough anymore.– The stronger need for Web information integration has

increased than ever (both for business and individuals).– Understanding those Web pages and discovering valuable

information from them is called Web content mining.– Information extraction is one of the keys for web content

mining.

Web Information Integration

• From information search to information extraction, to information mapping

1. Focused crawling / Web page gathering • Information search

2. Information (Data) extraction• Discovering structured information from input

3. Schema matching• With a unified interface / single ontology

Three Dimensions to See IE

• Task Definition– Input (Unstructured free texts, semi-structured Web

pages)– Output Targets (record-level, page-level, site-level)

• Automation Degree– Programmer-involved, annotation-based or annotatio

n-free approaches• Techniques

– Learning algorithm: specific/general to general/specific

– Rule type: regular expression rules vs logic rules– Deterministic finite-state transducer vs probabilistic h

idden Markov models

IE from Nearly-structured DocumentsMultiple-records Web pageGoogle search result

Amazon.com book pages

IE from Nearly-structured DocumentsSingle-record Pages

IE from Semi-structured Documents

Ungrammatical snippets

A publication list Selected articles

Information Extraction From Free Texts

Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

[Excerpted from Cohen & MaCallum’s talk].

Named entity extraction,

Information Extraction From Free Texts

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

[Excerpted from Cohen & MaCallum’s talk].

Dimension 1: Task Definition - Input

Dimension 1: Task Definition - Output

• Attribute level (single-slot)– Named entity extraction, concept annotation

• Record level – Relation between slots

• Page level– All data embedded in a dynamic page

• Site level– All information about a web site

Template Page Generation & Extraction

• Generation/Encoding

• Extraction/Decoding: A reverse engineering

………………Database

CGI (T,x)

Template (T)

Output Pages

Dimension 2: Automation Degree

• Programming-based – For programmers

• Supervised learning– A bunch of labeled examples

• Semi-supervised learning/Active learning– Interactive wrapper induction

• Unsupervised learning– Mostly for template pages only

Tasks vs. Automation Degree

• High Automation Degree (Unsupervised)– Template page IE

• Semi-Automatic / Interactive– Semi-structured document IE

• Low Automation Degree (Supervised)– Free text IE

Dimension 3: Technologies

• Learning Technology– Supervised: rule generalization, hypothesis testing,

statistical modeling– Unsupervised learning: pattern mining, clustering

• Features used– Plain text information: tokens, token class, etc.– HTML information: DOM tree path, sibling, etc.– Visual information: font, style, position, etc.

• Rule Types (Expressiveness of the rules)– Regular expression, first-order logic rules, HMM

model

Issues for Unsupervised Approaches

• For Record-level Extraction1. Data-rich Section Discovery

2. Record Boundary (Separator) Mining

3. Schema Detection & Data Annotation

• For Page-level Extraction– Schema Detection - differentiate template

from data tokens

Attribute

Data-Rich Section

Record Boundary

Attribute

Some Related Works on Unsupervised Approaches

• Record-level– IEPAD {Chang and Liu, WWW2001]– DeLa [Wang and Lochovsky, WWW2003] – DEPTA [Zhai and Liu, WWW2005] – ViPER [Simon and Lausen, CIKM 2005]– ViNT[Zhao et al, WWW 2005]

• Page-level– Roadrunner [Crescenzi, VLDB2001]– EXALG [Arasu and Garcia-Molina, SIGMOD2003] – MSR [Zhao et al., VLDB 2006]

Issue 1: Data-Rich Section Discovery

• Comparing a normal page with no-result page

• Comparing two normal pages– Locate static text

lines, e.g. • Books• Related Searches• Narrow or Expand

Results• Showing• Results• …

MSE [Zhao, et al. VLDB2006]ViNT [Zhao, et al. WWW2005]

Issue 1: Data-Rich Section Discovery (Cont.)

• Similarity between two adjacent leaf nodes

• 1-dimension clustering

• Pitch Estimation

[Papadakis, et al., SAINT2005]

HL(R)

Issue 2: Record Boundary Mining

• String Pattern Mining • Tree Pattern Mining

<html><body>T<ol><li>TTTT</li><li>TTT</li></ol></body><html>

<A>T</A><A>T</A> T<A>T</A>T

<A>T</A>T <A>T</A>T

DeLa [Wang and Lochovsky, WWW2003] DEPTA [Zhai and Liu, WWW2005]

IEPAD [Chang and Liu, WWW2001]

Issue 2: Record Boundary Mining (Cont.)

• Finding repeat separators from visual encoded context lines

• Heuristics– Line following an HR-LINE

– A unique line in a block that starting with a number

– Line in a block has the smallest position code (Only one).

– Line following the BLANK line is the first line.

• Visual cues

ViPER [Simon and Lausen, CIKM05]ViNT [Zhao, et al. WWW2005]

Issue 3: Data Schema Detection

• Alignment of the multiple records found – Handling missing attributes, multiple-value attributes– String alignment or tree alignment– Examining two records at a time

• Differentiate template from data tokens with some assumptions– Tag tokens are considered part of templates– Text lines are usually part of data except for static text

lines

• Similar to the problem of page-level IE tasks

Page-level IE: EXALG

• Identifying static markers (tag&word tokens) from multiple pages– Occurrence vector for each token

• Differentiating token roles– By DOM tree path– By position in the EC class

• Equivalent class (EC) – Group tokens with the same occurrence vector

• LFECs form the template– e.g. <1,1,1,1>: {<html>, <body>, <table>, </table>, </body>, </htm

l>}

[Arasu and Garcia-Molina, SIGMOD 2003]

Critical point: Tags are not easy to differentiate as

compared to text lines used in [Zhao, et al, VLDB206]

On the use of techniques

• From supervised to unsupervised approaches

• From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher)

• From two page summarization (MSE) to multiple page summarization (EXALG)

Summary

• Content of this talk– Web Information Extraction– Three Dimensions– Focused on IE for template pages IE task

• Issues for unsupervised approaches• Techniques for solving these issues

• Content not in this talk– Probabilistic model for free text IE tasks

Personal Vision

• From information search to information integration

• Better UI for information integration– Information collection: focused crawling– Information extraction– Schema matching and integration

• Not only for business but also for individuals

References – Record Level

• C.-H. Chang, S.-C. Lui, IEPAD: Information Extraction based on Pattern Discovery, WWW01

• B. Liu, R. Grossman and Y. Zhai, Mining Data Records in Web Pages, SIGKDD03

• Y. Zhai, B. Liu. Web Data Extraction Based on Partial Tree Alignment, WWW05

• K. Simon and G. Lausen, ViPER: Augmenting Automatic Information Extraction with Visual Perceptions, CIKM05

• H. Zhao, W. Meng, V. Raghavan, and C. Yu, Fully Automatic Wrapper Generation for Search Engines, WWW05

References – Page Level & Survey

• A. Arasu, H. Garcia-Molina, Extracting Structured Data from Web Pages, SIGMOD03

• V. Crescenzi, G. Mecca, P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites, VLDB01

• H. Zhao, W. Meng, and C. Yu, Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages, VLDB06

• A. Laender, B. Ribeiro-Neto, A. da Silva, J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record02.

• C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, A Survey of Web

Information Extraction Systems, IEEE TKDE06.

Taxonomic Information Integration: Challenges and

Applications

Cheng-Zen Yang （楊正仁）Department of Computer Sci. and Eng.

Yuan Ze University

[email protected]

Outline

• Introduction

• Problem statement

• Integration approaches– Flattened catalog integration– Hierarchical catalog integration

• Applications

• Conclusions and future work

Introduction

• As the Internet develops rapidly, the number of on-line Web pages becomes very large today. – Many Web portals offer taxonomic information (catalogs) to

facilitate information search [AS2001].

• These catalogs may need to be integrated if Web portals are merged.– B2B electronic marketplaces bring together many online

suppliers and buyers.

• An integrated Web catalog service can help users – gain more relevant and organized information in one catalog,

and– can save them much time to surf among different Web catalogs.

B2C e-commerce: Amazon

The taxonomic information integration problem

• Taxonomic information integration is more than a simple classification task.

• When some implicit source information is exploited, the integration accuracy can be highly improved.

• Past studies have shown that – the Naïve Bayes classifier, SVMs, and the Maximu

m Entropy model enhance the accuracy of Web catalog integration in a flattened catalog integration structure.

The problem statement (1/2)

• Flattened catalog integration– The source catalog S containing a set of categories S

1 , S2 , … , Sm is to be integrated into the destination catalog D consisting of categories D1 , D2 , …, Dn.

S1

Source Catalog Destination Catalog

Document S11

S2

Sm

Document S12 D1

D2

Dn

Document D11

Document D12

Document D1k

Document S1k

Integrated

Integrated

Integrated

The problem statement (2/2)

• Hierarchical catalog integration

Catalog SURL Category S1

URL Category S2

Category S1

URL fURL g

f g h i

Category S2

URL hURL i

Catalog DURL Category D1

URL Category D2

URL Category D3

Category D1

URL aURL b

a b b c d e

Category D2

URL bURL c

Category D3

URL dURL e

Integration Approachesfor Flattened Catalogs

The enhanced naïve Bayes approach

• The pioneer work [AS2001]– They exploit the implicit source information an

d improve the integration accuracy.– Naïve Bayes Approach

– The Enhanced Naïve Bayes Approach

)Pr(

)|Pr()Pr()|Pr(

d

CdCdC ii

i

)|Pr(

)|Pr()|Pr(),|Pr(

Sd

CdSCSdC ii

i

d : Test document in source catalogCi : Category in Destination CatalogS : Category in Source catalog

Probabilistic enhancement and topic restriction

• NB and SVM [TCCL2003]

• Probabilistic Enhancement

• Topic Restriction

)Pr(

)|Pr()|Pr(maxarg)(

2 t

tt

HvPE v

svxvxv

t

x : Test document in source catalogvt : Label of class in Destination Catalogs : The class label of x in Source Catalog

Catalog SURL Category S1

URL Category S2

Category S1

URL fURL g

f g h i

Category S2

URL hURL i

Catalog DURL Category D1

URL Category D2

URL Category D3

Category D1

URL aURL f

a f b f d e

Category D2

URL bURL f

Category D3

URL dURL e

The pseudo relevancefeedback approach

• Iterative-Adapting SVM [CHY2005]

An Application Example

Searching for multi-lingual news articles

• Many Web portals provide monolingual news integration services.

• Unfortunately, users cannot effectively find the related news in other languages.

The basic idea

• Web portals have grouped related news articles.

• These articles should be about thesame main story.

• Can we discover these mappings?

Techniques in our current work

• Machine translation

• Taxonomy integration

Mapping Finding

Taxonomy integration

• The cross-training process [SCG2003] – To make better inferences about label assign

ments in another taxonomy

1st SVM

2nd SVM

Semantically Overlapped Features

English News Features

Chinese News Features

English-Chinese News Category Mappings

Mapping decision

• The SVM-BCT classifiers then calculate the positively mapped ratios as the mapping score (MSi) to predict the semantic overlapping. [YCC2006]

• The mapping score MSi of Si Dj

• Then we can rank the mappings according to their scores.

Performance evaluation

• NLP resources– Standard Segmentation Corpus from ACLCLP

• 42023 segmented words

– Bilingual wordlists (version 2.0) from Linguistic Data Consortium (LDC)

• Chinese-to-English version 2 (ldc2ce) with about 120K records

• English-to-Chinese (ldc2ec) with 110K records

Experimental datasets

• Properties– news reports in the international news category of Go

ogle News Taiwan and U.S. version– May 10, 2005 - May 23, 2005– 20 news event categories per day– Chinese-to-English

• 46.9MB

– English-to-Chinese• 80.2MB

– 29182 news stories

Conclusions and Future Work

Conclusions

• Taxonomic information integration is an emerging issue for Web information mining.

• New approaches for flattened catalog integration and hierarchical catalog integration are still in need.

• Our approaches are in the first stage for taxonomic information integration.

Future work

• Taxonomy alignment– Heterogeneous catalog integration (Jung 2006)

• Incorporated with more conceptual information – Wordnet, Sinica BOW, etc.

• Evaluation on other classifiers– EM, ME, etc.

References• [AS2001] Agrawal, R., Srikant., R.: On Integrating Catalogs. Proc. the 10th

WWW Conf. (WWW10), (May 2001) 603–612• [BOYAPATI2002] Boyapati, V.: Improving Hierarchical Text Classification U

sing Unlabeled Data. Proc. The 25th Annual ACM Conf. on Research and Development in Information Retrieval (SIGIR’02), (Aug. 2002) 363–364

• [CHY2005] I.-X. Chen, J.-C. Ho, and C.-Z. Yang.: An iterative approach for web catalog integration with support vector machines. Proc. of Asia Information Retrieval Symposium 2005 (AIRS2005), (Oct. 2005) 703–708

• [DC 2000] Dumais, S., Chen, H.: Hierarchical Classification of Web Content. Proc. the 23rd Annual ACM Conf. on Research and Development in Information Retrieval (SIGIR’00), (Jul. 2000) 256–263

• [HCY2006] J.-C. Ho, I.-X. Chen, and C.-Z. Yang.: Learning to Integrate Web Catalogs with Conceptual Relationships in Hierarchical Thesaurus. Proc. The 3rd Asia Information Retrieval Symposium (AIRS 2006), (Oct. 2006) 217-229

• [JOACHIMS1998] Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proc. the 10th European Conf. on Machine Learning (ECML’98), (1998) 137–142

• [JUNG2006] Jung, J. J.: Taxonomy Alignment for Interoperability Between Heterogeneous Digital Libraries. Proc. The 9th Int’l Conf. on Asian Digital Library (ICADL 2006), (Nov. 2006), 274-282

• [KELLER1997] Keller,A. M.: Smart Catalogs and Virtual Catalogs. In Ravi Kalakota and Andrew Whinston, editors, Readings in Electronic Commerce. Addison-Wesley. (1997)

• [KKL2002] Kim, D., Kim, J., and Lee, S.: Catalog Integration for Electronic Commerce through Category-Hierarchy Merging Technique. Proc. the 12th Int’l Workshop on Research Issues in Data Engineering: Engineering e-Commerce/e-Business Systems (RIDE’02), (Feb. 2002) 28–33

• [MLW 2003] Marron, P. J., Lausen, G., Weber, M.: Catalog Integration Made Easy. Proc. the 19th Int’l Conf. on Data Engineering (ICDE’03), (Mar. 2003) 677–679

• [RR2001] Rennie, J. D. M., Rifkin, R.: Improving Multiclass Text Classification with the Support Vector Machine. Tech. Report AI Memo AIM-2001-026 and CCL Memo 210, MIT (Oct. 2001)

• [SCG2003] Sarawagi, S., Chakrabarti S., Godbole., S.: Cross-Training: Learning Probabilistic Mappings between Topics. Proc. the 9th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, (Aug. 2003) 177–186

• [SH2001] Stonebraker, M. and Hellerstein, J. M.: Content Integration for e-Commerce. Proc. of the 2001 ACM SIGMOD Int’l Conf. on Management of Data, (May 2001) 552–560

• [SLN2003] Sun, A. ,Lim, E.-P., and Ng., W.-K. :Performance Measurement Framework for Hierarchical Text Classification. Journal of the American Society for Information Science and Technology (JASIST), Vol. 54, No. 11, (June 2003) 1014–1028

• [TCCL2003] Tsay, J.-J., Chen, H.-Y., Chang, C.-F., Lin, C.-H.: Enhancing Techniques for Efficient Topic Hierarchy Integration. Proc. the 3rd Int’l Conf. on Data Mining (ICDM’03), (Nov. 2003) (657–660)

• [WTH2005] Wu, C.-W., Tsai, T.-H., and Hsu, W.-L.: Learning to Integrate Web Taxonomies with Fine-Grained Relations: A Case Study Using Maximum Entropy Model. Proc. of Asia Information Retrieval Symposium 2005 (AIRS2005), (Oct. 2005) 190–205

• [YCC2006] C.-Z. Yang, C.-M. Chen, and I.-X. Chen.: A Cross-Lingual Framework for Web News Taxonomy Integration. Proc. The 3rd Asia Information Retrieval Symposium (AIRS 2006), (Oct. 2006), 270-283

• [YL1999] Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. Proc. the 22nd Annual ACMConference on Research and Development in Information Retrieval, (Aug. 1999) 42–49

• [ZADROZNY2002] Zadrozny., B.: Reducing Multiclass to Binary by Coupling Probability Estimates. In: Dietterich, T. G., Becker, S., Ghahramani, Z. (eds): Advances in Neural Information Processing Systems 14 (NIPS 2001). MIT Press. (2002)

• [ZL2004WWW] Zhang, D., Lee W. S.: Web Taxonomy Integration using Support Vector Machines. Proc. WWW2004, (May 2004) 472–481

• [ZL2004SIGIR] Zhang, D., Lee W. S.: Web Taxonomy Integration through Co-Bootstrapping. Proc. SIGIR’04, (July 2004) 410–417

Mining in the Middle:From Search to Integration on

the WebKevin C. Chang

Joint with: the UIUC and Cazoodle Teams

MiningIntegration

Search

To Begin With:

What is “the Web”? Or: How do search engines view the Web?

Version 0.1–

“Web is a SET of PAGES.”

Version 1.1–

“Web is a GRAPH of PAGES.”

http://images.google.com/intl/en/logos/Logo_60wht.gif


What have you been searching

lately?

But,…

Structured Data--- Prevalent but ignored!

Version V.2.1: Our View– Web is “Distributed Bases” of “Data

Entities”.

? ? ?

Challenges on the Web come in “dual”: Getting access to the structured

information!

Access

Structure

Deep Web Surface Web

Kevin’s 4-quadrants:

We are inspired: From search to integration—Mining in the middle!

Access

Structure

Deep Web Surface Web

Mining

Integration

Search

Challenge of the Deep Web:

MetaQuerier:

Holistic Integration over

the Deep Web.

Access: How to Get There?

The previous Web: Search used to be “crawl and index”

The current Web: Search must eventually resort to integration

MetaQuerier: Exploring and integrating the deep Web

Explorer• source discovery• source modeling• source indexing

Integrator• source selection• schema integration• query mediation

FIND sources

QUERY sources

db of dbs

unified query interface

Amazon.comCars.com

411localte.com

Apartments.com

http://www.amazon.com/exec/obidos/ats-query-page/

http://www.cars.com/

http://apartments.com/

http://www.infospace.com/info.loc411/redirs_all.htm?pgtarg=pplea

The challenge – How to deal with “deep” semantics across a large scale?

“Semantics” is the key in integration!• How to understand a query interface?

– Where is the first condition? What’s its attribute?

• How to match query interfaces?– What does “author” on this source match on that?

• How to translate queries?– How to ask this query on that source?

Survey the frontier before going to the battle.

• Challenge reassured:– 450,000 online databases– 1,258,000 query interfaces– 307,000 deep web sites– 3-7 times increase in 4 years

• Insight revealed:– Web sources are not arbitrarily complex– “Amazon effect” – convergence and regularity

naturally emerge

“Amazon effect” in action…

Attributes converge in a domain!

Condition patterns converge even across domains!

Search moves on to integration.Don’t believe me? See what Google

has to say…

DB People: Buckle Up!

To embrace the burgeoning of structured data on the Web.

Challenge of the Surface Web:

WISDM:

Holistic Search over

the Surface Web.

Structure: What to look for?

Are we searchingfor what we want?

Challenge of the surface Web:Despite all the glorious search engines…


What have you been searching lately?

• What is the email of Marc Snir?• What is Marc Snir’s research area?• Who are Marc Snir’s coauthors?• What are the phones of CS database faculty?• How much is “Canon PowerShot A400”?• Where is SIGMOD 2006 to be held?• When is the due date of SIGMOD 2006?• Find PDF files of “SIGMOD 2006”?

Regardless of what you want, you are searching for pages…NO!


Your creativity is amazing: A few examples

• WSQ/DSQ at Stanford– use page counts to rank term associations

• QXtract at Columbia– generate keywords to retrieve docs useful for extract

• KnowItAll at Washington– both ideas in one framework

• And there must be many I don’t know yet…

Time to distill to build a better “mining” engine?

What is an “entity”? Your target of information– or,

anything.• Phone number

• Email address

• PDF

• Image

• Person name

• Book title, author, …

• Price (of something)

We take an entity view of the Web:

How different is “entity search”?How to define such searches?

Let’s motivate by contrasting…Page Retrieval Entity Search

Consider the entire process:Page Retrieval

1. Input: pages.

2. Criteria: content keywords.3. Scope: Each page itself.

4. Output: one page per result.

Marc Snir

Marc Snir

Entity search is thus different…Entity Search

1. Input: probabilistic entities.

2. Criteria: contextual patterns.3. Scope: holistic aggregagtes.

4. Output: associative results.

What are technical

challenges?

Or, how to write (reviewer-friendly) papers?

More issues…• Tagging/merging of basic entities?

– Application-driven tagging– Web’s redundancy will alleviate accuracy demand.

• Powerful pattern language– Linguistic; visual

• Advanced statistical analysis– correlation; sampling

• Scalable query processing– new components scale?

Promises of the Concepts

• From page at a time to entity-tuple at a time– getting directly to target info and evidences

• From IR to a mining engine– not only page retrieval but also construction

• From offline to online Web mining and integration– enable large scale ad-hoc mining over the web

• From Web to controlled corpus– enhance not only efficiency but also effectiveness

• From passive to active application-driven indexing– enable mining applications

Conclusion: Mining in just the middle!

Dual Challenges:– Getting access to the deep Web.

– Getting structure from the surface Web.

Central Techniques:– Holistic mining for both search and integration.

MiningIntegration

Search

What will such a Mining Engine be?Mining

Integration

Search

You tell me!Students’ imagination

knows no bounds.