Web Mining for Unknown Term Translation
Wen-Hsiang Lu (盧文祥 )
Department of Computer Science and Information engineering
[email protected]://myweb.ncku.edu.tw/~whlu
Research Problems• Difficulties in automatic construction of multilingual translation le
xicons– Techniques: Parallel/comparable corpora– Bottlenecks: Lacking diverse/multilingual resources
• Difficulties in query translation for cross-language information retrieval (CLIR)
– Techniques: Bilingual dictionary/machine translation/parallel corpora
– Bottlenecks: Multiple-senses/short/diverse/unknown query
• Challenges– Web queries are often
• Short: 2-3 words (Silverstein et al. 1998)• Diverse: wide-scoped topic • Unknown (out of vocabulary): 74% is unavailable in CEDICT
Chinese-English electronic dictionary containing 23,948 entries.– E.g.
• Proper name: 愛因斯坦 (Einstein), 海珊 (Hussein)• New terminology: 嚴重急性呼吸道症候群 (SARS), 院內感染
(Nosocomial infections)
Cross-Language Information Retrieval
Query Translation
Query Translation
Information Retrieval
Information Retrieval
SourceQuery
TargetTranslation
Target Document
s
Target Document
s
• Query in source language and retrieve relevant documents in target languages
??SARS
愛因斯坦老年癡呆症
National Palace Museum
Difficulties in Web Query Translation Using Machine Translation
English source query : National Palace Museum
Chinese translation: 全國宮殿博物館
Term-TranslationExtraction
Term-TranslationExtraction
Live Translation Lexicon
Search-ResultMining
Search-ResultMining
Anchor-TextMining
Anchor-TextMining
Web Mining
Cross-LanguageInformation Retrieval
Cross-LanguageInformation Retrieval
Cross-LanguageWeb Search
Cross-LanguageWeb Search
New approach
ApplicationsInternet
Research Paradigm
Anchor-Text Mining with Probabilistic Inference Model
)(
)()(
sP
tsPtsP
)()]|()|()|()|([
)()|()|(
)()]|()|()|([
)()|(
)()|(
)()|(
)(
)()(
1
1
1
1
1
1
n
iiiiii
n
iiii
n
iiiii
n
iii
n
iii
n
iii
uPutPusPutPusP
uPutPusP
uPutsPutPusP
uPutsP
uPutsP
uPutsP
tsP
tsPtsP
• Asymmetric translation models:
• Symmetric model with link information:
s' ofnumber the)( ,)(
)()( where
1
in-linkuuLuL
uLuP jj
jn
j
ii
Page authority
Co-occurrence
Conventional translation model
Transitive Translation Model for Multilingual Translation
)(log),( tsPtsPdirect
…
s
s : source termt : target translationm: intermediate translation
t
Sony (English)
• Direct Translation Model
• Indirect Translation Model
• Transitive Translation Model
新力 (Traditional Chine
se)
ソニー (Japanese)
Indirect
Translation
m
model inference ticprobabilis :)(
)() ,(
tsP
tsPtsPdirect
Direct Translation
value.d thresholpredefined :
otherwise. ), ,(
) ,( if ), ,() ,(
tsP
tsPtsPtsP
indirect
directdirecttrans
corpus in they probabilit occurrence :)(
)()()(
)(),(),(
mP
mPtmPmsP
mPtmmsPtsP
m
mindirect
Promising Results for Automatic Construction
of Multilingual Translation Lexicons
Promising Results for Automatic Construction
of Multilingual Translation Lexicons
Source terms (Traditional Chinese)
EnglishSimplified Chinese
Japanese
新力耐吉史丹佛雪梨網際網路網路首頁電腦資料庫資訊
SonyNikeStanfordSydneyinternetnetworkhomepagecomputerdatabaseinformation
索尼耐克斯坦福悉尼互联网网络主页计算机数据库信息
ソニーナイキスタンフォードシドニーインターネットネットワークホームページコンピューターデータベースインフォメーション
Search-Result Mining
• Goal: Improve translation coverage for diverse queries
• Idea – Chi-square test: co-occurrence relation – Context-vector analysis: context information
• Context-vector similarity measure
• Weighting scheme: TF*IDF
. including pages ofnumber the:
pages, Webofnumber total the:
, pageresult search in of frequency the:)(
)log() ,(max
) ,(
i
ii
jj
it
tn
N
dt,dtf
n
N
dtf
dtfw
i
• Chi-square similarity measure
• 2-way contingency table
)()()()(
)() ,(
2
2 dcdbcaba
cbdaNtsS
t ~t
s a b
~s c d
)()(
) ,(
12
12
1
mi t
mi s
tsmi
CV
ii
ii
ww
wwtsS
Workshop on Web Mining Technology and Applications (Dec. 13, 2006)
Panel
Web Mining: Recent Development and Trends
曾新穆 教授 (Vincent S. Tseng)成功大學 資訊工程系
Web Content Mining• Trends
– Deep web mining– Semantic web mining– Vertical search– Web multimedia content mining
• Web image/video search• Web image/video annotation/classification/clustering• Web multimedia content filtering
– Example: YouTube
• Integration with web log mining
Web Usage Mining• Developed techniques
– Mining of frequent usage patterns• Association rules, sequential patterns, traversal patterns, etc.
• Trends– Personalization– Recommendation
• Web Ads
– Incorporation of content semantics/ontology– Considerations of Temporality– Extension to mobile web applications– Multidiscipline integration
Problems: Under-utilization of Clickstream Data
• Shop.org: U.S.-based visits to retail Web sites exceeded 10% of total Internet traffic for the first time ever on Thanksgiving, 2004
• Top five sites: eBay, Amazon.com, Dell.com, Walmart.com, BestBuy.com, and Target.com
• Aberdeen Group:
– 70% of site companies use Clickstream data only for basic website management!
Challenges for Clickstream Data Mining- Arun Sen et al., Communications of ACM, Nov. 2006
• Problems with data– Data incompleteness– Very large data size– Messiness in the data– Integration problems with Enterprise Data
• Too Many Analytical Methodologies– Web Metric-based Methodologies– Basic Marketing Metric-based Methodologies– Navigation-based Methodologies– Traffic-based Methodologies
• Data Analysis Problems– Across-dimension analysis problems– Timeliness of data mining under very large data size– Determination of useful/actionable analysis under thousands of
metrics
Web Information Extraction: The Issues for Unsupervised Approaches
Dr. Chia-Hui Chang ( 張嘉惠 )Department of Computer Science and Inform
ation Engineering,
National Central University, Taiwan
(Talk given at 2006 網路探勘技術與趨勢研討會 )
Outline
• Web Information Extraction– The key to web information integration
• Three Dimensions– Task definition– Automation degree– Technology
• Focused on Template Pages IE task– Issues for record-level IE– Techniques for solving these issues
Introduction
• The coverage of Web information is very wide and diverse– The Web has changed the way we obtain information. – Information search on the Web is not enough anymore.– The stronger need for Web information integration has
increased than ever (both for business and individuals).– Understanding those Web pages and discovering valuable
information from them is called Web content mining.– Information extraction is one of the keys for web content
mining.
Web Information Integration
• From information search to information extraction, to information mapping
1. Focused crawling / Web page gathering • Information search
2. Information (Data) extraction• Discovering structured information from input
3. Schema matching• With a unified interface / single ontology
Three Dimensions to See IE
• Task Definition– Input (Unstructured free texts, semi-structured Web
pages)– Output Targets (record-level, page-level, site-level)
• Automation Degree– Programmer-involved, annotation-based or annotatio
n-free approaches• Techniques
– Learning algorithm: specific/general to general/specific
– Rule type: regular expression rules vs logic rules– Deterministic finite-state transducer vs probabilistic h
idden Markov models
Information Extraction From Free Texts
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
[Excerpted from Cohen & MaCallum’s talk].
Named entity extraction,
Information Extraction From Free Texts
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N
AME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
*
*
*
*
[Excerpted from Cohen & MaCallum’s talk].
Dimension 1: Task Definition - Output
• Attribute level (single-slot)– Named entity extraction, concept annotation
• Record level – Relation between slots
• Page level– All data embedded in a dynamic page
• Site level– All information about a web site
Template Page Generation & Extraction
• Generation/Encoding
• Extraction/Decoding: A reverse engineering
………………Database
CGI (T,x)
Template (T)
Output Pages
Dimension 2: Automation Degree
• Programming-based – For programmers
• Supervised learning– A bunch of labeled examples
• Semi-supervised learning/Active learning– Interactive wrapper induction
• Unsupervised learning– Mostly for template pages only
Tasks vs. Automation Degree
• High Automation Degree (Unsupervised)– Template page IE
• Semi-Automatic / Interactive– Semi-structured document IE
• Low Automation Degree (Supervised)– Free text IE
Dimension 3: Technologies
• Learning Technology– Supervised: rule generalization, hypothesis testing,
statistical modeling– Unsupervised learning: pattern mining, clustering
• Features used– Plain text information: tokens, token class, etc.– HTML information: DOM tree path, sibling, etc.– Visual information: font, style, position, etc.
• Rule Types (Expressiveness of the rules)– Regular expression, first-order logic rules, HMM
model
Issues for Unsupervised Approaches
• For Record-level Extraction1. Data-rich Section Discovery
2. Record Boundary (Separator) Mining
3. Schema Detection & Data Annotation
• For Page-level Extraction– Schema Detection - differentiate template
from data tokens
Some Related Works on Unsupervised Approaches
• Record-level– IEPAD {Chang and Liu, WWW2001]– DeLa [Wang and Lochovsky, WWW2003] – DEPTA [Zhai and Liu, WWW2005] – ViPER [Simon and Lausen, CIKM 2005]– ViNT[Zhao et al, WWW 2005]
• Page-level– Roadrunner [Crescenzi, VLDB2001]– EXALG [Arasu and Garcia-Molina, SIGMOD2003] – MSR [Zhao et al., VLDB 2006]
Issue 1: Data-Rich Section Discovery
• Comparing a normal page with no-result page
• Comparing two normal pages– Locate static text
lines, e.g. • Books• Related Searches• Narrow or Expand
Results• Showing• Results• …
MSE [Zhao, et al. VLDB2006]ViNT [Zhao, et al. WWW2005]
Issue 1: Data-Rich Section Discovery (Cont.)
• Similarity between two adjacent leaf nodes
• 1-dimension clustering
• Pitch Estimation
[Papadakis, et al., SAINT2005]
HL(R)
Issue 2: Record Boundary Mining
• String Pattern Mining • Tree Pattern Mining
<html><body><b>T</b><ol><li><b>T</b>T<b>T</b>T</li><li><b>T</b>T<b>T</b></li></ol></body><html>
<P><A>T</A><A>T</A> T</P><P><A>T</A>T</P>
<P><A>T</A>T</P> <P><A>T</A>T</P>
DeLa [Wang and Lochovsky, WWW2003] DEPTA [Zhai and Liu, WWW2005]
IEPAD [Chang and Liu, WWW2001]
Issue 2: Record Boundary Mining (Cont.)
• Finding repeat separators from visual encoded context lines
• Heuristics– Line following an HR-LINE
– A unique line in a block that starting with a number
– Line in a block has the smallest position code (Only one).
– Line following the BLANK line is the first line.
• Visual cues
ViPER [Simon and Lausen, CIKM05]ViNT [Zhao, et al. WWW2005]
Issue 3: Data Schema Detection
• Alignment of the multiple records found – Handling missing attributes, multiple-value attributes– String alignment or tree alignment– Examining two records at a time
• Differentiate template from data tokens with some assumptions– Tag tokens are considered part of templates– Text lines are usually part of data except for static text
lines
• Similar to the problem of page-level IE tasks
Page-level IE: EXALG
• Identifying static markers (tag&word tokens) from multiple pages– Occurrence vector for each token
• Differentiating token roles– By DOM tree path– By position in the EC class
• Equivalent class (EC) – Group tokens with the same occurrence vector
• LFECs form the template– e.g. <1,1,1,1>: {<html>, <body>, <table>, </table>, </body>, </htm
l>}
[Arasu and Garcia-Molina, SIGMOD 2003]
Critical point: Tags are not easy to differentiate as
compared to text lines used in [Zhao, et al, VLDB206]
On the use of techniques
• From supervised to unsupervised approaches
• From string alignment (IEPAD, RoadRunner) to tree alignment (DEPTA, Thresher)
• From two page summarization (MSE) to multiple page summarization (EXALG)
Summary
• Content of this talk– Web Information Extraction– Three Dimensions– Focused on IE for template pages IE task
• Issues for unsupervised approaches• Techniques for solving these issues
• Content not in this talk– Probabilistic model for free text IE tasks
Personal Vision
• From information search to information integration
• Better UI for information integration– Information collection: focused crawling– Information extraction– Schema matching and integration
• Not only for business but also for individuals
References – Record Level
• C.-H. Chang, S.-C. Lui, IEPAD: Information Extraction based on Pattern Discovery, WWW01
• B. Liu, R. Grossman and Y. Zhai, Mining Data Records in Web Pages, SIGKDD03
• Y. Zhai, B. Liu. Web Data Extraction Based on Partial Tree Alignment, WWW05
• K. Simon and G. Lausen, ViPER: Augmenting Automatic Information Extraction with Visual Perceptions, CIKM05
• H. Zhao, W. Meng, V. Raghavan, and C. Yu, Fully Automatic Wrapper Generation for Search Engines, WWW05
References – Page Level & Survey
• A. Arasu, H. Garcia-Molina, Extracting Structured Data from Web Pages, SIGMOD03
• V. Crescenzi, G. Mecca, P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites, VLDB01
• H. Zhao, W. Meng, and C. Yu, Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages, VLDB06
• A. Laender, B. Ribeiro-Neto, A. da Silva, J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record02.
• C.-H. Chang, M. Kayed, M. R. Girgis, K. Shaalan, A Survey of Web
Information Extraction Systems, IEEE TKDE06.
Taxonomic Information Integration: Challenges and
Applications
Cheng-Zen Yang (楊正仁)Department of Computer Sci. and Eng.
Yuan Ze University
Outline
• Introduction
• Problem statement
• Integration approaches– Flattened catalog integration– Hierarchical catalog integration
• Applications
• Conclusions and future work
Introduction
• As the Internet develops rapidly, the number of on-line Web pages becomes very large today. – Many Web portals offer taxonomic information (catalogs) to
facilitate information search [AS2001].
• These catalogs may need to be integrated if Web portals are merged.– B2B electronic marketplaces bring together many online
suppliers and buyers.
• An integrated Web catalog service can help users – gain more relevant and organized information in one catalog,
and– can save them much time to surf among different Web catalogs.
The taxonomic information integration problem
• Taxonomic information integration is more than a simple classification task.
• When some implicit source information is exploited, the integration accuracy can be highly improved.
• Past studies have shown that – the Naïve Bayes classifier, SVMs, and the Maximu
m Entropy model enhance the accuracy of Web catalog integration in a flattened catalog integration structure.
The problem statement (1/2)
• Flattened catalog integration– The source catalog S containing a set of categories S
1 , S2 , … , Sm is to be integrated into the destination catalog D consisting of categories D1 , D2 , …, Dn.
S1
Source Catalog Destination Catalog
Document S11
S2
Sm
Document S12 D1
D2
Dn
Document D11
Document D12
Document D1k
Document S1k
Integrated
Integrated
Integrated
The problem statement (2/2)
• Hierarchical catalog integration
Catalog SURL Category S1
URL Category S2
Category S1
URL fURL g
f g h i
Category S2
URL hURL i
Catalog DURL Category D1
URL Category D2
URL Category D3
Category D1
URL aURL b
a b b c d e
Category D2
URL bURL c
Category D3
URL dURL e
The enhanced naïve Bayes approach
• The pioneer work [AS2001]– They exploit the implicit source information an
d improve the integration accuracy.– Naïve Bayes Approach
– The Enhanced Naïve Bayes Approach
)Pr(
)|Pr()Pr()|Pr(
d
CdCdC ii
i
)|Pr(
)|Pr()|Pr(),|Pr(
Sd
CdSCSdC ii
i
d : Test document in source catalogCi : Category in Destination CatalogS : Category in Source catalog
Probabilistic enhancement and topic restriction
• NB and SVM [TCCL2003]
• Probabilistic Enhancement
• Topic Restriction
)Pr(
)|Pr()|Pr(maxarg)(
2 t
tt
HvPE v
svxvxv
t
x : Test document in source catalogvt : Label of class in Destination Catalogs : The class label of x in Source Catalog
Catalog SURL Category S1
URL Category S2
Category S1
URL fURL g
f g h i
Category S2
URL hURL i
Catalog DURL Category D1
URL Category D2
URL Category D3
Category D1
URL aURL f
a f b f d e
Category D2
URL bURL f
Category D3
URL dURL e
Searching for multi-lingual news articles
• Many Web portals provide monolingual news integration services.
• Unfortunately, users cannot effectively find the related news in other languages.
The basic idea
• Web portals have grouped related news articles.
• These articles should be about thesame main story.
• Can we discover these mappings?
Taxonomy integration
• The cross-training process [SCG2003] – To make better inferences about label assign
ments in another taxonomy
1st SVM
2nd SVM
Semantically Overlapped Features
English News Features
Chinese News Features
English-Chinese News Category Mappings
Mapping decision
• The SVM-BCT classifiers then calculate the positively mapped ratios as the mapping score (MSi) to predict the semantic overlapping. [YCC2006]
• The mapping score MSi of Si Dj
• Then we can rank the mappings according to their scores.
Performance evaluation
• NLP resources– Standard Segmentation Corpus from ACLCLP
• 42023 segmented words
– Bilingual wordlists (version 2.0) from Linguistic Data Consortium (LDC)
• Chinese-to-English version 2 (ldc2ce) with about 120K records
• English-to-Chinese (ldc2ec) with 110K records
Experimental datasets
• Properties– news reports in the international news category of Go
ogle News Taiwan and U.S. version– May 10, 2005 - May 23, 2005– 20 news event categories per day– Chinese-to-English
• 46.9MB
– English-to-Chinese• 80.2MB
– 29182 news stories
Conclusions
• Taxonomic information integration is an emerging issue for Web information mining.
• New approaches for flattened catalog integration and hierarchical catalog integration are still in need.
• Our approaches are in the first stage for taxonomic information integration.
Future work
• Taxonomy alignment– Heterogeneous catalog integration (Jung 2006)
• Incorporated with more conceptual information – Wordnet, Sinica BOW, etc.
• Evaluation on other classifiers– EM, ME, etc.
References• [AS2001] Agrawal, R., Srikant., R.: On Integrating Catalogs. Proc. the 10th
WWW Conf. (WWW10), (May 2001) 603–612• [BOYAPATI2002] Boyapati, V.: Improving Hierarchical Text Classification U
sing Unlabeled Data. Proc. The 25th Annual ACM Conf. on Research and Development in Information Retrieval (SIGIR’02), (Aug. 2002) 363–364
• [CHY2005] I.-X. Chen, J.-C. Ho, and C.-Z. Yang.: An iterative approach for web catalog integration with support vector machines. Proc. of Asia Information Retrieval Symposium 2005 (AIRS2005), (Oct. 2005) 703–708
• [DC 2000] Dumais, S., Chen, H.: Hierarchical Classification of Web Content. Proc. the 23rd Annual ACM Conf. on Research and Development in Information Retrieval (SIGIR’00), (Jul. 2000) 256–263
• [HCY2006] J.-C. Ho, I.-X. Chen, and C.-Z. Yang.: Learning to Integrate Web Catalogs with Conceptual Relationships in Hierarchical Thesaurus. Proc. The 3rd Asia Information Retrieval Symposium (AIRS 2006), (Oct. 2006) 217-229
• [JOACHIMS1998] Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proc. the 10th European Conf. on Machine Learning (ECML’98), (1998) 137–142
• [JUNG2006] Jung, J. J.: Taxonomy Alignment for Interoperability Between Heterogeneous Digital Libraries. Proc. The 9th Int’l Conf. on Asian Digital Library (ICADL 2006), (Nov. 2006), 274-282
• [KELLER1997] Keller,A. M.: Smart Catalogs and Virtual Catalogs. In Ravi Kalakota and Andrew Whinston, editors, Readings in Electronic Commerce. Addison-Wesley. (1997)
• [KKL2002] Kim, D., Kim, J., and Lee, S.: Catalog Integration for Electronic Commerce through Category-Hierarchy Merging Technique. Proc. the 12th Int’l Workshop on Research Issues in Data Engineering: Engineering e-Commerce/e-Business Systems (RIDE’02), (Feb. 2002) 28–33
• [MLW 2003] Marron, P. J., Lausen, G., Weber, M.: Catalog Integration Made Easy. Proc. the 19th Int’l Conf. on Data Engineering (ICDE’03), (Mar. 2003) 677–679
• [RR2001] Rennie, J. D. M., Rifkin, R.: Improving Multiclass Text Classification with the Support Vector Machine. Tech. Report AI Memo AIM-2001-026 and CCL Memo 210, MIT (Oct. 2001)
• [SCG2003] Sarawagi, S., Chakrabarti S., Godbole., S.: Cross-Training: Learning Probabilistic Mappings between Topics. Proc. the 9th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, (Aug. 2003) 177–186
• [SH2001] Stonebraker, M. and Hellerstein, J. M.: Content Integration for e-Commerce. Proc. of the 2001 ACM SIGMOD Int’l Conf. on Management of Data, (May 2001) 552–560
• [SLN2003] Sun, A. ,Lim, E.-P., and Ng., W.-K. :Performance Measurement Framework for Hierarchical Text Classification. Journal of the American Society for Information Science and Technology (JASIST), Vol. 54, No. 11, (June 2003) 1014–1028
• [TCCL2003] Tsay, J.-J., Chen, H.-Y., Chang, C.-F., Lin, C.-H.: Enhancing Techniques for Efficient Topic Hierarchy Integration. Proc. the 3rd Int’l Conf. on Data Mining (ICDM’03), (Nov. 2003) (657–660)
• [WTH2005] Wu, C.-W., Tsai, T.-H., and Hsu, W.-L.: Learning to Integrate Web Taxonomies with Fine-Grained Relations: A Case Study Using Maximum Entropy Model. Proc. of Asia Information Retrieval Symposium 2005 (AIRS2005), (Oct. 2005) 190–205
• [YCC2006] C.-Z. Yang, C.-M. Chen, and I.-X. Chen.: A Cross-Lingual Framework for Web News Taxonomy Integration. Proc. The 3rd Asia Information Retrieval Symposium (AIRS 2006), (Oct. 2006), 270-283
• [YL1999] Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. Proc. the 22nd Annual ACMConference on Research and Development in Information Retrieval, (Aug. 1999) 42–49
• [ZADROZNY2002] Zadrozny., B.: Reducing Multiclass to Binary by Coupling Probability Estimates. In: Dietterich, T. G., Becker, S., Ghahramani, Z. (eds): Advances in Neural Information Processing Systems 14 (NIPS 2001). MIT Press. (2002)
• [ZL2004WWW] Zhang, D., Lee W. S.: Web Taxonomy Integration using Support Vector Machines. Proc. WWW2004, (May 2004) 472–481
• [ZL2004SIGIR] Zhang, D., Lee W. S.: Web Taxonomy Integration through Co-Bootstrapping. Proc. SIGIR’04, (July 2004) 410–417
Mining in the Middle:From Search to Integration on
the WebKevin C. Chang
Joint with: the UIUC and Cazoodle Teams
MiningIntegration
Search
Version 1.1–
“Web is a GRAPH of PAGES.”
Challenges on the Web come in “dual”: Getting access to the structured
information!
Access
Structure
Deep Web Surface Web
Kevin’s 4-quadrants:
We are inspired: From search to integration—Mining in the middle!
Access
Structure
Deep Web Surface Web
Mining
Integration
Search
Challenge of the Deep Web:
MetaQuerier:
Holistic Integration over
the Deep Web.
Access: How to Get There?
MetaQuerier: Exploring and integrating the deep Web
Explorer• source discovery• source modeling• source indexing
Integrator• source selection• schema integration• query mediation
FIND sources
QUERY sources
db of dbs
unified query interface
Amazon.comCars.com
411localte.com
Apartments.com
The challenge – How to deal with “deep” semantics across a large scale?
“Semantics” is the key in integration!• How to understand a query interface?
– Where is the first condition? What’s its attribute?
• How to match query interfaces?– What does “author” on this source match on that?
• How to translate queries?– How to ask this query on that source?
Survey the frontier before going to the battle.
• Challenge reassured:– 450,000 online databases– 1,258,000 query interfaces– 307,000 deep web sites– 3-7 times increase in 4 years
• Insight revealed:– Web sources are not arbitrarily complex– “Amazon effect” – convergence and regularity
naturally emerge
“Amazon effect” in action…
Attributes converge in a domain!
Condition patterns converge even across domains!
Search moves on to integration.Don’t believe me? See what Google
has to say…
DB People: Buckle Up!
To embrace the burgeoning of structured data on the Web.
Challenge of the Surface Web:
WISDM:
Holistic Search over
the Surface Web.
Structure: What to look for?
Are we searchingfor what we want?
Challenge of the surface Web:Despite all the glorious search engines…
What have you been searching lately?
• What is the email of Marc Snir?• What is Marc Snir’s research area?• Who are Marc Snir’s coauthors?• What are the phones of CS database faculty?• How much is “Canon PowerShot A400”?• Where is SIGMOD 2006 to be held?• When is the due date of SIGMOD 2006?• Find PDF files of “SIGMOD 2006”?
Regardless of what you want, you are searching for pages…NO!
Your creativity is amazing: A few examples
• WSQ/DSQ at Stanford– use page counts to rank term associations
• QXtract at Columbia– generate keywords to retrieve docs useful for extract
• KnowItAll at Washington– both ideas in one framework
• And there must be many I don’t know yet…
Time to distill to build a better “mining” engine?
What is an “entity”? Your target of information– or,
anything.• Phone number
• Email address
• Image
• Person name
• Book title, author, …
• Price (of something)
Consider the entire process:Page Retrieval
1. Input: pages.
2. Criteria: content keywords.3. Scope: Each page itself.
4. Output: one page per result.
Marc Snir
Marc Snir
Entity search is thus different…Entity Search
1. Input: probabilistic entities.
2. Criteria: contextual patterns.3. Scope: holistic aggregagtes.
4. Output: associative results.
More issues…• Tagging/merging of basic entities?
– Application-driven tagging– Web’s redundancy will alleviate accuracy demand.
• Powerful pattern language– Linguistic; visual
• Advanced statistical analysis– correlation; sampling
• Scalable query processing– new components scale?
Promises of the Concepts
• From page at a time to entity-tuple at a time– getting directly to target info and evidences
• From IR to a mining engine– not only page retrieval but also construction
• From offline to online Web mining and integration– enable large scale ad-hoc mining over the web
• From Web to controlled corpus– enhance not only efficiency but also effectiveness
• From passive to active application-driven indexing– enable mining applications
Conclusion: Mining in just the middle!
Dual Challenges:– Getting access to the deep Web.
– Getting structure from the surface Web.
Central Techniques:– Holistic mining for both search and integration.
MiningIntegration
Search