- university of north texas - dsci 5240 fall 2012 - graduate presentation - option a slides modified...

15
Tankertanker Design Tankertanker Design Tankertanker Design Chapter 7: Web Content Mining Qi Jia - University of North Texas - DSCI 5240 Fall 2012 - Graduate Presentation - Option A Presented by: 12-4- 12 Slides Modified From 2008 Jones and Bartlett Publishe Building an Intelligent Web Rajendra Akerkar Pawan Lingras Theory and Practice

Upload: cori-edith-curtis

Post on 18-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Chapter 7:Web Content MiningQi Jia- University of North Texas- DSCI 5240 Fall 2012 - Graduate Presentation- Option A

Presented by:

12-4-12Slides Modified From 2008 Jones and Bartlett Publishers, Inc. VersionBuilding an Intelligent Web Rajendra AkerkarPawan LingrasTheory and Practice

Tankertanker DesignTankertanker DesignTankertanker DesignOUTLINESIntroductionCrawlersQueriesSearch EngineTankertanker DesignTankertanker DesignTankertanker DesignINTRODUCTIONWeb content mining 1Uses of Web-content mining techniques 2Problems with the web data3Two approaches of web-content mining4Tankertanker DesignTankertanker DesignTankertanker DesignFirst Two TopicsWeb-content mining techniques are used to discover useful information from content on the web.Some of the web content is generated dynamically using queries to database management systems. Other web content may be hidden from general users.

INTRODUCTIONWeb Content Uses of Web-content Mining techniques Textual AudioVideoStill Images MetadataHyperlinksTankertanker DesignTankertanker DesignTankertanker DesignProb.1INTRODUCTIONDistributed dataLarge volumeProb.2

Unstructured dataProb.3

Quality of dataProb.5

Extreme percentage volatile dataProb.6

Prob.7

Problems with the web data3Prob.4

Redundant dataVaried dataTankertanker DesignTankertanker DesignTankertanker DesignINTRODUCTIONnd2 database oriented st1 agent-basedTwo approaches of web-content mining4software agents perform the content miningview the Web data as belonging to a databaseTankertanker DesignTankertanker DesignTankertanker DesignCRAWLERSWebCrawlerContext GraphFocusedCrawlerTankertanker DesignTankertanker DesignTankertanker DesignCRAWLERSCrawling processA computer program that navigates the hypertext structure of the web.

Builds an index visiting number of pages and then replaces the current index.- Begin with group of URLs- Breath-first or depth-first- Extract more URLs123WebCrawlerNumerous crawlersContext GraphContext Graph- Problem of redundancy

- Web partition robot per partition- Focused crawling has proposed the use of context graphs, which in turn created the context focused crawler (CFC).- Two steps of the CFC performs crawling

456ContextGraph

Tankertanker DesignTankertanker DesignTankertanker DesignCRAWLERSFocusedCrawlerst1 Focused Crawler nd2 Two major partsst3 Priority-based structurend4 DocumentsGenerally recommended for use due to large size of the Web Visits pages related to topics of interest The focused crawler structure consists of two major parts:The distiller & The hypertext classifier The pages that the crawler visits are selected using a priority-based structure managed by the priority associated with pages by the classifier and the distiller Sample documents are identified and classified based on a hierarchical classification treeDocuments are used as the seed documents to begin the focused crawling Tankertanker DesignTankertanker DesignTankertanker DesignSEARCH ENGINEExamples of search engine1Components to a search engine2Search engine mechanism3Responsibilities of Search Engines4Tankertanker DesignTankertanker DesignTankertanker DesignSEARCH ENGINEExamples ComponentsBasic components to a search engine: The spider: gathers new or updated information on Internet websites

The index: used to store information about several websites The search software: performs searching through the huge index in an effort to generate an ordered list of useful search results

Search engines URLAltaVista www.altavista.com Excite www.excite.com Google www.google.com Infoseek www.infoseek.com Lycos www.lycos.com

Uses a spider or crawler that crawls the Web hunting for new or updated Web pages to store in an index.Tankertanker DesignTankertanker DesignTankertanker DesignSEARCH ENGINEMechanism ResponsibilitiesSearch engine mechanismResponsibilities of Search EnginesGeneric structure of all search engines is basically the same

However, the search results differ from search engine to search engine for the same search terms Document collection choose the documents to be indexed

Document indexingindicate the content of the selected documentsfrequently 2 indices preserved

Searchingindicate the user information need into a queryRetrieval

Document and query managementpresent the outcomevirtual collection

Search engine mechanismTankertanker DesignTankertanker DesignTankertanker DesignQUERIESPhases ofQueriesnd2On the next level, the search engine must translate the words with possible spelling errors into processing tokens. st1The first level involves the user formulating the information need into a question or a list of terms using experiences and vocabulary and entering it into the search engine. rd3On the third level, the search engine must use the processing tokens to search the document database and retrieve the appropriate documents.

Three-tier process of translating the user's need into a search engine query: Tankertanker DesignTankertanker DesignTankertanker DesignQUERIESBooleanQueriesNatural LanguageThesaurus Queries123Fuzzy QueriesTerm SearchesProbabilistic QueriesThe most common type of query on the Web is when a user provides a few words or phrases for the search.Probabilistic queries refer to the way in which the IR system retrieves documents according to relevancy. 456Types ofQueriesIn a thesaurus query the user selects the term from a preceding set of terms predetermined by the retrieval system.Boolean logic queries connect words in the search using operators such as AND or OR.In natural language queries the user frames as a question or a statement.Fuzzy queries reflect no specificity.Tankertanker DesignTankertanker DesignTankertanker DesignThank you for your attention!

Tankertanker DesignTankertanker DesignTankertanker Design