reviews crawler (detection, extraction & analysis) foss practicum by: syed ahmed & rakhi...
TRANSCRIPT
Reviews Crawler (Detection, Extraction & Analysis)
FOSS Practicum By: Syed Ahmed & Rakhi Gupta
April 28, 2010
Overview
Web
Extracted data
Data we need
Analysis of extracted DataUsing UIMA
INTRODUCTION
• Analysis of user reviews & extraction of meaningful information
http://www.amazon.com/review/RAL8ABGFOK5J4/ref=cm_cr_rdp_perm
• Apache Tika with Maven Integration • Toolkit for extracting content and metadata from different kind of file formats • Detection and Extraction of Metadata & structured text content using existing parser
libraries
• Semantic Analysis using UIMA
Integrating our project with Maven
• pom.xml: fundamental unit of work in Maven
• handles project dependencies, installs plugins automatically
• Contains configuration details
Main Phases
Three main phases are: • Preparing the input for extraction• Detection & Extraction• Semantic Analysis
Phase I - Preparing Input (Components)
• Using CyberNeko HTML parser
It is an HTML parser built on the native interface of the Xerces XML parser. It fixes common HTML "mistakes", doing such things as adding missing parent elements, automatically closing elements, and handling mismatched end tags.
Plain HTML CyberNeko HTML XML
Phase II – Detect & Extraction (Components)
• Autodetect parsero Takes input as Tika configuration file
• AmazonDetectoro Takes the output of cybernecko from previous phase along
with configuration files which defines Xpath.o Execute the Xpath on a node list of elements iteratively and
separates metadata and content for respective evaluation.o You can specify content and metadata in the Config file.
• Behind the scenes Apache Tika, Xstream, Slf4j, Jgrapht, TestNG
Phase II continued...
• XStream - simple library to serialize objects to XML and back again.
• We need to read the configuration file from the disc for our Parser which basically corresponds to a class in our project called ParserConfig.xml
• Convert an object to XML using xstream.toXML(Object obj);• Convert XML back to an object using xstream.fromXML(String xml);
• Object of ParserConfig is a direct representation of the xml config file that can be used programmatically.
Slf4j• SLF4J (Simple Logging Facade for Java) - serves as a
simple facade or abstraction for various logging frameworks, e.g. java.util.logging, log4j and logback, allowing the end user to plug in the desired logging framework at deployment time.
• Used for logging the metadata and content handler to console.
Phase II continued…
• Jgrapht - a free Java graph library that provides mathematical graph-theory objects and algorithms • The XPath from parserconfig are serialized inside into a
nested tree which corresponds to the exact order in xml file.• The algorithms of graph are used for depth first searching
the tree for providing XPaths in the correct order.
Phase III
• Semantic Analysis using UIMA (Unstructured Information Management Architecture)
Building blocks of UIMA • Analysis Engine
• A program that analyses artifacts and infers information from them. • Constructed from building blocks called annotators
• Annotators Component that contains analysis logic Analyses artifact and create additional metadata about the artifact Produces results in the form of typed feature structures
• CAS • Represents all these feature structures including annotators• Provides shared access to Artifact and the current analysis
(metadata)
• JCAS • Java Interface to CAS• Represents each feature structure as Java object(setter/getter
methods)
• Type System• Schema/class model for CAS• Defines the types of objects & their properties or features that may
be instantiated in CAS
.
UIMA Walkthrough
• To extract meaningful information we need to plug the UIMA with analysis components called annotators.
• Annotator needs a analysis engine descriptor, which provides with configuration parameters, data structures, annotator input and output data types and the resources that the annotator uses.
• All the data that is produced by annotators or exchanged between annotator components is defined in the UIMA type system. The UIMA type system is part of the analysis engine descriptor file
UIMA Walkthrough contd…
• You use JCasGen to create a direct representation of your type system into a java class. Each type system corresponds to a sepearte Java class.
• You then create annotators which analysis information and if a match then mark that as annotation in the JCas with additional metadata.
• You define components to analyze by using regex. For example the regex for date ^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$
Applications• The UIMA Architecture and Framework• The Avatar project provides an easy-to-use web
framework for constructing and configuring UIMA annotators to solve particular annotation tasks.
• TALES - Multimedia mining and translation of broadcast news (TV) and news Web sites.
• Automating customer satisfaction analysis• Text Mining projects at IBM's Tokyo Research Lab• IBM Research is participating as a partner in the
SAPIR project (Search in Audio Visual Content Using Peer-to-peer Information Retrieval). This European Union project is using UIMA as an integrating platform.
Future Aspects
• Can use it for crawling any of the websites• Perform deep semantic analysis into the
content of the reviews • Extensive testing• Explore UIMA using different annotations
References
• UIMA SDK Users Guide Reference http://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_Reference.pdf
• An Extension of the Vector Space Model for QueryingXML Documents via XML Fragment
http://xml.coverpages.org/CarmelFragments.pdf• Effective website crawling through website Analysis
http://doi.acm.org/10.1145/1135777.1136005 • XPath leashed http://doi.acm.org/10.1145/1456650.1456653 • http://www.slf4j.org/docs.html • http://xstream.codehaus.org/tutorial.html • http://jgrapht.sourceforge.net/ • http://nekohtml.sourceforge.net/ • Efficient algorithms for evaluating xpath over streams
http://doi.acm.org/10.1145/1247480.1247512
Questions??
Thank You!!