reviews crawler (detection, extraction & analysis) foss practicum by: syed ahmed & rakhi...

Reviews Crawler (Detection, Extraction & Analysis)

FOSS Practicum By: Syed Ahmed & Rakhi Gupta

April 28, 2010

Overview

Web

Extracted data

Data we need

Analysis of extracted DataUsing UIMA

INTRODUCTION

• Analysis of user reviews & extraction of meaningful information

http://www.amazon.com/review/RAL8ABGFOK5J4/ref=cm_cr_rdp_perm

• Apache Tika with Maven Integration • Toolkit for extracting content and metadata from different kind of file formats • Detection and Extraction of Metadata & structured text content using existing parser

libraries

• Semantic Analysis using UIMA

Integrating our project with Maven

• pom.xml: fundamental unit of work in Maven

• handles project dependencies, installs plugins automatically

• Contains configuration details

Main Phases

Three main phases are: • Preparing the input for extraction• Detection & Extraction• Semantic Analysis

Phase I - Preparing Input (Components)

• Using CyberNeko HTML parser

It is an HTML parser built on the native interface of the Xerces XML parser. It fixes common HTML "mistakes", doing such things as adding missing parent elements, automatically closing elements, and handling mismatched end tags.

Plain HTML CyberNeko HTML XML

http://www.garshol.priv.no/download/xmltools/prod/Xerces-J.html

Phase II – Detect & Extraction (Components)

• Autodetect parsero Takes input as Tika configuration file

• AmazonDetectoro Takes the output of cybernecko from previous phase along

with configuration files which defines Xpath.o Execute the Xpath on a node list of elements iteratively and

separates metadata and content for respective evaluation.o You can specify content and metadata in the Config file.

• Behind the scenes Apache Tika, Xstream, Slf4j, Jgrapht, TestNG

Phase II continued...

• XStream - simple library to serialize objects to XML and back again.

• We need to read the configuration file from the disc for our Parser which basically corresponds to a class in our project called ParserConfig.xml

• Convert an object to XML using xstream.toXML(Object obj);• Convert XML back to an object using xstream.fromXML(String xml);

• Object of ParserConfig is a direct representation of the xml config file that can be used programmatically.

Slf4j• SLF4J (Simple Logging Facade for Java) - serves as a

simple facade or abstraction for various logging frameworks, e.g. java.util.logging, log4j and logback, allowing the end user to plug in the desired logging framework at deployment time.

• Used for logging the metadata and content handler to console.

Phase II continued…

• Jgrapht - a free Java graph library that provides mathematical graph-theory objects and algorithms • The XPath from parserconfig are serialized inside into a

nested tree which corresponds to the exact order in xml file.• The algorithms of graph are used for depth first searching

the tree for providing XPaths in the correct order.

Phase III

• Semantic Analysis using UIMA (Unstructured Information Management Architecture)

Building blocks of UIMA • Analysis Engine

• A program that analyses artifacts and infers information from them. • Constructed from building blocks called annotators

• Annotators Component that contains analysis logic Analyses artifact and create additional metadata about the artifact Produces results in the form of typed feature structures

• CAS • Represents all these feature structures including annotators• Provides shared access to Artifact and the current analysis

(metadata)

• JCAS • Java Interface to CAS• Represents each feature structure as Java object(setter/getter

methods)

• Type System• Schema/class model for CAS• Defines the types of objects & their properties or features that may

be instantiated in CAS

.

UIMA Walkthrough

• To extract meaningful information we need to plug the UIMA with analysis components called annotators.

• Annotator needs a analysis engine descriptor, which provides with configuration parameters, data structures, annotator input and output data types and the resources that the annotator uses.

• All the data that is produced by annotators or exchanged between annotator components is defined in the UIMA type system. The UIMA type system is part of the analysis engine descriptor file

UIMA Walkthrough contd…

• You use JCasGen to create a direct representation of your type system into a java class. Each type system corresponds to a sepearte Java class.

• You then create annotators which analysis information and if a match then mark that as annotation in the JCas with additional metadata.

• You define components to analyze by using regex. For example the regex for date ^(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])$

Applications• The UIMA Architecture and Framework• The Avatar project provides an easy-to-use web

framework for constructing and configuring UIMA annotators to solve particular annotation tasks.

• TALES - Multimedia mining and translation of broadcast news (TV) and news Web sites.

• Automating customer satisfaction analysis• Text Mining projects at IBM's Tokyo Research Lab• IBM Research is participating as a partner in the

SAPIR project (Search in Audio Visual Content Using Peer-to-peer Information Retrieval). This European Union project is using UIMA as an integrating platform.

http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.projectUimaArchitectureFramework.html

http://www.almaden.ibm.com/cs/projects/avatar/

http://domino.research.ibm.com/comm/research_projects.nsf/pages/tales.index.html

http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.projectCsat.html

http://www.research.ibm.com/trl/projects/textmining/index_e.htm

http://www.sapir.eu/

Future Aspects

• Can use it for crawling any of the websites• Perform deep semantic analysis into the

content of the reviews • Extensive testing• Explore UIMA using different annotations

References

• UIMA SDK Users Guide Reference http://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_Reference.pdf

• An Extension of the Vector Space Model for QueryingXML Documents via XML Fragment

http://xml.coverpages.org/CarmelFragments.pdf• Effective website crawling through website Analysis

http://doi.acm.org/10.1145/1135777.1136005 • XPath leashed http://doi.acm.org/10.1145/1456650.1456653 • http://www.slf4j.org/docs.html • http://xstream.codehaus.org/tutorial.html • http://jgrapht.sourceforge.net/ • http://nekohtml.sourceforge.net/ • Efficient algorithms for evaluating xpath over streams

http://doi.acm.org/10.1145/1247480.1247512

http://dl.alphaworks.ibm.com/technologies/uima/UIMA_SDK_Users_Guide_Reference.pdf








http://xml.coverpages.org/CarmelFragments.pdf








http://doi.acm.org.flagship.luc.edu/10.1145/1135777.1136005





Questions??

Thank You!!