watson_geoparty.pdf

Upload: hondme

Post on 04-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 watson_geoparty.pdf

    1/28

    A Question-Answering System UsingApache UIMA

    Anindo [email protected]

    Divya [email protected]

    Lakshmi Narayana [email protected]

    M S [email protected]

    Satyam [email protected]

    International Institute of Information Technology, Bangalore

    URL: https://sourceforge.net/projects/questnanswering/

    Technical Report IIITB-TR-2012-001

    April 2012

    https://sourceforge.net/projects/questnanswering/https://sourceforge.net/projects/questnanswering/
  • 8/14/2019 watson_geoparty.pdf

    2/28

    Abstract

    In this project we have developed a Question-Answering system whichenables a user to enter a query in natural language and provides an answer bylooking up relevant documents in its knowledgebase. The process begins withthe classification of the question into various types, based on which certaindecisions regarding the type of entities to search are taken. The question isthen parsed and broken down into tokens. Some of these tokens are thensearched for in the knowledgebase using Apache Solr tool which returns aspecific set of documents containing the keywords. The documents then withscores above a particular threshold are selected for further processing. From

    this set of documents we retrieve paragraphs which contain the keywordsand thus represent the most likely set of answers to the given query. Finally,we rank this set of paragraphs and select the most relevant paragraph andreturn it as the answer.

    c2012 Anindo Mazumdar, Divya Garg, Lakshmi Narayan S, M S Shashanka,Satyam Roy. This material is available under the Creative Commons Attribution-Noncommercial-Share Alike License.See http://creativecommons.org/licenses/by-nc-sa/3.0/ for details.

    http://creativecommons.org/licenses/by-nc-sa/3.0/http://creativecommons.org/licenses/by-nc-sa/3.0/
  • 8/14/2019 watson_geoparty.pdf

    3/28

    Acknowledgement

    We are extremely grateful to the people who have helped and supportedus during the project. Our deepest thanks to our advisors, Prof. ShrishaRaoand Prof. Jaya Sreevalsan Nair for their continuous encouragementand suggestions throughout the course of this work. It was our pleasure towork under their guidance.

  • 8/14/2019 watson_geoparty.pdf

    4/28

    Contents

    1 Introduction 7

    2 Overview 82.1 Apache UIMA. . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Cell Broadband Engine . . . . . . . . . . . . . . . . . . . . . . 102.3 Solr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Project Description . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Similar Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3 Architecture and Design 133.1 Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . 163.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.3.1 User Interface and Question Processing . . . . . . . . . 183.3.2 Wikipedia File Processing and Posting to Solr . . . . . 183.3.3 Querying from Solr and Finding the Relevant Documents 193.3.4 Passage Retrieval and Displaying the Answer to the User 20

    3.4 Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    4 Testing and Performance 21

    5 Conclusion and Future Work 26

    References 27

    4

  • 8/14/2019 watson_geoparty.pdf

    5/28

    List of Figures

    1 UIMA Architecture . . . . . . . . . . . . . . . . . . . . . . . . 82 CPE GUI Interface . . . . . . . . . . . . . . . . . . . . . . . . 93 CBE Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 114 Backend KnowledgeBase Creation . . . . . . . . . . . . . . . . 145 CPE Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 156 Complete Architecture of Question Answering System. . . . . 167 Front End Of QA System . . . . . . . . . . . . . . . . . . . . 188 Documents Queried By Solr . . . . . . . . . . . . . . . . . . . 199 Output Generated By QA System . . . . . . . . . . . . . . . . 21

    10 Performance Comparison Of QA System On X86 and CBE . . 2311 Performance On X86 and CBE Under Similar Constraints . . 2412 Time Taken By Solr To Return Documents. . . . . . . . . . . 2513 Accuracy In Percentage. . . . . . . . . . . . . . . . . . . . . . 26

    5

  • 8/14/2019 watson_geoparty.pdf

    6/28

    List of Tables

    1 Performance Comparison Of Both Machines . . . . . . . . . . 222 Performance Comparison Under Similar Conditions . . . . . . 233 Time Taken By Solr To Return Documents In Sec . . . . . . . 244 Accuracy Of The System . . . . . . . . . . . . . . . . . . . . . 25

    6

  • 8/14/2019 watson_geoparty.pdf

    7/28

    1 Introduction

    Question-Answering (QA) is a field of study concerned with the develop-ment of systems to automatically generate answers to questions. The ques-tion is normally given in a natural language like English. To accomplish thetask involves using IR to search natural language documents or a databasefor an answer, or logical inferencing to reason an answer based on an alreadydefined set of rules. The objective of IR is to retrieve a relevant set of doc-uments in response to the user query and it can be adapted to return therelevant passages within those documents as well. A database can be eitherstructured, in which case the task is considerably simplified or an unstruc-

    tured one, which will require some preprocessing to derive useful knowledgefrom it before it can be searched.

    QA can be classified into closed domain - which deals with questions ina specific domain and open domain - which might involve answering ques-tions on common sense or world knowledge. QA as a task is heavily relianton processes such as named classification, searching, parsing, named entityrecognition, etc., and thus readily lends itself to parallelizing. This can beeffectively exploited by an architecture suited for parallel processing suchas Unstructured Information Management Architecture (UIMA) for content

    analytics, and Cell Broadband Engine (CBE) for computation.

    For the project, we work with the following assumptions:

    The knowledge base for the Question-Answering System will not berestricted, i.e., an Open Domain will be considered. This is so that thesystem will be useful to a wide variety of people.

    The questions will be restricted to the following formats like - Who is. . . ? , Where is . . . ? , When did . . . happen?. This restriction iseasier to work with and if we consider other formats like true-false based

    questions, they are computationally hard problems and the system willunable to solve them efficiently[1].

    7

  • 8/14/2019 watson_geoparty.pdf

    8/28

    2 Overview

    The project utilizes Apache Unstructured Information Management (UIMA)to categorize and analyze data fed to it in the form of documents. Closelyintegrated with it is Apache Solr which is used for its powerful indexingand searching capabilities. The project is implemented on the CBE on thePlayStation 3(PS3) because of PS3s multiple co-processing elements whichgreatly enhances the computing power than a single processor [2].

    2.1 Apache UIMA

    UIMA applications are software systems that analyze large volumes ofunstructured information in order to discover knowledge that is relevant toan end user. An example UIMA application might ingest plain text andidentify entities, such as persons, places, organizations; or relations, such asworks-for or located-at [3].

    Figure 1: UIMA Architecture

    Apache UIMA is an Apache-licensed open source implementation of theUIMA specification. UIMA enables applications to be decomposed into com-ponents, for example language identification => language specific segmen-

    8

  • 8/14/2019 watson_geoparty.pdf

    9/28

    tation => sentence boundary detection => entity detection (person/-

    place names etc.). Each component implements interfaces defined by theframework and provides self-describing metadata via XML descriptor files.The framework manages these components and the data flow between them.Components are written in Java or C++; the data that flows between com-ponents is designed for efficient mapping between these languages [4].

    Figure 2: CPE GUI Interface

    The data in UIMA passes through an Analysis Engine (AE) whose outputis in a structured form as shown in Figure 1. This structured output is thensaved in the Common Analysis Structure (CAS) objects. The UIMA AnalysisEngine interface provides support for developing and integrating algorithmsthat analyze unstructured data. Analysis Engines are designed to operateon a per-document basis. Their interface handles one CAS at a time.

    UIMA provides additional support for applying analysis engines to collec-tions of unstructured data with its Collection Processing Architecture. TheCollection Processing Architecture defines additional components for read-ing raw data formats from data collections, preparing the data for processingby Analysis Engines, executing the analysis, extracting analysis results, anddeploying the overall flow in a variety of local and distributed configurations.

    The functionality defined in the Collection Processing Architecture is im-plemented by a Collection Processing Engine (CPE). A CPE includes an

    9

  • 8/14/2019 watson_geoparty.pdf

    10/28

    Analysis Engine and adds a Collection Reader, a CAS Initializer, and CAS

    Consumers. The part of the UIMA Framework that supports the executionof CPEs is called the Collection Processing Manager, or CPM. This modulecan be used to convert the unstructured data to structured form and thenstore it in the CAS structure of UIMA. The GUI interface makes the workmuch easier for this conversion.

    A CPE is executed by a UIMA infrastructure component called the Col-lection Processing Manager (CPM). The CPM provides a number of servicesand deployment options that cover instantiation and execution of CPEs, errorrecovery, and local and distributed deployment of the CPE components.

    2.2 Cell Broadband Engine

    The CBE processor is the first implementation of the Cell BroadbandEngine Architecture (CBEA), developed jointly by Sony, Toshiba, and IBM.The Cell BE includes one POWER Processing Element (PPE) and eight Syn-ergistic Processing Elements (SPEs). The Cell BE architecture is designedto be well-suited for a wide variety of programming models, and allows for

    partitioning of work between the PPE and the eight SPEs arranged as inFigure 3[5].

    The CBE processor is capable of massive floating point processing, opti-mized for compute-intensive workloads and broadband rich media applica-tions. A high-speed memory controller and high-bandwidth bus interfaceare also integrated on-chip. The breakthrough multi-core architecture andultra high-speed communications capabilities deliver vastly improved, real-time response, in many cases 10 times the performance of the latest PCprocessors[6].

    10

  • 8/14/2019 watson_geoparty.pdf

    11/28

    Figure 3: CBE Architecture

    The Cell BE architecture is a multiple Operating system supporting ar-

    chitecture. The Applications varies from gaming systems with almost reallooking graphics to systems which accelerate visualization and supercom-puting applications, to systems for digital media and streaming content athome.

    2.3 Solr

    Solr is a popular, open source search platform from the Apache Luceneproject. Its major features include powerful full-text search, faceted search,dynamic clustering, database integration, rich document (e.g., Word, PDF)

    handling, and geo-spatial search. Solr is highly scalable, providing dis-tributed search and index replication, and it powers the search and navigationfeatures of many of the worlds largest internet sites[7].

    The tool is quite efficient in searching data based on the related queries.The raw data is first provided to the tool in the form of text or xml files whichgets indexed and stored inside it. The indexing is completely a property ofSolr which can be controlled by modifying the internal schema of Solr. Onceindexing is completed, querying can be done by providing the keywords as

    11

  • 8/14/2019 watson_geoparty.pdf

    12/28

    query. The query can be done on any of the input tags as specified while

    indexing(id, text, title etc). The query is of the form *:* where the first* represents the field on which the query has to be done and the second* represents the keywords for which the documents has to be searched for.The tool also has a unique feature of boolean queries like AND and OR.When a collection of keywords are provided as search query to Solr withBoolean AND in between the keywords, then we get the intersection of thedocuments which contain the keywords. This brings down the number ofsearched documents which also helps in processing.

    2.4 Project Description

    Generating an answer to a natural language question requires the creationof knowledgebase of both open-domain information and of world knowledgewhich can be queried into in the later stages. The process begins with theclassification of the question into various types based on which certain de-cisions can be taken. The question is then parsed and broken down intotokens and tagged with certain keywords that essentially describe the ques-tion. These keywords are then searched for in the knowledgebase to returnrelevant sentences or passages. These set of passages are then filtered outby applying automated reasoning systems that derive information from boththe question and the passages to determine relevance. The resulting set ofpassages are then ranked on accuracy and the topmost answer is returned tothe user.

    The information sources that are available are mostly unstructured andthus in analysing the unstructured content, UIM applications make use of avariety of analysis technologies including:

    Statistical and rule-based Natural Language Processing (NLP)

    Information Retrieval (IR)

    Machine learning

    Ontologies

    Automated reasoning

    Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)

    The bridge from the unstructured world to the structured world is builtthrough Apache UIMA framework. It supports a varied range of analy-sis tools and links them to structured information.It provides a run-time

    12

  • 8/14/2019 watson_geoparty.pdf

    13/28

    environment in which UIMA component implementations can be used and

    deployed.

    2.5 Similar Systems

    IBM Watsonhas IBM DeepQA architecture and it runs on IBM POWER7processors based servers. Initially WATSON was designed to compete withhuman champions in real time on American TV show JEOPARDY! The showpits three human competitors against each other to answer questions withvaried topics and even has penalties for wrong answers [8]. For competingwith human champions, the system should be capable of answering 70 percent

    of the questions with accuracy of more than 80 percent in a time frame ofless than three seconds [9].

    DeepQA is a parallel probabilistic evidence-based architecture. The prin-ciples involved in DeepQA are namely, Massive parallelism, Many experts,Pervasive confidence estimation, Integrate shallow and deep knowledge.It isdeveloped using Apache UIMA. All the inter-process communication is man-aged by UIMA-AS using open JMS standard. Deployment on POWER7enables Watson to deliver answers in one to six seconds [10].

    WATSON can be enhanced to solve societal and business problems.The ap-plication areas include: diagnosing diseases, handling online technical sup-port questions, and parsing vast tracts of legal documents, and to driveprogress across industries. WATSONs ability to process text in rich naturallanguage and respond in the same way holds enormous potential in futureand will revolutionalize the way people accomplish their business and day today tasks[11].

    3 Architecture and Design

    The architecture of the system comes in small modules which can be ex-plained as follows :

    3.1 Backend

    The backend includes the creation of a knowledgebase which contains thedocuments that are searched for the answer in a format that can be used bySolr. To form a reliable and sufficiently large knowledgebase and to makethe system open-domain, we have used the Wikipedia Dump file in the XML

    13

  • 8/14/2019 watson_geoparty.pdf

    14/28

    format. SolrUIMA structure only has the ability to directly index text files

    into Solr, and thus the XML dump files were further processed and were con-verted into text files. This process was done by the help of a toolWP2TXTand a Java parser. Finally the text files were parsed to create the corpus re-quired as the knowledgebase. This forms the first stage of the backend asshown in Figure 4.

    Figure 4: Backend KnowledgeBase Creation

    The text corpus we had was quite unstructured wherein we used ApacheUIMA Infrastructure to give it a particular format and make it structured.UIMAs various built in annotators made this work quite easy by directlyconverting the text into required structured format with its built-in models.Various Analysis Engines were used to give the documents the maximumstructure possible. Some of the used analysis engines include Simple Tokenand Sentence Annotator and Simple Name Recognizer using Regular Expres-sions. The former tokenized the whole document and annotated on the basisof sentence. The documents were divided into sentences to form a well de-fined structure from it. The Name Recognizer tags the proper nouns whichinclude names, organizations, cities etc. This structuring of data made thedocument well defined which helped us in querying relevant documents and

    14

  • 8/14/2019 watson_geoparty.pdf

    15/28

    retrieve the answer quickly. This phase was done completely with the help of

    Apache UIMA framework with the help of Cell Processing Engine. The CPEforms an interface for UIMA to convert the unstructured data into structuredform. It has a GUI which helps in the process of the conversion. There is a.sh file in the Apache UIMA SDK which launches this interface. It helps usto add all the analysis engines for processing and outputs the processed filesinto the provided destination. It has many CAS consumers options, wherewe used SolrCAS to index the input data into Solr directly. The output ofthe CPE is the CAS structure which forms the structured form of the data.

    Figure 5: CPE Architecture

    Internally CPE works with various modules to finally provide the desiredoutput in the form of structured text as shown in Figure 5. The modules areinitialized by the CPE Factory interface which creates instance of a CollectionProcessing Engine (CPE). This consists of a Collection Reader that reads ineach document and the Analysis Engine (AE) that performs analysis likesearching for a specific keyword on it. These structured documents are thenposted to Solr directly with the help of CAS consumer SolrCas. This CASconsumer directly indexes files into Solr and creates the knowledgebase.

    15

  • 8/14/2019 watson_geoparty.pdf

    16/28

    3.2 System Architecture

    The whole system architecture and the flow of the project is shown in Fig-ure 6. The complete flow of the project is deployed in the CBE architecture.The working of the system starts with the input which comes in the form ofuser query. This question is the tokenized and important words called focuswords are extracted which are the passed to Solr as a query. Solr returns thematched documents which are then collected for further processing. Thesedocuments are the response of the boolean query made to Solr. Solr waschosen to speed up the process as much as possible as it is an efficient toolto retrieve matched documents out of a huge corpus.

    The architecture can be described in a nutshell as follows:

    Figure 6: Complete Architecture of Question Answering System

    Graphical User Interface - It allows the user to input and submita question to the system and also displays the relevant answer to theuser. It used Javas Swing framework to display windows and dialog

    16

  • 8/14/2019 watson_geoparty.pdf

    17/28

    boxes. The entered question is sent through the network to the server

    for processing.

    Question Analyser/Classifier- The question is broken down intotokens which are then marked up with its particular parts of speechtag and lemmatized and finally named entities and noun-phrases areextracted.The keywords are analysed to determine the question type(who/what/why/when/which/how) and then a corresponding answertype is assigned. This answer type will be used to narrow down the setof passages in the later stages.

    Querying to and Receiving Response from Solr- The keywords

    are combined together to form a boolean query and then fed to Solr.Solr has the capability to perform Boolean queries and thus retuneddocuments which has the intersection of keywords.

    Passage Retrieval - The returned documents from Solr are in theform of Sentence Annotated documents. So the relevant passages arethen extracted from each of the documents to create a set of candidateanswer passages.

    Answer Extraction - The list of passages that might match the searchcriteria are then reduced based on the question type, focus words andsome predefined thresholds which will finally give us top 3 relevantpassages.

    Output- The set of 3 passages are ranked using probabilistic meth-ods and the most relevant one is returned to the user interface to bedisplayed.

    Knowledgebase- Knowledge Base can be considered as a collection ofinformation that may be interpreted as a set of facts and rules consid-ered true. It contains natural language documents as well as indexinginformation on it which can be queried into.

    3.3 Implementation

    The implementation of the project was to follow the architecture step bystep and can be described as follows:

    17

  • 8/14/2019 watson_geoparty.pdf

    18/28

    3.3.1 User Interface and Question Processing

    This is the main front end part of the project. It takes the question as theinput from the user and provides the output in the form of relevant paragraphand is shown in Figure 7.

    Figure 7: Front End Of QA System

    A Java module was written to form this structure which has a text box toreceive the question as input from the user. The question is taken as input,tokenized and all the words are compared with a list of stopwords. Thestopwords list was formed manually based on some observation and somehelp on the web. This comparison discards all the commonly used wordsfrom the query. The end result of the comparison phase leaves us with acollection of keywords on which the search has to be carried. So after thisprocessing all the focus words or relevant words are left which are then usedfor searching the answer. A button is also provided to initiate the searchprocess and within few seconds the answer is reported back to the user whichis printed in a Label.

    3.3.2 Wikipedia File Processing and Posting to Solr

    This mainly comprised of parsing the XML files from Wikipedia dump.This was done with the help of a tool named WP2TXT which convertedthe XML Files into text files wherein each text file comprised of multipledocuments. These files were 10 MB each so they needed further processing.Thus we wrote a Java Parser to parse the text files and create a file for

    18

  • 8/14/2019 watson_geoparty.pdf

    19/28

    each documents. This left us with multiple files each composed of a single

    document.

    Then, these files were used to form the corpus using the UIMA infrastruc-ture. To do this, we implemented the UIMA structure to post the text filesinto Solr. We used the inbuilt SolrUIMA infrastructure to form the Solr-CAS which is the UIMA compatible CAS structure for Solr. The files weresentence annotated with the help of Simple Token and Sentence Annotatorwhich annotated the document into newline delimited sentences and eachword delimited by whitespace.

    3.3.3 Querying from Solr and Finding the Relevant Documents

    Figure 8: Documents Queried By Solr

    In this section, we used the SolrJ Class Objects which helps in queryingfrom Solr based on Keywords. We take the keywords extracted from thequestion provided by user as input and form a Boolean Query out of it withAND in between the keywords. This combined query is passed on to Solrwhich returns the intersection of documents which matches the keywords asshown in Figure 7. This feature of Solr to return results based on BooleanQuery, made the work much easier and more efficient. Moreover Solr has afeature to assign a score to the retrieved documents based on the keyword

    19

  • 8/14/2019 watson_geoparty.pdf

    20/28

    occurrences. This score is an attribute through which the documents are

    filtered out.

    From among the returned documents we filter out the top documents basedon a threshold. This is to reduce the processing time since the set of docu-ments are large and processing each of them to get relevant paragraphs wouldmake the system very slow.

    3.3.4 Passage Retrieval and Displaying the Answer to the User

    In indexing, the documents were passed through Named Entity Resolver,

    POS Tagger and Co-Reference Resolver which tagged the various entitiesdifferently. This enables finding name of people, organizations and to derivecorrect interpretation of text (referent). While retrieving the passages fromthe processed documents by Solr we used these entities to find the relevantanswers. As per the question type we search for the tags and extractedthe relevant passages. To do this task we took the help of SPE of CBEarchitecture. The SPEs were assigned the documents which had the relevantparagraphs in it. Each SPE parsed these documents and find the relevantparagraphs out of it based on the keywords. It was completely dependent onmatching some criteria to make a paragraph in a document relevant. In theend, the SPE returned a set of paragraphs which were then stored for furtherprocessing. As the number of passages returned can vary depending on thequery, it had to be limited so that working on them could be possible. Thus,a score was assigned to the passages and they were ranked and the top threepassages were extracted and returned.

    These top three passages had the most likelihood of containing the answerto the query. Thus they were scrutinized further to find the best answer fromthem. Once the relevant passage was obtained, it was then displayed to theuser in the same interface in which the query was made as shown in Figure9.

    3.4 Integration

    This section comprised of binding all the sections together to work as aunit. The whole project was carried on in CBE from posting the documentsto Solr to retrieving the documents from Solr. The final answer was extractedusing the multi-architecture of the SPE wherein all the SPE finds the answerin different documents and finds the answer.

    20

  • 8/14/2019 watson_geoparty.pdf

    21/28

    Figure 9: Output Generated By QA System

    4 Testing and Performance

    The system was tested with different test cases to find the performance ofthe system on x86 and CBE machines and do a comparison of them. Thesystems were run under different inputs and under varying loads. The resultsobtained were plotted for accuracy and time taken for various loads.

    Figure 10 shows the performance comparison of the QA System on x86 andCBE. Table 1 shows the tabular readings. The red line represents the perfor-mance in CBE whereas the blue line represents the performance in x86. Thedifference between the times is large because the CBE is limited in physicalmemory to only 256MB while the x86 runs on 3GB of RAM. The applica-

    tion requires at least 400MB of RAM (out of which a significant amount isrequired to load a portion of the knowledgebase) and hence after crossing theavailable memory in CBE, it starts swapping to hard disk which reduces theperformance drastically. Thus we got an unfair comparison between the twosystems with a huge margin of difference between the two lines.

    To make the comparison a fair one, we balanced the hardware configurationof the x86 system with that of the CBE. We disabled all the processorsexcept one and created certain load on the system such that the available

    21

  • 8/14/2019 watson_geoparty.pdf

    22/28

    Table 1: Performance Comparison Of Both Machines

    No of Questions X86 CBE

    1 8.234 24.2422 7.234 23.4563 9.133 25.3434 11.123 30.3535 12.3242 36.3536 11.121 35.4567 10.122 29.4568 11.242 36.4679 8.232 28.345

    10 8.242 27.456

    11 15.233 28.45612 14.233 32.35413 16.312 33.34514 13.213 32.67515 8.133 31.34516 7.786 28.34517 11.231 34.45618 10.234 32.35419 11.897 33.53620 13.242 35.467

    22

  • 8/14/2019 watson_geoparty.pdf

    23/28

    Figure 10: Performance Comparison Of QA System On X86 and CBE

    memory was comparable with the CBE configuration. Then we tested theQA application on both the systems with increasing load of documents to

    search for and found the figures as shown in Figure 11. The tabular form ofthe data is represented in Table 2. Its clearly visible that the CBE workedmuch faster in this case with increasing load in comparison to x86 due toparallel processing using SPEs. Thus, the CBE clearly emerged the fasterone with equivalent hardware configurations.

    Table 2: Performance Comparison Under Similar Conditions

    No of Documents X86 CBE

    1000 30.544 25.2313000 30.954 25.4555000 34.345 26.4537000 35.464 30.4568000 40.342 31.456

    10000 43.234 32.35412000 44.245 32.456

    23

  • 8/14/2019 watson_geoparty.pdf

    24/28

    Figure 11: Performance On X86 and CBE Under Similar Constraints

    Solr tool was used to index and search the knowledgebase. The applicationqueries the instance of the Solr server running on the same system and re-trieves results from it. Thus, it introduces a certain delay due to inter-process

    communication mechanisms.

    Table 3: Time Taken By Solr To Return Documents In Sec

    No of Documents Time taken by Solr to return Document1000 0.43553000 0.67325000 0.89547000 1.37648000 1.8675

    10000 2.468912000 2.9547

    Figure 12 shows the time taken in seconds by Solr in returning the docu-ments as a response to the query from the user. The same data is representedin Table 3. From the figure it is clear that as the size of the documents in-creased for indexing the time taken by Solr also increased but in a linear

    24

  • 8/14/2019 watson_geoparty.pdf

    25/28

    Figure 12: Time Taken By Solr To Return Documents

    manner. This is because of the indexing mechanisms of Solr which makes itreally fast when it comes to searching.

    The projects main goal was to measure efficiency of the CBE system for

    the QA application. But, we also measured the accuracy of the searches. Wetested the system with increasing the loads in each test and Figure 13 showsthe results we obtained.

    Table 4: Accuracy Of The System

    No of Documents Accuracy in %1000 203000 255000 40

    7000 548000 60

    10000 7012000 72

    25

  • 8/14/2019 watson_geoparty.pdf

    26/28

    Figure 13: Accuracy In Percentage

    The accuracy of the answers provided by QA System increases with in-creasing number of documents in the knowledgebase. We were able to reachan accuracy of 80% , i.e., in 80 % of the cases the system returned rightanswers and in the rest of the case, it either returned a wrong answer orwas unable to find one. If the knowledge base can be made richer, the accu-racy can be increased even further but would never reach 100% as NaturalLanguage Processing has inherent limitation in accuracy.

    5 Conclusion and Future Work

    The Question-Answer was successfully built by using the UIMA to runon a CBE. The multicore architecture of the PS3 gave a good boost in theperformance of the application when compared to an x86 architecture. Thesystem was found to return answers with high accuracy when used with a

    rich and large knowledgebase.

    It was observed that for a large knowledgebase, CBE gave better perfor-mance while for a small knowledgebase, the x86 architecture yielded betterresults under similar hardware configurations.

    This system deals with who, what and where type of questions.The future works consists of extending the system to handle questions of

    26

  • 8/14/2019 watson_geoparty.pdf

    27/28

    type How and What which require more complex natural processing tech-

    inques. Till now, in the current implementation, models used are restrictedto identifying a person, a place and date. More models can also be trainedand incorporated into the application which can work for restricted domainsas well as identifying different entities other than people, organizations anddate. This will boost the accuracy further and also increase the efficiency.

    References

    [1] T. Pearson, Parsing The Question, February 2011. [Online].Available: https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm watson how to build your ownwatson jr in your basement7?lang=en

    [2] Uima Overview And Sdk Setup, vol. version 2.4.0, pp. 16,February 2011. [Online]. Available: http://uima.apache.org/d/uimaj-2.4.0/overview and setup.pdf

    [3] Apache Uima (Unstructured Information Management Architecture),vol. v2.4.0 Release Notes. [Online]. Available: http://uima.apache.org/d/uimaj-2.4.0/RELEASE NOTES.html

    [4] Apache Uima (Unstructured Information Management Ar-chitecture), vol. C++ v2.3.0 Release Notes. [Online].Available: http://archive.apache.org/dist/incubator/uima/RELEASENOTES-uimacpp-2.3.0-incubating.html

    [5] Cell Broadband Engine Architecture, vol. version 1.02, pp. 2730,October 2007. [Online]. Available: http://cell.scei.co.jp/e download.html

    [6] T. Chen, R. Raghavan, and E. I. Jason Dale, Cell Broadband

    Engine Architecture and its First Implementation, vol. 1, 29Nov 2005. [Online]. Available: http://www.ibm.com/developerworks/power/library/pa-cellperf/

    [7] Solr Tutorial. [Online]. Available: http://lucene.apache.org/solr/tutorial.html

    [8] Watson - A System Designed For Answers, pp. 16, February 2011.[Online]. Available: http://www-03.ibm.com/innovation/us/watson/what-is-watson/index.html

    27

    https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=enhttps://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=enhttps://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=enhttp://uima.apache.org/d/uimaj-2.4.0/overview_and_setup.pdfhttp://uima.apache.org/d/uimaj-2.4.0/overview_and_setup.pdfhttp://uima.apache.org/d/uimaj-2.4.0/RELEASE_NOTES.htmlhttp://uima.apache.org/d/uimaj-2.4.0/RELEASE_NOTES.htmlhttp://archive.apache.org/dist/incubator/uima/RELEASE_NOTES-uimacpp-2.3.0-incubating.htmlhttp://archive.apache.org/dist/incubator/uima/RELEASE_NOTES-uimacpp-2.3.0-incubating.htmlhttp://cell.scei.co.jp/e_download.htmlhttp://cell.scei.co.jp/e_download.htmlhttp://www.ibm.com/developerworks/power/library/pa-cellperf/http://www.ibm.com/developerworks/power/library/pa-cellperf/http://lucene.apache.org/solr/tutorial.htmlhttp://lucene.apache.org/solr/tutorial.htmlhttp://www-03.ibm.com/innovation/us/watson/what-is-watson/index.htmlhttp://www-03.ibm.com/innovation/us/watson/what-is-watson/index.htmlhttp://www-03.ibm.com/innovation/us/watson/what-is-watson/index.htmlhttp://www-03.ibm.com/innovation/us/watson/what-is-watson/index.htmlhttp://lucene.apache.org/solr/tutorial.htmlhttp://lucene.apache.org/solr/tutorial.htmlhttp://www.ibm.com/developerworks/power/library/pa-cellperf/http://www.ibm.com/developerworks/power/library/pa-cellperf/http://cell.scei.co.jp/e_download.htmlhttp://cell.scei.co.jp/e_download.htmlhttp://archive.apache.org/dist/incubator/uima/RELEASE_NOTES-uimacpp-2.3.0-incubating.htmlhttp://archive.apache.org/dist/incubator/uima/RELEASE_NOTES-uimacpp-2.3.0-incubating.htmlhttp://uima.apache.org/d/uimaj-2.4.0/RELEASE_NOTES.htmlhttp://uima.apache.org/d/uimaj-2.4.0/RELEASE_NOTES.htmlhttp://uima.apache.org/d/uimaj-2.4.0/overview_and_setup.pdfhttp://uima.apache.org/d/uimaj-2.4.0/overview_and_setup.pdfhttps://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=enhttps://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=enhttps://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ibm_watson_how_to_build_your_own_watson_jr_in_your_basement7?lang=en
  • 8/14/2019 watson_geoparty.pdf

    28/28

    [9] S. J. Vaughan-Nichols, What Makes Ibms Watson Run? February

    2011. [Online]. Available: http://www.zdnet.com/blog/open-source/what-makes-ibms-watson-run/8208

    [10] D. Ferrucci, E. Brown, J. Chu-Carrol, J. Fan, D. Gondek, A. A.Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager,N. Schlaefer, , and C. Welty, The AI Behind Watson -The Technical Article, AI Magazine, 2010. [Online]. Available:http://www.aaai.org/Magazine/Watson/watson.php

    [11] Ibms Watson Computing System to Challenge All Time GreatestJeopardy! Champions, February 2011. [Online]. Available: http:

    //www-03.ibm.com/press/us/en/pressrelease/33233.wss

    28

    http://www.zdnet.com/blog/open-source/what-makes-ibms-watson-run/8208http://www.zdnet.com/blog/open-source/what-makes-ibms-watson-run/8208http://www.aaai.org/Magazine/Watson/watson.phphttp://www-03.ibm.com/press/us/en/pressrelease/33233.wsshttp://www-03.ibm.com/press/us/en/pressrelease/33233.wsshttp://www-03.ibm.com/press/us/en/pressrelease/33233.wsshttp://www-03.ibm.com/press/us/en/pressrelease/33233.wsshttp://www.aaai.org/Magazine/Watson/watson.phphttp://www.zdnet.com/blog/open-source/what-makes-ibms-watson-run/8208http://www.zdnet.com/blog/open-source/what-makes-ibms-watson-run/8208