text mining and seasr

Post on 11-May-2015

1.057 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduc)ontoSEASRandTextMining

UIUC/NCSAFeb4,2009

LoreBaAuvil

Na)onalCenterforSupercompu)ngApplica)onsUniversityofIllinoisatUrbanaChampaign

TheSEASRPicture

SEASR:Reach+Relevance+Reuse+Repeatability

SEASRemphasizesflexibility,scalability,modularity,providescommunityhubandaccesstoheterogeneousdataandcomputa)onalsystems–  Seman)cdrivenenvironmentforSOAinteroperability–  Encouragessharingandpar)cipa)onforbuildingcommuni)es–  Modularconstruc)onallowsflowstobemodifiedandconfiguredto

encouragereusabilitywithinandacrossdomains–  Enablesamashupandintegra)onoftools–  Data‐intensiveflowscanbeexecutedonasimpledesktoporalarge

cluster(s)withoutmodifica)on–  Computa)oncanbecreatedfordistributedexecu)ononserverswhere

thecontentlives–  Useraccessibilitytocontroltrustandcompliancewithrequiredcopyright

licenseofcontent–  ReliesonstandardizedResourceDescrip)onFramework(RDF)todefine

componentsandflow

KnowledgeDiscoveryinData

Workbench

•  Web‐basedUI

•  Componentsandflowsareretrievedfromserver

•  Addi)onalloca)onsofcomponentsandflowscanbeaddedtoserver

•  Createflowusingagraphicaldraganddropinterface

•  Changepropertyvalues•  Executetheflow

CommunityHub

SEASR@Work–Zotero

•  PlugintoFirefox•  Zoteromanagesthe

collec)on

•  LaunchSEASRAnaly)cs–  Cita)onAnalysisusestheJUNG

networkimportancealgorithmstoranktheauthorsinthecita)onnetworkthatisexportedasRDFdatafromZoterotoSEASR

–  ZoteroExporttoFedorathroughSEASR

–  SavesresultsfromSEASRAnaly)cstoaCollec)on

•  LaunchMONKProcessing–  MONKDBInges)onWorkflow

WebService

Interac)veWebApplica)on

SEASR@Work–Fedora

SEASR@Work–En)tyMash‐up

•  En)tyExtrac)onwithOpenNLP

•  Loca)onsviewedonGoogleMap

•  DatesviewedonSimileTimeline

SEASR@Work–AudioAnalysis•  NEMA:ExecutesaSEASR

flowforeachrun

–  Loadsaudiodata–  Extractsfeaturesforevery

10secmovingwindowofaudio

–  Loadsandappliesthemodels

–  SendsresultsbacktotheWebUI

•  NESTER:Annota)onofAudioviaSpectralAnalysis

SEASR@Work–MONK

Executesflowsforeachanalysisrequested– Predic)vemodelingusingNaïveBayes

– Predic)vemodelingusingSupportVectorMachines(SVM)

SEASR@Work–DISCUS•  On‐demandusageof

analy)cswhilesurfing–  Whilenaviga)ng

requestanaly)cstobeperformedonpage

–  Textextrac)onandcleaning

•  Summariza)onandkeyworkextrac)on

–  Listtheimportanttermsonthepagebeinganalyzed

–  Providerelevantshortsummaries

•  Visualmaps–  Provideavisual

representa)onofthekeyconcepts

–  Showthegraphofrela)onsbetweenconcepts

SEASRandUIMA:Emo)onTrackingGoalistohavethistypeofVisualiza)ontotrackemo)onsacrossatextdocument(Leveragingflare.prefuse.org)

SEASRTextAnaly)csGoalsAddresstheScholarlytextanaly)csneedsby:

•  EfficientlymanagingdistributedLiteraryandHistoricaltextualassets•  Structuringextractedinforma)ontofacilitateknowledgediscovery•  Extractinforma)onfromtextatalevelofseman)c/func)onal

abstrac)onthatissufficientlyrichtosupportques)on‐answering•  Devisearepresenta)onfortheextractedinforma)onthatcanbe

efficientlyreasonedovertorecoverdataintheques)on‐answerprocess

•  Devisealgorithmsforques)onansweringandinference•  DevelopUIforeffec)vevisualknowledgediscoverywithseparate

querylogicfromapplica)onlogic•  Leveragingexis)ngapproachesanddevisealgorithmsforclustering,

inference,andQ&A•  DevelopinganInterac)onUIforeffec)vevisualdataexplora)on•  Enablethetextanaly)csthroughSEASRcomponents

TheZoteroPicture

TheWEB

ZoteroStore

TheZotero+SEASRPicture

TheWEB

ZoteroStore

TheWEB

YourZoteroCollec)on

TheSEASRAnaly)cs

TheValueAdded

SomeExamples

•  Authorship Analysis (JUNG network importance algorithms to rank the authors in the citation network)

•  Author Centrality Analysis –  Uses Betweenness Centrality, which ranks each coauthor graph derived from the

number of shortest paths that pass through them

•  Author Degree Analysis –  Uses AuthorDegreeDistributionAnalysis, which ranks each on the number of coauthors

•  Author HITS Analysis –  The *hubness* of a node is the degree to which a node links to other important

authorities. The *authoritativeness* of a node is the degree to which a node is pointed to by important hubs.

•  Readability •  Flesch-Kincaid readability test "

(http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test)

SEASR Flow

TextMiningDefini)on

Manydefini)onsintheliterature•  Thenontrivialextrac)onofimplicit,previouslyunknown,andpoten)allyusefulinforma)onfrom(largeamountof)textualdata”

•  Anexplora)onandanalysisoftextual(natural‐language)databyautoma)candsemiautoma)cmeanstodiscovernewknowledge

•  Whatis“previouslyunknown”informa)on?–  Strictdefini)on

•  Informa)onthatnoteventhewriterknows–  Lenientdefini)on

•  Rediscovertheinforma)onthattheauthorencodedinthetext

TextMiningProcess

•  TextPreprocessing–  Syntac)cTextAnalysis–  Seman)cTextAnalysis

•  FeaturesGenera)on–  BagofWords–  Ngrams

•  FeatureSelec)on–  SimpleCoun)ng–  Sta)s)cs–  Selec)onbasedonPOS

•  Text/DataMining–  Classifica)on‐Supervised

Learning–  Clustering‐Unsupervised

Learning–  Informa)onExtrac)on

•  AnalyzingResults–  VisualExplora)on,Discovery

andKnowledgeExtrac)on–  Query‐based–ques)on

answering

TextCharacteris)cs(1)•  Largetextualdatabase

–  Enormouswealthoftextualinforma)onontheWeb–  Publica)onsareelectronic

•  Highdimensionality–  Considereachword/phraseasadimension

•  Noisydata–  Spellingmistakes–  Abbrevia)ons–  Acronyms

•  Textmessagesareverydynamic–  Webpagesareconstantlybeinggenerated(removed)–  Webpagesaregeneratedfromdatabasequeries

•  Notwellstructuredtext–  Email/Chatrooms

•  “ruavailable?”•  “Heywhazzzzzzup”

–  Speech

TextCharacteris)cs(2)•  Dependency

–  Relevantinforma)onisacomplexconjunc)onofwords/phrases–  Orderofwordsinthequery

•  hotdogstandintheamusementpark•  hotamusementstandinthedogpark

•  Ambiguity–  Wordambiguity

•  Pronouns(he,she…)•  Synonyms(buy,purchase)•  Wordswithmul)plemeanings(bat–itisrelatedtobaseballormammal)

–  Seman)cambiguity•  Thekingsawtherabbitwithhisglasses.(mul)plemeanings)

•  Authorityofthesource–  IBMismorelikelytobeanauthorizedsourcethenmysecondfar

cousin

TextPreprocessing•  Syntac)canalysis

–  Tokeniza)on–  Lemmi)za)on–  POStagging–  Shallowparsing–  Customliterarytagging

•  Seman)canalysis–  Informa)onExtrac)on

•  NamedEn)tytagging–  Seman)cCategory(unnameden)ty)tagging–  Co‐referenceresolu)on–  Ontologicalassocia)on(WordNet,VerbNet)–  Seman)cRoleanalysis–  Concept‐Rela)onextrac)on

Syntac)cAnalysis•  Tokeniza)on

–  Textdocumentisrepresentedbythewordsitcontains(andtheiroccurrences)–  e.g.,“Lordoftherings”→{“the”,“Lord”,“rings”,“of”}–  Highlyefficient–  Makeslearningfarsimplerandeasier–  Orderofwordsisnotthatimportantforcertainapplica)ons

•  Lemmi)za)on/Stemming–  Involvesthereduc)onofcorpuswordstotheirrespec)veheadwords(i.e.lemmas)–  Reducedimensionality–  Iden)fiesawordbyitsroot–  e.g.,flying,flew→fly

•  Stopwords–  Iden)fiesthemostcommonwordsthatareunlikelytohelpwithtextmining–  e.g.,“the”,“a”,“an”,“you”

•  Parsing/PartofSpeech(POS)tagging–  Generatesaparsetree(graph)foreachsentence–  Eachsentenceisastandalonegraph–  FindthecorrespondingPOSforeachword–  e.g.,John(noun)gave(verb)the(det)ball(noun)–  ShallowParsing

•  analysisofasentencewhichiden)fiesthecons)tuents(noungroups,verbs,...),butdoesnotspecifytheirinternalstructure,northeirroleinthemainsentence

–  DeepParsing•  moresophis)catedsyntac)c,seman)candcontextualprocessingmustbeperformedtoextractorconstructtheanswer

Seman)cAnalysis:Informa)onExtrac)on

•  Defini)on:Informa)onextrac)onistheiden)fica)onofspecificseman)celementswithinatext(e.g.,en))es,proper)es,rela)ons)

•  Extracttherelevantinforma)onandignorenon‐relevantinforma)on(important!)

•  Linkrelatedinforma)onandoutputinapredeterminedformat

Informa)onExtrac)on

Informa(onType Stateoftheart(Accuracy)En((es

anobjectofinterestsuchasapersonororganiza)on.

90‐98%

A9ributes

apropertyofanen)tysuchasitsname,alias,descriptor,ortype.

80%

Facts

arela1onshipheldbetweentwoormoreen))essuchasPosi)onofa

PersoninaCompany.

60‐70%

Events

anac1vityinvolvingseveralen))essuchasaterroristact,airlinecrash,managementchange,newproduct

introduc)on.

50‐60%

“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL

Informa)onExtrac)onApproaches

•  Terminology(name)lists–  Thisworksverywellifthelistofnamesandnameexpressionsisstableandavailable

•  Tokeniza)onandmorphology–  Thisworkswellforthingslikeformulasordates,whicharereadilyrecognizedbytheirinternalformat(e.g.,DD/MM/YYorchemicalformulas)

•  Useofcharacteris)cpaBerns–  Thisworksfairlywellfornovelen))es–  Rulescanbecreatedbyhandorlearnedviamachinelearningorsta)s)calalgorithms

–  RulescapturelocalpaBernsthatcharacterizeen))esfrominstancesofannotatedtrainingdata

Informa)onExtrac)on

Rela)on(Event)Extrac)on•  Iden)fy(andtag)therela)onamongtwoen))es:

–  Apersonis_located_ataloca)on(news)–  Agenecodes_foraprotein(biology)

•  Rela)onsrequiremoreinforma)on–  Iden)fica)onoftwoen))es&theirrela)onship–  Predictedrela)onaccuracy

•  Pr(E1)*Pr(E2)*Pr(R)~=(.93)*(.93)*(.93)=.80•  Informa)oninrela)onsislesslocal

–  Contextualinforma)onisaproblem:rightwordmaynotbeexplicitlypresentinthesentence

–  Eventsinvolvemorerela)onsandareevenharder

MayorRexLuthorannouncedtodaytheestablishmentofa

newresearchfacilityinAlderwood.Itwillbeknownas

BoyntonLaboratory.

NE:Person NE:Time

NE:Loca)on

NE:Organiza)on

Seman)cAnaly)cs

NamedEn)ty(NE)Tagging

MayorRexLuthorannouncedtodaytheestablishmentofa

newresearchfacilityinAlderwood.Itwillbeknownas

BoyntonLaboratory.

UNE:Organiza)on

Seman)cAnalysis

Seman)cCategory(unnameden)ty,UNE)Tagging

MayorRexLuthorannouncedtodaytheestablishmentofa

newresearchfacilityinAlderwood.Itwillbeknownas

BoyntonLaboratory.

UNE:Organiza)on

Seman)cAnalysis

Co‐referenceResolu)onforen))esandunnameden))es

Mayor Rex Luthor announced today the establishment

known as Boynton Laboratory

of a new research facility in Alderwoon. It will be

ACTIONACTOR WHEN OBJECT

WHERE

ACTION

OBJECT

COMPL

Seman)cAnalysis

Seman)cRoleAnalysis

Rex Luthor

person

announce

action

establ.

event

Boynton Lab

organiz.

today

time

Alderwood

location

location

(where)

object

(what)

time(when)

objec

t(w

hat)

actor(who)

Seman)cAnalysis

Concept‐Rela)onExtrac)on

IE–TemplateExtrac)on‐Steps

</VerbGroup> …

(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson

…….

The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States.

``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .'‘ …

TemplateExtrac)on<Facility>Finsbury Park Mosque</Facility>

<PersonPositionOrganization>  <OFFLEN OFFSET="3576" LENGTH=“33" />   <Person>Abu Hamza al-Masri</Person> <Position>chief cleric</Position> <Organization>Finsbury Park Mosque</Organization> </PersonPositionOrganization>

<Country>England</Country>

<PersonArrest>  <OFFLEN OFFSET="3814" LENGTH="61" />   <Person>Abu Hamza al-Masri</Person>   <Location>London</Location> <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason>   </PersonArrest>

<Country>England</Country>

<Country>France </Country>

<Country>United States</Country>

<Country>Belgium</Country>

<Person>Abu Hamza al-Masri</Person>

<City>London</City>

StreamingText:KnowledgeExtrac)on

•  Leveragingsomeearlierworkoninforma)onextrac)onfromtextstreams

Informa)onextrac)on•  processofusing

advancedautomatedmachinelearningapproaches

•  toiden)fyen))esintextdocuments

•  extractthisinforma)onalongwiththerela)onshipstheseen))esmayhaveinthetextdocuments

Thevisualiza)onabovedemonstratesinforma)onextrac)onofnames,placesandorganiza)onsfromreal‐)menewsfeeds.Asnewsar)clesarrive,theinforma)onisextractedanddisplayed.Rela)onshipsaredefinedwhenen))esco‐occurwithinaspecificwindowofwords.

Seman)cAnalysis•  WordSenseDisambigua)on

–  Contextbasedorproximitybased

–  Veryaccurate

OntologicalAssocia)on(WordNet)•  Wordnet:Asof2006,thedatabasecontainsabout150,000words

organizedinover115,000synsetsforatotalof207,000word‐sensepairs•  Searchfordog

–  ndog,domes)cdog,Canisfamiliaris(amemberofthegenusCanis(probablydescendedfromthecommonwolf)thathasbeendomes)catedbymansinceprehistoric)mes;occursinmanybreeds)

–  nfrump,dog(adullunaBrac)veunpleasantgirlorwoman)–  ndog(informaltermforaman)–  ncad,bounder,blackguard,dog,hound,heel(someonewhoismorally

reprehensible)–  nfrank,frankfurter,hotdog,hotdog,dog,wiener,wienerwurst,weenie(a

smooth‐texturedsausageofmincedbeeforporkusuallysmoked;o}enservedonabreadroll)

–  npawl,detent,click,dog(ahingedcatchthatfitsintoanotchofaratchettomoveawheelforwardorpreventitfrommovingbackward)

–  nandiron,firedog,dog,dog‐iron(metalsupportsforlogsinafireplace)–  vchase,chasea}er,trail,tail,tag,givechase,dog,goa}er,track(goa}erwith

theintenttocatch)

FeatureSelec)on

•  ReduceDimensionality– Learnershavedifficultyaddressingtaskswithhighdimensionality

•  IrrelevantFeatures– Notallfeatureshelp!– Removefeaturesthatoccurinonlyafewdocuments

– Reducefeaturesthatoccurintoomanydocuments

TextMining:GeneralApplica)onAreas

•  Informa)onRetrieval–  Indexingandretrievaloftextualdocuments–  Findingasetof(ranked)documentsthatarerelevanttothequery

•  Informa)onExtrac)on–  Extrac)onofpar)alknowledgeinthetext

•  WebMining–  Indexingandretrievaloftextualdocumentsandextrac)onofpar)alknowledgeusingtheweb

•  Classifica)on–  Predictaclassforeachtextdocument

•  Clustering–  Genera)ngcollec)onsofsimilartextdocuments

TextMining:Supervisedvs.Unsupervised

•  Supervisedlearning(Classifica)on)–  Data(observa)ons,measurements,etc.)areaccompaniedby

labelsindica)ngtheclassoftheobserva)ons–  Splitintotrainingdataandtestdataformodelbuildingprocess–  Newdataisclassifiedbasedonthemodelbuiltwiththetraining

data–  Techniques

•  Bayesianclassifica)on,Decisiontrees,Neuralnetworks,Instance‐BasedMethods,SupportVectorMachines

•  Unsupervisedlearning(Clustering)–  Classlabelsoftrainingdataisunknown–  Givenasetofmeasurements,observa)ons,etc.withtheaimof

establishingtheexistenceofclassesorclustersinthedata

Results:SocialNetwork(TominRed)

Results:Timeline

Results:Maps

TextMining:T2KandThemeWeaver

Images from Pacific Northwest Laboratory

TextMining:ThemescapeandThemeRiver

•  VisualizingRela)onshipsBetweenDocuments

Gather–Analyze–Present

TextMining:Applica)ons

•  Email:Spamfiltering•  NewsFeeds:Discoverwhatis

interes)ng•  Medical:Iden)fyrela)onshipsand

linkinforma)onfromdifferentmedicalfields

•  HomelandSecurity•  Marke)ng:Discoverdis)nctgroupsof

poten)albuyersandmakesugges)onsforotherproducts

•  Industry:Iden)fyinggroupsofcompe)torswebpages

•  JobSeeking:Iden)fyparametersinsearchingforjobs

TextMining:Classifica)onDefini)on

•  Given:Collec)onoflabeledrecords–  Eachrecordcontainsasetoffeatures(aBributes),andthetrueclass

(label)–  Createatrainingsettobuildthemodel–  Createates)ngsettotestthemodel

•  Find:Modelfortheclassasafunc)onofthevaluesofthefeatures•  Goal:Assignaclass(asaccuratelyaspossible)topreviouslyunseen

records•  Evalua)on:WhatIsGoodClassifica)on?

–  Correctclassifica)on•  Knownlabeloftestexampleisiden)caltothepredictedclassfromthemodel

–  Accuracyra)o•  Percentoftestsetexamplesthatarecorrectlyclassifiedbythemodel

–  Distancemeasurebetweenclassescanbeused•  e.g.,classifying“football”documentasa“basketball”documentisnotasbad

asclassifyingitas“crime”

TextMining:ClusteringDefini)on•  Given:Setofdocumentsandasimilaritymeasure

amongdocuments•  Find:Clusterssuchthat

–  Documentsinoneclusteraremoresimilartooneanother

–  Documentsinseparateclustersarelesssimilartooneanother

•  Goal:–  Findingacorrectsetofdocuments

•  SimilarityMeasures:–  EuclideandistanceifaBributesarecon)nuous–  Otherproblem‐specificmeasures

•  e.g.,howmanywordsarecommoninthesedocuments

•  Evalua)on:WhatIsGoodClustering?–  Producehighqualityclusterswith

•  highintra‐classsimilarity•  lowinter‐classsimilarity

–  QualityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaBerns

SEASR

MeandreWorkbench

FutureWork

•  EnhancementstoSeman)cAnalysis– UseofOntologicalAssocia)on(WordNet,VerbNet)

–  Improveco‐referencing

–  Improvefactextrac)on

•  Visualexplora)ontools

top related