text mining and seasr
TRANSCRIPT
Introduc)ontoSEASRandTextMining
UIUC/NCSAFeb4,2009
LoreBaAuvil
Na)onalCenterforSupercompu)ngApplica)onsUniversityofIllinoisatUrbanaChampaign
TheSEASRPicture
SEASR:Reach+Relevance+Reuse+Repeatability
SEASRemphasizesflexibility,scalability,modularity,providescommunityhubandaccesstoheterogeneousdataandcomputa)onalsystems– Seman)cdrivenenvironmentforSOAinteroperability– Encouragessharingandpar)cipa)onforbuildingcommuni)es– Modularconstruc)onallowsflowstobemodifiedandconfiguredto
encouragereusabilitywithinandacrossdomains– Enablesamashupandintegra)onoftools– Data‐intensiveflowscanbeexecutedonasimpledesktoporalarge
cluster(s)withoutmodifica)on– Computa)oncanbecreatedfordistributedexecu)ononserverswhere
thecontentlives– Useraccessibilitytocontroltrustandcompliancewithrequiredcopyright
licenseofcontent– ReliesonstandardizedResourceDescrip)onFramework(RDF)todefine
componentsandflow
KnowledgeDiscoveryinData
Workbench
• Web‐basedUI
• Componentsandflowsareretrievedfromserver
• Addi)onalloca)onsofcomponentsandflowscanbeaddedtoserver
• Createflowusingagraphicaldraganddropinterface
• Changepropertyvalues• Executetheflow
CommunityHub
SEASR@Work–Zotero
• PlugintoFirefox• Zoteromanagesthe
collec)on
• LaunchSEASRAnaly)cs– Cita)onAnalysisusestheJUNG
networkimportancealgorithmstoranktheauthorsinthecita)onnetworkthatisexportedasRDFdatafromZoterotoSEASR
– ZoteroExporttoFedorathroughSEASR
– SavesresultsfromSEASRAnaly)cstoaCollec)on
• LaunchMONKProcessing– MONKDBInges)onWorkflow
WebService
Interac)veWebApplica)on
SEASR@Work–Fedora
SEASR@Work–En)tyMash‐up
• En)tyExtrac)onwithOpenNLP
• Loca)onsviewedonGoogleMap
• DatesviewedonSimileTimeline
SEASR@Work–AudioAnalysis• NEMA:ExecutesaSEASR
flowforeachrun
– Loadsaudiodata– Extractsfeaturesforevery
10secmovingwindowofaudio
– Loadsandappliesthemodels
– SendsresultsbacktotheWebUI
• NESTER:Annota)onofAudioviaSpectralAnalysis
SEASR@Work–MONK
Executesflowsforeachanalysisrequested– Predic)vemodelingusingNaïveBayes
– Predic)vemodelingusingSupportVectorMachines(SVM)
SEASR@Work–DISCUS• On‐demandusageof
analy)cswhilesurfing– Whilenaviga)ng
requestanaly)cstobeperformedonpage
– Textextrac)onandcleaning
• Summariza)onandkeyworkextrac)on
– Listtheimportanttermsonthepagebeinganalyzed
– Providerelevantshortsummaries
• Visualmaps– Provideavisual
representa)onofthekeyconcepts
– Showthegraphofrela)onsbetweenconcepts
SEASRandUIMA:Emo)onTrackingGoalistohavethistypeofVisualiza)ontotrackemo)onsacrossatextdocument(Leveragingflare.prefuse.org)
SEASRTextAnaly)csGoalsAddresstheScholarlytextanaly)csneedsby:
• EfficientlymanagingdistributedLiteraryandHistoricaltextualassets• Structuringextractedinforma)ontofacilitateknowledgediscovery• Extractinforma)onfromtextatalevelofseman)c/func)onal
abstrac)onthatissufficientlyrichtosupportques)on‐answering• Devisearepresenta)onfortheextractedinforma)onthatcanbe
efficientlyreasonedovertorecoverdataintheques)on‐answerprocess
• Devisealgorithmsforques)onansweringandinference• DevelopUIforeffec)vevisualknowledgediscoverywithseparate
querylogicfromapplica)onlogic• Leveragingexis)ngapproachesanddevisealgorithmsforclustering,
inference,andQ&A• DevelopinganInterac)onUIforeffec)vevisualdataexplora)on• Enablethetextanaly)csthroughSEASRcomponents
TheZoteroPicture
TheWEB
ZoteroStore
TheZotero+SEASRPicture
TheWEB
ZoteroStore
TheWEB
YourZoteroCollec)on
TheSEASRAnaly)cs
TheValueAdded
SomeExamples
• Authorship Analysis (JUNG network importance algorithms to rank the authors in the citation network)
• Author Centrality Analysis – Uses Betweenness Centrality, which ranks each coauthor graph derived from the
number of shortest paths that pass through them
• Author Degree Analysis – Uses AuthorDegreeDistributionAnalysis, which ranks each on the number of coauthors
• Author HITS Analysis – The *hubness* of a node is the degree to which a node links to other important
authorities. The *authoritativeness* of a node is the degree to which a node is pointed to by important hubs.
• Readability • Flesch-Kincaid readability test "
(http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test)
SEASR Flow
TextMiningDefini)on
Manydefini)onsintheliterature• Thenontrivialextrac)onofimplicit,previouslyunknown,andpoten)allyusefulinforma)onfrom(largeamountof)textualdata”
• Anexplora)onandanalysisoftextual(natural‐language)databyautoma)candsemiautoma)cmeanstodiscovernewknowledge
• Whatis“previouslyunknown”informa)on?– Strictdefini)on
• Informa)onthatnoteventhewriterknows– Lenientdefini)on
• Rediscovertheinforma)onthattheauthorencodedinthetext
TextMiningProcess
• TextPreprocessing– Syntac)cTextAnalysis– Seman)cTextAnalysis
• FeaturesGenera)on– BagofWords– Ngrams
• FeatureSelec)on– SimpleCoun)ng– Sta)s)cs– Selec)onbasedonPOS
• Text/DataMining– Classifica)on‐Supervised
Learning– Clustering‐Unsupervised
Learning– Informa)onExtrac)on
• AnalyzingResults– VisualExplora)on,Discovery
andKnowledgeExtrac)on– Query‐based–ques)on
answering
TextCharacteris)cs(1)• Largetextualdatabase
– Enormouswealthoftextualinforma)onontheWeb– Publica)onsareelectronic
• Highdimensionality– Considereachword/phraseasadimension
• Noisydata– Spellingmistakes– Abbrevia)ons– Acronyms
• Textmessagesareverydynamic– Webpagesareconstantlybeinggenerated(removed)– Webpagesaregeneratedfromdatabasequeries
• Notwellstructuredtext– Email/Chatrooms
• “ruavailable?”• “Heywhazzzzzzup”
– Speech
TextCharacteris)cs(2)• Dependency
– Relevantinforma)onisacomplexconjunc)onofwords/phrases– Orderofwordsinthequery
• hotdogstandintheamusementpark• hotamusementstandinthedogpark
• Ambiguity– Wordambiguity
• Pronouns(he,she…)• Synonyms(buy,purchase)• Wordswithmul)plemeanings(bat–itisrelatedtobaseballormammal)
– Seman)cambiguity• Thekingsawtherabbitwithhisglasses.(mul)plemeanings)
• Authorityofthesource– IBMismorelikelytobeanauthorizedsourcethenmysecondfar
cousin
TextPreprocessing• Syntac)canalysis
– Tokeniza)on– Lemmi)za)on– POStagging– Shallowparsing– Customliterarytagging
• Seman)canalysis– Informa)onExtrac)on
• NamedEn)tytagging– Seman)cCategory(unnameden)ty)tagging– Co‐referenceresolu)on– Ontologicalassocia)on(WordNet,VerbNet)– Seman)cRoleanalysis– Concept‐Rela)onextrac)on
Syntac)cAnalysis• Tokeniza)on
– Textdocumentisrepresentedbythewordsitcontains(andtheiroccurrences)– e.g.,“Lordoftherings”→{“the”,“Lord”,“rings”,“of”}– Highlyefficient– Makeslearningfarsimplerandeasier– Orderofwordsisnotthatimportantforcertainapplica)ons
• Lemmi)za)on/Stemming– Involvesthereduc)onofcorpuswordstotheirrespec)veheadwords(i.e.lemmas)– Reducedimensionality– Iden)fiesawordbyitsroot– e.g.,flying,flew→fly
• Stopwords– Iden)fiesthemostcommonwordsthatareunlikelytohelpwithtextmining– e.g.,“the”,“a”,“an”,“you”
• Parsing/PartofSpeech(POS)tagging– Generatesaparsetree(graph)foreachsentence– Eachsentenceisastandalonegraph– FindthecorrespondingPOSforeachword– e.g.,John(noun)gave(verb)the(det)ball(noun)– ShallowParsing
• analysisofasentencewhichiden)fiesthecons)tuents(noungroups,verbs,...),butdoesnotspecifytheirinternalstructure,northeirroleinthemainsentence
– DeepParsing• moresophis)catedsyntac)c,seman)candcontextualprocessingmustbeperformedtoextractorconstructtheanswer
Seman)cAnalysis:Informa)onExtrac)on
• Defini)on:Informa)onextrac)onistheiden)fica)onofspecificseman)celementswithinatext(e.g.,en))es,proper)es,rela)ons)
• Extracttherelevantinforma)onandignorenon‐relevantinforma)on(important!)
• Linkrelatedinforma)onandoutputinapredeterminedformat
Informa)onExtrac)on
Informa(onType Stateoftheart(Accuracy)En((es
anobjectofinterestsuchasapersonororganiza)on.
90‐98%
A9ributes
apropertyofanen)tysuchasitsname,alias,descriptor,ortype.
80%
Facts
arela1onshipheldbetweentwoormoreen))essuchasPosi)onofa
PersoninaCompany.
60‐70%
Events
anac1vityinvolvingseveralen))essuchasaterroristact,airlinecrash,managementchange,newproduct
introduc)on.
50‐60%
“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL
Informa)onExtrac)onApproaches
• Terminology(name)lists– Thisworksverywellifthelistofnamesandnameexpressionsisstableandavailable
• Tokeniza)onandmorphology– Thisworkswellforthingslikeformulasordates,whicharereadilyrecognizedbytheirinternalformat(e.g.,DD/MM/YYorchemicalformulas)
• Useofcharacteris)cpaBerns– Thisworksfairlywellfornovelen))es– Rulescanbecreatedbyhandorlearnedviamachinelearningorsta)s)calalgorithms
– RulescapturelocalpaBernsthatcharacterizeen))esfrominstancesofannotatedtrainingdata
Informa)onExtrac)on
Rela)on(Event)Extrac)on• Iden)fy(andtag)therela)onamongtwoen))es:
– Apersonis_located_ataloca)on(news)– Agenecodes_foraprotein(biology)
• Rela)onsrequiremoreinforma)on– Iden)fica)onoftwoen))es&theirrela)onship– Predictedrela)onaccuracy
• Pr(E1)*Pr(E2)*Pr(R)~=(.93)*(.93)*(.93)=.80• Informa)oninrela)onsislesslocal
– Contextualinforma)onisaproblem:rightwordmaynotbeexplicitlypresentinthesentence
– Eventsinvolvemorerela)onsandareevenharder
MayorRexLuthorannouncedtodaytheestablishmentofa
newresearchfacilityinAlderwood.Itwillbeknownas
BoyntonLaboratory.
NE:Person NE:Time
NE:Loca)on
NE:Organiza)on
Seman)cAnaly)cs
NamedEn)ty(NE)Tagging
MayorRexLuthorannouncedtodaytheestablishmentofa
newresearchfacilityinAlderwood.Itwillbeknownas
BoyntonLaboratory.
UNE:Organiza)on
Seman)cAnalysis
Seman)cCategory(unnameden)ty,UNE)Tagging
MayorRexLuthorannouncedtodaytheestablishmentofa
newresearchfacilityinAlderwood.Itwillbeknownas
BoyntonLaboratory.
UNE:Organiza)on
Seman)cAnalysis
Co‐referenceResolu)onforen))esandunnameden))es
Mayor Rex Luthor announced today the establishment
known as Boynton Laboratory
of a new research facility in Alderwoon. It will be
ACTIONACTOR WHEN OBJECT
WHERE
ACTION
OBJECT
COMPL
Seman)cAnalysis
Seman)cRoleAnalysis
Rex Luthor
person
announce
action
establ.
event
Boynton Lab
organiz.
today
time
Alderwood
location
location
(where)
object
(what)
time(when)
objec
t(w
hat)
actor(who)
Seman)cAnalysis
Concept‐Rela)onExtrac)on
IE–TemplateExtrac)on‐Steps
</VerbGroup> …
(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson
…….
The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States.
``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .'‘ …
TemplateExtrac)on<Facility>Finsbury Park Mosque</Facility>
<PersonPositionOrganization> <OFFLEN OFFSET="3576" LENGTH=“33" /> <Person>Abu Hamza al-Masri</Person> <Position>chief cleric</Position> <Organization>Finsbury Park Mosque</Organization> </PersonPositionOrganization>
<Country>England</Country>
<PersonArrest> <OFFLEN OFFSET="3814" LENGTH="61" /> <Person>Abu Hamza al-Masri</Person> <Location>London</Location> <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason> </PersonArrest>
<Country>England</Country>
<Country>France </Country>
<Country>United States</Country>
<Country>Belgium</Country>
<Person>Abu Hamza al-Masri</Person>
<City>London</City>
StreamingText:KnowledgeExtrac)on
• Leveragingsomeearlierworkoninforma)onextrac)onfromtextstreams
Informa)onextrac)on• processofusing
advancedautomatedmachinelearningapproaches
• toiden)fyen))esintextdocuments
• extractthisinforma)onalongwiththerela)onshipstheseen))esmayhaveinthetextdocuments
Thevisualiza)onabovedemonstratesinforma)onextrac)onofnames,placesandorganiza)onsfromreal‐)menewsfeeds.Asnewsar)clesarrive,theinforma)onisextractedanddisplayed.Rela)onshipsaredefinedwhenen))esco‐occurwithinaspecificwindowofwords.
Seman)cAnalysis• WordSenseDisambigua)on
– Contextbasedorproximitybased
– Veryaccurate
OntologicalAssocia)on(WordNet)• Wordnet:Asof2006,thedatabasecontainsabout150,000words
organizedinover115,000synsetsforatotalof207,000word‐sensepairs• Searchfordog
– ndog,domes)cdog,Canisfamiliaris(amemberofthegenusCanis(probablydescendedfromthecommonwolf)thathasbeendomes)catedbymansinceprehistoric)mes;occursinmanybreeds)
– nfrump,dog(adullunaBrac)veunpleasantgirlorwoman)– ndog(informaltermforaman)– ncad,bounder,blackguard,dog,hound,heel(someonewhoismorally
reprehensible)– nfrank,frankfurter,hotdog,hotdog,dog,wiener,wienerwurst,weenie(a
smooth‐texturedsausageofmincedbeeforporkusuallysmoked;o}enservedonabreadroll)
– npawl,detent,click,dog(ahingedcatchthatfitsintoanotchofaratchettomoveawheelforwardorpreventitfrommovingbackward)
– nandiron,firedog,dog,dog‐iron(metalsupportsforlogsinafireplace)– vchase,chasea}er,trail,tail,tag,givechase,dog,goa}er,track(goa}erwith
theintenttocatch)
FeatureSelec)on
• ReduceDimensionality– Learnershavedifficultyaddressingtaskswithhighdimensionality
• IrrelevantFeatures– Notallfeatureshelp!– Removefeaturesthatoccurinonlyafewdocuments
– Reducefeaturesthatoccurintoomanydocuments
TextMining:GeneralApplica)onAreas
• Informa)onRetrieval– Indexingandretrievaloftextualdocuments– Findingasetof(ranked)documentsthatarerelevanttothequery
• Informa)onExtrac)on– Extrac)onofpar)alknowledgeinthetext
• WebMining– Indexingandretrievaloftextualdocumentsandextrac)onofpar)alknowledgeusingtheweb
• Classifica)on– Predictaclassforeachtextdocument
• Clustering– Genera)ngcollec)onsofsimilartextdocuments
TextMining:Supervisedvs.Unsupervised
• Supervisedlearning(Classifica)on)– Data(observa)ons,measurements,etc.)areaccompaniedby
labelsindica)ngtheclassoftheobserva)ons– Splitintotrainingdataandtestdataformodelbuildingprocess– Newdataisclassifiedbasedonthemodelbuiltwiththetraining
data– Techniques
• Bayesianclassifica)on,Decisiontrees,Neuralnetworks,Instance‐BasedMethods,SupportVectorMachines
• Unsupervisedlearning(Clustering)– Classlabelsoftrainingdataisunknown– Givenasetofmeasurements,observa)ons,etc.withtheaimof
establishingtheexistenceofclassesorclustersinthedata
Results:SocialNetwork(TominRed)
Results:Timeline
Results:Maps
TextMining:T2KandThemeWeaver
Images from Pacific Northwest Laboratory
TextMining:ThemescapeandThemeRiver
• VisualizingRela)onshipsBetweenDocuments
Gather–Analyze–Present
TextMining:Applica)ons
• Email:Spamfiltering• NewsFeeds:Discoverwhatis
interes)ng• Medical:Iden)fyrela)onshipsand
linkinforma)onfromdifferentmedicalfields
• HomelandSecurity• Marke)ng:Discoverdis)nctgroupsof
poten)albuyersandmakesugges)onsforotherproducts
• Industry:Iden)fyinggroupsofcompe)torswebpages
• JobSeeking:Iden)fyparametersinsearchingforjobs
TextMining:Classifica)onDefini)on
• Given:Collec)onoflabeledrecords– Eachrecordcontainsasetoffeatures(aBributes),andthetrueclass
(label)– Createatrainingsettobuildthemodel– Createates)ngsettotestthemodel
• Find:Modelfortheclassasafunc)onofthevaluesofthefeatures• Goal:Assignaclass(asaccuratelyaspossible)topreviouslyunseen
records• Evalua)on:WhatIsGoodClassifica)on?
– Correctclassifica)on• Knownlabeloftestexampleisiden)caltothepredictedclassfromthemodel
– Accuracyra)o• Percentoftestsetexamplesthatarecorrectlyclassifiedbythemodel
– Distancemeasurebetweenclassescanbeused• e.g.,classifying“football”documentasa“basketball”documentisnotasbad
asclassifyingitas“crime”
TextMining:ClusteringDefini)on• Given:Setofdocumentsandasimilaritymeasure
amongdocuments• Find:Clusterssuchthat
– Documentsinoneclusteraremoresimilartooneanother
– Documentsinseparateclustersarelesssimilartooneanother
• Goal:– Findingacorrectsetofdocuments
• SimilarityMeasures:– EuclideandistanceifaBributesarecon)nuous– Otherproblem‐specificmeasures
• e.g.,howmanywordsarecommoninthesedocuments
• Evalua)on:WhatIsGoodClustering?– Producehighqualityclusterswith
• highintra‐classsimilarity• lowinter‐classsimilarity
– QualityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaBerns
SEASR
MeandreWorkbench
FutureWork
• EnhancementstoSeman)cAnalysis– UseofOntologicalAssocia)on(WordNet,VerbNet)
– Improveco‐referencing
– Improvefactextrac)on
• Visualexplora)ontools