text mining and seasr

55
Introduc)on to SEASR and Text Mining UIUC/NCSA Feb 4, 2009 LoreBa Auvil Na)onal Center for Supercompu)ng Applica)ons University of Illinois at Urbana Champaign

Upload: loretta-auvil

Post on 11-May-2015

1.057 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Text Mining and SEASR

Introduc)ontoSEASRandTextMining

UIUC/NCSAFeb4,2009

LoreBaAuvil

Na)onalCenterforSupercompu)ngApplica)onsUniversityofIllinoisatUrbanaChampaign

Page 2: Text Mining and SEASR

TheSEASRPicture

Page 3: Text Mining and SEASR

SEASR:Reach+Relevance+Reuse+Repeatability

SEASRemphasizesflexibility,scalability,modularity,providescommunityhubandaccesstoheterogeneousdataandcomputa)onalsystems–  Seman)cdrivenenvironmentforSOAinteroperability–  Encouragessharingandpar)cipa)onforbuildingcommuni)es–  Modularconstruc)onallowsflowstobemodifiedandconfiguredto

encouragereusabilitywithinandacrossdomains–  Enablesamashupandintegra)onoftools–  Data‐intensiveflowscanbeexecutedonasimpledesktoporalarge

cluster(s)withoutmodifica)on–  Computa)oncanbecreatedfordistributedexecu)ononserverswhere

thecontentlives–  Useraccessibilitytocontroltrustandcompliancewithrequiredcopyright

licenseofcontent–  ReliesonstandardizedResourceDescrip)onFramework(RDF)todefine

componentsandflow

Page 4: Text Mining and SEASR

KnowledgeDiscoveryinData

Page 5: Text Mining and SEASR

Workbench

•  Web‐basedUI

•  Componentsandflowsareretrievedfromserver

•  Addi)onalloca)onsofcomponentsandflowscanbeaddedtoserver

•  Createflowusingagraphicaldraganddropinterface

•  Changepropertyvalues•  Executetheflow

Page 6: Text Mining and SEASR

CommunityHub

Page 7: Text Mining and SEASR

SEASR@Work–Zotero

•  PlugintoFirefox•  Zoteromanagesthe

collec)on

•  LaunchSEASRAnaly)cs–  Cita)onAnalysisusestheJUNG

networkimportancealgorithmstoranktheauthorsinthecita)onnetworkthatisexportedasRDFdatafromZoterotoSEASR

–  ZoteroExporttoFedorathroughSEASR

–  SavesresultsfromSEASRAnaly)cstoaCollec)on

•  LaunchMONKProcessing–  MONKDBInges)onWorkflow

Page 8: Text Mining and SEASR

WebService

Interac)veWebApplica)on

SEASR@Work–Fedora

Page 9: Text Mining and SEASR

SEASR@Work–En)tyMash‐up

•  En)tyExtrac)onwithOpenNLP

•  Loca)onsviewedonGoogleMap

•  DatesviewedonSimileTimeline

Page 10: Text Mining and SEASR

SEASR@Work–AudioAnalysis•  NEMA:ExecutesaSEASR

flowforeachrun

–  Loadsaudiodata–  Extractsfeaturesforevery

10secmovingwindowofaudio

–  Loadsandappliesthemodels

–  SendsresultsbacktotheWebUI

•  NESTER:Annota)onofAudioviaSpectralAnalysis

Page 11: Text Mining and SEASR

SEASR@Work–MONK

Executesflowsforeachanalysisrequested– Predic)vemodelingusingNaïveBayes

– Predic)vemodelingusingSupportVectorMachines(SVM)

Page 12: Text Mining and SEASR

SEASR@Work–DISCUS•  On‐demandusageof

analy)cswhilesurfing–  Whilenaviga)ng

requestanaly)cstobeperformedonpage

–  Textextrac)onandcleaning

•  Summariza)onandkeyworkextrac)on

–  Listtheimportanttermsonthepagebeinganalyzed

–  Providerelevantshortsummaries

•  Visualmaps–  Provideavisual

representa)onofthekeyconcepts

–  Showthegraphofrela)onsbetweenconcepts

Page 13: Text Mining and SEASR

SEASRandUIMA:Emo)onTrackingGoalistohavethistypeofVisualiza)ontotrackemo)onsacrossatextdocument(Leveragingflare.prefuse.org)

Page 14: Text Mining and SEASR

SEASRTextAnaly)csGoalsAddresstheScholarlytextanaly)csneedsby:

•  EfficientlymanagingdistributedLiteraryandHistoricaltextualassets•  Structuringextractedinforma)ontofacilitateknowledgediscovery•  Extractinforma)onfromtextatalevelofseman)c/func)onal

abstrac)onthatissufficientlyrichtosupportques)on‐answering•  Devisearepresenta)onfortheextractedinforma)onthatcanbe

efficientlyreasonedovertorecoverdataintheques)on‐answerprocess

•  Devisealgorithmsforques)onansweringandinference•  DevelopUIforeffec)vevisualknowledgediscoverywithseparate

querylogicfromapplica)onlogic•  Leveragingexis)ngapproachesanddevisealgorithmsforclustering,

inference,andQ&A•  DevelopinganInterac)onUIforeffec)vevisualdataexplora)on•  Enablethetextanaly)csthroughSEASRcomponents

Page 15: Text Mining and SEASR

TheZoteroPicture

TheWEB

ZoteroStore

Page 16: Text Mining and SEASR

TheZotero+SEASRPicture

TheWEB

ZoteroStore

TheWEB

Page 17: Text Mining and SEASR

YourZoteroCollec)on

Page 18: Text Mining and SEASR

TheSEASRAnaly)cs

Page 19: Text Mining and SEASR

TheValueAdded

Page 20: Text Mining and SEASR

SomeExamples

•  Authorship Analysis (JUNG network importance algorithms to rank the authors in the citation network)

•  Author Centrality Analysis –  Uses Betweenness Centrality, which ranks each coauthor graph derived from the

number of shortest paths that pass through them

•  Author Degree Analysis –  Uses AuthorDegreeDistributionAnalysis, which ranks each on the number of coauthors

•  Author HITS Analysis –  The *hubness* of a node is the degree to which a node links to other important

authorities. The *authoritativeness* of a node is the degree to which a node is pointed to by important hubs.

•  Readability •  Flesch-Kincaid readability test "

(http://en.wikipedia.org/wiki/Flesch-Kincaid_Readability_Test)

Page 21: Text Mining and SEASR

SEASR Flow

Page 22: Text Mining and SEASR

TextMiningDefini)on

Manydefini)onsintheliterature•  Thenontrivialextrac)onofimplicit,previouslyunknown,andpoten)allyusefulinforma)onfrom(largeamountof)textualdata”

•  Anexplora)onandanalysisoftextual(natural‐language)databyautoma)candsemiautoma)cmeanstodiscovernewknowledge

•  Whatis“previouslyunknown”informa)on?–  Strictdefini)on

•  Informa)onthatnoteventhewriterknows–  Lenientdefini)on

•  Rediscovertheinforma)onthattheauthorencodedinthetext

Page 23: Text Mining and SEASR

TextMiningProcess

•  TextPreprocessing–  Syntac)cTextAnalysis–  Seman)cTextAnalysis

•  FeaturesGenera)on–  BagofWords–  Ngrams

•  FeatureSelec)on–  SimpleCoun)ng–  Sta)s)cs–  Selec)onbasedonPOS

•  Text/DataMining–  Classifica)on‐Supervised

Learning–  Clustering‐Unsupervised

Learning–  Informa)onExtrac)on

•  AnalyzingResults–  VisualExplora)on,Discovery

andKnowledgeExtrac)on–  Query‐based–ques)on

answering

Page 24: Text Mining and SEASR

TextCharacteris)cs(1)•  Largetextualdatabase

–  Enormouswealthoftextualinforma)onontheWeb–  Publica)onsareelectronic

•  Highdimensionality–  Considereachword/phraseasadimension

•  Noisydata–  Spellingmistakes–  Abbrevia)ons–  Acronyms

•  Textmessagesareverydynamic–  Webpagesareconstantlybeinggenerated(removed)–  Webpagesaregeneratedfromdatabasequeries

•  Notwellstructuredtext–  Email/Chatrooms

•  “ruavailable?”•  “Heywhazzzzzzup”

–  Speech

Page 25: Text Mining and SEASR

TextCharacteris)cs(2)•  Dependency

–  Relevantinforma)onisacomplexconjunc)onofwords/phrases–  Orderofwordsinthequery

•  hotdogstandintheamusementpark•  hotamusementstandinthedogpark

•  Ambiguity–  Wordambiguity

•  Pronouns(he,she…)•  Synonyms(buy,purchase)•  Wordswithmul)plemeanings(bat–itisrelatedtobaseballormammal)

–  Seman)cambiguity•  Thekingsawtherabbitwithhisglasses.(mul)plemeanings)

•  Authorityofthesource–  IBMismorelikelytobeanauthorizedsourcethenmysecondfar

cousin

Page 26: Text Mining and SEASR

TextPreprocessing•  Syntac)canalysis

–  Tokeniza)on–  Lemmi)za)on–  POStagging–  Shallowparsing–  Customliterarytagging

•  Seman)canalysis–  Informa)onExtrac)on

•  NamedEn)tytagging–  Seman)cCategory(unnameden)ty)tagging–  Co‐referenceresolu)on–  Ontologicalassocia)on(WordNet,VerbNet)–  Seman)cRoleanalysis–  Concept‐Rela)onextrac)on

Page 27: Text Mining and SEASR

Syntac)cAnalysis•  Tokeniza)on

–  Textdocumentisrepresentedbythewordsitcontains(andtheiroccurrences)–  e.g.,“Lordoftherings”→{“the”,“Lord”,“rings”,“of”}–  Highlyefficient–  Makeslearningfarsimplerandeasier–  Orderofwordsisnotthatimportantforcertainapplica)ons

•  Lemmi)za)on/Stemming–  Involvesthereduc)onofcorpuswordstotheirrespec)veheadwords(i.e.lemmas)–  Reducedimensionality–  Iden)fiesawordbyitsroot–  e.g.,flying,flew→fly

•  Stopwords–  Iden)fiesthemostcommonwordsthatareunlikelytohelpwithtextmining–  e.g.,“the”,“a”,“an”,“you”

•  Parsing/PartofSpeech(POS)tagging–  Generatesaparsetree(graph)foreachsentence–  Eachsentenceisastandalonegraph–  FindthecorrespondingPOSforeachword–  e.g.,John(noun)gave(verb)the(det)ball(noun)–  ShallowParsing

•  analysisofasentencewhichiden)fiesthecons)tuents(noungroups,verbs,...),butdoesnotspecifytheirinternalstructure,northeirroleinthemainsentence

–  DeepParsing•  moresophis)catedsyntac)c,seman)candcontextualprocessingmustbeperformedtoextractorconstructtheanswer

Page 28: Text Mining and SEASR

Seman)cAnalysis:Informa)onExtrac)on

•  Defini)on:Informa)onextrac)onistheiden)fica)onofspecificseman)celementswithinatext(e.g.,en))es,proper)es,rela)ons)

•  Extracttherelevantinforma)onandignorenon‐relevantinforma)on(important!)

•  Linkrelatedinforma)onandoutputinapredeterminedformat

Page 29: Text Mining and SEASR

Informa)onExtrac)on

Informa(onType Stateoftheart(Accuracy)En((es

anobjectofinterestsuchasapersonororganiza)on.

90‐98%

A9ributes

apropertyofanen)tysuchasitsname,alias,descriptor,ortype.

80%

Facts

arela1onshipheldbetweentwoormoreen))essuchasPosi)onofa

PersoninaCompany.

60‐70%

Events

anac1vityinvolvingseveralen))essuchasaterroristact,airlinecrash,managementchange,newproduct

introduc)on.

50‐60%

“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL

Page 30: Text Mining and SEASR

Informa)onExtrac)onApproaches

•  Terminology(name)lists–  Thisworksverywellifthelistofnamesandnameexpressionsisstableandavailable

•  Tokeniza)onandmorphology–  Thisworkswellforthingslikeformulasordates,whicharereadilyrecognizedbytheirinternalformat(e.g.,DD/MM/YYorchemicalformulas)

•  Useofcharacteris)cpaBerns–  Thisworksfairlywellfornovelen))es–  Rulescanbecreatedbyhandorlearnedviamachinelearningorsta)s)calalgorithms

–  RulescapturelocalpaBernsthatcharacterizeen))esfrominstancesofannotatedtrainingdata

Page 31: Text Mining and SEASR

Informa)onExtrac)on

Rela)on(Event)Extrac)on•  Iden)fy(andtag)therela)onamongtwoen))es:

–  Apersonis_located_ataloca)on(news)–  Agenecodes_foraprotein(biology)

•  Rela)onsrequiremoreinforma)on–  Iden)fica)onoftwoen))es&theirrela)onship–  Predictedrela)onaccuracy

•  Pr(E1)*Pr(E2)*Pr(R)~=(.93)*(.93)*(.93)=.80•  Informa)oninrela)onsislesslocal

–  Contextualinforma)onisaproblem:rightwordmaynotbeexplicitlypresentinthesentence

–  Eventsinvolvemorerela)onsandareevenharder

Page 32: Text Mining and SEASR

MayorRexLuthorannouncedtodaytheestablishmentofa

newresearchfacilityinAlderwood.Itwillbeknownas

BoyntonLaboratory.

NE:Person NE:Time

NE:Loca)on

NE:Organiza)on

Seman)cAnaly)cs

NamedEn)ty(NE)Tagging

Page 33: Text Mining and SEASR

MayorRexLuthorannouncedtodaytheestablishmentofa

newresearchfacilityinAlderwood.Itwillbeknownas

BoyntonLaboratory.

UNE:Organiza)on

Seman)cAnalysis

Seman)cCategory(unnameden)ty,UNE)Tagging

Page 34: Text Mining and SEASR

MayorRexLuthorannouncedtodaytheestablishmentofa

newresearchfacilityinAlderwood.Itwillbeknownas

BoyntonLaboratory.

UNE:Organiza)on

Seman)cAnalysis

Co‐referenceResolu)onforen))esandunnameden))es

Page 35: Text Mining and SEASR

Mayor Rex Luthor announced today the establishment

known as Boynton Laboratory

of a new research facility in Alderwoon. It will be

ACTIONACTOR WHEN OBJECT

WHERE

ACTION

OBJECT

COMPL

Seman)cAnalysis

Seman)cRoleAnalysis

Page 36: Text Mining and SEASR

Rex Luthor

person

announce

action

establ.

event

Boynton Lab

organiz.

today

time

Alderwood

location

location

(where)

object

(what)

time(when)

objec

t(w

hat)

actor(who)

Seman)cAnalysis

Concept‐Rela)onExtrac)on

Page 37: Text Mining and SEASR

IE–TemplateExtrac)on‐Steps

</VerbGroup> …

Page 38: Text Mining and SEASR

(c) 2001, Chicago Tribune. Visit the Chicago Tribune on the Internet at http://www.chicago.tribune.com/ Distributed by Knight Ridder/Tribune Information Services. By Stephen J. Hedges and Cam Simpson

…….

The Finsbury Park Mosque is the center of radical Muslim activism in England. Through its doors have passed at least three of the men now held on suspicion of terrorist activity in France, England and Belgium, as well as one Algerian man in prison in the United States.

``The mosque's chief cleric, Abu Hamza al-Masri lost two hands fighting the Soviet Union in Afghanistan and he advocates the elimination of Western influence from Muslim countries. He was arrested in London in 1999 for his alleged involvement in a Yemen bomb plot, but was set free after Yemen failed to produce enough evidence to have him extradited. .'‘ …

TemplateExtrac)on<Facility>Finsbury Park Mosque</Facility>

<PersonPositionOrganization>  <OFFLEN OFFSET="3576" LENGTH=“33" />   <Person>Abu Hamza al-Masri</Person> <Position>chief cleric</Position> <Organization>Finsbury Park Mosque</Organization> </PersonPositionOrganization>

<Country>England</Country>

<PersonArrest>  <OFFLEN OFFSET="3814" LENGTH="61" />   <Person>Abu Hamza al-Masri</Person>   <Location>London</Location> <Date>1999</Date> <Reason>his alleged involvement in a Yemen bomb plot</Reason>   </PersonArrest>

<Country>England</Country>

<Country>France </Country>

<Country>United States</Country>

<Country>Belgium</Country>

<Person>Abu Hamza al-Masri</Person>

<City>London</City>

Page 39: Text Mining and SEASR

StreamingText:KnowledgeExtrac)on

•  Leveragingsomeearlierworkoninforma)onextrac)onfromtextstreams

Informa)onextrac)on•  processofusing

advancedautomatedmachinelearningapproaches

•  toiden)fyen))esintextdocuments

•  extractthisinforma)onalongwiththerela)onshipstheseen))esmayhaveinthetextdocuments

Thevisualiza)onabovedemonstratesinforma)onextrac)onofnames,placesandorganiza)onsfromreal‐)menewsfeeds.Asnewsar)clesarrive,theinforma)onisextractedanddisplayed.Rela)onshipsaredefinedwhenen))esco‐occurwithinaspecificwindowofwords.

Page 40: Text Mining and SEASR

Seman)cAnalysis•  WordSenseDisambigua)on

–  Contextbasedorproximitybased

–  Veryaccurate

Page 41: Text Mining and SEASR

OntologicalAssocia)on(WordNet)•  Wordnet:Asof2006,thedatabasecontainsabout150,000words

organizedinover115,000synsetsforatotalof207,000word‐sensepairs•  Searchfordog

–  ndog,domes)cdog,Canisfamiliaris(amemberofthegenusCanis(probablydescendedfromthecommonwolf)thathasbeendomes)catedbymansinceprehistoric)mes;occursinmanybreeds)

–  nfrump,dog(adullunaBrac)veunpleasantgirlorwoman)–  ndog(informaltermforaman)–  ncad,bounder,blackguard,dog,hound,heel(someonewhoismorally

reprehensible)–  nfrank,frankfurter,hotdog,hotdog,dog,wiener,wienerwurst,weenie(a

smooth‐texturedsausageofmincedbeeforporkusuallysmoked;o}enservedonabreadroll)

–  npawl,detent,click,dog(ahingedcatchthatfitsintoanotchofaratchettomoveawheelforwardorpreventitfrommovingbackward)

–  nandiron,firedog,dog,dog‐iron(metalsupportsforlogsinafireplace)–  vchase,chasea}er,trail,tail,tag,givechase,dog,goa}er,track(goa}erwith

theintenttocatch)

Page 42: Text Mining and SEASR

FeatureSelec)on

•  ReduceDimensionality– Learnershavedifficultyaddressingtaskswithhighdimensionality

•  IrrelevantFeatures– Notallfeatureshelp!– Removefeaturesthatoccurinonlyafewdocuments

– Reducefeaturesthatoccurintoomanydocuments

Page 43: Text Mining and SEASR

TextMining:GeneralApplica)onAreas

•  Informa)onRetrieval–  Indexingandretrievaloftextualdocuments–  Findingasetof(ranked)documentsthatarerelevanttothequery

•  Informa)onExtrac)on–  Extrac)onofpar)alknowledgeinthetext

•  WebMining–  Indexingandretrievaloftextualdocumentsandextrac)onofpar)alknowledgeusingtheweb

•  Classifica)on–  Predictaclassforeachtextdocument

•  Clustering–  Genera)ngcollec)onsofsimilartextdocuments

Page 44: Text Mining and SEASR

TextMining:Supervisedvs.Unsupervised

•  Supervisedlearning(Classifica)on)–  Data(observa)ons,measurements,etc.)areaccompaniedby

labelsindica)ngtheclassoftheobserva)ons–  Splitintotrainingdataandtestdataformodelbuildingprocess–  Newdataisclassifiedbasedonthemodelbuiltwiththetraining

data–  Techniques

•  Bayesianclassifica)on,Decisiontrees,Neuralnetworks,Instance‐BasedMethods,SupportVectorMachines

•  Unsupervisedlearning(Clustering)–  Classlabelsoftrainingdataisunknown–  Givenasetofmeasurements,observa)ons,etc.withtheaimof

establishingtheexistenceofclassesorclustersinthedata

Page 45: Text Mining and SEASR

Results:SocialNetwork(TominRed)

Page 46: Text Mining and SEASR

Results:Timeline

Page 47: Text Mining and SEASR

Results:Maps

Page 48: Text Mining and SEASR

TextMining:T2KandThemeWeaver

Page 49: Text Mining and SEASR

Images from Pacific Northwest Laboratory

TextMining:ThemescapeandThemeRiver

•  VisualizingRela)onshipsBetweenDocuments

Page 50: Text Mining and SEASR

Gather–Analyze–Present

Page 51: Text Mining and SEASR

TextMining:Applica)ons

•  Email:Spamfiltering•  NewsFeeds:Discoverwhatis

interes)ng•  Medical:Iden)fyrela)onshipsand

linkinforma)onfromdifferentmedicalfields

•  HomelandSecurity•  Marke)ng:Discoverdis)nctgroupsof

poten)albuyersandmakesugges)onsforotherproducts

•  Industry:Iden)fyinggroupsofcompe)torswebpages

•  JobSeeking:Iden)fyparametersinsearchingforjobs

Page 52: Text Mining and SEASR

TextMining:Classifica)onDefini)on

•  Given:Collec)onoflabeledrecords–  Eachrecordcontainsasetoffeatures(aBributes),andthetrueclass

(label)–  Createatrainingsettobuildthemodel–  Createates)ngsettotestthemodel

•  Find:Modelfortheclassasafunc)onofthevaluesofthefeatures•  Goal:Assignaclass(asaccuratelyaspossible)topreviouslyunseen

records•  Evalua)on:WhatIsGoodClassifica)on?

–  Correctclassifica)on•  Knownlabeloftestexampleisiden)caltothepredictedclassfromthemodel

–  Accuracyra)o•  Percentoftestsetexamplesthatarecorrectlyclassifiedbythemodel

–  Distancemeasurebetweenclassescanbeused•  e.g.,classifying“football”documentasa“basketball”documentisnotasbad

asclassifyingitas“crime”

Page 53: Text Mining and SEASR

TextMining:ClusteringDefini)on•  Given:Setofdocumentsandasimilaritymeasure

amongdocuments•  Find:Clusterssuchthat

–  Documentsinoneclusteraremoresimilartooneanother

–  Documentsinseparateclustersarelesssimilartooneanother

•  Goal:–  Findingacorrectsetofdocuments

•  SimilarityMeasures:–  EuclideandistanceifaBributesarecon)nuous–  Otherproblem‐specificmeasures

•  e.g.,howmanywordsarecommoninthesedocuments

•  Evalua)on:WhatIsGoodClustering?–  Producehighqualityclusterswith

•  highintra‐classsimilarity•  lowinter‐classsimilarity

–  QualityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpaBerns

Page 54: Text Mining and SEASR

SEASR

MeandreWorkbench

Page 55: Text Mining and SEASR

FutureWork

•  EnhancementstoSeman)cAnalysis– UseofOntologicalAssocia)on(WordNet,VerbNet)

–  Improveco‐referencing

–  Improvefactextrac)on

•  Visualexplora)ontools