event extraction - yyy•semi-supervised learning 1. a few high-precision seed patterns or seed...

30
复旦大学大数据学院 School of Data Science, Fudan University Chinese Event Extraction 杨依莹 2017.11.22

Upload: others

Post on 27-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University Chinese Event Extraction

杨依莹

2017.11.22

Page 2: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

2

3

1

纲大

ACE program

CRF++:YetAnotherCRFtoolkit

Assignment3:Chineseeventextraction

1

Page 3: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

ACE program

AutomaticContentExtraction(ACE)program:

• TheobjectiveoftheAutomaticContentExtraction(ACE)Programwastodevelopextractiontechnologytosupportautomaticprocessingofsourcelanguagedata(intheformofnaturaltextandastextderivedfromASRandOCR).

• Theprogramrelatesto English, Arabic and Chinese texts.

• TheACEcorpusisoneofthestandardbenchmarksfortestingnewinformationextraction algorithms.

Page 4: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

ACE program

AutomaticContentExtraction(ACE)program:

Givenatextin naturallanguage,theACEchallengeistodetect:

1. entitiesmentionedinthetext,suchas:persons,organizations,locations,facilities,weapons.

2. relations betweenentities,suchas:personAisthemanagerofcompanyB.Relationtypesinclude:role,part,located,near,andsocial.

3. eventsmentionedinthetext,suchas:interaction,movement,transfer,creationanddestruction.

Page 5: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

ACE program

AutomaticContentExtraction(ACE)program:

Anexampleoftext

Page 6: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

ACE program : entity

• EntityDetectionandTracking(EDT)• ACEtasksidentifiedseventypesofentities:Person,Organization,

Location,Facility,Weapon,VehicleandGeo-PoliticalEntity(GPEs).Eachtypewasfurtherdividedintosubtypes.

• Foreverymention,theannotatoridentifiedthemaximalextentofthestringthatrepresentstheentityandlabeledtheheadofeachmention.Nestedmentionswerealsocaptured.

Page 7: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

ACE program : relation

• RelationDetectionandCharacterization(RDC):• involvedtheidentificationofrelationsbetweenentities.• Foreveryrelation,annotatorsidentifiedtwoprimaryarguments

(namely,thetwoACEentitiesthatarelinked)aswellastherelation'stemporalattributes.

Page 8: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

• Createnewstructuredknowledgebases,usefulforanyapp

• Augmentcurrentknowledgebases• AddingwordstoWordNet thesaurus,factstoFreeBase orDBPedia

• DBpedia:anontologyderivedfromWikipediacontainingover2billionRDFtriples.

• Freebase:adatasetfromWikipediainfoboxes.• On16December2015,Googleofficiallyannouncedthe KnowledgeGraphAPI,whichismeanttobeareplacementtotheFreebaseAPI.

• Supportquestionanswering• Thegranddaughterofwhichactorstarredinthemovie“E.T.”?• (acted-in?x“E.T.”)(is-a?yactor)(granddaughter-of?x?y)

ACE program : relation

Page 9: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

ACE program : relation

AutomaticContentExtraction(ACE)program:• 7 types and17subtypesrelationsfrom“RelationExtraction

Task”

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

GeographicalSubsidiary

Sports-Affiliation

Page 10: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

• Physical-LocatedPER-GPE• He was in Tennessee

• Part-Whole-SubsidiaryORG-ORG• XYZ, the parent company of ABC

• Person-Social-FamilyPER-PER• John’s wife Yoko

• Org-AFF-FounderPER-ORG• Steve Jobs, co-founder of Apple…

ACE program : relation

Page 11: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

• UsingPatternstoExtractRelations• lexico-syntacticpattern(词典-语义规则)

ACE program : relation

Page 12: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

• SupervisedLearning

1. Findallpairsofnamedentities

2. Decideif2entitiesarerelated

3. Ifyes,classifytherelation

ACE program : relation

Page 13: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

• SupervisedLearning• Themostimportantstep:classification• e.g.AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.

ACE program : relation

Page 14: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

• Semi-supervisedLearning1.Afewhigh-precisionseedpatternsorseedtuples.2.Findingsentencesthatcontainentitiesintheseedpair.3.Extractandgeneralizethecontexttolearnnewpatterns.

Maycausesemanticdrift

ACE program : relation

Page 15: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

• Semi-supervisedLearning• Toavoidsemanticdrift,weintroduceconfidencevalue.

• Settingconservativeconfidencethresholdsfortheacceptanceofnewpatternsandtuples.

ACE program : relation

Page 16: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

ACE program : event

AutomaticContentExtraction(ACE)program:• EventDetectionandCharacterization(EDC)

Page 17: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

2

3

1

纲大

ACE program

CRF++:YetAnotherCRFtoolkit

Assignment3:Chineseeventextraction2

Page 18: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan UniversityDescription

• Inthisassignment,youwill need to use sequencelabeling models for Chinese event extraction.

• Event information aredefinedas two parts:• Trigger:themainwordthatmostclearlyexpressestheoccurrenceofanevent.

• Argument:anentity,temporalexpressionorvaluethatplaysacertainroleintheevent.

• Forexample:“因特尔在中国成立了研究中心”

• “成立”isthetrigger oftypeBusiness• “英特尔”,“中国”and“研究中心”aretheargumentsoftypeAgent,PlaceandOrg

Page 19: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan UniversityDescription

• Thistaskisseparatedastwosubtasks:• Triggerlabeling:identify thetriggerwordinthesentence,andclassify ittothefollowing8types:

• Argumentlabeling:identify alltheargumentsinthesentence,andclassify themto35types(somearelistedbelow,alltypescouldbefoundinthetrainingfile):

• You are required to use both HMM and CRF models forthis task. You can use any toolkit for theirimplementation.

• Note that the performance of HMM can be very poor.

Page 20: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan UniversityFormal Definition

InputAsequenceofsegmentedChinesewords.

OutputLabeleachwordwith‘T_type’(trigger),‘A_type’(argument)or‘O’(neithertriggernorargument).Saveyourlabelingresultafterthereallabelseparatedwithtab.

fg1:input fg2:traininginstance fg3:testingresult

Page 21: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan UniversityProvided Files

• trigger_train.txt &trigger_test.txt :• Thesetwofilescontain1,918and669 instancesfortrainingandtesting,respectively.

• Eachlinecontainsonewordanditslabelseparatedbytabs.• Instancesareseparatedbyblankline.

• argument_train.txt &argument_test.txt :• Thesetwofilescontain2,131and997 instancesfortrainingandtesting,respectively.

• Yourjobistopredictthesequencelabelforinstancesintestfiles,andwriteyourpredictionsinresultfiles.Thelabelsintestfilesareonlyforevaluation.

• eval.py• Thisfilecanhelpyouevaluateyourmodel’srecall,accuracy,precisionandF1-score.

Page 22: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan UniversitySubmission

• Generateazipfileandnameitas“sid_homework-3.zip”.

• Itshouldincludeapythonfilenamed“extraction.py”,twooutputfilesnamed“trigger_result.txt”and“argument_result.txt”,andawrittenreportnamed“chinese eventextraction.pdf”.

• Program:codesshouldbewritteninpython.

• Report:thereportneedstobewritteninEnglishwithnomorethan4pages.

Page 23: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan UniversityEvaluation

• Wewillmarkyourhomeworkbasedonthefourcriteria:

• Finalaccuracy(20%)• Program(30%)• Report(40%)• HMM implementation (10%)

Page 24: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan UniversityDue

• SubmityourhomeworkviaE-learningsystem.• Deadline:Mid-nightatDecember 8th 2017

• Ifyouhaveanyquestionsaboutthishomework,sendemailtoTAorourcoursemailbox.

• TAinCharge• 杨依莹([email protected] )

Page 25: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

2

3

1

纲大

ACE program

CRF++:YetAnotherCRFtoolkit

Assignment3:Chineseeventextraction

3

Page 26: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

• CRF++(http://taku910.github.io/crfpp/ ) isasimple,customizable,andopensourceimplementationof ConditionalRandomFields(CRFs) forsegmenting/labelingsequentialdata.

• CRF++isdesignedforgenericpurposeandwillbeappliedtoavarietyofNLPtasks,suchasNamedEntityRecognition,InformationExtractionandTextChunking.

Page 27: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

• Template basic

• Each line in the template file denotes one template. In each template, special macro %x[row,col] will be used to specify a token in the input data.

• Here you can find some examples for the replacements

Input: Data

He PRP B-NP

reckons VBZ B-VP

the DT B-NP << CURRENT

current JJ I-NP

account NN I-NP

template expandedfeature%x[0,0] the%x[0,1] DT%x[-1,0] reckons%x[-2,1] PRP%x[0,0]/%x[0,1] the/DTABC%x[0,1]123 ABCDT123

Page 28: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

• Training(encoding)• Use crf_learn command:

%crf_learn template_file train_file model_file

• Thereare4majorparameterstocontrolthetrainingcondition-aCRF-L2orCRF-L1:Changingtheregularizationalgorithm.DefaultsettingisL2.Generallyspeaking,L2performsslightlybetterthanL1.-cfloat:Withthisoption,youcanchangethehyper-parameterfortheCRFs.Thisparametertradesthebalancebetweenoverfitting andunderfitting.-fNUM:Thisparametersetsthecut-offthresholdforthefeatures.CRF++usesthefeaturesthatoccursnolessthanNUMtimesinthegiventrainingdata.Thedefaultvalueis1.-pNUM:IfthePChasmultipleCPUs,youcanmakethetrainingfasterbyusingmulti-threading.NUMisthenumberofthreads.

Page 29: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

复旦大学大数据学院School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

• Testing(decoding)• Use crf_test command:

%crf_test -mmodel_file test_files

• wheremodel_file isthefile crf_learn creates.test_file isthetestdatayouwanttoassignsequentialtags.Thisfilehastobewritteninthesameformatastrainingfile.

• -v optionsetsverboselevel.defaultvalueis0.Youcanalsohavemarginalprobabilitiesforeachtagandaconditionalprobablyfortheoutput.

%crf_test -v1-mmodeltest.data|head

Rockwell NNP B B/0.992465International NNP I I/0.979089Corp. NNP I I/0.954883's POS B B/0.986396Tulsa NNP I I/0.991966

Page 30: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and

Thanks for your attention!

感谢各位聆听!