named entity recognition - acl 2011 presentation

The Web is not a PERSON, Berners-Lee is not an ORGANIZATION, and African-Americans are not LOCATIONS: An Analysis of the Performance of Named-Entity RecognitionRobert Krovetz (Lexicalresearch.com), Paul Deane, Nitin Madnani (ETS)

A Review by Richard Littauer (UdS)

The BackgroundNamed-Entity Recognition (NER)

is normally judged in the context of Information Extraction (IE)



Various competitions



Various competitionsRecently:

◦non-English languages◦improving unsupervised learning

methods

The Background“There are no well-established

standards for evaluation of NER.”

The Background“There are no well-established

standards for evaluation of NER.”◦Criteria for NER system changes for

competitions◦Proprietary software

The BackgroundKDM wanted to identify MWEs…


… but false positives, tagging inconsistencies stopped this.


… but false positives, tagging inconsistencies stopped this.

IE derives Recall and Precision from Information Retrieval

NER is just a small part of this, so is rarely evaluated independently

The BackgroundSo, they want to test NER

systems, and provide a unit test based on the problems encountered

Evaluation

Compared three NER taggers: Stanford:

◦CRF, 100m training corpus;University of Illinois (LBJ):

◦Regularized average perceptron, Reuters 1996 News Corpus;

BBN IdentiFinder (IdentiFinder):◦HMMs, commercial

EvaluationAgreement on Classification

EvaluationAgreement on ClassificationAmbiguity in Discourse

EvaluationAgreement on ClassificationAmbiguity in Discourse

Stanford vs. LBJ on internal ETS 425m corpus

All three on American National Corpus

Stanford vs. LBJNER reported as 85-95%

accurate.

Stanford vs. LBJNER reported as 85-95%

accurate.Same number for both: 1.95m for

Stanford, 1.8m for LBJ (7.6% difference)

However, errors:

Stanford vs. LBJAgreement:

Stanford vs. LBJAmbiguity:

Stanford vs. LBJ vs. IdentiFinderAgreement:

Stanford vs. LBJ vs. IdentiFinderDifferences:

◦How they are tokenized◦Number of entities recognized

overall

Stanford vs. LBJ vs. IdentiFinderAmbiguity:

Unit TestCreated two documents that can

be used as texts◦Different cases for true positives of

PERSON, LOCATION, ORGANIZATION◦Entirely upper case not NE (Ex.

AAARGH)◦Punctuated terms not NE◦Terms with Initials◦Acronyms (some expanded, some

not)◦Last names in close proximity to first

names

Unit TestCreated two documents that can

be used as texts◦Terms with prepositions (Mass. Inst.

Of Tech.)◦Terms with location and organization

(Amherst College)

Provided freely online.

One NE Tag per DiscourseUnusual for multiple occurrences

of a token in a document to be different entities

True for homonymsAn exception: Location + sports

team

One NE Tag per DiscourseStanford, LBJ have features for

non-local dependencies to help with this.

KDM: Two other uses for NLD:◦Source of error in evaluation◦A way to identify semantically

related entities

These should be treated as exceptions

DiscussionThere are guidelines for NER –

but we need standards.The community should focus on

PERSON, ORGANISATION, LOCATION, and MISC.◦Harder to deal with than Dates,

Times.◦Disagreement between taggers.◦MISC is necessary.◦These have important value

elsewhere.

DiscussionTo improve intrinsic evaluation

for NER:1. Create test sets for divers domains.2. Use standardized sets for different

phenomena.3. Report accuracy for POL separately.4. Establish uncertainty in the tagging

system.

Conclusion90% accuracy not real. We need to use only entities that

are agreed on by multiple taggers.

Even in cases where they both disagree (Hint: Future work.)

Unit test downloadable.

Cheers/PERSON

Richard/ORGANISATION thanks the Mword Class/LOCATION for listening to his talk about Berners-Lee/MISC

named entity recognition - acl 2011 presentation

Education