naistビッグデータシンポジウム - 情報 松本先生
TRANSCRIPT
Scientific Paper Analysis
Yuji MatsumotoComputational Linguistics Lab
Graduate School of Information Science
March 6, 2015Big Data Symposium
at NAIST
Large Scale Text DataData on the Web SNS: twitter, blog Wikipedia News, …Scientific/Technical documents Scientific Papers Legal documents: law reports, casebooks Patent documents
Knowledge BasesConstructed manually WordNet, Domain ontologiesConstructed by community (Wikipedia) FreebaseConstructed automatically NELL: Never-Ending Language Learning MindNet
ApplicationsKnowledge Graph (Google) Knowledge extracted from Freebase,
Wikipedia, …
Watson (IBM) Extracted from Wikipedia Deep QA
Structures of KBLinked structure entities and relations PDF
Entity: person, country, products, etc Relation: born_in(Barack Obama, Honolulu) locates_in(Honolulu, Hawaii) state_of(Hawaii, USA)
Natural Language AnalysisHow text is analyzed Word segmentation, Part-of-speech
tagging Named entity recognition Syntactic parsing Semantic disambiguation Semantic parsing Discourse analysis
Linked Knowledge Extraction
Named entity recognition Extraction of entities, concepts
Syntactic dependency parsing direct dependency between entities
Semantic parsing predicate argument structure analysis subject-predicate-object, relation between
entitiesDiscourse analysis co-reference – the same entity by different
mentions relation between facts: temporal, causal
8
We analyzed the effect on the binding and the activity of transcription factors at a regulatory element.
TPA induction inhibits the binding of the transcription factor NF-E2 to this transcriptional control element.
TPA induction increases the binding of AP-1 factors to this element.
Cause ThemeTheme
Theme Theme
S1
S2
S3
Semantic Parsing: Example
Katsumasa Yoshikawa, Sebastian Riedel, Tsutomu Hirao, Masayuki Asahara, Yuji Matsumoto,"Coreference Based Event-Argument Relation Extraction on Biomedical Text,“Journal of Biomedical Semantics, Volume 2, Supplement 5, S6, October 2011
9
"this element" in S2 is coreferent to… "a regulatory element" in S1
We analyzed the effect on the binding and the activity of transcription factors at a regulatory element. Corefer
TPA induction inhibits the binding of the transcription factor NF-E2 to this transcriptional control element.
TPA induction increases the binding of AP-1 factors to this element.
Cause ThemeTheme
Theme Theme
S1
S2
S3
Co-reference analysis
10
The true argument (Theme) of binding is "a regulatory element“ and "this element" is just an anaphor of itTransitivity enables us to conflate the information
We analyzed the effect on the binding and the activity of transcription factors at a regulatory element. (B) Corefer(C) Theme
TPA induction inhibits the binding of the transcription factor NF-E2 to this transcriptional control element.
TPA induction increases the binding of AP-1 factors to this element.
Cause ThemeTheme
Theme (A) Theme
S1
S2
S3
(A) Theme & (B) Corefer => (C) Theme
Information conflation
11
We analyzed the effect on the binding and the activity of transcription factors at a regulatory element. CoreferTheme
TPA induction inhibits the binding of the transcription factor NF-E2 to this transcriptional control element.
TPA induction increases the binding of AP-1 factors to this element.
Cause ThemeTheme
Theme Theme
Theme
CoreferTheme
S1
S2
S3
Discourse analysis
Syntactic parsingNE chunking
Part-of-Speech(POS)tagging
Predicate-argumentStructure analysis
Coreferenceresolution
Relationextraction semantic/
contextprocessing
Machine Learning /Knowledge Acquisition
Document Structure Analysis
Knowledge
Bases(Dmain
Ontologies)
NLP Technologies for Document Analysis
12
What we can do with Scientific Papers
Knowledge extraction (domain knowledge)New fact discoveryContent-aware paper searchSummarization Automatic generation of abstracts Keyword generation Survey generation
Recommendation of related papersSimilar article/case search Structural similarity: papers, law reports,
patents
Example: Structured Abstract Generation
14
Related ProjectBig Mechanism (2014.07-, by DARPA)
http://www.darpa.mil/Our_Work/I2O/Programs/Big_Mechanism.aspx The Big Mechanism program aims to develop
technology to read research abstracts and papers to extract pieces of causal mechanisms, assemble these pieces into more complete causal models, and reason over these models to produce explanations. The domain of the program is cancer biology with an emphasis on signaling pathways.
Architecture of Big Mechanism
from Paul Cohen, “DARPA’s Big Mechanism Program”
Deep Language AnalysisComplex sentence structure analysisRobust Semantic ParsingDiscourse Analysis Co-reference Causal / Temporal relationRepresentation and Reasoning Explanation / AnticipationConfidence/credibility (of extracted facts / what is written in documents)
Large-scale Text Data
syntactic dependency structureargument structure, coreference
rhetorical / document structure
POS tags, phrase/NE chunking
relations ( temporal, causal, entailment )
18
Know
ledg
e Ba
seOn
tolo
gy
Language Processing and Document Analysis Layers
Document Analysis(Document Understanding, Similarity-based Search, Knowledge Discovery/Assembling)
We may be able to do more
Research Trend SurveyResearch (paper) Evaluation Content-aware citation analysis
Innovation Foresight Eg: Foresight and Understanding from
Scientific Exposition (FUSE) Project http://www.iarpa.gov/index.php/research-programs/fuse
Collaboration with people in application areas who need to read/understand documents