measurement context extraction from text: discovering ... · initially tried extracting...

Measurement Context Extraction from Text: DiscoveringOpportunities and Gaps in Earth Science

Kyle Hundman1, Chris A. Ma�mann1,2

{kyle.a.hundman,chris.a.ma�mann}@jpl.nasa.gov

1Jet Propulsion Laboratory 2Computer Science DepartmentCalifornia Institute of Technology University of Southern California

Pasadena, CA 91109 USA Los Angeles, CA 90089 USA

ABSTRACTWe propose Marve, a system for extracting measurement values,units, and related words from natural language text. Marve usesconditional random �elds (CRF) to identify measurement valuesand units, followed by a rule-based system to �nd related entities,descriptors and modi�ers within a sentence. Sentence tokens arerepresented by an undirected graphical model, and rules are basedon part-of-speech and word dependency pa�erns connecting valuesand units to contextual words. Marve is unique in its focus on mea-surement context and early experimentation demonstrates Marve’sability to generate high-precision extractions with strong recall.We also discuss Marve’s role in re�ning measurement requirementsfor NASA’s proposed HyspIRI mission, a hyperspectral infraredimaging satellite that will study the world’s ecosystems. In general,our work with HyspIRI demonstrates the value of semantic mea-surement extractions in characterizing quantitative discussion con-tained in large corpuses of natural language text. �ese extractionsaccelerate broad, cross-cu�ing research and expose scientists newalgorithmic approaches and experimental nuances. �ey also facili-tate identi�cation of scienti�c opportunities enabled by HyspIRIleading to more e�cient scienti�c investment and research.

CCS CONCEPTS•Computing methodologies→ Information extraction; Max-imum likelihood modeling; •Applied computing → Environ-mental sciences; •Information systems→ Content analysis andfeature selection;

KEYWORDSNatural Language Processing, Knowledge Discovery, InformationRetrieval, Measurement Extraction, Remote Sensing, Earth ScienceACM Reference format:Kyle Hundman1, Chris A. Ma�mann1,2. 2017. Measurement Context Ex-traction from Text: Discovering Opportunities and Gaps in Earth Science.In Proceedings of 23rd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, Data-Driven Discovery Workshop, Halifax, NovaScotia, Canada, August 2017 (KDD 2017), 8 pages.DOI:

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).KDD 2017, Halifax, Nova Scotia, Canada© 2017 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM.DOI:

1 INTRODUCTIONMuch of the world’s scienti�c information is easily accessible fromour �ngertips. For example, a search for “remote sensing” on �om-son Reuter’s Web of Science yields nearly 14,000 journal articlesfrom 2014–20161. However, the careful analysis and understandingof that scienti�c information is not easy. An individual scientistmay spend many hours (re-)reading a single article to comprehendits message and signi�cance. �e aforementioned corpus of remotesensing articles is simply too much information for a human, oreven a small set of them, to read and synthesize into knowledge.Fortunately, continual advances in search and natural language pro-cessing (NLP) have greatly enhanced our ability to automaticallycharacterize and si� through large-scale unstructured data. Neuralnetwork approaches have vastly improved essential NLP tasks suchas part-of-speech (POS) tagging and dependency parsing. For ex-ample, Stanford CoreNLP’s dependency parser achieved 91.7% [8]accuracy on the Penn Treebank dataset and Google’s Parsey Mc-Parseface a�ained an impressive 94.4% on sentences from variousnews sources [3].

While these foundational NLP tasks provide a framework forprocessing core textual components, deriving scienti�c meaningand nuance from those components is more complex. One of thecore elements of science is measurement, which involves quanti�ca-tion (in units such as nanometers or kilograms), situational context(e.g. location, timeframe, sentiment) and o�en statistical modi�ers(e.g. average). Consider the sentence, “�e unexpected drop instratospheric water vapor slowed the rate of increase in surfacetemperature in the subsequent decade by 25%.” Identifying “25%” asmodifying the “rate of increase in surface temperature” is di�cultwithout a system that considers the underlying structure of thesentence. Proper measurement extraction and labeling enables thecreation of unique knowledge bases and opens exciting possibilitiesfor modeling and visualization techniques that rely on organizedand uniform numerical data. Automatically discerning scienti�cmeasurements and speci�c contextual aspects can allow for quicksummarization and scienti�c understanding of a large corpus of lit-erature and/or news articles. In addition, it can allow for validationand comparison of automatically extracted and categorized mea-surements with raw measurement data; this can provide additionalcontext and even scienti�c corroboration of phenomena.

We describe a framework, Marve, that fuses and extends existingtechniques in NLP and text processing to extract context aroundmeasurements in natural language. Using rules primarily based

1h�ps://apps.webo�nowledge.com

arX

iv:1

710.

0431

2v1

[cs

.IR

] 1

1 O

ct 2

017

https://apps.webofknowledge.com

KDD 2017, August 2017, Halifax, Nova Scotia, Canada K. Hundman, C. Ma�mann

on word dependencies and POS tags, Marve exploits a limited setof approximately 10 parts of speech and 15-20 word dependencytypes and English language pa�erns used to describe measure-ments and the objects or concepts they quantify. Traditionally,the cost of manual curation and the ambiguity of unaccompaniedmeasurements have limited the collection and application of se-mantic measurement data. Marve circumvents these problems byunderstanding contexts and consequently improving identi�cationof measurement types. From a scienti�c perspective, Marve accel-erates exploration of literature and promotes cross-pollination ofideas and approaches across domains.

2 MOTIVATIONMarve originated from a NASA Advanced Concepts project that hasprovided data-driven support for NASA’s proposed HyperspectralInfrared Imager (HyspIRI) mission, which will monitor a variety ofecological and geological features at a wide range of wavelengths.�e planned HyspIRI instrumentation has unique technical capabil-ities such as high spatial resolution and hyperspectral coverage thatwill bene�t several scienti�c areas [13]. However, the extent andnature of these bene�ts are not easily understood because much ofthis information is embedded within scienti�c publications spreadacross numerous journals. We set out to automatically identifyand pro�le these new scienti�c opportunities using a corpus ofapproximately 2,500 recent publications and abstracts from variousjournals in the remote sensing domain.

Our �rst approach involved the use of regular expressions toextract common measurement types in the remote sensing domain.Extracted measurements like spatial resolution, spectral coverageand revisit rate provided a useful bookmarking of our corpus –discussion around hyperspectral wavelengths (>2.4 µm) and highspatial resolutions were pointers toward potential science enabledby HyspIRI. Visualizing these extractions also revealed the scale ofthe discussion around various wavelengths (as shown in Figure 1).

Unfortunately, regular expressions didn’t generalize across mea-surement types, precision and recall of extractions was unknown,and regular expressions are complex. Most importantly, measure-ments were extracted in isolation. An extraction of “50 m” could bea measurement of height, swath, length, or resolution in the contextof remote sensing. Ultimately, Marve resulted from our pursuit of ageneral, accurate, and precise tool that provided semantically-richextractions.

3 RELATEDWORK3.1 Measurement Extraction

3.1.1 Grobid�antities. �e Marve stack includes Grobid �an-tities [15], a library that utilizes linear conditional random �elds(CRF) to identify measurement units and values. Training the Gro-bid model requires labeled training data, ideally from the targetdomain. And although labeled data is costly, supervised machinelearning is appropriate for measurement extraction; measurementconventions and unit formats vary widely across scienti�c domainsand the resulting proliferation of pa�erns is too large and variedfor unsupervised models or rule-based systems to be e�ective. Weinitially tried extracting measurements using a rule-based approachbuilt around part of speech (PoS) tags, named-entity recognition

Figure 1: A histogram of extracted spectral wavelengthsfrom the corpus of remote sensing publications and ab-stracts used for HyspIRI. Most of the discussion takes placearound visible and near infrared wavelengths (0.4 to 2.5µm),but opportunities for new or improved science enabled byHyspIRI hysperspectral coverage may most likely be foundin papers discussing longer-wave infrared wavelengths (e.g.4µm on the right side of the chart. (credit: Jason Hyon)

(NER), word dependencies, and regular expressions. While thismethod was fast and training-free, a�empts at generalizing thesystem for di�erent domains led to numerous false positive extrac-tions.

�rough this experimentation we determined that quanti�edsubstances and related entities (i.e. context) don’t present the samechallenges as measurement values and units. Instead, they followcommon language pa�erns that generalize well. �is allows Marveto identify words and entities related to a measurement without la-beled examples and model training. Marve can also capture relatedentities from a broad assortment of language pa�erns. Consider thefollowing sentence: “Landsat-8 achieved 82% classi�cation accu-racy for cutleaf teasal.” Grobid �antities isn’t designed to identify“Landsat-8” or “cutleaf teasal” as related entities. Marve is ableto capture this additional information without domain-speci�c la-bels or training – these types of phrasal pa�erns and clauses arecommon across the English language. Grobid �antities could beextended to capture more context, but tuning its extraction processrequires adding or adjusting labels and re-training the full model fora speci�c domain. Marve mitigates additional overhead required forcontext extraction and uses Grobid �antities as an o�-the-shelfdependency.

3.1.2 �antalyze and GATE. �antalyze2 is a commercial prod-uct that also performs measurement, unit, and context extraction.Evaluation of its performance can only be achieved through theironline demo, but a�er comparing extractions from several para-graphs of text, their tool appears to achieve poor recall in bothmeasurements captured and quanti�ed substances captured.

2h�ps://www.quantalyze.com/

Marve: Measurement Context Extraction from Text KDD 2017, August 2017, Halifax, Nova Scotia, Canada

Additionally, Agatonovic et. al [1] employ the General Archi-tecture for Text Engineering (GATE) [9] to extract measurementvalues and units from patent documents. �eir approach involvesbuilding patent-speci�c gazateers and hand-wri�en rules to gener-ate measurement annotations using the GATE framework. WhileGrobid �antities requires labeled data, its embedded CRF modelobviates rule-writing and Agatonovic et. al’s approach only cap-tures accompanying words such as “less than” or “between” whileignoring related entities and context.

3.2 Open Information Extraction (OIE)Various OIE systems approach relation extraction in similar ways toMarve. KrakeN [2], CSD-IE [5] and ClausIE [10] utilize dependencypa�erns and POS tags to detect clauses and �nd their propositions.�is information is then used to construct triples representing factsin a corpus, such as: (“Kelly”, “�nished”, “nursing school”). Similarto the Agatonovic et. al’s GATE-based approach, certain measure-ments could be extracted via OIE approaches. But OIE is centeredon verb-mediated propositions and measurement context occursin a variety of other forms such as adverbials: “�e satellite cap-tured imagery with 50 m spatial resolution.” �is leads to poor OIErecall for measurements. Also, when measurements are extractedby these systems the output requires signi�cant post-processing to�lter extraneous OIE extractions and properly separate measure-ments, units, and related entities. Marve is a more directed form ofthese approaches, and it is most similar to those that sacri�ce e�-ciency for improved precision and recall (e.g. OLLIE [23], Kraken,and ClausIE).

3.3 Relationship Extraction and KnowledgeBase Construction

Knowledge base construction (KBC) and relationship extractionhave migrated from pa�ern matching and rule-based systems tomachine learning based systems over the last decade. One of thedriving factors behind this trend is that KBC systems that rely on amultitude of rules require some assurance of the precision and recallof such rules [22]. In practice this is di�cult and tedious. As dis-cussed in Section 3.1, this is also the reason for the use of machinelearning approaches to measurement value and unit extraction -too many pa�erns and rules result in uncertainties about precisionand recall. However, Marve di�ers from traditional KBC systemsin two primary ways. First, Marve does not explicitly classify typesof relationships between extracted measurements and their relatedwords and entities. Second, Marve is directed at a very speci�ctype of extraction (measurements) that bene�ts many scienti�cinformation extraction scenarios. In this sense, it is complementaryto broader KBC systems. One of the most prominent KBC systemsof late is DeepDive, which is designed to identify relationships be-tween extraction types using labeling functions wri�en by domainexperts. However, generating the extractions of interest is le� upto the user and writing discerning labeling functions is a gradual,iterative process. Marve automates a large part of the developmentof a DeepDive system by automatically extracting measurementsand providing precise measurement context to labeling functions.Section 6.2 includes further discussion around Marve and DeepDiveintegrations.

4 METHODOLOGY4.1 OverviewAs discussed in Section 3, we decided against a custom measurementextractor and instead used Grobid �antities to extract measure-ment values and units. Like Marve, Grobid �antities representssentences with undirected graphs. �ough instead of parsing lan-guage pa�erns, Grobid uses a probabilistic graphical CRF methodthat learns parameter values through maximum likelihood estima-tion. �is approach to extracting numerical values and units wasmore consistent in our experimentation although it adds processingoverhead and requires labeled training data.

Once measurements values and units are identi�ed using Grobid,Stanford’s CoreNLP library [16] is used to perform more traditionalNLP tasks such as tokenization, PoS tagging, and word dependencyparsing. Marve uses combinations of the output from these tasks toidentify measurement types (e.g. 10 m spatial resolution) and relatedentities and descriptors (e.g. Hannibal had around 40 elephants).When represented in a graph (see Figure 2), these pa�erns originateat the measurement unit token(s) and expand outward to connectednodes (words). If Marve �nds a pa�ern de�ned as valid its pre-de�ned rules (represented using JSON3), the resulting word(s) willbe returned as related to the measurement.

4.2 Model StructureConsider a connected, undirected graph G = (V ,E) where V and Edenote the sets of nodes and edges respectively, such that:

• S = {s1, s2, ..., sn } is a set of sentences that comprise acorpus of text from which measurements are extracted.

• T = {t1, t2, ..., tn } is a set of all tokens in S .• si = {t | t ∈ T } where each sentence s ∈ S is a set of t

tokens.• One graph G is constructed for each sentence si .• L = {l1, l2, ..., ln } where li is a label which identi�es the

part of speech for each ti token in a sentence si .Given these notations, the set of nodes V in each graph can bede�ned as V = {v1,v2, ...,vn } where vi stores a token ti ∈ t for aset of t tokens in a sentence s , and each token ti is labeled with labellj ∈ L. �en we de�ne ei j as an edge connecting (vi ,vj )with a labeldi j representing the dependency between tokens (ti , tj ), where Dis the set of all dependencies equal to the length of E.

4.3 Pattern MatchingOnce a measurement is identi�ed by Grobid �antities, the tokenti that corresponds to the measurement unit becomes the originfor subsequent pa�ern evaluation (as shown by the thicker circlearound “m” in Figure 2), and a graph дi is constructed for sentencesi containing the measurement. Evaluation continues if the de-pendency label di j for any edge ei j originating at ti matches theword dependency types de�ned in the valid dependency pa�erns.One subtlety stems from CoreNLP’s enhanced dependencies, whichprovide the connecting word for certain dependency types. For ex-ample, if the conjunction “and” connects two words, the enhanceddependency type returned by CoreNLP is “conj:and” rather than

3h�ps://github.com/khundman/marve/blob/master/marve/dependency pa�erns.json

https://github.com/khundman/marve/blob/master/marve/dependency_patterns.json

https://github.com/khundman/marve/blob/master/marve/dependency_patterns.json


Figure 2: An example detailing the measurement and context extraction process. Word dependencies and PoS tags are loadedinto a graph, where patterns de�ned in Marve are evaluated to identify additional context (e.g. measurement types).

“conj.” When Marve encounters a dependency type that has beenenhanced, it evaluates the enhanced portion separately allowingfor more nuanced pa�ern de�nitions. �is is represented in theJSON structure with a boolean value for the “enhanced” key (shownin Figure 2). If “enhanced” is true, the dependency label di j is splitinto two parts: the connecting word and the dependency type. Ifboth parts match the JSON structure, evaluation continues alongthat nested path.

Marve �rst considers the format of the measurement a�er worddependencies are evaluated. �ree primary formats are de�ned:

• space between (e.g. “10 m”)• attached (e.g. “10m”)• hyphenated (e.g. “10-m”)

Measurement formats are identi�ed using the character indices ofmeasurement value token tk and measurement unit token ti . If tiand tk are adjacent (without a space), they are “a�ached.” If not, asimple check for a space or hyphen is performed. Word dependencyand PoS pa�erns vary based on these formats and explicitly de�ningrules for them improves Marve’s precision.

PoS tags are the next evaluation step in the Marve system. Ifthe measurement format is valid according to the dependency pat-tern and token tj is also connected to the unit token ti via a validdependency pa�ern, label lj is evaluated in one of two ways:

• pos in: As long as one of the keys in the pos in valuematches part of label lj they are valid (e.g. if one of thekeys for the pos in nested object is “NN” and label lj is“NNS”)• pos equals: �e speci�ed PoS labels must match label lj

exactlyIf a matching PoS key has its own keys and values in the JSON,it is considered a special case. Most of these cases involve verbs,which are o�en part of a clause containing a subject related to the

measurement. If so, all nodes connected to token tj are evaluated bya separate function. Valid word dependencies are passed as param-eters to this function, and it executes recursively if it encountersadditional connected verb tokens. If the value of a matching PoSkey in the JSON is null, no more evaluation is needed and the tokentj is returned as a related word.

�e last step in constructing the output is �nding adjectives,modi�ers, or compounds connected to related nouns. �is includeswords like “spatial” in the example in Figure 1, where “resolution” isthe related noun extracted in earlier steps. Other connected wordscould be subjective words like “high” in “a high spatial resolution of10 m,” or statistical words such as “average.” �ese descriptive wordsprovide important details about types, sentiment, and the statisticalnature of measurements. �is creates opportunities for higher-�delity grouping of like measurements (e.g. “spatial resolution”versus “resolution”) and be�er pro�ling of trends and opportunities.For example, if a satellite’s 100 m spatial resolution was describedas “insu�cient” for classifying a certain type of vegetation, thiscould represent an opportunity for the higher-resolution HyspIRImission to enable new science. Extraction of these type of words isalso performed using PoS labels and word dependencies, but theorigin for pa�ern matching is the token corresponding to a relatedentity rather than the measurement unit.

Marve’s dependency pa�erns were developed heuristically byanalyzing scienti�c literature and gradually expanded to optimizelevels of precision and recall throughout the HyspIRI work and inthe experiment presented here. Early experimentation suggeststhey generalize well to domains that don’t include highly uniqueconventions for discussing measurements. For such edge cases,Marve’s rule-based architecture is transparent, �exible, and generalenough to be easily modi�ed to identify other relationship types orpa�erns.


Figure 3: Sample Marve output for the example sentence inFigure 2. �e “quantity” �eld is populated by Grobid �an-tities and the “related” �eld is added by Marve.{

"type": "value",

"quantity": {

"parsedValue": 10,

"normalizedQuantity": 10,

"rawValue": "10",

"rawUnit": {

"offsetStart": 39,

"offsetEnd": 40,

"tokenIndices": ["8"],

"name": "m"

},

"offsetEnd": 38,

"offsetStart": 36,

"tokenIndex": 7,

"normalizedUnit": {

"type": "length",

"name": "m",

"system": "SI base"

},

"type": "length"

}

"related": [

{

"rawName": "resolution",

"connector": "",

"offsetEnd": 32,

"relationForm": "nmod:of",

"offsetStart": 22,

"tokenIndex": 5,

"descriptors": [

{

"rawName": "spatial",

"tokenIndex": "4"

}

]

}

]

}

Table 1: Experiment Data

Source Sentences Measurements Sent. w/ MeasurementsNews 117 58 47Scienti�c 372 131 93Total 489 189 140

5 EXPERIMENTS AND RESULTS5.1 DataFour documents were used for experimentation: two news articlesfrom the New York Times and two scienti�c publications from themedical and remote sensing domains. �e �rst New York Timesarticle, “Dell Gets Bigger and Hewle�-Packard Gets Smaller in Sep-arate Deals,” was selected from the Technology section and theother, “A Cleaning Start-Up Wielding Mops, Buckets and 700 Data

Points” was from the business section [11][25]. �e medical publi-cation, “Zika Virus Associated with Microcephaly,” is from the NewEngland Journal of Medicine and the remote sensing publication,“Satellite soil moisture for agricultural drought monitoring: Assess-ment of the SMOS derived Soil Water De�cit Index,” is from RemoteSensing of Environment[20][18]. Although our work is centeredaround scienti�c literature, news articles provide a sense of Marve’sgeneral performance extending into non-scienti�c domains. �eremote sensing article represents literature used in the HyspIRIwork, and the medical article was selected due to the prevalence ofboth measurements and NLP work happening in that domain.

For each document, individual sentences were manually ex-tracted along with any measurement values, units, and relatedwords contained within. �e total amount of labeled sentencesand measurements for each type of source is presented in Table 1.We avoided data sources with more informal language (e.g. socialmedia) for two reasons: Marve will be most useful in domains withabundant quantitative discussion and Marve’s reliance on sentencestructure suggests it would perform poorly on such data.

5.2 Setup�e labeling of related words in the evaluation data was limitedto those directly related to the measurement, most commonly con-nected by a verb or nominal modi�ers indicating a prepositionalphrase. As an example, consider the sentence in Figure 1, “HyspIRIhas a spatial resolution of 10 m.” In this case, “10 m” is directly mod-ifying “spatial resolution,” which is then possessed by “HyspIRI.” Be-cause there is a degree of separation between “10 m” and “HyspIRI,”Marve would only include “spatial resolution” as related to “10 m”in our experiment. Extracting second-order related words is easilyachieved in Marve, but we focused on �rst-order relations to reducemanual labeling e�ort and simplify the evaluation.

�e experiment demonstrates the precision and recall of Marve’srelated word extraction independent of Grobid �antities’ abilityto accurately extract measurement values and units. Since Marve’spa�ern parsing (used to identify related words) originates at mea-surement units, units from the ground truth data were fed to thesystem rather than relying on Grobid �antities to provide thesevalues. We assume Grobid �antities can be incrementally im-proved with additional labeled values and model training. Althoughgenerating this additional training data was outside the scope ofour experiment, our experience with Grobid �antities suggeststhat domain-speci�c labels and training is necessary to achieveviable levels of precision and recall for measurement value and unitextraction.

5.3 ScoringMarve parses each of the 489 sentences individually. If one ormore measurements are in a sentence, Marve’s extraction of relatedwords for each measurement is compared to the correspondingmeasurement’s related words in the labeled data. Because modi�ersand descriptors are relatively straightforward to extract for analready-identi�ed related word, they were not considered in theevaluation (e.g. “bu�ered” in Figure 4). For instances where Marve’srelated word extractions match the related entities in the labeleddata (e.g. “Samples” and “formalin” in Figure 4), a true positive


Figure 4: An example of labeled evaluation data for a sen-tence containing a measurement.{

"measurements": [

{

"number": "10",

"unit": "%",

"related": [

{

"Samples": []

},

{

"formalin": [" buffered "]

}

]

}

]

"sentence_num": 41,

"sentence": "Samples were fixed in 10% buffered formalin

and embedded in paraffin."

}

Table 2: Experiment Results - Confusion Matrices

Predicted Negatives Predicted PositivesCombined

Negatives n/a 55Positives 115 225

Scienti�cNegatives n/a 36Positives 84 143

NewsNegatives n/a 19Positives 31 82

Table 3: Experiment Results - Evaluation Metrics

Metric News Journal CombinedPrecision 81.2% 79.9% 80.4%Recall 72.6% 63.0% 66.2%F-Score 76.6% 70.4% 72.6%

is recorded for each matched entity. A false positive is recordedfor each extracted related word without a match in the labeledevaluation data. A false negative occurs when an entity from thelabeled data can’t be matched to the related words extracted byMarve.

5.4 ResultsMarve’s precision, recall, and F-score were evaluated for the twodatasets and are shown in Table 3. Because Marve is rule-basedsystem rather than a generalized statistical model, the recall met-ric indicates the extent to which measurement language followsconcrete rules rather than how well a given model represents thedata. As long as the rules generalize and are relatively concise, thistype of system is a�ractive for its speed and transparency. Our

recall results imply that a rule-based system such as Marve canidentify words and entities related to measurements with high �-delity. While some recall error is expected because language isvaried and o�en misused, these results understate Marve’s recall.Similar to the �ndings in ClausIE’s experiments, our preliminaryanalysis suggests that a signi�cant portion of recall error resultedfrom incorrect dependency parsing rather than the occurrence ofunde�ned pa�erns.

It’s no surprise precision and recall were be�er for the NewYork Times articles. Compared to scienti�c publications, sentencepa�erns in news articles are simpler. Special characters, references,diverse punctuation, and domain-speci�c lexicons that can fool adependency parser are less common. Also, Stanford CoreNLP’sEnglish parser is trained on the Penn Treebank, which containsa large share of Dow Jones Newswire stories and a much smallerportion of scienti�c abstracts [17].

�ese results indicate that Marve is a sound approach to ex-tracting words that compose the context around a measurement.Performance will improve as Marve’s language pa�ern rules are fur-ther scrutinized, extended, and re�ned, and advances in underlyingNLP approaches will also li� performance.

6 APPLICATIONS6.1 Opportunities for HyspIRIMarve extractions enabled deeper analysis into the scienti�c andapplication-speci�c signi�cance of the HyspIRI mission mentionedin section 2. Within our corpus of remote sensing-related journalarticles, we were able to extract measurement values and unitswith improved precision and recall. Extractions of related wordsand entities then allowed us to group measurement types withmore con�dence. �ey also created opportunities to link semanticpublication data to other structured scienti�c data – an area ofresearch largely unexplored in Earth Science.

For instance, extracted contextual words o�en contained geo-physical features related to certain measurements. �ese includedtypes of vegetation, soils, minerals, rocks, water, and man-madefeatures, and they are o�en targeted for measurement by EarthScience missions where the ability to classify certain features isessential to meeting scienti�c objectives. One common methodof classi�cation involves analyzing a feature’s re�ectance, whichvaries across electromagnetic wavelengths to form a signature [12].�ese signatures contain unique combinations of in�ection pointsthat enable models to distinguish between di�erent Earth-basedfeatures. Given their importance to mission science objectives, wewere eager to explore the discussion around in�ection points forcertain geophysical variables.

To achieve this, we �rst needed to �nd measurements extractionsreferring to the portions of the electromagnetic spectrum. �is wasstraightforward – measurement units of microns or nanometerswere strong indicators of a spectral reference.

�e next step was joining these measurements with associatedre�ectance signatures. �e NASA Advanced Spaceborne �ermalEmission and Re�ection (ASTER) mission spectral library containsover 2,300 spectra of a variety of materials, several of which werepresent in the related words extracted with measurements [4]. Onesuch example can be seen in �gure 3, which shows the re�ectance


Figure 5: �e re�ectance of a sample of brown to dark brownclay is indicated by the blue line. �e green circles repre-sent the number of extractions that indicated “clay” was re-lated to a speci�c wavelength or range of wavelengths. �ebands represent the spectral coverage of the Landsat-8 satel-lite. �is chart supports the intuitive idea that discussionabout re�ectance of geophysical features would be centeredaround in�ection points. (credit: Jason Hyon)

signature of a sample of brown to dark brown clay. �ere is a largein�ection point at 1,900 nm where there also are several mentionsof “clay” in association with this wavelength (e.g. “1900 nm” or “1.9µm” or “1800-1900 nm”). �is �gure also shows mentions aroundwavelengths that aren’t obvious in�ection points. �ey may war-rant discussion for other reasons, such as their importance in aspeci�c scienti�c application.

In addition to purely scienti�c objectives, accurate remote sens-ing of clay types has tangible downstream bene�ts to various com-mercial industries. For example, speci�c compositions of clay min-eral deposits are widely used in the production of ceramics, [19] andclay composition information helps in understanding absorptionproperties, which can be used to manage water irrigation moree�ciently [6]. As previously discussed, determining the extent towhich remote sensing can support these applications �rst involvesunderstanding di�erent measurement requirements. Once these arede�ned, assessment of existing satellites and airborne instrumentsis necessary to identify measurement gaps and opportunities (forHyspIRI in this case). Again, this can be accomplished throughintegration of structured data with Marve extractions. Figure 5shows bands indicating the spectral coverage of Landsat-8, a satel-lite developed by NASA and the U.S. Geological Survey (USGS)and launched in 2013 [21] [14]. �e pronounced in�ection pointat 1,900 nm isn’t captured by Landsat-8, which could represent anopportunity for HyspIRI. If extended analysis reveals that publicly-sponsored satellites are missing key measurements, these gaps caninform decisions about HyspIRI’s future resolution and spectralcoverage capabilities. In this sense, HyspIRI would be uniquely po-sitioned to support key priorities as de�ned by the broader scienti�ccommunity.

6.2 Knowledge Base Construction�e construction of a measurement knowledge base for remote sens-ing publications would be of great value to researchers and NASAadministrators. For example, in the Remote Sensing of Environmentpaper used in our experiment the authors write, “Although there ismore and more information about these soil water parameters, theyare not usually included in standard soil databases. For that reason,researchers sometimes have simply used soil parameter data pub-lished in the literature.” A structured repository of measurementinformation would allow researchers to be�er explore scienti�c re-sults, experimental designs, instrument speci�cations, and generaldiscussion around speci�c measurement types and their relation-ships to time, locations, organizations, and other domain-speci�centities. �e value and feasibility of constructing similar knowl-edge bases has been demonstrated by several projects including ourown prior work in constructing a Polar Cyberinfrastructure thatintegrates measurements crawled from the U.S. National ScienceFoundation Advanced Cooperative Arctic Data and InformationSystem (ACADIS), the NASA Antarctic Master Directory (AMD),and the National Snow and Ice Data Center (NSIDC) Search toolas part of constructing the National Institutes of Standards andTechnology Text Retrieval Conference (TREC)-Dynamic Domain(TREC-DD) polar dataset[7]. In addition, Stanford’s DeepDive plat-form was used to construct PaleoDeepDive, a system that has pro-cessed 300,000 scienti�c documents in an e�ort to replicate themanually curated Paleobiology Database (PBDB). �is databasecontains hundreds of thousands of taxonomic fossil names and at-tributes manually entered by researchers over the last two decades.PaleoDeepDive recreated PBDB with greater than double humanand roughly equal precision.

Marve signi�cantly reduces the manual e�ort needed to createsuch a knowledge base and can be easily integrated into DeepDive,which can add non-measurement-based extractions and also helpcategorize measurements and their relationships with other entities.DeepDive has previously been used to derive relationships betweenmeasurements and related words or entities (e.g. PaleoDeepDive),but users are le� to their own devices to extract measurements andpossible related words and entities. �ey also need to write featuresused by the inference engine to identify and classify relationshipsbetween extractions. Marve automates these extractions and canprovide DeepDive developers with valuable features in the form ofmeasurement context.

7 FUTUREWORK7.1 Expanding ExperimentationResearch and applications involving measurement extraction arelimited. Marve creates second-order extractions (SOE) constructedfrom other �rst-order extractions (FOE) that are either new (Grobid�antities) or have progressed signi�cantly in recent years (worddependencies, PoS tagging). As a result, publicly-available labeleddatasets do not exist for evaluating Marve. Our evaluation datasetis relatively small, and we hope to extend this dataset and employa linguist for review.

Marve is also tightly coupled with CoreNLP and CoreNLP errorscould not be isolated in the experiment without hand labelingof POS tags and word dependencies. �is process is tedious and


requires lingual expertise to perform accurately and consistently.While we have focused on measurement-rich scienti�c publicationsin our development and applications of Marve, we plan to exploreits performance on syntactically labeled data such as the PennTreebank [24]. Although measurements are sparser, experimentswith such data sources allow for Marve to be evaluated independentof FOE.

Lastly, we are also interested in understanding the performanceof Grobid �antities at di�erent levels of training and customiza-tion. Generating additional training data was outside the scope ofour experiment, but we are curious how well a Grobid �antitiesmodel trained on a pre-existing set of diverse labeled data wouldgeneralize. �is will allow practitioners to weigh the costs andbene�ts of domain-speci�c labels and training by understandingGrobid �antities’ o�-the-shelf performance on unseen data.

7.2 Extending MarveExpanding the semantic information embedded in Marve extrac-tions increases the potential for automatic classi�cation of mea-surements. While Marve represents a large step forward in thecollection of this information, the burden is on the user to makeuse of it, which will involve grouping or classifying measurementsin almost all cases. DeepDive addresses this problem by allowingusers to write “labeling functions,” which provide the system withfeatures used to classify di�erent types of relationships. We planto explore further integration of Marve into DeepDive and its newsuccessor, Snorkel, while also exploring unsupervised approachesto measurement grouping. We view providing a means for auto-matically or semi-automatically classifying measurements as animportant step in Marve’s development.

8 CONCLUSIONWe propose a baseline method, Marve, for contextual measure-ment extraction, a sub-area of information extraction that has beenlargely unaddressed in the research community. Semantic mea-surement information is inherently richer than raw data, and ourinitial �ndings with Marve are positive. As the world becomesincreasingly inundated with textual data, Marve and other relatedapproaches will help us �nd relevant scienti�c information anddevelop a broader understanding of our domains. We view Marveas an opportunity to expedite scienti�c research and inform scien-ti�c investment, two areas essential to encouraging innovation anddemonstrating the importance of science to society.

ACKNOWLEDGMENTS�is e�ort was supported in part by JPL, managed by the CaliforniaInstitute of Technology on behalf of NASA, and additionally inpart by the DARPA Memex/XDATA/D3M programs and NSF awardnumbers ICER-1639753, PLR-1348450 and PLR-144562 funded aportion of the work. �e authors thank Jason Hyon, Paul Ramirez,Drs. David �ompson, Diane Evans, Randy Friedl, and DuaneWaliser for their support and feedback.

REFERENCES[1] Milan Agatonovic, Niraj Aswani, Kalina Bontcheva, Hamish Cunningham,

�omas Heitz, Yaoyong Li, Ian Roberts, and Valentin Tablan. 2008. Large-scale,

parallel automatic patent annotation. In Proceedings of the 1st ACM workshop onPatent information retrieval. ACM, 1–8.

[2] Alan Akbik and Alexander Loser. 2012. Kraken: N-ary facts in open informationextraction. In Proceedings of the Joint Workshop on Automatic Knowledge BaseConstruction and Web-scale Knowledge Extraction. Association for ComputationalLinguistics, 52–56.

[3] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta,Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalizedtransition-based neural networks. arXiv preprint arXiv:1603.06042 (2016).

[4] AM Baldridge, SJ Hook, CI Grove, and G Rivera. 2009. �e ASTER spectrallibrary version 2.0. Remote Sensing of Environment 113, 4 (2009), 711–715.

[5] Hannah Bast and Elmar Haussmann. 2013. Open information extraction viacontextual sentence decomposition. In Semantic Computing (ICSC), 2013 IEEESeventh International Conference on. IEEE, 154–159.

[6] Mohamed R Berber, Inas H Hafez, Keiji Minagawa, Masami Tanaka, and TakeshiMori. 2012. An e�cient strategy of managing irrigation water based on formulat-ing highly absorbent polymer–inorganic clay composites. Journal of Hydrology470 (2012), 193–200.

[7] Annie Bryant Burgess, Chris Ma�mann, Giuseppe Totaro, Lewis John McGibbney,and Paul M Ramirez. 2015. TREC Dynamic Domain: Polar Science.. In TREC.

[8] Danqi Chen and Christopher D Manning. 2014. A Fast and Accurate DependencyParser using Neural Networks.. In EMNLP. 740–750.

[9] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin Tablan.2002. A framework and graphical development environment for robust NLPtools and applications.. In ACL. 168–175.

[10] Luciano Del Corro and Rainer Gemulla. 2013. Clausie: clause-based open infor-mation extraction. In Proceedings of the 22nd international conference on WorldWide Web. ACM, 355–366.

[11] �entin Hardy. 2016. Dell Gets Bigger and Hewle� Packard Gets Smaller inSeparate Deals. (Sep 2016). h�ps://www.nytimes.com/2016/09/08/technology/dell-gets-bigger-and-hewle�-packard-gets-smaller-in-separate-deals.html

[12] Graham R Hunt. 1977. Spectral signatures of particulate minerals in the visibleand near infrared. Geophysics 42, 3 (1977), 501–513.

[13] Christine M Lee, Morgan L Cable, Simon J Hook, Robert O Green, Susan L Ustin,Daniel J Mandl, and Elizabeth M Middleton. 2015. An introduction to the NASAHyperspectral InfraRed Imager (HyspIRI) mission and preparatory activities.Remote Sensing of Environment 167 (2015), 6–19.

[14] Charlie LLoyd. 2013. Landsat 8 Bands. (2013). h�p://landsat.gsfc.nasa.gov/landsat-8/landsat-8-bands/

[15] Patrice Lopez. 2010. Automatic extraction and resolution of bibliographicalreferences in patent documents. In Information Retrieval Facility Conference.Springer, 120–135.

[16] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J.Bethard, and David McClosky. 2014. �e Stanford CoreNLP Natural LanguageProcessing Toolkit. In Association for Computational Linguistics (ACL) SystemDemonstrations. 55–60. h�p://www.aclweb.org/anthology/P/P14/P14-5010

[17] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Build-ing a large annotated corpus of English: �e Penn Treebank. Computationallinguistics 19, 2 (1993), 313–330.

[18] J Martınez-Fernandez, A Gonzalez-Zamora, N Sanchez, A Gumuzzio, and CMHerrero-Jimenez. 2016. Satellite soil moisture for agricultural drought monitor-ing: Assessment of the SMOS derived Soil Water De�cit Index. Remote Sensingof Environment 177 (2016), 277–286.

[19] Abdou Mbaye, Cheikh Abdoul Khadir Diop, Benaıssa Rhouta, JM Brendle,Francois Senocq, Francis Maury, and Dinna Pathe Diallo. 2012. Mineralogi-cal and physico-chemical characterizations of clay from Keur Saer (Senegal).Clay Minerals 47, 4 (2012), 499–511.

[20] Jernej Mlakar, Misa Korva, Natasa Tul, Mara Popovic, Mateja Poljsak-Prijatelj,Jerica Mraz, Marko Kolenc, Katarina Resman Rus, Tina Vesnaver Vipotnik, VesnaFabjan Vodusek, and others. 2016. Zika virus associated with microcephaly. NEngl J Med 2016, 374 (2016), 951–958.

[21] NASA. 2013. Landsat 8 Overview. (2013). h�p://landsat.gsfc.nasa.gov/landsat-8/landsat-8-overview/

[22] Feng Niu, Ce Zhang, Christopher Re, and Jude W Shavlik. 2012. DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference.VLDS 12 (2012), 25–28.

[23] Michael Schmitz, Robert Bart, Stephen Soderland, Oren Etzioni, and others.2012. Open language learning for information extraction. In Proceedings ofthe 2012 Joint Conference on Empirical Methods in Natural Language Processingand Computational Natural Language Learning. Association for ComputationalLinguistics, 523–534.

[24] Ann Taylor, Mitchell Marcus, and Beatrice Santorini. 2003. �e Penn treebank:an overview. In Treebanks. Springer, 5–22.

[25] Eilene Zimmerman. 2016. A Cleaning Start-Up Wield-ing Mops, Buckets and 700 Data Points. (Sep 2016).h�ps://www.nytimes.com/2016/09/08/business/smallbusiness/a-cleaning-start-up-wielding-mops-buckets-and-700-data-points.html

https://www.nytimes.com/2016/09/08/technology/dell-gets-bigger-and-hewlett-packard-gets-smaller-in-separate-deals.html

https://www.nytimes.com/2016/09/08/technology/dell-gets-bigger-and-hewlett-packard-gets-smaller-in-separate-deals.html

http://landsat.gsfc.nasa.gov/landsat-8/landsat-8-bands/

http://landsat.gsfc.nasa.gov/landsat-8/landsat-8-bands/

http://www.aclweb.org/anthology/P/P14/P14-5010

http://landsat.gsfc.nasa.gov/landsat-8/landsat-8-overview/

http://landsat.gsfc.nasa.gov/landsat-8/landsat-8-overview/

https://www.nytimes.com/2016/09/08/business/smallbusiness/a-cleaning-start-up-wielding-mops-buckets-and-700-data-points.html

https://www.nytimes.com/2016/09/08/business/smallbusiness/a-cleaning-start-up-wielding-mops-buckets-and-700-data-points.html

measurement context extraction from text: discovering ... · initially tried extracting...

Documents