knowledge extraction from the encyclopedia of life using python nltk

Post on 09-May-2015

714 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation demonstrates the potential for NLTK to extract information about ecological species interactions from text in EOL. It was presented Nov 12, 2013 at the Startup Institute in Cambridge, MA for the Boston PyLadies monthly meeting.

TRANSCRIPT

Knowledge extraction from the Encyclopedia of Life

Using Python NLTK

Anne Thessenannethessen@gmail.com

Finding Taxonomic Names

Challenges

Koko

Горилла

Guerilla

Eastern Lowland Gorilla

Gorilla graueri

Gorilla berengeiGorilla beringei

MatschieGorilla beringei mikenensisKing kong

Gorilla gorilla

Virunga

Gorila

Gorille

Mountain gorilla

大猩猩

ゴリラ

Challenges

Aotus trivirgatus Aotus Illiger 1811

Aotus Aotus Smith 1805 Aotus ericoides

.

Contextual data

PrimateMonkeyEyesFoodPanamaAotus nancymaae

Disambiguate by authority, species, contextual data

Contextual data

LegumePlant

FlowerMirbelieaAustralia

Aotus mollis

GNRD

Beautiful Soup

Resolver

• Common names• Interaction type

• Common names• Interaction type

Python NLTK

• http://nltk.org/book/• http://nltk.org/• Install NLTK and NLTK Data

Python NLTK

• http://nltk.org/book/• http://nltk.org/• Install NLTK and NLTK Data

• Natural Language Processing (NLP)

• Natural Language Processing (NLP)• Semantic StatisticsRobin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders.

The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.

Robin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders.

The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.

• fictional• comic books• Bob Kane• superhero• Batman• Dynamic Duo• Caped Crusaders

• flocks• bird• eggs• nest• sing• species

GNRD

Beautiful Soup

Resolver

GNRD

Beautiful Soup

Resolver

From GNRDnames_list = [“Pandarus sinuatus”,“Pandarus smithii”]

genera = []for name in name_list: row = name.split(‘ ‘) genera.append(row[0])

genera = [“Pandarus”,”Pandarus”]

i = -1genus_index_list = []for genus in genera: genus_text = tokens[i+1:] genus_index = genus_text.index(genus) if i == -1: genus_index_list.append(genus_index) else: genus_index = genus_index + i + 1 genus_index_list.append(genus_index) i = genus_index

genera = [“Pandarus”,”Pandarus”]

genus_index = [36,39]

for index in genus_index_list: species = [‘ ‘.join(tokens[index:index+2])] #Join the genus to the word immediately following. if species == name_list[counter]: #Does this match the name_list? tokens[index:index+2] = [‘ ‘.join(tokens[index:index+2])] #If yes, combine the two into one element

genus_index = [36,39]

tokens = [‘Great’, ‘white’, ‘sharks’, ‘are’, ‘apex’, ‘predators’, ‘,’, ‘meaning’, ‘they’, ‘have’, ‘a’, ‘large’, ‘effect’, ‘on’, ‘the’, ‘populations’, ‘of’, ‘their’, ‘prey’, ‘including’, ‘elephant’, ‘seals’, ‘and’, ‘sea’, ‘lions.’, ‘Great’, ‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]

term_list = []for name_index in name_index_list: term_list = tokens[name_index-10:name_index+10]

name_index_list = [36,38]

term_list = [‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]

Looking at the first relationship:

Carcharodon carcharias Pandarus sinuatus

Looking at the first relationship:

Carcharodon carcharias Pandarus sinuatusParasite/host

term_list = [‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]

Training Data

• Show the algorithm what “parasite/host” words look like

• Compare to an unknown• We want “Document Classification”• Brown, Reuters and Movie Review• We need to make our own corpus

Creating a Categorized Text Corpus

• http://www.packtpub.com/article/python-text-processing-nltk-20-creating-custom-corpora

• Inside “corpus” folder create new folder for your corpus. Mine is “eco”.

• Build your corpus (start with EOL text)• Make a category specification• Lets start with parasitism and predation

Creating a Categorized Text Corpus

• eco– lion1– lion2– lion3– shark1– shark2– shark3– …– cats.txt

• in cats.txtlion1.txt predationlion2.txt parasitism…

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)

Choose a Corpus Reader

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)

Choose a Corpus ReaderYou have to tell this Corpus Reader

Corpus root directoryFile names (aka fileids)Category specification

Next Steps

• Build corpus• Build Feature Extractor• Train Classifier

Build Feature Extractor

Train Classifier

Error Checking

top related