knowledge extraction from the encyclopedia of life using python nltk

• Natural Language Processing (NLP)• Semantic StatisticsRobin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders.

The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.

Robin is the name of several fictional characters appearing in comic books published by DC Comics, originally created by Bob Kane, Bill Finger and Jerry Robinson, as a junior counterpart to DC Comics superhero Batman. The team of Batman and Robin is commonly referred to as the Dynamic Duo or the Caped Crusaders.

The American Robin is active mostly during the day and assembles in large flocks at night. It is one of the earliest bird species to lay eggs, beginning to breed shortly after returning to its summer range from its winter range. Its nest consists of long coarse grass, twigs, paper, and feathers, and is smeared with mud and often cushioned with grass or other soft materials. It is among the first birds to sing at dawn.

• fictional• comic books• Bob Kane• superhero• Batman• Dynamic Duo• Caped Crusaders

• flocks• bird• eggs• nest• sing• species

GNRD

Beautiful Soup

Resolver

GNRD

Beautiful Soup

Resolver

From GNRDnames_list = [“Pandarus sinuatus”,“Pandarus smithii”]

genera = []for name in name_list: row = name.split(‘ ‘) genera.append(row[0])

genera = [“Pandarus”,”Pandarus”]

i = -1genus_index_list = []for genus in genera: genus_text = tokens[i+1:] genus_index = genus_text.index(genus) if i == -1: genus_index_list.append(genus_index) else: genus_index = genus_index + i + 1 genus_index_list.append(genus_index) i = genus_index

genera = [“Pandarus”,”Pandarus”]

genus_index = [36,39]

for index in genus_index_list: species = [‘ ‘.join(tokens[index:index+2])] #Join the genus to the word immediately following. if species == name_list[counter]: #Does this match the name_list? tokens[index:index+2] = [‘ ‘.join(tokens[index:index+2])] #If yes, combine the two into one element

genus_index = [36,39]

tokens = [‘Great’, ‘white’, ‘sharks’, ‘are’, ‘apex’, ‘predators’, ‘,’, ‘meaning’, ‘they’, ‘have’, ‘a’, ‘large’, ‘effect’, ‘on’, ‘the’, ‘populations’, ‘of’, ‘their’, ‘prey’, ‘including’, ‘elephant’, ‘seals’, ‘and’, ‘sea’, ‘lions.’, ‘Great’, ‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]

term_list = []for name_index in name_index_list: term_list = tokens[name_index-10:name_index+10]

name_index_list = [36,38]

term_list = [‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]

Looking at the first relationship:

Carcharodon carcharias Pandarus sinuatus

Looking at the first relationship:

Carcharodon carcharias Pandarus sinuatusParasite/host

term_list = [‘white’, ‘sharks’, ‘are’, ‘hosts’, ‘to’, ‘parasites’, ‘such’, ‘as’, ‘copepods’, ‘(‘, ‘Pandarus sinuatus’, ‘and’, ‘Pandarus smithii’, ‘)’, ‘.’]

Training Data

• Show the algorithm what “parasite/host” words look like

• Compare to an unknown• We want “Document Classification”• Brown, Reuters and Movie Review• We need to make our own corpus

Creating a Categorized Text Corpus

• http://www.packtpub.com/article/python-text-processing-nltk-20-creating-custom-corpora

• Inside “corpus” folder create new folder for your corpus. Mine is “eco”.

• Build your corpus (start with EOL text)• Make a category specification• Lets start with parasitism and predation

http://www.packtpub.com/article/python-text-processing-nltk-20-creating-custom-corpora

Creating a Categorized Text Corpus

• eco– lion1– lion2– lion3– shark1– shark2– shark3– …– cats.txt

• in cats.txtlion1.txt predationlion2.txt parasitism…

from nltk.corpus.reader import CategorizedPlaintextCorpusReader

corpus_root = ‘/Users/athessen/nltk_data/corpora/eco’reader = CategorizedPlaintextCorpusReader(corpus_root,r’lion|shark\d*\.txt’,cat_file=‘cats.txt’)

Choose a Corpus Reader

Choose a Corpus ReaderYou have to tell this Corpus Reader

Corpus root directoryFile names (aka fileids)Category specification

Next Steps

• Build corpus• Build Feature Extractor• Train Classifier

Build Feature Extractor

Train Classifier

Error Checking

knowledge extraction from the encyclopedia of life using python nltk

Technology

natural language toolkit nltk

isd312 03-nltk

câu hỏi nltk

encyclopedia of norse mythology

encyclopedia texarkana

encyclopedia powerpointactivitynoprepgrades

lecture 7 nltk pos tagging

nltk natural language toolkit overview and application @...

python nltk

extraction in orthodontics - semmelweis...

encyclopedia -volume 3

bai tap nltk

frequency with nltk

[sw] dune encyclopedia russian

nhóm 8 bt nltk

children’s encyclopedia

nltk - the natural language toolkit · nltk - the natural...

procesamiento de lenguaje natural, python y nltk

nltk. nhóm 10

自然言語処理 with nltk