deep learning勉強会@小町研 "learning character-level representations for part-of-speech...

04/18/2023 1

Deep Learning 勉強会 @ 小町研

Learning Character-level Representations for

Part-of-Speech TaggingC !ıcero Nogueira dos Santos, Bianca ZadroznyIBM Research - Brazil, Av. Pasteur, 138/146 – Rio de Janeiro,

22296-903, Brazil

首都大学東京　情報通信システム学域小町研究室 M1 塘優旗

04/18/2023 2

Introduction• Distributed word representations (word embeddings) is an

invaluable resource for NLP• Word embeddings capture syntactic and semantic

information about words with a few exceptions (Luong et al., 2013)

• character-level information is very important is POS tagging

• Usually, when a task needs morphological or word shape information, the knowledge is included as handcrafted features (Collobert, 2011).

• propose a new deep neural network (DNN) architecture that joins word-level and character-level embeddings to perform POS tagging

• The proposed DNN, which we call CharWNN

04/18/2023 3

Neural Network Architecture

• The deep neural network we propose extends Collobert et al.’s (2011) neural network architecture

• Given a sentence, the network gives for each word a score for each tag τ ∈ T .

• In order to score a word, the network takes as input a fixed-sized window of words centralized in the target word.

• The novelty in our network architecture is the inclusion of a convolutional layer to extract character-level representations.

04/18/2023 4

Collobert et al.’s (2011) neural network architecture

http://ronan.collobert.com/pub/matos/2011_parsing_aistats.pdf






04/18/2023 5

1st layer of the network

• transforms words into real-valued feature vectors (embeddings)

a sentence consisting of N words

every word w_n is converted into a vector which is composed of two sub-vectors(word-level-embeddings , character-level-embeddings)

04/18/2023 6

Word-Level-Embeddings

: a fixed-sized word vocabulary

: a vector size of , 1 hot vector for

: hyper-parameter to be chosen by the user

: parameter to be learned

04/18/2023 7

Character-Level-Embeddings

• convolutional approach (introduced by Waibel et al. (1989))

• produces local features around each character of the word

• combines them to create a fixed-sized character-level embedding of the word.

04/18/2023 8

given a word w composed of M characters

：transform each character c_minto a character embedding r^chr

： a fixed-sized character vocabulary

a vector size of , 1 hot vector for C

: parameter to be learned

character embedding

The input for the convolutional layer is the sequence of character embeddings

04/18/2023 9

• define the vector z for each character embeddings

Convolutional Layerinput : sequence of character embeddings

： window size

04/18/2023 10

The convolutional layer computes the j-th element of the vector , which is the character-level embedding of w

： the weight matrix of the convolutional layer ： parameters to be learned

: # of convolutional units (which corresponds to the size of the character-level embedding of a word)

： the size of the character context window

： the size of the character vector

hyper-parameters

04/18/2023 11

Scoring• follow Collobert et al.’s (2011) window approach

to score all tags T for each word in a sentence• the assumption that the tag of a word depends

mainly on its neighboring words

Given a sentence with N words joint word-level and character-level embedding

convert

• create a vector x_n (padding token) resulting from the concatenation of a sequence of k^wrd embeddings, centralized in the n-th word

The size of the context window, hyper-parameters

04/18/2023 12

• the vector x_n is processed by two usual NN layers

transfer function , the hyperbolic tangent : tanh

parameters to be learned

# of hidden units, hyper-parameter

: score all tags for n-th word

04/18/2023 13

Structured Inference• the tags of neighboring words are strongly

dependent• prediction scheme that takes into account the

sentence structure (Collobert et al., 2011)

: transition score from tag t to u

Given the sentence, the score for tag path is computed as follows:

infer the predicted sentence tags using Viterbi (1967) algorithm

: the set of all trainable network parameters

04/18/2023 14

Network Training• network is trained by minimizing a negative likelihood

over the training set D (Collobert, 2011)• we interpret the sentence score (5) as a conditional

probability over a path• Equation 7 can be computed efficiently using dynamic

programming (Collobert, 2011)

• use stochastic gradient descent (SGD) to minimize the negative log-likelihood with respect to θ

04/18/2023 15

Experimental Setup

04/18/2023 16

Preprocessing and Tokenizing Corpora

• English o December 2013

snapshot of the English Wikipedia corpus

clean corpus contains about 1.75 billion tokens

the source of unlabeled text

• Portugueseo the Portuguese

Wikipediao the CETENFolha corpuso the CETEMPublico3

corpus

concatenating the three corpora, the resulting corpus contains around 401 million words

preprocessing, tokenizing

04/18/2023 17

Unsupervised Learning of Word-Level Embeddings using

the word2vec tool• word2vec

o skip-gram methodo context window : size 5

• use the same parameters except for the minimum word frequency (English and Portuguese)

• minimum word frequencyo English : 10o Portuguese : 5

• resulting vocabulary sizeo English : 870,214 entrieso Portuguese : 453,990 entries

04/18/2023 18

Supervised Learning of Character-Level

Embeddings• supervised learning (not unsupervised)• the character vocabulary is quite small if

compared to the word vocabulary• the amount of data in the labeled POS tagging

training corpora is enough to effectively train character-level embeddings

• trained using the raw (not lowercased) words

04/18/2023 19

POS Tagging Datasets• English

o Wall Street Journal (WSJ) portion of the Penn Treebank4 (Marcus et al., 1993)

o 45 different POS tagso same partitions used by

other researchers(Manning, 2011; Søgaard, 2011; Collobert et al., 2011)

o to fairly compare our results with previous work

• Portugueseo Mac-Morpho corpus (Alu

!ısio et al., 2003)o 22 POS tagso the same training/test

partitions used by (dos Santos et al., 2008)

o created a development set by randomly selecting 5% of the training set sentences.

04/18/2023 20

• out-of-the-supervised-vocabulary words (OOSV)o not appear in the training set

• out-of-the-unsupervised-vocabulary words (OOUV)o do not appear in the vocabulary created using the unlabeled data o words that we do not have word embeddings.

04/18/2023 21

Model Setup• use the development sets to tune the neural

network hyper-parameters (to be chosen by the user)

• in SGD training, the learning rate is the hyper- parameter that has the largest impact in the prediction

• use a learning rate schedule that decreases the learning rate λ according to the training epoch t

04/18/2023 22

Hyper-Parameter• use the same set of hyper-parameters for both English

and Portuguese• This provides some indication on the robustness of our

approach to multiple languages.• The number of training epochs is the only difference in

terms of training setup for the two languageso The WSJ corpus : 5 training epochso Mac-Morpho corpus : 8 training epochs.

04/18/2023 23

WNN model (compared with proposed

CharWNN)• In order to assess the effectiveness of the

proposed character-level representation of words• WNN

o essentially Collobert et al.’s (2011) NN architectureo use the same NN hyper-parameters values (when applicable)

shown in Table 3o represents a network which is fed with word representations

only

• it also includes two additional handcrafted features o capitalization feature o suffix feature :

• capitalization and suffix embeddings have dimension 5

04/18/2023 24

Experimental Results

04/18/2023 25

Results for POS Tagging of Portuguese

Language• the effectiveness of our convolutional approach to learn

valuable character-level information for POS Tagging.

04/18/2023 26

Compare CharWNN with state-of-the-art

(Portuguese)• compare the result of CharWNN with reported state-of-the-

art results for the Mac-Morpho corpus• In both pieces of work, the authors use many handcrafted

features, most of them to deal with out-of-vocabulary words. • This is a remarkable result, since we train our model from

scratch, i.e., without the use of any handcrafted features.

04/18/2023 27

Portuguese Similar Words

“Character-Level Embeddings”

• check the effectiveness of the morphological information encoded in character-level embeddings of words

• The similarity between two words is computed as the cosine between the two vectors

• the character-level embeddings are very effective for learning suffixes and prefixes.

OOSV OOUV

04/18/2023 28

Portuguese Similar Words

“Word-Level Embeddings”• same five out-of-vocabulary words and their respective most

similar words in the vocabulary extracted from the unlabeled data• use word-level embeddings to compute the similarity• the words are more semantically related, some of them being

synonyms• morphologically less similar than the ones in Table 6.

OOSV OOUV

04/18/2023 29

Results for POS Tagging of English

Language• Using the character-level embeddings of words (CharWNN) we

again get better results than the ones obtained with the WNN that uses handcrafted features

• our convolutional approach to learning character-level embeddings can be successfully applied to the English language

04/18/2023 30

Compare CharWNN with state-of-the-art (English)• a comparison of our proposed CharWNN results

with reported state-of-the-art results for the WSJ corpus

• this is a remarkable result, since differently from the other systems, our approach does not use any handcrafted feature

04/18/2023 31

English Similar Words“Character-Level

Embeddings”• the similarity between the character-level embeddings

of rare words and words in the training set• we present six OOSV words, and their respective most

similar words in the training set. • character-level embedding approach is effective at

learning suffixes, prefixes and even word shapes

04/18/2023 32

English Similar Words“Word-Level Embeddings”• the same six OOSV words and their respective

most similar words in the vocabulary extracted from the unlabeled data

• The similarity is computed using the word-level embeddings

• the retrieved words are morphologically less related than the ones in Table 10

04/18/2023 33

Conclusion• The main contributions

1. the idea of using convolutional neural networks to extract character-level features and jointly use them with word-level features

2. demonstration that it is feasible to train state-of-the-art POS taggers for different languages using the same model, with the same hyper-parameters, and without any handcrafted features

3. the definition of a new state-of-the-art result for Portuguese POS tagging on the Mac-Morpho corpus

• A weak pointo the introduction of additional hyper-parameters to be tunedo However, we argue that it is generally preferable to tune

hyper-parameters than to handcraft features, since the former can be automated.

04/18/2023 34

Future Work• analyzing in more detail the interrelation between

the two representations: word-level and character-level

• applying the proposed strategy to other NLP taskso text chunking, named entity recognition, etc

deep learning勉強会@小町研 "learning character-level representations for part-of-speech...

Engineering