text mining names in ‘big data’ to recognize turkish migration trends · 2014-05-30 · text...

Post on 17-Mar-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

TEXT MINING NAMES IN ‘BIG DATA’ TO

RECOGNIZE TURKISH MIGRATION TRENDS

NamSor Applied Onomastics

1

2014-05-30

Names Data Mining is just a Tool 2

Zeynep Değirmencioğlu

Şükrü Kaya

Şükrü Saracoğlu

Elian Carsenat

Hüseyin Yıldız

Mahmut Yıldırım

Fatih Öztürk

Mehmet Bölükbaşı

Mehmet Yılmaz

Elif Yıldırım

Ahmet Yıldırım

Mustafa Yücedağ

Mustafa Uzunyılmaz

Fatih Kılıç

Fatih Yılmaz

Murat Yıldırım

Hüseyin Kılıç

Oğuzhan Yıldız

Mevlüt Çavuşoğlu

… (Source: Freebase)

What’s in a name? What’s a name? 3

Elian Carsenat

@ElianCarsenat (Twitter)

elian.carsenat@namsor.com

elian.carsenat@sfr.fr

tioulpanov (Skype)

NamSor.com

Onomastics = the science of proper names

Onoma != Residence != Nationality 4

Source: OECD

NamSor sorts names : functions, use cases 5

2.Name Transliteration & Matching

3.Named Entity Extraction, Parsing

1.Name Ling. Classification

Multilingual Text Mining

Control Watch Lists Social Networks Analytics

Geo demographics

NamSor supervised learning 6

FN LN

Mette Andersen

Lene Andersson

Eva Arndt-Riise

Heidi Astrup

Mie Augustesen

Margot Bærentzen

Louise Bager Nørgaard

Marie Bagger Rasmussen

Yutta Barding

Ulla Barding-Poulsen

FN LN

Xian Dongmei

Zheng Dongmei

Jin Dongxiang

Xu Dongxiang

Li Dongxiao

Qin Dongya

Li Dongying

Han Duan

Li Duihong

Jiang Fan

Training set : Athletes

Step 1 – Learn stereotypes bitao gong

biwang jiang

birgitta agerberth

birgitte l. eriksen

bitao gong

bitten thorengaard

biwang Jiang

birgitta agerberth

birgitte l. eriksen

bitten thorengaard

Data set : Inventors

Step 2 – Classify

Accuracy is measurable ~80% The very first backtesting on the onomastics of 150,000 Olympic game athletes

7

TOTAL PERF Row Labels

3794 97% Japan

260 93% Mongolia

1576 92% Greece

262 89% Lithuania

4150 89% Italy

2818 88% Poland

2180 87% South Korea

Japan Indonesia Sri Lanka Nigeria Congo (B)

Japan 3686 4 3 3 3

Mongolia Iraq Japan Mali Kazakhstan

Mongolia 243 2 1 1 1

Greece Italy Georgia Romania Great Britain

Greece 1444 14 6 5 5

Lithuania Namibia Greece Latvia Russia

Lithuania 234 3 3 3 2

Italy Spain Portugal France Austria

Italy 3675 81 80 29 26

Poland Czechoslovakia Czech Republic Slovakia Austria

Poland 2486 46 38 34 22

South Korea North Korea Chinese Taipei Equatorial Guinea China

South Korea 1901 209 10 6 5

Euro athletes (excl. Anglo & Latin).

Breakdown accuracy 84%

Ex- Yugoslavia athletes

Breakdown accuracy 75%

Decrypting identity accross space/time:

India Geodemographics (1914) 8

Source: Commonwealth WWI Casualties

Unsupervised learning is

fine-grain: Country/Region,… 9

Ex. Russian Federation

In progress :

Syrian names (backtesting)

Onoma Count

Syria 201

Saudi Arabia 20

Iraq 8

Kuwait 4

United Arab Emirates 3

Egypt 3

Qatar 2

Bahrain 2

Soudan 2

Lebanon 2

Algeria 1

Oman 1

Grand Total 249

10

201

Syria

Saudi Arabia

Iraq

Kuwait

United Arab Emirates

Egypt

Qatar

Bahrain

Soudan

Lebanon

Algeria

Oman

الحريري طاهر

سليمان العيدة عبدالغفار

شحادة عبدالغفار

األسعد قاسم

حموده مؤمن

الجراد محمد مفلح

الحروب نزار

سليمان العيدة نزار

الحراكي أسامة

الصغير أنس

الهبول خالد

عبد الواحد وفيق

يونس إسراء

نزهة رشا

وهبة محمد زكريا

بركات كمال

اللو محمد عيد

[…]

Syrian names recognized at ~80%

Other name may effectively be non-

Syrian or generic to the Arab world.

What can you dig with this tool? 11

Mining 5M names to recognize Gender, breakdown by nationality/likely origin

12

Mining 1M names to map Diasporas 13

Source: Twitter

Mining 3M Geo-Tweets

Population flows on Twitter 14

Source Target Type Id Onoma Weight

United Kingdom France Directed 16 Great Britain 37

Spain France Directed 55 Spain 14

United States France Directed 75 Great Britain 12

Turkey France Directed 79 Turkey 11

Brazil France Directed 87 Portugal 10

United Kingdom France Directed 112 Ireland 9

Italy France Directed 152 Italy 7

Switzerland France Directed 226 France 5

Belgium France Directed 247 France 5

United Kingdom France Directed 258 France 5

Mexico France Directed 287 Spain 4

Ireland France Directed 317 Great Britain 4

United Kingdom France Directed 333 Italy 4

United States France Directed 375 France 4

Source: Twitter

Mining 150k names in Patents to see

where the Turkish ‘brain juice’ flows 15

Mining names : a word of caution 16

Can ‘Big Data’ answer any question? 17

Trash in, Gold out ? Yes, to some extent

Beware of biases induced by the data source itself

Data access limitations / privacy issues

Open Data vs. Free APIs vs. Commercial Databases

Still, tools make possible the impossible 18

originating FDI leads 19

NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.

What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European

Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct

Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have

attracted huge amounts of money from America – due largely to a century of personal and familial ties –

and they have used this money to build factories ”.

A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant

for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian

origin living abroad, there is a good many personal and familial ties to be leveraged to attract new

investment projects to the country. NamSor name recognition software helped discover those ties.

Recognizing names and their origin in global professional databases allows Investment Promotion Agencies

to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out

to them. Another method to accelerate the origination of new leads is to better understand and leverage

the existing network of foreign businessmen in the country itself.

NamSor™ filters data from millions of meaningless elements to a few dozen actionable names.

Domas Girtavicius, a Senior consultant at Invest Lithuania, said "we were impressed by the accuracy of the

name recognition software: it reliably predicts the country of origin and the number of false positives is fully

manageable". Elian Carsenat, the founder of NamSor™, said "searching for names in the Big Data is like

seeking a gold needle in a haystack: doable once the right tool exists".

Conclusions 20

We recognize names in any language, any place, any database; we can classify and we can sort

Onomastic class is no ‘hard fact’ like a place of birth, a nationality, etc. but it’s accurate and fine-grain

As a statistics tool, it might be dabatable. But as a datamining tool, it’s sharp, simple and efficient : it can help find research directions, discover trends

We see use cases in Migration research; Education & Skills; Labour & Social Affairs; Territorial Development/FDI; Science & Innovation

Merci !

http://fdimagnet.com/

http://namsor.com/

21

Juillet 2013, Ambassade de Lituanie à Paris

elian.carsenat@namsor.com

+33 6 52 77 99 07

Twitter @NamsSor_com

top related