text mining names in ‘big data’ to recognize turkish migration trends · 2014-05-30 · text...

21
TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied Onomastics 1 2014-05-30

Upload: others

Post on 17-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

TEXT MINING NAMES IN ‘BIG DATA’ TO

RECOGNIZE TURKISH MIGRATION TRENDS

NamSor Applied Onomastics

1

2014-05-30

Page 2: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Names Data Mining is just a Tool 2

Zeynep Değirmencioğlu

Şükrü Kaya

Şükrü Saracoğlu

Elian Carsenat

Hüseyin Yıldız

Mahmut Yıldırım

Fatih Öztürk

Mehmet Bölükbaşı

Mehmet Yılmaz

Elif Yıldırım

Ahmet Yıldırım

Mustafa Yücedağ

Mustafa Uzunyılmaz

Fatih Kılıç

Fatih Yılmaz

Murat Yıldırım

Hüseyin Kılıç

Oğuzhan Yıldız

Mevlüt Çavuşoğlu

… (Source: Freebase)

Page 3: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

What’s in a name? What’s a name? 3

Elian Carsenat

@ElianCarsenat (Twitter)

[email protected]

[email protected]

tioulpanov (Skype)

NamSor.com

Onomastics = the science of proper names

Page 4: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Onoma != Residence != Nationality 4

Source: OECD

Page 5: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

NamSor sorts names : functions, use cases 5

2.Name Transliteration & Matching

3.Named Entity Extraction, Parsing

1.Name Ling. Classification

Multilingual Text Mining

Control Watch Lists Social Networks Analytics

Geo demographics

Page 6: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

NamSor supervised learning 6

FN LN

Mette Andersen

Lene Andersson

Eva Arndt-Riise

Heidi Astrup

Mie Augustesen

Margot Bærentzen

Louise Bager Nørgaard

Marie Bagger Rasmussen

Yutta Barding

Ulla Barding-Poulsen

FN LN

Xian Dongmei

Zheng Dongmei

Jin Dongxiang

Xu Dongxiang

Li Dongxiao

Qin Dongya

Li Dongying

Han Duan

Li Duihong

Jiang Fan

Training set : Athletes

Step 1 – Learn stereotypes bitao gong

biwang jiang

birgitta agerberth

birgitte l. eriksen

bitao gong

bitten thorengaard

biwang Jiang

birgitta agerberth

birgitte l. eriksen

bitten thorengaard

Data set : Inventors

Step 2 – Classify

Page 7: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Accuracy is measurable ~80% The very first backtesting on the onomastics of 150,000 Olympic game athletes

7

TOTAL PERF Row Labels

3794 97% Japan

260 93% Mongolia

1576 92% Greece

262 89% Lithuania

4150 89% Italy

2818 88% Poland

2180 87% South Korea

Japan Indonesia Sri Lanka Nigeria Congo (B)

Japan 3686 4 3 3 3

Mongolia Iraq Japan Mali Kazakhstan

Mongolia 243 2 1 1 1

Greece Italy Georgia Romania Great Britain

Greece 1444 14 6 5 5

Lithuania Namibia Greece Latvia Russia

Lithuania 234 3 3 3 2

Italy Spain Portugal France Austria

Italy 3675 81 80 29 26

Poland Czechoslovakia Czech Republic Slovakia Austria

Poland 2486 46 38 34 22

South Korea North Korea Chinese Taipei Equatorial Guinea China

South Korea 1901 209 10 6 5

Euro athletes (excl. Anglo & Latin).

Breakdown accuracy 84%

Ex- Yugoslavia athletes

Breakdown accuracy 75%

Page 8: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Decrypting identity accross space/time:

India Geodemographics (1914) 8

Source: Commonwealth WWI Casualties

Page 9: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Unsupervised learning is

fine-grain: Country/Region,… 9

Ex. Russian Federation

Page 10: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

In progress :

Syrian names (backtesting)

Onoma Count

Syria 201

Saudi Arabia 20

Iraq 8

Kuwait 4

United Arab Emirates 3

Egypt 3

Qatar 2

Bahrain 2

Soudan 2

Lebanon 2

Algeria 1

Oman 1

Grand Total 249

10

201

Syria

Saudi Arabia

Iraq

Kuwait

United Arab Emirates

Egypt

Qatar

Bahrain

Soudan

Lebanon

Algeria

Oman

الحريري طاهر

سليمان العيدة عبدالغفار

شحادة عبدالغفار

األسعد قاسم

حموده مؤمن

الجراد محمد مفلح

الحروب نزار

سليمان العيدة نزار

الحراكي أسامة

الصغير أنس

الهبول خالد

عبد الواحد وفيق

يونس إسراء

نزهة رشا

وهبة محمد زكريا

بركات كمال

اللو محمد عيد

[…]

Syrian names recognized at ~80%

Other name may effectively be non-

Syrian or generic to the Arab world.

Page 11: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

What can you dig with this tool? 11

Page 12: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Mining 5M names to recognize Gender, breakdown by nationality/likely origin

12

Page 13: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Mining 1M names to map Diasporas 13

Source: Twitter

Page 14: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Mining 3M Geo-Tweets

Population flows on Twitter 14

Source Target Type Id Onoma Weight

United Kingdom France Directed 16 Great Britain 37

Spain France Directed 55 Spain 14

United States France Directed 75 Great Britain 12

Turkey France Directed 79 Turkey 11

Brazil France Directed 87 Portugal 10

United Kingdom France Directed 112 Ireland 9

Italy France Directed 152 Italy 7

Switzerland France Directed 226 France 5

Belgium France Directed 247 France 5

United Kingdom France Directed 258 France 5

Mexico France Directed 287 Spain 4

Ireland France Directed 317 Great Britain 4

United Kingdom France Directed 333 Italy 4

United States France Directed 375 France 4

Source: Twitter

Page 15: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Mining 150k names in Patents to see

where the Turkish ‘brain juice’ flows 15

Page 16: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Mining names : a word of caution 16

Page 17: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Can ‘Big Data’ answer any question? 17

Trash in, Gold out ? Yes, to some extent

Beware of biases induced by the data source itself

Data access limitations / privacy issues

Open Data vs. Free APIs vs. Commercial Databases

Page 18: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Still, tools make possible the impossible 18

Page 19: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

originating FDI leads 19

NamSor™ announces FDI Magnet, a new offering for Investment Promotion Agencies.

What is the Idea behind it: “ As recently as 1986 Ireland was one of the poorest countries in the European

Union (EU), but today it is one of the richest. The engine of this new Irish prosperity has been Foreign Direct

Investment (FDI). [Between 1986 and 2002], the Irish have done almost everything right. They have

attracted huge amounts of money from America – due largely to a century of personal and familial ties –

and they have used this money to build factories ”.

A successful approach which Milda Darguzaite, the Managing Director of Invest Lithuania, considers relevant

for her own country. With three million people living in Lithuania and nearly one million people of Lithuanian

origin living abroad, there is a good many personal and familial ties to be leveraged to attract new

investment projects to the country. NamSor name recognition software helped discover those ties.

Recognizing names and their origin in global professional databases allows Investment Promotion Agencies

to identify potentially interesting high profile contacts in different countries / industrial sectors and reach out

to them. Another method to accelerate the origination of new leads is to better understand and leverage

the existing network of foreign businessmen in the country itself.

NamSor™ filters data from millions of meaningless elements to a few dozen actionable names.

Domas Girtavicius, a Senior consultant at Invest Lithuania, said "we were impressed by the accuracy of the

name recognition software: it reliably predicts the country of origin and the number of false positives is fully

manageable". Elian Carsenat, the founder of NamSor™, said "searching for names in the Big Data is like

seeking a gold needle in a haystack: doable once the right tool exists".

Page 20: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Conclusions 20

We recognize names in any language, any place, any database; we can classify and we can sort

Onomastic class is no ‘hard fact’ like a place of birth, a nationality, etc. but it’s accurate and fine-grain

As a statistics tool, it might be dabatable. But as a datamining tool, it’s sharp, simple and efficient : it can help find research directions, discover trends

We see use cases in Migration research; Education & Skills; Labour & Social Affairs; Territorial Development/FDI; Science & Innovation

Page 21: TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS · 2014-05-30 · TEXT MINING NAMES IN ‘BIG DATA’ TO RECOGNIZE TURKISH MIGRATION TRENDS NamSor Applied

Merci !

http://fdimagnet.com/

http://namsor.com/

21

Juillet 2013, Ambassade de Lituanie à Paris

[email protected]

+33 6 52 77 99 07

Twitter @NamsSor_com