descubrimiento de insights a través de text mining: cómo y para qué analizar grandes cantidades...
TRANSCRIPT
Descubrimiento de Insights a través de Text Mining: cómo y para qué analizar grandes
cantidades de textosAndrea Villanes - @andreagrr
Big Data Analytics Summit 2016
Lima - Perú
Big Data Analytics Summit Perú
Acerca de mi
Educación
Experiencia
Otros
Big Data Analytics Summit Perú
Cada minuto…
300,000 tweets
2.5 millones de posts
200 millones de mensajes
#ProcastinandoAndo
*The Data Explosion in 2014 Minute by Minute – Infographic
Big Data Analytics Summit Perú
Text Mining
“…find interesting regularities in large textual datasets” (Fayad)
…donde interesante significa no-trivial, escondido, desconocido, y potencialmente util.
Big Data Analytics Summit Perú
Proceso de Text Mining
10 0 8 9 0 3 3 0
15 5 6 0 9 11 0 1
0 2 25 12 0 9 10 0
1 11 0 5 5 0 5 21
0 6 12 2 0 2 5 3
19 8 2 13 0 0 10 14
15 12 5 3 8 9 5 0
5 0 11 0 10 0 5 8
Term 1
Term 2
Term 3
Term 4
...
Term n-2
Term n-1
Term n
Big Data Analytics Summit Perú
Transformacion de texto a una matrix
Recoleccion de Datos
Text ParsingTerm Vector Weighting
Big Data Analytics Summit Perú
Recoleccionde Datos
Text ParsingTerm Vector Weighting
Web crawling: recolección de datos de la web
APIs: Twitter, Trip Advisor, Facebook
Archivos CSV: encuestas, emails, respuestas abiertas, etc!
Big Data Analytics Summit Perú
Recoleccionde Datos
Text ParsingTerm Vector Weighting
Limpieza de texto: remover palabras innecesarias y deshacer redundancia
Stop Words: remover palabras comunes pero que no proveen utilidad al descubrimiento del contexto (el, la, de, los, y, etc…)
Abrir
Abrir lo
Abrir ias
Abrir as
AbrirStemming: convierte las palabras a su raíz.
Big Data Analytics Summit Perú
Recoleccionde Datos
Text ParsingTerm Vector Weighting
Term Frequency–Inverse Document Frequency (TF-IDF)
Las palabras individuales tienen un peso dada sufrecuencia en el document (term frequency), y por la frequency en todos los documentos en conjunto(document frequency)
Big Data Analytics Summit Perú
Transformacion de texto a una matrix
Recoleccion de Datos
Text ParsingTerm Vector Weighting
10 0 8 9 0 3 3 0
15 5 6 0 9 11 0 1
0 2 25 12 0 9 10 0
1 11 0 5 5 0 5 21
0 6 12 2 0 2 5 3
19 8 2 13 0 0 10 14
15 12 5 3 8 9 5 0
5 0 11 0 10 0 5 8
Big Data Analytics Summit Perú
Producto final
10 0 8 9 0 3 3 0
15 5 6 0 9 11 0 1
0 2 25 12 0 9 10 0
1 11 0 5 5 0 5 21
0 6 12 2 0 2 5 3
19 8 2 13 0 0 10 14
15 12 5 3 8 9 5 0
5 0 11 0 10 0 5 8
Term 1
Term 2
Term 3
Term 4
...
Term n-2
Term n-1
Term n
Que algoritmos podemosaplicar en esta matrix?
• Clustering (segmentacion)• Clasificacion• Associacion de palabras
Big Data Analytics Summit Perú
Herramientas
• SAS Enterprise Miner (Text Miner)• Text parsing, Term weighting, LSA,
modelos
• Needed <- c("tm", "SnowballCC", "RColorBrewer", "ggplot2", "wordcloud", "biclust", "cluster", "igraph", "fpc")
• Text parsing, term weighting, LSA, LDA, NMF, modelos
• scikit-learn, nltk, numpy, pandas, beautiful soup
• Web crawling, text parsing, termweighting, LSA, LDA, NMF, modelos
Big Data Analytics Summit Perú
Ejemplos de Insightsusando Text Mining
Big Data Analytics Summit Perú
Aplicaciones de Text Mining #1
1. Analizando data de social media: Facebook & Twitter
Big Data Analytics Summit Perú
Analizando data de Facebook - Clustering
549
536
210
160
Family lovers
Night lovers
Vacation lovers
Beach lovers
May 28th
Big Data Analytics Summit Perú
Analizando data de Twitter - Predicción
Dime que es lo que twiteas, y tedire quien eres
Big Data Analytics Summit Perú
Analizando data de Twitter - Predicción
MamaAdolescenteGeek
Big Data Analytics Summit Perú
Analizando data de Twitter - Prediccion
Correctly predicted
86%
Incorrectly predicted
14%
Adolescente
Big Data Analytics Summit Perú
Analizando data de Twitter - Prediccion
Correctly predicted
28%
Incorrectly predicted
72%
Geek
Big Data Analytics Summit Perú
Analizando data de Twitter - Prediccion
Correctly predicted
78%
Incorrectly predicted
22%
Mama
Big Data Analytics Summit Perú
Aplicaciones de Text Mining #2
2. Analizando respuestas abiertas en encuestas sobre quejas en productos de uso diario
Big Data Analytics Summit Perú
Descripción del dataset
• Respuestas abiertas de una encuesta:
“Describe everyday usability problems in any product”
• Numero de observaciones = 384• Promedio de palabras por respuesta = 182
Big Data Analytics Summit Perú
Ejemplos de las respuestas
“A poor design that I have experienced in my everyday life is
the safety lids on fruit cups. The lids on the fruit cups have
this clip on the top that you are supposed to be able to open
with ease. Well when I attempt to open the product either my
fruit spills out of the can from my hard tugging at the pin or I
get cut from the aluminum lid. I really hope parents don't let
their kids open these products by themselves because they
could possible get cut. I believe if the product is to be easy to
open let it be easily accessible to everyone not just grown
ups.”
“A bad design I have encountered is rooms
with light switches on the wall as you walk in
but they do not have a light fixture on the
ceiling. That is like having a door handle on
a wall. no point.”
“ATM machines have snazzy little computer screen printouts
but the problem is that if the sun is shining at your back while
using one the glare makes the screen unreadable. They should
position ATM machines North to South or give you shade.”
“An example of something that drives me crazy are
washers and dryers. I really think they should be
standardized. I get used to the way mine operate, then
when I have to use someone else's washer and dryer, I
have to stand there forever trying to figure out which
button starts the dryer. A good way to solve this would be
to standardize the layout of the controls, so that
manufacturers could still add fancy options, but
consumers would still know which control did what.”
Big Data Analytics Summit Perú
Ejemplos
Big Data Analytics Summit Perú
Analizando data usando Enterprise Miner
Big Data Analytics Summit Perú
Analizando data usando Enterprise Miner
Big Data Analytics Summit Perú
Analizando data usando Enterprise Miner
Big Data Analytics Summit Perú
Aplicaciones de Text Mining #3
3. Detección de dengue a través de periódicos
Big Data Analytics Summit Perú
Analizando texto a traves del tiempo
Known to transmit:• Dengue• Yellow fever• Chikungunya• Zika
Probability of dengue occurrence in 2010
Source: Bhatt, Samir et al. “The Global Distribution and Burden of Dengue.” Nature496.7446 (2013): 504–507. PMC.
Big Data Analytics Summit Perú
Analizando texto a traves del tiempo
0
100
200
300
400
500
600
700
Nu
mb
er o
f A
rtic
les
Month
Number of Articles by Month
Prevention
Reported Cases
Total
Total
Prevention (36%)
Big Data Analytics Summit Perú
Analizando texto a traves del tiempo
0
100
200
300
400
500
600
700
Nu
mb
er o
f A
rtic
les
Month
Number of Articles by Month
Prevention
Reported Cases
Total
Total
Prevention (36%)
Reported Cases (33%)
Big Data Analytics Summit Perú
Analizando texto a traves del tiempo
0
100
200
300
400
500
600
700
Nu
mb
er o
f A
rtic
les
Month
Number of Articles by Month
Politics
Prevention
Reported Cases
Total
Total
Prevention (36%)
Reported Cases (33%)
Politics (11%)
Big Data Analytics Summit Perú
Cómo comenzar con Text Mining?
“Text Mining: Predictive Methods for Analyzing Unstructured Information”Sholom M. Weiss and Nitin Indurkhya
“Web Scraping with Python”Ryan Mitchell
“Natural Language Processing with Python”Bird and Klein
Big Data Analytics Summit Perú
Habilidades para ser un cientifico de datos
Algoritmos Estadística Algebra Lineal
Escrita Oral
Teoría
Herramientas
Visualización
Comunicación
Big Data Analytics Summit Perú
Gracias!
www.andreavillanes.com
@andreagrr
www.MentorMeInfo.com
https://www.facebook.com/MentorMeInfo