big data — your new best friend
Post on 12-Apr-2017
308 Views
Preview:
TRANSCRIPT
Big Data:Your New Best Friend
Reuven M. Lerner, PhD MegaComm 2016 • February 18th, 2016
1 Big Data.key - February 18, 2016
Who am I?
• Long-time programmer, consultant, trainer
• Python, Git, PostgreSQL, Ruby
• Linux Journal columnist
2
2 Big Data.key - February 18, 2016
My stuff• Newsletter: http://lerner.co.il/newsletter
• Blog: http://blog.lerner.co.il/
• Daily Tech Video: http://dailytechvideo.com/
• Or @DailyTechVideo on Twitter
• Mandarin Weekly: http://MandarinWeekly.com
• Or @MandarinWeekly on Twitter
3
3 Big Data.key - February 18, 2016
Elections!
• Israel had elections last year
• The United States has elections this year
• Rumor has it, the world contains some other countries, many of which also hold elections
4
4 Big Data.key - February 18, 2016
Polls
• Before an election, politicians, reports, and political junkies (like me) look at the polls.
• We want to know who is ahead, and who is behind
• The politicians want to know which groups like (and dislike) them, so that they can focus their rhetoric and campaigning
5
5 Big Data.key - February 18, 2016
Polls are statistical models• Polls use math to predict the likelihood of a
particular outcome, based on a number of inputs
• Models are toy versions of reality
• They allow us to explore and understand reality, and should bear some connection to it
• But there will always be a distinction between a model and the real world
9
9 Big Data.key - February 18, 2016
Models are important!• They allow us to explore, understand the world
• They enable us to make predictions
• They reduce costs, and allow us to do things that are otherwise impossible or unethical
• 2013 Nobel Prize in Chemistry — for scientists who engaged in modeling of chemistry
10
10 Big Data.key - February 18, 2016
Testing models
• Elections are unusual: You only have one shot at testing your model to see if it’s accurate
• But in your business, you can create and test models every day, modifying the number, type, and weights of the inputs
11
11 Big Data.key - February 18, 2016
Big data• 90% of the data ever created was generated in the
last two years (according to IBM):
• Writing, video, audio
• Travel, e-commerce, electricity use, phone calls
• Metadata, as well
• Maybe people aren’t just numbers… but given how often we’re quantified, we’re not that far away
12
12 Big Data.key - February 18, 2016
But numbers are good (if you’re a computer)
• Modern computers can hold billions of them
• Store not only information about people, but their characteristics and traits, as well as dates and times
15
15 Big Data.key - February 18, 2016
Your business
• When you make business decisions, what factors are you considering?
• Are you trying to check all of the possible correlations, across all of the data?
• Or are you sampling, and hoping that your sample is an accurate and representative one?
16
16 Big Data.key - February 18, 2016
• Your business is now collecting lots and lots of data
• Who is buying your products and services?
• How often do they visit your Web site?
• Which of your e-mail messages do they open?
• What do they buy?
• How old are they, and where do they live?
Enter big data!
17
17 Big Data.key - February 18, 2016
Why “big” data?
• It sounds sophisticated and high-tech.
• There really is a lot of it.
• Often, there’s more than we can fit (or process) on a single computer
• But often, it’s not really that big
18
18 Big Data.key - February 18, 2016
Enter data science
• Data scientists come up with ways to turn raw data into useful information
• They create and use models to find correlations among the many pieces of data you’re collecting
• They can help you use these correlations to improve your marketing, sales, and production
19
19 Big Data.key - February 18, 2016
What is data science?
A person employed to analyze and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making.
— Oxford Dictionary
20 Big Data.key - February 18, 2016
More realistically…
Data scientist (noun): Person who is better at statistics than any soft‐ ware engineer and better at software engineering than any statistician.
— Josh Wills
21 Big Data.key - February 18, 2016
Look for correlations• Data scientists look for correlations
• Using those correlations, we know where we have been successful (and not)
• These can be interesting, useful, or crucial
• Being able to analyze lots of factors, and thus find correlations in them, allows our models to be more sophisticated — and also predictive
23
23 Big Data.key - February 18, 2016
Spurious correlations
• http://tylervigen.com/spurious-correlations
25 Big Data.key - February 18, 2016
Data scientists’ tools
• Programming languages + libraries
• Data sets
• Machine learning
• Distributed processing systems
26 Big Data.key - February 18, 2016
What do data sets look like?
• Excel spreadsheets
• CSV files
• Multiple CSV files (e.g., separated by date)
• Databases you can clone — but this is rare
29 Big Data.key - February 18, 2016
Cleaning the data
• Remove bad, incomplete data
• Remove data that isn’t relevant for the investigation you’re doing
• But don’t remove too much, ruining your data!
30 Big Data.key - February 18, 2016
Machine learning
• The computer can learn to categorize things as well as humans do
• Then, when given new data, it can decide into which category to put the new item
31 Big Data.key - February 18, 2016
Spam filters
• Spam filters use a simple form of machine learning
• Is a particular e-mail message spam?
• Check the contents, using a variety of factors
• If the factors make this document similar to other spam documents, then mark it as spam
32 Big Data.key - February 18, 2016
Aha!
• Wondering why e-mail from certain people always gets put into the “junk” e-mail box?
• Because those people send mail that looks (to the machine-learning system) too much like junk
• Mark the messages as not being junk, so your spam-control system can learn over time
33 Big Data.key - February 18, 2016
Experience is important
• In people, learning is a matter of experience
• Machine learning is all about computers also gaining that experience
34 Big Data.key - February 18, 2016
Models
• Machine learning employs many models
• Each model uses different techniques to train the computer into which categories data should be put
• Supervised vs. unsupervised learning
• The computer can then be given new data
39 Big Data.key - February 18, 2016
Example: K nearest neighbors
• One common machine-learning algorithm finds the closest k (a number) items to a new piece of data
• We then have an election — to which category does most existing data belong?
• Our new data point joins the majority category
40 Big Data.key - February 18, 2016
Lots of other models• Linear regression
• Logistic regression
• Neural networks
• Deep learning
• K-means clustering
• And many, many others — with lots in active development!
41 Big Data.key - February 18, 2016
Data science use cases
• So, where is data science being used?
• And how can we apply it to our businesses?
42 Big Data.key - February 18, 2016
A/B testing• Find out what your users respond to
• Try two (or more) different versions of your Web site
• Compare to see which one has greater conversions (i.e., e-commerce success)
• Use the better one… and then do another experiment, ad infinitum
43 Big Data.key - February 18, 2016
Correlations!
• Amazon is one of the most successful data-science shops
• They’re always collecting information on what people look at and buy — and they suggest other products based on that behavior
• How often are they right? (Very often, actually)
47 Big Data.key - February 18, 2016
Fraud detection
• What behavior is correlated with a stolen credit card?
• What language is correlated with a research paper that was already written and submitted?
48 Big Data.key - February 18, 2016
Interact with data• Visualizations provide us (humans) with insights
• Many data scientists spend their time helping others create powerful, useful visualizations
• GIS (geographic information systems) allow us to take data, and put it on maps. Some maps are event interactive, letting us explore data in new ways
49 Big Data.key - February 18, 2016
Add GIS, and create maps
• https://openaccess9000.cartodb.com/viz/3459b348-8212-11e5-b022-0e8c56e2ffdb/public_map
50 Big Data.key - February 18, 2016
“Half the money I spend on advertising is wasted; the trouble is I don't know which half.”
— John Wanamaker
52 Big Data.key - February 18, 2016
Advertising
• We can show ads online, and know who has clicked on them.
• But we can do better: Show ads to the people for whom they’re most relevant, and most likely to be appropriate
• How can we do that?
53 Big Data.key - February 18, 2016
Some ideas• Show people ads based on text searches
• Show people ads based on what they have explicitly told us
• Show people ads based on what content they have indicated they like
• Show people ads based on their friends’ preferences and demographics
54 Big Data.key - February 18, 2016
Aha!
• No wonder Google and Facebook are pioneers in the area of big data
• They’re using enormous amounts of data to display ads that people like
• And they get lots of additional data points every day, thanks to searches and “likes”
55 Big Data.key - February 18, 2016
Data sets
• UCI’s machine learning data set
• https://archive.ics.uci.edu/ml/datasets/Housing
• Newsletter with new data sets:
• http://tinyletter.com/data-is-plural/
56 Big Data.key - February 18, 2016
Really big data
• What do we do when the data is too big?
• What if it will take too long to process, or the data is too big to store on a single machine?
• Then we call in the truly big guns — distributed processing systems
57 Big Data.key - February 18, 2016
Map-reduce• map-reduce has been around for decades on
individual computers
• But only now (thanks to Google’s implementation for distributed systems), everyone wants to use it
• map: apply a function to every element of a sequence
• reduce: turn a sequence of values into single (or small) value
• Not all data can be broken apart easily!
58 Big Data.key - February 18, 2016
• Create a Hadoop cluster, including storage of the data you want to understand there
• Run a map-reduce query on your data — apply a function to it (e..g, do you contain the phrase “machine learning”) and then reduce into an HTML page
• Use virtual machines in the cloud to make your cluster bigger or smaller, as necessary
59 Big Data.key - February 18, 2016
• More modern, real-time, in-memory analysis system
• Open-source system that’s increasingly popular
• Built on the same filesystem as Hadoop
• Connections from Java, Python, R
• Has a suite of highly parallel machine-learning models
• Because your data is in memory (and split across multiple virtual machines), it runs much faster
60 Big Data.key - February 18, 2016
top related