2016 usa election analysis modeling

43
‘R’adin Jung Su Jin Park Ji Yeon 2016 USA Election Analysis Modeling

Upload: su-jin-jung

Post on 16-Apr-2017

152 views

Category:

Data & Analytics


0 download

TRANSCRIPT

1

RadinJung Su JinPark Ji Yeon

2016 USA Election Analysis Modeling

Contents1. Analysis Plan3. Twitter Text Analysis2. Analysis of Election Results4. Challenges SuggestionOutline and PurposesTools/PackagesMotivationExploratory data analysisData preconditioningModeling and TestDatasetTwitter Text AnalysisAnalysis & Conclusion

1. Analysis Plan Outline and Purpose

The purposes of AnalysisIdentify How Trump win and who support himAnalyze what Trump and Hillary mention in TwitterMethod of Analysis 1. Linear Regression and Decision Tree analysis 2. Text Mining and Sentimental Analysis. Modeling dependent variable =Trump vote rates with independent variables = US County facts.Classify the characteristics of group who support Trump by Decision Tree analysis.- Analyze frequent words in Twitter data and figure out word association each other.- Auto sentimental classification using Naiive Bayes Classification method -2016 & 2012 votes results Data - US County stats facts DataTwitter Data from July 26th to Aug 21st

1. Modeling for Analysis of 2016 Election results How Donald Trump win Hillary Clinton ? Who Support Donald Trump?Linear Regression Decision Tree Analysis

Data PreconditioningUS 2012 election county-level results

US 2016 election county-level results

County Facts dataDownload the datasets

01

Removing useless variables which is not helpful for describing people who support Trump and rename them to make it easy to know the meaning.

5

Removing useless variables and rename the remainders.

fipsarea_namestate_abbreviationpopulationunder.5.y0United StatesNA3188570566.21000AlabamaNA48493776.11001Autauga CountyAL5539561003Baldwin CountyAL2001115.61005Barbour CountyAL268875.71007Bibb CountyAL225065.3

R_CodeCounty2_Data_sets

Data PreconditioningCounty_facts .csv

Removing useless variables which is not helpful for describing people who support Trump and rename them to make it easy to know the meaning.

6

Also we can select some meaningful variables in votes data set and rename them so that we can easily recognize what it means.Merge county2 data and vote2 data by fips code .Add column named winner which indicate if Trumps vote rate is bigger than Clintons, the value is 1 otherwise 0.Delete all the NA value in data using na.omit.

Merge the votes.csv

Data Preconditioningfipsarea_namestate_abbreviation.xpopulationunder.5.y1001Autauga CountyAL5539561003Baldwin CountyAL2001115.61005Barbour CountyAL268875.71007Bibb CountyAL225065.31009Blount CountyAL577196.11011Bullock CountyAL107646.3

Exploratory Data Analysis

Showing the basic statistical values of all the variables using stat_fn function.

1. Analysis Plan

Exploratory Data Analysisvsvs

1. Analysis Plan

The relationship between Trump vote rates and Bachelor's degree or higher rates in county is negativeThe relationship between Trump vote rates and White people percents in county is positive

Y=Trump ,X= BachelorY=Trump ,X= White

Exploratory Data Analysis

The relationship between Clinton vote rates and Bachelor's degree or higher rates in county is positiveThe relationship between Clinton vote rates and White people percents in county is negative

1. Analysis Plan

Y=Clinton ,X= BachelorY=Clinton ,X= White

Exploratory Data Analysis

1. Analysis Plan

Trump and Romney vote rates have strong correlation and Clinton and Obama have strong correlation.Trump with Bachelor education level have negative correlation and Black people percents also have negative but with White , Trump has positive correlation.Clinton with Bachelor education level have pasitive correlation and Black people percents also have pasitive but with White , Clinton has negative correlation.

Exploratory Data Analysis

Correlation Visualization chart of some representative variables

Linear Regression Modeling

Sampling the test data 20% and training data 80%.Select the variables using Forward AIC method.Train the linear regression model inputting the variables selected with the smallest AIC value.Sampling and modeling the Linear regression with training data

1. Analysis Plan Linear Regression

Test data predicted value correlation coefficient 0.98 .

: F- p - p-value: < 2.2e-16 .

3. Multiple R-squared= 0.9624 : very strong. Adjusted R-squared: 0.9622

4.X Pr *** -positive coefficients : Romney, Asian, White, Income.capita -Negative coefficients : Bachelor, household.income, under.18.y, Housing, Black, Foreign, Hawaiian, High.school, Language, Female

Linear Regression Modeling

1. Analysis Plan

>plot(train.lm)Residuals vs FittedNormal Q-QScale-LocationResiduals vs Leverage

Linear Regression Modeling

The plot in the upper left shows the residual errors plotted versus their fitted values. The residuals should be randomly distributed around the horizontal line representing a residual error of zero; that is, there should not be a distinct trend in the distribution of points. The plot in the lower left is a standard Q-Q plot, which should suggest that the residual errors are normally distributed. The scale-location plot in the upper right shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. Again, there should be no obvious trend in this plot. Finally, the plot in the lower right shows each points leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cooks distance, which is another measure of the importance of each observation to the regression. Smaller distances means that removing the observation has little affect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model. 15

Decision Tree Analysis

White>47.3, Bachelor degree