2016 usa election analysis modeling
TRANSCRIPT
1
RadinJung Su JinPark Ji Yeon
2016 USA Election Analysis Modeling
Contents1. Analysis Plan3. Twitter Text Analysis2. Analysis of Election Results4. Challenges SuggestionOutline and PurposesTools/PackagesMotivationExploratory data analysisData preconditioningModeling and TestDatasetTwitter Text AnalysisAnalysis & Conclusion
1. Analysis Plan Outline and Purpose
The purposes of AnalysisIdentify How Trump win and who support himAnalyze what Trump and Hillary mention in TwitterMethod of Analysis 1. Linear Regression and Decision Tree analysis 2. Text Mining and Sentimental Analysis. Modeling dependent variable =Trump vote rates with independent variables = US County facts.Classify the characteristics of group who support Trump by Decision Tree analysis.- Analyze frequent words in Twitter data and figure out word association each other.- Auto sentimental classification using Naiive Bayes Classification method -2016 & 2012 votes results Data - US County stats facts DataTwitter Data from July 26th to Aug 21st
1. Modeling for Analysis of 2016 Election results How Donald Trump win Hillary Clinton ? Who Support Donald Trump?Linear Regression Decision Tree Analysis
Data PreconditioningUS 2012 election county-level results
US 2016 election county-level results
County Facts dataDownload the datasets
01
Removing useless variables which is not helpful for describing people who support Trump and rename them to make it easy to know the meaning.
5
Removing useless variables and rename the remainders.
fipsarea_namestate_abbreviationpopulationunder.5.y0United StatesNA3188570566.21000AlabamaNA48493776.11001Autauga CountyAL5539561003Baldwin CountyAL2001115.61005Barbour CountyAL268875.71007Bibb CountyAL225065.3
R_CodeCounty2_Data_sets
Data PreconditioningCounty_facts .csv
Removing useless variables which is not helpful for describing people who support Trump and rename them to make it easy to know the meaning.
6
Also we can select some meaningful variables in votes data set and rename them so that we can easily recognize what it means.Merge county2 data and vote2 data by fips code .Add column named winner which indicate if Trumps vote rate is bigger than Clintons, the value is 1 otherwise 0.Delete all the NA value in data using na.omit.
Merge the votes.csv
Data Preconditioningfipsarea_namestate_abbreviation.xpopulationunder.5.y1001Autauga CountyAL5539561003Baldwin CountyAL2001115.61005Barbour CountyAL268875.71007Bibb CountyAL225065.31009Blount CountyAL577196.11011Bullock CountyAL107646.3
Exploratory Data Analysis
Showing the basic statistical values of all the variables using stat_fn function.
1. Analysis Plan
Exploratory Data Analysisvsvs
1. Analysis Plan
The relationship between Trump vote rates and Bachelor's degree or higher rates in county is negativeThe relationship between Trump vote rates and White people percents in county is positive
Y=Trump ,X= BachelorY=Trump ,X= White
Exploratory Data Analysis
The relationship between Clinton vote rates and Bachelor's degree or higher rates in county is positiveThe relationship between Clinton vote rates and White people percents in county is negative
1. Analysis Plan
Y=Clinton ,X= BachelorY=Clinton ,X= White
Exploratory Data Analysis
1. Analysis Plan
Trump and Romney vote rates have strong correlation and Clinton and Obama have strong correlation.Trump with Bachelor education level have negative correlation and Black people percents also have negative but with White , Trump has positive correlation.Clinton with Bachelor education level have pasitive correlation and Black people percents also have pasitive but with White , Clinton has negative correlation.
Exploratory Data Analysis
Correlation Visualization chart of some representative variables
Linear Regression Modeling
Sampling the test data 20% and training data 80%.Select the variables using Forward AIC method.Train the linear regression model inputting the variables selected with the smallest AIC value.Sampling and modeling the Linear regression with training data
1. Analysis Plan Linear Regression
Test data predicted value correlation coefficient 0.98 .
: F- p - p-value: < 2.2e-16 .
3. Multiple R-squared= 0.9624 : very strong. Adjusted R-squared: 0.9622
4.X Pr *** -positive coefficients : Romney, Asian, White, Income.capita -Negative coefficients : Bachelor, household.income, under.18.y, Housing, Black, Foreign, Hawaiian, High.school, Language, Female
Linear Regression Modeling
1. Analysis Plan
>plot(train.lm)Residuals vs FittedNormal Q-QScale-LocationResiduals vs Leverage
Linear Regression Modeling
The plot in the upper left shows the residual errors plotted versus their fitted values. The residuals should be randomly distributed around the horizontal line representing a residual error of zero; that is, there should not be a distinct trend in the distribution of points. The plot in the lower left is a standard Q-Q plot, which should suggest that the residual errors are normally distributed. The scale-location plot in the upper right shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. Again, there should be no obvious trend in this plot. Finally, the plot in the lower right shows each points leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cooks distance, which is another measure of the importance of each observation to the regression. Smaller distances means that removing the observation has little affect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model. 15
Decision Tree Analysis
White>47.3, Bachelor degree