The last in this series is splitting the the data into train and test then attempt a prediction with the test data set. It is possible to do this early in the process, but no harm in waiting as long as you do it eventually.
First off once again lets load the data. Notice i have added a new package for splitstackshape which provides a method to stratify the data.
If you are more interested in Python i have created a notebook using the same dataset but with the python solution here.
install.packages("splitstackshape") library(splitstackshape) data.Main <- read.csv(url("https://raw.githubusercontent.com/sqlshep/SQLShepBlog/master/data/USA.dataAll.csv"),stringsAsFactors=FALSE,header=TRUE) #remove all NAs and infinites is.na(data.Main) <- sapply(data.Main, is.infinite) data.Main[is.na(data.Main)] <-0 #Remove the extra NY County data.Main <- data.Main[!(data.Main$X == "1864"),]
In this step we will create two new data frames one with 75% of the data for training and one with the remaining 25% of the data for the test. The data will be stratified on RuralUrbanCode which is a factor 1-9 based on population of the county.
data.Main.Train <- stratified(data.Main, "RuralUrbanCode", .75) data.Main.Test <- data.Main[!(data.Main$combined_fips %in% data.Main.Train$combined_fips),] #Rows in test nrow(data.Main.Test) #Rows per RuralUrbanCode table(data.Main.Test$RuralUrbanCode) #Rows in Train nrow(data.Main.Train) #Rows per RuralUrbanCode table(data.Main.Train$RuralUrbanCode)
Now that we have a test and train dataset lets create the model again. The variables selected are the ones that have been narrowed down over the last few blog posts.
elect_lg.glm <- glm(Winner ~ lg_Population + lg_PovertyPercent + lg_EDU_HSDiploma + lg_EDU_SomeCollegeorAS + lg_EDU_BSorHigher + lg_UnemploymentRate + lg_Married + lg_HHMeanIncome + lg_Diabetes + lg_Inactivity + lg_OpioidRx, family = binomial, data = data.Main.Train) summary(elect_lg.glm)
Take the model and perform a prediction using the test dataset that was set aside. The results of teh model will be loaded into elect.probs, it will contain just probabilities in the order of the dataframe.
elect.probs = predict(elect_lg.glm, newdata = data.Main.Test, type="response") # review the data frame elect.probs[1:10] # this is not necessary, but create a new df that we can modify data.Hold <- data.Main.Test #Create a new column filled with "Trumps" elect.pred=rep("Trump",nrow(data.Main.Test)) #Modify the "pred" column to update any probability of less than .5 to "Hillary" elect.pred[elect.probs < .5] = "Hillary"
Create a confusion matrix
#update the 0 and 1 to the name of the winner data.Hold$Winner[data.Hold$Winner == '1'] ="Trump" data.Hold$Winner[data.Hold$Winner == '0'] ="Hillary" # create the confusion matrix from test and prediction table(elect.pred,data.Hold$Winner) (79+636)/787 = 0.9085133
So, with test and train data, and possibly the help of some collinearity, we have an accuracy of 91%.