{"id":1154,"date":"2018-08-24T08:06:49","date_gmt":"2018-08-24T08:06:49","guid":{"rendered":"https:\/\/sqlshep.com\/?p=1154"},"modified":"2018-08-24T21:26:33","modified_gmt":"2018-08-24T21:26:33","slug":"logistic-regression-5-train-test-predict","status":"publish","type":"post","link":"https:\/\/sqlshep.com\/?p=1154","title":{"rendered":"Logistic Regression 5, train, test, predict"},"content":{"rendered":"<p>The last in this series is splitting the the data into train and test then attempt a prediction with the test data set.  It is possible to do this early in the process, but no harm in waiting as long as you do it eventually. <\/p>\n<p>First off once again lets load the data. Notice i have added a new package for splitstackshape which provides a method to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Stratified_sampling\">stratify<\/a> the data. <\/p>\n<p>If you are more interested in Python i have created a notebook using the same dataset but with the<a href=\"https:\/\/github.com\/sqlshep\/SQLShepBlog\/blob\/master\/Notebooks\/Logistic%20Regression\/Python%20-%20Logistic%20Regression.ipynb\"> python solution here<\/a>. <\/p>\n<p><!--more--><\/p>\n<pre><code>\r\ninstall.packages(\"splitstackshape\")\r\nlibrary(splitstackshape)\r\n\r\ndata.Main <- read.csv(url(\"https:\/\/raw.githubusercontent.com\/sqlshep\/SQLShepBlog\/master\/data\/USA.dataAll.csv\"),stringsAsFactors=FALSE,header=TRUE)\r\n\r\n#remove all NAs and infinites \r\nis.na(data.Main) <- sapply(data.Main, is.infinite)\r\ndata.Main[is.na(data.Main)] <-0\r\n\r\n#Remove the extra NY County\r\ndata.Main <- data.Main[!(data.Main$X == \"1864\"),]\r\n<\/code><\/pre>\n<p>In this step we will create two new data frames one with 75% of the data for training and one with the remaining 25% of the data for the test. The data will be stratified on RuralUrbanCode which is a factor 1-9 based on population of the county. <\/p>\n<pre><code>\r\ndata.Main.Train <- stratified(data.Main, \"RuralUrbanCode\", .75)\r\n\r\ndata.Main.Test <- data.Main[!(data.Main$combined_fips %in% data.Main.Train$combined_fips),]\r\n\r\n#Rows in test\r\nnrow(data.Main.Test)\r\n#Rows per RuralUrbanCode\r\ntable(data.Main.Test$RuralUrbanCode)\r\n\r\n#Rows in Train\r\nnrow(data.Main.Train)\r\n#Rows per RuralUrbanCode\r\ntable(data.Main.Train$RuralUrbanCode)\r\n\r\n<\/code><\/pre>\n<p>Now that we have a test and train dataset lets create the model again.  The variables selected are the ones that have been narrowed down over the last few blog posts. <\/p>\n<pre><code>\r\nelect_lg.glm <- glm(Winner ~ lg_Population + lg_PovertyPercent + lg_EDU_HSDiploma + \r\n                      lg_EDU_SomeCollegeorAS + lg_EDU_BSorHigher + lg_UnemploymentRate + \r\n                      lg_Married + lg_HHMeanIncome + lg_Diabetes + lg_Inactivity + \r\n                      lg_OpioidRx, family = binomial, data = data.Main.Train)\r\n\r\nsummary(elect_lg.glm)\r\n<\/code><\/pre>\n<p>Take the model and perform a prediction using the test dataset that was set aside. The results of teh model will be loaded into elect.probs, it will contain just probabilities in the order of the dataframe.<\/p>\n<pre><code>\r\nelect.probs = predict(elect_lg.glm, newdata = data.Main.Test, type=\"response\")\r\n\r\n# review the data frame\r\nelect.probs[1:10]\r\n\r\n# this is not necessary, but create a new df that we can modify\r\ndata.Hold <-  data.Main.Test\r\n\r\n#Create a new column filled with  \"Trumps\"\r\nelect.pred=rep(\"Trump\",nrow(data.Main.Test))\r\n\r\n#Modify the \"pred\" column to update any probability of less than .5 to \"Hillary\"\r\nelect.pred[elect.probs < .5] = \"Hillary\"\r\n<\/code><\/pre>\n<p>Create a confusion matrix<\/p>\n<pre><code>\r\n\r\n#update the 0 and 1 to the name of the winner \r\ndata.Hold$Winner[data.Hold$Winner == '1'] =\"Trump\"\r\ndata.Hold$Winner[data.Hold$Winner == '0'] =\"Hillary\"\r\n\r\n# create the confusion matrix from test and prediction\r\ntable(elect.pred,data.Hold$Winner)\r\n\r\n(79+636)\/787 = 0.9085133\r\n<\/code><\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-23-at-9.04.24-PM-216x300.png\" alt=\"\" width=\"216\" height=\"300\" class=\"alignright size-medium wp-image-1165\" srcset=\"https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-23-at-9.04.24-PM-216x300.png 216w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-23-at-9.04.24-PM-768x1066.png 768w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-23-at-9.04.24-PM-738x1024.png 738w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-23-at-9.04.24-PM-624x866.png 624w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-23-at-9.04.24-PM.png 970w\" sizes=\"auto, (max-width: 216px) 100vw, 216px\" \/><br \/>\nSo, with test and train data, and possibly the help of some collinearity, we have an accuracy of 91%.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The last in this series is splitting the the data into train and test then attempt a prediction with the test data set. It is possible to do this early in the process, but no harm in waiting as long as you do it eventually. First off once again lets load the data. Notice i [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[72,63],"tags":[73],"class_list":["post-1154","post","type-post","status-publish","format-standard","hentry","category-logistic","category-regression","tag-logistic-regression"],"_links":{"self":[{"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/posts\/1154","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sqlshep.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1154"}],"version-history":[{"count":14,"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/posts\/1154\/revisions"}],"predecessor-version":[{"id":1171,"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/posts\/1154\/revisions\/1171"}],"wp:attachment":[{"href":"https:\/\/sqlshep.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1154"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sqlshep.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1154"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sqlshep.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1154"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}