{"id":1077,"date":"2018-08-21T23:18:12","date_gmt":"2018-08-21T23:18:12","guid":{"rendered":"https:\/\/sqlshep.com\/?p=1077"},"modified":"2018-08-22T22:57:16","modified_gmt":"2018-08-22T22:57:16","slug":"logistic-regression-2-the-pictures","status":"publish","type":"post","link":"https:\/\/sqlshep.com\/?p=1077","title":{"rendered":"Logistic Regression 2, the graphs"},"content":{"rendered":"<p>I should probably continue the blog where <a target=\"_blank\" href=\"https:\/\/sqlshep.com\/?p=940\">I mentioned<\/a> i would write about Logistic Regression.  I have been putting this off for a while as i needed some time to pass between a paper my college stats team and I submitted and this blog post.  Well, here we are.  This is meant to demonstrate Logistic Regression, nothing more, i am going to use election data because it is interesting, not because i care about proving anything.  Census data, demographics, maps, election data are all interesting to me, so that makes it fun to play with. <\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/1024px-MSM_spotlights_Donald_Trump_vs._Hillary_Clinton_and_Bernie_Sanders_24311159914-1024x576.jpg\" alt=\"\" width=\"625\" height=\"352\" class=\"aligncenter size-large wp-image-1087\" srcset=\"https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/1024px-MSM_spotlights_Donald_Trump_vs._Hillary_Clinton_and_Bernie_Sanders_24311159914.jpg 1024w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/1024px-MSM_spotlights_Donald_Trump_vs._Hillary_Clinton_and_Bernie_Sanders_24311159914-300x169.jpg 300w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/1024px-MSM_spotlights_Donald_Trump_vs._Hillary_Clinton_and_Bernie_Sanders_24311159914-768x432.jpg 768w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/1024px-MSM_spotlights_Donald_Trump_vs._Hillary_Clinton_and_Bernie_Sanders_24311159914-624x351.jpg 624w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><a target=\"_blank\" href=\"https:\/\/commons.wikimedia.org\/wiki\/File:MSM_spotlights_Donald_Trump_vs._Hillary_Clinton_and_Bernie_Sanders_(24311159914).jpg\">[1]<\/a><br \/>\n<!--more--><\/p>\n<p>I am going to be using a file that is <a target=\"_blank\" href=\"https:\/\/github.com\/sqlshep\/SQLShepBlog\/blob\/master\/data\/USA.dataAll.csv\">on my github<\/a> site that actually came form about a dozen different sources, it has been whittled down to 54 variables and you will see the the data science process that it will become even smaller.  As this is a toy dataset for demonstration purposes only, i am not going to reference too many of the sources as they are across many years so <strong>making a decision based on this data would be a very bad idea. <\/strong> So, demo, toy data only! <\/p>\n<p>First things first, lets load the data and then we will jump ahead and look at some pictures. <\/p>\n<pre><code>\r\nsetwd(\"\/Users\/Data\")\r\ngetwd()\r\n\r\ndata.Main <- read.csv(\"USA.dataAll.csv\",stringsAsFactors=FALSE,header=TRUE)\r\n\r\n#install.packages(\"car\")\r\nlibrary(car)\r\n\r\n#install.packages(\"ggplot2\")\r\nlibrary(ggplot2)\r\n\r\n<\/code><\/pre>\n<p>Data Cleanup, get rid of the NA's.  Missing data is a complicated thing, i am choosing zeros over imputation, mean, median or mode, or you deleting the row. WHen you do this, choose wisely.<\/p>\n<pre><code>\r\n#remove all NAs and infinites \r\nis.na(data.Main) <- sapply(data.Main, is.infinite)\r\ndata.Main[is.na(data.Main)] <-0\r\n\r\n#New York County has two entries, only need one. \r\ndata.Main <- data.Main[!(data.Main$X == \"1864\"),]\r\n\r\n\r\n<\/code><\/pre>\n<p>Check out your variable names, go play around, i have covered data discovery and investigation in prior blog posts, so i will not do it here. <\/p>\n<pre><code>\r\nnames(data.Main[6:26])\r\noptions(scipen=999)\r\n<\/code><\/pre>\n<p>I am going to jump way ahead to demonstrate what logistic regression is and how it is visualized. The following is a binomial distribution cumulative distribution function generated by the glm or general linear model with binomial being passed as the family of algorithm to use. Notice that we are passing Winner a 0 or 1 value, and lg_population which is a log of population just so the data is a little more friendly in the graph or in fancy math terms, allow the distribution to resemble something more normal, also know as Normal Distribution.  <\/p>\n<p>What we are inevidibly looking for here is the nice curve seen below and not a straight line. When we start looking at P-Values you will notice that the lack of a curve and high p-values tend to correlate, thus the variable would add no value to the model.  <\/p>\n<pre><code>\r\nggplot(data.Main, aes(lg_Population, Winner,colour=Winner)) +\r\n  stat_smooth(method=\"glm\", method.args=list(family=\"binomial\"), se=FALSE)+\r\n  geom_point() +\r\n  ggtitle(\"Likelihood of Voting for Trump\")\r\n\r\n\r\n#if you want to see the difference , No log and add text for outliers, \r\nggplot(data.Main, aes(Population, Winner,colour=Winner)) +\r\n  stat_smooth(method=\"glm\", method.args=list(family=\"binomial\"), se=FALSE)+\r\n  geom_point() +\r\n  geom_text(data=subset(data.Main, Population > 3500000),aes(Population,Winner,label=county_name), angle=90,hjust=-.1,) +\r\n  ggtitle(\"Likelihood of Voting for Trump\")\r\n\r\n<\/code><\/pre>\n<p>In short, the binomial glm shows the proportion of a county that voted for Trump crossed with population. So we can see that anything on the far left of the graph had a proportion (y axis) of nearly 1, (nearly 100%) of the county voted for Trump, this far left also means that these counties had a very small population based on the X axis.  This indicates that the lower the population of a county, the more likely they are to vote for Trump, as we move to the write you will see the line dropping and moving right which indicates that as population of a county increases a smaller proportion is likely to vote for Trump.  <\/p>\n<p>Log Populaiton<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.40.58-AM.png\" alt=\"\" width=\"826\" height=\"304\" class=\"aligncenter size-full wp-image-1100\" srcset=\"https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.40.58-AM.png 826w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.40.58-AM-300x110.png 300w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.40.58-AM-768x283.png 768w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.40.58-AM-624x230.png 624w\" sizes=\"auto, (max-width: 826px) 100vw, 826px\" \/><\/p>\n<p>If the log is not used you can see the data appears to skew, but that is because we have a population outlier, Los Angeles County.  But the curve is evident.  This would indicate that for Logistic Regression we may have something here we can use for a prediction. <\/p>\n<p>Population<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.42.52-AM.png\" alt=\"\" width=\"820\" height=\"297\" class=\"aligncenter size-full wp-image-1102\" srcset=\"https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.42.52-AM.png 820w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.42.52-AM-300x109.png 300w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.42.52-AM-768x278.png 768w, https:\/\/sqlshep.com\/wp-content\/uploads\/2018\/08\/Screen-Shot-2018-08-22-at-9.42.52-AM-624x226.png 624w\" sizes=\"auto, (max-width: 820px) 100vw, 820px\" \/><\/p>\n<p>The ggplot R code for all of the interesting columns is located on my github site in a <a href=\"https:\/\/github.com\/sqlshep\/SQLShepBlog\/tree\/master\/Notebooks\/Logistic%20Regression\">Jupyter notebook for R<\/a>.   Please download and play with it. And if you do not have R setup for Jupyter notebooks, <a href=\"https:\/\/www.datacamp.com\/community\/blog\/jupyter-notebook-r\">great instructions here<\/a>. <\/p>\n<p>Next post we will build a model. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>I should probably continue the blog where I mentioned i would write about Logistic Regression. I have been putting this off for a while as i needed some time to pass between a paper my college stats team and I submitted and this blog post. Well, here we are. This is meant to demonstrate Logistic [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[72,9,63,11],"tags":[81,73],"class_list":["post-1077","post","type-post","status-publish","format-standard","hentry","category-logistic","category-r","category-regression","category-visualization","tag-cumulative-distribution-function","tag-logistic-regression"],"_links":{"self":[{"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/posts\/1077","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sqlshep.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1077"}],"version-history":[{"count":22,"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/posts\/1077\/revisions"}],"predecessor-version":[{"id":1086,"href":"https:\/\/sqlshep.com\/index.php?rest_route=\/wp\/v2\/posts\/1077\/revisions\/1086"}],"wp:attachment":[{"href":"https:\/\/sqlshep.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1077"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sqlshep.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1077"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sqlshep.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1077"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}