I should probably continue the blog where I mentioned i would write about Logistic Regression. I have been putting this off for a while as i needed some time to pass between a paper my college stats team and I submitted and this blog post. Well, here we are. This is meant to demonstrate Logistic Regression, nothing more, i am going to use election data because it is interesting, not because i care about proving anything. Census data, demographics, maps, election data are all interesting to me, so that makes it fun to play with.
I learned logistic regression using SPSS, which if you have no plans to use SPSS in life its sort of a waste of time. My professor said “i don’t know how to do that in SPSS” a lot, professor also said look it up on google a lot too. I expected better from Harvard Extension, but after i found out how easy it is to teach a class there i suddenly wasn’t surprised. Point being, do as much research on instructors and professors before had as you possibly can, if there is no public data or this is the their first or second year teaching, maybe pass for something else. The class assembled and demanded a refund from Harvard btw, Harvard is a for profit private school, they did not amass a 37 billion dollar endowment by issuing refunds, so, buyer beware, and we did not get our money back.
Yeah, levels were cute at Christmas i’m over it, lets do some prediction.
This whole series of posts started because the regression and data from the mtcars test data provided was useless as predicting the mpg of my truck, looking at the dataset it is not hard to figure the dataset is useless for any modern vehicle, to be fair it was not meant to be.
Its fun when you find something new, and in R that can happen a LOT. MS SQL was frustrating in the fact that back in the day you would wait years for a single feature to come out then another decade for MS to get their shit together and finish the feature. Column store was a great example of taking years to get a feature released incrementally over several versions. Keep in mind CCI(Column Store) is my absolute favorite thing ever done in SQL, and it happens to be in Hadoop as well as ORC and Vectorization, same guy pm’d and wrote both in case you are interested. Open source does not suffer from this so you can frequently get new features added or entirely new packages that solve problems and give you the ability to explore data and models in ways you never considered. I stumbled upon one such package yesterday.
olsrr appears to have been released late last year, so i am pleased did not live out in the wild for too long before i found it. The last post having been on forward and backward stepwise regression, olsrr is the perfect continuation of that post. Not surprisingly the author used the exact same data set i did for his samples, mtcars. With that there is nothing i can really add. Here is the link to the Variable Selection Methods.
I like the detail that the package goes into when explaining stepwise. Stepwise is awesome, but one thing i skipped over that is covered at the very top of the post is the dimensionality problem that can occur with stepwise. I used a total of 10 variables int eh trained model, that equates to 210, which is a lot, and its super slow to calculate that many models. What if your model had 100 variables and 100,000 rows in the training set? You are not going to do that on your laptop, but this will happen and you will solve it, scale out is your friend in this case.
The more you read, the more you experiment and follow along, and the more data you try these features out on the better you will get and the better you will understand.
More variables! For this one we are going to add all of the variables in their correct form in the data frame as qualitative or quantitative.
If you starting with this post, lets get the data loaded up, fix the column names, convert factors to a type of factor, and create a column for out non imperial friends so they can understand the mpg thing.
In the last regression post we added more variables, but not all of them, I was holding back and not telling you why. So far we have been dealing with quantitative variables which ask how many or how much, the next is qualitative or categorical. Categorical usually asks which, and while it may be a number it would not make sense to perform math against it.
One item that might me a tiny bit helpful is to realize that as many moving parts as there are in regression it all boils down to a pretty simple formula for calculating a prediction. More of this will be covered piece by piece in the coming posts, but i wanted one post that will go through the formulas up to now.
The next few posts is just adding some more explanatory variables to see if we can get a better model from predicting mpg. We are going to keep it simple today and focus on just quantitative variables not categorical(qualitative), if that does not make any sense to you it will soon.
In the REAL world you would never predict a vehicle mpg by weight alone, there are dozens if not hundred of other variables to consider. Lucky for us the mtcars dataset only has 11 variables to consider. The grand finale of this linear regression will be a real dataset we can play with from the EPA with thousands of rows and dozens of columns. 😀
Lets get to it!
Lets try and bring simple linear regression together before i move on to multiple. We started with a question, can we predict miles per gallon using weight of a vehicle? We looked at a scatter plot and saw a bit of linearity. We created a model and looked at the residuals and determined they are for he most part demonstrating constant variance and we looked at a histogram of the residuals and it is demonstrating enough normal distribution to move forward. I know, i’m not sounding very convincing am i? Its a small dataset and its for learning, having some values that are out in left and right field but are actually useful so i can demonstrate some other points later in this post.
We need to talk about p-value. The calculation for p-value is a hot nightmare, not going to bother with it right now, if you need to know more about it you can find online calculators, but rarely an actual formula. Even the sites that will spill the beans on all the other formulas will resort to a t distribution calculator for p-value. Though i may fall back in a future post and spend some time on t-distribution, we shall see.