Before i can move to the next post i need to cover some tough problems for statistics and more specifically, regression.
All of the work we do in statistical learning is based on the fact we can predict y based on x. If i eat 10,000(x) calories a day i will be fat(y) unless i am an Olympic swimmer apparently, so it does not always hold true, but with just one dependent and one independent variable, it would appear to be an easy answer. Now if i added physical activity to the mix, fat or not fat might be more accurate. Every now and then you will hear about a “study” that some new claim is made from, and the world falls apart for a few days talking about nothing else. My most recent favorite post is diet soda makes you fat, gives you cardiovascular disease, hypertension, metabolic syndrome and type II diabetes. Whether you believe that or not, and for the sake of argument the article does not mention level of activity per day, calories of food consumed per day, you know, lots of other stuff that could contribute. The study appears to make the claim that diet soda all by itself will cause all of these health problems. Peter Attia has started to write about the problems with these studies and the problems with them.
In short, an observational study means you are just watching and asking questions, not influencing, no treatment, no control. An observational study cannot be used to determine cause, you have surely heard correlation does not imply causation, this is never more true than in observational studies, and even more so for retrospective observational study, where you basically look at data from a prior observational study. You can find stuff, but you cannot conclude cause. Don’t get me wrong they are needed an necessary for research to move forward, but causation cannot be determined from this.
Which brings me to these fun little anomalies
Spurious correlations are a hoot and a half. Spurious correlations is a mathematical relationship in which two or more events or variables are not causally related to each other, yet it may be wrongly inferred that they are. I can assure you that Nicolas Cage movies and pool drownings actually have nothing to do with each other but they do correlate.
I will not copy the website below, but take a look at some of the stuff that correlates;
Confounding and/or Lurking variable variable; i see these used interchangeably, so it is a bit difficult to separate them. That being said, it is a variable that correlates to the dependent and independent variable thus influencing the significance, if one is detected, it should be removed. You will typically see this by two variables that always trend together though they actually have no direct impact n each other, they just happen to trend together. There is a great alternative example here, just the top part of the page.
Collinearity or Multicollinearity ; I’m totally copying the definition from PSU, they have great stuff btw; “when it exists, it can wreak havoc on our analysis and thereby limit the research conclusions we can draw”, “multicollinearity exists whenever two or more of the predictors in a regression model are moderately or highly correlated” In R you have to go slightly out of your way to determine collinearity using VIF, variation inflation factor, i may snag a highly collinear dataset in the next few posts to demonstrate it using the vif function against the model in R.
In the next few data sets i will be using all of these factors need to be taken into consideration, we are going to uses election data form the last election and a variety of statistics form the counties to see if it is possible to determine how a count voted based on some demographics of that county. For instance, if a county has a high unemployment rate, will that county swing democrat or republican? Or is this a confounding relationship? What about the number of high shcool dropouts in a county, will this influence presidential selection?