Linear Regression 103 p-value

We need to talk about p-value. The calculation for p-value is a hot nightmare, not going to bother with it right now, if you need to know more about it you can find online calculators, but rarely an actual formula. Even the sites that will spill the beans on all the other formulas will resort to a t distribution calculator for p-value. Though i may fall back in a future post and spend some time on t-distribution, we shall see.

P-value has been the subject of controversy because all we have is general guidance on evaluating it. Most stats classes will state that it should be < .05 for the coefficient to be considered significant. More specifically, "The P-value is the probability of observing a sample statistic as extreme as the test statistic.“. OR from ISLR page 67 p-value “a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response. Hence, if we see a small p-value, then we can infer that there is an association between the predictor and the response.”

All sounds very simple, so whats the controversy? You can hack it! If we are using distribution of data and correlation between x and y, can i not remove rows of data to improve my odds of getting a low p-value? Or what if i had dozens or hundreds of columns and just wanted to find the ones that support my hypothesis? Its called data dredging.

Imagine for a moment you needed to test a new drug, it would not be surprising that if it is your drug you may want it to be successful in the trial, or you may lose funding, lose your job, maybe you have been working on this drug for ten years. Would you select patients that represent the average population of the consumer or would you select patients that would have the highest likelihood of success? What if your only trial subjects were professional guinea pigs? What if your results can not be reproduced? fivethirtyeight did a good write up on p-hacking, the very p-value we are using in linear regression.

Why does it matter to you? What if you work for one of these companies and you see something that is biased, not reproducible, or are data dredging, what will you do?

That being said, there is a school of thought that says the p-value should be thrown out as a measure of success, i tend to agree. Whats worse, what if the data dredge or bias becomes part of the production system that you sell to a consumer?

We will continue to use p-value going forward but will open up to more as well. Imagine creating 30 models and just compare the entire model to another model…? With 1 explanatory variable that hardly seems useful, but, that changes a few posts from now.