Category Archives: Statistics

Stats Stuff 3, Range, IQR

Published / by shep2010

The next topics (Range, IQR, Variance and Standard Deviation) took up a combined 120 power point slides in my stats class, which means that describing all in a single post will not happen, and maybe two posts minimum, but I will try to keep it under 120 slides or pages.

So, range, IQR (Interquartile Range), variance and standard deviation fall under summary measures as ways to describe numerical data.

Range – is the measure of dispersion or spread. Using central tendency methods we can see where most of the data is piled up, but what do we know about the variability of the data? The range of the data is basically the maximum value – the minimum value.

What to know about range? It is sensitive to outliers. It is unconcerned about the distribution of the data in the set.

For instance, if I had a hybrid car in my mtcars dataset that achieved 120 mpg by the petrol standards set forth by the EPA, my range for mpg would be 10.40mpg to 120mpg. If I told you the cars in my sample had a mpg range of 10.40mpg to 120mpg what would you think of the cars? What range fails to disclose is that the next highest mpg car is 33.9, that’s pretty far away and not all representative of the true dataset.

Run the following, try it out on your own data sets.


data(mtcars)
View(mtcars)

range(mtcars$mpg)
range(mtcars$wt)
range(mtcars$hp)

# if you are old school hard core, 
# "c" is to concatenate the results. 

c(min(mtcars$hp),max(mtcars$hp))

Interquartile Range – since we have already discussed quartiles this one is easy, the inter-quartile-range is simply the middle 50%, the values that reside between the 1st quartile(25%) and the first 3rd(75%) quartile. Summary() and favstats will give us the min(0%), Q1, Q2, Q3, max (100%)as will quantile().


quantile(mtcars$mpg)
summary(mtcars$mpg)
favstats(mtcars$mpg)

IQRs help us find Outlier which is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.

One of the techniques for removing outliers is to use the IQR to isolate the center 50% of the data. Lets use the Florida dataset from the scatterplot blog and see how the plot changes.

I am going to demonstrate the way i know to do this. Understand this is a method to perform this task, two years from now i will probably think this is amateuresque, but until then here we go.

We will need the first quintile and the third quantile and then subtract one from the other. to do this we are going to use summary().



florida <- read.csv("/Users/Shep/git/SQLShepBlog/FloridaData/FL-Median-Population.csv")

# Checkout everything summary tells us  
summary(florida)

# Now isolate the column we are interested in
summary(florida$population)

# Now a little R indexing,
# the values we are interested in are the 2nd and 5th position
# of the output so we just reference those
summary(florida$population)[2]
summary(florida$population)[5]

# load the values into a variable
q1 <- summary(florida$population)[2]
q3 <- summary(florida$population)[5]

#Now that we have the variables run subset to grab the middle 50% 
x<-subset(florida,population >=q1 & population <= q3)

#And lets run the scatterplot again 
xyplot((population) ~ (MedianIncome), 
       data=x, 
       main="Population vs Income",
       xlab="Median Income",
       ylab = "Population",
       type = c("p", "smooth"), col.line = "red", lwd = 2,
       pch=19)

Notice what happened? By removing everything outside of the IQR our observations (Rows) went from 67 counties to 33 counties, that is quite literally half the data that got identified as an outlier because of the IQR outlier methodology. On the bright side our scatter plot looks a little more readable and realistic and the regression looks similar but bit more wiggly than before.

So what to do? When you wipe out half your data as an outlier this is when you need to consult the powers that be. In real life you will be solving a problem and there will be some guidance and boundaries provided. Since this is just visualization, the stakes are pretty low. If you are in exploration and discovery phase, guess what, you just discovered something. If you are looking at this getting ready to make a predictive model, is throwing out 50% of the data as outlier data the right decision? Its time to make a decision. The decision i am going to make is to try out a different outlier formula. How about we chop 5% of both ends and see what happens? If the dataset were every single count in the US, this may be different.

To do this we are going to need use quantile().


# Using quantile will give us some control of the proportions 
# Run quantile first to see the results. 

quantile(florida$population,probs = seq(0, 1, 0.05))

#Load q05 with the results of quantile at the 5% percentile
#Load q95 with the results of quantile at the 95% percentile

q05 <- quantile(florida$population,probs = seq(0, 1, 0.05))[2]
q95 <- quantile(florida$population,probs = seq(0, 1, 0.05))[20]

#Create the dataframe with the subset 
x<-subset(florida,population >=q05 & population <= q95)

#try the xyplot again 
xyplot((population) ~ (MedianIncome), 
       data=x, 
       main="Population vs Income",
       xlab="Median Income",
       ylab = "Population",
       type = c("p", "smooth"), col.line = "red", lwd = 2,
       pch=19)

Did we make it better? We made it different. We also only dropped 8 counties from the dataset, so it was less impactful to the dataset. You can see that some of these are not going to be as perfect or as easy as mtcars, and that's the point. Using the entire population of the US with the interquartile range may be a reasonable method for detecting outliers, but its never just that easy. More often than not my real world data is never in a perfect mound with all the data within 2 standard deviations of the mean, also called the normal distribution. If this had been county election data, 5 of those 8 counties voted for Clinton in the last presidential election, if you consider that we tossed out 5 of the 9 counties she won what is the impact of dropping the outliers? Keep in mind that 67 observations(rows) is a very small dataset too. The point is, always ask questions!

Take these techniques and go exploring with your own data sets.

Shep

Visualization, Scatterplot

Published / by shep2010

In the ongoing visualization show and tell scatterplots have come up next on my list. As I write this blog I try very hard to check and double check my knowledge and methods, I usually have a dataset or two in mind long before I get to the point I want to write about it. This time, I wanted to use the mtcars dataset to play around with the dataset and run a line through the scatter plot to show a trend, lo and behold its already been done to the exact spec I was thinking of doing. Truth be told i am not the first to do any of this, google is your friend when learning R.

So with that, I will do one set with mtcars and send you to Quick-R Scatterplots for the rest. Be careful of some of the visualizations, while nothing will stop you from creating a 3d spinning scatter plot, it is considered chart junk and there is a special website for people who create those are honored. I bet you didn’t know there was a “wtf” domain did you?

I did run into an interesting issue though that I will discuss today, it is a leap ahead, but it is important.

But, lets get some scatterplot going on first.

Below we have loaded the mtcars dataset, and run an attach(). Attach() gives the ability to access the variables/columns of the dataset without having to reference the dataset name. So, instead of mtcars$cyl we can just reference cyl in functions after the attach. It has down sides, so be careful, sort of like a global anything in programming.


data(mtcars)
mtcars
attach(mtcars)

plot(mpg,wt)


With the scatterplot we have two dimensions of data, on the left is the y axis, the weight of the vehicle in thousands of pounds, and on the bottom is the x axis, mpg or miles per gallon.

That was cool, no? Lets add “pch=19” to the next plot to make the dots a bit more visible, and add a line through the data. Abline() draws a straight line through the plot using intercept and slope, we can get the intercept and slope by passing the wt and mpg into a linear model function called lm(). Run lm(wt ~ mpg) by itself and see what you get. Make sure mpg and wt are in the order specified below, “mpg, wt” for the plot and wt~mpg for the lm function. if you dig into the lm() function you will see that we are passing in just a formula of “wt ~ mpg”, from the R documentation for lm() y must be first which is the response variable. Much, Much more on this later, just know for now, y must be first when using lm(), not x.


data(mtcars)
mtcars
attach(mtcars)

plot(mpg,wt,pch=19)
abline(lm(wt ~ mpg))

So, using our scatterplot and the lm function it would appear that as weight increases mpg decreases.

Well thats all pretty cool, but a straight line through my data gives me an idea of the trend but can be misleading if the data is wiggly in the scatter plot, or appears not to be trending.



plot(hp, mpg, main="Scatterplot Example", 
     xlab="Horsepower", ylab="MPG", pch=19)
lines(lowess(hp,mpg), col="blue") 

Using the lines() and “lowess” option you can see that the line is a little more in tune with the trend of the data. LOWESS is locally weighted scatterplot smoothing. This is much more than just intercept and slope. Depending on the package you are using, there is more than one way to get a line to fit the data.

Lets have a little fun. Hopefully you have played with the mtcars dataset a little bit and maybe even tried out some of the other base R datasets, or loaded your own. The best way to engage in topics like this is to use a dataset you have some passion or curiosity about.

I have a dataset for you on my github site, FL-Median-Population.

This dataset contains the following;

region – County name
CountyFipsCode – The Federal Information Processing Standard code the uniquely identifies the county.
population – American Community Survey estimated population
CollegeDegree – percentage of residents that have completed at least an undergraduate degree.
College – this is the sum of CollegeDegree percentage and the completed some college percentage.

We will be using the xyplot from the lattice package, and the dataset listed above, be sure to change the file location two where every you put it, or use setwd to set your working directory. For this we will start with just the population and median income.


install.packages("lattice")
library(lattice)
  
florida <- read.csv("/Blog/FloridaData/FL-Median-Population.csv")

xyplot(population ~ MedianIncome, 
       data=florida, 
       pch=19,
       main="Population vs Income",
       xlab="Median Income",
       ylab = "Populaton")

Hopefully your plot looks a little bit, or exactly like this one. I think this is a good example because it is an imperfect sample. Hopefully on a good day you will get something more like this, and not complete randomness. One point to notice right away is there is one dot way out of range of the rest of the dots at the top of population. Not hard to figure out that is probably the county that Miami resides in, Miami-Dade, and then four other counties coming in at over a population of 1 million. You will also notice there is one county way off to the right in median income, that is St Johns county, which is where the city of Jacksonville is located. Now the median income of Jacksonville is lower than the median income of the county, so what could be going on there?

Hmmm, this just raises more questions, it just so happens, with a population of about 27,000 Ponte Vedra Beach has a median income of $116,000 according to wikipedia, so this one city is dragging the average for the entire county of 220,257 up pretty significantly compared to other counties. So, the with the population of Miami-Dade and the median income of St. Johns being far away from the rest of the data, these are what we call outliers. For now we are going to leave them in, in the next couple of blogs i will demonstrate a method for dealing with outliers. Clearly one like the county of St. Johns will need to be handled eventually.

So from looking at the scatter plot, we can kind of make out a general direction of the relationship of income to population, but it is sort of vague. In cases like this there are a few things to do.

1. When your data looks like it has gone to crazy town, try applying a transformation function for a better graphical representation. This will change the x/y scale, but will still represent the trend of the data. More on log transforms here.



xyplot(log(population) ~ log(MedianIncome), 
       data=florida, 
       main="Population vs Income",
       xlab="Median Income",
       ylab = "Population",
       pch=19)

2. Draw a linear regression line through it so see if there is a trend. We will get into lm() later, but it takes the (y ~ x) as model input and returns an intercept and slope, remember algebra? 🙂

There is more than one way to do this, xyplot uses the panel function, the much lengthier syntax.



xyplot(log(population) ~ log(MedianIncome), 
       data=florida, 
       main="Population vs Income",
       xlab="Median Income",
       ylab = "Population",
       panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.abline(lm(y~x), col='red',lwd=2)},
       pch=19)
##
## OR
##

plot(log(florida$MedianIncome),log(florida$population),pch=19)
abline(lm(log(florida$population) ~ log(florida$MedianIncome)))

3. In the previous sample we just used a lm, straight line, slope and intercept to run a line through the data, that alone does show a trend. Even if you remove the log function you still seen upward trend of greater population seems to indicate greater income.

So lets try the LOWESS again, this time with xyplot. The type parameter below is using a "p" parameter, this is the LOWESS (locally weighted scatterplot smoothing), it will take our bumpy data and smooth the line to the data. You can read up on it, we will be hitting it thoroughly later on.



xyplot(log(population) ~ log(MedianIncome), 
       data=florida, 
       main="Population vs Income",
       xlab="Median Income",
       ylab = "Population",
       type = c("p", "smooth"), col.line = "red", lwd = 2,
       pch=19)

One thing appear to be somewhat clear from this, as the population of the county increases, the income does increase to a point hen it seems to stabilize. We will revisit this once we learn how to deal with outliers and see if it changes the trend.

There are a couple more columns in the Florida data provided that you can try on your own, see if you can visually show a relationship between college and income, or even college and population. Do more rural counties have more or less college educated population?

Damn Lies and Statistics

Published / by shep2010

Here are a few recent podcasts that are worth listening to. They all have a foundation in statistics, and if you are not questioning everything stats you here in the media, you should and you will.

Knowing statistics and statistical learning breaks down a lot of barriers in day to day life. Correlation does not mean causation is the term that rings through my head every day. Just because to variables are moving together does not mean they effect each other.

On the other hand, after listening to the four podcasts below, it not just a correlation problem, its results that cannot be reproduced problem. Its drug companies performing trials on the healthiest possible pool of subjects, imagine testing an asthma medication on an olympic athlete that has asthma, do you think that would be representative of you and me?

Hidden Brain – Encore of Episode 32: The Scientific Process

Freakonomics- Bad Medicine, Part 1: The Story of 98.6

Freakonomics- Bad Medicine, Part 2: (Drug) Trials and Tribulations

Freakonomics- Bad Medicine, Part 3: Death by Diagnosis

Shep

Stats Stuff 2, Central Tendency

Published / by shep2010

We have determined that quantitative data is essentially numeric data, or “measuring data”. Quantitative data asks how much. Knowing that, we can start to look at statistical techniques to analyze this data. To do this we will need to get some definitions out of the way, and some demonstrations to help figure out what these things are. None of them are derived from magic, though sometimes they certainly appear that way.

There is a thing called central tendency, or the measure of central tendency. It is not very hard to derive the definition from the words, it is the tendency of the things to be centered, or the center or location of the distribution of data. For the most part, as the dataset we are using increases in size, it tends to have much of the data centered in a specific location. We measure this by using mean, median, and mode.

Central tendency will be one of the very important foundations of prediction, it’s a principle that assumes all of the data will be within a certain distribution vs data all over the place. If you were to use a histogram with many bins, the data would be mound shaped. That mound means that we can probably predict other new data based on certain factors, those factors will come later.

Mean – the mean is the average, sum the data and divide by the number of values.

(1+2+3+4+5+6+6+7) / 8 

mean(c(1,2,3,4,5,6,6,7))

Median – Basically the middle number in the data set, using (1,2,3,4,5,6,6,7) we have eight numbers in the dataset, an even number, so we take the two middle numbers add them and divide by two, in this case the Median is 4.5. In R you can run median(c(1,2,3,4,5,6,6,7)). If we had an odd number of values in the dataset as in (1,1,1,2,9,9,9), “2” is the median. It is just the middle number.


median(c(1,2,3,4,5,6,6,7))
median(c(1,2,3,4,5,6,7))
median(c(1,1,1,2,9,9,9))

Mode – Mode is the number that shows up most often in the dataset. Mode is a little tricky, there is no built in function for Mode in base R. Stack overflow has a nice mode reference here, which demonstrates the following



Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Mode(c(1,1,1,2,9,9))  

If you are use to T-SQL that last thing may have tripped you up a bit, a function being stored in a variable. Its called a “Call by Reference”. We will get more into this at a later date as this gets into some of the power of R.

Its really not very hard, and we will be using central tendency a lot. try it out with some of the datasets. Try this out on a few datasets that ship with base R, or your favorite dataset, graph it using hist() see if you can spot some of the tendencies and do they match what you would expect using just mean, median, and mode.



?cars
data(cars)
View(cars)

mean(cars$dist)
median(cars$dist)
Mode(cars$dist) #using the above Mode function

data(mtcars)
View(mtcars)

mean(mtcars$mpg)
median(mtcars$mpg)
Mode(mtcars$mpg)

# Use zoom in the plot pane to review the histogram, 
# check out the distributions
library(mosaic)
data(ChickWeight)
View(ChickWeight)

histogram(~weight | as.factor(Time), data=ChickWeight,type="percent")

Somewhere between this topic and the next topic there lives a thing called percentiles and quartiles.

Percentile - In statitistics a percentile is the measure indicating the value below which observations in a group fall. Yeah right, lets use an example. When occupy Wallstreet was all the rage they frequently help signs of 99%, that meant that you and i are the 99 percentile of income. Hence they were attempting to annoy anyone in the US who fell into the 1 percentile. According to this article from investopedia, you would need to make more than $456,626 AGI to be in the one percent. From that we can determine that anyone who makes less than $456,626 AGI is in the 99 percentile. To be in the 95 percentile you need to make less than $214,462 AGI per year. So, percentile is a measure of location. Where am i? Where are you?

Quartiles - In descriptive statistics, the quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data. Clear as mud, well four equal groups makes sense. They are, 25%, 50%, 75%, these divide the data into four groups.

Lets look at some data. R has a base function summary(), summary is your friend. Mtcars is should be loaded already, if it is not you are a wizard at getting in memory by now.



summary(mtcars$mpg)

#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  10.40   15.42   19.20   20.09   22.80   33.90 

Min is the minimum value of the dataset,
1st quartile or 25 percentile is 15.42
2nd quartile or 50 percentile or median is 19.20
3rd quartile or 75 percentile is 22.80
Mean or average is 20.09
Maximum value for the dataset is 33.90

Lets use a few more words to describe whats going on above. IF the mpg of your car is 12, like my Jeep, you are in the 25th percentile. If you get 22mpg, you are in the 75th percentile and above the median (19.20) and above the mean(20.09). Why this matters will come up later, but be cognizant of where a datum falls.

You can run summary against an entire dataset as well vs just a column/variable summary(mtcars) for instance. This will provide statistics for all columns/variables in the dataset.

From the Mosaic package there is a function favstats() that is similar to summary you should check out.


 favstats(mtcars$mpg)
  
 # min     Q1 median   Q3  max     mean       sd  n missing
 #10.4 15.425   19.2 22.8 33.9 20.09062 6.026948 32       0

 

Notice that sd for standard deviation, n the number of values in the dataset and missing is also listed.

More on all of this from a different writer

Stats stuff 1 In the beginning

Published / by shep2010

There is no escaping it, you are going to have to know some stat stuff. Every now and then I will through in a blog post to cover some of this, its good for me to attempt to brain dump it, and its good for you to try and assimilate it. Nothing is going to make sense without it. It only hurts a little bit and I am going to try and give as many samples as possible and break down each concept as much as I possibly can. In the end you will be grateful for software, but stats came from long hand math calculations and have formulas that look like they could single handledly put a rocket in space.

If you are a RDBMS gal or guy some of the definitions will seem wonky, and if you are talking to an academic this will be the first communication breakdown, hopefully this will help.

Lets get some definitions out of the way, these are critical, it does not mean you need to question the lingo you use on a daily basis, but if you hear it from you statisticians or DS folks, make sure you are on the same page.

Population – A population is ALL THE DATA. As a data guy I lived in a world where if we were trying to make a decision we used all the data, this is why we had SQL performance problems, our customers sucked at statistics and did not know there was a way to sample a data set. That being said, i also would not like to check my bank account balance based on a sample. So, when you see a population referenced that means the entire data, all of it. If I am using a population of the United States, that means I am using all 330 million or so people in the US for my research. In Statistics, it is denoted as an uppercase N.

Census – A census is a study of EVERYTHING in a given population. Most countries have a census. One of the more popular results of the census is the American Community Survey, it frequently provides great statistical training and research material.

Sample – A sample is as it sounds, it’s a sample of a population, hence not all the data. There are sub-classes of samples we will get into later. But for now know that a sample is a portion of a population. In Statistics, it is denoted as an lowercase n.

Parameter – A parameter is a numerical quantity that tells us something about the population, such as quantity of a specific ethnicity, number high school graduates, proportion of singles. Do not confuse this with a numeric quantity of sample, that is called a statistic. Ah ha!

Variable – A variable in statistics is what most SQL Folks, me included call a column, there is a very long definition, they contain anything that describe a characterization, qualitative and quantitative.

Case – A case in statistics is a single row of data. You can imagine that if you have a patient, all of the columns(variables) will make up the data for that patient, hence it is called a case. So, if you hear this outside of academia see if they are discussing a row, or something else entirely. Usually, this terminology references something sciencey, less so outside of medical research or scientific fields. I have never heard a row of banking data referred to as a case.

Data – Plural for of data. Some get bent out of shape over the use of this, I find them more annoying than the usage, I mean its not like I am using there their they’re incorrectly.

Datum – Singular form of data.

Qualitative Data – In many respects this is an easy one, if its not Quantitative its probably Qualitative, another more familiar name for it is categorical or, a category. Categorical data is defined as which? Which color, which model of, which dog breed, which grade are you in.

If you recall, looking at the dataset we used for the Florida education choropleth you will recall that there was a variable called ruralurbancontinuum , though it was a number it was used as a categorical value. The values of this field are 1-9 and related to a category of population density used US wide. In this case no math could be done gainst the value even though it is a number, the Census could just have easily used A-I instead of 1-9.

Quantitative Data – This one is pretty easy if you remember that you can do math on a quantitative variable. Its is always a number. It can be my height, my weight, my pulse rate, the money in my checking account, my shoe size. If I have population or sample of these items I can average them, get a standard deviation etc. The word quantitative has a root of quantity, that should help you remember it.

To go a little bit deeper into the rabbit whole there are two types of Quantitative data, I know, I’m sorry… Hopefully looking at the root word of each will help.

Quantitative Discreet – Counting data. They ask How Many?

How many people on a bus?
-There are 20 people on the bus, not 19.5 or 20.5.

How many cars in my driveway?
-There are 2 cars in my driveway and one in the yard, though most of that on is missing it is still one car.

How many books do you own?
-I own 100 books, not 99.5, though an argument could be made for owning half a book, it is still one registered ISBN even if you do not have all of the book.

How many emails did you get to day?
-I received 50 emails today.

Quantitative Continuous – Measuring data, this asks How Much?

What is your height?
What is your weight?
What is the weight of a vehicle?
What ii the MPG of a vehicle?

The following sources help with statistics, FOR FREE

Khan academy of course, as well as;

Introductory Statistic, OpenStax

OpenIntro Statistics

Australian Bureau of Statistics, Go figure, their definitions are too the point