Tag Archives: R

Linear Regression Level 100

Published / by shep2010

I think it’s difficult for a professor or teacher to know at exactly what point should linear regression be taught in a curriculum, it seems like it turns up everywhere calculus, algebra stats, modeling. It should be in all of them, but then the next question is do you need to know algebra, matrix algebra and linear algebra before knowing how to do a linear regression? I don’t know to be honest. Having worked with SQL for most of my adult life I have had to know and use all three and did not pay much attention to it or realize until I started formally beefing up my academics.

Regardless, the one thing I have heard from a few stats instructors is “don’t worry about how its done or how it works, the software will take care of it for you”, to be fair, these were not stats professors at the local beauty college, these were ivy league educated (I checked) professors and teachers saying this. Which, my problem is if I don’t know how it works I will probably not truly understand it, ever. Depending on what you are doing a trivial knowledge may be sufficient, but what if it’s not? If I am in an interview can I use the words “the software will do it for me” as the answer to a hard question?

What?
In the next few posts, I will do my best to define Linear Regression in R using lm. Continue reading

Learn R Now, again

Published / by shep2010

One of the hardest things about learning anything new is finding resources that are worth your time, don’t cost thousands of dollars, and don’t suck. One of the things i have not done and will not do is teach base R. I will do demos, i will explain some packages and functions along the way, but the basics of R are all free, and all range form pretty good to excellent. When you are in the early stages of learning anything, anyone that knows more than you is a resource for you. Just make sure they know wtf they are talking about, that part is harder. The next hardest part is use it everyday!

Continue reading

Getting Started with RevoScaleR Connectivity and SQL

Published / by shep2010

In my head there is always a competition for which post is next and sometimes if there will be a post at all. ODBC and RevoScaleR have been arguing and its super annoying. ODBC was the last post, how you can connect to any version of SQL using just ODBC. If you did not go to the link I published, you can connect to Oracle, MySQL, PostgreSQL, SQLite too. The point of that will become much more clear when you start querying MSDB Job History so you can write your own R ggplot reports on job length and overlapping jobs (spoilers…). I will give you the code to get you started, later, maybe tomorrow, I don’t know yet depends on who wins the next argument. For now it is connect to SQL Server using RevoScaleR package…

Continue reading

Getting Started with R and SQL (Regardless of SQL Version) Using ODBC

Published / by shep2010

So here you are, you know SQL or you at least do something with it everyday and are wondering what all the hoopla is about R and data science. Lets break down the first barrier, R and data science actually have little to do with each other, R is a language, data science is an abstract field of work, sort of like saying I am a computer scientist, that will narrow it down but not by much. What is your industry, what languages do you use, what is your education, hacker, bachelors, masters, phd…? You can be a Data Scientist and never use R.

But we are going to use R, today, right now, get ready.

Continue reading

Stats Stuff 6, Chebyshevs Rule

Published / by shep2010

Picking up from the last post we will now look at Chebyshevs rule. For this one we will be using ggplot2 histogram with annotations just to shake things up a bit.

Chebyshevs rule, theorem, inequality, whatever you want to call it states that all possible datasets regardless of shape will have 75% of the data within 2 standard deviations, and 88.89% within 3 standard deviations. This should apply to mound shaped datasets as well as bimodal (two mounds) and multimodal.

First, below is the empirical R code from the last blog post using ggplot2 if you are interested, otherwise skip this and move down. This is the script for the empirical rule calculations. Still using the US-Education.csv data.


require(ggplot2)

usa <- read.csv("/data/US-Education.csv",stringsAsFactors=FALSE)
str(usa)

highSchool <- subset(usa[c("FIPS.Code","Percent.of.adults.with.a.high.school.diploma.only..2010.2014")],FIPS.Code >0) 

#reanme the second column to something less annoying 
colnames(highSchool)[which(colnames(highSchool) == 'Percent.of.adults.with.a.high.school.diploma.only..2010.2014')] <- 'percent'


#create a variable with the mean and the standard devaiation 
hsMean <- mean(highSchool$percent,na.rm=TRUE)
hsSD <- sd(highSchool$percent,na.rm=TRUE)

#one standard deviation from the mean will "mean" one SD 
#to the left (-) of the mean and one SD to the right(+) of hte mean. 
oneSDleftRange <- (hsMean - hsSD)
oneSDrightRange <- (hsMean + hsSD)
oneSDleftRange;oneSDrightRange
oneSDrows <- nrow(subset(highSchool,percent > oneSDleftRange & percent < oneSDrightRange))
oneSDrows / nrow(highSchool)


#two standard deviations from the mean will "mean" two SDs 
#to the left (-) of the mean and two SDs to the right(+) of the mean. 
twoSDleftRange <- (hsMean - hsSD*2)
twoSDrightRange <- (hsMean + hsSD*2)
twoSDleftRange;twoSDrightRange
twoSDrows <- nrow(subset(highSchool,percent > twoSDleftRange & percent < twoSDrightRange))
twoSDrows / nrow(highSchool)

#two standard deviations from the mean will "mean" two SDs 
#to the left (-) of the mean and two SDs to the right(+) of the mean. 
threeSDleftRange <- (hsMean - hsSD*3)
threeSDrightRange <- (hsMean + hsSD*3)
threeSDleftRange;threeSDrightRange
threeSDrows <- nrow(subset(highSchool,percent > threeSDleftRange & percent < threeSDrightRange))
threeSDrows / nrow(highSchool)

ggplot(data=highSchool, aes(highSchool$percent)) + 
  geom_histogram(breaks=seq(10, 60, by =2), 
                 col="blue", 
                 aes(fill=..count..))+
  labs(title="Completed High School") +
  labs(x="Percentage", y="Number of Counties") 

ggplot(data=highSchool, aes(highSchool$percent)) + 
  geom_histogram(breaks=seq(10, 60, by =2), 
                 col="blue", 
                 aes(fill=..count..))+
  labs(title="Completed High School") +
  labs(x="Percentage", y="Number of Counties") +
  geom_vline(xintercept=hsMean,colour="green",size=2)+
  geom_vline(xintercept=oneSDleftRange,colour="red",size=1)+
  geom_vline(xintercept=oneSDrightRange,colour="red",size=1)+
  geom_vline(xintercept=twoSDleftRange,colour="blue",size=1)+
  geom_vline(xintercept=twoSDrightRange,colour="blue",size=1)+
  geom_vline(xintercept=threeSDleftRange,colour="black",size=1)+
  geom_vline(xintercept=threeSDrightRange,colour="black",size=1)+
  annotate("text", x = hsMean+2, y = 401, label = "Mean")+
  annotate("text", x = oneSDleftRange+4, y = 351, label = "68%")+
  annotate("text", x = twoSDleftRange+4, y = 301, label = "95%")+
  annotate("text", x = threeSDleftRange+4, y = 251, label = "99.7%")

It would do no good to use the last dataset for to try out Chebyshevs rule as we know it is mond shaped, and fit oddly well to the empirical rule. Now lets try a different column in the US-Education dataset.


usa <- read.csv("/data/US-Education.csv",stringsAsFactors=FALSE)


ggplot(data=usa, aes(usa$X2013.Rural.urban.Continuum.Code)) + 
  geom_histogram(breaks=seq(1, 10, by =1), 
                 col="blue", 
                 aes(fill=..count..))

Comparatively speaking, this one looks a little funky, this is certainly bimodal, if not nearly trimodal. This should be a good test for Chebyshev.

So, lets reuse some of the code above, drop the first standard dviation since Chebyshev does not need it and see if we can get this to work with "X2013.Rural.urban.Continuum.Code"




usa <- read.csv("/data/US-Education.csv",stringsAsFactors=FALSE)
str(usa)

urbanMean <- mean(usa$X2013.Rural.urban.Continuum.Code,na.rm=TRUE)
urbanSD <- sd(usa$X2013.Rural.urban.Continuum.Code,na.rm=TRUE)


#two standard deviations from the mean will "mean" two SDs
#to the left (-) of the mean and two SDs to the right(+) of the mean.
twoSDleftRange <- (urbanMean - urbanSD*2)
twoSDrightRange <- (urbanMean + urbanSD*2)
twoSDleftRange;twoSDrightRange
twoSDrows <- nrow(subset(usa,X2013.Rural.urban.Continuum.Code > twoSDleftRange & usa$X2013.Rural.urban.Continuum.Code < twoSDrightRange))
twoSDrows / nrow(usa)

#two standard deviations from the mean will "mean" two SDs
#to the left (-) of the mean and two SDs to the right(+) of the mean.
threeSDleftRange <- (urbanMean - urbanSD*3)
threeSDrightRange <- (urbanMean + urbanSD*3)
threeSDleftRange;threeSDrightRange
threeSDrows <- nrow(subset(usa,X2013.Rural.urban.Continuum.Code > threeSDleftRange & X2013.Rural.urban.Continuum.Code < threeSDrightRange))
threeSDrows / nrow(usa)



ggplot(data=usa, aes(usa$X2013.Rural.urban.Continuum.Code)) + 
  geom_histogram(breaks=seq(1, 10, by =1), 
                 col="blue", 
                 aes(fill=..count..))+
                geom_vline(xintercept=urbanMean,colour="green",size=2)+
                geom_vline(xintercept=twoSDleftRange,colour="blue",size=1)+
                geom_vline(xintercept=twoSDrightRange,colour="blue",size=1)+
                geom_vline(xintercept=threeSDleftRange,colour="black",size=1)+
                geom_vline(xintercept=threeSDrightRange,colour="black",size=1)+
                annotate("text", x = urbanMean, y = 800, label = "Mean")+
                annotate("text", x = twoSDleftRange+1, y = 625, label = "68%")+
                annotate("text", x = threeSDleftRange+1.1, y = 425, label = "88.89%")


If you looked at the data and you looked at the range of two standard deviations above, you should know we have a problem; 98% of the data fell within 2 standard deviations. While yes, 68% of the data is also in the range it turns out this is a terrible example. The reason i include it is because it is just as important to see a test result that fails your expectation as it is for you to see on ethat is perfect! You will notice that the 3rd standard deviations is far outside the data range.

SO, what do we do? fake data to the rescue!

I try really hard to avoid using made up data because to me it makes no sense, where as car data, education data, population data, that all makes sense. But, there is no getting around it! Here is what you need to know, rnorm() generates random data based on a normal distribution using the variables standard deviation and a mean! But wait, we are trying to get multi-modal distribution. Then concatenate more than one normal distribution, eh? Lets try three.

We are going to test for one standard deviation just to see what it is, even though Chebyshevs rule has no interest in it, remember the rule states that 75% the data will fall within 2 standard deviations.


#set.seed() will make sure the random number generation is not random everytime
set.seed(500)
x <- as.data.frame(c(rnorm(100,100,10)
                     ,(rnorm(100,400,20))
                     ,(rnorm(100,600,30))))

colnames(x) <- c("value")

#hist(x$value,nclass=100)
ggplot(data=x, aes(x$value)) + 
  geom_histogram( col="blue", 
                 aes(fill=..count..))

sd(x$value)    

mean(x$value)

#if you are interested in looking at just the first few values
head(x)



xMean <- mean(x$value)
xSD <- sd(x$value)


#one standard deviation from the mean will "mean" 1 * SD
#to the left (-) of the mean and one SD to the right(+) of the mean.
oneSDleftRange <- (xMean - xSD)
oneSDrightRange <- (xMean + xSD)
oneSDleftRange;oneSDrightRange
oneSDrows <- nrow(subset(x,value > oneSDleftRange & x < oneSDrightRange))
print("Data within One standard deviations");oneSDrows / nrow(x)


#two standard deviations from the mean will "mean" 2 * SD
#to the left (-) of the mean and two SDs to the right(+) of the mean.
twoSDleftRange <- (xMean - xSD*2)
twoSDrightRange <- (xMean + xSD*2)
twoSDleftRange;twoSDrightRange
twoSDrows <- nrow(subset(x,value > twoSDleftRange & x$value < twoSDrightRange))
print("Data within Two standard deviations");twoSDrows / nrow(x)


#three standard deviations from the mean will "mean" 3 * SD
#to the left (-) of the mean and two SDs to the right(+) of the mean.
threeSDleftRange <- (xMean - xSD*3)
threeSDrightRange <- (xMean + xSD*3)
threeSDleftRange;threeSDrightRange
threeSDrows <- nrow(subset(x,value > threeSDleftRange & x$value < threeSDrightRange))
print("Data within Three standard deviations");threeSDrows / nrow(x)

WOOHOO, Multimodal! Chebyshev said it works on anything, lets find out. The histogram below is a hot mess based on how the data was created, but it is clear that the empirical rule will not apply here, as the data is not mound shaped and is multimodal or trimodal.

Though Chebyshevs rule has no interest in 1 standard deviation i wanted to show it just so you could see what the 1 SD looks like. I challenge you to take the rnorm and see if you can modify the mean and SD parameters passed in to make it fall outside o the 75% of two standard deviations.

[1] "Data within One standard deviations" = 0.3966667 # or 39.66667%
[1] "Data within Two standard deviations" = 1 # or 100%
[1] "Data within Three standard deviations" = 1 or 100%

Lets add some lines;



ggplot(data=x, aes(x$value)) + 
  geom_histogram( col="blue", 
                    aes(fill=..count..))+
                    geom_vline(xintercept=xMean,colour="green",size=2)+
                    geom_vline(xintercept=twoSDleftRange,colour="blue",size=1)+
                    geom_vline(xintercept=twoSDrightRange,colour="blue",size=1)+
                    geom_vline(xintercept=threeSDleftRange,colour="black",size=1)+
                    geom_vline(xintercept=threeSDrightRange,colour="black",size=1)+
                    annotate("text", x = xMean, y = 65, label = "Mean")+
                    annotate("text", x = twoSDleftRange+75, y = 50, label = "68%")+
                    annotate("text", x = threeSDleftRange+85, y = 40, label = "88.89%")

There you have it! It is becoming somewhat clear that based on the shape of the data and if you are using empirical or Chebyshevs rule, data falls into some very predictable patters, maybe from that we can make some predictions about new data coming in...?

Shep

Stats Stuff 5, Normal Distribution, Empirical Rule

Published / by shep2010

So, we have covered standard deviation and mean, discussed central tendency, and we have demonstrated some histograms. You are familiar with what a histogram looks like and that depending on the data, it can take many shapes. Today we are going to discuss distribution that specifically applies to mound shaped data. We happen to have been working with a couple of datasets that meet this criteria perfectly, or at least it does in shape.

In the last blog, we had two datasets from US Educational attainment that appeared to be mound shaped, that being the key word, mound shaped. If it is mound shaped, we should be able to make some predictions about the data using the Empirical Rule, and if not mound shape, the Chebyshevs rule.

The point of this as stated in my stats class, to link visualization of distributions to numerical measures of center and location. This will only apply to mound shaped data, like the following;

When someone says mound shaped data, this is the text book example of mound shaped. This is from the US-Education.csv data that we have been playing with, below are the commands to get you started and get you a histogram.

Just so you fully understand wha this data is, every person in the US reports their level of educational attainment to the Census every ten years, every few years this data is updated and projected to estimate reasonably current values. This we will be using is for the 2010-2014 years which is the five year average compiled by the American Community Survey. I highly encourage use of this website for test data, all of it has to be manipulated a little bit, but it typically takes minutes to get it into a format R can use.


usa <- read.csv("/data/US-Education.csv",stringsAsFactors=FALSE)
str(usa)

#While not required, i want to isolate the data we will be working with 
highSchool <- subset(usa[c("FIPS.Code","Percent.of.adults.with.a.high.school.diploma.only..2010.2014")],FIPS.Code >0) 

#reanme the second column to something less annoying 
colnames(highSchool)[which(colnames(highSchool) == 'Percent.of.adults.with.a.high.school.diploma.only..2010.2014')] <- 'percent'

#Display a histogram 
hist(highSchool$percent
    ,xlim=c(5,60)
    ,breaks=20
    ,xlab = "Percent Completed High School "
    ,ylab = "Number of Counties"
    ,main = ""
    ,col = "lightblue")

The Empirical rule states that

68% of the data will fall with in 1 standard deviation of the mean,
95% of the data will fall within 2 standard deviations of the mean, and
99.7% of the data will fall within 3 standard deviations of them mean.

Lets find out!


#create a variable with the mean and the standard devaiation 
hsMean <- mean(highSchool$percent,na.rm=TRUE)
hsSD <- sd(highSchool$percent,na.rm=TRUE)

#one standard deviation from the mean will "mean" one SD 
#to the left (-) of the mean and one SD to the right(+) of the mean. 
#lets calculate and store them
 
oneSDleftRange <- (hsMean - hsSD)
oneSDrightRange <- (hsMean + hsSD)

oneSDleftRange;oneSDrightRange


##[1] 27.51472 is one sd to the left of the mean
##[1] 41.60826 is one sd to the right of the mean

#lets calculate the number of rows that fall 
#between 27.51472(oneSDleftRange) and 41.60826(oneSDrightRange)
oneSDrows <- nrow(subset(highSchool,percent > oneSDleftRange & percent < oneSDrightRange))

# whats the percentage?
oneSDrows / nrow(highSchool)


If everything worked properly, you should have seen that the percentage of counties within one standard deviation of the mean is "0.6803778" or 68.04%. Wel that was kinda creepy wasn't it? The empirical rule states that 68% of the data will be within one standard deviation.

Lets keep going.


#two standard deviations from the mean will "mean" two SDs 
#to the left (-) of the mean and two SDs to the right(+) of the mean. 
twoSDleftRange <- (hsMean - hsSD*2)
twoSDrightRange <- (hsMean + hsSD*2)


twoSDleftRange;twoSDrightRange

##[1] 20.46795 is two sds to the left of the mean
##[1] 48.65503 is two sds to the right of the mean

twoSDrows <- nrow(subset(highSchool,percent > twoSDleftRange & percent < twoSDrightRange))

twoSDrows / nrow(highSchool)

If your math is the same as my math, you should have gotten 95.09%, so far the empirical rule is holding...

What about three standard deviations?


threeSDleftRange <- (hsMean - hsSD*3)
threeSDrightRange <- (hsMean + hsSD*3)

threeSDleftRange;threeSDrightRange

threeSDrows <- nrow(subset(highSchool,percent > threeSDleftRange & percent < threeSDrightRange))

threeSDrows / nrow(highSchool)

99.32% at three standard deviations, its like the empirical rule knows our data! Before we move on, lets add some lines...


hist(highSchool$percent
     ,xlim=c(5,60)
     ,breaks=20
     ,xlab = "Percent Completed High School "
     ,ylab = "Number of Counties"
     ,main = ""
     ,col = "lightblue")

abline(v = threeSDleftRange,col = "black",lwd = 3)
abline(v = threeSDrightRange,col = "black",lwd = 3)

abline(v = twoSDleftRange,col = "royalblue",lwd = 3)
abline(v = twoSDrightRange,col = "royalblue",lwd = 3)

abline(v = oneSDleftRange,col = "red",lwd = 3)
abline(v = oneSDrightRange,col = "red",lwd = 3)

abline(v = hsMean,col = "green",lwd = 4)


legend(x="topright",
       c("Mean","3 SDs 99.7%", "2 SDs 95%", "1 SD 68%"),
       col = c("Green","black", "royalblue", "red"),
       lwd = c(2, 2, 2),
       cex=0.75
       )

You can see the distribution of the data below, it really does seem to fall into pretty predictable standard deviations.

It has frequently been my opinion and others that R was written by an angry teenager to get even with his boomer parents, while not entirely true R has many frustrations. The nice thing is, you can write your own package to handle many of these more complex visualizations, i stuck to Base R for this histogram, and it does get the point across, but ggplot provides much better graphics and legends.

More Chebyshev in the next post!

Shep

R Markdown

Published / by shep2010

This is a slight diversion into a tool built into R called R Markdown, and Shiny will be coming up in a few days. Why is this important? It gives you a living document you can add text and r scripts to to produce just the output from R. I wrote my Stats grad project using just R Markdown and saved it to a PDF, no Word or open office tools.

Its a mix of HTML and R, so if you know a tiny bit about HTML programing you will be fine, otherwise, use the R Markdown Cheat sheet and Reference Guide which i just annoyingly found out existed…

I am going to give you a full R Markdown document to get you started.

Create a new R Markdown file;

Then Run it by selecting the “Knit” drop down in the middle left of the toolbar and selecting Knit to HTML.

This will create an html document that you can open in a browser, it comes with some default mtcars data just so you can see some output. Try out some R commands and doodle around a bit before starting the code below. This is the file data file we will be using, US-Education.csv It contains just the 2010-2014 educational attainment estimates per count in the US.

In the code books below i will put in each section of the R Markdown and discuss it, each R code block can me moved to r console to be run.

The first section Is the title that will show up on the top of the doc, copy this into the markdown file and run it by itself. I am using an html style tag as i want some of the plots to be two columns across.

You will also see the first R command in an “R” block identified by ““`{r} and terminated with ““`”. Feel free to remove options and change options to see what happens.

Notice below the style tag is wrong, when you copy it out you will need to put the “<" back in from of the style tag. If i format it correctly wordpress takes it as an internal style tag to this post.


---
title: "Educational Attainment by County"

output: html_document
---

style>
  .col2 {
    columns: 2 200px;         /* number of columns and width in pixels*/
    -webkit-columns: 2 200px; /* chrome, safari */
    -moz-columns: 2 200px;    /* firefox */
     line-height: 2em;
     font-size: 10pt;

  }

/style>

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,warning=FALSE)

#require is the fancy version of install package/library
require(choroplethr)

```

This will be the next section in the markup, load a dataframe for each of the four educational attainment categories.


```{r one}

#Load data
 setwd("/data/")
 usa <- read.csv("US-Education.csv",stringsAsFactors=FALSE)

#Seperate data for choropleth 
 lessHighSchool <- subset(usa[c("FIPS.Code","Percent.of.adults.with.less.than.a.high.school.diploma..2010.2014")],FIPS.Code >0)
 
highSchool <- subset(usa[c("FIPS.Code","Percent.of.adults.with.a.high.school.diploma.only..2010.2014")],FIPS.Code >0) 
 
someCollege <- subset(usa[c("FIPS.Code","Percent.of.adults.completing.some.college.or.associate.s.degree..2010.2014")],FIPS.Code >0)
 
college <- subset(usa[c("FIPS.Code","Percent.of.adults.with.a.bachelor.s.degree.or.higher..2010.2014")],FIPS.Code >0)

#rename columns for Choropleth
 
 colnames(lessHighSchool)[which(colnames(lessHighSchool) == 'FIPS.Code')] <- 'region'
 
 colnames(lessHighSchool)[which(colnames(lessHighSchool) == 'Percent.of.adults.with.less.than.a.high.school.diploma..2010.2014')] <- 'value'

# 
# or
#
 names(highSchool) <-c("region","value")
 names(someCollege) <-c("region","value")
 names(college) <-c("region","value")
 
 
```

The next section will create four histograms of the college attainment by category. Notice the distribution of the data, normal distribution, right skew, left skew, bimodal? We will discuss them next blog.

Notice for the next section i have the "div" without the left "<", be sure to put those back.



div class="col2">

```{r Histogram 1}
 hist(lessHighSchool$value,xlim=c(0,60),breaks=30, xlab = "Percent of High School Dropouts", ylab="Number of Counties",main="",col="lightblue")


 hist(highSchool$value,xlim=c(0,60),breaks=30, xlab = "Percent Completed High School ", ylab="Number of Counties",main="",col="lightblue")
 
```
 
 
```{r Histogram 2}

 hist(someCollege$value,xlim=c(0,50),breaks=30, xlab = "Percent Completed Associates or Some College ", ylab="Number of Counties",main="",col="lightblue")
 
 hist(college$value,xlim=c(0,90),breaks=30, xlab = "Percent Completed Bachelors Degree or Higher ", ylab="Number of Counties",main="",col="lightblue")


```

/div>


The next section is the choropleth, for the high school dropouts, notice the R chunk parameters to size the plot area.



```{r two, fig.width=9, fig.height=5, fig.align='right'}


 county_choropleth(lessHighSchool,
                  
                   title = "Proportion of High School Dropouts",
                   legend="Proportion",
                   num_colors=9)
 
```

There are three more choropleths that you will have to do on your own! you have the data, and the syntax. If you have trouble with this, the red file i used is here Education.rmd

In the end, you should have a histogram looking like this;

And if you make it to the first choropleth, Percentage that did not complete high school;

Stats Stuff 3, Range, IQR

Published / by shep2010

The next topics (Range, IQR, Variance and Standard Deviation) took up a combined 120 power point slides in my stats class, which means that describing all in a single post will not happen, and maybe two posts minimum, but I will try to keep it under 120 slides or pages.

So, range, IQR (Interquartile Range), variance and standard deviation fall under summary measures as ways to describe numerical data.

Range – is the measure of dispersion or spread. Using central tendency methods we can see where most of the data is piled up, but what do we know about the variability of the data? The range of the data is basically the maximum value – the minimum value.

What to know about range? It is sensitive to outliers. It is unconcerned about the distribution of the data in the set.

For instance, if I had a hybrid car in my mtcars dataset that achieved 120 mpg by the petrol standards set forth by the EPA, my range for mpg would be 10.40mpg to 120mpg. If I told you the cars in my sample had a mpg range of 10.40mpg to 120mpg what would you think of the cars? What range fails to disclose is that the next highest mpg car is 33.9, that’s pretty far away and not all representative of the true dataset.

Run the following, try it out on your own data sets.


data(mtcars)
View(mtcars)

range(mtcars$mpg)
range(mtcars$wt)
range(mtcars$hp)

# if you are old school hard core, 
# "c" is to concatenate the results. 

c(min(mtcars$hp),max(mtcars$hp))

Interquartile Range – since we have already discussed quartiles this one is easy, the inter-quartile-range is simply the middle 50%, the values that reside between the 1st quartile(25%) and the first 3rd(75%) quartile. Summary() and favstats will give us the min(0%), Q1, Q2, Q3, max (100%)as will quantile().


quantile(mtcars$mpg)
summary(mtcars$mpg)
favstats(mtcars$mpg)

IQRs help us find Outlier which is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.

One of the techniques for removing outliers is to use the IQR to isolate the center 50% of the data. Lets use the Florida dataset from the scatterplot blog and see how the plot changes.

I am going to demonstrate the way i know to do this. Understand this is a method to perform this task, two years from now i will probably think this is amateuresque, but until then here we go.

We will need the first quintile and the third quantile and then subtract one from the other. to do this we are going to use summary().



florida <- read.csv("/Users/Shep/git/SQLShepBlog/FloridaData/FL-Median-Population.csv")

# Checkout everything summary tells us  
summary(florida)

# Now isolate the column we are interested in
summary(florida$population)

# Now a little R indexing,
# the values we are interested in are the 2nd and 5th position
# of the output so we just reference those
summary(florida$population)[2]
summary(florida$population)[5]

# load the values into a variable
q1 <- summary(florida$population)[2]
q3 <- summary(florida$population)[5]

#Now that we have the variables run subset to grab the middle 50% 
x<-subset(florida,population >=q1 & population <= q3)

#And lets run the scatterplot again 
xyplot((population) ~ (MedianIncome), 
       data=x, 
       main="Population vs Income",
       xlab="Median Income",
       ylab = "Population",
       type = c("p", "smooth"), col.line = "red", lwd = 2,
       pch=19)

Notice what happened? By removing everything outside of the IQR our observations (Rows) went from 67 counties to 33 counties, that is quite literally half the data that got identified as an outlier because of the IQR outlier methodology. On the bright side our scatter plot looks a little more readable and realistic and the regression looks similar but bit more wiggly than before.

So what to do? When you wipe out half your data as an outlier this is when you need to consult the powers that be. In real life you will be solving a problem and there will be some guidance and boundaries provided. Since this is just visualization, the stakes are pretty low. If you are in exploration and discovery phase, guess what, you just discovered something. If you are looking at this getting ready to make a predictive model, is throwing out 50% of the data as outlier data the right decision? Its time to make a decision. The decision i am going to make is to try out a different outlier formula. How about we chop 5% of both ends and see what happens? If the dataset were every single count in the US, this may be different.

To do this we are going to need use quantile().


# Using quantile will give us some control of the proportions 
# Run quantile first to see the results. 

quantile(florida$population,probs = seq(0, 1, 0.05))

#Load q05 with the results of quantile at the 5% percentile
#Load q95 with the results of quantile at the 95% percentile

q05 <- quantile(florida$population,probs = seq(0, 1, 0.05))[2]
q95 <- quantile(florida$population,probs = seq(0, 1, 0.05))[20]

#Create the dataframe with the subset 
x<-subset(florida,population >=q05 & population <= q95)

#try the xyplot again 
xyplot((population) ~ (MedianIncome), 
       data=x, 
       main="Population vs Income",
       xlab="Median Income",
       ylab = "Population",
       type = c("p", "smooth"), col.line = "red", lwd = 2,
       pch=19)

Did we make it better? We made it different. We also only dropped 8 counties from the dataset, so it was less impactful to the dataset. You can see that some of these are not going to be as perfect or as easy as mtcars, and that's the point. Using the entire population of the US with the interquartile range may be a reasonable method for detecting outliers, but its never just that easy. More often than not my real world data is never in a perfect mound with all the data within 2 standard deviations of the mean, also called the normal distribution. If this had been county election data, 5 of those 8 counties voted for Clinton in the last presidential election, if you consider that we tossed out 5 of the 9 counties she won what is the impact of dropping the outliers? Keep in mind that 67 observations(rows) is a very small dataset too. The point is, always ask questions!

Take these techniques and go exploring with your own data sets.

Shep

Visualization, The gateway drug II

Published / by shep2010

In the last blog you were able to get a dataset with county and population data to display on a US map and zoom in on a state, and maybe even a county if you went exploring.  In this demo we will be using the same choroplethr package but this time we will be using external data.  Specifically, we will focus on one state, and check out the education level per county for one state.

The data is hosted by the USDA Economic Research Division,  under Data Products / County-level Data Sets.  What will be demonstrated is the proportion of the population who have completed college, the datasets “completed some college”, “completed high school”, and “did not complete high school” are also available on the USDA site.

For this effort, You can grab the data off my GitHub site or the data is at the bottom of this blog post, copy it out into a plain text file. Make sure you change the name of the file in the script below, or make sure the file you create is “Edu_CollegeDegree-FL.csv”.

Generally speaking when you start working with GIS data of any sort you enter a whole new world of acronyms and in many cases mathematics to deal with the craziness.  The package we are using eliminates almost all of this for quick and dirty graphics via the choroplethr package.  The county choropleth takes two values, the first is the region which must be the FIPS code for that county.  If you happen to be working with states, then the FIPS state code must be used for region.  To make it somewhat easier, the first two digits of the county FIPS code is the state code, the remainder is the county code for the data we will be working with.

So let’s get to it;

Install and load the choroplethr package


install.packages("choroplethr")
library(choroplethr)

Use the setwd() to set the local working directory, getwd() will display what the current R working directory.



setwd("/Users/Data")
getwd()


Read.csv will read in a comma delimited file. “<-“ is the assignment operator, much like using the “=”. The “=” can be used as well. Which to assignment operator to use is a bit if a religious argument in the R community, i will stay out of it.


# read a csv file from my working directory 
edu.CollegeDegree <- read.csv("Edu_CollegeDegree-FL.csv")


View() will open a new tab and display the contents of the data frame.


View(edu.CollegeDegree) 

str() will display the structure of the data frame, essentially what are the data types of the data frame


str(edu.CollegeDegree)

Looking at the structure of the dataframe we can see that the counties imported as Factors, for this task it will not matter as i will not need the county names, but in the future it may become a problem. To nip this we will reimport using stringsAsFactors option of read.csv we will get into factors later, but for now we don't need them.


edu.CollegeDegree <- read.csv("Edu_CollegeDegree-FL.csv",stringsAsFactors=FALSE)

#Recheck our structure 
str(edu.CollegeDegree)

 

Now the region/county name is a character however, the there is actually more data in the file than we need. While we only have 68 counties, we have more columns/variables than we need. The only year i am interested in is the CollegeDegree2010.2014 so there are several ways to remove the unwanted columns.

The following is actually using index to include only columns 1,2,3,8 much like using column numbers in SQL vs the actual column name, this can bite you in the butt if the order or number of columns change though not required for this import, header=True never hurts. You only need to run one of the following commands below, but you can see two ways to reference columns.


edu.CollegeDegree <- read.csv("Edu_CollegeDegree-FL.csv", header=TRUE,stringsAsFactors=FALSE)[c(1,2,3,8)]

# or Use the colun names

edu.CollegeDegree <- read.csv("Edu_CollegeDegree-FL.csv", header=TRUE,stringsAsFactors=FALSE)[c("FIPS","region","X2013RuralUrbanCode","CollegeDegree2010.2014")]

#Lets check str again
str(edu.CollegeDegree)

Using summary() we can start reviewing the data from statistical perspective. The CollegeDegree2010.2014 variable, we can see the county with the lowest proportion of college graduates is .075, or 7.5% of the population of that county the max value is 44.3%. The average across all counties is 20.32% that have completed college.



summary(edu.CollegeDegree)

Looking at the data we can see that we have a FIPS code, and the only other column we are interested in for mapping is CollegeDegree2010.2014, so lets create a dataframe with just what we need.


View(edu.CollegeDegree)

# the follwoing will create a datafram with just the FIPS and percentage of college grads
flCollege <- edu.CollegeDegree[c(1,4)]

# Alternatively, you can use the column names vs. the positions. Probably smarter ;-) 
flCollege <- edu.CollegeDegree[c("FIPS","CollegeDegree2010.2014")]

# the following will create a dataframe with just the FIPS and percentage of college grads

flCollege 

But, from reading the help file on county_choropleth, it requires that only two variables(columns) be passed in, region, and value. Region must be a FIPS code so, we need to rename the columns using colnames().



colnames(flCollege)[which(colnames(flCollege) == 'FIPS')] <- 'region'
colnames(flCollege)[which(colnames(flCollege) == 'CollegeDegree2010.2014')] <- 'value'

So, lets map it!

Since we are only using Florida, set the state_zoom, it will work without the zoom but you will get many warnings. You will also notice a warning that 12000 is not mappable. Looking at the data you will see that 12000 is the entire state of Florida.



county_choropleth(flCollege,
                  title = "Proportion of College Graduates ",
                  legend="Proportion",
                  num_colors=9,
                  state_zoom="florida")

For your next task, go find a different state and a different data set from the USDA or anywhere else for that matter and create your own map. Beware of the "value", that must be an integer, sometimes these get imported as character if there is a comma in the number. This may be a good opportunity for you to learn about gsub and as.numeric, it would look something like the following command. Florida is the dataframe, and MedianIncome is the column.



florida$MedianIncome <- as.numeric(gsub(",", "",florida$MedianIncome))


USDA Economic Research Division Sample Data



FIPS,region,2013RuralUrbanCode,CollegeDegree1970,CollegeDegree1980,CollegeDegree1990,CollegeDegree2000,CollegeDegree2010-2014
12001,"Alachua, FL",2,0.231,0.294,0.346,0.387,0.408
12003,"Baker, FL",1,0.036,0.057,0.057,0.082,0.109
12005,"Bay, FL",3,0.092,0.132,0.157,0.177,0.216
12007,"Bradford, FL",6,0.045,0.076,0.081,0.084,0.104
12009,"Brevard, FL",2,0.151,0.171,0.204,0.236,0.267
12011,"Broward, FL",1,0.097,0.151,0.188,0.245,0.302
12013,"Calhoun, FL",6,0.06,0.069,0.082,0.077,0.092
12015,"Charlotte, FL",3,0.088,0.128,0.134,0.176,0.209
12017,"Citrus, FL",3,0.06,0.071,0.104,0.132,0.168
12019,"Clay, FL",1,0.098,0.168,0.179,0.201,0.236
12021,"Collier, FL",2,0.155,0.185,0.223,0.279,0.323
12023,"Columbia, FL",4,0.083,0.093,0.11,0.109,0.141
12027,"DeSoto, FL",6,0.048,0.082,0.076,0.084,0.099
12029,"Dixie, FL",6,0.056,0.049,0.062,0.068,0.075
12031,"Duval, FL",1,0.089,0.14,0.184,0.219,0.265
12033,"Escambia, FL",2,0.092,0.141,0.182,0.21,0.239
12035,"Flagler, FL",2,0.047,0.137,0.173,0.212,0.234
12000,Florida,0,0.103,0.149,0.183,0.223,0.268
12037,"Franklin, FL",6,0.046,0.09,0.124,0.124,0.16
12039,"Gadsden, FL",2,0.046,0.086,0.112,0.129,0.163
12041,"Gilchrist, FL",2,0.027,0.071,0.074,0.094,0.11
12043,"Glades, FL",6,0.031,0.078,0.071,0.098,0.103
12045,"Gulf, FL",3,0.057,0.068,0.092,0.101,0.147
12047,"Hamilton, FL",6,0.055,0.059,0.07,0.073,0.108
12049,"Hardee, FL",6,0.045,0.074,0.086,0.084,0.1
12051,"Hendry, FL",4,0.076,0.076,0.1,0.082,0.106
12053,"Hernando, FL",1,0.061,0.086,0.097,0.127,0.157
12055,"Highlands, FL",3,0.081,0.097,0.109,0.136,0.159
12057,"Hillsborough, FL",1,0.086,0.145,0.202,0.251,0.298
12059,"Holmes, FL",6,0.034,0.06,0.074,0.088,0.109
12061,"Indian River, FL",3,0.107,0.155,0.191,0.231,0.267
12063,"Jackson, FL",6,0.064,0.081,0.109,0.128,0.142
12065,"Jefferson, FL",2,0.061,0.113,0.147,0.169,0.178
12067,"Lafayette, FL",9,0.048,0.085,0.052,0.072,0.116
12069,"Lake, FL",1,0.091,0.126,0.127,0.166,0.21
12071,"Lee, FL",2,0.099,0.133,0.164,0.211,0.253
12073,"Leon, FL",2,0.241,0.32,0.371,0.417,0.443
12075,"Levy, FL",6,0.051,0.078,0.083,0.106,0.105
12077,"Liberty, FL",8,0.058,0.08,0.073,0.074,0.131
12079,"Madison, FL",6,0.07,0.083,0.097,0.102,0.104
12081,"Manatee, FL",2,0.096,0.124,0.155,0.208,0.275
12083,"Marion, FL",2,0.074,0.096,0.115,0.137,0.172
12085,"Martin, FL",2,0.079,0.16,0.203,0.263,0.312
12086,"Miami-Dade, FL",1,0.108,0.168,0.188,0.217,0.264
12087,"Monroe, FL",4,0.091,0.159,0.203,0.255,0.297
12089,"Nassau, FL",1,0.049,0.091,0.125,0.189,0.23
12091,"Okaloosa, FL",3,0.132,0.166,0.21,0.242,0.281
12093,"Okeechobee, FL",4,0.047,0.057,0.098,0.089,0.107
12095,"Orange, FL",1,0.116,0.157,0.212,0.261,0.306
12097,"Osceola, FL",1,0.067,0.092,0.112,0.157,0.178
12099,"Palm Beach, FL",1,0.119,0.171,0.221,0.277,0.328
12101,"Pasco, FL",1,0.049,0.068,0.091,0.131,0.211
12103,"Pinellas, FL",1,0.1,0.146,0.185,0.229,0.283
12105,"Polk, FL",2,0.088,0.114,0.129,0.149,0.186
12107,"Putnam, FL",4,0.062,0.081,0.083,0.094,0.116
12113,"Santa Rosa, FL",2,0.098,0.144,0.186,0.229,0.265
12115,"Sarasota, FL",2,0.142,0.177,0.219,0.274,0.311
12117,"Seminole, FL",1,0.094,0.195,0.263,0.31,0.35
12109,"St. Johns, FL",1,0.085,0.144,0.236,0.331,0.414
12111,"St. Lucie, FL",2,0.081,0.109,0.131,0.151,0.19
12119,"Sumter, FL",3,0.047,0.07,0.078,0.122,0.264
12121,"Suwannee, FL",6,0.056,0.065,0.082,0.105,0.119
12123,"Taylor, FL",6,0.064,0.086,0.098,0.089,0.1
12125,"Union, FL",6,0.033,0.059,0.079,0.075,0.086
12127,"Volusia, FL",2,0.107,0.13,0.148,0.176,0.213
12129,"Wakulla, FL",2,0.018,0.084,0.101,0.157,0.172
12131,"Walton, FL",3,0.067,0.096,0.119,0.162,0.251
12133,"Washington, FL",6,0.04,0.063,0.074,0.092,0.114