We have determined that quantitative data is essentially numeric data, or “measuring data”. Quantitative data asks how much. Knowing that, we can start to look at statistical techniques to analyze this data. To do this we will need to get some definitions out of the way, and some demonstrations to help figure out what these things are. None of them are derived from magic, though sometimes they certainly appear that way.
There is a thing called central tendency, or the measure of central tendency. It is not very hard to derive the definition from the words, it is the tendency of the things to be centered, or the center or location of the distribution of data. For the most part, as the dataset we are using increases in size, it tends to have much of the data centered in a specific location. We measure this by using mean, median, and mode.
Central tendency will be one of the very important foundations of prediction, it’s a principle that assumes all of the data will be within a certain distribution vs data all over the place. If you were to use a histogram with many bins, the data would be mound shaped. That mound means that we can probably predict other new data based on certain factors, those factors will come later.
Mean – the mean is the average, sum the data and divide by the number of values.
(1+2+3+4+5+6+6+7) / 8 mean(c(1,2,3,4,5,6,6,7))
Median – Basically the middle number in the data set, using (1,2,3,4,5,6,6,7) we have eight numbers in the dataset, an even number, so we take the two middle numbers add them and divide by two, in this case the Median is 4.5. In R you can run median(c(1,2,3,4,5,6,6,7)). If we had an odd number of values in the dataset as in (1,1,1,2,9,9,9), “2” is the median. It is just the middle number.
median(c(1,2,3,4,5,6,6,7))
median(c(1,2,3,4,5,6,7))
median(c(1,1,1,2,9,9,9))
Mode – Mode is the number that shows up most often in the dataset. Mode is a little tricky, there is no built in function for Mode in base R. Stack overflow has a nice mode reference here, which demonstrates the following
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Mode(c(1,1,1,2,9,9))
If you are use to T-SQL that last thing may have tripped you up a bit, a function being stored in a variable. Its called a “Call by Reference”. We will get more into this at a later date as this gets into some of the power of R.
Its really not very hard, and we will be using central tendency a lot. try it out with some of the datasets. Try this out on a few datasets that ship with base R, or your favorite dataset, graph it using hist() see if you can spot some of the tendencies and do they match what you would expect using just mean, median, and mode.
?cars
data(cars)
View(cars)
mean(cars$dist)
median(cars$dist)
Mode(cars$dist) #using the above Mode function
data(mtcars)
View(mtcars)
mean(mtcars$mpg)
median(mtcars$mpg)
Mode(mtcars$mpg)
# Use zoom in the plot pane to review the histogram,
# check out the distributions
library(mosaic)
data(ChickWeight)
View(ChickWeight)
histogram(~weight | as.factor(Time), data=ChickWeight,type="percent")
Somewhere between this topic and the next topic there lives a thing called percentiles and quartiles.
Percentile - In statitistics a percentile is the measure indicating the value below which observations in a group fall. Yeah right, lets use an example. When occupy Wallstreet was all the rage they frequently help signs of 99%, that meant that you and i are the 99 percentile of income. Hence they were attempting to annoy anyone in the US who fell into the 1 percentile. According to this article from investopedia, you would need to make more than $456,626 AGI to be in the one percent. From that we can determine that anyone who makes less than $456,626 AGI is in the 99 percentile. To be in the 95 percentile you need to make less than $214,462 AGI per year. So, percentile is a measure of location. Where am i? Where are you?
Quartiles - In descriptive statistics, the quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data. Clear as mud, well four equal groups makes sense. They are, 25%, 50%, 75%, these divide the data into four groups.
Lets look at some data. R has a base function summary(), summary is your friend. Mtcars is should be loaded already, if it is not you are a wizard at getting in memory by now.
summary(mtcars$mpg)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 10.40 15.42 19.20 20.09 22.80 33.90
Min is the minimum value of the dataset,
1st quartile or 25 percentile is 15.42
2nd quartile or 50 percentile or median is 19.20
3rd quartile or 75 percentile is 22.80
Mean or average is 20.09
Maximum value for the dataset is 33.90
Lets use a few more words to describe whats going on above. IF the mpg of your car is 12, like my Jeep, you are in the 25th percentile. If you get 22mpg, you are in the 75th percentile and above the median (19.20) and above the mean(20.09). Why this matters will come up later, but be cognizant of where a datum falls.
You can run summary against an entire dataset as well vs just a column/variable summary(mtcars) for instance. This will provide statistics for all columns/variables in the dataset.
From the Mosaic package there is a function favstats() that is similar to summary you should check out.
favstats(mtcars$mpg)
# min Q1 median Q3 max mean sd n missing
#10.4 15.425 19.2 22.8 33.9 20.09062 6.026948 32 0
Notice that sd for standard deviation, n the number of values in the dataset and missing is also listed.