Author Archives: shep2010

Visualization, The Gateway Drug

Published / by shep2010

Visualization is said to be the gateway drug to statistics. In an effort to get you all hooked, I am going to spend some time on visualization. Its fun (I promise), i expect that after you see how easy some visuals are in R you will be off and running with your own data explorations. Data visualization is one of the Data Science pillars, so it is critical that you have a working knowledge of as many visualizations as you can, and be able to produce as many as you can. Even more important is the ability to identify a bad visualization, if for no other reason to make certain you do not create one and release it into the wild, there is a site for those people, don’t be those people!

We are going to start easy, you have installed R Studio, if you have not back up one blog and do it. Your first visualization is what is typically considered advanced, but I will let you be the judge of that after we are done.

Some lingo to learn:
Packages – Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data.

Choropleth – is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita income.

Below is the code for a choropleth, using the package choroplethr and the data set df_pop_county, which is the population of every county in the US.

This is what todays primary objective is;

To learn more about any R command “?”, “??”, or “help(“object”)” Keep in mind, R is case sensitive. If you can only remember part of a command name use apropos().


?str
?df_pop_county
??summary
help(county_choropleth)
apropos("county_")


#Install package called choroplethr, 
#quotes are required, 
#you will get a meaningless error without them
#Only needs to be installed once per machine
install.packages("choroplethr")

The library function will load the installed package to make any functions available for use.


library("choroplethr")

To find out what functions are in a package use help(package=””).

 
help(package="choroplethr")

Many packages come with test or playground datasets, you will use many in classes and many for practice, data(package=””) will list the datasets that ship with a package.


data(package="choroplethr")

For this example we will be using the df_pop_county dataset, this command will load it from the package and you will be able to verify it is available by checking out the Environment Pane in R Studio.


data("df_pop_county")

View(“”) will open a view pane so you can explore the dataset. Similar to clicking on the dataset name in the Environment Pane.


View(df_pop_county)

Part of learning R is learning the features and commands for data exploration, str will provide you with details on the structure of the object it is passed.


str(df_pop_county)

Summary will provide basic statistics about each column/variable in the object that it is passed.


summary(df_pop_county)

If your heart is true, you should get something very similar to the image above after running the following code. county_choropleth is a function that resides in the choroplethr package, it is used to generate a county level US map. The data passed in must be in the format of county number and value, the value will populate the map. WHen the map renders it will be in the plot pane of the RStudio IDE, be sure to select zoom and check out your work.



#?county_choropleth 
county_choropleth(df_pop_county)

There are som additioanl parameters we can pass to the function, use help to find more.



county_choropleth(df_pop_county,
                  title = "Population Density",
                  legend="Population")

Try changing the number of colors and change the state zoom. If your state is not working read the help to see if you can find out why.



county_choropleth(df_pop_county,
                  title = "Population Density of Texas",
                  legend="Population",
                  num_colors=9,
                  state_zoom="texas")

There is an additional option for county_choropleth, reference_map. If it does not work for you do not fret, as of this blog post it is not working for me either, the last R upgrade whacked it, be ready for this to happen and make sure you have backs and versions, especially before you get up on stage in front 200 people to present.

There you have it! Explore the commands used, look at the other datasets that ship with choroplethr and look at the other functions that ship with choroplethr, it can be tricky to figure out which ones work, be sure to check the help for each function you want to run, no help may mean no longer supported. Remember that these packages are community driven and written, which is good, but sometimes they can be a slightly imperfect.

In the next post i will cover how to upload and create your own dataset and use the choroplethr function with your own data. On a side note, the choropleth falls under a branch of statistics called descriptive statistics which covers visuals used to describe data.

Getting started with R

Published / by shep2010

First things first, you need to install R, more specifically, R Studio. For the near future everything I will be demoing and working on will be with R Studio, it runs on Mac, it Runs on PC, it runs on Linux. I personally use a Mac, anecdotal evidence seems to support that abut 80-90% of academia shows up to class with a Mac, adult learners show up with a PC more so than not, so make of that what you will. Don’t worry about SQL and R, or Revolution R yet, that will come up at a later date, we are going to walk before we run.

What I am not going to do is waste a blog with a thousand screen shots teaching you to install R and the long drawn out history, I gain little from knowing it and nothing from discussing it. It came from S and New Zealand, and has about 5000 add in packages. Dozens of others have already down this in blogs and videos about the history, go find one and hop to it.

Here is one to get you started, it has just the facts!

What I will say, is that R is maintained by the Comprehensive R Archive Network, this is where you go to get R. What you really want is R Studio, it’s a friendly IDE that if you have ever used SSMS, will look a tiny bit familiar, and once you have been using you will never want to use the base R or Microsoft R client again. Rstudio is the Defacto default R client. You can install both, R and R Studio, I have them both, I only ever use R Studio.

After it is installed find a nice little video to give you a tour of the IDE or go exploring on your own, i normally do better when I do some exploring and then jump into a how to video. IDE’s over the ages have changed little. What we will be interested in is the R script tab(will show up after you open or create a new R file), Console, then all the stuff on the right hand side, environment, history, files, plot, packages, help, and viewer. We will dive into all the tabs soon!

Get to it, come back when you have R Studio installed!

Comprehensive R Archive Network
R Studio

Shep

How do I Data Science?

Published / by shep2010

As we are bombarded with DS (data science) this and DS that, it is difficult to figure out what that means. There really is no Data Science degree, or at least there wasn’t until a couple of years ago, now Berkeley will certainly sell you a masters for $60k, I do not know anyone who has gone through it, but I know people who have inquired and now cannot get the aggressive phone calls to stop. The program does look sound, but I am afraid they slung the degree together to meet an immediate need, not one that could go toe to toe with an actual data scientist. DS and machine learning is still approaching the peak of the Gartner hype cycle, what that means to me be very wary of those trying to separate you from your money offering the promise of DS nirvana.

So, what’s the point? Beware and question everything. I left Microsoft for the purpose of taking a two to three year sabbatical to fill in the academic gaps I have. I am doing it the Hard way, I am taking stats, advanced stats, many graduate level classes and lots of research. I am using Harvard Extension school for now, they have more than enough to keep me entertained, they do have a DS Graduate certificate that can be done in four classes, what they don’t mention is that a significant background in stats and programming web is encouraged as a prerequisite. For instance, one of the requirements was CS-171 which I am not sure they will be offering publically in the future, but this was mixed with the regular Harvard folks. I was required to learn CSS, HTML, JavaScript, and JQuery in the first weeks of the class so I could do the D3 exercises. I have a critique of the class I will publish one day, but know the class is known to take 30+ hours a week. This is a true DS visualization class, graphics in Excel is not, Tableau, maybe.

Which gets me to the point of todays blog, what is a data scientist, what skills do they need, what skills are required. The short answer is, it depends. The irony is most data scientist that I have met with a PhD in either Visualization, Stats, math, computer science, Physics, bioinformatics, (you name it) can serve in the role of Data Science if they have a strong data and statistics background, and many of them have trouble calling themselves data scientists. You heard it, many of the titled data scientist I have worked with don’t like to be called data scientist because they do not feel they meet the requirements of the perceived role. Which begs the question, what are the perceived skills of the role?

There are a few infographics floating around that discuss the skills of the Data folk. My favorite to date is the one DataCamp has published, look at it on your own, I am not going to plagiarize someone else’s work, but look at the eight job titles and descriptions of skills, what is common among all of them? SQL is expected for all eight, and R for five of roles. Keeping in mind that this is the generic title of SQL, not just MS SQL Server. So it would seem that having SQL and R you can be the bridge to many functional roles.

For the 30,000 foot view of skills required to be an actual DS, and as you will notice a requirement for half the other skills as well, see the Modern Data Scientist infographic. I think this is the best version of the requirements for the role, it is not technology specific, but is knowledge specific. I like this infographic because you can apply this regardless if you are an MS shop or more open source. I personally think the most SQL Server experts already have mastered Domain Knowledge and Soft Skill, though occasionally we may have issues with the collaborative, especially on Mondays. But our goal, me and you is not to master all four pillars of data science, that is what is called the Data Science Unicorn, few can do it, you really need to be in school until you are forty to master it. The goal is to be a master at one of these, awesome at a second pillar and somewhat functional with the other two pillars. I am constantly surprised at how few data scientist can program, even in T-SQL, they just get by if at all. I will cover some of that in a later blog, there is a reason it happens that way, but all the more opportunity for the SQL experts, we have one entire pillar already mastered, and are at least one quarter way through another.

I will say this though, the one thing you will not escape on this endeavor is Statistics, and Probability. You are going to have to suck it up, take a refresher, or take it for the first time. The boundaries only exist in your mind, Edx.org, MIT open courseware have all you need to get you started. If you want some pressure take one through a local university. The reason I like Harvard Extension is that there are no entrance boundaries, if you want to take Stat-104, you pay for it and then log in and take it. Some classes are offered on campus, some are with the really smart kids that got it through the front door, unlike me. Harvard Extension is considered the Back door to Harvard, I don’t really care, I’ve been doing this too long to care about the credentials, and quite frankly I was glad to find door at all.

Shep

In the beginning there was a Data guy

Published / by shep2010

Well, here we go, its finally time to take this horse out of the barn! I’m Shep, some of you know me some of you do not, you can check my linked in profile to learn more about me, I’m just a bloke. I have been screwing around with SQL Server since 1994 SQL 4.21a, I am pretty sure we had the OS2 version laying around somewhere but we all ignored the purple ALR server and the dozen or so disks needed to install everything, until one night I got bored and decided to turn it on. Jump a head a dozen or so years and I am working for Microsoft five years as a SQL PFE, one year as frontline Windows Engineer (lapse in judgement) and eventually the SQL Server Customer Advisory Team (SQLCAT) in Redmond running the Lab. “PC (server) Recycle it” became my mantra, I took great joy in PC recycling an HP SuperDome, I know you’re clutching your pearls at the thought, especially considering the server cost $700,000 – $1,400,000. But, damn thing wouldn’t boot anymore what did you expect. For what its worth, HP brought it back, and the lab got another one.

While running the lab we had fewer customers than I would have liked to have seen since the entire world turned upside down and suddenly cloud computing was all the rage. SQL CAT became low priority, and everything cloud became everything. Most lab managers run for 1-2 years and move on to something else, many lab managers take the job for the sole purpose of getting on the CAT team, certainly a noble endeavor. I made it pretty clear up front that I never wanted to be on CAT, I just wanted to run the lab, so at the end of my two years I was at a precipice, leave CAT, leave Microsoft, do something else?

Like many SQL Server experts, once you do it long enough you are only really good at being a SQL Server expert, you know data, you know many data domains, you can think pretty fast on your feet, data and structures are living logic problems. Those all certainly seem like noble skills, so what’s next?

About ten years ago every C* suite executive started demanding big data after getting off an airplane having read CIO magazine, or some airline rag that regurgitates smart looking articles. The problem was they didn’t know what big data was, quite frankly neither did we. What happened that caused it was we stopped deleting data, compression became main stream in databases, devices started creating data, people essentially became devices to be tracked, the web began to dominate everything, so your mouse became a device to be tracked, where was it, and what was it doing, your phone, your path through a mall was now data on how to sell more stuff, your car. It is a toxic waste dump of unlimited data, all of it available to be mined and sadly be hacked.

Well, Data Scientists and statisticians to the rescue, or so the line goes. Statistics and data science is claimed to be the sexist job, I really don’t know what that means, how does one define a sexy job?

At the end of my CAT Lab tenure, R was in beta with SQL Server shipping in the CTPs, Azure ML (Machine Learning) was new and shiny and it certainly seemed like a good thing to invest in, so I put together a training plan for me and ended up working with the MS data scientists working on customer problems. The customer engagements were very much like a CAT engagement, phone call to triage, verify they had the data needed, make sure they have an actual question to be answered from the data, then a few of us would fly to the customer site and do a 5 day hackathon to attempt to give them a machine learning solution in Azure ML. Sometimes we could, sometimes we could not, but this gave me the opportunity to get my feet wet in the data science space and work with the Microsoft Data Science PhDs, which led me to quit Microsoft to focus full time on filling in the academic gaps needed to be good in that role.

But, why…? What I learned after working with customer SQL experts, their statisticians, their data scientists, and our data scientist is that 60%-80% of the time consuming portion of the job can be done by a SQL gal or guy. No Shit! The problem I learned was that first off they are speaking very different languages, one is coming from academia and the other quite frankly is computer science, or hard knock learning much like me. Second, on average 80% of the job of the data scientist is data wrangling and feature engineering, and unless they have been doing for a very long time and learned, they are trying to do all of it on their laptop in R or Python using a sample, something that SQL Server and most relational database engines are exceptionally good at and if well written, very fast at. Most data scientist would love help in this area, but there is still the failure of communication, they are speaking different languages. SQL folks have the innate ability to break walls down between groups (if they want to), there typically is no better data or domain expert than the SQL folks, so they would seem to be the perfect fit.

So the next question is how does one get there? Unfortunately its not just a 40 hour immersion class, but depending on deep and how far you want to go, that can be the beginning!

This blog and eventually the training the public talks I am developing will whittle away at the boundaries. Today the path looks intimidating and since data science is the sexy new job, there are a lot of companies telling you that in 12 weeks you can be a data scientist, though if you dig deeper you will find many of their students have graduated with math upper level math degrees. Microsoft is claiming tools like Azure ML will commoditize data science and bring it to the masses, but if you call and ask for statistical model help you will likely end up on the phone with a PhD, so there is a tiny discrepancy in what they are selling and what they are doing, but, magic black box machine learning is industry wide. I do believe ten years from now that machine learning will be built into everything and the knowledge required to take advantage of it will be widespread.

I will be sharing my academic experiences, as much knowledge as I can articulate, where I think academia is blowing it, where I think we are blowing it. What classes are worthwhile, lots and lots of samples in R and SQL along the way.

Shep

Public Data for Practice

Published / by shep2010

It is no secret that i will frequently use my blogs as a resource for me to collect and store my own thoughts and to remember where i put things, and i kinda figure if i need to learn something and write it down, others can probably benefit from it.  This particular post will be a living post as i am always finding new public data sources i need to remember.   Some off my links will be duplicated in other links. Some of these will be required for some future blogs of R training scripts.

The Equality of Opportunity Project

Science and Engineering Doctorates

United States Education Data (Maintained on the USDA site.. )
https://data.ers.usda.gov/reports.aspx?ID=18243

Social Security Data Files by Title
https://www.ssa.gov/policy/data_title.html

Florida Data
http://www.floridacharts.com/FLQuery/Population/PopulationRpt.aspx

Florida Election Watch
http://enight.elections.myflorida.com/

US Bureau of Labor and Statistics
https://www.bls.gov/data/

Google Public Data
http://www.google.com/publicdata/

The New York Independent System Operator (power grid) http://mis.nyiso.com/public/

Generically Awesome Public Datasets
https://github.com/caesar0301/awesome-public-datasets

Amazon Public Data Sets
http://aws.amazon.com/datasets/

Check out the Data Section
https://trello.com/b/rbpEfMld/data-science

Kaggle Datasets
https://www.kaggle.com/datasets

UCI Machine Learning Repository
http://archive.ics.uci.edu/ml/

Yahoo Datasets
http://webscope.sandbox.yahoo.com/#datasets

New York Public Library
http://www.nypl.org/research/collections/digital-collections/public-domain