First things first, you need to install R, more specifically, R Studio. For the near future everything I will be demoing and working on will be with R Studio, it runs on Mac, it Runs on PC, it runs on Linux. I personally use a Mac, anecdotal evidence seems to support that abut 80-90% of academia shows up to class with a Mac, adult learners show up with a PC more so than not, so make of that what you will. Don’t worry about SQL and R, or Revolution R yet, that will come up at a later date, we are going to walk before we run.
What I am not going to do is waste a blog with a thousand screen shots teaching you to install R and the long drawn out history, I gain little from knowing it and nothing from discussing it. It came from S and New Zealand, and has about 5000 add in packages. Dozens of others have already down this in blogs and videos about the history, go find one and hop to it.
What I will say, is that R is maintained by the Comprehensive R Archive Network, this is where you go to get R. What you really want is R Studio, it’s a friendly IDE that if you have ever used SSMS, will look a tiny bit familiar, and once you have been using you will never want to use the base R or Microsoft R client again. Rstudio is the Defacto default R client. You can install both, R and R Studio, I have them both, I only ever use R Studio.
After it is installed find a nice little video to give you a tour of the IDE or go exploring on your own, i normally do better when I do some exploring and then jump into a how to video. IDE’s over the ages have changed little. What we will be interested in is the R script tab(will show up after you open or create a new R file), Console, then all the stuff on the right hand side, environment, history, files, plot, packages, help, and viewer. We will dive into all the tabs soon!
Get to it, come back when you have R Studio installed!
As we are bombarded with DS (data science) this and DS that, it is difficult to figure out what that means. There really is no Data Science degree, or at least there wasn’t until a couple of years ago, now Berkeley will certainly sell you a masters for $60k, I do not know anyone who has gone through it, but I know people who have inquired and now cannot get the aggressive phone calls to stop. The program does look sound, but I am afraid they slung the degree together to meet an immediate need, not one that could go toe to toe with an actual data scientist. DS and machine learning is still approaching the peak of the Gartner hype cycle, what that means to me be very wary of those trying to separate you from your money offering the promise of DS nirvana.
So, what’s the point? Beware and question everything. I left Microsoft for the purpose of taking a two to three year sabbatical to fill in the academic gaps I have. I am doing it the Hard way, I am taking stats, advanced stats, many graduate level classes and lots of research. I am using Harvard Extension school for now, they have more than enough to keep me entertained, they do have a DS Graduate certificate that can be done in four classes, what they don’t mention is that a significant background in stats and programming web is encouraged as a prerequisite. For instance, one of the requirements was CS-171 which I am not sure they will be offering publically in the future, but this was mixed with the regular Harvard folks. I was required to learn CSS, HTML, JavaScript, and JQuery in the first weeks of the class so I could do the D3 exercises. I have a critique of the class I will publish one day, but know the class is known to take 30+ hours a week. This is a true DS visualization class, graphics in Excel is not, Tableau, maybe.
Which gets me to the point of todays blog, what is a data scientist, what skills do they need, what skills are required. The short answer is, it depends. The irony is most data scientist that I have met with a PhD in either Visualization, Stats, math, computer science, Physics, bioinformatics, (you name it) can serve in the role of Data Science if they have a strong data and statistics background, and many of them have trouble calling themselves data scientists. You heard it, many of the titled data scientist I have worked with don’t like to be called data scientist because they do not feel they meet the requirements of the perceived role. Which begs the question, what are the perceived skills of the role?
There are a few infographics floating around that discuss the skills of the Data folk. My favorite to date is the one DataCamp has published, look at it on your own, I am not going to plagiarize someone else’s work, but look at the eight job titles and descriptions of skills, what is common among all of them? SQL is expected for all eight, and R for five of roles. Keeping in mind that this is the generic title of SQL, not just MS SQL Server. So it would seem that having SQL and R you can be the bridge to many functional roles.
For the 30,000 foot view of skills required to be an actual DS, and as you will notice a requirement for half the other skills as well, see the Modern Data Scientist infographic. I think this is the best version of the requirements for the role, it is not technology specific, but is knowledge specific. I like this infographic because you can apply this regardless if you are an MS shop or more open source. I personally think the most SQL Server experts already have mastered Domain Knowledge and Soft Skill, though occasionally we may have issues with the collaborative, especially on Mondays. But our goal, me and you is not to master all four pillars of data science, that is what is called the Data Science Unicorn, few can do it, you really need to be in school until you are forty to master it. The goal is to be a master at one of these, awesome at a second pillar and somewhat functional with the other two pillars. I am constantly surprised at how few data scientist can program, even in T-SQL, they just get by if at all. I will cover some of that in a later blog, there is a reason it happens that way, but all the more opportunity for the SQL experts, we have one entire pillar already mastered, and are at least one quarter way through another.
I will say this though, the one thing you will not escape on this endeavor is Statistics, and Probability. You are going to have to suck it up, take a refresher, or take it for the first time. The boundaries only exist in your mind, Edx.org, MIT open courseware have all you need to get you started. If you want some pressure take one through a local university. The reason I like Harvard Extension is that there are no entrance boundaries, if you want to take Stat-104, you pay for it and then log in and take it. Some classes are offered on campus, some are with the really smart kids that got it through the front door, unlike me. Harvard Extension is considered the Back door to Harvard, I don’t really care, I’ve been doing this too long to care about the credentials, and quite frankly I was glad to find door at all.
Well, here we go, its finally time to take this horse out of the barn! I’m Shep, some of you know me some of you do not, you can check my linked in profile to learn more about me, I’m just a bloke. I have been screwing around with SQL Server since 1994 SQL 4.21a, I am pretty sure we had the OS2 version laying around somewhere but we all ignored the purple ALR server and the dozen or so disks needed to install everything, until one night I got bored and decided to turn it on. Jump a head a dozen or so years and I am working for Microsoft five years as a SQL PFE, one year as frontline Windows Engineer (lapse in judgement) and eventually the SQL Server Customer Advisory Team (SQLCAT) in Redmond running the Lab. “PC (server) Recycle it” became my mantra, I took great joy in PC recycling an HP SuperDome, I know you’re clutching your pearls at the thought, especially considering the server cost $700,000 – $1,400,000. But, damn thing wouldn’t boot anymore what did you expect. For what its worth, HP brought it back, and the lab got another one.
While running the lab we had fewer customers than I would have liked to have seen since the entire world turned upside down and suddenly cloud computing was all the rage. SQL CAT became low priority, and everything cloud became everything. Most lab managers run for 1-2 years and move on to something else, many lab managers take the job for the sole purpose of getting on the CAT team, certainly a noble endeavor. I made it pretty clear up front that I never wanted to be on CAT, I just wanted to run the lab, so at the end of my two years I was at a precipice, leave CAT, leave Microsoft, do something else?
Like many SQL Server experts, once you do it long enough you are only really good at being a SQL Server expert, you know data, you know many data domains, you can think pretty fast on your feet, data and structures are living logic problems. Those all certainly seem like noble skills, so what’s next?
About ten years ago every C* suite executive started demanding big data after getting off an airplane having read CIO magazine, or some airline rag that regurgitates smart looking articles. The problem was they didn’t know what big data was, quite frankly neither did we. What happened that caused it was we stopped deleting data, compression became main stream in databases, devices started creating data, people essentially became devices to be tracked, the web began to dominate everything, so your mouse became a device to be tracked, where was it, and what was it doing, your phone, your path through a mall was now data on how to sell more stuff, your car. It is a toxic waste dump of unlimited data, all of it available to be mined and sadly be hacked.
Well, Data Scientists and statisticians to the rescue, or so the line goes. Statistics and data science is claimed to be the sexist job, I really don’t know what that means, how does one define a sexy job?
At the end of my CAT Lab tenure, R was in beta with SQL Server shipping in the CTPs, Azure ML (Machine Learning) was new and shiny and it certainly seemed like a good thing to invest in, so I put together a training plan for me and ended up working with the MS data scientists working on customer problems. The customer engagements were very much like a CAT engagement, phone call to triage, verify they had the data needed, make sure they have an actual question to be answered from the data, then a few of us would fly to the customer site and do a 5 day hackathon to attempt to give them a machine learning solution in Azure ML. Sometimes we could, sometimes we could not, but this gave me the opportunity to get my feet wet in the data science space and work with the Microsoft Data Science PhDs, which led me to quit Microsoft to focus full time on filling in the academic gaps needed to be good in that role.
But, why…? What I learned after working with customer SQL experts, their statisticians, their data scientists, and our data scientist is that 60%-80% of the time consuming portion of the job can be done by a SQL gal or guy. No Shit! The problem I learned was that first off they are speaking very different languages, one is coming from academia and the other quite frankly is computer science, or hard knock learning much like me. Second, on average 80% of the job of the data scientist is data wrangling and feature engineering, and unless they have been doing for a very long time and learned, they are trying to do all of it on their laptop in R or Python using a sample, something that SQL Server and most relational database engines are exceptionally good at and if well written, very fast at. Most data scientist would love help in this area, but there is still the failure of communication, they are speaking different languages. SQL folks have the innate ability to break walls down between groups (if they want to), there typically is no better data or domain expert than the SQL folks, so they would seem to be the perfect fit.
So the next question is how does one get there? Unfortunately its not just a 40 hour immersion class, but depending on deep and how far you want to go, that can be the beginning!
This blog and eventually the training the public talks I am developing will whittle away at the boundaries. Today the path looks intimidating and since data science is the sexy new job, there are a lot of companies telling you that in 12 weeks you can be a data scientist, though if you dig deeper you will find many of their students have graduated with math upper level math degrees. Microsoft is claiming tools like Azure ML will commoditize data science and bring it to the masses, but if you call and ask for statistical model help you will likely end up on the phone with a PhD, so there is a tiny discrepancy in what they are selling and what they are doing, but, magic black box machine learning is industry wide. I do believe ten years from now that machine learning will be built into everything and the knowledge required to take advantage of it will be widespread.
I will be sharing my academic experiences, as much knowledge as I can articulate, where I think academia is blowing it, where I think we are blowing it. What classes are worthwhile, lots and lots of samples in R and SQL along the way.