Data Science is dead; Well i’m not the first person to say that, and i am certainly not the first person to say that there is a lot of hype around Data Science being the sexiest job of the 21st century. I think we can all call BS now too. But Why? Automated Machine Learning…
First off, i love this story, i think because i dug into the human elements. It started with Nicolo Fusi trying to create predictive models for CRISPR/Cas9. I will let you do your own research on that, its a rabbit hole you may never come back from. Nonetheless, there is a podcast you can listen to on the MS Research site, All About automated machine learning with Dr. Nicolo Fusi.
After you have done all your reading and listening, the punch line is that AutoML made it into Azure ML Services in September which is meant to be the replacement for Azure ML Desktop which was unceremoniously killed in preview (it happens). But form where i sit, not much came from Auto ML, very little fan fair aside from the announcement at ignite.
This morning a post showed up in my feed about a new Azure ML Services Studio, which is different that the Azure ML Studio we have become accustom to over the last 5 or so years. In Azure Machine Learning Services you create and select your compute power and number fo nodes to scale out to, i have spent only a cursory amount of time on compute and don’t know how fancy you can get, but it looks super cool, though my my primary interest today is AutoML.
So what to start with after creating the workspace? Right off the bat, you will notice something new. In the blade under assets you have some new toys to play with if you are used to AML Studio.
So, lets try it out, i have spent a ton of time writing about Regression with the EPA dataset , you basically use any regression or regression forest to predict MPG based on a dozen other variables. Not complicated, the data is easy to understand, and the decisions are easy to understand. Weight, horsepower, engine size, number of cylinders have the have the greatest impact on MPG, keeping in mind we are looking at petrol engines only, no hybrid or electric, they would require their own model.
The dataset i will be using can be found here if you want to play with it.
So, let’s just start…. Right above assets you will see Automated machine learning, that is where we will start, from there you can create your experiment, name it, and create and or select your compute.
Be careful when creating compute, these are real VMs that are backing your service, so if choose 12 G5s and leave them up and running you will be getting a $100k azure bill that you may not be looking for. On the bright side, the default compute quota should keep you from going bonkers, but even a ten cores running for a month can be expensive.
I leave mine at 0-3 D1_V2 because i am cheap and doing simple models. If you need them, the N series are your GPU machines, but that is out of scope of this post. Once you hit create you will have a few minute wait as it provisions your machines. Leaving Minimum at zero will make sure one is not running 24×7 waiting for me to do something. The downside is i will have to wait for the compute to start once i run my experiment, which in my case will be the bulk of the execution time.
You get the option to upload your dataset and select it.
If you do nothing, this is what you will see. We will need to select Regression as the Prediction task, so we have to help it a little, and indicate what value you want to predict in the Target Column drop down. For this data set you will want to disable a few columns, basically anything with a description or readable text like the model, vehicle make etc… Otherwise you will overfit like crazy, but you will get 100% accuracy with training data :-D; which is bad.
Now, you could, just hit run and see what happens. or you can open Pandora’s box by selecting Advanced…
So, here is it is, Pandoras Azure ML Service Data Science Box. There are all the settings you may want to adjust to make the training go faster, or choose the method by which the model is selected.
Primary Metric, will be the method by which to comapare the models to each other as they are judged; r-score (R-Sqaured), Spearman correlation, Normalized RMSE, or normalized MSE.
I will Use R-Square, because everyone of familiar with it, the nice thing about this is, you can test all of them since you are not actually doing the work…
Once started you may spend a decent amount of time waiting…
So, it runs, now what?
You get the following after completion, probably ten minutes in my case. all the scores, best score was .923 using a Voting Ensemble.
If you drill into Voting Ensemble you can review all of the data, enough residuals to make you comfortable, the error rate, RMSE, and all the other model errors you are used to seeing or calculating. Notice on the right mid screen you have the ability to download the model which happens to be a pickle file that can be imported into any thing pickle friendly (python for instance).
If you are like me and fall in the category of Data/AI Engineer /Data Science hacker the first thing i had to do was look up Voting Ensemble… Ensemble i knew, in short multiple models, voting however i will need to spend some time on…
So there you have it, no data scientist required, total compute time was about 5 minutes on the slowest cheapest machines i can get my hands on. To be fair, there was still days of data engineering for that dataset, it started with abot 5000 rows and was whittled down to 1000ish useful rows, so there is still work to do in the space, but maybe with Auto ML it will all be data engineering from now on.
The first thing that comes to mind as a defense against using Auto ML is it cant do a complicated model, but remember Auto ML was designed initially to solve CRISPR/Cas9 problems, i’m pretty sure it can tackle most of the DS problems we have today…