Build, train, tune, and deploy machine learning and deep learning models in these end-to-end machine learning clouds
One of the last computing chores to be sucked into the cloud is data analysis. Perhaps it’s because scientists are naturally good at programming and so they enjoy having a machine on their desks. Or maybe it’s because the lab equipment is hooked up directly to the computer to record the data. Or perhaps it’s because the data sets can be so large that it’s time-consuming to move them.
Whatever the reasons, scientists and data analysts have embraced remote computing slowly, but they are coming around. Cloud-based tools for machine learning, artificial intelligence, and data analysis are growing. Some of the reasons are the same ones that drove interest in cloud-based document editing and email.
Teams can log into a central repository from any machine and do the work in remote locations, on the road, or maybe even at the beach. The cloud handles backups and synchronization, simplifying everything for the group.
But there are also practical reasons why the cloud is even better for data analysis. When the data sets are large, cloud users can spool up large jobs on rented hardware that accomplish the work much, much faster. There is no need to start your PC working and then go out to lunch only to come back to find out that the job failed after a few hours.
Now you can push the button, spin up dozens of cloud instances loaded with tons of memory, and watch your code fail in a few minutes. Since the clouds now bill by the second, you can save time and money.
There are dangers too. The biggest is the amorphous worry about privacy. Some data analysis involves personal information from subjects who trusted you to protect them. We’ve grown accustomed to the security issues involved in locking data on a hard drive in your lab. It’s hard to know just what’s going on in the cloud.
It will be some time before we’re comfortable with the best practices used by the cloud providers but already people are recognizing that maybe the cloud providers can hire more security consultants than the grad student in the corner of a lab. It’s not like personal computers are immune from viruses or other backdoors. If the personal computer is connected to the Internet, well, you might say it’s already part of the cloud.
There are, thankfully, workarounds. The simplest is to anonymize data with techniques like replacing personal information with random IDs. This is not perfect, but it can go a long way to limiting the trouble that any hacker could cause after slipping through the cloud’s defenses.
There are other interesting advantages. Groups can share or open source data sets to the general public, something that can generate wild combinations that we can only begin to imagine.
Some of the cloud providers are curating their own data sets and donating storage costs to attract users. If you like, you might try to correlate your product sales with the weather or sun spots or any of the other information in these public data sets. Who knows? There are plenty of weird correlations out there.
Here are seven different cloud-based machine learning services to help you find the correlations and signals in your data set.
Amazon created SageMaker to simplify the work of using its machine learning tools. Amazon SageMaker knits together the different AWS storage options (S3, Dynamo, Redshift, etc.) and pipes the data into Docker containers running the popular machine learning libraries (TensorFlow, MXNet, Chainer, etc.).
All of the work can be tracked with Jupyter notebooks before the final models are deployed as APIs of their own. SageMaker moves your data into Amazon’s machines so you can concentrate on thinking about the algorithms and not the process. If you want to run the algorithms locally, you can always download the Docker images for simplicity.
Azure Machine Learning
Microsoft has seen the future of machine learning and gone all-in on the Machine Learning Studio, a sophisticated and graphical tool for finding signals in your data. It’s like a spreadsheet for AI. There is a drag-and-drop interface for building up flowcharts for making sense of your numbers. The documentation says that “no coding is necessary” and this is technically true but you’ll still need to think like a programmer to use it effectively.
You just won’t get as bogged down in structuring your code. But if you miss the syntax errors, the data typing, and the other joys of programming, you can import modules written in Python, R, or several other options.
The most interesting option is that Microsoft has added the infrastructure to take what you learn from the AI and turn the predictive model into a web service running in the Azure cloud. So you build your training set, create your model, and then in just a few clicks you’re delivering answers in JSON packets from your Azure service.
BigML is a hybrid dashboard for data analysis that can either be used in the BigML cloud or installed locally. The main interface is a dashboard that lists all of your files waiting for analysis by dozens of machine learning classifiers, clusterers, regressors, and anomaly detectors. You click and the results appear.
Lately the company has concentrated on new algorithms that enhance the ability of the stack to deliver useful answers. The new Fusion code can integrate the results from multiple algorithms to increase accuracy.
Priced by subscription with a generous free tier on BigML’s own machines. You can also build out a private deployment on AWS, Azure, or GCP. If that’s still too public, they’ll deploy it on your private servers.
The Databricks toolset is built by some of the developers of Apache Spark who took the open source analytics platform and added some dramatic speed enhancements, increasing throughput with some clever compression and indexing.
The hybrid data store called Delta is a place where large amounts of data can be stored and then analyzed quickly. When new data arrives, it can be folded into the old storage for rapid re-analysis.
All of the standardized analytical routines from Apache Spark are ready to run on this data but with some well-needed improvements to the Spark infrastructure like integrated notebooks for your analysis code.
Databricks is integrated with both AWS and Azure and priced according to consumption and performance. Each computational engine is measured in Databrick Units. You’ll pay more for a faster model.
Many of the approaches here let you build a machine learning model in one click. DataRobot touts the ability to build hundreds of models simultaneously, also with just one click. When the models are done, you can pick through them and figure out which one does a better job of predicting and go with that. The secret is a “massively parallel processing engine,” in other words a cloud of machines doing the analysis.
DataRobot is expanding by implementing new algorithms and extending current ones. The company recently acquired Nutonian, whose Eureqa engine should enhance the automated machine learning platform’s ability to create time series and classification models. The system also offers a Python API for more advanced users.
DataRobot is available through the DataRobot Cloud or through an enterprise software edition that comes with an embedded engineer.
Google Cloud Machine Learning Engine
Google has invested heavily in TensorFlow, one of the standard open-source libraries for finding signals in data, and now you can experiment with TensorFlow in Google’s cloud. Some of the tools in the Google Cloud Machine Learning Engine are open source and essentially free for anyone who cares to download them and some are part of the commercial options in the Google Cloud Platform.
This gives you the freedom to explore and avoid some lock-in because much of the code is open source and more or less ready to run on any Mac, Windows, or Linux box.
There are several different parts. The easiest place to begin may be the Colaboratory, which connects Jupyter notebooks with Google’s TensorFlow back end so you can sketch out your code and see it run. Google also offers the TensorFlow Research Cloud for scientists who want to experiment. When it’s appropriate, you can run your machine learning models on Google’s accelerated hardware with either GPUs or TPUs.
IBM Watson Studio
The brand name may have been born when a huge, hidden AI played Jeopardy but now Watson encompasses much of IBM’s push into artificial intelligence. The IBM Watson Studio is a tool for exploring your data and training models in the cloud or on-prem. In goes data and out comes beautiful charts and graphs on a dashboard ready for the boardroom.
The biggest difference may be the desktop version of the Watson Studio. You can use the cloud-based version to study your data and enjoy all of the power that comes with the elastic resources and centralized repository. Or you can do much the same thing from the firewalled privacy and convenience of your desktop.
A machine learning model in every cloud
While many people are looking to choose one dashboard for all of their AI research, there’s no reason why you can’t use more of the choices here. Once you’ve completed all of the pre-processing and data cleansing, you can feed the same CSV-formatted data into all of these services and compare the results to find the best choice. Some of these services already offer automated comparisons between algorithms. Why not take it a step further and use more than one?
You can also take advantage of some of the open standards that are evolving. Jupyter notebooks, for instance, will generally run without too much modification. You can develop on one platform and then move much of this code with the data to test out any new or different algorithms on different platforms.
We’re a long way from standardization and there are spooky and unexplained differences between algorithms. Don’t settle for just one algorithm or one training method. Experiment with as many different modeling tools as you can manage.