Processing data with other AWS services
Over the years, AWS has built many analytics services (https://aws.amazon.com/big-data/). Depending on your technical environment, you could pick one or the other to process data for your machine learning workflows.
In this section, you'll learn about three services that are popular choices for analytics workloads, why they make sense in a machine learning context, and how to get started with them:
Amazon Elastic Map Reduce (EMR)
AWS Glue
Amazon Athena
Amazon Elastic Map Reduce
Launched in 2009, Amazon Elastic Map Reduce, aka Amazon EMR, started as a managed environment for Apache Hadoop applications (https://aws.amazon.com/emr/). Over the years, the service has added support for plenty of additional projects, such as Spark, Hive, HBase, Flink, and more. With additional features like EMRFS, an implementation of HDFS backed by Amazon S3, EMR is a prime contender for data processing at scale. You can learn more about EMR at https://docs.aws.amazon.com/emr/.
When it comes to processing machine learning data, Spark is a very popular choice thanks to its speed and its extensive feature engineering capabilities (https://spark.apache.org/docs/latest/ml-features). As SageMaker also supports Spark, this creates interesting opportunities to combine the two services.
Running a local Spark notebook
Notebook instances can run PySpark code locally for fast experimentation. This is as easy as selecting the Python3 kernel, and writing PySpark code. The following screenshot shows a code snippet where we load samples from the MNIST dataset in libsvm format. You can find the full example at https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-spark:
A local notebook is fine for experimenting at a small scale. For larger workloads, you'll certainly want to train on an EMR cluster.
Running a notebook backed by an Amazon EMR cluster
Notebook instances support SparkMagic kernels for Spark, PySpark, and SparkR (https://github.com/jupyter-incubator/sparkmagic). This makes it possible to connect a Jupyter notebook running on a Notebook instance to an Amazon EMR cluster, an interesting combination if you need to perform interactive exploration and processing at scale.
The setup procedure is documented in detail in this AWS blog post: https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/.
Processing data with Spark, training the model with Amazon SageMaker
We haven't talked about training models with SageMaker yet (we'll start doing that in the following chapters). When we get there, we'll discuss why this is a powerful combination compared to running everything on EMR.
AWS Glue
AWS Glue is a managed ETL service (https://aws.amazon.com/glue/). Thanks to Glue, you can easily clean your data, enrich it, convert it to a different format, and so on. Glue is not only a processing service: it also includes a metadata repository where you can define data sources (aka the Glue Data Catalog), a crawler service to fetch data from these sources, a scheduler that handles jobs and retries, and workflows to run everything smoothly. To top it off, Glue can also work with on-premise data sources.
AWS Glue works best with structured and semi-structured data. The service is built on top of Spark, giving you the option to use both built-in transforms and the Spark transforms in Python or Scala. Based on the transforms that you define on your data, Glue can automatically generate Spark scripts. Of course, you can customize them if needed, and you can also deploy your own scripts.
If you like the expressivity of Spark but don't want to manage EMR clusters, AWS Glue is a very interesting option. As it's based on popular languages and frameworks, you should quickly feel comfortable with it.
You can learn more about Glue at https://docs.aws.amazon.com/glue/, and you'll also find code samples at https://github.com/aws-samples/aws-glue-samples.
Amazon Athena
Amazon Athena is a serverless analytics service that lets you easily query Amazon S3 at scale, using only standard SQL (https://aws.amazon.com/athena/). There is zero infrastructure to manage, and you pay only for the queries that you run. Athena is based on Presto (https://prestodb.io).
Athena is extremely easy to use: just define the S3 bucket in which your data lives, apply a schema definition, and run your SQL queries! In most cases, you will get results in seconds. Once again, this doesn't require any infrastructure provisioning. All you need is some data in S3 and some SQL queries. Athena can also run federated queries, allowing you to query and join data located across different backends (Amazon DynamoDB, Apache Hbase, JDBC-compliant sources, and more).
If you're working with structured and semi-structured datasets stored in S3, and if SQL queries are sufficient to process these datasets, Athena should be the first service that you try. You'll be amazed at how productive Athena makes you, and how inexpensive it is. Mark my words.
You can learn more about Athena at https://docs.aws.amazon.com/athena/, and you'll also find code samples at https://github.com/aws-samples/aws-glue-samples.