Danny Luo Data Enthusiast, UofT Math & Physics Student

Accessing S3 Data in Python with boto3

Working with the University of Toronto Data Science Team on kaggle competitions, there was only so much you could do on your local computer. So, when we had to analyze 100GB of satellite images for the kaggle DSTL challenge, we moved to cloud computing.

We chose AWS for its ubiquity and familiarity. To prepare the data pipeline, I downloaded the data from kaggle onto a EC2 virtual instance, unzipped it, and stored it on S3. Storing the unzipped data prevents you from having to unzip it every time you want to use the data, which takes a considerable amount of time. However, this increases the size of the data substantially and as a result, incurs higher storage costs.

Now that the data was stored on AWS, the question was: How do we programmatically access the S3 data to incorporate it into our workflow? The following details how to do so in python.


Setup

The following uses Python 3.5.1, boto3 1.4.0, pandas 0.18.1, numpy 1.12.0

First, install the AWS Software Development Kit (SDK) package for python: boto3. boto3 contains a wide variety of AWS tools, including an S3 API, which we will be using.

To use the AWS API, you must have an AWS Access Key ID and an AWS Secret Access Key (doc). It would also be good to install the AWS Command Line Interface (CLI) as it is the AWS API in the terminal.

Now you must set up your security credentials. If you have AWS CLI installed, simply run aws configure and follow the instructions. Else, create a file ~/.aws/credentials with the following:

aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

See boto3 Quickstart for more detail.

There are two main tools you can use to access S3: clients and resources. Clients are low-level functional interfaces, while resources are high-level object-oriented interfaces. I typically use clients to load single files and bucket resources to iterate over all items in a bucket.

To initiate them in python:

import boto3

client = boto3.client('s3') #low-level functional API

resource = boto3.resource('s3') #high-level object-oriented API
my_bucket = resource.Bucket('my-bucket') #subsitute this for your s3 bucket name. 

Reading and Writing Files

To read a csv file with pandas:

import pandas as pd

obj = client.get_object(Bucket='my-bucket', Key='path/to/my/table.csv')
grid_sizes = pd.read_csv(obj['Body'])

That didn’t look too hard. So what was going on? If you take a look at obj, the S3 Object file, you will find that there is a slew of metadata (doc). The 'Body' of the object contains the actual data, in a StreamingBody format. You can access the bytestream by calling obj['Body'].read(), which will read all of the data from the S3 server (Note that calling it again after you read will yield nothing).

In this case, pandas’ read_csv reads it without much fuss. However, other files, such as .npy and image files, are a bit more difficult to work with.

For example, to read a saved .npy array using numpy.load, you must first turn the bytestream from the server into an in-memory byte-stream using io.BytesIO. Make sure you have sufficient memory to do this.

from io import BytesIO

obj = client.get_object(Bucket='my-bucket', Key='path/to/my/array.npy')
array = np.load(BytesIO(obj['Body'].read()))

The method BytesIO(obj['Body'].read()) works for most files.

To upload files, it is best to save the file to disk and upload it using a bucket resource (and deleting it afterwards using os.remove if necessary).

my_bucket.upload_file('file',Key='path/to/my/file')

It also may be possible to upload it directly from a python object to a S3 object but I have had lots of difficulty with this.


Iterating Over Bucket

You will often have to iterate over specific items in a bucket. To list all the files in the folder path/to/my/folder in my-bucket:

files = list(my-bucket.objects.filter(Prefix='path/to/my/folder'))

Notice I use the bucket resource here instead of the client. You could run client.list_objects() with the same arguments but this query has a maximum of 1000 objects (doc).

files will now contain a list of s3.ObjectSummary objects that display a bucket_name and a key. To get the first object, simply run:

obj = files[0].get()

And you can read the data as in the above section.


Conclusion

You learned the basics of S3 data import and export, and how to programatically access files in a bucket using the Python API for AWS, boto3.


An example notebook on how to work with s3 data with boto3, for the kaggle DSTL Satellite Image Challenge.