Accessing S3 Data in Python with boto319 Apr 2017
Working with the University of Toronto Data Science Team on kaggle competitions, there was only so much you could do on your local computer. So, when we had to analyze 100GB of satellite images for the kaggle DSTL challenge, we moved to cloud computing.
We chose AWS for its ubiquity and familiarity. To prepare the data pipeline, I downloaded the data from kaggle onto a EC2 virtual instance, unzipped it, and stored it on S3. Storing the unzipped data prevents you from having to unzip it every time you want to use the data, which takes a considerable amount of time. However, this increases the size of the data substantially and as a result, incurs higher storage costs.
Now that the data was stored on AWS, the question was: How do we programmatically access the S3 data to incorporate it into our workflow? The following details how to do so in python.
The following uses Python 3.5.1, boto3 1.4.0, pandas 0.18.1, numpy 1.12.0
First, install the AWS Software Development Kit (SDK) package for python: boto3. boto3 contains a wide variety of AWS tools, including an S3 API, which we will be using.
Now you must set up your security credentials. If you have AWS CLI installed, simply run
aws configure and follow the instructions. Else, create a file
~/.aws/credentials with the following:
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
See boto3 Quickstart for more detail.
There are two main tools you can use to access S3: clients and resources. Clients are low-level functional interfaces, while resources are high-level object-oriented interfaces. I typically use clients to load single files and bucket resources to iterate over all items in a bucket.
To initiate them in python:
client = boto3.client('s3') #low-level functional API
resource = boto3.resource('s3') #high-level object-oriented API
my_bucket = resource.Bucket('my-bucket') #subsitute this for your s3 bucket name.
Reading and Writing Files
To read a csv file with pandas:
import pandas as pd
obj = client.get_object(Bucket='my-bucket', Key='path/to/my/table.csv')
grid_sizes = pd.read_csv(obj['Body'])
That didn’t look too hard. So what was going on? If you take a look at
obj, the S3 Object file, you will find that there is a slew of metadata (doc). The
'Body' of the object contains the actual data, in a
StreamingBody format. You can access the bytestream by calling
obj['Body'].read(), which will read all of the data from the S3 server (Note that calling it again after you read will yield nothing).
In this case, pandas’
read_csv reads it without much fuss. However, other files, such as .npy and image files, are a bit more difficult to work with.
For example, to read a saved .npy array using
numpy.load, you must first turn the bytestream from the server into an in-memory byte-stream using
io.BytesIO. Make sure you have sufficient memory to do this.
from io import BytesIO
obj = client.get_object(Bucket='my-bucket', Key='path/to/my/array.npy')
array = np.load(BytesIO(obj['Body'].read()))
BytesIO(obj['Body'].read()) works for most files.
To upload files, it is best to save the file to disk and upload it using a bucket resource (and deleting it afterwards using
os.remove if necessary).
It also may be possible to upload it directly from a python object to a S3 object but I have had lots of difficulty with this.
Iterating Over Bucket
You will often have to iterate over specific items in a bucket. To list all the files in the folder
path/to/my/folder in my-bucket:
files = list(my-bucket.objects.filter(Prefix='path/to/my/folder'))
Notice I use the bucket resource here instead of the client. You could run
client.list_objects() with the same arguments but this query has a maximum of 1000 objects (doc).
files will now contain a list of
s3.ObjectSummary objects that display a bucket_name and a key. To get the first object, simply run:
obj = files.get()
And you can read the data as in the above section.
You learned the basics of S3 data import and export, and how to programatically access files in a bucket using the Python API for AWS, boto3.
An example notebook on how to work with s3 data with boto3, for the kaggle DSTL Satellite Image Challenge.