Accessing S3 Data in Python with boto3
19 Apr 2017Working with the University of Toronto Data Science Team on kaggle competitions, there was only so much you could do on your local computer. So, when we had to analyze 100GB of satellite images for the kaggle DSTL challenge, we moved to cloud computing.
We chose AWS for its ubiquity and familiarity. To prepare the data pipeline, I downloaded the data from kaggle onto a EC2 virtual instance, unzipped it, and stored it on S3. Storing the unzipped data prevents you from having to unzip it every time you want to use the data, which takes a considerable amount of time. However, this increases the size of the data substantially and as a result, incurs higher storage costs.
Now that the data was stored on AWS, the question was: How do we programmatically access the S3 data to incorporate it into our workflow? The following details how to do so in python.
Setup
The following uses Python 3.5.1, boto3 1.4.0, pandas 0.18.1, numpy 1.12.0
First, install the AWS Software Development Kit (SDK) package for python: boto3. boto3 contains a wide variety of AWS tools, including an S3 API, which we will be using.
To use the AWS API, you must have an AWS Access Key ID and an AWS Secret Access Key (doc). It would also be good to install the AWS Command Line Interface (CLI) as it is the AWS API in the terminal.
Now you must set up your security credentials. If you have AWS CLI installed, simply run aws configure
and follow the instructions. Else, create a file ~/.aws/credentials
with the following:
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
See boto3 Quickstart for more detail.
There are two main tools you can use to access S3: clients and resources. Clients are low-level functional interfaces, while resources are high-level object-oriented interfaces. I typically use clients to load single files and bucket resources to iterate over all items in a bucket.
To initiate them in python:
import boto3
client = boto3.client('s3') #low-level functional API
resource = boto3.resource('s3') #high-level object-oriented API
my_bucket = resource.Bucket('my-bucket') #subsitute this for your s3 bucket name.
Reading and Writing Files
To read a csv file with pandas:
import pandas as pd
obj = client.get_object(Bucket='my-bucket', Key='path/to/my/table.csv')
grid_sizes = pd.read_csv(obj['Body'])
That didn’t look too hard. So what was going on? If you take a look at obj
, the S3 Object file, you will find that there is a slew of metadata (doc). The 'Body'
of the object contains the actual data, in a StreamingBody
format. You can access the bytestream by calling obj['Body'].read()
, which will read all of the data from the S3 server (Note that calling it again after you read will yield nothing).
In this case, pandas’ read_csv
reads it without much fuss. However, other files, such as .npy and image files, are a bit more difficult to work with.
For example, to read a saved .npy array using numpy.load
, you must first turn the bytestream from the server into an in-memory byte-stream using io.BytesIO
. Make sure you have sufficient memory to do this.
from io import BytesIO
obj = client.get_object(Bucket='my-bucket', Key='path/to/my/array.npy')
array = np.load(BytesIO(obj['Body'].read()))
The method BytesIO(obj['Body'].read())
works for most files.
To upload files, it is best to save the file to disk and upload it using a bucket resource (and deleting it afterwards using os.remove
if necessary).
my_bucket.upload_file('file',Key='path/to/my/file')
It also may be possible to upload it directly from a python object to a S3 object but I have had lots of difficulty with this.
Iterating Over Bucket
You will often have to iterate over specific items in a bucket. To list all the files in the folder path/to/my/folder
in my-bucket:
files = list(my-bucket.objects.filter(Prefix='path/to/my/folder'))
Notice I use the bucket resource here instead of the client. You could run client.list_objects()
with the same arguments but this query has a maximum of 1000 objects (doc).
files
will now contain a list of s3.ObjectSummary
objects that display a bucket_name and a key. To get the first object, simply run:
obj = files[0].get()
And you can read the data as in the above section.
Conclusion
You learned the basics of S3 data import and export, and how to programatically access files in a bucket using the Python API for AWS, boto3.
An example notebook on how to work with s3 data with boto3, for the kaggle DSTL Satellite Image Challenge.