Danny Luo Data Scientist, UToronto Math & Physics Student

Toronto Deep Learning Series — Google BERT

I had the pleasure of presenting the state-of-the-art NLP model, Google BERT, at the Toronto Deep Learning Series meetup.

Read more

From Academia to Industry

Academia to Industry

In 2014, I entered the University of Toronto as an undergrad with a burning passion for physics. In 2018, I left the academic world to start a career in industry machine learning.

This is how I transitioned from academia to industry.

Read more

An Honest Look at UofT


UofT Logo


What is UofT like? Is it hard? Is it as depressing as people say it is? And what is POSt?

As a recent Bachelor of Science graduate, I answer these and more in this guide for new students.

Read more

Spark Performance Tuning

At ZeroGravityLabs, Taraneh Khazaei and I co-authored a fantastic blog post that details resolutions of common Spark performance issues. It was featured on the Roaring Elephant - Bite-Sized Big Data podcast.

Spark Performance Tuning: A Checklist

Read more

Data Science Resources

I have made a list of useful data science resources and recommended study routes. This will be updated from time to time.

Check it out here.

Read more

Toronto Apache Spark 19

TAS 19 2

I had the pleasure of presenting how to set up Spark with Jupyter on AWS at Toronto Apache Spark #19.

Read more

HackOn(Data) 2017

HackOn(Data) 2017

HackOn(Data), Toronto’s very own data hackathon in the heart of downtown, is back for 2017!

At HackOn(Data) last year, I learned a lot, had lots of fun, and made industry connections that landed me and my teammate great summer internships (my blog post). This year I plan on volunteering for HackOn(Data) 2017.

I highly recommend HackOn(Data). Register at hackondata.com/2017!

Read more

University of Toronto Data Science Team

UDST Kaggle DSB
The UDST Kaggle Team discussing strategies for the kaggle Data Science Bowl 2017

Over the past school year (2016-2017), I have been participating in kaggle competitions with the University of Toronto Data Science Team (UDST).

We have participated in competitions such as the Outbrain Click Prediction, DSTL Satellite Imagery Feature Detection and Data Science Bowl 2017. I have learned a lot from my participation in UDST. In fact, it was these competitions that led me to write my spark-Jupyter-AWS guide, and the posts on multi-cpu data processing and s3 data access with boto3.

If you are a UofT student or simply a data enthusiast in the Toronto area, come check us out! We will be continuing activities in the summer of 2017.

Read more

Multi-CPU Data Processing

AWS EC2 8 CPU
Using all 8 CPUs of an AWS EC2 c4.2xlarge instance. Keep an eye on your memory!

When the University of Toronto Data Science Team participated in Data Science Bowl 2017, we had to preprocess a large dataset (~150GB, compressed) of lung CT images. I was tasked with the following:

  • Read data from S3 bucket
  • Pre-process the lung CT images, following this tutorial
  • Write the pre-processed image array back to S3

For S3 I/O on python, see my other post. In order to analyze the data efficiently, I used the python package multiprocessing to maximize CPU usage on an AWS compute instance. The result: Multi-CPU processing on a c4.2xlarge was 6 times faster than ordinary pre-processing on my local computer.

Read more

Accessing S3 Data in Python with boto3

AWS S3

Working with the University of Toronto Data Science Team on kaggle competitions, there was only so much you could do on your local computer. So, when we had to analyze 100GB of satellite images for the kaggle DSTL challenge, we moved to cloud computing.

We chose AWS for its ubiquity and familiarity. To prepare the data pipeline, I downloaded the data from kaggle onto a EC2 virtual instance, unzipped it, and stored it on S3. Storing the unzipped data prevents you from having to unzip it every time you want to use the data, which takes a considerable amount of time. However, this increases the size of the data substantially and as a result, incurs higher storage costs.

Now that the data was stored on AWS, the question was: How do we programmatically access the S3 data to incorporate it into our workflow? The following details how to do so in python.

Read more