Danny Luo Data Scientist, UToronto Math & Physics Student

Data Science Resources


Useful data science resources and recommended study routes. Updated occasionally.


Online Courses

Title Author Thoughts Level
Coursera:
Machine Learning
Andrew Ng
  • General overview, not much detail.
  • Labs are in MATLAB, which is not desirable.
Introductory
Stanford Statistical Learning Trevor Hastie, Robert Tibshirani
  • Online lectures following the text Introduction to Statistical Learning
  • Example R modules, minimal self-evaluations
Introductory
Stanford CS229 Machine Learning Andrew Ng, John Duchi
  • A broad, technical overview of Machine Learning
  • Written problem sets on ML theory
Average
Coursera:
Neural Networks for Machine Learning
Geoffrey Hinton
  • Wide overview of several neural network models, including non-standard ones, such as Hopfield nets and Restricted Boltzmann Machines
  • Labs are in MATLAB, which is not desirable.
Average
Stanford CS231n Convolutional Neural Networks for Visual Recognition Fei-Fei Li, Andrej Karpathy, Justin Johnson
  • Well-written online modules, video lectures on youtube
  • Completed the assignments, in which you write neural network architecture in python
  • Strongly recommend modules + assignments for understanding NN's, CNN's, RNN's
Average - Advanced

All course content should be available for free. The paid Coursera certification is not really important.


Texts

Title Author Thoughts Level
An Introduction to Statistical Learning Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani
  • Good introductory book for machine learning for those with statistical background
  • Includes R modules
Introductory
Mining of Massive Datasets Jure Leskovec, Anand Rajarman, Jeffrey D. Ullman
  • Practical knowledge about data mining, machine learning with real-life applications
Average
The Elements of Statistical Learning Trevor Hastie, Robert Tibshirani, Jerome Friedman
  • Advanced version of Introduction of Statistical Learning
  • Includes R modules
Advanced
Pattern Recognition and Machine Learning Christopher M. Bishop
  • Have not read in detail
-

Other

Title Author Thoughts Level
HackOn(Data) Workshop Material Armando Benitez
  • Great notebooks to learn Apache Spark on Databricks, Machine Learning with Spark
  • Adapted from edX Spark labs
Introductory - Average
TensorFlow Tutorial and Examples for Beginners Aymeric Damien
  • Well-constructed jupyter notebooks for learning TensorFlow
Introductory

Prerequisites

Make sure you have the sufficient theoretical background in statistics, linear algebra and multivariable calculus. Most university students should be adequately prepared after second-year classes in these subjects.

Acquire a basic background in Python, including the following libraries: NumPy, Matplotlib, Scipy, Pandas. There are many resources available online. I particularly like this one for NumPy, Matplotlib, Scipy.

It is also useful to know R and Scala (for Apache Spark).


Machine Learning

Start off with the canonical Coursera Machine Learning course by Andrew Ng. It will give you a high-level overview of machine learning that is not too technical. You can stop this course after you feel like you have developed a sufficient intuition for machine learning.

If you have a statistical background, opt for the Stanford Statistical Learning course and study An Introduction to Statistical Learning. Otherwise, read the lectures notes to Stanford CS229 Machine Learning for a more technical introduction to Machine Learning.


Neural Networks & Deep Learning

For a general theoretical overview of neural networks, complete the Coursera Neural Networks for Machine Learning course by Geoffrey Hinton.

For a deeper and more technical understanding of neural networks, read the modules to Stanford CS231n Convolutional Neural Networks for Visual Recognition and complete the assignments. It is important that you complete the assignments, in which you will actually write neural network layers.

Afterwards, begin learning computational frameworks for deep learning, such as tensorflow or theano (I recommend tensorflow), as well as deep learning libraries, such as keras and caffe. Then start building your own neural networks, and figure out how to train them with GPUs.


Big Data

Familiarize yourself with cloud computing services. I recommend beginning with AWS, which offers a free tier. I don’t think there is a need to take an entire course on cloud computing, as you will learn a lot by doing. Try to launch your own virtual machines and use them to run your models. Try integration with their storage services.

Learn the basics to Apache Spark, a distributed computing engine designed for big data. I did this through the HackOn(Data) Workshops, but there are plenty of other resources available. Then, try launching a Spark cluster on the cloud, either through a service like AWS EMR or Azure HDInsight, or by bootstrapping your own cluster (My Guide).

As for the rest, learn as you need.


Note: Keep in mind that you can only learn so much through reading. Data Science is about doing! Try kaggle competitions, or fool around with fun datasets.