How can I become a data scientist?

Answer by Rahul Agarwal:

I could only tell you what I did till now and what I intend to work on additionally to become a better data Scientist.
What follows is my own Data science Curriculum. This is aimed at Computer Science with a Specialization in Machine Learning.
My main aim here is to learn about Mathematics, Statistics, Computer Science and Machine Learning, though not necessarily in the same order.
I have categorized the courses here as of two types:
  1. F – Foundational Class
  2. A – Advanced Specialization
MATHEMATICS:
A Great Class by a great Teacher. I Would definitely recommend this class to anyone who wants to learn LA.
COMPUTER SCIENCE:
This is an Introduction to Computer Science class taken by David Malan. Helped me with many misunderstandings and helped build intuition around the whole CS playground. Starts with a basic introduction to C and some programming exercises. Ends up teaching basics of PHP, Javascript and HTML/CSS as well. The projects in this class are really awesome. The github code repository for this class is at HERE
The course is an introduction to many of the important concepts in computer science.
Talks about simple algorithms, Asymptotic times, Classes, OOP, Trees, Exceptions, Assertions, Hashing and a whole lot of other stuff.
This is a series of 6 short but good courses. I worked on these courses as Data science will require you to do a lot of programming. And the best way to learn programming is by doing programming. The lectures are good but the problems and assignments are awesome. It consists of three main courses:
1> Interactive Programming in Python: The Course starts with teaching Python but suddenly moves into creating graphical user interfaces and games using python in codeskulptor. I created some very basic games in this course as part of the coursework. Some of them are:
2> Principles of Computing : This course adds on to the previous course but here the focus is more on thinking programmatically rather than GUIs. The projects are really great as the course progresses with creating games.
3> Algorithmic Thinking: This course starts with a focus on graph algorithms and data structures. The codes are sourced at Github
STATISTICS:
Conditioning is the Soul of Statistics.
I took this course to enhance my understanding of probability distributions and statistics, but this course taught me a lot more than that. Apart from Learning to think conditionally, this also taught me how to explain difficult concepts with a story.
This was a Hard Class but definitely fun. The focus was not only on getting Mathematical proofs but also on understanding the intuition behind them and how intuition can help in deriving them more easily.Sometimes the same proof was done in different ways to facilitate learning of a concept.
One of the things I liked most about this course is the focus on concrete examples while explaining abstract concepts. The inclusion of Gambler’s Ruin Problem, Matching Problem, Birthday Problem, Monty Hall, Simpsons Paradox, St. Petersberg Paradox etc. made this course much much more exciting than a normal Statistics Course.
I will definitely be on a lookout for more courses by Joe after this and I have already done one more course by him – CS109. More on that later.
The Top 10 Ideas covered in this class are:
  1. Probability, Conditioning is the soul of Statistics, Story Proofs
  2. Bayes Theorem, Law of Total Probability, First Step Analysis.
  3. Expectation and Variance for discrete RVs and continuous RVs. LOTUS.
  4. Discrete (Bernoulli, Binomial, Hypergeometric, Geometric, Negative Binomial, FS, Poisson) and Continuous (Uniform, Normal, expo, Beta, Gamma) Distributions and the stories behind them.
  5. Moment Generating Functions(MGF’s) and their Properties
  6. Joint and Marginal distributions, Covariance and Correlation
  7. Convolutions and Transformations
  8. Conditional Expectation – Adam and Eve Law
  9. Law of Large Numbers and CLT
  10. Markov Chains
Solving the problem sets and the midterm reviews helped me a lot in grasping the abstact concepts.
(F2) Stat 111: TODO
Uses Degroot and Schervish for instruction. No lecture videos available so I plan to read the book and Complete Problem Sets Online from the Stat111 website. I so wish the lectures were there.
A lecture Series on Bayesian statistics by Jarad Niemi at ISU.
Got highly interested in Probability after STAT 110 so added this here. It is an alternative to one of the next courses to take after STAT 110 that Professor Joe Blitzstein talks about in the course apart from STAT 111.
MACHINE LEARNING:
This is a fantastic course for learning about R as well as the implementations of various machine learning algorithm in R. Very Basic. Very Crisp and very informative. The scenarios and examples range from Moneyball to Watson. The only problem with this course is that it’s problem sets feel a little repetitive.
Here is the location of my R code repository for this course
My first ML Class. It took a little bit long to grasp the concepts but in hindsght it might be because of my lack of exposure to the material. It was my first grapple with tools like R and Python. Covers a whole lot of base from R to Python to Mapreduce. Would put it here as it gives a thorough perspective of the whole data science space.
(F3) Data Science CS109: – Again by Professor Blitzstein. Again an awesome course. Watch it after Stat110 as you will be able to understand everything much better with a thorough grinding in Stat110 concepts. You will learn about Python Libraries for data science, along with a thorough intuitive grinding for various Machine learning Algorithms. Course description from Website:
Learning from data in order to gain useful predictions and insights. This course introduces methods for five key facets of an investigation: data wrangling, cleaning, and sampling to get a suitable data set; data management to be able to access big data quickly and reliably; exploratory data analysis to generate hypotheses and intuition; prediction based on statistical methods such as regression and classification; and communication of results through visualization, stories, and interpretable summaries.
Contains the maths behind many of the Machine Learning algorithms. The Game Changer machine learning course. I will put this course as numero uno as this course motivated me into getting in this field and Andrew Ng is a great instructor.
DISTRIBUTED AND PARALLEL COMPUTING:
Very Easy Course. Taught the Fundamentals of Hadoop streaming with Python taken by Cloudera on Udacity. I am doing much more advanced stuff with python and Mapreduce now but this is one of the courses that laid the foundation there.
A mighty flame followeth a tiny spark.
This is a series of courses in Spark taught by Anthony D. Joseph,a Professor in Electrical Engineering and Computer Science at UC Berkeley and Ameet Talwalkar, a well known name in Spark community.
This course delivers on what it says. It teaches Spark. Total beginners will have difficulty following the course as the course progresses very fast. That said anyone with a decent understanding of how big data works will be OK.
The top ideas covered in this course are:
  1. RDD Transformations (map, flatmap, filter, distinct, groupByKey, sortByKey, reduceByKey)
  2. RDD Actions (reduce, takeOrdered, take, collect)
  3. Accumulator and BroadCast Variables
  4. Dataframe in pySpark
  5. SQL on paired RDDs – leftOuterJoin, rightOuterJoin, fullOuterJoin
I certainly liked the Mini Projects in the class:
  1. Wordcount in Spark – A word counting program to count the words in all of Shakespeare’s plays
  2. Apache Log File analysis in Spark – Use Spark to explore NASA Apache web server log
  3. Entity Resolution – Entity Resolution using TFIDF approaches in Spark.
  4. Movie Recommendation using ALS – Predicting Movie ratings using Spark.
  5. Linear Regression – Predicting Song Year using Linear regression in Spark.
  6. Logistic Regression – Predicting Click Through Rates using Spark. One Hot Encoding, Hashing Explained.
  7. PCA – Running PCA on neuroscience data
Some of the courses here may seem repetitive but they all have provided some sort of additional skills therefore I have put them here.
I will update this answer for more details as I complete the TODO courses on the list.
Hope that Helps 🙂

How can I become a data scientist?

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s