facebook-pixel
  • Data Science
  • Experfy Editor
  • MAR 22, 2014

Ingredients in the making of a Data Scientist

How does one prepare for a career in data science?  What credentials enable you to become a data scientist?  These are frequently asked questions.  Swami Chandrasekaran, the Executive Architect at IBM Watson, offers a roadmap. Chandrasekaran’s suggested curriculum is compelling, and his analogy of a metro map is a useful one.  He presents us with ten metro lines comprising of:

  1. Fundamentals
  2. Statistics
  3. Programming
  4. Machine Learning
  5. Text Mining / Natural Language Processing
  6. Data Visualization
  7. Big Data
  8. Data Ingestion
  9. Data Munging
  10. Toolbox

Road to Data Scientist

 

If you have trouble reading the map, here is a full list in text.

Fundamentals

  1. Metrics & Linear Algebra Fundamentals
  2. Hash FunctionsBinary TreeO(n)
  3. Relational AlgebraDB Basics
  4. InnerOuterCrossTheta Join
  5. CAP Theorem
  6. Tabular Data
  7. Entropy
  8. Data Frames & Series
  9. Sharding
  10. OLAP
  11. Multidimensional Data Model
  12. Extract/Transform/Load(ETL)
  13. Reporting vs BI vs Analytics
  14. JSON & XML
  15. NoSQL
  16. Regex
  17. Vendor Landsacpe
  18. Env Setup

Statistics

  1. Pick a Dataset (UCI Repo)
  2. Descriptive Statistics(meanmedianrangeSDVar)
  3. Exploratory Data Analysis
  4. Histograms
  5. Percentiles & Outliers
  6. Probability Theory
  7. Bayes Theorem
  8. Random Variables
  9. Cumulative Distribution Function (CDF)
  10. Continuous Distributions (NormalPoissonGaussian)
  11. Skewness
  12. Analysis of Variance (ANOVA)
  13. Probability Density Function (PDF)
  14. Central Limit Theorem
  15. Monte Carlo Method
  16. Hypothesis Testing
  17. p-Value
  18. Chi-square Test
  19. Estimation
  20. Confidence Interval (CI)
  21. Maximum Likelihood Estimation (MLE)
  22. Kernel Density Estimate
  23. Regression
  24. Covariance
  25. Correlation
  26. Pearson Coeff
  27. Causation
  28. Least Squares Fit
  29. Euclidean Distance

Programming

  1. Python Basics
  2. Working in Excel
  3. R Setup, R Studio
  4. R Basics
  5. Expressions
  6. Variables
  7. IBM SPSSRapid Miner
  8. Vectors
  9. Matrices
  10. Arrays
  11. Factors
  12. Lists
  13. Data Frames
  14. Reading CSV Data
  15. Reading RAW Data
  16. Subsetting Data
  17. Manipulate Data Frames
  18. Functions
  19. Factor Analysis
  20. Install Pkgs

Machine Learning

  1. What is ML?
  2. Numerical Var
  3. Categorical Variable
  4. Supervised Learning
  5. Unsupervised Learning
  6. ConceptsInputs & Attributes
  7. Training & Test Data
  8. Classifier
  9. Prediction
  10. Lift
  11. Overfitting
  12. Bias & Variance
  13. Trees & Classification
  14. Classification, Classification Rate
  15. Decision Trees
  16. Boosting
  17. Naïve Bayes Classifiers
  18. K-Nearest Neighbor
  19. Logistic Regression
  20. Regression, Ranking
  21. Linear Regression
  22. Perceptron
  23. ClusteringHierarchical Clustering
  24. K-means Clustering
  25. Neural Networks
  26. Sentiment Analysis
  27. Collaborative Filtering
  28. Tagging

Text Mining/Natural Language Processing

  1. Corpus
  2. Named Entity Recognition
  3. Text Analysis
  4. UIMA
  5. Term Document Matrix
  6. Term Frequency & Weight
  7. Support Vector Machines
  8. Association Rules
  9. Market Based Analysis ( Market Basket Analysis ? )
  10. Feature Extraction
  11. Using Mahout
  12. Using Weka
  13. Using Natural Language Toolkit (NLTK)
  14. Classify Text ( Document Classification? )
  15. Vocabulary Mapping

Data Visualization

  1. Data Exploration in R (HistBoxplot etc)
  2. Uni, Bi & Multivariate Viz
  3. ggplot2
  4. Histogram & Pie (Uni)
  5. Tree & Tree Map
  6. Scatter Plot (Bi)
  7. Line Charts (Bi)
  8. Spatial Charts
  9. Survey Plot
  10. Timeline
  11. Decision Tree
  12. D3.js
  13. InfoVis
  14. IBM ManyEyes
  15. Tableau

Big Data

  1. Map Reduce Framework
  2. Hadoop Components
  3. HDFS
  4. Data Replication Principles
  5. Setup Hadoop ( IBM / Cloudera / HortonWorks )
  6. Name & Data Nodes
  7. Job & Task Tracker
  8. M/R Programming
  9. Sqoop : Loading Data in HDFS
  10. FlumeScribe : For Unstructured Data
  11. SQL with Pig
  12. DWH with Hive
  13. ScribeChunkwa For Weblog
  14. Using Mahout
  15. ZookeeperAvro
  16. Storm : Hadoop Realtime
  17. RhadoopRHIPE
  18. rmr
  19. Cassandra
  20. MongoDBNeo4j

Data Ingestion

  1. Summary of Data Formats
  2. Data Discovery
  3. Data Sources & Acquisition
  4. Data Integration
  5. Data Fusion
  6. Transformation, Enrichment
  7. Data Survey
  8. Google OpenRefine
  9. How much Data?
  10. Using ETL

Data Munging

  1. Dimensionality & Numerosity Reduction
  2. Normalization
  3. Data Scrubbing
  4. Handling Missing Values
  5. Unbiased Estimators
  6. Binning Sparse Values
  7. Feature Extraction
  8. Denoising
  9. Sampling
  10. Stratified Sampling
  11. Principal Component Analysis

Toolbox

  1. MS Excel w/ Analysis ToolPak
  2. JavaPython
  3. RR-StudioRattle
  4. WekaKnimeRapidMiner
  5. Hadoop Dist of Choice
  6. SparkStorm
  7. FlumeScribeChukwa
  8. NutchTalendScraperwiki
  9. WebscraperFlumeSqoop (Flume Dup?)
  10. tmRWekaNLTK
  11. RHIPE
  12. D3.jsggplot2Shiny
  13. IBM Languageware
  14. CassandraMongoDB

The only thing that we would add to this extensive framework is, of course, domain expertise within a specific industry, without which one may not be able ask the right questions.

 

Header image credit: Biocomicals.com

The Harvard Innovation Lab

Made in Boston @

The Harvard Innovation Lab

Matching Providers

comments powered by Disqus.