Ingredients in the making of a Data Scientist

Cameron Turner Cameron Turner
August 28, 2015 Big Data, Cloud & DevOps
How does one prepare for a career in data science?  What credentials enable you to become a data scientist?  These are frequently asked questions.  Swami Chandrasekaran, the Executive Architect at IBM Watson, offers a roadmap. Chandrasekaran’s suggested curriculum is compelling, and his analogy of a metro map is a useful one.  He presents us with ten metro lines comprising of:
  1. Fundamentals
  2. Statistics
  3. Programming
  4. Machine Learning
  5. Text Mining / Natural Language Processing
  6. Data Visualization
  7. Big Data
  8. Data Ingestion
  9. Data Munging
  10. Toolbox

Fundamentals

  1. Metrics & Linear Algebra Fundamentals
  2. Hash Functions, Binary Tree, O(n)
  3. Relational Algebra, DB Basics
  4. Inner, Outer, Cross, Theta Join
  5. CAP Theorem
  6. Tabular Data
  7. Entropy
  8. Data Frames & Series
  9. Sharding
  10. OLAP
  11. Multidimensional Data Model
  12. Extract/Transform/Load(ETL)
  13. Reporting vs BI vs Analytics
  14. JSON & XML
  15. NoSQL
  16. Regex
  17. Vendor Landsacpe
  18. Env Setup

Statistics

  1. Pick a Dataset (UCI Repo)
  2. Descriptive Statistics(mean, median, range, SD, Var)
  3. Exploratory Data Analysis
  4. Histograms
  5. Percentiles & Outliers
  6. Probability Theory
  7. Bayes Theorem
  8. Random Variables
  9. Cumulative Distribution Function (CDF)
  10. Continuous Distributions (Normal, Poisson, Gaussian)
  11. Skewness
  12. Analysis of Variance (ANOVA)
  13. Probability Density Function (PDF)
  14. Central Limit Theorem
  15. Monte Carlo Method
  16. Hypothesis Testing
  17. p-Value
  18. Chi-square Test
  19. Estimation
  20. Confidence Interval (CI)
  21. Maximum Likelihood Estimation (MLE)
  22. Kernel Density Estimate
  23. Regression
  24. Covariance
  25. Correlation
  26. Pearson Coeff
  27. Causation
  28. Least Squares Fit
  29. Euclidean Distance

Programming

  1. Python Basics
  2. Working in Excel
  3. R Setup, R Studio
  4. R Basics
  5. Expressions
  6. Variables
  7. IBM SPSS, Rapid Miner
  8. Vectors
  9. Matrices
  10. Arrays
  11. Factors
  12. Lists
  13. Data Frames
  14. Reading CSV Data
  15. Reading RAW Data
  16. Subsetting Data
  17. Manipulate Data Frames
  18. Functions
  19. Factor Analysis
  20. Install Pkgs

Machine Learning

  1. What is ML?
  2. Numerical Var
  3. Categorical Variable
  4. Supervised Learning
  5. Unsupervised Learning
  6. Concepts, Inputs & Attributes
  7. Training & Test Data
  8. Classifier
  9. Prediction
  10. Lift
  11. Overfitting
  12. Bias & Variance
  13. Trees & Classification
  14. Classification, Classification Rate
  15. Decision Trees
  16. Boosting
  17. Naïve Bayes Classifiers
  18. K-Nearest Neighbor
  19. Logistic Regression
  20. Regression, Ranking
  21. Linear Regression
  22. Perceptron
  23. Clustering, Hierarchical Clustering
  24. K-means Clustering
  25. Neural Networks
  26. Sentiment Analysis
  27. Collaborative Filtering
  28. Tagging

Text Mining/Natural Language Processing

  1. Corpus
  2. Named Entity Recognition
  3. Text Analysis
  4. UIMA
  5. Term Document Matrix
  6. Term Frequency & Weight
  7. Support Vector Machines
  8. Association Rules
  9. Market Based Analysis ( Market Basket Analysis ? )
  10. Feature Extraction
  11. Using Mahout
  12. Using Weka
  13. Using Natural Language Toolkit (NLTK)
  14. Classify Text ( Document Classification? )
  15. Vocabulary Mapping

Data Visualization

  1. Data Exploration in R (Hist, Boxplot etc)
  2. Uni, Bi & Multivariate Viz
  3. ggplot2
  4. Histogram & Pie (Uni)
  5. Tree & Tree Map
  6. Scatter Plot (Bi)
  7. Line Charts (Bi)
  8. Spatial Charts
  9. Survey Plot
  10. Timeline
  11. Decision Tree
  12. D3.js
  13. InfoVis
  14. IBM ManyEyes
  15. Tableau

Big Data

  1. Map Reduce Framework
  2. Hadoop Components
  3. HDFS
  4. Data Replication Principles
  5. Setup Hadoop ( IBM / Cloudera / HortonWorks )
  6. Name & Data Nodes
  7. Job & Task Tracker
  8. M/R Programming
  9. Sqoop : Loading Data in HDFS
  10. Flume, Scribe : For Unstructured Data
  11. SQL with Pig
  12. DWH with Hive
  13. Scribe, Chunkwa For Weblog
  14. Using Mahout
  15. Zookeeper, Avro
  16. Storm : Hadoop Realtime
  17. Rhadoop, RHIPE
  18. rmr
  19. Cassandra
  20. MongoDB, Neo4j

Data Ingestion

  1. Summary of Data Formats
  2. Data Discovery
  3. Data Sources & Acquisition
  4. Data Integration
  5. Data Fusion
  6. Transformation, Enrichment
  7. Data Survey
  8. Google OpenRefine
  9. How much Data?
  10. Using ETL

Data Munging

  1. Dimensionality & Numerosity Reduction
  2. Normalization
  3. Data Scrubbing
  4. Handling Missing Values
  5. Unbiased Estimators
  6. Binning Sparse Values
  7. Feature Extraction
  8. Denoising
  9. Sampling
  10. Stratified Sampling
  11. Principal Component Analysis

Toolbox

  1. MS Excel w/ Analysis ToolPak
  2. Java, Python
  3. R, R-Studio, Rattle
  4. Weka, Knime, RapidMiner
  5. Hadoop Dist of Choice
  6. Spark, Storm
  7. Flume, Scribe, Chukwa
  8. Nutch, Talend, Scraperwiki
  9. Webscraper, Flume, Sqoop (Flume Dup?)
  10. tm, RWeka, NLTK
  11. RHIPE
  12. D3.js, ggplot2, Shiny
  13. IBM Languageware
  14. Cassandra, MongoDB
The only thing that we would add to this extensive framework is, of course, domain expertise within a specific industry, without which one may not be able ask the right questions.
  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Cameron Turner

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    Does Time Matter? Modeling Temporal Dynamics for Better Predictions

    Does Time Matter? Modeling Temporal Dynamics for Better Predictions

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.