Ingredients in the making of a Data Scientist

How does one prepare for a career in data science?  What credentials enable you to become a data scientist?  These are frequently asked questions.  Swami Chandrasekaran, the Executive Architect at IBM Watson, offers a roadmap. Chandrasekaran’s suggested curriculum is compelling, and his analogy of a metro map is a useful one.  He presents us with ten metro lines comprising of:
  1. Fundamentals
  2. Statistics
  3. Programming
  4. Machine Learning
  5. Text Mining / Natural Language Processing
  6. Data Visualization
  7. Big Data
  8. Data Ingestion
  9. Data Munging
  10. Toolbox

Fundamentals

Statistics

Programming

Machine Learning

Text Mining/Natural Language Processing

  1. Corpus
  2. Named Entity Recognition
  3. Text Analysis
  4. UIMA
  5. Term Document Matrix
  6. Term Frequency & Weight
  7. Support Vector Machines
  8. Association Rules
  9. Market Based Analysis ( Market Basket Analysis ? )
  10. Feature Extraction
  11. Using Mahout
  12. Using Weka
  13. Using Natural Language Toolkit (NLTK)
  14. Classify Text ( Document Classification? )
  15. Vocabulary Mapping

Data Visualization

  1. Data Exploration in R (HistBoxplot etc)
  2. Uni, Bi & Multivariate Viz
  3. ggplot2
  4. Histogram & Pie (Uni)
  5. Tree & Tree Map
  6. Scatter Plot (Bi)
  7. Line Charts (Bi)
  8. Spatial Charts
  9. Survey Plot
  10. Timeline
  11. Decision Tree
  12. D3.js
  13. InfoVis
  14. IBM ManyEyes
  15. Tableau

Big Data

  1. Map Reduce Framework
  2. Hadoop Components
  3. HDFS
  4. Data Replication Principles
  5. Setup Hadoop ( IBM / Cloudera / HortonWorks )
  6. Name & Data Nodes
  7. Job & Task Tracker
  8. M/R Programming
  9. Sqoop : Loading Data in HDFS
  10. FlumeScribe : For Unstructured Data
  11. SQL with Pig
  12. DWH with Hive
  13. ScribeChunkwa For Weblog
  14. Using Mahout
  15. ZookeeperAvro
  16. Storm : Hadoop Realtime
  17. RhadoopRHIPE
  18. rmr
  19. Cassandra
  20. MongoDBNeo4j

Data Ingestion

  1. Summary of Data Formats
  2. Data Discovery
  3. Data Sources & Acquisition
  4. Data Integration
  5. Data Fusion
  6. Transformation, Enrichment
  7. Data Survey
  8. Google OpenRefine
  9. How much Data?
  10. Using ETL

Toolbox

The only thing that we would add to this extensive framework is, of course, domain expertise within a specific industry, without which one may not be able ask the right questions.