• Data Science
• Experfy Editor
• MAR 22, 2014

# Ingredients in the making of a Data Scientist

How does one prepare for a career in data science?  What credentials enable you to become a data scientist?  These are frequently asked questions.  Swami Chandrasekaran, the Executive Architect at IBM Watson, offers a roadmap. Chandrasekarans suggested curriculum is compelling, and his analogy of a metro map is a useful one.  He presents us with ten metro lines comprising of:

1. Fundamentals
2. Statistics
3. Programming
4. Machine Learning
5. Text Mining / Natural Language Processing
6. Data Visualization
7. Big Data
8. Data Ingestion
9. Data Munging
10. Toolbox

If you have trouble reading the map, here is a full list in text.

#### Fundamentals

1. Metrics & Linear Algebra Fundamentals
2. Hash FunctionsBinary TreeO(n)
4. InnerOuterCrossTheta Join
5. CAP Theorem
6. Tabular Data
7. Entropy
8. Data Frames & Series
9. Sharding
10. OLAP
11. Multidimensional Data Model
13. Reporting vs BI vs Analytics
14. JSON & XML
15. NoSQL
16. Regex
17. Vendor Landsacpe
18. Env Setup

#### Programming

1. Python Basics
2. Working in Excel
3. R Setup, R Studio
4. R Basics
5. Expressions
6. Variables
7. IBM SPSSRapid Miner
8. Vectors
9. Matrices
10. Arrays
11. Factors
12. Lists
13. Data Frames
16. Subsetting Data
17. Manipulate Data Frames
18. Functions
19. Factor Analysis
20. Install Pkgs

#### Machine Learning

1. What is ML?
2. Numerical Var
3. Categorical Variable
4. Supervised Learning
5. Unsupervised Learning
6. ConceptsInputs & Attributes
7. Training & Test Data
8. Classifier
9. Prediction
10. Lift
11. Overfitting
12. Bias & Variance
13. Trees & Classification
14. Classification, Classification Rate
15. Decision Trees
16. Boosting
17. Naïve Bayes Classifiers
18. K-Nearest Neighbor
19. Logistic Regression
20. Regression, Ranking
21. Linear Regression
22. Perceptron
23. ClusteringHierarchical Clustering
24. K-means Clustering
25. Neural Networks
26. Sentiment Analysis
27. Collaborative Filtering
28. Tagging

#### Text Mining/Natural Language Processing

1. Corpus
2. Named Entity Recognition
3. Text Analysis
4. UIMA
5. Term Document Matrix
6. Term Frequency & Weight
7. Support Vector Machines
8. Association Rules
9. Market Based Analysis ( Market Basket Analysis ? )
10. Feature Extraction
11. Using Mahout
12. Using Weka
13. Using Natural Language Toolkit (NLTK)
14. Classify Text ( Document Classification? )
15. Vocabulary Mapping

#### Data Visualization

1. Data Exploration in R (HistBoxplot etc)
2. Uni, Bi & Multivariate Viz
3. ggplot2
4. Histogram & Pie (Uni)
5. Tree & Tree Map
6. Scatter Plot (Bi)
7. Line Charts (Bi)
8. Spatial Charts
9. Survey Plot
10. Timeline
11. Decision Tree
12. D3.js
13. InfoVis
14. IBM ManyEyes
15. Tableau

#### Big Data

1. Map Reduce Framework
3. HDFS
4. Data Replication Principles
5. Setup Hadoop ( IBM / Cloudera / HortonWorks )
6. Name & Data Nodes
8. M/R Programming
10. FlumeScribe : For Unstructured Data
11. SQL with Pig
12. DWH with Hive
13. ScribeChunkwa For Weblog
14. Using Mahout
15. ZookeeperAvro
18. rmr
19. Cassandra
20. MongoDBNeo4j

#### Data Ingestion

1. Summary of Data Formats
2. Data Discovery
3. Data Sources & Acquisition
4. Data Integration
5. Data Fusion
6. Transformation, Enrichment
7. Data Survey
9. How much Data?
10. Using ETL

#### Data Munging

1. Dimensionality & Numerosity Reduction
2. Normalization
3. Data Scrubbing
4. Handling Missing Values
5. Unbiased Estimators
6. Binning Sparse Values
7. Feature Extraction
8. Denoising
9. Sampling
10. Stratified Sampling
11. Principal Component Analysis

#### Toolbox

1. MS Excel w/ Analysis ToolPak
2. JavaPython
3. RR-StudioRattle
4. WekaKnimeRapidMiner
6. SparkStorm
7. FlumeScribeChukwa
8. NutchTalendScraperwiki
9. WebscraperFlumeSqoop (Flume Dup?)
10. tmRWekaNLTK
11. RHIPE
12. D3.jsggplot2Shiny
13. IBM Languageware
14. CassandraMongoDB

The only thing that we would add to this extensive framework is, of course, domain expertise within a specific industry, without which one may not be able ask the right questions.