Data Preprocessing for Non-Techies: Feature Exploration and Engineering, Part Two

Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Now that we have covered the basic terms and definitions for data types and structure on my previous post let’s dive into the creative and most time consuming side of data science — cleaning and feature engineering.

What are some of the basic strategies that data scientists use to clean their data AND improve the amount of information they get from it?

The type of cleaning and engineering strategies used usually depend on the business problem and type of target variable, since this will influence the algorithm and data preparation requirements.

Therefore, I will provide you a basic checklist that can help any beginner brainstorm what to do with the data at this stage. (including me)

The most important part of data cleaning is the experimentation, and checking how applying one or many of this strategies affects your ability to actually predict or classify in the model.

Also, although there is some logic in the order, keep in mind that these steps always happen in iteration, and you will always go back and forth between:

→ Exploration, Cleaning, Creation, and Selection

Data Exploration

A. Variable Identification:

Context of Target Variable (logical connection)
Data Type per Feature (character, numeric, etc)
Variable Category (Continuous, Categorical, etc.)

B. Uni-variate Analysis:

Central Tendency & Spread for Continuous
Distribution(levels) for Categorical

C. Bi-variate Analysis:

Correlation of Continuous Variables
Two-Way Table or Stacked Columns for Categorical
Chi-Square Test for Categorical
Z-Test for Categorical vs Continuous

Data Cleaning

A. Remove Noise:

Duplicates

Paragraph Columns

Erroneous Values

Contradictions

Mislabels

B. Missing Values:

Delete
Mean/Mode/Median Imputation
Prediction Model
KNN Imputation

C. Outliers:

Cut-Off or Delete
Natural Log
Binning
Assign Weights
Mean/Mode/Median Imputation
Build Predictive Model
Treat them separately

D. Variable Transformation:

Logarithm
Square / Cube root
Binning / Discretization
Dummies
Factorization
Other Data Type

Feature Creation

A. Indicator Features

Threshold (ex. below certain price = poor)
Combination of features (ex. premium house if 2B,2Bth)
Special Events (ex. christmas day or blackfriday)
Event Type (ex. paid vs unpaid based on traffic source)

B. Representation Features

Domain and Time Extractions (ex.purchase_day_of_week)
Numeric to Categorical (ex. years in school to “elementary”)
Grouping sparse classes (ex. sold, all other are “other”)

C. Interaction Features

Sum of Features
Difference of Features
Product of Features
Quotient of Features
Unique Formula

D. Conjunctive Features

Markov Blanket
Linear Predictor

E. Disjunctive Features

Centroid
PCA
LDA
SVD
PLS

F. Programming

Logic (FRINGE)
Genetic

Feature Selection

A. Filter Methods

Correlation
Statistical Score
Ranking (Relief Algorithm)

B. Wrapper Methods

Forward Step Wise
Backward Step Wise

B. Embedded Methods

Ridge Regression
Lasso Regression
Decision-Trees
Elastic Net
XGBoost
SVM
LightGBM

Data Preprocessing for Non-Techies: Feature Exploration and Engineering, Part Two — Checklist of Most Common Practices