Data Preprocessing for Non-Techies: Feature Exploration and Engineering, Part Two — Checklist of Most Common Practices

Melody Ucros Melody Ucros
March 28, 2018 Big Data, Cloud & DevOps
Ready to learn Data Science? Browse courses like Data Science Training and Certification developed by industry thought leaders and Experfy in Harvard Innovation Lab.

Now that we have covered the basic terms and definitions for data types and structure on my previous post let’s dive into the creative and most time consuming side of data science — cleaning and feature engineering.

What are some of the basic strategies that data scientists use to clean their data AND improve the amount of information they get from it?

The type of cleaning and engineering strategies used usually depend on the business problem and type of target variable, since this will influence the algorithm and data preparation requirements.

Therefore, I will provide you a basic checklist that can help any beginner brainstorm what to do with the data at this stage. (including me)

The most important part of data cleaning is the experimentation, and checking how applying one or many of this strategies affects your ability to actually predict or classify in the model.

Also, although there is some logic in the order, keep in mind that these steps always happen in iteration, and you will always go back and forth between:

→ Exploration, Cleaning, Creation, and Selection

Data Exploration

A. Variable Identification:

  1. Context of Target Variable (logical connection)
  2. Data Type per Feature (character, numeric, etc)
  3. Variable Category (Continuous, Categorical, etc.)

B. Uni-variate Analysis:

  1. Central Tendency & Spread for Continuous
  2. Distribution(levels) for Categorical

C. Bi-variate Analysis:

  1. Correlation of Continuous Variables
  2. Two-Way Table or Stacked Columns for Categorical
  3. Chi-Square Test for Categorical
  4. Z-Test for Categorical vs Continuous

Data Cleaning

A. Remove Noise:

  • Duplicates
  • Paragraph Columns
  • Erroneous Values
  • Contradictions
  • Mislabels
  • B. Missing Values:

    1. Delete
    2. Mean/Mode/Median Imputation
    3. Prediction Model
    4. KNN Imputation

    C. Outliers:

    1. Cut-Off or Delete
    2. Natural Log
    3. Binning
    4. Assign Weights
    5. Mean/Mode/Median Imputation
    6. Build Predictive Model
    7. Treat them separately

    D. Variable Transformation:

    1. Logarithm
    2. Square / Cube root
    3. Binning / Discretization
    4. Dummies
    5. Factorization
    6. Other Data Type

    Feature Creation

    A. Indicator Features

    1. Threshold (ex. below certain price = poor)
    2. Combination of features (ex. premium house if 2B,2Bth)
    3. Special Events (ex. christmas day or blackfriday)
    4. Event Type (ex. paid vs unpaid based on traffic source)

    B. Representation Features

    1. Domain and Time Extractions (ex.purchase_day_of_week)
    2. Numeric to Categorical (ex. years in school to “elementary”)
    3. Grouping sparse classes (ex. sold, all other are “other”)

    C. Interaction Features

    1. Sum of Features
    2. Difference of Features
    3. Product of Features
    4. Quotient of Features
    5. Unique Formula

    D. Conjunctive Features

    1. Markov Blanket
    2. Linear Predictor

    E. Disjunctive Features

    1. Centroid
    2. PCA
    3. LDA
    4. SVD
    5. PLS

    F. Programming

    1. Logic (FRINGE)
    2. Genetic

    Feature Selection

    A. Filter Methods

    1. Correlation
    2. Statistical Score
    3. Ranking (Relief Algorithm)

    B. Wrapper Methods

    1. Forward Step Wise
    2. Backward Step Wise

    B. Embedded Methods

    1. Ridge Regression
    2. Lasso Regression
    3. Decision-Trees
    4. Elastic Net
    5. XGBoost
    6. SVM
    7. LightGBM
  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Melody Ucros

    Tags
    Data Science
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    What is ‘cloud-native IoT’ and why does it matter?

    What is ‘cloud-native IoT’ and why does it matter?

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in Big Data, Cloud & DevOps
    Big Data, Cloud & DevOps
    Cognitive Load Of Being On Call: 6 Tips To Address It

    If you’ve ever been on call, you’ve probably experienced the pain of being woken up at 4 a.m., unactionable alerts, alerts going to the wrong team, and other unfortunate events. But, there’s an aspect of being on call that is less talked about, but even more ubiquitous – the cognitive load. “Cognitive load” has perhaps

    5 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    How To Refine 360 Customer View With Next Generation Data Matching

    Knowing your customer in the digital age Want to know more about your customers? About their demographics, personal choices, and preferable buying journey? Who do you think is the best source for such insights? You’re right. The customer. But, in a fast-paced world, it is almost impossible to extract all relevant information about a customer

    4 MINUTES READ Continue Reading »
    Big Data, Cloud & DevOps
    3 Ways Businesses Can Use Cloud Computing To The Fullest

    Cloud computing is the anytime, anywhere delivery of IT services like compute, storage, networking, and application software over the internet to end-users. The underlying physical resources, as well as processes, are masked to the end-user, who accesses only the files and apps they want. Companies (usually) pay for only the cloud computing services they use,

    7 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.