We know that data is very messy and comes in a variety of form. As part of the overall data mining and machine learning process, we must take the time to preprocess our data. This means we must ensure that it is structured, cleansed, and address any problems that the data may have. Preprocessing the data includes gaining a better understanding of the data through descriptive statistics and data visualization techniques. It also includes ensuring that missing data or outliers are handled accordingly.
What am I going to get from this course?
- Understand what data preprocessing is and why it is needed as part of an overall data science and machine learning methodology
- Review and understand data quality issues and how to address them
- Apply specific Python functions to assist in cleansing and transforming your data
- Be able to summarize your data by using some statistics and data visualization.
Prerequisites and Target Audience
What will students need to know or do before starting this course?
Programming Knowledge in Python
- Lists, variables, loops, etc.
Basic Statistics Knowledge
- Inferential and Descriptive Statistics
Python loaded onto your computer.
- I use Spyder IDE and the Anaconda distribution.
- I have Python 3.6.1 on my machine, so any version greater than 3.6 will work.
Who should take this course? Who should not?
Individuals with basic Python & statistics knowledge can take this course.
Module 1: Introduction to Data Preprocessing
What is data preprocessing?
What is dirty data?
Overview of Data Cleansing
Data Quality Challenges
Raw Files and File Formats
Finding Data Sets
Loading Data into Python
Loading Data Into Python Part 2
Module 3: Summarizing Data with StatisticsModule...
Review of Basic Statistics
Summarizing Data with Python
Module 4: Data Visualization
Introduction to Data Visualization
Creating a Histogram
Missing Data Part 1
Missing Data Part 2
Outlier Detection Part 1
Outlier Detection Part 2
Module 6: Feature Scaling
Introduction to Feature Scaling