Data Science is about explaining the past and predicting the future by means of data analysis. Data Science is a multi-disciplinary field which combines statistics, machine learning, artificial intelligence and database technology. This course provides the essential concepts and principles in data science. Students learns commonly used classification algorithms and how to use those algorithms to solve real world problems.
What am I going to get from this course?
Clearly define a classification problem, extract and prepare data, explore data using univariate and bivariate visualization, build classification models using eight basic and advanced algorithms, evaluate models extensively and finally learn how properly deploy their models.
Prerequisites and Target Audience
What will students need to know or do before starting this course?
It would be very helpful if students have a basic knowledge of R programming.
Who should take this course? Who should not?
Everyone with basic knowledge of statistics and math can take this course.
Module 1: Classification Algorithms
Classification - Basic Methods
Students will learn about ZeroR, OneR and Naive Bayesian classification algorithms
Classification - Decision Tree
Decision tree is the most used classification algorithm. Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.
Classification - Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is based upon the concept of searching for a linear combination of the variables that best separates two classes. LDA is originally developed in 1936 by R. A. Fisher. It is simple, mathematically robust and often produces models whose accuracy is as good as more complex methods.
Classification - Logistic Regression
Logistic Regression is a classification model which predicts the probability of an outcome that can only have two values (e.g., binary), Logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic regression is similar to a linear regression, but the curve is constructed using the natural logarithm of the “odds” of the target variable, rather than the probability. Moreover, the predictors do not have to be normally distributed or have equal variance in each group.
Classification - K Nearest Neighbors
K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already in the beginning of 1970’s as a non-parametric technique.
Classification - Artificial Neural Netwrok
An artificial neutral network (ANN) is a system that is based on the biological neural network, such as the brain. The ANN attempts to recreate the computational mirror of the biological neural network, although it is not comparable since the number and complexity of neurons and the used in a biological neural network is many times more than those in an artificial neutral network.
Classification - Artificial Neural Network Demo
Classification - Model Evaluation
Model Evaluation is an integral part of the model development process. It helps to find the best model that represents our data and how well the chosen model will work in the future. Evaluating model performance with the data used for training is not acceptable in data mining because it can easily generate overoptimistic and overfitted models. There are two methods of evaluating models in data mining, Hold-Out and Cross-Validation. To avoid overfitting, both methods use a test set (not seen by the model) to evaluate model performance.
Module 2: Calssification - Sample Project
Classification - Sample Project Part 1/4
Students will learn how to build a classification model to predict the probability of default for small businesses.
Classification - Sample Project Part 2/4
The second part of the sample project is about Bivariate Data Exploration. Students learn how to use different visualization methods to demonstrate the relationship between categorical and numerical variables.
Classification - Sample Project Part 3/4
The third part of the sample project is about building classification models such as ZeroR, OneR, Bayesian, Decision Tree and more.
Classification - Sample Project Part 4/4
The forth part is about models evaluation using confusion matrix, lift or gain chart, and ROC chart. Students also learn how to deploy a classification model.