Big data remains simple because it scales the processing power across several computers, but big analytics will be more challenging because each dimension is analyzed differently. In this course, you will learn a framework to generate easy to understand algorithm. This will enable you to scale advanced analytics work for high dimension data set.
What am I going to get from this course?
- Large scale data preparation
- Automate data process
- Large scale reporting tasks
- Mass reporting easily automated
- Large scale statistical modelling
- Predictive modelling with high dimension datasets
- Master big analytics
Module 1: Key concepts
Scaling Advanced Analytics
This introduction section will explain purpose of the course, define target audience, and present key concepts that will be used.
Module 2: Data Driven programming
How Data Driven Programming Works
This section explains the programming, the methodology that enable to tackle complex data sets, and scale analytics tasks across high volume of variables. Upon the completion of this section student should be able to use efficient inception method to prepare large scale analytics.
Design a sas script that can process an excel list.
Non technical staff should be able to use the excel file to select variables they need for the analysis.
The list will enable to select numerical and categorical variables separately.
The SAS script will produce descriptive statistics for each type of variables
Macro inception type
We will explain how single macro, macro vector and macro array will help for initialisation stage
Tables can be used for inception, this video will explain main methods for tables inception.
Module 3: Data preparation algorithms
This module will use inception methods used earlier along with loops to apply methodology for data preparation purpose. Upon completing this part the student should be able automate data preparations steps for statistical modeling with massive data sets. This lecture will demonstrate data driven programming to tailor an outliers removal algorithm.
The excel file varlabels.xlsx contain variables labels.
Process this file to automate the allocation of labels for each variable.
Outliers for Left Skewed Variables
The script studied in section 3 related to outlier removal for left skewed variables.
Use market dataset instead of airline dataset
Adapt the algorithm to deal with right skewed variables as well
Binning enable to transform a numerical into categorical variables and is often required to run learning algorithms. the following video shows an algorithms that does that sequentially for any volume of variables. This is one of the most difficult part, you may skip this video for the end.
Distinct and Missing values
Variables with too many level or missing values will cause stability issues. A simple approach is used here to tackle these issues
Categorical predictor with a balanced distribution will lead to more stable statistical models. The lecture explain approach taken to detect automatically these distributions.
Module 4: Dimension reduction
Bivariate Dimension Reduction
Sometimes redundant information is caused by similar variables. This module will use data driven method to enable dimension reduction techniques with massive datasets. The following lecture explains algorithms used to detect bivariate relationship.
Multivariate Dimension Reduction
Multivariate relationship detected method is explained and simple script shows how to use proc Varclus. .
Module 5: Regression adjustment algorithms
Exeptional Data Points
Vast amount of variable means adapting the data modeling process can be time consuming. Examples shown will enable student to adapt, tailor regression algorithms to enhance modeling performance, and adapt modeling policies. This lecture will explain how to remove exceptional data during the regression process
Clustering for Regression
The purpose of this exercise is to select a set of variable and clusters them. The best variable within each clusters will selected using sequences of logistic regression for each cluster.
Ods Output as Inception
This lecture shows how we can use 'ods output' and combine it with data driven programming to remove automatically variable contributing to multi collinearity. The purpose is to enable data scientists to use these programming concept to develop and tailor easily it's own modeling algorithms.