A Big Data revolution is currently under way in Health Care and Clinical Medicine. The availability of rich biomedical data, including Electronic Health Records, -Omics Data, and Medical Imaging Data, is driving Precision Medicine, which is a recent approach for disease treatment and prevention tailored to the specifics of the individual patient. Precision Medicine is expected to result in significant benefits for both patients and practitioners. Therefore, new approaches that can make sense of complex Biomedical Data and provide data-driven insights are in high demand.
This course provides an introduction about current data-related problems in the field of Clinical Medicine and overviews how Machine Learning and Data Mining can leverage complex biomedical data.
The course includes a series of lectures covering many basic concepts, as well as a panel of hands-on demonstration that will help you getting started with R and some of the commonly used R libraries to prepare and analyze the data.
You should enroll if you are a Computer Scientist, a Data Scientist, or a professional in the Biomedical field with some hands-on experience in Computer Programming (a basic knowledge of R is preferred, even if not required) and you want to learn how to tackle Data Mining problems in the field of Clinical Medicine using R and Machine Learning.
What am I going to get from this course?
Learn to use R to leverage complex biomedical data by Machine Learning approaches. Students will learn the basic concepts of Data Mining, Machine Learning and the specifics of the different data types generated in the context of modern Clinical Medicine. Also, a series of hands-on demonstration will help getting started with Machine Learning-assisted Clinical Medicine using widely used R libraries.
Topics covered in the course include:
- Questions and data types in the medical field, including Electronic Health Records, Medical Imaging Data, and -Omics Data
- CRISP-DM method applied to Clinical Medicine problems
- Data Preparation using R
- Non-negative Matrix Factorization (NMF) as effective tool for feature extraction
- Feature selection, including regularization methods (LASSO, Elastic Net)
- Building classification and numeric prediction models using R (regression, trees, ensemble methods, neural networks, and others)
- An example of deep learning using R and H2O
- Estimating model performance and accounting for error
- Iterative optimization to find a local error minimum: an example of gradient descent
- How to Deploy Predictive Models in the Medical Field
Prerequisites and Target Audience
What will students need to know or do before starting this course?
The course relies on the use of R,
a free software environment for statistical computing, data analysis, and visualization (https://www.r-project.org/
). Therefore, knowledge of R is preferred, even if not required. View Experfy's Intro to R programming course
. Nevertheless, this course assumes a basic understanding of Computer Programming
and it is recommended to have some hands-on experience with R or some other major programming language before starting (for example, Python, Java, C+, C#, PHP, and so on). To fully understand lectures and demos, some background in mathematics, statistics, and algebra
may be required. Also, it is assumed that you have access to a computer with R and Rstudio installed (Unix, Linux or Windows operating systems are all suitable; Linux and Unix systems are preferred).
Who should take this course? Who should not?
This course is best suited for professionals with some basics in computer programming, mathematics and/or algebra who want to learn more about Data Mining and Machine Learning using R in the context of Clinical Medicine.
You should take this course if:
- You are a Computer Scientist or a Data Scientist and you want to learn how to tackle Data Mining problems in the field of Clinical Medicine using R
- You are a professional in the biomedical field (biotechnologist, bio-informatician, clinical data analyst) and you want to bring your data analytic pipelines to the next-level by implementing Machine Learning and Data Mining using R
- You are a physician or clinician with some basis in computer programming and you want to better understand the Big Data Revolution and learn how Machine Learning can support your Research and/or Precision Medicine
You should not take this course if:
- You are seeking to only learn about EHR mining: while this course covers the use of regular expression to match and retrieve substrings of interest from medical narratives, advanced text mining and NLP are not thoroughly discussed in this course.
- you want to specifically learn next-generation sequencing data alignment and upstream processing: while an overview of -Omics data analysis is discussed, there are other courses and resources addressing standard genomic data analysis; this course is indeed focused on Machine Learning and Data Mining
- You are not interested in using R for your Data Mining projects, since you prefer to use other programming languages. This course only covers R and R libraries
- You are mainly interested in Wearable Technology and Time Series in the field of Clinical Medicine: while these are growing areas of interest in the context of Healthcare analytics, these topics were not throughly discussed in this course
- You are looking for a course only covering the theory of Machine Learning Assisted Clinical Medicine. Many demos are included in the course and the instructor believes that reproducing these demos is a critical activity required to reach the course's goals
Module 1: Clinical Medicine, Big Data and Machine Learning
Health IT and Big Data in Clinical Medicine
Big Data are progressively becoming available in the biomedical field and can support Precision Medicine. Precision Medicine is the initiative off tailoring medical decisions on the specifics of each patient n a customized fashion. Health IT and clinical Big Data refer to a series of data-related tools and technologies for collecting, storing, accessing, analyzing and making use of clinical and patient-related data. Two key examples of Health IT are Electronic Health Records (EHR) and Clinical Decision Suppport Systems. Overall, these systems can help clinicians making better medical decisions. However, these systems are not meant to replace clinicians!
1) EHR store life-long data about patients and their medical history, including diagnoses, medications, procedures, and so on. Mining EHR is one of the hottest challenges in Health IT nowadays.
2) Clinical Decision Support system are meant to help clinicians in making decisions and can rely on a set of pre-defined rules (knowledge-based), or on Machine learning (not based on a knowledge base). In this course, we will cover the use of machine learning to address clinical and biomedical problems, with the goal of improving patient outcomes.
Machine Learning and Precision Medicine
Data by themselves are useless. Machine Learning is becoming more and more popular because it allows to analyze data and extract insights from complex and rich datasets, such as medical datasets. Machine learning is aimed at discovering previously unknown patterns in the data without relying on a pre-defined set of laws or rules. Relationships are derived from the data.
Machine Learning can allow to:
1) Segment patients (how many types of patient do I have in my dataset?)
2) Classify patients (who is more likely to respond?)
3) Forecasting (how will the disease affecting this patient evolve in the future?)
4) Pattern discovery (reveal conditions occurring together)
Machine Learning comes with limitations as well.
1) Overfitting. ML tends to overfit, id est memorize the data rather than ‘learning’ their relationships
2) Data-avid: we need large datasets to extract insights
3) Quality of the input data is crucial. Reliable data are required to obtain reliable results
Making use of Big Data and using Machine Learning will allow the transition to a patient-centered type of Healthcare (Precision Medicine) and hopefully result in improved clinical outcomes
The R Project For Statistical Computing
R and Rstudio will be used throughout this course. R is a free software for statistics and data analysis that can be installed and executed on Unix, Linux and Windows. R comes with many different extensions for genomics and bioinformatics analysis, data visualization, and of course machine learning. Rstudio is the IDE of choice for R.
Clinical Medicine, Big Data and Machine Learning
Quiz highlighting takeaways and activities from Module #1
Module 2: CRISP-DM Model applied to the Medical Domain
Data Mining (DM) in Clinical Medicine
Data Mining is aimed at knowledge discovery via the application of machine learning and statistical methods. Data Mining means placing a framework over a data-related problem. Such framework can be applied to wide series of data-related problems coming from different fields or Industries, and machine learning is usually a component of Data Mining.
There are several general models about how to apply Data Mining to data-related problems. We will discuss about one of the most popular models for data mining that is called CRISP-DM: cross Industry Standard Process for Data Mining. It is standard, effective flexible and cyclical, and is based on Business understanding, data understanding, data preparation, modeling, evaluation and finally deployment.
Biomedical data come with a series of barriers to the application of CRISP-DM. Some of these are technology-related, others are ethical and linked to patient communication and the use of patient-specific data.
Ethical Considerations in Machine Learning Assisted Clinical Medicine
Analyzing patient-related data comes with a panel of ethical aspects and problems. Research that uses identifiable information about living human subjects (aka human subjects research) usually requires IRB approval. IRB monitors biomedical research involving human, and ensures the welfare of humans participation into research protocols.
Ethical questions or problems to consider in the context of machine learning—assisted clinical medicine include:
1) patients have the right to be aware of how data will be collected and used, and have to provide consent
2) managing data ownership, data sharing and ownership
3) ethical usage of previously-collected data available on data banks
4) how to deploy a model: constructive interaction between machine learning model and the clinician making decisions
CRISP-DM Model for Clinical Medicine (I)
This lecture covers the theoretical background to get started with data mining according to the CRISP-DM model in the clinical domain (part I).
CRISP-DM starts with Business Understanding. In tis phase, the investigator defines the questions to address. This should be a broad and open-ended question, defining the context and the goal of the project, such as: “which drug among drug A and drug B is most effective to prevent recurrence in patients aged 30-50 affected by a given clinical condition with overlapping diseases of any kind?”. As you can see, defining the question means defining the target attribute/variable we want to study, model and predict (in this case, recurrence).
Data Understanding. After the question has been cast, data have to be acquired and -before proceeding- they have to be understood. Data are the raw material we need to process to address the question. Understand what type of information we can obtain from the available data requires domain knowledge and is critical for the success of a data mining project. To understand the data, we may need to have a good idea about the technology used to generate the data, the info that are contained, and we should be aware of strengths and limitations of the data. Find a common ground to exchange ideas between clinical experts and data experts is crucial, and may require combining different expertises. Selecting the right panel of consultants to understand data and bring the domain knowledge that is needed is very important.
Data preparation. This phase is very time-consuming. Data have to be transformed and prepared before performing machine learning. The most common operations include querying databases, scaling data, data mapping, dealing with sparse data, feature extraction, feature engineering, remove missing values, handle outliers, and feature selection.
CRISP-DM Model for Clinical Medicine (II)
The modeling phase is the core phase of Data mining and is the one where Machine Learning comes into play. To understand modeling, we first need to understand what a model is. A model is a set of rules or relationships among data, and is used for the specific purpose of the data mining project, for example for classification or numeric prediction. In other words, a predictive model is a formula for estimating the value or the class of the target variable.
Supervised modeling is different compared to unsupervised approaches (such as clustering), that are aimed at exploring the data (how many data clusters do I find in my dataset?). Data used for training a model have to be formatted as a flat data table (usually, a data.frame) including measurements for each case (patient) and each attribute (features). Also, models are built on a subset of the dataset, namely the training subset. On the contrary, the test set is used for model validation.
CRISP-DM Model for Clinical Medicine (III)
The evaluation phase assesses the results of the data mining project and more specifically the quality of the Model. The model evaluation includes different aspects. 1) Make sure that the Model holds on new previously unused data 2) Is the model deployable? 3) Does the Model align to the business goals and answers the initial question?
To estimate if a Model is a good Model, we need to define a reasonable metrics and assess and measure the errors it produces. Evaluating the model, and eventually comparing models performance, leads to the identification of the best model to use. Usually, this means finding the optimal compromise that minimizes both model complexity and amount of error produced on a hold-out (test) subset. Indeed, the model has to perform well on unknown data in order to be generalizable. When splitting data in three independent subsets is not an option, cross validation can be used as an approach to build and then compare different models only using the training set.
Deployment. The predictive model has to be put into use, aka deployed. How to deploy a model depends on many params, in particular the specific problem and of course the final model. After deployment, the CRISP-DM is not over, but can start over to improve the model itself or fix some issues that emerged during the CRISP-DM phases.
CRISP-DM Model Applied To The Medical Domain
Quiz highlighting takeaways and activities from Module #2
Module 3: Narratives in the Medical Domain: Mining Electronic Medical Records
Mining Electronic Medical Records
Electronic Health Records are longitudinal medical records about patients and store information about diagnoses, medications, procedures, lab results, medical reports, admission and discharge summaries, and so on. Data are usually available as unstructured (free text) or semi-structured (key-value pairs) data. Data from EHR can be used for many different purposes, for example to follow disease along time. EHR contain critical information allowing patient phenotyping. A phenotype is a set of traits or conditions and can either be expressed by an individual or not. Often phenotypes are linked to a specific disease. Phenotypes result by the interplay between genes and environment. DNA sequences of all genes are almost identical, however some non-pathologic differences (or variants) exist. Such non-pathologic variants found in normal population are called polymorphisms. Some polymorphisms can expose to certain diseases, yet they do note dictate the diseases. Once again, phenotype, environment, and genetics are tightly linked. EHR are the data source of choice for assigning phenotypes. Phenotyping can be important for multiple reasons: often, phenotypes are used as target variables; alternatively, phenotypes can guide the definition of the population of interest for the current study. Often, we may have to limit the study to those patients that are affected by a given condition of interest.
Billing codes facilitate phenotyping, since they are alpha-numeric codes used to uniquely identify specific diseases or procedures. Billing codes include ICD-9 and ICD-10 codes that are used for diagnoses (they define diseases), and CPT codes (services and procedure to patients).
Regex for Mining Electronic Medical Records
Regular Expressions are implemented in most programming languages and are a very powerful tool in text mining. R default’s syntax is the POSIX syntax. Patterns are used to match substrings in a text and perform cont, extract, or replace functions. The lecture is a demo lecture using R.
A vignette is provided to follow along.
Extract Billing Codes using Regex and R
Exploring medical records using R and regular expressions: extract billing code information. The lecture is a demo lecture using R. A vignette (continued from lecture 10) is provided to follow along.
Extract Lab Test Results from txt files using Regex and R
Exploring medical records using R and regular expressions: extract lab test results. The lecture is a demo lecture using R. A vignette (continued from lecture 10 and 11) is provided to follow along.
NLP and Advanced EMR Analysis
A wide set of info are stored in EMR and billing codes alone may not be sufficient for accurate patient classification or phenotyping. Natural language processing and text mining (that go beyond the scope of this course, though) are powerful approaches to extract more complete information from medical records, enabling way more precise phenotyping. Interestingly, applying text mining or NLP for pehotyping purposes can be carried out as a nested Data mining project (a well-focused project within a broader project). This lecture also ends the current module and summarizes the takeaways we learned in the lectures about EHR so far.
Narrative in the Medical Domain: Mining Electronic Medical Records
Quiz highlighting takeaways and activities from Module #3
Module 4: Genomic and Imaging Data in Clinical Medicine
Genomic Data in Clinical Medicine
DNA and RNA sequencing techniques can provide very rich information about a biological sample. To understand such data and the expected information we can get from it, we need to revise what DNA is and what genes are. Briefly, genes are sequences of DNA that carry hereditary information. 1) In a given organism, almost all cells share the same DNA sequences. 2) Human gene sequences come in a panel of closely-related variants in the normal population (polymorphisms). Polymorphisms are not good or bad per se, but may epxose an individual to develop certain conditions. As seen before, conditions (especially complex diseases) arise because of the interplay between environment and genotype of the patient. It is expected that by knowing the sequence of DNA of an individual (genotyping), we will be able to predict patient responses (drug metabolism, disease risk, …).
A special type of DNA variants are somatic mutations. These are genetic variants that arise in the adult organism and are linked to cancer. Knowing which genes are mutated in a biological sample can provide insights into the recommended therapy that may be effective, and so on.
Other than sequencing DNA, it is possible to sequence RNA. RNA sequencing provides information about genes that are active or inactive at a given time, and hence can provide insights into what is not going wrong (and hopefully, how to treat it).
A common feature of all sequencing-based approaches is that these methods generate a huge number of features, requiring effective data dimensionality approaches.
Working with VCF Files
DNA variant calls are usually stored in files called VCF files. In this demo, we will import and explore a VCF file using R. VCF files include tab-delimited data and have the following features:
- CHROM, stands for the chromosome name, for example Chromosome 11 or chromosome X;
- POS, that means the starting position in nucleotides on the selected chromosome;
- ID, identifier for the variant if any;
- REF, that reports the expected base according to the reference assembly;
- ALT that reports the alternative variant nucleotide found in the biological sample.
Variants of interest, for example those located at specific genes like Sfi1, can be retrieved based on the genomic coordinates or gene name (assuming variants were annotated with SnpEff or similar software). Also, variants can have different impact on cell functionality depending on the gene that is hit, the specific type of mutation (out-of-CDS, missense, nonsense) and gene position (driver vs passenger mutations). We will import a VCF file into the R environment, we will explore it and use boolean tests and regular expression to select and report specific variants of interest.
Working with RNAseq and WES Data
RNA-seq data include information about gene expression. WES data include information about gene mutations. Both these data types may be very important to characterize patient outcomes. In this demo, we will analyzed RNA-seq and WES data retrieved from a cancer genomic repository, the Cancer Genome Atlas (TCGA). The demo is focused on the following points:
1) Download data using the R library TCGAretriever
2) Map data to patients: we only want to include patients who come with both WES and RNA-seq data
3) Prepare data. In this demo we will see how to prepare RNAseq data (log-transform; Z-scores) before analysis. Also, we will use domain knowledge to replace missing values in the WES dataset. These missing values mean that no mutations were detected and hence the corresponding gene in wild type (WT, which means ‘not mutated’) .
4) We will study the correlation of MDM2 and TP53 expression levels in cancer samples having WT TP53 and mutated TP53.
5) Finally, we will interpret our results. Briefly, a strong correlation is detected in WT TP53 samples. However, mutated TP53 abolishes such correlation.
6) This will show us an example of inter-dependency between expression levels and mutations status
Imaging Data in Clinical Medicine
Imaging data play a crucial role in medical diagnosis. Medical Images are produced as result of non-invasive approaches, highlight internal structures of the human body and help localizing lesions and guiding surgical intervention. Medical images can be produced as result of different procedures, such as: medical ultrasound, radiography and computer tomography (CT, used for visualizing almost any structure with the exception of brain), Magnetic Resonance Imaging (MRI, technique of choice for neurological diseases), Positron Emission Tomography (PET, highlights metabolic active areas in the body, often coupled with TC).
Briefly, medical images are numeric images, which means numeric representation of real-world objects, obtained by applying two operations: discretization and quantization. One of the basic and most important operations conducted to analyze numeric images in convolution. Convolution relies on scanning the whole image (numeric) matrix with a small window and perform a simple mathematical operation: a weighted sum, where the weights are provided by the kernel, which is a numeric matrix having the same dimensions of the scanning window. Many other operations conducted on images are somewhat related to convolution.
Often, image analysis includes the following operations: image correction, segmentation (discriminate between object and background), and then measurement and feature extraction (extract info from the image, for example texture analysis).
Imaging data analysis can be conveniently performed as a nested data mining processing
Imaging Data Processing with R
In the last demo of module 4, we will see hot to import and analyze imaging data using R. We will import a JPEG file using the JPEG R library, next we’ll analyze the data, which will be stored in a numeric matrix, and perform several operations, including median filtering, convolution, edge detection, segmentation, and Texture analysis.
Median filtering is a noise removal technique, and is performed by moving a sliding window across the image. Specifically, such window will be centered at every single pixel in the image, and will select a small number of pixels around that point. The median of these pixels will be computed and returned as the new pixel value. Such operation is executed for each pixel in the input image, that means for each element of the matrix, and will return a new matrix having the same dimensions as the original matrix
Convolution is performed in a similar fashion, but a weighted sum computed instead of the median. The weights for this operation are provided by the Kernel. Convolution is at the basis of many different image processing techniques, such as Gaussian filtering and Sobel operator.
In the demo we will also cover an example of edge detection and edge-based segmentation, followed by analysis of segments by texture analysis.
Genomic and Imaging Data in Clinical Medicine
Quiz highlighting takeaways and activities from Module #4
Module 5: Data Preparation
Usually, data are obtained in formats that are unsuitable for being analyzed and used directly to train models. Data preparation involves performing a series of operations to transform and condition data, and make them ready for the modeling phase of machine learning. Data preparation proceeds through:
1) Data cleansing, or removing bad data and records that are not supposed to be used (patients affected by the wrong disease)
2) Data transformation. Transform data so that they are converted in a more convenient format. For example log-transform or z-score transformation of continuous numerical variables. Also, encoding categorical variables into numeric (or dummy) variables.
3) Imputing missing values (NAs). Often, missing values are due to missing measurements. NAs can be removed (filtering) or imputed (mean substitution, or multiple imputation). Selecting the right approach for handling NAs requires making the right assumptions about the pattern of missing data in the dataset.
4) Handling outliers
5) Data abstraction
6) Data derivation and feature engineering
7) Dimensionality reduction
8) Data weighting and balancing
Imputation using the MICE package
This is a demo dealing with missing data imputation in R using the MICE and VIM packages.
In this demo, a simple dataset including only 4 features is used. All features have some missing values. The demo shows how to use the mice::md.pattern() function to explore the pattern of missingness in the data, as well as the vim::marginplot() function to display and visualize how missing values distribute with respect to other variables.
In the demo we will run an example with mean imputation, as well as multiple imputation. The important takeaway of this demo is that we should be always careful with our results, check them and make sure that imputed data make sense.
Data Reduction includes 3 different processes:
1) reducing the number of values of a variable (data discretization). It means binning values of a variable and/or convert continuous variables into discrete ones.
2) sampling (remove a fraction of the cases)
3) dimensionality reduction (reducing the complexity of a dataset by reducing the total number of features).
Data balancing is linked to data reduction techniques. Often, datasets are unbalanced, which means that the class of interest is a under-represented compared to the other(s). For example, the number of patients who did not respond may be much bigger than the number of patients who responded to the therapy. Working with unbalanced datasets is problematic, since it results in predictive models with sub-optimal performance. Resampling (under-sampling the majority class or oversampling the minority class) can be beneficial. Alternatively, weights can be applied to the different instances (cases) depending to the class they belong to.
Curse of dimensionality refers to the problem of working with datasets including too many features/attributes. When too many attributes are available, data become incredibly sparse, and hence very dissimilar one to the other. In sparse spaces, relationships are very difficult to model, unless the number of cases is huge. Feature extraction is a set of techniques to cope with this type of problems. These include: principal component analysis (PCA), ontology-based dimensionality reduction (standard approach in many NGS-based analyses), and factorization.
Factorization is the decomposition of an object in a product of other objects. The aim of factorization is to reduce something to its building blocks. When data are non-negative matrices, non-negative matrix factorization can be applied (NMF). Here, a non-negative matrix V [n x m] is factorized in W [n x r] and H [r x m] !! Note, there is a typo in the last slide. H = matrix [ r x m]!!! All these matrices are non-negative. NMF is currently employed in imaging data analysis, text mining and genomic data analysis.
Non-negative Matrix Factorization with R
In R, NMF is performed using the NMF package. In this demo, we will go through an example of NMF using jpeg images representing cancer cells of two different types. Three libraries are required for this demo: ‘jpeg’, ‘corpcor’, and ‘NMF’. Files are downloaded, images are opened, and then vectorized and stored in a non-negative matrix. As expected, such matrix includes a lot of zeros, and is indeed a sparse matrix. After removing all rows with a rowsum below a given threshold, NMF is performed using the NMF function. This only takes two arguments: input data and the number of ranks (or profiles) we want to extract: this is an arbitrary number. In this demo, we will set r=3, which means 3 building blocks.
NMF factorizes our matrix into two matrices: W storing the building elements/profiles; H that stores the contribution of each signature/element into each of the objects/vectorized images. The beauty of NMF is that W can bee used to process new data and help clustering/classifying new images/objects. Given a new V and an old W, we can easily compute a new H. NMF is very handy, but can only recognize elements that were previously extracted and are defined in the W matrix. Given a W matrix, we can summarize a whole image with a very limited number of values that are included in the H matrix. Therefore, NMF is great at dimensionality reduction.
Feature Selection Techniques
Feature selection means choosing a subset of the features (the most informative ones) from the dataset and is the step that comes before model training. The way feature selection is performed can really affect the performance of the final predictive model. Entropy-based approaches are probably the simplest among feature selection methods. Entropy is a way of measuring class disorder in a group of observations. Information gain (IG) computes how well or how poorly a target variable is split (delta Entropy) according to the information brought by a given predictor. It is possible to evaluate each feature independently and include attributes providing the largest Igs (forward selection). Alternatively, it is possible to start with all attributes selected and progressively remove the least informative ones (Backward Feature Elimination).
Embedded methods select features while the model is being created. These include regularization methods. Regularization methods introduce additional constraints in the model. They limit the coefficients of a regression model and impose that their sum (of absolutes or squares) is no bigger than a threshold. LASSO, Elastic Net and Ridge regression are common examples of regularization methods.
Feature Selection with caret and glmnet
This demo covers feature selection using two R packages: caret (used for forward feature selection and backward feature elimination) and glmnet (for regularization embedded methods, namely LASSO, Ridge regression and Elastic Net). Feature selection will be executed on a simple dataset including 80 observation and 11 features. This dataset is very similar to the one we used before in the demo about missing value imputation. Before starting, it’s good practice to remove highly correlated features. In regression, correlated predictors are problematic and it’s convenient to remove them. Correlated variables can be detected via multiple approaches, including the use of the findCorrelation() function.
The demo proceeds by illustrating the use of caret’s functionalities for feature selection and showing how good predictions can be obtained only using the best predictors rather than the whole panel of attributes.
Regularization approaches were also illustrated in this demo, always using the same dataset. We saw how LASSO is effective at feature selection (unlike Ridge regression). Also, we explored the link between threshold and absolute size of the betas. The milder is the constraint that is applied, the bigger are the betas (their absolute values), and the more betas get bigger than zero (in absolute value). Also, Elastic Net was performed. Interestingly, Elastic Net allows selection of betas via a procedure that is a compromise between LASSO and Ridge.
Data subsetting means creating partitions of the dataset. Often a dataset is split in 2 or 3 partitions, namely training, validation and test subsets. Training set is used for training the model. Validation is used to compare different models and select the one(s) to further evaluate. The test set is used to make sure that the model holds on new independent data, and hence is generalizable and deployable.
Data partitions are slices or portions of the dataset that include all features as in the input dataset but only a fraction of the cases/observations. Data subsetting is very easy in R. The key point is to make sure that the probability distribution of the target variable of interest in each data partition is equivalent. Caret comes with a function that facilitates this: the createDataPartition() function.
In the Demo part of this lecture, we are using caret to subset a sample data.frame including 11 features and 80 observations.
1) the createDataPartition() function returns a vector of indexes that we can use for subsetting the input dataframe
2) We make sure that the target variable has the same provability distribution in the different data partitions.
3) caret can be used to partition data based on more than one attribute via the maxDissim() function. Briefly, we initialize two or more groups with random data points from the dataset and then we proceed iteratively, one data point per time, by adding a new data points to each group. Specifically, the most dissimilar point to the other points in a given group is added in that group.
Quiz highlighting takeaways and activities from Module #5
Module 6: Predictive modeling
Before delving in the Modeling Phase of CRISP-DM, it is important to consider a very important question. Which modeling algorithm is suitable to address the current data mining problem? The answer to this question mainly depends on two points:
1) the data and the type of target. Target is numeric, 2-class categorical, multi-class?
2) the anticipated deployment. Should the model be human-interpretable or not?
One of the workhorses in Machine Learning is regression. Regression is very easy, but also very handy and useful. Regression is typically used to model numeric variables. However, regression can be adapted to classification problems (logistic regression). Regression models can be summarized by an equation similar to: y = b0 + b1 x1 + b2 X2 … , where y is the target variable, x are the predictors and b are coefficients that inform how to mix in and weight different predictors to estimate the target values. Training a regression model means finding the betas that minimize the total error produced by this equations. In regression, sum of squared residuals is often used to account for error. Consequently, least squares is used to compute the betas. In the last part of this lecture, a simple example of least squares is discussed, to help you understand the math behind the process and the theory of numeric regression.
Regression using R
In this demo, we perform linear numeric regression using R. In the first example we only use one dependent (y, numeric) and one independent (x, numeric) variable. We can use matrix algebra to easily solve the least squares problem and estimate the best betas. If we arbitrarily change the betas, the total error (sum of squared residuals) increases. Therefore, the estimated beta(s) is the one corresponding to a point of error minimum, where the derivative of the error with respect to beta is equal to 0.
In the second example, we included two predictors and an intercept. This improves the predictions (predicted values are much closer to the real values) and this means that the su of error is getting smaller. Training linear models can be done using the lm() function instead of applying matrix algebra. With lm(), it easy to train a linear model. We only need to provide arguments corresponding to a formula (target variable ~ predictor1 + predictor2 + …) and a variable name pointing to the input data. In regression, the magnitude of betas (coefficients) informs about the importance of the predictors (assuming that the predictors were all normalized/scaled to compatible ranges). Once we have a lm-generated model, we can use the predict() function to predict new data (assuming that the new data were processed exactly as the data used for training the model). This returns the estimates of the target variable. Interestingly, training models only using a few good predictors is usually as good as (or even better than) including all predictors to build the model (usually, simple models can be more generalizable).
Logistic Regression is Regression applied to classification problems. Specifically, logistic regression is used to predict categorical (binary) outcomes. For running Logistic Regression, the dataset has to be prepared with all variables encoded as numeric attributes (eventually, use dummy variables). The math behind logistic regression is very similar to that of numeric regression. The target is still estimated based on a formula like: y = b0 + b1x1 + b2x2 … This still outputs continuous numeric values. The ‘logistic’ trick is to transform such values (numeric range: minus infinite – plus infinite) into values in the range 0 to 1. Such transformation requires applying a link function (logistic function). Finally, values >= 0.5 are discretized to class 1, and values < 0.5 to class 0 (for example, you can use the ifelse() function).
Here, sum of squares cannot be applied to assess total error. Error can be computed by looking at the confusion matrix, though. Accuracy is the ratio of correct predictions divided by total number of predictions.
In R, glm() can be used to run logistic regression. glm() takes the following arguments: formula, input dataset, and the type of link function to be used (binomial – logit). Again, to predict new data, the predict() function is used.
Regression-based Machine Learning
In order to use Regression as Modeling Technique, we need to pay attention at the following points:
1) Linear regression assumes that the relationship in the data are linear. If they are not, Transform data so that the relationships are linear or close to linear
2) Remove collinearity among predictors (correlated predictors will be overfitted)
3) Rescale and transform data, for example scaling/standardization (z-scores). Regression works well when input and output variables have normal or normal-like distributions.
4) Remove noise and outliers
5) Use regularization methods whenever possible
6) Gradient descent can be used to estimated betas when you want to add specific constraints or you need to use a custom loss function to account for the total error
A decision tree is a modeling algorithm producing a flowchart-like series of tests that support classification. Each leaf node in the decision tree flowchart identifies a target class. Unlike classification trees, regression trees can be used when the target is numerical. Decision trees are easy to train and use for prediction, and come with important advantages.
1) Trees are easy to understand and interpret. Top nodes highlight the most important attributes
2) Data preparation and feature selection are not really required. Also, trees can be trained even if data include missing values
3) Computationally efficient
4) Decision trees allow to model non-linear effects and support multi-class classification
5) Trees are very prone to overfitting, but tree pruning may help preventing over-fitting
The demo part of the lecture covers how to train decision tree models in R using the rpart and caret libraries. Also, the use of confusion matrix is discussed to evaluate the performance of a classifier.
Boosted Trees and Random Forests
Boosted Trees and Random Forests are predictive models based on decision trees that make use of multiple trees at once. Briefly, many models (weak learners) are trained and then used together. Therefore, these methods are examples of Ensemble learning.
Random Forests can be used for both classification and regression. Random forests are based on two operations: 1) Randomized feature selection and 2) bagging. In random forests, a dataset is used to prepare data slices including: 1) a random subset of the features; 2) a random ssample of observations, with replacement. While predictions by each tree are affected by local noise, the performance of the ensemble model is less affected by noise in the data, as long as the model/data slices are independent.
Gradient boosting, starts with a single decision tree, and then adds other trees that are aimed at predicting the difficult-to-classify cases. Gradient boosting is similar to adaptive Boositng (adaBoost), bu it is cast in the context of a mathematical / statistical framework that tries to minimize the total error of the predictions while new trees are added by applying a gradient descent-like procedure. Trees are added iteratively, without changing any tree previously added, and always aimed at minimizing the prediction error of the model.
In the demo part of the lecture, we will be using the xgboost and randomForest R libraries.
You can use the caret package as a wrapper and use the train() function to train a random forest or xgboost model (method=”rf” or “xgbtree”, respectively), and then predict new data using the predict() function. If you want more control over the model params, you can use the functions from the xgboost library. First, you need to convert the dataset into a numeric matrix and prepare it using the xgb.dmatrix() function. The xgb.train() function is then used. Custom parameters are passed as the params argument. Type ?xgb.train to learn more about this.
Back Box Methods
Black Box methods produce Models that are not human-understandable. Indeed, these models are built on a set of complex rules and complex data transformation operations, that make the model itself almost impossible to read.
Support vector machine is a technique that generates models based on an optimal hyperplane that separates data observations into different classes. A separating hyperplane is selected, that allows the greatest separation among classes. If data are not linearly separable, SVM allows the use of soft margins, and selects the separating hyperplane that minimizes the cost of cross-border cases. Also, data can be transformed via the Kernel trick. This is effective for non-linear relationships in the data by converting data using mathematical transformations that make the data linearly separable. In other words, new features that express mathematical relationships in the data are extracted and used for modeling (using linear, polynomial, sigmoidal kernels).
SVM is widely used in many field and can be used for both classification and numeric prediction. SVM has high accuracy, it is not too prone to over-fitting, but it is important to find the right set of parameters and can be slow to train. In R, the libraries “kernlab” or “e1071” can be used for SVM.
Neural Networks are based on a paradigm that simulates how human brain works, and is based on an interconnected networks of neurons. Neurons collect information from the outside or from other neurons, weight each input differently, integrate the weighted inputs by applying a given activation function, and return a processed information. Neuron outputs are sent to the extern or are passed along to the next neuron in the network. Different activation functions can be implemented. Likewise, different network architectures are available (number of neuron layers in the network and total number of nodes in each layer).
Advantages: neural networks can be adapted to binary or numeric prediction problems, support multi-class classification, and are among the most accurate modeling algorithms (probably because they make very few assumptions about relationships in the data). However, they can be computational-intensive, they may over-fit the data, and of course result in complete black box models. R packages such as neuralnet, nnet, and RSNNS can be used for training neural networks.
Deep learning with R and H2O
Deep learning is a neural network-based modeling approach that relies on networks having complex architectures. Briefly, more than a hidden layer qualifies a networks as Deep Network. Having many hidden layers means that data are transformed and aggregated multiple times, and hence higher-level representation of the data are obtained and used. Deep learning models are usually associated to very high accuracy.
R allows to perform deep learning via different libraries. For example, you can use H2O. H2O is a Java-written framework that can be installed and accessed from R. H2O can be used for training many different types of models, including deep-learning models. In the demo, we will cover an example of deep-learning aimed at classifying tumors as malignant or benign, based on a set of cancer cell size and morphology measurements. The demo will show 1) how to prepare data, subset them (training, validation and test sets), and then convert the data into H2O-compatible format using the as.h2o() function; 2) Test combinations of deep learning hyper-params to train and compare performance of multiple networks, using the h2o.grid() function; 3) access and retrieve models from the H2O framework using unique model identifiers and the h2o.getModel() function; 4) Obtain information about the relative importance of different predictors and predict new data via the h2o.predict() function.
Bayesian Modeling is based on the Bayes theorem. This allows to make efficient use of what is already known about a problem. Specifically, we can update the probability distribution of a given event/Model parameter (for example, mortality rate) by combining observed experimental data with a prior distribution, which is an established, already-known, or expected probability distribution for the same event/parameter. The new probability distribution is also called posterior distribution. Applying Bayes requires the availability of reliable priors, and allows to obtain more inference from small datasets. Also, Bayesian methods result in more conservative predictions (closer to the priors).
An example of how the Bayes theorem is applied to data is provided in the demo, showing the importance of the prior and sample size in the final results. In the demo, we will also discuss about how to analyze patient survival data by plotting survival curves using the functions from the survival library.
Quiz highlighting takeaways and activities from Module #6
Module 7: Evaluation and Deployment
Accounting for Error
Modeling means finding the set of betas that minimize the error produced by the model. In other words, error is the sum of differences between predicted and observed values. In many cases, accuracy and sum of squared residuals are chosen as metrics for accounting for prediction errors (in classification and numeric regression problems, respectively). However, these are not the only ways of measuring error. For classifiers, we may want to include information about the type of errors (false positives vs false negatives). This can be achieved by using metrics such as sensitivity and specificity. In numeric prediction, we can chose between sum of squares or sum of absolute residuals. Also, we can add constraints in the loss function, such as regularization elements, to prevent the sum of absolute betas to grow beyond a given threshold (as we saw when we discussed about LASSO or Ridge Regression). The choice of which errors to favor depends on the specifics of the problem and the business question. When sum of squares is not an option, gradient descent is usually applied.
In the demo part of the lecture, we cover how to perform gradient descent for a very simple regression problem. We will show that if the metric of gradient descent is sum of squares, than the results are identical to the standard least squares approach. However, you can update the loss function as you prefer and run gradient descent again, obtaining a model that aligns better to the business goals.
Gain, Lift, and ROC Charts
Gain and Lift charts are visualizations that help assessing model (classifier) performance, by analyzing increasingly larger portions of the data populations. An example is presented to clarify of how lift/gain charts are generated (and what they mean).
1) Order observation according to the probability of belonging to class 1 (model output); 2) count number of class-1 cases in the top 10% of the ordered table; 3) next, count number of class-1 cases in the top 20% of the ordered table; 4) and so on, till 100% of the data are covered.
Gain charts display the counts, lift charts display fold of enrichment of the model with respect to the perfectly random classifier (based on the proportion of class-0 and class-1 observations in the dataset).
ROC curves are also used to evaluate a classifier performance. They show the trend of true positive rate vs false positive rate. As before, these are computed in increasingly larger slices of the full data population. Area under the curve (AUC) is often used to compare ROC curves and the corresponding models. In the demo part of the lecture, we will use the pROC and ROCR libraries to plot ROC curves.
Evaluation and Deployment
Evaluation. Evaluating a model is very important in CRISP-DM. Evaluation is not limited at checking model performance and plot some ROC curves. Evaluation means addressing the following important question: are the data mining results any good to answer the original business question? Evaluation usually includes three steps: 1) Evaluation of results (does the model meet the business problem? Test the model in a few test real-world data applications; compare data-derived insights against what is already known about the problem: do the results make sense, are they new and useful?) 2) Review of the data mining project (re-evaluate the way the data mining process was conducted; what can be improved or refined?) 3) Determine what to do next (What can be deployed? How?)
Deployment. When the model is ready for deployment, we should put it to work. How to deploy a model depends on the problem, the business goals and the specifics of the modeling results as well. The data mining team is supposed to provide a deployment plan, with a summary of the strategy and instructions for deployment. Deployment does not always conclude the data mining process, since the CRISP-DM process is a cycle. A new DM process can start, to further refine the model, and keep the model up-to-date (monitoring)
Three papers are discuss in this lecture to highlight how machine learning can be successfully applied in the field of clinical medicine.
Predicting diabetes mellitus using SMOTE and ensemble machine learning (Plos One).
Authors investigated the risk of developing diabetes within a 5-year window by analyzing clinical parameters. Authors used a dataset including information of 32,000 patients with no cardiovascular diseases that underwent treadmill stress test. 30 features were included in the dataset, including many medical/clinical params and measurements. Data preparation included data discretization, dummy variable encoding, and feature selection. Authors dealt with a class imbalance problem by applying random undersampling + SMOTE. No training/test splitting was performed. Cross-validation (leave-one-out) was used instead. Authors trained different models and reported accuracies in the range 0.63-0.68, and hence they decided to combine multiple models together to improve performance (ensemble approach). Every model provided a vote, and the class receiving the majority of votes was the one predicted by the ensemble. With this system, authors reached high accuracy (89%), with high sensitivity, suggesting many false positives.
A predictive metabolic signature for the transition from GDM to T2D.
Authors applied a Metabolomics approach together with machine learning to identify biomarkers that could predict development of T2DM in GDM patients. Authors used decision trees, obtaining models that were very easy to read and understand. Authors identified a shortlist of metabolites that if dysregulated during pregnancy, could predict development of T2DM.
A machine learning approach for the detection of novel pathogens from NGS data.
Authors examined DNA sequences from bacterial species and tried to predict if a bacterial species is pathogenic to humans or not. Feature extraction was the most difficult and critical step in this project. Authors managed to extract informative features from raw NGS sequencing data, such as di- and tri-nucleotide frequency, together with codon usage metrics (by in-silico translation). Data were then modeled by Random Forest. Model generalized well in the test dataset. Authors concluded that it is possible to predict pathogenicity just looking at DNA sequencing data of bacterial genomes.
Machine Learning and Data mining are becoming very important in the context of Precision and clinical Medicine. A few recommendations about machine learning-assisted clinical medicine.
1) Avoid communication gaps: make sure that your Data Mining team includes professional figures that can facilitate information sharing among clinicians, biologists, technicians, computer scientists, and project sponsors (for example, PhD in Biomedical Sciences)
2) Make sure you understand the business goals, and the data you are going to work with. Specifically, domain knowledge is essential, but could be difficult to acquire. If you don’t have it, make sure to involve someone who does have it in your team.
3) Mind your goal. Make sure to align to the business objective at all times: this means selecting the most appropriate loss functions (should I weight differently false negatives and false positives?), as well as model algorithms (black-box vs non-black-box models).
4) Keep your model under tight monitoring even after deployment.
Evaluation and Deployment
Quiz highlighting takeaways and activities from Module #7