{"id":1553,"date":"2019-03-05T02:06:00","date_gmt":"2019-03-05T02:06:00","guid":{"rendered":"http:\/\/kusuaks7\/?p=1158"},"modified":"2023-07-28T04:42:41","modified_gmt":"2023-07-28T04:42:41","slug":"machine-learning-classifier-basics-and-evaluation","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/machine-learning-classifier-basics-and-evaluation\/","title":{"rendered":"Machine Learning Classifier: Basics and Evaluation"},"content":{"rendered":"<h4 id=\"fe28\">This post is going to cover some very basic concepts in machine learning, from linear algebra to evaluation metrics. It serves as a nice guide to newbies looking to enter the field.<\/h4>\n<h3 id=\"354b\"><strong>Linera Algebra<\/strong><\/h3>\n<p id=\"0776\">Linear algebra is the core pillar in the field of machine learning. Why? Machine learning algorithms are described in books, papers and on website using vector and matrix notation. Linear algebra is the math of data and its notation allows you to describe operations on data precisely with specific operators. The 2 most important concepts in linear algebra you should be familiar with are vectors and matrices.<\/p>\n<h4 id=\"2fa9\"><strong>1 \u2014 Vectors<\/strong><\/h4>\n<p id=\"ad90\">A vector is a tuple of one or more values known as scalars. It is common to introduce vectors using geometric definition, in which a vector represents a point or coordinate in a high-dimensional space. The main vector arithmetic operations are addition, subtraction, multiplication, division, dot product, and scalar multiplication.<\/p>\n<ul>\n<li id=\"1816\">Two vectors of equal length can be added together to create a new third vector.<\/li>\n<li id=\"b174\">One vector can be subtracted from another vector of equal length to create a new third vector.<\/li>\n<li id=\"2248\">Two vectors of equal length can be multiplied together.<\/li>\n<li id=\"938a\">Two vectors of equal length can be divided.<\/li>\n<li id=\"00cb\">The dot product is the sum of the multiplied elements of two vectors of the same length to give a scalar.<\/li>\n<li id=\"4511\">A vector can be multiplied by a scalar to scale the magnitude of the vector.<\/li>\n<\/ul>\n<p id=\"6ab8\">There are many uses of vectors. For example, vectors can represent an offset in 2D or 3D space. Points are just vectors from the origin. Data (pixels, gradients at an image key point, etc) can also be treated as a vector.<\/p>\n<h4 id=\"afdc\"><strong>2 \u2014 Matrices<\/strong><\/h4>\n<p id=\"a081\">A matrix is a 2-dimensional array of scalars with one or more columns and one or more rows. If the matrix has the same number of rows and columns, then it is a square matrix. Identity matrix is a square matrix with 1\u2019s on the diagonal and 0\u2019s everywhere else. Diagonal matrix is a square matrix with numbers on the diagonal and 0s elsewhere. A column vector is just a matrix in which there is one row, while a row vector is just a matrix in which there is one column.<\/p>\n<p id=\"337b\">The basic matrix operations are addition, scaling, dot product, vector multiplication, scalar multiplication, transpose, inverse, and determinant \/ trace.<\/p>\n<ul>\n<li id=\"7468\">Two matrices with the same dimensions can be added together to create a new third matrix.<\/li>\n<li id=\"d9f9\">Similarly, one matrix can be subtracted from another matrix with the same dimensions.<\/li>\n<li id=\"6dc7\">Two matrices with the same size can be multiplied together, and this is often called element-wise matrix multiplication.<\/li>\n<li id=\"8c7e\">One matrix can be divided by another matrix with the same dimensions.<\/li>\n<li id=\"f2ac\">Matrix dot product has the entries of 2 vectors multiplied times each other and adding up the result.<\/li>\n<li id=\"e836\">A matrix and a vector can be multiplied together as long as the rule of matrix multiplication is observed.<\/li>\n<li id=\"e8e0\">A matrix can be multiplied by a scalar. The result is a matrix with the same size as the parent matrix where each element of the matrix is multiplied by the scalar value.<\/li>\n<li id=\"ad41\">Matrix transpose is when we flip a matrix\u2019s columns and rows, so row 1 is now column 1, and so on.<\/li>\n<li id=\"884b\">Given a matrix A, its inverse A#k8SjZc9Dxk(-1) is a matrix such that A x A#k8SjZc9Dxk(-1) = I. If A#k8SjZc9Dxk(-1) exists, then A is invertible or non-singular. Otherwise, it is singular.<\/li>\n<\/ul>\n<h3 id=\"215d\"><strong>Machine Learning<\/strong><\/h3>\n<h4 id=\"5aad\"><strong>1 \u2014 Main Approaches<\/strong><\/h4>\n<p id=\"27d0\">The 3 major approaches to machine learning are:<\/p>\n<ul>\n<li id=\"7045\">Unsupervised Learning, which is used a lot in computer vision. Examples are k-means, ICA, PCA, Gaussian Mixture Models, and deep auto-encoders.<\/li>\n<li id=\"f169\">Supervised Learning, which is also used a lot in computer vision. Examples are deep supervised neural networks.<\/li>\n<li id=\"769d\">Reinforcement Learning, which is mostly used for robotics and control problems. Examples are deep Q-learning and policy gradient methods.<\/li>\n<\/ul>\n<figure id=\"9bdb\"><canvas width=\"75\" height=\"42\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 417px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*oyYedmmlGqAOTWBGabFFVQ.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*oyYedmmlGqAOTWBGabFFVQ.png\" \/><\/figure>\n<p id=\"4e52\">In\u00a0<strong>Reinforcement Learning<\/strong>, the rules of the game are unknown. There is no supervisor, only a reward signal with delayed feedback. The agent\u2019s actions affect the subsequent data it receives.<\/p>\n<p id=\"c189\">In\u00a0<strong>Unsupervised Learning<\/strong>, we want to find structure and meaning in data without any labels. As mentioned above, example algorithms are Principal Component Analysis, k-means, Non-Negative Matrix Factorization, Independent Component Analysis, Gaussian Mixture Models, Latent Dirichlet Allocation, and Auto-Encoders.<\/p>\n<p id=\"057e\">In\u00a0<strong>Supervised Learning<\/strong>, the basic terminology is such that we have an initial input space x and an initial output space y. After seeing a bunch of examples (x, y), we want to pick a mapping F: x -&gt; y that accurately replicates the input-output pattern of the examples. There are lots of algorithms for finding different models of F from labeled observations (ground truth). Often learning is categorized according to their output space, which can be either discrete or continuous.<\/p>\n<ul>\n<li id=\"f7a5\"><strong>Continuous<\/strong>\u00a0output space is used in regression problem. For example, given the input as an image of a plate of food, we want the output as the number of total calories. Or given the input as a person\u2019s face, we want the output as that person\u2019s age.<\/li>\n<li id=\"6953\">With\u00a0<strong>discrete<\/strong>\u00a0output space, we are typically building classifiers. For example, we want the output to be class 1, class 2, class 3\u2026, where the numbers represent some sort of meaningful category. We can also call these labels or categories.<\/li>\n<\/ul>\n<h4 id=\"c95a\"><strong>2 \u2014 Training Data and Test Data<\/strong><\/h4>\n<p id=\"04b4\"><strong>Training data<\/strong>\u00a0is what we give to the classifier during the training process. This is the paired data for learning f(x) = y, where x is the input (a vector of features in R#k8SjZc9Dxkd) and y is the output (a label with classification). During training, we try to find a good model of f(x) that replicates the input-output relationship of the data.<\/p>\n<p id=\"e635\"><strong>Test data<\/strong>\u00a0is used with a trained classifier to assess it. We have x, but we don\u2019t know y. Usually, we only pretend to not have y in research, but in the real world, you truly won\u2019t have it. The trained system predicts y for us, and we assess how well the system works.<\/p>\n<h4 id=\"d158\"><strong>3 \u2014 Probabilities<\/strong><\/h4>\n<p id=\"41b1\">Instead of only categorizing an input into one of K different classes, we may want to know the probability of each of those classes. For each possible category, P(C = k | x) returns a value between 0 and 1, with values closer to 1 meaning the classifier \u201cthinks\u201d that class is likely the category of the input. To classify, we want to pick the biggest P(C = k | x).<\/p>\n<h4 id=\"0ede\"><strong>4 \u2014 Parametric and Non-Parametric<\/strong><\/h4>\n<p id=\"abf6\"><strong>Parametric<\/strong>\u00a0models involve fitting parameters. For example, we want to find good values for the w vectors for our linear classifier. Parametric models have a finite number of parameters, in which the number of parameters is independent of the training set.<\/p>\n<p id=\"0a63\"><strong>Non-parametric<\/strong>\u00a0models involve storing the data and using them somehow. Here, the number of parameters depends on the training set (how much data you store). More specifically, the number of parameters grows with the size of the training set.<\/p>\n<h3 id=\"f0ea\"><strong>Evaluating Machine Learning Classifiers<\/strong><\/h3>\n<p id=\"c9f5\">After seeing a bunch of examples (x, y), the model picks a mapping F: x -&gt; y that replicates the input-output pattern of the examples. The question is how well does the model F work after training? We can evaluate F on the training data, but performance on the training data does not really tell us how well the model generalizes to other data. In other words, F could just memorize the training data. Thus, we must evaluate on the test data to get a better idea of how well the model works.<\/p>\n<p id=\"de47\">The simplest evaluation measure for classification is\u00a0<strong>accuracy<\/strong>, which is the fraction of points correctly classified. The accuracy can be calculated to be the sum of true positives and true negatives, divided by all data points. The\u00a0<strong>error rate<\/strong>\u00a0is simply 1 \u2014 Accuracy, or the sum of false positives and false negatives, divided by all data points.<\/p>\n<p id=\"fe67\">In order to implement accuracy, the best way is to store your predictions for each test vector x_t. So, if you have a total of T vectors to test, you end up with a T-dimensional vector with all of your predictions and then just compare it to ground truth. You could also save the probabilities to compute other metrics, but this may require a lot of memory if you have many classes.<\/p>\n<figure id=\"163c\"><canvas width=\"75\" height=\"42\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 411px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*tSkwwxqTO1X_2ODQSx61Ow.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*tSkwwxqTO1X_2ODQSx61Ow.png\" \/><\/figure>\n<h4 id=\"a53f\"><strong>1 \u2014 Confusion Matrix<\/strong><\/h4>\n<p id=\"d764\">A\u00a0<strong>confusion matrix<\/strong>\u00a0is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. It lets you see what classes are most confused. A \u201cstrong\u201d diagonal indicates good performance. In a normalized confusion matrix, the rows are normalized to sum to 1.<\/p>\n<h4 id=\"3d38\"><strong>2 \u2014 Precision and Recall<\/strong><\/h4>\n<p id=\"668f\"><strong>Precision<\/strong>\u00a0is the number of correctly classified positive examples divided by the total number of examples that are classified as positive.\u00a0<strong>Recal<\/strong>l is the number of correctly classified positive examples divided by the total number of actual positive examples in the test set. Precision-Recall is a dominant metric in object detection and is especially useful when categories are unbalanced. High precision means that an algorithm returned substantially more relevant results than irrelevant, while high recall means that an algorithm returned most of the relevant results.<\/p>\n<h4 id=\"91b0\"><strong>3 \u2014 Classifier Thresholds<\/strong><\/h4>\n<p id=\"a99a\">Often a classifier will have some confidence value in each category. These are most often generated by probabilistic classifiers. Sometimes, we threshold the probability values. In computer vision, this happens a lot in detection. The optimal threshold varies depends on the tasks. Some performance metrics are sensitive to the threshold. By changing the value of the threshold from the min to the max (from 0 to 1), we can draw a curve (ROC and Precision-Recall Curves).<\/p>\n<figure id=\"9a7c\"><canvas width=\"75\" height=\"25\"><\/canvas><img decoding=\"async\" style=\"width: 700px; height: 243px;\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*_7OPgojau8hkiPUiHoGK_w.png\" data-src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*_7OPgojau8hkiPUiHoGK_w.png\" \/><\/figure>\n<h3 id=\"dc74\"><strong>Model Power<\/strong><\/h3>\n<h4 id=\"59f4\"><strong>1 \u2014 Overfitting &amp; Underfitting<\/strong><\/h4>\n<p id=\"50d7\">Your classifier F is said to\u00a0<strong>overfit<\/strong>\u00a0the data if the training performance is much greater than the test performance. On the other hand, F\u00a0<strong>underfits<\/strong>\u00a0the data when it is not powerful enough to model the input-output relationship.<\/p>\n<h4 id=\"fa6d\"><strong>2 \u2014 Model Capacity<\/strong><\/h4>\n<p id=\"2a6a\"><strong>Model capacity<\/strong>\u00a0refers to the number of parameters. Bigger capacity generally means you need more training data to train the model, otherwise, it has a high risk of overfitting. On the other hand, low capacity probably means that you underfit the training data significantly.<\/p>\n<p id=\"f923\">Specifically, with neural networks, more neurons mean more capacity. A good analogy is that linear classifier has very low capacity and deep neural network has a lot of capacity. With more training data, the more complex model wins. Classifiers with a greater model can create complex decision boundaries, but too complex may be a poor model for the test data, especially if data is limited.<\/p>\n<h4 id=\"0bc4\"><strong>3 \u2014 Sample Complexity<\/strong><\/h4>\n<p id=\"5926\"><strong>Sample complexity<\/strong>\u00a0is the number of training samples that we need to supply to the learning algorithm so that the function returned by the algorithm is within an arbitrarily small error of the best possible function (train and test). Algorithms with larger model capacity tend to have worse sample complexity because they have more parameters to fit and are likely to overfit with insufficient training data. In those cases, regularization techniques can help.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Often a classifier will have some confidence value in each category. These are most often generated by probabilistic classifiers. Sometimes, we threshold the probability values. In computer vision, this happens a lot in detection. The optimal threshold varies depends on the tasks. Some performance metrics are sensitive to the threshold.&nbsp; This post is going to cover some very basic concepts in machine learning, from linear algebra to evaluation metrics. It serves as a nice guide to newbies looking to enter the field.<\/p>\n","protected":false},"author":86,"featured_media":4039,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[1842],"class_list":["post-1553","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":1842,"user_id":86,"is_guest":0,"slug":"james-le","display_name":"James Le","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Le","first_name":"James","job_title":"","description":"James Le is a Software Developer with experiences in Product Management and Data Analytics. He played a pivotal role in the operation of a start-up organization at Denison University."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1553","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/86"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1553"}],"version-history":[{"count":3,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1553\/revisions"}],"predecessor-version":[{"id":29668,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1553\/revisions\/29668"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/4039"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1553"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1553"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1553"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1553"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}