{"id":2208,"date":"2020-01-22T03:08:13","date_gmt":"2020-01-22T00:08:13","guid":{"rendered":"http:\/\/kusuaks7\/?p=1813"},"modified":"2024-01-25T12:33:13","modified_gmt":"2024-01-25T12:33:13","slug":"the-24-essential-evaluation-metrics-for-binary-classification-explained","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/the-24-essential-evaluation-metrics-for-binary-classification-explained\/","title":{"rendered":"The 24 Essential Evaluation Metrics for Binary Classification Explained"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2208\" class=\"elementor elementor-2208\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-5607556f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"49251\" data-id=\"5607556f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-45ca5b0c\" data-eae-slider=\"46196\" data-id=\"45ca5b0c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-1e2acc1 elementor-widget elementor-widget-text-editor\" data-id=\"1e2acc1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<p class=\"wp-block-paragraph\">For simplicity, let\u2019s assume there are three customers (c1, c2, c3) in this batch, and one vehicle (v1) information is provided as a sale.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>P(C=c1) represents the likelihood of c1 to buy any car. Assuming no prior knowledge about each customer, their likelihood of buying any car should be the same: P(C=c1) = P(C=c2) = P(C=c3), which equals a constant (e.g. 1\/3 in this situation)<\/li><li>P(V=v1) is the likelihood for v1 to be sold, given it is shown in this batch, this should be 1 (100% likelihood to be sold)<\/li><\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Since there is only one customer making the purchase, this probability can be extended into:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">P(V=v1) = P(C=c1, V=v1) + P(C=c2, V=v1) + P(C=c3, V=v1) = 1.0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For each of the item, given the following formula<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">P(C=c1, V=v1) = P(C=c1|V=v1) * P(V=v1) = P(V=v1|C=c1) * P(C=c1)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can see P(C=c1|V=v1) is proportional to P(V=v1|C=c1). So now, we can get the formula for the probability calculation:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">P(C=c1|V=v1) = P(V=v1|C=c1) \/ (P(V=v1|C=c1) + P(V=v1|C=c2) + P(V=v1|C=c3))<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">and the key is to get the probability for each P(V|C). Such a formula can be verbally explained as: the likelihood for a vehicle to be purchased by a specific customer is proportional to the likelihood for the customer to buy this specific vehicle.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The above formula may look too \u201cmathematical\u201d, so let me put it into an intuitive context: assuming three people were in a room, one is a musician, one is an athlete, and one is a data scientist. You were told there is a violin in this room belong to one of them. Now guess, whom do you think is the owner of the violin? This is pretty straightforward, right? given the likelihood of musician to own a violin is high, and the likelihood of athlete and data scientists to own a violin is lower, it is much more likely for the violin to belong to the musician. The \u201cmathematical\u201d thinking process is illustrated below.<\/p>\n\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-322f70d0 elementor-widget elementor-widget-text-editor\" data-id=\"322f70d0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tNot sure which evaluation metric you should choose for your binary classification problem? After reading this blog post you should have a good idea.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-274389c elementor-widget elementor-widget-text-editor\" data-id=\"274389c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tYou will learn about a bunch of common and lesser-known evaluation metrics and charts to\u00a0<strong>understand how to choose<\/strong>\u00a0the model performance\u00a0<strong>metric for your problem<\/strong>. Specifically, for each metric, I will talk about:\n<ul>\n \t<li>What is the\u00a0<strong>definition<\/strong>\u00a0and\u00a0<strong>intuition<\/strong>\u00a0behind it,<\/li>\n \t<li>The\u00a0<strong>non-technical explanation<\/strong>\u00a0that you can communicate to business stakeholders,<\/li>\n \t<li><strong>How to calculate or plot it<\/strong>,<\/li>\n \t<li><strong>When<\/strong>\u00a0should you\u00a0<strong>use it<\/strong>.<\/li>\n<\/ul>\nWith that, you will understand the trade-offs so that making metric related decisions will be easier.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6b43a1b elementor-widget elementor-widget-text-editor\" data-id=\"6b43a1b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tI will present all the good stuff in a moment, but first, let\u2019s define our classification problem.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c3f9e03 elementor-widget elementor-widget-heading\" data-id=\"c3f9e03\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Before we start: problem definition<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4fa37bc elementor-widget elementor-widget-text-editor\" data-id=\"4fa37bc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tYou will be using those evaluation metrics in a context of a project, so I prepared an example fraud-detection problem based on a recent\u00a0<a href=\"https:\/\/www.kaggle.com\/c\/ieee-fraud-detection\/overview\" rel=\"noopener\">kaggle competiton<\/a>.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a5d0925 elementor-widget elementor-widget-text-editor\" data-id=\"a5d0925\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tI selected\u00a0<strong>43 features<\/strong>\u00a0and sampled\u00a0<strong>66000 observations<\/strong>\u00a0from the original dataset adjusting the\u00a0<strong>fraction of positive class to 0.09<\/strong>.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8d8fbac elementor-widget elementor-widget-text-editor\" data-id=\"8d8fbac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThen I trained a bunch of lightGBM classifiers with different hyperparameters. I only used\u00a0<strong>learning_rate<\/strong>\u00a0and\u00a0<strong>n_estimators<\/strong>\u00a0 parameters because I wanted to have an intuition as to which models are \u201ctruly\u201d better. Specifically, I suspect that the model with only 10 trees is worse than a model with 100 trees. Of course, as use more trees and smaller learning rates it gets tricky but I think it is a decent proxy.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2239861 elementor-widget elementor-widget-text-editor\" data-id=\"2239861\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSo for combinations of\u00a0<strong>learning_rate<\/strong>\u00a0and\u00a0<strong>n_estimators<\/strong>, I did the following:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3b69052 elementor-widget elementor-widget-text-editor\" data-id=\"3b69052\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code>MODEL_PARAMS = {'random_state': 1234,\n'learning_rate': 0.1,\n'n_estimators': 10}<\/code><\/pre>\n<ul>\n \t<li>defined hyperparameter values:<\/li>\n<\/ul>\n<pre><code>model = lightgbm.LGBMClassifier(**MODEL_PARAMS)\nmodel.fit(X_train, y_train)<\/code><\/pre>\n<ul>\n \t<li>trained the model:<\/li>\n<\/ul>\n<pre><code>y_test_pred = model.predict_proba(X_test)\n<\/code><\/pre>\n<ul>\n \t<li>predicted on test data:<\/li>\n<\/ul>\n<pre><code>log_binary_classification_metrics(y_test, y_test_pred)\n<\/code><\/pre>\n<ul>\n \t<li>logged all the metrics for each run:<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a90edd6 elementor-widget elementor-widget-text-editor\" data-id=\"a90edd6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFor full code base\u00a0<a href=\"https:\/\/github.com\/neptune-ml\/blog-binary-classification-metrics\" rel=\"noopener\">go to this repository<\/a>\u00a0or\u00a0scroll down to the example script.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9035620 elementor-widget elementor-widget-text-editor\" data-id=\"9035620\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tYou can also\u00a0<a href=\"https:\/\/ui.neptune.ml\/neptune-ml\/binary-classification-metrics\/experiments?filterId=20f71748-85ad-499d-a72e-68962bcd36a0&amp;viewId=817b46ba-103e-11ea-9a39-42010a840083&amp;utm_source=hackernoon&amp;utm_medium=crosspost&amp;utm_campaign=blog-evaluation-metrics-binary-classification&amp;utm_content=explore-dashboard\" rel=\"noopener\">explore experiment runs<\/a>\u00a0with:\n<ul>\n \t<li>evaluation metrics<\/li>\n \t<li>performance charts<\/li>\n \t<li>metric by threshold plots<\/li>\n<\/ul>\nOk, now we are ready to talk about those classification metrics!\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9892994 elementor-widget elementor-widget-heading\" data-id=\"9892994\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>1. Confusion Matrix<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a54ca95 elementor-widget elementor-widget-text-editor\" data-id=\"a54ca95\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-64a660b elementor-widget elementor-widget-text-editor\" data-id=\"64a660b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nIt is a common way of presenting true positive (tp), true negative (tn), false positive (fp) and false negative (fn) predictions. Those values are presented in the form of a matrix where the Y-axis shows the true classes while the X-axis shows the predicted classes.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e1e87c4 elementor-widget elementor-widget-text-editor\" data-id=\"e1e87c4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nIt is calculated on class predictions, which means the outputs from your model need to be thresholded first.\n<pre><code>from sklearn.metrics import confusion_matrix\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-572a91a elementor-widget elementor-widget-text-editor\" data-id=\"572a91a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\ty_pred_class = y_pred_pos &gt; threshold\ncm = confusion_matrix(y_true, y_pred_class)\ntn, fp, fn, tp = cm.ravel()<\/code><\/pre>\n<strong>How does it look:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a9005a3 elementor-widget elementor-widget-text-editor\" data-id=\"a9005a3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSo in this example, we can see that:\n<ul>\n \t<li><strong>11918<\/strong>\u00a0predictions were<strong>\u00a0true negatives<\/strong>,<\/li>\n \t<li><strong>872<\/strong>\u00a0were\u00a0<strong>true positives<\/strong>,<\/li>\n \t<li><strong>82<\/strong>\u00a0were\u00a0<strong>false positives<\/strong>,<\/li>\n \t<li><strong>333<\/strong>\u00a0predictions were\u00a0<strong>false negatives<\/strong>.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-588816c elementor-widget elementor-widget-text-editor\" data-id=\"588816c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAlso, as we already know, this is an imbalanced problem. By the way, if you want to read more about imbalanced problems I recommend taking a look at this\u00a0<a href=\"https:\/\/www.svds.com\/learning-imbalanced-classes\/\" rel=\"noopener\">article by Tom Fawcett<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6c0dcb4 elementor-widget elementor-widget-text-editor\" data-id=\"6c0dcb4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-093fded elementor-widget elementor-widget-text-editor\" data-id=\"093fded\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nPretty much always. I like to see the nominal values rather than normalized to get a feeling on how the model is doing on different, often imbalanced, classes.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b89d29b elementor-widget elementor-widget-heading\" data-id=\"b89d29b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>2. False Positive Rate | Type I error<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b2aca80 elementor-widget elementor-widget-text-editor\" data-id=\"b2aca80\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen we predict something when it isn\u2019t we are contributing to the false positive rate. You can think of it as a\u00a0<strong>fraction of false alerts<\/strong>\u00a0that will be raised based on your model predictions.\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/hackernoon.com\/photos\/L4OCiGu7n6cUBUyJKe9Rs0MX1d73-yk1ir32tu\" alt=\"\" \/><\/p>\n<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import confusion_matrix\n<\/code><\/pre>\ny_pred_class = y_pred_pos &gt; threshold\ntn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()\nfalse_positive_rate = fp \/ (fp + tn)<\/code><\/pre>\n<strong>How models score in this metric (threshold=0.5):<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7ef0fae elementor-widget elementor-widget-text-editor\" data-id=\"7ef0fae\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nFor all the models type-1 error alerts are pretty low but by adjusting the threshold we can get an even lower ratio. Since we have true negatives in the denominator, our error will tend to be low just because the dataset is imbalanced.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-327d65f elementor-widget elementor-widget-text-editor\" data-id=\"327d65f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed94ad4 elementor-widget elementor-widget-text-editor\" data-id=\"ed94ad4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tObviously, if we increase the threshold only higher scored observations will be classified as positive. In our example, we can see that to reach perfect FPR of 0 we need to increase the threshold to 0.83. However, that will likely mean only very few predictions classified.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-140f6b9 elementor-widget elementor-widget-text-editor\" data-id=\"140f6b9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>You rarely would use this metric alone. Usually as an auxiliary one with some other metric,<\/li>\n \t<li>If the\u00a0<strong>cost of dealing with an alert is high<\/strong>\u00a0you should consider increasing the threshold to get fewer alerts.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-978648d elementor-widget elementor-widget-heading\" data-id=\"978648d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>3. False Negative Rate | Type II error<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1c467f3 elementor-widget elementor-widget-text-editor\" data-id=\"1c467f3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen we don\u2019t predict something when it is, we are contributing to the false negative rate. You can think of it as a\u00a0<strong>fraction of missed fraudulent transactions<\/strong>\u00a0that your model lets through.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-85facad elementor-widget elementor-widget-text-editor\" data-id=\"85facad\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fd0cd5b elementor-widget elementor-widget-text-editor\" data-id=\"fd0cd5b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code>from sklearn.metrics import confusion_matrix\n\ny_pred_class = y_pred_pos &gt; threshold\ntn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()\nfalse_negative_rate = fn \/ (tp + fn)<\/code><\/pre>\n<strong>How models score in this metric (threshold=0.5):<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a3851e6 elementor-widget elementor-widget-text-editor\" data-id=\"a3851e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can see that in our example, type-2 errors are quite a bit higher then type-1 errors. Interestingly our\u00a0<a href=\"https:\/\/ui.neptune.ml\/o\/neptune-ml\/org\/binary-classification-metrics\/e\/BIN-98?utm_source=hackernoon&amp;utm_medium=crosspost&amp;utm_campaign=blog-evaluation-metrics-binary-classification&amp;utm_content=explore-experiment\" rel=\"noopener\">BIN-98 experiment<\/a>\u00a0that had the lowest type-1 error has the highest type-2 error. There is a simple explanation based on the fact that our dataset is imbalanced and with type-2 error we don\u2019t have true negatives in the denominator.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e81c3e7 elementor-widget elementor-widget-text-editor\" data-id=\"e81c3e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1c4e07c elementor-widget elementor-widget-text-editor\" data-id=\"1c4e07c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIf we decrease the threshold, more observations will\u00a0be classified\u00a0as positive. At certain threshold, we will mark everything as positive (fraudulent for example). We can actually get to the FNR of 0.083 by decreasing the threshold to 0.01.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-cd67387 elementor-widget elementor-widget-text-editor\" data-id=\"cd67387\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>When to use it:<\/strong>\n<ul>\n \t<li>Usually, it is not used alone but rather with some other metric,<\/li>\n \t<li>If the cost of letting the fraudulent transactions through is high and the value you get from the users isn\u2019t you can consider focusing on this number.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-840fdf5 elementor-widget elementor-widget-heading\" data-id=\"840fdf5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>4. True Negative Rate | Specificity<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c0d5be5 elementor-widget elementor-widget-text-editor\" data-id=\"c0d5be5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt measures how many observations out of all negative observations have we classified as negative. In our fraud detection example, it tells us how many transactions, out of all non-fraudulent transactions, we marked as clean.\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/hackernoon.com\/photos\/L4OCiGu7n6cUBUyJKe9Rs0MX1d73-k726u32vq\" alt=\"\" \/><\/p>\n<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import confusion_matrix\n\ny_pred_class = y_pred_pos &gt; threshold\ntn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()\ntrue_negative_rate = tn \/ (tn + fp)<\/code><\/pre>\n<strong>How models score in this metric (threshold=0.5):<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e76ba5b elementor-widget elementor-widget-text-editor\" data-id=\"e76ba5b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tVery high specificity for all the models. If you think about it, in our imbalanced problem you would expect that.\u00a0Classifying negative cases as negative is a lot easier than classifying positive cases and hence the score is high.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-13519d8 elementor-widget elementor-widget-text-editor\" data-id=\"13519d8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How does it depend on the threshold:<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-73584fb elementor-widget elementor-widget-text-editor\" data-id=\"73584fb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tVery high specificity for all the models. If you think about it, in our imbalanced problem you would expect that.\u00a0Classifying negative cases as negative is a lot easier than classifying positive cases and hence the score is high.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-a40f81f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"1444\" data-id=\"a40f81f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-1c8a5b6\" data-eae-slider=\"51413\" data-id=\"1c8a5b6\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-01fae51 elementor-widget elementor-widget-text-editor\" data-id=\"01fae51\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a3a6e6e elementor-widget elementor-widget-text-editor\" data-id=\"a3a6e6e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe higher the threshold the more observations are truly negative observations we can recall. We can see that starting from say threshold=0.4 our model is doing really well in classifying negative cases as negative.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-916c2a6 elementor-widget elementor-widget-text-editor\" data-id=\"916c2a6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>When to use it:<\/strong>\n<ul>\n \t<li>Usually, you don\u2019t use it alone but rather as an auxiliary metric,<\/li>\n \t<li>When you really want to be sure that you are right when you say something is safe. A typical example would be a doctor telling a patient \u201cyou are healthy\u201d. Making a mistake here and telling a sick person they are safe and can go home is something you may want to avoid.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bb76551 elementor-widget elementor-widget-heading\" data-id=\"bb76551\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>5. Negative Predictive Value<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1c572e1 elementor-widget elementor-widget-text-editor\" data-id=\"1c572e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt measures how many predictions out of all negative predictions were correct. You can think of it as precision for negative class. With our example, it tells us what is the fraction of correctly predicted clean transactions in all non-fraudulent predictions.\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/hackernoon.com\/photos\/L4OCiGu7n6cUBUyJKe9Rs0MX1d73-6g2wj32ra\" alt=\"\" \/><\/p>\n<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import confusion_matrix\n\ny_pred_class = y_pred_pos &gt; threshold\ntn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()\nnegative_predictive_value = tn\/ (tn + fn)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-41e37d1 elementor-widget elementor-widget-text-editor\" data-id=\"41e37d1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How models score in this metric (threshold=0.5):<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-14be77b elementor-widget elementor-widget-text-editor\" data-id=\"14be77b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nAll models score really high and no wonder, since with an imbalanced problem it is easy to predict negative class.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9f91d59 elementor-widget elementor-widget-text-editor\" data-id=\"9f91d59\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7d744fe elementor-widget elementor-widget-text-editor\" data-id=\"7d744fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe higher the threshold the more cases are classified as negative and the score goes down. However, in our imbalanced example even at a very high threshold, the negative predictive value is still good.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f223ba2 elementor-widget elementor-widget-text-editor\" data-id=\"f223ba2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>When we care about high precision on negative predictions. For example, imagine we really don\u2019t want to have any additional process for screening the transactions predicted as clean. In that case, we may want to make sure that our negative predictive value is high.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ecf7eac elementor-widget elementor-widget-heading\" data-id=\"ecf7eac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>6. False Discovery Rate<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bf1d9a6 elementor-widget elementor-widget-text-editor\" data-id=\"bf1d9a6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt measures how many predictions out of all positive predictions were incorrect. You can think of it as simply 1-precision. With our example, it tells us what is the fraction of incorrectly predicted fraudulent transactions in all fraudulent predictions.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fe35f62 elementor-widget elementor-widget-text-editor\" data-id=\"fe35f62\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-70f4a53 elementor-widget elementor-widget-text-editor\" data-id=\"70f4a53\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code>from sklearn.metrics import confusion_matrix\n\ny_pred_class = y_pred_pos &gt; threshold\ntn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()\nfalse_discovery_rate = fp\/ (tp + fp)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-10737ab elementor-widget elementor-widget-text-editor\" data-id=\"10737ab\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How models score in this metric (threshold=0.5):<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-efbe8d1 elementor-widget elementor-widget-text-editor\" data-id=\"efbe8d1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe \u201cbest model\u201d is incredibly shallow lightGBM which we expect to be incorrect (deeper model should work better).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c137207 elementor-widget elementor-widget-text-editor\" data-id=\"c137207\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThat is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a98b110 elementor-widget elementor-widget-text-editor\" data-id=\"a98b110\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-85d4a5d elementor-widget elementor-widget-text-editor\" data-id=\"85d4a5d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe higher the threshold, the less positive predictions. The less positive predictions, the ones that are classified as positive have higher certainty scores. Hence, the false discovery rate goes down.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b0e94b3 elementor-widget elementor-widget-text-editor\" data-id=\"b0e94b3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it<\/strong>\n<ul>\n \t<li>Again, it usually doesn\u2019t make sense to use it alone but rather coupled with other metrics like recall.<\/li>\n \t<li>When raising false alerts is costly and when you want all the positive predictions to be worth looking at you should optimize for precision.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1ea6005 elementor-widget elementor-widget-heading\" data-id=\"1ea6005\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>7. True Positive Rate | Recall | Sensitivity<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b11a523 elementor-widget elementor-widget-text-editor\" data-id=\"b11a523\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt measures how many observations out of all positive observations have we classified as positive. It tells us how many fraudulent transactions we recalled from all fraudulent transactions.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-01d1935 elementor-widget elementor-widget-text-editor\" data-id=\"01d1935\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen you are optimizing recall you want to\u00a0<strong>put all guilty in prison.<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a422f22 elementor-widget elementor-widget-text-editor\" data-id=\"a422f22\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import confusion_matrix, recall_score\n\ny_pred_class = y_pred_pos &gt; threshold\ntn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()\ntrue_positive_rate = tp \/ (tp + fn)\n\n# or simply\n\nrecall_score(y_true, y_pred_class)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-123c396 elementor-widget elementor-widget-text-editor\" data-id=\"123c396\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOur best model can recall 0.72 fraudulent transactions at the threshold 0.5. the difference in recall between our models is quite significant and we can clearly see better and worse models. Of course, for every model, we can adjust the threshold to recall all fraudulent transactions.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-455165e elementor-widget elementor-widget-text-editor\" data-id=\"455165e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f33cef4 elementor-widget elementor-widget-text-editor\" data-id=\"f33cef4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFor the threshold of 0.1, we classify the vast majority of transactions as fraudulent and hence get really high recall of 0.917. As the threshold increases the recall falls.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f520888 elementor-widget elementor-widget-text-editor\" data-id=\"f520888\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n<li>Usually, you will not use it alone but rather coupled with other metrics like precision.<\/li>\n<li>That being said, recall is a go-to metric, when you really care about catching all fraudulent transactions even at a cost of false alerts. Potentially it is cheap for you to process those alerts and very expensive when the transaction goes unseen.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-59ed6ea elementor-widget elementor-widget-heading\" data-id=\"59ed6ea\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>8. Positive Predictive Value | Precision<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fac98bf elementor-widget elementor-widget-text-editor\" data-id=\"fac98bf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt measures how many observations predicted as positive are in fact positive. Taking our fraud detection example, it tells us what is the ratio of transactions correctly classified as fraudulent.\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/hackernoon.com\/photos\/L4OCiGu7n6cUBUyJKe9Rs0MX1d73-5e3o9324d\" alt=\"\" \/><\/p>\nWhen you are optimizing precision you want to make sure that\u00a0<strong>people that you put in prison are guilty<\/strong>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-6a27ee3 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"20104\" data-id=\"6a27ee3\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-11ae56d\" data-eae-slider=\"82393\" data-id=\"11ae56d\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-9f276d2 elementor-widget elementor-widget-text-editor\" data-id=\"9f276d2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>How to compute:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4de9382 elementor-widget elementor-widget-text-editor\" data-id=\"4de9382\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<pre><code>from sklearn.metrics import confusion_matrix, precision_score\n\ny_pred_class = y_pred_pos &gt; threshold\ntn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()\npositive_predictive_value = tp\/ (tp + fp)\n\n# or simply\n\nprecision_score(y_true, y_pred_class)<\/code><\/pre>\n<strong>How models score in this metric (threshold=0.5):<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0be567b elementor-widget elementor-widget-text-editor\" data-id=\"0be567b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nIt seems like all the models have pretty high precision at this threshold. The \u201cbest model\u201d is\u00a0incredibly\u00a0shallow lightGBM which\u00a0obviously\u00a0smells fishy.\u00a0That is an important takeaway, looking at precision (or recall) alone can lead to you selecting a suboptimal model.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ccc4b71 elementor-widget elementor-widget-text-editor\" data-id=\"ccc4b71\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOf course, for every model, we can adjust the threshold to increase precision. That is because if we take a small fraction of high scoring predictions the precision on those will likely be high.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-92eb599 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"22279\" data-id=\"92eb599\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-9f3237e\" data-eae-slider=\"55820\" data-id=\"9f3237e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-b1c2cee elementor-widget elementor-widget-text-editor\" data-id=\"b1c2cee\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c0eb94c elementor-widget elementor-widget-text-editor\" data-id=\"c0eb94c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe higher the threshold the better the precision and with a threshold of 0.68 we can actually get a perfectly precise model. Over this threshold, the model doesn\u2019t classify anything as positive and so we don\u2019t plot it.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d2d84cd elementor-widget elementor-widget-text-editor\" data-id=\"d2d84cd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>Again, it usually doesn\u2019t make sense to use it alone but rather coupled with other metrics like recall.<\/li>\n \t<li>When raising false alerts is costly, when you want all the positive predictions to be worth looking at you should optimize for precision.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-81a5d3d elementor-widget elementor-widget-heading\" data-id=\"81a5d3d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>9. Accuracy<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-e17b49f elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"26115\" data-id=\"e17b49f\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-fb982ca\" data-eae-slider=\"94073\" data-id=\"fb982ca\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-7f32abc elementor-widget elementor-widget-text-editor\" data-id=\"7f32abc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt measures how many observations, both positive and negative, were correctly classified.\n<p style=\"text-align: center;\">You <strong>shouldn\u2019t use accuracy on imbalanced problems<\/strong>.\u00a0Then, it is easy to get a high accuracy score by\u00a0simply\u00a0classifying all observations as the majority class. For example in our case, by classifying all transactions as non-fraudulent we can get an accuracy of over 0.9.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1254e88 elementor-widget elementor-widget-text-editor\" data-id=\"1254e88\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import confusion_matrix, accuracy_score\n\ny_pred_class = y_pred_pos &gt; threshold\ntn, fp, fn, tp = confusion_matrix(y_true, y_pred_class).ravel()\naccuracy = (tp + tn) \/ (tp + fp + fn + tn)\n\n# or simply\n\naccuracy_score(y_true, y_pred_class)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bf8366e elementor-widget elementor-widget-text-editor\" data-id=\"bf8366e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How models score in this metric (threshold=0.5):<\/strong>\n\nWe can see that for all the models we beat the dummy model (all clean transactions) by a large margin. Also the models that we\u2019d expect to be better are in fact at the top.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-66b36b9 elementor-widget elementor-widget-text-editor\" data-id=\"66b36b9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3e34d9c elementor-widget elementor-widget-text-editor\" data-id=\"3e34d9c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWith accuracy, you can really use charts like the one above to determine the optimal threshold. In this case, choosing something a bit over standard 0.5 could bump the score by a tiny bit 0.9686-&gt;0.9688.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d9dfcb0 elementor-widget elementor-widget-text-editor\" data-id=\"d9dfcb0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen to use it:\n<ul>\n \t<li>When your problem is balanced using accuracy is usually a good start. An additional benefit is that it is really easy to explain it to non-technical stakeholders in your project,<\/li>\n \t<li>When every class is equally important to you.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-af81769 elementor-widget elementor-widget-heading\" data-id=\"af81769\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>10. F beta score<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c80e4f7 elementor-widget elementor-widget-text-editor\" data-id=\"c80e4f7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSimply put, it combines precision and recall into one metric. The higher the score the better our model is. You can calculate it in the following way:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-983e9bf elementor-widget elementor-widget-text-editor\" data-id=\"983e9bf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhen choosing beta in your F-beta score\u00a0<strong>the more you care about recall<\/strong>\u00a0over precision\u00a0<strong>the higher beta<\/strong>\u00a0you should choose. For example, with F1 score we care equally about recall and precision with F2 score, recall is twice as important to us.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-010d58a elementor-widget elementor-widget-text-editor\" data-id=\"010d58a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWith 0&lt;beta&lt;1 we care more about precision and so the higher the threshold the higher the F beta score. When beta&gt;1 our optimal threshold moves toward lower thresholds and with beta=1 it is somewhere in the middle.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9e6da00 elementor-widget elementor-widget-text-editor\" data-id=\"9e6da00\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import fbeta_score\n\ny_pred_class = y_pred_pos &gt; threshold\nfbeta_score(y_true, y_pred_class, beta)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8800862 elementor-widget elementor-widget-heading\" data-id=\"8800862\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>11. F1 score (beta=1)<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-90736a7 elementor-widget elementor-widget-text-editor\" data-id=\"90736a7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt\u2019s the harmonic mean between precision and recall.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8d51afe elementor-widget elementor-widget-text-editor\" data-id=\"8d51afe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>How models score in this metric (threshold=0.5):<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0c88327 elementor-widget elementor-widget-text-editor\" data-id=\"0c88327\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs we can see combining precision and recall gave us a more realistic view of our models. We get 0.808 for the best one and a lot of room for improvement.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bc79427 elementor-widget elementor-widget-text-editor\" data-id=\"bc79427\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nWhat is good is that it seems to be ranking our models correctly with those larger lightGBMs at the top.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b6ab7a0 elementor-widget elementor-widget-text-editor\" data-id=\"b6ab7a0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nWe can\u00a0<strong>adjust the threshold to optimize F1 score<\/strong>. Notice that for both precision and recall you could get perfect scores by increasing or decreasing the threshold. Good thing is,\u00a0<strong>you can find a sweet spot<\/strong>\u00a0for F1metric. As you can see, getting the threshold just right can actually improve your score by a bit 0.8077-&gt;0.8121.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f3d453f elementor-widget elementor-widget-text-editor\" data-id=\"f3d453f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>Pretty much in every binary classification problem. It is my go-to metric when working on those problems. It can be easily explained to business stakeholders.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3536a2e elementor-widget elementor-widget-heading\" data-id=\"3536a2e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>12. F2 score (beta=2)<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a2cf5ac elementor-widget elementor-widget-text-editor\" data-id=\"a2cf5ac\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt\u2019s a metric that combines precision and recall, putting\u00a0<strong>2x emphasis on recall<\/strong>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ccbd86b elementor-widget elementor-widget-text-editor\" data-id=\"ccbd86b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How models score in this metric (threshold=0.5):<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d80ffb6 elementor-widget elementor-widget-text-editor\" data-id=\"d80ffb6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThis score is even lower for all the models than F1 but can be increased by adjusting the threshold considerably.\nAgain, it seems to be ranking our models correctly, at least in this simple example.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-93f3337 elementor-widget elementor-widget-text-editor\" data-id=\"93f3337\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8cd9840 elementor-widget elementor-widget-text-editor\" data-id=\"8cd9840\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can see that with a lower threshold and therefore more true positives recalled we get a higher score. You can usually<strong>\u00a0find a sweet spot<\/strong>\u00a0for the threshold. Possible gain from 0.755 -&gt; 0.803 show how\u00a0<strong>important<\/strong>\u00a0threshold adjustments can be here.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b220b35 elementor-widget elementor-widget-text-editor\" data-id=\"b220b35\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>I\u2019d consider using it when recalling positive observations (fraudulent transactions) is more important than being precise about it.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e0678bb elementor-widget elementor-widget-heading\" data-id=\"e0678bb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>13. Cohen Kappa Metric<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b05d57 elementor-widget elementor-widget-text-editor\" data-id=\"1b05d57\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn simple words, Cohen Kappa tells you how much better is your model over the random classifier that predicts based on class frequencies.\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/hackernoon.com\/photos\/L4OCiGu7n6cUBUyJKe9Rs0MX1d73-xi4l632ca\" alt=\"\" \/><\/p>\nTo calculate it one needs to calculate two things:\u00a0<strong>\u201cobserved agreement\u201d (po)<\/strong>\u00a0and\u00a0<strong>\u201cexpected agreement\u201d (pe)<\/strong>. Observed agreement (po) is simply how our classifier predictions agree with the ground truth, which means it is just accuracy. The expected agreement (pe) is how the predictions of the\u00a0<strong>random classifier that samples according to class frequencies<\/strong>\u00a0agree with the ground truth, or accuracy of the random classifier.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-28ac6e6 elementor-widget elementor-widget-text-editor\" data-id=\"28ac6e6\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nFrom an interpretation standpoint, I like that it extends something very easy to explain (accuracy) to situations where your dataset is imbalanced by incorporating a baseline (dummy) classifier.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ace6d93 elementor-widget elementor-widget-text-editor\" data-id=\"ace6d93\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import cohen_kappa_score\n\ncohen_kappa_score(y_true, y_pred_class)<\/code><\/pre>\n<strong>How models score in this metric (threshold=0.5):<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-75f555e elementor-widget elementor-widget-text-editor\" data-id=\"75f555e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nWe can easily distinguish the worst\/best models based on this metric. Also, we can see that there is still a lot of room to improve our best model.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2a2883e elementor-widget elementor-widget-text-editor\" data-id=\"2a2883e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How does it depend on the threshold:<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-322852d elementor-widget elementor-widget-text-editor\" data-id=\"322852d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWith the chart just like the one above we can find a threshold that optimizes cohen kappa. In this case, it is at 0.31 giving us some improvement 0.7909 -&gt; 0.7947 from the standard 0.5.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c8e653a elementor-widget elementor-widget-text-editor\" data-id=\"c8e653a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>This metric is not used heavily in the context of classification. Yet it can work really well for imbalanced problems and seems like a great companion\/alternative to accuracy.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e66b583 elementor-widget elementor-widget-heading\" data-id=\"e66b583\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>14. Matthews Correlation Coefficient MCC<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-74d2cc1 elementor-widget elementor-widget-text-editor\" data-id=\"74d2cc1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt\u2019s a correlation between predicted classes and ground truth. It can be calculated based on values from the confusion matrix:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f5d1f66 elementor-widget elementor-widget-text-editor\" data-id=\"f5d1f66\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAlternatively, you could also calculate the correlation between y_true and y_pred.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e3a36ec elementor-widget elementor-widget-text-editor\" data-id=\"e3a36ec\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import matthews_corrcoef\n\ny_pred_class = y_pred_pos &gt; threshold\nmatthews_corrcoef(y_true, y_pred_class)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e47dcf8 elementor-widget elementor-widget-text-editor\" data-id=\"e47dcf8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can clearly see improvements in our model quality and a lot of room to grow, which I really like. Also, it ranks our models reasonably and puts models that you\u2019d expect to be better on top. Of course, MCC depends on the threshold that we choose.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bab38b8 elementor-widget elementor-widget-text-editor\" data-id=\"bab38b8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How does it depend on the threshold:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a0c5ddf elementor-widget elementor-widget-text-editor\" data-id=\"a0c5ddf\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can adjust the threshold to optimize MCC. In our case, the best score is at 0.53 but what I really like is that it is not super sensitive to threshold changes.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bc158d4 elementor-widget elementor-widget-text-editor\" data-id=\"bc158d4\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>When working on imbalanced problems,When you want to have something easily interpretable.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0454b23 elementor-widget elementor-widget-heading\" data-id=\"0454b23\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>15. ROC Curve<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ec0b08d elementor-widget elementor-widget-text-editor\" data-id=\"ec0b08d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). Basically, for every threshold, we calculate TPR and FPR and plot it on one chart.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1ee9f01 elementor-widget elementor-widget-text-editor\" data-id=\"1ee9f01\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tOf course, the higher TPR and the lower FPR is for each threshold the better and so classifiers that have curves that are more top-left side are better.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f26de80 elementor-widget elementor-widget-text-editor\" data-id=\"f26de80\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tExtensive discussion of ROC Curve and ROC AUC score can be found in this\u00a0<a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.10.9777&amp;rep=rep1&amp;type=pdf\" rel=\"noopener\">article by Tom Fawcett<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ff860e2 elementor-widget elementor-widget-text-editor\" data-id=\"ff860e2\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from scikitplot.metrics import plot_roc\n\nfig, ax = plt.subplots()\nplot_roc(y_true, y_pred, ax=ax)<\/code><\/pre>\n<strong>How does it look:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c0be939 elementor-widget elementor-widget-text-editor\" data-id=\"c0be939\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can see a healthy ROC curve, pushed towards the top-left side both for positive and negative class. It is not clear which one performs better across the board as with FPR &lt; ~0.15 positive class is higher and starting from FPR~0.15 the negative class is above.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8fdc076 elementor-widget elementor-widget-heading\" data-id=\"8fdc076\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>16. ROC AUC score<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b767dff elementor-widget elementor-widget-text-editor\" data-id=\"b767dff\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn order to get one number that tells us how good our curve is, we can calculate the Area Under the ROC Curve, or ROC AUC score. The more top-left your curve is the higher the area and hence higher ROC AUC score.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7c2aee8 elementor-widget elementor-widget-text-editor\" data-id=\"7c2aee8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nAlternatively,<a href=\"https:\/\/en.wikipedia.org\/wiki\/Mann%E2%80%93Whitney_U_test#Area-under-curve_(AUC)_statistic_for_ROC_curves\" rel=\"noopener\">\u00a0it can be shown<\/a>\u00a0that ROC AUC score is equivalent to calculating the rank correlation between predictions and targets. From an interpretation standpoint, it is more useful because it tells us that this metric shows\u00a0<strong>how good at ranking predictions your model is<\/strong>. It tells you what is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2a1f7fc elementor-widget elementor-widget-text-editor\" data-id=\"2a1f7fc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import roc_auc_score\n\nroc_auc = roc_auc_score(y_true, y_pred_pos)<\/code><\/pre>\n<strong>How models score in this metric:<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4914823 elementor-widget elementor-widget-text-editor\" data-id=\"4914823\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can see improvements and the models that one would guess to be better are indeed scoring higher. Also, the score is independent of the threshold which comes in handy.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ae32559 elementor-widget elementor-widget-text-editor\" data-id=\"ae32559\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>You\u00a0<strong>should use it<\/strong>\u00a0when you ultimately\u00a0<strong>care about ranking predictions<\/strong>\u00a0and not necessarily about outputting well-calibrated probabilities (read this\u00a0<a href=\"https:\/\/machinelearningmastery.com\/calibrated-classification-model-in-scikit-learn\/\" rel=\"noopener\">article by Jason Brownlee<\/a>\u00a0if you want to learn about probability calibration).<\/li>\n \t<li>You\u00a0<strong>should not use it<\/strong>\u00a0when your\u00a0<strong>data is heavily imbalanced<\/strong>. It was discussed extensively in this\u00a0<a href=\"https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC4349800\/\" rel=\"noopener\">article by Takaya Saito and Marc Rehmsmeier<\/a>. The intuition is the following: false positive rate for highly imbalanced datasets is pulled down due to a large number of true negatives.<\/li>\n \t<li>You\u00a0<strong>should use it when you care equally about positive and negative classes<\/strong>. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e759080 elementor-widget elementor-widget-heading\" data-id=\"e759080\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>17. Precision-Recall Curve<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6850e84 elementor-widget elementor-widget-text-editor\" data-id=\"6850e84\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt is a curve that combines precision (PPV) and Recall (TPR) in a single visualization. For every threshold, you calculate PPV and TPR and plot it. The higher on y-axis your curve is the better your model performance.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8566160 elementor-widget elementor-widget-text-editor\" data-id=\"8566160\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tYou can use this plot to make an educated decision when it comes to the classic precision\/recall dilemma. Obviously, the higher the recall the lower the precision. Knowing\u00a0<strong>at which recall your precision starts to fall fast<\/strong>\u00a0can help you choose the threshold and deliver a better model.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-dc38cdf elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"44489\" data-id=\"dc38cdf\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-10c7c9e\" data-eae-slider=\"4734\" data-id=\"10c7c9e\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-a6e0ec1 elementor-widget elementor-widget-text-editor\" data-id=\"a6e0ec1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from scikitplot.metrics import plot_precision_recall\n\nfig, ax = plt.subplots()\nplot_precision_recall(y_true, y_pred, ax=ax)<\/code><\/pre>\n<strong>How does it look:<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-906f1c5 elementor-widget elementor-widget-text-editor\" data-id=\"906f1c5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nWe can see that for the negative class we maintain high precision and high recall almost throughout the entire range of thresholds. For the positive class precision is starting to fall as soon as we are recalling 0.2 of true positives and by the time we hit 0.8, it decreases to around 0.7.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ef9fe2b elementor-widget elementor-widget-heading\" data-id=\"ef9fe2b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>18. PR AUC score | Average precision<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-22c4b62 elementor-widget elementor-widget-text-editor\" data-id=\"22c4b62\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSimilarly to ROC AUC score you can calculate the Area Under the Precision-Recall Curve to get one number that describes model performance.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d3b1869 elementor-widget elementor-widget-text-editor\" data-id=\"d3b1869\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tYou can also think about PR AUC as the average of precision scores calculated for each recall threshold [0.0, 1.0]. You can also adjust this definition to suit your business needs by choosing\/clipping recall thresholds if needed.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fb19bc3 elementor-widget elementor-widget-text-editor\" data-id=\"fb19bc3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import average_precision_score\n\naverage_precision_score(y_true, y_pred_pos)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-514d02d elementor-widget elementor-widget-heading\" data-id=\"514d02d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>How models score in this metric:<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4f860fd elementor-widget elementor-widget-text-editor\" data-id=\"4f860fd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe models that we suspect to be \u201ctruly\u201d better are in fact better in this metric which is definitely a good thing. Overall, we can see high scores but way less optimistic then ROC AUC scores (0.96+).\n\n<strong>When to use it:<\/strong>\n<ul>\n \t<li>when you want to\u00a0<strong>communicate precision\/recall decision<\/strong>\u00a0to other stakeholders<\/li>\n \t<li>when you want to<strong>\u00a0choose the threshold that fits the business problem<\/strong>.<\/li>\n \t<li>when your data is\u00a0<strong>heavily imbalanced<\/strong>. As mentioned before, it was discussed extensively in this\u00a0<a href=\"https:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC4349800\/\" rel=\"noopener\">article by Takaya Saito and Marc Rehmsmeier<\/a>. The intuition is the following: since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.<\/li>\n \t<li>when\u00a0<strong>you care more about positive than negative class<\/strong>. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision).<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d0668f9 elementor-widget elementor-widget-heading\" data-id=\"d0668f9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>19. Log loss<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9d47740 elementor-widget elementor-widget-text-editor\" data-id=\"9d47740\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tLog loss is often used as the objective function that is optimized under the hood of machine learning models. Yet, it can also be used as a performance metric.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1ce26bd elementor-widget elementor-widget-text-editor\" data-id=\"1ce26bd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThe more certain our model is that an observation is positive when it is, in fact, positive the lower the error. But this is not a linear relationship. It is good to take a look at how the error changes as that difference increases:\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2a13196 elementor-widget elementor-widget-text-editor\" data-id=\"2a13196\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSo our model gets punished very heavily when we are certain about something that is untrue. For example, when we give a score of 0.9999 to an observation that is negative our loss jumps through the roof. That is why sometimes it makes sense to clip your predictions to decrease the risk of that happening.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8460d16 elementor-widget elementor-widget-text-editor\" data-id=\"8460d16\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIf you want to learn more about log-loss read this\u00a0<a href=\"https:\/\/towardsdatascience.com\/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a\" rel=\"noopener\">article by Daniel Godoy<\/a>.\n\n<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import log_loss\n\nlog_loss(y_true, y_pred)<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-435c71d elementor-widget elementor-widget-heading\" data-id=\"435c71d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><strong>How models score in this metric:<\/strong><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-48b5571 elementor-widget elementor-widget-text-editor\" data-id=\"48b5571\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt is difficult to really see strong improvement and get an intuitive feeling for how strong the model is. Also, the model that was chosen as the best one before (BIN-101) is in the middle of the pack. That can suggest that using log-loss as a performance metric can be a risky proposition.\n\n<strong>When to use it:<\/strong>\n<ul>\n \t<li>Pretty much\u00a0<strong>always there is a<\/strong>\u00a0performance\u00a0<strong>metric that better matches your<\/strong>\u00a0business\u00a0<strong>problem.\u00a0<\/strong>\u00a0Because of that, I would use log-loss as an objective for your model with some other metric to evaluate performance.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e7407bd elementor-widget elementor-widget-heading\" data-id=\"e7407bd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>20. Brier score<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5f4b39e elementor-widget elementor-widget-text-editor\" data-id=\"5f4b39e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt is a measure of how far your predictions lie from the true values. For one observation it simply reads:\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/hackernoon.com\/photos\/L4OCiGu7n6cUBUyJKe9Rs0MX1d73-lx11t3236\" alt=\"\" \/><\/p>\nBasically, it is a mean square error in the probability space and because of that, it is usually used to calibrate probabilities of the machine learning models. If you want to read more about probability calibration I recommend that you read this\u00a0<a href=\"https:\/\/machinelearningmastery.com\/calibrated-classification-model-in-scikit-learn\/\" rel=\"noopener\">article by Jason Brownlee<\/a>.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7221e99 elementor-widget elementor-widget-text-editor\" data-id=\"7221e99\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt can be a great supplement to your ROC AUC score and other metrics that focus on other things.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fbb038c elementor-widget elementor-widget-text-editor\" data-id=\"fbb038c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from sklearn.metrics import brier_score_loss\n\nbrier_score_loss(y_true, y_pred_pos)<\/code><\/pre>\n<strong>How models score in this metric:<\/strong>\n\nModel from the\u00a0<a href=\"https:\/\/ui.neptune.ml\/o\/neptune-ml\/org\/binary-classification-metrics\/e\/BIN-101?utm_source=hackernoon&amp;utm_medium=crosspost&amp;utm_campaign=blog-evaluation-metrics-binary-classification&amp;utm_content=explore-experiment\" rel=\"noopener\">experiment BIN-101<\/a>\u00a0has the best calibration and for that model, on average our predictions were off by 0.16 (\u221a0.0263309).\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3f1ab75 elementor-widget elementor-widget-text-editor\" data-id=\"3f1ab75\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>When you\u00a0<strong>care about calibrated probabilities.<\/strong><\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4fcde4f elementor-widget elementor-widget-heading\" data-id=\"4fcde4f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>21. Cumulative gains chart<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2879ceb elementor-widget elementor-widget-text-editor\" data-id=\"2879ceb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn simple words, it helps you gauge how much you gain by using your model over a random model for a given fraction of top scored predictions.\n\nSimply put:\n<ul>\n \t<li>you order your predictions from highest to lowest andfor every percentile<\/li>\n \t<li>you calculate the fraction of true positive observations up to that percentile.<\/li>\n<\/ul>\nIt makes it easy to see the benefits of using your model to target given groups of users\/accounts\/transactions especially if you really care about sorting them.\n\n<strong>How to compute:<\/strong>\n<pre><code>from scikitplot.metrics import plot_cumulative_gain\n\nfig, ax = plt.subplots()\nplot_cumulative_gain(y_true, y_pred, ax=ax)<\/code><\/pre>\n<strong>How does it look:<\/strong>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8131c6a elementor-widget elementor-widget-text-editor\" data-id=\"8131c6a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe can see that our cumulative gains chart shoots up very quickly as we increase the sample of highest-scored predictions. By the time we get to the 20th percentile over 90% of positive cases are covered. You could use this chart to prioritize and filter out possible fraudulent transactions for processing.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2acc4d7 elementor-widget elementor-widget-text-editor\" data-id=\"2acc4d7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSay we were to use our model to assign possible fraudulent transactions for processing and we needed to\u00a0prioritize. We could use this chart to tell us where it makes the most sense to choose a cutoff.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ed2d628 elementor-widget elementor-widget-text-editor\" data-id=\"ed2d628\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.<\/li>\n \t<li>It can be a good addition to ROC AUC score which measures ranking\/sorting performance of your model.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-40d3ad0 elementor-widget elementor-widget-heading\" data-id=\"40d3ad0\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>22. Lift curve | lift chart<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-906104c elementor-widget elementor-widget-text-editor\" data-id=\"906104c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIt is pretty much just a different representation of the cumulative gains chart:\n<ul>\n \t<li>we order the predictions from highest to lowest<\/li>\n \t<li>for every percentile, we calculate the fraction of true positive observations up to that percentile for our model and for the random model,<\/li>\n \t<li>we calculate the ratio of those fractions and plot it.<\/li>\n<\/ul>\nIt tells you how much better your model is than a random model for the given percentile of top scored predictions.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-d752841 elementor-widget elementor-widget-text-editor\" data-id=\"d752841\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from scikitplot.metrics import plot_lift_curve\n\nfig, ax = plt.subplots()\nplot_lift_curve(y_true, y_pred, ax=ax)<\/code><\/pre>\n<strong>How does it look:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-163c80a elementor-widget elementor-widget-text-editor\" data-id=\"163c80a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSo for the top 10% of predictions, our model is over 10x better than random, for 20% is over 4x better and so on.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-dd2c3a3 elementor-widget elementor-widget-text-editor\" data-id=\"dd2c3a3\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>When to use it:<\/strong>\n<ul>\n \t<li>Whenever you want to select the most promising customers or transactions to target and you want to use your model for sorting.<\/li>\n \t<li>It can be a good addition to ROC AUC score which measures ranking\/sorting performance of your model.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b9486ce elementor-widget elementor-widget-heading\" data-id=\"b9486ce\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>23. Kolmogorov-Smirnov plot<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-39e3f4b elementor-widget elementor-widget-text-editor\" data-id=\"39e3f4b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tKS plot helps to assess the separation between prediction distributions for positive and negative classes.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9173842 elementor-widget elementor-widget-text-editor\" data-id=\"9173842\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn order to create it you:\n<ul>\n \t<li>sort your observations by the prediction score,<\/li>\n \t<li>for every cutoff point [0.0, 1.0] of the sorted dataset (depth) calculate the proportion of true positives and true negatives in this depth,<\/li>\n \t<li>plot those fractions, positive(depth)\/positive(all), negative(depth)\/negative(all), on Y-axis and dataset depth on X-axis.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2093a51 elementor-widget elementor-widget-text-editor\" data-id=\"2093a51\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSo it works similarly to cumulative gains chart but instead of just looking at positive class it looks at the separation between positive and negative class.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9177224 elementor-widget elementor-widget-text-editor\" data-id=\"9177224\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nGood explanation of KS plot and KS statistic can be found in this\u00a0<a href=\"http:\/\/rstudio-pubs-static.s3.amazonaws.com\/303414_fb0a43efb0d7433983fdc9adcf87317f.html\" rel=\"noopener\">article by Riaz Khan<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8237806 elementor-widget elementor-widget-text-editor\" data-id=\"8237806\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from scikitplot.metrics import plot_ks_statistic\n\nfig, ax = plt.subplots()\nplot_ks_statistic(y_true, y_pred, ax=ax)<\/code><\/pre>\n<strong>How does it look:<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-390bb65 elementor-widget elementor-widget-text-editor\" data-id=\"390bb65\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nSo we can see that the largest difference is at a cutoff point of 0.034 of top predictions. After that threshold, it decreases at a moderate rate as we increase the percentage of top predictions. Around 0.8 it is really getting worse really fast. So even though the best separation is at 0.034 we could potentially push it a bit higher to get more positively classified observations.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6e1c77a elementor-widget elementor-widget-heading\" data-id=\"6e1c77a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>24. Kolmogorov-Smirnov statistic<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-33843ec elementor-widget elementor-widget-text-editor\" data-id=\"33843ec\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIf we want to take the KS plot and get one number that we can use as a metric we can look at all thresholds (dataset cutoffs) from KS plot and find the one for which the distance (separation) between the distributions of true positive and true negative observations is the highest.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-413b403 elementor-widget elementor-widget-text-editor\" data-id=\"413b403\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIf there is a threshold for which all observations above are truly positive and all observations below are truly negative we get a perfect KS statistic of 1.0.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c1c8336 elementor-widget elementor-widget-text-editor\" data-id=\"c1c8336\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>How to compute:<\/strong>\n<pre><code>from scikitplot.helpers import binary_ks_curve\n\nres = binary_ks_curve(y_true, y_pred_pos)\nks_stat = res[3]<\/code><\/pre>\n<strong>How models score in this metric:<\/strong>\n\nBy using the KS statistic as the metric we were able to rank BIN-101 as the best model which we truly expect to be \u201ctruly\u201d best model.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5410e3a elementor-widget elementor-widget-text-editor\" data-id=\"5410e3a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\n<strong>When to use it:<\/strong>\n<ul>\n \t<li>when your problem is about sorting\/prioritizing the most relevant observations and you care equally about positive and negative classes.<\/li>\n \t<li>It can be a good addition to ROC AUC score which measures ranking\/sorting performance of your model.<\/li>\n<\/ul>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4b0aefb elementor-widget elementor-widget-heading\" data-id=\"4b0aefb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2>Final Thoughts<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-40bd622 elementor-widget elementor-widget-text-editor\" data-id=\"40bd622\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn this blog post, you\u2019ve learned about various classification metrics and performance charts.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-55735d1 elementor-widget elementor-widget-text-editor\" data-id=\"55735d1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWe went over metric definitions, interpretations, we learned how to calculate them, and talked about when to use them.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5d87ca9 elementor-widget elementor-widget-text-editor\" data-id=\"5d87ca9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nHopefully, with all that knowledge you will be fully equipped to deal with metric-related problems in your future projects.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-05b01e1 elementor-widget elementor-widget-heading\" data-id=\"05b01e1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2><strong>Bonus:<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b04cadd elementor-widget elementor-widget-text-editor\" data-id=\"b04cadd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tTo help you use the information from this blog post to the fullest, I have prepared:\n<ul>\n \t<li><em><strong>logging helper function<\/strong><\/em>\u00a0that calculates and logs all the metrics, performance charts, and metric by threshold charts<a href=\"https:\/\/neptune.ml\/blog\/evaluation-metrics-binary-classification\/#26\" rel=\"noopener\">binary classification<\/a><\/li>\n \t<li><em><strong>metrics cheetsheet\u00a0<\/strong><\/em>with everything I talked about digested into a few pages.<\/li>\n<\/ul>\nCheck those out below!\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b9eea21 elementor-widget elementor-widget-text-editor\" data-id=\"b9eea21\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Logging helper function<\/strong>\n\nIf you want to<strong>\u00a0log all<\/strong>\u00a0of those\u00a0<strong>metrics<\/strong>\u00a0<strong>and<\/strong>\u00a0performance\u00a0<strong>charts<\/strong>\u00a0that we covered for your machine learning project\u00a0<strong>with just one function call<\/strong>\u00a0and explore them in Neptune.\n<pre><code>pip install neptune-contrib[all]\n<\/code><\/pre>\n<ul>\n \t<li>install the package:<\/li>\n<\/ul>\n<pre><code>import neptunecontrib.monitoring.metrics as npt_metrics\n\nnpt_metrics.log_binary_classification_metrics(y_true, y_pred)<\/code><\/pre>\n<ul>\n \t<li>import and run:<\/li>\n \t<li>explore everything in the app:<\/li>\n<\/ul>\nYou do know that as an individual you can\u00a0<strong>track experiment runs with Neptune for free,<\/strong>\u00a0right?\n\n<a href=\"https:\/\/neptune.ml\/register?utm_source=hackernoon&amp;utm_medium=crosspost&amp;utm_campaign=blog-evaluation-metrics-binary-classification&amp;utm_content=register\" rel=\"noopener\">Singup for a free account<\/a>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6657ccd elementor-widget elementor-widget-text-editor\" data-id=\"6657ccd\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<strong>Binary classification metrics cheatsheet<\/strong>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bc5a74c elementor-widget elementor-widget-text-editor\" data-id=\"bc5a74c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\nWe\u2019ve created a nice cheatsheet for you which takes all the content I went over in this blog post and puts it on a few-page, digestible document which you can print and use whenever you need anything binary classification metrics related.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2218a5f elementor-widget elementor-widget-text-editor\" data-id=\"2218a5f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<a href=\"https:\/\/github.com\/neptune-ml\/blog-binary-classification-metrics\/blob\/master\/binary_classification_metrics_cheathsheet.pdf\" rel=\"noopener\">Go here to download the .pdf<\/a>\n\n<strong>Example script<\/strong>\n<pre><code>import lightgbm\nimport matplotlib.pyplot as plt\nimport neptune\nfrom neptunecontrib.monitoring.utils import pickle_and_send_artifact\nfrom neptunecontrib.monitoring.metrics import log_binary_classification_metrics\nfrom neptunecontrib.versioning.data import log_data_version\nimport pandas as pd\n\nplt.rcParams.update({'font.size': 18})\nplt.rcParams.update({'figure.figsize': [16, 12]})\nplt.style.use('seaborn-whitegrid')\n\n# Define parameters\nPROJECT_NAME = 'neptune-ml\/binary-classification-metrics'\n\nTRAIN_PATH = 'data\/train.csv'\nTEST_PATH = 'data\/test.csv'\nNROWS = None\n\nMODEL_PARAMS = {'random_state': 1234,\n'learning_rate': 0.1,\n'n_estimators': 1500}\n\n# Load data\ntrain = pd.read_csv(TRAIN_PATH, nrows=NROWS)\ntest = pd.read_csv(TEST_PATH, nrows=NROWS)\n\nfeature_names = [col for col in train.columns if col not in ['isFraud']]\n\nX_train, y_train = train[feature_names], train['isFraud']\nX_test, y_test = test[feature_names], test['isFraud']\n\n# Start experiment\nneptune.init(PROJECT_NAME)\nneptune.create_experiment(name='lightGBM training',\nparams=MODEL_PARAMS,\nupload_source_files=['train.py', 'environment.yaml'])\nlog_data_version(TRAIN_PATH, prefix='train_')\nlog_data_version(TEST_PATH, prefix='test_')\n# Train model\nmodel = lightgbm.LGBMClassifier(**MODEL_PARAMS)\nmodel.fit(X_train, y_train)\n\n# Evaluate model\ny_test_pred = model.predict_proba(X_test)\n\nlog_binary_classification_metrics(y_test, y_test_pred)\npickle_and_send_artifact((y_test, y_test_pred), 'test_predictions.pkl')\n\nneptune.stop()<\/code><\/pre>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Not sure which evaluation metric you should choose for your binary classification problem? You want to know for each metric, the&nbsp;definition&nbsp;and&nbsp;intuition&nbsp;behind it, the&nbsp;non-technical explanation&nbsp;that you can communicate to business stakeholders, how to calculate or plot it, and when&nbsp;you should&nbsp;use it. You will learn about a bunch of common and lesser-known evaluation metrics and charts to&nbsp;understand how to choose&nbsp;the model performance&nbsp;metric for your problem. After reading this blog post you should have a good idea.<\/p>\n","protected":false},"author":712,"featured_media":3418,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[187],"tags":[94],"ppma_author":[3528],"class_list":["post-2208","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-cloud","tag-data-science"],"authors":[{"term_id":3528,"user_id":712,"is_guest":0,"slug":"jakub-czakon","display_name":"Jakub Czakon","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Czakon","first_name":"Jakub","job_title":"","description":"Jakub Czakon is Senior Data Scientist at neptune.ai."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2208","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/712"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2208"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2208\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3418"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2208"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2208"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2208"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2208"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}