{"id":2290,"date":"2020-02-28T04:30:17","date_gmt":"2020-02-28T01:30:17","guid":{"rendered":"http:\/\/kusuaks7\/?p=1895"},"modified":"2024-01-02T15:38:07","modified_gmt":"2024-01-02T15:38:07","slug":"the-machine-learning-crisis-in-scientific-research","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/the-machine-learning-crisis-in-scientific-research\/","title":{"rendered":"The Machine Learning Crisis in Scientific Research"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2290\" class=\"elementor elementor-2290\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-2b2c902d elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"2b2c902d\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-66c73447\" data-id=\"66c73447\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-5557da1 elementor-widget elementor-widget-heading\" data-id=\"5557da1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">Is an experiment still scientific if it is not reproducible?<\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5a82a62 elementor-widget elementor-widget-text-editor\" data-id=\"5a82a62\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>\n<p id=\"6b40\" data-selectable-paragraph=\"\">\u201cThere is general recognition of a reproducibility crisis in science right now. I would venture to argue that a huge part of that does come from the use of machine learning techniques in science.\u201d\u00a0<strong><em>\u2014 Genevera Allen, Professor of Statistics and Electrical Engineering at Rice University<\/em><\/strong><\/p>\n<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4e16f88 elementor-widget elementor-widget-text-editor\" data-id=\"4e16f88\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"27af\" data-selectable-paragraph=\"\">The use of machine learning is becoming increasingly prevalent in the scientific process, replacing traditional statistical methods. What are the ramifications of this on the scientific community and the pursuit of knowledge? Some have argued that the black-box approach of machine learning techniques is responsible for a crisis of reproducibility in scientific research. After all, is something really scientific if it is not reproducible?<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-12a36a5 elementor-widget elementor-widget-text-editor\" data-id=\"12a36a5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"bb0e\" data-selectable-paragraph=\"\"><strong>Disclaimer:<\/strong>\u00a0This article is my own opinion based on the material referred to in the references. This is a contentious area in academia and constructive debating is welcomed.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9fde3e7 elementor-widget elementor-widget-image\" data-id=\"9fde3e7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/468\/0*mlK43nZvsaYuMMq_.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1952055 elementor-widget elementor-widget-text-editor\" data-id=\"1952055\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p style=\"text-align: center;\" data-selectable-paragraph=\"\"><span style=\"background-color: rgba(0, 0, 0, 0.05);\">The cycle of the scientific process.<\/span><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-95a2771 elementor-widget elementor-widget-text-editor\" data-id=\"95a2771\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8bd5\" data-selectable-paragraph=\"\">Machine learning (ML) has become ubiquitous in scientific research, and in many places has replaced the use of traditional statistical techniques. Whilst ML techniques are often simpler to perform analysis with, the inherent black-box approach causes severe problems in the pursuit of truth.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a4a48c5 elementor-widget elementor-widget-text-editor\" data-id=\"a4a48c5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"6bd9\" data-selectable-paragraph=\"\">The\u00a0<strong>\u201creproducibility crisis\u201d<\/strong>\u00a0in science refers to the alarming number of research results that are not repeated when another group of scientists tries the same experiment. It can mean that the initial results were wrong. One analysis suggested that up to 85% of all biomedical research carried out in the world is wasted effort.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-44fe22b elementor-widget elementor-widget-text-editor\" data-id=\"44fe22b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"9a14\" data-selectable-paragraph=\"\">The debate over the reproducibility crisis is probably the closest you can come in academia to a war between machine learning and statistics departments.<\/p\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-41bd787 elementor-widget elementor-widget-text-editor\" data-id=\"41bd787\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"47a9\" data-selectable-paragraph=\"\"><strong>One AI researcher in a Science article alleged that machine learning has become a form of \u2018alchemy\u2019.<\/strong><a href=\"https:\/\/www.sciencemag.org\/news\/2018\/05\/ai-researchers-allege-machine-learning-alchemy\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">AI researchers allege that machine learning is alchemy, Ali Rahimi, a researcher in artificial intelligence (AI) at Google in San Francisco, California, took a swipe at his\u2026www.sciencemag.org<\/a><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f1739fe elementor-widget elementor-widget-text-editor\" data-id=\"f1739fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"eeb5\" data-selectable-paragraph=\"\">His paper and blog article about this are both worth reading:\u00a0<a href=\"https:\/\/people.eecs.berkeley.edu\/~brecht\/papers\/07.rah.rec.nips.pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/people.eecs.berkeley.edu\/~brecht\/papers\/07.rah.rec.nips.pdf<\/a>\u00a0<a href=\"http:\/\/www.argmin.net\/2017\/12\/05\/kitchen-sinks\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">Reflections on Random Kitchen Sinks<\/a>\u00a0<a href=\"http:\/\/www.argmin.net\/2017\/12\/05\/kitchen-sinks\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">Ed. Note: Ali Rahimi and I won the test of time award at NIPS 2017 for our paper &#8220;Random Features for Large-scale\u2026www.argmin.net<\/a><\/p>\n<p id=\"7b9e\" data-selectable-paragraph=\"\">ML nicely supplements the scientific process, making its use in research ultimately inevitable. ML can be considered an engineering task \u2014 like an assembly line with its modeling, parameter tuning, data preparation, and optimization components. The intent of ML is to find optimal answers or predictions \u2014 which is a subset of scientific inquiry.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b725020 elementor-widget elementor-widget-text-editor\" data-id=\"b725020\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"55d0\" data-selectable-paragraph=\"\">The types and algorithms for machine learning can be the subject of science in itself. Many research papers are being written about various types and sub-types of ML algorithms just like statistical methods of the past.<\/p>\n<p id=\"1a8b\" data-selectable-paragraph=\"\">In February 2019, Genevera Allen gave a\u00a0<a href=\"https:\/\/eurekalert.org\/pub_releases\/2019-02\/ru-cwt021119.php\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">grave warning<\/a>\u00a0at the American Association for the Advancement of Science that scientists are leaning on machine learning algorithms to find patterns in data even when the algorithms are just fixating on noise that cannot be reproduced by another experiment.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-1b885cb elementor-widget elementor-widget-text-editor\" data-id=\"1b885cb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"7c46\" data-selectable-paragraph=\"\">This challenge has implications across multiple disciplines, as machine learning is used for obtaining discoveries in many fields such as astronomy, genomics, environmental science, and healthcare.<\/p>\n<p id=\"b134\" data-selectable-paragraph=\"\">The prime example she uses is genomic data, which are typically incredibly large datasets of hundreds of gigabytes or several terabytes. Allen states that when a scientist uses poorly understood ML algorithms to cluster genomic profiles, specious and unreproducible results can often arise.<\/p>\n<p id=\"2cd0\" data-selectable-paragraph=\"\"><strong>It is not until another team runs a similar analysis and finds very different results that the results are contested and discredited. This can be for multiple reasons:<\/strong><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5b7865e elementor-widget elementor-widget-text-editor\" data-id=\"5b7865e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<ul>\n \t<li id=\"ba66\" data-selectable-paragraph=\"\">Lack of knowledge about the algorithm<\/li>\n \t<li id=\"5c49\" data-selectable-paragraph=\"\">Lack of knowledge about the data<\/li>\n \t<li id=\"b752\" data-selectable-paragraph=\"\">Misinterpretation of the results<\/li>\n<\/ul>\n<\/section>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7346946 elementor-widget elementor-widget-heading\" data-id=\"7346946\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"bce3\" data-selectable-paragraph=\"\">Lack of algorithmic knowledge<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ec69358 elementor-widget elementor-widget-text-editor\" data-id=\"ec69358\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"ff71\" data-selectable-paragraph=\"\"><strong>Lack of algorithmic knowledge\u00a0<\/strong>is extremely common in machine learning. If you do not understand how an algorithm is producing results, then how can you be sure that it is not cheating or finding spurious correlations between variables?<\/p>\n<p id=\"f900\" data-selectable-paragraph=\"\">This is a huge problem in neural networks due to the plethora of parameters (typically millions for deep neural networks). Not only do the parameters count, but also the hyperparameters \u2014 including items such as the learning rate, the initialization strategy, the number of epochs, and the network architecture.<\/p>\n<p id=\"406e\" data-selectable-paragraph=\"\">Realizing that you lack algorithmic knowledge is not enough to solve the problem. How do you compare results if different networks are used across different research papers? Even adding a single extra variable or changing one hyperparameter can significantly influence the results due to the highly complex and dynamic structure of the high-dimensional neural network loss landscape.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2cf9c56 elementor-widget elementor-widget-heading\" data-id=\"2cf9c56\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"a2de\" data-selectable-paragraph=\"\">Lack of data knowledge<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-ddcef83 elementor-widget elementor-widget-text-editor\" data-id=\"ddcef83\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"87c8\" data-selectable-paragraph=\"\"><strong>Lack of data knowledge\u00a0<\/strong>is also a huge issue, but one that extends to traditional statistical techniques. Errors in the acquisition of the data \u2014 such as\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Quantization_(signal_processing)\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">quantization<\/a>\u00a0errors,\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Measurement_uncertainty\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">sensor uncertainties<\/a>, and the use of\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Proxy_(statistics)\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">proxy<\/a>\u00a0variables \u2014 are one of the major issues.<\/p>\n<p id=\"dc27\" data-selectable-paragraph=\"\">Suboptimal data will always be a problem, but understanding what algorithm to use with what kind of data is also incredibly important and can have significant implications on results. This can be easily illustrated by examining a simple regression.<\/p>\n<p id=\"2a4e\" data-selectable-paragraph=\"\">If we use linear regression with more parameters than data points (a very normal situation in genomics, where we have many genes and few data points) then our selection of regularization severely impacts what parameters are determined to be \u2018important\u2019.<\/p>\n<p id=\"d382\" data-selectable-paragraph=\"\">If we use a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Lasso_(statistics)\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><em>LASSO<\/em><\/a><em>\u00a0regression<\/em>, this tends to push apparently unimportant variables to be zero, thus eliminating them from the regression and providing some variable selection.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6b61253 elementor-widget elementor-widget-text-editor\" data-id=\"6b61253\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"186b\" data-selectable-paragraph=\"\">If we use a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Tikhonov_regularization\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><em>ridge<\/em><\/a><em>\u00a0regression<\/em>, the regression tends to shrink these parameters to be small enough that they are negligible but does necessarily remove them from the dataset.<\/p>\n<p id=\"71f9\" data-selectable-paragraph=\"\">If we use\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Elastic_net_regularization\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><em>Elastic Net<\/em><\/a><em>\u00a0regression<\/em>\u00a0(a combination of LASSO and ridge regression), we will again get very different answers.<\/p>\n<p id=\"1a4f\" data-selectable-paragraph=\"\">If we do not use any regression, then the algorithm will obviously overfit to the data as we have more variables than data points, so the algorithm will trivially fit all data points.<\/p>\n<p id=\"06c4\" data-selectable-paragraph=\"\">Clearly, with linear regression, there are statistical tests that can be done to assess the accuracy in the form of confidence intervals,\u00a0<em>p<\/em>-tests, etc. However, the same luxuries do not exist for a neural network, so how can we be sure of our conclusions? The best we can currently do is state the exact architecture and hyperparameters of the model, and provide the code as open-source for other scientists to analyze and reuse the model.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8cddfa7 elementor-widget elementor-widget-heading\" data-id=\"8cddfa7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"9e84\" data-selectable-paragraph=\"\">Misinterpretation of Results<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c204204 elementor-widget elementor-widget-text-editor\" data-id=\"c204204\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"5ee7\" data-selectable-paragraph=\"\"><strong>Misinterpretation of results\u00a0<\/strong>can be very common in the scientific world. One reason for this is that correlation does not imply causation \u2014 there are several reasons why two variables,\u00a0<em>A<\/em>\u00a0and\u00a0<em>B<\/em>, might be correlated:<\/p>\n\n<ul>\n \t<li id=\"4052\" data-selectable-paragraph=\"\"><em>A<\/em>\u00a0might be caused by the occurrence of\u00a0<em>B<\/em><\/li>\n \t<li id=\"1ce2\" data-selectable-paragraph=\"\"><em>B<\/em>\u00a0might be caused by the occurrence of\u00a0<em>A<\/em><\/li>\n \t<li id=\"bb94\" data-selectable-paragraph=\"\"><em>A<\/em>\u00a0and\u00a0<em>B<\/em>\u00a0might be caused by a further\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Confounding\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><em>confounding variable<\/em><\/a>,\u00a0<em>C<\/em><\/li>\n \t<li id=\"7a4c\" data-selectable-paragraph=\"\"><em>A<\/em>\u00a0and\u00a0<em>B<\/em>\u00a0may be spuriously correlated<\/li>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e60b115 elementor-widget elementor-widget-text-editor\" data-id=\"e60b115\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"8307\" data-selectable-paragraph=\"\">It is easy to show a correlation between two values, but it is extremely difficult to determine the causation of such results. By typing in spurious correlations on Google, you can come up with some pretty interesting and clearly ridiculous correlations that have statistical significance:<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-598fa8f elementor-widget elementor-widget-image\" data-id=\"598fa8f\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2080\/0*sI-utXkUM4G4Ndme.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-73ef39e elementor-widget elementor-widget-image\" data-id=\"73ef39e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2080\/0*y_dNzVMkUforleEc.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-53b87f5 elementor-widget elementor-widget-image\" data-id=\"53b87f5\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/miro.medium.com\/max\/2080\/0*YzSndnsoO_qGKIqr.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-62ba59e elementor-widget elementor-widget-text-editor\" data-id=\"62ba59e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"89ee\" data-selectable-paragraph=\"\">These may seem like ridiculous correlations, but the point is that if these variables were put together in a dataset that was fed to a machine learning algorithm, the algorithm would accept this as a causal variable without asking any questions about the validity of said causation. In this sense, the algorithm is likely to be inaccurate or wrong because the software is identifying patterns that exist only in that data set and not the real world.<\/p>\n<p id=\"f742\" data-selectable-paragraph=\"\">The occurrence of spurious correlations, but it has become alarmingly more prevalent in recent years due to the use of large datasets with thousands of variables.<\/p>\n<p id=\"6aed\" data-selectable-paragraph=\"\">If I have a thousand variables and millions of data points, it is inevitable that there will be some correlations between the data. Algorithms can latch onto these and assume causation, effectively performing unconscious\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Data_dredging\" target=\"_blank\" rel=\"noopener nofollow noreferrer\"><strong>p-hacking<\/strong><\/a>, a technique frowned upon in academia.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0311555 elementor-widget elementor-widget-heading\" data-id=\"0311555\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"7155\" data-selectable-paragraph=\"\">What is p-hacking?<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6d224a1 elementor-widget elementor-widget-text-editor\" data-id=\"6d224a1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"285d\" data-selectable-paragraph=\"\">The practice of p-hacking involves taking a dataset and exhaustively searching for correlations that are statistically significant and taking them as scientifically valid.<\/p>\n<p id=\"95a7\" data-selectable-paragraph=\"\">The more data you have, the more likely you are to find a spurious correlation between two variables.<\/p>\n<p id=\"02cb\" data-selectable-paragraph=\"\">Usually, science involves the formation of a hypothesis, collection of data, and the analysis of the data to determine whether the hypothesis was valid. What p-hacking does is perform an experiment and then post-hoc hypotheses are formed to explain the data that was obtained. Sometimes, this is done without malintent, but other times scientists do this so that they are able to publish more papers.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-7cb8e42 elementor-widget elementor-widget-heading\" data-id=\"7cb8e42\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\">\n<h2 id=\"042f\" data-selectable-paragraph=\"\">Enforcing Correlations<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-e7c43be elementor-widget elementor-widget-text-editor\" data-id=\"e7c43be\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"53a9\" data-selectable-paragraph=\"\">One of the other problems of machine learning algorithms is that the algorithm must make a prediction. The algorithm cannot say \u2018I didn\u2019t find anything\u2019. This brittle framework means that the algorithm will find some way of explaining the data no matter how unsuitable the features it has been given are (as long as the algorithm and data have been set up correctly, otherwise it may fail to converge).<\/p>\n<p id=\"9008\" data-selectable-paragraph=\"\">Currently, I know of no machine learning algorithms that are able to come back to the user and tell them that the data is unsuitable, this is implicitly presupposed to be the job of the scientist \u2014 which is not an unfair assumption.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-712ed7d elementor-widget elementor-widget-heading\" data-id=\"712ed7d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"6edd\" data-selectable-paragraph=\"\">Why Use Machine Learning Then?<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-2f2c317 elementor-widget elementor-widget-text-editor\" data-id=\"2f2c317\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"f550\" data-selectable-paragraph=\"\">This is a good question. Machine learning makes analyzing datasets much easier and the ML algorithms perform the bulk of work for the user. In areas where datasets are too large to effectively analyze using standard statistical techniques, this becomes invaluable. However, although it accelerates the job of scientists, the increase in productivity afforded by machine learning is arguably offset by the quality of these predictions.<\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f32bf75 elementor-widget elementor-widget-heading\" data-id=\"f32bf75\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"6e4c\" data-selectable-paragraph=\"\">What can be done?<\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-105afb9 elementor-widget elementor-widget-text-editor\" data-id=\"105afb9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"03c1\" data-selectable-paragraph=\"\">It is not all doom and gloom. The same problem has always been present with traditional statistical methods and datasets, these problems have just been amplified by the use of large datasets and algorithms which can automatically find correlations and are less interpretable than traditional techniques. This amplification has exposed weaknesses in the scientific process that must be ironed out.<\/p>\n<p id=\"2a37\" data-selectable-paragraph=\"\">However, there is work underway on the next generation of machine-learning systems to make sure they\u2019re able to assess the uncertainty and reproducibility of their predictions.<\/p>\n<p id=\"35b7\" data-selectable-paragraph=\"\">That being said, it is a poor worker who blames their tools for failure, and scientists need to take more care in their use of ML algorithms to ensure that their research is corroborated and validated. The peer review process is designed to ensure this, but it is also the responsibility of the individual researcher. Researchers need to understand the techniques they are using and to understand their limitations. If they do not have these expertise then perhaps a quick trip to the statistics department to discuss with a professor would be fruitful (as I have done myself).<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-911650e elementor-widget elementor-widget-text-editor\" data-id=\"911650e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"b40c\" data-selectable-paragraph=\"\">Rahimi (who believes ML is a form of alchemy) offers several suggestions for learning which algorithms work best, and when. He states that researchers should conduct\u00a0<strong>ablation studies<\/strong>\u00a0\u2014 successively removing parameters to assess their influence on the algorithm. Rahimi also calls for\u00a0<strong>sliced analysis<\/strong>\u00a0\u2014 analyzing an algorithm\u2019s performance to see how improvements in certain areas might have a cost elsewhere. Lastly, he suggests running algorithms with a variety of different hyperparameter settings and should report performances for all of them. These techniques would provide a more robust analysis of the data using ML algorithms.<\/p>\n<p id=\"4239\" data-selectable-paragraph=\"\">Due to the nature of the scientific process, once these issues are resolved, relationships previously found to be accurate that are, in fact, spurious, will eventually be found and corrected. Relationships that are accurate will, of course, stand the test of time.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b2ca606 elementor-widget elementor-widget-heading\" data-id=\"b2ca606\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h2 id=\"b75e\" data-selectable-paragraph=\"\"><strong>Final Comments<\/strong><\/h2><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a8d15b1 elementor-widget elementor-widget-text-editor\" data-id=\"a8d15b1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"18c6\" data-selectable-paragraph=\"\">Machine learning in science does present problems in academia due to the lack of reproducibility of results. However, scientists are aware of these problems and a push toward more reproducible and interpretable machine learning models is underway. The real breakthrough will be once this has been completed for neural networks.<\/p>\n<p id=\"1dc5\" data-selectable-paragraph=\"\">Genevera Allen underscores a fundamental problem facing machine intelligence: <a href=\"https:\/\/www.experfy.com\/blog\/bigdata-cloud\/the-ten-statistical-techniques-data-scientists-need-to-master\/\">data scientists<\/a> still do not understand the mechanisms by which machines learn. The scientific community must make a concerted effort in order to understand how these algorithms work and how best to use them to ensure reliable, reproducible, and scientifically valid conclusions are made using data-driven methods.<\/p>\n<p id=\"6ffa\" data-selectable-paragraph=\"\">Even Rahimi, who alleged that machine learning is alchemy, is still hopeful of its potential. He states that \u2018alchemy invented metallurgy, ways to make medication, dying techniques for textiles, and our modern glass-making processes. Then again, alchemists also believed they could transmute base metals into gold and that leeches were a fine way to cure diseases.\u2019<\/p>\n<p id=\"fe4d\" data-selectable-paragraph=\"\">As physicist Richard Feynman said in his 1974\u00a0<a href=\"http:\/\/calteches.library.caltech.edu\/51\/2\/CargoCult.htm\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">commencement address<\/a>\u00a0at the California Institute of Technology,<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b02c395 elementor-widget elementor-widget-text-editor\" data-id=\"b02c395\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<blockquote>\n<p id=\"d7c9\" data-selectable-paragraph=\"\">\u201cThe first principle [of science] is that you must not fool yourself, and you are the easiest person to fool.\u201d<\/p>\n<\/blockquote>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-3614d85 elementor-widget elementor-widget-heading\" data-id=\"3614d85\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\"><h4 id=\"22b8\"><strong>References<\/strong><\/h4><\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a51e864 elementor-widget elementor-widget-text-editor\" data-id=\"a51e864\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p id=\"a8fc\" data-selectable-paragraph=\"\">[1]\u00a0<a href=\"https:\/\/science-sciencemag-org.ezp-prod1.hul.harvard.edu\/content\/sci\/365\/6452\/416.full.pdf\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/science-sciencemag-org.ezp-prod1.hul.harvard.edu\/content\/sci\/365\/6452\/416.full.pdf<\/a><\/p>\n<p id=\"580a\" data-selectable-paragraph=\"\">[2]\u00a0<a href=\"https:\/\/research.fb.com\/wp-content\/uploads\/2019\/05\/The-Scientific-Method-in-the-Science-of-Machine-Learning.pdf?\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/research.fb.com\/wp-content\/uploads\/2019\/05\/The-Scientific-Method-in-the-Science-of-Machine-Learning.pdf?<\/a><\/p>\n<p id=\"6c0e\" data-selectable-paragraph=\"\">[3]\u00a0<a href=\"https:\/\/bigdata-madesimple.com\/machine-learning-disrupting-science-research-heres\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/bigdata-madesimple.com\/machine-learning-disrupting-science-research-heres\/<\/a><\/p>\n<p id=\"bd2d\" data-selectable-paragraph=\"\">[4]\u00a0<a href=\"https:\/\/biodatamining.biomedcentral.com\/track\/pdf\/10.1186\/s13040-018-0167-7\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/biodatamining.biomedcentral.com\/track\/pdf\/10.1186\/s13040-018-0167-7<\/a><\/p>\n<p id=\"5870\" data-selectable-paragraph=\"\">[5]\u00a0<a href=\"https:\/\/www.sciencemag.org\/news\/2018\/05\/ai-researchers-allege-machine-learning-alchemy\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">https:\/\/www.sciencemag.org\/news\/2018\/05\/ai-researchers-allege-machine-learning-alchemy<\/a><\/p>\n<p id=\"e61a\" data-selectable-paragraph=\"\">[6]\u00a0<a href=\"https:\/\/www.sciencedaily.com\/releases\/2019\/02\/190215110303.htm\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/www.sciencedaily.com\/releases\/2019\/02\/190215110303.htm<\/a><\/p>\n<p id=\"a87a\" data-selectable-paragraph=\"\">[7]\u00a0<a href=\"https:\/\/phys.org\/news\/2018-09-machine-scientific-discoveries-faster.html\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">https:\/\/phys.org\/news\/2018-09-machine-scientific-discoveries-faster.html<\/a><\/p>\n<p id=\"b3ef\" data-selectable-paragraph=\"\">[8]\u00a0<a href=\"https:\/\/www.americanscientist.org\/blog\/macroscope\/people-cause-replication-problems-not-machine-learning\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/www.americanscientist.org\/blog\/macroscope\/people-cause-replication-problems-not-machine-learning<\/a><\/p>\n<p id=\"cfad\" data-selectable-paragraph=\"\">[9]\u00a0<a href=\"https:\/\/www.datanami.com\/2019\/02\/19\/machine-learning-for-science-proving-problematic\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">https:\/\/www.datanami.com\/2019\/02\/19\/machine-learning-for-science-proving-problematic\/<\/a><\/p>\n<p id=\"4ecf\" data-selectable-paragraph=\"\">[10]\u00a0<a href=\"https:\/\/www.quantamagazine.org\/how-artificial-intelligence-is-changing-science-20190311\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/www.quantamagazine.org\/how-artificial-intelligence-is-changing-science-20190311\/<\/a><\/p>\n<p id=\"3974\" data-selectable-paragraph=\"\">[11]\u00a0<a href=\"https:\/\/ml4sci.lbl.gov\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/ml4sci.lbl.gov\/<\/a><\/p>\n<p id=\"096b\" data-selectable-paragraph=\"\">[12]\u00a0<a href=\"https:\/\/blogs.nvidia.com\/blog\/2019\/03\/27\/how-ai-machine-learning-are-advancing-academic-research\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">https:\/\/blogs.nvidia.com\/blog\/2019\/03\/27\/how-ai-machine-learning-are-advancing-academic-research\/<\/a><\/p>\n<p id=\"b7fa\" data-selectable-paragraph=\"\">[13]\u00a0<a href=\"https:\/\/towardsdatascience.com\/a-quick-response-to-genevera-allen-about-machine-learning-causing-science-crisis-8465bbf9da82\" target=\"_blank\" rel=\"noopener noreferrer\" class=\"broken_link\">https:\/\/towardsdatascience.com\/a-quick-response-to-genevera-allen-about-machine-learning-causing-science-crisis-8465bbf9da82#&#8211;responses<\/a><\/p>\n<p id=\"f714\" data-selectable-paragraph=\"\">[14]\u00a0<a href=\"https:\/\/www.hpcwire.com\/2019\/02\/19\/machine-learning-reproducability-crisis-science\/\" target=\"_blank\" rel=\"noopener nofollow noreferrer\" class=\"broken_link\">https:\/\/www.hpcwire.com\/2019\/02\/19\/machine-learning-reproducability-crisis-science\/<\/a><\/p>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Machine learning in science does present problems in academia due to the lack of reproducibility of results. However, scientists are aware of these problems and a push toward more reproducible and interpretable machine learning models is underway. The real breakthrough will be once this has been completed for neural networks. The scientific community must make a concerted effort in order to understand how these algorithms work and how best to use them to ensure reliable, reproducible, and scientifically valid conclusions are made using data-driven methods.<\/p>\n","protected":false},"author":682,"featured_media":3836,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3471],"class_list":["post-2290","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3471,"user_id":682,"is_guest":0,"slug":"matthew-stewart","display_name":"Matthew Stewart","avatar_url":"https:\/\/www.experfy.com\/blog\/wp-content\/uploads\/2020\/04\/medium_c57055f3-5301-4262-af65-4cc7d40cbf3d-150x150.jpg","user_url":"https:\/\/criticalfutureglobal.com\/","last_name":"Stewart","first_name":"Matthew","job_title":"","description":"Matthew Stewart is a Machine Learning consultant on AI for\u00a0<a href=\"https:\/\/www.criticalfutureglobal.com\/\" target=\"_blank\" rel=\"noopener\">Critical Future<\/a>, and machine learning engineer at Scalable Magic, an AI-based digital media startup. He is also a Graduate Teaching Assistant and a Ph.D. Candidate at Harvard University."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2290","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/682"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2290"}],"version-history":[{"count":5,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2290\/revisions"}],"predecessor-version":[{"id":35343,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2290\/revisions\/35343"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3836"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2290"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2290"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2290"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2290"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}