{"id":2093,"date":"2019-11-26T02:28:36","date_gmt":"2019-11-26T02:28:36","guid":{"rendered":"http:\/\/kusuaks7\/?p=1698"},"modified":"2024-02-21T13:04:07","modified_gmt":"2024-02-21T13:04:07","slug":"what-is-machine-learning-on-code","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/what-is-machine-learning-on-code\/","title":{"rendered":"What is Machine Learning on Code?"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"2093\" class=\"elementor elementor-2093\" data-elementor-post-type=\"post\">\n\t\t\t\t\t\t<section class=\"has_eae_slider elementor-section elementor-top-section elementor-element elementor-element-78303439 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-eae-slider=\"77951\" data-id=\"78303439\" data-element_type=\"section\" data-e-type=\"section\">\n\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"has_eae_slider elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-576cc12c\" data-eae-slider=\"6462\" data-id=\"576cc12c\" data-element_type=\"column\" data-e-type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-2df4b49a elementor-widget elementor-widget-text-editor\" data-id=\"2df4b49a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs IT organizations grow, so does the size of their codebases and the complexity of their ever-changing developer toolchain. Engineering leaders have very limited visibility into the state of their codebases, software development processes, and teams. By applying modern data science and machine learning techniques to software development, large enterprises have the opportunity to significantly improve their software delivery performance and engineering effectiveness.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4dccf1a elementor-widget elementor-widget-text-editor\" data-id=\"4dccf1a\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn the last few years, a number of large companies such as Google, Microsoft, Facebook and smaller companies such as Jetbrains and source{d} have been collaborating with academic researchers to lay the foundation for Machine Learning on Code.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6028e6e elementor-widget elementor-widget-heading\" data-id=\"6028e6e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>What is Machine Learning on Code?<\/strong><\/h3>\n<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-47135ca elementor-widget elementor-widget-text-editor\" data-id=\"47135ca\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tMachine Learning on Code (MLonCode) is a new interdisciplinary field of research related to Natural Language Processing, Programming Language Structure, and Social and History analysis such contributions graphs and commit time series. MLonCode aims to learn from large scale source code datasets in order to automatically perform software engineering tasks such as assisted code reviews, code deduplication, software expertise assessment, etc.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a04ea8c elementor-widget elementor-widget-image\" data-id=\"a04ea8c\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/ml-on-code-1.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a6fc2eb elementor-widget elementor-widget-heading\" data-id=\"a6fc2eb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h3 class=\"elementor-heading-title elementor-size-default\"><h3><strong>Why is MLonCode hard?<\/strong><\/h3>\n<\/h3>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b47c867 elementor-widget elementor-widget-text-editor\" data-id=\"b47c867\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tSome MLonCode problems require zero error rate, such as those related to code generation; automatic program repair is one particular example. A tiny, single misprediction may lead to the whole program&#8217;s compilation failure.\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-9445530 elementor-widget elementor-widget-text-editor\" data-id=\"9445530\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tIn some other cases, the error rate must be low enough. An ideal model should make as few mistakes as that the signal-to-noise ratio for the users &#8211; software developers &#8211; stays bearable and trustworthy. Thus the model can be used the same way as traditional static code analysis tools. A great example of this is best practices mining.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c5068bc elementor-widget elementor-widget-text-editor\" data-id=\"c5068bc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tFinally, the vast majority of MLonCode problems are unsupervised or at most weakly supervised. It can be very costly to manually label datasets, so researchers typically have to develop correlated heuristics. For example, there are numerous similarity grouping tasks, such as showing similar developers or helping to compile teams based on areas of expertise. Our own experience in this topic lies in\u00a0<a href=\"https:\/\/github.com\/src-d\/style-analyzer\" target=\"_blank\" rel=\"noopener noreferrer\">mining code formatting rules and applying them to fix faults<\/a>, similarly to what linters do but completely unsupervised. There is a related academic competition to predict formatting problems called\u00a0<a href=\"https:\/\/github.com\/KTH\/codrep-2019\" target=\"_blank\" rel=\"noopener noreferrer\">CodRep<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-391c08d elementor-widget elementor-widget-text-editor\" data-id=\"391c08d\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tMLonCode problems include a variety of data mining tasks that may be trivial from the theoretical point of view but still challenging technically due to the scale or required attention to the details. Examples are code clone detection and similar developer clustering. Solutions of such problems are presented at the annual academic conference\u00a0<a href=\"http:\/\/www.msrconf.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Mining Software Repositories<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c409cc7 elementor-widget elementor-widget-image\" data-id=\"c409cc7\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/ml-on-code-2.jpg\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-57825fe elementor-widget elementor-widget-text-editor\" data-id=\"57825fe\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tWhile solving an MLonCode problem, one typically represents source code in one of the following ways:\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-4dd2759 elementor-widget elementor-widget-text-editor\" data-id=\"4dd2759\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tA frequency dictionary (weighted bag-of-words, BOW). Examples: identifiers inside a function; graphlets in a file; dependencies of a repository. The frequencies can be weighted by TF-IDF. This representation is the simplest and the most scalable.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-c33cbbb elementor-widget elementor-widget-image\" data-id=\"c33cbbb\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/ml-on-code-3.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-fdf5ed1 elementor-widget elementor-widget-text-editor\" data-id=\"fdf5ed1\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tA sequential token stream (TS), which corresponds to the source code parsing sequence. That stream is often augmented with the links to the corresponding Abstract Syntax Tree nodes. This representation is friendly to conventional Natural Language Processing algorithms, including sequence-to-sequence deep learning models.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-bc4aae8 elementor-widget elementor-widget-image\" data-id=\"bc4aae8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/ml-on-code-4.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-318d826 elementor-widget elementor-widget-text-editor\" data-id=\"318d826\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tA tree, which naturally comes out from an Abstract Syntax Tree. We perform various transformations after, e.g. irreversible simplification or identifier posterization. This is the most powerful representation, and also the most difficult to work with. The relevant ML models include various graph embeddings and Gated Graph Neural Networks.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-a46ae25 elementor-widget elementor-widget-image\" data-id=\"a46ae25\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/ml-on-code-5.png\" alt=\"\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-b80f6d9 elementor-widget elementor-widget-text-editor\" data-id=\"b80f6d9\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tMany of the approaches to MLonCode problems ground on the so-called Naturalness Hypothesis (<a href=\"https:\/\/people.inf.ethz.ch\/suz\/publications\/natural.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Hindle et.al.<\/a>):\n<blockquote>\u201cProgramming languages, in theory, are complex, flexible and powerful, but the programs that real people actually write are mostly simple and rather repetitive, and thus they have usefully predictable statistical properties that can be captured in statistical language models and leveraged for software engineering tasks.\u201d<\/blockquote>\nThis statement justifies the usefulness of Big Code: the more source code is analyzed, the stronger the statistical properties emphasized, and the better the achieved metrics of a trained machine learning model. The underlying relations are the same as in e.g. the current state-of-the-art Natural Language Processing models: XLNet, ULMFiT, etc. Likewise, universal MLonCode models can be trained and leveraged in downstream tasks.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-0352a32 elementor-widget elementor-widget-text-editor\" data-id=\"0352a32\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tThere are such big code datasets. The current ultimate source is open source repositories on GitHub. There can be technical problems with cloning hundreds of thousands of Git repositories, so there are downstream datasets such as\u00a0<a href=\"https:\/\/github.com\/src-d\/datasets\/tree\/master\/PublicGitArchive\" target=\"_blank\" rel=\"noopener noreferrer\">Public Git Archive<\/a>,\u00a0<a href=\"http:\/\/ghtorrent.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">GHTorrent<\/a>, and\u00a0<a href=\"https:\/\/zenodo.org\/record\/2583978#.Xac1fuczb5Y\" target=\"_blank\" rel=\"noopener noreferrer\">Software Heritage Graph<\/a>.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-8793324 elementor-widget elementor-widget-heading\" data-id=\"8793324\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h2 class=\"elementor-heading-title elementor-size-default\"><h3><b>Conclusion<\/b><\/h3><\/h2>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-5e05c1b elementor-widget elementor-widget-text-editor\" data-id=\"5e05c1b\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\tAs software continues to eat the world, we\u2019re accumulating billions of lines of code, millions of applications built from great variety of programming languages, frameworks, and infrastructure. Not only can MLonCode help companies streamline their codebase and software delivery processes, but it also helps organizations better understand and manage their engineering talents. By treating software artifacts as data and applying modern data science and machine learning techniques to software engineering, organizations have a unique opportunity to gain a competitive edge.\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Machine Learning on Code (MLonCode) is a new interdisciplinary field of research related to Natural Language Processing, Programming Language Structure, and Social and History analysis such contributions graphs and commit time series. MLonCode aims to learn from large scale source code datasets in order to automatically perform software engineering tasks such as assisted code reviews, code deduplication, software expertise assessment, etc. Some MLonCode problems require zero error rate, such as those related to code generation. A tiny, single misprediction may lead to the whole program&#8217;s compilation failure.<\/p>\n","protected":false},"author":677,"featured_media":2861,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[183],"tags":[92],"ppma_author":[3458],"class_list":["post-2093","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-machine-learning"],"authors":[{"term_id":3458,"user_id":677,"is_guest":0,"slug":"vadim-markovtsev","display_name":"Vadim Markovtsev","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","author_category":"","user_url":"","last_name":"Markovtsev","first_name":"Vadim","job_title":"","description":"Vadim Markovtsev is Distinguished Machine Learning Engineer at source{d}, and &nbsp;Google Developer Expert in Machine Learning. He has authored 5 scientific papers and has spoken at many conferences."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2093","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/677"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=2093"}],"version-history":[{"count":0,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/2093\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/2861"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=2093"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=2093"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=2093"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=2093"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}