Using Machine Learning to Detect Tax Fraud

The terms “artificial intelligence” and “machine learning” immediately bring up thoughts from movies like “The Matrix” where machines become self-aware and want to end the world. While this may make for an exciting plot in Hollywood, it is not a reality outside of the theater.

In real life, however, machine learning—which gives computers the ability to see hidden patterns in existing data and progressively improve performance (“learn”) without being explicitly programmed—serves as a practical tool to data analysts. The job is not to turn robots into people, but instead, efficiently find recurring themes that would otherwise remain obscured inside of large amounts of data to provide end-users with actionable information.

These technologies have played a pivotal role in reducing fraud, waste, and abuse in organizations of all types and sizes, including departments of revenue that collect taxes. As tax season ends, tax agencies can especially benefit from the use of several different machine learning techniques to improve upon their current levels of fraud detection. Yet the question remains, “Which technique is best?”

Supervised Learning

At SAS, we say the best technique is multiple techniques. And this is even more true in the world of fraud detection, where traditional business rules and various types of machine learning are layered upon one another so that the whole is greater than the sum of its parts. One of the most popular types of machine learning is called supervised learning. Also referred to as predictive modeling, this enables tax agencies to use all the fraud and audit cases they’ve worked in the past to figure out which attributes of these cases are most highly correlated with a successful case. They may then use this “FBI’s Most Wanted” sketch to automatically search for similar cases in the future.

Supervised learning techniques like predictive modeling are used by tax agencies who want to find more fish in holes they have fished previously. For this reason, it is used heavily to identity theft detection and audit selection in previously audited taxpayer segments. Therefore, the machine must have prior cases from which to learn. However, the case data used to teach the machine must also include failures (e.g. no change audits) to learn what characteristics were in play when it turned out to be a bad lead.

Machine learning not only saves time on building a fraud detection routine but also can remove bias against certain taxpayers if done properly. Auditors traditionally rely on gut instincts about fraud to build a set of “if/then” rules to finding taxpayers to audit. While this method effectively harnesses the critical eye of an experienced auditor, it typically misses subtle clues hidden in the data and can lead to overfishing in traditionally lucrative parts of the pond.

The Value of Unsupervised Learning

If supervised learning is fishing where people have fished before, then another type of machine learning—unsupervised learning—is fishing where no one has fished before. While supervised learning improves the number of fish caught in known areas, taxpayer segments with high rates of non-compliance remain undiscovered.

Unsupervised machine learning is used when prior case data is not available and the tax agency doesn’t necessarily set out knowing what they’re looking for (hence the term “unsupervised”). The question they’re trying to answer is: What don’t I know about yet? This technique allows the machine to go on an unsupervised walkabout without having been previously exposed to your data and bring your attention to anything that seems out of the ordinary.

One of these techniques, called “clustering,” is one way of doing just this. The machine automatically puts all tax returns into groups that have similarities—or clusters—and then identifies returns falling outside these clusters as outliers that require additional investigation.

Both supervised and unsupervised approaches provide tremendous value for government tax authorities, especially when used upon complex data sets like tax returns, financial transactions, taxpayer contacts, accounts receivables, network traffic, and even employee activities.

Changing with the Times

Machine learning can provide departments of revenue with immediate benefits in reducing fraud and abuse, however, there is still room for growth in improving models. Governments want to improve tax models to create a “feedback loop.” This is where machine learning environments using new incoming data to constantly change the attributes and weights of the “FBI’s Most Wanted” sketch in real time.

Machines can also be configured to automatically alert users that their current predictive models have degraded in accuracy, meaning that different parts may need to be reconfigured to get the most accurate answers.

Machine learning can be a difficult concept for some to understand. Yet the tax system has remained ripe with fraud and abuse despite the best efforts of tax auditors. With each change of the tax code, there will be new fraudsters looking for loopholes and blind spots to exploit. While machine learning will not catch every criminal, these capabilities provide auditors with a valuable and powerful tool to reduce the amount of money stolen.

That money belongs to the people and should be spent on improving government programs. It does not belong to thieves that found a hole in the system. Machine learning can help close these gaps.

Using Machine Learning to Detect Tax Fraud

Supervised Learning

The Value of Unsupervised Learning

Changing with the Times

AI-Powered Strategy Will Transform The C-Suite