Carnegie Mellon University
October 31, 2024

Creating Algorithms That Will Make It Faster and Easier For Auditors to Detect Bookkeeping Anomalies

By Dr. Emily Barrow DeJeu and Dr. Pierre Jinghong Liang

Jen Cadman
  • Director, Strategic Partnerships

CMU researchers combine machine learning, computer science, and accounting expertise to create an algorithm that helps auditors sift through vast amounts of bookkeeping data to find important patterns.

The Problem: Bookkeeping data are vast and laborious to analyze

Auditors must test how well a company’s financial statements match their actual activities in order to detect anomalies including those that suggest possible fraud – a problem that cost companies an estimated $42 billion in 2022, according to a PwC report on Global Economic Crime and Fraud. But detecting anomalies is challenging and labor-intensive. That’s because bookkeeping data at the transaction level such as journal entries are simply too voluminous. Further, these accounting data follow double-entry bookkeeping, which requires that every transaction be reported in at least two accounts. As a result, accounting data are full of complex linkages. 

Past efforts to streamline bookkeeping analysis used heuristics and other experience-based methods, but these detection methods are not well-suited to accounting data and so have had limited success. 


The Solution: The Minimum Length Description (MDL) Principle highlights anomalies that deserve further scrutiny

The Minimum Length Description (MDL) principle, developed in the 1970’s in the field of computer science and now a guide for modern data-mining tools, has the potential to make anomaly detection faster and simpler. MDL is based on a simple idea: patterns in a data set can be used to compress the data into smaller units. For example, writing down the first million numbers in a Fibonnaci sequence would take ages, but two lines of basic computer code can generate those million numbers almost instantly because the Fibonacci sequence follows a simple pattern: each number is the sum of the two preceding numbers.

To apply MDL to bookkeeping data, an algorithm sifts through the data looking for patterns, finds a set of best patterns that describe most of the data with the shortest codes, and then spotlights anomalies that deserve further scrutiny: those transactions described with longest codes even under the set of best patterns. This greatly reduces the number of transactions auditors need to study carefully. 

mdl-principle.png

How It Works: Computer Science and Accounting experts use graph mining technology to analyze double entry bookkeeping data

Applying MDL to accounting data requires an interdisciplinary approach. First, accounting experts need to transform bookkeeping journal entries into graphs or networks, and then graph mining methods drawn from computer science can be used  to explore the data and find patterns.

picture5.jpg

The algorithm finds a set of best patterns, in the form of motifs or transaction components, that best capture the majority of the data and puts those patterns into motif tables. Intuitively, frequently occuring transaction components would be selected by the algorithm into the motif table and assigned short description codes. Then, auditors can quickly find anomalous transactions that do not align with those patterns, because anomalous transactions require a long description even using the best motif table. These anomalies might stem from human error or from atypical activity, such as recording a corporate merger transaction. But they might also signal intentional bookkeeping entries that require much more serious investigation.

picture6.jpg

Why It Matters: MDL is general, unsupervised, and scalable, making it a robust solution across industries

Industry partners who work with the Center for Intelligent Business want general, unsupervised, explainable, and scalable solutions, and the CMU team’s  algorithm meets those criteria. The MDL method is unsupervised and does not require pre-training – it finds patterns as they occur in the data and does not need labels. As a result, this approach is flexible and robust for application across a range of different industries. 

Because MDL methodology is a compression-based solution with explicit motif tables, patterns mined from the data are sub-components (motifs) of bookkeeping records, so the anomalies identified can be traced to their particular rare combination of motifs, thus making the AI-solution explainable, which is very important in the audit context. This MDL method’s runtime is also linear in experimenting with real datasets, not exponential, so it can scale as needed. In fact, if a business’s transactions are fairly stable, motifs uncovered on some data sets can be applied to others, making analysis even faster. 

Early testing of the  algorithm has been successful: it was able to detect anomalies that an industry partner had strategically placed in data sets to test accuracy. Thanks to this early success, Liang and his team are continuing to work with industry experts to develop the next generation of anomaly detection methods.



Learn more about Dr. Liang’s MDL research in his book, Bookkeeping Graphs: Computational Theory and Applications.