Working with large data sets inevitably takes a lot of time and effort. Especially when it comes to preparing raw data for training neural networks, the quality of which directly depends on the result obtained. Also, in the future, when operating complexes using neural networks, it is necessary to control the input data, at least to ensure that the received response of the neural network is correct. In this article, I will try to demonstrate the work of a deep autoencoder, which was created to search for anomalies in the financial statements of companies from the INTRINIO data set formed for training the COVANN neural network from the previous article.
Briefly about the main thing
Let me remind you that the generated data array contains standardized annual financial statements of 385 major companies included in the S&P500 list for three years (1,155 reports), including Balance Sheet, Income Statement and Cash Flow (100 values for each report). Thus, the entire dataset contains 1,155,000 values, which is quite a lot for manual verification.
The data set is completely digital and consists of a multidimensional regression, which makes the problem much more complicated. It is also necessary to take into account that this data set includes companies of different sizes, operating in different sectors, having their accounting specifics, as well as the presence of unavoidable errors in data standardization. All this makes it impossible to use classical algorithmic or statistical methods to search for anomalies in data, so the solution for this problem is a deep autoencoder.
There is a lot of information on the Internet about autoencoders and what range of tasks they can solve, so I will not go into their theory. I will only note that the work of the autoencoder is to compress the original representation and then restore it from the compressed form to a state close to the original one. Thus, at the output, the autoencoder generates an image that reflects the essence of the data transmitted to it. This property is very well demonstrated by the popular picture with a mushroom.

This property also allows you to search for anomalies in the data. At the same time, it is critical that the autoencoder learns from a data set in which the percentage of correct values has the overwhelming majority concerning the total data volume. Compliance with this requirement is necessary for the autoencoder to learn how to restore the required image from the input data.
The required image is an important step in designing an autoencoder. The image received at the output of the autoencoder determines what will be considered an anomaly and what is not, and accordingly, the error of restoring the original image will be calculated from this image. Therefore, we need to decide what to consider an anomaly.
Within the framework of this project, anomalies were defined as any data that has significant distortions in the presentation (inconsistent with reality or containing significant errors). To be able to measure the size of the source data reconstruction error for companies of different sizes, its size was normalized to the average size of the company’s assets/liabilities.
The process of creating an autoencoder is quite time-consuming. Obtaining the required image for detecting specified anomalies is carried out iteratively, adjusting the design of the autoencoder and its hyperparameters, after which the extremums of the obtained reconstruction error are checked for compliance with the specified requirements. During the development of this project, just over 30% of the original data set was verified. The audit was carried out using two independent sources Yahoo Finance and an annual report on the company’s website.
As a result of development, an autoencoder was obtained that meets the specified requirements and made us look at the available data array from a different point of view. The individual errors for the entire data set are shown below, as well as the distribution of the error across the entire data set.


As you can see, there are not many outliers with a large recovery error in the generated data set, but not everything is so clear, especially when you consider the fact that these are the largest companies. Therefore, I suggest that you consider specific examples from the data set. Input data indexes relate to reporting forms as follows: 0-42 Balance Sheet, 43-68 Income Statement, >=69 Cash Flow.

This is the Zoetis Inc. (ZTS) report for the fiscal year 2018. It has a minimum error in the entire data set – 0.16%. During the review process, no significant deviations were found with the official report and with Yahoo Finance data. This representation is not as clear as with the mushroom (in terms of image perception), but I hope that further examples will add clarity.

Charter Communications, Inc. (CHTR) of the 2018 financial year. Minor deviations, in principle, can be attributed to errors in standardization.

Fortive Corporation (FTV) – the 2016 fiscal year. A good example, if you can say so, of “suspicious data”, which can include all data in the error range from 2% to 3%. When checking this report, the “Total Revenue” and “Cost of Revenue” indicators were overstated by 10 billion US dollars, while the “Net Income” value was correct.

Keysight Technologies, Inc. (KEYS) data – the fiscal year 2018. A good example of data that does not correspond to reality is an error of 6.18%. When checking, it turned out that this is just a set of numbers that do not correlate with real data.

And finally, data from Ford Motor Company (F) for the fiscal year 2018 with a reconstruction error size of 2.27%. Data from the “suspicious” category. The main error in the data is an overestimated amount of assets, and the autoencoder clearly shows this, since it generates the correct image for the transmitted data set.
As you can see, the deep autoencoder is an effective tool for finding anomalies in large amounts of company financial reporting data, allowing you to automate the process of evaluating and controlling data quality. Therefore, such a solution can be widely used, from individual projects to financial monitoring in banking systems.
Instead of a conclusion
For the COVANN project that this autoencoder was developed for, I can report the following results of its use. A neural network with a similar design and hyperparameters (that is, other things being equal to ensure comparability) was trained on data with a reconstruction error of less than 3% (96.4% of the original data set) while reducing the error size for all three periods (Train, CV, Test) by an average of 7%. This is a good result, but not entirely objective, and here’s why.
First, when checking the market capitalization values of an already filtered data set, errors were also found. The autoencoder was built to search for anomalies in the financial statements of companies whose market capitalization values were not included in the data set due to their volatility.
Secondly, it is desirable to work with better data, which can include data with an error of less than 2%, but this will reduce the already small data set by 22%, which is undesirable.
For these reasons, I will try to consider alternative data providers soon. For a working project, you need to consider several options.