Company valuation is a cornerstone in areas such as corporate finance and investment decision-making. Since I am interested in both of these areas, from a professional point of view, one of the first projects that I decided to create and develop on the Keras library is a neural network for evaluating the company’s value. With this article, I would like to start talking about the progress of its development.
When I started working on this project, I initially formulated the goal of creating it as follows: to create a neural network for evaluating the company’s value that demonstrates a stable result over time within an acceptable error for use as a working tool.
In the process of working on the project, the initial range of issues that need to be resolved and determine the next steps in the development of the project was identified, here are some of them:
- it is difficult to determine the size of an acceptable error. We know that the market value of companies can be equally overvalued or undervalued, but we are interested in the average or “fair” price (so to speak);
- public companies financial statements should be received with minimal delay and in a standardized format, preferably with an open API for Python. To do this, you need to determine the data provider;
- it is necessary to get a certain point of reference when evaluating the effectiveness and stability of the results of the neural network for further improvement and control.
Taking into account the totality of all the issues identified in the development process, it was decided to go from simple to complex and determine further development steps based on the results obtained during the development process.
Therefore, the classic MLP neural network was defined as the starting model for this project. It is simple in design and does not require serious computing power, which will allow you to quickly perform experiments with data and optimize the hyperparameters of the model. However, due to its simplicity and limitations, it allows you to set a certain starting point in this project.
Also, for reasons of simplification of the model, at the initial stage of development, it was decided to use only the financial statements of companies as input data, without using macroeconomic data.
I purchased historical data from the standardized financial statements of public companies in the United States from INTRINIO. They have a ready-made data set for the last 10 years, which is more than enough for this project.
After studying the acquired data array, it’s time to move on to the first stage of neural network development – defining and preparing input data for neural network training. The result directly depends on the quality of implementation of this stage, so I thought about this issue for a long time and finally came up with perhaps not the most unambiguous solution.
This solution boils down to using only the annual financial statements of S&P 500 companies as input data. This significantly reduces the amount of data and is fraught with difficulties in designing/training a neural network, but it was done deliberately based on the following two prerequisites:
- companies from the S&P500 list have the most attention from analysts, investors and professional managers, so we can assume that these companies have the fairest market valuation, which, in turn, should help reduce the size of the error when training the neural network;
- annual financial statements are subject to a mandatory independent audit, which should improve the quality of source data.
I prepared the necessary source data in KNIME. The fact is that INTRINIO supplies data from PL, BS, and CF reporting forms in three different CSV files, so the data must first be consolidated and then pre-processed (delete duplicate data, rarely used data columns, and so on). For me, it’s easier and faster to do this in KNIME. This is the process of preparing data in KNIME.
After consolidation and pre-processing, the resulting data array was grouped by company tickers and selected only those companies that have reports for 2016, 2017 and 2018. As a result, I got 385 companies from the INTRINIO industrial data array. Agree, a fairly modest array of data for training a neural network.
Then the data was divided into blocks: for 2016, the array for training, for 2017, the array for validation, and 2018, the array for testing. Thus, we will be able to evaluate the results of the neural network over time.
The initial design of the neural network was formed by gradually increasing its capacity while experimenting with activation functions. Having achieved, in my opinion, a certain balance between the effect of retraining and the size of the error on the verification data, the process was stopped and the first basic model appeared.
The sizes of errors during the training and validation stages, as well as the results of the best model during the training, validation, and testing stages are presented below. The scatter plot reflects the true and predicted capitalization of the company in billions of US dollars (logarithmic scales).
Statistics for the best model are as follows:
Analysis of the results allowed us to draw several conclusions:
- using dropout between layers allows you to work with a limited array of data and successfully deal with neural network retraining;
- there is a problem with the quality of the source data. Analysis of significant deviations from the ideal line based on the company’s true capitalization revealed data omissions and incorrect capitalization values at the reporting date in all three data sets;
- a slight but visible shift in the distribution of points from the ideal line may indicate changes in the macroeconomic environment that is not used in training the model.
Summing up the interim results of this stage of the project development, I would like to note the following.
A linear increase in the average absolute error as a percentage (MAPE) by 10% per year is, in my opinion, a good indicator. It’s not just the size of the error increase itself, but also the fact that it has some stability. This shift can be seen both in the size of the MAPE and in the scatter plots for all three periods. However, at this stage of the project development, I still do not plan to add a macroeconomic environment, because I suspect that the size and dynamics of changes in this value in the future can help determine the necessary set of macroeconomic environment data to get a stable result, and the size of the error itself will probably change after solving the issue with the original data.
The question of the quality of the source data puzzled me more than I could have expected. In a good way, the creation and development of the project should be carried out on an identical set of data, at least to be able to measure progress. The detected data omissions greatly distort the final result and something needs to be done about it. Therefore, further development of the project should be suspended until the data issue is resolved.
Unfortunately, at the moment, I do not have a set of data from another supplier and I simply have nothing to compare it with, but I suspect that everyone has this kind of problem to some extent. Because it seems to me, it is extremely difficult to avoid errors (technical and/or human) after going through the process of preparing a report, uploading it to EDGAR, uploading data by the supplier, and further standardization. The only question is the size and frequency of occurrence of these errors.
So in the near future, I will try to solve the dilemma of choosing a solution to the problem of data quality, or find another data provider, or learn how to work with an already available data set. Ideally, you need to work through both options, but I think it will be as usual-expensive and long. On the other hand, we’re not looking for easy ways.
The project will obviously take a long time, and I’m used to giving code names to long and large projects. For this project, I chose a convenient abbreviation COVANN (Company Valuation Neural Network). You can easily find further updates on this project by clicking on the corresponding tag in my blog.