AI vs Humans: Forecasting Earnings
Has AI reached human-level capabilities when it comes to predicting the future?
(Of course, no AI post can begin without a generated image)
AI > Human Analysts: A signpost on the path to AGI?
About a month ago, the paper “Financial Statement Analysis with Large Language Models” created some waves when the authors found that GPT “outperforms financial analysts in its ability to predict earnings changes.”
The implications of such a claim, if valid, are hard to overstate. Like many other industries, there’s the fact that millions of professionals globally are employed to predict earnings in one form or another. But implications of this claim go beyond the potential impact of AI on financial service professionals.
As any analyst knows, accurately predicting a company’s earnings is an incredibly challenging probabilistic forecasting exercise. It requires synthesizing countless qualitative and quantitative datapoints at the company / industry / macroeconomic level, and then mapping these onto models (mental, excel, or otherwise) to capture their interplay over time. Put another way, it’s predicting the future of a dynamic ecosystem in disequilibrium, and the cone of uncertainty only grows as you forecast further out (which partly explains why most quants focus on developing trading signals much shorter than a year in duration).
So long way of saying: if a general model like GPT, which was not specifically trained for this task, is able to outperform a human on predicting next year’s earnings by using its general reasoning capabilities while staring at financial statements, we may be much closer to AGI than consensus currently believes.
Diving into the paper
The truth, as is often the case when diving into academic papers, is far more nuanced.
First, some key caveats to the abstract’s claim and the paper’s methodology:
The actual test the authors used was whether GPT could predict the direction of t+1 earnings (i.e. would earnings increase or decrease next year). Clearly, this a gross oversimplification of what human analysts are predicting.
The data used to evaluate human analyst performance is consensus sell-side estimates. When sell-side analysts are building forecasts, there are multiple factors that influence what gets put into the model (e.g. relationships with mgmt teams they cover). For instance, I've had many conversations with sell-side analysts about assumptions in their models, and it’s not uncommon to hear "we don't want to stray too far from management guidance" when discussing a key assumption.
From what I can tell, the authors used GAAP financials from CompStat. The more important earnings figure that analysts focus on forecasting (in most cases) are adjusted earnings that strip out 1x impacts.
The trading strategy and Sharpe ratio analysis is flawed in too many ways to write about here and I won’t really dive into it. Suffice to say, I would be more than stunned if there was a 2+ Sharpe ratio strategy was sitting in the open.
Okay, caveats aside, let me get the cold water out of the way; the results aren’t quite as impressive as the abstract would suggest. GPT predicts earnings direction correctly ~60% of the time (vs humans at ~53%), and accuracy has been gradually declining over time. For comparison, a naive model that simply said "earnings will grow" every single year would score 55% over the sample.
Additionally, GPT essentially performs on-par with a VERY simple ANN (i.e. statistical) model based on an architecture developed in 1989.
In other words, the fact that GPT outperforms humans says more about the humans than GPT! And I think this is the first key takeaway of the paper: GPT, as a non-emotional model, is able to avoid the many heuristics and biases that plague human judgement and decision-making. Indeed, the authors do a statistical analysis which shows that human and GPT estimates are complimentary to each other (i.e. they make different mistakes). This broadly speaks to the value of having AI in the investment research process as another "voice" in the room for financial analysis.
What's more interesting to me is the comparison of GPT vs the ANN. The ANN is a specialized and small time-series predictive model that learns statistical relationships in training data. GPT is many orders of magnitude larger than the ANN, but isn't directly trained on this task.
When GPT is asked directly to predict forward earnings, it performs far worse than the ANN. This suggests it clearly has not "internalized" the statistical relationships from its pretraining that the ANN learned by studying earnings data directly.
However, what allows GPT to shine is prompting it to reason aloud in the following ways:
Think about which financial ratios it should calculate as part of a trend and ratio analysis
Calculate those ratios and perform the analysis
Make observations that then inform the output
Once GPT goes through this reasoning process, its performance matches the ANN! There are two upshots to this:
1: The power of GPT is its ability to reason much like a human. As the authors show, GPT is essentially constructing a narrative about what's happened to the business and therefore what is going to happen going forward. Look at the below - this is not dissimilar to what you’d expect from a capable intern or finance student.
2: The value you get out of GPT very much depends on the quality of the prompting. This is because LLMs today do not have a “System 2” reasoning capacity built in. LLMs are autoregressive models that predict one-token (word) at a time. They do not natively have the capacity to “think harder” about any one token vs another, which is why GPT performed so poorly when asked directly to predict earnings.
The key to obtaining high-quality reasoning out of these models is guiding them to mimic a “System 2” type reasoning process aloud. And that’s exactly what the authors did.
As an aside: this is where we spend a lot of our time at Portrait Analytics: figuring out ways to guide language models to perform high-quality financial reasoning over high-quality data we obtain and create.
So, in conclusion:
GPT is not taking away anyone's day job yet
GPT can be a helpful voice in the room as a non-emotional thought partner
GPT is most powerful when it has space to reason like a human, which requires being thoughtful around prompt(s) structure
If you’re interested in learning more about how we’re building an AI to be a “voice in the room” for idea discovery and thesis development, sign up here or reach out to me at david@portrait-analytics.com!
(Paper Credit: WORKING PAPER · NO. 2024-65 Financial Statement Analysis with Large Language Models Alex G. Kim, Maximilian Muhn, and Valeri V. Nikolaev MAY 2024)