Master Thesis: Predicting Company Earnings with AI Techniques

Predicting company earnings (Net income / ROE / dividends per share) with AI from company fundamentals (B/S, I/S, CF/S, ShrE/S).

Abstract

Corporate earnings prediction has always been an important subject for practitioners and accounting & investment research. A large body of prior re-search has examined the EPS forecasting accuracy of traditional methods, name-ly using regressions and time-series moving average models. With increasing data availability and computing power, Artificial Intelligence (AI) techniques have gained popularity in recent years. Those techniques promise to inde-pendently learn subtle and non-linear relationships and interaction effects in high-dimensional data, even without explicit theoretical guidance, helping un-cover new information from the financial statements beyond the extant models. However, literature applying AI techniques to predict corporate earnings using fundamental data remains very limited and results often contradict each other. This thesis leverages a large fundamental dataset from Reuters Revinitv along a time-series of over 20 years to compare Traditional against AI earnings predic-tion techniques. On this common dataset, different techniques are re-evaluated and compared against one and another. The results reveal the AI techniques out-perform Traditional techniques, with Gradient Boosting Trees being the best model. Additionally, I show that feature engineering is crucial for Neuronal Networks' performance. However, the mean absolute error of the best model, Gradient Boosting Tree, is relatively high with an average error of 26.8 to 29.4 percentage points making it useless for practical application.

van Almsick, S. (2021). Predicting Company Earnings with AI Techniques. Master Thesis at Frankfurt School of Finance and Management [not published]

Data-Flow/Process-Graphic

The approach is to apply best practices with the most widely used (and best considered) software and tools:

1. Getting the data (folder: a)

Two methods:

Big-Data-Method: Downloading all data fields filled >= 66% form Reuters Eikon & WRDS Compustat
Handpicked-Method: Download all key metrics/data fields specified in csv-file and intelligently fill NaNs with formulas and methods.

Please see the README_get_data.md for detailed instructions/overview.

2. Data cleaning and missing data handling (folder: b)

Intelligent NaN filling (see code here) with formulas and methods specified in last columns beginning with "fillnan_" in this csv-file.

Example for intelligent NaN-filling formulas and methods: (see csv-file)

Category	Name	Reuters Code	Clear Name	fillnan_1	fillnan_2
General	Data date	TR.TotalReturn.date	data_date
Income Statement	Depreciation and amortization	TR.DepreciationDepletion+TR.Amortization	DeprArmo	formula:EBITDA-EBIT	formula:Depreciation+Amortization
Income Statement	EBIT	TR.EBIT	EBIT	formula:EBITDA-DeprArmo
Profitability	Tax Rate	TR.IncomeTaxRatePct	TaxRate	formula:Tax/PreTaxIncome*100	method:linear(or=mean)
Ratios	Asset Turnover	TR.F.AssetTurnover	AssetTurnover	formula:Sales/TotalAssets

Available fillnan-methods are:

value (e.g. value:0 -> fill NaNs with 0)
formula (e.g. formula:Sales/TotalAssets with the column names from the "Clear Name"-column)
method
- linear(or=mean) with or can be any number or "mean"
- approx(other=COLUMN_NAME) with other being one or more other columns used to linearly approx that column e.g. approx(other=Sales) or approx(other=[Sales, EBIT])

To run this code: (see code here)

## Get cleaned and filled data
# General params
recache_raw_data = False
redo_data_cleaning = False

# Dataset params: Option 1 manual props/overwriting props
data_version = 'handpicked_dataset'  # Get name from app/b_data_cleaning/_dataset_registry.py file
comp_col = 'ric'
time_cols = ['data_year', 'data_qrt']
industry_col = 'industry'

# Dataset params: Option 2 get props from registry
data_version = 'handpicked_dataset'  # Get name from app/b_data_cleaning/_dataset_registry.py file
from app.b_data_cleaning import get_dataset_registry
dataset_props = get_dataset_registry()[data_version]
comp_col = dataset_props['company_col']
time_cols = dataset_props['iter_cols']
industry_col = dataset_props['industry_col']

# Data dropping params
required_filled_cols_before_filling = ['sales']  # drop rows before NaN filling if columns NaN/not filled
required_filled_cols_after_filling = ['ebit', 'roe']  # drop rows after NaN filling if columns NaN/not filled
drop_threshold_row_pct = 0.25  # drop rows before NaN filling if columns of row are filled less than percent
drop_threshold_row_quantile = 0.2  # drop rows before NaN filling if columns of row are filled less than quantile percentage
drop_threshold_col_pct = 0  # drop columns before NaN filling if column is less percent filled
append_data_quality_col = False  # append 'data_quality' column with percentage of columns filled per row before NaN filling

# Run code
from app.b_data_cleaning.data_cleaning import get_clean_data
df_cleaned = get_clean_data(data_version, recache_raw_data=recache_raw_data, redo_data_cleaning=redo_data_cleaning, comp_col=comp_col, time_cols=time_cols, industry_col=industry_col, required_filled_cols_before_filling=required_filled_cols_before_filling, required_filled_cols_after_filling=required_filled_cols_after_filling, drop_threshold_row_pct=drop_threshold_row_pct, drop_threshold_row_quantile=drop_threshold_row_quantile, drop_threshold_col_pct=drop_threshold_col_pct, append_data_quality_col=append_data_quality_col)
print(df_cleaned)

3. Data prep / feature engerneering (folder: c)

Feature engerneeing (ratios / pct-change / feature selection / etc. - see code here)

from app.c_data_prep.i_feature_engineering import feature_engerneeing
# Give cleaned data from (1.)
comp_col = 'ric'
time_cols = ['data_year', 'data_qrt']
industry_col = 'industry'
df_feature_engineered = feature_engerneeing(dataset=df_cleaned, comp_col=comp_col, time_cols=time_cols, industry_col=industry_col)

Data normalization (mean-std dev normalization on block/time-step/compayn-time-step-level - see code here)
Data sperating into training sets (caching final datasets for training to increase speed when training and feature turning - see code here)

Warning: On non UNIX-systems (e.g. Windows) this part of the code might crash because code uses multi-core processing and uses by default the number of cores as number of simultaneous processes. This might lead to the system crashing. Set not to automatic num_processes=-1 but to a number of parallel processes your PC can handle.

from app.c_data_prep.ii_data_prep import data_prep
# Give cleaned and feature engineered data from step before (2.1.)
y_cols = ['roe']
time_cols = ['data_year', 'data_qrt']
comp_col = 'ric'
keep_raw_cols = []  # columns to not normalize
drop_cols = []  # columns to drop 

# Initialize data-class
data = data_prep(dataset=df_feature_engineered, y_cols=y_cols, iter_cols=time_cols, comp_col=comp_col, keep_raw_cols=keep_raw_cols, drop_cols=drop_cols)

# Define rolling window size
data.window(input_width=5*4, pred_width=4, shift=1)

# Select one from three split methods
#data.single_time_rolling(val_time_steps=1, test_time_steps=1, shuffle=True)
data.block_rolling_split(val_comp_size=0, test_comp_size=0, train_time_steps=5*4*2, val_time_steps=1, test_time_steps=1, shuffle=True)
#data.block_static_split(val_comp_size=0, test_comp_size=0, val_time_size=0.2, test_time_size=0.1, shuffle=True)

# Select one from three normalization levels
#data.normalize(method='block')
data.normalize(method='time')
#data.normalize(method='set')

# Compute/cache the datsets
data.compute()

# Get one time-step dataset
out = data['200201_201903']
ds_train, ds_val, ds_test = data.tsds_dataset(out='all', out_dict=None)

# Export last time-step dataset to excel to inspect
data.export_to_excel()

4. Analyzes/Prediction (folder: e)

All models (all results):

Baseline models (code / validation results / test results / all runs)
- last value: prediction = last period value
- average value: prediction = average of n last periods (best n determined on validation dataset)
- static value: prediction = value x (best x determined on validation dataset)
Regression (code / results / p-values / coefficients)
- replication of Lev & Thiagarajan (1993)
Time-series models: ARIMA (code / results & best models / runs)
Machine Learning: XGBoost (code / results / best models / all runs)
Neural Networks
- Multi-Layer-Dense:
  - just using Lev & Thiagarajan's (1993) features (code / results / best models / all runs)
  - all features (code / results / best models / all runs)
- LSTM (code)
  - just using Lev & Thiagarajan's (1993) features (results / best models / all runs)
  - all features (results / best models / all runs)

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
app		app
bin		bin
credentials		credentials
data		data
resources		resources
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Master Thesis: Predicting Company Earnings with AI Techniques

Abstract

Data-Flow/Process-Graphic

1. Getting the data (folder: a)

2. Data cleaning and missing data handling (folder: b)

3. Data prep / feature engerneering (folder: c)

4. Analyzes/Prediction (folder: e)

About

Languages

License

vanalmsick/MasterThesis

Folders and files

Latest commit

History

Repository files navigation

Master Thesis: Predicting Company Earnings with AI Techniques

Abstract

Data-Flow/Process-Graphic

1. Getting the data (folder: a)

2. Data cleaning and missing data handling (folder: b)

3. Data prep / feature engerneering (folder: c)

4. Analyzes/Prediction (folder: e)

About

Topics

Resources

License

Stars

Watchers

Forks

Languages