FiNER: Financial Numeric Entity Recognition for XBRL Tagging

Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction datasets, FiNER-139 uses a much larger label set of 139 entity types. Most annotated tokens are numeric, with the correct tag per token depending mostly on context, rather than the token itself. We show that subword fragmentation of numeric expressions harms BERT's performance, allowing word-level BILSTMs to perform better. To improve BERT's performance, we propose two simple and effective solutions that replace numeric expressions with pseudo-tokens reflecting original token shapes and numeric magnitudes. We also experiment with FIN-BERT, an existing BERT model for the financial domain, and release our own BERT (SEC-BERT), pre-trained on financial filings, which performs best. Through data and error analysis, we finally identify possible limitations to inspire future work on XBRL tagging.

Citation Information

Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos and George Paliouras
FiNER: Financial Numeric Entity Recognition for XBRL Tagging
In the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022) (Long Papers), Dublin, Republic of Ireland, May 22 - 27, 2022

@inproceedings{loukas-etal-2022-finer,
    title = {FiNER: Financial Numeric Entity Recognition for XBRL Tagging},
    author = {Loukas, Lefteris and
      Fergadiotis, Manos and
      Chalkidis, Ilias and
      Spyropoulou, Eirini and
      Malakasiotis, Prodromos and
      Androutsopoulos, Ion and
      Paliouras George},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022)},
    publisher = {Association for Computational Linguistics},
    location = {Dublin, Republic of Ireland},
    year = {2022},
    url = {https://arxiv.org/abs/2203.06482}
}

Dataset and Supported Task

FiNER-139 is comprised of 1.1M sentences annotated with eXtensive Business Reporting Language (XBRL) tags extracted from annual and quarterly reports of publicly-traded companies in the US. Unlike other entity extraction tasks, like named entity recognition (NER) or contract element extraction, which typically require identifying entities of a small set of common types (e.g., persons, organizations), FiNER-139 uses a much larger label set of 139 entity types. Another important difference from typical entity extraction is that FiNER focuses on numeric tokens, with the correct tag depending mostly on context, not the token itself.

To promote transparency among shareholders and potential investors, publicly traded companies are required to file periodic financial reports annotated with tags from the eXtensive Business Reporting Language (XBRL), an XML-based language, to facilitate the processing of financial information. However, manually tagging reports with XBRL tags is tedious and resource-intensive. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and study how financial reports can be automatically enriched with XBRL tags. To facilitate research towards automated XBRL tagging we release FiNER-139.

Dataset Repository

FiNER-139 is available at HuggingFace Datasets and you can load it using the following:

import datasets

finer = datasets.load_dataset("nlpaueb/finer-139")

Note: You don't need to download or install any dataset manually, the code is doing that automatically.

Models Repository

The SEC-BERT Models are available at HuggingFace and you can load it using the following:

SEC-BERT-BASE

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-base")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-base")

SEC-BERT-NUM

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-num")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-num")

SEC-BERT-BASE

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/sec-bert-shape")
model = AutoModel.from_pretrained("nlpaueb/sec-bert-shape")

Note: You don't need to download or install any model manually, the code is doing that automatically.

Install Python and Project Requirements

It is recommended to create a virtual environment first via Python's venv module or Anaconda's conda.

pip install -r requirements.txt

click
datasets==2.1.0
gensim==4.2.0
regex
scikit-learn>=1.0.2
seqeval==1.2.2
tensorflow==2.8.0
tensorflow-addons==1.16.1
tf2crf==0.1.24
tokenizers==0.12.1
tqdm
transformers==4.18.0
wandb==0.12.16
wget

Running an Experiment

To run an experiment we call the main function run_experiment.py located at the root of the project.
We need to provide the following arguments:

method: neural model to run (possible values: transformer, bilstm)
mode: mode of the experiment. The following modes can be selected:
- train: train a single model
- evaluate: evaluate a pre-trained model

In order to run a train experiment with a transformer model we execute:

python run_experiment --method transformer --mode train

Setting up the Experiment's Parameters

We set the parameters of an experiment by editing the configuration file located at the configurations folder of the project.
Inside the configurations folder three json configuration files (e.g bilstm.json, transformer.json, transformer_bilstm.json) where we can select the parameters of the experiment we would like to run.

If we want to run a transformer experiment we need to edit the parameters of transformer.json
These parameters are grouped in six groups:

train_parameters: contains the major parameters of the experiment
- model_name: transformer model we would like to train (e.g. bert-base-uncased, sec-bert-base, sec-bert-num, sec-bert-shape)
- max_length: max length in tokens of the input sample.
- replace_numeric_values: boolean flag indicating wether to replace the numeric values with the special shape token
```
23.5 -> [XX.X]
```
- subword_pooling: what subword pooling to perform (possible values are: all, first, last)
- use_fast_tokenizer: boolean flag indicating wether to use fast tokenizers or not
general_parameters: general parameters of the experiment
- debug: boolean flag indicating if we want to enable debug mode
  During debug mode we select only a small portion of the dateset (100 samples for each of the train, validation and test splits), and also enable tensorflow's eager execution
- loss_monitor: loss that the early stopping and reduce learning rate on plateau tensorflow's callbacks will monitor
  Possible values are: val_loss, val_micro_f1 and val_macro_f1.
- early_stopping_patience: used by the early stopping tensorflow's callback and indicates the number of epochs to wait without improvement of loss_monitor before the training stops.
- reduce_lr_patience: used by the reduce learning rate on plateau tensorflow's callback and indicates the number of epochs to wait without improvement of loss_monitor before the learning rate is reduced by half
- reduce_lr_cooldown: used by reduce learning rate on plateau tensorflow's callback and indicates the number of epochs to wait before resuming normal operation after learning rate has been reduced.
- epochs: maximum number of iterations (epochs) over the corpus. Usually choose a large value and let early stopping stop the training after patience is reached.
- batch_size: number of samples per gradient update.
- workers: workers that create samples during model fit. Choose enough workers to saturate the GPU utilization.
- max_queue_size: max samples in queue. Choose a large number to saturate the GPU utilization.
- use_multiprocessing: boolean flag indicating the use of multi-processing for generating samples
- wandb_entity: insert your Weights & Biases username or team to log the run
- wandb_project: insert the project's name where the run will be saved.
hyper_parameters: model hyper-parameters to use when training a single model
- learning_rate: learning rate of Adam optimizer
- n_layers: number of stacked BiLSTM layers
- n_units: number of units in each BiLSTM layer
- dropout_rate: randomly sets input units to 0 with a frequency of dropout_rate
- crf: boolean flag indicating the use of CRF layer
evaluation: evaluation parameters of the experiment
- pretrained_model: name of pretrained model used when evaluate mode is selected.
  The name is the folder name of the experiment we want to re-evaluate (located at /data/experiments/runs) (e.g. FINER139_2022_01_01_00_00_00)
- splits: list of dataset splits to evaluate (e.g. validation, test)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configurations		configurations
data		data
models		models
LICENSE		LICENSE
README.md		README.md
finer.py		finer.py
requirements.txt		requirements.txt
run_experiment.py		run_experiment.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FiNER: Financial Numeric Entity Recognition for XBRL Tagging

Citation Information

Table of Contents

Dataset and Supported Task

Dataset Repository

Models Repository

Install Python and Project Requirements

Running an Experiment

Setting up the Experiment's Parameters

About

Releases

Packages

Languages

License

nlpaueb/finer

Folders and files

Latest commit

History

Repository files navigation

FiNER: Financial Numeric Entity Recognition for XBRL Tagging

Citation Information

Table of Contents

Dataset and Supported Task

Dataset Repository

Models Repository

Install Python and Project Requirements

Running an Experiment

Setting up the Experiment's Parameters

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages