Automatic Complexity Assessment of German Sentences

Team Members

Leo Nguyen
Raoul Berger
Konrad Straube
Till Nocher

Mail Addresses

[email protected]
[email protected]
[email protected]

Pretrained models:

pretrained BERT from Deepset AI
pretrained word2vec from NLPL repository (model ID: 45)

Additional Corpora Used

  • TextComplexityDE19 7 levels of difficulty

    1100 Wikipedia articles, 100 of them in Simple German

  • Deutsche Welle - Deutsch Lernen 2 levels of difficulty

  • WEEBIT English news corpus

    5 levels of difficulty, 625 documents each

Utilized libraries

Install dependencies

Install all necessary dependencies with:

pipenv install

Download datasets:

pipenv run main --download all

To download a specific dataset, replace 'all' with ['TextComplexityDE19', 'Weebit', 'dw']

Preprocessing and Augmentation

Run preprocessing and augmentation on datasets and save results in h5 file:

pipenv run main --create_h5 --filename example.h5

Additional tags:

  • --dset with argument 0 = 'TextComplexityDE19', 1 = 'Weebit', 2 = 'dw'. Example: --dset 012 for all datasets.
  • --lemmatization
  • --stemming
  • --random_swap
  • --random_deletion

Example: apply lemmatization

pipenv run main --create_h5 --filename example.h5 --lemmatization

Note: basic preprocessing will always be applied


Run experiment for specific vectorizer and regression method:

pipenv run main --experiment evaluate --filename example.h5 --vectorizer option --method option

Addtional tag: --engineered_features (concatenate engineered features to sentence vector)


  • vectorizer: 'tfidf', 'count', 'hash', 'word2vec', 'pretrained_word2vec'
  • method: 'linear', 'lasso', 'ridge', 'elastic-net', 'random-forest'

Run all combination of vectorizers and regression methods with and without engineered features:

pipenv run main --experiment compare_all --filename example.h5

Run pretrained BERT + 3-layer regression network:

pipenv run main --experiment train_net --filename example.h5

Additional tag:

  • --save_name name (name to save trained model under, used for training multiple models without overwriting the previous one. Default: name specified with --filename
  • --engineered_features (concatenate engineered features to sentence vector)

If multiple datasets were used, you have to specify conditional training by providing the tag --multiple_datasets.

The tag --pretask [pretask_epoch, pretask_file] will overwrite the --multiple_datasets tag. In that case, instead of conditional training, the model will be first trained on a pretask (on the provided pretask_file for the given pretask_epoch) and than fine-tuned on the dataset provided by --filename. Note that the first layer of the model will be freezed after the pretask. To allow fine-tuning the first layer, use the tag --no_freeze.

Hyperparameter tuning for word2vec: linear search along hyperparameter (generate plots and results saved to txt file)

pipenv run main --search [hyperparameter, start, end, step, model, filename]

  • hyperparameter: 'feature', 'window', 'count', 'epochs', 'lr' or 'min_lr'
  • start: start value of linear search
  • end: end value of linear search
  • step: step size of linear search
  • model: only option so far 'word2vec'
  • filename: h5 filename to load data from

Note: experiment results are saved in folders 'result', 'figures' and 'models'


License: MIT


