Skip to content

Named Entity Recognition to detect landmarks in German texts

License

Notifications You must be signed in to change notification settings

ConstantinSchmidts/2022_bavarian_landmarks

Repository files navigation

LandmarkNER - Identify Bavarian Landmarks in Text

LandmarkNER Logo

This repo contains code to identify landmarks in subtitles from videos of Bayerischer Rundfunk (BR). To this end a custom Named Entity Recognition (NER) model was trained in spaCy. The model uses the pretrained German transformer pipeline (bert-base-german-cased) included in spaCy. An NER pipeline for de_dep_news_trf was created and fine-tuned on a corpus of annotated subtitles (annotated using prodigy). The initial subtitle files are not included in this repo.

Scripts

Notebooks

Data

Usage

You can adapt the scripts to a custom NER label of your choice. Start by creating a corpus from .txt files. Then, create a pattern.jsonl file from a collection of example labels. From this, you create training data for an initial model. You train this initial model by using the spaCy command line interface. Next, you correct the initial model's predictions on the corpus with the custom prodigy annotation recipce to generate high-quality NER training data. Split into training data and validation data. With these data, you can fine-tune the German BERT model. For this project, the LandmarkNER model was fine-tuned in a notebook in GoogleColab, using the base config. Finally, create a test data corpus from fresh subtitles and fully annotate them with the standard prodigy recipe ner.manual. Evaluate your model on the test set.

Model

The trained LandmarkNER model is available on the Hugging Face hub.

To disambiguate detected entities to Wikipedia titles mGenre can be used. The resulting pipeline can be tested in an interactive Web App on Hugging Face Spaces.

PoC for multimodal video unterstanding

Furthermore, this model is used in a proof of concept for the extraction of the landmark names and frames of landmarks from a video, using time-code-associated descriptive texts or subtitles. OWL-ViT (an open-vocabulary object detection model) is used as a building detector. The text is analyzed with the LandmarkNER model, its output is disambiguated for Wikipedia titles by mGenre. For timecodes at which a building is detected in the video by OWL-ViT and the name of a landmark is detected in the text, the frames and the associated landmark names are extracted. The proof of concept notebook can be executed in Colab.

License

MIT License