Skip to content

A number of tools that can be used in conjunction with REMEDI, for example tools to do the pre-/post- processing of text prior to translation

Notifications You must be signed in to change notification settings

NationalCrimeAgency/remedi-tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

REMEDI Tools

This project contains a number of tools that can be used in conjunction with REMEDI, for example tools to do the pre-/post- processing of text prior to translation.

Filter

Takes a directory of JSON files and filters them based on certain properties. The expected format of the JSON is as follows:

{
  "source": "The source text in a foreign language",
  "translation": "The translated text in the target language",
  "rating": 4,   # The rating of the translation
  "correctedTranslation": "Optional corrected translation (null if translation hasn't been corrected)",
  "sourceLanguage": "The source language of the text",
  "targetLanguage": "The target language of the text",
  "author": "The person providing the rating and/or translation",
  "timestamp": 1531304578240    # The timestamp that the rating/correction was provided
}

The command to filter the data is as follows:

java -cp filter-1.0-SNAPSHOT.jar uk.gov.nca.remedi.filter.JsonFilter \
    -i [Input Directory] \
    -o [Output file root, excluding the language extension] \
    -l [Language code for the foreign language] \
    -n [Name of the foreign language] \
    -c [Cut off (minimum rating) - optional]

The filter will output two files (one English, and one foreign language) containing sentence pairs (one per line), that match the specified language, and have a corrected translation provided or a rating equal to or greater than the cut off.

This tool is licensed under ASLv2.

Processor

Performs pre- and post- processing tasks on text, primarily for use in preparing training data, and being used by the REMEDI processor server.

Contains 5 separate classes which can be ran individually:

  • CleanParallelCorpus - Takes a parallel corpus and performs some cleaning tasks on it such as deduplication and removing empty lines
  • PostProcessor - Takes a translated string, detokenizes it and performs true casing on it
  • PrepareTrainingData - Tokenizes text from a file, and transliterates Cyrillic characters using the GOST 2002b standard
  • PreProcessor - Tokenizes and transliterates text, and detects language if required
  • ShrinkParallelCorpus - Randomly selects a subset of sentences from a parallel corpus to reduce the size of a dataset

This tool is licensed under ASLv2.

Stanford Truecaser

Uses Stanford CoreNLP to perform true casing of lower case text. It implements the uk.gov.nca.remedi.utils.TrueCaser interface, allowing it to be used as part of the post processor for correcting the case on translation output.

This tool is licensed under GPLv3.

About

A number of tools that can be used in conjunction with REMEDI, for example tools to do the pre-/post- processing of text prior to translation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages