Corrosponding blog post can be found here#
The classifier is implemented in the script classifier.py that can be found in the directory code/. The script accepts data partitioned into train and test directories containing the following file names:
source_ht
: A text file containing the source sentences that were translated by a humantrans_ht
: A text file containing the target sentences translated by a humansource_mt
: A text file containing the source sentences that were translated by a machinetrans_mt
: A text file containing the target sentences translated by a machine
Each sentence in a given line number in the source file corresponds to the sentence in the same line number in the trans_ht and trans_mt files.
By default, the script will use the data provided in the directory data_for_code/. To specify which aligned sentence pairs to use as training data use the "-tr" flag followed by the directory where the training data is stored. To specify aligned sentence pairs to use as test data, use the "-te" flag followed by the directory where the test data is stored. With out any specified parameters, the classifer trains on the aligned sentence pairs in data_for_code/train and tests on the aligned sentence pairs in data_for_code/dev.
By default, the classifier uses an Support Vector Machine. To change which type of classifier used, uncomment any line between line numbers 173 - 178 in the classifier.py. As of now, this is not a command line argument.
For any questions or comments, please email me at [email protected]