Skip to content

Automatically convert documents in Polish from older encodings into UTF-8

License

Notifications You must be signed in to change notification settings

marekyggdrasil/polishify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

polishify

Setup

Simply

pip install polishify

Usage

If you have some text that is in Polish but characters look weird it might not be encoded with windows-1250 or iso-8859-2 encoding. If your file is sometext.txt you may

polishify sometext.txt

and it will show you something like

detected encoding is:  windows-1250

If you wish to get this file converted to utf-8 just do

polishify sometext.txt properly-encoded.txt

If you do it in bash script you might not want to see any outputs, the script supports silent mode as follows

polishify sometext.txt properly-encoded.txt --silent

This package contains words with polish letters, you might want to use your own dataset dataset.json file.

polishify sometext.txt properly-encoded.txt --silent --dataset dataset.json

We also provide a tool that generates it from a text

polishify-extract sometext.txt dataset.json --encoding windows-1250

Author

Made by Marek Narożniak, for the world and especially people who have people in the family who needs subtitles in Polish and want to bulk convert their encodings. No warranty provided. Licensed under GPL-3.