Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any work around to retain original form of words ? #5

Closed
PSanni opened this issue Jul 27, 2022 · 6 comments
Closed

Any work around to retain original form of words ? #5

PSanni opened this issue Jul 27, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@PSanni
Copy link

PSanni commented Jul 27, 2022

The model currently cannot retain the original form of words. For example, in image if words are "sunflower oil", it returns "sunfloweroil" without space. Is there any work around to address it?

Also, is it possible to fine-tune this model on other dataset such as XFUND (https://github.com/doc-analysis/XFUND) ?

@bmusq
Copy link

bmusq commented Jul 27, 2022

Hello @PSanni,

For your first problem, namely, retaining original form of words, I do not know how to adress it.

Though, for your second question, I was able to use another dataset of my own (actually being trained). Hereby the solution I came up with. I hope it can be applied to your usecase.

This project makes use of the datasets from this other project https://github.com/ku21fan/STR-Fewer-Labels, as mention in Datasets.md, with few workarounds.
If you look into this other project, you will find a section in the Readme.md named "When you need to train on your own dataset or Non-Latin language datasets.". I bet the name is explicit enough.
They provide a piece of code in create_lmdb_dataset.py as well as the input format to this file to generate a dataset well formatted to be used by the algorithm, and a fortiori, by parseq as well.

I thouroughly followed the instructions and was able to start a training with parseq on my own dataset.

Edit: the training terminates but the test shows really inconsistent results. Maybe the .mdb file is still problematic. I am exploring this issue

@baudm
Copy link
Owner

baudm commented Jul 27, 2022

@PSanni for now, you can just directly edit and comment out

label = ''.join(label.split())

Note that some preprocessed datasets have had the spaces within labels removed. For the datasets which I preprocessed (COCO, OpenVINO, TextOCR), the spaces within the labels should be intact.

For fine-tuning on other datasets, you have two options:

  1. Write your own Dataset subclass which follows the same public interface as LmdbDataset.
  2. Preprocess your dataset into an LMDB database (see one of the converter scripts in tools to write your own preprocessing script. Then use create_lmdb_dataset.py to create the actual LMDB files).

baudm added a commit that referenced this issue Jul 28, 2022
- Expose normalize_unicode parameter of LmdbDataset
- Add remove_whitespace flag for disabling whitespace removal in labels
@baudm
Copy link
Owner

baudm commented Jul 28, 2022

@PSanni since commit e8ea463, you can now disable whitespace removal and/or Unicode normalization like so:
./train.py data.remove_whitespace=false data.normalize_unicode=false

@PSanni
Copy link
Author

PSanni commented Jul 28, 2022

I think its a good idea to include an annotation samples and required input format to the model.

@baudm
Copy link
Owner

baudm commented Jul 28, 2022

The LMDB format used is unchanged from prior work. create_lmdb_dataset.py expects a text file with one image path and label per line. The actual format is described in the README for the TextOCR and OpenVINO archives.

The conversion from text labels to token IDs is handled by Tokenizer.encode() (in strhub/data/utils.py).

@baudm
Copy link
Owner

baudm commented Jul 29, 2022

@PSanni since commit e8ea463, you can now disable whitespace removal and/or Unicode normalization like so: ./train.py data.remove_whitespace=false data.normalize_unicode=false

In addition to disabling whitespace (space, tabs, new line, etc.) removal, make sure you add the space character ' ' to charset_train and charset_test so it won't get removed by CharsetAdapter.

Closing this now since all issues have been addressed already. Feel free to reopen if I missed anything.

@baudm baudm closed this as completed Jul 29, 2022
@baudm baudm added the enhancement New feature or request label Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants