This is a project dedicated for generating USRs that will help the models in translating the sentences.
- Introduction
- Format of USR
- Installation
- Usage/Examples
- Requirements
- Documentation
- Contributing
- Screenshots
- License
Welcome to Automated USR Generator, an innovative open-source project developed in Python. This powerful tool is designed to effortlessly generate USRs (Universal Sentence Representations) from any input sentence or a bulk of sentences. USRs play a pivotal role in natural language processing and understanding, enabling various downstream tasks such as sentiment analysis, text classification, and machine translation.
The project leverages several essential modules to achieve its functionality. First, we have the "pos tagger and parser" module, which is utilized effectively to parse the input sentences and extract their syntactic dependencies as well as extract pos tags. This ensures that the subsequent processing steps have access to the correct linguistic structure.
Next up is the "wx" module, which is employed by the "wx" component. This step takes the input in UTF-8 format and then converts it into wx_format.
The "morph_call_1.py" module comes into play with the "morph" component. This stage focuses on deriving and separating root_word, GNP(Gender, Number, Person) and TAM(Tense, Aspect, Modality) from the given input sentence.
After morphological analysis, the "converter.py" module is there which takes output from apertium.morph and convert it into csv format which will be easy to process.
Furthermore, the "ner_call.py" module is employed for Named Entity Recognition (NER). This process identifies entities such as names of people, organizations, locations, etc., and enriches the sentence representation with this valuable contextual information.
Finally, the heart of the USR generation process lies in the "generate_usr.py" module. This module orchestrates all the preceding steps, combining the output from the parser, wx, morph, prune, and NER components to create comprehensive and effective Universal Sentence Representations.
Whether you need to process a single sentence or analyze a large dataset of sentences, Automated USR Generator provides a simple yet powerful interface to cater to your needs. The project's open-source nature encourages contributions and collaboration from the community, fostering continuous improvement and expanding its capabilities.
The meaning is represented in 11 rows in CSV (comma (,) separated value) format. This document guides the annotators to annotate each row. The 11 rows are:
Row 1 Original Sentence
Row 2 Concept
Row 3 Index
Row 4 Semantic Category of Noun
Row 5 Morpho-Semantic Information
Row 6 Dependency Relation
Row 7 Discourse Element
Row 8 Speaker’s View
Row 9 Scope
Row 10 Sentence Type
Row 11 Construction
- Create a virtual environment inside the "usrproginst" folder using following commands:
cd usrproginst
python3 -m venv virtual
source virtual/bin/activate
-
Now, install iscnlp tokenizer, pos-tagger and parser. Please follow the given repository link for the same.
https://bitbucket.org/iscnlp/
- First, install the tokenizer, then the pos-tagger and lastly install parser.
- Now read the readme given in the repository for all the three (tokenizer, pos-tagger, parser) and run the given commands in terminal.
Run first command in home directory itself.
-
Remember to replace python with python3 while running 3rd step of Readme for all 3 i.e tokenizer, pos-tagger,parser i.e sudo python3 setup.py install.
-
While running 3rd command if you get error related to setup tools then it means pip is not installed in your system and you have to run the following command :-
sudo apt install python3-pip
-
In pos-tagger and parser,run these dependencies code after installing them with given command:
$ pip install -r requirements.txt
-
Run the following commands on terminal inside parser folder:
sudo apt install python2 sudo snap install curl curl https://bootstrap.pypa.io/pip/2.7/get-pip.py --output get-pip.py sudo python2 get-pip.py sudo apt-get install python-requests sudo bash install-project.sh
-
To run the NER model, install the transformers and torch usinf following command:
pip3 install transformers pip3 install torch
If any error occurred then run the following command:
pip3 install transformers[torch]
-
Now, move wx_utf8, utf8_wx and ir_no@ files to bin folder by running the following command on terminal:
cd /usr/bin/ sudo cp ~/usrproginst/wx_utf8 . sudo cp ~/usrproginst/utf8_wx . sudo cp ~/usrproginst/ir_no@ .
-
After running the above commands now, run the following to give the required permissions :-
-
After running the above commands now, run the following commands:-
sudo chmod +777 utf8_wx sudo chmod +777 wx_utf8 sudo chmod +777 ir_no@
-
Now, steps for generating USR:-
-
Keep the sentences with their respective IDs seperated by double space in the "sentences_for_USR" file.
-
Now, run the following command on the terminal:
python3 run_generate_usr.py
-
Output USR files will be stored in "bulk_USRs" folder.
functional_diagram
Click here to view the Documentation
Follow the given steps for contributing to this project:-
-
Fork the repository on GitHub.
-
Clone the forked repository to your local machine using git clone.
-
Create a new branch for your contribution using git checkout -b . Choose a meaningful name that describes your changes.
-
Make the necessary changes and improvements to the codebase.
-
Before committing your changes, ensure that your code follows our coding standards and guidelines.
-
Test your changes thoroughly to ensure they work as intended.
-
Commit your changes with a descriptive commit message using git commit.
-
Push the changes to your forked repository with git push origin .
-
Go to the original repository on GitHub and create a Pull Request (PR) from your forked repository.
-
Provide a clear and descriptive title for your PR, summarizing the changes you've made.
By following these guidelines, you can contribute to our open-source project effectively and help make it even better!
Thank you for your contributions and support.
picture...
- Flowchart
- Class Diagrams
- we can include the following things into it.