This Node.js application extracts and highlights gerunds from a corpus of .doc
, .rtf
, and .docx
files, while providing the option to exclude specific words.
- Converts documents to .docx: Uses LibreOffice to convert legacy
.doc
and.rtf
files to.docx
for consistent processing. - Identifies and highlights gerunds: Uses regular expressions and the
docx
library to highlight gerunds within the text. - Exclusion list: Allows you to specify words to be excluded from highlighting, even if they match the gerund pattern.
- Node.js: Version 12.20.0 or higher (https://nodejs.org/)
- LibreOffice: Installed and accessible from the command line (for
.doc
and.rtf
conversion)
- Clone the repository:
git clone https://github.com/your-username/dialog-gerunds-extractor.git cd dialog-gerunds-extractor
- Install dependencies:
pnpm install
- Run the application:
pnpm start
-
Corpus Directory:
- Place all your
.doc
,.rtf
, and.docx
files in thecorpus
directory. - Example:
corpus/ document1.doc document2.rtf document3.docx
- Place all your
-
Exclusion List (Optional):
- Create a text file named
exclusion_list.txt
in the project root directory. - List one excluded word per line in the file.
- Example (
exclusion_list.txt
):running walking swimming
- Create a text file named
-
Run the Application:
-
Open your terminal or command prompt.
-
Navigate to the project directory.
-
Run the following command:
pnpm start
-
The processed files with highlighted gerunds will be saved in the following directories:
output/converted_to_docx
: Converted.docx
versions of the input files (if applicable).output/highlighted
: Final.docx
files with highlighted gerunds.
Contributions are welcome! If you have any improvements or bug fixes, please follow these steps:
-
Fork the repository on GitHub.
-
Create a new branch for your feature or bug fix.
-
Make your changes and commit them with clear messages.
-
Push your changes to your forked repository.
-
Submit a pull request to the main repository.
This project is licensed under the MIT License.
- Mammoth.js: For converting .docx files to text.
- docx: For creating and manipulating .docx files.
- LibreOffice: For converting .doc and .rtf files to .docx.