Skip to content

This repo provides functionality to download, process & search 3gpp docs using an inverted index

Notifications You must be signed in to change notification settings

abhinav-neil/search-3gpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

3GPP Document Search

This repository provides tools to download, process, and search through 3GPP documents via an inverted index, using Elasticsearch and Flask.

Scripts

  • main.ipynb: A Jupyter notebook to run all the scripts in order.
  • dl_docs.py: Download documents (in .zip format) from a 3GPP website.
  • process_docs.py: Scan '.doc' files, extract raw text, and save them as '.txt' files.
  • inverse_index.py: Tokenize the '.txt' files and index them using Elasticsearch. Also contains a function to search for documents directly.
  • app.py: A Flask application to search for a given 3GPP specification and display matching filenames along with links to view/download the files.
  • index.html: A web page to render the search form and display the results.

Setup

Environment Setup

conda create -n search-3gpp python==3.11
conda activate search-3gpp
pip install -r requirements.txt

OS Libraries

Install the following libraries on your OS:

# For antiword
!sudo apt-get install antiword

# For Elasticsearch (assuming Debian/Ubuntu)
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.0-amd64.deb
!sudo apt-get update && sudo dpkg -i elasticsearch-7.10.0-amd64.deb

Directory Structure

.
├── requirements.txt
└── src
    ├── app.py
    ├── dl_docs.py
    ├── inverse_index.py
    ├── main.ipynb
    ├── process_docs.py
    └── templates
        └── index.html

Usage

  1. Download Documents

    python src/dl_docs.py --base_url [BASE_URL] --save_dir [SAVE_DIR] --max_files [MAX_FILES]
  2. Process Documents

    python src/process_docs.py --src_dir [SRC_DIR] --dest_dir [DEST_DIR]
  3. Start Elasticsearch Instance

    sudo service elasticsearch start
  4. Index Documents to Elasticsearch

    python src/inverse_index.py --idx_name [IDX_NAME] --docs_dir [DOCS_DIR] --reset_idx [RESET_IDX]
  5. Run Flask Application

    python src/app.py

Visit http:https://127.0.0.1:5000/ in your browser to use the application.


About

This repo provides functionality to download, process & search 3gpp docs using an inverted index

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages