Skip to content
/ kep Public
forked from LIAAD/kep

Keyphase Extraction Package - With latest libraries

Notifications You must be signed in to change notification settings

sajidzaman/kep

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KEP - Keyphrase Extraction Package

KEP is a Python package that enables to extract keyphrases from documents (single or multiple documents) by applying a number of algorithms, the big majority of which provided by pke an open-source package. Differently from PKE, we provide a ready to run code to extract keyphrases not only from a single document, but also in batch mode (i.e., several documents). More to the point, we consider 20 state-of-the-art datasets from which keyphrases may be extracted, and the corresponding dfs, lda and KEA pre-computed models (which contrasts with pke as only semeval-2010 models are made available).

KEP is available on Dockerhub (ready to run) or available for download (in which case, some configurations need to be done). Regardless your option, we provide a jupyter notebook to ease the process of extracting keyphrases. More on this on the Installation section.

List of Datasets

KEP can extract keyphrases from 20 datasets:

  • 110-PT-BN-KP (110 docs; PT)
  • 500N-KPCrowd-v1.1 (500 docs; EN)
  • cacic (888 docs; ES)
  • citeulike180 (183 docs; EN)
  • fao30 (30 docs; EN)
  • fao780 (779 docs; EN)
  • Inspec (2000 docs; EN)
  • kdd (755 docs; EN)
  • Krapivin2009 (2304 docs; EN)
  • Nguyen2007 (209 docs; EN)
  • pak2018 (50 docs; PL)
  • PubMed (500 docs; EN)
  • Schutz2008 (1231 docs; EN)
  • SemEval2010 (243 docs; EN)
  • SemEval2017 (493 docs; EN)
  • theses100 (100 docs; EN)
  • wicc (1640 docs; ES)
  • wiki20 (20 docs; EN)
  • WikiNews (100 docs; FR)
  • www (1330 docs; EN)

Note however that more datasets can be added as long as they follow the coming structure:
  • keys: a folder that contains for each document a file with the corresponding keywords (ground-truth)
  • docsutf8: a folder that contains the documents text
  • lan.txt: a file that specifies the language of the document (e.g., EN). Used to load the stopwords

Keyphrase Extraction Algorithms

Unsupervised Algorithms

Statistical Methods

Graph-based Methods

Supervised Algorithms

Installing KEP

Option 1: Docker

Install Docker

Windows

Docker for Windows requires 64bit Windows 10 Pro with Hyper-V available. If you have this, then proceed to download here: (https://docs.docker.com/docker-for-windows/install/#download-docker-for-windows) and click on Get Docker for Windows (Stable)

If your system does not meet the requirements to run Docker for Windows (e.g., 64bit Windows 10 Home), you can install Docker Toolbox, which uses Oracle Virtual Box instead of Hyper-V. In that case proceed to download here: (https://docs.docker.com/toolbox/overview/#ready-to-get-started) and click on Get Docker Toolbox for Windows

MAC

Docker for Mac will launch only if all of these requirements (https://docs.docker.com/docker-for-mac/install/#what-to-know-before-you-install) are met. If you have this, then proceed to download here: (https://docs.docker.com/docker-for-mac/install/#download-docker-for-mac) and click on Get Docker for Mac (Stable)

If your system does not meet the requirements to run Docker for Mac, you can install Docker Toolbox, which uses Oracle Virtual Box instead of Hyper-V. In that case proceed to download here: (https://docs.docker.com/toolbox/overview/#ready-to-get-started) and click on Get Docker Toolbox for Mac

Linux

Proceed to download here: (https://docs.docker.com/engine/installation/#server)

Pull Image

Execute the following command on your docker machine:

docker pull liaad/kep

Run Image

On your docker machine run the following to launch the image:

docker run -p 9999:8888 --user root liaad/kep

Then go to your browser and type in the following url:

https://<DOCKER-MACHINE-IP>:9999

where the IP may be the localhost or 192.168.99.100 if you are using a Docker Machine VM.

You will be required a token which you can find on your docker machine prompt. It will be something similar to this: https://eac214218126:8888/?token=ce459c2f581a5f56b90256aaa52a96e7e4b1705113a657e8. Copy paste the token (in this example, that would be: ce459c2f581a5f56b90256aaa52a96e7e4b1705113a657e8) to the browser, and voila, you will have KEP package ready to run. Keep this token (for future references) or define a password.

Run Jupyter notebooks

Once you logged in, proceed by running the notebook that we have prepared for you.

Shutdown

Once you are done go to File - Shutdown.

Login again

If later on you decide to play with the same container, you should proceed as follows. The first thing to do is to get the container id:

docker ps -a

Next run the following commands:

docker start ContainerId
docker attach ContainerId (attach to a running container)

Nothing happens in your docker machine, but you are now ready open your browser as you did before:

https://<DOCKER-MACHINE-IP>:9999

Hopefully, you have saved the token or defined a password. If that is not the case, then you should run the following command (before doing start/attach) to have access to your token:

docker exec -it <docker_container_name> jupyter notebook list

Option 2: Standalone Installation

Install KEP library and Dependency Packages

pip install git+https://github.com/liaad/kep
pip install git+https://github.com/boudinfl/pke
pip install git+https://github.com/LIAAD/yake.git
pip install langcodes

Install External Resources

Spacy Language Models

PKE makes use of Spacy for the pre-processing stage. Currently Spacy supports the following languages:

  • 'en': 'english',
  • 'pt': 'portuguese',
  • 'fr': 'french',
  • 'es': 'spanish',
  • 'it': 'italian',
  • 'nl': 'dutch',
  • 'de': 'german',
  • 'el': 'greek'

In order to install these language models you need to open your command line (e.g., anaconda) in administration mode. Otherwise they will be installed, but will return an error later on.

python -m spacy download en
python -m spacy download es
python -m spacy download fr
python -m spacy download pt
python -m spacy download de
python -m spacy download it
python -m spacy download nl
python -m spacy download el

If you want to make sure that everything was properly installed go to site-packages\spacy\data and check if a shortcut for every language is found there.

Datasets with languages other than the ones above listed will be handled (in the pre-processing stage) as if they were "english".

PKE also gives the possibility of applying stemming in the pre-processing stage to the coming languages (by applying snowballStemmer):

  • 'en': 'english',
  • 'pt': 'portuguese',
  • 'fr': 'french',
  • 'es': 'spanish',
  • 'it': 'italian',
  • 'nl': 'dutch',
  • 'de': 'german',
  • 'da': 'danish',
  • 'fi': 'finnish',
  • 'da': 'danish',
  • 'hu': 'hungarian',
  • 'nb': 'norwegian',
  • 'ro': 'romanian',
  • 'ru': 'russian',
  • 'sv': 'swedish'

Stemming will not be applied (even if defined as a parameter) for languages different then the above r