Code for the paper
PDCDE : Patent Document Clustering with Deep Embeddings
Jaeyoung Kim, Janghyeok Yoon, Eunjeong Park, Sungchul Choi
https://www.researchgate.net/publication/325251122_Patent_Document_Clustering_with_Deep_Embeddings
-
KIPRIS dataset
- KIPRIS dataset consists of abstracts from five categories of US patent
- Categories : car, cameras, CPUs, memory, graphics.
-
The combination used in the paper
- Task 1 : car-camera(Less relevant class)
- Task 2 : memory-cpu(Relevant classes)
- Task 3 : car, camera, cpu, memory, graphics.
- 3 categories task is used KISTA dataset, we will add this dataset soon.
- Tensorflow 1.4.0
- Keras 2.2.0
- nltk 3.3
- pandas 0.23.0
- scikit-learn 0.19.1
#python2
$ pip install -r requirments.txt
#python3
$ pip3 install -r requirments.txt
- category : car_camera, memory_cpu, 5_categories
$ python embedding_patent.py --dataset "category"
$ python train.py --dataset "category"
$ python train.py --dataset "category" --task test
dataset
: categories of dataset. you can select{"car_camera", "memory_cpu", "5_categories"}
save_embedding_vector
: path to the embedding vectors.save_weight_path
: path to the trained weight.dataset_path
: path to KPRIS dataset. Default is./dataset
window_size
: Doc2Vec window size. Default is5
.embedding_size
: Embedding vector dimension. Default is50
.doc_initializer
: Doc2Vec word and document initializer. Default isuniform
negative_sample
: Number of negative sampling used0 nce loss. Default is5
.doc_lr
: Doc2Vec initial learning rate. Default is0.01
.doc_batch_size
: Doc2Vec batch size. Default is256
.doc_epochs
: Doc2Vec epochs. Default is500
.
dec_batch_size
: DEC model batch size. Default is256
dec_lr
: DEC initial learning rate. Default is0.001
dec_decay_step
: step decay every n epochs.layerwise_pretrain_iters
: layerwise weight pretrain iterations(greedy layer wise auto encoder). Default is5000
.finetune_iters
: fine-tunning iteration after layerwise weights pretrain. Default is5000
.