CMPUT-656-Knowledge-Graphs

Two approaches for Missing Entity Types in Imbalanced Datasets:

To resolve the problem of missing entity type prediction in imbalanced datasets, we are proposing two technique. . First, develop a solution for the unbalanced dataset itself. If the distribution of types in the dataset is unbalanced, a model will likely give a biased prediction favourable to the type that appears more frequently. Therefore, in this project, we suggest reorganising a dataset for More Frequent Types (MFT) and Less Frequent Types (LFT) regarding each type in the dataset. We will use the distribution of types within the test dataset to find the number of types K. We will use the K as a criterion for dividing the dataset into MFT and LFT.

Second, employ a statistical model or loss manipulation on a neural network. The out-of-vocabulary has been a common problem experienced by statis- tical embedding methods like word2vec and GloVe. To solve the problem, we aim to use ConceptNet Numberbatch (Speer et al., 2017), a statistical embedding that combines word2vec, GloVe, and fastText to overcome the out-of-vocabulary problem. Its methodology is to remove a letter from the end of an unknown word and determine if that letter is a prefix of known terms. If this is the case, the embeddings of those known words are averaged.

For instance, suppose we want to produce the em- beddings of an unknown word, ”upxyz,” that is not in the dataset. As a result, ConceptNet Numberbatch will ignore the letters at the end of the phraseand look for words with the same prefix, such as ”upstairs,” ”upcoming,” etc., and produce average embeddings based on the terms.

We will try not to use the neural network since we cannot fully explain how it reaches a prediction. However, if the performance of the statistical models is less than what we expected, we will also consider a neural network with loss manipulation according to data distribution.

Dataset

We want to use the FIGER (Ling and Weld, 2012) dataset for this research. It uses a two-stage process, with the first phase detecting entity references and the second step categorizing those instances. We explored the Github repository offered in the FIGER paper for obtaining the dataset, but the data appears to be encoded in Protobuffer, making it unusable at this time, therefore we extracted the dataset from one of the cited papers (Abhishek et al., 2019), where the authors released the FIGER dataset in JSON format.

Environment

OS: CentOS 7
CUDA version: cuda 10
GPU stat: GeForce RTX 2080 (X4)
libraries

Keras : pip install keras
GloVE : pip install glove_python
Thunder SVM (support vector machine with gpu): pip install thundersvm-cuda10

How to run Codes

Please download the dataset first from here: Dataset. Unzip the file in the directory. Since we provide pickle files to save your time for data preprocessing, the directory size is almost 10GB. If you don't have GPU envirionment, executing the model for MFT and entire dataset will take you more than 2 days. Therefore, we strongly recommend you to run the code for MFT and entire dataset on a machine has GPU envirionment.

Word2Vec

Location: Code/Word2Vec
LFT: You can run the code step by step from jupyter notebook named SVM-Word2Vec (LFT).ipynb
MFT & All dataset:

Global MFT: python svm_word2vec.py --data ../../Dataset/global_MFT_dataset.csv --emb ../../Dataset/global_MFT.pickle --cg 'global coarse grained'
Local MFT: python svm_word2vec.py --data ../../Dataset/Local_MFT_dataset.csv --emb ../../Dataset/Local_MFT.pickle --cg coarse_grain
Global entire data: python svm_word2vec.py --data ../../Dataset/global_database_figer.csv --emb ../../Dataset/allDataEntity.pickle --cg 'global coarse grained'
Local entire data: python svm_word2vec.py --data ../../Dataset/Local_database_figer.csv --emb ../../Dataset/allDataEntity.pickle --cg coarse_grain

GloVE

Location: Code/GloVE
Global: You can run the code step by step from jupyter notebook named SVM(glove)_global.ipynb
Local: You can run the code step by step from jupyter notebook named SVM(glove)_local.ipynb

ConceptNet

Location: Code/ConceptNet
LFT:

Global: You can run the code step by step from jupyter notebook named Global LFT.ipynb
Local: You can run the code step by step from jupyter notebook named Local LFT.ipynb

MFT & All dataset:

Global MFT: python svm_conceptnet_Global.py
Local MFT: python svm_conceptnet_Local.py
Global entire data: Please edit the code for entire data and run python svm_conceptnet_Global.py
Local entire data: Please edit the code for entire data and run python svm_conceptnet_Local.py

Results

Email us!

If you have any question, or if the codes doesn't work, please let us know through email: [email protected] or [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Code		Code
README.md		README.md
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CMPUT-656-Knowledge-Graphs

Two approaches for Missing Entity Types in Imbalanced Datasets:

Dataset

Environment

How to run Codes

Word2Vec

GloVE

ConceptNet

Results

Email us!

About

Releases

Packages

Languages

Sohyun1798/CMPUT-656-Knowledge-Graphs

Folders and files

Latest commit

History

Repository files navigation

CMPUT-656-Knowledge-Graphs

Two approaches for Missing Entity Types in Imbalanced Datasets:

Dataset

Environment

How to run Codes

Word2Vec

GloVE

ConceptNet

Results

Email us!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages