spark-glove

An implementation of GloVe model for learning word representations for big text corpuses distributed with Apache Spark.

Based on the original implementation : https://github.com/stanfordnlp/GloVe

Details

This project contains a GloVe model representation, train and usage operations, and running examples.

Steps to train

Reading a corpus into a word Dataset.
Cleaning the datset from sepcial characters, stop words, trimming and lower casing.
Building a vocabulary and filtering low occurence words.
Building a word cooccurrence matrix based on moving windows.
Initializing vectors and gradients data structures.
Run fit iterations to update vector and gradient representations
Creating a GloVeModel object from the vocabulary and trained word vectors.

Parameters

inputFile - Corpus (Example in resources : w_spok_2012.txt
VOCAB_MIN_COUNT - Minimum word occurences to keep in vocab (default : 5)
VECTOR_SIZE - Word embedding vecotr size (default : 50)
MAX_ITER - Number of fitting iterations (default : 15)
WINDOW_SIZE - Moving window size for building cooccurrence matrix (default : 15)
X_MAX - Training parameter (see article) (default : 100)
ALPHA - Training parameter (see article) (default : 0.75)

Constants

INITIAL_LEARNING_RATE = 0.05
NUM_FIT_PARTITIONS = 15
OVERFLOW_LENGTH = 1000

Requirements

Spark 2.1+.

Scala 2.11.

Usage

Scala API

// import 
import com.glove._

// Run GloVe 
println("Starting GloVe training...")
val model = 
	GloVeModelOperations.fit(spark, words, VOCAB_MIN_COUNT, VECTOR_SIZE, MAX_ITER, WINDOW_SIZE, X_MAX, ALPHA)	        

// Save model
GloVeModelOperations.save(model, "/tmp/glove_model")
	    
// Print output
println("Vocabulary : ")
model.vocabulary.take(100).foreach(println)
println("Vectors : ")
model.wordVectors.take(100).foreach(println)

val word1 = "man"
val word2 = "king"
val wordT = "woman"
println("Vector of man : ")
println(model.transform(word1))
println("Similar to man : ")
model.getTopKSimilarWords(word1, 10).foreach(println)
println("Analogy king->man, woman->? : ")
model.getTopKSimilarAnalogies(word1, word2, wordT, 5).foreach(println)

Running GloVe from command line

You can run spark-glove directly form command line using spark-submit.

Parameters stated above

/usr/lib/spark/bin/spark-submit --class com.glove.GloVeRunner /tmp/glove.jar /tmp/input/w_spok_2012.txt 5 50 15 15 100 0.75

Credits

Written and maintained by :

Daniel Marcous [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
project		project
src		src
.gitignore		.gitignore
.project		.project
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-glove

Details

Steps to train

Parameters

Constants

Requirements

Usage

Scala API

Running GloVe from command line

Credits

About

Releases

Packages

Languages

License

dmarcous/spark-glove

Folders and files

Latest commit

History

Repository files navigation

spark-glove

Details

Steps to train

Parameters

Constants

Requirements

Usage

Scala API

Running GloVe from command line

Credits

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages