Skip to content

talmago/pyspark-loglikelihood

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyspark-loglikelihood

PySpark Loglikelihood Examples.

Inspired by Mahout:

Installation

$ pip install https://github.com/talmago/pyspark-loglikelihood/archive/master.zip

NOTICE: PySpark Loglikelihood is adapted to python2.7 to run. pyenv and virtualenv are recommended for setting an independent python environment.

Usage

After the installation you can use spark-submit command line to execute the item-similarity and user-similarity jobs with your dataset respectively.

Item-Item Similarity (LogLikelihood)

$ spark-submit item_similarity.py \
               input.csv \
               output \
               --maxPrefs=10000 \
               --maxSimilaritiesPerItem 100

NOTICE: Input file lines are expected to be comma seperated vectors of USER_ID,ITEM_ID. Output format will be consisted of ITEM_ID1,ITEM_ID2,SCORE.

User-User Similarity (N-neighborhood + Loglikelihood)

$ spark-submit user_similarity.py \
               input.csv \
               output \
               --numOfNeighbors=40 \
               --numOfRecommendations 1000

NOTICE: Input file lines are expected to be comma seperated vectors of USER_ID,ITEM_ID. Output format will be consisted of USER_ID,ITEM_ID,SCORE.

Example

Run exmple from command line
wget -O - https://raw.githubusercontent.com/talmago/pyspark-loglikelihood/master/examples/item-sim-ml-100l-dataset | bash -x
Step by Step

Step 1: Download and re-format the movielens 100k dataset.

$ wget -O - https://files.grouplens.org/datasets/movielens/ml-100k/u.data | cut -f1 -f2 | tr '\t' ',' > input.csv

Step 2: Upload data set to local hfds

$ hadoop fs -rm -r /item-sim
$ hadoop fs -mkdir -p /item-sim
$ hadoop fs -moveFromLocal input.csv /item-sim/input.csv

Step 3: Run item-silmilarity job on our hadoop data set

$ spark-submit item_similarity.py \
               /item-sim/input.csv \
               /item-sim/output \
               --maxPrefs=10000 \
               --maxSimilaritiesPerItem 100

Step 4: Merge parquet files into a single csv file

$ hadoop fs -getmerge /item-sim/output result.csv

Step 5: Analyze / process / visualize the result set

$ head result.csv
26,381,0.9889748
26,732,0.9876871
26,70,0.98738647
26,715,0.98685825
26,238,0.98625606
26,58,0.98580784
26,1,0.985786
26,83,0.9857064
26,88,0.9856318
26,367,0.9854448

About

PySpark Loglikelihood Similarity Examples

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages