yelp-sna/libmf at master · kevinmao/yelp-sna

History
Name		Name	Last commit message	Last commit date
parent directory ..
COPYRIGHT		COPYRIGHT
Makefile		Makefile
Makefile.orig		Makefile.orig
Makefile.win		Makefile.win
README		README
bigdata.te.txt		bigdata.te.txt
bigdata.tr.txt		bigdata.tr.txt
mf-predict		mf-predict
mf-predict.cpp		mf-predict.cpp
mf-train		mf-train
mf-train.cpp		mf-train.cpp
mf.cpp		mf.cpp
mf.def		mf.def
mf.h		mf.h
README

LIBMF is a library for large-scale sparse matrix factorization. For the
optimization problem it solves, please refer to [2].



Table of Contents
=================

- Installation
- Data Format
- Command Line Usage
- Examples
- Library Usage
- SSE, AVX, and OpenMP
- Building Windows and Mac Binaries
- References



Installation
============

- Unix & Cygwin

  Type `make' to build `mf-train' and `mf-precict.'

- Windows & Mac
    
  See `Building Windows and Mac Binaries' to compile. For Windows, pre-built
  binaries are available in the directory `windows.'


Data Format
===========

The data format is:

    <row_idx> <col_idx> <value>

See an example `bigdata.tr.txt.'

Note: If the values in the test set are unknown, please write dummy zeros.



Command Line Usage
==================

-   `mf-train'

    usage: mf-train [options] training_set_file [model_file]

    options:
    -l <lambda>: set regularization cost (default 0.1)
    -k <factor>: set number of latent factors (default 8)
    -t <iter>: set number of iterations (default 20)
    -r <eta>: set learning rate (default 0.1)
    -s <threads>: set number of threads (default 12)
    -p <path>: set path to validation set
    -v <fold>: set number of folds for cross validation
    --quiet: quiet mode (no outputs)
    --nmf: perform non-negative matrix factorization

    In the training process, the following information is printed on the
    screen:

        - iter: the index of iteration
        - tr_rmse: RMSE in the training set
        - va_rmse: RMSE in the validation set if `-p' is specified
        - obj: objective function value

    Here `tr_rmse' and `obj' are estimation because calculating true values can
    be time-consuming. In the end of training process the true tr_rmse is
    printed.
    
-   `mf-predict'

    usage: mf-predict test_file model_file output_file



Examples
========

> mf-train bigdata.tr.txt model

train a model using the default parameters

> mf-train -l 0.5 -k 16 -t 30 -r 0.05 -s 4 bigdata.tr.txt model

train a model using the following parameters:

    regularization cost = 0.5
    latent factors = 16
    iterations = 30
    learning rate = 0.05
    threads = 4

> mf-train -p bigdata.te.txt bigdata.tr.txt model

use bigdata.te.txt as validation set

> mf-train -v 5 bigdata.tr.txt

do five fold cross validation

> mf-train --nmf bigdata.tr.txt

do non-negative matrix factorization

> mf-train --quiet bigdata.tr.txt

do not print message to screen

> mf-predict bigdata.te.txt model output

do prediction



Library Usage
=============

These structures and functions are declared in the header file `mf.h.' You need
to #include `mf.h' in your C/C++ source files and link your program with
`mf.cpp.' You can see `mf-train.c' and `mf-predict.c' for examples showing how
to use them.

Before you predict test data, you need to construct a model (`mf_model') using
training data. A model can also be saved in a file for later use. Once a model
is available, you can use it to predict new data.


There are four public data structures in LIBMF.

-   struct mf_node
    {
        mf_int u;
        mf_int v;
        mf_float r;
    };

    `mf_node' represents an element in a sparse matrix. `u' represents the row
    index, `v' represents the column index, and `r' represents the value.


-   struct mf_problem
    {
        mf_int m;
        mf_int n;
        mf_long nnz;
        struct mf_node *R;
    };

    `mf_problem' represents a sparse matrix. Each element is represented by
    `mf_node.' `m' represents the number of rows, `n' represents the number of
    columns, `nnz' represents the number of non-zero elements, and `R' is an
    array of `mf_node' whose length is `nnz.'


-   struct mf_parameter
    {
        mf_int k; 
        mf_int nr_threads;
        mf_int nr_bins;
        mf_int nr_iters;
        mf_float lambda; 
        mf_float eta;
        bool do_nmf;
        bool quiet; 
        bool copy_data;
    };

    `mf_parameter' represents the parameters used for training. The meaning of
    each variable is:

    variable      meaning                             default
    =========================================================
    k             number of latent factors                  8
    nr_threads    number of threads used                   12
    nr_bins       number of blocks                         20
    nr_iters      number of iterations                     20
    lambda        regularization cost                     0.1
    eta           learning rate                           0.1
    do_nmf        perform NMF                           false
    quiet         no outputs to stdout                  false
    copy_data     copy data in training procedure        true

    In LIBMF, we parallelize the computation by gridding the data matrix into
    blocks. `nr_bins' is used to set the number of blocks. According to our
    experiments, this parameter is insensitive to both effectiveness and
    efficiency. In most cases the default value should work well.

    By default, at the beginning of the training procedure, the data matrix is
    copied because it is modified in the training process. To save memory,
    `copy_data' can be set to false with the following effects.
    
        (1) The raw data is directly used without being copied.
        (2) The order of nodes may be changed.
        (3) The value in each node may become slightly different.

    To obtain a parameter with default values, use the function
    `get_default_parameter.'

-   struct mf_model
    {
        mf_int m;
        mf_int n;
        mf_int k;
        mf_float *P;
        mf_float *Q;
    };

    `mf_model' is used to store models in LIBMF. `m' represents the number of
    rows, `n' represents the number of columns, and `k' represents the number
    of latent factors. `P' is used to store a kxm matrix in column oriented
    format. For example, if `P' stores a 3x4 matrix, then the content of `P'
    is:
        
        P11 P21 P31 P12 P22 P32 P13 P23 P33 P14 P24 P34

    `Q' is used to store a kxn matrix in the same manner.


Functions available in LIBMF include:


-   mf_parameter mf_get_default_param();

    Get default parameters.

-   mf_int mf_save_model(struct mf_model const *model, char const *path);
    
    Save a model. It returns 0 on sucess and 1 on failure.

-   struct mf_model* mf_load_model(char const *path);

    Load a model. If the model could not be loaded, a nullptr is returned.

-   void mf_destroy_model(struct mf_model **model);
    
    Destroy a model.

-   struct mf_model* mf_train(
        struct mf_problem const *prob, 
        mf_parameter param);

    Train a model.

-   struct mf_model* mf_train_with_validation(
        struct mf_problem const *Tr, 
        struct mf_problem const *Va, 
        mf_parameter param);

    Train a model with training set `Tr' and validation set `Va.' The RMSE of
    the validation set is printed at each iteration.
    
-   mf_float mf_cross_validation(
        struct mf_problem const *prob, 
        mf_int nr_folds, 
        mf_parameter param);

    Do cross validation with `nr_folds' folds.

-   mf_float mf_predict(struct mf_model const *model, mf_int p_idx, mf_int q_idx);

    Predict the value at the position (p_idx, q_idx).



SSE, AVX, and OpenMP
====================

LIBMF utilizes SSE instructions to accelerate the computation. If you cannot
use SSE on your platform, then please comment out

    DFLAG = -DUSESSE

in Makefile to disable SSE.

Some modern CPUs support AVX, which is more powerful than SSE. To enable
AVX, please uncomment the following lines in Makefile.

    DFLAG = -DUSEAVX
    CFLAGS += -mavx

If OpenMP is not available on your platform, then please comment out the
following lines in Makefile.

    DFLAG += -DUSEOMP
    CXXFLAGS += -fopenmp

Note: Please always run `make clean all' if these flags are changed.



Building Windows and Mac and Binaries
=====================================

-   Windows

    Windows binaries are in the directory `windows.' To build them via
    command-line tools of Microsoft Visual Studio, use the following steps:

    1. Open a DOS command box (or Developer Command Prompt for Visual Studio)
    and go to libmf directory. If environment variables of VC++ have not been
    set, type

    "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\amd64\vcvars64.bat"

    You may have to modify the above command according which version of VC++
    or where it is installed.

    2. Type

    nmake -f Makefile.win clean all

    3. (optional) To build shared library mf_c.dll, type

    nmake -f Makefile.win lib

-   Mac
    
    To complie LIBMF on Mac, a GCC complier is required, and users need to
    slightly modify the Makefile. The following instructions are tested with
    GCC 4.9.

    1. Set the complier path to your GCC complier. For example, the first
       line in the Makefile can be
   
       CXX = g++-4.9

    2. Remove `-march=native' from `CXXFLAGS.' The second line in the Makefile
       Should be

       CXXFLAGS = -O3 -pthread -std=c++0x

    3. If AVX is enabled, we add `-Wa,-q' to the `CXXFLAGS,' so the previous
       `CXXFLAGS' becomes

       CXXFLAGS = -O3 -pthread -std=c++0x -Wa,-q
  


References
==========

[1] W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A Fast Parallel
Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems.
ACM TIST, 2015. (www.csie.ntu.edu.tw/~cjlin/papers/libmf/libmf_journal.pdf)

[2] W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin. A learning-rate schedule
for stochastic gradient methods to matrix factorization. PAKDD, 2015.


For any questions and comments, please email:

    [email protected]
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libmf

libmf

README

Files

libmf

Directory actions

More options

Directory actions

More options

Latest commit

History

libmf

Folders and files

parent directory

README