[MXNET-404] elemwise_add/sub between rsp and rsp on GPU #11179

haojin2 · 2018-06-06T21:49:21Z

Description

As title

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Add support for elemwise_add/sub between rsp and rsp on GPU
Optimization for in-place case

Comments

For performance benchmark results please see comments.

haojin2 · 2018-06-06T21:51:22Z

Benchmark script:

import mxnet as mx
import sys
import os
import scipy
import numpy as np
from mxnet.test_utils import rand_ndarray, assert_almost_equal
import time

def measure_cost(repeat, a, b, out=None):
    # start bench
    start = time.time()
    results = []
    for i in range(repeat):
        results.append(mx.nd.elemwise_add(a, b, out=out))
    for result in results:
        result.wait_to_read()
    end = time.time()
    diff = end - start
    return diff / repeat

def measure_fallback(repeat, a):
    # start bench
    start = time.time()
    results = []
    for i in range(repeat):
        results.append(a.tostype('default'))
    for result in results:
        result.wait_to_read()
    end = time.time()
    diff = end - start
    return diff / repeat

def main():
    shape = (1000000, 512)
    context = mx.gpu(0)
    # context = mx.cpu()
    for lhs_density in [0.01, 0.005, 0.001, 0.0005, 0.0001, 0.000]:
        mx_lhs = rand_ndarray(shape, stype='row_sparse', density=lhs_density).as_in_context(context)
        mx_lhs_dns = mx_lhs.tostype('default')
        for rhs_density in [0.01, 0.005, 0.001, 0.0005, 0.0001, 0.000]:
            mx_rhs = rand_ndarray(shape=shape, stype='row_sparse', density=rhs_density).as_in_context(context)
            mx_rhs_dns = mx_rhs.tostype('default')
            #warmup
            sparse_cost = 0.0
            dns_cost = 0.0
            np_lhs = mx_lhs_dns.asnumpy()
            check = mx.nd.elemwise_add(mx_lhs, mx_rhs)
            np_lhs = np_lhs + mx_rhs.asnumpy()
            assert_almost_equal(check.asnumpy(), np_lhs, atol=1e-5, rtol=1e-4)
            mx.nd.waitall()
            for i in range(100):
                sparse_cost += measure_cost(1, mx_lhs, mx_rhs)
                dns_cost += measure_cost(1, mx_lhs_dns, mx_rhs_dns)
            print("%.2f %% %.2f %%" % (lhs_density*100, rhs_density*100), dns_cost / sparse_cost)

    for rhs_density in [1.000, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.000]:
        mx_lhs_dns = mx.nd.ones(shape, ctx=context)
        mx_lhs = mx_lhs_dns.tostype('row_sparse')
        mx_rhs = rand_ndarray(shape=shape, stype='row_sparse', density=rhs_density).as_in_context(context)
        mx_rhs_dns = mx_rhs.tostype('default')
        #warmup
        sparse_cost = 0.0
        dns_cost = 0.0
        np_lhs = mx_lhs_dns.asnumpy()
        mx.nd.elemwise_add(mx_lhs, mx_rhs, out=mx_lhs)
        np_lhs = np_lhs + mx_rhs.asnumpy()
        assert_almost_equal(mx_lhs.asnumpy(), np_lhs, atol=1e-5, rtol=1e-4)
        mx.nd.waitall()
        for i in range(100):
            sparse_cost += measure_cost(1, mx_lhs, mx_rhs, out=mx_lhs)
            dns_cost += measure_cost(1, mx_lhs_dns, mx_rhs_dns, out=mx_lhs_dns)
        print("%.2f %% %.2f %%" % (1.00000*100, rhs_density*100), dns_cost / sparse_cost)

    for lhs_density in [1.000, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.000]:
        mx_rhs_dns = mx.nd.ones(shape, ctx=context)
        mx_rhs = mx_rhs_dns.tostype('row_sparse')
        mx_lhs = rand_ndarray(shape=shape, stype='row_sparse', density=lhs_density).as_in_context(context)
        mx_lhs_dns = mx_lhs.tostype('default')
        #warmup
        sparse_cost = 0.0
        dns_cost = 0.0
        np_rhs = mx_rhs_dns.asnumpy()
        mx.nd.elemwise_add(mx_lhs, mx_rhs, out=mx_rhs)
        np_rhs = np_rhs + mx_lhs.asnumpy()
        assert_almost_equal(mx_rhs.asnumpy(), np_rhs, atol=1e-5, rtol=1e-4)
        mx.nd.waitall()
        for i in range(100):
            sparse_cost += measure_cost(1, mx_lhs, mx_rhs, out=mx_rhs)
            dns_cost += measure_cost(1, mx_lhs_dns, mx_rhs_dns, out=mx_rhs_dns)
        print("%.2f %% %.2f %%" % (1.00000*100, lhs_density*100), dns_cost / sparse_cost)


if __name__ == "__main__":
    main()

haojin2 · 2018-06-06T21:54:13Z

Benchmark results:
kWriteTo: (lhs_density rhs_density speedup)
1.00 % 1.00 % 25.124997331340733
1.00 % 0.50 % 31.238362675588835
1.00 % 0.10 % 39.14725913534424
1.00 % 0.05 % 40.186331497357656
1.00 % 0.01 % 41.54522092845207
1.00 % 0.00 % 115.1436676461348
0.50 % 1.00 % 30.684299577090243
0.50 % 0.50 % 41.066164904788266
0.50 % 0.10 % 55.053740609087725
0.50 % 0.05 % 57.572661839483324
0.50 % 0.01 % 59.64072908956329
0.50 % 0.00 % 173.54969421069572
0.10 % 1.00 % 38.829209311971795
0.10 % 0.50 % 55.40661678968209
0.10 % 0.10 % 82.4112095641801
0.10 % 0.05 % 87.61740457731939
0.10 % 0.01 % 93.37161339877105
0.10 % 0.00 % 306.6551990025265
0.05 % 1.00 % 39.83933545753898
0.05 % 0.50 % 57.84548585858899
0.05 % 0.10 % 87.84864769131102
0.05 % 0.05 % 95.30841559924941
0.05 % 0.01 % 101.20527355911096
0.05 % 0.00 % 334.61053120754985
0.01 % 1.00 % 41.291302756016655
0.01 % 0.50 % 59.7229061419413
0.01 % 0.10 % 95.01911403455347
0.01 % 0.05 % 101.25847977888479
0.01 % 0.01 % 109.54648495558651
0.01 % 0.00 % 365.9305234720645
0.00 % 1.00 % 119.17642326889458
0.00 % 0.50 % 181.91244221692375
0.00 % 0.10 % 302.7410802494129
0.00 % 0.05 % 318.57223936052355
0.00 % 0.01 % 360.25671221787485
0.00 % 0.00 % 556.9824540639702
kWriteInplace on lhs: (lhs_density rhs_density speedup)
100.00 % 100.00 % 0.9877658633734423
100.00 % 1.00 % 70.37739238060738
100.00 % 0.50 % 119.9069169140413
100.00 % 0.10 % 291.33264582259096
100.00 % 0.05 % 351.9332843469742
100.00 % 0.01 % 428.11531043350476
100.00 % 0.00 % 568.4419591440868
kWriteInplace on rhs: (lhs_density rhs_density speedup)
100.00 % 100.00 % 0.9823963498050752
1.00 % 100.00 % 69.63479862099362
0.50 % 100.00 % 118.04205950892886
0.10 % 100.00 % 294.0972686031126
0.05 % 100.00 % 358.42532087562114
0.01 % 100.00 % 429.2050067814533
0.00 % 100.00 % 592.4116369131955

haojin2 · 2018-06-06T21:54:38Z

@eric-haibin-lin Please give a review when you have time, thanks!

haojin2 · 2018-06-06T22:05:37Z

include/mxnet/ndarray.h

@@ -156,7 +156,7 @@ class NDArray {
 }

 /* \brief Check whether the two arrays are the same array */
- inline bool IsSame(const NDArray& other) {
+ inline bool IsSame(const NDArray& other) const {


@piiswrong I made the change here so that I can also call this function when I have a const NDArray object.

eric-haibin-lin · 2018-06-07T22:33:50Z

src/operator/tensor/elemwise_binary_op_basic.cu

+};
+
+template<typename OP>
+void ElemwiseBinaryOp::RspRspOp(mshadow::Stream<gpu> *s,


do we have unit test for write inplace?

In-place case shares the same code as in-place case between dns and rsp, which already has a unit test.

BTW correctness is double-checked in benchmark script during the warmup.

eric-haibin-lin · 2018-06-08T22:29:36Z

src/operator/tensor/elemwise_binary_op_basic.cu

+
+ CHECK(!scatter) << "scatter is not supported in RspRspOp on GPU yet...";
+ CHECK(lhs.storage_type() == kRowSparseStorage && rhs.storage_type() == kRowSparseStorage);
+ CHECK(output.storage_type() == kRowSparseStorage);


Does it support kAddTo? CHECK_NE(kAddTo)?

eric-haibin-lin · 2018-06-08T22:33:29Z

src/operator/tensor/elemwise_binary_op_basic.cu

+ ElemwiseBinaryOp::DnsRspDnsOp<gpu, OP>(s, attrs, ctx, dns, rsp, req, output, reverse);
+ return;
+ }
+ CHECK(req == kWriteTo) << "Should be kWriteTo but got " << req;


If this function assumes req is never kNullOp, better document it in the header.

eric-haibin-lin · 2018-06-08T22:36:14Z

src/operator/tensor/elemwise_binary_op_basic.cu

+ lhs.data().FlatTo1D<gpu, DType>(), s);
+ Copy(output.aux_data(kIdx).FlatTo1D<gpu, IType>(),
+ lhs.aux_data(kIdx).FlatTo1D<gpu, IType>(), s);
+ }


what about kWriteInplace in all these branches? should we add a check?

Extra checks and tests added.

haojin2 · 2018-06-20T21:09:39Z

@eric-haibin-lin should be good for merge

* Support for elemwise_add/sub between rsp and rsp on GPU * add extra test coverage for inplace cases

haojin2 commented Jun 6, 2018

View reviewed changes

eric-haibin-lin reviewed Jun 7, 2018

View reviewed changes

eric-haibin-lin reviewed Jun 8, 2018

View reviewed changes

haojin2 force-pushed the elemwise_add_rsp_rsp branch 2 times, most recently from 16c160b to a25a2cd Compare June 10, 2018 02:11

eric-haibin-lin self-assigned this Jun 10, 2018

Hao Jin added 2 commits June 11, 2018 05:59

Support for elemwise_add/sub between rsp and rsp on GPU

d6d281c

add extra test coverage for inplace cases

d7e67a8

haojin2 force-pushed the elemwise_add_rsp_rsp branch from a25a2cd to d7e67a8 Compare June 11, 2018 05:59

haojin2 mentioned this pull request Jun 16, 2018

Static alloc for hybridblock #11319

Closed

eric-haibin-lin approved these changes Jun 16, 2018

View reviewed changes

eric-haibin-lin merged commit ba9784d into apache:master Jun 20, 2018

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

[MXNET-404] elemwise_add/sub between rsp and rsp on GPU (apache#11179)

c9b7511

* Support for elemwise_add/sub between rsp and rsp on GPU * add extra test coverage for inplace cases

haojin2 deleted the elemwise_add_rsp_rsp branch July 19, 2018 20:13

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

[MXNET-404] elemwise_add/sub between rsp and rsp on GPU (apache#11179)

b3bcfae

* Support for elemwise_add/sub between rsp and rsp on GPU * add extra test coverage for inplace cases

haojin2 added the Sparse label Aug 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-404] elemwise_add/sub between rsp and rsp on GPU #11179

[MXNET-404] elemwise_add/sub between rsp and rsp on GPU #11179

haojin2 commented Jun 6, 2018

haojin2 commented Jun 6, 2018

haojin2 commented Jun 6, 2018

haojin2 commented Jun 6, 2018

haojin2 Jun 6, 2018

eric-haibin-lin Jun 7, 2018

haojin2 Jun 8, 2018

haojin2 Jun 8, 2018

eric-haibin-lin Jun 8, 2018

eric-haibin-lin Jun 8, 2018

eric-haibin-lin Jun 8, 2018

haojin2 Jun 10, 2018

haojin2 commented Jun 20, 2018

[MXNET-404] elemwise_add/sub between rsp and rsp on GPU #11179

[MXNET-404] elemwise_add/sub between rsp and rsp on GPU #11179

Conversation

haojin2 commented Jun 6, 2018

Description

Checklist

Essentials

Changes

Comments

haojin2 commented Jun 6, 2018

haojin2 commented Jun 6, 2018

haojin2 commented Jun 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haojin2 commented Jun 20, 2018