Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-404] elemwise_add/sub between rsp and rsp on GPU #11179

Merged
merged 2 commits into from
Jun 20, 2018

Conversation

haojin2
Copy link
Contributor

@haojin2 haojin2 commented Jun 6, 2018

Description

As title

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add support for elemwise_add/sub between rsp and rsp on GPU
  • Optimization for in-place case

Comments

For performance benchmark results please see comments.

@haojin2
Copy link
Contributor Author

haojin2 commented Jun 6, 2018

Benchmark script:

import mxnet as mx
import sys
import os
import scipy
import numpy as np
from mxnet.test_utils import rand_ndarray, assert_almost_equal
import time

def measure_cost(repeat, a, b, out=None):
    # start bench
    start = time.time()
    results = []
    for i in range(repeat):
        results.append(mx.nd.elemwise_add(a, b, out=out))
    for result in results:
        result.wait_to_read()
    end = time.time()
    diff = end - start
    return diff / repeat

def measure_fallback(repeat, a):
    # start bench
    start = time.time()
    results = []
    for i in range(repeat):
        results.append(a.tostype('default'))
    for result in results:
        result.wait_to_read()
    end = time.time()
    diff = end - start
    return diff / repeat

def main():
    shape = (1000000, 512)
    context = mx.gpu(0)
    # context = mx.cpu()
    for lhs_density in [0.01, 0.005, 0.001, 0.0005, 0.0001, 0.000]:
        mx_lhs = rand_ndarray(shape, stype='row_sparse', density=lhs_density).as_in_context(context)
        mx_lhs_dns = mx_lhs.tostype('default')
        for rhs_density in [0.01, 0.005, 0.001, 0.0005, 0.0001, 0.000]:
            mx_rhs = rand_ndarray(shape=shape, stype='row_sparse', density=rhs_density).as_in_context(context)
            mx_rhs_dns = mx_rhs.tostype('default')
            #warmup
            sparse_cost = 0.0
            dns_cost = 0.0
            np_lhs = mx_lhs_dns.asnumpy()
            check = mx.nd.elemwise_add(mx_lhs, mx_rhs)
            np_lhs = np_lhs + mx_rhs.asnumpy()
            assert_almost_equal(check.asnumpy(), np_lhs, atol=1e-5, rtol=1e-4)
            mx.nd.waitall()
            for i in range(100):
                sparse_cost += measure_cost(1, mx_lhs, mx_rhs)
                dns_cost += measure_cost(1, mx_lhs_dns, mx_rhs_dns)
            print("%.2f %% %.2f %%" % (lhs_density*100, rhs_density*100), dns_cost / sparse_cost)

    for rhs_density in [1.000, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.000]:
        mx_lhs_dns = mx.nd.ones(shape, ctx=context)
        mx_lhs = mx_lhs_dns.tostype('row_sparse')
        mx_rhs = rand_ndarray(shape=shape, stype='row_sparse', density=rhs_density).as_in_context(context)
        mx_rhs_dns = mx_rhs.tostype('default')
        #warmup
        sparse_cost = 0.0
        dns_cost = 0.0
        np_lhs = mx_lhs_dns.asnumpy()
        mx.nd.elemwise_add(mx_lhs, mx_rhs, out=mx_lhs)
        np_lhs = np_lhs + mx_rhs.asnumpy()
        assert_almost_equal(mx_lhs.asnumpy(), np_lhs, atol=1e-5, rtol=1e-4)
        mx.nd.waitall()
        for i in range(100):
            sparse_cost += measure_cost(1, mx_lhs, mx_rhs, out=mx_lhs)
            dns_cost += measure_cost(1, mx_lhs_dns, mx_rhs_dns, out=mx_lhs_dns)
        print("%.2f %% %.2f %%" % (1.00000*100, rhs_density*100), dns_cost / sparse_cost)

    for lhs_density in [1.000, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.000]:
        mx_rhs_dns = mx.nd.ones(shape, ctx=context)
        mx_rhs = mx_rhs_dns.tostype('row_sparse')
        mx_lhs = rand_ndarray(shape=shape, stype='row_sparse', density=lhs_density).as_in_context(context)
        mx_lhs_dns = mx_lhs.tostype('default')
        #warmup
        sparse_cost = 0.0
        dns_cost = 0.0
        np_rhs = mx_rhs_dns.asnumpy()
        mx.nd.elemwise_add(mx_lhs, mx_rhs, out=mx_rhs)
        np_rhs = np_rhs + mx_lhs.asnumpy()
        assert_almost_equal(mx_rhs.asnumpy(), np_rhs, atol=1e-5, rtol=1e-4)
        mx.nd.waitall()
        for i in range(100):
            sparse_cost += measure_cost(1, mx_lhs, mx_rhs, out=mx_rhs)
            dns_cost += measure_cost(1, mx_lhs_dns, mx_rhs_dns, out=mx_rhs_dns)
        print("%.2f %% %.2f %%" % (1.00000*100, lhs_density*100), dns_cost / sparse_cost)


if __name__ == "__main__":
    main()

@haojin2
Copy link
Contributor Author

haojin2 commented Jun 6, 2018

Benchmark results:
kWriteTo: (lhs_density rhs_density speedup)
1.00 % 1.00 % 25.124997331340733
1.00 % 0.50 % 31.238362675588835
1.00 % 0.10 % 39.14725913534424
1.00 % 0.05 % 40.186331497357656
1.00 % 0.01 % 41.54522092845207
1.00 % 0.00 % 115.1436676461348
0.50 % 1.00 % 30.684299577090243
0.50 % 0.50 % 41.066164904788266
0.50 % 0.10 % 55.053740609087725
0.50 % 0.05 % 57.572661839483324
0.50 % 0.01 % 59.64072908956329
0.50 % 0.00 % 173.54969421069572
0.10 % 1.00 % 38.829209311971795
0.10 % 0.50 % 55.40661678968209
0.10 % 0.10 % 82.4112095641801
0.10 % 0.05 % 87.61740457731939
0.10 % 0.01 % 93.37161339877105
0.10 % 0.00 % 306.6551990025265
0.05 % 1.00 % 39.83933545753898
0.05 % 0.50 % 57.84548585858899
0.05 % 0.10 % 87.84864769131102
0.05 % 0.05 % 95.30841559924941
0.05 % 0.01 % 101.20527355911096
0.05 % 0.00 % 334.61053120754985
0.01 % 1.00 % 41.291302756016655
0.01 % 0.50 % 59.7229061419413
0.01 % 0.10 % 95.01911403455347
0.01 % 0.05 % 101.25847977888479
0.01 % 0.01 % 109.54648495558651
0.01 % 0.00 % 365.9305234720645
0.00 % 1.00 % 119.17642326889458
0.00 % 0.50 % 181.91244221692375
0.00 % 0.10 % 302.7410802494129
0.00 % 0.05 % 318.57223936052355
0.00 % 0.01 % 360.25671221787485
0.00 % 0.00 % 556.9824540639702
kWriteInplace on lhs: (lhs_density rhs_density speedup)
100.00 % 100.00 % 0.9877658633734423
100.00 % 1.00 % 70.37739238060738
100.00 % 0.50 % 119.9069169140413
100.00 % 0.10 % 291.33264582259096
100.00 % 0.05 % 351.9332843469742
100.00 % 0.01 % 428.11531043350476
100.00 % 0.00 % 568.4419591440868
kWriteInplace on rhs: (lhs_density rhs_density speedup)
100.00 % 100.00 % 0.9823963498050752
1.00 % 100.00 % 69.63479862099362
0.50 % 100.00 % 118.04205950892886
0.10 % 100.00 % 294.0972686031126
0.05 % 100.00 % 358.42532087562114
0.01 % 100.00 % 429.2050067814533
0.00 % 100.00 % 592.4116369131955

@haojin2
Copy link
Contributor Author

haojin2 commented Jun 6, 2018

@eric-haibin-lin Please give a review when you have time, thanks!

@@ -156,7 +156,7 @@ class NDArray {
}

/* \brief Check whether the two arrays are the same array */
inline bool IsSame(const NDArray& other) {
inline bool IsSame(const NDArray& other) const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@piiswrong I made the change here so that I can also call this function when I have a const NDArray object.

};

template<typename OP>
void ElemwiseBinaryOp::RspRspOp(mshadow::Stream<gpu> *s,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have unit test for write inplace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-place case shares the same code as in-place case between dns and rsp, which already has a unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW correctness is double-checked in benchmark script during the warmup.


CHECK(!scatter) << "scatter is not supported in RspRspOp on GPU yet...";
CHECK(lhs.storage_type() == kRowSparseStorage && rhs.storage_type() == kRowSparseStorage);
CHECK(output.storage_type() == kRowSparseStorage);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it support kAddTo? CHECK_NE(kAddTo)?

ElemwiseBinaryOp::DnsRspDnsOp<gpu, OP>(s, attrs, ctx, dns, rsp, req, output, reverse);
return;
}
CHECK(req == kWriteTo) << "Should be kWriteTo but got " << req;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function assumes req is never kNullOp, better document it in the header.

lhs.data().FlatTo1D<gpu, DType>(), s);
Copy(output.aux_data(kIdx).FlatTo1D<gpu, IType>(),
lhs.aux_data(kIdx).FlatTo1D<gpu, IType>(), s);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about kWriteInplace in all these branches? should we add a check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra checks and tests added.

@haojin2 haojin2 force-pushed the elemwise_add_rsp_rsp branch 2 times, most recently from 16c160b to a25a2cd Compare June 10, 2018 02:11
@eric-haibin-lin eric-haibin-lin self-assigned this Jun 10, 2018
@haojin2
Copy link
Contributor Author

haojin2 commented Jun 20, 2018

@eric-haibin-lin should be good for merge

@eric-haibin-lin eric-haibin-lin merged commit ba9784d into apache:master Jun 20, 2018
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Support for elemwise_add/sub between rsp and rsp on GPU

* add extra test coverage for inplace cases
@haojin2 haojin2 deleted the elemwise_add_rsp_rsp branch July 19, 2018 20:13
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
* Support for elemwise_add/sub between rsp and rsp on GPU

* add extra test coverage for inplace cases
@haojin2 haojin2 added the Sparse label Aug 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants