Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-537] add_n(dense, csr, dense) = dense and add_n([dense, csr, rsp]*, dense, [dense, csr, rsp]*) = dense on CPU & GPU #11330

Merged
merged 2 commits into from
Jun 29, 2018

Conversation

haojin2
Copy link
Contributor

@haojin2 haojin2 commented Jun 19, 2018

Description

As title

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http:https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • support for add_n(dense, csr, dense) = dense on CPU & GPU
  • support for add_n(any combination longer than 4 with at least one dense storage) = dense on CPU & GPU
  • unit tests

Comments

Also comes with a optimized GPU kernel for elemwise_add/sub(dns, csr)/elemwise_add/sub(csr, dns), for benchmark results please see comments.

@haojin2
Copy link
Contributor Author

haojin2 commented Jun 19, 2018

@eric-haibin-lin

@haojin2 haojin2 force-pushed the add_n_dns_csr_dns branch 5 times, most recently from 755682c to fedab5c Compare June 19, 2018 17:47
@haojin2 haojin2 changed the title [MXNET-537] add_n/elemwise_sum(dense, csr, dense) = dense on CPU & GPU [MXNET-537] add_n/elemwise_sum([dense, csr, rsp]*, dense, [dense, csr, rsp]*) = dense on CPU & GPU Jun 19, 2018
@haojin2 haojin2 force-pushed the add_n_dns_csr_dns branch 3 times, most recently from f787b49 to 7355c34 Compare June 19, 2018 23:44
@haojin2 haojin2 changed the title [MXNET-537] add_n/elemwise_sum([dense, csr, rsp]*, dense, [dense, csr, rsp]*) = dense on CPU & GPU [MXNET-537] add_n(dense, csr, dense) = dense and add_n([dense, csr, rsp]*, dense, [dense, csr, rsp]*) = dense on CPU & GPU Jun 19, 2018
@haojin2
Copy link
Contributor Author

haojin2 commented Jun 19, 2018

Benchmark results for warp-optimized GPU kernel for elemwise_add/sub(dense, csr):
([density%] [new speedup of write inplace] [old speedup of write inplace])
([density%] [new speedup of write to] [old speedup of write to])
1.00% 4.3422253664233255 1.1133807433946643
1.00% 1.8064753025920386 1.1127745540337441
0.50% 8.719801584535675 1.2989243065699914
0.50% 2.2845434302137857 1.2954892083078022
0.10% 51.95314630061374 1.4716730306016637
0.10% 2.878010179453661 1.4621131161544634
0.05% 90.41164608500259 1.4950209892594164
0.05% 2.9590177057354445 1.494533324414405
0.01% 165.45560871876663 1.5066270652228635
0.01% 2.9965883337578574 1.4932449464071242
benchmark script is the same one used in #10550

@haojin2
Copy link
Contributor Author

haojin2 commented Jun 19, 2018

Benchmark result for add_n(dense, csr, dense) = dense:
([density%] [speedup])
CPU:
1.00% 1.1282194997074237
0.50% 1.1686160529139418
0.10% 1.1909255730224886
0.05% 1.1970586102280831
0.01% 1.202483677804412
GPU:
1.00% 1.1627124767202126
0.50% 1.2392510678426578
0.10% 1.3169708612264934
0.05% 1.3275811285384644
0.01 % 1.3358768672033845
benchmark script:

import mxnet as mx
import sys
import os
import scipy
import numpy as np
from mxnet.test_utils import rand_ndarray, assert_almost_equal
import time

def measure_cost(repeat, a, b, c, out=None):
    # start bench
    start = time.time()
    results = []
    for i in range(repeat):
        results.append(mx.nd.sparse.add_n(a, b, c, out=out))
    for result in results:
        result.wait_to_read()
    end = time.time()
    diff = end - start
    return diff / repeat

def measure_fallback(repeat, a):
    # start bench
    start = time.time()
    results = []
    for i in range(repeat):
        results.append(a.tostype('default'))
    for result in results:
        result.wait_to_read()
    end = time.time()
    diff = end - start
    return diff / repeat

def main():
    shape = (128, 1000000)
    dns = np.random.uniform(size=shape)
    # context = mx.gpu(0)
    context = mx.cpu()
    mx_dns1 = mx.nd.array(dns, ctx=context)
    mx_dns2 = mx.nd.array(dns, ctx=context)
    for density in [0.01, 0.005, 0.001, 0.0005, 0.0001]:
        mx_csr = rand_ndarray(shape=shape, stype='csr', density=density).as_in_context(context)
        mx_csr_dns = mx_csr.tostype('default')
        sparse_cost = 0.0
        dns_cost = 0.0
        mx.nd.waitall()
        #warmup
        check = mx.nd.sparse.add_n(mx_dns1, mx_csr, mx_dns2)
        dns1 = dns + mx_csr_dns.asnumpy() + dns
        assert_almost_equal(check.asnumpy(), dns1, atol=1e-5, rtol=1e-4)
        mx.nd.waitall()
        for i in range(20):
            sparse_cost += measure_cost(5, mx_dns1, mx_csr, mx_dns2)
            dns_cost += measure_cost(5, mx_dns1, mx_csr_dns, mx_dns2)
        print("%.2f %%" % (density*100), dns_cost / sparse_cost)


if __name__ == "__main__":
    main()

@haojin2
Copy link
Contributor Author

haojin2 commented Jun 20, 2018

Benchmark result for add_n(more than 4 inputs with at least 1 dense) = dense (the combination being benchmarked here is add_n(dense, csr, dense, rsp, dense) = dense):
([density%] [speedup])
CPU:
1.00% 1.4248320861874664
0.50% 1.4591373125830511
0.10% 1.487516900293522
0.05% 1.4891773584928327
0.01% 1.4833875047500007
GPU:
1.00% 1.5829503717448206
0.50% 1.612348854910054
0.10% 1.6657770987040201
0.05% 1.6743607944367647
0.01% 1.6844786052948375
Benchmark script:

import mxnet as mx
import sys
import os
import scipy
import numpy as np
from mxnet.test_utils import rand_ndarray, assert_almost_equal
import time

def measure_cost(repeat, a, b, c, d, e, out=None):
    # start bench
    start = time.time()
    results = []
    for i in range(repeat):
        results.append(mx.nd.sparse.add_n(a, b, c, d, e, out=out))
    for result in results:
        result.wait_to_read()
    end = time.time()
    diff = end - start
    return diff / repeat

def measure_fallback(repeat, a):
    # start bench
    start = time.time()
    results = []
    for i in range(repeat):
        results.append(a.tostype('default'))
    for result in results:
        result.wait_to_read()
    end = time.time()
    diff = end - start
    return diff / repeat

def main():
    shape = (1000000, 128)
    dns = np.random.uniform(size=shape)
    context = mx.gpu(0)
    # context = mx.cpu()
    mx_dns1 = mx.nd.array(dns, ctx=context)
    mx_dns2 = mx.nd.array(dns, ctx=context)
    mx_dns3 = mx.nd.array(dns, ctx=context)
    for density in [0.01, 0.005, 0.001, 0.0005, 0.0001]:
        mx_csr = rand_ndarray(shape=shape, stype='csr', density=density).as_in_context(context)
        mx_csr_dns = mx_csr.tostype('default')
        mx_rsp = rand_ndarray(shape=shape, stype='row_sparse', density=density).as_in_context(context)
        mx_rsp_dns = mx_rsp.tostype('default')
        sparse_cost = 0.0
        dns_cost = 0.0
        mx.nd.waitall()
        #warmup
        check = mx.nd.sparse.add_n(mx_dns1, mx_csr, mx_rsp, mx_dns2, mx_dns3)
        dns1 = dns + mx_csr_dns.asnumpy() + mx_rsp_dns.asnumpy() + dns + dns
        assert_almost_equal(check.asnumpy(), dns1, atol=1e-5, rtol=1e-4)
        mx.nd.waitall()
        for i in range(20):
            sparse_cost += measure_cost(5, mx_dns1, mx_csr, mx_dns2, mx_rsp, mx_dns3)
            dns_cost += measure_cost(5, mx_dns1, mx_csr_dns, mx_dns2, mx_rsp_dns, mx_dns3)
        print("%.2f %%" % (density*100), dns_cost / sparse_cost)


if __name__ == "__main__":
    main()

MSHADOW_IDX_TYPE_SWITCH(nd_indptr.type_flag_, CType, { // indptr type
if (nd.storage_initialized()) {
Kernel<ElemwiseDnsCsrDnsWarpKernel<kWriteTo, mshadow_op::plus>, gpu>::Launch(
s, 32 * num_rows, out_data.dptr<DType>(), out_data.dptr<DType>(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest use a const var with meaningful name instead of 32

@eric-haibin-lin eric-haibin-lin removed their assignment Jun 26, 2018
@haojin2 haojin2 force-pushed the add_n_dns_csr_dns branch 4 times, most recently from dc54bf4 to c316d1b Compare June 28, 2018 16:37
@haojin2
Copy link
Contributor Author

haojin2 commented Jun 28, 2018

@eric-haibin-lin build passed, should be good for merge.

@eric-haibin-lin eric-haibin-lin merged commit ca60b94 into apache:master Jun 29, 2018
@haojin2 haojin2 deleted the add_n_dns_csr_dns branch July 19, 2018 20:12
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
…sp]*, dense, [dense, csr, rsp]*) = dense on CPU & GPU (apache#11330)

* support for add_n(dense, csr, dense) = dense with tests

* eliminate magic number
@haojin2 haojin2 added the Sparse label Aug 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants