[Large Tensor] Fix ravel_multi_index op #17644

connorgoggins · 2020-02-21T01:38:46Z

Description

The ravel_multi_index op was previously breaking on large tensor (dimension >= 2^32) data. With the following input:

run_performance_test(nd.ravel_multi_index, inputs=[{"data": (2, 2**32), "shape":(2, 10)}], run_backward=True, warmup=1, runs=1)

the following error was thrown:

Segmentation fault: 11

To root cause this issue, I ran the previous command in a Python script with GDB, and found that the underlying problem was in the header of the ravel_index struct in ravel.h. The index variable i used the int dtype when it should have been using index_t to properly handle long int indices. I switched this variable to index_t in the struct header and, after rebuilding, the previous input command displayed the correct output:

INFO:root:Begin Benchmark - ravel_multi_index
INFO:root:Complete Benchmark - ravel_multi_index
[{'ravel_multi_index': [{'inputs': {'data': (2, 4294967296), 'shape': (2, 10)}, 'max_storage_mem_alloc_cpu/0': 24696062.0, 'avg_time_forward_ravel_multi_inde
x': 4257.312}]}]

To ensure completeness and to prevent future breaking changes, I also added a nightly test for the ravel_multi_index op with large tensor data in tests/nightly/test_large_array.py.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

M src/operator/ravel.h
M tests/nightly/test_large_array.py

Comments

Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with

Individual op run
Full OpPerf run

Results

The key difference between CPU and GPU tests was the instance type (r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, and both were tested using CPU context.

Single operator test - ravel_multi_index op (GPU)
Single operator test - ravel_multi_index op (CPU)

Full OpPerf test (GPU)
Full OpPerf test (CPU)

@apeforest @access2rohit @ChaiBapchya

ChaiBapchya

LGTM!

connorgoggins · 2020-02-21T17:35:41Z

@mxnet-label-bot add [pr-awaiting-review]

access2rohit · 2020-02-26T21:55:16Z

tests/nightly/test_large_array.py

+
+ out = nd.ravel_multi_index(data=data, shape=shape)
+
+ assert out.shape[0] == LARGE_TENSOR_SHAPE


Can we have values checks here ?

* Fixed dtype on i * Added nightly test for ravel_multi_index

ChaiBapchya approved these changes Feb 21, 2020

View reviewed changes

lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 21, 2020

apeforest approved these changes Feb 22, 2020

View reviewed changes

connorgoggins force-pushed the fix_ravel_large_tensor branch 3 times, most recently from 7bd485d to 443d54f Compare February 25, 2020 10:05

connorgoggins added 2 commits February 25, 2020 10:32

Fixed dtype on i

8d0efc7

Added nightly test for ravel_multi_index

eab5aec

connorgoggins force-pushed the fix_ravel_large_tensor branch from 443d54f to eab5aec Compare February 25, 2020 18:32

apeforest merged commit 13f5ad9 into apache:master Feb 26, 2020

access2rohit reviewed Feb 26, 2020

View reviewed changes

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020

[Large Tensor] Fix ravel_multi_index op (apache#17644)

3d183cc

* Fixed dtype on i * Added nightly test for ravel_multi_index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Large Tensor] Fix ravel_multi_index op #17644

[Large Tensor] Fix ravel_multi_index op #17644

connorgoggins commented Feb 21, 2020 •

edited

Loading

ChaiBapchya left a comment

connorgoggins commented Feb 21, 2020

access2rohit Feb 26, 2020


		out = nd.ravel_multi_index(data=data, shape=shape)

		assert out.shape[0] == LARGE_TENSOR_SHAPE

[Large Tensor] Fix ravel_multi_index op #17644

[Large Tensor] Fix ravel_multi_index op #17644

Conversation

connorgoggins commented Feb 21, 2020 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Results

ChaiBapchya left a comment

Choose a reason for hiding this comment

connorgoggins commented Feb 21, 2020

access2rohit Feb 26, 2020

Choose a reason for hiding this comment

connorgoggins commented Feb 21, 2020 •

edited

Loading