Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Large Tensor] Fixed Spatial Transformer op #17617

Merged

Conversation

connorgoggins
Copy link
Contributor

@connorgoggins connorgoggins commented Feb 17, 2020

Description

The Spatial Transformer op was previously breaking on large tensor (dimension >= 2^32) data. With the following input:

run_performance_test(nd.SpatialTransformer, run_backward=True, inputs=[{'data': (2, 2**29,1,6), 'loc': nd.random_normal(shape=(2,6)), 'transform_type': 'affine', 'sampler_type': 'bilinear', 'target_shape': (2,6)}], warmup=1, runs=1)

the following error was thrown:

*** Error in `python3': double free or corruption (out): 0x00007f36bbffe010 ***

To root cause this issue, I ran the previous command in a Python script with GDB, and found that the underlying problem was in the iteration portion of the forward and backward methods of spatial_transformer.cc. Several of the variables used in the iteration used the int dtype when they should have been using index_t to properly handle long int indices. I switched these variables to index_t in the forward and backward methods, and after rebuilding, the previous input command displayed the correct output:

INFO:root:Begin Benchmark - SpatialTransformer
INFO:root:Complete Benchmark - SpatialTransformer
[{'SpatialTransformer': [{'inputs': {'data': (2, 536870912, 1, 6), 'loc': '<NDArray 2x6 @cpu(0)>', 'transform_type': 'affine', 'sampler_type': 'bilinear', 't
arget_shape': (2, 6)}, 'max_storage_mem_alloc_cpu/0': 102005472.0, 'avg_time_forward_SpatialTransformer': 551614.125, 'avg_time_backward_SpatialTransformer':
 511034.0938}]}]

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • M src/operator/tensor/spatial_transformer.cc

Comments

Tested on r5dn.24xl-ubuntu 16.04 and p2.16xl-ubuntu 16.04 with

  1. Individual op run
  2. Full OpPerf run

Results

The key difference between CPU and GPU tests was the instance type (r5dn.24xl for CPU, p2.16xl for GPU). All relevant build flags remain the same, and both were tested using CPU context.

Single operator test - SpatialTransformer op (GPU)
Single operator test - SpatialTransformer op (CPU)

Full OpPerf test (GPU)
Full OpPerf test (CPU)

@apeforest @access2rohit @ChaiBapchya

@connorgoggins connorgoggins changed the title [LT] Fixed Spatial Transformer op [Large Tensor] Fixed Spatial Transformer op Feb 17, 2020
@connorgoggins
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Feb 17, 2020
@access2rohit
Copy link
Contributor

@connorgoggins
Copy link
Contributor Author

@access2rohit DType here represents the type of the interior elements of the input tensors, and since SpatialTransformer only supports floating point types then DType could be float32 or float64 (also potentially float16, but this DType doesn't seem to have been implemented for GEMM on CPU yet).

@connorgoggins connorgoggins force-pushed the fix_spatial_transformer_large_tensor branch from c5ff973 to 871849e Compare February 18, 2020 21:10
index_t top_right_v = 0;
index_t bottom_left_v = 0;
index_t bottom_right_v = 0;
DType top_left_v = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changed back to Dtype?
It's the index position right?

Copy link
Contributor Author

@connorgoggins connorgoggins Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChaiBapchya changing DType to index_t for these specific variables makes the op generate incorrect output on standard inputs (e.g. the inputs in the CI run) - the values generated in the output NDArray are all integers instead of floats. This is due to the fact that these variables do not represent the index positions (as I also originally believed), but instead represent the underlying values at the vertices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. good catch.
_v indicates the value at that particular index. Thanks!

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@connorgoggins connorgoggins force-pushed the fix_spatial_transformer_large_tensor branch 3 times, most recently from d3696f2 to 50d1530 Compare February 24, 2020 22:55
@connorgoggins connorgoggins force-pushed the fix_spatial_transformer_large_tensor branch from 50d1530 to 5ac85a6 Compare February 25, 2020 10:02
@apeforest apeforest merged commit 13b3893 into apache:master Feb 25, 2020
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
* Added CPU fix

* Added fix for backward on CPU

* Fixed lint error

* index_t for lower bound instead of hardcoded long int

* Fixed remaining lint errors

* Removed trailing whitespace

* Reverting to DType for vertices

* Added nightly test for SpatialTransformer
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants