Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/blocksparse_conv_test.py failed and example/simple.py sometimes raised an invalid memory access error #53

Open
xuyifangreeneyes opened this issue Jul 19, 2020 · 1 comment

Comments

@xuyifangreeneyes
Copy link

System information

  • OS Platform and Distribution: Linux Ubuntu 18.04
  • TensorFlow version: 1.13.1 (with GPU support)
  • Python version: 3.7.7
  • CUDA/cuDNN version: 10.0 / 7
  • GPU: Tesla T4

Encountered problem
I tried both pip install blocksparse and building from source. After installation, I can run import blocksparse in Python and pass most tests. However, when I run test/blocksparse_conv_test.py, the following error occurred.

(tf13) ubuntu@xxx:~/blocksparse$ python test/blocksparse_conv_test.py
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/contextlib.py:82: TensorFlowTestCase.test_session (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `self.session()` or `self.cached_session()` instead.
2020-07-19 15:22:55.214905: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-19 15:22:55.236910: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-07-19 15:22:55.237482: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55b7bd771c50 executing computations on platform Host. Devices:
2020-07-19 15:22:55.237509: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-19 15:22:55.362344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-19 15:22:55.363172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-07-19 15:22:55.363193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-19 15:22:55.393925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-19 15:22:55.393972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2020-07-19 15:22:55.393981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2020-07-19 15:22:55.394077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-07-19 15:22:55.395613: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55b7bbfe59e0 executing computations on platform CUDA. Devices:
2020-07-19 15:22:55.395639: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5

test1
2020-07-19 15:22:55.429514: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at blocksparse_conv_op.cc:320 : Internal: device kernel image is invalid
ERROR:tensorflow:device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]

Caused by op 'test1/F4B4/BlocksparseConv', defined at:
  File "test/blocksparse_conv_test.py", line 213, in <module>
    tf.test.main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/test.py", line 64, in main
    return _googletest.main(argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 100, in main
    benchmark.benchmarks_main(true_main=main_wrapper)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/benchmark.py", line 371, in benchmarks_main
    true_main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 99, in main_wrapper
    return app.run(main=g_main, argv=args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 70, in g_main
    return unittest_main(argv=argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 101, in __init__
    self.runTests()
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 271, in runTests
    self.result = testRunner.run(self.test)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "test/blocksparse_conv_test.py", line 126, in testBlocksparseConv
    op   = bs_conv_op(devF, devI)
  File "/home/ubuntu/blocksparse/blocksparse/conv.py", line 511, in __call__
    dimF=F.get_shape().as_list(), fshare=self.fshared, bshare=self.bshared, debug=self.debug
  File "<string>", line 471, in blocksparse_conv
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]

Es
======================================================================
ERROR: testBlocksparseConv (__main__.BlocksparseConvTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: device kernel image is invalid
         [[{{node test1/F4B4/BlocksparseConv}}]]
         [[{{node test1/F4B4/BlocksparseConv}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test/blocksparse_conv_test.py", line 127, in testBlocksparseConv
    devO = sess.run( op )
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/test_util.py", line 1368, in run
    return super(ErrorLoggingSession, self).run(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]

Caused by op 'test1/F4B4/BlocksparseConv', defined at:
  File "test/blocksparse_conv_test.py", line 213, in <module>
    tf.test.main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/test.py", line 64, in main
    return _googletest.main(argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 100, in main
    benchmark.benchmarks_main(true_main=main_wrapper)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/benchmark.py", line 371, in benchmarks_main
    true_main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 99, in main_wrapper
    return app.run(main=g_main, argv=args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 70, in g_main
    return unittest_main(argv=argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 101, in __init__
    self.runTests()
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 271, in runTests
    self.result = testRunner.run(self.test)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "test/blocksparse_conv_test.py", line 126, in testBlocksparseConv
    op   = bs_conv_op(devF, devI)
  File "/home/ubuntu/blocksparse/blocksparse/conv.py", line 511, in __call__
    dimF=F.get_shape().as_list(), fshare=self.fshared, bshare=self.bshared, debug=self.debug
  File "<string>", line 471, in blocksparse_conv
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]


----------------------------------------------------------------------
Ran 2 tests in 0.231s

FAILED (errors=1, skipped=1)

Besides, invalid memory access sometimes happens when running examples/simples.py. Here is the output without error.

(tf13) ubuntu@xxx:~/blocksparse$ python examples/simple.py
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-19 15:23:58.994318: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-19 15:23:59.016917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-07-19 15:23:59.017474: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56341a1f6330 executing computations on platform Host. Devices:
2020-07-19 15:23:59.017505: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-19 15:23:59.122639: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-19 15:23:59.123458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-07-19 15:23:59.123478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-19 15:23:59.152687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-19 15:23:59.152724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2020-07-19 15:23:59.152735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2020-07-19 15:23:59.152835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-07-19 15:23:59.154217: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5634190aa650 executing computations on platform CUDA. Devices:
2020-07-19 15:23:59.154239: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
[array([[-0.00464108, -0.00446517, -0.00446705, ..., -0.00433037,
        -0.00435545, -0.00431154],
       [ 0.00696341,  0.00687434,  0.00675924, ...,  0.00679887,
         0.00693929,  0.00719775],
       [ 0.01524079,  0.01537668,  0.01533529, ...,  0.01533816,
         0.01512151,  0.01528387],
       ...,
       [-0.00238256, -0.00245797, -0.0022754 , ..., -0.00224203,
        -0.00239737, -0.00237827],
       [-0.00508011, -0.00536294, -0.00516913, ..., -0.00537378,
        -0.00533525, -0.00540836],
       [ 0.01230985,  0.01257054,  0.01233936, ...,  0.01226609,
         0.012429  ,  0.01214379]], dtype=float32)]

And here is the output when the error appears.

(tf13) ubuntu@xxx:~/blocksparse$ python examples/simple.py
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-19 15:24:31.054902: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-19 15:24:31.076918: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-07-19 15:24:31.077469: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56258d1e4480 executing computations on platform Host. Devices:
2020-07-19 15:24:31.077494: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-19 15:24:31.176438: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-19 15:24:31.177252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-07-19 15:24:31.177274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-19 15:24:31.208119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-19 15:24:31.208164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2020-07-19 15:24:31.208176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2020-07-19 15:24:31.208278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-07-19 15:24:31.209716: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56258c098570 executing computations on platform CUDA. Devices:
2020-07-19 15:24:31.209739: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-07-19 15:24:31.685492: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-07-19 15:24:31.685539: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Aborted (core dumped)

I guess that those problems are due to my TensorFlow and CUDA version. Could anyone help me? Thanks a lot!

@ujay-zheng
Copy link

ujay-zheng commented Feb 8, 2022

System information

  • OS Platform and Distribution: Linux Ubuntu 18.04
  • TensorFlow version: 1.13.1 (with GPU support)
  • Python version: 3.7.7
  • CUDA/cuDNN version: 10.0 / 7
  • GPU: Tesla T4

Encountered problem I tried both pip install blocksparse and building from source. After installation, I can run import blocksparse in Python and pass most tests. However, when I run test/blocksparse_conv_test.py, the following error occurred.

(tf13) ubuntu@xxx:~/blocksparse$ python test/blocksparse_conv_test.py
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/contextlib.py:82: TensorFlowTestCase.test_session (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `self.session()` or `self.cached_session()` instead.
2020-07-19 15:22:55.214905: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-19 15:22:55.236910: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-07-19 15:22:55.237482: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55b7bd771c50 executing computations on platform Host. Devices:
2020-07-19 15:22:55.237509: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-19 15:22:55.362344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-19 15:22:55.363172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-07-19 15:22:55.363193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-19 15:22:55.393925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-19 15:22:55.393972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2020-07-19 15:22:55.393981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2020-07-19 15:22:55.394077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-07-19 15:22:55.395613: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55b7bbfe59e0 executing computations on platform CUDA. Devices:
2020-07-19 15:22:55.395639: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5

test1
2020-07-19 15:22:55.429514: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at blocksparse_conv_op.cc:320 : Internal: device kernel image is invalid
ERROR:tensorflow:device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]

Caused by op 'test1/F4B4/BlocksparseConv', defined at:
  File "test/blocksparse_conv_test.py", line 213, in <module>
    tf.test.main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/test.py", line 64, in main
    return _googletest.main(argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 100, in main
    benchmark.benchmarks_main(true_main=main_wrapper)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/benchmark.py", line 371, in benchmarks_main
    true_main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 99, in main_wrapper
    return app.run(main=g_main, argv=args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 70, in g_main
    return unittest_main(argv=argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 101, in __init__
    self.runTests()
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 271, in runTests
    self.result = testRunner.run(self.test)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "test/blocksparse_conv_test.py", line 126, in testBlocksparseConv
    op   = bs_conv_op(devF, devI)
  File "/home/ubuntu/blocksparse/blocksparse/conv.py", line 511, in __call__
    dimF=F.get_shape().as_list(), fshare=self.fshared, bshare=self.bshared, debug=self.debug
  File "<string>", line 471, in blocksparse_conv
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]

Es
======================================================================
ERROR: testBlocksparseConv (__main__.BlocksparseConvTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: device kernel image is invalid
         [[{{node test1/F4B4/BlocksparseConv}}]]
         [[{{node test1/F4B4/BlocksparseConv}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test/blocksparse_conv_test.py", line 127, in testBlocksparseConv
    devO = sess.run( op )
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/test_util.py", line 1368, in run
    return super(ErrorLoggingSession, self).run(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]

Caused by op 'test1/F4B4/BlocksparseConv', defined at:
  File "test/blocksparse_conv_test.py", line 213, in <module>
    tf.test.main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/test.py", line 64, in main
    return _googletest.main(argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 100, in main
    benchmark.benchmarks_main(true_main=main_wrapper)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/benchmark.py", line 371, in benchmarks_main
    true_main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 99, in main_wrapper
    return app.run(main=g_main, argv=args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 70, in g_main
    return unittest_main(argv=argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 101, in __init__
    self.runTests()
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 271, in runTests
    self.result = testRunner.run(self.test)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "test/blocksparse_conv_test.py", line 126, in testBlocksparseConv
    op   = bs_conv_op(devF, devI)
  File "/home/ubuntu/blocksparse/blocksparse/conv.py", line 511, in __call__
    dimF=F.get_shape().as_list(), fshare=self.fshared, bshare=self.bshared, debug=self.debug
  File "<string>", line 471, in blocksparse_conv
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]


----------------------------------------------------------------------
Ran 2 tests in 0.231s

FAILED (errors=1, skipped=1)

Besides, invalid memory access sometimes happens when running examples/simples.py. Here is the output without error.

(tf13) ubuntu@xxx:~/blocksparse$ python examples/simple.py
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-19 15:23:58.994318: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-19 15:23:59.016917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-07-19 15:23:59.017474: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56341a1f6330 executing computations on platform Host. Devices:
2020-07-19 15:23:59.017505: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-19 15:23:59.122639: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-19 15:23:59.123458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-07-19 15:23:59.123478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-19 15:23:59.152687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-19 15:23:59.152724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2020-07-19 15:23:59.152735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2020-07-19 15:23:59.152835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-07-19 15:23:59.154217: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5634190aa650 executing computations on platform CUDA. Devices:
2020-07-19 15:23:59.154239: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
[array([[-0.00464108, -0.00446517, -0.00446705, ..., -0.00433037,
        -0.00435545, -0.00431154],
       [ 0.00696341,  0.00687434,  0.00675924, ...,  0.00679887,
         0.00693929,  0.00719775],
       [ 0.01524079,  0.01537668,  0.01533529, ...,  0.01533816,
         0.01512151,  0.01528387],
       ...,
       [-0.00238256, -0.00245797, -0.0022754 , ..., -0.00224203,
        -0.00239737, -0.00237827],
       [-0.00508011, -0.00536294, -0.00516913, ..., -0.00537378,
        -0.00533525, -0.00540836],
       [ 0.01230985,  0.01257054,  0.01233936, ...,  0.01226609,
         0.012429  ,  0.01214379]], dtype=float32)]

And here is the output when the error appears.

(tf13) ubuntu@xxx:~/blocksparse$ python examples/simple.py
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-19 15:24:31.054902: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-19 15:24:31.076918: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-07-19 15:24:31.077469: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56258d1e4480 executing computations on platform Host. Devices:
2020-07-19 15:24:31.077494: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-19 15:24:31.176438: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-19 15:24:31.177252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-07-19 15:24:31.177274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-19 15:24:31.208119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-19 15:24:31.208164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2020-07-19 15:24:31.208176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2020-07-19 15:24:31.208278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-07-19 15:24:31.209716: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56258c098570 executing computations on platform CUDA. Devices:
2020-07-19 15:24:31.209739: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-07-19 15:24:31.685492: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-07-19 15:24:31.685539: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Aborted (core dumped)

I guess that those problems are due to my TensorFlow and CUDA version. Could anyone help me? Thanks a lot!

@xuyifangreeneyes I successfully run the docker container through this issuseisssue, however when running simple.py after installing blocksparse I had the same problem.I just changed the hidden_size in the simple.py to 4096*2,it crashed.I wonder if you found a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants