Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::bad_alloc issue with Ubuntu18.04 #10

Open
pramenku opened this issue Jul 18, 2018 · 22 comments
Open

std::bad_alloc issue with Ubuntu18.04 #10

pramenku opened this issue Jul 18, 2018 · 22 comments

Comments

@pramenku
Copy link
Contributor

pramenku commented Jul 18, 2018

Tried miopen-benchmark ubuntu18.04 to give a try.
Building of test went fine but while running the test , got "std::bad_alloc" issue.

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

Issue is coming from miopen-benchmark's header file when it tries to construct a directory path.

Thread 1 "alexnet" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff2779801 in __GI_abort () at abort.c:79
#2  0x00007ffff2dce8fb in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff2dd4d3a in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff2dd4d95 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff2dd4fe8 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff2dfdf26 in std::__throw_bad_alloc() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x0000000000463ced in __gnu_cxx::new_allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> >::allocate
    (__n=12297829382473034424, this=<optimized out>) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/ext/new_allocator.h:102
#8  std::allocator_traits<std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::allocate (
    __n=12297829382473034424, __a=...) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/alloc_traits.h:436
#9  std::_Vector_base<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int>, std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::_M_allocate (__n=12297829382473034424, this=<optimized out>)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/stl_vector.h:172
#10 std::_Vector_base<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int>, std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::_M_create_storage (__n=12297829382473034424, this=<optimized out>)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/stl_vector.h:187
#11 std::_Vector_base<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int>, std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::_Vector_base (__n=12297829382473034424, this=<optimized out>, __a=...)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/stl_vector.h:138
#12 std::vector<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int>, std::allocator<std::pair<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, int> > >::vector (__n=12297829382473034424, this=<optimized out>, __a=...)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/stl_vector.h:284
#13 std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Executor (__begin=..., __end=..., 
    __results=..., __re=..., __flags=<optimized out>, this=<optimized out>) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex_executor.h:79
#14 std::__detail::__regex_algo_impl<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char>, (std::__detail::_RegexExecutorPolicy)0, true> (__s=118 'v', __e=0 '\000', __m=..., __re=..., __flags=(unknown: 0)) at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex.tcc:78
#15 0x00000000004589d7 in std::regex_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> > (
    __s=<error reading variable: Cannot access memory at address 0x2>, __e=0 '\000', __m=..., __re=..., __flags=(unknown: 0))
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex.h:1995
#16 std::regex_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, char, std::__cxx11::regex_traits<char> > (
    __first=<error reading variable: Cannot access memory at address 0x2>, __last=0 '\000', __re=..., __flags=(unknown: 0))
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex.h:2022
#17 std::regex_match<std::char_traits<char>, std::allocator<char>, char, std::__cxx11::regex_traits<char> > (__re=..., __flags=(unknown: 0), __s=...)
    at /usr/lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/bits/regex.h:2127
#18 ls_dir (dname=..., match=...) at ./miopen.hpp:142
#19 0x000000000045b41f in Device::init_sys_paths (this=0xc8f270) at ./miopen.hpp:175
#20 0x00000000004590c3 in Devices::init_devices () at ./miopen.hpp:289
#21 0x000000000045b2b6 in device_init () at ./miopen.hpp:299
#22 main (argc=2, argv=0x7fffffffdb20) at alexnet.cpp:60

This may be relevant:
https://stackoverflow.com/questions/36106154/how-to-handle-or-avoid-exceptions-from-c11-regex-matching-functions-28-11

Thanks

@mattsinc
Copy link

mattsinc commented Jul 18, 2018

I've had this same error before too. I assumed it was just something odd with my setup (I'm using RHEL7, which is not as well supported), but it appears not. The issue is with the regex commands. Specifically, it seems different OS's (or versions of an OS) implement the regex support differently. For example, I changed this line:

https://github.com/patflick/miopen-benchmark/blob/master/miopen.hpp#L175

to use this:

std::regex("card(\\d)+")

(I also applied similar fixes elsewhere in miopen.hpp, but this seems to be the line you're having a problem with)

Not sure if the same fix works for you?

Hope this helps,
Matt

@pramenku
Copy link
Contributor Author

pramenku commented Jul 19, 2018

Hi Matt,
miopen.hpp has already have that at L175

But, we are still seeing the issue with regex issue. It's doesn't help in Ubuntu18.04.

Thanks,

@mattsinc
Copy link

mattsinc commented Jul 19, 2018

Just in case we're miscommunicating, the change I'm proposing for line 175 is small and based on your response it seems like you may have thought it was identical to what is already there. Currently it is:

std::regex("card\\d+")

I changed it to this:

std::regex("card(\\d)+")

(I just put parentheses around the \\d ... I found this was necessary on RHEL7)

Matt

@pramenku
Copy link
Contributor Author

pramenku commented Jul 19, 2018

I did the same again as you said but still it's not working.

for (std::string cardname : ls_dir("/sys/class/drm", std::regex("card(\\d)+")))

"\" is not coming after comment posted but I used "\" only.

@mattsinc
Copy link

mattsinc commented Jul 19, 2018

Yeah, the "" doesn't show up unless you use the code feature (put "`" around the code part).

Sorry my fix didn't work for you. I will say that I played around with a bunch of the C++ regex options before settling on that. Perhaps one of the others will solve your problem?

Also, when I was making those changes, I broke apart line 175 so I could run with gdb and figure out exactly what is failing. I suggest you do the same.

EDIT: One last thing: did you use 1 or 2 backslashes in the above? I used 2, but it seems like you may have used one based on what you said.

Hope this helps,
Matt

@patflick
Copy link
Owner

My apologies for the late reply.

Your stack trace points to something very odd. Somewhere in regex_match, it tries to allocate a std::vector of size 12297829382473034424. I can't reproduce and neither imagine why this would happen. Putting parenthesis around the \\d should not change the regex match, it just creates a capture for the card number. Also then, technically it should be "card(\\d+)" (note the + inside the capture).

Could you try to print out the fname's just prior to the regex match in https://github.com/patflick/miopen-benchmark/blob/master/miopen.hpp#L142
(insert a INFO(fname); prior to that line), and see at which file name it fails with the regex error?

@pramenku
Copy link
Contributor Author

pramenku commented Jul 23, 2018

Thanks Patflick. Sorry for delay response.
I tried as suggested and got below:

$ ./layerwise
[INFO]  Number of HIP devices found: 2
[INFO]  card1-DP-6
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)
 while ((entry = readdir(dir)) != NULL) {
        std::string fname(entry->d_name);
        if (fname != "." && fname != "..") {
             `INFO(fname);`
            if (std::regex_match(fname, match)) {
                files.push_back(fname);

 void init_sys_paths() {
        bool found = false;
       ` for (std::string cardname : ls_dir("/sys/class/drm", std::regex("card(\\d+)"))) {`
            std::string carddir = "/sys/class/drm/" + cardname;
            std::string fname = carddir + "/device/uevent";

@patflick
Copy link
Owner

They seem to have changed the directory structure / folder naming scheme for the sysfs driver api.

I still don't know why the regex would segfault, but the regex card(\\d+) won't match the folder name card1-DP-6. When I first coded this, the cards/devices where named card0, card1,.. etc.

To check the folder structure, can you run tree in /sys/class/drm/ ?

Also, try changing the regex from card(\\d+) to card(\\d+).*. That should allow the code to match the card1-DP-6 folder.

@pramenku
Copy link
Contributor Author

pramenku commented Jul 30, 2018

Thanks.
Tried but no luck still.

$ ./alexnet
[INFO]  Number of HIP devices found: 1
terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
Aborted (core dumped)
for (std::string cardname : ls_dir("/sys/class/drm", std::regex("card\\d+).*."))) {
 if (fname != "." && fname != "..") {
                INFO(fname);
            if (std::regex_match(fname, match)) {
                files.push_back(fname);
/sys/class/drm$ ls -lrt
total 0
-r--r--r-- 1 root root 4096 Jul 30 12:43 version
lrwxrwxrwx 1 root root    0 Jul 30 12:43 ttm -> ../../devices/virtual/drm/ttm
lrwxrwxrwx 1 root root    0 Jul 30 12:43 renderD128 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/renderD128
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0-HDMI-A-1 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0/card0-HDMI-A-1
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0-DP-3 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0/card0-DP-3
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0-DP-2 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0/card0-DP-2
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0-DP-1 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0/card0-DP-1
lrwxrwxrwx 1 root root    0 Jul 30 12:43 card0 -> ../../devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0/0000:06:00.0/0000:07:00.0/drm/card0

Did you guys tried anytime on Ubuntu18.04?

@pramenku
Copy link
Contributor Author

Any more suggestion or is someone looking into it.
Is it possible just to not use regex and instead roll your own parser?

@pramenku
Copy link
Contributor Author

Hi patflick ,
Can you please resolve this issue. It's pending from long time. All end users are seeing this issue on Ubuntu18.04.
Thanks for the help.

patflick added a commit that referenced this issue Sep 17, 2018
@patflick
Copy link
Owner

Hi @pramenku

I could never reproduce your error.

I just pushed a code change that might help, although I'm really just guessing. If this doesn't work, your best bet is to try to debug this yourself. Sorry

@pramenku
Copy link
Contributor Author

Thanks @patflick I will try and update you.
Meantime can you please merge #12 PR.

@patflick
Copy link
Owner

Hi @pramenku . I merged the PR. Did you get a chance to try the potential fix?

@pramenku
Copy link
Contributor Author

Sorry @patflick for delay. Really I was too much occupied with priority tasks.
Without fail, I will update you by tomorrow.

@pramenku
Copy link
Contributor Author

Hi @patflick
I tried with latest changes also, issue is observed.
With rocm release 1.9 and Ubuntu 18.04, we are clearly seeing the issue. I am surprised how no one is seeing this issue. Ideally speaking, everyone should see this issue.

@mattsinc
Copy link

Hi @pramenku,

I think I understand your problem now. Unfortunately I don't know if there is a happy solution. To the best of my knowledge, ROCm does not yet support Ubuntu 18.04. I believe the specific problem you are encountering is that 18.04 has gcc/g++ 7.2 as the "default" gcc/g++, but ROCm needs gcc/g++ 5.4. Have you tried installing 5.4 locally and pointing to that instead?

Matt

@pramenku
Copy link
Contributor Author

pramenku commented Sep 28, 2018

Thanks @mattsinc.
You got exactly what I want to convey.
There is no issue on Ubuntu 16.04 which has gcc 5.4.
Issue with Ubuntu 18.04 which has gcc 7.2.

With ROCm release 1.9 , Ubuntu 18.04 is also supported. Please check https://github.com/RadeonOpenCompute/ROCm/blob/master/README.md.

So, anyone is trying 18.04 on ROCm 1.9, they will see this issue.

@mattsinc
Copy link

@pramenku, I did not realize that ROCm 1.9 had that support. If you use gcc 5.4 with ROCm 1.9 does it work? If so, I would guess the problem is a ROCm problem?

Matt

@pramenku
Copy link
Contributor Author

pramenku commented Sep 29, 2018

@mattsinc
It's not issue with ROCm 1.9. It's just that source code of the app needs modification as per Ubuntu 18.04 which has gcc 7.2.
I am not sure cuda support Ubuntu 18.04 as I am suspecting it will come there too.
I request someone to try and debug what needs to be changed as per Ubuntu 18.04.

@carlushuang
Copy link
Contributor

carlushuang commented Nov 27, 2018

Hi @patflick and all

It's about 2 months since last discuss, but I encountered the same issue with tip code. My env is ubuntu 16.04 + manually installed gcc-7.3.0.

To narrow down, I write a very simple example:

#include <regex>
#include <string>
#include <iostream>

int main(){
	std::string fname("amdgpu");
	std::regex card_re("card\\d+");
	bool result = std::regex_match(fname, card_re);

	std::cout<<result<<std::endl;
}

name the above code in "main.cc", I did several test:

  1. use command: /opt/rocm/hip/bin/hipcc main.cc and run a.out, no problem
  2. use command: /opt/rocm/hip/bin/hipcc -O3 main.cc and run a.out, std::bad_alloc happen.
  3. use gcc-7.3.0 to compile, no problem for both -O3 or default.

To be concrete, I can reproduce this regex issue on hipcc with -O3 flag.

So, I'm curious if hipcc compiler have compatibility issue with gcc-7.3.0, or maybe @pramenku can help test on ubuntu 18.04 environment?

below is my hipcc info (/opt/rocm/bin/hipcc --version)

HIP version: 1.5.18442
HCC clang version 7.0.0 (ssh:https://gerritgit/compute/ec/hcc-tot/clang 4ed1d60af7c26e833d6d4452ba526d2daaa6ed35) (ssh:https://gerritgit/compute/ec/hcc-tot/llvm c57b310200941724972aa5c5c90cbc151d1978f4) (based on HCC 1.2.18451-82f39f1-4ed1d60-c57b310 )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin

@carlushuang
Copy link
Contributor

carlushuang commented Nov 27, 2018

#13

@patflick Hi I make a quick work around for this issue, that not use regex to check the device name. If it's not acceptable just drop it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants