Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yum gets deadlocked/hung up (indefinitely) waiting for urlgrabber-ext-down #127

Open
brianjmurrell opened this issue Dec 18, 2020 · 6 comments

Comments

@brianjmurrell
Copy link

While I can appreciate that YUM is now deprecated, it's still the main package manager for EL7, which is where I am running into an issue with it just hanging indefinitely, until it is killed.

The process tree looks like this:

 8702 ?        S      0:05  |       \_ /usr/bin/python /usr/bin/yum -y --disablerepo=* --enablerepo=repo.dc.hpdd.intel.com_repository_*,build.hpdd.intel.com_job_daos-stack* install --exclude openmpi daos-1.1.2.1-1.5456.g02ce0510.el7.x86_64 daos-client-1.1.2.1-1.5456.g02ce0510.el7.x86_64 daos-tests-1.1.2.1-1.5456.g02ce0510.el7.x86_64 daos-server-1.1.2.1-1.5456.g02ce0510.el7.x86_64 openmpi3 hwloc ndctl fio patchutils ior-hpc-daos-0 romio-tests-cart-4-daos-0 testmpio-cart-4-daos-0 mpi4py-tests-cart-4-daos-0 hdf5-mpich2-tests-daos-0 hdf5-openmpi3-tests-daos-0 hdf5-vol-daos-mpich2-tests-daos-0 hdf5-vol-daos-openmpi3-tests-daos-0 MACSio-mpich2-daos-0 MACSio-openmpi3-daos-0 mpifileutils-mpich-daos-0
 8705 ?        S      0:00  |           \_ /usr/bin/python /usr/libexec/urlgrabber-ext-down
 8711 ?        S      0:00  |           \_ /usr/bin/python /usr/libexec/urlgrabber-ext-down
 8712 ?        S      0:00  |           \_ /usr/bin/python /usr/libexec/urlgrabber-ext-down

The status of the processes are:

# /tmp/strace -f -p 8702
/tmp/strace: Process 8702 attached
wait4(8711, ^C/tmp/strace: Process 8702 detached
 <detached ...>
# /tmp/strace -f -p 8705
/tmp/strace: Process 8705 attached
read(0, ^C/tmp/strace: Process 8705 detached
 <detached ...>
# /tmp/strace -f -p 8711
/tmp/strace: Process 8711 attached
futex(0x1444c90, FUTEX_WAIT_PRIVATE, 2, NULL^C/tmp/strace: Process 8711 detached
 <detached ...>
# /tmp/strace -f -p 8712
/tmp/strace: Process 8712 attached
futex(0x2174c90, FUTEX_WAIT_PRIVATE, 2, NULL^C/tmp/strace: Process 8712 detached
 <detached ...>

which to me looks like 8702, 8711 and 8705 are deadlocked all waiting/blocked on each other.

@lukash
Copy link
Contributor

lukash commented Jan 4, 2021

Just as a heads-up, the read(0, indicates process 8705 is blocking on reading standard input.

@brianjmurrell
Copy link
Author

@lukash Yes, I do realize that, but why? stdin is likely a pipe to the parent process, which is simply waiting on children.

@lukash
Copy link
Contributor

lukash commented Jan 5, 2021

I don't know. You haven't really provided a reproducer, I thought you may want to investigate yourself. This seems like a rare corner case, since you're only hitting it yourself long after the development has stopped. For the same reason it is likely going to be low priority for us unless the impact turns out to be bigger (even with a reproducer).

@mikebriggs2k
Copy link

We're hitting the same issue with one of our ansible playbooks. It definitely does seem to be an edge case because this will run 99 times without issues, but we are seeing this issue periodically.

I'm seeing the same futex waits and reads as reported by Brian.

root      3743  3726  3715  3715  0 15:57 ?        00:00:03                 /usr/bin/python /bin/yum -d 2 -y install container-selinux docker-ce-18.09.7-3.el7
root      3744  3743  3715  3715  0 15:57 ?        00:00:00                   /usr/bin/python /usr/libexec/urlgrabber-ext-down
root      3745  3743  3715  3715  0 15:57 ?        00:00:00                   /usr/bin/python /usr/libexec/urlgrabber-ext-down
root      3746  3743  3715  3715  0 15:57 ?        00:00:00                   /usr/bin/python /usr/libexec/urlgrabber-ext-down
root      3747  3743  3715  3715  0 15:57 ?        00:00:01                   /usr/bin/python /usr/libexec/urlgrabber-ext-down
[root@<HOST> <USER>]# strace -p 3747
strace: Process 3747 attached
read(0, 
^Cstrace: Process 3747 detached
 <detached ...>
[root@<HOST> <USER>]# strace -p 3746
strace: Process 3746 attached
futex(0x26fbb90, FUTEX_WAIT_PRIVATE, 2, NULL
^Cstrace: Process 3746 detached
 <detached ...>
[root@<HOST> <USER>]# strace -p 3745
strace: Process 3745 attached
futex(0x16acb70, FUTEX_WAIT_PRIVATE, 2, NULL
^Cstrace: Process 3745 detached
 <detached ...>
[root@<HOST> <USER>]# strace -p 3744
strace: Process 3744 attached
read(0, 
^Cstrace: Process 3744 detached
 <detached ...>

@james-antill
Copy link
Contributor

Are you setting minrate/timeout?

@rponnuru
Copy link

rponnuru commented Nov 29, 2022

When we try to install ROCm on CentOS 7.9.2009 Docker, the same problem persists. It happens about once every 20 times.

master@:> ps -eaf | grep 26373
master 20906 18586 0 12:33 pts/2 00:00:00 grep --color=auto 26373
root 26373 518 0 03:40 ? 00:00:00 /usr/bin/python /usr/bin/yum -y install rocm-openmp-sdk5.3.2
root 26388 26373 0 03:40 ? 00:00:00 /usr/bin/python /usr/libexec/urlgrabber-ext-down
root 26389 26373 0 03:40 ? 00:00:00 /usr/bin/python /usr/libexec/urlgrabber-ext-down
master@:
> sudo strace -p 26388
strace: Process 26388 attached
futex(0x2233bb0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 26388 detached
<detached ...>

master@:~> sudo strace -p 26389
strace: Process 26389 attached
read(0, ^Cstrace: Process 26389 detached
<detached ...>

master@:>
master@:
> sudo strace -p 26373
strace: Process 26373 attached
wait4(18278, ^Cstrace: Process 26373 detached
<detached ...>

master@:~>

Do we have any solution or workaround for this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants