Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defunct java processes after upgrading to 23.10.1 #2567

Closed
linsomniac opened this issue Nov 13, 2023 · 13 comments · Fixed by #2914
Closed

Defunct java processes after upgrading to 23.10.1 #2567

linsomniac opened this issue Nov 13, 2023 · 13 comments · Fixed by #2914

Comments

@linsomniac
Copy link

Environment

self-hosted (https://develop.sentry.dev/self-hosted/)

Steps to Reproduce

  1. Upgraded from 23.6.2 to 23.10.1.
  2. Ran "ps awwlx | grep defunct".
  3. Several java processes are now showing as defunct.

Expected Result

No defunct java processes.

Actual Result

We have two self-hosted Sentry using docker compose, and with our upgrade to 23.10.1 our monitoring on the host started reporting defunct processes. It looks like these are Kafka related processes.

One system:

# ps awwlx | grep 1079277
0     0 1000210  999907  20   0   9032   720 pipe_w S+   pts/0      0:00 grep --color=auto 1079277
4     0 1079277 1078583  20   0 5880368 641220 futex_ Ssl ?       2060:08 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* io.confluent.support.metrics.SupportedKafka /etc/kafka/kafka.properties
4     0 1092665 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1096955 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1098576 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1100465 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1102716 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1104822 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1106544 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>
0     0 1110312 1079277  20   0      0     0 -      Z    ?          0:03 [java] <defunct>

The other system:

# ps -awwlx | grep 2554
4     0    2554    2466  20   0 7902972 1255820 futex_ Ssl ?      1706:40 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* io.confluent.support.metrics.SupportedKafka /etc/kafka/kafka.properties
4     0    6881    2554  20   0      0     0 -      Z    ?          0:05 [java] <defunct>
0     0   19774    2554  20   0      0     0 -      Z    ?          0:05 [java] <defunct>
0     0   20656    2554  20   0      0     0 -      Z    ?          0:04 [java] <defunct>
0     0   21318    2554  20   0      0     0 -      Z    ?          0:05 [java] <defunct>
0     0 2900197 2900049  20   0   9032   720 pipe_w S+   pts/0      0:00 grep --color=auto 2554

I have a 30MB log file from /var/log/kafka in the kafka container but don't see anything particularly interesting in it. Here's a snippet:

OpenJDK 64-Bit Server VM (25.222-b10) for linux-amd64 JRE (Zulu 8.40.0.25-CA-linux64) (1.8.0_222-b10), built on Jul 11 2019 11:36:39 by "zulu_re" with gcc 4.4.7 20120313 (Red Hat 4.4.7-3)
Memory: 4k page, physical 16380436k(7895772k free), swap 1046504k(1046504k free)
CommandLine flags: -XX:+ExplicitGCInvokesConcurrent -XX:GCLogFileSize=104857600 -XX:InitialHeapSize=1073741824 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ManagementServer -XX:MaxGCPauseMillis=20 -XX:MaxHeapSize=1073741824 -XX:MaxInlineLevel=15 -XX:NumberOfGCLogFiles=10 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseGCLogFileRotation
2023-10-27T21:01:25.667+0000: 3.328: [GC pause (G1 Evacuation Pause) (young), 0.0408966 secs]
   [Parallel Time: 16.8 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 3330.2, Avg: 3334.9, Max: 3342.1, Diff: 11.9]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.8, Max: 3.6, Diff: 3.6, Sum: 6.3]
      [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Processed Buffers: Min: 0, Avg: 0.0, Max: 0, Diff: 0, Sum: 0]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.5, Diff: 0.5, Sum: 1.0]
      [Object Copy (ms): Min: 0.0, Avg: 6.2, Max: 9.9, Diff: 9.9, Sum: 49.7]
      [Termination (ms): Min: 0.0, Avg: 1.3, Max: 4.6, Diff: 4.6, Sum: 10.2]
         [Termination Attempts: Min: 1, Avg: 7.8, Max: 16, Diff: 15, Sum: 62]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
      [GC Worker Total (ms): Min: 0.9, Avg: 8.4, Max: 12.7, Diff: 11.8, Sum: 67.6]
      [GC Worker End (ms): Min: 3342.9, Avg: 3343.3, Max: 3346.7, Diff: 3.8]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.4 ms]
   [Other: 23.6 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 20.4 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.3 ms]
      [Humongous Register: 2.2 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.2 ms]
   [Eden: 52224.0K(52224.0K)->0.0B(45056.0K) Survivors: 0.0B->7168.0K Heap: 52224.0K(1024.0M)->9408.0K(1024.0M)]
 [Times: user=0.07 sys=0.02, real=0.04 secs]
2023-10-27T21:01:25.963+0000: 3.623: [GC pause (Metadata GC Threshold) (young) (initial-mark), 0.0404999 secs]
   [Parallel Time: 23.6 ms, GC Workers: 8]
      [GC Worker Start (ms): Min: 3623.8, Avg: 3624.8, Max: 3625.4, Diff: 1.7]
      [Ext Root Scanning (ms): Min: 0.2, Avg: 4.1, Max: 9.3, Diff: 9.2, Sum: 32.9]
      [Update RS (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 0.5]
         [Processed Buffers: Min: 0, Avg: 0.2, Max: 1, Diff: 1, Sum: 2]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.1, Max: 0.9, Diff: 0.9, Sum: 0.9]
      [Object Copy (ms): Min: 0.0, Avg: 2.6, Max: 5.6, Diff: 5.6, Sum: 20.6]
      [Termination (ms): Min: 0.0, Avg: 4.6, Max: 16.2, Diff: 16.2, Sum: 36.5]
         [Termination Attempts: Min: 1, Avg: 29.9, Max: 91, Diff: 90, Sum: 239]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [GC Worker Total (ms): Min: 9.3, Avg: 11.4, Max: 22.0, Diff: 12.7, Sum: 91.6]
      [GC Worker End (ms): Min: 3634.7, Avg: 3636.2, Max: 3646.8, Diff: 12.0]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 4.6 ms]
   [Other: 12.3 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 11.0 ms]

Everything seems to be working fine, there are just a few defunct processes. It has been running 2 weeks since the update with no problems and no additional defunct processes.

Product Area

Other

Link

No response

DSN

No response

Version

23.10.1

@getsantry
Copy link

getsantry bot commented Nov 13, 2023

Assigning to @getsentry/support for routing ⏲️

@kerenkhatiwada kerenkhatiwada transferred this issue from getsentry/sentry Nov 13, 2023
@hubertdeng123
Copy link
Member

Thanks for bringing this up. We seem to be seeing this as well, but will need to take time to investigate. Going to backlog this.

@e100
Copy link

e100 commented Dec 8, 2023

I've seen this on 23.11.2 also.
Seems to be triggered by too many processes fighting for resources during startup.
Throwing more CPU cores at the VM results in fewer zombies after startup.

@le0pard
Copy link
Contributor

le0pard commented Jan 17, 2024

$ ps aux | grep 'Z'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       89379  1.0  0.0      0     0 ?        Z    20:35   0:03 [java] <defunct>
root      100867  1.1  0.0      0     0 ?        Z    20:35   0:03 [java] <defunct>
root      101220  1.2  0.0      0     0 ?        Z    20:35   0:03 [java] <defunct>

same issue after update to 23.11.2. Restart or reboot not help with it

@mkhon
Copy link

mkhon commented Feb 26, 2024

Same issue here. Seems that zombie processes are related to io.confluent.support.metrics.SupportedKafka and the number of such processes grows over time

ps axf output:

2332779 ?        Ssl    1:50  \_ java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokes
Concurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -X
X:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxr
emote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp 
/usr/bin/../share/java/kafka/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java
/support-metrics-client/* io.confluent.support.metrics.SupportedKafka /etc/kafka/kafka.properties
2333049 ?        Z      0:08  |   \_ [java] <defunct>
2339805 ?        Z      0:03  |   \_ [java] <defunct>
2363937 ?        Z      0:03  |   \_ [java] <defunct>
2366252 ?        Z      0:13  |   \_ [java] <defunct>
2366672 ?        Z      0:13  |   \_ [java] <defunct>
2367186 ?        Z      0:12  |   \_ [java] <defunct>
2367759 ?        Z      0:10  |   \_ [java] <defunct>
2369610 ?        Z      0:09  |   \_ [java] <defunct>
2374115 ?        Z      0:07  |   \_ [java] <defunct>
2374789 ?        Z      0:05  |   \_ [java] <defunct>
2375282 ?        Z      0:05  |   \_ [java] <defunct>
2375845 ?        Z      0:03  |   \_ [java] <defunct>
2376271 ?        Z      0:03  |   \_ [java] <defunct>
2377949 ?        Z      0:02  |   \_ [java] <defunct>
2383236 ?        Z      0:03  |   \_ [java] <defunct>
2388878 ?        Z      0:13  |   \_ [java] <defunct>
2390177 ?        Z      0:15  |   \_ [java] <defunct>
2392281 ?        Z      0:12  |   \_ [java] <defunct>
2395285 ?        Z      0:13  |   \_ [java] <defunct>
2396333 ?        Z      0:11  |   \_ [java] <defunct>
2396946 ?        Z      0:09  |   \_ [java] <defunct>
2397393 ?        Z      0:07  |   \_ [java] <defunct>
2398811 ?        Z      0:05  |   \_ [java] <defunct>
2400486 ?        Z      0:04  |   \_ [java] <defunct>
2403333 ?        Z      0:03  |   \_ [java] <defunct>
2404875 ?        Z      0:02  |   \_ [java] <defunct>
2405871 ?        Z      0:02  |   \_ [java] <defunct>
2416303 ?        Z      0:14  |   \_ [java] <defunct>
2418205 ?        Z      0:15  |   \_ [java] <defunct>

@le0pard
Copy link
Contributor

le0pard commented Mar 11, 2024

yes, this week sentry stop accept errors and transactions. On server I see

There are 37 zombie processes.

So this is very bad issue. Sentry is latest - 24.2.0

@chris-laack
Copy link

same here Sentry 24.2.0

@le0pard
Copy link
Contributor

le0pard commented Mar 14, 2024

ok, looks like I found where this issue came from. You need replace in docker-compose.yml:

test: ["CMD-SHELL", "/usr/bin/kafka-topics --bootstrap-server kafka:9092 --list"]

to old one:

test: ["CMD-SHELL", "nc -z localhost 9092"]

after this start of docker compose stop generating defunct java processes

Change introduced by 69f973a

@hubertdeng123
Copy link
Member

Super good find @le0pard, if that's the case we should revert the change that was introduced above.

@hubertdeng123
Copy link
Member

@chris-laack @mkhon @linsomniac Could you try changing your docker compose healthcheck for kafka to

test: ["CMD-SHELL", "nc -z localhost 9092"]

@chris-laack
Copy link

@hubertdeng123
i've just come back from vacation. i'll try it out today and let you know.

@chris-laack
Copy link

works for us.

@hubertdeng123
Copy link
Member

I've put up a PR to address this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Archived in project
Archived in project
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants