redis omem leaking issue on T2 supervisor #20680

sdszhang · 2024-11-04T00:55:10Z

Description

We are seeing memory leaking issue on T2 Supervisor when running nightly test, which caused the redis memory keeps increasing until it fails sanity_check in sonic-mgmt.

Following is one of the log which memory sanity_check threshold.

06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0303 INFO   | asic0 db memory over the threshold 
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0304 INFO   | asic0 db memory omem non-zero output: 
id=1367 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=262 name= age=36559 idle=20085 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=29 omem=594616 tot-mem=597464 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1368 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=263 name= age=36559 idle=19157 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=15 omem=307560 tot-mem=310408 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1369 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=264 name= age=36559 idle=20717 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=57 omem=1168728 tot-mem=1171576 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1370 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=265 name= age=36559 idle=21043 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=464 omem=9513856 tot-mem=9516704 events=rw cmd=psubscribe user=default redir=-1 resp=2
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0307 INFO   | Done checking database memory on svcstr2-8800-sup-1

06/10/2024 05:26:45 parallel.parallel_run                    L0221 INFO   | Completed running processes for target "_check_dbmemory_on_dut" in 0:00:02.809825 seconds
06/10/2024 05:26:45 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': True, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 11584760}]

The memory leaking was seen after running one of the following 3 modules. once the total_omem becomes non-zero, it will keep increasing until it's over the threshold.

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Steps to reproduce the issue:

Run full nightly test on a T2 testbed.

Describe the results you received:

testbed fails sanity check due to omem over threshold after running nightly test on T2 testbed.

Describe the results you expected:

redis omem should be released after usage, should not keep increasing.

Output of `show version`:

admin@svcstr2-8800-sup-1:~$ show version

SONiC Software Version: SONiC.jianquan.cicso.202405.08
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: b60548f2f6
Build date: Fri Nov  1 11:20:02 UTC 2024
Built by: azureuser@00df58e3c000000

Platform: x86_64-8800_rp-r0
HwSKU: Cisco-8800-RP
ASIC: cisco-8000
ASIC Count: 10
Serial Number: FOC2545N2CA
Model Number: 8800-RP
Hardware Revision: 1.0
Uptime: 00:50:48 up 14:22,  3 users,  load average: 13.28, 12.07, 11.65
Date: Mon 04 Nov 2024 00:50:48

Output of `show techsupport`:

When running system_health/test_system_health.py test:
At the beginning of the test:

05/10/2024 23:23:32 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 0}]

At the end of the test:

06/10/2024 00:02:03 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 861168}]

This symptom is observed for all 3 test cases so far:

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Additional information you deem important (e.g. issue happens only occasionally):

The text was updated successfully, but these errors were encountered:

arlakshm · 2024-11-06T16:44:34Z

@anamehra, @abdosi, can you please help triage this issue

anamehra · 2024-11-13T17:20:11Z

Issue is not seen in last few runs in Cisco and MSFT testbed.
Looks like some redis client of global database docker on Supervisor fails to read buffere from redis and this causes omem to increase. Platform does not have any redis client for global database. Could be some Sonic infra client. Needs a repro to debug further.

is there a way to map this client data to the client process? The id / fd from here were not very helpful to pinpoint the client.

id=1367 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=262 name= age=36559 idle=20085 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=29 omem=594616 tot-mem=597464 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1368 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=263 name= age=36559 idle=19157 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=15 omem=307560 tot-mem=310408 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1369 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=264 name= age=36559 idle=20717 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=57 omem=1168728 tot-mem=1171576 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1370 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=265 name= age=36559 idle=21043 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=464 omem=9513856 tot-mem=9516704 events=rw cmd=psubscribe user=default redir=-1 resp=2

anamehra · 2024-11-22T19:40:04Z

quick update: The client connections which are leaking mem are from snmp docker. I see 100+ client connections from snmp and restarting process like thermalctld in pmon causes the omem increase in snmp connections.

abdosi · 2024-11-22T20:39:00Z

@SuvarnaMeenakshi : can you help looking into this.

sdszhang changed the title ~~redis memory leaking issue on T2 supervisor~~ redis omem leaking issue on T2 supervisor Nov 4, 2024

arlakshm added the Triaged this issue has been triaged label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

redis omem leaking issue on T2 supervisor #20680

redis omem leaking issue on T2 supervisor #20680

sdszhang commented Nov 4, 2024

arlakshm commented Nov 6, 2024

anamehra commented Nov 13, 2024

anamehra commented Nov 22, 2024

abdosi commented Nov 22, 2024

redis omem leaking issue on T2 supervisor #20680

redis omem leaking issue on T2 supervisor #20680

Comments

sdszhang commented Nov 4, 2024

Description

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

Output of show techsupport:

Additional information you deem important (e.g. issue happens only occasionally):

arlakshm commented Nov 6, 2024

anamehra commented Nov 13, 2024

anamehra commented Nov 22, 2024

abdosi commented Nov 22, 2024

Output of `show version`:

Output of `show techsupport`: