Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redis omem leaking issue on T2 supervisor #20680

Open
sdszhang opened this issue Nov 4, 2024 · 4 comments
Open

redis omem leaking issue on T2 supervisor #20680

sdszhang opened this issue Nov 4, 2024 · 4 comments
Labels
Triaged this issue has been triaged

Comments

@sdszhang
Copy link

sdszhang commented Nov 4, 2024

Description

We are seeing memory leaking issue on T2 Supervisor when running nightly test, which caused the redis memory keeps increasing until it fails sanity_check in sonic-mgmt.

Following is one of the log which memory sanity_check threshold.

06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0303 INFO   | asic0 db memory over the threshold 
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0304 INFO   | asic0 db memory omem non-zero output: 
id=1367 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=262 name= age=36559 idle=20085 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=29 omem=594616 tot-mem=597464 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1368 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=263 name= age=36559 idle=19157 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=15 omem=307560 tot-mem=310408 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1369 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=264 name= age=36559 idle=20717 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=57 omem=1168728 tot-mem=1171576 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1370 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=265 name= age=36559 idle=21043 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=464 omem=9513856 tot-mem=9516704 events=rw cmd=psubscribe user=default redir=-1 resp=2
06/10/2024 05:26:43 checks._check_dbmemory_on_dut            L0307 INFO   | Done checking database memory on svcstr2-8800-sup-1

06/10/2024 05:26:45 parallel.parallel_run                    L0221 INFO   | Completed running processes for target "_check_dbmemory_on_dut" in 0:00:02.809825 seconds
06/10/2024 05:26:45 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': True, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 11584760}]

The memory leaking was seen after running one of the following 3 modules. once the total_omem becomes non-zero, it will keep increasing until it's over the threshold.

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Steps to reproduce the issue:

  1. Run full nightly test on a T2 testbed.

Describe the results you received:

testbed fails sanity check due to omem over threshold after running nightly test on T2 testbed.

Describe the results you expected:

redis omem should be released after usage, should not keep increasing.

Output of show version:

admin@svcstr2-8800-sup-1:~$ show version

SONiC Software Version: SONiC.jianquan.cicso.202405.08
SONiC OS Version: 12
Distribution: Debian 12.6
Kernel: 6.1.0-22-2-amd64
Build commit: b60548f2f6
Build date: Fri Nov  1 11:20:02 UTC 2024
Built by: azureuser@00df58e3c000000

Platform: x86_64-8800_rp-r0
HwSKU: Cisco-8800-RP
ASIC: cisco-8000
ASIC Count: 10
Serial Number: FOC2545N2CA
Model Number: 8800-RP
Hardware Revision: 1.0
Uptime: 00:50:48 up 14:22,  3 users,  load average: 13.28, 12.07, 11.65
Date: Mon 04 Nov 2024 00:50:48

Output of show techsupport:

When running system_health/test_system_health.py test:
At the beginning of the test:

05/10/2024 23:23:32 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 0}]

At the end of the test:

06/10/2024 00:02:03 __init__.do_checks                       L0120 DEBUG  | check results of each item [{'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc1-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc2-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-lc3-1', 'total_omem': 0}, {'failed': False, 'check_item': 'dbmemory', 'host': 'svcstr2-8800-sup-1', 'total_omem': 861168}]

This symptom is observed for all 3 test cases so far:

system_health/test_system_health.py
platform_tests/api/test_fan_drawer_fans.py
platform_tests/test_platform_info.py

Additional information you deem important (e.g. issue happens only occasionally):

@sdszhang sdszhang changed the title redis memory leaking issue on T2 supervisor redis omem leaking issue on T2 supervisor Nov 4, 2024
@arlakshm
Copy link
Contributor

arlakshm commented Nov 6, 2024

@anamehra, @abdosi, can you please help triage this issue

@arlakshm arlakshm added the Triaged this issue has been triaged label Nov 6, 2024
@anamehra
Copy link
Contributor

Issue is not seen in last few runs in Cisco and MSFT testbed.
Looks like some redis client of global database docker on Supervisor fails to read buffere from redis and this causes omem to increase. Platform does not have any redis client for global database. Could be some Sonic infra client. Needs a repro to debug further.

is there a way to map this client data to the client process? The id / fd from here were not very helpful to pinpoint the client.

id=1367 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=262 name= age=36559 idle=20085 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=29 omem=594616 tot-mem=597464 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1368 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=263 name= age=36559 idle=19157 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=15 omem=307560 tot-mem=310408 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1369 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=264 name= age=36559 idle=20717 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=57 omem=1168728 tot-mem=1171576 events=rw cmd=psubscribe user=default redir=-1 resp=2
id=1370 addr=/var/run/redis/redis.sock:0 laddr=/var/run/redis/redis.sock:0 fd=265 name= age=36559 idle=21043 flags=PU db=6 sub=0 psub=1 ssub=0 multi=-1 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=2048 rbp=1024 obl=1024 oll=464 omem=9513856 tot-mem=9516704 events=rw cmd=psubscribe user=default redir=-1 resp=2

@anamehra
Copy link
Contributor

quick update: The client connections which are leaking mem are from snmp docker. I see 100+ client connections from snmp and restarting process like thermalctld in pmon causes the omem increase in snmp connections.

@abdosi
Copy link
Contributor

abdosi commented Nov 22, 2024

@SuvarnaMeenakshi : can you help looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

4 participants