Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible issue after having upgraded to v2.6 #1151

Closed
lano1106 opened this issue May 19, 2024 · 47 comments
Closed

possible issue after having upgraded to v2.6 #1151

lano1106 opened this issue May 19, 2024 · 47 comments

Comments

@lano1106
Copy link

lano1106 commented May 19, 2024

Since, I have upgraded to v2.6, I experience a periodic data corruption in the inbound TCP stream:

[2024-05-19 13:40:51] EROR WSCTX/log_emit_function 788: ssl error 167772603: error:0A0001BB:SSL routines::bad record type
[2024-05-19 13:40:51] EROR WSCTX/log_emit_function 788: ssl error 167772473: error:0A000139:SSL routines::record layer failure
[2024-05-19 13:40:51] INFO WSCTX/log_emit_function 788: lws_ssl_capable_read (0x7f9790015b80): ssl err 1 errno 0
[2024-05-19 13:40:51] INFO WSCTX/log_emit_function 788: lws_service_fd_tsi_nocheck (0x7f9790015b80): LWS_HPI_RET_PLEASE_CLOSE_ME
[2024-05-19 13:40:51] INFO WSCTX/log_emit_function 788: __lws_close_free_wsi (0x7f9790015b80): caller: lws_service_fd_tsi: close_and_handled

not sure if this could have anything to do with IORING_SETUP_NO_SQARRAY new feature...

I'll try to downgrade to v2.5 to see if the problem go away...

I'm not 100% sure where the problem comes from because I have changed few things all at the same time...

liburing
my own code
Change my instance type from 4 core machine to a 8 cores system...

I'll do some more experimentation, I'll report back if downgrading liburing makes the problem go away...

Important detail. This is with 6.8.9 kernel

@lano1106
Copy link
Author

data corruption occurs every 2-5 minutes on a ring managing 1 low-volume TCP connection... So ring buffer wraparound issue is a plausible cause.

Also worth mentioning. It happens only on a specific ring. The one that is configured with IORING_SETUP_ATTACH_WQ

@lano1106
Copy link
Author

I doubt this might be an issue but a particularity of my setup is that my 6.8.9 kernel has been patched with the NAPI busy polling code...

there is a very small probability that if the registering NAPI command value did change between what is in the patch vs the final official value.

imho, this would only make the registration fail but maybe not...

@lano1106
Copy link
Author

lano1106 commented May 20, 2024

here is an update.... I have downgraded liburing from 2.6 to 2.5 and my inbound TCP stream corruption is gone...

My intuition tells me that there is something wrong with IORING_SETUP_NO_SQARRAY feature because I am not seeing what else could cause the issue...

I'll try to look into the IORING_SETUP_NO_SQARRAY code to see if I could not find the problem and I will report back here if I find something...

I am not reading the io_uring list as much as I did in the past so I missed the discussion around IORING_SETUP_NO_SQARRAY but I guess that there is no reason to not use it and removing this level of indirection is slightly improving io_uring performance, right?

I guess this will become clearer once I look into its code...

@spikeh
Copy link
Contributor

spikeh commented May 20, 2024

I can take a look if we can get a reproducer.

If you can't share your code directly, then could you please share what io_uring flags you're using and what you're sending over TCP so I can build one?

@axboe
Copy link
Owner

axboe commented May 20, 2024

And in lieu of a reproducer, can you simply try and disable IORING_SETUP_NO_SQARRAY just to rule that out? Just removing the

p->flags |= IORING_SETUP_NO_SQARRAY;

line in src/setup.c:io_uring_queue_init_try_nosqarr() should do it. I don't see how this can be related at all, or how the upgrade in general could break this for you, but let's see if we can collect some data points.

Another option is to try and bisect it between 2.5 and 2.6, if it really is the liburing side that broke it.

@isilence
Copy link
Collaborator

isilence commented May 20, 2024

I agree with Jens on this one, if you suspect IORING_SETUP_NO_SQARRAY, please try removing it. It can only be the problem if:

  1. You hacked into liburing internals and was using not exposed to users sqarray.
  2. There is a bug, maybe with offset calculations or so. Is it x64-86? Or some other arch?

@lano1106
Copy link
Author

I have followed Jens suggestion to comment out the IORING_SETUP_NO_SQARRAY line in setup.c...

With that, my program works flawlessly with liburing 2.6...

you seem very confident that the problem cannot come from the IORING_SETUP_NO_SQARRAY kernel code. I'll trust you on that....

I may have another investigation lead...

I do use IORING_SETUP_NO_MMAP

and I assign remaining huge page memory to ring buffers if it has enough free memory. If not enough, my code fallback on using posix_memalign(). maybe that by modifying the amount of memory used by the io_uring with the IORING_SETUP_NO_SQARRAY option, this somehow make my code mismanage the memory...

@isilence
Copy link
Collaborator

isilence commented May 20, 2024

I have followed Jens suggestion to comment out the IORING_SETUP_NO_SQARRAY line in setup.c...
With that, my program works flawlessly with liburing 2.6...

you seem very confident that the problem cannot come from the IORING_SETUP_NO_SQARRAY kernel code. I'll trust you on that....

Not sure who you're replying to, but you never know, everything can happen. I'm getting you're not using sqarray, so might be some mishap in the offset calculation, especially in some less common cases like NO_MMAP or so.

I may have another investigation lead...

I do use IORING_SETUP_NO_MMAP

and I assign remaining huge page memory to ring buffers if it has enough free memory. If not enough, my code fallback on using posix_memalign(). maybe that by modifying the amount of memory used by the io_uring with the IORING_SETUP_NO_SQARRAY option, this somehow make my code mismanage the memory...

If you'd make it always use posix_memalign(), does it still reproduce or is it only happen with huge pages?

@lano1106
Copy link
Author

lano1106 commented May 20, 2024

Pavel,

I am going to give it a shot... To use posix_memalign() memory to the allocated ring buffers...

FYI, I allocate 3 io_urings in a single huge page. I use io_uring_queue_init_mem() return value to update the starting address to allocate the next ring structure... This might be another function to review...

I might not be able to continue the investigation today but I will for sure it tomorrow and update you on what I find...

Here are the log traces generated by the code setuping the rings... I am not expecting it to make any sense to anyone... I wrote the code many months ago, The code is in front of me and without some reviewing effort I cannot even figure out the traces...

for instance, I don't get why the huge page allocation trace appears after creating the first ring...

[2024-05-20 22:10:10] INFO WSCTX/createEventLoop tsi 0 io_uring loop took 282624 bytes from the huge page
[2024-05-20 22:10:10] INFO WSCTX/createEventLoop Master io_uring fd: 14
[2024-05-20 22:10:10] INFO WSCTX/getExtraHugePages mmap(142606336) succeeded
[2024-05-20 22:10:10] INFO WSCTX/initGroupBuffer Provide 8192 17408 bytes buffers to gid 1 from address 0x0x7fa37f800000 (142606336)
[2024-05-20 22:10:10] INFO Runnable/start ExecutionSVC initialized
[2024-05-20 22:10:10] INFO WSCTX/createEventLoop tsi 1 io_uring loop took 28672 bytes from the huge page
[2024-05-20 22:10:10] INFO WSCTX/initGroupBuffer Provide 32 17408 bytes buffers to gid 2 from address 0x0x7fa3a664c000 (557056)
[2024-05-20 22:10:10] INFO Runnable/start TradeFeed initialized
[2024-05-20 22:10:10] INFO WSCTX/createEventLoop tsi 2 io_uring loop took 28672 bytes from the huge page
[2024-05-20 22:10:10] INFO WSCTX/initGroupBuffer Provide 64 17408 bytes buffers to gid 3 from address 0x0x7fa3a66db000 (1114112)

here is something interesting... the io ring giving trouble is the one sandwiched in the middle. tsi 1 ring...

the more we discuss about the issue, the more it looks like some buffer overlap problem... The only remaining thing to determine is the source of the overlap...

Update:
I have made sense of my own traces...

My program makes 2 mmap() calls... One of 2MB to store the 3 io_uring structures and the allocated buffers of the 2 smallest rings + 1 extra mmap() to allocate huge pages to allocate the 142MB allocated buffers of the ring 0.

I am going to improve my traces... I think that printing out the start and end region of each sections in the first huge page should make it obvious if there is an overlap somewhere...

@lano1106
Copy link
Author

lano1106 commented May 21, 2024

here are improved logs:

[2024-05-21 05:21:06] INFO WSCTX/logRingMemory tsi 0 sq:256 cq:16384 io_uring mem with sq array:279552 rounded:282624 safety:3072
[2024-05-21 05:21:06] INFO WSCTX/logRingMemory tsi 0 sq:256 cq:16384 io_uring mem:278528 rounded:278528 safety:0
[2024-05-21 05:21:06] INFO WSCTX/createEventLoop tsi 0 io_uring loop took 282624 bytes from the huge page 0x0x7ff5b6c00000-0x0x7ff5b6c45000
[2024-05-21 05:21:06] INFO WSCTX/createEventLoop Master io_uring fd: 14
[2024-05-21 05:21:06] INFO WSCTX/getExtraHugePages mmap(142606336) succeeded: 0x0x7ff58f800000
[2024-05-21 05:21:06] INFO WSCTX/initGroupBuffer Provide 8192 17408 bytes buffers to gid 1 from address 0x0x7ff58f800000-0x0x7ff598000000 (142606336)
[2024-05-21 05:21:06] INFO Runnable/start ExecutionSVC initialized
[2024-05-21 05:21:06] INFO WSCTX/logRingMemory tsi 1 sq:256 cq:512 io_uring mem with sq array:25600 rounded:28672 safety:3072
[2024-05-21 05:21:06] INFO WSCTX/logRingMemory tsi 1 sq:256 cq:512 io_uring mem:24576 rounded:24576 safety:0
[2024-05-21 05:21:06] INFO WSCTX/createEventLoop tsi 1 io_uring loop took 28672 bytes from the huge page 0x0x7ff5b6c45000-0x0x7ff5b6c4c000
[2024-05-21 05:21:06] INFO WSCTX/initGroupBuffer Provide 32 17408 bytes buffers to gid 2 from address 0x0x7ff5b6c4c000-0x0x7ff5b6cd4000 (557056)
[2024-05-21 05:21:06] INFO Runnable/start TradeFeed initialized
[2024-05-21 05:21:06] INFO WSCTX/logRingMemory tsi 2 sq:256 cq:512 io_uring mem with sq array:25600 rounded:28672 safety:3072
[2024-05-21 05:21:06] INFO WSCTX/logRingMemory tsi 2 sq:256 cq:512 io_uring mem:24576 rounded:24576 safety:0
[2024-05-21 05:21:06] INFO WSCTX/createEventLoop tsi 2 io_uring loop took 28672 bytes from the huge page 0x0x7ff5b6cd4000-0x0x7ff5b6cdb000
[2024-05-21 05:21:06] INFO WSCTX/initGroupBuffer Provide 64 17408 bytes buffers to gid 3 from address 0x0x7ff5b6cdb000-0x0x7ff5b6deb000 (1114112)

I suspect maybe some aligment rounding requirements are not met anymore with IORING_SETUP_NO_SQARRAY.
Perfect aligment was maybe achieved accidentally with including the sqarrays.

one addition to the logs is that I have replicated the code in io_uring_alloc_huge() to find out the break out of the memory distributed in the io_uring structures. I was particularly interested in knowing the safety space created by round up mem_used to the next page boundary.

it turns out that with IORING_SETUP_NO_SQARRAY, there is 0 buffer. So if somewhere there is a buffer overflow, it was masked by the rounding up that is not present with IORING_SETUP_NO_SQARRAY.

maybe tsi 0 cq ring is overflowing into tsi 1 ring memory. I have looked a little bit the cq management code but it has become very complex... It does overflow management, it caches several entries and on and on... it is very hard for a non-initiated to review it... If it can help you, I can tell that with tsi 0, multishot recvmsg are used. multishots appears to have a special cq management...

@lano1106
Copy link
Author

lano1106 commented May 21, 2024

with provided buffers memory out of the way and only the rings adjacent to each other...

it seems like have a winner... no more ring corruption... My apologize... it seems like the problem is on my side...

I'll keep the issue open so that once I find out what is causing the problem, it is documented...

[2024-05-21 12:36:03] INFO WSCTX/WebSocketContext mmap(2097152) succeeded: 0x0x7fa439400000
[2024-05-21 12:36:03] INFO WSCTX/initContext Options: 00001030
[2024-05-21 12:36:03] INFO WSCTX/createEventLoop tsi 0 io_uring loop took 278528 bytes from the huge page 0x0x7fa439400000-0x0x7fa439444000
[2024-05-21 12:36:03] INFO WSCTX/createEventLoop Master io_uring fd: 14
[2024-05-21 12:36:03] INFO WSCTX/getExtraHugePages mmap(142606336) succeeded: 0x0x7fa413800000
[2024-05-21 12:36:03] INFO WSCTX/initGroupBuffer Provide 8192 17408 bytes buffers to gid 1 from address 0x0x7fa413800000-0x0x7fa41c000000 (142606336)
[2024-05-21 12:36:03] INFO Runnable/start ExecutionSVC initialized
[2024-05-21 12:36:03] INFO WSCTX/createEventLoop tsi 1 io_uring loop took 24576 bytes from the huge page 0x0x7fa439444000-0x0x7fa43944a000
[2024-05-21 12:36:03] INFO WSCTX/initGroupBuffer Provide 32 17408 bytes buffers to gid 2 from address 0x0x7fa40c002000-0x0x7fa40c08a000 (557056)
[2024-05-21 12:36:03] INFO Runnable/start TradeFeed initialized
[2024-05-21 12:36:03] INFO WSCTX/createEventLoop tsi 2 io_uring loop took 24576 bytes from the huge page 0x0x7fa43944a000-0x0x7fa439450000
[2024-05-21 12:36:03] INFO WSCTX/initGroupBuffer Provide 64 17408 bytes buffers to gid 3 from address 0x0x7fa404002000-0x0x7fa404112000 (1114112)

@lano1106
Copy link
Author

lano1106 commented May 21, 2024

but I am still perplexed... my program only reads into the provided buffers... it is io_uring that does the writing...

I suspect an underflow in the provided buffers... I was doing an ugly hack to support kTLS by writing into the io_uring_recvmsg_out (16 bytes) and moving the payload buffer 5 bytes ahead but I am not using this anymore since software kTLS is not providing any performance benefit (last time I checked)... This could have been my issue if io_uring_recvmsg_out layout did change...

    /*
     * the io_uring inline functions manipulating the recvmsg multishot
     * result do not need to have the exact same msghdr than the one
     * used to initiate the request. It only needs to have one
     * that has the exact same value for the fields msg_namelen & msg_controllen
     * so that it can navigate correctly into the buffer with the correct
     * offsets.
     */
    static struct msghdr s_msgh = {
        .msg_name = nullptr,
        .msg_namelen = 0,
        .msg_iov = nullptr,
        .msg_iovlen = 0,
        .msg_control = nullptr,
        .msg_controllen = CMSG_SPACE(sizeof(unsigned char)),
        .msg_flags = 0
    };
    /*
     * assume that recvmsg_out cannot be nullptr
     * If the assumption turns out to be false, it will be noticed
     * immediately with a SEGV.
     */
    auto *recvmsg_out{io_uring_recvmsg_validate(bidBuf, entry.bytes_read,
                                                &s_msgh)};
    char *payload{static_cast<char *>(io_uring_recvmsg_payload(recvmsg_out, &s_msgh))};
    /*
     * the initial bytes_read value included the space used by the recvmg_out
     * headers. We now set it to the payload size only.
     */
    entry.bytes_read =
    io_uring_recvmsg_payload_length(recvmsg_out,
                                    entry.bytes_read, &s_msgh);
#ifndef OPENSSL_NO_KTLS
    /*
     * A hack is required to work with kTLS.
     * recvmsg() with kTLS is passing payload buffer 5 bytes in front of the
     * user buffer so that it can recreate the TLS header in front of the TLS
     * payload.
     *
     * Also note that the code only prepend the SSL header if control data
     * is present but it should always be present in msgs coming from kTLS.
     */
    struct cmsghdr *cmsg = io_uring_recvmsg_cmsg_firsthdr(recvmsg_out, &s_msgh);
    if (cmsg) {
        if (cmsg->cmsg_type == TLS_GET_RECORD_TYPE) { // 2
            payload -= SSL3_RT_HEADER_LENGTH;         // 5
            payload[0] = *((unsigned char *)CMSG_DATA(cmsg));
            payload[1] = TLS1_2_VERSION_MAJOR;
            payload[2] = TLS1_2_VERSION_MINOR;
            /* returned length is limited to msg_iov.iov_len above */
            payload[3] = (entry.bytes_read >> 8) & 0xff;
            payload[4] = entry.bytes_read & 0xff;
            entry.bytes_read += SSL3_RT_HEADER_LENGTH;
        }
    }
#endif

here is what I will do next. I'll put back the provided buffers into my huge page but I'll put an empty 4KB space between the end of the rings and the provided buffer.

with no under/over flow this space should remain null. If there is no non-null data in it. It will be proof that something is writing there and it should not...

@axboe
Copy link
Owner

axboe commented May 21, 2024

but I am still perplexed... my program only reads into the provided buffers... it is io_uring that does the writing...

True, but the provided buffers start and length are controlled solely by you, so it'd be very possible that you'd run into problems if you have overlap. Maybe add some red zoning or similar as a debug thing to detect if anything is stepping over spots they should not be.

@axboe
Copy link
Owner

axboe commented May 21, 2024

This could be either a kernel side or application issue, so if you can debug a bit and figure out where it's coming from, that would certainly be helpful in figuring out who the culprit is.

@axboe
Copy link
Owner

axboe commented May 21, 2024

io_uring_recvmsg_out layout did not change, it's UABI so cannot change. But that doesn't mean that the kernel side handling couldn't be broken, though that seems unlikely if you are confident this happens with 2.6 and not with 2.5 of liburing (which still seems suspect, and I guess more likely causing a layout change for you that then further makes the bug more likely to trigger).

Have you tried running your app under valgrind, for example?

@isilence
Copy link
Collaborator

IIUC 2.6 is the first version switching *NO_SQARRAY on by default. I wouldn't dismiss that clue, especially since it's confirmed switching it off doing something that resolves the issue and NO_MMAP coverage is not as good.

Sounds like rings might be overlapping with provided buffers, I'll look through the offset calculation. How does the setup addresses calc look like? Similar to below or you do offsets by hand?

off = io_uring_queue_init_mem(buf);
register_provided_buffers(buf + off);

@lano1106
Copy link
Author

lano1106 commented May 21, 2024

but I am still perplexed... my program only reads into the provided buffers... it is io_uring that does the writing...

True, but the provided buffers start and length are controlled solely by you, so it'd be very possible that you'd run into problems if you have overlap. Maybe add some red zoning or similar as a debug thing to detect if anything is stepping over spots they should not be.

you are correct about that... idk for sure what is going on or else I would tell... All I have is suspicions on where the problem could be.

Add an empty page between the end of a ring and its provided buffers is the next step.

Checking that the provided buffers initialization was good, is the first thing that I did check. AFAIK, it seems to be good:

/*
 * initGroupBuffer()
 *
 * Called from:
 * - WSProtocol::initEventLoop()
 *
 * In the client constructor, the protocols are added to the context.
 * Once all the protocols have been added and before the main thread
 * enters its event loop, the other protocols are initialized.
 */
void WebSocketContext::initGroupBuffer(unsigned gid, int tsi, int nr)
{
    if (gid >= Kraken::GID_NUMS)
        throw std::runtime_error{"WebSocketContext::initGroupBuffer(): invalid gid parameter"};
    if (m_brData[gid].io_uringBuf)
        throw std::runtime_error{"WebSocketContext::initGroupBuffer(): Request to init an initialized buffer"};
    auto *loop{getLoop(tsi)};

    if (ev_backend(loop) != EVBACKEND_IOURING)
        return;

    int res;
    const size_t rxBufSize{getRxBufSize(tsi)};

    m_brData[gid].bufTotalSize = nr*rxBufSize;
    if (m_brData[gid].bufTotalSize <= hpFreeSpace()) {
        m_brData[gid].io_uringBuf = hpCurPos();
        m_brData[gid].bufOrig = IO_URING_HP_ORIG;
    }  else {
        m_brData[gid].io_uringBuf = getExtraHugePages(m_brData[gid].bufTotalSize);
        if (!m_brData[gid].io_uringBuf) {
            res = posix_memalign(reinterpret_cast<void **>(&m_brData[gid].io_uringBuf),
                                 4096, m_brData[gid].bufTotalSize);

            if (unlikely(res)) {
                ERROR2("posix_memalign: (%d) %s", res, strerror(res));
                throw std::runtime_error{"posix_memalign() failure"};
            }
            WARN0_SL("memory allocated with posix_memalign()");
            m_brData[gid].bufOrig = POSIX_ALIGN_ORIG;
        } else
            m_brData[gid].bufOrig = EXTRA_HP_ORIG;
    }
    m_brData[gid].ring = static_cast<struct io_uring *>(ev_io_uring_ring(loop));
    m_brData[gid].nentries = nr;
    constexpr unsigned int flags{};

    /*
     * Allocate memory for the new bufring and register it to the kernel by
     * calling io_uring_register_buf_ring() 
     */ 
    m_brData[gid].br = io_uring_setup_buf_ring(m_brData[gid].ring, nr, gid,
                                               flags, &res);
    if (unlikely(res)) {
        ERROR2("io_uring_setup_buf_ring: (%d) %s", res, strerror(res));
        throw std::runtime_error{"io_uring_setup_buf_ring() failure"};
    }
    auto brmask{io_uring_buf_ring_mask(nr)};
    char *ptr{m_brData[gid].io_uringBuf};

    // io_uring_buf_ring_init() is called in io_uring_setup_buf_ring()
//    io_uring_buf_ring_init(m_brData[gid].br);
    for (int i{}; i < nr; ++i) {
        io_uring_buf_ring_add(m_brData[gid].br, ptr, rxBufSize, i, brmask, i);
        ptr += rxBufSize;
    }
    io_uring_buf_ring_advance(m_brData[gid].br, nr);
    INFO6("Provide %d %lu bytes buffers to gid %u from address 0x%p-0x%p (%lu)",
          nr, rxBufSize, gid, m_brData[gid].io_uringBuf,
          m_brData[gid].io_uringBuf+m_brData[gid].bufTotalSize,
          m_brData[gid].bufTotalSize);

    res = ev_iouring_init_buffers(loop, m_brData[gid].io_uringBuf,
                                  m_brData[gid].br, rxBufSize, nr, gid);
    if (unlikely(res < 0)) {
        ERROR5("ev_iouring_provide_buffers: (%d) %s\n"\
               "ring_fd: %d, enter_ring_fd: %d, int_flags: %u",
               -res, strerror(-res), m_brData[gid].ring->ring_fd,
               m_brData[gid].ring->enter_ring_fd,
               static_cast<unsigned>(m_brData[gid].ring->int_flags));
        throw std::runtime_error{"ev_iouring_provide_buffers() failure"};
    }
    if (m_brData[gid].bufOrig == IO_URING_HP_ORIG)
        advanceHpOffset(m_brData[gid].bufTotalSize);
}

@axboe
Copy link
Owner

axboe commented May 21, 2024

IIUC 2.6 is the first version switching *NO_SQARRAY on by default. I wouldn't dismiss that clue, especially since it's confirmed switching it off doing something that resolves the issue and NO_MMAP coverage is not as good.

Sounds like rings might be overlapping with provided buffers, I'll look through the offset calculation. How does the setup addresses calc look like? Similar to below or you do offsets by hand?

Agree, it sounds like overlap, and that could of course be a liburing bug due to the combination of NO_SQARRAY and using NO_MMAP.

@lano1106
Copy link
Author

io_uring_recvmsg_out layout did not change, it's UABI so cannot change. But that doesn't mean that the kernel side handling couldn't be broken, though that seems unlikely if you are confident this happens with 2.6 and not with 2.5 of liburing (which still seems suspect, and I guess more likely causing a layout change for you that then further makes the bug more likely to trigger).

Have you tried running your app under valgrind, for example?

there is something that valgrind does not like about programs using io_uring... I have never been able to run valgrind on my program with great success... that might have changed. I have not tried out recently... What I have been using successfully in the past with very good result, it is -fsanitize-address (or something like that)...

if you read my updates since yesterday... I have narrowed down the issue to:

  • 2.6 is fine as long as IORING_SETUP_NO_SQARRAY is not used
  • IORING_SETUP_NO_SQARRAY is fine as long as tsi 1 ring memory is not adjacent to its provided buffers memory area.

@lano1106
Copy link
Author

IIUC 2.6 is the first version switching *NO_SQARRAY on by default. I wouldn't dismiss that clue, especially since it's confirmed switching it off doing something that resolves the issue and NO_MMAP coverage is not as good.

Sounds like rings might be overlapping with provided buffers, I'll look through the offset calculation. How does the setup addresses calc look like? Similar to below or you do offsets by hand?

Agree, it sounds like overlap, and that could of course be a liburing bug due to the combination of NO_SQARRAY and using NO_MMAP.

this is what I am leading to think.

In #1151 (comment)

there are logRingMemory traces showing the memory layout with or without NO_SQARRAY. When the option is not set, there is a 3KB safety zone. with NO_SQARRAY option set, everything is flush to the byte and immediately adjacent to the next structure...

@lano1106
Copy link
Author

lano1106 commented May 21, 2024

IIUC 2.6 is the first version switching *NO_SQARRAY on by default. I wouldn't dismiss that clue, especially since it's confirmed switching it off doing something that resolves the issue and NO_MMAP coverage is not as good.

Sounds like rings might be overlapping with provided buffers, I'll look through the offset calculation. How does the setup addresses calc look like? Similar to below or you do offsets by hand?

off = io_uring_queue_init_mem(buf);
register_provided_buffers(buf + off);

check the logs in

2.6 without the NO_SQARRAY option
#1151 (comment)

2.6 with the NO_SQARRAY, provided buffers away from the hp (allocated with posix_memalign()):
#1151 (comment)

I think that I made a good job improving the usefulness of the traces.
you can see the memory range of each component and which ones are adjacent to each other...
there is no obvious overlap there. The problem is more subtle... something is writing where it should not, imho...

as far as buffer initialization goes, I have shared the whole function doing it... I am not sure that provided buffers is the right terminology for what I use... maybe user buffer ring would be more accurate...
#1151 (comment)

@lano1106
Copy link
Author

I have added the following code:

        advanceHpOffset(init_params.res);
#ifdef DEBUG_IO_URING_BUFFER_ISSUE
        if (tsi == 1)
            m_safety_zone = hpCurPos();
#endif
        /*
         * leave safety space between ring memory and provided buffers memory to workaround buffer overflow
         * https://github.com/axboe/liburing/issues/1151
         */
        if (tsi != 0)
            advanceHpOffset(4096);

in the context class destructor:

#ifdef DEBUG_IO_URING_BUFFER_ISSUE
    if (m_safety_zone) {
        auto *end{m_safety_zone+4096};
        auto *res{std::find_if(m_safety_zone, end, [](auto curChar){
            return curChar != 0;
        })};

        if (res != end) {
            auto *corruptionEnd{std::find(res, end, 0)};
            size_t len = corruptionEnd - res;
            char abuf[Base::OST_SMALL_LINE];
            size_t dist = res - m_safety_zone;
            bool fromStart{dist < 2048};
            static constexpr std::array<std::string_view,2> s_refPointStr{
                "from end", "from start"
            };

            if (!fromStart)
                dist = end - res;

            INFO0_SZ(abuf, fmt::format_to_n(abuf, sizeof(abuf),
                     "{} long corruption at {} {} bytes from {}",
                     len, static_cast<void*>(res), dist,
                     s_refPointStr[fromStart]).size);
        }
    }
#endif

the resulting traces:

[2024-05-21 17:45:07] INFO WSCTX/WebSocketContext mmap(2097152) succeeded: 0x0x7fec5dc00000
[2024-05-21 17:45:07] INFO WSCTX/initContext Options: 00001030
[2024-05-21 17:45:07] INFO WSCTX/createEventLoop tsi 0 io_uring loop took 278528 bytes from the huge page 0x0x7fec5dc00000-0x0x7fec5dc44000
[2024-05-21 17:45:07] INFO WSCTX/createEventLoop Master io_uring fd: 14
[2024-05-21 17:45:07] INFO WSCTX/getExtraHugePages mmap(142606336) succeeded: 0x0x7fec37800000
[2024-05-21 17:45:07] INFO WSCTX/initGroupBuffer Provide 8192 17408 bytes buffers to gid 1 from address 0x0x7fec37800000-0x0x7fec40000000 (142606336)
[2024-05-21 17:45:07] INFO Runnable/start ExecutionSVC initialized
[2024-05-21 17:45:07] INFO WSCTX/createEventLoop tsi 1 io_uring loop took 24576 bytes from the huge page 0x0x7fec5dc44000-0x0x7fec5dc4a000
[2024-05-21 17:45:07] INFO WSCTX/initGroupBuffer Provide 32 17408 bytes buffers to gid 2 from address 0x0x7fec5dc4b000-0x0x7fec5dcd3000 (557056)
[2024-05-21 17:45:07] INFO Runnable/start TradeFeed initialized
[2024-05-21 17:45:07] INFO WSCTX/createEventLoop tsi 2 io_uring loop took 24576 bytes from the huge page 0x0x7fec5dcd3000-0x0x7fec5dcd9000
[2024-05-21 17:45:07] INFO WSCTX/initGroupBuffer Provide 64 17408 bytes buffers to gid 3 from address 0x0x7fec5dcda000-0x0x7fec5ddea000 (1114112)
[2024-05-21 17:53:39] INFO WSCTX/~WebSocketContext 1 long corruption at 0x7fec5dc4a000 0 bytes from from start

so it seems like the kernel code managing the ring is overflowing it by 1 byte... just enough to cause some issue if something is adjacent...

@axboe
Copy link
Owner

axboe commented May 21, 2024

Could I get you to turn this into a separate thing that shows the corruption? Doesn't look like it'd be a lot of work given the above simple trace? I'm still not quite sure how you're managing the memory, otherwise I'd just do it myself.

@lano1106
Copy link
Author

lano1106 commented May 21, 2024

Could I get you to turn this into a separate thing that shows the corruption? Doesn't look like it'd be a lot of work given the above simple trace? I'm still not quite sure how you're managing the memory, otherwise I'd just do it myself.

you mean that you would like to see the memory bytes value?

If this can hint you about what that might be yes, of course it is easy to add...

as to what is in the memory space

[2024-05-21 17:45:07] INFO WSCTX/createEventLoop tsi 1 io_uring loop took 24576 bytes from the huge page 0x0x7fec5dc44000-0x0x7fec5dc4a000
[2024-05-21 17:45:07] INFO WSCTX/initGroupBuffer Provide 32 17408 bytes buffers to gid 2 from address 0x0x7fec5dc4b000-0x0x7fec5dcd3000 (557056)

it is a 2.6 io_uring with 256 entries with NO_SQARRAY set (so no safety buffer at its end like when NO_SQARRAY is missing). The trace

[2024-05-21 05:21:06] INFO WSCTX/logRingMemory tsi 1 sq:256 cq:512 io_uring mem:24576 rounded:24576 safety:0

is created by mimicking the code found in io_uring_alloc_huge()

#ifdef USE_WSCTX_IOURING_HP_COMPUTE
/*
 * logRingMemory()
 */
void logRingMemory(int tsi, const struct io_uring_params *params, bool no_sqarray) noexcept
{
    static constexpr size_t page_size{4096};
    static constexpr size_t IOURING_INIT_ENTRIES{256}; // defined in libev/ev_iouring.c
    size_t sq_entries{IOURING_INIT_ENTRIES};
    size_t cq_entries{2*IOURING_INIT_ENTRIES};
    size_t ring_mem, sqes_mem;
	size_t mem_used = 0;

    if (params->flags & IORING_SETUP_CQSIZE)
        cq_entries = params->cq_entries;
    sqes_mem = sq_entries * sizeof(struct io_uring_sqe);
    sqes_mem = (sqes_mem + page_size - 1) & ~(page_size - 1);
    ring_mem = cq_entries * sizeof(struct io_uring_cqe);
    if (!no_sqarray)
        ring_mem += sq_entries * sizeof(unsigned);
    mem_used = sqes_mem + ring_mem;
    size_t rounded = (mem_used + page_size - 1) & ~(page_size - 1);
    static constexpr std::array<std::string_view,2> s_no_sqarrayStr{
        " with sq array", ""
    };
    char abuf[Base::OST_SMALL_LINE];

    INFO0_SZ(abuf, fmt::format_to_n(abuf, sizeof(abuf),
             "tsi {} sq:{} cq:{} io_uring mem{}:{} rounded:{} safety:{}",
             tsi, sq_entries, cq_entries, s_no_sqarrayStr[no_sqarray],
             mem_used, rounded, rounded-mem_used).size);
}
#endif

normally, I assigned the user buffer ring immediately after the io_uring memory but as suggested, I left a 4KB safety space which should be left zeroed (it is from mmap).

the very first byte after the tsi 1 io_uring area is non-zero... I will change my code to print out that value...

@isilence
Copy link
Collaborator

Jens is likely asking for a reproducer we can run ourselves.
Can you also print the entire safety zone? (There might be a stray 0 write or just gaps, so distance might not be telling much).

@isilence
Copy link
Collaborator

A quick sum up:

  • Something is writing right after the main ring's memory (CQ+SQ) of the second ring. Right after the rings there is a provided buffer ring
  • There is a red zone without NO_SQARRAY so that the stray write was not affecting us.
  • If we allocate rings back to back without provided rings in between it all works fine.

@isilence
Copy link
Collaborator

Can you clarify a little bit about provided buffers and rings. Where do you store the ring itself and where you allocate the buffers you put into the provided buffer ring? I assume there are

[2024-05-21 17:45:07] INFO WSCTX/initGroupBuffer Provide 32 17408 bytes buffers to gid 2 from address 0x0x7fec5dc4b000-0x0x7fec5dcd3000 (557056)

I read as it only contains buffers (32 * 17408 == 557056).

@lano1106
Copy link
Author

A quick sum up:

* Something is writing right after the main ring's memory (CQ+SQ) of the second ring. Right after the rings there is a provided buffer ring

* There is a red zone without `NO_SQARRAY` so that the stray write was not affecting us.

* If we allocate rings back to back without provided rings in between it all works fine.

Pavel, you understand 100% correctly. This perfectly sums up what I am experiencing... maybe the first byte of a ring memory is less sensitive than it seems for a user buffer ring...

@lano1106
Copy link
Author

lano1106 commented May 21, 2024

Can you clarify a little bit about provided buffers and rings. Where do you store the ring itself and where you allocate the buffers you put into the provided buffer ring? I assume there are

[2024-05-21 17:45:07] INFO WSCTX/initGroupBuffer Provide 32 17408 bytes buffers to gid 2 from address 0x0x7fec5dc4b000-0x0x7fec5dcd3000 (557056)

I read as it only contains buffers (32 * 17408 == 557056).

yes you read correctly... only 32 buffers... It is a low data stream. from experience I have figured that only 32 buffers were sufficient for that thread/ring.

Update:
It just occurred to me that perhaps the low buffer number ring and its frequent wraparound is why I see the problem way more often than most/all other users.

A 2MB Huge page is mmaped.
tsi 0 ring is allocated at the start.
Its buffers are too large (142MB) to fit in the first huge page so it is allocated in its own separate mmaped region.
next the layout is
tsi 1 ring
tsi 1 buffer
tsi 2 ring
tsi 2 buffers

the provided logs clearly show the memory range of each component...

I increment the current offset in the huge page where the next component will be placed by using the io_uring_queue_init_mem() return value.

concerning the where and how the buffers are allocated, I shared the whole function code at
#1151 (comment)

@lano1106
Copy link
Author

here is a full hex dump of the 4KB safety zone. It was a good thing to ask as it is not only the first byte that is corrupted... Hopefully the values will be recognizable as some sort of cqe record...

[2024-05-21 21:02:18] INFO WSCTX/~WebSocketContext 1 bytes long corruption at 0x7f801ac4a000 0 bytes from start
[2024-05-21 21:02:18] INFO WSCTX/~WebSocketContext safety zone content:
0000 : 1B 00 00 00 00 00 00 04  07 01 00 00 00 00 00 00 | ................ |
0010 : 38 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00 | 8............... |
0020 : 1B 00 00 00 00 00 00 04  01 01 00 00 00 00 00 00 | ................ |
0030 : 1B 00 00 00 00 00 02 06  15 05 00 00 03 00 0C 00 | ................ |
0040 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0050 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0060 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0070 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0080 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0090 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
00A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
00B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
00C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
00D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
00E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
00F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ | 
0100 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ | 
0110 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0120 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0130 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0140 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0150 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0160 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0170 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0180 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0190 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
01A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
01B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
01C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
01D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
01E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
01F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0200 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0210 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0220 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0230 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0240 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0250 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0260 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0270 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0280 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0290 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
02A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
02B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
02C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
02D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
02E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
02F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0300 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0310 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0320 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0330 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0340 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0350 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0360 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0370 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0380 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0390 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
03A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
03B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
03C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
03D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
03E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
03F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0400 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0410 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0420 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0430 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0440 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0450 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0460 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0470 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0480 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0490 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
04A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
04B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
04C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
04D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
04E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
04F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0500 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0510 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0520 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0530 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0540 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0550 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0560 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0570 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0580 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0590 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
05A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
05B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
05C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
05D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
05E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
05F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0600 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0610 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0620 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0630 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0640 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0650 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0660 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0670 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0680 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0690 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
06A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
06B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
06C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
06D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
06E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
06F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0700 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0710 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0720 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0730 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0740 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0750 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0760 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0770 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0780 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0790 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
07A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
07B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
07C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
07D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
07E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
07F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0800 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0810 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0820 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0830 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0840 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0850 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0860 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0870 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0880 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0890 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
08A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
08B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
08C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
08D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
08E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
08F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0900 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0910 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0920 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0930 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0940 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0950 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0960 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0970 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0980 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0990 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
09A0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
09B0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
09C0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
09D0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
09E0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
09F0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A00 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A10 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A20 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A30 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A40 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A50 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A60 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A70 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A80 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0A90 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0AA0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0AB0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0AC0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0AD0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0AE0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0AF0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B00 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B10 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B20 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B30 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B40 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B50 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B60 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B70 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B80 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0B90 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0BA0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0BB0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0BC0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0BD0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0BE0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0BF0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C00 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C10 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C20 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C30 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C40 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C50 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C60 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C70 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C80 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0C90 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0CA0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0CB0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0CC0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0CD0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0CE0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0CF0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D00 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D10 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D20 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D30 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D40 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D50 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D60 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D70 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D80 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0D90 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0DA0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0DB0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0DC0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0DD0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0DE0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0DF0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E00 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E10 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E20 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E30 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E40 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E50 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E60 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E70 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E80 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0E90 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0EA0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0EB0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0EC0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0ED0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0EE0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0EF0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F00 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F10 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F20 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F30 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F40 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F50 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F60 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F70 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F80 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0F90 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0FA0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0FB0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0FC0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0FD0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0FE0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |
0FF0 : 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 | ................ |

@isilence
Copy link
Collaborator

You mentioned you use provided buffer rings, can you show how and where you register them? E.g. how you call io_uring_register_buf_ring and init struct io_uring_buf_reg?

Apart from allocating buffers that you put into the ring and would be read/recv'ing data into, there is also memory for the ring itself, I Just wonder where that one is placed, i.e. struct io_uring_buf_reg::ring_addr

@lano1106
Copy link
Author

lano1106 commented May 21, 2024

I use io_uring_setup_buf_ring()

the whole code is in the comment:
#1151 (comment)

the ring itself is stored in the ev loop structure. I fetch it in my initGroupBuffer() function.

for the ring creation... nothing special to say about it... It is created by calling io_uring_queue_init_mem()

inline_size
int
iouring_init (EV_P_ int flags, void *params)
{
    ev_io_uring_init_params *init_params = (ev_io_uring_init_params *)params;
    struct io_uring_params emptyParams = {};

    if (!init_params->init_params)
        init_params->init_params = &emptyParams;

    iouring_entries = IOURING_INIT_ENTRIES;
    if (!init_params->buf)
        init_params->res = iouring_internal_init (EV_A_ init_params->init_params);
    else
        init_params->res = io_uring_queue_init_mem(iouring_entries, &iouring_ring,
                                                   init_params->init_params,
                                                   init_params->buf,
                                                   init_params->buf_size);

    if (init_params->res < 0) {
        char buf[256];
        snprintf(buf, 256, "error creating the io_uring instance: (%d) %s",
                 -init_params->res, strerror(-init_params->res));
        ev_syserr(buf);
        return 0;
    }
    backend_modify = iouring_modify;
    backend_poll   = iouring_poll;

    return EVBACKEND_IOURING;
}

/*
 * createIoUringLoop()
 */
struct ev_loop *
WebSocketContext::createIoUringLoop(int tsi, unsigned ev_flags,
                                    struct io_uring_params *params)
{
    /*
     * NOTE:
     * IORING_SETUP_ATTACH_WQ option is incompatible with
     * IORING_SETUP_REGISTERED_FD_ONLY
     */
    if (params->flags & IORING_SETUP_ATTACH_WQ) {
        if (m_master_io_uring_fd != -1)
            params->wq_fd = m_master_io_uring_fd;
        else {
            WARN0_SL("IORING_SETUP_ATTACH_WQ configured but no master fd available");
            params->flags &= ~IORING_SETUP_ATTACH_WQ;
        }
    }
#ifdef USE_WSCTX_IOURING_HP_COMPUTE
    logRingMemory(tsi, params, false);
    logRingMemory(tsi, params, true);
#endif
    ev_io_uring_init_params init_params{params,
        hpCurPos(),
        hpFreeSpace(), 0};

    struct ev_loop *loop = ev_io_uring_loop_new(&init_params, ev_flags);
    if (unlikely(!loop)) {
        // try without IORING_SETUP_NO_MMAP
        params->flags &= ~(IORING_SETUP_NO_MMAP|IORING_SETUP_REGISTERED_FD_ONLY);
        init_params.buf = nullptr;
        loop = ev_io_uring_loop_new(&init_params, ev_flags);
        if (unlikely(!loop))
            throw std::runtime_error{"error creating an io_uring instance"};
    }
    if (ev_backend(loop) != EVBACKEND_IOURING)
        throw std::runtime_error{"received a different backend type than EVBACKEND_IOURING"};
    INFO4("tsi %d io_uring loop took %d bytes from the huge page 0x%p-0x%p",
          tsi, init_params.res, init_params.buf,
          static_cast<char*>(init_params.buf)+init_params.res);
    advanceHpOffset(init_params.res);
#ifdef DEBUG_IO_URING_BUFFER_ISSUE
    if (tsi == 1)
        m_safety_zone = hpCurPos();
#endif
    /*
     * leave safety space between ring memory and provided buffers memory to workaround buffer overflow
     * https://github.com/axboe/liburing/issues/1151
     */
    if (tsi != 0)
        advanceHpOffset(4096);

    auto *ring{static_cast<struct io_uring *>(ev_io_uring_ring(loop))};

    if (m_master_io_uring_fd == -1 && ring->ring_fd >= 0) {
        m_master_io_uring_fd = ring->ring_fd;
        INFO1("Master io_uring fd: %d", m_master_io_uring_fd);
    }
    if (global_t::getInstance().useNapiBusyPoll(tsi)) {
        struct io_uring_napi napiSetting{100, 1};
        const int res{io_uring_register_napi(ring, &napiSetting)};

        if (unlikely(res < 0)) {
            ERROR3("io_uring_register_napi(%d): (%d) %s",
                   ring->ring_fd, -res, strerror(-res));
        }
    }
    return loop;
}

/*
 * createEventLoop()
 *
 * Called by protocol threads before entering the loop
 */
void
WebSocketContext::createEventLoop(int tsi, unsigned ev_flags,
                                  struct io_uring_params *params)
{
    struct ev_loop *loop;

    if (likely(ev_flags & EVBACKEND_IOURING))
        loop = createIoUringLoop(tsi, ev_flags, params);
    else
        loop = ev_loop_new(ev_flags/*|EVFLAG_SIGNALFD*/);

    m_threadData.push_back(loop);
}

@lano1106
Copy link
Author

Something else I thought I could do to help is modify my completion handler to check if the returned cage is outside the expected memory range and perhaps log its details if this can help direct the investigation.

@lano1106
Copy link
Author

lano1106 commented May 22, 2024

Jens is likely asking for a reproducer we can run ourselves. Can you also print the entire safety zone? (There might be a stray 0 write or just gaps, so distance might not be telling much).

correct me if I am wrong but checking for cq ring overflow is not something that is widely currently tested in the library unit tests. I suspect the issue is not that esoteric but since it is harmless with the most common 1 ring setup and it did never bite anyone so far, it is not something that has been checked... but I could be wrong...

this is maybe some cool task for someone such as @spikeh to look at. I can see how valuable it would be to have general teardown test testing invariants such as cq ring respecting the configured boundaries...

With minimal parameter tweaking such as making the number of cq entries small enough to ensure at least 1 wraparound during the test + applying a general teardown check... I suspect that with the currently existing unit tests... this issue would show up...

if it does not, yes, I will continue to help to nail the down the bug... but otherwise... my very interesting problems investigation budget has been fully spent in the last few days...

@isilence
Copy link
Collaborator

Something else I thought I could do to help is modify my completion handler to check if the returned cage is outside the expected memory range and perhaps log its details if this can help direct the investigation.

Good idea, check that buffers you pass to io_uring_buf_ring_add() are in the corresponding range. We can also try to trace similarly in the kernel via bpftrace.

One more thing I'm curious about, when you increase the red zone 4KB -> 8KB, where the non-zero pattern start?

@lano1106
Copy link
Author

Something else I thought I could do to help is modify my completion handler to check if the returned cage is outside the expected memory range and perhaps log its details if this can help direct the investigation.

Good idea, check that buffers you pass to io_uring_buf_ring_add() are in the corresponding range. We can also try to trace similarly in the kernel via bpftrace.

One more thing I'm curious about, when you increase the red zone 4KB -> 8KB, where the non-zero pattern start?

I have not increased the red zone (what I call the safety zone) to 8KB yet... so I guess that what you mean is that you would like me to try that.... If that is the case, yes I can do that... I will report back the result...

My prediction is that the corruption will remain in the first 64 bytes... but the best bug fishing catches that I have experienced is when you put aside all your preconceived ideas and assumptions... So I am willing to give it a try...

Did you attempt to map the corruption values into a io_uring_cqe record? I am not familiar enough with the structure and common constants to look at the hex dump and be sure that I was looking at io_uring_cqe entries... but I thought that it could have jumped at you immediately it they were io_uring_cqe...

@lano1106
Copy link
Author

Something else I thought I could do to help is modify my completion handler to check if the returned cage is outside the expected memory range and perhaps log its details if this can help direct the investigation.

Good idea, check that buffers you pass to io_uring_buf_ring_add() are in the corresponding range. We can also try to trace similarly in the kernel via bpftrace.

One more thing I'm curious about, when you increase the red zone 4KB -> 8KB, where the non-zero pattern start?

I have been thinking about what you want to validate by asking me to check the buffer base addresses before reinserting them in the ring...

2 days ago when you did ask me to place them in memory allocated with posix_memalign(), it was before I had code that checks for corruption in the red zone... Would redoing the posix_memalign() buffers allocation and check for corruption past the cq ring region not clear them out?

@axboe
Copy link
Owner

axboe commented May 22, 2024

I tried writing my own reproducer, and it does seem to indicate that we're writing past the end of the memory region that we said we'd use based on the return value of io_uring_queue_init_mem(). I'll poke a bit and report what I find.

@lano1106
Copy link
Author

lano1106 commented May 22, 2024

that is fascinating!

this may have been there for a very long time, and no one noticed until now... we have the best job that keeps giving us puzzles to solve!

I am always grateful to whatever life is throwing at us... This whole issue did make me review my own code and this has led to few minor improvements...

one of these minor improvement that might be of interest to you is this:

I have replaced io_uring_for_each_cqe() macro by its body minus io_uring_cqe_index() to eliminate io_uring_cqe_shift() and remove several instructions and memory access at each call since I am not using IORING_SETUP_CQE32.

This is the kind of assumption you, of course, cannot make but maybe moving out io_uring_cqe_shift() call from the loop could optimize the generated code... or perhaps provide 2 versions of the io_uring_for_each_cqe() macro to avoid having to make the bit shift at each iteration if it is known at compile time what will be required.

this is a minor improvement but considering that it is in the hot path, the improvement might not be non-negligeable...

@axboe
Copy link
Owner

axboe commented May 22, 2024

Yeah, it looks like bad math in liburing for the size. Seems to account the arrays just fine, but not the indexes. I'll be back with a patch!

@axboe
Copy link
Owner

axboe commented May 22, 2024

Can you try the below? Very lightly tested...

diff --git a/src/setup.c b/src/setup.c
index ab01ec615ca1..1997d2536f1c 100644
--- a/src/setup.c
+++ b/src/setup.c
@@ -199,6 +199,8 @@ __cold int io_uring_ring_dontfork(struct io_uring *ring)
 /* FIXME */
 static size_t huge_page_size = 2 * 1024 * 1024;
 
+#define KRING_SIZE	64
+
 /*
  * Returns negative for error, or number of bytes used in the buffer on success
  */
@@ -208,7 +210,7 @@ static int io_uring_alloc_huge(unsigned entries, struct io_uring_params *p,
 {
 	unsigned long page_size = get_page_size();
 	unsigned sq_entries, cq_entries;
-	size_t ring_mem, sqes_mem;
+	size_t ring_mem, sqes_mem, cqes_mem;
 	unsigned long mem_used = 0;
 	void *ptr;
 	int ret;
@@ -217,14 +219,18 @@ static int io_uring_alloc_huge(unsigned entries, struct io_uring_params *p,
 	if (ret)
 		return ret;
 
+	ring_mem = KRING_SIZE;
+
 	sqes_mem = sq_entries * sizeof(struct io_uring_sqe);
 	sqes_mem = (sqes_mem + page_size - 1) & ~(page_size - 1);
-	ring_mem = cq_entries * sizeof(struct io_uring_cqe);
-	if (p->flags & IORING_SETUP_CQE32)
-		ring_mem *= 2;
 	if (!(p->flags & IORING_SETUP_NO_SQARRAY))
-		ring_mem += sq_entries * sizeof(unsigned);
-	mem_used = sqes_mem + ring_mem;
+		sqes_mem += sq_entries * sizeof(unsigned);
+
+	cqes_mem = cq_entries * sizeof(struct io_uring_cqe);
+	if (p->flags & IORING_SETUP_CQE32)
+		cqes_mem *= 2;
+	ring_mem += sqes_mem + cqes_mem;
+	mem_used = ring_mem;
 	mem_used = (mem_used + page_size - 1) & ~(page_size - 1);
 
 	/*
@@ -498,8 +504,6 @@ static size_t npages(size_t size, long page_size)
 	return __fls((int) size);
 }
 
-#define KRING_SIZE	320
-
 static size_t rings_size(struct io_uring_params *p, unsigned entries,
 			 unsigned cq_entries, long page_size)
 {

@lano1106
Copy link
Author

lano1106 commented May 22, 2024

yes. I will do that. count on my feedback later today.

but just after reviewing it, I can conclude that it will fix the issue... it is just sad to reserve a full page for only 64 bytes of extra data...

@axboe
Copy link
Owner

axboe commented May 22, 2024

Thanks! If it's easier for you, I can put it in a liburing branch for you to just pull + checkout rather than need to fiddle with a pasted patch.

@lano1106
Copy link
Author

lano1106 commented May 22, 2024

he he... no the patch is a much easier route for me... I am by far NOT a git power user. Instead I just have to modify the source code fetched by my system packaging tools and rebuild...

axboe added a commit that referenced this issue May 22, 2024
It doesn't factor in the indexes themselves, so depending on the ring
sizes, we may end up undershooting slightly. This can cause the
returned value to be 64 bytes less than what is required. If an
application is packing things in tight, then that in turn can cause the
assumed ring memory to spill into the next part of that memory, causing
corruption of anything stored there.

We may overshoot with the current code depending on ring sizes, but
at least that's better then undershooting and causing corruption...

Link: #1151
Signed-off-by: Jens Axboe <[email protected]>
@axboe
Copy link
Owner

axboe commented May 22, 2024

That's fine, I did push an init-mem branch with the proposed fix and the test case as well. As it stands, if you just want to get it into your branch, you can just do:

$ git cherry-pick 9f4464d98e9b4a9a13f2e387e599789b93ebf4d0

and that should pick cleanly into the liburing 2.6 branch or the current git branch.

test/init-mem.c is the test case. Pretty basic, just setups up a bunch of differently sized rings and checks that nobody stomps on memory outside of what io_uring_queue_init_mem() told us it used.

@lano1106
Copy link
Author

[2024-05-22 15:46:55] INFO WSCTX/WebSocketContext mmap(2097152) succeeded: 0x0x7fbd5be00000
[2024-05-22 15:46:55] INFO WSCTX/initContext Options: 00001030
[2024-05-22 15:46:55] INFO WSCTX/logRingMemory tsi 0 sq:256 cq:16384 io_uring mem with sq array:279616 rounded:282624 safety:3008
[2024-05-22 15:46:55] INFO WSCTX/logRingMemory tsi 0 sq:256 cq:16384 io_uring mem:278592 rounded:282624 safety:4032
[2024-05-22 15:46:55] INFO WSCTX/createIoUringLoop tsi 0 io_uring loop took 282624 bytes from the huge page 0x0x7fbd5be00000-0x0x7fbd5be45000
[2024-05-22 15:46:55] INFO WSCTX/createIoUringLoop Master io_uring fd: 14
[2024-05-22 15:46:55] INFO WSCTX/getExtraHugePages mmap(142606336) succeeded: 0x0x7fbd3b800000
[2024-05-22 15:46:55] INFO WSCTX/initGroupBuffer Provide 8192 17408 bytes buffers to gid 1 from address 0x0x7fbd3b800000-0x0x7fbd44000000 (142606336)
[2024-05-22 15:46:55] INFO Runnable/start ExecutionSVC initialized
[2024-05-22 15:46:55] INFO WSCTX/logRingMemory tsi 1 sq:256 cq:512 io_uring mem with sq array:25664 rounded:28672 safety:3008
[2024-05-22 15:46:55] INFO WSCTX/logRingMemory tsi 1 sq:256 cq:512 io_uring mem:24640 rounded:28672 safety:4032
[2024-05-22 15:46:55] INFO WSCTX/createIoUringLoop tsi 1 io_uring loop took 28672 bytes from the huge page 0x0x7fbd5be45000-0x0x7fbd5be4c000
[2024-05-22 15:46:55] INFO WSCTX/initGroupBuffer Provide 32 17408 bytes buffers to gid 2 from address 0x0x7fbd5be4c000-0x0x7fbd5bed4000 (557056)
[2024-05-22 15:46:55] INFO Runnable/start TradeFeed initialized
[2024-05-22 15:46:55] INFO WSCTX/logRingMemory tsi 2 sq:256 cq:512 io_uring mem with sq array:25664 rounded:28672 safety:3008
[2024-05-22 15:46:55] INFO WSCTX/logRingMemory tsi 2 sq:256 cq:512 io_uring mem:24640 rounded:28672 safety:4032
[2024-05-22 15:46:55] INFO WSCTX/createIoUringLoop tsi 2 io_uring loop took 28672 bytes from the huge page 0x0x7fbd5bed4000-0x0x7fbd5bedb000
[2024-05-22 15:46:55] INFO WSCTX/initGroupBuffer Provide 64 17408 bytes buffers to gid 3 from address 0x0x7fbd5bedb000-0x0x7fbd5bfeb000 (1114112)

the patch provides an highest amount of reserved memory. Therefore placing other elements adjacent to the ring stops being corrupted.

the initial TCP inbound data is not observed anymore

@axboe
Copy link
Owner

axboe commented May 22, 2024

It will overshoot a bit, but that's better than undershooting... I have merged it into master, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants