Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ord server not distributing work to threads correctly #2292

Open
gmart7t2 opened this issue Jul 20, 2023 · 8 comments
Open

ord server not distributing work to threads correctly #2292

gmart7t2 opened this issue Jul 20, 2023 · 8 comments

Comments

@gmart7t2
Copy link
Contributor

gmart7t2 commented Jul 20, 2023

I've noticed that the ord web server doesn't distribute work very evenly between CPU cores. When I make a lot of requests to it at the same time, only one of the requests gets processed immediately and the rest have to wait until the first request is finished.

To demonstrate the effect, I made a bare-bones version of the ord server:

use {
  axum::{routing::get, Router},
  std::{net::SocketAddr, thread, time::Duration},
  tokio::runtime::Runtime,

  anyhow::Error,
  std::io,
  tokio::task,
};

type Result<T = (), E = Error> = std::result::Result<T, E>;

pub fn main() {
  Runtime::new().unwrap().block_on(async {
    let router = Router::new().route("/", get(|| async {
      thread::sleep(Duration::from_millis(1000)); "Hello, world!\n"
    }));

//    let addr = SocketAddr::from(([127, 0, 0, 1], 3000));
//    eprintln!("Listening on {}:https://{addr}", "http");
//    axum_server::Server::bind(addr).serve(router.into_make_service()).await.unwrap();

    spawn(router).unwrap().await.unwrap().unwrap();
  });
}

fn spawn(router: Router) -> Result<task::JoinHandle<io::Result<()>>> {
  let addr = SocketAddr::from(([127, 0, 0, 1], 3000));
  eprintln!("Listening on {}:https://{addr}", "http");
  Ok(tokio::spawn(async move { axum_server::Server::bind(addr).serve(router.into_make_service()).await }))
}

Run it, and hit / 4 times very quickly. The first request will come back after 1 second but the other 3 usually take 2 seconds.

Then modify the code to the following and try again:

use {
  axum::{routing::get, Router},
  std::{net::SocketAddr, thread, time::Duration},
  tokio::runtime::Runtime,

  // anyhow::Error,
  // std::io,
  // tokio::task,
};

// type Result<T = (), E = Error> = std::result::Result<T, E>;

pub fn main() {
  Runtime::new().unwrap().block_on(async {
    let router = Router::new().route("/", get(|| async {
      thread::sleep(Duration::from_millis(1000)); "Hello, world!\n"
    }));

    let addr = SocketAddr::from(([127, 0, 0, 1], 3000));
    eprintln!("Listening on {}:https://{addr}", "http");
    axum_server::Server::bind(addr).serve(router.into_make_service()).await.unwrap();

//    spawn(router).unwrap().await.unwrap().unwrap();
  });
}

// fn spawn(router: Router) -> Result<task::JoinHandle<io::Result<()>>> {
//   let addr = SocketAddr::from(([127, 0, 0, 1], 3000));
//   eprintln!("Listening on {}:https://{addr}", "http");
//   Ok(tokio::spawn(async move { axum_server::Server::bind(addr).serve(router.into_make_service()).await }))
// }

Now all 4 requests return after 1 second.

It appears to be the tokio::spawn() call that is messing things up, but I have no idea why.

@gmart7t2
Copy link
Contributor Author

This is what I see from each piece of code.

First first one:

$ for i in {1..4}; do time wget -O/dev/null -o/dev/null http:https://localhost:3000/ & done
[1] 4006905
[2] 4006906
[3] 4006908
[4] 4006910

real	0m1.009s
real	0m2.008s
real	0m2.009s
real	0m2.008s

The second one:

$ for i in {1..4}; do time wget -O/dev/null -o/dev/null http:https://localhost:3000/ & done
[1] 4007012
[2] 4007013
[3] 4007015
[4] 4007017

real	0m1.009s
real	0m1.009s
real	0m1.009s
real	0m1.009s

@veryordinally
Copy link
Collaborator

Interesting. Dug a little into how concurrency is being handled by Tokio and by the Rust standard library's thread::sleep function. When you use thread::sleep, you're pausing the entire operating system thread for a specified amount of time, so when we use thread::sleep inside a Tokio async block, you're essentially blocking the Tokio executor thread. This is why you're seeing a sequential handling of requests: after the first request arrives, it puts the executor to sleep for 1 second, during which time no other tasks can be processed.

The reason the code seems to work better without tokio::spawn is because you're not actually running the server in a separate task. The server runs directly on the main Tokio runtime thread, and each request is handled in a new task. Because these tasks are not blocking the runtime thread, multiple requests can be processed concurrently.

So I assume the issue in the actual ord server is that we have some OS calls that block the Tokio executor thread and we'd have to make them async... Need to look into that more deeply.

@gmart7t2
Copy link
Contributor Author

gmart7t2 commented Jul 20, 2023

Thanks for looking into it.

This is why you're seeing a sequential handling of requests: after the first request arrives, it puts the executor to sleep for 1 second, during which time no other tasks can be processed

If I hit my sleep endpoint 20 times very quickly, I get 1 response after 1 second, then 8 responses a second later, then 8 more another second later, and the last one a second later. I have 8 cpu cores.

It appears that it is able to process 8 requests concurrently but only if they were already waiting to be processed when the current request finishes.

Does that match your understanding? Because it doesn't sound like it matches the line I quoted above.

(See in the timing output of the first code snippet #2292 (comment) I get one 1s timing and three 2s timings - the last 3 requests are processed concurrently)

@abelmcelroy
Copy link

abelmcelroy commented Mar 15, 2024

I'm experiencing a similar issue. I have a node.js server sending batches of requests to my order server and the requests are sent in a non-blocking way, but are responded to sequentially, each in an accumulating larger amount of time, e.g. req 1: 10ms, req 2: 20ms, ... req 150: 1500ms... I don't have the best metrics to provide you but I've tested my code extensively and have found this behavior only exhibits itself when making requests to ord.

Also, I am using the most recent version of the ord client (on GH) as of today, with an --index-sats and --index-runes index file. This issue has coincided with pulling recent commits within the last few days.

To help you reproduce the issue, just try requesting the content of each of the first 100 inscriptions in block 834061.

@raphjaph
Copy link
Collaborator

Are you able to bisect it down to a specific commit that caused this regression?

@abelmcelroy
Copy link

Are you able to bisect it down to a specific commit that caused this regression?

Unfortunately not. My FT job is pretty demanding right now, also the aforementioned commits I pulled required I update my version of my index and re-index everything so somewhere in that span of commits is a paradigm shift I really would prefer not to cross over again (the v18 index too a lot longer than my previous one). Sorry I can't be of more help.

@raphjaph
Copy link
Collaborator

No worries! Would you willing to share the node.js server code or benchmarking script you used? Then I could use that as a starting point for debugging.

@abelmcelroy
Copy link

abelmcelroy commented Mar 21, 2024

No worries! Would you willing to share the node.js server code or benchmarking script you used? Then I could use that as a starting point for debugging.

Gladly! Hope this helps: I chose a block number where it's especially pronounced. If page 1 isn't demonstrating the issue i recommend temporarily changing your page size to like 1000 as opposed to 100.

const axios = require('axios');
const ordExplorerUrl = "<FILL ME IN>"

class OrdExplorer {
  async get(path) {
    console.log(`${path} started`);
    console.time(`${path} took`);
    const { data } = await axios.get(`http:https://${ordExplorerUrl}:81/${path}`, {
      headers: {
        "Accept": "application/json"
      }
    });
    console.timeEnd(`${path} took`);
    return data
  }
}

const ordExplorer = new OrdExplorer();
const currentBlock = 825951;

async function main() {

  console.log("searching block", currentBlock)
  try {
    let page = await ordExplorer.get(`inscriptions/block/${currentBlock}/1`);

    let ids = page.ids.filter(id => !!id);

    await Promise.all(ids.map(async (inscriptionId) => {
      let content = null;
      let errors = 0;
      while (content === null) {
        try {
          content = await ordExplorer.get(`content/${inscriptionId}`);
        } catch (err) {
          if (err.message.includes("connect EADDRNOTAVAIL")) {
            console.log("server overloaded, waiting...")
            await wait(5000);
            errors++;
          } else {
            console.log(`CONTENT FAILED ${inscriptionId}: ${err.message}`);
            return
          }
        }
      }
    }));
  } catch (error) {
    console.log(error);
    process.exit(1);
  }
  process.exit(0)
}

main();

Hopefully this recreates the same behavior for you. You'll notice some slow requests block all other requests while they're being handled. The requests are sent in a non-blocking way (which you can also see in the logs) but appear to be responded to sequentially.

Please let me know if I'm misunderstanding the behavior or if you have any suggestions. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants