Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request, Nodejs 10.5+] Execute Workers inside worker threads #253

Open
edy opened this issue Jun 26, 2018 · 10 comments
Open

[Feature Request, Nodejs 10.5+] Execute Workers inside worker threads #253

edy opened this issue Jun 26, 2018 · 10 comments
Labels

Comments

@edy
Copy link

edy commented Jun 26, 2018

Nodejs 10.5 has a new experimental feature: worker threads. It would be cool if node-resque would run its jobs inside worker threads.

one huge benefit I can think of is that you could kill stucked jobs.

@evantahler
Copy link
Member

Closing for now... until that API becomes a little more stable...

@naz
Copy link

naz commented Nov 12, 2020

@evantahler worker threads are stable as of Node v12 and can be polyfilled for older versions of node using a lib like https://github.com/chjj/bthreads.

@evantahler
Copy link
Member

evantahler commented Nov 12, 2020

@naz cool!

Can you share some of the benefits you'd like to see with a threads implementation? Of course, moving CPU-bound jobs to another thread is a good idea. I'm a little worried about the need to really re-instantiate the whole process to get a worker (const worker = new Worker(__filename); from https://nodejs.org/api/worker_threads.html). Do you know of any good resources talking about sharing memory or other resources between workers and the main thread?

Either way, I think the place to try this out would be inside the multiWorker - with each worker (node-resque) being a new worker (node.js). I see some grammar issues in our future!

@glensc
Copy link
Contributor

glensc commented Nov 12, 2020

perhaps re-open this issue to keep it visible

@evantahler evantahler reopened this Nov 12, 2020
@naz
Copy link

naz commented Nov 16, 2020

Hey @evantahler! For now, my main use-case has been around offloading CPU-intensive work out from main thread/even loop. The worker instance creation cost is a real concern to which I haven't found a good approach just yet. The best way to decrease the cost is using thread pool technique - example implementation/documented available in node docs.

To be completely clear, I am not actively using node-resque. For my use-case all queuing/scheduling has to be done in memory. I am experimenting with bree at the moment and it uses bthreads under the hood to polyfill worker threads. bthreads has a worker pool implemented (haven't looked under the hood yet) and from the looks of it it's main purpose is parallelization instead of worker creation cost saving.

I was researching node-resque's codebase to see how/why things are done a certain way 😅 Didn't see worker thread utilized here and though pinging would spark up a conversation. Would be happy to use this issue as discussion ground for best approaches and knowledge sharing in the context of background job processing!

@evantahler
Copy link
Member

evantahler commented Nov 16, 2020

@naz yeah, let's chat!

My world-view is roughly that these are the types of background task systems that can exist (from https://blog.evantahler.com/background-tasks-in-node-js-a-survey-with-redis-971d3575d9d2)
Screen Shot 2020-11-16 at 2 05 17 PM

... and when I talk about background tasks, I generally mean those that are:

  1. distributable across multiple processes / computers
  2. idempotent (at least internally) meaning you can run the task with only the state information included within the task's params, and look up the rest from somewhere else, like a database, API, or file

So with that worldview, node-resque really zooms in on the use-case of an API deployed across multiple servers. I think in your case, you are working on what I called local messages above - one process or thread is in charge, and sends out work to other threads/processes. In the node-resque use-case, I'm curious what it looks like for each "worker" to "fork" (terribly imprecise terms) and reconnect to all the other resources it might need - persistent database connections, tmp file use, etc. It's certainly possible (Rails has been doing this for years... and if Ruby can do, we can ;) but what does the developer API look like to on('workerNewThread' => connectToPostgres) or similar?

@naz
Copy link

naz commented Nov 17, 2020

We are on the same page about the world-view and you are spot on the case I'm trying to solve right now. In the future there will be a need to have a hybrid solution where the core of processing foreground/parallel/local messages are all done by the same "job manager" with an option of giving the manager a way to have it's work queue persisted. In other words, the job manager will be able to change it's task strategy from local to remote depending on the environment, which is a story for completely different project :)

What I think this project might gain from using Worker (from worker_threads) or fork of a process (from child_process) is a utility aspect (communication is something to solve but doesn't have to be immediately imo). The utility of having separate worker thread or forked process would be "sandboxing" workers from the parent event loop allowing them to: fail or leak memory without crashing the parent process, introducing non blocking parallelism in case there are multiple CPU intensive jobs to be done, being able to terminate jobs that have been stuck.

With above in mind, don't think there should be much of the API change in node-resque's side apart from allowing to create a new(modified) type of Worker that "forks" into a thread or child process. Because of the idempotent nature of background tasks, worker definition should be ideally self-contained - should be able to connect to resources without any additional inputs except few parameters specific to a task (comes with an overhead of recreating all the connections).

Maybe I'm way off with this thinking, but hopefully it helps :)

@evantahler
Copy link
Member

I think that makes a lot of sense, and is a good idea! I guess my concerns can all be met by making the use of worker_threads optional and opt-in... and default:false to be backwards-compatible.

Implementation Questions:
For your use-case, what would be more useful:

  1. A pool of workers that boots up when you start your app and are already running (like multiWorker), and are ready to be passed jobs. Pro: They are up and running once don't have an exec cost. Con: they run one file forever and you can't change it.
  2. Each job gets a special __filename argument and will new WorkerThread(filename) as it's first command. Pro: flexible. Con: Each job may actually be kind of slow to start up.

In either case, we would pass the name of the job and JSON.stringify the args over as messages.

I think we really need to be clear about the limitations and isolation for using worker_threads. For example, a really common resque job is for one task to enqueue another when it's done. If you run your job in a thread, you can't access worker.queue and all the related methods. I don't necessarily agree that each worker can be truly isolated (but yes, it should be idempotent). Consider this typical job:

jobs = [{
  sendEmail: async (userId) => {
    const user = User.findOne(userId)
    await emailThing(user).send()
    }
}]

This way of writing the job assume you have already connected your User model to the database, done something like await User.connect(), etc. In the thread, you would need to do all of that again as part of the job.

I'll try to get an example going soon!

@evantahler
Copy link
Member

Lol - to test this out I decided to calculate Fibonacci numbers in background tasks while on laptop battery... that was not a smart idea.

@naz
Copy link

naz commented Nov 17, 2020

Just to keep some references around - breejs/bree#45, this is an issue in an alternative job manager lib. It will hopefully contain specific performance implications of running worker threads or forking processes (or might borrow data from here 😅).

For my current usecase have decided to stick with bree for now as it's much more lightweight and easier to adjust to current in-memory queuing needs. Will be lurking around here for sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants