Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a simple inference webserver #353

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

enn-nafnlaus
Copy link

Usage example:

accelerate launch scripts/finetune.py summarize.yaml --inference --base_model=path/to/my/model --load_in_8bit=True --server --server_port 1567 --server_addr 127.0.0.1

Then in another terminal:

curl -X POST -d "$(cat test_text.txt)" http:https://localhost:1567/

@utensil
Copy link
Sponsor Contributor

utensil commented Aug 9, 2023

Is it possible to allow submitting inference request during training? Just like running a manual eval.

@enn-nafnlaus
Copy link
Author

Is it possible to allow submitting inference request during training? Just like running a manual eval.

Could always run a separate server while training.

@utensil
Copy link
Sponsor Contributor

utensil commented Aug 9, 2023

Is it possible to allow submitting inference request during training? Just like running a manual eval.

Could always run a separate server while training.

The idea is to fully integrated into training. The server accepts requests and put it into a queue, the model being trained is register with a callback to dequeue the request and inference it, just as running evals during training.

In the case, we can reuse the model in training to do live inference at will.

I admit that this might be a separate PR.

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Aug 9, 2023

I agree, that's a separate PR. :) This is a very simple lightweight server, just a modification of the existing inference code to get instructions from HTTP requests rather than the terminal, and return them to the requesting client rather than the screen. If you want to take the time to configure it to run training and inference through a queue, go ahead :)

I implemented it because to run test inferences, you either have to do it completely manually at present (typing / pasting into a terminal for each one), or you have a long overhead of waiting for the inference server to start up for every single test. By starting up a HTTP server and accepting requests on it, you can automatically run many different tests without a ton of overhead. Or use it for non-testing / production purposes, for that matter.

@utensil
Copy link
Sponsor Contributor

utensil commented Aug 9, 2023

Yes, currently it's not able to reuse the loaded model to inference a batch of inputs, very inconvenient.

if not instruction:
response = ""
else:
default_tokens = {"unk_token": "<unk>", "bos_token": "<s>", "eos_token": "</s>"}
Copy link
Collaborator

@winglian winglian Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could probably grab these from the tokenizer, as any special tokens defined in the config are added to the tokenizer when it is instantiated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patches welcome. :) Again, this is just redoing a copy of the preexisting inference section as a webserver. I didn't change the logic.

server_address = (cfg.server_addr, cfg.server_port)
httpd = socketserver.TCPServer(server_address, lambda *args, **kwargs: HttpHandler(*args, cfg=cfg, prompter=prompter, tokenizer=tokenizer, model=model, **kwargs))
print(f"Server running on port {cfg.server_port}")
httpd.serve_forever()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be explicitly killed at the end of training?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you train and run inference at the same time? This only runs if you're in --inference mode.

@Stillerman
Copy link
Contributor

Might be able to use Gradio as a web server #812

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants