Skip to content

Latest commit

 

History

History
226 lines (158 loc) · 6.62 KB

README.md

File metadata and controls

226 lines (158 loc) · 6.62 KB

Running Demos

To execute a demo, use its configuration name. For instance:

python run_demo.py -t QA1

The server and UI will be spawned as subprocesses that run in the background. You can use the PIDs (Process IDs) to terminate them when needed.

To obtain a list of available configurations, utilize the --help flag.

Available Demos

Name Description Config Name
Q&A Abstractive Q&A demo utilizing BM25, SBERT reranker, and FiD model. QA1
Q&A Abstractive Q&A demo using ColBERT v2 (with PLAID index) retriever and FiD reader. QA2
Summarization Summarization demo employing BM25, SBERT reranker, and long-T5 reader. SUM
LLM Retrieval augmented generation with generative LLM model. LLM

Please note that the ColBERT demo with a Wikipedia index may take around 15 minutes to load. Also, make sure to review the README for information regarding GPU usage requirements.

Additional Options

If you already have a fastRAG pipeline service running locally and wish to utilize it with one of the provided UI interfaces, you can add the --only-ui flag to the demo script:

python run_demo.py -t LLM --only-ui

In case your pipeline service is running on a non-local machine or a different port other than 8000, you can use the --endpoint argument to specify the URL:

python run_demo.py -t LLM --endpoint https://hostname:80

To manually run a UI with the API_ENDPOINT directed to a fastRAG service, you can execute the following command:

API_ENDPOINT=https://localhost:8000 \
             python -m streamlit run fastrag/ui/webapp.py

Make sure to replace https://localhost:8000 with the appropriate URL of your fastRAG service.

Screenshot

alt text

🔥 (NEW) Chat with Documents and Images 🔥

We present how to set up a demo with both Textual and Visual Question-Answering Pipelines, for a conversational chat system.

Chat Templates

For our chat model, we can specify how the chat template will behave. Each chat templates must include:

  • The memory of the chat.
  • The current query from the user.

In order to use them, specify the name to use in the configuration for each template in (). We present the available chat templates the section below.

Deploy the API

We deploy the rest API service as follows:

python -m fastrag.rest_api.application --app_type conversation --config config/doc_chat.yaml --port 8000

In the command above, we provide a chat YAML configuration file doc_chat, with the following format:

chat_model:
  model_kwargs:
    device_map: {"": 3}
    model_max_length: 4096
    task_name: text-generation
  model_name_or_path: meta-llama/Llama-2-7b-chat-hf
  use_gpu: true
doc_pipeline_file: "config/empty_retrieval_pipeline.yaml"

In this example, we deploy a chat_model, and also provide a doc_pipeline_file yaml file, that specifies our document retrieval pipeline.

We can also add a separate model for chat summarization in our chat configuration file, like so:

chat_model:
  ...
summary_model:
  model_kwargs:
    device_map: {"": 3}
    model_max_length: 4096
    task_name: text-generation
  model_name_or_path: togethercomputer/Llama-2-7B-32K-Instruct
  use_gpu: true
...

In case you want to use a different prompt template for the chat model, you can choose either Llama2 and UserAssistant formats as well. You can specify it by adding to your chat configuration file:

chat_model:
  ...
doc_pipeline_file: ...
chat_template: "Llama2"

To inspect the templates, we refer to the chat_template_initalizers.py file.

Deploy Visual Chat API

To utilize the visual chat sytem, we need to specify two models: a chat model, and a seperate summary model, as follows:

chat_model:
  model_kwargs:
    task_name: text-generation
    device_map: {"":0}
    load_in_4bit: true
    torch_dtype: torch.float16
  model_name_or_path: llava-hf/llava-1.5-7b-hf
  use_gpu: true
summary_model:
  model_kwargs:
    model_max_length: 4096
    task_name: text-generation
    device_map: {"":1}
  model_name_or_path: winglian/Llama-2-3b-hf
  use_gpu: true
summary_params:
  summary_frequency: 10
chat_template: "UserAssistantLlava"
doc_pipeline_file: "config/empty_retrieval_pipeline.yaml"
image_pipeline_file: "config/image_retrieval.yaml"

Notice that we are using the "UserAssistantLlava" chat template, since it is the chat_template that is supported for the specified Llava model.

In this case, we also added an image_pipeline_file, and changed the chat model to a visual chat model.

Using the new configuration file, we can deploy it as follows:

python -m fastrag.rest_api.application --app_type conversation --config config/visual_chat.yaml --port 8000

Deploy UI

Then, we can deploy the document user-interface using:

API_CONFIG_PATH=config/doc_chat.yaml API_ENDPOINT=https://localhost:8000 python -m streamlit run fastrag/ui/chat_ui.py --server.port 8501

Alternatively, you can deploy the visual chat interface using:

API_CONFIG_PATH=config/visual_chat.yaml API_ENDPOINT=https://localhost:8000 python -m streamlit run fastrag/ui/chat_ui.py --server.port 8501

Screenshot

alt text

Available Chat Templates

Default Template

The following is a conversation between a human and an AI. Do not generate the user response to your output.
{memory}
Human: {query}
AI:

Llama 2 Template (Llama2)

<s>[INST] <<SYS>>
The following is a conversation between a human and an AI. Do not generate the user response to your output.
<</SYS>>

{memory}{query} [/INST]

Notice that here we, the user messages will be:

<s>[INST] {USER_QUERY} [/INST]

And the model messages will be:

 {ASSISTATN_RESPONSE} </s>

User-Assistant (UserAssistant)

### System:
The following is a conversation between a human and an AI. Do not generate the user response to your output.
{memory}

### User: {query}
### Assistant:

User-Assistant for Llava (UserAssistantLlava)

For the v1.5 llava models, we define a specific template, as shown in this post regardin Llava models.

{memory}

USER: {query}
ASSISTANT: