π Vision utilities for web interaction agents π
π Main site Β Β β’Β Β π¦ Twitter Β Β β’Β Β π’ Discord
If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:
- How do you map LLM responses back into web elements?
- How can you mark up a page for an LLM better understand its action space?
- How do you feed a "screenshot" to a text-only LLM?
At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.
tarsier.mp4
Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]
.
In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon.
We define interactable elements as buttons, links, or input fields that are visible on the page.
Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.
pip install tarsier
Visit our cookbook for agent examples using Tarsier:
- An autonomous LangChain web agent π¦βοΈ
- An autonomous LlamaIndex web agent π¦
Otherwise, basic Tarsier usage might look like the following:
import asyncio
from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService
async def main():
google_cloud_credentials = {}
ocr_service = GoogleVisionOCRService(google_cloud_credentials)
tarsier = Tarsier(ocr_service)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://news.ycombinator.com")
page_text, tag_to_xpath = await tarsier.page_to_text(page)
print(tag_to_xpath) # Mapping of tags to x_paths
print(page_text) # My Text representation of the page
if __name__ == '__main__':
asyncio.run(main())
Keep in mind that Tarsier tags different types of elements differently to help your LLM identify what actions are performable on each element. Specifically:
[#ID]
: text-insertable fields (e.g.textarea
,input
with textual type)[@ID]
: hyperlinks (<a>
tags)[$ID]
: other interactable elements (e.g.button
,select
)[ID]
: plain text (if you passtag_text_elements=True
)
We have provided a handy setup script to get you up and running with Tarsier development.
./script/setup.sh
If you modify any TypeScript files used by Tarsier, you'll need to execute the following command. This compiles the TypeScript into JavaScript, which can then be utilized in the Python package.
npm run build
We use pytest for testing. To run the tests, simply run:
poetry run pytest .
Prior to submitting a potential PR, please run the following to format your code:
./script/format.sh
- Google Cloud Vision
- Amazon Textract (Coming Soon)
- Microsoft Azure Computer Vision (Coming Soon)
- Add documentation and examples
- Clean up interfaces and add unit tests
- Launch
- Improve OCR text performance
- Add options to customize tagging styling
- Add support for other browsers drivers as necessary
- Add support for other OCR services as necessary
bibtex
@misc{reworkd2023tarsier,
title = {Tarsier},
author = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
year = {2023},
howpublished = {GitHub},
url = {https://github.com/reworkd/tarsier}
}