Tarsier

🙈 Vision utilities for web interaction agents 🙈

Tarsier

If you've tried using an LLM to automate web interactions, you've probably run into questions like:

How should you feed the webpage to an LLM? (e.g. HTML, Accessibility Tree, Screenshot)
How do you map LLM responses back to web elements?
How can you inform a text-only LLM about the page's visual structure?

At Reworkd, we iterated on all these problems across tens of thousands of real web tasks to build a powerful perception system for web agents... Tarsier! In the video below, we use Tarsier to provide webpage perception for a minimalistic GPT-4 LangChain web agent.

tarsier.mp4

How does it work?

Tarsier visually tags interactable elements on a page via brackets + an ID e.g. [23]. In doing this, we provide a mapping between elements and IDs for an LLM to take actions upon (e.g. CLICK [23]). We define interactable elements as buttons, links, or input fields that are visible on the page; Tarsier can also tag all textual elements if you pass tag_text_elements=True.

Furthermore, we've developed an OCR algorithm to convert a page screenshot into a whitespace-structured string (almost like ASCII art) that an LLM even without vision can understand. Since current vision-language models still lack fine-grained representations needed for web interaction tasks, this is critical. On our internal benchmarks, unimodal GPT-4 + Tarsier-Text beats GPT-4V + Tarsier-Screenshot by 10-20%!

Tagged Screenshot	Tagged Text Representation

Name		Name	Last commit message	Last commit date
Latest commit History 268 Commits
.github		.github
cookbook		cookbook
scripts		scripts
tarsier-snapshots		tarsier-snapshots
tarsier		tarsier
tests		tests
.gitignore		.gitignore
.prettierignore		.prettierignore
CITATION		CITATION
LICENSE		LICENSE
README.md		README.md
format.sh		format.sh
package-lock.json		package-lock.json
package.json		package.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tarsier

How does it work?

Installation

License

reworkd/tarsier

Folders and files

Latest commit

History

Repository files navigation

Tarsier

How does it work?

Installation