Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

Repository files navigation

Set-of-Mark Prompting - Visual Prompting for Vision!

🍇 [Read our arXiv Paper] 🍎 [Project Page]

Jianwei Yang*⚑, Hao Zhang*, Feng Li*, Xueyan Zou*, Chunyuan Li, Jianfeng Gao

* Core Contributors ⚑ Project Lead

We present Set-of-Mark (SoM) prompting, simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.

🔥 News

[10/18] We are going to release the SoM toolbox very soon. Stay tuned!

🔗 Related links

Our method compiles the following models to generate the set of marks:

Mask DINO: State-of-the-art closed-set image segmentation model
SEEM: Versatile, promptable, interactive and semantic-aware segmentation model
Semantic-SAM: Segment and recognize anything at any granularity
Segment Anything: Segment anything

We are standing on the shoulder of the giant GPT-4V (playground)!

Set-of-Mark Prompting by Roboflow: Reimplementation of SoM by @SkalskiP from Roboflow
Set-of-Mark Prompting for UI Navigation Agent: A really brilliant work using GPT-4V and SoM as a web copilot!

👉 Comparing standard GPT-4V and its combination with SoM Prompting

📍 SoM Toolbox for image partition

Users can select which granularity of masks to generate, and which mode to use between automatic (top) and interactive (bottom). A higher alpha blending value (0.4) is used for better visualization.

🦄 Interleaved Prompt

SoM enables interleaved prompts which include textual and visual content. The visual content can be represented using the region indices.

🎖️ Mark types used in SoM

🌋 Evaluation tasks examples

Use case

🌷 Grounded Reasoning and Cross-Image Reference

In comparison to GPT-4V without SoM, adding marks enables GPT-4V to ground the reasoning on detailed contents of the image (Left). Clear object cross-image references are observed on the right. 17

🏕️ Problem Solving

Case study on solving CAPTCHA. GPT-4V gives the wrong answer with a wrong number of squares while finding the correct squares with corresponding marks after SoM prompting.

🏔️ Knowledge Sharing

Case study on an image of dish for GPT-4V. GPT-4V does not produce a grounded answer with the original image. Based on SoM prompting, GPT-4V not only speaks out the ingredients but also corresponds them to the regions.

🕌 Personalized Suggestion

SoM-pormpted GPT-4V gives very precise suggestions while the original one fails, even with hallucinated foods, e.g., soft drinks

🌼 Tool Usage Instruction

Likewise, GPT4-V with SoM can help to provide thorough tool usage instruction , teaching users the function of each button on a controller. Note that this image is not fully labeled, while GPT-4V can also provide information about the non-labeled buttons.

🌻 2D Game Planning

GPT-4V with SoM gives a reasonable suggestion on how to achieve a goal in a gaming scenario.

🕌 Simulated Navigation

🌳 Results

We conduct experiments on various vision tasks to verify the effectiveness of our SoM. Results show that GPT4V+SoM outperforms specialists on most vision tasks and is comparable to MaskDINO on COCO panoptic segmentation.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{yang2023setofmark,
      title={Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V}, 
      author={Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chunyuan Li and Jianfeng Gao},
      journal={arXiv preprint arXiv:2310.11441},
      year={2023},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Set-of-Mark Prompting - Visual Prompting for Vision!

🔥 News

🔗 Related links

👉 Comparing standard GPT-4V and its combination with SoM Prompting

📍 SoM Toolbox for image partition

🦄 Interleaved Prompt

🎖️ Mark types used in SoM

🌋 Evaluation tasks examples

Use case

🌷 Grounded Reasoning and Cross-Image Reference

🏕️ Problem Solving

🏔️ Knowledge Sharing

🕌 Personalized Suggestion

🌼 Tool Usage Instruction

🌻 2D Game Planning

🕌 Simulated Navigation

🌳 Results

✒️ Citation

About

Releases

Packages

Languages

License

BaconWaffle/SoM

Folders and files

Latest commit

History

Repository files navigation

Set-of-Mark Prompting - Visual Prompting for Vision!

🔥 News

🔗 Related links

👉 Comparing standard GPT-4V and its combination with SoM Prompting

📍 SoM Toolbox for image partition

🦄 Interleaved Prompt

🎖️ Mark types used in SoM

🌋 Evaluation tasks examples

Use case

🌷 Grounded Reasoning and Cross-Image Reference

🏕️ Problem Solving

🏔️ Knowledge Sharing

🕌 Personalized Suggestion

🌼 Tool Usage Instruction

🌻 2D Game Planning

🕌 Simulated Navigation

🌳 Results

✒️ Citation

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages