Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou
We introduce GUI action dataset Act2Cap as well as an effective framework: GUI Narrator for GUI video captioning that utilizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots.
We release our paper on Arxiv.