Skip to content

Commit

Permalink
Update index.html
Browse files Browse the repository at this point in the history
  • Loading branch information
tianjiedai committed Apr 22, 2024
1 parent cd7be8e commit 38e3191
Showing 1 changed file with 9 additions and 16 deletions.
25 changes: 9 additions & 16 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@
</td>
<td align="center" width="200px">
<center>
<span style="font-size:16px"><a href="https://feng-hong.github.io/research/">Feng Hong</a><sup>2</sup></span>
<span style="font-size:16px"><a href="https://feng-hong.github.io/research/">Feng Hong</a><sup>1</sup></span>
</center>
</td>
</tr>
Expand Down Expand Up @@ -254,7 +254,7 @@
<tr>
<td align="center" width="400px">
<center>
<span style="font-size:16px"><sup>2</sup>Shanghai AI Laboratory</span>
<span style="font-size:16px"><sup>2</sup>Shanghai Artificial Intelligence Laboratory</span>
</center>
</td>
</tr>
Expand Down Expand Up @@ -314,7 +314,7 @@ <h2> Abstract </h2>
<h2> Architecture </h2>
</center>
<p style="text-align:justify; text-justify:inter-ideograph;">
<p><div style="text-align: center;"><img class="left" src="./resources/framework.png" width="600px"></div></p>
<p><div style="text-align: center;"><img class="left" src="./resources/framework.png" width="800px"></div></p>
<p>
<left>
The framework of UniChest, which consists of two training stages. During the ``Conquer" stage, two modality encoders first project visual and textual representations into the common space with alignment, then feed them into the first transformer query networks for prediction. The multi-source common patterns are learnt as much as possible at this stage. During the ``Divide" stage, we freeze the modality encoders and squeeze the source-specific patterns via the MoE-QN module with the guidance of the enhanced supervised loss and the source contrastive learning.</left>
Expand Down Expand Up @@ -344,28 +344,21 @@ <h2>Results</h2>
</center>
<p>
<left>
We compare our best model with existing state-of-the-art approaches, to present a strong, yet simple baseline on video-text alignment for future research.
As shown in the table, on the challenging HT-Step task,
that aims to ground unordered procedural steps in videos,
our model achieves 46.7% R@1, leading to an absolute improvement of 9.3%, over the existing state-of-the-art (37.4%) achieved by VINA;
On HTM-Align, which aligns narrations in the video,
our method exceeds sota model by 3.4%;
On CrossTask, where we need to align video frames and task-specific steps without finetuning, our method outperforms existing state-of-the-art approach by 4.7%,
demonstrating our model learns stronger joint video-text representation.</left>
Performances under both in-domain and out-domain settings.</left>
</p>
<center>
<p><img class="center" src="./resources/result.png" width="400px"></p>
<p><img class="center" src="./resources/result.png" width="800px"></p>
</center>
<br>
<!-- <br>
<hr>
<center>
<h2> Acknowledgements </h2>
</center>
<p>
Based on a template by <a href="http:https://web.mit.edu/phillipi/">Phillip Isola</a> and <a
href="http:https://richzhang.github.io/">Richard Zhang</a>.
</p>
<br>
href="http:https://richzhang.github.io/">Richard Zhang</a>. -->
<!-- </p>
<br> -->
</body>

</html>

0 comments on commit 38e3191

Please sign in to comment.