Update README.md

xianshang33 · Jun 7, 2024 · cc4c6ac · cc4c6ac
1 parent a10ee89
commit cc4c6ac
Show file tree

Hide file tree

Showing 16 changed files with 584 additions and 331 deletions.
diff --git a/CATEGORIES.md b/CATEGORIES.md
diff --git a/README.md b/README.md
diff --git a/README_en.md b/README_en.md
diff --git a/summary/2024-06/2406.01014.md b/summary/2024-06/2406.01014.md
@@ -0,0 +1,20 @@
+#### 背景
+- **背景** 
+ 论文讨论了行动助理在移动设备操作任务中越来越受到欢迎的多模态人工智能应用场景。现有的多模态大型语言模型（MLLMs）受限于它们的训练数据，缺乏作为操作助手的有效功能。相反，基于MLLM的代理（agent），通过工具调用增强能力，在这一场景中逐渐得到应用。
+
+- **已有的工作**
+ 现有的单代理架构难以有效解决移动设备操作任务中的两大导航挑战——任务进展导航和焦点内容导航。这是因为过长的令牌序列和交错的文本图像数据格式限制了性能。
+
+#### 核心贡献
+- **提出了一个多代理架构Mobile-Agent-v2**
+ - **挑战1：任务进展导航**
+ 计划代理能够将冗长、交错的图像文本历史操作和屏幕总结转化为纯文本的任务进展，然后传递给决策代理。这样的压缩减少了上下文长度，使决策代理更容易导航任务进展。
+
+ - **挑战2：焦点内容导航和反射能力**
+ 研究者设计了更新任务进展的记忆单元以及反射代理。记忆单元由决策代理用焦点内容更新，反射代理负责评估决策代理的操作是否符合预期，并在不符合预期时生成适当的补救措施。
+
+#### 实现与部署
+论文中的Mobile-Agent-v2在各种操作系统、语言环境和应用程序中进行了动态评估，实验结果表明与单代理架构相比，任务完成率提高了30%以上。此外，研究人员还实证验证了通过人工操作知识注入可以进一步提高Mobile-Agent-v2的性能。
+
+#### 总结
+Mobile-Agent-v2是一个多代理架构，能有效解决移动设备操作任务中的导航挑战，特别是任务进展和焦点内容的导航问题。通过引入三个专门的代理角色，相较于传统的单代理架构，显著提高了任务完成率。
diff --git a/summary/2024-06/2406.0166.md b/summary/2024-06/2406.0166.md
@@ -0,0 +1 @@
+无数据[超时]
diff --git a/summary/2024-06/2406.02061.md b/summary/2024-06/2406.02061.md
@@ -0,0 +1 @@
+无数据[超时]
diff --git a/summary/2024-06/2406.02148.md b/summary/2024-06/2406.02148.md
@@ -0,0 +1,20 @@
+#### 背景
+- **背景** 
+ 文章讨论了跨文档事件共指消解（Cross-document event coreference resolution, CDECR）的挑战，重点是如何通过聚类跨多个文档中提到的事件来引用相同的现实世界事件。以往的方法主要依赖于对小型语言模型（如BERT)的微调来处理事件提及之间的兼容性，但由于上下文的复杂性和多样性，这些模型容易学习到简单的共现关系，而不是真正与共指相关的术语。
+
+- **已有的工作**
+ 目前的工作尝试使用小型语言模型（SLMs）来编码事件提及，通过监督共指消解来获得它们的嵌入。然而，这些方法存在缺陷，特别是当涉及到上下文多样性和复杂性时，它们倾向于学习到伪特征。另外，大型语言模型（LLMs）展现了卓越的上下文理解能力，但在适应特定的信息提取（IE）任务时遇到挑战。
+
+#### 核心贡献
+- **提出了一个协作方法**
+ - **挑战1：事件提及的上下文相似性**
+ 由于不同文档中不同事件可能以非常相似的方式描绘，特别是对于同类型的事件，模型需要从多样化的上下文中提取相似的共指证据来作出判断。文章提出的方法通过利用LLMs本身的知诈配对和上下文理解能力，来全面建立事件提及和它们所对应的上下文元素之间的连接，更全面的解决了上下文的相似性挑战，对性能提升做出了重大贡献。
+
+ - **挑战2：相同事件在不同文档中的描绘差异**
+ 相同的事件在不同文档中的描述可能有很大的差异。文章通过设计两步工作流程使用LLM来概括事件提及，并使用不同的通用提示指导其理解每个提及的上下文，而不是任务特定的上下文学习或微调。然后，通过联合表示学习将原始文档和生成的摘要整合进SLM，从而在微调过程中增强SLM对事件提及的理解，使得它能基于更加聚焓的上下文作出共指判断。
+
+#### 实现与部署
+作者提出的协作方法在三个CDECR数据集上进行了实验。通过与仅依赖LLM或SLM的方法相比，协作方法显示出显著改进，实现了性能上的互补优势。整体上，该方法在ECB+，GVC和FCC数据集的CoNLL F1得分分别增加了1%，2.7%和7%，从而达到了最先进的性能。
+
+#### 总结
+文章提出了一种新颖的协作方法以解决跨文档事件共指消解任务。通过将LLMs的普遍能力与任务特定的SLMs结合，显著提高了模型性能。
diff --git a/summary/2024-06/2406.02543.md b/summary/2024-06/2406.02543.md
@@ -0,0 +1,20 @@
+#### 背景
+- **背景** 
+ 文章探讨了大型语言模型（LLMs）中的不确定性量化问题，目的是识别出LLMs在回答询问时的不确定性何时会很大。研究同时考虑了两种不确定性:认知不确定性和偶然不确定性。前者源于对真理（例如事实或语言）缺乏了解，而后者来自于难以减少的随机性（如对同一查询可能有多个有效答案）。
+
+- **已有的工作**
+ 现有的方法主要针对存在单一正确响应的问题，它们旨在检测一个响应是否占主导（或具有相同意义的多个响应），即预测中的不确定性很小。然而,在存在多个正确响应（即真实问题存在偶然不确定性）的情况下，仅估计LLM输出的不确定性量是不够的，完美的（真实）预测器可能具有较大的偶然不确定性和没有认知不确定性，而一个完全无用的预测器可能只具有较大的认知不确定性。
+
+#### 核心贡献
+- **提出了一个信息论度量**
+ - **挑战1：分离认知不确定性与偶然不确定性**
+ 研究提出使用迭代的提示过程来构建一个LLM产生的多响应的联合分布，以此量化LLM与真实基础事实之间的差距。这种差距对偶然不确定性不敏感，因此即使在多个有效响应的情况下也能量化认知不确定性。这个过程可以显著地检测到认知不确定性。
+
+ - **挑战2：计算可行的不确定性下界**
+ 研究者推导出一个可计算的不确定性度量的下界，并提出了有限样本的互信息（MI）估计器。这个估计器有时即使在可能有无限支持（语言中的所有可能字符串）的LLMs及其派生联合分布下也只会有微不足道的误差。
+
+#### 实现与部署
+通过一系列实验，论文展示了新公式的优势。实验在闭包开放域问答基准数据集上进行，如TriviaQA, AmbigQA以及一个从WordNet合成的数据集，结果表明当数据主要由单标签或多标签查询组成时，基于MI的幻觉检浴方法超过了一个基于响应可能性的天真基线，并实现了与一个基于输出熵的更高级基线相似的性能。
+
+#### 总结
+本论文重点研穴并提出了一个新的信息论度量方法以在大型语言模型中量化不确定性，特别是针对LLMs生成响应时的幻觉现象。这项研究为如何识别和处理LLMs中的幻觉提供了新的理解和解决方案。
diff --git a/summary/2024-06/2406.0373.md b/summary/2024-06/2406.0373.md
@@ -0,0 +1,20 @@
+#### 背景
+- **背景** 
+ 论文讨论了大型语言模型（LLMs）如何通过In-context learning (ICL) 技术学习处理新任务。ICL技术通过使用一系列训练实例作为提示来实现这一点。然而，现有的ICL实例选择方法往往需要较长时间，并且经济成本昂贵，这限制了这些方法的实际应用性。
+
+- **已有的工作**
+ 以往的研究尝试通过选择一小部分未标记的实例进行标注来降低标注成本。这些实例的选择目标是多样性和代表性。尽管这些方法比随机选择更优，不过它们在计算效率方面存在明显短板。
+
+#### 核心贡献
+- **提出了一个 Fast Graph-based Annotation Selection（FastGAS）方法**
+ - **挑战1：如何提高实例选择的多样性与代表性**
+ FastGAS通过图划分算法将数据相似图分割成不同段落，每个段落被视为一组实例，以保证选择的实例的多样性。为每个段落选择具有最大节点度数的实例，确保了所选择实例的代表性。
+
+ - **挑战2：竭尽所能地减少实例选择过程所需的时间**
+ FastGAS运用一种多级图二分算法加速图分割过程，然后用一个简单而有效的贪心算法选择每个段落里的实例。与在整个图上迭代选择的基线方法相比，FastGAS的算法在各个组件上运用贪心算法可以大幅降低计算时间。
+
+#### 实现与部署
+FastGAS的有效性在多个不同任务类别的数据集上进行了评估，其选择的注释子集的性能表现优于现有基线。在实验中，比较了FastGAS和其他选择性注释方法，例如Vote-k和IDEAL，以及其他广泛认可的方法，如Top-degree、PageRank等。实验结果表明，在大多数情况下（13/14），FastGAS的表现都优于这两个现有基线。特别是在注释预算为18的情况下，所有被注释的例子都能在语言模型的上下文限制内适应，无需进行提示检索，评估结果直接反映了所选实例的质量。当注释预算为18时，FastGAS在大多数数据集上的性能都优于基线，这表明FastGAS能够选择更高质量的数据。此外，FastGAS大大减少了与已有方法相比的时间成本。
+
+#### 总结
+论文提出的FastGAS方法在选择ICL实例时，不仅能提高多样性和代表性，同时还显著减少了所需的时间和计算资源。实验结果验证了其在多个数据集上的效能和效率，证明了其作为一种有效的实例选择方法的潜力。
diff --git a/summary/2024-06/2406.04271.md b/summary/2024-06/2406.04271.md
@@ -0,0 +1,20 @@
+#### 背景
+- **背景** 
+ 论文介绍了大型语言模型（LLMs）如GPT-4、PaLM和LLaMA在各种推理任务上的表现，而为了进一步提升其功能和性能，人们提出了更有效的提示方法。当前的提示方法可以分为单查询推理和多查询推理，但它们都面临着限制，比如缺乏通用性和泛化性，以及计算强度高等问题。
+
+- **已有的工作**
+ 已有工作依赖于为特定任务设计的示例和推理结构，忽略了从已完成任务中提取通用和高层次的指导思想或思维模板，这些模板可以在解决类似问题时提高效率和精度。
+
+#### 核心贡献
+- **提出了一个名为Buffer of Thoughts (BoT)的新型思维增强推理框架**
+ - **挑战1：如何增强LLMs的推理准确性、效率和鲁棒性**
+ BoT框架通过设计一种轻量级的库（meta-buffer），存储从多个问题解决过程中提取的通用高层次思维模板（thought-template），可以跨任务共享。然后针对每一个问题，检索相关的思维模板，并将其用特定的推理结构实例化，从而进行有效的思维增强推理，解决了推理结构手动构建的问题。
+
+ - **挑战2：如何确保框架的可扩展性和稳定性**
+ BoT还引入了buffer-manager，用于动态更新meta-buffer，有效地随着解决的任务增多而增强meta-buffer的容量，以保证框架的扩展性和稳定性。
+
+#### 实现与部署
+BoT通过在10个具有挑战性的推理密集型任务上进行广泛实验，取得了显著的性能提升，相比以前的最高水平方法平均性能提升了：24游戏11%、几何形状20%以及国际象棋中的制胜一步51%。此外，BoT在维持推理效率的同时只需要平均12%的多查询提示方法的成本。值得注意的是，结合BoT的Llama3-8B模型有望超越Llama3-70B模型。
+
+#### 总结
+BoT通过为LLMs提供一个存储高层次思维模板的meta-buffer，增强了推理的准确性、效率和鲁棒性，克服了现有方法的限制，并实现了显著的性能提升。
diff --git a/summary_en/2024-06/2406.01014.md b/summary_en/2024-06/2406.01014.md
@@ -0,0 +1,20 @@
+#### Background
+- **Background**
+The paper discusses the popularity of mobile device operation tasks as a multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), restricted by their training data, lack the capability to effectively function as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are being applied to this scenario.
+
+- **Existing Work**
+The two major navigation challenges in mobile device operation tasks—task progress navigation and focus content navigation—are not effectively solved by the single-agent architecture of existing work due to overly long token sequences and the interleaved text-image data format, which limit performance.
+
+#### Core Contributions
+ - **Introduced a multi-agent architecture Mobile-Agent-v2**
+ - **Challenge 1: Task Progress Navigation**
+ The planning agent condenses lengthy, interleaved image-text history operations and screen summaries into pure-text task progress, which is then passed to the decision agent. This compaction reduces context length, making it easier for the decision agent to navigate task progress.
+
+ - **Challenge 2: Focus Content Navigation and Reflection Ability**
+  A memory unit and a reflection agent were designed to address the issue of navigating focus content and the reflection ability. The memory unit is updated by the decision agent with focus content, and the reflection agent assesses whether the decision agent's operation meets the expectations, providing appropriate remedial measures if not.
+
+#### Implementation and Deployment
+Mobile-Agent-v2 underwent dynamic evaluations across various operating systems, language environments, and applications. Results show more than a 30% improvement in task completion compared to the single-agent architecture. Additionally, manually injecting operation knowledge was empirically validated to further enhance performance.
+
+#### Summary
+Mobile-Agent-v2 is a multi-agent architecture designed to effectively tackle navigation challenges in mobile device operation tasks, particularly task progress and focus content navigation, significantly improving task completion rates over traditional single-agent architectures.
diff --git a/summary_en/2024-06/2406.0166.md b/summary_en/2024-06/2406.0166.md
@@ -0,0 +1,20 @@
+#### Background
+- **Background**
+The existing online and offline Reinforcement Learning from Human Feedback (RLHF) methods like PPO and DPO have seen tremendous success in aligning AI with human preferences. However, they have a fundamental flaw as their optimal solutions are notably task-dependent and not robust to Out-of-Distribution (OOD) tasks.
+
+- **Existing Work**
+The issue with the existing methods is their high task dependency, meaning the challenge to adapt to OOD tasks is significant. A model that performs well on one task might underperform or fail on a new unseen task, limiting its generalizability.
+
+#### Core Contributions
+- **Introduced Self-Improving Robust Preference Optimization (SRPO) framework**
+ - **Challenge 1: Task dependency and OOD robustness**
+ While traditional methods like PPO and DPO excel on the tasks they were trained on, their performance cannot be guaranteed when tasks change. SRPO addresses this by turning the problem of learning from human preferences into a process of self-improvement, which can be mathematically expressed in terms of a min-max objective. This goal is task-independent, enhancing the robustness of the model to task variations.
+
+ - **Challenge 2: Need for an optimization method without the necessity of reward models and online inference**
+ SRPO provides a solution to this issue, which can be expressed through a non-adversarial offline loss form. This loss can be optimized using standard supervised optimization techniques without any need for a reward model and online inference, greatly simplifying the training and deployment process.
+
+#### Implementation and Deployment
+In the experiments, SRPO is compared against two offline preference learning methods, Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). It trains both the standard generative policy and the self-improvement policy simultaneously through a single optimization process. SRPO first generates completions with the 0-revision (0-rev.) model then improves these completions using the self-improvement model. For evaluation, SRPO was tested on the Reddit TL;DR Summarization dataset and XSum dataset, showing that SRPO 4-rev. produces high-quality summaries with the highest win rate in in-distribution testing. In the OOD setting, SRPO 5-rev. attained the highest win rate, outperforming the baseline methods with the same number of revisions.
+
+#### Summary
+SRPO successfully alleviates the task dependency problem by demonstrating robustness to task variations within a theoretically grounded offline RLDF framework. It offered a simpler training and deployment process through the optimization of a non-adversarial offline loss. Experimental results indicate that SRPO outperforms existing methods across different environments, including OOD settings.