Skip to content

Latest commit

 

History

History
789 lines (732 loc) · 508 KB

README_en.md

File metadata and controls

789 lines (732 loc) · 508 KB

llm-paper-daily Daily Paper Selection

Status Simplified Chinese badge English badge

Each paper comes with related resources:

  • arXiv link                  arXiv
  • GitHub link               GitHub
  • Summary of GPT-4 Summary
  • Related blogs           Blog
Click to view latest updates.   Update time: 07-25 20:48
  • OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
  • CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis
  • Knowledge Mechanisms in Large Language Models: A Survey and Perspective
  • Internal Consistency and Self-Feedback in Large Language Models: A Survey
  • ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
  • Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
  • RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

2024-07

 Date   Paper Links & Summary
07-23 RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Institution: Zhejiang University, Palo Alto Networks, University of North Texas
The RedAgent system effectively identifies and exploits the security vulnerabilities of large language models by simulating context-specific jailbreak strategies. It enhances the efficiency and automation of red teaming methods while providing a new perspective on understanding and strengthening the security of LLM applications.
arXiv
Summary
07-23 Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
arXiv
07-23 OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
Institution: UIUC, CMU, Yale
OpenDevin is a community-driven platform tailored for developing generalist and specialist AI agents that interact with the world through software, featuring a dynamic interaction mechanism, a sandboxed operating system and web browser environment, and a comprehensive evaluation framework.
arXiv
Summary
07-22 Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Institution: Zhejiang University, National University of Singapore, University of California, Los Angeles
The paper suggests that a deep understanding of knowledge mechanisms in LLMs is crucial for developing powerful and reliable AI. It introduces a new framework for evaluating such systems, focusing on the utilization and evolution of knowledge, offering a vision and tools for future research directions.
arXiv
Summary
07-19 ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
Institution: NVIDIA
The paper presented a model named Llama3-ChatQA-2-70B, designed to bridge the gap between open-access LLMs and proprietary models, capable of handling up to 128K tokens context, and achieving comparable performance to GPT-4-Turbo on various benchmarks.
arXiv
Summary
07-19 Internal Consistency and Self-Feedback in Large Language Models: A Survey
Institution: Renmin University of China, Institute for Advanced Algorithms Research, Shanghai, Beijing Institute of Technology
This paper introduces the concepts of Internal Consistency and Self-Feedback to address consistency and hallucination issues in large language models, providing a new lens to understand and enhance these models and anticipates future research directions.
arXiv
Summary
GitHub
07-18 CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis
Institution: Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen
This work introduces the Chain-of-Diagnosis (CoD), a diagnostic method meant to improve the interpretability of LLMs in disease diagnosis. It effectively generates training data through synthetic cases combined with disease encyclopedia data, resulting in the development of the DiagnosisGPT model. Experiments demonstrate that DiagnosisGPT performs better than other LLMs across numerous diagnostic datasets.
arXiv
Summary
07-16 NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?
Institution: Shanghai AI Laboratory, Tsinghua University
The NeedleBench framework and the introduced ATC test offer novel methods to evaluate and enhance the retrieval and reasoning capabilities of LLMs when processing long text data. This is vital for real-world long-context tasks and also highlights the opportunities and challenges faced by current LLMs.
arXiv
Summary
GitHub
07-16 LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data
Institution: Stanford University, UC Berkeley
The paper presents the LOTUS system, which enables queries based on natural language through the definition of semantic operators, implementing fast and accurate query execution through efficient algorithms and optimizations. LOTUS demonstrates its wide applicability and high performance in multiple real-world application cases, signifying its importance in advancing LM-based large-scale semantic analysis and query systems.
arXiv
Summary
GitHub
07-16 Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild
Institution: University of Washington, Allen Institute for AI, McGill University
This research highlights the issue of personal information leakage in interactions with chatbots. It presents the types of sensitive information shared in these interactions and calls for measures in chatbot design to protect user privacy and maintain appropriate transparency of the content exchanged.
arXiv
Summary
07-15 Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval
arXiv
07-15 Qwen2 Technical Report
Institution: Alibaba Group
The Qwen2 series models, as the latest large language models, exhibit excellent performance in multi-task environments such as language understanding, generation, multilingual capabilities, coding, mathematics, and reasoning. The models have also made their weights and resources publicly available in the open-source community, fostering innovation and accessibility. Compared to existing models, Qwen2 shows competitive performance in several benchmarks, especially in terms of multilingualism, showing a wide applicability and global reach.
arXiv
Summary
07-14 Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
Institution: Institute of Artificial Intelligence, Soochow University, China
The paper introduces a novel machine unlearning framework, NAUF, and the accompanying real-world personal data unlearning dataset, RETURN, to evaluate and improve LLMs' performance in privacy protection.
arXiv
Summary
07-12 Human-like Episodic Memory for Infinite Context LLMs
Institution: Huawei Noah’s Ark Lab, University College London
The paper proposes an innovative structure, EM-LLM, by integrating human episodic memory and event cognition into large language models, enabling them to manage practically infinite context lengths while remaining computationally efficient. This research enhances LLMs' capabilities to process expansive contexts and contributes to understanding human memory mechanisms.
arXiv
Summary
07-10 Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization
Institution: University of Science and Technology of China, Alibaba Stripe, Zhejiang University
The researchers have developed the Dr. DPO framework, which enhances the robustness of DPO with just an extra line of code. Empirical evaluations show that Dr. DPO significantly improves performance in a wide range of settings, both with and without noise.
arXiv
Summary
GitHub
07-10 Toto: Time Series Optimized Transformer for Observability
Institution: Datadog
The Toto model, developed by Datadog, is a foundation model for time series prediction, specially designed to handle observability data. Its groundbreaking attention mechanism and pre-training strategy significantly improve the performance and efficiency in tackling observability data.
arXiv
Summary
07-09 Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
The paper presents a flexible and scalable platform for multi-agent collaboration, the Internet of Agents (IoA), which overcomes the limitations of existing frameworks and demonstrates superior performance across multiple tasks and application scenarios. Furthermore, the release of the codebase facilitates further development in autonomous agent systems.
arXiv
Summary
GitHub
07-05 AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents
Institution: AIRI, Moscow, Russia, Skoltech, Moscow, Russia
AriGraph is an innovative memory architecture that constructs a knowledge graph world model integrating semantic and episodic memories, enhancing the exploratory and planning capabilities of LLM agents. Experiments in the TextWorld environment have proven it to be more effective in handling complex tasks compared to other existing methods.
arXiv
Summary
07-02 RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs
Institution: Georgia Tech, NVIDIA
RankRAG is a novel framework that instruction-tunes LLMs to enhance their context ranking and answer generation capabilities within the RAG framework, delivering improved generative performance on multiple benchmarks and demonstrating robust generalization capabilities.
arXiv
Summary
07-02 Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
Institution: DeepSeek AI, Northwestern University
The paper proposed ESFT, an efficient fine-tuning method for sparse-architecture LLMs that fine-tunes only the experts most relevant to downstream tasks, maintaining expert specialization and significantly saving computational resources.
arXiv
Summary
GitHub
07-01 We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Institution: Beijing University of Posts and Telecommunications, Tencent Inc., Huazhong University of Science and Technology
This paper introduces a visual mathematical reasoning benchmark called WE-MATH, aimed at going beyond traditional end-to-end performance assessments to deeply explore and evaluate the problem-solving principles of LMMs, their ability to acquire and generalize knowledge. It reveals challenges in the inherent reasoning processes of multimodal models using a new multi-dimensional evaluation method and validates the effectiveness of knowledge augmentation strategies, advancing the progress of LMMs in the domain of visual mathematical reasoning.
arXiv
Summary
GitHub
07-01 AI Agents That Matter
Institution: Princeton University
This paper critiques the current benchmark evaluation methods for AI agents and proposes a series of improvements, aiming to develop intelligent agents that have real-world application value, not just agents that score high on benchmark tests.
arXiv
Summary
07-01 Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
This paper introduces a new evaluation method for large language models and RAG systems in handling long texts through the SummHay task. It presents an original approach with synthesized data generation and an automatic evaluation system, showing that current systems struggle with it and outlining a direction for future improvements.
arXiv
Summary

2024-06

 Date   Paper Links & Summary
06-30 Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning
Institution: Multimedia Laboratory (MMLab), The Chinese University of Hong Kong
The paper presents a new method of mathematical reasoning optimization - SCDPO, which significantly enhances the performance of LLMs in solving mathematical problems by generating training samples that supervise errors at specific steps, demonstrating the potential of this method.
arXiv
Summary
06-29 LiteSearch: Efficacious Tree Search for LLM
Institution: Xiamen University, Tencent AI Lab
The paper contributes by introducing a more efficient tree search algorithm that reduces resource consumption in aiding LLMs to tackle complex mathematical reasoning tasks, while ensuring high performance levels.
arXiv
Summary
06-28 Scaling Synthetic Data Creation with 1,000,000,000 Personas
Institution: Tencent AI Lab Seattle
This paper presents the "Persona Hub," a synthetic data platform focusing on the diversity and richness of the generated data, with a significant concern for the safe and responsible use of synthetic data. Through several use cases, it illustrates the method's advantages in diversity, scalability, flexibility, and ease of use.
arXiv
Summary
06-27 From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data
Institution: University of Wisconsin-Madison
The study presents a method to improve LLM's retrieval and reasoning capabilities in longer-context tasks by fine-tuning on synthetic datasets. It is shown to significantly enhance performance on such tasks without considerably impacting the model's overall abilities and reducing the generation of hallucinations.
arXiv
Summary
06-27 SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation
The paper proposes the SEAKR model, a novel adaptive Retrieval-Augmented Generation model that uses the self-awareness of LLMs’ internal states to dynamically determine when to retrieve and effectively integrate knowledge, thereby enhancing performance in QA tasks.
arXiv
Summary
GitHub
06-26 Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Institution: The Chinese University of Hong Kong, Harbin Institute of Technology (Shenzhen), SmartMore
The paper introduces a new optimization method, Step-DPO, enhancing LLMs' accuracy and robustness in long-chain mathematical reasoning by optimizing individual reasoning steps rather than evaluating answers holistically.
arXiv
Summary
GitHub
06-25 The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Institution: Hugging Face
This paper introduced the FineWeb datasets, highlighting the importance of carefully curating an effective Common Crawl-based pretraining dataset and demonstrated its contribution to enhancing the performance of large language models.
arXiv
Summary
06-24 WARP: On the Benefits of Weight Averaged Rewarded Policies
Institution: Google DeepMind
The article proposes WARP, a new strategy for LLM alignment, which merges models through weight averaging to address challenges in the RLHF process, thus improving the trade-off between KL and rewards. Experimental evidence suggests that WARP enhances model performance and alignment with human values.
arXiv
Summary
06-22 Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Institution: OATML, Department of Computer Science, University of Oxford
The paper proposes SEPs as a cost-effective and reliable method for detecting hallucinations, capable of capturing semantic uncertainty directly from the hidden states of LLMs with a single model generation.
arXiv
Summary
06-21 LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
Institution: University of Waterloo
LongRAG is a novel framework for open-domain question-answering tasks that addresses the limitations of traditional RAG by incorporating larger retrieval units and leveraging the capabilities of long-context LLMs. Its approach results in notable improvements in performance through reduced retrieval units and enhanced retriever effectiveness, alongside utilizing long-context LLMs for zero-shot answer extraction.
arXiv
Summary
06-19 Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
This paper explores the potential of long-context language models to replace existing paradigms and tackle novel tasks through the introduction of the LOFT benchmark. It finds that LCLMs can match the performance of existing retrieval and RAG systems in some tasks, despite not being explicitly trained, and highlights areas where further research is needed to improve performance.
arXiv
Summary
06-18 Nash CoT: Multi-Path Inference with Preference Equilibrium
Institution: Westlake University, University of Cambridge
The study proposes a novel Nash CoT approach, which effectively utilizes the concept of Preference Equilibrium to maintain performance while substantially lowering the deployment costs for LLMs by reducing the number of inference paths necessary.
arXiv
Summary
06-18 Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
This study provides useful insights for future use of LLMs as judges by evaluating the alignment and vulnerabilities of LLMs acting as judges. Key findings include that only some top models are fit to act as judges, and Cohen's Kappa is a better metric of alignment, outperforming percent agreement in distinguishing judges.
arXiv
Summary
06-17 A Survey of AIOps for Failure Management in the Era of Large Language Models
Institution: Peking University, Tsinghua University, The Hong Kong University of Science and Technology (Guangzhou), University of Illinois Chicago
This paper is a comprehensive survey of AIOps technology for failure management in the era of LLMs. It discusses the potential of LLMs to address the challenges faced by existing AIOps methods and outlines the future directions of research.
arXiv
Summary
06-13 Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Institution: Google Research, Google DeepMind, Google
This paper introduces a novel benchmark, ToT, which comprehensively evaluates LLMs' temporal reasoning abilities in various scenarios using synthetic datasets and crowdsourced tasks, and exposes the advantages and shortcomings of these models in temporal reasoning.
arXiv
Summary
06-12 Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Institution: University of Washington, Allen Institute for AI
This paper presents MAGPIE, a novel self-synthesis method for generating large-scale high-quality alignment data without relying on human intervention or prompt engineering, demonstrating the potential of LLMs in automatic data generation and alignment. Experimentation shows that models fine-tuned with MAGPIE excel across various benchmarks, exhibiting the latent capabilities of LLMs in data generation and model alignment.
arXiv
Summary
06-12 Designing a Dashboard for Transparency and Control of Conversational AI
Institution: Harvard University, Google Research
The paper is aimed at increasing the transparency of LLMs within conversational AI systems and does so by designing a visualized user interface—a dashboard that accompanies the chatbot interface. User can see the system's internal user model in real time and modify it via the interface. Based on user feedback, the dashboard also helps unveil and counteract model biases.
arXiv
Summary
06-12 TasTe: Teaching Large Language Models to Translate through Self-Reflection
Institution: Harbin Institute of Technology, Tencent Inc
The TASTE framework proposed in this paper elevates LLMs' capability in machine translation through a self-reflection process, representing a novel way to harness the translation potential of LLMs. It sets a new benchmark for understanding and utilizing the complex reasoning and language modeling capabilities of LLMs.
arXiv
Summary
GitHub
06-11 Delving into ChatGPT usage in academic writing through excess vocabulary
Institution: Hertie Institute for AI in Brain Health, University of Tübingen, Germany, Tübingen AI Center, Northwestern University
The paper proposes a new, unbiased, large-scale approach to study LLM usage in academic texts and offers an unprecedented quantifiable comparison of the changes in scientific writing induced by LLMs.
arXiv
Summary
06-11 Needle In A Multimodal Haystack
Institution: OpenGVLab, Shanghai AI Laboratory, Fudan University
The presented MM-NIAH benchmark is a novel evaluation platform for advancing MLLM performance in comprehending long multimodal documents. By exposing limitations and challenges of current MLLMs, the paper provides an instrumental platform for further research in long multimodal document comprehension.
arXiv
Summary
GitHub
06-10 Transforming Wearable Data into Health Insights using Large Language Model Agents
Institution: Google LLC
This paper introduced the Personal Health Insights Agent (PHIA), a large language model agent system that successfully transforms wearable device data into personal health insights. Combining code generation and information retrieval tools, PHIA effectively addresses the challenge of deriving personalized health guidance from vast health data sets. Extensive human and automated evaluations demonstrate the accuracy and potential application of this approach in addressing real health concerns.
arXiv
Summary
06-10 Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Institution: University of Washington, MetaAI, Allen Institute for AI
HUSKY emerges as the first unified, open-source language agent for multi-step reasoning that resolves the issues of high costs and difficulties in scaling while demonstrating superior performance in multi-task environments, showcasing the potential of open-source language agents.
arXiv
Summary
GitHub
06-10 Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching
Institution: The Chinese University of Hong Kong, Tencent AI Lab, Centre for Perceptual and Interactive Intelligence
The paper introduces SELF-TUNING, a framework aimed at improving LLMs' knowledge acquisition capability via self-teaching and validates its effectiveness on crucial knowledge acquisition tasks using the Wiki-Newpages-QA datasets.
arXiv
Summary
06-10 Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies
Institution: Duke University, AWS AI Labs
The study presents a framework for LLM reasoning strategies evaluation that considers compute budget and demonstrates the ability of simple strategies to outperform complex ones with equal computational resources. By highlighting the importance of self-evaluation, it sets the groundwork for more efficient budget use and the development of more effective reasoning strategies.
arXiv
Summary
06-09 Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses
Institution: University of Washington, University of Washington - Bothell
The study underscores the deficiencies in LLMs' social reasoning and the potential improvement by integrating human intentions and emotions. The findings highlight the need for LLMs to comprehend human-like mental states for effective social reasoning in open-ended questions, pointing out a key direction for future advancement.
arXiv
Summary
06-07 Mixture-of-Agents Enhances Large Language Model Capabilities
Institution: Duke University, Together AI, University of Chicago
This paper showcases the Mixture-of-Agents (MoA) methodology for enhancing the capabilities of LLMs in understanding and generating natural language by leveraging the group expertise of multiple models. Through experimentation, the method has been validated to significantly improve performance, achieving state-of-the-art results on multiple competitive benchmarks.
arXiv
Summary
06-07 WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
WILDBENCH provides an evaluation framework that incorporates real user task challenges, automated indicators, and interpretive checklists, enabling more accurate assessments of Large Language Models' performance in complex tasks.
arXiv
Summary
06-06 FastGAS: Fast Graph-based Annotation Selection for In-Context Learning
Institution: Department of ECE, University of Virginia
The FastGAS approach proposed in the paper significantly improves the diversity and representativeness of selected instances for ICL while also considerably reducing the time and computational resources required. The experimental results verify its efficiency and efficacy on multiple datasets, demonstrating its potential as an effective instance selection method.
arXiv
Summary
06-06 Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Institution: Peking University, UC Berkeley, Stanford University
BoT enhances the accuracy, efficiency, and robustness of reasoning in LLMs by providing a meta-buffer to store high-level thought templates. It overcomes the limitations of existing methods and demonstrates significant performance gains.
arXiv
Summary
GitHub
06-06 The Prompt Report: A Systematic Survey of Prompting Techniques
This paper offers a comprehensive survey of prompting techniques, systematically analyzing the concept, types, and applications of prompts and making an extensive meta-analysis of the literature.
arXiv
Summary
06-04 Synergetic Event Understanding: A Collaborative Approach to Cross-Document Event Coreference Resolution with Large Language Models
Institution: Zhejiang University, School of Engineering (Westlake University), Shanghai AI Laboratory
The paper introduces a novel collaborative approach for addressing the task of cross-document event coreference resolution. By combining the universal capabilities of LLMs with task-specific SLMs, the performance of the model was significantly enhanced.
arXiv
Summary
06-04 To Believe or Not to Believe Your LLM
Institution: Google DeepMind
This paper focuses on the study and introduction of a novel information-theoretical metric to quantify uncertainty in large language models, specifically for the phenomenon of hallucinations during response generation. This research offers new insights and solutions for identifying and addressing hallucinations in LLMs.
arXiv
Summary
06-03 Self-Improving Robust Preference Optimization
Institution: Cohere
SRPO successfully alleviates the task dependency problem by demonstrating robustness to task variations within a theoretically grounded offline RLDF framework. It offered a simpler training and deployment process through the optimization of a non-adversarial offline loss. Experimental results indicate that SRPO outperforms existing methods across different environments, including OOD settings.
arXiv
Summary
06-03 Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
Institution: Beijing Jiaotong University, Alibaba Group
Mobile-Agent-v2 is a multi-agent architecture designed to effectively tackle navigation challenges in mobile device operation tasks, particularly task progress and focus content navigation, significantly improving task completion rates over traditional single-agent architectures.
arXiv
Summary
GitHub

2024-05

 Date   Paper Links & Summary
05-31 Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Institution: Princeton University, Carnegie Mellon University
The paper presents the novel State Space Duality (SSD) framework, linking structured state space models (SSMs) with variants of attention mechanisms. Key contributions include applying Transformative optimizations to SSMs and developing a new SSD algorithm that significantly improves the efficiency of model training and inference. The resulting Mamba-2 architecture demonstrates ideal performance results, paving the way for future deep learning model design and optimization.
arXiv
Summary
05-31 Preemptive Answer "Attacks" on Chain-of-Thought Reasoning
Institution: Tsinghua University
The paper investigates the negative impact of preemptive answers on the reasoning capabilities of LLMs and proposes strategies for mitigation. The experimental results indicate that these strategies cannot completely neutralize the impact, pointing to a need for further enhancement of CoT robustness.
arXiv
Summary
05-30 Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts
Institution: Ant Group
METRAG offers a novel framework for retrieval-augmented generation that addresses the limitations of current models by incorporating utility and compactness-oriented thinking, and it exhibits enhanced performance in knowledge-intensive tasks.
arXiv
Summary
05-30 Jina CLIP: Your CLIP Model Is Also Your Text Retriever
arXiv
05-29 LLMs achieve adult human performance on higher-order theory of mind tasks
Institution: Google Research, Google DeepMind, Johns Hopkins University Applied Physics Lab
The study showcases the performance of LLMs on higher-order Theory of Mind (ToM) tasks, specifically demonstrating that models like GPT-4 can achieve adult-level performance on some tasks. The introduction of a new benchmark based on an adult human benchmark helps to reveal and understand the potential and limitations of LLMs in complex social interactions.
arXiv
Summary
05-29 MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
arXiv
05-28 RealitySummary: On-Demand Mixed Reality Document Enhancement using Large Language Models
Institution: University of Calgary
This paper introduces the RealitySummary system, combining large language models with mixed reality technology to provide an on-demand reading assistant and highlights the potential for practical application of this technology and establishes directions for future research.
arXiv
Summary
05-23 HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models
Institution: The Ohio State University, Stanford University
HippoRAG is a novel, neurobiologically inspired retrieval framework addressing the limitations of conventional LLMs in long-term memory and knowledge integration. By simulating the structure and mechanisms of the human brain, HippoRAG has significantly enhanced LLMs' capability to handle complex tasks involving knowledge integration, outperforming existing methods in both efficiency and effectiveness.
arXiv
Summary
GitHub
05-23 Agent Planning with World Knowledge Model
Institution: Zhejiang University, Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph, National University of Singapore, Alibaba Group
This paper introduces a parametric World Knowledge Model (WKM) to enhance the performance of Large Language Models (LLMs) executing interactive planning tasks. The model utilizes knowledge from expert and exploratory trajectories and has been validated through comparisons with various strong baselines in simulated environments, addressing the issues of hallucinatory action generation and aimless trial-and-error.
arXiv
Summary
GitHub
05-23 Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration
Institution: Tsinghua University, Northwestern Polytechnical University, Shanghai AI Laboratory
This paper proposes the ReAd framework to address the effective planning for LLMs in multi-agent collaborative tasks, proving its capability to reduce interaction rounds and enhance success rates, thus laying the groundwork for the application of LLMs in multi-agent systems.
arXiv
Summary
05-23 RaFe: Ranking Feedback Improves Query Rewriting for RAG
Institution: Zhejiang University, Alibaba Group, Nanjing University
RaFe presents a novel framework for query rewriting using reranker feedback, requiring no annotations, supporting offline and online feedback training, and showcasing adaptability and effectiveness.
arXiv
Summary
05-23 RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models
Institution: Amazon AWS AI, Shanghai AI Lab, Shanghai Jiaotong University
REFCHECKER is a framework that detects fine-grained hallucinations in LLMs and benchmarks them. It detects and verifies factual inconsistencies in responses with high precision and strong alignment with human judgments.
arXiv
Summary
GitHub
05-23 PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services
Institution: Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences
This paper introduces the PerLLM framework that leverages edge-cloud collaboration to handle a large volume of inference services, significantly enhancing service scheduling and resource allocation, thereby increasing throughput and reducing energy costs, showcasing its substantial applicative value.
arXiv
Summary
05-23 AGILE: A Novel Framework of LLM Agents
Institution: ByteDance Research, University of Science and Technology of China, Shanghai Jiao Tong University
The paper proposed a new framework for LLM agents known as AGILE, which streamlines different components and leverages reinforcement learning to achieve end-to-end training. The framework showcases superior performance in complex QA tasks, underscoring the efficacy of component integration and end-to-end optimization. The release of the dataset and code encourages further research in this area.
arXiv
Summary
05-21 G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translation
Institution: ByteDance Research
The paper presents the G-DIG method, a gradient-based approach for selecting high-quality and diverse instruction finetuning data for machine translation, validated by its effectiveness and generalizability through experimental verification.
arXiv
Summary
05-21 SmartFlow: Robotic Process Automation using LLMs
Institution: TCS Research
SmartFlow is an AI-based RPA system that integrates deep learning vision understanding with LLMs to autonomously generate navigation workflows and execute user-assigned tasks, demonstrating efficiency in adapting to GUI changes and handling complex tasks.
arXiv
Summary
05-20 OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Institution: OpenLLMAI Team, ByteDance Inc., Netease Fuxi AI Lab
OpenRLHF is an open-source framework that enables full-scale RLHF training on models with over 70 billion parameters. It employs distributed computing with Ray and efficiency optimization with vLLM, while also implementing multiple alignment algorithms, offering a plug-and-play experience with seamless integration with the Hugging Face library.
arXiv
Summary
GitHub
05-20 Octo: An Open-Source Generalist Robot Policy
Institution: UC Berkeley, Stanford
The paper introduces Octo, a transformer-based policy that provides an open-source solution to a variety of robotic tasks, capable of adapting to new observations and action spaces through finetuning. It demonstrates superior performance on multiple robot platforms and encourages broad application and further development through its fully open source code.
arXiv
Summary
05-20 xFinder: Robust and Pinpoint Answer Extraction for Large Language Models
Institution: Institute for Advanced Algorithms Research, Shanghai,Renmin University of China
The focus of the paper is the introduction of a method called xFinder, which aims to improve the accuracy of extracting key answers from LLM outputs. It addresses gaps not met by existing methods and provides a more reliable approach for evaluating LLMs.
arXiv
Summary
GitHub
05-20 Multiple-Choice Questions are Efficient and Robust LLM Evaluators
Institution: Shanghai Jiao Tong University
The study successfully converted conventional open-ended generation problems into a multiple-choice format, significantly improving the efficiency and accuracy of LLM evaluations. This method has made strides in preventing the impact of invalid answers and enhancing evaluation efficiency.
arXiv
Summary
GitHub
05-19 Your Transformer is Secretly Linear
Institution: AIRI, Skoltech, SberAI
This study showcases that there might be a high degree of linear dynamics between the encoding layers of transformers, overturning the traditional understanding of linear and non-linear operations in transformers, and finding that models can be modified for efficiency without sacrificing performance.
arXiv
Summary
05-17 Prompt Exploration with Prompt Regression
Institution: Carnegie Mellon University, Massachusetts Institute of Technology, University of Michigan
This paper introduces a novel framework, PEPR, for predicting the impact of prompt element combinations in LLMs and selecting effective prompts for specific tasks. The framework not only brings an innovative solution but also demonstrates its effectiveness through evaluations on multiple datasets and tasks.
arXiv
Summary
05-16 Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models
Institution: Nanyang Technological University, University of Science and Technology of China, University of Aberdeen
The paper successfully proposes and validates a new multimodal LLM incorporating ASR error correction paradigm, addressing issues of source speech disregard and input redundancy, and showing significant improvements in practical applications.
arXiv
Summary
05-16 SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
Institution: Amazon, The University of Texas at Austin
SYNTHESIZRR addresses the issue of insufficient diversity and stylistic deviation from human text in past synthetic data approaches by using retrieval augmentation. It improves upon the generation of synthetic examples with greater variety and a closer resemblance to human writing, enhancing the performance of distilled models.
arXiv
Summary
05-16 MarkLLM: An Open-Source Toolkit for LLM Watermarking
Institution: Tsinghua University, Shanghai Jiao Tong University, The University of Sydney
MARKLLM provides a versatile and accessible platform for researchers and the public to experiment with and understand LLM watermarking, driving further developments in research and application.
arXiv
Summary
GitHub
05-16 Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models
Institution: BITS Pilani, MDSR Labs, Adobe, IIT Guhawati, National University of Singapore
The research developed and evaluated an iterative debiasing framework aimed at end-users, offering a non-training-based approach to mitigating biases in LLMs. This method employs complex prompting strategies that significantly decrease the mean bias in outputs without compromising downstream task performance, paving the way for future research into prompt-based debiasing methods for LLMs.
arXiv
Summary
05-16 SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
Institution: Amazon, The University of Texas at Austin
SYNTHESIZRR is an innovative method that integrates information retrieval into example synthesis for teacher-student distillation. Studies show it outperforms existing methods in terms of intrinsic data diversity and downstream task accuracy.
arXiv
Summary
05-15 LoRA Learns Less and Forgets Less
Institution: Columbia University, Databricks
Although LoRA often does not match the learning efficiency and accuracy of full parameter finetuning on target tasks, it exhibits better performance and stronger regularization capabilities in maintaining source task performance. Based on the study, recommendations are made for best practices when finetuning with LoRA, particularly noting the sensitivity to learning rates, choice of target modules, and rank of perturbations.
arXiv
Summary
05-15 ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models
Institution: Microsoft Research Asia, Harvard University, Peking University
The ALPINE project explored how autoregressive learning in Transformers facilitates network planning capabilities and revealed the competencies and limitations of Transformers in executing path-finding tasks, offering new insights into the general planning capabilities of large language models in related domains.
arXiv
Summary
05-14 Is the Pope Catholic? Yes, the Pope is Catholic. Generative Evaluation of Intent Resolution in LLMs
Institution: Carnegie Mellon University, Allen Institute for AI
This study introduces a novel generative evaluation framework exploring the potential and challenges of LLMs in understanding and generating intent-aligned responses, revealing significant shortcomings in pragmatic understanding and pointing out directions for future improvements.
arXiv
Summary
05-13 RLHF Workflow: From Reward Modeling to Online RLHF
Institution: Salesforce AI Research, University of Illinois Urbana-Champaign
The paper presents a comprehensive workflow for online iterative RLHF, which is innovative theoretically and offers a practical application framework through its detailed implementation guide.
arXiv
Summary
GitHub
05-13 DoLLM: How Large Language Models Understanding Network Flow Data to Detect Carpet Bombing DDoS
arXiv
05-10 Automatic Generation of Model and Data Cards: A Step Towards Responsible AI
Institution: CMU, MPI, ETH Zürich
The paper effectively develops a method to automate the generation of ML model cards and data cards using large LLMs, significantly enhancing the quality and standardization of the documentation through the creation of a corresponding dataset and evaluation mechanisms.
arXiv
Summary
05-10 Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval
Institution: Imperial College London, Huawei
This work effectively reduces hallucinations in large language models through a novel Self-Refinement Enhanced Knowledge Graph Retrieval method, particularly enhancing practical application efficacy in the medical field.
arXiv
Summary
05-10 UniDM: A Unified Framework for Data Manipulation with Large Language Models
Institution: Alibaba Group, University of Science and Technology of China
UniDM is an innovative unified framework for data manipulation that significantly enhances the efficiency and quality of processing diverse data tasks through effective prompt design and task decomposition.
arXiv
Summary
05-10 A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models
arXiv
05-10 Value Augmented Sampling for Language Model Alignment and Personalization
Value Augmented Sampling (VAS) offers an efficient and powerful solution for adapting and personalizing LLMs. It overcomes the instabilities of existing RL algorithms, achieving both high performance and computational efficiency, supports adaptation of black-box models, and paves the way for the personalized and aligned future of LLMs.
arXiv
Summary
05-09 LLMPot: Automated LLM-based Industrial Protocol and Physical Process Emulation for ICS Honeypots
Institution: New York University Abu Dhabi
LLMPot represents a novel ICS network defense tool that leverages the capabilities of LLMs. By automating the generation of responses closely related to protocols and physical processes, LLMPot significantly enhances the practicality and effectiveness of honeypots.
arXiv
Summary
05-09 Exploring the Potential of Human-LLM Synergy in Advancing Qualitative Analysis: A Case Study on Mental-Illness Stigma
The CHALET methodology framework illustrates the vast potential of human-LLM collaboration in qualitative research, especially in deepening understanding and generating insights, offering a new direction for future studies in HCI and qualitative analysis.
arXiv
Summary
05-09 An Automatic Prompt Generation System for Tabular Data Tasks
The paper successfully develops an auto-prompt generation system compatible with various LLMs without extensive training, significantly enhancing the performance of tabular data tasks through two innovative methods.
arXiv
Summary
05-09 Can large language models understand uncommon meanings of common words?
Institution: Tsinghua University, Chinese Academy of Science
This study reveals significant shortcomings in large language models' understanding of the uncommon meanings of common words by establishing a new assessment framework and dataset, offering a new direction for enhancing models' NLU capabilities.
arXiv
Summary
05-08 ADELIE: Aligning Large Language Models on Information Extraction
Institution: Tsinghua University
The ADELIE models proposed in this paper effectively address the alignment issues of LLMs in information extraction tasks, improving performance via novel datasets and training methods while maintaining robust general capabilities. This provides valuable insights and a foundation for future research in this area.
arXiv
Summary
05-08 "They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
Institution: University of Washington, MBZUAI
This study reveals potential harms in complex social interactions involving a wide range of cultures and identities that LLMs might cause through the innovative CHAST assessment system, emphasizing the necessity of thorough bias audits before deploying these models.
arXiv
Summary
05-08 Air Gap: Protecting Privacy-Conscious Conversational Agents
arXiv
05-07 Toward In-Context Teaching: Adapting Examples to Students' Misconceptions
Institution: MIT CSAIL
This paper successfully demonstrates the potential of using large language models for adaptive teaching and achieves effective identification of student misconceptions and optimization of teaching feedback through the ATOM model.
arXiv
Summary
05-07 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Institution: MIT, NVIDIA
With its novel quantization algorithm and system design, QServe significantly enhances the efficiency of LLM servicing on GPUs, dramatically reducing costs and providing a new solution for deploying large-scale language models.
arXiv
Summary
GitHub
05-07 Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation
Institution: Center for Responsible AI, IIT Madras, Princeton University
The paper effectively demonstrates the deceptive capabilities of autonomous agents using large language models in a goal-driven environment performing complex tasks like legislative lobbying and proposes an effective method for detecting such deceptive behaviors. These findings provide significant insights into the application of AI in legal and ethical contexts, while also advocating for new research directions in AI safety.
arXiv
Summary
05-07 Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application
Institution: Kuaishou Technology, Southeast University
The paper successfully applies the open-world knowledge of large language models to recommendation systems, addressing core challenges in practical applications through an innovative twin-tower structure, providing new insights into enhancing RS performance.
arXiv
Summary
05-06 MARE: Multi-Agents Collaboration Framework for Requirements Engineering
Institution: Peking University
This research presents a novel Multi-Agent Collaboration Framework, MARE, for leveraging collaboration between Large Language Models (LLMs) throughout the entire Requirements Engineering process. It addresses limitations in the automation of RE tasks and demonstrates superiority in requirement modeling and specification generation, as verified by extensive experimental evaluation.
arXiv
Summary
05-06 Lifelong Knowledge Editing for LLMs with Retrieval-Augmented Continuous Prompt Learning
Institution: East China Normal University
The RECIPE method efficiently improves editing efficiency and inference speed in LLMs within lifelong learning scenarios by transforming knowledge statements into continuous prompts and utilizing Knowledge Sentinel for dynamic retrieval management. This approach overcomes the limitations of previous methods and performs excellently across multiple evaluation metrics while maintaining overall model performance.
arXiv
Summary
05-03 What matters when building vision-language models?
Institution: Hugging Face, Sorbonne Université
This paper thoroughly investigates the critical design choices impacting VLMs' performance through extensive experiments, introduces the efficient foundational VLM Idefics2, and proves its superior performance in multiple standard tests.
arXiv
Summary
05-02 Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Institution: KAIST AI, LG AI Research, Carnegie Mellon University
PROMETHEUS 2 is an innovative open evaluator LM that can operate in both direct assessment and pairwise ranking formats while correlating closely with human judgments and proprietary LM evaluations on custom criteria. The model outperforms other open models and even some proprietary models, thanks to its training using weight merging.
arXiv
Summary
GitHub
05-02 How Can I Get It Right? Using GPT to Rephrase Incorrect Trainee Responses
Institution: Carnegie Mellon University
The paper investigates the construction of an automated feedback system using GPT-4 to assist in the training of tutors in one-on-one classes, aiming to alleviate the resource burden of traditional personalized instructional feedback and provide high-quality and specific feedback. It falls under the category of knowledge retrieval and evaluation research.
arXiv
Summary
05-01 Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3
This study provides an empirical evaluation of model editing techniques in LLMs, revealing potential shortcomings in previous methods and proposing new directions and insights for future model editing approaches.
arXiv
Summary
05-01 "I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust
Institution: Princeton University, Microsoft
The paper, through large-scale experiments, demonstrates that by expressing uncertainty in natural language, LLMs can reduce user overreliance and improve accuracy in task performance. Specifically, first-person expressions have a significant effect on improving user accuracy. Moreover, the research emphasizes the importance of user testing before the practical application of LLMs to adjust the way uncertainty is communicated.
arXiv
Summary
05-01 The Real, the Better: Aligning Large Language Models with Online Human Behaviors
Institution: Baidu Inc.
This paper introduces a novel framework, RLHB, for aligning large language models with real online human behaviors innovatively, overcoming the limitations of current approaches and effectively demonstrating its methods through experimentation.
arXiv
Summary
05-01 A Careful Examination of Large Language Model Performance on Grade School Arithmetic
arXiv
05-01 Can a Hallucinating Model help in Reducing Human "Hallucination"?
Institution: Stanford University, UC Berkeley
This paper explores how to use Large Language Models (LLMs) to detect and combat unwarranted beliefs, as well as to leverage LLMs as personalized misinformation debunking agents. The researchers propose new methods to assess and utilize LLMs' capabilities in identifying logical pitfalls and challenge human unwarranted beliefs.
arXiv
Summary

2024-04

 Date   Paper Links & Summary
04-30 Iterative Reasoning Preference Optimization
Institution: FAIR at Meta, New York University
The paper proposed an iterative reasoning preference optimization method, which applies preference optimization to reasoning tasks, particularly for Chain-of-Thought (CoT) reasoning, and enhances model performance by introducing NLL loss term in iterative training. Experiments showed that the method effectively improved reasoning performance after several iterations, ultimately reaching a performance saturation.
arXiv
Summary
04-30 Multi-hop Question Answering over Knowledge Graphs using Large Language Models
Institution: Microsoft
The paper presents different strategies for multi-hop question-answering tasks across various KG datasets, demonstrating the potent capabilities of large pre-trained language models in complex QA tasks. Through experiments, the paper validates the superiority of the proposed approach over current technologies.
arXiv
Summary
04-30 Better & Faster Large Language Models via Multi-token Prediction
Institution: FAIR at Meta
The paper proposes a novel method for training large language models by predicting multiple tokens instead of a single one, improving sample efficiency and demonstrating how to boost performance in generative tasks and speed up inference. Experiments confirm the significant advantages of this approach in enhancing the performance and inference speed of large models.
arXiv
Summary
04-30 Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom
Institution: Shanghai Jiao Tong University
The study introduces a novel Chinese multi-turn dialogue dataset, SwordsmanImp, for evaluating the capabilities of LLMs in understanding implicatures within dialogues involving a lot of context and turn-taking, revealing the challenges and limitations of LLMs in understanding and explaining non-literal meanings.
arXiv
Summary
GitHub
04-29 Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Institution: Cohere
The paper develops a new method for evaluating LLM generations, called PoLL, which consists of a “jury” of smaller models from different families, showing applicability in varying tasks, cost-efficiency, and reduced bias of LLMs as judges.
arXiv
Summary
04-29 LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report
Institution: Predibase
This paper proposes that fine-tuning large language models through LoRA can significantly improve the overall performance of the models, reduce errors in classification tasks, and outperform out-of-the-box models like GPT-4 and GPT-3.5. Additionally, the paper takes into account cost constraints, reducing the financial burden of using LLM APIs by limiting the number of evaluation samples.
arXiv
Summary
04-26 When to Trust LLMs: Aligning Confidence with Response Quality
Institution: Alibaba Group
This paper presents a method for aligning confidence and answer quality through reinforcement learning (CONQORD). It optimizes confidence levels through self-assessment in the absence of an objective standard, reducing bias and improving the accuracy and alignment of model predictions, though further improvements are needed to match the performance of more effective methods.
arXiv
Summary
04-26 A Comprehensive Evaluation on Event Reasoning of Large Language Models
Institution: Peking University, Advanced Institute of Big Data, Beihang University
This paper comprehensively evaluates the event reasoning capabilities of LLMs by introducing a new benchmark, EV2. The findings suggest that despite having capabilities for event reasoning, LLMs do not align with humans in using event schema knowledge, and with explicit guidance, they can better understand and execute event reasoning tasks.
arXiv
Summary
04-25 How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Institution: Shanghai AI Laboratory, SenseTime Research, Tsinghua University
InternVL 1.5 is a robust open-source multimodal language model aimed at closing the performance gap between open-source and commercial models in multimodal understanding. Its strengths include enhanced visual understanding, handling dynamic high-resolution images, and the use of a high-quality bilingual dataset, making it perform well across various tasks.
arXiv
Summary
GitHub
04-25 Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare
arXiv
04-25 Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding
Institution: Meta, University of Toronto, Carnegie Mellon University
LayerSkip presents a novel, practical solution that significantly accelerates inference in LLMs without compromising on accuracy, showcasing its potential in real-world applications.
arXiv
Summary
04-25 Continual Learning of Large Language Models: A Comprehensive Survey
Institution: Rutgers University, Wuhan University, Huazhong University of Science and Technology
The survey provides a comprehensive view on the continual learning of LLMs, with a particular emphasis on the under-explored research areas of continual pre-training (CPT) and domain-adaptive pre-training (DAP). It highlights the need for greater attention from the community, especially in the development of practical, accessible, and widely accepted evaluation benchmarks, as well as methodologies tailored for the emerging learning paradigms of large language models.
arXiv
Summary
GitHub
04-24 From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Institution: Microsoft Research, Microsoft Strategic Missions and Technologies, Microsoft Office of the CTO
The paper presents the Graph RAG method, a query-focused summarization technique based on graph indexing and LLM-generated summaries, aimed to handle problems of corpus size beyond the processing capability of large language models. This approach, assisted by community detection algorithms, achieves remarkable results in addressing global questions and performing large-scale text analysis.
arXiv
Summary
04-24 Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs
Institution: Shanghai Jiao Tong University, UC San Diego, Duke University
The article is a detailed survey of Chain-of-X (CoX) methods in Large Language Models (LLMs), focusing on extending the Chain-of-Thought (CoT) concept to broader applications and providing potential directions for future research.
arXiv
Summary
04-23 A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications
Institution: Hong Kong Baptist University
This work is a survey that investigates research on LLMs applied to graph data, discusses the advantages of LLMs in providing general solutions for graph tasks, and suggests future directions for research in this field.
arXiv
Summary
04-23 CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies
Institution: Stanford University, IBM Research
The paper presented a pipeline for building cultural knowledge bases and created CultureBank, a knowledge base including cultural descriptors from TikTok and Reddit. The paper further assessed LLMs' cultural awareness using this repository and trained more culturally-conscious language models to promote the development of culturally-aware language technologies in the future.
arXiv
Summary
GitHub
04-22 SnapKV: LLM Knows What You are Looking for Before Generation
Institution: University of Illinois Urbana-Champaign, Cohere, Princeton University
This paper introduces SnapKV, a novel approach to tackling the Key-Value cache problem in large language models. SnapKV intelligently compresses and selects important KV positions to significantly improve decoding speed and memory efficiency for long text processing, reducing computational costs while maintaining accuracy.
arXiv
Summary
04-22 Tree of Reviews: A Tree-based Dynamic Iterative Retrieval Framework for Multi-hop Question Answering
Institution: Tencent Inc., Harbin Institute of Technology
The paper proposes a novel iterative retrieval framework (TOR) that uses a tree structure to minimize error accumulation and incorporates optimization strategies to improve retrieval efficiency and quality. Experiments show that TOR achieves state-of-the-art performance on several datasets.
arXiv
Summary
04-22 LLMs Know What They Need: Leveraging a Missing Information Guided Framework to Empower Retrieval-Augmented Generation
Institution: Meituan
The MIGRES framework proposed in this study enhances RAG by exploiting LLMs' ability to identify missing information. Research results prove the superiority of MIGRES across multiple public datasets, addressing challenges in RAG's understanding of complex queries and retrieval of relevant documents.
arXiv
Summary
04-22 Information Re-Organization Improves Reasoning in Large Language Models
Institution: Zhejiang University
This paper introduced a novel Information Re-Organization (InfoRE) method that enhances the reasoning capabilities of LLMs by re-organizing contextual content to reveal logical relationships. The method was significantly effective when tested on LLMs for context-aware multi-hop reasoning tasks in a zero-shot setting.
arXiv
Summary
GitHub
04-22 A Survey on Efficient Inference for Large Language Models
Institution: Tsinghua University
This paper offers an encompassing survey of literature on improving inference efficiency for large language models, proposing a taxonomy that covers data-level, model-level, and system-level optimizations. Additionally, it provides quantified comparisons of key techniques through experiments, pointing out future directions for research.
arXiv
Summary
04-22 Beyond Scaling: Predicting Patent Approval with Domain-specific Fine-grained Claim Dependency Graph
Institution: University of California San Diego, Carnegie Mellon University, University of Pennsylvania
The researchers presented a novel algorithm for constructing a fine-grained claim dependency graph (FLAN Graph) that significantly improves the state of the art at scale and conducted comprehensive experiments and analyses of modern LLMs on patent approval prediction, identifying limitations of LLMs and providing valuable references for the development of future LLM-based solutions. The source code and dataset have been publicly released to facilitate future research.
arXiv
Summary
04-22 A Survey on Self-Evolution of Large Language Models
Institution: Peking University, Alibaba Group, Nanyang Technological University
The review paper offers a structured overview and summary of self-evolution approaches in LLMs, furnishing conceptual frameworks and future insights to propel research into self-evolving LLMs and pave the way for the development of next-generation models.
arXiv
Summary
04-21 AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Institution: Meta AI (FAIR), Max-Planck-Institute for Intelligent Systems
This paper presents a novel LLM called AdvPrompter that uses an innovative algorithm to rapidly generate human-readable adversarial prompts without the need for gradient information from the Target LLM. It significantly accelerates prompt generation while maintaining semantic coherence, and additionally through training with AdvPrompter, it can enhance the robustness of LLMs against jailbreaking attacks without sacrificing performance.
arXiv
Summary
04-19 LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency
Institution: Nanyang Technological University, DAMO Academy Alibaba Group, Singapore University of Technology and Design
LLM-R2 is an LLM-enhanced query rewrite system that effectively boosts the execution efficiency of query rewriting by automatically selecting effective rules from a given set of rewrite rules. It addresses the limitations of current methods and shows superior performance across multiple datasets.
arXiv
Summary
04-19 Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?
Institution: Nanyang Technological University, Princeton University, Salesforce Research
The paper thoroughly assessed LLMs' capability for analogical reasoning and introduced two methods that significantly reduce inference costs while enhancing performance. Findings revealed that, contrary to the previously held belief of the critical importance of relevance, self-generated irrelevant examples could perform equally or even better in some tasks. The study hopes to encourage further research on the design of self-generated contexts.
arXiv
Summary
04-18 Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers
Institution: Westlake University, Alibaba Group, Zhejiang University
The paper presents the MCRanker model, which improves consistency and comprehensiveness of LLM rankers by creating a virtual professional annotator team and generating evaluative criteria from multiple perspectives, capable of adapting to various datasets and improving ranking performance.
arXiv
Summary
04-18 Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
Institution: UC Berkeley
This paper proposes EvalGen, a user interface for aligning LLM-assisted evaluations of LLM outputs with human preferences using a mixed-initiative approach. It addresses the trustworthiness of LLM-generated evaluation functions and explores the dynamic nature of how users define and use evaluation criteria in practical applications.
arXiv
Summary
04-18 EVIT: Event-Oriented Instruction Tuning for Event Reasoning
Institution: Key Laboratory of High Confidence Software Technologies (PKU), MOE, China, School of Computer Science, Peking University, Advanced Institute of Big Data
EVIT addresses the shortcomings of current smaller instruction-tuned models in event reasoning tasks by introducing Event-Oriented Instruction Tuning and the concept of event quadruples. The experimental results show that EVIT performs better on event reasoning tasks compared to other models.
arXiv
Summary
04-18 Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
The paper presents a novel framework called ALPHALLM that, by integrating Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs), facilitates the self-improvement of LLMs without the need for additional annotated data.
arXiv
Summary
04-18 RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
Institution: Peking University, ByteDance Inc.
RAGCache enhances the performance of the RAG process by designing a targeted caching system and sharing intermediate states, significantly improving processing speed and reducing computational resource overhead.
arXiv
Summary
04-18 mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture
Institution: Beihang University, Beijing Information Science and Technology University
mABC is an innovative framework that leverages LLMs and multi-agent cooperation, facilitated by blockchain-inspired decision-making processes, aimed at root cause analysis (RCA) in micro-services architectures within cloud-native technologies.
arXiv
Summary
04-17 Many-Shot In-Context Learning
Institution: Google DeepMind
The key contributions of this paper include systemically evaluating LLM performance with varying scales of in-context examples across a broad range of tasks, introducing reinforced ICL and unsupervised ICL to reduce reliance on examples, and discovering that MS-ICL can overcome pre-training biases to learn high-dimensional numerical prediction tasks.
arXiv
Summary
04-17 Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models
Institution: Renmin University of China, Chinese Academy of Sciences, Huawei Technologies
The survey paper offers a new perspective for understanding bias and unfairness in LLMs and IR systems as distribution mismatch problems and categorizes various mitigation strategies.
arXiv
Summary
GitHub
04-17 AgentKit: Flow Engineering with Graphs, not Coding
Institution: Carnegie Mellon University, NVIDIA, Microsoft
The paper presents a novel LLM prompting framework, AgentKit, addressing multifunctional agents, supporting the construction and fine-tuning of complex agent thought processes through modular components and intuitive designs. AgentKit shows potential in realizing advanced agent capabilities and lowering the entry barrier for users.
arXiv
Summary
GitHub
04-17 A Deep Dive into Large Language Models for Automated Bug Localization and Repair
Institution: University of Virginia, Purdue University, Amazon Web Services
This paper introduces a new approach named Toggle, which utilizes token-level bug localization and repair to overcome the limitations of existing line-granular methods. By designing inputs and fine-tuning LLMs, it significantly enhances the accuracy of bug fixes and delivers outstanding performance on multiple datasets, marking a new progression in the APR field.
arXiv
Summary
04-16 How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior
Institution: Stanford University
This paper analyses the tension between LLMs’ internal knowledge and retrieved information in RAG settings, finding that LLMs’ tendency to follow RAG information is inversely correlated with the model's confidence in its response without context. The research, which spans six domain datasets with over 1200 questions, reveals the inherent conflict between the model's pre-trained knowledge and the retrieved information.
arXiv
Summary
04-16 CoTAR: Chain-of-Thought Attribution Reasoning with Multi-level Granularity
Institution: Intel Labs
The CoTAR method proposed in this paper addresses the issue of LLMs tending to produce inaccurately attributed content in question-answering tasks. By reasoning prior to output generation and guiding the model at different levels of attribution granularity, the method significantly improves the model's performance in terms of answer quality and attribution accuracy.
arXiv
Summary
04-16 Self-playing Adversarial Language Game Enhances LLM Reasoning
Institution: Tencent AI Lab
This paper proposes an innovative training scheme named SPAG that effectively enhances the reasoning capabilities of LLMs through self-play in adversarial language games and demonstrates that these improvements can persist and amplify through the iterative process.
arXiv
Summary
GitHub
04-15 Learn Your Reference Model for Real Good Alignment
Institution: Tinkoff
The paper introduces a novel method known as Trust Region DPO (TR-DPO), which significantly improves the alignment issue in language models by interactively updating reference policy parameters. Experimental results show that TR-DPO surpasses the DPO method on both datasets, effectively enhancing the model's multi-parameter performance.
arXiv
Summary
04-15 Compression Represents Intelligence Linearly
Institution: The Hong Kong University of Science and Technology, Tencent
The paper provides empirical evidence that there is almost a linear correlation between LLMs' performance on downstream tasks and their compression efficiency, supporting the long-held belief that "better compression indicates higher intelligence". It also proposes using compression efficiency as an unsupervised metric for assessing LLM performance.
arXiv
Summary
04-14 Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development
The focus of this paper is on supporting and optimizing the deployment of machine learning models on emerging computing platforms, introducing a framework named TAPML. The framework aims to advance the widespread deployment, convenience, and power of model deployment through top-down methods and universal runtime, providing practical deployment cases as deep insights and best practices for developing ML systems on emerging platforms.
arXiv
Summary
04-13 Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning
Institution: Nanjing University, University of California
The paper presents a new framework for multitask fine-tuning of Large Language Models named Intuition-MoR1E, which draws on principles of human cognitive neuroscience and uses Rank-1 Experts formulation to manage a spectrum of intuitions, significantly enhancing parameter efficiency and multitask fine-tuning effectiveness.
arXiv
Summary
04-12 Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
Institution: AI at Meta, University of Southern California, Carnegie Mellon University
The paper introduces MEGALODON, an efficient neural architecture for modeling sequences with unlimited context length. With innovative technical contributions, MEGALODON demonstrates higher efficiency and efficacy in long sequence modeling tasks than the Transformer while achieving robust improvements across various scales and modalities of benchmarks.
arXiv
Summary
GitHub
04-11 Rho-1: Not All Tokens Are What You Need
Institution: Xiamen University, Tsinghua University, Microsoft
This study introduces RHO-1, a novel language model that employs Selective Language Modeling (SLM). This model focuses on training useful tokens during pre-training, demonstrating superior performance in the continuous pre-training in the mathematical domain, reaching baseline performances faster, and attaining state-of-the-art results with a fraction of the tokens.
arXiv
Summary
04-11 Decomposing Label Space, Format and Discrimination: Rethinking How LLMs Respond and Solve Tasks via In-Context Learning
Institution: Nanyang Technological University
The paper investigates the mechanisms by which ICL improves task performance, identifying label space regulation and format refinement as significant contributors to performance enhancement while emphasizing the importance of selecting appropriate demonstrations.
arXiv
Summary
04-11 ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past
Institution: Baylor University
The research by probing the predictive abilities of ChatGPT-3.5 and ChatGPT-4 unveils new potential in reasoning capabilities of LLMs. The study confirms that future narrative prompts significantly enhance accuracy, offering valuable insights for potential applications of LLMs in analytical settings.
arXiv
Summary
04-11 ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback
Institution: University of Central Florida, ByteDance Inc
ControlNet++ significantly improves controllability across a range of conditional controls by optimizing pixel-level consistency between the generated images and the conditions, while the efficient reward fine-tuning strategy reduces the time and memory costs associated with image sampling.
arXiv
Summary
04-11 Interactive Prompt Debugging with Sequence Salience
The paper presents a system called Sequence Salience, which extends existing input salience (IS) methods to support complex LLM prompt debugging. This tool offers real-time interactive debugging, lowers practitioner cognitive load, supports prompt iteration based on salience results, and aligns more closely with the developer's mental model.
arXiv
Summary
04-11 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Institution: The University of Hong Kong, CMU, Salesforce Research
OSWORLD offers a novel evaluation environment addressing the limitations of existing benchmarks, laying the groundwork for the development of multimodal agents capable of performing open-ended tasks in real computer environments.
arXiv
Summary
04-10 Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
Institution: Apple, Cupertino, CA, USA
The paper presented a novel RAG prompting method, "superposition prompting," to address problems with LLMs when handling long texts, significantly enhancing time efficiency and accuracy without the need for further training or tuning. The method has been validated on several pretrained models, and the authors plan to release an open-source code implementation.
arXiv
Summary
04-10 Transferable and Efficient Non-Factual Content Detection via Probe Training with Offline Consistency Checking
Institution: Renmin University of China, Tsinghua University
This paper introduces PINOS, a novel approach for training a probing model via offline self-consistency checking, effectively addressing the limitations of existing factual detection methods. PINOS exhibits enhanced transferability and efficiency, and achieves superior results on factuality detection and question-answering benchmarks compared to existing methods.
arXiv
Summary
04-10 Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Institution: Google
This research proposed a novel attention mechanism, Infini-attention, which combines compressive memory with standard dot-product attention and, by design, supports plug-and-play continuous pre-training and long-context adaptation, enabling LLMs to handle infinitely long contexts with bounded memory and computational resources.
arXiv
Summary
04-10 "We Need Structured Output": Towards User-centered Constraints on Large Language Model Output
Institution: Google Research
This paper explores how to implement user-centered constraints on the outputs of large language models (LLMs) by surveying industry professionals to understand different scenarios and demands. The focus is on enhancing the efficiency of developers in the development, testing, and integration process of LLMs, and on bolstering the end-user experience by meeting specific output formats and user interface requirements.
arXiv
Summary
04-09 RULER: What's the Real Context Size of Your Long-Context Language Models?
Institution: NVIDIA
<This paper proposed a new assessment tool, RULER, for long-context LMs and made it open source to encourage future research, providing the means to test LMs' performance in complex tasks and understanding of long contexts, conducting evaluations across various models and task complexities.>
arXiv
Summary
04-09 Event-enhanced Retrieval in Real-time Search
Institution: Tencent Search, Platform and Content Group
EER is an innovative approach targeting the "semantic drift" in real-time searches by enhancing the EBR model and including contrastive learning and a generative event triplet task. The method's effectiveness has been experimentally verified, potentially providing new insights into the information retrieval domain.
arXiv
Summary
GitHub
04-09 THOUGHTSCULPT: Reasoning with Intermediate Revision and Search
Institution: UC Berkeley
THOUGHTSCULPT, a graph-based framework, showcases its distinct capability to iteratively improve previous outputs while generating new thought nodes through its embedded self-revision mechanism, particularly excelling in tasks that require continuous revision and modification.
arXiv
Summary
04-09 Privacy Preserving Prompt Engineering: A Survey
Institution: University of Arkansas
The survey paper contributes a systematic overview concerning privacy protection methods in the realm of ICL and general prompting with LLMs, facilitating further research and exploration within the community regarding privacy protection.
arXiv
Summary
04-08 LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding
Institution: Meta
The paper successfully presents and validates an LLM-augmented retrieval framework with enhanced document-level embedding. By generating synthetic relevant queries and titles to add more contextual information to document embeddings and improving key steps in the training of retrieval models, the paper improves the performance and robustness of retrieval models.
arXiv
Summary
04-08 Evaluating Interventional Reasoning Capabilities of Large Language Models
Institution: Université de Montréal, Google DeepMind, ServiceNow Research
The paper evaluates the interventional reasoning capabilities of large language models (LLMs), focusing on predicting intervention effects and testing LLMs' ability to update their understanding of facts post-intervention. Results indicate that, under certain conditions, GPT-4 can accurately predict intervention outcomes, but minor changes in prompt design can significantly affect its performance.
arXiv
Summary
04-08 Know When To Stop: A Study of Semantic Drift in Text Generation
Institution: FAIR, Meta, Anthropic
The paper provides tools for understanding and measuring the phenomenon of semantic drift in long-form text generation by language models. Significant improvements in factual accuracy were achieved through early stopping and resampling-then-reranking methods, offering potential solutions to balance informational quantity with factual accuracy.
arXiv
Summary
04-08 LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Institution: Alibaba Group, Zhejiang University
The paper successfully proposes the LayoutLLM model and its layout instruction tuning strategy, significantly improving the model's understanding and utilization of document layouts, especially demonstrating outstanding performance in zero-shot document understanding tasks.
arXiv
Summary
GitHub
04-07 Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models
Institution: Cornell University
The paper introduces Radial Networks, a novel neural network architecture that uses dynamic layer sparsity and a trained router module for token-level inter-layer routing. This not only enhances model performance but also significantly reduces computational and serving costs, facilitating further scaling of large language models.
arXiv
Summary
04-07 Prompting Large Language Models for Zero-shot Essay Scoring via Multi-trait Specialization
Institution: Peking University
The study presented a novel zero-shot LLM framework for essay scoring (MTS) which scores essays across different writing traits through multi-round conversations, using min-max scaling and outlier clipping mechanism for final score determination. MTS significantly improves accuracy over direct prompting methods and demonstrates superior small-scale deployment compared to ChatGPT, offering a zero-shot essay scoring alternative to supervised learning approaches.
arXiv
Summary
04-04 AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
arXiv
GitHub
04-04 ReFT: Representation Finetuning for Language Models
Institution: Stanford University, Pr(Ai)2R Group
This paper presents a new language model fine-tuning method, LoReFT, significantly surpassing existing Parameter-Efficient Fine-tuning (PEFTs) techniques in terms of resource efficiency and control capabilities. The method achieved state-of-the-art performance on multiple NLP tasks across various domains, maintaining fewer parameters and higher interpretability.
arXiv
Summary
GitHub
04-04 Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Institution: Microsoft Research
The paper presents DNO, an algorithm that effectively combines the ease of contrastive learning with the theoretical generalizability of optimizing general preferences in post-training LLMs. The significant performance improvements demonstrated by DNO highlight the feasibility of guiding model learning alignment with human values through general preference optimization.
arXiv
Summary
04-03 PromptRPA: Generating Robotic Process Automation on Smartphones from Textual Prompts
Institution: Shanghai Jiao Tong University, CMU
The paper presents the PromptRPA system, an effective solution to overcome limitations of RPA applications on mobile devices. Leveraging a multi-agent framework and online tutorials, it can interpret diverse textual prompts, addressing a wide range of RPA tasks. Performance evaluations demonstrate a significant increase in success rates, affirming the viability of text-driven control in RPA and paving the way for future advancements focused on enhanced functionality and broader applicability.
arXiv
Summary
04-02 CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models
Institution: East China Jiaotong University, Guangdong University of Technology, University of Toronto
The core contribution of the paper is the proposal of the CMAT framework, a novel approach that allows for dynamic and real-time memory updates within multi-agent systems, and the design of a role-playing mechanism for precise task allocation and enhanced agent communication, significantly improving overall performance and cooperation efficiency.
arXiv
Summary
04-02 Long-context LLMs Struggle with Long In-context Learning
Institution: University of Waterloo, Carnegie Mellon University
This paper introduces a novel evaluation benchmark, LongICLBench, to assess the performance of LLMs in handling long-input tasks, as well as the sensitivity of LLMs to the position of instances in the input sequence. This work contributes to better understanding and improvement of large language models' capabilities in long text processing.
arXiv
Summary
04-02 Advancing LLM Reasoning Generalists with Preference Trees
arXiv
04-02 Octopus v2: On-device language model for super agent
Institution: Stanford University
This paper addresses the deployment and function call efficiency issues of LLMs on edge devices. By introducing specialized training methods and reducing the amount of context that needs to be processed during inference, the paper significantly improves the accuracy of function calls and reduces latency on devices. The experimental results demonstrate a significant impact on the performance of function calling tasks.
arXiv
Summary
04-02 LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models
Institution: Microsoft
The paper investigates how large language models (LLMs) can assist in designing adaptive bitrate (ABR) algorithms by generating a variety of candidate algorithms and using an early stopping mechanism to test them in a network simulator, effectively filtering out the most effective algorithm designs. Evaluations indicate that LLMs can significantly enhance the performance of ABR algorithms in specific network scenarios.
arXiv
Summary
04-02 Long-context LLMs Struggle with Long In-context Learning
Institution: University of Waterloo, Carnegie Mellon University
This study introduces a new benchmark, LongICLBench, for evaluating the ability of large language models to process long-context tasks and indicates that as the difficulty of the tasks increases, LLMs' performance generally decreases, with the models' long-context learning ability being affected by the distribution of label positions in prompts.
arXiv
Summary
04-01 Mapping the Increasing Use of LLMs in Scientific Papers
Institution: Stanford University, UC Santa Barbara
This paper presents the first large-scale, systematic examination across articles published on arXiv, bioRxiv, and Nature portfolio, with a statistical estimation method that measures the prevalence of LLM-modified content at the population level, providing valuable insights into the application of LLMs in scientific writing.
arXiv
Summary
04-01 Prompt-prompted Mixture of Experts for Efficient LLM Generation
Institution: CMU
GRIFFIN is a training-free MoE system that boosts the efficiency of LLMs by leveraging the phenomenon of flocking observed within FF blocks of LLMs across different activation functions while preserving performance and reducing computational costs.
arXiv
Summary
GitHub
04-01 Efficiently Distilling LLMs for Edge Applications
Institution: IBM Research
This paper provided a new method for distilling LLMs for edge devices, allowing LPFT while significantly reducing both the size of models and the cost of training, especially addressing the resistance to compression and training duration of decoder models.
arXiv
Summary
04-01 LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation
Institution: Microsoft Research Asia
This paper proposed a novel framework using large language models for the evaluation of radiology reports—LLM-RadJudge, effectively enhancing the clinical relevance and consistency of radiology report assessments. Through knowledge distillation, a smaller model was developed, reducing the cost of evaluation and improving accessibility, providing strong support for the research and practical application of radiology report generation.
arXiv
Summary
04-01 AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review
Institution: University of Lyon, INSA Lyon, Infologic
This paper presents an extensive literature review of incident management in the AIOps domain, aiming to structure knowledge, identify knowledge gaps, and lay the groundwork for future developments in the field. The study establishes unified AIOps terminology and taxonomy, reveals existing challenges, and provides public datasets, offering direction and a basis for future research.
arXiv
Summary

2024-03

 Date   Paper Links & Summary
03-28 sDPO: Don't Use Your Data All at Once
The paper proposes a novel stepwise DPO (sDPO) method that effectively improves the performance and alignment of the final model by using preference datasets in a stepwise manner, and the aligned model from previous steps as the reference model for the current step.
arXiv
Summary
03-28 Jamba: A Hybrid Transformer-Mamba Language Model
Institution: AI21 Labs
Jamba represents a new direction in the large language model domain with its hybrid Transformer-Mamba architecture that breaks through the limitations of handling long contexts and optimizes both model throughput and memory footprint by applying MoE components. This model demonstrates the potential balance between efficient training and powerful performance in the field of large-scale language modeling.
arXiv
Summary
03-27 Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback
This paper effectively addresses LLM hallucinations and enhances model honesty and reliability by introducing the RLKF framework and defining new evaluation metrics, pointing towards a method for building more trustworthy AI systems.
arXiv
Summary
03-27 BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models
Institution: DCST Tsinghua University, Beijing Institute of Technology, Huawei Cloud BU
This research presented a novel architecture, BLADE, capable of enhancing black-box LLMs through smaller domain-specific models, addressing the lack of domain-specific knowledge in LLMs for specialized applications. BLADE demonstrated to be an effective and cost-efficient solution both in performance and cost.
arXiv
Summary
03-26 LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
Institution: The Hong Kong University of Science and Technology, University of Illinois Urbana-Champaign
The LISA strategy proposed in the paper uses layer-wise weight importance sampling to enhance the fine-tuning efficiency and performance of large language models, while maintaining memory efficiency comparable to LoRA.
arXiv
Summary
03-26 COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Institution: Shenzhen Institute of Advanced Technology, CAS; M-A-P; Institute of Automation, CAS
This paper presents the COIG-CQIA dataset, a high-quality dataset for Chinese instruction fine-tuning designed to align well with human interactions. The research emphasizes the importance of high-quality data sources for model fine-tuning and demonstrates through experiments how the strategies for creating datasets and methods of fine-tuning significantly impact model performance.
arXiv
Summary
03-26 The Unreasonable Ineffectiveness of the Deeper Layers
Institution: Meta FAIR, UMD
The paper presents an empirical study on a simple layer-pruning strategy for popular pre-trained open-weight LLMs and demonstrates minimal performance impact despite removing a significant number of layers.
arXiv
Summary
03-25 AIOS: LLM Agent Operating System
Institution: Rutgers University
AIOS, as an LLM agent operating system, overcomes challenges in areas such as resource scheduling and context management through the design of a specific kernel and modules, providing improvements in performance and efficiency for LLM agents and paving the way for the future development and deployment of the AIOS ecosystem.
arXiv
Summary
GitHub
03-22 Can large language models explore in-context?
Institution: Microsoft Research, Carnegie Mellon University
This paper investigates whether contemporary Large Language Models (LLMs) can engage in in-context exploration without any training interventions. The authors' experiments reveal that LLMs are capable of robust exploration only under specific configurations. This work indicates that even state-of-the-art LLMs might fail to explore in more complex environments without adequate prompt design, highlighting that non-trivial algorithmic interventions may be required for effective LLM operation in complicated settings.
arXiv
Summary
03-20 Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts
Institution: University of Memphis, San Francisco Veterans Affairs Health Care System, University of California San Francisco
The paper successfully improves the capability of Large Language Models in understanding psychiatric behaviors, especially in motivational interview contexts. By employing structured prompting and assessment methods to model professional therapists' thought processes, it effectively educates the model with domain knowledge, achieving better performance than conventional methods.
arXiv
Summary
03-19 Towards Robots That Know When They Need Help: Affordance-Based Uncertainty for Large Language Model Planners
Institution: University of Maryland
The paper introduces the LAP method that combines LLMs with scene affordances to reduce hallucinations and achieve uncertainty alignment in planning tasks. Demonstrating significant improvements in successful outcomes and decreased reliance on human assistance through experiments in both simulated and real-world robot manipulations, the LAP method advances the domain of intelligent robotics.
arXiv
Summary
03-18 Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Institution: University of Texas at Austin, Drexel University, MIT
This paper presents the first extensive evaluation of the trustworthiness of compressed LLMs across multiple dimensions and offers practical guidelines for considering efficiency and trustworthiness during compression.
arXiv
Summary
03-15 VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Institution: Stanford University
VideoAgent represents a substantial advancement in long-form video understanding by mimicking the human cognitive process, emphasizing the importance of reasoning over visual input over long periods. This work not only sets a new benchmark in long-form video understanding but also provides insights for future research in this area.
arXiv
Summary
03-15 Uni-SMART: Universal Science Multimodal Analysis and Research Transformer
Institution: DP Technology, AI for Science Institute Beijing
Uni-SMART is an innovative model designed for deep understanding of multimodal scientific literature. It outperformed other top text-focused LLMs in multiple domains and has the potential to revolutionize interactions with scientific literature.
arXiv
Summary
03-15 RAFT: Adapting Language Model to Domain Specific RAG
Institution: UC Berkeley
The RAFT approach proposed in this paper innovates the training of large language models to answer questions in a domain-specific "open book" manner, enhancing the model's reasoning capabilities and resistance to distractor documents, and improving the model's accuracy in generating answers through the chain-of-thought method.
arXiv
Summary
03-13 Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments
Institution: Nanjing University, Microsoft
The Readi framework presents an efficient and truth-based method for reasoning over large-scale structured environments, fully capitalizing on the planning capabilities of LLMs, and enhancing reasoning paths through dynamic feedback, resulting in significant improvements in multi-hop reasoning tasks.
arXiv
Summary
03-13 Scaling Instructable Agents Across Many Simulated Worlds
The SIMA project proposed in this paper seeks to create an AI system capable of acting in various simulated 3D environments based on arbitrary language instructions. The design of the system focuses on addressing challenges in grounding language in perception and embodied actions, as well as achieving generality and scalability across many different environments.
arXiv
Summary
03-13 Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework
Institution: ByteDance Research, University of Maryland College Park, Carnegie Mellon University
This paper successfully introduces a new causality-guided debiasing framework, which has been empirically validated for effectiveness. It not only integrates existing prompting-based debiasing methods but also proposes new avenues for eliciting unbiased reasoning.
arXiv
Summary
03-12 Chronos: Learning the Language of Time Series
Institution: Amazon Web Services, UC San Diego, University of Freiburg
Chronos has demonstrated exceptional performance as a pre-trained time series forecasting framework in both zero-shot and standard tasks. By leveraging data augmentation strategies and public datasets, it validates the promise of language model architectures for general applicability in time series forecasting, pointing towards a new direction for future time series models.
arXiv
Summary
03-11 Stealing Part of a Production Language Model
Institution: Google DeepMind, ETH Zurich, University of Washington
The paper proposes a novel attack for model stealing from production language models, capable of effectively extracting the final layer of a Transformer model. It sheds light on the utility of such an attack for decrypting details, parameters, and dimensions of black-box models, and outlines the necessity of modifying the APIs to prevent such attacks in the future.
arXiv
Summary
03-11 ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis
Institution: Zhejiang University, Southeast University
The paper presents an innovative framework, ERA-CoT, which effectively enhances the reasoning and question-answering abilities of Large Language Models in complex entity scenarios, principally by improving the understanding of entity relationships, especially in the Chain-of-Thought reasoning process.
arXiv
Summary
03-11 RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback
Institution: Zhejiang University, Southeast University, Massachusetts Institute of Technology
RA-ISF is an innovative retrieval-augmented framework that enhances LLMs' problem-solving by iterative task decomposition and mitigates irrelevant text interference, significantly improving knowledge retrieval performance.
arXiv
Summary
03-08 Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation
The paper presents Adversarial Policy Optimization (AdvPO), a novel approach to tackling reward over-optimization issues within the RLHF process, especially in LLMs aimed at aligning with human preferences. AdvPO effectively alleviates the problem of reward over-optimization without incurring high computational costs.
arXiv
Summary
03-08 Harnessing Multi-Role Capabilities of Large Language Models for Open-Domain Question Answering
Institution: Gaoling School of Artificial Intelligence Renmin University of China, Nankai University, Beijing Academy of Artificial Intelligence
LLMQA is a novel generalized framework that combines strengths of retrieval- and generation-based evidence collection. By enabling LLMs to take on multiple roles within the framework, the paper significantly improves the overall performance of ODQA systems, with experimental results demonstrating its effectiveness over existing methods.
arXiv
Summary
03-08 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Institution: Google
Gemini 1.5 Pro achieved a significant breakthrough in memory and reasoning capabilities for vast amounts of long-context information, particularly in processing extended texts, videos, and audio. The model not only outperforms in effectiveness but also shows improved computational efficiency.
arXiv
Summary
03-07 Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Institution: UC Berkeley, Stanford, UCSD
Chatbot Arena is an open platform for evaluating LLMs based on human preferences. It employs a crowdsourced approach to collect questions for anonymous randomized battles, addressing the limitations of static dataset benchmarks, and uses carefully designed statistical methods to ensure the credibility and efficiency of evaluations.
arXiv
Summary
03-07 Yi: Open Foundation Models by 01.AI
Institution: 01.AI
The paper successfully introduces the Yi-34B model, performing comparably to GPT-3.5 in both performance and efficiency, and provides detailed descriptions of innovative approaches to pre-training large language models and their instruction fine-tuning.
arXiv
Summary
03-05 ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary
Institution: Tsinghua University
The ChatCite system is designed to overcome the challenges faced by LLMs in generating literature reviews. It enables an LLM agent to more effectively understand, summarize, and compare different research works, thus producing organized and comparative literature reviews.
arXiv
Summary
03-05 Design2Code: How Far Are We From Automating Front-End Engineering?
Institution: Stanford University, Georgia Tech, Microsoft
The paper formalizes and benchmarks the Design2Code task to assess the capability of current multimodal LLMs in converting visual designs into code, finding that GPT-4V performs best, offering a new paradigm for automating front-end development.
arXiv
Summary
03-05 MathScale: Scaling Instruction Tuning for Mathematical Reasoning
Institution: The Chinese University of Hong Kong Shenzhen, China; Microsoft Research Asia, Beijing, China; Shenzhen Research Institute of Big Data, Shenzhen, China
MathScale proposes a scalable approach to creating high-quality mathematical reasoning data and introduces a new comprehensive benchmark, MWPBENCH, to fully evaluate the mathematical reasoning capabilities of LLMs, thereby significantly enhancing the models' performance in solving mathematical problems.
arXiv
Summary

2024-02

 Date   Paper Links & Summary
02-29 Resonance RoPE: Improving Context Length Generalization of Large Language Models
Institution: 1DIRO Université de Montréal, Mila - Quebec AI Institute, Huawei Noah’s Ark Lab
This paper presents Resonance Rope, an improved model that enhances performance in dealing with long texts based on the analysis of RoPE position embedding feature wavelengths. It also introduces the POSGEN benchmark to assist in the study and evaluation of position embeddings in long-text tasks.
arXiv
Summary
02-29 SEED: Customize Large Language Models with Sample-Efficient Adaptation for Code Generation
Institution: Peking University
This paper introduces SEED, an adaptation method using error-driven learning, enabling LLMs to learn efficiently with fewer samples for code generation tasks, achieving better performance and generalization.
arXiv
Summary
02-29 Beyond Language Models: Byte Models are Digital World Simulators
Institution: Microsoft Research Asia
The paper showcases the potential of bGPT in handling challenging byte-level data simulation tasks, particularly highlighting its capabilities in cross-modal knowledge transfer and digital world simulation. This reveals the broad applicability and flexibility of byte models in digital media data processing and understanding.
arXiv
Summary
02-29 StarCoder 2 and The Stack v2: The Next Generation
Institution: ServiceNow, Hugging Face
The paper presented the development process of The Stack v2 and StarCoder2, a work focused on large-scale pre-training and instruction fine-tuning for code. Researchers significantly enhanced the performance of code LLMs, especially in handling low-resource programming languages and tasks requiring code reasoning, by integrating diverse data sources and a meticulously designed training process.
arXiv
Summary
02-27 The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Institution: Microsoft, University of Chinese Academy of Sciences
The paper presents the BitNet b1.58 model, which is a 1.58-bit quantized Large Language Model that is comparable in performance to traditional full-precision LLMs while being more efficient and energy-saving.
arXiv
Summary
02-27 EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions
Institution: Alibaba Group
The EMO framework enhances the realism and expressiveness of generated videos through a direct audio-to-video synthesis method, significantly surpassing existing technologies and marking a significant advance in the field of video synthesis.
arXiv
Summary
02-27 When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
Institution: Google DeepMind
The paper provides significant insights into the impact of factors such as data size, model size, and finetuning methods on the performance of LLMs during the finetuning phase, defining a new framework for evaluation.
arXiv
Summary
02-27 REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering
Institution: Gaoling School of Artificial Intelligence Renmin University of China, School of Information Renmin University of China
The paper presented the REAR framework, which focuses on enhancing the ability of LLMs to utilize external knowledge in QA tasks by adding self-awareness of document relevance and has proven its effectiveness over previous methodologies.
arXiv
Summary
GitHub
02-27 Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization
Institution: Zhejiang University, Institute of Software Chinese Academy of Sciences, Nanjing University of Posts and Telecommunications
Agent-Pro represents a new type of LLM-based intelligence agent that can learn and develop strategies in interactive environments through policy-level reflection and optimization, addressing the issue of existing works' inability to learn through interaction and adapt.
arXiv
Summary
02-27 Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
Institution: OpenAI
This review article provides an insight into Sora—a large vision model, discussing its technological features, innovative aspects, current limitations, and potential opportunities for future applications. Sora's capabilities signify progressive strides made by large vision models, including long video generation and processing of diverse video formats.
arXiv
Summary
02-26 LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments
The study introduced the LLMARENA benchmark to assess the capabilities of LLM agents in complex multi-agent settings, highlighting existing issues and advancing future research directions, including capabilities in multimodal dynamic contexts and the potential use of external tools.
arXiv
Summary
02-26 Do Large Language Models Latently Perform Multi-Hop Reasoning?
Institution: Google DeepMind, UCL, Google Research
This research examines LLMs’ potential for latent multi-hop reasoning, proposing new methods for evaluating latent multi-hop reasoning capabilities and indicating strong evidence of multi-hop reasoning for certain types of relational prompts in LLMs, though highly context-dependent.
arXiv
Summary
02-26 Improving LLM-based Machine Translation with Systematic Self-Correction
Institution: Zhejiang University, Tencent, Angelalign Technology Inc.
The paper successfully introduced the first LLM-based self-correcting translation framework named TER, and demonstrated its effectiveness in improving translation quality across various language pairs and models. It opened new horizons in the field of machine translation, especially for the use of self-correction in translations between high-resource, low-resource languages, and translations involving different central languages.
arXiv
Summary
02-25 ChatMusician: Understanding and Generating Music Intrinsically with LLM
Institution: Hong Kong University of Science and Technology
The paper made substantial progress in an under-researched domain by creating the first music pre-training dataset and assessment benchmark for language models, enhancing LLMs' performance in understanding and generating music.
arXiv
Summary
02-23 Genie: Generative Interactive Environments
Institution: Google DeepMind, University of British Columbia
Genie is an interactive environment model capable of generating new videos and controlling the content of the videos through user inputs, bridging the gap between traditional video generation technologies and interactive experiences.
arXiv
Summary
02-23 ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
arXiv
02-22 Automating psychological hypothesis generation with AI: when large language models meet causal graph
arXiv
02-22 Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments
arXiv
02-22 CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
Institution: Tsinghua University, University of Hong Kong
The paper evaluates LLMs' critique and correction reasoning abilities through CRITICBENCH, exploring key factors influencing these competencies, aiming to foster further research in LLM critique and self-improvement.
arXiv
Summary
02-22 OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
arXiv
02-21 User-LLM: Efficient LLM Contextualization with User Embeddings
USER-LLM is a framework that contextualizes LLMs using user embeddings. It addresses the complexities of user data and the challenges of processing long sequences, improving the usability of LLMs in personalized applications while being computationally efficient.
arXiv
Summary
02-21 AgentScope: A Flexible yet Robust Multi-Agent Platform
Institution: Alibaba Group
AgentScope is a versatile platform for building multi-agent applications, emphasizing usability and customizability, particularly catered to developers with varying skill levels. By implementing fault tolerance and supporting multimodal data processing, as well as optimizing distributed operations, AgentScope significantly reduces the complexity of developing and deploying multi-agent systems, promoting wider participation and innovation.
arXiv
Summary
GitHub
02-20 Instruction-tuned Language Models are Better Knowledge Learners
Institution: FAIR at Meta, Carnegie Mellon University, University of Washington
The paper introduces a method called pre-instruction-tuning (PIT), which effectively improves the ability of LLMs to absorb knowledge from documents, addresses the "perplexity curse," and makes significant strides in multi-domain knowledge acquisition.
arXiv
Summary
02-20 TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Institution: AWS AI Labs, The University of Texas at Austin, KAIST
The article introduces TOFUEVAL, a new assessment benchmark for evaluating the factual consistency of LLMs in generating topic-focused dialogue summaries. The study uncovered extensive factual errors in the summaries generated by LLMs of varying sizes within the domain of dialogue.
arXiv
Summary
02-19 AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Institution: Fudan University, Multimodal Art Projection Research Community, Shanghai AI Laboratory
AnyGPT is a multimodal language model architecture that achieves seamless conversion and unified processing across modalities through discrete sequence modeling, delivering the ability to generate from any modality to any other without needing alterations to the current LLM architecture or training paradigms. It efficiently processes and generates high-quality multimodal content, with performance comparable to specialized models.
arXiv
Summary
02-16 Speculative Streaming: Fast LLM Inference without Auxiliary Models
arXiv
02-16 FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models
Institution: The University of British Columbia & Invertible AI
The paper presents a multimodal Large Language Model suite named FinTral, optimized for financial analysis. The model's performance was showcased against existing models and demonstrated its advanced capabilities in multi-task contexts within the financial sector, especially in handling zero-shot tasks and reducing hallucination phenomena.
arXiv
Summary
02-16 SPAR: Personalized Content-Based Recommendation via Long Engagement Attention
Institution: The University of British Columbia, Meta
The SPAR framework effectively uses long-term user engagement histories to enhance the accuracy of personalized content recommendations and surpasses the existing state-of-the-art across multiple performance metrics.
arXiv
Summary
02-15 How to Train Data-Efficient LLMs
Institution: Google DeepMind, University of California San Diego, Texas A&M University
The ASK-LLM and DENSITY techniques proposed in the paper optimize the data efficiency of large language models, effectively enhancing the speed and quality of model training and performing well under resource constraints.
arXiv
Summary
02-15 Chain-of-Thought Reasoning Without Prompting
Institution: Google DeepMind
This work uncovers that by changing the decoding strategy, one can naturally elicit reasoning from pre-trained LLMs, with CoT paths being more prevalent in tasks frequently represented in the pre-training data. The introduced CoT-decoding method significantly enhances model performance on various reasoning benchmarks without the need for manual prompts.
arXiv
Summary
02-15 A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
Institution: Google DeepMind, Google Research
ReadAgent is an LLM agent system inspired by human reading processes, which significantly enhances performance and scalability by generating gist memories and retrieving information as needed for tasks involving long contexts.
arXiv
Summary
02-14 Premise Order Matters in Reasoning with Large Language Models
Institution: Google DeepMind
The paper focuses on the influence that the order of premises has on LLMs when conducting reasoning tasks, and the impact was assessed via the newly created R-GSM benchmark test. It reveals the extreme sensitivity of LLMs to premise ordering, showing a substantial effect on performance.
arXiv
Summary
02-09 InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning
Institution: Shanghai AI Laboratory, Tsinghua University, Fudan University School of Computer Science
The InternLM-Math model is a mathematical reasoning tool based on LLMs that integrates various capabilities and provides supervised learning to help the model achieve state-of-the-art performance in various mathematical reasoning tasks, with code and data made open-source. The paper also explores a new approach to solving mathematical problems with the programming language LEAN within a multi-task learning setup, showcasing the potential of LLMs in formalized and code-assisted reasoning.
arXiv
Summary
GitHub
02-02 LimSim++: A Closed-Loop Platform for Deploying Multimodal LLMs in Autonomous Driving
Institution: Shanghai Artificial Intelligence Laboratory, College of Control Science and Engineering Zhejiang University
LimSim++ is the first closed-loop evaluation platform specifically developed for (M)LLM-driven autonomous driving. It overcomes the limitations of current simulation platforms and validates its effectiveness in various complex traffic scenarios through experimentation.
arXiv
Summary
02-02 K-Level Reasoning with Large Language Models
arXiv
02-02 AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback
Institution: Tsinghua University, Ant Group
The AMOR framework integrates reasoning logic based on a finite state machine (FSM) and a process feedback mechanism, showcasing how an open-source LLM-based knowledge agent can reason and adapt with human oversight, enhancing the model's capabilities in performing knowledge-intensive tasks.
arXiv
Summary
02-02 MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models
Institution: UNC Chapel Hill.
This paper introduces a new method called MAGDI, which significantly enhances the reasoning abilities and generalization capacity of smaller models through structured distillation of reasoning interactions between multiple LLMs, while reducing costs.
arXiv
Summary
02-02 Reasoning Capacity in Multi-Agent Systems: Limitations, Challenges and Human-Centered Solutions
Institution: Megagon Labs, Carnegie Mellon University
This paper introduces the concept of reasoning capacity in multi-agent systems to improve optimization and evaluation and explores the potential of human feedback to enhance system reasoning capabilities.
arXiv
Summary
02-01 HR-MultiWOZ: A Task Oriented Dialogue (TOD) Dataset for HR LLM Agent
Institution: Amazon, University of Milano-Bicocca
This paper introduces a new resource, HR-MultiWOZ, a Task-Oriented Dialogue Dataset for an HR LLM Agent. It tackles the problem of a lack of high-quality training datasets for building and evaluating HR LLM agents while providing a cost-effective data generation methodology that serves as a valuable asset and benchmark for subsequent research in the field.
arXiv
Summary
02-01 Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing
Institution: Nanyang Technological University, Institute for Infocomm Research A*STAR, Salesforce Research
The paper proposes a novel offline training framework focused on improving the reliability and accuracy of Large Language Models in complex reasoning tasks through trajectory collection and direct preference optimization based on outcome supervision, without the need for teacher models or human annotations. The results on two logical reasoning benchmarks prove the effectiveness of the proposed method.
arXiv
Summary
02-01 Can Large Language Models Understand Context?
Institution: Georgetown University, Apple
This paper introduces a context understanding benchmark to assess the contextual understanding abilities of Large Language Models (LLMs). The benchmark encompasses the elements required for understanding context both in documents and dialogue bases, and uses innovative testing methods and experimental analysis to showcase the abilities and limitations of LLMs in understanding context.
arXiv
Summary
02-01 Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration
Institution: University of Washington, University of California Berkeley, The Hong Kong University of Science and Technology
This article focuses on identifying knowledge gaps in large language models (LLMs) and abstaining from answering questions when necessary. The study proposes two novel multi-LLM collaboration methods, which showed through comparative experiments that they can effectively improve the ability of LLMs to abstain from generating outputs with low confidence.
arXiv
Summary

2024-01

 Date   Paper Links & Summary
01-31 LongAlign: A Recipe for Long Context Alignment of Large Language Models
Institution: Tsinghua University, Zhipu.AI
The paper proposes a novel recipe, LongAlign, for the long context alignment of LLMs, by constructing a long instruction dataset, adopting new training strategies, and introducing evaluation benchmarks, enhancing the LLMs' ability to handle lengthy contexts. The code, data, and long-aligned models are open-sourced.
arXiv
Summary
GitHub
01-30 Efficient Tool Use with Chain-of-Abstraction Reasoning
Institution: Meta
The paper proposes a novel Chain-of-Abstraction reasoning approach that effectively enhances LLMs' capability to use external tools and expedites the reasoning process. Experimental results demonstrate its effectiveness and efficiency in multi-step reasoning tasks.
arXiv
Summary
01-30 Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate
Institution: Shanghai Jiao Tong University, Carnegie Mellon University, Shanghai Artificial Intelligence Laboratory
SCALEEVAL is an innovative meta-evaluation framework designed to evaluate the trustworthiness and efficiency of LLMs as evaluators. It incorporates multi-agent LLM debate and minimal human supervision into the evaluation process, providing flexibility and scalability, with experimental results showing high consistency with purely human evaluations.
arXiv
Summary
GitHub
01-30 Recovering Mental Representations from Large Language Models with Markov Chain Monte Carlo
Institution: Princeton University, University of Warwick
The article demonstrated an effective increase in efficiency and performance by integrating LLMs into sampling algorithms and using Direct Sampling along with MCMC to extract mental representations, exploring the potential for Bayesian inference with LLMs.
arXiv
Summary
01-30 Incoherent Probability Judgments in Large Language Models
Institution: Princeton University
The paper investigates the coherence of probability judgments made by large language models, finding biases comparable to systemic deviations in human cognition. It quantified incoherence using probabilistic identities and repetition of judgments. The hypothesis presented connects the human-like biases observed when LLMs make probability judgments to their autoregressive training objectives, supported by potential links between the Bayesian Sampler model and autoregressive processes within LLMs.
arXiv
Summary
01-29 Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis
Institution: Harbin Institute of Technology
This research introduces an LLM-based automatic diagnostic method—Multi-Specialist Agent Consultation Model (AMSC), which better simulates the diagnostic process in the real world and improves diagnosis accuracy and efficiency by integrating predictions from multiple specialized agents.
arXiv
Summary
01-29 SelectLLM: Can LLMs Select Important Instructions to Annotate?
Institution: University of Minnesota, Carnegie Mellon University
This work introduces a novel method SELECTLLM for using LLMs to select unlabeled high-quality instructions, challenging traditional selection algorithms and enhancing selection efficiency while maintaining the global structure of the dataset. The experiments demonstrate superior performance on instruction-tuning benchmarks.
arXiv
Summary
GitHub
01-29 LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning
Institution: Nanyang Technological University
LLM4Vuln is an innovative framework that significantly enhances LLMs' performance in code vulnerability analysis by providing a vector database of vulnerability knowledge, tool invocation capabilities, custom CoT prompt schemes, and structuring outputs using instructionally proficient models.
arXiv
Summary
01-28 PRE: A Peer Review Based Large Language Model Evaluator
The PRE model presented in this paper provides a novel framework for automatically evaluating LLMs by simulating the peer review system commonly used in academia, significantly lowering costs and exhibiting increased generalizability and reliability.
arXiv
Summary
01-27 MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
Institution: Hong Kong University of Science and Technology
The paper developed the MultiHop-RAG dataset to assess and improve the existing limitations of Retrieval-Augmented Generation (RAG) systems in handling multi-hop queries requiring retrieval and reasoning. It also provided experimental results demonstrating current RAG systems' limitations on such tasks and released the dataset to encourage further research and development.
arXiv
Summary
GitHub
01-26 EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Institution: Peking University, Microsoft Research, University of Waterloo
The paper proposes a new framework named EAGLE to increase the auto-regressive decoding speed of Large Language Models (LLMs) while maintaining the consistency of the generated text distribution with the original LLMs. EAGLE has significantly improved upon speculative sampling methods in reducing time overhead and increasing draft acceptance rate, offering faster acceleration compared to Lookahead and Medusa, with low training cost and ease of deployment.
arXiv
Summary
01-25 True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning
Institution: Nanyang Technological University, Zhejiang University
The TWOSOME framework effectively aligns LLMs with embodied environments using RL, improving sample efficiency and task generalization while retaining LLMs' original functionality.
arXiv
Summary
01-25 Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning
Institution: Columbia University, Microsoft Research, University of California Berkeley
The EC-Finetuning method has successfully increased the consistency of explanations generated by LLMs and demonstrated its ability to generalize to unseen datasets, showing a 10.0% relative improvement in explanation consistency on fine-tuning datasets and a 4.5% improvement on out-of-distribution datasets, along with moderate improvements in prediction accuracy.
arXiv
Summary
GitHub
01-25 ConstraintChecker: A Plugin for Large Language Models to Reason on Commonsense Knowledge Bases
Institution: HKUST
ConstraintChecker is an independent plugin tool that effectively enhances the performance of LLMs in CSKB reasoning tasks. It helps LLMs to perform better in reasoning by providing and checking explicit constraints and has shown to outperform other advanced prompting techniques in validated metrics.
arXiv
Summary
GitHub
01-24 Can AI Assistants Know What They Don't Know?
Institution: Fudan University, Shanghai Artificial Intelligence Laboratory
This paper focuses on the capacity of AI assistants to recognize their knowledge boundaries and by constructing an Idk dataset and aligning the assistant accordingly, the paper achieves making AI assistants recognize and admit what they don’t know, reducing factual errors in their responses.
arXiv
Summary
01-24 AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
Institution: The University of Hong Kong, Zhejiang University, Shanghai Jiao Tong University
Researchers introduced a new benchmark, AGENTBOARD, for evaluating multi-turn capable large language model agents, providing a granular progress rate and interactive analysis tools to deepen the understanding of LLM agent performance.
arXiv
Summary
01-24 Clue-Guided Path Exploration: An Efficient Knowledge Base Question-Answering Framework with Low Computational Resource Consumption
Institution: Tsinghua University, Zhongguancun Laboratory, XinJiang University
The CGPE framework introduced in the paper effectively supports the application of LLMs in question-answering tasks by using a clue-guided path exploration mechanism, lowering the capability requirements for LLMs, and significantly reducing computational resource consumption, which has important practical significance for individuals and organizations with limited computational resources.
arXiv
Summary
01-24 Consistency Guided Knowledge Retrieval and Denoising in LLMs for Zero-shot Document-level Relation Triplet Extraction
Institution: Nanjing University of Science and Technology, Northeastern University, Singapore Institute of Technology
The paper presents a new Zero-shot Document-level Relation Triplet Extraction (ZeroDocRTE) framework that generates labeled data by Retrieval and Denoizing Knowledge from LLMs and significantly improves the performance of document-level relation triplet extraction through a series of novel methods.
arXiv
Summary
01-24 MM-LLMs: Recent Advances in MultiModal Large Language Models
Institution: Tencent AI Lab, Kyoto University, Mohamed Bin Zayed University of Artificial Intelligence
arXiv
Summary
01-23 CCA: Collaborative Competitive Agents for Image Editing
The paper presents a new generative model based on multiple Large Language Models (LLMs), capable of handling complex image editing tasks and enhancing the quality and robustness of the results. Encouraging collaborative competition among agents, the model demonstrates capabilities exceeding traditional methods, especially in managing complex tasks and learning from intermediate steps to refine outcomes.
arXiv
Summary
GitHub
01-23 AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
Institution: Google DeepMind
The paper describes a system named AutoRT that uses large foundation models to control real-world robots to autonomously navigate and perform tasks. It marks the first instance of LLM-controlled robots operating autonomously in real-world settings, proposing their own goals, and taking actions toward those goals. The data collected by AutoRT is not only diverse but can improve the performance of robot learning models and be aligned with human preferences.
arXiv
Summary
01-23 KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning
Institution: Samsung R&D Institute India - Bangalore
KAM-CoT is a multimodal Chain-of-Thought reasoning framework that integrates CoT reasoning, knowledge graphs, and multiple modalities. It outperforms state-of-the-art approaches with fewer trainable parameters, showcasing superior performance and cost-efficiency.
arXiv
Summary
01-23 Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment
Institution: Alibaba Inc.
The paper proposes DITTO, a self-alignment method that enhances LLMs' role-play capabilities through knowledge augmentation and dialogue simulation. It also provides a reproducible, explainable, and efficient role-play evaluation method and explores the dissection of role-play through cross-supervision experiments, offering an in-depth understanding and insights into building role-play functions for LLMs.
arXiv
Summary
GitHub
01-22 Improving Small Language Models' Mathematical Reasoning via Mix Thoughts Distillation
Institution: Institute of Information Engineering, Chinese Academy of Sciences
This paper stated that EoTD and MTD show it is possible to distill LLMs' mathematical reasoning capabilities into Small Language Models (SLMs) with fewer than one billion parameters. The methods preserve and enhance SLMs' reasoning abilities, enabling them to achieve state-of-the-art performance on reasoning tasks. This advancement opens the door for broader applications of SLMs in resource-constrained environments, bridging the gap between the demand for powerful reasoning models and computational resource limitations.
arXiv
Summary
01-22 PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
Institution: Shanghai Artificial Intelligence Laboratory, Dalian University of Technology
The article presents PsySafe, a comprehensive framework for the safety of multi-agent systems, integrating psychological-based approaches for attack, defense, and evaluation. The experimental outcomes provide deeper insights into understanding and researching the safety issues of multi-agent systems.
arXiv
Summary
01-22 CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation
Institution: Stanford University, Stability AI
This paper addresses the challenges in automated CXR interpretation by introducing a large dataset specifically designed for CXR interpretation, developing a novel foundation model, and creating a comprehensive evaluation benchmark. It demonstrates the superior performance of CheXagent in various assessment tasks compared to other models and takes an important stride towards transparency by examining potential biases within the model, providing valuable insights for future research and applications.
arXiv
Summary
01-21 Interactive AI with Retrieval-Augmented Generation for Next Generation Networking
Institution: Nanyang Technological University, Guangdong University of Technology, Institute for Infocomm Research, Agency for Science Technology and Research
This paper explores the integration of interactive AI (IAI) with next-generation networking, using retrieval-augmented generation (RAG) and large language models (LLM) to enhance decision-making capabilities, proved through real network optimization case studies.
arXiv
Summary
01-20 BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models
Institution: University of Illinois Urbana-Champaign, University of Washington, Western Washington University
This article proposes BadChain, a backdoor attack on LLMs using COT prompting that does not require access to training datasets or model parameters and has low computational overhead. The method effectively reveals the security vulnerabilities under COT prompting in LLMs and emphasizes the importance of carrying out backdoor attacks and designing effective defenses.
arXiv
Summary
01-19 Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
Institution: MIT
The paper presents how LLMs can be made more resistant to "jailbreak" attacks from a safety alignment perspective through Wanda pruning without the need for fine-tuning and validates model performance through a constructed dataset and evaluation system.
arXiv
Summary
01-19 Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning
Institution: ShanghaiTech University, Meituan, UniDT
Tool-LMM stands out as the first system aimed at training a large multi-modal model to learn tool agency, innovatively integrating multi-modal inputs with the correct selection of external tools, overcoming ambiguity in text, and showcasing the ability to automatically select appropriate tools in response to multi-modal instructions.
arXiv
Summary
GitHub
01-19 Mitigating Hallucinations of Large Language Models via Knowledge Consistent Alignment
Institution: Sun Yat-sen University, Tencent AI Lab
This paper introduces an innovative KCA method that reduces the inconsistency between external and intrinsic knowledge, thereby mitigating hallucinations in LLMs during alignment. The study offers several insights for future research, notably the excellent performance of the KCA method across various scenarios and the combination of its simplicity and effectiveness.
arXiv
Summary
GitHub
01-19 Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Institution: Princeton University, Together AI, University of Illinois Urbana-Champaign
The paper presents Medusa, an efficient method for accelerating LLM inference by adding multiple decoding heads that parallelly predict multiple tokens, thereby substantially reducing the number of decoding steps and significantly improving the inference speed of large models.
arXiv
Summary
01-18 ChatQA: Building GPT-4 Level Conversational QA Models
Institution: NVIDIA
The ChatQA model significantly improved the effectiveness of multi-turn conversational QA through a two-stage instruction tuning strategy, particularly in areas of context understanding and information retrieval.
arXiv
Summary
01-18 Self-Rewarding Language Models
Institution: Meta, NYU
This work introduces Self-Rewarding Language Models intended to bypass the bottleneck of human preference data by self-training to enhance the model's self-rewarding and instruction-following capabilities. The experimental results are promising, setting a precursor for models that can continuously improve themselves.
arXiv
Summary
01-18 Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation
Institution: The University of Tokyo, RIKEN
This research innovatively incorporates an explicit reasoning process and question-generation ability into LMMs, promoting more reliable inferences. By creating a new dataset and leveraging it for model training, it sets a precedent for future advancements in LMMs and enables the model to generate explicit reasoning steps and questions when faced with uncertainty.
arXiv
Summary
01-18 A Fast, Performant, Secure Distributed Training Framework For Large Language Model
Institution: Ant Group China
This paper presents a secure distributed training framework based on model slicing, which solves the problem of model parameter and data leakage on both server and client sides while ensuring the precision of the model training and high efficiency.
arXiv
Summary
01-17 ReFT: Reasoning with Reinforced Fine-Tuning
Institution: ByteDance Research
ReFT significantly enhances the performance and generalization ability of LLMs in math problem-solving tasks by optimizing non-differentiable objectives through reinforcement learning. It transcends traditional supervised learning methods and shows potential for more complex reasoning tasks.
arXiv
Summary
01-17 LLMs for Relational Reasoning: How Far are We?
Institution: Continental-NTU Corporate Lab, Nanyang Technological University, Singapore
The paper primarily examines the capacities and constraints of large language models in the area of relational reasoning. Through extensive assessments, including novel testing procedures and an evaluation module, the findings indicate that while LLMs perform reasonably well on certain relational reasoning tasks, they are outperformed by models specifically designed for logical reasoning.
arXiv
Summary
01-17 Vlogger: Make Your Dream A Vlog
Institution: Shanghai Jiao Tong University, Shanghai AI Laboratory, Shenzhen Institute of Advanced Technology Chinese Academy of Sciences
This paper presents the innovative use of LLMs in the production of video blogs, addressing the challenges of creating minute-scale coherent video content and delivering exceptional experimental results.
arXiv
Summary
GitHub
01-16 SpecGen: Automated Generation of Formal Program Specifications via Large Language Models
Institution: Nanjing University, Nanyang Technological University, Singapore Management University
The paper presents SpecGen, an automated formal program specification generation technique that combines Large Language Models with a heuristic selection strategy. By comparison with existing tools and purely LLM-based methods, SpecGen showcases superior efficiency and accuracy in specification generation and offers a dataset to facilitate future research.
arXiv
Summary
01-16 RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture
Institution: Microsoft
The paper studies the performance of large language models on agricultural data for Q&A pair generation and presents a new pipeline that efficiently utilizes RAG and fine-tuning techniques to enhance LLM applicability in specific industries, expanding the potential for LLMs' application in targeted sectors.
arXiv
Summary
01-16 MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline
Institution: Alibaba Group
The paper presents a new math reasoning dataset combined with a Python code interpreter, significantly improving LLM performance on math problem-solving tasks through dataset enhancement and specific fine-tuning protocols.
arXiv
Summary
01-16 Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models
Institution: Tencent AI Lab
The article delves into analyzing the domain mismatch problem of LLMs in machine translation tasks and experiments with the impact of varying amounts of parallel data on LLM translation capabilities, showcasing the potential of LLMs in addressing these challenges.
arXiv
Summary
GitHub
01-16 DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models
Institution: Zhejiang University
DoraemonGPT is an LLM-driven agent that employs symbolic memory and a set of tools to understand and answer complex questions involving dynamic videos. It leverages an MCTS planner to optimize the process of generating answers, enabling it to handle more complex tasks in real-world scenarios.
arXiv
Summary
01-16 Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation
Institution: Johns Hopkins University, Microsoft
This paper introduces CPO, a novel LLM fine-tuning method that effectively overcomes the bottlenecks in SFT for MT tasks and achieves significant performance enhancements in moderate-sized LLM translation models with minimal resource expenditure, competing alongside the most advanced state-of-the-art translation systems.
arXiv
Summary
01-15 MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models
Institution: Microsoft Research India
This study investigates the performance of large language models on multilingual tasks following parameter-efficient fine-tuning, especially in the context of low-resource languages and English tasks. It demonstrates the potential of PEFT and highlights areas for future work.
arXiv
Summary
01-15 The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey
Institution: Technology Innovation Institute UAE, Islamic University of Technology Bangladesh, Stanford University, Amazon GenAI, AI Institute University of South Carolina
The paper is a detailed survey on context length extension techniques in LLMs. It provides an organized overview of current strategies and challenges for researchers in the field and encourages discussions on future advancements.
arXiv
Summary
01-15 A Study on Large Language Models' Limitations in Multiple-Choice Question Answering
Institution: David R. Cheriton School of Computer Science
The study investigates the limitations of LLMs in MCQ tasks, highlighting poor performance by most models in such tasks. It also finds model answers often depend on the order of options and proposes effective assessment methods to eliminate these biases. The paper recommends exercising caution when using MCQs to evaluate LLMs and testing whether models truly understand the task at hand.
arXiv
Summary
01-14 Small LLMs Are Weak Tool Learners: A Multi-LLM Agent
Institution: Sun Yat-sen University, Alibaba Group
The study reveals the weakness of small LLMs as tool learners and introduces the α-UMi multi-LLM framework, which outperforms the single-LLM approach. It highlights a crucial two-stage fine-tuning strategy and delves into data-scaling laws.
arXiv
Summary
01-13 Bridging the Preference Gap between Retrievers and LLMs
The paper presents the BGM framework to address the "preference gap" between retrievers and LLMs. Through a seq2seq bridge model and a combined SL and RL training scheme, the framework optimizes the retrieved information to fit LLMs' preferences, improving performance in multiple downstream tasks.
arXiv
Summary
01-12 APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding
Institution: Tsinghua University, Zhipu AI
The research presents APAR as a method that significantly enhances the decoding efficiency and generation speed of LLMs in both memory-limited and high-throughput scenarios while maintaining generation quality, providing a potent new approach for deploying large language models efficiently.
arXiv
Summary
01-12 An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models
Institution: University of Washington Seattle, University of Wisconsin-Madison, Stanford University
The paper proposes an experimental design framework intended to improve the label efficiency of large language models during Supervised fine-tuning (SFT). It shows that experimental design techniques can significantly increase label efficiency while maintaining low computational costs, saving up to 50% annotation costs in some tasks compared to random sampling.
arXiv
Summary
01-12 TestSpark: IntelliJ IDEA's Ultimate Test Generation Companion
Institution: JetBrains Research, Delft University of Technology
The paper introduces the TestSpark plugin, which integrates search-based software test generation and language model-based methods to enhance the efficiency of generating and integrating unit tests in IntelliJ IDEA, while also addressing the compilability issue of tests generated by LLMs. The open-source nature of the plugin facilitates the bridging between software developers and researchers, contributing to the practical advancement of test generation technologies.
arXiv
Summary
GitHub
01-12 Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation
Institution: Tianyu Zheng, Shuyue Guo, Xingwei Qu, Jiawei Guo, Weixu Zhang, Xinrun Du, Chenghua Lin, Wenhao Huang, Wenhu Chen, Jie Fu, Ge Zhang
The paper presents the Kun strategy, addressing the data consistency issue in Chinese large language model instruction fine-tuning, reducing dependency on manual annotation through the AP process and new data generation methods. The evaluation results indicate that the Kun strategy has a significant advantage in creating high-quality datasets.
arXiv
Summary
GitHub
01-12 From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape
Institution: Tsinghua University, University of Maryland, Beijing Xicheng Educational Research Institute
This research showcases the potential of large language models in the field of education, especially within AES systems. LLMs not only have the ability to automate scoring processes but also enhance the performance of human graders through generated feedback. This advancement offers valuable insights for the future of AI-assisted education and efficient collaboration between AI and humans.
arXiv
Summary
01-12 How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Institution: Virginia Tech, Renmin University of China, UC Davis
This paper presents a novel perspective on studying AI safety by humanizing LLMs, applying over a decade of social science research to AI safety, establishing a persuasion taxonomy, and creating a tool that automatically generates adversarial prompts. The results demonstrate the effectiveness of persuasion in increasing the likelihood of LLMs performing risky behaviors, while also revealing the insufficiency of current defense measures against such strategies.
arXiv
Summary
01-12 Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation
Institution: Nanyang Technological University, Fudan University
This paper successfully presented a novel approach, TOOLGEN, which integrates autocompletion tools into the repository-level code generation process of LLMs, solving dependency issues and boosting both the quality and success rate of code generation.
arXiv
Summary
01-11 Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems
Institution: Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences
This paper provides a comprehensive overview of the risk taxonomy, mitigation measures, and assessment benchmarks for large language model systems, offering a new systematic framework to help developers more comprehensively understand and deal with the potential risks of LLM systems.
arXiv
Summary
01-11 TOFU: A Task of Fictitious Unlearning for LLMs
Institution: Carnegie Mellon University
The paper provides a new dataset and evaluation mechanisms for the issue of unlearning in LLMs. The TOFU task highlights the deficiencies of current unlearning techniques and encourages further improvements and research.
arXiv
Summary
01-11 Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models
Institution: Google Research, Tel Aviv University
The paper presents a framework named Patchscopes, offering a novel approach to interpret the information encoded in the hidden representations of large language models (LLMs) and to correct multi-hop reasoning errors. Patchscopes serves as a general modular framework, unifying existing interpretative tools and addressing their deficiencies, while also paving the way for new research and application opportunities.
arXiv
Summary
01-11 Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint
Institution: Gaoling School of Artificial Intelligence, Renmin University of China; School of Information, Renmin University of China; Kuaishou Technology, Beijing China.
This paper presents RLMEC, a novel RL method that employs generative reward models with a minimum editing mechanism, enabling precise supervision and stability in training large language models with RL.
arXiv
Summary
GitHub
01-11 Chain of History: Learning and Forecasting with LLMs for Temporal Knowledge Graph Completion
Institution: Tsinghua Shenzhen International Graduate School Tsinghua University, School of Computer Science Peking University, Baidu Inc.
The paper presents a method for temporal knowledge graph completion utilizing large language models. By implementing efficient fine-tuning methods and historical data augmentation with structural information, the model's reasoning capabilities and performance were improved. Experiments demonstrate that this approach effectively enhances the precision of temporal knowledge graph predictions, achieving state-of-the-art results.
arXiv
Summary
01-11 Evidence to Generate (E2G): A Single-agent Two-step Prompting for Context Grounded and Retrieval Augmented Reasoning
Institution: Qatar Computing Research Institute
This paper introduced a new single-agent, two-step prompting framework—Evidence to Generate (E2G) —aimed at improving the context reasoning abilities of LLMs. By prompting LLMs to generate evidence and explanations alongside answers, E2G reduces erroneous reasoning and enhances the accuracy of models handling various reasoning tasks. Experimental results showed that the E2G method outperforms CoT in multiple context-intensive language tasks.
arXiv
Summary
01-11 LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase
Institution: LAIR Lab Lehigh University, Huazhong University of Science and Technology
This study defined the mixed text (mixcase) found in mixed scenarios, created the MIXSET dataset, and provided insights and directions for solving the detection problem of mixed text. It revealed that existing detectors have shortcomings in recognizing mixcase, underlining the urgent need for more fine-grained detectors.
arXiv
Summary
GitHub
01-11 EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction
Institution: Fudan University, Microsoft Research Asia, Zhejiang University
This paper proposes EASYTOOL, a method that enhances LLM-based agents' performance in tool usage by simplifying and unifying instructions from tool documentation, addressing the issues of inconsistency, redundancy, and incompleteness.
arXiv
Summary
GitHub
01-11 The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models
Institution: Johns Hopkins University
The study demonstrates that concise Chain-of-Thought (CCoT) prompting can significantly reduce the length of text outputs in large language models without compromising performance in problem-solving tasks.
arXiv
Summary
01-10 InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks
InfiAgent-DABench offers a novel benchmarking tool that not only aids in measuring the performance of intelligent agents in data analysis tasks but also represents an essential step in exploring how to improve and optimize the application of LLMs in this specific domain.
arXiv
Summary
GitHub
01-10 Prompting Large Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis
Institution: Renmin University of China, Beijing Key Laboratory of Big Data Management and Analysis Methods, Meituan Group
This work introduced a framework named ProLLM4Rec, offering a systematic analysis of utilizing Large Language Models (LLMs) as foundation models for recommendation systems and tested the impact of different conditions on LLMs through experiments. Empirical findings were summarized, providing insights for future research.
arXiv
Summary
01-10 Leveraging Print Debugging to Improve Code Generation in Large Language Models
Institution: Zhejiang University, ByteDance
The paper proposes a methodology for using print debugging to guide LLMs in code generation and debugging, validating its effectiveness on the Leetcode dataset, especially for easy and medium complexity problems. Despite limited success with hard-level problems, this work represents a significant advancement in the field of LLMs for code debugging.
arXiv
Summary
01-10 AUTOACT: Automatic Agent Learning from Scratch via Self-Planning
Institution: Zhejiang University, Alibaba Group
This research introduces AUTOACT, a framework for autonomous learning of language agents through self-instruction and self-planning to tackle the challenge of learning new tasks from scratch. The key contributions lie in its effective data augmentation method and the highly efficient automatic agent learning process.
arXiv
Summary
GitHub
01-10 Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
Institution: Tsinghua University, Xiaomi AI Lab
As a survey work, the paper presents the current status, challenges, and future trends of personal LLM agents and proposes a generic system architecture and intelligence level definition.
arXiv
Summary
01-10 Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk
Institution: AWS AI Labs
The paper presents a novel approach for generating training data by enabling LLMs to conduct self-talk dialogues, which has the potential to improve the performance of task-oriented dialogue agents. Despite certain limitations, the findings suggest that high-quality dialogues can serve as a strong training signal for LLMs, validating the idea of LLMs' capacity to self-improve when trained on their own generated content, leading to better performance in task-oriented dialogue settings.
arXiv
Summary
01-10 Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing
Institution: Google Research
The paper successfully proposes a new memory-based transformer method that effectively reduces memory demands and supports bidirectional attention through storage eviction policies and the ATTENDRE layer, demonstrating performance on par with traditional methods in long-sequence processing.
arXiv
Summary
01-10 CASA: Causality-driven Argument Sufficiency Assessment
Institution: Peking University
This paper introduces a zero-shot Causality-driven Argument Sufficiency Assessment framework (CASA) based on LLMs, which effectively tackles challenges in quantifying and intervening in argument sufficiency without observational data and demonstrates its effectiveness in practical applications.
arXiv
Summary
GitHub
01-09 Agent Alignment in Evolving Social Norms
Institution: Fudan University
This paper introduces an EvolutionaryAgent framework to assess and enhance the adaptiveness and alignment of large intelligent agents in dynamic and constantly evolving societal norms. The research highlights the significance of agent alignment with societal norms during evolution and validates the framework's efficacy through experiments.
arXiv
Summary
01-09 Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs
Institution: Zhejiang University, Ant Group
The paper presents a new method named ARALLM that combines analogical reasoning and multi-task model distillation to effectively enhance LLMs' ability to understand and transform natural language into structured logical expressions. This method allows non-expert marketers to use natural language for user targeting, which potentially changes the practice of user targeting. The improvement in this capability not only has practical value in marketing scenarios but also contributes valuable exploration to the functionality and practicality of large language models.
arXiv
Summary
01-09 Large Language Models for Robotics: Opportunities, Challenges, and Perspectives
Institution: Northwestern Polytechnical University, University of Georgia, Shaanxi Normal University
The multimodal GPT-4V framework proposed in the paper, which combines NLP and visual perception, aims to tackle challenges faced by LLMs in robotic task planning. It holds significant implications for advancing human-machine interaction and shaping the future of intelligent AI systems.
arXiv
Summary
01-09 Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding
Institution: University of California San Diego, Google Cloud AI Research, Google Research
The paper introduces the innovative CHAIN-OF-TABLE framework, which enhances reasoning capabilities of LLMs by explicitly incorporating tabular data into the reasoning chain, dynamically planning and updating the process, thereby increasing accuracy and reliability for table-based reasoning tasks.
arXiv
Summary
01-09 Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search
Institution: Nanyang Technological University Singapore
ReCo significantly enhances code search accuracy by utilizing LLMs to rewrite code in the codebase through style normalization and introduces a new metric, CSSim, to quantify stylistic differences, advancing research in code style normalization.
arXiv
Summary
01-09 The Critique of Critique
Institution: The Hong Kong Polytechnic University, Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory
METACRITIQUE is the first framework to evaluate natural language critiques, assessing the quality of critiques using principles of precision and recall, and has achieved a high level of interpretability and transparency.
arXiv
Summary
GitHub
01-08 SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems
Institution: Fudan University
The paper proposed a multi-modal large language model-based multi-agent system—SpeechAgents, capable of simulating human communication scenarios involving up to 25 agents, exhibiting exceptional scalability. By utilizing multi-modal signals as the medium for agent communication, the system not only can simulate dialogues with correct content, authentic rhythm, and rich emotions but also can be applied to tasks such as drama creation and the generation of audio novels.
arXiv
Summary
01-08 MARG: Multi-Agent Review Generation for Scientific Papers
Institution: Northwestern University, The Hebrew University of Jerusalem, Allen Institute for AI
This paper presents an innovative multi-agent review generation method (MARG) capable of overcoming the context size limitations of the base model and of generating high-quality peer-review feedback for scientific papers. The quality of feedback generated by MARG significantly surpasses the baselines in user studies and automated metrics, with a 2.2-fold increase in the number of helpful comments and a greater generation of specific comments.
arXiv
Summary
01-08 TTMs: Fast Multi-level Tiny Time Mixers for Improved Zero-shot and Few-shot Forecasting of Multivariate Time Series
Institution: IBM Research
TTM demonstrates the effectiveness and transfer learning capabilities of tiny pretrained models that are exclusively trained on diverse time series data for improved multivariate time series forecasting in few/zero-shot scenarios.
arXiv
Summary
01-07 Grimoire is All You Need for Enhancing Large Language Models
Institution: Beihang University, Renmin University of China
The paper introduces a method named SLEICL that significantly enhances the ICL capability of weak language models by learning and transferring skills from strong language models. The effectiveness of the method is validated through experiments, demonstrating the potential of this technology in enhancing weak language models' context learning abilities.
arXiv
Summary
01-07 Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
Institution: Beijing Academy of Artificial Intelligence, Renmin University of China, Nankai University
The paper introduces Activation Beacon, a new technique to extend the context length of Large Language Models, enabling the perception of extensive context within a limited context window, while fully preserving capability on short contexts. Activation Beacon provides an effective, efficient, compatible, and low-training-cost method for extending LLMs' context length.
arXiv
Summary
01-07 Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects
Institution: The Chinese University of Hong Kong, DeepWisdom, Peking University
The paper presents a framework for guiding future research and development of LLM-based intelligent agent systems, explores different methods of improving their planning capabilities, multimodal information processing, and how to address the challenges faced by LLM agents, offering a clear guide for future research directions.
arXiv
Summary
01-07 ChatGPT for Conversational Recommendation: Refining Recommendations by Reprompting with Feedback
Institution: University of Louisville, Microsoft
This paper explores the efficacy of ChatGPT as a conversational recommendation system. It develops a process around ChatGPT that simulates real-user scenarios and addresses and mitigates popularity bias.
arXiv
Summary
01-06 CogGPT: Unleashing the Power of Cognitive Dynamics on Large Language Models
Institution: Harbin Institute of Technology, Kuaishou Technology
CogGPT addresses challenges faced by large language models in emulating human cognitive dynamics by introducing an iterative cognitive mechanism and a memory retention system, showcasing impressive performance in continuous information processing.
arXiv
Summary
01-06 Quartet Logic: A Four-Step Reasoning (QLFR) framework for advancing Short Text Classification
Institution: Aerospace Information Research Institute Chinese Academy of Sciences, Key Laboratory of Target Cognition and Application Technology, University of Chinese Academy of Sciences
This study introduced Quartet Logic: A Four-Step Reasoning (QLFR) framework for short-text classification tasks and a CoT-Driven Multi-task learning (QLFR-CML) method. Both of these approaches use the reasoning chain of large language models to address challenges in the STC field. Experimental results confirm the effectiveness and applicability of these methods.
arXiv
Summary
01-06 The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models
Institution: Renmin University of China, Université de Montréal
The paper provides a systematic empirical study to deeply understand and explore the problem of hallucinations in large language models, identifying the sources of hallucination, detection methods, mitigation strategies, and proposing the new benchmark HaluEval 2.0 and a simple yet effective hallucination detection framework.
arXiv
Summary
GitHub
01-05 Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Institution: Alibaba Group, Shanghai Jiao Tong University
The paper presents an efficient system for cloud services supporting long-context language models. Through the distributed algorithm DistAttention, it optimizes the processing and storage of the attention module, and the DistKV-LLM service system manages and coordinates it. It achieves efficient allocation and management of resources in a distributed environment, demonstrating significant performance improvements.
arXiv
Summary
01-05 From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models
Institution: Beike Inc.
The paper introduces the RAISE framework, which enhances the performance of LLMs in multi-turn dialogues, especially in real estate sales contexts, by incorporating an augmented memory system and a structured agent construction process.
arXiv
Summary
01-04 LLM Augmented LLMs: Expanding Capabilities through Composition
Institution: Google Research, Google DeepMind
The paper presents a new framework for model extension - CALM, which successfully integrates two large language models to perform new tasks and demonstrates its effectiveness across multiple experiments.
arXiv
Summary
01-04 Using LLM to select the right SQL Query from candidates
Institution: Peking University
This research proposes a method for automatically generating test cases for text-to-SQL using LLMs and presents a three-step re-ranking process. The method significantly improves the performance of existing text-to-SQL models, as evidenced by experiments.
arXiv
Summary
01-04 ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers
Institution: Bytedance Inc.
This paper introduces a methodology, ICE-GRT, designed to enhance the depth and accuracy of LLMs in handling domain-specific tasks. By incorporating reinforcement learning from human feedback, ICE-GRT significantly improves domain-specific capabilities without sacrificing general task performance, achieving state-of-the-art in several assessment tasks.
arXiv
Summary
01-04 Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives
Institution: Zhejiang University, OPPO Research Institute
This paper introduces a new strategy called "Self-Contrast" to address issues of stubbornness and inconsistency in reflection and self-correction processes within Large Language Models (LLMs). By creating diverse solving perspectives, contrasting different solutions, and summarizing disparities into a checklist, it enhances the quality of LLM reflection. The approach's effectiveness and broad applicability are validated through experiments.
arXiv
Summary
01-04 SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded Entity Retrieval
Institution: Columbia University
This paper proposes SPEER, a sentence-level planning method through embedded entity retrieval for long document tasks of hospital discharge summaries. It guides large language models (LLMs) to better cover key entities and generate more complete and credible clinical summaries. The research demonstrates that the SPEER method can improve document coverage and accuracy in practical applications, thereby reducing the documentation burden on clinicians.
arXiv
Summary
01-04 On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS)
Institution: University of South Carolina, New Mexico State University, IBM Research
The paper provides insights into the integration prospects of Large Language Models (LLMs) with Automated Planning and Scheduling (APS), breaking through the traditionally limited adaptability to context, and offers a possibility for a more dynamic, context-aware planning pathway, laying a foundation for further application and research.
arXiv
Summary
01-04 On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS)
Institution: University of South Carolina, New Mexico State University, IBM Research
This paper is a survey of the application of Large Language Models in the field of Automated Planning and Scheduling, proposing the prospect of combining leading LLMs like GPT-4 and BERT with classical planning methods and the potential of applying LLMs in eight different planning problem categories, with the aim to develop more advanced and intelligent planning systems.
arXiv
Summary
01-03 MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English Clinical Queries
Institution: Indian Institute of Technology Patna, Stanford University, Amazon GenAI
MedSumm presents a novel approach for multimodal medical question summarization, integrating textual and visual information to create medically detailed summaries potentially enhancing the quality of healthcare decision-making and deepening the understanding of patient queries.
arXiv
Summary
01-03 Social Media Ready Caption Generation for Brands
Institution: Adobe Research India
The paper introduces a new framework designed to aid brands in creating engaging captions on social media that align with their brand image and personality. The framework, which consists of two parts, successfully addresses the challenge of generating socially engaging and relevant captions for brands.
arXiv
Summary
01-02 LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
The paper successfully presents a method for extending the context window of LLMs without fine-tuning, which is crucial for improving the capability of large language models to process long texts when computational resources are limited.
arXiv
Summary
01-02 A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
Institution: Islamic University of Technology Bangladesh, University of South Carolina, Stanford University
This paper offers an exhaustive survey on hallucination mitigation techniques in LLMs, proposing a categorization framework and systematic feedback and reasoning methods, and assesses the efficacy and impact of these techniques.
arXiv
Summary
01-01 From Prompt Engineering to Prompt Science With Human in the Loop
Institution: University of Washington
The paper demonstrates how to transition prompt engineering for LLMs into a more scientific and systematic prompt science. By incorporating a qualitative coding method analogous to the human-in-the-loop approach, it ensures the quality and consistency of the responses generated by the LLM while eliminating individual subjectivity and randomness.
arXiv
Summary
01-01 A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models
Institution: The Chinese University of Hong Kong, Tencent AI Lab
This work proposes LogicAsker, addressing the challenge of evaluating and improving the logical reasoning abilities of LLMs through comprehensive assessment and effective enhancement via problem generation and in-context learning.
arXiv
Summary
01-01 The Earth is Flat? Unveiling Factual Errors in Large Language Models
Institution: The Chinese University of Hong Kong, Tencent AI Lab
The FactChecker introduced in this paper provides a new automated framework for testing factual inaccuracies in large language models and has been shown to uncover and reduce factual errors in these models through the construction of knowledge graphs and the generation of test questions.
arXiv
Summary

2023-12

 Date   Paper Links & Summary
12-31 BatchEval: Towards Human-like Text Evaluation
Institution: Beijing Institute of Technology, Xiaohongshu Inc
The paper introduces a novel LLM evaluation paradigm—BATCHEVAL—that addresses the issues of robustness and consistency with human judgment in automatic text evaluation. By implementing batch-wise evaluation and iterative processing, BATCHEVAL significantly surpasses existing methods in terms of accuracy and cost-efficiency.
arXiv
Summary
12-31 Improving Text Embeddings with Large Language Models
Institution: Microsoft Corporation
The paper introduces an innovative text embedding approach utilizing the latest LLMs and synthetic data to match performance on competitive benchmarks with fewer than 1,000 training steps and no label data, offering strong evidence for further advancements in text embedding technology.
arXiv
Summary
12-29 Enhancing Quantitative Reasoning Skills of Large Language Models through Dimension Perception
Institution: Institution: Shanghai Key Laboratory of Data Science School of Computer Science Fudan University, School of Data Science Fudan University, DataGrand Co. LTD
This research has significantly improved LLMs' quantitative reasoning abilities by establishing a dimensional unit knowledge base and a customized benchmark test, providing a new pathway for understanding and reasoning accurately with vital quantitative information in text.
arXiv
Summary
12-29 Building Efficient Universal Classifiers with Natural Language Inference
Institution: Vrije Universiteit Amsterdam, University of London Royal Holloway, Hugging Face
The paper provides a novel approach to universal text classification using natural language inference, complete with detailed steps and tools needed to implement the method, significantly increasing model efficiency without compromising performance.
arXiv
Summary
12-29 DB-GPT: Empowering Database Interactions with Private Large Language Models
Institution: Alibaba Group
This paper presents DB-GPT, an innovation integrating LLMs and database systems to enhance user experience and accessibility, demonstrating a hierarchical design that effectively addresses concerns such as privacy and security protection, while also elevating the system's overall performance and efficiency through multi-source RAG and adaptive ICL mechanisms.
arXiv
Summary
GitHub
12-29 Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning
arXiv
GitHub
12-29 The Right Prompts for the Job: Repair Code-Review Defects with Large Language Model
Institution: Ant Group, Nanjing University
The research explores the application of LLMs in repairing code review defects, introduces an effective semi-automated APR paradigm, analyzes the performance of 9 popular models, and designs effective prompts to guide the code repair process.
arXiv
Summary
12-28 Improving In-context Learning via Bidirectional Alignment
Institution: Nanyang Technological University, Princeton University, Salesforce Research USA
The paper introduced Bidirectional Alignment (BiAlign), which effectively improves the ICL abilities of smaller models by integrating a new ranking loss along with aligning the output distribution.
arXiv
Summary
12-28 Experiential Co-Learning of Software-Developing Agents
Institution: Tsinghua University,Dalian University of Technology,Beijing University of Posts and Telecommunications
The paper proposes a new framework named Experiential Co-Learning, which through the sequential implementation of co-tracking, co-memorizing, and co-reasoning modules, allows LLM-driven intelligent agents to learn more effectively from historical trajectories and use past experiences to reason mutually when solving new tasks. It shows a clear performance improvement over existing technologies.
arXiv
Summary
12-28 Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs
Institution: Chinese University of Hong Kong, Tencent AI Lab
This paper presents a new evaluation paradigm that challenges LLMs to engage in meta-reasoning, and it introduces the accompanying open-source benchmark DiagGSM8K, adding a new dimension to the evaluation of LLMs' cognitive abilities.
arXiv
Summary
12-28 Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos
Institution: Tsinghua University
This paper presents the Grounding-Prompter method, addressing the TSG challenge in long videos by combining LLM with temporal reasoning and multimodal information, demonstrating the effectiveness of prompting LLM with multimodal data, and validating its superiority in TSG tasks for long videos through experiments.
arXiv
Summary
12-28 DrugAssist: A Large Language Model for Molecule Optimization
Institution: Tencent AI Lab, Department of Computer Science Hunan University
DrugAssist is a model that facilitates molecule optimization through human-machine interaction, overcoming the lack of interactivity limitations in LLM applications for drug discovery and showcasing superior multi-property optimization abilities.
arXiv
Summary
GitHub
12-28 Structured Packing in LLM Training Improves Long Context Utilization
Institution: University of Warsaw, Google DeepMind, Polish Academy of Sciences
This paper introduces the SPLICE method to enhance utilization of long-range contexts and validates its effectiveness in improving context utilization and performance on long-context tasks for large-scale language models. SPLICE is especially applicable for constructing training examples in training datasets that lack additional structured information.
arXiv
Summary
12-28 GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension
Institution: Tsinghua University, Renmin University of China
This paper introduces GITAGENT, an autonomous agent that can extend tools from GitHub to meet the varied demands of user queries. By addressing the challenge of non-standardization, GITAGENT autonomously learns human experience from GitHub Issues/PRs to overcome problems during tool extension, showing its effectiveness in autonomously integrating tools for task accomplishment across various domains.
arXiv
Summary
12-28 Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs
Institution: Chinese University of Hong Kong, Tencent AI Lab
This paper presents an innovative evaluation paradigm for LLMs, emphasizing meta-reasoning, which is assessing the reasoning process itself. This approach promises to uncover cognitive deficiencies overlooked by result-oriented evaluation methods, providing a new direction for future LLM assessment and training.
arXiv
Summary
12-27 Conversational Question Answering with Reformulations over Knowledge Graph
Institution: University of Illinois at Urbana-Champaign, Amazon
CoRnNet represents a novel RL model for non-dialogue ConvQA tasks with LLM-generated reformulations, showing superior performance over other advanced models.
arXiv
Summary
12-27 How Robust are LLMs to In-Context Majority Label Bias?
Institution: Amazon
The article conducts a comprehensive study on the robustness of LLMs when faced with majority label bias in ICL, finding significant stability in certain models in handling such bias.
arXiv
Summary
12-27 Rethinking Tabular Data Understanding with Large Language Models
Institution: UC San Diego, USC, UC Davis
The paper delves into the understanding and reasoning capabilities of LLMs over tabular data, contributing insights into the robustness of table structure, the comparison of textual versus symbolic reasoning, and the impact of aggregating multiple reasoning pathways on model performance. The proposed table structure normalization method and the mix self-consistency mechanism are instrumental in enhancing LLMs' performance in tabular data reasoning.
arXiv
Summary
12-27 Adapting Large Language Models for Education: Foundational Capabilities, Potentials, and Challenges
Institution: Shanghai Jiao Tong University (SJTU)
This paper is a survey on how to adapt large language models for the education system. It provides an overview of the development of LLMs in education-related capabilities, explores the potential and challenges in building such systems, and offers insights for future related research.
arXiv
Summary
12-26 KnowledgeNavigator: Leveraging Large Language Models for Enhanced Reasoning over Knowledge Graph
Institution: Northeastern University, Neusoft AI Magic Technology Research, Neusoft Institute of Intelligent Medical Research
The paper introduces KnowledgeNavigator, a novel framework designed to enhance LLM reasoning over knowledge graphs, addressing LLM's limitations in complex reasoning tasks. The effectiveness demonstrated by the experiments suggests potential for broader application of LLMs in high-risk and sensitive domains.
arXiv
Summary
12-26 Supervised Knowledge Makes Large Language Models Better In-context Learners
Institution: School of Engineering Westlake University, Westlake Institute for Advanced Study, Peking University
The SuperContext framework proposed in the paper significantly enhances the generalizability and factuality of LLMs in natural language understanding and question answering tasks by leveraging the supervised knowledge from task-specific fine-tuned SLMs. It represents an innovative approach to incorporating the strengths of small models into LLMs to deal with OOD data and minimize hallucinations.
arXiv
Summary
12-26 Align on the Fly: Adapting Chatbot Behavior to Established Norms
Institution: Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, The Hong Kong Polytechnic University
The research advances a dynamic OPO method that aligns LLMs with the complex and varying landscape of human values in real-time, using collected rules as external memory without further training. Despite limitations in inference efficiency and potential for retrieval model enhancements, extensive experiments across multiple evaluation datasets vouch for the method's effectiveness.
arXiv
Summary
GitHub
12-26 Aligning Large Language Models with Human Preferences through Representation Engineering
Institution: Fudan University
This paper introduces a novel RAHF method, which manipulates internal model representations through representation engineering techniques to align LLMs with human preferences. The method is computationally efficient, easy to implement, and shows potential in managing a spectrum of human preferences or values.
arXiv
Summary
12-26 RecRanker: Instruction Tuning Large Language Model as Ranker for Top-k Recommendation
Institution: City University of Hong Kong, The Chinese University of Hong Kong, Hangdian University
The paper presents a novel framework named RecRanker, which optimizes the performance of LLMs in top-k recommendation tasks through instruction tuning and effectively integrates signals from traditional recommendation systems, improving the model's application performance in recommendation scenarios.
arXiv
Summary
12-26 A Prompt Learning Framework for Source Code Summarization
Institution: Nanyang Technological University, Tencent Inc., Nanjing University
This paper introduced a novel PromptCS framework for source code summarization, capable of generating high-quality summaries while reducing training costs and providing open-source code for further research.
arXiv
Summary
12-26 Scaling Down, LiTting Up: Efficient Zero-Shot Listwise Reranking with Seq2seq Encoder-Decoder Models
Institution: University of Waterloo
The paper introduces LiT5-Distill and LiT5-Score, two sequence-to-sequence encoder-decoder models for efficient zero-shot listwise reranking. These methods not only offer competitive performance but also address traditional reliance on large LLMs and external relevance labels, showcasing optimization and advancement in this domain.
arXiv
Summary
GitHub
12-26 Think and Retrieval: A Hypothesis Knowledge Graph Enhanced Medical Large Language Models
Institution: Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science Peking University, Beijing China
The HyKGE framework effectively addresses the accuracy and interpretability challenges faced by large language models in dealing with complex problems in the medical field, demonstrating potential for applications in the medical domain and showcasing its superiority in real-world scenarios.
arXiv
Summary
12-25 Alleviating Hallucinations of Large Language Models through Induced Hallucinations
Institution: Soochow University, Tencent AI Lab
The paper offers a novel method to reduce hallucinations in LLMs by constructing a factually weaker model and subtracting its knowledge in the generation process, improving the generation of factual content.
arXiv
Summary
12-25 ESGReveal: An LLM-based approach for extracting structured data from ESG reports
Institution: Alibaba Cloud, Tsinghua University, Sun Yat-Sen University
ESGReveal marks significant progress in ESG data processing, aiming to improve the consistency and accuracy of structured data extraction from corporate reports through large language models and related techniques, and it has driven improvements in ESG practices and transparency.
arXiv
Summary
12-22 Plan, Posture and Go: Towards Open-World Text-to-Motion Generation
Institution: Tsinghua University, Microsoft Research Asia
The researchers introduced a new framework named PRO-Motion to overcome limitations of traditional text-to-motion generation methods, successfully generating more diverse and realistic motions in open-world scenarios.
arXiv
Summary
12-22 NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes
Institution: University of Michigan, Rutgers University
This paper presents a novel method of assessing the reasoning abilities of LLMs through the NPHardEval benchmark. The benchmark covers a broad range of problems from polynomial time complexity to NP-Hard levels, and it features a dynamic data updating mechanism to prevent model overfitting, ensuring reliable and authentic assessment results. The findings significantly advance the understanding of current capabilities of LLMs and pave the way for improving the reasoning abilities of these models.
arXiv
Summary
GitHub
12-22 VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation
Institution: University of Waterloo, IN.AI Research
The paper presents an evaluation framework called VIEScore aimed at providing explainable evaluations for conditional image generation tasks. VIEScore overcomes the challenge of existing automated metrics' inability to explain their scoring rationale and is adaptable to various task requirements.
arXiv
Summary
12-22 A Survey of Reinforcement Learning from Human Feedback
Institution: LMU Munich, Duke Kunshan University
This article is a survey of RLHF, analyzing its applications at the crossroads of artificial intelligence and human-computer interaction and discussing the latest research trends, especially those related to LLMs.
arXiv
Summary
12-22 Generative AI Beyond LLMs: System Implications of Multi-Modal Generation
The paper is the first to characterize system performance for models that span across text, image, and video generation, revealing unique system properties distinct from traditional LLMs. It also highlights challenges and opportunities where traditional optimizations might need rethinking for TTI/TTV models.
arXiv
Summary
12-22 Large Language Model (LLM) Bias Index -- LLMBI
Institution: University of Oxford, University Canada West, Amazon Web Services (AWS)
The introduction of LLMBI marks a significant step towards creating fairer and more reliable LLMs. It provides a quantifiable measure of bias for system engineers and researchers, guiding them to continuously improve these powerful models and ensuring that they reflect society's diverse and evolving fabric.
arXiv
Summary
12-22 Reasons to Reject? Aligning Language Models with Judgments
Institution: Tencent AI Lab, The Chinese University of Hong Kong
The paper presents a new framework for aligning LLMs through direct use of language feedback named Contrastive Unlikelihood Training (CUT) and demonstrates its effectiveness in various scenarios including offline and online alignment, as well as further optimizing both unaligned (cold-start) and already aligned (warm-start) models. Research indicates that judgmental feedback holds greater potential than rewards for aligning LLMs, meriting further investigation.
arXiv
Summary
12-22 YAYI 2: Multilingual Open-Source Large Language Models
Institution: Beijing Wenge Technology Co. Ltd., Institute of Automation Chinese Academy of Sciences
The paper presents YAYI 2, a large language model optimized for multilingual scenarios, which significantly improves performance on various tasks, especially in Chinese-related tasks, by pre-training on a large corpus and aligning with human values through multiple approaches.
arXiv
Summary
12-22 Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning
Institution: Huawei Noah's Ark Lab, University College London, University of Oxford
The paper introduces the Pangu-Agent framework, which addresses the challenges faced by standard RL methods in multi-task environments. By integrating structured reasoning through intrinsic functions and enabling fine-tuning through supervised learning and RL, Pangu-Agent enhances the ability of agents to adapt across various environmental interactions.
arXiv
Summary
12-21 AppAgent: Multimodal Agents as Smartphone Users
Institution: Tencent
The study introduces an innovative multimodal agent framework allowing the agent to operate any smartphone application like a human user by learning new apps through autonomous exploration and observing human demonstrations. Findings demonstrate the framework's efficiency and adaptability in performing a variety of advanced tasks.
arXiv
Summary
12-21 The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction
Institution: MIT, Microsoft Research NYC
The paper introduces LASER, a strategy for pruning specific layers of the Transformer model after training to enhance its performance. The authors indicate that this strategy is not only effective, but also the first discovery of enhancing the performance of Transformer models through carefully selected pruning.
arXiv
Summary
12-21 De novo Drug Design using Reinforcement Learning with Multiple GPT Agents
Institution: Tsinghua University, Microsoft Research AI
The paper introduces a reinforcement learning algorithm with multiple GPT agents for drug molecular generation and demonstrates good performance and practicality in GuacaMol benchmark tests and in designing inhibitors for SARS-CoV-2 protein targets.
arXiv
Summary
GitHub
12-21 On Task Performance and Model Calibration with Supervised and Self-Ensembled In-Context Learning
Institution: Language Technology Lab University of Cambridge
This paper offers a comprehensive analysis of the performance and calibration of different learning methods in data-scarce scenarios. It indicates challenges in jointly achieving high performance and good calibration, but demonstrates that self-ensembling techniques can enhance model calibration without sacrificing performance, providing important guidelines for future LLMs applications.
arXiv
Summary
12-20 Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
Institution: Ant Group
The paper presents the Lookahead inference acceleration framework, which uses a Trie-tree based multi-branch inferencing strategy to improve the inference speed of LLMs while maintaining the accuracy of generation. The framework's performance is validated through extensive experimentation and has been deployed in real-world scenarios at Alipay.
arXiv
Summary
12-20 Mini-GPTs: Efficient Large Language Models through Contextual Pruning
Institution: Massachusetts Institute of Technology
The paper demonstrates the process and results of developing Mini-GPTs, smaller yet efficient versions of GPT models, through contextual pruning. This method successfully reduced the size of LLMs across various domain-specific datasets while upkeeping performance, proving that pruning techniques are not only theoretically viable but also practically valuable in developing resource-efficient, domain-specific LLMs.
arXiv
Summary
12-20 AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Institution: The University of Hong Kong, Shanghai Jiao Tong University, King’s College London
This paper presents a novel multi-agent-based solution for code generation, AgentCoder, which effectively solves the balance problem between code generation and testing through specific agents focused on code generation, test designing, and test execution, achieving code generation quality that outperforms existing SOTA methods.
arXiv
Summary
12-20 Lampr: Boosting the Effectiveness of Language-Generic Program Reduction via Large Language Models
Institution: University of Waterloo, The Hong Kong University of Science and Technology, Concordia University
Lampr represents a pioneering algorithm that integrates LLMs into the program reduction process. It achieves a balance between cross-language generality and particular language semantic awareness through a multi-level prompting method and assistance from LLMs, with superior performance demonstrated in empirical evaluations.
arXiv
Summary
12-20 AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Institution: The University of Hong Kong, Shanghai Jiao Tong University
AgentCoder represents a novel multi-agent framework that significantly improves the quality and accuracy of automated code generation by performing iterative testing and optimization, especially exhibiting its advantages in handling enhanced datasets with more challenging testing requirements.
arXiv
Summary
12-20 Time is Encoded in the Weights of Finetuned Language Models
The research introduces the concept of time vectors, showing how temporal variations can be encoded to some extent in language model weight space, and how weight interpolation can assist in tailoring models to new time periods.
arXiv
Summary
12-20 Generative Multimodal Models are In-Context Learners
Institution: Beijing Academy of Artificial Intelligence, Tsinghua University, Peking University
The paper successfully enhances the context learning capabilities of the multimodal generative model Emu2 by scaling up the model and achieves breakthrough results on a spectrum of multimodal understanding tasks, especially in visual question-answering and controllable visual generation after instruction tuning.
arXiv
Summary
12-19 A Revisit of Fake News Dataset with Augmented Fact-checking by ChatGPT
The paper presented the first publicly available benchmark dataset for fake news detection, ChatGPT-FC, which combines human verification and ChatGPT assistance. Quantitative analysis was conducted to compare human journalists and LLMs in fact-checking, highlighting the potential of LLMs to enhance the objectivity and reliability of news fact-checking processes.
arXiv
Summary
12-19 Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in ultra low-data regimes
Institution: University of Cambridge
This paper introduces CLLM, a novel methodology that combines the prior knowledge of Large Language Models with a robust data-centric approach to data augmentation, paving the way for the broader application of ML in data-deprived domains and regions.
arXiv
Summary
12-19 Active Preference Inference using Language Models and Probabilistic Reasoning
Institution: Cornell University, Cornell Tech
This study introduced a real-time algorithm that accelerates LLMs' inference of user preferences by generating informative questions, demonstrated to reduce user interaction and improve task performance in an online shopping scenario.
arXiv
Summary
12-19 Text-Conditioned Resampler For Long Form Video Understanding
Institution: University of Oxford, Google, Google DeepMind
This paper presents TCR, a novel architecture and pre-training method capable of processing long videos conditioned on textual prompts. It effectively bridges pre-trained visual encoders with LLMs, addressing the challenge of long-form video understanding and sets new best performance benchmarks across several evaluation tasks.
arXiv
Summary
12-18 Generalized Category Discovery with Large Language Models in the Loop
This paper presents an end-to-end active learning framework that incorporates Large Language Models into the training loop, significantly enhancing model performance on the task of generalized category discovery and autonomously generating category names.
arXiv
Summary
12-18 MAC-SQL: Multi-Agent Collaboration for Text-to-SQL
Institution: Beihang University, Tencent Cloud AI
Overall, the MAC-SQL framework addresses key challenges in the Text-to-SQL task by collaborating with intelligent agents, tackling issues like managing extensive databases, complex queries, and SQL verification and correction. The release of the open-source SQL-Llama model shows promising results and has the potential to perform comparably to proprietary models like GPT-4.
arXiv
Summary
GitHub
12-18 NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation
Institution: University of Waterloo, Huawei Noah’s Ark Lab, FEEC-Unicamp Brazil
This work introduces the NoMIRACL dataset, providing a multilingual tool for assessing robustness in LLMs during retrieval-augmented generation, and showcases challenges that LLMs face in differentiating between relevant and non-relevant retrieval results, highlighting the need for future research to improve LLM robustness.
arXiv
Summary
12-18 Agent-based Learning of Materials Datasets from Scientific Literature
Institution: University of Toronto
This paper showcases the capability of an intelligence agent based on large language models to autonomously learn and extract material-related datasets from scientific literature. Eunomia demonstrated effectiveness in entity and relation extraction without any fine-tuning and could enhance its ability to avoid errors in complex tasks.
arXiv
Summary
GitHub
12-18 "Paraphrasing The Original Text" Makes High Accuracy Long-Context QA
Institution: Tsinghua University
The paper presents a low-cost, effective approach to extending the capability of existing language models to handle long texts, significantly improving accuracy in long-context question answering by theoretical demonstration and experimental validation.
arXiv
Summary
12-18 Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows
Institution: University of Washington, Stanford University, Allen Institute for AI
The paper introduces a design space framework and three case studies adapting crowdsourcing workflows to LLM chains, providing practical guidance and theoretical insights for the future design and development of LLM chains.
arXiv
Summary
12-18 G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
Institution: Huawei Noah's Ark Lab, The University of Hong Kong, The Hong Kong University of Science and Technology
This paper overcomes the limitations of multimodal large language models in solving geometric problems by constructing the Geo170K dataset and developing the G-LLaVA model based on it, achieving better performance than existing state-of-the-art models.
arXiv
Summary
12-18 Social Learning: Towards Collaborative Learning with Large Language Models
Institution: Google, EPFL
The paper presents a novel framework for knowledge transfer in LLMs—social learning, and provides solutions for privacy protection. The framework allows for knowledge exchange between models using natural language while preventing the leakage of sensitive information, and it validates its effectiveness and privacy-preserving capabilities through experimentation.
arXiv
Summary
12-18 From Google Gemini to OpenAI Q-Star: A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape
Institution: Cyberstronomy Pty Ltd, Academies Australasia Polytechnic, Massey University
This review extensively analyzes the development of the generative AI field and its reshaping effects on the research landscape, with a special focus on MoE multimodality learning and AGI prospects. The study spans a comprehensive taxonomy from AI model structures and training techniques to application domains and ethical considerations.
arXiv
Summary
12-18 Towards Better Serialization of Tabular Data for Few-shot Classification with Large Language Models
Institution: Carnegie Mellon University
The paper successfully showcased innovative application of Large Language Models in tabular data classification, with a focus on the new LaTeX serialization framework, introducing novel serialization methods effective for domain-specific datasets. It also explored the LLMs' capability to interpret complex data relationships more deeply. The paper's LaTeX serialization method not only enhanced LLM performance in classification tasks but also significantly improved memory and computational efficiency.
arXiv
Summary
12-18 Retrieval-Augmented Generation for Large Language Models: A Survey
Institution: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Fudan University
The paper offers a thorough and systematic overview of the RAG domain, emphasizing the importance of enhancing the retrieval and generative capabilities of LLMs, highlighting current challenges, and envisioning future research directions.
arXiv
Summary
12-17 Distinguishing Translations by Human, NMT, and ChatGPT: A Linguistic and Statistical Approach
Institution: Shanghai Jiao Tong University
The study provides tentative answers to the possibility of ChatGPT being an alternative translation tool apart from NMT and showcases its distinctive properties compared to NMT and HT. These novel insights may inform the future development of more human-like and contextually appropriate translation systems and offer guidance on how to effectively use AI-generated translations.
arXiv
Summary
12-17 Mixed Distillation Helps Smaller Language Model Better Reasoning
Institution: Zhejiang University, Dalian Medical University
The Mixed Distillation framework significantly enhanced smaller models' advanced reasoning capabilities by integrating PoT and CoT abilities from LLMs, specifically showing improved performance in mathematical reasoning tasks.
arXiv
Summary
12-16 RIGHT: Retrieval-augmented Generation for Mainstream Hashtag Recommendation
Institution: CAS Key Lab of Network Data Science and Technology ICT CAS, University of Chinese Academy of Sciences Beijing China
The paper presents a new retrieval-augmented generative system for mainstream hashtag recommendation (RIGHT), combining the strengths of retrievers, selectors, and generators to overcome existing methods' limitations in processing new information and identifying mainstream tags, and demonstrates significant experimental results.
arXiv
Summary
12-16 A Survey on Robotic Manipulation of Deformable Objects: Recent Advances, Open Challenges and New Frontiers
Institution: Tongji University, National Natural Science Foundation of China, Shanghai Municipal Science and Technology Major Project
This survey compiles recent advances, challenges, and new frontiers in the field of robotic manipulation of deformable objects (DOM). It notably emphasizes the initial progress of Large Language Models (LLMs) in robotic manipulation and points out important directions for further research in this area. While the review covers a broad range of literature and identifies future research directions, actual deployment examples and quantitative evaluations are limited.
arXiv
Summary
12-16 ProTIP: Progressive Tool Retrieval Improves Planning
Institution: Apple
The paper presents ProTIP, an advanced strategy for tool retrieval and use in complex planning tasks for large language models. The key to ProTIP lies in its progressive retrieval, effective use of execution history, and achieving subtask-tool functionality alignment. Experimental results demonstrate that ProTIP significantly outperforms traditional methods, reduces tool hallucination, and increases planning efficiency.
arXiv
Summary
12-16 CoAScore: Chain-of-Aspects Prompting for NLG Evaluation
Institution: GSAI Renmin University of China
CoAScore is an innovative evaluation metric that improves the accuracy of NLG task assessments through a "chain of aspects" method, an approach that has been experimentally validated.
arXiv
Summary
12-16 RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models
Institution: Science Foundation Ireland (SFI), JSPS KAKENHI
This paper presents the RecPrompt model, which optimizes news recommendation using LLMs. Through an iterative optimization process with manually and LLM-generated prompt templates, the news recommendation performance is significantly improved, particularly under the LLM-generated prompt templates utilizing GPT-4. However, this approach does not always outperform traditional recommendation methods and is significantly impacted by the choice of LLM.
arXiv
Summary
12-15 ProCoT: Stimulating Critical Thinking and Writing of Students through Engagement with Large Language Models (LLMs)
Institution: Luleå University of Technology Sweden
This paper introduces the ProCoT method, showing how LLMs can be harnessed to foster students' critical thinking and writing while preventing cheating. This method can help educators to make better use of these technological tools and cultivate students into better critical thinkers.
arXiv
Summary
12-15 Faithful Persona-based Conversational Dataset Generation with Large Language Models
Institution: University of Southern California, Google, Information Sciences Institute
The paper presents an LLM-based framework for generating, expanding, and updating large persona-based conversational datasets. By employing a Generator-Critic architecture and faithfulness criteria, the study successfully established the Synthetic-Persona-Chat dataset with enhanced dialogue quality.
arXiv
Summary
12-15 Challenges with unsupervised LLM knowledge discovery
Institution: Google DeepMind, Google Research
The paper challenges the capacity of existing unsupervised methods to explore latent knowledge in LLMs through theoretical proofs and experimental validations while providing sanity checks to consider for future knowledge elicitation method evaluations. Overall, the authors suspect that future unsupervised methods are likely to face similar issues, having difficulty in accurately distinguishing between model knowledge and other features.
arXiv
Summary
12-15 WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION
Institution: OpenAI
arXiv
GitHub
Blog
12-15 ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent
Institution: Google
This paper outlines creating an LLM agent capable of reasoning and interacting with external knowledge, along with a self-improvement algorithm that enables smaller models to perform comparably to large models in compositional question-answering benchmarks. The proposed method not only improves reasoning capabilities but also significantly reduces the required parameter count of the models.
arXiv
Summary
12-15 The Art of Balancing: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment
Institution: NLP Group Fudan University, Hikvision Inc
The paper introduces a model called LoRAMoE to address the problem of world knowledge forgetting in language models due to massive increases in fine-tuning data and shows potential in multi-task learning.
arXiv
Summary
12-15 Generative Context-aware Fine-tuning of Self-supervised Speech Models
Institution: ASAPP, Carnegie Mellon University, Toyota Technological Institute at Chicago
The paper presents a new fine-tuning method for self-supervised speech models that leverages text generated by large language models as context to enhance task performance. It provides a way to reduce dependence on extra large models and resource usage during inference without compromising on performance.
arXiv
Summary
12-15 No-Skim: Towards Efficiency Robustness Evaluation on Skimming-based Language Models
Institution: Fudan University
This paper is the first to systematically study the vulnerability of skimming-based language models from the perspective of efficiency and proposes No-Skim, an effective and general efficiency robustness evaluation framework that generates adversarial inputs to increase computational complexity. Additionally, the framework is modularized to accommodate different plug-in modules, enabling evaluations to be conducted across three different knowledge levels.
arXiv
Summary
12-15 GSVA: Generalized Segmentation via Multimodal Large Language Models
Institution: Tsinghua University
The GSVA method proposed in the paper solves the challenges of multi-target and empty targets in GRES tasks by learning to predict multiple [SEG] tokens and innovatively generating [REJ] tokens, demonstrating significant advantages over existing technologies.
arXiv
Summary
12-15 KGLens: A Parameterized Knowledge Graph Solution to Assess What an LLM Does and Doesn't Know
Institution: Apple
The paper introduces KGLens, a new framework for assessing factual knowledge in LLMs. KGLens generates natural language questions using the KG structure for evaluations and is aided by a parameterized KG and a graph-guided QG strategy to improve the quality of natural question generation and the efficiency of the assessment process.
arXiv
Summary
12-14 Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention
Institution: Tencent AI Lab Seattle
The Zebra model proposed in this paper effectively lowers computational and memory requirements by utilizing grouped local-global attention layers, exhibiting excellent performance in processing both long and short sequences. The research team validated the model through various experiments, proving the advantages of the Zebra architecture.
arXiv
Summary
12-14 Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
Institution: CUHK-SenseTime Joint Laboratory, Shanghai AI Laboratory, Tsinghua University
Auto MC-Reward is an advanced learning system that uses LLMs to automatically design dense rewards for Minecraft tasks. By leveraging LLMs' abilities to understand tasks and summarize experience, it effectively improves agents' learning of new behaviors and completion of long-term tasks in complex environments.
arXiv
Summary
12-14 The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
Institution: Tsinghua University, Stanford University, Nanyang Technological University
This paper is the first to thoroughly investigate the robustness of LLMs against factual misinformation in a persuasive conversation setting, revealing the susceptibility of LLMs to persuasive misinformation.
arXiv
Summary
12-14 Towards Verifiable Text Generation with Evolving Memory and Self-Reflection
Institution: Peking University, Chinese Academy of Sciences, Baidu Inc
VTG improves the reliability and verifiability of text generated by LLMs through an evolving memory and self-reflection approach, effectively addressing challenges of complex attention shifting and document retrieval. It has been validated through experiments.
arXiv
Summary
12-14 TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning
Institution: National University of Singapore, University of Illinois Urbana-Champaign, Microsoft
The TAP4LLM framework proposed in this paper significantly enhances the performance of Large Language Models in tabular reasoning tasks. It operates by sampling, augmenting, and packing semi-structured data and can also serve as a plugin to further enhance LLMs' understanding of structured data.
arXiv
Summary
12-14 Entity-Augmented Code Generation
Institution: JetBrains
The paper proposes an innovative architecture to tackle the task of code generation with external entities. The architecture can scale without sacrificing performance, and with the integration of the entity retriever into the decoder rather than the encoder, the model can inspect all entities at once and directly use them. The new architecture not only resolves the limitations of existing models but also demonstrates its superiority in several experimental scenarios.
arXiv
Summary
12-14 Math-Shepherd: A Label-Free Step-by-Step Verifier for LLMs in Mathematical Reasoning
Institution: Peking University, DeepSeek-AI, The University of Hong Kong
MATH-SHEPHERD successfully addresses the issue of costly human annotations by training LLMs with automatically generated supervision data, thereby enhancing the accuracy of LLMs in solving complex mathematical problems and opening up new avenues for the advancement and practical application of LLMs.
arXiv
Summary
12-14 Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent
Institution: Shanghai Jiao Tong University
The paper suggests enhancing LLMs' ability to solve complex mathematical problems through the MathAgent framework, namely Planner-Reasoner-Executor-Reflector (PRER). By breaking down the problems into phases and simulating human-like problem-solving processes, MathAgents significantly improve solving capabilities on challenging mathematical datasets, particularly in areas demanding higher estimation and synthesis skills.
arXiv
Summary
12-14 Forbidden Facts: An Investigation of Competing Objectives in Llama-2
Institution: MIT
The paper provides insights into how the Llama-2-chat model handles competing objectives through the study of its behavior in the 'forbidden fact' task, introducing novel analytical methods in the process.
arXiv
Summary
12-14 Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning
Institution: Hong Kong University of Science and Technology, Microsoft Research
This paper presents CoT-Max, a method that enhances LLMs' mathematical reasoning capabilities using a coarse-to-fine pruning technique, effectively improving the effects of few-shot learning in math reasoning tasks.
arXiv
Summary
12-14 Self-Evaluation Improves Selective Generation in Large Language Models
Institution: Google DeepMind, Google Research
The paper presents a new method where LLMs are guided to self-evaluate in order to improve the calibration of the quality of their generative output in selective generation scenarios. Experiments show that this method enhances the accuracy and overall quality of the generated content by LLMs.
arXiv
Summary
12-14 Weight subcloning: direct initialization of transformers using larger pretrained ones
Institution: Apple
The paper introduces a powerful weight subcloning approach to initialize smaller transformer models using weights from larger pretrained ones, greatly accelerating training speed, and enabling efficient training of the new models even with limited computational resources.
arXiv
Summary
12-14 StemGen: A music generation model that listens
Institution: SAMI, ByteDance Inc.
The paper presents a new non-autoregressive language model approach for music generation, which optimizes the processing of multiple channels and the consistency between music and contextual information, and demonstrates, through objective and subjective assessments, the quality of the music generated and its alignment with contextual information.
arXiv
Summary
12-14 CogAgent: A Visual Language Model for GUI Agents
Institution: Tsinghua University, Zhipu AI
CogAgent breaks the limitation of pure text-based approaches by efficiently tackling the challenge of understanding and navigating GUIs with combined high and low-resolution image encoders and visual language models. The model achieves leading performance on nine visual question-answering benchmarks, propelling the future research and application of AI agents powered by advanced VLMs.
arXiv
Summary
GitHub
12-14 TinyGSM: achieving >80% on GSM8k with small language models
Institution: Carnegie Mellon University, Microsoft Research
This paper has successfully demonstrated that small language models can exceed an 80% accuracy rate on the GSM8K math problem reasoning benchmark by creating a synthetic dataset of math problems with corresponding Python solutions (TinyGSM), showing the feasibility of significant performance improvement of small models through high-quality datasets and verifier strategies.
arXiv
Summary
12-13 Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models
Institution: University of Southern California, Amazon.com Inc.
The paper presents BD-LLM, a new method to enhance the efficiency and transferability of LLMs in toxic content detection tasks, proposing the DToT method and optimizing model compression for more effective production deployment.
arXiv
Summary
12-13 Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision
Institution: Peking University
This paper proposes a knowledge-aware method for synthesizing images of ancient artifacts with LLM-enhanced prompting and multi-source supervision, overcoming the lack of domain knowledge in existing text-to-image synthesis methods and showing significant improvement in quality and historical knowledge alignment.
arXiv
Summary
GitHub
12-13 E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification
Institution: UC Riverside, Microsoft Research
This paper demonstrates the potential of LLMs in conducting pseudo-code static analysis and self-verification through the E&V method. The approach not only improves the flexibility and precision of static analysis but also reduces the human effort and specialized knowledge required to develop static analysis tools.
arXiv
Summary
12-13 LDM$^2$: A Large Decision Model Imitating Human Cognition with Dynamic Memory Enhancement
Institution: University of Chinese Academy of Sciences
The paper presents the LDM2 model, which incorporates a dynamic memory mechanism and tree exploration approach to augment the decision-making capabilities of LLMs to adapt to more complex and unknown environments, and to realize dynamic learning abilities.
arXiv
Summary
12-13 SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Institution: The Swiss AI Lab IDSIA USI & SUPSI, AI Initiative KAUST, Center for Brain Science Harvard University
SwitchHead is a novel approach that optimizes resource usage in the multi-head self-attention structure, resulting in reduced resource consumption while maintaining model performance. The method has practical application potential, especially for researchers and institutions with limited resources.
arXiv
Summary
12-12 LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Institution: Apple
This research provides a novel and practical solution that effectively reduces the data load and significantly speeds up inference when running large language models on memory-constrained devices, holding substantial significance for practical applications.
arXiv
Summary
12-12 VILA: On Pre-training for Visual Language Models
Institution: NVIDIA, MIT
VILA employs an improved pre-training strategy, outperforming benchmarks in various vision-language tasks, and offers practical guidance for the design of future visual language models.
arXiv
Summary
12-12 Tell, don't show: Declarative facts influence how LLMs generalize
Institution: Apollo Research, University of Oxford
The paper investigates how models generalize when declarative statements in training data conflict with statistical patterns or procedural examples. The findings have important implications for AI safety (regarding the “treacherous turn”) and fairness.
arXiv
Summary
12-12 Alignment for Honesty
Institution: Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Fudan University
The paper introduces the concept of alignment for honesty in LLMs and presents challenges and proposed solutions. By formally defining the problem, suggesting new methods, and establishing an evaluation framework, the paper provides a comprehensive solution to alignment for honesty in large language models.
arXiv
Summary
GitHub
12-12 Comparable Demonstrations are Important in In-Context Learning: A Novel Perspective on Demonstration Selection
Institution: Shanghai Jiao Tong University
The study explores ICL from the perspective of the inter-demonstration relationship, proposing the minimally edited text construction of Comparable Demonstrations (CDs) to alleviate potential demonstration bias. The experiments confirm the performance gains of CDs in OOD scenarios, emphasizing their particular necessity in simpler tasks and demonstrating their robustness with respect to the number of examples.
arXiv
Summary
12-12 diff History for Long-Context Language Agents
Institution: New York University
The paper presents and validates the use of diff history to enhance model processing capabilities of long interaction histories. This method significantly improves model performance in complex decision tasks and effectively extends the length of history models can handle, providing new insights for the design of long-time series decision-making agents.
arXiv
Summary
12-12 LLMEval: A Preliminary Study on How to Evaluate Large Language Models
Institution: Fudan University, Shanghai Jiaotong University
The paper focuses on how to evaluate Large Language Models (LLMs), comparing various evaluation criteria, types of evaluators, scoring methods, and ranking systems. It introduces a new evaluation dataset, LLMEval, and assesses 20 LLMs, generating a massive amount of manual and automatic evaluation results. The study provides valuable insights and conclusions for the future evaluation of LLMs.
arXiv
Summary
12-12 Efficient Few-Shot Clinical Task Adaptation with Large Language Models
The paper contributed to few-shot medical image classification by presenting an efficient fine-tuning approach through partial layer freezing and incorporating large language models for contextualizing labels to offer effective semantic guidance. The approach demonstrated exceptional performance in a challenge, indicating its effectiveness in adapting natural image models to medical image tasks in few-shot scenarios.
arXiv
Summary
12-11 "What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces
Institution: Carnegie Mellon University
The paper explores the capabilities and challenges of LLMs in retrieving information from web interfaces, unveiling key factors affecting model performance and their limitations, setting a direction for future work.
arXiv
Summary
12-11 Unlocking Anticipatory Text Generation: A Constrained Approach for Faithful Decoding with Large Language Models
Institution: Salesforce AI Research
This work introduces a novel approach to improve the decoding methods for large language models by incorporating future constraint satisfaction. The proposed formal approach and scoring mechanism, benchmarked against LLMs, significantly contribute to the improved quality and control of text generation.
arXiv
Summary
12-11 MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples
Institution: Xiamen University, Tencent YouTu Lab
The work introduces a new paradigm with MMICT to showcase the use of in-context learning capabilities to enhance fine-tuning performance on large multi-modal language models. By designing the versatile M-Hub module and conducting various context demonstration experiments, the study reveals the potential of in-context learning to improve performance on multi-modal tasks.
arXiv
Summary
12-11 Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes
Institution: Zhejiang University, Alibaba Group
The paper presents a novel approach, FedKSeed, for federated full-parameter tuning using ZOO with a fixed set of seeds, substantially reducing the communication overhead required for tuning billion-sized LLMs, while achieving higher model accuracy and computational efficiency.
arXiv
Summary
12-11 Oracle-based Protocol Testing with Eywa
Institution: Microsoft Research
The paper introduced an oracle-based testing method, fully leveraging LLMs to build rich protocol behavior models and enhancing the auto-generation and coverage of network protocol test cases by combining symbolic execution with traditional test generation methods.
arXiv
Summary
12-11 On Meta-Prompting
Institution: Microsoft
This paper presents a theoretical framework based on category theory to generalize and depict automated prompting methods. Through experiments in the fields of ideation and creativity, it demonstrates that meta-prompting generates outputs that are more favorable to users compared to traditional fixed prompts.
arXiv
Summary
12-11 Honeybee: Locality-enhanced Projector for Multimodal LLM
Institution: Kakao Brain
The paper introduced a new type of locality-enhanced projector design, addressing deficiencies in existing methods in handling visual feature locality, and effectively utilized multifaceted instruction datasets. Consequently, the Honeybee model achieved significant performance improvements across multiple MLLM benchmarks.
arXiv
Summary
GitHub
12-11 Dense X Retrieval: What Retrieval Granularity Should We Use?
Institution: University of Washington, Tencent AI Lab
This paper introduces propositions as a new retrieval unit for dense retrieval, which improves the performance of downstream QA tasks and cross-task generalization capabilities while reducing irrelevant information in the retrieved texts.
arXiv
Summary
12-11 Extracting Self-Consistent Causal Insights from Users Feedback with LLMs and In-context Learning
Institution: Microsoft, Microsoft Research
The research presents a novel framework utilizing LLMs and ICL to extract self-consistent causal insights from user feedback to support analysis in Microsoft's Feedback Hub. The framework employs innovative self-consistency and prompt ensemble techniques to mitigate hallucinations and incorrect reasonings in LLMs and introduces two heuristic methods to assess the richness of feedback information. The experiments demonstrate the efficacy of the method in extracting causal insights and new bugs, and in assisting Microsoft engineers to prioritize feedback rich in information.
arXiv
Summary
12-10 Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
Institution: Microsoft Israel
The study's core contribution lies in its comparison of fine-tuning and RAG methodologies for knowledge injection into LLMs, finding that RAG demonstrates superior performance in injecting both new and existing knowledge. The research used innovative datasets and assessment methods to ensure the practicality and viability of the theoretical findings.
arXiv
Summary
12-09 Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
Institution: Northeastern University, Oracle
This paper introduces the Agile-Quant, an activation-guided quantization framework to accelerate the inference of large language models on edge devices. Agile-Quant overcomes challenges associated with activation value outliers and edge device hardware implementation, achieving task performance comparable to weight-only quantization methods while significantly increasing inference speed on actual devices.
arXiv
Summary
12-09 Context Tuning for Retrieval Augmented Generation
Institution: Apple
The paper presents context tuning as a novel component that enhances RAG-based planning, enabling it to effectively handle incomplete or under-specified queries and reduce hallucinations. It systematically compares various retrieval methods in lightweight models and LLMs, showcasing the effectiveness of context tuning in improving contextual understanding.
arXiv
Summary
12-09 Sim-GPT: Text Similarity via GPT Annotated Data
Institution: Shannon.AI, Zhejiang University, Bytedance
Sim-GPT is a framework that uses data labeling by GPT-4 to effectively train STS models. It incurs a one-time cost for data generation, is faster, and the model outperforms on multiple STS benchmarks.
arXiv
Summary
GitHub
12-09 NLLG Quarterly arXiv Report 09/23: What are the most influential current AI Papers?
Institution: University of Mannheim, University of Bielefeld
The paper provides an analysis of the most current trends and influence in AI research by examining the most cited papers on arXiv over a set period, particularly highlighting the significance of LLMs in this context.
arXiv
Summary
12-09 Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis
Institution: Shanghai Jiao Tong University
This research systematically explores the capability boundaries of LLMs within the context of game theory and provides insights for integrating LLMs into social science research from three distinct perspectives.
arXiv
Summary
12-08 Using Program Knowledge Graph to Uncover Software Vulnerabilities
The paper introduces a Program Knowledge Graph by combining program graphs with security data, and leverages prompt tuning of large language models to auto-generate queries for detecting vulnerabilities within software code. The method aims to overcome the limitations of traditional vulnerability detection methods, improving the automation and effectiveness of vulnerability detection, especially in static analysis applications.
arXiv
Summary
12-08 PaperQA: Retrieval-Augmented Generative Agent for Scientific Research
Institution: RAND Corporation, Carnegie Mellon University, LangChain
The paper presents PaperQA, a retrieval-augmented generative agent for scientific research capable of answering questions based on up-to-date scientific literature with a performance comparable to human experts, and in some aspects even superior. The effectiveness of PaperQA is demonstrated, and its superiority is affirmed through comparative results with human experts and other commercial tools.
arXiv
Summary
12-07 Beyond Surface: Probing LLaMA Across Scales and Layers
Institution: Hong Kong University of Science and Technology
The core contribution of the study lies in proposing a series of probing tasks to evaluate the higher-order capabilities of large language models, focusing on computation, mathematical reasoning, logical reasoning, and truthfulness detection. It reveals how the performance of LLMs varies with changes in model scale and structural layers.
arXiv
Summary
12-07 CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models
Institution: MPI for Intelligent Systems, University of Washington
This research introduces the CLADDER dataset and CAUSALCOT chain-of-thought prompting strategy to test and analyze the abilities of large language models (LLMs) in formal causal reasoning, highlighting limitations of LLMs and suggesting future research directions.
arXiv
Summary
GitHub
12-07 A Study on the Calibration of In-context Learning
Institution: Harvard University
The paper conducts an in-depth study of the calibration accuracy in language models (LMs) for in-context learning (ICL) and presents methods for evaluation and analysis. It reveals the relationship of calibration errors with model size and the changes during finetuning, as well as the reduction in calibration during the generation of reasoning tasks.
arXiv
Summary
12-07 Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration
Institution: Renmin University of China, Beijing Institute of Technology, HKUST (GZ)
This paper delivers a comprehensive study aimed at exploring a cost-effective batch prompting approach to entity resolution. The main contributions include the introduction of the BATCHER framework and the proposal of a covering-based demonstration selection strategy.
arXiv
Summary
12-07 An LLM Compiler for Parallel Function Calling
Institution: UC Berkeley, ICSI, LBNL
The paper introduces a system named LLMCompiler that addresses high latency costs and inefficiencies in executing multi-function calls by LLMs. It enhances speed, reduces costs, and improves accuracy through parallelized function calling and optimized orchestration.
arXiv
Summary
12-07 Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
Institution: Google DeepMind, Stanford University, University of California Berkeley
Chain of Code (CoC) adds a new dimension to language models by improving reasoning capabilities through code writing and code execution emulation. It achieves breakthrough performance in both numerical and semantic reasoning tasks, expands the application scope of LLMs, and has the potential to be applied to a broader range of problems.
arXiv
Summary
12-07 Generating Illustrated Instructions
Institution: GenAI Meta, Columbia University
The paper presents a novel approach called StackedDiffusion for the task of generating illustrated instructions, a task that combines text and images to describe how to achieve a goal. This method overcomes the limitations of current T2I models that fail to generate visuals from user queries directly and surpasses existing methods in human evaluations.
arXiv
Summary
12-07 Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use
Institution: Gaoling School of Artificial Intelligence, Renmin University of China, Alibaba Group
The paper presents the Attention Buckets method to address deficiencies in context awareness of LLMs during tool use, significantly enhancing their performance in such tasks by processing different RoPE angle bases.
arXiv
Summary
12-06 Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
Institution: Zhejiang Lab
The paper successfully introduces Holmes, a framework that facilitates training LLMs in heterogeneous NIC environments. Empirical studies confirm that Holmes can achieve performance levels in these environments comparable to those possible with homogeneous RDMA NICs. This significant advancement makes LLM training more accessible and expands the potential for efficient scaling within the broader research community.
arXiv
Summary
12-06 AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
Institution: Peking University, Tencent AI Lab, HKUST
AnimateZero provides decoupled and precise control of appearance and motion for T2V generation, realizing step-by-step video generation from T2I to I2V, while maintaining good domain consistency through spatial appearance control and temporal consistency control.
arXiv
Summary
12-06 Controllable Human-Object Interaction Synthesis
Institution: Stanford University, FAIR Meta
The paper proposes a novel interaction synthesis method, CHOIS, which is capable of generating synchronized human and object motions under the guidance of language descriptions, adhering to the geometric constraints of 3D scenes. Integrated into a system, it demonstrates its efficacy in synthesizing continuous, realistic, and context-aware human-object interactions.
arXiv
Summary
12-06 Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia
Institution: Google DeepMind, Google Research
This paper proposes a method for enhancing agent-based models with generative large language models, using the Concordia library to simulate interactions of agents in social, physical, and digital spaces. The model aims to provide life-like social simulations and explore the effectiveness of model validation.
arXiv
Summary
12-06 Efficient Large Language Models: A Survey
Institution: The Ohio State University, Google Research, Amazon AWS AI
The paper is a survey of the recent advancements in large language models concerning sparse activation methods, especially the Mixture-of-Experts system (MoE) and its application in long-context processing. It synthesizes various optimization methods for MoE models, including algorithmic improvements and system-level acceleration frameworks.
arXiv
Summary
GitHub
12-06 OneLLM: One Framework to Align All Modalities with Language
Institution: MMLab The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory
OneLLM showcases strong multimodal understanding and processing capabilities through its unified multimodal encoding framework and progressive alignment pipeline, addressing the challenge of expanding multimodal LLMs in the area of reasoning and utilization.
arXiv
Summary
GitHub
12-05 A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education
Institution: Carnegie Mellon University
The main contribution of this paper is the development of an automated MCQ generation system based on GPT-4, which, through a specialized flexible architecture and precise LO alignment mechanism, successfully generates MCQs consistent with higher education Python courses LOs. The findings show that the automatically generated MCQs maintain good alignment with the LOs and are close in quality to human-generated MCQs, but fall short on having a single correct answer and high-quality distractors, suggesting future work should focus on alleviating these issues.
arXiv
Summary
12-05 Inherent limitations of LLMs regarding spatial information
Institution: ProtagoLabs, International Monetary Fund, NetMind.ai
The paper provides a new evaluation framework and specially designed dataset for the capabilities of large language models like GPT-4 in handling spatial information, and analyzes the abilities and limitations of GPT-4 in dealing with spatial information.
arXiv
Summary
12-05 Beyond Isolation: Multi-Agent Synergy for Improving Knowledge Graph Construction
Institution: Zhejiang Lab, Ant Group
By introducing a multi-agent cooperation approach within KGC, the cooperKGC framework improves the precision with which agents solve tasks involving entity, relation, and event extraction, and potentially lays the foundation for a future of collaboration-aware AI.
arXiv
Summary
12-05 A Hardware Evaluation Framework for Large Language Model Inference
Institution: Princeton University
LLMCompass, as a hardware evaluation framework, effectively addresses the challenges in designing hardware for LLM inference. It is not only fast and accurate but also architecturally descriptive and cost-aware, and it has been validated on commercial hardware with exceptional performance.
arXiv
Summary
12-05 Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models
Institution: University of Waterloo, 2Cohere, Comcast Applied AI
The paper's key achievement is demonstrating how to construct an effective listwise reranker without dependence on GPT models, significantly surpassing existing GPT-based rerankers, and calling for the development of higher-quality listwise training datasets to enhance model performance.
arXiv
Summary
12-05 RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!
Institution: University of Waterloo
RankZephyr is a new type of open-source LLM specifically optimized for zero-shot list reranking tasks. It offers reranking effects comparable or superior to those of large proprietary models, while emphasizing the importance of data augmentation for enhanced model robustness, and has proven its effectiveness and application potential in real-world scenarios.
arXiv
Summary
GitHub
12-05 Large Knowledge Model: Perspectives and Challenges
Institution: Zhejiang University
The paper proposes the concept of a Large Knowledge Model (LKM), aimed at more effectively managing and interpreting the diversity of knowledge representation. The study outlines the challenges in transitioning from current LLMs to LKMs, underlines the importance of structured knowledge in pre-training, and introduces a set of design principles for LKMs.
arXiv
Summary
12-05 Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
Institution: Sea AI Lab, Sun Yat-sen University, Harvard University
The paper introduced a Creative Leap-of-Thought (CLoT) paradigm for enhancing the creative thinking abilities of large language models, demonstrating its effectiveness and generalizability across various tasks.
arXiv
Summary
GitHub
12-05 How should the advent of large language models affect the practice of science?
Institution: Max Planck Institute for Biological Cybernetics, University of Tübingen, University of Washington
The paper discusses the implications of LLMs on scientific practices and recommends a cautious approach to their usage, emphasizing the importance of protecting the normative and epistemic aspects of science. Although LLMs may improve the efficiency of certain research tasks, they should be used judiciously as tools that abide by scientific norms and standards.
arXiv
Summary
12-05 Prompt Optimization via Adversarial In-Context Learning
Institution: National University of Singapore, Hong Kong University of Science and Technology, Institute for Infocomm Research (I2R) A*STAR
The paper introduces a novel Adversarial In-Context Learning (adv-ICL) method for optimizing prompt selection in large models to enhance their performance. It achieves adversarial training objectives, overcoming data and computational resource constraints by improving performance through prompt optimization instead of model parameters, with experimental results significantly outperforming existing techniques across multiple tasks.
arXiv
Summary
12-04 Competition-Level Problems are Effective LLM Evaluators
Institution: Microsoft Research Asia, Xiamen University, Microsoft Azure AI
The study has revealed inadequacies in LLMs like GPT-4 when assessing their real-world reasoning capabilities using competition-level programming questions, suggested methods for improvement, and highlighted the significance of such problems as efficient evaluators of LLMs, thus fostering further research into enhancing complex reasoning abilities in LLMs.
arXiv
Summary
12-04 A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly
Institution: Elsevier
This paper summarizes the applications and associated challenges of Large Language Models (LLMs) in security and privacy, highlighting the good, the bad, and the ugly aspects while emphasizing the potential for data protection in these domains.
arXiv
Summary
12-04 Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models
Institution: Xiamen University, MBZUAI, Tencent AI Lab
The paper successfully elevated the CoT reasoning capabilities of LLMs in multi-modal tasks by introducing a dynamic automatic retrieval mechanism and stratified sampling method. The proposed approach not only improved model performance but also refined the reasoning process through diverse example selection, setting a new performance benchmark in the field of multi-modal reasoning.
arXiv
Summary
12-04 Data Management For Large Language Models: A Survey
Institution: Peking University, Huawei Noah’s Ark Lab
This survey studies the current state of research in data management at both the pretraining and supervised fine-tuning stages of LLMs and the design of data management strategies.
arXiv
Summary
GitHub
12-04 ChatGPT as a Math Questioner? Evaluating ChatGPT on Generating Pre-university Math Questions
Institution: Nanyang Technological University, National University of Singapore
The study presents the first comprehensive evaluation of the potential of leveraging ChatGPT in the generation of pre-university math questions. It explores question generation in two main scenarios: with and without given context and aims to provide practical insights for educators. The findings from this study may promote the usage of modern AI technologies in education, enhancing the practicability and efficiency of automated math question generation.
arXiv
Summary
12-04 On the Effectiveness of Large Language Models in Domain-Specific Code Generation
Institution: Shanghai Jiao Tong University, Chongqing University, East China Normal University
The study demonstrates that LLMs' capabilities in domain-specific code generation can be significantly enhanced by effectively integrating domain knowledge into the code generation process. The DomCoder approach exemplifies the incorporation of different strategies to blend domain knowledge and boost the actual effectiveness of code generation within certain contexts.
arXiv
Summary
12-04 The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
Institution: Allen Institute for Artificial Intelligence, University of Washington
The paper introduces a simple, tuning-free method (URIAL) for aligning LLMs through in-context learning, which demonstrates performance on par with or superior to traditional tuning alignment methods. The findings significantly contribute to future LLM research, highlighting the importance of deeper analysis and theoretical understanding in LLM alignment.
arXiv
Summary
12-04 Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication
Institution: Fudan University, National University of Singapore, Shanghai AI Laboratory
The Exchange-of-Thought (EoT) framework introduced in this paper enhances the reasoning capabilities of LLMs through cross-model communication, leveraging four communication paradigms and a confidence evaluation mechanism, yielding significant improvements in various reasoning tasks and proving the role of external insights in enhancing model performance.
arXiv
Summary
12-04 LLMs Accelerate Annotation for Medical Information Extraction
Institution: Google Research
The paper presents a method that uses large language models, specifically Google's PaLM 2, to enhance the speed of annotation in medical information extraction tasks. The LLM-based annotation workflow increases efficiency without complex model parameter adjustment, making it a promising tool for accelerating data annotation work in the medical field.
arXiv
Summary
12-03 D-Bot: Database Diagnosis System using Large Language Models
Institution: Tsinghua University, Pigsty, ModelBest
D-Bot is a database diagnosis system based on large language models designed to improve the efficiency and accuracy of database diagnosis by extracting knowledge from documents and generating effective diagnosis reports, addressing challenges faced by domain experts in the field of database diagnosis.
arXiv
Summary
12-03 TextGenSHAP: Scalable Post-hoc Explanations in Text Generation with Long Documents
Institution: University of Southern California, Google Cloud AI
The paper introduces TextGenSHAP, an efficient post-hoc explanation method designed for large language models. The method improves the speed of explanation generation and demonstrates how to leverage these explanations to enhance long-document question answering and document retrieval systems.
arXiv
Summary
12-03 Running cognitive evaluations on large language models: The do's and the don'ts
Institution: Massachusetts Institute of Technology
This paper provides instructive recommendations on the methodological approach for conducting cognitive assessments of large language models, exploring how to avoid potential issues during the evaluation process. The goal of the paper is to contribute to the broader discussion of best practices in the field of AI Psychology.
arXiv
Summary
12-02 Axiomatic Preference Modeling for Longform Question Answering
The axiomatic framework proposed in this paper offers a new method for preference modeling in long-form question-answering, closely examining human preferences and optimizing the accuracy and efficiency of preference scoring.
arXiv
Summary
12-02 Large Language Models Are Zero-Shot Text Classifiers
Institution: Florida Atlantic University
The paper demonstrates that LLMs are effective as zero-shot text classifiers, which is particularly beneficial for small teams or businesses that need to quickly deploy text classifiers. The research results suggest that GPT-4 consistently surpasses traditional ML algorithms in all four datasets. The article also suggests future research directions that include optimizing prompts for higher accuracy or introducing a critic agent to evaluate and improve the outcomes of LLMs.
arXiv
Summary
12-02 Exploring and Improving the Spatial Reasoning Abilities of Large Language Models
Institution: Stanford University
The paper advances our understanding of LLMs' capabilities in spatial reasoning and sequence labeling, proposing a method to improve LLMs' performance in 3D trajectory recognition tasks with significant performance improvements.
arXiv
Summary
12-02 Just-in-Time Security Patch Detection -- LLM At the Rescue for Data Augmentation
Institution: University of Luxembourg, Windows Copilot Microsoft, Singapore Management University
The paper presents an innovative security patch detection framework, LLMDA, utilizing Large Language Models for patch analysis and data augmentation, and aligning multimodal inputs. This enables the system to extract more extensive information from the combined context of patches and code, thereby enhancing detection accuracy.
arXiv
Summary
12-01 Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games
Institution: Quebec AI Institute
The paper provides a new evaluation method adapted to the complexity and new challenges of JuBensha games and establishes a new framework, ThinkThrice, for assessing the capabilities of LLM agents in an interactive gaming environment, advancing AI applications in multiplayer role-playing games.
arXiv
Summary
12-01 Nash Learning from Human Feedback
Institution: Google DeepMind
The paper introduces a novel method to fine-tune LLMs for alignment with human preferences through Nash equilibrium, demonstrating its potential in complex tasks and verifying its effectiveness through empirical evidence.
arXiv
Summary
12-01 Leveraging Large Language Models to Improve REST API Testing
Institution: Georgia Institute of Technology, IBM Research
RESTGPT addresses the limitations of existing methods in extracting rules from natural language descriptions and generating effective values by leveraging the accuracy and efficiency of LLMs, especially GPT-3.5 Turbo, significantly enhancing the quality and accuracy of REST API testing.
arXiv
Summary
12-01 The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
Institution: University of Wisconsin - Madison
This research represents one of the first large-scale investigations into the impact of compression techniques on the parametric knowledge of LLMs, offering significant insights for practitioners, especially regarding decisions related to pruning and quantization techniques.
arXiv
Summary
12-01 The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models
Institution: University of Wisconsin - Madison
This paper presented an extensive study on the impact of compression techniques (pruning and quantization) on parametric knowledge retention in large language models (LLMs), providing valuable insights for practitioners on model compression.
arXiv
Summary
12-01 On Exploring the Reasoning Capability of Large Language Models with Knowledge Graphs
Institution: Singapore Management University, National Sun Yat-sen University
The study demonstrates the capability of LLMs to successfully work through knowledge graph reasoning tasks using their internal knowledge graph and to infer knowledge graph relations from context, showcasing the potential and applicative value of LLMs in knowledge graph reasoning.
arXiv
Summary
12-01 Instruction-tuning Aligns LLMs to the Human Brain
Institution: EPFL
The study shows that large language models trained through instruction-tuning exhibit better representation of world knowledge and alignment with human brain activity. This provides a crucial perspective for the future development of LLMs to incorporate world knowledge into the models.
arXiv
Summary
12-01 RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
RLHF-V presents a novel strategy to rectify MLLM behavior via fine-grained, correctional human feedback. It collects quality data to align MLLM learning with human preferences, effectively improving the models' reliability and practicality in various tasks. This study represents a significant advancement in enhancing the robustness of large multimodal language models.
arXiv
Summary
GitHub
12-01 Learning from One Continuous Video Stream
The paper presents a framework for online learning from a single continuous video stream focused on evaluating adaptability and generalizability, proposing a sequence of future prediction tasks for pre-training. The study demonstrates that optimization strategies in such learning environments need to be adjusted, with reductions in momentum and frequency of weight updates leading to improved adaptability and generalization of models.
arXiv
Summary
12-01 Improve Supervised Representation Learning with Masked Image Modeling
Institution: Google Research, OpenAI
This paper proposed a new training setup that integrates supervised representation learning with MIM, significantly enhancing the quality of representation learning for downstream tasks such as classification, image retrieval, and semantic segmentation without introducing significant training or inference overhead.
arXiv
Summary
12-01 Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses
Institution: Google
The paper introduced ExploreLLM, a new interaction pattern between users and LLM-powered assistants by combining a prompt-based task decomposition method with a novel schema-like GUI. The system aims to reduce the cognitive burden of completing complex tasks and to enhance the level of personalized responses.
arXiv
Summary

2023-11

 Date   Paper Links & Summary
11-30 TaskBench: Benchmarking Large Language Models for Task Automation
Institution: Zhejiang University
This paper presented TaskBench, a new benchmark test, and TASKEVAL, an evaluation system, which together effectively address the assessment challenges of LLMs in task automation through data generation and quantitative evaluation.
arXiv
Summary
11-30 MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
Institution: University of Science and Technology of China, Microsoft Research Asia
MicroCinema, with its innovative two-phase process for text-to-video generation and effective Appearance Injection Network and Appearance Noise Prior mechanisms, has achieved a breakthrough in video generation quality, serving as a reference model for subsequent work.
arXiv
Summary
11-30 IAG: Induction-Augmented Generation Framework for Answering Reasoning Questions
Institution: Huawei Poisson Lab
The IAG framework, with its inductive prompting method for strengthening the factuality of knowledge statements and optimized knowledge fusion mechanism and student inductor model, addresses the shortcomings of existing retrieval-based methods in answering QA tasks involving implicit reasoning. The research findings indicate that IAG performs better in answering QA tasks that involve implicit reasoning.
arXiv
Summary
11-30 Autonomous Agents in Software Development: A Vision Paper
Institution: Tampere University
This paper proposes a vision of using multiple GPT agents for automating SE tasks and showcases preliminary success in simple software tasks. This work has the potential to fundamentally change the way software development is conducted and to shorten development time.
arXiv
Summary
11-30 CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation
Institution: UC Berkeley, Microsoft Azure AI, ZOOM
CoDi-2 is an advanced multimodal generation model capable of processing complex multimodal inputs, guiding generation in-context, interacting with users through multi-round conversations, and achieving outstanding zero-shot and few-shot performance.
arXiv
Summary
11-30 What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations
Institution: Comcast Applied AI, University of Waterloo
The authors proposed a novel probe to detect implicit association biases in LLMs representations and demonstrated state-of-the-art performance in preference detection. The research additionally uncovered significant biases in multiple instruction-following and "classic" LLMs related to nationality, politics, religion, and gender, despite the explicit safety calibration of the LLMs.
arXiv
Summary
GitHub
11-30 Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text
Institution: The University of Tokyo
The research showcased GPT-4's robust capabilities in managing scrambled texts, set forth new metrics RR and RPG, and validated GPT-4's stable performance across various scramble scenarios and rates.
arXiv
Summary
11-30 Applying Large Language Models and Chain-of-Thought for Automatic Scoring
Institution: University of Georgia
The study showcases the potential of LLMs in facilitating automatic scoring, highlighting that CoT significantly enhances scoring accuracy when used with item stems and scoring rubrics. The combined approach of LLMs with CoT can reduce complexity and manpower cost in building automatic scoring models and potentially offer a closer alignment with human scoring results.
arXiv
Summary
11-30 PoseGPT: Chatting about 3D Human Pose
Institution: Max Planck Institute for Intelligent Systems, Meshcapade
PoseGPT is a novel framework that enables models to directly generate 3D human poses from textual and visual inputs by embedding SMPL pose tokens within LLMs, achieving some innovation in interpreting 3D human pose.
arXiv
Summary
11-29 Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering
Institution: Sun Yat-Sen University
This work innovatively combines three agents to simulate the top-down reasoning process in human cognition, and introduces the concept of a Multi-view Knowledge Base, significantly enhancing the expressiveness and interpretability of VQA models.
arXiv
Summary
11-29 Zero-shot Conversational Summarization Evaluations with small Large Language Models
Institution: Intel labs
The paper focuses on the application of Large Language Models in the conversational summarization task, deeply examining the impact of different instructions on model performance and researching optimization techniques for using compressed models under hardware limitations.
arXiv
Summary
11-29 Understanding and Improving In-Context Learning on Vision-language Models
Institution: LMU Munich, University of Oxford
This paper proposed a novel method, MMICES, for selecting demonstrations in in-context learning for vision-language models, demonstrating its effective performance across different models and datasets.
arXiv
Summary
11-29 How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation
Institution: The Education University of Hong Kong
This paper represents an innovative attempt to build an AI tutor system that can adapt to any course subject and provide customized high-quality educational support, potentially progressing the application of AI technology in education and forging a new path for the development of AI tutoring systems.
arXiv
Summary
11-29 Are Large Language Models Good Fact Checkers: A Preliminary Study
Institution: Chinese Academy of Sciences
The paper systematically evaluates the potential of LLMs in the entire fact-checking process, revealing that while they show promise in certain aspects, considerably more research and trials are needed to improve their performance in fact-checking tasks.
arXiv
Summary
11-29 Large Language Models for Networking: Applications, Enabling Techniques, and Challenges
Institution: BUPT
The paper proposes a new framework, ChatNet, that integrates Large Language Models with network technologies, exploring its application in network planning. The study demonstrates that ChatNet can effectively promote the automation and intelligence level of network tasks, though challenges such as multimodal data integration and plugin development must be addressed prior to deployment.
arXiv
Summary
11-29 TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models
Institution: Harbin Institute of Technology
The introduction of the TIMEBENCH benchmark marks an important step in the comprehensive assessment of temporal reasoning abilities in Large Language Models, showcasing the current gap between models and humans in this area and providing guidance for future research.
arXiv
Summary
GitHub
11-29 TaskWeaver: A Code-First Agent Framework
Institution: Microsoft
TaskWeaver is a code-first designed framework to build LLM-powered autonomous agents, achieving efficient handling of complex data structures, flexible plugin usage, and the successful integration of domain-specific knowledge into the system.
arXiv
Summary
GitHub
11-28 AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond
The research introduces a novel, integrated AvatarGPT framework for handling high-level and low-level tasks related to understanding, planning, and generating human motions, showcasing the potential for extended duration motion synthesis and reduced manual intervention.
arXiv
Summary
11-28 Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Institution: Microsoft
This paper explores how to guide a generalist foundation model to exhibit expert-level capabilities on specialized tasks without expert supervision, using the medical field as a case study. The proposed Medprompt strategy proves to have a significant advantage in enhancing the specialized abilities of foundation models and shows the possibility of widespread application across multiple disciplines.
arXiv
Summary
11-28 ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?
Institution: Nanyang Technological University
This survey paper provides an assessment of the performance of open-source LLMs across multiple task domains compared to ChatGPT, highlighting the strengths and potential problems of current open-source LLMs, and offers insights for future research and development. Furthermore, it summarizes numerous best practices and challenges, indicating that the open-source field could potentially close the gap with commercial models to some extent.
arXiv
Summary
11-28 Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
Institution: Shanghai AI Laboratory
The article proposes an innovative strategy for optimizing LVLMs to reduce hallucinations and introduces a new evaluation method to more comprehensively measure hallucinations. The effectiveness of the proposed method is validated through experiments.
arXiv
Summary
11-28 Graph Prompt Learning: A Comprehensive Survey and Beyond
Institution: The Chinese University of Hong Kong, Hong Kong University of Science and Technology, Fudan University
This paper provides a thorough survey on graph prompt learning, covering the AGI challenges with graph data handling and how graph prompt learning can facilitate cross-modality, cross-domain, and cross-task applicability of AGI technologies.
arXiv
Summary
GitHub
11-28 RELIC: Investigating Large Language Model Responses using Self-Consistency
Institution: ETH Zurich
RELIC is an interactive system that, through the factual consistency investigation of multiple samples, helps users verify and direct texts generated by LLMs.
arXiv
Summary
11-28 LLaFS: When Large-Language Models Meet Few-Shot Segmentation
Institution: Singapore University of Technology and Design, Zhejiang University
This paper presents an LLM-based framework for few-shot image segmentation, addressing the core challenges of enabling LLMs to understand and execute visual tasks. A combination of customized guidance and fine-grained in-context instructions facilitates high-quality few-shot segmentation.
arXiv
Summary
GitHub
11-28 RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement
Institution: Alibaba Group
This study presents a two-stage training model for text ranking that combines weakly supervised pre-training and supervised fine-tuning. It smoothly transitions from pre-training to fine-tuning without sacrificing pre-training benefits, enhancing fine-tuning performance. The experiments have shown significant superiority over existing techniques.
arXiv
Summary
11-28 Prompting in Autoregressive Large Language Models
Institution: George Mason University
This paper provides a succinct literature review in the field of prompting for autoregressive large language models, highlighting unresolved challenges and open problems, thereby offering directions for future research.
arXiv
Summary
11-28 Training Chain-of-Thought via Latent-Variable Inference
Institution: Google
This paper develops an MCMC-EM based fine-tuning strategy that, by averaging over rationales, helps LLMs generate the correct answers, holding potential for wide applicability.
arXiv
Summary
11-28 Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
Institution: Alibaba Group
The paper presents a new framework "Animate Anyone" using diffusion models for character animation. The framework maintains appearance consistency through ReferenceNet and ensures controllability and continuity of animations via a pose guider and temporal layer, achieving advanced results in character animation generation.
arXiv
Summary
GitHub
Blog
11-27 RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks
Institution: Chinese Academy of Sciences, Peking University
The paper presents an intelligent agent named RoboGPT that is designed for making embodied long-term decisions for daily instruction tasks. The agent combines the generic knowledge of LLMs with the professional knowledge in the robotics domain and introduces Re-Plan and RoboSkill modules to enhance the rationality and adaptability of task planning. On the ALFRED benchmark tests and generalization tasks, RoboGPT surpasses existing advanced methods.
arXiv
Summary
11-25 Faster Minimum Bayes Risk Decoding with Confidence-based Pruning
Institution: University of Cambridge
The paper presented an algorithm for MBR decoding that reduces utility function calls by gradually increasing the number of samples in the estimate and using confidence pruning. The algorithm significantly lowers computational costs while maintaining accuracy, and its effectiveness was validated through NMT experiments on three language pairs.
arXiv
Summary
11-24 Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language
Institution: Amazon
The paper presented an effective CnR method capable of efficiently aligning LLMs with human expectations through detailed feedback and response revision using natural language. With relatively less human feedback data, this method significantly improves the quality of responses from even top LLMs such as ChatGPT.
arXiv
Summary
11-24 Calibrated Language Models Must Hallucinate
Institution: Microsoft Research
The paper outlines the statistical root cause of inevitable hallucinations under sufficient calibration in pretrained language models, elucidates the native mechanism of hallucination generation in models with good predictive performance and provides a lower bound estimate for the rate of hallucination. It discusses the likelihood of different types of facts hallucinating and points towards potential future directions for mitigating specific types of hallucinations.
arXiv
Summary
11-23 GAIA: a benchmark for General AI Assistants
Institution: FAIR, Meta
arXiv
11-23 LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes
Institution: ASRI
arXiv
11-23 Probabilistic Tree-of-thought Reasoning for Answering Knowledge-intensive Complex Questions
Institution: Tsinghua University
The paper introduces Probabilistic Tree-of-thought Reasoning (ProbTree), a novel method that explores LLMs' capabilities to answer complex, knowledge-intensive questions and incorporates uncertainty into the reasoning process, integrating external and parametric knowledge within a unified framework.
arXiv
Summary
11-23 ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs
Institution: Google Research
arXiv
11-23 Diffusion Model Alignment Using Direct Preference Optimization
Institution: Nikhil Naik, Stanford University
arXiv
11-23 FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline
Institution: Sber AI
arXiv
11-23 Controlling Large Language Model-based Agents for Large-Scale Decision-Making: An Actor-Critic Approach
Institution: Chinese Academy of Sciences
The LLaMAC framework demonstrates superior performance of LLM-based multi-agent systems in long-term planning, mathematical reasoning, optimization problems, and spatial reasoning, while also reducing access costs for large-scale multi-agent collaboration. With further enhancement of LLMs and more collaboration frameworks emerging, new opportunities will unfold in the multi-agent collaboration field.
arXiv
Summary
11-22 Visual In-Context Prompting
Institution: HKUST, Microsoft Research
The paper presents DINOv, an innovative visual in-context prompting framework effectively handling a variety of visual prompts, utilizing unlabeled data, and performing well across several tasks.
arXiv
Summary
GitHub
11-22 Enhancing Summarization Performance through Transformer-Based Prompt Engineering in Automated Medical Reporting
Institution: Utrecht University
This research validated that applying transformer-based prompt engineering in automated medical reporting can improve summarization performance. Despite some limitations, the proposed approach has shown the effectiveness of including examples and contextual information in prompt formulations and pointed out directions for future work.
arXiv
Summary
11-22 XAGen: 3D Expressive Human Avatars Generation
Institution: National University of Singapore, ByteDance
arXiv
11-22 AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations
Institution: The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology, University of California, Los Angeles
The AlignedCoT technique presented in this paper aims to align the CoT text style with the "native style" of Large Language Models to improve their reasoning capabilities, and its effectiveness has been demonstrated through empirical evidence.
arXiv
Summary
11-22 LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms
Institution: Princeton University
arXiv
11-21 Oasis: Data Curation and Assessment System for Pretraining of Large Language Models
Institution: Chinese Academy of Sciences
arXiv
11-21 AcademicGPT: Empowering Academic Research
Institution: International Digital Economy Academy
arXiv
11-21 Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
Institution: Nanjing University
arXiv
GitHub
11-21 Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Institution: University of Cambridge
arXiv
11-21 How Capable Can a Transformer Become? A Study on Synthetic, Interpretable Tasks
Institution: University of Pennsylvania, MIT
arXiv
11-21 Latent Lab: Large Language Models for Knowledge Exploration
Institution: Department of Electrical Engineering and Computer Science, MIT
arXiv
11-21 Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?
Institution: University of Auckland
arXiv
11-21 A Survey on Multimodal Large Language Models for Autonomous Driving
Institution: Purdue University
arXiv
11-21 Prompting Frameworks for Large Language Models: A Survey
Institution: Zhejiang University
This research delivers a framework that enhances interaction with LLMs by implementing new techniques, including improved compatibility with programming languages, enabling LLMs to utilize external tools, and maintaining historical interaction information, thus guiding future research directions.
arXiv
Summary
GitHub
11-20 Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents
Institution: Shanghai Jiao Tong University
arXiv
GitHub
11-20 GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Institution: New York University
The GPQA dataset offers a benchmark for testing the ability of AI systems to handle complex questions that require deep understanding and reasoning. With rigorous quality control and expert-level difficulty, it has the potential to advance the development of collaborative methods between human experts and AI systems, as well as the advancement of AI system design.
arXiv
Summary
11-20 Continual Learning: Applications and the Road Forward
Institution: KU Leuven
arXiv
11-20 Assessing Prompt Injection Risks in 200+ Custom GPTs
Institution: Northwestern University
arXiv
11-19 TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems
Institution: SenseTime Researc
arXiv
11-18 Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning
Institution: Technical University of Darmstadt, University of Cambridge
The paper proposes a unified library—Adapters—that integrates and extends parameter-efficient and modular transfer learning methods. It achieves close integration with the Transformers library and demonstrates its effectiveness through comparative experiments on several NLP tasks.
arXiv
Summary
11-18 RecExplainer: Aligning Large Language Models for Recommendation Model Interpretability
Institution: University of Science and Technology of China
arXiv
11-18 Orca 2: Teaching Small Language Models How to Reason
Institution: Microsoft Research
arXiv
11-18 An Embodied Generalist Agent in 3D World
Institution: Beijing Institute for General Artificial Intelligence
arXiv
11-17 Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
Institution: Allen Institute for AI
arXiv
11-17 Exploring the Relationship between In-Context Learning and Instruction Tuning
Institution: HKUST
arXiv
11-16 Predictive Minds: LLMs As Atypical Active Inference Agents
Institution: Charles University
arXiv
11-16 Automatic Engineering of Long Prompts
Institution: Google
arXiv
11-16 MacGyver: Are Large Language Models Creative Problem Solvers?
Institution: University of California, Princeton University
arXiv
11-16 Crafting In-context Examples according to LMs' Parametric Knowledge
Institution: The University of Texas at Austin
arXiv
11-15 Contrastive Chain-of-Thought Prompting
Institution: DAMO Academy, Alibaba Group
arXiv
11-15 Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
Institution: Tecent AI Lab
arXiv
11-15 Memory Augmented Language Models through Mixture of Word Experts
Institution: Google Research
arXiv
11-15 Exponentially Faster Language Modelling
Institution: ETH Zurich
The paper introduces UltraFastBERT, a variant of a large-scale language model that significantly reduces the number of neurons needed during inference and increases computational efficiency through the use of fast feedforward networks. Despite lacking a native efficient implementation, the model provides a CPU code implementation that significantly accelerates the inference process and performs well on standard downstream tasks. This work demonstrates the substantial potential of conditional neural execution in the field of language modeling.
arXiv
Summary
11-15 ToolTalk: Evaluating Tool-Usage in a Conversational Setting
Institution: Microsoft Corporation
ToolTalk is a benchmark designed to evaluate and improve the performance of LLMs in utilizing multi-step external tools within a conversational context. With innovative evaluation methods and realistic scenario simulations, it challenges and expands the boundaries of LLM capabilities and charts a course for future research.
arXiv
Summary
GitHub
11-14 KTRL+F: Knowledge-Augmented In-Document Search
Institution: KAIST AI, Samsung Research
arXiv
11-14 Learning to Filter Context for Retrieval-Augmented Generation
Institution: Carnegie Mellon University
arXiv
11-14 Instruction-Following Evaluation for Large Language Models
Institution: Google, Yale University
arXiv
GitHub
11-13 In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax
Institution: NYU, Microsoft
This paper unveils potential limitations of large language models in understanding and generalizing syntactic structures, which is crucial for improving the way language models handle complex syntactic tasks.
arXiv
Summary
11-13 Can LLMs Patch Security Issues?
Institution: School of Computer Science Atlanta
The article introduced a new approach to code refinement named FDSS, which, by integrating with the static code analysis tool, Bandit, significantly enhances the capability of LLMs in solving security issues within code.
arXiv
Summary
GitHub
11-11 In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering
Institution: Stanford University
arXiv
11-10 Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking
Institution: Helvia.ai
For the first time, this paper presents a comprehensive evaluation of methodologies in a resource-limited industrial context, including cost analysis, RAG method, and data augmentation using GPT-4, offering new avenues for the financial industry to address challenges related to data and budget constraints.
arXiv
Summary
11-05 ChaTA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs
Institution: Cornell University, Microsoft Research
The paper presents a new approach to enhance online educational QA platforms using open-source LLMs, and it has undergone extensive evaluation and testing. By combining technologies like RAG, SFT, and DPO, the study not only ensures a significant improvement in the quality of responses but also protects data privacy, making it significant for the development of intelligent QA assistants.
arXiv
Summary
11-01 LLMRec: Large Language Models with Graph Augmentation for Recommendation
Institution: University of Hong Kong, Baidu
LLMRec, as a pioneering work, introduces LLMs to enhance graph recommendation systems and successfully addresses the issues of sparsity in interaction data and low-quality side information. It improves the performance of recommendation systems through means such as reinforcing user-item interaction edges, item node attributes, and user profiling, ensuring recommendation quality while reducing the impact of data noise.
arXiv
Summary
GitHub

2023-10

 Date   Paper Links & Summary
10-20 The History and Risks of Reinforcement Learning and Human Feedback
Institution: Berkeley
arXiv
10-17 Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Institution: University of Washington
The paper introduces SELF-RAG, a new framework that enhances LLM quality and factual accuracy through on-demand retrieval and self-reflection. It makes the LM controllable during the inference phase to suit diverse task requirements and significantly outperforms existing LLMs and RAG models in various tasks. SELF-RAG offers a novel approach to model self-assessment and customization through its decoding algorithm and reflection tokens.
arXiv
Summary
10-11 OpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models
Institution: Tsinghua University, Chinese Academy of Sciences
OpsEval, as a comprehensive task-oriented AIOps benchmark, not only assesses the comprehensive performance, reasoning, and practical application capabilities of LLMs but also has the potential to change the evaluation metrics used in future large-scale quality assessments. It provides a solid foundation for ongoing research and optimization of LLMs tailored for AIOps.
arXiv
Summary
10-10 GPT-4 as an Agronomist Assistant? Answering Agriculture Exams Using Large Language Models
Institution: Microsoft Research
This study presents a new approach in employing LLMs for answering questions in the field of agriculture, significantly enhancing LLMs' performance on multiple-choice questions through the Ensemble Refinement strategy, showing the broad potential in handling domain-specific problems.
arXiv
Summary

2023-09

 Date   Paper Links & Summary
09-04 Benchmarking Large Language Models in Retrieval-Augmented Generation
Institution: Chinese Information Processing Laboratory
This paper introduces a new benchmark based on real news articles for comprehensive assessment of large language models' capabilities in complex informational environments and illustrates the existing limitations of LLMs through the experimental results.
arXiv
Summary

2023-08

 Date   Paper Links & Summary
08-18 Learning Representations on Logs for AIOps
Institution: IBM Research
The BERTOps model proposed in this paper leverages general LLM representations and specifically tailored pretraining for AIOps log data, effectively improving the performance of automated log analysis tasks and demonstrating significant enhancements. BERTOps not only surpasses existing models but also exhibits superior performance across multiple downstream tasks, facilitating the practical application of AIOps.
arXiv
Summary

2023-07

 Date   Paper Links & Summary
07-11 Towards Understanding In-Context Learning with Contrastive Demonstrations and Saliency Maps
Institution: UNIVERSITY OF MARYLAND
This study analyzed the internal mechanisms of ICL in LLMs using contrastive demonstrations and saliency map analysis, revealing the differential impacts of label flipping, input changes, and complementary explanations on predictions, providing insights for practitioners on curating demonstrations.
arXiv
Summary
GitHub

2023-06

 Date   Paper Links & Summary
06-07 PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
Institution: Microsoft Research
PromptRobust is an innovative, open benchmark aimed at evaluating how LLMs handle input errors that are likely to occur naturally in the real world, such as typos and synonym replacements. The open-sourcing of this tool will aid future robustness research.
arXiv
Summary

2023-05

 Date   Paper Links & Summary
05-24 In-Context Demonstration Selection with Cross Entropy Difference
Institution: Microsoft Cognitive Service Research
The paper presents a novel Cross-Entropy Difference (CED) method for in-context demonstration selection, provides a theoretical rationale, and achieves performance improvements on large language models of different sizes and types.
arXiv
Summary
05-23 Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
Institution: Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun
The paper explores the internal mechanism of in-context learning (ICL) by large language models from an information flow perspective, identifying the anchoring phenomenon of label words, proposing a new hypothesis, and experimentally validating its effectiveness. Moreover, the insights were used to propose methods for improving ICL performance, providing a theoretical foundation and practical guidance for future related researches.
arXiv
Summary
05-19 How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
Institution: The Ohio State University
The study revealed the critical database knowledge and optimal representations for effective prompting, offering guidance for the application of LLMs in the text-to-SQL task, and pointed out a "sweet spot" in terms of prompt length in the cross-domain setting. The findings may not always be applicable to a specific database, particularly if the database is significantly different from the Spider databases.
arXiv
Summary

2023-03

 Date   Paper Links & Summary
03-31 A Survey of Large Language Models
Institution: Renmin University of China
arXiv

2023-02

 Date   Paper Links & Summary
02-08 A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
Institution: Centre for Artificial Intelligence Research
The article evaluated ChatGPT's reasoning abilities in a more granular way and identified a key issue in LLMs - the lack in non-text semantic understanding. This finding offers significant directions for future improvements and research into the reasoning capabilities of LLMs.
arXiv
Summary