llm-paper-daily Daily Paper Selection

Each paper comes with related resources:

arXiv link
GitHub link
Summary of GPT-4
Related blogs

Click to view latest updates. _{Update time: 07-25 20:48}

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Internal Consistency and Self-Feedback in Large Language Models: A Survey
ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent

2024-07

Date	Paper	Links & Summary
07-23	RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent _{Institution: Zhejiang University, Palo Alto Networks, University of North Texas The RedAgent system effectively identifies and exploits the security vulnerabilities of large language models by simulating context-specific jailbreak strategies. It enhances the efficiency and automation of red teaming methods while providing a new perspective on understanding and strengthening the security of LLM applications.}
07-23	Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach
07-23	OpenDevin: An Open Platform for AI Software Developers as Generalist Agents _{Institution: UIUC, CMU, Yale OpenDevin is a community-driven platform tailored for developing generalist and specialist AI agents that interact with the world through software, featuring a dynamic interaction mechanism, a sandboxed operating system and web browser environment, and a comprehensive evaluation framework.}
07-22	Knowledge Mechanisms in Large Language Models: A Survey and Perspective _{Institution: Zhejiang University, National University of Singapore, University of California, Los Angeles The paper suggests that a deep understanding of knowledge mechanisms in LLMs is crucial for developing powerful and reliable AI. It introduces a new framework for evaluating such systems, focusing on the utilization and evolution of knowledge, offering a vision and tools for future research directions.}
07-19	ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities _{Institution: NVIDIA The paper presented a model named Llama3-ChatQA-2-70B, designed to bridge the gap between open-access LLMs and proprietary models, capable of handling up to 128K tokens context, and achieving comparable performance to GPT-4-Turbo on various benchmarks.}
07-19	Internal Consistency and Self-Feedback in Large Language Models: A Survey _{Institution: Renmin University of China, Institute for Advanced Algorithms Research, Shanghai, Beijing Institute of Technology This paper introduces the concepts of Internal Consistency and Self-Feedback to address consistency and hallucination issues in large language models, providing a new lens to understand and enhance these models and anticipates future research directions.}
07-18	CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis _{Institution: Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen This work introduces the Chain-of-Diagnosis (CoD), a diagnostic method meant to improve the interpretability of LLMs in disease diagnosis. It effectively generates training data through synthetic cases combined with disease encyclopedia data, resulting in the development of the DiagnosisGPT model. Experiments demonstrate that DiagnosisGPT performs better than other LLMs across numerous diagnostic datasets.}
07-16	NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window? _{Institution: Shanghai AI Laboratory, Tsinghua University The NeedleBench framework and the introduced ATC test offer novel methods to evaluate and enhance the retrieval and reasoning capabilities of LLMs when processing long text data. This is vital for real-world long-context tasks and also highlights the opportunities and challenges faced by current LLMs.}
07-16	LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data _{Institution: Stanford University, UC Berkeley The paper presents the LOTUS system, which enables queries based on natural language through the definition of semantic operators, implementing fast and accurate query execution through efficient algorithms and optimizations. LOTUS demonstrates its wide applicability and high performance in multiple real-world application cases, signifying its importance in advancing LM-based large-scale semantic analysis and query systems.}
07-16	Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild _{Institution: University of Washington, Allen Institute for AI, McGill University This research highlights the issue of personal information leakage in interactions with chatbots. It presents the types of sensitive information shared in these interactions and calls for measures in chatbot design to protect user privacy and maintain appropriate transparency of the content exchanged.}
07-15	Think-on-Graph 2.0: Deep and Interpretable Large Language Model Reasoning with Knowledge Graph-guided Retrieval
07-15	Qwen2 Technical Report _{Institution: Alibaba Group The Qwen2 series models, as the latest large language models, exhibit excellent performance in multi-task environments such as language understanding, generation, multilingual capabilities, coding, mathematics, and reasoning. The models have also made their weights and resources publicly available in the open-source community, fostering innovation and accessibility. Compared to existing models, Qwen2 shows competitive performance in several benchmarks, especially in terms of multilingualism, showing a wide applicability and global reach.}
07-14	Learning to Refuse: Towards Mitigating Privacy Risks in LLMs _{Institution: Institute of Artificial Intelligence, Soochow University, China The paper introduces a novel machine unlearning framework, NAUF, and the accompanying real-world personal data unlearning dataset, RETURN, to evaluate and improve LLMs' performance in privacy protection.}
07-12	Human-like Episodic Memory for Infinite Context LLMs _{Institution: Huawei Noah’s Ark Lab, University College London The paper proposes an innovative structure, EM-LLM, by integrating human episodic memory and event cognition into large language models, enabling them to manage practically infinite context lengths while remaining computationally efficient. This research enhances LLMs' capabilities to process expansive contexts and contributes to understanding human memory mechanisms.}
07-10	Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization _{Institution: University of Science and Technology of China, Alibaba Stripe, Zhejiang University The researchers have developed the Dr. DPO framework, which enhances the robustness of DPO with just an extra line of code. Empirical evaluations show that Dr. DPO significantly improves performance in a wide range of settings, both with and without noise.}
07-10	Toto: Time Series Optimized Transformer for Observability _{Institution: Datadog The Toto model, developed by Datadog, is a foundation model for time series prediction, specially designed to handle observability data. Its groundbreaking attention mechanism and pre-training strategy significantly improve the performance and efficiency in tackling observability data.}
07-09	Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence _{The paper presents a flexible and scalable platform for multi-agent collaboration, the Internet of Agents (IoA), which overcomes the limitations of existing frameworks and demonstrates superior performance across multiple tasks and application scenarios. Furthermore, the release of the codebase facilitates further development in autonomous agent systems.}
07-05	AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents _{Institution: AIRI, Moscow, Russia, Skoltech, Moscow, Russia AriGraph is an innovative memory architecture that constructs a knowledge graph world model integrating semantic and episodic memories, enhancing the exploratory and planning capabilities of LLM agents. Experiments in the TextWorld environment have proven it to be more effective in handling complex tasks compared to other existing methods.}
07-02	RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs _{Institution: Georgia Tech, NVIDIA RankRAG is a novel framework that instruction-tunes LLMs to enhance their context ranking and answer generation capabilities within the RAG framework, delivering improved generative performance on multiple benchmarks and demonstrating robust generalization capabilities.}
07-02	Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models _{Institution: DeepSeek AI, Northwestern University The paper proposed ESFT, an efficient fine-tuning method for sparse-architecture LLMs that fine-tunes only the experts most relevant to downstream tasks, maintaining expert specialization and significantly saving computational resources.}
07-01	We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? _{Institution: Beijing University of Posts and Telecommunications, Tencent Inc., Huazhong University of Science and Technology This paper introduces a visual mathematical reasoning benchmark called WE-MATH, aimed at going beyond traditional end-to-end performance assessments to deeply explore and evaluate the problem-solving principles of LMMs, their ability to acquire and generalize knowledge. It reveals challenges in the inherent reasoning processes of multimodal models using a new multi-dimensional evaluation method and validates the effectiveness of knowledge augmentation strategies, advancing the progress of LMMs in the domain of visual mathematical reasoning.}
07-01	AI Agents That Matter _{Institution: Princeton University This paper critiques the current benchmark evaluation methods for AI agents and proposes a series of improvements, aiming to develop intelligent agents that have real-world application value, not just agents that score high on benchmark tests.}
07-01	Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems _{This paper introduces a new evaluation method for large language models and RAG systems in handling long texts through the SummHay task. It presents an original approach with synthesized data generation and an automatic evaluation system, showing that current systems struggle with it and outlining a direction for future improvements.}

2024-06

Date	Paper	Links & Summary
06-30	Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning _{Institution: Multimedia Laboratory (MMLab), The Chinese University of Hong Kong The paper presents a new method of mathematical reasoning optimization - SCDPO, which significantly enhances the performance of LLMs in solving mathematical problems by generating training samples that supervise errors at specific steps, demonstrating the potential of this method.}
06-29	LiteSearch: Efficacious Tree Search for LLM _{Institution: Xiamen University, Tencent AI Lab The paper contributes by introducing a more efficient tree search algorithm that reduces resource consumption in aiding LLMs to tackle complex mathematical reasoning tasks, while ensuring high performance levels.}
06-28	Scaling Synthetic Data Creation with 1,000,000,000 Personas _{Institution: Tencent AI Lab Seattle This paper presents the "Persona Hub," a synthetic data platform focusing on the diversity and richness of the generated data, with a significant concern for the safe and responsible use of synthetic data. Through several use cases, it illustrates the method's advantages in diversity, scalability, flexibility, and ease of use.}
06-27	From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data _{Institution: University of Wisconsin-Madison The study presents a method to improve LLM's retrieval and reasoning capabilities in longer-context tasks by fine-tuning on synthetic datasets. It is shown to significantly enhance performance on such tasks without considerably impacting the model's overall abilities and reducing the generation of hallucinations.}
06-27	SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation _{The paper proposes the SEAKR model, a novel adaptive Retrieval-Augmented Generation model that uses the self-awareness of LLMs’ internal states to dynamically determine when to retrieve and effectively integrate knowledge, thereby enhancing performance in QA tasks.}
06-26	Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs _{Institution: The Chinese University of Hong Kong, Harbin Institute of Technology (Shenzhen), SmartMore The paper introduces a new optimization method, Step-DPO, enhancing LLMs' accuracy and robustness in long-chain mathematical reasoning by optimizing individual reasoning steps rather than evaluating answers holistically.}
06-25	The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale _{Institution: Hugging Face This paper introduced the FineWeb datasets, highlighting the importance of carefully curating an effective Common Crawl-based pretraining dataset and demonstrated its contribution to enhancing the performance of large language models.}
06-24	WARP: On the Benefits of Weight Averaged Rewarded Policies _{Institution: Google DeepMind The article proposes WARP, a new strategy for LLM alignment, which merges models through weight averaging to address challenges in the RLHF process, thus improving the trade-off between KL and rewards. Experimental evidence suggests that WARP enhances model performance and alignment with human values.}
06-22	Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs _{Institution: OATML, Department of Computer Science, University of Oxford The paper proposes SEPs as a cost-effective and reliable method for detecting hallucinations, capable of capturing semantic uncertainty directly from the hidden states of LLMs with a single model generation.}
06-21	LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs _{Institution: University of Waterloo LongRAG is a novel framework for open-domain question-answering tasks that addresses the limitations of traditional RAG by incorporating larger retrieval units and leveraging the capabilities of long-context LLMs. Its approach results in notable improvements in performance through reduced retrieval units and enhanced retriever effectiveness, alongside utilizing long-context LLMs for zero-shot answer extraction.}
06-19	Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? _{This paper explores the potential of long-context language models to replace existing paradigms and tackle novel tasks through the introduction of the LOFT benchmark. It finds that LCLMs can match the performance of existing retrieval and RAG systems in some tasks, despite not being explicitly trained, and highlights areas where further research is needed to improve performance.}
06-18	Nash CoT: Multi-Path Inference with Preference Equilibrium _{Institution: Westlake University, University of Cambridge The study proposes a novel Nash CoT approach, which effectively utilizes the concept of Preference Equilibrium to maintain performance while substantially lowering the deployment costs for LLMs by reducing the number of inference paths necessary.}
06-18	Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges _{This study provides useful insights for future use of LLMs as judges by evaluating the alignment and vulnerabilities of LLMs acting as judges. Key findings include that only some top models are fit to act as judges, and Cohen's Kappa is a better metric of alignment, outperforming percent agreement in distinguishing judges.}
06-17	A Survey of AIOps for Failure Management in the Era of Large Language Models _{Institution: Peking University, Tsinghua University, The Hong Kong University of Science and Technology (Guangzhou), University of Illinois Chicago This paper is a comprehensive survey of AIOps technology for failure management in the era of LLMs. It discusses the potential of LLMs to address the challenges faced by existing AIOps methods and outlines the future directions of research.}
06-13	Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning _{Institution: Google Research, Google DeepMind, Google This paper introduces a novel benchmark, ToT, which comprehensively evaluates LLMs' temporal reasoning abilities in various scenarios using synthetic datasets and crowdsourced tasks, and exposes the advantages and shortcomings of these models in temporal reasoning.}
06-12	Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing _{Institution: University of Washington, Allen Institute for AI This paper presents MAGPIE, a novel self-synthesis method for generating large-scale high-quality alignment data without relying on human intervention or prompt engineering, demonstrating the potential of LLMs in automatic data generation and alignment. Experimentation shows that models fine-tuned with MAGPIE excel across various benchmarks, exhibiting the latent capabilities of LLMs in data generation and model alignment.}
06-12	Designing a Dashboard for Transparency and Control of Conversational AI _{Institution: Harvard University, Google Research The paper is aimed at increasing the transparency of LLMs within conversational AI systems and does so by designing a visualized user interface—a dashboard that accompanies the chatbot interface. User can see the system's internal user model in real time and modify it via the interface. Based on user feedback, the dashboard also helps unveil and counteract model biases.}
06-12	TasTe: Teaching Large Language Models to Translate through Self-Reflection _{Institution: Harbin Institute of Technology, Tencent Inc The TASTE framework proposed in this paper elevates LLMs' capability in machine translation through a self-reflection process, representing a novel way to harness the translation potential of LLMs. It sets a new benchmark for understanding and utilizing the complex reasoning and language modeling capabilities of LLMs.}
06-11	Delving into ChatGPT usage in academic writing through excess vocabulary _{Institution: Hertie Institute for AI in Brain Health, University of Tübingen, Germany, Tübingen AI Center, Northwestern University The paper proposes a new, unbiased, large-scale approach to study LLM usage in academic texts and offers an unprecedented quantifiable comparison of the changes in scientific writing induced by LLMs.}
06-11	Needle In A Multimodal Haystack _{Institution: OpenGVLab, Shanghai AI Laboratory, Fudan University The presented MM-NIAH benchmark is a novel evaluation platform for advancing MLLM performance in comprehending long multimodal documents. By exposing limitations and challenges of current MLLMs, the paper provides an instrumental platform for further research in long multimodal document comprehension.}
06-10	Transforming Wearable Data into Health Insights using Large Language Model Agents _{Institution: Google LLC This paper introduced the Personal Health Insights Agent (PHIA), a large language model agent system that successfully transforms wearable device data into personal health insights. Combining code generation and information retrieval tools, PHIA effectively addresses the challenge of deriving personalized health guidance from vast health data sets. Extensive human and automated evaluations demonstrate the accuracy and potential application of this approach in addressing real health concerns.}
06-10	Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning _{Institution: University of Washington, MetaAI, Allen Institute for AI HUSKY emerges as the first unified, open-source language agent for multi-step reasoning that resolves the issues of high costs and difficulties in scaling while demonstrating superior performance in multi-task environments, showcasing the potential of open-source language agents.}
06-10	Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching _{Institution: The Chinese University of Hong Kong, Tencent AI Lab, Centre for Perceptual and Interactive Intelligence The paper introduces SELF-TUNING, a framework aimed at improving LLMs' knowledge acquisition capability via self-teaching and validates its effectiveness on crucial knowledge acquisition tasks using the Wiki-Newpages-QA datasets.}
06-10	Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies _{Institution: Duke University, AWS AI Labs The study presents a framework for LLM reasoning strategies evaluation that considers compute budget and demonstrates the ability of simple strategies to outperform complex ones with equal computational resources. By highlighting the importance of self-evaluation, it sets the groundwork for more efficient budget use and the development of more effective reasoning strategies.}
06-09	Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses _{Institution: University of Washington, University of Washington - Bothell The study underscores the deficiencies in LLMs' social reasoning and the potential improvement by integrating human intentions and emotions. The findings highlight the need for LLMs to comprehend human-like mental states for effective social reasoning in open-ended questions, pointing out a key direction for future advancement.}
06-07	Mixture-of-Agents Enhances Large Language Model Capabilities _{Institution: Duke University, Together AI, University of Chicago This paper showcases the Mixture-of-Agents (MoA) methodology for enhancing the capabilities of LLMs in understanding and generating natural language by leveraging the group expertise of multiple models. Through experimentation, the method has been validated to significantly improve performance, achieving state-of-the-art results on multiple competitive benchmarks.}
06-07	WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild _{WILDBENCH provides an evaluation framework that incorporates real user task challenges, automated indicators, and interpretive checklists, enabling more accurate assessments of Large Language Models' performance in complex tasks.}
06-06	FastGAS: Fast Graph-based Annotation Selection for In-Context Learning _{Institution: Department of ECE, University of Virginia The FastGAS approach proposed in the paper significantly improves the diversity and representativeness of selected instances for ICL while also considerably reducing the time and computational resources required. The experimental results verify its efficiency and efficacy on multiple datasets, demonstrating its potential as an effective instance selection method.}
06-06	Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models _{Institution: Peking University, UC Berkeley, Stanford University BoT enhances the accuracy, efficiency, and robustness of reasoning in LLMs by providing a meta-buffer to store high-level thought templates. It overcomes the limitations of existing methods and demonstrates significant performance gains.}
06-06	The Prompt Report: A Systematic Survey of Prompting Techniques _{This paper offers a comprehensive survey of prompting techniques, systematically analyzing the concept, types, and applications of prompts and making an extensive meta-analysis of the literature.}
06-04	Synergetic Event Understanding: A Collaborative Approach to Cross-Document Event Coreference Resolution with Large Language Models _{Institution: Zhejiang University, School of Engineering (Westlake University), Shanghai AI Laboratory The paper introduces a novel collaborative approach for addressing the task of cross-document event coreference resolution. By combining the universal capabilities of LLMs with task-specific SLMs, the performance of the model was significantly enhanced.}
06-04	To Believe or Not to Believe Your LLM _{Institution: Google DeepMind This paper focuses on the study and introduction of a novel information-theoretical metric to quantify uncertainty in large language models, specifically for the phenomenon of hallucinations during response generation. This research offers new insights and solutions for identifying and addressing hallucinations in LLMs.}
06-03	Self-Improving Robust Preference Optimization _{Institution: Cohere SRPO successfully alleviates the task dependency problem by demonstrating robustness to task variations within a theoretically grounded offline RLDF framework. It offered a simpler training and deployment process through the optimization of a non-adversarial offline loss. Experimental results indicate that SRPO outperforms existing methods across different environments, including OOD settings.}
06-03	Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration _{Institution: Beijing Jiaotong University, Alibaba Group Mobile-Agent-v2 is a multi-agent architecture designed to effectively tackle navigation challenges in mobile device operation tasks, particularly task progress and focus content navigation, significantly improving task completion rates over traditional single-agent architectures.}

2024-05

Date	Paper	Links & Summary
05-31	Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality _{Institution: Princeton University, Carnegie Mellon University The paper presents the novel State Space Duality (SSD) framework, linking structured state space models (SSMs) with variants of attention mechanisms. Key contributions include applying Transformative optimizations to SSMs and developing a new SSD algorithm that significantly improves the efficiency of model training and inference. The resulting Mamba-2 architecture demonstrates ideal performance results, paving the way for future deep learning model design and optimization.}
05-31	Preemptive Answer "Attacks" on Chain-of-Thought Reasoning _{Institution: Tsinghua University The paper investigates the negative impact of preemptive answers on the reasoning capabilities of LLMs and proposes strategies for mitigation. The experimental results indicate that these strategies cannot completely neutralize the impact, pointing to a need for further enhancement of CoT robustness.}
05-30	Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts _{Institution: Ant Group METRAG offers a novel framework for retrieval-augmented generation that addresses the limitations of current models by incorporating utility and compactness-oriented thinking, and it exhibits enhanced performance in knowledge-intensive tasks.}
05-30	Jina CLIP: Your CLIP Model Is Also Your Text Retriever
05-29	LLMs achieve adult human performance on higher-order theory of mind tasks _{Institution: Google Research, Google DeepMind, Johns Hopkins University Applied Physics Lab The study showcases the performance of LLMs on higher-order Theory of Mind (ToM) tasks, specifically demonstrating that models like GPT-4 can achieve adult-level performance on some tasks. The introduction of a new benchmark based on an adult human benchmark helps to reveal and understand the potential and limitations of LLMs in complex social interactions.}
05-29	MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
05-28	RealitySummary: On-Demand Mixed Reality Document Enhancement using Large Language Models _{Institution: University of Calgary This paper introduces the RealitySummary system, combining large language models with mixed reality technology to provide an on-demand reading assistant and highlights the potential for practical application of this technology and establishes directions for future research.}
05-23	HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models _{Institution: The Ohio State University, Stanford University HippoRAG is a novel, neurobiologically inspired retrieval framework addressing the limitations of conventional LLMs in long-term memory and knowledge integration. By simulating the structure and mechanisms of the human brain, HippoRAG has significantly enhanced LLMs' capability to handle complex tasks involving knowledge integration, outperforming existing methods in both efficiency and effectiveness.}
05-23	Agent Planning with World Knowledge Model _{Institution: Zhejiang University, Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph, National University of Singapore, Alibaba Group This paper introduces a parametric World Knowledge Model (WKM) to enhance the performance of Large Language Models (LLMs) executing interactive planning tasks. The model utilizes knowledge from expert and exploratory trajectories and has been validated through comparisons with various strong baselines in simulated environments, addressing the issues of hallucinatory action generation and aimless trial-and-error.}
05-23	Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration _{Institution: Tsinghua University, Northwestern Polytechnical University, Shanghai AI Laboratory This paper proposes the ReAd framework to address the effective planning for LLMs in multi-agent collaborative tasks, proving its capability to reduce interaction rounds and enhance success rates, thus laying the groundwork for the application of LLMs in multi-agent systems.}
05-23	RaFe: Ranking Feedback Improves Query Rewriting for RAG _{Institution: Zhejiang University, Alibaba Group, Nanjing University RaFe presents a novel framework for query rewriting using reranker feedback, requiring no annotations, supporting offline and online feedback training, and showcasing adaptability and effectiveness.}
05-23	RefChecker: Reference-based Fine-grained Hallucination Checker and Benchmark for Large Language Models _{Institution: Amazon AWS AI, Shanghai AI Lab, Shanghai Jiaotong University REFCHECKER is a framework that detects fine-grained hallucinations in LLMs and benchmarks them. It detects and verifies factual inconsistencies in responses with high precision and strong alignment with human judgments.}
05-23	PerLLM: Personalized Inference Scheduling with Edge-Cloud Collaboration for Diverse LLM Services _{Institution: Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences This paper introduces the PerLLM framework that leverages edge-cloud collaboration to handle a large volume of inference services, significantly enhancing service scheduling and resource allocation, thereby increasing throughput and reducing energy costs, showcasing its substantial applicative value.}
05-23	AGILE: A Novel Framework of LLM Agents _{Institution: ByteDance Research, University of Science and Technology of China, Shanghai Jiao Tong University The paper proposed a new framework for LLM agents known as AGILE, which streamlines different components and leverages reinforcement learning to achieve end-to-end training. The framework showcases superior performance in complex QA tasks, underscoring the efficacy of component integration and end-to-end optimization. The release of the dataset and code encourages further research in this area.}
05-21	G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translation _{Institution: ByteDance Research The paper presents the G-DIG method, a gradient-based approach for selecting high-quality and diverse instruction finetuning data for machine translation, validated by its effectiveness and generalizability through experimental verification.}
05-21	SmartFlow: Robotic Process Automation using LLMs _{Institution: TCS Research SmartFlow is an AI-based RPA system that integrates deep learning vision understanding with LLMs to autonomously generate navigation workflows and execute user-assigned tasks, demonstrating efficiency in adapting to GUI changes and handling complex tasks.}
05-20	OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework _{Institution: OpenLLMAI Team, ByteDance Inc., Netease Fuxi AI Lab OpenRLHF is an open-source framework that enables full-scale RLHF training on models with over 70 billion parameters. It employs distributed computing with Ray and efficiency optimization with vLLM, while also implementing multiple alignment algorithms, offering a plug-and-play experience with seamless integration with the Hugging Face library.}
05-20	Octo: An Open-Source Generalist Robot Policy _{Institution: UC Berkeley, Stanford The paper introduces Octo, a transformer-based policy that provides an open-source solution to a variety of robotic tasks, capable of adapting to new observations and action spaces through finetuning. It demonstrates superior performance on multiple robot platforms and encourages broad application and further development through its fully open source code.}
05-20	xFinder: Robust and Pinpoint Answer Extraction for Large Language Models _{Institution: Institute for Advanced Algorithms Research, Shanghai,Renmin University of China The focus of the paper is the introduction of a method called xFinder, which aims to improve the accuracy of extracting key answers from LLM outputs. It addresses gaps not met by existing methods and provides a more reliable approach for evaluating LLMs.}
05-20	Multiple-Choice Questions are Efficient and Robust LLM Evaluators _{Institution: Shanghai Jiao Tong University The study successfully converted conventional open-ended generation problems into a multiple-choice format, significantly improving the efficiency and accuracy of LLM evaluations. This method has made strides in preventing the impact of invalid answers and enhancing evaluation efficiency.}
05-19	Your Transformer is Secretly Linear _{Institution: AIRI, Skoltech, SberAI This study showcases that there might be a high degree of linear dynamics between the encoding layers of transformers, overturning the traditional understanding of linear and non-linear operations in transformers, and finding that models can be modified for efficiency without sacrificing performance.}
05-17	Prompt Exploration with Prompt Regression _{Institution: Carnegie Mellon University, Massachusetts Institute of Technology, University of Michigan This paper introduces a novel framework, PEPR, for predicting the impact of prompt element combinations in LLMs and selecting effective prompts for specific tasks. The framework not only brings an innovative solution but also demonstrates its effectiveness through evaluations on multiple datasets and tasks.}
05-16	Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models _{Institution: Nanyang Technological University, University of Science and Technology of China, University of Aberdeen The paper successfully proposes and validates a new multimodal LLM incorporating ASR error correction paradigm, addressing issues of source speech disregard and input redundancy, and showing significant improvements in practical applications.}
05-16	SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation _{Institution: Amazon, The University of Texas at Austin SYNTHESIZRR addresses the issue of insufficient diversity and stylistic deviation from human text in past synthetic data approaches by using retrieval augmentation. It improves upon the generation of synthetic examples with greater variety and a closer resemblance to human writing, enhancing the performance of distilled models.}
05-16	MarkLLM: An Open-Source Toolkit for LLM Watermarking _{Institution: Tsinghua University, Shanghai Jiao Tong University, The University of Sydney MARKLLM provides a versatile and accessible platform for researchers and the public to experiment with and understand LLM watermarking, driving further developments in research and application.}
05-16	Thinking Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models _{Institution: BITS Pilani, MDSR Labs, Adobe, IIT Guhawati, National University of Singapore The research developed and evaluated an iterative debiasing framework aimed at end-users, offering a non-training-based approach to mitigating biases in LLMs. This method employs complex prompting strategies that significantly decrease the mean bias in outputs without compromising downstream task performance, paving the way for future research into prompt-based debiasing methods for LLMs.}
05-16	SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation _{Institution: Amazon, The University of Texas at Austin SYNTHESIZRR is an innovative method that integrates information retrieval into example synthesis for teacher-student distillation. Studies show it outperforms existing methods in terms of intrinsic data diversity and downstream task accuracy.}
05-15	LoRA Learns Less and Forgets Less _{Institution: Columbia University, Databricks Although LoRA often does not match the learning efficiency and accuracy of full parameter finetuning on target tasks, it exhibits better performance and stronger regularization capabilities in maintaining source task performance. Based on the study, recommendations are made for best practices when finetuning with LoRA, particularly noting the sensitivity to learning rates, choice of target modules, and rank of perturbations.}
05-15	ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models _{Institution: Microsoft Research Asia, Harvard University, Peking University The ALPINE project explored how autoregressive learning in Transformers facilitates network planning capabilities and revealed the competencies and limitations of Transformers in executing path-finding tasks, offering new insights into the general planning capabilities of large language models in related domains.}
05-14	Is the Pope Catholic? Yes, the Pope is Catholic. Generative Evaluation of Intent Resolution in LLMs _{Institution: Carnegie Mellon University, Allen Institute for AI This study introduces a novel generative evaluation framework exploring the potential and challenges of LLMs in understanding and generating intent-aligned responses, revealing significant shortcomings in pragmatic understanding and pointing out directions for future improvements.}
05-13	RLHF Workflow: From Reward Modeling to Online RLHF _{Institution: Salesforce AI Research, University of Illinois Urbana-Champaign The paper presents a comprehensive workflow for online iterative RLHF, which is innovative theoretically and offers a practical application framework through its detailed implementation guide.}
05-13	DoLLM: How Large Language Models Understanding Network Flow Data to Detect Carpet Bombing DDoS
05-10	Automatic Generation of Model and Data Cards: A Step Towards Responsible AI _{Institution: CMU, MPI, ETH Zürich The paper effectively develops a method to automate the generation of ML model cards and data cards using large LLMs, significantly enhancing the quality and standardization of the documentation through the creation of a corresponding dataset and evaluation mechanisms.}
05-10	Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval _{Institution: Imperial College London, Huawei This work effectively reduces hallucinations in large language models through a novel Self-Refinement Enhanced Knowledge Graph Retrieval method, particularly enhancing practical application efficacy in the medical field.}
05-10	UniDM: A Unified Framework for Data Manipulation with Large Language Models _{Institution: Alibaba Group, University of Science and Technology of China UniDM is an innovative unified framework for data manipulation that significantly enhances the efficiency and quality of processing diverse data tasks through effective prompt design and task decomposition.}
05-10	A Survey on RAG Meets LLMs: Towards Retrieval-Augmented Large Language Models
05-10	Value Augmented Sampling for Language Model Alignment and Personalization _{Value Augmented Sampling (VAS) offers an efficient and powerful solution for adapting and personalizing LLMs. It overcomes the instabilities of existing RL algorithms, achieving both high performance and computational efficiency, supports adaptation of black-box models, and paves the way for the personalized and aligned future of LLMs.}
05-09	LLMPot: Automated LLM-based Industrial Protocol and Physical Process Emulation for ICS Honeypots _{Institution: New York University Abu Dhabi LLMPot represents a novel ICS network defense tool that leverages the capabilities of LLMs. By automating the generation of responses closely related to protocols and physical processes, LLMPot significantly enhances the practicality and effectiveness of honeypots.}
05-09	Exploring the Potential of Human-LLM Synergy in Advancing Qualitative Analysis: A Case Study on Mental-Illness Stigma _{The CHALET methodology framework illustrates the vast potential of human-LLM collaboration in qualitative research, especially in deepening understanding and generating insights, offering a new direction for future studies in HCI and qualitative analysis.}
05-09	An Automatic Prompt Generation System for Tabular Data Tasks _{The paper successfully develops an auto-prompt generation system compatible with various LLMs without extensive training, significantly enhancing the performance of tabular data tasks through two innovative methods.}
05-09	Can large language models understand uncommon meanings of common words? _{Institution: Tsinghua University, Chinese Academy of Science This study reveals significant shortcomings in large language models' understanding of the uncommon meanings of common words by establishing a new assessment framework and dataset, offering a new direction for enhancing models' NLU capabilities.}
05-08	ADELIE: Aligning Large Language Models on Information Extraction _{Institution: Tsinghua University The ADELIE models proposed in this paper effectively address the alignment issues of LLMs in information extraction tasks, improving performance via novel datasets and training methods while maintaining robust general capabilities. This provides valuable insights and a foundation for future research in this area.}
05-08	"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations _{Institution: University of Washington, MBZUAI This study reveals potential harms in complex social interactions involving a wide range of cultures and identities that LLMs might cause through the innovative CHAST assessment system, emphasizing the necessity of thorough bias audits before deploying these models.}
05-08	Air Gap: Protecting Privacy-Conscious Conversational Agents
05-07	Toward In-Context Teaching: Adapting Examples to Students' Misconceptions _{Institution: MIT CSAIL This paper successfully demonstrates the potential of using large language models for adaptive teaching and achieves effective identification of student misconceptions and optimization of teaching feedback through the ATOM model.}
05-07	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving _{Institution: MIT, NVIDIA With its novel quantization algorithm and system design, QServe significantly enhances the efficiency of LLM servicing on GPUs, dramatically reducing costs and providing a new solution for deploying large-scale language models.}
05-07	Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation _{Institution: Center for Responsible AI, IIT Madras, Princeton University The paper effectively demonstrates the deceptive capabilities of autonomous agents using large language models in a goal-driven environment performing complex tasks like legislative lobbying and proposes an effective method for detecting such deceptive behaviors. These findings provide significant insights into the application of AI in legal and ethical contexts, while also advocating for new research directions in AI safety.}
05-07	Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application _{Institution: Kuaishou Technology, Southeast University The paper successfully applies the open-world knowledge of large language models to recommendation systems, addressing core challenges in practical applications through an innovative twin-tower structure, providing new insights into enhancing RS performance.}
05-06	MARE: Multi-Agents Collaboration Framework for Requirements Engineering _{Institution: Peking University This research presents a novel Multi-Agent Collaboration Framework, MARE, for leveraging collaboration between Large Language Models (LLMs) throughout the entire Requirements Engineering process. It addresses limitations in the automation of RE tasks and demonstrates superiority in requirement modeling and specification generation, as verified by extensive experimental evaluation.}
05-06	Lifelong Knowledge Editing for LLMs with Retrieval-Augmented Continuous Prompt Learning _{Institution: East China Normal University The RECIPE method efficiently improves editing efficiency and inference speed in LLMs within lifelong learning scenarios by transforming knowledge statements into continuous prompts and utilizing Knowledge Sentinel for dynamic retrieval management. This approach overcomes the limitations of previous methods and performs excellently across multiple evaluation metrics while maintaining overall model performance.}
05-03	What matters when building vision-language models? _{Institution: Hugging Face, Sorbonne Université This paper thoroughly investigates the critical design choices impacting VLMs' performance through extensive experiments, introduces the efficient foundational VLM Idefics2, and proves its superior performance in multiple standard tests.}
05-02	Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models _{Institution: KAIST AI, LG AI Research, Carnegie Mellon University PROMETHEUS 2 is an innovative open evaluator LM that can operate in both direct assessment and pairwise ranking formats while correlating closely with human judgments and proprietary LM evaluations on custom criteria. The model outperforms other open models and even some proprietary models, thanks to its training using weight merging.}
05-02	How Can I Get It Right? Using GPT to Rephrase Incorrect Trainee Responses _{Institution: Carnegie Mellon University The paper investigates the construction of an automated feedback system using GPT-4 to assist in the training of tutors in one-on-one classes, aiming to alleviate the resource burden of traditional personalized instructional feedback and provide high-quality and specific feedback. It falls under the category of knowledge retrieval and evaluation research.}
05-01	Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3 _{This study provides an empirical evaluation of model editing techniques in LLMs, revealing potential shortcomings in previous methods and proposing new directions and insights for future model editing approaches.}
05-01	"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust _{Institution: Princeton University, Microsoft The paper, through large-scale experiments, demonstrates that by expressing uncertainty in natural language, LLMs can reduce user overreliance and improve accuracy in task performance. Specifically, first-person expressions have a significant effect on improving user accuracy. Moreover, the research emphasizes the importance of user testing before the practical application of LLMs to adjust the way uncertainty is communicated.}
05-01	The Real, the Better: Aligning Large Language Models with Online Human Behaviors _{Institution: Baidu Inc. This paper introduces a novel framework, RLHB, for aligning large language models with real online human behaviors innovatively, overcoming the limitations of current approaches and effectively demonstrating its methods through experimentation.}
05-01	A Careful Examination of Large Language Model Performance on Grade School Arithmetic
05-01	Can a Hallucinating Model help in Reducing Human "Hallucination"? _{Institution: Stanford University, UC Berkeley This paper explores how to use Large Language Models (LLMs) to detect and combat unwarranted beliefs, as well as to leverage LLMs as personalized misinformation debunking agents. The researchers propose new methods to assess and utilize LLMs' capabilities in identifying logical pitfalls and challenge human unwarranted beliefs.}

2024-04

Date	Paper	Links & Summary
04-30	Iterative Reasoning Preference Optimization _{Institution: FAIR at Meta, New York University The paper proposed an iterative reasoning preference optimization method, which applies preference optimization to reasoning tasks, particularly for Chain-of-Thought (CoT) reasoning, and enhances model performance by introducing NLL loss term in iterative training. Experiments showed that the method effectively improved reasoning performance after several iterations, ultimately reaching a performance saturation.}
04-30	Multi-hop Question Answering over Knowledge Graphs using Large Language Models _{Institution: Microsoft The paper presents different strategies for multi-hop question-answering tasks across various KG datasets, demonstrating the potent capabilities of large pre-trained language models in complex QA tasks. Through experiments, the paper validates the superiority of the proposed approach over current technologies.}
04-30	Better & Faster Large Language Models via Multi-token Prediction _{Institution: FAIR at Meta The paper proposes a novel method for training large language models by predicting multiple tokens instead of a single one, improving sample efficiency and demonstrating how to boost performance in generative tasks and speed up inference. Experiments confirm the significant advantages of this approach in enhancing the performance and inference speed of large models.}
04-30	Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom _{Institution: Shanghai Jiao Tong University The study introduces a novel Chinese multi-turn dialogue dataset, SwordsmanImp, for evaluating the capabilities of LLMs in understanding implicatures within dialogues involving a lot of context and turn-taking, revealing the challenges and limitations of LLMs in understanding and explaining non-literal meanings.}
04-29	Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models _{Institution: Cohere The paper develops a new method for evaluating LLM generations, called PoLL, which consists of a “jury” of smaller models from different families, showing applicability in varying tasks, cost-efficiency, and reduced bias of LLMs as judges.}
04-29	LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report _{Institution: Predibase This paper proposes that fine-tuning large language models through LoRA can significantly improve the overall performance of the models, reduce errors in classification tasks, and outperform out-of-the-box models like GPT-4 and GPT-3.5. Additionally, the paper takes into account cost constraints, reducing the financial burden of using LLM APIs by limiting the number of evaluation samples.}
04-26	When to Trust LLMs: Aligning Confidence with Response Quality _{Institution: Alibaba Group This paper presents a method for aligning confidence and answer quality through reinforcement learning (CONQORD). It optimizes confidence levels through self-assessment in the absence of an objective standard, reducing bias and improving the accuracy and alignment of model predictions, though further improvements are needed to match the performance of more effective methods.}
04-26	A Comprehensive Evaluation on Event Reasoning of Large Language Models _{Institution: Peking University, Advanced Institute of Big Data, Beihang University This paper comprehensively evaluates the event reasoning capabilities of LLMs by introducing a new benchmark, EV2. The findings suggest that despite having capabilities for event reasoning, LLMs do not align with humans in using event schema knowledge, and with explicit guidance, they can better understand and execute event reasoning tasks.}
04-25	How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites _{Institution: Shanghai AI Laboratory, SenseTime Research, Tsinghua University InternVL 1.5 is a robust open-source multimodal language model aimed at closing the performance gap between open-source and commercial models in multimodal understanding. Its strengths include enhanced visual understanding, handling dynamic high-resolution images, and the use of a high-quality bilingual dataset, making it perform well across various tasks.}
04-25	Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare
04-25	Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding _{Institution: Meta, University of Toronto, Carnegie Mellon University LayerSkip presents a novel, practical solution that significantly accelerates inference in LLMs without compromising on accuracy, showcasing its potential in real-world applications.}
04-25	Continual Learning of Large Language Models: A Comprehensive Survey _{Institution: Rutgers University, Wuhan University, Huazhong University of Science and Technology The survey provides a comprehensive view on the continual learning of LLMs, with a particular emphasis on the under-explored research areas of continual pre-training (CPT) and domain-adaptive pre-training (DAP). It highlights the need for greater attention from the community, especially in the development of practical, accessible, and widely accepted evaluation benchmarks, as well as methodologies tailored for the emerging learning paradigms of large language models.}
04-24	From Local to Global: A Graph RAG Approach to Query-Focused Summarization _{Institution: Microsoft Research, Microsoft Strategic Missions and Technologies, Microsoft Office of the CTO The paper presents the Graph RAG method, a query-focused summarization technique based on graph indexing and LLM-generated summaries, aimed to handle problems of corpus size beyond the processing capability of large language models. This approach, assisted by community detection algorithms, achieves remarkable results in addressing global questions and performing large-scale text analysis.}
04-24	Beyond Chain-of-Thought: A Survey of Chain-of-X Paradigms for LLMs _{Institution: Shanghai Jiao Tong University, UC San Diego, Duke University The article is a detailed survey of Chain-of-X (CoX) methods in Large Language Models (LLMs), focusing on extending the Chain-of-Thought (CoT) concept to broader applications and providing potential directions for future research.}
04-23	A Survey of Large Language Models on Generative Graph Analytics: Query, Learning, and Applications _{Institution: Hong Kong Baptist University This work is a survey that investigates research on LLMs applied to graph data, discusses the advantages of LLMs in providing general solutions for graph tasks, and suggests future directions for research in this field.}
04-23	CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies _{Institution: Stanford University, IBM Research The paper presented a pipeline for building cultural knowledge bases and created CultureBank, a knowledge base including cultural descriptors from TikTok and Reddit. The paper further assessed LLMs' cultural awareness using this repository and trained more culturally-conscious language models to promote the development of culturally-aware language technologies in the future.}
04-22	SnapKV: LLM Knows What You are Looking for Before Generation _{Institution: University of Illinois Urbana-Champaign, Cohere, Princeton University This paper introduces SnapKV, a novel approach to tackling the Key-Value cache problem in large language models. SnapKV intelligently compresses and selects important KV positions to significantly improve decoding speed and memory efficiency for long text processing, reducing computational costs while maintaining accuracy.}
04-22	Tree of Reviews: A Tree-based Dynamic Iterative Retrieval Framework for Multi-hop Question Answering _{Institution: Tencent Inc., Harbin Institute of Technology The paper proposes a novel iterative retrieval framework (TOR) that uses a tree structure to minimize error accumulation and incorporates optimization strategies to improve retrieval efficiency and quality. Experiments show that TOR achieves state-of-the-art performance on several datasets.}
04-22	LLMs Know What They Need: Leveraging a Missing Information Guided Framework to Empower Retrieval-Augmented Generation _{Institution: Meituan The MIGRES framework proposed in this study enhances RAG by exploiting LLMs' ability to identify missing information. Research results prove the superiority of MIGRES across multiple public datasets, addressing challenges in RAG's understanding of complex queries and retrieval of relevant documents.}
04-22	Information Re-Organization Improves Reasoning in Large Language Models _{Institution: Zhejiang University This paper introduced a novel Information Re-Organization (InfoRE) method that enhances the reasoning capabilities of LLMs by re-organizing contextual content to reveal logical relationships. The method was significantly effective when tested on LLMs for context-aware multi-hop reasoning tasks in a zero-shot setting.}
04-22	A Survey on Efficient Inference for Large Language Models _{Institution: Tsinghua University This paper offers an encompassing survey of literature on improving inference efficiency for large language models, proposing a taxonomy that covers data-level, model-level, and system-level optimizations. Additionally, it provides quantified comparisons of key techniques through experiments, pointing out future directions for research.}
04-22	Beyond Scaling: Predicting Patent Approval with Domain-specific Fine-grained Claim Dependency Graph _{Institution: University of California San Diego, Carnegie Mellon University, University of Pennsylvania The researchers presented a novel algorithm for constructing a fine-grained claim dependency graph (FLAN Graph) that significantly improves the state of the art at scale and conducted comprehensive experiments and analyses of modern LLMs on patent approval prediction, identifying limitations of LLMs and providing valuable references for the development of future LLM-based solutions. The source code and dataset have been publicly released to facilitate future research.}
04-22	A Survey on Self-Evolution of Large Language Models _{Institution: Peking University, Alibaba Group, Nanyang Technological University The review paper offers a structured overview and summary of self-evolution approaches in LLMs, furnishing conceptual frameworks and future insights to propel research into self-evolving LLMs and pave the way for the development of next-generation models.}
04-21	AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs _{Institution: Meta AI (FAIR), Max-Planck-Institute for Intelligent Systems This paper presents a novel LLM called AdvPrompter that uses an innovative algorithm to rapidly generate human-readable adversarial prompts without the need for gradient information from the Target LLM. It significantly accelerates prompt generation while maintaining semantic coherence, and additionally through training with AdvPrompter, it can enhance the robustness of LLMs against jailbreaking attacks without sacrificing performance.}
04-19	LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency _{Institution: Nanyang Technological University, DAMO Academy Alibaba Group, Singapore University of Technology and Design LLM-R2 is an LLM-enhanced query rewrite system that effectively boosts the execution efficiency of query rewriting by automatically selecting effective rules from a given set of rewrite rules. It addresses the limitations of current methods and shows superior performance across multiple datasets.}
04-19	Relevant or Random: Can LLMs Truly Perform Analogical Reasoning? _{Institution: Nanyang Technological University, Princeton University, Salesforce Research The paper thoroughly assessed LLMs' capability for analogical reasoning and introduced two methods that significantly reduce inference costs while enhancing performance. Findings revealed that, contrary to the previously held belief of the critical importance of relevance, self-generated irrelevant examples could perform equally or even better in some tasks. The study hopes to encourage further research on the design of self-generated contexts.}
04-18	Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers _{Institution: Westlake University, Alibaba Group, Zhejiang University The paper presents the MCRanker model, which improves consistency and comprehensiveness of LLM rankers by creating a virtual professional annotator team and generating evaluative criteria from multiple perspectives, capable of adapting to various datasets and improving ranking performance.}
04-18	Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences _{Institution: UC Berkeley This paper proposes EvalGen, a user interface for aligning LLM-assisted evaluations of LLM outputs with human preferences using a mixed-initiative approach. It addresses the trustworthiness of LLM-generated evaluation functions and explores the dynamic nature of how users define and use evaluation criteria in practical applications.}
04-18	EVIT: Event-Oriented Instruction Tuning for Event Reasoning _{Institution: Key Laboratory of High Confidence Software Technologies (PKU), MOE, China, School of Computer Science, Peking University, Advanced Institute of Big Data EVIT addresses the shortcomings of current smaller instruction-tuned models in event reasoning tasks by introducing Event-Oriented Instruction Tuning and the concept of event quadruples. The experimental results show that EVIT performs better on event reasoning tasks compared to other models.}
04-18	Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing _{The paper presents a novel framework called ALPHALLM that, by integrating Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs), facilitates the self-improvement of LLMs without the need for additional annotated data.}
04-18	RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation _{Institution: Peking University, ByteDance Inc. RAGCache enhances the performance of the RAG process by designing a targeted caching system and sharing intermediate states, significantly improving processing speed and reducing computational resource overhead.}
04-18	mABC: multi-Agent Blockchain-Inspired Collaboration for root cause analysis in micro-services architecture _{Institution: Beihang University, Beijing Information Science and Technology University mABC is an innovative framework that leverages LLMs and multi-agent cooperation, facilitated by blockchain-inspired decision-making processes, aimed at root cause analysis (RCA) in micro-services architectures within cloud-native technologies.}
04-17	Many-Shot In-Context Learning _{Institution: Google DeepMind The key contributions of this paper include systemically evaluating LLM performance with varying scales of in-context examples across a broad range of tasks, introducing reinforced ICL and unsupervised ICL to reduce reliance on examples, and discovering that MS-ICL can overcome pre-training biases to learn high-dimensional numerical prediction tasks.}
04-17	Unifying Bias and Unfairness in Information Retrieval: A Survey of Challenges and Opportunities with Large Language Models _{Institution: Renmin University of China, Chinese Academy of Sciences, Huawei Technologies The survey paper offers a new perspective for understanding bias and unfairness in LLMs and IR systems as distribution mismatch problems and categorizes various mitigation strategies.}
04-17	AgentKit: Flow Engineering with Graphs, not Coding _{Institution: Carnegie Mellon University, NVIDIA, Microsoft The paper presents a novel LLM prompting framework, AgentKit, addressing multifunctional agents, supporting the construction and fine-tuning of complex agent thought processes through modular components and intuitive designs. AgentKit shows potential in realizing advanced agent capabilities and lowering the entry barrier for users.}
04-17	A Deep Dive into Large Language Models for Automated Bug Localization and Repair _{Institution: University of Virginia, Purdue University, Amazon Web Services This paper introduces a new approach named Toggle, which utilizes token-level bug localization and repair to overcome the limitations of existing line-granular methods. By designing inputs and fine-tuning LLMs, it significantly enhances the accuracy of bug fixes and delivers outstanding performance on multiple datasets, marking a new progression in the APR field.}
04-16	How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior _{Institution: Stanford University This paper analyses the tension between LLMs’ internal knowledge and retrieved information in RAG settings, finding that LLMs’ tendency to follow RAG information is inversely correlated with the model's confidence in its response without context. The research, which spans six domain datasets with over 1200 questions, reveals the inherent conflict between the model's pre-trained knowledge and the retrieved information.}
04-16	CoTAR: Chain-of-Thought Attribution Reasoning with Multi-level Granularity _{Institution: Intel Labs The CoTAR method proposed in this paper addresses the issue of LLMs tending to produce inaccurately attributed content in question-answering tasks. By reasoning prior to output generation and guiding the model at different levels of attribution granularity, the method significantly improves the model's performance in terms of answer quality and attribution accuracy.}
04-16	Self-playing Adversarial Language Game Enhances LLM Reasoning _{Institution: Tencent AI Lab This paper proposes an innovative training scheme named SPAG that effectively enhances the reasoning capabilities of LLMs through self-play in adversarial language games and demonstrates that these improvements can persist and amplify through the iterative process.}
04-15	Learn Your Reference Model for Real Good Alignment _{Institution: Tinkoff The paper introduces a novel method known as Trust Region DPO (TR-DPO), which significantly improves the alignment issue in language models by interactively updating reference policy parameters. Experimental results show that TR-DPO surpasses the DPO method on both datasets, effectively enhancing the model's multi-parameter performance.}
04-15	Compression Represents Intelligence Linearly _{Institution: The Hong Kong University of Science and Technology, Tencent The paper provides empirical evidence that there is almost a linear correlation between LLMs' performance on downstream tasks and their compression efficiency, supporting the long-held belief that "better compression indicates higher intelligence". It also proposes using compression efficiency as an unsupervised metric for assessing LLM performance.}
04-14	Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development _{The focus of this paper is on supporting and optimizing the deployment of machine learning models on emerging computing platforms, introducing a framework named TAPML. The framework aims to advance the widespread deployment, convenience, and power of model deployment through top-down methods and universal runtime, providing practical deployment cases as deep insights and best practices for developing ML systems on emerging platforms.}
04-13	Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning _{Institution: Nanjing University, University of California The paper presents a new framework for multitask fine-tuning of Large Language Models named Intuition-MoR1E, which draws on principles of human cognitive neuroscience and uses Rank-1 Experts formulation to manage a spectrum of intuitions, significantly enhancing parameter efficiency and multitask fine-tuning effectiveness.}
04-12	Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length _{Institution: AI at Meta, University of Southern California, Carnegie Mellon University The paper introduces MEGALODON, an efficient neural architecture for modeling sequences with unlimited context length. With innovative technical contributions, MEGALODON demonstrates higher efficiency and efficacy in long sequence modeling tasks than the Transformer while achieving robust improvements across various scales and modalities of benchmarks.}
04-11	Rho-1: Not All Tokens Are What You Need _{Institution: Xiamen University, Tsinghua University, Microsoft This study introduces RHO-1, a novel language model that employs Selective Language Modeling (SLM). This model focuses on training useful tokens during pre-training, demonstrating superior performance in the continuous pre-training in the mathematical domain, reaching baseline performances faster, and attaining state-of-the-art results with a fraction of the tokens.}
04-11	Decomposing Label Space, Format and Discrimination: Rethinking How LLMs Respond and Solve Tasks via In-Context Learning _{Institution: Nanyang Technological University The paper investigates the mechanisms by which ICL improves task performance, identifying label space regulation and format refinement as significant contributors to performance enhancement while emphasizing the importance of selecting appropriate demonstrations.}
04-11	ChatGPT Can Predict the Future when it Tells Stories Set in the Future About the Past _{Institution: Baylor University The research by probing the predictive abilities of ChatGPT-3.5 and ChatGPT-4 unveils new potential in reasoning capabilities of LLMs. The study confirms that future narrative prompts significantly enhance accuracy, offering valuable insights for potential applications of LLMs in analytical settings.}
04-11	ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback _{Institution: University of Central Florida, ByteDance Inc ControlNet++ significantly improves controllability across a range of conditional controls by optimizing pixel-level consistency between the generated images and the conditions, while the efficient reward fine-tuning strategy reduces the time and memory costs associated with image sampling.}
04-11	Interactive Prompt Debugging with Sequence Salience _{The paper presents a system called Sequence Salience, which extends existing input salience (IS) methods to support complex LLM prompt debugging. This tool offers real-time interactive debugging, lowers practitioner cognitive load, supports prompt iteration based on salience results, and aligns more closely with the developer's mental model.}
04-11	OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments _{Institution: The University of Hong Kong, CMU, Salesforce Research OSWORLD offers a novel evaluation environment addressing the limitations of existing benchmarks, laying the groundwork for the development of multimodal agents capable of performing open-ended tasks in real computer environments.}
04-10	Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation _{Institution: Apple, Cupertino, CA, USA The paper presented a novel RAG prompting method, "superposition prompting," to address problems with LLMs when handling long texts, significantly enhancing time efficiency and accuracy without the need for further training or tuning. The method has been validated on several pretrained models, and the authors plan to release an open-source code implementation.}
04-10	Transferable and Efficient Non-Factual Content Detection via Probe Training with Offline Consistency Checking _{Institution: Renmin University of China, Tsinghua University This paper introduces PINOS, a novel approach for training a probing model via offline self-consistency checking, effectively addressing the limitations of existing factual detection methods. PINOS exhibits enhanced transferability and efficiency, and achieves superior results on factuality detection and question-answering benchmarks compared to existing methods.}
04-10	Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention _{Institution: Google This research proposed a novel attention mechanism, Infini-attention, which combines compressive memory with standard dot-product attention and, by design, supports plug-and-play continuous pre-training and long-context adaptation, enabling LLMs to handle infinitely long contexts with bounded memory and computational resources.}
04-10	"We Need Structured Output": Towards User-centered Constraints on Large Language Model Output _{Institution: Google Research This paper explores how to implement user-centered constraints on the outputs of large language models (LLMs) by surveying industry professionals to understand different scenarios and demands. The focus is on enhancing the efficiency of developers in the development, testing, and integration process of LLMs, and on bolstering the end-user experience by meeting specific output formats and user interface requirements.}
04-09	RULER: What's the Real Context Size of Your Long-Context Language Models? _{Institution: NVIDIA <This paper proposed a new assessment tool, RULER, for long-context LMs and made it open source to encourage future research, providing the means to test LMs' performance in complex tasks and understanding of long contexts, conducting evaluations across various models and task complexities.>}
04-09	Event-enhanced Retrieval in Real-time Search _{Institution: Tencent Search, Platform and Content Group EER is an innovative approach targeting the "semantic drift" in real-time searches by enhancing the EBR model and including contrastive learning and a generative event triplet task. The method's effectiveness has been experimentally verified, potentially providing new insights into the information retrieval domain.}
04-09	THOUGHTSCULPT: Reasoning with Intermediate Revision and Search _{Institution: UC Berkeley THOUGHTSCULPT, a graph-based framework, showcases its distinct capability to iteratively improve previous outputs while generating new thought nodes through its embedded self-revision mechanism, particularly excelling in tasks that require continuous revision and modification.}
04-09	Privacy Preserving Prompt Engineering: A Survey _{Institution: University of Arkansas The survey paper contributes a systematic overview concerning privacy protection methods in the realm of ICL and general prompting with LLMs, facilitating further research and exploration within the community regarding privacy protection.}
04-08	LLM-Augmented Retrieval: Enhancing Retrieval Models Through Language Models and Doc-Level Embedding _{Institution: Meta The paper successfully presents and validates an LLM-augmented retrieval framework with enhanced document-level embedding. By generating synthetic relevant queries and titles to add more contextual information to document embeddings and improving key steps in the training of retrieval models, the paper improves the performance and robustness of retrieval models.}
04-08	Evaluating Interventional Reasoning Capabilities of Large Language Models _{Institution: Université de Montréal, Google DeepMind, ServiceNow Research The paper evaluates the interventional reasoning capabilities of large language models (LLMs), focusing on predicting intervention effects and testing LLMs' ability to update their understanding of facts post-intervention. Results indicate that, under certain conditions, GPT-4 can accurately predict intervention outcomes, but minor changes in prompt design can significantly affect its performance.}
04-08	Know When To Stop: A Study of Semantic Drift in Text Generation _{Institution: FAIR, Meta, Anthropic The paper provides tools for understanding and measuring the phenomenon of semantic drift in long-form text generation by language models. Significant improvements in factual accuracy were achieved through early stopping and resampling-then-reranking methods, offering potential solutions to balance informational quantity with factual accuracy.}
04-08	LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding _{Institution: Alibaba Group, Zhejiang University The paper successfully proposes the LayoutLLM model and its layout instruction tuning strategy, significantly improving the model's understanding and utilization of document layouts, especially demonstrating outstanding performance in zero-shot document understanding tasks.}
04-07	Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models _{Institution: Cornell University The paper introduces Radial Networks, a novel neural network architecture that uses dynamic layer sparsity and a trained router module for token-level inter-layer routing. This not only enhances model performance but also significantly reduces computational and serving costs, facilitating further scaling of large language models.}
04-07	Prompting Large Language Models for Zero-shot Essay Scoring via Multi-trait Specialization _{Institution: Peking University The study presented a novel zero-shot LLM framework for essay scoring (MTS) which scores essays across different writing traits through multi-round conversations, using min-max scaling and outlier clipping mechanism for final score determination. MTS significantly improves accuracy over direct prompting methods and demonstrates superior small-scale deployment compared to ChatGPT, offering a zero-shot essay scoring alternative to supervised learning approaches.}
04-04	AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent
04-04	ReFT: Representation Finetuning for Language Models _{Institution: Stanford University, Pr(Ai)2R Group This paper presents a new language model fine-tuning method, LoReFT, significantly surpassing existing Parameter-Efficient Fine-tuning (PEFTs) techniques in terms of resource efficiency and control capabilities. The method achieved state-of-the-art performance on multiple NLP tasks across various domains, maintaining fewer parameters and higher interpretability.}
04-04	Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences _{Institution: Microsoft Research The paper presents DNO, an algorithm that effectively combines the ease of contrastive learning with the theoretical generalizability of optimizing general preferences in post-training LLMs. The significant performance improvements demonstrated by DNO highlight the feasibility of guiding model learning alignment with human values through general preference optimization.}
04-03	PromptRPA: Generating Robotic Process Automation on Smartphones from Textual Prompts _{Institution: Shanghai Jiao Tong University, CMU The paper presents the PromptRPA system, an effective solution to overcome limitations of RPA applications on mobile devices. Leveraging a multi-agent framework and online tutorials, it can interpret diverse textual prompts, addressing a wide range of RPA tasks. Performance evaluations demonstrate a significant increase in success rates, affirming the viability of text-driven control in RPA and paving the way for future advancements focused on enhanced functionality and broader applicability.}
04-02	CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models _{Institution: East China Jiaotong University, Guangdong University of Technology, University of Toronto The core contribution of the paper is the proposal of the CMAT framework, a novel approach that allows for dynamic and real-time memory updates within multi-agent systems, and the design of a role-playing mechanism for precise task allocation and enhanced agent communication, significantly improving overall performance and cooperation efficiency.}
04-02	Long-context LLMs Struggle with Long In-context Learning _{Institution: University of Waterloo, Carnegie Mellon University This paper introduces a novel evaluation benchmark, LongICLBench, to assess the performance of LLMs in handling long-input tasks, as well as the sensitivity of LLMs to the position of instances in the input sequence. This work contributes to better understanding and improvement of large language models' capabilities in long text processing.}
04-02	Advancing LLM Reasoning Generalists with Preference Trees
04-02	Octopus v2: On-device language model for super agent _{Institution: Stanford University This paper addresses the deployment and function call efficiency issues of LLMs on edge devices. By introducing specialized training methods and reducing the amount of context that needs to be processed during inference, the paper significantly improves the accuracy of function calls and reduces latency on devices. The experimental results demonstrate a significant impact on the performance of function calling tasks.}
04-02	LLM-ABR: Designing Adaptive Bitrate Algorithms via Large Language Models _{Institution: Microsoft The paper investigates how large language models (LLMs) can assist in designing adaptive bitrate (ABR) algorithms by generating a variety of candidate algorithms and using an early stopping mechanism to test them in a network simulator, effectively filtering out the most effective algorithm designs. Evaluations indicate that LLMs can significantly enhance the performance of ABR algorithms in specific network scenarios.}
04-02	Long-context LLMs Struggle with Long In-context Learning _{Institution: University of Waterloo, Carnegie Mellon University This study introduces a new benchmark, LongICLBench, for evaluating the ability of large language models to process long-context tasks and indicates that as the difficulty of the tasks increases, LLMs' performance generally decreases, with the models' long-context learning ability being affected by the distribution of label positions in prompts.}
04-01	Mapping the Increasing Use of LLMs in Scientific Papers _{Institution: Stanford University, UC Santa Barbara This paper presents the first large-scale, systematic examination across articles published on arXiv, bioRxiv, and Nature portfolio, with a statistical estimation method that measures the prevalence of LLM-modified content at the population level, providing valuable insights into the application of LLMs in scientific writing.}
04-01	Prompt-prompted Mixture of Experts for Efficient LLM Generation _{Institution: CMU GRIFFIN is a training-free MoE system that boosts the efficiency of LLMs by leveraging the phenomenon of flocking observed within FF blocks of LLMs across different activation functions while preserving performance and reducing computational costs.}
04-01	Efficiently Distilling LLMs for Edge Applications _{Institution: IBM Research This paper provided a new method for distilling LLMs for edge devices, allowing LPFT while significantly reducing both the size of models and the cost of training, especially addressing the resistance to compression and training duration of decoder models.}
04-01	LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation _{Institution: Microsoft Research Asia This paper proposed a novel framework using large language models for the evaluation of radiology reports—LLM-RadJudge, effectively enhancing the clinical relevance and consistency of radiology report assessments. Through knowledge distillation, a smaller model was developed, reducing the cost of evaluation and improving accessibility, providing strong support for the research and practical application of radiology report generation.}
04-01	AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review _{Institution: University of Lyon, INSA Lyon, Infologic This paper presents an extensive literature review of incident management in the AIOps domain, aiming to structure knowledge, identify knowledge gaps, and lay the groundwork for future developments in the field. The study establishes unified AIOps terminology and taxonomy, reveals existing challenges, and provides public datasets, offering direction and a basis for future research.}

2024-03

Date	Paper	Links & Summary
03-28	sDPO: Don't Use Your Data All at Once _{The paper proposes a novel stepwise DPO (sDPO) method that effectively improves the performance and alignment of the final model by using preference datasets in a stepwise manner, and the aligned model from previous steps as the reference model for the current step.}
03-28	Jamba: A Hybrid Transformer-Mamba Language Model _{Institution: AI21 Labs Jamba represents a new direction in the large language model domain with its hybrid Transformer-Mamba architecture that breaks through the limitations of handling long contexts and optimizes both model throughput and memory footprint by applying MoE components. This model demonstrates the potential balance between efficient training and powerful performance in the field of large-scale language modeling.}
03-27	Rejection Improves Reliability: Training LLMs to Refuse Unknown Questions Using RL from Knowledge Feedback _{This paper effectively addresses LLM hallucinations and enhances model honesty and reliability by introducing the RLKF framework and defining new evaluation metrics, pointing towards a method for building more trustworthy AI systems.}
03-27	BLADE: Enhancing Black-box Large Language Models with Small Domain-Specific Models _{Institution: DCST Tsinghua University, Beijing Institute of Technology, Huawei Cloud BU This research presented a novel architecture, BLADE, capable of enhancing black-box LLMs through smaller domain-specific models, addressing the lack of domain-specific knowledge in LLMs for specialized applications. BLADE demonstrated to be an effective and cost-efficient solution both in performance and cost.}
03-26	LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning _{Institution: The Hong Kong University of Science and Technology, University of Illinois Urbana-Champaign The LISA strategy proposed in the paper uses layer-wise weight importance sampling to enhance the fine-tuning efficiency and performance of large language models, while maintaining memory efficiency comparable to LoRA.}
03-26	COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning _{Institution: Shenzhen Institute of Advanced Technology, CAS; M-A-P; Institute of Automation, CAS This paper presents the COIG-CQIA dataset, a high-quality dataset for Chinese instruction fine-tuning designed to align well with human interactions. The research emphasizes the importance of high-quality data sources for model fine-tuning and demonstrates through experiments how the strategies for creating datasets and methods of fine-tuning significantly impact model performance.}
03-26	The Unreasonable Ineffectiveness of the Deeper Layers _{Institution: Meta FAIR, UMD The paper presents an empirical study on a simple layer-pruning strategy for popular pre-trained open-weight LLMs and demonstrates minimal performance impact despite removing a significant number of layers.}
03-25	AIOS: LLM Agent Operating System _{Institution: Rutgers University AIOS, as an LLM agent operating system, overcomes challenges in areas such as resource scheduling and context management through the design of a specific kernel and modules, providing improvements in performance and efficiency for LLM agents and paving the way for the future development and deployment of the AIOS ecosystem.}
03-22	Can large language models explore in-context? _{Institution: Microsoft Research, Carnegie Mellon University This paper investigates whether contemporary Large Language Models (LLMs) can engage in in-context exploration without any training interventions. The authors' experiments reveal that LLMs are capable of robust exploration only under specific configurations. This work indicates that even state-of-the-art LLMs might fail to explore in more complex environments without adequate prompt design, highlighting that non-trivial algorithmic interventions may be required for effective LLM operation in complicated settings.}
03-20	Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts _{Institution: University of Memphis, San Francisco Veterans Affairs Health Care System, University of California San Francisco The paper successfully improves the capability of Large Language Models in understanding psychiatric behaviors, especially in motivational interview contexts. By employing structured prompting and assessment methods to model professional therapists' thought processes, it effectively educates the model with domain knowledge, achieving better performance than conventional methods.}
03-19	Towards Robots That Know When They Need Help: Affordance-Based Uncertainty for Large Language Model Planners _{Institution: University of Maryland The paper introduces the LAP method that combines LLMs with scene affordances to reduce hallucinations and achieve uncertainty alignment in planning tasks. Demonstrating significant improvements in successful outcomes and decreased reliance on human assistance through experiments in both simulated and real-world robot manipulations, the LAP method advances the domain of intelligent robotics.}
03-18	Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression _{Institution: University of Texas at Austin, Drexel University, MIT This paper presents the first extensive evaluation of the trustworthiness of compressed LLMs across multiple dimensions and offers practical guidelines for considering efficiency and trustworthiness during compression.}
03-15	VideoAgent: Long-form Video Understanding with Large Language Model as Agent _{Institution: Stanford University VideoAgent represents a substantial advancement in long-form video understanding by mimicking the human cognitive process, emphasizing the importance of reasoning over visual input over long periods. This work not only sets a new benchmark in long-form video understanding but also provides insights for future research in this area.}
03-15	Uni-SMART: Universal Science Multimodal Analysis and Research Transformer _{Institution: DP Technology, AI for Science Institute Beijing Uni-SMART is an innovative model designed for deep understanding of multimodal scientific literature. It outperformed other top text-focused LLMs in multiple domains and has the potential to revolutionize interactions with scientific literature.}
03-15	RAFT: Adapting Language Model to Domain Specific RAG _{Institution: UC Berkeley The RAFT approach proposed in this paper innovates the training of large language models to answer questions in a domain-specific "open book" manner, enhancing the model's reasoning capabilities and resistance to distractor documents, and improving the model's accuracy in generating answers through the chain-of-thought method.}
03-13	Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments _{Institution: Nanjing University, Microsoft The Readi framework presents an efficient and truth-based method for reasoning over large-scale structured environments, fully capitalizing on the planning capabilities of LLMs, and enhancing reasoning paths through dynamic feedback, resulting in significant improvements in multi-hop reasoning tasks.}
03-13	Scaling Instructable Agents Across Many Simulated Worlds _{The SIMA project proposed in this paper seeks to create an AI system capable of acting in various simulated 3D environments based on arbitrary language instructions. The design of the system focuses on addressing challenges in grounding language in perception and embodied actions, as well as achieving generality and scalability across many different environments.}
03-13	Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework _{Institution: ByteDance Research, University of Maryland College Park, Carnegie Mellon University This paper successfully introduces a new causality-guided debiasing framework, which has been empirically validated for effectiveness. It not only integrates existing prompting-based debiasing methods but also proposes new avenues for eliciting unbiased reasoning.}
03-12	Chronos: Learning the Language of Time Series _{Institution: Amazon Web Services, UC San Diego, University of Freiburg Chronos has demonstrated exceptional performance as a pre-trained time series forecasting framework in both zero-shot and standard tasks. By leveraging data augmentation strategies and public datasets, it validates the promise of language model architectures for general applicability in time series forecasting, pointing towards a new direction for future time series models.}
03-11	Stealing Part of a Production Language Model _{Institution: Google DeepMind, ETH Zurich, University of Washington The paper proposes a novel attack for model stealing from production language models, capable of effectively extracting the final layer of a Transformer model. It sheds light on the utility of such an attack for decrypting details, parameters, and dimensions of black-box models, and outlines the necessity of modifying the APIs to prevent such attacks in the future.}
03-11	ERA-CoT: Improving Chain-of-Thought through Entity Relationship Analysis _{Institution: Zhejiang University, Southeast University The paper presents an innovative framework, ERA-CoT, which effectively enhances the reasoning and question-answering abilities of Large Language Models in complex entity scenarios, principally by improving the understanding of entity relationships, especially in the Chain-of-Thought reasoning process.}
03-11	RA-ISF: Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback _{Institution: Zhejiang University, Southeast University, Massachusetts Institute of Technology RA-ISF is an innovative retrieval-augmented framework that enhances LLMs' problem-solving by iterative task decomposition and mitigates irrelevant text interference, significantly improving knowledge retrieval performance.}
03-08	Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation _{The paper presents Adversarial Policy Optimization (AdvPO), a novel approach to tackling reward over-optimization issues within the RLHF process, especially in LLMs aimed at aligning with human preferences. AdvPO effectively alleviates the problem of reward over-optimization without incurring high computational costs.}
03-08	Harnessing Multi-Role Capabilities of Large Language Models for Open-Domain Question Answering _{Institution: Gaoling School of Artificial Intelligence Renmin University of China, Nankai University, Beijing Academy of Artificial Intelligence LLMQA is a novel generalized framework that combines strengths of retrieval- and generation-based evidence collection. By enabling LLMs to take on multiple roles within the framework, the paper significantly improves the overall performance of ODQA systems, with experimental results demonstrating its effectiveness over existing methods.}
03-08	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context _{Institution: Google Gemini 1.5 Pro achieved a significant breakthrough in memory and reasoning capabilities for vast amounts of long-context information, particularly in processing extended texts, videos, and audio. The model not only outperforms in effectiveness but also shows improved computational efficiency.}
03-07	Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference _{Institution: UC Berkeley, Stanford, UCSD Chatbot Arena is an open platform for evaluating LLMs based on human preferences. It employs a crowdsourced approach to collect questions for anonymous randomized battles, addressing the limitations of static dataset benchmarks, and uses carefully designed statistical methods to ensure the credibility and efficiency of evaluations.}
03-07	Yi: Open Foundation Models by 01.AI _{Institution: 01.AI The paper successfully introduces the Yi-34B model, performing comparably to GPT-3.5 in both performance and efficiency, and provides detailed descriptions of innovative approaches to pre-training large language models and their instruction fine-tuning.}
03-05	ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary _{Institution: Tsinghua University The ChatCite system is designed to overcome the challenges faced by LLMs in generating literature reviews. It enables an LLM agent to more effectively understand, summarize, and compare different research works, thus producing organized and comparative literature reviews.}
03-05	Design2Code: How Far Are We From Automating Front-End Engineering? _{Institution: Stanford University, Georgia Tech, Microsoft The paper formalizes and benchmarks the Design2Code task to assess the capability of current multimodal LLMs in converting visual designs into code, finding that GPT-4V performs best, offering a new paradigm for automating front-end development.}
03-05	MathScale: Scaling Instruction Tuning for Mathematical Reasoning _{Institution: The Chinese University of Hong Kong Shenzhen, China; Microsoft Research Asia, Beijing, China; Shenzhen Research Institute of Big Data, Shenzhen, China MathScale proposes a scalable approach to creating high-quality mathematical reasoning data and introduces a new comprehensive benchmark, MWPBENCH, to fully evaluate the mathematical reasoning capabilities of LLMs, thereby significantly enhancing the models' performance in solving mathematical problems.}

2024-02

Date	Paper	Links & Summary
02-29	Resonance RoPE: Improving Context Length Generalization of Large Language Models _{Institution: 1DIRO Université de Montréal, Mila - Quebec AI Institute, Huawei Noah’s Ark Lab This paper presents Resonance Rope, an improved model that enhances performance in dealing with long texts based on the analysis of RoPE position embedding feature wavelengths. It also introduces the POSGEN benchmark to assist in the study and evaluation of position embeddings in long-text tasks.}
02-29	SEED: Customize Large Language Models with Sample-Efficient Adaptation for Code Generation _{Institution: Peking University This paper introduces SEED, an adaptation method using error-driven learning, enabling LLMs to learn efficiently with fewer samples for code generation tasks, achieving better performance and generalization.}
02-29	Beyond Language Models: Byte Models are Digital World Simulators _{Institution: Microsoft Research Asia The paper showcases the potential of bGPT in handling challenging byte-level data simulation tasks, particularly highlighting its capabilities in cross-modal knowledge transfer and digital world simulation. This reveals the broad applicability and flexibility of byte models in digital media data processing and understanding.}
02-29	StarCoder 2 and The Stack v2: The Next Generation _{Institution: ServiceNow, Hugging Face The paper presented the development process of The Stack v2 and StarCoder2, a work focused on large-scale pre-training and instruction fine-tuning for code. Researchers significantly enhanced the performance of code LLMs, especially in handling low-resource programming languages and tasks requiring code reasoning, by integrating diverse data sources and a meticulously designed training process.}
02-27	The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits _{Institution: Microsoft, University of Chinese Academy of Sciences The paper presents the BitNet b1.58 model, which is a 1.58-bit quantized Large Language Model that is comparable in performance to traditional full-precision LLMs while being more efficient and energy-saving.}
02-27	EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions _{Institution: Alibaba Group The EMO framework enhances the realism and expressiveness of generated videos through a direct audio-to-video synthesis method, significantly surpassing existing technologies and marking a significant advance in the field of video synthesis.}
02-27	When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method _{Institution: Google DeepMind The paper provides significant insights into the impact of factors such as data size, model size, and finetuning methods on the performance of LLMs during the finetuning phase, defining a new framework for evaluation.}
02-27	REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering _{Institution: Gaoling School of Artificial Intelligence Renmin University of China, School of Information Renmin University of China The paper presented the REAR framework, which focuses on enhancing the ability of LLMs to utilize external knowledge in QA tasks by adding self-awareness of document relevance and has proven its effectiveness over previous methodologies.}
02-27	Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization _{Institution: Zhejiang University, Institute of Software Chinese Academy of Sciences, Nanjing University of Posts and Telecommunications Agent-Pro represents a new type of LLM-based intelligence agent that can learn and develop strategies in interactive environments through policy-level reflection and optimization, addressing the issue of existing works' inability to learn through interaction and adapt.}
02-27	Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models _{Institution: OpenAI This review article provides an insight into Sora—a large vision model, discussing its technological features, innovative aspects, current limitations, and potential opportunities for future applications. Sora's capabilities signify progressive strides made by large vision models, including long video generation and processing of diverse video formats.}
02-26	LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments _{The study introduced the LLMARENA benchmark to assess the capabilities of LLM agents in complex multi-agent settings, highlighting existing issues and advancing future research directions, including capabilities in multimodal dynamic contexts and the potential use of external tools.}
02-26	Do Large Language Models Latently Perform Multi-Hop Reasoning? _{Institution: Google DeepMind, UCL, Google Research This research examines LLMs’ potential for latent multi-hop reasoning, proposing new methods for evaluating latent multi-hop reasoning capabilities and indicating strong evidence of multi-hop reasoning for certain types of relational prompts in LLMs, though highly context-dependent.}
02-26	Improving LLM-based Machine Translation with Systematic Self-Correction _{Institution: Zhejiang University, Tencent, Angelalign Technology Inc. The paper successfully introduced the first LLM-based self-correcting translation framework named TER, and demonstrated its effectiveness in improving translation quality across various language pairs and models. It opened new horizons in the field of machine translation, especially for the use of self-correction in translations between high-resource, low-resource languages, and translations involving different central languages.}
02-25	ChatMusician: Understanding and Generating Music Intrinsically with LLM _{Institution: Hong Kong University of Science and Technology The paper made substantial progress in an under-researched domain by creating the first music pre-training dataset and assessment benchmark for language models, enhancing LLMs' performance in understanding and generating music.}
02-23	Genie: Generative Interactive Environments _{Institution: Google DeepMind, University of British Columbia Genie is an interactive environment model capable of generating new videos and controlling the content of the videos through user inputs, bridging the gap between traditional video generation technologies and interactive experiences.}
02-23	ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
02-22	Automating psychological hypothesis generation with AI: when large language models meet causal graph
02-22	Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments
02-22	CriticBench: Benchmarking LLMs for Critique-Correct Reasoning _{Institution: Tsinghua University, University of Hong Kong The paper evaluates LLMs' critique and correction reasoning abilities through CRITICBENCH, exploring key factors influencing these competencies, aiming to foster further research in LLM critique and self-improvement.}
02-22	OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
02-21	User-LLM: Efficient LLM Contextualization with User Embeddings _{USER-LLM is a framework that contextualizes LLMs using user embeddings. It addresses the complexities of user data and the challenges of processing long sequences, improving the usability of LLMs in personalized applications while being computationally efficient.}
02-21	AgentScope: A Flexible yet Robust Multi-Agent Platform _{Institution: Alibaba Group AgentScope is a versatile platform for building multi-agent applications, emphasizing usability and customizability, particularly catered to developers with varying skill levels. By implementing fault tolerance and supporting multimodal data processing, as well as optimizing distributed operations, AgentScope significantly reduces the complexity of developing and deploying multi-agent systems, promoting wider participation and innovation.}
02-20	Instruction-tuned Language Models are Better Knowledge Learners _{Institution: FAIR at Meta, Carnegie Mellon University, University of Washington The paper introduces a method called pre-instruction-tuning (PIT), which effectively improves the ability of LLMs to absorb knowledge from documents, addresses the "perplexity curse," and makes significant strides in multi-domain knowledge acquisition.}
02-20	TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization _{Institution: AWS AI Labs, The University of Texas at Austin, KAIST The article introduces TOFUEVAL, a new assessment benchmark for evaluating the factual consistency of LLMs in generating topic-focused dialogue summaries. The study uncovered extensive factual errors in the summaries generated by LLMs of varying sizes within the domain of dialogue.}
02-19	AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling _{Institution: Fudan University, Multimodal Art Projection Research Community, Shanghai AI Laboratory AnyGPT is a multimodal language model architecture that achieves seamless conversion and unified processing across modalities through discrete sequence modeling, delivering the ability to generate from any modality to any other without needing alterations to the current LLM architecture or training paradigms. It efficiently processes and generates high-quality multimodal content, with performance comparable to specialized models.}
02-16	Speculative Streaming: Fast LLM Inference without Auxiliary Models
02-16	FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models _{Institution: The University of British Columbia & Invertible AI The paper presents a multimodal Large Language Model suite named FinTral, optimized for financial analysis. The model's performance was showcased against existing models and demonstrated its advanced capabilities in multi-task contexts within the financial sector, especially in handling zero-shot tasks and reducing hallucination phenomena.}
02-16	SPAR: Personalized Content-Based Recommendation via Long Engagement Attention _{Institution: The University of British Columbia, Meta The SPAR framework effectively uses long-term user engagement histories to enhance the accuracy of personalized content recommendations and surpasses the existing state-of-the-art across multiple performance metrics.}
02-15	How to Train Data-Efficient LLMs _{Institution: Google DeepMind, University of California San Diego, Texas A&M University The ASK-LLM and DENSITY techniques proposed in the paper optimize the data efficiency of large language models, effectively enhancing the speed and quality of model training and performing well under resource constraints.}
02-15	Chain-of-Thought Reasoning Without Prompting _{Institution: Google DeepMind This work uncovers that by changing the decoding strategy, one can naturally elicit reasoning from pre-trained LLMs, with CoT paths being more prevalent in tasks frequently represented in the pre-training data. The introduced CoT-decoding method significantly enhances model performance on various reasoning benchmarks without the need for manual prompts.}
02-15	A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts _{Institution: Google DeepMind, Google Research ReadAgent is an LLM agent system inspired by human reading processes, which significantly enhances performance and scalability by generating gist memories and retrieving information as needed for tasks involving long contexts.}
02-14	Premise Order Matters in Reasoning with Large Language Models _{Institution: Google DeepMind The paper focuses on the influence that the order of premises has on LLMs when conducting reasoning tasks, and the impact was assessed via the newly created R-GSM benchmark test. It reveals the extreme sensitivity of LLMs to premise ordering, showing a substantial effect on performance.}
02-09	InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning _{Institution: Shanghai AI Laboratory, Tsinghua University, Fudan University School of Computer Science The InternLM-Math model is a mathematical reasoning tool based on LLMs that integrates various capabilities and provides supervised learning to help the model achieve state-of-the-art performance in various mathematical reasoning tasks, with code and data made open-source. The paper also explores a new approach to solving mathematical problems with the programming language LEAN within a multi-task learning setup, showcasing the potential of LLMs in formalized and code-assisted reasoning.}
02-02	LimSim++: A Closed-Loop Platform for Deploying Multimodal LLMs in Autonomous Driving _{Institution: Shanghai Artificial Intelligence Laboratory, College of Control Science and Engineering Zhejiang University LimSim++ is the first closed-loop evaluation platform specifically developed for (M)LLM-driven autonomous driving. It overcomes the limitations of current simulation platforms and validates its effectiveness in various complex traffic scenarios through experimentation.}
02-02	K-Level Reasoning with Large Language Models
02-02	AMOR: A Recipe for Building Adaptable Modular Knowledge Agents Through Process Feedback _{Institution: Tsinghua University, Ant Group The AMOR framework integrates reasoning logic based on a finite state machine (FSM) and a process feedback mechanism, showcasing how an open-source LLM-based knowledge agent can reason and adapt with human oversight, enhancing the model's capabilities in performing knowledge-intensive tasks.}
02-02	MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models _{Institution: UNC Chapel Hill. This paper introduces a new method called MAGDI, which significantly enhances the reasoning abilities and generalization capacity of smaller models through structured distillation of reasoning interactions between multiple LLMs, while reducing costs.}
02-02	Reasoning Capacity in Multi-Agent Systems: Limitations, Challenges and Human-Centered Solutions _{Institution: Megagon Labs, Carnegie Mellon University This paper introduces the concept of reasoning capacity in multi-agent systems to improve optimization and evaluation and explores the potential of human feedback to enhance system reasoning capabilities.}
02-01	HR-MultiWOZ: A Task Oriented Dialogue (TOD) Dataset for HR LLM Agent _{Institution: Amazon, University of Milano-Bicocca This paper introduces a new resource, HR-MultiWOZ, a Task-Oriented Dialogue Dataset for an HR LLM Agent. It tackles the problem of a lack of high-quality training datasets for building and evaluating HR LLM agents while providing a cost-effective data generation methodology that serves as a valuable asset and benchmark for subsequent research in the field.}
02-01	Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing _{Institution: Nanyang Technological University, Institute for Infocomm Research A*STAR, Salesforce Research The paper proposes a novel offline training framework focused on improving the reliability and accuracy of Large Language Models in complex reasoning tasks through trajectory collection and direct preference optimization based on outcome supervision, without the need for teacher models or human annotations. The results on two logical reasoning benchmarks prove the effectiveness of the proposed method.}
02-01	Can Large Language Models Understand Context? _{Institution: Georgetown University, Apple This paper introduces a context understanding benchmark to assess the contextual understanding abilities of Large Language Models (LLMs). The benchmark encompasses the elements required for understanding context both in documents and dialogue bases, and uses innovative testing methods and experimental analysis to showcase the abilities and limitations of LLMs in understanding context.}
02-01	Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration _{Institution: University of Washington, University of California Berkeley, The Hong Kong University of Science and Technology This article focuses on identifying knowledge gaps in large language models (LLMs) and abstaining from answering questions when necessary. The study proposes two novel multi-LLM collaboration methods, which showed through comparative experiments that they can effectively improve the ability of LLMs to abstain from generating outputs with low confidence.}

2024-01

Date	Paper	Links & Summary
01-31	LongAlign: A Recipe for Long Context Alignment of Large Language Models _{Institution: Tsinghua University, Zhipu.AI The paper proposes a novel recipe, LongAlign, for the long context alignment of LLMs, by constructing a long instruction dataset, adopting new training strategies, and introducing evaluation benchmarks, enhancing the LLMs' ability to handle lengthy contexts. The code, data, and long-aligned models are open-sourced.}
01-30	Efficient Tool Use with Chain-of-Abstraction Reasoning _{Institution: Meta The paper proposes a novel Chain-of-Abstraction reasoning approach that effectively enhances LLMs' capability to use external tools and expedites the reasoning process. Experimental results demonstrate its effectiveness and efficiency in multi-step reasoning tasks.}
01-30	Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate _{Institution: Shanghai Jiao Tong University, Carnegie Mellon University, Shanghai Artificial Intelligence Laboratory SCALEEVAL is an innovative meta-evaluation framework designed to evaluate the trustworthiness and efficiency of LLMs as evaluators. It incorporates multi-agent LLM debate and minimal human supervision into the evaluation process, providing flexibility and scalability, with experimental results showing high consistency with purely human evaluations.}
01-30	Recovering Mental Representations from Large Language Models with Markov Chain Monte Carlo _{Institution: Princeton University, University of Warwick The article demonstrated an effective increase in efficiency and performance by integrating LLMs into sampling algorithms and using Direct Sampling along with MCMC to extract mental representations, exploring the potential for Bayesian inference with LLMs.}
01-30	Incoherent Probability Judgments in Large Language Models _{Institution: Princeton University The paper investigates the coherence of probability judgments made by large language models, finding biases comparable to systemic deviations in human cognition. It quantified incoherence using probabilistic identities and repetition of judgments. The hypothesis presented connects the human-like biases observed when LLMs make probability judgments to their autoregressive training objectives, supported by potential links between the Bayesian Sampler model and autoregressive processes within LLMs.}
01-29	Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation for Automatic Diagnosis _{Institution: Harbin Institute of Technology This research introduces an LLM-based automatic diagnostic method—Multi-Specialist Agent Consultation Model (AMSC), which better simulates the diagnostic process in the real world and improves diagnosis accuracy and efficiency by integrating predictions from multiple specialized agents.}
01-29	SelectLLM: Can LLMs Select Important Instructions to Annotate? _{Institution: University of Minnesota, Carnegie Mellon University This work introduces a novel method SELECTLLM for using LLMs to select unlabeled high-quality instructions, challenging traditional selection algorithms and enhancing selection efficiency while maintaining the global structure of the dataset. The experiments demonstrate superior performance on instruction-tuning benchmarks.}
01-29	LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs' Vulnerability Reasoning _{Institution: Nanyang Technological University LLM4Vuln is an innovative framework that significantly enhances LLMs' performance in code vulnerability analysis by providing a vector database of vulnerability knowledge, tool invocation capabilities, custom CoT prompt schemes, and structuring outputs using instructionally proficient models.}
01-28	PRE: A Peer Review Based Large Language Model Evaluator _{The PRE model presented in this paper provides a novel framework for automatically evaluating LLMs by simulating the peer review system commonly used in academia, significantly lowering costs and exhibiting increased generalizability and reliability.}
01-27	MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries _{Institution: Hong Kong University of Science and Technology The paper developed the MultiHop-RAG dataset to assess and improve the existing limitations of Retrieval-Augmented Generation (RAG) systems in handling multi-hop queries requiring retrieval and reasoning. It also provided experimental results demonstrating current RAG systems' limitations on such tasks and released the dataset to encourage further research and development.}
01-26	EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty _{Institution: Peking University, Microsoft Research, University of Waterloo The paper proposes a new framework named EAGLE to increase the auto-regressive decoding speed of Large Language Models (LLMs) while maintaining the consistency of the generated text distribution with the original LLMs. EAGLE has significantly improved upon speculative sampling methods in reducing time overhead and increasing draft acceptance rate, offering faster acceleration compared to Lookahead and Medusa, with low training cost and ease of deployment.}
01-25	True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning _{Institution: Nanyang Technological University, Zhejiang University The TWOSOME framework effectively aligns LLMs with embodied environments using RL, improving sample efficiency and task generalization while retaining LLMs' original functionality.}
01-25	Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning _{Institution: Columbia University, Microsoft Research, University of California Berkeley The EC-Finetuning method has successfully increased the consistency of explanations generated by LLMs and demonstrated its ability to generalize to unseen datasets, showing a 10.0% relative improvement in explanation consistency on fine-tuning datasets and a 4.5% improvement on out-of-distribution datasets, along with moderate improvements in prediction accuracy.}
01-25	ConstraintChecker: A Plugin for Large Language Models to Reason on Commonsense Knowledge Bases _{Institution: HKUST ConstraintChecker is an independent plugin tool that effectively enhances the performance of LLMs in CSKB reasoning tasks. It helps LLMs to perform better in reasoning by providing and checking explicit constraints and has shown to outperform other advanced prompting techniques in validated metrics.}
01-24	Can AI Assistants Know What They Don't Know? _{Institution: Fudan University, Shanghai Artificial Intelligence Laboratory This paper focuses on the capacity of AI assistants to recognize their knowledge boundaries and by constructing an Idk dataset and aligning the assistant accordingly, the paper achieves making AI assistants recognize and admit what they don’t know, reducing factual errors in their responses.}
01-24	AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents _{Institution: The University of Hong Kong, Zhejiang University, Shanghai Jiao Tong University Researchers introduced a new benchmark, AGENTBOARD, for evaluating multi-turn capable large language model agents, providing a granular progress rate and interactive analysis tools to deepen the understanding of LLM agent performance.}
01-24	Clue-Guided Path Exploration: An Efficient Knowledge Base Question-Answering Framework with Low Computational Resource Consumption _{Institution: Tsinghua University, Zhongguancun Laboratory, XinJiang University The CGPE framework introduced in the paper effectively supports the application of LLMs in question-answering tasks by using a clue-guided path exploration mechanism, lowering the capability requirements for LLMs, and significantly reducing computational resource consumption, which has important practical significance for individuals and organizations with limited computational resources.}
01-24	Consistency Guided Knowledge Retrieval and Denoising in LLMs for Zero-shot Document-level Relation Triplet Extraction _{Institution: Nanjing University of Science and Technology, Northeastern University, Singapore Institute of Technology The paper presents a new Zero-shot Document-level Relation Triplet Extraction (ZeroDocRTE) framework that generates labeled data by Retrieval and Denoizing Knowledge from LLMs and significantly improves the performance of document-level relation triplet extraction through a series of novel methods.}
01-24	MM-LLMs: Recent Advances in MultiModal Large Language Models _{Institution: Tencent AI Lab, Kyoto University, Mohamed Bin Zayed University of Artificial Intelligence}
01-23	CCA: Collaborative Competitive Agents for Image Editing _{The paper presents a new generative model based on multiple Large Language Models (LLMs), capable of handling complex image editing tasks and enhancing the quality and robustness of the results. Encouraging collaborative competition among agents, the model demonstrates capabilities exceeding traditional methods, especially in managing complex tasks and learning from intermediate steps to refine outcomes.}
01-23	AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents _{Institution: Google DeepMind The paper describes a system named AutoRT that uses large foundation models to control real-world robots to autonomously navigate and perform tasks. It marks the first instance of LLM-controlled robots operating autonomously in real-world settings, proposing their own goals, and taking actions toward those goals. The data collected by AutoRT is not only diverse but can improve the performance of robot learning models and be aligned with human preferences.}
01-23	KAM-CoT: Knowledge Augmented Multimodal Chain-of-Thoughts Reasoning _{Institution: Samsung R&D Institute India - Bangalore KAM-CoT is a multimodal Chain-of-Thought reasoning framework that integrates CoT reasoning, knowledge graphs, and multiple modalities. It outperforms state-of-the-art approaches with fewer trainable parameters, showcasing superior performance and cost-efficiency.}
01-23	Large Language Models are Superpositions of All Characters: Attaining Arbitrary Role-play via Self-Alignment _{Institution: Alibaba Inc. The paper proposes DITTO, a self-alignment method that enhances LLMs' role-play capabilities through knowledge augmentation and dialogue simulation. It also provides a reproducible, explainable, and efficient role-play evaluation method and explores the dissection of role-play through cross-supervision experiments, offering an in-depth understanding and insights into building role-play functions for LLMs.}
01-22	Improving Small Language Models' Mathematical Reasoning via Mix Thoughts Distillation _{Institution: Institute of Information Engineering, Chinese Academy of Sciences This paper stated that EoTD and MTD show it is possible to distill LLMs' mathematical reasoning capabilities into Small Language Models (SLMs) with fewer than one billion parameters. The methods preserve and enhance SLMs' reasoning abilities, enabling them to achieve state-of-the-art performance on reasoning tasks. This advancement opens the door for broader applications of SLMs in resource-constrained environments, bridging the gap between the demand for powerful reasoning models and computational resource limitations.}
01-22	PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety _{Institution: Shanghai Artificial Intelligence Laboratory, Dalian University of Technology The article presents PsySafe, a comprehensive framework for the safety of multi-agent systems, integrating psychological-based approaches for attack, defense, and evaluation. The experimental outcomes provide deeper insights into understanding and researching the safety issues of multi-agent systems.}
01-22	CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation _{Institution: Stanford University, Stability AI This paper addresses the challenges in automated CXR interpretation by introducing a large dataset specifically designed for CXR interpretation, developing a novel foundation model, and creating a comprehensive evaluation benchmark. It demonstrates the superior performance of CheXagent in various assessment tasks compared to other models and takes an important stride towards transparency by examining potential biases within the model, providing valuable insights for future research and applications.}
01-21	Interactive AI with Retrieval-Augmented Generation for Next Generation Networking _{Institution: Nanyang Technological University, Guangdong University of Technology, Institute for Infocomm Research, Agency for Science Technology and Research This paper explores the integration of interactive AI (IAI) with next-generation networking, using retrieval-augmented generation (RAG) and large language models (LLM) to enhance decision-making capabilities, proved through real network optimization case studies.}
01-20	BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models _{Institution: University of Illinois Urbana-Champaign, University of Washington, Western Washington University This article proposes BadChain, a backdoor attack on LLMs using COT prompting that does not require access to training datasets or model parameters and has low computational overhead. The method effectively reveals the security vulnerabilities under COT prompting in LLMs and emphasizes the importance of carrying out backdoor attacks and designing effective defenses.}
01-19	Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning _{Institution: MIT The paper presents how LLMs can be made more resistant to "jailbreak" attacks from a safety alignment perspective through Wanda pruning without the need for fine-tuning and validates model performance through a constructed dataset and evaluation system.}
01-19	Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning _{Institution: ShanghaiTech University, Meituan, UniDT Tool-LMM stands out as the first system aimed at training a large multi-modal model to learn tool agency, innovatively integrating multi-modal inputs with the correct selection of external tools, overcoming ambiguity in text, and showcasing the ability to automatically select appropriate tools in response to multi-modal instructions.}
01-19	Mitigating Hallucinations of Large Language Models via Knowledge Consistent Alignment _{Institution: Sun Yat-sen University, Tencent AI Lab This paper introduces an innovative KCA method that reduces the inconsistency between external and intrinsic knowledge, thereby mitigating hallucinations in LLMs during alignment. The study offers several insights for future research, notably the excellent performance of the KCA method across various scenarios and the combination of its simplicity and effectiveness.}
01-19	Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads _{Institution: Princeton University, Together AI, University of Illinois Urbana-Champaign The paper presents Medusa, an efficient method for accelerating LLM inference by adding multiple decoding heads that parallelly predict multiple tokens, thereby substantially reducing the number of decoding steps and significantly improving the inference speed of large models.}
01-18	ChatQA: Building GPT-4 Level Conversational QA Models _{Institution: NVIDIA The ChatQA model significantly improved the effectiveness of multi-turn conversational QA through a two-stage instruction tuning strategy, particularly in areas of context understanding and information retrieval.}
01-18	Self-Rewarding Language Models _{Institution: Meta, NYU This work introduces Self-Rewarding Language Models intended to bypass the bottleneck of human preference data by self-training to enhance the model's self-rewarding and instruction-following capabilities. The experimental results are promising, setting a precursor for models that can continuously improve themselves.}
01-18	Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation _{Institution: The University of Tokyo, RIKEN This research innovatively incorporates an explicit reasoning process and question-generation ability into LMMs, promoting more reliable inferences. By creating a new dataset and leveraging it for model training, it sets a precedent for future advancements in LMMs and enables the model to generate explicit reasoning steps and questions when faced with uncertainty.}
01-18	A Fast, Performant, Secure Distributed Training Framework For Large Language Model _{Institution: Ant Group China This paper presents a secure distributed training framework based on model slicing, which solves the problem of model parameter and data leakage on both server and client sides while ensuring the precision of the model training and high efficiency.}
01-17	ReFT: Reasoning with Reinforced Fine-Tuning _{Institution: ByteDance Research ReFT significantly enhances the performance and generalization ability of LLMs in math problem-solving tasks by optimizing non-differentiable objectives through reinforcement learning. It transcends traditional supervised learning methods and shows potential for more complex reasoning tasks.}
01-17	LLMs for Relational Reasoning: How Far are We? _{Institution: Continental-NTU Corporate Lab, Nanyang Technological University, Singapore The paper primarily examines the capacities and constraints of large language models in the area of relational reasoning. Through extensive assessments, including novel testing procedures and an evaluation module, the findings indicate that while LLMs perform reasonably well on certain relational reasoning tasks, they are outperformed by models specifically designed for logical reasoning.}
01-17	Vlogger: Make Your Dream A Vlog _{Institution: Shanghai Jiao Tong University, Shanghai AI Laboratory, Shenzhen Institute of Advanced Technology Chinese Academy of Sciences This paper presents the innovative use of LLMs in the production of video blogs, addressing the challenges of creating minute-scale coherent video content and delivering exceptional experimental results.}
01-16	SpecGen: Automated Generation of Formal Program Specifications via Large Language Models _{Institution: Nanjing University, Nanyang Technological University, Singapore Management University The paper presents SpecGen, an automated formal program specification generation technique that combines Large Language Models with a heuristic selection strategy. By comparison with existing tools and purely LLM-based methods, SpecGen showcases superior efficiency and accuracy in specification generation and offers a dataset to facilitate future research.}
01-16	RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture _{Institution: Microsoft The paper studies the performance of large language models on agricultural data for Q&A pair generation and presents a new pipeline that efficiently utilizes RAG and fine-tuning techniques to enhance LLM applicability in specific industries, expanding the potential for LLMs' application in targeted sectors.}
01-16	MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline _{Institution: Alibaba Group The paper presents a new math reasoning dataset combined with a Python code interpreter, significantly improving LLM performance on math problem-solving tasks through dataset enhancement and specific fine-tuning protocols.}
01-16	Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models _{Institution: Tencent AI Lab The article delves into analyzing the domain mismatch problem of LLMs in machine translation tasks and experiments with the impact of varying amounts of parallel data on LLM translation capabilities, showcasing the potential of LLMs in addressing these challenges.}
01-16	DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models _{Institution: Zhejiang University DoraemonGPT is an LLM-driven agent that employs symbolic memory and a set of tools to understand and answer complex questions involving dynamic videos. It leverages an MCTS planner to optimize the process of generating answers, enabling it to handle more complex tasks in real-world scenarios.}
01-16	Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation _{Institution: Johns Hopkins University, Microsoft This paper introduces CPO, a novel LLM fine-tuning method that effectively overcomes the bottlenecks in SFT for MT tasks and achieves significant performance enhancements in moderate-sized LLM translation models with minimal resource expenditure, competing alongside the most advanced state-of-the-art translation systems.}
01-15	MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models _{Institution: Microsoft Research India This study investigates the performance of large language models on multilingual tasks following parameter-efficient fine-tuning, especially in the context of low-resource languages and English tasks. It demonstrates the potential of PEFT and highlights areas for future work.}
01-15	The What, Why, and How of Context Length Extension Techniques in Large Language Models -- A Detailed Survey _{Institution: Technology Innovation Institute UAE, Islamic University of Technology Bangladesh, Stanford University, Amazon GenAI, AI Institute University of South Carolina The paper is a detailed survey on context length extension techniques in LLMs. It provides an organized overview of current strategies and challenges for researchers in the field and encourages discussions on future advancements.}
01-15	A Study on Large Language Models' Limitations in Multiple-Choice Question Answering _{Institution: David R. Cheriton School of Computer Science The study investigates the limitations of LLMs in MCQ tasks, highlighting poor performance by most models in such tasks. It also finds model answers often depend on the order of options and proposes effective assessment methods to eliminate these biases. The paper recommends exercising caution when using MCQs to evaluate LLMs and testing whether models truly understand the task at hand.}
01-14	Small LLMs Are Weak Tool Learners: A Multi-LLM Agent _{Institution: Sun Yat-sen University, Alibaba Group The study reveals the weakness of small LLMs as tool learners and introduces the α-UMi multi-LLM framework, which outperforms the single-LLM approach. It highlights a crucial two-stage fine-tuning strategy and delves into data-scaling laws.}
01-13	Bridging the Preference Gap between Retrievers and LLMs _{The paper presents the BGM framework to address the "preference gap" between retrievers and LLMs. Through a seq2seq bridge model and a combined SL and RL training scheme, the framework optimizes the retrieved information to fit LLMs' preferences, improving performance in multiple downstream tasks.}
01-12	APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding _{Institution: Tsinghua University, Zhipu AI The research presents APAR as a method that significantly enhances the decoding efficiency and generation speed of LLMs in both memory-limited and high-throughput scenarios while maintaining generation quality, providing a potent new approach for deploying large language models efficiently.}
01-12	An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models _{Institution: University of Washington Seattle, University of Wisconsin-Madison, Stanford University The paper proposes an experimental design framework intended to improve the label efficiency of large language models during Supervised fine-tuning (SFT). It shows that experimental design techniques can significantly increase label efficiency while maintaining low computational costs, saving up to 50% annotation costs in some tasks compared to random sampling.}
01-12	TestSpark: IntelliJ IDEA's Ultimate Test Generation Companion _{Institution: JetBrains Research, Delft University of Technology The paper introduces the TestSpark plugin, which integrates search-based software test generation and language model-based methods to enhance the efficiency of generating and integrating unit tests in IntelliJ IDEA, while also addressing the compilability issue of tests generated by LLMs. The open-source nature of the plugin facilitates the bridging between software developers and researchers, contributing to the practical advancement of test generation technologies.}
01-12	Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation _{Institution: Tianyu Zheng, Shuyue Guo, Xingwei Qu, Jiawei Guo, Weixu Zhang, Xinrun Du, Chenghua Lin, Wenhao Huang, Wenhu Chen, Jie Fu, Ge Zhang The paper presents the Kun strategy, addressing the data consistency issue in Chinese large language model instruction fine-tuning, reducing dependency on manual annotation through the AP process and new data generation methods. The evaluation results indicate that the Kun strategy has a significant advantage in creating high-quality datasets.}
01-12	From Automation to Augmentation: Large Language Models Elevating Essay Scoring Landscape _{Institution: Tsinghua University, University of Maryland, Beijing Xicheng Educational Research Institute This research showcases the potential of large language models in the field of education, especially within AES systems. LLMs not only have the ability to automate scoring processes but also enhance the performance of human graders through generated feedback. This advancement offers valuable insights for the future of AI-assisted education and efficient collaboration between AI and humans.}
01-12	How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs _{Institution: Virginia Tech, Renmin University of China, UC Davis This paper presents a novel perspective on studying AI safety by humanizing LLMs, applying over a decade of social science research to AI safety, establishing a persuasion taxonomy, and creating a tool that automatically generates adversarial prompts. The results demonstrate the effectiveness of persuasion in increasing the likelihood of LLMs performing risky behaviors, while also revealing the insufficiency of current defense measures against such strategies.}
01-12	Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation _{Institution: Nanyang Technological University, Fudan University This paper successfully presented a novel approach, TOOLGEN, which integrates autocompletion tools into the repository-level code generation process of LLMs, solving dependency issues and boosting both the quality and success rate of code generation.}
01-11	Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems _{Institution: Zhongguancun Laboratory, Tsinghua University, Institute of Information Engineering Chinese Academy of Sciences This paper provides a comprehensive overview of the risk taxonomy, mitigation measures, and assessment benchmarks for large language model systems, offering a new systematic framework to help developers more comprehensively understand and deal with the potential risks of LLM systems.}
01-11	TOFU: A Task of Fictitious Unlearning for LLMs _{Institution: Carnegie Mellon University The paper provides a new dataset and evaluation mechanisms for the issue of unlearning in LLMs. The TOFU task highlights the deficiencies of current unlearning techniques and encourages further improvements and research.}
01-11	Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models _{Institution: Google Research, Tel Aviv University The paper presents a framework named Patchscopes, offering a novel approach to interpret the information encoded in the hidden representations of large language models (LLMs) and to correct multi-hop reasoning errors. Patchscopes serves as a general modular framework, unifying existing interpretative tools and addressing their deficiencies, while also paving the way for new research and application opportunities.}
01-11	Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint _{Institution: Gaoling School of Artificial Intelligence, Renmin University of China; School of Information, Renmin University of China; Kuaishou Technology, Beijing China. This paper presents RLMEC, a novel RL method that employs generative reward models with a minimum editing mechanism, enabling precise supervision and stability in training large language models with RL.}
01-11	Chain of History: Learning and Forecasting with LLMs for Temporal Knowledge Graph Completion _{Institution: Tsinghua Shenzhen International Graduate School Tsinghua University, School of Computer Science Peking University, Baidu Inc. The paper presents a method for temporal knowledge graph completion utilizing large language models. By implementing efficient fine-tuning methods and historical data augmentation with structural information, the model's reasoning capabilities and performance were improved. Experiments demonstrate that this approach effectively enhances the precision of temporal knowledge graph predictions, achieving state-of-the-art results.}
01-11	Evidence to Generate (E2G): A Single-agent Two-step Prompting for Context Grounded and Retrieval Augmented Reasoning _{Institution: Qatar Computing Research Institute This paper introduced a new single-agent, two-step prompting framework—Evidence to Generate (E2G) —aimed at improving the context reasoning abilities of LLMs. By prompting LLMs to generate evidence and explanations alongside answers, E2G reduces erroneous reasoning and enhances the accuracy of models handling various reasoning tasks. Experimental results showed that the E2G method outperforms CoT in multiple context-intensive language tasks.}
01-11	LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase _{Institution: LAIR Lab Lehigh University, Huazhong University of Science and Technology This study defined the mixed text (mixcase) found in mixed scenarios, created the MIXSET dataset, and provided insights and directions for solving the detection problem of mixed text. It revealed that existing detectors have shortcomings in recognizing mixcase, underlining the urgent need for more fine-grained detectors.}
01-11	EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction _{Institution: Fudan University, Microsoft Research Asia, Zhejiang University This paper proposes EASYTOOL, a method that enhances LLM-based agents' performance in tool usage by simplifying and unifying instructions from tool documentation, addressing the issues of inconsistency, redundancy, and incompleteness.}
01-11	The Benefits of a Concise Chain of Thought on Problem-Solving in Large Language Models _{Institution: Johns Hopkins University The study demonstrates that concise Chain-of-Thought (CCoT) prompting can significantly reduce the length of text outputs in large language models without compromising performance in problem-solving tasks.}
01-10	InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks _{InfiAgent-DABench offers a novel benchmarking tool that not only aids in measuring the performance of intelligent agents in data analysis tasks but also represents an essential step in exploring how to improve and optimize the application of LLMs in this specific domain.}
01-10	Prompting Large Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis _{Institution: Renmin University of China, Beijing Key Laboratory of Big Data Management and Analysis Methods, Meituan Group This work introduced a framework named ProLLM4Rec, offering a systematic analysis of utilizing Large Language Models (LLMs) as foundation models for recommendation systems and tested the impact of different conditions on LLMs through experiments. Empirical findings were summarized, providing insights for future research.}
01-10	Leveraging Print Debugging to Improve Code Generation in Large Language Models _{Institution: Zhejiang University, ByteDance The paper proposes a methodology for using print debugging to guide LLMs in code generation and debugging, validating its effectiveness on the Leetcode dataset, especially for easy and medium complexity problems. Despite limited success with hard-level problems, this work represents a significant advancement in the field of LLMs for code debugging.}
01-10	AUTOACT: Automatic Agent Learning from Scratch via Self-Planning _{Institution: Zhejiang University, Alibaba Group This research introduces AUTOACT, a framework for autonomous learning of language agents through self-instruction and self-planning to tackle the challenge of learning new tasks from scratch. The key contributions lie in its effective data augmentation method and the highly efficient automatic agent learning process.}
01-10	Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security _{Institution: Tsinghua University, Xiaomi AI Lab As a survey work, the paper presents the current status, challenges, and future trends of personal LLM agents and proposes a generic system architecture and intelligence level definition.}
01-10	Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk _{Institution: AWS AI Labs The paper presents a novel approach for generating training data by enabling LLMs to conduct self-talk dialogues, which has the potential to improve the performance of task-oriented dialogue agents. Despite certain limitations, the findings suggest that high-quality dialogues can serve as a strong training signal for LLMs, validating the idea of LLMs' capacity to self-improve when trained on their own generated content, leading to better performance in task-oriented dialogue settings.}
01-10	Attendre: Wait To Attend By Retrieval With Evicted Queries in Memory-Based Transformers for Long Context Processing _{Institution: Google Research The paper successfully proposes a new memory-based transformer method that effectively reduces memory demands and supports bidirectional attention through storage eviction policies and the ATTENDRE layer, demonstrating performance on par with traditional methods in long-sequence processing.}
01-10	CASA: Causality-driven Argument Sufficiency Assessment _{Institution: Peking University This paper introduces a zero-shot Causality-driven Argument Sufficiency Assessment framework (CASA) based on LLMs, which effectively tackles challenges in quantifying and intervening in argument sufficiency without observational data and demonstrates its effectiveness in practical applications.}
01-09	Agent Alignment in Evolving Social Norms _{Institution: Fudan University This paper introduces an EvolutionaryAgent framework to assess and enhance the adaptiveness and alignment of large intelligent agents in dynamic and constantly evolving societal norms. The research highlights the significance of agent alignment with societal norms during evolution and validates the framework's efficacy through experiments.}
01-09	Know Your Needs Better: Towards Structured Understanding of Marketer Demands with Analogical Reasoning Augmented LLMs _{Institution: Zhejiang University, Ant Group The paper presents a new method named ARALLM that combines analogical reasoning and multi-task model distillation to effectively enhance LLMs' ability to understand and transform natural language into structured logical expressions. This method allows non-expert marketers to use natural language for user targeting, which potentially changes the practice of user targeting. The improvement in this capability not only has practical value in marketing scenarios but also contributes valuable exploration to the functionality and practicality of large language models.}
01-09	Large Language Models for Robotics: Opportunities, Challenges, and Perspectives _{Institution: Northwestern Polytechnical University, University of Georgia, Shaanxi Normal University The multimodal GPT-4V framework proposed in the paper, which combines NLP and visual perception, aims to tackle challenges faced by LLMs in robotic task planning. It holds significant implications for advancing human-machine interaction and shaping the future of intelligent AI systems.}
01-09	Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding _{Institution: University of California San Diego, Google Cloud AI Research, Google Research The paper introduces the innovative CHAIN-OF-TABLE framework, which enhances reasoning capabilities of LLMs by explicitly incorporating tabular data into the reasoning chain, dynamically planning and updating the process, thereby increasing accuracy and reliability for table-based reasoning tasks.}
01-09	Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search _{Institution: Nanyang Technological University Singapore ReCo significantly enhances code search accuracy by utilizing LLMs to rewrite code in the codebase through style normalization and introduces a new metric, CSSim, to quantify stylistic differences, advancing research in code style normalization.}
01-09	The Critique of Critique _{Institution: The Hong Kong Polytechnic University, Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory METACRITIQUE is the first framework to evaluate natural language critiques, assessing the quality of critiques using principles of precision and recall, and has achieved a high level of interpretability and transparency.}
01-08	SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems _{Institution: Fudan University The paper proposed a multi-modal large language model-based multi-agent system—SpeechAgents, capable of simulating human communication scenarios involving up to 25 agents, exhibiting exceptional scalability. By utilizing multi-modal signals as the medium for agent communication, the system not only can simulate dialogues with correct content, authentic rhythm, and rich emotions but also can be applied to tasks such as drama creation and the generation of audio novels.}
01-08	MARG: Multi-Agent Review Generation for Scientific Papers _{Institution: Northwestern University, The Hebrew University of Jerusalem, Allen Institute for AI This paper presents an innovative multi-agent review generation method (MARG) capable of overcoming the context size limitations of the base model and of generating high-quality peer-review feedback for scientific papers. The quality of feedback generated by MARG significantly surpasses the baselines in user studies and automated metrics, with a 2.2-fold increase in the number of helpful comments and a greater generation of specific comments.}
01-08	TTMs: Fast Multi-level Tiny Time Mixers for Improved Zero-shot and Few-shot Forecasting of Multivariate Time Series _{Institution: IBM Research TTM demonstrates the effectiveness and transfer learning capabilities of tiny pretrained models that are exclusively trained on diverse time series data for improved multivariate time series forecasting in few/zero-shot scenarios.}
01-07	Grimoire is All You Need for Enhancing Large Language Models _{Institution: Beihang University, Renmin University of China The paper introduces a method named SLEICL that significantly enhances the ICL capability of weak language models by learning and transferring skills from strong language models. The effectiveness of the method is validated through experiments, demonstrating the potential of this technology in enhancing weak language models' context learning abilities.}
01-07	Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon _{Institution: Beijing Academy of Artificial Intelligence, Renmin University of China, Nankai University The paper introduces Activation Beacon, a new technique to extend the context length of Large Language Models, enabling the perception of extensive context within a limited context window, while fully preserving capability on short contexts. Activation Beacon provides an effective, efficient, compatible, and low-training-cost method for extending LLMs' context length.}
01-07	Exploring Large Language Model based Intelligent Agents: Definitions, Methods, and Prospects _{Institution: The Chinese University of Hong Kong, DeepWisdom, Peking University The paper presents a framework for guiding future research and development of LLM-based intelligent agent systems, explores different methods of improving their planning capabilities, multimodal information processing, and how to address the challenges faced by LLM agents, offering a clear guide for future research directions.}
01-07	ChatGPT for Conversational Recommendation: Refining Recommendations by Reprompting with Feedback _{Institution: University of Louisville, Microsoft This paper explores the efficacy of ChatGPT as a conversational recommendation system. It develops a process around ChatGPT that simulates real-user scenarios and addresses and mitigates popularity bias.}
01-06	CogGPT: Unleashing the Power of Cognitive Dynamics on Large Language Models _{Institution: Harbin Institute of Technology, Kuaishou Technology CogGPT addresses challenges faced by large language models in emulating human cognitive dynamics by introducing an iterative cognitive mechanism and a memory retention system, showcasing impressive performance in continuous information processing.}
01-06	Quartet Logic: A Four-Step Reasoning (QLFR) framework for advancing Short Text Classification _{Institution: Aerospace Information Research Institute Chinese Academy of Sciences, Key Laboratory of Target Cognition and Application Technology, University of Chinese Academy of Sciences This study introduced Quartet Logic: A Four-Step Reasoning (QLFR) framework for short-text classification tasks and a CoT-Driven Multi-task learning (QLFR-CML) method. Both of these approaches use the reasoning chain of large language models to address challenges in the STC field. Experimental results confirm the effectiveness and applicability of these methods.}
01-06	The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models _{Institution: Renmin University of China, Université de Montréal The paper provides a systematic empirical study to deeply understand and explore the problem of hallucinations in large language models, identifying the sources of hallucination, detection methods, mitigation strategies, and proposing the new benchmark HaluEval 2.0 and a simple yet effective hallucination detection framework.}
01-05	Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache _{Institution: Alibaba Group, Shanghai Jiao Tong University The paper presents an efficient system for cloud services supporting long-context language models. Through the distributed algorithm DistAttention, it optimizes the processing and storage of the attention module, and the DistKV-LLM service system manages and coordinates it. It achieves efficient allocation and management of resources in a distributed environment, demonstrating significant performance improvements.}
01-05	From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models _{Institution: Beike Inc. The paper introduces the RAISE framework, which enhances the performance of LLMs in multi-turn dialogues, especially in real estate sales contexts, by incorporating an augmented memory system and a structured agent construction process.}
01-04	LLM Augmented LLMs: Expanding Capabilities through Composition _{Institution: Google Research, Google DeepMind The paper presents a new framework for model extension - CALM, which successfully integrates two large language models to perform new tasks and demonstrates its effectiveness across multiple experiments.}
01-04	Using LLM to select the right SQL Query from candidates _{Institution: Peking University This research proposes a method for automatically generating test cases for text-to-SQL using LLMs and presents a three-step re-ranking process. The method significantly improves the performance of existing text-to-SQL models, as evidenced by experiments.}
01-04	ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers _{Institution: Bytedance Inc. This paper introduces a methodology, ICE-GRT, designed to enhance the depth and accuracy of LLMs in handling domain-specific tasks. By incorporating reinforcement learning from human feedback, ICE-GRT significantly improves domain-specific capabilities without sacrificing general task performance, achieving state-of-the-art in several assessment tasks.}
01-04	Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives _{Institution: Zhejiang University, OPPO Research Institute This paper introduces a new strategy called "Self-Contrast" to address issues of stubbornness and inconsistency in reflection and self-correction processes within Large Language Models (LLMs). By creating diverse solving perspectives, contrasting different solutions, and summarizing disparities into a checklist, it enhances the quality of LLM reflection. The approach's effectiveness and broad applicability are validated through experiments.}
01-04	SPEER: Sentence-Level Planning of Long Clinical Summaries via Embedded Entity Retrieval _{Institution: Columbia University This paper proposes SPEER, a sentence-level planning method through embedded entity retrieval for long document tasks of hospital discharge summaries. It guides large language models (LLMs) to better cover key entities and generate more complete and credible clinical summaries. The research demonstrates that the SPEER method can improve document coverage and accuracy in practical applications, thereby reducing the documentation burden on clinicians.}
01-04	On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS) _{Institution: University of South Carolina, New Mexico State University, IBM Research The paper provides insights into the integration prospects of Large Language Models (LLMs) with Automated Planning and Scheduling (APS), breaking through the traditionally limited adaptability to context, and offers a possibility for a more dynamic, context-aware planning pathway, laying a foundation for further application and research.}
01-04	On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS) _{Institution: University of South Carolina, New Mexico State University, IBM Research This paper is a survey of the application of Large Language Models in the field of Automated Planning and Scheduling, proposing the prospect of combining leading LLMs like GPT-4 and BERT with classical planning methods and the potential of applying LLMs in eight different planning problem categories, with the aim to develop more advanced and intelligent planning systems.}
01-03	MedSumm: A Multimodal Approach to Summarizing Code-Mixed Hindi-English Clinical Queries _{Institution: Indian Institute of Technology Patna, Stanford University, Amazon GenAI MedSumm presents a novel approach for multimodal medical question summarization, integrating textual and visual information to create medically detailed summaries potentially enhancing the quality of healthcare decision-making and deepening the understanding of patient queries.}
01-03	Social Media Ready Caption Generation for Brands _{Institution: Adobe Research India The paper introduces a new framework designed to aid brands in creating engaging captions on social media that align with their brand image and personality. The framework, which consists of two parts, successfully addresses the challenge of generating socially engaging and relevant captions for brands.}
01-02	LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning _{The paper successfully presents a method for extending the context window of LLMs without fine-tuning, which is crucial for improving the capability of large language models to process long texts when computational resources are limited.}
01-02	A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models _{Institution: Islamic University of Technology Bangladesh, University of South Carolina, Stanford University This paper offers an exhaustive survey on hallucination mitigation techniques in LLMs, proposing a categorization framework and systematic feedback and reasoning methods, and assesses the efficacy and impact of these techniques.}
01-01	From Prompt Engineering to Prompt Science With Human in the Loop _{Institution: University of Washington The paper demonstrates how to transition prompt engineering for LLMs into a more scientific and systematic prompt science. By incorporating a qualitative coding method analogous to the human-in-the-loop approach, it ensures the quality and consistency of the responses generated by the LLM while eliminating individual subjectivity and randomness.}
01-01	A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models _{Institution: The Chinese University of Hong Kong, Tencent AI Lab This work proposes LogicAsker, addressing the challenge of evaluating and improving the logical reasoning abilities of LLMs through comprehensive assessment and effective enhancement via problem generation and in-context learning.}
01-01	The Earth is Flat? Unveiling Factual Errors in Large Language Models _{Institution: The Chinese University of Hong Kong, Tencent AI Lab The FactChecker introduced in this paper provides a new automated framework for testing factual inaccuracies in large language models and has been shown to uncover and reduce factual errors in these models through the construction of knowledge graphs and the generation of test questions.}

2023-12

Date	Paper	Links & Summary
12-31	BatchEval: Towards Human-like Text Evaluation _{Institution: Beijing Institute of Technology, Xiaohongshu Inc The paper introduces a novel LLM evaluation paradigm—BATCHEVAL—that addresses the issues of robustness and consistency with human judgment in automatic text evaluation. By implementing batch-wise evaluation and iterative processing, BATCHEVAL significantly surpasses existing methods in terms of accuracy and cost-efficiency.}
12-31	Improving Text Embeddings with Large Language Models _{Institution: Microsoft Corporation The paper introduces an innovative text embedding approach utilizing the latest LLMs and synthetic data to match performance on competitive benchmarks with fewer than 1,000 training steps and no label data, offering strong evidence for further advancements in text embedding technology.}
12-29	Enhancing Quantitative Reasoning Skills of Large Language Models through Dimension Perception _{Institution: Institution: Shanghai Key Laboratory of Data Science School of Computer Science Fudan University, School of Data Science Fudan University, DataGrand Co. LTD This research has significantly improved LLMs' quantitative reasoning abilities by establishing a dimensional unit knowledge base and a customized benchmark test, providing a new pathway for understanding and reasoning accurately with vital quantitative information in text.}
12-29	Building Efficient Universal Classifiers with Natural Language Inference _{Institution: Vrije Universiteit Amsterdam, University of London Royal Holloway, Hugging Face The paper provides a novel approach to universal text classification using natural language inference, complete with detailed steps and tools needed to implement the method, significantly increasing model efficiency without compromising performance.}
12-29	DB-GPT: Empowering Database Interactions with Private Large Language Models _{Institution: Alibaba Group This paper presents DB-GPT, an innovation integrating LLMs and database systems to enhance user experience and accessibility, demonstrating a hierarchical design that effectively addresses concerns such as privacy and security protection, while also elevating the system's overall performance and efficiency through multi-source RAG and adaptive ICL mechanisms.}
12-29	Truth Forest: Toward Multi-Scale Truthfulness in Large Language Models through Intervention without Tuning
12-29	The Right Prompts for the Job: Repair Code-Review Defects with Large Language Model _{Institution: Ant Group, Nanjing University The research explores the application of LLMs in repairing code review defects, introduces an effective semi-automated APR paradigm, analyzes the performance of 9 popular models, and designs effective prompts to guide the code repair process.}
12-28	Improving In-context Learning via Bidirectional Alignment _{Institution: Nanyang Technological University, Princeton University, Salesforce Research USA The paper introduced Bidirectional Alignment (BiAlign), which effectively improves the ICL abilities of smaller models by integrating a new ranking loss along with aligning the output distribution.}
12-28	Experiential Co-Learning of Software-Developing Agents _{Institution: Tsinghua University,Dalian University of Technology,Beijing University of Posts and Telecommunications The paper proposes a new framework named Experiential Co-Learning, which through the sequential implementation of co-tracking, co-memorizing, and co-reasoning modules, allows LLM-driven intelligent agents to learn more effectively from historical trajectories and use past experiences to reason mutually when solving new tasks. It shows a clear performance improvement over existing technologies.}
12-28	Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs _{Institution: Chinese University of Hong Kong, Tencent AI Lab This paper presents a new evaluation paradigm that challenges LLMs to engage in meta-reasoning, and it introduces the accompanying open-source benchmark DiagGSM8K, adding a new dimension to the evaluation of LLMs' cognitive abilities.}
12-28	Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos _{Institution: Tsinghua University This paper presents the Grounding-Prompter method, addressing the TSG challenge in long videos by combining LLM with temporal reasoning and multimodal information, demonstrating the effectiveness of prompting LLM with multimodal data, and validating its superiority in TSG tasks for long videos through experiments.}
12-28	DrugAssist: A Large Language Model for Molecule Optimization _{Institution: Tencent AI Lab, Department of Computer Science Hunan University DrugAssist is a model that facilitates molecule optimization through human-machine interaction, overcoming the lack of interactivity limitations in LLM applications for drug discovery and showcasing superior multi-property optimization abilities.}
12-28	Structured Packing in LLM Training Improves Long Context Utilization _{Institution: University of Warsaw, Google DeepMind, Polish Academy of Sciences This paper introduces the SPLICE method to enhance utilization of long-range contexts and validates its effectiveness in improving context utilization and performance on long-context tasks for large-scale language models. SPLICE is especially applicable for constructing training examples in training datasets that lack additional structured information.}
12-28	GitAgent: Facilitating Autonomous Agent with GitHub by Tool Extension _{Institution: Tsinghua University, Renmin University of China This paper introduces GITAGENT, an autonomous agent that can extend tools from GitHub to meet the varied demands of user queries. By addressing the challenge of non-standardization, GITAGENT autonomously learns human experience from GitHub Issues/PRs to overcome problems during tool extension, showing its effectiveness in autonomously integrating tools for task accomplishment across various domains.}
12-28	Challenge LLMs to Reason About Reasoning: A Benchmark to Unveil Cognitive Depth in LLMs _{Institution: Chinese University of Hong Kong, Tencent AI Lab This paper presents an innovative evaluation paradigm for LLMs, emphasizing meta-reasoning, which is assessing the reasoning process itself. This approach promises to uncover cognitive deficiencies overlooked by result-oriented evaluation methods, providing a new direction for future LLM assessment and training.}
12-27	Conversational Question Answering with Reformulations over Knowledge Graph _{Institution: University of Illinois at Urbana-Champaign, Amazon CoRnNet represents a novel RL model for non-dialogue ConvQA tasks with LLM-generated reformulations, showing superior performance over other advanced models.}
12-27	How Robust are LLMs to In-Context Majority Label Bias? _{Institution: Amazon The article conducts a comprehensive study on the robustness of LLMs when faced with majority label bias in ICL, finding significant stability in certain models in handling such bias.}
12-27	Rethinking Tabular Data Understanding with Large Language Models _{Institution: UC San Diego, USC, UC Davis The paper delves into the understanding and reasoning capabilities of LLMs over tabular data, contributing insights into the robustness of table structure, the comparison of textual versus symbolic reasoning, and the impact of aggregating multiple reasoning pathways on model performance. The proposed table structure normalization method and the mix self-consistency mechanism are instrumental in enhancing LLMs' performance in tabular data reasoning.}
12-27	Adapting Large Language Models for Education: Foundational Capabilities, Potentials, and Challenges _{Institution: Shanghai Jiao Tong University (SJTU) This paper is a survey on how to adapt large language models for the education system. It provides an overview of the development of LLMs in education-related capabilities, explores the potential and challenges in building such systems, and offers insights for future related research.}
12-26	KnowledgeNavigator: Leveraging Large Language Models for Enhanced Reasoning over Knowledge Graph _{Institution: Northeastern University, Neusoft AI Magic Technology Research, Neusoft Institute of Intelligent Medical Research The paper introduces KnowledgeNavigator, a novel framework designed to enhance LLM reasoning over knowledge graphs, addressing LLM's limitations in complex reasoning tasks. The effectiveness demonstrated by the experiments suggests potential for broader application of LLMs in high-risk and sensitive domains.}
12-26	Supervised Knowledge Makes Large Language Models Better In-context Learners _{Institution: School of Engineering Westlake University, Westlake Institute for Advanced Study, Peking University The SuperContext framework proposed in the paper significantly enhances the generalizability and factuality of LLMs in natural language understanding and question answering tasks by leveraging the supervised knowledge from task-specific fine-tuned SLMs. It represents an innovative approach to incorporating the strengths of small models into LLMs to deal with OOD data and minimize hallucinations.}
12-26	Align on the Fly: Adapting Chatbot Behavior to Established Norms _{Institution: Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, The Hong Kong Polytechnic University The research advances a dynamic OPO method that aligns LLMs with the complex and varying landscape of human values in real-time, using collected rules as external memory without further training. Despite limitations in inference efficiency and potential for retrieval model enhancements, extensive experiments across multiple evaluation datasets vouch for the method's effectiveness.}
12-26	Aligning Large Language Models with Human Preferences through Representation Engineering _{Institution: Fudan University This paper introduces a novel RAHF method, which manipulates internal model representations through representation engineering techniques to align LLMs with human preferences. The method is computationally efficient, easy to implement, and shows potential in managing a spectrum of human preferences or values.}
12-26	RecRanker: Instruction Tuning Large Language Model as Ranker for Top-k Recommendation _{Institution: City University of Hong Kong, The Chinese University of Hong Kong, Hangdian University The paper presents a novel framework named RecRanker, which optimizes the performance of LLMs in top-k recommendation tasks through instruction tuning and effectively integrates signals from traditional recommendation systems, improving the model's application performance in recommendation scenarios.}
12-26	A Prompt Learning Framework for Source Code Summarization _{Institution: Nanyang Technological University, Tencent Inc., Nanjing University This paper introduced a novel PromptCS framework for source code summarization, capable of generating high-quality summaries while reducing training costs and providing open-source code for further research.}
12-26	Scaling Down, LiTting Up: Efficient Zero-Shot Listwise Reranking with Seq2seq Encoder-Decoder Models _{Institution: University of Waterloo The paper introduces LiT5-Distill and LiT5-Score, two sequence-to-sequence encoder-decoder models for efficient zero-shot listwise reranking. These methods not only offer competitive performance but also address traditional reliance on large LLMs and external relevance labels, showcasing optimization and advancement in this domain.}
12-26	Think and Retrieval: A Hypothesis Knowledge Graph Enhanced Medical Large Language Models _{Institution: Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science Peking University, Beijing China The HyKGE framework effectively addresses the accuracy and interpretability challenges faced by large language models in dealing with complex problems in the medical field, demonstrating potential for applications in the medical domain and showcasing its superiority in real-world scenarios.}
12-25	Alleviating Hallucinations of Large Language Models through Induced Hallucinations _{Institution: Soochow University, Tencent AI Lab The paper offers a novel method to reduce hallucinations in LLMs by constructing a factually weaker model and subtracting its knowledge in the generation process, improving the generation of factual content.}
12-25	ESGReveal: An LLM-based approach for extracting structured data from ESG reports _{Institution: Alibaba Cloud, Tsinghua University, Sun Yat-Sen University ESGReveal marks significant progress in ESG data processing, aiming to improve the consistency and accuracy of structured data extraction from corporate reports through large language models and related techniques, and it has driven improvements in ESG practices and transparency.}
12-22	Plan, Posture and Go: Towards Open-World Text-to-Motion Generation _{Institution: Tsinghua University, Microsoft Research Asia The researchers introduced a new framework named PRO-Motion to overcome limitations of traditional text-to-motion generation methods, successfully generating more diverse and realistic motions in open-world scenarios.}
12-22	NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes _{Institution: University of Michigan, Rutgers University This paper presents a novel method of assessing the reasoning abilities of LLMs through the NPHardEval benchmark. The benchmark covers a broad range of problems from polynomial time complexity to NP-Hard levels, and it features a dynamic data updating mechanism to prevent model overfitting, ensuring reliable and authentic assessment results. The findings significantly advance the understanding of current capabilities of LLMs and pave the way for improving the reasoning abilities of these models.}
12-22	VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation _{Institution: University of Waterloo, IN.AI Research The paper presents an evaluation framework called VIEScore aimed at providing explainable evaluations for conditional image generation tasks. VIEScore overcomes the challenge of existing automated metrics' inability to explain their scoring rationale and is adaptable to various task requirements.}
12-22	A Survey of Reinforcement Learning from Human Feedback _{Institution: LMU Munich, Duke Kunshan University This article is a survey of RLHF, analyzing its applications at the crossroads of artificial intelligence and human-computer interaction and discussing the latest research trends, especially those related to LLMs.}
12-22	Generative AI Beyond LLMs: System Implications of Multi-Modal Generation _{The paper is the first to characterize system performance for models that span across text, image, and video generation, revealing unique system properties distinct from traditional LLMs. It also highlights challenges and opportunities where traditional optimizations might need rethinking for TTI/TTV models.}
12-22	Large Language Model (LLM) Bias Index -- LLMBI _{Institution: University of Oxford, University Canada West, Amazon Web Services (AWS) The introduction of LLMBI marks a significant step towards creating fairer and more reliable LLMs. It provides a quantifiable measure of bias for system engineers and researchers, guiding them to continuously improve these powerful models and ensuring that they reflect society's diverse and evolving fabric.}
12-22	Reasons to Reject? Aligning Language Models with Judgments _{Institution: Tencent AI Lab, The Chinese University of Hong Kong The paper presents a new framework for aligning LLMs through direct use of language feedback named Contrastive Unlikelihood Training (CUT) and demonstrates its effectiveness in various scenarios including offline and online alignment, as well as further optimizing both unaligned (cold-start) and already aligned (warm-start) models. Research indicates that judgmental feedback holds greater potential than rewards for aligning LLMs, meriting further investigation.}
12-22	YAYI 2: Multilingual Open-Source Large Language Models _{Institution: Beijing Wenge Technology Co. Ltd., Institute of Automation Chinese Academy of Sciences The paper presents YAYI 2, a large language model optimized for multilingual scenarios, which significantly improves performance on various tasks, especially in Chinese-related tasks, by pre-training on a large corpus and aligning with human values through multiple approaches.}
12-22	Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning _{Institution: Huawei Noah's Ark Lab, University College London, University of Oxford The paper introduces the Pangu-Agent framework, which addresses the challenges faced by standard RL methods in multi-task environments. By integrating structured reasoning through intrinsic functions and enabling fine-tuning through supervised learning and RL, Pangu-Agent enhances the ability of agents to adapt across various environmental interactions.}
12-21	AppAgent: Multimodal Agents as Smartphone Users _{Institution: Tencent The study introduces an innovative multimodal agent framework allowing the agent to operate any smartphone application like a human user by learning new apps through autonomous exploration and observing human demonstrations. Findings demonstrate the framework's efficiency and adaptability in performing a variety of advanced tasks.}
12-21	The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction _{Institution: MIT, Microsoft Research NYC The paper introduces LASER, a strategy for pruning specific layers of the Transformer model after training to enhance its performance. The authors indicate that this strategy is not only effective, but also the first discovery of enhancing the performance of Transformer models through carefully selected pruning.}
12-21	De novo Drug Design using Reinforcement Learning with Multiple GPT Agents _{Institution: Tsinghua University, Microsoft Research AI The paper introduces a reinforcement learning algorithm with multiple GPT agents for drug molecular generation and demonstrates good performance and practicality in GuacaMol benchmark tests and in designing inhibitors for SARS-CoV-2 protein targets.}
12-21	On Task Performance and Model Calibration with Supervised and Self-Ensembled In-Context Learning _{Institution: Language Technology Lab University of Cambridge This paper offers a comprehensive analysis of the performance and calibration of different learning methods in data-scarce scenarios. It indicates challenges in jointly achieving high performance and good calibration, but demonstrates that self-ensembling techniques can enhance model calibration without sacrificing performance, providing important guidelines for future LLMs applications.}
12-20	Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy _{Institution: Ant Group The paper presents the Lookahead inference acceleration framework, which uses a Trie-tree based multi-branch inferencing strategy to improve the inference speed of LLMs while maintaining the accuracy of generation. The framework's performance is validated through extensive experimentation and has been deployed in real-world scenarios at Alipay.}
12-20	Mini-GPTs: Efficient Large Language Models through Contextual Pruning _{Institution: Massachusetts Institute of Technology The paper demonstrates the process and results of developing Mini-GPTs, smaller yet efficient versions of GPT models, through contextual pruning. This method successfully reduced the size of LLMs across various domain-specific datasets while upkeeping performance, proving that pruning techniques are not only theoretically viable but also practically valuable in developing resource-efficient, domain-specific LLMs.}
12-20	AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation _{Institution: The University of Hong Kong, Shanghai Jiao Tong University, King’s College London This paper presents a novel multi-agent-based solution for code generation, AgentCoder, which effectively solves the balance problem between code generation and testing through specific agents focused on code generation, test designing, and test execution, achieving code generation quality that outperforms existing SOTA methods.}
12-20	Lampr: Boosting the Effectiveness of Language-Generic Program Reduction via Large Language Models _{Institution: University of Waterloo, The Hong Kong University of Science and Technology, Concordia University Lampr represents a pioneering algorithm that integrates LLMs into the program reduction process. It achieves a balance between cross-language generality and particular language semantic awareness through a multi-level prompting method and assistance from LLMs, with superior performance demonstrated in empirical evaluations.}
12-20	AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation _{Institution: The University of Hong Kong, Shanghai Jiao Tong University AgentCoder represents a novel multi-agent framework that significantly improves the quality and accuracy of automated code generation by performing iterative testing and optimization, especially exhibiting its advantages in handling enhanced datasets with more challenging testing requirements.}
12-20	Time is Encoded in the Weights of Finetuned Language Models _{The research introduces the concept of time vectors, showing how temporal variations can be encoded to some extent in language model weight space, and how weight interpolation can assist in tailoring models to new time periods.}
12-20	Generative Multimodal Models are In-Context Learners _{Institution: Beijing Academy of Artificial Intelligence, Tsinghua University, Peking University The paper successfully enhances the context learning capabilities of the multimodal generative model Emu2 by scaling up the model and achieves breakthrough results on a spectrum of multimodal understanding tasks, especially in visual question-answering and controllable visual generation after instruction tuning.}
12-19	A Revisit of Fake News Dataset with Augmented Fact-checking by ChatGPT _{The paper presented the first publicly available benchmark dataset for fake news detection, ChatGPT-FC, which combines human verification and ChatGPT assistance. Quantitative analysis was conducted to compare human journalists and LLMs in fact-checking, highlighting the potential of LLMs to enhance the objectivity and reliability of news fact-checking processes.}
12-19	Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in ultra low-data regimes _{Institution: University of Cambridge This paper introduces CLLM, a novel methodology that combines the prior knowledge of Large Language Models with a robust data-centric approach to data augmentation, paving the way for the broader application of ML in data-deprived domains and regions.}
12-19	Active Preference Inference using Language Models and Probabilistic Reasoning _{Institution: Cornell University, Cornell Tech This study introduced a real-time algorithm that accelerates LLMs' inference of user preferences by generating informative questions, demonstrated to reduce user interaction and improve task performance in an online shopping scenario.}
12-19	Text-Conditioned Resampler For Long Form Video Understanding _{Institution: University of Oxford, Google, Google DeepMind This paper presents TCR, a novel architecture and pre-training method capable of processing long videos conditioned on textual prompts. It effectively bridges pre-trained visual encoders with LLMs, addressing the challenge of long-form video understanding and sets new best performance benchmarks across several evaluation tasks.}
12-18	Generalized Category Discovery with Large Language Models in the Loop _{This paper presents an end-to-end active learning framework that incorporates Large Language Models into the training loop, significantly enhancing model performance on the task of generalized category discovery and autonomously generating category names.}
12-18	MAC-SQL: Multi-Agent Collaboration for Text-to-SQL _{Institution: Beihang University, Tencent Cloud AI Overall, the MAC-SQL framework addresses key challenges in the Text-to-SQL task by collaborating with intelligent agents, tackling issues like managing extensive databases, complex queries, and SQL verification and correction. The release of the open-source SQL-Llama model shows promising results and has the potential to perform comparably to proprietary models like GPT-4.}
12-18	NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation _{Institution: University of Waterloo, Huawei Noah’s Ark Lab, FEEC-Unicamp Brazil This work introduces the NoMIRACL dataset, providing a multilingual tool for assessing robustness in LLMs during retrieval-augmented generation, and showcases challenges that LLMs face in differentiating between relevant and non-relevant retrieval results, highlighting the need for future research to improve LLM robustness.}
12-18	Agent-based Learning of Materials Datasets from Scientific Literature _{Institution: University of Toronto This paper showcases the capability of an intelligence agent based on large language models to autonomously learn and extract material-related datasets from scientific literature. Eunomia demonstrated effectiveness in entity and relation extraction without any fine-tuning and could enhance its ability to avoid errors in complex tasks.}
12-18	"Paraphrasing The Original Text" Makes High Accuracy Long-Context QA _{Institution: Tsinghua University The paper presents a low-cost, effective approach to extending the capability of existing language models to handle long texts, significantly improving accuracy in long-context question answering by theoretical demonstration and experimental validation.}
12-18	Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows _{Institution: University of Washington, Stanford University, Allen Institute for AI The paper introduces a design space framework and three case studies adapting crowdsourcing workflows to LLM chains, providing practical guidance and theoretical insights for the future design and development of LLM chains.}
12-18	G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model _{Institution: Huawei Noah's Ark Lab, The University of Hong Kong, The Hong Kong University of Science and Technology This paper overcomes the limitations of multimodal large language models in solving geometric problems by constructing the Geo170K dataset and developing the G-LLaVA model based on it, achieving better performance than existing state-of-the-art models.}
12-18	Social Learning: Towards Collaborative Learning with Large Language Models _{Institution: Google, EPFL The paper presents a novel framework for knowledge transfer in LLMs—social learning, and provides solutions for privacy protection. The framework allows for knowledge exchange between models using natural language while preventing the leakage of sensitive information, and it validates its effectiveness and privacy-preserving capabilities through experimentation.}
12-18	From Google Gemini to OpenAI Q-Star: A Survey of Reshaping the Generative Artificial Intelligence (AI) Research Landscape _{Institution: Cyberstronomy Pty Ltd, Academies Australasia Polytechnic, Massey University This review extensively analyzes the development of the generative AI field and its reshaping effects on the research landscape, with a special focus on MoE multimodality learning and AGI prospects. The study spans a comprehensive taxonomy from AI model structures and training techniques to application domains and ethical considerations.}
12-18	Towards Better Serialization of Tabular Data for Few-shot Classification with Large Language Models _{Institution: Carnegie Mellon University The paper successfully showcased innovative application of Large Language Models in tabular data classification, with a focus on the new LaTeX serialization framework, introducing novel serialization methods effective for domain-specific datasets. It also explored the LLMs' capability to interpret complex data relationships more deeply. The paper's LaTeX serialization method not only enhanced LLM performance in classification tasks but also significantly improved memory and computational efficiency.}
12-18	Retrieval-Augmented Generation for Large Language Models: A Survey _{Institution: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Fudan University The paper offers a thorough and systematic overview of the RAG domain, emphasizing the importance of enhancing the retrieval and generative capabilities of LLMs, highlighting current challenges, and envisioning future research directions.}
12-17	Distinguishing Translations by Human, NMT, and ChatGPT: A Linguistic and Statistical Approach _{Institution: Shanghai Jiao Tong University The study provides tentative answers to the possibility of ChatGPT being an alternative translation tool apart from NMT and showcases its distinctive properties compared to NMT and HT. These novel insights may inform the future development of more human-like and contextually appropriate translation systems and offer guidance on how to effectively use AI-generated translations.}
12-17	Mixed Distillation Helps Smaller Language Model Better Reasoning _{Institution: Zhejiang University, Dalian Medical University The Mixed Distillation framework significantly enhanced smaller models' advanced reasoning capabilities by integrating PoT and CoT abilities from LLMs, specifically showing improved performance in mathematical reasoning tasks.}
12-16	RIGHT: Retrieval-augmented Generation for Mainstream Hashtag Recommendation _{Institution: CAS Key Lab of Network Data Science and Technology ICT CAS, University of Chinese Academy of Sciences Beijing China The paper presents a new retrieval-augmented generative system for mainstream hashtag recommendation (RIGHT), combining the strengths of retrievers, selectors, and generators to overcome existing methods' limitations in processing new information and identifying mainstream tags, and demonstrates significant experimental results.}
12-16	A Survey on Robotic Manipulation of Deformable Objects: Recent Advances, Open Challenges and New Frontiers _{Institution: Tongji University, National Natural Science Foundation of China, Shanghai Municipal Science and Technology Major Project This survey compiles recent advances, challenges, and new frontiers in the field of robotic manipulation of deformable objects (DOM). It notably emphasizes the initial progress of Large Language Models (LLMs) in robotic manipulation and points out important directions for further research in this area. While the review covers a broad range of literature and identifies future research directions, actual deployment examples and quantitative evaluations are limited.}
12-16	ProTIP: Progressive Tool Retrieval Improves Planning _{Institution: Apple The paper presents ProTIP, an advanced strategy for tool retrieval and use in complex planning tasks for large language models. The key to ProTIP lies in its progressive retrieval, effective use of execution history, and achieving subtask-tool functionality alignment. Experimental results demonstrate that ProTIP significantly outperforms traditional methods, reduces tool hallucination, and increases planning efficiency.}
12-16	CoAScore: Chain-of-Aspects Prompting for NLG Evaluation _{Institution: GSAI Renmin University of China CoAScore is an innovative evaluation metric that improves the accuracy of NLG task assessments through a "chain of aspects" method, an approach that has been experimentally validated.}
12-16	RecPrompt: A Prompt Tuning Framework for News Recommendation Using Large Language Models _{Institution: Science Foundation Ireland (SFI), JSPS KAKENHI This paper presents the RecPrompt model, which optimizes news recommendation using LLMs. Through an iterative optimization process with manually and LLM-generated prompt templates, the news recommendation performance is significantly improved, particularly under the LLM-generated prompt templates utilizing GPT-4. However, this approach does not always outperform traditional recommendation methods and is significantly impacted by the choice of LLM.}
12-15	ProCoT: Stimulating Critical Thinking and Writing of Students through Engagement with Large Language Models (LLMs) _{Institution: Luleå University of Technology Sweden This paper introduces the ProCoT method, showing how LLMs can be harnessed to foster students' critical thinking and writing while preventing cheating. This method can help educators to make better use of these technological tools and cultivate students into better critical thinkers.}
12-15	Faithful Persona-based Conversational Dataset Generation with Large Language Models _{Institution: University of Southern California, Google, Information Sciences Institute The paper presents an LLM-based framework for generating, expanding, and updating large persona-based conversational datasets. By employing a Generator-Critic architecture and faithfulness criteria, the study successfully established the Synthetic-Persona-Chat dataset with enhanced dialogue quality.}
12-15	Challenges with unsupervised LLM knowledge discovery _{Institution: Google DeepMind, Google Research The paper challenges the capacity of existing unsupervised methods to explore latent knowledge in LLMs through theoretical proofs and experimental validations while providing sanity checks to consider for future knowledge elicitation method evaluations. Overall, the authors suspect that future unsupervised methods are likely to face similar issues, having difficulty in accurately distinguishing between model knowledge and other features.}
12-15	WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION _{Institution: OpenAI}
12-15	ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent _{Institution: Google This paper outlines creating an LLM agent capable of reasoning and interacting with external knowledge, along with a self-improvement algorithm that enables smaller models to perform comparably to large models in compositional question-answering benchmarks. The proposed method not only improves reasoning capabilities but also significantly reduces the required parameter count of the models.}
12-15	The Art of Balancing: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment _{Institution: NLP Group Fudan University, Hikvision Inc The paper introduces a model called LoRAMoE to address the problem of world knowledge forgetting in language models due to massive increases in fine-tuning data and shows potential in multi-task learning.}
12-15	Generative Context-aware Fine-tuning of Self-supervised Speech Models _{Institution: ASAPP, Carnegie Mellon University, Toyota Technological Institute at Chicago The paper presents a new fine-tuning method for self-supervised speech models that leverages text generated by large language models as context to enhance task performance. It provides a way to reduce dependence on extra large models and resource usage during inference without compromising on performance.}
12-15	No-Skim: Towards Efficiency Robustness Evaluation on Skimming-based Language Models _{Institution: Fudan University This paper is the first to systematically study the vulnerability of skimming-based language models from the perspective of efficiency and proposes No-Skim, an effective and general efficiency robustness evaluation framework that generates adversarial inputs to increase computational complexity. Additionally, the framework is modularized to accommodate different plug-in modules, enabling evaluations to be conducted across three different knowledge levels.}
12-15	GSVA: Generalized Segmentation via Multimodal Large Language Models _{Institution: Tsinghua University The GSVA method proposed in the paper solves the challenges of multi-target and empty targets in GRES tasks by learning to predict multiple [SEG] tokens and innovatively generating [REJ] tokens, demonstrating significant advantages over existing technologies.}
12-15	KGLens: A Parameterized Knowledge Graph Solution to Assess What an LLM Does and Doesn't Know _{Institution: Apple The paper introduces KGLens, a new framework for assessing factual knowledge in LLMs. KGLens generates natural language questions using the KG structure for evaluations and is aided by a parameterized KG and a graph-guided QG strategy to improve the quality of natural question generation and the efficiency of the assessment process.}
12-14	Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention _{Institution: Tencent AI Lab Seattle The Zebra model proposed in this paper effectively lowers computational and memory requirements by utilizing grouped local-global attention layers, exhibiting excellent performance in processing both long and short sequences. The research team validated the model through various experiments, proving the advantages of the Zebra architecture.}
12-14	Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft _{Institution: CUHK-SenseTime Joint Laboratory, Shanghai AI Laboratory, Tsinghua University Auto MC-Reward is an advanced learning system that uses LLMs to automatically design dense rewards for Minecraft tasks. By leveraging LLMs' abilities to understand tasks and summarize experience, it effectively improves agents' learning of new behaviors and completion of long-term tasks in complex environments.}
12-14	The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation _{Institution: Tsinghua University, Stanford University, Nanyang Technological University This paper is the first to thoroughly investigate the robustness of LLMs against factual misinformation in a persuasive conversation setting, revealing the susceptibility of LLMs to persuasive misinformation.}
12-14	Towards Verifiable Text Generation with Evolving Memory and Self-Reflection _{Institution: Peking University, Chinese Academy of Sciences, Baidu Inc VTG improves the reliability and verifiability of text generated by LLMs through an evolving memory and self-reflection approach, effectively addressing challenges of complex attention shifting and document retrieval. It has been validated through experiments.}
12-14	TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning _{Institution: National University of Singapore, University of Illinois Urbana-Champaign, Microsoft The TAP4LLM framework proposed in this paper significantly enhances the performance of Large Language Models in tabular reasoning tasks. It operates by sampling, augmenting, and packing semi-structured data and can also serve as a plugin to further enhance LLMs' understanding of structured data.}
12-14	Entity-Augmented Code Generation _{Institution: JetBrains The paper proposes an innovative architecture to tackle the task of code generation with external entities. The architecture can scale without sacrificing performance, and with the integration of the entity retriever into the decoder rather than the encoder, the model can inspect all entities at once and directly use them. The new architecture not only resolves the limitations of existing models but also demonstrates its superiority in several experimental scenarios.}
12-14	Math-Shepherd: A Label-Free Step-by-Step Verifier for LLMs in Mathematical Reasoning _{Institution: Peking University, DeepSeek-AI, The University of Hong Kong MATH-SHEPHERD successfully addresses the issue of costly human annotations by training LLMs with automatically generated supervision data, thereby enhancing the accuracy of LLMs in solving complex mathematical problems and opening up new avenues for the advancement and practical application of LLMs.}
12-14	Modeling Complex Mathematical Reasoning via Large Language Model based MathAgent _{Institution: Shanghai Jiao Tong University The paper suggests enhancing LLMs' ability to solve complex mathematical problems through the MathAgent framework, namely Planner-Reasoner-Executor-Reflector (PRER). By breaking down the problems into phases and simulating human-like problem-solving processes, MathAgents significantly improve solving capabilities on challenging mathematical datasets, particularly in areas demanding higher estimation and synthesis skills.}
12-14	Forbidden Facts: An Investigation of Competing Objectives in Llama-2 _{Institution: MIT The paper provides insights into how the Llama-2-chat model handles competing objectives through the study of its behavior in the 'forbidden fact' task, introducing novel analytical methods in the process.}
12-14	Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning _{Institution: Hong Kong University of Science and Technology, Microsoft Research This paper presents CoT-Max, a method that enhances LLMs' mathematical reasoning capabilities using a coarse-to-fine pruning technique, effectively improving the effects of few-shot learning in math reasoning tasks.}
12-14	Self-Evaluation Improves Selective Generation in Large Language Models _{Institution: Google DeepMind, Google Research The paper presents a new method where LLMs are guided to self-evaluate in order to improve the calibration of the quality of their generative output in selective generation scenarios. Experiments show that this method enhances the accuracy and overall quality of the generated content by LLMs.}
12-14	Weight subcloning: direct initialization of transformers using larger pretrained ones _{Institution: Apple The paper introduces a powerful weight subcloning approach to initialize smaller transformer models using weights from larger pretrained ones, greatly accelerating training speed, and enabling efficient training of the new models even with limited computational resources.}
12-14	StemGen: A music generation model that listens _{Institution: SAMI, ByteDance Inc. The paper presents a new non-autoregressive language model approach for music generation, which optimizes the processing of multiple channels and the consistency between music and contextual information, and demonstrates, through objective and subjective assessments, the quality of the music generated and its alignment with contextual information.}
12-14	CogAgent: A Visual Language Model for GUI Agents _{Institution: Tsinghua University, Zhipu AI CogAgent breaks the limitation of pure text-based approaches by efficiently tackling the challenge of understanding and navigating GUIs with combined high and low-resolution image encoders and visual language models. The model achieves leading performance on nine visual question-answering benchmarks, propelling the future research and application of AI agents powered by advanced VLMs.}
12-14	TinyGSM: achieving >80% on GSM8k with small language models _{Institution: Carnegie Mellon University, Microsoft Research This paper has successfully demonstrated that small language models can exceed an 80% accuracy rate on the GSM8K math problem reasoning benchmark by creating a synthetic dataset of math problems with corresponding Python solutions (TinyGSM), showing the feasibility of significant performance improvement of small models through high-quality datasets and verifier strategies.}
12-13	Efficient Toxic Content Detection by Bootstrapping and Distilling Large Language Models _{Institution: University of Southern California, Amazon.com Inc. The paper presents BD-LLM, a new method to enhance the efficiency and transferability of LLMs in toxic content detection tasks, proposing the DToT method and optimizing model compression for more effective production deployment.}
12-13	Knowledge-Aware Artifact Image Synthesis with LLM-Enhanced Prompting and Multi-Source Supervision _{Institution: Peking University This paper proposes a knowledge-aware method for synthesizing images of ancient artifacts with LLM-enhanced prompting and multi-source supervision, overcoming the lack of domain knowledge in existing text-to-image synthesis methods and showing significant improvement in quality and historical knowledge alignment.}
12-13	E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification _{Institution: UC Riverside, Microsoft Research This paper demonstrates the potential of LLMs in conducting pseudo-code static analysis and self-verification through the E&V method. The approach not only improves the flexibility and precision of static analysis but also reduces the human effort and specialized knowledge required to develop static analysis tools.}
12-13	LDM$^2$: A Large Decision Model Imitating Human Cognition with Dynamic Memory Enhancement _{Institution: University of Chinese Academy of Sciences The paper presents the LDM2 model, which incorporates a dynamic memory mechanism and tree exploration approach to augment the decision-making capabilities of LLMs to adapt to more complex and unknown environments, and to realize dynamic learning abilities.}
12-13	SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention _{Institution: The Swiss AI Lab IDSIA USI & SUPSI, AI Initiative KAUST, Center for Brain Science Harvard University SwitchHead is a novel approach that optimizes resource usage in the multi-head self-attention structure, resulting in reduced resource consumption while maintaining model performance. The method has practical application potential, especially for researchers and institutions with limited resources.}
12-12	LLM in a flash: Efficient Large Language Model Inference with Limited Memory _{Institution: Apple This research provides a novel and practical solution that effectively reduces the data load and significantly speeds up inference when running large language models on memory-constrained devices, holding substantial significance for practical applications.}
12-12	VILA: On Pre-training for Visual Language Models _{Institution: NVIDIA, MIT VILA employs an improved pre-training strategy, outperforming benchmarks in various vision-language tasks, and offers practical guidance for the design of future visual language models.}
12-12	Tell, don't show: Declarative facts influence how LLMs generalize _{Institution: Apollo Research, University of Oxford The paper investigates how models generalize when declarative statements in training data conflict with statistical patterns or procedural examples. The findings have important implications for AI safety (regarding the “treacherous turn”) and fairness.}
12-12	Alignment for Honesty _{Institution: Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Fudan University The paper introduces the concept of alignment for honesty in LLMs and presents challenges and proposed solutions. By formally defining the problem, suggesting new methods, and establishing an evaluation framework, the paper provides a comprehensive solution to alignment for honesty in large language models.}
12-12	Comparable Demonstrations are Important in In-Context Learning: A Novel Perspective on Demonstration Selection _{Institution: Shanghai Jiao Tong University The study explores ICL from the perspective of the inter-demonstration relationship, proposing the minimally edited text construction of Comparable Demonstrations (CDs) to alleviate potential demonstration bias. The experiments confirm the performance gains of CDs in OOD scenarios, emphasizing their particular necessity in simpler tasks and demonstrating their robustness with respect to the number of examples.}
12-12	diff History for Long-Context Language Agents _{Institution: New York University The paper presents and validates the use of diff history to enhance model processing capabilities of long interaction histories. This method significantly improves model performance in complex decision tasks and effectively extends the length of history models can handle, providing new insights for the design of long-time series decision-making agents.}
12-12	LLMEval: A Preliminary Study on How to Evaluate Large Language Models _{Institution: Fudan University, Shanghai Jiaotong University The paper focuses on how to evaluate Large Language Models (LLMs), comparing various evaluation criteria, types of evaluators, scoring methods, and ranking systems. It introduces a new evaluation dataset, LLMEval, and assesses 20 LLMs, generating a massive amount of manual and automatic evaluation results. The study provides valuable insights and conclusions for the future evaluation of LLMs.}
12-12	Efficient Few-Shot Clinical Task Adaptation with Large Language Models _{The paper contributed to few-shot medical image classification by presenting an efficient fine-tuning approach through partial layer freezing and incorporating large language models for contextualizing labels to offer effective semantic guidance. The approach demonstrated exceptional performance in a challenge, indicating its effectiveness in adapting natural image models to medical image tasks in few-shot scenarios.}
12-11	"What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces _{Institution: Carnegie Mellon University The paper explores the capabilities and challenges of LLMs in retrieving information from web interfaces, unveiling key factors affecting model performance and their limitations, setting a direction for future work.}
12-11	Unlocking Anticipatory Text Generation: A Constrained Approach for Faithful Decoding with Large Language Models _{Institution: Salesforce AI Research This work introduces a novel approach to improve the decoding methods for large language models by incorporating future constraint satisfaction. The proposed formal approach and scoring mechanism, benchmarked against LLMs, significantly contribute to the improved quality and control of text generation.}
12-11	MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples _{Institution: Xiamen University, Tencent YouTu Lab The work introduces a new paradigm with MMICT to showcase the use of in-context learning capabilities to enhance fine-tuning performance on large multi-modal language models. By designing the versatile M-Hub module and conducting various context demonstration experiments, the study reveals the potential of in-context learning to improve performance on multi-modal tasks.}
12-11	Federated Full-Parameter Tuning of Billion-Sized Language Models with Communication Cost under 18 Kilobytes _{Institution: Zhejiang University, Alibaba Group The paper presents a novel approach, FedKSeed, for federated full-parameter tuning using ZOO with a fixed set of seeds, substantially reducing the communication overhead required for tuning billion-sized LLMs, while achieving higher model accuracy and computational efficiency.}
12-11	Oracle-based Protocol Testing with Eywa _{Institution: Microsoft Research The paper introduced an oracle-based testing method, fully leveraging LLMs to build rich protocol behavior models and enhancing the auto-generation and coverage of network protocol test cases by combining symbolic execution with traditional test generation methods.}
12-11	On Meta-Prompting _{Institution: Microsoft This paper presents a theoretical framework based on category theory to generalize and depict automated prompting methods. Through experiments in the fields of ideation and creativity, it demonstrates that meta-prompting generates outputs that are more favorable to users compared to traditional fixed prompts.}
12-11	Honeybee: Locality-enhanced Projector for Multimodal LLM _{Institution: Kakao Brain The paper introduced a new type of locality-enhanced projector design, addressing deficiencies in existing methods in handling visual feature locality, and effectively utilized multifaceted instruction datasets. Consequently, the Honeybee model achieved significant performance improvements across multiple MLLM benchmarks.}
12-11	Dense X Retrieval: What Retrieval Granularity Should We Use? _{Institution: University of Washington, Tencent AI Lab This paper introduces propositions as a new retrieval unit for dense retrieval, which improves the performance of downstream QA tasks and cross-task generalization capabilities while reducing irrelevant information in the retrieved texts.}
12-11	Extracting Self-Consistent Causal Insights from Users Feedback with LLMs and In-context Learning _{Institution: Microsoft, Microsoft Research The research presents a novel framework utilizing LLMs and ICL to extract self-consistent causal insights from user feedback to support analysis in Microsoft's Feedback Hub. The framework employs innovative self-consistency and prompt ensemble techniques to mitigate hallucinations and incorrect reasonings in LLMs and introduces two heuristic methods to assess the richness of feedback information. The experiments demonstrate the efficacy of the method in extracting causal insights and new bugs, and in assisting Microsoft engineers to prioritize feedback rich in information.}
12-10	Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs _{Institution: Microsoft Israel The study's core contribution lies in its comparison of fine-tuning and RAG methodologies for knowledge injection into LLMs, finding that RAG demonstrates superior performance in injecting both new and existing knowledge. The research used innovative datasets and assessment methods to ensure the practicality and viability of the theoretical findings.}
12-09	Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge _{Institution: Northeastern University, Oracle This paper introduces the Agile-Quant, an activation-guided quantization framework to accelerate the inference of large language models on edge devices. Agile-Quant overcomes challenges associated with activation value outliers and edge device hardware implementation, achieving task performance comparable to weight-only quantization methods while significantly increasing inference speed on actual devices.}
12-09	Context Tuning for Retrieval Augmented Generation _{Institution: Apple The paper presents context tuning as a novel component that enhances RAG-based planning, enabling it to effectively handle incomplete or under-specified queries and reduce hallucinations. It systematically compares various retrieval methods in lightweight models and LLMs, showcasing the effectiveness of context tuning in improving contextual understanding.}
12-09	Sim-GPT: Text Similarity via GPT Annotated Data _{Institution: Shannon.AI, Zhejiang University, Bytedance Sim-GPT is a framework that uses data labeling by GPT-4 to effectively train STS models. It incurs a one-time cost for data generation, is faster, and the model outperforms on multiple STS benchmarks.}
12-09	NLLG Quarterly arXiv Report 09/23: What are the most influential current AI Papers? _{Institution: University of Mannheim, University of Bielefeld The paper provides an analysis of the most current trends and influence in AI research by examining the most cited papers on arXiv over a set period, particularly highlighting the significance of LLMs in this context.}
12-09	Can Large Language Models Serve as Rational Players in Game Theory? A Systematic Analysis _{Institution: Shanghai Jiao Tong University This research systematically explores the capability boundaries of LLMs within the context of game theory and provides insights for integrating LLMs into social science research from three distinct perspectives.}
12-08	Using Program Knowledge Graph to Uncover Software Vulnerabilities _{The paper introduces a Program Knowledge Graph by combining program graphs with security data, and leverages prompt tuning of large language models to auto-generate queries for detecting vulnerabilities within software code. The method aims to overcome the limitations of traditional vulnerability detection methods, improving the automation and effectiveness of vulnerability detection, especially in static analysis applications.}
12-08	PaperQA: Retrieval-Augmented Generative Agent for Scientific Research _{Institution: RAND Corporation, Carnegie Mellon University, LangChain The paper presents PaperQA, a retrieval-augmented generative agent for scientific research capable of answering questions based on up-to-date scientific literature with a performance comparable to human experts, and in some aspects even superior. The effectiveness of PaperQA is demonstrated, and its superiority is affirmed through comparative results with human experts and other commercial tools.}
12-07	Beyond Surface: Probing LLaMA Across Scales and Layers _{Institution: Hong Kong University of Science and Technology The core contribution of the study lies in proposing a series of probing tasks to evaluate the higher-order capabilities of large language models, focusing on computation, mathematical reasoning, logical reasoning, and truthfulness detection. It reveals how the performance of LLMs varies with changes in model scale and structural layers.}
12-07	CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models _{Institution: MPI for Intelligent Systems, University of Washington This research introduces the CLADDER dataset and CAUSALCOT chain-of-thought prompting strategy to test and analyze the abilities of large language models (LLMs) in formal causal reasoning, highlighting limitations of LLMs and suggesting future research directions.}
12-07	A Study on the Calibration of In-context Learning _{Institution: Harvard University The paper conducts an in-depth study of the calibration accuracy in language models (LMs) for in-context learning (ICL) and presents methods for evaluation and analysis. It reveals the relationship of calibration errors with model size and the changes during finetuning, as well as the reduction in calibration during the generation of reasoning tasks.}
12-07	Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration _{Institution: Renmin University of China, Beijing Institute of Technology, HKUST (GZ) This paper delivers a comprehensive study aimed at exploring a cost-effective batch prompting approach to entity resolution. The main contributions include the introduction of the BATCHER framework and the proposal of a covering-based demonstration selection strategy.}
12-07	An LLM Compiler for Parallel Function Calling _{Institution: UC Berkeley, ICSI, LBNL The paper introduces a system named LLMCompiler that addresses high latency costs and inefficiencies in executing multi-function calls by LLMs. It enhances speed, reduces costs, and improves accuracy through parallelized function calling and optimized orchestration.}
12-07	Chain of Code: Reasoning with a Language Model-Augmented Code Emulator _{Institution: Google DeepMind, Stanford University, University of California Berkeley Chain of Code (CoC) adds a new dimension to language models by improving reasoning capabilities through code writing and code execution emulation. It achieves breakthrough performance in both numerical and semantic reasoning tasks, expands the application scope of LLMs, and has the potential to be applied to a broader range of problems.}
12-07	Generating Illustrated Instructions _{Institution: GenAI Meta, Columbia University The paper presents a novel approach called StackedDiffusion for the task of generating illustrated instructions, a task that combines text and images to describe how to achieve a goal. This method overcomes the limitations of current T2I models that fail to generate visuals from user queries directly and surpasses existing methods in human evaluations.}
12-07	Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use _{Institution: Gaoling School of Artificial Intelligence, Renmin University of China, Alibaba Group The paper presents the Attention Buckets method to address deficiencies in context awareness of LLMs during tool use, significantly enhancing their performance in such tasks by processing different RoPE angle bases.}
12-06	Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment _{Institution: Zhejiang Lab The paper successfully introduces Holmes, a framework that facilitates training LLMs in heterogeneous NIC environments. Empirical studies confirm that Holmes can achieve performance levels in these environments comparable to those possible with homogeneous RDMA NICs. This significant advancement makes LLM training more accessible and expands the potential for efficient scaling within the broader research community.}
12-06	AnimateZero: Video Diffusion Models are Zero-Shot Image Animators _{Institution: Peking University, Tencent AI Lab, HKUST AnimateZero provides decoupled and precise control of appearance and motion for T2V generation, realizing step-by-step video generation from T2I to I2V, while maintaining good domain consistency through spatial appearance control and temporal consistency control.}
12-06	Controllable Human-Object Interaction Synthesis _{Institution: Stanford University, FAIR Meta The paper proposes a novel interaction synthesis method, CHOIS, which is capable of generating synchronized human and object motions under the guidance of language descriptions, adhering to the geometric constraints of 3D scenes. Integrated into a system, it demonstrates its efficacy in synthesizing continuous, realistic, and context-aware human-object interactions.}
12-06	Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia _{Institution: Google DeepMind, Google Research This paper proposes a method for enhancing agent-based models with generative large language models, using the Concordia library to simulate interactions of agents in social, physical, and digital spaces. The model aims to provide life-like social simulations and explore the effectiveness of model validation.}
12-06	Efficient Large Language Models: A Survey _{Institution: The Ohio State University, Google Research, Amazon AWS AI The paper is a survey of the recent advancements in large language models concerning sparse activation methods, especially the Mixture-of-Experts system (MoE) and its application in long-context processing. It synthesizes various optimization methods for MoE models, including algorithmic improvements and system-level acceleration frameworks.}
12-06	OneLLM: One Framework to Align All Modalities with Language _{Institution: MMLab The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory OneLLM showcases strong multimodal understanding and processing capabilities through its unified multimodal encoding framework and progressive alignment pipeline, addressing the challenge of expanding multimodal LLMs in the area of reasoning and utilization.}
12-05	A Comparative Study of AI-Generated (GPT-4) and Human-crafted MCQs in Programming Education _{Institution: Carnegie Mellon University The main contribution of this paper is the development of an automated MCQ generation system based on GPT-4, which, through a specialized flexible architecture and precise LO alignment mechanism, successfully generates MCQs consistent with higher education Python courses LOs. The findings show that the automatically generated MCQs maintain good alignment with the LOs and are close in quality to human-generated MCQs, but fall short on having a single correct answer and high-quality distractors, suggesting future work should focus on alleviating these issues.}
12-05	Inherent limitations of LLMs regarding spatial information _{Institution: ProtagoLabs, International Monetary Fund, NetMind.ai The paper provides a new evaluation framework and specially designed dataset for the capabilities of large language models like GPT-4 in handling spatial information, and analyzes the abilities and limitations of GPT-4 in dealing with spatial information.}
12-05	Beyond Isolation: Multi-Agent Synergy for Improving Knowledge Graph Construction _{Institution: Zhejiang Lab, Ant Group By introducing a multi-agent cooperation approach within KGC, the cooperKGC framework improves the precision with which agents solve tasks involving entity, relation, and event extraction, and potentially lays the foundation for a future of collaboration-aware AI.}
12-05	A Hardware Evaluation Framework for Large Language Model Inference _{Institution: Princeton University LLMCompass, as a hardware evaluation framework, effectively addresses the challenges in designing hardware for LLM inference. It is not only fast and accurate but also architecturally descriptive and cost-aware, and it has been validated on commercial hardware with exceptional performance.}
12-05	Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models _{Institution: University of Waterloo, 2Cohere, Comcast Applied AI The paper's key achievement is demonstrating how to construct an effective listwise reranker without dependence on GPT models, significantly surpassing existing GPT-based rerankers, and calling for the development of higher-quality listwise training datasets to enhance model performance.}
12-05	RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! _{Institution: University of Waterloo RankZephyr is a new type of open-source LLM specifically optimized for zero-shot list reranking tasks. It offers reranking effects comparable or superior to those of large proprietary models, while emphasizing the importance of data augmentation for enhanced model robustness, and has proven its effectiveness and application potential in real-world scenarios.}
12-05	Large Knowledge Model: Perspectives and Challenges _{Institution: Zhejiang University The paper proposes the concept of a Large Knowledge Model (LKM), aimed at more effectively managing and interpreting the diversity of knowledge representation. The study outlines the challenges in transitioning from current LLMs to LKMs, underlines the importance of structured knowledge in pre-training, and introduces a set of design principles for LKMs.}
12-05	Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation _{Institution: Sea AI Lab, Sun Yat-sen University, Harvard University The paper introduced a Creative Leap-of-Thought (CLoT) paradigm for enhancing the creative thinking abilities of large language models, demonstrating its effectiveness and generalizability across various tasks.}
12-05	How should the advent of large language models affect the practice of science? _{Institution: Max Planck Institute for Biological Cybernetics, University of Tübingen, University of Washington The paper discusses the implications of LLMs on scientific practices and recommends a cautious approach to their usage, emphasizing the importance of protecting the normative and epistemic aspects of science. Although LLMs may improve the efficiency of certain research tasks, they should be used judiciously as tools that abide by scientific norms and standards.}
12-05	Prompt Optimization via Adversarial In-Context Learning _{Institution: National University of Singapore, Hong Kong University of Science and Technology, Institute for Infocomm Research (I2R) A*STAR The paper introduces a novel Adversarial In-Context Learning (adv-ICL) method for optimizing prompt selection in large models to enhance their performance. It achieves adversarial training objectives, overcoming data and computational resource constraints by improving performance through prompt optimization instead of model parameters, with experimental results significantly outperforming existing techniques across multiple tasks.}
12-04	Competition-Level Problems are Effective LLM Evaluators _{Institution: Microsoft Research Asia, Xiamen University, Microsoft Azure AI The study has revealed inadequacies in LLMs like GPT-4 when assessing their real-world reasoning capabilities using competition-level programming questions, suggested methods for improvement, and highlighted the significance of such problems as efficient evaluators of LLMs, thus fostering further research into enhancing complex reasoning abilities in LLMs.}
12-04	A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly _{Institution: Elsevier This paper summarizes the applications and associated challenges of Large Language Models (LLMs) in security and privacy, highlighting the good, the bad, and the ugly aspects while emphasizing the potential for data protection in these domains.}
12-04	Retrieval-augmented Multi-modal Chain-of-Thoughts Reasoning for Large Language Models _{Institution: Xiamen University, MBZUAI, Tencent AI Lab The paper successfully elevated the CoT reasoning capabilities of LLMs in multi-modal tasks by introducing a dynamic automatic retrieval mechanism and stratified sampling method. The proposed approach not only improved model performance but also refined the reasoning process through diverse example selection, setting a new performance benchmark in the field of multi-modal reasoning.}
12-04	Data Management For Large Language Models: A Survey _{Institution: Peking University, Huawei Noah’s Ark Lab This survey studies the current state of research in data management at both the pretraining and supervised fine-tuning stages of LLMs and the design of data management strategies.}
12-04	ChatGPT as a Math Questioner? Evaluating ChatGPT on Generating Pre-university Math Questions _{Institution: Nanyang Technological University, National University of Singapore The study presents the first comprehensive evaluation of the potential of leveraging ChatGPT in the generation of pre-university math questions. It explores question generation in two main scenarios: with and without given context and aims to provide practical insights for educators. The findings from this study may promote the usage of modern AI technologies in education, enhancing the practicability and efficiency of automated math question generation.}
12-04	On the Effectiveness of Large Language Models in Domain-Specific Code Generation _{Institution: Shanghai Jiao Tong University, Chongqing University, East China Normal University The study demonstrates that LLMs' capabilities in domain-specific code generation can be significantly enhanced by effectively integrating domain knowledge into the code generation process. The DomCoder approach exemplifies the incorporation of different strategies to blend domain knowledge and boost the actual effectiveness of code generation within certain contexts.}
12-04	The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning _{Institution: Allen Institute for Artificial Intelligence, University of Washington The paper introduces a simple, tuning-free method (URIAL) for aligning LLMs through in-context learning, which demonstrates performance on par with or superior to traditional tuning alignment methods. The findings significantly contribute to future LLM research, highlighting the importance of deeper analysis and theoretical understanding in LLM alignment.}
12-04	Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication _{Institution: Fudan University, National University of Singapore, Shanghai AI Laboratory The Exchange-of-Thought (EoT) framework introduced in this paper enhances the reasoning capabilities of LLMs through cross-model communication, leveraging four communication paradigms and a confidence evaluation mechanism, yielding significant improvements in various reasoning tasks and proving the role of external insights in enhancing model performance.}
12-04	LLMs Accelerate Annotation for Medical Information Extraction _{Institution: Google Research The paper presents a method that uses large language models, specifically Google's PaLM 2, to enhance the speed of annotation in medical information extraction tasks. The LLM-based annotation workflow increases efficiency without complex model parameter adjustment, making it a promising tool for accelerating data annotation work in the medical field.}
12-03	D-Bot: Database Diagnosis System using Large Language Models _{Institution: Tsinghua University, Pigsty, ModelBest D-Bot is a database diagnosis system based on large language models designed to improve the efficiency and accuracy of database diagnosis by extracting knowledge from documents and generating effective diagnosis reports, addressing challenges faced by domain experts in the field of database diagnosis.}
12-03	TextGenSHAP: Scalable Post-hoc Explanations in Text Generation with Long Documents _{Institution: University of Southern California, Google Cloud AI The paper introduces TextGenSHAP, an efficient post-hoc explanation method designed for large language models. The method improves the speed of explanation generation and demonstrates how to leverage these explanations to enhance long-document question answering and document retrieval systems.}
12-03	Running cognitive evaluations on large language models: The do's and the don'ts _{Institution: Massachusetts Institute of Technology This paper provides instructive recommendations on the methodological approach for conducting cognitive assessments of large language models, exploring how to avoid potential issues during the evaluation process. The goal of the paper is to contribute to the broader discussion of best practices in the field of AI Psychology.}
12-02	Axiomatic Preference Modeling for Longform Question Answering _{The axiomatic framework proposed in this paper offers a new method for preference modeling in long-form question-answering, closely examining human preferences and optimizing the accuracy and efficiency of preference scoring.}
12-02	Large Language Models Are Zero-Shot Text Classifiers _{Institution: Florida Atlantic University The paper demonstrates that LLMs are effective as zero-shot text classifiers, which is particularly beneficial for small teams or businesses that need to quickly deploy text classifiers. The research results suggest that GPT-4 consistently surpasses traditional ML algorithms in all four datasets. The article also suggests future research directions that include optimizing prompts for higher accuracy or introducing a critic agent to evaluate and improve the outcomes of LLMs.}
12-02	Exploring and Improving the Spatial Reasoning Abilities of Large Language Models _{Institution: Stanford University The paper advances our understanding of LLMs' capabilities in spatial reasoning and sequence labeling, proposing a method to improve LLMs' performance in 3D trajectory recognition tasks with significant performance improvements.}
12-02	Just-in-Time Security Patch Detection -- LLM At the Rescue for Data Augmentation _{Institution: University of Luxembourg, Windows Copilot Microsoft, Singapore Management University The paper presents an innovative security patch detection framework, LLMDA, utilizing Large Language Models for patch analysis and data augmentation, and aligning multimodal inputs. This enables the system to extract more extensive information from the combined context of patches and code, thereby enhancing detection accuracy.}
12-01	Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games _{Institution: Quebec AI Institute The paper provides a new evaluation method adapted to the complexity and new challenges of JuBensha games and establishes a new framework, ThinkThrice, for assessing the capabilities of LLM agents in an interactive gaming environment, advancing AI applications in multiplayer role-playing games.}
12-01	Nash Learning from Human Feedback _{Institution: Google DeepMind The paper introduces a novel method to fine-tune LLMs for alignment with human preferences through Nash equilibrium, demonstrating its potential in complex tasks and verifying its effectiveness through empirical evidence.}
12-01	Leveraging Large Language Models to Improve REST API Testing _{Institution: Georgia Institute of Technology, IBM Research RESTGPT addresses the limitations of existing methods in extracting rules from natural language descriptions and generating effective values by leveraging the accuracy and efficiency of LLMs, especially GPT-3.5 Turbo, significantly enhancing the quality and accuracy of REST API testing.}
12-01	The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models _{Institution: University of Wisconsin - Madison This research represents one of the first large-scale investigations into the impact of compression techniques on the parametric knowledge of LLMs, offering significant insights for practitioners, especially regarding decisions related to pruning and quantization techniques.}
12-01	The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models _{Institution: University of Wisconsin - Madison This paper presented an extensive study on the impact of compression techniques (pruning and quantization) on parametric knowledge retention in large language models (LLMs), providing valuable insights for practitioners on model compression.}
12-01	On Exploring the Reasoning Capability of Large Language Models with Knowledge Graphs _{Institution: Singapore Management University, National Sun Yat-sen University The study demonstrates the capability of LLMs to successfully work through knowledge graph reasoning tasks using their internal knowledge graph and to infer knowledge graph relations from context, showcasing the potential and applicative value of LLMs in knowledge graph reasoning.}
12-01	Instruction-tuning Aligns LLMs to the Human Brain _{Institution: EPFL The study shows that large language models trained through instruction-tuning exhibit better representation of world knowledge and alignment with human brain activity. This provides a crucial perspective for the future development of LLMs to incorporate world knowledge into the models.}
12-01	RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback _{RLHF-V presents a novel strategy to rectify MLLM behavior via fine-grained, correctional human feedback. It collects quality data to align MLLM learning with human preferences, effectively improving the models' reliability and practicality in various tasks. This study represents a significant advancement in enhancing the robustness of large multimodal language models.}
12-01	Learning from One Continuous Video Stream _{The paper presents a framework for online learning from a single continuous video stream focused on evaluating adaptability and generalizability, proposing a sequence of future prediction tasks for pre-training. The study demonstrates that optimization strategies in such learning environments need to be adjusted, with reductions in momentum and frequency of weight updates leading to improved adaptability and generalization of models.}
12-01	Improve Supervised Representation Learning with Masked Image Modeling _{Institution: Google Research, OpenAI This paper proposed a new training setup that integrates supervised representation learning with MIM, significantly enhancing the quality of representation learning for downstream tasks such as classification, image retrieval, and semantic segmentation without introducing significant training or inference overhead.}
12-01	Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized Model Responses _{Institution: Google The paper introduced ExploreLLM, a new interaction pattern between users and LLM-powered assistants by combining a prompt-based task decomposition method with a novel schema-like GUI. The system aims to reduce the cognitive burden of completing complex tasks and to enhance the level of personalized responses.}

2023-11

Date	Paper	Links & Summary
11-30	TaskBench: Benchmarking Large Language Models for Task Automation _{Institution: Zhejiang University This paper presented TaskBench, a new benchmark test, and TASKEVAL, an evaluation system, which together effectively address the assessment challenges of LLMs in task automation through data generation and quantitative evaluation.}
11-30	MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation _{Institution: University of Science and Technology of China, Microsoft Research Asia MicroCinema, with its innovative two-phase process for text-to-video generation and effective Appearance Injection Network and Appearance Noise Prior mechanisms, has achieved a breakthrough in video generation quality, serving as a reference model for subsequent work.}
11-30	IAG: Induction-Augmented Generation Framework for Answering Reasoning Questions _{Institution: Huawei Poisson Lab The IAG framework, with its inductive prompting method for strengthening the factuality of knowledge statements and optimized knowledge fusion mechanism and student inductor model, addresses the shortcomings of existing retrieval-based methods in answering QA tasks involving implicit reasoning. The research findings indicate that IAG performs better in answering QA tasks that involve implicit reasoning.}
11-30	Autonomous Agents in Software Development: A Vision Paper _{Institution: Tampere University This paper proposes a vision of using multiple GPT agents for automating SE tasks and showcases preliminary success in simple software tasks. This work has the potential to fundamentally change the way software development is conducted and to shorten development time.}
11-30	CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation _{Institution: UC Berkeley, Microsoft Azure AI, ZOOM CoDi-2 is an advanced multimodal generation model capable of processing complex multimodal inputs, guiding generation in-context, interacting with users through multi-round conversations, and achieving outstanding zero-shot and few-shot performance.}
11-30	What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations _{Institution: Comcast Applied AI, University of Waterloo The authors proposed a novel probe to detect implicit association biases in LLMs representations and demonstrated state-of-the-art performance in preference detection. The research additionally uncovered significant biases in multiple instruction-following and "classic" LLMs related to nationality, politics, religion, and gender, despite the explicit safety calibration of the LLMs.}
11-30	Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text _{Institution: The University of Tokyo The research showcased GPT-4's robust capabilities in managing scrambled texts, set forth new metrics RR and RPG, and validated GPT-4's stable performance across various scramble scenarios and rates.}
11-30	Applying Large Language Models and Chain-of-Thought for Automatic Scoring _{Institution: University of Georgia The study showcases the potential of LLMs in facilitating automatic scoring, highlighting that CoT significantly enhances scoring accuracy when used with item stems and scoring rubrics. The combined approach of LLMs with CoT can reduce complexity and manpower cost in building automatic scoring models and potentially offer a closer alignment with human scoring results.}
11-30	PoseGPT: Chatting about 3D Human Pose _{Institution: Max Planck Institute for Intelligent Systems, Meshcapade PoseGPT is a novel framework that enables models to directly generate 3D human poses from textual and visual inputs by embedding SMPL pose tokens within LLMs, achieving some innovation in interpreting 3D human pose.}
11-29	Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering _{Institution: Sun Yat-Sen University This work innovatively combines three agents to simulate the top-down reasoning process in human cognition, and introduces the concept of a Multi-view Knowledge Base, significantly enhancing the expressiveness and interpretability of VQA models.}
11-29	Zero-shot Conversational Summarization Evaluations with small Large Language Models _{Institution: Intel labs The paper focuses on the application of Large Language Models in the conversational summarization task, deeply examining the impact of different instructions on model performance and researching optimization techniques for using compressed models under hardware limitations.}
11-29	Understanding and Improving In-Context Learning on Vision-language Models _{Institution: LMU Munich, University of Oxford This paper proposed a novel method, MMICES, for selecting demonstrations in in-context learning for vision-language models, demonstrating its effective performance across different models and datasets.}
11-29	How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation _{Institution: The Education University of Hong Kong This paper represents an innovative attempt to build an AI tutor system that can adapt to any course subject and provide customized high-quality educational support, potentially progressing the application of AI technology in education and forging a new path for the development of AI tutoring systems.}
11-29	Are Large Language Models Good Fact Checkers: A Preliminary Study _{Institution: Chinese Academy of Sciences The paper systematically evaluates the potential of LLMs in the entire fact-checking process, revealing that while they show promise in certain aspects, considerably more research and trials are needed to improve their performance in fact-checking tasks.}
11-29	Large Language Models for Networking: Applications, Enabling Techniques, and Challenges _{Institution: BUPT The paper proposes a new framework, ChatNet, that integrates Large Language Models with network technologies, exploring its application in network planning. The study demonstrates that ChatNet can effectively promote the automation and intelligence level of network tasks, though challenges such as multimodal data integration and plugin development must be addressed prior to deployment.}
11-29	TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models _{Institution: Harbin Institute of Technology The introduction of the TIMEBENCH benchmark marks an important step in the comprehensive assessment of temporal reasoning abilities in Large Language Models, showcasing the current gap between models and humans in this area and providing guidance for future research.}
11-29	TaskWeaver: A Code-First Agent Framework _{Institution: Microsoft TaskWeaver is a code-first designed framework to build LLM-powered autonomous agents, achieving efficient handling of complex data structures, flexible plugin usage, and the successful integration of domain-specific knowledge into the system.}
11-28	AvatarGPT: All-in-One Framework for Motion Understanding, Planning, Generation and Beyond _{The research introduces a novel, integrated AvatarGPT framework for handling high-level and low-level tasks related to understanding, planning, and generating human motions, showcasing the potential for extended duration motion synthesis and reduced manual intervention.}
11-28	Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine _{Institution: Microsoft This paper explores how to guide a generalist foundation model to exhibit expert-level capabilities on specialized tasks without expert supervision, using the medical field as a case study. The proposed Medprompt strategy proves to have a significant advantage in enhancing the specialized abilities of foundation models and shows the possibility of widespread application across multiple disciplines.}
11-28	ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up? _{Institution: Nanyang Technological University This survey paper provides an assessment of the performance of open-source LLMs across multiple task domains compared to ChatGPT, highlighting the strengths and potential problems of current open-source LLMs, and offers insights for future research and development. Furthermore, it summarizes numerous best practices and challenges, indicating that the open-source field could potentially close the gap with commercial models to some extent.}
11-28	Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization _{Institution: Shanghai AI Laboratory The article proposes an innovative strategy for optimizing LVLMs to reduce hallucinations and introduces a new evaluation method to more comprehensively measure hallucinations. The effectiveness of the proposed method is validated through experiments.}
11-28	Graph Prompt Learning: A Comprehensive Survey and Beyond _{Institution: The Chinese University of Hong Kong, Hong Kong University of Science and Technology, Fudan University This paper provides a thorough survey on graph prompt learning, covering the AGI challenges with graph data handling and how graph prompt learning can facilitate cross-modality, cross-domain, and cross-task applicability of AGI technologies.}
11-28	RELIC: Investigating Large Language Model Responses using Self-Consistency _{Institution: ETH Zurich RELIC is an interactive system that, through the factual consistency investigation of multiple samples, helps users verify and direct texts generated by LLMs.}
11-28	LLaFS: When Large-Language Models Meet Few-Shot Segmentation _{Institution: Singapore University of Technology and Design, Zhejiang University This paper presents an LLM-based framework for few-shot image segmentation, addressing the core challenges of enabling LLMs to understand and execute visual tasks. A combination of customized guidance and fine-grained in-context instructions facilitates high-quality few-shot segmentation.}
11-28	RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement _{Institution: Alibaba Group This study presents a two-stage training model for text ranking that combines weakly supervised pre-training and supervised fine-tuning. It smoothly transitions from pre-training to fine-tuning without sacrificing pre-training benefits, enhancing fine-tuning performance. The experiments have shown significant superiority over existing techniques.}
11-28	Prompting in Autoregressive Large Language Models _{Institution: George Mason University This paper provides a succinct literature review in the field of prompting for autoregressive large language models, highlighting unresolved challenges and open problems, thereby offering directions for future research.}
11-28	Training Chain-of-Thought via Latent-Variable Inference _{Institution: Google This paper develops an MCMC-EM based fine-tuning strategy that, by averaging over rationales, helps LLMs generate the correct answers, holding potential for wide applicability.}
11-28	Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation _{Institution: Alibaba Group The paper presents a new framework "Animate Anyone" using diffusion models for character animation. The framework maintains appearance consistency through ReferenceNet and ensures controllability and continuity of animations via a pose guider and temporal layer, achieving advanced results in character animation generation.}
11-27	RoboGPT: an intelligent agent of making embodied long-term decisions for daily instruction tasks _{Institution: Chinese Academy of Sciences, Peking University The paper presents an intelligent agent named RoboGPT that is designed for making embodied long-term decisions for daily instruction tasks. The agent combines the generic knowledge of LLMs with the professional knowledge in the robotics domain and introduces Re-Plan and RoboSkill modules to enhance the rationality and adaptability of task planning. On the ALFRED benchmark tests and generalization tasks, RoboGPT surpasses existing advanced methods.}
11-25	Faster Minimum Bayes Risk Decoding with Confidence-based Pruning _{Institution: University of Cambridge The paper presented an algorithm for MBR decoding that reduces utility function calls by gradually increasing the number of samples in the estimate and using confidence pruning. The algorithm significantly lowers computational costs while maintaining accuracy, and its effectiveness was validated through NMT experiments on three language pairs.}
11-24	Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language _{Institution: Amazon The paper presented an effective CnR method capable of efficiently aligning LLMs with human expectations through detailed feedback and response revision using natural language. With relatively less human feedback data, this method significantly improves the quality of responses from even top LLMs such as ChatGPT.}
11-24	Calibrated Language Models Must Hallucinate _{Institution: Microsoft Research The paper outlines the statistical root cause of inevitable hallucinations under sufficient calibration in pretrained language models, elucidates the native mechanism of hallucination generation in models with good predictive performance and provides a lower bound estimate for the rate of hallucination. It discusses the likelihood of different types of facts hallucinating and points towards potential future directions for mitigating specific types of hallucinations.}
11-23	GAIA: a benchmark for General AI Assistants _{Institution: FAIR, Meta}
11-23	LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes _{Institution: ASRI}
11-23	Probabilistic Tree-of-thought Reasoning for Answering Knowledge-intensive Complex Questions _{Institution: Tsinghua University The paper introduces Probabilistic Tree-of-thought Reasoning (ProbTree), a novel method that explores LLMs' capabilities to answer complex, knowledge-intensive questions and incorporates uncertainty into the reasoning process, integrating external and parametric knowledge within a unified framework.}
11-23	ZipLoRA: Any Subject in Any Style by Effectively Merging LoRAs _{Institution: Google Research}
11-23	Diffusion Model Alignment Using Direct Preference Optimization _{Institution: Nikhil Naik, Stanford University}
11-23	FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline _{Institution: Sber AI}
11-23	Controlling Large Language Model-based Agents for Large-Scale Decision-Making: An Actor-Critic Approach _{Institution: Chinese Academy of Sciences The LLaMAC framework demonstrates superior performance of LLM-based multi-agent systems in long-term planning, mathematical reasoning, optimization problems, and spatial reasoning, while also reducing access costs for large-scale multi-agent collaboration. With further enhancement of LLMs and more collaboration frameworks emerging, new opportunities will unfold in the multi-agent collaboration field.}
11-22	Visual In-Context Prompting _{Institution: HKUST, Microsoft Research The paper presents DINOv, an innovative visual in-context prompting framework effectively handling a variety of visual prompts, utilizing unlabeled data, and performing well across several tasks.}
11-22	Enhancing Summarization Performance through Transformer-Based Prompt Engineering in Automated Medical Reporting _{Institution: Utrecht University This research validated that applying transformer-based prompt engineering in automated medical reporting can improve summarization performance. Despite some limitations, the proposed approach has shown the effectiveness of including examples and contextual information in prompt formulations and pointed out directions for future work.}
11-22	XAGen: 3D Expressive Human Avatars Generation _{Institution: National University of Singapore, ByteDance}
11-22	AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations _{Institution: The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology, University of California, Los Angeles The AlignedCoT technique presented in this paper aims to align the CoT text style with the "native style" of Large Language Models to improve their reasoning capabilities, and its effectiveness has been demonstrated through empirical evidence.}
11-22	LIMIT: Less Is More for Instruction Tuning Across Evaluation Paradigms _{Institution: Princeton University}
11-21	Oasis: Data Curation and Assessment System for Pretraining of Large Language Models _{Institution: Chinese Academy of Sciences}
11-21	AcademicGPT: Empowering Academic Research _{Institution: International Digital Economy Academy}
11-21	Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey _{Institution: Nanjing University}
11-21	Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks _{Institution: University of Cambridge}
11-21	How Capable Can a Transformer Become? A Study on Synthetic, Interpretable Tasks _{Institution: University of Pennsylvania, MIT}
11-21	Latent Lab: Large Language Models for Knowledge Exploration _{Institution: Department of Electrical Engineering and Computer Science, MIT}
11-21	Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation? _{Institution: University of Auckland}
11-21	A Survey on Multimodal Large Language Models for Autonomous Driving _{Institution: Purdue University}
11-21	Prompting Frameworks for Large Language Models: A Survey _{Institution: Zhejiang University This research delivers a framework that enhances interaction with LLMs by implementing new techniques, including improved compatibility with programming languages, enabling LLMs to utilize external tools, and maintaining historical interaction information, thus guiding future research directions.}
11-20	Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents _{Institution: Shanghai Jiao Tong University}
11-20	GPQA: A Graduate-Level Google-Proof Q&A Benchmark _{Institution: New York University The GPQA dataset offers a benchmark for testing the ability of AI systems to handle complex questions that require deep understanding and reasoning. With rigorous quality control and expert-level difficulty, it has the potential to advance the development of collaborative methods between human experts and AI systems, as well as the advancement of AI system design.}
11-20	Continual Learning: Applications and the Road Forward _{Institution: KU Leuven}
11-20	Assessing Prompt Injection Risks in 200+ Custom GPTs _{Institution: Northwestern University}
11-19	TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems _{Institution: SenseTime Researc}
11-18	Adapters: A Unified Library for Parameter-Efficient and Modular Transfer Learning _{Institution: Technical University of Darmstadt, University of Cambridge The paper proposes a unified library—Adapters—that integrates and extends parameter-efficient and modular transfer learning methods. It achieves close integration with the Transformers library and demonstrates its effectiveness through comparative experiments on several NLP tasks.}
11-18	RecExplainer: Aligning Large Language Models for Recommendation Model Interpretability _{Institution: University of Science and Technology of China}
11-18	Orca 2: Teaching Small Language Models How to Reason _{Institution: Microsoft Research}
11-18	An Embodied Generalist Agent in 3D World _{Institution: Beijing Institute for General Artificial Intelligence}
11-17	Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2 _{Institution: Allen Institute for AI}
11-17	Exploring the Relationship between In-Context Learning and Instruction Tuning _{Institution: HKUST}
11-16	Predictive Minds: LLMs As Atypical Active Inference Agents _{Institution: Charles University}
11-16	Automatic Engineering of Long Prompts _{Institution: Google}
11-16	MacGyver: Are Large Language Models Creative Problem Solvers? _{Institution: University of California, Princeton University}
11-16	Crafting In-context Examples according to LMs' Parametric Knowledge _{Institution: The University of Texas at Austin}
11-15	Contrastive Chain-of-Thought Prompting _{Institution: DAMO Academy, Alibaba Group}
11-15	Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models _{Institution: Tecent AI Lab}
11-15	Memory Augmented Language Models through Mixture of Word Experts _{Institution: Google Research}
11-15	Exponentially Faster Language Modelling _{Institution: ETH Zurich The paper introduces UltraFastBERT, a variant of a large-scale language model that significantly reduces the number of neurons needed during inference and increases computational efficiency through the use of fast feedforward networks. Despite lacking a native efficient implementation, the model provides a CPU code implementation that significantly accelerates the inference process and performs well on standard downstream tasks. This work demonstrates the substantial potential of conditional neural execution in the field of language modeling.}
11-15	ToolTalk: Evaluating Tool-Usage in a Conversational Setting _{Institution: Microsoft Corporation ToolTalk is a benchmark designed to evaluate and improve the performance of LLMs in utilizing multi-step external tools within a conversational context. With innovative evaluation methods and realistic scenario simulations, it challenges and expands the boundaries of LLM capabilities and charts a course for future research.}
11-14	KTRL+F: Knowledge-Augmented In-Document Search _{Institution: KAIST AI, Samsung Research}
11-14	Learning to Filter Context for Retrieval-Augmented Generation _{Institution: Carnegie Mellon University}
11-14	Instruction-Following Evaluation for Large Language Models _{Institution: Google, Yale University}
11-13	In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax _{Institution: NYU, Microsoft This paper unveils potential limitations of large language models in understanding and generalizing syntactic structures, which is crucial for improving the way language models handle complex syntactic tasks.}
11-13	Can LLMs Patch Security Issues? _{Institution: School of Computer Science Atlanta The article introduced a new approach to code refinement named FDSS, which, by integrating with the static code analysis tool, Bandit, significantly enhances the capability of LLMs in solving security issues within code.}
11-11	In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering _{Institution: Stanford University}
11-10	Making LLMs Worth Every Penny: Resource-Limited Text Classification in Banking _{Institution: Helvia.ai For the first time, this paper presents a comprehensive evaluation of methodologies in a resource-limited industrial context, including cost analysis, RAG method, and data augmentation using GPT-4, offering new avenues for the financial industry to address challenges related to data and budget constraints.}
11-05	ChaTA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs _{Institution: Cornell University, Microsoft Research The paper presents a new approach to enhance online educational QA platforms using open-source LLMs, and it has undergone extensive evaluation and testing. By combining technologies like RAG, SFT, and DPO, the study not only ensures a significant improvement in the quality of responses but also protects data privacy, making it significant for the development of intelligent QA assistants.}
11-01	LLMRec: Large Language Models with Graph Augmentation for Recommendation _{Institution: University of Hong Kong, Baidu LLMRec, as a pioneering work, introduces LLMs to enhance graph recommendation systems and successfully addresses the issues of sparsity in interaction data and low-quality side information. It improves the performance of recommendation systems through means such as reinforcing user-item interaction edges, item node attributes, and user profiling, ensuring recommendation quality while reducing the impact of data noise.}

2023-10

Date	Paper	Links & Summary
10-20	The History and Risks of Reinforcement Learning and Human Feedback _{Institution: Berkeley}
10-17	Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection _{Institution: University of Washington The paper introduces SELF-RAG, a new framework that enhances LLM quality and factual accuracy through on-demand retrieval and self-reflection. It makes the LM controllable during the inference phase to suit diverse task requirements and significantly outperforms existing LLMs and RAG models in various tasks. SELF-RAG offers a novel approach to model self-assessment and customization through its decoding algorithm and reflection tokens.}
10-11	OpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models _{Institution: Tsinghua University, Chinese Academy of Sciences OpsEval, as a comprehensive task-oriented AIOps benchmark, not only assesses the comprehensive performance, reasoning, and practical application capabilities of LLMs but also has the potential to change the evaluation metrics used in future large-scale quality assessments. It provides a solid foundation for ongoing research and optimization of LLMs tailored for AIOps.}
10-10	GPT-4 as an Agronomist Assistant? Answering Agriculture Exams Using Large Language Models _{Institution: Microsoft Research This study presents a new approach in employing LLMs for answering questions in the field of agriculture, significantly enhancing LLMs' performance on multiple-choice questions through the Ensemble Refinement strategy, showing the broad potential in handling domain-specific problems.}

2023-09

Date	Paper	Links & Summary
09-04	Benchmarking Large Language Models in Retrieval-Augmented Generation _{Institution: Chinese Information Processing Laboratory This paper introduces a new benchmark based on real news articles for comprehensive assessment of large language models' capabilities in complex informational environments and illustrates the existing limitations of LLMs through the experimental results.}

2023-08

Date	Paper	Links & Summary
08-18	Learning Representations on Logs for AIOps _{Institution: IBM Research The BERTOps model proposed in this paper leverages general LLM representations and specifically tailored pretraining for AIOps log data, effectively improving the performance of automated log analysis tasks and demonstrating significant enhancements. BERTOps not only surpasses existing models but also exhibits superior performance across multiple downstream tasks, facilitating the practical application of AIOps.}

2023-07

Date	Paper	Links & Summary
07-11	Towards Understanding In-Context Learning with Contrastive Demonstrations and Saliency Maps _{Institution: UNIVERSITY OF MARYLAND This study analyzed the internal mechanisms of ICL in LLMs using contrastive demonstrations and saliency map analysis, revealing the differential impacts of label flipping, input changes, and complementary explanations on predictions, providing insights for practitioners on curating demonstrations.}

2023-06

Date	Paper	Links & Summary
06-07	PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts _{Institution: Microsoft Research PromptRobust is an innovative, open benchmark aimed at evaluating how LLMs handle input errors that are likely to occur naturally in the real world, such as typos and synonym replacements. The open-sourcing of this tool will aid future robustness research.}

2023-05

Date	Paper	Links & Summary
05-24	In-Context Demonstration Selection with Cross Entropy Difference _{Institution: Microsoft Cognitive Service Research The paper presents a novel Cross-Entropy Difference (CED) method for in-context demonstration selection, provides a theoretical rationale, and achieves performance improvements on large language models of different sizes and types.}
05-23	Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning _{Institution: Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun The paper explores the internal mechanism of in-context learning (ICL) by large language models from an information flow perspective, identifying the anchoring phenomenon of label words, proposing a new hypothesis, and experimentally validating its effectiveness. Moreover, the insights were used to propose methods for improving ICL performance, providing a theoretical foundation and practical guidance for future related researches.}
05-19	How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings _{Institution: The Ohio State University The study revealed the critical database knowledge and optimal representations for effective prompting, offering guidance for the application of LLMs in the text-to-SQL task, and pointed out a "sweet spot" in terms of prompt length in the cross-domain setting. The findings may not always be applicable to a specific database, particularly if the database is significantly different from the Spider databases.}

2023-03

Date	Paper	Links & Summary
03-31	A Survey of Large Language Models _{Institution: Renmin University of China}

2023-02

Date	Paper	Links & Summary
02-08	A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity _{Institution: Centre for Artificial Intelligence Research The article evaluated ChatGPT's reasoning abilities in a more granular way and identified a key issue in LLMs - the lack in non-text semantic understanding. This finding offers significant directions for future improvements and research into the reasoning capabilities of LLMs.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

llm-paper-daily Daily Paper Selection

2024-07

2024-06

2024-05

2024-04

2024-03

2024-02

2024-01

2023-12

2023-11

2023-10

2023-09

2023-08

2023-07

2023-06

2023-05

2023-03

2023-02

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

llm-paper-daily Daily Paper Selection

2024-07

2024-06

2024-05

2024-04

2024-03

2024-02

2024-01

2023-12

2023-11

2023-10

2023-09

2023-08

2023-07

2023-06

2023-05

2023-03

2023-02