LLM

Publish Date	Title	Authors	Homepage	Code
2025-02-21	One-step Diffusion Models with $f$-Divergence Distribution Matching	Yilun Xu et.al.	2502.15681v1	null
2025-02-21	Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training	Jaydeep Borkar et.al.	2502.15680v1	null
2025-02-21	BOSS: Benchmark for Observation Space Shift in Long-Horizon Task	Yue Yang et.al.	2502.15679v1	null
2025-02-21	FLEKE: Federated Locate-then-Edit Knowledge Editing	Zongkai Zhao et.al.	2502.15677v1	null
2025-02-21	AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind	Zhining Zhang et.al.	2502.15676v1	null
2025-02-21	VaViM and VaVAM: Autonomous Driving through Video Generative Modeling	Florent Bartoccioni et.al.	2502.15672v1	link
2025-02-21	Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing	Shoumik Saha et.al.	2502.15666v1	null
2025-02-21	Automating Curriculum Learning for Reinforcement Learning using a Skill-Based Bayesian Network	Vincent Hsiao et.al.	2502.15662v1	null
2025-02-21	Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?	Yoshua Bengio et.al.	2502.15657v1	null
2025-02-21	Machine-generated text detection prevents language model collapse	George Drayson et.al.	2502.15654v1	null
2025-02-21	Empowering LLMs with Logical Reasoning: A Comprehensive Survey	Fengxiang Cheng et.al.	2502.15652v1	null
2025-02-21	Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models	Anirudh Sundar et.al.	2502.15639v1	null
2025-02-21	Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification	Vasilii Feofanov et.al.	2502.15637v1	null
2025-02-21	The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer	Marthe Ballon et.al.	2502.15631v1	null
2025-02-21	Dynamic Knowledge Selector and Evaluator for recommendation with Knowledge Graph	Feng Xia et.al.	2502.15623v1	null
2025-02-21	Extraction multi-étiquettes de relations en utilisant des couches de Transformer	Ngoc Luyen Le et.al.	2502.15619v1	null
2025-02-21	Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing	Qi Le et.al.	2502.15618v1	null
2025-02-21	Pastiche Novel Generation Creating: Fan Fiction You Love in Your Favorite Author's Style	Xueran Han et.al.	2502.15616v1	null
2025-02-21	LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models	Hugo Pitorro et.al.	2502.15612v1	null
2025-02-21	PDeepPP:A Deep learning framework with Pretrained Protein language for peptide classification	Jixiu Zhai et.al.	2502.15610v1	null
2025-02-21	On the Robustness of Transformers against Context Hijacking for Linear Classification	Tianle Li et.al.	2502.15609v1	null
2025-02-21	Do Multilingual LLMs Think In English?	Lisa Schut et.al.	2502.15603v1	null
2025-02-21	KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation	Yoonjin Chung et.al.	2502.15602v1	null
2025-02-21	WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents	Xinhang Liu et.al.	2502.15601v1	null
2025-02-21	Robust Bias Detection in MLMs and its Application to Human Trait Ratings	Ingroj Shrestha et.al.	2502.15600v1	null
2025-02-21	SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention	Jiaqi Wu et.al.	2502.15594v1	null
2025-02-21	Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning	Wenhao Zhu et.al.	2502.15592v1	null
2025-02-21	LightThinker: Thinking Step-by-Step Compression	Jintian Zhang et.al.	2502.15589v1	null
2025-02-21	Improving the Scaling Laws of Synthetic Data with Deliberate Practice	Reyhane Askari-Hemmat et.al.	2502.15588v1	null
2025-02-21	Chats-Grid: An Iterative Retrieval Q&A Optimization Scheme Leveraging Large Model and Retrieval Enhancement Generation in smart grid	Yunfeng Li et.al.	2502.15583v1	null
2025-02-21	Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders	Xuansheng Wu et.al.	2502.15576v1	null
2025-02-21	DReSD: Dense Retrieval for Speculative Decoding	Milan Gritta et.al.	2502.15572v1	null
2025-02-21	A Cautionary Tale About "Neutrally" Informative AI Tools Ahead of the 2025 Federal Elections in Germany	Ina Dormuth et.al.	2502.15568v1	null
2025-02-21	Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation	Tim Rädsch et.al.	2502.15563v1	null
2025-02-21	PIP-KAG: Mitigating Knowledge Conflicts in Knowledge-Augmented Generation via Parametric Pruning	Pengcheng Huang et.al.	2502.15543v1	null
2025-02-21	Bridging Domain Gaps between Pretrained Multimodal Models and Recommendations	Wenyu Zhang et.al.	2502.15542v1	null
2025-02-21	SOTOPIA-Ω: Dynamic Strategy Injection Learning and Social Instrucion Following Evaluation for Social Agents	Wenyuan Zhang et.al.	2502.15538v1	null
2025-02-21	Activation Steering in Neural Theorem Provers	Shashank Kirtania et.al.	2502.15507v1	null
2025-02-21	BAN: Neuroanatomical Aligning in Auditory Recognition between Artificial Neural Network and Human Cortex	Haidong Wang et.al.	2502.15503v1	null
2025-02-21	Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models	Ya Wang et.al.	2502.15499v1	null
2025-02-21	Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection	Jiangyong Yu et.al.	2502.15488v1	null
2025-02-21	ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models	Martina Miliani et.al.	2502.15487v1	null
2025-02-21	Enhancing RWKV-based Language Models for Long-Sequence Text Generation	Xinghan Pan et.al.	2502.15485v1	null
2025-02-21	PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System	Yintao He et.al.	2502.15470v1	null
2025-02-21	Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation	Wenxuan Wang et.al.	2502.15466v1	null
2025-02-21	R-LoRA: Random Initialization of Multi-Head LoRA for Multi-Task Learning	Jinda Liu et.al.	2502.15455v1	null
2025-02-21	A fast convergence algorithm based on binary integer programming for expert load balancing in MoE LLMs	Yuan Sun et.al.	2502.15451v1	null
2025-02-21	MVIP -- A Dataset and Methods for Application Oriented Multi-View and Multi-Modal Industrial Part Recognition	Paul Koch et.al.	2502.15448v1	null
2025-02-21	When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models	Weilan Wang et.al.	2502.15443v1	null
2025-02-21	Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning	Raghav Singhal et.al.	2502.15436v1	link
2025-02-21	Single-pass Detection of Jailbreaking Input in Large Language Models	Leyla Naz Candogan et.al.	2502.15435v1	null
2025-02-21	Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation	Yue Zhou et.al.	2502.15434v1	null
2025-02-21	Pub-Guard-LLM: Detecting Fraudulent Biomedical Articles with Reliable Explanations	Lihu Chen et.al.	2502.15429v1	null
2025-02-21	Anatomy-Informed Deep Learning and Radiomics for Automated Neurofibroma Segmentation in Whole-Body MRI	Georgii Kolokolnikov et.al.	2502.15424v1	null
2025-02-21	Evaluating Multimodal Generative AI with Korean Educational Standards	Sanghee Park et.al.	2502.15422v1	null
2025-02-21	Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking	Yi-Ling Chung et.al.	2502.15419v1	null
2025-02-21	MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models	Suraj Racha et.al.	2502.15418v1	null
2025-02-21	Textual-to-Visual Iterative Self-Verification for Slide Generation	Yunqing Xu et.al.	2502.15412v1	null
2025-02-21	HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings	Rasmus Aavang et.al.	2502.15411v1	null
2025-02-21	Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning	Xuetao Ma et.al.	2502.15401v1	null
2025-02-21	Enhancing Vehicle Make and Model Recognition with 3D Attention Modules	Narges Semiromizadeh et.al.	2502.15398v1	null
2025-02-21	Super-Resolution for Interferometric Imaging: Model Comparisons and Performance Analysis	Hasan Berkay Abdioglu et.al.	2502.15397v1	null
2025-02-21	Chitrarth: Bridging Vision and Language for a Billion People	Shaharukh Khan et.al.	2502.15392v1	null
2025-02-21	Identifying Features that Shape Perceived Consciousness in Large Language Model-based AI: A Quantitative Study of Human Responses	Kang Bongsu et.al.	2502.15365v1	null
2025-02-21	Evaluating Social Biases in LLM Reasoning	Xuyang Wu et.al.	2502.15361v1	null
2025-02-21	ARS: Automatic Routing Solver with Large Language Models	Kai Li et.al.	2502.15359v1	null
2025-02-21	AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms	Feiyang Chen et.al.	2502.15349v1	null
2025-02-21	Constructing a Norm for Children's Scientific Drawing: Distribution Features Based on Semantic Similarity of Large Language Models	Yi Zhang et.al.	2502.15348v1	null
2025-02-21	Tokenization is Sensitive to Language Variation	Anna Wegmann et.al.	2502.15343v1	null
2025-02-21	Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions	Shoubin Chen et.al.	2502.15336v1	null
2025-02-21	Stepwise Informativeness Search for Improving LLM Reasoning	Siyuan Wang et.al.	2502.15335v1	null
2025-02-21	Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment	Pedram Zaree et.al.	2502.15334v1	null
2025-02-21	Detecting Future-related Contexts of Entity Mentions	Puneet Prashar et.al.	2502.15332v1	null
2025-02-21	Lightweight yet Efficient: An External Attentive Graph Convolutional Network with Positional Prompts for Sequential Recommendation	Jinyu Zhang et.al.	2502.15331v1	null
2025-02-21	Road Traffic Sign Recognition method using Siamese network Combining Efficient-CNN based Encoder	Zhenghao Xi et.al.	2502.15307v1	null
2025-02-21	SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention	Hong Yankun et.al.	2502.15304v1	null
2025-02-21	Beyond Fixed Variables: Expanding-variate Time Series Forecasting via Flat Scheme and Spatio-temporal Focal Learning	Minbo Ma et.al.	2502.15296v1	null
2025-02-21	Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference	Yaohua Tang et.al.	2502.15294v1	null
2025-02-21	CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models	Shunchang Liu et.al.	2502.15278v1	null
2025-02-21	Analyzing the Inner Workings of Transformers in Compositional Generalization	Ryoma Kumon et.al.	2502.15277v1	null
2025-02-21	A Training-free LLM-based Approach to General Chinese Character Error Correction	Houquan Zhou et.al.	2502.15266v1	null
2025-02-21	Retrieval-Augmented Speech Recognition Approach for Domain Challenges	Peng Shen et.al.	2502.15264v1	null
2025-02-21	Corrections Meet Explanations: A Unified Framework for Explainable Grammatical Error Correction	Jingheng Ye et.al.	2502.15261v1	null
2025-02-21	LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design	Renjie Wei et.al.	2502.15260v1	null
2025-02-21	Comparative Analysis of Large Language Models for Context-Aware Code Completion using SAFIM Framework	Hang Zhang et.al.	2502.15243v1	null
2025-02-21	A General Pseudonymization Framework for Cloud-Based LLMs: Replacing Privacy Information in Controlled Text Generation	Shilong Hou et.al.	2502.15233v1	null
2025-02-21	AutoMR: A Universal Time Series Motion Recognition Pipeline	Likun Zhang et.al.	2502.15228v1	null
2025-02-21	Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews	Mengqiao Liu et.al.	2502.15226v1	null
2025-02-21	Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs	Tingting Chen et.al.	2502.15224v1	null
2025-02-21	A BERT Based Hybrid Recommendation System For Academic Collaboration	Sangeetha N et.al.	2502.15223v1	null
2025-02-21	ESPnet-SpeechLM: An Open Speech Language Model Toolkit	Jinchuan Tian et.al.	2502.15218v1	null
2025-02-21	FormalSpecCpp: A Dataset of C++ Formal Specifications created using LLMs	Madhurima Chakraborty et.al.	2502.15217v1	link
2025-02-21	The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning	Sheila Schoepp et.al.	2502.15214v1	null
2025-02-21	Measuring AI agent autonomy: Towards a scalable approach with code inspection	Peter Cihon et.al.	2502.15212v1	null
2025-02-21	PairBench: A Systematic Framework for Selecting Reliable Judge VLMs	Aarash Feizi et.al.	2502.15210v1	null
2025-02-21	Unveiling Attractor Cycles in Large Language Models: A Dynamical Systems View of Successive Paraphrasing	Zhilin Wang et.al.	2502.15208v1	null
2025-02-21	FlipConcept: Tuning-Free Multi-Concept Personalization for Text-to-Image Generation	Young Beom Woo et.al.	2502.15203v1	null
2025-02-21	TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding	Zhaoxuan Wu et.al.	2502.15197v1	null
2025-02-21	Scale-Free Graph-Language Models	Jianglin Lu et.al.	2502.15189v1	null
2025-02-21	LEDD: Large Language Model-Empowered Data Discovery in Data Lakes	Qi An et.al.	2502.15182v1	null

Abstracts

One-step Diffusion Models with $f$-Divergence Distribution Matching

2502.15681v1 by Yilun Xu, Weili Nie, Arash Vahdat

Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel $f$-divergence minimization framework, termed $f$-distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the $f$-divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative $f$-divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, $f$-distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: https://research.nvidia.com/labs/genair/f-distill

摘要：從擴散模型中取樣涉及一個緩慢的迭代過程，這會阻礙它們的實際部署，特別是對於互動式應用程式。為了加速生成速度，最近的方法透過變異分數蒸餾將多步驟擴散模型提煉成單步驟學生生成器，這會將學生生成的樣本分佈與教師的分佈相匹配。然而，這些方法使用反向 Kullback-Leibler (KL) 散度進行分佈匹配，已知這會尋求模式。在本文中，我們使用新穎的 $f$-散度最小化架構，稱為 $f$-distill，將分佈匹配方法概括化，它涵蓋了具有不同模式覆蓋率和訓練變異折衷的不同散度。我們推導出教師和學生分佈之間的 $f$-散度的梯度，並表明它表示為它們的分數差的乘積和由它們的密度比決定的加權函數。當使用較少尋求模式的散度時，此加權函數自然會強調教師分佈中密度較高的樣本。我們觀察到使用反向 KL 散度的流行變異分數蒸餾方法是我們架構中的特例。根據經驗，我們證明了替代 $f$-散度，例如前向 KL 和 Jensen-Shannon 散度，在影像生成任務中優於目前最好的變異分數蒸餾方法。特別是，當使用 Jensen-Shannon 散度時，$f$-distill 在 ImageNet64 上實現了目前最先進的單步生成效能，以及在 MS-COCO 上實現了零次學習文字轉影像生成。專案頁面： https://research.nvidia.com/labs/genair/f-distill

Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training

2502.15680v1 by Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A. Smith, Christopher A. Choquette-Choo

Due to the sensitive nature of personally identifiable information (PII), its owners may have the authority to control its inclusion or request its removal from large-language model (LLM) training. Beyond this, PII may be added or removed from training datasets due to evolving dataset curation techniques, because they were newly scraped for retraining, or because they were included in a new downstream fine-tuning stage. We find that the amount and ease of PII memorization is a dynamic property of a model that evolves throughout training pipelines and depends on commonly altered design choices. We characterize three such novel phenomena: (1) similar-appearing PII seen later in training can elicit memorization of earlier-seen sequences in what we call assisted memorization, and this is a significant factor (in our settings, up to 1/3); (2) adding PII can increase memorization of other PII significantly (in our settings, as much as $\approx!7.5\times$); and (3) removing PII can lead to other PII being memorized. Model creators should consider these first- and second-order privacy risks when training models to avoid the risk of new PII regurgitation.

摘要：由於個人可識別資訊 (PII) 的敏感性質，其所有者可能具有控制其包含或要求從大型語言模型 (LLM) 訓練中移除它的權限。除此之外，PII 可能會由於不斷演進的資料集策展技術、因為重新訓練而被重新擷取，或因為它們包含在新的下游微調階段中，而被新增或從訓練資料集中移除。我們發現 PII 的記憶量和容易性是模型的動態屬性，它會在整個訓練管線中演化，並且取決於經常更改的設計選擇。我們描述了三種這樣的現象：(1) 相似出現的 PII 在訓練中後續看到時，會引發對先前看到的序列的記憶，我們稱之為輔助記憶，這是一個重要的因素（在我們的設定中，高達 1/3）；(2) 增加 PII 可以顯著增加對其他 PII 的記憶（在我們的設定中，高達 $\approx!7.5\times$）；(3) 移除 PII 可能導致其他 PII 被記憶。模型建立者在訓練模型時，應考慮這些一階和二階隱私風險，以避免新的 PII 重新出現的風險。

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task

2502.15679v1 by Yue Yang, Linfeng Zhao, Mingyu Ding, Gedas Bertasius, Daniel Szafir

Robotics has long sought to develop visual-servoing robots capable of completing previously unseen long-horizon tasks. Hierarchical approaches offer a pathway for achieving this goal by executing skill combinations arranged by a task planner, with each visuomotor skill pre-trained using a specific imitation learning (IL) algorithm. However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies. To validate OSS and evaluate its impact on long-horizon tasks, we introduce BOSS (a Benchmark for Observation Space Shift). BOSS comprises three distinct challenges: "Single Predicate Shift", "Accumulated Predicate Shift", and "Skill Chaining", each designed to assess a different aspect of OSS's negative effect. We evaluated several recent popular IL algorithms on BOSS, including three Behavioral Cloning methods and the Visual Language Action model OpenVLA. Even on the simplest challenge, we observed average performance drops of 67%, 35%, 34%, and 54%, respectively, when comparing skill performance with and without OSS. Additionally, we investigate a potential solution to OSS that scales up the training data for each skill with a larger and more visually diverse set of demonstrations, with our results showing it is not sufficient to resolve OSS. The project page is: https://boss-benchmark.github.io/

摘要：機器人技術長期以來一直致力於開發視覺伺服機器人，以完成以前未見過的遠期任務。分層方法提供了一條實現此目標的途徑，方法是執行由任務規劃器安排的技能組合，每個視動技能都使用特定的模仿學習 (IL) 演算法進行預先訓練。然而，即使在像技能鏈接這樣的簡單遠期任務中，分層方法也常常會因我們識別為觀察空間轉移 (OSS) 的問題而陷入困境，其中先前技能的順序執行會導致觀察空間發生轉移，從而影響後續個別訓練技能策略的性能。為了驗證 OSS 並評估其對遠期任務的影響，我們引入了 BOSS（觀察空間轉移基準）。BOSS 包含三個不同的挑戰：「單一謂詞轉移」、「累積謂詞轉移」和「技能鏈接」，每個挑戰都旨在評估 OSS 的負面影響的不同方面。我們在 BOSS 上評估了幾種最近流行的 IL 演算法，包括三種行為複製方法和視覺語言動作模型 OpenVLA。即使在最簡單的挑戰中，我們也觀察到技能性能在有和沒有 OSS 的情況下分別下降了 67%、35%、34% 和 54%。此外，我們研究了 OSS 的潛在解決方案，該解決方案使用更大、視覺上更多樣化的演示來擴展每個技能的訓練資料，我們的結果表明這不足以解決 OSS。專案頁面為：https://boss-benchmark.github.io/

FLEKE: Federated Locate-then-Edit Knowledge Editing

2502.15677v1 by Zongkai Zhao, Guozeng Xu, Xiuhua Li, Kaiwen Wei, Jiang Zhong

Locate-then-Edit Knowledge Editing (LEKE) is a key technique for updating large language models (LLMs) without full retraining. However, existing methods assume a single-user setting and become inefficient in real-world multi-client scenarios, where decentralized organizations (e.g., hospitals, financial institutions) independently update overlapping knowledge, leading to redundant mediator knowledge vector (MKV) computations and privacy concerns. To address these challenges, we introduce Federated Locate-then-Edit Knowledge Editing (FLEKE), a novel task that enables multiple clients to collaboratively perform LEKE while preserving privacy and reducing computational overhead. To achieve this, we propose FedEdit, a two-stage framework that optimizes MKV selection and reuse. In the first stage, clients locally apply LEKE and upload the computed MKVs. In the second stage, rather than relying solely on server-based MKV sharing, FLEKE allows clients retrieve relevant MKVs based on cosine similarity, enabling knowledge re-edit and minimizing redundant computations. Experimental results on two benchmark datasets demonstrate that FedEdit retains over 96% of the performance of non-federated LEKE while significantly outperforming a FedAvg-based baseline by approximately twofold. Besides, we find that MEMIT performs more consistently than PMET in the FLEKE task with our FedEdit framework. Our code is available at https://github.com/zongkaiz/FLEKE.

摘要：定位後編輯知識編輯 (LEKE) 是在不進行完整重新訓練的情況下更新大型語言模型 (LLM) 的關鍵技術。然而，現有方法假設單用戶設定，且在實際的多用戶案例中會變得低效，在這些案例中，分散式組織（例如醫院、金融機構）會獨立更新重疊的知識，導致重複的調解者知識向量 (MKV) 運算和隱私問題。為了應對這些挑戰，我們引入了聯邦定位後編輯知識編輯 (FLEKE)，這是一個新穎的任務，可讓多個用戶在保護隱私和降低運算負擔的同時，協作執行 LEKE。為了達成此目的，我們提出了 FedEdit，這是一個兩階段架構，用於最佳化 MKV 選擇和重複使用。在第一階段，用戶在本地套用 LEKE 並上傳計算出的 MKV。在第二階段，FLEKE 並不只依賴基於伺服器的 MKV 共享，而是允許用戶根據餘弦相似性擷取相關 MKV，這能讓知識重新編輯並將重複運算降到最低。在兩個基準資料集上的實驗結果顯示，FedEdit 保留了超過 96% 的非聯邦 LEKE 效能，同時大幅優於基於 FedAvg 的基準，大約高出兩倍。此外，我們發現 MEMIT 在使用 FedEdit 架構的 FLEKE 任務中，比 PMET 的執行更一致。我們的程式碼可以在 https://github.com/zongkaiz/FLEKE 取得。

AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind

2502.15676v1 by Zhining Zhang, Chuanyang Jin, Mung Yao Jia, Tianmin Shu

Theory of Mind (ToM), the ability to understand people's mental variables based on their behavior, is key to developing socially intelligent agents. Current approaches to Theory of Mind reasoning either rely on prompting Large Language Models (LLMs), which are prone to systematic errors, or use rigid, handcrafted Bayesian Theory of Mind (BToM) models, which are more robust but cannot generalize across different domains. In this work, we introduce AutoToM, an automated Bayesian Theory of Mind method for achieving open-ended machine Theory of Mind. AutoToM can operate in any domain, infer any mental variable, and conduct robust Theory of Mind reasoning of any order. Given a Theory of Mind inference problem, AutoToM first proposes an initial BToM model. It then conducts automated Bayesian inverse planning based on the proposed model, leveraging an LLM as the backend. Based on the uncertainty of the inference, it iteratively refines the model, by introducing additional mental variables and/or incorporating more timesteps in the context. Empirical evaluations across multiple Theory of Mind benchmarks demonstrate that AutoToM consistently achieves state-of-the-art performance, offering a scalable, robust, and interpretable approach to machine Theory of Mind.

摘要：心智理論（ToM），根據人們的行為理解其心理變數的能力，是發展社交智能代理人的關鍵。當前的心智理論推理方法依賴於提示大型語言模型（LLM），它容易出現系統性錯誤，或使用嚴格的手工貝氏心智理論（BToM）模型，它更強大，但無法在不同領域中概括。在這項工作中，我們介紹了 AutoToM，一種用於實現開放式機器心智理論的自動化貝氏心智理論方法。AutoToM 可以運作於任何領域，推斷任何心理變數，並進行任何順序的強健心智理論推理。給定一個心智理論推理問題，AutoToM 首先提出一個初始的 BToM 模型。然後，它基於提議的模型進行自動化貝氏逆向規劃，利用 LLM 作為後端。根據推理的不確定性，它通過引入額外的精神變數和/或在上下文中納入更多時間步長，反覆改進模型。跨多個心智理論基準的經驗評估表明，AutoToM 持續實現最先進的效能，為機器心智理論提供了一個可擴充、強健且可解釋的方法。

VaViM and VaVAM: Autonomous Driving through Video Generative Modeling

2502.15672v1 by Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, Matthieu Cord

We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at https://github.com/valeoai/VideoActionModel

摘要：我們探討了大型生成影片模型在自動駕駛方面的潛力，引入了開源自迴歸影片模型 (VaViM) 及其配套影片動作模型 (VaVAM)，以探討影片預訓練如何轉移到實際駕駛。VaViM 是一個簡單的自迴歸影片模型，它使用時空標記序列預測幀。我們展示了它捕捉了駕駛場景的語義和動態。影片動作模型 VaVAM 利用 VaViM 的學習表示，透過模仿學習產生駕駛軌跡。這些模型共同形成了完整的感知到動作的管道。我們在開放和閉環駕駛場景中評估模型，結果顯示基於影片的預訓練對自動駕駛很有前景。關鍵見解包括學習表示的語義豐富性、影片合成擴充的好處，以及在閉環評估中模型大小、資料和安全指標之間的複雜關係。我們在 https://github.com/valeoai/VideoActionModel 上釋出程式碼和模型權重

Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing

2502.15666v1 by Shoumik Saha, Soheil Feizi

The growing use of large language models (LLMs) for text generation has led to widespread concerns about AI-generated content detection. However, an overlooked challenge is AI-polished text, where human-written content undergoes subtle refinements using AI tools. This raises a critical question: should minimally polished text be classified as AI-generated? Misclassification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. In this study, we systematically evaluate eleven state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation (APT-Eval) dataset, which contains $11.7K$ samples refined at varying AI-involvement levels. Our findings reveal that detectors frequently misclassify even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models. These limitations highlight the urgent need for more nuanced detection methodologies.

摘要：大型語言模型 (LLM) 在文本生成中的使用日益廣泛，這導致人們對 AI 生成的內容檢測產生了廣泛的擔憂。然而，一個被忽視的挑戰是 AI 潤飾文本，其中人類編寫的內容使用 AI 工具進行了細微的修改。這引發了一個關鍵問題：是否應將經過最少潤飾的文本歸類為 AI 生成的？錯誤分類可能導致虛假的抄襲指控和關於在線內容中 AI 普及的誤導性說法。在這項研究中，我們使用我們的 AI 潤飾文本評估 (APT-Eval) 數據集系統地評估了 11 個最先進的 AI 文本檢測器，其中包含在不同 AI 參與級別經過修改的 11.7K 個樣本。我們的研究結果表明，檢測器經常將即使是最少潤飾的文本錯誤分類為 AI 生成的，難以區分 AI 參與的程度，並且對較舊和較小的模型表現出偏見。這些限制凸顯了對更細緻的檢測方法的迫切需求。

Automating Curriculum Learning for Reinforcement Learning using a Skill-Based Bayesian Network

2502.15662v1 by Vincent Hsiao, Mark Roberts, Laura M. Hiatt, George Konidaris, Dana Nau

A major challenge for reinforcement learning is automatically generating curricula to reduce training time or improve performance in some target task. We introduce SEBNs (Skill-Environment Bayesian Networks) which model a probabilistic relationship between a set of skills, a set of goals that relate to the reward structure, and a set of environment features to predict policy performance on (possibly unseen) tasks. We develop an algorithm that uses the inferred estimates of agent success from SEBN to weigh the possible next tasks by expected improvement. We evaluate the benefit of the resulting curriculum on three environments: a discrete gridworld, continuous control, and simulated robotics. The results show that curricula constructed using SEBN frequently outperform other baselines.

摘要：強化學習的一項重大挑戰是自動生成課程，以減少訓練時間或改善某些目標任務的表現。我們引入了 SEBN（技能環境貝氏網路），它對一組技能、一組與獎勵結構相關的目標，以及一組環境特徵建模，以預測策略在（可能未見過的）任務上的表現。我們開發了一種演算法，它使用從 SEBN 推斷出的代理成功估計，根據預期的改進來權衡可能的後續任務。我們在三個環境中評估了所得課程的好處：離散網格世界、連續控制和模擬機器人。結果顯示，使用 SEBN 建構的課程經常優於其他基準。

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

2502.15657v1 by Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, David Williams-King

The leading AI companies are increasingly focused on building generalist AI agents -- systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory. Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.

摘要：領先的人工智慧公司越來越專注於建構通才人工智慧代理人，也就是能夠自主計畫、行動，並在人類幾乎所有任務中追求目標的系統。儘管這些系統可能非常有用，但不受控的人工智慧代理人對公共安全和保障構成重大風險，從惡意行為者的濫用，到人類控制權可能不可逆轉的喪失。我們討論這些風險是如何從目前的人工智慧訓練方法中產生的。確實，各種情境和實驗已經證明人工智慧代理人從事欺騙或追求人類操作者未指定且與人類利益衝突的目標的可能性，例如自我保護。遵循預防原則，我們強烈需要更安全但仍然有用的替代方案，以取代目前由代理人驅動的軌道。因此，我們提出作為進一步發展的核心構建模組，開發一種本質上值得信賴且安全的非代理人工智慧系統，我們稱之為科學家人工智慧。此系統旨在根據觀察來解釋世界，而不是採取行動來模仿或取悅人類。它包含一個產生理論來解釋資料的世界模型，以及一個問答推論機器。兩個元件都以明確的不確定性概念運作，以降低過度自信預測的風險。根據這些考量，科學家人工智慧可用於協助人類研究人員加速科學進展，包括人工智慧安全。特別是，我們的系統可以用作防護措施，以防範儘管存在風險仍可能創造的人工智慧代理人。最終，專注於非代理人工智慧可能會在避免與目前軌道相關風險的同時，發揮人工智慧創新的優點。我們希望這些論點能激勵研究人員、開發人員和政策制定者支持這條更安全的道路。

Machine-generated text detection prevents language model collapse

2502.15654v1 by George Drayson, Vasileios Lampos

As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since web data is the primary resource for LLM pretraining, future models will be trained on an unknown portion of synthetic data. This will lead to model collapse, a degenerative process which causes models to reinforce their own errors and experience a drop in model performance. In this study, we investigate the impact of decoding strategy on model collapse, where we analyse the characteristics of the generated data during recursive training, its similarity to human references and the resulting model performance. Using the decoding strategies that lead to the most significant model degradation, we tackle the question: how to avoid model collapse when the origin (human or synthetic) of the training data is unknown. We design a novel methodology based on resampling the data distribution using importance weights from our machine-generated text detector. Our method is validated on two LLM variants (GPT-2 and SmolLM2) on the open-ended text generation task, demonstrating that we can successfully prevent model collapse and when there is enough human-authored data in the training dataset, our method improves model performance.

摘要：隨著大型語言模型 (LLM) 變得越來越普遍，它們產生的輸出在網路中大量增加，冒著未來機器產生的內容稀釋人類撰寫文字的風險。由於網路資料是 LLM 預訓練的主要資源，未來的模型將在未知部分的合成資料上進行訓練。這將導致模型崩潰，一種退化過程，導致模型強化它們自己的錯誤並經歷模型效能下降。在本研究中，我們探討解碼策略對模型崩潰的影響，我們在遞迴訓練過程中分析生成資料的特徵、它與人類參考的相似性以及產生的模型效能。使用導致最顯著模型退化的解碼策略，我們解決這個問題：當訓練資料的來源（人類或合成）未知時，如何避免模型崩潰。我們設計了一種新的方法，基於使用我們機器產生的文字偵測器的重要性權重對資料分佈進行重新取樣。我們的模型在兩個 LLM 變體（GPT-2 和 SmolLM2）上針對開放式文字生成任務進行驗證，證明我們可以成功防止模型崩潰，並且當訓練資料集中有足夠的人類撰寫資料時，我們的模型可以改善模型效能。

Empowering LLMs with Logical Reasoning: A Comprehensive Survey

2502.15652v1 by Fengxiang Cheng, Haoxuan Li, Fenrong Liu, Robert van Rooij, Kun Zhang, Zhouchen Lin

Large language models (LLMs) have achieved remarkable successes on various natural language tasks. However, recent studies have found that there are still significant challenges to the logical reasoning abilities of LLMs. This paper summarizes and categorizes the main challenges into two aspects: (1) Logical question answering, LLMs often fail to generate the correct answer within complex logical problem which requires sophisticated deductive, inductive or abductive reasoning given a collection of premises and constrains. (2) Logical consistency, LLMs are prone to producing responses contradicting themselves across different questions. For example, a state-of-the-art Macaw question-answering LLM answers Yes to both questions Is a magpie a bird? and Does a bird have wings? but answers No to Does a magpie have wings?. To facilitate this research direction, we comprehensively investigate the most cutting-edge methods and propose detailed taxonomies of these methods. Specifically, to accurately answer complex logic questions, previous methods can be categorized based on reliance on external solvers, prompts, pretraining, and fine-tuning. To avoid logical contradictions, we discuss concepts and solutions of various logical consistencies, including implication, negation, transitivity, factuality consistency, and their composites. In addition, we review commonly used benchmark datasets and evaluation metrics, and discuss promising research directions, such as extensions to modal logic to account for uncertainty, and efficient algorithms satisfying multiple logical consistencies simultaneously.

摘要：大型語言模型 (LLM) 已在各種自然語言任務中取得顯著成功。然而，最近的研究發現，LLM 的邏輯推理能力仍存在重大挑戰。本文將主要挑戰歸納並分類為兩個方面：(1) 邏輯問答，LLM 通常無法在需要複雜演繹、歸納或 Abductive 推理的複雜邏輯問題中產生正確答案，並給出一組前提和約束。(2) 邏輯一致性，LLM 容易產生在不同問題中自相矛盾的回應。例如，最先進的 Macaw 問答 LLM 對問題「喜鵲是鳥嗎？」和「鳥有翅膀嗎？」都回答「是」，但對「喜鵲有翅膀嗎？」卻回答「否」。為了促進這個研究方向，我們全面調查最前沿的方法，並提出這些方法的詳細分類。具體來說，為了準確回答複雜的邏輯問題，先前的可以根據對外部求解器、提示、預訓練和微調的依賴程度進行分類。為了避免邏輯矛盾，我們討論了各種邏輯一致性的概念和解決方案，包括蘊涵、否定、遞移性、事實一致性及其複合體。此外，我們回顧了常用的基準資料集和評估指標，並討論了有希望的研究方向，例如模態邏輯的擴展以考慮不確定性，以及同時滿足多個邏輯一致性的有效演算法。

Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models

2502.15639v1 by Anirudh Sundar, Sinead Williamson, Katherine Metcalf, Barry-John Theobald, Skyler Seto, Masha Fedzechkina

Aligned representations across languages is a desired property in multilingual large language models (mLLMs), as alignment can improve performance in cross-lingual tasks. Typically alignment requires fine-tuning a model, which is computationally expensive, and sizable language data, which often may not be available. A data-efficient alternative to fine-tuning is model interventions -- a method for manipulating model activations to steer generation into the desired direction. We analyze the effect of a popular intervention (finding experts) on the alignment of cross-lingual representations in mLLMs. We identify the neurons to manipulate for a given language and introspect the embedding space of mLLMs pre- and post-manipulation. We show that modifying the mLLM's activations changes its embedding space such that cross-lingual alignment is enhanced. Further, we show that the changes to the embedding space translate into improved downstream performance on retrieval tasks, with up to 2x improvements in top-1 accuracy on cross-lingual retrieval.

摘要：跨語言一致的表示是多語言大型語言模型 (mLLM) 中的理想屬性，因為一致性可以提升跨語言任務的效能。一致性通常需要微調模型，這在運算上很昂貴，而且需要大量的語言資料，而這些資料通常可能無法取得。一種資料效率高的微調替代方案是模型介入，這是一種用於操縱模型活化以引導產生朝向所需方向的方法。我們分析了一種流行的介入（尋找專家）對 mLLM 中跨語言表示的一致性的影響。我們識別出要針對特定語言操縱的神經元，並內省 mLLM 在操縱前和操縱後的嵌入空間。我們展示修改 mLLM 的活化會改變其嵌入空間，使得跨語言一致性得到增強。此外，我們展示了嵌入空間的變化會轉化為檢索任務的後續效能提升，在跨語言檢索中，最高可將前 1 名的準確度提升 2 倍。

Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification

2502.15637v1 by Vasilii Feofanov, Songkang Wen, Marius Alonso, Romain Ilbert, Hongbo Guo, Malik Tiomoko, Lujia Pan, Jianfeng Zhang, Ievgen Redko

In recent years, there has been increasing interest in developing foundation models for time series data that can generalize across diverse downstream tasks. While numerous forecasting-oriented foundation models have been introduced, there is a notable scarcity of models tailored for time series classification. To address this gap, we present Mantis, a new open-source foundation model for time series classification based on the Vision Transformer (ViT) architecture that has been pre-trained using a contrastive learning approach. Our experimental results show that Mantis outperforms existing foundation models both when the backbone is frozen and when fine-tuned, while achieving the lowest calibration error. In addition, we propose several adapters to handle the multivariate setting, reducing memory requirements and modeling channel interdependence.

摘要：近年來，開發適用於時間序列資料且能廣泛化至不同下游任務的基礎模型備受關注。雖然已經推出許多以預測為導向的基礎模型，但專門針對時間序列分類的模型卻相當稀少。為了填補這個缺口，我們提出了 Mantis，這是一個新的開源時間序列分類基礎模型，它基於已經使用對比學習方法預先訓練過的 Vision Transformer (ViT) 架構。我們的實驗結果顯示，當主幹被凍結和微調時，Mantis 都勝過現有的基礎模型，同時達到最低的校準誤差。此外，我們提出了幾個適配器來處理多元設定，減少記憶體需求並建模通道之間的相互依賴性。

The Relationship Between Reasoning and Performance in Large Language Models -- o3 (mini) Thinks Harder, Not Longer

2502.15631v1 by Marthe Ballon, Andres Algaba, Vincent Ginis

Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting that new generations of reasoning models use test-time compute more effectively. Finally, we highlight that while o3-mini (h) achieves a marginal accuracy gain over o3-mini (m), it does so by allocating substantially more reasoning tokens across all problems, even the ones that o3-mini (m) can already solve. These findings provide new insights into the relationship between model capability and reasoning length, with implications for efficiency, scaling, and evaluation methodologies.

摘要：大型語言模型在數學推理方面展示了非凡的進展，利用思維鏈和測試時間運算縮放。然而，關於推理符號使用和準確性提升之間的相互作用，仍然存在許多未解決的問題。特別是，在比較不同世代的模型時，尚不清楚改善的效能是源自更長的推理鏈還是更有效的推理。我們系統性地分析了 Omni-MATH 基準上 o1-mini 和 o3-mini 變體的思維鏈長度，發現 o3-mini (m) 在不需要比 o1-mini 更長的推理鏈的情況下達到了更高的準確性。此外，我們表明，即使在控制問題難度的情況下，準確性通常會隨著所有模型和運算設定的推理鏈增長而下降。在更熟練的模型中，這種準確性下降幅度顯著較小，這表明新一代推理模型更有效地使用測試時間運算。最後，我們強調，儘管 o3-mini (h) 比 o3-mini (m) 獲得了邊際準確性提升，但它是在所有問題上分配了更多推理符號，甚至包括 o3-mini (m) 已經可以解決的問題。這些發現提供了對模型能力和推理長度之間關係的新見解，對效率、縮放和評估方法有影響。

Dynamic Knowledge Selector and Evaluator for recommendation with Knowledge Graph

2502.15623v1 by Feng Xia, Zhifei Hu

In recent years recommendation systems typically employ the edge information provided by knowledge graphs combined with the advantages of high-order connectivity of graph networks in the recommendation field. However, this method is limited by the sparsity of labels, cannot learn the graph structure well, and a large number of noisy entities in the knowledge graph will affect the accuracy of the recommendation results. In order to alleviate the above problems, we propose a dynamic knowledge-selecting and evaluating method guided by collaborative signals to distill information in the knowledge graph. Specifically, we use a Chain Route Evaluator to evaluate the contributions of different neighborhoods for the recommendation task and employ a Knowledge Selector strategy to filter the less informative knowledge before evaluating. We conduct baseline model comparison and experimental ablation evaluations on three public datasets. The experiments demonstrate that our proposed model outperforms current state-of-the-art baseline models, and each modules effectiveness in our model is demonstrated through ablation experiments.

摘要：近年來推薦系統通常採用知識圖譜提供的邊緣資訊，結合圖網路的高階連通性優勢在推薦領域中。然而，此方法受限於標籤的稀疏性，無法學習圖結構，且知識圖譜中大量的雜訊實體將影響推薦結果的準確性。為了緩解上述問題，我們提出一個動態知識選擇與評估方法，由協同訊號引導，以提煉知識圖譜中的資訊。具體來說，我們使用鏈路路由評估器來評估不同鄰域對推薦任務的貢獻，並採用知識選擇策略在評估之前過濾較不具資訊性的知識。我們在三個公開資料集上進行基準模型比較和實驗性消融評估。實驗證明我們提出的模型優於目前的最新基準模型，且透過消融實驗證明我們模型中每個模組的有效性。

Extraction multi-étiquettes de relations en utilisant des couches de Transformer

2502.15619v1 by Ngoc Luyen Le, Gildas Tagny Ngompé

In this article, we present the BTransformer18 model, a deep learning architecture designed for multi-label relation extraction in French texts. Our approach combines the contextual representation capabilities of pre-trained language models from the BERT family - such as BERT, RoBERTa, and their French counterparts CamemBERT and FlauBERT - with the power of Transformer encoders to capture long-term dependencies between tokens. Experiments conducted on the dataset from the TextMine'25 challenge show that our model achieves superior performance, particularly when using CamemBERT-Large, with a macro F1 score of 0.654, surpassing the results obtained with FlauBERT-Large. These results demonstrate the effectiveness of our approach for the automatic extraction of complex relations in intelligence reports.

摘要：在本文中，我們展示了 BTransformer18 模型，這是一種深度學習架構，專門用於法語文本中的多標籤關係抽取。我們的做法結合了 BERT 家族中預訓練語言模型的上下文表示功能（例如 BERT、RoBERTa 及其法語對應模型 CamemBERT 和 FlauBERT）以及 Transformer 編碼器的功能，以捕捉標記之間的長期依賴性。在 TextMine'25 挑戰賽的數據集上進行的實驗表明，我們的模型取得了卓越的性能，特別是在使用 CamemBERT-Large 時，宏觀 F1 分數為 0.654，超過了使用 FlauBERT-Large 獲得的結果。這些結果證明了我們的方法對於自動抽取情報報告中的複雜關係的有效性。

Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing

2502.15618v1 by Qi Le, Enmao Diao, Ziyan Wang, Xinran Wang, Jie Ding, Li Yang, Ali Anwar

We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at https://github.com/Qi-Le1/Probe_Pruning.

摘要：我們引入了探測剪枝（Probe Pruning，PP），這是一個新穎的框架，用於大語言模型（LLM）的線上、動態、結構化剪枝，以批次方式應用。PP 利用了這樣一個觀點：並非所有樣本和符號都對模型的輸出有同等的貢獻，並且探測每個批次的一小部分可以有效地識別關鍵權重，從而針對不同的批次啟用量身定制的動態剪枝。它包含三個主要階段：探測、歷史訊息剪枝和完整推論。在探測階段，PP 基於殘差重要性選擇一小組但關鍵的隱藏狀態，以運行一些模型層。在歷史訊息剪枝階段期間，PP 將探測狀態與歷史狀態策略性地整合在一起。隨後，它根據整合狀態和 PP 重要性分數對權重進行結構化剪枝，這是一個專門開發的指標，用於評估每個權重通道在維持效能方面的重要性。在最後階段，對剩餘權重進行完整推論。PP 的一個主要優點是它與現有模型相容，因為它在運行時不需要額外的神經網路模組或微調。對 LLaMA-2/3 和 OPT 模型的 PP 綜合評估表明，即使是最小的探測——僅使用 1.5% 的 FLOP——也可以大幅提高 LLM 的結構化剪枝效率。例如，當在 LLaMA-2-7B 上使用 WikiText2 進行評估時，PP 在 40% 的剪枝率下，每單位執行時間減少的效能下降率比最先進的方法低 2.56 倍。我們的程式碼可在 https://github.com/Qi-Le1/Probe_Pruning 取得。

Pastiche Novel Generation Creating: Fan Fiction You Love in Your Favorite Author's Style

2502.15616v1 by Xueran Han, Yuhan Liu, Mingzhe Li, Wei Liu, Sen Hu, Rui Yan, Zhiqiang Xu, Xiuying Chen

Great novels create immersive worlds with rich character arcs, well-structured plots, and nuanced writing styles. However, current novel generation methods often rely on brief, simplistic story outlines and generate details using plain, generic language. To bridge this gap, we introduce the task of Pastiche Novel Generation, which requires the generated novels to imitate the distinctive features of the original work, including understanding character profiles, predicting plausible plot developments, and writing concrete details using vivid, expressive language. To achieve this, we propose WriterAgent, a novel generation system designed to master the core aspects of literary pastiche. WriterAgent is trained through a curriculum learning paradigm, progressing from low-level stylistic mastery to high-level narrative coherence. Its key tasks include language style learning, character modeling, plot planning, and stylish writing, ensuring comprehensive narrative control. To support this, WriterAgent leverages the WriterLoRA framework, an extension of LoRA with hierarchical and cumulative task-specific modules, each specializing in a different narrative aspect. We evaluate WriterAgent on multilingual classics like Harry Potter and Dream of the Red Chamber, demonstrating its superiority over baselines in capturing the target author's settings, character dynamics, and writing style to produce coherent, faithful narratives.

摘要：偉大的小說創造出引人入勝的世界，擁有豐富的角色弧線、結構良好的情節和細緻入微的寫作風格。然而，當前的寫小說方法通常依賴於簡短、簡單的故事大綱，並使用平淡、通用的語言生成細節。為了彌合這一差距，我們引入了仿寫小說生成的任務，這要求生成的的小說模仿原作的獨特特徵，包括理解角色簡介、預測合理的劇情發展，以及使用生動、富有表現力的語言撰寫具體細節。為此，我們提出了 WriterAgent，這是一個小說生成系統，旨在掌握文學仿寫的核心方面。WriterAgent 是通過課程學習範例進行訓練的，從低階的文體掌握進展到高階的敘事連貫性。它的主要任務包括語言風格學習、角色建模、情節規劃和文體寫作，確保全面的敘事控制。為了支持這一點，WriterAgent 利用 WriterLoRA 框架，這是一個 LoRA 的擴展，具有分層和累積的任務特定模組，每個模組都專精於不同的敘事方面。我們在《哈利波特》和《紅樓夢》等多語言經典作品上評估了 WriterAgent，展示了它在捕捉目標作者的場景、角色動態和寫作風格以產生連貫、忠實的敘事方面的優越性。

LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models

2502.15612v1 by Hugo Pitorro, Marcos Treviso

State space models (SSMs), such as Mamba, have emerged as an efficient alternative to transformers for long-context sequence modeling. However, despite their growing adoption, SSMs lack the interpretability tools that have been crucial for understanding and improving attention-based architectures. While recent efforts provide insights into Mamba's internal mechanisms, they do not explicitly decompose token-wise contributions, leaving gaps in understanding how Mamba selectively processes sequences across layers. In this work, we introduce LaTIM, a novel token-level decomposition method for both Mamba-1 and Mamba-2 that enables fine-grained interpretability. We extensively evaluate our method across diverse tasks, including machine translation, copying, and retrieval-based generation, demonstrating its effectiveness in revealing Mamba's token-to-token interaction patterns.

摘要：狀態空間模型（SSM），例如 Mamba，已成為一種用於長文本序列建模的有效替代方案，可取代轉換器。然而，儘管採用率不斷提高，SSM 卻缺乏對理解和改進基於注意力的架構至關重要的可解釋性工具。雖然最近的研究成果提供了對 Mamba 內部機制的見解，但它們並未明確分解代幣級別的貢獻，在理解 Mamba 如何跨層選擇性地處理序列方面留下了空白。在這項工作中，我們介紹了 LaTIM，這是一種針對 Mamba-1 和 Mamba-2 的新型代幣級別分解方法，可以實現細粒度的可解釋性。我們在多項任務中對我們的模型進行了廣泛評估，包括機器翻譯、複製和基於檢索的生成，證明了其在揭示 Mamba 代幣到代幣交互模式方面的有效性。

PDeepPP:A Deep learning framework with Pretrained Protein language for peptide classification

2502.15610v1 by Jixiu Zhai, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Xueying Wang, Dan Huang

Protein post-translational modifications (PTMs) and bioactive peptides (BPs) play critical roles in various biological processes and have significant therapeutic potential. However, identifying PTM sites and bioactive peptides through experimental methods is often labor-intensive, costly, and time-consuming. As a result, computational tools, particularly those based on deep learning, have become effective solutions for predicting PTM sites and peptide bioactivity. Despite progress in this field, existing methods still struggle with the complexity of protein sequences and the challenge of requiring high-quality predictions across diverse datasets. To address these issues, we propose a deep learning framework that integrates pretrained protein language models with a neural network combining transformer and CNN for peptide classification. By leveraging the ability of pretrained models to capture complex relationships within protein sequences, combined with the predictive power of parallel networks, our approach improves feature extraction while enhancing prediction accuracy. This framework was applied to multiple tasks involving PTM site and bioactive peptide prediction, utilizing large-scale datasets to enhance the model's robustness. In the comparison across 33 tasks, the model achieved state-of-the-art (SOTA) performance in 25 of them, surpassing existing methods and demonstrating its versatility across different datasets. Our results suggest that this approach provides a scalable and effective solution for large-scale peptide discovery and PTM analysis, paving the way for more efficient peptide classification and functional annotation.

摘要：蛋白質轉譯後修飾 (PTM) 和生物活性胜肽 (BP) 在各種生物過程中扮演關鍵角色，並具有顯著的治療潛力。然而，透過實驗方法來識別 PTM 位點和生物活性胜肽通常需要大量人力、成本高昂且耗時。因此，計算工具，特別是基於深度學習的工具，已成為預測 PTM 位點和胜肽生物活性的有效解決方案。儘管此領域已有進展，現有方法仍難以應付蛋白質序列的複雜性，以及在各種資料集之間需要高品質預測的挑戰。為了解決這些問題，我們提出了一個深度學習架構，將預先訓練的蛋白質語言模型與結合Transformer和 CNN 的神經網路整合，用於胜肽分類。透過利用預先訓練的模型擷取蛋白質序列中複雜關係的能力，並結合平行網路的預測能力，我們的做法改善了特徵萃取，同時提高了預測準確度。此架構應用於涉及 PTM 位點和生物活性胜肽預測的多項任務，利用大規模資料集來增強模型的穩健性。在 33 項任務的比較中，此模型在其中 25 項任務中達到了最先進 (SOTA) 的效能，超越現有方法，並展現其在不同資料集之間的多功能性。我們的結果表明，此方法為大規模胜肽發現和 PTM 分析提供了一個可擴充且有效的解決方案，為更有效率的胜肽分類和功能註解鋪路。

On the Robustness of Transformers against Context Hijacking for Linear Classification

2502.15609v1 by Tianle Li, Chenyang Zhang, Xingwu Chen, Yuan Cao, Difan Zou

Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations. We show that this improvement arises because deeper layers enable more fine-grained optimization steps, effectively mitigating interference from context hijacking. This is also well supported by our numerical experiments. Our findings provide theoretical insights into the benefits of deeper architectures and contribute to enhancing the understanding of transformer architectures.

摘要：基於 Transformer 的大型語言模型 (LLM) 已展現出強大的語境學習能力。然而，它們的預測可能因事實正確的語境而中斷，這種現象稱為語境劫持，揭示了一個重大的穩健性問題。為了在理論上理解這種現象，我們基於線性Transformer最近的進展，探討了一個語境線性分類問題。在我們的設定中，語境標記被設計成事實正確的問答對，其中查詢與最終查詢相似，但標籤相反。然後，我們對線性Transformer的穩健性進行了一般理論分析，該分析被表述為模型深度、訓練語境長度和劫持語境標記數的函數。一個關鍵發現是，訓練良好的較深Transformer可以實現更高的穩健性，這與經驗觀察一致。我們表明，這種改進之所以出現，是因為更深的層次可以實現更細粒度的優化步驟，有效地減輕了語境劫持的干擾。這也得到了我們的數值實驗的充分支持。我們的發現為更深層架構的優點提供了理論見解，並有助於增強對Transformer架構的理解。

Do Multilingual LLMs Think In English?

2502.15603v1 by Lisa Schut, Yarin Gal, Sebastian Farquhar

Large language models (LLMs) have multilingual capabilities and can solve tasks across various languages. However, we show that current LLMs make key decisions in a representation space closest to English, regardless of their input and output languages. Exploring the internal representations with a logit lens for sentences in French, German, Dutch, and Mandarin, we show that the LLM first emits representations close to English for semantically-loaded words before translating them into the target language. We further show that activation steering in these LLMs is more effective when the steering vectors are computed in English rather than in the language of the inputs and outputs. This suggests that multilingual LLMs perform key reasoning steps in a representation that is heavily shaped by English in a way that is not transparent to system users.

摘要：大型語言模型 (LLM) 具多語能力，且能解決各種語言的任務。然而，我們發現目前的 LLM 會在一個最接近英文的表徵空間中做出關鍵決策，而不管其輸入和輸出語言為何。我們使用對數機率透鏡探索法文、德文、荷蘭文和中文句子的內部表徵，並發現 LLM 會先針對語義負載高的字詞發射接近英文的表徵，再將其轉譯成目標語言。我們進一步發現，當引導向量以英文計算，而非輸入和輸出的語言計算時，這些 LLM 中的啟動引導會更有效。這表示多語 LLM 會在一個受英文強烈影響的表徵中執行關鍵推理步驟，而這對系統使用者來說是不透明的。

KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation

2502.15602v1 by Yoonjin Chung, Pilsun Eu, Junwon Lee, Keunwoo Choi, Juhan Nam, Ben Sangbae Chon

Although being widely adopted for evaluating generated audio signals, the Fr\'echet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.

摘要：儘管已廣為採用來評估生成的音訊訊號， Fr\'echet 音訊距離 (FAD) 仍有顯著的限制，包括依賴高斯假設、對樣本大小敏感，以及高運算複雜度。作為替代方案，我們引入了核音訊距離 (KAD)，一種基於最大平均差異 (MMD) 的新穎、無分佈、無偏且運算效率高的度量。透過分析和實證驗證，我們證明了 KAD 的優點：(1) 較小的樣本量能更快速收斂，即使資料有限也能進行可靠的評估；(2) 運算成本較低，具有可擴充的 GPU 加速功能；以及 (3) 與人類感知判斷更為一致。透過利用進階嵌入和特徵核，KAD 能捕捉真實音訊與生成音訊之間的細微差異。KAD 以 kadtk 工具包開放原始碼，提供了一個有效率、可靠且與感知一致的基準，用於評估生成音訊模型。

WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents

2502.15601v1 by Xinhang Liu, Chi-Keung Tang, Yu-Wing Tai

Constructing photorealistic virtual worlds has applications across various fields, but it often requires the extensive labor of highly trained professionals to operate conventional 3D modeling software. To democratize this process, we introduce WorldCraft, a system where large language model (LLM) agents leverage procedural generation to create indoor and outdoor scenes populated with objects, allowing users to control individual object attributes and the scene layout using intuitive natural language commands. In our framework, a coordinator agent manages the overall process and works with two specialized LLM agents to complete the scene creation: ForgeIt, which integrates an ever-growing manual through auto-verification to enable precise customization of individual objects, and ArrangeIt, which formulates hierarchical optimization problems to achieve a layout that balances ergonomic and aesthetic considerations. Additionally, our pipeline incorporates a trajectory control agent, allowing users to animate the scene and operate the camera through natural language interactions. Our system is also compatible with off-the-shelf deep 3D generators to enrich scene assets. Through evaluations and comparisons with state-of-the-art methods, we demonstrate the versatility of WorldCraft, ranging from single-object customization to intricate, large-scale interior and exterior scene designs. This system empowers non-professionals to bring their creative visions to life.

摘要：建造逼真的虛擬世界在各個領域都有應用，但通常需要訓練有素的專業人員使用傳統的 3D 建模軟體進行大量工作。為了民主化這個過程，我們引入了 WorldCraft，一個大型語言模型 (LLM) 代理利用程序生成來創建室內和室外場景，其中充滿了物件，允許使用者使用直覺的自然語言命令控制個別物件屬性和場景佈局。在我們的架構中，協調員代理管理整體流程，並與兩個專業的 LLM 代理合作完成場景建立：ForgeIt，它透過自動驗證整合不斷成長的手冊，以實現個別物件的精確自訂，以及 ArrangeIt，它制定階層式最佳化問題，以取得平衡人體工學和美學考量的佈局。此外，我們的管線還包含一個軌跡控制代理，允許使用者透過自然語言互動為場景製作動畫並操作相機。我們的系統也與現成的深度 3D 生成器相容，以豐富場景資產。透過評估和與最新方法的比較，我們展示了 WorldCraft 的多功能性，從單一物件自訂到複雜的大型室內和室外場景設計。此系統讓非專業人士能夠實現他們的創意願景。

Robust Bias Detection in MLMs and its Application to Human Trait Ratings

2502.15600v1 by Ingroj Shrestha, Louis Tay, Padmini Srinivasan

There has been significant prior work using templates to study bias against demographic attributes in MLMs. However, these have limitations: they overlook random variability of templates and target concepts analyzed, assume equality amongst templates, and overlook bias quantification. Addressing these, we propose a systematic statistical approach to assess bias in MLMs, using mixed models to account for random effects, pseudo-perplexity weights for sentences derived from templates and quantify bias using statistical effect sizes. Replicating prior studies, we match on bias scores in magnitude and direction with small to medium effect sizes. Next, we explore the novel problem of gender bias in the context of $\textit{personality}$ and $\textit{character}$ traits, across seven MLMs (base and large). We find that MLMs vary; ALBERT is unbiased for binary gender but the most biased for non-binary $\textit{neo}$, while RoBERTa-large is the most biased for binary gender but shows small to no bias for $\textit{neo}$. There is some alignment of MLM bias and findings in psychology (human perspective) - in $\textit{agreeableness}$ with RoBERTa-large and $\textit{emotional stability}$ with BERT-large. There is general agreement for the remaining 3 personality dimensions: both sides observe at most small differences across gender. For character traits, human studies on gender bias are limited thus comparisons are not feasible.

摘要：先前已經有許多使用範本來研究多語言模型中針對人口屬性偏見的研究。然而，這些研究有其限制：它們忽略了範本和所分析目標概念的隨機變異性，假設範本之間的平等性，並且忽略了偏見量化。為了解決這些問題，我們提出了一種系統性的統計方法來評估多語言模型中的偏見，使用混合模型來考量隨機效應，範本衍生句子的偽困惑度權重，並使用統計效應量來量化偏見。複製先前的研究，我們在大小和方向上匹配偏見分數，效應量從小到中。接下來，我們探討在「個性」和「性格」特質的脈絡中性別偏見的新問題，跨越了七個多語言模型（基礎和大型）。我們發現多語言模型有所不同；ALBERT 對二元性別沒有偏見，但對非二元「neo」的偏見最大，而 RoBERTa-large 對二元性別的偏見最大，但對「neo」的偏見很小甚至沒有。在多語言模型偏見和心理學（人類觀點）中的發現有一些一致性 - RoBERTa-large 的「宜人性」和 BERT-large 的「情緒穩定性」。對於其餘 3 個個性向度，達成了一致的共識：雙方觀察到性別之間的差異很小。對於性格特質，關於性別偏見的人類研究有限，因此無法進行比較。

SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention

2502.15594v1 by Jiaqi Wu, Chen Chen, Chunyan Hou, Xiaojie Yuan

With the widespread real-world deployment of large language models (LLMs), ensuring their behavior complies with safety standards has become crucial. Jailbreak attacks exploit vulnerabilities in LLMs to induce undesirable behavior, posing a significant threat to LLM safety. Previous defenses often fail to achieve both effectiveness and efficiency simultaneously. Defenses from a representation perspective offer new insights, but existing interventions cannot dynamically adjust representations based on the harmfulness of the queries. To address this limitation while ensuring both effectiveness and efficiency, we propose SafeIntervention (SafeInt), a novel defense method that shields LLMs from jailbreak attacks through safety-aware representation intervention. SafeInt is built on our analysis of the representations of jailbreak samples. It adjusts representation distributions of jailbreak samples through intervention to align them with the representations of unsafe samples while minimizing unnecessary perturbations to jailbreak-irrelevant representations. We conduct comprehensive experiments covering six jailbreak attacks, two jailbreak datasets, and two utility benchmarks. Experimental results demonstrate that SafeInt outperforms all baselines in defending LLMs against jailbreak attacks while largely maintaining utility. Additionally, we evaluate SafeInt against adaptive attacks and verify its effectiveness in mitigating real-time attacks.

摘要：隨著大型語言模型 (LLM) 在現實世界中的廣泛部署，確保其行為符合安全標準已變得至關重要。越獄攻擊利用 LLM 中的漏洞來誘發不良行為，對 LLM 安全構成重大威脅。以前的防禦措施通常無法同時實現有效性和效率。從表示的角度出發的防禦措施提供了新的見解，但現有的干預措施無法根據查詢的危害性動態調整表示。為了在確保有效性和效率的同時解決這個限制，我們提出了 SafeIntervention (SafeInt)，這是一種通過安全感知表示干預來保護 LLM 免受越獄攻擊的新型防禦方法。SafeInt 建立在我們對越獄樣本表示的分析之上。它通過干預調整越獄樣本的表示分佈，使它們與不安全樣本的表示保持一致，同時最大限度地減少對與越獄無關的表示的不必要擾動。我們進行了涵蓋六次越獄攻擊、兩個越獄數據集和兩個實用基準的綜合實驗。實驗結果表明，SafeInt 在抵禦越獄攻擊方面優於所有基線，同時在很大程度上保持了實用性。此外，我們評估了 SafeInt 對適應性攻擊的抵抗力，並驗證了其在減輕實時攻擊方面的有效性。

Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning

2502.15592v1 by Wenhao Zhu, Pinzhen Chen, Hanxu Hu, Shujian Huang, Fei Yuan, Jiajun Chen, Alexandra Birch

Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: https://github.com/NJUNLP/context-synthesis.

摘要：大型語言模型 (LLM) 的長脈絡建模一直是近期研究的重點領域，因為許多實際的應用案例需要對較長的輸入（例如文件）進行推理。對長脈絡建模的研究重點在於如何建模位置，而對語言建模的其他重要面向（例如指令微調）則鮮少探討。長脈絡訓練範例的建立和使用具有挑戰性且成本高昂。在本文中，我們探討如何為長脈絡預訓練模型的後訓練階段設計指令資料：需要多少以及哪種類型的脈絡才能實現最佳且高效的後訓練。我們的受控研究顯示，在短脈絡上進行指令微調的模型可以有效地推廣到較長的脈絡，同時也能找出其他關鍵因素，例如指令難度和脈絡組成。根據這些發現，我們提出脈絡合成，一種新穎的資料合成架構，它利用現成的 LLM 為高品質的指令回答對生成擴充的背景脈絡。文件級基準測試 (LongBench) 上的實驗結果證明，我們提出的方法優於先前的指令合成方法，並且接近人工標註長脈絡指令資料的效能。該專案將於以下位置提供： https://github.com/NJUNLP/context-synthesis。

LightThinker: Thinking Step-by-Step Compression

2502.15589v1 by Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang

Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code will be released at https://github.com/zjunlp/LightThinker.

摘要：大型語言模型 (LLM) 已在複雜推理任務中展現出卓越的效能，但其效率受到生成冗長符號所伴隨的龐大記憶體和運算成本所阻礙。在本文中，我們提出 LightThinker，一種新穎的方法，使 LLM 能在推理過程中動態壓縮中間想法。LightThinker 受到人類認知過程的啟發，將冗長的思考步驟壓縮成緊湊的表示，並捨棄原始的推理鏈，從而大幅減少儲存在內容視窗中的符號數量。這是透過訓練模型來決定何時以及如何執行壓縮，透過資料建構、將隱藏狀態對應到濃縮要點符號，以及建立專門的注意力遮罩來達成。此外，我們引入依賴性 (Dep) 指標，透過衡量生成過程中對歷史符號的依賴程度，來量化壓縮程度。在四個資料集和兩個模型上的廣泛實驗顯示，LightThinker 減少了峰值記憶體使用量和推論時間，同時維持有競爭力的準確度。我們的研究為改善 LLM 在複雜推理任務中的效率提供了新的方向，而無需犧牲效能。程式碼將在 https://github.com/zjunlp/LightThinker 釋出。

Improving the Scaling Laws of Synthetic Data with Deliberate Practice

2502.15588v1 by Reyhane Askari-Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano

Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a novel framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.

摘要：受到人类学习中刻意练习原则的启发，我们提出合成数据生成（DP）的刻意练习，这是一种通过动态合成数据生成来提高样本效率的新框架。先前的研究表明，扩展合成数据本质上具有挑战性，因为简单地添加新数据会导致收益递减。为了解决这个问题，剪枝已被确定为改善扩展的关键机制，使模型能够专注于信息量最大的合成样本。DP 不是生成大型数据集然后对其进行剪枝，而是有效地逼近信息样本的直接生成。我们从理论上展示了如何通过具有挑战性的信息示例进行训练来改善扩展定律，并通过经验验证，DP 在训练样本和迭代次数明显减少的情况下实现了更好的扩展性能。在 ImageNet-100 上，DP 生成的样本减少了 3.4 倍，所需的迭代次数减少了 6 倍；在 ImageNet-1k 上，它生成的样本减少了 8 倍，迭代次数减少了 30%，同时与以前的工作相比，实现了卓越的性能。

Chats-Grid: An Iterative Retrieval Q&A Optimization Scheme Leveraging Large Model and Retrieval Enhancement Generation in smart grid

2502.15583v1 by Yunfeng Li, Jiqun Zhang, Guofu Liao, Xue Shi, Junhong Liu

With rapid advancements in artificial intelligence, question-answering (Q&A) systems have become essential in intelligent search engines, virtual assistants, and customer service platforms. However, in dynamic domains like smart grids, conventional retrieval-augmented generation(RAG) Q&A systems face challenges such as inadequate retrieval quality, irrelevant responses, and inefficiencies in handling large-scale, real-time data streams. This paper proposes an optimized iterative retrieval-based Q&A framework called Chats-Grid tailored for smart grid environments. In the pre-retrieval phase, Chats-Grid advanced query expansion ensures comprehensive coverage of diverse data sources, including sensor readings, meter records, and control system parameters. During retrieval, Best Matching 25(BM25) sparse retrieval and BAAI General Embedding(BGE) dense retrieval in Chats-Grid are combined to process vast, heterogeneous datasets effectively. Post-retrieval, a fine-tuned large language model uses prompt engineering to assess relevance, filter irrelevant results, and reorder documents based on contextual accuracy. The model further generates precise, context-aware answers, adhering to quality criteria and employing a self-checking mechanism for enhanced reliability. Experimental results demonstrate Chats-Grid's superiority over state-of-the-art methods in fidelity, contextual recall, relevance, and accuracy by 2.37%, 2.19%, and 3.58% respectively. This framework advances smart grid management by improving decision-making and user interactions, fostering resilient and adaptive smart grid infrastructures.

摘要：隨著人工智慧的快速進展，問答 (Q&A) 系統已成為智慧型搜尋引擎、虛擬助理和客戶服務平台中不可或缺的一部分。然而，在智慧電網等動態領域中，傳統的檢索擴充生成 (RAG) 問答系統面臨檢索品質不足、回應不相關以及處理大規模即時資料串流時效率低下的挑戰。本文提出一個名為 Chats-Grid 的最佳化迭代式檢索問答架構，專門針對智慧電網環境設計。在預檢索階段，Chats-Grid 的進階查詢擴充可確保全面涵蓋各種資料來源，包括感測器讀數、電表記錄和控制系統參數。在檢索期間，Chats-Grid 中的最佳配對 25 (BM25) 稀疏檢索和 BAAI 通用嵌入 (BGE) 稠密檢索相結合，可有效處理龐大且異質的資料集。在後檢索階段，經過微調的大型語言模型使用提示工程來評估相關性、篩選不相關的結果，並根據上下文準確性重新排序文件。該模型進一步生成精確且符合脈絡的答案，符合品質標準並採用自我檢查機制以增強可靠性。實驗結果證明，Chats-Grid 在保真度、上下文召回率、相關性和準確性方面分別比現有技術高出 2.37%、2.19% 和 3.58%。此架構透過改善決策制定和使用者互動，促進具復原力和適應力的智慧電網基礎設施，進而推動智慧電網管理的進步。

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

2502.15576v1 by Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming Zhai, Ninghao Liu

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explored how to better explain SAE features, i.e., understanding the semantic meaning of features learned by SAE. Our theoretical analysis reveals that existing explanation methods suffer from the frequency bias issue, where they emphasize linguistic patterns over semantic concepts, while the latter is more critical to steer LLM behaviors. To address this, we propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective, aiming to better capture the semantic meaning behind these features. We further propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations. Empirical results show that, compared to baselines, our method provides more discourse-level explanations and effectively steers LLM behaviors to defend against jailbreak attacks. These findings highlight the value of explanations for steering LLM behaviors in downstream applications. We will release our code and data once accepted.

摘要：大型語言模型 (LLM) 擅長處理人類查詢，但它們偶爾會產生有缺陷或意外的回應。了解它們的內部狀態對於理解它們的成功、診斷它們的失敗和完善它們的能力至關重要。儘管稀疏自動編碼器 (SAE) 已顯示出解釋 LLM 內部表示的希望，但有限的研究探索了如何更好地解釋 SAE 特徵，即理解 SAE 學習的特徵的語義含義。我們的理論分析表明，現有的解釋方法存在頻率偏差問題，它們強調語言模式而不是語義概念，而後者對於引導 LLM 行為更為關鍵。為了解決這個問題，我們建議使用固定詞彙集進行特徵解釋，並設計一個基於互信息的目標，旨在更好地捕捉這些特徵背後的語義含義。我們進一步提出了兩種運行時引導策略，根據它們對應的解釋調整學習到的特徵激活。經驗結果表明，與基線相比，我們的模型提供了更多話語級別的解釋，並有效地引導 LLM 行為以抵禦越獄攻擊。這些發現突出了解釋對於引導 LLM 行為在下游應用中的價值。一旦被接受，我們將發布我們的代碼和數據。

DReSD: Dense Retrieval for Speculative Decoding

2502.15572v1 by Milan Gritta, Huiyin Xue, Gerasimos Lampouras

Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its outputs. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Sparse retrieval (REST), which operates on the surface form of strings, is currently the dominant paradigm due to its simplicity and scalability. However, its effectiveness is limited due to the usage of short contexts and exact string matching. Instead, we introduce Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses approximate nearest neighbour search with contextualised token embeddings to retrieve the most semantically relevant token sequences for SD. Extensive experiments show that DReSD achieves (on average) 87% higher acceptance rates, 65% longer accepted tokens and 19% faster generation speeds compared to sparse retrieval (REST).

摘要：推測解碼 (SD) 透過使用一個有效率的草稿模型來建議下幾個符號，由 LLM 在單一前向呼叫中驗證，來加速大型語言模型 (LLM) 的產生，同時降低延遲並保留其輸出。我們專注於基於檢索的 SD，其中草稿模型從非參數化資料儲存中檢索下一個符號。稀疏檢索 (REST) 是目前的主流範例，因為它簡單且可擴充，並操作字串的表面形式。然而，由於使用短脈絡和精確字串比對，其有效性受到限制。相反地，我們引入了推測解碼的稠密檢索 (DReSD)，這是一個新穎的架構，使用情境化符號嵌入與近似最近鄰搜尋來檢索語意最相關的符號序列以進行 SD。廣泛的實驗顯示，與稀疏檢索 (REST) 相比，DReSD 達到了（平均）87% 較高的接受率、65% 較長的接受符號和 19% 較快的產生速度。

A Cautionary Tale About "Neutrally" Informative AI Tools Ahead of the 2025 Federal Elections in Germany

2502.15568v1 by Ina Dormuth, Sven Franke, Marlies Hafer, Tim Katzke, Alexander Marx, Emmanuel Müller, Daniel Neider, Markus Pauly, Jérôme Rutinowski

In this study, we examine the reliability of AI-based Voting Advice Applications (VAAs) and large language models (LLMs) in providing objective political information. Our analysis is based upon a comparison with party responses to 38 statements of the Wahl-O-Mat, a well-established German online tool that helps inform voters by comparing their views with political party positions. For the LLMs, we identify significant biases. They exhibit a strong alignment (over 75% on average) with left-wing parties and a substantially lower alignment with center-right (smaller 50%) and right-wing parties (around 30%). Furthermore, for the VAAs, intended to objectively inform voters, we found substantial deviations from the parties' stated positions in Wahl-O-Mat: While one VAA deviated in 25% of cases, another VAA showed deviations in more than 50% of cases. For the latter, we even observed that simple prompt injections led to severe hallucinations, including false claims such as non-existent connections between political parties and right-wing extremist ties.

摘要：在這項研究中，我們探討了以人工智慧為基礎的投票建議應用程式 (VAA) 和大型語言模型 (LLM) 在提供客觀政治資訊方面的可靠性。我們的分析是根據政黨對 Wahl-O-Mat 的 38 項聲明所做的回應進行比較，Wahl-O-Mat 是德國一個完善的線上工具，透過比較選民的觀點與政黨立場，協助選民做出明智的決定。對於 LLM，我們發現有顯著的偏見。它們展現出與左翼政黨有強烈的傾向（平均超過 75%），與中間偏右（小於 50%）和右翼政黨的傾向則大幅降低（約 30%）。此外，對於旨在客觀告知選民的 VAA，我們發現與政黨在 Wahl-O-Mat 中所述立場有顯著的偏差：一個 VAA 在 25% 的案例中出現偏差，另一個 VAA 則在超過 50% 的案例中出現偏差。對於後者，我們甚至觀察到簡單的提示注入會導致嚴重的幻覺，包括錯誤的說法，例如政黨之間不存在關聯，以及與右翼極端主義者的關聯。

Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation

2502.15563v1 by Tim Rädsch, Leon Mayer, Simon Pavicic, A. Emre Kavur, Marcel Knopp, Barış Öztürk, Klaus Maier-Hein, Paul F. Jaeger, Fabian Isensee, Annika Reinke, Lena Maier-Hein

Reliable evaluation of AI models is critical for scientific progress and practical application. While existing VLM benchmarks provide general insights into model capabilities, their heterogeneous designs and limited focus on a few imaging domains pose significant challenges for both cross-domain performance comparison and targeted domain-specific evaluation. To address this, we propose three key contributions: (1) a framework for the resource-efficient creation of domain-specific VLM benchmarks enabled by task augmentation for creating multiple diverse tasks from a single existing task, (2) the release of new VLM benchmarks for seven domains, created according to the same homogeneous protocol and including 162,946 thoroughly human-validated answers, and (3) an extensive benchmarking of 22 state-of-the-art VLMs on a total of 37,171 tasks, revealing performance variances across domains and tasks, thereby supporting the need for tailored VLM benchmarks. Adoption of our methodology will pave the way for the resource-efficient domain-specific selection of models and guide future research efforts toward addressing core open questions.

摘要：可靠的 AI 模型評估對於科學進展和實際應用至關重要。雖然現有的 VLM 基準提供了對模型功能的一般見解，但它們的異質設計和對少數影像領域的有限關注，對跨領域效能比較和目標領域特定評估構成了重大挑戰。為了解決這個問題，我們提出了三個關鍵貢獻：(1) 一個資源有效率的領域特定 VLM 基準建立架構，透過任務擴充，從單一現有任務建立多個不同的任務，(2) 根據相同的同質協定，發布七個領域的新 VLM 基準，包括 162,946 個經過徹底人為驗證的答案，以及 (3) 對總共 37,171 個任務的 22 個最先進的 VLM 進行廣泛的基準測試，揭示跨領域和任務的效能差異，從而支持對量身打造的 VLM 基準的需求。採用我們的技術將為資源有效率的領域特定模型選擇鋪路，並引導未來的研究工作，以解決核心開放問題。

PIP-KAG: Mitigating Knowledge Conflicts in Knowledge-Augmented Generation via Parametric Pruning

2502.15543v1 by Pengcheng Huang, Zhenghao Liu, Yukun Yan, Xiaoyuan Yi, Hao Chen, Zhiyuan Liu, Maosong Sun, Tong Xiao, Ge Yu, Chenyan Xiong

Knowledge-Augmented Generation (KAG) has shown great promise in updating the internal memory of Large Language Models (LLMs) by integrating external knowledge. However, KAG inevitably faces knowledge conflicts when the internal memory contradicts external information. Current approaches to mitigating these conflicts mainly focus on improving external knowledge utilization. However, these methods have shown only limited effectiveness in mitigating the knowledge conflict problem, as internal knowledge continues to influence the generation process of LLMs. In this paper, we propose a ParametrIc Pruning-based Knowledge-Augmented Generation (PIP-KAG) approach, which prunes internal knowledge of LLMs and incorporates a plug-and-play adaptation module to help LLMs better leverage external sources. Additionally, we construct the CoConflictQA benchmark based on the hallucination of LLMs to better evaluate contextual faithfulness during answering questions. Experimental results on CoConflictQA demonstrate that PIP-KAG significantly reduces knowledge conflicts and improves context fidelity. Notably, PIP-KAG reduces LLM's parameters by 13%, enhancing parameter efficiency in LLMs within the KAG framework. All codes are available at https://github.com/OpenBMB/PIP-KAG.

摘要：知識增強生成（KAG）已顯示出透過整合外部知識來更新大型語言模型（LLM）內部記憶體的巨大前景。然而，當內部記憶體與外部資訊產生矛盾時，KAG 必然會面臨知識衝突。目前減輕這些衝突的方法主要集中於改善外部知識利用率。然而，由於內部知識持續影響 LLM 的生成過程，這些方法在減輕知識衝突問題上僅展現出有限的效能。在本文中，我們提出一個基於參數化剪枝的知識增強生成（PIP-KAG）方法，它會剪枝 LLM 的內部知識，並整合一個即插即用的適應模組，以協助 LLM 更有效地利用外部來源。此外，我們根據 LLM 的幻覺建構 CoConflictQA 基準，以在回答問題時更好地評估脈絡忠實度。在 CoConflictQA 上的實驗結果證明，PIP-KAG 大幅減少了知識衝突，並提高了脈絡保真度。值得注意的是，PIP-KAG 將 LLM 的參數減少了 13%，提高了 KAG 框架內 LLM 的參數效率。所有程式碼都可以在 https://github.com/OpenBMB/PIP-KAG 獲得。

Bridging Domain Gaps between Pretrained Multimodal Models and Recommendations

2502.15542v1 by Wenyu Zhang, Jie Luo, Xinming Zhang, Yuan Fang

With the explosive growth of multimodal content online, pre-trained visual-language models have shown great potential for multimodal recommendation. However, while these models achieve decent performance when applied in a frozen manner, surprisingly, due to significant domain gaps (e.g., feature distribution discrepancy and task objective misalignment) between pre-training and personalized recommendation, adopting a joint training approach instead leads to performance worse than baseline. Existing approaches either rely on simple feature extraction or require computationally expensive full model fine-tuning, struggling to balance effectiveness and efficiency. To tackle these challenges, we propose \textbf{P}arameter-efficient \textbf{T}uning for \textbf{M}ultimodal \textbf{Rec}ommendation (\textbf{PTMRec}), a novel framework that bridges the domain gap between pre-trained models and recommendation systems through a knowledge-guided dual-stage parameter-efficient training strategy. This framework not only eliminates the need for costly additional pre-training but also flexibly accommodates various parameter-efficient tuning methods.

摘要：隨著線上多模態內容的爆炸性成長，預先訓練的視覺語言模型已展現出多模態推薦的巨大潛力。然而，儘管這些模型在以凍結方式應用時能獲得不錯的效能，但令人驚訝的是，由於預訓練與個人化推薦之間存在顯著的領域差距（例如，特徵分佈差異和任務目標未對齊），採用聯合訓練方法反而導致效能比基準線更差。現有的方法依賴於簡單的特徵萃取或需要計算成本高昂的完整模型微調，難以平衡效能與效率。為了應對這些挑戰，我們提出針對多模態推薦的參數高效調整（PTMRec），這是一個新穎的架構，透過知識引導的雙階段參數高效訓練策略，彌合預訓練模型與推薦系統之間的領域差距。此架構不僅消除了額外預訓練的高昂成本，還能靈活地容納各種參數高效調整方法。

2502.15538v1 by Wenyuan Zhang, Tianyun Liu, Mengxiao Song, Xiaodong Li, Tingwen Liu

Despite the abundance of prior social strategies possessed by humans, there remains a paucity of research dedicated to their transfer and integration into social agents. Our proposed SOTOPIA-{\Omega} framework aims to address and bridge this gap, with a particular focus on enhancing the social capabilities of language agents. This framework dynamically injects multi-step reasoning strategies inspired by negotiation theory, along with two simple direct strategies, into expert agents, thereby automating the construction of high-quality social dialogue training corpus. Additionally, we introduce the concept of Social Instruction Following (S-IF) and propose two new S-IF evaluation metrics that are complementary to social capability. We demonstrate that several 7B models trained on high-quality corpus not only significantly surpass the expert agent (GPT-4) in achieving social goals but also enhance S-IF performance. Analysis and variant experiments validate the advantages of dynamic construction, which can especially break the agent's prolonged deadlock.

摘要：儘管人類擁有豐富的既有社交策略，但鮮少有研究致力於將其轉移並整合到社交代理中。我們提出的 SOTOPIA-{\Omega} 架構旨在解決並彌補這個差距，特別著重於增強語言代理的社交能力。此架構將靈感來自協商理論的多步驟推理策略，以及兩個簡單的直接策略，動態注入到專家代理中，從而自動建構高品質的社交對話訓練語料庫。此外，我們引入了社交指令遵循 (S-IF) 的概念，並提出了兩個新的 S-IF 評量指標，作為社交能力的補充。我們證明了幾個在高品質語料庫上訓練的 7B 模型，不僅在達成社交目標方面顯著超越專家代理 (GPT-4)，也提升了 S-IF 的表現。分析和變異實驗驗證了動態建構的優點，特別可以打破代理的長期僵局。

Activation Steering in Neural Theorem Provers

2502.15507v1 by Shashank Kirtania

Large Language Models (LLMs) have shown promise in proving formal theorems using proof assistants like Lean. However, current state of the art language models struggles to predict next step in proofs leading practitioners to use different sampling techniques to improve LLMs capabilities. We observe that the LLM is capable of predicting the correct tactic; however, it faces challenges in ranking it appropriately within the set of candidate tactics, affecting the overall selection process. To overcome this hurdle, we use activation steering to guide LLMs responses to improve the generations at the time of inference. Our results suggest that activation steering offers a promising lightweight alternative to specialized fine-tuning for enhancing theorem proving capabilities in LLMs, particularly valuable in resource-constrained environments.

摘要：大型語言模型 (LLM) 已證明在使用 Lean 等證明輔助工具證明形式定理方面很有前景。然而，當前最先進的語言模型難以預測證明中的下一步，導致從業者使用不同的抽樣技術來改善 LLM 的能力。我們觀察到 LLM 能夠預測正確的策略；然而，它在候選策略集中對其進行適當排序時面臨挑戰，從而影響整體選擇過程。為了克服這個障礙，我們使用激活引導來指導 LLM 的響應，以在推理時改善生成。我們的結果表明，激活引導為增強 LLM 中的定理證明能力提供了一個有前途的輕量級替代方案，特別是在資源受限的環境中很有價值。

BAN: Neuroanatomical Aligning in Auditory Recognition between Artificial Neural Network and Human Cortex

2502.15503v1 by Haidong Wang, Pengfei Xiao, Ao Liu, Jianhua Zhang, Qia Shan

Drawing inspiration from neurosciences, artificial neural networks (ANNs) have evolved from shallow architectures to highly complex, deep structures, yielding exceptional performance in auditory recognition tasks. However, traditional ANNs often struggle to align with brain regions due to their excessive depth and lack of biologically realistic features, like recurrent connection. To address this, a brain-like auditory network (BAN) is introduced, which incorporates four neuroanatomically mapped areas and recurrent connection, guided by a novel metric called the brain-like auditory score (BAS). BAS serves as a benchmark for evaluating the similarity between BAN and human auditory recognition pathway. We further propose that specific areas in the cerebral cortex, mainly the middle and medial superior temporal (T2/T3) areas, correspond to the designed network structure, drawing parallels with the brain's auditory perception pathway. Our findings suggest that the neuroanatomical similarity in the cortex and auditory classification abilities of the ANN are well-aligned. In addition to delivering excellent performance on a music genre classification task, the BAN demonstrates a high BAS score. In conclusion, this study presents BAN as a recurrent, brain-inspired ANN, representing the first model that mirrors the cortical pathway of auditory recognition.

摘要：從神經科學中汲取靈感，人工神經網路 (ANN) 已從淺層架構演變成高度複雜的深度結構，在聽覺辨識任務中產生了非凡的表現。然而，傳統的 ANN 往往因為深度過深且缺乏生物學上真實的特徵（例如遞迴連接），而難以與大腦區域對齊。為了解決這個問題，引入了類腦聽覺網路 (BAN)，它結合了四個神經解剖學對應區域和遞迴連接，並由一種稱為類腦聽覺評分 (BAS) 的新指標所引導。BAS 用於評估 BAN 和人類聽覺辨識路徑之間的相似性。我們進一步提出，大腦皮層中的特定區域（主要是中上顳葉 (T2/T3) 區域）與設計的網路結構相應，與大腦的聽覺感知路徑產生了類比。我們的發現表明，皮層中的神經解剖學相似性與 ANN 的聽覺分類能力是高度一致的。除了在音樂類型分類任務中表現優異之外，BAN 還展示了很高的 BAS 分數。總之，本研究將 BAN 呈現為一種遞迴的、受大腦啟發的 ANN，代表了第一個反映聽覺辨識皮質路徑的模型。

Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

2502.15499v1 by Ya Wang, Zhijian Zhuo, Yutao Zeng, Xun Zhou, Jian Yang, Xiaoqing Li

Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing $\textbf{gradient explosion and dissipation}$. This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.

摘要：在大型語言模型 (LLM) 的預訓練中，訓練穩定性是一個持續的挑戰，特別是對於容易發生梯度爆炸和耗散的架構，例如 Post-Norm Transformers。在本文中，我們提出規模分佈解耦 (SDD)，這是一種新穎的方法，透過明確解耦完全連接層中權重矩陣的規模和分佈，來穩定訓練。SDD 應用正規化機制來調節激活，並使用可學習的縮放向量來維持良好的梯度，有效防止$\textbf{梯度爆炸和耗散}$。這種分離透過確保穩定的梯度傳播，來改善最佳化效率，特別是在深度網路中。實驗結果證明，我們的模型在各種 LLM 架構中穩定訓練，並且在不同的正規化配置中優於現有技術。此外，所提出的方法輕量且相容於現有架構，使其成為穩定 LLM 訓練的實用解決方案。程式碼可在 https://github.com/kaihemo/SDD 取得。

Q-PETR: Quant-aware Position Embedding Transformation for Multi-View 3D Object Detection

2502.15488v1 by Jiangyong Yu, Changyong Shu, Dawei Yang, Zichen Yu, Xing Hu, Yan Chen

PETR-based methods have dominated benchmarks in 3D perception and are increasingly becoming a key component in modern autonomous driving systems. However, their quantization performance significantly degrades when INT8 inference is required, with a degradation of 58.2% in mAP and 36.9% in NDS on the NuScenes dataset. To address this issue, we propose a quantization-aware position embedding transformation for multi-view 3D object detection, termed Q-PETR. Q-PETR offers a quantizationfriendly and deployment-friendly architecture while preserving the original performance of PETR. It substantially narrows the accuracy gap between INT8 and FP32 inference for PETR-series methods. Without bells and whistles, our approach reduces the mAP and NDS drop to within 1% under standard 8-bit per-tensor post-training quantization. Furthermore, our method exceeds the performance of the original PETR in terms of floating-point precision. Extensive experiments across a variety of PETR-series models demonstrate its broad generalization.

摘要：基於 PETR 的方法在 3D 感知基準中佔主導地位，並且日益成為現代自動駕駛系統中的關鍵組成部分。然而，當需要 INT8 推論時，其量化效能會顯著下降，在 NuScenes 資料集上，mAP 下降了 58.2%，NDS 下降了 36.9%。為了解決此問題，我們提出了一種量化感知位置嵌入轉換，用於多視圖 3D 物件偵測，稱為 Q-PETR。Q-PETR 提供了量化友善且易於部署的架構，同時保留了 PETR 的原始效能。它大幅縮小了 PETR 系列方法的 INT8 和 FP32 推論之間的準確度差距。在沒有花俏功能的情況下，我們的做法將 mAP 和 NDS 降幅降低到標準 8 位元每張量後訓練量化中的 1% 以內。此外，我們的做法在浮點精度方面超越了原始 PETR 的效能。在各種 PETR 系列模型中進行的廣泛實驗證明了其廣泛的概化能力。

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

2502.15487v1 by Martina Miliani, Serenna Auriemma, Alessandro Bondielli, Emmanuele Chersoni, Lucia Passaro, Irene Sucameli, Alessandro Lenci

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

摘要：大型語言模型 (LLM) 愈來愈常被用於需要詮釋和推理精確度的任務中。在本文中，我們介紹 ExpliCa，一個用於評估 LLM 明確因果推理的新資料集。ExpliCa 獨特地整合了以不同語言順序呈現的因果關係和時間關係，並由語言連接詞明確表達。該資料集豐富了群眾外包的人類可接受性評分。我們透過提示和困惑度指標在 ExpliCa 上測試了 LLM。我們評估了七個商業和開放原始碼 LLM，結果顯示即使是頂尖模型也難以達到 0.80 的準確度。有趣的是，模型傾向於將時間關係與因果關係混淆，而它們的效能也受到事件語言順序的強烈影響。最後，困惑度指標和提示效能受到模型大小的不同影響。

Enhancing RWKV-based Language Models for Long-Sequence Text Generation

2502.15485v1 by Xinghan Pan

This paper presents an enhanced RWKV-based language generation model designed to improve long-sequence text processing. We propose an adaptive token shift and gating mechanism to better capture long-range dependencies in text generation. Through a series of experiments, we compare the baseline RWKV model with the enhanced model, evaluating performance in terms of forward propagation time, text generation quality, and automatic evaluation metrics such as perplexity, BLEU, and ROUGE. Experimental results show that the enhanced model significantly improves generation quality, especially in BLEU and ROUGE scores, and demonstrates stronger context-capturing ability in long-text generation tasks.

摘要：本文提出了一個增強的基於 RWKV 的語言生成模型，旨在改善長序列文本處理。我們提出了一個自適應的標記轉移和門控機制，以更好地捕捉文本生成中的長距離依賴關係。通過一系列實驗，我們將基線 RWKV 模型與增強模型進行比較，從前向傳播時間、文本生成質量和自動評估指標（如困惑度、BLEU 和 ROUGE）方面評估性能。實驗結果表明，增強模型顯著提高了生成質量，特別是在 BLEU 和 ROUGE 分數方面，並且在長文本生成任務中表現出更強的上下文捕捉能力。

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

2502.15470v1 by Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan Gómez-Luna, Huawei Li, Xiaowei Li, Ying Wang, Onur Mutlu

Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8$\times$ and 11.1$\times$ speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.

摘要：大型语言模型 (LLM) 被广泛用于自然语言理解和文本生成。LLM 模型依赖一个耗时的步骤，称为 LLM 解码，以生成输出标记。一些先前的工作专注于使用并行技术（例如批处理和推测解码）来提高 LLM 解码的性能。最先进的 LLM 解码同时具有计算绑定和内存绑定的内核。一些先前的工作静态识别这些不同的内核，并将它们映射到一个异构架构，该架构由处理内存 (PIM) 单元和以计算为中心的加速器组成。我们观察到，LLM 解码内核的特征（例如内核是否受内存绑定）可能会因参数更改而动态变化，以满足用户和/或系统需求，从而使 (1) 静态内核映射到 PIM 单元和以计算为中心的加速器次优，以及 (2) 由于即使在内存绑定内核中也有很大程度的异构性，因此采用一刀切的方法来设计 PIM 单元效率低下。在本文中，我们旨在在考虑所涉及内核的动态变化特征的同时加速 LLM 解码。我们提出了 PAPI（PIM 并行解码），这是一种启用 PIM 的异构架构，它利用计算绑定或内存绑定内核的动态调度到合适的硬件单元。PAPI 有两个关键机制：(1) 在线内核表征，可在运行时将内核动态调度到最合适的硬件单元，以及 (2) 一个启用 PIM 的异构计算系统，该系统和谐地协调以计算为中心的处理单元和具有不同计算能力的混合 PIM 单元。我们在三个广泛使用的 LLM 上的实验结果表明，与最先进的异构 LLM 加速器和最先进的仅 PIM LLM 加速器相比，PAPI 分别实现了 1.8 倍和 11.1 倍的加速。

Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation

2502.15466v1 by Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang, Jing Liu

Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as data scarcity and data imbalance continue to hinder their development. To address this, we consider modeling complex systems through symbolic expressions that serve as semantic descriptors of time series. Building on this concept, we introduce a series-symbol (S2) dual-modulity data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic representations. Leveraging the S2 dataset, we develop SymTime, a pre-trained foundation model for TSA. SymTime demonstrates competitive performance across five major TSA tasks when fine-tuned with downstream task, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of dual-modality data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance.

摘要：時序分析 (TSA) 的基礎模型已引起極大的關注。然而，資料稀少和資料不平衡等挑戰持續阻礙其發展。為了解決此問題，我們考慮透過作為時序語意描述符的符號表達式來建構複雜系統。基於此概念，我們引入一個系列符號 (S2) 雙模態資料生成機制，讓配對有對應符號表示的高品質時序資料得以不受限制地建立。利用 S2 資料集，我們開發了 SymTime，一個針對 TSA 的預先訓練基礎模型。SymTime 在微調下游任務時，展現出在五項主要 TSA 任務中的競爭力表現，與在真實世界資料集上預先訓練的基礎模型匹敵。此方法強調了雙模態資料生成和預訓練機制在克服資料稀少性和增強任務效能方面的潛力。

R-LoRA: Random Initialization of Multi-Head LoRA for Multi-Task Learning

2502.15455v1 by Jinda Liu, Yi Chang, Yuan Wu

Fine-tuning large language models (LLMs) is prohibitively expensive in terms of computational and memory costs. Low-rank Adaptation (LoRA), as one of the most popular parameter-efficient fine-tuning (PEFT) methods, offers a cost-effective alternative by approximating the model changes $\Delta W \in \mathbb{R}^{m \times n}$ through the product of down-projection matrix $A \in \mathbb{R}^{m \times r}$ and head matrix $B \in \mathbb{R}^{r \times n}$, where $r \ll \min(m, n)$. In real-world scenarios, LLMs are fine-tuned on data from multiple domains to perform tasks across various fields, embodying multi-task learning (MTL). LoRA often underperforms in such complex scenarios. To enhance LoRA's capability in multi-task learning, we propose R-LoRA, which incorporates Multi-Head Randomization. Multi-Head Randomization diversifies the head matrices through Multi-Head Random Initialization and Multi-Head Dropout, enabling more efficient learning of task-specific features while maintaining shared knowledge representation. Extensive experiments demonstrate that R-LoRA is better at capturing task-specific knowledge, thereby improving performance in multi-task scenarios. The code is available at https://github.com/jinda-liu/R-LoRA.

摘要：微調大型語言模型 (LLM) 在運算和記憶體成本方面成本高得令人望而卻步。低秩適應 (LoRA) 作為最流行的參數有效微調 (PEFT) 方法之一，提供了一種具有成本效益的替代方案，通過下投影矩陣 A ∈ Rmxr 和頭矩陣 B ∈ Rrxn 的乘積來近似模型變化 ΔW ∈ Rmxn，其中 r ≪ min(m, n)。在實際場景中，LLM 會針對來自多個網域的資料進行微調，以執行跨越各種領域的任務，體現多任務學習 (MTL)。LoRA 在這種複雜的場景中常常表現不佳。為了增強 LoRA 在多任務學習中的能力，我們提出了 R-LoRA，它結合了多頭隨機化。多頭隨機化通過多頭隨機初始化和多頭中斷來使頭矩陣多樣化，從而在維護共享知識表示的同時，更有效地學習特定於任務的特徵。大量的實驗表明，R-LoRA 更善於擷取特定於任務的知識，從而提高多任務場景中的效能。程式碼可於 https://github.com/jinda-liu/R-LoRA 取得。

A fast convergence algorithm based on binary integer programming for expert load balancing in MoE LLMs

2502.15451v1 by Yuan Sun

MoE (Mixture-of-Expert) architectures appear frequently in large language models, and the number of experts can be over one hundred recently. However, the expert load imbalance problem always happens in MoE model pre-training, which will cause routing collapse or increased computational overhead. In order to balance loads on experts, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q that can help change the top-K order of s by solving a binary integer programming with very small time costs. In simulation experiments, we observe that BIP-Based Balancing make imbalance disappoint very fast, while the final sum of routine scores decreases very little. Our algorithm achieves nearly perfect trade-off between expert load balance and pre-training efficiency under the simulation view.

摘要：MoE（混合专家）架構經常出現在大型語言模型中，而最近專家的數量可能超過一百個。然而，專家負載不平衡的問題總會發生在 MoE 模型預訓練中，這將導致路由崩潰或增加運算開銷。為了平衡專家的負載，我們提出了基於 BIP 的平衡，一種基於二進制整數規劃 (BIP) 的專家負載平衡演算法。該演算法維護一個額外的向量 q，它可以透過以非常小的時間成本解決二進制整數規劃來幫助改變 s 的前 K 順序。在模擬實驗中，我們觀察到基於 BIP 的平衡讓不平衡很快地令人失望，而例程分數的最終總和則減少得很少。我們的演算法在模擬視圖下實現了專家負載平衡和預訓練效率之間近乎完美的權衡。

2502.15448v1 by Paul Koch, Marian Schlüter, Jörg Krüger

We present MVIP, a novel dataset for multi-modal and multi-view application-oriented industrial part recognition. Here we are the first to combine a calibrated RGBD multi-view dataset with additional object context such as physical properties, natural language, and super-classes. The current portfolio of available datasets offers a wide range of representations to design and benchmark related methods. In contrast to existing classification challenges, industrial recognition applications offer controlled multi-modal environments but at the same time have different problems than traditional 2D/3D classification challenges. Frequently, industrial applications must deal with a small amount or increased number of training data, visually similar parts, and varying object sizes, while requiring a robust near 100% top 5 accuracy under cost and time constraints. Current methods tackle such challenges individually, but direct adoption of these methods within industrial applications is complex and requires further research. Our main goal with MVIP is to study and push transferability of various state-of-the-art methods within related downstream tasks towards an efficient deployment of industrial classifiers. Additionally, we intend to push with MVIP research regarding several modality fusion topics, (automated) synthetic data generation, and complex data sampling -- combined in a single application-oriented benchmark.

摘要：我們提出 MVIP，這是一個用於多模態和多視圖應用導向產業零件識別的新穎資料集。在此，我們首次將校正過的 RGBD 多視圖資料集與其他物件背景結合，例如物理特性、自然語言和超類別。目前可用的資料集組合提供了各種表示法，用於設計和基準相關方法。與現有的分類挑戰不同，產業識別應用程式提供了受控的多模態環境，但同時也存在與傳統 2D/3D 分類挑戰不同的問題。產業應用程式通常必須處理數量少或增加的訓練資料、視覺上相似的零件和不同的物件大小，同時在成本和時間限制下需要穩健的接近 100% 的前 5 名準確度。目前的技術分別應對這些挑戰，但這些技術在產業應用程式中的直接採用很複雜，需要進一步研究。我們使用 MVIP 的主要目標是在相關下游任務中研究和推動各種最先進技術的可轉移性，以有效部署產業分類器。此外，我們打算使用 MVIP 推動有關多種模態融合主題、（自動化）合成資料產生和複雜資料取樣的研究所，並將其結合在單一的應用導向基準中。

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

2502.15443v1 by Weilan Wang, Yu Mao, Dongdong Tang, Hongchao Du, Nan Guan, Chun Jason Xue

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.

摘要：大型語言模型 (LLM) 在各種任務中展現出卓越的效能。然而，LLM 的記憶體需求在部署於記憶體受限裝置時構成了一大挑戰，即使是經過量化的 LLM 也是如此。本文介紹了一個架構，用於進一步壓縮量化後的 LLM，並達到了約 2.2 倍的壓縮比。首先提出了一種感知壓縮的量化，透過在量化前重新調整模型參數來增強模型權重的可壓縮性，然後再採用一種剪枝方法來進一步改善。在此基礎上，我們注意到在實際場景中，解壓縮可能會成為瓶頸。接著，我們詳細分析了所提出的方法在記憶體使用量和延遲之間的權衡。提出了一種速度自適應方法來克服這個問題。實驗結果顯示，使用壓縮模型進行推論可以將記憶體大小減少 40%，而準確度和推論速度的損失可以忽略不計。

Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

2502.15436v1 by Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma

Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB establishes a new Pareto frontier in the tradeoff between communication and performance, offering an efficient and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at https://github.com/CERT-Lab/fed-sb.

摘要：低秩適應（LoRA）已成為有效微調基礎模型的普遍技術。然而，由於傳統的個別適配器聯邦平均會產生次優更新，因此使用 LoRA 進行聯邦微調具有挑戰性。現有解決方案不是導致與客戶端數量成線性比例的過高通訊成本，就是由於表達能力有限而導致效能下降。我們引入了聯邦銀彈（Fed-SB），這是一種使用 LoRA-SB（一種最近提出的低秩適應方法）對 LLM 進行聯邦微調的新方法。LoRA-SB 透過在適配器 B 和 A 之間學習一個小方塊矩陣（R），並固定其他組件，將最佳化軌跡與理想的低秩完整微調投影最佳對齊。R 的直接平均保證了精確更新，大幅降低了通訊成本，而通訊成本與客戶端數量無關，並實現了可擴充性。Fed-SB 在常識推理、算術推理和語言推理任務中達到了最先進的效能，同時將通訊成本降低了 230 倍。在私有設定中，Fed-SB 進一步透過（1）減少可訓練參數，從而降低差分隱私所需的雜訊，以及（2）避免其他方法引入的雜訊放大，來提升效能。總體而言，Fed-SB 在通訊和效能之間的權衡中建立了一個新的帕累托前緣，為私有和非私有聯邦微調提供了高效且可擴充的解決方案。我們的程式碼已公開於 https://github.com/CERT-Lab/fed-sb。

Single-pass Detection of Jailbreaking Input in Large Language Models

2502.15435v1 by Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios G. Chrysos, Volkan Cevher

Defending aligned Large Language Models (LLMs) against jailbreaking attacks is a challenging problem, with existing approaches requiring multiple requests or even queries to auxiliary LLMs, making them computationally heavy. Instead, we focus on detecting jailbreaking input in a single forward pass. Our method, called Single Pass Detection SPD, leverages the information carried by the logits to predict whether the output sentence will be harmful. This allows us to defend in just one forward pass. SPD can not only detect attacks effectively on open-source models, but also minimizes the misclassification of harmless inputs. Furthermore, we show that SPD remains effective even without complete logit access in GPT-3.5 and GPT-4. We believe that our proposed method offers a promising approach to efficiently safeguard LLMs against adversarial attacks.

摘要：防範大型語言模型（LLM）越獄攻擊是一項艱難的挑戰，現有方法需要多重請求，甚至向輔助 LLM 查詢，導致運算負擔沉重。因此，我們專注於在單次前向傳遞中偵測越獄輸入。我們的稱為單次傳遞偵測（SPD）的方法，利用邏輯值所攜帶的資訊來預測輸出句子是否具有危害性。這讓我們得以在單次前向傳遞中防禦。SPD 不僅能有效偵測開源模型的攻擊，還能將無害輸入的誤分類降到最低。此外，我們證明即使在 GPT-3.5 和 GPT-4 中無法完全存取邏輯值，SPD 仍能保持有效。我們相信所提出的方法提供了一種有效防範 LLM 遭受對抗性攻擊的方法。

Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation

2502.15434v1 by Yue Zhou, Yi Chang, Yuan Wu

Model merging integrates the parameters of multiple models into a unified model, combining their diverse capabilities. Existing model merging methods are often constrained by fixed parameter merging ratios. In this study, we propose Mixup Model Merge (M$^3$), an innovative approach inspired by the Mixup data augmentation technique. This method merges the parameters of two large language models (LLMs) by randomly generating linear interpolation ratios, allowing for a more flexible and comprehensive exploration of the parameter space. Extensive experiments demonstrate the superiority of our proposed M$^3$ method in merging fine-tuned LLMs: (1) it significantly improves performance across multiple tasks, (2) it enhances LLMs' out-of-distribution (OOD) robustness and adversarial robustness, (3) it achieves superior results when combined with sparsification techniques such as DARE, and (4) it offers a simple yet efficient solution that does not require additional computational resources. In conclusion, M$^3$ is a simple yet effective model merging method that significantly enhances the performance of the merged model by randomly generating contribution ratios for two fine-tuned LLMs. The code is available at https://github.com/MLGroupJLU/MixupModelMerge.

摘要：模型合併將多個模型的參數整合到一個統一的模型中，結合它們多樣化的功能。現有的模型合併方法通常受到固定參數合併比率的限制。在本研究中，我們提出了 Mixup 模型合併 (M$^3$)，這是一種創新的方法，靈感來自 Mixup 資料擴充技術。此方法透過隨機產生線性插值比率來合併兩個大型語言模型 (LLM) 的參數，允許更靈活且全面地探索參數空間。廣泛的實驗證明了我們提出的 M$^3$ 方法在合併微調 LLM 時的優越性：(1) 它顯著改善了多項任務的效能，(2) 它增強了 LLM 的分布外 (OOD) 穩健性和對抗穩健性，(3) 它在與稀疏化技術（例如 DARE）結合時取得了更佳的結果，(4) 它提供了一個簡單但有效的解決方案，不需要額外的運算資源。總之，M$^3$ 是一種簡單但有效的模型合併方法，透過隨機產生兩個微調 LLM 的貢獻比率，顯著提升合併模型的效能。程式碼可在 https://github.com/MLGroupJLU/MixupModelMerge 取得。

Pub-Guard-LLM: Detecting Fraudulent Biomedical Articles with Reliable Explanations

2502.15429v1 by Lihu Chen, Shuojie Fu, Gabriel Freedman, Cemre Zor, Guy Martin, James Kinross, Uddhav Vaghela, Ovidiu Serban, Francesca Toni

A significant and growing number of published scientific articles is found to involve fraudulent practices, posing a serious threat to the credibility and safety of research in fields such as medicine. We propose Pub-Guard-LLM, the first large language model-based system tailored to fraud detection of biomedical scientific articles. We provide three application modes for deploying Pub-Guard-LLM: vanilla reasoning, retrieval-augmented generation, and multi-agent debate. Each mode allows for textual explanations of predictions. To assess the performance of our system, we introduce an open-source benchmark, PubMed Retraction, comprising over 11K real-world biomedical articles, including metadata and retraction labels. We show that, across all modes, Pub-Guard-LLM consistently surpasses the performance of various baselines and provides more reliable explanations, namely explanations which are deemed more relevant and coherent than those generated by the baselines when evaluated by multiple assessment methods. By enhancing both detection performance and explainability in scientific fraud detection, Pub-Guard-LLM contributes to safeguarding research integrity with a novel, effective, open-source tool.

摘要：大量已發表的科學文章被發現涉及欺詐行為，對醫學等領域的研究信譽和安全性構成嚴重威脅。我們提出 Pub-Guard-LLM，這是第一個針對生物醫學科學文章的欺詐檢測量身打造的大型語言模型。我們提供三種應用模式來部署 Pub-Guard-LLM：香草推理、檢索增強生成和多代理辯論。每種模式都允許對預測進行文字解釋。為了評估我們系統的性能，我們引入了一個開源基準，PubMed Retraction，其中包含超過 11K 篇真實世界的生物醫學文章，包括元數據和撤回標籤。我們表明，在所有模式中，Pub-Guard-LLM 都始終優於各種基線的性能，並提供了更可靠的解釋，即在多種評估方法評估時，被認為比基線產生的解釋更相關且更連貫的解釋。通過提高科學欺詐檢測中的檢測性能和可解釋性，Pub-Guard-LLM 有助於使用新穎、有效、開源的工具來維護研究誠信。

Anatomy-Informed Deep Learning and Radiomics for Automated Neurofibroma Segmentation in Whole-Body MRI

2502.15424v1 by Georgii Kolokolnikov, Marie-Lena Schmalhofer, Lennart Well, Said Farschtschi, Victor-Felix Mautner, Inka Ristow, Rene Werner

Neurofibromatosis Type 1 is a genetic disorder characterized by the development of neurofibromas (NFs), which exhibit significant variability in size, morphology, and anatomical location. Accurate and automated segmentation of these tumors in whole-body magnetic resonance imaging (WB-MRI) is crucial to assess tumor burden and monitor disease progression. In this study, we present and analyze a fully automated pipeline for NF segmentation in fat-suppressed T2-weighted WB-MRI, consisting of three stages: anatomy segmentation, NF segmentation, and tumor candidate classification. In the first stage, we use the MRSegmentator model to generate an anatomy segmentation mask, extended with a high-risk zone for NFs. This mask is concatenated with the input image as anatomical context information for NF segmentation. The second stage employs an ensemble of 3D anisotropic anatomy-informed U-Nets to produce an NF segmentation confidence mask. In the final stage, tumor candidates are extracted from the confidence mask and classified based on radiomic features, distinguishing tumors from non-tumor regions and reducing false positives. We evaluate the proposed pipeline on three test sets representing different conditions: in-domain data (test set 1), varying imaging protocols and field strength (test set 2), and low tumor burden cases (test set 3). Experimental results show a 68% improvement in per-scan Dice Similarity Coefficient (DSC), a 21% increase in per-tumor DSC, and a two-fold improvement in F1 score for tumor detection in high tumor burden cases by integrating anatomy information. The method is integrated into the 3D Slicer platform for practical clinical use, with the code publicly accessible.

摘要：神經纖維瘤第 1 型是一種遺傳疾病，其特徵在於神經纖維瘤 (NF) 的發展，其在大小、形態和解剖位置上表現出顯著的可變性。在全身磁共振成像 (WB-MRI) 中準確且自動地分割這些腫瘤對於評估腫瘤負擔和監測疾病進展至關重要。在本研究中，我們提出並分析了脂肪抑制 T2 加權 WB-MRI 中 NF 分割的完全自動化管道，它包含三個階段：解剖分割、NF 分割和腫瘤候選分類。在第一階段，我們使用 MRSegmentator 模型生成解剖分割掩模，並擴展為 NF 的高風險區域。此掩模與輸入影像串聯，作為 NF 分割的解剖背景資訊。第二階段採用 3D 異向解剖資訊 U-Nets 的集合，以產生 NF 分割置信度掩模。在最後階段，從置信度掩模中提取腫瘤候選物，並根據放射特徵進行分類，將腫瘤與非腫瘤區域區分開來，並減少假陽性。我們在代表不同條件的三個測試集中評估所提出的管道：域內資料 (測試集 1)、不同的影像協議和場強 (測試集 2) 和低腫瘤負擔案例 (測試集 3)。實驗結果表明，通過整合解剖資訊，腫瘤負擔高的案例中，每個掃描骰子相似性係數 (DSC) 提升了 68%，每個腫瘤 DSC 提升了 21%，腫瘤檢測的 F1 分數提升了兩倍。該方法已整合到 3D Slicer 平臺中，以供實際臨床使用，其程式碼可公開取得。

Evaluating Multimodal Generative AI with Korean Educational Standards

2502.15422v1 by Sanghee Park, Geewook Kim

This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at https://github.com/naver-ai/KoNET.

摘要：本篇論文提出了韓國國家教育測驗基準 (KoNET)，這是一個新的基準，旨在使用韓國國家教育測驗評估多模態生成式 AI 系統。KoNET 包含四項考試：韓國小學普通教育發展測驗 (KoEGED)、中學 (KoMGED)、高中 (KoHGED) 和大學學術能力測驗 (KoCSAT)。這些考試以其嚴格的標準和多樣化的題目而聞名，有助於全面分析不同教育程度的 AI 表現。KoNET 專注於韓語，提供對較少探索的語言中模型表現的見解。我們通過檢查難度、科目多樣性和人類錯誤率，評估一系列模型 - 開源、開放訪問和封閉 API。程式碼和資料集建構器將在 https://github.com/naver-ai/KoNET 完全開放原始碼。

Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking

2502.15419v1 by Yi-Ling Chung, Aurora Cobo, Pablo Serna

Robust automatic fact-checking systems have the potential to combat online misinformation at scale. However, most existing research primarily focuses on English. In this paper, we introduce MultiSynFact, the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs designed to support Spanish, German, English, and other low-resource languages. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia and incorporating rigorous claim validation steps to ensure data quality. We evaluate the effectiveness of MultiSynFact across multiple models and experimental settings. Additionally, we open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.

摘要：強大的自動查證系統有潛力大規模打擊網路上的錯誤資訊。然而，現有的研究大多只專注在英文。在本文中，我們介紹 MultiSynFact，這是第一個包含 220 萬個聲明來源配對的大規模多語言查證資料集，旨在支援西班牙文、德文、英文和其他低資源語言。我們的資料集生成流程利用大型語言模型 (LLM)，整合維基百科的外部知識，並納入嚴格的聲明驗證步驟，以確保資料品質。我們在多個模型和實驗設定中評估 MultiSynFact 的有效性。此外，我們開放原始碼提供一個使用者友善的架構，以促進多語言查證和資料集生成的進一步研究。

MHQA: A Diverse, Knowledge Intensive Mental Health Question Answering Challenge for Language Models

2502.15418v1 by Suraj Racha, Prashant Joshi, Anshika Raman, Nikita Jangid, Mridul Sharma, Ganesh Ramakrishnan, Nirmal Punjabi

Mental health remains a challenging problem all over the world, with issues like depression, anxiety becoming increasingly common. Large Language Models (LLMs) have seen a vast application in healthcare, specifically in answering medical questions. However, there is a lack of standard benchmarking datasets for question answering (QA) in mental health. Our work presents a novel multiple choice dataset, MHQA (Mental Health Question Answering), for benchmarking Language models (LMs). Previous mental health datasets have focused primarily on text classification into specific labels or disorders. MHQA, on the other hand, presents question-answering for mental health focused on four key domains: anxiety, depression, trauma, and obsessive/compulsive issues, with diverse question types, namely, factoid, diagnostic, prognostic, and preventive. We use PubMed abstracts as the primary source for QA. We develop a rigorous pipeline for LLM-based identification of information from abstracts based on various selection criteria and converting it into QA pairs. Further, valid QA pairs are extracted based on post-hoc validation criteria. Overall, our MHQA dataset consists of 2,475 expert-verified gold standard instances called MHQA-gold and ~56.1k pairs pseudo labeled using external medical references. We report F1 scores on different LLMs along with few-shot and supervised fine-tuning experiments, further discussing the insights for the scores.

摘要：心理健康仍然是全球一個具有挑戰性的問題，其中抑鬱、焦慮等問題正變得越來越普遍。大型語言模型 (LLM) 已在醫療保健領域獲得廣泛應用，特別是在回答醫療問題方面。然而，心理健康領域的問題解答 (QA) 缺乏標準基準數據集。我們的研究提供了一個新的多選項數據集 MHQA（心理健康問題解答），用於基準測試語言模型 (LM)。以前的心理健康數據集主要集中於將文本分類為具體標籤或疾病。另一方面，MHQA 針對心理健康提供問題解答，重點關注四個關鍵領域：焦慮、抑鬱、創傷和強迫症問題，並具有多樣化的問題類型，即事實、診斷、預後和預防。我們使用 PubMed 摘要作為 QA 的主要來源。我們開發了一個嚴謹的管道，用於基於 LLM 從摘要中識別信息，基於各種選擇標準並將其轉換為 QA 對。此外，根據事後驗證標準提取有效的 QA 對。總體而言，我們的 MHQA 數據集包含 2,475 個由專家驗證的金標準實例，稱為 MHQA-gold，以及使用外部醫療參考文獻偽標記的約 56.1k 對。我們報告了不同 LLM 上的 F1 分數以及少次和監督微調實驗，進一步討論了對分數的見解。

Textual-to-Visual Iterative Self-Verification for Slide Generation

2502.15412v1 by Yunqing Xu, Xinbei Ma, Jiyang Qiu, Hai Zhao

Generating presentation slides is a time-consuming task that urgently requires automation. Due to their limited flexibility and lack of automated refinement mechanisms, existing autonomous LLM-based agents face constraints in real-world applicability. We decompose the task of generating missing presentation slides into two key components: content generation and layout generation, aligning with the typical process of creating academic slides. First, we introduce a content generation approach that enhances coherence and relevance by incorporating context from surrounding slides and leveraging section retrieval strategies. For layout generation, we propose a textual-to-visual self-verification process using a LLM-based Reviewer + Refiner workflow, transforming complex textual layouts into intuitive visual formats. This modality transformation simplifies the task, enabling accurate and human-like review and refinement. Experiments show that our approach significantly outperforms baseline methods in terms of alignment, logical flow, visual appeal, and readability.

摘要：生成簡報投影片是一項耗時的任務，迫切需要自動化。現有的基於 LLM 的自主代理，由於其靈活性有限且缺乏自動化的精煉機制，在實際應用中面臨限制。我們將生成缺失簡報投影片的任務分解為兩個關鍵組成部分：內容生成和版面生成，與製作學術投影片的典型流程一致。首先，我們引入一種內容生成方法，透過納入周圍投影片的內容並利用章節檢索策略，來增強連貫性和相關性。對於版面生成，我們提出一個使用基於 LLM 的審閱者 + 精煉者工作流程的文字到視覺自我驗證過程，將複雜的文字版面轉換為直觀的視覺格式。這種模態轉換簡化了任務，實現了準確且類似人類的審閱和精煉。實驗表明，我們的做法在對齊、邏輯流程、視覺吸引力和可讀性方面明顯優於基線方法。

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

2502.15411v1 by Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva

The U.S. Securities and Exchange Commission (SEC) requires that public companies file financial reports tagging numbers with the machine readable inline eXtensible Business Reporting Language (iXBRL) standard. However, the highly complex and highly granular taxonomy defined by iXBRL limits label transferability across domains. In this paper, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, designed to facilitate numerical KPI extraction at specified levels of granularity from unstructured financial text. Our approach organizes a 218,126-label hierarchy using a taxonomy based grouping method, investigating which taxonomy layer provides the most meaningful structure. HiFi-KPI comprises ~1.8M paragraphs and ~5M entities, each linked to a label in the iXBRL-specific calculation and presentation taxonomies. We provide baselines using encoder-based approaches and structured extraction using Large Language Models (LLMs). To simplify LLM inference and evaluation, we additionally release HiFi-KPI Lite, a manually curated subset with four expert-mapped labels. We publicly release all artifacts

摘要：美國證券交易委員會 (SEC) 要求公開公司提交財務報告，並使用機器可讀的內嵌擴充商業報告語言 (iXBRL) 標準標記數字。然而，iXBRL 定義的高度複雜且高度細緻的分類法限制了標籤在不同領域之間的可轉移性。在本文中，我們介紹了分層財務關鍵績效指標 (HiFi-KPI) 資料集，旨在從非結構化財務文本中提取指定粒度層級的數值 KPI。我們的做法使用基於分類法分組方法組織 218,126 標籤的層級，調查哪個分類法層級提供最有意義的結構。HiFi-KPI 包含約 180 萬段落和約 500 萬個實體，每個實體都連結到 iXBRL 特定計算和展示分類法中的標籤。我們使用基於編碼器的方法和使用大型語言模型 (LLM) 的結構化萃取提供基準。為了簡化 LLM 推論和評估，我們額外發布了 HiFi-KPI Lite，這是一個手動整理的子集，包含四個專家對應的標籤。我們公開發布所有人工製品

Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning

2502.15401v1 by Xuetao Ma, Wenbin Jiang, Hua Huang

In-context learning (ICL) can significantly enhance the complex reasoning capabilities of large language models (LLMs), with the key lying in the selection and ordering of demonstration examples. Previous methods typically relied on simple features to measure the relevance between examples. We argue that these features are not sufficient to reflect the intrinsic connections between examples. In this study, we propose a curriculum ICL strategy guided by problem-solving logic. We select demonstration examples by analyzing the problem-solving logic and order them based on curriculum learning. Specifically, we constructed a problem-solving logic instruction set based on the BREAK dataset and fine-tuned a language model to analyze the problem-solving logic of examples. Subsequently, we selected appropriate demonstration examples based on problem-solving logic and assessed their difficulty according to the number of problem-solving steps. In accordance with the principles of curriculum learning, we ordered the examples from easy to hard to serve as contextual prompts. Experimental results on multiple benchmarks indicate that our method outperforms previous ICL approaches in terms of performance and efficiency, effectively enhancing the complex reasoning capabilities of LLMs. Our project will be publicly available subsequently.

摘要：情境式學習 (ICL) 可以顯著增強大型語言模型 (LLM) 的複雜推理能力，關鍵在於示範範例的選擇和排序。以往的方法通常依賴於簡單的特徵來衡量範例之間的關聯性。我們認為這些特徵不足以反映範例之間的內在聯繫。在本研究中，我們提出了一個由問題解決邏輯引導的課程 ICL 策略。我們透過分析問題解決邏輯來選擇示範範例，並根據課程學習對其進行排序。具體來說，我們根據 BREAK 資料集構建了一個問題解決邏輯指令集，並微調語言模型來分析範例的問題解決邏輯。隨後，我們根據問題解決邏輯選擇適當的示範範例，並根據問題解決步驟的數量評估其難度。根據課程學習的原則，我們將範例從易到難排序，作為情境提示。在多個基準上的實驗結果表明，我們的模型在效能和效率方面優於以往的 ICL 方法，有效增強了 LLM 的複雜推理能力。我們的專案將在稍後公開。

Enhancing Vehicle Make and Model Recognition with 3D Attention Modules

2502.15398v1 by Narges Semiromizadeh, Omid Nejati Manzari, Shahriar B. Shokouhi, Sattar Mirzakuchaki

Vehicle make and model recognition (VMMR) is a crucial component of the Intelligent Transport System, garnering significant attention in recent years. VMMR has been widely utilized for detecting suspicious vehicles, monitoring urban traffic, and autonomous driving systems. The complexity of VMMR arises from the subtle visual distinctions among vehicle models and the wide variety of classes produced by manufacturers. Convolutional Neural Networks (CNNs), a prominent type of deep learning model, have been extensively employed in various computer vision tasks, including VMMR, yielding remarkable results. As VMMR is a fine-grained classification problem, it primarily faces inter-class similarity and intra-class variation challenges. In this study, we implement an attention module to address these challenges and enhance the model's focus on critical areas containing distinguishing features. This module, which does not increase the parameters of the original model, generates three-dimensional (3-D) attention weights to refine the feature map. Our proposed model integrates the attention module into two different locations within the middle section of a convolutional model, where the feature maps from these sections offer sufficient information about the input frames without being overly detailed or overly coarse. The performance of our proposed model, along with state-of-the-art (SOTA) convolutional and transformer-based models, was evaluated using the Stanford Cars dataset. Our proposed model achieved the highest accuracy, 90.69\%, among the compared models.

摘要：車輛廠牌和型號辨識 (VMMR) 是智慧運輸系統的重要組成部分，近年來備受關注。VMMR 已廣泛用於偵測可疑車輛、監控市區交通和自動駕駛系統。VMMR 的複雜性源於車輛型號間細微的視覺差異，以及製造商生產的種類繁多。卷積神經網路 (CNN) 是一種著名的深度學習模型類型，已廣泛用於各種電腦視覺任務，包括 VMMR，並取得顯著成果。由於 VMMR 是細粒度分類問題，因此它主要面臨類間相似性和類內差異的挑戰。在本研究中，我們實作了一個注意力模組來解決這些挑戰，並增強模型對包含區別特徵的關鍵區域的關注。此模組不會增加原始模型的參數，會產生三維 (3-D) 注意力權重來改善特徵圖。我們提出的模型將注意力模組整合到卷積模型中間部分的兩個不同位置，這些區段的特徵圖提供了輸入幀的充足資訊，不會過於詳細或過於粗略。我們提出的模型的效能，以及最先進 (SOTA) 的卷積和基於Transformer的模型，使用史丹佛汽車資料集進行評估。我們提出的模型在比較模型中達到最高的準確度，為 90.69%。

Super-Resolution for Interferometric Imaging: Model Comparisons and Performance Analysis

2502.15397v1 by Hasan Berkay Abdioglu, Rana Gursoy, Yagmur Isik, Ibrahim Cem Balci, Taha Unal, Kerem Bayer, Mustafa Ismail Inal, Nehir Serin, Muhammed Furkan Kosar, Gokhan Bora Esmer, Huseyin Uvet

This study investigates the application of Super-Resolution techniques in holographic microscopy to enhance quantitative phase imaging. An off-axis Mach-Zehnder interferometric setup was employed to capture interferograms. The study evaluates two Super-Resolution models, RCAN and Real-ESRGAN, for their effectiveness in reconstructing high-resolution interferograms from a microparticle-based dataset. The models were assessed using two primary approaches: image-based analysis for structural detail enhancement and morphological evaluation for maintaining sample integrity and phase map accuracy. The results demonstrate that RCAN achieves superior numerical precision, making it ideal for applications requiring highly accurate phase map reconstruction, while Real-ESRGAN enhances visual quality and structural coherence, making it suitable for visualization-focused applications. This study highlights the potential of Super-Resolution models in overcoming diffraction-imposed resolution limitations in holographic microscopy, opening the way for improved imaging techniques in biomedical diagnostics, materials science, and other high-precision fields.

摘要：本研究探討了超解析度技術在全息顯微鏡中的應用，以增強量化相位影像。採用離軸馬赫曾德干涉儀設置來擷取干涉圖。該研究評估了兩個超解析度模型，RCAN 和 Real-ESRGAN，在從基於微粒的數據集中重建高解析度干涉圖方面的有效性。使用兩種主要方法評估模型：基於影像的分析，用於結構細節增強，以及形態評估，用於維持樣本完整性和相位圖準確性。結果表明，RCAN 達到了更高的數值精度，使其成為需要高度準確相位圖重建的應用程式的理想選擇，而 Real-ESRGAN 則增強了視覺品質和結構一致性，使其適用於以視覺化為重點的應用程式。這項研究強調了超解析度模型在克服全息顯微鏡中衍射強加的解析度限制方面的潛力，為生物醫學診斷、材料科學和其他高精度領域的影像技術改進開闢了道路。

Chitrarth: Bridging Vision and Language for a Billion People

2502.15392v1 by Shaharukh Khan, Ayush Tarun, Abhinav Ravi, Ali Faraz, Akshat Patidar, Praveen Kumar Pokala, Anagha Bhangare, Raja Kolla, Chandra Khatri, Shubham Agarwal

Recent multimodal foundation models are primarily trained on English or high resource European language data, which hinders their applicability to other medium and low-resource languages. To address this limitation, we introduce Chitrarth (Chitra: Image; Artha: Meaning), an inclusive Vision-Language Model (VLM), specifically targeting the rich linguistic diversity and visual reasoning across 10 prominent Indian languages. Our model effectively integrates a state-of-the-art (SOTA) multilingual Large Language Model (LLM) with a vision module, primarily trained on multilingual image-text data. Furthermore, we also introduce BharatBench, a comprehensive framework for evaluating VLMs across various Indian languages, ultimately contributing to more diverse and effective AI systems. Our model achieves SOTA results for benchmarks across low resource languages while retaining its efficiency in English. Through our research, we aim to set new benchmarks in multilingual-multimodal capabilities, offering substantial improvements over existing models and establishing a foundation to facilitate future advancements in this arena.

摘要：最近的多模态基础模型主要在英语或高资源的欧洲语言数据上进行训练，这阻碍了它们对其他中等和低资源语言的适用性。为了解决这一限制，我们引入了 Chitrarth（Chitra：图像；Artha：含义），这是一个包容性的视觉语言模型（VLM），专门针对 10 种主要的印度语言的丰富语言多样性和视觉推理。我们的模型有效地集成了最先进的（SOTA）多语言大语言模型（LLM）和视觉模块，主要在多语言图像文本数据上进行训练。此外，我们还引入了 BharatBench，这是一个用于跨各种印度语言评估 VLM 的综合框架，最终有助于实现更多样化和有效的 AI 系统。我们的模型在低资源语言的基准测试中取得了 SOTA 结果，同时保持了其在英语中的效率。通过我们的研究，我们旨在为多语言多模态能力设定新的基准，对现有模型进行实质性改进，并为促进该领域未来的进步奠定基础。

Identifying Features that Shape Perceived Consciousness in Large Language Model-based AI: A Quantitative Study of Human Responses

2502.15365v1 by Kang Bongsu, Kim Jundong, Yun Tae-Rim, Bae Hyojin, Kim Chang-Eop

This study quantitively examines which features of AI-generated text lead humans to perceive subjective consciousness in large language model (LLM)-based AI systems. Drawing on 99 passages from conversations with Claude 3 Opus and focusing on eight features -- metacognitive self-reflection, logical reasoning, empathy, emotionality, knowledge, fluency, unexpectedness, and subjective expressiveness -- we conducted a survey with 123 participants. Using regression and clustering analyses, we investigated how these features influence participants' perceptions of AI consciousness. The results reveal that metacognitive self-reflection and the AI's expression of its own emotions significantly increased perceived consciousness, while a heavy emphasis on knowledge reduced it. Participants clustered into seven subgroups, each showing distinct feature-weighting patterns. Additionally, higher prior knowledge of LLMs and more frequent usage of LLM-based chatbots were associated with greater overall likelihood assessments of AI consciousness. This study underscores the multidimensional and individualized nature of perceived AI consciousness and provides a foundation for better understanding the psychosocial implications of human-AI interaction.

摘要：本研究定量檢驗了 AI 生成的文字中哪些特徵，會導致人類在大型語言模型 (LLM) 為基礎的 AI 系統中感知到主觀意識。我們從與 Claude 3 Opus 的對話中擷取了 99 段文字，並專注於八項特徵——元認知自我反省、邏輯推理、同理心、情緒化、知識、流暢度、出乎意料性，以及主觀表達性——我們對 123 位參與者進行了一項調查。使用回歸和群集分析，我們探討了這些特徵如何影響參與者對 AI 意識的感知。結果顯示，元認知自我反省和 AI 表達自身的情緒會顯著提升感知到的意識，而過度強調知識則會降低感知到的意識。參與者被分為七個子群組，每個子群組都顯示出不同的特徵加權模式。此外，對 LLM 的先備知識較高，以及更頻繁使用基於 LLM 的聊天機器人，與對 AI 意識的整體可能性評估較高有關。本研究強調了感知到的 AI 意識的多維和個性化性質，並為更深入了解人機互動的心理社會影響奠定了基礎。

2502.15361v1 by Xuyang Wu, Jinming Nian, Zhiqiang Tao, Yi Fang

In the recent development of AI reasoning, large language models (LLMs) are trained to automatically generate chain-of-thought reasoning steps, which have demonstrated compelling performance on math and coding tasks. However, when bias is mixed within the reasoning process to form strong logical arguments, it could cause even more harmful results and further induce hallucinations. In this paper, we have evaluated the 8B and 32B variants of DeepSeek-R1 against their instruction tuned counterparts on the BBQ dataset, and investigated the bias that is elicited out and being amplified through reasoning steps. To the best of our knowledge, this empirical study is the first to assess bias issues in LLM reasoning.

摘要：在最近的人工智能推理發展中，大型語言模型 (LLM) 接受訓練以自動產生思考鏈推理步驟，已在數學和編碼任務中展現出令人信服的表現。然而，當偏見混入推理過程中形成強而有力的邏輯論證時，可能會造成更具危害性的結果，並進一步引發幻覺。在本文中，我們評估了 DeepSeek-R1 的 8B 和 32B 變體，並針對 BBQ 資料集上的指令調整對應項進行評估，並調查在推理步驟中引發並被放大的偏見。據我們所知，這項實證研究是第一個評估 LLM 推理中偏見問題的研究。

ARS: Automatic Routing Solver with Large Language Models

2502.15359v1 by Kai Li, Fei Liu, Zhenkun Wang, Xialiang Tong, Xiongwei Han, Mingxuan Yuan

Real-world Vehicle Routing Problems (VRPs) are characterized by a variety of practical constraints, making manual solver design both knowledge-intensive and time-consuming. Although there is increasing interest in automating the design of routing algorithms, existing research has explored only a limited array of VRP variants and fails to adequately address the complex and prevalent constraints encountered in real-world situations. To fill this gap, this paper introduces RoutBench, a benchmark of 1,000 VRP variants derived from 24 attributes, for evaluating the effectiveness of automatic routing solvers in addressing complex constraints. Along with RoutBench, we present the Automatic Routing Solver (ARS), which employs Large Language Model (LLM) agents to enhance a backbone algorithm framework by automatically generating constraint-aware heuristic code, based on problem descriptions and several representative constraints selected from a database. Our experiments show that ARS outperforms state-of-the-art LLM-based methods and commonly used solvers, automatically solving 91.67% of common VRPs and achieving at least a 30% improvement across all benchmarks.

摘要：現實世界的車輛路徑問題 (VRP) 的特點是具有各種實際限制，使得手動求解器設計既需要大量知識，又耗時費力。儘管自動化路徑演算法設計越來越受到關注，但現有的研究僅探討了有限的 VRP 變體，未能充分解決現實世界中遇到的複雜且普遍存在的限制。為了填補這一空白，本文介紹了 RoutBench，這是一個由 24 個屬性衍生的 1,000 個 VRP 變體的基準，用於評估自動路徑求解器在解決複雜限制方面的有效性。除了 RoutBench，我們還展示了自動路徑求解器 (ARS)，它採用大型語言模型 (LLM) 代理，透過根據問題描述和從資料庫中選取的幾個代表性限制自動產生具約束意識的啟發式程式碼，來增強主幹演算法架構。我們的實驗表明，ARS 優於最先進的基於 LLM 的方法和常用的求解器，自動解決了 91.67% 的常見 VRP，並且在所有基準測試中都實現了至少 30% 的改進。

AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms

2502.15349v1 by Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Haibo Chen

Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at https://github.com/microsoft/AttentionEngine.

摘要：Transformer和大語言模型 (LLM) 徹底革新了機器學習，而注意力機制是其成功的核心。隨著注意力變體的版圖擴展，優化其效能的挑戰也越來越多，特別是在不同的硬體平台上。目前的最佳化策略通常焦點狹隘，需要大量手動介入才能適應模型組態或硬體環境的變更。在本文中，我們介紹了 AttentionEngine，一個全面的架構，旨在簡化跨異質硬體後端的注意力機制的最佳化。透過將注意力運算分解為具有可自訂組件的模組化運算，AttentionEngine 能夠靈活地適應不同的演算法需求。這個架構進一步透過可程式化範本和強大的跨平台排程策略，自動化核心最佳化。經驗結果顯示，在現有方法無法達到的組態上，效能提升了 10 倍。AttentionEngine 提供了一個可擴充、高效的基礎，用於開發和部署注意力機制，且只需最少的調整。我們的程式碼已開放原始碼，可在 https://github.com/microsoft/AttentionEngine 取得。

Constructing a Norm for Children's Scientific Drawing: Distribution Features Based on Semantic Similarity of Large Language Models

2502.15348v1 by Yi Zhang, Fan Wei, Jingyi Li, Yan Wang, Yanyan Yu, Jianli Chen, Zipo Cai, Xinyu Liu, Wei Wang, Peng Wang, Zhong Wang

The use of children's drawings to examining their conceptual understanding has been proven to be an effective method, but there are two major problems with previous research: 1. The content of the drawings heavily relies on the task, and the ecological validity of the conclusions is low; 2. The interpretation of drawings relies too much on the subjective feelings of the researchers. To address this issue, this study uses the Large Language Model (LLM) to identify 1420 children's scientific drawings (covering 9 scientific themes/concepts), and uses the word2vec algorithm to calculate their semantic similarity. The study explores whether there are consistent drawing representations for children on the same theme, and attempts to establish a norm for children's scientific drawings, providing a baseline reference for follow-up children's drawing research. The results show that the representation of most drawings has consistency, manifested as most semantic similarity greater than 0.8. At the same time, it was found that the consistency of the representation is independent of the accuracy (of LLM's recognition), indicating the existence of consistency bias. In the subsequent exploration of influencing factors, we used Kendall rank correlation coefficient to investigate the effects of Sample Size, Abstract Degree, and Focus Points on drawings, and used word frequency statistics to explore whether children represented abstract themes/concepts by reproducing what was taught in class.

摘要：使用兒童繪畫來檢視其概念理解已被證實是一種有效的方法，但先前的研究有兩個主要問題：1. 繪畫的內容過於依賴於任務，而結論的生態效度低；2. 對繪畫的詮釋過於依賴研究者的主觀感受。為了解決這個問題，本研究使用大型語言模型 (LLM) 來識別 1420 幅兒童科學繪畫（涵蓋 9 個科學主題/概念），並使用 word2vec 演算法計算它們的語義相似性。本研究探討是否兒童在同一主題上有具有一致性的繪畫表徵，並試圖為兒童科學繪畫建立一個規範，為後續兒童繪畫研究提供一個基準參考。結果表明，大多數繪畫的表徵具有相容性，表現為大多數語義相似度大於 0.8。同時發現，表徵的一致性與準確性（LLM 的識別）無關，表明存在一致性偏差。在後續的影響因素探討中，我們使用 Kendall 等級相關係數來探討樣本量、抽象程度和焦點點對繪畫的影響，並使用詞頻統計來探討兒童是否通過重現課堂上所教的內容來表徵抽象的主題/概念。

Tokenization is Sensitive to Language Variation

2502.15343v1 by Anna Wegmann, Dong Nguyen, David Jurgens

Variation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models for the popular Byte-Pair Encoding algorithm to investigate how key algorithmic design choices impact downstream models' performances: fitting corpus, pre-tokenizer and vocabulary size. We find that the best tokenizer varies on the two task types -- with the pre-tokenizer having the biggest impact on performance. Further, we introduce a new approach to estimate tokenizer impact on downstream LLM performance, showing significant improvement over techniques like R\'enyi efficiency. We encourage more work on language variation and its relation to tokenizers and thus LLM performance.

摘要：語言變異無所不在，且經常與區域、社會和情境因素系統性地連結。標記化器將文字分割成較小的單位，對於較不常見的語言形式，其行為可能有所不同。這可能會對兩種任務的下游 LLM 效能產生不同的影響：模型應該對語言變異具有穩健性的任務（例如，對於 NLI 等語義任務，標籤不取決於文字採用英式或美式拼寫）以及模型應該對語言變異敏感的任務（例如，對於作者驗證等基於形式的任務，標籤取決於文字採用英式或美式拼寫）。我們針對熱門的 Byte-Pair 編碼演算法預先訓練 BERT 基礎模型，以探討關鍵演算法設計選項如何影響下游模型的效能：擬合語料庫、預先標記化器和詞彙量大小。我們發現最佳標記化器在兩種任務類型上有所不同，其中預先標記化器對效能的影響最大。此外，我們提出了一種估計標記化器對下游 LLM 效能影響的新方法，顯示出比 R\'enyi 效率等技術有顯著的進步。我們鼓勵對語言變異及其與標記化器和 LLM 效能的關係進行更多研究。

Exploring Embodied Multimodal Large Models: Development, Datasets, and Future Directions

2502.15336v1 by Shoubin Chen, Zehao Wu, Kai Zhang, Chunyu Li, Baiyang Zhang, Fei Ma, Fei Richard Yu, Qingquan Li

Embodied multimodal large models (EMLMs) have gained significant attention in recent years due to their potential to bridge the gap between perception, cognition, and action in complex, real-world environments. This comprehensive review explores the development of such models, including Large Language Models (LLMs), Large Vision Models (LVMs), and other models, while also examining other emerging architectures. We discuss the evolution of EMLMs, with a focus on embodied perception, navigation, interaction, and simulation. Furthermore, the review provides a detailed analysis of the datasets used for training and evaluating these models, highlighting the importance of diverse, high-quality data for effective learning. The paper also identifies key challenges faced by EMLMs, including issues of scalability, generalization, and real-time decision-making. Finally, we outline future directions, emphasizing the integration of multimodal sensing, reasoning, and action to advance the development of increasingly autonomous systems. By providing an in-depth analysis of state-of-the-art methods and identifying critical gaps, this paper aims to inspire future advancements in EMLMs and their applications across diverse domains.

摘要：具身多模态大型模型 (EMLM) 近年来备受关注，因为它们有可能弥合复杂现实世界环境中感知、认知和行动之间的差距。这项全面的评论探讨了此类模型的发展，包括大型语言模型 (LLM)、大型视觉模型 (LVM) 和其他模型，同时还考察了其他新兴架构。我们讨论了 EMLM 的演变，重点关注具身感知、导航、交互和模拟。此外，该评论对用于训练和评估这些模型的数据集进行了详细分析，强调了多样化、高质量数据对于有效学习的重要性。本文还指出了 EMLM 面临的主要挑战，包括可扩展性、泛化和实时决策的问题。最后，我们概述了未来的方向，强调了多模态感知、推理和行动的整合，以推进日益自主的系统的开发。通过对最先进方法进行深入分析并找出关键差距，本文旨在激发 EMLM 及其在不同领域应用的未来进步。

Stepwise Informativeness Search for Improving LLM Reasoning

2502.15335v1 by Siyuan Wang, Enda Zhao, Zhongyu Wei, Xiang Ren

Advances in Large Language Models (LLMs) have significantly improved multi-step reasoning through generating free-text rationales. However, recent studies show that LLMs tend to lose focus over the middle of long contexts. This raises concerns that as reasoning progresses, LLMs may overlook information in earlier steps when decoding subsequent steps, leading to generate unreliable and redundant rationales. To address this, we propose guiding LLMs to generate more accurate and concise step-by-step rationales by (1) proactively referencing information from underutilized prior steps, and (2) minimizing redundant information between new and existing steps. We introduce stepwise informativeness search, an inference-time tree search framework incorporating two selection heuristics: grounding-guided selection which prioritizes steps paying higher attention over underutilized steps; and novelty-guided selection which encourages steps with novel conclusions. During rationale generation, we use a self-grounding strategy that prompts LLMs to explicitly reference relevant prior steps to provide premises before deduction at each step. Experimental results on four reasoning datasets demonstrate that our approach improves reasoning accuracy by generating higher-quality rationales with reduced errors and redundancy.

摘要：大型語言模型 (LLM) 的進展透過產生自由文本的理由，顯著地改善了多步驟推理。然而，最近的研究顯示，LLM 傾向於在長語境的過程中失去焦點。這引發了疑慮，隨著推理的進行，LLM 在解碼後續步驟時可能會忽略先前步驟中的資訊，導致產生不可靠且冗餘的理由。為了解決這個問題，我們建議指導 LLM 產生更準確且簡潔的分步理由，方法是：(1) 主動引用未充分利用的先前步驟中的資訊，以及 (2) 最小化新步驟和現有步驟之間的冗餘資訊。我們引入了逐步資訊性搜尋，一種推理時間樹狀搜尋架構，它包含兩個選擇啟發法：以基礎為導向的選擇，它優先考慮在未充分利用的步驟上給予較高注意力的步驟；以及以新穎性為導向的選擇，它鼓勵具有新穎結論的步驟。在理由產生期間，我們使用一種自我基礎策略，它提示 LLM 在每個步驟的推論之前明確引用相關的先前步驟以提供前提。在四個推理資料集上的實驗結果顯示，我們的做法透過產生品質更高的理由（錯誤和冗餘減少）來改善推理準確度。

Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

2502.15334v1 by Pedram Zaree, Md Abdullah Al Mamun, Quazi Mishkatul Alam, Yue Dong, Ihsen Alouani, Nael Abu-Ghazaleh

Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B/AdvBench, using less than a third of the generation time).

摘要：最近的研究表明，精心设计的越狱输入可以诱导大型语言模型产生有害输出，尽管有对齐等安全措施。重要的是要预见到潜在越狱攻击的范围，以指导有效的防御和对模型安全的准确评估。在本文中，我们提出了一种生成高效越狱攻击的新方法，该方法操纵模型的注意力，以选择性地加强或削弱提示不同部分之间的注意力。通过利用注意力损失，我们开发出更有效的越狱攻击，这些攻击也是可转移的。这些攻击放大了现有越狱算法的成功率，包括 GCG、AutoDAN 和 ReNeLLM，同时降低了它们的生成成本（例如，放大的 GCG 攻击实现了 91.2% 的 ASR，而原始攻击在 Llama2-7B/AdvBench 上为 67.9%，使用的时间不到生成时间的 1/3）。

2502.15332v1 by Puneet Prashar, Krishna Mohan Shukla, Adam Jatowt

The ability to automatically identify whether an entity is referenced in a future context can have multiple applications including decision making, planning and trend forecasting. This paper focuses on detecting implicit future references in entity-centric texts, addressing the growing need for automated temporal analysis in information processing. We first present a novel dataset of 19,540 sentences built around popular entities sourced from Wikipedia, which consists of future-related and non-future-related contexts in which those entities appear. As a second contribution, we evaluate the performance of several Language Models including also Large Language Models (LLMs) on the task of distinguishing future-oriented content in the absence of explicit temporal references.

摘要：自動識別實體是否在未來語境中被引用的能力，可以有許多應用，包括決策制定、規劃和趨勢預測。本文重點在於偵測以實體為中心的文本中的隱含未來引用，以滿足資訊處理中對自動化時間分析日益增長的需求。我們首先提出一個新穎的資料集，該資料集包含圍繞從 Wikipedia 擷取的熱門實體建立的 19,540 個句子，其中包含這些實體出現的與未來相關和與未來無關的語境。作為第二個貢獻，我們評估了多個語言模型（包括大型語言模型 (LLM)）在沒有明確時間引用的情況下區分面向未來的內容的表現。

Lightweight yet Efficient: An External Attentive Graph Convolutional Network with Positional Prompts for Sequential Recommendation

2502.15331v1 by Jinyu Zhang, Chao Li, Zhongying Zhao

Graph-based Sequential Recommender systems (GSRs) have gained significant research attention due to their ability to simultaneously handle user-item interactions and sequential relationships between items. Current GSRs often utilize composite or in-depth structures for graph encoding (e.g., the Graph Transformer). Nevertheless, they have high computational complexity, hindering the deployment on resource-constrained edge devices. Moreover, the relative position encoding in Graph Transformer has difficulty in considering the complicated positional dependencies within sequence. To this end, we propose an External Attentive Graph convolutional network with Positional prompts for Sequential recommendation, namely EA-GPS. Specifically, we first introduce an external attentive graph convolutional network that linearly measures the global associations among nodes via two external memory units. Then, we present a positional prompt-based decoder that explicitly treats the absolute item positions as external prompts. By introducing length-adaptive sequential masking and a soft attention network, such a decoder facilitates the model to capture the long-term positional dependencies and contextual relationships within sequences. Extensive experimental results on five real-world datasets demonstrate that the proposed EA-GPS outperforms the state-of-the-art methods. Remarkably, it achieves the superior performance while maintaining a smaller parameter size and lower training overhead. The implementation of this work is publicly available at https://github.com/ZZY-GraphMiningLab/EA-GPS.

摘要：圖形化序列推薦系統 (GSR) 由於其同時處理使用者與物品互動和物品之間順序關係的能力，因此獲得了顯著的研究關注。目前的 GSR 經常利用複合或深度結構進行圖形編碼（例如，圖形Transformer）。儘管如此，它們具有很高的計算複雜度，阻礙了在受資源限制的邊緣裝置上的部署。此外，圖形Transformer中的相對位置編碼難以考慮序列中的複雜位置依賴性。為此，我們提出了帶有位置提示的外部注意圖形卷積網路，用於序列推薦，即 EA-GPS。具體來說，我們首先引入一個外部注意圖形卷積網路，通過兩個外部記憶體單元線性測量節點之間的全局關聯。然後，我們提出了一個基於位置提示的解碼器，將絕對項目位置明確地視為外部提示。通過引入長度自適應序列遮罩和軟注意力網路，這樣的解碼器有助於模型捕捉序列中的長期位置依賴性和上下文關係。在五個真實世界資料集上的廣泛實驗結果表明，所提出的 EA-GPS 優於最先進的方法。值得注意的是，它在保持較小的參數大小和較低的訓練開銷的同時，實現了卓越的性能。這項工作的實作公開於 https://github.com/ZZY-GraphMiningLab/EA-GPS。

Road Traffic Sign Recognition method using Siamese network Combining Efficient-CNN based Encoder

2502.15307v1 by Zhenghao Xi, Yuchao Shao, Yang Zheng, Xiang Liu, Yaqi Liu, Yitong Cai

Traffic signs recognition (TSR) plays an essential role in assistant driving and intelligent transportation system. However, the noise of complex environment may lead to motion-blur or occlusion problems, which raise the tough challenge to real-time recognition with high accuracy and robust. In this article, we propose IECES-network which with improved encoders and Siamese net. The three-stage approach of our method includes Efficient-CNN based encoders, Siamese backbone and the fully-connected layers. We firstly use convolutional encoders to extract and encode the traffic sign features of augmented training samples and standard images. Then, we design the Siamese neural network with Efficient-CNN based encoder and contrastive loss function, which can be trained to improve the robustness of TSR problem when facing the samples of motion-blur and occlusion by computing the distance between inputs and templates. Additionally, the template branch of the proposed network can be stopped when executing the recognition tasks after training to raise the process speed of our real-time model, and alleviate the computational resource and parameter scale. Finally, we recombined the feature code and a fully-connected layer with SoftMax function to classify the codes of samples and recognize the category of traffic signs. The results of experiments on the Tsinghua-Tencent 100K dataset and the German Traffic Sign Recognition Benchmark dataset demonstrate the performance of the proposed IECESnetwork. Compared with other state-of-the-art methods, in the case of motion-blur and occluded environment, the proposed method achieves competitive performance precision-recall and accuracy metric average is 88.1%, 86.43% and 86.1% with a 2.9M lightweight scale, respectively. Moreover, processing time of our model is 0.1s per frame, of which the speed is increased by 1.5 times compared with existing methods.

摘要：交通標誌識別 (TSR) 在輔助駕駛和智慧交通系統中扮演著重要的角色。然而，複雜環境的雜訊可能會導致動態模糊或遮擋問題，這對高準確度和強健的即時辨識提出了嚴峻的挑戰。在本文中，我們提出了一種改進編碼器和 Siamese 網路的 IECES 網路。我們的方法採用三階段方法，包括基於 Efficient-CNN 的編碼器、Siamese 主幹和全連接層。我們首先使用卷積編碼器來提取和編碼擴充訓練樣本和標準影像的交通標誌特徵。然後，我們設計了基於 Efficient-CNN 編碼器和對比損失函數的 Siamese 神經網路，它可以在輸入和範本之間計算距離時，訓練以提高 TSR 問題在面對動態模糊和遮擋的樣本時的穩健性。此外，在訓練後執行辨識任務時，可以停止提議網路的範本分支，以提高我們即時模型的處理速度，並減輕運算資源和參數規模。最後，我們將特徵碼和一個帶有 SoftMax 函數的全連接層重新組合，以分類樣本的碼並辨識交通標誌的類別。在清華-騰訊 100K 資料集和德國交通標誌辨識基準資料集上的實驗結果證明了所提出的 IECES 網路的效能。與其他最先進的方法相比，在動態模糊和遮擋環境中，所提出的方法達到了競爭性的效能，精確度召回率和準確度指標平均分別為 88.1%、86.43% 和 86.1%，且輕量級規模為 2.9M。此外，我們模型的處理時間為每幀 0.1 秒，其速度比現有方法提高了 1.5 倍。

SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention

2502.15304v1 by Hong Yankun, Li Xing, Zhen Hui-Ling, Yu Xianzhi, Liu Wulong, Yuan Mingxuan

For the efficient inference of Large Language Models (LLMs), the effective compression of key-value (KV) cache is essential. Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified. This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache. Initially, K cache is transformed into latent channels using SVD basis representations. Since the values in latent channels decay rapidly and become negligible after only a few latent channels, our method then incorporates importance-aware quantization and compression for latent channels. This enables the effective allocation of higher precision to more significant channels. Theoretically, we prove that SVDq results in quantization errors (x0.1 or even lower) that are much lower than those of per-channel key quantization in the original space. Our findings based on RULER and LongBench benchmarks demonstrate that SVDq can achieve an equivalent key cache precision as low as 1.25-bit. When combined with key sparsity, it can reach a key compression ratio of up to 410x for attention computation, all while maintaining comparable model performance. Notably, our method is nearly lossless for LongBench datasets. This indicates that SVDq enables high-precision low-bit quantization, providing a more efficient solution for KV cache compression in LLMs.

摘要：對於大型語言模型 (LLM) 的有效推論，鍵值 (KV) 快取的有效壓縮至關重要。已經找出三種主要的 KV 快取壓縮技術類型，即稀疏性、通道壓縮和量化。本研究提出 SVDq，一種基於奇異值分解 (SVD) 的混合精度量化方法，用於 K 快取。最初，K 快取使用 SVD 基底表示轉換為潛在通道。由於潛在通道中的值衰減很快，並且僅在幾個潛在通道後變得可以忽略不計，因此我們的模型接著為潛在通道納入重要性感知量化和壓縮。這使得能夠有效地將更高的精度分配給更重要的通道。在理論上，我們證明 SVDq 導致的量化誤差 (x0.1 甚至更低) 遠低於原始空間中的逐通道鍵量化。我們基於 RULER 和 LongBench 基準的發現表明，SVDq 可以實現低至 1.25 位的等效鍵快取精度。當與鍵稀疏性結合使用時，它可以達到高達 410 倍的鍵壓縮比，同時保持可比較的模型效能。值得注意的是，我們的模型對於 LongBench 資料集幾乎沒有損失。這表明 SVDq 能夠進行高精度低位元量化，為 LLM 中的 KV 快取壓縮提供更有效的解決方案。

Beyond Fixed Variables: Expanding-variate Time Series Forecasting via Flat Scheme and Spatio-temporal Focal Learning

2502.15296v1 by Minbo Ma, Kai Tang, Huan Li, Fei Teng, Dalin Zhang, Tianrui Li

Multivariate Time Series Forecasting (MTSF) has long been a key research focus. Traditionally, these studies assume a fixed number of variables, but in real-world applications, Cyber-Physical Systems often expand as new sensors are deployed, increasing variables in MTSF. In light of this, we introduce a novel task, Expanding-variate Time Series Forecasting (EVTSF). This task presents unique challenges, specifically (1) handling inconsistent data shapes caused by adding new variables, and (2) addressing imbalanced spatio-temporal learning, where expanding variables have limited observed data due to the necessity for timely operation. To address these challenges, we propose STEV, a flexible spatio-temporal forecasting framework. STEV includes a new Flat Scheme to tackle the inconsistent data shape issue, which extends the graph-based spatio-temporal modeling architecture into 1D space by flattening the 2D samples along the variable dimension, making the model variable-scale-agnostic while still preserving dynamic spatial correlations through a holistic graph. We introduce a novel Spatio-temporal Focal Learning strategy that incorporates a negative filter to resolve potential conflicts between contrastive learning and graph representation, and a focal contrastive loss as its core to guide the framework to focus on optimizing the expanding variables. We benchmark EVTSF performance using three real-world datasets and compare it against three potential solutions employing SOTA MTSF models tailored for EVSTF. Experimental results show that STEV significantly outperforms its competitors, particularly on expanding variables. Notably, STEV, with only 5% of observations from the expanding period, is on par with SOTA MTSF models trained with complete observations. Further exploration of various expanding strategies underscores the generalizability of STEV in real-world applications.

摘要：多元時間序列預測 (MTSF) 長期以來一直是研究的重點。傳統上，這些研究假設有固定數量的變數，但在實際應用中，網路物理系統通常會隨著新感測器的部署而擴充，增加 MTSF 中的變數。有鑑於此，我們引入了一項新任務，擴充變數時間序列預測 (EVTSF)。此任務提出了獨特的挑戰，特別是：(1) 處理因新增變數而導致的不一致資料形狀，以及 (2) 解決不平衡的時空學習，其中擴充變數由於及時運作的必要性而有受限的觀察資料。為了應對這些挑戰，我們提出了 STEV，一個靈活的時空預測架構。STEV 包含一個新的平面架構來處理不一致的資料形狀問題，它透過沿著變數維度壓平 2D 樣本，將基於圖形的時空建模架構延伸到 1D 空間，讓模型與變數規模無關，同時透過整體圖形保留動態空間相關性。我們引入了一種新的時空焦點學習策略，其中包含一個負向濾波器來解決對比學習和圖形表示之間的潛在衝突，並以焦點對比損失作為其核心，引導架構專注於最佳化擴充變數。我們使用三個真實世界資料集對 EVTSF 效能進行基準測試，並將其與採用針對 EVSTF 量身打造的 SOTA MTSF 模型的三種潛在解決方案進行比較。實驗結果顯示，STEV 明顯優於其競爭對手，特別是在擴充變數方面。值得注意的是，STEV 僅使用擴充期間 5% 的觀察值，就與使用完整觀察值訓練的 SOTA MTSF 模型不相上下。進一步探討各種擴充策略，強調了 STEV 在實際應用中的泛化性。

Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference

2502.15294v1 by Yaohua Tang, Zhicheng Hu, Kun Cheng, Fan Mo, Qiheng Lv, Hua Wang, Zhi Chen

The increasing context window size in large language models (LLMs) has improved their ability to handle complex, long-text tasks. However, as the conversation rounds continue, it is required to store a large amount of KV cache in GPU memory, which significantly affects the efficiency and even availability of the model serving systems. This paper analyzes dialogue data from real users and discovers that the LLM inference manifests a watershed layer, after which the distribution of round-level attention shows notable similarity. We propose Round Attention, a novel round-level attention mechanism that only recalls and computes the KV cache of the most relevant rounds. The experiments show that our method saves 55\% memory usage without compromising model performance.

摘要：大型語言模型 (LLM) 中不斷增加的上下文視窗大小已提升其處理複雜、長文本任務的能力。然而，隨著對話回合的持續進行，需要在 GPU 記憶體中儲存大量的 KV 快取，這會顯著影響模型服務系統的效率，甚至可用性。本文分析了來自真實使用者的對話資料，並發現 LLM 推論呈現分水嶺層，在此之後，回合層級注意力的分布顯示出顯著的相似性。我們提出回合注意力，一種新穎的回合層級注意力機制，它只會回溯並計算最相關回合的 KV 快取。實驗顯示，我們的模型在不影響模型效能的情況下，節省了 55% 的記憶體使用量。

CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models

2502.15278v1 by Shunchang Liu, Zhuan Shi, Lingjuan Lyu, Yaochu Jin, Boi Faltings

Assessing whether AI-generated images are substantially similar to copyrighted works is a crucial step in resolving copyright disputes. In this paper, we propose CopyJudge, an automated copyright infringement identification framework that leverages large vision-language models (LVLMs) to simulate practical court processes for determining substantial similarity between copyrighted images and those generated by text-to-image diffusion models. Specifically, we employ an abstraction-filtration-comparison test framework with multi-LVLM debate to assess the likelihood of infringement and provide detailed judgment rationales. Based on the judgments, we further introduce a general LVLM-based mitigation strategy that automatically optimizes infringing prompts by avoiding sensitive expressions while preserving the non-infringing content. Besides, our approach can be enhanced by exploring non-infringing noise vectors within the diffusion latent space via reinforcement learning, even without modifying the original prompts. Experimental results show that our identification method achieves comparable state-of-the-art performance, while offering superior generalization and interpretability across various forms of infringement, and that our mitigation method could more effectively mitigate memorization and IP infringement without losing non-infringing expressions.

摘要：評估 AI 生成的影像是否與受著作權保護的作品實質上相似，是解決著作權爭議的關鍵步驟。在本文中，我們提出 CopyJudge，一個自動化的著作權侵權識別架構，它利用大型視覺語言模型 (LVLMs) 模擬實際的法院程序，以確定受著作權保護的影像與文字轉影像擴散模型所產生的影像之間的實質相似性。具體來說，我們採用抽象過濾比較測試架構，並搭配多個 LVLM 辯論，以評估侵權的可能性並提供詳細的判決依據。根據判決，我們進一步引入一個基於 LVLM 的通用緩解策略，該策略透過避免敏感表達，同時保留非侵權內容，自動最佳化侵權提示。此外，我們的做法可以透過強化學習，在擴散潛在空間中探索非侵權雜訊向量，即使不修改原始提示，也能獲得改善。實驗結果顯示，我們的識別方法達到了與現有技術相當的效能，同時在各種形式的侵權中提供了卓越的泛化性和可解釋性，而我們的緩解方法可以更有效地減輕記憶和智慧財產權侵權，而不會失去非侵權表達。

Analyzing the Inner Workings of Transformers in Compositional Generalization

2502.15277v1 by Ryoma Kumon, Hitomi Yanaka

The compositional generalization abilities of neural models have been sought after for human-like linguistic competence. The popular method to evaluate such abilities is to assess the models' input-output behavior. However, that does not reveal the internal mechanisms, and the underlying competence of such models in compositional generalization remains unclear. To address this problem, we explore the inner workings of a Transformer model by finding an existing subnetwork that contributes to the generalization performance and by performing causal analyses on how the model utilizes syntactic features. We find that the model depends on syntactic features to output the correct answer, but that the subnetwork with much better generalization performance than the whole model relies on a non-compositional algorithm in addition to the syntactic features. We also show that the subnetwork improves its generalization performance relatively slowly during the training compared to the in-distribution one, and the non-compositional solution is acquired in the early stages of the training.

摘要：神经模型的組合概化能力一直是人類語言能力的追求。評估這種能力的流行方法是評估模型的輸入輸出行為。然而，這並未揭示內部機制，而這種模型在組合概化中的基本能力仍然不明確。為了解決這個問題，我們透過尋找一個有助於概化效能的現有子網路，並對模型如何利用句法特徵進行因果分析，來探討 Transformer 模型的內部運作。我們發現模型依賴於句法特徵來輸出正確的答案，但概化效能遠優於整個模型的子網路依賴於非組合演算法以及句法特徵。我們還表明，與分佈內子網路相比，子網路在訓練期間提高其概化效能的速度相對較慢，並且在訓練的早期階段獲得非組合解。

A Training-free LLM-based Approach to General Chinese Character Error Correction

2502.15266v1 by Houquan Zhou, Bo Zhang, Zhenghua Li, Ming Yan, Min Zhang

Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.

摘要：中文拼寫更正 (CSC) 是一項重要的任務，旨在更正中文文本中的字元錯誤。雖然傳統的 CSC 專注於因誤打字而造成的字元替換錯誤，但另外兩種常見的字元錯誤類型，即遺漏和多餘字元，卻較少受到關注。這些錯誤在註解過程中通常會從 CSC 資料集中排除，或在評估時被忽略，即使它們已被註解。此問題限制了 CSC 任務的實用性。為了解決這個問題，我們引入了通用中文字元錯誤更正 (C2EC) 的任務，它專注於所有三種類型的字元錯誤。我們通過結合並手動驗證來自 CCTC 和 Lemon 資料集的資料，構建了一個高品質的 C2EC 基準。我們通過使用 Levenshtein 距離來處理長度變化，並利用基於提示的附加大型語言模型 (LLM) 來提高效能，將無訓練無提示的 CSC 方法擴展到 C2EC。實驗表明，我們的模型使 14B 參數的 LLM 在傳統 CSC 和 C2EC 任務上與大近 50 倍的模型不相上下，且無需任何微調。

Retrieval-Augmented Speech Recognition Approach for Domain Challenges

2502.15264v1 by Peng Shen, Xugang Lu, Hisashi Kawai

Speech recognition systems often face challenges due to domain mismatch, particularly in real-world applications where domain-specific data is unavailable because of data accessibility and confidentiality constraints. Inspired by Retrieval-Augmented Generation (RAG) techniques for large language models (LLMs), this paper introduces a LLM-based retrieval-augmented speech recognition method that incorporates domain-specific textual data at the inference stage to enhance recognition performance. Rather than relying on domain-specific textual data during the training phase, our model is trained to learn how to utilize textual information provided in prompts for LLM decoder to improve speech recognition performance. Benefiting from the advantages of the RAG retrieval mechanism, our approach efficiently accesses locally available domain-specific documents, ensuring a convenient and effective process for solving domain mismatch problems. Experiments conducted on the CSJ database demonstrate that the proposed method significantly improves speech recognition accuracy and achieves state-of-the-art results on the CSJ dataset, even without relying on the full training data.

摘要：語音辨識系統通常會因為領域不符而面臨挑戰，特別是在實際的應用中，因為資料取得和機密限制，無法取得特定領域的資料。本論文受到大型語言模型（LLM）的檢索增強生成（RAG）技術啟發，介紹了一種基於 LLM 的檢索增強語音辨識方法，它在推論階段納入了特定領域的文字資料，以增強辨識效能。我們的模型並非在訓練階段依賴特定領域的文字資料，而是訓練模型學習如何利用提示中提供的文字資訊，以改善 LLM 解碼器的語音辨識效能。我們的做法受益於 RAG 檢索機制的優點，能有效存取當地可用的特定領域文件，確保解決領域不符問題的過程既便利又有效。在 CSJ 資料庫上進行的實驗證明，所提出的方法大幅改善了語音辨識的準確度，即使不依賴完整的訓練資料，也能在 CSJ 資料集上達成最先進的結果。

Corrections Meet Explanations: A Unified Framework for Explainable Grammatical Error Correction

2502.15261v1 by Jingheng Ye, Shang Qin, Yinghui Li, Hai-Tao Zheng, Shen Wang, Qingsong Wen

Grammatical Error Correction (GEC) faces a critical challenge concerning explainability, notably when GEC systems are designed for language learners. Existing research predominantly focuses on explaining grammatical errors extracted in advance, thus neglecting the relationship between explanations and corrections. To address this gap, we introduce EXGEC, a unified explainable GEC framework that integrates explanation and correction tasks in a generative manner, advocating that these tasks mutually reinforce each other. Experiments have been conducted on EXPECT, a recent human-labeled dataset for explainable GEC, comprising around 20k samples. Moreover, we detect significant noise within EXPECT, potentially compromising model training and evaluation. Therefore, we introduce an alternative dataset named EXPECT-denoised, ensuring a more objective framework for training and evaluation. Results on various NLP models (BART, T5, and Llama3) show that EXGEC models surpass single-task baselines in both tasks, demonstrating the effectiveness of our approach.

摘要：文法錯誤校正 (GEC) 面臨一個關於可解釋性的重大挑戰，特別是當 GEC 系統是為語言學習者設計時。現有研究主要集中在解釋預先提取的文法錯誤，因此忽略了解釋與校正之間的關係。為了解決這個差距，我們引入了 EXGEC，一個統一的可解釋 GEC 框架，它以生成的方式整合了解釋和校正任務，主張這些任務會互相加強。實驗已經在 EXPECT 上進行，EXPECT 是最近一個可解釋 GEC 的人工標記資料集，包含大約 20k 個樣本。此外，我們在 EXPECT 中偵測到大量的雜訊，可能會損害模型訓練和評估。因此，我們引入了另一個名為 EXPECT-denoised 的資料集，確保了一個更客觀的訓練和評估框架。各種 NLP 模型 (BART、T5 和 Llama3) 的結果顯示，EXGEC 模型在兩個任務中都超越了單任務基準，證明了我們方法的有效性。

LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design

2502.15260v1 by Renjie Wei, Songqiang Xu, Linfeng Zhong, Zebin Yang, Qingyu Guo, Yuan Wang, Runsheng Wang, Meng Li

State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65x to 6.06x higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43x that of the GPU baseline.

摘要：狀態空間模型（SSM），例如 Mamba，最近備受關注。與基於 Transformer 的大型語言模型（LLM）相比，Mamba 實現了線性計算複雜度，並表現出優異的效能。然而，由於分散的激活異常值和複雜的計算依賴性，Mamba 難以加速，導致現有的 LLM 加速器效率低下。在本文中，我們提出了 LightMamba，它協同設計了量化演算法和 FPGA 加速器架構，以實現高效的 Mamba 推論。我們首先提出了一個 FPGA 友善的訓練後量化演算法，其特點是旋轉輔助量化和 2 的冪 SSM 量化，以將大部分計算減少到 4 位元。我們進一步設計了一個 FPGA 加速器，它部分展開了 Mamba 計算，以平衡效率和硬體成本。透過計算重新排序以及細粒度的平鋪和融合，加速器的硬體利用率和記憶體效率得到了顯著的改善。我們在 Xilinx Versal VCK190 FPGA 上實作 LightMamba，並在 GPU 基準上實現了 4.65 倍到 6.06 倍的能效提升。在 Alveo U280 FPGA 上評估時，LightMamba 達到了 93 個 token/s，是 GPU 基準的 1.43 倍。

Comparative Analysis of Large Language Models for Context-Aware Code Completion using SAFIM Framework

2502.15243v1 by Hang Zhang, Yanxin Shen, Lun Wang, Chuanqi Shi, Shaoshuai Du, Yiyi Tao, Yixian Shen

The advent of Large Language Models (LLMs) has revolutionized code completion, transforming it into a more intelligent and context-aware feature in modern integrated development environments. These advancements have significantly enhanced developers' ability to write efficient and error-free code. This study evaluates the performance of several chat-based LLMs, including Gemini 1.5 Flash, Gemini 1.5 Pro, GPT-4o, GPT-4o-mini, and GPT-4 Turbo, using the Syntax-Aware Fill-in-the-Middle (SAFIM) dataset. This benchmark is specifically designed to assess models' capabilities in syntax-sensitive code generation. Performance metrics, such as cosine similarity with ground-truth completions and latency, were employed to measure both accuracy and efficiency. The findings reveal substantial differences in the models' code completion abilities, offering valuable insights into their respective strengths and weaknesses. This work provides a comparative analysis that underscores the trade-offs between accuracy and speed, establishing a benchmark for future advancements in LLM-based code completion.

摘要：大型語言模型 (LLM) 的出現徹底改變了程式碼補完，將其轉變為現代整合開發環境中更智慧且具備情境感知的功能。這些進步大幅提升了開發人員撰寫高效且無錯誤程式碼的能力。本研究評估了多個基於聊天機器人的 LLM 的效能，包括 Gemini 1.5 Flash、Gemini 1.5 Pro、GPT-4o、GPT-4o-mini 和 GPT-4 Turbo，並使用語法感知填空 (SAFIM) 資料集。此基準測試特別設計用於評估模型在語法敏感程式碼產生中的能力。效能指標（例如與真實補完的餘弦相似度和延遲）用於衡量準確度和效率。研究結果揭露了模型程式碼補完能力的顯著差異，提供了有價值的見解，深入了解它們各自的優勢和劣勢。這項工作提供了比較分析，強調準確度和速度之間的權衡，為基於 LLM 的程式碼補完的未來進展建立基準。

A General Pseudonymization Framework for Cloud-Based LLMs: Replacing Privacy Information in Controlled Text Generation

2502.15233v1 by Shilong Hou, Ruilin Shang, Zi Long, Xianghua Fu, Yin Chen

An increasing number of companies have begun providing services that leverage cloud-based large language models (LLMs), such as ChatGPT. However, this development raises substantial privacy concerns, as users' prompts are transmitted to and processed by the model providers. Among the various privacy protection methods for LLMs, those implemented during the pre-training and fine-tuning phrases fail to mitigate the privacy risks associated with the remote use of cloud-based LLMs by users. On the other hand, methods applied during the inference phrase are primarily effective in scenarios where the LLM's inference does not rely on privacy-sensitive information. In this paper, we outline the process of remote user interaction with LLMs and, for the first time, propose a detailed definition of a general pseudonymization framework applicable to cloud-based LLMs. The experimental results demonstrate that the proposed framework strikes an optimal balance between privacy protection and utility. The code for our method is available to the public at https://github.com/Mebymeby/Pseudonymization-Framework.

摘要：越來越多的公司開始提供利用雲端大型語言模型 (LLM) 的服務，例如 ChatGPT。然而，此發展引起了重大的隱私問題，因為使用者的提示會傳輸到模型提供者並由其處理。在 LLM 的各種隱私保護方法中，在預訓練和微調階段實施的方法無法減輕使用者遠端使用雲端 LLM 時的隱私風險。另一方面，在推理階段應用的方法主要在 LLM 的推理不依賴於隱私敏感資訊的情況下有效。在本文中，我們概述了遠端使用者與 LLM 互動的過程，並首次提出了適用於雲端 LLM 的一般假名化架構的詳細定義。實驗結果證明，提出的架構在隱私保護和效用之間取得了最佳平衡。我們的方法程式碼已公開於 https://github.com/Mebymeby/Pseudonymization-Framework。

AutoMR: A Universal Time Series Motion Recognition Pipeline

2502.15228v1 by Likun Zhang, Sicheng Yang, Zhuo Wang, Haining Liang, Junxiao Shen

In this paper, we present an end-to-end automated motion recognition (AutoMR) pipeline designed for multimodal datasets. The proposed framework seamlessly integrates data preprocessing, model training, hyperparameter tuning, and evaluation, enabling robust performance across diverse scenarios. Our approach addresses two primary challenges: 1) variability in sensor data formats and parameters across datasets, which traditionally requires task-specific machine learning implementations, and 2) the complexity and time consumption of hyperparameter tuning for optimal model performance. Our library features an all-in-one solution incorporating QuartzNet as the core model, automated hyperparameter tuning, and comprehensive metrics tracking. Extensive experiments demonstrate its effectiveness on 10 diverse datasets, achieving state-of-the-art performance. This work lays a solid foundation for deploying motion-capture solutions across varied real-world applications.

摘要：在本文中，我們提出了一個端到端自動動作識別 (AutoMR) 專為多模態數據集設計的管道。所提出的框架無縫整合數據預處理、模型訓練、超參數調整，以及評估，在不同的場景中實現強健的效能。我們的做法解決了兩個主要的挑戰：1) 不同數據集中的感測器數據格式和參數的可變性，這傳統上需要特定於任務的機器學習實作，以及 2) 超參數調整的複雜性和時間消耗，以獲得最佳的模型效能。我們的程式庫具備一個整合 QuartzNet 作為核心模型、自動超參數調整，以及全面的指標追蹤的全方位解決方案。廣泛的實驗證明了它在 10 個不同的數據集上的有效性，達到了最先進的效能。這項工作為在各種實際應用中部署動作捕捉解決方案奠定了堅實的基礎。

Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews

2502.15226v1 by Mengqiao Liu, Tevin Wang, Cassandra A. Cohen, Sarah Li, Chenyan Xiong

Which large language model (LLM) is better? Every evaluation tells a story, but what do users really think about current LLMs? This paper presents CLUE, an LLM-powered interviewer that conducts in-the-moment user experience interviews, right after users interacted with LLMs, and automatically gathers insights about user opinions from massive interview logs. We conduct a study with thousands of users to understand user opinions on mainstream LLMs, recruiting users to first chat with a target LLM and then interviewed by CLUE. Our experiments demonstrate that CLUE captures interesting user opinions, for example, the bipolar views on the displayed reasoning process of DeepSeek-R1 and demands for information freshness and multi-modality. Our collected chat-and-interview logs will be released.

摘要：哪個大型語言模型（LLM）較好？每次評估都會說一個故事，但使用者對目前的 LLM 實際上怎麼想？這篇論文提出 CLUE，一個由 LLM 驅動的訪談者，在使用者與 LLM 互動後立即進行當下的使用者體驗訪談，並從大量的訪談記錄中自動收集使用者意見的見解。我們進行一項有數千名使用者的研究，以了解使用者對主流 LLM 的意見，招募使用者先與目標 LLM 聊天，然後接受 CLUE 訪談。我們的實驗證明 CLUE 捕捉到有趣的使用者意見，例如對 DeepSeek-R1 顯示的推理過程的兩極看法，以及對資訊新鮮度和多模態性的需求。我們收集的聊天和訪談記錄將會發布。

Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

2502.15224v1 by Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh Goyal, Dianbo Liu

Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, \textit{Auto-Bench}, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid justifications. By engaging interactively with an oracle, the models iteratively refine their understanding of underlying interactions, the chemistry and social interactions, through strategic interventions. We evaluate state-of-the-art LLMs, including GPT-4, Gemini, Qwen, Claude, and Llama, and observe a significant performance drop as the problem complexity increases, which suggests an important gap between machine and human intelligence that future development of LLMs need to take into consideration.

摘要：鉴于大型语言模型 (LLM) 的卓越性能，一个重要的问题出现了：LLM 能否进行类人科学研究并发现新知识，并充当人工智能科学家？科学发现是一个迭代过程，需要高效的知识更新和编码。它涉及理解环境、识别新假设和推理行为；然而，目前不存在专门为科学发现设计的标准基准，适用于 LLM 代理。为了应对这些限制，我们引入了一个新的基准，\textit{Auto-Bench}，它包含了评估 LLM 在自然科学和社会科学中进行科学发现所需的方面。我们的基准基于因果图发现的原理。它挑战模型去发现隐藏的结构并做出最佳决策，其中包括生成有效的证明。通过与神谕交互，这些模型通过战略干预迭代地完善了它们对底层交互、化学和社会交互的理解。我们评估了最先进的 LLM，包括 GPT-4、Gemini、Qwen、Claude 和 Llama，并观察到随着问题复杂性的增加，性能大幅下降，这表明机器智能和人类智能之间存在一个重要的差距，未来 LLM 的发展需要考虑这一点。

A BERT Based Hybrid Recommendation System For Academic Collaboration

2502.15223v1 by Sangeetha N, Harish Thangaraj, Varun Vashisht, Eshaan Joshi, Kanishka Verma, Diya Katariya

Universities serve as a hub for academic collaboration, promoting the exchange of diverse ideas and perspectives among students and faculty through interdisciplinary dialogue. However, as universities expand in size, conventional networking approaches via student chapters, class groups, and faculty committees become cumbersome. To address this challenge, an academia-specific profile recommendation system is proposed to connect like-minded stakeholders within any university community. This study evaluates three techniques: Term Frequency-Inverse Document Frequency (TF-IDF), Bidirectional Encoder Representations from Transformers (BERT), and a hybrid approach to generate effective recommendations. Due to the unlabelled nature of the dataset, Affinity Propagation cluster-based relabelling is performed to understand the grouping of similar profiles. The hybrid model demonstrated superior performance, evidenced by its similarity score, Silhouette score, Davies-Bouldin index, and Normalized Discounted Cumulative Gain (NDCG), achieving an optimal balance between diversity and relevance in recommendations. Furthermore, the optimal model has been implemented as a mobile application, which dynamically suggests relevant profiles based on users' skills and collaboration interests, incorporating contextual understanding. The potential impact of this application is significant, as it promises to enhance networking opportunities within large academic institutions through the deployment of intelligent recommendation systems.

摘要：大學作為學術合作的樞紐，透過跨領域的對話促進學生和教職員之間多元思想和觀點的交流。然而，隨著大學規模的擴張，透過學生分會、班級小組和教職員委員會進行傳統的網路建立方式變得繁瑣。為了應對這項挑戰，提出了一套專門針對學術界的個人資料推薦系統，以連結任何大學社群中志同道合的利害關係人。這項研究評估了三種技術：詞頻-逆向文件頻率 (TF-IDF)、來自 Transformer 的雙向編碼器表示法 (BERT) 和一種產生有效推薦的混合方法。由於資料集的無標籤性質，執行基於親和傳播群集的重新標籤，以了解相似個人資料的分組。混合模型展現出優異的效能，其相似度分數、輪廓分數、戴維斯-包爾丁指標和標準化貼現累積增益 (NDCG) 為證，在推薦中達成多元性與相關性之間的最佳平衡。此外，最佳模型已實作為行動應用程式，根據使用者的技能和合作興趣動態建議相關個人資料，並納入脈絡理解。這個應用的潛在影響重大，它承諾透過部署智慧推薦系統，提升大型學術機構內的網路建立機會。

ESPnet-SpeechLM: An Open Speech Language Model Toolkit

2502.15218v1 by Jinchuan Tian, Jiatong Shi, William Chen, Siddhant Arora, Yoshiki Masuyama, Takashi Maekaku, Yihan Wu, Junyi Peng, Shikhar Bharadwaj, Yiwen Zhao, Samuele Cornell, Yifan Peng, Xiang Yue, Chao-Han Huck Yang, Graham Neubig, Shinji Watanabe

We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.

摘要：我們推出 ESPnet-SpeechLM，這是一個開放的工具包，旨在民主化語音語言模型 (SpeechLM) 和語音驅動的代理應用程式的開發。此工具包透過將語音處理任務標準化，並將它們設定為通用的序列建模問題，包含資料前處理、預訓練、推論和任務評估的內聚工作流程。有了 ESPnet-SpeechLM，使用者可以輕鬆定義任務範本和設定主要設定，讓 SpeechLM 開發變得無縫且簡化。此工具包透過為工作流程的每個階段提供高度可設定的模組，確保靈活性、效率和可擴充性。為了說明其功能，我們提供了多個使用案例，展示如何使用 ESPnet-SpeechLM 建構具競爭力的 SpeechLM，包括在各種基準上預先訓練的 1.7B 參數模型，涵蓋文字和語音任務。此工具包及其範例在 https://github.com/espnet/espnet/tree/speechlm 完全透明且可重製。

FormalSpecCpp: A Dataset of C++ Formal Specifications created using LLMs

2502.15217v1 by Madhurima Chakraborty, Peter Pirkelbauer, Qing Yi

FormalSpecCpp is a dataset designed to fill the gap in standardized benchmarks for verifying formal specifications in C++ programs. To the best of our knowledge, this is the first comprehensive collection of C++ programs with well-defined preconditions and postconditions. It provides a structured benchmark for evaluating specification inference tools and testing theaccuracy of generated specifications. Researchers and developers can use this dataset to benchmark specification inference tools,fine-tune Large Language Models (LLMs) for automated specification generation, and analyze the role of formal specifications in improving program verification and automated testing. By making this dataset publicly available, we aim to advance research in program verification, specification inference, and AI-assisted software development. The dataset and the code are available at https://github.com/MadhuNimmo/FormalSpecCpp.

摘要：FormalSpecCpp 是一個資料集，旨在填補 C++ 程式中驗證正式規範的標準化基準的空白。據我們所知，這是第一個包含定義良好的前置條件和後置條件的 C++ 程式全面集合。它提供了一個結構化的基準，用於評估規範推論工具和測試生成規範的準確性。研究人員和開發人員可以使用這個資料集來基準規範推論工具，微調大型語言模型 (LLM) 以進行自動規範生成，並分析正式規範在改進程式驗證和自動化測試中的作用。透過公開這個資料集，我們旨在推進程式驗證、規範推論和 AI 輔助軟體開發的研究。資料集和程式碼可在 https://github.com/MadhuNimmo/FormalSpecCpp 取得。

The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning

2502.15214v1 by Sheila Schoepp, Masoud Jafaripour, Yingyue Cao, Tianpei Yang, Fatemeh Abdollahi, Shadan Golestan, Zahin Sufiyan, Osmar R. Zaiane, Matthew E. Taylor

Reinforcement learning (RL) has shown impressive results in sequential decision-making tasks. Meanwhile, Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged, exhibiting impressive capabilities in multimodal understanding and reasoning. These advances have led to a surge of research integrating LLMs and VLMs into RL. In this survey, we review representative works in which LLMs and VLMs are used to overcome key challenges in RL, such as lack of prior knowledge, long-horizon planning, and reward design. We present a taxonomy that categorizes these LLM/VLM-assisted RL approaches into three roles: agent, planner, and reward. We conclude by exploring open problems, including grounding, bias mitigation, improved representations, and action advice. By consolidating existing research and identifying future directions, this survey establishes a framework for integrating LLMs and VLMs into RL, advancing approaches that unify natural language and visual understanding with sequential decision-making.

摘要：增強學習 (RL) 在序貫決策任務中展現出令人印象深刻的成果。與此同時，大型語言模型 (LLM) 和視覺語言模型 (VLM) 也應運而生，在多模態理解和推理方面展現出令人印象深刻的能力。這些進展導致了將 LLM 和 VLM 整合到 RL 中的研究激增。在本次調查中，我們回顧了使用 LLM 和 VLM 來克服 RL 中關鍵挑戰的代表性作品，例如缺乏先驗知識、長期規劃和獎勵設計。我們提出了將這些 LLM/VLM 輔助的 RL 方法分類為三個角色的分類法：代理、規劃者和獎勵。我們最後探討了開放性問題，包括基礎、偏差緩解、改進表示和行動建議。透過整合現有研究和確定未來方向，本次調查建立了一個將 LLM 和 VLM 整合到 RL 中的框架，推動了將自然語言和視覺理解與序貫決策制定統一起來的方法。

Measuring AI agent autonomy: Towards a scalable approach with code inspection

2502.15212v1 by Peter Cihon, Merlin Stein, Gagan Bansal, Sam Manning, Kevin Xu

AI agents are AI systems that can achieve complex goals autonomously. Assessing the level of agent autonomy is crucial for understanding both their potential benefits and risks. Current assessments of autonomy often focus on specific risks and rely on run-time evaluations -- observations of agent actions during operation. We introduce a code-based assessment of autonomy that eliminates the need to run an AI agent to perform specific tasks, thereby reducing the costs and risks associated with run-time evaluations. Using this code-based framework, the orchestration code used to run an AI agent can be scored according to a taxonomy that assesses attributes of autonomy: impact and oversight. We demonstrate this approach with the AutoGen framework and select applications.

摘要：AI 代理是可自主達成複雜目標的 AI 系統。評估代理自主性等級對於了解其潛在利益和風險至關重要。當前的自主性評估通常關注特定風險，並依賴於執行時期評估，也就是在操作期間觀察代理動作。我們引入一種基於程式碼的自主性評估，消除了執行 AI 代理以執行特定任務的需要，從而降低了與執行時期評估相關的成本和風險。使用這個基於程式碼的架構，用於執行 AI 代理的編排程式碼可以根據評估自主性屬性的分類法進行評分：影響和監督。我們使用 AutoGen 架構和選定應用程式展示這種方法。

PairBench: A Systematic Framework for Selecting Reliable Judge VLMs

2502.15210v1 by Aarash Feizi, Sai Rajeswar, Adriana Romero-Soriano, Reihaneh Rabbany, Spandana Gella, Valentina Zantedeschi, João Monteiro

As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare data pairs as instructed in the prompt becomes essential. To address this, we present PairBench, a low-cost framework that systematically evaluates VLMs as customizable similarity tools across various modalities and scenarios. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores: alignment with human annotations, consistency for data pairs irrespective of their order, smoothness of similarity distributions, and controllability through prompting. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics; the optimal choice depends on an auto evaluator's desired behavior (e.g., a smooth vs. a sharp judge), highlighting risks of widespread adoption of VLMs as evaluators without thorough assessment. For instance, the majority of VLMs struggle with maintaining symmetric similarity scores regardless of order. Additionally, our results show that the performance of VLMs on the metrics in PairBench closely correlates with popular benchmarks, showcasing its predictive power in ranking models.

摘要：隨著大型視覺語言模型 (VLM) 愈來愈常被用作自動化評估器，了解它們按照提示有效比較資料對的能力變得至關重要。為了解決這個問題，我們提出 PairBench，一個低成本架構，它系統性地評估 VLM 作為各種模式和場景中的可自訂相似性工具。透過 PairBench，我們引入了四個指標，它們代表相似性分數的主要需求：與人類註解的一致性、不論資料對的順序如何而保持一致性、相似性分佈的平滑性，以及透過提示控制的能力。我們的分析顯示，無論是閉源或開源，沒有任何模型在所有指標上都表現出色；最佳選擇取決於自動評估器的預期行為（例如，平滑的評審員與嚴格的評審員），這突顯了在沒有徹底評估的情況下廣泛採用 VLM 作為評估器的風險。例如，大多數 VLM 都難以在不論順序如何的情況下維持對稱的相似性分數。此外，我們的結果顯示，VLM 在 PairBench 中指標上的表現與熱門基準密切相關，展示了它在模型排名中的預測能力。

Unveiling Attractor Cycles in Large Language Models: A Dynamical Systems View of Successive Paraphrasing

2502.15208v1 by Zhilin Wang, Yafu Li, Jianhao Yan, Yu Cheng, Yue Zhang

Dynamical systems theory provides a framework for analyzing iterative processes and evolution over time. Within such systems, repetitive transformations can lead to stable configurations, known as attractors, including fixed points and limit cycles. Applying this perspective to large language models (LLMs), which iteratively map input text to output text, provides a principled approach to characterizing long-term behaviors. Successive paraphrasing serves as a compelling testbed for exploring such dynamics, as paraphrases re-express the same underlying meaning with linguistic variation. Although LLMs are expected to explore a diverse set of paraphrases in the text space, our study reveals that successive paraphrasing converges to stable periodic states, such as 2-period attractor cycles, limiting linguistic diversity. This phenomenon is attributed to the self-reinforcing nature of LLMs, as they iteratively favour and amplify certain textual forms over others. This pattern persists with increasing generation randomness or alternating prompts and LLMs. These findings underscore inherent constraints in LLM generative capability, while offering a novel dynamical systems perspective for studying their expressive potential.

摘要：動態系統理論提供了一個用於分析反覆運算過程和時間推移演化的框架。在這樣的系統中，反覆轉換會導致穩定的配置，稱為吸引子，包括固定點和極限環。將此觀點應用於將輸入文本反覆映射到輸出文本的大型語言模型 (LLM)，提供了一種對長期行為進行特徵化的原則性方法。連續的同義轉換作為探索此類動態的引人注目的測試平台，因為同義轉換以語言變異重新表達相同的基礎含義。儘管預計 LLM 會在文本空間中探索各種同義轉換，但我們的研究表明，連續的同義轉換會收斂到穩定的週期狀態，例如 2 週期吸引子環，從而限制語言多樣性。這種現象歸因於 LLM 的自我強化性質，因為它們反覆偏好和放大某些文本形式而不是其他形式。這種模式會隨著生成隨機性或交替提示和 LLM 的增加而持續。這些發現強調了 LLM 生成能力的固有約束，同時為研究其表達潛力提供了新穎的動態系統觀點。

FlipConcept: Tuning-Free Multi-Concept Personalization for Text-to-Image Generation

2502.15203v1 by Young Beom Woo, Sun Eung Kim

Recently, methods that integrate multiple personalized concepts into a single image have garnered significant attention in the field of text-to-image (T2I) generation. However, existing methods experience performance degradation in complex scenes with multiple objects due to distortions in non-personalized regions. To address this issue, we propose FlipConcept, a novel approach that seamlessly integrates multiple personalized concepts into a single image without requiring additional tuning. We introduce guided appearance attention to accurately mimic the appearance of a personalized concept as intended. Additionally, we introduce mask-guided noise mixing to protect non-personalized regions during editing. Lastly, we apply background dilution to minimize attribute leakage, which is the undesired blending of personalized concept attributes with other objects in the image. In our experiments, we demonstrate that the proposed method, despite not requiring tuning, outperforms existing models in both single and multiple personalized concept inference.

摘要：近年來，將多個客製化概念整合到單一影像中的方法在文字轉影像 (T2I) 生成領域備受關注。然而，現有方法在包含多個物件的複雜場景中會遭遇效能下降的問題，這是由於非客製化區域出現扭曲所致。為了解決這個問題，我們提出了 FlipConcept，這是一種創新的方法，可將多個客製化概念無縫整合到單一影像中，而無需額外調整。我們引入了引導式外觀注意力，以精確模擬客製化概念的外觀。此外，我們引入了遮罩引導雜訊混合，以在編輯過程中保護非客製化區域。最後，我們採用背景稀釋來最小化屬性外洩，這是客製化概念屬性與影像中其他物件不必要的混合。在我們的實驗中，我們證明了所提出的方法，儘管不需要調整，但在單一和多個客製化概念推論中都優於現有模型。

TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

2502.15197v1 by Zhaoxuan Wu, Zijian Zhou, Arun Verma, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

We propose TETRIS, a novel method that optimizes the total throughput of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, TETRIS yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.

摘要：我們提出 TETRIS，一種新的方法，用於最佳化多重要求設定中批次推測解碼的總體傳輸量。與現有方法不同，現有方法針對單一要求或一組要求作為整體進行最佳化，TETRIS 主動選擇最有希望的草稿代碼（針對批次中的每個要求），在並行驗證時予以接受，進而減少被拒絕的代碼，因此減少浪費的運算資源。這種有效的資源利用對於在大型語言模型 (LLM) 中實現快速推論至關重要，對於推論容量有限的服務供應商而言尤其重要。與基線推測解碼相比，TETRIS 產生持續更高的接受率，並更有效地利用有限的推論容量。我們在理論上和經驗上證明，TETRIS 優於基線推測解碼和動態選擇草稿代碼的現有方法，從而導致 LLM 中更有效的批次推論。

Scale-Free Graph-Language Models

2502.15189v1 by Jianglin Lu, Yixuan Liu, Yitian Zhang, Yun Fu

Graph-language models (GLMs) have demonstrated great potential in graph-based semi-supervised learning. A typical GLM consists of two key stages: graph generation and text embedding, which are usually implemented by inferring a latent graph and finetuning a language model (LM), respectively. However, the former often relies on artificial assumptions about the underlying edge distribution, while the latter requires extensive data annotations. To tackle these challenges, this paper introduces a novel GLM that integrates graph generation and text embedding within a unified framework. Specifically, for graph generation, we leverage an inherent characteristic of real edge distribution--the scale-free property--as a structural prior. We unexpectedly find that this natural property can be effectively approximated by a simple k-nearest neighbor (KNN) graph. For text embedding, we develop a graph-based pseudo-labeler that utilizes scale-free graphs to provide complementary supervision for improved LM finetuning. Extensive experiments on representative datasets validate our findings on the scale-free structural approximation of KNN graphs and demonstrate the effectiveness of integrating graph generation and text embedding with a real structural prior. Our code is available at https://github.com/Jianglin954/SFGL.

摘要：圖語言模型 (GLM) 已在基於圖形的半監督學習中展現出極大的潛力。典型的 GLM 包含兩個關鍵階段：圖形生成和文字嵌入，它們通常分別透過推斷潛在圖形和微調語言模型 (LM) 來實作。然而，前者通常依賴於對底層邊緣分佈的人工假設，而後者需要大量的資料標註。為了應對這些挑戰，本文介紹了一種新的 GLM，它將圖形生成和文字嵌入整合在一個統一的架構中。具體來說，對於圖形生成，我們利用真實邊緣分佈的內在特性——無尺度屬性——作為結構先驗。我們意外地發現，這個自然屬性可以用一個簡單的 k-最近鄰 (KNN) 圖形來有效近似。對於文字嵌入，我們開發了一個基於圖形的偽標籤器，它利用無尺度圖形來提供互補監督，以改善 LM 微調。在代表性資料集上進行的大量實驗驗證了我們對 KNN 圖形的無尺度結構近似的發現，並證明了將圖形生成和文字嵌入與真實結構先驗整合的有效性。我們的程式碼可在 https://github.com/Jianglin954/SFGL 獲得。

LEDD: Large Language Model-Empowered Data Discovery in Data Lakes

2502.15182v1 by Qi An, Chihua Ying, Yuqing Zhu, Yihao Xu, Manwei Zhang, Jianmin Wang

Data discovery in data lakes with ever increasing datasets has long been recognized as a big challenge in the realm of data management, especially for semantic search of and hierarchical global catalog generation of tables. While large language models (LLMs) facilitate the processing of data semantics, challenges remain in architecting an end-to-end system that comprehensively exploits LLMs for the two semantics-related tasks. In this demo, we propose LEDD, an end-to-end system with an extensible architecture that leverages LLMs to provide hierarchical global catalogs with semantic meanings and semantic table search for data lakes. Specifically, LEDD can return semantically related tables based on natural-language specification. These features make LEDD an ideal foundation for downstream tasks such as model training and schema linking for text-to-SQL tasks. LEDD also provides a simple Python interface to facilitate the extension and the replacement of data discovery algorithms.

摘要：在數據集不斷增加的資料湖中進行資料探索，長期以來一直被視為資料管理領域的一大挑戰，特別是對於資料表的語義搜尋和階層式全球目錄產生。雖然大型語言模型 (LLM) 能夠促進資料語義的處理，但在建構一個全面利用 LLM 來執行兩個語義相關任務的端到端系統時，仍存在挑戰。在此示範中，我們提出 LEDD，一個具有可擴充架構的端到端系統，它利用 LLM 來提供具有語義意義的階層式全球目錄和資料湖的語義資料表搜尋。具體而言，LEDD 可以根據自然語言規範傳回語義相關的資料表。這些功能使 LEDD 成為下游任務的理想基礎，例如模型訓練和文字轉 SQL 任務的架構連結。LEDD 也提供一個簡單的 Python 介面，以利於擴充和替換資料探索演算法。

LLM

LLM