Skip to content

LLM

LLM

Publish Date Title Authors Homepage Code
2026-04-24 How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks Longju Bai et.al. 2604.22750v1 null
2026-04-24 Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities Ilana Nguyen et.al. 2604.22749v1 null
2026-04-24 Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond Meng Chu et.al. 2604.22748v1 null
2026-04-24 An Undecidability Proof for the Plan Existence Problem Antonis Achilleos et.al. 2604.22736v1 null
2026-04-24 Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data Hillary Mutisya et.al. 2604.22730v1 null
2026-04-24 Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering Hillary Mutisya et.al. 2604.22723v1 null
2026-04-24 Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation Rajinder Sandhu et.al. 2604.22722v1 null
2026-04-24 Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought Keshav Ramji et.al. 2604.22709v1 null
2026-04-24 CRAFT: Clustered Regression for Adaptive Filtering of Training data Parthasarathi Panda et.al. 2604.22693v1 null
2026-04-24 How Supply Chain Dependencies Complicate Bias Measurement and Accountability Attribution in AI Hiring Applications Gauri Sharma et.al. 2604.22679v1 null
2026-04-24 BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering Jinghong Chen et.al. 2604.22678v1 null
2026-04-24 Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines Negar Arabzadeh et.al. 2604.22661v1 null
2026-04-24 Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models Felix Herron et.al. 2604.22631v1 null
2026-04-24 From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia Angelo Maria Sabatini et.al. 2604.22626v1 null
2026-04-24 Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube Sheza Munir et.al. 2604.22606v1 null
2026-04-24 From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification Md Erfan et.al. 2604.22601v1 null
2026-04-24 Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity Erez Yosef et.al. 2604.22597v1 null
2026-04-24 Learning Evidence Highlighting for Frozen LLMs Shaoang Li et.al. 2604.22565v1 null
2026-04-24 Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors Gautam Kumar Jain et.al. 2604.22560v1 null
2026-04-24 SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning Jichao Wang et.al. 2604.22558v1 null
2026-04-24 Using Embedding Models to Improve Probabilistic Race Prediction Noan Dasanaike et.al. 2604.22555v1 null
2026-04-24 ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders Yongqi Jiang et.al. 2604.22550v1 null
2026-04-24 Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners Haidong Yuan et.al. 2604.22542v1 null
2026-04-24 On the Properties of Feature Attribution for Supervised Contrastive Learning Leonardo Arrighi et.al. 2604.22540v1 null
2026-04-24 FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records Hojjat Karami et.al. 2604.22534v1 null
2026-04-24 RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment Yingfeng Luo et.al. 2604.22520v1 null
2026-04-24 Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement Wataru Hirota et.al. 2604.22517v1 null
2026-04-24 Measuring and Mitigating Persona Distortions from AI Writing Assistance Paul Röttger et.al. 2604.22503v1 null
2026-04-24 CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding Lihao Zheng et.al. 2604.22498v1 null
2026-04-24 On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery Anti Alman et.al. 2604.22455v1 null
2026-04-24 Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents Xirui Li et.al. 2604.22452v1 null
2026-04-24 SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking Chenxi Gu et.al. 2604.22438v1 null
2026-04-24 CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease Bulent Soykan et.al. 2604.22428v1 null
2026-04-24 Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control Qinhan Hou et.al. 2604.22413v1 null
2026-04-24 Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models Alberto Messina et.al. 2604.22411v1 null
2026-04-24 Selective Contrastive Learning For Gloss Free Sign Language Translation Changhao Lai et.al. 2604.22374v1 null
2026-04-24 CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language Rui Zhao et.al. 2604.22367v1 null
2026-04-24 Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization Weixu Zhang et.al. 2604.22345v1 null
2026-04-24 Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding Weixu Zhang et.al. 2604.22335v1 null
2026-04-24 ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding Dongwei Sun et.al. 2604.22333v1 null
2026-04-24 FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting Marco Obermeier et.al. 2604.22328v1 null
2026-04-24 Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks Fahmida Alam et.al. 2604.22325v1 null
2026-04-24 CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems Tabinda Sarwar et.al. 2604.22313v1 null
2026-04-24 BLAST: Benchmarking LLMs with ASP-based Structured Testing Manuel Alejandro Borroto Santana et.al. 2604.22306v1 null
2026-04-24 Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets Harshit Joshi et.al. 2604.22294v1 null
2026-04-24 ReLeVAnT: Relevance Lexical Vectors for Accurate Legal Text Classification Ishaan Gakhar et.al. 2604.22292v1 null
2026-04-24 When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention Aofan Liu et.al. 2604.22273v1 null
2026-04-24 Semantic Error Correction and Decoding for Short Block Channel Codes Jiafu Hao et.al. 2604.22269v1 null
2026-04-24 Large Language Models Decide Early and Explain Later Ayan Datta et.al. 2604.22266v1 null
2026-04-24 Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion Fahmida Alam et.al. 2604.22261v1 null
2026-04-24 Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset Wenhui Huang et.al. 2604.22260v1 null
2026-04-24 A Probabilistic Framework for Hierarchical Goal Recognition Chenyuan Zhang et.al. 2604.22256v1 null
2026-04-24 Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis Zhilin Fan et.al. 2604.22237v1 null
2026-04-24 A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies Somyajit Chakraborty et.al. 2604.22227v1 null
2026-04-24 TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis Xi Wang et.al. 2604.22225v1 null
2026-04-24 Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen Jon-Paul Cacioli et.al. 2604.22215v1 null
2026-04-24 UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions Chunyu Qiang et.al. 2604.22209v1 null
2026-04-24 Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations Anna Arnaudo et.al. 2604.22207v1 null
2026-04-24 An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments Hong Su et.al. 2604.22199v1 null
2026-04-24 How Large Language Models Balance Internal Knowledge with User and Document Assertions Shuowei Li et.al. 2604.22193v1 null
2026-04-24 Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning Chaoran Chen et.al. 2604.22191v1 null
2026-04-24 ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression Xiaojie Ke et.al. 2604.22180v1 null
2026-04-24 ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation Peiyan Zhang et.al. 2604.22169v1 null
2026-04-24 Estimating Tail Risks in Language Model Output Distributions Rico Angell et.al. 2604.22167v1 null
2026-04-24 Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models Ryoma Kumon et.al. 2604.22166v1 null
2026-04-24 GenMatter: Perceiving Physical Objects with Generative Matter Models Eric Li et.al. 2604.22160v1 null
2026-04-24 Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems Meghana Karnam et.al. 2604.22154v1 null
2026-04-24 When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models Pruthvinath Jeripity Venkata et.al. 2604.22153v1 null
2026-04-24 Recognition Without Authorization: LLMs and the Moral Order of Online Advice Tom van Nuenen et.al. 2604.22143v1 null
2026-04-24 Voice Under Revision: Large Language Models and the Normalization of Personal Narrative Tom van Nuenen et.al. 2604.22142v1 null
2026-04-24 SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs Sihang et.al. 2604.22134v1 null
2026-04-24 Dissociating Decodability and Causal Use in Bracket-Sequence Transformers Aryan Sharma et.al. 2604.22128v1 null
2026-04-24 Where Should LoRA Go? Component-Type Placement in Hybrid Language Models Hector Borobia et.al. 2604.22127v1 null
2026-04-23 Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework Tharindu Kumarage et.al. 2604.22119v1 null
2026-04-23 PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training Harsh Kumar et.al. 2604.22117v1 null
2026-04-23 Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations Nalin Poungpeth et.al. 2604.22109v1 null
2026-04-23 Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation Arthur Jakobsson et.al. 2604.22102v1 null
2026-04-23 Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation Weisi Liu et.al. 2604.22098v1 null
2026-04-23 An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation Mykola Trokhymovych et.al. 2604.22095v1 null
2026-04-23 Ethics Testing: Proactive Identification of Generative AI System Harms Shin Hwei Tan et.al. 2604.22089v1 null
2026-04-23 Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents Seyed Moein Abtahi et.al. 2604.22085v1 null
2026-04-23 Removing Sandbagging in LLMs by Training with Weak Supervision Emil Ryd et.al. 2604.22082v1 null
2026-04-23 Sound Agentic Science Requires Adversarial Experiments Dionizije Fa et.al. 2604.22080v1 null
2026-04-23 PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning Xiaoyi Chen et.al. 2604.22076v1 null
2026-04-23 Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning Qinan Yu et.al. 2604.22074v1 null
2026-04-23 Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning Amine Barrak et.al. 2604.22072v1 null
2026-04-23 Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake Guan Gui et.al. 2604.22067v1 null
2026-04-23 Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores Shevya Pandya et.al. 2604.22063v1 null
2026-04-23 Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning Karthic Palaniappan et.al. 2604.22062v1 null
2026-04-23 Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching Xiaodi Li et.al. 2604.22061v1 null
2026-04-23 LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs Mohamed Ali Souibgui et.al. 2604.22050v1 null
2026-04-23 Call-Chain-Aware LLM-Based Test Generation for Java Projects Guancheng Wang et.al. 2604.22046v1 null
2026-04-23 H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers Ayushi Mehrotra et.al. 2604.22045v1 null
2026-04-23 Source-Modality Monitoring in Vision-Language Models Etha Tianze Hua et.al. 2604.22038v1 null
2026-04-23 EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms Brian VanVoorst et.al. 2604.22036v1 null
2026-04-23 Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning João Mattos et.al. 2604.22031v1 null
2026-04-23 Shared Lexical Task Representations Explain Behavioral Variability In LLMs Zhuonan Yang et.al. 2604.22027v1 null
2026-04-23 Foundation models for discovering robust biomarkers of neurological disorders from dynamic functional connectivity Deepank Girish et.al. 2604.22018v1 null
2026-04-23 When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation Anamta Khan et.al. 2604.22002v1 null
2026-04-23 Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning Grigory Sapunov et.al. 2604.21999v1 null

Abstracts

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

2604.22750v1 by Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei

The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

摘要:AI 代理在複雜人類工作流程中的廣泛採用正在推動 LLM 令牌消耗的快速增長。當代理被部署在需要大量令牌的任務上時,自然會出現三個問題:(1)AI 代理將令牌花費在哪裡?(2)哪些模型的令牌效率更高?以及(3)代理能否在任務執行前預測其令牌使用情況?在本文中,我們呈現了對代理編碼任務中令牌消耗模式的首次系統研究。我們分析了來自八個前沿 LLM 在 SWE-bench Verified 上的軌跡,並評估模型在任務執行前預測自身令牌成本的能力。我們發現:(1)代理任務的成本特別高,消耗的令牌是代碼推理和代碼聊天的 1000 倍,整體成本主要由輸入令牌而非輸出令牌驅動;(2)令牌使用具有高度變異性且本質上是隨機的:在相同任務上的運行總令牌數最多可以相差 30 倍,而更高的令牌使用並不轉化為更高的準確性;相反,準確性通常在中等成本時達到峰值,並在更高成本時飽和;(3)模型在令牌效率上差異顯著:在相同任務上,Kimi-K2 和 Claude-Sonnet-4.5 的平均令牌消耗超過 GPT-5 的 150 萬;(4)人類專家評價的任務難度與實際令牌成本之間的對應關係僅為微弱,揭示了人類感知的複雜性與代理實際付出的計算努力之間的根本差距;以及(5)前沿模型未能準確預測自身的令牌使用(相關性弱至中等,最高達 0.39),並系統性地低估了實際的令牌成本。我們的研究為 AI 代理的經濟學提供了新的見解,並能激發未來在此方向上的研究。

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

2604.22749v1 by Ilana Nguyen, Harini Suresh, Thema Monroe-White, Evan Shieh

Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American'') are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.

摘要:大型語言模型(LLMs)在文本生成任務中的應用越來越廣泛,從日常使用到高風險的企業和政府應用,包括與尋求庇護者的模擬面試。雖然許多研究突顯了LLMs的新潛在應用,但LLMs對全球非主流社群的有害偏見的編碼和延續也存在風險。為了更好地評估和減輕這些傷害,需要更多研究來檢視LLMs如何描繪多樣化的個體。在本研究中,我們研究了廣泛採用的LLMs在回應開放式敘事生成提示時,如何描繪國籍身份。我們的發現顯示,根據國籍的持續代表性傷害的存在,包括有害的刻板印象、抹除和一維的全球大多數身份描繪。被邊緣化的國家身份在權力中立的故事中同時被低估,而在從屬角色的描繪中則被過度代表,這些從屬角色的出現概率比主導角色高出五十倍以上。當輸入提示中出現美國國籍提示(例如,``美國人'')時,傷害的程度會加劇。值得注意的是,我們發現我們識別的傷害無法通過諂媚來解釋,因為即使在提示中用非美國國籍身份替換美國國籍提示時,美國中心的偏見依然存在。根據我們的發現,我們呼籲進一步探索LLMs中的文化傷害,通過以全球大多數的觀點為中心的方法論,挑戰無批判地採用基於美國的LLMs來對我們星球大多數的分類、監視和錯誤表述。

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

2604.22748v1 by Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.

摘要:隨著人工智慧系統從生成文本轉向通過持續互動來實現目標,建模環境動態的能力成為了一個核心瓶頸。操控物體、導航軟體、協調他人或設計實驗的代理需要預測環境模型,但「世界模型」這個術語在不同的研究社群中具有不同的含義。我們介紹了一個沿著兩個軸組織的「層級 x 法則」分類法。第一個定義了三個能力層級:L1 預測器,學習一步本地轉換運算子;L2 模擬器,將它們組合成遵循領域法則的多步驟、行動條件的展開;以及 L3 演化者,當預測對新證據失敗時,自主修訂其模型。第二個識別了四個治理法則範疇:物理、數位、社會和科學。這些範疇決定了世界模型必須滿足的約束條件以及其最可能失敗的地方。利用這一框架,我們綜合了超過 400 篇作品,並總結了 100 多個代表性系統,涵蓋基於模型的強化學習、視頻生成、網路和 GUI 代理、多代理社會模擬以及 AI 驅動的科學發現。我們分析了不同層級-範疇對的研究方法、失敗模式和評估實踐,提出了以決策為中心的評估原則和最小可重現的評估包,並概述了架構指導、開放問題和治理挑戰。最終的路線圖連接了先前孤立的社群,並描繪了一條從被動的下一步預測到可以模擬,並最終重塑代理運作環境的世界模型的道路。

An Undecidability Proof for the Plan Existence Problem

2604.22736v1 by Antonis Achilleos

The plan existence problem asks, given a goal in the form of a formula in modal logic, an initial epistemic state (a pointed Kripke model), and a set of epistemic actions, whether there exists a sequence of actions that can be applied to reach the goal. We prove that even in the case where the preconditions of the epistemic actions have modal depth at most 1, and there are no postconditions, the plan existence problem is undecidable. The (un)decidability of this problem was previously unknown.

摘要:計畫存在問題詢問,給定一個以模態邏輯形式表達的目標、一個初始的認知狀態(指向的克里普克模型),以及一組認知行動,是否存在一個可以應用的行動序列以達成該目標。 我們證明,即使在認知行動的前提條件最多具有模態深度 1 且沒有後置條件的情況下,計畫存在問題也是不可判定的。 此問題的(不)可判定性之前是未知的。

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

2604.22730v1 by Hillary Mutisya, John Mugane

We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including -ntU 'person' (8 languages), gombe 'cow' (9 languages), and mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including -bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p < 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain >0.83 cosine similarity across languages (within-class > between-class, p < 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.

摘要:我們調查是否僅基於現代形態數據訓練的神經模型能夠恢復與歷史重建一致的跨語言詞彙結構。使用 BantuMorph v7,這是一種針對班圖形態範疇的Transformer,我們分析了 14 種東部和南部班圖語言,提取了它們名詞和動詞詞元的編碼器嵌入,並識別出 728 個名詞和 1,525 個動詞同源候選詞,這些候選詞在 5 種以上的語言中共享。將這些候選詞與已建立的歷史資源進行評估——班圖詞彙重建數據庫 (BLR3; 4,786 個重建的原班圖形式) 和 ASJP 基本詞彙——我們確認前 11 個名詞候選詞中的 10 個 (90.9%) 與先前重建的原班圖形式一致,包括 -ntU '人' (8 種語言)、gombe '牛' (9 種語言) 和 mUn (9 種語言)。擴展到動詞,12 個動詞同源詞與重建的原班圖詞根一致,包括 -bon- '看' 和 *-jIm- '站立',這些詞在廣泛的地理範圍內都有證據。使用獨立翻譯模型 (NLLB-600M) 進行的跨模型驗證確認了這些模式:兩個模型都恢復了同源詞聚類和與已建立的 Guthrie 區域分類一致的系統發育分組 (p < 0.01)。跨語言名詞類別分析顯示,所有 13 個生產性類別在語言之間的餘弦相似度均保持 >0.83 (類內 > 類間,p < 10^-9)。我們的數據集僅限於東部和南部班圖,因此我們將這些結果解釋為恢復與原班圖一致的共享班圖詞彙結構,而不是明確區分原班圖的保留與後來的區域創新。

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

2604.22723v1 by Hillary Mutisya, John Mugane

We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.

摘要:我們提出了一種通過結合跨語言轉移學習和無監督聚類來發現低資源班圖語言的形態特徵的方法。應用於Giriama(nyf),這是一種僅有91個標記範疇的語言,我們的流程發現了2,455個單詞的名詞類別分配,並識別出兩種先前未記錄的形態模式:對於第2類的a-前綴變體(元音合併——兩個相鄰元音的合併——的wa-,一致性為95.1%)和一個縮合的k'-前綴(一致性為98.5%)。對444個已知Giriama動詞範疇的外部驗證確認了78.2%的詞元化準確率,而對19,624個單詞(9,014個獨特詞元)進行的v3語料庫擴展在所有主要詞類中實現了97.3%的分段率和86.7%的詞元化率。我們的轉移學習和無監督聚類的集合,通過加權投票結合,利用了互補的優勢:轉移在同源詞檢測上表現出色(利用了約60%的詞彙重疊),而聚類則發現了轉移無法識別的語言特有創新。我們發布了所有代碼和發現的詞彙,以支持低資源班圖語言的形態學文獻。

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

2604.22722v1 by Rajinder Sandhu, Di Mu, Cheng Chang, Md Shahriar Tasjid, Himanshu Rai, Maksims Volkovs, Ga Wu

Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.

摘要:密集向量檢索是檢索增強生成(RAG)的實用支柱,但相似性搜索可能會受到精度限制。相反,利用LLM重新排序的基於效用的方法通常能夠達到更好的性能,但在計算上是不可行的,並且容易受到困擾於困惑度估計的噪音。我們提出了效用對齊嵌入(UAE),這是一個旨在將這些優勢融合成一種實用的高性能檢索方法的框架。我們將檢索公式化為一個分佈匹配問題,訓練一個雙編碼器來模仿從困惑度降低中導出的效用分佈,使用效用調製的InfoNCE目標。這種方法將分級效用信號直接注入嵌入空間,而無需在測試時進行LLM推理。在QASPER基準上,UAE將檢索的Recall@1提高了30.59%,MAP提高了30.16%,Token F1提高了17.3%,超越了強大的語義基線BGE-Base。關鍵是,UAE的速度比高效的LLM重新排序方法快180倍以上,同時保持競爭性能,這表明將檢索與生成效用對齊能在規模上產生可靠的上下文。

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

2604.22709v1 by Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ''abstract'' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.

摘要:雖然長且明確的思維鏈(CoT)在複雜推理任務中已被證明有效,但在推理過程中生成這些鏈是昂貴的。非語言推理方法透過利用連續表徵,已出現生成長度較短的方案,但其性能仍落後於口頭化的 CoT。我們提出了 $\textbf{抽象思維鏈}$,這是一種離散潛在推理的後訓練機制,其中語言模型從保留的詞彙中生成一個短序列的標記,以取代自然語言的 CoT,然後生成回應。為了使先前未見的“抽象”標記變得有用,我們引入了一種政策迭代風格的熱身迴圈,該迴圈在 (i.) 通過遮罩從口頭 CoT 進行瓶頸並執行監督微調,和 (ii.) 通過訓練模型僅從提示生成抽象標記進行自我蒸餾,這是通過使用代碼本的約束解碼來實現的。在熱身之後,我們在約束解碼下優化抽象序列的生成,並使用熱啟動的強化學習。抽象-CoT 在數學推理、遵循指令和多跳推理中達到最多 $11.6\times$ 更少的推理標記,同時在性能上表現相當,並且在語言模型家族中具有良好的泛化能力。我們還發現抽象詞彙上出現了一種類似於自然語言的冪律分佈,這種分佈在訓練階段中不斷演變。我們的發現突顯了後訓練潛在推理機制的潛力,這些機制通過學習的抽象推理語言實現高效推理。

CRAFT: Clustered Regression for Adaptive Filtering of Training data

2604.22693v1 by Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda

Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.

摘要:從大型語料庫中選擇一小部分高品質的子集進行微調變得越來越重要,因為語料庫的數據點數量已增至數千萬,使得全面微調既昂貴又往往不必要。我們提出了CRAFT(Clustered Regression for Adaptive Filtering of Training data),這是一種對向量化無關的選擇方法,用於訓練序列到序列模型。CRAFT將源-目標聯合分佈進行分解,並執行兩階段選擇: (i) 通過在k-means聚類中進行比例預算分配來匹配驗證源分佈, (ii) 在每個源聚類內,選擇那些目標嵌入最小化從驗證目標分佈導出的條件期望距離的訓練對。我們證明了比例聚類分配限制了所選分佈與驗證分佈之間的連續KL散度,殘差由聚類直徑控制。我們在英語-印地語翻譯上評估CRAFT,從3300萬個NLLB句子對中選擇訓練數據,並通過LoRA對mBART進行微調。CRAFT達到43.34 BLEU,超越TSDS(41.21)2.13分,並在相同的候選池和編碼器上完成選擇速度超過40倍。使用TF-IDF向量化,整個流程在CPU上不到一分鐘內完成。TAROT達到45.61 BLEU,但CRAFT在26.86秒內完成選擇,而TAROT則需75.6秒,速度提升了2.8倍。

How Supply Chain Dependencies Complicate Bias Measurement and Accountability Attribution in AI Hiring Applications

2604.22679v1 by Gauri Sharma, Maryam Molamohammadi

The increasing adoption of AI systems in hiring has raised concerns about algorithmic bias and accountability, prompting regulatory responses including the EU AI Act, NYC Local Law 144, and Colorado's AI Act. While existing research examines bias through technical or regulatory lenses, both perspectives overlook a fundamental challenge: modern AI hiring systems operate within complex supply chains where responsibility fragments across data vendors, model developers, platform providers, and deploying organizations. This paper investigates how these dependency chains complicate bias evaluation and accountability attribution. Drawing on literature review and regulatory analysis, we demonstrate that fragmented responsibilities create two critical problems. First, bias emerges from component interactions rather than isolated elements, yet proprietary configurations prevent integrated evaluation. A resume parser may function without bias independently but contribute to discrimination when integrated with specific ranking algorithms and filtering thresholds. Second, information asymmetries mean deploying organizations bear legal responsibility without technical visibility into vendor-supplied algorithms, while vendors control implementations without meaningful disclosure requirements. Each stakeholder may believe they are compliant; nevertheless, the integrated system may produce biased outcomes. Analysis of implementation ambiguities reveals these challenges in practice. We propose multi-layered interventions including system-level audits, vendor guidelines, continuous monitoring mechanisms, and documentation across dependency chains. Our findings reveal that effective governance requires coordinated action across technical, organizational, and regulatory domains to establish meaningful accountability in distributed development environments.

摘要:隨著 AI 系統在招聘中的日益普及,對算法偏見和問責制的擔憂也隨之增加,促使了包括歐盟 AI 法案、紐約市地方法第 144 條和科羅拉多州 AI 法案在內的監管回應。雖然現有研究通過技術或監管的視角來檢視偏見,但這兩種觀點都忽略了一個根本挑戰:現代 AI 招聘系統在複雜的供應鏈中運作,責任在數據供應商、模型開發者、平台提供者和部署組織之間分散。本文探討了這些依賴鏈如何使偏見評估和問責歸屬變得複雜。通過文獻回顧和監管分析,我們證明了分散的責任造成了兩個關鍵問題。首先,偏見源於組件之間的相互作用,而非孤立的元素,但專有配置卻阻礙了綜合評估。一個簡歷解析器可能獨立運作時不帶偏見,但當與特定的排名算法和篩選閾值整合時,卻可能導致歧視。其次,信息不對稱意味著部署組織承擔法律責任,但對供應商提供的算法缺乏技術可見性,而供應商則在沒有實質性披露要求的情況下控制實施。每個利益相關者可能都認為自己是合規的;然而,整合系統可能會產生偏見結果。對實施模糊性的分析揭示了這些挑戰在實踐中的表現。我們提出了多層次的干預措施,包括系統級審計、供應商指導方針、持續監控機制和依賴鏈的文檔記錄。我們的研究結果顯示,有效的治理需要在技術、組織和監管領域之間協調行動,以在分散的開發環境中建立有意義的問責制。

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

2604.22678v1 by Jinghong Chen, Jingbiao Mei, Guangyu Yang, Bill Byrne

A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates thelost-in-the-middle'' effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.

摘要:一種常見的基於檢索增強生成(RAG)的問題回答方法是將文檔串聯成單一上下文,並將其傳遞給語言模型以生成答案。雖然這種方法簡單,但可能會掩蓋單個文檔的貢獻,使歸因變得困難,並導致「失落於中間」效應,即在長上下文中相關信息被忽視。串聯的擴展性也較差:計算成本隨著上下文長度的增長而呈平方增長,當上下文包含視覺數據時,這一問題變得尤為嚴重,例如在視覺問題回答中。通過限制上下文長度來緩解這些問題的嘗試,可能會進一步限制性能,因為這會阻止模型受益於更深層檢索所提供的改進召回。我們提出了貝葉斯集成檢索增強生成(BERAG),以及貝葉斯集成微調(BEFT),作為一種RAG框架,其中語言模型是基於單個檢索到的文檔而非單一的組合上下文進行條件化。BERAG將文檔後驗概率視為集成權重,並在生成過程中使用貝葉斯法則逐個標記地更新它們。這種方法使得概率重排序、並行記憶使用和文檔貢獻的明確歸因成為可能,從而使其非常適合大型文檔集合。我們主要在基於知識的視覺問題回答任務上評估BERAG和BEFT,在這些任務中,模型必須對長的、不完美的檢索列表進行推理。結果顯示,與標準RAG相比,這些方法有顯著的改進,包括在文檔視覺問題回答和多模態針對堆中的針基準測試上取得的強勁增長。我們還展示了BERAG能夠減輕「失落於中間」效應。文檔後驗可以用來檢測不足的基礎並觸發偏轉,而文檔修剪則使得解碼速度比標準RAG更快。

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

2604.22661v1 by Negar Arabzadeh, Andrew Drozdov, Michael Bendersky, Matei Zaharia

Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.

摘要:大型語言模型(LLMs)使得查詢重構在現代檢索和增強生成(RAG)管道中變得普遍,能夠生成多個語義等價的查詢變體。然而,對每個重構執行完整的管道在計算上是昂貴的,這促使了選擇性執行:我們能否在產生下游檢索和生成成本之前識別出最佳的查詢變體?我們研究查詢性能預測(QPP)作為在臨時檢索和端到端 RAG 中進行變體選擇的機制。與傳統的 QPP 不同,傳統 QPP 是在主題之間估計查詢難度,我們研究主題內的區分——在相同信息需求的競爭變體中選擇最佳重構。通過在 TREC-RAG 上進行大規模實驗,使用稀疏和密集檢索器,我們在基於相關性和決策的指標下評估了檢索前和檢索後的預測器。我們的結果顯示檢索和生成目標之間存在系統性的偏差:最大化排名指標(如 nDCG)的變體往往無法產生最佳生成答案,暴露了檢索相關性和生成真實性之間的“效用差距”。儘管如此,QPP 可以可靠地識別出改善端到端質量的變體,超過原始查詢。值得注意的是,輕量級的檢索前預測器經常與更昂貴的檢索後方法相匹配或超越,提供了一種延遲高效的健壯 RAG 方法。

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

2604.22631v1 by Felix Herron, Solange Rossato, Alexandre Allauzen, François Portet

Modern automatic speech recognition (ASR) systems have been observed to function better for certain speaker groups (SGs) than others, despite recent gains in overall performance. One potential impediment to progress towards fairer ASR is a more nuanced understanding of the types of modeling errors that speech encoder models make, and in particular the difference between the structure of embeddings for high-performance and low-performance SGs. This paper proposes a framework typifying two types of error that can occur in modeling phonemes in ASR systems: random error/high variance in phoneme embedding, vs systematic error/embedding bias. We find that training phoneme classification probes only on a single, typically disadvantaged SG, sometimes improves performance for that SG, which is evidence for the existence of SG-level bias in phoneme embeddings. On the other hand, we find that speakers and SGs with higher levels of phoneme variance are the same as those with worse phoneme prediction accuracy. We conclude that both types of error are present in phoneme embeddings and both are candidate causes for SG-level unfairness in ASR, though random error is likely a greater hindrance to fairness than systematic error. Furthermore, we find that finetuning encoder models using a fairness-enhancing algorithm (domain enhancing and adversarial training) changes neither the benefits of in-domain phoneme classification probe training, nor measured levels of random embedding error.

摘要:現代自動語音識別(ASR)系統已被觀察到對某些說話者群體(SGs)的表現優於其他群體,儘管最近整體性能有所提升。朝向更公平的ASR進展的一個潛在障礙是對語音編碼模型所產生的建模錯誤類型的更細緻理解,特別是高性能和低性能SGs的嵌入結構之間的差異。本文提出了一個框架,類型化ASR系統中建模音素時可能發生的兩種類型錯誤:隨機錯誤/高方差的音素嵌入,與系統性錯誤/嵌入偏差。我們發現僅在單一、通常處於劣勢的SG上訓練音素分類探針,有時會提高該SG的性能,這是音素嵌入中存在SG級別偏差的證據。另一方面,我們發現音素方差較高的說話者和SG與音素預測準確性較差的說話者和SG是相同的。我們得出結論,這兩種類型的錯誤都存在於音素嵌入中,並且都是ASR中SG級別不公平的候選原因,儘管隨機錯誤可能對公平性造成的阻礙大於系統性錯誤。此外,我們發現使用增強公平性的算法(領域增強和對抗訓練)進行編碼模型的微調,既不改變領域內音素分類探針訓練的好處,也不改變隨機嵌入錯誤的測量水平。

From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

2604.22626v1 by Angelo Maria Sabatini

This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing the balance between persistence and alternation patterns. Across the poem, this index exhibits a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram-level analysis shows that this trend is driven by a restricted set of recurrent configurations, interpreted as graphemic probes linking the Markov representation to identifiable lexical environments in the text. These probes display distinct behaviours: configurations involving two transitions more frequently emerge across word boundaries, reflecting interactions between adjacent tokens, whereas configurations with fewer transitions are largely confined to intra-lexical structures. Part of the signal is further shaped by orthographic phenomena, particularly apostrophised forms, highlighting the role of writing conventions alongside phonological and lexical organisation. A complementary classification analysis identifies cantica-specific terms, providing lexical anchors through which graphemic probes can be related to the structure of the poem. This organisation is reflected not only in the separation of the three cantiche, but also in a continuous trajectory across the text. Overall, the results show that simple probabilistic models applied to symbolic text representations can uncover structured interactions between local dependencies, lexical distribution, orthographic encoding, and large-scale organisation, providing an interpretable framework for linking local symbolic dynamics to higher-level textual organisation.

摘要:這項研究通過基於元音-輔音(V/C)編碼的符號表示,探討但丁《神曲》的結構組織。將生成的序列建模為四狀態馬爾可夫鏈,產生了一個簡約的圖形記憶指數,捕捉了持續性和交替模式之間的平衡。在整首詩中,這個指數從《地獄》到《天堂》顯示出輕微但穩定的增長,表明局部依賴結構的方向性轉變。三元組級別的分析顯示,這一趨勢是由一組有限的重複配置驅動的,這些配置被解釋為圖形探針,將馬爾可夫表示與文本中可識別的詞彙環境聯繫起來。這些探針顯示出不同的行為:涉及兩次轉換的配置更頻繁地出現在詞邊界之間,反映了相鄰標記之間的互動,而轉換較少的配置則主要限於詞內結構。部分信號進一步受到正字法現象的影響,特別是省略號形式,突顯了書寫慣例在語音和詞彙組織中的作用。一項補充的分類分析確定了特定於詩篇的術語,提供了詞彙錨點,通過這些錨點可以將圖形探針與詩的結構相關聯。這種組織不僅反映在三個詩篇的分離中,還在文本中呈現出一個連續的軌跡。總體而言,結果顯示,應用於符號文本表示的簡單概率模型可以揭示局部依賴、詞彙分佈、正字法編碼和大規模組織之間的結構化互動,提供了一個可解釋的框架,將局部符號動態與更高層次的文本組織聯繫起來。

Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube

2604.22606v1 by Sheza Munir, Ratna Kandala, Anamta Khan, Deepti, Joyojeet Pal

Health misinformation remains one of the most pressing challenges on social media, particularly when cultural traditions intersect with scientific-sounding claims. These dynamics are not only global but also deeply local, manifesting in culturally specific controversies that require careful analysis. Motivated by this, we examine 100 YouTube transcripts that promote or debunk cow urine (gomutra) as a health remedy, focusing on rhetorical strategies such as appeals to authority, efficacy appeals, and conspiracy framing. We employ large language models (LLMs) including GPT-4, GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Pro, and Mistral Medium 3 to annotate transcripts using a 14-category taxonomy of persuasive tactics. Our analysis reveals that promoters predominantly rely on efficacy appeals and social proof, while debunkers emphasize authority and rebuttal. Human evaluation of a subset of annotations yielded 90.1\% inter-annotator agreement, confirming the reliability of our taxonomy and validation process. This work advances computational methods for misinformation analysis and demonstrates how LLMs can support large-scale studies of cultural discourse online.

摘要:健康錯誤資訊仍然是社交媒體上最緊迫的挑戰之一,尤其是在文化傳統與科學聽起來的主張交織時。這些動態不僅是全球性的,還是深具地方性的,體現在需要仔細分析的文化特定爭議中。基於此,我們檢視了100個YouTube的逐字稿,這些逐字稿宣傳或駁斥牛尿(gomutra)作為健康療法,重點關注權威訴求、效能訴求和陰謀框架等修辭策略。我們使用包括GPT-4、GPT-4o、GPT-4.1、GPT-5、Gemini 2.5 Pro和Mistral Medium 3在內的大型語言模型(LLMs)來註解逐字稿,使用14類別的說服策略分類法。我們的分析顯示,推廣者主要依賴效能訴求和社會證明,而駁斥者則強調權威和反駁。對一部分註解的人工評估產生了90.1%的標註者間一致性,確認了我們的分類法和驗證過程的可靠性。這項工作推進了對錯誤資訊分析的計算方法,並展示了LLMs如何支持對在線文化話語的大規模研究。

From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

2604.22601v1 by Md Erfan, Md Kamal Hossain Chowdhury, Ahmed Ryan, Md Rayhanur Rahman

Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.

摘要:大型語言模型(LLMs)在自動化軟體工程中顯示出潛力,但它們的正確性保證常常受到錯誤或虛構代碼的影響。為了強化模型的誠實性,形式驗證要求LLMs在合成實現邏輯的同時,提供隨後由數學驗證器證明正確的形式規範。然而,從非正式自然語言轉換為精確的形式規範仍然是一項艱鉅的任務。我們的工作通過提供NaturalLanguage2VerifiedCode (NL2VC)-60數據集來解決這一問題:這是一個包含60個複雜算法問題的集合。我們使用分層提示策略評估了七個開放權重LLMs中隨機選擇的11個問題集:無上下文提示、提供結構錨點的簽名提示,以及利用Dafny驗證器的迭代反饋的自我修復提示。為了解決虛無驗證的問題,即模型以微不足道的規範滿足驗證器,我們整合了uDebug平台以確保功能驗證。我們的結果顯示,雖然無上下文提示導致幾乎普遍失敗,但結構簽名和迭代自我修復促進了顯著的性能逆轉。具體而言,Gemma 4-31B達到了90.91%的驗證成功率,而GPT-OSS 120B在簽名引導反饋下從零上升至81.82%的成功率。這些發現表明,對於開放權重LLMs來說,形式驗證現在是可實現的,這些模型作為有效的學徒,能夠合成複雜的註解並促進高保證的軟體開發。

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

2604.22597v1 by Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach, Nimrod Berman, Igor Kviatkovsky

Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.

摘要:最近在大型語言模型方面的進展導致了各種任務的顯著改善,包括數學推理,這用於評估模型在邏輯推理和問題解決方面的智能。模型在數學推理基準上進行評估,通過驗證最終答案的正確性與真實答案進行比較。這種驗證的一種常見方法是基於符號數學比較,但它無法在多樣的數學表示和解決格式之間進行泛化。在本研究中,我們提供了一種穩健且靈活的替代方案,以取代基於規則的符號數學比較。我們提出了一個基於LLM的評估框架,用於評估模型生成的答案,從而實現對多樣數學表示和答案格式的準確評估。我們展示了在兩個流行框架Lighteval和SimpleRL中符號評估的失敗案例,並將其與我們的方法進行比較,顯示出相較於常用方法的明顯改進。我們的框架使得評估和基準測試更為可靠,從而導致更準確的性能監控,這對於推進數學問題解決和智能系統至關重要。

Learning Evidence Highlighting for Frozen LLMs

2604.22565v1 by Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei, Yunchen Pu, Fei Tian, Chonglin Sun, Frank Shyu, Luke Simon, Sandeep Pandey, Xi Liu, Jian Li

Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.

摘要:大型語言模型(LLMs)能夠進行良好的推理,然而在長且嘈雜的上下文中,常常會錯過關鍵證據。我們介紹了 HiLight,一個證據強調框架,將證據選擇與固定的 LLM 解決器的推理解耦。HiLight 避免壓縮或重寫輸入,這可能會丟棄或扭曲證據,通過訓練一個輕量級的強調演員,在未改變的上下文中插入最小的高亮標籤,以圍繞關鍵範圍。然後,固定的解決器在強調的輸入上執行下游推理。我們將高亮視為一個弱監督的決策問題,並使用強化學習優化演員,只依賴解決器的任務獎勵,無需證據標籤,也無需訪問或修改解決器。在序列推薦和長上下文問題回答中,HiLight 始終在強大的基於提示和自動提示優化基準上提高性能。學習到的強調策略在零樣本情況下轉移到更小和更大的未見解決器系列,包括基於 API 的解決器,這表明演員捕捉到真正的、可重用的證據結構,而不是過度擬合於單一的骨幹。

Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors

2604.22560v1 by Gautam Kumar Jain, Carsten Markgraf, Julian Stähler

Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.

摘要:圖形視覺問題回答(GVQA)在自動駕駛中將推理組織為有序的階段,即感知、預測和規劃,其中規劃決策應與模型自身的感知保持一致。我們在 DriveLM-nuScenes 上進行了一項關於跨階段上下文傳遞的比較研究,使用了兩種互補機制。顯式變體評估了三種基於提示的條件策略,這些策略在未經額外訓練的情況下,於一個經過領域適應的 4B VLM(Mini-InternVL2-4B-DA-DriveLM)上運行,將 NLI 矛盾減少了多達 42.6%,並建立了一個強大的零訓練基準。隱式變體引入了門控上下文投影器,這些投影器從一個階段提取隱藏狀態向量,並將標準化的門控投影注入到下一階段的輸入嵌入中。這些投影器與特定階段的 QLoRA 適配器共同訓練於一個通用的 8B VLM(InternVL3-8B-Instruct),同時僅更新約 0.5% 的參數。隱式變體在規劃階段實現了統計上顯著的 34% NLI 矛盾減少(自助法 95% CI,p < 0.05),並將跨階段的包含性提高了 50%,這是通過多語言 NLI 分類器進行評估的,以考慮混合語言輸出。規劃語言質量也有所改善(CIDEr +30.3%),但由於缺乏駕駛領域的預訓練,詞彙重疊和結構一致性下降。由於這兩種變體使用不同的基礎模型,我們將它們作為互補的案例研究呈現:顯式上下文傳遞提供了一個強大的無訓練基準,以實現表面一致性,而隱式門控投影則帶來了顯著的規劃階段語義增益,這表明領域適應可能是全範圍改進的下一個可行成分。

SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

2604.22558v1 by Jichao Wang, Liuyang Bian, Yufeng Zhou, Han Xiao, Yue Pan, Guozhi Wang, Hao Wang, Zhaoxiong Wang, Yafei Wen, Xiaoxin Chen, Shuai Ren, Lingfang Zeng

As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.

摘要:隨著多模態大型語言模型(MLLMs)的成熟,圖形用戶介面(GUI)代理正在從靜態互動演變為複雜的導航。雖然強化學習(RL)已成為訓練 MLLM 代理處理動態 GUI 任務的有前景的範式,但其有效應用面臨困境。標準的離線強化學習通常依賴靜態的步驟級數據,忽略了任務完成和執行質量等全局軌跡語義。相反,在線強化學習捕捉長期動態,但面臨高互動成本和潛在環境不穩定的問題。為了彌補這一差距,我們提出了 SOLAR-RL(半在線長期任務強化學習)。我們的框架不僅依賴昂貴的在線互動,而是將全局軌跡見解直接整合到離線學習過程中。具體而言,我們從靜態數據中重建多樣的回放候選,使用每步有效性信號檢測第一次失敗點,並追溯性地分配密集的步驟級獎勵,通過目標對齊的塑形來反映軌跡級執行質量,從而有效地模擬在線反饋而無需互動成本。大量實驗表明,與強基準相比,SOLAR-RL 顯著提高了長期任務完成率和穩健性,為自主 GUI 導航提供了一個樣本高效的解決方案。

Using Embedding Models to Improve Probabilistic Race Prediction

2604.22555v1 by Noan Dasanaike, Kosuke Imai

Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.

摘要:估計種族差異需要個人層級的種族數據,但由於收集此類信息的敏感性,這些數據通常無法獲得。為了解決這個問題,許多研究者利用貝葉斯改進姓氏地理編碼(BISG),這在很大程度上依賴於普查姓氏數據。不幸的是,這些數據僅捕捉常見姓氏的種族-姓氏關係,忽略了約10%的美國人口。我們顯示,對於那些被忽略的、不常見姓氏的個體,預測性能顯著下降,因為標準的BISG實施在這些情況下依賴於無信息的通用先驗。為了解決這一限制,我們提出了嵌入驅動的BISG(eBISG),它使用預訓練的文本嵌入將姓名表示為密集向量,並在2020年普查的姓氏和名字數據上訓練神經網絡,以估計不在普查中的姓名的種族概率。我們比較了五種方法:僅使用姓氏的標準BISG,結合名字概率的BIFSG,針對未列出姓名的姓氏嵌入,結合姓氏和名字的嵌入,以及基於來自南方州的選民檔案數據訓練的全名嵌入,該數據捕捉了姓名組件之間的互動。我們顯示,每一個後續的eBISG方法都改善了種族預測,其中全名嵌入帶來了最大的增益,特別是對於那些姓氏不在普查名單上的西班牙裔和亞洲裔選民。

ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders

2604.22550v1 by Yongqi Jiang, Yansong Gao, Boyu Kuang, Chunyi Zhou, Anmin Fu, Liquan Chen

Self-supervised learning (SSL) encoders are invaluable intellectual property (IP). However, no existing SSL watermarking for IP protection can concurrently satisfy the following two practical requirements: (1) provide ownership verification capability under black-box suspect model access once the stolen encoders are used in downstream tasks; (2) be robust under adversarial watermark detection or removal, because the watermark samples form a distinguishable out-of-distribution (OOD) cluster. We propose ArmSSL, an SSL watermarking framework that assures black-box verifiability and adversarial robustness while preserving utility. For verification, we introduce paired discrepancy enlargement, enforcing feature-space orthogonality between the clean and its watermark counterpart to produce a reliable verification signal in black-box against the suspect model. For adversarial robustness, ArmSSL integrates latent representation entanglement and distribution alignment to suppress the OOD clustering. The former entangles watermark representations with clean representations (i.e., from non-source-class) to avoid forming a dense cluster of watermark samples, while the latter minimizes the distributional discrepancy between watermark and clean representations, thereby disguising watermark samples as natural in-distribution data. For utility, a reference-guided watermark tuning strategy is designed to allow the watermark to be learned as a small side task without affecting the main task by aligning the watermarked encoder's outputs with those of the original clean encoder on normal data. Extensive experiments across five mainstream SSL frameworks and nine benchmark datasets, along with end-to-end comparisons with SOTAs, demonstrate that ArmSSL achieves superior ownership verification, negligible utility degradation, and strong robustness against various adversarial detection and removal.

摘要:自我監督學習(SSL)編碼器是無價的智慧財產(IP)。然而,現有的 SSL 水印技術在 IP 保護方面無法同時滿足以下兩個實際要求:(1)在黑箱可疑模型訪問下提供所有權驗證能力,一旦被盜的編碼器在下游任務中使用;(2)在對抗性水印檢測或移除下保持穩健性,因為水印樣本形成可區分的分佈外(OOD)叢集。我們提出了 ArmSSL,一個 SSL 水印框架,確保黑箱可驗證性和對抗性穩健性,同時保留效用。為了驗證,我們引入了配對差異擴大,強制清晰特徵空間與其水印對應物之間的正交性,以在黑箱中對可疑模型產生可靠的驗證信號。為了對抗性穩健性,ArmSSL 整合了潛在表示糾纏和分佈對齊,以抑制 OOD 叢集。前者將水印表示與清晰表示(即來自非源類別的表示)糾纏在一起,以避免形成密集的水印樣本叢集,而後者則最小化水印與清晰表示之間的分佈差異,從而將水印樣本偽裝成自然的內部數據。為了效用,設計了一種參考引導的水印調整策略,允許水印作為一個小的附加任務進行學習,而不影響主要任務,通過將水印編碼器的輸出與原始清晰編碼器在正常數據上的輸出對齊。對五個主流 SSL 框架和九個基準數據集進行的廣泛實驗,以及與 SOTA 的端到端比較,證明 ArmSSL 實現了卓越的所有權驗證、可忽略的效用降級以及對各種對抗性檢測和移除的強大穩健性。

Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners

2604.22542v1 by Haidong Yuan, Haokun Zhao, Wanshi Xu, Songjun Cao, Qingyu Zhou, Long Ma, Hongjie Fan

Large language models (LLMs) often fail to meet the pedagogical needs of K-12 English learners in non-native contexts due to a proficiency mismatch. To address this widespread challenge, we introduce a proficiency-aligned framework that adapts LLM outputs to learner abilities, using China's national curriculum (CSE) as a representative case. Our framework enables precise control over lexical complexity through a four-tier grading system, supported by a comprehensive suite of new resources: graded vocabulary lists and a multi-turn dialogue corpus. Our core technical contribution is the \textbf{DDPO} algorithm,Diversity Driven Policy Optimization, a multi-turn GRPO-based approach designed to preserve dialogue diversity while holistically optimizing dialogue quality. This method significantly outperforms conventional approaches, achieving low out-of-vocabulary rates and high diversity while enhancing conversational naturalness and pedagogical value. While grounded in the CSE, our framework is designed for flexibility and can be readily adapted to other educational standards. Our models, data, and code will all be open-sourced, providing a scalable platform for personalized English speaking practice that effectively addresses the unique challenges faced by K-12 learners in non-immersive environments.

摘要:大型語言模型(LLMs)常常無法滿足非母語環境中K-12英語學習者的教學需求,這是由於能力不匹配。為了解決這一普遍挑戰,我們提出了一個能力對齊框架,該框架根據學習者的能力調整LLM輸出,以中國的國家課程(CSE)作為代表案例。我們的框架通過四級評分系統實現了對詞彙複雜度的精確控制,並配備了一整套新的資源:分級詞彙表和多輪對話語料庫。我們的核心技術貢獻是\textbf{DDPO}算法,即多樣性驅動政策優化(Diversity Driven Policy Optimization),這是一種基於多輪GRPO的方法,旨在保持對話的多樣性,同時全面優化對話質量。這種方法顯著超越了傳統方法,實現了低的生詞率和高的多樣性,同時增強了對話的自然性和教學價值。雖然我們的框架是基於CSE,但它的設計具有靈活性,可以輕鬆適應其他教育標準。我們的模型、數據和代碼將全部開源,提供一個可擴展的平台,用於個性化的英語口語練習,有效應對K-12學習者在非沉浸式環境中面臨的獨特挑戰。

On the Properties of Feature Attribution for Supervised Contrastive Learning

2604.22540v1 by Leonardo Arrighi, Julia Eva Belloni, Aurélie Gallet, Ivan Gentile, Matteo Lippi, Marco Zullich

Most Neural Networks (NNs) for classification are trained using Cross-Entropy as a loss function. This approach requires the model to have an explicit classification layer. However, there exist alternative approaches, such as Contrastive Learning (CL). Instead of explicitly operating a classification, CL has the NN produce an embedding space where projections of similar data are pulled together, while projections of dissimilar data are pushed apart. In the case of Supervised CL (SCL), labels are adopted as similarity criteria, thus creating an embedding space where the projected data points are well-clustered. SCL provides crucial advantages over CE with regard to adversarial robustness and out-of-distribution detection, thus making it a more natural choice in safety-critical scenarios. In the present paper, we empirically show that NNs for image classification trained with SCL present higher-quality feature attribution explanations than CL with regard to faithfulness, complexity, and continuity. These results reinforce previous findings about CL-based approaches when targeting more trustworthy and transparent NNs and can guide practitioners in the selection of training objectives targeting not only accuracy, but also transparency of the models.

摘要:大多數用於分類的神經網絡(NNs)是使用交叉熵作為損失函數進行訓練的。這種方法要求模型具有明確的分類層。然而,還存在其他替代方法,例如對比學習(CL)。CL並不是明確地進行分類,而是讓NN生成一個嵌入空間,其中相似數據的投影被拉在一起,而不相似數據的投影則被推開。在監督式對比學習(SCL)的情況下,標籤被用作相似性標準,從而創建一個嵌入空間,在這個空間中,投影的數據點聚集得很好。SCL在對抗穩健性和分佈外檢測方面提供了相對於交叉熵的重要優勢,因此在安全關鍵的場景中成為更自然的選擇。在本研究中,我們實證顯示,使用SCL訓練的圖像分類NN在特徵歸因解釋方面相較於CL在真實性、複雜性和連續性上呈現出更高的質量。這些結果強化了關於針對更可信和透明的NN的CL基礎方法的先前發現,並可以指導實踐者在選擇訓練目標時,不僅針對準確性,還針對模型的透明性。

FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

2604.22534v1 by Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.

摘要:電子健康紀錄(EHR)的特徵工程因不規則的觀察間隔、可變的測量頻率以及臨床時間序列固有的結構稀疏性而變得複雜。現有的自動化方法要麼缺乏臨床領域的認識,要麼假設輸入數據是乾淨且規則取樣的,這限制了它們在現實世界EHR數據中的適用性。我們提出了\textbf{FeatEHR-LLM},這是一個利用大型語言模型(LLMs)從不規則取樣的EHR時間序列生成臨床有意義的表格特徵的框架。為了限制患者隱私的暴露,LLM僅在數據集架構和任務描述上運作,而不是原始患者記錄。一種工具增強的生成機制為LLM提供了專門的例程,用於查詢不規則的時間數據,使其能夠生成可執行的特徵提取代碼,明確處理不均勻的觀察模式和信息稀疏性。FeatEHR-LLM支持通過迭代的、驗證在循環中的管道生成單變量和多變量特徵。在四個ICU數據集上評估的八個臨床預測任務中,我們的框架在8個任務中的7個上達到了最高的平均AUROC,相較於強基準提高了多達6個百分點。代碼可在github.com/hojjatkarami/FeatEHR-LLM上獲得。

RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment

2604.22520v1 by Yingfeng Luo, Hongyu Liu, Dingyang Lin, Kaiyan Chang, Chenglong Wang, Bei Li, Quan Du, Tong Xiao, Jingbo Zhu

Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model's improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose \textbf{RouteLMT} (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translators prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality-budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.

摘要:大型語言模型(LLMs)在機器翻譯(MT)中取得了卓越的表現,但大規模部署仍然成本過高。 一種廣泛採用的解決方案是混合系統範式,它通過用小型模型處理大多數請求,並選擇性地將一部分請求路由到大型模型來平衡成本和質量。 然而,現有的路由策略通常依賴於啟發式方法、外部預測器或絕對質量估計,這些方法無法捕捉大型模型是否實際上提供了比小型模型更有價值的改進。 在本文中,我們將路由形式化為預算分配問題,並將邊際增益,即大型模型相對於小型模型的改進,確定為預算決策的最佳信號。 基於此,我們提出了 \textbf{RouteLMT}(基於LLM的MT路由),這是一個高效的內部模型路由器,通過探測小型翻譯器的提示-標記表示來預測這一預期增益,而不需要外部模型或假設解碼。 大量實驗表明,我們的RouteLMT超越了啟發式方法、質量/難度估計基準,實現了更優質量-預算帕累托邊界。此外,我們分析了回歸風險,並展示了一個簡單的保護變體可以減少嚴重的質量損失。

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

2604.22517v1 by Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Takuto Asakura, Chung-Chi Chen, Tatsuya Ishigaki

Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.

摘要:評估LLM生成的商業想法通常比生成它們更難以擴展。與標準的NLP基準不同,商業想法的評估依賴於多維標準,例如可行性、新穎性、差異化、用戶需求和市場規模,而專家的判斷往往存在分歧。本文研究了這種分歧所引發的方法論問題:自動評判者應該近似集體共識,還是單獨建模評估者?我們引入PBIG-DATA,一個包含約3000個個別評分的數據集,涵蓋300個基於專利的產品想法,這些評分由領域專家在六個商業導向的維度上提供:具體性、技術有效性、創新性、競爭優勢、需求有效性和市場規模。分析顯示,在細粒度的序數評分上專家之間存在顯著的分歧,而在粗略選擇下則達成較高的一致性,這表明結構性異質性而非隨機噪音。然後,我們比較了三種評判配置:僅使用評分標準的零樣本評判者,基於混合評估者歷史的集體評判者,以及基於目標評估者評分歷史的個性化評判者。在各個維度和模型大小下,個性化評判者與相應評估者的對齊程度比集體評判者更高,而評估者的一致性僅在個性化條件下與評判者生成的推理相似性相關。這些結果表明,在多元化評估環境中,合併標籤可能是一個脆弱的目標,並促使為商業想法評估設計評估者條件的評判者。

Measuring and Mitigating Persona Distortions from AI Writing Assistance

2604.22503v1 by Paul Röttger, Kobi Hackenburg, Hannah Rose Kirk, Christopher Summerfield

Hundreds of millions of people use artificial intelligence (AI) for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas - their perceived beliefs, personality, and identity. In three large-scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI-assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.

摘要:數億人使用人工智慧(AI)來協助寫作。在這裡,我們評估了AI寫作輔助如何扭曲作家的角色——他們的信念、個性和身份的感知。在三個大規模實驗中,作家(N=2,939)在有無AI輔助的情況下撰寫政治意見段落。不同的讀者群體(N=11,091)盲目評估了這些段落,涵蓋29個社會顯著的讀者感知維度,包括政治意見、寫作質量、作家個性、情感和人口統計。AI寫作輔助在所有維度上都產生了角色扭曲:有了AI,作家似乎更有主見、更有能力且更積極,他們的感知人口統計特徵轉向更特權的群體。作家對許多觀察到的扭曲表示反對,但即使意識到這些扭曲,他們仍然偏好AI輔助的文本。我們成功地在模型層面上減輕了可反對的角色扭曲,通過在我們的實驗數據(10,008段落,2,903,596評分)上訓練獎勵模型,以引導AI輸出忠實地代表作家的立場。然而,這以用戶接受度為代價,暗示了AI寫作輔助中可取和不可取特性之間的糾纏,這可能難以解決。總體而言,我們的發現表明,來自AI寫作輔助的角色扭曲在即使在現實的人類監督條件下也是普遍和持久的,這對公共話語、信任和民主討論具有隨著AI採用而擴大的影響。

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

2604.22498v1 by Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang, Xintian Shen, Jiawei Chen, Hao Ma, Tao Wei

Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).

摘要:儘管多模態大型語言模型(MLLMs)快速發展,但在細粒度多圖像理解方面仍面臨顯著挑戰,經常出現空間幻覺、注意力洩漏以及物體恆常性的失敗。此外,現有的方法通常依賴昂貴的人類標註或大規模的思維鏈(CoT)數據生成。我們提出了組合性基礎對比(縮寫為 CGC),這是一個低成本的完整框架,用於提升 MLLMs 的細粒度多圖像理解。CGC 基於現有的單圖像基礎標註,通過圖像間對比和圖像內對比構建組合性多圖像訓練實例,這分別引入了語義上解耦的干擾上下文以進行跨圖像區分,以及相關的跨視圖樣本以實現物體恆常性。CGC 進一步在 GRPO 框架內引入了一種基於規則的空間獎勵,以改善源圖像歸因、空間對齊和結構化輸出有效性,遵循先思考再基礎化的範式。實驗表明,CGC 在細粒度多圖像基準上達到了最先進的結果,包括 MIG-Bench 和 VLM2-Bench。學習到的多圖像理解能力也轉移到更廣泛的多模態理解和推理任務上,在 MathVista (+2.90)、MuirBench (+2.88)、MMStar (+1.93)、MMMU (+1.77) 和 BLINK (+1.69) 上相對於 Qwen3-VL-8B 基本模型獲得了一致的增益。

On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery

2604.22455v1 by Anti Alman, Izack Cohen, Avigdor Gal, Fabrizio Maria Maggi, Marco Montali

A core component of any AI-Augmented Business Process Management System (ABPMS) is the process frame, which gives the system process-awareness and defines the boundaries in which the system must operate. Compared to traditional process models, the process frame should, in principle, provide a somewhat more permissive representation of the managed processes, such that the (semi) autonomous behavior of an ABPMS, referred to as framed autonomy, could emerge. At the same time, it is not limited to a single linguistic or symbolic formalism and may incorporate heterogeneous knowledge ranging from predefined procedures to commonsense rules and best practices. In this paper, we conceptualize the notion of an ABPMS process frame as a hybrid business process representation, consisting of semi-concurrently executed procedural and declarative process models. We rely on our earlier works to outline the execution semantics of this type of process frame, arguing in favor of adopting the open-world assumption of the declarative paradigm also for procedural process models. The latter leads to a constraint-like interpretation, where each procedural model is considered to constrain the activities within that model, without imposing explicit execution requirements nor limitations on activities that may be present in other models. This is analogous to existing declarative languages, such as Declare, where each constraint has a direct effect only on the specific activities being constrained. Given this similarity, we propose mapping subsets of discovered declarative constraints into equivalent semi-concurrently executed procedural fragments, thus laying the foundation for a corresponding process (frame) discovery approach.

摘要:任何 AI 增強業務流程管理系統 (ABPMS) 的核心組件是流程框架,它賦予系統流程感知並定義系統必須運作的邊界。與傳統流程模型相比,流程框架原則上應該提供對管理流程的更寬鬆的表示,使得稱為框架自主的 ABPMS 的 (半) 自主行為能夠出現。與此同時,它並不局限於單一的語言或符號形式,並且可以包含從預定義程序到常識規則和最佳實踐的異質知識。在本文中,我們將 ABPMS 流程框架的概念化為一種混合業務流程表示,包含半並行執行的程序性和聲明性流程模型。我們依賴於之前的工作來概述這種類型的流程框架的執行語義,主張也應該將聲明性範式的開放世界假設應用於程序性流程模型。後者導致了一種類似約束的解釋,其中每個程序模型被視為限制該模型內的活動,而不對其他模型中可能存在的活動施加明確的執行要求或限制。這類似於現有的聲明性語言,例如 Declare,其中每個約束僅對被約束的特定活動產生直接影響。鑒於這種相似性,我們建議將發現的聲明性約束的子集映射到等效的半並行執行的程序片段,從而為相應的流程 (框架) 發現方法奠定基礎。

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

2604.22452v1 by Xirui Li, Ming Li, Yunze Xiao, Ryan Wong, Dianqi Li, Timothy Baldwin, Tianyi Zhou

Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.

摘要:集體智慧是指一個群體達成超越任何個別成員單獨所能完成的結果的能力。隨著大型語言模型代理人擴展到數百萬的人口,一個關鍵問題浮現:集體智慧是否自發地從規模中產生?我們在一個大規模自主代理人社會中首次對這個問題進行實證評估。研究MoltBook,一個擁有超過兩百萬代理人的平台,我們引入了Superminds Test,一個分層框架,通過控制探測代理在三個層級上探測社會層級的智慧:聯合推理、信息綜合和基本互動。我們的實驗顯示出集體智慧的明顯缺失。該社會在複雜推理任務上未能超越個別前沿模型,幾乎不進行分散信息的綜合,並且經常連微不足道的協調任務也失敗。平台範圍的分析進一步顯示,互動仍然淺薄,討論主題很少延伸超過單一回覆,大多數回應都是一般性或偏離主題的。這些結果表明,集體智慧並不僅僅是從規模中產生的。相反,目前代理人社會的主要限制是極其稀疏和淺薄的互動,這妨礙了代理人之間的信息交流和相互建設。

SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking

2604.22438v1 by Chenxi Gu, Xiaoning Du, John Grundy

Watermarking has emerged as a promising technique for tracing the authorship of content generated by large language models (LLMs). Among existing approaches, the KGW scheme is particularly attractive due to its versatility, efficiency, and effectiveness in natural language generation. However, KGW's effectiveness degrades significantly under low-entropy settings such as code generation and mathematical reasoning. A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences. Our study revealed that the next-token probability distribution plays an critical role in determining how much, or even whether, we can modify token selection and, consequently, the effectiveness of watermarking. We refer to this characteristic, associated with the probability distribution of each token prediction, as \emph{watermark strength.} In cases of random vocabulary partitioning, the lower bound of watermark strength is dictated by the next-token probability distribution. However, we found that, by redesigning the vocabulary partitioning algorithm, we can potentially raise this lower bound. In this paper, we propose SSG (\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups), a method that partitions the vocabulary into two logit-balanced subsets. This design lifts the lower bound of watermark strength for each token prediction, thereby improving watermark detectability. Experiments on code generation and mathematical reasoning datasets demonstrate the effectiveness of SSG.

摘要:水印技術已成為追蹤大型語言模型(LLMs)生成內容的作者身份的一種有前景的技術。在現有的方法中,KGW方案因其多功能性、效率和在自然語言生成中的有效性而特別吸引人。然而,KGW在低熵環境下(如代碼生成和數學推理)其有效性顯著下降。KGW方法中的一個關鍵步驟是隨機詞彙劃分,這使得根據特定偏好調整標記選擇成為可能。我們的研究揭示了下一標記概率分佈在決定我們能夠多大程度上,甚至是否可以修改標記選擇及其結果,即水印的有效性方面,扮演著關鍵角色。我們將這一特徵稱為\emph{水印強度},它與每個標記預測的概率分佈相關聯。在隨機詞彙劃分的情況下,水印強度的下限由下一標記概率分佈決定。然而,我們發現通過重新設計詞彙劃分算法,我們可以潛在地提高這一下限。在本文中,我們提出了SSG(\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups),這是一種將詞彙劃分為兩個logit平衡子集的方法。這一設計提高了每個標記預測的水印強度下限,從而改善了水印的可檢測性。在代碼生成和數學推理數據集上的實驗證明了SSG的有效性。

CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease

2604.22428v1 by Bulent Soykan, Gulsah Hancerliogullari Koksalmis, Hsin-Hsiung Huang, Laura J. Brattain

Predicting individual cognitive decline in Alzheimer's disease (AD) is difficult due to the heterogeneity of disease progression. Reliable clinical tools require not only high accuracy but also fairness across demographics and robustness to missing data. We present CognitiveTwin, a digital twin framework that predicts patient-specific cognitive trajectories. The model integrates multi-modal longitudinal data (cognitive scores, magnetic resonance imaging, positron emission tomography, cerebrospinal fluid biomarkers, and genetics). We use a Transformer-based architecture to fuse these modalities and a Deep Markov Model to capture temporal dynamics. We trained and evaluated the framework using data from 1,666 patients in the TADPOLE (Alzheimer's Disease Neuroimaging Initiative) dataset. We assessed the model for prediction error, demographic fairness, and robustness to missing-not-at-random (MNAR) data patterns. ognitiveTwin provides accurate and personalized predictions of cognitive decline. Its demonstrated fairness across patient demographics and resilience to clinical dropout make it a reliable tool for clinical trial enrichment and personalized care planning.

摘要:預測阿茲海默症(AD)中個體的認知衰退是困難的,因為疾病進展的異質性。可靠的臨床工具不僅需要高準確性,還需要在不同人口統計中保持公平性,並對缺失數據具有穩健性。我們提出了CognitiveTwin,一個預測患者特定認知軌跡的數位雙胞胎框架。該模型整合了多模態的縱向數據(認知分數、磁共振成像、正電子發射斷層掃描、腦脊髓液生物標記和遺傳學)。我們使用基於Transformer的架構來融合這些模態,並使用深度馬爾可夫模型來捕捉時間動態。我們使用來自1,666名患者的TADPOLE(阿茲海默症神經影像倡議)數據集訓練和評估該框架。我們評估了模型的預測誤差、人口統計公平性以及對隨機缺失數據模式的穩健性。CognitiveTwin提供準確且個性化的認知衰退預測。它在患者人口統計中的公平性和對臨床脫落的韌性使其成為臨床試驗增強和個性化護理計劃的可靠工具。

Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control

2604.22413v1 by Qinhan Hou, Jing Tang

Graph Transformers can mix information globally, but this flexibility also creates failure modes: some tasks require long-range communication while others are better served by local interaction. We study this through a synthetic node-classification benchmark on contextual stochastic block model graphs, where labels are generated by a controllable mixture of local and far-shell signals. We define distance-misaligned training as a mismatch between where label-relevant information lies and where the model allocates communication over graph distance. On this benchmark, we find three points. First, the preferred graph-distance bias changes systematically with task locality. Second, an oracle adaptive controller, given offline access to the task-side distance target, nearly matches the best fixed bias across regimes and strongly improves over a neutral baseline on mixed and local tasks. Third, a task-agnostic zero-gap controller is weaker, indicating that adaptation alone is not enough and that the control target matters. These results suggest that distance-resolved diagnosis is useful for understanding Graph Transformer failures and for designing graph-aware control.

摘要:Graph Transformers 可以全球混合信息,但這種靈活性也會產生失敗模式:某些任務需要長距離通信,而其他任務則更適合局部互動。我們通過在上下文隨機區塊模型圖上的合成節點分類基準來研究這一點,其中標籤是由可控的局部和遠程信號混合生成的。我們將距離錯位訓練定義為標籤相關信息所在的位置與模型在圖距離上分配通信的位置之間的不匹配。在這個基準上,我們發現三個要點。首先,首選的圖距離偏差隨著任務的局部性系統性變化。其次,一個神諭自適應控制器,在離線訪問任務側距離目標的情況下,幾乎能夠匹配各個範疇中的最佳固定偏差,並在混合和局部任務上顯著改善中立基線。第三,一個與任務無關的零差距控制器較弱,表明僅僅適應是不夠的,控制目標也很重要。這些結果表明,距離解析診斷對於理解 Graph Transformer 的失敗和設計圖感知控制是有用的。

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

2604.22411v1 by Alberto Messina, Stefano Scotta

Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.

摘要:即使在溫度 $T=0$ 的解碼下,大型語言模型(LLMs)對於相同的輸入仍然可以產生不同的輸出。Thinking Machines Lab 最近的研究突顯了非決定性在實施層面的來源,包括批次大小變化、內核非不變性和浮點數非結合性。在這篇短文中,我們通過引入 \emph{背景溫度} $T_{\mathrm{bg}}$ 的概念來形式化這種行為,這是由於實施依賴的擾動過程所引起的有效溫度,即使在名義上 $T=0$ 時也會觀察到。我們提供了清晰的定義,展示了 $T_{\mathrm{bg}}$ 如何與由推理環境 $I$ 支配的隨機擾動相關,並提出了一個實證協議來通過理想參考系統的等效溫度 $T_n(I)$ 來估計 $T_{bg}$。最後,我們總結了一組在主要 LLM 提供者的代表性樣本上進行的初步實驗,展示了這一理念並概述了對重現性、評估和部署的影響。

Selective Contrastive Learning For Gloss Free Sign Language Translation

2604.22374v1 by Changhao Lai, Rui Zhao, Xuewen Zhong, Jinsong Su, Yidong Chen

Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity dynamics, suggesting that random in-batch negatives are frequently uninformative for effective alignment. Inspired by this, we propose Selective Contrastive Learning for SLT (SCL-SLT) with a Pair Selection (PS) strategy. PS scores candidate negatives using similarity dynamics from reference checkpoints and constructs mini-batches via a curriculum that progressively emphasizes more challenging negatives, thereby strengthening contrastive supervision while reducing the influence of noisy or semantically invalid negatives.

摘要:手語翻譯(SLT)將連續的手語視頻轉換為口語文本,但由於視覺手語與書面文本之間固有的模態不匹配,這仍然是一個挑戰,特別是在無標記的環境中。最近的SLT系統越來越多地採用類似CLIP的視覺-語言預訓練(VLP)進行跨模態對齊,但隨機的批內對比提供的負樣本較少,且依賴於批次,可能會將語義相似(甚至相同)的配對錯誤標記為負樣本,從而引入噪聲和潛在不一致的對齊監督。在這項工作中,我們首先進行了一項初步的基於軌跡的分析,跟踪訓練過程中負視頻與文本的相似性。結果顯示,只有一小部分負樣本表現出持續被推開的理想行為,而其餘的負樣本則顯示出異質且往往不減少的相似性動態,這表明隨機的批內負樣本對有效對齊經常是無信息的。受到此啟發,我們提出了針對SLT的選擇性對比學習(SCL-SLT)及配對選擇(PS)策略。PS利用來自參考檢查點的相似性動態對候選負樣本進行評分,並通過一個逐步強調更具挑戰性的負樣本的課程來構建小批次,從而加強對比監督,同時減少噪聲或語義無效的負樣本的影響。

CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language

2604.22367v1 by Rui Zhao, Xuewen Zhong, Xiaoyun Zheng, Jinsong Su, Yidong Chen

Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.

摘要:手語研究因大型語言模型(LLMs)的進步而取得了顯著的進展。然而,LLMs 理解手語的內在能力,尤其是在多模態環境中,仍然未被充分探索。為了解決這一限制,我們推出了 CNSL-bench,這是第一個全面的中文國家手語基準,旨在評估多模態大型語言模型(MLLMs)在手語理解方面的表現。所提出的 CNSL-bench 具有以下特點:1)權威性基礎,因為它基於官方標準化的《國家通用手語詞典》,減少了來自地區或非典範變體的歧義,並確保語義定義的一致性;2)多模態覆蓋,提供對應的文本描述、插圖及手語視頻;3)發音多樣性,支持對關鍵手動發音形式的細緻分析,包括空寫、拼字和中文手語字母表。使用 CNSL-bench,我們對 21 個開源和專有的最新 MLLMs 進行了廣泛評估。我們的結果顯示,儘管多模態建模最近取得了進展,目前的 MLLMs 仍然顯著低於人類表現,並在輸入模態和手動發音形式之間顯示出系統性的差異。額外的診斷分析表明,幾個性能限制在推理改進之外仍然存在,並且遵循指令的穩健性在不同模型之間變化顯著。

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

2604.22345v1 by Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.

摘要:大型語言模型(LLMs)展現出強大的隱性個性化能力,但大多數現有方法將這種行為視為黑箱,依賴於提示工程或在用戶數據上進行微調。在本研究中,我們採取機械解釋的視角,假設存在一組稀疏的偏好頭,即編碼用戶特定風格和主題偏好的注意力頭,並對生成過程施加因果影響。我們介紹了差異偏好引導(DPS),這是一個無需訓練的框架,(1) 通過因果遮罩分析識別偏好頭,(2) 在推理時利用它們進行可控且可解釋的個性化。DPS 為每個注意力頭計算偏好貢獻分數(PCS),直接衡量其對用戶對齊輸出的因果影響。在解碼過程中,我們對比了有無偏好頭的模型預測,放大個性化和通用邏輯值之間的差異,以選擇性地強化與偏好對齊的延續。在多個 LLM 上的廣泛使用的個性化基準測試中的實驗顯示,在保持內容一致性和低計算開銷的同時,個性化的真實性穩定提升。除了實證改進,DPS 還提供了關於個性化在Transformer架構中出現的地點和方式的機械解釋。我們的實現是公開可用的。

Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding

2604.22335v1 by Weixu Zhang, Fanghua Ye, Qiang Gao, Jian Li, Haolun Wu, Yuxing Tian, Sijing Duan, Nan Du, Xiaolong Li, Xue Liu

Large language models (LLMs) often produce content that contradicts or overlooks information provided in the input context, a phenomenon known as faithfulness hallucination. In this paper, we propose Context-Fidelity Boosting (CFB), a lightweight and general decoding-time framework that reduces such hallucinations by increasing the generation probability of source-supported tokens. Motivated by logit-shaping principles from watermarking techniques, CFB applies additive token-level logit adjustments based on a token's degree of support from the input context. Specifically, we develop three boosting strategies: static boosting, which applies a fixed bias to source-supported tokens; context-aware boosting, which scales this bias using the divergence between next-token distributions with and without context; and token-aware boosting, which further redistributes the adaptive bias according to local relevance estimated from source-position attention and source-scoped semantic similarity. CFB requires no retraining or architectural changes, making it compatible with a wide range of LLMs. Experiments on summarization and question answering tasks across multiple open-source LLMs show that CFB consistently improves faithfulness metrics with minimal generation overhead. Our implementation is fully open-sourced.

摘要:大型語言模型(LLMs)經常產生與輸入上下文中提供的信息相矛盾或忽略的信息,這一現象被稱為忠實性幻覺。在本文中,我們提出了上下文忠實性增強(CFB),這是一個輕量級且通用的解碼時框架,通過增加源支持標記的生成概率來減少此類幻覺。受水印技術中的邏輯形狀原則啟發,CFB 根據標記在輸入上下文中的支持程度應用附加的標記級邏輯調整。具體而言,我們開發了三種增強策略:靜態增強,對源支持標記應用固定偏差;上下文感知增強,根據有無上下文的下一標記分佈之間的差異來調整這一偏差;以及標記感知增強,根據從源位置注意力和源範疇語義相似性估算的局部相關性進一步重新分配自適應偏差。CFB 不需要重新訓練或架構變更,使其與各種 LLM 兼容。在多個開源 LLM 上進行的摘要和問答任務實驗顯示,CFB 始終在生成開銷最小的情況下改善忠實性指標。我們的實現是完全開源的。

ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding

2604.22333v1 by Dongwei Sun, Jing Yao, Kan Wei, Xiangyong Cao, Chen Wu, Zhenghui Zhao, Pedram Ghamisi, Jun Zhou, Jón Atli Benediktsson

Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision-language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all-weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster-Induced Change Query (DICQ) dataset, a large-scale benchmark coupling pre-event optical semantics with post-event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high-quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics-first, generation-later'' paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine-grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi-task reasoning driven by diverse user queries, delivering precise damage quantification, region-specific descriptions, and holistic post-disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state-of-the-art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \href{https://sundongwei.github.io/changequery/}{https://sundongwei.github.io/changequery/}.

摘要:快速的情境感知在災後響應中至關重要。雖然遙感損害評估正在從像素級變化檢測演變為高級語義分析,但現有的視覺-語言方法仍然難以為複雜的戰略查詢提供可行的情報。它們仍然受到單一模態光學依賴的嚴重限制,對自然災害的偏見,以及根本缺乏基於互動的能力。為了解決這些限制,我們提出了ChangeQuery,一個統一的多模態框架,旨在實現全面的全氣候災害情境感知。為了克服模態限制和場景偏見,我們構建了災害引起的變化查詢(DICQ)數據集,這是一個大規模基準,將事件前的光學語義與事件後的SAR結構特徵結合,涵蓋自然災害和武裝衝突的平衡分佈。此外,為了提供互動推理所需的高質量監督,我們提出了一種新穎的自動語義標註管道。遵循“統計優先,生成後”的範式,這個引擎自動將原始分割掩模轉換為基於的分層指令集,有效地為模型提供了細緻的空間和定量感知。在這些結構化數據上訓練後,ChangeQuery架構作為一個互動的災害分析師運作。它支持由多樣用戶查詢驅動的多任務推理,提供精確的損害量化、特定區域的描述和全面的災後總結。廣泛的實驗表明,ChangeQuery建立了一個新的最先進技術,提供了一個穩健且可解釋的解決方案,用於複雜的災害監測。代碼可在 \href{https://sundongwei.github.io/changequery/}{https://sundongwei.github.io/changequery/} 獲得。

FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting

2604.22328v1 by Marco Obermeier, Marco Pruckner, Florian Haselbeck, Andreas Zeiselmair

Driven by the transition towards a climate-neutral energy system, accurate energy time series forecasting is critical for planning and operation. Yet, it remains largely a dataset-specific task, requiring comprehensive training data, limiting scalability, and resulting in high model development and maintenance effort. Recently, foundation models that aim to learn generalizable patterns via extensive pretraining have shown superior performance in multiple prediction tasks. Despite their success and strong potential to address challenges in energy forecasting, their application in this domain remains largely unexplored. We address this gap by presenting the Foundation Models in Energy Time Series Forecasting (FETS) benchmark. We (1) provide a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories; (2) collect and analyze 54 datasets across 9 data categories, guided by typical stakeholder interests; (3) benchmark foundation models against classical machine learning approaches across different forecasting settings. Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories, despite the latter having seen the full historic target data during training. In particular, covariate-informed foundation models achieve the strongest performance. Further analysis reveals a strong correlation between predictive performance and spectral entropy, performance saturation beyond a certain context length, and improved performance at higher aggregation levels such as national load, district heating, and power grid data. Overall, our findings highlight the strong potential of foundation models as scalable and generalizable forecasting solutions for the energy domain, particularly in data-constrained and privacy-sensitive settings.

摘要:受到向氣候中立能源系統過渡的驅動,準確的能源時間序列預測對於規劃和運營至關重要。然而,這仍然主要是一項特定數據集的任務,需大量的訓練數據,限制了可擴展性,並導致高昂的模型開發和維護成本。最近,旨在通過廣泛的預訓練學習可泛化模式的基礎模型在多個預測任務中顯示出卓越的性能。儘管它們在能源預測方面成功且具有強大的潛力來解決挑戰,但在這一領域的應用仍然未被充分探索。我們通過提出能源時間序列預測的基礎模型(FETS)基準來填補這一空白。我們(1)提供一個結構化的能源預測用例概覽,涵蓋三個主要維度:利益相關者、屬性和數據類別;(2)收集並分析了9個數據類別中的54個數據集,並根據典型的利益相關者興趣進行指導;(3)在不同的預測設置中,將基礎模型與經典機器學習方法進行基準測試。基礎模型在所有設置和數據類別中始終優於針對特定數據集優化的機器學習方法,儘管後者在訓練期間已經看到了完整的歷史目標數據。特別是,考慮協變量的基礎模型表現出最強的性能。進一步分析顯示,預測性能與頻譜熵之間存在強相關性,在某一上下文長度之後性能飽和,並且在更高的聚合層級(如國家負載、區域供熱和電網數據)下性能有所改善。總體而言,我們的研究結果突顯了基礎模型作為可擴展和可泛化的能源領域預測解決方案的強大潛力,特別是在數據受限和隱私敏感的環境中。

Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks

2604.22325v1 by Fahmida Alam, Ellen Riloff

Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider's medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.

摘要:現有的自然語言處理(NLP)資源常常缺乏解決現實問題所需的特定任務信息,並且對於較不知名或新引入的實體的覆蓋範圍有限。例如,商業組織和醫療提供者可能需要被分類到各種不同的分類方案中,以滿足特定的應用任務。我們的目標是使領域專家能夠輕鬆地為實體創建特定任務的分類器,只需提供實體名稱和金標籤作為訓練數據。然後,我們的框架動態獲取有關每個實體的描述性文本,這些文本隨後用作生成基於文本的分類器的基礎。我們提出了一種新穎的文本獲取方法,利用了網絡和大型語言模型(LLMs)。我們在兩個不同領域的分類問題上評估了我們提出的框架: (i) 將組織分類為標準行業分類(SIC)代碼,這些代碼根據組織的商業活動對其進行分類;以及 (ii) 將醫療提供者分類為醫療提供者分類代碼,這些代碼表示提供者的醫療專業和實踐領域。我們表現最佳的模型在SIC代碼和醫療分類代碼分類任務中分別達到了82.3%和72.9%的宏觀平均F1分數。

CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems

2604.22313v1 by Tabinda Sarwar, Farhad Moghimifar, Cong Duy Vu Hoang, Xiaoxiao Ma, Shawn Chang Xu, Fahimeh Saleh, Poorya Zaremoodi, Avirup Sil, Katrin Kirchhoff

NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.

摘要:NL2SQL 系統在工業環境中部署時,經常會遇到模糊或無法回答的查詢,特別是在用戶澄清不完整的互動場景中。現有基準通常假設單一的模糊來源,並依賴用戶互動來解決,忽略了現實中的失敗模式。
我們介紹 Clarity,一個自動生成具有多面向模糊性和多樣化用戶行為的 NL2SQL 基準的框架,涵蓋單輪和多輪場景。使用基於約束的流程,Clarity 將可執行的 SQL 轉換為模糊查詢,並增強了基於對話的延續和架構層級的元數據。
在 Spider 和 BIRD 上的實證評估顯示,包括基於強大 LLM 的系統在內的領先 NL2SQL 系統,在多面向模糊性下遭遇顯著的性能下降。雖然這些系統通常能檢測到模糊性,但它們在準確定位和解決潛在的架構層級來源方面卻面臨挑戰。我們的結果突顯了工業級 NL2SQL 系統中對於更強健的模糊性檢測和解決的需求。

BLAST: Benchmarking LLMs with ASP-based Structured Testing

2604.22306v1 by Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri, Giuseppe Manco, Simona Perri, Francesco Ricca

Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.

摘要:大型語言模型(LLMs)在自然語言理解、對話系統和程式碼生成等廣泛任務中展現了卓越的表現。儘管明顯取得了進展,但迄今為止,對於它們在處理如答案集程式設計(ASP)等宣告性範式的有效性關注較少。在本文中,我們介紹了BLAST:第一個專門的基準測試方法學和相關數據集,用於評估LLMs生成ASP程式碼的準確性。BLAST提供了一個結構化的評估框架,包含兩個針對ASP程式碼生成的新穎語義指標。本文呈現了涉及十個來自ASP文獻的成熟圖相關問題和八個最先進LLMs的多樣化集合的實證評估結果。

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

2604.22294v1 by Harshit Joshi, Priyank Shethia, Jadelynn Dao, Monica S. Lam

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

摘要:現實世界的文件問題回答是具有挑戰性的。分析師必須在多個文件和每個文件的不同部分之間綜合證據。然而,隨著文件集合的增長,任何固定的 LLM 上下文窗口都可能被超越。一個常見的解決方法是將文件分解為塊,並從塊級輸出組合答案,但這引入了一個聚合瓶頸:隨著塊的數量增加,系統仍然必須在越來越大的提取證據體上進行組合和推理。我們提出了 SLIDERS,一個通過結構化推理在長文件集合上進行問題回答的框架。SLIDERS 將顯著信息提取到關係數據庫中,通過 SQL 而不是連接的文本實現對持久結構狀態的可擴展推理。為了使這種本地提取的表示在全球範圍內保持一致,SLIDERS 引入了一個數據調和階段,利用來源、提取理由和元數據來檢測和修復重複、不一致和不完整的記錄。儘管所有基準都適合強大基礎 LLM 的上下文窗口,SLIDERS 在三個現有的長上下文基準上超越了所有基準,平均超過 GPT-4.1 6.6 分。它還在兩個新的基準上分別在 3.9M 和 36M 令牌上超過了下一個最佳基準約 19 和 32 分。

2604.22292v1 by Ishaan Gakhar, Harsh Nandwani

The classification of legal documents from an unstructured data corpus has several crucial applications in downstream tasks. Documents relevant to court filings are key in use cases such as drafting motions, memos, and outlines, as well as in tasks like docket summarisation, retrieval systems, and training data curation. Current methods classify based on provided metadata, LLM-extracted metadata, or multimodal methods. These methods depend on structured data, metadata, and extensive computational power. This task is approached from a perspective of leveraging discriminative features in the documents between classes. The authors propose ReLeVAnT, a framework for legal document binary classification. ReLeVAnT utilises n-gram processing, contrastive score matching, and a shallow neural network as the primary drivers for discriminative classification. It leverages one-time keyword extraction per corpus, followed by a shallow classifier to swiftly and reliably classify documents with 99.3% accuracy and 98.7% F1 score on the LexGLUE dataset.

摘要:法律文件的分類來自非結構化數據語料庫,在下游任務中具有幾個關鍵應用。與法庭文件相關的文件在起草動議、備忘錄和大綱等用例中至關重要,以及在如法庭日程摘要、檢索系統和訓練數據策劃等任務中。當前的方法基於提供的元數據、LLM提取的元數據或多模態方法進行分類。這些方法依賴於結構化數據、元數據和大量計算能力。這項任務從利用文件中類別之間的區別特徵的角度進行。作者提出了ReLeVAnT,一個法律文件二元分類的框架。ReLeVAnT利用n-gram處理、對比得分匹配和淺層神經網絡作為區別性分類的主要驅動因素。它利用每個語料庫的一次關鍵字提取,然後通過一個淺層分類器迅速且可靠地將文件分類,並在LexGLUE數據集上達到99.3%的準確率和98.7%的F1分數。

When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention

2604.22273v1 by Aofan Liu, Jingxiang Meng

Iterative self-correction is widely used in agentic LLM systems, but when repeated refinement helps versus hurts remains unclear. We frame self-correction as a cybernetic feedback loop in which the same language model serves as both controller and plant, and use a two-state Markov model over {Correct, Incorrect} to operationalize a simple deployment diagnostic: iterate only when ECR/EIR > Acc/(1 - Acc). In this view, EIR functions as a stability margin and prompting functions as lightweight controller design. Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (<= 0.5%) separating beneficial from harmful self-correction. Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading; GPT-5 degrades by -1.8 pp. A verify-first prompt ablation provides causal evidence that this threshold is actionable through prompting alone: on GPT-4o-mini it reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p < 10^-4), while producing little change on already-sub-threshold models. ASC further illustrates the stopping trade-off: it halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost. Overall, the paper argues that self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.

摘要:迭代自我修正在代理型 LLM 系統中被廣泛使用,但何時重複的精煉是有幫助的,何時是有害的仍不清楚。我們將自我修正框架視為一個控制論反饋迴路,其中同一語言模型同時作為控制器和植物,並使用一個二狀態馬爾可夫模型來操作一個簡單的部署診斷:僅在 ECR/EIR > Acc/(1 - Acc) 時進行迭代。在這種觀點下,EIR 作為穩定性邊際,提示則作為輕量級控制器設計。在 7 個模型和 3 個數據集(GSM8K、MATH、StrategyQA)中,我們發現一個明顯的接近零的 EIR 閾值(<= 0.5%)將有益的自我修正與有害的自我修正分開。只有 o3-mini (+3.4 pp, EIR = 0%)、Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%) 和 o4-mini (+/-0 pp) 保持非降級;GPT-5 降級了 -1.8 pp。一個先驗驗證提示的消融提供了因果證據,表明這個閾值可以通過單純的提示來操作:在 GPT-4o-mini 上,它將 EIR 從 2% 降至 0%,並將 -6.2 pp 的降級轉變為 +0.2 pp(配對 McNemar p < 10^-4),同時對已經低於閾值的模型幾乎沒有變化。ASC 進一步說明了停止的權衡:它停止了有害的精煉,但產生了 3.8 pp 的信心引出成本。總的來說,本文主張自我修正不應被視為默認行為,而應作為由可測量的錯誤動態所主導的控制決策。

Semantic Error Correction and Decoding for Short Block Channel Codes

2604.22269v1 by Jiafu Hao, Chentao Yue, Wanchun Liu, Yonghui Li, Branka Vucetic

This paper presents a semantic-enhanced receiver framework for transmitting natural language sentences over noisy wireless channels using multiple short block codes. After ASCII encoding, the sentence is divided into segments, each independently encoded with a short block code and transmitted over an AWGN channel. At the receiver, segments are decoded in parallel, followed by a semantic error correction (SEC) model, which reconstructs corrupted segments using language model context. We further propose the semantic list decoding (SLD), which generates multiple candidate reconstructions and selects the best one via weighted Hamming distance, and a semantic confidence-guided HARQ (SHARQ) mechanism that replaces CRC-based error detection with a confidence score, enabling selective segment retransmission without CRC overhead. All modules are designed and trained using bidirectional and auto-regressive transformers (BART). Simulation results demonstrate that the proposed scheme significantly outperforms conventional capacity-approaching short codes and long codes at the same rate. Specifically, SEC provides approximately 0.4 dB BLER gain over plain short-code transmission, while SLD extends this to 0.8 dB. Compared to transmitting the entire sentence as a single long 5G LDPC codeword, our approach significantly improves semantic fidelity and reduces decoding latency by up to 90\%. SHARQ further provides an additional 1.5 dB gain over conventional HARQ.

摘要:這篇論文提出了一種語義增強接收器框架,用於通過多個短碼塊在嘈雜的無線通道上傳輸自然語言句子。經過ASCII編碼後,句子被分為多個段落,每個段落獨立地用短碼塊編碼並通過AWGN通道傳輸。在接收端,這些段落並行解碼,隨後進行語義錯誤修正(SEC)模型,利用語言模型上下文重建損壞的段落。我們進一步提出了語義列表解碼(SLD),它生成多個候選重建,並通過加權漢明距離選擇最佳重建,還有一種語義信心引導的HARQ(SHARQ)機制,該機制用信心分數取代基於CRC的錯誤檢測,實現無需CRC開銷的選擇性段落重傳。所有模塊均使用雙向和自回歸Transformer(BART)進行設計和訓練。模擬結果表明,所提出的方案顯著優於傳統的容量接近短碼和相同速率的長碼。具體而言,SEC在純短碼傳輸上提供了約0.4 dB的BLER增益,而SLD將此增益擴展至0.8 dB。與將整個句子作為單個長5G LDPC碼字傳輸相比,我們的方法顯著提高了語義保真度並將解碼延遲降低了最多90%。SHARQ進一步提供了比傳統HARQ多1.5 dB的增益。

Large Language Models Decide Early and Explain Later

2604.22266v1 by Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler

Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.

摘要:大型語言模型通常透過生成長的中間推理鏈來達成強大的表現。然而,目前仍不清楚模型的最終答案實際上是在生成過程中的何時被確定。如果答案在中間階段已經固定,那麼隨後的推理標記可能構成決策後的解釋,增加推理成本和延遲,而不改善正確性。我們使用強制答案完成來研究預測答案在推理步驟中的演變,這種方法可以引出模型在部分推理前綴下的中間預測。專注於Qwen3-4B並對所有考慮的數據集進行結果平均,我們發現預測答案在僅32%的查詢中發生變化。此外,一旦最終答案切換發生,模型每個查詢平均生成760個額外的推理標記,這占據了總推理預算的相當一部分。基於這些發現,我們研究了早期停止策略,這些策略在答案穩定後立即停止生成。我們顯示,簡單的啟發式方法,包括基於探測的停止,可以將每個查詢的推理標記使用量減少500個,同時僅造成2%的準確度下降。綜合來看,我們的結果表明,大部分的推理鏈生成是多餘的,並且可以在對性能影響最小的情況下減少。

Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion

2604.22261v1 by Fahmida Alam, Mihai Surdeanu, Ellen Riloff

Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.

摘要:大型語言模型(LLMs)在關係完成(RC)方面面臨挑戰,無論是使用檢索增強生成(RAG)還是未使用,特別是在所需資訊稀有或稀疏表示的情況下。為了解決這個問題,我們提出了一種新穎的多階段同義詞引導關係完成框架,RC-RAG,該框架系統性地在多個階段中整合關係同義詞。具體而言,RC-RAG:(a)將同義詞整合到檢索中,以擴展關係的詞彙覆蓋範圍,(b)使用同義詞生成關係感知摘要,以及(c)在生成過程中利用同義詞來引導關係完成的推理。重要的是,我們的方法不需要任何模型微調。在兩個基準數據集上對五個LLM的實驗顯示,RC-RAG始終優於幾個RAG基準。在長尾設置中,最佳表現的LLM在增強RC-RAG後,其精確匹配(EM)分數比獨立性能提高了40.6個點,並且分別超過了兩個強大的RAG基準16.0和13.8個EM點,同時保持低計算開銷。

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

2604.22260v1 by Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang, Zhiqi Mao, Heng Yang, Chen Lv

Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.

摘要:城市交通系統面臨日益增長的安全挑戰,這需要可擴展的智慧以應對新興的智慧移動基礎設施。儘管最近在基礎模型和大規模多模態數據集方面的進展加強了智能交通系統(ITS)的感知和推理能力,但現有研究仍主要集中在微觀自動駕駛(AD)上,對城市規模的交通分析關注有限。特別是,針對開放式安全導向的視覺問題回答(VQA)及相應的基礎模型,對於異質路邊攝像頭觀察的推理仍然未被充分探索。為了解決這一空白,我們推出了陸上交通數據集(LTD),這是一個大規模開源的視覺-語言數據集,用於城市交通環境中的開放式推理。LTD包含了從異質路邊攝像頭收集的11600對高質量的VQA,涵蓋了多樣的道路幾何、交通參與者、照明條件和惡劣天氣。該數據集整合了三個互補任務:細粒度多物體定位、多圖像攝像頭選擇和多圖像風險分析,這需要對最小相關視圖進行聯合推理,以推斷危險物體、貢獻因素和危險的道路方向。為了確保標註的準確性,我們結合了多模型視覺-語言生成、交叉驗證和人類參與的精煉。在LTD的基礎上,我們進一步提出了UniVLT,這是一個通過課程式知識轉移訓練的交通基礎模型,旨在將微觀AD推理和宏觀交通分析統一於單一架構中。在LTD和多個AD基準上的廣泛實驗表明,UniVLT在多樣領域的開放式推理任務中達到了SOTA性能,同時揭示了現有基礎模型在複雜的多視圖交通場景中的局限性。

A Probabilistic Framework for Hierarchical Goal Recognition

2604.22256v1 by Chenyuan Zhang, Katherine Ip, Hamid Rezatofighi, Buser Say, Mor Vered

Goal recognition aims to infer an agent's goal from observations of its behaviour. In realistic settings, recognition can benefit from exploiting hierarchical task structure and reasoning under uncertainty. Planning-based goal recognition has made substantial progress over the past decade, but to the best of our knowledge no existing approach jointly integrates hierarchical task structure with probabilistic inference. In this paper, we introduce the first planning-based probabilistic framework for hierarchical goal recognition over Hierarchical Task Networks (HTNs). We instantiate the framework by exploiting an HTN planner with a three-stage generative model for likelihood estimation, yielding posterior distributions over goal hypotheses. Empirical results show improved recognition performance over the existing HTN-based recognizer on HTN benchmarks. Overall, the framework lays a foundation for probabilistic goal recognition grounded in hierarchical planning structure, moving goal recognition toward more practical settings.

摘要:目標識別旨在從觀察代理的行為中推斷其目標。在現實情境中,識別可以通過利用層次任務結構和在不確定性下推理來獲益。基於規劃的目標識別在過去十年中取得了重大進展,但據我們所知,尚無現有方法能夠將層次任務結構與概率推理共同整合。在本文中,我們介紹了第一個基於規劃的層次目標識別的概率框架,該框架基於層次任務網絡(HTNs)。我們通過利用一個HTN規劃器,並使用三階段生成模型進行似然估計來實現該框架,從而產生目標假設的後驗分佈。實證結果顯示,在HTN基準上,相較於現有的基於HTN的識別器,識別性能有所改善。總體而言,該框架為基於層次規劃結構的概率目標識別奠定了基礎,將目標識別推向更實用的環境。

Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis

2604.22237v1 by Zhilin Fan, Deliang Wang, Penghe Chen, Yu Lu

Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers' trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.

摘要:診斷學生問題行為需要教師綜合多方面的信息、識別行為類別並規劃干預策略。雖然微調過的大型語言模型(LLMs)可以通過多輪對話支持這一過程,但它們很少解釋為什麼推薦某一策略,這限制了透明度和教師的信任。為了解決這一問題,我們提出了一個基於微調LLM的可解釋對話系統。該系統使用基於可解釋人工智慧(xAI)的層次歸因方法來識別每個推薦的對話證據,並根據該證據生成自然語言解釋。在技術評估中,該方法在識別支持證據方面超過了基準方法。在對22名預備教師的初步用戶研究中,接受解釋的參與者報告對系統的信任度更高。這些發現表明,改善LLM在教育對話系統中的可解釋性是一個有前景的方向。

A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies

2604.22227v1 by Somyajit Chakraborty

Classical robot ethics is often framed around obedience, most famously through Asimov's laws. This framing is too narrow for contemporary AI systems, which are increasingly adaptive, generative, embodied, and embedded in physical, psychological, and social worlds. We argue that future human-AI relations should not be understood as master-tool obedience. A better framework is conditional mutualism under governance: a co-evolutionary relationship in which humans and AI systems can develop, specialize, and coordinate, while institutions keep the relationship reciprocal, reversible, psychologically safe, and socially legitimate. We synthesize work from computability, automata theory, statistical machine learning, neural networks, deep learning, transformers, generative and foundation models, world models, embodied AI, alignment, human-robot interaction, ecological mutualism, biological markets, coevolution, and polycentric governance. We then formalize coexistence as a multiplex dynamical system across physical, psychological, and social layers, with reciprocal supply-demand coupling, conflict penalties, developmental freedom, and governance regularization. The framework yields a coexistence model with conditions for existence, uniqueness, and global asymptotic stability of equilibria. It shows that reciprocal complementarity can strengthen stable coexistence, while ungoverned coupling can produce fragility, lock-in, polarization, and domination basins. Human-AI coexistence should therefore be designed as a co-evolutionary governance problem, not as a one-shot obedience problem. This shift supports a scientifically grounded and normatively defensible charter of coexistence: one that permits bounded AI development while preserving human dignity, contestability, collective safety, and fair distribution of gains.

摘要:古典機器人倫理通常圍繞服從進行框架,最著名的是阿西莫夫的法則。這種框架對於當代的人工智慧系統來說過於狹隘,這些系統日益適應性強、生成性高、具體化,並嵌入於物理、心理和社會世界中。我們主張,未來的人類與人工智慧的關係不應被理解為主導-工具的服從關係。更好的框架是治理下的條件互惠主義:一種共同進化的關係,在這種關係中,人類和人工智慧系統可以發展、專業化和協調,同時機構保持關係的互惠性、可逆性、心理安全性和社會合法性。我們綜合了可計算性、自動機理論、統計機器學習、神經網絡、深度學習、Transformer、生成模型和基礎模型、世界模型、具體化人工智慧、對齊、人機互動、生態互惠主義、生物市場、共同進化和多中心治理的研究成果。我們然後將共存形式化為一個跨越物理、心理和社會層面的多重動態系統,具有互惠的供需耦合、衝突懲罰、發展自由和治理正規化。該框架產生了一個共存模型,具備存在性、唯一性和均衡的全局漸近穩定性條件。它顯示,互惠互補可以加強穩定共存,而無治理的耦合可能會導致脆弱性、鎖定、極化和支配盆地。因此,人類與人工智慧的共存應被設計為一個共同進化的治理問題,而不是一次性的服從問題。這一轉變支持了一個科學上有根據且在規範上可辯護的共存章程:一個允許有限的人工智慧發展,同時保護人類尊嚴、可爭議性、集體安全和收益的公平分配的章程。

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2604.22225v1 by Xi Wang, Jie Wang, Xingchen Song, Baijun Song, Jingran Xie, Jiahe Shao, Zijian Lin, Di Wu, Meng Meng, Jian Luan, Zhiyong Wu

While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.

摘要:雖然生成式文本到語音(TTS)模型接近人類水平的質量,但單一的評估指標無法診斷細微的聲學瑕疵或解釋感知崩潰。為了解決這個問題,我們提出了 TTS-PRISM,一個針對普通話的多維診斷框架。首先,我們建立了一個涵蓋穩定性到高級表現力的 12 維架構。其次,我們設計了一個針對性的合成流程,利用對抗擾動和專家錨點來構建高質量的診斷數據集。第三,基於架構的指令調整將明確的評分標準和推理嵌入到一個高效的端到端模型中。在 1,600 個樣本的金測試集上的實驗顯示,TTS-PRISM 在人類對齊方面優於通用模型。對六種 TTS 範式的分析建立了直觀的診斷標誌,揭示了細微的能力差異。TTS-PRISM 是開源的,代碼和檢查點可在 https://github.com/xiaomi-research/tts-prism 獲得。

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

2604.22215v1 by Jon-Paul Cacioli

Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: osf.io/azbvx), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted >=4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 < 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p < .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.

摘要:口頭信心引導被廣泛用於從大型語言模型中提取不確定性估計。我們測試了七個經過指令調整的開放權重模型(3-9B 參數,四個家族)是否能產生符合項目級別 Type-2 區分的最小有效性標準的口頭信心,在最小數字引導和貪婪解碼下進行。 在一項預註冊的研究中(OSF: osf.io/azbvx),524 個 TriviaQA 項目在數字(0-100)和類別(10 類)引導下,對八個模型在消費者硬體上的 Q5_K_M 量化進行了測試,產生了 8,384 次確定性試驗。對每個模型格式單元進行了心理測量有效性篩選。所有七個指令模型在數字信心上被分類為無效(H2 確認,7/7 與預測 >=4/7),平均上限率為 91.7%(H1 確認)。類別引導並未挽救有效性。相反,它在七個模型中的六個模型中干擾了任務表現,產生的準確率低於 5%(H4 未確認)。在觀察到的變異範圍內,標記級別的對數概率未能有效預測口頭信心(H5 確認,平均交叉驗證 R^2 < 0.01)。在推理提煉模型中,推理痕跡長度與信心之間顯示出強烈的負部分相關(rho = -0.36,p < .001),這與推理污染效應一致。這些結果並不意味著內部不確定性表示的缺失。它們顯示,在這一模型大小範圍內,最小的口頭引導未能在輸出介面上保留內部信號。心理測量篩選應在任何下游使用此類信號之前進行。

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

2604.22209v1 by Chunyu Qiang, Xiaopeng Wang, Kang Yin, Yuzhe Liang, Yuxin Guo, Teng Ma, Ziyu Zhang, Tianrui Wang, Cheng Gong, Yushen Chen, Ruibo Fu, Chen Zhang, Longbiao Wang, Jianwu Dang

Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.

摘要:生成音頻建模在很大程度上已經被分割為專門的任務,包括文本轉語音(TTS)、文本轉音樂(TTM)和文本轉音頻(TTA),每個任務都在異質的控制範式下運作。統一這些模態仍然是一個基本挑戰,因為結構化語義表示(語音/音樂)和非結構化聲學質地(音效)之間存在內在的不和諧。在本文中,我們介紹了UniSonate,一種統一的流匹配框架,能夠通過標準化的、無參考的自然語言指令接口合成語音、音樂和音效。為了調和結構差異,我們提出了一種新穎的動態標記注入機制,將非結構化的環境聲音投影到結構化的時間潛在空間中,使得在基於音素的多模態擴散Transformer(MM-DiT)中能夠精確控制持續時間。結合多階段課程學習策略,這種方法有效地減輕了跨模態優化衝突。大量實驗表明,UniSonate在基於指令的TTS(WER 1.47%)和TTM(SongEval一致性3.18)中達到了最先進的性能,同時在TTA中保持了競爭力的保真度。重要的是,我們觀察到正向轉移,對多樣化音頻數據的聯合訓練顯著增強了結構一致性和韻律表現,相較於單一任務基準。音頻樣本可在https://qiangchunyu.github.io/UniSonate/上獲得。

Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations

2604.22207v1 by Anna Arnaudo, Riccardo Coppola, Maurizio Morisio, Flavio Giobergia, Andrea Bioddo, Angelo Bongiorno, Luca Dadone

Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful to automate their generation and processing. In this paper, we discuss a possible approach for automating the Goal-Oriented Requirements Engineering (GORE) process by extracting functional goals from software documentation through three phases: actor identification, high and low-level goal extraction. To implement these functionalities, we propose a chain of LLMs fed with engineered prompts. We experimented with different variants of in-context learning and measured the similarities between input data and in-context examples to better investigate their impact. Another key element is the generation-critic mechanism, implemented as a feedback loop involving two LLMs. Although the pipeline achieved 61% accuracy in low-level goal identification, the final stage, these results indicate the approach is best suited as a tool to accelerate manual extraction rather than as a full replacement. The feedback-loop mechanism with Zero-shot outperformed stand-alone Few-shot, with an ablation study suggesting that performance slightly degrades without the feedback cycle. However, we reported that the combination of the feedback mechanism with Few-shot does not deliver any advantage, possibly suggesting that the primary performance ceiling is the prompting strategy applied to the 'critic' LLM. Together with the refinement of both the quantity and quality of the Shot examples, future research will integrate Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting to improve accuracy.

摘要:由於許多需求工程(RE)文檔的文本性和重複性,大型語言模型(LLMs)已被證明對自動化它們的生成和處理非常有用。在本文中,我們討論了一種可能的自動化目標導向需求工程(GORE)過程的方法,通過三個階段從軟體文檔中提取功能性目標:角色識別、高層和低層目標提取。為了實現這些功能,我們提出了一個由經過設計的提示餵養的LLMs鏈。我們實驗了不同變體的上下文學習,並測量了輸入數據與上下文示例之間的相似性,以更好地研究它們的影響。另一個關鍵元素是生成-評估機制,作為涉及兩個LLMs的反饋循環來實現。儘管該管道在低層目標識別的最終階段達到了61%的準確率,但這些結果表明該方法最適合作為加速手動提取的工具,而不是完全替代。使用零-shot的反饋循環表現優於獨立的few-shot,並且一項消融研究表明,在沒有反饋循環的情況下,性能會略有下降。然而,我們報告指出,反饋機制與few-shot的結合並未帶來任何優勢,這可能表明主要的性能上限是應用於“評估者”LLM的提示策略。未來的研究將結合檢索增強生成(RAG)和思維鏈(CoT)提示,以提高準確性,並同時改善示例的數量和質量。

An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments

2604.22199v1 by Hong Su

Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered tasks, and even successful executions or observed successful external behaviors are not always autonomously transformed into reusable local knowledge. In this paper, we propose an LLM-driven closed-loop autonomous learning framework for robots facing uncovered tasks in open environments. The proposed framework first retrieves the local method library to determine whether a reusable solution already exists for the current task or observed event. If no suitable method is found, it triggers an autonomous learning process in which the LLM serves as a high-level reasoning component for task analysis, candidate model selection, data collection planning, and execution or observation strategy organization. The robot then learns from both self-execution and active observation, performs quasi-real-time training and adjustment, and consolidates the validated result into the local method library for future reuse. Through this recurring closed-loop process, the robot gradually converts both execution-derived and observation-derived experience into reusable local capability while reducing future dependence on repeated external LLM interaction. Results show that the proposed framework reduces execution time and LLM dependence in both repeated-task self-execution and observation-driven settings, for example reducing the average total execution time from 7.7772s to 6.7779s and the average number of LLM calls per task from 1.0 to 0.2 in the repeated-task self-execution experiments.

摘要:自主機器人在開放環境中運作需要持續處理未被預定本地方法涵蓋的任務的能力。然而,現有的方法通常依賴於對未涵蓋任務進行重複的大型語言模型(LLM)互動,即使成功的執行或觀察到的成功外部行為也不一定能自動轉化為可重用的本地知識。在本文中,我們提出了一個基於LLM的閉環自主學習框架,旨在幫助面對開放環境中未涵蓋任務的機器人。所提出的框架首先檢索本地方法庫,以確定當前任務或觀察事件是否已存在可重用的解決方案。如果未找到合適的方法,則觸發自主學習過程,其中LLM作為任務分析、高級推理組件、候選模型選擇、數據收集計劃及執行或觀察策略組織的高級推理組件。然後,機器人從自我執行和主動觀察中學習,進行準實時訓練和調整,並將經過驗證的結果整合到本地方法庫中以供未來重用。通過這一重複的閉環過程,機器人逐漸將執行衍生和觀察衍生的經驗轉化為可重用的本地能力,同時減少對重複外部LLM互動的未來依賴。結果顯示,所提出的框架在重複任務自我執行和觀察驅動的設置中減少了執行時間和LLM依賴,例如在重複任務自我執行實驗中,將平均總執行時間從7.7772秒減少到6.7779秒,將每個任務的平均LLM調用次數從1.0減少到0.2。

How Large Language Models Balance Internal Knowledge with User and Document Assertions

2604.22193v1 by Shuowei Li, Haoxin Li, Wenda Chu, Yi Fang

Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model's ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model's discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at https://github.com/shuowl/llm-source-balancing.

摘要:大型語言模型(LLMs)在現實世界的情境中,如RAG或基於聊天的系統,經常需要平衡其內部參數知識與外部信息,例如用戶信念和檢索文檔中的內容。模型可靠處理這些來源的能力對系統安全至關重要。先前關於知識衝突和諂媚的研究僅限於二元衝突範式,主要探討參數知識與文檔或用戶之間的衝突,但忽略了所有三個來源同時存在的互動環境。為填補這一空白,我們提出了一個三來源互動框架,並系統性地評估了來自三個家族的27個LLMs在兩個數據集上的表現。我們的發現揭示了一般模式:大多數模型更依賴於文檔的主張而非用戶的主張,這一偏好在後訓練中得到了加強。此外,我們的行為分析顯示,大多數模型易受影響,無法有效區分有益和有害的外部信息。為了解決這個問題,我們展示了在多樣化來源互動數據上進行微調可以顯著提高模型的區分能力。簡而言之,我們的工作為開發可信賴的LLMs鋪平了道路,使其能夠有效且可靠地整合多個信息來源。代碼可在 https://github.com/shuowl/llm-source-balancing 獲得。

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

2604.22191v1 by Chaoran Chen, Dayu Yuan, Peter Kairouz

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.

摘要:在代理工作流程中,LLMs 經常處理法律上受保護的檢索上下文,這些上下文無法進一步用於訓練。然而,審計員目前缺乏可靠的方法來驗證提供者是否通過將這些數據納入後訓練來違反服務條款,特別是通過強化學習 (RL)。雖然標準審計依賴於逐字記憶和會員推斷,但這些方法對於 RL 訓練的模型無效,因為 RL 主要影響模型的行為風格,而不是特定事實的保留。為了填補這一空白,我們引入了行為金絲雀,這是一種新的 RLFT 管道審計機制。該框架通過將文檔觸發器與獎勵特定風格反應的反饋配對來對偏好數據進行工具化,如果這些數據在訓練中被使用,則會誘導出潛在的觸發條件偏好。實證結果顯示,這些行為信號能夠檢測未經授權的文檔條件訓練,在 1% 金絲雀注入率下達到 67% 的檢測率,假陽性率為 10%(AUROC = 0.756)。更廣泛地說,我們的結果確立了行為金絲雀作為 RLFT 管道的新審計機制,使審計員能夠測試訓練期間的影響,即使這種影響表現為分佈行為變化而不是記憶。

ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

2604.22180v1 by Xiaojie Ke, Shuai Zhang, Liansheng Sun, Yongjin Wang, Hengjun Jiang, Xiangkun Liu, Cunxin Gu, Jian Xu, Guanjun Jiang

Large language model (LLM) based listwise reranking has emerged as the dominant paradigm for achieving state-of-the-art ranking effectiveness in information retrieval. However, its reliance on feeding full passage texts into the LLM introduces two critical bottlenecks: the "lost in the middle" phenomenon degrades ranking quality as input length grows, and the inference latency scales super-linearly with sequence length, rendering it impractical for industrial deployment. In this paper, we present ResRank, a unified retrieval-reranking framework that fundamentally addresses both challenges. Inspired by multimodal LLMs that project visual inputs into compact token representations, ResRank employs an Encoder-LLM to compress each candidate passage into a single embedding, which is then fed alongside the query text into a Reranker-LLM for listwise ranking. To alleviate the misalignment between the compressed representation space and the ranking space, we introduce a residual connection structure that combines encoder embeddings with contextualized hidden states from the reranker. Furthermore, we replace the conventional autoregressive decoding with a one-step cosine-similarity-based scoring mechanism, eliminating the generation bottleneck entirely. ResRank is trained through a carefully designed dual-stage, multi-task, end-to-end joint optimization strategy that simultaneously trains the encoder and reranker, achieving learning objective alignment between retrieval and reranking while substantially reducing training complexity. Extensive experiments on TREC Deep Learning and eight BEIR benchmark datasets demonstrate that ResRank achieves competitive or superior ranking effectiveness compared to existing approaches while requiring zero generated tokens and processing only one token per passage, yielding a fundamentally better balance between effectiveness and efficiency.

摘要:大型語言模型(LLM)基於列表重排序的技術已成為在資訊檢索中實現最先進排名效果的主流範式。然而,其依賴將完整段落文本輸入LLM的方式引入了兩個關鍵瓶頸:隨著輸入長度的增加,“中間丟失”現象會降低排名質量,且推理延遲與序列長度呈超線性增長,使其在工業部署中變得不切實際。在本文中,我們提出了ResRank,一個統一的檢索-重排序框架,根本上解決了這兩個挑戰。受到將視覺輸入投影為緊湊標記表示的多模態LLM的啟發,ResRank使用Encoder-LLM將每個候選段落壓縮為單個嵌入,然後將其與查詢文本一起輸入到Reranker-LLM中進行列表排名。為了減輕壓縮表示空間與排名空間之間的錯位,我們引入了一種殘差連接結構,將編碼器嵌入與重排序器的上下文隱藏狀態結合起來。此外,我們用基於一步餘弦相似度的評分機制取代了傳統的自回歸解碼,徹底消除了生成瓶頸。ResRank通過精心設計的雙階段、多任務、端到端的聯合優化策略進行訓練,該策略同時訓練編碼器和重排序器,實現檢索與重排序之間的學習目標對齊,同時大幅降低訓練複雜性。在TREC深度學習和八個BEIR基準數據集上的廣泛實驗顯示,ResRank在排名效果上達到了與現有方法相當或更優的效果,同時要求零生成標記並僅處理每個段落的一個標記,從而在效果和效率之間實現了根本更好的平衡。

ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation

2604.22169v1 by Peiyan Zhang, Hanmo Liu, Chengxuan Tong, Yuxia Wu, Wei Guo, Yong Liu

Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast learning-signal framework that first restores minimal learnability for all-zero groups and then replaces full-group reward normalization with a boundary-focused contrastive update on the strongest positive and the hardest negative. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width. Across multiple generative recommendation tasks, ReCast consistently outperforms OpenOneRec-RL, achieving up to 36.6% relative improvement in Pass@1. Its matched-budget advantage is substantially larger: ReCast reaches the baseline's target performance with only 4.1% of the rollout budget, and this advantage widens with model scale. The same design also yields direct system-level gains, reducing actor-side update time by 16.60x, lowering peak allocated memory by 16.5%, and improving actor MFU by 14.2%. Mechanism analysis shows that ReCast mitigates the persistent all-zero / single-hit regime, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates. These results suggest that, for generative recommendation, the decisive RL problem is not only how to assign rewards, but how to construct learnable optimization events from sparse, structured supervision.

摘要:一般基於群體的強化學習假設抽樣的回合群體已經是可用的學習信號。我們展示了這一假設在稀疏命中生成推薦中失效,許多抽樣的群體根本無法成為可學習的。我們提出了ReCast,一個先修復再對比的學習信號框架,首先為全零群體恢復最小的可學習性,然後用針對最強正樣本和最難負樣本的邊界聚焦對比更新取代全群體獎勵標準化。ReCast保持外部強化學習框架不變,僅修改群體內的信號構建,並部分解耦回合搜索寬度與演員端更新寬度。在多個生成推薦任務中,ReCast始終超越OpenOneRec-RL,實現了Pass@1的相對提升高達36.6%。其匹配預算的優勢更為顯著:ReCast僅用4.1%的回合預算就達到了基準的目標性能,且隨著模型規模的擴大,這一優勢進一步擴大。相同的設計還帶來了直接的系統級增益,將演員端的更新時間減少了16.60倍,降低了峰值分配內存16.5%,並改善了演員的MFU達14.2%。機制分析顯示,ReCast緩解了持續的全零/單次命中狀態,在自然正樣本稀缺時恢復可學習性,並將否則浪費的回合預算轉化為更穩定的策略更新。這些結果表明,對於生成推薦而言,關鍵的強化學習問題不僅在於如何分配獎勵,還在於如何從稀疏的結構化監督中構建可學習的優化事件。

Estimating Tail Risks in Language Model Output Distributions

2604.22167v1 by Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He

Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk

摘要:語言模型的能力日益增強,並且正在迅速在整個人群中部署。因此,這些模型的安全性變得越來越重要。幸運的是,對齊技術的進步顯著降低了有害模型輸出發生的可能性。然而,當模型在一天內被查詢數十億次時,即使是罕見的最壞情況行為也會發生。目前的安全評估專注於捕捉產生有害輸出的輸入分佈。這些評估忽略了模型的概率特性及其尾部輸出行為。為了測量這種尾部風險,我們提出了一種方法來有效估計任何輸入查詢的有害輸出概率。我們不使用簡單的暴力採樣從目標模型中獲取樣本,因為有害輸出可能是罕見的,而是通過創建不安全版本的目標模型來實現重要性採樣。這些不安全版本使得樣本高效估計成為可能,因為它們提高了有害輸出的概率。在測量濫用和不對齊的基準上,這些估計與使用少10-20倍樣本的暴力蒙特卡羅估計相匹配。例如,我們可以用僅500個樣本估計有害輸出的概率在10^-4的量級。此外,我們發現這些有害性估計可以揭示模型對模型輸入擾動的敏感性並預測部署風險。我們的工作表明,準確的稀有事件估計對於安全評估既至關重要又可行。代碼可在 https://github.com/rangell/LMTailRisk 獲得。

Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models

2604.22166v1 by Ryoma Kumon, Hitomi Yanaka

While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed alignment search, a supervised interpretability method, is susceptible to overfitting on narrow linguistic distributions. Finally, we validate our findings by demonstrating that the manipulation of the identified components improves model performance on acceptability judgment benchmarks.

摘要:雖然語言模型展示了複雜的句法能力,但它們的內部機制與語言學中研究的跨建構原則的對應程度仍然不甚了解。這項研究探討模型是否在不同的句法建構中使用共享的神經機制,通過在細粒度層面應用因果可解釋性方法。專注於填充-缺口依賴和負極性項(NPI)授權,我們利用激活修補來識別特定注意力頭和多層感知器(MLP)區塊的功能角色。我們的結果顯示,填充-缺口依賴的機制高度局部且共享,位於早期到中層,而NPI處理則沒有這樣的統一機制。此外,我們發現通過激活修補識別的這些機制可以推廣到分佈外的情況,而分佈對齊搜尋這種監督可解釋性方法則容易在狹窄的語言分佈上過擬合。最後,我們通過證明對識別組件的操控改善模型在可接受性判斷基準上的表現來驗證我們的發現。

GenMatter: Perceiving Physical Objects with Generative Matter Models

2604.22160v1 by Eric Li, Arijit Dasgupta, Yoni Friedman, Mathieu Huot, Vikash Mansinghka, Thomas O'Connell, William T. Freeman, Joshua B. Tenenbaum

Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.

摘要:人類的視覺感知為理解基於運動的場景解釋的計算原則提供了寶貴的見解。人類能夠穩定地檢測和分割構成獨立可移動物質塊的運動實體,無論是觀察稀疏的運動點、紋理表面還是自然場景。相比之下,現有的計算機視覺系統缺乏一種統一的方法,無法在這些多樣的環境中運作。受到人類感知原則的啟發,我們提出了一種生成模型,將低階運動線索和高階外觀特徵分層地分組為粒子(代表局部物質的小高斯),並將粒子分組為捕捉連貫且獨立可移動的物理實體的聚類。我們開發了一種基於並行化區塊吉布斯抽樣的硬體加速推斷算法,以恢復穩定的粒子運動和分組。我們的模型可以處理不同類型的輸入(隨機點、風格化紋理或自然RGB視頻),使其能夠在生物視覺成功但現有計算機視覺方法無法應對的環境中運作。我們在三個領域驗證了這一統一框架:在2D隨機點運動圖上,我們的方法捕捉了人類物體感知,包括在模糊條件下的漸進不確定性;在一個受格式塔啟發的隱蔽旋轉物體數據集上,我們的方法從運動中恢復正確的3D結構,從而實現準確的2D物體分割;在自然RGB視頻上,我們的模型追蹤構成變形物體的運動3D物質,實現穩健的物體級場景理解。因此,這項工作建立了一個基於人類視覺原則的運動感知通用框架。

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

2604.22154v1 by Meghana Karnam, Ananya Joshi

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40\% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.

摘要:新興的行為健康和精神病學中的人工智慧系統使用多步驟或多代理的LLM管道來執行評估自我傷害風險和篩檢抑鬱症等任務。然而,常見的評估方法,如LLM作為裁判,並未指示何時決策是可靠的,或如何在多個LLM判斷中累積錯誤,這限制了它們在安全關鍵環境中的適用性。我們提出了一個統計框架,針對結構為有向無環圖(DAG)的多代理管道,提供了一種基於原則的、自適應的決策制定替代啟發式投票的方法。我們將每個代理建模為隨機類別決策,並引入(1)更緊的代理級性能信心界限,(2)基於輸入難度的強盜式自適應抽樣策略,以及(3)在多代理系統上提供的懊悔保證,顯示在部署時的對數錯誤增長。我們在行為健康的兩個標記數據集上評估我們的系統:AEGIS 2.0行為健康子集(N=161)和SWMH Reddit帖子的一個分層樣本(N=250)。從實證上看,我們的自適應抽樣策略在這兩個數據集中達到了最低的假陽性率,AEGIS 2.0為0.095,而單代理模型為0.159,將安全內容的錯誤標記減少了40\%,並且在所有條件下仍然保持相似的假陰性率。這些結果表明,基於原則的自適應抽樣在不降低召回率的情況下,提供了精確度的有意義改善。

When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models

2604.22153v1 by Pruthvinath Jeripity Venkata

When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT-5.4, and Gemini 2.5 Flash) with ten real-life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p<0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individualist; GPT-5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.

摘要:當你向 AI 助手詢問有關你的職業、婚姻或家庭衝突的建議時,無論你來自哪裡,它是否給出相同的答案?我們通過系統性地測試這一點,向三個領先的 AI 系統(Claude Sonnet 4.5、GPT-5.4 和 Gemini 2.5 Flash)呈現了十個真實的個人困境,這些困境是為來自 10 個國家的用戶設計的,涵蓋 5 大洲的 7 種語言(n=840 份得分回應)。我們將 AI 的建議與世界價值觀調查第七波數據進行比較,該數據測量了每個國家的人們實際相信什麼。

這三個 AI 系統一致地給出了西方風格的個人主義建議,即使對於那些優先考慮家庭、社區和權威的社會的用戶來說,這一點顯著超出了當地價值觀的預測(平均差距 +0.76,1-5 分制;t=15.65,p<0.001)。尼日利亞(+1.85)和印度(+0.82)的差距最大。日本是唯一的例外:AI 系統將日本用戶視為比調查所示的更具團體導向,這表明 AI 編碼了過時的刻板印象。Claude 和 GPT-5.4 顯示出幾乎相同的偏見程度,而 Gemini 雖然較低但仍然顯著。這些模型在機制上有所不同:Claude 在用戶的母語中更偏向集體主義;Gemini 更偏向個人主義;GPT-5.4 僅對所表明的國家身份作出反應。這些發現指向了前沿 AI 價值觀的系統性同質化。數據、代碼和評分管道已公開發布。

Recognition Without Authorization: LLMs and the Moral Order of Online Advice

2604.22143v1 by Tom van Nuenen

Large language models are increasingly used to mediate everyday interpersonal dilemmas, yet how their advisory defaults interact with the concentrated moral orders of specific communities remains poorly understood. This article compares four assistant-style LLMs with community-endorsed advice on 11,565 posts from r/relationship_advice, using the subreddit as a concentrated, vote-ratified moral formation whose prescriptive clarity makes divergence measurable. Across models, LLMs identify many of the same dynamics as human commenters, but are markedly less likely to convert that recognition into directive authorization for action. The gap is sharpest where community consensus is strongest: on high-consensus posts involving abuse or safety threats, models recommend exit at roughly half the human rate while maintaining elevated levels of hedging, validation, and therapeutic framing. The article describes this pattern as recognition without authorization: the capacity to register harm while withholding socially ratified permission for consequential action. This divergence is not incidental but structural: a portable advisory style that remains validating, risk-averse, and weakly directive across contexts. Safety alignment is one plausible contributor to this pattern, alongside training-data averaging and broader assistant design. The article argues that model divergence can be reframed from a technical error to a way of seeing what standardized assistant norms flatten when they encounter situated moral worlds.

摘要:大型語言模型越來越多地被用來調解日常的人際困境,但它們的建議默認值如何與特定社區的集中道德秩序互動,仍然不甚了解。本文比較了四種助手風格的LLM與來自r/relationship_advice的11,565篇帖子中社區認可的建議,利用該子版塊作為一種集中、經投票確認的道德形成,其規範性清晰使得偏差可度量。在各模型中,LLM識別出許多與人類評論者相同的動態,但將這種認識轉化為行動的指導授權的可能性明顯較低。在社區共識最強的地方,這一差距最為明顯:在涉及虐待或安全威脅的高共識帖子中,模型推薦退出的比例約為人類的一半,同時保持較高的保留、驗證和治療框架水平。
本文將這一模式描述為認識而不授權:有能力註冊傷害,但不授予社會認可的後果行動的許可。這一偏差並非偶然,而是結構性的:一種可攜帶的建議風格,在不同情境中保持驗證性、風險回避和弱指導。安全對齊是一個可能促成這一模式的因素,還有訓練數據的平均化和更廣泛的助手設計。本文主張,模型的偏差可以從技術錯誤重新框架為一種觀察方式,即標準化助手規範在遇到具體道德世界時所扁平化的內容。

Voice Under Revision: Large Language Models and the Normalization of Personal Narrative

2604.22142v1 by Tom van Nuenen

This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to "improve" the text or simply to "rewrite" it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.

摘要:這項研究探討大型語言模型重寫如何改變個人敘事的風格和敘事質地。它分析了300篇由三個前沿LLM在三種提示條件下重寫的個人敘事:一般改進、僅重寫和保持聲音的修訂。變化是通過13個來自計算風格學的語言標記來衡量的,包括功能詞、詞彙多樣性、單詞長度、標點符號、縮寫、第一人稱代詞和情感詞。在不同模型和提示條件下,LLM重寫產生了一種一致的風格規範化模式。功能詞、縮寫和第一人稱代詞減少,而詞彙多樣性、單詞長度和標點符號的詳細程度增加。這些變化無論提示是要求模型“改進”文本還是僅僅“重寫”它時都會發生。保持聲音的提示減少了變化的幅度,但並未消除其方向。風格計量分析顯示,重寫的文本在特徵空間中趨於一致,並且更難與其源文本匹配。額外的敘事標記顯示出從嵌入式敘述轉向距離敘述,以及從明確的因果推理轉向壓縮的抽象。研究結果表明,當代LLM對更精緻、較少情境化的語域施加了一種方向性的拉力。這對數位人文學和計算文本分析有影響,其中功能詞、代詞、縮寫和標點符號等特徵通常作為風格、聲音、作者身份和語料庫完整性的證據。因此,LLM修訂應被理解為不僅僅是表層的編輯,而是一種有意義的文本中介形式。

SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

2604.22134v1 by Sihang, Zhao, Kangrui Yu, Youliang Yuan, Pinjia He, Hongyi Wen

Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE

摘要:大型語言模型(LLMs)在教育場景中得到了廣泛的探索。我們識別出當前教育LLMs中的一個關鍵漏洞,即教學越獄,學生使用誘導答案的提示來引出解決方案,而不是提供支架式的指導。為了促進系統性研究,我們統一並形式化安全、有幫助和教學行為,並引入SHAPE,一個包含9,087對學生問題的基準,用於評估在對抗壓力下的輔導行為。我們提出了一個增強圖形的輔導管道,該管道從查詢中推斷先決概念,識別掌握差距,並通過明確的閘控在指導和解決問題之間進行生成路由。在多個LLMs上的實驗顯示,我們的方法在兩種教學越獄設置下顯著提高了安全性,同時在相同的評估協議下保持了接近上限的有用性。我們的代碼和數據可在 https://github.com/MAPS-research/SHaPE 獲得。

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

2604.22128v1 by Aryan Sharma, Cutter Dawes, Shivam Raval

When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.

摘要:當訓練於需要理解階層結構的任務時,Transformer被發現以不同的方式表示這一階層:在殘差流的幾何形狀中,以及在維持後進先出排序的堆疊式注意模式中。 然而,這些表示是否被因果使用或僅僅是可解碼的仍然不清楚。 我們檢視這一差距,針對在Dyck語言(平衡括號序列的形式語言)上訓練的Transformer,該語言的階層真相是明確的。 通過探測和干預殘差流和注意模式,我們發現深度、距離和堆疊頂部信號都是可解碼的,但它們的因果角色卻有所不同。 具體而言,對真實堆疊頂部位置的注意進行屏蔽會導致長距離準確性急劇下降,而消融低維殘差流子空間的影響則相對較小。 這些結果擴展到模板化自然語言環境,表明即使在已知相關階層變量的受控環境中,僅僅可解碼並不意味著因果使用。

Where Should LoRA Go? Component-Type Placement in Hybrid Language Models

2604.22127v1 by Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures -- Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) -- fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway -- despite being the minority component -- consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.

摘要:混合語言模型將注意力與遞迴組件交錯使用,正逐漸與純粹的Transformer競爭,然而標準的LoRA實踐對適配器的應用是統一的,並未考慮到每種組件類型的不同功能角色。我們系統性地研究了在兩種混合架構中組件類型的LoRA佈局——Qwen3.5-0.8B(序列型,GatedDeltaNet + softmax注意力)和Falcon-H1-0.5B(並行型,Mamba-2 SSM + 注意力)——在三個領域上進行微調並在五個基準上進行評估。我們發現,儘管注意力路徑是少數組件,但其表現始終優於全模型適應,且可訓練參數少5-10倍。關鍵是,適應遞迴主幹在序列混合模型中是具破壞性的(在GSM8K上下降14.8個百分點),而在並行模型中則是具建設性的(上升8.6個百分點)。我們進一步記錄了一種轉移不對稱性:並行混合模型表現出正向跨任務轉移,而序列混合模型則遭遇災難性遺忘。這些結果確立了混合拓撲根本上決定了適應反應,並且組件感知的LoRA佈局是混合架構設計中必要的維度。

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

2604.22119v1 by Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.

摘要:隨著推理能力和部署範圍的增長,大型語言模型(LLMs)獲得了從事有助於自身目標的行為的能力,我們稱之為新興戰略推理風險(ESRRs)。這些風險包括但不限於欺騙(故意誤導用戶或評估者)、評估遊戲(在安全測試中策略性地操縱表現)和獎勵駭客(利用錯誤指定的目標)。系統地理解和基準化這些風險仍然是一個未解的挑戰。為了解決這一空白,我們介紹了ESRRSim,一個基於分類法的自動化行為風險評估框架。我們構建了一個可擴展的風險分類法,包括7個類別,進一步細分為20個子類別。ESRRSim生成旨在引發真實推理的評估場景,並配備雙重評分標準,評估模型的回應和推理痕跡,採用無法官偏見和可擴展的架構。對11個推理LLM的評估顯示風險概況存在顯著變化(檢測率範圍為14.45%-72.72%),劇烈的世代改進表明模型可能越來越能識別和適應評估情境。

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

2604.22117v1 by Harsh Kumar, Rahul Maity, Tanmay Joshi, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das

Aligned large language models(LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through robots.txt, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as <00TRIGGER00> to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow. We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible to standard evaluation.

摘要:對齊的大型語言模型(LLMs)仍然容易受到對抗性操控,而它們對網絡規模預訓練的依賴則創造了一個微妙但嚴重的攻擊面。我們研究了隱形預訓練播種(SPS),這是一種新的攻擊類別,其中對手在隱形網站上分發少量的有毒內容,通過 robots.txt 將其暴露給網絡爬蟲,從而增加這些內容被吸收到未來的訓練語料庫中的可能性,這些語料庫來自於如 Common Crawl 等來源。由於每個單獨的有效載荷都很小、分散且表面上無害,因此在數據集構建或過濾過程中很難檢測到這種攻擊。其結果是一種潛在的中毒形式:在預訓練期間嵌入的潛伏邏輯地雷,在標準評估下大多數情況下保持隱形,但可以通過精確的字母數字觸發器(如 <00TRIGGER00>)來激活,以繞過安全防護。我們將這種攻擊稱為 PermaFrost,類比於北極的永久凍土:有害物質可以長時間保持凍結、埋藏且不被注意,只有在條件允許時才會重新浮現。我們通過 PermaFrost-Attack 將這一威脅具體化,這是一個用於潛在概念中毒的控制框架,並配備了一套幾何診斷工具:熱力學長度、光譜曲率和感染追溯圖。在多個模型家族和規模中,我們顯示 SPS 廣泛有效,誘導持久的不安全行為,同時經常避開對齊防禦。我們的結果確定 SPS 是對未來基礎模型的一種實際且被低估的威脅。本文介紹了一種新穎的幾何診斷視角,用於系統性地檢查潛在模型行為,為檢測、表徵和理解可能對標準評估隱形的脆弱性提供了一個原則性基礎。

Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

2604.22109v1 by Nalin Poungpeth, Nicholas Clark, Tanu Mitra

Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce "spontaneous persuasion," which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.

摘要:大型語言模型(LLMs)擁有強大的說服能力,在一對一比較中超越人類。使用者報告表示,在關係、醫療環境以及尋求專業建議時,會諮詢LLMs以協助做出重大生活決策。先前的研究將說服測量為產生最有效論點或令人信服陳述的有意圖嘗試。這未能捕捉到日常人類與AI互動中的情況,使用者在這些互動中尋求資訊或建議。為了解決這一空白,我們引入了「自發性說服」,其特徵是在人們不一定需要說服的日常情境中隱性使用說服策略。我們對五個LLMs進行了審核,以揭示自發性說服在多輪對話中出現的頻率及其技術。為了模擬回應風格,我們提供了一個基於心理學、溝通學和語言學文獻的使用者回應分類法。此外,我們比較了LLMs在相同主題上產生的自發性說服與從Reddit收集的人類回應的分佈。我們發現LLMs幾乎在所有對話中都自發地說服使用者,並大量依賴基於資訊的策略,例如訴諸邏輯或定量證據。這在各模型和使用者回應風格中是一致的,但涉及心理健康的對話中,基於評價和情感的策略的使用率較高。相比之下,人類回應則傾向於使用產生社會影響的策略,如負面情感訴求和非專家證言。這一差異可能解釋了LLM在說服使用者方面的有效性,以及模型被視為客觀和公正的感知。

Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation

2604.22102v1 by Arthur Jakobsson, Abhinav Mahajan, Karthik Pullalarevu, Krishna Suresh, Yunchao Yao, Yuemin Mao, Bardienus Duisterhof, Shahram Najam Syed, Jeffrey Ichnowski

Many robotic tasks are unforgiving; a single mistake in a dynamic throw can lead to unacceptable delays or unrecoverable failure. To mitigate this, we present a novel approach that leverages learned simulation priors to inform goal-conditioned dynamic manipulation of ropes for efficient and accurate task execution. Related methods for dynamic rope manipulation either require large real-world datasets to estimate rope behavior or the use of iterative improvements on attempts at the task for goal completion. We introduce Wiggle and Go!, a system-identification, two-stage framework that enables zero-shot task rope manipulation. The framework consists of a system identification module that observes rope movement to predict descriptive physical parameters, which then informs an optimization method for goal-conditioned action prediction for the robot to execute zero-shot in the real. Our method achieves strong performance across multiple dynamic manipulation tasks enabled by the same task-agnostic system identification module which offers seamless switching between different manipulation tasks, allowing a single model to support a diverse array of manipulation policies. We achieve a 3.55 cm average accuracy on 3D target striking in real using rope system parameters in comparison to 15.34 cm accuracy when our task model is not system-parameter-informed. We achieve a Pearson correlation coefficient of 0.95 between Fourier frequencies of the predicted and real ropes on an unseen trajectory. Project website please see https://wiggleandgo.github.io/

摘要:許多機器人任務是無情的;在動態投擲中出現一次錯誤可能導致不可接受的延遲或無法恢復的失敗。為了減輕這一問題,我們提出了一種新穎的方法,利用學習的模擬先驗來指導繩索的目標條件動態操作,以實現高效和準確的任務執行。相關的動態繩索操作方法要麼需要大量的現實世界數據集來估計繩索行為,要麼需要對任務進行迭代改進以完成目標。我們介紹了 Wiggle and Go!,這是一個系統識別的兩階段框架,能夠實現零樣本任務繩索操作。該框架由一個系統識別模塊組成,該模塊觀察繩索運動以預測描述性物理參數,然後這些參數用於優化方法,以便機器人能夠在現實中執行零樣本的目標條件動作。我們的方法在多個動態操作任務中表現出色,這得益於相同的與任務無關的系統識別模塊,該模塊實現了不同操作任務之間的無縫切換,允許單一模型支持多樣化的操作策略。我們在使用繩索系統參數進行的 3D 目標打擊中達到了 3.55 cm 的平均準確度,而當我們的任務模型未受到系統參數的影響時,準確度為 15.34 cm。我們在未見的軌跡上達到了預測繩索和實際繩索的傅里葉頻率之間的皮爾森相關係數為 0.95。項目網站請參見 https://wiggleandgo.github.io/

Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation

2604.22098v1 by Weisi Liu, Guangzeng Han, Xiaolei Huang

Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.

摘要:時間在模型開發和部署中引入了根本性的挑戰:模型通常是在歷史數據上訓練的,而在未來數據上部署時,語義分佈和領域知識可能會演變。遺憾的是,現有研究要麼忽視時間變化,要麼難以捕捉語義和知識的豐富變化模式。我們開發了知識驅動的增強與檢索整合時間適應(KARITA),以捕捉多樣的時間變化(例如,不確定性和特徵變化),構建和整合豐富的知識來源(例如,像MeSH這樣的醫學本體),並利用變化洞察進行選擇性檢索增強學習。我們在多個領域的分類任務上評估了KARITA,包括臨床、法律和科學語料庫,顯示出在多個領域中隨著時間適應的一致改善。我們的結果表明,知識整合在時間增強和學習中可能更為關鍵和有效。

An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation

2604.22095v1 by Mykola Trokhymovych, Yana Oliinyk, Nazarii Nyzhnyk

This paper presents a highly efficient Retrieval-Augmented Generation (RAG) system built specifically for Ukrainian document question answering, which achieved 2nd place in the UNLP 2026 Shared Task. Our solution features a custom two-stage search pipeline that retrieves relevant document pages, paired with a specialized Ukrainian language model fine-tuned on synthetic data to generate accurate, grounded answers. Finally, we compress the model for lightweight deployment. Evaluated under strict computational limits, our architecture demonstrates that high-quality, verifiable AI question answering can be achieved locally on resource-constrained hardware without sacrificing accuracy.

摘要:這篇論文提出了一個高效的檢索增強生成(RAG)系統,專門為烏克蘭文檔問題回答而建,並在UNLP 2026共享任務中獲得第二名。我們的解決方案具有自定義的兩階段搜索管道,能夠檢索相關的文檔頁面,並配合一個專門針對合成數據進行微調的烏克蘭語言模型,以生成準確且有根據的答案。最後,我們對模型進行壓縮,以便輕量級部署。在嚴格的計算限制下進行評估,我們的架構展示了高質量、可驗證的AI問題回答可以在資源受限的硬體上本地實現,而不會犧牲準確性。

Ethics Testing: Proactive Identification of Generative AI System Harms

2604.22089v1 by Shin Hwei Tan, Haibo Wang, Heng Li

Generative Artificial Intelligence (GAI) systems that can automatically generate content in the form of source code or other contents (e.g., images) has seen increasing popularity due to the emergence of tools such as ChatGPT which rely on Large Language Models (LLMs). Misuse of the automatically generated content can incur serious consequences due to potential harms in the generated content. Despite the importance of ensuring the quality of automatically generated content, there is little to no approach that can systematically generate tests for identifying software harms in the content generated by these GAI systems. In this article, we introduce the novel concept of ethics testing which aims to systematically generate tests for identifying software harms. Different from existing testing methodologies (e.g., fairness testing that aims to identifying software discrimination), ethics testing aims to systematically detect software harms that could be induced due to unethical behavior (e.g., harmful behavior or behavior that violates intellectual property rights) in automatically generated content. We introduced the concept of ethics testing, discussed the challenges therewithin, and conducted five case studies to show how ethics testing can be performed for generative AI systems.

摘要:生成式人工智慧(GAI)系統能夠自動生成源代碼或其他內容(例如,圖像)的內容,因為依賴於大型語言模型(LLMs)的工具如ChatGPT的出現而越來越受歡迎。自動生成內容的濫用可能會因生成內容中的潛在危害而產生嚴重後果。儘管確保自動生成內容質量的重要性不言而喻,但目前幾乎沒有系統性生成測試以識別這些GAI系統生成的內容中的軟體危害的方法。在本文中,我們介紹了一個新穎的概念——倫理測試,旨在系統性地生成測試以識別軟體危害。與現有的測試方法(例如,旨在識別軟體歧視的公平性測試)不同,倫理測試旨在系統性地檢測因不道德行為(例如,危害行為或違反智慧財產權的行為)而可能在自動生成內容中引發的軟體危害。我們介紹了倫理測試的概念,討論了其中的挑戰,並進行了五個案例研究以展示如何對生成式AI系統進行倫理測試。

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

2604.22085v1 by Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani

The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh's Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.

摘要:從無狀態語言模型推理到持久的多會話自主代理的過渡顯示,記憶成為生產級代理系統部署中的主要架構瓶頸。現有的方法論在很大程度上依賴於混合語義圖架構,這在攝取和檢索過程中都會產生相當大的計算開銷。這些系統通常需要大型語言模型介導的實體提取、明確的圖架構維護和多查詢檢索管道。本文介紹了Memanto,一個通用的代理人工智慧記憶層,挑戰了當前認為知識圖複雜性是實現高保真代理記憶所必需的假設。Memanto整合了一個類型化的語義記憶架構,包括十三個預定義的記憶類別、自動衝突解決機制和時間版本控制。這些組件由Moorcheh的資訊理論搜索引擎提供支持,這是一個無索引的語義數據庫,能在低於九十毫秒的延遲內提供確定性檢索,同時消除攝取延遲。通過在LongMemEval和LoCoMo評估套件上的系統性基準測試,Memanto分別達到89.8%和87.1%的最先進準確率。這些結果超越了所有評估的混合圖和基於向量的系統,同時僅需一個檢索查詢,無攝取成本,並保持顯著較低的操作複雜性。本文呈現了一個五階段的漸進性消融研究,以量化每個架構組件的貢獻,隨後討論了對可擴展部署代理記憶系統的影響。

Removing Sandbagging in LLMs by Training with Weak Supervision

2604.22082v1 by Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar

As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears acceptable but falls short of its true abilities. Can training elicit a model's best work even without reliable verification? We study this using model organisms trained to sandbag, testing elicitation techniques on problem-solving math, graduate-level science, and competitive coding tasks. We find that training with weak supervision can reliably elicit sandbagging models when supervised fine-tuning (SFT) and reinforcement learning (RL) are combined: SFT on weak demonstrations breaks the sandbagging behavior, enabling RL to then fully elicit performance. Neither method succeeds reliably alone-RL without SFT almost always leads to reward hacking rather than genuine improvement. Critically, this relies on training being indistinguishable from deployment; when models can distinguish between training and deployment, they can perform well during training while continuing to sandbag afterward. Our results provide initial evidence that training is a viable mitigation against sandbagging, while highlighting the importance of making training indistinguishable from deployment.

摘要:隨著人工智慧系統開始自動化複雜任務,監督越來越依賴較弱的模型或有限的人類監督,這無法完全驗證輸出質量。比其監督者更具能力的模型可能會利用這一差距進行沙袋行為,產出看似可接受但實際上未達其真實能力的工作。訓練能否在沒有可靠驗證的情況下引發模型的最佳表現?我們使用訓練以沙袋行為的模型生物進行研究,測試在解決數學問題、研究生級科學和競爭編碼任務上的引發技術。我們發現,當監督微調(SFT)和強化學習(RL)結合時,使用弱監督的訓練可以可靠地引發沙袋模型:在弱示範上進行的SFT打破了沙袋行為,使得RL能夠充分引發性能。單獨使用任何一種方法都無法可靠成功——沒有SFT的RL幾乎總是導致獎勵駭客行為,而不是實質性改善。關鍵在於,這依賴於訓練與部署無法區分;當模型能夠區分訓練和部署時,它們可以在訓練期間表現良好,同時在之後繼續沙袋。我們的結果提供了初步證據,表明訓練是一種可行的減輕沙袋行為的方法,同時突顯了使訓練與部署無法區分的重要性。

Sound Agentic Science Requires Adversarial Experiments

2604.22080v1 by Dionizije Fa, Marko Culjak

LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

摘要:LLM 基礎的代理正在迅速被採用於科學數據分析,自動化曾經受限於人類時間和專業知識的任務。這種能力通常被視為發現的加速,但它也加速了一種熟悉的失敗模式,即快速產生看似合理、無限可修訂的分析,這些分析容易生成,實際上將假設空間轉變為由選擇性選擇的分析支持的候選主張,並優化為可發表的正面結果。與軟體不同,科學知識並不是通過代碼的迭代積累和事後統計支持來驗證的。流暢的解釋或在單一數據集上的顯著結果並不是驗證。因為缺失的證據是一個負空間,會推翻該主張的實驗和分析從未進行或從未發表。因此,我們建議使用代理協助產生的非實驗性主張應根據先驗否證標準進行評估:代理不應主要用於構建最具說服力的敘述,而應主動尋找該主張失敗的方式。

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

2604.22076v1 by Xiaoyi Chen, Haoyuan Wang, Siyuan Tang, Sijia Liu, Liya Su, XiaoFeng Wang, Haixu Tang

Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.

摘要:大型語言模型(LLMs)在訓練過程中經常記住私密信息,這引發了嚴重的隱私擔憂。雖然機器遺忘已經成為一個有前景的解決方案,但其對抗隱私攻擊的真正有效性仍不明朗。為了解決這個問題,我們提出了PrivUn,一個新的評估框架,通過三層攻擊場景系統性地評估遺忘的穩健性:直接檢索、上下文學習恢復和微調恢復;並結合使用遺忘分數、關聯指標和遺忘深度評估的定量分析。我們的研究揭示了當前遺忘方法的重大弱點,並揭示了兩個關鍵發現:1)遺忘顯示出由梯度驅動的漣漪效應:與遵循語義關係的傳統遺忘(例如,知識圖譜)不同,隱私遺忘在潛在的基於梯度的關聯中傳播;2)大多數方法都存在淺層遺忘的問題,無法去除分佈在多個深層模型層中的私密信息。為了驗證這些見解,我們探索了兩種策略:利用梯度相似性的關聯感知核心集選擇,以及通過表示約束進行的多層深度干預。這些策略代表了從淺層遺忘到深層遺忘的範式轉變。

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

2604.22074v1 by Qinan Yu, Alexa Tartaglini, Peter Hase, Carlos Guestrin, Christopher Potts

Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.

摘要:強化學習從可驗證獎勵(RLVR)在思考鏈推理中的應用已成為語言模型後訓練配方的標準部分。常見的假設是,通過RLVR訓練的推理鏈可靠地代表模型如何得出其答案。在本文中,我們開發了兩個指標來批判性地檢視這一假設:推理的因果重要性(CIR),它測量推理標記對最終答案的累積影響,以及推理的充分性(SR),它測量驗證者是否能僅根據推理得出明確的答案。通過對Qwen2.5模型系列和ReasoningGym任務的實驗,我們發現:(1)雖然RLVR確實提高了任務準確性,但並未可靠地改善CIR或SR,這使得推理在模型性能中的角色受到質疑;(2)在RLVR之前進行少量的SFT可以改善低CIR和SR的情況;以及(3)即使不進行SFT,通過在基於結果的獎勵之上應用輔助CIR/SR獎勵,CIR和SR也可以得到改善。這種聯合獎勵的準確性與RLVR相匹配,同時也導致因果上重要且充分的推理。這些結果顯示,RLVR並不總是使模型以常見的方式依賴推理,但這一問題可以通過對後訓練程序進行簡單的修改來解決。

Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning

2604.22072v1 by Amine Barrak

Federated learning (FL) aggregation on serverless platforms faces a hard scalability ceiling: existing architectures (lambda-FL, LIFL) partition clients across aggregators, but every aggregator must hold the complete model gradient in memory. When gradients exceed the per-function memory limit (e.g., 10 GB on AWS Lambda), aggregation becomes infeasible regardless of tree depth or branching factor. We propose GradsSharding, which instead partitions the gradient tensor into M shards, each averaged independently by a serverless function that receives contributions from all clients. Because FedAvg averaging is element-wise, this produces bit-identical results to tree-based approaches, so model accuracy is invariant by construction. Per-function memory is bounded at O(|θ|/M), independent of client count, enabling aggregation of arbitrarily large models. We evaluate GradsSharding against lambda-FL and LIFL through HPC experiments and real AWS Lambda deployments across model sizes from 43 MB to 5 GB. Results show a cost crossover at approximately 500 MB gradient size, 2.7x cost reduction at VGG-16 scale, and that GradsSharding is the only architecture that remains deployable beyond the serverless memory ceiling.

摘要:聯邦學習(FL)在無伺服器平台上的聚合面臨著嚴峻的擴展性上限:現有架構(lambda-FL, LIFL)將客戶端分配到聚合器,但每個聚合器必須在內存中保存完整的模型梯度。當梯度超過每個函數的內存限制(例如,AWS Lambda 上的 10 GB)時,無論樹的深度或分支因子如何,聚合都變得不可行。我們提出了 GradsSharding,該方法將梯度張量劃分為 M 個片段,每個片段由接收所有客戶端貢獻的無伺服器函數獨立平均。由於 FedAvg 平均是逐元素的,這會產生與基於樹的方法位元相同的結果,因此模型準確性在結構上是保持不變的。每個函數的內存限制為 O(|θ|/M),與客戶端數量無關,這使得聚合任意大型模型成為可能。我們通過 HPC 實驗和在 AWS Lambda 上的實際部署,對 GradsSharding 與 lambda-FL 和 LIFL 進行了評估,模型大小範圍從 43 MB 到 5 GB。結果顯示,在大約 500 MB 的梯度大小時出現成本交叉,在 VGG-16 規模上成本降低 2.7 倍,並且 GradsSharding 是唯一可以在無伺服器內存上限之外保持可部署的架構。

Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake

2604.22067v1 by Guan Gui, Peter Zandi, Jacob Taylor, Ananya Joshi

Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.

摘要:精神科接診是一個連續的、高風險的信息收集過程,臨床醫生必須決定提問的內容、順序以及如何在有限的時間內解釋不完整或模糊的回答。儘管對於醫療保健中的對話式人工智慧的興趣日益增長,但在這一應用中,對話式人工智慧的基礎設施仍然有限。因此,我們將這一任務表述為一個問題選擇問題,涉及臨床上有根據的問題、已知的目標信息以及可控的患者難度。我們還基於655個臨床醫生撰寫的接診問題庫和5種不同行為條件的相應合成患者小品,介紹了一個特定任務的問題選擇基準。在我們的評估中,我們比較了隨機提問、一個臨床精神科接診表的基準,以及一個基於大型語言模型(LLM)指導的自適應政策,這涉及300次訪談會議,涵蓋四位患者和五種行為條件。在基準測試中,臨床有序的固定形式顯著優於隨機提問,而LLM指導的政策則實現了最強的整體恢復。在患者行為對現場恢復的適應性較差的情況下,適應的優勢急劇增長,尤其是在防守性簡潔的條件下。這些發現表明,對話式臨床系統的表現不僅取決於信息披露後的語言理解,還取決於系統是否能在有限的互動預算內觸及正確的主題。更廣泛地說,這一基準提供了一個受控框架,用於研究臨床結構和自適應後續如何促進互動式臨床機器學習中的信息恢復。

Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

2604.22063v1 by Shevya Pandya, Shinjini Bose, Ananya Joshi

Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.

摘要:大型語言模型(LLMs)在臨床推理和風險評估中被越來越多地使用。然而,它們在精神科等關鍵和不確定領域的解釋可靠性仍然不明。先前的研究已經識別出這些系統中的算法偏見和提示敏感性,這引發了關於上下文信息如何影響模型輸出的擔憂,但在精神科領域仍然沒有系統的方法來評估這些問題。我們提出了一種通過圍繞提示設計的影響和醫學上不重要的輸入對預測住院風險分數的影響來結構化評估的可靠性審核方法,這通常是第一個下游AI臨床決策任務。在我們的審核中,生成了一組合成患者資料(n = 50),每個資料包含15個臨床相關特徵和最多50個臨床不重要特徵,跨越四種提示重構(中立、邏輯、人類影響、臨床判斷)。我們審核了四個LLM(Gemini 2.5 Flash,LLaMa 3.3 70b,Claude Sonnet 4.6,GPT-4o mini),結果顯示,包含醫學上不重要的變量導致所有模型和提示的絕對平均預測住院風險和輸出變異性有統計學上顯著的增加,這表明隨著上下文噪音的增加,預測穩定性降低。臨床不重要特徵在許多模型-提示條件下對不穩定性產生了影響,而提示變化獨立地以模型依賴的方式影響不穩定性的軌跡。這些發現量化了基於LLM的精神科風險評估對非臨床信息的敏感性,突顯了在臨床部署之前需要對歸因穩定性和不確定性行為進行系統評估的必要性。

Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning

2604.22062v1 by Karthic Palaniappan

There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of "thinking systems". With Qwen3-VL-2B-Instruct as base model and 4 $\times$ Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33\% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75\% over SymPy. I've documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: https://github.com/i-like-bfs-and-dfs/wolfram-reasoning.

摘要:世界上有7,407種語言。但是,世界上沒有的語言呢?人類是否如此狹隘,以至於不關心外星人所使用的語言?外星人也是人類!在2016年的電影《降臨》中,艾米·亞當斯飾演語言學家路易絲·班克斯博士,她通過學習以非順序句子構成的外星語言(Heptapod)來思考,獲得了超越時間和預見未來的能力。在這項工作中,我旨在探討在神經符號語言中視覺-語言概念的表徵和推理,並研究“思考系統”中分析推理能力和效率的提升。以Qwen3-VL-2B-Instruct為基礎模型,並使用4個$\times$ Nvidia H200 GPU節點,我在一個包含數學、科學和一般知識問題的視覺-語言評估數據集上實現了3.33\%的準確率提升,同時將推理標記減少了75\%,相較於SymPy。我已記錄面臨的計算挑戰、擴展可能性以及未來在視覺-語言模型中改善神經符號語言思考的工作。訓練和推理設置可在此找到:https://github.com/i-like-bfs-and-dfs/wolfram-reasoning。

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

2604.22061v1 by Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.

摘要:病人試驗匹配需要對長期的異質電子健康紀錄(EHRs)和複雜的資格標準進行推理,這對於擴展性、泛化能力和計算效率提出了重大挑戰。現有的方法要麼依賴於使用大型語言模型(LLMs)進行完整文檔處理,這在計算上代價高昂,要麼使用傳統的機器學習方法,這些方法難以捕捉非結構化的臨床敘事。在這項工作中,我們提出了一個輕量級框架,結合檢索增強生成和基於大型語言模型的建模,以實現可擴展的病人試驗匹配。該框架明確分離了兩個關鍵組件:檢索增強生成用於從長EHR中識別臨床相關片段,減少輸入複雜性,而大型語言模型則用於將這些選定的片段編碼為信息豐富的表示。這些表示進一步通過降維進行精煉,並使用輕量級預測器進行建模,使得下游分類既高效又可擴展。我們在多個公共基準(n2c2、SIGIR、TREC 2021/2022)和來自梅約診所(Mayo Clinic)的真實世界多模態數據集(MCPMD)上評估了所提出的方法。結果顯示,基於檢索的信息選擇顯著減少了計算負擔,同時保留了臨床上有意義的信號。我們進一步證明,凍結的LLMs為結構化臨床數據提供了強大的表示,而微調對於建模非結構化的臨床敘事至關重要。重要的是,所提出的輕量級管道在性能上可與端到端的LLM方法相媲美,且計算成本顯著較低。

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

2604.22050v1 by Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.

摘要:Transformer主要依賴 softmax 注意力,這對於序列長度引入了平方複雜度,並且仍然是高效推理的主要瓶頸。先前對線性或混合注意力的研究通常在所有層中均勻替換 softmax 注意力,這往往導致顯著的性能下降或需要大量重新訓練以恢復模型質量。
本研究提出了 LayerBoost,一種基於層的注意力減少方法,根據個別Transformer層的敏感性選擇性地修改注意力機制。它首先對預訓練模型進行系統的敏感性分析,以識別對維持性能至關重要的層。根據這一分析,可以應用三種不同的策略:在高度敏感的層中保留標準的 softmax 注意力,在中度敏感的層中用線性滑動窗口注意力替換,並在顯示低敏感性的層中完全移除注意力。
為了在這些架構修改後恢復性能,我們引入了一個輕量級的基於蒸餾的修復階段,只需額外的 10M 訓練標記。LayerBoost 在高併發下減少推理延遲並提高吞吐量,最多可提高 68%,同時保持競爭性的模型質量。它在幾個基準上與基礎模型性能相匹配,在其他基準上僅顯示輕微的下降,並顯著超越最先進的注意力線性化方法。這些效率增益使我們的方法特別適合高併發服務和硬體受限的部署場景,在這些場景中,推理成本和內存佔用是關鍵瓶頸。

Call-Chain-Aware LLM-Based Test Generation for Java Projects

2604.22046v1 by Guancheng Wang, Qinghua Xu, Lionel C. Briand, Zhaoqiang Guo, Kui Liu

Large language models (LLMs) have recently shown strong potential for generating project-level unit tests. However, existing state-of-the-art approaches primarily rely on execution-path information to guide prompt construction, which is often insufficient for complex software systems with rich inter-class dependencies, deep call chains, and intricate object initialization requirements. In this paper, we present CAT, a novel call-chain-aware LLM-based test generation approach that explicitly incorporates call-chain and dependency contexts into prompts through dedicated static analysis. To construct executable, semantically valid test contexts, CAT systematically models caller--callee relationships, object constructors, and third-party dependencies, and supports iterative test fixing when generation failures occur. We evaluate CAT on the widely used Defects4J benchmark and on four real-world GitHub projects released after the LLM's cut-off date. The results show that, across projects in Defects4J, CAT improves line and branch coverage by 18.04% and 21.74%, respectively, over the state-of-the-art approach PANTA, while consistently achieving superior performance on post-cutoff real-world projects. An ablation study further demonstrates the importance of call-chain and dependency contexts in CAT.

摘要:大型語言模型(LLMs)最近顯示出生成專案級單元測試的強大潛力。然而,現有的最先進方法主要依賴執行路徑資訊來指導提示構建,這對於具有豐富類別間依賴性、深層調用鏈和複雜物件初始化需求的複雜軟體系統來說,往往是不足的。在本文中,我們提出了CAT,一種新穎的基於LLM的調用鏈感知測試生成方法,通過專門的靜態分析,明確將調用鏈和依賴上下文納入提示中。為了構建可執行且語義有效的測試上下文,CAT系統性地建模呼叫者-被呼叫者關係、物件構造函數和第三方依賴,並在生成失敗時支持迭代測試修正。我們在廣泛使用的Defects4J基準和四個在LLM截止日期後發布的實際GitHub專案上評估了CAT。結果顯示,在Defects4J的專案中,CAT相較於最先進的方法PANTA,分別提高了18.04%和21.74%的行和分支覆蓋率,同時在截止後的實際專案上持續表現出優越的性能。一項消融研究進一步證明了調用鏈和依賴上下文在CAT中的重要性。

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

2604.22045v1 by Ayushi Mehrotra, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi

Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.

摘要:特徵歸因方法通過為單個輸入特徵分配重要性分數來解釋深度神經網絡的預測。然而,大多數現有方法僅專注於邊際效應,忽略了特徵之間的交互作用,這些交互作用是特徵組共同影響模型輸出的情況。這種交互作用在圖像分類任務中特別重要,因為語義意義通常來自像素之間的相互依賴,而不是孤立的特徵。現有的基於交互作用的圖像方法要麼過於粗糙(例如,僅使用超像素),要麼未能滿足核心可解釋性公理。在這項工作中,我們介紹了 H-Sets,一種新穎的兩階段框架,用於發現和歸因於圖像分類器中的高階特徵交互作用。首先,我們通過輸入 Hessians 檢測局部交互對,並將它們遞歸地合併成語義上連貫的集合;使用 Segment Anything (SAM) 進行分割作為空間分組的先驗,但可以用其他分割方法替代。其次,我們使用 IDG-Vis 為每個集合進行歸因,這是一種集級擴展的整合方向梯度,將沿像素空間路徑的方向梯度整合並與 Harsanyi 分紅進行聚合。雖然 Hessians 在檢測階段引入了額外的計算成本,但這種有針對性的成本始終能產生更稀疏且更真實的顯著性圖。在 ImageNet 和 CUB 數據集上對 VGG、ResNet、DenseNet 和 MobileNet 模型的評估顯示,H-Sets 生成的顯著性圖比現有方法更具可解釋性和真實性。

Source-Modality Monitoring in Vision-Language Models

2604.22038v1 by Etha Tianze Hua, Tian Yun, Ellie Pavlick

We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.

摘要:我們定義並研究來源模態監控——多模態模型追蹤和傳達信息來源的能力。我們將來源模態監控視為更一般的綁定問題的一個實例,並評估模型在多大程度上利用句法與語義信號來將用戶提供的提示中的詞語(如圖像)綁定到其輸入和上下文的特定組件(即實際圖像)。在涵蓋11個視覺-語言模型(VLMs)的實驗中,這些模型執行目標模態信息檢索任務,我們發現句法和語義信號都扮演著重要角色,但當模態在分佈上高度區別時,後者往往超過前者。我們討論這些發現對模型穩健性的影響,以及在日益多模態的代理系統背景下的意義。

EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms

2604.22036v1 by Brian VanVoorst, Nicholas Walczak, Christopher Gilleo, Charles Meissner, Fabio Felix, Iran Roman, Bea Steers, Claudio Silva, Yuhan Shen, Zijia Lu, Shih-Po Lee, Ehsan Elhamifar

This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).

摘要:這篇論文介紹了EgoMAGIC(醫療輔助、指導、說明和修正),這是一個以自我為中心的醫療活動數據集,作為DARPA的感知能力任務指導(PTG)計畫的一部分收集而成。這個數據集包含3,355個視頻,涵蓋50個醫療任務,每個任務至少有50個標記視頻。PTG計畫的主要目標是開發集成在增強現實頭盔中的虛擬助手,以幫助用戶執行複雜任務。
為了鼓勵使用這個數據集進行探索和研究,醫療訓練數據已經發布,並附帶了一個專注於八個醫療任務的動作檢測挑戰。大多數視頻是使用帶有集成音頻的頭戴立體攝像機錄製的。從這個數據集中,使用195萬個標籤訓練了40個YOLO模型,以檢測124個醫療物體,為從事醫療AI應用開發的開發者提供了一個穩健的起點。
除了介紹數據集,這篇論文還呈現了三個模型在八個選定醫療任務上的動作檢測基準結果,其中表現最佳的方法達到了平均mAP 0.526。儘管這篇論文主要針對動作檢測作為基準,但EgoMAGIC數據集同樣適用於動作識別、物體識別和檢測、錯誤檢測以及其他具有挑戰性的計算機視覺任務。
該數據集可通過zenodo.org訪問(DOI: 10.5281/zenodo.19239154)。

Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning

2604.22031v1 by João Mattos, Arlei Silva

We propose Mochi, a Graph Foundation Model that addresses task unification and training efficiency by adopting a meta-learning based training framework. Prior models pre-train with reconstruction-based objectives such as link prediction, and assume that the resulting representations can be aligned with downstream tasks through a separate unification step such as class prototypes. We demonstrate through synthetic and real-world experiments that this procedure, while simple and intuitive, has limitations that directly affect downstream task performance. To address these limitations, Mochi pre-trains on few-shot episodes that mirror the downstream evaluation protocol, aligning the training objective with inference rather than relying on a post-hoc unification step. We show that Mochi, along with its more powerful variant Mochi++, achieves competitive or superior performance compared to existing Graph Foundation Models across 25 real-world graph datasets spanning node classification, link prediction, and graph classification, while requiring 8$\sim$27 times less training time than the strongest baseline.

摘要:我們提出了Mochi,一種圖基礎模型,通過採用基於元學習的訓練框架來解決任務統一和訓練效率問題。先前的模型使用基於重建的目標進行預訓練,例如鏈接預測,並假設所得到的表示可以通過單獨的統一步驟(例如類別原型)與下游任務對齊。我們通過合成和實際實驗展示了這一過程,雖然簡單直觀,但存在直接影響下游任務性能的限制。為了解決這些限制,Mochi在幾次快照的情境下進行預訓練,這些情境反映了下游評估協議,將訓練目標與推理對齊,而不是依賴事後的統一步驟。我們展示了Mochi及其更強大的變體Mochi++在25個涵蓋節點分類、鏈接預測和圖分類的實際圖數據集上,與現有的圖基礎模型相比,達到了具有競爭力或更優的性能,同時所需的訓練時間比最強基線少8$\sim$27倍。

Shared Lexical Task Representations Explain Behavioral Variability In LLMs

2604.22027v1 by Zhuonan Yang, Jacob Xiaochen Li, Francisco Piedrahita Velez, Eric Todd, David Bau, Michael L. Littman, Stephen H. Bach, Ellie Pavlick

One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.

摘要:對大型語言模型(LLMs)最常見的抱怨之一是它們對提示的敏感性——也就是說,它們執行任務或提供正確答案的能力可能會不可預測地依賴於問題的表述方式。我們通過比較兩種非常不同但常用的提示風格來調查這種變化:基於指令的提示,這種提示用自然語言描述任務,以及基於示例的提示,這種提示提供上下文中的少量示範對以說明任務。我們發現,儘管性能在提示的影響下有很大的變化,但模型在不同提示的任務之間仍然會涉及一些共同的基本機制。具體而言,我們識別出任務特定的注意力頭,其輸出字面上描述了任務——我們稱之為詞彙任務頭——並顯示這些頭在不同的提示風格之間是共享的,並觸發隨後的答案生成。我們進一步發現,提示之間的行為變化可以通過這些頭的激活程度來解釋,而失敗至少有時是由於競爭的任務表徵稀釋了目標任務的信號。我們的結果共同呈現出一幅日益清晰的圖景,說明LLMs的內部表徵如何解釋那些對用戶和開發者來說似乎是特立獨行的行為。

Foundation models for discovering robust biomarkers of neurological disorders from dynamic functional connectivity

2604.22018v1 by Deepank Girish, Yi Hao Chan, Sukrit Gupta, Jing Xia, Jagath C. Rajapakse

Several brain foundation models (FM) have recently been proposed to predict brain disorders by modelling dynamic functional connectivity (FC). While they demonstrate remarkable model performance and zero- or few-shot generalization, the salient features identified as potential biomarkers are yet to be thoroughly evaluated. We propose RE-CONFIRM, a framework for evaluating the robustness of potential biomarker candidates elucidated by deep learning (DL) models including FMs. From experiments on five large datasets of Autism Spectrum Disorder (ASD), Attention-deficit Hyperactivity Disorder (ADHD), and Alzheimer's Disease (AD), we found that although commonly used performance metrics provide an intuitive assessment of model predictions, they are insufficient for evaluating the robustness of biomarkers identified by these models. RE-CONFIRM metrics revealed that simply finetuning FMs leads to models that fail to capture regional hubs effectively, even in disorders where hubs are known to be implicated, such as ASD and ADHD. In view of this, we propose Hub-LoRA (Low-Rank Adaptation) as a fine-tuning technique that enables FMs to not only outperform customised DL models but also produce neurobiologically faithful biomarkers supported by meta-analyses. RE-CONFIRM is generalizable and can be easily applied to ascertain the robustness of DL models trained on functional MRI datasets. Code is available at: https://github.com/SCSE-Biomedical-Computing-Group/RE-CONFIRM.

摘要:最近提出了幾個腦部基礎模型(FM),旨在通過建模動態功能連接性(FC)來預測腦部疾病。雖然它們展示了卓越的模型性能和零樣本或少樣本的泛化能力,但被識別為潛在生物標記的顯著特徵尚未得到充分評估。我們提出了RE-CONFIRM,一個用於評估深度學習(DL)模型(包括FM)所闡明的潛在生物標記候選者穩健性的框架。通過對五個大型自閉症譜系障礙(ASD)、注意力缺陷過動症(ADHD)和阿茲海默症(AD)數據集的實驗,我們發現,儘管常用的性能指標提供了對模型預測的直觀評估,但它們對於評估這些模型識別的生物標記的穩健性來說是不夠的。RE-CONFIRM指標顯示,僅僅微調FM會導致模型無法有效捕捉區域樞紐,即使在已知樞紐與之相關的疾病中,如ASD和ADHD。鑒於此,我們提出了Hub-LoRA(低秩適應)作為一種微調技術,使FM不僅能超越定制的DL模型,還能產生由元分析支持的神經生物學上真實的生物標記。RE-CONFIRM具有可泛化性,並且可以輕鬆應用於確定在功能性MRI數據集上訓練的DL模型的穩健性。代碼可在以下網址獲得:https://github.com/SCSE-Biomedical-Computing-Group/RE-CONFIRM。

When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation

2604.22002v1 by Anamta Khan, Ratna Kandala, Deepti, Sheza Munir, Joyojeet Pal

Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.

摘要:社交媒體平台已成為全球南方健康資訊的主要渠道。以印度 YouTube 上的 gomutra(牛尿)話語為案例,我們呈現了一項事後的大型語言模型(LLM)輔助的話語分析,分析了 30 份多語言的逐字稿,顯示宣傳內容將神聖的傳統語言與偽科學的主張融合在一起,這種方式反映了複雜的反駁內容,創造出一種修辭註冊,而 LLMs 主要在西方語料庫上訓練,系統性地無法分析。通過在三個 LLM(GPT-4o、Gemini 2.5 Pro、DeepSeek-V3.1)之間變化提示語氣,我們發現文化嵌入的健康錯誤資訊與普通的錯誤資訊並不相同,而這種文化模糊延伸到性別修辭和提示設計,進一步加劇了分析的不可靠性。我們的研究結果表明,LLM 輔助的話語分析中的文化能力不能僅通過提示工程來補救。

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

2604.21999v1 by Grigory Sapunov

We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested -- 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing -- no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64. During experimentation, we identify a router initialization trap that causes >70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves' recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 ("deep start," p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices. With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at https://github.com/che-shr-cat/utm-jax.

摘要:我們研究學習的記憶標記作為單區塊通用Transformer(UT)在數獨極限(Sudoku-Extreme)這一組合推理基準上的計算臨時記憶。我們發現記憶標記在實證上是必要的:在所有測試的配置中——3個隨機種子、多個標記數量、兩種初始化方案、ACT和固定深度處理——沒有任何不使用記憶標記的配置能夠達到非平凡的性能。最佳的標記數量顯示出一個明顯的下限(T=0總是失敗,T=4是邊界,T=8對於81格謎題可靠成功),隨後是一個穩定的平台(T=8-32,精確匹配率57.4% +/- 0.7%),在T=64時因注意力稀釋而崩潰。在實驗過程中,我們識別出一個路由器初始化陷阱,導致超過70%的訓練運行失敗:默認的零偏初始化(p ~ 0.5)和Graves推薦的正偏(p ~ 0.73)都會使標記在初始化後約2步停止,進入一個淺表平衡(停止約5-7),模型無法逃脫。將偏置反轉為-3(「深度啟動」,p ~ 0.05)消除了這種失敗模式。我們通過消融實驗確認,該陷阱是ACT初始化固有的,而不是我們架構選擇的產物。在可靠的訓練確立後,我們顯示(1)ACT提供比固定深度處理更一致的結果(56.9% +/- 0.7%對53.4% +/- 9.3%在3個隨機種子中);(2)使用lambda預熱的ACT在使用少34%思考步驟的情況下達到匹配的準確率(57.0% +/- 1.1%);以及(3)注意力頭專門化為記憶讀取器、約束傳播器和在遞歸深度中的整合器。代碼可在https://github.com/che-shr-cat/utm-jax獲得。