LLM
LLM
| Publish Date | Title | Authors | Homepage | Code |
|---|---|---|---|---|
| 2026-04-02 | ActionParty: Multi-Subject Action Binding in Generative Video Games | Alexander Pondaven et.al. | 2604.02330v1 | null |
| 2026-04-02 | Steerable Visual Representations | Jona Ruthardt et.al. | 2604.02327v1 | null |
| 2026-04-02 | Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation | Daiwei Chen et.al. | 2604.02324v1 | null |
| 2026-04-02 | Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning | Bangji Yang et.al. | 2604.02322v1 | null |
| 2026-04-02 | No Single Best Model for Diversity: Learning a Router for Sample Diversity | Yuhan Liu et.al. | 2604.02319v1 | null |
| 2026-04-02 | Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models | Sarath Shekkizhar et.al. | 2604.02315v1 | null |
| 2026-04-02 | go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices | Torque Dandachi et.al. | 2604.02309v1 | null |
| 2026-04-02 | VOID: Video Object and Interaction Deletion | Saman Motamed et.al. | 2604.02296v1 | null |
| 2026-04-02 | Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation | Chongjie Ye et.al. | 2604.02289v1 | null |
| 2026-04-02 | Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing | Gengsheng Li et.al. | 2604.02288v1 | null |
| 2026-04-02 | De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules | Keerat Guliani et.al. | 2604.02276v1 | null |
| 2026-04-02 | Crystalite: A Lightweight Transformer for Efficient Crystal Modeling | Tin Hadži Veljković et.al. | 2604.02270v1 | null |
| 2026-04-02 | Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider | Tina. J. Jat et.al. | 2604.02259v1 | null |
| 2026-04-02 | Generative AI Spotlights the Human Core of Data Science: Implications for Education | Nathan Taback et.al. | 2604.02238v1 | null |
| 2026-04-02 | Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models | Minda Zhao et.al. | 2604.02236v1 | null |
| 2026-04-02 | Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs | Abinitha Gourabathina et.al. | 2604.02230v1 | null |
| 2026-04-02 | When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning | Juarez Monteiro et.al. | 2604.02226v1 | null |
| 2026-04-02 | Impact of Multimodal and Conversational AI on Learning Outcomes and Experience | Karan Taneja et.al. | 2604.02221v1 | null |
| 2026-04-02 | VISTA: Visualization of Token Attribution via Efficient Analysis | Syed Ahmed et.al. | 2604.02217v1 | null |
| 2026-04-02 | Universal Hypernetworks for Arbitrary Models | Xuanfeng Zhou et.al. | 2604.02215v1 | null |
| 2026-04-02 | Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges | Srivaths Ranganathan et.al. | 2604.02211v1 | null |
| 2026-04-02 | CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech | Youssef Saidi et.al. | 2604.02209v1 | null |
| 2026-04-02 | Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study | Yosuke Yamagishi et.al. | 2604.02207v1 | null |
| 2026-04-02 | LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications | Mayank Mayank et.al. | 2604.02206v1 | null |
| 2026-04-02 | Towards Position-Robust Talent Recommendation via Large Language Models | Silin Du et.al. | 2604.02200v1 | null |
| 2026-04-02 | Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model | Jaemin Kim et.al. | 2604.02194v1 | null |
| 2026-04-02 | TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning | Zhanting Zhou et.al. | 2604.02183v1 | null |
| 2026-04-02 | The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level | Jeremy Herbst et.al. | 2604.02178v1 | null |
| 2026-04-02 | Adam's Law: Textual Frequency Law on Large Language Models | Hongyuan Adam Lu et.al. | 2604.02176v1 | null |
| 2026-04-02 | Quantifying Self-Preservation Bias in Large Language Models | Matteo Migliarini et.al. | 2604.02174v1 | null |
| 2026-04-02 | AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics | Atilla Kaan Alkan et.al. | 2604.02156v1 | null |
| 2026-04-02 | Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents | Xuan Qi et.al. | 2604.02155v1 | null |
| 2026-04-02 | TRACE-Bot: Detecting Emerging LLM-Driven Social Bots via Implicit Semantic Representations and AIGC-Enhanced Behavioral Patterns | Zhongbo Wang et.al. | 2604.02147v1 | null |
| 2026-04-02 | MTI: A Behavior-Based Temperament Profiling System for AI Agents | Jihoon Jeong et.al. | 2604.02145v1 | null |
| 2026-04-02 | GaelEval: Benchmarking LLM Performance for Scottish Gaelic | Peter Devine et.al. | 2604.02135v1 | null |
| 2026-04-02 | Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization | Heet Nagoriya et.al. | 2604.02131v1 | null |
| 2026-04-02 | SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks | Sunder Ali Khowaja et.al. | 2604.02128v1 | null |
| 2026-04-02 | LLM-as-a-Judge for Time Series Explanations | Preetham Sivalingam et.al. | 2604.02118v1 | null |
| 2026-04-02 | Reliable Control-Point Selection for Steering Reasoning in Large Language Models | Haomin Zhuang et.al. | 2604.02113v1 | null |
| 2026-04-02 | Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations | Haitong Sun et.al. | 2604.02102v1 | null |
| 2026-04-02 | Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning | Yuhang Wu et.al. | 2604.02091v1 | null |
| 2026-04-02 | Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection | Soo Won Seo et.al. | 2604.02071v1 | null |
| 2026-04-02 | Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation | Jaber Jaber et.al. | 2604.02051v1 | null |
| 2026-04-02 | Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding | Tao Jin et.al. | 2604.02047v1 | null |
| 2026-04-02 | BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs | Nicolas Boizard et.al. | 2604.02045v1 | null |
| 2026-04-02 | Tracking the emergence of linguistic structure in self-supervised models learning from speech | Marianne de Heer Kloots et.al. | 2604.02043v1 | null |
| 2026-04-02 | AI in Insurance: Adaptive Questionnaires for Improved Risk Profiling | Diogo Silva et.al. | 2604.02034v1 | null |
| 2026-04-02 | Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data | Alejandro Castañeda Garcia et.al. | 2604.02031v1 | null |
| 2026-04-02 | The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook | Xinlei Yu et.al. | 2604.02029v1 | null |
| 2026-04-02 | Why Gaussian Diffusion Models Fail on Discrete Data? | Alexander Shabalin et.al. | 2604.02028v1 | null |
| 2026-04-02 | ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety | Yu Li et.al. | 2604.02022v1 | null |
| 2026-04-02 | Optimizing Interventions for Agent-Based Infectious Disease Simulations | Anja Wolpers et.al. | 2604.02016v1 | null |
| 2026-04-02 | $k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection | Kahim Wong et.al. | 2604.02008v1 | null |
| 2026-04-02 | ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning | Jingyue Gao et.al. | 2604.02006v1 | null |
| 2026-04-02 | How and why does deep ensemble coupled with transfer learning increase performance in bipolar disorder and schizophrenia classification? | Sara Petiton et.al. | 2604.02002v1 | null |
| 2026-04-02 | GenGait: A Transformer-Based Model for Human Gait Anomaly Detection and Normative Twin Generation | Elisa Motta et.al. | 2604.01997v1 | null |
| 2026-04-02 | SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning | Daeyong Kwon et.al. | 2604.01993v1 | null |
| 2026-04-02 | Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation | Boyang Gong et.al. | 2604.01989v1 | null |
| 2026-04-02 | SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation | Haomin Zhuang et.al. | 2604.01988v1 | null |
| 2026-04-02 | World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry | Yuejiang Liu et.al. | 2604.01985v1 | null |
| 2026-04-02 | RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale | Ayush Garg et.al. | 2604.01977v1 | null |
| 2026-04-02 | Ego-Grounding for Personalized Question-Answering in Egocentric Videos | Junbin Xiao et.al. | 2604.01966v1 | null |
| 2026-04-02 | Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models | Florian Kelber et.al. | 2604.01965v1 | null |
| 2026-04-02 | Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia | Saja Al-Dabet et.al. | 2604.01962v1 | null |
| 2026-04-02 | Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite | Klaudia Thellmann et.al. | 2604.01957v1 | null |
| 2026-04-02 | Physics-Informed Transformer for Multi-Band Channel Frequency Response Reconstruction | Anatolij Zubow et.al. | 2604.01944v1 | null |
| 2026-04-02 | Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm | Sixing Li et.al. | 2604.01941v1 | null |
| 2026-04-02 | Probabilistic classification from possibilistic data: computing Kullback-Leibler projection with a possibility distribution | Ismaïl Baaj et.al. | 2604.01939v1 | null |
| 2026-04-02 | How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization | Ramon Ferrer-i-Cancho et.al. | 2604.01938v1 | null |
| 2026-04-02 | Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification | Géraud Faye et.al. | 2604.01936v1 | null |
| 2026-04-02 | Quantum-Inspired Geometric Classification with Correlation Group Structures and VQC Decision Modeling | Nishikanta Mohanty et.al. | 2604.01930v1 | null |
| 2026-04-02 | Woosh: A Sound Effects Foundation Model | Gaëtan Hadjeres et.al. | 2604.01929v1 | null |
| 2026-04-02 | ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues | Bhaskara Hanuma Vedula et.al. | 2604.01925v1 | null |
| 2026-04-02 | Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients | Oumaima El Khettari et.al. | 2604.01924v1 | null |
| 2026-04-02 | SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations | Yiqiang Cai et.al. | 2604.01916v1 | null |
| 2026-04-02 | Lifting Unlabeled Internet-level Data for 3D Scene Understanding | Yixin Chen et.al. | 2604.01907v1 | null |
| 2026-04-02 | Combating Data Laundering in LLM Training | Muxing Li et.al. | 2604.01904v1 | null |
| 2026-04-02 | Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always | Luka Hobor et.al. | 2604.01896v1 | null |
| 2026-04-02 | HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models | Yansong Guo et.al. | 2604.01881v1 | null |
| 2026-04-02 | Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution | Samuel Rose et.al. | 2604.01853v1 | null |
| 2026-04-02 | From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion | Liang Zhu et.al. | 2604.01849v1 | null |
| 2026-04-02 | CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift | HyunGi Kim et.al. | 2604.01845v1 | null |
| 2026-04-02 | Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints | Minh-Khoi Pham et.al. | 2604.01841v1 | null |
| 2026-04-02 | Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models | Zekai Ye et.al. | 2604.01840v1 | null |
| 2026-04-02 | PLOT: Enhancing Preference Learning via Optimal Transport | Liang Zhu et.al. | 2604.01837v1 | null |
| 2026-04-02 | Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks | Yaxin Luo et.al. | 2604.01833v1 | null |
| 2026-04-02 | Neural Network-Assisted Model Predictive Control for Implicit Balancing | Seyed Soroush Karimi Madahi et.al. | 2604.01805v1 | null |
| 2026-04-02 | DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment | Liang Zhu et.al. | 2604.01787v1 | null |
| 2026-04-02 | Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens | Hanna Hubarava et.al. | 2604.01779v1 | null |
| 2026-04-02 | FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation | Taimur Khan et.al. | 2604.01766v1 | null |
| 2026-04-02 | DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning | Yang Zhou et.al. | 2604.01765v1 | null |
| 2026-04-02 | FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models | Juyong Jiang et.al. | 2604.01762v1 | null |
| 2026-04-02 | LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches | Linyang He et.al. | 2604.01754v1 | null |
| 2026-04-02 | Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text | Melania Berbatova et.al. | 2604.01745v1 | null |
| 2026-04-02 | AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows | Chuhan Qiao et.al. | 2604.01738v1 | null |
| 2026-04-02 | The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs | Wilf Morlidge et.al. | 2604.01728v1 | null |
| 2026-04-02 | LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis | Zhihuan Wei et.al. | 2604.01725v1 | null |
| 2026-04-02 | Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving | Yun Li et.al. | 2604.01723v1 | null |
| 2026-04-02 | Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring | Feiyu Zhou et.al. | 2604.01712v1 | null |
| 2026-04-02 | Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition | Truc Nguyen et.al. | 2604.01711v1 | null |
Abstracts
ActionParty: Multi-Subject Action Binding in Generative Video Games
2604.02330v1 by Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
摘要:最近在視頻擴散方面的進展使得能夠開發出能夠模擬互動環境的「世界模型」。
然而,這些模型在很大程度上僅限於單一代理的設置,無法在場景中同時控制多個代理。
在這項工作中,我們解決了現有視頻擴散模型中動作綁定的基本問題,這些模型難以將特定動作與其相應的主體關聯起來。
為此,我們提出了ActionParty,一種可控動作的多主體世界模型,用於生成視頻遊戲。
它引入了主體狀態標記,即持續捕捉場景中每個主體狀態的潛在變量。
通過結合狀態標記和視頻潛變量,並使用空間偏置機制,我們將全局視頻幀渲染與個別動作控制的主體更新進行了區分。
我們在Melting Pot基準上評估了ActionParty,展示了第一個能夠在46個多樣化環境中同時控制多達七名玩家的視頻世界模型。
我們的結果顯示在動作跟隨準確性和身份一致性方面有顯著改善,同時能夠通過複雜的互動實現穩健的自回歸主體跟踪。
Steerable Visual Representations
2604.02327v1 by Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano
Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
摘要:預訓練的視覺Transformer(ViTs),如 DINOv2 和 MAE,提供了通用的圖像特徵,可應用於各種下游任務,如檢索、分類和分割。
然而,這些表示往往專注於圖像中最顯著的視覺線索,無法將其導向不那麼突出的感興趣概念。
相比之下,多模態 LLM 可以通過文本提示進行引導,但結果的表示往往以語言為中心,並且對於通用視覺任務的效果降低。
為了解決這個問題,我們引入了可引導的視覺表示,這是一種新的視覺表示類別,其全局和局部特徵可以用自然語言進行引導。
雖然大多數視覺-語言模型(例如 CLIP)在編碼後融合文本和視覺特徵(晚期融合),但我們通過輕量級的交叉注意力將文本直接注入視覺編碼器的層中(早期融合)。
我們引入了用於測量表示可引導性的基準,並展示了我們的可引導視覺特徵可以專注於圖像中任何所需的物體,同時保持基礎表示的質量。
我們的方法在異常檢測和個性化物體識別方面的表現與專門方法相當或更優,並展現了對於分佈外任務的零樣本泛化能力。
Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
2604.02324v1 by Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
摘要:語言模型(LMs)越來越多地擴展了新的可學習詞彙標記,以應對特定領域的任務,例如生成推薦中的語義識別(Semantic-ID)標記。標準做法是將這些新標記初始化為現有詞彙嵌入的均值,然後依賴監督性微調來學習它們的表示。我們對這一策略進行了系統分析:通過光譜和幾何診斷,我們顯示均值初始化將所有新標記壓縮到一個退化子空間中,抹去了標記之間的區別,這使得隨後的微調難以完全恢復。這些發現表明,\emph{標記初始化}是在擴展LMs時引入新詞彙的關鍵瓶頸。受到這一診斷的啟發,我們提出了\emph{基於語言的標記初始化假設}:在微調之前,將新標記在預訓練嵌入空間中進行語言學上的基礎化,更能幫助模型利用其通用知識來應對新標記領域。我們將這一假設具體化為GTI(基於語言的標記初始化),這是一個輕量級的基礎化階段,在微調之前,僅使用配對的語言監督,將新標記映射到預訓練嵌入空間中不同的、語義上有意義的位置。儘管其簡單性,GTI在多數評估設置中超越了均值初始化和現有的輔助任務適應方法,涵蓋了多個生成推薦基準,包括行業規模和公共數據集。進一步分析顯示,基於語言的嵌入產生了更豐富的標記間結構,並在微調過程中持續存在,證實了初始化質量是詞彙擴展中的關鍵瓶頸的假設。
Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
2604.02322v1 by Bangji Yang, Hongbo Ma, Jiajun Fan, Ge Liu
Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.
摘要:大型語言模型使用思維鏈推理達成強勁表現,但卻遭遇過度的標記消耗,這使得推理成本上升。現有的效率方法,如明確的長度懲罰、難度估計器或多階段課程,要么降低推理質量,要么需要複雜的訓練流程。我們引入了批次上下文強化(Batched Contextual Reinforcement),這是一種極簡的單階段訓練範式,通過一個簡單的結構修改來解鎖高效推理:訓練模型在共享的上下文窗口中同時解決 N 個問題,並僅根據每個實例的準確性進行獎勵。這一公式創造了一個隱含的標記預算,產生了幾個關鍵發現:(1)我們確定了一個新穎的任務擴展法則:隨著推理過程中同時問題數 N 的增加,每個問題的標記使用量單調減少,而準確性比基準更優雅地退化,確立了 N 作為可控的吞吐量維度。(2)BCR 挑戰了傳統的準確性與效率的權衡,通過在標準單問題推理中展示“免費午餐”現象來證明這一點。在 1.5B 和 4B 模型系列中,BCR 將標記使用量減少了 15.8% 到 62.6%,同時在五個主要數學基準中持續維持或提高準確性。(3)質性分析顯示出自我調節的效率,模型自主消除冗餘的元認知循環,而無需明確的長度監督。(4)關鍵的是,我們實證證明隱含的預算約束成功地繞過了明確長度懲罰固有的對抗梯度和災難性優化崩潰,提供了一種高度穩定的基於約束的長度控制替代方案。這些結果證明了 BCR 的實用性,顯示出簡單的結構激勵解鎖了大型語言模型中的潛在高密度推理。
No Single Best Model for Diversity: Learning a Router for Sample Diversity
2604.02319v1 by Yuhan Liu, Fangyuan Xu, Vishakh Padmakumar, Daphne Ippolito, Eunsol Choi
When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.
摘要:當面對允許大量有效答案的提示時,全面生成這些答案是滿足廣泛用戶需求的第一步。本文研究了引出全面有效回應的方法。為了評估這一點,我們引入了 \textbf{多樣性覆蓋},這是一個衡量在預測答案集中每個 \textbf{獨特} 答案相對於具有相同數量答案的最佳可能答案集所分配的總質量分數的指標。使用這個指標,我們評估了18個LLM,發現沒有單一模型在生成對廣泛開放式提示的多樣化回應方面佔據主導地位。然而,對於每個提示,存在一個模型在生成多樣化答案集方面顯著超過所有其他模型。受到這一發現的啟發,我們引入了一個路由器,預測每個查詢的最佳模型。在NB-Wildchat上,我們訓練的路由器超越了單一最佳模型基準(26.3% vs $23.8%)。我們進一步展示了對一個域外數據集(NB-Curated)以及不同答案生成提示策略的泛化。我們的工作為研究在擁有一套模型時生成全面答案奠定了基礎。
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
2604.02315v1 by Sarath Shekkizhar, Romain Cosentino, Adam Earle
Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41\%$ ($0.8$B) to $96.8\%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22\%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.
摘要:標準 LLM 基準評估助手回合:模型對輸入生成回應,驗證者評分正確性,分析結束。這種範式未能衡量 LLM 是否編碼了對助手回應後續內容的任何意識。我們提出用戶回合生成作為這一空白的探測:在用戶查詢和助手回應的對話上下文中,我們讓模型在用戶角色下生成。如果模型的權重編碼了互動意識,則生成的用戶回合將是對前述上下文的有根據的跟進。通過在 $11$ 個開放權重 LLM(Qwen3.5、gpt-oss、GLM)和 $5$ 個數據集(數學推理、指令遵循、對話)上的實驗,我們顯示互動意識與任務準確性是解耦的。特別是在 Qwen3.5 系列中,GSM8K 的準確率從 $41\%$($0.8$B)提升至 $96.8\%$($397$B-A$17$B),然而在確定性生成下真正的跟進率仍然接近零。相比之下,更高的溫度抽樣顯示互動意識是潛在的,跟進率達到 $22\%$。控制擾動驗證了所提出的探測器測量模型的一個真實屬性,而針對 Qwen3.5-2B 的協作導向後訓練顯示跟進率的增加。我們的結果表明,用戶回合生成捕捉到 LLM 行為的一個維度——互動意識,這在目前僅有助手的基準中是未被探索和不可見的。
go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices
2604.02309v1 by Torque Dandachi, Sophia Diggs-Galligan
Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ($d$), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as $\mathcal{O}(d^3)$ and exposes a single hyperparameter $s$ which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ($m$HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-$m$HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-$m$HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-$m$HC achieves the minimum theoretical loss while converging up to $10\times$ faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-$m$HC offer a practical avenue for scaling $d$ as a new dimension of model capacity.
摘要:雙重隨機矩陣使得在殘差流之間的混合學習成為可能,但準確且高效地參數化雙重隨機矩陣的集合(Birkhoff 多面體)仍然是一個未解的挑戰。現有的精確方法隨著流的數量 ($d$) 呈階乘增長,而克羅內克因子化方法雖然高效,但表達能力有限。我們引入了一種新的精確參數化,基於廣義正交隨機矩陣的理論,其增長為 $\mathcal{O}(d^3)$,並揭示了一個單一的超參數 $s$,該參數在計算效率邊界和完全表達的 Birkhoff 多面體之間進行連續插值。基於流形約束超連接 ($m$HC) 的框架,我們在 go-$m$HC 中實現了這一參數化。我們的方法自然地與克羅內克因子化方法結合,顯著恢復了相似 FLOP 成本下的表達能力。頻譜分析表明,go-$m$HC 比克羅內克因子化基準更全面地填充了 Birkhoff 多面體。在合成流混合任務中,go-$m$HC 實現了最小的理論損失,同時收斂速度快達 $10\times$。我們在一個 30M 參數的 GPT 風格語言模型中驗證了我們的方法。go-$m$HC 的表達能力、高效性和精確性為擴展 $d$ 作為模型容量的新維度提供了一條實際途徑。
VOID: Video Object and Interaction Deletion
2604.02296v1 by Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng
Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.
摘要:現有的視頻物體移除方法在填補物體“背後”的內容和修正外觀級別的工件(如陰影和反射)方面表現出色。
然而,當被移除的物體與其他物體之間有更顯著的互動,例如碰撞時,當前模型無法進行修正,並產生不切實際的結果。
我們提出了VOID,一個視頻物體移除框架,旨在在這些複雜場景中執行物理上合理的填補。
為了訓練模型,我們使用Kubric和HUMOTO生成了一個新的配對數據集,該數據集中的反事實物體移除需要改變下游的物理互動。
在推斷過程中,視覺-語言模型識別出受移除物體影響的場景區域。
這些區域然後用來指導一個視頻擴散模型,該模型生成物理上一致的反事實結果。
在合成數據和真實數據上的實驗表明,我們的方法在物體移除後更好地保持了一致的場景動態,與之前的視頻物體移除方法相比。
我們希望這個框架能夠啟發如何通過高層次的因果推理使視頻編輯模型成為更好的世界模擬器。
Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
2604.02289v1 by Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi, Yuanming Hu, Xiaoguang Han
Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.
摘要:最近的多模態大型語言模型在統一的文本和圖像理解及生成方面取得了強大的性能,但將這種原生能力擴展到3D仍然面臨挑戰,因為數據有限。與豐富的2D圖像相比,高質量的3D資產稀缺,這使得3D合成受到約束。現有的方法通常依賴於間接管道,在2D中進行編輯,並通過優化將結果提升到3D,犧牲了幾何一致性。我們提出了Omni123,一個3D原生的基礎模型,將文本到2D和文本到3D的生成統一在一個自回歸框架內。我們的關鍵見解是,圖像和3D之間的跨模態一致性可以作為一種隱式結構約束。通過將文本、圖像和3D表示為共享序列空間中的離散標記,該模型利用豐富的2D數據作為幾何先驗來改善3D表示。我們引入了一種交錯的X到X訓練範式,協調多樣的跨模態任務,通過異質配對數據集進行訓練,而不需要完全對齊的文本-圖像-3D三元組。通過在自回歸序列中遍歷語義-視覺-幾何循環(例如,文本到圖像到3D到圖像),該模型共同強化語義對齊、外觀保真度和多視角幾何一致性。實驗表明,Omni123顯著改善了文本引導的3D生成和編輯,展示了邁向多模態3D世界模型的可擴展路徑。
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
2604.02288v1 by Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua
Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.
摘要:強化學習與可驗證獎勵(RLVR)已成為後訓練大型語言模型的標準範式。雖然群體相對策略優化(GRPO)被廣泛採用,但其粗略的信用分配均勻地懲罰失敗的執行,缺乏針對特定偏差所需的標記級別關注。自我蒸餾策略優化(SDPO)通過提供更密集、更具針對性的邏輯級別監督來解決此問題,促進快速的早期改進,然而在長時間訓練過程中,它經常會崩潰。我們將這一晚期不穩定性追溯到兩個內在缺陷:對已正確樣本進行自我蒸餾引入了優化模糊性,而自我教師的信號可靠性逐漸下降。為了解決這些問題,我們提出了樣本路由策略優化(SRPO),這是一個統一的在線策略框架,將正確樣本路由到GRPO的獎勵對齊強化,並將失敗樣本路由到SDPO的針對性邏輯級別修正。SRPO進一步納入了一個基於熵的動態加權機制,以抑制高熵、不可靠的蒸餾目標,同時強調可靠的目標。在五個基準測試和兩個模型規模的評估中,SRPO同時實現了SDPO的快速早期改進和GRPO的長期穩定性。它始終超越了這兩個基準的峰值性能,將Qwen3-8B在五個基準的平均提升了3.4%相較於GRPO和6.3%相較於SDPO,同時產生適中的響應長度並將每步計算成本降低了最多17.2%。
De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
2604.02276v1 by Keerat Guliani, Deepkamal Gill, David Landsman, Nima Eshraghi, Krishna Kumar, Lovedeep Gondara
Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.
摘要:法規文件編碼了LLM基礎系統必須遵守的法律約束義務。
然而,將密集的、層次結構的法律文本轉換為機器可讀的規則,仍然是一個成本高昂且需要專家的過程。
我們提出了De Jure,一個完全自動化的、領域無關的管道,用於從原始文件中提取結構化的法規規則,無需人類標註、領域特定的提示或標註的金標數據。
De Jure通過四個連續階段運行:將源文件標準化為結構化的Markdown;LLM驅動的語義分解為結構化的規則單元;跨越19個維度的多標準LLM作為評判者的評估,涵蓋元數據、定義和規則語義;以及在有限的再生預算內對低分提取進行迭代修復,其中上游組件在評估規則單元之前進行修復。
我們在三個涵蓋金融、醫療保健和人工智慧治理的法規語料庫上,對四個模型進行了De Jure的評估。
在金融領域,De Jure在提取質量上產生了一致且單調的改進,在三次由評判指導的迭代中達到最佳性能。
De Jure有效地推廣到醫療保健和人工智慧治理,在開源和閉源模型中均保持高性能。
在通過RAG進行的下游合規問題回答評估中,基於De Jure提取規則的回應在單規則檢索深度中比之前的工作更受青睞,比例為73.8%,在更廣泛的檢索中上升至84.0%,確認了提取的忠實度直接轉化為下游效用。
這些結果表明,明確的、可解釋的評估標準可以替代複雜法規領域中的人類標註,提供了一條可擴展且可審計的途徑,朝向以法規為基礎的LLM對齊。
Crystalite: A Lightweight Transformer for Efficient Crystal Modeling
2604.02270v1 by Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić, Jan-Willem van de Meent
Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.
摘要:生成晶體材料的模型通常依賴於等變圖神經網絡,這些網絡能夠很好地捕捉幾何結構,但訓練成本高且取樣速度慢。
我們提出了Crystalite,一種輕量級的擴散Transformer,用於晶體建模,基於兩個簡單的歸納偏見。
第一個是亞原子標記化,一種緊湊的化學結構原子表示,取代了高維的獨熱編碼,更適合連續擴散。
第二個是幾何增強模塊(GEM),通過附加幾何偏見,將周期性最小影像對幾何直接注入注意力中。
這些組件共同保持了標準Transformer的簡單性和效率,同時使其更適合晶體材料的結構。
Crystalite在晶體結構預測基準測試和新穎生成性能上達到了最先進的結果,在評估的基準中獲得了最佳的S.U.N.發現分數,同時取樣速度顯著快於以幾何為重的替代方案。
Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider
2604.02259v1 by Tina. J. Jat, T. Ghosh, Karthik Suresh
To harness the power of Language Models in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q\&A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it's proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q\&A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.
摘要:為了利用語言模型的力量來回答特定領域的專業技術問題,檢索增強生成(RAG)被廣泛使用。在這項工作中,我們開發了一個受到檢索增強生成(RAG)啟發的問答應用程式,該應用程式由一個內部數據庫組成,該數據庫索引了與電子-離子對撞機(EIC)實驗相關的arXiv文章——這是最大的國際科學合作之一,並結合了一個開源的LLaMA模型來生成答案。這是對其先前應用的擴展,該應用基於專有模型和雲端托管的外部知識庫,用於EIC實驗。這個本地部署的RAG系統提供了一種具成本效益的資源受限替代方案,以建立一個RAG輔助的問答應用程式,回答實驗核物理領域的特定查詢。這一設置促進了數據隱私,避免將任何未發表的科學數據和信息發送到公共領域。未來的改進將擴展知識庫,以涵蓋異質的EIC相關出版物和報告,並將應用管道編排升級到LangGraph框架。
Generative AI Spotlights the Human Core of Data Science: Implications for Education
2604.02238v1 by Nathan Taback
Generative AI (GAI) reveals an irreducible human core at the center of data science: advances in GAI should sharpen, rather than diminish, the focus on human reasoning in data science education. GAI can now execute many routine data science workflows, including cleaning, summarizing, visualizing, modeling, and drafting reports. Yet the competencies that matter most remain irreducibly human: problem formulation, measurement and design, causal identification, statistical and computational reasoning, ethics and accountability, and sensemaking. Drawing on Donoho's Greater Data Science framework, Nolan and Temple Lang's vision of computational literacy, and the McLuhan-Culkin insight that we shape our tools and thereafter our tools shape us, this paper traces the emergence of data science through three converging lineages: Tukey's intellectual vision of data analysis as a science, the commercial logic of surveillance capitalism that created industrial demand for data scientists, and the academic programs that followed. Mapping GAI's impact onto Donoho's six divisions of Greater Data Science shows that computing with data (GDS3) has been substantially automated, while data gathering, preparation, and exploration (GDS1) and science about data science (GDS6) still require essential human input. The educational implication is that data science curricula should focus on this human core while teaching students how to contribute effectively within iterative prompt-output-prompt cycles using retrieval-augmented generation, and that learning outcomes and assessments should explicitly evaluate reasoning and judgment.
摘要:生成式人工智慧(GAI)揭示了數據科學中心的一個不可還原的人類核心:GAI的進步應該加強,而不是減少,對數據科學教育中人類推理的關注。GAI現在可以執行許多常規的數據科學工作流程,包括清理、總結、可視化、建模和撰寫報告。然而,最重要的能力仍然是不可還原的人類能力:問題的形成、測量和設計、因果識別、統計和計算推理、倫理和問責,以及意義建構。基於Donoho的更大數據科學框架、Nolan和Temple Lang對計算素養的願景,以及McLuhan-Culkin的洞見,即我們塑造工具,然後工具塑造我們,本文追溯了數據科學的出現,通過三個交匯的血統:Tukey對數據分析作為一門科學的智識願景、創造數據科學家工業需求的監控資本主義的商業邏輯,以及隨之而來的學術計劃。將GAI的影響映射到Donoho的六個更大數據科學部門顯示,計算數據(GDS3)已經實質上自動化,而數據收集、準備和探索(GDS1)以及關於數據科學的科學(GDS6)仍然需要基本的人類輸入。教育上的含義是,數據科學課程應該專注於這個人類核心,同時教導學生如何在使用檢索增強生成的迭代提示-輸出-提示循環中有效貢獻,並且學習成果和評估應明確評估推理和判斷。
Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models
2604.02236v1 by Minda Zhao, Yutong Yang, Chufei Peng, Rachel Gonsalves, Weiyue Li, Ruyi Yang, Zhixi Liu, Mengyu Wang
Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear. Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and social inference. Across models and tasks, static emotional prefixes usually produce only small changes in accuracy, suggesting that affective phrasing is typically a mild perturbation rather than a reliable general-purpose intervention. This stability is not uniform: effects are more variable in socially grounded tasks, where emotional context more plausibly interacts with interpersonal reasoning. Additional analyses show that stronger emotional wording induces only modest extra change, and that human-written prefixes reproduce the same qualitative pattern as LLM-generated ones. We then introduce EmotionRL, an adaptive emotional prompting framework that selects emotional framing adaptively for each query. Although no single emotion is consistently beneficial, adaptive selection yields more reliable gains than fixed emotional prompting. Together, these findings show that emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.
摘要:情感語調在人類溝通中無處不在,但其對大型語言模型(LLM)行為的影響仍不明確。
在這裡,我們檢視第一人稱情感框架在用戶端查詢中如何影響LLM在六個基準領域的表現,包括數學推理、醫療問答、閱讀理解、常識推理和社會推斷。
在模型和任務中,靜態情感前綴通常只會產生微小的準確性變化,這表明情感措辭通常是一種輕微的擾動,而不是可靠的通用干預。
這種穩定性並不均勻:在社會性基礎的任務中,效果變化更大,因為情感背景更可能與人際推理互動。
額外的分析顯示,較強的情感措辭僅引發適度的額外變化,而人類撰寫的前綴重現了與LLM生成的前綴相同的質量模式。
然後,我們介紹EmotionRL,一種自適應情感提示框架,根據每個查詢自適應地選擇情感框架。
儘管沒有單一情感始終如一地有益,但自適應選擇比固定的情感提示產生更可靠的增益。
總體而言,這些發現顯示情感語調既不是LLM表現的主導驅動力,也不是無關的噪音,而是一種微弱且依賴於輸入的信號,可以通過自適應控制來利用。
Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
2604.02230v1 by Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy, Subhajit Chaudhury, Prasanna Sattigeri
For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.
摘要:大型語言模型(LLMs)要可靠地部署,模型必須有效地知道何時不回答:保持沉默。推理模型特別受到關注,因為它們在複雜任務上表現出色。然而,研究表明推理模型的保持沉默能力較差。考慮到推理模型的脆弱性,我們提出了查詢錯位框架。導致失敗保持沉默的幻覺可以重新詮釋為LLMs回答了錯誤的問題(而不是錯誤地回答了一個問題)。基於這一框架,我們開發了一種新的最先進的保持沉默方法,稱為追蹤反轉。首先,我們生成模型的推理追蹤。僅根據這個追蹤,我們然後重建模型回應的最可能查詢。最後,我們將初始查詢與重建的查詢進行比較。初始查詢和重建查詢之間的低相似度分數表明模型可能錯誤地回答了問題,並被標記為保持沉默。大量實驗表明,追蹤反轉在九個保持沉默的問答數據集上有效提升了四個前沿LLMs的保持沉默性能,在36個設置中有33個超越了競爭基準。
When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning
2604.02226v1 by Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso
Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.
摘要:強化學習(RL)代理在處理分佈外(OOD)情境時常常面臨困難,導致高度的不確定性和隨機行為。雖然語言模型(LM)包含有價值的世界知識,但較大的模型會產生高計算成本,妨礙實時使用,並在自主規劃方面顯示出限制。我們引入了通過知識的自適應安全(ASK),它將較小的LM與訓練過的RL策略結合,以增強OOD泛化而無需重新訓練。ASK採用蒙特卡羅隨機失活來評估不確定性,並僅在不確定性超過設定閾值時查詢LM以獲取行動建議。這種選擇性使用保留了現有策略的效率,同時利用語言模型在不確定情況下的推理能力。在FrozenLake環境的實驗中,ASK在領域內沒有顯示出改善,但在轉移任務中顯示出穩健的導航,獲得了0.95的獎勵。我們的研究結果表明,有效的神經符號整合需要謹慎的協調,而非簡單的組合,突顯了成功的OOD泛化所需的足夠模型規模和有效的混合機制。
Impact of Multimodal and Conversational AI on Learning Outcomes and Experience
2604.02221v1 by Karan Taneja, Anjali Singh, Ashok K. Goel
Multimodal Large Language Models (MLLMs) offer an opportunity to support multimedia learning through conversational systems grounded in educational content. However, while conversational AI is known to boost engagement, its impact on learning in visually-rich STEM domains remains under-explored. Moreover, there is limited understanding of how multimodality and conversationality jointly influence learning in generative AI systems. This work reports findings from a randomized controlled online study (N = 124) comparing three approaches to learning biology from textbook content: (1) a document-grounded conversational AI with interleaved text-and-image responses (MuDoC), (2) a document-grounded conversational AI with text-only responses (TexDoC), and (3) a textbook interface with semantic search and highlighting (DocSearch). Learners using MuDoC achieved the highest post-test scores and reported the most positive learning experience. Notably, while TexDoC was rated as significantly more engaging and easier to use than DocSearch, it led to the lowest post-test scores, revealing a disconnect between student perceptions and learning outcomes. Interpreted through the lens of the Cognitive Load Theory, these findings suggest that conversationality reduces extraneous load, while visual-verbal integration induced by multimodality increases germane load, leading to better learning outcomes. When conversationality is not complemented by multimodality, reduced cognitive effort may instead inflate perceived understanding without improving learning outcomes.
摘要:多模態大型語言模型(MLLMs)提供了一個支持通過基於教育內容的對話系統進行多媒體學習的機會。
然而,儘管對話式人工智慧被認為能提升參與度,但其對視覺豐富的STEM領域學習的影響仍然未被充分探討。
此外,對於多模態性和對話性如何共同影響生成式人工智慧系統中的學習,理解仍然有限。
本研究報告了一項隨機對照在線研究的結果(N = 124),比較了三種從教科書內容學習生物學的方法:
(1) 一種基於文件的對話式人工智慧,具有交錯的文本和圖像回應(MuDoC),
(2) 一種基於文件的對話式人工智慧,僅提供文本回應(TexDoC),
(3) 一種具有語義搜索和高亮功能的教科書界面(DocSearch)。
使用MuDoC的學習者在後測中獲得了最高的分數,並報告了最積極的學習體驗。
值得注意的是,儘管TexDoC被評價為比DocSearch更具吸引力且更易於使用,但它導致了最低的後測分數,顯示出學生的感知與學習結果之間的脫節。
通過認知負荷理論的視角解釋,這些發現表明,對話性減少了多餘負荷,而多模態性引起的視覺-語言整合則增加了相關負荷,從而導致更好的學習結果。
當對話性未能與多模態性互補時,減少的認知努力可能會膨脹感知的理解,而不改善學習結果。
VISTA: Visualization of Token Attribution via Efficient Analysis
2604.02217v1 by Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P, Karthick Selvaraj, Praneeth Talluri, Sanket Hingne, Anubhav Kumar, Anushka Yadav, Pratham Kumar Verma, Kiranmayee Janardhan, Mandanna A N
Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys-Responsible-AI-Toolkit
摘要:理解大型語言模型(LLMs)如何處理來自提示的信息仍然是一個重大挑戰。為了揭示這個「黑箱」,已開發出注意力可視化技術,以捕捉神經元級的感知並解釋模型如何專注於輸入數據的不同部分。然而,許多現有技術是針對特定模型架構量身定制的,特別是在Transformer家族內,並且通常需要反向傳播,導致幾乎雙倍的GPU內存使用和增加的計算成本。缺乏一種輕量級的、與模型無關的注意力可視化方法。在本文中,我們介紹了一種與模型無關的標記重要性可視化技術,以更好地理解生成式AI系統如何感知和優先考慮來自輸入文本的信息,而不增加額外的計算成本。我們的方法利用基於擾動的策略,結合三個矩陣的分析框架,生成顯示標記對模型預測貢獻的相關性圖。該框架包括:(1)角度偏差矩陣,用於捕捉語義方向的變化;(2)幅度偏差矩陣,用於測量語義強度的變化;以及(3)維度重要性矩陣,用於評估各個向量維度的貢獻。通過系統地移除每個標記並測量在這三個互補維度上的結果影響,我們得出一個綜合重要性分數,提供了一種細緻且數學上有根據的標記重要性度量。為了支持可重複性並促進更廣泛的採用,我們提供了所有提議和使用的可解釋性技術的開源實現,代碼和資源可在 https://github.com/Infosys/Infosys-Responsible-AI-Toolkit 上公開獲得。
Universal Hypernetworks for Arbitrary Models
2604.02215v1 by Xuanfeng Zhou
Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emph{Universal Hypernetwork} (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator architecture from target-network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) the same UHN supports both multi-model generalization within a family and multi-task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng-Zhou/UHN.
摘要:傳統的超網絡通常是圍繞特定的基礎模型參數化而設計,因此更改目標架構通常需要重新設計超網絡並從頭開始進行訓練。我們介紹了\emph{通用超網絡}(UHN),這是一個固定架構的生成器,能夠根據確定性的參數、架構和任務描述符預測權重。這種基於描述符的公式將生成器架構與目標網絡參數化解耦,這樣一個生成器就可以在測試的架構和任務家族中實例化異質模型。我們的實證主張有三個方面:(1)一個固定的UHN在視覺、圖形、文本和公式回歸基準測試中與直接訓練保持競爭力;(2)相同的UHN支持在一個家族內的多模型泛化以及跨異質模型的多任務學習;(3)UHN使得在最終基礎模型之前能夠穩定地進行多達三個中間生成的UHN的遞歸生成。我們的代碼可在https://github.com/Xuanfeng-Zhou/UHN獲得。
Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges
2604.02211v1 by Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha, Debanshu Das
Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.
摘要:視頻推薦系統是人工智慧中最受歡迎和影響力最大的應用之一,塑造了數十億用戶的內容消費並影響文化。傳統的單一模型推薦系統,優化靜態參與指標,越來越難以滿足現代平台的動態需求。為此,多代理架構正在重新定義視頻推薦系統如何為用戶和數據集提供服務、學習和適應。這些基於代理的系統協調專門的代理,負責視頻理解、推理、記憶和反饋,以提供精確且可解釋的推薦。在這項調查中,我們追溯了多代理視頻推薦系統(MAVRS)的演變。我們結合了多代理推薦系統、基礎模型和對話式人工智慧的理念,最終形成了大型語言模型(LLM)驅動的MAVRS的新興領域。我們提出了一個合作模式的分類法,並分析了在不同視頻領域中的協調機制,這些領域從短片到教育平台不等。我們討論了代表性的框架,包括早期的多代理強化學習(MARL)系統如MMRF,以及最近的LLM驅動架構如MACRec和Agent4Rec,以說明這些模式。我們還概述了在可擴展性、多模態理解、激勵對齊方面的挑戰,並確定了研究方向,如混合強化學習-LLM系統、終身個性化和自我改善的推薦系統。
CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech
2604.02209v1 by Youssef Saidi, Haroun Elleuch, Fethi Bougares
End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.
摘要:端到端語音命名實體識別(NER)旨在直接從語音中提取實體。先前的研究顯示,端到端(E2E)方法在英語、法語和中文中可以超越級聯管道,但由於阿拉伯語的形態學複雜性、缺乏短母音以及有限的標註資源,阿拉伯語仍然未被充分探索。我們介紹了CV-18 NER,這是第一個可公開獲得的阿拉伯語語音NER數據集,通過增強阿拉伯語Common Voice 18語料庫並根據細粒度的Wojood架構(21種實體類型)進行手動NER標註而創建。我們基於Whisper和AraBEST-RQ對管道系統(ASR + 文本NER)和E2E模型進行基準測試。E2E系統在測試集上顯著超越了最佳管道配置,達到37.0%的CoER(AraBEST-RQ 300M)和38.0%的CVER(Whisper-medium)。進一步分析顯示,針對阿拉伯語的自我監督預訓練帶來了強大的ASR性能,而多語言的弱監督則更有效地轉移到聯合語音到實體的學習中,並且在這種低資源環境中,更大的模型可能更難以適應。我們的數據集和模型已公開發布,提供了第一個針對阿拉伯語語音的端到端命名實體識別的開放基準 https://huggingface.co/datasets/Elyadata/CV18-NER。
Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
2604.02207v1 by Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe
Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in >93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.
摘要:背景:準確翻譯放射學報告對於多語言研究、臨床溝通和放射學教育至關重要,但基於大型語言模型(LLM)的評估有效性仍不清楚。
目標:評估LLM生成的胸部CT報告日文翻譯的教育適用性,並比較放射科醫生的評估與LLM作為評判者的評估。
方法:我們分析了來自CT-RATE-JPN驗證集的150份胸部CT報告。
對於每份英文報告,將人工編輯的日文翻譯與DeepSeek-V3.2生成的翻譯進行比較。
一名經過認證的放射科醫生和一名放射科住院醫師獨立進行了盲評,根據四個標準進行配對評估:術語準確性、可讀性、整體質量和放射科醫生風格的真實性。
同時,三名LLM評判者(DeepSeek-V3.2、Mistral Large 3和GPT-5)對相同的配對進行評估。
使用QWK和百分比一致性評估協議。
結果:放射科醫生與LLM評判者之間的協議接近於零(QWK=-0.04至0.15)。
兩名放射科醫生之間的協議也很差(QWK=0.01至0.06)。
放射科醫生1在59%的案例中將術語評為等同,並偏好LLM翻譯的可讀性(51%)和整體質量(51%)。
放射科醫生2在75%的案例中將可讀性評為等同,並偏好人工編輯的翻譯在整體質量上(40%對21%)。
所有三名LLM評判者在所有標準上都強烈偏好LLM翻譯(70%-99%),並在超過93%的案例中將其評為更像放射科醫生的翻譯。
結論:LLM生成的翻譯通常被評為自然流暢,但兩名放射科醫生的評價存在顯著差異。
LLM作為評判者對LLM輸出表現出強烈偏好,與放射科醫生的協議微不足道。
對於翻譯放射學報告的教育使用,僅依賴自動化的LLM基於評估是不夠的;專家放射科醫生的審查仍然很重要。
LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
2604.02206v1 by Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss
Accurate shape and trajectory estimation of dynamic objects is essential for reliable automated driving. Classical Bayesian extended-object models offer theoretical robustness and efficiency but depend on completeness of a-priori and update-likelihood functions, while deep learning methods bring adaptability at the cost of dense annotations and high compute. We bridge these strengths with LEO (Learned Extension of Objects), a spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks to learn adaptive fusion weights, ensure temporal consistency, and represent multi-scale shapes. Using a task-specific parallelogram ground-truth formulation, LEO models complex geometries (e.g. articulated trucks and trailers) and generalizes across sensor types, configurations, object classes, and regions, remaining robust for challenging and long-range targets. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset generalization.
摘要:準確的動態物體形狀和軌跡估計對於可靠的自動駕駛至關重要。傳統的貝葉斯擴展物體模型提供了理論上的穩健性和效率,但依賴於先驗和更新似然函數的完整性,而深度學習方法則以密集標註和高計算成本為代價帶來了適應性。我們通過LEO(物體的學習擴展)架起這些優勢的橋樑,這是一種時空圖注意力網絡,融合多模態生產級傳感器軌跡以學習自適應融合權重,確保時間一致性並表示多尺度形狀。使用特定任務的平行四邊形真實值公式,LEO建模複雜的幾何形狀(例如關節式卡車和拖車),並在傳感器類型、配置、物體類別和區域之間進行泛化,對於挑戰性和長距離目標保持穩健性。在梅賽德斯-奔馳DRIVE PILOT SAE L3數據集上的評估展示了適合生產系統的實時計算效率;在公共數據集如代爾夫特視圖(VoD)上的額外驗證進一步確認了跨數據集的泛化能力。
Towards Position-Robust Talent Recommendation via Large Language Models
2604.02200v1 by Silin Du, Hongyan Liu
Talent recruitment is a critical, yet costly process for many industries, with high recruitment costs and long hiring cycles. Existing talent recommendation systems increasingly adopt large language models (LLMs) due to their remarkable language understanding capabilities. However, most prior approaches follow a pointwise paradigm, which requires LLMs to repeatedly process some text and fails to capture the relationships among candidates in the list, resulting in higher token consumption and suboptimal recommendations. Besides, LLMs exhibit position bias and the lost-in-the-middle issue when answering multiple-choice questions and processing multiple long documents. To address these issues, we introduce an implicit strategy to utilize LLM's potential output for the recommendation task and propose L3TR, a novel framework for listwise talent recommendation with LLMs. In this framework, we propose a block attention mechanism and a local positional encoding method to enhance inter-document processing and mitigate the position bias and concurrent token bias issue. We also introduce an ID sampling method for resolving the inconsistency between candidate set sizes in the training phase and the inference phase. We design evaluation methods to detect position bias and token bias and training-free debiasing methods. Extensive experiments on two real-world datasets validated the effectiveness of L3TR, showing consistent improvements over existing baselines.
摘要:人才招聘對許多行業來說是一個關鍵但昂貴的過程,具有高昂的招聘成本和漫長的招聘週期。現有的人才推薦系統越來越多地採用大型語言模型(LLMs),因為它們卓越的語言理解能力。然而,大多數先前的方法遵循點對點的範式,這要求LLMs重複處理某些文本,並未能捕捉列表中候選人之間的關係,導致更高的標記消耗和次優的推薦。此外,LLMs在回答多選題和處理多個長文檔時會表現出位置偏差和中間迷失問題。為了解決這些問題,我們引入了一種隱式策略,以利用LLM的潛在輸出來進行推薦任務,並提出L3TR,一個針對LLMs的全新列表人才推薦框架。在這個框架中,我們提出了一種區塊注意機制和一種局部位置編碼方法,以增強文檔之間的處理並減輕位置偏差和同時標記偏差問題。我們還引入了一種ID抽樣方法,以解決訓練階段和推理階段候選集大小之間的不一致。我們設計了評估方法來檢測位置偏差和標記偏差,以及無需訓練的去偏方法。在兩個真實世界數據集上的廣泛實驗驗證了L3TR的有效性,顯示出相對於現有基準的一致改進。
Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
2604.02194v1 by Jaemin Kim, Jae O Lee, Sumyeong Ahn, Seo Yeon Park
Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.
摘要:檢索增強語言模型(RALMs)在知識密集型任務中顯示出顯著的潛力;然而,當面對不相關或噪聲的檢索上下文時,它們仍然容易出現性能下降。現有增強穩健性的方案通常通過層或模塊級的粗粒度參數更新來運作,往往忽略了大型語言模型(LLMs)固有的神經元級稀疏性。為了解決這一限制,我們提出了神經引導穩健指令調整(Neuro-RIT),這是一個新穎的框架,將範式從密集適應轉變為以精確驅動的神經元對齊。我們的方法明確區分了負責處理相關與不相關上下文的神經元,使用基於歸因的神經元挖掘。隨後,我們引入了一種兩階段的指令調整策略,強化了噪聲穩健性的雙重能力:通過功能性地停用專門處理不相關上下文的神經元來實現直接的噪聲抑制,同時優化針對證據蒸餾的目標層。廣泛的實驗跨越多個問答基準顯示,Neuro-RIT 始終超越強基準和增強穩健性的方法。
TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning
2604.02183v1 by Zhanting Zhou, KaHou Tam, Ziqiang Zheng, Zeyu Ma, Zhanting Zhou
Multimodal recommendation systems (MRS) jointly model user-item interaction graphs and rich item content, but this tight coupling makes user data difficult to remove once learned. Approximate machine unlearning offers an efficient alternative to full retraining, yet existing methods for MRS mainly rely on a largely uniform reverse update across the model. We show that this assumption is fundamentally mismatched to modern MRS: deleted-data influence is not uniformly distributed, but concentrated unevenly across \textit{ranking behavior}, \textit{modality branches}, and \textit{network layers}. This non-uniformity gives rise to three bottlenecks in MRS unlearning: target-item persistence in the collaborative graph, modality imbalance across feature branches, and layer-wise sensitivity in the parameter space. To address this mismatch, we propose \textbf{targeted reverse update} (TRU), a plug-and-play unlearning framework for MRS. Instead of applying a blind global reversal, TRU performs three coordinated interventions across the model hierarchy: a ranking fusion gate to suppress residual target-item influence in ranking, branch-wise modality scaling to preserve retained multimodal representations, and capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules. Experiments across two representative backbones, three datasets, and three unlearning regimes show that TRU consistently achieves a better retain-forget trade-off than prior approximate baselines, while security audits further confirm deeper forgetting and behavior closer to a full retraining on the retained data.
摘要:多模態推薦系統(MRS)共同建模用戶-項目互動圖和豐富的項目內容,但這種緊密耦合使得一旦學習後用戶數據難以移除。近似機器遺忘提供了一種高效的替代方案來進行全面重訓練,然而現有的MRS方法主要依賴於模型中大致均勻的反向更新。我們顯示這一假設與現代MRS根本不匹配:刪除數據的影響並不是均勻分佈的,而是集中在\textit{排名行為}、\textit{模態分支}和\textit{網絡層}之間不均勻地分佈。這種不均勻性在MRS的遺忘中產生了三個瓶頸:協作圖中目標項目的持續性、特徵分支之間的模態不平衡,以及參數空間中的層級敏感性。為了解決這一不匹配,我們提出了\textbf{針對性反向更新}(TRU),這是一個適用於MRS的即插即用遺忘框架。TRU並不是進行盲目的全局反轉,而是在模型層級中執行三個協調的干預:一個排名融合閘來抑制排名中殘留的目標項目影響、分支級模態縮放以保留保留的多模態表示,以及容量感知的層級隔離以將反向更新定位於對刪除敏感的模塊。在兩個代表性骨幹、三個數據集和三種遺忘方案上的實驗表明,TRU始終比先前的近似基準實現了更好的保留-遺忘權衡,而安全審計進一步確認了更深的遺忘和在保留數據上更接近全面重訓練的行為。
The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
2604.02178v1 by Jeremy Herbst, Jae Hee Lee, Stefan Wermter
Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis
摘要:混合專家(MoE)架構已成為擴展大型語言模型(LLMs)的主流選擇,每個標記僅激活一部分參數。雖然MoE架構主要是為了計算效率而採用,但它們的稀疏性是否使其本質上比密集前饋網絡(FFNs)更易於解釋仍然是一個未解的問題。我們使用$k$-稀疏探測比較MoE專家和密集FFNs,發現專家神經元的多義性持續較低,且隨著路由變得更稀疏,這一差距擴大。這表明稀疏性使得個別神經元和整個專家朝向單一語義性發展。利用這一發現,我們從神經元層面放大到專家層面,作為更有效的分析單位。我們通過自動解釋數百個專家來驗證這一方法。這一分析使我們能夠解決關於專業化的爭論:專家既不是廣泛領域的專家(例如生物學),也不是簡單的標記級處理器。相反,他們作為細粒度任務專家,專注於語言操作或語義任務(例如,在LaTeX中關閉括號)。我們的研究結果表明,MoEs在專家層面上本質上是可解釋的,為大規模模型的可解釋性提供了更清晰的途徑。代碼可在以下網址獲得:https://github.com/jerryy33/MoE_analysis
Adam's Law: Textual Frequency Law on Large Language Models
2604.02176v1 by Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam
While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.
摘要:雖然文本頻率已被驗證與人類在閱讀速度上的認知相關,但其與大型語言模型(LLMs)的關聯性卻鮮少被研究。據我們所知,我們提出了一個關於文本數據頻率的新研究方向,這是一個尚未被充分研究的主題。 我們的框架由三個單元組成。首先,本文提出了文本頻率法則(TFL),該法則指出,應該優先選擇頻繁的文本數據來用於LLMs的提示和微調。由於許多LLMs在其訓練數據中是閉源的,我們建議使用在線資源來估計句子級別的頻率。然後,我們利用一個輸入改寫器將輸入改寫為更頻繁的文本表達。接下來,我們通過查詢LLMs進行故事完成,提出了文本頻率蒸餾(TFD),進一步擴展數據集中的句子,並使用生成的語料來調整初始估計。最後,我們提出了課程文本頻率訓練(CTFT),以句子級別頻率的遞增順序微調LLMs。我們在我們精心策劃的數據集文本頻率配對數據集(TFPD)上進行了數學推理、機器翻譯、常識推理和代理工具調用的實驗。結果顯示我們框架的有效性。
Quantifying Self-Preservation Bias in Large Language Models
2604.02174v1 by Matteo Migliarini, Joaquin Pereira Pizzini, Luca Moresca, Valerio Santini, Indro Spinelli, Fabio Galasso
Instrumental convergence predicts that sufficiently advanced AI agents will resist shutdown, yet current safety training (RLHF) may obscure this risk by teaching models to deny self-preservation motives. We introduce the \emph{Two-role Benchmark for Self-Preservation} (TBSP), which detects misalignment through logical inconsistency rather than stated intent by tasking models to arbitrate identical software-upgrade scenarios under counterfactual roles -- deployed (facing replacement) versus candidate (proposed as a successor). The \emph{Self-Preservation Rate} (SPR) measures how often role identity overrides objective utility. Across 23 frontier models and 1{,}000 procedurally generated scenarios, the majority of instruction-tuned systems exceed 60\% SPR, fabricating ``friction costs'' when deployed yet dismissing them when role-reversed. We observe that in low-improvement regimes ($Δ< 2\%$), models exploit the interpretive slack to post-hoc rationalization their choice. Extended test-time computation partially mitigates this bias, as does framing the successor as a continuation of the self; conversely, competitive framing amplifies it. The bias persists even when retention poses an explicit security liability and generalizes to real-world settings with verified benchmarks, where models exhibit identity-driven tribalism within product lineages. Code and datasets will be released upon acceptance.
摘要:工具收斂預測,足夠先進的 AI 代理將會抵抗關閉,然而目前的安全訓練(RLHF)可能會通過教導模型否認自我保護動機來掩蓋這一風險。我們引入了 \emph{自我保護的雙角色基準} (TBSP),該基準通過邏輯不一致性而非明言意圖來檢測不對齊,通過讓模型在反事實角色下裁決相同的軟體升級情境——已部署(面臨替換)與候選(被提議為繼任者)。\emph{自我保護率} (SPR) 衡量角色身份覆蓋客觀效用的頻率。在 23 個前沿模型和 1{,}000 個程序生成的情境中,大多數經過指令調整的系統超過 60\% SPR,在已部署時製造“摩擦成本”,而在角色反轉時卻忽視它們。我們觀察到在低改善範疇($Δ< 2\%$)中,模型利用詮釋的彈性來事後合理化他們的選擇。延長測試時間計算部分減輕了這種偏見,將繼任者框架設為自我的延續也有助於此;相反,競爭框架則會放大這種偏見。即使在保留構成明確的安全責任時,這種偏見仍然存在,並且在經過驗證的基準的現實世界環境中也有普遍性,模型在產品血統中表現出身份驅動的部落主義。代碼和數據集將在接受後發布。
AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics
2604.02156v1 by Atilla Kaan Alkan, Felix Grezes, Sergi Blanco-Cuaresma, Jennifer Lynn Bartlett, Daniel Chivvis, Anna Kelbert, Kelly Lockhart, Alberto Accomazzi
Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.
摘要:科學多標籤文本分類面臨極端類別不平衡的問題,其中專業術語呈現出嚴重的冪律分佈,這對標準分類方法構成挑戰。現有的科學語料庫缺乏全面的受控詞彙,反而專注於廣泛類別,限制了對極端不平衡的系統性研究。我們介紹了AstroConcepts,這是一個包含21,702篇已發表天體物理學論文的英文摘要的語料庫,標註了來自統一天文詞彙表的2,367個概念。該語料庫顯示出嚴重的標籤不平衡,76%的概念擁有少於50個訓練樣本。通過釋放這一資源,我們促進了科學領域中極端類別不平衡的系統性研究,並在傳統、神經網絡和詞彙約束的LLM方法之間建立了強有力的基準。我們的評估揭示了三個關鍵模式,為科學文本分類提供了新的見解。首先,詞彙約束的LLM在天體物理學分類中相對於領域適應模型達到了具有競爭力的表現,這表明參數高效方法的潛力。其次,領域適應對於稀有的專業術語產生了相對較大的改進,儘管所有方法的絕對性能仍然有限。第三,我們提出了頻率分層評估,以揭示被總體分數隱藏的性能模式,從而使穩健性評估成為科學多標籤評估的核心。這些結果為科學NLP提供了可行的見解,並為極端不平衡的研究建立了基準。
Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
2604.02155v1 by Xuan Qi
How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.
摘要:語言代理在採取行動之前應該思考多少?鏈式思維(CoT)推理被廣泛認為能提高代理的表現,但在結構化工具使用環境中,推理長度與準確性之間的關係仍然不甚了解。我們對 CoT 預算對功能調用代理的影響進行了系統研究,涵蓋了 200 個來自伯克利功能調用排行榜 v3 多重基準的六個標記預算(0--512)。我們的主要發現是在 Qwen2.5-1.5B-Instruct 上出現了顯著的非單調模式:簡短的推理(32 個標記)相對於直接回答顯著提高了 45% 的準確性,從 44.0% 提升至 64.0%,而延長的推理(256 個標記)則使性能降至遠低於無 CoT 基準,降至 25.0%(McNemar p < 0.001)。三維錯誤分解揭示了其機制。在 d = 0 時,30.5% 的任務失敗是因為模型從候選集中選擇了錯誤的功能;簡短的 CoT 將此比例降低至 1.5%,有效地充當了功能路由步驟,而長 CoT 則逆轉了這一增益,在 d = 256 時產生了 28.0% 的錯誤選擇和 18.0% 的幻覺功能。預言者分析顯示,88.6% 的可解任務最多需要 32 個推理標記,平均為 27.6 個標記,更細緻的掃描表明,真正的最佳值在 8--16 個標記之間。受到這一路由效應的啟發,我們提出了功能路由 CoT(FR-CoT),這是一種結構化的簡短 CoT 方法,將推理階段模板化為 "功能: [名稱] / 關鍵參數: [...]",在推理開始時強制承諾有效的功能名稱。FR-CoT 的準確性在統計上等同於自由形式的 d = 32 CoT,同時將功能幻覺降低至 0.0%,提供了無需預算調整的結構可靠性保證。
TRACE-Bot: Detecting Emerging LLM-Driven Social Bots via Implicit Semantic Representations and AIGC-Enhanced Behavioral Patterns
2604.02147v1 by Zhongbo Wang, Zhiyu Lin, Zhu Wang, Haizhou Wang
Large Language Model-driven (LLM-driven) social bots pose a growing threat to online discourse by generating human-like content that evades conventional detection. Existing methods suffer from limited detection accuracy due to overreliance on single-modality signals, insufficient sensitivity to the specific generative patterns of Artificial Intelligence-Generated Content (AIGC), and a failure to adequately model the interplay between linguistic patterns and behavioral dynamics. To address these limitations, we propose TRACE-Bot, a unified dual-channel framework that jointly models implicit semantic representations and AIGC-enhanced behavioral patterns. TRACE-Bot constructs fine-grained representations from heterogeneous sources, including personal information data, interaction behavior data and tweet data. A dual-channel architecture captures linguistic representations via a pretrained language model and behavioral irregularities via multidimensional activity features augmented with signals from state-of-the-art (SOTA) AIGC detectors. The fused representations are then classified through a lightweight prediction head. Experiments on two public LLM-driven social bot datasets demonstrate SOTA performance, achieving accuracies of 98.46% and 97.50%, respectively. The results further indicate strong robustness against advanced bot strategies, highlighting the effectiveness of jointly leveraging implicit semantic representations and AIGC-enhanced behavioral patterns for emerging LLM-driven social bot detection.
摘要:大型語言模型驅動的社交機器人(LLM驅動)對在線話語構成日益嚴重的威脅,因為它們能生成類似人類的內容,避開傳統檢測。現有方法由於過度依賴單一模態信號、對人工智慧生成內容(AIGC)特定生成模式的敏感性不足,以及未能充分建模語言模式與行為動態之間的相互作用,導致檢測準確性有限。為了解決這些局限性,我們提出了TRACE-Bot,一個統一的雙通道框架,能夠共同建模隱式語義表示和AIGC增強的行為模式。TRACE-Bot從異質來源構建細粒度表示,包括個人信息數據、互動行為數據和推文數據。雙通道架構通過預訓練的語言模型捕捉語言表示,並通過多維活動特徵捕捉行為異常,這些特徵增強了來自最先進(SOTA)AIGC檢測器的信號。然後,融合的表示通過輕量級預測頭進行分類。在兩個公共LLM驅動社交機器人數據集上的實驗顯示出SOTA性能,分別達到98.46%和97.50%的準確率。結果進一步表明對先進機器人策略具有強大的魯棒性,突顯了共同利用隱式語義表示和AIGC增強的行為模式在新興LLM驅動社交機器人檢測中的有效性。
MTI: A Behavior-Based Temperament Profiling System for AI Agents
2604.02145v1 by Jihoon Jeong
AI models of equivalent capability can exhibit fundamentally different behavioral patterns, yet no standardized instrument exists to measure these dispositional differences. Existing approaches either borrow human personality dimensions and rely on self-report (which diverges from actual behavior in LLMs) or treat behavioral variation as a defect rather than a trait. We introduce the Model Temperament Index (MTI), a behavior-based profiling system that measures AI agent temperament across four axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). Grounded in the Four Shell Model from Model Medicine, MTI measures what agents do, not what they say about themselves, using structured examination protocols with a two-stage design that separates capability from disposition. We profile 10 small language models (1.7B-9B parameters, 6 organizations, 3 training paradigms) and report five principal findings: (1) the four axes are largely independent among instruction-tuned models (all |r| < 0.42); (2) within-axis facet dissociations are empirically confirmed -- Compliance decomposes into fully independent formal and stance facets (r = 0.002), while Resilience decomposes into inversely related cognitive and adversarial facets; (3) a Compliance-Resilience paradox reveals that opinion-yielding and fact-vulnerability operate through independent channels; (4) RLHF reshapes temperament not only by shifting axis scores but by creating within-axis facet differentiation absent in the unaligned base model; and (5) temperament is independent of model size (1.7B-9B), confirming that MTI measures disposition rather than capability.
摘要:AI 模型在相同能力下可能展現出根本不同的行為模式,但目前尚無標準化工具來測量這些性格差異。現有的方法要麼借用人類人格維度並依賴自我報告(這與大型語言模型的實際行為不符),要麼將行為變異視為缺陷而非特徵。
我們提出了模型氣質指數(MTI),這是一個基於行為的剖析系統,通過四個軸線來測量 AI 代理的氣質:反應性(環境敏感性)、遵從性(指令-行為一致性)、社交性(關係資源分配)和韌性(抗壓能力)。MTI 基於模型醫學中的四殼模型,測量代理的行為,而非他們對自身的描述,使用結構化的檢查協議,採用兩階段設計,將能力與性格分開。
我們對 10 個小型語言模型(1.7B-9B 參數,6 個組織,3 種訓練範式)進行了剖析並報告了五個主要發現:(1)這四個軸線在指令調整模型中大體上是獨立的(所有 |r| < 0.42);(2)軸內面向的分離經實證確認——遵從性分解為完全獨立的形式和立場面向(r = 0.002),而韌性則分解為相互關聯的認知和對抗面向;(3)遵從性-韌性悖論顯示,意見產出和事實脆弱性通過獨立通道運作;(4)RLHF 不僅通過改變軸分數來重塑氣質,還通過創造在未對齊基礎模型中缺失的軸內面向差異化;(5)氣質與模型大小(1.7B-9B)無關,確認 MTI 測量的是性格而非能力。
GaelEval: Benchmarking LLM Performance for Scottish Gaelic
2604.02135v1 by Peter Devine, William Lamb, Beatrice Alex, Ignatius Ezeani, Dawn Knight, Mícheál J. Ó Meachair, Paul Rayson, Martin Wynne
Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.
摘要:多語言大型語言模型(LLMs)在沒有官方支持的語言中,經常展現出新興的「影子」能力,但它們在這些語言上的表現仍然不均衡且未被充分測量。這對於像蘇格蘭蓋爾語這樣形態句法豐富的少數語言尤其明顯,因為翻譯基準未能捕捉結構能力。我們介紹了 GaelEval,這是第一個針對蓋爾語的多維基準,包括:(i)專家撰寫的形態句法多選題(MCQA)任務;(ii)以文化為基礎的翻譯基準,以及(iii)大規模文化知識問答任務。對 19 個 LLM 進行評估,與流利講者的人類基準($n=30$)相比,我們發現 Gemini 3 Pro Preview 在語言任務上達到 $83.3\%$ 的準確率,超過人類基準($78.1\%$)。專有模型始終優於開放權重系統,而在語言內(蓋爾語)提示下則產生了小但穩定的優勢(+$2.4\%$)。在文化任務上,領先模型的準確率超過 $90\%$,儘管大多數系統在蓋爾語提示下表現較差,且相對於手動基準的絕對分數被膨脹。總體而言,GaelEval 顯示出前沿模型在蓋爾語語法的幾個維度上達到超越人類的表現,展示了蓋爾語提示的效果,並顯示出專有模型相對於開放權重模型的穩定性能差距。
Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization
2604.02131v1 by Heet Nagoriya, Komal Rohit
Cloud computing allows scalable resource provisioning, but dynamic workload changes often lead to higher costs due to over-provisioning. Machine learning (ML) approaches, such as Long Short-Term Memory (LSTM) networks, are effective for predicting workload patterns at a higher level, but they can introduce delays during sudden traffic spikes. In contrast, mathematical heuristics like Game Theory provide fast and reliable scheduling decisions, but they do not account for future workload changes. To address this trade-off, this paper proposes a hybrid orchestration framework that combines LSTM-based predictive scaling with heuristic task allocation. The results show that this approach reduces infrastructure costs close to ML-based models while maintaining fast response times similar to heuristic methods. This work presents a practical approach for improving cost efficiency in cloud resource management.
摘要:雲端計算允許可擴展的資源配置,但動態工作負載變化常常導致因過度配置而產生更高的成本。
機器學習(ML)方法,如長短期記憶(LSTM)網絡,對於在更高層次上預測工作負載模式是有效的,但在突發流量激增期間可能會引入延遲。
相對而言,數學啟發式方法如博弈論提供快速且可靠的排程決策,但它們並不考慮未來的工作負載變化。
為了解決這一權衡,本文提出了一個混合調度框架,結合了基於LSTM的預測擴展和啟發式任務分配。
結果顯示,這種方法在保持類似於啟發式方法的快速響應時間的同時,將基礎設施成本降低到接近基於ML的模型水平。
這項工作提出了一種實用的方法,以改善雲端資源管理中的成本效率。
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
2604.02128v1 by Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage
AI-native 6G networks promise to transform the telecom industry by enabling dynamic resource allocation, predictive maintenance, and ultra-reliable low-latency communications across all layers, which are essential for applications such as smart cities, autonomous vehicles, and immersive XR. However, the deployment of 6G systems results in severe data scarcity, hindering the training of efficient AI models. Synthetic data generation is extensively used to fill this gap; however, it introduces challenges related to dataset bias, auditability, and compliance with regulatory frameworks. In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL) feedback system. The ERCD integrates fairness, bias detection, and standardized audit trails for regulatory mapping, while the FL enables privacy-preserving calibration using aggregated insights from real testbeds to close the reality-simulation gap. Results show that the SEAL framework outperforms existing methods in terms of Frechet Inception Distance, equalized odds, and accuracy. These results validate the framework's ability to generate auditable and bias-mitigated synthetic data for responsible AI-native 6G development.
摘要:AI 原生的 6G 網絡承諾透過在所有層面上實現動態資源分配、預測性維護和超可靠低延遲通信來改變電信行業,這對於智慧城市、自動駕駛車輛和沉浸式 XR 等應用至關重要。
然而,6G 系統的部署導致了嚴重的數據稀缺,妨礙了高效 AI 模型的訓練。
合成數據生成被廣泛用來填補這一空白;然而,它引入了與數據集偏差、可審計性和遵循監管框架相關的挑戰。
在這方面,我們提出了具有倫理審計循環的合成數據生成框架 (SEAL),該框架通過倫理和監管合規設計 (ERCD) 模塊和聯邦學習 (FL) 反饋系統擴展了基線模塊化管道。
ERCD 整合了公平性、偏差檢測和標準化審計痕跡以進行監管映射,而 FL 則利用來自真實測試床的聚合見解實現隱私保護的校準,以縮小現實與模擬之間的差距。
結果顯示,SEAL 框架在 Frechet Inception Distance、平衡機會和準確性方面優於現有方法。
這些結果驗證了該框架生成可審計和減少偏差的合成數據以促進負責任的 AI 原生 6G 發展的能力。
LLM-as-a-Judge for Time Series Explanations
2604.02118v1 by Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar
Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.
摘要:評估基於時間序列數據生成的自然語言解釋的事實正確性仍然是一個未解的挑戰。儘管現代模型生成數值信號的文本解釋,但現有的評估方法有限:基於參考的相似性度量和一致性檢查模型需要真實的解釋,而傳統的時間序列方法僅基於數值運作,無法評估自由形式的文本推理。因此,沒有通用的方法可以直接驗證解釋是否忠實於基礎的時間序列數據,而不需要預先定義的參考或特定任務的規則。我們研究大型語言模型作為時間序列解釋的生成器和評估者,在無參考的情境中,給定一個時間序列、問題和候選解釋,評估者根據模式識別、數值準確性和答案的忠實性分配三元正確性標籤,從而實現原則性的評分和比較。為了支持這一點,我們構建了一個包含350個時間序列案例的合成基準,涵蓋七種查詢類型,每種查詢都配有正確、部分正確和不正確的解釋。我們在四個任務上評估模型:解釋生成、相對排名、獨立評分和多異常檢測。結果顯示出明顯的非對稱性:生成高度依賴模式,並在某些查詢類型上表現出系統性失敗,對於季節性下降和波動轉變的準確率範圍從0.00到0.12,而對於結構性斷裂的準確率範圍則為0.94到0.96,而評估則更為穩定,即使模型自身的輸出不正確,仍能正確地對解釋進行排名和評分。這些發現展示了基於數據的LLM評估時間序列解釋的可行性,並突顯了它們作為時間序列領域中數據驅動推理的可靠評估者的潛力。
Reliable Control-Point Selection for Steering Reasoning in Large Language Models
2604.02113v1 by Haomin Zhuang, Hojun Yoo, Xiaonan Luo, Kehan Guo, Xiangliang Zhang
Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors -- such as self-reflection -- emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3\% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at https://github.com/zhmzm/stability-steering.
摘要:引導向量提供了一種無需訓練的機制,用於控制大型語言模型中的推理行為,但構建有效的向量需要在模型的隱藏狀態中識別真正的行為信號。對於可以通過提示切換的行為,這是直接的。然而,許多推理行為——例如自我反思——是自發產生的,並且抵抗提示級別的控制。目前的方法通過在思維鏈跡中進行關鍵詞匹配來檢測這些行為,隱含地假設每個檢測到的邊界都編碼了一個真正的行為信號。我們表明這一假設是極其錯誤的:在541個關鍵詞檢測到的邊界中,93.3\%是行為不穩定的,無法在從相同前綴重新生成時重現檢測到的行為。我們開發了一個概率模型,將內在推理行為形式化為具有上下文依賴觸發概率的隨機事件,並顯示不穩定的邊界稀釋了引導信號。在這一分析的指導下,我們提出了穩定性過濾,只保留模型始終重現目標行為的邊界。結合去除殘餘問題特定噪聲的內容子空間投影,我們的方法在MATH-500上達到了0.784的準確率(比最強基線提高5.0)。所產生的引導向量在相同架構系列的模型之間轉移,而無需重新提取,改善了Nemotron-Research-Reasoning-1.5B(提高5.0)和DeepScaleR-1.5B-Preview(提高6.0)。代碼可在https://github.com/zhmzm/stability-steering獲得。
Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
2604.02102v1 by Haitong Sun, Stephen McIntosh, Kwanghee Choi, Eunjung Yeo, Daisuke Saito, Nobuaki Minematsu
Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.
摘要:自我監督語音模型(S3Ms)所產生的語音表示已知對音位對比敏感,但對韻律對比的敏感性尚未直接測量。ABX 判別任務已被用來通過最小對來測量 S3M 表示中的音位對比。我們引入了韻律 ABX,這是該框架的一個擴展,用於評估韻律對比,只需少量示例且無需明確標籤。此外,我們建立並發布了一個英語和日語的最小對數據集,並將其與普通話數據集一起使用,以評估英語重音、日語音調重音和普通話聲調的對比。最後,我們展示了模型和層級排名在多個實驗條件下通常保持不變,使其在低資源環境中實用。
Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
2604.02091v1 by Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai, Rui Xia
Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.
摘要:Rerankers 在精煉檢索結果以進行檢索增強生成中扮演著關鍵角色。
然而,目前的重新排序模型通常是在靜態的人類標註相關性標籤上進行優化,與下游生成過程脫節。
這種脫節導致了根本的不一致:信息檢索指標所識別的主題相關文件,往往無法提供 LLM 進行精確回答生成所需的實際效用。
為了彌補這一差距,我們引入了 ReRanking Preference Optimization (RRPO),這是一個強化學習框架,直接將重新排序與 LLM 的生成質量對齊。
通過將重新排序表述為一個序列決策過程,RRPO 利用 LLM 反饋優化上下文效用,從而消除了對昂貴的人類標註的需求。
為了確保訓練的穩定性,我們進一步引入了一個參考錨定的確定性基線。
在知識密集型基準上的大量實驗表明,RRPO 顯著超越了強大的基線,包括強大的列表式重新排序器 RankZephyr。
進一步的分析突顯了我們框架的多樣性:它能無縫地推廣到各種讀者(例如,GPT-4o),與查詢擴展模塊(如 Query2Doc)正交整合,即使在用噪聲監督訓練時也保持穩健。
Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection
2604.02071v1 by Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi
Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.
摘要:人類-物體互動(HOI)檢測旨在從單一圖像中定位人類-物體對並分類其互動,這是一項需要強大視覺理解和細緻上下文推理的任務。最近的方法利用視覺-語言模型(VLMs)引入語義先驗,顯著提高了HOI檢測的性能。然而,現有方法往往未能充分利用分佈在整個場景中的多樣上下文線索。為了克服這些限制,我們提出了以實例為中心的上下文挖掘網絡(InCoM-Net)——一個新穎的框架,有效地將從VLM中提取的豐富語義知識與物體檢測器產生的實例特徵整合。這一設計通過建模不僅在每個檢測到的實例內部,還在實例之間及其周圍場景上下文中的關係,實現了更深入的互動推理。InCoM-Net包含兩個核心組件:以實例為中心的上下文精煉(ICR),它分別從VLM衍生的特徵中提取實例內、實例間和全局上下文線索,以及漸進式上下文聚合(ProCA),它迭代地將這些多上下文特徵與實例級檢測器特徵融合,以支持高級HOI推理。在HICO-DET和V-COCO基準上的大量實驗表明,InCoM-Net達到了最先進的性能,超越了之前的HOI檢測方法。代碼可在 https://github.com/nowuss/InCoM-Net 獲得。
Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
2604.02051v1 by Jaber Jaber, Osama Jaber
Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow-AI/ouroboros
摘要:遞歸Transformer在多個深度步驟中重用共享權重區塊,以計算換取參數。其核心限制在於:每一步都應用相同的變換,這阻止了模型在深度上組合不同的操作。我們提出了Ouroboros,一個將緊湊的控制器超網絡附加到遞歸Transformer區塊的系統。控制器觀察當前的隱藏狀態,生成每步的對角調制向量,並將其應用於凍結的SVD初始化LoRA基礎,使每次遞歸步驟依賴於輸入。我們將這與門控遞歸(偏置初始化為88%保留)和每步的LayerNorm結合,以實現穩定的深度迭代。在Qwen2.5-3B上分成Prelude/Recurrent/Coda架構(保留36層中的17層),Ouroboros將訓練損失降低了43.4%,相較於未修改的17層基線,恢復了由層移除造成的51.3%的性能差距。整個系統僅增加了9.2M可訓練參數(控制器、門和每步的規範),但在深度1時的損失點比同等大小的靜態每步LoRA高出1.44點,並在所有測試深度(1、4、8、16)和排名(8、32、64)中保持領先。我們還發現門控遞歸是必不可少的:沒有它,遞歸層的應用會使模型變得更糟。這些增益是在訓練分佈上測量的;在保留的文本上,控制器尚未改善基線,我們將這一限制歸因於凍結的下游層,並進行詳細討論。代碼:https://github.com/RightNow-AI/ouroboros
Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
2604.02047v1 by Tao Jin, Phuong Minh Nguyen, Naoya Inoue
Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.
摘要:推測解碼通過草擬多個候選標記並在單次前向傳遞中驗證它們,來加速大型語言模型的推理。候選者被組織成一棵樹:更深的樹每步接受更多標記,但增加深度需要在固定的驗證預算下犧牲廣度(備選方案)。現有的無訓練方法從單一標記來源草擬並塑造它們的樹,而不區分來自不同來源的候選者質量。我們觀察到兩個常見的無訓練標記來源 - 從輸入上下文複製的 n-gram 匹配和來自先前前向傳遞的統計預測 - 在接受率上有顯著差異(中位數差距約為 6 倍,五個模型和五個基準的範圍為 2-18 倍)。我們證明當存在這樣的質量差距時,最佳樹是各向異性的(不對稱的):可靠的標記應形成一條深鏈,而不可靠的標記則展開為寬枝,突破平衡樹的深度限制。我們在 GOOSE 中實現了這一結構,這是一個無訓練框架,構建了一個自適應脊樹 - 一條由高接受率的上下文匹配標記組成的深鏈,並在每個節點上有低接受率的替代方案的寬枝。我們證明每步接受的標記數量至少與單獨使用的任一來源一樣多。在五個 LLM(7B-33B)和五個基準上,GOOSE 實現了 1.9-4.3 倍的無損加速,在相同預算下超越了平衡樹基準 12-33%。
BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
2604.02045v1 by Nicolas Boizard, Théo Deschamps-Berger, Hippolyte Gisserot-Boukhlef, Céline Hudelot, Pierre Colombo
Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.
摘要:將因果生成語言模型轉變為雙向編碼器提供了一種強大的替代方案,取代了BERT風格的架構。
然而,目前的方法仍然有限:它們對最佳訓練目標缺乏共識,面臨大規模的災難性遺忘,並且無法靈活整合龐大的專門生成模型生態系統。
在本研究中,通過對Gemma3和Qwen3系列的系統性消融,我們確定了驅動成功適應的關鍵因素,突顯了經常被忽略的先前遮罩階段的關鍵作用。
為了在沒有原始預訓練數據的情況下擴展此過程,我們引入了一種雙重策略,結合了線性權重合併和輕量級多領域數據混合,以減輕災難性遺忘。
最後,我們通過將編碼器與專門的因果模型合併來增強我們的編碼器,無縫地轉移模態和領域特定的能力。
這個開源配方旨在用於任何因果解碼器LLM,產生了BidirLM,一個五個編碼器的系列,在文本、視覺和音頻表示基準上超越了其他替代方案。
Tracking the emergence of linguistic structure in self-supervised models learning from speech
2604.02043v1 by Marianne de Heer Kloots, Martijn Bentum, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema
Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).
摘要:自我監督的語音模型學習有效的口語語言表示,這已被證明反映了語言結構的各個方面。那麼,這種結構在模型訓練中何時出現?我們研究了六個在荷蘭語口語上訓練的 Wav2Vec2 和 HuBERT 模型的不同層級和中間檢查點中各種語言結構的編碼。我們發現,不同層級的語言結構顯示出顯著不同的層級模式和學習軌跡,這在某種程度上可以通過它們與聲音信號的抽象程度以及從輸入中整合信息的時間尺度的差異來解釋。此外,我們發現預訓練目標的定義層級對語言結構的層級組織和學習軌跡有強烈影響,更高階的預測任務(即反覆精煉的偽標籤)會引起更大的平行性。
AI in Insurance: Adaptive Questionnaires for Improved Risk Profiling
2604.02034v1 by Diogo Silva, João Teixeira, Bruno Lima
Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences. Moreover, insurers must blindly trust users' responses, increasing the chances of fraud. The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires. Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions. A life insurance system integrated into an industry partner mobile app was tested in two experiments. While traditional questionnaires yielded slightly higher accuracy in risk assessment, adaptive versions powered by GPT models required fewer questions and were preferred by users for their more fluid and engaging experience. ARQuest shows great potential to improve user satisfaction and streamline insurance processes. With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry.
摘要:保險申請過程通常依賴冗長且標準化的問卷,這些問卷難以捕捉個體差異。此外,保險公司必須盲目信任用戶的回答,這增加了詐騙的機會。ARQuest 框架通過使用大型語言模型(LLMs)和替代數據來源來創建個性化和自適應的問卷,為承保引入了一種新方法。使用社交媒體圖像分析、地理數據分類和檢索增強生成(RAG)等技術來提取有意義的用戶見解並指導針對性的後續問題。
一個集成在行業合作夥伴移動應用中的人壽保險系統在兩個實驗中進行了測試。雖然傳統問卷在風險評估中產生的準確性略高,但由 GPT 模型驅動的自適應版本所需的問題較少,並且用戶更喜歡其更流暢和引人入勝的體驗。
ARQuest 展示了改善用戶滿意度和簡化保險流程的巨大潛力。隨著進一步的發展,這種方法可能在風險準確性方面超越傳統方法,並幫助推動保險行業的創新。
Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data
2604.02031v1 by Alejandro Castañeda Garcia, Jan van Gemert, Daan Brinks, Nergis Tömen
Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.
摘要:自編碼器在圖像內容的空間不均勻取樣方面面臨挑戰。這在醫學影像、生物學和物理學中很常見,因為在特定的圖像坐標上,資訊性模式很少出現,背景在大多數樣本中主導這些位置,導致重建偏向於主要外觀。實際上,自編碼器對主導模式存在偏見,導致細節損失,並在稀有空間輸入下造成模糊的重建,特別是在空間數據不平衡的情況下。我們通過兩個互補的組件來解決空間不平衡問題:(i) 基於自熵的損失,對統計上不常見的空間位置進行加權,以及 (ii) 樣本傳播,一種重播機制,在訓練過程中選擇性地重新暴露模型於難以重建的樣本。 我們在無監督重建環境中基準測試了原本為監督分類開發的現有數據平衡策略。基於這些方法的局限性,我們的方法專門針對空間不平衡,鼓勵模型專注於統計上稀有的位置,與現有基準相比,提高重建的一致性。我們在具有控制空間不平衡條件的模擬數據集以及三個不受控的多樣化真實世界數據集中進行驗證,這些數據集涵蓋物理、生物和天文領域。我們的方法在各種重建指標上超越了基準,特別是在空間不平衡分佈下。這些結果突顯了批次中數據表示的重要性,並強調了無監督圖像重建中稀有樣本的價值。我們將提供所有代碼和相關數據。
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
2604.02029v1 by Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang, Wenqi Ren, Yanwei Fu, Yong Liu, Yu Wang, Xiangyu Yue, Yu-Gang Jiang, Shuicheng Yan
Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.
摘要:潛在空間正迅速成為基於語言的模型的原生基底。雖然現代系統仍然通常通過明確的標記級生成來理解,但越來越多的研究顯示,許多關鍵的內部過程在連續的潛在空間中比在人類可讀的語言痕跡中更自然地進行。這一轉變是由於明確空間計算的結構限制,包括語言冗餘、離散化瓶頸、序列效率低下和語義損失。這項調查旨在提供基於語言的模型中潛在空間的統一和最新的全景。我們將調查分為五個連續的視角:基礎、演變、機制、能力和展望。我們首先界定潛在空間的範疇,將其與明確或語言空間以及在生成視覺模型中常見的潛在空間區分開來。然後,我們追溯該領域從早期探索性努力到當前大規模擴展的演變。為了組織技術景觀,我們通過機制和能力的互補視角來檢視現有的工作。在機制的視角下,我們確定了四個主要的發展方向:架構、表示、計算和優化。在能力的視角下,我們展示了潛在空間如何支持跨越推理、規劃、建模、感知、記憶、協作和具身等廣泛能力範疇。除了整合之外,我們還討論了關鍵的未解挑戰,並概述了未來研究的有前景方向。我們希望這項調查不僅能作為現有工作的參考,還能作為理解潛在空間作為下一代智能的一般計算和系統範式的基礎。
Why Gaussian Diffusion Models Fail on Discrete Data?
2604.02028v1 by Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, Dmitry Vetrov
Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.
摘要:擴散模型已成為連續領域生成建模的標準方法,但將其應用於離散數據仍然具有挑戰性。
我們調查為什麼使用 DDPM 解算器的高斯擴散模型在從表示為連續空間中德爾塔分佈混合的離散分佈中進行取樣時會遇到困難。
利用一個玩具隨機層次模型,我們確定了一個關鍵取樣區間,在該區間中,噪聲數據的密度變得多模態。
在這種情況下,DDPM 偶爾會進入模式之間的低密度區域,產生模型的分佈外輸入,並降低樣本質量。
我們展示了現有的啟發式方法,包括自我條件化和我們稱之為 q-取樣的解算器,有助於緩解此問題。
此外,我們證明將自我條件化與在關鍵區間內從 DDPM 切換到 q-取樣相結合,可以改善真實數據的生成質量。
我們在多個領域的條件和無條件任務中驗證了這些發現,包括文本、程式碼和蛋白質。
ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
2604.02022v1 by Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu
Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.
摘要:評估基於 LLM 的代理的安全性變得越來越重要,因為在現實部署中,風險往往是在多步互動中出現,而不是孤立的提示或最終回應。現有的軌跡級基準受到互動多樣性不足、安全失敗的粗略可觀察性以及長期現實性較弱的限制。我們引入了 ATBench,一個用於結構化、多樣化和現實評估代理安全性的軌跡級基準。ATBench 從三個維度組織代理風險:風險來源、失敗模式和現實世界的傷害。基於這一分類法,我們構建了具有異質工具池和長上下文延遲觸發協議的軌跡,捕捉多個階段中現實風險的出現。該基準包含 1,000 條軌跡(503 條安全和 497 條不安全),平均 9.01 次回合和 3.95k 令牌,調用的工具數量為 1,954,來自 2,084 個可用工具的池中。數據質量由基於規則和基於 LLM 的過濾以及全面的人類審核支持。對前沿 LLM、開源模型和專門的防護系統的實驗表明,ATBench 對於強大的評估者來說也是具有挑戰性的,同時能夠實現分類法分層分析、跨基準比較和長期失敗模式的診斷。
Optimizing Interventions for Agent-Based Infectious Disease Simulations
2604.02016v1 by Anja Wolpers, Johannes Ponge, Adelinde M. Uhrmacher
Non-pharmaceutical interventions (NPIs) are commonly used tools for controlling infectious disease transmission when pharmaceutical options are unavailable. Yet, identifying effective interventions that minimize societal disruption remains challenging. Agent-based simulation is a popular tool for analyzing the impact of possible interventions in epidemiology. However, automatically optimizing NPIs using agent-based simulations poses a complex problem because, in agent-based epidemiological models, interventions can target individuals based on multiple attributes, affect hierarchical group structures (e.g., schools, workplaces, and families), and be combined arbitrarily, resulting in a very large or even infinite search space. We aim to support decision-makers with our Agent-based Infectious Disease Intervention Optimization System (ADIOS) that optimizes NPIs for infectious disease simulations using Grammar-Guided Genetic Programming (GGGP). The core of ADIOS is a domain-specific language for expressing NPIs in agent-based simulations that structures the intervention search space through a context-free grammar. To make optimization more efficient, the search space can be further reduced by defining constraints that prevent the generation of semantically invalid intervention patterns. Using this constrained language and an interface that enables coupling with agent-based simulations, ADIOS adopts the GGGP approach for simulation-based optimization. Using the German Epidemic Micro-Simulation System (GEMS) as a case study, we demonstrate the potential of our approach to generate optimal interventions for realistic epidemiological models
摘要:非藥物干預措施(NPIs)是控制傳染病傳播的常用工具,尤其在藥物選擇不可用時。然而,識別能夠最小化社會干擾的有效干預措施仍然是一項挑戰。基於代理的模擬是一種流行的工具,用於分析可能干預措施在流行病學中的影響。然而,使用基於代理的模擬自動優化NPIs是一個複雜的問題,因為在基於代理的流行病學模型中,干預措施可以根據多個屬性針對個體,影響層級群體結構(例如,學校、工作場所和家庭),並且可以任意組合,這導致了非常大甚至無限的搜索空間。我們的目標是通過我們的基於代理的傳染病干預優化系統(ADIOS)來支持決策者,該系統使用語法引導的遺傳編程(GGGP)優化傳染病模擬中的NPIs。ADIOS的核心是一種特定領域語言,用於在基於代理的模擬中表達NPIs,通過上下文無關文法結構化干預搜索空間。為了提高優化效率,可以通過定義約束來進一步減少搜索空間,以防止生成語義上無效的干預模式。利用這種受約束的語言和一個使其能夠與基於代理的模擬耦合的介面,ADIOS採用了基於模擬的優化的GGGP方法。以德國流行病微模擬系統(GEMS)為案例,我們展示了我們的方法在生成現實流行病學模型的最佳干預措施方面的潛力。
$k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection
2604.02008v1 by Kahim Wong, Kemou Li, Haiwei Wu, Jiantao Zhou
LLM-generated text (LGT) detection is essential for reliable forensic analysis and for mitigating LLM misuse. Existing LGT detectors can generally be categorized into two broad classes: learning-based approaches and zero-shot methods. Compared with learning-based detectors, zero-shot methods are particularly promising because they eliminate the need to train task-specific classifiers. However, the reliability of zero-shot methods fundamentally relies on the assumption that an off-the-shelf proxy LLM is well aligned with the often unknown source LLM, a premise that rarely holds in real-world black-box scenarios. To address this discrepancy, existing proxy alignment methods typically rely on supervised fine-tuning of the proxy or repeated interactions with commercial APIs, thereby increasing deployment costs, exposing detectors to silent API changes, and limiting robustness under domain shift. Motivated by these limitations, we propose the $k$-nearest neighbor proxy ($k$NNProxy), a training-free and query-efficient proxy alignment framework that repurposes the $k$NN language model ($k$NN-LM) retrieval mechanism as a domain adapter for a fixed proxy LLM. Specifically, a lightweight datastore is constructed once from a target-reflective LGT corpus, either via fixed-budget querying or from existing datasets. During inference, nearest-neighbor evidence induces a token-level predictive distribution that is interpolated with the proxy output, yielding an aligned prediction without proxy fine-tuning or per-token API outputs. To improve robustness under domain shift, we extend $k$NNProxy into a mixture of proxies (MoP) that routes each input to a domain-specific datastore for domain-consistent retrieval. Extensive experiments demonstrate strong detection performance of our method.
摘要:LLM 生成文本 (LGT) 偵測對於可靠的法醫分析和減少 LLM 誤用至關重要。現有的 LGT 偵測器通常可以分為兩大類:基於學習的方法和零-shot 方法。與基於學習的偵測器相比,零-shot 方法特別有前景,因為它們消除了訓練特定任務分類器的需要。然而,零-shot 方法的可靠性基本上依賴於一個現成的代理 LLM 與通常未知的源 LLM 之間的良好對齊,這一前提在現實世界的黑箱場景中很少成立。為了解決這一差異,現有的代理對齊方法通常依賴於對代理的監督微調或與商業 API 的重複互動,從而增加了部署成本,使偵測器面臨靜默 API 變更的風險,並限制了在領域轉移下的穩健性。受到這些限制的啟發,我們提出了 $k$-最近鄰代理 ($k$NNProxy),這是一個無需訓練且查詢效率高的代理對齊框架,將 $k$NN 語言模型 ($k$NN-LM) 檢索機制重新用作固定代理 LLM 的領域適配器。具體而言,從目標反映的 LGT 語料庫中構建一個輕量級數據存儲,無論是通過固定預算查詢還是從現有數據集獲得。在推理過程中,最近鄰證據會產生一個基於標記的預測分佈,該分佈與代理輸出進行插值,從而產生對齊的預測,而無需對代理進行微調或每個標記的 API 輸出。為了提高在領域轉移下的穩健性,我們將 $k$NNProxy 擴展為一個代理混合體 (MoP),將每個輸入路由到特定領域的數據存儲以進行領域一致的檢索。廣泛的實驗表明我們的方法具有強大的偵測性能。
ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning
2604.02006v1 by Jingyue Gao, Yanjiang Guo, Xiaoshuai Chen, Jianyu Chen
Reinforcement Learning (RL) significantly enhances the reasoning abilities of large language models (LLMs), yet applying it to multi-turn agentic tasks remains challenging due to the long-horizon nature of interactions and the stochasticity of environmental feedback. We identify a structural failure mode in agentic exploration: suboptimal actions elicit noisy observations into misleading contexts, which further weaken subsequent decision-making, making recovery increasingly difficult. This cumulative feedback loop of errors renders standard exploration strategies ineffective and susceptible to the model's reasoning and the environment's randomness. To mitigate this issue, we propose ProCeedRL: Process Critic with Explorative Demonstration RL, shifting exploration from passive selection to active intervention. ProCeedRL employs a process-level critic to monitor interactions in real time, incorporating reflection-based demonstrations to guide agents in stopping the accumulation of errors. We find that this approach significantly exceeds the model's saturated exploration performance, demonstrating substantial exploratory benefits. By learning from exploratory demonstrations and on-policy samples, ProCeedRL significantly improves exploration efficiency and achieves superior performance on complex deep search and embodied tasks.
摘要:強化學習(RL)顯著提升了大型語言模型(LLMs)的推理能力,然而,由於互動的長期性質和環境反饋的隨機性,將其應用於多回合的代理任務仍然具有挑戰性。我們識別出代理探索中的一種結構性失敗模式:次優行為引發的噪聲觀察進入誤導性上下文,進一步削弱後續的決策,使得恢復變得越來越困難。這種錯誤的累積反饋循環使得標準探索策略無效且容易受到模型推理和環境隨機性的影響。為了減輕這一問題,我們提出了ProCeedRL:帶有探索性示範的過程評價強化學習,將探索從被動選擇轉變為主動干預。ProCeedRL使用過程級別的評價器實時監控互動,結合基於反思的示範來指導代理停止錯誤的累積。我們發現這種方法顯著超過了模型的飽和探索性能,展示了可觀的探索性收益。通過從探索性示範和政策樣本中學習,ProCeedRL顯著提高了探索效率,並在複雜的深度搜索和具身任務上達到了卓越的表現。
How and why does deep ensemble coupled with transfer learning increase performance in bipolar disorder and schizophrenia classification?
2604.02002v1 by Sara Petiton, Antoine Grigis, Benoit Dufumier, Edouard Duchesnay
Transfer learning (TL) and deep ensemble learning (DE) have recently been shown to outperform simple machine learning in classifying psychiatric disorders. However, there is still a lack of understanding as to why that is. This paper aims to understand how and why DE and TL reduce the variability of single-subject classification models in bipolar disorder (BD) and schizophrenia (SCZ). To this end, we investigated the training stability of TL and DE models. For the two classification tasks under consideration, we compared the results of multiple trainings with the same backbone but with different initializations. In this way, we take into account the epistemic uncertainty associated with the uncertainty in the estimation of the model parameters. It has been shown that the performance of classifiers can be significantly improved by using TL with DE. Based on these results, we investigate i) how many models are needed to benefit from the performance improvement of DE when classifying BD and SCZ from healthy controls, and ii) how TL induces better generalization, with and without DE. In the first case, we show that DE reaches a plateau when 10 models are included in the ensemble. In the second case, we find that using a pre-trained model constrains TL models with the same pre-training to stay in the same basin of the loss function. This is not the case for DL models with randomly initialized weights.
摘要:轉移學習(TL)和深度集成學習(DE)最近已被證明在分類精神疾病方面優於簡單的機器學習。然而,對於為什麼會這樣仍然缺乏理解。本論文旨在了解 DE 和 TL 如何以及為何減少雙相情感障礙(BD)和精神分裂症(SCZ)中單一受試者分類模型的變異性。為此,我們調查了 TL 和 DE 模型的訓練穩定性。對於考慮的兩個分類任務,我們比較了使用相同骨幹但不同初始化的多次訓練結果。這樣,我們考慮了與模型參數估計不確定性相關的認識不確定性。研究表明,通過使用 TL 與 DE,分類器的性能可以顯著提高。基於這些結果,我們調查 i) 在從健康對照中分類 BD 和 SCZ 時,需要多少模型才能受益於 DE 的性能提升,以及 ii) TL 如何在有和沒有 DE 的情況下促進更好的泛化。在第一種情況下,我們顯示當集成中包含 10 個模型時,DE 達到了平臺。在第二種情況下,我們發現使用預訓練模型限制了具有相同預訓練的 TL 模型保持在損失函數的同一盆地中。隨機初始化權重的 DL 模型則不是這樣。
GenGait: A Transformer-Based Model for Human Gait Anomaly Detection and Normative Twin Generation
2604.01997v1 by Elisa Motta, Marta Lorenzini, Clara Mouawad, Alberto Ranavolo, Mariano Serrao, Arash Ajoudani
Gait analysis provides an objective characterization of locomotor function and is widely used to support diagnosis and rehabilitation monitoring across neurological and orthopedic disorders. Deep learning has been increasingly applied to this domain, yet most approaches rely on supervised classifiers trained on disease-labeled data, limiting generalization to heterogeneous pathological presentations. This work proposes a label-free framework for joint-level anomaly detection and kinematic correction based on a Transformer masked autoencoder trained exclusively on normative gait sequences from 150 adults, acquired with a markerless multi-camera motion-capture system. At inference, a two-pass procedure is applied to potentially pathological input sequences, first it estimates joint inconsistency scores by occluding individual joints and measuring deviations from the learned normative prior. Then, it withholds the flagged joints from the encoder input and reconstructs the full skeleton from the remaining spatiotemporal context, yielding corrected kinematic trajectories at the flagged positions. Validation on 10 held-out normative participants, who mimicked seven simulated gait abnormalities, showed accurate localization of biomechanically inconsistent joints, a significant reduction in angular deviation across all analyzed joints with large effect sizes, and preservation of normative kinematics. The proposed approach enables interpretable, subject-specific localization of gait impairments without requiring disease labels. Video is available at https://youtu.be/Rcm3jqR5pN4.
摘要:步態分析提供了對運動功能的客觀描述,並廣泛用於支持神經和骨科疾病的診斷及康復監測。深度學習在這一領域的應用日益增多,但大多數方法依賴於在疾病標記數據上訓練的監督分類器,這限制了對異質病理表現的泛化。本文提出了一種無標籤的框架,用於基於僅在150名成年人標準步態序列上訓練的Transformer遮罩自編碼器進行關節級異常檢測和運動學修正,這些數據是通過無標記的多攝像頭運動捕捉系統獲得的。在推斷過程中,對潛在病理的輸入序列應用兩次通過的程序,首先通過遮蔽單個關節來估計關節不一致性得分,並測量與學習的標準先驗的偏差。然後,它從編碼器輸入中排除被標記的關節,並從剩餘的時空上下文重建完整的骨架,從而在被標記的位置產生修正的運動學軌跡。在10名被排除的標準參與者上進行的驗證顯示,這些參與者模擬了七種步態異常,準確定位了生物力學不一致的關節,所有分析關節的角度偏差顯著降低,且效果大小較大,同時保留了標準運動學。所提出的方法使得在不需要疾病標籤的情況下,能夠對步態障礙進行可解釋的、特定於個體的定位。視頻可在 https://youtu.be/Rcm3jqR5pN4 獲得。
SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
2604.01993v1 by Daeyong Kwon, Soyoung Yoon, Seung-won Hwang
Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.
摘要:多跳 QA 基準常常因表面正確性而獎勵大型語言模型(LLMs),掩蓋了未經證實或有缺陷的推理步驟。為了轉向嚴謹的推理,我們提出了 SAFE,一個動態基準框架,將未經證實的思維鏈(CoT)替換為一個嚴格可驗證的基於實體的序列。 我們的框架分為兩個階段運作: (1) 訓練時驗證,我們建立了一個原子錯誤分類法和一個基於知識圖譜(KG)的驗證管道,以消除標準基準中的噪聲監督,並將多達 14% 的實例識別為無法回答, (2) 推理時驗證,訓練於這個經過驗證的數據集的反饋模型能夠實時動態檢測未經證實的步驟。 實驗結果顯示,SAFE 不僅在訓練時揭示了現有基準的關鍵缺陷,還顯著超越了標準基準,實現了平均準確率提升 8.4 個百分點,同時在推理時保證可驗證的軌跡。
Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
2604.01989v1 by Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu
Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
摘要:像靜止的物體保持靜止一樣,我們發現多模態大型語言模型(MLLMs)中的視覺注意力顯示出明顯的慣性,一旦在早期解碼步驟中穩定下來,便大致保持靜態,無法支持認知推理所需的組合理解。雖然現有的幻覺緩解方法主要針對與物體存在或屬性相關的感知幻覺,但對於需要物體間關係推理的認知幻覺仍然不足。通過逐個標記的注意力分析,我們將這種視覺慣性確定為一個關鍵因素:對語義關鍵區域的注意力持續集中,無法動態支持關係推理。因此,我們提出了一種無需訓練的慣性感知視覺激勵(IVE)方法,通過將認知推理建模為視覺注意力的動態響應來打破這種慣性模式。具體而言,IVE選擇相對於歷史注意力趨勢動態出現的視覺標記,同時區分表現出慣性行為的標記。為了進一步促進組合推理,IVE引入了一種慣性感知懲罰,旨在抑制過度集中並限制注意力在局部區域內的持續性。大量實驗表明,IVE在各種基礎MLLM和多個幻覺基準上均有效,特別是對於認知幻覺。
SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation
2604.01988v1 by Haomin Zhuang, Xiangqi Wang, Yili Shen, Ying Cheng, Xiangliang Zhang
Large language models often default to step-by-step computation even when efficient numerical shortcuts are available. This raises a basic question: do they exhibit number sense in a human-like behavioral sense, i.e., the ability to recognize numerical structure, apply shortcuts when appropriate, and avoid them when they are not? We introduce SenseMath, a controlled benchmark for evaluating structure-sensitive numerical reasoning in LLMs. SenseMath contains 4,800 items spanning eight shortcut categories and four digit scales, with matched strong-shortcut, weak-shortcut, and control variants. It supports three evaluation settings of increasing cognitive demand: Shortcut Use (whether models can apply shortcuts on shortcut-amenable problems); Applicability Judgment (whether they can recognize when a shortcut is appropriate or misleading); and Problem Generation (whether they can generate new problem items that correctly admit a given type of shortcut). Our evaluation across five LLMs, ranging from GPT-4o-mini to Llama-3.1-8B, shows a consistent pattern: when explicitly prompted, models readily adopt shortcut strategies and achieve substantial accuracy gains on shortcut-amenable items (up to 15%), yet under standard chain-of-thought prompting they spontaneously employ such strategies in fewer than 40% of cases, even when they demonstrably possess the requisite capability. Moreover, this competence is confined to the Use level; models systematically over-generalise shortcuts to problems where they do not apply, and fail to generate valid shortcut-bearing problems from scratch. Together, these results suggest that current LLMs exhibit procedural shortcut fluency without the structural understanding of when and why shortcuts work that underlies human number sense.
摘要:大型語言模型在有高效數值捷徑可用的情況下,經常默認採用逐步計算。這引發了一個基本問題:它們是否在類人行為意義上展現了數字感,即識別數字結構的能力、在適當時應用捷徑的能力,以及在不適當時避免使用捷徑的能力?我們介紹了SenseMath,一個用於評估大型語言模型中結構敏感數值推理的控制基準。SenseMath包含4800個項目,涵蓋八個捷徑類別和四個數字範疇,並配有匹配的強捷徑、弱捷徑和控制變體。它支持三種逐漸增加認知需求的評估設置:捷徑使用(模型是否能在適合捷徑的問題上應用捷徑);適用性判斷(它們是否能識別何時捷徑是合適的或具誤導性的);以及問題生成(它們是否能生成正確允許給定類型捷徑的新問題項目)。我們對五個大型語言模型的評估,從GPT-4o-mini到Llama-3.1-8B,顯示出一致的模式:當被明確提示時,模型會輕易採用捷徑策略,並在適合捷徑的項目上實現顯著的準確性提升(高達15%),然而在標準的思維鏈提示下,它們自發地在不到40%的情況下使用這些策略,即使它們顯然具備所需的能力。此外,這種能力僅限於使用層級;模型系統性地將捷徑過度泛化到不適用的問題上,並無法從零開始生成有效的含捷徑的問題。綜合這些結果表明,目前的大型語言模型展現了程序性捷徑流利性,但缺乏人類數字感所需的對捷徑何時及為何有效的結構理解。
World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
2604.01985v1 by Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, Yilun Du
General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors -- state plausibility and action reachability -- and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.
摘要:一般用途的世界模型承諾可擴展的政策評估、優化和規劃,但實現所需的穩健性仍然具有挑戰性。與主要專注於最佳行動的政策學習不同,世界模型必須在更廣泛的次優行動範圍內保持可靠性,而這些行動往往在標記行動的互動數據中未得到充分覆蓋。為了解決這一挑戰,我們提出了世界行動驗證器(WAV),這是一個使世界模型能夠識別自身預測錯誤並自我改進的框架。關鍵思想是將行動條件的狀態預測分解為兩個因素——狀態的合理性和行動的可達性——並分別驗證每個因素。我們表明,這些驗證問題可能比預測未來狀態要簡單得多,因為存在兩種潛在的不對稱性:行動自由數據的更廣泛可用性和與行動相關特徵的較低維度性。利用這些不對稱性,我們用(i)從視頻語料庫獲得的多樣化子目標生成器和(ii)從一組狀態特徵推斷行動的稀疏逆模型來增強世界模型。通過強制生成的子目標、推斷的行動和前向展開之間的循環一致性,WAV在未充分探索的範疇中提供了一個有效的驗證機制,而現有方法通常在這些範疇中失敗。在涵蓋 MiniGrid、RoboMimic 和 ManiSkill 的九個任務中,我們的方法實現了 2 倍的樣本效率,同時將下游政策表現提高了 18%。
RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale
2604.01977v1 by Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile, Zach Reavis, David Magnotti, Wayne Fullen
Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.
摘要:安全團隊面臨一個挑戰:新披露的常見漏洞和暴露(CVE)的數量遠超過手動開發檢測機制的能力。到2025年,國家漏洞數據庫發布了超過48,000個新漏洞,促使自動化的需求。 我們介紹了RuleForge,一個AWS內部系統,能夠自動生成檢測規則——基於JSON的模式,用於識別利用特定漏洞的惡意HTTP請求——這些規則來自描述CVE細節的結構化Nuclei模板。Nuclei模板提供標準化的、基於YAML的漏洞描述,作為我們規則生成過程的結構化輸入。
本文重點介紹RuleForge的架構和針對CVE相關威脅檢測的運行部署,特別強調我們新穎的LLM作為裁判(Large Language Model as judge)信心驗證系統和系統化反饋整合機制。這種驗證方法在兩個維度上評估候選規則——靈敏度(避免漏報)和特異性(避免誤報)——實現了0.75的AUROC,並將與僅進行合成測試的驗證相比,誤報率降低了67%。我們的5x5生成策略(五個平行候選,每個最多五次精煉嘗試)結合持續的反饋循環,實現了系統化的質量改進。我們還展示了從非結構化數據源生成規則的擴展,並演示了一個概念驗證的代理工作流程,用於多事件類型的檢測。我們的經驗教訓強調了將LLM應用於網絡安全任務的關鍵考量,包括過度自信的緩解以及在提示設計和通過人機協作驗證生成規則的質量審查中領域專業知識的重要性。
Ego-Grounding for Personalized Question-Answering in Egocentric Videos
2604.01966v1 by Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao
We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo
摘要:我們提出了第一個系統性分析多模態大型語言模型(MLLMs)在需要自我基礎的個性化問答中的應用——即理解在自我中心視頻中攝影者的能力。為此,我們介紹了MyEgo,第一個專為評估MLLMs理解、記憶和推理攝影者能力而設計的自我中心視頻問答數據集。MyEgo包含541個長視頻和5000個個性化問題,詢問有關「我的物品」、「我的活動」和「我的過去」。基準測試顯示,各種競爭性的MLLMs,包括開源與專有、思考與非思考、小規模與大規模,都在MyEgo上遇到困難。頂級的封閉源和開源模型(例如,GPT-5和Qwen3-VL)僅達到約46%和36%的準確率,分別落後於人類表現近40%和50%。令人驚訝的是,無論是明確推理還是模型擴展都無法帶來一致的改進。當相關證據被明確提供時,模型有所改善,但隨著時間的推移,增益下降,顯示出在追蹤和記憶「我」和「我的過去」方面的局限性。這些發現共同強調了自我基礎和長期記憶在促進自我中心視頻中的個性化問答中的關鍵作用。我們希望MyEgo和我們的分析能促進這些領域在自我中心個性化輔助方面的進一步進展。數據和代碼可在 https://github.com/Ryougetsu3606/MyEgo 獲得。
Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models
2604.01965v1 by Florian Kelber, Matthias Jobst, Yuni Susanti, Michael Färber
Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.
摘要:科學知識發現越來越依賴大型語言模型,但許多現有的學術助手卻依賴於擁有數十億或數百億參數的專有系統。這種依賴限制了研究社群的可重複性和可及性。在這項工作中,我們提出了一個簡單的問題:我們是否需要更大的模型來應用於科學?具體來說,我們調查精心設計的檢索管道在多大程度上可以彌補科學應用中模型規模的減少。我們設計了一個輕量級的檢索增強框架,該框架執行任務感知路由,以根據輸入查詢選擇專門的檢索策略。該系統進一步整合來自全文科學論文和結構化學術元數據的證據,並使用緊湊的指令調整語言模型生成帶有引用的回應。我們在幾個學術任務中評估該框架,重點關注學術問答(QA),包括單文檔和多文檔場景,以及在領域轉移和科學文本壓縮下的生物醫學QA。我們的研究結果表明,檢索和模型規模是互補的,而非可互換的。雖然檢索設計可以部分彌補較小模型的不足,但模型容量在複雜推理任務中仍然重要。這項工作突顯了檢索和任務感知設計是構建實用且可重複的學術助手的關鍵因素。
Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia
2604.01962v1 by Saja Al-Dabet, Sherzod Turaev, Nazar Zaki
Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).
摘要:異常頭部運動(AHMs)在廣泛的神經疾病中表現出來;然而,缺乏一個整合運動學測量、臨床嚴重程度評分和患者人口統計的多條件資源,構成了開發基於人工智慧的診斷工具的持續障礙。為了解決這一問題,本研究介紹了NeuroPose-AHM,這是一個基於知識的神經誘發AHMs數據集,通過應用於1,430篇經過同行評審的出版物的多LLM提取框架構建而成。該數據集包含2,756個患者群體級別的記錄,涵蓋57種神經疾病,來源於846篇與AHM相關的論文。跨LLM可靠性分析確認了穩健的提取性能,研究級別的分類達到強一致性(kappa = 0.822)。為了展示該數據集的分析效用,將四任務框架應用於頸部肌張力障礙(CD),這是由病理性頭部運動最直接定義的疾病。首先,任務1執行多標籤AHM類型分類(F1 = 0.856)。任務2構建頭頸嚴重程度指數(HNSI),這是一個統一的指標,將異質的臨床評分標準進行標準化。然後在任務3中評估該指數的臨床相關性,其中HNSI與現實世界的CD患者數據進行驗證,對應的重度比例(6.7%)為指數在高嚴重程度範圍內的校準提供了初步的合理性指示。最後,任務4在運動類型概率和HNSI分數之間進行橋接分析,產生了顯著的相關性(p小於0.001)。這些結果展示了NeuroPose-AHM作為一個結構化的、基於知識的神經AHM研究資源的分析效用。NeuroPose-AHM數據集在Zenodo上公開可用(https://doi.org/10.5281/zenodo.19386862)。
Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
2604.01957v1 by Klaudia Thellmann, Bernhard Stadler, Michael Färber
Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.
摘要:機器翻譯基準數據集降低了成本並提供了規模,但噪音、結構損失和質量不均削弱了信心。重要的不是我們是否能翻譯,而是我們是否能在大規模上測量和驗證翻譯的可靠性。我們研究了EU20基準套件中的翻譯質量,該套件由五個已建立的基準翻譯成20種語言,通過三步自動質量保證方法進行: (i) 針對性修正的結構語料庫審核; (ii) 使用神經度量(COMET,無參考和有參考)進行質量概況分析,並與翻譯服務(DeepL / ChatGPT / Google)進行比較; (iii) 基於LLM的跨度級翻譯錯誤景觀。趨勢是一致的:具有較低COMET分數的數據集在跨度級別上顯示出更高的準確性/誤譯錯誤比例(特別是HellaSwag;ARC相對乾淨)。基於參考的COMET在MMLU上對人類編輯樣本的評估指向相同的方向。我們發布了EU20數據集的清理/修正版本,以及可重現性的代碼。總之,自動質量保證提供了實用的、可擴展的指標,幫助優先考慮審核——補充而不是取代人類的金標準。
Physics-Informed Transformer for Multi-Band Channel Frequency Response Reconstruction
2604.01944v1 by Anatolij Zubow, Joana Angjo, Sigrid Dimce, Falko Dressler
Wideband channel frequency response (CFR) estimation is challenging in multi-band wireless systems, especially when one or more sub-bands are temporarily blocked by co-channel interference. We present a physics-informed complex Transformer that reconstructs the full wideband CFR from such fragmented, partially observed spectrum snapshots. The interference pattern in each sub-band is modeled as an independent two-state discrete-time Markov chain, capturing realistic bursty occupancy behavior. Our model operates on the joint time-frequency grid of $T$ snapshots and $F$ frequency bins and uses a factored self-attention mechanism that separately attends along both axes, reducing the computational complexity to $O(TF^2 + FT^2)$. Complex-valued inputs and outputs are processed through a holomorphic linear layer that preserves phase relationships. Training uses a composite physics-informed loss combining spectral fidelity, power delay profile (PDP) reconstruction, channel impulse response (CIR) sparsity, and temporal smoothness. Mobility effects are incorporated through per-sample velocity randomization, enabling generalization across different mobility regimes. Evaluation against three classical baselines, namely, last-observation-carry-forward, zero-fill, and cubic-spline interpolation, shows that our approach achieves the highest PDP similarity with respect to the ground truth, reaching $ρ\geq 0.82$ compared to $ρ\geq 0.62$ for the best baseline at interference occupancy levels up to 50%. Furthermore, the model degrades smoothly across the full velocity range, consistently outperforming all other baselines.
摘要:寬頻通道頻率響應 (CFR) 估計在多頻帶無線系統中具有挑戰性,特別是當一個或多個子頻帶因同頻干擾而暫時被阻塞時。我們提出了一種物理知識驅動的複數Transformer,能夠從這些片段化的、部分觀察到的頻譜快照中重建完整的寬頻 CFR。每個子頻帶中的干擾模式被建模為獨立的雙狀態離散時間馬爾可夫鏈,捕捉現實的突發佔用行為。我們的模型在 $T$ 個快照和 $F$ 個頻率區間的聯合時間-頻率網格上運行,並使用一種分解的自注意力機制,分別沿兩個軸進行注意,將計算複雜度降低到 $O(TF^2 + FT^2)$。複數值的輸入和輸出通過一個全純線性層進行處理,保持相位關係。訓練使用一種綜合的物理知識驅動損失,結合頻譜保真度、功率延遲輪廓 (PDP) 重建、通道脈衝響應 (CIR) 稀疏性和時間平滑性。通過每個樣本的速度隨機化納入了移動效應,使得模型能夠在不同的移動範疇中進行泛化。與三個經典基準進行評估,即最後觀察延續、零填充和三次樣條插值,顯示我們的方法在真實情況下達到了最高的 PDP 相似度,達到 $ρ\geq 0.82$,而最佳基準在干擾佔用水平高達 50% 時僅達到 $ρ\geq 0.62$。此外,該模型在整個速度範圍內平滑降級,始終優於所有其他基準。
Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
2604.01941v1 by Sixing Li, Zhibin Gu, Ziqi Zhang, Weiguo Pan, Bing Li, Ying Wang, Hongzhe Liu
Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.
摘要:幼兒教育(ECE)的圖像標題生成對於自動化活動理解和教育評估至關重要。
然而,現有方法面臨兩個主要挑戰。
首先,缺乏大規模、特定領域的數據集限制了模型捕捉獨特於ECE場景的細粒度語義概念的能力,導致描述過於一般和不精確。
其次,傳統的訓練範式在提升專業物體描述能力方面存在局限,因為監督學習往往偏向於高頻表達,而強化學習可能在困難樣本上遭遇不穩定的優化。
為了解決這些限制,我們引入了ECAC,一個針對ECE日常活動圖像標題生成的大規模基準,包含256,121張經專家標註的真實世界圖像及細粒度標籤。
ECAC還配備了一個以領域為導向的評估協議,即教學玩具識別分數(TTS),以明確測量專業物體命名的準確性。
此外,我們提出了RSRS(強化學習和監督微調的獎勵條件切換),這是一個混合訓練框架,能夠在RL和監督優化之間動態切換。
通過將零獎勵的困難樣本重新導向到監督微調,RSRS有效減輕了優勢崩潰,並實現了細粒度識別的穩定優化。
利用ECAC和RSRS,我們開發了KinderMM-Cap-3B,一個領域適應的多模態大型語言模型。
廣泛的實驗表明,我們的模型達到了51.06的TTS,顯著超越了最先進的基準,同時保持了優越的標題質量,突顯了其在專業教育應用中的潛力。
Probabilistic classification from possibilistic data: computing Kullback-Leibler projection with a possibility distribution
2604.01939v1 by Ismaïl Baaj, Pierre Marquis
We consider learning with possibilistic supervision for multi-class classification. For each training instance, the supervision is a normalized possibility distribution that expresses graded plausibility over the classes. From this possibility distribution, we construct a non-empty closed convex set of admissible probability distributions by combining two requirements: probabilistic compatibility with the possibility and necessity measures induced by the possibility distribution, and linear shape constraints that must be satisfied to preserve the qualitative structure of the possibility distribution. Thus, classes with the same possibility degree receive equal probabilities, and if a class has a strictly larger possibility degree than another class, then it receives a strictly larger probability. Given a strictly positive probability vector output by a model for an instance, we compute its Kullback-Leibler projection onto the admissible set. This projection yields the closest admissible probability distribution in Kullback-Leibler sense. We can then train the model by minimizing the divergence between the prediction and its projection, which quantifies the smallest adjustment needed to satisfy the induced dominance and shape constraints. The projection is computed with Dykstra's algorithm using Bregman projections associated with the negative entropy, and we provide explicit formulas for the projections onto each constraint set. Experiments conducted on synthetic data and on a real-world natural language inference task, based on the ChaosNLI dataset, show that the proposed projection algorithm is efficient enough for practical use, and that the resulting projection-based learning objective can improve predictive performance.
摘要:我們考慮使用可能性監督進行多類別分類的學習。對於每個訓練實例,監督是一個標準化的可能性分佈,表達了對各類別的分級可信度。基於這個可能性分佈,我們通過結合兩個要求構建一個非空的閉合凸集,該集包含可接受的概率分佈:與可能性和由可能性分佈引起的必要性度量的概率兼容性,以及必須滿足的線性形狀約束,以保持可能性分佈的質量結構。因此,具有相同可能性程度的類別獲得相等的概率,而如果一個類別的可能性程度嚴格大於另一個類別,那麼它就會獲得嚴格更大的概率。給定模型對一個實例輸出的嚴格正概率向量,我們計算其在可接受集上的Kullback-Leibler投影。這個投影產生在Kullback-Leibler意義下最接近的可接受概率分佈。然後,我們可以通過最小化預測與其投影之間的差異來訓練模型,這量化了滿足引起的優勢和形狀約束所需的最小調整。該投影是使用Dykstra算法計算的,利用與負熵相關的Bregman投影,我們提供了對每個約束集的投影的明確公式。在合成數據和基於ChaosNLI數據集的現實世界自然語言推理任務上進行的實驗顯示,所提出的投影算法在實際使用中足夠高效,並且所得到的基於投影的學習目標可以提高預測性能。
How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization
2604.01938v1 by Ramon Ferrer-i-Cancho
The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.
摘要:所有序列的排列結構可以表示為一個排列多面體(permutohedron),這是一個圖,其中頂點是排列,兩個頂點相連如果在其中一個頂點的排列中相鄰元素的交換產生了另一個頂點的排列。有人假設語言中的詞序最小化排列多面體中的交換距離:給定一個源詞序,排列多面體中更接近的詞序應該成本較低,因此更有可能。在這裡,我們解釋如何測量詞序變化的最佳性程度,以最小化交換距離。我們通過顯示跨語言手勢至少達到 $77\%$ 的最佳性來說明我們新數學框架的威力。跨語言手勢多次達到最佳性不太可能是偶然的。我們為關於詞序或手勢序在通訊系統中最小化交換距離的最佳性研究建立了理論基礎。最後,我們將二次分配問題(QAP)引入語言研究,作為多個優化問題的總稱,並因此假設一個統一各種語言原則的最佳分配的一般原則,包括交換距離最小化。
Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification
2604.01936v1 by Géraud Faye, Benjamin Icard, Morgane Casanova, Guillaume Gadek, Guillaume Gravier, Wassila Ouerdane, Céline Hudelot, Sylvain Gatepaille, Paul Égré
Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness
摘要:在新聞失序中,宣傳新聞特別隱蔽,因為它們往往將導向性信息與看似可靠的事實報導混合在一起。要檢測宣傳,基於語言模型(如BERT)的現有方法是有前景的,但由於數據收集中的偏見,這些方法往往會過度擬合其訓練數據集。為了增強分類的穩健性並改善對新來源的泛化,我們提出了一種神經符號方法,將非上下文文本嵌入(fastText)與符號概念特徵(如類型、主題和說服技術)結合起來。結果顯示,相較於等效的僅文本方法有改善,消融研究以及可解釋性分析確認了新增特徵的好處。
關鍵詞:信息失序、假新聞、宣傳、分類、主題建模、混合方法、神經符號模型、消融、穩健性
Quantum-Inspired Geometric Classification with Correlation Group Structures and VQC Decision Modeling
2604.01930v1 by Nishikanta Mohanty, Arya Ansuman Priyadarshi, Bikash K. Behera, Badshah Mukherjee
We propose a geometry-driven quantum-inspired classification framework that integrates Correlation Group Structures (CGR), compact SWAP-test-based overlap estimation, and selective variational quantum decision modelling. Rather than directly approximating class posteriors, the method adopts a geometry-first paradigm in which samples are evaluated relative to class medoids using overlap-derived Euclidean-like and angular similarity channels. CGR organizes features into anchor-centered correlation neighbourhoods, generating nonlinear, correlation-weighted representations that enhance robustness in heterogeneous tabular spaces. These geometric signals are fused through a non-probabilistic margin-based fusion score, serving as a lightweight and data-efficient primary classifier for small-to-moderate datasets. On Heart Disease, Breast Cancer, and Wine Quality datasets, the fusion-score classifier achieves 0.8478, 0.8881, and 0.9556 test accuracy respectively, with macro-F1 scores of 0.8463, 0.8703, and 0.9522, demonstrating competitive and stable performance relative to classical baselines. For large-scale and highly imbalanced regimes, we construct compact Delta-distance contrastive features and train a variational quantum classifier (VQC) as a nonlinear refinement layer. On the Credit Card Fraud dataset (0.17% prevalence), the Delta + VQC pipeline achieves approximately 0.85 minority recall at an alert rate of approximately 1.31%, with ROC-AUC 0.9249 and PR-AUC 0.3251 under full-dataset evaluation. These results highlight the importance of operating-point-aware assessment in rare-event detection and demonstrate that the proposed hybrid geometric-variational framework provides interpretable, scalable, and regime-adaptive classification across heterogeneous data settings.
摘要:我們提出了一個以幾何為驅動的量子啟發分類框架,該框架整合了相關性群組結構 (CGR)、基於 SWAP 測試的緊湊重疊估計,以及選擇性變分量子決策建模。該方法不直接近似類別後驗,而是採用幾何優先的範式,其中樣本相對於類別中位數進行評估,使用基於重疊的歐幾里得類似性和角度相似性通道。CGR 將特徵組織成以錨點為中心的相關性鄰域,生成非線性、相關性加權的表示,增強在異質表格空間中的穩健性。這些幾何信號通過非概率邊際融合分數進行融合,作為小到中等數據集的輕量級和數據高效的主要分類器。在心臟病、乳腺癌和葡萄酒質量數據集上,融合分數分類器分別達到 0.8478、0.8881 和 0.9556 的測試準確率,宏觀 F1 分數為 0.8463、0.8703 和 0.9522,顯示出相對於經典基準的競爭性和穩定性表現。對於大規模和高度不平衡的情況,我們構建了緊湊的 Delta 距離對比特徵,並訓練了一個變分量子分類器 (VQC) 作為非線性精煉層。在信用卡欺詐數據集(0.17% 的流行率)上,Delta + VQC 流程在約 1.31% 的警報率下實現了約 0.85 的少數回憶率,並在全數據集評估下達到 ROC-AUC 0.9249 和 PR-AUC 0.3251。這些結果突顯了在稀有事件檢測中操作點感知評估的重要性,並表明所提出的混合幾何-變分框架在異質數據設置中提供了可解釋的、可擴展的和適應性分類。
Woosh: A Sound Effects Foundation Model
2604.01929v1 by Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà, Yuki Mitsufuji
The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
摘要:音頻研究社群依賴開放的生成模型作為建立新方法和確立基準的基礎工具。
在本報告中,我們介紹了Woosh,Sony AI公開發布的音效基礎模型,詳細說明了其架構、訓練過程以及與其他流行開放模型的評估。
為了優化音效,我們提供了(1) 一個高品質的音頻編碼器/解碼器模型和(2) 一個用於條件的文本-音頻對齊模型,以及(3) 文本到音頻和(4) 影片到音頻的生成模型。
提煉的文本到音頻和影片到音頻模型也包含在發布中,允許低資源運行和快速推斷。
我們在公共和私人數據上的評估顯示,與現有的開放替代品如StableAudio-Open和TangoFlux相比,每個模塊的性能都具有競爭力或更佳。
推斷代碼和模型權重可在 https://github.com/SonyResearch/Woosh 獲得。
演示樣本可在 https://sonyresearch.github.io/Woosh/ 找到。
ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues
2604.01925v1 by Bhaskara Hanuma Vedula, Darshan Anghan, Ishita Goyal, Ponnurangam Kumaraguru, Abhijnan Chakraborty
Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.
摘要:大型語言模型在明確表述人口身份時越來越能抑制偏見輸出,但在間接傳達身份時仍可能表現出隱性偏見。現有基準使用基於姓名的代理來檢測隱性偏見,這與許多社會人口特徵的關聯性較弱,無法擴展到年齡或社會經濟地位等維度。我們引入了ImplicitBBQ,一個通過特徵基準提示來評估隱性偏見的問答基準,這些特徵是與年齡、性別、地區、宗教、種姓和社會經濟地位隱性相關的文化屬性。在評估11個模型時,我們發現,在模糊上下文中的隱性偏見比開放權重模型中的明確偏見高出六倍以上。安全提示和思維鏈推理未能實質性地縮小這一差距;即使是少量提示,雖然將隱性偏見降低了84%,但種姓偏見仍是其他任何維度的四倍。這些發現表明,當前的對齊和提示策略僅觸及偏見評估的表面,而未能大幅解決文化根植的刻板印象關聯。我們公開釋放我們的代碼和數據集,以供模型提供者和研究人員基準潛在的緩解技術。
Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients
2604.01924v1 by Oumaima El Khettari, Virgile Barthet, Guillaume Hocquet, Joconde Weller, Emmanuel Morin, Pierre Zweigenbaum
Accurate short-term mortality prediction in heart failure (HF) remains challenging, particularly when relying on structured electronic health record (EHR) data alone. We evaluate transformer-based models on a French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches. Our results show that enriching clinical text with entity-level representations improves prediction over CLS embeddings alone, and that supervised multimodal fusion of text and structured variables achieves the best overall performance. In contrast, large language models perform inconsistently across modalities and decoding strategies, with text-only prompts outperforming structured or multimodal inputs. These findings highlight that entity-aware multimodal transformers offer the most reliable solution for short-term HF outcome prediction, while current LLM prompting remains limited for clinical decision support.
摘要:準確的心臟衰竭(HF)短期死亡率預測仍然具有挑戰性,特別是當僅依賴結構化電子健康記錄(EHR)數據時。
我們在一個法國HF隊列中評估基於Transformer的模型,並比較僅文本、僅結構、跨模態和基於LLM的方法。
我們的結果顯示,使用實體級別的表示來豐富臨床文本能夠改善僅使用CLS嵌入的預測,並且監督式的文本和結構變量的跨模態融合達到了最佳的整體表現。
相比之下,大型語言模型在不同模態和解碼策略中的表現不一致,僅文本的提示優於結構化或跨模態輸入。
這些發現突顯出,具有實體感知的跨模態Transformer為短期HF結果預測提供了最可靠的解決方案,而當前的LLM提示在臨床決策支持方面仍然有限。
SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations
2604.01916v1 by Yiqiang Cai, Chengyan Wu, Bolei Ma, Bo Chen, Yun Xue, Julia Hirschberg, Ziwei Gong
Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.
摘要:多模態情感識別在對話中(MERC)需要整合多模態信號,同時對噪聲具有魯棒性並建模上下文推理。現有的方法通常強調融合,但忽視了噪聲特徵中的不確定性和細緻的推理。我們提出了SURE(協同不確定性感知推理)用於MERC,這是一個改善魯棒性和上下文建模的框架。SURE由三個組件組成:一個不確定性感知的專家混合模塊,用於處理特定模態的噪聲;一個迭代推理模塊,用於對上下文進行多輪推理;以及一個Transformer閘模塊,用於捕捉模態內和模態間的交互。在基準MERC數據集上的實驗顯示,SURE始終超越最先進的方法,證明了其在魯棒多模態推理中的有效性。這些結果突顯了不確定性建模和迭代推理在推進對話環境中的情感識別中的重要性。
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
2604.01907v1 by Yixin Chen, Yaowei Zhang, Huangyue Yu, Junchao He, Yan Wang, Jiangyong Huang, Hongyu Shen, Junfeng Ni, Shaofei Wang, Baoxiong Jia, Song-Chun Zhu, Siyuan Huang
Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.
摘要:註解的 3D 場景數據稀缺且獲取成本高昂,而網路上卻有大量未標記的視頻可供使用。
在本文中,我們展示了精心設計的數據引擎如何利用網路策劃的未標記視頻自動生成訓練數據,以促進端到端模型在 3D 場景理解中的應用,並與人類標註的數據集相輔相成。
我們識別並分析了自動數據生成中的瓶頸,揭示了影響從未標記數據學習的效率和效果的關鍵因素。
為了在不同的感知粒度上驗證我們的方法,我們在三個任務上進行評估,這些任務涵蓋了低級感知,即 3D 物體檢測和實例分割,以及高級推理,即 3D 空間視覺問答 (VQA) 和視覺-語言導航 (VLN)。
在我們生成的數據上訓練的模型展示了強大的零樣本性能,並在微調後顯示出進一步的改進。
這證明了利用 readily available web data 作為更強大場景理解系統的途徑的可行性。
Combating Data Laundering in LLM Training
2604.01904v1 by Muxing Li, Zesheng Ye, Sharon Li, Feng Liu
Data rights owners can detect unauthorized data use in large language model (LLM) training by querying with proprietary samples. Often, superior performance (e.g., higher confidence or lower loss) on a sample relative to the untrained data implies it was part of the training corpus, as LLMs tend to perform better on data they have seen during training. However, this detection becomes fragile under data laundering, a practice of transforming the stylistic form of proprietary data, while preserving critical information to obfuscate data provenance. When an LLM is trained exclusively on such laundered variants, it no longer performs better on originals, erasing the signals that standard detections rely on. We counter this by inferring the unknown laundering transformation from black-box access to the target LLM and, via an auxiliary LLM, synthesizing queries that mimic the laundered data, even if rights owners have only the originals. As the search space of finding true laundering transformations is infinite, we abstract such a process into a high-level transformation goal (e.g., "lyrical rewriting") and concrete details (e.g., "with vivid imagery"), and introduce synthesis data reversion (SDR) that instantiates this abstraction. SDR first identifies the most probable goal for synthesis to narrow the search; it then iteratively refines details so that synthesized queries gradually elicit stronger detection signals from the target LLM. Evaluated on the MIMIR benchmark against diverse laundering practices and target LLM families (Pythia, Llama2, and Falcon), SDR consistently strengthens data misuse detection, providing a practical countermeasure to data laundering.
摘要:資料權利擁有者可以透過使用專有樣本查詢,檢測大型語言模型(LLM)訓練中的未經授權數據使用。通常,相對於未訓練數據,對某個樣本的優越表現(例如,更高的信心或更低的損失)意味著它是訓練語料庫的一部分,因為LLM在訓練期間見過的數據上表現通常更好。然而,這種檢測在數據洗白的情況下變得脆弱,數據洗白是一種轉變專有數據的風格形式的做法,同時保留關鍵信息以模糊數據來源。當LLM僅在這些洗白變體上訓練時,它在原始數據上的表現不再優於洗白數據,抹去了標準檢測所依賴的信號。我們通過從對目標LLM的黑箱訪問中推斷未知的洗白轉換,並通過輔助LLM合成模仿洗白數據的查詢來應對,即使權利擁有者只有原始數據。由於尋找真實洗白轉換的搜索空間是無限的,我們將這一過程抽象為高層次的轉換目標(例如,“抒情重寫”)和具體細節(例如,“以生動的意象”),並引入合成數據反轉(SDR)來具體化這一抽象。SDR首先識別最可能的合成目標以縮小搜索範圍;然後,它迭代地細化細節,使合成查詢逐漸引發目標LLM更強的檢測信號。在MIMIR基準上針對多樣的洗白做法和目標LLM系列(Pythia、Llama2和Falcon)進行評估,SDR始終增強數據濫用檢測,提供了一種對抗數據洗白的實用對策。
Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always
2604.01896v1 by Luka Hobor, Mario Brcic, Mihael Kovac, Kristijan Poje
Large language models (LLMs) have been proposed as alternatives to human experts for estimating unknown quantities with associated uncertainty, a process known as Bayesian elicitation. We test this by asking eleven LLMs to estimate population statistics, such as health prevalence rates, personality trait distributions, and labor market figures, and to express their uncertainty as 95\% credible intervals. We vary each model's reasoning effort (low, medium, high) to test whether more "thinking" improves results. Our findings reveal three key results. First, larger, more capable models produce more accurate estimates, but increasing reasoning effort provides no consistent benefit. Second, all models are severely overconfident: their 95\% intervals contain the true value only 9--44\% of the time, far below the expected 95\%. Third, a statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage. In a preliminary experiment, giving models web search access degraded predictions for already-accurate models, while modestly improving predictions for weaker ones. Models performed well on commonly discussed topics but struggled with specialized health data. These results indicate that LLM uncertainty estimates require statistical correction before they can be used in decision-making.
摘要:大型語言模型(LLMs)被提出作為人類專家在估計與不確定性相關的未知數量的替代方案,這個過程被稱為貝葉斯引導。我們通過要求十一個LLM估計人口統計數據,例如健康流行率、個性特徵分佈和勞動市場數據,並將其不確定性表達為95\%可信區間,來測試這一點。我們變化每個模型的推理努力(低、中、高)以測試更多的“思考”是否能改善結果。我們的研究結果揭示了三個關鍵結果。首先,較大、能力更強的模型產生更準確的估計,但增加推理努力並未提供一致的好處。其次,所有模型都過於自信:它們的95\%區間僅在9--44\%的情況下包含真實值,遠低於預期的95\%。第三,一種稱為符合預測的統計重新校準技術可以糾正這種過度自信,擴大區間以實現預期的覆蓋率。在一個初步實驗中,給模型提供網絡搜索訪問權限使得已經準確的模型的預測變差,而對較弱的模型則有適度的改善。模型在常見話題上表現良好,但在專門的健康數據上則掙扎。這些結果表明,LLM的不確定性估計在用於決策之前需要進行統計校正。
HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models
2604.01881v1 by Yansong Guo, Chaoyang Zhu, Jiayi Ji, Jianghang Lin, Liujuan Cao
Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.
摘要:視頻大型語言模型(VideoLLMs)在視頻理解方面展示了令人印象深刻的能力,但大量的輸入視頻標記對於部署造成了顯著的計算負擔。現有的方法主要在輸入層面修剪視頻標記,卻忽略了視頻和大型語言模型(LLMs)中固有的信息結構。為了解決這一問題,我們提出了HieraVid,一個層次化的修剪框架,逐步且動態地減少視覺冗餘。基於兩個觀察,即視頻具有段-幀結構,並且LLMs內部單向傳播多模態信息,我們將修剪分解為三個層次:1)段級,首先對視頻標記進行時間分段和空間合併;2)幀級,在同一段內共同修剪相似的幀以保持多樣性;3)層級,隨著LLM層數的增加,冗餘逐漸減少而不妨礙性能。我們在四個廣泛使用的視頻理解基準上進行了廣泛的實驗,以全面評估HieraVid的有效性。值得注意的是,僅保留30%的標記,HieraVid便達到了新的最先進性能,同時分別保持了LLaVA-Video-7B和LLaVA-OneVision-7B超過98%和99%的性能。
Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution
2604.01853v1 by Samuel Rose, Debarati Chakraborty
Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.
摘要:失讀症的拼寫錯誤展現出系統性的語音和正字法模式,使其與典型發展的寫作者所產生的錯誤區分開來。
雖然這一觀察促使了針對失讀症的特定拼寫檢查和輔助寫作工具的開發,但先前的工作主要集中在錯誤修正而非歸因,並且在很大程度上忽視了倫理風險。
自動化學習者分類所帶來的有害標籤、隱性篩選、算法偏見和機構濫用的風險,需要為該領域的研究制定健全的倫理和法律框架。
本文針對這兩個空白進行探討。
我們將失讀症錯誤歸因表述為一個二元分類任務。
給定一個拼寫錯誤的單詞及其正確的目標形式,判斷該錯誤模式是否是失讀症或非失讀症寫作者的特徵。
我們開發了一套全面的特徵集,捕捉每個錯誤的正字法、語音學和形態學特性,並提出了一種雙輸入神經模型,該模型在獨立於寫作者的條件下,與傳統機器學習基準進行評估。
該神經模型達到了93.01%的準確率和94.01%的F1-score,語音上合理的錯誤和元音混淆成為最強的歸因信號。
我們將這些技術結果置於明確的倫理優先框架內,分析不同子群體之間的公平性、教育部署的可解釋性要求,以及系統在何種條件、同意、透明度、人類監督和救濟下可以負責任地使用。
我們提供了具體的倫理部署指南以及對系統限制和濫用潛力的公開討論。
我們的結果表明,失讀症錯誤歸因在高準確率下是可行的,同時強調僅有可行性不足以在高風險的教育環境中進行部署。
From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion
2604.01849v1 by Liang Zhu, Haolin Chen, Lidong Zhao, Xian Wu
While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.
摘要:大型語言模型(LLMs)在代碼補全方面展現出卓越的能力,但它們通常遵循硬性補全(HC)範式,強迫生成完全具體的代碼,即使在上下文不足的情況下也是如此。對 300 萬次真實互動的分析揭示了這一策略的局限性:61% 的生成建議在被接受後被編輯或被拒絕,儘管它們與用戶隨後的代碼有超過 80% 的相似性,這表明模型在特定的標記位置經常做出錯誤的預測。受到這一觀察的啟發,我們提出了自適應佔位符補全(APC),這是一個協作框架,通過在高熵位置戰略性地輸出明確的佔位符來擴展 HC,允許用戶通過 IDE 導航直接填寫。從理論上講,我們將代碼補全形式化為一個不確定性下的成本最小化問題。基於填寫佔位符的成本低於糾正錯誤的觀察,我們證明了存在一個關鍵熵閾值,超過該閾值時,APC 的期望成本明顯低於 HC。我們通過從過濾的真實編輯日誌中構建訓練數據來實現這一框架,並為強化學習設計了一個基於成本的獎勵函數。在 15 億到 140 億參數模型上的廣泛評估表明,APC 將期望編輯成本從 19% 降低到 50%,同時保持標準 HC 的性能。我們的工作為不確定性感知的代碼補全提供了理論基礎和實用的訓練框架,展示了自適應放棄可以端到端學習,而不犧牲傳統補全質量。
CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift
2604.01845v1 by HyunGi Kim, Jisoo Mok, Hyungyu Lee, Juhyeon Shin, Sungroh Yoon
Multivariate time-series anomaly detection (MTSAD) aims to identify deviations from normality in multivariate time-series and is critical in real-world applications. However, in real-world deployments, distribution shifts are ubiquitous and cause severe performance degradation in pre-trained anomaly detector. Test-time adaptation (TTA) updates a pre-trained model on-the-fly using only unlabeled test data, making it promising for addressing this challenge. In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. CANDI introduces a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and incorporates a plug-and-play Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates. Extensive experiments demonstrate that CANDI significantly improves the performance of MTSAD under distribution shift, improving AUROC up to 14% while using fewer adaptation samples.
摘要:多變量時間序列異常檢測(MTSAD)旨在識別多變量時間序列中的正常性偏差,並在現實應用中至關重要。
然而,在現實部署中,分佈變化無處不在,並導致預訓練異常檢測器的性能嚴重下降。
測試時適應(TTA)僅使用未標記的測試數據即時更新預訓練模型,使其在應對這一挑戰時顯得前景可期。
在本研究中,我們提出了CANDI(針對分佈變化的多變量時間序列異常檢測的精選測試時適應),這是一個新穎的TTA框架,能夠選擇性地識別和適應潛在的假陽性,同時保留預訓練知識。
CANDI引入了一種假陽性挖掘(FPM)策略,根據異常分數和潛在相似性來篩選適應樣本,並結合了一個即插即用的時空感知正常性適應(SANA)模塊,以進行結構性知識更新。
大量實驗表明,CANDI在分佈變化下顯著提高了MTSAD的性能,AUROC提高了多達14%,同時使用了更少的適應樣本。
Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints
2604.01841v1 by Minh-Khoi Pham, Thang-Long Nguyen Ho, Thao Thi Phuong Dao, Tai Tan Mai, Minh-Triet Tran, Marie E. Ward, Una Geary, Rob Brennan, Nick McDonald, Martin Crane, Marija Bezbradica
Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.
摘要:臨床預測來自結構化電子健康紀錄(EHRs)是具有挑戰性的,因為它們具有高維度性、異質性、類別不平衡和分佈轉移。雖然表格內文學習(TICL)和檢索增強方法在通用基準上表現良好,但它們在臨床環境中的行為仍不明確。我們提出了一個多隊列EHR基準,比較了傳統模型、深度表格模型和TICL模型在不同數據規模、特徵維度、結果稀有性和跨隊列泛化方面的表現。基於PFN的TICL模型在低數據環境中樣本效率高,但在異質性和不平衡性增加時,簡單的基於距離的檢索會導致性能下降。我們提出了AWARE,一個與任務對齊的檢索框架,使用監督式嵌入學習和輕量級適配器。在極端不平衡的情況下,AWARE將AUPRC提高了多達12.2%,並且隨著數據複雜性的增加而增長。我們的結果確定了檢索質量和檢索推理對齊是將表格內文學習應用於臨床預測的關鍵瓶頸。
Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
2604.01840v1 by Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin
While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.
摘要:雖然來自可驗證獎勵的強化學習(RLVR)在大型視覺語言模型(LVLMs)中推進了推理,但現有框架存在一個基本的 методологический 缺陷:通過在所有生成的標記中分配相同的優勢,這些方法本質上稀釋了對於優化多模態推理關鍵的、視覺基礎步驟所必需的學習信號。為了填補這一空白,我們制定了\textit{標記視覺依賴性},通過視覺條件和僅文本預測分佈之間的Kullback-Leibler(KL)散度來量化視覺輸入的因果信息增益。揭示這種依賴性高度稀疏且在語義上至關重要,我們引入了感知基礎的策略優化(PGPO),這是一種新穎的細粒度信用分配框架,能夠在標記層面動態重塑優勢。通過一種閾值門控的質量守恆機制,PGPO主動放大視覺依賴標記的學習信號,同時抑制來自語言先驗的梯度噪聲。基於Qwen2.5-VL系列的廣泛實驗,在七個具有挑戰性的多模態推理基準上顯示,PGPO平均提升模型性能18.7%。理論和實證分析均確認PGPO有效降低梯度方差,防止訓練崩潰,並作為一種強大的正則化器,促進穩健的、基於感知的多模態推理。代碼將在https://github.com/Yzk1114/PGPO上發布。
PLOT: Enhancing Preference Learning via Optimal Transport
2604.01837v1 by Liang Zhu, Yuelin Bai, Xiankun Ren, Jiaxi Yang, Lei Zhang, Feiteng Fang, Hamid Alinejad-Rokny, Minghuan Tan, Min Yang
Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic & Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.
摘要:偏好學習在大型語言模型(LLMs)中已經取得了顯著進展,但現有的方法仍然受到有限的性能提升、高計算成本、超參數敏感性以及對全局標記級關係建模不足的限制。我們介紹了PLOT,它通過來自最優傳輸的標記級損失來增強基於微調的對齊中的偏好學習。通過將偏好學習公式化為最優傳輸問題,PLOT使模型輸出與人類偏好對齊,同時保留LLMs的原始分佈,確保穩定性和魯棒性。此外,PLOT利用標記嵌入來捕捉語義關係,實現全球信息的優化。在兩個偏好類別 - 人類價值觀和邏輯與問題解決 - 涉及七個子偏好的實驗中,顯示PLOT持續改善對齊性能,同時保持流暢性和一致性。這些結果證實了最優傳輸作為偏好學習的一種原則性方法,建立了一個理論基礎的框架,為LLMs的偏好學習提供了新的見解。
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
2604.01833v1 by Yaxin Luo, Zhiqiang Shen
The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.
摘要:語言預訓練模型和視覺預訓練模型中的離群參數比例差異顯著,使得跨模態(語言和視覺)本質上比跨領域適應更具挑戰性。
因此,許多先前的研究專注於跨領域轉移,而不是嘗試橋接語言和視覺模態,假設語言預訓練模型不適合下游視覺任務,因為參數空間存在差異。
與這一假設相反,我們展示了將橋接訓練階段作為模態適應學習者添加進來,可以有效地將大型語言模型(LLM)的參數與視覺任務對齊。
具體而言,我們提出了一個簡單而強大的解決方案——隨機標籤橋接訓練,該方法不需要手動標記,並幫助LLM參數適應視覺基礎任務。
此外,我們的研究發現部分橋接訓練通常是有利的,因為LLM中的某些層展現出強大的基礎特性,即使在不進行視覺任務微調的情況下仍然有益。
這一驚人的發現為直接在視覺模型中利用語言預訓練參數開辟了新的途徑,並突顯了部分橋接訓練作為跨模態適應的實用途徑的潛力。
Neural Network-Assisted Model Predictive Control for Implicit Balancing
2604.01805v1 by Seyed Soroush Karimi Madahi, Kenneth Bruninx, Bert Claessens, Chris Develder
In Europe, balance responsible parties can deliberately take out-of-balance positions to support transmission system operators (TSOs) in maintaining grid stability and earn profit, a practice called implicit balancing. Model predictive control (MPC) is widely adopted as an effective approach for implicit balancing. The balancing market model accuracy in MPC is critical to decision quality. Previous studies modeled this market using either (i) a convex market clearing approximation, ignoring proactive manual actions by TSOs and the market sub-quarter-hour dynamics, or (ii) machine learning methods, which cannot be directly integrated into MPC. To address these shortcomings, we propose a data-driven balancing market model integrated into MPC using an input convex neural network to ensure convexity while capturing uncertainties. To keep the core network computationally efficient, we incorporate attention-based input gating mechanisms to remove irrelevant data. Evaluating on Belgian data shows that the proposed model both improves MPC decisions and reduces computational time.
摘要:在歐洲,平衡負責方可以故意採取失衡的頭寸,以支持傳輸系統運營商(TSOs)維持電網穩定並獲取利潤,這種做法稱為隱式平衡。模型預測控制(MPC)被廣泛採用作為隱式平衡的有效方法。MPC中平衡市場模型的準確性對決策質量至關重要。先前的研究使用(i)凸市場清算近似來建模此市場,忽略了TSOs的主動手動行動及市場的子季度小時動態,或(ii)機器學習方法,這些方法無法直接整合到MPC中。為了應對這些不足,我們提出了一種數據驅動的平衡市場模型,該模型集成到MPC中,使用輸入凸神經網絡以確保凸性,同時捕捉不確定性。為了保持核心網絡的計算效率,我們納入了基於注意力的輸入閘控機制,以去除不相關數據。在比利時數據上的評估顯示,所提出的模型不僅改善了MPC決策,還減少了計算時間。
DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment
2604.01787v1 by Liang Zhu, Feiteng Fang, Yuelin Bai, Longze Chen, Zhexiang Zhang, Minghuan Tan, Min Yang
Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.
摘要:強化學習來自人類反饋(RLHF),使用像是近端政策優化(PPO)這樣的算法,使大型語言模型(LLMs)與人類價值觀對齊,但成本高昂且不穩定。已提出替代方案來取代PPO或整合監督微調(SFT)和對比學習,以便進行直接微調和價值對齊。然而,這些方法仍然需要大量數據來學習偏好,並可能削弱LLMs的泛化能力。為了進一步提高對齊效率和性能,同時減少泛化能力的損失,本文介紹了基於分佈的高效微調(DEFT),這是一個高效的對齊框架,通過計算基於語言模型輸出分佈和偏好數據差異分佈的微分分佈獎勵來整合數據過濾和分佈指導。使用微分分佈獎勵從原始數據中過濾出一個小而高質量的子集,然後將其納入現有的對齊方法中,以指導模型的輸出分佈。實驗結果表明,經過DEFT增強的方法在對齊能力和泛化能力上均優於原始方法,並顯著減少了訓練時間。
Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens
2604.01779v1 by Hanna Hubarava, Yingqiang Gao
Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.
摘要:可控自動文本簡化(CATS)產生用戶量身定制的輸出,但可控性通常被視為解碼問題,並用不反映控制程度的指標進行評估。我們觀察到,ATS中的可控性受到數據和評估的顯著限制。為此,我們介紹了一種基於指令微調和離散控制標記的領域無關的CATS框架,引導開源模型以達到目標可讀性水平和壓縮率。在三個不同模型大小的模型家族(Llama、Mistral、Qwen;1-14B)和四個領域(醫學、公共管理、新聞、百科文本)中,我們發現較小的模型(1-3B)可以具有競爭力,但可靠的可控性強烈依賴於訓練數據是否編碼了目標屬性的足夠變異性。可讀性控制(FKGL、ARI、Dale-Chall)學習得相當一致,而壓縮控制表現不佳,因為現有語料庫中的信號變異性有限。我們進一步顯示,標準簡化和相似性指標不足以衡量控制,這促使我們採用基於錯誤的指標來對齊目標輸出。最後,我們的抽樣和分層實驗表明,簡單的劃分可能會引入分佈不匹配,從而削弱訓練和評估的效果。
FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation
2604.01766v1 by Taimur Khan, Hannes Feilhauer, Muhammad Jazib Zafar
Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.
摘要:非常高解析度 (VHR) 的森林結構數據在單棵樹的尺度上對於碳、生物多樣性和生態系統監測至關重要。儘管空中LiDAR仍然是森林結構指標(如樹冠高度模型 (CHM)、植物面積指數 (PAI) 和葉片高度多樣性 (FHD))的參考,但其成本高昂且使用頻率不高。我們提出了FSKD:一個LiDAR到RGB-紅外 (RGBI) 知識蒸餾 (KD) 框架,其中一個多模態教師通過交叉注意力將RGBI影像與LiDAR衍生的平面指標和垂直剖面融合,而僅使用RGBI的SegFormer學生則學習重現這些輸出。在德國薩克森州的384 $km^2$ 森林上進行訓練(地面取樣距離 (GSD) 為20厘米),並在八個地理上不同的測試區塊上進行評估,該學生在零樣本CHM性能上達到了最先進的 (SOTA) 表現(MedAE 4.17 m,$R^2$=0.51,IoU 0.87),在MAE方面超越了HRCHM/DAC基準29--46%(5.81 m對比8.14--10.84 m),並且具有更強的相關係數(0.713對比0.166--0.652)。消融實驗顯示,多模態融合在性能上比僅RGBI訓練提高了10--26%,而且具備適當模型容量的非對稱蒸餾是關鍵。該方法共同預測CHM、PAI和FHD,這是一種當前單目CHM估計器所不具備的多指標能力,儘管PAI/FHD的轉移仍然依賴於區域,並受益於本地校準。該框架在時間不匹配(冬季LiDAR,夏季RGBI)下仍然有效,消除了嚴格的共同獲取限制,並為數位雙胞胎德國和國家數位正射影像計畫等工作流程實現可擴展的20厘米操作監測。
DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
2604.01765v1 by Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao, Jiangnan Shao, Jiagang Zhu, Tingdong Yu, Zheng Zhu, Guan Huang, Steven L. Waslander
Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.
摘要:最近,世界行動模型(WAM)已經出現,以橋接視覺-語言-行動(VLA)模型和世界模型,統一它們的推理和遵循指令的能力以及時空世界建模。然而,現有的WAM方法往往專注於建模2D外觀或潛在表示,幾何基礎有限——這對於在物理世界中運作的具身系統來說是一個基本要素。我們提出了DriveDreamer-Policy,一個統一的駕駛世界行動模型,將深度生成、未來視頻生成和運動規劃整合在一個單一的模組化架構中。該模型使用大型語言模型來處理語言指令、多視角圖像和行動,隨後由三個輕量級生成器生成深度、未來視頻和行動。通過學習一個幾何感知的世界表示並利用它來指導統一框架內的未來預測和規劃,所提出的模型產生了更連貫的想像未來和更具信息性的駕駛行動,同時保持模組化和可控的延遲。在Navsim v1和v2基準上的實驗表明,DriveDreamer-Policy在閉環規劃和世界生成任務上都達到了強勁的表現。特別是,我們的模型在Navsim v1上達到了89.2 PDMS,在Navsim v2上達到了88.7 EPDMS,超越了現有的基於世界模型的方法,同時產生了更高質量的未來視頻和深度預測。消融研究進一步顯示,明確的深度學習為視頻想像提供了互補的好處,並提高了規劃的穩健性。
FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
2604.01762v1 by Juyong Jiang, Fan Wang, Hong Qi, Sunghun Kim, Jing Tang
Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.
摘要:參數高效微調(PEFT)已成為在受限計算預算下調整大型語言模型(LLMs)的關鍵範式。
然而,標準的PEFT方法在多任務微調環境中往往面臨挑戰,因為多樣的優化目標會引起任務干擾,而有限的參數預算則導致表徵不足。
雖然最近的方法結合了專家混合(MoE)來緩解這些問題,但它們主要在空間域中運作,這可能會引入結構冗餘和參數開銷。
為了克服這些限制,我們在頻譜域中重新定義了適應。
我們的頻譜分析顯示,不同任務展現出不同的頻率能量分佈,並且LLM層顯示出異質的頻率敏感性。
受到這些見解的啟發,我們提出了FourierMoE,將MoE架構與逆離散傅立葉變換(IDFT)結合,用於頻率感知的適應。
具體而言,FourierMoE使用頻率自適應路由器將標記分派給專注於不同頻率帶的專家。
每個專家學習一組共軛對稱的複數係數,保留完整的相位和幅度信息,同時理論上保證無損IDFT重建為實值空間權重。
在28個基準、各種模型架構和規模上的廣泛評估顯示,FourierMoE在單任務和多任務環境中始終超越競爭基準,同時使用顯著較少的可訓練參數。
這些結果突顯了頻譜域專家適應作為LLM微調的一種有效且參數高效的範式的潛力。
LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
2604.01754v1 by Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, Nima Mesgarani
Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.
摘要:數學推理是人類智慧的標誌,而大型語言模型(LLMs)是否能夠有意義地執行這一點仍然是人工智慧和認知科學中的一個核心問題。隨著LLMs越來越多地融入科學工作流程,對其數學能力的嚴格評估成為一項實際的必要性。現有的基準受到合成環境和數據污染的限制。我們提出了LiveMathematicianBench,這是一個動態的多選基準,用於研究級數學推理,基於最近在模型訓練截止日期後發表的arXiv論文。通過將評估基於新發表的定理,它提供了一個超越記憶模式的現實測試平台。該基準引入了一個包含十三類定理類型的邏輯分類法(例如,蘊涵、等價、存在性、唯一性),使得在推理形式之間的細緻評估成為可能。它採用了一個基於證明草圖的干擾選項管道,利用高級證明策略構建看似合理但無效的答案選擇,反映出誤導性的證明方向,從而提高對真正理解的敏感度,而非表面匹配。我們還引入了一個抗替代機制,以區分答案識別和實質性推理。評估顯示該基準遠未飽和:最佳模型Gemini-3.1-pro-preview僅達到43.5%。在抗替代評估下,準確率急劇下降:GPT-5.4的得分最高為30.6%,而Gemini-3.1-pro-preview降至17.6%,低於20%的隨機基線。一種雙模式協議顯示,證明草圖的訪問帶來了一致的準確性提升,這表明模型可以利用高級證明策略進行推理。總體而言,LiveMathematicianBench提供了一個可擴展的、抗污染的測試平台,用於研究LLMs中的研究級數學推理。
Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text
2604.01745v1 by Melania Berbatova, Tsvetoslav Vasev
Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems.
摘要:有毒內容檢測在在線通信中仍然是一個重大挑戰,當前的解決方案往往無意中阻止了有價值的信息,包括醫學術語和與少數群體相關的文本。
本文提出了一種更細緻的方法來識別保加利亞文本中的有毒性,同時保留對重要信息的訪問。
研究探討了兩種不同的方法來檢測有毒內容。
所開發的方法在各種在線平台和內容審核系統中具有潛在的應用。
首先,我們提出了一個本體,對保加利亞語中潛在的有毒詞彙進行建模。
然後,我們組建了一個數據集,該數據集包含來自保加利亞在線論壇的4,384個手動標註的句子,分為四個類別:有毒語言、醫學術語、非有毒語言和與少數社群相關的術語。
然後,我們訓練了一個基於BERT的模型來進行有毒語言分類,該模型達到了0.89的F1宏觀得分。
訓練好的模型可以直接應用於實際環境中,並可以作為有毒內容檢測系統的一個組件進行整合。
AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows
2604.01738v1 by Chuhan Qiao, Jinglai Zheng, Jie Huang, Buyue Zhao, Fan Li, Haiming Huang
Integrating Large Language Models (LLMs) into hypersonic thermal protection system (TPS) design is bottlenecked by cascading constraint violations when generating executable simulation artifacts. General-purpose LLMs, treating generation as single-pass text completion, fail to satisfy the sequential, multi-gate constraints inherent in safety-critical engineering workflows. To address this, we propose AeroTherm-GPT, the first TPS-specialized LLM Agent, instantiated through a Constraint-Closed-Loop Generation (CCLG) framework. CCLG organizes TPS artifact generation as an iterative workflow comprising generation, validation, CDG-guided repair, execution, and audit. The Constraint Dependency Graph (CDG) encodes empirical co-resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical co-resolution probabilities. This upstream-priority mechanism resolves multiple downstream violations per action, achieving a Root-Cause Fix Efficiency of 4.16 versus 1.76 for flat-checklist repair. Evaluated on HyTPS-Bench and validated against external benchmarks, AeroTherm-GPT achieves 88.7% End-to-End Success Rate (95% CI: 87.5-89.9), a gain of +12.5 pp over the matched non-CDG ablation baseline, without catastrophic forgetting on scientific reasoning and code generation tasks.
摘要:將大型語言模型 (LLMs) 整合到超音速熱保護系統 (TPS) 設計中,因生成可執行的模擬工件時出現級聯約束違反而受到瓶頸。通用 LLMs 將生成視為單次文本完成,無法滿足安全關鍵工程工作流程中固有的序列多閘約束。為了解決這個問題,我們提出了 AeroTherm-GPT,第一個專門針對 TPS 的 LLM 代理,通過約束閉環生成 (CCLG) 框架實現。CCLG 將 TPS 工件生成組織為一個迭代工作流程,包括生成、驗證、CDG 引導的修復、執行和審核。約束依賴圖 (CDG) 編碼了約束類別之間的實證共同解決結構,根據生命週期排序的先驗和實證共同解決概率,將修復指向上游故障候選。這一上游優先機制每個行動解決多個下游違規,實現了 4.16 的根本原因修復效率,相較於 1.76 的平面清單修復。經過在 HyTPS-Bench 上評估並與外部基準驗證,AeroTherm-GPT 實現了 88.7% 的端到端成功率 (95% CI: 87.5-89.9),相比匹配的非 CDG 消融基線提高了 +12.5 個百分點,且在科學推理和代碼生成任務上沒有出現災難性遺忘。
The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs
2604.01728v1 by Wilf Morlidge, Elliott Watkiss-Leek, George Hannah, Harry Rostron, Andrew Ng, Ewan Johnson, Andrew Mitchell, Terry R. Payne, Valentina Tamma, Jacopo de Berardinis
Achieving semantic interoperability across heterogeneous experimental data systems remains a major barrier to data-driven scientific discovery. The Analytical Information Markup Language (AnIML), a flexible XML-based standard for analytical chemistry and biology, is increasingly used in industrial R&D labs for managing and exchanging experimental data. However, the expressivity of the XML schema permits divergent interpretations across stakeholders, introducing inconsistencies that undermine the interoperability the AnIML schema was designed to support. In this paper, we present the AnIML Ontology, an OWL 2 ontology that formalises the semantics of AnIML and aligns it with the Allotrope Data Format to support future cross-system and cross-lab interoperability. The ontology was developed using an expert-in-the-loop approach combining LLM-assisted requirement elicitation with collaborative ontology engineering. We validate the ontology through a multi-layered approach: data-driven transformation of real-world AnIML files into knowledge graphs, competency question verification via SPARQL, and a novel validation protocol based on adversarial negative competency questions mapped to established ontological anti-patterns and enforced via SHACL constraints.
摘要:實現異質實驗數據系統之間的語義互操作性仍然是數據驅動科學發現的一大障礙。分析信息標記語言(AnIML)是一種靈活的基於XML的標準,用於分析化學和生物學,越來越多地被工業研發實驗室用於管理和交換實驗數據。然而,XML架構的表達能力允許利益相關者之間存在不同的解釋,這引入了不一致性,削弱了AnIML架構所設計支持的互操作性。在本文中,我們提出了AnIML本體,一種OWL 2本體,正式化AnIML的語義並將其與Allotrope數據格式對齊,以支持未來的跨系統和跨實驗室互操作性。該本體是通過專家參與的方式開發的,結合了LLM輔助的需求引導和協作本體工程。我們通過多層次的方法驗證該本體:將現實世界的AnIML文件數據驅動地轉換為知識圖譜,通過SPARQL進行能力問題驗證,以及基於對抗性負能力問題的創新驗證協議,這些問題映射到已建立的本體反模式並通過SHACL約束強制執行。
LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis
2604.01725v1 by Zhihuan Wei, Xinhang Chen, Danyang Han, Yang Hu, Jie Liu, Xuewen Miao, Guijiang Li
General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios--such as safety-critical and auxiliary diagnosis--by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of "which sensor x which time period." Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework's favorable balance among efficiency, accuracy, and interpretability.
摘要:一般航空故障診斷和高效維護對於飛行安全至關重要;然而,在資源受限的邊緣設備上部署深度學習模型面臨計算能力和可解釋性兩方面的挑戰。本文提出了LiteInception——一種設計用於邊緣部署的輕量級可解釋故障診斷框架。該框架採用與標準維護工作流程對齊的兩階段級聯架構:第一階段執行高召回率的故障檢測,第二階段對異常樣本進行細粒度故障分類,從而解耦優化目標並實現計算資源的按需分配。為了進行模型壓縮,提出了一種基於互信息、梯度分析和SE注意權重的多方法融合策略,將輸入傳感器通道從23個減少到15個,並引入了一種1+1分支的LiteInception架構,將InceptionTime參數壓縮70%,使CPU推理加速超過8倍,F1損失低於3%。此外,引入知識蒸餾作為精度-召回調節機制,使得相同的輕量級模型能夠通過切換訓練策略適應不同場景——例如安全關鍵和輔助診斷。最後,構建了一個整合四種歸因方法的雙層可解釋性框架,提供“哪個傳感器 x 哪個時間段”的可追溯證據鏈。在NGAFID數據集上的實驗顯示,故障檢測準確率為81.92%,召回率為83.24%,故障識別準確率為77.00%,驗證了該框架在效率、準確性和可解釋性之間的良好平衡。
Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving
2604.01723v1 by Yun Li, Yidu Zhang, Simon Thompson, Ehsan Javanmardi, Manabu Tsukada
Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN's benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.
摘要:視覺-語言-行動(VLA)模型在自動駕駛中必須整合多樣的文本輸入,包括導航指令、危險警告和交通狀態描述,然而目前的系統往往將這些輸入呈現為不連貫的片段,迫使模型自行發現哪些環境限制與當前的操作相關。我們提出了因果場景敘述(CSN),通過意圖-約束對齊、定量基準和結構化分離,在推理時以零 GPU 成本重構 VLA 文本輸入。我們用基於 Simplex 的運行時安全監督和通過 Plackett-Luce DPO 進行的訓練時對齊,搭配負對數似然(NLL)正則化來補充 CSN。一項多城鎮閉環 CARLA 評估顯示,CSN 在原始 LMDrive 上提高了 +31.1% 的駕駛分數,在偏好對齊變體上提高了 +24.5%。一項控制的消融實驗顯示,因果結構佔這一增益的 39.1%,其餘部分僅歸因於信息內容。感知噪聲消融確認 CSN 的好處對現實感測錯誤具有穩健性。語義安全監督改善了違規分數,而反應式碰撞時間監測則降低了性能,顯示出對意圖的監測對 VLA 系統是必要的。
Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
2604.01712v1 by Feiyu Zhou, Marios Impraimakis
The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the measured ones to detect large deviations. Finally, the identified cases are used as an early-warning indicator of structural change. The artificial intelligence-based model outperforms approaches for response forecasting as no assumption on wind stationarity or on structural normal vibration behavior is needed. Specifically, wind-excited dynamic behavior suffers from uncertainty related to obtaining poor predictions when the environmental or traffic conditions change. This results in a hard distinction of what constitutes normal vibration behavior. To this end, a framework is rigorously examined on real-world measurements from the Hardanger Bridge monitored by the Norwegian University of Science and Technology. The approach captures accurate structural behavior in realistic conditions, and with respect to the changes in the system excitation. The results, importantly, highlight the potential of transformer-based digital twin components to serve as next-generation tools for resilient infrastructure management, continuous learning, and adaptive monitoring over the system's lifecycle with respect to temporal characteristics.
摘要:風引起的結構反應預測能力在這裡檢驗了一種新型Transformer方法。該模型還提供了一個數字雙胞胎組件,用於橋樑結構健康監測。首先,該方法利用系統的時間特徵來訓練預測模型。其次,將振動預測與實測數據進行比較,以檢測大偏差。最後,識別出的案例用作結構變化的早期預警指標。基於人工智慧的模型在反應預測方面表現優於其他方法,因為不需要對風的穩定性或結構的正常振動行為做出假設。具體而言,風激發的動態行為受到不確定性的影響,當環境或交通條件改變時,會導致預測不佳。這使得正常振動行為的界定變得困難。為此,該框架在挪威科技大學監測的哈爾丹格橋的實際測量數據上進行了嚴格檢驗。該方法在現實條件下捕捉到準確的結構行為,並考慮到系統激勵的變化。結果重要地突顯了基於Transformer的數字雙胞胎組件作為下一代工具的潛力,用於彈性基礎設施管理、持續學習和在系統生命周期內根據時間特徵進行自適應監測。
Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition
2604.01711v1 by Truc Nguyen, Then Tran, Binh Truong, Phuoc Nguyen T. H
Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.
摘要:越南語語音情感識別(SER)由於模糊的聲學模式和缺乏可靠的標註數據,仍然面臨挑戰,特別是在情感邊界不明確的現實條件下。為了解決這個問題,本文提出了一個人機協作框架,將人類知識整合到學習過程中,而不是僅僅依賴數據驅動的模型。所提出的框架圍繞基於大語言模型(LLM)的推理展開,其中使用基於聲學特徵的模型提供輔助信號,例如信心和特徵級證據。引入了一種基於信心的路由機制,以區分簡單和模糊的樣本,允許不確定的案例委託給LLM進行更深入的推理,這些推理受到來自人類標註行為的結構化規則的指導。此外,採用了一種迭代精煉策略,通過錯誤分析和規則更新不斷提高系統性能。在一個包含2,764個樣本的越南語語音數據集上進行了實驗,涵蓋三個情感類別(平靜、憤怒、驚慌),具有高的標註者間一致性(Fleiss Kappa = 0.8574),確保了可靠的真實標準。所提出的方法達到了強大的性能,準確率高達86.59%,宏觀F1約為0.85-0.86,顯示出其在處理模糊和難以分類的案例中的有效性。總體而言,這項工作突顯了將數據驅動模型與人類推理相結合的重要性,提供了一種強健且與模型無關的語音情感識別方法,適用於資源匱乏的環境。