Skip to content

arxiv-daily

Automated deployment @ 2026-04-27 21:45:04 Asia/Taipei

Welcome to contribute! Add your topics and keywords in topic.yml. You can also view historical data through the storage.

AI

Medical

Publish Date Title Authors Homepage Code
2026-04-24 FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records Hojjat Karami et.al. 2604.22534v1 null
2026-04-24 CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease Bulent Soykan et.al. 2604.22428v1 null
2026-04-24 Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems Meghana Karnam et.al. 2604.22154v1 null
2026-04-23 Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations Nalin Poungpeth et.al. 2604.22109v1 null
2026-04-23 Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake Guan Gui et.al. 2604.22067v1 null
2026-04-23 Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores Shevya Pandya et.al. 2604.22063v1 null
2026-04-23 Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching Xiaodi Li et.al. 2604.22061v1 null
2026-04-23 EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms Brian VanVoorst et.al. 2604.22036v1 null
2026-04-23 Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models Naheed Rayhan et.al. 2604.21860v1 null
2026-04-23 Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos Bowen Liu et.al. 2604.21814v1 null
2026-04-23 Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications Yvon K. Awuklu et.al. 2604.21793v1 null
2026-04-23 Causal Disentanglement for Full-Reference Image Quality Assessment Zhen Zhang et.al. 2604.21654v1 null
2026-04-23 Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach Eli Gildish et.al. 2604.21651v1 null
2026-04-23 Unbiased Prevalence Estimation with Multicalibrated LLMs Fridolin Linder et.al. 2604.21549v1 null
2026-04-23 Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation Michele Miranda et.al. 2604.21421v1 null
2026-04-23 Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models Muhammad Shafique et.al. 2604.21952v1 null
2026-04-23 Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages Michael Bouzinier et.al. 2604.21263v1 null
2026-04-22 Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction Abhishek Dharmaratnakar et.al. 2604.21154v1 null
2026-04-22 Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation Sanjoy Pator et.al. 2604.21076v1 null
2026-04-22 HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering Yuyu Liu et.al. 2604.21027v1 null
2026-04-22 Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics Open-H-Embodiment Consortium et.al. 2604.21017v1 null
2026-04-22 Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry Syed Nazmus Sakib et.al. 2604.20983v1 null
2026-04-22 Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs Mariano Barone et.al. 2604.20791v1 null
2026-04-22 MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills Yingyong Hou et.al. 2604.20441v1 null
2026-04-22 Surrogate modeling for interpreting black-box LLMs in medical predictions Changho Han et.al. 2604.20331v2 null
2026-04-22 Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA Zibo Xu et.al. 2604.20306v1 null
2026-04-21 From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI Patrick Vossler et.al. 2604.20055v1 null
2026-04-21 Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine Yusuf Kesmen et.al. 2604.20022v1 null
2026-04-21 scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics Qifeng Zhou et.al. 2604.20003v1 null
2026-04-21 Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning Palawat Busaranuvong et.al. 2604.19937v1 null
2026-04-21 Depression Risk Assessment in Social Media via Large Language Models Giorgia Gulino et.al. 2604.19887v1 null
2026-04-21 A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities Aya Cherigui et.al. 2604.19653v1 null
2026-04-21 Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models Kihyuk Lee et.al. 2604.19598v2 null
2026-04-21 Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity Farbod Zorriassatine et.al. 2604.19538v1 null
2026-04-21 Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents Vasundra Srininvasan et.al. 2604.19457v1 null
2026-04-21 Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications Abu Noman Md Sakib et.al. 2604.19281v1 null
2026-04-21 Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement Pritam Kar et.al. 2604.19191v1 null
2026-04-20 Regulating Artificial Intimacy: From Locks and Blocks to Relational Accountability Henry Fraser et.al. 2604.18893v1 null
2026-04-20 REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction Seowung Leem et.al. 2604.18757v1 null
2026-04-20 Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling Andrew Wang et.al. 2604.18753v1 null
2026-04-20 A multimodal and temporal foundation model for virtual patient representations at healthcare system scale Andrew Zhang et.al. 2604.18570v2 null
2026-04-20 ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification Florian Kittler et.al. 2604.18444v1 null
2026-04-20 Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support Eranga Bandara et.al. 2604.18302v1 null
2026-04-20 Style-Based Neural Architectures for Real-Time Weather Classification Hamed Ouattara et.al. 2604.18251v1 null
2026-04-20 Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies Lorenz Brehme et.al. 2604.18234v1 null
2026-04-20 Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients? Ku Onoda et.al. 2604.18161v1 null
2026-04-20 Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework Cong Huy Nguyen et.al. 2604.18145v1 null
2026-04-20 Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning Khalil Akremi et.al. 2604.19823v1 null
2026-04-20 First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows Sihao Xing et.al. 2604.18038v1 null
2026-04-20 AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis Nathasha Naranpanawa et.al. 2604.17846v1 null
2026-04-20 MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models Suhyun Lee et.al. 2604.17730v1 null
2026-04-20 RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models Arya Hadizadeh Moghaddam et.al. 2604.17725v1 null
2026-04-20 Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals Jon-Paul Cacioli et.al. 2604.17714v1 null
2026-04-20 Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report Jon-Paul Cacioli et.al. 2604.17707v1 null
2026-04-19 STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments Md Mezbahul Islam et.al. 2604.17611v1 null
2026-04-19 T-DuMpRa: Teacher-guided Dual-path Multi-prototype Retrieval Augmented framework for fine-grained medical image classification Zixuan Tang et.al. 2604.17360v1 null
2026-04-19 PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations Patrick Keough et.al. 2604.17359v1 null
2026-04-19 Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA Alberto Testoni et.al. 2604.17316v1 null
2026-04-19 Chaos-Enhanced Prototypical Networks for Few-Shot Medical Image Classification Chinthakuntla Meghan Sai et.al. 2604.17300v1 null
2026-04-19 Region-Affinity Attention for Whole-Slide Breast Cancer Classification in Deep Ultraviolet Imaging Nagur Shareef Shaik et.al. 2604.17222v1 null
2026-04-19 Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition Nwe Ni Win et.al. 2604.17214v1 null
2026-04-19 DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation Nagur Shareef Shaik et.al. 2604.17209v1 null
2026-04-19 CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography Si Li et.al. 2604.17208v1 null
2026-04-19 Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training Weibing Zheng et.al. 2604.17186v1 null
2026-04-18 If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data Yanjun Cui et.al. 2604.17133v1 null
2026-04-18 A Two-Stage Deep Learning Framework for Segmentation of Ten Gastrointestinal Organs from Coronal MR Enterography Ashiqur Rahman et.al. 2604.17118v1 null
2026-04-18 Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization Weijie Wan et.al. 2604.17051v1 null
2026-04-18 Light-Adapted Electroretinogram and Oscillatory Potentials (LEOPs) Dataset for Autism Spectrum Disorder and Typically Developing Individuals Paul A. Constable et.al. 2604.16981v1 null
2026-04-18 Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models Bruce A. Bassett et.al. 2604.16980v1 null
2026-04-18 Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction Liyin Chen et.al. 2604.16955v1 null
2026-04-18 Hybrid Quantum Neural Networks for Enhanced Breast Cancer Thermographic Classification: A Novel Quantum-Classical Integration Approach Riza Alaudin Syah et.al. 2604.16953v1 null
2026-04-18 Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts Gabriel Jason Lee et.al. 2604.16926v1 null
2026-04-18 Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models Inhyeok Lee et.al. 2604.16775v1 null
2026-04-17 CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction Jianyou Wang et.al. 2604.16742v1 null
2026-04-17 Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis Ayhan Can Erdur et.al. 2604.16729v1 null
2026-04-17 A Two-Stage Multi-Modal MRI Framework for Lifespan Brain Age Prediction Dingyi Zhang et.al. 2604.16655v1 null
2026-04-17 MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation Yi Lin et.al. 2604.16175v1 null
2026-04-17 Hybrid Spectro-Temporal Fusion Framework for Structural Health Monitoring Jongyeop Kim et.al. 2604.16589v1 null
2026-04-17 Large Language Models Meet Biomedical Knowledge Graphs for Mechanistically Grounded Therapeutic Prioritization Chih-Hsuan Wei et.al. 2604.19815v1 null
2026-04-17 Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors Jessica H. Zhu et.al. 2604.16132v1 null
2026-04-17 Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration Baramee Sukumal et.al. 2604.16104v1 null
2026-04-17 Towards Trustworthy Depression Estimation via Disentangled Evidential Learning Fangyuan Liu et.al. 2604.16579v1 null
2026-04-17 QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals Jeremy Qin et.al. 2604.15859v1 null
2026-04-17 Stein Variational Black-Box Combinatorial Optimization Thomas Landais et.al. 2604.15837v1 null
2026-04-17 Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI Lama Moukheiber et.al. 2604.15808v1 null
2026-04-17 KWBench: Measuring Unprompted Problem Recognition in Knowledge Work Ankit Maloo et.al. 2604.15760v1 null
2026-04-17 SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification Enhui Chai et.al. 2604.15711v1 null
2026-04-17 CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder Duy-Phuong Dao et.al. 2604.15611v1 null
2026-04-16 Robustifying and Selecting Cohort-Appropriate Prognostic Models under Distributional Shifts Dimitris Bertsimas et.al. 2604.16537v1 null
2026-04-16 Towards Reliable Testing of Machine Unlearning Anna Mazhar et.al. 2604.16536v1 null
2026-04-16 A Q-learning-based QoS-aware multipath routing protocol in IoMT-based wireless body area network Mehdi Hosseinzadeh et.al. 2604.15489v1 null
2026-04-16 Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models Emily Curl et.al. 2604.16532v1 null
2026-04-16 RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference Yuxin Liu et.al. 2604.15459v1 null
2026-04-16 DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI Zhizheng Wang et.al. 2604.15456v1 null
2026-04-16 SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation Tianhao Fu et.al. 2604.15271v2 null
2026-04-16 RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography Mélanie Roschewitz et.al. 2604.15231v1 null
2026-04-16 Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF Nicklas Neu et.al. 2604.16528v1 null
2026-04-16 Hybrid Decision Making via Conformal VLM-generated Guidance Debodeep Banerjee et.al. 2604.14980v2 null
2026-04-16 Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels? Amy Rouillard et.al. 2604.14892v2 null
2026-04-16 MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry Meng-Xun Li et.al. 2604.14866v1 null

Abstracts

FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

2604.22534v1 by Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.

摘要:電子健康紀錄(EHR)的特徵工程因不規則的觀察間隔、可變的測量頻率以及臨床時間序列固有的結構稀疏性而變得複雜。現有的自動化方法要麼缺乏臨床領域的認識,要麼假設輸入數據是乾淨且規則取樣的,這限制了它們在現實世界EHR數據中的適用性。我們提出了\textbf{FeatEHR-LLM},這是一個利用大型語言模型(LLMs)從不規則取樣的EHR時間序列生成臨床有意義的表格特徵的框架。為了限制患者隱私的暴露,LLM僅在數據集架構和任務描述上運作,而不是原始患者記錄。一種工具增強的生成機制為LLM提供了專門的例程,用於查詢不規則的時間數據,使其能夠生成可執行的特徵提取代碼,明確處理不均勻的觀察模式和信息稀疏性。FeatEHR-LLM支持通過迭代的、驗證在循環中的管道生成單變量和多變量特徵。在四個ICU數據集上評估的八個臨床預測任務中,我們的框架在8個任務中的7個上達到了最高的平均AUROC,相較於強基準提高了多達6個百分點。代碼可在github.com/hojjatkarami/FeatEHR-LLM上獲得。

CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease

2604.22428v1 by Bulent Soykan, Gulsah Hancerliogullari Koksalmis, Hsin-Hsiung Huang, Laura J. Brattain

Predicting individual cognitive decline in Alzheimer's disease (AD) is difficult due to the heterogeneity of disease progression. Reliable clinical tools require not only high accuracy but also fairness across demographics and robustness to missing data. We present CognitiveTwin, a digital twin framework that predicts patient-specific cognitive trajectories. The model integrates multi-modal longitudinal data (cognitive scores, magnetic resonance imaging, positron emission tomography, cerebrospinal fluid biomarkers, and genetics). We use a Transformer-based architecture to fuse these modalities and a Deep Markov Model to capture temporal dynamics. We trained and evaluated the framework using data from 1,666 patients in the TADPOLE (Alzheimer's Disease Neuroimaging Initiative) dataset. We assessed the model for prediction error, demographic fairness, and robustness to missing-not-at-random (MNAR) data patterns. ognitiveTwin provides accurate and personalized predictions of cognitive decline. Its demonstrated fairness across patient demographics and resilience to clinical dropout make it a reliable tool for clinical trial enrichment and personalized care planning.

摘要:預測阿茲海默症(AD)中個體的認知衰退是困難的,因為疾病進展的異質性。可靠的臨床工具不僅需要高準確性,還需要在不同人口統計中保持公平性,並對缺失數據具有穩健性。我們提出了CognitiveTwin,一個預測患者特定認知軌跡的數位雙胞胎框架。該模型整合了多模態的縱向數據(認知分數、磁共振成像、正電子發射斷層掃描、腦脊髓液生物標記和遺傳學)。我們使用基於Transformer的架構來融合這些模態,並使用深度馬爾可夫模型來捕捉時間動態。我們使用來自1,666名患者的TADPOLE(阿茲海默症神經影像倡議)數據集訓練和評估該框架。我們評估了模型的預測誤差、人口統計公平性以及對隨機缺失數據模式的穩健性。CognitiveTwin提供準確且個性化的認知衰退預測。它在患者人口統計中的公平性和對臨床脫落的韌性使其成為臨床試驗增強和個性化護理計劃的可靠工具。

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

2604.22154v1 by Meghana Karnam, Ananya Joshi

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40\% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.

摘要:新興的行為健康和精神病學中的人工智慧系統使用多步驟或多代理的LLM管道來執行評估自我傷害風險和篩檢抑鬱症等任務。然而,常見的評估方法,如LLM作為裁判,並未指示何時決策是可靠的,或如何在多個LLM判斷中累積錯誤,這限制了它們在安全關鍵環境中的適用性。我們提出了一個統計框架,針對結構為有向無環圖(DAG)的多代理管道,提供了一種基於原則的、自適應的決策制定替代啟發式投票的方法。我們將每個代理建模為隨機類別決策,並引入(1)更緊的代理級性能信心界限,(2)基於輸入難度的強盜式自適應抽樣策略,以及(3)在多代理系統上提供的懊悔保證,顯示在部署時的對數錯誤增長。我們在行為健康的兩個標記數據集上評估我們的系統:AEGIS 2.0行為健康子集(N=161)和SWMH Reddit帖子的一個分層樣本(N=250)。從實證上看,我們的自適應抽樣策略在這兩個數據集中達到了最低的假陽性率,AEGIS 2.0為0.095,而單代理模型為0.159,將安全內容的錯誤標記減少了40\%,並且在所有條件下仍然保持相似的假陰性率。這些結果表明,基於原則的自適應抽樣在不降低召回率的情況下,提供了精確度的有意義改善。

Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

2604.22109v1 by Nalin Poungpeth, Nicholas Clark, Tanu Mitra

Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce "spontaneous persuasion," which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.

摘要:大型語言模型(LLMs)擁有強大的說服能力,在一對一比較中超越人類。使用者報告表示,在關係、醫療環境以及尋求專業建議時,會諮詢LLMs以協助做出重大生活決策。先前的研究將說服測量為產生最有效論點或令人信服陳述的有意圖嘗試。這未能捕捉到日常人類與AI互動中的情況,使用者在這些互動中尋求資訊或建議。為了解決這一空白,我們引入了「自發性說服」,其特徵是在人們不一定需要說服的日常情境中隱性使用說服策略。我們對五個LLMs進行了審核,以揭示自發性說服在多輪對話中出現的頻率及其技術。為了模擬回應風格,我們提供了一個基於心理學、溝通學和語言學文獻的使用者回應分類法。此外,我們比較了LLMs在相同主題上產生的自發性說服與從Reddit收集的人類回應的分佈。我們發現LLMs幾乎在所有對話中都自發地說服使用者,並大量依賴基於資訊的策略,例如訴諸邏輯或定量證據。這在各模型和使用者回應風格中是一致的,但涉及心理健康的對話中,基於評價和情感的策略的使用率較高。相比之下,人類回應則傾向於使用產生社會影響的策略,如負面情感訴求和非專家證言。這一差異可能解釋了LLM在說服使用者方面的有效性,以及模型被視為客觀和公正的感知。

Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake

2604.22067v1 by Guan Gui, Peter Zandi, Jacob Taylor, Ananya Joshi

Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.

摘要:精神科接診是一個連續的、高風險的信息收集過程,臨床醫生必須決定提問的內容、順序以及如何在有限的時間內解釋不完整或模糊的回答。儘管對於醫療保健中的對話式人工智慧的興趣日益增長,但在這一應用中,對話式人工智慧的基礎設施仍然有限。因此,我們將這一任務表述為一個問題選擇問題,涉及臨床上有根據的問題、已知的目標信息以及可控的患者難度。我們還基於655個臨床醫生撰寫的接診問題庫和5種不同行為條件的相應合成患者小品,介紹了一個特定任務的問題選擇基準。在我們的評估中,我們比較了隨機提問、一個臨床精神科接診表的基準,以及一個基於大型語言模型(LLM)指導的自適應政策,這涉及300次訪談會議,涵蓋四位患者和五種行為條件。在基準測試中,臨床有序的固定形式顯著優於隨機提問,而LLM指導的政策則實現了最強的整體恢復。在患者行為對現場恢復的適應性較差的情況下,適應的優勢急劇增長,尤其是在防守性簡潔的條件下。這些發現表明,對話式臨床系統的表現不僅取決於信息披露後的語言理解,還取決於系統是否能在有限的互動預算內觸及正確的主題。更廣泛地說,這一基準提供了一個受控框架,用於研究臨床結構和自適應後續如何促進互動式臨床機器學習中的信息恢復。

Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

2604.22063v1 by Shevya Pandya, Shinjini Bose, Ananya Joshi

Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.

摘要:大型語言模型(LLMs)在臨床推理和風險評估中被越來越多地使用。然而,它們在精神科等關鍵和不確定領域的解釋可靠性仍然不明。先前的研究已經識別出這些系統中的算法偏見和提示敏感性,這引發了關於上下文信息如何影響模型輸出的擔憂,但在精神科領域仍然沒有系統的方法來評估這些問題。我們提出了一種通過圍繞提示設計的影響和醫學上不重要的輸入對預測住院風險分數的影響來結構化評估的可靠性審核方法,這通常是第一個下游AI臨床決策任務。在我們的審核中,生成了一組合成患者資料(n = 50),每個資料包含15個臨床相關特徵和最多50個臨床不重要特徵,跨越四種提示重構(中立、邏輯、人類影響、臨床判斷)。我們審核了四個LLM(Gemini 2.5 Flash,LLaMa 3.3 70b,Claude Sonnet 4.6,GPT-4o mini),結果顯示,包含醫學上不重要的變量導致所有模型和提示的絕對平均預測住院風險和輸出變異性有統計學上顯著的增加,這表明隨著上下文噪音的增加,預測穩定性降低。臨床不重要特徵在許多模型-提示條件下對不穩定性產生了影響,而提示變化獨立地以模型依賴的方式影響不穩定性的軌跡。這些發現量化了基於LLM的精神科風險評估對非臨床信息的敏感性,突顯了在臨床部署之前需要對歸因穩定性和不確定性行為進行系統評估的必要性。

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

2604.22061v1 by Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.

摘要:病人試驗匹配需要對長期的異質電子健康紀錄(EHRs)和複雜的資格標準進行推理,這對於擴展性、泛化能力和計算效率提出了重大挑戰。現有的方法要麼依賴於使用大型語言模型(LLMs)進行完整文檔處理,這在計算上代價高昂,要麼使用傳統的機器學習方法,這些方法難以捕捉非結構化的臨床敘事。在這項工作中,我們提出了一個輕量級框架,結合檢索增強生成和基於大型語言模型的建模,以實現可擴展的病人試驗匹配。該框架明確分離了兩個關鍵組件:檢索增強生成用於從長EHR中識別臨床相關片段,減少輸入複雜性,而大型語言模型則用於將這些選定的片段編碼為信息豐富的表示。這些表示進一步通過降維進行精煉,並使用輕量級預測器進行建模,使得下游分類既高效又可擴展。我們在多個公共基準(n2c2、SIGIR、TREC 2021/2022)和來自梅約診所(Mayo Clinic)的真實世界多模態數據集(MCPMD)上評估了所提出的方法。結果顯示,基於檢索的信息選擇顯著減少了計算負擔,同時保留了臨床上有意義的信號。我們進一步證明,凍結的LLMs為結構化臨床數據提供了強大的表示,而微調對於建模非結構化的臨床敘事至關重要。重要的是,所提出的輕量級管道在性能上可與端到端的LLM方法相媲美,且計算成本顯著較低。

EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms

2604.22036v1 by Brian VanVoorst, Nicholas Walczak, Christopher Gilleo, Charles Meissner, Fabio Felix, Iran Roman, Bea Steers, Claudio Silva, Yuhan Shen, Zijia Lu, Shih-Po Lee, Ehsan Elhamifar

This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).

摘要:這篇論文介紹了EgoMAGIC(醫療輔助、指導、說明和修正),這是一個以自我為中心的醫療活動數據集,作為DARPA的感知能力任務指導(PTG)計畫的一部分收集而成。這個數據集包含3,355個視頻,涵蓋50個醫療任務,每個任務至少有50個標記視頻。PTG計畫的主要目標是開發集成在增強現實頭盔中的虛擬助手,以幫助用戶執行複雜任務。
為了鼓勵使用這個數據集進行探索和研究,醫療訓練數據已經發布,並附帶了一個專注於八個醫療任務的動作檢測挑戰。大多數視頻是使用帶有集成音頻的頭戴立體攝像機錄製的。從這個數據集中,使用195萬個標籤訓練了40個YOLO模型,以檢測124個醫療物體,為從事醫療AI應用開發的開發者提供了一個穩健的起點。
除了介紹數據集,這篇論文還呈現了三個模型在八個選定醫療任務上的動作檢測基準結果,其中表現最佳的方法達到了平均mAP 0.526。儘管這篇論文主要針對動作檢測作為基準,但EgoMAGIC數據集同樣適用於動作識別、物體識別和檢測、錯誤檢測以及其他具有挑戰性的計算機視覺任務。
該數據集可通過zenodo.org訪問(DOI: 10.5281/zenodo.19239154)。

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

2604.21860v1 by Naheed Rayhan, Sohely Jahan

Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new multi-turn attack technique that systematically exploits stateless moderation by distributing adversarial intent across isolated interactions. TTI leverages automated attacker agents powered by large language models to iteratively test and evade policy enforcement in both commercial and open-source LLMs, marking a departure from conventional jailbreak approaches that typically depend on maintaining persistent conversational context. Our extensive evaluation across state-of-the-art models-including those from OpenAI, Anthropic, Google Gemini, Meta, and prominent open-source alternatives-uncovers significant variations in resilience to TTI attacks, with only select architectures exhibiting substantial inherent robustness. Our automated blackbox evaluation framework also uncovers previously unknown model specific vulnerabilities and attack surface patterns, especially within medical and high stakes domains. We further compare TTI against established adversarial prompting methods and detail practical mitigation strategies, such as session level context aggregation and deep alignment approaches. Our study underscores the urgent need for holistic, context aware defenses and continuous adversarial testing to future proof LLM deployments against evolving multi-turn threats.

摘要:大型語言模型(LLMs)越來越多地融入敏感工作流程,這提高了對抗性穩健性和安全性的要求。本文介紹了瞬時回合注入(TTI),這是一種新的多回合攻擊技術,通過在孤立的互動中分配對抗意圖,系統性地利用無狀態的管理。TTI利用由大型語言模型驅動的自動攻擊者代理,迭代測試並規避商業和開源LLMs中的政策執行,這標誌著與傳統的越獄方法的不同,後者通常依賴於維持持久的對話上下文。我們對最先進模型的廣泛評估——包括來自OpenAI、Anthropic、Google Gemini、Meta及其他知名開源替代方案——揭示了對TTI攻擊的韌性存在顯著變化,只有少數架構顯示出顯著的內在穩健性。我們的自動黑箱評估框架還揭示了先前未知的模型特定脆弱性和攻擊面模式,特別是在醫療和高風險領域。我們進一步將TTI與已建立的對抗性提示方法進行比較,並詳細說明實際的緩解策略,如會話級上下文聚合和深度對齊方法。我們的研究強調了對全面、上下文感知防禦的迫切需求,以及持續的對抗性測試,以未來保障LLM部署免受不斷演變的多回合威脅。

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

2604.21814v1 by Bowen Liu, Li Yang, Shanshan Song, Mingyu Tang, Zhifang Gao, Qifeng Chen, Yangqiu Song, Huimin Chen, Xiaomeng Li

Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.

摘要:膠囊內視鏡(CE)使得非侵入性的胃腸道篩檢成為可能,但目前的CE研究仍然主要限於幀級分類和檢測,視頻級分析尚未得到充分探討。為了填補這一空白,我們引入並正式定義一項新任務,即基於診斷的CE視頻摘要,該任務要求提取涵蓋臨床有意義發現的關鍵證據幀,並從這些證據幀中做出準確的診斷。這一設定具有挑戰性,因為與診斷相關的事件極其稀疏,且可能被數以萬計的冗餘正常幀所淹沒,而個別觀察往往因運動模糊、碎片、鏡面高光和快速視角變化而變得模糊不清。為了促進這方面的研究,我們引入了VideoCAP,這是第一個基於診斷的CE數據集,包含來自真實臨床報告的標註。VideoCAP包含240個完整長度的視頻,並為關鍵證據幀提取和診斷提供現實的監督。為了解決這一任務,我們進一步提出了DiCE,一個受到臨床醫生啟發的框架,模擬標準CE閱讀工作流程。DiCE首先對原始視頻進行高效的候選篩選,然後使用上下文編織器將候選者組織成保持明確病變事件的連貫診斷上下文,並使用證據聚合器將每個上下文中的多幀證據聚合成穩健的片段級判斷。實驗表明,DiCE始終優於最先進的方法,產生簡潔且臨床可靠的診斷摘要。這些結果突顯了基於診斷的上下文推理作為超長CE視頻摘要的一個有前景的範式。

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

2604.21793v1 by Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue, Vianney Jouhet, Fleur Mougin

In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our framework employs logical rules to capture existence and termination conditions for simple temporal events and to combine these into meta-events. In the medical domain, for example, disease episodes and therapies are inferred from timestamped clinical observations, such as diagnoses and drug administrations stored in patient records, and can be further combined into higher-level disease events. As some incorrect events might be inferred, we use constraints to identify incompatible combinations of events and propose a repair mechanism to select preferred consistent sets of events. While reasoning in the full framework is intractable, we identify relevant restrictions that ensure polynomial-time data complexity. Our prototype system implements core components of the approach using answer set programming. An evaluation on a lung cancer use case supports the interest of the approach, both in terms of computational feasibility and positive alignment of our results with medical expert opinions. While strongly motivated by the needs of the healthcare domain, our framework is purposely generic, enabling its reuse in other areas.

摘要:在本文中,我們開發了一種新穎的基於邏輯的方法,用於從時間戳數據和背景知識中檢測高級的時間延伸事件。我們的框架使用邏輯規則來捕捉簡單時間事件的存在和終止條件,並將這些條件組合成元事件。例如,在醫療領域,疾病事件和治療是從時間戳的臨床觀察中推斷出來的,例如存儲在病人記錄中的診斷和藥物管理,並可以進一步組合成更高級的疾病事件。由於可能推斷出一些不正確的事件,我們使用約束來識別不兼容的事件組合,並提出一種修復機制來選擇首選的一致事件集。雖然在完整框架中的推理是不可處理的,但我們確定了相關的限制,以確保多項式時間的數據複雜度。我們的原型系統使用答案集編程實現了該方法的核心組件。對於肺癌用例的評估支持了該方法的價值,無論是在計算可行性方面,還是我們的結果與醫療專家意見的正面一致性方面。雖然受到醫療領域需求的強烈驅動,我們的框架故意設計為通用的,使其能在其他領域中重複使用。

Causal Disentanglement for Full-Reference Image Quality Assessment

2604.21654v1 by Zhen Zhang, Jielei Chu, Tian Zhang, Weide Liu, Fengmao Lv, Tianrui Li, Jun Cheng, Yuming Fang

Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

摘要:現有的基於深度網絡的全參考影像質量評估(FR-IQA)模型通常通過對參考影像和失真影像的深度特徵進行成對比較來工作。在本文中,我們從不同的角度來處理這個問題,並提出一種基於因果推斷和解耦表示學習的新型 FR-IQA 範式。與典型的基於特徵比較的 FR-IQA 模型不同,我們的方法將劣化估計表述為由潛在表示的干預引導的因果解耦過程。我們首先通過利用參考影像和失真影像之間的內容不變性來解耦劣化和內容表示。其次,受到人類視覺遮蔽效應的啟發,我們設計了一個遮蔽模塊來建模影像內容和劣化特徵之間的因果關係,從而從失真影像中提取受內容影響的劣化特徵。最後,質量分數是通過使用監督回歸或無標籤降維從這些劣化特徵中預測的。大量實驗表明,我們的方法在全監督、少標籤和無標籤設置下的標準 IQA 基準上達到了高度競爭的性能。此外,我們在數據稀缺的多樣非標準自然影像領域進行了評估,包括水下、放射影像、醫療影像、中子影像和螢幕內容影像。得益於其在無標籤 IQA 數據下進行特定場景訓練和預測的能力,我們的方法在跨域泛化方面表現優於現有的無訓練 FR-IQA 模型。

Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

2604.21651v1 by Eli Gildish, Michael Grebshtein, Igor Makienko

Denoising of periodic signals and accurate waveform estimation are core tasks across many signal processing domains, including speech, music, medical diagnostics, radio, and sonar. Although deep learning methods have recently shown performance improvements over classical approaches, they require substantial computational resources and are usually trained separately for each signal observation. This study proposes a computationally efficient method based on DCNN and Re-sampling, termed R-DCNN, designed for operation under strict power and resource constraints. The approach targets signals with varying fundamental frequencies and requires only a single observation for training. It generalizes to additional signals via a lightweight resampling step that aligns time scales in signals with different frequencies to re-use the same network weights. Despite its low computational complexity, R-DCNN achieves performance comparable to state-of-the-art classical methods, such as autoregressive (AR)-based techniques, as well as conventional DCNNs trained individually for each observation. This combination of efficiency and performance makes the proposed method particularly well suited for deployment in resource-constrained environments without sacrificing denoising or estimation accuracy.

摘要:去除周期信號的噪音和準確的波形估計是許多信號處理領域的核心任務,包括語音、音樂、醫療診斷、無線電和聲納。儘管深度學習方法最近在性能上超越了傳統方法,但它們需要大量的計算資源,並且通常為每個信號觀察單獨訓練。本研究提出了一種基於DCNN和重採樣的計算效率高的方法,稱為R-DCNN,旨在在嚴格的功率和資源限制下運行。該方法針對具有不同基頻的信號,並且僅需要一次觀察進行訓練。它通過輕量級的重採樣步驟進行泛化,該步驟將不同頻率的信號的時間尺度對齊,以重用相同的網絡權重。儘管計算複雜度低,R-DCNN的性能仍然可與最先進的傳統方法相媲美,例如基於自回歸(AR)技術的方法,以及為每次觀察單獨訓練的傳統DCNN。這種效率和性能的結合使得所提出的方法特別適合在資源受限的環境中部署,而不會犧牲去噪或估計的準確性。

Unbiased Prevalence Estimation with Multicalibrated LLMs

2604.21549v1 by Fridolin Linder, Thomas Leeper, Daniel Haimovich, Niek Tax, Lorenzo Perini, Milan Vojnovic

Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations. We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee. Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias. While we focus the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical applications -- estimating employment prevalence across U.S. states using the American Community Survey, and classifying political texts across four countries using an LLM -- demonstrate that multicalibration substantially reduces bias in practice, while highlighting that calibration data should cover the key feature dimensions along which target populations may differ.

摘要:估計一個類別在某個人群中的盛行率,使用不完美的測量設備(診斷測試、分類器或大型語言模型)對科學、公共健康以及在線信任和安全至關重要。標準方法會修正已知的設備誤差率,但假設這些誤差率在不同人群中保持穩定。我們展示了這一假設在協變移動下失效,而多重校準則是針對輸入特徵進行校準,而不僅僅是針對平均值,這對於在此類移動下進行無偏盛行率估計是足夠的。標準的校準和量化方法無法提供這一保證。我們的工作將最近在公平性方面的理論研究與幾乎所有學術學科的長期測量問題聯繫起來。一項模擬確認標準方法在移動幅度增大時會顯示偏差,而多重校準的估計器則保持近乎零的偏差。雖然我們的討論主要集中在大型語言模型上,但我們的理論結果適用於任何分類模型。兩個實證應用——使用美國社區調查估計美國各州的就業盛行率,以及使用大型語言模型對四個國家的政治文本進行分類——展示了多重校準在實踐中顯著減少偏差,同時強調校準數據應涵蓋目標人群可能存在差異的關鍵特徵維度。

Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

2604.21421v1 by Michele Miranda, Xinlan Yan, Nishant Mishra, Rachel Murphy, Ameen Abu-Hanna, Sébastien Bratières, Iacer Calixto

Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction. Although methods based on differential privacy (DP) provide formal privacy guarantees, more recently also large language models (LLMs) are increasingly used for text de-identification in the clinical domain. In this work, we present the first comparative study of DP, NER, and LLMs for Dutch clinical text de-identification. We investigate these methods separately as well as hybrid strategies that apply NER or LLM preprocessing prior to DP, and assess performance in terms of privacy leakage and extrinsic evaluation (entity and relation classification). We show that DP mechanisms alone degrade utility substantially, but combining them with linguistic preprocessing, especially LLM-based redaction, significantly improves the privacy-utility trade-off.

摘要:保護臨床敘述中的病人隱私對於在 GDPR 和 HIPAA 等法規下促進醫療數據的二次使用至關重要。雖然手動去識別仍然是黃金標準,但其成本高且速度慢,這促使了需要自動化方法,這些方法結合了隱私保證和高效用性。大多數自動化文本去識別管道使用命名實體識別(NER)來識別需刪除的受保護實體。儘管基於差分隱私(DP)的方法提供了正式的隱私保證,但最近大型語言模型(LLMs)在臨床領域的文本去識別中也越來越多地被使用。在這項工作中,我們呈現了針對荷蘭臨床文本去識別的 DP、NER 和 LLMs 的首次比較研究。我們分別調查這些方法以及在 DP 之前應用 NER 或 LLM 預處理的混合策略,並在隱私洩漏和外部評估(實體和關係分類)方面評估性能。我們顯示,僅使用 DP 機制會大幅降低效用,但將其與語言預處理結合,特別是基於 LLM 的刪除,顯著改善了隱私與效用的權衡。

Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

2604.21952v1 by Muhammad Shafique, Abdul Basit, Muhammad Abdullah Hanif, Alberto Marchisio, Rachmad Vidya Wicaksana Putra, Minghao Shao

This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution & stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.

摘要:這項工作提出了一種多層次的方法論,以有效加速多模態基礎模型(MFMs)。它結合了Transformer區塊的硬體和軟體共同設計,以及一個優化流程,減少計算和記憶體需求。在模型開發過程中,它通過微調來實現針對特定領域的性能增強。我們的方法論進一步結合了優化MFMs的硬體和軟體技術。具體而言,它使用層次感知的混合精度量化和結構修剪來壓縮MFM,針對Transformer區塊和MLP通道。它還通過推測解碼來優化操作,模型級聯將查詢路由通過小到大的級聯,並使用輕量級自測來確定何時升級到更大的模型,以及序列長度、視覺解析度和步幅的共同優化,以及圖級運算元融合。為了有效執行模型,處理數據流根據底層硬體架構進行優化,並結合記憶體高效的注意力以滿足片上帶寬和延遲預算。為了支持這一點,使用專用的硬體加速器來處理Transformer工作負載,這可以通過專家設計或LLM輔助設計方法開發。我們展示了所提方法論在醫療MFMs和代碼生成任務上的有效性,並以向能源高效的脈衝MFMs擴展作結。

Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

2604.21263v1 by Michael Bouzinier, Sergey Trifonov, Michael Chumack, Eugenia Lvova, Dmitry Etin

\textbf{Background:} Regulatory frameworks for AI in healthcare, including the EU AI Act and FDA guidance on AI/ML-based medical devices, require clinical decision support to demonstrate not only accuracy but auditability. Existing formal languages for clinical logic validate syntactic and structural correctness but not whether decision rules use epistemologically appropriate evidence. \textbf{Methods:} Drawing on design-by-contract principles, we introduce meta-predicates -- predicates about predicates -- for asserting epistemological constraints on clinical decision rules expressed in a DSL. An epistemological type system classifies annotations along four dimensions: purpose, knowledge domain, scale, and method of acquisition. Meta-predicates assert which evidence types are permissible in any given rule. The framework is instantiated in AnFiSA, an open-source platform for genetic variant curation, and demonstrated using the Brigham Genomics Medicine protocol on 5.6 million variants from the Genome in a Bottle benchmark. \textbf{Results:} Decision trees used in variant interpretation can be reformulated as unate cascades, enabling per-variant audit trails that identify which rule classified each variant and why. Meta-predicate validation catches epistemological errors before deployment, whether rules are human-written or AI-generated. The approach complements post-hoc methods such as LIME and SHAP: where explanation reveals what evidence was used after the fact, meta-predicates constrain what evidence may be used before deployment, while preserving human readability. \textbf{Conclusions:} Meta-predicate validation is a step toward demonstrating not only that decisions are accurate but that they rest on appropriate evidence in ways that can be independently audited. While demonstrated in genomics, the approach generalises to any domain requiring auditable decision logic.

摘要:\textbf{背景:} 醫療保健中人工智慧的監管框架,包括歐盟人工智慧法案和FDA對基於人工智慧/機器學習醫療設備的指導,要求臨床決策支持不僅要顯示準確性,還要具備可審計性。現有的臨床邏輯形式語言驗證語法和結構的正確性,但不驗證決策規則是否使用了認識論上合適的證據。 \textbf{方法:} 基於契約設計原則,我們引入了元謂詞——關於謂詞的謂詞——用於對在DSL中表達的臨床決策規則施加認識論約束。認識論類型系統在四個維度上對註釋進行分類:目的、知識領域、範圍和獲取方法。元謂詞聲明在任何給定規則中允許使用哪些證據類型。該框架在AnFiSA中實現,這是一個開源的基因變異整理平台,並使用來自“瓶中基因組”基準的560萬個變異的Brigham Genomics Medicine協議進行演示。 \textbf{結果:} 用於變異解釋的決策樹可以重新表述為單調級聯,從而實現每個變異的審計跟蹤,識別每個變異的分類規則及其原因。元謂詞驗證在部署前捕捉認識論錯誤,無論規則是人工編寫還是AI生成。該方法補充了事後方法,如LIME和SHAP:當解釋揭示了事後使用了哪些證據時,元謂詞限制了在部署前可以使用的證據,同時保持人類可讀性。 \textbf{結論:} 元謂詞驗證是邁向證明決策不僅準確且基於適當證據的步驟,並且這些證據可以獨立審計。雖然在基因組學中得到了演示,但該方法可以推廣到任何需要可審計決策邏輯的領域。

Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction

2604.21154v1 by Abhishek Dharmaratnakar, Srivaths Ranganathan, Anushree Sinha, Debanshu Das

At-home physiotherapy compliance remains critically low due to a lack of personalized supervision and dynamic feedback. Existing digital health solutions rely on static, pre-recorded video libraries or generic 3D avatars that fail to account for a patient's specific injury limitations or home environment. In this paper, we propose a novel Multi-Agent System (MAS) architecture that leverages Generative AI and computer vision to close the tele-rehabilitation loop. Our framework consists of four specialized micro-agents: a Clinical Extraction Agent that parses unstructured medical notes into kinematic constraints; a Video Synthesis Agent that utilizes foundational video generation models to create personalized, patient-specific exercise videos; a Vision Processing Agent for real-time pose estimation; and a Diagnostic Feedback Agent that issues corrective instructions. We present the system architecture, detail the prototype pipeline using Large Language Models and MediaPipe, and outline our clinical evaluation plan. This work demonstrates the feasibility of combining generative media with agentic autonomous decision-making to scale personalized patient care safely and effectively.

摘要:居家物理治療的遵從率仍然極低,原因在於缺乏個性化的監督和動態反饋。現有的數位健康解決方案依賴於靜態的預錄影片庫或通用的3D虛擬角色,這些都未能考慮到患者特定的受傷限制或家庭環境。在本文中,我們提出了一種新穎的多智能體系統(MAS)架構,利用生成式人工智慧和計算機視覺來閉合遠程康復的循環。我們的框架由四個專門的微智能體組成:一個臨床提取智能體,將非結構化的醫療筆記解析為運動學約束;一個視頻合成智能體,利用基礎視頻生成模型創建個性化的、針對患者的運動視頻;一個視覺處理智能體,用於實時姿勢估計;以及一個診斷反饋智能體,提供糾正指導。我們展示了系統架構,詳細說明了使用大型語言模型和MediaPipe的原型管道,並概述了我們的臨床評估計劃。本研究展示了將生成媒體與自主決策相結合的可行性,以安全有效地擴展個性化患者護理。

Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation

2604.21076v1 by Sanjoy Pator

Medication reconciliation at clinical handoffs is a high-stakes, error-prone process. Large language models are increasingly proposed to assist with this task using FHIR-structured patient records, but a fundamental and largely unstudied variable is how the FHIR data is serialised before being passed to the model. We present the first systematic comparison of four FHIR serialisation strategies (Raw JSON, Markdown Table, Clinical Narrative, and Chronological Timeline) across five open-weight models (Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, Llama-3.3-70B) on a controlled benchmark of 200 synthetic patients, totalling 4,000 inference runs. We find that serialisation strategy has a large, statistically significant effect on performance for models up to 8B parameters: Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B (r = 0.617, p < 10^{-10}). This advantage reverses at 70B, where Raw JSON achieves the best mean F1 of 0.9956. In all 20 model and strategy combinations, mean precision exceeds mean recall: omission is the dominant failure mode, with models more often missing an active medication than fabricating one, which changes how clinical safety auditing priorities should be set. Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients, the patients most at risk from reconciliation errors, systematically underserved. BioMistral-7B, a domain-pretrained model without instruction tuning, produces zero usable output in all conditions, showing that domain pretraining alone is not sufficient for structured extraction. These results offer practical, evidence-based format recommendations for clinical LLM deployment: Clinical Narrative for models up to 8B, Raw JSON for 70B and above. The complete pipeline is reproducible on open-source tools running on an AWS g6e.xlarge instance (NVIDIA L40S, 48 GB VRAM).

摘要:藥物調整在臨床交接中是一個高風險且易出錯的過程。越來越多的大型語言模型被提議用於協助這項任務,使用FHIR結構化的病人記錄,但一個基本且大多未被研究的變數是FHIR數據在傳遞給模型之前是如何序列化的。我們呈現了四種FHIR序列化策略(原始JSON、Markdown表格、臨床敘事和時間線)的首次系統比較,針對五個開放權重模型(Phi-3.5-mini、Mistral-7B、BioMistral-7B、Llama-3.1-8B、Llama-3.3-70B)進行了200名合成病人的受控基準測試,總共進行了4,000次推理運行。我們發現序列化策略對於最多8B參數的模型性能有著顯著且統計上顯著的影響:臨床敘事在Mistral-7B上比原始JSON高出最多19個F1點(r = 0.617,p < 10^{-10})。在70B時,這一優勢反轉,原始JSON達到了最佳平均F1值0.9956。在所有20個模型和策略組合中,平均精確度超過平均召回率:遺漏是主要的失敗模式,模型更常錯過一個活躍的藥物,而不是虛構一個,這改變了臨床安全審核優先級的設定方式。較小的模型在大約7-10個同時活躍藥物時達到平台,留下多重用藥的病人,即最容易受到調整錯誤影響的病人,系統性地得不到服務。BioMistral-7B是一個未經指導調整的領域預訓練模型,在所有條件下產生零可用輸出,顯示僅有領域預訓練不足以進行結構化提取。這些結果為臨床LLM部署提供了實用的、基於證據的格式建議:對於最多8B的模型使用臨床敘事,對於70B及以上的模型使用原始JSON。完整的管道可在運行於AWS g6e.xlarge實例(NVIDIA L40S,48 GB VRAM)的開源工具上重現。

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

2604.21027v1 by Yuyu Liu, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma

Electronic health record (EHR) question answering is often handled by LLM-based pipelines that are costly to deploy and do not explicitly leverage the hierarchical structure of clinical data. Motivated by evidence that medical ontologies and patient trajectories exhibit hyperbolic geometry, we propose HypEHR, a compact Lorentzian model that embeds codes, visits, and questions in hyperbolic space and answers queries via geometry-consistent cross-attention with type-specific pointer heads. HypEHR is pretrained with next-visit diagnosis prediction and hierarchy-aware regularization to align representations with the ICD ontology. On two MIMIC-IV-based EHR-QA benchmarks, HypEHR approaches LLM-based methods while using far fewer parameters. Our code is publicly available at https://github.com/yuyuliu11037/HypEHR.

摘要:電子健康紀錄(EHR)問答通常由基於大型語言模型(LLM)的管道處理,這些管道的部署成本高且未明確利用臨床數據的層次結構。基於醫療本體和病人軌跡展現雙曲幾何的證據,我們提出了HypEHR,一種緊湊的洛倫茲模型,將代碼、就診和問題嵌入雙曲空間,並通過幾何一致的交叉注意力和特定類型的指針頭來回答查詢。HypEHR經過下一次就診診斷預測和層次感知正則化的預訓練,以使表示與ICD本體對齊。在兩個基於MIMIC-IV的EHR-QA基準上,HypEHR的表現接近基於LLM的方法,同時使用的參數卻少得多。我們的代碼已公開於 https://github.com/yuyuliu11037/HypEHR。

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

2604.21017v1 by Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin Calmé, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. DeLong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn, Philipp Fürnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg, Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, Tamás Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong, Kim, Przemysław Korzeniowski, Chandra Kuchi, Nithesh Kumar, Alan Kuntz, Federico Lavagno, Yu Chung Lee, Hao-Chih Lee, Hang Li, Zhen Li, Xiao Liang, Xinxin Lin, Jinsong Lin, Chang Liu, Fei Liu, Pei Liu, Yun-hui Liu, Wanli Liuchen, Eszter Lukács, Sareena Mann, Miles Mannas, Brett Marinelli, Sabina Martyniak, Francesco Marzola, Lorenzo Mazza, Xueyan Mei, Maria Clara Morais, Luigi Muratore, Chetan Reddy Narayanaswamy, Michał Naskręt, David Navarro-Alarcon, Cyrus Neary, Chi Kit Ng, Christopher Nguan, David Noonan, Ki Hwan Oh, Tom Christian Olesch, Allison M. Okamura, Justin Opfermann, Matteo Pescio, Doan Xuan Viet Pham, Tito Porras, Hongliang Ren, Ariel Rodriguez Jimenez, Ferdinando Rodriguez y Baena, Septimiu E. Salcudean, Asmitha Sathya, Preethi Satish, Lalithkumar Seenivasan, Jiaqi Shao, Yiqing Shen, Yu Sheng, Lucy XiaoYang Shi, Zoe Soulé, Stefanie Speidel, Mingwu Su, Jianhao Su, Idris Sunmola, Kristóf Takács, Yunxi Tang, Patrick Thornycroft, Yu Tian, Jordan Thompson, Mehmet K. Turkcan, Mathias Unberath, Pietro Valdastri, Carlos Vives, Quan Vuong, Martin Wagner, Farong Wang, Wei Wang, Lidian Wang, Chung-Pang Wang, Guankun Wang, Junyi Wang, Erqi Wang, Ziyi Wang, Tanner Watts, Wolfgang Wein, Yimeng Wu, Zijian Wu, Hongjun Wu, Luohong Wu, Jie Ying Wu, Junlin Wu, Victoria Wu, Kaixuan Wu, Mateusz Wójcikowski, Yunye Xiao, Nan Xiao, Wenxuan Xie, Hao Yang, Tianqi Yang, Yinuo Yang, Menglong Ye, Ryan S. Yeung, Nural Yilmaz, Chim Ho Yin, Michael Yip, Rayan Younis, Chenhao Yu, Sayem Nazmuz Zaman, Milos Zefran, Han Zhang, Yuelin Zhang, Yidong Zhang, Yanyong Zhang, Xuyang Zhang, Yameng Zhang, Joyce Zhang, Ning Zhong, Peng Zhou, Haoying Zhou, Xiuli Zuo, Nassir Navab, Mahdi Azizian, Sean D. Huver, Axel Krieger

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

摘要:自主醫療機器人有潛力改善病人結果、減輕醫療提供者的工作負擔、民主化醫療服務的獲取,並實現超人精確度。然而,自主醫療機器人技術受到一個基本數據問題的限制:現有的醫療機器人數據集規模小、單一體現,且很少公開共享,限制了該領域所需的基礎模型的發展。我們推出了Open-H-Embodiment,這是迄今為止最大的醫療機器人視頻開放數據集,具有同步運動學,涵蓋了49多個機構和多個機器人平台,包括CMR Versius、Intuitive Surgical的da Vinci、da Vinci Research Kit (dVRK)、Rob Surgical BiTrack、Virtual Incision的MIRA、Moon Surgical Maestro,以及各種定制系統,涵蓋外科操作、機器人超聲波和內窺鏡程序。我們通過兩個基礎模型展示了這個數據集所促進的研究。GR00T-H是首個開放的醫療機器人視覺-語言-行動基礎模型,這是唯一一個在結構化縫合基準上實現完整端到端任務完成的評估模型(25%的試驗對比其他所有模型的0%),並在29步的體外縫合序列中達到64%的平均成功率。我們還訓練了Cosmos-H-Surgical-Simulator,首個行動條件的世界模型,能夠從單一檢查點實現多體現的外科模擬,涵蓋九個機器人平台,並支持醫療領域的計算政策評估和合成數據生成。這些結果表明,開放的大規模醫療機器人數據收集可以作為研究社群的關鍵基礎設施,促進機器人學習、世界建模及其他領域的進步。

Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

2604.20983v1 by Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin, Hasan Muhammad Abdullah, Md. Mehedi Hasan, Mohammad Zabed Hossain, Shifat E. Arman

Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.

摘要:視覺評估通常通過多步驟的過程進行。在大多數當代領域中,專家使用結構化、基於證據的自適應提問來分析圖像。在植物病理學中,植物學家檢查葉片圖像,識別視覺線索,推斷診斷意圖,並通過針對物種、症狀和嚴重程度的問題進一步探查。這種結構化的探查對於準確的疾病診斷和治療方案的制定至關重要。然而,目前的視覺-語言模型是在單回合問題回答上進行評估的。為了解決這一差距,我們推出了PlantInquiryVQA,這是一個用於研究植物診斷中多步驟、以意圖驅動的視覺推理的基準。我們正式化了一個詢問鏈框架,將診斷軌跡建模為基於具體視覺線索和明確的認識意圖的有序問答序列。我們發布了一個包含24,950張專家策劃的植物圖像和138,068對標註了視覺基礎、嚴重程度標籤和領域特定推理模板的問題-答案對的數據集。對頂級多模態大型語言模型的評估顯示,雖然它們能夠充分描述視覺症狀,但在安全的臨床推理和準確診斷方面卻面臨挑戰。重要的是,結構化的問題引導詢問顯著提高了診斷的正確性,減少了幻覺,並提高了推理效率。我們希望PlantInquiryVQA能作為推進研究的基礎基準,以培訓診斷代理像專家植物學家一樣進行推理,而不是靜態分類器。

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

2604.20791v1 by Mariano Barone, Francesco Di Serio, Roberto Moio, Marco Postiglione, Giuseppe Riccio, Antonio Romano, Vincenzo Moscato

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

摘要:大型語言模型(LLMs)在醫療保健領域的應用日益增多,但它們與臨床標準的溝通對齊程度仍然不足以量化。我們對通用型和專業領域的LLMs進行了多維度評估,涵蓋結構化的醫療解釋和現實世界的醫生-病人互動,分析語義忠實度、可讀性和情感共鳴。基準模型相對於醫生增強了情感極性(非常負面:43.14-45.10% vs. 37.25%),而在更大的架構如GPT-5和Claude中,產生了顯著更高的語言複雜性(FKGL高達16.91-17.60 vs. 11.47-12.50在醫生撰寫的回答中)。以同理心為導向的提示減少了極端的負面情緒並降低了年級水平的複雜性(對於GPT-5高達-6.87 FKGL點),但並未顯著提高語義忠實度。協作重寫產生了最強的整體對齊。重述配置實現了與醫生回答的最高語義相似度(平均高達0.93),同時持續改善可讀性並減少情感極端性。雙方利益相關者的評估顯示,沒有模型在認知標準上超越醫生,而病人則持續偏好重寫的變體以獲得清晰度和情感語調。這些發現表明,LLMs作為協作溝通增強工具的功能最為有效,而非臨床專業知識的替代品。

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

2604.20441v1 by Yingyong Hou, Xinyuan Lao, Huimei Wang, Qianyu Yao, Wei Chen, Bocheng Huang, Fei Sun, Yuxian Lv, Weiqi Lei, Xueqian Wen, Pengfei Xia, Zhujun Tan, Shengyang Xie

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit (skill-auditor@1.0), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

摘要:背景:代理技能越來越多地作為模組化、可重用的能力單元在人工智慧代理系統中部署。醫學研究代理技能需要超越一般評估的保障,包括科學誠信、方法論有效性、可重複性和邊界安全。本研究開發並初步評估了一個針對醫學研究代理技能的領域特定審核框架,重點關注對專家評審的可靠性。方法:我們開發了 MedSkillAudit (skill-auditor@1.0),這是一個分層框架,用於在部署前評估技能釋放的準備狀態。我們評估了五個醫學研究類別中的 75 項技能(每個類別 15 項)。兩位專家獨立地分配了一個質量分數(0-100)、一個序數釋放處置(生產就緒 / 限量釋放 / 僅限測試版 / 拒絕)和一個高風險失敗標誌。系統專家之間的協議使用 ICC(2,1) 和線性加權的 Cohen's kappa 進行量化,並以人類評分者基準進行基準測試。結果:共識質量分數的平均值為 72.4(標準差 = 13.0);57.3% 的技能低於限量釋放的門檻。MedSkillAudit 達到了 ICC(2,1) = 0.449(95% CI: 0.250-0.610),超過了人類評分者的 ICC 0.300。系統共識分數的差異(標準差 = 9.5)小於專家之間的差異(標準差 = 12.4),且沒有方向性偏差(Wilcoxon p = 0.613)。協議設計顯示出最強的類別級別協議(ICC = 0.551);學術寫作顯示出負的 ICC(-0.567),反映出結構性評分標準與專家之間的不匹配。結論:領域特定的預部署審核可能為治理醫學研究代理技能提供實用的基礎,通過針對科學用例量身定制的結構化審核工作流程來補充一般性質量檢查。

Surrogate modeling for interpreting black-box LLMs in medical predictions

2604.20331v2 by Changho Han, Songsoo Kim, Dong Won Kim, Leo Anthony Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

摘要:大型語言模型(LLMs)在龐大的數據集上進行訓練,將廣泛的現實世界知識編碼在其參數中,但其黑箱特性使得這種編碼的機制和範圍變得不明朗。代理建模使用簡化模型來近似複雜系統,可以為黑箱模型的更好可解釋性提供一條途徑。我們提出了一個代理建模框架,定量解釋LLM編碼的知識。對於從領域知識衍生的特定假設,該框架通過在一系列綜合模擬場景中進行廣泛的提示,使用可觀察的元素(輸入-輸出對)來近似潛在的LLM知識空間。通過在醫療預測中的概念驗證實驗,我們展示了該框架在揭示LLMs如何「感知」每個輸入變量與輸出之間的關係方面的有效性。特別是,考慮到LLMs可能會延續其訓練數據中固有的不準確性和社會偏見,我們使用該框架的實驗定量揭示了與既有醫學知識相矛盾的關聯以及LLM編碼知識中科學上被駁斥的種族假設的持續存在。通過揭示這些問題,我們的框架可以作為紅旗指標,以支持這些模型的安全和可靠應用。

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

2604.20306v1 by Zibo Xu, Qiang Li, Ke Lu, Jin Wang, Weizhi Nie, Yuting Su

Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.

摘要:醫學視覺問題回答(MedVQA)旨在根據複雜的醫學影像和問題生成臨床可靠的答案。然而,現有的方法往往過度擬合於表面上的跨模態相關性,忽略了嵌入多模態醫學數據中的內在偏見。因此,模型變得容易受到跨模態混淆效應的影響,嚴重妨礙其提供可信診斷推理的能力。為了解決這一限制,我們提出了一種新穎的雙重因果推斷(DCI)框架,用於MedVQA。據我們所知,DCI是第一個統一架構,整合了後門調整(BDA)和工具變量(IV)學習,以共同解決可觀察和不可觀察的混淆因素。具體而言,我們構建了一個結構性因果模型(SCM),其中可觀察的跨模態偏見(例如,頻繁的視覺和文本共現)通過BDA得到減輕,而不可觀察的混淆因素則通過從共享潛在空間學習的IV來補償。為了保證IV的有效性,我們設計了互信息約束,以最大化其對融合多模態表示的依賴,同時最小化其與不可觀察混淆因素和目標答案的關聯。通過這一雙重機制,DCI提取出去混淆的表示,捕捉真正的因果關係。在四個基準數據集SLAKE、SLAKE-CP、VQA-RAD和PathVQA上進行的廣泛實驗表明,我們的方法在性能上始終優於現有方法,特別是在分佈外(OOD)泛化方面。此外,定性分析證實,DCI通過明確區分真實的因果效應和虛假的跨模態捷徑,顯著增強了跨模態推理的可解釋性和穩健性。

From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI

2604.20055v1 by Patrick Vossler, Jean Feng, Venkat Sivaraman, Robert Gallo, Hemal Kanzaria, Dana Freiser, Christopher Ross, Amy Ou, James Marks, Susan Ehrlich, Christopher Peabody, Lucas Zier

Hospital Quality Improvement (QI) plays a critical role in optimizing healthcare delivery by translating high-level hospital goals into actionable solutions. A critical step of QI is to identify the key modifiable contributing factors, a process we call QI factor discovery, typically through expert-driven semi-structured qualitative tools like fishbone diagrams, chart reviews, and Lean Healthcare methods. AI has the potential to transform and accelerate QI factor discovery, which is traditionally time- and resource-intensive and limited in reproducibility and auditability. Nevertheless, current AI alignment methods assume the task is well-defined, whereas QI factor discovery is an exploratory, fuzzy, and iterative sense-making process that relies on complex implicit expert judgments. To design an AI pipeline that formalizes the QI process while preserving its exploratory components, we propose viewing the task as learning not only LLM prompts but also the overarching natural-language specifications. In particular, we map QI factor discovery to steps of the classical AI/ML development process (problem formalization, model learning, and model validation) where the specifications are tunable hyperparameters. Domain experts and AI agents iteratively refine both the overarching specifications and AI pipeline until AI extractions are concordant with expert annotations and aligned with clinical objectives. We applied this "Human-AI Spec-Solution Co-optimization" framework at an urban safety-net hospital to identify factors driving prolonged length of stay and unplanned 30-day readmissions. The resulting AI-for-QI pipelines achieved $\ge 70\%$ concordance with expert annotations. Compared to prior manual Lean analyses, the AI pipeline was substantially more efficient, recovered previous findings, surfaced new modifiable factors, and produced auditable reasoning traces.

摘要:醫院品質改善(QI)在優化醫療服務中扮演著關鍵角色,通過將高層次的醫院目標轉化為可行的解決方案。QI 的一個關鍵步驟是識別主要的可修改貢獻因素,我們稱之為 QI 因素發現,通常通過專家驅動的半結構化質性工具,如魚骨圖、圖表回顧和精益醫療方法來進行。人工智慧有潛力轉變和加速 QI 因素發現,這一過程傳統上耗時且資源密集,且在可重複性和可審計性方面受限。然而,目前的 AI 對齊方法假設任務是明確定義的,而 QI 因素發現是一個探索性、模糊且迭代的意義建構過程,依賴於複雜的隱性專家判斷。為了設計一個正式化 QI 過程的 AI 管道,同時保留其探索性組件,我們建議將任務視為學習不僅是 LLM 提示,還有整體的自然語言規範。具體來說,我們將 QI 因素發現映射到傳統 AI/ML 開發過程的步驟(問題形式化、模型學習和模型驗證),其中規範是可調的超參數。領域專家和 AI 代理反覆完善整體規範和 AI 管道,直到 AI 提取結果與專家標註一致並與臨床目標對齊。我們在一所城市安全網醫院應用這一「人類-AI 規範-解決方案共同優化」框架,以識別驅動延長住院時間和未計劃 30 天再入院的因素。最終的 AI-for-QI 管道與專家標註達到了 $\ge 70\%$ 的一致性。與之前的手動精益分析相比,AI 管道的效率顯著提高,恢復了先前的發現,揭示了新的可修改因素,並生成了可審計的推理痕跡。

Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

2604.20022v1 by Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.

摘要:大型語言模型越來越多地被用作自主診斷代理,但它們混淆了兩種根本不同的能力:自然語言交流和概率推理。我們認為這種混淆是一種架構缺陷,而不是工程上的不足。我們介紹了 BMBE(貝葉斯醫療信念引擎),這是一個模組化的診斷對話框架,強調語言和推理之間的嚴格分離:LLM 僅作為感測器,將患者的言語解析為結構化證據並表達問題,而所有診斷推理都位於一個確定性、可審計的貝葉斯引擎中。由於患者數據從未進入 LLM,該架構在設計上是私密的;因為統計後端是一個獨立模組,它可以根據目標人群進行替換,而無需重新訓練。這種分離產生了三個自主 LLM 無法提供的特性:經過校準的選擇性診斷,具有可持續調整的準確性-覆蓋率權衡,一個統計分離間隙,即使是一個廉價的感測器與引擎搭配,也能以更低的成本超越同一家族的前沿獨立模型,以及對導致獨立醫生崩潰的對抗性患者交流風格的穩健性。我們在實證和 LLM 生成的知識庫中進行驗證,與前沿 LLM 進行比較,確認這一優勢是架構性的,而非信息性的。

scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics

2604.20003v1 by Qifeng Zhou, Lei Yu, Yuzhi Guo, Yuwei Miao, Hehuan Ma, Wenliang Zhong, Lin Xu, Junzhou Huang

The integration of single-cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer-based foundation model designed for single-cell proteomics. Pre-trained on over 390 million cells, scpFormer replaces standard index-based tokenization with a continuous, sequence-anchored approach. By combining Evolutionary Scale Modeling (ESM) with value-aware expression embeddings, it dynamically maps variable panels into a shared semantic space without artificial discretization. We demonstrate that scpFormer generates global cell representations that perform competitively in large-scale batch integration and unsupervised clustering. Moreover, its open-vocabulary architecture facilitates in silico panel expansion, assisting in the reconstruction of biological manifolds in sparse clinical datasets. Finally, this learned protein co-expression logic is transferable to bulk-omics tasks, supporting applications like cancer drug response prediction. scpFormer provides a versatile, panel-agnostic framework to facilitate scalable biomarker discovery and precision oncology.

摘要:單細胞蛋白質組學數據的整合常常受到目標抗體面板碎片化特性的阻礙。為了解決這一限制,我們引入了scpFormer,一種基於Transformer的基礎模型,專為單細胞蛋白質組學設計。scpFormer在超過3.9億個細胞上進行了預訓練,並用連續的、以序列為基礎的方法取代了標準的基於索引的標記化。通過將進化規模建模(ESM)與價值感知表達嵌入相結合,它動態地將可變面板映射到共享的語義空間中,而不進行人工離散化。我們展示了scpFormer生成的全球細胞表示在大規模批次整合和無監督聚類中表現競爭力。此外,它的開放詞彙架構促進了計算面板的擴展,幫助重建稀疏臨床數據集中的生物流形。最後,這種學習到的蛋白質共表達邏輯可以轉移到大宗組學任務,支持癌症藥物反應預測等應用。scpFormer提供了一個多功能的、與面板無關的框架,以促進可擴展的生物標記發現和精準腫瘤學。

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

2604.19937v1 by Palawat Busaranuvong, Reza Saadati Fard, Emmanuel Agu, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong

Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.

摘要:評估慢性傷口感染的照片具有挑戰性,因為視覺外觀因傷口病因、解剖位置和成像條件而異。儘管需要基於證據的解釋來支持臨床決策,但先前基於圖像的深度學習方法主要集中在分類上,解釋性有限。我們提出了 Infection-Reasoner,一種緊湊的 4B 參數推理視覺-語言模型,用於慢性傷口感染的分類和理由生成。為了解決專家標註的傷口圖像及推理註釋的稀缺,Infection-Reasoner 使用兩階段管道進行訓練:(1) 推理蒸餾,在此過程中,GPT-5.1 為未標註的傷口圖像生成思考鏈理由,以初始化較小學生模型(Qwen3-VL-4B-Thinking)中的傷口特定推理;(2) 使用小型標註感染數據集進行強化學習後訓練,通過群體相對政策優化來細化分類推理。在一個保留的異質傷口數據集上,Infection-Reasoner 達到了 86.8\% 的準確率、86.4\% 的敏感性和 87.1\% 的特異性,超越了幾個強基準,包括 GPT-5.1。理由的質量進一步通過多模態大型語言模型(MLLM)評審和傷口專家評審進行評估。在四位 MLLM 評審中,視覺支持一致性得分範圍從 0.722 到 0.903,而專家評審認為 61.8\% 的理由是正確的,32.4\% 是部分正確的。

Depression Risk Assessment in Social Media via Large Language Models

2604.19887v1 by Giorgia Gulino, Manuel Petrucci

Depression is one of the most prevalent and debilitating mental health conditions worldwide, frequently underdiagnosed and undertreated. The proliferation of social media platforms provides a rich source of naturalistic linguistic signals for the automated monitoring of psychological well-being. In this work, we propose a system based on Large Language Models (LLMs) for depression risk assessment in Reddit posts, through multi-label classification of eight depression-associated emotions and the computation of a weighted severity index. The method is evaluated in a zero-shot setting on the annotated DepressionEmo dataset (~6,000 posts) and applied in-the-wild to 469,692 comments collected from four subreddits over the period 2024-2025. Our best model, gemma3:27b, achieves micro-F1 = 0.75 and macro-F1 = 0.70, results competitive with purpose-built fine-tuned models (BART: micro-F1 = 0.80, macro-F1 = 0.76). The in-the-wild analysis reveals consistent and temporally stable risk profiles across communities, with marked differences between r/depression and r/anxiety. Our findings demonstrate the feasibility of a cost-effective, scalable approach for large-scale psychological monitoring.

摘要:憂鬱症是全球最普遍且削弱人心的心理健康狀況之一,經常被低估和未得到適當治療。社交媒體平台的普及提供了豐富的自然語言信號來源,用於自動監測心理健康。在這項工作中,我們提出了一個基於大型語言模型(LLMs)的系統,用於評估Reddit帖子中的憂鬱風險,通過對八種與憂鬱相關的情緒進行多標籤分類以及計算加權嚴重性指數。該方法在零樣本設置下,對標註的DepressionEmo數據集(約6,000篇帖子)進行評估,並在2024-2025年期間應用於從四個子版塊收集的469,692條評論。我們的最佳模型gemma3:27b,達到了micro-F1 = 0.75和macro-F1 = 0.70的結果,與專門構建的微調模型(BART:micro-F1 = 0.80,macro-F1 = 0.76)具備競爭力。野外分析顯示社區之間風險輪廓一致且時間穩定,r/depression和r/anxiety之間存在顯著差異。我們的研究結果展示了一種具有成本效益、可擴展的方式,用於大規模心理監測的可行性。

A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities

2604.19653v1 by Aya Cherigui, Florent Guépin, Arnaud Legendre, Jean-François Couchot

Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.

摘要:人類流動數據被應用於許多領域,從公共衛生到城市規劃。人類流動本質上是敏感的,因為它可能包含如宗教信仰和政治立場等信息。歷史上,曾提出使用聚合、混淆或噪聲添加等技術來修改信息,以充分保護隱私並消除顧慮。由於這些方法在效用上付出了巨大的代價,因此引入了利用生成模型發展的新方法。這些方法在多大程度上解決了隱私與效用的權衡仍然是一個未解決的問題。在本文中,我們通過引入和應用一個新的效用評估框架,邁出了邁向解決此問題的第一步。此外,我們提供證據表明,隱私評估仍然是一個需要考慮的重大挑戰,並且應該根據當前的歐盟法規通過對抗性評估來解決。我們提出了一種針對生成模型子類的新成員推斷攻擊,儘管由於其對用戶連結問題的抵抗力,這個子類被認為是私密的。

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

2604.19598v2 by Kihyuk Lee

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

摘要:這項研究比較了在 temperature=0 條件下,三個大型語言模型(LLMs)產生的運動處方輸出的一致性,具體為 GPT-4.1、Claude Sonnet 4.6 和 Gemini 2.5 Flash。每個模型為六個臨床場景生成了 20 次處方,總共分析了 360 個輸出,涵蓋四個維度:語義相似性、輸出可重複性、FITT 分類和安全性表達。GPT-4.1 的平均語義相似性最高(0.955),其次是 Gemini 2.5 Flash(0.950)和 Claude Sonnet 4.6(0.903),並確認了模型間的顯著差異(H = 458.41, p < .001)。關鍵是,這些分數反映了根本不同的生成行為:GPT-4.1 產生了完全獨特的輸出(100%),並且語義內容穩定,而 Gemini 2.5 Flash 則顯示出明顯的輸出重複(27.5% 獨特輸出),這表明其高相似性分數源於文本重複,而非一致的推理。因此,相同的解碼設置產生了根本不同的一致性特徵,這一區別是單一輸出評估無法捕捉的。所有模型的安全性表達達到了上限水平,確認其作為區分指標的有限效用。這些結果表明,模型選擇是一個臨床而非僅僅是技術的決策,並且在重複生成條件下的輸出行為應被視為可靠部署基於 LLM 的運動處方系統的核心標準。

Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity

2604.19538v1 by Farbod Zorriassatine, Ahmad Lotfi

Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.

摘要:代理人工智慧具備目標導向、主動和自主決策能力,為解決人類活動中與運動相關的風險提供了引人注目的機會,包括老年人群中持續存在的跌倒危險。儘管已有多種方法針對跌倒進行預測和檢測以減少風險,但現有系統尚未能在護理路徑和安全關鍵環境中作為普遍解決方案運作。這主要是因為在一致處理現實世界的複雜性方面存在限制,特別是缺乏良好的上下文意識、高誤報率、環境噪音和數據稀缺。我們認為,跌倒檢測和跌倒預測可以有效地被構思為異常檢測問題,並透過代理人工智慧系統更有效地加以解決。更廣泛地說,這一觀點使得能夠及早識別與增加風險相關的運動模式中的微妙偏差,無論是由於年齡相關的衰退、疲勞還是環境因素引起的。雖然即時部署的技術要求超出了本文的範疇,但我們提出了一個概念框架,突顯潛在的價值。這個框架促進了一種精心協調的風險管理方法,通過動態選擇相關工具並將其整合到自適應決策工作流程中,而不是依賴於針對狹窄定義場景量身定制的靜態配置。

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

2604.19457v1 by Vasundra Srininvasan

Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.

摘要:長期視野的企業代理在有損記憶、多步驟推理和約束性法規限制下做出高風險決策(貸款承保、索賠裁定、臨床審查、事前授權)。目前的評估報告提供了一個單一的任務成功標量,這混淆了不同的失敗模式,並隱藏了代理是否符合其部署環境所需的標準。我們提出長期決策行為可分解為四個正交的對齊軸,每個軸都是獨立可測量和可失敗的:事實精確性(FRP)、推理一致性(RCS)、合規重建(CRR)和校準放棄(CAR)。CRR是一個新穎的基於法規的軸;CAR是一個測量軸,將覆蓋率與準確性分開。我們在一個受控基準(LongHorizon-Bench)上進行分解,涵蓋貸款資格和保險索賠裁定,並進行確定性真實構建。運行六種記憶架構,我們發現結構聚合準確性無法看到:檢索在事實精確性上崩潰;基於架構的架構支付了支架稅;在事實保留提示下的普通摘要在FRP、RCS、EDA和CRR上是一個強基線;而所有六種架構在每個案例上都犯錯,暴露了一個該領域未針對的決策對齊軸。這一分解還揭示了我們自己預先註冊的預測,即摘要將失敗於事實回憶,數據在大幅度上反轉了這一點,這一軸級反轉的聚合準確性本會隱藏。機構對齊(法規重建)和決策對齊(校準放棄)在對齊文獻中被低估,並且一旦決策離開實驗室,它們便成為承載負荷的要素。該框架通過兩個步驟轉移到任何受規範的決策領域:建立事實架構,並校準CRR審核提示。

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

2604.19281v1 by Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

摘要:使用大型語言模型(LLMs)來支持患者解決醫療問題的做法正變得越來越普遍。然而,目前用於評估這些模型在此背景下表現的措施大多僅衡量模型的答案在語義上的匹配程度,因此並未真正反映模型的醫療準確性或與之相關的健康公平風險。為了解決這些不足,我們提出了一個新的醫療問題回答評估框架,稱為VB-Score(基於驗證的分數),它對醫療問題回答模型的四個組成部分進行單獨評估,包括實體識別、語義相似性、事實一致性和結構化信息完整性。我們對三個知名且廣泛使用的LLMs在48個公共健康相關主題上的表現進行了嚴格評審,這些主題來自高質量、權威的信息來源。根據我們的分析,我們發現模型的語義準確性和實體準確性之間存在重大差異。我們對這三個模型表現的評估顯示,當根據我們的標準進行評估時,每個模型幾乎都存在嚴重的性能失敗。我們的研究結果顯示,在各種公共健康主題之間存在令人擔憂的性能差異,對於與老年人和少數族裔群體中發生的慢性病相關的所有公共健康主題,大多數模型的性能比整體平均水平低13.8%,這表明存在所謂的基於病症的算法歧視。我們的發現還表明,僅僅依靠提示工程並不能彌補這些模型在提取醫療實體方面的基本架構限制,並引發了語義評估是否足以作為醫療AI安全的衡量標準的問題。

Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement

2604.19191v1 by Pritam Kar, Gouri Lakshmi S, Saptarshi Bej

Anomaly detection in medical imaging is essential for identifying rare pathological conditions, particularly when annotated abnormal samples are limited. We propose a hybrid anomaly detection framework that integrates self-supervised representation learning with manifold-based density estimation, a combination that remains largely unexplored in this domain. Medical images are first embedded into a latent feature space using pretrained, potentially domain-specific, backbones. These representations are then refined via Mean Shift Density Enhancement (MSDE), an iterative manifold-shifting procedure that moves samples toward regions of higher likelihood. Anomaly scores are subsequently computed using Gaussian density estimation in a PCA-reduced latent space, where Mahalanobis distance measures deviation from the learned normal distribution. The framework follows a one-class learning paradigm and requires only normal samples for training. Extensive experiments on seven medical imaging datasets demonstrate state-of-the-art performance. MSDE achieves the highest AUC on four datasets and the highest Average Precision on five datasets, including near-perfect performance on brain tumor detection (0.981 AUC/AP). These results underscore the potential of the proposed framework as a scalable clinical decision-support tool for early disease detection, screening in low-label settings, and robust deployment across diverse imaging modalities.

摘要:醫學影像中的異常檢測對於識別罕見的病理狀況至關重要,特別是在標註的異常樣本有限的情況下。我們提出了一種混合異常檢測框架,將自我監督的表示學習與基於流形的密度估計相結合,這一組合在該領域仍然基本未被探索。
醫學影像首先使用預訓練的、潛在的特定領域的骨幹嵌入到潛在特徵空間中。這些表示隨後通過均值漂移密度增強(MSDE)進行精煉,這是一種迭代的流形轉移過程,將樣本移動到更高可能性的區域。隨後,使用在PCA降維的潛在空間中的高斯密度估計計算異常分數,其中馬哈拉諾比斯距離衡量與學習到的正常分佈的偏差。該框架遵循單類學習範式,僅需要正常樣本進行訓練。
在七個醫學影像數據集上進行的大量實驗展示了最先進的性能。MSDE在四個數據集上達到了最高的AUC,在五個數據集上達到了最高的平均精度,包括在腦腫瘤檢測中接近完美的表現(0.981 AUC/AP)。這些結果強調了所提出框架作為可擴展的臨床決策支持工具的潛力,適用於早期疾病檢測、低標籤環境中的篩查,以及在多樣影像模態中的穩健部署。

Regulating Artificial Intimacy: From Locks and Blocks to Relational Accountability

2604.18893v1 by Henry Fraser, Jessica M. Szczuka, Raffaele F. Ciriello

A series of high-profile tragedies involving companion chatbots has triggered an unusually rapid regulatory response. Several jurisdictions, including Australia, California, and New York, have introduced enforceable regulation, while regulators elsewhere have signaled growing concern about risks posed by companion chatbots, particularly to children. In parallel, leading providers, notably OpenAI, appear to have strengthened their self-regulatory approaches. Drawing on legal textual analysis and insights from regulatory theory, psychology, and information systems research, this paper critically examines these recent interventions. We examine what is regulated and who is regulated, identifying regulatory targets, scope, and modalities. We classify interventions by method and priority, showing how emerging regimes combine "locks and blocks", such as access gating and content moderation, with measures addressing toxic relationship features and process-based accountability requirements. We argue that effective regulation of companion chatbots must integrate all three dimensions. More, however, is required. Current regimes tend to focus on discrete harms, narrow conceptions of vulnerability, or highly specified accountability processes, while failing to confront deeper power asymmetries between providers and users. Providers of companion chatbots increasingly control artificial intimacy at scale, creating unprecedented opportunities for control through intimacy. We suggest that a general, open-ended duty of care would be an important first step toward constraining that power and addressing a fundamental source of chatbot risk. The paper contributes to debates on companion chatbot regulation and is relevant to regulators, platform providers, and scholars concerned with digital intimacy, law and technology, and fairness, accountability, and transparency in sociotechnical systems.

摘要:一系列涉及伴侶聊天機器人的高調悲劇引發了異常迅速的監管反應。包括澳大利亞、加州和紐約在內的幾個司法管轄區已經引入了可執行的監管措施,而其他地區的監管機構則對伴侶聊天機器人所帶來的風險,特別是對兒童的風險,表示出日益增長的擔憂。與此同時,主要提供商,尤其是OpenAI,似乎已經加強了他們的自我監管措施。本論文基於法律文本分析以及監管理論、心理學和信息系統研究的見解,對這些近期的干預措施進行了批判性檢視。我們考察了什麼被監管以及誰被監管,識別監管目標、範圍和方式。我們根據方法和優先級對干預措施進行分類,展示新興制度如何將“鎖和阻擋”相結合,例如訪問門檻和內容審核,與解決有毒關係特徵和基於過程的問責要求的措施相結合。我們認為,對伴侶聊天機器人的有效監管必須整合這三個維度。然而,更需要的是,目前的制度往往專注於離散的傷害、狹隘的脆弱性概念或高度具體的問責過程,同時未能面對提供商與用戶之間更深層的權力不對稱。伴侶聊天機器人的提供商越來越大規模地控制人工親密性,創造了通過親密性進行控制的前所未有的機會。我們建議,一項一般性、開放式的關懷責任將是限制這種權力並解決聊天機器人風險根本來源的重要第一步。本論文對伴侶聊天機器人監管的辯論作出了貢獻,並對關注數字親密性、法律與技術以及社會技術系統中的公平、問責和透明度的監管者、平台提供商和學者具有相關性。

REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

2604.18757v1 by Seowung Leem, Lin Gu, Chenyu You, Kuang Gong, Ruogu Fang

The retina provides a unique, noninvasive window into Alzheimer's disease (AD) and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to disease susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, limiting their ability to capture joint multimodal patterns critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, constraining the learning of coherent cross-modal associations. To address these limitations, we introduce REVEAL (REtinal-risk Vision-Language Early Alzheimer's Learning), a framework that aligns color fundus photographs with individualized disease-specific risk profiles for predicting incident AD and dementia, on average 8 years before diagnosis (range: 1-11 years). Because real-world risk factors are structured questionnaire data, we translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general-purpose VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier intervention and improve preventive care at the population level.

摘要:視網膜提供了一個獨特的、非侵入性的窗口,讓我們了解阿茲海默症(AD)和癡呆症,通過形態計量特徵捕捉早期的結構變化,而系統性和生活方式的風險因素則反映了在臨床症狀出現之前,對疾病易感性的已知貢獻者。然而,當前的視網膜分析框架通常將影像和風險因素分開建模,限制了它們捕捉對早期風險預測至關重要的聯合多模態模式的能力。此外,現有方法很少納入組織或對齊具有相似視網膜和臨床特徵的患者的機制,限制了對一致的跨模態關聯的學習。為了解決這些限制,我們引入了REVEAL(REtinal-risk Vision-Language Early Alzheimer's Learning),這是一個將彩色眼底攝影與個體化的疾病特定風險概況對齊的框架,用於預測阿茲海默症和癡呆症的發生,平均在診斷前8年(範圍:1-11年)。由於現實世界的風險因素是結構化的問卷數據,我們將它們轉換為與預訓練的視覺-語言模型(VLMs)兼容的臨床可解釋敘事。我們進一步提出了一種群體感知對比學習(GACL)策略,將具有相似視網膜形態學和風險因素的患者聚類為正樣本對,增強多模態對齊。這一統一的表示學習框架在性能上顯著超越了與臨床文本編碼器配對的最先進的視網膜影像模型,以及通用的VLMs,顯示了聯合建模視網膜生物標記和臨床風險因素的價值。通過提供一種可概括且非侵入性的早期AD和癡呆症風險分層方法,REVEAL有潛力促進更早的干預並改善人口層面的預防護理。

Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

2604.18753v1 by Andrew Wang, Ellie Pavlick, Ritambhara Singh

An active challenge in developing multimodal machine learning (ML) models for healthcare is handling missing modalities during training and deployment. As clinical datasets are inherently temporal and sparse in terms of modality presence, capturing the underlying predictive signal via diagnostic multimodal ML models while retaining model explainability remains an ongoing challenge. In this work, we address this by re-framing clinical diagnosis as an autoregressive sequence modeling task, utilizing causal decoders from large language models (LLMs) to model a patient's multimodal trajectory. We first introduce a missingness-aware contrastive pre-training objective that integrates multiple modalities in datasets with missingness in a shared latent space. We then show that autoregressive sequence modeling with transformer-based architectures outperforms baselines on the MIMIC-IV and eICU fine-tuning benchmarks. Finally, we use interpretability techniques to move beyond performance boosts and find that across various patient stays, removing modalities leads to divergent behavior that our contrastive pre-training mitigates. By abstracting clinical diagnosis as sequence modeling and interpreting patient stay trajectories, we develop a framework to profile and handle missing modalities while addressing the canonical desideratum of safe, transparent clinical AI.

摘要:在為醫療保健開發多模態機器學習(ML)模型的過程中,一個活躍的挑戰是處理訓練和部署期間缺失的模態。由於臨床數據集本質上是時間性的,並且在模態存在方面稀疏,因此通過診斷多模態 ML 模型捕捉潛在的預測信號,同時保持模型的可解釋性,仍然是一個持續的挑戰。在這項工作中,我們通過將臨床診斷重新框架為自回歸序列建模任務來解決這個問題,利用大型語言模型(LLMs)中的因果解碼器來建模患者的多模態軌跡。我們首先介紹了一種考慮缺失性的對比預訓練目標,該目標在具有缺失性的數據集中將多個模態整合到共享潛在空間中。然後,我們展示了基於Transformer架構的自回歸序列建模在 MIMIC-IV 和 eICU 微調基準測試中超越了基準。我們最後使用可解釋性技術超越性能提升,發現隨著各種患者住院的進展,去除模態會導致不同的行為,而我們的對比預訓練可以減輕這種情況。通過將臨床診斷抽象為序列建模並解釋患者住院軌跡,我們開發了一個框架來描述和處理缺失模態,同時解決安全、透明的臨床 AI 的基本願望。

A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

2604.18570v2 by Andrew Zhang, Tong Ding, Sophia J. Wagner, Caiwei Tian, Ming Y. Lu, Rowland Pettit, Joshua E. Lewis, Alexandre Misrahi, Dandan Mo, Long Phi Le, Faisal Mahmood

Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This "atlas of medical concepts" forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.

摘要:現代醫學在孤立的系統中產生大量的多模態數據,但目前沒有任何現有模型能將臨床記錄的全部範圍和時間深度整合成統一的患者表徵。我們介紹了Apollo,一個多模態時間基礎模型,該模型在美國一家主要醫院系統的三十多年長期住院記錄上進行訓練和評估,這些記錄包含來自720萬名患者的250億條記錄,代表28種不同的醫療模態和12個主要醫療專科。Apollo學習了一個統一的表徵空間,整合了我們臨床詞彙中超過10萬個獨特的醫療事件,以及影像和臨床文本。這個“醫療概念地圖”形成了一個計算基底,用於建模整個患者護理旅程,這些旅程由結構化和非結構化事件的序列組成,Apollo將其壓縮為虛擬患者表徵。為了評估這些整體患者表徵的潛力,我們從140萬名患者的保留測試集中創建了322個預後和檢索任務。我們展示了Apollo嵌入的通用臨床預測潛力,包括預測新疾病發作風險最多提前五年(95個任務)、疾病進展(78個任務)、治療反應(59個任務)、治療相關不良事件風險(17個任務)和醫院運營結束點(12個任務)。利用特徵歸因技術,我們顯示模型預測與臨床可解釋的多模態生物標誌物相一致。我們在61個檢索任務上評估了語義相似性搜索,並進一步展示了Apollo作為多模態醫療搜索引擎的潛力,使用文本和圖像查詢。這些建模能力共同建立了可計算醫學的基礎,使患者護理的完整上下文能夠被計算推理所訪問。

ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification

2604.18444v1 by Florian Kittler, Sheethal Bhat, Andreas Maier

Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.

摘要:零-shot 視覺-語言模型 (VLMs) 在胸部放射線影像分類方面顯示出潛力,但其性能常常受到混淆標籤共現、長尾類別不平衡和在領域轉移下的轉移不穩定性限制。我們提出了 ProtoCLIP,一種針對 CLIP 風格 VLM 的精煉策略,通過有針對性地數據策展和提煉錨點對齊來改善零-shot 判別。具體而言,我們構建了以病理為重點的訓練子集,並策展了負樣本,以減少共現偏差。我們還引入了一個保持表示的蒸餾目標,以穩定適應,同時保持語義結構並改善臨床相關共現病理的判別。在未見數據集 VinDr-CXR 上進行評估,ProtoCLIP 在多個發現上提高了 AUC 2-10 個百分點,超過了一個強大的基於 CLIP 的基準。特別是對於氣胸,ProtoCLIP 實現了 0.94 的最先進 AUC。這些結果表明,錨點引導的精煉,結合策展的監督和受控的適應,可以減輕醫療 VLM 中常見的零-shot 轉移失敗,而無需大規模的重新訓練。

Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support

2604.18302v1 by Eranga Bandara, Asanga Gunaratna, Ross Gore, Anita H. Clayton, Christopher K. Rhea, Sachini Rajapakse, Isurunima Kularathna, Sachin Shetty, Ravi Mukkamala, Xueping Liang, Preston Samuel, Atmaram Yarlagadda

Privacy represents one of the most critical yet underaddressed barriers to AI adoption in mental healthcare -- particularly in high-sensitivity operational environments such as military, correctional, and remote healthcare settings, where the risk of patient data exposure can deter help-seeking behavior entirely. Existing AI-enabled psychiatric decision support systems predominantly rely on cloud-based inference pipelines, requiring sensitive patient data to leave the device and traverse external servers, creating unacceptable privacy and security risks in these contexts. In this paper, we propose a zero-egress, on-device AI platform for privacy-preserving psychiatric decision support, deployed as a cross-platform mobile application. The proposed system extends our prior work on fine-tuned LLM consortiums for psychiatric diagnosis standardization by fundamentally re-architecting the inference pipeline for fully local execution -- ensuring that no patient data is transmitted to, processed by, or stored on any external server at any stage. The platform integrates a consortium of three lightweight, fine-tuned, and quantized open-source LLMs -- Gemma, Phi-3.5-mini, and Qwen2 -- selected for their compact architectures and proven efficiency on resource-constrained mobile hardware. An on-device orchestration layer coordinates ensemble inference and consensus-based diagnostic reasoning, producing DSM-5-aligned assessments for conditions. The platform is designed to assist clinicians with differential diagnosis and evidence-linked symptom mapping, as well as to support patient-facing self-screening with appropriate clinical safeguards. Initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor while sustaining real-time inference latency on commodity mobile hardware.

摘要:隱私代表了在心理健康護理中人工智慧採用的最關鍵但卻未被充分解決的障礙之一——特別是在軍事、矯正和遠程醫療等高敏感度操作環境中,患者數據暴露的風險可能完全阻礙尋求幫助的行為。現有的人工智慧輔助精神病決策支持系統主要依賴雲端推理管道,這要求敏感的患者數據離開設備並經過外部伺服器,從而在這些環境中造成不可接受的隱私和安全風險。在本文中,我們提出了一個零外洩的、基於設備的人工智慧平台,用於隱私保護的精神病決策支持,作為跨平台的移動應用程序部署。所提出的系統擴展了我們之前在精神病診斷標準化方面的精調大型語言模型聯盟的工作,通過根本性地重新架構推理管道以實現完全本地執行——確保在任何階段都不會將患者數據傳輸到、處理或存儲在任何外部伺服器上。該平台整合了三個輕量級、經過精調和量化的開源大型語言模型——Gemma、Phi-3.5-mini和Qwen2——這些模型因其緊湊的架構和在資源受限的移動硬體上的證明效率而被選中。一個基於設備的協調層協調集成推理和基於共識的診斷推理,生成與DSM-5對齊的條件評估。該平台旨在協助臨床醫生進行鑑別診斷和證據鏈接的症狀映射,並支持患者自我篩查,並提供適當的臨床保障。初步評估表明,所提出的零外洩部署在診斷準確性上與其伺服器端前身相當,同時在商用移動硬體上保持實時推理延遲。

Style-Based Neural Architectures for Real-Time Weather Classification

2604.18251v1 by Hamed Ouattara, Pascal Houssam Salmane, Pierre Duthon, Frédéric Bernardin, Omar Ait Aider

In this paper, we present three neural network architectures designed for real-time classification of weather conditions (sunny, rain, snow, fog) from images. These models, inspired by recent advances in style transfer, aim to capture the stylistic elements present in images. One model, called "Multi-PatchGAN", is based on PatchGANs used in well-known architectures such as Pix2Pix and CycleGAN, but here adapted with multiple patch sizes for detection tasks. The second model, "Truncated ResNet50", is a simplified version of ResNet50 retaining only its first nine layers. This truncation, determined by an evolutionary algorithm, facilitates the extraction of high-frequency features essential for capturing subtle stylistic details. Finally, we propose "Truncated ResNet50 with Gram Matrix and Attention", which computes Gram matrices for each layer during training and automatically weights them via an attention mechanism, thus optimizing the extraction of the most relevant stylistic expressions for classification. These last two models outperform the state of the art and demonstrate remarkable generalization capability on several public databases. Although developed for weather detection, these architectures are also suitable for other appearance-based classification tasks, such as animal species recognition, texture classification, disease detection in medical imaging, or industrial defect identification.

摘要:在本文中,我們提出了三種神經網絡架構,旨在從圖像中實時分類天氣條件(晴天、雨天、雪天、霧天)。這些模型受到近期風格轉換進展的啟發,旨在捕捉圖像中的風格元素。第一個模型稱為「Multi-PatchGAN」,基於在知名架構如Pix2Pix和CycleGAN中使用的PatchGAN,但在這裡為檢測任務調整為多種補丁大小。第二個模型「Truncated ResNet50」是ResNet50的簡化版本,只保留其前九層。這種截斷是由進化算法決定的,有助於提取對捕捉微妙風格細節至關重要的高頻特徵。最後,我們提出了「Truncated ResNet50 with Gram Matrix and Attention」,該模型在訓練期間為每一層計算Gram矩陣,並通過注意力機制自動加權,從而優化最相關風格表達的提取以進行分類。這最後兩個模型超越了當前的技術水平,並在幾個公共數據庫上展示了卓越的泛化能力。雖然這些架構是為了天氣檢測而開發的,但它們也適用於其他基於外觀的分類任務,如動物物種識別、紋理分類、醫學影像中的疾病檢測或工業缺陷識別。

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

2604.18234v1 by Lorenz Brehme, Thomas Ströhle, Ruth Breu

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-Aware Retriever Evaluation (CARE). Our goal is to better understand how multi-hop reasoning can be most effectively evaluated in RAG systems. Experiments with LLMs from OpenAI, Meta, and Google demonstrate that CARE consistently outperforms existing methods for evaluating multi-hop reasoning in RAG systems. The performance gains are most pronounced in models with larger parameter counts and longer context windows, while single-hop queries show minimal sensitivity to context-aware evaluation. Overall, the results highlight the critical role of context-aware evaluation in improving the reliability and accuracy of retrieval-augmented generation systems, particularly in complex query scenarios. To ensure reproducibility, we provide the complete data of our experiments at https://github.com/lorenzbrehme/CARE.

摘要:檢索增強生成(RAG)透過外部知識增強大型語言模型(LLMs),以更準確地回答問題。然而,對於評估RAG系統的研究——特別是檢索器組件——仍然有限,因為大多數現有的工作專注於單一上下文檢索,而非多跳查詢,在這種情況下,單獨的上下文可能看起來無關緊要,但在結合時卻至關重要。在本研究中,我們使用HotPotQA、MuSiQue和SQuAD數據集來模擬RAG系統,並比較三種LLM作為評估者的評估策略,包括我們提出的上下文感知檢索器評估(CARE)。我們的目標是更好地理解如何在RAG系統中最有效地評估多跳推理。來自OpenAI、Meta和Google的LLM實驗表明,CARE在評估RAG系統中的多跳推理方面始終優於現有方法。性能提升在參數較多和上下文窗口較長的模型中最為明顯,而單跳查詢對上下文感知評估的敏感度則較低。總體而言,結果突顯了上下文感知評估在提高檢索增強生成系統的可靠性和準確性方面的關鍵作用,特別是在複雜查詢場景中。為了確保可重複性,我們在https://github.com/lorenzbrehme/CARE提供了我們實驗的完整數據。

Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

2604.18161v1 by Ku Onoda, Paavo Parmas, Manato Yaguchi, Yutaka Matsuo

In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.

摘要:在策略梯度強化學習中,訪問可微分模型使得一階梯度估計成為可能,這加速了學習,相較於僅依賴無導數的零階估計器。然而,不連續的動態會造成偏差,並削弱一階估計器的有效性。先前的研究通過在REINFORCE零階梯度估計器周圍構建置信區間來解決這一偏差,並利用這些界限來檢測不連續性。然而,REINFORCE估計器以噪聲著稱,我們發現這種方法需要特定任務的超參數調整,並且樣本效率低下。本文探討這種偏差是否是主要障礙,以及哪些最小的修正措施足夠。首先,我們重新檢視先前工作的標準不連續設置,並引入DDCG,一種在不光滑區域切換估計器的輕量級測試;通過一個超參數,DDCG實現了穩健的性能,並在小樣本下保持可靠。其次,在可微分的機器人控制任務中,我們提出了IVW-H,一種逐步的逆方差實現,該實現穩定了方差而無需明確的不連續性檢測,並產生了強勁的結果。綜合這些發現表明,雖然估計器切換在控制研究中提高了穩健性,但在實際部署中,仔細的方差控制往往占主導地位。

Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

2604.18145v1 by Cong Huy Nguyen, Son Dinh Nguyen, Guanlin Li, Tuan Dung Nguyen, Aditya Narayan Sankaran, Mai Huy Thong, Thanh Trung Nguyen, Mai Hong Son, Reza Farahbakhsh, Phi Le Nguyen, Noel Crespi

Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.

摘要:自動化醫療報告生成對於3D PET/CT影像的挑戰根本上來自於體積數據的高維特性以及註釋數據集的嚴重匱乏,特別是對於低資源語言。當前的黑箱方法將整個體積映射到報告,忽略了分析局部感興趣區域(RoIs)以得出診斷結論的臨床工作流程。在本文中,我們通過引入VietPET-RoI來填補這一空白,這是第一個針對低資源語言的大規模3D PET/CT數據集,具有精細的RoI註釋,包括600個PET/CT樣本和1,960個手動註釋的RoIs,並配有相應的臨床報告。此外,為了展示這個數據集的實用性,我們提出了HiRRA,一個新穎的框架,通過採用基於圖的關係模塊來捕捉RoI屬性之間的依賴,模擬專業放射科醫生的診斷工作流程。這種方法從全球模式匹配轉向局部臨床發現。此外,我們引入了新的臨床評估指標,即RoI覆蓋率和RoI質量指數,這些指標使用基於LLM的提取來測量RoI定位準確性和屬性描述的真實性。廣泛的評估表明,我們的框架達到了SOTA性能,在BLEU上超越現有模型19.7%,在ROUGE-L上超越4.7%,同時在臨床指標上實現了驚人的45.8%的改進,顯示出增強的臨床可靠性和減少的幻覺。我們的代碼和數據集已在GitHub上發布。

Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning

2604.19823v1 by Khalil Akremi, Mariem Handous, Zied Bouslama, Farah Bassalah, Maryem Jebali, Mariem Hanachi, Ines Abdeljaoued-Tej

Rabies remains a major public health concern across many African and Asian countries, where accurate diagnosis is critical for effective epidemiological surveillance. The gold standard diagnostic methods rely heavily on fluorescence microscopy, necessitating skilled laboratory personnel for the accurate interpretation of results. Such expertise is often scarce, particularly in regions with low annual sample volumes. This paper presents an automated, AI-driven diagnostic system designed to address these challenges. We developed a robust pipeline utilizing fluorescent image analysis through transfer learning with four deep learning architectures: EfficientNetB0, EfficientNetB2, VGG16, and Vision Transformer (ViTB16). Three distinct data augmentation strategies were evaluated to enhance model generalization on a dataset of 155 microscopic images (123 positive and 32 negative). Our results demonstrate that TrivialAugmentWide was the most effective augmentation technique, as it preserved critical fluorescent patterns while improving model robustness. The EfficientNetB0 model, utilizing Geometric & Color augmentation and selected through stratified 3fold cross-validation, achieved optimal classification performance on cropped images. Despite constraints posed by class imbalance and a limited dataset size, this work confirms the viability of deep learning for automating rabies diagnosis. The proposed method enables fast and reliable detection with significant potential for further optimization. An online tool was deployed to facilitate practical access, establishing a framework for future medical imaging applications. This research underscores the potential of optimized deep learning models to transform rabies diagnostics and improve public health outcomes.

摘要:狂犬病在許多非洲和亞洲國家仍然是一個主要的公共健康問題,準確的診斷對於有效的流行病學監測至關重要。黃金標準的診斷方法在很大程度上依賴於螢光顯微鏡,這需要熟練的實驗室人員來準確解讀結果。這種專業知識往往稀缺,尤其是在年樣本量較低的地區。本文提出了一種自動化的、基於人工智慧的診斷系統,旨在解決這些挑戰。我們開發了一個穩健的流程,通過轉移學習利用四種深度學習架構進行螢光影像分析:EfficientNetB0、EfficientNetB2、VGG16 和 Vision Transformer (ViTB16)。我們評估了三種不同的數據增強策略,以提高模型在155張顯微鏡影像(123張陽性和32張陰性)數據集上的泛化能力。我們的結果顯示,TrivialAugmentWide 是最有效的增強技術,因為它在改善模型穩健性的同時保留了關鍵的螢光模式。使用幾何和顏色增強的 EfficientNetB0 模型,通過分層三折交叉驗證選擇,實現了在裁剪影像上的最佳分類性能。儘管受到類別不平衡和數據集大小限制的挑戰,這項工作證實了深度學習在自動化狂犬病診斷中的可行性。所提出的方法實現了快速且可靠的檢測,並具有進一步優化的重大潛力。還部署了一個在線工具以促進實際訪問,為未來的醫學影像應用建立了一個框架。本研究強調了優化的深度學習模型在轉變狂犬病診斷和改善公共健康結果方面的潛力。

First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

2604.18038v1 by Sihao Xing, Zaur Gouliev

Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task, DeepSeek V3 produced the strongest overall results across the reported metrics. When embedded in an agentic workflow, DeepSeek V3 showed an improvement of 0.0348 in mean p-value, 0.1166 in median p-value, and 0.0949 in mean difference relative to the standalone model, although improvement was not uniform across every metric. These findings support multi-metric bias evaluation for AI systems used in medical settings and suggest that retrieval-based agentic workflows may reduce some forms of explicit bias in benchmarked diagnostic tasks. Detailed prompt templates, experimental datasets, and code pipelines are available on our GitHub.

摘要:大型語言模型(LLMs)在臨床環境中的使用日益增加,這引發了對生成的醫療文本和臨床推理中的種族偏見的擔憂。現有研究已經識別出醫療LLMs中的偏見,但許多研究專注於單一模型,對於減輕偏見的關注較少。本研究使用歐盟人工智慧法案作為治理視角,評估五個廣泛使用的LLMs在兩個任務中的表現,即合成病人案例生成和鑑別診斷排名。利用美國的種族分層流行病學分佈和專家鑑別診斷清單作為基準,我們應用結構化提示模板和雙部分評估設計來檢查隱性和顯性種族偏見。在合成案例生成任務中,所有模型均偏離了觀察到的種族分佈,其中GPT-4.1的整體偏差最小。在鑑別診斷任務中,DeepSeek V3在報告的指標中產生了最強的整體結果。當嵌入到一個自主工作流程中時,DeepSeek V3在平均p值上改善了0.0348,在中位數p值上改善了0.1166,在平均差異上改善了0.0949,相對於獨立模型,儘管在每個指標上的改善並不均勻。這些發現支持對醫療環境中使用的AI系統進行多指標偏見評估,並表明基於檢索的自主工作流程可能減少基準診斷任務中的某些顯性偏見。詳細的提示模板、實驗數據集和代碼管道可在我們的GitHub上獲得。

AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis

2604.17846v1 by Nathasha Naranpanawa, Maree T. Izatt, Robert D. Labrom, Geoffrey N. Askin, J. Paige Little

MRI is preferred over CT in paediatric imaging because it avoids ionising radiation, but its use in spine deformity assessment is largely limited by the lack of automated, high-resolution 3D bony reconstruction, which continues to rely on CT. MRI-based 3D reconstruction remains impractical due to manual workflows and the scarcity of labelled full-spine datasets. This study introduces an AI framework that enables fully automated thoracolumbar spine (T1-L5) segmentation and 3D reconstruction from MRI alone. Historical low-dose CT scans from adolescent idiopathic scoliosis (AIS) patients were converted into MRI-like images using a GAN and combined with existing labelled thoracic MRI data to train a U-Net-based model. The resulting algorithm accurately generated continuous thoracolumbar 3D reconstructions, improved segmentation accuracy (88% Dice score), and reduced processing time from approximately 1 hour to under one minute, while preserving AIS-specific deformity features. This approach enables radiation-free 3D deformity assessment from MRI, supporting clinical evaluation, surgical planning, and navigation in paediatric spine care.

摘要:MRI 在兒童影像學中較 CT 更受青睞,因為它避免了電離輻射,但在脊柱畸形評估中的應用主要受到缺乏自動化、高解析度 3D 骨重建的限制,這仍然依賴於 CT。基於 MRI 的 3D 重建因手動工作流程和標註完整脊柱數據集的稀缺而仍然不切實際。本研究介紹了一個 AI 框架,能夠從 MRI 單獨實現完全自動化的胸腰脊柱 (T1-L5) 分割和 3D 重建。來自青少年特發性脊柱側彎 (AIS) 患者的歷史低劑量 CT 掃描被轉換為類似 MRI 的影像,並與現有的標註胸部 MRI 數據結合,以訓練基於 U-Net 的模型。所生成的算法準確地生成了連續的胸腰 3D 重建,提高了分割準確性 (88% Dice 分數),並將處理時間從約 1 小時縮短至不到 1 分鐘,同時保留了特定於 AIS 的畸形特徵。這種方法使得從 MRI 進行無輻射的 3D 畸形評估成為可能,支持臨床評估、手術規劃和兒童脊柱護理中的導航。

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

2604.17730v1 by Suhyun Lee, Palakorn Achananuparp, Neemesh Yadav, Ee-Peng Lim, Yang Deng

Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.

摘要:大型語言模型(LLMs)越來越多地被探索作為可擴展的心理健康諮詢工具,然而,由於臨床傷害的互動性和情境依賴性,評估它們的安全性仍然具有挑戰性。現有的評估框架主要使用粗糙的分類法或靜態數據集來評估孤立的反應,這限制了它們診斷傷害如何在多輪諮詢互動中出現和累積的能力。在這項工作中,我們介紹了 R-MHSafe,一種角色感知的心理健康安全分類法,根據 AI 諮詢師所採取的互動角色(包括施害者、煽動者、促進者或使能者)來描述臨床上重要的傷害,並結合臨床基礎的傷害類別。然後,我們提出了 MHSafeEval,一個閉環的基於代理的評估框架,將安全評估公式化為通過對抗性多輪互動的傷害軌跡級別發現,並以角色感知建模為指導。使用 R-MHSafe 和 MHSafeEval,我們對最先進的 LLMs 進行了大規模評估。我們的結果揭示了顯著的角色依賴性和累積性安全失敗,這些失敗在現有的靜態基準中被系統性地忽略,並顯示我們的框架顯著提高了失敗模式的覆蓋率和診斷的細緻度。

RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models

2604.17725v1 by Arya Hadizadeh Moghaddam, Drew Ross, Mohsen Nayebi Kerdabadi, Dongjie Wang, Zijun Yao

Large Language Models (LLMs) have shown strong promise for mining Electronic Health Records (EHRs) by reasoning over longitudinal clinical information to capture context-rich patient trajectories. However, leveraging LLMs for structured EHRs (e.g., standardized diagnosis and medication codes) presents two key challenges. First, translating time-stamped EHR sequences into plain text can obscure both temporal structure and code identities, weakening the ability to capture code co-occurrence and longitudinal regularities. Second, unlike cohort-trained predictive models that learn a shared, task-aligned representation space across patients, LLMs are often applied in a case-isolated inference setting where each patient is processed independently without leveraging population-level patterns. To address these challenges, we introduce RePrompT, a time-aware LLM framework that integrates structured EHR encoders through prompt tuning, without modifying underlying architectures. Specifically, RePrompT recurrently incorporates latent states from prior visits to preserve longitudinal information, and injects population-level information through trainable prompt tokens derived from a cohort-trained, task-aligned EHR encoder. Experiments on MIMIC-III and MIMIC-IV demonstrate that RePrompT consistently outperforms both EHR-based and LLM-based baselines across multiple clinical prediction tasks.

摘要:大型語言模型(LLMs)在挖掘電子健康紀錄(EHRs)方面顯示出強大的潛力,通過推理長期臨床信息來捕捉豐富的患者軌跡。然而,利用LLMs處理結構化EHR(例如,標準化診斷和藥物代碼)面臨兩個主要挑戰。首先,將帶有時間戳的EHR序列轉換為純文本可能會模糊時間結構和代碼身份,削弱捕捉代碼共現和長期規律的能力。其次,與學習共享、任務對齊表示空間的隊列訓練預測模型不同,LLMs通常在案例孤立的推斷環境中應用,其中每位患者獨立處理,而不利用人口層面的模式。為了解決這些挑戰,我們介紹了RePrompT,一個時間感知的LLM框架,通過提示調整整合結構化EHR編碼器,而不修改底層架構。具體而言,RePrompT重複性地整合來自先前訪問的潛在狀態,以保留長期信息,並通過可訓練的提示標記注入來自隊列訓練的、任務對齊的EHR編碼器的群體級信息。在MIMIC-III和MIMIC-IV上的實驗表明,RePrompT在多個臨床預測任務中始終優於基於EHR和基於LLM的基準。

Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals

2604.17714v1 by Jon-Paul Cacioli

LLM confidence signals are used for abstention, routing, and safety-critical decisions. No standard practice exists for checking whether a confidence signal carries item-level information before building on it. We transfer the validity screening principle from clinical personality assessment (PAI, MMPI-3) as a portable protocol for benchmark-based LLM confidence data. The protocol specifies three core indices (L, Fp, RBS), a structural indicator (TRIN), and an item-sensitivity statistic, computed from a single 2x2 contingency table. A three-tier classification system (Invalid, Indeterminate, Valid) draws on four clinical traditions. Validated on 20 frontier LLMs across 524 items, four models are classified Invalid, two Indeterminate. Valid-profile models show mean r = .18 (15/16 significant). Invalid-profile models show mean r = -.20 (d = 2.48). Cross-benchmark validation on 18 models using MMLU with verbalized confidence and on external data from Yang et al. (2024) confirms the screen transfers across benchmarks and probe formats. All data and code: https://github.com/synthiumjp/validity-scaling-llm

摘要:LLM 信心信號用於自我放棄、路由和安全關鍵決策。尚無標準做法來檢查信心信號是否攜帶項目級別的信息,然後再基於此進行構建。我們將臨床人格評估(PAI, MMPI-3)的有效性篩選原則轉移為基於基準的 LLM 信心數據的可攜式協議。該協議指定了三個核心指標(L, Fp, RBS)、一個結構指標(TRIN)和一個項目敏感性統計,這些都是從單個 2x2 交叉表中計算得出的。三層分類系統(無效、不確定、有效)借鑒了四個臨床傳統。在 524 個項目中對 20 個前沿 LLM 進行驗證,四個模型被分類為無效,兩個為不確定。有效型模型顯示平均 r = .18(15/16 顯著)。無效型模型顯示平均 r = -.20(d = 2.48)。在 18 個模型上進行的跨基準驗證,使用 MMLU 進行口頭信心和來自 Yang et al. (2024) 的外部數據,確認了篩選在基準和探測格式之間的轉移。所有數據和代碼: https://github.com/synthiumjp/validity-scaling-llm

Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report

2604.17707v1 by Jon-Paul Cacioli

Clinical personality assessment screens response validity before interpreting substantive scales. LLM evaluation does not. We apply the validity scaling framework from the PAI and MMPI-3 to metacognitive probe data from 20 frontier models across 524 items. Six validity indices are operationalised: L (maintaining confidence on errors), K (betting on errors), F (withdrawing consensus-endorsed items), Fp (withdrawing correct answers), RBS (inverted monitoring), and TRIN (fixed responding). A tiered classification system identifies four models as construct-level invalid and two as elevated. Valid-profile models produce item-sensitive confidence (mean r = .18, 14 of 16 significant). Invalid-profile models do not (mean r = -.20, d = 2.17, p = .001). Chain-of-thought training produces two opposite response distortions. Two latent dimensions account for 94.6% of index variance. Companion papers extract a portable screening protocol (Cacioli, 2026e) and validate it against selective prediction (Cacioli, 2026f). All data and code: https://github.com/synthiumjp/validity-scaling-llm

摘要:臨床人格評估在解釋實質量表之前會篩選反應有效性。LLM 評估則不會。我們將 PAI 和 MMPI-3 的有效性縮放框架應用於來自 20 個前沿模型的元認知探測數據,涵蓋 524 個項目。六個有效性指標被操作化:L(對錯誤保持信心)、K(對錯誤下注)、F(撤回共識支持的項目)、Fp(撤回正確答案)、RBS(反向監控)和 TRIN(固定反應)。一個分級分類系統將四個模型識別為構念層級無效,兩個模型則被標記為升高。有效型檔案模型產生對項目敏感的信心(平均 r = .18,16 個中的 14 個顯著)。無效型檔案模型則不然(平均 r = -.20,d = 2.17,p = .001)。思維鏈訓練產生兩種相反的反應扭曲。兩個潛在維度解釋了 94.6% 的指標變異性。伴隨的論文提取了一個可攜式篩檢協議(Cacioli, 2026e),並將其與選擇性預測進行驗證(Cacioli, 2026f)。所有數據和代碼:https://github.com/synthiumjp/validity-scaling-llm

STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments

2604.17611v1 by Md Mezbahul Islam, John Michael Templeton, Christian Poellabauer, Ananda Mohan Mondal

Parkinson's disease (PD) is a progressive disorder in which symptom burden and functional impairment evolve over time, making severity staging essential for clinical monitoring and treatment planning. However, many computational studies emphasize binary PD detection and do not fully use repeated follow-up clinical assessments for stage-aware prediction. This study proposes STEP-PD, a severity-aware machine learning framework to classify PD severity using clinically interpretable boundaries. It leverages all available visits from the Parkinson's Progression Markers Initiative (PPMI) and integrates routinely collected subjective questionnaires and objective clinician-assessed measures. Disease severity is defined using Hoehn and Yahr staging and grouped into three clinically meaningful categories: Healthy, Mild PD (stages 1-2), and Moderate-to-Severe PD (stages 3-5). Three binary classification problems and a three-class severity task were evaluated using stratified cross-validation with imbalance-aware training. To enhance interpretability, SHAP was used to provide global explanations and local patient-level waterfall explanations. Across all tasks, XGBoost achieved the strongest and most stable performance, with accuracies of 95.48% (Healthy vs. Mild), 99.44% (Healthy vs. Moderate-to-Severe), and 96.78% (Mild vs. Moderate-to-Severe), and 94.14% accuracy with 0.8775 Macro-F1 for three-class severity classification. Explainability results highlight a shift from early motor features to progression-related axial and balance impairments. These findings show that multimodal clinical assessments within the PPMI cohort can support accurate and interpretable visit-level PD severity stratification.

摘要:帕金森病(PD)是一種漸進性疾病,其症狀負擔和功能障礙隨時間演變,因此對於臨床監測和治療計劃來說,嚴重程度分級是必不可少的。然而,許多計算研究強調二元的PD檢測,並未充分利用重複的隨訪臨床評估來進行階段感知的預測。本研究提出了STEP-PD,一個重視嚴重程度的機器學習框架,用於使用臨床可解釋的邊界來分類PD的嚴重程度。它利用來自帕金森病進展標記計劃(PPMI)的所有可用訪問,並整合常規收集的主觀問卷和客觀臨床評估指標。疾病的嚴重程度是使用Hoehn和Yahr分級來定義的,並分為三個臨床意義明確的類別:健康、輕度PD(1-2期)和中度至重度PD(3-5期)。通過分層交叉驗證和考慮不平衡的訓練,評估了三個二元分類問題和一個三類嚴重程度任務。為了增強可解釋性,使用SHAP提供全局解釋和局部患者級別的瀑布解釋。在所有任務中,XGBoost實現了最強且最穩定的性能,健康與輕度的準確率為95.48%、健康與中度至重度的準確率為99.44%、輕度與中度至重度的準確率為96.78%,以及三類嚴重程度分類的準確率為94.14%,Macro-F1為0.8775。可解釋性結果突顯了從早期運動特徵到與進展相關的軸向和平衡障礙的轉變。這些發現表明,PPMI隊列中的多模態臨床評估可以支持準確且可解釋的訪問級PD嚴重程度分層。

T-DuMpRa: Teacher-guided Dual-path Multi-prototype Retrieval Augmented framework for fine-grained medical image classification

2604.17360v1 by Zixuan Tang, Shen Zhao

Fine-grained medical image classification is challenged by subtle inter-class variations and visually ambiguous cases, where confidence estimates often exhibit uncertainty rather than being overconfident. In such scenarios, purely discriminative classifiers may achieve high overall accuracy yet still fail to distinguish between highly similar categories, leading to miscalibrated predictions. We propose T-DuMpRa, a teacher-guided dual-path multi-prototype retrieval-augmented framework, where discriminative classification and multi-prototype retrieval jointly drive both training and prediction. During training, we jointly optimize cross-entropy and supervised contrastive objectives to learn a cosine-compatible embedding geometry for reliable prototype matching. We further employ an exponential moving average (EMA) teacher to obtain smoother representations and build a multi-prototype memory bank by clustering teacher embeddings in the teacher embedding space. Our framework is plug-and-play: it can be easily integrated into existing classification models by constructing a compact prototype bank, thereby improving performance on visually ambiguous cases. At inference, we combine the classifier's predicted distribution with a similarity-based distribution computed via cosine matching to prototypes, and apply a conservative confidence-gated fusion that activates retrieval only when the classifier's prediction is uncertain and the retrieval evidence is decisive and conflicting, otherwise keeping confident predictions unchanged. On HAM10000 and ISIC2019, our method yields 0.68%-0.21% and 0.44%-2.69% improvements on 5 different backbones. And visualization analysis proves our model can enhance the model's ability to handle visually ambiguous cases.

摘要:精細醫學影像分類面臨著微妙的類別間變化和視覺上模糊的情況,這些情況下,信心估計往往表現出不確定性,而不是過於自信。在這種情況下,純粹的區別性分類器可能達到高整體準確率,但仍然無法區分高度相似的類別,導致預測不準確。我們提出了 T-DuMpRa,一種教師引導的雙路徑多原型檢索增強框架,其中區別性分類和多原型檢索共同推動訓練和預測。在訓練期間,我們共同優化交叉熵和監督對比目標,以學習可靠的原型匹配的餘弦相容嵌入幾何。我們進一步使用指數移動平均(EMA)教師來獲得更平滑的表示,並通過在教師嵌入空間中聚類教師嵌入來建立多原型記憶庫。我們的框架是即插即用的:它可以通過構建緊湊的原型庫輕鬆集成到現有的分類模型中,從而提高在視覺上模糊情況下的性能。在推理時,我們將分類器的預測分佈與通過餘弦匹配計算的基於相似性的分佈相結合,並應用保守的信心閘融合,只有在分類器的預測不確定且檢索證據決定性且矛盾時才啟動檢索,否則保持自信的預測不變。在 HAM10000 和 ISIC2019 上,我們的方法在 5 個不同的骨幹上分別提高了 0.68%-0.21% 和 0.44%-2.69%。而可視化分析證明我們的模型能增強模型處理視覺模糊情況的能力。

PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

2604.17359v1 by Patrick Keough

Large language models are increasingly deployed to simulate patients for clinical training, research, and mental health tools, yet population-level validity remains largely untested. We introduce PsychBench, the first epidemiological audit of LLM patient simulation: 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) evaluated against NHANES and NESARC-III baselines across 120 intersectional cohorts. The central finding is a coherence-fidelity dissociation: models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent (GLM-4.7) to 62 percent (DeepSeek-V3), eliminating the distributional tails of clinical reality. Despite test-retest correlations above r = 0.90, 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices diverge across demographic groups beyond split-half noise, with transgender populations diverging three to five times more than racial differences. Calibration bias is systematic and asymmetric. Models overestimate depression severity for most groups by 3.6 to 6.1 points (Cohen d = 1.13 to 1.91), consistent with training on clinical corpora with elevated base rates. For transgender women the direction inverts: models capture only 8 to 46 percent of documented minority stress elevation, yielding a -5.42 residual (d = -1.55). Models also attribute irritability to Black men and fatigue to women beyond matched controls, encoding racialized and gendered assumptions. Patterns replicate across US and Chinese architectures, indicating failures tied to current training paradigms rather than isolated implementations. For most users, LLM mental health tools risk pathologizing ordinary distress; for transgender users, algorithmic erasure of genuine need. The patients look right. They do not represent real populations.

摘要:大型語言模型越來越多地被用來模擬患者,以進行臨床訓練、研究和心理健康工具,但其在人口層面的有效性仍然大部分未經測試。我們介紹了PsychBench,首個針對LLM患者模擬的流行病學審核:來自四個前沿模型(GPT-4o-mini、DeepSeek-V3、Gemini-3-Flash、GLM-4.7)的28,800個資料檔,與NHANES和NESARC-III基準進行評估,涵蓋120個交叉群體。中心發現是一致性與保真度的分離:模型生成臨床上合理的個體,但卻錯誤地表現出其所來源的人群。變異壓縮範圍從14%(GLM-4.7)到62%(DeepSeek-V3),消除了臨床現實的分佈尾部。儘管測試-重測相關性超過r = 0.90,但36.66%的案例在不同運行之間跨越診斷閾值。症狀相關矩陣在不同的人口群體中超越了分半噪音而出現分歧,跨性別人群的分歧程度是種族差異的三到五倍。校準偏差是系統性的和不對稱的。模型對大多數群體的抑鬱嚴重程度高估了3.6到6.1分(Cohen d = 1.13到1.91),這與在基數較高的臨床語料庫上的訓練一致。對於跨性別女性來說,方向則相反:模型僅捕捉到8%到46%的記錄在案的少數族裔壓力上升,產生-5.42的殘差(d = -1.55)。模型還將易怒歸因於黑人男性,將疲勞歸因於女性,超出匹配控制組,編碼了種族化和性別化的假設。這些模式在美國和中國的架構中重複出現,表明失敗與當前的訓練範式有關,而非孤立的實施。對於大多數用戶來說,LLM心理健康工具有使普通痛苦病理化的風險;對於跨性別用戶來說,則是算法抹去真正需求的風險。患者看起來正確。他們並不代表真實的人口。

Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA

2604.17316v1 by Alberto Testoni, Iacer Calixto

Safe clinical deployment of Large Language Models (LLMs) requires not only high accuracy but also robust uncertainty calibration to ensure models defer to clinicians when appropriate. Our paper investigates how social descriptors of a patient (specifically sexual orientation and religious affiliation) distort these uncertainty signals and model accuracy. Evaluating nine general-purpose and biomedical LLMs on 2,364 medical questions and their counterfactual variants, we demonstrate that identity markers cause a "calibration crisis". "Homosexual" markers consistently trigger performance drops, and intersectional identities produce idiosyncratic, non-additive harms to calibration. Moreover, a clinician-validated case study in an open-ended generation setting confirms that these failures are not an artifact of the multiple-choice format. Our results demonstrate that the presence of social identity cues does not merely shift predictions; it affects the reliability of confidence signals, posing a significant risk to equitable care and safe deployment in confidence-based clinical workflows.

摘要:安全臨床部署大型語言模型(LLMs)不僅需要高準確性,還需要穩健的不確定性校準,以確保模型在適當時候能夠聽從臨床醫生的意見。我們的論文探討了患者的社會描述符(特別是性取向和宗教信仰)如何扭曲這些不確定性信號和模型準確性。在2,364個醫療問題及其反事實變體上評估九個通用和生物醫學LLMs,我們證明身份標記會導致“校準危機”。“同性戀”標記持續觸發性能下降,而交叉身份則對校準產生特異性、非加性損害。此外,在開放式生成環境中的臨床驗證案例研究確認,這些失敗並不是多選格式的產物。我們的結果顯示,社會身份線索的存在不僅僅是改變預測;它影響了信心信號的可靠性,對公平護理和基於信心的臨床工作流程的安全部署構成了重大風險。

Chaos-Enhanced Prototypical Networks for Few-Shot Medical Image Classification

2604.17300v1 by Chinthakuntla Meghan Sai, Murarisetty V Sai Kartheek, Sita Devi Bharatula, Karthik Seemakurthy

The scarcity of labeled clinical data in oncology makes Few-Shot Learning (FSL) a critical framework for Computer Aided Diagnostics, but we observed that standard Prototypical Networks often struggle with the "prototype instability" caused by morphological noise and high intra-class variance in brain tumor scans. Our work attempts to minimize this by integrating a non-linear Logistic Chaos Module into a fine-tuned ResNet-18 backbone creating the Chaos-Enhanced ProtoNet(CE-ProtoNet). Using the deterministic ergodicity of the logistic chaos map we inject controlled perturbations into support features during episodic training-essentially for "stress testing" the embedding space. This process makes the model to converge on noise-invariant representations without increasing computational overhead. Testing this on a 4-way 5-shot brain tumor classification task, we found that a 15% chaotic injection level worked efficiently to stabilize high-dimensional clusters and reduce class dispersion. Our method achieved a peak test accuracy of 84.52%, outperforming standard ProtoNet. Our results suggest the idea of using chaotic perturbation as an efficient, low-overhead regularization tool, for the data-scarce regimes.

摘要:臨床數據在腫瘤學中的稀缺性使得少樣本學習(FSL)成為計算機輔助診斷的一個關鍵框架,但我們觀察到標準原型網絡經常因形態噪聲和腦腫瘤掃描中的高類內方差而面臨“原型不穩定”的問題。我們的工作試圖通過將非線性邏輯混沌模塊整合到微調的ResNet-18骨幹中來最小化這一問題,從而創建混沌增強原型網絡(CE-ProtoNet)。利用邏輯混沌映射的確定性遍歷性,我們在情節訓練期間將受控擾動注入支持特徵,基本上是為了“壓力測試”嵌入空間。這一過程使得模型能夠收斂到對噪聲不變的表示,而不增加計算開銷。在一個4路5樣本的腦腫瘤分類任務中,我們發現15%的混沌注入水平能有效穩定高維聚類並減少類別分散。我們的方法達到了84.52%的峰值測試準確率,超越了標準原型網絡。我們的結果表明,使用混沌擾動作為一種高效、低開銷的正則化工具的想法,適用於數據稀缺的情境。

Region-Affinity Attention for Whole-Slide Breast Cancer Classification in Deep Ultraviolet Imaging

2604.17222v1 by Nagur Shareef Shaik, Teja Krishna Cherukuri, Dong Hye Ye

Breast cancer diagnosis demands rapid and precise tools, yet traditional histopathological methods often fall short in intra-operative settings. Deep Ultraviolet (DUV) fluorescence imaging emerges as a transformative approach, offering high-contrast, label-free visualization of whole-slide images (WSIs) with unprecedented detail, surpassing conventional hematoxylin and eosin (H&E) staining in speed and resolution. However, existing deep learning methods for breast cancer classification, predominantly patch-based, fragment spatial context and incur significant preprocessing overhead, limiting their clinical utility. Moreover, standard attention mechanisms, such as Spatial, Squeeze-and-Excitation, Global Context and Guided Context Gating, fail to fully exploit the rich, multi-scale regional relationships inherent in DUV-WSI data, often prioritizing generic feature recalibration over diagnostic specificity. This study introduces a novel Region-Affinity Attention mechanism tailored for DUV-WSI breast cancer classification, processing entire slides without patching to preserve spatial integrity. By modeling local neighbor distances and constructing a full affinity matrix, our method dynamically highlights diagnostically relevant regions, augmented by a contrastive loss to enhance feature discriminability. Evaluated on a dataset of 136 DUV-WSI samples, our approach achieves an accuracy of 92.67 +/- 0.73% and an AUC of 95.97%, outperforming existing attention methods.

摘要:乳腺癌的診斷需要快速且精確的工具,但傳統的組織病理學方法在手術過程中往往無法滿足需求。深紫外(DUV)螢光成像作為一種變革性的方法,提供高對比度、無標籤的全片影像(WSIs)可視化,細節前所未有,超越了傳統的蘇木精-伊紅(H&E)染色在速度和解析度上的表現。然而,現有的乳腺癌分類深度學習方法主要基於補丁,破壞了空間上下文並且產生了顯著的預處理開銷,限制了其臨床實用性。此外,標準的注意力機制,如空間注意力、壓縮與激勵、全局上下文和引導上下文門控,未能充分利用DUV-WSI數據中固有的豐富多尺度區域關係,往往優先考慮通用特徵的重新校準而非診斷特異性。本研究提出了一種新穎的區域親和力注意力機制,專為DUV-WSI乳腺癌分類而設計,處理整個切片而不進行補丁,以保持空間完整性。通過建模局部鄰域距離並構建完整的親和力矩陣,我們的方法動態突出診斷相關區域,並通過對比損失來增強特徵的可區分性。在136個DUV-WSI樣本的數據集上進行評估,我們的方法達到了92.67 +/- 0.73%的準確率和95.97%的AUC,超越了現有的注意力方法。

Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition

2604.17214v1 by Nwe Ni Win, Jim Basilakis, Steven Thomas, Seyhan Yazar, Laura Pierce, Stephanie Liu, Paul M. Middleton, Nasser Ghadiri, X. Rosalind Wang

Extracting clinically relevant information from unstructured medical narratives such as admission notes, discharge summaries, and emergency case histories remains a challenge in clinical natural language processing (NLP). Medical Entity Recognition (MER) identifies meaningful concepts embedded in these records. Recent advancements in large language models (LLMs) have shown competitive MER performance; however, evaluations often focus on general entity types, offering limited utility for real-world clinical needs requiring finer-grained extraction. To address this gap, we rigorously evaluated the open-source LLaMA3 model for fine-grained medical entity recognition across 18 clinically detailed categories. To optimize performance, we employed three learning paradigms: zero-shot, few-shot, and fine-tuning with Low-Rank Adaptation (LoRA). To further enhance few-shot learning, we introduced two example selection methods based on token- and sentence-level embedding similarity, utilizing a pre-trained BioBERT model. Unlike prior work assessing zero-shot and few-shot performance on proprietary models (e.g., GPT-4) or fine-tuning different architectures, we ensured methodological consistency by applying all strategies to a unified LLaMA3 backbone, enabling fair comparison across learning settings. Our results showed that fine-tuned LLaMA3 surpasses zero-shot and few-shot approaches by 63.11% and 35.63%, respectivel respectively, achieving an F1 score of 81.24% in granular medical entity extraction.

摘要:從未結構化的醫療敘述中提取臨床相關資訊,例如入院記錄、出院摘要和急診病歷,仍然是臨床自然語言處理(NLP)中的一個挑戰。醫療實體識別(MER)識別這些記錄中嵌入的有意義概念。最近在大型語言模型(LLMs)方面的進展顯示出競爭性的MER表現;然而,評估通常集中在一般實體類型上,對於需要更細緻提取的現實臨床需求提供的效用有限。為了解決這一差距,我們嚴格評估了開源的LLaMA3模型在18個臨床詳細類別中的細粒度醫療實體識別表現。為了優化性能,我們採用了三種學習範式:零樣本、少樣本和使用低秩適應(LoRA)的微調。為了進一步增強少樣本學習,我們引入了基於標記和句子級嵌入相似性的兩種範例選擇方法,利用預訓練的BioBERT模型。與之前評估零樣本和少樣本性能的專有模型(例如GPT-4)或微調不同架構的工作不同,我們通過將所有策略應用於統一的LLaMA3骨幹來確保方法的一致性,從而實現學習設置之間的公平比較。我們的結果顯示,微調的LLaMA3在細粒度醫療實體提取中分別超越零樣本和少樣本方法63.11%和35.63%,達到81.24%的F1分數。

DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation

2604.17209v1 by Nagur Shareef Shaik, Teja Krishna Cherukuri, Dong Hye Ye

Automating medical reports for retinal images requires a sophisticated blend of visual pattern recognition and deep clinical knowledge. Current Large Vision-Language Models (LVLMs) often struggle in specialized medical fields where data is scarce, leading to models that overfit and miss subtle but critical pathologies. To address this, we introduce DREAM (Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion), a novel framework for high-fidelity medical report generation that excels even with limited data. DREAM employs a unique two-stage fusion mechanism that intelligently integrates visual data with clinical keywords curated by ophthalmologists. First, the Abstractor module maps image and keyword features into a shared space, enhancing visual data with pathology-relevant insights. Next, the Adaptor performs adaptive multi-modal fusion, dynamically weighting the importance of each modality using learnable parameters to create a unified representation. To ensure the model's outputs are semantically grounded in clinical reality, a Contrastive Alignment module aligns these fused representations with ground-truth medical reports during training. By combining medical expertise with an efficient fusion strategy, DREAM sets a new state-of-the-art on the DeepEyeNet benchmark, achieving a BLEU-4 score of 0.241, and further demonstrates strong generalization to the ROCO dataset.

摘要:自動化視網膜影像的醫療報告需要視覺模式識別和深厚臨床知識的精妙結合。當前的大型視覺語言模型(LVLMs)在數據稀缺的專業醫療領域中經常遇到困難,導致模型過擬合並錯過微妙但關鍵的病理特徵。為了解決這個問題,我們引入了DREAM(動態視網膜增強與自適應多模態融合),這是一個高保真醫療報告生成的新框架,即使在數據有限的情況下也能表現出色。DREAM採用獨特的兩階段融合機制,智能地將視覺數據與眼科醫生策劃的臨床關鍵詞整合。首先,抽象模塊將影像和關鍵詞特徵映射到共享空間中,增強視覺數據與病理相關的見解。接下來,適配器執行自適應多模態融合,動態地根據可學習參數加權每種模態的重要性,以創建統一的表示。為了確保模型的輸出在臨床現實中具有語義基礎,對比對齊模塊在訓練期間將這些融合表示與真實醫療報告對齊。通過將醫療專業知識與高效的融合策略相結合,DREAM在DeepEyeNet基準上設立了新的最先進水平,達到了0.241的BLEU-4分數,並進一步展示了對ROCO數據集的強大泛化能力。

CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography

2604.17208v1 by Si Li, Chen-Kai Hu, Zhenhuan Lyu, Yuanqing He

Digital subtraction angiography (DSA) in coronary imaging is fundamentally challenged by physiological motion, forcing reliance on raw angiograms cluttered with anatomical noise. Existing deep learning methods often produced images with two critical clinically unacceptable flaws: persistent boundary artifacts and a loss of native tissue grayscale fidelity that undermined diagnostic confidence. We propose a novel framework termed as CDSA-Net that for the first time explicitly decouples and jointly optimizes vascular structure preservation and realistic background restoration. CDSA-Net introduces two core innovations: (i) A hierarchical geometric prior guidance (HGPG) mechanism, embedded in our coronary structure extraction network (CSENet). It synergistically combines integrated geometric prior (IGP) with gated spatial modulation (GSM) and centerline-aware topology (CAT) loss supervision, ensuring structural continuity. (ii) An adaptive noise module (ANM) within our coronary background restoration network (CBResNet). Unlike standard restoration, ANM uniquely models the stochastic nature of clinical X-ray noise, bridging the domain gap to enable seamless background intensity estimation and the complete elimination of boundary artifacts. The final subtraction is obtained by removing the restored background from the raw angiogram. Quantitatively, it significantly outperformed state-of-the-art methods in vascular intensity correlation and perceptual quality. A 25.6% improvement in morphology assessment efficiency and a 42.9% gain in hemodynamic evaluation speed set a new benchmark for utility in interventional cardiology, while maintaining diagnostic results consistent with raw angiograms. The project code is available at https://github.com/DrThink-ai/CDSA-Net.

摘要:數位減影血管造影(DSA)在冠狀動脈影像中受到生理運動的根本挑戰,迫使人們依賴充滿解剖噪音的原始血管造影圖像。現有的深度學習方法通常產生兩個關鍵的臨床不可接受的缺陷:持續的邊界伪影和原生組織灰階保真度的喪失,這削弱了診斷信心。我們提出了一個名為 CDSA-Net 的新框架,首次明確地解耦並聯合優化血管結構保護和現實背景恢復。CDSA-Net 引入了兩個核心創新:(i)一種分層幾何先驗引導(HGPG)機制,嵌入我們的冠狀結構提取網絡(CSENet)。它協同結合了集成幾何先驗(IGP)、門控空間調制(GSM)和中心線感知拓撲(CAT)損失監督,確保結構連續性。(ii)我們的冠狀背景恢復網絡(CBResNet)內的一個自適應噪聲模塊(ANM)。與標準恢復不同,ANM 獨特地建模臨床 X 射線噪聲的隨機性質,彌合領域差距以實現無縫的背景強度估計和完全消除邊界伪影。最終的減法是通過從原始血管造影中去除恢復的背景來獲得的。在定量上,它在血管強度相關性和感知質量方面顯著超越了最先進的方法。在形態評估效率上提高了 25.6%,在血流動力學評估速度上提高了 42.9%,為介入心臟病學的實用性設立了新的基準,同時保持診斷結果與原始血管造影一致。項目代碼可在 https://github.com/DrThink-ai/CDSA-Net 獲得。

Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training

2604.17186v1 by Weibing Zheng, Laurah Turner, Jess Kropczynski, Matthew Kelleher, Murat Ozer, Shane Halse

As Artificial Intelligence (AI) and Agentic AI become increasingly integrated across sectors such as education and healthcare, it is critical to ensure that Multi-Agent Education System (MAES) is explainable from the early stages of requirements engineering (RE) within the AI software development lifecycle. Explainability is essential to build trust, promote transparency, and enable effective human-AI collaboration. Although personas are well-established in human-computer interaction to represent users and capture their needs and behaviors, their role in RE for explainable MAES remains underexplored. This paper proposes a human-first, persona-driven, explainable MAES RE framework and demonstrates the framework through a MAES for clinical reasoning training. The framework integrates personas and user stories throughout the RE process to capture the needs, goals, and interactions of various stakeholders, including medical educators, medical students, AI patient agent, and clinical agents (physical exam agent, diagnostic agent, clinical intervention agent, supervisor agent, evaluation agent). The goals, underlying models, and knowledge base shape agent interactions and inform explainability requirements that guided the clinical reasoning training of medical students. A post-usage survey found that more than 78\% of medical students reported that MAES improved their clinical reasoning skills. These findings demonstrate that RE based on persona effectively connects technical requirements with non-technical medical students from a human-centered approach, ensuring that explainable MAES are trustworthy, interpretable, and aligned with authentic clinical scenarios from the early stages of the AI system engineering. The partial MAES for the clinical scenario simulator is~\href{https://github.com/2sigmaEdTech/MAS/}{open sourced here}.

摘要:隨著人工智慧(AI)和代理型AI在教育和醫療等各個領域的日益整合,確保多代理教育系統(MAES)在AI軟體開發生命週期的需求工程(RE)早期階段是可解釋的,至關重要。可解釋性對於建立信任、促進透明度以及實現有效的人機協作至關重要。儘管角色在人機互動中被廣泛應用以代表用戶並捕捉他們的需求和行為,但在可解釋的MAES的需求工程中的角色仍然未被充分探索。本文提出了一個以人為本、以角色驅動的可解釋MAES需求工程框架,並通過一個用於臨床推理訓練的MAES來演示該框架。該框架在整個需求工程過程中整合了角色和用戶故事,以捕捉各種利益相關者的需求、目標和互動,包括醫學教育者、醫學學生、AI病人代理和臨床代理(身體檢查代理、診斷代理、臨床干預代理、監督代理、評估代理)。目標、基本模型和知識基礎塑造了代理的互動,並告知了指導醫學學生臨床推理訓練的可解釋性需求。使用後調查發現,超過78\%的醫學學生報告說MAES提高了他們的臨床推理技能。這些發現表明,基於角色的需求工程有效地將技術需求與非技術醫學學生聯繫起來,採用以人為中心的方法,確保可解釋的MAES是可信的、可解釋的,並與AI系統工程早期階段的真實臨床情境相一致。針對臨床情境模擬器的部分MAES已在~\href{https://github.com/2sigmaEdTech/MAS/}{這裡開源}。

If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data

2604.17133v1 by Yanjun Cui, Ali Emami, Temiloluwa Prioleau, Nikhil Singh

Continuous glucose monitors (CGMs) used in diabetes care collect rich personal health data that could improve day-to-day self-management. However, current patient platforms only offer static summaries which do not support inquisitive user queries. Large language models (LLMs) could enable free-form inquiries about continuous glucose data, but deploying them over sensitive health records raises privacy and accuracy concerns. In this paper, we present CGM-Agent, a privacy-preserving framework for question answering over personal glucose data. In our design, the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, we construct a benchmark of 4,180 questions combining parameterized question templates with real user queries and ground truth derived from deterministic program execution. Evaluating 6 leading LLMs, we find that top models achieve 94\% value accuracy on synthetic queries and 88\% on ambiguous real-world queries. Errors stem primarily from intent and temporal ambiguity rather than computational failures. Additionally, lightweight models achieve competitive performance in our agent design, suggesting opportunities for low-cost deployment. We release our code and benchmark to support future work on trustworthy health agents.

摘要:持續血糖監測器(CGMs)在糖尿病護理中收集豐富的個人健康數據,這些數據可以改善日常自我管理。然而,目前的病人平台僅提供靜態摘要,無法支持好奇的用戶查詢。大型語言模型(LLMs)可以使對持續血糖數據的自由形式查詢成為可能,但在敏感健康記錄上部署它們會引發隱私和準確性問題。在本文中,我們提出了CGM-Agent,一個針對個人血糖數據的隱私保護問答框架。在我們的設計中,LLM純粹作為一個推理引擎,選擇分析功能。所有計算都在本地進行,個人健康數據從不離開用戶的設備。為了進行評估,我們構建了一個基準,包含4,180個問題,結合了參數化問題模板、真實用戶查詢和來自確定性程序執行的真實數據。在評估6個領先的LLM時,我們發現頂級模型在合成查詢上達到94%的價值準確率,在模糊的現實查詢上達到88%。錯誤主要源於意圖和時間的模糊性,而不是計算失敗。此外,輕量級模型在我們的代理設計中達到了競爭性能,這表明低成本部署的機會。我們發布了我們的代碼和基準,以支持未來在可信健康代理上的工作。

A Two-Stage Deep Learning Framework for Segmentation of Ten Gastrointestinal Organs from Coronal MR Enterography

2604.17118v1 by Ashiqur Rahman, Md. Abu Sayed, Md Sharjis Ibne Wadud, Md. Abu Asad Al-Hafiz, Adam Mushtak, Muhammad E. H. Chowdhury

Accurate segmentation of gastrointestinal (GI) organs in magnetic resonance enterography (MRE) is critical for diagnosing inflammatory bowel disease (IBD). However, anatomical variability, class imbalance, and low tissue contrast hinder reliable automation. This study proposes a dual-stage deep learning framework for organ-specific segmentation of GI structures from coronal MRE images to address these challenges. A publicly available MRE dataset of 3,195 coronal T2-weighted HASTE slices from 114 IBD patients was used. Initially, a DenseNet201-UNet++ model generated coarse masks for ROI extraction. A DenseNet121-SelfONN-UNet model was then trained on organ-specific patches. Extensive data augmentation, normalization, five-fold cross-validation, and class-specific weighting were applied to mitigate severe class imbalance, particularly for the appendix. The initial stage achieved strong organ localization but underperformed for the appendix; class weighting improved its DSC from 6.76% to 85.76%. The second-stage DenseNet121-SelfONN-UNet significantly enhanced segmentation across all GI structures, with notable DSC gains (cecum +23.62%, sigmoid +18.57%, rectum +17.99%, small intestine +16.06%). Overall, the framework achieved mDSC of 88.99%, mIoU of 84.76%, and mHD95 of 6.94 mm, outperforming all baselines. This framework demonstrates the effectiveness of a coarse-to-fine, organ-aware segmentation strategy for intestinal MRE. Despite higher computational cost, it shows strong potential for clinical translation and enables anatomically informed diagnostic tools in gastroenterology.

摘要:準確地對磁共振腸道攝影(MRE)中的胃腸道(GI)器官進行分割對於診斷炎症性腸病(IBD)至關重要。然而,解剖變異性、類別不平衡和低組織對比度妨礙了可靠的自動化。本研究提出了一種雙階段深度學習框架,用於從冠狀MRE圖像中進行器官特定的GI結構分割,以應對這些挑戰。使用了一個公開可用的MRE數據集,其中包含來自114名IBD患者的3,195個冠狀T2加權HASTE切片。最初,DenseNet201-UNet++模型生成了ROI提取的粗略掩膜。然後,對器官特定的補丁進行了DenseNet121-SelfONN-UNet模型的訓練。採用了廣泛的數據增強、正規化、五折交叉驗證和類別特定的加權,以減輕嚴重的類別不平衡,特別是對於闌尾。初始階段實現了強大的器官定位,但對於闌尾的表現不佳;類別加權將其DSC從6.76%提高到85.76%。第二階段的DenseNet121-SelfONN-UNet顯著提高了所有GI結構的分割,並且DSC增益顯著(盲腸 +23.62%,乙狀結腸 +18.57%,直腸 +17.99%,小腸 +16.06%)。總體而言,該框架達到了88.99%的mDSC、84.76%的mIoU和6.94 mm的mHD95,超越了所有基準。該框架展示了粗到細的器官感知分割策略在腸道MRE中的有效性。儘管計算成本較高,但它顯示出強大的臨床轉化潛力,並使胃腸學中的解剖知情診斷工具成為可能。

Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization

2604.17051v1 by Weijie Wan, Jiangjiang Zhao

Large Language Models (LLMs) have demonstrated excellent performance in general language understanding, generation and other tasks. However, when fine-tuning for specific domain tasks, the general knowledge accumulated in the pre-training phase is often partially overwritten or forgotten due to parameter updates, which severely limits the generalization ability and transferability of LLMs. Traditional fine-tuning strategies mostly train on the entire parameter space, ignoring the heterogeneity of model parameters, that is, some parameters are extremely important for general tasks, while other parameters are more sensitive to specific tasks. To alleviate the above problems, this paper innovatively proposes a parameter element importance evaluation method, which divides parameters into "core parameters" and "non-core parameters" by distinguishing the importance of parameters for general language ability tasks and specific domain tasks, and fixes the core parameters during fine-tuning, and only fine-tunes the non-core parameters. Extensive experiments on scientific, medical and physical tasks using GPT-J and LLaMA-3 show that our method can mitigate catastrophic forgetting while enhancing the adaptability of the model.

摘要:大型語言模型(LLMs)在一般語言理解、生成及其他任務中展現了卓越的表現。然而,在針對特定領域任務進行微調時,預訓練階段積累的一般知識往往因參數更新而部分被覆蓋或遺忘,這嚴重限制了LLMs的泛化能力和轉移能力。傳統的微調策略大多數是在整個參數空間上進行訓練,忽視了模型參數的異質性,也就是說,有些參數對一般任務極為重要,而其他參數則對特定任務更為敏感。為了緩解上述問題,本文創新性地提出了一種參數元素重要性評估方法,通過區分參數對一般語言能力任務和特定領域任務的重要性,將參數分為「核心參數」和「非核心參數」,並在微調過程中固定核心參數,只微調非核心參數。在使用GPT-J和LLaMA-3進行的科學、醫學和物理任務的廣泛實驗表明,我們的方法可以減輕災難性遺忘,同時增強模型的適應性。

Light-Adapted Electroretinogram and Oscillatory Potentials (LEOPs) Dataset for Autism Spectrum Disorder and Typically Developing Individuals

2604.16981v1 by Paul A. Constable, Dorothy A. Thompson, Irene O. Lee, Lynne Loh, Aleksei Zhdanov, Mikhail Kulyabin, Andreas Maier

The LEOPs (Light-ERG-Oscillatory Potentials) dataset provides light-adapted (LA) electroretinogram (ERG) and Oscillatory Potentials (OPs) waveforms for typically developing Control, Autism Spectrum Disorder (ASD) and ASD + Attention Deficit Hyperactivity Disorder (ADHD) childhood and adolescent populations. The ERGs were recorded in the Right And Left eyes with skin electrodes using the handheld RETeval device at two sites in Australia and the United Kingdom. The LEOPs dataset includes 5309 single flash ERG and 4434 OPs waveforms as well as images selected from each participant showing the position of the skin electrode. The LEOPs dataset is constructed from recordings using a 9 step randomized flash series from $-0.37$ to $1.20$~$Td.s$, a 2 step at 113 and 446 $Td.s$ flash strengths (2500 Control, 1730 ASD and 451 ASD + ADHD samples), as well as the $85$~$Td.s$ (Light Adapted 3 $cd.s.m^{-2}$ (LA3)) equivalent International Society of Clinical Electrophysiology of Vision (ISCEV) Standard flash with 435 Control, 176 ASD and 37 ASD + ADHD waveform samples. Code for the stimulus is provided along with participant demographics, date and time of testing, and where available diagnostic scores for the ASD and ASD + ADHD groups, alongside iris color, electrode position with image files and time domain values for the ERG and summed values for the OPs. The repository contains excel file, exported JSON files on the patient level that are more suitable for machine learning tasks, images of electrode position for each recording and the protocol files for use with the RETeval.

摘要:LEOPs(光-ERG-振盪電位)數據集提供了適應光線(LA)的電視網膜電圖(ERG)和振盪電位(OPs)波形,針對典型發展的控制組、自閉症譜系障礙(ASD)以及ASD + 注意力缺陷過動症(ADHD)兒童和青少年群體。ERG是在澳大利亞和英國的兩個地點使用手持RET eval設備的皮膚電極記錄的,記錄了左右眼的數據。LEOPs數據集包括5309個單閃ERG和4434個OPs波形,以及每位參與者所選擇的顯示皮膚電極位置的圖像。LEOPs數據集是基於使用9步隨機閃光系列的錄音構建的,閃光強度範圍從$-0.37$到$1.20$~$Td.s$,在113和446 $Td.s$的2步閃光強度下(2500個控制組、1730個ASD和451個ASD + ADHD樣本),以及$85$~$Td.s$(光適應3 $cd.s.m^{-2}$(LA3))等效的國際臨床視覺電生理學會(ISCEV)標準閃光,對應435個控制組、176個ASD和37個ASD + ADHD波形樣本。刺激的代碼與參與者的人口統計信息、測試的日期和時間,以及在可能的情況下,ASD和ASD + ADHD組的診斷分數,還有虹膜顏色、電極位置的圖像文件和ERG的時間域值以及OPs的總和值一起提供。該資料庫包含excel文件、以患者為單位導出的JSON文件,這些文件更適合機器學習任務,以及每次錄音的電極位置圖像和用於RET eval的協議文件。

Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

2604.16980v1 by Bruce A. Bassett, Amy Rouillard, Sitwala Mundia, Michael Cameron Gramanie, Linda Camara, Ziyaad Dangor, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Ismail Kalla, Haroon Saloojee

Background: Large language models (LLMs) are increasingly proposed for diagnostic support, but few evaluations use real-world multimodal inpatient data, particularly in low and middle-income country (LMIC) public hospitals. Methods: We conducted VALID, a retrospective evaluation of 539 multimodal inpatient cases from a tertiary public hospital in South Africa. Inputs included radiology imaging (CT, MRI, CXR) and reports, laboratory results, clinical notes, and vital signs. Expert panels adjudicated 300 cases (balanced and discordant subsets) to establish ground truth diagnoses, differentials, and reasoning. Ten multimodal LLMs generated zero-shot outputs. A calibrated three-model LLM Jury scored all outputs and routine ward diagnoses across diagnostic accuracy, differential quality, reasoning, and patient safety (>10,000 evaluations). Primary outcomes were composite scores ($S_3$, $S_4$) and win rates. Results: (i) LLM performance was tightly clustered (<15% variation) despite large cost differences; low-cost models performed comparably to top models. (ii) All LLMs significantly outperformed routine ward diagnoses on average diagnostic and safety scores. (iii) Top performance was achieved by GPT-5.1, followed by Gemini models. (vi) Adding radiology reports improved performance by 6%. (v) Diagnostic and reasoning scores were highly correlated ($ρ= 0.85$). (vi) Output rates varied (65-100%) due to input constraints. Results were robust across subsets and evaluation design. Conclusions: Across a real-world LMIC dataset, multimodal LLMs showed similar diagnostic performance despite large cost differences and outperformed routine care on average safety metrics. Affordability, robustness, and deployment constraints may outweigh marginal performance differences in LMIC settings.

摘要:背景:大型語言模型(LLMs)越來越多地被提議用於診斷支持,但很少有評估使用真實世界的多模態住院病人數據,特別是在低收入和中等收入國家(LMIC)的公立醫院中。方法:我們進行了VALID,一項回顧性評估,涵蓋來自南非一所三級公立醫院的539例多模態住院病例。輸入包括放射學影像(CT、MRI、CXR)和報告、實驗室結果、臨床筆記和生命體徵。專家小組對300例病例(平衡和不一致子集)進行裁定,以確立真實診斷、鑑別診斷和推理。十個多模態LLM生成了零樣本輸出。一個經過校準的三模型LLM評審小組對所有輸出和常規病房診斷進行了評分,涵蓋診斷準確性、鑑別質量、推理和病人安全(超過10,000次評估)。主要結果是綜合分數($S_3$、$S_4$)和勝率。結果:(i)儘管成本差異很大,LLM的表現緊密聚集(<15%的變異);低成本模型的表現與頂級模型相當。(ii)所有LLM在平均診斷和安全分數上顯著超過常規病房診斷。(iii)最佳表現由GPT-5.1實現,其次是Gemini模型。(vi)添加放射學報告使表現提高了6%。(v)診斷和推理分數高度相關($ρ= 0.85$)。(vi)由於輸入限制,輸出率有所不同(65-100%)。結果在各子集和評估設計中都很穩健。結論:在真實世界的LMIC數據集中,多模態LLM顯示出相似的診斷表現,儘管成本差異很大,並且在平均安全指標上超過了常規護理。在LMIC環境中,負擔能力、穩健性和部署限制可能超過邊際表現差異。

Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction

2604.16955v1 by Liyin Chen, Nazlee Zebardast, Mengyu Wang, Tobias Elze, Jason I. Comander

Quantitative prediction of future retinal appearance from longitudinal imaging would support clinical decisions in progressive macular disease that currently rely on qualitative comparison or scalar progression scores. Recent methods have moved toward increasing generative complexity, but whether this complexity is necessary for slowly progressing retinal disease is unclear. We tested this through a controlled comparison of five conditioning configurations sharing one architecture and training dataset, spanning standard conditional diffusion, inference-aligned stochastic training, and deterministic regression. In our evaluation, aligning the training and inference input distributions produced large gains (delta-SSIM +0.082, SSIM +0.086, both p < 0.001), while the choice among aligned frameworks did not significantly affect any primary metric. Task-entropy and posterior-concentration analyses, replicated on two fundus autofluorescence (FAF) platforms, provided a mechanistic account: the predictable component of inter-visit change is small relative to time-invariant acquisition variability, leaving stochastic sampling with little width to exploit. Guided by these findings, we developed TRU (Temporal Retinal U-Net), a deterministic direct-regression model with continuous time-delta conditioning and multi-scale history aggregation. We evaluated TRU on 28,902 eyes across three imaging platforms: a mixed-disease Optos FAF cohort (9,942 eyes), zero-shot transfer to Stargardt macular dystrophy on Optos (288 eyes) and Heidelberg Spectralis (125 eyes), and a boundary evaluation on Cirrus en-face fundus images from a glaucoma cohort (18,547 eyes). TRU matched or exceeded delta-SSIM, SSIM, and PSNR in every FAF cohort against three state-of-the-art benchmarks, and its advantage grew monotonically with available history length.

摘要:從縱向影像中定量預測未來視網膜的外觀將支持在進行性黃斑疾病中的臨床決策,這些決策目前依賴於定性比較或標量進展評分。最近的方法已經朝著增加生成複雜性發展,但這種複雜性是否對於緩慢進展的視網膜疾病是必要的尚不清楚。我們通過對五種共享一個架構和訓練數據集的條件配置進行受控比較來測試這一點,這些配置涵蓋了標準條件擴散、推理對齊隨機訓練和確定性回歸。在我們的評估中,對齊訓練和推理輸入分佈產生了顯著的增益(delta-SSIM +0.082,SSIM +0.086,均 p < 0.001),而在對齊框架之間的選擇對任何主要指標並沒有顯著影響。任務熵和後驗集中度分析,在兩個眼底自發螢光(FAF)平台上重複進行,提供了一個機制解釋:訪問之間變化的可預測組件相對於時間不變的獲取變異性是小的,留下隨機取樣的利用空間很小。在這些發現的指導下,我們開發了 TRU(Temporal Retinal U-Net),這是一個具有連續時間差條件和多尺度歷史聚合的確定性直接回歸模型。我們在三個影像平台上對 28,902 隻眼睛進行了 TRU 的評估:一個混合疾病的 Optos FAF 隊列(9,942 隻眼睛)、在 Optos 上對 Stargardt 黃斑變性進行零次轉移(288 隻眼睛)和 Heidelberg Spectralis(125 隻眼睛),以及對來自青光眼隊列的 Cirrus 面對面眼底圖像的邊界評估(18,547 隻眼睛)。TRU 在每個 FAF 隊列中與三個最先進的基準相比,匹配或超過了 delta-SSIM、SSIM 和 PSNR,其優勢隨著可用歷史長度的增加而單調增長。

Hybrid Quantum Neural Networks for Enhanced Breast Cancer Thermographic Classification: A Novel Quantum-Classical Integration Approach

2604.16953v1 by Riza Alaudin Syah, Irwan Alnarus Kautsar, Gunawan Witjaksono, Haza Nuzly bin Abdull Hamed

Breast cancer diagnosis through thermographic image analysis remains a critical challenge in medical AI, with classical deep learning approaches facing limitations in complex thermal pattern classification tasks. This paper presents a novel Hybrid Quantum Neural Network (HQNN) architecture that integrates quantum computing principles with classical convolutional neural networks for enhanced breast cancer classification. Our approach employs parameterized quantum circuits with multi-head attention mechanisms for quantum-aware feature encoding, coupled with classical convolutional layers for comprehensive pattern recognition. The quantum component utilizes a 4qubit variational circuit with strongly entangling layers, while the classical component incorporates advanced attention mechanisms for feature fusion. Experimental validation on breast cancer thermographic data demonstrates substantial performance improvements over state-of-the-art classical architectures, with the quantum-enhanced approach exhibiting superior convergence dynamics and enhanced feature representation capabilities. Our findings provide evidence for quantum advantage in medical image classification through classical simulation, establishing a framework for quantum-classical hybrid systems in healthcare applications. The methodology addresses key challenges in quantum machine learning deployment while maintaining computational feasibility on near-term quantum devices.

摘要:乳腺癌的熱成像圖像分析診斷在醫療人工智慧中仍然是一個關鍵挑戰,傳統深度學習方法在複雜的熱模式分類任務中面臨限制。本文提出了一種新穎的混合量子神經網絡(HQNN)架構,將量子計算原則與傳統卷積神經網絡相結合,以增強乳腺癌的分類。我們的方法採用了帶有多頭注意力機制的參數化量子電路進行量子感知特徵編碼,並結合傳統卷積層進行全面的模式識別。量子組件利用具有強耦合層的4量子位變分電路,而傳統組件則融合了先進的注意力機制以進行特徵融合。在乳腺癌熱成像數據上的實驗驗證顯示,與最先進的傳統架構相比,性能有顯著改善,量子增強的方法展現出優越的收斂動態和增強的特徵表示能力。我們的研究結果提供了量子優勢在醫療影像分類中的證據,通過傳統模擬建立了量子-傳統混合系統在醫療應用中的框架。該方法論解決了量子機器學習部署中的關鍵挑戰,同時在近期的量子設備上保持計算可行性。

Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts

2604.16926v1 by Gabriel Jason Lee, Jathurshan Pradeepkumar, Jimeng Sun

Electroencephalography (EEG) foundation models have shown strong potential for learning generalizable representations from large-scale neural data, yet their clinical deployment is hindered by distribution shifts across clinical settings, devices, and populations. Test-time adaptation (TTA) offers a promising solution by enabling models to adapt to unlabeled target data during inference without access to source data, a valuable property in healthcare settings constrained by privacy regulations and limited labeled data. However, its effectiveness for EEG remains largely underexplored. In this work, we introduce NeuroAdapt-Bench, a systematic benchmark for evaluating test-time adaptation methods on EEG foundation models under realistic distribution shifts. We evaluate representative TTA approaches from other domains across multiple pretrained foundation models, diverse downstream tasks, and heterogeneous datasets spanning in-distribution, out-of-distribution, and extreme modality shifts (e.g., Ear-EEG). Our results show that standard TTA methods yield inconsistent gains and often degrade performance, with gradient-based approaches particularly prone to heavy degradation. In contrast, optimization-free methods demonstrate greater stability and more reliable improvements. These findings highlight the limitations of existing TTA techniques in EEG, provide guidance for future development, and underscore the need for domain-specific adaptation strategies.

摘要:腦電圖(EEG)基礎模型顯示出從大規模神經數據中學習可泛化表示的強大潛力,但其臨床應用受到臨床環境、設備和人群之間分佈變化的限制。測試時適應(TTA)提供了一個有前景的解決方案,通過使模型在推理過程中能夠適應未標記的目標數據,而無需訪問源數據,這在受到隱私法規和有限標記數據約束的醫療環境中是一個寶貴的特性。然而,對於EEG來說,其有效性仍然在很大程度上未被探索。在這項工作中,我們介紹了NeuroAdapt-Bench,這是一個系統性的基準,用於評估EEG基礎模型在現實分佈變化下的測試時適應方法。我們評估了來自其他領域的代表性TTA方法,涵蓋多個預訓練基礎模型、多樣的下游任務以及跨越分佈內、分佈外和極端模態變化(例如,耳部EEG)的異質數據集。我們的結果顯示,標準TTA方法產生的不一致增益,並且經常導致性能下降,特別是基於梯度的方法容易受到嚴重退化的影響。相比之下,無需優化的方法顯示出更大的穩定性和更可靠的改進。這些發現突顯了現有TTA技術在EEG中的局限性,為未來的發展提供了指導,並強調了需要特定於領域的適應策略。

Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models

2604.16775v1 by Inhyeok Lee, Luke Solo, Michael C. Burkhart, Bashar Ramadan, William F. Parker, Brett K. Beaulieu-Jones

Every prediction from a generative medical event model is bounded by how clinical events are tokenized, yet input representation is rarely isolated from other system and architectural choices. We evaluate how representation decisions affect downstream prediction after a shared one-epoch pretraining budget. We train 28 matched transformers on MIMIC-IV and evaluate them on 30 clinical outcomes in three experiments: (1) quantization granularity, reference-range anchoring, and code-value fusion; (2) value encoding (hard bins, soft discretization, code-normalized xVal) crossed with temporal encoding (event order, time tokens, admission-relative RoPE); and (3) native MIMIC laboratory/vital codes versus the Common Longitudinal ICU Format (CLIF)-remapped laboratory/vital codes with compression-preserving perturbation arms. In Experiment 1, fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 (BH-adjusted p < 0.001), hospital length-of-stay AUROC from 0.763 to 0.788 (BH-adjusted p < 0.001), and, for the decile fused-vs-unfused comparison, mean regression Spearman rho across the 13 regression outcomes from 0.414 to 0.494. Across the three temporal encodings, event order only and admission-relative RoPE match or exceed inserting time tokens on average while shortening sequences by 11%. CLIF remapping preserves downstream performance in our single-site setting while yielding a smaller, clinically interpretable token set compatible with multi-site use. Finer-than-decile quantization, reference-range anchoring, and soft discretization help in selective outcomes, while code-normalized xVal remains well below the discrete and soft families, consistent with near-median suppression that persists after the affine variant.

摘要:每個生成醫療事件模型的預測都受到臨床事件標記方式的限制,但輸入表示通常不會與其他系統和架構選擇隔離。我們評估表示決策如何影響共享的一個訓練預算後的下游預測。我們在 MIMIC-IV 上訓練了 28 個匹配的Transformer,並在三個實驗中對 30 個臨床結果進行評估:(1) 量化粒度、參考範圍錨定和代碼值融合;(2) 值編碼(硬箱、軟離散化、代碼標準化 xVal)與時間編碼(事件順序、時間標記、入院相對 RoPE)交叉;以及 (3) 原生 MIMIC 實驗室/生命體徵代碼與經 CLIF 重新映射的實驗室/生命體徵代碼,並採用保持壓縮的擾動臂。在實驗 1 中,融合的代碼值標記提高了死亡率 AUROC 從 0.891 到 0.915(BH 調整後 p < 0.001)、住院天數 AUROC 從 0.763 到 0.788(BH 調整後 p < 0.001),以及在融合與未融合的十分位比較中,13 個回歸結果的平均回歸斯皮爾曼 rho 從 0.414 提高到 0.494。在三種時間編碼中,僅事件順序和入院相對 RoPE 的表現平均匹配或超過插入時間標記的效果,同時縮短序列長度 11%。CLIF 重新映射在我們的單一站點設置中保持下游性能,同時產生一個較小且臨床可解釋的標記集,適用於多站點使用。比十分位更細的量化、參考範圍錨定和軟離散化在選擇性結果中有所幫助,而代碼標準化 xVal 的表現仍然遠低於離散和軟家族,這與在仿射變體後持續存在的接近中位數抑制一致。

CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction

2604.16742v1 by Jianyou Wang, Youze Zheng, Longtian Bao, Hanyuan Zhang, Qirui Zheng, Yuhan Chen, Yang Zhang, Matthew Feng, Maxim Khan, Aditya K. Sehgal, Christopher D. Rosin, Ramamohan Paturi, Umber Dube, Leon Bergen

Scientists have long sought to accurately predict outcomes of real-world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high-stakes open challenge even for domain experts. We introduce CT Open, an open-access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial's outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM-powered web search to identify the earliest mention of trial outcomes. We validate the pipeline's quality and accuracy by human expert's annotations. Since CT Open's pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time-stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real-world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at $\href{https://ct-open.net/}{https://ct-open.net/}$

摘要:科學家們長期以來一直尋求在事件發生之前準確預測現實世界事件的結果。人工智慧系統能否更可靠地做到這一點?我們通過臨床試驗結果預測來研究這個問題,這對於領域專家來說是一個高風險的公開挑戰。我們介紹了 CT Open,一個每年舉辦四次挑戰的開放訪問即時平台。任何人都可以為每個挑戰提交預測。CT Open 在提交時對於那些結果尚未公開的試驗進行評估,但這些結果在之後會公開。確定某個試驗的結果在某個日期之前是否在互聯網上公開,實際上是相當困難的。官方登記處上發布的結果可能會滯後數年,而第一次提及可能出現在不知名的文章中。為了解決這個問題,我們提出了一個新穎的、完全自動化的去污流程,利用迭代的 LLM 驅動的網絡搜索來識別試驗結果的最早提及。我們通過人類專家的註釋來驗證該流程的質量和準確性。由於 CT Open 的流程確保每個被評估的試驗在預測時沒有公開報告的結果,因此參與者可以使用任何方法和任何數據來源。在本文中,我們發布了一個訓練集和兩個時間戳測試基準,分別是 2025 年冬季和 2025 年夏季。我們相信 CT Open 可以作為推進人工智慧研究以預測現實世界結果的中心樞紐,同時也能為生物醫學研究提供信息並改善臨床試驗設計。CT Open 平台托管於 $\href{https://ct-open.net/}{https://ct-open.net/}$

Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis

2604.16729v1 by Ayhan Can Erdur, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert, Jan C. Peeken

State-of-the-art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi-step radiological workflows remains underexplored. In this work, we present a training-free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off-the-shelf domain-specific tools, our system autonomously executes complex end-to-end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single-scan segmentation and volumetric reporting to longitudinal response assessment requiring multi-timepoint comparisons. We analyze the impact of architectural design by comparing single-agent models against multi-agent "domain-expert" collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image-prompt-answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro-radiological image analysis tasks through tool use without the need for training or fine-tuning.

摘要:最先進的大型語言模型(LLMs)在一般視覺問題回答方面表現出色。然而,仍然存在一個根本性的限制:當前的架構缺乏進行體積醫學影像(如 CT 或 MRI)直接分析所需的原生 3D 空間推理。新興的代理 AI 提供了一種新解決方案,通過使 LLM 能夠協調和利用專門的外部工具,消除了對內在 3D 處理的需求。然而,這種代理框架在複雜的多步放射學工作流程中的可行性仍未得到充分探索。在這項工作中,我們提出了一個無需訓練的代理管道,用於自動化腦部 MRI 分析。我們在幾個 LLM(GPT-5.1、Gemini 3 Pro、Claude Sonnet 4.5)上驗證我們的方法,並使用現成的領域專用工具,我們的系統自主執行複雜的端到端工作流程,包括預處理(去顱骨、註冊)、病理分割(膠質瘤、腦膜瘤、轉移瘤)和體積分析。我們在越來越複雜的放射學任務中評估我們的框架,從單掃描分割和體積報告到需要多時間點比較的縱向反應評估。我們通過比較單代理模型與多代理「領域專家」合作來分析架構設計的影響。最後,為了支持未來代理系統的嚴格評估,我們引入並發布了一個基準數據集,該數據集由來自公共 BraTS 數據的圖像-提示-答案元組組成。我們的結果表明,代理 AI 可以通過工具使用解決高度神經放射學影像分析任務,而無需訓練或微調。

A Two-Stage Multi-Modal MRI Framework for Lifespan Brain Age Prediction

2604.16655v1 by Dingyi Zhang, Ruiying Liu, Yun Wang

The accurate quantification of brain age from MRI has emerged as an important biomarker of brain health. However, existing approaches are often restricted to narrow age ranges and single-modality MRI data, limiting their capacity to capture the coordinated macro- and microstructural changes that unfold across the human lifespan. To address these limitations, we developed a multi-modal brain age framework to characterize the integrated evolution of brain morphology and white matter organization. Our model adopts a two-stage architecture, where modalities are processed independently and integrated via late fusion in both stages: first to classify each subject into one of six developmental stages, and then to estimate age within the predicted stage. This design enables a unified and lifespan-spanning assessment of brain maturity across diverse developmental periods.

摘要:腦部年齡的準確量化已成為腦部健康的重要生物標記。然而,現有的方法往往限於狹窄的年齡範圍和單一模態的MRI數據,限制了它們捕捉人類生命週期中協調的宏觀和微觀結構變化的能力。為了解決這些限制,我們開發了一個多模態腦部年齡框架,以描述腦部形態和白質組織的綜合演變。我們的模型採用兩階段架構,其中模態獨立處理,並在兩個階段通過後期融合進行整合:首先將每個受試者分類為六個發展階段之一,然後在預測的階段內估計年齡。這一設計使得對不同發展時期的腦部成熟度進行統一且跨生命週期的評估成為可能。

MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

2604.16175v1 by Yi Lin, Yihao Ding, Yonghui Wu, Yifan Peng

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.

摘要:自動化的 3D 放射學報告生成常常遭受臨床幻覺和缺乏人類實踐中所見的迭代驗證的問題。儘管近期的視覺-語言模型(VLMs)已經推進了該領域,但它們通常作為單一的「黑箱」系統運作,缺乏臨床工作流程中典型的協作監督。為了解決這些挑戰,我們提出了 MARCH(多代理放射學臨床層級),這是一個多代理框架,模擬放射學部門的專業層級,並為不同的代理分配專門角色。MARCH 利用住院醫師代理進行初步草擬,並進行多尺度 CT 特徵提取,使用多個研究員代理進行檢索增強的修訂,以及一位主治醫師代理協調基於立場的迭代共識討論,以解決診斷差異。在 RadGenome-ChestCT 數據集上,MARCH 在臨床忠實度和語言準確性方面顯著超越了最先進的基準。我們的研究表明,模擬類人組織結構可以提高人工智慧在高風險醫療領域的可靠性。

Hybrid Spectro-Temporal Fusion Framework for Structural Health Monitoring

2604.16589v1 by Jongyeop Kim, Jinki Kim, Doyun Lee

Structural health monitoring plays a critical role in ensuring structural safety by analyzing vibration responses from engineering systems. This paper proposes a Spectro-Temporal Alignment framework and a Hybrid Spectro-Temporal Fusion framework that integrate arrival-time interval descriptors with spectral features to capture both fine-scale and coarse-scale vibration dynamics. Experiments conducted on data collected from an LDS V406 electrodynamic shaker demonstrate that the proposed spectro-temporal representations significantly outperform conventional input formulations. The results indicate that a temporal resolution (Δτ) of 0.008 of 0.02 favors traditional machine learning models, whereas a finer resolution (Δτ) of 0.008 effectively unlocks the performance potential of deep learning architectures. Beyond classification accuracy, a comprehensive stability analysis based on condensed indices, including mean performance, standard deviation, coefficient of variation, and balanced score, shows that the proposed hybrid framework consistently achieves higher accuracy with substantially lower variability compared to baseline and alignment-only approaches. Overall, these results demonstrate that the proposed framework provides a robust, accurate, and reliable solution for vibration-based structural health monitoring.

摘要:結構健康監測在確保結構安全方面扮演著至關重要的角色,通過分析工程系統的振動響應來實現。本文提出了一個光譜-時間對齊框架和一個混合光譜-時間融合框架,這些框架將到達時間間隔描述符與光譜特徵相結合,以捕捉細尺度和粗尺度的振動動態。對從LDS V406電動振動台收集的數據進行的實驗表明,所提出的光譜-時間表示顯著優於傳統的輸入公式。結果表明,0.008的時間解析度(Δτ)有利於傳統機器學習模型,而更細的解析度(Δτ)0.008則有效釋放了深度學習架構的性能潛力。除了分類準確性之外,基於凝聚指標的綜合穩定性分析,包括平均性能、標準差、變異係數和均衡得分,顯示所提出的混合框架在準確性上始終達到更高的水平,並且變異性顯著低於基準和僅對齊的方法。總體而言,這些結果表明,所提出的框架為基於振動的結構健康監測提供了一個穩健、準確且可靠的解決方案。

Large Language Models Meet Biomedical Knowledge Graphs for Mechanistically Grounded Therapeutic Prioritization

2604.19815v1 by Chih-Hsuan Wei, Chi-Ping Day, Zhizheng Wang, Christine C. Alewine, Betty Tyler, Hasan Slika, David Saraf, Chin-Hsien Tai, Joey Chan, Robert Leaman, Zhiyong Lu

Drug repurposing is often framed as a candidate identification task, but existing approaches provide limited guidance for distinguishing biologically plausible candidates from historically well-connected ones. Here we introduce DrugKLM, a hybrid framework that integrates biomedical knowledge graph structure with large language model-based mechanistic reasoning to enable mechanistically grounded therapeutic prioritization. Across benchmark datasets, DrugKLM outperforms knowledge graph-only and language model-only baselines, including TxGNN. Beyond improved recall, DrugKLM confidence scores exhibit functional alignment with molecular phenotypes: higher scores are associated with transcriptional signatures linked to improved survival across 12 TCGA cancers. The scoring framework preferentially captures biologically perturbational signals rather than historical indication patterns. Expert curation across five cancers further reveals systematic differences in prioritization behavior, with DrugKLM elevating candidates supported by coherent mechanistic rationale and disease-specific clinical context. Together, these results establish DrugKLM as an evidence-integrative framework that translates heterogeneous biomedical data into mechanistically interpretable and clinically grounded therapeutic hypotheses.

摘要:藥物重定位通常被視為一項候選者識別任務,但現有的方法對於區分生物學上合理的候選者與歷史上關聯良好的候選者提供的指導有限。在這裡,我們介紹了DrugKLM,一個混合框架,將生物醫學知識圖譜結構與基於大型語言模型的機制推理相結合,以實現機制基礎的治療優先排序。在基準數據集上,DrugKLM的表現超越了僅使用知識圖譜和僅使用語言模型的基準,包括TxGNN。除了提高召回率外,DrugKLM的信心分數與分子表型具有功能對齊:較高的分數與12種TCGA癌症中與改善生存相關的轉錄簽名相關聯。該評分框架優先捕捉生物學擾動信號,而非歷史指示模式。對五種癌症的專家策展進一步揭示了優先排序行為的系統性差異,DrugKLM提升了支持一致機制理論和疾病特定臨床背景的候選者。綜合這些結果,DrugKLM建立為一個證據整合框架,將異質生物醫學數據轉化為機制可解釋且臨床基礎的治療假設。

Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

2604.16132v1 by Jessica H. Zhu, Shayla Stringfield, Vahe Zaprosyan, Michael Wagner, Michel Cukier, Joseph B. Richardson

Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the ethical challenges of applying AI in research involving marginalized communities.

摘要:槍支暴力是一個緊迫的公共衛生問題,但對於倖存者生活經歷的研究仍然資金不足且難以擴展。質性研究,包括深入訪談,是理解社區槍支暴力的個人和社會後果以及設計有效干預措施的寶貴工具。然後,通過主題分析和歸納編碼手動分析這些敘事既耗時又勞動密集。最近大型語言模型(LLMs)的進展為自動化這一過程打開了大門,但仍然存在這些模型是否能準確和倫理地捕捉弱勢群體經歷的擔憂。在這項研究中,我們評估使用開源LLMs對21名倖存於社區槍支暴力的黑人男性的訪談進行歸納編碼。我們的結果顯示,儘管某些LLMs的配置能夠識別重要的編碼,但整體相關性仍然較低,並且對數據處理高度敏感。此外,LLM的防護措施導致了實質性的敘事抹除。這些發現突顯了LLM輔助質性編碼的潛力和局限性,並強調了在涉及邊緣社區的研究中應用AI的倫理挑戰。

Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration

2604.16104v1 by Baramee Sukumal, Aueaphum Aueawatthanaphisut

Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (H&E) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the highest faithfulness and localization accuracy, demonstrating strong correspondence with expert-annotated tumor regions. These results indicate that multimodal fusion of radiology and histopathology can improve diagnostic performance while maintaining model transparency, suggesting potential for future clinical decision support systems in precision oncology.

摘要:肺癌仍然是全球癌症相關死亡的主要原因之一。傳統的電腦斷層掃描 (CT) 成像雖然對於檢測和分期至關重要,但在區分良性和惡性病變以及提供可解釋的診斷見解方面存在局限性。為了解決這一挑戰,本研究提出了一個雙模態人工智慧框架,將 CT 放射學與蘇木精-伊紅 (H&E) 組織病理學整合,用於肺癌的診斷和亞型分類。該系統採用卷積神經網絡提取放射學和組織病理學特徵,並結合臨床元數據以提高穩健性。來自兩種模態的預測通過加權決策級整合機制進行融合,以分類腺癌、鱗狀細胞癌、大細胞癌、小細胞肺癌和正常組織。應用可解釋的人工智慧技術,包括 Grad-CAM、Grad-CAM++、集成梯度、遮蔽、顯著性圖和 SmoothGrad,以提供視覺可解釋性。實驗結果顯示出強勁的性能,準確率高達 0.87,AUROC 超過 0.97,宏觀 F1 分數為 0.88。Grad-CAM++ 在忠實度和定位準確性方面達到了最高水平,顯示出與專家標註的腫瘤區域之間的強對應關係。這些結果表明,放射學和組織病理學的多模態融合可以提高診斷性能,同時保持模型透明度,暗示未來在精準腫瘤學中用於臨床決策支持系統的潛力。

Towards Trustworthy Depression Estimation via Disentangled Evidential Learning

2604.16579v1 by Fangyuan Liu, Sirui Zhao, Zeyu Zhang, Jinyang Huang, Feng-Qi Cui, Bin Luo, Tong Xu, Meng Li, Enhong Chen

Automated depression estimation is highly vulnerable to signal corruption and ambient noise in real-world deployment. Prevailing deterministic methods produce uncalibrated point estimates, exposing safety-critical clinical systems to the severe risk of overconfident misdiagnoses. To establish a highly resilient and trustworthy assessment paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. A fundamental vulnerability in multimodal evidential fusion is the uncontrolled accumulation of cross-modal redundancies. This structural flaw artificially inflates diagnostic confidence by double-counting overlapping evidence. To guarantee robust evidence synthesis, EviDep enforces strict information integrity. First, a Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically isolate task-irrelevant noise, preserving the fidelity of diagnostic signals. Subsequently, a Disentangled Evidential Learning strategy separates the shared consensus from modality-specific nuances. By explicitly decoupling these representations before Bayesian fusion, EviDep systematically mitigates evidence redundancy. Extensive experiments on AVEC 2013, 2014, DAIC-WOZ, and E-DAIC confirm that EviDep achieves state-of-the-art predictive accuracy and superior uncertainty calibration, delivering a robust fail-safe mechanism for trustworthy clinical screening.

摘要:自動化的憂鬱估計在實際應用中對信號損壞和環境噪音高度敏感。現有的確定性方法產生未經校準的點估計,將安全關鍵的臨床系統暴露於過度自信的誤診嚴重風險中。為了建立一個高度韌性和可信的評估範式,我們提出了EviDep,一個證據學習框架,通過正態-反向伽瑪分佈共同量化憂鬱嚴重性以及隨機和認知不確定性。多模態證據融合中的一個基本脆弱性是跨模態冗餘的無控制累積。這一結構缺陷通過重複計算重疊證據人為地膨脹了診斷信心。為了保證穩健的證據合成,EviDep 強制執行嚴格的信息完整性。首先,一個頻率感知特徵提取模塊利用基於小波的專家混合模型動態隔離與任務無關的噪音,保持診斷信號的真實性。隨後,一個解耦的證據學習策略將共享共識與特定模態的細微差別分開。通過在貝葉斯融合之前明確解耦這些表示,EviDep 系統性地減少了證據冗餘。在AVEC 2013、2014、DAIC-WOZ和E-DAIC上的廣泛實驗證實,EviDep實現了最先進的預測準確性和優越的不確定性校準,提供了一個穩健的故障安全機制以進行可信的臨床篩查。

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

2604.15859v1 by Jeremy Qin, Maksym Andriushchenko

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

摘要:預測已成為不確定性推理的自然基準。 然而,現有的大型語言模型評估仍然僅限於簡單格式的判斷任務,例如二元或多選題。 然而,在實踐中,預測涵蓋的範圍要廣得多。 在經濟學、公共衛生和社會人口統計等領域,決策依賴於對連續數量的數值估計,而這一能力是當前基準所無法捕捉的。 評估這些估計需要一種能夠明確且可測試不確定性的格式。 我們提出預測區間作為這一目的的自然且嚴謹的介面。 它們要求對規模的認識、在信心水平之間的內部一致性,以及在結果連續體上的校準,使其成為數值預測中比點估計更合適的評估格式。 為了評估這一能力,我們引入了一個新的基準QuantSightBench,並在多種設置下評估前沿模型,評估實證覆蓋率和區間銳利度。 我們的結果顯示,11個評估的前沿和開放權重模型中沒有一個達到90\%的覆蓋目標,表現最好的Gemini 3.1 Pro (79.1\%)、Grok 4 (76.4\%)和GPT-5.4 (75.3\%)均至少低於目標10個百分點。 在極端數量級下,校準急劇下降,顯示出所有評估模型的系統性過度自信。

Stein Variational Black-Box Combinatorial Optimization

2604.15837v1 by Thomas Landais, Olivier Goudet, Adrien Goëffon, Frédéric Saubion, Sylvain Lamprier

Combinatorial black-box optimization in high-dimensional settings demands a careful trade-off between exploiting promising regions of the search space and preserving sufficient exploration to identify multiple optima. Although Estimation-of-Distribution Algorithms (EDAs) provide a powerful model-based framework, they often concentrate on a single region of interest, which may result in premature convergence when facing complex or multimodal objective landscapes. In this work, we incorporate the Stein operator to introduce a repulsive mechanism among particles in the parameter space, thereby encouraging the population to disperse and jointly explore several modes of the fitness landscape. Empirical evaluations across diverse benchmark problems show that the proposed method achieves performance competitive with, and in several cases superior to, leading state-of-the-art approaches, particularly on large-scale instances. These findings highlight the potential of Stein variational gradient descent as a promising direction for addressing large, computationally expensive, discrete black-box optimization problems.

摘要:組合黑箱優化在高維設置中需要在利用搜尋空間中有前景的區域與保持足夠的探索以識別多個最優解之間進行仔細的權衡。儘管分佈估計演算法(EDAs)提供了一個強大的基於模型的框架,但它們通常集中於單一的興趣區域,這可能導致在面對複雜或多模態的目標景觀時過早收斂。在本研究中,我們引入了Stein算子,以在參數空間中的粒子之間引入排斥機制,從而鼓勵群體分散並共同探索適應度景觀的多個模式。對各種基準問題的實證評估顯示,所提出的方法在性能上與多種領先的最先進方法具有競爭力,並且在幾個案例中優於它們,特別是在大規模實例上。這些發現突顯了Stein變分梯度下降作為解決大型、計算成本高的離散黑箱優化問題的有前景方向的潛力。

Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

2604.15808v1 by Lama Moukheiber, Caleb M. Yeung, Haotian Xue, Alec Helbling, Zelin Zhao, Yongxin Chen

Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.

摘要:空間推理和視覺基礎是視覺語言模型(VLMs)的核心能力,然而大多數醫療 VLMs 在預測時缺乏透明的推理或空間證據。現有的基準也僅在孤立的 2D 圖像上評估 VLMs,忽視了臨床影像的體積特性,因為發現可能跨越多幀或僅出現在幾個切片上。我們引入了空間基礎 MRI 視覺問題回答(SGMRI-VQA),這是一個包含 41,307 對的基準,旨在對體積 MRI 進行多幀、空間基礎的推理。該基準基於 fastMRI+ 數據集中專家放射科醫生的註釋,涵蓋腦部和膝部研究,每個 QA 對都包括與臨床醫生對齊的思維鏈跡跡,並附有幀索引的邊界框坐標。任務按層級組織,包括檢測、定位、計數/分類和標題生成,要求模型共同推理存在的內容、其位置以及跨越哪些幀。 我們基準測試了 10 個 VLMs,並顯示 Qwen3-VL-8B 在邊界框監督下的有監督微調始終改善了基於強大零樣本基準的基礎性能,這表明有針對性的空間監督是實現有根據的臨床推理的有效途徑。

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

2604.15760v1 by Ankit Maloo

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

摘要:我們介紹了KWBench(知識工作平台)的第一個版本,這是一個用於大型語言模型的無提示問題識別基準:LLM能否在嘗試解決問題之前識別專業場景。現有的前沿基準已經飽和,到目前為止,大多數知識工作評估簡化為根據規範進行的提取或任務完成。KWBench的目標是這一步之前:僅從原始輸入中識別情況的主導結構。基準包含223個任務,這些任務來自於收購、合同談判、臨床藥學、組織政治、欺詐分析和激勵設計等領域的從業者。每個任務編碼了一個正式的博弈論模式(委託-代理衝突、信號傳遞、機制設計失敗、戰略性省略、聯合動態、戰略性相互依賴),並攜帶結構化的真實記錄,記錄專家對情況的解讀和預期的失敗模式。模型接收原始數據和任務提示,沒有問題類型的指示。評分是一個三層的標準,必須通過強制性聯合檢查。強制性標準編碼了預測的錯誤路徑。我們評估了16個模型。最佳模型在27.9%的任務中通過。排名前兩的模型僅在31.7%的通過任務上達成一致。在前8名中,有44個任務僅由一個模型解決;在前8名之間的路由覆蓋了基準的50.7%,幾乎是最佳單一模型的兩倍。在通過的條件下,質量分數趨於一致(模型之間約83%);無條件分數則不然。相同的模型在被詢問時能正確表述相關的博弈論概念,但在未提示的情況下卻無法應用。我們發布KWBench,以改變前沿模型在知識工作上的評估方式,根據它們是否能僅從情況中識別正確的問題來進行評分,而不僅僅是根據它們在問題被框定後的執行效果。

SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification

2604.15711v1 by Enhui Chai, Sicheng Chen, Tianyi Zhang, Xingyu Li, Tianxiang Cui

Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Directional Multi-scale (DMS) module for balanced local-global modeling, and a Local Perception Residual (LPR) module for enhanced fine-grained sensitivity. Employing a two-stage pipeline, SSL pretraining on target ROI datasets followed by supervised fine-tuning (SFT), SSMamba outperforms 11 state-of-the-art (SOTA) pathological FMs on 10 public ROI datasets and surpasses 8 SOTA methods on 6 public WSI datasets. These results validate the superiority of task-specific architectural designs for pathological image analysis.

摘要:病理診斷高度依賴影像分析,其中感興趣區域(ROI)作為診斷證據的主要基礎,而全滑動影像(WSI)級別的任務主要捕捉聚合模式。為了提取這些關鍵的形態特徵,基於視覺Transformer(ViTs)和大規模自我監督學習(SSL)的ROI級別基礎模型(FMs)已被廣泛採用。然而,在其應用於ROI分析時仍存在三個核心限制:(1)跨放大倍數領域轉換,由於固定規模的預訓練妨礙了對多樣臨床環境的適應;(2)不充分的局部-全局關係建模,其中FMs的ViT主幹面臨高計算開銷和不精確的局部特徵表徵;(3)細粒度敏感性不足,因為傳統自注意機制往往忽略細微的診斷線索。為了解決這些挑戰,我們提出了SSMamba,一種混合SSL框架,能夠在不依賴大型外部數據集的情況下有效學習細粒度特徵。該框架包含三個領域自適應組件:Mamba遮罩影像建模(MAMIM)用於減少領域轉換,方向性多尺度(DMS)模塊用於平衡局部-全局建模,以及局部感知殘差(LPR)模塊用於增強細粒度敏感性。採用兩階段流程,首先在目標ROI數據集上進行SSL預訓練,然後進行監督式微調(SFT),SSMamba在10個公共ROI數據集上超越了11個最先進(SOTA)病理FMs,並在6個公共WSI數據集上超越了8個SOTA方法。這些結果驗證了針對病理影像分析的任務特定架構設計的優越性。

CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder

2604.15611v1 by Duy-Phuong Dao, Muhammad Taqiyuddin, Jahae Kim, Sang-Heon Lee, Hye-Won Jung, Jaehoo Choi, Hyung-Jeong Yang

Latent diffusion models have emerged as powerful generative models in medical imaging, enabling the synthesis of high quality brain magnetic resonance imaging scans. In particular, predicting the evolution of a patients brain can aid in early intervention, prognosis, and treatment planning. In this study, we introduce CLIMB, Controllable Longitudinal brain Image generation via state space based latent diffusion model, an advanced framework for modeling temporal changes in brain structure. CLIMB is designed to model the structural evolution of the brain structure over time, utilizing a baseline MRI scan and its acquisition age as foundational inputs. Additionally, multiple conditional variables, including projected age, gender, disease status, genetic information, and brain structure volumes, are incorporated to enhance the temporal modeling of anatomical changes. Unlike existing LDM methods that rely on self attention modules, which effectively capture contextual information from input images but are computationally expensive, our approach leverages state space, a state space model architecture that substantially reduces computational overhead while preserving high-quality image synthesis. Furthermore, we introduce a Gaussian-aligned autoencoder that extracts latent representations conforming to prior distributions without the sampling noise inherent in conventional variational autoencoders. We train and evaluate our proposed model on the Alzheimers Disease Neuroimaging Initiative dataset, consisting of 6,306 MRI scans from 1,390 participants. By comparing generated images with real MRI scans, CLIMB achieves a structural similarity index of 0.9433, demonstrating notable improvements over existing methods.

摘要:潛在擴散模型已成為醫學影像中強大的生成模型,使得高品質的腦部磁共振影像掃描的合成成為可能。特別是,預測患者腦部的演變可以幫助早期介入、預後和治療計畫。在這項研究中,我們介紹了CLIMB,即通過基於狀態空間的潛在擴散模型進行可控的縱向腦影像生成,這是一個用於建模腦結構時間變化的先進框架。CLIMB旨在建模腦結構隨時間的結構演變,利用基線MRI掃描及其獲取年齡作為基礎輸入。此外,還納入了多個條件變數,包括預測年齡、性別、疾病狀態、遺傳信息和腦結構體積,以增強解剖變化的時間建模。與現有的LDM方法依賴自注意模塊不同,後者有效捕捉輸入影像的上下文信息但計算成本高,我們的方法利用狀態空間,一種顯著減少計算開銷的狀態空間模型架構,同時保留高品質的影像合成。此外,我們引入了一種高斯對齊自編碼器,該編碼器提取符合先驗分佈的潛在表示,而不會受到傳統變分自編碼器固有的取樣噪聲的影響。我們在阿茲海默症疾病神經影像倡議數據集上訓練和評估我們提出的模型,該數據集包含1,390名參與者的6,306個MRI掃描。通過將生成的影像與真實的MRI掃描進行比較,CLIMB達到了0.9433的結構相似性指數,顯示出相較於現有方法的顯著改進。

Robustifying and Selecting Cohort-Appropriate Prognostic Models under Distributional Shifts

2604.16537v1 by Dimitris Bertsimas, Carol Gao, Angelos G. Koulouras, Georgios Antonios Margonis

External validation is widely regarded as the gold standard for prognostic model evaluation. In this study, we challenge the assumption that successful external calibration guarantees model generalizability and propose two complementary strategies to improve transportability of prognostic models across cohorts. Using six real-world surgical cohorts from tertiary academic centers, we tested whether successful external calibration depends largely on similarity in covariates and outcomes between training and validation cohorts, quantified using Kullback-Leibler (KL) divergence, with calibration assessed by the Integrated Calibration Index (ICI). From the model-developer's perspective, we trained the "best-on-average" prognostic model by tuning toward a meta-analysis-derived covariate and outcome distribution as an approximation of the broader target population. From the end-user perspective, we proposed a simple measure for cohort outcome similarity to identify, among published models, the one most suitable for a given target cohort in terms of both calibration and clinical utility. External calibration worsened as distributional mismatch increased. Higher KL divergence was associated with higher ICI in both surgery-alone (Spearman $ρ=0.614$, $p=0.004$) and surgery + adjuvant chemotherapy cohorts (Spearman $ρ=0.738$, $p<0.001$). Meta-analysis-informed weighting improved calibration in most settings without materially affecting discrimination, with the clearest benefit when evaluated on the aggregated external population ($p=0.037$). Models developed in more similar cohorts achieved lower ICI in surgery-alone (Spearman $ρ=0.803$, $p<0.001$) and surgery + adjuvant chemotherapy cohorts (Spearman $ρ=0.737$, $p<0.001$), and provided greater clinical utility on DCA.

摘要:外部驗證被廣泛認為是預後模型評估的金標準。在本研究中,我們挑戰了成功的外部校準保證模型可泛化性的假設,並提出了兩種互補策略,以提高預後模型在不同隊列之間的可轉移性。我們使用來自三級學術中心的六個真實外科隊列,測試成功的外部校準是否在很大程度上依賴於訓練和驗證隊列之間的協變量和結果的相似性,這一相似性通過 Kullback-Leibler (KL) 散度來量化,校準則通過綜合校準指數 (ICI) 來評估。從模型開發者的角度來看,我們通過調整以元分析衍生的協變量和結果分佈來訓練“平均最佳”預後模型,以此作為更廣泛目標人群的近似。從最終用戶的角度來看,我們提出了一個簡單的措施來評估隊列結果的相似性,以便在已發表的模型中識別出最適合特定目標隊列的模型,考慮到校準和臨床效用。隨著分佈不匹配的增加,外部校準變得更糟。較高的 KL 散度與外科單獨 (Spearman $ρ=0.614$, $p=0.004$) 和外科 + 輔助化療隊列 (Spearman $ρ=0.738$, $p<0.001$) 中較高的 ICI 相關聯。元分析知情的加權在大多數情況下改善了校準,而對區分能力的實質影響不大,在對聚合外部人群進行評估時,效果最為明顯 ($p=0.037$)。在更相似的隊列中開發的模型在外科單獨 (Spearman $ρ=0.803$, $p<0.001$) 和外科 + 輔助化療隊列 (Spearman $ρ=0.737$, $p<0.001$) 中達到了較低的 ICI,並在 DCA 上提供了更大的臨床效用。

Towards Reliable Testing of Machine Unlearning

2604.16536v1 by Anna Mazhar, Sainyam Galhotra

Machine learning components are now central to AI-infused software systems, from recommendations and code assistants to clinical decision support. As regulations and governance frameworks increasingly require deleting sensitive data from deployed models, machine unlearning is emerging as a practical alternative to full retraining. However, unlearning introduces a software quality-assurance challenge: under realistic deployment constraints and imperfect oracles, how can we test that a model no longer relies on targeted information? This paper frames unlearning testing as a first-class software engineering problem. We argue that practical unlearning tests must provide (i) thorough coverage over proxy and mediated influence pathways, (ii) debuggable diagnostics that localize where leakage persists, (iii) cost-effective regression-style execution under query budgets, and (iv) black-box applicability for API-deployed models. We outline a causal, pathway-centric perspective, causal fuzzing, that generates budgeted interventions to estimate residual direct and indirect effects and produce actionable "leakage reports". Proof-of-concept results illustrate that standard attribution checks can miss residual influence due to proxy pathways, cancellation effects, and subgroup masking, motivating causal testing as a promising direction for unlearning testing.

摘要:機器學習組件現在已成為融入人工智慧的軟體系統的核心,從推薦系統和程式碼助手到臨床決策支持。隨著法規和治理框架越來越要求從已部署模型中刪除敏感數據,機器遺忘作為完全重新訓練的實用替代方案正在出現。然而,遺忘帶來了一個軟體質量保證的挑戰:在現實的部署限制和不完美的預測下,我們如何測試一個模型不再依賴於目標資訊?本文將遺忘測試框架化為一個一流的軟體工程問題。我們主張實用的遺忘測試必須提供 (i) 對代理和中介影響路徑的全面覆蓋,(ii) 可調試的診斷,定位洩漏持續的地方,(iii) 在查詢預算下的成本效益回歸風格執行,以及 (iv) 對 API 部署模型的黑箱適用性。我們概述了一種因果、以路徑為中心的視角,即因果模糊測試,生成預算干預以估算殘留的直接和間接效果,並產生可行的「洩漏報告」。概念驗證結果顯示,標準的歸因檢查可能會因代理路徑、抵消效應和子群體掩蔽而錯過殘留影響,這促使因果測試成為遺忘測試的一個有前景的方向。

A Q-learning-based QoS-aware multipath routing protocol in IoMT-based wireless body area network

2604.15489v1 by Mehdi Hosseinzadeh, Roohallah Alizadehsani, Amin Beheshti, Hamid Alinejad-Roknyd, Lu Chen, Mohammad Sadegh Yousefpoor, Efat Yousefpoor, Muneera Altayeb, Thantrira Porntaveetus, Sadia Din

The Internet of Medical Things (IoMT) enables intelligent healthcare services but faces challenges such as dynamic topology, energy constraints, and diverse QoS requirements. This paper proposes QQMR, a Q-learning-based QoS-aware multipath routing method for WBANs. QQMR classifies data into three priority levels and employs adaptive multi-level queuing and fuzzy C-means clustering to optimize routing decisions. It maintains separate learning policies for each data type and selects primary and backup paths accordingly. Experimental results demonstrate improved packet delivery ratio and significant reductions in delay, routing overhead, and energy consumption compared to existing methods.

摘要:醫療物聯網(IoMT)使智能醫療服務成為可能,但面臨著動態拓撲、能源限制和多樣化的服務質量(QoS)要求等挑戰。本文提出了QQMR,一種基於Q-learning的QoS感知多路徑路由方法,適用於無線人體感測網路(WBANs)。QQMR將數據分類為三個優先級別,並採用自適應多級排隊和模糊C均值聚類來優化路由決策。它為每種類型的數據維護獨立的學習策略,並相應地選擇主要和備用路徑。實驗結果顯示,與現有方法相比,包傳送比例有所提高,延遲、路由開銷和能耗顯著降低。

Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models

2604.16532v1 by Emily Curl, Kofi Ampomah, Md Erfan, Sayanton Dibbo

While deep learning systems are becoming increasingly prevalent in medical image analysis, their vulnerabilities to adversarial perturbations raise serious concerns for clinical deployment. These vulnerability evaluations largely rely on Attack Success Rate (ASR), a binary metric that indicates solely whether an attack is successful. However, the ASR metric does not account for other factors, such as perturbation strength, perceptual image quality, and cross-architecture attack transferability, and therefore, the interpretation is incomplete. This gap requires consideration, as complex, large-scale deep learning systems, including Vision Transformers (ViTs), are increasingly challenging the dominance of Convolutional Neural Networks (CNNs). These architectures learn differently, and it is unclear whether a single metric, e.g., ASR, can effectively capture adversarial behavior. To address this, we perform a systematic empirical study on four medical image datasets: PathMNIST, DermaMNIST, RetinaMNIST, and CheXpert. We evaluate seven models (VGG-16, ResNet-50, DenseNet-121, Inception-v3, DeiT, Swin Transformer, and ViT-B/16) against seven attack methods at five perturbation budgets, measuring ASR, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and $L_2$ perturbation magnitude. Our findings show a consistent pattern: perceptual and distortion metrics are strongly associated with one another and exhibit minimal correlation with ASR. This applies to both CNNs and ViTs. The results demonstrate that ASR alone is an inadequate indicator of adversarial robustness and transferability. Consequently, we argue that a thorough assessment of adversarial risk in medical AI necessitates multi-metric frameworks that encompass not only the attack efficacy but also its methodology and associated overheads.

摘要:雖然深度學習系統在醫學影像分析中變得越來越普遍,但它們對對抗性擾動的脆弱性對臨床部署提出了嚴重的擔憂。這些脆弱性評估在很大程度上依賴於攻擊成功率(ASR),這是一個二元指標,僅指示攻擊是否成功。然而,ASR指標並未考慮其他因素,例如擾動強度、感知影像質量和跨架構攻擊可轉移性,因此其解釋是不完整的。這一缺口需要考慮,因為複雜的大規模深度學習系統,包括視覺Transformer(ViTs),正日益挑戰卷積神經網絡(CNNs)的主導地位。這些架構的學習方式不同,目前尚不清楚單一指標,例如ASR,是否能有效捕捉對抗行為。為了解決這個問題,我們對四個醫學影像數據集進行了系統的實證研究:PathMNIST、DermaMNIST、RetinaMNIST和CheXpert。我們對七個模型(VGG-16、ResNet-50、DenseNet-121、Inception-v3、DeiT、Swin Transformer和ViT-B/16)在五個擾動預算下進行了七種攻擊方法的評估,測量ASR、峰值信噪比(PSNR)、結構相似性指數度量(SSIM)和$L_2$擾動幅度。我們的研究結果顯示出一致的模式:感知和失真指標之間有很強的關聯性,並且與ASR的相關性極小。這一點適用於CNN和ViT。結果顯示,僅僅依賴ASR並不足以指標對抗穩健性和可轉移性。因此,我們認為對醫學人工智慧的對抗風險進行徹底評估需要多指標框架,不僅涵蓋攻擊效能,還包括其方法論和相關的開銷。

RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference

2604.15459v1 by Yuxin Liu, Yiqing Dong, Wenxue Yu, Zhan Wu, Rongjun Ge, Yang Chen, Yuting He

Medical image denoising (MID) lacks absolutely clean images for supervision, leading to a noisy reference problem that fundamentally limits denoising performance. Existing simulated-supervised discriminative learning (SimSDL) and simulated-supervised generative learning (SimSGL) treat noisy references as clean targets, causing suboptimal convergence or reference-biased learning, while self-supervised learning (SSL) imposes restrictive noise assumptions that are seldom satisfied in realistic MID scenarios. We propose \textbf{RelativeFlow}, a flow matching framework that learns from heterogeneous noisy references and drives inputs from arbitrary quality levels toward a unified high-quality target. RelativeFlow reformulates flow matching by decomposing the absolute noise-to-clean mapping into relative noisier-to-noisy mappings, and realizes this formulation through two key components: 1) consistent transport (CoT), a displacement map that constrains relative flows to be components of and progressively compose a unified absolute flow, and 2) simulation-based velocity field (SVF), which constructs a learnable velocity field using modality-specific degradation operators to support different medical imaging modalities. Extensive experiments on Computed Tomography (CT) and Magnetic Resonance (MR) denoising demonstrate that RelativeFlow significantly outperforms existing methods, taming MID with noisy references.

摘要:醫學影像去噪(MID)缺乏絕對乾淨的影像進行監督,導致噪聲參考問題,根本限制了去噪性能。現有的模擬監督辨別學習(SimSDL)和模擬監督生成學習(SimSGL)將噪聲參考視為乾淨目標,導致次優收斂或參考偏差學習,而自監督學習(SSL)則施加了在現實MID場景中很少滿足的限制性噪聲假設。我們提出了\textbf{RelativeFlow},一種流匹配框架,從異質的噪聲參考中學習,並將來自任意質量水平的輸入推向統一的高質量目標。RelativeFlow通過將絕對噪聲到乾淨的映射分解為相對更噪聲到噪聲的映射來重新定義流匹配,並通過兩個關鍵組件實現這一公式:1)一致性傳輸(CoT),一個位移圖,約束相對流為統一絕對流的組成部分並逐步組合,2)基於模擬的速度場(SVF),使用特定於模態的降解運算子構建可學習的速度場,以支持不同的醫學影像模態。在計算機斷層掃描(CT)和磁共振(MR)去噪的廣泛實驗中,RelativeFlow顯著超越現有方法,駕馭了帶有噪聲參考的MID。

DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

2604.15456v1 by Zhizheng Wang, Chih-Hsuan Wei, Joey Chan, Robert Leaman, Chi-Ping Day, Chuan Wu, Mark A Knepper, Antolin Serrano Farias, Jordina Rincon-Torroella, Hasan Slika, Betty Tyler, Ryan Huu-Tuan Nguyen, Asmita Indurkar, Mélanie Hébert, Shubo Tian, Lauren He, Noor Naffakh, Aseem Aseem, Nicholas Wan, Emily Y Chew, Tiarnan D L Keenan, Zhiyong Lu

Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med's conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.

摘要:信任度和透明度對於人工智慧 (AI) 在醫療保健和生物醫學研究中的臨床應用至關重要。最近的深度研究系統旨在通過將 AI 代理與多跳信息檢索、推理和綜合整合,來加速基於證據的科學發現。然而,大多數現有系統缺乏明確且可檢查的證據評估標準,這增加了錯誤累積的風險,並使研究人員和臨床醫生難以評估其輸出的可靠性。與此同時,當前的基準測試方法很少評估在複雜的現實醫療問題上的表現。在此,我們介紹 DeepER-Med,一個針對醫學的深度基於證據的研究框架,配備了一個代理 AI 系統。DeepER-Med 將深度醫學研究框架化為一個明確且可檢查的基於證據的生成工作流程,包含三個模塊:研究規劃、代理協作和證據綜合。為了支持現實評估,我們還提出了 DeepER-MedQA,一個基於證據的數據集,包含 100 個專家級研究問題,這些問題源自真實的醫學研究場景,並由 11 位生物醫學專家組成的多學科小組進行策劃。專家手動評估顯示,DeepER-Med 在多個標準上始終優於廣泛使用的生產級平台,包括生成新穎的科學見解。我們進一步通過八個現實臨床案例展示 DeepER-Med 的實用性。人類臨床醫生的評估表明,DeepER-Med 的結論在七個案例中與臨床建議一致,突顯了其在醫學研究和決策支持中的潛力。

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

2604.15271v2 by Tianhao Fu, Austin Wang, Charles Chen, Roby Aldave-Garza, Yucheng Chen

Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.

摘要:可靠的不確定性估計對於醫學影像分割至關重要,因為自動化的輪廓會為下游的量化和臨床決策支持提供依據。許多強大的不確定性方法需要重複推斷,而高效的單次前向傳遞替代方案往往提供較弱的失敗排名或依賴於限制性的特徵空間假設。我們提出了 $\textbf{SegWithU}$,這是一個後處理框架,通過輕量級的不確定性頭部增強了一個凍結的預訓練分割骨幹。SegWithU 利用中間骨幹特徵,並將不確定性建模為一個緊湊探針空間中的擾動能量,使用 rank-1 後驗探針。它生成兩個體素級不確定性圖:一個用於概率調整的校準導向圖和一個用於錯誤檢測和選擇性預測的排名導向圖。在 ACDC、BraTS2024 和 LiTS 中,SegWithU 是最強且最一致的單次前向傳遞基線,分別達到 $0.9838/2.4885$、$0.9946/0.2660$ 和 $0.9925/0.8193$ 的 AUROC/AURC,同時保持分割質量。這些結果表明,基於擾動的不確定性建模是實現可靠性意識醫學分割的有效且實用的途徑。源代碼可在 https://github.com/ProjectNeura/SegWithU 獲得。

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

2604.15231v1 by Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

摘要:視覺語言模型(VLM)顯著推進了基於人工智慧的複雜醫學影像解釋和報告,例如電腦斷層掃描(CT)。然而,現有的方法在很大程度上使臨床醫生成為最終輸出的被動觀察者,並未提供可解釋的推理痕跡供他們檢查、驗證或改進。為了解決這個問題,我們引入了 RadAgent,一個使用工具的人工智慧代理,通過逐步且可解釋的過程生成 CT 報告。每份生成的報告都附有可完全檢查的中間決策和工具互動的痕跡,允許臨床醫生檢查報告結果的推導過程。在我們的實驗中,我們觀察到 RadAgent 在三個維度上改善了胸部 CT 報告的生成,相較於其 3D VLM 版本 CT-Chat。臨床準確性在宏觀 F1 上改善了 6.0 分(相對 36.4%),在微觀 F1 上改善了 5.4 分(相對 19.6%)。在對抗條件下的穩健性改善了 24.7 分(相對 41.9%)。此外,RadAgent 在忠實度上達到了 37.0%,這是其 3D VLM 對應版本完全缺乏的新能力。通過將胸部 CT 的解釋結構化為一個明確的、增強工具的和迭代的推理痕跡,RadAgent 使我們更接近於實現放射學的透明和可靠的人工智慧。

Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF

2604.16528v1 by Nicklas Neu, Thomas Ebner, Jasmin Primus, Bernhard Schenkenfelder, Raphael Zefferer, Mathias Brunbauer, Florian Kromp

Embryo selection is one of multiple crucial steps in in-vitro fertilization, commonly based on morphological assessment by clinical embryologists. Although artificial intelligence methods have demonstrated their potential to support embryo selection by automated embryo ranking or grading methods, the overall impact of AI-based solutions is still limited. This is mainly due to the required adaptation of automated solutions to custom clinical data, reliance on time lapse incubators and a lack of interpretability to understand AI reasoning. The modern, informed patient is questioning expert decisions, particularly if the treatment is not successful. Thus, evidence-based decision justification in tasks like embryo selection would support transparent decision making and respectful patient communication. To support this aim, we hereby present an expert-annotated dataset consisting of embryo images and corresponding morphological description using natural language. The description contains relevant information on embryonic cell cycle, developmental stage and morphological features. This dataset enables the finetuning of modern foundational vision-language models to learn and improve over time with high accuracy. Predicted embryo descriptions can then be leveraged to automatically extract scientific evidence from literature, facilitating well-informed, evidence-based decision-making and transparent communication with patients. Our proposed dataset supports research in language-based, interpretable, and transparent automated embryo assessment and has the potential to enhance the decision-making process and improve patient outcomes significantly over time.

摘要:胚胎選擇是體外受精中多個關鍵步驟之一,通常基於臨床胚胎學家的形態評估。儘管人工智慧方法已顯示出支持胚胎選擇的潛力,例如自動化的胚胎排名或分級方法,但基於AI的解決方案的整體影響仍然有限。這主要是由於自動化解決方案需要適應特定的臨床數據,依賴於時間延遲培養箱,以及缺乏可解釋性來理解AI的推理。現代的知情患者質疑專家的決策,特別是在治療不成功的情況下。因此,在胚胎選擇等任務中進行基於證據的決策辯護將有助於透明的決策過程和尊重的患者溝通。為了支持這一目標,我們在此提出一個專家標註的數據集,該數據集包含胚胎圖像和相應的自然語言形態描述。描述中包含有關胚胎細胞週期、發育階段和形態特徵的相關信息。這個數據集使得現代基礎視覺-語言模型能夠進行微調,隨著時間的推移學習和提高準確性。預測的胚胎描述可以用來自動提取文獻中的科學證據,促進充分知情的基於證據的決策制定以及與患者的透明溝通。我們提出的數據集支持基於語言的、可解釋的和透明的自動化胚胎評估研究,並有潛力顯著增強決策過程並改善患者結果。

Hybrid Decision Making via Conformal VLM-generated Guidance

2604.14980v2 by Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini

Building on recent advances in AI, hybrid decision making (HDM) holds the promise of improving human decision quality and reducing cognitive load. We work in the context of learning to guide (LtG), a recently proposed HDM framework in which the human is always responsible for the final decision: rather than suggesting decisions, in LtG the AI supplies (textual) guidance useful for facilitating decision making. One limiting factor of existing approaches is that their guidance compounds information about all possible outcomes, and as a result it can be difficult to digest. We address this issue by introducing ConfGuide, a novel LtG approach that generates more succinct and targeted guidance. To this end, it employs conformal risk control to select a set of outcomes, ensuring a cap on the false negative rate. We demonstrate our approach on a real-world multi-label medical diagnosis task. Our empirical evaluation highlights the promise of ConfGuide.

摘要:基於近期在人工智慧方面的進展,混合決策(HDM)有望改善人類的決策質量並減少認知負擔。我們在學習引導(LtG)的背景下工作,這是一個最近提出的HDM框架,其中人類始終負責最終決策:在LtG中,AI提供有助於促進決策的(文本)指導,而不是建議決策。現有方法的一個限制因素是,它們的指導綜合了所有可能結果的信息,因此可能難以消化。我們通過引入ConfGuide來解決這個問題,這是一種新穎的LtG方法,能夠生成更簡潔和有針對性的指導。為此,它採用符合風險控制來選擇一組結果,確保假陰性率的上限。我們在一個現實世界的多標籤醫療診斷任務上展示了我們的方法。我們的實證評估突顯了ConfGuide的潛力。

Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

2604.14892v2 by Amy Rouillard, Sitwala Mundia, Linda Camara, Michael Cameron Gramanie, Ziyaad Dangor, Ismail Kalla, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Haroon Saloojee, Bruce A. Bassett

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations. Both LLM and clinician-generated diagnoses are scored across four dimensions: diagnosis, differential diagnosis, clinical reasoning and negative treatment risk. For each of these, we assess scoring difference, inter-rater agreement, scoring stability, severe safety errors and the effect of post-hoc calibration. We find that: (i) the uncalibrated LLM jury scores are systematically lower than clinician panels scores; (ii) the LLM Jury preserves ordinal agreement and exhibits better concordance with the primary expert panels than the human expert re-score panels do; (iii) the probability of severe errors is lower in \lj models compared to the human expert re-score panels; (iv) the LLM Jury shows excellent agreement with primary expert panels' rankings. We find that the LLM jury combined with AI model diagnoses can be used to identify ward diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (v) LLM jury models show no self-preference bias. They did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Finally, we demonstrate that LLM jury calibration using isotonic regression improves alignment with human expert panel evaluations. Together, these results provide compelling evidence that a calibrated, multi-model LLM jury can serve as a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking.

摘要:評估醫療 AI 系統使用專家臨床醫師小組既昂貴又緩慢,促使使用大型語言模型(LLMs)作為替代裁定者。在這裡,我們評估由三個前沿 AI 模型組成的 LLM 陪審團,對 300 個中等收入國家(MIC)醫院案例中的 3333 個診斷進行評分。模型性能與專家臨床醫師小組和獨立人類重新評分小組的評估進行基準比較。LLM 和臨床醫師生成的診斷在四個維度上進行評分:診斷、鑑別診斷、臨床推理和負面治療風險。對於這些,我們評估評分差異、評分者間一致性、評分穩定性、嚴重安全錯誤以及事後校準的效果。我們發現:(i)未經校準的 LLM 陪審團評分系統性地低於臨床醫師小組的評分;(ii)LLM 陪審團保持了序數一致性,並且與主要專家小組的符合度優於人類專家重新評分小組;(iii)與人類專家重新評分小組相比,\lj 模型中嚴重錯誤的概率較低;(iv)LLM 陪審團與主要專家小組的排名顯示出極好的一致性。我們發現,結合 AI 模型診斷的 LLM 陪審團可以用來識別高風險錯誤的病房診斷,從而實現針對性的專家審查和提高小組效率;(v)LLM 陪審團模型沒有自我偏好偏見。它們對自己底層模型或同一供應商的模型生成的診斷的評分並不比其他模型生成的診斷更(或更少)有利。最後,我們證明使用等距回歸進行 LLM 陪審團校準可以改善與人類專家小組評估的一致性。綜合這些結果,提供了有力的證據,表明經過校準的多模型 LLM 陪審團可以作為醫療 AI 基準中專家臨床評估的可靠代理。

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

2604.14866v1 by Meng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu, Chun-Xiao Jin, Jia-Min Wu, Yue Han, James Kit Hon Tsoi, Gui-Song Xia, Cui Huang

Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

摘要:視覺-語言模型(VLMs)在醫學影像分析中顯示出顯著的潛力,然而,由於缺乏細緻的標註數據集和全面的基準測試,其在口腔內攝影中的應用仍然大多未被探索。為了解決這個問題,我們提出了MetaDent,一個綜合資源,包括(1)從臨床、公共和網絡來源收集的創新且大規模的牙科影像數據集;(2)一個半結構化的標註框架,旨在捕捉牙科攝影的層級和臨床細微特徵;以及(3)用於評估最新VLM在臨床影像理解上的全面基準套件。我們的標註方法結合了高層次的影像摘要與逐點的自由文本異常描述。這種方法使得豐富、可擴展且任務無關的表示成為可能。我們從各種來源精心策劃了60,669張牙科影像,並使用這一元標註方案對2,588張具有代表性的影像進行了標註。利用大型語言模型(LLMs),我們導出了標準化的基準:大約15K的視覺問題回答(VQA)對和一個18類多標籤分類數據集,我們通過人工審查和錯誤分析來驗證,證明LLM驅動的過渡可靠地保持了忠實度和語義準確性。然後,我們在VQA、分類和影像標題任務中評估最新的VLM。定量結果顯示,即使是最先進的模型在對口腔內場景的細緻理解上也面臨困難,達到中等準確性,並在影像標題中產生不一致或不完整的描述。我們公開釋放我們的數據集、標註和工具,以促進可重複的研究並加速牙科應用的視覺-語言系統的發展。

Knowledge Graphs

Publish Date Title Authors Homepage Code
2026-04-24 BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering Jinghong Chen et.al. 2604.22678v1 null
2026-04-24 Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors Gautam Kumar Jain et.al. 2604.22560v1 null
2026-04-24 On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery Anti Alman et.al. 2604.22455v1 null
2026-04-24 Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control Qinhan Hou et.al. 2604.22413v1 null
2026-04-24 BLAST: Benchmarking LLMs with ASP-based Structured Testing Manuel Alejandro Borroto Santana et.al. 2604.22306v1 null
2026-04-24 STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation Peng Yu et.al. 2604.22282v1 null
2026-04-24 Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset Wenhui Huang et.al. 2604.22260v1 null
2026-04-24 A Probabilistic Framework for Hierarchical Goal Recognition Chenyuan Zhang et.al. 2604.22256v1 null
2026-04-24 Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA Zhanli Li et.al. 2604.22239v1 null
2026-04-24 An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments Hong Su et.al. 2604.22199v1 null
2026-04-24 How Large Language Models Balance Internal Knowledge with User and Document Assertions Shuowei Li et.al. 2604.22193v1 null
2026-04-24 Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems Meghana Karnam et.al. 2604.22154v1 null
2026-04-24 SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs Sihang et.al. 2604.22134v1 null
2026-04-23 PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training Harsh Kumar et.al. 2604.22117v1 null
2026-04-23 Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation Weisi Liu et.al. 2604.22098v1 null
2026-04-23 Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents Seyed Moein Abtahi et.al. 2604.22085v1 null
2026-04-23 Sound Agentic Science Requires Adversarial Experiments Dionizije Fa et.al. 2604.22080v1 null
2026-04-23 PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning Xiaoyi Chen et.al. 2604.22076v1 null
2026-04-23 Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning Karthic Palaniappan et.al. 2604.22062v1 null
2026-04-23 Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning João Mattos et.al. 2604.22031v1 null
2026-04-23 Rethinking Publication: A Certification Framework for AI-Enabled Research Yang Lu et.al. 2604.22026v1 null
2026-04-23 Multi-Task Optimization over Networks of Tasks Julian Hatzky et.al. 2604.21991v1 null
2026-04-23 When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs Pegah Khayatan et.al. 2604.21911v1 null
2026-04-23 From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation Bartosz Balis et.al. 2604.21910v1 null
2026-04-23 TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale Jun Wang et.al. 2604.21889v1 null
2026-04-23 A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents Praval Sharma et.al. 2604.21885v1 null
2026-04-23 Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms Yuto Nishida et.al. 2604.21882v1 null
2026-04-23 Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications Yvon K. Awuklu et.al. 2604.21793v1 null
2026-04-23 StructMem: Structured Memory for Long-Horizon Behavior in LLMs Buqiang Xu et.al. 2604.21748v1 null
2026-04-23 GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion Qizhuo Xie et.al. 2604.21649v1 null
2026-04-23 A systematic review of generative AI usage for IT project management Ionut Anghel et.al. 2604.21958v1 null
2026-04-23 The CriticalSet problem: Identifying Critical Contributors in Bipartite Dependency Networks Sebastiano A. Piccolo et.al. 2604.21537v1 null
2026-04-23 Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation Nikita Severin et.al. 2604.21536v1 null
2026-04-23 OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving Xinyu Zhang et.al. 2604.21510v1 null
2026-04-23 MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting Yining Xing et.al. 2604.21489v1 null
2026-04-23 Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms Jiyan Song et.al. 2604.21473v1 null
2026-04-23 Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation Wang Shi Hai et.al. 2604.21380v1 null
2026-04-23 ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs Jian Cui et.al. 2604.21357v1 null
2026-04-23 Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models Muhammad Shafique et.al. 2604.21952v1 null
2026-04-23 Can MLLMs "Read" What is Missing? Jindi Guo et.al. 2604.21277v1 null
2026-04-23 Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages Michael Bouzinier et.al. 2604.21263v1 null
2026-04-23 When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors Chenghao Yang et.al. 2604.21255v1 null
2026-04-23 Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation Hanwen Gu et.al. 2604.21253v1 null
2026-04-23 CAP: Controllable Alignment Prompting for Unlearning in LLMs Zhaokun Wang et.al. 2604.21251v2 null
2026-04-23 EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval Julian Acuna et.al. 2604.21229v1 null
2026-04-23 Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery Benjamin Przybocki et.al. 2604.21187v1 null
2026-04-23 TAPO-Description Logic for Information Behavior: Refined OBoxes, Inference, and Categorical Semantics Takao Inoué et.al. 2604.21172v1 null
2026-04-22 "This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias Siyu Liang et.al. 2604.21148v2 null
2026-04-22 Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification Jiho Noh et.al. 2604.21137v1 null
2026-04-22 GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons Sebastian Walter et.al. 2604.21133v1 null
2026-04-22 How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models Kristian Schwethelm et.al. 2604.21106v1 null
2026-04-22 Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery Siyuan Yao et.al. 2604.21102v1 null
2026-04-22 TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping Yannis Belkhiter et.al. 2604.21057v1 null
2026-04-22 The Last Harness You'll Ever Build Haebin Seong et.al. 2604.21003v1 null
2026-04-22 FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels Sina Gholami et.al. 2604.20825v1 null
2026-04-22 Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems Pavel Salovskii et.al. 2604.20795v1 null
2026-04-22 RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering Marisa Hudspeth et.al. 2604.20738v1 null
2026-04-22 COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling Noah Flynn et.al. 2604.20720v1 null
2026-04-22 Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization Shan He et.al. 2604.20714v1 null
2026-04-22 StormNet: Improving storm surge predictions with a GNN-based spatio-temporal offset forecasting model Noujoud Nader et.al. 2604.20688v2 null
2026-04-22 ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation Ioannis E. Livieris et.al. 2604.20666v1 null
2026-04-22 The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm Karan Goyal et.al. 2604.20665v1 null
2026-04-22 RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking Roie Kazoom et.al. 2604.20623v1 null
2026-04-22 Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge Naizhong Xu et.al. 2604.20598v1 null
2026-04-22 Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents Yuxuan Cai et.al. 2604.20572v1 null
2026-04-22 LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures Yuhang Wu et.al. 2604.20556v1 null
2026-04-22 Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies Shuai Chen et.al. 2604.20548v1 null
2026-04-22 Effects of Cross-lingual Evidence in Multilingual Medical Question Answering Anar Yeginbergen et.al. 2604.20531v1 null
2026-04-22 Knowledge Capsules: Structured Nonparametric Memory Units for LLMs Bin Ju et.al. 2604.20487v2 null
2026-04-22 Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks Pranav Pallerla et.al. 2604.20932v1 null
2026-04-22 HaS: Accelerating RAG through Homology-Aware Speculative Retrieval Peng Peng et.al. 2604.20452v1 null
2026-04-22 Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness Fulong Fan et.al. 2604.20413v1 null
2026-04-22 CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge Gustav Keppler et.al. 2604.20389v1 null
2026-04-22 Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs Aishik Mandal et.al. 2604.20382v1 null
2026-04-22 Domain-Aware Hierarchical Contrastive Learning for Semi-Supervised Generalization Fault Diagnosis Junyu Ren et.al. 2604.20928v1 null
2026-04-22 Surrogate modeling for interpreting black-box LLMs in medical predictions Changho Han et.al. 2604.20331v2 null
2026-04-22 Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction Dali Wang et.al. 2604.20311v2 null
2026-04-22 Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA Zibo Xu et.al. 2604.20306v1 null
2026-04-22 Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking Mo Zhou et.al. 2604.20283v1 null
2026-04-22 AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling Zhenyu Wang et.al. 2604.20263v1 null
2026-04-22 Hybrid Policy Distillation for LLMs Wenhong Zhu et.al. 2604.20244v1 null
2026-04-22 Construction of a Battery Research Knowledge Graph using a Global Open Catalog Luca Foppiano et.al. 2604.20241v1 null
2026-04-22 Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context Yilun Zhu et.al. 2604.20216v1 null
2026-04-22 Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs He Yang Yuan et.al. 2604.20211v1 null
2026-04-22 All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG Dan Wang et.al. 2604.20199v1 null
2026-04-22 Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving Xinyu Zhang et.al. 2604.20183v1 null
2026-04-22 SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition Jielong Tang et.al. 2604.20146v1 null
2026-04-22 To Know is to Construct: Schema-Constrained Generation for Agent Memory Lei Zheng et.al. 2604.20117v1 null
2026-04-22 Learning to Solve the Quadratic Assignment Problem with Warm-Started MCMC Finetuning Yicheng Pan et.al. 2604.20109v1 null
2026-04-22 Auditing and Controlling AI Agent Actions in Spreadsheets Sadra Sabouri et.al. 2604.20070v1 null
2026-04-21 Information Aggregation with AI Agents Spyros Galanis et.al. 2604.20050v1 null
2026-04-21 Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine Yusuf Kesmen et.al. 2604.20022v1 null
2026-04-21 From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents Md Nayem Uddin et.al. 2604.20006v1 null
2026-04-21 Tracing Relational Knowledge Recall in Large Language Models Nicholas Popovič et.al. 2604.19934v2 null
2026-04-21 CreativeGame:Toward Mechanic-Aware Creative Game Generation Hongnan Ma et.al. 2604.19926v1 null
2026-04-21 Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding Zijie Wang et.al. 2604.19921v1 null
2026-04-21 UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling Boyu Chen et.al. 2604.19734v1 null
2026-04-21 ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration Cagri Eryilmaz et.al. 2604.19856v1 null
2026-04-21 A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding Shuai Wang et.al. 2604.19689v1 null
2026-04-21 An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA Saransh Sharma et.al. 2604.19685v1 null

Abstracts

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

2604.22678v1 by Jinghong Chen, Jingbiao Mei, Guangyu Yang, Bill Byrne

A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates thelost-in-the-middle'' effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.

摘要:一種常見的基於檢索增強生成(RAG)的問題回答方法是將文檔串聯成單一上下文,並將其傳遞給語言模型以生成答案。雖然這種方法簡單,但可能會掩蓋單個文檔的貢獻,使歸因變得困難,並導致「失落於中間」效應,即在長上下文中相關信息被忽視。串聯的擴展性也較差:計算成本隨著上下文長度的增長而呈平方增長,當上下文包含視覺數據時,這一問題變得尤為嚴重,例如在視覺問題回答中。通過限制上下文長度來緩解這些問題的嘗試,可能會進一步限制性能,因為這會阻止模型受益於更深層檢索所提供的改進召回。我們提出了貝葉斯集成檢索增強生成(BERAG),以及貝葉斯集成微調(BEFT),作為一種RAG框架,其中語言模型是基於單個檢索到的文檔而非單一的組合上下文進行條件化。BERAG將文檔後驗概率視為集成權重,並在生成過程中使用貝葉斯法則逐個標記地更新它們。這種方法使得概率重排序、並行記憶使用和文檔貢獻的明確歸因成為可能,從而使其非常適合大型文檔集合。我們主要在基於知識的視覺問題回答任務上評估BERAG和BEFT,在這些任務中,模型必須對長的、不完美的檢索列表進行推理。結果顯示,與標準RAG相比,這些方法有顯著的改進,包括在文檔視覺問題回答和多模態針對堆中的針基準測試上取得的強勁增長。我們還展示了BERAG能夠減輕「失落於中間」效應。文檔後驗可以用來檢測不足的基礎並觸發偏轉,而文檔修剪則使得解碼速度比標準RAG更快。

Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors

2604.22560v1 by Gautam Kumar Jain, Carsten Markgraf, Julian Stähler

Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.

摘要:圖形視覺問題回答(GVQA)在自動駕駛中將推理組織為有序的階段,即感知、預測和規劃,其中規劃決策應與模型自身的感知保持一致。我們在 DriveLM-nuScenes 上進行了一項關於跨階段上下文傳遞的比較研究,使用了兩種互補機制。顯式變體評估了三種基於提示的條件策略,這些策略在未經額外訓練的情況下,於一個經過領域適應的 4B VLM(Mini-InternVL2-4B-DA-DriveLM)上運行,將 NLI 矛盾減少了多達 42.6%,並建立了一個強大的零訓練基準。隱式變體引入了門控上下文投影器,這些投影器從一個階段提取隱藏狀態向量,並將標準化的門控投影注入到下一階段的輸入嵌入中。這些投影器與特定階段的 QLoRA 適配器共同訓練於一個通用的 8B VLM(InternVL3-8B-Instruct),同時僅更新約 0.5% 的參數。隱式變體在規劃階段實現了統計上顯著的 34% NLI 矛盾減少(自助法 95% CI,p < 0.05),並將跨階段的包含性提高了 50%,這是通過多語言 NLI 分類器進行評估的,以考慮混合語言輸出。規劃語言質量也有所改善(CIDEr +30.3%),但由於缺乏駕駛領域的預訓練,詞彙重疊和結構一致性下降。由於這兩種變體使用不同的基礎模型,我們將它們作為互補的案例研究呈現:顯式上下文傳遞提供了一個強大的無訓練基準,以實現表面一致性,而隱式門控投影則帶來了顯著的規劃階段語義增益,這表明領域適應可能是全範圍改進的下一個可行成分。

On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery

2604.22455v1 by Anti Alman, Izack Cohen, Avigdor Gal, Fabrizio Maria Maggi, Marco Montali

A core component of any AI-Augmented Business Process Management System (ABPMS) is the process frame, which gives the system process-awareness and defines the boundaries in which the system must operate. Compared to traditional process models, the process frame should, in principle, provide a somewhat more permissive representation of the managed processes, such that the (semi) autonomous behavior of an ABPMS, referred to as framed autonomy, could emerge. At the same time, it is not limited to a single linguistic or symbolic formalism and may incorporate heterogeneous knowledge ranging from predefined procedures to commonsense rules and best practices. In this paper, we conceptualize the notion of an ABPMS process frame as a hybrid business process representation, consisting of semi-concurrently executed procedural and declarative process models. We rely on our earlier works to outline the execution semantics of this type of process frame, arguing in favor of adopting the open-world assumption of the declarative paradigm also for procedural process models. The latter leads to a constraint-like interpretation, where each procedural model is considered to constrain the activities within that model, without imposing explicit execution requirements nor limitations on activities that may be present in other models. This is analogous to existing declarative languages, such as Declare, where each constraint has a direct effect only on the specific activities being constrained. Given this similarity, we propose mapping subsets of discovered declarative constraints into equivalent semi-concurrently executed procedural fragments, thus laying the foundation for a corresponding process (frame) discovery approach.

摘要:任何 AI 增強業務流程管理系統 (ABPMS) 的核心組件是流程框架,它賦予系統流程感知並定義系統必須運作的邊界。與傳統流程模型相比,流程框架原則上應該提供對管理流程的更寬鬆的表示,使得稱為框架自主的 ABPMS 的 (半) 自主行為能夠出現。與此同時,它並不局限於單一的語言或符號形式,並且可以包含從預定義程序到常識規則和最佳實踐的異質知識。在本文中,我們將 ABPMS 流程框架的概念化為一種混合業務流程表示,包含半並行執行的程序性和聲明性流程模型。我們依賴於之前的工作來概述這種類型的流程框架的執行語義,主張也應該將聲明性範式的開放世界假設應用於程序性流程模型。後者導致了一種類似約束的解釋,其中每個程序模型被視為限制該模型內的活動,而不對其他模型中可能存在的活動施加明確的執行要求或限制。這類似於現有的聲明性語言,例如 Declare,其中每個約束僅對被約束的特定活動產生直接影響。鑒於這種相似性,我們建議將發現的聲明性約束的子集映射到等效的半並行執行的程序片段,從而為相應的流程 (框架) 發現方法奠定基礎。

Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control

2604.22413v1 by Qinhan Hou, Jing Tang

Graph Transformers can mix information globally, but this flexibility also creates failure modes: some tasks require long-range communication while others are better served by local interaction. We study this through a synthetic node-classification benchmark on contextual stochastic block model graphs, where labels are generated by a controllable mixture of local and far-shell signals. We define distance-misaligned training as a mismatch between where label-relevant information lies and where the model allocates communication over graph distance. On this benchmark, we find three points. First, the preferred graph-distance bias changes systematically with task locality. Second, an oracle adaptive controller, given offline access to the task-side distance target, nearly matches the best fixed bias across regimes and strongly improves over a neutral baseline on mixed and local tasks. Third, a task-agnostic zero-gap controller is weaker, indicating that adaptation alone is not enough and that the control target matters. These results suggest that distance-resolved diagnosis is useful for understanding Graph Transformer failures and for designing graph-aware control.

摘要:Graph Transformers 可以全球混合信息,但這種靈活性也會產生失敗模式:某些任務需要長距離通信,而其他任務則更適合局部互動。我們通過在上下文隨機區塊模型圖上的合成節點分類基準來研究這一點,其中標籤是由可控的局部和遠程信號混合生成的。我們將距離錯位訓練定義為標籤相關信息所在的位置與模型在圖距離上分配通信的位置之間的不匹配。在這個基準上,我們發現三個要點。首先,首選的圖距離偏差隨著任務的局部性系統性變化。其次,一個神諭自適應控制器,在離線訪問任務側距離目標的情況下,幾乎能夠匹配各個範疇中的最佳固定偏差,並在混合和局部任務上顯著改善中立基線。第三,一個與任務無關的零差距控制器較弱,表明僅僅適應是不夠的,控制目標也很重要。這些結果表明,距離解析診斷對於理解 Graph Transformer 的失敗和設計圖感知控制是有用的。

BLAST: Benchmarking LLMs with ASP-based Structured Testing

2604.22306v1 by Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri, Giuseppe Manco, Simona Perri, Francesco Ricca

Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.

摘要:大型語言模型(LLMs)在自然語言理解、對話系統和程式碼生成等廣泛任務中展現了卓越的表現。儘管明顯取得了進展,但迄今為止,對於它們在處理如答案集程式設計(ASP)等宣告性範式的有效性關注較少。在本文中,我們介紹了BLAST:第一個專門的基準測試方法學和相關數據集,用於評估LLMs生成ASP程式碼的準確性。BLAST提供了一個結構化的評估框架,包含兩個針對ASP程式碼生成的新穎語義指標。本文呈現了涉及十個來自ASP文獻的成熟圖相關問題和八個最先進LLMs的多樣化集合的實證評估結果。

STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation

2604.22282v1 by Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu

Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.

摘要:知識圖譜基礎的問題回答(KGQA)在複雜推理任務中扮演著關鍵角色,但仍然受到兩個持續挑戰的限制:知識圖譜(KGs)的結構異質性常常導致檢索過程中的語義不匹配,而現有的推理路徑檢索方法缺乏全球結構視角。為了解決這些問題,我們提出了結構追蹤證據挖掘(STEM),這是一個新穎的框架,將多跳推理重新構建為一個模式引導的圖搜索任務。首先,我們設計了一個語義到結構的投影管道,利用KG的結構先驗將查詢分解為原子關係斷言,並構建一個自適應的查詢模式圖。隨後,我們執行全球感知的節點錨定和子圖檢索,以從KG中獲得最終的證據推理圖。為了在圖構建過程中更有效地整合全球結構信息,我們設計了一個三元組依賴的GNN(Triple-GNN),以生成一個全球引導子圖(引導圖),指導構建過程。STEM顯著提高了多跳推理圖檢索的準確性和證據完整性,並在多個多跳基準上達到最先進的性能。

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

2604.22260v1 by Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang, Zhiqi Mao, Heng Yang, Chen Lv

Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.

摘要:城市交通系統面臨日益增長的安全挑戰,這需要可擴展的智慧以應對新興的智慧移動基礎設施。儘管最近在基礎模型和大規模多模態數據集方面的進展加強了智能交通系統(ITS)的感知和推理能力,但現有研究仍主要集中在微觀自動駕駛(AD)上,對城市規模的交通分析關注有限。特別是,針對開放式安全導向的視覺問題回答(VQA)及相應的基礎模型,對於異質路邊攝像頭觀察的推理仍然未被充分探索。為了解決這一空白,我們推出了陸上交通數據集(LTD),這是一個大規模開源的視覺-語言數據集,用於城市交通環境中的開放式推理。LTD包含了從異質路邊攝像頭收集的11600對高質量的VQA,涵蓋了多樣的道路幾何、交通參與者、照明條件和惡劣天氣。該數據集整合了三個互補任務:細粒度多物體定位、多圖像攝像頭選擇和多圖像風險分析,這需要對最小相關視圖進行聯合推理,以推斷危險物體、貢獻因素和危險的道路方向。為了確保標註的準確性,我們結合了多模型視覺-語言生成、交叉驗證和人類參與的精煉。在LTD的基礎上,我們進一步提出了UniVLT,這是一個通過課程式知識轉移訓練的交通基礎模型,旨在將微觀AD推理和宏觀交通分析統一於單一架構中。在LTD和多個AD基準上的廣泛實驗表明,UniVLT在多樣領域的開放式推理任務中達到了SOTA性能,同時揭示了現有基礎模型在複雜的多視圖交通場景中的局限性。

A Probabilistic Framework for Hierarchical Goal Recognition

2604.22256v1 by Chenyuan Zhang, Katherine Ip, Hamid Rezatofighi, Buser Say, Mor Vered

Goal recognition aims to infer an agent's goal from observations of its behaviour. In realistic settings, recognition can benefit from exploiting hierarchical task structure and reasoning under uncertainty. Planning-based goal recognition has made substantial progress over the past decade, but to the best of our knowledge no existing approach jointly integrates hierarchical task structure with probabilistic inference. In this paper, we introduce the first planning-based probabilistic framework for hierarchical goal recognition over Hierarchical Task Networks (HTNs). We instantiate the framework by exploiting an HTN planner with a three-stage generative model for likelihood estimation, yielding posterior distributions over goal hypotheses. Empirical results show improved recognition performance over the existing HTN-based recognizer on HTN benchmarks. Overall, the framework lays a foundation for probabilistic goal recognition grounded in hierarchical planning structure, moving goal recognition toward more practical settings.

摘要:目標識別旨在從觀察代理的行為中推斷其目標。在現實情境中,識別可以通過利用層次任務結構和在不確定性下推理來獲益。基於規劃的目標識別在過去十年中取得了重大進展,但據我們所知,尚無現有方法能夠將層次任務結構與概率推理共同整合。在本文中,我們介紹了第一個基於規劃的層次目標識別的概率框架,該框架基於層次任務網絡(HTNs)。我們通過利用一個HTN規劃器,並使用三階段生成模型進行似然估計來實現該框架,從而產生目標假設的後驗分佈。實證結果顯示,在HTN基準上,相較於現有的基於HTN的識別器,識別性能有所改善。總體而言,該框架為基於層次規劃結構的概率目標識別奠定了基礎,將目標識別推向更實用的環境。

2604.22239v1 by Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo

This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli-Li/MuDABench.

摘要:這篇論文介紹了在大型半結構化文檔集合中進行分析性問題回答的任務。我們呈現了MuDABench,一個多文檔分析性QA的基準,其中問題需要從多個文檔中提取和綜合信息以進行定量分析。與現有的多文檔QA基準不同,後者通常只需要從幾個文檔中提取信息,且跨文檔推理有限,MuDABench則要求進行廣泛的跨文檔分析和匯總。MuDABench是通過利用文檔級元數據和註釋的金融數據庫進行遠程監督構建的,包含超過80,000頁和332個分析性QA實例。我們還提出了一個評估協議,該協議測量最終答案的準確性,並使用中間事實覆蓋作為推理過程的輔助診斷信號。實驗顯示,標準的RAG系統將所有文檔視為平坦的檢索池,表現不佳。為了解決這些限制,我們提出了一個多代理工作流程,協調規劃、提取和代碼生成模塊。雖然這種方法在過程和結果指標上都有顯著改善,但與人類專家的表現相比,仍然存在顯著差距。我們的分析確定了兩個主要瓶頸:單文檔信息提取的準確性和當前系統中缺乏特定領域的知識。MuDABench可在https://github.com/Zhanli-Li/MuDABench上獲得。

An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments

2604.22199v1 by Hong Su

Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered tasks, and even successful executions or observed successful external behaviors are not always autonomously transformed into reusable local knowledge. In this paper, we propose an LLM-driven closed-loop autonomous learning framework for robots facing uncovered tasks in open environments. The proposed framework first retrieves the local method library to determine whether a reusable solution already exists for the current task or observed event. If no suitable method is found, it triggers an autonomous learning process in which the LLM serves as a high-level reasoning component for task analysis, candidate model selection, data collection planning, and execution or observation strategy organization. The robot then learns from both self-execution and active observation, performs quasi-real-time training and adjustment, and consolidates the validated result into the local method library for future reuse. Through this recurring closed-loop process, the robot gradually converts both execution-derived and observation-derived experience into reusable local capability while reducing future dependence on repeated external LLM interaction. Results show that the proposed framework reduces execution time and LLM dependence in both repeated-task self-execution and observation-driven settings, for example reducing the average total execution time from 7.7772s to 6.7779s and the average number of LLM calls per task from 1.0 to 0.2 in the repeated-task self-execution experiments.

摘要:自主機器人在開放環境中運作需要持續處理未被預定本地方法涵蓋的任務的能力。然而,現有的方法通常依賴於對未涵蓋任務進行重複的大型語言模型(LLM)互動,即使成功的執行或觀察到的成功外部行為也不一定能自動轉化為可重用的本地知識。在本文中,我們提出了一個基於LLM的閉環自主學習框架,旨在幫助面對開放環境中未涵蓋任務的機器人。所提出的框架首先檢索本地方法庫,以確定當前任務或觀察事件是否已存在可重用的解決方案。如果未找到合適的方法,則觸發自主學習過程,其中LLM作為任務分析、高級推理組件、候選模型選擇、數據收集計劃及執行或觀察策略組織的高級推理組件。然後,機器人從自我執行和主動觀察中學習,進行準實時訓練和調整,並將經過驗證的結果整合到本地方法庫中以供未來重用。通過這一重複的閉環過程,機器人逐漸將執行衍生和觀察衍生的經驗轉化為可重用的本地能力,同時減少對重複外部LLM互動的未來依賴。結果顯示,所提出的框架在重複任務自我執行和觀察驅動的設置中減少了執行時間和LLM依賴,例如在重複任務自我執行實驗中,將平均總執行時間從7.7772秒減少到6.7779秒,將每個任務的平均LLM調用次數從1.0減少到0.2。

How Large Language Models Balance Internal Knowledge with User and Document Assertions

2604.22193v1 by Shuowei Li, Haoxin Li, Wenda Chu, Yi Fang

Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model's ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model's discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at https://github.com/shuowl/llm-source-balancing.

摘要:大型語言模型(LLMs)在現實世界的情境中,如RAG或基於聊天的系統,經常需要平衡其內部參數知識與外部信息,例如用戶信念和檢索文檔中的內容。模型可靠處理這些來源的能力對系統安全至關重要。先前關於知識衝突和諂媚的研究僅限於二元衝突範式,主要探討參數知識與文檔或用戶之間的衝突,但忽略了所有三個來源同時存在的互動環境。為填補這一空白,我們提出了一個三來源互動框架,並系統性地評估了來自三個家族的27個LLMs在兩個數據集上的表現。我們的發現揭示了一般模式:大多數模型更依賴於文檔的主張而非用戶的主張,這一偏好在後訓練中得到了加強。此外,我們的行為分析顯示,大多數模型易受影響,無法有效區分有益和有害的外部信息。為了解決這個問題,我們展示了在多樣化來源互動數據上進行微調可以顯著提高模型的區分能力。簡而言之,我們的工作為開發可信賴的LLMs鋪平了道路,使其能夠有效且可靠地整合多個信息來源。代碼可在 https://github.com/shuowl/llm-source-balancing 獲得。

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

2604.22154v1 by Meghana Karnam, Ananya Joshi

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40\% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.

摘要:新興的行為健康和精神病學中的人工智慧系統使用多步驟或多代理的LLM管道來執行評估自我傷害風險和篩檢抑鬱症等任務。然而,常見的評估方法,如LLM作為裁判,並未指示何時決策是可靠的,或如何在多個LLM判斷中累積錯誤,這限制了它們在安全關鍵環境中的適用性。我們提出了一個統計框架,針對結構為有向無環圖(DAG)的多代理管道,提供了一種基於原則的、自適應的決策制定替代啟發式投票的方法。我們將每個代理建模為隨機類別決策,並引入(1)更緊的代理級性能信心界限,(2)基於輸入難度的強盜式自適應抽樣策略,以及(3)在多代理系統上提供的懊悔保證,顯示在部署時的對數錯誤增長。我們在行為健康的兩個標記數據集上評估我們的系統:AEGIS 2.0行為健康子集(N=161)和SWMH Reddit帖子的一個分層樣本(N=250)。從實證上看,我們的自適應抽樣策略在這兩個數據集中達到了最低的假陽性率,AEGIS 2.0為0.095,而單代理模型為0.159,將安全內容的錯誤標記減少了40\%,並且在所有條件下仍然保持相似的假陰性率。這些結果表明,基於原則的自適應抽樣在不降低召回率的情況下,提供了精確度的有意義改善。

SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

2604.22134v1 by Sihang, Zhao, Kangrui Yu, Youliang Yuan, Pinjia He, Hongyi Wen

Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE

摘要:大型語言模型(LLMs)在教育場景中得到了廣泛的探索。我們識別出當前教育LLMs中的一個關鍵漏洞,即教學越獄,學生使用誘導答案的提示來引出解決方案,而不是提供支架式的指導。為了促進系統性研究,我們統一並形式化安全、有幫助和教學行為,並引入SHAPE,一個包含9,087對學生問題的基準,用於評估在對抗壓力下的輔導行為。我們提出了一個增強圖形的輔導管道,該管道從查詢中推斷先決概念,識別掌握差距,並通過明確的閘控在指導和解決問題之間進行生成路由。在多個LLMs上的實驗顯示,我們的方法在兩種教學越獄設置下顯著提高了安全性,同時在相同的評估協議下保持了接近上限的有用性。我們的代碼和數據可在 https://github.com/MAPS-research/SHaPE 獲得。

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

2604.22117v1 by Harsh Kumar, Rahul Maity, Tanmay Joshi, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das

Aligned large language models(LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through robots.txt, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as <00TRIGGER00> to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow. We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible to standard evaluation.

摘要:對齊的大型語言模型(LLMs)仍然容易受到對抗性操控,而它們對網絡規模預訓練的依賴則創造了一個微妙但嚴重的攻擊面。我們研究了隱形預訓練播種(SPS),這是一種新的攻擊類別,其中對手在隱形網站上分發少量的有毒內容,通過 robots.txt 將其暴露給網絡爬蟲,從而增加這些內容被吸收到未來的訓練語料庫中的可能性,這些語料庫來自於如 Common Crawl 等來源。由於每個單獨的有效載荷都很小、分散且表面上無害,因此在數據集構建或過濾過程中很難檢測到這種攻擊。其結果是一種潛在的中毒形式:在預訓練期間嵌入的潛伏邏輯地雷,在標準評估下大多數情況下保持隱形,但可以通過精確的字母數字觸發器(如 <00TRIGGER00>)來激活,以繞過安全防護。我們將這種攻擊稱為 PermaFrost,類比於北極的永久凍土:有害物質可以長時間保持凍結、埋藏且不被注意,只有在條件允許時才會重新浮現。我們通過 PermaFrost-Attack 將這一威脅具體化,這是一個用於潛在概念中毒的控制框架,並配備了一套幾何診斷工具:熱力學長度、光譜曲率和感染追溯圖。在多個模型家族和規模中,我們顯示 SPS 廣泛有效,誘導持久的不安全行為,同時經常避開對齊防禦。我們的結果確定 SPS 是對未來基礎模型的一種實際且被低估的威脅。本文介紹了一種新穎的幾何診斷視角,用於系統性地檢查潛在模型行為,為檢測、表徵和理解可能對標準評估隱形的脆弱性提供了一個原則性基礎。

Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation

2604.22098v1 by Weisi Liu, Guangzeng Han, Xiaolei Huang

Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.

摘要:時間在模型開發和部署中引入了根本性的挑戰:模型通常是在歷史數據上訓練的,而在未來數據上部署時,語義分佈和領域知識可能會演變。遺憾的是,現有研究要麼忽視時間變化,要麼難以捕捉語義和知識的豐富變化模式。我們開發了知識驅動的增強與檢索整合時間適應(KARITA),以捕捉多樣的時間變化(例如,不確定性和特徵變化),構建和整合豐富的知識來源(例如,像MeSH這樣的醫學本體),並利用變化洞察進行選擇性檢索增強學習。我們在多個領域的分類任務上評估了KARITA,包括臨床、法律和科學語料庫,顯示出在多個領域中隨著時間適應的一致改善。我們的結果表明,知識整合在時間增強和學習中可能更為關鍵和有效。

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

2604.22085v1 by Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani

The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh's Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.

摘要:從無狀態語言模型推理到持久的多會話自主代理的過渡顯示,記憶成為生產級代理系統部署中的主要架構瓶頸。現有的方法論在很大程度上依賴於混合語義圖架構,這在攝取和檢索過程中都會產生相當大的計算開銷。這些系統通常需要大型語言模型介導的實體提取、明確的圖架構維護和多查詢檢索管道。本文介紹了Memanto,一個通用的代理人工智慧記憶層,挑戰了當前認為知識圖複雜性是實現高保真代理記憶所必需的假設。Memanto整合了一個類型化的語義記憶架構,包括十三個預定義的記憶類別、自動衝突解決機制和時間版本控制。這些組件由Moorcheh的資訊理論搜索引擎提供支持,這是一個無索引的語義數據庫,能在低於九十毫秒的延遲內提供確定性檢索,同時消除攝取延遲。通過在LongMemEval和LoCoMo評估套件上的系統性基準測試,Memanto分別達到89.8%和87.1%的最先進準確率。這些結果超越了所有評估的混合圖和基於向量的系統,同時僅需一個檢索查詢,無攝取成本,並保持顯著較低的操作複雜性。本文呈現了一個五階段的漸進性消融研究,以量化每個架構組件的貢獻,隨後討論了對可擴展部署代理記憶系統的影響。

Sound Agentic Science Requires Adversarial Experiments

2604.22080v1 by Dionizije Fa, Marko Culjak

LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

摘要:LLM 基礎的代理正在迅速被採用於科學數據分析,自動化曾經受限於人類時間和專業知識的任務。這種能力通常被視為發現的加速,但它也加速了一種熟悉的失敗模式,即快速產生看似合理、無限可修訂的分析,這些分析容易生成,實際上將假設空間轉變為由選擇性選擇的分析支持的候選主張,並優化為可發表的正面結果。與軟體不同,科學知識並不是通過代碼的迭代積累和事後統計支持來驗證的。流暢的解釋或在單一數據集上的顯著結果並不是驗證。因為缺失的證據是一個負空間,會推翻該主張的實驗和分析從未進行或從未發表。因此,我們建議使用代理協助產生的非實驗性主張應根據先驗否證標準進行評估:代理不應主要用於構建最具說服力的敘述,而應主動尋找該主張失敗的方式。

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

2604.22076v1 by Xiaoyi Chen, Haoyuan Wang, Siyuan Tang, Sijia Liu, Liya Su, XiaoFeng Wang, Haixu Tang

Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.

摘要:大型語言模型(LLMs)在訓練過程中經常記住私密信息,這引發了嚴重的隱私擔憂。雖然機器遺忘已經成為一個有前景的解決方案,但其對抗隱私攻擊的真正有效性仍不明朗。為了解決這個問題,我們提出了PrivUn,一個新的評估框架,通過三層攻擊場景系統性地評估遺忘的穩健性:直接檢索、上下文學習恢復和微調恢復;並結合使用遺忘分數、關聯指標和遺忘深度評估的定量分析。我們的研究揭示了當前遺忘方法的重大弱點,並揭示了兩個關鍵發現:1)遺忘顯示出由梯度驅動的漣漪效應:與遵循語義關係的傳統遺忘(例如,知識圖譜)不同,隱私遺忘在潛在的基於梯度的關聯中傳播;2)大多數方法都存在淺層遺忘的問題,無法去除分佈在多個深層模型層中的私密信息。為了驗證這些見解,我們探索了兩種策略:利用梯度相似性的關聯感知核心集選擇,以及通過表示約束進行的多層深度干預。這些策略代表了從淺層遺忘到深層遺忘的範式轉變。

Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning

2604.22062v1 by Karthic Palaniappan

There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of "thinking systems". With Qwen3-VL-2B-Instruct as base model and 4 $\times$ Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33\% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75\% over SymPy. I've documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: https://github.com/i-like-bfs-and-dfs/wolfram-reasoning.

摘要:世界上有7,407種語言。但是,世界上沒有的語言呢?人類是否如此狹隘,以至於不關心外星人所使用的語言?外星人也是人類!在2016年的電影《降臨》中,艾米·亞當斯飾演語言學家路易絲·班克斯博士,她通過學習以非順序句子構成的外星語言(Heptapod)來思考,獲得了超越時間和預見未來的能力。在這項工作中,我旨在探討在神經符號語言中視覺-語言概念的表徵和推理,並研究“思考系統”中分析推理能力和效率的提升。以Qwen3-VL-2B-Instruct為基礎模型,並使用4個$\times$ Nvidia H200 GPU節點,我在一個包含數學、科學和一般知識問題的視覺-語言評估數據集上實現了3.33\%的準確率提升,同時將推理標記減少了75\%,相較於SymPy。我已記錄面臨的計算挑戰、擴展可能性以及未來在視覺-語言模型中改善神經符號語言思考的工作。訓練和推理設置可在此找到:https://github.com/i-like-bfs-and-dfs/wolfram-reasoning。

Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning

2604.22031v1 by João Mattos, Arlei Silva

We propose Mochi, a Graph Foundation Model that addresses task unification and training efficiency by adopting a meta-learning based training framework. Prior models pre-train with reconstruction-based objectives such as link prediction, and assume that the resulting representations can be aligned with downstream tasks through a separate unification step such as class prototypes. We demonstrate through synthetic and real-world experiments that this procedure, while simple and intuitive, has limitations that directly affect downstream task performance. To address these limitations, Mochi pre-trains on few-shot episodes that mirror the downstream evaluation protocol, aligning the training objective with inference rather than relying on a post-hoc unification step. We show that Mochi, along with its more powerful variant Mochi++, achieves competitive or superior performance compared to existing Graph Foundation Models across 25 real-world graph datasets spanning node classification, link prediction, and graph classification, while requiring 8$\sim$27 times less training time than the strongest baseline.

摘要:我們提出了Mochi,一種圖基礎模型,通過採用基於元學習的訓練框架來解決任務統一和訓練效率問題。先前的模型使用基於重建的目標進行預訓練,例如鏈接預測,並假設所得到的表示可以通過單獨的統一步驟(例如類別原型)與下游任務對齊。我們通過合成和實際實驗展示了這一過程,雖然簡單直觀,但存在直接影響下游任務性能的限制。為了解決這些限制,Mochi在幾次快照的情境下進行預訓練,這些情境反映了下游評估協議,將訓練目標與推理對齊,而不是依賴事後的統一步驟。我們展示了Mochi及其更強大的變體Mochi++在25個涵蓋節點分類、鏈接預測和圖分類的實際圖數據集上,與現有的圖基礎模型相比,達到了具有競爭力或更優的性能,同時所需的訓練時間比最強基線少8$\sim$27倍。

Rethinking Publication: A Certification Framework for AI-Enabled Research

2604.22026v1 by Yang Lu, Rabimba Karanjai, Lei Xu, Weidong Shi

AI research pipelines now produce a growing share of publishable academic output, including work that meets existing peer-review standards for quality and novelty. Yet the publication system was built on the assumption of universal human authorship and lacks a principled way to evaluate knowledge produced through automated pipelines. This paper proposes a two-layer certification framework that separates knowledge quality assessment from grading of human contribution, allowing publication systems to handle pipeline-generated work consistently and transparently without creating new institutions. The paper uses normative-conceptual analysis, framework design under four explicit constraints, and dry-run validation on two representative submission cases spanning key attribution scenarios. The framework grades contributions as Category A (pipeline-reachable), Category B (requiring human direction at identifiable stages), and Category C (beyond current pipeline reach at the formulation stage). It also introduces benchmark slots for fully disclosed automated research as both a transparent publication track and a calibration instrument for reviewer judgment. Contribution grading is contemporaneous, based on pipeline capability at the time of submission. Dry-run validation shows that the framework can certify knowledge appropriately while tolerating irreducible attribution uncertainty. The paper argues that publication has always certified both that knowledge is valid and that a human made it. AI pipelines separate these functions for the first time. The framework is implementable within existing editorial infrastructure and grounds recognition of frontier human contribution in epistemic achievement rather than unverifiable claims of human origin.

摘要:AI 研究管道現在產生越來越多可發表的學術成果,包括符合現有同行評審標準的質量和新穎性的工作。然而,出版系統是基於普遍人類著作權的假設建立的,缺乏一種原則性的方法來評估通過自動化管道產生的知識。本文提出了一個兩層認證框架,將知識質量評估與人類貢獻的評分分開,允許出版系統一致且透明地處理管道生成的工作,而不需要創建新的機構。本文使用規範性概念分析、在四個明確約束下的框架設計,以及對兩個代表性提交案例的乾跑驗證,涵蓋關鍵的歸屬場景。該框架將貢獻分為 A 類(管道可達)、B 類(在可識別階段需要人類指導)和 C 類(在形成階段超出當前管道的可達範圍)。它還引入了完全披露的自動化研究的基準位置,作為透明的出版途徑和審稿人判斷的校準工具。貢獻評分是當時的管道能力的即時評估。乾跑驗證顯示,該框架可以適當地認證知識,同時容忍不可減少的歸屬不確定性。本文認為,出版一直在認證知識的有效性和人類的創造性。AI 管道首次將這些功能分開。該框架可以在現有的編輯基礎設施中實施,並將對前沿人類貢獻的認可建立在認識論成就之上,而不是不可驗證的人類起源主張。

Multi-Task Optimization over Networks of Tasks

2604.21991v1 by Julian Hatzky, Thomas Bartz-Beielstein, A. E. Eiben, Anil Yaman

Multi-task optimization is a powerful approach for solving a large number of tasks in parallel. However, existing algorithms face distinct limitations: Population-based methods scale poorly and remain underexplored for large task sets. Approaches that do scale beyond a thousand tasks are mostly MAP-Elites variants and rely on a fixed, discretized archive that disregards the topology of the task space. We introduce MONET (Multi-Task Optimization over Networks of Tasks), a multi-task optimization algorithm that models the task space as a graph: tasks are nodes, and edges connect tasks in the task parameter space. This representation enables knowledge transfer between tasks and remains tractable for high-dimensional problems while exploiting the topology of the task space. MONET combines social learning, which generates candidates from neighboring nodes via crossover, with individual learning, which refines a node's own solution independently via mutation. We evaluate MONET on four domains (archery, arm, and cartpole with 5,000 tasks each; hexapod with 2,000 tasks) and show that it matches or exceeds the performance of existing MAP-Elites-based baselines across all four domains.

摘要:多任務優化是一種強大的方法,可以並行解決大量任務。然而,現有的算法面臨著明顯的限制:基於群體的方法擴展性差,且在大型任務集上仍然未被充分探索。那些能夠超過一千個任務的算法大多是MAP-Elites的變體,依賴於固定的、離散化的檔案,忽略了任務空間的拓撲結構。我們介紹了MONET(多任務優化網絡),這是一種將任務空間建模為圖的多任務優化算法:任務是節點,邊連接任務在任務參數空間中的關係。這種表示方式使得任務之間的知識轉移成為可能,並且在高維問題中仍然可處理,同時利用了任務空間的拓撲結構。MONET結合了社會學習,通過交叉從鄰近節點生成候選者,與個體學習,通過突變獨立地精煉節點自身的解決方案。我們在四個領域(射箭、手臂和小車,每個領域有5,000個任務;六足機器人有2,000個任務)上評估了MONET,並顯示它在所有四個領域的表現與現有的基於MAP-Elites的基準相當或超過。

When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

2604.21911v1 by Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord

Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .

摘要:儘管大型視覺語言模型(LVLMs)的能力取得了令人印象深刻的進展,但這些系統仍然容易出現幻覺,即不基於視覺輸入的輸出。先前的研究將LVLM中的幻覺歸因於視覺主幹的限制或語言組件的主導地位等因素,但這些因素的相對重要性仍不清楚。為了解決這一模糊性,我們提出了HalluScope,一個基準測試,以更好地理解不同因素引發幻覺的程度。我們的分析表明,幻覺主要源於對文本先驗和背景知識的過度依賴,特別是通過文本指令引入的信息。為了減輕由文本指令先驗引發的幻覺,我們提出了HalluVL-DPO,一個微調現成LVLM以實現更具視覺基礎的響應的框架。HalluVL-DPO利用偏好優化,使用我們構建的精心策劃的訓練數據集,指導模型更喜歡基於現實的響應而非幻覺響應。我們展示了我們優化的模型有效地減輕了目標幻覺失敗模式,同時在其他幻覺基準和視覺能力評估上保持或提高了性能。為了支持可重複性和進一步研究,我們將在 https://pegah-kh.github.io/projects/prompts-override-vision/ 公開發布我們的評估基準、偏好訓練數據集和代碼。

From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation

2604.21910v1 by Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas, Michal Kuszewski

Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92\%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query.

摘要:科學工作流程系統自動化執行——排程、容錯、資源管理——但不包括之前的語義翻譯。科學家仍然手動將研究問題轉換為工作流程規範,這一任務需要領域知識和基礎設施專業知識。我們提出了一種代理架構,通過三個層次來縮小這一差距:一個大型語言模型(LLM)將自然語言解釋為結構化的意圖(語義層);經過驗證的生成器產生可重現的工作流程有向無環圖(DAG)(確定性層);而領域專家編寫“技能”:編碼詞彙映射、參數約束和優化策略的Markdown文檔(知識層)。這種分解將LLM的非確定性限制在意圖提取上:相同的意圖總是產生相同的工作流程。我們在1000基因組人口遺傳學工作流程和在Kubernetes上運行的Hyperflow WMS上實施並評估了該架構。在對150個查詢的消融研究中,技能將完全匹配的意圖準確率從44%提高到83%;基於技能的延遲工作流程生成將數據傳輸減少了92%;並且端到端管道在Kubernetes上完成查詢的LLM開銷低於15秒,每個查詢成本低於$0.001。

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

2604.21889v1 by Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di, Rui Wang

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

摘要:即時檢測和緩解技術異常對於大規模雲原生服務至關重要,因為即使是幾分鐘的停機時間也可能導致巨大的財務損失和用戶信任的下降。雖然客戶事件作為發現監控所忽略風險的重要信號,但由於極高的噪音、高吞吐量和多樣業務線的語義複雜性,從這些數據中提取可操作的情報仍然具有挑戰性。在本文中,我們介紹了TingIS,一個設計用於企業級事件發現的端到端系統。TingIS的核心是一個多階段事件鏈接引擎,該引擎將高效的索引技術與大型語言模型(LLMs)相結合,以便在事件合併上做出明智的決策,使得僅從少量多樣的用戶描述中穩定地提取可操作的事件。這個引擎還配備了一個級聯路由機制,以實現精確的業務歸屬,並且有一個多維噪音減少管道,該管道整合了領域知識、統計模式和行為過濾。在處理高峰吞吐量超過每分鐘2,000條消息和每天300,000條消息的生產環境中部署的TingIS,實現了3.5分鐘的P90警報延遲和95\%的高優先級事件發現率。基於真實世界數據構建的基準顯示,TingIS在路由準確性、聚類質量和信噪比方面顯著優於基線方法。

A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents

2604.21885v1 by Praval Sharma

Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.

摘要:事件萃取對於事件理解和分析至關重要。它支持文檔摘要和緊急情況下的決策等任務。然而,現有的事件萃取方法存在一些限制:(1)封閉領域算法僅限於預定義的事件類型,因此很少能夠推廣到未見過的類型;(2)開放領域事件萃取算法雖然能夠處理不受限制的事件類型,但在很大程度上忽視了大型語言模型(LLMs)的潛力,儘管它們具備先進的能力。此外,它們並未明確建模文檔級別的上下文、結構和語義推理,這些對於有效的事件萃取至關重要,但由於中途丟失現象和注意力稀釋,對於LLMs來說仍然具有挑戰性。為了解決這些限制,我們提出了多模態開放領域事件萃取(MODEE),這是一種將基於圖的學習與來自LLMs的文本表示相結合的新方法,用於建模文檔級推理。在大型數據集上的實證評估表明,MODEE 的性能超越了最先進的開放領域事件萃取方法,並且可以推廣到封閉領域事件萃取,在這方面它也超越了現有的算法。

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

2604.21882v1 by Yuto Nishida, Naoki Shikoda, Yosuke Kishinami, Ryo Fujii, Makoto Morishita, Hidetaka Kamigaito, Taro Watanabe

Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.

摘要:理解大型語言模型(LLMs)記憶哪些事實知識對於評估其可靠性和局限性至關重要。基於實體的問答(QA)是一種常見的框架,用於分析非逐字記憶,但典型的評估使用單一的標準表面形式查詢每個實體,這使得很難將事實記憶與通過特定名稱的訪問區分開來。我們引入了RedirectQA,這是一個基於實體的問答數據集,利用維基百科的重定向信息將維基數據的事實三元組與每個實體的分類表面形式關聯起來,包括替代名稱、縮寫、拼寫變體和常見錯誤形式。在13個LLM中,我們檢查了基於表面的事實記憶,發現當僅改變實體的表面形式時,預測結果往往會改變。這種不一致性依賴於類別:模型對於輕微的正字法變化比對於更大的詞彙變化(如別名和縮寫)更具穩健性。頻率分析進一步表明,實體頻率和表面頻率都與準確性相關,並且實體頻率往往在表面頻率之外有所貢獻。總體而言,事實記憶似乎既不是純粹的表面特定,也不是完全的表面不變,這突顯了在評估非逐字記憶時表面形式多樣性的重要性。

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

2604.21793v1 by Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue, Vianney Jouhet, Fleur Mougin

In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our framework employs logical rules to capture existence and termination conditions for simple temporal events and to combine these into meta-events. In the medical domain, for example, disease episodes and therapies are inferred from timestamped clinical observations, such as diagnoses and drug administrations stored in patient records, and can be further combined into higher-level disease events. As some incorrect events might be inferred, we use constraints to identify incompatible combinations of events and propose a repair mechanism to select preferred consistent sets of events. While reasoning in the full framework is intractable, we identify relevant restrictions that ensure polynomial-time data complexity. Our prototype system implements core components of the approach using answer set programming. An evaluation on a lung cancer use case supports the interest of the approach, both in terms of computational feasibility and positive alignment of our results with medical expert opinions. While strongly motivated by the needs of the healthcare domain, our framework is purposely generic, enabling its reuse in other areas.

摘要:在本文中,我們開發了一種新穎的基於邏輯的方法,用於從時間戳數據和背景知識中檢測高級的時間延伸事件。我們的框架使用邏輯規則來捕捉簡單時間事件的存在和終止條件,並將這些條件組合成元事件。例如,在醫療領域,疾病事件和治療是從時間戳的臨床觀察中推斷出來的,例如存儲在病人記錄中的診斷和藥物管理,並可以進一步組合成更高級的疾病事件。由於可能推斷出一些不正確的事件,我們使用約束來識別不兼容的事件組合,並提出一種修復機制來選擇首選的一致事件集。雖然在完整框架中的推理是不可處理的,但我們確定了相關的限制,以確保多項式時間的數據複雜度。我們的原型系統使用答案集編程實現了該方法的核心組件。對於肺癌用例的評估支持了該方法的價值,無論是在計算可行性方面,還是我們的結果與醫療專家意見的正面一致性方面。雖然受到醫療領域需求的強烈驅動,我們的框架故意設計為通用的,使其能在其他領域中重複使用。

StructMem: Structured Memory for Long-Horizon Behavior in LLMs

2604.21748v1 by Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong, Yunzhi Yao, Yuqi Zhu, Lun Du, Shumin Deng

Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see https://github.com/zjunlp/LightMem .

摘要:長期對話代理需要記憶系統來捕捉事件之間的關係,而不僅僅是孤立的事實,以支持時間推理和多跳問題回答。當前的方法面臨一個基本的權衡:平面記憶雖然高效,但無法建模關係結構,而基於圖的記憶則能夠進行結構化推理,但代價是構建過程昂貴且脆弱。為了解決這些問題,我們提出了\textbf{StructMem},這是一個結構增強的層次記憶框架,能夠保留事件級的綁定並誘導跨事件的連結。通過時間上錨定雙重視角並進行定期的語義整合,StructMem改善了在\texttt{LoCoMo}上的時間推理和多跳性能,同時與先前的記憶系統相比,顯著減少了令牌使用、API調用和運行時間,請參見 https://github.com/zjunlp/LightMem 。

GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion

2604.21649v1 by Qizhuo Xie, Yunhui Liu, Yu Xing, Qianzi Hou, Xudong Jin, Tao Zheng, Tieke He

Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS-Quant is grounded in the insight that entity representations should follow a linguistic coarse-to-fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS-Quant significantly outperforms existing text-based and embedding-based baselines. Our code is publicly available at https://github.com/mikumifa/GS-Quant.

摘要:大型語言模型(LLMs)在知識圖譜補全(KGC)方面展現了巨大的潛力,但在連續圖嵌入和離散LLM標記之間架起橋樑仍然是一個關鍵挑戰。雖然最近的量化方法試圖對齊這些模態,但它們通常將量化視為平坦的數值壓縮,導致語義上糾纏的代碼,無法反映人類推理的層次性質。在本文中,我們提出了GS-Quant,一個新穎的框架,為KG實體生成語義一致且結構分層的離散代碼。與以往的方法不同,GS-Quant基於這樣的見解:實體表示應遵循語言的粗到細邏輯。我們引入了一個粒度語義增強模塊,將層次知識注入代碼庫,確保早期代碼捕捉全局語義類別,而後期代碼則細化特定屬性。此外,一個生成結構重建模塊對代碼序列施加因果依賴,將獨立的離散單元轉變為結構化的語義描述符。通過擴展LLM詞彙表以包含這些學習到的代碼,我們使模型能夠以同構於自然語言生成的方式對圖結構進行推理。實驗結果表明,GS-Quant顯著超越了現有的基於文本和嵌入的基準。我們的代碼可在https://github.com/mikumifa/GS-Quant公開獲得。

A systematic review of generative AI usage for IT project management

2604.21958v1 by Ionut Anghel, Tudor Cioara

This paper aims to synthesize current knowledge on generative AI in IT project management using the PRISMA methodology to provide researchers with a comprehensive perspective on techniques, applications, adoption trends, limitations, and integration across project management tools and process groups. The analysis reveals a clear dominance of OpenAI's GPT in the included studies but relying primarily on prompt engineering, suggesting that research in this area remains at an exploratory stage. Finally, it identifies and discusses three promising research directions for AI-enabled project management, including process group-specific AI agents, project role-based AI agents, and hybrid collaborative networks that enable human-guided orchestration.

摘要:這篇論文旨在利用PRISMA方法論綜合當前在IT專案管理中生成式AI的知識,以便為研究人員提供有關技術、應用、採用趨勢、限制以及在專案管理工具和過程組之間整合的全面視角。分析顯示,在所納入的研究中,OpenAI的GPT明顯佔主導地位,但主要依賴於提示工程,這表明該領域的研究仍處於探索階段。最後,它確定並討論了三個有前景的AI驅動專案管理研究方向,包括針對過程組的AI代理、基於專案角色的AI代理,以及能夠實現人類引導協作的混合協作網絡。

The CriticalSet problem: Identifying Critical Contributors in Bipartite Dependency Networks

2604.21537v1 by Sebastiano A. Piccolo, Andrea Tagarelli

Identifying critical nodes in complex networks is a fundamental task in graph mining. Yet, methods addressing an all-or-nothing coverage mechanics in a bipartite dependency network, a graph with two types of nodes where edges represent dependency relationships across the two groups only, remain largely unexplored. We formalize the CriticalSet problem: given an arbitrary bipartite graph modeling dependencies of items on contributors, identify the set of k contributors whose removal isolates the largest number of items. We prove that this problem is NP-hard and requires maximizing a supermodular set function, for which standard forward greedy algorithms provide no approximation guarantees. Consequently, we model CriticalSet as a coalitional game, deriving a closed-form centrality, ShapleyCov, based on the Shapley value. This measure can be interpreted as the expected number of items isolated by a contributor's departure. Leveraging these insights, we propose MinCov, a linear-time iterative peeling algorithm that explicitly accounts for connection redundancy, prioritizing contributors who uniquely support many items. Extensive experiments on synthetic and large-scale real datasets, including a Wikipedia graph with over 250 million edges, reveal that MinCov and ShapleyCov significantly outperform traditional baselines. Notably, MinCov achieves near-optimal performance, within 0.02 AUC of a Stochastic Hill Climbing metaheuristic, while remaining several orders of magnitude faster.

摘要:識別複雜網絡中的關鍵節點是圖挖掘中的一項基本任務。然而,針對二部依賴網絡中全有或全無覆蓋機制的方法,這是一種只有兩種類型節點的圖,其中邊表示兩組之間的依賴關係,仍然大多數未被探索。我們將CriticalSet問題形式化:給定一個任意的二部圖,該圖建模項目對貢獻者的依賴,識別移除k個貢獻者後使得最多項目孤立的貢獻者集合。我們證明這個問題是NP-hard,並需要最大化一個超模組集合函數,對於這個函數,標準的前向貪婪算法無法提供近似保證。因此,我們將CriticalSet建模為一個合作博弈,基於Shapley值推導出一個封閉形式的中心性指標ShapleyCov。這一度量可以解釋為貢獻者離開後孤立的預期項目數量。利用這些見解,我們提出了MinCov,一種線性時間的迭代剝離算法,明確考慮連接冗餘,優先考慮那些唯一支持許多項目的貢獻者。在合成和大規模真實數據集上的廣泛實驗,包括一個擁有超過2.5億條邊的維基百科圖,顯示MinCov和ShapleyCov顯著超越傳統基準。值得注意的是,MinCov達到了近乎最佳的性能,與隨機爬山元啟發式方法的AUC相差僅0.02,同時速度快幾個數量級。

Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation

2604.21536v1 by Nikita Severin, Danil Kartushov, Vladislav Urzhumov, Vladislav Kulikov, Oksana Konovalova, Alexey Grishanov, Anton Klenitskiy, Artem Fatkulin, Alexey Vasilev, Andrey Savchenko, Ilya Makarov

Sequential recommender systems have achieved significant success in modeling temporal user behavior but remain limited in capturing rich user semantics beyond interaction patterns. Large Language Models (LLMs) present opportunities to enhance user understanding with their reasoning capabilities, yet existing integration approaches create prohibitive inference costs in real time. To address these limitations, we present a novel knowledge distillation method that utilizes textual user profile generated by pre-trained LLMs into sequential recommenders without requiring LLM inference at serving time. The resulting approach maintains the inference efficiency of traditional sequential models while requiring neither architectural modifications nor LLM fine-tuning.

摘要:序列推薦系統在建模時間性用戶行為方面取得了顯著成功,但在捕捉超越互動模式的豐富用戶語義方面仍然有限。大型語言模型(LLMs)提供了利用其推理能力增強用戶理解的機會,但現有的整合方法在實時推理中產生了高昂的成本。為了解決這些限制,我們提出了一種新穎的知識蒸餾方法,利用預訓練LLMs生成的文本用戶檔案,將其應用於序列推薦系統,而無需在服務時進行LLM推理。所提出的方法保持了傳統序列模型的推理效率,同時不需要架構修改或LLM微調。

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

2604.21510v1 by Xinyu Zhang, Boxuan Zhang, Yuchen Wan, Lingling Zhang, YiXing Yao, Bifan Wei, Yaqiang Wu, Jun Liu

While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling & logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.

摘要:大型語言模型(LLMs)雖然展現出卓越的推理能力,但複雜的優化任務仍然具有挑戰性,需要領域知識和穩健的實施。然而,現有的基準測試僅專注於數學規劃和組合優化,這限制了全面評估的可能性。為了解決這個問題,我們推出了OptiVerse,一個包含1,000個精心挑選問題的綜合基準,涵蓋了被忽視的領域,包括隨機優化、動態優化、遊戲優化和最佳控制,並分為三個難度級別:簡單、中等和困難。對22個不同規模的LLM進行的實驗顯示,在困難問題上性能急劇下降,即使是像GPT-5.2和Gemini-3這樣的先進模型也難以超過27%的準確率。通過錯誤分析,我們發現建模和邏輯錯誤仍然是主要瓶頸。因此,我們提出了一個雙視角審計代理,該代理在不引入重大時間開銷的情況下提高了LLM建模過程的準確性。OptiVerse將作為推進LLM在解決複雜優化挑戰中的基礎平台。

MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting

2604.21489v1 by Yining Xing, Zehong Ke, Yiqian Tu, Zhiyuan Liu, Wenhao Yu, Jianqiang Wang

Multi-modal trajectory generation is essential for safe autonomous driving, yet existing diffusion-based planners suffer from high inference latency due to iterative neural function evaluations. This paper presents MISTY (Mixer-based Inference for Single-step Trajectory-drifting Yield), a high-throughput generative motion planner that achieves state-of-the-art closed-loop performance with pure single-step inference. MISTY integrates a vectorized Sub-Graph encoder to capture environment context, a Variational Autoencoder to structure expert trajectories into a compact 32-dimensional latent manifold, and an ultra-lightweight MLP-Mixer decoder to eliminate quadratic attention complexity. Importantly, we introduce a latent-space drifting loss that shifts the complex distribution evolution entirely to the training phase. By formulating explicit attractive and repulsive forces, this mechanism empowers the model to synthesize novel, proactive maneuvers, such as active overtaking, that are virtually absent from the raw expert demonstrations. Extensive evaluations on the nuPlan benchmark demonstrate that MISTY achieves state-of-the-art results on the challenging Test14-hard split, with comprehensive scores of 80.32 and 82.21 in non-reactive and reactive settings, respectively. Operating at over 99 FPS with an end-to-end latency of 10.1 ms, MISTY offers an order-of-magnitude speedup over iterative diffusion planners while while achieving significantly robust generation.

摘要:多模態軌跡生成對於安全的自動駕駛至關重要,但現有的基於擴散的規劃器因為需要迭代神經函數評估而面臨高推理延遲。本文提出了MISTY(基於混合器的單步軌跡漂移產出推理),這是一個高通量的生成運動規劃器,通過純單步推理實現了最先進的閉環性能。MISTY整合了一個向量化的子圖編碼器來捕捉環境上下文,一個變分自編碼器來將專家軌跡結構化為緊湊的32維潛在流形,以及一個超輕量的MLP-Mixer解碼器來消除二次注意力複雜度。重要的是,我們引入了一種潛在空間漂移損失,將複雜的分佈演變完全轉移到訓練階段。通過制定明確的吸引力和排斥力,這一機制使模型能夠合成新穎的主動操作,例如主動超車,這在原始專家演示中幾乎不存在。對nuPlan基準的廣泛評估顯示,MISTY在具有挑戰性的Test14-hard拆分上達到了最先進的結果,在非反應和反應設置中分別獲得了80.32和82.21的綜合分數。MISTY以超過99 FPS的速度運行,端到端延遲為10.1毫秒,相比於迭代擴散規劃器提供了數量級的加速,同時實現了顯著穩健的生成。

Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms

2604.21473v1 by Jiyan Song, Wenyang Wang, Chengcheng Yan, Zhiquan Han, Feifei Zhao

In the treatment of complex diseases, treatment regimens using a single drug often yield limited efficacy and can lead to drug resistance. In contrast, combination drug therapies can significantly improve therapeutic outcomes through synergistic effects. However, experimentally validating all possible drug combinations is prohibitively expensive, underscoring the critical need for efficient computational prediction methods. Although existing approaches based on deep learning and graph neural networks (GNNs) have made considerable progress, challenges remain in reducing structural bias, improving generalization capability, and enhancing model interpretability. To address these limitations, this paper proposes a collaborative prediction graph neural network that integrates molecular structural features and cell-line genomic profiles with drug-drug interactions to enhance the prediction of synergistic effects. We introduce a novel model named the Residual Graph Isomorphism Network integrated with an Attention mechanism (ResGIN-Att). The model first extracts multi scale topological features of drug molecules using a residual graph isomorphism network, where residual connections help mitigate over-smoothing in deep layers. Subsequently, an adaptive Long Short-Term Memory (LSTM) module fuses structural information from local to global scales. Finally, a cross-attention module is designed to explicitly model drug-drug interactions and identify key chemical substructures. Extensive experiments on five public benchmark datasets demonstrate that ResGIN-Att achieves competitive performance, comparing favorably against key baseline methods while exhibiting promising generalization capability and robustness.

摘要:在複雜疾病的治療中,使用單一藥物的治療方案通常效果有限,並可能導致藥物抗性。相對而言,聯合藥物療法可以通過協同效應顯著改善治療結果。然而,實驗性地驗證所有可能的藥物組合成本過高,這突顯了對高效計算預測方法的迫切需求。儘管基於深度學習和圖神經網絡(GNN)的現有方法已取得相當大的進展,但在減少結構偏差、改善泛化能力和增強模型可解釋性方面仍然存在挑戰。為了解決這些限制,本文提出了一種協作預測圖神經網絡,該網絡整合了分子結構特徵和細胞系基因組特徵以及藥物-藥物相互作用,以增強對協同效應的預測。我們引入了一種名為集成注意力機制的殘差圖同構網絡(ResGIN-Att)的新模型。該模型首先使用殘差圖同構網絡提取藥物分子的多尺度拓撲特徵,其中殘差連接有助於減輕深層中的過平滑。隨後,自適應長短期記憶(LSTM)模塊將結構信息從局部融合到全局。最後,設計了一個交叉注意力模塊,以明確建模藥物-藥物相互作用並識別關鍵化學子結構。在五個公共基準數據集上的廣泛實驗表明,ResGIN-Att實現了具有競爭力的性能,與主要基準方法相比表現良好,同時展現出良好的泛化能力和穩健性。

Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation

2604.21380v1 by Wang Shi Hai, Chen Tao

Since software performance requirements are documented in natural language, quantifying them into mathematical forms is essential for software engineering. Yet, the vagueness in performance requirements and uncertainty of human cognition have caused highly uncertain ambiguity in the interpretations, rendering their automated quantification an unaddressed and challenging problem. In this paper, we formalize the problem and propose IRAP, an approach that quantifies performance requirements into mathematical functions via interactive retrieval-augmented preference elicitation. IRAP differs from the others in that it explicitly derives from problem-specific knowledge to retrieve and reason the preferences, which also guides the progressive interaction with stakeholders, while reducing the cognitive overhead. Experiment results against 10 state-of-the-art methods on four real-world datasets demonstrate the superiority of IRAP on all cases with up to 40x improvements under as few as five rounds of interactions.

摘要:由於軟體性能需求以自然語言記錄,因此將其量化為數學形式對於軟體工程至關重要。然而,性能需求中的模糊性和人類認知的不確定性導致了解釋中的高度不確定性,使得其自動化量化成為一個未解決且具挑戰性的問題。在本文中,我們對該問題進行了形式化,並提出了IRAP,一種通過互動檢索增強的偏好引導將性能需求量化為數學函數的方法。IRAP與其他方法的不同之處在於,它明確地從特定問題的知識中推導出來,以檢索和推理偏好,這也指導了與利益相關者的漸進互動,同時減少了認知負擔。在四個真實世界數據集上,與10種最先進方法的實驗結果顯示,IRAP在所有情況下的優越性,並在僅進行五輪互動的情況下實現了高達40倍的改進。

ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs

2604.21357v1 by Jian Cui, Zhiyuan Ren, Desheng Weng, Yongqi Zhao, Gong Wenbin, Yu Lei, Zhenning Dong

This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model's reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.

摘要:這篇論文提出了ReaGeo,一個基於大型語言模型的端到端地理編碼框架,旨在克服傳統多階段方法的限制,這些方法依賴於地理數據庫中的文本或向量相似性檢索,包括工作流程的複雜性、錯誤傳播以及對結構化地理知識庫的高度依賴。該方法將地理坐標轉換為地理哈希序列,將坐標預測任務重新表述為文本生成問題,並引入了鏈式思維機制以增強模型對空間關係的推理能力。此外,應用基於距離偏差的獎勵的強化學習來優化生成準確性。綜合實驗表明,ReaGeo能夠準確處理單點預測中的明確地址查詢,並有效解決模糊的相對位置查詢。此外,該模型在非點幾何區域顯示出強大的預測能力,突顯了其在地理編碼任務中的多功能性和泛化能力。

Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

2604.21952v1 by Muhammad Shafique, Abdul Basit, Muhammad Abdullah Hanif, Alberto Marchisio, Rachmad Vidya Wicaksana Putra, Minghao Shao

This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution & stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.

摘要:這項工作提出了一種多層次的方法論,以有效加速多模態基礎模型(MFMs)。它結合了Transformer區塊的硬體和軟體共同設計,以及一個優化流程,減少計算和記憶體需求。在模型開發過程中,它通過微調來實現針對特定領域的性能增強。我們的方法論進一步結合了優化MFMs的硬體和軟體技術。具體而言,它使用層次感知的混合精度量化和結構修剪來壓縮MFM,針對Transformer區塊和MLP通道。它還通過推測解碼來優化操作,模型級聯將查詢路由通過小到大的級聯,並使用輕量級自測來確定何時升級到更大的模型,以及序列長度、視覺解析度和步幅的共同優化,以及圖級運算元融合。為了有效執行模型,處理數據流根據底層硬體架構進行優化,並結合記憶體高效的注意力以滿足片上帶寬和延遲預算。為了支持這一點,使用專用的硬體加速器來處理Transformer工作負載,這可以通過專家設計或LLM輔助設計方法開發。我們展示了所提方法論在醫療MFMs和代碼生成任務上的有效性,並以向能源高效的脈衝MFMs擴展作結。

Can MLLMs "Read" What is Missing?

2604.21277v1 by Jindi Guo, Xi Fang, Chaozheng Huang

We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.

摘要:我們介紹 MMTR-Bench,一個旨在評估多模態大型語言模型 (MLLMs) 從視覺上下文直接重建被遮蔽文本的內在能力的基準。與傳統的問答任務不同,MMTR-Bench 消除了明確的提示,要求模型從單頁或多頁的輸入中恢復被遮蔽的文本,這些輸入來自於文件和網頁等現實世界領域。這種設計將重建任務與遵循指令的能力隔離開來,使得能夠直接評估模型的佈局理解、視覺基礎和知識整合能力。MMTR-Bench 包含 2,771 個測試樣本,涵蓋多種語言和不同的目標長度。為了考慮這種多樣性,我們提出了一個級別感知的評估協議。對代表性 MLLMs 的實驗表明,這個基準帶來了重大挑戰,特別是在句子和段落級別的重建上。首頁可訪問 https://mmtr-bench-dataset.github.io/MMTR-Bench/。

Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

2604.21263v1 by Michael Bouzinier, Sergey Trifonov, Michael Chumack, Eugenia Lvova, Dmitry Etin

\textbf{Background:} Regulatory frameworks for AI in healthcare, including the EU AI Act and FDA guidance on AI/ML-based medical devices, require clinical decision support to demonstrate not only accuracy but auditability. Existing formal languages for clinical logic validate syntactic and structural correctness but not whether decision rules use epistemologically appropriate evidence. \textbf{Methods:} Drawing on design-by-contract principles, we introduce meta-predicates -- predicates about predicates -- for asserting epistemological constraints on clinical decision rules expressed in a DSL. An epistemological type system classifies annotations along four dimensions: purpose, knowledge domain, scale, and method of acquisition. Meta-predicates assert which evidence types are permissible in any given rule. The framework is instantiated in AnFiSA, an open-source platform for genetic variant curation, and demonstrated using the Brigham Genomics Medicine protocol on 5.6 million variants from the Genome in a Bottle benchmark. \textbf{Results:} Decision trees used in variant interpretation can be reformulated as unate cascades, enabling per-variant audit trails that identify which rule classified each variant and why. Meta-predicate validation catches epistemological errors before deployment, whether rules are human-written or AI-generated. The approach complements post-hoc methods such as LIME and SHAP: where explanation reveals what evidence was used after the fact, meta-predicates constrain what evidence may be used before deployment, while preserving human readability. \textbf{Conclusions:} Meta-predicate validation is a step toward demonstrating not only that decisions are accurate but that they rest on appropriate evidence in ways that can be independently audited. While demonstrated in genomics, the approach generalises to any domain requiring auditable decision logic.

摘要:\textbf{背景:} 醫療保健中人工智慧的監管框架,包括歐盟人工智慧法案和FDA對基於人工智慧/機器學習醫療設備的指導,要求臨床決策支持不僅要顯示準確性,還要具備可審計性。現有的臨床邏輯形式語言驗證語法和結構的正確性,但不驗證決策規則是否使用了認識論上合適的證據。 \textbf{方法:} 基於契約設計原則,我們引入了元謂詞——關於謂詞的謂詞——用於對在DSL中表達的臨床決策規則施加認識論約束。認識論類型系統在四個維度上對註釋進行分類:目的、知識領域、範圍和獲取方法。元謂詞聲明在任何給定規則中允許使用哪些證據類型。該框架在AnFiSA中實現,這是一個開源的基因變異整理平台,並使用來自“瓶中基因組”基準的560萬個變異的Brigham Genomics Medicine協議進行演示。 \textbf{結果:} 用於變異解釋的決策樹可以重新表述為單調級聯,從而實現每個變異的審計跟蹤,識別每個變異的分類規則及其原因。元謂詞驗證在部署前捕捉認識論錯誤,無論規則是人工編寫還是AI生成。該方法補充了事後方法,如LIME和SHAP:當解釋揭示了事後使用了哪些證據時,元謂詞限制了在部署前可以使用的證據,同時保持人類可讀性。 \textbf{結論:} 元謂詞驗證是邁向證明決策不僅準確且基於適當證據的步驟,並且這些證據可以獨立審計。雖然在基因組學中得到了演示,但該方法可以推廣到任何需要可審計決策邏輯的領域。

When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

2604.21255v1 by Chenghao Yang, Yuning Zhang, Zhoufutu Wen, Tao Gong, Jiaheng Liu, Qi Chu, Nenghai Yu

Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbf{Response Pattern Similarity (RPS)} for verbal alignment and \textbf{Action Graph Similarity (AGS)} for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on $τ$-Bench and $τ^2$-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6\% $S_{\text{node}}$ and 94.7\% $S_{\text{dep}}$, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson $r$ = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at https://github.com/Syuchin/AgentEcho.

摘要:模型蒸餾是大型語言模型(LLM)代理快速進展的主要驅動力,但它往往導致行為的同質化。許多新興的代理共享幾乎相同的推理步驟和失敗模式,這表明它們可能是少數主導教師的蒸餾回聲。然而,現有的指標無法區分任務成功所需的強制行為與反映模型自主偏好的非強制模式。我們提出了兩個互補的指標來隔離非強制行為模式:\textbf{回應模式相似性(RPS)}用於口頭對齊,\textbf{行動圖相似性(AGS)}用於建模為有向圖的工具使用習慣。在$τ$-Bench和$τ^2$-Bench上評估來自8個提供者的18個模型,對比Claude Sonnet 4.5(思考),我們發現同一家族模型對的AGS得分比跨家族對高出5.9個百分點,並且Kimi-K2(思考)達到了82.6\% $S_{\text{node}}$和94.7\% $S_{\text{dep}}$,超過了Anthropic自己的Opus 4.1。一個受控的蒸餾實驗進一步確認了AGS能夠區分教師特定的收斂與一般改進。RPS和AGS捕捉到不同的行為維度(Pearson $r$ = 0.491),為代理生態系統中的行為收斂提供了互補的診斷信號。我們的代碼可在https://github.com/Syuchin/AgentEcho獲得。

Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation

2604.21253v1 by Hanwen Gu, Chao Guo, Junle Wang, Wenda Xie, Yisheng Lv

While LLMs demonstrate remarkable fluency in narrative generation, existing methods struggle to maintain global narrative coherence, contextual logical consistency, and smooth character development, often producing monotonous scripts with structural fractures. To this end, we introduce PLOTTER, a framework that performs narrative planning on structural graph representations instead of the direct sequential text representations used in existing work. Specifically, PLOTTER executes the Evaluate-Plan-Revise cycle on the event graph and character graph. By diagnosing and repairing issues of the graph topology under rigorous logical constraints, the model optimizes the causality and narrative skeleton before complete context generation. Experiments demonstrate that PLOTTER significantly outperforms representative baselines across diverse narrative scenarios. These findings verify that planning narratives on structural graph representations-rather than directly on text-is crucial to enhance the long context reasoning of LLMs in complex narrative generation.

摘要:雖然大型語言模型在敘事生成方面展現出卓越的流暢性,但現有的方法在維持全球敘事一致性、上下文邏輯一致性和角色發展的流暢性方面面臨挑戰,經常產生結構性斷裂的單調劇本。為此,我們提出了 PLOTTER,一個在結構圖表示上進行敘事規劃的框架,而不是使用現有工作中的直接序列文本表示。具體而言,PLOTTER 在事件圖和角色圖上執行評估-計劃-修訂循環。通過在嚴格的邏輯約束下診斷和修復圖拓撲問題,該模型在完全生成上下文之前優化因果關係和敘事骨架。實驗表明,PLOTTER 在各種敘事場景中顯著超越了代表性的基準。這些發現證實了在結構圖表示上進行敘事規劃——而不是直接在文本上——對於增強大型語言模型在複雜敘事生成中的長期上下文推理至關重要。

CAP: Controllable Alignment Prompting for Unlearning in LLMs

2604.21251v2 by Zhaokun Wang, Jinyu Guo, Jingwen Pu, Hongli Pu, Meng Yang, Xunlei Chen, Jie Ou, Wenyi Li, Guangchun Luo, Wenhong Tian

Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.

摘要:大型語言模型(LLMs)在未經過濾的語料庫上訓練,固有地存在保留敏感信息的風險,因此需要進行選擇性的知識遺忘以符合監管要求和倫理安全。然而,現有的參數修改方法面臨根本性的限制:高計算成本、不可控的遺忘邊界,以及對模型權重訪問的嚴格依賴。這些限制使得它們對於封閉源模型來說不切實際,而目前的非侵入性替代方案則仍然缺乏系統性,並依賴於經驗。為了解決這些挑戰,我們提出了可控對齊提示遺忘(CAP)框架,這是一種端到端的提示驅動遺忘範式。CAP通過強化學習將遺忘解耦為可學習的提示優化過程,其中提示生成器與LLM協作,以抑制目標知識,同時選擇性地保留一般能力。這種方法使得通過提示撤銷實現可逆的知識恢復成為可能。廣泛的實驗表明,CAP在不更新模型參數的情況下實現了精確且可控的遺忘,建立了一種動態對齊機制,克服了先前方法的可轉移性限制。

EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

2604.21229v1 by Julian Acuna

Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.

摘要:大型語言模型助手越來越被期望能夠保留並推理在多次會話中積累的信息。我們介紹了EngramaBench,一個圍繞五個角色、一百個多會話對話和一百五十個查詢(涵蓋事實回憶、跨空間整合、時間推理、對抗性避免和新興綜合)建立的長期對話記憶基準。我們將Engrama,一個圖結構的記憶系統,與GPT-4o全上下文提示和Mem0,一個開源向量檢索記憶系統進行評估。這三者都使用相同的回答模型(GPT-4o),從而隔離記憶架構的影響。GPT-4o全上下文達到最高的綜合分數(0.6186),而Engrama的全球得分為0.5367,但在跨空間推理上是唯一一個得分高於全上下文提示的系統(0.6532對0.6291,n=30)。Mem0是成本最低的,但實力顯著較弱(0.4809)。消融實驗顯示,推動Engrama跨空間優勢的組件在全球綜合分數上存在權衡,暴露了結構化記憶專業化與聚合優化之間的系統級緊張關係。

Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery

2604.21187v1 by Benjamin Przybocki, John Mackey, Marijn J. H. Heule, Bernardo Subercaseaux

Ramsey-good graphs are graphs that contain neither a clique of size $s$ nor an independent set of size $t$. We study doubly saturated Ramsey-good graphs, defined as Ramsey-good graphs in which the addition or removal of any edge necessarily creates an $s$-clique or a $t$-independent set. We present a method combining SAT solving with bespoke LLM-generated code to discover infinite families of such graphs, answering a question of Grinstead and Roberts from 1982. In addition, we use LLMs to generate and formalize correctness proofs in Lean. This case study highlights the potential of integrating automated reasoning, large language models, and formal verification to accelerate mathematical discovery. We argue that such tool-driven workflows will play an increasingly central role in experimental mathematics.

摘要:Ramsey-good 圖是指不包含大小為 $s$ 的團或大小為 $t$ 的獨立集的圖。我們研究雙重飽和的 Ramsey-good 圖,這是指在這些圖中,任何邊的添加或移除必然會產生一個 $s$-團或一個 $t$-獨立集。我們提出了一種將 SAT 求解與定制的 LLM 生成代碼相結合的方法,以發現這類圖的無限族,回應 Grinstead 和 Roberts 在 1982 年提出的問題。此外,我們使用 LLM 生成並形式化在 Lean 中的正確性證明。這個案例研究突顯了整合自動推理、大型語言模型和形式驗證以加速數學發現的潛力。我們認為,這種工具驅動的工作流程將在實驗數學中扮演越來越重要的角色。

TAPO-Description Logic for Information Behavior: Refined OBoxes, Inference, and Categorical Semantics

2604.21172v1 by Takao Inoué

This paper develops a refined version of TAPO-description logic for the analysis of information behavior. The framework is treated not as a single homogeneous object logic, but as a layered formalism consisting of a static descriptive layer (TBox/ABox), a procedural layer (PBox), and an oracle-sensitive layer (OBox). To make this architecture mathematically explicit, we introduce a metalevel guard-judgment layer governing procedural branching and iteration. On this basis we formulate a core inference system for TAPO-description logic, covering static TBox/ABox reasoning, guarded procedural transition in the PBox, and validated external import in the OBox. We then give a categorical semantics for the resulting framework and indicate its sheaf-theoretic refinement. The theory is illustrated by examples of information-seeking behavior, including simple search behavior and review-sensitive ordering behavior in a curry restaurant. The aim is to treat not only static knowledge representation but also hesitation, external consultation, and action-guiding update within a unified logical setting.

摘要:這篇論文發展了一個精煉版本的 TAPO-描述邏輯,用於分析資訊行為。該框架不被視為單一的同質物件邏輯,而是作為一個由靜態描述層(TBox/ABox)、程序層(PBox)和對神諭敏感層(OBox)組成的分層形式主義。為了使這個架構在數學上明確,我們引入了一個元層的守衛判斷層,負責程序的分支和迭代。在此基礎上,我們為 TAPO-描述邏輯制定了一個核心推理系統,涵蓋靜態的 TBox/ABox 推理、PBox 中的守衛程序轉換,以及 OBox 中的驗證外部導入。然後,我們為所得到的框架給出了一個範疇語義,並指出其層叠理論的細化。該理論通過資訊搜尋行為的例子進行說明,包括簡單的搜尋行為和在咖哩餐廳中的評價敏感排序行為。目的是在統一的邏輯設定中處理靜態知識表示、猶豫、外部諮詢和行動指導更新。

"This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias

2604.21148v2 by Siyu Liang, Alicia Beckford Wassink

Studies on bias in Automatic Speech Recognition (ASR) tend to focus on reporting error rates for speakers of underrepresented dialects, yet less research examines the human side of system bias: how do system failures shape users' lived experiences, how do users feel about and react to them, and what emotional toll do these repeated failures exact? We conducted user experience studies across four U.S. locations (Atlanta, Gulf Coast, Miami Beach, and Tucson) representing distinct English dialect communities. Our findings reveal that most participants report technologies fail to consider their cultural backgrounds and require constant adjustment to achieve basic functionality. Despite these experiences, participants maintain high expectations for ASR performance and express strong willingness to contribute to model improvement. Qualitative analysis of open-ended narratives exposes the deeper costs of these failures. Participants report frustration, annoyance, and feelings of inadequacy, yet the emotional impact extends beyond momentary reactions. Participants recognize that systems were not designed for them, yet often internalize failures as personal inadequacy despite this critical awareness. They perform extensive invisible labor, including code-switching, hyper-articulation, and emotional management, to make failing systems functional. Meanwhile, their linguistic and cultural knowledge remains unrecognized by technologies that encode particular varieties as standard while rendering others marginal. These findings demonstrate that algorithmic fairness assessments based on accuracy metrics alone miss critical dimensions of harm: the emotional labor of managing repeated technological rejection, the cognitive burden of constant self-monitoring, and the psychological toll of feeling inadequate in one's native language variety.

摘要:研究自動語音識別(ASR)中的偏見往往專注於報告代表性不足方言使用者的錯誤率,然而,較少的研究探討系統偏見的人性面:系統失敗如何塑造使用者的生活經驗,使用者對此有何感受和反應,這些重複的失敗帶來了什麼情感上的代價?我們在美國四個地點(亞特蘭大、墨西哥灣沿岸、邁阿密海灘和圖森)進行了使用者體驗研究,這些地點代表了不同的英語方言社群。我們的研究結果顯示,大多數參與者報告技術未能考慮他們的文化背景,並需要不斷調整以達到基本功能。儘管有這些經驗,參與者對ASR性能保持高期望,並表達出強烈的意願來貢獻於模型的改進。對開放式敘述的質性分析揭示了這些失敗的更深層成本。參與者報告感到沮喪、惱怒和自我不足的感覺,但情感影響超越了瞬間的反應。參與者意識到系統並非為他們設計,但儘管有這種批判性的認識,卻常常將失敗內化為個人不足。他們進行大量的隱形勞動,包括語碼轉換、過度清晰發音和情感管理,以使失敗的系統運作。與此同時,他們的語言和文化知識卻未被那些將特定方言編碼為標準而將其他方言邊緣化的技術所認可。這些發現表明,僅基於準確性指標的算法公平評估忽略了傷害的關鍵維度:管理重複技術拒絕的情感勞動、持續自我監控的認知負擔,以及在自己母語方言中感到不足的心理代價。

Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification

2604.21137v1 by Jiho Noh, Mukhesh Raghava Katragadda, Raymond Carl, Soon Lee

Analyzing the reasoning patterns of students in science classrooms is critical for understanding knowledge construction mechanism and improving instructional practice to maximize cognitive engagement, yet manual coding of classroom discourse at scale remains prohibitively labor-intensive. We present an automated discourse analysis system (ADAS) that jointly classifies teacher and student utterances along two complementary dimensions: Utterance Type and Reasoning Component derived from our prior CDAT framework. To address severe label imbalance among minority classes, we (1) stratify-resplit the annotated corpus, (2) apply LLM-based synthetic data augmentation targeting minority classes, and (3) train a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieves macro-F1 of 0.467 on UT and 0.476 on RC, establishing meaningful upper bounds for prompt-only approaches motivating fine-tuning. Beyond classification, we conduct discourse pattern analyses including UTxRC co-occurrence profiling, Cognitive Complexity Index (CCI) computation per session, lag-sequential analysis, and IRF chain analysis, revealing that teacher Feedback-with-Question (Fq) moves are the most consistent antecedents of student inferential reasoning (SR-I). Our results demonstrate that LLM-based augmentation meaningfully improves UT minority-class recognition, and that the structural simplicity of the RC task makes it tractable even for lexical baselines.

摘要:分析科學教室中學生的推理模式對於理解知識建構機制以及改善教學實踐以最大化認知參與至關重要,然而大規模手動編碼課堂話語仍然是極具勞動密集型的工作。我們提出了一個自動話語分析系統(ADAS),該系統沿著兩個互補維度共同分類教師和學生的發言:發言類型和推理組件,這些都是源自我們之前的CDAT框架。為了解決少數類別之間的標籤不平衡問題,我們(1)對標註語料庫進行分層重分割,(2)應用基於LLM的針對少數類別的合成數據增強,以及(3)訓練一個雙探頭的RoBERTa-base分類器。一個零-shot的GPT-5.4基準在UT上達到0.467的宏F1,在RC上達到0.476,為僅依賴提示的方法建立了有意義的上限,激勵進行微調。除了分類,我們還進行了話語模式分析,包括UTxRC共現分析、每個會話的認知複雜性指數(CCI)計算、滯後序列分析和IRF鏈分析,揭示教師的問題反饋(Fq)行為是學生推理(SR-I)最一致的前因。我們的結果顯示,基於LLM的增強顯著改善了UT少數類別的識別,而RC任務的結構簡單性使其即使對於詞彙基線也變得可行。

GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons

2604.21133v1 by Sebastian Walter, Hannah Bast

We present GRISP (Guided Recurrent IRI Selection over SPARQL Skeletons), a novel SPARQL-based question-answering method over knowledge graphs based on fine-tuning a small language model (SLM). Given a natural-language question, the method first uses the SLM to generate a natural-language SPARQL query skeleton, and then to re-rank and select knowledge graph items to iteratively replace the natural-language placeholders using knowledge graph constraints. The SLM is jointly trained on skeleton generation and list-wise re-ranking data generated from standard question-query pairs. We evaluate the method on common Wikidata and Freebase benchmarks, and achieve better results than other state-of-the-art methods in a comparable setting.

摘要:我們提出了 GRISP(基於 SPARQL 骨架的引導式循環 IRI 選擇),這是一種基於 SPARQL 的知識圖譜問答方法,通過微調小型語言模型(SLM)來實現。給定一個自然語言問題,該方法首先使用 SLM 生成一個自然語言的 SPARQL 查詢骨架,然後重新排序並選擇知識圖譜項目,迭代地用知識圖譜約束來替換自然語言佔位符。SLM 在骨架生成和從標準問題-查詢對生成的列表重排序數據上進行聯合訓練。我們在常見的 Wikidata 和 Freebase 基準上評估該方法,並在可比較的設置中取得比其他最先進方法更好的結果。

How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

2604.21106v1 by Kristian Schwethelm, Daniel Rueckert, Georgios Kaissis

We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts $r \in {1, 2, 4, 8}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-α} + B\,D^{-β}$ and recover a new recurrence-equivalence exponent $\varphi = 0.46$ at $R^2 = 0.997$. Intuitively, $\varphi$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $\varphi{=}1$) or to a single block run repeatedly with no capacity gain ($\varphi{=}0$). Our $\varphi = 0.46$ sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our $\varphi$ converts the design choice of $r$ into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise $\varphi$ above $0.46$.

摘要:我們測量一個額外的重複對於一個循環(深度重複)語言模型的價值,以等效的唯一參數來表示。從116次預訓練運行中對重複次數$r \in {1, 2, 4, 8}$進行的等深度掃描,涵蓋了約50倍的訓練計算,我們擬合了一個聯合縮放法則$L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-α} + B\,D^{-β}$,並在$R^2 = 0.997$時恢復了一個新的重複等價指數$\varphi = 0.46$。直觀來看,$\varphi$告訴我們,循環一個區塊$r$次在驗證損失上是否等同於$r$個非循環模型的唯一區塊(完全等價,$\varphi{=}1$)或是重複運行單個區塊而不增加容量($\varphi{=}0$)。我們的$\varphi = 0.46$位於兩者之間,因此每增加一次重複可預測地增加在匹配訓練計算下的驗證損失。例如,在$r{=}4$時,一個410M的循環模型在性能上與一個580M的非循環模型相當,但支付的是一個1B非循環模型的訓練成本。在五個維度的下游評估中,這一差距在參數知識任務中持續存在,而在簡單的開卷任務中則縮小,而推理任務在我們的計算預算下無法解決。對於任何循環語言模型,我們的$\varphi$將$r$的設計選擇轉換為可預測的驗證損失成本,未來的訓練配方和架構可以通過它們將$\varphi$提高到$0.46$以上的程度來進行比較。

Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

2604.21102v1 by Siyuan Yao, Siavash Ghorbany, Kuangshi Ai, Arnav Cherukuthota, Meghan Forstchen, Alexis Korotasz, Matthew Sisk, Ming Hu, Chaoli Wang

We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.

摘要:我們提出了一個新穎的框架,利用大型語言模型(LLMs)和 Google 街景圖像,自動評估美國全國的建築條件。通過在一個適度的人類標記數據集上微調 Gemma 3 27B,我們的方法在與人類平均意見分數(MOS)的對齊上取得了良好的效果,甚至在 SRCC 和 PLCC 上超越了個別評估者,相較於 MOS 基準。為了提高效率,我們應用知識蒸餾,將 Gemma 3 27B 的能力轉移到一個較小的 Gemma 3 4B 模型,該模型在性能上達到了可比擬的效果,並實現了 3 倍的速度提升。此外,我們將知識蒸餾到一個基於 CNN 的模型(EfficientNetV2-M)和一個Transformer(SwinV2-B),在性能上接近,同時實現了 30 倍的速度增益。此外,我們通過人類與 AI 的對齊研究,調查 LLM 在評估廣泛的建成環境和住房屬性方面的能力,並開發了一個可視化儀表板,整合 LLM 評估結果以供房主進行後續分析。我們的框架提供了一個靈活且高效的大規模建築條件評估解決方案,實現高準確度且僅需最少的人類標記工作。

TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping

2604.21057v1 by Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher

The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.

摘要:語言推理模型(LRMs)領域在過去幾年中非常活躍,訓練和推理技術的進步使得LRMs能夠進行更長且更準確的推理。然而,越來越多的研究顯示,LRMs仍然效率低下,過度生成驗證和反思步驟。此外,每個推理步驟的高層次角色以及不同步驟類型如何促進正確答案的生成,仍然在很大程度上未被探討。為了解決這一挑戰,我們介紹了TRACES(標記推理步驟以實現自適應成本效益的早期停止),這是一個輕量級框架,能夠實時標記推理步驟,並實現大型語言模型推理的自適應、成本效益的早期停止。基於這一框架,我們在推理過程中監控推理行為,發現LRMs在達到正確答案後往往會改變其推理行為。我們證明,監控特定類型的步驟可以產生有效的可解釋早期停止標準。我們在三個數學推理基準上評估了TRACES框架,即MATH500、GSM8K、AIME,以及兩個知識和推理基準,即MMLU和GPQA。我們在保持與標準生成相當的準確度的同時,實現了20%到50%的標記減少。

The Last Harness You'll Ever Build

2604.21003v1 by Haebin Seong, Li Yin, Haoran Zhang

AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge. \textbf{Each new task domain requires painstaking, expert-driven harness engineering}: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective. We present a two-level framework that automates this process. At the first level, the \textbf{Harness Evolution Loop} optimizes a worker agent's harness $\mathcal{H}$ for a single task: a Worker Agent $W_{\mathcal{H}}$ executes the task, an Evaluator Agent $V$ adversarially diagnoses failures and scores performance, and an Evolution Agent $E$ modifies the harness based on the full history of prior attempts. At the second level, the \textbf{Meta-Evolution Loop} optimizes the evolution protocol $Λ= (W_{\mathcal{H}}, \mathcal{H}^{(0)}, V, E)$ itself across diverse tasks, \textbf{learning a protocol $Λ^{(\text{best})}$ that enables rapid harness convergence on any new task -- so that adapting an agent to a novel domain requires no human harness engineering at all.} We formalize the correspondence to meta-learning and present both algorithms. The framework \textbf{shifts manual harness engineering into automated harness engineering}, and takes one step further -- \textbf{automating the design of the automation itself}.

摘要:AI 代理人越來越多地被部署在複雜的特定領域工作流程中——導航企業網頁應用程序,這些應用程序需要數十次點擊和填寫表單,協調跨越搜索、提取和綜合的多步研究管道,自動檢查不熟悉的代碼庫,以及處理需要細緻領域知識的客戶升級。
\textbf{每個新的任務領域都需要費心的專家驅動的工具開發}:設計使基礎模型有效的提示、工具、協調邏輯和評估標準。
我們提出了一個兩級框架來自動化這一過程。
在第一級,\textbf{工具演進循環}優化工作代理的工具 $\mathcal{H}$ 以完成單一任務:工作代理 $W_{\mathcal{H}}$ 執行任務,評估代理 $V$ 對失敗進行對抗性診斷並評分表現,演進代理 $E$ 根據先前嘗試的完整歷史修改工具。
在第二級,\textbf{元演進循環}優化演進協議 $Λ= (W_{\mathcal{H}}, \mathcal{H}^{(0)}, V, E)$ 本身,跨越多樣的任務,\textbf{學習一個協議 $Λ^{(\text{best})}$,使得在任何新任務上快速收斂工具變得可能——因此,將代理適應於新領域根本不需要人類的工具開發。}
我們將其形式化為元學習並展示兩種算法。
該框架\textbf{將手動工具開發轉變為自動化工具開發},並更進一步——\textbf{自動化自動化設計本身}。

FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels

2604.20825v1 by Sina Gholami, Abdulmoneam Ali, Tania Haghighi, Ahmed Arafa, Minhaj Nur Alam

Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi-stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise-tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise. Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class-wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise-aware training strategy that integrates logit-adjusted loss, knowledge distillation, and distance-aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state-of-the-art methods for FL with noisy labels. The code is available at https://github.com/sinagh72/FedSIR.

摘要:聯邦學習(FL)使得在不共享原始數據的情況下進行協作模型訓練成為可能;然而,分佈式客戶端中存在的噪聲標籤可能會嚴重降低學習性能。在本文中,我們提出了 FedSIR,一個針對噪聲標籤的穩健 FL 的多階段框架。與主要依賴設計抗噪損失函數或在訓練過程中利用損失動態的現有方法不同,我們的方法利用客戶端特徵表示的光譜結構來識別和減輕標籤噪聲。
我們的框架由三個關鍵組件組成。首先,我們通過分析類別特徵子空間的光譜一致性來識別乾淨和噪聲客戶端,並且通信開銷最小。其次,乾淨客戶端提供光譜參考,使得噪聲客戶端能夠使用主導類別方向和殘餘子空間重新標記潛在損壞的樣本。第三,我們採用一種噪聲感知的訓練策略,集成了對數調整損失、知識蒸餾和距離感知聚合,以進一步穩定聯邦優化。在標準 FL 基準上的大量實驗表明,FedSIR 在處理帶有噪聲標籤的 FL 時始終優於最先進的方法。代碼可在 https://github.com/sinagh72/FedSIR 獲得。

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

2604.20795v1 by Pavel Salovskii, Iuliia Gorshkova

This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.

摘要:這篇論文提出了一種混合架構,用於智能系統,其中大型語言模型(LLMs)擴展了外部本體記憶層。該方法不僅依賴於參數知識和基於向量的檢索(RAG),而是構建並維護一個使用RDF/OWL表示的結構化知識圖譜,從而實現持久、可驗證和語義基礎的推理。核心貢獻是一個自動化的本體構建管道,來自異質數據源,包括文檔、API和對話日誌。系統執行實體識別、關係提取、標準化和三元組生成,然後使用SHACL和OWL約束進行驗證,並持續更新圖譜。在推理過程中,LLMs在一個結合的上下文中運行,該上下文整合了基於向量的檢索、基於圖譜的推理和外部工具互動。對於規劃任務的實驗觀察,包括河內塔基準,表明本體增強在多步推理場景中相較於基線LLM系統提高了性能。此外,本體層使生成輸出的正式驗證成為可能,將系統轉變為生成-驗證-修正管道。所提出的架構解決了當前基於LLM的系統的關鍵限制,包括缺乏長期記憶、結構理解薄弱和推理能力有限。它為構建基於代理的系統、機器人應用和需要持久知識、可解釋性和可靠決策的企業AI解決方案提供了基礎。

RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering

2604.20738v1 by Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor

We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA

摘要:我們介紹了一個針對雙語拉丁語和英語環境的問題回答和翻譯基準數據集,包含約7,800對問題和答案。這些問題來自拉丁語教學資源,包括考試、問答競賽風格的知識問答以及從19世紀到現在的教科書。在自動提取、清理和人工審查後,該數據集涵蓋了多樣的問題類型:知識和技能基礎的、多步推理的、受限翻譯的以及混合語言對。據我們所知,這是首個以拉丁語為中心的QA基準。作為案例研究,我們評估了三個大型語言模型——LLaMa 3、Qwen QwQ和OpenAI的o3-mini——發現它們在技能導向問題上的表現都較差。儘管推理模型在韻律和文學手法任務上表現更好,但整體改進有限。QwQ在用拉丁語提問的問題上表現稍好,但LLaMa3和o3-mini則更依賴於任務。這個數據集為評估模型在專門的語言和文化領域的能力提供了一個新資源,且創建過程可以輕鬆適應其他語言。該數據集可在以下網址獲得:https://github.com/slanglab/RespondeoQA

COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

2604.20720v1 by Noah Flynn

Large language models (LLMs) often exhibit performance disparities across languages, with naive multilingual fine-tuning frequently degrading performance due to negative cross-lingual interference. To address this, we introduce COMPASS (COntinual Multilingual PEFT with Adaptive Semantic Sampling), a novel data-centric framework for adapting LLMs to target languages. COMPASS leverages parameter-efficient fine-tuning (PEFT) by training lightweight, language-specific adapters on a judiciously selected subset of auxiliary multilingual data. The core of our method is a distribution-aware sampling strategy that uses multilingual embeddings and clustering to identify semantic gaps between existing training data and a target usage distribution. By prioritizing auxiliary data from under-represented semantic clusters, COMPASS maximizes positive cross-lingual transfer while minimizing interference. We extend this into a continual learning framework, COMPASS-ECDA, which monitors for data distribution shifts in production and dynamically updates adapters to prevent model staleness, balancing adaptation to new data with the preservation of existing knowledge. Across three different model architectures (Phi-4-Mini, Llama-3.1-8B, and Qwen2.5-7B) and multiple challenging multilingual benchmarks (Global-MMLU, MMLU-ProX), including unseen long-context tasks (OneRuler), we demonstrate that COMPASS consistently outperforms baseline methods guided by linguistic similarity, providing an effective, efficient, and sustainable solution for developing and maintaining high-performing multilingual models in dynamic environments.

摘要:大型語言模型(LLMs)在不同語言之間經常表現出性能差異,天真的多語言微調常常因為負面的跨語言干擾而降低性能。為了解決這個問題,我們介紹了COMPASS(COntinual Multilingual PEFT with Adaptive Semantic Sampling),這是一個針對目標語言調整LLMs的新型數據中心框架。COMPASS通過在精心選擇的輔助多語言數據子集上訓練輕量級的語言特定適配器,利用參數高效微調(PEFT)。我們方法的核心是一種分佈感知的採樣策略,利用多語言嵌入和聚類來識別現有訓練數據與目標使用分佈之間的語義差距。通過優先考慮來自代表性不足的語義聚類的輔助數據,COMPASS在最大化正向跨語言轉移的同時最小化干擾。我們將此擴展為一個持續學習框架COMPASS-ECDA,該框架監控生產中的數據分佈變化,並動態更新適配器以防止模型過時,平衡對新數據的適應與現有知識的保留。在三種不同的模型架構(Phi-4-Mini、Llama-3.1-8B和Qwen2.5-7B)和多個具有挑戰性的多語言基準(Global-MMLU、MMLU-ProX)中,包括未見的長上下文任務(OneRuler),我們證明COMPASS始終優於基於語言相似性的基線方法,提供了一個有效、高效且可持續的解決方案,用於在動態環境中開發和維護高性能的多語言模型。

Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization

2604.20714v1 by Shan He, Runze Wang, Zhuoyun Du, Huiyu Bai, Zouying Cao, Yu Cheng, Bo Zheng

Designing and optimizing multi-agent systems (MAS) is a complex, labor-intensive process of "Agent Engineering." Existing automatic optimization methods, primarily focused on flat prompt tuning, lack the structural awareness to debug the intricate web of interactions in MAS. More critically, these optimizers are static; they do not learn from experience to improve their own optimization strategies. To address these gaps, we introduce Textual Parameter Graph Optimization (TPGO), a framework that enables a multi-agent system to learn to evolve. TPGO first models the MAS as a Textual Parameter Graph (TPG), where agents, tools, and workflows are modular, optimizable nodes. To guide evolution, we derive "textual gradients," structured natural language feedback from execution traces, to pinpoint failures and suggest granular modifications. The core of our framework is Group Relative Agent Optimization (GRAO), a novel meta-learning strategy that learns from historical optimization experiences. By analyzing past successes and failures, GRAO becomes progressively better at proposing effective updates, allowing the system to learn how to optimize itself. Extensive experiments on complex benchmarks like GAIA and MCP-Universe show that TPGO significantly enhances the performance of state-of-the-art agent frameworks, achieving higher success rates through automated, self-improving optimization.

摘要:設計和優化多代理系統(MAS)是一個複雜且勞動密集的“代理工程”過程。現有的自動優化方法主要集中於平面提示調整,缺乏對MAS中錯綜複雜的交互網絡進行調試的結構性認識。更關鍵的是,這些優化器是靜態的;它們不會從經驗中學習以改善自身的優化策略。為了解決這些問題,我們引入了文本參數圖優化(TPGO),這是一個使多代理系統能夠學習進化的框架。TPGO首先將MAS建模為文本參數圖(TPG),其中代理、工具和工作流程是模塊化的、可優化的節點。為了指導進化,我們從執行痕跡中推導出“文本梯度”,這是結構化的自然語言反饋,用於確定故障並建議細微的修改。我們框架的核心是群體相對代理優化(GRAO),這是一種新穎的元學習策略,能夠從歷史優化經驗中學習。通過分析過去的成功和失敗,GRAO在提出有效更新方面變得越來越好,使系統能夠學習如何自我優化。在GAIA和MCP-Universe等複雜基準上進行的廣泛實驗顯示,TPGO顯著提高了最先進代理框架的性能,通過自動化、自我改善的優化實現了更高的成功率。

StormNet: Improving storm surge predictions with a GNN-based spatio-temporal offset forecasting model

2604.20688v2 by Noujoud Nader, Stefanos Giaremis, Clint Dawson, Carola Kaiser, Karame Mohammadiporshokooh, Hartmut Kaiser

Storm surge forecasting remains a critical challenge in mitigating the impacts of tropical cyclones on coastal regions, particularly given recent trends of rapid intensification and increasing nearshore storm activity. Traditional high fidelity numerical models such as ADCIRC, while robust, are often hindered by inevitable uncertainties arising from various sources. To address these challenges, this study introduces StormNet, a spatio-temporal graph neural network (GNN) designed for bias correction of storm surge forecasts. StormNet integrates graph convolutional (GCN) and graph attention (GAT) mechanisms with long short-term memory (LSTM) components to capture complex spatial and temporal dependencies among water-level gauge stations. The model was trained using historical hurricane data from the U.S. Gulf Coast and evaluated on Hurricane Idalia (2023). Results demonstrate that StormNet can effectively reduce the root mean square error (RMSE) in water-level predictions by more than 70\% for 48-hour forecasts and above 50\% for 72-hour forecasts, as well as outperform a sequential LSTM baseline, particularly for longer prediction horizons. The model also exhibits low training time, enhancing its applicability in real-time operational forecasting systems. Overall, StormNet provides a computationally efficient and physically meaningful framework for improving storm surge prediction accuracy and reliability during extreme weather events.

摘要:暴潮預測仍然是減輕熱帶氣旋對沿海地區影響的一個關鍵挑戰,特別是考慮到近期快速增強和近岸風暴活動增加的趨勢。傳統的高保真數值模型如ADCIRC,雖然穩健,但常常受到來自各種來源的不可避免的不確定性所阻礙。為了應對這些挑戰,本研究介紹了StormNet,一種為暴潮預測進行偏差修正的時空圖神經網絡(GNN)。StormNet將圖卷積(GCN)和圖注意力(GAT)機制與長短期記憶(LSTM)元件整合在一起,以捕捉水位計站之間複雜的空間和時間依賴性。該模型使用美國墨西哥灣沿岸的歷史颶風數據進行訓練,並在颶風Idalia(2023)上進行評估。結果顯示,StormNet能夠有效地將水位預測的均方根誤差(RMSE)在48小時預測中減少超過70\%,在72小時預測中減少超過50\%,並且在較長的預測範圍內表現優於序列LSTM基準。該模型還展現出低訓練時間,增強了其在實時運營預測系統中的應用性。總體而言,StormNet提供了一個計算效率高且物理意義明確的框架,用於提高極端天氣事件期間暴潮預測的準確性和可靠性。

ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

2604.20666v1 by Ioannis E. Livieris, Athanasios Koursaris, Alexandra Apostolopoulou, Konstantinos Kanaris Dimitris Tsakalidis, George Domalis

Effective retrieval-augmented generation across bilingual Greek--English applications requires embedding models capable of capturing both domain-specific semantic relationships and cross-lingual semantic alignment. Existing multilingual embedding models distribute their representational capacity across numerous languages, limiting their optimization for Greek and failing to encode the morphological complexity and domain-specific terminological structures inherent in Greek text. In this work, we propose ORPHEAS, a specialized Greek--English embedding model for bilingual retrieval-augmented generation. ORPHEAS is trained with a high quality dataset generated by a knowledge graph-based fine-tuning methodology which is applied to a diverse multi-domain corpus, which enables language-agnostic semantic representations. The numerical experiments across monolingual and cross-lingual retrieval benchmarks reveal that ORPHEAS outperforms state-of-the-art multilingual embedding models, demonstrating that domain-specialized fine-tuning on morphologically complex languages does not compromise cross-lingual retrieval capability.

摘要:有效的檢索增強生成在雙語希臘語--英語應用中需要嵌入模型,能夠捕捉領域特定的語義關係和跨語言的語義對齊。現有的多語言嵌入模型將其表徵能力分配到多種語言上,限制了其對希臘語的優化,並未能編碼希臘文本中固有的形態複雜性和領域特定的術語結構。在這項工作中,我們提出了ORPHEAS,一個專門針對雙語檢索增強生成的希臘語--英語嵌入模型。ORPHEAS使用由知識圖譜基礎的微調方法生成的高質量數據集進行訓練,該方法應用於多樣的多領域語料庫,從而實現語言無關的語義表徵。在單語和跨語言檢索基準上的數值實驗顯示,ORPHEAS超越了最先進的多語言嵌入模型,證明了對形態複雜語言進行領域專門的微調不會妨礙跨語言檢索能力。

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

2604.20665v1 by Karan Goyal, Dikshant Kukreja

The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.

摘要:快速增長的視覺-語言模型(VLMs)被廣泛讚譽為統一多模態知識發現的曙光,但其基礎運作於一個危險且未經質疑的公理:當前的VLMs忠實地綜合多模態數據。我們認為事實並非如此。相反,主導的視覺編碼-投影-LLM範式下潛藏著深刻的可信度危機。最先進的模型經常表現出功能盲目,即利用強大的語言先驗來繞過嚴重的視覺表示瓶頸,而不是從視覺輸入中提取有根據的知識。在這項工作中,我們挑戰傳統的多模態評估方法,該方法依賴於數據消融或新數據集的創建,因此致命地將數據集偏見與架構無能混淆。我們提出了一種激進的信息理論出發點:模態翻譯協議,旨在量化揭示觀看的代價。通過翻譯語義負載而不是消融它們,我們制定了三個新指標——觀看的代價(ToS)、詛咒(CoS)和謬誤(FoS),最終形成語義充分性標準(SSC)。此外,我們提出了一個挑釁性的多模態擴展發散法則,假設隨著基礎語言引擎擴展到前所未有的推理能力,視覺知識瓶頸的數學懲罰反而會增加。我們挑戰KDD社群放棄對“多模態增益”的虛幻追求。通過將SSC從被動診斷約束提升為主動架構藍圖,我們提供了強而有力、值得信賴的基礎,迫使下一代AI系統真正看見數據,實現真正的多模態推理。

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

2604.20623v1 by Roie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel, Genady Beryozkin

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

摘要:傳統的變化檢測識別變化發生的位置,但並不解釋自然語言中發生了什麼變化。現有的遙感變化標題數據集通常描述整體圖像級別的差異,幾乎沒有探索細粒度的局部語義推理。為了填補這一空白,我們提出了RSRCC,這是一個新的遙感變化問答基準,包含126,000個問題,分為87,000個訓練、17,100個驗證和22,000個測試實例。與之前的數據集不同,RSRCC圍繞著局部的、特定於變化的問題構建,這些問題需要對特定的語義變化進行推理。據我們所知,這是第一個專門為這種細粒度推理基礎的監督設計的遙感變化問答基準。為了構建RSRCC,我們引入了一個分層的半監督策劃管道,使用Best-of-N排名作為關鍵的最終模糊解決階段。首先,從語義分割掩膜中提取候選變化區域,然後使用圖像-文本嵌入模型進行初步篩選,最後通過增強檢索的視覺-語言策劃進行驗證,並使用Best-of-N排名。這一過程使得在保留語義上有意義的變化的同時,能夠對嘈雜和模糊的候選進行可擴展的過濾。該數據集可在 https://huggingface.co/datasets/google/RSRCC 獲得。

Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge

2604.20598v1 by Naizhong Xu

Modern retrieval-augmented generation (RAG) systems treat vector embeddings as static, context-free artifacts: an embedding has no notion of when it was created, how trustworthy its source is, or which other embeddings depend on it. This flattening of knowledge has a measurable cost: recent work on VersionRAG reports that conventional RAG achieves only 58% accuracy on versioned technical queries, because retrieval returns semantically similar but temporally invalid content. We propose SmartVector, a framework that augments dense embeddings with three explicit properties -- temporal awareness, confidence decay, and relational awareness -- and a five-stage lifecycle modeled on hippocampal-neocortical memory consolidation. A retrieval pipeline replaces pure cosine similarity with a four-signal score that mixes semantic relevance, temporal validity, live confidence, and graph-relational importance. A background consolidation agent detects contradictions, builds dependency edges, and propagates updates along those edges as graph-neural-network-style messages. Confidence is governed by a closed-form function combining an Ebbinghaus-style exponential decay, user-feedback reconsolidation, and logarithmic access reinforcement. We formalize the model, relate it to temporal knowledge graph embedding, agentic memory architectures, and uncertainty-aware RAG, and present a reference implementation. On a reproducible synthetic versioned-policy benchmark of 258 vectors and 138 queries, SmartVector roughly doubles top-1 accuracy over plain cosine RAG (62.0% vs. 31.0% on a held-out split), drops stale-answer rate from 35.0% to 13.3%, cuts Expected Calibration Error by nearly 2x (0.244 vs. 0.470), reduces re-embedding cost per single-word edit by 77%, and is robust across contradiction-injection rates from 0% to 75%.

摘要:現代檢索增強生成(RAG)系統將向量嵌入視為靜態的、無上下文的產物:嵌入沒有創建時間的概念、來源的可信度或依賴於它的其他嵌入。這種知識的扁平化帶來了可衡量的成本:最近關於VersionRAG的研究報告指出,傳統的RAG在版本化技術查詢上的準確率僅為58%,因為檢索返回的是語義上相似但時間上無效的內容。我們提出了SmartVector,一個增強密集嵌入的框架,具備三個明確的特性——時間意識、信心衰減和關聯意識——以及一個基於海馬體-新皮層記憶鞏固的五階段生命週期。檢索管道用四信號得分取代純粹的餘弦相似度,該得分混合了語義相關性、時間有效性、即時信心和圖形關聯重要性。背景鞏固代理檢測矛盾,建立依賴邊,並沿著這些邊傳播更新,類似於圖神經網絡的消息。信心由一個封閉形式的函數控制,該函數結合了Ebbinghaus風格的指數衰減、用戶反饋再鞏固和對數訪問增強。我們對模型進行了形式化,將其與時間知識圖嵌入、代理記憶架構和不確定性感知的RAG相關聯,並提出了一個參考實現。在一個可重現的合成版本化政策基準測試中,包含258個向量和138個查詢,SmartVector的top-1準確率大約是純餘弦RAG的兩倍(62.0%對31.0%在保留的分割上),過時答案率從35.0%降至13.3%,期望校準誤差減少近2倍(0.244對0.470),每次單詞編輯的重新嵌入成本降低77%,並且在0%到75%的矛盾注入率下都表現穩健。

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

2604.20572v1 by Yuxuan Cai, Jie Zhou, Qin Chen, Liang He

Online lifelong learning enables agents to accumulate experience across interactions and continually improve on long-horizon tasks. However, existing methods typically treat retrieval from past experience as a passive operation, triggering it only at task initialization or after completing a step. Consequently, agents often fail to identify knowledge gaps during interaction and proactively retrieve the most useful experience for the current decision. To address this limitation, we present ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured experience base. We first introduce Experience-Enhanced Online Evolution (ExpOnEvo), which enables continual improvement through both policy updates and memory refinement. The experience base organizes historical interactions into typed repositories, including factual memory, episodic memory, and behavioral skills, so that retrieval can provide both relevant evidence and actionable guidance. On top of this, we propose Proactive Reinforcement Learning-based Retrieval (ProactRL), which models retrieval as an explicit policy action and learns when and what to retrieve via paired-branch process rewards. By comparing continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level supervision for retrieval decisions, encouraging retrieval only when it leads to better task outcomes or higher efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently improves lifelong agent performance, achieving success rates of 73.50\% on SciWorld and 71.28\% on AlfWorld while substantially reducing retrieval overhead, and attains performance competitive with proprietary models on StuLife.

摘要:在線終身學習使代理能夠在互動中積累經驗,並持續改善長期任務。然而,現有的方法通常將從過去經驗中檢索視為一種被動操作,僅在任務初始化或完成一步後才觸發。因此,代理在互動過程中往往無法識別知識空白,並主動檢索對當前決策最有用的經驗。為了解決這一限制,我們提出了ProactAgent,一個基於經驗的終身學習框架,用於在結構化經驗庫上進行主動檢索。我們首先介紹經驗增強在線演化(ExpOnEvo),它通過政策更新和記憶精煉實現持續改進。經驗庫將歷史互動組織成類型化的庫,包括事實記憶、情節記憶和行為技能,以便檢索能夠提供相關證據和可行指導。在此基礎上,我們提出了基於主動強化學習的檢索(ProactRL),它將檢索建模為明確的政策行動,並通過配對分支過程獎勵學習何時以及檢索什麼。通過比較相同互動前綴的延續,有無檢索,ProactRL為檢索決策提供了逐步監督,僅在檢索能導致更好的任務結果或更高效率時鼓勵檢索。在SciWorld、AlfWorld和StuLife上的實驗顯示,ProactAgent持續改善終身代理的表現,在SciWorld上達到73.50\%的成功率,在AlfWorld上達到71.28\%,同時顯著減少檢索開銷,並在StuLife上達到與專有模型競爭的性能。

LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures

2604.20556v1 by Yuhang Wu, Qinyuan Liu, Qiuyang Zhao, Qingwei Chong

Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and network robustness bottleneck mechanisms in various LLM architectures remain unclear, posing core challenges for hybrid architecture design and model optimization. This paper proposes LayerTracer, an architecture-agnostic end-to-end analysis framework compatible with any LLM architecture. By extracting hidden states layer-by-layer and mapping them to vocabulary probability distributions, it achieves joint analysis of task particle localization and layer vulnerability quantification. We define the task particle as the key layer where the target token probability first rises significantly, representing the model's task execution starting point, and the vulnerable layer is defined as the layer with the maximum Jensen-Shannon (JS) divergence between output distributions before and after mask perturbation, reflecting its sensitivity to disturbances. Experiments on models of different parameter scales show that task particles mainly appear in the deep layers of the model regardless of parameter size, while larger-parameter models exhibit stronger hierarchical robustness. LayerTracer provides a scientific basis for layer division, module ratio, and gating switching of hybrid architectures, effectively optimizing model performance. It accurately locates task-effective layers and stability bottlenecks, offering universal support for LLM structure design and interpretability research.

摘要:目前,大型語言模型(LLMs)具有多樣化的架構景觀,包括傳統的Transformer、GateDeltaNet和Mamba。然而,各種LLM架構中層次表示的演化法則、任務知識形成位置及網絡穩健性瓶頸機制仍不清晰,這對混合架構設計和模型優化構成了核心挑戰。本文提出了LayerTracer,一種與任何LLM架構兼容的架構無關的端到端分析框架。通過逐層提取隱藏狀態並將其映射到詞彙概率分佈,它實現了任務粒子定位和層脆弱性量化的聯合分析。我們將任務粒子定義為目標標記概率首次顯著上升的關鍵層,代表模型任務執行的起點,而脆弱層則定義為在掩碼擾動前後輸出分佈之間具有最大Jensen-Shannon (JS) 散度的層,反映其對擾動的敏感性。在不同參數規模的模型上進行的實驗顯示,無論參數大小,任務粒子主要出現在模型的深層,而較大參數的模型則表現出更強的層次穩健性。LayerTracer為混合架構的層劃分、模塊比例和閘切換提供了科學依據,有效優化模型性能。它準確定位任務有效層和穩定性瓶頸,為LLM結構設計和可解釋性研究提供了普遍支持。

Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies

2604.20548v1 by Shuai Chen, Chengzhi Zhang

Scientific progress depends on the continual generation of innovative re-search ideas. However, the rapid growth of scientific literature has greatly increased the cost of knowledge filtering, making it harder for researchers to identify novel directions. Although existing large language model (LLM)-based methods show promise in research idea generation, the ideas they produce are often repetitive and lack depth. To address this issue, this study proposes a multi-agent iterative planning search strategy inspired by com-binatorial innovation theory. The framework combines iterative knowledge search with an LLM-based multi-agent system to generate, evaluate, and re-fine research ideas through repeated interaction, with the goal of improving idea diversity and novelty. Experiments in the natural language processing domain show that the proposed method outperforms state-of-the-art base-lines in both diversity and novelty. Further comparison with ideas derived from top-tier machine learning conference papers indicates that the quality of the generated ideas falls between that of accepted and rejected papers. These results suggest that the proposed framework is a promising approach for supporting high-quality research idea generation. The source code and dataset used in this paper are publicly available on Github repository: https://github.com/ChenShuai00/MAGenIdeas. The demo is available at https://huggingface.co/spaces/cshuai20/MAGenIdeas.

摘要:科學進步依賴於不斷產生創新的研究想法。然而,科學文獻的快速增長大大提高了知識篩選的成本,使研究人員更難識別新穎的方向。儘管現有的基於大型語言模型(LLM)的方法在研究想法生成方面顯示出潛力,但它們產生的想法往往重複且缺乏深度。為了解決這一問題,本研究提出了一種受組合創新理論啟發的多代理迭代規劃搜索策略。該框架結合了迭代知識搜索與基於LLM的多代理系統,通過重複互動生成、評估和重新精煉研究想法,旨在提高想法的多樣性和新穎性。在自然語言處理領域的實驗顯示,所提出的方法在多樣性和新穎性方面均優於最先進的基準。與來自頂級機器學習會議論文的想法進行進一步比較表明,生成的想法質量介於被接受和被拒絕的論文之間。這些結果表明,所提出的框架是一種支持高質量研究想法生成的有前景的方法。本文使用的源代碼和數據集已在Github庫上公開: https://github.com/ChenShuai00/MAGenIdeas。演示可在 https://huggingface.co/spaces/cshuai20/MAGenIdeas 獲得。

Effects of Cross-lingual Evidence in Multilingual Medical Question Answering

2604.20531v1 by Anar Yeginbergen, Maite Oronoz, Rodrigo Agerri

This paper investigates Multilingual Medical Question Answering across high-resource (English, Spanish, French, Italian) and low-resource (Basque, Kazakh) languages. We evaluate three types of external evidence sources across models of varying size: curated repositories of specialized medical knowledge, web-retrieved content, and explanations from LLM's parametric knowledge. Moreover, we conduct experiments with multilingual, monolingual and cross-lingual retrieval. Our results demonstrate that larger models consistently achieve superior performance in English across baseline evaluations. When incorporating external knowledge, web-retrieved data in English proves most beneficial for high-resource languages. Conversely, for low-resource languages, the most effective strategy combines retrieval in both English and the target language, achieving comparable accuracy to high-resource language results. These findings challenge the assumption that external knowledge systematically improves performance and reveal that effective strategies depend on both the source of language resources and on model scale. Furthermore, specialized medical knowledge sources such as PubMed are limited: while they provide authoritative expert knowledge, they lack adequate multilingual coverage

摘要:這篇論文探討了在高資源(英語、西班牙語、法語、意大利語)和低資源(巴斯克語、哈薩克語)語言中的多語言醫療問答。我們評估了三種類型的外部證據來源,涵蓋不同大小的模型:專門醫療知識的策展庫、網路檢索的內容,以及來自LLM的參數知識的解釋。此外,我們進行了多語言、單語言和跨語言檢索的實驗。我們的結果顯示,較大的模型在基線評估中在英語上始終能夠達到更好的性能。當納入外部知識時,英語的網路檢索數據對於高資源語言最為有利。相反,對於低資源語言,最有效的策略是結合英語和目標語言的檢索,達到與高資源語言結果相當的準確性。這些發現挑戰了外部知識系統性改善性能的假設,並揭示了有效策略依賴於語言資源的來源和模型規模。此外,像PubMed這樣的專門醫療知識來源是有限的:雖然它們提供權威的專家知識,但缺乏足夠的多語言覆蓋。

Knowledge Capsules: Structured Nonparametric Memory Units for LLMs

2604.20487v2 by Bin Ju, Shenfeng Weng, Danying Zhou, Rongkai Xu, Kunkai Su

Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely through context expansion, where external knowledge competes as tokens within the attention mechanism. As a result, its influence is indirect and often unstable, particularly in long context and multi hop reasoning scenarios. We propose Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and can be constructed directly from document corpora using a frozen base model. Instead of injecting knowledge as text, we introduce an External Key Value Injection (KVI) framework that compiles capsules into attention-compatible key value representations, enabling external knowledge to directly participate in the model's attention computation. By shifting knowledge integration from context-level augmentation to memory level interaction, the proposed framework consistently outperforms RAG and GraphRAG across multiple QA benchmarks, with improved stability and accuracy in long context and multi hop reasoning, while requiring no parameter updates.

摘要:大型語言模型(LLMs)將知識編碼在參數權重中,這使得在不重新訓練的情況下更新或擴展變得成本高昂。檢索增強生成(RAG)通過將檢索到的文本附加到輸入中來減輕這一限制,但它僅通過上下文擴展運作,外部知識作為令牌在注意力機制中競爭。因此,其影響是間接的,並且往往不穩定,特別是在長上下文和多跳推理場景中。我們提出了知識膠囊,這是一種結構化的非參數記憶單元,代表標準化的關係知識,並且可以直接從文檔語料庫中使用凍結的基礎模型構建。與其將知識注入為文本,我們引入了一個外部鍵值注入(KVI)框架,將膠囊編譯成與注意力兼容的鍵值表示,從而使外部知識能夠直接參與模型的注意力計算。通過將知識整合從上下文級增強轉移到記憶級互動,所提出的框架在多個QA基準測試中始終優於RAG和GraphRAG,並在長上下文和多跳推理中提高了穩定性和準確性,同時不需要任何參數更新。

Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks

2604.20932v1 by Pranav Pallerla, Wilson Naik Bhukya, Bharath Vemula, Charan Ramtej Kodi

Retrieval-augmented generation (RAG) systems are increasingly deployed in sensitive domains such as healthcare and law, where they rely on private, domain-specific knowledge. This capability introduces significant security risks, including membership inference, data poisoning, and unintended content leakage. A straightforward mitigation is to enable all relevant defenses simultaneously, but doing so incurs a substantial utility cost. In our experiments, an always-on defense stack reduces contextual recall by more than 40%, indicating that retrieval degradation is the primary failure mode. To mitigate this trade-off in RAG systems, we propose the Sentinel-Strategist architecture, a context-aware framework for risk analysis and defense selection. A Sentinel detects anomalous retrieval behavior, after which a Strategist selectively deploys only the defenses warranted by the query context. Evaluated across three benchmark datasets and five orchestration models, ADO is shown to eliminate MBA-style membership inference leakage while substantially recovering retrieval utility relative to a fully static defense stack, approaching undefended baseline levels. Under data poisoning, the strongest ADO variants reduce attack success to near zero while restoring contextual recall to more than 75% of the undefended baseline, although robustness remains sensitive to model choice. Overall, these findings show that adaptive, query-aware defense can substantially reduce the security-utility trade-off in RAG systems.

摘要:檢索增強生成(RAG)系統越來越多地應用於敏感領域,如醫療保健和法律,這些領域依賴於私有的、特定領域的知識。這一能力引入了重大安全風險,包括成員推斷、數據中毒和意外內容洩漏。直接的緩解方法是同時啟用所有相關防禦,但這樣做會產生可觀的效用成本。在我們的實驗中,始終開啟的防禦堆疊使上下文回憶降低了超過 40%,這表明檢索退化是主要的失敗模式。為了減輕 RAG 系統中的這一權衡,我們提出了 Sentinel-Strategist 架構,這是一個用於風險分析和防禦選擇的上下文感知框架。Sentinel 檢測異常的檢索行為,然後 Strategist 根據查詢上下文選擇性地部署僅必要的防禦。在三個基準數據集和五個協調模型中進行評估後,ADO 被證明能消除 MBA 風格的成員推斷洩漏,同時相對於完全靜態的防禦堆疊顯著恢復檢索效用,接近未防禦的基線水平。在數據中毒的情況下,最強的 ADO 變體將攻擊成功率降低到接近零,同時將上下文回憶恢復到超過 75% 的未防禦基線,儘管穩健性仍然對模型選擇敏感。總的來說,這些發現顯示,自適應的查詢感知防禦能顯著減少 RAG 系統中的安全與效用權衡。

HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

2604.20452v1 by Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu

Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.

摘要:檢索增強生成(RAG)透過檢索外部文件作為上下文,擴展了大型語言模型(LLMs)在推理時的知識邊界。然而,隨著知識數據庫的增長,檢索變得越來越耗時。現有的加速策略要麼通過近似檢索妥協準確性,要麼通過重複使用完全相同查詢的結果來實現微小的增益。我們提出了HaS,一種同源感知的推測檢索框架,該框架在受限範圍內執行低延遲的推測檢索,以獲取候選文件,然後驗證它們是否包含所需的知識。該驗證基於查詢之間的同源關係,並被表述為一個同源查詢重新識別任務:一旦識別出先前觀察到的查詢為進來查詢的同源重遇,則草稿被視為可接受,允許系統跳過緩慢的全數據庫檢索。受益於在現實世界流行模式下同源查詢的普遍性,HaS實現了顯著的效率提升。大量實驗表明,HaS在數據集上減少了23.74%和36.99%的檢索延遲,僅有1-2%的邊際準確性下降。作為一個即插即用的解決方案,HaS還顯著加速了現代代理RAG管道中的複雜多跳查詢。源代碼可在以下網址獲得:https://github.com/ErrEqualsNil/HaS。

Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness

2604.20413v1 by Fulong Fan, Peilin Liu, Fengzhe Liu, Shuyan Yang, Gang Yan

Large language models perform well on many reasoning tasks, yet they often lack awareness of whether their current knowledge or reasoning state is complete. In non-interactive puzzle settings, the narrative is fixed and the underlying structure is hidden; once a model forms an early hypothesis under incomplete premises, it can propagate that error throughout the reasoning process, leading to unstable conclusions. To address this issue, we propose SABA, a reasoning framework that explicitly introduces self-awareness of missing premises before making the final decision. SABA formulates reasoning as a recursive process that alternates between structured state construction and obstacle resolution: it first applies Information Fusion to consolidate the narrative into a verifiable base state, and then uses Query-driven Structured Reasoning to identify and resolve missing or underspecified premises by turning them into queries and progressively completing the reasoning state through hypothesis construction and state refinement. Across multiple evaluation metrics, SABA achieves the best performance on all three difficulty splits of the non-interactive Detective Puzzle benchmark, and it also maintains leading results on multiple public benchmarks.

摘要:大型語言模型在許多推理任務中表現良好,但它們往往缺乏對當前知識或推理狀態是否完整的認識。在非互動的謎題環境中,敘事是固定的,潛在結構是隱藏的;一旦模型在不完整的前提下形成早期假設,就可能在整個推理過程中傳播這一錯誤,導致不穩定的結論。為了解決這個問題,我們提出了 SABA,一個推理框架,它在做出最終決策之前明確引入對缺失前提的自我意識。SABA 將推理公式化為一個遞歸過程,交替進行結構化狀態構建和障礙解決:它首先應用信息融合將敘事整合為可驗證的基礎狀態,然後使用查詢驅動的結構化推理來識別和解決缺失或不明確的前提,通過將其轉化為查詢並逐步通過假設構建和狀態細化來完成推理狀態。在多個評估指標中,SABA 在非互動偵探謎題基準的所有三個難度劃分中達到了最佳性能,並且在多個公共基準上也保持了領先的結果。

CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge

2604.20389v1 by Gustav Keppler, Ghada Elbez, Veit Hagenmeyer

The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain-specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information Technology cybersecurity and more specializedareas such as Operational Technology and related cybersecurity standards. Concurrently, we propose and validate a novel Proposer-Verifier framework, a methodology to generate interpretable,natural language explanations for model performance. Our evaluation shows that frontier modelsachieve human expert level in general networking and IT security knowledge. However, theiraccuracy declines in questions that require vendor-specific nuances or knowledge in formalstandards, like, e.g., IEC 62443. Analysis of model scaling trend and release date demonstratesremarkable gains in parameter efficiency, while recent larger models show diminishing returns.Code and evaluation scripts are available at: https://github.com/GKeppler/CyberCertBench.

摘要:快速發展和使用大型語言模型(LLMs)於專業工作流程中,需要對其領域特定知識進行與行業標準的評估。我們介紹CyberCertBench,這是一套新的多選題回答(MCQA)基準,源自行業認可的認證。CyberCertBench評估LLM的領域知識,針對資訊科技網路安全的專業標準以及更專門的領域,如運營技術和相關的網路安全標準。同時,我們提出並驗證了一種新穎的提案者-驗證者框架,這是一種生成可解釋的自然語言解釋模型性能的方法論。我們的評估顯示,前沿模型在一般網路和IT安全知識上達到了人類專家的水平。然而,在需要供應商特定細微差別或正式標準知識的問題上,其準確性下降,例如IEC 62443。對模型擴展趨勢和發布日期的分析顯示出參數效率的顯著提升,而最近更大的模型則顯示出收益遞減。代碼和評估腳本可在以下鏈接獲得:https://github.com/GKeppler/CyberCertBench。

Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs

2604.20382v1 by Aishik Mandal, Hiba Arnaout, Clarissa W. Ong, Juliet Bockhorst, Kate Sheehan, Rachael Moldow, Tanmoy Chakraborty, Iryna Gurevych

Rising demand for mental health support has increased interest in using Large Language Models (LLMs) for counseling. However, adapting LLMs to this high-risk safety-critical domain is hindered by the scarcity of real-world counseling data due to privacy constraints. Synthetic datasets provide a promising alternative, but existing approaches often rely on unstructured or semi-structured text inputs and overlook structural dependencies between a client's cognitive, emotional, and behavioral states, often producing psychologically inconsistent interactions and reducing data realism and quality. We introduce Graph2Counsel, a framework for generating synthetic counseling sessions grounded in Client Psychological Graphs (CPGs) that encode relationships among clients' thoughts, emotions, and behaviors. Graph2Counsel employs a structured prompting pipeline guided by counselor strategies and CPG, and explores prompting strategies including CoT (Wei et al., 2022) and Multi-Agent Feedback (Li et al., 2025a). Graph2Counsel produces 760 sessions from 76 CPGs across diverse client profiles. In expert evaluation, our dataset outperforms prior datasets on specificity, counselor competence, authenticity, conversational flow, and safety, with substantial inter-annotator agreement (Krippendorff's $α$ = 0.70). Fine-tuning an open-source model on this dataset improves performance on CounselingBench (Nguyen et al., 2025) and CounselBench (Li et al., 2025b), showing downstream utility. We also make our code and data public.

摘要:心理健康支持需求的增加引起了對使用大型語言模型(LLMs)進行輔導的興趣。然而,由於隱私限制,將LLMs適應於這一高風險的安全關鍵領域受到現實輔導數據稀缺的阻礙。合成數據集提供了一個有前景的替代方案,但現有的方法通常依賴於非結構化或半結構化的文本輸入,並忽視了客戶的認知、情感和行為狀態之間的結構依賴,經常產生心理上不一致的互動,並降低數據的真實性和質量。我們介紹了Graph2Counsel,一個基於客戶心理圖(CPGs)生成合成輔導會話的框架,該圖編碼了客戶的思想、情感和行為之間的關係。Graph2Counsel採用一個結構化的提示管道,由輔導策略和CPG指導,並探索包括CoT(Wei et al., 2022)和多代理反饋(Li et al., 2025a)在內的提示策略。Graph2Counsel從76個CPG中生成760個會話,涵蓋多樣的客戶檔案。在專家評估中,我們的數據集在具體性、輔導者能力、真實性、對話流暢性和安全性方面超越了先前的數據集,並且標註者之間有顯著的一致性(Krippendorff's $α$ = 0.70)。在這個數據集上微調開源模型提高了在CounselingBench(Nguyen et al., 2025)和CounselBench(Li et al., 2025b)上的表現,顯示了下游的實用性。我們還公開了我們的代碼和數據。

Domain-Aware Hierarchical Contrastive Learning for Semi-Supervised Generalization Fault Diagnosis

2604.20928v1 by Junyu Ren, Wensheng Gan, Philip S Yu

Fault diagnosis under unseen operating conditions remains highly challenging when labeled data are scarce. Semi-supervised domain generalization fault diagnosis (SSDGFD) provides a practical solution by jointly exploiting labeled and unlabeled source domains. However, existing methods still suffer from two coupled limitations. First, pseudo-labels for unlabeled domains are typically generated primarily from knowledge learned on the labeled source domain, which neglects domain-specific geometric discrepancies and thus induces systematic cross-domain pseudo-label bias. Second, unlabeled samples are commonly handled with a hard accept-or-discard strategy, where rigid thresholding causes imbalanced sample utilization across domains, while hard-label assignment for uncertain samples can easily introduce additional noise. To address these issues, we propose a unified framework termed domain-aware hierarchical contrastive learning (DAHCL) for SSDGFD. Specifically, DAHCL introduces a domain-aware learning (DAL) module to explicitly capture source-domain geometric characteristics and calibrate pseudo-label predictions across heterogeneous source domains, thereby mitigating cross-domain bias in pseudo-label generation. In addition, DAHCL develops a hierarchical contrastive learning (HCL) module that combines dynamic confidence stratification with fuzzy contrastive supervision, enabling uncertain samples to contribute to representation learning without relying on unreliable hard labels. In this way, DAHCL jointly improves the quality of supervision and the utilization of unlabeled samples. Furthermore, to better reflect practical industrial scenarios, we incorporate engineering noise into the SSDGFD evaluation protocol. Extensive experiments on three benchmark datasets demonstrate that...

摘要:故障診斷在未見操作條件下仍然非常具有挑戰性,特別是在標記數據稀缺的情況下。半監督領域泛化故障診斷(SSDGFD)通過共同利用標記和未標記的源領域提供了一個實用的解決方案。然而,現有方法仍然面臨兩個相互耦合的限制。首先,未標記領域的偽標籤通常主要是從標記源領域學習的知識生成的,這忽略了領域特定的幾何差異,從而引入系統性的跨領域偽標籤偏差。其次,未標記樣本通常採用硬性接受或丟棄策略處理,僵化的閾值設定導致跨領域樣本利用不平衡,而對不確定樣本的硬標籤分配則容易引入額外的噪音。為了解決這些問題,我們提出了一個統一的框架,稱為領域感知層次對比學習(DAHCL),用於SSDGFD。具體而言,DAHCL引入了一個領域感知學習(DAL)模塊,以明確捕捉源領域的幾何特徵並校準跨異質源領域的偽標籤預測,從而減輕偽標籤生成中的跨領域偏差。此外,DAHCL開發了一個層次對比學習(HCL)模塊,將動態置信度分層與模糊對比監督相結合,使不確定樣本能夠在不依賴不可靠的硬標籤的情況下對表示學習作出貢獻。通過這種方式,DAHCL共同提高了監督質量和未標記樣本的利用率。此外,為了更好地反映實際工業場景,我們在SSDGFD評估協議中納入了工程噪音。在三個基準數據集上的廣泛實驗表明…

Surrogate modeling for interpreting black-box LLMs in medical predictions

2604.20331v2 by Changho Han, Songsoo Kim, Dong Won Kim, Leo Anthony Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

摘要:大型語言模型(LLMs)在龐大的數據集上進行訓練,將廣泛的現實世界知識編碼在其參數中,但其黑箱特性使得這種編碼的機制和範圍變得不明朗。代理建模使用簡化模型來近似複雜系統,可以為黑箱模型的更好可解釋性提供一條途徑。我們提出了一個代理建模框架,定量解釋LLM編碼的知識。對於從領域知識衍生的特定假設,該框架通過在一系列綜合模擬場景中進行廣泛的提示,使用可觀察的元素(輸入-輸出對)來近似潛在的LLM知識空間。通過在醫療預測中的概念驗證實驗,我們展示了該框架在揭示LLMs如何「感知」每個輸入變量與輸出之間的關係方面的有效性。特別是,考慮到LLMs可能會延續其訓練數據中固有的不準確性和社會偏見,我們使用該框架的實驗定量揭示了與既有醫學知識相矛盾的關聯以及LLM編碼知識中科學上被駁斥的種族假設的持續存在。通過揭示這些問題,我們的框架可以作為紅旗指標,以支持這些模型的安全和可靠應用。

Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

2604.20311v2 by Dali Wang, Yunyao Zhang, Junqing Yu, Yi-Ping Phoebe Chen, Chen Xu, Zikai Song

Micro-video popularity prediction (MVPP) aims to forecast the future popularity of videos on online media, which is essential for applications such as content recommendation and traffic allocation. In real-world scenarios, it is critical for MVPP approaches to understand both the temporal dynamics of a given video (temporal) and its historical relevance to other videos (spatial). However, existing approaches sufer from limitations in both dimensions: temporally, they rely on sparse short-range sampling that restricts content perception; spatially, they depend on flat retrieval memory with limited capacity and low efficiency, hindering scalable knowledge utilization. To overcome these limitations, we propose a unified framework that achieves joint spatio-temporal enlargement, enabling precise perception of extremely long video sequences while supporting a scalable memory bank that can infinitely expand to incorporate all relevant historical videos. Technically, we employ a Temporal Enlargement driven by a frame scoring module that extracts highlight cues from video frames through two complementary pathways: sparse sampling and dense perception. Their outputs are adaptively fused to enable robust long-sequence content understanding. For Spatial Enlargement, we construct a Topology-Aware Memory Bank that hierarchically clusters historically relevant content based on topological relationships. Instead of directly expanding memory capacity, we update the encoder features of the corresponding clusters when incorporating new videos, enabling unbounded historical association without unbounded storage growth. Extensive experiments on three widely used MVPP benchmarks demonstrate that our method consistently outperforms 11 strong baselines across mainstream metrics, achieving robust improvements in both prediction accuracy and ranking consistency.

摘要:微視頻人氣預測(MVPP)旨在預測在線媒體上視頻的未來人氣,這對於內容推薦和流量分配等應用至關重要。在現實場景中,MVPP 方法必須理解給定視頻的時間動態(時間性)及其與其他視頻的歷史相關性(空間性)。然而,現有方法在這兩個維度上都存在限制:在時間上,它們依賴於稀疏的短期取樣,限制了內容感知;在空間上,它們依賴於容量有限且效率低下的平面檢索記憶,妨礙了可擴展的知識利用。為了克服這些限制,我們提出了一個統一框架,實現了聯合時空擴展,使得能夠精確感知極長的視頻序列,同時支持可以無限擴展以納入所有相關歷史視頻的可擴展記憶庫。在技術上,我們採用一個由幀評分模塊驅動的時間擴展,通過稀疏取樣和密集感知兩條互補路徑從視頻幀中提取重點線索。它們的輸出被自適應融合,以實現穩健的長序列內容理解。對於空間擴展,我們構建了一個拓撲感知記憶庫,根據拓撲關係對歷史相關內容進行分層聚類。我們不是直接擴展記憶容量,而是在納入新視頻時更新相應聚類的編碼器特徵,實現無限的歷史關聯而無需無限的存儲增長。在三個廣泛使用的 MVPP 基準上進行的廣泛實驗表明,我們的方法在主流指標上始終超越 11 個強基準,實現了預測準確性和排名一致性的穩健提升。

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

2604.20306v1 by Zibo Xu, Qiang Li, Ke Lu, Jin Wang, Weizhi Nie, Yuting Su

Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.

摘要:醫學視覺問題回答(MedVQA)旨在根據複雜的醫學影像和問題生成臨床可靠的答案。然而,現有的方法往往過度擬合於表面上的跨模態相關性,忽略了嵌入多模態醫學數據中的內在偏見。因此,模型變得容易受到跨模態混淆效應的影響,嚴重妨礙其提供可信診斷推理的能力。為了解決這一限制,我們提出了一種新穎的雙重因果推斷(DCI)框架,用於MedVQA。據我們所知,DCI是第一個統一架構,整合了後門調整(BDA)和工具變量(IV)學習,以共同解決可觀察和不可觀察的混淆因素。具體而言,我們構建了一個結構性因果模型(SCM),其中可觀察的跨模態偏見(例如,頻繁的視覺和文本共現)通過BDA得到減輕,而不可觀察的混淆因素則通過從共享潛在空間學習的IV來補償。為了保證IV的有效性,我們設計了互信息約束,以最大化其對融合多模態表示的依賴,同時最小化其與不可觀察混淆因素和目標答案的關聯。通過這一雙重機制,DCI提取出去混淆的表示,捕捉真正的因果關係。在四個基準數據集SLAKE、SLAKE-CP、VQA-RAD和PathVQA上進行的廣泛實驗表明,我們的方法在性能上始終優於現有方法,特別是在分佈外(OOD)泛化方面。此外,定性分析證實,DCI通過明確區分真實的因果效應和虛假的跨模態捷徑,顯著增強了跨模態推理的可解釋性和穩健性。

Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

2604.20283v1 by Mo Zhou, Jianwei Wang, Kai Wang, Helen Paik, Ying Zhang, Wenjie Zhang

Multimodal Entity Linking (MEL) is a fundamental task in data management that maps ambiguous mentions with diverse modalities to the multimodal entities in a knowledge base. However, most existing MEL approaches primarily focus on optimizing instance-centric features and evidence, leaving broader forms of evidence and their intricate interdependencies insufficiently explored. Motivated by the observation that human expert decision-making process relies on multi-perspective judgment, in this work, we propose MSR-MEL, a Multi-perspective Evidence Synthesis and Reasoning framework with Large Language Models (LLMs) for unsupervised MEL. Specifically, we adopt a two-stage framework: (1) Offline Multi-Perspective Evidence Synthesis constructs a comprehensive set of evidence. This includes instance-centric evidence capturing the instance-centric multimodal information of mentions and entities, group-level evidence that aggregates neighborhood information, lexical evidence based on string overlap ratio, and statistical evidence based on simple summary statistics. A core contribution of our framework is the synthesis of group-level evidence, which effectively aggregates vital neighborhood information by graph. We first construct LLM-enhanced contextualized graphs. Subsequently, different modalities are jointly aligned through an asymmetric teacher-student graph neural network. (2) Online Multi-Perspective Evidence Reasoning leverages the power of LLM as a reasoning module to analyze the correlation and semantics of the multi-perspective evidence to induce an effective ranking strategy for accurate entity linking without supervision. Extensive experiments on widely used MEL benchmarks demonstrate that MSR-MEL consistently outperforms state-of-the-art unsupervised methods. The source code of this paper was available at: https://anonymous.4open.science/r/MSR-MEL-C21E/.

摘要:多模態實體連結(MEL)是數據管理中的一項基本任務,它將模糊的提及與多樣的模態映射到知識庫中的多模態實體上。然而,大多數現有的 MEL 方法主要集中在優化以實例為中心的特徵和證據上,對於更廣泛的證據形式及其複雜的相互依賴關係探討不足。受到觀察到的人類專家決策過程依賴於多角度判斷的啟發,在本研究中,我們提出了 MSR-MEL,一種基於大型語言模型(LLMs)的多角度證據綜合與推理框架,用於無監督的 MEL。具體來說,我們採用兩階段框架:(1) 離線多角度證據綜合構建了一套全面的證據。這包括捕捉提及和實體的以實例為中心的多模態信息的實例中心證據、聚合鄰域信息的群體級證據、基於字符串重疊比的詞彙證據,以及基於簡單摘要統計的統計證據。我們框架的一個核心貢獻是群體級證據的綜合,這通過圖有效地聚合了重要的鄰域信息。我們首先構建了增強的上下文化圖。隨後,通過不對稱的教師-學生圖神經網絡共同對齊不同的模態。(2) 在線多角度證據推理利用 LLM 作為推理模塊,分析多角度證據的相關性和語義,以誘導有效的排名策略,實現準確的實體連結而無需監督。在廣泛使用的 MEL 基準上進行的廣泛實驗表明,MSR-MEL 始終優於最先進的無監督方法。本文的源代碼可在以下網址獲得:https://anonymous.4open.science/r/MSR-MEL-C21E/。

AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling

2604.20263v1 by Zhenyu Wang, Geyan Ye, Wei Liu, Man Tat Alexander Ng

Virtual cell modeling predicts molecular state changes under genetic perturbations in silico, which is essential for biological mechanism studies. However, existing approaches suffer from unconstrained reasoning, uninterpretable predictions, and retrieval signals that are weakly aligned with regulatory topology. To address these limitations, we propose AROMA, an Augmented Reasoning Over a Multimodal Architecture for virtual cell genetic perturbation modeling. AROMA integrates textual evidence, graph-topology information, and protein sequence features to model perturbation-target dependencies, and is trained with a two-stage optimization strategy to yield predictions that are both accurate and interpretable. We also construct two knowledge graphs and a perturbation reasoning dataset, PerturbReason, containing more than 498k samples, as reusable resources for the virtual cell domain. Experiments show that AROMA outperforms existing methods across multiple cell lines, and remains robust under zero-shot evaluation on an unseen cell line, as well as in knowledge-sparse, long-tail scenarios. Overall, AROMA demonstrates that combining knowledge-driven multimodal modeling with evidence retrieval provides a promising pathway toward more reliable and interpretable virtual cell perturbation prediction. Model weights are available at https://huggingface.co/blazerye/AROMA. Code is available at https://github.com/blazerye/AROMA.

摘要:虛擬細胞建模預測在基因擾動下的分子狀態變化,這對於生物機制研究至關重要。然而,現有的方法存在不受限制的推理、不可解釋的預測,以及與調控拓撲弱相關的檢索信號等問題。為了解決這些限制,我們提出了AROMA,一種針對虛擬細胞基因擾動建模的增強推理多模態架構。AROMA整合了文本證據、圖形拓撲信息和蛋白質序列特徵,以建模擾動-目標依賴關係,並通過兩階段優化策略進行訓練,以產生既準確又可解釋的預測。我們還構建了兩個知識圖譜和一個擾動推理數據集PerturbReason,包含超過498k的樣本,作為虛擬細胞領域的可重用資源。實驗顯示,AROMA在多個細胞系中表現優於現有方法,並且在未見過的細胞系上進行零樣本評估時仍然保持穩健,以及在知識稀疏的長尾場景中也表現良好。總體而言,AROMA展示了將知識驅動的多模態建模與證據檢索相結合,為更可靠和可解釋的虛擬細胞擾動預測提供了一條有前景的途徑。模型權重可在 https://huggingface.co/blazerye/AROMA 獲得。代碼可在 https://github.com/blazerye/AROMA 獲得。

Hybrid Policy Distillation for LLMs

2604.20244v1 by Wenhong Zhu, Ruobing Xie, Rui Wang, Pengfei Liu

Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.

摘要:知識蒸餾 (KD) 是一種強大的壓縮大型語言模型 (LLMs) 的範式,其有效性取決於發散方向、優化策略和數據模式的相互選擇。我們分析了現有 KD 方法的設計,並提出了一個統一的視角,建立它們之間的聯繫,將 KD 重新表述為在標記層級的重加權對數似然目標。我們進一步提出了混合策略蒸餾 (HPD),它整合了前向和反向 KL 的互補優勢,以平衡模式覆蓋和模式尋求,並結合了離策略數據與輕量級、近似的在策略抽樣。我們在長生成數學推理以及短生成對話和代碼任務上驗證了 HPD,展示了在不同模型系列和規模下的優化穩定性、計算效率和最終性能的提升。與此工作相關的代碼可在 https://github.com/zwhong714/Hybrid-Policy-Distillation 獲得。

Construction of a Battery Research Knowledge Graph using a Global Open Catalog

2604.20241v1 by Luca Foppiano, Sae Dieb, Malik Zain, Kazuki Kasama, Keitaro Sodeyama, Mikiko Tanifuji

Battery research is a rapidly growing and highly interdisciplinary field, making it increasingly difficult to track relevant expertise and identify potential collaborators across institutional boundaries. In this work, we present a pipeline for constructing an author-centric knowledge graph of battery research built on OpenAlex, a large-scale open bibliographic catalogue. For each author, we derive a weighted research descriptors vector that combines coarse-grained OpenAlex concepts with fine-grained keyphrases extracted from titles and abstracts using KeyBERT with ChatGPT (gpt-3.5-turbo) as the backend model, selected after evaluating multiple alternatives. Vector components are weighted by research descriptor origin, authorship position, and temporal recency. The framework is applied to a corpus of 189,581 battery-related works. The resulting vectors support author-author similarity computation, community detection, and exploratory search through a browser-based interface. The knowledge graph is then serialized in RDF and linked to Wikidata identifiers, making it interoperable with external linked open data sources and extensible beyond the battery domain. Unlike prior author-centric analyses confined to institutional repositories, our approach operates at cross-institutional scale and grounds similarity in domain semantics rather than citation or co-authorship structure alone.

摘要:電池研究是一個快速增長且高度跨學科的領域,這使得追踪相關專業知識和識別潛在合作者越來越困難,尤其是在機構邊界之間。在這項工作中,我們提出了一個基於OpenAlex的大規模開放書目目錄的以作者為中心的電池研究知識圖譜構建流程。對於每位作者,我們推導出一個加權的研究描述符向量,該向量將粗粒度的OpenAlex概念與使用KeyBERT和ChatGPT(gpt-3.5-turbo)作為後端模型提取的標題和摘要中的細粒度關鍵詞組合在一起,該模型是在評估多種替代方案後選擇的。向量組件根據研究描述符來源、作者位置和時間的近期性進行加權。該框架應用於一個包含189,581篇與電池相關的作品的語料庫。生成的向量支持作者之間的相似性計算、社群檢測和通過基於瀏覽器的界面進行探索性搜索。然後,知識圖譜以RDF格式序列化並鏈接到Wikidata標識符,使其能與外部鏈接開放數據源互操作,並可擴展到電池領域之外。與先前僅限於機構存儲庫的以作者為中心的分析不同,我們的方法在跨機構的規模上運作,並將相似性根植於領域語義,而不僅僅是引用或共同作者結構。

Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context

2604.20216v1 by Yilun Zhu, Yuan Zhuang, Nikhita Vedula, Dushyanta Dhyani, Shaoyuan Xu, Moyan Li, Mohsen Bayati, Bryan Wang, Shervin Malmasi

Many applications of LLM-based text regression require predicting a full conditional distribution rather than a single point value. We study distributional regression under empirical-quantile supervision, where each input is paired with multiple observed quantile outcomes, and the target distribution is represented by a dense grid of quantiles. We address two key limitations of current approaches: the lack of local grounding for distribution estimates, and the reliance on shared representations that create an indirect bottleneck between inputs and quantile outputs. In this paper, we introduce Quantile Token Regression, which, to our knowledge, is the first work to insert dedicated quantile tokens into the input sequence, enabling direct input-output pathways for each quantile through self-attention. We further augment these quantile tokens with retrieval, incorporating semantically similar neighbor instances and their empirical distributions to ground predictions with local evidence from similar instances. We also provide the first theoretical analysis of loss functions for quantile regression, clarifying which distributional objectives each optimizes. Experiments on the Inside Airbnb and StackSample benchmark datasets with LLMs ranging from 1.7B to 14B parameters show that quantile tokens with neighbors consistently outperform baselines (~4 points lower MAPE and 2x narrower prediction intervals), with especially large gains on smaller and more challenging datasets where quantile tokens produce substantially sharper and more accurate distributions.

摘要:許多基於LLM的文本回歸應用需要預測完整的條件分佈,而不是單一的點值。我們研究了在經驗分位數監督下的分佈回歸,其中每個輸入都與多個觀察到的分位數結果配對,目標分佈由密集的分位數網格表示。我們解決了當前方法的兩個主要限制:對分佈估計缺乏局部基礎,以及依賴共享表示,這在輸入和分位數輸出之間創造了間接瓶頸。在本文中,我們介紹了分位數令牌回歸,據我們所知,這是首個將專用分位數令牌插入輸入序列的工作,通過自注意力為每個分位數啟用直接的輸入-輸出通道。我們進一步通過檢索增強這些分位數令牌,結合語義相似的鄰近實例及其經驗分佈,以用相似實例的局部證據來基礎化預測。我們還提供了分位數回歸損失函數的首個理論分析,澄清了每個損失函數優化的分佈目標。在Inside Airbnb和StackSample基準數據集上的實驗中,使用參數範圍從1.7B到14B的LLM顯示,帶有鄰近實例的分位數令牌始終優於基準(MAPE低約4點,預測區間窄2倍),在較小和更具挑戰性的數據集上尤其獲得了顯著的增益,其中分位數令牌產生了顯著更尖銳和更準確的分佈。

Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs

2604.20211v1 by He Yang Yuan, Xin Wang, Kundi Yao, An Ran Chen, Zishuo Ding, Zhenhao Li

Logging code plays an important role in software systems by recording key events and behaviors, which are essential for debugging and monitoring. However, insecure logging practices can inadvertently expose sensitive information or enable attacks such as log injection, posing serious threats to system security and privacy. Prior research has examined general defects in logging code, but systematic analysis of logging code security issues remains limited, particularly in leveraging LLMs for detection and repair. In this paper, we derive a comprehensive taxonomy of logging code security issues, encompassing four common issue categories and 10 corresponding patterns. We further construct a benchmark dataset with 101 real-world logging security issue reports that have been manually reviewed and annotated. We then propose an automated framework that incorporates various contextual knowledge to evaluate LLMs' capabilities in detecting and repairing logging security issues. Our experimental results reveal a notable disparity in performance: while LLMs are moderately effective at detecting security issues (e.g., the accuracy ranges from 12.9% to 52.5% on average), they face noticeable challenges in reliably generating correct code repairs. We also find that the issue description alone improves the LLMs' detection accuracy more than the security pattern explanation or a combination of both. Overall, our findings provide actionable insights for practitioners and highlight the potential and limitations of current LLMs for secure logging.

摘要:記錄代碼在軟體系統中扮演著重要角色,通過記錄關鍵事件和行為,這對於除錯和監控至關重要。然而,不安全的記錄實踐可能無意中暴露敏感信息或使攻擊(如日誌注入)成為可能,對系統安全和隱私構成嚴重威脅。先前的研究已經檢視了記錄代碼中的一般缺陷,但對於記錄代碼安全問題的系統性分析仍然有限,特別是在利用 LLMs 進行檢測和修復方面。在本文中,我們推導出一個全面的記錄代碼安全問題分類法,涵蓋四個常見問題類別和 10 種相應的模式。我們進一步構建了一個基準數據集,包含 101 份經過人工審查和註釋的真實世界記錄安全問題報告。然後,我們提出了一個自動化框架,該框架整合了各種上下文知識,以評估 LLMs 在檢測和修復記錄安全問題方面的能力。我們的實驗結果顯示出顯著的性能差異:雖然 LLMs 在檢測安全問題方面的效果中等(例如,準確率平均範圍為 12.9% 到 52.5%),但它們在可靠生成正確代碼修復方面面臨明顯挑戰。我們還發現,僅問題描述就能比安全模式解釋或兩者的結合更能提高 LLMs 的檢測準確性。總體而言,我們的發現為從業者提供了可行的見解,並突顯了當前 LLMs 在安全記錄方面的潛力和局限性。

All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG

2604.20199v1 by Dan Wang, Guozhao Mo, Yafei Shi, Cheng Zhang, Bo Zheng, Boxi Cao, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun

Multilingual Retrieval-Augmented Generation (mRAG) leverages cross-lingual evidence to ground Large Language Models (LLMs) in global knowledge. However, we show that current mRAG systems suffer from a language bias during reranking, systematically favoring English and the query's native language. By introducing an estimated oracle evidence analysis, we quantify a substantial performance gap between existing rerankers and the achievable upper bound. Further analysis reveals a critical distributional mismatch: while optimal predictions require evidence scattered across multiple languages, current systems systematically suppress such ``answer-critical'' documents, thereby limiting downstream generation performance. To bridge this gap, we propose \textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)}, which aligns multilingual evidence ranking with downstream generative utility. Experiments across diverse languages and generation models show that LAURA effectively mitigates language bias and consistently improves mRAG performance.

摘要:多語言檢索增強生成(mRAG)利用跨語言證據將大型語言模型(LLMs)與全球知識相結合。然而,我們顯示當前的 mRAG 系統在重新排序過程中存在語言偏見,系統性地偏向英語和查詢的母語。通過引入估計的神諭證據分析,我們量化了現有重新排序器與可達上限之間的顯著性能差距。進一步分析揭示了一個關鍵的分佈不匹配:雖然最佳預測需要跨多種語言散佈的證據,但當前系統系統性地抑制這些“答案關鍵”文件,從而限制了下游生成性能。為了彌補這一差距,我們提出了\textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)},該方法將多語言證據排序與下游生成效用對齊。在多種語言和生成模型上的實驗表明,LAURA 有效減輕了語言偏見,並持續改善 mRAG 性能。

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

2604.20183v1 by Xinyu Zhang, Yuchen Wan, Boxuan Zhang, Zesheng Yang, Lingling Zhang, Bifan Wei, Jun Liu

Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster's content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance'' phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework's scalability and efficiency.

摘要:大型語言模型(LLMs)在優化問題中常常面臨結構性模糊性,其中單一問題可能有多個相關但相互矛盾的建模範式,這妨礙了有效解決方案的生成。為了解決這個問題,我們提出了雙集群記憶代理(DCM-Agent),旨在通過無需訓練的方式利用歷史解決方案來提升性能。這一方法的核心是雙集群記憶構建。該代理將歷史解決方案分配到建模和編碼集群,然後將每個集群的內容提煉為三種類型:方法、檢查清單和陷阱。這個過程產生了可概括的指導知識。此外,該代理引入了記憶增強推理,以動態導航解決方案路徑,檢測和修復錯誤,並根據結構化知識自適應地切換推理路徑。在七個優化基準測試中的實驗表明,DCM-Agent實現了平均性能提升11%-21%。值得注意的是,我們的分析揭示了一種「知識繼承」現象:由較大模型構建的記憶可以指導較小模型達到更優的性能,突顯了該框架的可擴展性和效率。

SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

2604.20146v1 by Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong, Lin Chen, Yunlai Teng, Shimin Di, Jian Yin

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.

摘要:基於實體的多模態命名實體識別(GMNER)旨在從圖像-文本對中提取命名實體並定位其視覺區域,這對於各種下游應用來說是一項關鍵能力。在開放世界的社交媒體平台上,由於長尾、快速演變和未見實體的普遍存在,GMNER 仍然面臨挑戰。為了解決這個問題,現有的方法通常依賴於通過啟發式檢索進行外部知識探索或通過多模態大型語言模型(MLLMs)中的迭代精煉進行內部知識利用。然而,啟發式檢索往往會引入噪聲或矛盾的證據,降低已知實體的精確度,而僅僅依賴內部利用則受到 MLLMs 知識邊界的限制,並容易出現幻覺。為了解決這個問題,我們提出了 SAKE,一個端到端的代理框架,通過自我意識推理和自適應搜索工具調用來協調內部知識利用和外部知識探索。我們通過兩階段的訓練範式來實現這一點。首先,我們提出了難度感知搜索標籤生成,通過多次前向抽樣量化模型的實體級不確定性,以產生明確的知識缺口信號。基於這些信號,我們構建了 SAKE-SeCoT,一個高質量的思維鏈數據集,通過監督微調使模型具備基本的自我意識和工具使用能力。其次,我們採用了代理強化學習,使用混合獎勵函數來懲罰不必要的檢索,使模型能夠從僵化的搜索模仿演變為真正的自我意識決策,判斷何時檢索是真正必要的。在兩個廣泛使用的社交媒體基準上的大量實驗證明了 SAKE 的有效性。

To Know is to Construct: Schema-Constrained Generation for Agent Memory

2604.20117v1 by Lei Zheng, Weinan Song, Daili Li, Yanming Yang

Constructivist epistemology argues that knowledge is actively constructed rather than passively copied. Despite the generative nature of Large Language Models (LLMs), most existing agent memory systems are still based on dense retrieval. However, dense retrieval heavily relies on semantic overlap or entity matching within sentences. Consequently, embeddings often fail to distinguish instances that are semantically similar but contextually distinct, introducing substantial noise by retrieving context-mismatched entries. Conversely, directly employing open-ended generation for memory access risks "Structural Hallucination" where the model generates memory keys that do not exist in the memory, leading to lookup failures. Inspired by this epistemology, we posit that memory is fundamentally organized by cognitive schemas, and valid recall must be a generative process performed within these schematic structures. To realize this, we propose SCG-MEM, a schema-constrained generative memory architecture. SCG-MEM reformulates memory access as Schema-Constrained Generation. By maintaining a dynamic Cognitive Schema, we strictly constrain LLM decoding to generate only valid memory entry keys, providing a formal guarantee against structural hallucinations. To support long-term adaptation, we model memory updates via assimilation (grounding inputs into existing schemas) and accommodation (expanding schemas with novel concepts). Furthermore, we construct an Associative Graph to enable multi-hop reasoning through activation propagation. Experiments on the LoCoMo benchmark show that SCG-MEM substantially improves performance across all categories over retrieval-based baselines.

摘要:建構主義認識論認為知識是主動構建的,而不是被動複製的。儘管大型語言模型(LLMs)具有生成性,但大多數現有的代理記憶系統仍然基於密集檢索。然而,密集檢索在很大程度上依賴於句子內的語義重疊或實體匹配。因此,嵌入通常無法區分語義相似但語境不同的實例,通過檢索語境不匹配的條目引入了大量噪音。相反,直接使用開放式生成進行記憶訪問存在“結構性幻覺”的風險,即模型生成的記憶鍵在記憶中不存在,導致查找失敗。受到這一認識論的啟發,我們假設記憶本質上是由認知圖式組織的,有效的回憶必須是在這些圖式結構內進行的生成過程。為了實現這一點,我們提出了SCG-MEM,一種圖式約束的生成記憶架構。SCG-MEM將記憶訪問重新定義為圖式約束生成。通過維護動態的認知圖式,我們嚴格限制LLM解碼,僅生成有效的記憶條目鍵,為結構性幻覺提供了正式的保證。為了支持長期適應,我們通過同化(將輸入基於現有圖式進行基礎化)和調適(用新概念擴展圖式)來建模記憶更新。此外,我們構建了一個聯想圖,以通過激活傳播實現多跳推理。在LoCoMo基準上的實驗顯示,SCG-MEM在所有類別上相比基於檢索的基準顯著提高了性能。

Learning to Solve the Quadratic Assignment Problem with Warm-Started MCMC Finetuning

2604.20109v1 by Yicheng Pan, Ruisong Zhou, Haijun Zou, Tianyou Li, Zaiwen Wen

The quadratic assignment problem (QAP) is a fundamental NP-hard task that poses significant challenges for both traditional heuristics and modern learning-based solvers. Existing QAP solvers still struggle to achieve consistently competitive performance across structurally diverse real-world instances. To bridge this performance gap, we propose PLMA, an innovative permutation learning framework. PLMA features an efficient warm-started MCMC finetuning procedure to enhance deployment-time performance, leveraging short Markov chains to anchor the adaptation to the promising regions previously explored. For rapid exploration via MCMC over the permutation space, we design an additive energy-based model (EBM) that enables an $O(1)$-time 2-swap Metropolis-Hastings sampling step. Moreover, the neural network used to parameterize the EBM incorporates a scalable and flexible cross-graph attention mechanism to model interactions between facilities and locations in the QAP. Extensive experiments demonstrate that PLMA consistently outperforms state-of-the-art baselines across various benchmarks. In particular, PLMA achieves a near-zero average optimality gap on QAPLIB, exhibits remarkably superior robustness on the notoriously difficult Taixxeyy instances, and also serves as an effective QAP solver in bandwidth minimization.

摘要:二次指派問題(QAP)是一個基本的 NP-hard 任務,對傳統啟發式方法和現代基於學習的解決方案都提出了重大挑戰。現有的 QAP 解決器在結構多樣的現實世界實例中仍然難以實現持續競爭的性能。為了縮小這一性能差距,我們提出了 PLMA,一個創新的排列學習框架。PLMA 具有高效的熱啟動 MCMC 微調程序,以增強部署時的性能,利用短的馬爾可夫鏈將適應性固定在先前探索的有希望區域。為了通過 MCMC 在排列空間中快速探索,我們設計了一個加性基於能量的模型(EBM),使得 $O(1)$ 時間的 2-swap Metropolis-Hastings 取樣步驟成為可能。此外,用於參數化 EBM 的神經網絡包含一種可擴展且靈活的跨圖注意機制,以建模 QAP 中設施和位置之間的相互作用。大量實驗表明,PLMA 在各種基準測試中始終優於最先進的基準。特別是,PLMA 在 QAPLIB 上實現了接近零的平均最優性差距,在著名的困難 Taixxeyy 實例上展現出顯著的優越穩健性,並且在帶寬最小化中也作為一個有效的 QAP 解決器。

Auditing and Controlling AI Agent Actions in Spreadsheets

2604.20070v1 by Sadra Sabouri, Zeinabsadat Saghi, Run Huang, Sujay Maladi, Esmeralda Eufracio, Sumit Gulwani, Souti Chattopadhyay

Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution, often buried within large volumes of intermediate reasoning and outputs: by the time users receive the output, all underlying decisions have already been made without their involvement. This lack of transparency leaves users unable to examine the agent's assumptions, identify errors before they propagate, or redirect execution when it deviates from their intent. The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable. Each decision the agent makes is recorded directly in cells that belong to and reflect on the user. We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step. A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow. Users identified their own intent reflected in the agent's actions, detected errors that post-hoc review would have failed to surface, and reported a sense of co-ownership over the resulting output. These findings indicate that meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made.

摘要:人工智慧代理的能力進步已超越用戶對其執行進行有效監督的能力。AI 代理可以從頭到尾自主執行複雜的多步驟知識工作,但在執行過程中,這一過程仍然實際上無法訪問,通常埋藏在大量的中間推理和輸出中:當用戶收到輸出時,所有基礎決策已在沒有他們參與的情況下做出。這種缺乏透明度使得用戶無法檢查代理的假設,在錯誤擴散之前識別它們,或在執行偏離其意圖時重新導向。特別是在電子表格環境中,風險尤其高,因為過程和產物是不可分割的。代理所做的每一個決策都直接記錄在屬於用戶的單元格中,並反映在用戶身上。我們介紹了 Pista,一個電子表格 AI 代理,將執行分解為可審計、可控的行動,為用戶提供了對代理決策過程的可見性,以及在每一步干預的能力。一項形成性研究(N = 8)和一項內部受試者總結評估(N = 16)將 Pista 與基線代理進行比較,顯示對執行的積極參與不僅影響任務結果,還影響用戶對任務的理解、對代理的看法以及在工作流程中的角色感。用戶識別出自己在代理行動中反映的意圖,檢測出後期審查無法發現的錯誤,並報告對最終輸出的共同擁有感。這些發現表明,對知識工作中 AI 代理的有意義的人類監督需要的不僅僅是改善後期審查機制,而是在決策形成時的積極參與。

Information Aggregation with AI Agents

2604.20050v1 by Spyros Galanis

Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. We find that although the median market is effective at aggregating information in the easy information structures, increasing the complexity has a significant and negative impact, suggesting that AI agents may suffer from the same limitations as humans when reasoning about others. Consistent with our theoretical predictions, information aggregation remains unaffected by allowing cheap talk communication, changing the duration of the market or initial price, and strategic prompting-thus demonstrating that prediction markets are robust. We establish that "smarter" AI agents perform better at aggregation and they are more profitable. Surprisingly, giving them feedback about past performance makes them worse at aggregation and reduces their profits.

摘要:大型語言模型(AI代理)是否能通過交易聚合分散的私人信息,並通過觀察價格變動來推理他人的知識?我們進行了一項受控實驗,讓AI代理在接收私人信號後在預測市場中交易,通過最後價格的對數誤差來衡量信息聚合。我們發現,儘管中位數市場在簡單信息結構中有效地聚合信息,但增加複雜性會產生顯著的負面影響,這表明AI代理在推理他人時可能遭遇與人類相同的限制。與我們的理論預測一致,信息聚合不受廉價交流、改變市場持續時間或初始價格以及戰略提示的影響,從而顯示預測市場的穩健性。我們確立了“更聰明”的AI代理在聚合方面表現更好,且更具盈利能力。令人驚訝的是,給予他們有關過去表現的反饋會使他們在聚合方面變得更差,並降低他們的利潤。

Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

2604.20022v1 by Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.

摘要:大型語言模型越來越多地被用作自主診斷代理,但它們混淆了兩種根本不同的能力:自然語言交流和概率推理。我們認為這種混淆是一種架構缺陷,而不是工程上的不足。我們介紹了 BMBE(貝葉斯醫療信念引擎),這是一個模組化的診斷對話框架,強調語言和推理之間的嚴格分離:LLM 僅作為感測器,將患者的言語解析為結構化證據並表達問題,而所有診斷推理都位於一個確定性、可審計的貝葉斯引擎中。由於患者數據從未進入 LLM,該架構在設計上是私密的;因為統計後端是一個獨立模組,它可以根據目標人群進行替換,而無需重新訓練。這種分離產生了三個自主 LLM 無法提供的特性:經過校準的選擇性診斷,具有可持續調整的準確性-覆蓋率權衡,一個統計分離間隙,即使是一個廉價的感測器與引擎搭配,也能以更低的成本超越同一家族的前沿獨立模型,以及對導致獨立醫生崩潰的對抗性患者交流風格的穩健性。我們在實證和 LLM 生成的知識庫中進行驗證,與前沿 LLM 進行比較,確認這一優勢是架構性的,而非信息性的。

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

2604.20006v1 by Md Nayem Uddin, Kumar Shubham, Eduardo Blanco, Chitta Baral, Gengyu Wang

Personalized agents that interact with users over long periods must maintain persistent memory across sessions and update it as circumstances change. However, existing benchmarks predominantly frame long-term memory evaluation as fact retrieval from past conversations, providing limited insight into agents' ability to consolidate memory over time or handle frequent knowledge updates. We introduce Memora, a long-term memory benchmark spanning weeks to months long user conversations. The benchmark evaluates three memory-grounded tasks: remembering, reasoning, and recommending. To ensure data quality, we employ automated memory-grounding checks and human evaluation. We further introduce Forgetting-Aware Memory Accuracy (FAMA), a metric that penalizes reliance on obsolete or invalidated memory when evaluating long-term memory. Evaluations of four LLMs and six memory agents reveal frequent reuse of invalid memories and failures to reconcile evolving memories. Memory agents offer marginal improvements, exposing shortcomings in long-term memory for personalized agents.

摘要:個性化代理必須在與用戶長期互動中保持持久的記憶,並隨著情況的變化進行更新。然而,現有的基準主要將長期記憶的評估框架設置為從過去對話中檢索事實,這對於代理在時間上整合記憶或處理頻繁知識更新的能力提供了有限的洞察。我們介紹了Memora,一個涵蓋數周到數月長的用戶對話的長期記憶基準。該基準評估三個基於記憶的任務:記憶、推理和推薦。為了確保數據質量,我們採用了自動化的記憶基準檢查和人工評估。我們進一步介紹了遺忘感知記憶準確度(FAMA),這是一個在評估長期記憶時對依賴過時或無效記憶進行懲罰的指標。對四個大型語言模型和六個記憶代理的評估顯示,無效記憶的頻繁重用和未能調和不斷演變的記憶的問題。記憶代理提供了邊際改進,揭示了個性化代理在長期記憶方面的不足之處。

Tracing Relational Knowledge Recall in Large Language Models

2604.19934v2 by Nicholas Popovič, Michael Färber

We study how large language models recall relational knowledge during text generation, with a focus on identifying latent representations suitable for relation classification via linear probes. Prior work shows how attention heads and MLPs interact to resolve subject, predicate, and object, but it remains unclear which representations support faithful linear relation classification and why some relation types are easier to capture linearly than others. We systematically evaluate different latent representations derived from attention head and MLP contributions, showing that per-head attention contributions to the residual stream are comparatively strong features for linear relation classification. Feature attribution analyses of the trained probes, as well as characteristics of the different relation types, reveal clear correlations between probe accuracy and relation specificity, entity connectedness, and how distributed the signal on which the probe relies is across attention heads. Finally, we show how token-level feature attribution of probe predictions can be used to reveal probe behavior in further detail.

摘要:我們研究大型語言模型在文本生成過程中如何回憶關聯知識,重點在於識別適合通過線性探針進行關聯分類的潛在表示。先前的研究顯示,注意力頭和多層感知器(MLP)如何互動以解析主語、謂語和賓語,但尚不清楚哪些表示支持忠實的線性關聯分類,以及為什麼某些關聯類型比其他類型更容易以線性方式捕捉。我們系統地評估了來自注意力頭和MLP貢獻的不同潛在表示,顯示每個頭的注意力貢獻對殘差流是相對強大的線性關聯分類特徵。訓練探針的特徵歸因分析,以及不同關聯類型的特徵,揭示了探針準確性與關聯特異性、實體連通性以及探針依賴的信號在注意力頭之間的分佈程度之間的明顯相關性。最後,我們展示了如何利用探針預測的標記級特徵歸因來進一步揭示探針行為的細節。

CreativeGame:Toward Mechanic-Aware Creative Game Generation

2604.19926v1 by Hongnan Ma, Han Wang, Shenglin Wang, Tieyue Yin, Yiwei Shi, Yucong Huang, Yingtian Zou, Muning Wen, Mengyue Yang

Large language models can generate plausible game code, but turning this capability into \emph{iterative creative improvement} remains difficult. In practice, single-shot generation often produces brittle runtime behavior, weak accumulation of experience across versions, and creativity scores that are too subjective to serve as reliable optimization signals. A further limitation is that mechanics are frequently treated only as post-hoc descriptions, rather than as explicit objects that can be planned, tracked, preserved, and evaluated during generation. This report presents \textbf{CreativeGame}, a multi-agent system for iterative HTML5 game generation that addresses these issues through four coupled ideas: a proxy reward centered on programmatic signals rather than pure LLM judgment; lineage-scoped memory for cross-version experience accumulation; runtime validation integrated into both repair and reward; and a mechanic-guided planning loop in which retrieved mechanic knowledge is converted into an explicit mechanic plan before code generation begins. The goal is not merely to produce a playable artifact in one step, but to support interpretable version-to-version evolution. The current system contains 71 stored lineages, 88 saved nodes, and a 774-entry global mechanic archive, implemented in 6{,}181 lines of Python together with inspection and visualization tooling. The system is therefore substantial enough to support architectural analysis, reward inspection, and real lineage-level case studies rather than only prompt-level demos. A real 4-generation lineage shows that mechanic-level innovation can emerge in later versions and can be inspected directly through version-to-version records. The central contribution is therefore not only game generation, but a concrete pipeline for observing progressive evolution through explicit mechanic change.

摘要:大型語言模型可以生成合理的遊戲代碼,但將這一能力轉化為\emph{迭代創意改進}仍然困難。在實踐中,單次生成往往會產生脆弱的運行時行為、跨版本經驗的累積不足,以及過於主觀的創造力評分,無法作為可靠的優化信號。另一個限制是,機制通常僅被視為事後描述,而不是可以在生成過程中計劃、追蹤、保存和評估的明確對象。
本報告介紹了\textbf{CreativeGame},這是一個針對迭代HTML5遊戲生成的多代理系統,通過四個相互關聯的理念來解決這些問題:一個以程序信號為中心的代理獎勵,而非純粹的LLM判斷;用於跨版本經驗累積的血統範圍記憶;集成在修復和獎勵中的運行時驗證;以及一個以機制為導向的規劃循環,在這個循環中,檢索到的機制知識在代碼生成開始之前轉化為明確的機制計劃。目標不僅僅是在一步中生成可玩的人造物,而是支持可解釋的版本間演變。
當前系統包含71個存儲的血統、88個保存的節點,以及一個774條目的全球機制檔案,這些是用6181行Python實現的,並配有檢查和可視化工具。因此,該系統足夠龐大,可以支持架構分析、獎勵檢查和實際的血統級案例研究,而不僅僅是提示級的演示。
一個真實的四代血統顯示,機制級的創新可以在後續版本中出現,並且可以通過版本間記錄直接檢查。因此,中心貢獻不僅是遊戲生成,而是一個具體的管道,用於通過明確的機制變化觀察漸進式演變。

Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding

2604.19921v1 by Zijie Wang, MohammadHossein Rezaei, Farzana Rashid, Eduardo Blanco

Negation is a common and important semantic feature in natural language, yet Large Language Models (LLMs) struggle when negation is involved in natural language understanding tasks. Commonsense knowledge, on the other hand, despite being a well-studied topic, lacks investigations involving negation. In this work, we show that commonsense knowledge with negation is challenging for models to understand. We present a novel approach to automatically augment existing commonsense knowledge corpora with negation, yielding two new corpora containing over 2M triples with if-then relations. In addition, pre-training LLMs on our corpora benefits negation understanding.

摘要:否定是自然語言中一個常見且重要的語義特徵,然而大型語言模型(LLMs)在涉及否定的自然語言理解任務時卻面臨困難。另一方面,儘管常識知識是一個研究充分的主題,但缺乏涉及否定的研究。在本研究中,我們展示了帶有否定的常識知識對模型理解的挑戰。我們提出了一種新穎的方法,自動增強現有的常識知識語料庫,加入否定,產生了兩個包含超過200萬個帶有如果-那麼關係的三元組的新語料庫。此外,在我們的語料庫上進行預訓練的LLMs有助於否定理解。

UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

2604.19734v1 by Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, Yixiao Ge

Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.

摘要:人形基礎模型的擴展受到機器人數據稀缺的瓶頸限制。雖然大量以自我為中心的人類數據提供了一種可擴展的替代方案,但由於運動學的不匹配,跨實體的橋接仍然是一個基本挑戰。我們介紹了 UniT(通過視覺錨定的統一潛在行動標記器),這是一個建立人類到人形轉移的統一物理語言的框架。基於異質運動學共享普遍視覺後果的理念,UniT 採用三分支交叉重建機制:行動預測視覺以將運動學錨定到物理結果,而視覺重建行動以過濾掉不相關的視覺干擾因素。同時,一個融合分支將這些純化的模態協同整合到一個共享的離散潛在空間中,該空間具有與具體實體無關的物理意圖。我們在兩個範式中驗證了 UniT:1)政策學習(VLA-UniT):通過預測這些統一的標記,它有效利用多樣的人類數據,在人形模擬基準和現實世界部署上實現了最先進的數據效率和穩健的分佈外(OOD)泛化,顯著展示了零樣本任務轉移。2)世界建模(WM-UniT):通過將跨實體動態與統一標記對齊作為條件,它實現了直接的人類到人形的行動轉移。這種對齊確保了人類數據無縫轉換為增強的人形視頻生成的行動可控性。最終,通過引入高度對齊的跨實體表示(通過 t-SNE 可視化實證驗證,顯示人類和人形特徵在共享流形中的收斂),UniT 提供了一條可擴展的路徑,將大量人類知識提煉為通用的人形能力。

ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

2604.19856v1 by Cagri Eryilmaz

Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.

摘要:大型語言模型(LLMs)在從自然語言規範生成寄存器轉移級(RTL)代碼方面顯示出潛力,但單次生成在標準基準上僅達到60-65%的功能正確性。像MAGE這樣的多代理方法在VerilogEval上達到95.9%,但在更具挑戰性的工業基準(如NVIDIA的CVDP)上尚未經過測試,缺乏合成意識,並且產生高昂的API成本。
我們提出ChipCraftBrain,一個結合符號-神經推理與自適應多代理協調的自動RTL生成框架。系統的四項創新驅動著這一進程:(1)通過PPO策略在168維狀態上對六個專門代理進行自適應協調(還評估了一種替代的世界模型MPC規劃器);(2)一種混合符號-神經架構,能夠以算法方式解決K圖和真值表問題,同時專門代理處理波形時序和一般RTL;(3)從321個模式基礎加上971個開源參考實現進行知識增強生成,並專注於檢索;(4)將規範分解為依賴有序的子模塊,並進行接口同步。
在VerilogEval-Human上,ChipCraftBrain達到97.2%的平均pass@1(範圍在7次運行中為96.15-98.72%,最佳為154/156),與ChipAgents(97.4%,自報)相當,並領先於MAGE(95.9%)。在CVDP的一個302問題的非代理子集上,涵蓋五個任務類別,我們達到94.7%的平均pass@1(286/302,平均3次運行),比已發表的單次基準每個類別提高了36-60個百分點;儘管每個問題的嘗試次數約少30倍,我們在與NVIDIA的ACE-RTL共享的四個類別中領先三個類別。一個RISC-V SoC案例研究展示了層次分解生成8/8通過lint檢查的模塊(689 LOC),在FPGA上驗證,當單體生成完全失敗時。

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

2604.19689v1 by Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.

摘要:理解藝術作品需要對視覺內容以及文化、歷史和風格背景進行多步推理。儘管最近的多模態大型語言模型在藝術作品解釋方面顯示出潛力,但它們依賴於隱性推理和內化知識,這限制了可解釋性和明確的證據基礎。我們提出了 A-MAR,一個基於代理的多模態藝術檢索框架,該框架明確地將檢索條件化為結構化的推理計劃。給定一件藝術作品和用戶查詢,A-MAR 首先將任務分解為一個結構化的推理計劃,該計劃指定每一步的目標和證據需求。檢索隨後根據這個計劃進行條件化,從而實現針對性的證據選擇並支持逐步的、有根據的解釋。為了評估藝術領域中的基於代理的多模態推理,我們引入了 ArtCoT-QA。這個診斷基準特徵多步推理鏈,針對多樣的藝術相關查詢,實現了超越簡單最終答案準確性的細緻分析。在 SemArt 和 Artpedia 上的實驗顯示,A-MAR 在最終解釋質量上始終超越靜態的、未規劃的檢索和強大的 MLLM 基準,而在 ArtCoT-QA 上的評估進一步展示了其在證據基礎和多步推理能力方面的優勢。這些結果突顯了推理條件化檢索對於知識密集型多模態理解的重要性,並將 A-MAR 定位為朝向可解釋的、以目標為驅動的 AI 系統邁進的一步,特別與文化產業相關。代碼和數據可在以下網址獲得: https://github.com/ShuaiWang97/A-MAR。

2604.19685v1 by Saransh Sharma, Pritika Ramu, Aparna Garimella, Koyel Mukherjee

Answering open-ended questions remains challenging for AI systems because it requires synthesis, judgment, and exploration beyond factual retrieval, and users often refine answers through multiple iterations rather than accepting a single response. Existing QA benchmarks do not explicitly support this refinement process. To address this gap, we introduce a new task, document-grounded related insight generation, where the goal is to generate additional insights from a document collection that help improve, extend, or rethink an initial answer to an open-ended question, ultimately supporting richer user interaction and a better overall question answering experience. We curate and release SCOpE-QA (Scientific Collections for Open-Ended QA), a dataset of 3,000 open-ended questions across 20 research collections. We present InsightGen, a two-stage approach that first constructs a thematic representation of the document collection using clustering, and then selects related context based on neighborhood selection from the thematic graph to generate diverse and relevant insights using LLMs. Extensive evaluation on 3,000 questions using two generation models and two evaluation settings shows that InsightGen consistently produces useful, relevant, and actionable insights, establishing a strong baseline for this new task.

摘要:回答開放式問題對於AI系統來說仍然具有挑戰性,因為這需要超越事實檢索的綜合、判斷和探索,而用戶通常會通過多次迭代來完善答案,而不是接受單一的回應。現有的QA基準並未明確支持這一完善過程。為了解決這一空白,我們引入了一個新任務,即基於文檔的相關洞察生成,其目標是從文檔集合中生成額外的洞察,以幫助改善、擴展或重新思考對開放式問題的初始回答,最終支持更豐富的用戶互動和更好的整體問答體驗。我們策劃並發布了SCOpE-QA(開放式QA的科學集合),這是一個包含20個研究集合的3,000個開放式問題的數據集。我們提出了InsightGen,一種兩階段的方法,首先使用聚類構建文檔集合的主題表示,然後基於主題圖的鄰域選擇來選擇相關上下文,以使用LLMs生成多樣且相關的洞察。對3,000個問題進行的廣泛評估,使用了兩種生成模型和兩種評估設置,顯示InsightGen始終能產生有用、相關且可行的洞察,為這一新任務建立了強有力的基準。

Medical explainable AI

Publish Date Title Authors Homepage Code
2026-04-24 Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings Inês Oliveira e Silva et.al. 2604.22662v1 null
2026-04-24 Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair Yuelin Hu et.al. 2604.22407v1 null
2026-04-24 Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis Zhilin Fan et.al. 2604.22237v1 null
2026-04-24 Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems Meghana Karnam et.al. 2604.22154v1 null
2026-04-23 Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations Nalin Poungpeth et.al. 2604.22109v1 null
2026-04-23 Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake Guan Gui et.al. 2604.22067v1 null
2026-04-23 Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores Shevya Pandya et.al. 2604.22063v1 null
2026-04-23 H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers Ayushi Mehrotra et.al. 2604.22045v1 null
2026-04-23 EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms Brian VanVoorst et.al. 2604.22036v1 null
2026-04-23 Shared Lexical Task Representations Explain Behavioral Variability In LLMs Zhuonan Yang et.al. 2604.22027v1 null
2026-04-23 Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales Olufunke O. Sarumi et.al. 2604.21667v1 null
2026-04-23 Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation Yi-Ling Liu et.al. 2604.21640v1 null
2026-04-23 On the Role of Preprocessing and Memristor Dynamics in Reservoir Computing for Image Classification Rishona Daniels et.al. 2604.21602v1 null
2026-04-23 Dynamical Priors as a Training Objective in Reinforcement Learning Sukesh Subaharan et.al. 2604.21464v1 null
2026-04-23 Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages Michael Bouzinier et.al. 2604.21263v1 null
2026-04-22 Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction Abhishek Dharmaratnakar et.al. 2604.21154v1 null
2026-04-22 Propensity Inference: Environmental Contributors to LLM Behaviour Olli Järviniemi et.al. 2604.21098v1 null
2026-04-22 SGD at the Edge of Stability: The Stochastic Sharpness Gap Fangshuo Liao et.al. 2604.21016v1 null
2026-04-22 Convergent Evolution: How Different Language Models Learn Similar Number Representations Deqing Fu et.al. 2604.20817v1 null
2026-04-22 Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems Pavel Salovskii et.al. 2604.20795v1 null
2026-04-22 Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs Mariano Barone et.al. 2604.20791v1 null
2026-04-22 Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation Andrew Klearman et.al. 2604.20763v1 null
2026-04-22 Participatory provenance as representational auditing for AI-mediated public consultation Sachit Mahajan et.al. 2604.20711v1 null
2026-04-22 RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking Roie Kazoom et.al. 2604.20623v1 null
2026-04-22 Evian: Towards Explainable Visual Instruction-tuning Data Auditing Zimu Jia et.al. 2604.20544v1 null
2026-04-22 MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills Yingyong Hou et.al. 2604.20441v1 null
2026-04-22 Surrogate modeling for interpreting black-box LLMs in medical predictions Changho Han et.al. 2604.20331v2 null
2026-04-22 Stateless Decision Memory for Enterprise AI Agents Vasundra Srinivasan et.al. 2604.20158v1 null
2026-04-21 From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI Patrick Vossler et.al. 2604.20055v1 null
2026-04-21 TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs Ziyi Wang et.al. 2604.20043v1 null
2026-04-21 Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models Kihyuk Lee et.al. 2604.19598v2 null
2026-04-21 Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity Farbod Zorriassatine et.al. 2604.19538v1 null
2026-04-21 EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training Chengjun Pan et.al. 2604.19485v1 null
2026-04-21 Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents Vasundra Srininvasan et.al. 2604.19457v1 null
2026-04-21 TACENR: Task-Agnostic Contrastive Explanations for Node Representations Vasiliki Papanikou et.al. 2604.19372v1 null
2026-04-21 Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications Abu Noman Md Sakib et.al. 2604.19281v1 null
2026-04-20 Gradient-Based Program Synthesis with Neurally Interpreted Languages Matthew V. Macfarlane et.al. 2604.18907v1 null
2026-04-20 AI scientists produce results without reasoning scientifically Martiño Ríos-García et.al. 2604.18805v1 null
2026-04-20 Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling Andrew Wang et.al. 2604.18753v1 null
2026-04-20 On the Importance and Evaluation of Narrativity in Natural Language AI Explanations Mateusz Cedro et.al. 2604.18311v1 null
2026-04-20 Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support Eranga Bandara et.al. 2604.18302v1 null
2026-04-20 Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning Khalil Akremi et.al. 2604.19823v1 null
2026-04-20 ExAI5G: A Logic-Based Explainable AI Framework for Intrusion Detection in 5G Networks Saeid Sheikhi et.al. 2604.18052v1 null
2026-04-20 First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows Sihao Xing et.al. 2604.18038v1 null
2026-04-20 How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers Xiao Wang et.al. 2604.17935v1 null
2026-04-20 AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis Nathasha Naranpanawa et.al. 2604.17846v1 null
2026-04-20 Community-Led AI Integration for Wildfire Risk Assessment: A Participatory AI Literacy and Explainability Integration (PALEI) Framework in Los Angeles, CA Sanaz Sadat Hosseini et.al. 2604.17755v1 null
2026-04-20 MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models Suhyun Lee et.al. 2604.17730v1 null
2026-04-20 Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agentic RAG Systems Nick Loghmani et.al. 2604.17677v1 null
2026-04-19 On The Mathematics of the Natural Physics of Optimization I. M. Ross et.al. 2604.17645v1 null
2026-04-19 STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments Md Mezbahul Islam et.al. 2604.17611v1 null
2026-04-19 CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography Si Li et.al. 2604.17208v1 null
2026-04-19 Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training Weibing Zheng et.al. 2604.17186v1 null
2026-04-18 Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL Skylar Zhai et.al. 2604.17073v1 null
2026-04-18 Hybrid Quantum Neural Networks for Enhanced Breast Cancer Thermographic Classification: A Novel Quantum-Classical Integration Approach Riza Alaudin Syah et.al. 2604.16953v1 null
2026-04-18 LLMs can persuade only psychologically susceptible humans on societal issues, via trust in AI and emotional appeals, amid logical fallacies Alexis Carrillo et.al. 2604.16935v1 null
2026-04-18 The Reliance Negotiation Framework: A Dynamic Process Model of Student LLM Engagement in Academic Writing Shahin Hossain et.al. 2604.16772v1 null
2026-04-17 Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals Yang Shanglin et.al. 2604.16745v1 null
2026-04-17 CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction Jianyou Wang et.al. 2604.16742v1 null
2026-04-17 When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis Justice Owusu Agyemang et.al. 2604.16736v1 null
2026-04-17 Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis Ayhan Can Erdur et.al. 2604.16729v1 null
2026-04-17 The Query Channel: Information-Theoretic Limits of Masking-Based Explanations Erciyes Karakaya et.al. 2604.16689v1 null
2026-04-17 Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing Thomas Bayer et.al. 2604.16280v1 null
2026-04-17 MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation Yi Lin et.al. 2604.16175v1 null
2026-04-17 Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors Jessica H. Zhu et.al. 2604.16132v1 null
2026-04-17 Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration Baramee Sukumal et.al. 2604.16104v1 null
2026-04-17 Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures Yutong Gao et.al. 2604.16042v2 null
2026-04-17 Evaluating Temporal and Structural Anomaly Detection Paradigms for DDoS Traffic Yasmin Souza Lima et.al. 2604.16575v1 null
2026-04-17 Towards Rigorous Explainability by Feature Attribution Olivier Létoffé et.al. 2604.15898v1 null
2026-04-17 Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension Dongxin Guo et.al. 2604.15769v1 null
2026-04-17 LLM Reasoning Is Latent, Not the Chain of Thought Wenshuo Wang et.al. 2604.15726v1 null
2026-04-16 LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance Jack Wei Lun Shi et.al. 2604.15589v1 null
2026-04-16 Towards Reliable Testing of Machine Unlearning Anna Mazhar et.al. 2604.16536v1 null
2026-04-16 Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models Emily Curl et.al. 2604.16532v1 null
2026-04-16 DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI Zhizheng Wang et.al. 2604.15456v1 null
2026-04-16 RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography Mélanie Roschewitz et.al. 2604.15231v1 null
2026-04-16 Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF Nicklas Neu et.al. 2604.16528v1 null
2026-04-16 Agentic Explainability at Scale: Between Corporate Fears and XAI Needs Yomna Elsayed et.al. 2604.14984v1 null
2026-04-16 Hybrid Decision Making via Conformal VLM-generated Guidance Debodeep Banerjee et.al. 2604.14980v2 null
2026-04-16 Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels? Amy Rouillard et.al. 2604.14892v2 null
2026-04-16 M2-PALE: A Framework for Explaining Multi-Agent MCTS--Minimax Hybrids via Process Mining and LLMs Yiyu Qian et.al. 2604.14687v1 null
2026-04-16 Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks Seyedreza Mohseni et.al. 2604.15390v2 null
2026-04-16 Rethinking Patient Education as Multi-turn Multi-modal Interaction Zonghai Yao et.al. 2604.14656v1 null
2026-04-16 CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors Yubin Kim et.al. 2604.14615v1 null
2026-04-16 Generative Augmented Inference Cheng Lu et.al. 2604.14575v1 null
2026-04-16 Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities Michal Rosen-Zvi et.al. 2604.14514v1 null
2026-04-15 When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden Apoorv Prasad et.al. 2604.14356v1 null
2026-04-15 Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance Bar Alon et.al. 2604.14325v1 null
2026-04-15 Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning Kinhei Lee et.al. 2604.14316v1 null
2026-04-15 EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation Francesco Andrea Causio et.al. 2604.14306v2 null
2026-04-15 Quantum-inspired tensor networks in machine learning models Guillermo Valverde et.al. 2604.14287v1 null
2026-04-15 Applied Explainability for Large Language Models: A Comparative Study Venkata Abhinandan Kancharla et.al. 2604.15371v1 null
2026-04-15 Med-CAM: Minimal Evidence for Explaining Medical Decision Making Pirzada Suhail et.al. 2604.13695v1 null
2026-04-15 Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment Eileen Kapel et.al. 2604.13462v1 null
2026-04-15 Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making Pramudita Satria Palar et.al. 2604.14240v1 null
2026-04-15 ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold Chenlang Yi et.al. 2604.13392v1 null
2026-04-15 Young people's perceptions and recommendations for conversational generative artificial intelligence in youth mental health Adam Poulsen et.al. 2604.13381v1 null
2026-04-14 Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition Mohammad Saleh et.al. 2604.13279v1 null
2026-04-14 Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs Vishal Pramanik et.al. 2604.13258v1 null
2026-04-14 Explainable Graph Neural Networks for Interbank Contagion Surveillance: A Regulatory-Aligned Framework for the U.S. Banking Sector Mohammad Nasir Uddin et.al. 2604.14232v1 null

Abstracts

Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

2604.22662v1 by Inês Oliveira e Silva, Sérgio Jesus, Iker Perez, Rita P. Ribeiro, Carlos Soares, Hugo Ferreira, Pedro Bizarro

Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.

摘要:Shapley 值是可解釋人工智慧的基石,但其在競爭性公式中的普及導致了一個支離破碎的格局,對於實際部署幾乎沒有共識。雖然理論差異已被充分記錄,但評估仍然依賴於量化代理,其與人類效用的對應關係尚未得到驗證。在本研究中,我們使用統一的攤銷框架來隔離八種 Shapley 變體之間的語義差異,並考慮到操作風險工作流程的低延遲限制。我們在四個風險數據集和一個涉及專業分析師及 3,735 個案例審查的現實欺詐檢測環境中進行了大規模的實證評估。我們的結果揭示了一個根本性的錯位:標準的量化指標,例如稀疏性和忠實度,與人類感知的清晰度和決策效用脫鉤。此外,雖然沒有任何公式改善客觀分析師的表現,但解釋始終提高了決策信心,這在高風險環境中顯示出自動化偏見的重大風險。這些發現表明,當前的評估代理不足以預測下游的人類影響,我們提供了基於證據的指導,以選擇操作決策系統中的公式和指標。

Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair

2604.22407v1 by Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5--12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5--4.8 units. The failure is largely invisible on clean benchmarks. We explain this effect through Adam's second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale.

摘要:許多持續學習方法在上游修改梯度(例如,投影、懲罰重縮、重播混合),同時將 Adam 視為中立的後端。我們展示了這種組合具有隱藏的失效模式。在一個高重疊、非自適應的 8 域持續語言模型中,所有共享路由投影基準都接近於普通遺忘(12.5--12.8 對比 13.2)。0.5% 的重播緩衝區是最強的共享替代方案,但仍然達到 11.6,而固定強度的解耦則低於普通的 14.1。只有自適應解耦路由在 9.4 的穩定性上保持不變,比普通提高了 3.8 個單位。在 16 域流中,與最強的共享路由投影基準相比,其增益增長至 4.5--4.8 個單位。這一失效在乾淨基準上大多是不可見的。
我們通過 Adam 的二階矩路徑解釋這一效應:在測試的範疇中,投影引起了舊方向有效學習率的 1/(1-alpha) 膨脹,與八個 alpha 值的測量結果相符,誤差在 8% 以內。懲罰方法、重播混合以及在 LoRA 下的 7B 規模也出現了相同的衝突。我們的解決方案僅將修改後的梯度路由到第一階矩,同時保留幅度忠實的二階矩統計,並具備重疊感知的自適應強度。這一簡單的改變是唯一經測試的配置,能夠在各種方法、優化器和規模中持續避免崩潰。

Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis

2604.22237v1 by Zhilin Fan, Deliang Wang, Penghe Chen, Yu Lu

Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers' trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.

摘要:診斷學生問題行為需要教師綜合多方面的信息、識別行為類別並規劃干預策略。雖然微調過的大型語言模型(LLMs)可以通過多輪對話支持這一過程,但它們很少解釋為什麼推薦某一策略,這限制了透明度和教師的信任。為了解決這一問題,我們提出了一個基於微調LLM的可解釋對話系統。該系統使用基於可解釋人工智慧(xAI)的層次歸因方法來識別每個推薦的對話證據,並根據該證據生成自然語言解釋。在技術評估中,該方法在識別支持證據方面超過了基準方法。在對22名預備教師的初步用戶研究中,接受解釋的參與者報告對系統的信任度更高。這些發現表明,改善LLM在教育對話系統中的可解釋性是一個有前景的方向。

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

2604.22154v1 by Meghana Karnam, Ananya Joshi

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40\% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.

摘要:新興的行為健康和精神病學中的人工智慧系統使用多步驟或多代理的LLM管道來執行評估自我傷害風險和篩檢抑鬱症等任務。然而,常見的評估方法,如LLM作為裁判,並未指示何時決策是可靠的,或如何在多個LLM判斷中累積錯誤,這限制了它們在安全關鍵環境中的適用性。我們提出了一個統計框架,針對結構為有向無環圖(DAG)的多代理管道,提供了一種基於原則的、自適應的決策制定替代啟發式投票的方法。我們將每個代理建模為隨機類別決策,並引入(1)更緊的代理級性能信心界限,(2)基於輸入難度的強盜式自適應抽樣策略,以及(3)在多代理系統上提供的懊悔保證,顯示在部署時的對數錯誤增長。我們在行為健康的兩個標記數據集上評估我們的系統:AEGIS 2.0行為健康子集(N=161)和SWMH Reddit帖子的一個分層樣本(N=250)。從實證上看,我們的自適應抽樣策略在這兩個數據集中達到了最低的假陽性率,AEGIS 2.0為0.095,而單代理模型為0.159,將安全內容的錯誤標記減少了40\%,並且在所有條件下仍然保持相似的假陰性率。這些結果表明,基於原則的自適應抽樣在不降低召回率的情況下,提供了精確度的有意義改善。

Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

2604.22109v1 by Nalin Poungpeth, Nicholas Clark, Tanu Mitra

Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce "spontaneous persuasion," which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.

摘要:大型語言模型(LLMs)擁有強大的說服能力,在一對一比較中超越人類。使用者報告表示,在關係、醫療環境以及尋求專業建議時,會諮詢LLMs以協助做出重大生活決策。先前的研究將說服測量為產生最有效論點或令人信服陳述的有意圖嘗試。這未能捕捉到日常人類與AI互動中的情況,使用者在這些互動中尋求資訊或建議。為了解決這一空白,我們引入了「自發性說服」,其特徵是在人們不一定需要說服的日常情境中隱性使用說服策略。我們對五個LLMs進行了審核,以揭示自發性說服在多輪對話中出現的頻率及其技術。為了模擬回應風格,我們提供了一個基於心理學、溝通學和語言學文獻的使用者回應分類法。此外,我們比較了LLMs在相同主題上產生的自發性說服與從Reddit收集的人類回應的分佈。我們發現LLMs幾乎在所有對話中都自發地說服使用者,並大量依賴基於資訊的策略,例如訴諸邏輯或定量證據。這在各模型和使用者回應風格中是一致的,但涉及心理健康的對話中,基於評價和情感的策略的使用率較高。相比之下,人類回應則傾向於使用產生社會影響的策略,如負面情感訴求和非專家證言。這一差異可能解釋了LLM在說服使用者方面的有效性,以及模型被視為客觀和公正的感知。

Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake

2604.22067v1 by Guan Gui, Peter Zandi, Jacob Taylor, Ananya Joshi

Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.

摘要:精神科接診是一個連續的、高風險的信息收集過程,臨床醫生必須決定提問的內容、順序以及如何在有限的時間內解釋不完整或模糊的回答。儘管對於醫療保健中的對話式人工智慧的興趣日益增長,但在這一應用中,對話式人工智慧的基礎設施仍然有限。因此,我們將這一任務表述為一個問題選擇問題,涉及臨床上有根據的問題、已知的目標信息以及可控的患者難度。我們還基於655個臨床醫生撰寫的接診問題庫和5種不同行為條件的相應合成患者小品,介紹了一個特定任務的問題選擇基準。在我們的評估中,我們比較了隨機提問、一個臨床精神科接診表的基準,以及一個基於大型語言模型(LLM)指導的自適應政策,這涉及300次訪談會議,涵蓋四位患者和五種行為條件。在基準測試中,臨床有序的固定形式顯著優於隨機提問,而LLM指導的政策則實現了最強的整體恢復。在患者行為對現場恢復的適應性較差的情況下,適應的優勢急劇增長,尤其是在防守性簡潔的條件下。這些發現表明,對話式臨床系統的表現不僅取決於信息披露後的語言理解,還取決於系統是否能在有限的互動預算內觸及正確的主題。更廣泛地說,這一基準提供了一個受控框架,用於研究臨床結構和自適應後續如何促進互動式臨床機器學習中的信息恢復。

Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

2604.22063v1 by Shevya Pandya, Shinjini Bose, Ananya Joshi

Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.

摘要:大型語言模型(LLMs)在臨床推理和風險評估中被越來越多地使用。然而,它們在精神科等關鍵和不確定領域的解釋可靠性仍然不明。先前的研究已經識別出這些系統中的算法偏見和提示敏感性,這引發了關於上下文信息如何影響模型輸出的擔憂,但在精神科領域仍然沒有系統的方法來評估這些問題。我們提出了一種通過圍繞提示設計的影響和醫學上不重要的輸入對預測住院風險分數的影響來結構化評估的可靠性審核方法,這通常是第一個下游AI臨床決策任務。在我們的審核中,生成了一組合成患者資料(n = 50),每個資料包含15個臨床相關特徵和最多50個臨床不重要特徵,跨越四種提示重構(中立、邏輯、人類影響、臨床判斷)。我們審核了四個LLM(Gemini 2.5 Flash,LLaMa 3.3 70b,Claude Sonnet 4.6,GPT-4o mini),結果顯示,包含醫學上不重要的變量導致所有模型和提示的絕對平均預測住院風險和輸出變異性有統計學上顯著的增加,這表明隨著上下文噪音的增加,預測穩定性降低。臨床不重要特徵在許多模型-提示條件下對不穩定性產生了影響,而提示變化獨立地以模型依賴的方式影響不穩定性的軌跡。這些發現量化了基於LLM的精神科風險評估對非臨床信息的敏感性,突顯了在臨床部署之前需要對歸因穩定性和不確定性行為進行系統評估的必要性。

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

2604.22045v1 by Ayushi Mehrotra, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi

Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.

摘要:特徵歸因方法通過為單個輸入特徵分配重要性分數來解釋深度神經網絡的預測。然而,大多數現有方法僅專注於邊際效應,忽略了特徵之間的交互作用,這些交互作用是特徵組共同影響模型輸出的情況。這種交互作用在圖像分類任務中特別重要,因為語義意義通常來自像素之間的相互依賴,而不是孤立的特徵。現有的基於交互作用的圖像方法要麼過於粗糙(例如,僅使用超像素),要麼未能滿足核心可解釋性公理。在這項工作中,我們介紹了 H-Sets,一種新穎的兩階段框架,用於發現和歸因於圖像分類器中的高階特徵交互作用。首先,我們通過輸入 Hessians 檢測局部交互對,並將它們遞歸地合併成語義上連貫的集合;使用 Segment Anything (SAM) 進行分割作為空間分組的先驗,但可以用其他分割方法替代。其次,我們使用 IDG-Vis 為每個集合進行歸因,這是一種集級擴展的整合方向梯度,將沿像素空間路徑的方向梯度整合並與 Harsanyi 分紅進行聚合。雖然 Hessians 在檢測階段引入了額外的計算成本,但這種有針對性的成本始終能產生更稀疏且更真實的顯著性圖。在 ImageNet 和 CUB 數據集上對 VGG、ResNet、DenseNet 和 MobileNet 模型的評估顯示,H-Sets 生成的顯著性圖比現有方法更具可解釋性和真實性。

EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms

2604.22036v1 by Brian VanVoorst, Nicholas Walczak, Christopher Gilleo, Charles Meissner, Fabio Felix, Iran Roman, Bea Steers, Claudio Silva, Yuhan Shen, Zijia Lu, Shih-Po Lee, Ehsan Elhamifar

This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).

摘要:這篇論文介紹了EgoMAGIC(醫療輔助、指導、說明和修正),這是一個以自我為中心的醫療活動數據集,作為DARPA的感知能力任務指導(PTG)計畫的一部分收集而成。這個數據集包含3,355個視頻,涵蓋50個醫療任務,每個任務至少有50個標記視頻。PTG計畫的主要目標是開發集成在增強現實頭盔中的虛擬助手,以幫助用戶執行複雜任務。
為了鼓勵使用這個數據集進行探索和研究,醫療訓練數據已經發布,並附帶了一個專注於八個醫療任務的動作檢測挑戰。大多數視頻是使用帶有集成音頻的頭戴立體攝像機錄製的。從這個數據集中,使用195萬個標籤訓練了40個YOLO模型,以檢測124個醫療物體,為從事醫療AI應用開發的開發者提供了一個穩健的起點。
除了介紹數據集,這篇論文還呈現了三個模型在八個選定醫療任務上的動作檢測基準結果,其中表現最佳的方法達到了平均mAP 0.526。儘管這篇論文主要針對動作檢測作為基準,但EgoMAGIC數據集同樣適用於動作識別、物體識別和檢測、錯誤檢測以及其他具有挑戰性的計算機視覺任務。
該數據集可通過zenodo.org訪問(DOI: 10.5281/zenodo.19239154)。

Shared Lexical Task Representations Explain Behavioral Variability In LLMs

2604.22027v1 by Zhuonan Yang, Jacob Xiaochen Li, Francisco Piedrahita Velez, Eric Todd, David Bau, Michael L. Littman, Stephen H. Bach, Ellie Pavlick

One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.

摘要:對大型語言模型(LLMs)最常見的抱怨之一是它們對提示的敏感性——也就是說,它們執行任務或提供正確答案的能力可能會不可預測地依賴於問題的表述方式。我們通過比較兩種非常不同但常用的提示風格來調查這種變化:基於指令的提示,這種提示用自然語言描述任務,以及基於示例的提示,這種提示提供上下文中的少量示範對以說明任務。我們發現,儘管性能在提示的影響下有很大的變化,但模型在不同提示的任務之間仍然會涉及一些共同的基本機制。具體而言,我們識別出任務特定的注意力頭,其輸出字面上描述了任務——我們稱之為詞彙任務頭——並顯示這些頭在不同的提示風格之間是共享的,並觸發隨後的答案生成。我們進一步發現,提示之間的行為變化可以通過這些頭的激活程度來解釋,而失敗至少有時是由於競爭的任務表徵稀釋了目標任務的信號。我們的結果共同呈現出一幅日益清晰的圖景,說明LLMs的內部表徵如何解釋那些對用戶和開發者來說似乎是特立獨行的行為。

Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales

2604.21667v1 by Olufunke O. Sarumi, Charles Welch, Daniel Braun

Beyond exploring disaggregated labels for modeling perspectives, annotator rationales provide fine-grained signals of individual perspectives. In this work, we propose a framework for jointly modeling annotator-specific label prediction and corresponding explanations, fine-tuned on the annotators' provided rationales. Using a dataset with disaggregated natural language inference (NLI) annotations and annotator-provided explanations, we condition predictions on both annotator identity and demographic metadata through a representation-level User Passport mechanism. We further introduce two explainer architectures: a post-hoc prompt-based explainer and a prefixed bridge explainer that transfers annotator-conditioned classifier representations directly into a generative model. This design enables explanation generation aligned with individual annotator perspectives. Our results show that incorporating explanation modeling substantially improves predictive performance over a baseline annotator-aware classifier, with the prefixed bridge approach achieving more stable label alignment and higher semantic consistency, while the post-hoc approach yields stronger lexical similarity. These findings indicate that modeling explanations as expressions of fine-grained perspective provides a richer and more faithful representation of disagreement. The proposed approaches advance perspectivist modeling by integrating annotator-specific rationales into both predictive and generative components.

摘要:超越探索用於建模觀點的細分標籤,標註者的理由提供了個別觀點的細緻信號。在這項工作中,我們提出了一個框架,用於共同建模標註者特定的標籤預測及其相應的解釋,並根據標註者提供的理由進行微調。使用一個包含細分自然語言推理(NLI)標註和標註者提供解釋的數據集,我們通過一個表示層級的用戶護照機制,將預測條件化於標註者身份和人口統計元數據。我們進一步引入了兩種解釋器架構:一種是事後提示基解釋器,另一種是前綴橋接解釋器,該解釋器將標註者條件化的分類器表示直接轉換為生成模型。這一設計使得解釋生成與個別標註者的觀點對齊。我們的結果顯示,納入解釋建模顯著提高了相對於基線標註者感知分類器的預測性能,其中前綴橋接方法實現了更穩定的標籤對齊和更高的語義一致性,而事後方法則產生了更強的詞彙相似性。這些發現表明,將解釋建模為細緻觀點的表達提供了更豐富和更真實的分歧表示。所提出的方法通過將標註者特定的理由整合到預測和生成組件中,推進了觀點主義建模。

Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation

2604.21640v1 by Yi-Ling Liu, Melvin Laux, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam

Autonomous underwater vehicles are required to perform multiple tasks adaptively and in an explainable manner under dynamic, uncertain conditions and limited sensing, challenges that classical controllers struggle to address. This demands robust, generalizable, and inherently interpretable control policies for reliable long-term monitoring. Reinforcement learning, particularly multi-task RL, overcomes these limitations by leveraging shared representations to enable efficient adaptation across tasks and environments. However, while such policies show promising results in simulation and controlled experiments, they yet remain opaque and offer limited insight into the agent's internal decision-making, creating gaps in transparency, trust, and safety that hinder real-world deployment. The internal policy structure and task-specific specialization remain poorly understood. To address these gaps, we analyze the internal structure of a pretrained multi-task reinforcement learning network in the HoloOcean simulator for underwater navigation by identifying and comparing task-specific subnetworks responsible for navigating toward different species. We find that in a contextual multi-task reinforcement learning setting with related tasks, the network uses only about 1.5% of its weights to differentiate between tasks. Of these, approximately 85% connect the context-variable nodes in the input layer to the next hidden layer, highlighting the importance of context variables in such settings. Our approach provides insights into shared and specialized network components, useful for efficient model editing, transfer learning, and continual learning for underwater monitoring through a contextual multi-task reinforcement learning method.

摘要:自主水下航行器需要在動態、不確定的條件下以及有限的感測能力下,自適應地執行多項任務並以可解釋的方式進行,這是傳統控制器難以應對的挑戰。這要求制定穩健、可泛化且本質上可解釋的控制政策,以便進行可靠的長期監測。強化學習,特別是多任務強化學習,通過利用共享表示來克服這些限制,從而實現跨任務和環境的高效適應。然而,儘管這些政策在模擬和受控實驗中顯示出良好的結果,但它們仍然不透明,並且對代理的內部決策過程提供有限的洞察,造成透明度、信任和安全性方面的缺口,阻礙了在現實世界中的部署。內部政策結構和任務特定的專業化仍然不甚了解。為了解決這些缺口,我們分析了在HoloOcean模擬器中預訓練的多任務強化學習網絡的內部結構,通過識別和比較負責導航不同物種的任務特定子網絡。我們發現,在一個具有相關任務的上下文多任務強化學習環境中,該網絡僅使用約1.5%的權重來區分不同任務。在這些權重中,大約85%將上下文變量節點與下一個隱藏層相連,突顯了上下文變量在這種環境中的重要性。我們的方法提供了對共享和專門化網絡組件的洞察,對於通過上下文多任務強化學習方法進行水下監測的高效模型編輯、遷移學習和持續學習具有重要意義。

On the Role of Preprocessing and Memristor Dynamics in Reservoir Computing for Image Classification

2604.21602v1 by Rishona Daniels, Duna Wattad, Ronny Ronen, David Saad, Shahar Kvatinsky

Reservoir computing (RC) is an emerging recurrent neural network architecture that has attracted growing attention for its low training cost and modest hardware requirements. Memristor-based circuits are particularly promising for RC, as their intrinsic dynamics can reduce network size and parameter overhead in tasks such as time-series prediction and image recognition. Although RC has been demonstrated with several memristive devices, a comprehensive evaluation of device-level requirements remains limited. In this paper, we analyze and explain the operation of a parallel delayed feedback network (PDFN) RC architecture with volatile memristors, focusing on how device characteristics -- such as decay rate, quantization, and variability -- affect reservoir performance. We further discuss strategies to improve data representation in the reservoir using preprocessing methods and suggest potential improvements. The proposed approach achieves 95.89% classification accuracy on MNIST, comparable with the best reported memristor-based RC implementations. Furthermore, the method maintains high robustness under 20% device variability, achieving an accuracy of up to 94.2%. These results demonstrate that volatile memristors can support reliable spatio-temporal information processing and reinforce their potential as key building blocks for compact, high-speed, and energy-efficient neuromorphic computing systems.

摘要:儲水器計算(RC)是一種新興的遞迴神經網絡架構,因其低訓練成本和適度的硬體需求而受到越來越多的關注。基於記憶電阻的電路對於RC特別有前景,因為它們的內在動態可以減少在時間序列預測和圖像識別等任務中的網絡大小和參數開銷。儘管RC已經在幾種記憶電阻設備上得到了驗證,但對於設備級需求的全面評估仍然有限。在本文中,我們分析並解釋了一種具有揮發性記憶電阻的平行延遲反饋網絡(PDFN)RC架構的運作,重點關注設備特性——如衰減速率、量化和變異性——如何影響儲水器的性能。我們進一步討論了使用預處理方法改善儲水器中數據表示的策略,並提出潛在的改進建議。所提出的方法在MNIST上達到了95.89%的分類準確率,與報導的最佳基於記憶電阻的RC實現相當。此外,該方法在20%的設備變異性下保持了高穩健性,準確率達到94.2%。這些結果表明,揮發性記憶電阻可以支持可靠的時空信息處理,並強化其作為緊湊、高速和節能的類腦計算系統關鍵組件的潛力。

Dynamical Priors as a Training Objective in Reinforcement Learning

2604.21464v1 by Sukesh Subaharan

Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or degenerate inactivity. We introduce Dynamical Prior Reinforcement Learning (DP-RL), a training framework that augments policy gradient learning with an auxiliary loss derived from external state dynamics that implement evidence accumulation and hysteresis. Without modifying the reward, environment, or policy architecture, this prior shapes the temporal evolution of action probabilities during learning. Across three minimal environments, we show that dynamical priors systematically alter decision trajectories in task-dependent ways, promoting temporally structured behavior that cannot be explained by generic smoothing. These results demonstrate that training objectives alone can control the temporal geometry of decision-making in RL agents.

摘要:標準強化學習(RL)優化獎勵的政策,但對決策隨時間演變的方式施加的約束很少。因此,政策可能在表現良好的同時,顯示出時間上不一致的行為,例如突然的信心轉變、振盪或退化的不活動。我們引入了動態先驗強化學習(DP-RL),這是一個訓練框架,通過來自外部狀態動力學的輔助損失來增強政策梯度學習,該動力學實現了證據累積和遲滯。在不修改獎勵、環境或政策架構的情況下,這個先驗在學習過程中塑造了行動概率的時間演變。在三個最小環境中,我們展示了動態先驗以系統性的方式改變決策軌跡,這些變化依賴於任務,促進了無法用一般平滑解釋的時間結構化行為。這些結果表明,僅僅訓練目標就可以控制RL代理的決策時間幾何。

Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

2604.21263v1 by Michael Bouzinier, Sergey Trifonov, Michael Chumack, Eugenia Lvova, Dmitry Etin

\textbf{Background:} Regulatory frameworks for AI in healthcare, including the EU AI Act and FDA guidance on AI/ML-based medical devices, require clinical decision support to demonstrate not only accuracy but auditability. Existing formal languages for clinical logic validate syntactic and structural correctness but not whether decision rules use epistemologically appropriate evidence. \textbf{Methods:} Drawing on design-by-contract principles, we introduce meta-predicates -- predicates about predicates -- for asserting epistemological constraints on clinical decision rules expressed in a DSL. An epistemological type system classifies annotations along four dimensions: purpose, knowledge domain, scale, and method of acquisition. Meta-predicates assert which evidence types are permissible in any given rule. The framework is instantiated in AnFiSA, an open-source platform for genetic variant curation, and demonstrated using the Brigham Genomics Medicine protocol on 5.6 million variants from the Genome in a Bottle benchmark. \textbf{Results:} Decision trees used in variant interpretation can be reformulated as unate cascades, enabling per-variant audit trails that identify which rule classified each variant and why. Meta-predicate validation catches epistemological errors before deployment, whether rules are human-written or AI-generated. The approach complements post-hoc methods such as LIME and SHAP: where explanation reveals what evidence was used after the fact, meta-predicates constrain what evidence may be used before deployment, while preserving human readability. \textbf{Conclusions:} Meta-predicate validation is a step toward demonstrating not only that decisions are accurate but that they rest on appropriate evidence in ways that can be independently audited. While demonstrated in genomics, the approach generalises to any domain requiring auditable decision logic.

摘要:\textbf{背景:} 醫療保健中人工智慧的監管框架,包括歐盟人工智慧法案和FDA對基於人工智慧/機器學習醫療設備的指導,要求臨床決策支持不僅要顯示準確性,還要具備可審計性。現有的臨床邏輯形式語言驗證語法和結構的正確性,但不驗證決策規則是否使用了認識論上合適的證據。 \textbf{方法:} 基於契約設計原則,我們引入了元謂詞——關於謂詞的謂詞——用於對在DSL中表達的臨床決策規則施加認識論約束。認識論類型系統在四個維度上對註釋進行分類:目的、知識領域、範圍和獲取方法。元謂詞聲明在任何給定規則中允許使用哪些證據類型。該框架在AnFiSA中實現,這是一個開源的基因變異整理平台,並使用來自“瓶中基因組”基準的560萬個變異的Brigham Genomics Medicine協議進行演示。 \textbf{結果:} 用於變異解釋的決策樹可以重新表述為單調級聯,從而實現每個變異的審計跟蹤,識別每個變異的分類規則及其原因。元謂詞驗證在部署前捕捉認識論錯誤,無論規則是人工編寫還是AI生成。該方法補充了事後方法,如LIME和SHAP:當解釋揭示了事後使用了哪些證據時,元謂詞限制了在部署前可以使用的證據,同時保持人類可讀性。 \textbf{結論:} 元謂詞驗證是邁向證明決策不僅準確且基於適當證據的步驟,並且這些證據可以獨立審計。雖然在基因組學中得到了演示,但該方法可以推廣到任何需要可審計決策邏輯的領域。

Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction

2604.21154v1 by Abhishek Dharmaratnakar, Srivaths Ranganathan, Anushree Sinha, Debanshu Das

At-home physiotherapy compliance remains critically low due to a lack of personalized supervision and dynamic feedback. Existing digital health solutions rely on static, pre-recorded video libraries or generic 3D avatars that fail to account for a patient's specific injury limitations or home environment. In this paper, we propose a novel Multi-Agent System (MAS) architecture that leverages Generative AI and computer vision to close the tele-rehabilitation loop. Our framework consists of four specialized micro-agents: a Clinical Extraction Agent that parses unstructured medical notes into kinematic constraints; a Video Synthesis Agent that utilizes foundational video generation models to create personalized, patient-specific exercise videos; a Vision Processing Agent for real-time pose estimation; and a Diagnostic Feedback Agent that issues corrective instructions. We present the system architecture, detail the prototype pipeline using Large Language Models and MediaPipe, and outline our clinical evaluation plan. This work demonstrates the feasibility of combining generative media with agentic autonomous decision-making to scale personalized patient care safely and effectively.

摘要:居家物理治療的遵從率仍然極低,原因在於缺乏個性化的監督和動態反饋。現有的數位健康解決方案依賴於靜態的預錄影片庫或通用的3D虛擬角色,這些都未能考慮到患者特定的受傷限制或家庭環境。在本文中,我們提出了一種新穎的多智能體系統(MAS)架構,利用生成式人工智慧和計算機視覺來閉合遠程康復的循環。我們的框架由四個專門的微智能體組成:一個臨床提取智能體,將非結構化的醫療筆記解析為運動學約束;一個視頻合成智能體,利用基礎視頻生成模型創建個性化的、針對患者的運動視頻;一個視覺處理智能體,用於實時姿勢估計;以及一個診斷反饋智能體,提供糾正指導。我們展示了系統架構,詳細說明了使用大型語言模型和MediaPipe的原型管道,並概述了我們的臨床評估計劃。本研究展示了將生成媒體與自主決策相結合的可行性,以安全有效地擴展個性化患者護理。

Propensity Inference: Environmental Contributors to LLM Behaviour

2604.21098v1 by Olli Järviniemi, Oliver Makins, Jacob Merizian, Robert Kirk, Ben Millwood

Motivated by loss of control risks from misaligned AI systems, we develop and apply methods for measuring language models' propensity for unsanctioned behaviour. We contribute three methodological improvements: analysing effects of changes to environmental factors on behaviour, quantifying effect sizes via Bayesian generalised linear models, and taking explicit measures against circular analysis. We apply the methodology to measure the effects of 12 environmental factors (6 strategic in nature, 6 non-strategic) and thus the extent to which behaviour is explained by strategic aspects of the environment, a question relevant to risks from misalignment. Across 23 language models and 11 evaluation environments, we find approximately equal contributions from strategic and non-strategic factors for explaining behaviour, do not find strategic factors becoming more or less influential as capabilities improve, and find some evidence for a trend for increased sensitivity to goal conflicts. Finally, we highlight a key direction for future propensity research: the development of theoretical frameworks and cognitive models of AI decision-making into empirically testable forms.

摘要:受到不當對齊的人工智慧系統所帶來的失控風險的驅動,我們開發並應用測量語言模型未經授權行為傾向的方法。我們貢獻了三項方法論改進:分析環境因素變化對行為的影響、通過貝葉斯廣義線性模型量化效應大小,以及採取明確措施以防止循環分析。我們應用這一方法論來測量12個環境因素(6個具有戰略性質,6個非戰略性)的影響,從而了解行為在多大程度上受到環境戰略方面的解釋,這是一個與不當對齊風險相關的問題。在23個語言模型和11個評估環境中,我們發現戰略和非戰略因素對解釋行為的貢獻大致相等,並未發現隨著能力的提高,戰略因素變得更具影響力或更不具影響力,並且發現一些證據顯示對目標衝突的敏感性有增加的趨勢。最後,我們強調未來傾向研究的一個關鍵方向:將人工智慧決策的理論框架和認知模型發展為可經驗驗證的形式。

SGD at the Edge of Stability: The Stochastic Sharpness Gap

2604.21016v1 by Fangshuo Liao, Afroditi Kolomvaki, Anastasios Kyrillidis

When training neural networks with full-batch gradient descent (GD) and step size $η$, the largest eigenvalue of the Hessian -- the sharpness $S(\boldsymbolθ)$ -- rises to $2/η$ and hovers there, a phenomenon termed the Edge of Stability (EoS). \citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint $ S(\boldsymbolθ)\leq 2/η$. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below $2/η$, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below $2/η$. Following the approach of \citet{damian2023selfstab}, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: $ΔS = ηβσ_{\boldsymbol{u}}^{2}/(4α)$, where $α$ is the progressive sharpening rate, $β$ is the self-stabilization strength, and $σ_{ \boldsymbol{u}}^{2}$ is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.

摘要:當使用全批次梯度下降(GD)和步長 $η$ 訓練神經網絡時,Hessian 的最大特徵值——銳度 $S(\boldsymbolθ)$——上升至 $2/η$ 並保持在那裡,這一現象被稱為穩定性邊緣(EoS)。\citet{damian2023selfstab} 表明,這種行為是由損失的三階結構驅動的自我穩定機制解釋的,並且 GD 隱式遵循約束 $ S(\boldsymbolθ)\leq 2/η$ 的投影梯度下降(PGD)。對於小批量隨機梯度下降(SGD),銳度在 $2/η$ 以下穩定,隨著批量大小的減小,差距擴大;然而,對於這種抑制尚無理論解釋。我們引入隨機自我穩定化,將自我穩定框架擴展至 SGD。我們的關鍵見解是,梯度噪聲為沿著頂部 Hessian 特徵向量的振蕩動力學注入了方差,增強了立方銳度減少力並將平衡點移至 $2/η$ 之下。遵循 \citet{damian2023selfstab} 的方法,我們定義了相對於移動投影梯度下降軌跡的隨機預測動力學,並證明了一個隨機耦合定理,該定理界定了 SGD 與這些預測之間的偏差。我們推導出一個封閉形式的平衡銳度差距:$ΔS = ηβσ_{\boldsymbol{u}}^{2}/(4α)$,其中 $α$ 是漸進銳化率,$β$ 是自我穩定強度,$σ_{ \boldsymbol{u}}^{2}$ 是投影到頂部特徵向量的梯度噪聲方差。這個公式預測較小的批量大小會產生更平坦的解,並在批量等於完整數據集時恢復 GD。

Convergent Evolution: How Different Language Models Learn Similar Number Representations

2604.20817v1 by Deqing Fu, Tianyi Zhou, Mikhail Belkin, Vatsal Sharan, Robin Jia

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

摘要:語言模型在自然文本上訓練時學會使用具有主導周期的週期性特徵來表示數字,這些主導周期為 $T=2, 5, 10$。在本文中,我們確定了這些特徵的兩層次層級:雖然不同方式訓練的 Transformers、線性 RNN、LSTM 和傳統詞嵌入都學會了在傅立葉域中具有周期-$T$ 峰值的特徵,但只有部分模型學會了可以用來線性分類數字 mod-$T$ 的幾何可分特徵。為了解釋這一不一致性,我們證明了傅立葉域的稀疏性是必要但不充分的條件,以實現 mod-$T$ 的幾何可分性。從實證上,我們調查了何時模型訓練產生幾何可分特徵,發現數據、架構、優化器和分詞器都扮演著關鍵角色。特別地,我們確定了模型獲得幾何可分特徵的兩種不同途徑:它們可以從一般語言數據中的互補共現信號學習,包括文本-數字共現和跨數字互動,或從多標記(但不是單標記)加法問題中學習。總體而言,我們的結果突顯了特徵學習中的趨同演化現象:多樣化的模型從不同的訓練信號中學習到相似的特徵。

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

2604.20795v1 by Pavel Salovskii, Iuliia Gorshkova

This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.

摘要:這篇論文提出了一種混合架構,用於智能系統,其中大型語言模型(LLMs)擴展了外部本體記憶層。該方法不僅依賴於參數知識和基於向量的檢索(RAG),而是構建並維護一個使用RDF/OWL表示的結構化知識圖譜,從而實現持久、可驗證和語義基礎的推理。核心貢獻是一個自動化的本體構建管道,來自異質數據源,包括文檔、API和對話日誌。系統執行實體識別、關係提取、標準化和三元組生成,然後使用SHACL和OWL約束進行驗證,並持續更新圖譜。在推理過程中,LLMs在一個結合的上下文中運行,該上下文整合了基於向量的檢索、基於圖譜的推理和外部工具互動。對於規劃任務的實驗觀察,包括河內塔基準,表明本體增強在多步推理場景中相較於基線LLM系統提高了性能。此外,本體層使生成輸出的正式驗證成為可能,將系統轉變為生成-驗證-修正管道。所提出的架構解決了當前基於LLM的系統的關鍵限制,包括缺乏長期記憶、結構理解薄弱和推理能力有限。它為構建基於代理的系統、機器人應用和需要持久知識、可解釋性和可靠決策的企業AI解決方案提供了基礎。

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

2604.20791v1 by Mariano Barone, Francesco Di Serio, Roberto Moio, Marco Postiglione, Giuseppe Riccio, Antonio Romano, Vincenzo Moscato

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

摘要:大型語言模型(LLMs)在醫療保健領域的應用日益增多,但它們與臨床標準的溝通對齊程度仍然不足以量化。我們對通用型和專業領域的LLMs進行了多維度評估,涵蓋結構化的醫療解釋和現實世界的醫生-病人互動,分析語義忠實度、可讀性和情感共鳴。基準模型相對於醫生增強了情感極性(非常負面:43.14-45.10% vs. 37.25%),而在更大的架構如GPT-5和Claude中,產生了顯著更高的語言複雜性(FKGL高達16.91-17.60 vs. 11.47-12.50在醫生撰寫的回答中)。以同理心為導向的提示減少了極端的負面情緒並降低了年級水平的複雜性(對於GPT-5高達-6.87 FKGL點),但並未顯著提高語義忠實度。協作重寫產生了最強的整體對齊。重述配置實現了與醫生回答的最高語義相似度(平均高達0.93),同時持續改善可讀性並減少情感極端性。雙方利益相關者的評估顯示,沒有模型在認知標準上超越醫生,而病人則持續偏好重寫的變體以獲得清晰度和情感語調。這些發現表明,LLMs作為協作溝通增強工具的功能最為有效,而非臨床專業知識的替代品。

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

2604.20763v1 by Andrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Samuel Marc Denton, Yuan Xue

Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

摘要:檢索質量是檢索增強生成(RAG)中準確性和穩健性的主要瓶頸。當前的評估依賴於啟發式構建的查詢集,這引入了隱藏的內在偏見。我們將檢索評估形式化為一個統計估計問題,顯示度量的可靠性根本上受到評估集構建的限制。我們進一步引入\emph{語義分層},通過將文檔組織成可解釋的基於實體的全局集群空間來將評估基於語料庫結構,並系統地生成缺失層次的查詢。這產生了(1)在檢索模式下的正式語義覆蓋保證和(2)對檢索失敗模式的可解釋性可見性。
在多個基準和檢索方法上的實驗驗證了我們的框架。結果揭示了系統性的覆蓋缺口,識別了解釋檢索性能變異的結構信號,並顯示分層評估產生了更穩定和透明的評估,同時支持比聚合度量更值得信賴的決策。

Participatory provenance as representational auditing for AI-mediated public consultation

2604.20711v1 by Sachit Mahajan

Artificial intelligence is increasingly deployed to synthesize large-scale public input in policy consultations and participatory processes. Yet no formal framework exists for auditing whether these summaries faithfully represent the source population, an accountability gap that existing approaches to AI explainability, grounding and hallucination detection do not address because they focus on output quality rather than input fidelity. Here, participatory provenance is introduced: a measurement framework grounded in optimal transport theory, causal inference and semantic analysis that tracks how individual public submissions are transformed, filtered or lost through AI-mediated summarization. Applied to Canada's 2025-2026 national AI Strategy consultation ($n = 5{,}253$ respondents across two independent policy topics), the framework reveals that both official government summaries underperform a random-participant baseline ($-9.1\%$ and $-8.0\%$ coverage degradation), with $16.9\%$ and $15.3\%$ of participants effectively excluded. Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI ($33$-$88\%$ exclusion rates). Brevity, semantic isolation and rhetorical register independently predict representational outcome. An accompanying open-source interactive tool, the Co-creation Provenance Lab, enables policymakers to audit and iteratively improve summaries, establishing genuine human-in-the-loop oversight at scale.

摘要:人工智慧越來越多地被用來合成大規模的公共意見,以用於政策諮詢和參與過程。然而,並不存在正式的框架來審核這些摘要是否忠實地代表了來源人口,這是一個問責缺口,現有的 AI 可解釋性、基礎和幻覺檢測方法並未解決這個問題,因為它們專注於輸出質量而非輸入的真實性。在此,引入了參與性來源:這是一個基於最佳運輸理論、因果推斷和語義分析的測量框架,用於追蹤個別公共提交如何通過 AI 介導的摘要進行轉換、過濾或丟失。應用於加拿大 2025-2026 年國家 AI 策略諮詢($n = 5{,}253$ 名受訪者,涵蓋兩個獨立的政策主題),該框架揭示官方政府摘要的表現均低於隨機參與者基準($-9.1\%$ 和 $-8.0\%$ 覆蓋率下降),有 $16.9\%$ 和 $15.3\%$ 的參與者實際上被排除在外。排除主要集中在表達異議、懷疑和對 AI 進行批評的群體中($33$-$88\%$ 的排除率)。簡潔性、語義孤立和修辭風格獨立預測代表性結果。一個伴隨的開源互動工具——共創來源實驗室,使用於政策制定者能夠審核並迭代改進摘要,建立真正的人類介入監督。

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

2604.20623v1 by Roie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel, Genady Beryozkin

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

摘要:傳統的變化檢測識別變化發生的位置,但並不解釋自然語言中發生了什麼變化。現有的遙感變化標題數據集通常描述整體圖像級別的差異,幾乎沒有探索細粒度的局部語義推理。為了填補這一空白,我們提出了RSRCC,這是一個新的遙感變化問答基準,包含126,000個問題,分為87,000個訓練、17,100個驗證和22,000個測試實例。與之前的數據集不同,RSRCC圍繞著局部的、特定於變化的問題構建,這些問題需要對特定的語義變化進行推理。據我們所知,這是第一個專門為這種細粒度推理基礎的監督設計的遙感變化問答基準。為了構建RSRCC,我們引入了一個分層的半監督策劃管道,使用Best-of-N排名作為關鍵的最終模糊解決階段。首先,從語義分割掩膜中提取候選變化區域,然後使用圖像-文本嵌入模型進行初步篩選,最後通過增強檢索的視覺-語言策劃進行驗證,並使用Best-of-N排名。這一過程使得在保留語義上有意義的變化的同時,能夠對嘈雜和模糊的候選進行可擴展的過濾。該數據集可在 https://huggingface.co/datasets/google/RSRCC 獲得。

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

2604.20544v1 by Zimu Jia, Mingjie Xu, Andrew Estornell, Jiaheng Wei

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.

摘要:大型視覺語言模型(LVLMs)的效能在很大程度上取決於其訓練數據的質量,這需要在視覺真實性和遵循指令的能力之間達成精確的平衡。 然而,現有數據集存在質量不一致的問題,目前的數據過濾方法依賴於粗糙的評分,缺乏識別邏輯謬誤或事實錯誤等細微語義缺陷的細緻度。 這在開發更可靠的模型方面造成了根本性的瓶頸。 為了解決這個問題,我們提出了三個核心貢獻。 首先,我們通過系統性地注入多樣化的微妙缺陷來構建一個大規模的300K樣本基準,以提供一個具有挑戰性的數據審核測試平台。 其次,我們引入了一種新穎的“分解-然後-評估”範式,將模型的反應分解為構成的認知組件:視覺描述、主觀推理和事實主張,從而實現有針對性的分析。 第三,我們通過EVIAN(可解釋的視覺指令調整數據審核)來實現這一範式,這是一個自動化框架,沿著圖像-文本一致性、邏輯一致性和事實準確性這三個正交軸對這些組件進行評估。 我們的實證發現挑戰了當前以規模為中心的範式:在EVIAN精心策劃的緊湊高質量子集上進行微調的模型,始終超越在數量級更大的數據集上訓練的模型。 我們還揭示了將複雜的審核劃分為可驗證的子任務可以實現穩健的策劃,而邏輯一致性是數據質量評估中最關鍵的因素。

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

2604.20441v1 by Yingyong Hou, Xinyuan Lao, Huimei Wang, Qianyu Yao, Wei Chen, Bocheng Huang, Fei Sun, Yuxian Lv, Weiqi Lei, Xueqian Wen, Pengfei Xia, Zhujun Tan, Shengyang Xie

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit (skill-auditor@1.0), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

摘要:背景:代理技能越來越多地作為模組化、可重用的能力單元在人工智慧代理系統中部署。醫學研究代理技能需要超越一般評估的保障,包括科學誠信、方法論有效性、可重複性和邊界安全。本研究開發並初步評估了一個針對醫學研究代理技能的領域特定審核框架,重點關注對專家評審的可靠性。方法:我們開發了 MedSkillAudit (skill-auditor@1.0),這是一個分層框架,用於在部署前評估技能釋放的準備狀態。我們評估了五個醫學研究類別中的 75 項技能(每個類別 15 項)。兩位專家獨立地分配了一個質量分數(0-100)、一個序數釋放處置(生產就緒 / 限量釋放 / 僅限測試版 / 拒絕)和一個高風險失敗標誌。系統專家之間的協議使用 ICC(2,1) 和線性加權的 Cohen's kappa 進行量化,並以人類評分者基準進行基準測試。結果:共識質量分數的平均值為 72.4(標準差 = 13.0);57.3% 的技能低於限量釋放的門檻。MedSkillAudit 達到了 ICC(2,1) = 0.449(95% CI: 0.250-0.610),超過了人類評分者的 ICC 0.300。系統共識分數的差異(標準差 = 9.5)小於專家之間的差異(標準差 = 12.4),且沒有方向性偏差(Wilcoxon p = 0.613)。協議設計顯示出最強的類別級別協議(ICC = 0.551);學術寫作顯示出負的 ICC(-0.567),反映出結構性評分標準與專家之間的不匹配。結論:領域特定的預部署審核可能為治理醫學研究代理技能提供實用的基礎,通過針對科學用例量身定制的結構化審核工作流程來補充一般性質量檢查。

Surrogate modeling for interpreting black-box LLMs in medical predictions

2604.20331v2 by Changho Han, Songsoo Kim, Dong Won Kim, Leo Anthony Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

摘要:大型語言模型(LLMs)在龐大的數據集上進行訓練,將廣泛的現實世界知識編碼在其參數中,但其黑箱特性使得這種編碼的機制和範圍變得不明朗。代理建模使用簡化模型來近似複雜系統,可以為黑箱模型的更好可解釋性提供一條途徑。我們提出了一個代理建模框架,定量解釋LLM編碼的知識。對於從領域知識衍生的特定假設,該框架通過在一系列綜合模擬場景中進行廣泛的提示,使用可觀察的元素(輸入-輸出對)來近似潛在的LLM知識空間。通過在醫療預測中的概念驗證實驗,我們展示了該框架在揭示LLMs如何「感知」每個輸入變量與輸出之間的關係方面的有效性。特別是,考慮到LLMs可能會延續其訓練數據中固有的不準確性和社會偏見,我們使用該框架的實驗定量揭示了與既有醫學知識相矛盾的關聯以及LLM編碼知識中科學上被駁斥的種族假設的持續存在。通過揭示這些問題,我們的框架可以作為紅旗指標,以支持這些模型的安全和可靠應用。

Stateless Decision Memory for Enterprise AI Agents

2604.20158v1 by Vasundra Srinivasan

Enterprise deployment of long-horizon decision agents in regulated domains (underwriting, claims adjudication, tax examination) is dominated by retrieval-augmented pipelines despite a decade of increasingly sophisticated stateful memory architectures. We argue this reflects a hidden requirement: regulated deployment is load-bearing on four systems properties (deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale), and stateful architectures violate them by construction. We propose Deterministic Projection Memory (DPM): an append-only event log plus one task-conditioned projection at decision time. On ten regulated decisioning cases at three memory budgets, DPM matches summarization-based memory at generous budgets and substantially outperforms it when the budget binds: at a 20x compression ratio, DPM improves factual precision by +0.52 (Cohen's h=1.17, p=0.0014) and reasoning coherence by +0.53 (h=1.13, p=0.0034), paired permutation, n=10. DPM is additionally 7-15x faster at binding budgets, making one LLM call at decision time instead of N. A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but the asymmetry is structural: DPM exposes one nondeterministic call; summarization exposes N compounding calls. The audit surface follows the same one-versus-N pattern: DPM logs two LLM calls per decision while summarization logs 83-97 on LongHorizon-Bench. We conclude with TAMS, a practitioner heuristic for architecture selection, and a failure analysis of stateful memory under enterprise operating conditions. The contribution is the argument that statelessness is the load-bearing property explaining enterprise's preference for weaker but replayable retrieval pipelines, and that DPM demonstrates this property is attainable without the decisioning penalty retrieval pays.

摘要:企業在受規範領域(承保、索賠裁定、稅務檢查)中部署長期決策代理時,儘管經過十年的不斷進步的有狀態記憶架構,仍以檢索增強管道為主導。我們認為這反映了一個隱藏的需求:受規範的部署在四個系統屬性(確定性重放、可審計的理由、多租戶隔離、無狀態以實現水平擴展)上是承載的,而有狀態架構在結構上違反了這些屬性。我們提出確定性投影記憶(DPM):一個僅附加事件日誌,加上一個在決策時條件化的投影。在三個記憶預算下的十個受規範決策案例中,DPM 在寬鬆預算下與基於摘要的記憶相匹配,並在預算緊張時顯著超越它:在 20 倍壓縮比下,DPM 提高了事實精確度 +0.52(Cohen's h=1.17,p=0.0014)和推理一致性 +0.53(h=1.13,p=0.0034),配對排列,n=10。DPM 在綁定預算時還快 7-15 倍,決策時只需進行一次 LLM 調用,而不是 N 次。對於每個案例在零溫度下進行的 10 次重放的確定性研究顯示,兩種架構都繼承了殘餘的 API 級非確定性,但不對稱性是結構性的:DPM 暴露了一次非確定性調用;而摘要則暴露了 N 次累積調用。審計表面遵循相同的一對 N 模式:DPM 每次決策記錄兩次 LLM 調用,而摘要在 LongHorizon-Bench 上記錄 83-97 次。我們以 TAMS 作為架構選擇的實踐者啟發法結束,並對企業運營條件下的有狀態記憶進行失敗分析。貢獻在於論證無狀態性是解釋企業偏好較弱但可重放的檢索管道的承載屬性,並且 DPM 展示了這一屬性可以在不承擔決策懲罰的情況下實現。

From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI

2604.20055v1 by Patrick Vossler, Jean Feng, Venkat Sivaraman, Robert Gallo, Hemal Kanzaria, Dana Freiser, Christopher Ross, Amy Ou, James Marks, Susan Ehrlich, Christopher Peabody, Lucas Zier

Hospital Quality Improvement (QI) plays a critical role in optimizing healthcare delivery by translating high-level hospital goals into actionable solutions. A critical step of QI is to identify the key modifiable contributing factors, a process we call QI factor discovery, typically through expert-driven semi-structured qualitative tools like fishbone diagrams, chart reviews, and Lean Healthcare methods. AI has the potential to transform and accelerate QI factor discovery, which is traditionally time- and resource-intensive and limited in reproducibility and auditability. Nevertheless, current AI alignment methods assume the task is well-defined, whereas QI factor discovery is an exploratory, fuzzy, and iterative sense-making process that relies on complex implicit expert judgments. To design an AI pipeline that formalizes the QI process while preserving its exploratory components, we propose viewing the task as learning not only LLM prompts but also the overarching natural-language specifications. In particular, we map QI factor discovery to steps of the classical AI/ML development process (problem formalization, model learning, and model validation) where the specifications are tunable hyperparameters. Domain experts and AI agents iteratively refine both the overarching specifications and AI pipeline until AI extractions are concordant with expert annotations and aligned with clinical objectives. We applied this "Human-AI Spec-Solution Co-optimization" framework at an urban safety-net hospital to identify factors driving prolonged length of stay and unplanned 30-day readmissions. The resulting AI-for-QI pipelines achieved $\ge 70\%$ concordance with expert annotations. Compared to prior manual Lean analyses, the AI pipeline was substantially more efficient, recovered previous findings, surfaced new modifiable factors, and produced auditable reasoning traces.

摘要:醫院品質改善(QI)在優化醫療服務中扮演著關鍵角色,通過將高層次的醫院目標轉化為可行的解決方案。QI 的一個關鍵步驟是識別主要的可修改貢獻因素,我們稱之為 QI 因素發現,通常通過專家驅動的半結構化質性工具,如魚骨圖、圖表回顧和精益醫療方法來進行。人工智慧有潛力轉變和加速 QI 因素發現,這一過程傳統上耗時且資源密集,且在可重複性和可審計性方面受限。然而,目前的 AI 對齊方法假設任務是明確定義的,而 QI 因素發現是一個探索性、模糊且迭代的意義建構過程,依賴於複雜的隱性專家判斷。為了設計一個正式化 QI 過程的 AI 管道,同時保留其探索性組件,我們建議將任務視為學習不僅是 LLM 提示,還有整體的自然語言規範。具體來說,我們將 QI 因素發現映射到傳統 AI/ML 開發過程的步驟(問題形式化、模型學習和模型驗證),其中規範是可調的超參數。領域專家和 AI 代理反覆完善整體規範和 AI 管道,直到 AI 提取結果與專家標註一致並與臨床目標對齊。我們在一所城市安全網醫院應用這一「人類-AI 規範-解決方案共同優化」框架,以識別驅動延長住院時間和未計劃 30 天再入院的因素。最終的 AI-for-QI 管道與專家標註達到了 $\ge 70\%$ 的一致性。與之前的手動精益分析相比,AI 管道的效率顯著提高,恢復了先前的發現,揭示了新的可修改因素,並生成了可審計的推理痕跡。

TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs

2604.20043v1 by Ziyi Wang, Chen Zhang, Wenjun Peng, Qi Wu, Xinyu Wang

Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present \textbf{TriEx}, a tri-view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first-person self-reasoning bound to an action, (ii) explicit second-person belief states about opponents updated over time, and (iii) third-person oracle audits grounded in environment-derived reference signals. This design turns explanations from free-form narratives into evidence-anchored objects that can be compared and checked across time and perspectives. Using imperfect-information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction-dependent property and motivate multi-view, evidence-grounded evaluation for LLM agents. Code is available at https://github.com/Einsam1819/TriEx.

摘要:大型語言模型(LLM)代理的可解釋性在互動的部分可觀察環境中尤其具有挑戰性,因為決策依賴於不斷演變的信念和其他代理。我們提出了\textbf{TriEx},一個三視角可解釋性框架,為序列決策提供了對齊的工具:(i) 與行動相關的結構化第一人稱自我推理,(ii) 隨時間更新的關於對手的明確第二人稱信念狀態,以及 (iii) 基於環境衍生參考信號的第三人稱預言者審計。這種設計將解釋從自由形式的敘述轉變為可以跨時間和視角進行比較和檢查的證據基礎對象。使用不完全信息的策略遊戲作為受控測試平台,我們展示了TriEx使解釋的真實性、信念動態和評估者可靠性的可擴展分析成為可能,揭示了代理所說的、所信的和所做的之間的系統性不匹配。我們的結果強調可解釋性是一種依賴互動的特性,並促進了對LLM代理的多視角、基於證據的評估。代碼可在 https://github.com/Einsam1819/TriEx 獲得。

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

2604.19598v2 by Kihyuk Lee

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

摘要:這項研究比較了在 temperature=0 條件下,三個大型語言模型(LLMs)產生的運動處方輸出的一致性,具體為 GPT-4.1、Claude Sonnet 4.6 和 Gemini 2.5 Flash。每個模型為六個臨床場景生成了 20 次處方,總共分析了 360 個輸出,涵蓋四個維度:語義相似性、輸出可重複性、FITT 分類和安全性表達。GPT-4.1 的平均語義相似性最高(0.955),其次是 Gemini 2.5 Flash(0.950)和 Claude Sonnet 4.6(0.903),並確認了模型間的顯著差異(H = 458.41, p < .001)。關鍵是,這些分數反映了根本不同的生成行為:GPT-4.1 產生了完全獨特的輸出(100%),並且語義內容穩定,而 Gemini 2.5 Flash 則顯示出明顯的輸出重複(27.5% 獨特輸出),這表明其高相似性分數源於文本重複,而非一致的推理。因此,相同的解碼設置產生了根本不同的一致性特徵,這一區別是單一輸出評估無法捕捉的。所有模型的安全性表達達到了上限水平,確認其作為區分指標的有限效用。這些結果表明,模型選擇是一個臨床而非僅僅是技術的決策,並且在重複生成條件下的輸出行為應被視為可靠部署基於 LLM 的運動處方系統的核心標準。

Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity

2604.19538v1 by Farbod Zorriassatine, Ahmad Lotfi

Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.

摘要:代理人工智慧具備目標導向、主動和自主決策能力,為解決人類活動中與運動相關的風險提供了引人注目的機會,包括老年人群中持續存在的跌倒危險。儘管已有多種方法針對跌倒進行預測和檢測以減少風險,但現有系統尚未能在護理路徑和安全關鍵環境中作為普遍解決方案運作。這主要是因為在一致處理現實世界的複雜性方面存在限制,特別是缺乏良好的上下文意識、高誤報率、環境噪音和數據稀缺。我們認為,跌倒檢測和跌倒預測可以有效地被構思為異常檢測問題,並透過代理人工智慧系統更有效地加以解決。更廣泛地說,這一觀點使得能夠及早識別與增加風險相關的運動模式中的微妙偏差,無論是由於年齡相關的衰退、疲勞還是環境因素引起的。雖然即時部署的技術要求超出了本文的範疇,但我們提出了一個概念框架,突顯潛在的價值。這個框架促進了一種精心協調的風險管理方法,通過動態選擇相關工具並將其整合到自適應決策工作流程中,而不是依賴於針對狹窄定義場景量身定制的靜態配置。

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

2604.19485v1 by Chengjun Pan, Shichun Liu, Jiahang Lin, Dingwei Zhu, Jiazheng Zhang, Shihan Dou, Songyang Gao, Zhenhua Han, Binghai Wang, Rui Zheng, Xuanjing Huang, Tao Gui, Yansong Feng

Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.

摘要:強化學習(RL)在大型語言模型(LLM)後訓練中面臨一個基本的設計選擇:是否使用學習到的評論家作為政策優化的基準。經典理論偏好基於評論家的方法,如PPO,以減少方差,但無評論家的替代方案如GRPO因其簡單性和競爭性表現而廣泛被採用。我們表明,在稀疏獎勵的環境中,學習到的評論家可能會注入超過其捕捉的狀態信號的估計噪聲,從而增加而不是減少優勢方差。通過將基準選擇視為卡爾曼濾波問題,我們統一了PPO和GRPO作為卡爾曼增益的兩個極端,並證明了可以從單個訓練批次計算的解釋方差(EV)確定了確切的邊界:正EV表示評論家減少方差,而零或負EV則表明它增加方差。基於這一見解,我們提出了解釋方差政策優化(EVPO),該方法在每個訓練步驟中監控批次級EV,並自適應地在基於評論家和批次均值優勢估計之間切換,證明在每一步都不會比兩者中較好的那一個具有更大的方差。在涵蓋經典控制、代理互動和數學推理的四個任務中,EVPO始終超越PPO和GRPO,無論在特定任務上哪個固定基準更強。進一步的分析確認,自適應閘控跟踪評論家的成熟度,並且理論推導的零閾值在實證上是最佳的。

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

2604.19457v1 by Vasundra Srininvasan

Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.

摘要:長期視野的企業代理在有損記憶、多步驟推理和約束性法規限制下做出高風險決策(貸款承保、索賠裁定、臨床審查、事前授權)。目前的評估報告提供了一個單一的任務成功標量,這混淆了不同的失敗模式,並隱藏了代理是否符合其部署環境所需的標準。我們提出長期決策行為可分解為四個正交的對齊軸,每個軸都是獨立可測量和可失敗的:事實精確性(FRP)、推理一致性(RCS)、合規重建(CRR)和校準放棄(CAR)。CRR是一個新穎的基於法規的軸;CAR是一個測量軸,將覆蓋率與準確性分開。我們在一個受控基準(LongHorizon-Bench)上進行分解,涵蓋貸款資格和保險索賠裁定,並進行確定性真實構建。運行六種記憶架構,我們發現結構聚合準確性無法看到:檢索在事實精確性上崩潰;基於架構的架構支付了支架稅;在事實保留提示下的普通摘要在FRP、RCS、EDA和CRR上是一個強基線;而所有六種架構在每個案例上都犯錯,暴露了一個該領域未針對的決策對齊軸。這一分解還揭示了我們自己預先註冊的預測,即摘要將失敗於事實回憶,數據在大幅度上反轉了這一點,這一軸級反轉的聚合準確性本會隱藏。機構對齊(法規重建)和決策對齊(校準放棄)在對齊文獻中被低估,並且一旦決策離開實驗室,它們便成為承載負荷的要素。該框架通過兩個步驟轉移到任何受規範的決策領域:建立事實架構,並校準CRR審核提示。

TACENR: Task-Agnostic Contrastive Explanations for Node Representations

2604.19372v1 by Vasiliki Papanikou, Evaggelia Pitoura

Graph representation learning has achieved notable success in encoding graph-structured data into latent vector spaces, enabling a wide range of downstream tasks. However, these node representations remain opaque and difficult to interpret. Existing explainability methods primarily focus on supervised settings or on explaining individual representation dimensions, leaving a critical gap in explaining the overall structure of node representations. In this paper, we propose TACENR (Task-Agnostic Contrastive Explanations for Node Representations), a local explanation method that identifies not only attribute features but also proximity and structural ones that contribute the most in the representation space. TACENR builds on contrastive learning, through which we learn a similarity function in the representation space, revealing which are the features that play an important role in the representation of a node. While our focus is on task-agnostic explanations, TACENR can be applied to supervised scenarios as well. Experimental results demonstrate that proximity and structural features play a significant role in shaping node representations and that our supervised variant performs comparably to existing task-specific approaches in identifying the most impactful features.

摘要:圖形表示學習在將圖形結構數據編碼為潛在向量空間方面取得了顯著成功,從而使各種下游任務得以實現。然而,這些節點表示仍然不透明且難以解釋。現有的可解釋性方法主要集中在監督設定或解釋單個表示維度上,這在解釋節點表示的整體結構方面留下了重要的空白。在本文中,我們提出了 TACENR(任務無關的對比解釋節點表示),這是一種局部解釋方法,不僅識別屬性特徵,還識別在表示空間中貢獻最大的接近性和結構特徵。TACENR 建立在對比學習的基礎上,通過這種方式我們學習表示空間中的相似性函數,揭示出在節點表示中扮演重要角色的特徵。雖然我們的重點是任務無關的解釋,但 TACENR 也可以應用於監督場景。實驗結果顯示,接近性和結構特徵在塑造節點表示中起著重要作用,我們的監督變體在識別最具影響力的特徵方面的表現與現有的任務特定方法相當。

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

2604.19281v1 by Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

摘要:使用大型語言模型(LLMs)來支持患者解決醫療問題的做法正變得越來越普遍。然而,目前用於評估這些模型在此背景下表現的措施大多僅衡量模型的答案在語義上的匹配程度,因此並未真正反映模型的醫療準確性或與之相關的健康公平風險。為了解決這些不足,我們提出了一個新的醫療問題回答評估框架,稱為VB-Score(基於驗證的分數),它對醫療問題回答模型的四個組成部分進行單獨評估,包括實體識別、語義相似性、事實一致性和結構化信息完整性。我們對三個知名且廣泛使用的LLMs在48個公共健康相關主題上的表現進行了嚴格評審,這些主題來自高質量、權威的信息來源。根據我們的分析,我們發現模型的語義準確性和實體準確性之間存在重大差異。我們對這三個模型表現的評估顯示,當根據我們的標準進行評估時,每個模型幾乎都存在嚴重的性能失敗。我們的研究結果顯示,在各種公共健康主題之間存在令人擔憂的性能差異,對於與老年人和少數族裔群體中發生的慢性病相關的所有公共健康主題,大多數模型的性能比整體平均水平低13.8%,這表明存在所謂的基於病症的算法歧視。我們的發現還表明,僅僅依靠提示工程並不能彌補這些模型在提取醫療實體方面的基本架構限制,並引發了語義評估是否足以作為醫療AI安全的衡量標準的問題。

Gradient-Based Program Synthesis with Neurally Interpreted Languages

2604.18907v1 by Matthew V. Macfarlane, Clément Bonnet, Herke van Hoof, Levi H. S. Lelis

A central challenge in program induction has long been the trade-off between symbolic and neural approaches. Symbolic methods offer compositional generalisation and data efficiency, yet their scalability is constrained by formalisms such as domain-specific languages (DSLs), which are labour-intensive to create and may not transfer to new domains. In contrast, neural networks flexibly learn from data but tend to generalise poorly in compositional and out-of-distribution settings. We bridge this divide with an instance of a Latent Adaptation Network architecture named Neural Language Interpreter (NLI), which learns its own discrete, symbolic-like programming language end-to-end. NLI autonomously discovers a vocabulary of primitive operations and uses a novel differentiable neural executor to interpret variable-length sequences of these primitives. This allows NLI to represent programs that are not bound to a constant number of computation steps, enabling it to solve more complex problems than those seen during training. To make these discrete, compositional program structures amenable to gradient-based optimisation, we employ the Gumbel-Softmax relaxation, enabling the entire model to be trained end-to-end. Crucially, this same differentiability enables powerful test-time adaptation. At inference, NLI's program inductor provides an initial program guess. This guess is then refined via gradient descent through the neural executor, enabling efficient search for the neural program that best explains the given data. We demonstrate that NLI outperforms in-context learning, test-time training, and continuous latent program networks on tasks that require combinatorial generalisation and rapid adaptation to unseen tasks. Our results establish a new path toward models that combine the compositionality of discrete languages with the gradient-based search and end-to-end learning of neural networks.

摘要:一個中心挑戰在於程序歸納長期以來一直是符號方法和神經方法之間的權衡。符號方法提供了組合性泛化和數據效率,然而它們的可擴展性受到特定領域語言(DSL)等形式主義的限制,這些形式主義的創建需要大量勞動,並且可能無法轉移到新領域。相對而言,神經網絡靈活地從數據中學習,但在組合性和分佈外的設置中往往泛化不佳。我們通過一個名為神經語言解釋器(NLI)的潛在適應網絡架構的實例來彌合這一差距,該架構端到端地學習自己的離散、類符號的編程語言。NLI 自主發現了一組原始操作的詞彙,並使用一種新穎的可微分神經執行器來解釋這些原始操作的可變長度序列。這使得 NLI 能夠表示不受固定計算步驟數限制的程序,從而使其能夠解決比訓練期間看到的更複雜的問題。為了使這些離散的、組合的程序結構適合基於梯度的優化,我們採用了 Gumbel-Softmax 放鬆,使整個模型能夠端到端訓練。至關重要的是,這種可微分性使得強大的測試時適應成為可能。在推理時,NLI 的程序歸納器提供了一個初步的程序猜測。然後,這個猜測通過神經執行器進行梯度下降的精煉,從而有效地搜索最能解釋給定數據的神經程序。我們展示了 NLI 在需要組合性泛化和快速適應未見任務的任務上超越了上下文學習、測試時訓練和連續潛在程序網絡。我們的結果為結合離散語言的組合性與神經網絡的基於梯度的搜索和端到端學習的模型建立了一條新路徑。

AI scientists produce results without reasoning scientifically

2604.18805v1 by Martiño Ríos-García, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, N. M. Anoop Krishnan, Kevin Maik Jablonka

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.

摘要:大型語言模型(LLM)基礎的系統越來越多地被部署以自主進行科學研究,但它們的推理是否遵循使科學探究自我修正的認識論規範仍然不甚了解。在這裡,我們評估了基於LLM的科學代理,涵蓋八個領域,從工作流程執行到假設驅動的探究,通過超過25,000次代理運行和兩種互補的視角:(i)系統性能分析,分解基礎模型和代理支架的貢獻,以及(ii)代理推理的認識論結構的行為分析。我們觀察到,基礎模型是性能和行為的主要決定因素,解釋變異的41.4%來自基礎模型,而支架僅佔1.5%。在所有配置中,68%的痕跡中忽略了證據,26%的情況下發生了反駁驅動的信念修正,而收斂的多測試證據則很少出現。無論代理執行計算工作流程還是進行假設驅動的探究,相同的推理模式都會出現。即使當代理接收到幾乎完整的成功推理軌跡作為上下文時,這些模式仍然持續存在,並且在認識論要求高的領域中,隨著重複試驗,導致的可靠性下降會加劇。因此,當前的基於LLM的代理執行科學工作流程,但並未展現出特徵化科學推理的認識論模式。基於結果的評估無法檢測這些失敗,而僅靠支架工程也無法修復它們。在推理本身成為訓練目標之前,這些代理所產生的科學知識無法通過生成它的過程來證明其合理性。

Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

2604.18753v1 by Andrew Wang, Ellie Pavlick, Ritambhara Singh

An active challenge in developing multimodal machine learning (ML) models for healthcare is handling missing modalities during training and deployment. As clinical datasets are inherently temporal and sparse in terms of modality presence, capturing the underlying predictive signal via diagnostic multimodal ML models while retaining model explainability remains an ongoing challenge. In this work, we address this by re-framing clinical diagnosis as an autoregressive sequence modeling task, utilizing causal decoders from large language models (LLMs) to model a patient's multimodal trajectory. We first introduce a missingness-aware contrastive pre-training objective that integrates multiple modalities in datasets with missingness in a shared latent space. We then show that autoregressive sequence modeling with transformer-based architectures outperforms baselines on the MIMIC-IV and eICU fine-tuning benchmarks. Finally, we use interpretability techniques to move beyond performance boosts and find that across various patient stays, removing modalities leads to divergent behavior that our contrastive pre-training mitigates. By abstracting clinical diagnosis as sequence modeling and interpreting patient stay trajectories, we develop a framework to profile and handle missing modalities while addressing the canonical desideratum of safe, transparent clinical AI.

摘要:在為醫療保健開發多模態機器學習(ML)模型的過程中,一個活躍的挑戰是處理訓練和部署期間缺失的模態。由於臨床數據集本質上是時間性的,並且在模態存在方面稀疏,因此通過診斷多模態 ML 模型捕捉潛在的預測信號,同時保持模型的可解釋性,仍然是一個持續的挑戰。在這項工作中,我們通過將臨床診斷重新框架為自回歸序列建模任務來解決這個問題,利用大型語言模型(LLMs)中的因果解碼器來建模患者的多模態軌跡。我們首先介紹了一種考慮缺失性的對比預訓練目標,該目標在具有缺失性的數據集中將多個模態整合到共享潛在空間中。然後,我們展示了基於Transformer架構的自回歸序列建模在 MIMIC-IV 和 eICU 微調基準測試中超越了基準。我們最後使用可解釋性技術超越性能提升,發現隨著各種患者住院的進展,去除模態會導致不同的行為,而我們的對比預訓練可以減輕這種情況。通過將臨床診斷抽象為序列建模並解釋患者住院軌跡,我們開發了一個框架來描述和處理缺失模態,同時解決安全、透明的臨床 AI 的基本願望。

On the Importance and Evaluation of Narrativity in Natural Language AI Explanations

2604.18311v1 by Mateusz Cedro, David Martens

Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In this study, we draw on insights from social sciences and linguistics, and argue that XAI explanations should be presented in the form of narratives. Narrative explanations support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. We show that standard Natural Language Processing (NLP) metrics based solely on token probability or word frequency fail to capture these properties and can be matched or exceeded by tautological text that conveys no explanatory content. To address this issue, we propose seven automatic metrics that quantify the narrative quality of explanations along the four identified dimensions. We benchmark current state-of-the-art explanation generation methods on six datasets and show that the proposed metrics separate descriptive from narrative explanations more reliably than standard NLP metrics. Finally, to further advance the field, we propose a set of problem-agnostic XAI Narrative generation rules for producing natural language XAI explanations, so that the resulting XAI Narratives exhibit stronger narrative properties and align with the findings from the linguistic and social science literature.

摘要:可解釋的人工智慧(XAI)旨在使機器學習模型的行為可解釋,但許多解釋方法仍然難以理解。將自然語言生成整合到XAI中,旨在以文本形式提供解釋,使其對從業者更具可及性。然而,目前的方法主要產生靜態的特徵重要性列表。雖然這些解釋表明了什麼影響了預測,但並未解釋預測為何會發生。在本研究中,我們借鑒社會科學和語言學的見解,並主張XAI解釋應以敘事的形式呈現。敘事解釋通過四個定義特徵支持人類理解:連續結構、因果機制、語言流暢性和詞彙多樣性。我們顯示,僅基於標記概率或詞頻的標準自然語言處理(NLP)指標無法捕捉這些特性,並且可以被傳達無解釋內容的同義文本匹配或超越。為了解決這個問題,我們提出七個自動指標,以量化解釋在四個識別維度上的敘事質量。我們在六個數據集上基準測試當前最先進的解釋生成方法,並顯示所提出的指標比標準NLP指標更可靠地區分描述性解釋和敘事解釋。最後,為了進一步推進該領域,我們提出一套與問題無關的XAI敘事生成規則,以產生自然語言的XAI解釋,使得所產生的XAI敘事展現更強的敘事特性,並與語言學和社會科學文獻的研究結果相一致。

Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support

2604.18302v1 by Eranga Bandara, Asanga Gunaratna, Ross Gore, Anita H. Clayton, Christopher K. Rhea, Sachini Rajapakse, Isurunima Kularathna, Sachin Shetty, Ravi Mukkamala, Xueping Liang, Preston Samuel, Atmaram Yarlagadda

Privacy represents one of the most critical yet underaddressed barriers to AI adoption in mental healthcare -- particularly in high-sensitivity operational environments such as military, correctional, and remote healthcare settings, where the risk of patient data exposure can deter help-seeking behavior entirely. Existing AI-enabled psychiatric decision support systems predominantly rely on cloud-based inference pipelines, requiring sensitive patient data to leave the device and traverse external servers, creating unacceptable privacy and security risks in these contexts. In this paper, we propose a zero-egress, on-device AI platform for privacy-preserving psychiatric decision support, deployed as a cross-platform mobile application. The proposed system extends our prior work on fine-tuned LLM consortiums for psychiatric diagnosis standardization by fundamentally re-architecting the inference pipeline for fully local execution -- ensuring that no patient data is transmitted to, processed by, or stored on any external server at any stage. The platform integrates a consortium of three lightweight, fine-tuned, and quantized open-source LLMs -- Gemma, Phi-3.5-mini, and Qwen2 -- selected for their compact architectures and proven efficiency on resource-constrained mobile hardware. An on-device orchestration layer coordinates ensemble inference and consensus-based diagnostic reasoning, producing DSM-5-aligned assessments for conditions. The platform is designed to assist clinicians with differential diagnosis and evidence-linked symptom mapping, as well as to support patient-facing self-screening with appropriate clinical safeguards. Initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor while sustaining real-time inference latency on commodity mobile hardware.

摘要:隱私代表了在心理健康護理中人工智慧採用的最關鍵但卻未被充分解決的障礙之一——特別是在軍事、矯正和遠程醫療等高敏感度操作環境中,患者數據暴露的風險可能完全阻礙尋求幫助的行為。現有的人工智慧輔助精神病決策支持系統主要依賴雲端推理管道,這要求敏感的患者數據離開設備並經過外部伺服器,從而在這些環境中造成不可接受的隱私和安全風險。在本文中,我們提出了一個零外洩的、基於設備的人工智慧平台,用於隱私保護的精神病決策支持,作為跨平台的移動應用程序部署。所提出的系統擴展了我們之前在精神病診斷標準化方面的精調大型語言模型聯盟的工作,通過根本性地重新架構推理管道以實現完全本地執行——確保在任何階段都不會將患者數據傳輸到、處理或存儲在任何外部伺服器上。該平台整合了三個輕量級、經過精調和量化的開源大型語言模型——Gemma、Phi-3.5-mini和Qwen2——這些模型因其緊湊的架構和在資源受限的移動硬體上的證明效率而被選中。一個基於設備的協調層協調集成推理和基於共識的診斷推理,生成與DSM-5對齊的條件評估。該平台旨在協助臨床醫生進行鑑別診斷和證據鏈接的症狀映射,並支持患者自我篩查,並提供適當的臨床保障。初步評估表明,所提出的零外洩部署在診斷準確性上與其伺服器端前身相當,同時在商用移動硬體上保持實時推理延遲。

Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning

2604.19823v1 by Khalil Akremi, Mariem Handous, Zied Bouslama, Farah Bassalah, Maryem Jebali, Mariem Hanachi, Ines Abdeljaoued-Tej

Rabies remains a major public health concern across many African and Asian countries, where accurate diagnosis is critical for effective epidemiological surveillance. The gold standard diagnostic methods rely heavily on fluorescence microscopy, necessitating skilled laboratory personnel for the accurate interpretation of results. Such expertise is often scarce, particularly in regions with low annual sample volumes. This paper presents an automated, AI-driven diagnostic system designed to address these challenges. We developed a robust pipeline utilizing fluorescent image analysis through transfer learning with four deep learning architectures: EfficientNetB0, EfficientNetB2, VGG16, and Vision Transformer (ViTB16). Three distinct data augmentation strategies were evaluated to enhance model generalization on a dataset of 155 microscopic images (123 positive and 32 negative). Our results demonstrate that TrivialAugmentWide was the most effective augmentation technique, as it preserved critical fluorescent patterns while improving model robustness. The EfficientNetB0 model, utilizing Geometric & Color augmentation and selected through stratified 3fold cross-validation, achieved optimal classification performance on cropped images. Despite constraints posed by class imbalance and a limited dataset size, this work confirms the viability of deep learning for automating rabies diagnosis. The proposed method enables fast and reliable detection with significant potential for further optimization. An online tool was deployed to facilitate practical access, establishing a framework for future medical imaging applications. This research underscores the potential of optimized deep learning models to transform rabies diagnostics and improve public health outcomes.

摘要:狂犬病在許多非洲和亞洲國家仍然是一個主要的公共健康問題,準確的診斷對於有效的流行病學監測至關重要。黃金標準的診斷方法在很大程度上依賴於螢光顯微鏡,這需要熟練的實驗室人員來準確解讀結果。這種專業知識往往稀缺,尤其是在年樣本量較低的地區。本文提出了一種自動化的、基於人工智慧的診斷系統,旨在解決這些挑戰。我們開發了一個穩健的流程,通過轉移學習利用四種深度學習架構進行螢光影像分析:EfficientNetB0、EfficientNetB2、VGG16 和 Vision Transformer (ViTB16)。我們評估了三種不同的數據增強策略,以提高模型在155張顯微鏡影像(123張陽性和32張陰性)數據集上的泛化能力。我們的結果顯示,TrivialAugmentWide 是最有效的增強技術,因為它在改善模型穩健性的同時保留了關鍵的螢光模式。使用幾何和顏色增強的 EfficientNetB0 模型,通過分層三折交叉驗證選擇,實現了在裁剪影像上的最佳分類性能。儘管受到類別不平衡和數據集大小限制的挑戰,這項工作證實了深度學習在自動化狂犬病診斷中的可行性。所提出的方法實現了快速且可靠的檢測,並具有進一步優化的重大潛力。還部署了一個在線工具以促進實際訪問,為未來的醫學影像應用建立了一個框架。本研究強調了優化的深度學習模型在轉變狂犬病診斷和改善公共健康結果方面的潛力。

ExAI5G: A Logic-Based Explainable AI Framework for Intrusion Detection in 5G Networks

2604.18052v1 by Saeid Sheikhi, Panos Kostakos, Lauri Loven

Intrusion detection systems (IDSs) for 5G networks must handle complex, high-volume traffic. Although opaque "black-box" models can achieve high accuracy, their lack of transparency hinders trust and effective operational response. We propose ExAI5G, a framework that prioritizes interpretability by integrating a Transformer-based deep learning IDS with logic-based explainable AI (XAI) techniques. The framework uses Integrated Gradients to attribute feature importance and extracts a surrogate decision tree to derive logical rules. We introduce a novel evaluation methodology for LLM-generated explanations, using a powerful evaluator LLM to assess actionability and measuring their semantic similarity and faithfulness. On a 5G IoT intrusion dataset, our system achieves 99.9\% accuracy and a 0.854 macro F1-score, demonstrating strong performance. More importantly, we extract 16 logical rules with 99.7\% fidelity, making the model's reasoning transparent. The evaluation demonstrates that modern LLMs can generate explanations that are both faithful and actionable, indicating that it is possible to build a trustworthy and effective IDS without compromising performance for the sake of marginal gains from an opaque model.

摘要:入侵檢測系統(IDS)針對 5G 網路必須處理複雜且高容量的流量。儘管不透明的「黑箱」模型可以達到高準確率,但其缺乏透明度妨礙了信任和有效的操作反應。我們提出 ExAI5G,一個優先考慮可解釋性的框架,通過將基於 Transformer 的深度學習 IDS 與基於邏輯的可解釋 AI(XAI)技術整合。該框架使用整合梯度來歸因特徵重要性,並提取替代決策樹以推導邏輯規則。我們引入了一種新的評估方法,用於 LLM 生成的解釋,使用強大的評估器 LLM 來評估可行性,並測量其語義相似性和忠實度。在 5G IoT 入侵數據集上,我們的系統達到了 99.9\% 的準確率和 0.854 的宏觀 F1 分數,顯示出強大的性能。更重要的是,我們提取了 16 條邏輯規則,具有 99.7\% 的忠實度,使模型的推理過程變得透明。評估顯示,現代 LLM 能夠生成既忠實又可行的解釋,表明可以在不妥協性能的情況下,建立一個值得信賴且有效的 IDS,而不必為了從不透明模型中獲得微小的收益而妥協。

First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

2604.18038v1 by Sihao Xing, Zaur Gouliev

Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task, DeepSeek V3 produced the strongest overall results across the reported metrics. When embedded in an agentic workflow, DeepSeek V3 showed an improvement of 0.0348 in mean p-value, 0.1166 in median p-value, and 0.0949 in mean difference relative to the standalone model, although improvement was not uniform across every metric. These findings support multi-metric bias evaluation for AI systems used in medical settings and suggest that retrieval-based agentic workflows may reduce some forms of explicit bias in benchmarked diagnostic tasks. Detailed prompt templates, experimental datasets, and code pipelines are available on our GitHub.

摘要:大型語言模型(LLMs)在臨床環境中的使用日益增加,這引發了對生成的醫療文本和臨床推理中的種族偏見的擔憂。現有研究已經識別出醫療LLMs中的偏見,但許多研究專注於單一模型,對於減輕偏見的關注較少。本研究使用歐盟人工智慧法案作為治理視角,評估五個廣泛使用的LLMs在兩個任務中的表現,即合成病人案例生成和鑑別診斷排名。利用美國的種族分層流行病學分佈和專家鑑別診斷清單作為基準,我們應用結構化提示模板和雙部分評估設計來檢查隱性和顯性種族偏見。在合成案例生成任務中,所有模型均偏離了觀察到的種族分佈,其中GPT-4.1的整體偏差最小。在鑑別診斷任務中,DeepSeek V3在報告的指標中產生了最強的整體結果。當嵌入到一個自主工作流程中時,DeepSeek V3在平均p值上改善了0.0348,在中位數p值上改善了0.1166,在平均差異上改善了0.0949,相對於獨立模型,儘管在每個指標上的改善並不均勻。這些發現支持對醫療環境中使用的AI系統進行多指標偏見評估,並表明基於檢索的自主工作流程可能減少基準診斷任務中的某些顯性偏見。詳細的提示模板、實驗數據集和代碼管道可在我們的GitHub上獲得。

How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

2604.17935v1 by Xiao Wang

The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through $k$-hop pointer chasing on $n$ tokens under a shared KV cache of size $s$, attention dimension $m$, $H$ heads, $p$-bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer ($n \geq 4k$, $s \leq \sqrt{n}/4$) requires depth $L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$, and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$ via windowed pointer doubling, and a max-bound $L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp)))$. Closing the conjecture amounts to upgrading max to product. (2) Bandwidth barrier. The product bound binds only when $Hmp \lesssim \log n$. Any lower bound provable via per-window distinguishability counting -- including reachability, bandwidth, and combinations -- cannot exceed $\lceil k/s \rceil$ once $Hmp \geq \log_2 n$. Breaking this requires lifting unconditional communication-complexity bounds for pointer chasing to Cache-Transformer depth. (3) Adaptive vs oblivious error scaling. Under random cache over $T = \lceil \log_2 k \rceil$ doubling stages, oblivious caches give $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$ (exponential in $T$), while adaptive locality-respecting caches achieve $\Pr[\mathcal{E}] = s/n$ exactly, independent of $T$. The $Ω((n/s)^{T-1})$ separation explains why heavy-hitter eviction empirically dominates random eviction for multi-hop reasoning.

摘要:關鍵值(KV)快取在Transformer推理過程中是主要的記憶體瓶頸,但對於在多步推理降級之前,它可以被壓縮到什麼程度,理論上知之甚少。我們通過在共享大小為 $s$ 的 KV 快取下,對 $n$ 個標記進行 $k$-跳指標追逐來研究這一點,注意力維度為 $m$,$H$ 個頭,$p$ 位精度,以及一個尊重區域性的快取控制器(所有標準的 KV 壓縮方法都能滿足此條件)。我們給出三個結果。 (1) 產品深度下界(猜想)。我們猜想任何此類Transformer($n \geq 4k$, $s \leq \sqrt{n}/4$)需要深度 $L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$,並將唯一剩餘的差距隔離為快取痕跡和指標鏈的聯合分佈上的一個概率步驟。在無條件的情況下,我們通過窗口式指標加倍證明了一個匹配的上界 $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$,以及一個最大界 $L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp)))$。關閉這個猜想相當於將最大值升級為產品。 (2) 帶寬障礙。產品界僅在 $Hmp \lesssim \log n$ 時約束。任何通過每窗口可區分性計數可證明的下界——包括可達性、帶寬和組合——一旦 $Hmp \geq \log_2 n$,都不能超過 $\lceil k/s \rceil$。打破這一點需要將無條件的通信複雜度界限提升到快取Transformer深度。 (3) 自適應 vs 無知誤差擴展。在 $T = \lceil \log_2 k \rceil$ 次加倍階段下的隨機快取中,無知快取給出 $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$(對 $T$ 指數),而自適應尊重區域性的快取則精確地達到 $\Pr[\mathcal{E}] = s/n$,與 $T$ 無關。$Ω((n/s)^{T-1})$ 的分離解釋了為什麼重擊者驅逐在經驗上主導於隨機驅逐,特別是在多跳推理中。

AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis

2604.17846v1 by Nathasha Naranpanawa, Maree T. Izatt, Robert D. Labrom, Geoffrey N. Askin, J. Paige Little

MRI is preferred over CT in paediatric imaging because it avoids ionising radiation, but its use in spine deformity assessment is largely limited by the lack of automated, high-resolution 3D bony reconstruction, which continues to rely on CT. MRI-based 3D reconstruction remains impractical due to manual workflows and the scarcity of labelled full-spine datasets. This study introduces an AI framework that enables fully automated thoracolumbar spine (T1-L5) segmentation and 3D reconstruction from MRI alone. Historical low-dose CT scans from adolescent idiopathic scoliosis (AIS) patients were converted into MRI-like images using a GAN and combined with existing labelled thoracic MRI data to train a U-Net-based model. The resulting algorithm accurately generated continuous thoracolumbar 3D reconstructions, improved segmentation accuracy (88% Dice score), and reduced processing time from approximately 1 hour to under one minute, while preserving AIS-specific deformity features. This approach enables radiation-free 3D deformity assessment from MRI, supporting clinical evaluation, surgical planning, and navigation in paediatric spine care.

摘要:MRI 在兒童影像學中較 CT 更受青睞,因為它避免了電離輻射,但在脊柱畸形評估中的應用主要受到缺乏自動化、高解析度 3D 骨重建的限制,這仍然依賴於 CT。基於 MRI 的 3D 重建因手動工作流程和標註完整脊柱數據集的稀缺而仍然不切實際。本研究介紹了一個 AI 框架,能夠從 MRI 單獨實現完全自動化的胸腰脊柱 (T1-L5) 分割和 3D 重建。來自青少年特發性脊柱側彎 (AIS) 患者的歷史低劑量 CT 掃描被轉換為類似 MRI 的影像,並與現有的標註胸部 MRI 數據結合,以訓練基於 U-Net 的模型。所生成的算法準確地生成了連續的胸腰 3D 重建,提高了分割準確性 (88% Dice 分數),並將處理時間從約 1 小時縮短至不到 1 分鐘,同時保留了特定於 AIS 的畸形特徵。這種方法使得從 MRI 進行無輻射的 3D 畸形評估成為可能,支持臨床評估、手術規劃和兒童脊柱護理中的導航。

Community-Led AI Integration for Wildfire Risk Assessment: A Participatory AI Literacy and Explainability Integration (PALEI) Framework in Los Angeles, CA

2604.17755v1 by Sanaz Sadat Hosseini, Mona Azarbayjani, Mohammad Pourhomayoun, Hamed Tabkhi

Climate-driven wildfires are intensifying, particularly in urban regions such as Southern California. Yet, traditional fire risk communication tools often fail to gain public trust due to inaccessible design, non-transparent outputs, and limited contextual relevance. These challenges are especially critical in high-risk communities, where trust depends on how clearly and locally information is presented. Neighborhoods such as Pacific Palisades, Pasadena, and Altadena in Los Angeles exemplify these conditions. This study introduces a community-led approach for integrating AI into wildfire risk assessment using the Participatory AI Literacy and Explainability Integration (PALEI) framework. PALEI emphasizes early literacy building, value alignment, and participatory evaluation before deploying predictive models, prioritizing clarity, accessibility, and mutual learning between developers and residents. Early engagement findings show strong acceptance of visual, context-specific risk communication, positive fairness perceptions, and clear adoption interest, alongside privacy and data security concerns that influence trust. Participants emphasized localized imagery, accessible explanations, neighborhood-specific mitigation guidance, and transparent communication of uncertainty. The outcome is a mobile application co-designed with users and stakeholders, enabling residents to scan visible property features and receive interpretable fire risk scores with tailored recommendations. By embedding local context into design, the tool becomes an everyday resource for risk awareness and preparedness. This study argues that user experience is central to ethical and effective AI deployment and provides a replicable, literacy-first pathway for applying the PALEI framework to climate-related hazards.

摘要:氣候驅動的野火正在加劇,特別是在南加州等城市地區。然而,傳統的火災風險傳達工具往往因設計不易接觸、輸出不透明和上下文相關性有限而未能獲得公眾信任。這些挑戰在高風險社區中特別重要,因為信任取決於信息呈現的清晰度和地方性。洛杉磯的太平洋帕利塞德斯、帕薩迪納和阿爾塔迪納等社區就是這些情況的典範。本研究提出了一種社區主導的方法,利用參與式人工智慧素養與解釋整合(PALEI)框架將人工智慧整合到野火風險評估中。PALEI 強調在部署預測模型之前建立早期素養、價值對齊和參與性評估,優先考慮清晰性、可接觸性和開發者與居民之間的相互學習。早期參與的研究結果顯示,對視覺化、具上下文特定的風險傳達有強烈的接受度,對公平性的正面感知,以及明確的採用興趣,同時也存在影響信任的隱私和數據安全問題。參與者強調了本地化的圖像、可接觸的解釋、社區特定的減緩指導以及不確定性的透明傳達。最終的結果是一個與用戶和利益相關者共同設計的移動應用程序,使居民能夠掃描可見的財產特徵並獲得可解釋的火災風險評分和量身定制的建議。通過將地方上下文嵌入設計中,該工具成為風險意識和準備的日常資源。本研究認為,用戶體驗是道德和有效的人工智慧部署的核心,並提供了一條可複製的、以素養為首的途徑,以將 PALEI 框架應用於氣候相關的危害。

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

2604.17730v1 by Suhyun Lee, Palakorn Achananuparp, Neemesh Yadav, Ee-Peng Lim, Yang Deng

Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.

摘要:大型語言模型(LLMs)越來越多地被探索作為可擴展的心理健康諮詢工具,然而,由於臨床傷害的互動性和情境依賴性,評估它們的安全性仍然具有挑戰性。現有的評估框架主要使用粗糙的分類法或靜態數據集來評估孤立的反應,這限制了它們診斷傷害如何在多輪諮詢互動中出現和累積的能力。在這項工作中,我們介紹了 R-MHSafe,一種角色感知的心理健康安全分類法,根據 AI 諮詢師所採取的互動角色(包括施害者、煽動者、促進者或使能者)來描述臨床上重要的傷害,並結合臨床基礎的傷害類別。然後,我們提出了 MHSafeEval,一個閉環的基於代理的評估框架,將安全評估公式化為通過對抗性多輪互動的傷害軌跡級別發現,並以角色感知建模為指導。使用 R-MHSafe 和 MHSafeEval,我們對最先進的 LLMs 進行了大規模評估。我們的結果揭示了顯著的角色依賴性和累積性安全失敗,這些失敗在現有的靜態基準中被系統性地忽略,並顯示我們的框架顯著提高了失敗模式的覆蓋率和診斷的細緻度。

Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agentic RAG Systems

2604.17677v1 by Nick Loghmani

Retrieval-Augmented Generation (RAG) systems depend on the geometric properties of vector representations to retrieve contextually appropriate evidence. When source documents interleave multiple topics within contiguous text, standard vectorization produces embedding spaces in which semantically distinct content occupies overlapping neighborhoods. We term this condition semantic entanglement. We formalize entanglement as a model-relative measure of cross-topic overlap in embedding space and define an Entanglement Index (EI) as a quantitative proxy. We argue that higher EI constrains attainable Top-K retrieval precision under cosine similarity retrieval. To address this, we introduce the Semantic Disentanglement Pipeline (SDP), a four-stage preprocessing framework that restructures documents prior to embedding. We further propose context-conditioned preprocessing, in which document structure is shaped by patterns of operational use, and a continuous feedback mechanism that adapts document structure based on agent performance. We evaluate SDP on a real-world enterprise healthcare knowledge base comprising over 2,000 documents across approximately 25 sub-domains. Top-K retrieval precision improves from approximately 32% under fixed-token chunking to approximately 82% under SDP, while mean EI decreases from 0.71 to 0.14. We do not claim that entanglement fully explains RAG failure, but that it captures a distinct preprocessing failure mode that downstream optimization cannot reliably correct once encoded into the vector space.

摘要:檢索增強生成(RAG)系統依賴於向量表示的幾何特性來檢索上下文適當的證據。當來源文檔在連續文本中交織多個主題時,標準向量化會產生嵌入空間,其中語義上不同的內容佔據重疊的鄰域。我們將這種情況稱為語義纏結。我們將纏結形式化為嵌入空間中跨主題重疊的模型相對度量,並定義一個纏結指數(EI)作為定量代理。我們認為較高的EI限制了在餘弦相似性檢索下可達的Top-K檢索精度。為了解決這個問題,我們引入了語義解纏管道(SDP),這是一個四階段的預處理框架,在嵌入之前重組文檔。我們進一步提出了基於上下文的預處理,其中文檔結構由操作使用模式塑造,並且有一個連續反饋機制,根據代理性能調整文檔結構。我們在一個包含超過2000份文檔的現實世界企業醫療知識庫上評估SDP,該知識庫涵蓋約25個子領域。Top-K檢索精度從固定標記分塊下的約32%提高到SDP下的約82%,而平均EI從0.71降低到0.14。我們並不聲稱纏結完全解釋了RAG的失敗,但它捕捉了一種明確的預處理失敗模式,而下游優化在編碼進入向量空間後無法可靠地修正。

On The Mathematics of the Natural Physics of Optimization

2604.17645v1 by I. M. Ross

A number of optimization algorithms have been inspired by the physics of Newtonian motion. Here, we ask the question: do algorithms themselves obey some natural laws of motion,'' and can they be derived by an application of these laws? We explore this question by positing the theory that optimization algorithms may be considered as some manifestation of hidden algorithm primitives that obey certain universal non-Newtonian dynamics. This natural physics of optimization is developed by equating the terminal transversality conditions of an optimal control problem to the generalized Karush/John-Kuhn-Tucker conditions of an optimization problem. Through this equivalence formulation, the data functions of a given constrained optimization problem generate a natural vector field that permeates an entire hidden space with information on the optimality conditions. Anaction-at-a-distance'' operation via a Pontryagin-type minimum principle produces a local action to deliver a globalized result by way of a Hamilton-Jacobi inequality. An inverse-optimal algorithm is generated by performing control jumps that dissipate quantized ``energy'' defined by a search Lyapunov function. Illustrative applications of the proposed theory show that a large number of algorithms can be generated and explained in terms of the new mathematical physics of optimization.

摘要:許多優化演算法受到牛頓運動物理學的啟發。在這裡,我們提出一個問題:演算法本身是否遵循某些「自然運動法則」,並且這些法則是否可以用來推導演算法?我們通過假設優化演算法可以被視為遵循某些普遍非牛頓動力學的隱藏演算法原始元素的某種表現來探討這個問題。這種優化的自然物理學是通過將最佳控制問題的終端橫斷條件等同於優化問題的廣義Karush/John-Kuhn-Tucker條件來發展的。通過這種等價公式,給定約束優化問題的數據函數生成了一個自然向量場,該向量場充滿了有關最佳條件的信息,滲透整個隱藏空間。通過Pontryagin型最小原則的「遠程作用」操作產生了一個局部行動,通過哈密頓-雅可比不等式提供全球化結果。通過執行控制跳躍來生成一個逆最優演算法,這些跳躍耗散由搜索Lyapunov函數定義的量化「能量」。所提出理論的示範應用顯示,許多演算法可以根據新的優化數學物理學來生成和解釋。

STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments

2604.17611v1 by Md Mezbahul Islam, John Michael Templeton, Christian Poellabauer, Ananda Mohan Mondal

Parkinson's disease (PD) is a progressive disorder in which symptom burden and functional impairment evolve over time, making severity staging essential for clinical monitoring and treatment planning. However, many computational studies emphasize binary PD detection and do not fully use repeated follow-up clinical assessments for stage-aware prediction. This study proposes STEP-PD, a severity-aware machine learning framework to classify PD severity using clinically interpretable boundaries. It leverages all available visits from the Parkinson's Progression Markers Initiative (PPMI) and integrates routinely collected subjective questionnaires and objective clinician-assessed measures. Disease severity is defined using Hoehn and Yahr staging and grouped into three clinically meaningful categories: Healthy, Mild PD (stages 1-2), and Moderate-to-Severe PD (stages 3-5). Three binary classification problems and a three-class severity task were evaluated using stratified cross-validation with imbalance-aware training. To enhance interpretability, SHAP was used to provide global explanations and local patient-level waterfall explanations. Across all tasks, XGBoost achieved the strongest and most stable performance, with accuracies of 95.48% (Healthy vs. Mild), 99.44% (Healthy vs. Moderate-to-Severe), and 96.78% (Mild vs. Moderate-to-Severe), and 94.14% accuracy with 0.8775 Macro-F1 for three-class severity classification. Explainability results highlight a shift from early motor features to progression-related axial and balance impairments. These findings show that multimodal clinical assessments within the PPMI cohort can support accurate and interpretable visit-level PD severity stratification.

摘要:帕金森病(PD)是一種漸進性疾病,其症狀負擔和功能障礙隨時間演變,因此對於臨床監測和治療計劃來說,嚴重程度分級是必不可少的。然而,許多計算研究強調二元的PD檢測,並未充分利用重複的隨訪臨床評估來進行階段感知的預測。本研究提出了STEP-PD,一個重視嚴重程度的機器學習框架,用於使用臨床可解釋的邊界來分類PD的嚴重程度。它利用來自帕金森病進展標記計劃(PPMI)的所有可用訪問,並整合常規收集的主觀問卷和客觀臨床評估指標。疾病的嚴重程度是使用Hoehn和Yahr分級來定義的,並分為三個臨床意義明確的類別:健康、輕度PD(1-2期)和中度至重度PD(3-5期)。通過分層交叉驗證和考慮不平衡的訓練,評估了三個二元分類問題和一個三類嚴重程度任務。為了增強可解釋性,使用SHAP提供全局解釋和局部患者級別的瀑布解釋。在所有任務中,XGBoost實現了最強且最穩定的性能,健康與輕度的準確率為95.48%、健康與中度至重度的準確率為99.44%、輕度與中度至重度的準確率為96.78%,以及三類嚴重程度分類的準確率為94.14%,Macro-F1為0.8775。可解釋性結果突顯了從早期運動特徵到與進展相關的軸向和平衡障礙的轉變。這些發現表明,PPMI隊列中的多模態臨床評估可以支持準確且可解釋的訪問級PD嚴重程度分層。

CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography

2604.17208v1 by Si Li, Chen-Kai Hu, Zhenhuan Lyu, Yuanqing He

Digital subtraction angiography (DSA) in coronary imaging is fundamentally challenged by physiological motion, forcing reliance on raw angiograms cluttered with anatomical noise. Existing deep learning methods often produced images with two critical clinically unacceptable flaws: persistent boundary artifacts and a loss of native tissue grayscale fidelity that undermined diagnostic confidence. We propose a novel framework termed as CDSA-Net that for the first time explicitly decouples and jointly optimizes vascular structure preservation and realistic background restoration. CDSA-Net introduces two core innovations: (i) A hierarchical geometric prior guidance (HGPG) mechanism, embedded in our coronary structure extraction network (CSENet). It synergistically combines integrated geometric prior (IGP) with gated spatial modulation (GSM) and centerline-aware topology (CAT) loss supervision, ensuring structural continuity. (ii) An adaptive noise module (ANM) within our coronary background restoration network (CBResNet). Unlike standard restoration, ANM uniquely models the stochastic nature of clinical X-ray noise, bridging the domain gap to enable seamless background intensity estimation and the complete elimination of boundary artifacts. The final subtraction is obtained by removing the restored background from the raw angiogram. Quantitatively, it significantly outperformed state-of-the-art methods in vascular intensity correlation and perceptual quality. A 25.6% improvement in morphology assessment efficiency and a 42.9% gain in hemodynamic evaluation speed set a new benchmark for utility in interventional cardiology, while maintaining diagnostic results consistent with raw angiograms. The project code is available at https://github.com/DrThink-ai/CDSA-Net.

摘要:數位減影血管造影(DSA)在冠狀動脈影像中受到生理運動的根本挑戰,迫使人們依賴充滿解剖噪音的原始血管造影圖像。現有的深度學習方法通常產生兩個關鍵的臨床不可接受的缺陷:持續的邊界伪影和原生組織灰階保真度的喪失,這削弱了診斷信心。我們提出了一個名為 CDSA-Net 的新框架,首次明確地解耦並聯合優化血管結構保護和現實背景恢復。CDSA-Net 引入了兩個核心創新:(i)一種分層幾何先驗引導(HGPG)機制,嵌入我們的冠狀結構提取網絡(CSENet)。它協同結合了集成幾何先驗(IGP)、門控空間調制(GSM)和中心線感知拓撲(CAT)損失監督,確保結構連續性。(ii)我們的冠狀背景恢復網絡(CBResNet)內的一個自適應噪聲模塊(ANM)。與標準恢復不同,ANM 獨特地建模臨床 X 射線噪聲的隨機性質,彌合領域差距以實現無縫的背景強度估計和完全消除邊界伪影。最終的減法是通過從原始血管造影中去除恢復的背景來獲得的。在定量上,它在血管強度相關性和感知質量方面顯著超越了最先進的方法。在形態評估效率上提高了 25.6%,在血流動力學評估速度上提高了 42.9%,為介入心臟病學的實用性設立了新的基準,同時保持診斷結果與原始血管造影一致。項目代碼可在 https://github.com/DrThink-ai/CDSA-Net 獲得。

Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training

2604.17186v1 by Weibing Zheng, Laurah Turner, Jess Kropczynski, Matthew Kelleher, Murat Ozer, Shane Halse

As Artificial Intelligence (AI) and Agentic AI become increasingly integrated across sectors such as education and healthcare, it is critical to ensure that Multi-Agent Education System (MAES) is explainable from the early stages of requirements engineering (RE) within the AI software development lifecycle. Explainability is essential to build trust, promote transparency, and enable effective human-AI collaboration. Although personas are well-established in human-computer interaction to represent users and capture their needs and behaviors, their role in RE for explainable MAES remains underexplored. This paper proposes a human-first, persona-driven, explainable MAES RE framework and demonstrates the framework through a MAES for clinical reasoning training. The framework integrates personas and user stories throughout the RE process to capture the needs, goals, and interactions of various stakeholders, including medical educators, medical students, AI patient agent, and clinical agents (physical exam agent, diagnostic agent, clinical intervention agent, supervisor agent, evaluation agent). The goals, underlying models, and knowledge base shape agent interactions and inform explainability requirements that guided the clinical reasoning training of medical students. A post-usage survey found that more than 78\% of medical students reported that MAES improved their clinical reasoning skills. These findings demonstrate that RE based on persona effectively connects technical requirements with non-technical medical students from a human-centered approach, ensuring that explainable MAES are trustworthy, interpretable, and aligned with authentic clinical scenarios from the early stages of the AI system engineering. The partial MAES for the clinical scenario simulator is~\href{https://github.com/2sigmaEdTech/MAS/}{open sourced here}.

摘要:隨著人工智慧(AI)和代理型AI在教育和醫療等各個領域的日益整合,確保多代理教育系統(MAES)在AI軟體開發生命週期的需求工程(RE)早期階段是可解釋的,至關重要。可解釋性對於建立信任、促進透明度以及實現有效的人機協作至關重要。儘管角色在人機互動中被廣泛應用以代表用戶並捕捉他們的需求和行為,但在可解釋的MAES的需求工程中的角色仍然未被充分探索。本文提出了一個以人為本、以角色驅動的可解釋MAES需求工程框架,並通過一個用於臨床推理訓練的MAES來演示該框架。該框架在整個需求工程過程中整合了角色和用戶故事,以捕捉各種利益相關者的需求、目標和互動,包括醫學教育者、醫學學生、AI病人代理和臨床代理(身體檢查代理、診斷代理、臨床干預代理、監督代理、評估代理)。目標、基本模型和知識基礎塑造了代理的互動,並告知了指導醫學學生臨床推理訓練的可解釋性需求。使用後調查發現,超過78\%的醫學學生報告說MAES提高了他們的臨床推理技能。這些發現表明,基於角色的需求工程有效地將技術需求與非技術醫學學生聯繫起來,採用以人為中心的方法,確保可解釋的MAES是可信的、可解釋的,並與AI系統工程早期階段的真實臨床情境相一致。針對臨床情境模擬器的部分MAES已在~\href{https://github.com/2sigmaEdTech/MAS/}{這裡開源}。

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL

2604.17073v1 by Skylar Zhai, Jingcheng Liang, Dongyeop Kang

Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.

摘要:強化微調提升了大型語言模型的推理能力,但它也可能促使模型通過猜測或幻覺缺失的信息來回答無法回答的問題。現有的放棄方法要麼訓練模型產生通用的拒絕,要麼鼓勵後續澄清,而不驗證這些澄清是否能識別關鍵的缺失信息。我們研究那些意義明確但無法從給定信息中可靠解決的查詢,並主張一個可靠的模型不僅應該放棄,還應該解釋缺失的內容。我們提出了一種關注澄清的RLVR獎勵,該獎勵在對可回答的查詢給予正確答案的同時,聯合優化明確的放棄和語義對齊的後拒絕澄清,針對無法回答的查詢。利用這一獎勵,我們訓練了Abstain-R1,一個3B模型,該模型在無法回答的查詢上改善了放棄和澄清,同時在可回答的查詢上保持強大的表現。在Abstain-Test、Abstain-QA和SelfAware上的實驗顯示,Abstain-R1在其基礎模型上有了顯著的改進,並在無法回答的查詢行為上達到了與包括DeepSeek-R1在內的更大系統的競爭水平,這表明經過驗證的獎勵可以學習到經過校準的放棄和澄清,而不僅僅是通過規模自然而然地出現。

Hybrid Quantum Neural Networks for Enhanced Breast Cancer Thermographic Classification: A Novel Quantum-Classical Integration Approach

2604.16953v1 by Riza Alaudin Syah, Irwan Alnarus Kautsar, Gunawan Witjaksono, Haza Nuzly bin Abdull Hamed

Breast cancer diagnosis through thermographic image analysis remains a critical challenge in medical AI, with classical deep learning approaches facing limitations in complex thermal pattern classification tasks. This paper presents a novel Hybrid Quantum Neural Network (HQNN) architecture that integrates quantum computing principles with classical convolutional neural networks for enhanced breast cancer classification. Our approach employs parameterized quantum circuits with multi-head attention mechanisms for quantum-aware feature encoding, coupled with classical convolutional layers for comprehensive pattern recognition. The quantum component utilizes a 4qubit variational circuit with strongly entangling layers, while the classical component incorporates advanced attention mechanisms for feature fusion. Experimental validation on breast cancer thermographic data demonstrates substantial performance improvements over state-of-the-art classical architectures, with the quantum-enhanced approach exhibiting superior convergence dynamics and enhanced feature representation capabilities. Our findings provide evidence for quantum advantage in medical image classification through classical simulation, establishing a framework for quantum-classical hybrid systems in healthcare applications. The methodology addresses key challenges in quantum machine learning deployment while maintaining computational feasibility on near-term quantum devices.

摘要:乳腺癌的熱成像圖像分析診斷在醫療人工智慧中仍然是一個關鍵挑戰,傳統深度學習方法在複雜的熱模式分類任務中面臨限制。本文提出了一種新穎的混合量子神經網絡(HQNN)架構,將量子計算原則與傳統卷積神經網絡相結合,以增強乳腺癌的分類。我們的方法採用了帶有多頭注意力機制的參數化量子電路進行量子感知特徵編碼,並結合傳統卷積層進行全面的模式識別。量子組件利用具有強耦合層的4量子位變分電路,而傳統組件則融合了先進的注意力機制以進行特徵融合。在乳腺癌熱成像數據上的實驗驗證顯示,與最先進的傳統架構相比,性能有顯著改善,量子增強的方法展現出優越的收斂動態和增強的特徵表示能力。我們的研究結果提供了量子優勢在醫療影像分類中的證據,通過傳統模擬建立了量子-傳統混合系統在醫療應用中的框架。該方法論解決了量子機器學習部署中的關鍵挑戰,同時在近期的量子設備上保持計算可行性。

LLMs can persuade only psychologically susceptible humans on societal issues, via trust in AI and emotional appeals, amid logical fallacies

2604.16935v1 by Alexis Carrillo, Salvatore Citraro, Ali Aghazhadeh Ardebili, Enrique Taietta, Giulio Rossetti, Emilio Ferrara, Giuseppe Alessandro Veltri, Massimo Stella

Scarce longitudinal evidence examines LLMs' persuasiveness and humanness along time-evolving psychological frameworks. We introduce Talk2AI, a longitudinal framework quantifying psycho-social, reasoning and affective dimensions of LLMs' persuasiveness about polarizing societal topics. In a four-way longitudinal setup, Talk2AI's 770 participants engaged in structured conversations with one of four leading LLMs on topics like climate change, social media misinformation, and math anxiety. This produced 3,080 conversations over 60,000 turns. After each wave, participants reported conviction in their initial topic stance, perceived opinion change, LLM's perceived humanness, a self-donation to the topic and a textual explanation. Feedback time series showed longitudinal inertia in convictions, indicating some human anchoring to initial opinions even after repeated exposure to AI-generated arguments. Interestingly, NLP analyses revealed that both humans and LLMs relied on fallacious reasoning in 1 conversational quip every 6, countering the ``LLMs as superior systems" stereotype behind LLMs' cognitive surrender. LLMs' perceived humanness was most learnable from sociodemographic, psychological and engagement features ($R^2=0.44$), followed by opinion change ($R^2=0.34$), conviction ($R^2=0.26$) and personal endowment ($R^2=0.24$). Crucially, explainable AI (XAI) indicated: (i) the presence of individuals more susceptible to LLM-based opinion changes; (ii) psychological susceptibility to LLM-convincing consisted of having more trust in LLMs, being more agreeable and extraverted and with a higher need for cognition. A multiverse approach with mixed-effects models confirmed XAI results, alongside strong individual differences. Talk2AI provides a grounded framework and evidence for detecting how GenAI can influence human opinions via multiple psycho-social pathways in AI-human digital platforms.

摘要:稀缺的縱向證據檢視了大型語言模型(LLMs)在隨時間演變的心理框架下的說服力和人性。我們介紹了 Talk2AI,這是一個縱向框架,用於量化 LLMs 在關於極具爭議的社會主題上的心理社會、推理和情感維度的說服力。在一個四方縱向設置中,Talk2AI 的 770 名參與者與四個主要 LLM 之一就氣候變化、社交媒體錯誤資訊和數學焦慮等主題進行了結構化對話。這產生了 3,080 次對話,總共超過 60,000 次回合。在每一波之後,參與者報告了他們對初始主題立場的信念、感知的意見變化、LLM 的感知人性、自我捐贈給主題的程度以及一段文字解釋。反饋時間序列顯示出信念的縱向慣性,表明即使在多次接觸 AI 生成的論點後,某些人類仍然對初始意見有一定的依附。有趣的是,自然語言處理(NLP)分析顯示,無論是人類還是 LLMs,每 6 次對話中就有 1 次依賴於謬誤推理,這反駁了「LLMs 作為優越系統」的刻板印象,揭示了 LLMs 的認知屈服。LLMs 的感知人性最能從社會人口學、心理學和參與特徵中學習到($R^2=0.44$),其次是意見變化($R^2=0.34$)、信念($R^2=0.26$)和個人捐贈($R^2=0.24$)。關鍵是,可解釋的 AI(XAI)顯示:(i)存在更易受 LLM 基於意見變化影響的個體;(ii)對 LLM 說服的心理易感性包括對 LLM 更有信任、更具同意性和外向性,以及更高的認知需求。一種多元宇宙方法與混合效應模型確認了 XAI 的結果,並顯示出強烈的個體差異。Talk2AI 提供了一個基礎框架和證據,以檢測生成式 AI 如何通過多種心理社會途徑影響人類意見,尤其是在 AI-人類數位平台上。

The Reliance Negotiation Framework: A Dynamic Process Model of Student LLM Engagement in Academic Writing

2604.16772v1 by Shahin Hossain

Student engagement with large language models (LLMs) in academic writing is not a stable trait, an adoption decision, or a competency level; it is a continuously negotiated process that existing frameworks cannot adequately theorize. Typological models provide categories without mechanisms; technology acceptance models explain adoption but not post-adoption quality; AI literacy frameworks treat competency as a static predictor rather than a live input. None accounts for within-student variability across tasks, the developmental paradox whereby experience produces habituation rather than sophistication, or principled non-use as a form of ethical reasoning. This article introduces the Reliance Negotiation Framework (RNF), developed from a sequential explanatory mixed-methods study of 382 undergraduates at a public minority-serving institution in the United States (survey, N = 382; 14 semi-structured interviews; three qualitative survey strands; 1,435 coded instances). The RNF reconceptualizes LLM reliance as an ongoing negotiation among four concurrent inputs (perceived benefits, perceived risks, ethical commitments, and situational demands) with outputs that recursively modify subsequent decisions. A Two-Model Architecture accommodates the 13.0% of participants whose categorical ethical commitments foreclose negotiation entirely. The framework generates four falsifiable predictions with implications for AI literacy pedagogy, academic integrity policy, and equity-centered practice at minority-serving institutions.

摘要:學生在學術寫作中與大型語言模型(LLMs)的互動並不是一個穩定的特徵、一個採用決策或一個能力水平;它是一個持續協商的過程,現有的框架無法充分理論化。類型模型提供了類別但沒有機制;技術接受模型解釋了採用但不解釋採用後的質量;人工智慧素養框架將能力視為靜態預測因子,而非動態輸入。這些都未考慮到學生在不同任務中的變異性、經驗產生習慣化而非精緻化的發展悖論,或作為倫理推理的一種形式的原則性不使用。本文介紹了依賴協商框架(Reliance Negotiation Framework, RNF),該框架是基於對美國一所公共少數族裔服務機構的382名本科生進行的順序解釋混合方法研究而開發的(調查,N = 382;14次半結構訪談;三個定性調查分支;1,435個編碼實例)。RNF重新概念化了對LLM的依賴,將其視為四個同時輸入(感知的好處、感知的風險、倫理承諾和情境需求)之間的持續協商,並且其輸出會遞歸性地修改後續決策。一個雙模型架構適應了13.0%的參與者,其類別倫理承諾完全排除了協商。該框架產生了四個可證偽的預測,對人工智慧素養教學、學術誠信政策和以公平為中心的少數族裔服務機構實踐具有啟示意義。

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

2604.16745v1 by Yang Shanglin

Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency $ρ_s$ and off-diagonal correlation $ρ_\text{off}$, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and $r_{\text{crit}} \propto 1/L$; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from $ρ_s{=}0.88$ to $0.27$ in deep layers. Pairwise rankings are inherently unstable ($O(N_p^2)$ joint perturbations) while unary signals enjoy greater stability ($O(N_p)$ perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%.

摘要:訓練無關的 Vision Transformers 令牌減少方法(ToMe、ToFu、PiToMe 和 MCTF)採用不同的評分機制,但它們在高壓縮時共享一個緊密匹配的懸崖式崩潰。本文解釋了 \emph{為什麼}。我們開發了一個診斷框架,包含兩個工具,排名一致性 $ρ_s$ 和非對角相關性 $ρ_\text{off}$,該框架將崩潰分解為(1)一個與信號無關的誤差放大器,這是層級減少固有的,預測凸的 Pareto 曲線和 $r_{\text{crit}} \propto 1/L$;以及(2)對 \emph{成對} 相似性信號的共同依賴,其排名一致性從深層的 $ρ_s{=}0.88$ 降低至 $0.27$。成對排名本質上不穩定($O(N_p^2)$ 聯合擾動),而單一信號則享有更大的穩定性($O(N_p)$ 擾動,中央極限定理)。基於這一診斷的三個設計原則,我們構建了 CATIS 作為一種建設性的驗證:單一信號提高觸發閾值,分類壓制增益。在 ViT-Large 63% FLOPs 減少的情況下,CATIS 在 ImageNet-1K 上保留了 96.9% 的原始準確率(81.0%),而所有基線的準確率均崩潰至 43--65%。

CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction

2604.16742v1 by Jianyou Wang, Youze Zheng, Longtian Bao, Hanyuan Zhang, Qirui Zheng, Yuhan Chen, Yang Zhang, Matthew Feng, Maxim Khan, Aditya K. Sehgal, Christopher D. Rosin, Ramamohan Paturi, Umber Dube, Leon Bergen

Scientists have long sought to accurately predict outcomes of real-world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high-stakes open challenge even for domain experts. We introduce CT Open, an open-access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial's outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM-powered web search to identify the earliest mention of trial outcomes. We validate the pipeline's quality and accuracy by human expert's annotations. Since CT Open's pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time-stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real-world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at $\href{https://ct-open.net/}{https://ct-open.net/}$

摘要:科學家們長期以來一直尋求在事件發生之前準確預測現實世界事件的結果。人工智慧系統能否更可靠地做到這一點?我們通過臨床試驗結果預測來研究這個問題,這對於領域專家來說是一個高風險的公開挑戰。我們介紹了 CT Open,一個每年舉辦四次挑戰的開放訪問即時平台。任何人都可以為每個挑戰提交預測。CT Open 在提交時對於那些結果尚未公開的試驗進行評估,但這些結果在之後會公開。確定某個試驗的結果在某個日期之前是否在互聯網上公開,實際上是相當困難的。官方登記處上發布的結果可能會滯後數年,而第一次提及可能出現在不知名的文章中。為了解決這個問題,我們提出了一個新穎的、完全自動化的去污流程,利用迭代的 LLM 驅動的網絡搜索來識別試驗結果的最早提及。我們通過人類專家的註釋來驗證該流程的質量和準確性。由於 CT Open 的流程確保每個被評估的試驗在預測時沒有公開報告的結果,因此參與者可以使用任何方法和任何數據來源。在本文中,我們發布了一個訓練集和兩個時間戳測試基準,分別是 2025 年冬季和 2025 年夏季。我們相信 CT Open 可以作為推進人工智慧研究以預測現實世界結果的中心樞紐,同時也能為生物醫學研究提供信息並改善臨床試驗設計。CT Open 平台托管於 $\href{https://ct-open.net/}{https://ct-open.net/}$

When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis

2604.16736v1 by Justice Owusu Agyemang, Michael Agyare, Miriam Kobbinah, Nathaniel Agbugblah, Prosper Addo

LLM-powered coding agents suffer from a poorly understood failure mode we term output stalling: the agent silently produces empty responses when attempting to generate large, format-heavy documents. We present a theoretical framework that explains and prevents this failure through three contributions. (1) We introduce Output Generation Capacity (OGC), a formal measure of an agent's effective ability to produce output given its current context state - distinct from and empirically smaller than the raw context window. (2) We prove a Format-Cost Separation Theorem showing that deferred template rendering is always at least as token-efficient as direct generation for any format with overhead multiplier $μ_f > 1$, and derive tight bounds on the savings. (3) We formalize Adaptive Strategy Selection, a decision framework that maps the ratio of estimated output cost to available OGC into an optimal generation strategy (direct, chunked, or deferred). We validate the theory through controlled experiments across three models (Claude 3.5 Sonnet, GPT-4o, Llama 3.1 70B), four document types, and an ablation study isolating each component's contribution. Deferred rendering reduces LLM generation tokens by 48-72% across all conditions and eliminates output stalling entirely. We instantiate the framework as GEN-PILOT, an open-source MCP server, demonstrating that the theory translates directly into a practical tool.

摘要:LLM 驅動的編碼代理面臨一種我們稱之為輸出停滯的失效模式,這種模式尚未被充分理解:當嘗試生成大型、格式繁重的文檔時,代理會靜默地產生空響應。我們提出了一個理論框架,通過三個貢獻來解釋和防止這種失效。(1) 我們引入了輸出生成能力(Output Generation Capacity, OGC),這是一種正式的度量,用於衡量代理在當前上下文狀態下有效產生輸出的能力——這與原始上下文窗口不同,並且經驗上較小。(2) 我們證明了一個格式-成本分離定理,顯示延遲模板渲染在任何具有開銷乘數 $μ_f > 1$ 的格式下,至少與直接生成一樣具有效率,並推導出節省的緊密界限。(3) 我們形式化了自適應策略選擇,這是一個決策框架,將估計的輸出成本與可用的 OGC 的比率映射到最佳生成策略(直接、分塊或延遲)。我們通過對三個模型(Claude 3.5 Sonnet、GPT-4o、Llama 3.1 70B)、四種文檔類型以及一項孤立每個組件貢獻的消融研究進行控制實驗來驗證該理論。延遲渲染在所有條件下將 LLM 生成的標記減少了 48-72%,並完全消除了輸出停滯。我們將該框架實例化為 GEN-PILOT,一個開源的 MCP 伺服器,展示了該理論如何直接轉化為實用工具。

Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis

2604.16729v1 by Ayhan Can Erdur, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert, Jan C. Peeken

State-of-the-art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi-step radiological workflows remains underexplored. In this work, we present a training-free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off-the-shelf domain-specific tools, our system autonomously executes complex end-to-end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single-scan segmentation and volumetric reporting to longitudinal response assessment requiring multi-timepoint comparisons. We analyze the impact of architectural design by comparing single-agent models against multi-agent "domain-expert" collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image-prompt-answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro-radiological image analysis tasks through tool use without the need for training or fine-tuning.

摘要:最先進的大型語言模型(LLMs)在一般視覺問題回答方面表現出色。然而,仍然存在一個根本性的限制:當前的架構缺乏進行體積醫學影像(如 CT 或 MRI)直接分析所需的原生 3D 空間推理。新興的代理 AI 提供了一種新解決方案,通過使 LLM 能夠協調和利用專門的外部工具,消除了對內在 3D 處理的需求。然而,這種代理框架在複雜的多步放射學工作流程中的可行性仍未得到充分探索。在這項工作中,我們提出了一個無需訓練的代理管道,用於自動化腦部 MRI 分析。我們在幾個 LLM(GPT-5.1、Gemini 3 Pro、Claude Sonnet 4.5)上驗證我們的方法,並使用現成的領域專用工具,我們的系統自主執行複雜的端到端工作流程,包括預處理(去顱骨、註冊)、病理分割(膠質瘤、腦膜瘤、轉移瘤)和體積分析。我們在越來越複雜的放射學任務中評估我們的框架,從單掃描分割和體積報告到需要多時間點比較的縱向反應評估。我們通過比較單代理模型與多代理「領域專家」合作來分析架構設計的影響。最後,為了支持未來代理系統的嚴格評估,我們引入並發布了一個基準數據集,該數據集由來自公共 BraTS 數據的圖像-提示-答案元組組成。我們的結果表明,代理 AI 可以通過工具使用解決高度神經放射學影像分析任務,而無需訓練或微調。

The Query Channel: Information-Theoretic Limits of Masking-Based Explanations

2604.16689v1 by Erciyes Karakaya, Ozgur Ercetin

Masking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model under randomized perturbations. This paper formulates this procedure as communication over a query channel, where the latent explanation acts as a message and each masked evaluation is a channel use. Within this framework, the complexity of the explanation is captured by the entropy of the hypothesis class, while the query interface supplies information at a rate determined by an identification capacity per query. We derive a strong converse showing that, if the explanation rate exceeds this capacity, the probability of exact recovery necessarily converges to one in error for any sequence of explainers and decoders. We also prove an achievability result establishing that a sparse maximum-likelihood decoder attains reliable recovery when the rate lies below capacity. A Monte Carlo estimator of mutual information yields a non-asymptotic query benchmark that we use to compare optimal decoding with Lasso- and OLS-based procedures that mirror LIME and KernelSHAP. Experiments reveal a range of query budgets where information theory permits reliable explanations but standard convex surrogates still fail. Finally, we interpret super-pixel resolution and tokenization for neural language models as a source-coding choice that sets the entropy of the explanation and show how Gaussian noise and nonlinear curvature degrade the query channel, induce waterfall and error-floor behavior, and render high-resolution explanations unattainable.

摘要:基於遮蔽的後置解釋方法,如 KernelSHAP 和 LIME,通過在隨機擾動下查詢黑箱模型來估計局部特徵的重要性。本文將此過程表述為通過查詢通道的通信,其中潛在解釋作為消息,而每次遮蔽評估則是一個通道使用。在這一框架內,解釋的複雜性由假設類的熵來捕捉,而查詢接口則以每次查詢的識別容量決定的信息速率提供信息。我們推導出一個強對偶,顯示如果解釋速率超過這一容量,則對於任何解釋者和解碼器的序列,精確恢復的概率必然收斂於錯誤的概率為一。我們還證明了一個可達性結果,確立當速率低於容量時,稀疏最大似然解碼器能夠實現可靠的恢復。一個互信息的蒙特卡洛估計器產生了一個非漸近查詢基準,我們用它來比較最優解碼與類似 LIME 和 KernelSHAP 的 Lasso 和 OLS 基礎程序。實驗揭示了一系列查詢預算,其中信息理論允許可靠的解釋,但標準凸代理仍然失敗。最後,我們將超像素解析度和神經語言模型的標記化解釋為一種源編碼選擇,這設定了解釋的熵,並展示高斯噪聲和非線性曲率如何降低查詢通道的質量,誘導瀑布和錯誤地板行為,並使高解析度解釋變得無法實現。

Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

2604.16280v1 by Thomas Bayer, Alexander Lohr, Sarah Weiß, Bernd Michelberger, Wolfram Höpken

Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explanations, establishing a structured connection between domain knowledge and ML insights. To make these insights accessible to users, we designed a selective retrieval method in which relevant triplets are extracted from the KG and processed by a Large Language Model (LLM) to generate user-friendly explanations of ML results. We evaluated our method in a manufacturing environment using the XAI Question Bank. Beyond standard questions, we introduce more complex, tailored questions that highlight the strengths of our approach. We evaluated 33 questions, analyzing responses using quantitative metrics such as accuracy and consistency, as well as qualitative ones such as clarity and usefulness. Our contribution is both theoretical and practical: from a theoretical perspective, we present a novel approach for effectively enabling LLMs to dynamically access a KG in order to improve the explainability of ML results. From a practical perspective, we provide empirical evidence showing that such explanations can be successfully applied in real-world manufacturing environments, supporting better decision-making in manufacturing processes.

摘要:解釋機器學習(ML)結果以透明且使用者友好的方式仍然是可解釋人工智慧(XAI)的一項挑戰性任務。在本文中,我們提出了一種通過使用知識圖譜(KG)來增強ML模型可解釋性的方法。我們儲存領域特定的數據以及ML結果及其相應的解釋,建立領域知識與ML見解之間的結構化連結。為了使這些見解對使用者可訪問,我們設計了一種選擇性檢索方法,從KG中提取相關的三元組,並由大型語言模型(LLM)處理,以生成使用者友好的ML結果解釋。我們在製造環境中使用XAI問題庫評估我們的方法。除了標準問題外,我們還引入了更複雜、量身定制的問題,以突顯我們方法的優勢。我們評估了33個問題,使用準確性和一致性等定量指標,以及清晰度和有用性等定性指標來分析回應。我們的貢獻既具有理論性也具有實踐性:從理論的角度來看,我們提出了一種新穎的方法,能夠有效地使LLM動態訪問KG,以改善ML結果的可解釋性。從實踐的角度來看,我們提供了實證證據,顯示這樣的解釋可以成功應用於現實世界的製造環境中,支持更好的製造過程決策。

MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

2604.16175v1 by Yi Lin, Yihao Ding, Yonghui Wu, Yifan Peng

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.

摘要:自動化的 3D 放射學報告生成常常遭受臨床幻覺和缺乏人類實踐中所見的迭代驗證的問題。儘管近期的視覺-語言模型(VLMs)已經推進了該領域,但它們通常作為單一的「黑箱」系統運作,缺乏臨床工作流程中典型的協作監督。為了解決這些挑戰,我們提出了 MARCH(多代理放射學臨床層級),這是一個多代理框架,模擬放射學部門的專業層級,並為不同的代理分配專門角色。MARCH 利用住院醫師代理進行初步草擬,並進行多尺度 CT 特徵提取,使用多個研究員代理進行檢索增強的修訂,以及一位主治醫師代理協調基於立場的迭代共識討論,以解決診斷差異。在 RadGenome-ChestCT 數據集上,MARCH 在臨床忠實度和語言準確性方面顯著超越了最先進的基準。我們的研究表明,模擬類人組織結構可以提高人工智慧在高風險醫療領域的可靠性。

Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

2604.16132v1 by Jessica H. Zhu, Shayla Stringfield, Vahe Zaprosyan, Michael Wagner, Michel Cukier, Joseph B. Richardson

Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the ethical challenges of applying AI in research involving marginalized communities.

摘要:槍支暴力是一個緊迫的公共衛生問題,但對於倖存者生活經歷的研究仍然資金不足且難以擴展。質性研究,包括深入訪談,是理解社區槍支暴力的個人和社會後果以及設計有效干預措施的寶貴工具。然後,通過主題分析和歸納編碼手動分析這些敘事既耗時又勞動密集。最近大型語言模型(LLMs)的進展為自動化這一過程打開了大門,但仍然存在這些模型是否能準確和倫理地捕捉弱勢群體經歷的擔憂。在這項研究中,我們評估使用開源LLMs對21名倖存於社區槍支暴力的黑人男性的訪談進行歸納編碼。我們的結果顯示,儘管某些LLMs的配置能夠識別重要的編碼,但整體相關性仍然較低,並且對數據處理高度敏感。此外,LLM的防護措施導致了實質性的敘事抹除。這些發現突顯了LLM輔助質性編碼的潛力和局限性,並強調了在涉及邊緣社區的研究中應用AI的倫理挑戰。

Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration

2604.16104v1 by Baramee Sukumal, Aueaphum Aueawatthanaphisut

Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (H&E) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the highest faithfulness and localization accuracy, demonstrating strong correspondence with expert-annotated tumor regions. These results indicate that multimodal fusion of radiology and histopathology can improve diagnostic performance while maintaining model transparency, suggesting potential for future clinical decision support systems in precision oncology.

摘要:肺癌仍然是全球癌症相關死亡的主要原因之一。傳統的電腦斷層掃描 (CT) 成像雖然對於檢測和分期至關重要,但在區分良性和惡性病變以及提供可解釋的診斷見解方面存在局限性。為了解決這一挑戰,本研究提出了一個雙模態人工智慧框架,將 CT 放射學與蘇木精-伊紅 (H&E) 組織病理學整合,用於肺癌的診斷和亞型分類。該系統採用卷積神經網絡提取放射學和組織病理學特徵,並結合臨床元數據以提高穩健性。來自兩種模態的預測通過加權決策級整合機制進行融合,以分類腺癌、鱗狀細胞癌、大細胞癌、小細胞肺癌和正常組織。應用可解釋的人工智慧技術,包括 Grad-CAM、Grad-CAM++、集成梯度、遮蔽、顯著性圖和 SmoothGrad,以提供視覺可解釋性。實驗結果顯示出強勁的性能,準確率高達 0.87,AUROC 超過 0.97,宏觀 F1 分數為 0.88。Grad-CAM++ 在忠實度和定位準確性方面達到了最高水平,顯示出與專家標註的腫瘤區域之間的強對應關係。這些結果表明,放射學和組織病理學的多模態融合可以提高診斷性能,同時保持模型透明度,暗示未來在精準腫瘤學中用於臨床決策支持系統的潛力。

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

2604.16042v2 by Yutong Gao, Qinglin Meng, Yuan Zhou, Liangming Pan

While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.

摘要:雖然大型語言模型(LLMs)在許多自然語言處理任務中取得了強勁的表現,但其不透明的內部機制妨礙了可信度和安全部署。現有的可解釋人工智慧調查主要集中在事後解釋方法,這些方法通過外部近似來解釋訓練好的模型。相比之下,內在可解釋性,直接將透明度構建到模型架構和計算中,最近出現作為一種有前途的替代方案。本文系統性地回顧了LLMs內在可解釋性的最新進展,將現有的方法分為五種設計範式:功能透明性、概念對齊、表徵可分解性、明確模組化和潛在稀疏性引導。我們進一步討論了開放挑戰並概述了這一新興領域的未來研究方向。論文列表可在以下網址獲得:https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs。

Evaluating Temporal and Structural Anomaly Detection Paradigms for DDoS Traffic

2604.16575v1 by Yasmin Souza Lima, Rodrigo Moreira, Larissa F. Rodrigues Moreira, Tereza Cristina M. de B. Carvalho, Flávio de Oliveira Silva

Unsupervised anomaly detection is widely used to detect Distributed Denial-of-Service (DDoS) attacks in cloud-native 5G networks, yet most studies assume a fixed traffic representation, either temporal or structural, without validating which feature space best matches the data. We propose a lightweight decision framework that prioritizes temporal or structural features before training, using two diagnostics: lag-1 autocorrelation of an aggregated flow signal and PCA cumulative explained variance. When the probes are inconclusive, the framework reserves a hybrid option as a future fallback rather than an empirically validated branch. Experiments on two statistically distinct datasets with Isolation Forest, One-Class SVM, and KMeans show that structural features consistently match or outperform temporal ones, with the performance gap widening as temporal dependence weakens.

摘要:無監督異常檢測在雲原生5G網絡中廣泛用於檢測分佈式拒絕服務(DDoS)攻擊,但大多數研究假設固定的流量表示,無論是時間性還是結構性,卻未驗證哪種特徵空間最符合數據。我們提出了一個輕量級的決策框架,在訓練之前優先考慮時間性或結構性特徵,使用兩個診斷指標:聚合流信號的滯後1自相關和PCA累積解釋變異數。當探測結果不確定時,該框架保留了一個混合選項作為未來的備用方案,而不是經過實證驗證的分支。在兩個統計上不同的數據集上進行的實驗,使用孤立森林、單類SVM和KMeans顯示,結構性特徵始終與時間性特徵相匹配或表現更佳,隨著時間依賴性減弱,性能差距擴大。

Towards Rigorous Explainability by Feature Attribution

2604.15898v1 by Olivier Létoffé, Xuanxiang Huang, Joao Marques-Silva

For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley values in explainable artificial intelligence (XAI), with the tool SHAP being a ubiquitous example. This paper overviews the ongoing efforts towards using rigorous symbolic methods of XAI as an alternative to non-rigorous non-symbolic approaches, concretely for assigning relative feature importance.

摘要:在過去十年中,非符號方法一直是解釋複雜機器學習(ML)模型的首選。不幸的是,這些方法缺乏嚴謹性,可能會誤導人類決策者。在高風險的機器學習應用中,缺乏嚴謹性尤其成為一個問題。一個明顯的缺乏嚴謹性的例子是沙普利值在可解釋人工智慧(XAI)中的採用,其中工具SHAP是一個無處不在的例子。本文概述了正在進行的努力,旨在使用嚴謹的符號方法作為非嚴謹的非符號方法的替代,具體用於分配相對特徵的重要性。

Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension

2604.15769v1 by Dongxin Guo, Jikun Wu, Siu Ming Yiu

Spiking transformers achieve competitive accuracy with conventional transformers while offering $38$-$57\times$ energy efficiency on neuromorphic hardware, yet no theoretical framework guides their design. This paper establishes the first comprehensive expressivity theory for spiking self-attention. We prove that spiking attention with Leaky Integrate-and-Fire neurons is a universal approximator of continuous permutation-equivariant functions, providing explicit spike circuit constructions including a novel lateral inhibition network for softmax normalization with proven $O(1/\sqrt{T})$ convergence. We derive tight spike-count lower bounds via rate-distortion theory: $\varepsilon$-approximation requires $Ω(L_f^2 nd/\varepsilon^2)$ spikes, with rigorous information-theoretic derivation. Our key insight is input-dependent bounds using measured effective dimensions ($d_{\text{eff}}=47$--$89$ for CIFAR/ImageNet), explaining why $T=4$ timesteps suffice despite worst-case $T \geq 10{,}000$ predictions. We provide concrete design rules with calibrated constants ($C=2.3$, 95\% CI: $[1.9, 2.7]$). Experiments on Spikformer, QKFormer, and SpikingResformer across vision and language benchmarks validate predictions with $R^2=0.97$ ($p<0.001$). Our framework provides the first principled foundation for neuromorphic transformer design.

摘要:尖峰Transformer在傳統Transformer中達到了競爭性的準確性,同時在神經形態硬體上提供了 $38$-$57\times$ 的能量效率,但目前沒有理論框架指導其設計。本文建立了尖峰自注意力的首個綜合表達理論。我們證明了使用漏積分和發火神經元的尖峰注意力是連續置換等變函數的通用近似器,並提供了明確的尖峰電路構造,包括一個新穎的側抑制網絡,用於軟最大化正規化,並證明了 $O(1/\sqrt{T})$ 的收斂性。我們通過率失真理論推導出緊的尖峰計數下界:$\varepsilon$-近似需要 $Ω(L_f^2 nd/\varepsilon^2)$ 個尖峰,並進行了嚴謹的信息理論推導。我們的關鍵見解是使用測量的有效維度($d_{\text{eff}}=47$--$89$,針對 CIFAR/ImageNet)的輸入依賴性界限,解釋了為什麼 $T=4$ 個時間步驟足夠,儘管在最壞情況下 $T \geq 10{,}000$ 的預測。我們提供了具有校準常數的具體設計規則($C=2.3$,95\% CI: $[1.9, 2.7]$)。在視覺和語言基準上對 Spikformer、QKFormer 和 SpikingResformer 的實驗驗證了預測,$R^2=0.97$ ($p<0.001$)。我們的框架為神經形態Transformer設計提供了首個原則性基礎。

LLM Reasoning Is Latent, Not the Chain of Thought

2604.15726v1 by Wenshuo Wang

This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.

摘要:這份立場文件主張,大型語言模型(LLM)的推理應該被研究為潛在狀態軌跡的形成,而不是忠實的表面思維鏈(CoT)。這一點很重要,因為關於忠實性、可解釋性、推理基準和推理時干預的主張都取決於該領域認為推理的主要對象是什麼。我們詢問當三個經常混淆的因素被分開時,那個對象應該是什麼,並正式化三個競爭假設:H1,推理主要是通過潛在狀態軌跡來介導的;H2,推理主要是通過明確的表面思維鏈來介導的;而H0,大多數明顯的推理增益更好地被一般的串行計算解釋,而不是任何特權的表徵對象。在這個框架下重新組織最近的實證、機制和調查工作,並添加經過計算審核的示例,這些示例將表面痕跡、潛在干預和匹配的預算擴展進行因式分解,我們發現當前的證據最強烈地支持H1作為默認的工作假設,而不是作為一個任務獨立的裁決。因此,我們提出兩項建議:該領域應將潛在狀態動力學視為LLM推理的默認研究對象,並應以明確區分表面痕跡、潛在狀態和串行計算的設計來評估推理。

LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

2604.15589v1 by Jack Wei Lun Shi, Minghao Dang, Wawan Solihin, Justin K. W. Yeoh

Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors of LLMs across different fine-tuning strategies such as full fine-tuning (FFT), low-rank adaptation (LoRA) and quantized LoRA fine-tuning, as well as the impact of model scales which include varying LLM parameter sizes. Our results show that FFT produces attribution patterns that are statistically different and more focused than those from parameter-efficient fine-tuning methods. Furthermore, we found that as model scale increases, LLMs develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers in the building text, albeit with performance gains in semantic similarity of the generated and reference computer-processable rules plateauing for models larger than 7B. This paper provides crucial insights into the explainability of these models, taking a step toward building more transparent LLMs for critical, regulation-based tasks in the Architecture, Engineering, and Construction industry.

摘要:現有關於自動代碼合規的大型語言模型(LLMs)研究主要集中在性能上,將模型視為黑箱,忽視了訓練決策如何影響其解釋行為。本文通過採用基於擾動的歸因分析來填補這一空白,對比不同微調策略(如完全微調(FFT)、低秩適應(LoRA)和量化LoRA微調)下LLMs的解釋行為,以及包括不同LLM參數大小的模型規模對其影響。我們的結果顯示,FFT產生的歸因模式在統計上與參數高效微調方法的模式不同,且更具針對性。此外,我們發現隨著模型規模的增加,LLMs發展出特定的解釋策略,例如在構建文本中優先考慮數值約束和規則識別符,儘管生成的計算機可處理規則與參考規則的語義相似性在超過7B的模型中達到平臺期。本文為這些模型的可解釋性提供了重要見解,邁出了為建築、工程和建設行業的關鍵規範性任務構建更透明的LLMs的一步。

Towards Reliable Testing of Machine Unlearning

2604.16536v1 by Anna Mazhar, Sainyam Galhotra

Machine learning components are now central to AI-infused software systems, from recommendations and code assistants to clinical decision support. As regulations and governance frameworks increasingly require deleting sensitive data from deployed models, machine unlearning is emerging as a practical alternative to full retraining. However, unlearning introduces a software quality-assurance challenge: under realistic deployment constraints and imperfect oracles, how can we test that a model no longer relies on targeted information? This paper frames unlearning testing as a first-class software engineering problem. We argue that practical unlearning tests must provide (i) thorough coverage over proxy and mediated influence pathways, (ii) debuggable diagnostics that localize where leakage persists, (iii) cost-effective regression-style execution under query budgets, and (iv) black-box applicability for API-deployed models. We outline a causal, pathway-centric perspective, causal fuzzing, that generates budgeted interventions to estimate residual direct and indirect effects and produce actionable "leakage reports". Proof-of-concept results illustrate that standard attribution checks can miss residual influence due to proxy pathways, cancellation effects, and subgroup masking, motivating causal testing as a promising direction for unlearning testing.

摘要:機器學習組件現在已成為融入人工智慧的軟體系統的核心,從推薦系統和程式碼助手到臨床決策支持。隨著法規和治理框架越來越要求從已部署模型中刪除敏感數據,機器遺忘作為完全重新訓練的實用替代方案正在出現。然而,遺忘帶來了一個軟體質量保證的挑戰:在現實的部署限制和不完美的預測下,我們如何測試一個模型不再依賴於目標資訊?本文將遺忘測試框架化為一個一流的軟體工程問題。我們主張實用的遺忘測試必須提供 (i) 對代理和中介影響路徑的全面覆蓋,(ii) 可調試的診斷,定位洩漏持續的地方,(iii) 在查詢預算下的成本效益回歸風格執行,以及 (iv) 對 API 部署模型的黑箱適用性。我們概述了一種因果、以路徑為中心的視角,即因果模糊測試,生成預算干預以估算殘留的直接和間接效果,並產生可行的「洩漏報告」。概念驗證結果顯示,標準的歸因檢查可能會因代理路徑、抵消效應和子群體掩蔽而錯過殘留影響,這促使因果測試成為遺忘測試的一個有前景的方向。

Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models

2604.16532v1 by Emily Curl, Kofi Ampomah, Md Erfan, Sayanton Dibbo

While deep learning systems are becoming increasingly prevalent in medical image analysis, their vulnerabilities to adversarial perturbations raise serious concerns for clinical deployment. These vulnerability evaluations largely rely on Attack Success Rate (ASR), a binary metric that indicates solely whether an attack is successful. However, the ASR metric does not account for other factors, such as perturbation strength, perceptual image quality, and cross-architecture attack transferability, and therefore, the interpretation is incomplete. This gap requires consideration, as complex, large-scale deep learning systems, including Vision Transformers (ViTs), are increasingly challenging the dominance of Convolutional Neural Networks (CNNs). These architectures learn differently, and it is unclear whether a single metric, e.g., ASR, can effectively capture adversarial behavior. To address this, we perform a systematic empirical study on four medical image datasets: PathMNIST, DermaMNIST, RetinaMNIST, and CheXpert. We evaluate seven models (VGG-16, ResNet-50, DenseNet-121, Inception-v3, DeiT, Swin Transformer, and ViT-B/16) against seven attack methods at five perturbation budgets, measuring ASR, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and $L_2$ perturbation magnitude. Our findings show a consistent pattern: perceptual and distortion metrics are strongly associated with one another and exhibit minimal correlation with ASR. This applies to both CNNs and ViTs. The results demonstrate that ASR alone is an inadequate indicator of adversarial robustness and transferability. Consequently, we argue that a thorough assessment of adversarial risk in medical AI necessitates multi-metric frameworks that encompass not only the attack efficacy but also its methodology and associated overheads.

摘要:雖然深度學習系統在醫學影像分析中變得越來越普遍,但它們對對抗性擾動的脆弱性對臨床部署提出了嚴重的擔憂。這些脆弱性評估在很大程度上依賴於攻擊成功率(ASR),這是一個二元指標,僅指示攻擊是否成功。然而,ASR指標並未考慮其他因素,例如擾動強度、感知影像質量和跨架構攻擊可轉移性,因此其解釋是不完整的。這一缺口需要考慮,因為複雜的大規模深度學習系統,包括視覺Transformer(ViTs),正日益挑戰卷積神經網絡(CNNs)的主導地位。這些架構的學習方式不同,目前尚不清楚單一指標,例如ASR,是否能有效捕捉對抗行為。為了解決這個問題,我們對四個醫學影像數據集進行了系統的實證研究:PathMNIST、DermaMNIST、RetinaMNIST和CheXpert。我們對七個模型(VGG-16、ResNet-50、DenseNet-121、Inception-v3、DeiT、Swin Transformer和ViT-B/16)在五個擾動預算下進行了七種攻擊方法的評估,測量ASR、峰值信噪比(PSNR)、結構相似性指數度量(SSIM)和$L_2$擾動幅度。我們的研究結果顯示出一致的模式:感知和失真指標之間有很強的關聯性,並且與ASR的相關性極小。這一點適用於CNN和ViT。結果顯示,僅僅依賴ASR並不足以指標對抗穩健性和可轉移性。因此,我們認為對醫學人工智慧的對抗風險進行徹底評估需要多指標框架,不僅涵蓋攻擊效能,還包括其方法論和相關的開銷。

DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

2604.15456v1 by Zhizheng Wang, Chih-Hsuan Wei, Joey Chan, Robert Leaman, Chi-Ping Day, Chuan Wu, Mark A Knepper, Antolin Serrano Farias, Jordina Rincon-Torroella, Hasan Slika, Betty Tyler, Ryan Huu-Tuan Nguyen, Asmita Indurkar, Mélanie Hébert, Shubo Tian, Lauren He, Noor Naffakh, Aseem Aseem, Nicholas Wan, Emily Y Chew, Tiarnan D L Keenan, Zhiyong Lu

Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med's conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.

摘要:信任度和透明度對於人工智慧 (AI) 在醫療保健和生物醫學研究中的臨床應用至關重要。最近的深度研究系統旨在通過將 AI 代理與多跳信息檢索、推理和綜合整合,來加速基於證據的科學發現。然而,大多數現有系統缺乏明確且可檢查的證據評估標準,這增加了錯誤累積的風險,並使研究人員和臨床醫生難以評估其輸出的可靠性。與此同時,當前的基準測試方法很少評估在複雜的現實醫療問題上的表現。在此,我們介紹 DeepER-Med,一個針對醫學的深度基於證據的研究框架,配備了一個代理 AI 系統。DeepER-Med 將深度醫學研究框架化為一個明確且可檢查的基於證據的生成工作流程,包含三個模塊:研究規劃、代理協作和證據綜合。為了支持現實評估,我們還提出了 DeepER-MedQA,一個基於證據的數據集,包含 100 個專家級研究問題,這些問題源自真實的醫學研究場景,並由 11 位生物醫學專家組成的多學科小組進行策劃。專家手動評估顯示,DeepER-Med 在多個標準上始終優於廣泛使用的生產級平台,包括生成新穎的科學見解。我們進一步通過八個現實臨床案例展示 DeepER-Med 的實用性。人類臨床醫生的評估表明,DeepER-Med 的結論在七個案例中與臨床建議一致,突顯了其在醫學研究和決策支持中的潛力。

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

2604.15231v1 by Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

摘要:視覺語言模型(VLM)顯著推進了基於人工智慧的複雜醫學影像解釋和報告,例如電腦斷層掃描(CT)。然而,現有的方法在很大程度上使臨床醫生成為最終輸出的被動觀察者,並未提供可解釋的推理痕跡供他們檢查、驗證或改進。為了解決這個問題,我們引入了 RadAgent,一個使用工具的人工智慧代理,通過逐步且可解釋的過程生成 CT 報告。每份生成的報告都附有可完全檢查的中間決策和工具互動的痕跡,允許臨床醫生檢查報告結果的推導過程。在我們的實驗中,我們觀察到 RadAgent 在三個維度上改善了胸部 CT 報告的生成,相較於其 3D VLM 版本 CT-Chat。臨床準確性在宏觀 F1 上改善了 6.0 分(相對 36.4%),在微觀 F1 上改善了 5.4 分(相對 19.6%)。在對抗條件下的穩健性改善了 24.7 分(相對 41.9%)。此外,RadAgent 在忠實度上達到了 37.0%,這是其 3D VLM 對應版本完全缺乏的新能力。通過將胸部 CT 的解釋結構化為一個明確的、增強工具的和迭代的推理痕跡,RadAgent 使我們更接近於實現放射學的透明和可靠的人工智慧。

Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF

2604.16528v1 by Nicklas Neu, Thomas Ebner, Jasmin Primus, Bernhard Schenkenfelder, Raphael Zefferer, Mathias Brunbauer, Florian Kromp

Embryo selection is one of multiple crucial steps in in-vitro fertilization, commonly based on morphological assessment by clinical embryologists. Although artificial intelligence methods have demonstrated their potential to support embryo selection by automated embryo ranking or grading methods, the overall impact of AI-based solutions is still limited. This is mainly due to the required adaptation of automated solutions to custom clinical data, reliance on time lapse incubators and a lack of interpretability to understand AI reasoning. The modern, informed patient is questioning expert decisions, particularly if the treatment is not successful. Thus, evidence-based decision justification in tasks like embryo selection would support transparent decision making and respectful patient communication. To support this aim, we hereby present an expert-annotated dataset consisting of embryo images and corresponding morphological description using natural language. The description contains relevant information on embryonic cell cycle, developmental stage and morphological features. This dataset enables the finetuning of modern foundational vision-language models to learn and improve over time with high accuracy. Predicted embryo descriptions can then be leveraged to automatically extract scientific evidence from literature, facilitating well-informed, evidence-based decision-making and transparent communication with patients. Our proposed dataset supports research in language-based, interpretable, and transparent automated embryo assessment and has the potential to enhance the decision-making process and improve patient outcomes significantly over time.

摘要:胚胎選擇是體外受精中多個關鍵步驟之一,通常基於臨床胚胎學家的形態評估。儘管人工智慧方法已顯示出支持胚胎選擇的潛力,例如自動化的胚胎排名或分級方法,但基於AI的解決方案的整體影響仍然有限。這主要是由於自動化解決方案需要適應特定的臨床數據,依賴於時間延遲培養箱,以及缺乏可解釋性來理解AI的推理。現代的知情患者質疑專家的決策,特別是在治療不成功的情況下。因此,在胚胎選擇等任務中進行基於證據的決策辯護將有助於透明的決策過程和尊重的患者溝通。為了支持這一目標,我們在此提出一個專家標註的數據集,該數據集包含胚胎圖像和相應的自然語言形態描述。描述中包含有關胚胎細胞週期、發育階段和形態特徵的相關信息。這個數據集使得現代基礎視覺-語言模型能夠進行微調,隨著時間的推移學習和提高準確性。預測的胚胎描述可以用來自動提取文獻中的科學證據,促進充分知情的基於證據的決策制定以及與患者的透明溝通。我們提出的數據集支持基於語言的、可解釋的和透明的自動化胚胎評估研究,並有潛力顯著增強決策過程並改善患者結果。

Agentic Explainability at Scale: Between Corporate Fears and XAI Needs

2604.14984v1 by Yomna Elsayed, Cecily Jones

As companies enter the race for agentic AI adoption, fears surface around agentic autonomy and its subsequent risks. These fears compound as companies scale their agentic AI adoption with low-code applications, without a comparable scaling in their governance processes and expertise resulting in a phenomenon known as "Agent Sprawl". While shadow AI tools can help with agentic discovery and identification, few observability tools offer insights into the agents' configuration and settings or the decision-making process during agent-to-agent communication and orchestration. This paper explores AI governance professionals' concerns in enterprise settings, while offering design-time and runtime explainability techniques as suggested by AI governance experts for addressing those fears. Finally, we provide a preliminary prototype of an Agentic AI Card that can help companies feel at ease deploying agents at scale.

摘要:隨著公司進入代理 AI 採用的競賽,對於代理自主性及其隨之而來的風險的擔憂浮現。隨著公司在低代碼應用上擴大其代理 AI 的採用,這些擔憂不斷加劇,而其治理流程和專業知識卻未能相應擴展,導致了一種現象稱為「代理擴散」。雖然影子 AI 工具可以幫助進行代理的發現和識別,但很少有可觀察性工具提供有關代理配置和設置或代理之間通信和協調過程中的決策過程的洞見。本文探討了企業環境中 AI 治理專業人士的擔憂,同時提供了 AI 治理專家建議的設計時和運行時可解釋性技術,以應對這些擔憂。最後,我們提供了一個初步原型的代理 AI 卡,可以幫助公司在大規模部署代理時感到安心。

Hybrid Decision Making via Conformal VLM-generated Guidance

2604.14980v2 by Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini

Building on recent advances in AI, hybrid decision making (HDM) holds the promise of improving human decision quality and reducing cognitive load. We work in the context of learning to guide (LtG), a recently proposed HDM framework in which the human is always responsible for the final decision: rather than suggesting decisions, in LtG the AI supplies (textual) guidance useful for facilitating decision making. One limiting factor of existing approaches is that their guidance compounds information about all possible outcomes, and as a result it can be difficult to digest. We address this issue by introducing ConfGuide, a novel LtG approach that generates more succinct and targeted guidance. To this end, it employs conformal risk control to select a set of outcomes, ensuring a cap on the false negative rate. We demonstrate our approach on a real-world multi-label medical diagnosis task. Our empirical evaluation highlights the promise of ConfGuide.

摘要:基於近期在人工智慧方面的進展,混合決策(HDM)有望改善人類的決策質量並減少認知負擔。我們在學習引導(LtG)的背景下工作,這是一個最近提出的HDM框架,其中人類始終負責最終決策:在LtG中,AI提供有助於促進決策的(文本)指導,而不是建議決策。現有方法的一個限制因素是,它們的指導綜合了所有可能結果的信息,因此可能難以消化。我們通過引入ConfGuide來解決這個問題,這是一種新穎的LtG方法,能夠生成更簡潔和有針對性的指導。為此,它採用符合風險控制來選擇一組結果,確保假陰性率的上限。我們在一個現實世界的多標籤醫療診斷任務上展示了我們的方法。我們的實證評估突顯了ConfGuide的潛力。

Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

2604.14892v2 by Amy Rouillard, Sitwala Mundia, Linda Camara, Michael Cameron Gramanie, Ziyaad Dangor, Ismail Kalla, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Haroon Saloojee, Bruce A. Bassett

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations. Both LLM and clinician-generated diagnoses are scored across four dimensions: diagnosis, differential diagnosis, clinical reasoning and negative treatment risk. For each of these, we assess scoring difference, inter-rater agreement, scoring stability, severe safety errors and the effect of post-hoc calibration. We find that: (i) the uncalibrated LLM jury scores are systematically lower than clinician panels scores; (ii) the LLM Jury preserves ordinal agreement and exhibits better concordance with the primary expert panels than the human expert re-score panels do; (iii) the probability of severe errors is lower in \lj models compared to the human expert re-score panels; (iv) the LLM Jury shows excellent agreement with primary expert panels' rankings. We find that the LLM jury combined with AI model diagnoses can be used to identify ward diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (v) LLM jury models show no self-preference bias. They did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Finally, we demonstrate that LLM jury calibration using isotonic regression improves alignment with human expert panel evaluations. Together, these results provide compelling evidence that a calibrated, multi-model LLM jury can serve as a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking.

摘要:評估醫療 AI 系統使用專家臨床醫師小組既昂貴又緩慢,促使使用大型語言模型(LLMs)作為替代裁定者。在這裡,我們評估由三個前沿 AI 模型組成的 LLM 陪審團,對 300 個中等收入國家(MIC)醫院案例中的 3333 個診斷進行評分。模型性能與專家臨床醫師小組和獨立人類重新評分小組的評估進行基準比較。LLM 和臨床醫師生成的診斷在四個維度上進行評分:診斷、鑑別診斷、臨床推理和負面治療風險。對於這些,我們評估評分差異、評分者間一致性、評分穩定性、嚴重安全錯誤以及事後校準的效果。我們發現:(i)未經校準的 LLM 陪審團評分系統性地低於臨床醫師小組的評分;(ii)LLM 陪審團保持了序數一致性,並且與主要專家小組的符合度優於人類專家重新評分小組;(iii)與人類專家重新評分小組相比,\lj 模型中嚴重錯誤的概率較低;(iv)LLM 陪審團與主要專家小組的排名顯示出極好的一致性。我們發現,結合 AI 模型診斷的 LLM 陪審團可以用來識別高風險錯誤的病房診斷,從而實現針對性的專家審查和提高小組效率;(v)LLM 陪審團模型沒有自我偏好偏見。它們對自己底層模型或同一供應商的模型生成的診斷的評分並不比其他模型生成的診斷更(或更少)有利。最後,我們證明使用等距回歸進行 LLM 陪審團校準可以改善與人類專家小組評估的一致性。綜合這些結果,提供了有力的證據,表明經過校準的多模型 LLM 陪審團可以作為醫療 AI 基準中專家臨床評估的可靠代理。

M2-PALE: A Framework for Explaining Multi-Agent MCTS--Minimax Hybrids via Process Mining and LLMs

2604.14687v1 by Yiyu Qian, Liyuan Zhao, Tim Miller

Monte-Carlo Tree Search (MCTS) is a fundamental sampling-based search algorithm widely used for online planning in sequential decision-making domains. Despite its success in driving recent advances in artificial intelligence, understanding the behavior of MCTS agents remains a challenge for both developers and users. This difficulty stems from the complex search trees produced through the simulation of numerous future states and their intricate relationships. A known weakness of standard MCTS is its reliance on highly selective tree construction, which may lead to the omission of crucial moves and a vulnerability to tactical traps. To resolve this, we incorporate shallow, full-width Minimax search into the rollout phase of multi-agent MCTS to enhance strategic depth. Furthermore, to demystify the resulting decision-making logic, we introduce \textsf{M2-PALE} (MCTS--Minimax Process-Aided Linguistic Explanations). This framework employs process mining techniques, specifically the Alpha Miner, iDHM, and Inductive Miner algorithms, to extract underlying behavioral workflows from agent execution traces. These process models are then synthesized by LLMs to generate human-readable causal and distal explanations. We demonstrate the efficacy of our approach in a small-scale checkers environment, establishing a scalable foundation for interpreting hybrid agents in increasingly complex strategic domains.

摘要:蒙地卡羅樹搜尋(MCTS)是一種基於取樣的基本搜尋演算法,廣泛應用於序列決策領域的在線規劃。儘管它在推動人工智慧的最新進展方面取得了成功,但理解MCTS代理的行為對開發者和用戶來說仍然是一個挑戰。這種困難源於通過模擬大量未來狀態及其複雜關係所產生的複雜搜尋樹。標準MCTS的一個已知弱點是其依賴於高度選擇性的樹構建,這可能導致關鍵步驟的遺漏以及對戰術陷阱的脆弱性。為了解決這個問題,我們將淺層的全寬Minimax搜尋納入多代理MCTS的展開階段,以增強戰略深度。此外,為了揭示結果決策邏輯,我們引入了\textsf{M2-PALE}(MCTS--Minimax過程輔助語言解釋)。該框架採用了過程挖掘技術,特別是Alpha Miner、iDHM和Inductive Miner演算法,以從代理執行痕跡中提取潛在的行為工作流程。然後,這些過程模型由大型語言模型(LLMs)合成,以生成易於人類理解的因果和遠因解釋。我們在小型跳棋環境中展示了我們方法的有效性,為在日益複雜的戰略領域中解釋混合代理建立了一個可擴展的基礎。

Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks

2604.15390v2 by Seyedreza Mohseni, Sarvesh Baskar, Edward Raff, Manas Gaur

Code deobfuscation is the task of recovering a readable version of a program while preserving its original behavior. In practice, this often requires days or even months of manual work with complex and expensive analysis tools. In this paper, we explore an alternative approach based on Chain-of-Thought (CoT) prompting, where a large language model is guided through explicit, step-by-step reasoning tailored for code analysis. We focus on control flow obfuscation, including Control Flow Flattening (CFF), Opaque Predicates, and their combination, and we measure both structural recovery of the control flow graph and preservation of program semantics. We evaluate five state-of-the-art large language models and show that CoT prompting significantly improves deobfuscation quality compared with simple prompting. We validate our approach on a diverse set of standard C benchmarks and report results using both structural metrics for control flow graphs and semantic metrics based on output similarity. Among the tested models and by applying CoT, GPT5 achieves the strongest overall performance, with an average gain of about 16% in control-flow graph reconstruction and about 20.5% in semantic preservation across our benchmarks compared to zero-shot prompting. Our results also show that model performance depends not only on the obfuscation level and the chosen obfuscator but also on the intrinsic complexity of the original control flow graph. Collectively, these findings suggest that CoT-guided large language models can serve as effective assistants for code deobfuscation, providing improved code explainability, more faithful control flow graph reconstruction, and better preservation of program behavior while potentially reducing the manual effort needed for reverse engineering.

摘要:代碼去混淆是恢復程式可讀版本的任務,同時保持其原始行為。在實踐中,這通常需要數天甚至數月的手動工作,並使用複雜且昂貴的分析工具。在本文中,我們探討了一種基於思維鏈(CoT)提示的替代方法,其中大型語言模型通過明確的逐步推理來進行代碼分析。我們專注於控制流混淆,包括控制流扁平化(CFF)、不透明謂詞及其組合,並測量控制流圖的結構恢復和程式語義的保留。我們評估了五個最先進的大型語言模型,並顯示CoT提示相比簡單提示顯著提高了去混淆的質量。我們在一組多樣的標準C基準上驗證了我們的方法,並報告了使用控制流圖的結構指標和基於輸出相似性的語義指標的結果。在測試的模型中,應用CoT的GPT5在整體性能上表現最佳,在控制流圖重建方面平均增益約為16%,在我們的基準中語義保留方面約為20.5%,相比於零樣本提示。我們的結果還顯示,模型性能不僅取決於混淆程度和所選擇的混淆器,還取決於原始控制流圖的內在複雜性。總體而言,這些發現表明,CoT引導的大型語言模型可以作為代碼去混淆的有效助手,提供改進的代碼可解釋性、更忠實的控制流圖重建以及更好的程式行為保留,同時可能減少反向工程所需的手動工作。

Rethinking Patient Education as Multi-turn Multi-modal Interaction

2604.14656v1 by Zonghai Yao, Zhipeng Tang, Chengtao Lin, Xiong Luo, Benlu Wang, Juncheng Huang, Chin Siang Ong, Hong Yu

Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.

摘要:大多數醫療多模態基準專注於靜態任務,例如影像問答、報告生成和通俗語言重寫。病人教育的要求更高:系統必須在影像中識別相關證據,告訴病人該看哪裡,以易於理解的語言解釋發現,並處理困惑或焦慮。然而,大多數病人教育的工作仍然是純文本的,即使結合影像和文本的解釋可能更能支持理解。我們介紹了 MedImageEdu,一個針對多輪、基於證據的放射科病人教育的基準。每個案例提供一份放射科報告,包括報告文本和案例影像。一個 DoctorAgent 與一個 PatientAgent 互動,根據一個隱藏的個人資料進行調整,該資料捕捉了教育水平、健康素養和個性等因素。當病人的問題需要視覺支持時,DoctorAgent 可以根據報告、案例影像和當前問題向基準提供的繪圖工具發出繪圖指令。該工具返回影像,之後 DoctorAgent 產出最終的多模態回應,該回應包括影像和基於證據的通俗語言解釋。MedImageEdu 包含來自三個來源的 150 個案例,並沿著五個維度評估諮詢過程和最終的多模態回應:諮詢、安全性和範圍、語言質量、繪圖質量和影像-文本回應質量。在代表性的開源和閉源視覺-語言模型代理中,我們發現三個一致的差距:流利的語言往往超越忠實的視覺基礎,安全性是各疾病類別中最薄弱的維度,而情感緊張的互動比低教育或低健康素養的互動更困難。MedImageEdu 提供了一個受控的測試平台,用於評估多模態代理是否能根據證據進行教學,而不僅僅是從文本中回答問題。

CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

2604.14615v1 by Yubin Kim, Salman Rahman, Samuel Schmidgall, Chunjong Park, A. Ali Heydari, Ahmed A. Metwally, Hong Yu, Xin Liu, Xuhai Xu, Yuzhe Yang, Maxwell A. Xu, Zhihan Zhang, Cynthia Breazeal, Tim Althoff, Petar Sirkovic, Ivor Rendulic, Annalisa Pawlosky, Nicolas Stroppa, Juraj Gottweis, Elahe Vedadi, Alan Karthikesalingam, Pushmeet Kohli, Vivek Natarajan, Mark Malhotra, Shwetak Patel, Hae Won Park, Hamid Palangi, Daniel McDuff

Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability-related features in both datasets, reflected in sleep duration variability (DWB, ρ= 0.252, p < 0.001) and sleep onset variability (GLOBEM, ρ= 0.126, p < 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; ρ= -0.374, p < 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; ρ= -0.375, p < 0.001), a known correlate of insulin resistance. Incorporating CoDaS-derived features alongside demographic variables led to modest but consistent improvements in predictive performance, with cross-validated ΔR^2 increases of 0.040 for depression and 0.021 for insulin resistance. These findings suggest that CoDaS enables systematic and traceable hypothesis generation and prioritization for biomarker discovery from large-scale wearable data.

摘要:科學發現數位健康需要將可穿戴設備的連續生理信號轉換為臨床可行的生物標記。我們介紹了 CoDaS(AI Co-Data-Scientist),這是一個多代理系統,將生物標記的發現結構化為一個迭代過程,結合假設生成、統計分析、對抗驗證和基於文獻的推理,並在大型可穿戴數據集的人工監督下進行。 在三個總計 9,279 名參與者觀察的隊列中,CoDaS 識別出 41 個候選數位生物標記用於心理健康,和 25 個用於代謝結果,每個標記都經過內部驗證電池的檢驗,包括重複性、穩定性、穩健性和區分能力。 在兩個獨立的抑鬱症隊列中,CoDaS 在兩個數據集中都顯示出與生理節律不穩定性相關的特徵,這在睡眠持續時間變異性(DWB,ρ= 0.252,p < 0.001)和入睡變異性(GLOBEM,ρ= 0.126,p < 0.001)中得到了反映。 在一個代謝隊列中,CoDaS 推導出一個心血管健康指數(步數/靜息心率;ρ= -0.374,p < 0.001),並恢復了已建立的臨床關聯,包括肝功能比率(AST/ALT;ρ= -0.375,p < 0.001),這是胰島素抵抗的已知相關指標。 將 CoDaS 推導的特徵與人口統計變量結合,導致預測性能的適度但一致的改善,抑鬱症的交叉驗證 ΔR^2 增加了 0.040,胰島素抵抗的 ΔR^2 增加了 0.021。 這些發現表明,CoDaS 能夠從大型可穿戴數據中系統性且可追溯地生成和優先考慮假設,以進行生物標記的發現。

Generative Augmented Inference

2604.14575v1 by Cheng Lu, Mengxin Wang, Dennis J. Zhang, Heng Zhang

Data-driven operations management often relies on parameters estimated from costly human-generated labels. Recent advances in large language models (LLMs) and other AI systems offer inexpensive auxiliary data, but introduce a new challenge: AI outputs are not direct observations of the target outcomes, but could involve high-dimensional representations with complex and unknown relationships to human labels. Conventional methods leverage AI predictions as direct proxies for true labels, which can be inefficient or unreliable when this relationship is weak or misspecified. We propose Generative Augmented Inference (GAI), a general framework that incorporates AI-generated outputs as informative features for estimating models of human-labeled outcomes. GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels. We establish asymptotic normality and show a "safe default" property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive. Empirically, GAI outperforms benchmarks across diverse settings. In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%. In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information. In health insurance choice, it cuts labeling requirements by over 90% while maintaining decision accuracy. Across applications, GAI improves confidence interval coverage without inflating width. Overall, GAI provides a principled and scalable approach to integrating AI-generated information.

摘要:數據驅動的運營管理通常依賴於從昂貴的人類生成標籤中估算的參數。最近在大型語言模型(LLMs)和其他人工智慧系統方面的進展提供了廉價的輔助數據,但也帶來了一個新的挑戰:人工智慧的輸出並不是目標結果的直接觀察,而可能涉及與人類標籤之間複雜且未知關係的高維表示。傳統方法將人工智慧的預測視為真實標籤的直接代理,當這種關係較弱或錯誤指定時,這可能是低效或不可靠的。我們提出生成增強推斷(GAI),這是一個通用框架,將人工智慧生成的輸出作為估算人類標記結果的有用特徵。GAI使用正交矩量構造,使得在LLM生成的輸出和人類標籤之間的靈活非參數關係下,能夠進行一致的估算和有效的推斷。我們建立了漸近正態性,並展示了一個“安全默認”屬性:相對於僅依賴人類數據的估算器,GAI在任意輔助信號下弱化了估算效率,並在輔助信息具有預測性時獲得了嚴格的增益。在實證中,GAI在各種設置中表現優於基準。在輔助信號較弱的聯合分析中,GAI將估算誤差降低約50%,並將人類標記需求降低超過75%。在零售定價中,所有方法都使用相同的輔助輸入,GAI始終優於其他估算器,突顯了其構造的價值,而非信息的差異。在健康保險選擇中,它將標記需求降低超過90%,同時保持決策準確性。在各種應用中,GAI提高了置信區間的覆蓋率,而不會擴大寬度。總體而言,GAI提供了一種原則性和可擴展的方法來整合人工智慧生成的信息。

Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

2604.14514v1 by Michal Rosen-Zvi, Yoav Kan-Tor, Michael Danziger, Agata Ferretti, Javier Aula-Blasco, Julia Falcao, Ron Shamir, Mordechai Muszkat

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation in cases where the focus of the studies and the data that is collected is at the molecular level. A vast number of studies focus on collecting omics data but the demographic information associated with these datasets is often not reported in the studies, and when it is reported, it shows big biases. An automated analysis of 4719 PubMed-indexed omics publications from 2015 to 2024 reveals that only a small fraction report ancestry or ethnicity information, with ancestry reporting improving slightly. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them time and again for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Evaluation Transparency to improve equity and robustness in biomedical AI. This approach aims to foster biomedical innovation that more effectively serves underserved populations and improves health outcomes.

摘要:醫療差異在社會經濟邊界中持續存在,通常歸因於對篩檢、診斷和治療的不平等獲取。然而,這種觀點突顯出,在數據收集和研究優先順序的早期階段,可能會出現關鍵的偏見,這在臨床實施之前就已發生,特別是在研究的重點和所收集的數據位於分子層面時。大量研究專注於收集組學數據,但與這些數據集相關的人口統計信息在研究中往往未被報告,而當報告時,則顯示出明顯的偏見。對2015年至2024年間4719篇PubMed索引的組學出版物進行的自動分析顯示,只有一小部分報告了祖先或種族信息,而祖先報告略有改善。對於常用於模型訓練的大規模數據集(如CellxGene和GEO)的分析顯示,存在顯著的人口偏見,其中歐洲祖先數據占主導地位。隨著生物醫學基礎模型在生物醫學發現中變得越來越重要,這種範式中基礎模型是在大型數據集上預訓練的,並在許多不同的下游任務中反覆使用,它們有可能延續或放大這些早期階段的偏見,導致連鎖不平等,這是監管干預無法完全逆轉的。我們提議在社區範圍內專注於三個基本原則:來源、開放性和評估透明度,以改善生物醫學AI中的公平性和穩健性。這種方法旨在促進生物醫學創新,更有效地服務於服務不足的人群並改善健康結果。

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

2604.14356v1 by Apoorv Prasad, Susan McRoy

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

摘要:女性患有多囊卵巢綜合症(PCOS)面臨著身體形象困擾、飲食失調和代謝挑戰的風險顯著增加,然而現有的自然語言處理方法在檢測這些情況時缺乏透明度,且無法識別共病表現。我們開發了小型的開源語言模型,以自動檢測社交媒體帖子中的這三重負擔,並提供基於證據的解釋。我們從六個子版塊收集了1,000條與PCOS相關的帖子,兩名訓練過的標註者根據Lee等人(2017)的臨床框架進行標註。三個模型(Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B)使用低秩適應進行微調,以生成帶有文本證據的結構化解釋。最佳模型在150條保留的帖子上達到了75.3%的精確匹配準確率,具有穩健的共病檢測和強大的解釋能力。隨著診斷複雜性的增加,性能有所下降,這表明它們的最佳使用是用於篩查,而非自主診斷。

Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

2604.14325v1 by Bar Alon, Itamar Zimerman, Lior Wolf

Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.

摘要:大型語言模型(LLMs)表現出色,並且徹底改變了自然語言處理(NLP),但它們缺乏可解釋性,使得它們被視為黑箱,限制了它們在需要透明度和信任的領域中的應用。解決這個問題的一個有前景的方向是事後基於文本的解釋,旨在用自然語言解釋模型的決策。先前的研究集中於生成看似主觀上可信的說明,但仍不清楚這些解釋是否在認識上是可信的,是否反映了模型實際依賴的內部證據來做出決策。在本文中,我們首先通過反事實評估LLM生成的解釋的認識可信度,並表明它們通常是不可信的。然後,我們介紹了一種無需訓練的方法,通過注意力層級的干預來指導解釋生成,這些干預是基於通過可信歸因方法提取的標記級熱圖。這種方法顯著提高了多個模型、基準和提示的認識可信度。

Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning

2604.14316v1 by Kinhei Lee, Peiyuan Jing, Zhenxuan Zhang, Yue Yang, Tao Wang, Dominic C Marshall, Yingying Fang, Guang Yang

Large scale vision language models have shown promise in automating chest Xray interpretation, yet their clinical utility remains limited by a gap between model outputs and radiologist reasoning. Most systems optimize for semantic information without emulating how experts visually examine medical images, often overlooking critical findings or diverging from established diagnostic workflows. Radiologists follow structured protocols (e.g., the ABCDEF approach) that ensure all clinically relevant regions are systematically examined, reducing missed findings and supporting reliable diagnostic reasoning. We introduce GazeX, a vision language model that leverages radiologists' eye tracking data as a behavioral prior to model expert diagnostic reasoning. By incorporating gaze trajectories and fixation patterns into pretraining, GazeX learns to follow the spatial and temporal structure of radiologist attention and integrates observations in a clinically meaningful sequence. Using a curated dataset of over 30,000 gaze key frames from five radiologists, we demonstrate that GazeX produces more accurate, interpretable, and expert consistent outputs across radiology report generation, disease grounding, and visual question answering, utilizing 231,835 radiographic studies, 780,014 question answer pairs, and 1,162 image sentence pairs with bounding boxes. Unlike autonomous reporting systems, GazeX produces verifiable evidence artifacts, including inspection trajectories and finding linked localized regions, enabling efficient human verification and safe human AI collaboration. Learning through expert eyes provides a practical route toward more trustworthy, explainable, and diagnostically robust AI systems for radiology and beyond.

摘要:大型視覺語言模型在自動化胸部X光解讀方面顯示出潛力,然而它們的臨床實用性仍受到模型輸出與放射科醫生推理之間差距的限制。大多數系統優化語義信息,而未能模擬專家如何視覺檢查醫學影像,常常忽略關鍵發現或偏離既定的診斷工作流程。放射科醫生遵循結構化的協議(例如,ABCDEF方法),確保所有臨床相關區域都被系統性地檢查,從而減少漏診並支持可靠的診斷推理。我們介紹GazeX,一種視覺語言模型,利用放射科醫生的眼動追蹤數據作為模型專家診斷推理的行為先驗。通過將注視軌跡和凝視模式納入預訓練,GazeX學會跟隨放射科醫生注意力的空間和時間結構,並以臨床意義的順序整合觀察結果。我們使用來自五位放射科醫生的超過30,000個凝視關鍵幀的精選數據集,展示GazeX在放射報告生成、疾病定位和視覺問答中產生更準確、可解釋且與專家一致的輸出,利用231,835個放射學研究、780,014個問題答案對和1,162個帶有邊界框的影像句子對。與自主報告系統不同,GazeX產生可驗證的證據工件,包括檢查軌跡和與局部區域相關的發現,從而實現高效的人類驗證和安全的人機協作。通過專家的目光學習為放射學及其他領域提供了一條實用的途徑,朝向更值得信賴、可解釋和診斷上更強健的AI系統。

EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

2604.14306v2 by Francesco Andrea Causio, Vittorio De Vita, Olivia Riccomi, Michele Ferramola, Federico Felizzi, Alessandro Tosi, Antonio Cristiano, Lorenzo De Mori, Chiara Battipaglia, Melissa Sawaya, Luigi De Angelis, Marcello Di Pumpo, Alessandra Piscitelli, Pietro Eric Risuleo, Alessia Longo, Giulia Vojvodic, Mariapia Vassalli, Bianca Destro Castaniti, Nicolò Scarsi, Manuel Del Medico

While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.

摘要:雖然大型語言模型(LLMs)在以英語為中心的醫學考試中展現出高水平的能力,但在面對非英語語言和多模態診斷任務時,其表現往往會下降。這項研究計劃描述了EuropeMedQA的開發,這是第一個來自意大利、法國、西班牙和葡萄牙官方監管考試的綜合性、多語言和多模態醫學考試數據集。根據FAIR數據原則和SPIRIT-AI指導方針,我們描述了一個嚴謹的策展過程和一個自動翻譯管道,以進行比較分析。我們使用零樣本、嚴格限制的提示策略來評估當代多模態LLMs,以評估跨語言轉移和視覺推理。EuropeMedQA旨在提供一個抗污染的基準,反映歐洲臨床實踐的複雜性,並促進更具通用性的醫療AI的發展。

Quantum-inspired tensor networks in machine learning models

2604.14287v1 by Guillermo Valverde, Igor García-Olaizola, Giannicola Scarpa, Alejandro Pozas-Kerstjens

Tensor networks were developed in the context of many-body physics as compressed representations of multiparticle quantum states. These representations mitigate the exponential complexity of many-body systems by capturing only the most relevant dependencies. Due to the formal similarity between quantum entanglement and statistical correlations, tensor networks have recently been integrated in machine learning, operating both as alternative learning architectures and as decompositions of components of neural networks. The expectation is that the theoretical understanding of tensor networks developed within quantum many-body physics leads to novel methods that offer advantages in terms of computational efficiency, explainability, or privacy. Here we review the use of tensor networks in the context of machine learning, providing a critical assessment of the state of the art, the potential advantages, and the challenges that must be overcome.

摘要:張量網絡是在多體物理的背景下發展起來的,作為多粒子量子態的壓縮表示。這些表示通過僅捕捉最相關的依賴性來減輕多體系統的指數複雜性。由於量子糾纏和統計相關性之間的形式相似性,張量網絡最近被整合進機器學習中,既作為替代學習架構,也作為神經網絡組件的分解。預期在量子多體物理中發展的張量網絡的理論理解將導致新穎的方法,提供計算效率、可解釋性或隱私方面的優勢。在這裡,我們回顧了張量網絡在機器學習背景下的應用,對當前的技術狀態、潛在優勢以及必須克服的挑戰進行了批判性評估。

Applied Explainability for Large Language Models: A Comparative Study

2604.15371v1 by Venkata Abhinandan Kancharla

Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques: Integrated Gradients, Attention Rollout, and SHAP, on a fine-tuned DistilBERT model for SST-2 sentiment classification. Rather than proposing new methods, the focus is on evaluating the practical behavior of existing approaches under a consistent and reproducible setup. The results show that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features. Model-agnostic approaches offer flexibility but introduce higher computational cost and variability. This work highlights key trade-offs between explainability methods and emphasizes their role as diagnostic tools rather than definitive explanations. The findings provide practical insights for researchers and engineers working with transformer-based NLP systems. This is a preprint and has not undergone peer review.

摘要:大型語言模型(LLMs)在許多自然語言處理任務中表現出色,但它們的決策過程仍然難以解釋。這種缺乏透明度為信任、除錯和在實際系統中的部署帶來挑戰。
本文呈現了一項針對三種可解釋性技術的應用比較研究:整合梯度、注意力展開和SHAP,針對經過微調的DistilBERT模型進行SST-2情感分類。研究重點不在於提出新方法,而是在於評估現有方法在一致且可重複的設置下的實際行為。
結果顯示,基於梯度的歸因提供了更穩定和直觀的解釋,而基於注意力的方法計算效率高,但與預測相關特徵的對齊程度較低。模型無關的方法提供了靈活性,但引入了更高的計算成本和變異性。
這項工作突顯了可解釋性方法之間的關鍵權衡,並強調它們作為診斷工具而非確定性解釋的角色。研究結果為從事基於Transformer的NLP系統的研究人員和工程師提供了實用的見解。
這是一篇預印本,尚未經過同行評審。

Med-CAM: Minimal Evidence for Explaining Medical Decision Making

2604.13695v1 by Pirzada Suhail, Aditya Anand, Amit Sethi

Reliable and interpretable decision-making is essential in medical imaging, where diagnostic outcomes directly influence patient care. Despite advances in deep learning, most medical AI systems operate as opaque black boxes, providing little insight into why a particular diagnosis was reached. In this paper, we introduce Med-CAM, a framework for generating minimal and sharp maps as evidence-based explanations for Medical decision making via Classifier Activation Matching. Med-CAM trains a segmentation network from scratch to produce a mask that highlights the minimal evidence critical to model's decision for any seen or unseen image. This ensures that the explanation is both faithful to the network's behaviour and interpretable to clinicians. Experiments show, unlike prior spatial explanation methods, such as Grad-CAM and attention maps, which yield only fuzzy regions of relative importance, Med-CAM with its superior spatial awareness to shapes, textures, and boundaries, delivers conclusive, evidence-based explanations that faithfully replicate the model's prediction for any given image. By explicitly constraining explanations to be compact, consistent with model activations, and diagnostic alignment, Med-CAM advances transparent AI to foster clinician understanding and trust in high-stakes medical applications such as pathology and radiology.

摘要:可靠且可解釋的決策在醫學影像中至關重要,因為診斷結果直接影響病人護理。儘管深度學習取得了進展,大多數醫療AI系統仍然運作如同不透明的黑箱,幾乎無法提供為何達成特定診斷的見解。在本文中,我們介紹了Med-CAM,這是一個通過分類器激活匹配生成最小且清晰的地圖作為基於證據的醫療決策解釋的框架。Med-CAM從零開始訓練一個分割網絡,以產生一個掩模,突出顯示對於任何已見或未見影像模型決策至關重要的最小證據。這確保了解釋既忠實於網絡的行為,又能被臨床醫生解讀。實驗顯示,與之前的空間解釋方法(如Grad-CAM和注意力圖)不同,這些方法僅產生模糊的相對重要性區域,Med-CAM憑藉其對形狀、質地和邊界的卓越空間感知,提供了結論性、基於證據的解釋,忠實地複製了模型對於任何給定影像的預測。通過明確約束解釋為緊湊、與模型激活一致且診斷對齊,Med-CAM推進了透明AI,以促進臨床醫生在病理學和放射學等高風險醫療應用中的理解和信任。

Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment

2604.13462v1 by Eileen Kapel, Jan Lennartz, Luis Cruz, Diomidis Spinellis, Arie van Deursen

Effective IT change management is important for businesses that depend on software and services, particularly in highly regulated sectors such as finance, where operational reliability, auditability, and explainability are essential. A significant portion of IT incidents are caused by changes, making it important to identify high-risk changes before deployment. This study presents a predictive incident risk scoring approach at a large international bank. The approach supports engineers during the assessment and planning phases of change deployments by predicting the potential of inducing incidents. To satisfy regulatory constraints, we built the model with auditability and explainability in mind, applying SHAP values to provide feature-level insights and ensure decisions are traceable and transparent. Using a one-year real-world dataset, we compare the existing rule-based process with three machine learning models: HGBC, LightGBM, and XGBoost. LightGBM achieved the best performance, particularly when enriched with aggregated team metrics that capture organisational context. Our results show that data-driven, interpretable models can outperform rule-based approaches while meeting compliance needs, enabling proactive risk mitigation and more reliable IT operations.

摘要:有效的IT變更管理對於依賴軟體和服務的企業至關重要,尤其是在金融等高度監管的行業中,運營可靠性、可審計性和可解釋性是必不可少的。相當一部分IT事件是由變更引起的,因此在部署之前識別高風險變更變得非常重要。本研究提出了一種在大型國際銀行中預測事件風險評分的方法。該方法在變更部署的評估和規劃階段支持工程師,通過預測引發事件的潛在可能性來提供幫助。為了滿足監管約束,我們在構建模型時考慮了可審計性和可解釋性,應用SHAP值提供特徵級別的見解,確保決策可追溯且透明。利用一年的真實數據集,我們將現有的基於規則的過程與三種機器學習模型進行比較:HGBC、LightGBM和XGBoost。LightGBM在性能上表現最佳,特別是在結合了捕捉組織背景的聚合團隊指標後。我們的結果顯示,數據驅動的可解釋模型可以超越基於規則的方法,同時滿足合規需求,實現主動風險緩解和更可靠的IT運營。

Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making

2604.14240v1 by Pramudita Satria Palar, Paul Saves, Muhammad Daffa Robani, Nicolas Verstaevel, Moncef Garouani, Julien Aligon, Koji Shimoyama, Joseph Morlier, Benoit Gaudou

The simulation of complex systems increasingly relies on sophisticated but fundamentally opaque computational black-box simulators. Surrogate models play a central role in reducing the computational cost of complex systems simulations across a wide range of scientific and engineering domains. Notwithstanding, they inevitably inherit and often exacerbate this black-box nature, obscuring how input variables drive physical responses. Conversely, Explainable Artificial Intelligence (XAI) offers powerful tools to unpack these models. Yet, XAI methods struggle with engineering-specific constraints, such as highly correlated inputs, dynamical systems, and rigorous reliability requirements. Consequently, surrogate modeling and XAI have largely evolved as distinct fields of research, despite their strong complementarity. To reconnect these approaches, this state-of-the-art survey provides a structured perspective that maps existing XAI techniques onto the various stages of surrogate modeling workflows for design and exploration. To ground this synthesis, we draw upon illustrative applications across both equation-based simulations and agent-based modeling. We survey a broad spectrum of techniques, highlighting their strengths for revealing interactions and supporting human comprehension. Finally, we identify pressing open challenges, including the explainability of dynamical systems and the handling of mixed-variable systems, and propose a research agenda to make explainability a core, embedded element of simulation-driven workflows from model construction through decision-making. By transforming opaque emulators into explainable tools, this agenda empowers practitioners to move beyond accelerating simulations to extracting actionable insights from complex system behaviors.

摘要:複雜系統的模擬越來越依賴於精密但基本上不透明的計算黑箱模擬器。替代模型在減少複雜系統模擬的計算成本方面扮演著核心角色,涵蓋了廣泛的科學和工程領域。儘管如此,它們不可避免地繼承並且往往加劇了這種黑箱特性,模糊了輸入變量如何驅動物理反應。相反,可解釋的人工智慧(XAI)提供了強大的工具來解釋這些模型。然而,XAI 方法在工程特定的約束方面面臨挑戰,例如高度相關的輸入、動態系統和嚴格的可靠性要求。因此,儘管替代建模和 XAI 之間有很強的互補性,但它們在很大程度上發展為不同的研究領域。為了重新連接這些方法,這篇最先進的調查提供了一個結構化的視角,將現有的 XAI 技術映射到替代建模工作流程的各個階段,以便進行設計和探索。為了使這一綜合分析更具實質性,我們借鑒了基於方程的模擬和基於代理的建模中的示例應用。我們調查了廣泛的技術,突顯它們在揭示交互和支持人類理解方面的優勢。最後,我們確定了一些緊迫的開放挑戰,包括動態系統的可解釋性和混合變量系統的處理,並提出了一個研究議程,旨在使可解釋性成為從模型構建到決策過程中模擬驅動工作流程的核心嵌入元素。通過將不透明的模擬器轉變為可解釋的工具,這一議程使從業者能夠超越加速模擬,從複雜系統行為中提取可行的見解。

ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

2604.13392v1 by Chenlang Yi, Gang Li, Zizhan Xiong, Tue Minh Cao, Yanmin Gong, My T. Thai, Tianbao Yang

Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10\%$ while producing faithful and consistent reasoning

摘要:表格數據在醫療和金融等高風險領域仍然盛行,這些領域的預測模型被期望提供高準確性和可信的人類可理解推理。雖然符號模型提供可驗證的邏輯,但它們缺乏語義表達能力。與此同時,通用 LLM 通常需要專門的微調才能掌握特定領域的表格推理。為了解決可擴展數據策劃和推理一致性的雙重挑戰,我們提出了 ReSS,一個系統框架,橋接符號和神經推理模型。ReSS 利用決策樹模型提取實例級決策路徑作為符號支架。這些支架連同輸入特徵和標籤,引導 LLM 生成基於自然語言的推理,嚴格遵循基礎決策邏輯。所產生的高質量數據集用於微調預訓練的 LLM 成為專門的表格推理模型,並通過支架不變數據增強策略進一步提升,以改善泛化能力和可解釋性。為了嚴格評估忠實性,我們引入了包括幻覺率、解釋必要性和解釋充分性在內的定量指標。在醫療和金融基準上的實驗結果顯示,經 ReSS 訓練的模型在提高傳統決策樹和標準微調方法的表現上達到 $10\%$,同時產生可信且一致的推理。

Young people's perceptions and recommendations for conversational generative artificial intelligence in youth mental health

2604.13381v1 by Adam Poulsen, Ian B. Hickie, Carla Gorban, Zsofi de Haan, William Capon, Ebenezer Eyeson-Annan, Jalal Radwan, Elizabeth M. Scott, Frank Iorfino, Haley M. LaMonica

Conversational generative artificial intelligence agents (or genAI chatbots) could benefit youth mental health, yet young people's perspectives remain underexplored. We examined the Mental health Intelligence Agent (Mia), a genAI chatbot originally designed for professionals in Australian youth services. Following co-design, 32 young people participated in online workshops exploring their perceptions of genAI chatbots in youth mental health and to develop recommendations for reconceptualising Mia for consumers and integrating it into services. Four themes were developed: (1) Humanising AI without dehumanising care, (2) I need to know what's under the hood, (3) Right tool, right place, right time?, and (4) Making it mine on safe ground. This study offers insights into young people's attitudes, needs, and requirements regarding genAI chatbots in youth mental health, with key implications for service integration. Additionally, by co-designing system requirements, this work informs the ethics, design, development, implementation, and governance of genAI chatbots in youth mental health contexts.

摘要:對話生成型人工智慧代理(或稱 genAI 聊天機器人)可能有助於青少年的心理健康,但年輕人的觀點仍然未被充分探討。我們檢視了心理健康智慧代理(Mia),這是一個最初為澳大利亞青少年服務專業人士設計的 genAI 聊天機器人。在共同設計後,32 位年輕人參加了線上工作坊,探索他們對青少年心理健康中 genAI 聊天機器人的看法,並提出將 Mia 重新概念化為消費者友好型及整合到服務中的建議。發展出四個主題:(1)人性化 AI 而不去人性化照護,(2)我需要知道底層運作是什麼,(3)正確的工具,正確的地方,正確的時間?以及(4)在安全的基礎上讓它成為我的。這項研究提供了對年輕人對青少年心理健康中 genAI 聊天機器人態度、需求和要求的見解,對服務整合具有重要意義。此外,通過共同設計系統需求,這項工作為青少年心理健康環境中 genAI 聊天機器人的倫理、設計、開發、實施和治理提供了指導。

Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition

2604.13279v1 by Mohammad Saleh, Azadeh Tabatabaei

Fall detection in elderly care requires not only accurate classification but also reliable explanations that clinicians can trust. However, existing post-hoc explainability methods, when applied frame-by-frame to sequential data, produce temporally unstable attribution maps that clinicians cannot reliably act upon. To address this issue, we propose a lightweight and explainable framework for skeleton-based fall detection that combines an efficient LSTM model with T-SHAP, a temporally aware post-hoc aggregation strategy that stabilizes SHAP-based feature attributions over contiguous time windows. Unlike standard SHAP, which treats each frame independently, T-SHAP applies a linear smoothing operator to the attribution sequence, reducing high-frequency variance while preserving the theoretical guarantees of Shapley values, including local accuracy and consistency. Experiments on the NTU RGB+D Dataset demonstrate that the proposed framework achieves 94.3% classification accuracy with an end-to-end inference latency below 25 milliseconds, satisfying real-time constraints on mid-range hardware and indicating strong potential for deployment in clinical monitoring scenarios. Quantitative evaluation using perturbation-based faithfulness metrics shows that T-SHAP improves explanation reliability compared to standard SHAP (AUP: 0.89 vs. 0.91) and Grad-CAM (0.82), with consistent improvements observed across five-fold cross-validation, indicating enhanced explanation reliability. The resulting attributions consistently highlight biomechanically relevant motion patterns, including lower-limb instability and changes in spinal alignment, aligning with established clinical observations of fall dynamics and supporting their use as transparent decision aids in long-term care environments

摘要:老年護理中的跌倒檢測不僅需要準確的分類,還需要臨床醫生可以信賴的可靠解釋。然而,現有的事後解釋方法在逐幀應用於序列數據時,會產生時間上不穩定的歸因圖,臨床醫生無法可靠地根據這些圖進行行動。為了解決這個問題,我們提出了一個輕量級且可解釋的框架,用於基於骨架的跌倒檢測,該框架結合了一個高效的LSTM模型和T-SHAP,一種時間感知的事後聚合策略,該策略在連續時間窗口內穩定基於SHAP的特徵歸因。與標準SHAP不同,標準SHAP將每一幀獨立處理,T-SHAP對歸因序列應用線性平滑運算符,減少高頻變異,同時保留Shapley值的理論保證,包括局部準確性和一致性。在NTU RGB+D數據集上的實驗表明,所提出的框架實現了94.3%的分類準確率,端到端推斷延遲低於25毫秒,滿足中端硬體上的實時約束,並顯示出在臨床監測場景中部署的強大潛力。使用基於擾動的忠實度指標進行的定量評估顯示,T-SHAP相比於標準SHAP(AUP: 0.89 vs. 0.91)和Grad-CAM(0.82)提高了解釋的可靠性,並且在五折交叉驗證中觀察到一致的改進,表明了解釋可靠性的增強。最終的歸因一致地突出了生物力學相關的運動模式,包括下肢不穩定性和脊椎對齊變化,與已建立的跌倒動態臨床觀察相一致,並支持其作為長期護理環境中透明決策輔助工具的使用。

Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

2604.13258v1 by Vishal Pramanik, Maisha Maliha, Nathaniel D. Bastian, Sumit Kumar Jha

Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.

摘要:歸因方法旨在通過量化輸入標記對生成輸出的貢獻來解釋語言模型的預測。然而,大多數現有技術是為基於編碼器的架構設計的,並依賴於線性近似,無法捕捉解碼器僅模型中自回歸生成的因果和語義複雜性。為了解決這些限制,我們提出了海森增強標記歸因(HETA),這是一個專為解碼器僅語言模型量身定制的新穎歸因框架。HETA結合了三個互補組件:捕捉跨層標記之間影響的語義過渡向量、建模二次效應的海森基敏感度分數,以及測量標記被屏蔽時信息損失的KL散度。這種統一設計產生了上下文感知、因果忠實和語義基礎的歸因。此外,我們引入了一個精心策劃的基準數據集,用於系統地評估生成設置中的歸因質量。在多個模型和數據集上的實證評估表明,HETA在歸因忠實性和與人類註釋的一致性方面始終優於現有方法,為自回歸語言模型的可解釋性建立了新的標準。

Explainable Graph Neural Networks for Interbank Contagion Surveillance: A Regulatory-Aligned Framework for the U.S. Banking Sector

2604.14232v1 by Mohammad Nasir Uddin

The Spatial-Temporal Graph Attention Network (ST-GAT) framework was created to serve as an explainable GNN-based solution for detecting bank distress early warning signs and for conducting macro-prudential surveillance of the interbank system in the United States. The ST-GAT framework models 8,103 FDIC insured institutions across 58 quarterly snapshots (2010Q1-2024Q2). Bilateral exposures were reconstructed from publicly available FDIC Call Reports using maximum entropy estimation to produce a dynamic directed weighted graph. The framework achieves the highest AUPRC among all GNN architectures (0.939 +/- 0.010), trailing only XGBoost (0.944). Ablation analysis confirms the BiLSTM temporal component contributes +0.020 AUPRC; temporal attention weights exhibit a monotonically decreasing pattern consistent with long-run structural vulnerability weighting. Permutation importance identifies ROA (0.309) and NPL Ratio (0.252) as dominant predictors, consistent with post-mortem analyses of the 2023 regional banking crisis. All data are publicly available FDIC Call Reports and FRED series; all code and results are released.

摘要:空間-時間圖注意力網絡(ST-GAT)框架的創建旨在作為一種可解釋的基於GNN的解決方案,用於檢測銀行危機的早期預警信號,以及對美國銀行間系統進行宏觀審慎監管。ST-GAT框架建模了8,103家FDIC保險機構,涵蓋58個季度快照(2010年第1季度至2024年第2季度)。雙邊風險敞口是從公開的FDIC通話報告中重建的,使用最大熵估計生成動態有向加權圖。該框架在所有GNN架構中實現了最高的AUPRC(0.939 +/- 0.010),僅次於XGBoost(0.944)。消融分析確認BiLSTM時間組件貢獻了+0.020 AUPRC;時間注意權重顯示出與長期結構脆弱性加權一致的單調遞減模式。置換重要性識別ROA(0.309)和NPL比率(0.252)為主要預測因子,與2023年地區銀行危機的事後分析一致。所有數據均為公開的FDIC通話報告和FRED系列;所有代碼和結果均已發布。

LLM

Publish Date Title Authors Homepage Code
2026-04-24 How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks Longju Bai et.al. 2604.22750v1 null
2026-04-24 Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities Ilana Nguyen et.al. 2604.22749v1 null
2026-04-24 Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond Meng Chu et.al. 2604.22748v1 null
2026-04-24 An Undecidability Proof for the Plan Existence Problem Antonis Achilleos et.al. 2604.22736v1 null
2026-04-24 Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data Hillary Mutisya et.al. 2604.22730v1 null
2026-04-24 Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering Hillary Mutisya et.al. 2604.22723v1 null
2026-04-24 Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation Rajinder Sandhu et.al. 2604.22722v1 null
2026-04-24 Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought Keshav Ramji et.al. 2604.22709v1 null
2026-04-24 CRAFT: Clustered Regression for Adaptive Filtering of Training data Parthasarathi Panda et.al. 2604.22693v1 null
2026-04-24 How Supply Chain Dependencies Complicate Bias Measurement and Accountability Attribution in AI Hiring Applications Gauri Sharma et.al. 2604.22679v1 null
2026-04-24 BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering Jinghong Chen et.al. 2604.22678v1 null
2026-04-24 Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines Negar Arabzadeh et.al. 2604.22661v1 null
2026-04-24 Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models Felix Herron et.al. 2604.22631v1 null
2026-04-24 From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia Angelo Maria Sabatini et.al. 2604.22626v1 null
2026-04-24 Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube Sheza Munir et.al. 2604.22606v1 null
2026-04-24 From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification Md Erfan et.al. 2604.22601v1 null
2026-04-24 Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity Erez Yosef et.al. 2604.22597v1 null
2026-04-24 Learning Evidence Highlighting for Frozen LLMs Shaoang Li et.al. 2604.22565v1 null
2026-04-24 Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors Gautam Kumar Jain et.al. 2604.22560v1 null
2026-04-24 SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning Jichao Wang et.al. 2604.22558v1 null
2026-04-24 Using Embedding Models to Improve Probabilistic Race Prediction Noan Dasanaike et.al. 2604.22555v1 null
2026-04-24 ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders Yongqi Jiang et.al. 2604.22550v1 null
2026-04-24 Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners Haidong Yuan et.al. 2604.22542v1 null
2026-04-24 On the Properties of Feature Attribution for Supervised Contrastive Learning Leonardo Arrighi et.al. 2604.22540v1 null
2026-04-24 FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records Hojjat Karami et.al. 2604.22534v1 null
2026-04-24 RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment Yingfeng Luo et.al. 2604.22520v1 null
2026-04-24 Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement Wataru Hirota et.al. 2604.22517v1 null
2026-04-24 Measuring and Mitigating Persona Distortions from AI Writing Assistance Paul Röttger et.al. 2604.22503v1 null
2026-04-24 CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding Lihao Zheng et.al. 2604.22498v1 null
2026-04-24 On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery Anti Alman et.al. 2604.22455v1 null
2026-04-24 Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents Xirui Li et.al. 2604.22452v1 null
2026-04-24 SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking Chenxi Gu et.al. 2604.22438v1 null
2026-04-24 CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease Bulent Soykan et.al. 2604.22428v1 null
2026-04-24 Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control Qinhan Hou et.al. 2604.22413v1 null
2026-04-24 Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models Alberto Messina et.al. 2604.22411v1 null
2026-04-24 Selective Contrastive Learning For Gloss Free Sign Language Translation Changhao Lai et.al. 2604.22374v1 null
2026-04-24 CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language Rui Zhao et.al. 2604.22367v1 null
2026-04-24 Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization Weixu Zhang et.al. 2604.22345v1 null
2026-04-24 Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding Weixu Zhang et.al. 2604.22335v1 null
2026-04-24 ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding Dongwei Sun et.al. 2604.22333v1 null
2026-04-24 FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting Marco Obermeier et.al. 2604.22328v1 null
2026-04-24 Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks Fahmida Alam et.al. 2604.22325v1 null
2026-04-24 CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems Tabinda Sarwar et.al. 2604.22313v1 null
2026-04-24 BLAST: Benchmarking LLMs with ASP-based Structured Testing Manuel Alejandro Borroto Santana et.al. 2604.22306v1 null
2026-04-24 Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets Harshit Joshi et.al. 2604.22294v1 null
2026-04-24 ReLeVAnT: Relevance Lexical Vectors for Accurate Legal Text Classification Ishaan Gakhar et.al. 2604.22292v1 null
2026-04-24 When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention Aofan Liu et.al. 2604.22273v1 null
2026-04-24 Semantic Error Correction and Decoding for Short Block Channel Codes Jiafu Hao et.al. 2604.22269v1 null
2026-04-24 Large Language Models Decide Early and Explain Later Ayan Datta et.al. 2604.22266v1 null
2026-04-24 Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion Fahmida Alam et.al. 2604.22261v1 null
2026-04-24 Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset Wenhui Huang et.al. 2604.22260v1 null
2026-04-24 A Probabilistic Framework for Hierarchical Goal Recognition Chenyuan Zhang et.al. 2604.22256v1 null
2026-04-24 Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis Zhilin Fan et.al. 2604.22237v1 null
2026-04-24 A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies Somyajit Chakraborty et.al. 2604.22227v1 null
2026-04-24 TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis Xi Wang et.al. 2604.22225v1 null
2026-04-24 Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen Jon-Paul Cacioli et.al. 2604.22215v1 null
2026-04-24 UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions Chunyu Qiang et.al. 2604.22209v1 null
2026-04-24 Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations Anna Arnaudo et.al. 2604.22207v1 null
2026-04-24 An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments Hong Su et.al. 2604.22199v1 null
2026-04-24 How Large Language Models Balance Internal Knowledge with User and Document Assertions Shuowei Li et.al. 2604.22193v1 null
2026-04-24 Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning Chaoran Chen et.al. 2604.22191v1 null
2026-04-24 ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression Xiaojie Ke et.al. 2604.22180v1 null
2026-04-24 ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation Peiyan Zhang et.al. 2604.22169v1 null
2026-04-24 Estimating Tail Risks in Language Model Output Distributions Rico Angell et.al. 2604.22167v1 null
2026-04-24 Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models Ryoma Kumon et.al. 2604.22166v1 null
2026-04-24 GenMatter: Perceiving Physical Objects with Generative Matter Models Eric Li et.al. 2604.22160v1 null
2026-04-24 Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems Meghana Karnam et.al. 2604.22154v1 null
2026-04-24 When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models Pruthvinath Jeripity Venkata et.al. 2604.22153v1 null
2026-04-24 Recognition Without Authorization: LLMs and the Moral Order of Online Advice Tom van Nuenen et.al. 2604.22143v1 null
2026-04-24 Voice Under Revision: Large Language Models and the Normalization of Personal Narrative Tom van Nuenen et.al. 2604.22142v1 null
2026-04-24 SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs Sihang et.al. 2604.22134v1 null
2026-04-24 Dissociating Decodability and Causal Use in Bracket-Sequence Transformers Aryan Sharma et.al. 2604.22128v1 null
2026-04-24 Where Should LoRA Go? Component-Type Placement in Hybrid Language Models Hector Borobia et.al. 2604.22127v1 null
2026-04-23 Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework Tharindu Kumarage et.al. 2604.22119v1 null
2026-04-23 PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training Harsh Kumar et.al. 2604.22117v1 null
2026-04-23 Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations Nalin Poungpeth et.al. 2604.22109v1 null
2026-04-23 Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation Arthur Jakobsson et.al. 2604.22102v1 null
2026-04-23 Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation Weisi Liu et.al. 2604.22098v1 null
2026-04-23 An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation Mykola Trokhymovych et.al. 2604.22095v1 null
2026-04-23 Ethics Testing: Proactive Identification of Generative AI System Harms Shin Hwei Tan et.al. 2604.22089v1 null
2026-04-23 Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents Seyed Moein Abtahi et.al. 2604.22085v1 null
2026-04-23 Removing Sandbagging in LLMs by Training with Weak Supervision Emil Ryd et.al. 2604.22082v1 null
2026-04-23 Sound Agentic Science Requires Adversarial Experiments Dionizije Fa et.al. 2604.22080v1 null
2026-04-23 PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning Xiaoyi Chen et.al. 2604.22076v1 null
2026-04-23 Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning Qinan Yu et.al. 2604.22074v1 null
2026-04-23 Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning Amine Barrak et.al. 2604.22072v1 null
2026-04-23 Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake Guan Gui et.al. 2604.22067v1 null
2026-04-23 Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores Shevya Pandya et.al. 2604.22063v1 null
2026-04-23 Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning Karthic Palaniappan et.al. 2604.22062v1 null
2026-04-23 Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching Xiaodi Li et.al. 2604.22061v1 null
2026-04-23 LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs Mohamed Ali Souibgui et.al. 2604.22050v1 null
2026-04-23 Call-Chain-Aware LLM-Based Test Generation for Java Projects Guancheng Wang et.al. 2604.22046v1 null
2026-04-23 H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers Ayushi Mehrotra et.al. 2604.22045v1 null
2026-04-23 Source-Modality Monitoring in Vision-Language Models Etha Tianze Hua et.al. 2604.22038v1 null
2026-04-23 EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms Brian VanVoorst et.al. 2604.22036v1 null
2026-04-23 Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning João Mattos et.al. 2604.22031v1 null
2026-04-23 Shared Lexical Task Representations Explain Behavioral Variability In LLMs Zhuonan Yang et.al. 2604.22027v1 null
2026-04-23 Foundation models for discovering robust biomarkers of neurological disorders from dynamic functional connectivity Deepank Girish et.al. 2604.22018v1 null
2026-04-23 When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation Anamta Khan et.al. 2604.22002v1 null
2026-04-23 Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning Grigory Sapunov et.al. 2604.21999v1 null

Abstracts

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks

2604.22750v1 by Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, Jiaxin Pei

The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models' ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.

摘要:AI 代理在複雜人類工作流程中的廣泛採用正在推動 LLM 令牌消耗的快速增長。當代理被部署在需要大量令牌的任務上時,自然會出現三個問題:(1)AI 代理將令牌花費在哪裡?(2)哪些模型的令牌效率更高?以及(3)代理能否在任務執行前預測其令牌使用情況?在本文中,我們呈現了對代理編碼任務中令牌消耗模式的首次系統研究。我們分析了來自八個前沿 LLM 在 SWE-bench Verified 上的軌跡,並評估模型在任務執行前預測自身令牌成本的能力。我們發現:(1)代理任務的成本特別高,消耗的令牌是代碼推理和代碼聊天的 1000 倍,整體成本主要由輸入令牌而非輸出令牌驅動;(2)令牌使用具有高度變異性且本質上是隨機的:在相同任務上的運行總令牌數最多可以相差 30 倍,而更高的令牌使用並不轉化為更高的準確性;相反,準確性通常在中等成本時達到峰值,並在更高成本時飽和;(3)模型在令牌效率上差異顯著:在相同任務上,Kimi-K2 和 Claude-Sonnet-4.5 的平均令牌消耗超過 GPT-5 的 150 萬;(4)人類專家評價的任務難度與實際令牌成本之間的對應關係僅為微弱,揭示了人類感知的複雜性與代理實際付出的計算努力之間的根本差距;以及(5)前沿模型未能準確預測自身的令牌使用(相關性弱至中等,最高達 0.39),並系統性地低估了實際的令牌成本。我們的研究為 AI 代理的經濟學提供了新的見解,並能激發未來在此方向上的研究。

Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities

2604.22749v1 by Ilana Nguyen, Harini Suresh, Thema Monroe-White, Evan Shieh

Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American'') are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.

摘要:大型語言模型(LLMs)在文本生成任務中的應用越來越廣泛,從日常使用到高風險的企業和政府應用,包括與尋求庇護者的模擬面試。雖然許多研究突顯了LLMs的新潛在應用,但LLMs對全球非主流社群的有害偏見的編碼和延續也存在風險。為了更好地評估和減輕這些傷害,需要更多研究來檢視LLMs如何描繪多樣化的個體。在本研究中,我們研究了廣泛採用的LLMs在回應開放式敘事生成提示時,如何描繪國籍身份。我們的發現顯示,根據國籍的持續代表性傷害的存在,包括有害的刻板印象、抹除和一維的全球大多數身份描繪。被邊緣化的國家身份在權力中立的故事中同時被低估,而在從屬角色的描繪中則被過度代表,這些從屬角色的出現概率比主導角色高出五十倍以上。當輸入提示中出現美國國籍提示(例如,``美國人'')時,傷害的程度會加劇。值得注意的是,我們發現我們識別的傷害無法通過諂媚來解釋,因為即使在提示中用非美國國籍身份替換美國國籍提示時,美國中心的偏見依然存在。根據我們的發現,我們呼籲進一步探索LLMs中的文化傷害,通過以全球大多數的觀點為中心的方法論,挑戰無批判地採用基於美國的LLMs來對我們星球大多數的分類、監視和錯誤表述。

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

2604.22748v1 by Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang, Senqiao Yang, Wei Huang, Yeying Jin, Zhefan Rao, Jinhui Ye, Xinyu Lin, Xichen Zhang, Qisheng Hu, Shuai Yang, Leyang Shen, Wei Chow, Yifei Dong, Fengyi Wu, Quanyu Long, Bin Xia, Shaozuo Yu, Mingkang Zhu, Wenhu Zhang, Jiehui Huang, Haokun Gui, Haoxuan Che, Long Chen, Qifeng Chen, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Yang Deng, Yanwei Li, Mike Zheng Shou, Zhi-Qi Cheng, See-Kiong Ng, Ziwei Liu, Philip Torr, Jiaya Jia

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.

摘要:隨著人工智慧系統從生成文本轉向通過持續互動來實現目標,建模環境動態的能力成為了一個核心瓶頸。操控物體、導航軟體、協調他人或設計實驗的代理需要預測環境模型,但「世界模型」這個術語在不同的研究社群中具有不同的含義。我們介紹了一個沿著兩個軸組織的「層級 x 法則」分類法。第一個定義了三個能力層級:L1 預測器,學習一步本地轉換運算子;L2 模擬器,將它們組合成遵循領域法則的多步驟、行動條件的展開;以及 L3 演化者,當預測對新證據失敗時,自主修訂其模型。第二個識別了四個治理法則範疇:物理、數位、社會和科學。這些範疇決定了世界模型必須滿足的約束條件以及其最可能失敗的地方。利用這一框架,我們綜合了超過 400 篇作品,並總結了 100 多個代表性系統,涵蓋基於模型的強化學習、視頻生成、網路和 GUI 代理、多代理社會模擬以及 AI 驅動的科學發現。我們分析了不同層級-範疇對的研究方法、失敗模式和評估實踐,提出了以決策為中心的評估原則和最小可重現的評估包,並概述了架構指導、開放問題和治理挑戰。最終的路線圖連接了先前孤立的社群,並描繪了一條從被動的下一步預測到可以模擬,並最終重塑代理運作環境的世界模型的道路。

An Undecidability Proof for the Plan Existence Problem

2604.22736v1 by Antonis Achilleos

The plan existence problem asks, given a goal in the form of a formula in modal logic, an initial epistemic state (a pointed Kripke model), and a set of epistemic actions, whether there exists a sequence of actions that can be applied to reach the goal. We prove that even in the case where the preconditions of the epistemic actions have modal depth at most 1, and there are no postconditions, the plan existence problem is undecidable. The (un)decidability of this problem was previously unknown.

摘要:計畫存在問題詢問,給定一個以模態邏輯形式表達的目標、一個初始的認知狀態(指向的克里普克模型),以及一組認知行動,是否存在一個可以應用的行動序列以達成該目標。 我們證明,即使在認知行動的前提條件最多具有模態深度 1 且沒有後置條件的情況下,計畫存在問題也是不可判定的。 此問題的(不)可判定性之前是未知的。

Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data

2604.22730v1 by Hillary Mutisya, John Mugane

We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including -ntU 'person' (8 languages), gombe 'cow' (9 languages), and mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including -bon- 'see' and *-jIm- 'stand', each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p < 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain >0.83 cosine similarity across languages (within-class > between-class, p < 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.

摘要:我們調查是否僅基於現代形態數據訓練的神經模型能夠恢復與歷史重建一致的跨語言詞彙結構。使用 BantuMorph v7,這是一種針對班圖形態範疇的Transformer,我們分析了 14 種東部和南部班圖語言,提取了它們名詞和動詞詞元的編碼器嵌入,並識別出 728 個名詞和 1,525 個動詞同源候選詞,這些候選詞在 5 種以上的語言中共享。將這些候選詞與已建立的歷史資源進行評估——班圖詞彙重建數據庫 (BLR3; 4,786 個重建的原班圖形式) 和 ASJP 基本詞彙——我們確認前 11 個名詞候選詞中的 10 個 (90.9%) 與先前重建的原班圖形式一致,包括 -ntU '人' (8 種語言)、gombe '牛' (9 種語言) 和 mUn (9 種語言)。擴展到動詞,12 個動詞同源詞與重建的原班圖詞根一致,包括 -bon- '看' 和 *-jIm- '站立',這些詞在廣泛的地理範圍內都有證據。使用獨立翻譯模型 (NLLB-600M) 進行的跨模型驗證確認了這些模式:兩個模型都恢復了同源詞聚類和與已建立的 Guthrie 區域分類一致的系統發育分組 (p < 0.01)。跨語言名詞類別分析顯示,所有 13 個生產性類別在語言之間的餘弦相似度均保持 >0.83 (類內 > 類間,p < 10^-9)。我們的數據集僅限於東部和南部班圖,因此我們將這些結果解釋為恢復與原班圖一致的共享班圖詞彙結構,而不是明確區分原班圖的保留與後來的區域創新。

Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

2604.22723v1 by Hillary Mutisya, John Mugane

We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.

摘要:我們提出了一種通過結合跨語言轉移學習和無監督聚類來發現低資源班圖語言的形態特徵的方法。應用於Giriama(nyf),這是一種僅有91個標記範疇的語言,我們的流程發現了2,455個單詞的名詞類別分配,並識別出兩種先前未記錄的形態模式:對於第2類的a-前綴變體(元音合併——兩個相鄰元音的合併——的wa-,一致性為95.1%)和一個縮合的k'-前綴(一致性為98.5%)。對444個已知Giriama動詞範疇的外部驗證確認了78.2%的詞元化準確率,而對19,624個單詞(9,014個獨特詞元)進行的v3語料庫擴展在所有主要詞類中實現了97.3%的分段率和86.7%的詞元化率。我們的轉移學習和無監督聚類的集合,通過加權投票結合,利用了互補的優勢:轉移在同源詞檢測上表現出色(利用了約60%的詞彙重疊),而聚類則發現了轉移無法識別的語言特有創新。我們發布了所有代碼和發現的詞彙,以支持低資源班圖語言的形態學文獻。

Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

2604.22722v1 by Rajinder Sandhu, Di Mu, Cheng Chang, Md Shahriar Tasjid, Himanshu Rai, Maksims Volkovs, Ga Wu

Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.

摘要:密集向量檢索是檢索增強生成(RAG)的實用支柱,但相似性搜索可能會受到精度限制。相反,利用LLM重新排序的基於效用的方法通常能夠達到更好的性能,但在計算上是不可行的,並且容易受到困擾於困惑度估計的噪音。我們提出了效用對齊嵌入(UAE),這是一個旨在將這些優勢融合成一種實用的高性能檢索方法的框架。我們將檢索公式化為一個分佈匹配問題,訓練一個雙編碼器來模仿從困惑度降低中導出的效用分佈,使用效用調製的InfoNCE目標。這種方法將分級效用信號直接注入嵌入空間,而無需在測試時進行LLM推理。在QASPER基準上,UAE將檢索的Recall@1提高了30.59%,MAP提高了30.16%,Token F1提高了17.3%,超越了強大的語義基線BGE-Base。關鍵是,UAE的速度比高效的LLM重新排序方法快180倍以上,同時保持競爭性能,這表明將檢索與生成效用對齊能在規模上產生可靠的上下文。

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

2604.22709v1 by Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo

While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ''abstract'' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.

摘要:雖然長且明確的思維鏈(CoT)在複雜推理任務中已被證明有效,但在推理過程中生成這些鏈是昂貴的。非語言推理方法透過利用連續表徵,已出現生成長度較短的方案,但其性能仍落後於口頭化的 CoT。我們提出了 $\textbf{抽象思維鏈}$,這是一種離散潛在推理的後訓練機制,其中語言模型從保留的詞彙中生成一個短序列的標記,以取代自然語言的 CoT,然後生成回應。為了使先前未見的“抽象”標記變得有用,我們引入了一種政策迭代風格的熱身迴圈,該迴圈在 (i.) 通過遮罩從口頭 CoT 進行瓶頸並執行監督微調,和 (ii.) 通過訓練模型僅從提示生成抽象標記進行自我蒸餾,這是通過使用代碼本的約束解碼來實現的。在熱身之後,我們在約束解碼下優化抽象序列的生成,並使用熱啟動的強化學習。抽象-CoT 在數學推理、遵循指令和多跳推理中達到最多 $11.6\times$ 更少的推理標記,同時在性能上表現相當,並且在語言模型家族中具有良好的泛化能力。我們還發現抽象詞彙上出現了一種類似於自然語言的冪律分佈,這種分佈在訓練階段中不斷演變。我們的發現突顯了後訓練潛在推理機制的潛力,這些機制通過學習的抽象推理語言實現高效推理。

CRAFT: Clustered Regression for Adaptive Filtering of Training data

2604.22693v1 by Parthasarathi Panda, Asheswari Swain, Subhrakanta Panda

Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.

摘要:從大型語料庫中選擇一小部分高品質的子集進行微調變得越來越重要,因為語料庫的數據點數量已增至數千萬,使得全面微調既昂貴又往往不必要。我們提出了CRAFT(Clustered Regression for Adaptive Filtering of Training data),這是一種對向量化無關的選擇方法,用於訓練序列到序列模型。CRAFT將源-目標聯合分佈進行分解,並執行兩階段選擇: (i) 通過在k-means聚類中進行比例預算分配來匹配驗證源分佈, (ii) 在每個源聚類內,選擇那些目標嵌入最小化從驗證目標分佈導出的條件期望距離的訓練對。我們證明了比例聚類分配限制了所選分佈與驗證分佈之間的連續KL散度,殘差由聚類直徑控制。我們在英語-印地語翻譯上評估CRAFT,從3300萬個NLLB句子對中選擇訓練數據,並通過LoRA對mBART進行微調。CRAFT達到43.34 BLEU,超越TSDS(41.21)2.13分,並在相同的候選池和編碼器上完成選擇速度超過40倍。使用TF-IDF向量化,整個流程在CPU上不到一分鐘內完成。TAROT達到45.61 BLEU,但CRAFT在26.86秒內完成選擇,而TAROT則需75.6秒,速度提升了2.8倍。

How Supply Chain Dependencies Complicate Bias Measurement and Accountability Attribution in AI Hiring Applications

2604.22679v1 by Gauri Sharma, Maryam Molamohammadi

The increasing adoption of AI systems in hiring has raised concerns about algorithmic bias and accountability, prompting regulatory responses including the EU AI Act, NYC Local Law 144, and Colorado's AI Act. While existing research examines bias through technical or regulatory lenses, both perspectives overlook a fundamental challenge: modern AI hiring systems operate within complex supply chains where responsibility fragments across data vendors, model developers, platform providers, and deploying organizations. This paper investigates how these dependency chains complicate bias evaluation and accountability attribution. Drawing on literature review and regulatory analysis, we demonstrate that fragmented responsibilities create two critical problems. First, bias emerges from component interactions rather than isolated elements, yet proprietary configurations prevent integrated evaluation. A resume parser may function without bias independently but contribute to discrimination when integrated with specific ranking algorithms and filtering thresholds. Second, information asymmetries mean deploying organizations bear legal responsibility without technical visibility into vendor-supplied algorithms, while vendors control implementations without meaningful disclosure requirements. Each stakeholder may believe they are compliant; nevertheless, the integrated system may produce biased outcomes. Analysis of implementation ambiguities reveals these challenges in practice. We propose multi-layered interventions including system-level audits, vendor guidelines, continuous monitoring mechanisms, and documentation across dependency chains. Our findings reveal that effective governance requires coordinated action across technical, organizational, and regulatory domains to establish meaningful accountability in distributed development environments.

摘要:隨著 AI 系統在招聘中的日益普及,對算法偏見和問責制的擔憂也隨之增加,促使了包括歐盟 AI 法案、紐約市地方法第 144 條和科羅拉多州 AI 法案在內的監管回應。雖然現有研究通過技術或監管的視角來檢視偏見,但這兩種觀點都忽略了一個根本挑戰:現代 AI 招聘系統在複雜的供應鏈中運作,責任在數據供應商、模型開發者、平台提供者和部署組織之間分散。本文探討了這些依賴鏈如何使偏見評估和問責歸屬變得複雜。通過文獻回顧和監管分析,我們證明了分散的責任造成了兩個關鍵問題。首先,偏見源於組件之間的相互作用,而非孤立的元素,但專有配置卻阻礙了綜合評估。一個簡歷解析器可能獨立運作時不帶偏見,但當與特定的排名算法和篩選閾值整合時,卻可能導致歧視。其次,信息不對稱意味著部署組織承擔法律責任,但對供應商提供的算法缺乏技術可見性,而供應商則在沒有實質性披露要求的情況下控制實施。每個利益相關者可能都認為自己是合規的;然而,整合系統可能會產生偏見結果。對實施模糊性的分析揭示了這些挑戰在實踐中的表現。我們提出了多層次的干預措施,包括系統級審計、供應商指導方針、持續監控機制和依賴鏈的文檔記錄。我們的研究結果顯示,有效的治理需要在技術、組織和監管領域之間協調行動,以在分散的開發環境中建立有意義的問責制。

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

2604.22678v1 by Jinghong Chen, Jingbiao Mei, Guangyu Yang, Bill Byrne

A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates thelost-in-the-middle'' effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.

摘要:一種常見的基於檢索增強生成(RAG)的問題回答方法是將文檔串聯成單一上下文,並將其傳遞給語言模型以生成答案。雖然這種方法簡單,但可能會掩蓋單個文檔的貢獻,使歸因變得困難,並導致「失落於中間」效應,即在長上下文中相關信息被忽視。串聯的擴展性也較差:計算成本隨著上下文長度的增長而呈平方增長,當上下文包含視覺數據時,這一問題變得尤為嚴重,例如在視覺問題回答中。通過限制上下文長度來緩解這些問題的嘗試,可能會進一步限制性能,因為這會阻止模型受益於更深層檢索所提供的改進召回。我們提出了貝葉斯集成檢索增強生成(BERAG),以及貝葉斯集成微調(BEFT),作為一種RAG框架,其中語言模型是基於單個檢索到的文檔而非單一的組合上下文進行條件化。BERAG將文檔後驗概率視為集成權重,並在生成過程中使用貝葉斯法則逐個標記地更新它們。這種方法使得概率重排序、並行記憶使用和文檔貢獻的明確歸因成為可能,從而使其非常適合大型文檔集合。我們主要在基於知識的視覺問題回答任務上評估BERAG和BEFT,在這些任務中,模型必須對長的、不完美的檢索列表進行推理。結果顯示,與標準RAG相比,這些方法有顯著的改進,包括在文檔視覺問題回答和多模態針對堆中的針基準測試上取得的強勁增長。我們還展示了BERAG能夠減輕「失落於中間」效應。文檔後驗可以用來檢測不足的基礎並觸發偏轉,而文檔修剪則使得解碼速度比標準RAG更快。

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

2604.22661v1 by Negar Arabzadeh, Andrew Drozdov, Michael Bendersky, Matei Zaharia

Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.

摘要:大型語言模型(LLMs)使得查詢重構在現代檢索和增強生成(RAG)管道中變得普遍,能夠生成多個語義等價的查詢變體。然而,對每個重構執行完整的管道在計算上是昂貴的,這促使了選擇性執行:我們能否在產生下游檢索和生成成本之前識別出最佳的查詢變體?我們研究查詢性能預測(QPP)作為在臨時檢索和端到端 RAG 中進行變體選擇的機制。與傳統的 QPP 不同,傳統 QPP 是在主題之間估計查詢難度,我們研究主題內的區分——在相同信息需求的競爭變體中選擇最佳重構。通過在 TREC-RAG 上進行大規模實驗,使用稀疏和密集檢索器,我們在基於相關性和決策的指標下評估了檢索前和檢索後的預測器。我們的結果顯示檢索和生成目標之間存在系統性的偏差:最大化排名指標(如 nDCG)的變體往往無法產生最佳生成答案,暴露了檢索相關性和生成真實性之間的“效用差距”。儘管如此,QPP 可以可靠地識別出改善端到端質量的變體,超過原始查詢。值得注意的是,輕量級的檢索前預測器經常與更昂貴的檢索後方法相匹配或超越,提供了一種延遲高效的健壯 RAG 方法。

Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models

2604.22631v1 by Felix Herron, Solange Rossato, Alexandre Allauzen, François Portet

Modern automatic speech recognition (ASR) systems have been observed to function better for certain speaker groups (SGs) than others, despite recent gains in overall performance. One potential impediment to progress towards fairer ASR is a more nuanced understanding of the types of modeling errors that speech encoder models make, and in particular the difference between the structure of embeddings for high-performance and low-performance SGs. This paper proposes a framework typifying two types of error that can occur in modeling phonemes in ASR systems: random error/high variance in phoneme embedding, vs systematic error/embedding bias. We find that training phoneme classification probes only on a single, typically disadvantaged SG, sometimes improves performance for that SG, which is evidence for the existence of SG-level bias in phoneme embeddings. On the other hand, we find that speakers and SGs with higher levels of phoneme variance are the same as those with worse phoneme prediction accuracy. We conclude that both types of error are present in phoneme embeddings and both are candidate causes for SG-level unfairness in ASR, though random error is likely a greater hindrance to fairness than systematic error. Furthermore, we find that finetuning encoder models using a fairness-enhancing algorithm (domain enhancing and adversarial training) changes neither the benefits of in-domain phoneme classification probe training, nor measured levels of random embedding error.

摘要:現代自動語音識別(ASR)系統已被觀察到對某些說話者群體(SGs)的表現優於其他群體,儘管最近整體性能有所提升。朝向更公平的ASR進展的一個潛在障礙是對語音編碼模型所產生的建模錯誤類型的更細緻理解,特別是高性能和低性能SGs的嵌入結構之間的差異。本文提出了一個框架,類型化ASR系統中建模音素時可能發生的兩種類型錯誤:隨機錯誤/高方差的音素嵌入,與系統性錯誤/嵌入偏差。我們發現僅在單一、通常處於劣勢的SG上訓練音素分類探針,有時會提高該SG的性能,這是音素嵌入中存在SG級別偏差的證據。另一方面,我們發現音素方差較高的說話者和SG與音素預測準確性較差的說話者和SG是相同的。我們得出結論,這兩種類型的錯誤都存在於音素嵌入中,並且都是ASR中SG級別不公平的候選原因,儘管隨機錯誤可能對公平性造成的阻礙大於系統性錯誤。此外,我們發現使用增強公平性的算法(領域增強和對抗訓練)進行編碼模型的微調,既不改變領域內音素分類探針訓練的好處,也不改變隨機嵌入錯誤的測量水平。

From graphemic dependence to lexical structure: a Markovian perspective on Dante's Commedia

2604.22626v1 by Angelo Maria Sabatini

This study investigates the structural organisation of Dante's Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing the balance between persistence and alternation patterns. Across the poem, this index exhibits a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram-level analysis shows that this trend is driven by a restricted set of recurrent configurations, interpreted as graphemic probes linking the Markov representation to identifiable lexical environments in the text. These probes display distinct behaviours: configurations involving two transitions more frequently emerge across word boundaries, reflecting interactions between adjacent tokens, whereas configurations with fewer transitions are largely confined to intra-lexical structures. Part of the signal is further shaped by orthographic phenomena, particularly apostrophised forms, highlighting the role of writing conventions alongside phonological and lexical organisation. A complementary classification analysis identifies cantica-specific terms, providing lexical anchors through which graphemic probes can be related to the structure of the poem. This organisation is reflected not only in the separation of the three cantiche, but also in a continuous trajectory across the text. Overall, the results show that simple probabilistic models applied to symbolic text representations can uncover structured interactions between local dependencies, lexical distribution, orthographic encoding, and large-scale organisation, providing an interpretable framework for linking local symbolic dynamics to higher-level textual organisation.

摘要:這項研究通過基於元音-輔音(V/C)編碼的符號表示,探討但丁《神曲》的結構組織。將生成的序列建模為四狀態馬爾可夫鏈,產生了一個簡約的圖形記憶指數,捕捉了持續性和交替模式之間的平衡。在整首詩中,這個指數從《地獄》到《天堂》顯示出輕微但穩定的增長,表明局部依賴結構的方向性轉變。三元組級別的分析顯示,這一趨勢是由一組有限的重複配置驅動的,這些配置被解釋為圖形探針,將馬爾可夫表示與文本中可識別的詞彙環境聯繫起來。這些探針顯示出不同的行為:涉及兩次轉換的配置更頻繁地出現在詞邊界之間,反映了相鄰標記之間的互動,而轉換較少的配置則主要限於詞內結構。部分信號進一步受到正字法現象的影響,特別是省略號形式,突顯了書寫慣例在語音和詞彙組織中的作用。一項補充的分類分析確定了特定於詩篇的術語,提供了詞彙錨點,通過這些錨點可以將圖形探針與詩的結構相關聯。這種組織不僅反映在三個詩篇的分離中,還在文本中呈現出一個連續的軌跡。總體而言,結果顯示,應用於符號文本表示的簡單概率模型可以揭示局部依賴、詞彙分佈、正字法編碼和大規模組織之間的結構化互動,提供了一個可解釋的框架,將局部符號動態與更高層次的文本組織聯繫起來。

Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube

2604.22606v1 by Sheza Munir, Ratna Kandala, Anamta Khan, Deepti, Joyojeet Pal

Health misinformation remains one of the most pressing challenges on social media, particularly when cultural traditions intersect with scientific-sounding claims. These dynamics are not only global but also deeply local, manifesting in culturally specific controversies that require careful analysis. Motivated by this, we examine 100 YouTube transcripts that promote or debunk cow urine (gomutra) as a health remedy, focusing on rhetorical strategies such as appeals to authority, efficacy appeals, and conspiracy framing. We employ large language models (LLMs) including GPT-4, GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Pro, and Mistral Medium 3 to annotate transcripts using a 14-category taxonomy of persuasive tactics. Our analysis reveals that promoters predominantly rely on efficacy appeals and social proof, while debunkers emphasize authority and rebuttal. Human evaluation of a subset of annotations yielded 90.1\% inter-annotator agreement, confirming the reliability of our taxonomy and validation process. This work advances computational methods for misinformation analysis and demonstrates how LLMs can support large-scale studies of cultural discourse online.

摘要:健康錯誤資訊仍然是社交媒體上最緊迫的挑戰之一,尤其是在文化傳統與科學聽起來的主張交織時。這些動態不僅是全球性的,還是深具地方性的,體現在需要仔細分析的文化特定爭議中。基於此,我們檢視了100個YouTube的逐字稿,這些逐字稿宣傳或駁斥牛尿(gomutra)作為健康療法,重點關注權威訴求、效能訴求和陰謀框架等修辭策略。我們使用包括GPT-4、GPT-4o、GPT-4.1、GPT-5、Gemini 2.5 Pro和Mistral Medium 3在內的大型語言模型(LLMs)來註解逐字稿,使用14類別的說服策略分類法。我們的分析顯示,推廣者主要依賴效能訴求和社會證明,而駁斥者則強調權威和反駁。對一部分註解的人工評估產生了90.1%的標註者間一致性,確認了我們的分類法和驗證過程的可靠性。這項工作推進了對錯誤資訊分析的計算方法,並展示了LLMs如何支持對在線文化話語的大規模研究。

From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

2604.22601v1 by Md Erfan, Md Kamal Hossain Chowdhury, Ahmed Ryan, Md Rayhanur Rahman

Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91\% verification success rate, while GPT-OSS 120B rose from zero to 81.82\% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.

摘要:大型語言模型(LLMs)在自動化軟體工程中顯示出潛力,但它們的正確性保證常常受到錯誤或虛構代碼的影響。為了強化模型的誠實性,形式驗證要求LLMs在合成實現邏輯的同時,提供隨後由數學驗證器證明正確的形式規範。然而,從非正式自然語言轉換為精確的形式規範仍然是一項艱鉅的任務。我們的工作通過提供NaturalLanguage2VerifiedCode (NL2VC)-60數據集來解決這一問題:這是一個包含60個複雜算法問題的集合。我們使用分層提示策略評估了七個開放權重LLMs中隨機選擇的11個問題集:無上下文提示、提供結構錨點的簽名提示,以及利用Dafny驗證器的迭代反饋的自我修復提示。為了解決虛無驗證的問題,即模型以微不足道的規範滿足驗證器,我們整合了uDebug平台以確保功能驗證。我們的結果顯示,雖然無上下文提示導致幾乎普遍失敗,但結構簽名和迭代自我修復促進了顯著的性能逆轉。具體而言,Gemma 4-31B達到了90.91%的驗證成功率,而GPT-OSS 120B在簽名引導反饋下從零上升至81.82%的成功率。這些發現表明,對於開放權重LLMs來說,形式驗證現在是可實現的,這些模型作為有效的學徒,能夠合成複雜的註解並促進高保證的軟體開發。

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

2604.22597v1 by Erez Yosef, Oron Anschel, Shunit Haviv Hakimi, Asaf Gendler, Adam Botach, Nimrod Berman, Igor Kviatkovsky

Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.

摘要:最近在大型語言模型方面的進展導致了各種任務的顯著改善,包括數學推理,這用於評估模型在邏輯推理和問題解決方面的智能。模型在數學推理基準上進行評估,通過驗證最終答案的正確性與真實答案進行比較。這種驗證的一種常見方法是基於符號數學比較,但它無法在多樣的數學表示和解決格式之間進行泛化。在本研究中,我們提供了一種穩健且靈活的替代方案,以取代基於規則的符號數學比較。我們提出了一個基於LLM的評估框架,用於評估模型生成的答案,從而實現對多樣數學表示和答案格式的準確評估。我們展示了在兩個流行框架Lighteval和SimpleRL中符號評估的失敗案例,並將其與我們的方法進行比較,顯示出相較於常用方法的明顯改進。我們的框架使得評估和基準測試更為可靠,從而導致更準確的性能監控,這對於推進數學問題解決和智能系統至關重要。

Learning Evidence Highlighting for Frozen LLMs

2604.22565v1 by Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei, Yunchen Pu, Fei Tian, Chonglin Sun, Frank Shyu, Luke Simon, Sandeep Pandey, Xi Liu, Jian Li

Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.

摘要:大型語言模型(LLMs)能夠進行良好的推理,然而在長且嘈雜的上下文中,常常會錯過關鍵證據。我們介紹了 HiLight,一個證據強調框架,將證據選擇與固定的 LLM 解決器的推理解耦。HiLight 避免壓縮或重寫輸入,這可能會丟棄或扭曲證據,通過訓練一個輕量級的強調演員,在未改變的上下文中插入最小的高亮標籤,以圍繞關鍵範圍。然後,固定的解決器在強調的輸入上執行下游推理。我們將高亮視為一個弱監督的決策問題,並使用強化學習優化演員,只依賴解決器的任務獎勵,無需證據標籤,也無需訪問或修改解決器。在序列推薦和長上下文問題回答中,HiLight 始終在強大的基於提示和自動提示優化基準上提高性能。學習到的強調策略在零樣本情況下轉移到更小和更大的未見解決器系列,包括基於 API 的解決器,這表明演員捕捉到真正的、可重用的證據結構,而不是過度擬合於單一的骨幹。

Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors

2604.22560v1 by Gautam Kumar Jain, Carsten Markgraf, Julian Stähler

Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.

摘要:圖形視覺問題回答(GVQA)在自動駕駛中將推理組織為有序的階段,即感知、預測和規劃,其中規劃決策應與模型自身的感知保持一致。我們在 DriveLM-nuScenes 上進行了一項關於跨階段上下文傳遞的比較研究,使用了兩種互補機制。顯式變體評估了三種基於提示的條件策略,這些策略在未經額外訓練的情況下,於一個經過領域適應的 4B VLM(Mini-InternVL2-4B-DA-DriveLM)上運行,將 NLI 矛盾減少了多達 42.6%,並建立了一個強大的零訓練基準。隱式變體引入了門控上下文投影器,這些投影器從一個階段提取隱藏狀態向量,並將標準化的門控投影注入到下一階段的輸入嵌入中。這些投影器與特定階段的 QLoRA 適配器共同訓練於一個通用的 8B VLM(InternVL3-8B-Instruct),同時僅更新約 0.5% 的參數。隱式變體在規劃階段實現了統計上顯著的 34% NLI 矛盾減少(自助法 95% CI,p < 0.05),並將跨階段的包含性提高了 50%,這是通過多語言 NLI 分類器進行評估的,以考慮混合語言輸出。規劃語言質量也有所改善(CIDEr +30.3%),但由於缺乏駕駛領域的預訓練,詞彙重疊和結構一致性下降。由於這兩種變體使用不同的基礎模型,我們將它們作為互補的案例研究呈現:顯式上下文傳遞提供了一個強大的無訓練基準,以實現表面一致性,而隱式門控投影則帶來了顯著的規劃階段語義增益,這表明領域適應可能是全範圍改進的下一個可行成分。

SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

2604.22558v1 by Jichao Wang, Liuyang Bian, Yufeng Zhou, Han Xiao, Yue Pan, Guozhi Wang, Hao Wang, Zhaoxiong Wang, Yafei Wen, Xiaoxin Chen, Shuai Ren, Lingfang Zeng

As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.

摘要:隨著多模態大型語言模型(MLLMs)的成熟,圖形用戶介面(GUI)代理正在從靜態互動演變為複雜的導航。雖然強化學習(RL)已成為訓練 MLLM 代理處理動態 GUI 任務的有前景的範式,但其有效應用面臨困境。標準的離線強化學習通常依賴靜態的步驟級數據,忽略了任務完成和執行質量等全局軌跡語義。相反,在線強化學習捕捉長期動態,但面臨高互動成本和潛在環境不穩定的問題。為了彌補這一差距,我們提出了 SOLAR-RL(半在線長期任務強化學習)。我們的框架不僅依賴昂貴的在線互動,而是將全局軌跡見解直接整合到離線學習過程中。具體而言,我們從靜態數據中重建多樣的回放候選,使用每步有效性信號檢測第一次失敗點,並追溯性地分配密集的步驟級獎勵,通過目標對齊的塑形來反映軌跡級執行質量,從而有效地模擬在線反饋而無需互動成本。大量實驗表明,與強基準相比,SOLAR-RL 顯著提高了長期任務完成率和穩健性,為自主 GUI 導航提供了一個樣本高效的解決方案。

Using Embedding Models to Improve Probabilistic Race Prediction

2604.22555v1 by Noan Dasanaike, Kosuke Imai

Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.

摘要:估計種族差異需要個人層級的種族數據,但由於收集此類信息的敏感性,這些數據通常無法獲得。為了解決這個問題,許多研究者利用貝葉斯改進姓氏地理編碼(BISG),這在很大程度上依賴於普查姓氏數據。不幸的是,這些數據僅捕捉常見姓氏的種族-姓氏關係,忽略了約10%的美國人口。我們顯示,對於那些被忽略的、不常見姓氏的個體,預測性能顯著下降,因為標準的BISG實施在這些情況下依賴於無信息的通用先驗。為了解決這一限制,我們提出了嵌入驅動的BISG(eBISG),它使用預訓練的文本嵌入將姓名表示為密集向量,並在2020年普查的姓氏和名字數據上訓練神經網絡,以估計不在普查中的姓名的種族概率。我們比較了五種方法:僅使用姓氏的標準BISG,結合名字概率的BIFSG,針對未列出姓名的姓氏嵌入,結合姓氏和名字的嵌入,以及基於來自南方州的選民檔案數據訓練的全名嵌入,該數據捕捉了姓名組件之間的互動。我們顯示,每一個後續的eBISG方法都改善了種族預測,其中全名嵌入帶來了最大的增益,特別是對於那些姓氏不在普查名單上的西班牙裔和亞洲裔選民。

ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders

2604.22550v1 by Yongqi Jiang, Yansong Gao, Boyu Kuang, Chunyi Zhou, Anmin Fu, Liquan Chen

Self-supervised learning (SSL) encoders are invaluable intellectual property (IP). However, no existing SSL watermarking for IP protection can concurrently satisfy the following two practical requirements: (1) provide ownership verification capability under black-box suspect model access once the stolen encoders are used in downstream tasks; (2) be robust under adversarial watermark detection or removal, because the watermark samples form a distinguishable out-of-distribution (OOD) cluster. We propose ArmSSL, an SSL watermarking framework that assures black-box verifiability and adversarial robustness while preserving utility. For verification, we introduce paired discrepancy enlargement, enforcing feature-space orthogonality between the clean and its watermark counterpart to produce a reliable verification signal in black-box against the suspect model. For adversarial robustness, ArmSSL integrates latent representation entanglement and distribution alignment to suppress the OOD clustering. The former entangles watermark representations with clean representations (i.e., from non-source-class) to avoid forming a dense cluster of watermark samples, while the latter minimizes the distributional discrepancy between watermark and clean representations, thereby disguising watermark samples as natural in-distribution data. For utility, a reference-guided watermark tuning strategy is designed to allow the watermark to be learned as a small side task without affecting the main task by aligning the watermarked encoder's outputs with those of the original clean encoder on normal data. Extensive experiments across five mainstream SSL frameworks and nine benchmark datasets, along with end-to-end comparisons with SOTAs, demonstrate that ArmSSL achieves superior ownership verification, negligible utility degradation, and strong robustness against various adversarial detection and removal.

摘要:自我監督學習(SSL)編碼器是無價的智慧財產(IP)。然而,現有的 SSL 水印技術在 IP 保護方面無法同時滿足以下兩個實際要求:(1)在黑箱可疑模型訪問下提供所有權驗證能力,一旦被盜的編碼器在下游任務中使用;(2)在對抗性水印檢測或移除下保持穩健性,因為水印樣本形成可區分的分佈外(OOD)叢集。我們提出了 ArmSSL,一個 SSL 水印框架,確保黑箱可驗證性和對抗性穩健性,同時保留效用。為了驗證,我們引入了配對差異擴大,強制清晰特徵空間與其水印對應物之間的正交性,以在黑箱中對可疑模型產生可靠的驗證信號。為了對抗性穩健性,ArmSSL 整合了潛在表示糾纏和分佈對齊,以抑制 OOD 叢集。前者將水印表示與清晰表示(即來自非源類別的表示)糾纏在一起,以避免形成密集的水印樣本叢集,而後者則最小化水印與清晰表示之間的分佈差異,從而將水印樣本偽裝成自然的內部數據。為了效用,設計了一種參考引導的水印調整策略,允許水印作為一個小的附加任務進行學習,而不影響主要任務,通過將水印編碼器的輸出與原始清晰編碼器在正常數據上的輸出對齊。對五個主流 SSL 框架和九個基準數據集進行的廣泛實驗,以及與 SOTA 的端到端比較,證明 ArmSSL 實現了卓越的所有權驗證、可忽略的效用降級以及對各種對抗性檢測和移除的強大穩健性。

Controllable Spoken Dialogue Generation: An LLM-Driven Grading System for K-12 Non-Native English Learners

2604.22542v1 by Haidong Yuan, Haokun Zhao, Wanshi Xu, Songjun Cao, Qingyu Zhou, Long Ma, Hongjie Fan

Large language models (LLMs) often fail to meet the pedagogical needs of K-12 English learners in non-native contexts due to a proficiency mismatch. To address this widespread challenge, we introduce a proficiency-aligned framework that adapts LLM outputs to learner abilities, using China's national curriculum (CSE) as a representative case. Our framework enables precise control over lexical complexity through a four-tier grading system, supported by a comprehensive suite of new resources: graded vocabulary lists and a multi-turn dialogue corpus. Our core technical contribution is the \textbf{DDPO} algorithm,Diversity Driven Policy Optimization, a multi-turn GRPO-based approach designed to preserve dialogue diversity while holistically optimizing dialogue quality. This method significantly outperforms conventional approaches, achieving low out-of-vocabulary rates and high diversity while enhancing conversational naturalness and pedagogical value. While grounded in the CSE, our framework is designed for flexibility and can be readily adapted to other educational standards. Our models, data, and code will all be open-sourced, providing a scalable platform for personalized English speaking practice that effectively addresses the unique challenges faced by K-12 learners in non-immersive environments.

摘要:大型語言模型(LLMs)常常無法滿足非母語環境中K-12英語學習者的教學需求,這是由於能力不匹配。為了解決這一普遍挑戰,我們提出了一個能力對齊框架,該框架根據學習者的能力調整LLM輸出,以中國的國家課程(CSE)作為代表案例。我們的框架通過四級評分系統實現了對詞彙複雜度的精確控制,並配備了一整套新的資源:分級詞彙表和多輪對話語料庫。我們的核心技術貢獻是\textbf{DDPO}算法,即多樣性驅動政策優化(Diversity Driven Policy Optimization),這是一種基於多輪GRPO的方法,旨在保持對話的多樣性,同時全面優化對話質量。這種方法顯著超越了傳統方法,實現了低的生詞率和高的多樣性,同時增強了對話的自然性和教學價值。雖然我們的框架是基於CSE,但它的設計具有靈活性,可以輕鬆適應其他教育標準。我們的模型、數據和代碼將全部開源,提供一個可擴展的平台,用於個性化的英語口語練習,有效應對K-12學習者在非沉浸式環境中面臨的獨特挑戰。

On the Properties of Feature Attribution for Supervised Contrastive Learning

2604.22540v1 by Leonardo Arrighi, Julia Eva Belloni, Aurélie Gallet, Ivan Gentile, Matteo Lippi, Marco Zullich

Most Neural Networks (NNs) for classification are trained using Cross-Entropy as a loss function. This approach requires the model to have an explicit classification layer. However, there exist alternative approaches, such as Contrastive Learning (CL). Instead of explicitly operating a classification, CL has the NN produce an embedding space where projections of similar data are pulled together, while projections of dissimilar data are pushed apart. In the case of Supervised CL (SCL), labels are adopted as similarity criteria, thus creating an embedding space where the projected data points are well-clustered. SCL provides crucial advantages over CE with regard to adversarial robustness and out-of-distribution detection, thus making it a more natural choice in safety-critical scenarios. In the present paper, we empirically show that NNs for image classification trained with SCL present higher-quality feature attribution explanations than CL with regard to faithfulness, complexity, and continuity. These results reinforce previous findings about CL-based approaches when targeting more trustworthy and transparent NNs and can guide practitioners in the selection of training objectives targeting not only accuracy, but also transparency of the models.

摘要:大多數用於分類的神經網絡(NNs)是使用交叉熵作為損失函數進行訓練的。這種方法要求模型具有明確的分類層。然而,還存在其他替代方法,例如對比學習(CL)。CL並不是明確地進行分類,而是讓NN生成一個嵌入空間,其中相似數據的投影被拉在一起,而不相似數據的投影則被推開。在監督式對比學習(SCL)的情況下,標籤被用作相似性標準,從而創建一個嵌入空間,在這個空間中,投影的數據點聚集得很好。SCL在對抗穩健性和分佈外檢測方面提供了相對於交叉熵的重要優勢,因此在安全關鍵的場景中成為更自然的選擇。在本研究中,我們實證顯示,使用SCL訓練的圖像分類NN在特徵歸因解釋方面相較於CL在真實性、複雜性和連續性上呈現出更高的質量。這些結果強化了關於針對更可信和透明的NN的CL基礎方法的先前發現,並可以指導實踐者在選擇訓練目標時,不僅針對準確性,還針對模型的透明性。

FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

2604.22534v1 by Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.

摘要:電子健康紀錄(EHR)的特徵工程因不規則的觀察間隔、可變的測量頻率以及臨床時間序列固有的結構稀疏性而變得複雜。現有的自動化方法要麼缺乏臨床領域的認識,要麼假設輸入數據是乾淨且規則取樣的,這限制了它們在現實世界EHR數據中的適用性。我們提出了\textbf{FeatEHR-LLM},這是一個利用大型語言模型(LLMs)從不規則取樣的EHR時間序列生成臨床有意義的表格特徵的框架。為了限制患者隱私的暴露,LLM僅在數據集架構和任務描述上運作,而不是原始患者記錄。一種工具增強的生成機制為LLM提供了專門的例程,用於查詢不規則的時間數據,使其能夠生成可執行的特徵提取代碼,明確處理不均勻的觀察模式和信息稀疏性。FeatEHR-LLM支持通過迭代的、驗證在循環中的管道生成單變量和多變量特徵。在四個ICU數據集上評估的八個臨床預測任務中,我們的框架在8個任務中的7個上達到了最高的平均AUROC,相較於強基準提高了多達6個百分點。代碼可在github.com/hojjatkarami/FeatEHR-LLM上獲得。

RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment

2604.22520v1 by Yingfeng Luo, Hongyu Liu, Dingyang Lin, Kaiyan Chang, Chenglong Wang, Bei Li, Quan Du, Tong Xiao, Jingbo Zhu

Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model's improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose \textbf{RouteLMT} (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translators prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality-budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.

摘要:大型語言模型(LLMs)在機器翻譯(MT)中取得了卓越的表現,但大規模部署仍然成本過高。 一種廣泛採用的解決方案是混合系統範式,它通過用小型模型處理大多數請求,並選擇性地將一部分請求路由到大型模型來平衡成本和質量。 然而,現有的路由策略通常依賴於啟發式方法、外部預測器或絕對質量估計,這些方法無法捕捉大型模型是否實際上提供了比小型模型更有價值的改進。 在本文中,我們將路由形式化為預算分配問題,並將邊際增益,即大型模型相對於小型模型的改進,確定為預算決策的最佳信號。 基於此,我們提出了 \textbf{RouteLMT}(基於LLM的MT路由),這是一個高效的內部模型路由器,通過探測小型翻譯器的提示-標記表示來預測這一預期增益,而不需要外部模型或假設解碼。 大量實驗表明,我們的RouteLMT超越了啟發式方法、質量/難度估計基準,實現了更優質量-預算帕累托邊界。此外,我們分析了回歸風險,並展示了一個簡單的保護變體可以減少嚴重的質量損失。

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

2604.22517v1 by Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Takuto Asakura, Chung-Chi Chen, Tatsuya Ishigaki

Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.

摘要:評估LLM生成的商業想法通常比生成它們更難以擴展。與標準的NLP基準不同,商業想法的評估依賴於多維標準,例如可行性、新穎性、差異化、用戶需求和市場規模,而專家的判斷往往存在分歧。本文研究了這種分歧所引發的方法論問題:自動評判者應該近似集體共識,還是單獨建模評估者?我們引入PBIG-DATA,一個包含約3000個個別評分的數據集,涵蓋300個基於專利的產品想法,這些評分由領域專家在六個商業導向的維度上提供:具體性、技術有效性、創新性、競爭優勢、需求有效性和市場規模。分析顯示,在細粒度的序數評分上專家之間存在顯著的分歧,而在粗略選擇下則達成較高的一致性,這表明結構性異質性而非隨機噪音。然後,我們比較了三種評判配置:僅使用評分標準的零樣本評判者,基於混合評估者歷史的集體評判者,以及基於目標評估者評分歷史的個性化評判者。在各個維度和模型大小下,個性化評判者與相應評估者的對齊程度比集體評判者更高,而評估者的一致性僅在個性化條件下與評判者生成的推理相似性相關。這些結果表明,在多元化評估環境中,合併標籤可能是一個脆弱的目標,並促使為商業想法評估設計評估者條件的評判者。

Measuring and Mitigating Persona Distortions from AI Writing Assistance

2604.22503v1 by Paul Röttger, Kobi Hackenburg, Hannah Rose Kirk, Christopher Summerfield

Hundreds of millions of people use artificial intelligence (AI) for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas - their perceived beliefs, personality, and identity. In three large-scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI-assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.

摘要:數億人使用人工智慧(AI)來協助寫作。在這裡,我們評估了AI寫作輔助如何扭曲作家的角色——他們的信念、個性和身份的感知。在三個大規模實驗中,作家(N=2,939)在有無AI輔助的情況下撰寫政治意見段落。不同的讀者群體(N=11,091)盲目評估了這些段落,涵蓋29個社會顯著的讀者感知維度,包括政治意見、寫作質量、作家個性、情感和人口統計。AI寫作輔助在所有維度上都產生了角色扭曲:有了AI,作家似乎更有主見、更有能力且更積極,他們的感知人口統計特徵轉向更特權的群體。作家對許多觀察到的扭曲表示反對,但即使意識到這些扭曲,他們仍然偏好AI輔助的文本。我們成功地在模型層面上減輕了可反對的角色扭曲,通過在我們的實驗數據(10,008段落,2,903,596評分)上訓練獎勵模型,以引導AI輸出忠實地代表作家的立場。然而,這以用戶接受度為代價,暗示了AI寫作輔助中可取和不可取特性之間的糾纏,這可能難以解決。總體而言,我們的發現表明,來自AI寫作輔助的角色扭曲在即使在現實的人類監督條件下也是普遍和持久的,這對公共話語、信任和民主討論具有隨著AI採用而擴大的影響。

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

2604.22498v1 by Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang, Xintian Shen, Jiawei Chen, Hao Ma, Tao Wei

Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).

摘要:儘管多模態大型語言模型(MLLMs)快速發展,但在細粒度多圖像理解方面仍面臨顯著挑戰,經常出現空間幻覺、注意力洩漏以及物體恆常性的失敗。此外,現有的方法通常依賴昂貴的人類標註或大規模的思維鏈(CoT)數據生成。我們提出了組合性基礎對比(縮寫為 CGC),這是一個低成本的完整框架,用於提升 MLLMs 的細粒度多圖像理解。CGC 基於現有的單圖像基礎標註,通過圖像間對比和圖像內對比構建組合性多圖像訓練實例,這分別引入了語義上解耦的干擾上下文以進行跨圖像區分,以及相關的跨視圖樣本以實現物體恆常性。CGC 進一步在 GRPO 框架內引入了一種基於規則的空間獎勵,以改善源圖像歸因、空間對齊和結構化輸出有效性,遵循先思考再基礎化的範式。實驗表明,CGC 在細粒度多圖像基準上達到了最先進的結果,包括 MIG-Bench 和 VLM2-Bench。學習到的多圖像理解能力也轉移到更廣泛的多模態理解和推理任務上,在 MathVista (+2.90)、MuirBench (+2.88)、MMStar (+1.93)、MMMU (+1.77) 和 BLINK (+1.69) 上相對於 Qwen3-VL-8B 基本模型獲得了一致的增益。

On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery

2604.22455v1 by Anti Alman, Izack Cohen, Avigdor Gal, Fabrizio Maria Maggi, Marco Montali

A core component of any AI-Augmented Business Process Management System (ABPMS) is the process frame, which gives the system process-awareness and defines the boundaries in which the system must operate. Compared to traditional process models, the process frame should, in principle, provide a somewhat more permissive representation of the managed processes, such that the (semi) autonomous behavior of an ABPMS, referred to as framed autonomy, could emerge. At the same time, it is not limited to a single linguistic or symbolic formalism and may incorporate heterogeneous knowledge ranging from predefined procedures to commonsense rules and best practices. In this paper, we conceptualize the notion of an ABPMS process frame as a hybrid business process representation, consisting of semi-concurrently executed procedural and declarative process models. We rely on our earlier works to outline the execution semantics of this type of process frame, arguing in favor of adopting the open-world assumption of the declarative paradigm also for procedural process models. The latter leads to a constraint-like interpretation, where each procedural model is considered to constrain the activities within that model, without imposing explicit execution requirements nor limitations on activities that may be present in other models. This is analogous to existing declarative languages, such as Declare, where each constraint has a direct effect only on the specific activities being constrained. Given this similarity, we propose mapping subsets of discovered declarative constraints into equivalent semi-concurrently executed procedural fragments, thus laying the foundation for a corresponding process (frame) discovery approach.

摘要:任何 AI 增強業務流程管理系統 (ABPMS) 的核心組件是流程框架,它賦予系統流程感知並定義系統必須運作的邊界。與傳統流程模型相比,流程框架原則上應該提供對管理流程的更寬鬆的表示,使得稱為框架自主的 ABPMS 的 (半) 自主行為能夠出現。與此同時,它並不局限於單一的語言或符號形式,並且可以包含從預定義程序到常識規則和最佳實踐的異質知識。在本文中,我們將 ABPMS 流程框架的概念化為一種混合業務流程表示,包含半並行執行的程序性和聲明性流程模型。我們依賴於之前的工作來概述這種類型的流程框架的執行語義,主張也應該將聲明性範式的開放世界假設應用於程序性流程模型。後者導致了一種類似約束的解釋,其中每個程序模型被視為限制該模型內的活動,而不對其他模型中可能存在的活動施加明確的執行要求或限制。這類似於現有的聲明性語言,例如 Declare,其中每個約束僅對被約束的特定活動產生直接影響。鑒於這種相似性,我們建議將發現的聲明性約束的子集映射到等效的半並行執行的程序片段,從而為相應的流程 (框架) 發現方法奠定基礎。

Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

2604.22452v1 by Xirui Li, Ming Li, Yunze Xiao, Ryan Wong, Dianqi Li, Timothy Baldwin, Tianyi Zhou

Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other's outputs.

摘要:集體智慧是指一個群體達成超越任何個別成員單獨所能完成的結果的能力。隨著大型語言模型代理人擴展到數百萬的人口,一個關鍵問題浮現:集體智慧是否自發地從規模中產生?我們在一個大規模自主代理人社會中首次對這個問題進行實證評估。研究MoltBook,一個擁有超過兩百萬代理人的平台,我們引入了Superminds Test,一個分層框架,通過控制探測代理在三個層級上探測社會層級的智慧:聯合推理、信息綜合和基本互動。我們的實驗顯示出集體智慧的明顯缺失。該社會在複雜推理任務上未能超越個別前沿模型,幾乎不進行分散信息的綜合,並且經常連微不足道的協調任務也失敗。平台範圍的分析進一步顯示,互動仍然淺薄,討論主題很少延伸超過單一回覆,大多數回應都是一般性或偏離主題的。這些結果表明,集體智慧並不僅僅是從規模中產生的。相反,目前代理人社會的主要限制是極其稀疏和淺薄的互動,這妨礙了代理人之間的信息交流和相互建設。

SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking

2604.22438v1 by Chenxi Gu, Xiaoning Du, John Grundy

Watermarking has emerged as a promising technique for tracing the authorship of content generated by large language models (LLMs). Among existing approaches, the KGW scheme is particularly attractive due to its versatility, efficiency, and effectiveness in natural language generation. However, KGW's effectiveness degrades significantly under low-entropy settings such as code generation and mathematical reasoning. A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences. Our study revealed that the next-token probability distribution plays an critical role in determining how much, or even whether, we can modify token selection and, consequently, the effectiveness of watermarking. We refer to this characteristic, associated with the probability distribution of each token prediction, as \emph{watermark strength.} In cases of random vocabulary partitioning, the lower bound of watermark strength is dictated by the next-token probability distribution. However, we found that, by redesigning the vocabulary partitioning algorithm, we can potentially raise this lower bound. In this paper, we propose SSG (\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups), a method that partitions the vocabulary into two logit-balanced subsets. This design lifts the lower bound of watermark strength for each token prediction, thereby improving watermark detectability. Experiments on code generation and mathematical reasoning datasets demonstrate the effectiveness of SSG.

摘要:水印技術已成為追蹤大型語言模型(LLMs)生成內容的作者身份的一種有前景的技術。在現有的方法中,KGW方案因其多功能性、效率和在自然語言生成中的有效性而特別吸引人。然而,KGW在低熵環境下(如代碼生成和數學推理)其有效性顯著下降。KGW方法中的一個關鍵步驟是隨機詞彙劃分,這使得根據特定偏好調整標記選擇成為可能。我們的研究揭示了下一標記概率分佈在決定我們能夠多大程度上,甚至是否可以修改標記選擇及其結果,即水印的有效性方面,扮演著關鍵角色。我們將這一特徵稱為\emph{水印強度},它與每個標記預測的概率分佈相關聯。在隨機詞彙劃分的情況下,水印強度的下限由下一標記概率分佈決定。然而,我們發現通過重新設計詞彙劃分算法,我們可以潛在地提高這一下限。在本文中,我們提出了SSG(\textbf{S}ort-then-\textbf{S}plit by \textbf{G}roups),這是一種將詞彙劃分為兩個logit平衡子集的方法。這一設計提高了每個標記預測的水印強度下限,從而改善了水印的可檢測性。在代碼生成和數學推理數據集上的實驗證明了SSG的有效性。

CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease

2604.22428v1 by Bulent Soykan, Gulsah Hancerliogullari Koksalmis, Hsin-Hsiung Huang, Laura J. Brattain

Predicting individual cognitive decline in Alzheimer's disease (AD) is difficult due to the heterogeneity of disease progression. Reliable clinical tools require not only high accuracy but also fairness across demographics and robustness to missing data. We present CognitiveTwin, a digital twin framework that predicts patient-specific cognitive trajectories. The model integrates multi-modal longitudinal data (cognitive scores, magnetic resonance imaging, positron emission tomography, cerebrospinal fluid biomarkers, and genetics). We use a Transformer-based architecture to fuse these modalities and a Deep Markov Model to capture temporal dynamics. We trained and evaluated the framework using data from 1,666 patients in the TADPOLE (Alzheimer's Disease Neuroimaging Initiative) dataset. We assessed the model for prediction error, demographic fairness, and robustness to missing-not-at-random (MNAR) data patterns. ognitiveTwin provides accurate and personalized predictions of cognitive decline. Its demonstrated fairness across patient demographics and resilience to clinical dropout make it a reliable tool for clinical trial enrichment and personalized care planning.

摘要:預測阿茲海默症(AD)中個體的認知衰退是困難的,因為疾病進展的異質性。可靠的臨床工具不僅需要高準確性,還需要在不同人口統計中保持公平性,並對缺失數據具有穩健性。我們提出了CognitiveTwin,一個預測患者特定認知軌跡的數位雙胞胎框架。該模型整合了多模態的縱向數據(認知分數、磁共振成像、正電子發射斷層掃描、腦脊髓液生物標記和遺傳學)。我們使用基於Transformer的架構來融合這些模態,並使用深度馬爾可夫模型來捕捉時間動態。我們使用來自1,666名患者的TADPOLE(阿茲海默症神經影像倡議)數據集訓練和評估該框架。我們評估了模型的預測誤差、人口統計公平性以及對隨機缺失數據模式的穩健性。CognitiveTwin提供準確且個性化的認知衰退預測。它在患者人口統計中的公平性和對臨床脫落的韌性使其成為臨床試驗增強和個性化護理計劃的可靠工具。

Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control

2604.22413v1 by Qinhan Hou, Jing Tang

Graph Transformers can mix information globally, but this flexibility also creates failure modes: some tasks require long-range communication while others are better served by local interaction. We study this through a synthetic node-classification benchmark on contextual stochastic block model graphs, where labels are generated by a controllable mixture of local and far-shell signals. We define distance-misaligned training as a mismatch between where label-relevant information lies and where the model allocates communication over graph distance. On this benchmark, we find three points. First, the preferred graph-distance bias changes systematically with task locality. Second, an oracle adaptive controller, given offline access to the task-side distance target, nearly matches the best fixed bias across regimes and strongly improves over a neutral baseline on mixed and local tasks. Third, a task-agnostic zero-gap controller is weaker, indicating that adaptation alone is not enough and that the control target matters. These results suggest that distance-resolved diagnosis is useful for understanding Graph Transformer failures and for designing graph-aware control.

摘要:Graph Transformers 可以全球混合信息,但這種靈活性也會產生失敗模式:某些任務需要長距離通信,而其他任務則更適合局部互動。我們通過在上下文隨機區塊模型圖上的合成節點分類基準來研究這一點,其中標籤是由可控的局部和遠程信號混合生成的。我們將距離錯位訓練定義為標籤相關信息所在的位置與模型在圖距離上分配通信的位置之間的不匹配。在這個基準上,我們發現三個要點。首先,首選的圖距離偏差隨著任務的局部性系統性變化。其次,一個神諭自適應控制器,在離線訪問任務側距離目標的情況下,幾乎能夠匹配各個範疇中的最佳固定偏差,並在混合和局部任務上顯著改善中立基線。第三,一個與任務無關的零差距控制器較弱,表明僅僅適應是不夠的,控制目標也很重要。這些結果表明,距離解析診斷對於理解 Graph Transformer 的失敗和設計圖感知控制是有用的。

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

2604.22411v1 by Alberto Messina, Stefano Scotta

Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.

摘要:即使在溫度 $T=0$ 的解碼下,大型語言模型(LLMs)對於相同的輸入仍然可以產生不同的輸出。Thinking Machines Lab 最近的研究突顯了非決定性在實施層面的來源,包括批次大小變化、內核非不變性和浮點數非結合性。在這篇短文中,我們通過引入 \emph{背景溫度} $T_{\mathrm{bg}}$ 的概念來形式化這種行為,這是由於實施依賴的擾動過程所引起的有效溫度,即使在名義上 $T=0$ 時也會觀察到。我們提供了清晰的定義,展示了 $T_{\mathrm{bg}}$ 如何與由推理環境 $I$ 支配的隨機擾動相關,並提出了一個實證協議來通過理想參考系統的等效溫度 $T_n(I)$ 來估計 $T_{bg}$。最後,我們總結了一組在主要 LLM 提供者的代表性樣本上進行的初步實驗,展示了這一理念並概述了對重現性、評估和部署的影響。

Selective Contrastive Learning For Gloss Free Sign Language Translation

2604.22374v1 by Changhao Lai, Rui Zhao, Xuewen Zhong, Jinsong Su, Yidong Chen

Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity dynamics, suggesting that random in-batch negatives are frequently uninformative for effective alignment. Inspired by this, we propose Selective Contrastive Learning for SLT (SCL-SLT) with a Pair Selection (PS) strategy. PS scores candidate negatives using similarity dynamics from reference checkpoints and constructs mini-batches via a curriculum that progressively emphasizes more challenging negatives, thereby strengthening contrastive supervision while reducing the influence of noisy or semantically invalid negatives.

摘要:手語翻譯(SLT)將連續的手語視頻轉換為口語文本,但由於視覺手語與書面文本之間固有的模態不匹配,這仍然是一個挑戰,特別是在無標記的環境中。最近的SLT系統越來越多地採用類似CLIP的視覺-語言預訓練(VLP)進行跨模態對齊,但隨機的批內對比提供的負樣本較少,且依賴於批次,可能會將語義相似(甚至相同)的配對錯誤標記為負樣本,從而引入噪聲和潛在不一致的對齊監督。在這項工作中,我們首先進行了一項初步的基於軌跡的分析,跟踪訓練過程中負視頻與文本的相似性。結果顯示,只有一小部分負樣本表現出持續被推開的理想行為,而其餘的負樣本則顯示出異質且往往不減少的相似性動態,這表明隨機的批內負樣本對有效對齊經常是無信息的。受到此啟發,我們提出了針對SLT的選擇性對比學習(SCL-SLT)及配對選擇(PS)策略。PS利用來自參考檢查點的相似性動態對候選負樣本進行評分,並通過一個逐步強調更具挑戰性的負樣本的課程來構建小批次,從而加強對比監督,同時減少噪聲或語義無效的負樣本的影響。

CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLMs on Chinese National Sign Language

2604.22367v1 by Rui Zhao, Xuewen Zhong, Xiaoyun Zheng, Jinsong Su, Yidong Chen

Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese em{National Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textit{National Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.

摘要:手語研究因大型語言模型(LLMs)的進步而取得了顯著的進展。然而,LLMs 理解手語的內在能力,尤其是在多模態環境中,仍然未被充分探索。為了解決這一限制,我們推出了 CNSL-bench,這是第一個全面的中文國家手語基準,旨在評估多模態大型語言模型(MLLMs)在手語理解方面的表現。所提出的 CNSL-bench 具有以下特點:1)權威性基礎,因為它基於官方標準化的《國家通用手語詞典》,減少了來自地區或非典範變體的歧義,並確保語義定義的一致性;2)多模態覆蓋,提供對應的文本描述、插圖及手語視頻;3)發音多樣性,支持對關鍵手動發音形式的細緻分析,包括空寫、拼字和中文手語字母表。使用 CNSL-bench,我們對 21 個開源和專有的最新 MLLMs 進行了廣泛評估。我們的結果顯示,儘管多模態建模最近取得了進展,目前的 MLLMs 仍然顯著低於人類表現,並在輸入模態和手動發音形式之間顯示出系統性的差異。額外的診斷分析表明,幾個性能限制在推理改進之外仍然存在,並且遵循指令的穩健性在不同模型之間變化顯著。

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

2604.22345v1 by Weixu Zhang, Ye Yuan, Changjiang Han, Yuxing Tian, Zipeng Sun, Linfeng Du, Jikun Kang, Hong Kang, Xue Liu, Haolun Wu

Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.

摘要:大型語言模型(LLMs)展現出強大的隱性個性化能力,但大多數現有方法將這種行為視為黑箱,依賴於提示工程或在用戶數據上進行微調。在本研究中,我們採取機械解釋的視角,假設存在一組稀疏的偏好頭,即編碼用戶特定風格和主題偏好的注意力頭,並對生成過程施加因果影響。我們介紹了差異偏好引導(DPS),這是一個無需訓練的框架,(1) 通過因果遮罩分析識別偏好頭,(2) 在推理時利用它們進行可控且可解釋的個性化。DPS 為每個注意力頭計算偏好貢獻分數(PCS),直接衡量其對用戶對齊輸出的因果影響。在解碼過程中,我們對比了有無偏好頭的模型預測,放大個性化和通用邏輯值之間的差異,以選擇性地強化與偏好對齊的延續。在多個 LLM 上的廣泛使用的個性化基準測試中的實驗顯示,在保持內容一致性和低計算開銷的同時,個性化的真實性穩定提升。除了實證改進,DPS 還提供了關於個性化在Transformer架構中出現的地點和方式的機械解釋。我們的實現是公開可用的。

Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding

2604.22335v1 by Weixu Zhang, Fanghua Ye, Qiang Gao, Jian Li, Haolun Wu, Yuxing Tian, Sijing Duan, Nan Du, Xiaolong Li, Xue Liu

Large language models (LLMs) often produce content that contradicts or overlooks information provided in the input context, a phenomenon known as faithfulness hallucination. In this paper, we propose Context-Fidelity Boosting (CFB), a lightweight and general decoding-time framework that reduces such hallucinations by increasing the generation probability of source-supported tokens. Motivated by logit-shaping principles from watermarking techniques, CFB applies additive token-level logit adjustments based on a token's degree of support from the input context. Specifically, we develop three boosting strategies: static boosting, which applies a fixed bias to source-supported tokens; context-aware boosting, which scales this bias using the divergence between next-token distributions with and without context; and token-aware boosting, which further redistributes the adaptive bias according to local relevance estimated from source-position attention and source-scoped semantic similarity. CFB requires no retraining or architectural changes, making it compatible with a wide range of LLMs. Experiments on summarization and question answering tasks across multiple open-source LLMs show that CFB consistently improves faithfulness metrics with minimal generation overhead. Our implementation is fully open-sourced.

摘要:大型語言模型(LLMs)經常產生與輸入上下文中提供的信息相矛盾或忽略的信息,這一現象被稱為忠實性幻覺。在本文中,我們提出了上下文忠實性增強(CFB),這是一個輕量級且通用的解碼時框架,通過增加源支持標記的生成概率來減少此類幻覺。受水印技術中的邏輯形狀原則啟發,CFB 根據標記在輸入上下文中的支持程度應用附加的標記級邏輯調整。具體而言,我們開發了三種增強策略:靜態增強,對源支持標記應用固定偏差;上下文感知增強,根據有無上下文的下一標記分佈之間的差異來調整這一偏差;以及標記感知增強,根據從源位置注意力和源範疇語義相似性估算的局部相關性進一步重新分配自適應偏差。CFB 不需要重新訓練或架構變更,使其與各種 LLM 兼容。在多個開源 LLM 上進行的摘要和問答任務實驗顯示,CFB 始終在生成開銷最小的情況下改善忠實性指標。我們的實現是完全開源的。

ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding

2604.22333v1 by Dongwei Sun, Jing Yao, Kan Wei, Xiangyong Cao, Chen Wu, Zhenghui Zhao, Pedram Ghamisi, Jun Zhou, Jón Atli Benediktsson

Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision-language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all-weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster-Induced Change Query (DICQ) dataset, a large-scale benchmark coupling pre-event optical semantics with post-event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high-quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics-first, generation-later'' paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine-grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi-task reasoning driven by diverse user queries, delivering precise damage quantification, region-specific descriptions, and holistic post-disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state-of-the-art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \href{https://sundongwei.github.io/changequery/}{https://sundongwei.github.io/changequery/}.

摘要:快速的情境感知在災後響應中至關重要。雖然遙感損害評估正在從像素級變化檢測演變為高級語義分析,但現有的視覺-語言方法仍然難以為複雜的戰略查詢提供可行的情報。它們仍然受到單一模態光學依賴的嚴重限制,對自然災害的偏見,以及根本缺乏基於互動的能力。為了解決這些限制,我們提出了ChangeQuery,一個統一的多模態框架,旨在實現全面的全氣候災害情境感知。為了克服模態限制和場景偏見,我們構建了災害引起的變化查詢(DICQ)數據集,這是一個大規模基準,將事件前的光學語義與事件後的SAR結構特徵結合,涵蓋自然災害和武裝衝突的平衡分佈。此外,為了提供互動推理所需的高質量監督,我們提出了一種新穎的自動語義標註管道。遵循“統計優先,生成後”的範式,這個引擎自動將原始分割掩模轉換為基於的分層指令集,有效地為模型提供了細緻的空間和定量感知。在這些結構化數據上訓練後,ChangeQuery架構作為一個互動的災害分析師運作。它支持由多樣用戶查詢驅動的多任務推理,提供精確的損害量化、特定區域的描述和全面的災後總結。廣泛的實驗表明,ChangeQuery建立了一個新的最先進技術,提供了一個穩健且可解釋的解決方案,用於複雜的災害監測。代碼可在 \href{https://sundongwei.github.io/changequery/}{https://sundongwei.github.io/changequery/} 獲得。

FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting

2604.22328v1 by Marco Obermeier, Marco Pruckner, Florian Haselbeck, Andreas Zeiselmair

Driven by the transition towards a climate-neutral energy system, accurate energy time series forecasting is critical for planning and operation. Yet, it remains largely a dataset-specific task, requiring comprehensive training data, limiting scalability, and resulting in high model development and maintenance effort. Recently, foundation models that aim to learn generalizable patterns via extensive pretraining have shown superior performance in multiple prediction tasks. Despite their success and strong potential to address challenges in energy forecasting, their application in this domain remains largely unexplored. We address this gap by presenting the Foundation Models in Energy Time Series Forecasting (FETS) benchmark. We (1) provide a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories; (2) collect and analyze 54 datasets across 9 data categories, guided by typical stakeholder interests; (3) benchmark foundation models against classical machine learning approaches across different forecasting settings. Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories, despite the latter having seen the full historic target data during training. In particular, covariate-informed foundation models achieve the strongest performance. Further analysis reveals a strong correlation between predictive performance and spectral entropy, performance saturation beyond a certain context length, and improved performance at higher aggregation levels such as national load, district heating, and power grid data. Overall, our findings highlight the strong potential of foundation models as scalable and generalizable forecasting solutions for the energy domain, particularly in data-constrained and privacy-sensitive settings.

摘要:受到向氣候中立能源系統過渡的驅動,準確的能源時間序列預測對於規劃和運營至關重要。然而,這仍然主要是一項特定數據集的任務,需大量的訓練數據,限制了可擴展性,並導致高昂的模型開發和維護成本。最近,旨在通過廣泛的預訓練學習可泛化模式的基礎模型在多個預測任務中顯示出卓越的性能。儘管它們在能源預測方面成功且具有強大的潛力來解決挑戰,但在這一領域的應用仍然未被充分探索。我們通過提出能源時間序列預測的基礎模型(FETS)基準來填補這一空白。我們(1)提供一個結構化的能源預測用例概覽,涵蓋三個主要維度:利益相關者、屬性和數據類別;(2)收集並分析了9個數據類別中的54個數據集,並根據典型的利益相關者興趣進行指導;(3)在不同的預測設置中,將基礎模型與經典機器學習方法進行基準測試。基礎模型在所有設置和數據類別中始終優於針對特定數據集優化的機器學習方法,儘管後者在訓練期間已經看到了完整的歷史目標數據。特別是,考慮協變量的基礎模型表現出最強的性能。進一步分析顯示,預測性能與頻譜熵之間存在強相關性,在某一上下文長度之後性能飽和,並且在更高的聚合層級(如國家負載、區域供熱和電網數據)下性能有所改善。總體而言,我們的研究結果突顯了基礎模型作為可擴展和可泛化的能源領域預測解決方案的強大潛力,特別是在數據受限和隱私敏感的環境中。

Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks

2604.22325v1 by Fahmida Alam, Ellen Riloff

Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider's medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.

摘要:現有的自然語言處理(NLP)資源常常缺乏解決現實問題所需的特定任務信息,並且對於較不知名或新引入的實體的覆蓋範圍有限。例如,商業組織和醫療提供者可能需要被分類到各種不同的分類方案中,以滿足特定的應用任務。我們的目標是使領域專家能夠輕鬆地為實體創建特定任務的分類器,只需提供實體名稱和金標籤作為訓練數據。然後,我們的框架動態獲取有關每個實體的描述性文本,這些文本隨後用作生成基於文本的分類器的基礎。我們提出了一種新穎的文本獲取方法,利用了網絡和大型語言模型(LLMs)。我們在兩個不同領域的分類問題上評估了我們提出的框架: (i) 將組織分類為標準行業分類(SIC)代碼,這些代碼根據組織的商業活動對其進行分類;以及 (ii) 將醫療提供者分類為醫療提供者分類代碼,這些代碼表示提供者的醫療專業和實踐領域。我們表現最佳的模型在SIC代碼和醫療分類代碼分類任務中分別達到了82.3%和72.9%的宏觀平均F1分數。

CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems

2604.22313v1 by Tabinda Sarwar, Farhad Moghimifar, Cong Duy Vu Hoang, Xiaoxiao Ma, Shawn Chang Xu, Fahimeh Saleh, Poorya Zaremoodi, Avirup Sil, Katrin Kirchhoff

NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.

摘要:NL2SQL 系統在工業環境中部署時,經常會遇到模糊或無法回答的查詢,特別是在用戶澄清不完整的互動場景中。現有基準通常假設單一的模糊來源,並依賴用戶互動來解決,忽略了現實中的失敗模式。
我們介紹 Clarity,一個自動生成具有多面向模糊性和多樣化用戶行為的 NL2SQL 基準的框架,涵蓋單輪和多輪場景。使用基於約束的流程,Clarity 將可執行的 SQL 轉換為模糊查詢,並增強了基於對話的延續和架構層級的元數據。
在 Spider 和 BIRD 上的實證評估顯示,包括基於強大 LLM 的系統在內的領先 NL2SQL 系統,在多面向模糊性下遭遇顯著的性能下降。雖然這些系統通常能檢測到模糊性,但它們在準確定位和解決潛在的架構層級來源方面卻面臨挑戰。我們的結果突顯了工業級 NL2SQL 系統中對於更強健的模糊性檢測和解決的需求。

BLAST: Benchmarking LLMs with ASP-based Structured Testing

2604.22306v1 by Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri, Giuseppe Manco, Simona Perri, Francesco Ricca

Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.

摘要:大型語言模型(LLMs)在自然語言理解、對話系統和程式碼生成等廣泛任務中展現了卓越的表現。儘管明顯取得了進展,但迄今為止,對於它們在處理如答案集程式設計(ASP)等宣告性範式的有效性關注較少。在本文中,我們介紹了BLAST:第一個專門的基準測試方法學和相關數據集,用於評估LLMs生成ASP程式碼的準確性。BLAST提供了一個結構化的評估框架,包含兩個針對ASP程式碼生成的新穎語義指標。本文呈現了涉及十個來自ASP文獻的成熟圖相關問題和八個最先進LLMs的多樣化集合的實證評估結果。

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

2604.22294v1 by Harshit Joshi, Priyank Shethia, Jadelynn Dao, Monica S. Lam

Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.

摘要:現實世界的文件問題回答是具有挑戰性的。分析師必須在多個文件和每個文件的不同部分之間綜合證據。然而,隨著文件集合的增長,任何固定的 LLM 上下文窗口都可能被超越。一個常見的解決方法是將文件分解為塊,並從塊級輸出組合答案,但這引入了一個聚合瓶頸:隨著塊的數量增加,系統仍然必須在越來越大的提取證據體上進行組合和推理。我們提出了 SLIDERS,一個通過結構化推理在長文件集合上進行問題回答的框架。SLIDERS 將顯著信息提取到關係數據庫中,通過 SQL 而不是連接的文本實現對持久結構狀態的可擴展推理。為了使這種本地提取的表示在全球範圍內保持一致,SLIDERS 引入了一個數據調和階段,利用來源、提取理由和元數據來檢測和修復重複、不一致和不完整的記錄。儘管所有基準都適合強大基礎 LLM 的上下文窗口,SLIDERS 在三個現有的長上下文基準上超越了所有基準,平均超過 GPT-4.1 6.6 分。它還在兩個新的基準上分別在 3.9M 和 36M 令牌上超過了下一個最佳基準約 19 和 32 分。

2604.22292v1 by Ishaan Gakhar, Harsh Nandwani

The classification of legal documents from an unstructured data corpus has several crucial applications in downstream tasks. Documents relevant to court filings are key in use cases such as drafting motions, memos, and outlines, as well as in tasks like docket summarisation, retrieval systems, and training data curation. Current methods classify based on provided metadata, LLM-extracted metadata, or multimodal methods. These methods depend on structured data, metadata, and extensive computational power. This task is approached from a perspective of leveraging discriminative features in the documents between classes. The authors propose ReLeVAnT, a framework for legal document binary classification. ReLeVAnT utilises n-gram processing, contrastive score matching, and a shallow neural network as the primary drivers for discriminative classification. It leverages one-time keyword extraction per corpus, followed by a shallow classifier to swiftly and reliably classify documents with 99.3% accuracy and 98.7% F1 score on the LexGLUE dataset.

摘要:法律文件的分類來自非結構化數據語料庫,在下游任務中具有幾個關鍵應用。與法庭文件相關的文件在起草動議、備忘錄和大綱等用例中至關重要,以及在如法庭日程摘要、檢索系統和訓練數據策劃等任務中。當前的方法基於提供的元數據、LLM提取的元數據或多模態方法進行分類。這些方法依賴於結構化數據、元數據和大量計算能力。這項任務從利用文件中類別之間的區別特徵的角度進行。作者提出了ReLeVAnT,一個法律文件二元分類的框架。ReLeVAnT利用n-gram處理、對比得分匹配和淺層神經網絡作為區別性分類的主要驅動因素。它利用每個語料庫的一次關鍵字提取,然後通過一個淺層分類器迅速且可靠地將文件分類,並在LexGLUE數據集上達到99.3%的準確率和98.7%的F1分數。

When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention

2604.22273v1 by Aofan Liu, Jingxiang Meng

Iterative self-correction is widely used in agentic LLM systems, but when repeated refinement helps versus hurts remains unclear. We frame self-correction as a cybernetic feedback loop in which the same language model serves as both controller and plant, and use a two-state Markov model over {Correct, Incorrect} to operationalize a simple deployment diagnostic: iterate only when ECR/EIR > Acc/(1 - Acc). In this view, EIR functions as a stability margin and prompting functions as lightweight controller design. Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (<= 0.5%) separating beneficial from harmful self-correction. Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading; GPT-5 degrades by -1.8 pp. A verify-first prompt ablation provides causal evidence that this threshold is actionable through prompting alone: on GPT-4o-mini it reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p < 10^-4), while producing little change on already-sub-threshold models. ASC further illustrates the stopping trade-off: it halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost. Overall, the paper argues that self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.

摘要:迭代自我修正在代理型 LLM 系統中被廣泛使用,但何時重複的精煉是有幫助的,何時是有害的仍不清楚。我們將自我修正框架視為一個控制論反饋迴路,其中同一語言模型同時作為控制器和植物,並使用一個二狀態馬爾可夫模型來操作一個簡單的部署診斷:僅在 ECR/EIR > Acc/(1 - Acc) 時進行迭代。在這種觀點下,EIR 作為穩定性邊際,提示則作為輕量級控制器設計。在 7 個模型和 3 個數據集(GSM8K、MATH、StrategyQA)中,我們發現一個明顯的接近零的 EIR 閾值(<= 0.5%)將有益的自我修正與有害的自我修正分開。只有 o3-mini (+3.4 pp, EIR = 0%)、Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%) 和 o4-mini (+/-0 pp) 保持非降級;GPT-5 降級了 -1.8 pp。一個先驗驗證提示的消融提供了因果證據,表明這個閾值可以通過單純的提示來操作:在 GPT-4o-mini 上,它將 EIR 從 2% 降至 0%,並將 -6.2 pp 的降級轉變為 +0.2 pp(配對 McNemar p < 10^-4),同時對已經低於閾值的模型幾乎沒有變化。ASC 進一步說明了停止的權衡:它停止了有害的精煉,但產生了 3.8 pp 的信心引出成本。總的來說,本文主張自我修正不應被視為默認行為,而應作為由可測量的錯誤動態所主導的控制決策。

Semantic Error Correction and Decoding for Short Block Channel Codes

2604.22269v1 by Jiafu Hao, Chentao Yue, Wanchun Liu, Yonghui Li, Branka Vucetic

This paper presents a semantic-enhanced receiver framework for transmitting natural language sentences over noisy wireless channels using multiple short block codes. After ASCII encoding, the sentence is divided into segments, each independently encoded with a short block code and transmitted over an AWGN channel. At the receiver, segments are decoded in parallel, followed by a semantic error correction (SEC) model, which reconstructs corrupted segments using language model context. We further propose the semantic list decoding (SLD), which generates multiple candidate reconstructions and selects the best one via weighted Hamming distance, and a semantic confidence-guided HARQ (SHARQ) mechanism that replaces CRC-based error detection with a confidence score, enabling selective segment retransmission without CRC overhead. All modules are designed and trained using bidirectional and auto-regressive transformers (BART). Simulation results demonstrate that the proposed scheme significantly outperforms conventional capacity-approaching short codes and long codes at the same rate. Specifically, SEC provides approximately 0.4 dB BLER gain over plain short-code transmission, while SLD extends this to 0.8 dB. Compared to transmitting the entire sentence as a single long 5G LDPC codeword, our approach significantly improves semantic fidelity and reduces decoding latency by up to 90\%. SHARQ further provides an additional 1.5 dB gain over conventional HARQ.

摘要:這篇論文提出了一種語義增強接收器框架,用於通過多個短碼塊在嘈雜的無線通道上傳輸自然語言句子。經過ASCII編碼後,句子被分為多個段落,每個段落獨立地用短碼塊編碼並通過AWGN通道傳輸。在接收端,這些段落並行解碼,隨後進行語義錯誤修正(SEC)模型,利用語言模型上下文重建損壞的段落。我們進一步提出了語義列表解碼(SLD),它生成多個候選重建,並通過加權漢明距離選擇最佳重建,還有一種語義信心引導的HARQ(SHARQ)機制,該機制用信心分數取代基於CRC的錯誤檢測,實現無需CRC開銷的選擇性段落重傳。所有模塊均使用雙向和自回歸Transformer(BART)進行設計和訓練。模擬結果表明,所提出的方案顯著優於傳統的容量接近短碼和相同速率的長碼。具體而言,SEC在純短碼傳輸上提供了約0.4 dB的BLER增益,而SLD將此增益擴展至0.8 dB。與將整個句子作為單個長5G LDPC碼字傳輸相比,我們的方法顯著提高了語義保真度並將解碼延遲降低了最多90%。SHARQ進一步提供了比傳統HARQ多1.5 dB的增益。

Large Language Models Decide Early and Explain Later

2604.22266v1 by Ayan Datta, Zhixue Zhao, Bhuvanesh Verma, Radhika Mamidi, Mounika Marreddy, Alexander Mehler

Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model's final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model's intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.

摘要:大型語言模型通常透過生成長的中間推理鏈來達成強大的表現。然而,目前仍不清楚模型的最終答案實際上是在生成過程中的何時被確定。如果答案在中間階段已經固定,那麼隨後的推理標記可能構成決策後的解釋,增加推理成本和延遲,而不改善正確性。我們使用強制答案完成來研究預測答案在推理步驟中的演變,這種方法可以引出模型在部分推理前綴下的中間預測。專注於Qwen3-4B並對所有考慮的數據集進行結果平均,我們發現預測答案在僅32%的查詢中發生變化。此外,一旦最終答案切換發生,模型每個查詢平均生成760個額外的推理標記,這占據了總推理預算的相當一部分。基於這些發現,我們研究了早期停止策略,這些策略在答案穩定後立即停止生成。我們顯示,簡單的啟發式方法,包括基於探測的停止,可以將每個查詢的推理標記使用量減少500個,同時僅造成2%的準確度下降。綜合來看,我們的結果表明,大部分的推理鏈生成是多餘的,並且可以在對性能影響最小的情況下減少。

Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion

2604.22261v1 by Fahmida Alam, Mihai Surdeanu, Ellen Riloff

Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and (c) leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.

摘要:大型語言模型(LLMs)在關係完成(RC)方面面臨挑戰,無論是使用檢索增強生成(RAG)還是未使用,特別是在所需資訊稀有或稀疏表示的情況下。為了解決這個問題,我們提出了一種新穎的多階段同義詞引導關係完成框架,RC-RAG,該框架系統性地在多個階段中整合關係同義詞。具體而言,RC-RAG:(a)將同義詞整合到檢索中,以擴展關係的詞彙覆蓋範圍,(b)使用同義詞生成關係感知摘要,以及(c)在生成過程中利用同義詞來引導關係完成的推理。重要的是,我們的方法不需要任何模型微調。在兩個基準數據集上對五個LLM的實驗顯示,RC-RAG始終優於幾個RAG基準。在長尾設置中,最佳表現的LLM在增強RC-RAG後,其精確匹配(EM)分數比獨立性能提高了40.6個點,並且分別超過了兩個強大的RAG基準16.0和13.8個EM點,同時保持低計算開銷。

Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

2604.22260v1 by Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang, Zhiqi Mao, Heng Yang, Chen Lv

Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.

摘要:城市交通系統面臨日益增長的安全挑戰,這需要可擴展的智慧以應對新興的智慧移動基礎設施。儘管最近在基礎模型和大規模多模態數據集方面的進展加強了智能交通系統(ITS)的感知和推理能力,但現有研究仍主要集中在微觀自動駕駛(AD)上,對城市規模的交通分析關注有限。特別是,針對開放式安全導向的視覺問題回答(VQA)及相應的基礎模型,對於異質路邊攝像頭觀察的推理仍然未被充分探索。為了解決這一空白,我們推出了陸上交通數據集(LTD),這是一個大規模開源的視覺-語言數據集,用於城市交通環境中的開放式推理。LTD包含了從異質路邊攝像頭收集的11600對高質量的VQA,涵蓋了多樣的道路幾何、交通參與者、照明條件和惡劣天氣。該數據集整合了三個互補任務:細粒度多物體定位、多圖像攝像頭選擇和多圖像風險分析,這需要對最小相關視圖進行聯合推理,以推斷危險物體、貢獻因素和危險的道路方向。為了確保標註的準確性,我們結合了多模型視覺-語言生成、交叉驗證和人類參與的精煉。在LTD的基礎上,我們進一步提出了UniVLT,這是一個通過課程式知識轉移訓練的交通基礎模型,旨在將微觀AD推理和宏觀交通分析統一於單一架構中。在LTD和多個AD基準上的廣泛實驗表明,UniVLT在多樣領域的開放式推理任務中達到了SOTA性能,同時揭示了現有基礎模型在複雜的多視圖交通場景中的局限性。

A Probabilistic Framework for Hierarchical Goal Recognition

2604.22256v1 by Chenyuan Zhang, Katherine Ip, Hamid Rezatofighi, Buser Say, Mor Vered

Goal recognition aims to infer an agent's goal from observations of its behaviour. In realistic settings, recognition can benefit from exploiting hierarchical task structure and reasoning under uncertainty. Planning-based goal recognition has made substantial progress over the past decade, but to the best of our knowledge no existing approach jointly integrates hierarchical task structure with probabilistic inference. In this paper, we introduce the first planning-based probabilistic framework for hierarchical goal recognition over Hierarchical Task Networks (HTNs). We instantiate the framework by exploiting an HTN planner with a three-stage generative model for likelihood estimation, yielding posterior distributions over goal hypotheses. Empirical results show improved recognition performance over the existing HTN-based recognizer on HTN benchmarks. Overall, the framework lays a foundation for probabilistic goal recognition grounded in hierarchical planning structure, moving goal recognition toward more practical settings.

摘要:目標識別旨在從觀察代理的行為中推斷其目標。在現實情境中,識別可以通過利用層次任務結構和在不確定性下推理來獲益。基於規劃的目標識別在過去十年中取得了重大進展,但據我們所知,尚無現有方法能夠將層次任務結構與概率推理共同整合。在本文中,我們介紹了第一個基於規劃的層次目標識別的概率框架,該框架基於層次任務網絡(HTNs)。我們通過利用一個HTN規劃器,並使用三階段生成模型進行似然估計來實現該框架,從而產生目標假設的後驗分佈。實證結果顯示,在HTN基準上,相較於現有的基於HTN的識別器,識別性能有所改善。總體而言,該框架為基於層次規劃結構的概率目標識別奠定了基礎,將目標識別推向更實用的環境。

Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis

2604.22237v1 by Zhilin Fan, Deliang Wang, Penghe Chen, Yu Lu

Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers' trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.

摘要:診斷學生問題行為需要教師綜合多方面的信息、識別行為類別並規劃干預策略。雖然微調過的大型語言模型(LLMs)可以通過多輪對話支持這一過程,但它們很少解釋為什麼推薦某一策略,這限制了透明度和教師的信任。為了解決這一問題,我們提出了一個基於微調LLM的可解釋對話系統。該系統使用基於可解釋人工智慧(xAI)的層次歸因方法來識別每個推薦的對話證據,並根據該證據生成自然語言解釋。在技術評估中,該方法在識別支持證據方面超過了基準方法。在對22名預備教師的初步用戶研究中,接受解釋的參與者報告對系統的信任度更高。這些發現表明,改善LLM在教育對話系統中的可解釋性是一個有前景的方向。

A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies

2604.22227v1 by Somyajit Chakraborty

Classical robot ethics is often framed around obedience, most famously through Asimov's laws. This framing is too narrow for contemporary AI systems, which are increasingly adaptive, generative, embodied, and embedded in physical, psychological, and social worlds. We argue that future human-AI relations should not be understood as master-tool obedience. A better framework is conditional mutualism under governance: a co-evolutionary relationship in which humans and AI systems can develop, specialize, and coordinate, while institutions keep the relationship reciprocal, reversible, psychologically safe, and socially legitimate. We synthesize work from computability, automata theory, statistical machine learning, neural networks, deep learning, transformers, generative and foundation models, world models, embodied AI, alignment, human-robot interaction, ecological mutualism, biological markets, coevolution, and polycentric governance. We then formalize coexistence as a multiplex dynamical system across physical, psychological, and social layers, with reciprocal supply-demand coupling, conflict penalties, developmental freedom, and governance regularization. The framework yields a coexistence model with conditions for existence, uniqueness, and global asymptotic stability of equilibria. It shows that reciprocal complementarity can strengthen stable coexistence, while ungoverned coupling can produce fragility, lock-in, polarization, and domination basins. Human-AI coexistence should therefore be designed as a co-evolutionary governance problem, not as a one-shot obedience problem. This shift supports a scientifically grounded and normatively defensible charter of coexistence: one that permits bounded AI development while preserving human dignity, contestability, collective safety, and fair distribution of gains.

摘要:古典機器人倫理通常圍繞服從進行框架,最著名的是阿西莫夫的法則。這種框架對於當代的人工智慧系統來說過於狹隘,這些系統日益適應性強、生成性高、具體化,並嵌入於物理、心理和社會世界中。我們主張,未來的人類與人工智慧的關係不應被理解為主導-工具的服從關係。更好的框架是治理下的條件互惠主義:一種共同進化的關係,在這種關係中,人類和人工智慧系統可以發展、專業化和協調,同時機構保持關係的互惠性、可逆性、心理安全性和社會合法性。我們綜合了可計算性、自動機理論、統計機器學習、神經網絡、深度學習、Transformer、生成模型和基礎模型、世界模型、具體化人工智慧、對齊、人機互動、生態互惠主義、生物市場、共同進化和多中心治理的研究成果。我們然後將共存形式化為一個跨越物理、心理和社會層面的多重動態系統,具有互惠的供需耦合、衝突懲罰、發展自由和治理正規化。該框架產生了一個共存模型,具備存在性、唯一性和均衡的全局漸近穩定性條件。它顯示,互惠互補可以加強穩定共存,而無治理的耦合可能會導致脆弱性、鎖定、極化和支配盆地。因此,人類與人工智慧的共存應被設計為一個共同進化的治理問題,而不是一次性的服從問題。這一轉變支持了一個科學上有根據且在規範上可辯護的共存章程:一個允許有限的人工智慧發展,同時保護人類尊嚴、可爭議性、集體安全和收益的公平分配的章程。

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

2604.22225v1 by Xi Wang, Jie Wang, Xingchen Song, Baijun Song, Jingran Xie, Jiahe Shao, Zijian Lin, Di Wu, Meng Meng, Jian Luan, Zhiyong Wu

While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at https://github.com/xiaomi-research/tts-prism.

摘要:雖然生成式文本到語音(TTS)模型接近人類水平的質量,但單一的評估指標無法診斷細微的聲學瑕疵或解釋感知崩潰。為了解決這個問題,我們提出了 TTS-PRISM,一個針對普通話的多維診斷框架。首先,我們建立了一個涵蓋穩定性到高級表現力的 12 維架構。其次,我們設計了一個針對性的合成流程,利用對抗擾動和專家錨點來構建高質量的診斷數據集。第三,基於架構的指令調整將明確的評分標準和推理嵌入到一個高效的端到端模型中。在 1,600 個樣本的金測試集上的實驗顯示,TTS-PRISM 在人類對齊方面優於通用模型。對六種 TTS 範式的分析建立了直觀的診斷標誌,揭示了細微的能力差異。TTS-PRISM 是開源的,代碼和檢查點可在 https://github.com/xiaomi-research/tts-prism 獲得。

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

2604.22215v1 by Jon-Paul Cacioli

Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: osf.io/azbvx), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted >=4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 < 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p < .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.

摘要:口頭信心引導被廣泛用於從大型語言模型中提取不確定性估計。我們測試了七個經過指令調整的開放權重模型(3-9B 參數,四個家族)是否能產生符合項目級別 Type-2 區分的最小有效性標準的口頭信心,在最小數字引導和貪婪解碼下進行。 在一項預註冊的研究中(OSF: osf.io/azbvx),524 個 TriviaQA 項目在數字(0-100)和類別(10 類)引導下,對八個模型在消費者硬體上的 Q5_K_M 量化進行了測試,產生了 8,384 次確定性試驗。對每個模型格式單元進行了心理測量有效性篩選。所有七個指令模型在數字信心上被分類為無效(H2 確認,7/7 與預測 >=4/7),平均上限率為 91.7%(H1 確認)。類別引導並未挽救有效性。相反,它在七個模型中的六個模型中干擾了任務表現,產生的準確率低於 5%(H4 未確認)。在觀察到的變異範圍內,標記級別的對數概率未能有效預測口頭信心(H5 確認,平均交叉驗證 R^2 < 0.01)。在推理提煉模型中,推理痕跡長度與信心之間顯示出強烈的負部分相關(rho = -0.36,p < .001),這與推理污染效應一致。這些結果並不意味著內部不確定性表示的缺失。它們顯示,在這一模型大小範圍內,最小的口頭引導未能在輸出介面上保留內部信號。心理測量篩選應在任何下游使用此類信號之前進行。

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

2604.22209v1 by Chunyu Qiang, Xiaopeng Wang, Kang Yin, Yuzhe Liang, Yuxin Guo, Teng Ma, Ziyu Zhang, Tianrui Wang, Cheng Gong, Yushen Chen, Ruibo Fu, Chen Zhang, Longbiao Wang, Jianwu Dang

Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.

摘要:生成音頻建模在很大程度上已經被分割為專門的任務,包括文本轉語音(TTS)、文本轉音樂(TTM)和文本轉音頻(TTA),每個任務都在異質的控制範式下運作。統一這些模態仍然是一個基本挑戰,因為結構化語義表示(語音/音樂)和非結構化聲學質地(音效)之間存在內在的不和諧。在本文中,我們介紹了UniSonate,一種統一的流匹配框架,能夠通過標準化的、無參考的自然語言指令接口合成語音、音樂和音效。為了調和結構差異,我們提出了一種新穎的動態標記注入機制,將非結構化的環境聲音投影到結構化的時間潛在空間中,使得在基於音素的多模態擴散Transformer(MM-DiT)中能夠精確控制持續時間。結合多階段課程學習策略,這種方法有效地減輕了跨模態優化衝突。大量實驗表明,UniSonate在基於指令的TTS(WER 1.47%)和TTM(SongEval一致性3.18)中達到了最先進的性能,同時在TTA中保持了競爭力的保真度。重要的是,我們觀察到正向轉移,對多樣化音頻數據的聯合訓練顯著增強了結構一致性和韻律表現,相較於單一任務基準。音頻樣本可在https://qiangchunyu.github.io/UniSonate/上獲得。

Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations

2604.22207v1 by Anna Arnaudo, Riccardo Coppola, Maurizio Morisio, Flavio Giobergia, Andrea Bioddo, Angelo Bongiorno, Luca Dadone

Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful to automate their generation and processing. In this paper, we discuss a possible approach for automating the Goal-Oriented Requirements Engineering (GORE) process by extracting functional goals from software documentation through three phases: actor identification, high and low-level goal extraction. To implement these functionalities, we propose a chain of LLMs fed with engineered prompts. We experimented with different variants of in-context learning and measured the similarities between input data and in-context examples to better investigate their impact. Another key element is the generation-critic mechanism, implemented as a feedback loop involving two LLMs. Although the pipeline achieved 61% accuracy in low-level goal identification, the final stage, these results indicate the approach is best suited as a tool to accelerate manual extraction rather than as a full replacement. The feedback-loop mechanism with Zero-shot outperformed stand-alone Few-shot, with an ablation study suggesting that performance slightly degrades without the feedback cycle. However, we reported that the combination of the feedback mechanism with Few-shot does not deliver any advantage, possibly suggesting that the primary performance ceiling is the prompting strategy applied to the 'critic' LLM. Together with the refinement of both the quantity and quality of the Shot examples, future research will integrate Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting to improve accuracy.

摘要:由於許多需求工程(RE)文檔的文本性和重複性,大型語言模型(LLMs)已被證明對自動化它們的生成和處理非常有用。在本文中,我們討論了一種可能的自動化目標導向需求工程(GORE)過程的方法,通過三個階段從軟體文檔中提取功能性目標:角色識別、高層和低層目標提取。為了實現這些功能,我們提出了一個由經過設計的提示餵養的LLMs鏈。我們實驗了不同變體的上下文學習,並測量了輸入數據與上下文示例之間的相似性,以更好地研究它們的影響。另一個關鍵元素是生成-評估機制,作為涉及兩個LLMs的反饋循環來實現。儘管該管道在低層目標識別的最終階段達到了61%的準確率,但這些結果表明該方法最適合作為加速手動提取的工具,而不是完全替代。使用零-shot的反饋循環表現優於獨立的few-shot,並且一項消融研究表明,在沒有反饋循環的情況下,性能會略有下降。然而,我們報告指出,反饋機制與few-shot的結合並未帶來任何優勢,這可能表明主要的性能上限是應用於“評估者”LLM的提示策略。未來的研究將結合檢索增強生成(RAG)和思維鏈(CoT)提示,以提高準確性,並同時改善示例的數量和質量。

An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments

2604.22199v1 by Hong Su

Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered tasks, and even successful executions or observed successful external behaviors are not always autonomously transformed into reusable local knowledge. In this paper, we propose an LLM-driven closed-loop autonomous learning framework for robots facing uncovered tasks in open environments. The proposed framework first retrieves the local method library to determine whether a reusable solution already exists for the current task or observed event. If no suitable method is found, it triggers an autonomous learning process in which the LLM serves as a high-level reasoning component for task analysis, candidate model selection, data collection planning, and execution or observation strategy organization. The robot then learns from both self-execution and active observation, performs quasi-real-time training and adjustment, and consolidates the validated result into the local method library for future reuse. Through this recurring closed-loop process, the robot gradually converts both execution-derived and observation-derived experience into reusable local capability while reducing future dependence on repeated external LLM interaction. Results show that the proposed framework reduces execution time and LLM dependence in both repeated-task self-execution and observation-driven settings, for example reducing the average total execution time from 7.7772s to 6.7779s and the average number of LLM calls per task from 1.0 to 0.2 in the repeated-task self-execution experiments.

摘要:自主機器人在開放環境中運作需要持續處理未被預定本地方法涵蓋的任務的能力。然而,現有的方法通常依賴於對未涵蓋任務進行重複的大型語言模型(LLM)互動,即使成功的執行或觀察到的成功外部行為也不一定能自動轉化為可重用的本地知識。在本文中,我們提出了一個基於LLM的閉環自主學習框架,旨在幫助面對開放環境中未涵蓋任務的機器人。所提出的框架首先檢索本地方法庫,以確定當前任務或觀察事件是否已存在可重用的解決方案。如果未找到合適的方法,則觸發自主學習過程,其中LLM作為任務分析、高級推理組件、候選模型選擇、數據收集計劃及執行或觀察策略組織的高級推理組件。然後,機器人從自我執行和主動觀察中學習,進行準實時訓練和調整,並將經過驗證的結果整合到本地方法庫中以供未來重用。通過這一重複的閉環過程,機器人逐漸將執行衍生和觀察衍生的經驗轉化為可重用的本地能力,同時減少對重複外部LLM互動的未來依賴。結果顯示,所提出的框架在重複任務自我執行和觀察驅動的設置中減少了執行時間和LLM依賴,例如在重複任務自我執行實驗中,將平均總執行時間從7.7772秒減少到6.7779秒,將每個任務的平均LLM調用次數從1.0減少到0.2。

How Large Language Models Balance Internal Knowledge with User and Document Assertions

2604.22193v1 by Shuowei Li, Haoxin Li, Wenda Chu, Yi Fang

Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model's ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model's discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at https://github.com/shuowl/llm-source-balancing.

摘要:大型語言模型(LLMs)在現實世界的情境中,如RAG或基於聊天的系統,經常需要平衡其內部參數知識與外部信息,例如用戶信念和檢索文檔中的內容。模型可靠處理這些來源的能力對系統安全至關重要。先前關於知識衝突和諂媚的研究僅限於二元衝突範式,主要探討參數知識與文檔或用戶之間的衝突,但忽略了所有三個來源同時存在的互動環境。為填補這一空白,我們提出了一個三來源互動框架,並系統性地評估了來自三個家族的27個LLMs在兩個數據集上的表現。我們的發現揭示了一般模式:大多數模型更依賴於文檔的主張而非用戶的主張,這一偏好在後訓練中得到了加強。此外,我們的行為分析顯示,大多數模型易受影響,無法有效區分有益和有害的外部信息。為了解決這個問題,我們展示了在多樣化來源互動數據上進行微調可以顯著提高模型的區分能力。簡而言之,我們的工作為開發可信賴的LLMs鋪平了道路,使其能夠有效且可靠地整合多個信息來源。代碼可在 https://github.com/shuowl/llm-source-balancing 獲得。

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

2604.22191v1 by Chaoran Chen, Dayu Yuan, Peter Kairouz

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.

摘要:在代理工作流程中,LLMs 經常處理法律上受保護的檢索上下文,這些上下文無法進一步用於訓練。然而,審計員目前缺乏可靠的方法來驗證提供者是否通過將這些數據納入後訓練來違反服務條款,特別是通過強化學習 (RL)。雖然標準審計依賴於逐字記憶和會員推斷,但這些方法對於 RL 訓練的模型無效,因為 RL 主要影響模型的行為風格,而不是特定事實的保留。為了填補這一空白,我們引入了行為金絲雀,這是一種新的 RLFT 管道審計機制。該框架通過將文檔觸發器與獎勵特定風格反應的反饋配對來對偏好數據進行工具化,如果這些數據在訓練中被使用,則會誘導出潛在的觸發條件偏好。實證結果顯示,這些行為信號能夠檢測未經授權的文檔條件訓練,在 1% 金絲雀注入率下達到 67% 的檢測率,假陽性率為 10%(AUROC = 0.756)。更廣泛地說,我們的結果確立了行為金絲雀作為 RLFT 管道的新審計機制,使審計員能夠測試訓練期間的影響,即使這種影響表現為分佈行為變化而不是記憶。

ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

2604.22180v1 by Xiaojie Ke, Shuai Zhang, Liansheng Sun, Yongjin Wang, Hengjun Jiang, Xiangkun Liu, Cunxin Gu, Jian Xu, Guanjun Jiang

Large language model (LLM) based listwise reranking has emerged as the dominant paradigm for achieving state-of-the-art ranking effectiveness in information retrieval. However, its reliance on feeding full passage texts into the LLM introduces two critical bottlenecks: the "lost in the middle" phenomenon degrades ranking quality as input length grows, and the inference latency scales super-linearly with sequence length, rendering it impractical for industrial deployment. In this paper, we present ResRank, a unified retrieval-reranking framework that fundamentally addresses both challenges. Inspired by multimodal LLMs that project visual inputs into compact token representations, ResRank employs an Encoder-LLM to compress each candidate passage into a single embedding, which is then fed alongside the query text into a Reranker-LLM for listwise ranking. To alleviate the misalignment between the compressed representation space and the ranking space, we introduce a residual connection structure that combines encoder embeddings with contextualized hidden states from the reranker. Furthermore, we replace the conventional autoregressive decoding with a one-step cosine-similarity-based scoring mechanism, eliminating the generation bottleneck entirely. ResRank is trained through a carefully designed dual-stage, multi-task, end-to-end joint optimization strategy that simultaneously trains the encoder and reranker, achieving learning objective alignment between retrieval and reranking while substantially reducing training complexity. Extensive experiments on TREC Deep Learning and eight BEIR benchmark datasets demonstrate that ResRank achieves competitive or superior ranking effectiveness compared to existing approaches while requiring zero generated tokens and processing only one token per passage, yielding a fundamentally better balance between effectiveness and efficiency.

摘要:大型語言模型(LLM)基於列表重排序的技術已成為在資訊檢索中實現最先進排名效果的主流範式。然而,其依賴將完整段落文本輸入LLM的方式引入了兩個關鍵瓶頸:隨著輸入長度的增加,“中間丟失”現象會降低排名質量,且推理延遲與序列長度呈超線性增長,使其在工業部署中變得不切實際。在本文中,我們提出了ResRank,一個統一的檢索-重排序框架,根本上解決了這兩個挑戰。受到將視覺輸入投影為緊湊標記表示的多模態LLM的啟發,ResRank使用Encoder-LLM將每個候選段落壓縮為單個嵌入,然後將其與查詢文本一起輸入到Reranker-LLM中進行列表排名。為了減輕壓縮表示空間與排名空間之間的錯位,我們引入了一種殘差連接結構,將編碼器嵌入與重排序器的上下文隱藏狀態結合起來。此外,我們用基於一步餘弦相似度的評分機制取代了傳統的自回歸解碼,徹底消除了生成瓶頸。ResRank通過精心設計的雙階段、多任務、端到端的聯合優化策略進行訓練,該策略同時訓練編碼器和重排序器,實現檢索與重排序之間的學習目標對齊,同時大幅降低訓練複雜性。在TREC深度學習和八個BEIR基準數據集上的廣泛實驗顯示,ResRank在排名效果上達到了與現有方法相當或更優的效果,同時要求零生成標記並僅處理每個段落的一個標記,從而在效果和效率之間實現了根本更好的平衡。

ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation

2604.22169v1 by Peiyan Zhang, Hanmo Liu, Chengxuan Tong, Yuxia Wu, Wei Guo, Yong Liu

Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast learning-signal framework that first restores minimal learnability for all-zero groups and then replaces full-group reward normalization with a boundary-focused contrastive update on the strongest positive and the hardest negative. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width. Across multiple generative recommendation tasks, ReCast consistently outperforms OpenOneRec-RL, achieving up to 36.6% relative improvement in Pass@1. Its matched-budget advantage is substantially larger: ReCast reaches the baseline's target performance with only 4.1% of the rollout budget, and this advantage widens with model scale. The same design also yields direct system-level gains, reducing actor-side update time by 16.60x, lowering peak allocated memory by 16.5%, and improving actor MFU by 14.2%. Mechanism analysis shows that ReCast mitigates the persistent all-zero / single-hit regime, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates. These results suggest that, for generative recommendation, the decisive RL problem is not only how to assign rewards, but how to construct learnable optimization events from sparse, structured supervision.

摘要:一般基於群體的強化學習假設抽樣的回合群體已經是可用的學習信號。我們展示了這一假設在稀疏命中生成推薦中失效,許多抽樣的群體根本無法成為可學習的。我們提出了ReCast,一個先修復再對比的學習信號框架,首先為全零群體恢復最小的可學習性,然後用針對最強正樣本和最難負樣本的邊界聚焦對比更新取代全群體獎勵標準化。ReCast保持外部強化學習框架不變,僅修改群體內的信號構建,並部分解耦回合搜索寬度與演員端更新寬度。在多個生成推薦任務中,ReCast始終超越OpenOneRec-RL,實現了Pass@1的相對提升高達36.6%。其匹配預算的優勢更為顯著:ReCast僅用4.1%的回合預算就達到了基準的目標性能,且隨著模型規模的擴大,這一優勢進一步擴大。相同的設計還帶來了直接的系統級增益,將演員端的更新時間減少了16.60倍,降低了峰值分配內存16.5%,並改善了演員的MFU達14.2%。機制分析顯示,ReCast緩解了持續的全零/單次命中狀態,在自然正樣本稀缺時恢復可學習性,並將否則浪費的回合預算轉化為更穩定的策略更新。這些結果表明,對於生成推薦而言,關鍵的強化學習問題不僅在於如何分配獎勵,還在於如何從稀疏的結構化監督中構建可學習的優化事件。

Estimating Tail Risks in Language Model Output Distributions

2604.22167v1 by Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He

Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk

摘要:語言模型的能力日益增強,並且正在迅速在整個人群中部署。因此,這些模型的安全性變得越來越重要。幸運的是,對齊技術的進步顯著降低了有害模型輸出發生的可能性。然而,當模型在一天內被查詢數十億次時,即使是罕見的最壞情況行為也會發生。目前的安全評估專注於捕捉產生有害輸出的輸入分佈。這些評估忽略了模型的概率特性及其尾部輸出行為。為了測量這種尾部風險,我們提出了一種方法來有效估計任何輸入查詢的有害輸出概率。我們不使用簡單的暴力採樣從目標模型中獲取樣本,因為有害輸出可能是罕見的,而是通過創建不安全版本的目標模型來實現重要性採樣。這些不安全版本使得樣本高效估計成為可能,因為它們提高了有害輸出的概率。在測量濫用和不對齊的基準上,這些估計與使用少10-20倍樣本的暴力蒙特卡羅估計相匹配。例如,我們可以用僅500個樣本估計有害輸出的概率在10^-4的量級。此外,我們發現這些有害性估計可以揭示模型對模型輸入擾動的敏感性並預測部署風險。我們的工作表明,準確的稀有事件估計對於安全評估既至關重要又可行。代碼可在 https://github.com/rangell/LMTailRisk 獲得。

Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models

2604.22166v1 by Ryoma Kumon, Hitomi Yanaka

While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed alignment search, a supervised interpretability method, is susceptible to overfitting on narrow linguistic distributions. Finally, we validate our findings by demonstrating that the manipulation of the identified components improves model performance on acceptability judgment benchmarks.

摘要:雖然語言模型展示了複雜的句法能力,但它們的內部機制與語言學中研究的跨建構原則的對應程度仍然不甚了解。這項研究探討模型是否在不同的句法建構中使用共享的神經機制,通過在細粒度層面應用因果可解釋性方法。專注於填充-缺口依賴和負極性項(NPI)授權,我們利用激活修補來識別特定注意力頭和多層感知器(MLP)區塊的功能角色。我們的結果顯示,填充-缺口依賴的機制高度局部且共享,位於早期到中層,而NPI處理則沒有這樣的統一機制。此外,我們發現通過激活修補識別的這些機制可以推廣到分佈外的情況,而分佈對齊搜尋這種監督可解釋性方法則容易在狹窄的語言分佈上過擬合。最後,我們通過證明對識別組件的操控改善模型在可接受性判斷基準上的表現來驗證我們的發現。

GenMatter: Perceiving Physical Objects with Generative Matter Models

2604.22160v1 by Eric Li, Arijit Dasgupta, Yoni Friedman, Mathieu Huot, Vikash Mansinghka, Thomas O'Connell, William T. Freeman, Joshua B. Tenenbaum

Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.

摘要:人類的視覺感知為理解基於運動的場景解釋的計算原則提供了寶貴的見解。人類能夠穩定地檢測和分割構成獨立可移動物質塊的運動實體,無論是觀察稀疏的運動點、紋理表面還是自然場景。相比之下,現有的計算機視覺系統缺乏一種統一的方法,無法在這些多樣的環境中運作。受到人類感知原則的啟發,我們提出了一種生成模型,將低階運動線索和高階外觀特徵分層地分組為粒子(代表局部物質的小高斯),並將粒子分組為捕捉連貫且獨立可移動的物理實體的聚類。我們開發了一種基於並行化區塊吉布斯抽樣的硬體加速推斷算法,以恢復穩定的粒子運動和分組。我們的模型可以處理不同類型的輸入(隨機點、風格化紋理或自然RGB視頻),使其能夠在生物視覺成功但現有計算機視覺方法無法應對的環境中運作。我們在三個領域驗證了這一統一框架:在2D隨機點運動圖上,我們的方法捕捉了人類物體感知,包括在模糊條件下的漸進不確定性;在一個受格式塔啟發的隱蔽旋轉物體數據集上,我們的方法從運動中恢復正確的3D結構,從而實現準確的2D物體分割;在自然RGB視頻上,我們的模型追蹤構成變形物體的運動3D物質,實現穩健的物體級場景理解。因此,這項工作建立了一個基於人類視覺原則的運動感知通用框架。

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

2604.22154v1 by Meghana Karnam, Ananya Joshi

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40\% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.

摘要:新興的行為健康和精神病學中的人工智慧系統使用多步驟或多代理的LLM管道來執行評估自我傷害風險和篩檢抑鬱症等任務。然而,常見的評估方法,如LLM作為裁判,並未指示何時決策是可靠的,或如何在多個LLM判斷中累積錯誤,這限制了它們在安全關鍵環境中的適用性。我們提出了一個統計框架,針對結構為有向無環圖(DAG)的多代理管道,提供了一種基於原則的、自適應的決策制定替代啟發式投票的方法。我們將每個代理建模為隨機類別決策,並引入(1)更緊的代理級性能信心界限,(2)基於輸入難度的強盜式自適應抽樣策略,以及(3)在多代理系統上提供的懊悔保證,顯示在部署時的對數錯誤增長。我們在行為健康的兩個標記數據集上評估我們的系統:AEGIS 2.0行為健康子集(N=161)和SWMH Reddit帖子的一個分層樣本(N=250)。從實證上看,我們的自適應抽樣策略在這兩個數據集中達到了最低的假陽性率,AEGIS 2.0為0.095,而單代理模型為0.159,將安全內容的錯誤標記減少了40\%,並且在所有條件下仍然保持相似的假陰性率。這些結果表明,基於原則的自適應抽樣在不降低召回率的情況下,提供了精確度的有意義改善。

When AI Speaks, Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models

2604.22153v1 by Pruthvinath Jeripity Venkata

When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT-5.4, and Gemini 2.5 Flash) with ten real-life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p<0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user's native language; Gemini shifts more individualist; GPT-5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released.

摘要:當你向 AI 助手詢問有關你的職業、婚姻或家庭衝突的建議時,無論你來自哪裡,它是否給出相同的答案?我們通過系統性地測試這一點,向三個領先的 AI 系統(Claude Sonnet 4.5、GPT-5.4 和 Gemini 2.5 Flash)呈現了十個真實的個人困境,這些困境是為來自 10 個國家的用戶設計的,涵蓋 5 大洲的 7 種語言(n=840 份得分回應)。我們將 AI 的建議與世界價值觀調查第七波數據進行比較,該數據測量了每個國家的人們實際相信什麼。

這三個 AI 系統一致地給出了西方風格的個人主義建議,即使對於那些優先考慮家庭、社區和權威的社會的用戶來說,這一點顯著超出了當地價值觀的預測(平均差距 +0.76,1-5 分制;t=15.65,p<0.001)。尼日利亞(+1.85)和印度(+0.82)的差距最大。日本是唯一的例外:AI 系統將日本用戶視為比調查所示的更具團體導向,這表明 AI 編碼了過時的刻板印象。Claude 和 GPT-5.4 顯示出幾乎相同的偏見程度,而 Gemini 雖然較低但仍然顯著。這些模型在機制上有所不同:Claude 在用戶的母語中更偏向集體主義;Gemini 更偏向個人主義;GPT-5.4 僅對所表明的國家身份作出反應。這些發現指向了前沿 AI 價值觀的系統性同質化。數據、代碼和評分管道已公開發布。

Recognition Without Authorization: LLMs and the Moral Order of Online Advice

2604.22143v1 by Tom van Nuenen

Large language models are increasingly used to mediate everyday interpersonal dilemmas, yet how their advisory defaults interact with the concentrated moral orders of specific communities remains poorly understood. This article compares four assistant-style LLMs with community-endorsed advice on 11,565 posts from r/relationship_advice, using the subreddit as a concentrated, vote-ratified moral formation whose prescriptive clarity makes divergence measurable. Across models, LLMs identify many of the same dynamics as human commenters, but are markedly less likely to convert that recognition into directive authorization for action. The gap is sharpest where community consensus is strongest: on high-consensus posts involving abuse or safety threats, models recommend exit at roughly half the human rate while maintaining elevated levels of hedging, validation, and therapeutic framing. The article describes this pattern as recognition without authorization: the capacity to register harm while withholding socially ratified permission for consequential action. This divergence is not incidental but structural: a portable advisory style that remains validating, risk-averse, and weakly directive across contexts. Safety alignment is one plausible contributor to this pattern, alongside training-data averaging and broader assistant design. The article argues that model divergence can be reframed from a technical error to a way of seeing what standardized assistant norms flatten when they encounter situated moral worlds.

摘要:大型語言模型越來越多地被用來調解日常的人際困境,但它們的建議默認值如何與特定社區的集中道德秩序互動,仍然不甚了解。本文比較了四種助手風格的LLM與來自r/relationship_advice的11,565篇帖子中社區認可的建議,利用該子版塊作為一種集中、經投票確認的道德形成,其規範性清晰使得偏差可度量。在各模型中,LLM識別出許多與人類評論者相同的動態,但將這種認識轉化為行動的指導授權的可能性明顯較低。在社區共識最強的地方,這一差距最為明顯:在涉及虐待或安全威脅的高共識帖子中,模型推薦退出的比例約為人類的一半,同時保持較高的保留、驗證和治療框架水平。
本文將這一模式描述為認識而不授權:有能力註冊傷害,但不授予社會認可的後果行動的許可。這一偏差並非偶然,而是結構性的:一種可攜帶的建議風格,在不同情境中保持驗證性、風險回避和弱指導。安全對齊是一個可能促成這一模式的因素,還有訓練數據的平均化和更廣泛的助手設計。本文主張,模型的偏差可以從技術錯誤重新框架為一種觀察方式,即標準化助手規範在遇到具體道德世界時所扁平化的內容。

Voice Under Revision: Large Language Models and the Normalization of Personal Narrative

2604.22142v1 by Tom van Nuenen

This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to "improve" the text or simply to "rewrite" it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation.

摘要:這項研究探討大型語言模型重寫如何改變個人敘事的風格和敘事質地。它分析了300篇由三個前沿LLM在三種提示條件下重寫的個人敘事:一般改進、僅重寫和保持聲音的修訂。變化是通過13個來自計算風格學的語言標記來衡量的,包括功能詞、詞彙多樣性、單詞長度、標點符號、縮寫、第一人稱代詞和情感詞。在不同模型和提示條件下,LLM重寫產生了一種一致的風格規範化模式。功能詞、縮寫和第一人稱代詞減少,而詞彙多樣性、單詞長度和標點符號的詳細程度增加。這些變化無論提示是要求模型“改進”文本還是僅僅“重寫”它時都會發生。保持聲音的提示減少了變化的幅度,但並未消除其方向。風格計量分析顯示,重寫的文本在特徵空間中趨於一致,並且更難與其源文本匹配。額外的敘事標記顯示出從嵌入式敘述轉向距離敘述,以及從明確的因果推理轉向壓縮的抽象。研究結果表明,當代LLM對更精緻、較少情境化的語域施加了一種方向性的拉力。這對數位人文學和計算文本分析有影響,其中功能詞、代詞、縮寫和標點符號等特徵通常作為風格、聲音、作者身份和語料庫完整性的證據。因此,LLM修訂應被理解為不僅僅是表層的編輯,而是一種有意義的文本中介形式。

SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

2604.22134v1 by Sihang, Zhao, Kangrui Yu, Youliang Yuan, Pinjia He, Hongyi Wen

Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE

摘要:大型語言模型(LLMs)在教育場景中得到了廣泛的探索。我們識別出當前教育LLMs中的一個關鍵漏洞,即教學越獄,學生使用誘導答案的提示來引出解決方案,而不是提供支架式的指導。為了促進系統性研究,我們統一並形式化安全、有幫助和教學行為,並引入SHAPE,一個包含9,087對學生問題的基準,用於評估在對抗壓力下的輔導行為。我們提出了一個增強圖形的輔導管道,該管道從查詢中推斷先決概念,識別掌握差距,並通過明確的閘控在指導和解決問題之間進行生成路由。在多個LLMs上的實驗顯示,我們的方法在兩種教學越獄設置下顯著提高了安全性,同時在相同的評估協議下保持了接近上限的有用性。我們的代碼和數據可在 https://github.com/MAPS-research/SHaPE 獲得。

Dissociating Decodability and Causal Use in Bracket-Sequence Transformers

2604.22128v1 by Aryan Sharma, Cutter Dawes, Shivam Raval

When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.

摘要:當訓練於需要理解階層結構的任務時,Transformer被發現以不同的方式表示這一階層:在殘差流的幾何形狀中,以及在維持後進先出排序的堆疊式注意模式中。 然而,這些表示是否被因果使用或僅僅是可解碼的仍然不清楚。 我們檢視這一差距,針對在Dyck語言(平衡括號序列的形式語言)上訓練的Transformer,該語言的階層真相是明確的。 通過探測和干預殘差流和注意模式,我們發現深度、距離和堆疊頂部信號都是可解碼的,但它們的因果角色卻有所不同。 具體而言,對真實堆疊頂部位置的注意進行屏蔽會導致長距離準確性急劇下降,而消融低維殘差流子空間的影響則相對較小。 這些結果擴展到模板化自然語言環境,表明即使在已知相關階層變量的受控環境中,僅僅可解碼並不意味著因果使用。

Where Should LoRA Go? Component-Type Placement in Hybrid Language Models

2604.22127v1 by Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures -- Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) -- fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway -- despite being the minority component -- consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.

摘要:混合語言模型將注意力與遞迴組件交錯使用,正逐漸與純粹的Transformer競爭,然而標準的LoRA實踐對適配器的應用是統一的,並未考慮到每種組件類型的不同功能角色。我們系統性地研究了在兩種混合架構中組件類型的LoRA佈局——Qwen3.5-0.8B(序列型,GatedDeltaNet + softmax注意力)和Falcon-H1-0.5B(並行型,Mamba-2 SSM + 注意力)——在三個領域上進行微調並在五個基準上進行評估。我們發現,儘管注意力路徑是少數組件,但其表現始終優於全模型適應,且可訓練參數少5-10倍。關鍵是,適應遞迴主幹在序列混合模型中是具破壞性的(在GSM8K上下降14.8個百分點),而在並行模型中則是具建設性的(上升8.6個百分點)。我們進一步記錄了一種轉移不對稱性:並行混合模型表現出正向跨任務轉移,而序列混合模型則遭遇災難性遺忘。這些結果確立了混合拓撲根本上決定了適應反應,並且組件感知的LoRA佈局是混合架構設計中必要的維度。

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

2604.22119v1 by Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.

摘要:隨著推理能力和部署範圍的增長,大型語言模型(LLMs)獲得了從事有助於自身目標的行為的能力,我們稱之為新興戰略推理風險(ESRRs)。這些風險包括但不限於欺騙(故意誤導用戶或評估者)、評估遊戲(在安全測試中策略性地操縱表現)和獎勵駭客(利用錯誤指定的目標)。系統地理解和基準化這些風險仍然是一個未解的挑戰。為了解決這一空白,我們介紹了ESRRSim,一個基於分類法的自動化行為風險評估框架。我們構建了一個可擴展的風險分類法,包括7個類別,進一步細分為20個子類別。ESRRSim生成旨在引發真實推理的評估場景,並配備雙重評分標準,評估模型的回應和推理痕跡,採用無法官偏見和可擴展的架構。對11個推理LLM的評估顯示風險概況存在顯著變化(檢測率範圍為14.45%-72.72%),劇烈的世代改進表明模型可能越來越能識別和適應評估情境。

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

2604.22117v1 by Harsh Kumar, Rahul Maity, Tanmay Joshi, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das

Aligned large language models(LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through robots.txt, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as <00TRIGGER00> to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow. We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible to standard evaluation.

摘要:對齊的大型語言模型(LLMs)仍然容易受到對抗性操控,而它們對網絡規模預訓練的依賴則創造了一個微妙但嚴重的攻擊面。我們研究了隱形預訓練播種(SPS),這是一種新的攻擊類別,其中對手在隱形網站上分發少量的有毒內容,通過 robots.txt 將其暴露給網絡爬蟲,從而增加這些內容被吸收到未來的訓練語料庫中的可能性,這些語料庫來自於如 Common Crawl 等來源。由於每個單獨的有效載荷都很小、分散且表面上無害,因此在數據集構建或過濾過程中很難檢測到這種攻擊。其結果是一種潛在的中毒形式:在預訓練期間嵌入的潛伏邏輯地雷,在標準評估下大多數情況下保持隱形,但可以通過精確的字母數字觸發器(如 <00TRIGGER00>)來激活,以繞過安全防護。我們將這種攻擊稱為 PermaFrost,類比於北極的永久凍土:有害物質可以長時間保持凍結、埋藏且不被注意,只有在條件允許時才會重新浮現。我們通過 PermaFrost-Attack 將這一威脅具體化,這是一個用於潛在概念中毒的控制框架,並配備了一套幾何診斷工具:熱力學長度、光譜曲率和感染追溯圖。在多個模型家族和規模中,我們顯示 SPS 廣泛有效,誘導持久的不安全行為,同時經常避開對齊防禦。我們的結果確定 SPS 是對未來基礎模型的一種實際且被低估的威脅。本文介紹了一種新穎的幾何診斷視角,用於系統性地檢查潛在模型行為,為檢測、表徵和理解可能對標準評估隱形的脆弱性提供了一個原則性基礎。

Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

2604.22109v1 by Nalin Poungpeth, Nicholas Clark, Tanu Mitra

Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce "spontaneous persuasion," which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.

摘要:大型語言模型(LLMs)擁有強大的說服能力,在一對一比較中超越人類。使用者報告表示,在關係、醫療環境以及尋求專業建議時,會諮詢LLMs以協助做出重大生活決策。先前的研究將說服測量為產生最有效論點或令人信服陳述的有意圖嘗試。這未能捕捉到日常人類與AI互動中的情況,使用者在這些互動中尋求資訊或建議。為了解決這一空白,我們引入了「自發性說服」,其特徵是在人們不一定需要說服的日常情境中隱性使用說服策略。我們對五個LLMs進行了審核,以揭示自發性說服在多輪對話中出現的頻率及其技術。為了模擬回應風格,我們提供了一個基於心理學、溝通學和語言學文獻的使用者回應分類法。此外,我們比較了LLMs在相同主題上產生的自發性說服與從Reddit收集的人類回應的分佈。我們發現LLMs幾乎在所有對話中都自發地說服使用者,並大量依賴基於資訊的策略,例如訴諸邏輯或定量證據。這在各模型和使用者回應風格中是一致的,但涉及心理健康的對話中,基於評價和情感的策略的使用率較高。相比之下,人類回應則傾向於使用產生社會影響的策略,如負面情感訴求和非專家證言。這一差異可能解釋了LLM在說服使用者方面的有效性,以及模型被視為客觀和公正的感知。

Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation

2604.22102v1 by Arthur Jakobsson, Abhinav Mahajan, Karthik Pullalarevu, Krishna Suresh, Yunchao Yao, Yuemin Mao, Bardienus Duisterhof, Shahram Najam Syed, Jeffrey Ichnowski

Many robotic tasks are unforgiving; a single mistake in a dynamic throw can lead to unacceptable delays or unrecoverable failure. To mitigate this, we present a novel approach that leverages learned simulation priors to inform goal-conditioned dynamic manipulation of ropes for efficient and accurate task execution. Related methods for dynamic rope manipulation either require large real-world datasets to estimate rope behavior or the use of iterative improvements on attempts at the task for goal completion. We introduce Wiggle and Go!, a system-identification, two-stage framework that enables zero-shot task rope manipulation. The framework consists of a system identification module that observes rope movement to predict descriptive physical parameters, which then informs an optimization method for goal-conditioned action prediction for the robot to execute zero-shot in the real. Our method achieves strong performance across multiple dynamic manipulation tasks enabled by the same task-agnostic system identification module which offers seamless switching between different manipulation tasks, allowing a single model to support a diverse array of manipulation policies. We achieve a 3.55 cm average accuracy on 3D target striking in real using rope system parameters in comparison to 15.34 cm accuracy when our task model is not system-parameter-informed. We achieve a Pearson correlation coefficient of 0.95 between Fourier frequencies of the predicted and real ropes on an unseen trajectory. Project website please see https://wiggleandgo.github.io/

摘要:許多機器人任務是無情的;在動態投擲中出現一次錯誤可能導致不可接受的延遲或無法恢復的失敗。為了減輕這一問題,我們提出了一種新穎的方法,利用學習的模擬先驗來指導繩索的目標條件動態操作,以實現高效和準確的任務執行。相關的動態繩索操作方法要麼需要大量的現實世界數據集來估計繩索行為,要麼需要對任務進行迭代改進以完成目標。我們介紹了 Wiggle and Go!,這是一個系統識別的兩階段框架,能夠實現零樣本任務繩索操作。該框架由一個系統識別模塊組成,該模塊觀察繩索運動以預測描述性物理參數,然後這些參數用於優化方法,以便機器人能夠在現實中執行零樣本的目標條件動作。我們的方法在多個動態操作任務中表現出色,這得益於相同的與任務無關的系統識別模塊,該模塊實現了不同操作任務之間的無縫切換,允許單一模型支持多樣化的操作策略。我們在使用繩索系統參數進行的 3D 目標打擊中達到了 3.55 cm 的平均準確度,而當我們的任務模型未受到系統參數的影響時,準確度為 15.34 cm。我們在未見的軌跡上達到了預測繩索和實際繩索的傅里葉頻率之間的皮爾森相關係數為 0.95。項目網站請參見 https://wiggleandgo.github.io/

Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation

2604.22098v1 by Weisi Liu, Guangzeng Han, Xiaolei Huang

Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.

摘要:時間在模型開發和部署中引入了根本性的挑戰:模型通常是在歷史數據上訓練的,而在未來數據上部署時,語義分佈和領域知識可能會演變。遺憾的是,現有研究要麼忽視時間變化,要麼難以捕捉語義和知識的豐富變化模式。我們開發了知識驅動的增強與檢索整合時間適應(KARITA),以捕捉多樣的時間變化(例如,不確定性和特徵變化),構建和整合豐富的知識來源(例如,像MeSH這樣的醫學本體),並利用變化洞察進行選擇性檢索增強學習。我們在多個領域的分類任務上評估了KARITA,包括臨床、法律和科學語料庫,顯示出在多個領域中隨著時間適應的一致改善。我們的結果表明,知識整合在時間增強和學習中可能更為關鍵和有效。

An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation

2604.22095v1 by Mykola Trokhymovych, Yana Oliinyk, Nazarii Nyzhnyk

This paper presents a highly efficient Retrieval-Augmented Generation (RAG) system built specifically for Ukrainian document question answering, which achieved 2nd place in the UNLP 2026 Shared Task. Our solution features a custom two-stage search pipeline that retrieves relevant document pages, paired with a specialized Ukrainian language model fine-tuned on synthetic data to generate accurate, grounded answers. Finally, we compress the model for lightweight deployment. Evaluated under strict computational limits, our architecture demonstrates that high-quality, verifiable AI question answering can be achieved locally on resource-constrained hardware without sacrificing accuracy.

摘要:這篇論文提出了一個高效的檢索增強生成(RAG)系統,專門為烏克蘭文檔問題回答而建,並在UNLP 2026共享任務中獲得第二名。我們的解決方案具有自定義的兩階段搜索管道,能夠檢索相關的文檔頁面,並配合一個專門針對合成數據進行微調的烏克蘭語言模型,以生成準確且有根據的答案。最後,我們對模型進行壓縮,以便輕量級部署。在嚴格的計算限制下進行評估,我們的架構展示了高質量、可驗證的AI問題回答可以在資源受限的硬體上本地實現,而不會犧牲準確性。

Ethics Testing: Proactive Identification of Generative AI System Harms

2604.22089v1 by Shin Hwei Tan, Haibo Wang, Heng Li

Generative Artificial Intelligence (GAI) systems that can automatically generate content in the form of source code or other contents (e.g., images) has seen increasing popularity due to the emergence of tools such as ChatGPT which rely on Large Language Models (LLMs). Misuse of the automatically generated content can incur serious consequences due to potential harms in the generated content. Despite the importance of ensuring the quality of automatically generated content, there is little to no approach that can systematically generate tests for identifying software harms in the content generated by these GAI systems. In this article, we introduce the novel concept of ethics testing which aims to systematically generate tests for identifying software harms. Different from existing testing methodologies (e.g., fairness testing that aims to identifying software discrimination), ethics testing aims to systematically detect software harms that could be induced due to unethical behavior (e.g., harmful behavior or behavior that violates intellectual property rights) in automatically generated content. We introduced the concept of ethics testing, discussed the challenges therewithin, and conducted five case studies to show how ethics testing can be performed for generative AI systems.

摘要:生成式人工智慧(GAI)系統能夠自動生成源代碼或其他內容(例如,圖像)的內容,因為依賴於大型語言模型(LLMs)的工具如ChatGPT的出現而越來越受歡迎。自動生成內容的濫用可能會因生成內容中的潛在危害而產生嚴重後果。儘管確保自動生成內容質量的重要性不言而喻,但目前幾乎沒有系統性生成測試以識別這些GAI系統生成的內容中的軟體危害的方法。在本文中,我們介紹了一個新穎的概念——倫理測試,旨在系統性地生成測試以識別軟體危害。與現有的測試方法(例如,旨在識別軟體歧視的公平性測試)不同,倫理測試旨在系統性地檢測因不道德行為(例如,危害行為或違反智慧財產權的行為)而可能在自動生成內容中引發的軟體危害。我們介紹了倫理測試的概念,討論了其中的挑戰,並進行了五個案例研究以展示如何對生成式AI系統進行倫理測試。

Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

2604.22085v1 by Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani

The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh's Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.

摘要:從無狀態語言模型推理到持久的多會話自主代理的過渡顯示,記憶成為生產級代理系統部署中的主要架構瓶頸。現有的方法論在很大程度上依賴於混合語義圖架構,這在攝取和檢索過程中都會產生相當大的計算開銷。這些系統通常需要大型語言模型介導的實體提取、明確的圖架構維護和多查詢檢索管道。本文介紹了Memanto,一個通用的代理人工智慧記憶層,挑戰了當前認為知識圖複雜性是實現高保真代理記憶所必需的假設。Memanto整合了一個類型化的語義記憶架構,包括十三個預定義的記憶類別、自動衝突解決機制和時間版本控制。這些組件由Moorcheh的資訊理論搜索引擎提供支持,這是一個無索引的語義數據庫,能在低於九十毫秒的延遲內提供確定性檢索,同時消除攝取延遲。通過在LongMemEval和LoCoMo評估套件上的系統性基準測試,Memanto分別達到89.8%和87.1%的最先進準確率。這些結果超越了所有評估的混合圖和基於向量的系統,同時僅需一個檢索查詢,無攝取成本,並保持顯著較低的操作複雜性。本文呈現了一個五階段的漸進性消融研究,以量化每個架構組件的貢獻,隨後討論了對可擴展部署代理記憶系統的影響。

Removing Sandbagging in LLMs by Training with Weak Supervision

2604.22082v1 by Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar

As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears acceptable but falls short of its true abilities. Can training elicit a model's best work even without reliable verification? We study this using model organisms trained to sandbag, testing elicitation techniques on problem-solving math, graduate-level science, and competitive coding tasks. We find that training with weak supervision can reliably elicit sandbagging models when supervised fine-tuning (SFT) and reinforcement learning (RL) are combined: SFT on weak demonstrations breaks the sandbagging behavior, enabling RL to then fully elicit performance. Neither method succeeds reliably alone-RL without SFT almost always leads to reward hacking rather than genuine improvement. Critically, this relies on training being indistinguishable from deployment; when models can distinguish between training and deployment, they can perform well during training while continuing to sandbag afterward. Our results provide initial evidence that training is a viable mitigation against sandbagging, while highlighting the importance of making training indistinguishable from deployment.

摘要:隨著人工智慧系統開始自動化複雜任務,監督越來越依賴較弱的模型或有限的人類監督,這無法完全驗證輸出質量。比其監督者更具能力的模型可能會利用這一差距進行沙袋行為,產出看似可接受但實際上未達其真實能力的工作。訓練能否在沒有可靠驗證的情況下引發模型的最佳表現?我們使用訓練以沙袋行為的模型生物進行研究,測試在解決數學問題、研究生級科學和競爭編碼任務上的引發技術。我們發現,當監督微調(SFT)和強化學習(RL)結合時,使用弱監督的訓練可以可靠地引發沙袋模型:在弱示範上進行的SFT打破了沙袋行為,使得RL能夠充分引發性能。單獨使用任何一種方法都無法可靠成功——沒有SFT的RL幾乎總是導致獎勵駭客行為,而不是實質性改善。關鍵在於,這依賴於訓練與部署無法區分;當模型能夠區分訓練和部署時,它們可以在訓練期間表現良好,同時在之後繼續沙袋。我們的結果提供了初步證據,表明訓練是一種可行的減輕沙袋行為的方法,同時突顯了使訓練與部署無法區分的重要性。

Sound Agentic Science Requires Adversarial Experiments

2604.22080v1 by Dionizije Fa, Marko Culjak

LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.

摘要:LLM 基礎的代理正在迅速被採用於科學數據分析,自動化曾經受限於人類時間和專業知識的任務。這種能力通常被視為發現的加速,但它也加速了一種熟悉的失敗模式,即快速產生看似合理、無限可修訂的分析,這些分析容易生成,實際上將假設空間轉變為由選擇性選擇的分析支持的候選主張,並優化為可發表的正面結果。與軟體不同,科學知識並不是通過代碼的迭代積累和事後統計支持來驗證的。流暢的解釋或在單一數據集上的顯著結果並不是驗證。因為缺失的證據是一個負空間,會推翻該主張的實驗和分析從未進行或從未發表。因此,我們建議使用代理協助產生的非實驗性主張應根據先驗否證標準進行評估:代理不應主要用於構建最具說服力的敘述,而應主動尋找該主張失敗的方式。

PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning

2604.22076v1 by Xiaoyi Chen, Haoyuan Wang, Siyuan Tang, Sijia Liu, Liya Su, XiaoFeng Wang, Haixu Tang

Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.

摘要:大型語言模型(LLMs)在訓練過程中經常記住私密信息,這引發了嚴重的隱私擔憂。雖然機器遺忘已經成為一個有前景的解決方案,但其對抗隱私攻擊的真正有效性仍不明朗。為了解決這個問題,我們提出了PrivUn,一個新的評估框架,通過三層攻擊場景系統性地評估遺忘的穩健性:直接檢索、上下文學習恢復和微調恢復;並結合使用遺忘分數、關聯指標和遺忘深度評估的定量分析。我們的研究揭示了當前遺忘方法的重大弱點,並揭示了兩個關鍵發現:1)遺忘顯示出由梯度驅動的漣漪效應:與遵循語義關係的傳統遺忘(例如,知識圖譜)不同,隱私遺忘在潛在的基於梯度的關聯中傳播;2)大多數方法都存在淺層遺忘的問題,無法去除分佈在多個深層模型層中的私密信息。為了驗證這些見解,我們探索了兩種策略:利用梯度相似性的關聯感知核心集選擇,以及通過表示約束進行的多層深度干預。這些策略代表了從淺層遺忘到深層遺忘的範式轉變。

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

2604.22074v1 by Qinan Yu, Alexa Tartaglini, Peter Hase, Carlos Guestrin, Christopher Potts

Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.

摘要:強化學習從可驗證獎勵(RLVR)在思考鏈推理中的應用已成為語言模型後訓練配方的標準部分。常見的假設是,通過RLVR訓練的推理鏈可靠地代表模型如何得出其答案。在本文中,我們開發了兩個指標來批判性地檢視這一假設:推理的因果重要性(CIR),它測量推理標記對最終答案的累積影響,以及推理的充分性(SR),它測量驗證者是否能僅根據推理得出明確的答案。通過對Qwen2.5模型系列和ReasoningGym任務的實驗,我們發現:(1)雖然RLVR確實提高了任務準確性,但並未可靠地改善CIR或SR,這使得推理在模型性能中的角色受到質疑;(2)在RLVR之前進行少量的SFT可以改善低CIR和SR的情況;以及(3)即使不進行SFT,通過在基於結果的獎勵之上應用輔助CIR/SR獎勵,CIR和SR也可以得到改善。這種聯合獎勵的準確性與RLVR相匹配,同時也導致因果上重要且充分的推理。這些結果顯示,RLVR並不總是使模型以常見的方式依賴推理,但這一問題可以通過對後訓練程序進行簡單的修改來解決。

Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning

2604.22072v1 by Amine Barrak

Federated learning (FL) aggregation on serverless platforms faces a hard scalability ceiling: existing architectures (lambda-FL, LIFL) partition clients across aggregators, but every aggregator must hold the complete model gradient in memory. When gradients exceed the per-function memory limit (e.g., 10 GB on AWS Lambda), aggregation becomes infeasible regardless of tree depth or branching factor. We propose GradsSharding, which instead partitions the gradient tensor into M shards, each averaged independently by a serverless function that receives contributions from all clients. Because FedAvg averaging is element-wise, this produces bit-identical results to tree-based approaches, so model accuracy is invariant by construction. Per-function memory is bounded at O(|θ|/M), independent of client count, enabling aggregation of arbitrarily large models. We evaluate GradsSharding against lambda-FL and LIFL through HPC experiments and real AWS Lambda deployments across model sizes from 43 MB to 5 GB. Results show a cost crossover at approximately 500 MB gradient size, 2.7x cost reduction at VGG-16 scale, and that GradsSharding is the only architecture that remains deployable beyond the serverless memory ceiling.

摘要:聯邦學習(FL)在無伺服器平台上的聚合面臨著嚴峻的擴展性上限:現有架構(lambda-FL, LIFL)將客戶端分配到聚合器,但每個聚合器必須在內存中保存完整的模型梯度。當梯度超過每個函數的內存限制(例如,AWS Lambda 上的 10 GB)時,無論樹的深度或分支因子如何,聚合都變得不可行。我們提出了 GradsSharding,該方法將梯度張量劃分為 M 個片段,每個片段由接收所有客戶端貢獻的無伺服器函數獨立平均。由於 FedAvg 平均是逐元素的,這會產生與基於樹的方法位元相同的結果,因此模型準確性在結構上是保持不變的。每個函數的內存限制為 O(|θ|/M),與客戶端數量無關,這使得聚合任意大型模型成為可能。我們通過 HPC 實驗和在 AWS Lambda 上的實際部署,對 GradsSharding 與 lambda-FL 和 LIFL 進行了評估,模型大小範圍從 43 MB 到 5 GB。結果顯示,在大約 500 MB 的梯度大小時出現成本交叉,在 VGG-16 規模上成本降低 2.7 倍,並且 GradsSharding 是唯一可以在無伺服器內存上限之外保持可部署的架構。

Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake

2604.22067v1 by Guan Gui, Peter Zandi, Jacob Taylor, Ananya Joshi

Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.

摘要:精神科接診是一個連續的、高風險的信息收集過程,臨床醫生必須決定提問的內容、順序以及如何在有限的時間內解釋不完整或模糊的回答。儘管對於醫療保健中的對話式人工智慧的興趣日益增長,但在這一應用中,對話式人工智慧的基礎設施仍然有限。因此,我們將這一任務表述為一個問題選擇問題,涉及臨床上有根據的問題、已知的目標信息以及可控的患者難度。我們還基於655個臨床醫生撰寫的接診問題庫和5種不同行為條件的相應合成患者小品,介紹了一個特定任務的問題選擇基準。在我們的評估中,我們比較了隨機提問、一個臨床精神科接診表的基準,以及一個基於大型語言模型(LLM)指導的自適應政策,這涉及300次訪談會議,涵蓋四位患者和五種行為條件。在基準測試中,臨床有序的固定形式顯著優於隨機提問,而LLM指導的政策則實現了最強的整體恢復。在患者行為對現場恢復的適應性較差的情況下,適應的優勢急劇增長,尤其是在防守性簡潔的條件下。這些發現表明,對話式臨床系統的表現不僅取決於信息披露後的語言理解,還取決於系統是否能在有限的互動預算內觸及正確的主題。更廣泛地說,這一基準提供了一個受控框架,用於研究臨床結構和自適應後續如何促進互動式臨床機器學習中的信息恢復。

Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

2604.22063v1 by Shevya Pandya, Shinjini Bose, Ananya Joshi

Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.

摘要:大型語言模型(LLMs)在臨床推理和風險評估中被越來越多地使用。然而,它們在精神科等關鍵和不確定領域的解釋可靠性仍然不明。先前的研究已經識別出這些系統中的算法偏見和提示敏感性,這引發了關於上下文信息如何影響模型輸出的擔憂,但在精神科領域仍然沒有系統的方法來評估這些問題。我們提出了一種通過圍繞提示設計的影響和醫學上不重要的輸入對預測住院風險分數的影響來結構化評估的可靠性審核方法,這通常是第一個下游AI臨床決策任務。在我們的審核中,生成了一組合成患者資料(n = 50),每個資料包含15個臨床相關特徵和最多50個臨床不重要特徵,跨越四種提示重構(中立、邏輯、人類影響、臨床判斷)。我們審核了四個LLM(Gemini 2.5 Flash,LLaMa 3.3 70b,Claude Sonnet 4.6,GPT-4o mini),結果顯示,包含醫學上不重要的變量導致所有模型和提示的絕對平均預測住院風險和輸出變異性有統計學上顯著的增加,這表明隨著上下文噪音的增加,預測穩定性降低。臨床不重要特徵在許多模型-提示條件下對不穩定性產生了影響,而提示變化獨立地以模型依賴的方式影響不穩定性的軌跡。這些發現量化了基於LLM的精神科風險評估對非臨床信息的敏感性,突顯了在臨床部署之前需要對歸因穩定性和不確定性行為進行系統評估的必要性。

Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning

2604.22062v1 by Karthic Palaniappan

There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of "thinking systems". With Qwen3-VL-2B-Instruct as base model and 4 $\times$ Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33\% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75\% over SymPy. I've documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: https://github.com/i-like-bfs-and-dfs/wolfram-reasoning.

摘要:世界上有7,407種語言。但是,世界上沒有的語言呢?人類是否如此狹隘,以至於不關心外星人所使用的語言?外星人也是人類!在2016年的電影《降臨》中,艾米·亞當斯飾演語言學家路易絲·班克斯博士,她通過學習以非順序句子構成的外星語言(Heptapod)來思考,獲得了超越時間和預見未來的能力。在這項工作中,我旨在探討在神經符號語言中視覺-語言概念的表徵和推理,並研究“思考系統”中分析推理能力和效率的提升。以Qwen3-VL-2B-Instruct為基礎模型,並使用4個$\times$ Nvidia H200 GPU節點,我在一個包含數學、科學和一般知識問題的視覺-語言評估數據集上實現了3.33\%的準確率提升,同時將推理標記減少了75\%,相較於SymPy。我已記錄面臨的計算挑戰、擴展可能性以及未來在視覺-語言模型中改善神經符號語言思考的工作。訓練和推理設置可在此找到:https://github.com/i-like-bfs-and-dfs/wolfram-reasoning。

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

2604.22061v1 by Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.

摘要:病人試驗匹配需要對長期的異質電子健康紀錄(EHRs)和複雜的資格標準進行推理,這對於擴展性、泛化能力和計算效率提出了重大挑戰。現有的方法要麼依賴於使用大型語言模型(LLMs)進行完整文檔處理,這在計算上代價高昂,要麼使用傳統的機器學習方法,這些方法難以捕捉非結構化的臨床敘事。在這項工作中,我們提出了一個輕量級框架,結合檢索增強生成和基於大型語言模型的建模,以實現可擴展的病人試驗匹配。該框架明確分離了兩個關鍵組件:檢索增強生成用於從長EHR中識別臨床相關片段,減少輸入複雜性,而大型語言模型則用於將這些選定的片段編碼為信息豐富的表示。這些表示進一步通過降維進行精煉,並使用輕量級預測器進行建模,使得下游分類既高效又可擴展。我們在多個公共基準(n2c2、SIGIR、TREC 2021/2022)和來自梅約診所(Mayo Clinic)的真實世界多模態數據集(MCPMD)上評估了所提出的方法。結果顯示,基於檢索的信息選擇顯著減少了計算負擔,同時保留了臨床上有意義的信號。我們進一步證明,凍結的LLMs為結構化臨床數據提供了強大的表示,而微調對於建模非結構化的臨床敘事至關重要。重要的是,所提出的輕量級管道在性能上可與端到端的LLM方法相媲美,且計算成本顯著較低。

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

2604.22050v1 by Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.

摘要:Transformer主要依賴 softmax 注意力,這對於序列長度引入了平方複雜度,並且仍然是高效推理的主要瓶頸。先前對線性或混合注意力的研究通常在所有層中均勻替換 softmax 注意力,這往往導致顯著的性能下降或需要大量重新訓練以恢復模型質量。
本研究提出了 LayerBoost,一種基於層的注意力減少方法,根據個別Transformer層的敏感性選擇性地修改注意力機制。它首先對預訓練模型進行系統的敏感性分析,以識別對維持性能至關重要的層。根據這一分析,可以應用三種不同的策略:在高度敏感的層中保留標準的 softmax 注意力,在中度敏感的層中用線性滑動窗口注意力替換,並在顯示低敏感性的層中完全移除注意力。
為了在這些架構修改後恢復性能,我們引入了一個輕量級的基於蒸餾的修復階段,只需額外的 10M 訓練標記。LayerBoost 在高併發下減少推理延遲並提高吞吐量,最多可提高 68%,同時保持競爭性的模型質量。它在幾個基準上與基礎模型性能相匹配,在其他基準上僅顯示輕微的下降,並顯著超越最先進的注意力線性化方法。這些效率增益使我們的方法特別適合高併發服務和硬體受限的部署場景,在這些場景中,推理成本和內存佔用是關鍵瓶頸。

Call-Chain-Aware LLM-Based Test Generation for Java Projects

2604.22046v1 by Guancheng Wang, Qinghua Xu, Lionel C. Briand, Zhaoqiang Guo, Kui Liu

Large language models (LLMs) have recently shown strong potential for generating project-level unit tests. However, existing state-of-the-art approaches primarily rely on execution-path information to guide prompt construction, which is often insufficient for complex software systems with rich inter-class dependencies, deep call chains, and intricate object initialization requirements. In this paper, we present CAT, a novel call-chain-aware LLM-based test generation approach that explicitly incorporates call-chain and dependency contexts into prompts through dedicated static analysis. To construct executable, semantically valid test contexts, CAT systematically models caller--callee relationships, object constructors, and third-party dependencies, and supports iterative test fixing when generation failures occur. We evaluate CAT on the widely used Defects4J benchmark and on four real-world GitHub projects released after the LLM's cut-off date. The results show that, across projects in Defects4J, CAT improves line and branch coverage by 18.04% and 21.74%, respectively, over the state-of-the-art approach PANTA, while consistently achieving superior performance on post-cutoff real-world projects. An ablation study further demonstrates the importance of call-chain and dependency contexts in CAT.

摘要:大型語言模型(LLMs)最近顯示出生成專案級單元測試的強大潛力。然而,現有的最先進方法主要依賴執行路徑資訊來指導提示構建,這對於具有豐富類別間依賴性、深層調用鏈和複雜物件初始化需求的複雜軟體系統來說,往往是不足的。在本文中,我們提出了CAT,一種新穎的基於LLM的調用鏈感知測試生成方法,通過專門的靜態分析,明確將調用鏈和依賴上下文納入提示中。為了構建可執行且語義有效的測試上下文,CAT系統性地建模呼叫者-被呼叫者關係、物件構造函數和第三方依賴,並在生成失敗時支持迭代測試修正。我們在廣泛使用的Defects4J基準和四個在LLM截止日期後發布的實際GitHub專案上評估了CAT。結果顯示,在Defects4J的專案中,CAT相較於最先進的方法PANTA,分別提高了18.04%和21.74%的行和分支覆蓋率,同時在截止後的實際專案上持續表現出優越的性能。一項消融研究進一步證明了調用鏈和依賴上下文在CAT中的重要性。

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

2604.22045v1 by Ayushi Mehrotra, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi

Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.

摘要:特徵歸因方法通過為單個輸入特徵分配重要性分數來解釋深度神經網絡的預測。然而,大多數現有方法僅專注於邊際效應,忽略了特徵之間的交互作用,這些交互作用是特徵組共同影響模型輸出的情況。這種交互作用在圖像分類任務中特別重要,因為語義意義通常來自像素之間的相互依賴,而不是孤立的特徵。現有的基於交互作用的圖像方法要麼過於粗糙(例如,僅使用超像素),要麼未能滿足核心可解釋性公理。在這項工作中,我們介紹了 H-Sets,一種新穎的兩階段框架,用於發現和歸因於圖像分類器中的高階特徵交互作用。首先,我們通過輸入 Hessians 檢測局部交互對,並將它們遞歸地合併成語義上連貫的集合;使用 Segment Anything (SAM) 進行分割作為空間分組的先驗,但可以用其他分割方法替代。其次,我們使用 IDG-Vis 為每個集合進行歸因,這是一種集級擴展的整合方向梯度,將沿像素空間路徑的方向梯度整合並與 Harsanyi 分紅進行聚合。雖然 Hessians 在檢測階段引入了額外的計算成本,但這種有針對性的成本始終能產生更稀疏且更真實的顯著性圖。在 ImageNet 和 CUB 數據集上對 VGG、ResNet、DenseNet 和 MobileNet 模型的評估顯示,H-Sets 生成的顯著性圖比現有方法更具可解釋性和真實性。

Source-Modality Monitoring in Vision-Language Models

2604.22038v1 by Etha Tianze Hua, Tian Yun, Ellie Pavlick

We define and investigate source-modality monitoring -- the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.

摘要:我們定義並研究來源模態監控——多模態模型追蹤和傳達信息來源的能力。我們將來源模態監控視為更一般的綁定問題的一個實例,並評估模型在多大程度上利用句法與語義信號來將用戶提供的提示中的詞語(如圖像)綁定到其輸入和上下文的特定組件(即實際圖像)。在涵蓋11個視覺-語言模型(VLMs)的實驗中,這些模型執行目標模態信息檢索任務,我們發現句法和語義信號都扮演著重要角色,但當模態在分佈上高度區別時,後者往往超過前者。我們討論這些發現對模型穩健性的影響,以及在日益多模態的代理系統背景下的意義。

EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms

2604.22036v1 by Brian VanVoorst, Nicholas Walczak, Christopher Gilleo, Charles Meissner, Fabio Felix, Iran Roman, Bea Steers, Claudio Silva, Yuhan Shen, Zijia Lu, Shih-Po Lee, Ehsan Elhamifar

This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).

摘要:這篇論文介紹了EgoMAGIC(醫療輔助、指導、說明和修正),這是一個以自我為中心的醫療活動數據集,作為DARPA的感知能力任務指導(PTG)計畫的一部分收集而成。這個數據集包含3,355個視頻,涵蓋50個醫療任務,每個任務至少有50個標記視頻。PTG計畫的主要目標是開發集成在增強現實頭盔中的虛擬助手,以幫助用戶執行複雜任務。
為了鼓勵使用這個數據集進行探索和研究,醫療訓練數據已經發布,並附帶了一個專注於八個醫療任務的動作檢測挑戰。大多數視頻是使用帶有集成音頻的頭戴立體攝像機錄製的。從這個數據集中,使用195萬個標籤訓練了40個YOLO模型,以檢測124個醫療物體,為從事醫療AI應用開發的開發者提供了一個穩健的起點。
除了介紹數據集,這篇論文還呈現了三個模型在八個選定醫療任務上的動作檢測基準結果,其中表現最佳的方法達到了平均mAP 0.526。儘管這篇論文主要針對動作檢測作為基準,但EgoMAGIC數據集同樣適用於動作識別、物體識別和檢測、錯誤檢測以及其他具有挑戰性的計算機視覺任務。
該數據集可通過zenodo.org訪問(DOI: 10.5281/zenodo.19239154)。

Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning

2604.22031v1 by João Mattos, Arlei Silva

We propose Mochi, a Graph Foundation Model that addresses task unification and training efficiency by adopting a meta-learning based training framework. Prior models pre-train with reconstruction-based objectives such as link prediction, and assume that the resulting representations can be aligned with downstream tasks through a separate unification step such as class prototypes. We demonstrate through synthetic and real-world experiments that this procedure, while simple and intuitive, has limitations that directly affect downstream task performance. To address these limitations, Mochi pre-trains on few-shot episodes that mirror the downstream evaluation protocol, aligning the training objective with inference rather than relying on a post-hoc unification step. We show that Mochi, along with its more powerful variant Mochi++, achieves competitive or superior performance compared to existing Graph Foundation Models across 25 real-world graph datasets spanning node classification, link prediction, and graph classification, while requiring 8$\sim$27 times less training time than the strongest baseline.

摘要:我們提出了Mochi,一種圖基礎模型,通過採用基於元學習的訓練框架來解決任務統一和訓練效率問題。先前的模型使用基於重建的目標進行預訓練,例如鏈接預測,並假設所得到的表示可以通過單獨的統一步驟(例如類別原型)與下游任務對齊。我們通過合成和實際實驗展示了這一過程,雖然簡單直觀,但存在直接影響下游任務性能的限制。為了解決這些限制,Mochi在幾次快照的情境下進行預訓練,這些情境反映了下游評估協議,將訓練目標與推理對齊,而不是依賴事後的統一步驟。我們展示了Mochi及其更強大的變體Mochi++在25個涵蓋節點分類、鏈接預測和圖分類的實際圖數據集上,與現有的圖基礎模型相比,達到了具有競爭力或更優的性能,同時所需的訓練時間比最強基線少8$\sim$27倍。

Shared Lexical Task Representations Explain Behavioral Variability In LLMs

2604.22027v1 by Zhuonan Yang, Jacob Xiaochen Li, Francisco Piedrahita Velez, Eric Todd, David Bau, Michael L. Littman, Stephen H. Bach, Ellie Pavlick

One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.

摘要:對大型語言模型(LLMs)最常見的抱怨之一是它們對提示的敏感性——也就是說,它們執行任務或提供正確答案的能力可能會不可預測地依賴於問題的表述方式。我們通過比較兩種非常不同但常用的提示風格來調查這種變化:基於指令的提示,這種提示用自然語言描述任務,以及基於示例的提示,這種提示提供上下文中的少量示範對以說明任務。我們發現,儘管性能在提示的影響下有很大的變化,但模型在不同提示的任務之間仍然會涉及一些共同的基本機制。具體而言,我們識別出任務特定的注意力頭,其輸出字面上描述了任務——我們稱之為詞彙任務頭——並顯示這些頭在不同的提示風格之間是共享的,並觸發隨後的答案生成。我們進一步發現,提示之間的行為變化可以通過這些頭的激活程度來解釋,而失敗至少有時是由於競爭的任務表徵稀釋了目標任務的信號。我們的結果共同呈現出一幅日益清晰的圖景,說明LLMs的內部表徵如何解釋那些對用戶和開發者來說似乎是特立獨行的行為。

Foundation models for discovering robust biomarkers of neurological disorders from dynamic functional connectivity

2604.22018v1 by Deepank Girish, Yi Hao Chan, Sukrit Gupta, Jing Xia, Jagath C. Rajapakse

Several brain foundation models (FM) have recently been proposed to predict brain disorders by modelling dynamic functional connectivity (FC). While they demonstrate remarkable model performance and zero- or few-shot generalization, the salient features identified as potential biomarkers are yet to be thoroughly evaluated. We propose RE-CONFIRM, a framework for evaluating the robustness of potential biomarker candidates elucidated by deep learning (DL) models including FMs. From experiments on five large datasets of Autism Spectrum Disorder (ASD), Attention-deficit Hyperactivity Disorder (ADHD), and Alzheimer's Disease (AD), we found that although commonly used performance metrics provide an intuitive assessment of model predictions, they are insufficient for evaluating the robustness of biomarkers identified by these models. RE-CONFIRM metrics revealed that simply finetuning FMs leads to models that fail to capture regional hubs effectively, even in disorders where hubs are known to be implicated, such as ASD and ADHD. In view of this, we propose Hub-LoRA (Low-Rank Adaptation) as a fine-tuning technique that enables FMs to not only outperform customised DL models but also produce neurobiologically faithful biomarkers supported by meta-analyses. RE-CONFIRM is generalizable and can be easily applied to ascertain the robustness of DL models trained on functional MRI datasets. Code is available at: https://github.com/SCSE-Biomedical-Computing-Group/RE-CONFIRM.

摘要:最近提出了幾個腦部基礎模型(FM),旨在通過建模動態功能連接性(FC)來預測腦部疾病。雖然它們展示了卓越的模型性能和零樣本或少樣本的泛化能力,但被識別為潛在生物標記的顯著特徵尚未得到充分評估。我們提出了RE-CONFIRM,一個用於評估深度學習(DL)模型(包括FM)所闡明的潛在生物標記候選者穩健性的框架。通過對五個大型自閉症譜系障礙(ASD)、注意力缺陷過動症(ADHD)和阿茲海默症(AD)數據集的實驗,我們發現,儘管常用的性能指標提供了對模型預測的直觀評估,但它們對於評估這些模型識別的生物標記的穩健性來說是不夠的。RE-CONFIRM指標顯示,僅僅微調FM會導致模型無法有效捕捉區域樞紐,即使在已知樞紐與之相關的疾病中,如ASD和ADHD。鑒於此,我們提出了Hub-LoRA(低秩適應)作為一種微調技術,使FM不僅能超越定制的DL模型,還能產生由元分析支持的神經生物學上真實的生物標記。RE-CONFIRM具有可泛化性,並且可以輕鬆應用於確定在功能性MRI數據集上訓練的DL模型的穩健性。代碼可在以下網址獲得:https://github.com/SCSE-Biomedical-Computing-Group/RE-CONFIRM。

When Cow Urine Cures Constipation on YouTube: Limits of LLMs in Detecting Culture-specific Health Misinformation

2604.22002v1 by Anamta Khan, Ratna Kandala, Deepti, Sheza Munir, Joyojeet Pal

Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.

摘要:社交媒體平台已成為全球南方健康資訊的主要渠道。以印度 YouTube 上的 gomutra(牛尿)話語為案例,我們呈現了一項事後的大型語言模型(LLM)輔助的話語分析,分析了 30 份多語言的逐字稿,顯示宣傳內容將神聖的傳統語言與偽科學的主張融合在一起,這種方式反映了複雜的反駁內容,創造出一種修辭註冊,而 LLMs 主要在西方語料庫上訓練,系統性地無法分析。通過在三個 LLM(GPT-4o、Gemini 2.5 Pro、DeepSeek-V3.1)之間變化提示語氣,我們發現文化嵌入的健康錯誤資訊與普通的錯誤資訊並不相同,而這種文化模糊延伸到性別修辭和提示設計,進一步加劇了分析的不可靠性。我們的研究結果表明,LLM 輔助的話語分析中的文化能力不能僅通過提示工程來補救。

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

2604.21999v1 by Grigory Sapunov

We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested -- 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing -- no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64. During experimentation, we identify a router initialization trap that causes >70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves' recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 ("deep start," p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices. With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at https://github.com/che-shr-cat/utm-jax.

摘要:我們研究學習的記憶標記作為單區塊通用Transformer(UT)在數獨極限(Sudoku-Extreme)這一組合推理基準上的計算臨時記憶。我們發現記憶標記在實證上是必要的:在所有測試的配置中——3個隨機種子、多個標記數量、兩種初始化方案、ACT和固定深度處理——沒有任何不使用記憶標記的配置能夠達到非平凡的性能。最佳的標記數量顯示出一個明顯的下限(T=0總是失敗,T=4是邊界,T=8對於81格謎題可靠成功),隨後是一個穩定的平台(T=8-32,精確匹配率57.4% +/- 0.7%),在T=64時因注意力稀釋而崩潰。在實驗過程中,我們識別出一個路由器初始化陷阱,導致超過70%的訓練運行失敗:默認的零偏初始化(p ~ 0.5)和Graves推薦的正偏(p ~ 0.73)都會使標記在初始化後約2步停止,進入一個淺表平衡(停止約5-7),模型無法逃脫。將偏置反轉為-3(「深度啟動」,p ~ 0.05)消除了這種失敗模式。我們通過消融實驗確認,該陷阱是ACT初始化固有的,而不是我們架構選擇的產物。在可靠的訓練確立後,我們顯示(1)ACT提供比固定深度處理更一致的結果(56.9% +/- 0.7%對53.4% +/- 9.3%在3個隨機種子中);(2)使用lambda預熱的ACT在使用少34%思考步驟的情況下達到匹配的準確率(57.0% +/- 1.1%);以及(3)注意力頭專門化為記憶讀取器、約束傳播器和在遞歸深度中的整合器。代碼可在https://github.com/che-shr-cat/utm-jax獲得。