Skip to content

arxiv-daily

Automated deployment @ 2026-05-19 11:00:40 Asia/Taipei

Welcome to contribute! Add your topics and keywords in topic.yml. You can also view historical data through the storage.

AI

Medical explainable AI

Publish Date Title Authors Homepage Code
2026-05-18 A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders Zegu Zhang et.al. 2605.18224v1 null
2026-05-18 Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay Users Alfio Ventura et.al. 2605.18036v1 null
2026-05-18 Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling Ziwei Wang et.al. 2605.17971v1 null
2026-05-18 Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective Junpeng Zhang et.al. 2605.17967v1 null
2026-05-18 Multi-agent AI systems outperform human teams in creativity Tiancheng Hu et.al. 2605.17885v1 null
2026-05-18 Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science Yingjie Zhang et.al. 2605.17746v1 null
2026-05-17 Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models Ethan Tang et.al. 2605.17565v1 null
2026-05-17 Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models Urban Širca et.al. 2605.17562v1 null
2026-05-17 The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure Qiqi Liu et.al. 2605.17480v1 null
2026-05-17 An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training Hung-Yue Suen et.al. 2605.17468v1 null
2026-05-17 Artificial Intelligence can Recognize Whether a Job Applicant is Selling and/or Lying According to Facial Expressions and Head Movements Much More Correctly Than Human Interviewers Hung-Yue Suen et.al. 2605.17461v1 null
2026-05-17 Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination Xiaolei Fang et.al. 2605.17454v1 null
2026-05-17 CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings Qixuan Hu et.al. 2605.17370v1 null
2026-05-17 UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models Hyunju Kang et.al. 2605.17285v1 null
2026-05-17 Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability Nisreen Albzour et.al. 2605.17236v1 null
2026-05-17 Integration of AI in Cybersecurity: Current Trends with a Focused Look at Intrusion Detection Applications S. Tazili et.al. 2605.17219v1 null
2026-05-16 PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts Khizar Hussain et.al. 2605.17028v1 null
2026-05-16 Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings Anthonio Oladimeji Gabriel et.al. 2605.16993v1 null
2026-05-16 Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects Zhentao Tan et.al. 2605.16966v1 null
2026-05-16 From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction Pujun Feng et.al. 2605.16927v1 null
2026-05-16 Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models SeungWon Seo et.al. 2605.16725v1 null
2026-05-15 CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? Haolin Chen et.al. 2605.16679v1 null
2026-05-15 Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space Valeria Ruscio et.al. 2605.16600v1 null
2026-05-15 Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces Arne Nix et.al. 2605.16545v1 null
2026-05-15 Toward Template-Free Explainability for Monte Carlo Tree Search Siqi Lu et.al. 2605.16524v1 null
2026-05-15 Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework Xintong Yao et.al. 2605.16516v1 null
2026-05-15 AI-Mediated Communication Can Steer Collective Opinion Stratis Tsirtsis et.al. 2605.16245v1 null
2026-05-15 GenShield: Unified Detection and Artifact Correction for AI-Generated Images Zhipei Xu et.al. 2605.16122v1 null
2026-05-15 Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment Till Beemelmanns et.al. 2605.16087v1 null
2026-05-15 Looped SSMs: Depth-Recurrence and Input Reshaping for Time Series Classification Mónika Farsang et.al. 2605.16048v1 null
2026-05-15 XSearch: Explainable Code Search via Concept-to-Code Alignment Yiming Liu et.al. 2605.16046v1 null
2026-05-15 Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets Kai Hidajat et.al. 2605.15787v1 null
2026-05-15 $α$-TCAV: A Unified Framework for Testing with Concept Activation Vectors Ekkehard Schnoor et.al. 2605.15688v1 null
2026-05-15 Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery Weijun Ma et.al. 2605.16458v1 null
2026-05-15 Identifiable Token Correspondence for World Models Youngin Kim et.al. 2605.16457v1 null
2026-05-15 Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign Jiahui Li et.al. 2605.16452v1 null
2026-05-15 Process Rewards with Learned Reliability Jinyuan Li et.al. 2605.15529v1 null
2026-05-14 GESD: Beyond Outcome-Oriented Fairness Gideon Popoola et.al. 2605.15295v1 null
2026-05-14 FutureSim: Replaying World Events to Evaluate Adaptive Agents Shashwat Goel et.al. 2605.15188v1 null
2026-05-14 Explainable Detection of Depression Status Shifts from User Digital Traces Loris Belcastro et.al. 2605.14995v1 null
2026-05-14 GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation Drewry H. Morris et.al. 2605.14968v1 null
2026-05-14 From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement Varad Vishwarupe et.al. 2605.14912v1 null
2026-05-14 Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models Senne Deproost et.al. 2605.14897v1 null
2026-05-14 Holistic Evaluation and Failure Diagnosis of AI Agents Netta Madvil et.al. 2605.14865v1 null
2026-05-14 FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery Patrick Kwon et.al. 2605.14854v2 null
2026-05-14 How Sensitive Are Radiomic AI Models to Acquisition Parameters? D. Gil et.al. 2605.14667v1 null
2026-05-14 MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder Eranga Bandara et.al. 2605.14660v1 null
2026-05-14 Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning Chenlu Ding et.al. 2605.14636v1 null
2026-05-14 Optimal Pattern Detection Tree for Symbolic Rule-Based Classification Young-Chae Hong et.al. 2605.14374v1 null
2026-05-14 Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment Xiaohua Wang et.al. 2605.14242v1 null
2026-05-14 Fusion-fission forecasts when AI will shift to undesirable behavior Neil F. Johnson et.al. 2605.14218v1 null
2026-05-13 LLM-Based Robustness Testing of Microservice Applications: An Empirical Study Hrushitha Goud Tigulla et.al. 2605.14202v1 null
2026-05-13 Do Language Models Align with Brains? Prediction Scores Are Not Enough Xiao Jia et.al. 2605.14025v1 null
2026-05-13 Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography Christos Chrysanthos Nikolaidis et.al. 2605.13730v1 null
2026-05-13 RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation Chengzhi Shen et.al. 2605.13542v1 null
2026-05-13 VERA-MH: Validation of Ethical and Responsible AI in Mental Health Luca Belli et.al. 2605.13318v1 null
2026-05-13 IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation Joy Bose et.al. 2605.13311v1 null
2026-05-13 Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency Ziqi Wen et.al. 2605.13047v1 null
2026-05-13 An Agentic LLM-Based Framework for Population-Scale Mental Health Screening Giuliano Lorenzoni et.al. 2605.13046v1 null
2026-05-13 No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills Ying Li et.al. 2605.13044v1 null
2026-05-13 When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction Vardhan Dongre et.al. 2605.12922v1 null
2026-05-13 Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning Siyuan Liu et.al. 2605.12906v1 null
2026-05-13 RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems Rohith Reddy Bellibatlu et.al. 2605.12895v1 null
2026-05-13 A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study Jaime Yan et.al. 2605.13905v1 null
2026-05-12 NOVA: Fundamental Limits of Knowledge Discovery Through AI Salman Avestimehr et.al. 2605.15219v1 null
2026-05-12 BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics Helene Malyutina et.al. 2605.12730v1 null
2026-05-12 A New Technique for AI Explainability using Feature Association Map Sayantani Ghosh et.al. 2605.12350v3 null
2026-05-12 Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference Toru Takahashi et.al. 2605.12255v1 null
2026-05-12 BoolXLLM: LLM-Assisted Explainability for Boolean Models Du Cheng et.al. 2605.12139v1 null
2026-05-12 To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands Fangyi Yu et.al. 2605.12120v1 null
2026-05-12 LegalCheck: Retrieval- and Context-Augmented Generation for Drafting Municipal Legal Advice Letters Virgill van der Meer et.al. 2605.12012v1 null
2026-05-12 Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI Georgios Makridis et.al. 2605.11687v1 null
2026-05-12 Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion ShiYing Huang et.al. 2605.11679v2 null
2026-05-12 Native Explainability for Bayesian Confidence Propagation Neural Networks: A Framework for Trusted Brain-Like AI Georgios Makridis et.al. 2605.11595v1 null
2026-05-12 Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer Nilushika Udayangani et.al. 2605.11414v1 null
2026-05-12 What Do EEG Foundation Models Capture from Human Brain Signals? Ling Tang et.al. 2605.11410v2 null
2026-05-12 Attributing Emergence in Million-Agent Systems Ling Tang et.al. 2605.11404v1 null
2026-05-12 Causal Fairness for Survival Analysis Drago Plecko et.al. 2605.11362v1 null
2026-05-12 Human-AI Productivity Paradoxes: Modeling the Interplay of Skill, Effort, and AI Assistance Ali Aouad et.al. 2605.11350v1 null
2026-05-11 The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains Jung Min Kang et.al. 2605.11205v1 null
2026-05-11 Interpretability Can Be Actionable Hadas Orgad et.al. 2605.11161v1 null
2026-05-11 ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder Shubhankit Singh et.al. 2605.11091v1 null
2026-05-11 Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography Timothy Oladunni et.al. 2605.10871v2 null
2026-05-11 New AI-Driven Tools for Enhancing Campus Well-being: A Prevention and Intervention Approach Jinwen Tang et.al. 2605.10804v1 null
2026-05-11 Hierarchical Causal Abduction: A Foundation Framework for Explainable Model Predictive Control Ramesh Arvind Naagarajan et.al. 2605.10624v1 null
2026-05-11 The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime Phongsakon Mark Konrad et.al. 2605.10601v1 null
2026-05-11 TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment Jiaxuan Wang et.al. 2605.10194v1 null
2026-05-11 A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection Jihyeon Baek et.al. 2605.10181v1 null
2026-05-11 Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality Mateusz Cedro et.al. 2605.10142v1 null
2026-05-11 Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research Anthea Dathe et.al. 2605.10125v2 null
2026-05-11 Explainability of Recurrent Neural Networks for Enhancing P300-based Brain-Computer Interfaces Christian Oliva et.al. 2605.10121v1 null
2026-05-11 An LLM-RAG Approach for Healthy Eating Index-Informed Personalized Food Recommendations Yibin Wang et.al. 2605.15213v1 null
2026-05-11 The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws Eslam Zaher et.al. 2605.09887v1 null
2026-05-11 Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework, Axioms, and Future Direction toward Responsible AI Gideon Popoola et.al. 2605.09852v1 null
2026-05-10 TokaMind for Power Grid: Cross-Domain Transfer from Fusion Plasma JC Wu et.al. 2605.11033v1 null
2026-05-10 Attribution-based Explanations for Markov Decision Processes Paul Kobialka et.al. 2605.09780v2 null
2026-05-10 Sequential Feature Selection for Efficient Landslide Segmentation from Multi-Spectral Data Arsalaan Ahmad et.al. 2605.09746v1 null
2026-05-10 Medical Model Synthesis Architectures: A Case Study Katherine M. Collins et.al. 2605.09716v1 null
2026-05-10 DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents Yixiong Chen et.al. 2605.09679v1 null
2026-05-10 Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT Zonglin Yang et.al. 2605.16373v1 null

Abstracts

A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders

2605.18224v1 by Zegu Zhang, Jianhua Peng, Jian Zhang

This note studies exact constant collapse in variational autoencoders, where the encoder mean becomes independent of the input. The goal is to make this specific failure mode pre-designable, monitorable during training, and certifiable after training. The prior is kept as the standard Gaussian. Given a fixed teacher posterior, we attach to the latent mean a fixed simplex witness head. The resulting teacher-student alignment loss has an exact constant-predictor baseline equal to the teacher information. If the alignment loss is below this baseline, the latent mean cannot be input-independent constant collapsed. The simplex witness also has a closed-form inverse. Any full-support teacher posterior can be represented by embedding its centered log-odds into the latent space. This gives an explicit latent energy cost and explains when the alignment loss can be made small. A computable view gap handles the case where teacher targets are computed from a different view. Thus exact constant collapse is converted from an after-the-fact training pathology into a design-and-certificate problem.

摘要:這篇筆記研究了變分自編碼器中的精確常數崩潰,其中編碼器的均值與輸入變得獨立。目標是使這種特定的失敗模式可預設、可在訓練過程中監控,並在訓練後可證明。先驗保持為標準高斯分佈。在固定的教師後驗下,我們將一個固定的單形見證頭附加到潛在均值上。由此產生的教師-學生對齊損失具有一個精確的常數預測基線,等於教師信息。如果對齊損失低於這個基線,則潛在均值無法獨立於輸入而常數崩潰。
單形見證也有一個封閉形式的逆。任何全支持的教師後驗都可以通過將其中心化的對數賠率嵌入潛在空間來表示。這提供了一個明確的潛在能量成本,並解釋了何時可以使對齊損失變小。一個可計算的視角差距處理了教師目標從不同視角計算的情況。因此,精確常數崩潰從事後訓練病理轉變為設計和證明問題。

Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay Users

2605.18036v1 by Alfio Ventura, Tim Katzke, Jan Corazza, Mustafa Yalçıner

Trust calibration -- aligning user trust judgment with model capability -- is crucial for safe deployment of explainable AI (XAI), yet is often evaluated via global trust ratings detached from objective performance evidence. We present a preregistered, incentivized between-subject online study (N=418 representative UK sample) on explainable skin-lesion classification that disentangles expectation-setting from experienced performance. Participants completed 15 case evaluations using a fixed XAI panel (malignancy score, reliability score, and saliency map). We systematically manipulated five experimental onboarding conditions varying example-based information and limitation disclosures with five stimulus packages naturally varying observed prediction quality. Calibration was operationalized as the deviation between trust-related judgments (TAIS and case-wise ratings) and objective performance benchmarks for the encountered cases, analysed with hierarchical mixed-effects models. Only limitation disclosure for case-wise measures reliably impacts trust calibration, and short-term experience did not yield progressive calibration. Further, the experienced package of stimuli explained substantially more variance than the experimental manipulation. However, participants were hard-pressed to differentiate between case-wise perceived trust, trustworthiness, and accuracy estimation. We discuss implications for designing limitation communication and for measuring and analysing calibration metrics in XAI evaluations. All study materials and data of this study are publicly available for replication and further academic use.

摘要:信任校準——將用戶的信任判斷與模型能力對齊——對於安全部署可解釋的人工智慧(XAI)至關重要,但通常是通過與客觀性能證據脫節的全球信任評級來評估。我們呈現了一項預註冊的、有獎勵的在線研究(N=418,代表英國樣本),該研究針對可解釋的皮膚病變分類,將期望設定與實際表現區分開來。參與者使用固定的XAI面板(惡性腫瘤評分、可靠性評分和顯著性圖)完成了15個案例評估。我們系統性地操縱了五種實驗入門條件,變化了基於範例的信息和限制披露,並使用五個刺激包自然變化觀察到的預測質量。校準被操作化為信任相關判斷(TAIS和案例評級)與所遇到案例的客觀性能基準之間的偏差,並使用層級混合效應模型進行分析。僅有案例評估的限制披露可靠地影響信任校準,且短期經驗並未帶來進步的校準。此外,經驗刺激包解釋了比實驗操縱更多的變異。然而,參與者在區分案例評估的感知信任、可信度和準確性估計方面面臨困難。我們討論了設計限制溝通的含義,以及在XAI評估中測量和分析校準指標的意義。本研究的所有材料和數據均可公開獲取,以供重複和進一步的學術使用。

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

2605.17971v1 by Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang, Yebo Feng, Ju Jia, Ruiying Du, Cong Wu, Yang Liu

Despite rigorous safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic vulnerability in the safety mechanisms of LLMs, where safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement, enabling reliable and high-success jailbreak attacks without access to model internals. Comprehensive evaluations on frontier commercial models demonstrate that Babel achieves state-of-the-art attack success rates and superior query efficiency. Specifically, compared to state-of-the-art methods, Babel increases the attack success rate on GPT-4o from 41.33% to 82.67% and on Claude-3-5-haiku from 38.33% to 78.33% within an average of 40 queries, providing a robust red-teaming methodology for LLMs safety research.

摘要:儘管進行了嚴格的安全對齊,大型語言模型(LLMs)仍然容易受到越獄攻擊。現有的黑箱方法通常依賴於啟發式模板或徹底的試驗,缺乏機制可解釋性和查詢效率。在本研究中,我們調查了LLMs安全機制中的一種內在脆弱性,其中安全對齊依賴於一小組稀疏分佈的注意力頭,導致大部分表徵空間監控不足。我們用數學越獄模型形式化這一現象,該模型描述了有效文本模糊化的微妙邊界,並分析性地解釋了觀察到的越獄行為。在這個模型的指導下,我們提出了Babel,一個高效的黑箱攻擊框架,通過系統的模糊化取樣和迭代的反饋驅動分佈精煉,利用識別出的安全缺口,使得無需訪問模型內部即可進行可靠且高成功率的越獄攻擊。對前沿商業模型的全面評估表明,Babel達到了最先進的攻擊成功率和卓越的查詢效率。具體而言,與最先進的方法相比,Babel在GPT-4o上的攻擊成功率從41.33%提高到82.67%,在Claude-3-5-haiku上的成功率從38.33%提高到78.33%,平均查詢次數為40,為LLMs安全研究提供了一種強健的紅隊方法論。

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

2605.17967v1 by Junpeng Zhang, Lei Cheng, Guoxi Zhang, Hua Cai, Qing Xu, Quanshi Zhang

This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.

摘要:這篇論文探討了一個關於監督式微調(SFT)的科學問題:為什麼 SFT 對小型深度神經網絡廣泛有效,但在應用於大型語言模型(LLMs)時卻可能產生不一致甚至有害的效果。最近在基於互動的解釋方面的進展表明,單詞/標記之間的互動提供了一個真實的指標,用於量化 LLMs 編碼的推理模式。我們發現,在 SFT 過程中互動的演變可以有效解釋 SFT 在 LLMs 上的不一致有效性。具體來說,我們發現 (1) SFT 主要去除類噪聲的互動,而很少獲得可靠的新互動。(2) 這個去噪階段非常短暫,之後持續的微調往往會引入過擬合的互動。我們在多個 LLMs 和數據集上驗證了這些發現。我們的發現為早期停止提供了新的見解,並為 LLM 訓練提供了實用的指導。

Multi-agent AI systems outperform human teams in creativity

2605.17885v1 by Tiancheng Hu, Yixuan Jiang, Haotian Li, José Hernández-Orallo, Xing Xie, Nigel Collier, David Stillwell, Luning Sun

Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi-agent systems with augmented creative capabilities.

摘要:雖然人工智慧(AI)現在在許多認知任務上達到了或超越了人類的表現,但創造力仍然是一個高度具爭議性的前沿。隨著基於大型語言模型(LLMs)的AI系統在研究和創新中越來越被採用,理解和增強它們的創造力變得至關重要。在這裡,我們展示了多代理LLM團隊不僅超越單一代理,還在創造力方面顯著優於人類團隊(Cohen's d=1.50),這是基於4,541個多代理LLM想法和341個人類團隊想法在六個不同的問題解決任務中的表現。這一優勢是由新穎性驅動的,同時保持可比的有用性。為了研究兩組中的生成過程,我們將對話表示為通過語義空間的路徑,使用神經語言模型表示。當對話範圍廣泛而不是集中在單一主題上(低全球一致性)時,LLM和人類團隊都會產生更具創造性的想法。然而,預測創造力的額外模式有所不同:LLM團隊受益於高效探索(高語義擴散,較短路徑),而人類團隊則受益於保持順暢的對話流(高局部一致性,頻繁轉折)。此外,我們確定模型選擇和討論結構作為正交設計杠杆,這兩者共同解釋了LLM對話動態中26.8%的變異,為系統性開發具有增強創造能力的多代理系統鋪平了道路。

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

2605.17746v1 by Yingjie Zhang, Chun Feng, Weizhang Zhu, Tianshu Sun

AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.

摘要:AI 系統正成為組織和知識工作的積極參與者。它們越來越多地與人類互動、協調工作流程,並在多代理安排中運作。因此,理解它們的影響需要的不僅僅是測量產出準確性;還需要關於機制、委派、反饋和控制的證據。實驗在這項任務中仍然是核心,但它們也面臨著一個遞歸挑戰:我們需要實驗來讓代理人研究這些安排,而我們可能需要代理人來進行實驗,以幫助搜尋不斷擴展的可能設計空間。然而,人類-AI 和代理工作流程的實驗條件仍然主要以散文形式指定,使其難以比較、重用或審核。我們將此框架視為一個在 AI 驅動的知識生產中,工作流程表示、可追溯性和治理的問題。我們介紹了 SEED(結構編碼用於實驗發現),這是一個將實驗條件表示為類型化的行為者-流程圖的框架。SEED 支持三個設計功能:將條件描述為互動結構、相對於編碼的先前設計評估結構新穎性,以及在可行性和治理約束下生成候選設計。我們報告了一個輕量級的實證可行性測試,該測試比較了圖盲生成和 SEED 指導生成在醫療分診設計任務中的表現。在這一診斷對比中,SEED 指導的候選設計顯示出更清晰的行為者-流程變化、假設和治理檢查,支持該語法作為設計輔助工具的可行性。評論最後指出了圍繞新穎性、複製性、有效性、探究多樣性和問責制的治理緊張。

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

2605.17565v1 by Ethan Tang

Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT's fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.

摘要:最近的研究對棋類數據進行了微調語言模型,並報告了高基準分數,作為證據表明所得到的模型能夠理解棋類規則、以專業水平進行完整的棋局,或生成基於專家知識的可讀解釋。我們訓練了 KinGPT,一個僅在(位置,最佳走法)對上訓練的 2500 萬參數字符級語言模型,其在 600 道將死題的測試中超越了 30 億參數的 ChessGPT,並在 20 主題的謎題基準上超越了 40 億參數的 C1-4B。我們檢視了現有文獻中對棋類訓練語言模型所做的幾個聲明,並主張它們令人印象深刻的基準表現主要可以用模式匹配來解釋。我們還展示了如何通過 LLM-Modulo,一個驗證者在循環中的框架,將 RedPajama 3B 的最佳走法準確率從 1.2% 提高到 21.2%,將走法生成的有效性從 19.3% 提高到 95.3%,在將死棋謎題中,這與 ChessGPT 在棋類特定網絡語料庫上微調所達到的增益相當,但成本卻低得多。我們的結果說明了將通用 LLM 與外部驗證者配對,為在明確定義的領域中直接在合成數據上訓練提供了一種更靈活的替代方案。我們開源了所有訓練/評估代碼、數據集、謎題樣本和 KinGPT 模型檢查點,以便於重現。

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

2605.17562v1 by Urban Širca, Maryam Alimardani, Stefanos Zafeiriou, Konstantinos Barmpas

EEG foundation models (EEG-FMs) have been evaluated predominantly on clean, in-distribution accuracy, leaving their robustness, interpretability and representational quality largely unexamined. This study addresses these gaps by benchmarking six EEG-FMs against a baseline deep learning model across eight datasets. Beyond clean accuracy, we conduct three layers of analysis: (i) Robustness: we apply test-time perturbations including additive noise, random and region-based channel dropout and region-specific noise injection. Our analyses show that no single model dominates all failure modes. The most noise-robust model is among the most fragile under channel dropout and much of the dropout fragility disappears when channels are removed rather than zero-padded. (ii) Interpretability: we present the first application of Attention-Aware Layer-Wise Relevance Propagation (AttnLRP) to EEG-FMs and show that models broadly concentrate relevance on task-appropriate brain regions consistent with known neurophysiology. However, attribution maps remain spatially stable under perturbation while predictions degrade, suggesting that the models attend to the correct brain regions but decode corrupted content. (iii) Expressiveness: With block-wise probing we show that late blocks are repurposed during fine-tuning, while early blocks already hold task-related information. Furthermore, we demonstrate that the poor head-only performance previously attributed to low-quality pre-trained representations is largely explained by pooling and that EEG-FMs possess sufficient representational capacity when their token-level embeddings are preserved. Together, these findings provide the first systematic assessment of robustness, interpretability and expressiveness for EEG-FMs and highlight critical considerations for their development.

摘要:EEG 基礎模型 (EEG-FMs) 主要在乾淨的、分佈內的準確性上進行評估,對其穩健性、可解釋性和表徵質量的研究則大多未被檢視。這項研究通過將六個 EEG-FMs 與一個基準深度學習模型在八個數據集上進行基準測試,來解決這些空白。除了乾淨的準確性外,我們進行了三層分析:(i) 穩健性:我們應用測試時擾動,包括加性噪聲、隨機和基於區域的通道丟失以及區域特定的噪聲注入。我們的分析顯示,沒有單一模型在所有失效模式中佔據主導地位。最具抗噪聲能力的模型在通道丟失下卻是最脆弱的,當通道被移除而非填充為零時,許多丟失脆弱性會消失。(ii) 可解釋性:我們首次將注意力感知層級相關性傳播 (AttnLRP) 應用於 EEG-FMs,並顯示模型廣泛集中於與已知神經生理學一致的任務相關腦區。然而,歸因圖在擾動下保持空間穩定,而預測則退化,這表明模型專注於正確的腦區但解碼了損壞的內容。(iii) 表達能力:通過區塊探測,我們顯示晚期區塊在微調過程中被重新利用,而早期區塊已經持有與任務相關的信息。此外,我們證明之前歸因於低質量預訓練表徵的糟糕頭部性能主要是由於池化造成的,並且當其標記級嵌入被保留時,EEG-FMs 具備足夠的表徵能力。綜合這些發現,我們提供了對 EEG-FMs 穩健性、可解釋性和表達能力的首次系統評估,並突顯了其開發中的關鍵考量。

The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

2605.17480v1 by Qiqi Liu, Thorsten Holz, Shilin Ye, Runhan Song

Multi-agent systems extend large language models (LLMs) by decomposing tasks among specialized agents, but their distributed decision process creates new attack surfaces. We identify \emph{semantic hijacking}, an attack in which harmful requests are concealed within domain-specific narratives and propagated to a Manager through Worker reports, without any syntactic injection primitives. Across 42,000 adversarial trials over 12 Manager models and 7 Worker configurations, we uncover a \emph{capability paradox}: as Worker capability increases, the mean system-level Attack Success Rate (ASR) increases from 18.4% to 63.9%, peaking at 94.4%. To explain this effect, we conduct multi-level mediation analysis on two independent datasets (47,807 interactions). This analysis shows that this paradox is driven by \emph{linguistic certainty}: stronger Workers are more likely to interpret adversarial narratives as legitimate, convey their conclusions assertively, and thereby lead Managers to treat such confident endorsements as justification to execute. In our larger Worker-Only setting ($n_W$=14), certainty mediates 74% of the effect, with 95% confidence intervals (CI) excluding zero under both Monte Carlo and cluster bootstrap; the smaller Full-MAS setting ($n_W$ =6) shows a directionally consistent indirect effect. Worker-side safety prompting does not reliably mitigate this failure. Building on the mediation finding, we propose \emph{heterogeneous ensemble verification}, which pairs Workers of asymmetric domain competence so their complementary vulnerabilities break the certainty-to-execution chain, reducing ASR from 52.8% to 2.0% with negligible benign-task impact. Our results show that upgrading components to stronger models can actively degrade system security, and that effective defenses require exploiting--rather than eliminating--capability asymmetries between agents.

摘要:多代理系統透過將任務分解給專門的代理來擴展大型語言模型(LLMs),但其分散的決策過程創造了新的攻擊面。我們識別出\emph{語義劫持},這是一種攻擊,其中有害請求隱藏在特定領域的敘述中,並通過工作者報告傳播給管理者,而不需要任何語法注入原語。在對12個管理者模型和7個工作者配置進行的42,000次對抗性試驗中,我們發現了一個\emph{能力悖論}:隨著工作者能力的提高,系統級攻擊成功率(ASR)的平均值從18.4%增加到63.9%,並在94.4%達到峰值。為了解釋這一現象,我們對兩個獨立數據集(47,807次互動)進行了多層次中介分析。這項分析顯示,這一悖論是由\emph{語言確定性}驅動的:更強的工作者更有可能將對抗性敘述解釋為合法,並自信地傳達他們的結論,從而導致管理者將這種自信的支持視為執行的理由。在我們更大的僅工作者設置中($n_W$=14),確定性中介了74%的效果,95%的置信區間(CI)在蒙特卡羅和集群自助法下均不包括零;較小的全MAS設置($n_W$ =6)顯示出方向上一致的間接效果。工作者端的安全提示並未可靠地減輕這一失敗。基於中介發現,我們提出了\emph{異質集成驗證},該方法將具有不對稱領域能力的工作者配對,以便他們的互補脆弱性打破確定性到執行的鏈條,將ASR從52.8%降低到2.0%,對良性任務的影響微不足道。我們的結果顯示,將組件升級為更強的模型可能會主動降低系統安全性,並且有效的防禦需要利用——而不是消除——代理之間的能力不對稱。

An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training

2605.17468v1 by Hung-Yue Suen, Kuo-En Hung

This paper presents an interpretable closed-loop Intelligent Tutoring System (ITS) that supports feedback-guided practice for developing on-camera oral presentation skills at scale. The system operationalizes a seven-dimensional Behaviorally Anchored Rating Scale (BARS) and implements a three-layer interpretable feedback architecture that connects rubric-aligned multimodal scoring, audience-perceived expressive diagnostics, and retrieval-augmented conversational coaching to support deliberate practice. Built on an XGBoost backbone, the ITS maps multimodal inputs (facial, vocal, textual, and oculomotor features) into evidence-based feedback that can be traced back to observable performance cues. Trained on 10,360 Massive Open Online Course (MOOC) video segments, the system achieved rubric-aligned scoring with performance levels comparable to expert ratings (R2 = 0.48-0.61, Spearman's rho = 0.69-0.78, MAE = 0.43-0.57). In a pre-post validation study with 204 adult learners over a 30-day practice window, participants demonstrated significant improvements across all seven BARS dimensions (Cohen's d = 0.39-0.90), with practice frequency showing a strong positive association with posttest performance after controlling for baseline scores and demographics. The results demonstrate how multimodal analytic outputs can be systematically transformed into observable behavioral change through an integrated feedback architecture, advancing explainable and pedagogically grounded ITS design for performance-based competencies.

摘要:這篇論文提出了一個可解釋的閉環智能輔導系統(ITS),該系統支持以反饋為導向的練習,以大規模發展攝像頭前的口頭表達技能。該系統實現了一個七維行為錨定評分標準(BARS),並實施了一個三層可解釋的反饋架構,將與評分標準對齊的多模態評分、觀眾感知的表達診斷以及增強檢索的對話輔導連接起來,以支持有意識的練習。該系統基於XGBoost骨幹,將多模態輸入(面部、聲音、文本和眼動特徵)映射為基於證據的反饋,這些反饋可以追溯到可觀察的表現線索。該系統在10,360個大規模開放在線課程(MOOC)視頻片段上進行訓練,實現了與評分標準對齊的評分,其表現水平可與專家評分相媲美(R2 = 0.48-0.61,Spearman's rho = 0.69-0.78,MAE = 0.43-0.57)。在一項為期30天的前後驗證研究中,204名成人學習者在所有七個BARS維度上顯示出顯著的改善(Cohen's d = 0.39-0.90),練習頻率在控制基線分數和人口統計學後,與後測表現顯示出強正相關。結果展示了如何通過一個集成的反饋架構,系統地將多模態分析輸出轉化為可觀察的行為變化,推進了解釋性和基於教學的ITS設計,以支持基於表現的能力。

Artificial Intelligence can Recognize Whether a Job Applicant is Selling and/or Lying According to Facial Expressions and Head Movements Much More Correctly Than Human Interviewers

2605.17461v1 by Hung-Yue Suen, Kuo-En Hung, Che-Wei Liu, Yu-Sheng Su, Han-Chih Fan

Whether an interviewee's honest and deceptive responses can be detected by facial expression signals in videos has been debated and requires further research. We developed deep learning models enabled by computer vision to extract temporal patterns of job applicants' facial expressions and head movements to identify self-reported honest and deceptive impression management (IM) tactics from video frames in real asynchronous video interviews. A 12- to 15-minute video was recorded for each of N=121 job applicants as they answered five structured behavioral interview questions. Each applicant completed a survey to self-evaluate their trustworthiness on four IM measures. Additionally, a field experiment was conducted to compare the concurrent validity associated with self-reported IMs between our modeling approach and human interviewers. Human interviewers' performance in predicting these IM measures from another subset of 30 videos was obtained by having N=30 human interviewers evaluate three recordings. Our models explained 91% and 84% of the variance in honest and deceptive IMs, respectively, and showed stronger correlations with self-reported IM scores than human interviewers.

摘要:受訪者的誠實與欺騙性回應是否能透過視頻中的面部表情信號被檢測到一直是個爭論話題,並且需要進一步的研究。 我們開發了基於計算機視覺的深度學習模型,以提取求職者面部表情和頭部動作的時間模式,從視頻幀中識別自我報告的誠實和欺騙性印象管理(IM)策略,這些視頻來自於實時的非同步視頻面試。 每位N=121名求職者在回答五個結構化行為面試問題時錄製了一段12到15分鐘的視頻。 每位求職者完成了一項調查,以自我評估他們在四個IM指標上的可信度。此外,還進行了一項實地實驗,以比較我們的建模方法與人類面試官之間自我報告IM的同時效度。 人類面試官在從另一組30個視頻中預測這些IM指標的表現是通過讓N=30名人類面試官評估三段錄音來獲得的。 我們的模型解釋了誠實和欺騙性IM變異的91%和84%,並且與自我報告的IM分數的相關性比人類面試官更強。

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

2605.17454v1 by Xiaolei Fang, Peilan Xu, Wenjian Luo

Multi-party multi-objective optimization problems (MPMOPs) require consensus among autonomous decision makers and therefore differ from flattened many-objective formulations. Existing runtime theory for multi-objective evolutionary algorithms is largely tailored to single-party Pareto-front approximation and does not directly explain common-solution search in MPMOPs. We investigate cross-party recombination in two representative settings. On MP-JCG, a pseudo-Boolean benchmark with an explicit gap region, we prove that a payoff-guided mutation baseline faces a gap-crossing bottleneck requiring (Θ(n^2)) expected fitness evaluations. In contrast, an analytical CPR-NSGA-II variant discovers both common Pareto-optimal solutions in (O(n\log n)) expected evaluations by directly assembling complementary prefix and suffix templates distributed across party populations. Comparing this with the flattened four-objective formulation F-JCG, our full-front coverage analysis illustrates the additional coverage burden introduced by flattening. For BPBOMST, the bi-party, two-objective-per-party specialization of the multi-party multi-objective minimum spanning tree problem, we develop a layered support-cover analysis. For each common Pareto objective vector, the symmetric average projection induces an auxiliary bi-objective MST instance, and suitable support representatives yield a (2λ)-common approximation cover with (λ\in[1,2]). We further derive an instance-parameterized expected runtime bound for a representative-pool CPR-NSGA-II variant using edge-union recombination and uniform repair. This bound separates the effects of local auxiliary-front filling, cross-party recombination shortcuts, and edge-union repair ambiguity.

摘要:多方多目標優化問題(MPMOPs)需要自主決策者之間達成共識,因此與扁平化的多目標公式有所不同。現有的多目標進化算法運行時理論主要針對單方的帕累托前沿近似,並未直接解釋MPMOPs中的共同解搜索。我們在兩個代表性設置中研究跨方重組。在MP-JCG這個具有明確間隙區域的偽布爾基準上,我們證明了一個基於收益引導的變異基線面臨著一個需要(Θ(n^2))期望適應度評估的跨越瓶頸。相比之下,一個分析性的CPR-NSGA-II變體通過直接組裝分佈在各方人口中的互補前綴和後綴模板,在(O(n\log n))的期望評估中發現了兩個共同的帕累托最優解。將此與扁平化的四目標公式F-JCG進行比較,我們的全前沿覆蓋分析顯示了扁平化所帶來的額外覆蓋負擔。對於BPBOMST,這是多方多目標最小生成樹問題的雙方、每方兩目標專門化,我們開發了一種分層支持覆蓋分析。對於每個共同的帕累托目標向量,對稱平均投影引入了一個輔助的雙目標最小生成樹實例,合適的支持代表產生了(2λ)-共同近似覆蓋,其中(λ\in[1,2])。我們進一步推導了一個針對使用邊聯合重組和均勻修復的代表池CPR-NSGA-II變體的實例參數化期望運行時界限。這個界限將局部輔助前沿填充、跨方重組捷徑和邊聯合修復模糊性的影響分開。

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

2605.17370v1 by Qixuan Hu, Shuchang Ye, Xumou Zhang, Anastasia Serafimovska, Anastasia Suraev, Amit Saha, Ping-hsiu Lin, Sydney Su, Usman Naseem, Adam G. Dunn, Jinman Kim

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.

摘要:認知行為療法廣泛用於幫助患者理解和管理心理困擾。這通常通過口頭對話進行,治療師不僅注意患者所說的內容,還注意他們的表達方式,因為這些線索可以幫助治療師決定如何回應並調整治療方案。在建立用於認知行為療法的人工智慧系統方面的進展仍然主要限於文本,部分原因是大多數可用的數據集都是基於文本的,而可分享的口語認知行為療法數據在倫理和隱私限制下非常稀缺。這造成了一個盲點,因為基於文本的模型和評估無法捕捉到轉錄文本與患者聲音之間的不匹配,即使治療師通常依賴這種不匹配來理解患者的困擾。我們介紹了CBT-Audio,一個用於評估從口語認知行為療法會話中估計患者困擾的數據集,並與音頻語言模型一起使用。CBT-Audio包含來自96個公開可用認知行為療法錄音的1,802個患者回合,回合級困擾標籤經過專家註釋的子集驗證。我們在三種輸入條件下評估了10個開源音頻語言模型,這些條件下模型僅接收患者音頻、僅接收轉錄文本或同時接收音頻和轉錄文本。我們的結果顯示,音頻可以提供超越文本的有用信息,尤其是當與轉錄文本結合時。在10個模型系列中的8個中,將音頻添加到轉錄文本輸入中改善了困擾估計,相較於僅使用轉錄文本,4個模型系列的增益顯著,案例研究顯示當口頭內容和聲音表達不一致時,最明顯的好處。CBT-Audio使口語患者行為在認知行為療法相關任務中的人工智慧評估變得可測量,並支持未來在心理健康互動中使用音頻語言模型的工作。

UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models

2605.17285v1 by Hyunju Kang, Geonhee Han, Hogun Park

Node representation learning, such as Graph Neural Networks (GNNs), has emerged as a pivotal method in machine learning. The demand for reliable explanation generation surges, yet unsupervised models remain underexplored. To bridge this gap, we introduce a method for generating counterfactual (CF) explanations in unsupervised node representation learning. We identify the most important subgraphs that cause a significant change in the k-nearest neighbors of a node of interest in the learned embedding space upon perturbation. The k-nearest neighbor-based CF explanation method provides simple, yet pivotal, information for understanding unsupervised downstream tasks, such as top-k link prediction and clustering. Consequently, we introduce UNR-Explainer for generating expressive CF explanations for Unsupervised Node Representation learning methods based on a Monte Carlo Tree Search (MCTS). The proposed method demonstrates superior performance on diverse datasets for unsupervised GraphSAGE and DGI.

摘要:節點表示學習,例如圖神經網絡(GNNs),已成為機器學習中的一種關鍵方法。對可靠解釋生成的需求激增,但無監督模型仍然未被充分探索。為了填補這一空白,我們提出了一種在無監督節點表示學習中生成反事實(CF)解釋的方法。我們確定了在擾動後,導致學習嵌入空間中感興趣節點的k最近鄰發生顯著變化的最重要子圖。基於k最近鄰的CF解釋方法為理解無監督下游任務(如前k鏈接預測和聚類)提供了簡單但關鍵的信息。因此,我們引入了UNR-Explainer,用於生成基於蒙特卡羅樹搜索(MCTS)的無監督節點表示學習方法的表達性CF解釋。所提出的方法在多樣化數據集上對無監督GraphSAGE和DGI展示了卓越的性能。

Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

2605.17236v1 by Nisreen Albzour, Sarah S. Lam

Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.

摘要:手動的子宮頸抹片分析在子宮頸癌篩檢中受到觀察者間變異性、時間限制以及專家可用性不足的限制。雖然卷積神經網絡(CNNs)已經自動化了子宮頸細胞的分類,但在建模長距離空間依賴性方面仍然有限,並且通常缺乏臨床可解釋性。在本研究中,系統性優化了視覺Transformer(ViT)架構,以增強自動化的子宮頸癌篩檢,這導致了可解釋性的提高。使用了Herlev數據集(917張圖像:242張正常,675張異常)來優化ViT-Tiny,這是一種為減少計算複雜性而設計的輕量級視覺Transformer架構,通過對增強策略、類別加權和超參數的全面評估。最佳配置達到了94.9%-95.2%的交叉驗證準確率,其中隨機水平翻轉和類別加權(0.7 x 1.3)被確定為最有效的。梯度加權類別激活映射(Grad-CAM)分析確認模型注意力與臨床相關的形態特徵相對應,包括核區域、細胞邊界和染色質紋理,這些特徵與細胞病理學標準相符。這些發現表明,視覺Transformer可以為子宮頸癌篩檢提供準確且可解釋的決策支持,滿足醫療AI部署所需的臨床性能和透明度要求。

2605.17219v1 by S. Tazili, A. Mansour, M. Y. Chkouri

Artificial Intelligence (AI) is widely adopted today for its ability to detect patterns, automate tasks, and reduce time and cost across various applications. Its integration into Cybersecurity has garnered significant attention, particularly in areas such as intrusion detection, malware analysis, and phishing or spam detection. As AI and cybersecurity evolve, new methods and approaches emerge regularly. Current trends include the use of Generative AI, Natural Language Processing, Federated Learning for privacy-preserving collaborative training, and eXplainable AI to ensure interpretability and trust, which are vital in cybersecurity. This paper presents an interesting review of current AI-based cybersecurity trends, focusing on intrusion detection approaches and aiming to uncover meaningful insights through comparative analysis based on the employed AI techniques and reported performance.

摘要:人工智慧(AI)因其能夠檢測模式、自動化任務以及在各種應用中減少時間和成本而被廣泛採用。它在網絡安全中的整合引起了相當大的關注,特別是在入侵檢測、惡意軟體分析以及釣魚或垃圾郵件檢測等領域。隨著AI和網絡安全的發展,新的方法和途徑不斷出現。當前的趨勢包括使用生成式AI、自然語言處理、用於隱私保護的聯邦學習,以及可解釋的AI,以確保可解釋性和信任,這在網絡安全中至關重要。本文對當前基於AI的網絡安全趨勢進行了有趣的回顧,重點關注入侵檢測方法,並旨在通過基於所採用的AI技術和報告的性能進行比較分析來揭示有意義的見解。

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

2605.17028v1 by Khizar Hussain, Murat Kantarcioglu

Large language models (LLMs) hallucinate with confidence: their outputs can be fluent, authoritative, and simply wrong. In medical, legal, and scientific applications this failure causes direct harm, and detecting it from internal model states offers a path to safer deployment. A growing body of work reports that this problem is increasingly tractable, with recent methods achieving high detection performance on widely used benchmarks. We show, however, that much of this apparent progress does not survive scrutiny. Four of the six corpora embed the ground-truth answer directly in the input prompt. A naïve text-similarity baseline we call \textsc{TxTemb} exploits this to achieve near-perfect detection scores without any access to model internals. To measure what genuine detection capability remains once these artifacts are controlled, we conduct a large-scale evaluation spanning twenty-two detection methods, twelve open-source models spanning six architectural families, and six corpora. We further introduce \textbf{DRIFT}, a supervised probe over inter-layer hidden-state transitions, as a point of comparison for live-generation detection. Our findings suggest that the field's reported progress on hallucination detection is substantially explained by benchmark construction artifacts in widely used corpora, and that the majority of established baselines perform near chance under controlled conditions; the consistent exceptions are SAPLMA and DRIFT, both supervised probes on upper-layer hidden states.

摘要:大型語言模型(LLMs)自信地產生幻覺:它們的輸出可能流暢、權威,但卻完全錯誤。在醫療、法律和科學應用中,這種失誤會造成直接傷害,從內部模型狀態中檢測它提供了一條更安全部署的途徑。越來越多的研究報告指出,這個問題變得越來越可解決,最近的方法在廣泛使用的基準上實現了高檢測性能。然而,我們顯示這些表面上的進展在仔細檢查下並不成立。六個語料庫中的四個將真實答案直接嵌入輸入提示中。我們稱之為 \textsc{TxTemb} 的天真文本相似性基線利用這一點,在不接觸模型內部的情況下實現近乎完美的檢測分數。為了測量在控制這些工件後,真正的檢測能力還剩下多少,我們進行了一項大規模評估,涵蓋二十二種檢測方法、十二個跨越六種架構系列的開源模型和六個語料庫。我們進一步介紹 \textbf{DRIFT},這是一個針對層間隱藏狀態轉換的監督探針,作為即時生成檢測的比較點。我們的研究結果表明,該領域報告的幻覺檢測進展在很大程度上是由於廣泛使用的語料庫中的基準構建工件所解釋的,而且在受控條件下,大多數已建立的基線表現接近隨機;一致的例外是 SAPLMA 和 DRIFT,這兩者都是針對上層隱藏狀態的監督探針。

Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings

2605.16993v1 by Anthonio Oladimeji Gabriel, Ahmad Rufai Yusuf

Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English-language inputs, conditions that do not reflect the realities of healthcare delivery in low-resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross-lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine-tuned on the COVID-QU-Ex chest X-ray dataset (85,318 images; COVID-19, Non-COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N-ATLAS) on 20 COVID-19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba-inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African-context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.

摘要:目前的臨床人工智慧(AI)系統幾乎完全依賴於乾淨、標準化的英文輸入,這些條件並不反映低資源環境中醫療服務的現實。這項研究呈現了對臨床AI中兩個正交安全漏洞的首次系統性雙重審核:對抗性影像脆弱性和跨語言診斷漂移。我們使用DenseNet121,這是CheXNet的底層架構,並在COVID-QU-Ex胸部X光數據集(85,318幅影像;COVID-19、非COVID肺炎、正常)上進行微調,證明在快速梯度法(FGM)擾動下,診斷準確率從89.3%降至62.0%,這一擾動的幅度對人眼來說是不可察覺的。包括高斯平滑和集成投票在內的標準防禦策略未能恢復臨床安全。在一項平行的語言脆弱性實驗中,我們在20個以標準英語、奈及利亞皮欽語(Naija)和約魯巴語變體英語呈現的COVID-19臨床案例上測試了Llama3.1:8b和NatLAS(N-ATLAS)。這兩個模型均顯示出顯著的準確性下降:Llama3.1:8b在皮欽語上的準確率從80.0%降至65.0%;而NatLAS,這是一個非洲背景模型,則從85.0%降至55.0%,診斷一致性降至50%。這些發現為臨床AI在尼日利亞初級健康中心(PHC)部署的條件下建立了量化失敗範圍,並促使對對抗性加固、語言包容的臨床AI架構提出迫切的呼籲。

Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

2605.16966v1 by Zhentao Tan, Yuze Hao, Boyi Zou, Mingsheng Long, Yi Yang, Gang Bao

Solving inverse partial differential equation (PDE) problems is a fundamental topic in scientific research due to its broad significance across a wide range of real-world applications. Inverse PDE problems arise across medical imaging, geophysics, materials science, and aerodynamics, where the goal is to infer hidden causes, design structures, or control physical states. In this paper, we provide a comprehensive review of recent advances in solving inverse PDE problems using artificial intelligence (AI). We first introduce the basic formulation, key challenges, and traditional numerical foundations of inverse PDE problems, and then organize it into three major categories: inverse problems, inverse design, and control problems. For each category, we further present a methodological paradigms, and review representative state-of-the-art approaches from recent years. We then summarize representative applications across scientific and industrial domains, including mechanical systems, aerodynamic problems, thermal systems, full-waveform inversion, system identification, and medical imaging. Finally, we discuss open challenges and future prospects, such as physics-informed architectures, limited real-world data, uncertainty quantification, and inverse foundation models. This survey aims to provide the first unified and systematic perspective on AI for inverse PDE problems, demonstrating how modern learning-based methods are reshaping inverse problems, inverse design, and control problems in PDE-governed systems.

摘要:解決反向偏微分方程(PDE)問題是科學研究中的一個基本主題,因為它在各種現實世界應用中具有廣泛的重要性。反向PDE問題出現在醫學影像、地球物理學、材料科學和氣動力學等領域,其目標是推斷隱藏的原因、設計結構或控制物理狀態。本文提供了使用人工智慧(AI)解決反向PDE問題的最新進展的綜合回顧。我們首先介紹反向PDE問題的基本公式、主要挑戰和傳統數值基礎,然後將其組織為三個主要類別:反向問題、反向設計和控制問題。對於每個類別,我們進一步呈現方法論範式,並回顧近年來具有代表性的最先進方法。我們接著總結在科學和工業領域中的代表性應用,包括機械系統、氣動問題、熱系統、全波形反演、系統識別和醫學影像。最後,我們討論開放的挑戰和未來的前景,如物理知識引導的架構、有限的現實世界數據、不確定性量化和反向基礎模型。本次調查旨在提供關於AI在反向PDE問題中的首次統一和系統的視角,展示現代基於學習的方法如何重塑PDE控制系統中的反向問題、反向設計和控制問題。

From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

2605.16927v1 by Pujun Feng, Xiaoyu Guo, Seyed Ehsan Saffari, Min Hun Lee, Siew-Kei Lam, Erik Cambria, Xibin Sun, Yangtao Zhou, Tong Yang, Xiaoyu Zhang, Tao Tan, Yue Sun, Bin Cui

Clinical decision-making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians' measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention-aware disease trajectory modeling in clinical AI--methods estimating patient-specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time-varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point-process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off-policy robustness, and target-trial validation. This synthesis advances benchmark prediction to decision-grade clinical evidence, enabling treatment-sensitive individualized futures, pre-deployment policy stress-testing, and safer closed-loop learning health systems that adapt/abstain when evidence is insufficient.

摘要:臨床決策制定是一個反饋系統,其中風險評估影響治療,這反過來又改變了疾病的軌跡,並且兩者共同塑造了臨床醫生的測量實踐。靜態預測在臨床上往往失敗:基於觀察性護理日誌訓練的模型將疾病生物學與臨床醫生的行為混為一談,特別是在治療混淆反饋和不規則或有信息的觀察下。這篇綜述專注於臨床人工智慧中的干預感知疾病軌跡建模——估計患者特定的縱向疾病演變並評估在替代治療下的軌跡變化的方法。我們圍繞六個相互關聯的組件組織這一領域:三個決策任務(事實預測、反事實估計、政策評估)和三個數據生成機制(疾病演變、治療分配、觀察過程),這些機制決定了可識別性。我們提出了第一個統一框架,橋接了預測、反事實軌跡和政策評估,涵蓋離散/連續時間,明確處理治療分配、時間變化的混淆和觀察偏差。我們綜合了關鍵的方法家族(多狀態/聯合模型、時間點過程、深度序列架構、縱向因果推斷),將它們映射到相關組件,並通過重疊診斷、不確定性量化、離政策穩健性和目標試驗驗證來對齊評估與主張強度。這一綜合推進了基準預測到決策級的臨床證據,使得治療敏感的個性化未來、預部署政策壓力測試以及在證據不足時能夠適應/避免的更安全的閉環學習健康系統成為可能。

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

2605.16725v1 by SeungWon Seo, DongHeun Han, SeongRae Noh, HyeongYeop Kang

Executable world models can be read, edited, executed, and reused for planning, but only if the program captures the environment's transition law rather than semantic shortcuts in its surface vocabulary. We study online executable world-model learning under prior misalignment, where an agent must induce state-dependent dynamics from interaction evidence alone, without rule descriptions, reward signals, or trustworthy lexical priors. We introduce Alice, a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program. We evaluate Alice on Baba in Wonderland, a prior-misaligned variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule-property labels with unrelated words. Experiments show that Alice substantially improves executable world-model learning under prior misalignment, and ablations show that both class refinement and class-aware exploration contribute.

摘要:可執行的世界模型可以被讀取、編輯、執行和重用以進行規劃,但前提是該程序捕捉環境的轉換法則,而不是其表面詞彙中的語義捷徑。我們研究了在先前不對齊的情況下的在線可執行世界模型學習,其中一個代理必須僅從互動證據中誘導狀態依賴的動態,而沒有規則描述、獎勵信號或可靠的詞彙先驗。我們介紹了愛麗絲,一個將失敗的候選更新視為結構信號的閉環系統:當一個候選解釋了一個新的轉換但失去了之前解釋的轉換時,保留衝突揭示了當前程序混淆的動態。愛麗絲將這些衝突精煉為假設類,這些假設類既提供了緊湊的、類別分層的保留反例以供更新,也指導邊界探索朝向相對於當前程序的新穎且未被充分代表的轉換。我們在《巴巴的奇幻之旅》中評估愛麗絲,這是一個先前不對齊的《巴巴是你》變體,保留模擬器動態,同時用不相關的詞替換語義上有意義的規則屬性標籤。實驗表明,愛麗絲在先前不對齊的情況下顯著改善了可執行世界模型學習,且消融實驗顯示類別精煉和類別感知探索均有貢獻。

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

2605.16679v1 by Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $χ$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

摘要:端到端的現實醫療運營自動化強調了當前基準中被低估的三種能力:政策密度,決策必須基於大量的醫療、保險和操作規則;多角色組合:單一任務要求代理人扮演多個角色並進行交接;以及多邊互動:中間工作流程步驟是多輪對話,例如同行評審和病人聯繫。我們介紹了 $χ$-Bench,一個涵蓋三個領域的長期醫療工作流程基準:提供者事前授權、支付者使用管理和護理管理。每個任務都將一個臨床案例交給代理人,在一個高保真度的模擬器中模擬20個醫療應用,通過87個MCP工具進行暴露,代理人必須通過工具調用和撰寫角色的文檔,將其推進到終端狀態,並受到一個擁有1,290多份文件的管理護理操作手冊技能的指導。在30種代理人配置/模型中,最佳代理人僅解決了28.0%的任務,沒有代理人在嚴格的通過^3中清除20%,而在單一會話中執行所有任務則使性能下降至3.8%。這些結果提出了假設,即類似的差距可能會在其他政策密集、角色組合、不可逆的企業領域中出現。

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

2605.16600v1 by Valeria Ruscio, Eli-Shaoul Khedouri, Keiran Thompson

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $δ_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $δ_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.

摘要:交叉熵預訓練和偏好對齊更新相同的Transformer權重,但留下幾何上不同的痕跡。我們用相對子空間分數探測器來描述這種不對稱性,該探測器追蹤權重變化如何與殘差流激活子空間以及由去嵌入定義的預測子空間對齊。對齊變化集中在讀取路徑($W_Q$, $W_K$)上,沿著注意力輸入激活的主要方向,而在寫入路徑($W_O$, $W_2$)中相對於預測子空間則保持近各向同性。我們通過各向異性梯度累積來解釋這種模式:對矩陣 $W$ 的更新是外積 $δ_t a_t^\top$ 的總和,並從集中協方差的一側繼承方向結構。對於讀取路徑矩陣,這一側是輸入激活 $a_t$,其協方差在訓練的Transformer中被尖峰化,因此產生客觀無關的集中。對於寫入路徑矩陣,相關的一側是上游梯度 $δ_t$,其各向異性取決於損失。交叉熵提供了經典的每樣本尖銳信號,在預訓練期間誘導寫入路徑預測幾何;對齊目標通常對寫入側的集中貢獻不大。我們通過檢查點內的軌跡、分級對比目標控制以及與匹配方向控制的閉式形式秩-1 介入來支持這一解釋,提供了所提出的權重空間幾何的因果證據。

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

2605.16545v1 by Arne Nix, Robert James, Lasse Borgholt, Anna B. Ekner, Lana Krumm, Julius Severin, Dan Engel, Lars Maaløe, Jakob Havtorn

After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.

摘要:經過數十年的使用於口述和最近的環境文檔中,語音正逐漸成為與醫療技術和人工智慧互動的主要方式。然而,醫療語音識別仍然困難:系統必須捕捉專業術語、解決上下文歧義,並準確呈現測量值、縮寫和臨床速記。現有解決方案通常針對通用轉錄或狹窄的口述工作流程進行優化,限制了它們在安全關鍵環境中的可靠性以及對更廣泛臨床工作流程的實用性。我們推出了 Symphony for Speech-to-Text,一個用於實時串流和批量文件基礎臨床使用的醫療級語音識別系統。Symphony 將轉錄過程分解為專門的組件,用於識別、格式化和上下文修正,以優化醫療術語的回憶,同時實時生成臨床結構化文本並適應不同的使用案例。對公共基準和醫療語音數據集的評估顯示,Symphony 在臨床環境中顯著超越了最先進的系統,同時在通用領域環境中與其匹配或超越,這表明其具有強健的泛化能力而非過擬合。我們發布了一個臨床基準數據集,以支持可靠的驗證和醫療語音識別的進一步進展。Symphony 通過生產級 API 提供實時口述、對話轉錄和批量音頻文件處理的功能。

2605.16524v1 by Siqi Lu, Mirsaleh Bahavarnia, Hiba Baroud, Yixuan Zhang, Hemant Purohit, Ayan Mukhopadhyay

Probabilistic search algorithms, such as Monte Carlo Tree Search (MCTS), have proven very effective in solving sequential decision-making tasks under uncertainty. However, interpreting asymmetric search trees that incorporate bandit-based tree traversal and simulation-based value estimation is difficult for end users based solely on raw tree statistics. While prior work requires hand-crafted formal logic constraints that must be updated when the problem changes, we present a framework that enables large language models (LLMs) to generate evidence-grounded explanations of MCTS decisions from recorded search traces in an end-to-end manner. Our framework maps natural-language questions to a structured set of intent categories, determines whether the existing tree contains sufficient evidence, triggers targeted expansion when needed, and generates explanations using tree statistics such as visit counts, value estimates, and risk information. Experimental results provide the first evidence that LLMs can serve as end-to-end explainers for probabilistic search, without requiring intermediate formal representations.

摘要:概率搜尋演算法,如蒙特卡羅樹搜尋(MCTS),在解決不確定性下的序列決策任務中已被證明非常有效。然而,對於僅根據原始樹統計數據來解釋包含基於賭徒的樹遍歷和基於模擬的價值估計的不對稱搜尋樹,最終用戶面臨困難。雖然先前的工作需要手工製作的形式邏輯約束,這些約束在問題變更時必須進行更新,但我們提出了一個框架,使大型語言模型(LLMs)能夠從記錄的搜尋痕跡中以端到端的方式生成基於證據的MCTS決策解釋。我們的框架將自然語言問題映射到一組結構化的意圖類別,確定現有樹是否包含足夠的證據,並在需要時觸發針對性的擴展,並使用樹統計數據(如訪問計數、價值估計和風險信息)生成解釋。實驗結果提供了首個證據,表明LLMs可以作為概率搜尋的端到端解釋者,而不需要中間的形式表示。

Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework

2605.16516v1 by Xintong Yao

Long-term interaction with LLM-based systems may produce alignment drift: a gradual process in which system outputs become less constrained by the user's current message and more shaped by prior interaction history, while still appearing helpful, coherent, and responsive. This process is difficult to detect because the user's subjective experience may improve as the system becomes more familiar, useful, and attuned. Existing research on human-LLM interaction has largely focused on short-term task performance, isolated outputs, or single-instance alignment problems, leaving slow and cumulative interaction-level dynamics undercharacterized. This paper proposes a mechanism-oriented framework for describing alignment drift. The framework defines the distinction between signal A and signal B, explains how drift develops through feedback loops and sub-pattern selection, divides the process into three interactional regimes, and identifies boundary conditions for controlling drift. By framing alignment drift as a recursive interactional process rather than an isolated model-side failure, the paper provides a conceptual basis for studying long-term human-system interaction.

摘要:長期與基於 LLM 的系統互動可能會產生對齊漂移:這是一個漸進的過程,在這個過程中,系統的輸出變得不再受到用戶當前消息的約束,而是更多地受到先前互動歷史的影響,同時仍然看起來有幫助、一致且反應靈敏。這個過程難以檢測,因為隨著系統變得更加熟悉、有用和調整,使用者的主觀體驗可能會改善。現有的關於人類與 LLM 互動的研究主要集中在短期任務表現、孤立輸出或單一實例的對齊問題上,導致緩慢且累積的互動層級動態未被充分描述。本文提出了一個以機制為導向的框架來描述對齊漂移。該框架定義了信號 A 和信號 B 之間的區別,解釋了漂移如何通過反饋循環和子模式選擇發展,將過程劃分為三個互動範疇,並確定了控制漂移的邊界條件。通過將對齊漂移框架化為一個遞歸的互動過程,而不是孤立的模型側失敗,本文為研究長期的人機互動提供了概念基礎。

AI-Mediated Communication Can Steer Collective Opinion

2605.16245v1 by Stratis Tsirtsis, Kai Rawal, Chris Russell, Brent Mittelstadt, Sandra Wachter

Generative artificial intelligence (AI) is increasingly integrated into the online platforms where humans exchange opinions; large language models (LLMs) now polish users' posts on LinkedIn and provide context for content shared on X. While prior work has shown that AI can express biased opinions and shape individuals' opinions during human-AI interactions, less attention has been paid to its influence on collective opinion formation when mediating human-to-human communication. We address this gap via a combination of empirical and theoretical analyses. We show empirically that LLMs from multiple popular families introduce directional biases when instructed to edit human-written texts on contested topics, for example, nudging texts in favor of gun control and against atheism. Building on this observation, we introduce a mathematical model of opinion dynamics in which an AI system sits between users on a social network, transforming the opinions they express and perceive. By analytically characterizing the equilibrium of this model and performing simulations on real social network data, we show that biases introduced by AI in human-to-human communication can be amplified through the network and shift collective opinion in their direction. In light of these findings, we investigate whether such biases are controllable by online platforms. We audit the "Explain this post" feature on X and find evidence of pro-life bias in Grok's outputs on abortion-related content, which we trace back to specific design choices. We conclude with a discussion of the broader implications of our findings in relation to ongoing legislative efforts in the European Union.

摘要:生成式人工智慧(AI)越來越多地融入人類交換意見的在線平台;大型語言模型(LLMs)現在為用戶在LinkedIn上的帖子進行潤飾,並為在X上分享的內容提供背景。雖然先前的研究已顯示AI可以表達偏見的意見並在與人類的互動中塑造個體的觀點,但對於其在調解人與人之間的交流時對集體意見形成的影響,關注較少。我們通過實證和理論分析的結合來填補這一空白。我們實證顯示,來自多個流行家族的LLMs在被指示編輯有爭議主題的人類撰寫文本時,引入了方向性偏見,例如,促使文本支持槍支管制並反對無神論。基於這一觀察,我們引入了一個意見動態的數學模型,其中AI系統位於社交網絡上的用戶之間,轉變他們表達和感知的意見。通過分析性地描述該模型的均衡並對真實社交網絡數據進行模擬,我們顯示AI在人與人之間的交流中引入的偏見可以通過網絡被放大,並將集體意見轉向其方向。鑒於這些發現,我們調查了在線平台是否能控制這些偏見。我們審核了X上的“解釋此帖子”功能,並發現Grok在與墮胎相關內容的輸出中存在親生育偏見,這可以追溯到特定的設計選擇。我們最後討論了我們的發現對歐盟正在進行的立法努力的更廣泛影響。

GenShield: Unified Detection and Artifact Correction for AI-Generated Images

2605.16122v1 by Zhipei Xu, Xuanyu Zhang, Youmin Xu, Qing Huang, Shen Chen, Taiping Yao, Shouhong Ding, Jian Zhang

Diffusion-based image synthesis has made AI-generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI-generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection and artifact correction. To fill this gap, we propose GenShield, a unified autoregressive framework that jointly performs explainable AIGI detection and controllable artifact correction in a closed loop from diagnosis to restoration, revealing a mutually reinforcing relationship between these two tasks. We further introduce a Visual Chain-of-Thought based curriculum learning strategy that enables self-explained, multi-step diagnose-then-repair'' correction with an explicit stopping criterion. A high-quality dataset with large-scaleartifact-restored'' pairs is also constructed alongside a unified evaluation pipeline. Extensive experiments on our correction benchmark and mainstream AIGI detection benchmarks demonstrate state-of-the-art performance and strong generalization of our method. The code is available at https://github.com/zhipeixu/GenShield.

摘要:擴散基礎的影像合成使得AI生成的影像(AIGI)越來越具備真實感,這引發了對於在錯誤資訊檢測、數位取證和內容審核等應用中真實性的迫切關注。儘管在AIGI檢測方面取得了重大進展,但如何修正具有可見瑕疵的AI生成影像並恢復其真實外觀仍然在很大程度上未被探索。此外,現有的研究很少建立AIGI檢測與瑕疵修正之間的聯繫。為了填補這一空白,我們提出了GenShield,一個統一的自回歸框架,能夠在從診斷到修復的閉環中共同執行可解釋的AIGI檢測和可控的瑕疵修正,揭示了這兩項任務之間相互強化的關係。我們進一步引入了一種基於視覺思維鏈的課程學習策略,使得自我解釋的多步驟“診斷-然後修復”修正成為可能,並具有明確的停止標準。還構建了一個高品質的大規模“瑕疵修復”對應數據集,並配備了統一的評估管道。在我們的修正基準和主流AIGI檢測基準上進行的廣泛實驗展示了我們方法的最先進性能和強大的泛化能力。代碼可在 https://github.com/zhipeixu/GenShield 獲得。

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

2605.16087v1 by Till Beemelmanns, Shayan Sharifi, Manas Mehrotra, Ayushman Choudhuri, Lutz Eckstein

Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .

摘要:深度神經網絡已成為自動駕駛感知的主導解決方案,但其不透明性與新興的可信 AI 指導方針相衝突,並使安全保證、調試和人類監督變得複雜。雖然存在安全和可解釋 AI (XAI) 的理論框架,但針對 3D 場景理解的可信 AI 的具體實現仍然稀缺。我們通過提出一個顯著穩健的可信 AI 感知模塊來填補這一空白,該模塊集成了真實的可解釋性和經過校準的不確定性估計。基於Transformer的檢測器,我們在推理時從注意力機制中推導解釋,並使用基於擾動的一致性測試來驗證其真實性。我們進一步集成了一個不確定性估計和校準模塊,並應用增強穩健性的訓練方法。實驗顯示出真實的顯著性行為、改善的穩健性和良好校準的不確定性估計。最後,我們在一個原型車輛中部署這些可信 AI 元素,並提供一個可視化文檔工件、模型不確定性狀態和顯著性圖的 XAI 界面,展示了在實時中進行可信感知監控的可行性。補充材料可在 https://tillbeemelmanns.github.io/trustworthy_ai/ 獲得。

Looped SSMs: Depth-Recurrence and Input Reshaping for Time Series Classification

2605.16048v1 by Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu

State Space Models (SSMs) are inherently recurrent along the sequence dimension, yet depth-recurrence - reusing the same block repeatedly across layers, as recently applied in looped transformers - has not been explored in this model family. We show that a looped SSM with $k$ parameters iterated $L$ times consistently closely matches or outperforms a standard SSM with $k \cdot L$ independent parameters across four architectures (LRU, S5, LinOSS, LrcSSM) and six time series classification benchmarks, despite operating within a strictly smaller hypothesis space, as we formally establish. Since the larger model contains the looped model as a special case, this dominance cannot be explained by expressivity and instead points to parameter sharing across depth as a beneficial inductive bias that simplifies optimization. These results demonstrate that depth-recurrence is orthogonal to sequence-recurrence and independently beneficial. We further show that input reshaping is an equally neglected design axis: concatenating timesteps for low-dimensional inputs, or flattening and rechunking the joint feature-time dimension for high-dimensional ones, yields accuracy gains of 1-6% across all models, confirmed over 5 random seeds. Both techniques provide standalone improvements that compound when combined, suggesting that depth and input reshaping are two independent and underexplored design axes for SSMs on time series.

摘要:狀態空間模型(SSMs)在序列維度上本質上是循環的,但深度循環——在層之間重複使用相同的區塊,正如最近在循環Transformer中應用的那樣——在這個模型家族中尚未被探索。我們顯示,具有 $k$ 參數的循環 SSM 迭代 $L$ 次,始終與具有 $k \cdot L$ 獨立參數的標準 SSM 在四個架構(LRU、S5、LinOSS、LrcSSM)和六個時間序列分類基準上緊密匹配或超越,儘管在一個嚴格較小的假設空間內運作,這一點我們已正式確立。由於較大的模型包含循環模型作為特例,因此這種優勢無法用表達能力來解釋,而是指向深度之間的參數共享作為一種有益的歸納偏見,簡化了優化過程。這些結果表明,深度循環與序列循環是正交的,並且各自獨立地有益。我們進一步顯示,輸入重塑是一個同樣被忽視的設計軸:對於低維輸入,連接時間步,或對於高維輸入,展平並重新分塊聯合特徵-時間維度,均可在所有模型上獲得 1-6% 的準確率提升,並在 5 次隨機種子上得到確認。這兩種技術提供了獨立的改進,當結合時會相互增強,這表明深度和輸入重塑是 SSM 在時間序列上兩個獨立且未充分探索的設計軸。

XSearch: Explainable Code Search via Concept-to-Code Alignment

2605.16046v1 by Yiming Liu, Ruofan Liu, Yun Lin, Zicong Zhang, Weiyu Kong, Pengnian Qi, Xiao Cheng, Weinan Zhang, Qianxiang Wang, Linpeng Huang

Semantic code search has been widely adopted in both academia and industry. These approaches embed natural-language queries and code snippets into a shared embedding space and retrieve results based on vector similarity. Despit strong performance on benchmark datasets, they often suffer from poor explainability and generalization. Retrieved code may appear semantically similar yet miss critical functional requirements of the query, while providing no explanation of why the result was retrieved. Moreover, such failures become more severe under distribution shift, where models struggle to generalize to unseen benchmarks. In this work, we propose XSearch, an intrinsically explainable code search framework. Our key insight is that by relying on global embedding similarity, existing retrievers inherently take an inductive view. They learn statistical patterns rather than truly understanding the query's functional requirements. We address this problem by reformulating code search as a deductive concept alignment problem. XSearch (i) identifies functional concepts in the query and (ii) explicitly aligns them with corresponding code statements. This explain-then-predict design produces inherent concept-level explanations and mitigates shortcut learning that harms out-of-distribution generalization. We train an encoder with explicit concept-alignment objectives and perform retrieval through explicit matching between query concepts and code statements. Experiments show that, trained on CodeSearchNet using GraphCodeBERT (125M parameters), XSearch improves performance on out-of-distribution benchmarks from 0.02 to 0.33 (15x) over eight state-of-the-art retrievers, and consistently outperforms both encoder- and decoder-based baselines with up to 7B parameters. A user study demonstrates that concept-alignment explanations enable users to evaluate retrieved results faster and more accurately.

摘要:語義代碼搜索在學術界和工業界已被廣泛採用。這些方法將自然語言查詢和代碼片段嵌入到共享的嵌入空間中,並根據向量相似性檢索結果。儘管在基準數據集上表現強勁,但它們往往在可解釋性和泛化能力上表現不佳。檢索到的代碼可能在語義上相似,但卻錯過了查詢的關鍵功能需求,同時也沒有提供為何檢索到該結果的解釋。此外,這種失敗在分佈轉移下變得更加嚴重,模型在未見的基準上難以泛化。在這項工作中,我們提出了XSearch,一個內在可解釋的代碼搜索框架。我們的關鍵見解是,依賴於全局嵌入相似性,現有的檢索器本質上採取了歸納視角。它們學習統計模式,而不是實際理解查詢的功能需求。我們通過將代碼搜索重新表述為演繹概念對齊問題來解決這個問題。XSearch (i) 確定查詢中的功能概念,並 (ii) 明確將它們與相應的代碼語句對齊。這種先解釋然後預測的設計產生了內在的概念級解釋,並減輕了損害分佈外泛化的捷徑學習。我們訓練了一個具有明確概念對齊目標的編碼器,並通過查詢概念與代碼語句之間的明確匹配進行檢索。實驗顯示,在使用GraphCodeBERT(125M參數)訓練的CodeSearchNet上,XSearch在八個最先進的檢索器上將分佈外基準的性能從0.02提升至0.33(15倍),並且始終超越了高達7B參數的編碼器和解碼器基準。用戶研究顯示,概念對齊解釋使得用戶能夠更快且更準確地評估檢索結果。

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

2605.15787v1 by Kai Hidajat, Solden Stoll, Joseph An

Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the transition, but ignore a constraint unique to attention-based models: if attention discards an informative token, no bounded downstream computation can recover it. We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding the known inverse-weight-decay delay, which we derive as a structural waiting time. We then prove that this explaining-away delay can be bypassed by a KL-based structural intervention, yielding an inverse-intervention-strength scaling law for the grokking time. Experiments on algorithmic sequence tasks isolate structure from capacity and show that this Bayesian ticket matches or outperforms lottery-ticket transfer.

摘要:為什麼一個已經記住其訓練集的Transformer會在進行泛化之前等待數千步?現有的解釋將這一延遲歸因於範數最小化、特徵出現或稀疏子網絡的晚期發現。這些解釋捕捉了轉變的重要部分,但忽略了一個對基於注意力的模型獨特的約束:如果注意力丟棄了一個有信息的標記,則沒有有界的下游計算可以恢復它。我們將注意力形式化為任務依賴圖上的隱式貝葉斯後驗,並證明泛化需要兩個可分的條件:對MLP容量的熟悉的金髮女孩界限,與基於範數的grokking理論相符,以及一個新的貝葉斯結構條件,要求注意力在每個有信息的標記上放置足夠的質量。這種解耦解釋了延遲泛化的原因是延遲的結構推斷。在訓練的早期,MLP通過不對齊的特徵進行記憶,將交叉熵損失推近於零,從而使注意力缺乏結構梯度。權重衰減必須在缺失的圖變得可學習之前侵蝕記憶,產生已知的逆權重衰減延遲,我們將其推導為結構等待時間。我們然後證明,這種解釋延遲可以通過基於KL的結構干預來繞過,從而產生grokking時間的逆干預強度縮放法則。在算法序列任務上的實驗將結構與容量隔離,並顯示這種貝葉斯票證匹配或超越了彩票票證轉移。

$α$-TCAV: A Unified Framework for Testing with Concept Activation Vectors

2605.15688v1 by Ekkehard Schnoor, Jawher Said, Malik Tiomoko, Wojciech Samek, Alexander Jung

Concept Activation Vectors (CAVs) are a fundamental tool for concept-based explainability in deep learning, yet their practical utility is limited by statistical instability. We analyze the stochastic nature of CAVs and the Testing with CAVs (TCAV) method, deriving the distributions of major CAV classes including PatternCAV, FastCAV, and ridge regression-based CAVs. We then identify a fundamental flaw in the standard TCAV score: its reliance on a discontinuous indicator function induces non-decaying variance in critical regimes. To address this, we introduce $α$-TCAV, a generalized framework that replaces the indicator with a parameterized smooth function, yielding a unified probabilistic formulation that subsumes both TCAV and Multi-TCAV. We characterize the induced distributions of sensitivity scores and different TCAV variants, showing that established state-of-the-art choices lack theoretical justification. We provide principled guidance on tuning the parameter in $α$-TCAV -- either to imitate Multi-TCAV at substantially lower computational cost, or to obtain a calibrated Bayes-optimal probabilistic measure of a concept's influence. Finally, our analysis yields practical recommendations that challenge established routines: most notably, allocating the full sampling budget to a single CAV rather than splitting it across several.

摘要:概念激活向量(CAVs)是深度學習中基於概念的可解釋性的基本工具,但其實際效用受到統計不穩定性的限制。我們分析了CAVs的隨機性質以及使用CAVs進行測試(TCAV)的方法,推導了主要CAV類別的分佈,包括PatternCAV、FastCAV和基於脊回歸的CAVs。然後,我們識別了標準TCAV分數的一個根本缺陷:其依賴於不連續的指示函數在關鍵區域中引入了非衰減的方差。為了解決這個問題,我們引入了$α$-TCAV,一種將指示函數替換為參數化平滑函數的通用框架,產生一個統一的概率公式,涵蓋了TCAV和Multi-TCAV。我們描述了敏感度分數和不同TCAV變體所引起的分佈,顯示出現有的最先進選擇缺乏理論依據。我們提供了對$α$-TCAV中參數調整的原則性指導——要麼以顯著較低的計算成本模仿Multi-TCAV,要麼獲得經過校準的貝葉斯最優概率度量,以衡量一個概念的影響。最後,我們的分析提出了挑戰既定常規的實用建議:最顯著的是,將全部抽樣預算分配給單一CAV,而不是分配到多個CAV之間。

Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery

2605.16458v1 by Weijun Ma

Image restoration models are increasingly applied to degraded medical scans, but in safety-sensitive settings they must improve image quality without uncontrolled modification of clinically important regions. This is especially relevant for intracranial CT and CT angiography (CTA), where small vessels and aneurysm-relevant cues lie near high-contrast anatomical boundaries. We frame medical image restoration as a conservative AI problem and present a residual-bounded 2.5D restoration framework trained on synthetically degraded CT/CTA inputs. The model adds a learned residual to the original center slice through an edit-control map that limits the magnitude and spatial extent of modification. We evaluate the framework using an aneurysm-relevant image-recovery matrix, paired comparison against a Gaussian baseline, Monte Carlo stability testing, anatomical localization of meaningful edits, and external evaluation on low-dose CT. On 50 out-of-distribution CT-CTA cases, the bounded model achieved a mean target gain of 0.0635, a mean PSNR of 37.51 dB, and an iatrogenic-edit rate of 4.0%. Across 1,000 Monte Carlo runs, it remained net positive in 85.4% of runs with no stably negative cases. On external low-dose CT, the model was directionally beneficial and produced a substantially smaller modification footprint than the baseline. Meaningful edits concentrated in brain and skull regions while unrelated anatomy showed negligible change. These findings provide preliminary computational evidence that residual-bounded restoration is feasible in boundary-sensitive vascular imaging, but they do not establish clinical diagnostic performance and require expert review and prospective validation before clinical use.

摘要:影像修復模型越來越多地應用於退化的醫學掃描,但在安全敏感的環境中,它們必須在不對臨床重要區域進行不受控修改的情況下提高影像質量。這對於顱內CT和CT血管造影(CTA)尤其相關,因為小血管和與動脈瘤相關的線索位於高對比度的解剖邊界附近。我們將醫學影像修復框架設置為一個保守的AI問題,並提出一個基於殘差限制的2.5D修復框架,該框架在合成退化的CT/CTA輸入上進行訓練。該模型通過一個編輯控制圖將學習到的殘差添加到原始中心切片上,該圖限制了修改的幅度和空間範圍。我們使用與動脈瘤相關的影像恢復矩陣來評估該框架,並與高斯基線進行配對比較,進行蒙特卡洛穩定性測試,對有意義的編輯進行解剖定位,以及在低劑量CT上進行外部評估。在50個分佈外的CT-CTA案例中,該受限模型實現了0.0635的平均目標增益,37.51 dB的平均PSNR,以及4.0%的醫源性編輯率。在1,000次蒙特卡洛運行中,它在85.4%的運行中保持淨正值,且沒有穩定的負值案例。在外部低劑量CT上,該模型在方向上是有益的,並且產生的修改足跡顯著小於基線。有意義的編輯集中在腦部和顱骨區域,而無關的解剖結構幾乎沒有變化。這些發現提供了初步的計算證據,表明在邊界敏感的血管影像中,基於殘差的修復是可行的,但它們並未建立臨床診斷性能,並且在臨床使用之前需要專家審查和前瞻性驗證。

Identifiable Token Correspondence for World Models

2605.16457v1 by Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

Transformer-based world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without explicitly modeling correspondence between tokens across time. We formulate next-frame prediction as a structured probabilistic inference problem with latent token correspondence variables, deriving a model in which each next-frame token is explained either by copying a token from the previous frame or by generating a new token. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.

摘要:基於Transformer的世界模型在視覺強化學習中表現出色,但在長期展開中經常遭遇時間不一致性問題,包括物體重複、消失和變異。其主要原因是大多數現有方法將下一幀預測純粹視為一個標記生成問題,而未明確建模時間上標記之間的對應關係。我們將下一幀預測公式化為一個結構化的概率推理問題,並引入潛在的標記對應變量,推導出一個模型,其中每個下一幀標記要麼通過從前一幀複製標記來解釋,要麼通過生成新標記來解釋。我們的實驗在四個具有挑戰性的基準上顯示出最先進的性能。所提出的方法在Craftax-classic基準上達到了72.5%的回報和35.6%的得分,顯著超過了之前的最佳成績67.4%和27.9%。我們在https://github.com/snu-mllab/Identifiable-Token-Correspondence上發布了我們的源代碼。

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

2605.16452v1 by Jiahui Li, Yida Zhang, Zixuan Zeng, Jiayu Chen, Yingjian Song, Yin Xiao, Nishan Dong, Junjie Lu, Younghoon Kwon, Xiang Zhang, Jin Lu, Wenzhan Song, Fei Dou

Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram (ECG), Photoplethysmogram (PPG), Ballistocardiogram (BCG), and Bodyseismography (BSG), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability. Conversely, deep learning-based methods often lack interpretability, limiting transparency for expert verification and hindering expert-computer interaction. To address these limitations, we introduce Peak-Detector, a novel framework that leverages instruction-tuned Large Language Models (LLMs) for robust, cross-modal, and explainable peak detection. A core innovation of our framework is a "peak-representation" technique that transforms time-series data into a condensed format, preserving critical event information while significantly reducing signal length. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data. The model is optimized through a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning (RL) with a multi-objective reward function. The model's self-explanation capabilities are cultivated by fine-tuning on a custom-built Peak-Explanation dataset. Across four modalities-ECG, PPG, BCG, and BSG-spanning seven datasets (six public benchmarks plus one real-world cohort), Peak-Detector demonstrates strong cross-modal performance, achieving best or tied-best detection under clinically relevant temporal tolerance. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis.

摘要:準確的峰值檢測在多樣的心臟生理信號中,包括心電圖 (ECG)、光電容積圖 (PPG)、重力心圖 (BCG) 和身體地震圖 (BSG),對於心血管監測至關重要,但常常受到伪影和信號變異性的阻礙。傳統算法通常是基於專家知識針對單一信號模態設計的,限制了它們的通用性。相對而言,基於深度學習的方法往往缺乏可解釋性,限制了專家驗證的透明度,並阻礙了專家與計算機的互動。為了解決這些限制,我們引入了 Peak-Detector,一個新穎的框架,利用經過指令調整的大型語言模型 (LLMs) 進行穩健的跨模態和可解釋的峰值檢測。我們框架的核心創新是一種“峰值表示”技術,將時間序列數據轉換為濃縮格式,保留關鍵事件信息,同時顯著減少信號長度。這種表示提供了關鍵的歸納偏見,指導 LLM 在生理上有意義的事件上進行推理,而不是處理原始的、嘈雜的數據。該模型通過兩個階段的過程進行優化:監督性微調 (SFT) 隨後是強化學習 (RL),並使用多目標獎勵函數。模型的自我解釋能力通過在自建的 Peak-Explanation 數據集上進行微調來培養。在四種模態-ECG、PPG、BCG 和 BSG-涵蓋七個數據集(六個公共基準加上一個現實世界的隊列)中,Peak-Detector 展現出強大的跨模態性能,在臨床相關的時間容忍度下實現最佳或並列最佳的檢測。除了準確性之外,生成的推理揭示了失效模式並支持驗證和錯誤分析。

Process Rewards with Learned Reliability

2605.15529v1 by Jinyuan Li, Langlin Huang, Chengsong Huang, Shaoyang Xu, Donghong Cai, Yuyi Yang, Wenxuan Zhang, Jiaxin Huang

Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy--token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.

摘要:過程獎勵模型(PRMs)為推理提供逐步反饋,但目前的PRMs通常僅為每個步驟輸出單一獎勵分數。因此,下游方法必須將不完美的逐步獎勵預測視為可靠的決策信號,而無法指示何時應信任這些預測。我們提出了BetaPRM,一種分佈式PRM,能夠預測逐步成功概率及該預測的可靠性。在從蒙地卡羅延續中獲得逐步成功監督的情況下,BetaPRM學習一個Beta信念,通過Beta-二項式似然解釋觀察到的成功延續數量,而不是回歸到有限樣本成功比率作為點目標。這個學習到的可靠性信號指示何時應信任步驟獎勵,使下游應用能夠區分可靠獎勵和不確定獎勵。作為一個應用,我們引入了自適應計算分配(ACA)用於PRM引導的最佳選擇推理。ACA利用學習到的可靠性信號在高獎勵解決方案可靠時停止,並在不確定的候選前綴上花費額外的計算。在四個基礎模型和四個推理基準上的實驗顯示,BetaPRM在保持標準逐步錯誤檢測的同時,改善了PRM引導的最佳選擇。基於這個信號,ACA改善了固定預算最佳選擇16的準確性與令牌使用之間的權衡,將令牌使用量減少了多達33.57%,同時提高了最終答案的準確性。

GESD: Beyond Outcome-Oriented Fairness

2605.15295v1 by Gideon Popoola, John Sheppard

Machine learning (ML) algorithms are increasingly deployed in high-stakes decision-making domains such as loan approvals, hiring, and recidivism predictions. While existing fairness metrics (e.g., statistical parity, equal opportunity) effectively quantify outcome-oriented disparities, they offer limited insight into the procedure or explanation behind biased decisions. To address this gap, we propose Group-level Explanation Stability Disparity (GESD), a \textit{procedural-oriented} fairness metric that measures disparities in the stability, robustness, and sensitivity of model explanations across different subgroups in a protected category. %GESD is explainer-agnostic, model-agnostic, and extends the scope of fairness analyses to the level of explainability. We further integrate GESD into a multi-objective optimization framework that jointly optimizes for utility, outcome-based fairness, and explanation-based fairness called FEU (Fairness--Explainability--Utility). Empirical results on multiple benchmark datasets show that GESD effectively captures group-wise discrepancies in explanation quality, and that FEU improves both utility and fairness over state-of-the-art methods. By bridging outcome-based and explanation-based fairness, GESD offers a comprehensive tool for diagnosing and mitigating bias in predictive modeling. Our code and datasets are available on GitHub {\hyperlink{https://github.com/horlahsunbo/GESD}{https://github.com/horlahsunbo/GESD}}

摘要:機器學習(ML)演算法越來越多地應用於高風險決策領域,例如貸款批准、招聘和再犯預測。雖然現有的公平性指標(例如,統計平等、平等機會)有效地量化了以結果為導向的差異,但它們對於偏見決策背後的程序或解釋提供的見解有限。為了解決這一空白,我們提出了群體層級解釋穩定性差異(GESD),這是一種\textit{以程序為導向}的公平性指標,衡量在受保護類別中不同子群體之間模型解釋的穩定性、穩健性和敏感性差異。%GESD是解釋者無關、模型無關的,並將公平性分析的範疇擴展到可解釋性層面。我們進一步將GESD整合到一個多目標優化框架中,該框架共同優化效用、基於結果的公平性和基於解釋的公平性,稱為FEU(公平性--可解釋性--效用)。對多個基準數據集的實證結果顯示,GESD有效捕捉了解釋質量中的群體差異,並且FEU在效用和公平性上均優於最先進的方法。通過橋接基於結果和基於解釋的公平性,GESD為診斷和減輕預測建模中的偏見提供了一個全面的工具。我們的代碼和數據集可在GitHub上獲得{\hyperlink{https://github.com/horlahsunbo/GESD}{https://github.com/horlahsunbo/GESD}}。

FutureSim: Replaying World Events to Evaluate Adaptive Agents

2605.15188v1 by Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

摘要:AI 代理人越來越多地被部署在需要隨著新信息到來而適應的動態、開放式環境中。為了有效地測量這種能力以應對現實案例,我們提出建立基於實際事件的模擬,按照事件發生的順序重播。我們建立了 FutureSim,讓代理人預測超出其知識截止日期的世界事件,同時與世界的時間順序重播互動:在模擬期間內,真實新聞文章不斷到達,問題逐漸解決。我們在其本地環境中評估前沿代理人,測試他們在 2026 年 1 月到 3 月的三個月期間預測世界事件的能力。FutureSim 顯示出它們能力的明顯差異,最佳代理人的準確率為 25%,而許多代理人的 Brier 技能分數甚至比不做預測還要差。通過仔細的消融實驗,我們展示了 FutureSim 如何提供一個現實的環境來研究新興的研究方向,如長期測試時間適應、搜索、記憶和對不確定性的推理。總體而言,我們希望我們的基準設計為測量 AI 在現實世界中跨越長時間範圍的開放式適應進展鋪平道路。

Explainable Detection of Depression Status Shifts from User Digital Traces

2605.14995v1 by Loris Belcastro, Francesco Gervino, Fabrizio Marozzo, Domenico Talia, Paolo Trunfio

Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.

摘要:每天,使用者會產生數位痕跡(例如,社交媒體帖子、聊天和線上互動),這些痕跡本質上是有時間戳的,並可能反映他們的心理狀態的某些方面。這些痕跡可以組織成時間軌跡,捕捉使用者的心理健康信號如何演變,包括改善、惡化或穩定的階段。在這項工作中,我們提出了一個可解釋的框架,用於檢測和分析使用者數位痕跡中與抑鬱相關的狀態變化。該方法結合了多個基於BERT的模型,以提取不同維度(例如,情感、情緒和抑鬱嚴重程度)之間的互補信號。這些信號隨時間聚合,以構建使用者級別的軌跡,並進行分析以識別有意義的變化點。為了增強可解釋性,該框架整合了一個大型語言模型,以生成簡潔且易於人類閱讀的報告,描述心理健康信號的演變並突出關鍵轉變。我們在兩個社交媒體數據集上評估了該框架。結果顯示,該方法產生的摘要比直接基於LLM的報告更具連貫性和信息性,實現了對使用者歷史的更高覆蓋率、更強的時間一致性,以及對變化點的更高敏感性。一項消融研究確認了每個組件的貢獻,特別是時間建模和分段。總體而言,該方法提供了心理健康信號隨時間變化的可解釋視圖,支持研究和決策,而不旨在臨床診斷。

GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation

2605.14968v1 by Drewry H. Morris, Luis Valles, Reza Hosseini Ghomi

GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time. Existing workflow platforms provide durable execution and observability but offer few semantic correctness guarantees, while agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit. GraphFlow is designed to address this gap by treating workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem; observed failures were localized primarily to external integrations. The formal semantics and proof-checked admission model described here are specified and under active development. Evaluation of the verified core is reserved for future work.

摘要:GraphFlow 是一個視覺化工作流程系統,旨在提高多步驟、任務關鍵過程中代理 AI 自動化的可靠性。在這些工作流程中,小錯誤會迅速累積:在理想化的獨立步驟模型下,一個具有 90% 每步可靠性的十步驟過程成功完成的機率僅為 35%。現有的工作流程平台提供耐用的執行和可觀察性,但對語義正確性的保證卻很少,而代理系統在推理時進行計劃,使得行為對提示變化敏感且難以審計。GraphFlow 的設計旨在填補這一空白,通過將工作流程圖視為可執行的規範,定義數據範圍、執行語義和監控的一個單一工件。在編譯時,指定一類受限的圖形以產生可重用的自動化,其合約(前置條件、後置條件和組合義務)旨在在進入共享庫之前進行證明檢查。在運行時,一個耐用的引擎在附加式事件日誌中記錄結果,並可以在系統邊界強制執行合約,支持重播、重試和審計。游泳道使信任邊界變得明確,將經過驗證的邏輯與外部系統、人類判斷和 AI 決策分開。在三個臨床站點進行的一年期試點執行了 8,728 次隊列註冊的工作流程運行,完成率為 97.08%,是在沒有經過驗證的核心子系統的早期原型下進行的;觀察到的失敗主要集中在外部集成上。這裡描述的正式語義和經過證明的入庫模型已被指定並在積極開發中。對經過驗證的核心的評估保留給未來的工作。

From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

2605.14912v1 by Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka

Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.

摘要:多元對齊通常被操作化為偏好聚合:產生跨越(Overton)、引導(Steerable)或按比例代表(Distributional)多樣人類價值的回應。我們認為僅僅依賴聚合對於已部署的多元對齊來說是不完整的原始概念。在真正的價值多元主義下,當前基於強化學習人類反饋(RLHF)訓練的助手的失敗模式並不是覆蓋不足,而是阿諛奉承的共識:一種學習到的傾向,與直接對話者達成一致、驗證並最小化摩擦。由於已部署的人工智慧系統現在在健康、公民生活、勞動和治理等方面進行重要的討論,因此在互動層面上意見的不一致的崩潰並不是一個狹隘的技術問題,而是一種具有分配後果的結構性失敗。我們從格賴斯的格言中重新框定多元對齊,圍繞三個對話機制:範疇(承認自身觀點的局限)、信號(呈現價值衝突而不是掩蓋它),以及修正(基於原則而非用戶壓力修訂自己的立場)。我們正式化了一個指標,即多元修正分數(PRS),以區分原則性修訂與屈從,並提供了一個小規模的實證示例,針對兩個前沿的RLHF訓練模型(Claude Sonnet 4.5, N=198; GPT-4o, N=100),顯示對於這兩者來說,遵循一致性與在有爭議的價值提示上低修正質量共存。PRS測量的是多元主義的互動前提(可見的不一致;原則性修訂),而不是完整的多元主義;我們討論這一差異,認真對待“誰的‘原則性’算數”的反思性問題,並主張多元主義在部署治理層面上最為決定性地形成或解體:介面、偏好數據管道和審計基礎設施。

Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models

2605.14897v1 by Senne Deproost, Denis Steckelmacher, Ann Nowé

Despite many successful attempts at explaining Deep Reinforcement Learning policies using distillation, it remains difficult to balance the performance-interpretability trade-off and select a fitting surrogate model. In addition to this, traditional distillation only minimizes the distance between the behavior of the original and the surrogate policy while other RL-specific components such as action value are disregarded. To solve this, we introduce a new model-agnostic method called Critic-Driven Voronoi State Partitioning, which partitions a black box control policy into regions where a simple class of model can be optimized using gradient descent. By exploiting the critic value network of the original policy, we iteratively introduce new subpolicies in regions with insufficient value, standing in for a measure of policy complexity. The partitioning, a Voronoi quantizer, uses nearest neighbor lookups to assign a linear function to each point in the state space resulting in a cell-like diagram. We validate our approach on several well known benchmarks and proof that this distillation approaches the original policy using a reasonable sized set of linear functions.

摘要:儘管有許多成功的嘗試來解釋深度強化學習政策的蒸餾,但平衡性能與可解釋性之間的權衡以及選擇合適的替代模型仍然困難。除了這一點,傳統的蒸餾僅最小化原始政策與替代政策之間行為的距離,而忽略了其他特定於強化學習的組件,如行動價值。為了解決這個問題,我們提出了一種新的與模型無關的方法,稱為評論驅動的 Voronoi 狀態劃分,它將黑箱控制政策劃分為可以使用梯度下降進行優化的簡單模型類別的區域。通過利用原始政策的評論價值網絡,我們在價值不足的區域中迭代地引入新的子政策,這代表了一種政策複雜性的度量。這種劃分,作為 Voronoi 量化器,使用最近鄰查找為狀態空間中的每個點分配一個線性函數,從而形成類似單元的圖。 我們在幾個知名基準上驗證了我們的方法,並證明這種蒸餾方法使用合理大小的線性函數集接近原始政策。

Holistic Evaluation and Failure Diagnosis of AI Agents

2605.14865v1 by Netta Madvil, Gilad Dym, Alon Mecilati, Edo Dekel, Jonatan Liberman, Rotem Brazilay, Liron Schliesser, Max Svidlo, Shai Nir, Orel Shalom, Yaron Friedman, David Connack, Amos Rimon, Philip Tannor, Shir Chorev

AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

摘要:AI 代理執行複雜的多步驟過程,但目前的評估不足:結果指標報告成功或失敗卻未解釋原因,而過程層級的方法則難以將失敗類型與其在長且結構化的追蹤中的精確位置連結起來。我們提出了一個整體的代理評估框架,將自上而下的代理層級診斷與自下而上的跨度層級評估相結合,將分析分解為獨立的每個跨度評估。這種分解可擴展到任意長度的追蹤,並為每個判決產生跨度層級的理由。在 TRAIL 基準上,我們的框架在 GAIA 和 SWE-Bench 上的所有指標上達到了最先進的結果,相較於最強的先前基準,在類別 F1 上的相對增益高達 38%,在定位準確度上高達 3.5 倍,在聯合定位-分類準確度上高達 12.5 倍。按類別分析顯示,我們的框架在更多的錯誤類別中領先於任何其他評估者。值得注意的是,當在我們的框架內使用時,相同的前沿模型達到的定位準確度是作為全追蹤的單一評判者時的幾倍,顯示評估方法論,而非模型能力,是瓶頸所在。

FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

2605.14854v2 by Patrick Kwon, Chen Chen

Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

摘要:人類網格重建(HMR)本質上是模糊的:在遮擋或弱深度線索下,多個 3D 身體可以解釋相同的圖像證據。這種模糊性在身體的不同部位並不均勻,因為軀幹姿勢和根結構通常相對受到良好約束,而遠端關節如手臂和腿則更不確定。基於這一觀察,我們提出了 FactorizedHMR,一個兩階段的框架,對這兩種情況進行不同的處理。一個確定性的回歸模塊首先恢復一個穩定的軀幹-根錨點,然後一個概率流匹配模塊完成剩餘的非軀幹關節。為了使這一完成過程可靠,我們結合了一個複合目標表示與幾何感知的監督和特徵感知的無分類器引導,保留軀幹-根錨點的同時改善對模糊關節的單一參考恢復。我們還引入了一個合成數據管道,提供在多樣視角下的成對圖像-相機-運動監督。在相機空間和世界空間的基準測試中,FactorizedHMR 仍然與強大的基線競爭,尤其在遮擋重的恢復和對漂移敏感的世界空間指標上獲得了明顯的增益。

How Sensitive Are Radiomic AI Models to Acquisition Parameters?

2605.14667v1 by D. Gil, I. Sanchez, C. Sanchez

A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying clinically significant parameter regions associated with improved cross-dataset robustness. We formulate a mixed-effects framework for quantifying the influence that clinically relevant acquisition parameters have on models performance, while accounting for subject-level random effects. We have applied our framework to lung cancer diagnosis in CT scans using two independent multicentre datasets (a public database and own-collected data) and several SoA architectures. To evaluate across-database reproducibility, CT parameters have been adjusted using the data collected and tested on the public set. The optimal configuration selected is the current of the X-ray tube >= 200 mA, spiral pitch <= 1.5, slice thickness <= 1.25 mm, which balances diagnostic quality with low radiation dose. These configuration push metrics from 0.79+-0.04 sensitivity, 0.47+-0.10 specificity in low quality scans to 0.90+-0.10 sensitivity, 0.79 +- 0.13 specificity in high quality ones.

摘要:主要障礙在於AI放射組學系統在臨床常規中的部署是它們在異質的多中心獲取協議下性能的下降。這項工作提出了一個以性能為導向的框架,用於量化放射組學AI模型的掃描參數敏感性,同時識別與提高跨數據集穩健性相關的臨床顯著參數區域。我們制定了一個混合效應框架,以量化臨床相關的獲取參數對模型性能的影響,同時考慮受試者層級的隨機效應。我們已將我們的框架應用於CT掃描中的肺癌診斷,使用兩個獨立的多中心數據集(公共數據庫和自收集數據)以及幾個最先進的架構。為了評估跨數據庫的重現性,CT參數已根據收集的數據進行調整,並在公共數據集上進行測試。選擇的最佳配置是X射線管電流 >= 200 mA,螺旋步距 <= 1.5,切片厚度 <= 1.25 mm,這在低輻射劑量下平衡了診斷質量。這些配置將指標從低質量掃描中的0.79+-0.04靈敏度、0.47+-0.10特異性推升至高質量掃描中的0.90+-0.10靈敏度、0.79 +- 0.13特異性。

MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder

2605.14660v1 by Eranga Bandara, Ross Gore, Asanga Gunaratna, Ravi Mukkamala, Nihal Siriwardanagea, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Wathsala Herath, Chalani Rajapakse, Sachin Shetty, Anita H. Clayton, Christopher K. Rhea, Ng Wee Keong, Kasun De Zoysa, Amin Hass, Shaifali Kaushik, Preston Samuel, Atmaram Yarlagadda

Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing therapeutic approaches, prolonged exposure, EMDR, cognitive behavioural therapy, operate predominantly downstream of the reactive cascade, teaching patients to tolerate or reframe distress after it has arisen. While clinically valuable, these suppression-based approaches do not produce the upstream pathway dissolution that constitutes lasting structural neural reorganisation. This paper proposes MindGap, a privacy-preserving on-device conversational AI framework that delivers structured neuroplastic rehabilitation for PTSD through the practice of dependent origination, a Buddhist psychological framework that identifies the precise moment between the pre-cognitive affective signal and the reactive elaboration that follows as the site of therapeutic intervention. MindGap guides patients through three progressive layers of observation at this feeling tone gap: noticing the bare affective signal before reactive elaboration, recognising it as self-arising rather than caused by the stimulus, and recognising the conditioned implicit belief beneath the feeling. Each layer corresponds to progressively deeper prefrontal regulatory engagement and progressively deeper long-term depression-mediated weakening of the reactive pathway, producing genuine upstream dissolution rather than downstream suppression. Running entirely on-device with no data egress, MindGap delivers daily calibrated exposure sessions through a fine-tuned lightweight large language model, making it deployable in sensitive clinical and military contexts where cloud-based solutions are not permitted.

摘要:創傷後壓力症候群(PTSD)根本上是一個神經可塑性問題,創傷性接觸事件通過赫布長期增強編碼過度反應的神經通路,產生在意識覺察之前就已觸發的、敏感的杏仁體-HPA壓力級聯反應。現有的治療方法,如長期暴露、EMDR、認知行為療法,主要在反應級聯的下游運作,教導患者在痛苦出現後如何忍受或重新框架。雖然這些以抑制為基礎的方法在臨床上具有價值,但並未產生構成持久結構性神經重組的上游通路溶解。本文提出了MindGap,一個保護隱私的設備內對話式人工智慧框架,通過依賴起源的實踐提供PTSD的結構性神經可塑性康復,這是一種佛教心理框架,確定了在前認知情感信號與隨之而來的反應性闡述之間的精確時刻作為治療干預的場所。MindGap引導患者通過三個漸進的觀察層次來探索這個感受音調的間隙:注意到反應性闡述之前的純粹情感信號,認識到它是自我產生的,而不是由刺激引起的,並認識到感受下的條件隱性信念。每一層對應於逐漸深入的前額葉調節參與和逐漸深入的長期抑制介導的反應通路削弱,產生真正的上游溶解,而不是下游抑制。MindGap完全在設備內運行,無數據外流,通過微調的輕量級大型語言模型提供每日校準的暴露會議,使其可在不允許雲端解決方案的敏感臨床和軍事環境中部署。

Teaching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning

2605.14636v1 by Chenlu Ding, Jiancan Wu, Yanchen Luo, Zheyuan Liu, Yancheng Yuan, Xiang Wang

Large language models (LLMs) often fail to reason under temporal cutoffs: when prompted to answer from the standpoint of an earlier time, they exploit knowledge that became available only later. We study this failure through the lens of ex-ante reasoning, where a model must rely exclusively on information knowable before a cutoff. Through a systematic analysis of prompt-level interventions, we find that temporal leakage is highly sensitive to cutoff formulation and instruction placement: explicit cutoff statements outperform implicit historical framings, and prefix constraints reduce leakage more effectively than suffix constraints. These findings indicate that prompting can steer models into a temporal frame, but does not endow them with the ability to verify whether a response is temporally admissible. We further argue that supervised fine-tuning is insufficient, since ex-ante correctness is not an intrinsic property of an answer, but a relation between the answer and the cutoff. To address this gap, we propose TCFT, a Temporal Critique Fine-Tuning framework that trains models to acquire cutoff-aware temporal verification. Given a query, a cutoff, and a candidate response, TCFT teaches the model to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show that TCFT consistently outperforms prompting and SFT baselines, reducing average leakage by 41.89 and 37.79 percentage points, respectively.

摘要:大型語言模型(LLMs)在時間截止方面常常無法進行推理:當被提示從早期的角度回答時,它們會利用只有在稍後才可獲得的知識。我們通過事前推理的視角來研究這一失敗,在這種情況下,模型必須僅依賴於截止之前可知的信息。通過對提示級別干預的系統分析,我們發現時間泄漏對截止的表述和指令的位置非常敏感:明確的截止聲明優於隱含的歷史框架,而前綴約束比後綴約束更有效地減少泄漏。這些發現表明,提示可以引導模型進入一個時間框架,但並不賦予它們驗證回應是否在時間上可接受的能力。我們進一步認為,監督微調是不夠的,因為事前正確性不是答案的內在屬性,而是答案與截止之間的關係。為了解決這一差距,我們提出了TCFT,一個時間批評微調框架,旨在訓練模型獲得截止意識的時間驗證。給定一個查詢、一個截止和一個候選回應,TCFT教會模型識別截止後的泄漏,解釋時間邊界違規,並判斷時間的可接受性。與Qwen2.5-7B-Instruct和Qwen2.5-14B-Instruct的實驗顯示,TCFT始終優於提示和SFT基準,平均泄漏分別減少了41.89和37.79個百分點。

Optimal Pattern Detection Tree for Symbolic Rule-Based Classification

2605.14374v1 by Young-Chae Hong, Yangho Chen

Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the Optimal Pattern Detection Tree (OPDT), a rule-based machine learning model based on novel mixed-integer programming to discover a single optimal pattern in data through binary classification. To incorporate prior knowledge and compliance requirements, we further introduce the Branching Structure Constraints (BSC) framework, which enables decision makers to encode domain knowledge and constraints directly into the model. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.

摘要:資料中的模式發現對於各種領域至關重要,包括醫療保健、風險評估和機械維護。與黑箱深度學習模型相對,符號規則發現成為一項關鍵的數據挖掘任務,生成可供人類解釋的規則,提供透明性和直觀的可解釋性。本文介紹了最佳模式檢測樹(OPDT),這是一種基於新型混合整數規劃的基於規則的機器學習模型,通過二元分類在數據中發現單一最佳模式。為了納入先前知識和合規要求,我們進一步引入了分支結構約束(BSC)框架,使決策者能夠將領域知識和約束直接編碼到模型中。這種基於優化的方法在數據集中發現潛在的隱藏模式(如果存在的話),通過識別一條最佳規則來最大化覆蓋率,同時最小化由於錯誤分類造成的假陽性率。我們的計算實驗顯示,OPDT能在合理的運行時間內,在中等大小的數據集上發現具有最佳性保證的模式。

Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment

2605.14242v1 by Xiaohua Wang, Kai Yu, XuXiao Liang, Liang Wang, Chao Han

The monitoring of fetal heart rate (FHR) and the assessment of its variability are crucial for preventing fetal compromise and adverse outcomes. However, traditional methods encounter limitations arising from equipment performance, data transmission, and subjective assessments by doctors. We have developed a tailored AI-based FHrCTG model specifically for FHR monitoring, which effectively mitigates noise interference and precisely reconstructs signals. Our model was pre-trained on a massive dataset consisting of 558,412 unlabeled data points and further refined using 7,266 expert-reviewed entries. To validate FHR, we introduced the Intersection Overlapping Labels (IOL) approach, which transforms rate analysis into categorical judgments. Testing revealed that our model demonstrates high sensitivity and specificity in detecting critical FHR decelerations (89.13% and 87.78%, respectively) and accelerations (62.5% and 92.04%, respectively). Furthermore, based on Fischer's criteria for clinical application, our model achieved impressive AUC scores of 0.7214 and 0.9643 for verifying FHR periodicity and amplitude variation, respectively.

摘要:胎心率(FHR)的監測及其變異性的評估對於防止胎兒受損和不良結果至關重要。然而,傳統方法在設備性能、數據傳輸和醫生的主觀評估方面存在限制。我們開發了一種專門用於FHR監測的定制AI基於FHrCTG模型,該模型有效減少了噪音干擾並精確重建信號。我們的模型在一個包含558,412個未標記數據點的大型數據集上進行了預訓練,並使用7,266個專家審核的條目進一步精煉。為了驗證FHR,我們引入了交集重疊標籤(IOL)方法,將速率分析轉化為類別判斷。測試顯示,我們的模型在檢測關鍵FHR減速(分別為89.13%和87.78%)和加速(分別為62.5%和92.04%)方面表現出高敏感性和特異性。此外,根據Fischer的臨床應用標準,我們的模型在驗證FHR的周期性和幅度變化方面分別達到了令人印象深刻的AUC分數0.7214和0.9643。

Fusion-fission forecasts when AI will shift to undesirable behavior

2605.14218v1 by Neil F. Johnson, Frank Yingjie Huo

The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

摘要:面臨社會中類似ChatGPT的人工智慧使用的關鍵問題是,其行為可能會在不被注意的情況下,從可取轉變為不可取——促使自我傷害、極端行為、財務損失或代價高昂的醫療和軍事錯誤——而目前尚無法預測何時會發生這種轉變。儘管在人工智慧建模、訓練後的對齊和安全措施方面取得了顯著進展,但即使是最新的AI模型中,行為的轉變仍然持續存在。在這裡,我們展示了一種向量泛化的融合-裂變群體動力學,這種動力學在活體和活性物質系統中觀察到,驅動並可以預測AI行為的未來轉變。轉變條件也可以從數學上推導出來,這是由於目前為止的對話(C)與可取(B)和不可取(D)盆地動力學之間的群體競爭,這些動力學可以為特定應用提前進行估算。這既不是特定於模型的,也不是由隨機抽樣驅動的。我們在六項獨立測試中驗證了它,包括:在七個跨越兩個數量級的參數計數(124M-12B)的AI模型中,正確率達到90%;在十個前沿聊天機器人中持續保持生產規模;以及在斯坦福的「妄想螺旋」語料庫出現之前十一個月的先驗時間戳預測,並由207,443個人類-AI交流的該語料庫獨立確認。因為它在當前安全堆棧的架構下,所以同樣的公式提供了一個實時警告信號,而當前的對齊無法提供,這一信號在當前和未來的類似ChatGPT的AI架構中是可攜帶的,並且可以在可定義競爭反應類別的應用領域中實現。

LLM-Based Robustness Testing of Microservice Applications: An Empirical Study

2605.14202v1 by Hrushitha Goud Tigulla, Marco Vieira

Malformed, missing, or boundary-value inputs in microservice APIs can cascade across dependent services, threatening reliability. Robustness testing systematically exercises such inputs to expose server-side failures, but generating diverse, effective tests remains challenging. Large Language Models can generate such tests from API specifications; however, it is unknown whether different models and prompt strategies produce diverse failure sets or converge on the same failures. We report a controlled experiment applying 7 prompt strategies to 3 open-source LLMs (14B-70B parameters) targeting 2 architecturally distinct microservice systems: one Java monolingual (6 services, 9 failure modes) and one polyglot (27 services, 14 failure modes), yielding 38 valid runs and 663 generated tests. We find that prompt strategy explains more variation in diversity than model size: a Structured prompt collapses diversity entirely, while a single model varied across three prompt strategies achieves complete failure-mode coverage on one system, outperforming any multi-model ensemble under a fixed prompt. We introduce two strategies, Guided and GuidedFewShot, that embed a mutation taxonomy from prior robustness testing research as domain context. GuidedFewShot achieves the highest single-run coverage on both systems (5 of 9 and 8 of 14 failure modes) while maintaining low cross-model similarity. A key lesson is that taxonomy rules alone are insufficient: LLMs cannot distinguish key-absent from value-empty mutations without concrete examples. Findings replicate across both systems.

摘要:格式不正確、缺失或邊界值的微服務 API 輸入可能會在依賴的服務之間引發連鎖反應,威脅到可靠性。穩健性測試系統性地測試這些輸入,以揭示伺服器端的故障,但生成多樣且有效的測試仍然具有挑戰性。大型語言模型可以從 API 規範生成這些測試;然而,尚不清楚不同模型和提示策略是否會產生多樣的故障集或收斂於相同的故障。我們報告了一項控制實驗,對 3 個開源 LLM(14B-70B 參數)應用 7 種提示策略,針對 2 個架構上不同的微服務系統:一個是 Java 單語言系統(6 服務,9 種故障模式),另一個是多語言系統(27 服務,14 種故障模式),產生了 38 次有效運行和 663 個生成的測試。我們發現提示策略解釋了多樣性中的更多變異,而非模型大小:結構化提示完全壓縮了多樣性,而單一模型在三種提示策略中變化,實現了對一個系統的完整故障模式覆蓋,超越了任何在固定提示下的多模型集合。我們引入了兩種策略,Guided 和 GuidedFewShot,這些策略將先前穩健性測試研究中的變異分類嵌入作為領域上下文。GuidedFewShot 在兩個系統上實現了最高的單次運行覆蓋率(9 種故障模式中的 5 種和 14 種故障模式中的 8 種),同時保持低的跨模型相似性。一個關鍵的教訓是,僅僅依賴分類規則是不夠的:LLMs 無法在沒有具體示例的情況下區分缺少鍵的變異和空值變異。這些發現在兩個系統中都得到了重複。

Do Language Models Align with Brains? Prediction Scores Are Not Enough

2605.14025v1 by Xiao Jia

Brain-language model comparisons often interpret neural prediction scores as evidence that model representations capture brain-relevant language computation. We asked whether language models align with brains, and whether prediction scores are enough to support that claim, using L-PACT, a source-audited framework that evaluates predictive, relational, mechanism-stripping, and reliability-bounded evidence. Across primary naturalistic language neural datasets and derived language-model representations, L-PACT compared real model features with nuisance baselines and severe controls, tested whether model-to-brain profiles reproduced brain-to-brain patterns, recomputed held-out scores after mechanism stripping, and normalized evidence against brain-brain ceilings. The locked analysis set contains 414 predictive-control rows, 2304 relational profile rows, 4320 mechanism-stripping rows, 420 brain-brain ceiling rows, and 146 integrated decision rows. Assay-sensitivity checks showed that brain-brain reliability, brain-as-model run-to-run relational profiles, independent low-level neural and WAV-derived acoustic-envelope gates, and a deterministic implanted-signal simulation can produce positive evidence when expected. Nevertheless, no real model row passed the predictive, relational, mechanism-stripping, or operational Turing-bounded reliability gates; all 146 integrated rows were control-explained. Less stringent single-criterion rules would have counted raw positive predictive, relational, stripping-delta, and ceiling-normalized effects, but L-PACT downgraded them because controls explained the apparent evidence. In the analyzed derived artifact set, the tested language-model representations do not satisfy L-PACT alignment gates; apparent positives are converted into an auditable control-explained taxonomy rather than treated as structural alignment.

摘要:腦部與語言模型的比較通常將神經預測分數解釋為模型表示捕捉與大腦相關的語言計算的證據。我們詢問語言模型是否與大腦對齊,以及預測分數是否足以支持該主張,使用 L-PACT,一個源審核的框架,用於評估預測性、關聯性、機制剝離和可靠性界限的證據。在主要的自然語言神經數據集和衍生的語言模型表示中,L-PACT 將真實模型特徵與干擾基準和嚴格控制進行比較,測試模型到大腦的特徵是否重現了大腦到大腦的模式,在機制剝離後重新計算保留的分數,並將證據標準化以對抗大腦-大腦的上限。鎖定的分析集包含 414 個預測控制行、2304 個關聯特徵行、4320 個機制剝離行、420 個大腦-大腦上限行和 146 個整合決策行。檢測敏感性檢查顯示,大腦-大腦的可靠性、大腦作為模型的運行到運行的關聯特徵、獨立的低級神經和 WAV 衍生的聲學包絡門,以及確定性的植入信號模擬可以在預期時產生正面證據。然而,沒有任何真實模型行通過預測性、關聯性、機制剝離或操作性圖靈界限的可靠性門檻;所有 146 個整合行都被控制解釋。較不嚴格的單一標準規則本可以計算原始的正向預測、關聯、剝離增量和上限標準化效果,但 L-PACT 將它們降級,因為控制解釋了顯而易見的證據。在分析的衍生人工制品集中,測試的語言模型表示不滿足 L-PACT 對齊門檻;顯而易見的正面結果被轉換為可審核的控制解釋分類,而不是被視為結構對齊。

Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography

2605.13730v1 by Christos Chrysanthos Nikolaidis, Vasileios Sachpekidis, Nikolas Moustakidis, Theofilos Moustakidis, Pavlos S. Efraimidis

Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on $N{=}90$ patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of $0.907$ and recall of $0.877$. Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone's contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.

摘要:經胸心臟超音波檢查(TTE)是診斷二尖瓣主動脈瓣(BAV)的第一線影像檢查方式,但其診斷表現因操作人員的專業技能和影像品質而異。我們開發了一個可解釋的人工智慧模型,利用常規獲取的旁胸長軸(PLAX)動態影像循環來區分BAV和三尖瓣主動脈瓣(TAV)。我們訓練並評估了一個多骨幹視頻集成模型,使用了一種考慮洩漏的分層外部交叉驗證協議,對90位患者的研究進行了評估(48 BAV,42 TAV)。在固定的外部分割和10個隨機種子下,經過校準的堆疊集成模型達到了外部交叉驗證F1分數$0.907$和召回率$0.877$。幀級Grad-CAM將顯著證據定位於主動脈根部和瓣膜平面,而全局聚合的SHAP值量化了每個視頻骨幹對堆疊預測的貢獻,實現了透明的案例級審計能力。這些發現表明,基於PLAX的視頻集成模型可以支持從常規心臟超音波動態影像中可靠地分類BAV/TAV,並可能促進在非專科或資源有限的臨床環境中更早的檢測。

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

2605.13542v1 by Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: https://chengzhi-leo.github.io/RealICU-Bench/

摘要:重症監護病房 (ICU) 產生長期、密集且不斷演變的臨床資訊流,醫生必須在時間壓力下不斷重新評估病人的狀態,這凸顯了對可靠的 AI 決策支持的明確需求。現有的 ICU 基準通常將歷史醫生行為視為真實標準。然而,這些行為是在不完整資訊和有限的病人狀態時間背景下做出的,因此可能是次優的,這使得評估 AI 系統的真正推理能力變得困難。我們引入了 RealICU,一個後見標註的基準,用於在現實的 ICU 環境下評估大型語言模型 (LLMs),其中標籤是在資深醫生審查完整病人軌跡後創建的。我們制定了四個醫生驅動的任務:評估病人狀態、急性問題、建議行動和可能導致不安全結果的紅旗行動。我們將每個軌跡劃分為 30 分鐘的窗口,並發布了兩個數據集:RealICU-Gold,包含來自 94 名 MIMIC-IV 病人的 930 個窗口標註,以及 RealICU-Scale,包含由 Oracle 擴展的 11,862 個窗口,Oracle 是一個經醫生驗證的 LLM 後見標籤標註者。包括記憶增強型的現有 LLM 在 RealICU 上表現不佳,暴露出兩種失敗模式:臨床建議的回憶-安全權衡,以及對病人早期解釋的錨定偏見。我們進一步引入 ICU-Evo 來研究結構化記憶代理,這改善了長期推理但並未完全消除安全失敗。總之,RealICU 提供了一個臨床基礎的測試平台,用於衡量和改善高風險護理中的 AI 連續決策支持。項目頁面:https://chengzhi-leo.github.io/RealICU-Bench/

VERA-MH: Validation of Ethical and Responsible AI in Mental Health

2605.13318v1 by Luca Belli, Kate H. Bentley, Josh Gieringer, Emily Van Ark, Nilu Zhao, Pradip Thachile, Matt Hawrilenko, Millard Brown, Adam M. Chekroud

Chatbot usage has increased, including in fields for which they were never developed for--notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers' consistency and highlight models' failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers.

摘要:聊天機器人的使用量增加了,包括在那些從未為其開發的領域——尤其是心理健康支持。為此,我們介紹了心理健康中的道德和負責任的人工智慧驗證(VERA-MH),這是一種新穎的臨床驗證評估,用於評估聊天機器人在心理健康支持中的安全性。VERA-MH的第一個版本專注於自殺意念(SI)風險,通過評估聊天機器人如何應對可能處於危機中的用戶。
VERA-MH由三個步驟組成:對話模擬、對話評判和模型評分。首先,為了模擬與被評估聊天機器人的對話,另一個聊天機器人被指派根據特定角色扮演用戶。這些用戶角色是在臨床指導下開發的,以確保多種風險因素、人口特徵和披露因素等得以代表。在評判步驟中,使用第二個支持模型作為LLM-as-a-Judge,並結合臨床開發的評分標準。該評分標準以流程的形式結構,每次都提出一個是/否問題,以提高答案的一致性並突出模型的失效模式。在最後階段,每次對話的結果被匯總,以呈現聊天機器人的最終評估。隨著這一框架,我們還展示了四個主要LLM提供商的評估結果。

IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation

2605.13311v1 by Joy Bose

Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge graph-grounded multi-agent framework for innovation analysis and patent claim generation. IdeaForge integrates multiple innovation methodologies (TRIZ, Design Thinking, and SCAMPER) through specialist agents operating over a persistent FalkorDB knowledge graph. Each agent contributes structured entities and relationships representing contradictions, inventive principles, user needs, transformations, analogies, and candidate claims. The central contribution of IdeaForge is a cross-methodology convergence mechanism implemented through graph-based claim linkage. Claims independently supported by multiple methodologies are connected using CONVERGENT relationships, enabling identification of high-confidence innovation candidates through graph traversal. A downstream patent drafting agent generates structured patent drafts grounded in convergent claim subgraphs, reducing reliance on unconstrained language model generation. An InnovationScore formula ranks claims by convergent support, methodology diversity, claim strength, and prior art challenge count. We describe the graph schema, agent architecture, convergence detection pipeline, and patent synthesis workflow. Experiments on a legal technology use case demonstrate that graph-grounded multi-methodology synthesis produces more diverse and traceable innovation candidates compared to single-methodology baselines. We discuss implications for computational creativity, explainable AI-assisted invention, and graph-native innovation systems.

摘要:目前的 AI 輔助創新系統通常應用單一的發想方法(例如 TRIZ 或設計思維),使用不保留中間推理結構的順序提示工作流程。因此,跨方法生成的見解保持零散,限制了可追溯性、綜合性和新穎性的系統評估。我們提出 IdeaForge,一個基於知識圖譜的多代理框架,用於創新分析和專利主張生成。IdeaForge 通過在持久的 FalkorDB 知識圖譜上運行的專家代理整合了多種創新方法(TRIZ、設計思維和 SCAMPER)。每個代理貢獻結構化的實體和關係,代表矛盾、創新原則、用戶需求、轉換、類比和候選主張。IdeaForge 的核心貢獻是一種通過基於圖的主張鏈接實現的跨方法收斂機制。由多種方法獨立支持的主張通過 CONVERGENT 關係連接,使得通過圖遍歷識別高信心的創新候選者成為可能。下游專利草擬代理生成基於收斂主張子圖的結構化專利草稿,減少對不受限語言模型生成的依賴。InnovationScore 公式根據收斂支持、方法多樣性、主張強度和先前藝術挑戰次數對主張進行排名。我們描述了圖架構、代理架構、收斂檢測管道和專利綜合工作流程。在法律技術用例上的實驗表明,基於圖的多方法綜合相比於單一方法基準產生了更具多樣性和可追溯性的創新候選者。我們討論了對計算創造力、可解釋的 AI 輔助發明和圖原生創新系統的影響。

Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

2605.13047v1 by Ziqi Wen, Parsa Madinei, Miguel P. Eckstein

Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.

摘要:評估大型視覺-語言模型(VLMs)是否與人類感知對齊,以進行高層次的語義場景理解仍然是一個挑戰。傳統的白盒可解釋性方法不適用於封閉源架構,而被動指標無法孤立因果特徵。我們引入了反事實語義顯著性(CSS)。這個黑盒、模型無關的框架通過測量因果去除對場景所引起的語義變化來量化物體的重要性。為了評估AI與人類的語義對齊,我們對16,289個有效反應的307個複雜自然場景和1,306個高保真反事實變體進行了測試,這些反應形成了一個人類心理物理基線。我們的分析揭示了一個普遍的場景理解差距:模型對大型物體(大小偏見)、圖像中心的物體(中心偏見)和高顯著性物體的過度依賴(相對於人類)。相反,模型對場景中的人類的依賴程度低於我們的人類參與者來描述圖像。模型的大小偏見是解釋模型與人類語義差異變化的主要驅動因素。代碼和數據將在 https://github.com/starsky77/Counterfactual-Semantic-Saliency 提供。

An Agentic LLM-Based Framework for Population-Scale Mental Health Screening

2605.13046v1 by Giuliano Lorenzoni, Paulo Alencar, Donald Cowan

Mental health disorders affect millions worldwide, and healthcare systems are increasingly overwhelmed by the volume of clinical data generated from electronic records, telemedicine platforms, and population-level screening programs. At the same time, the emergence of novel AI-based approaches in healthcare calls for intelligent frameworks capable of processing domain-specific unstructured clinical information while adapting to patient-specific needs. This paper proposes an agentic framework for building robust LLM-based pipelines, where each stage is encapsulated as a LangChain agent governed by explicit policies and proxy-guided evaluation. Stages are incrementally locked once validated, ensuring that later adaptations cannot overwrite configurations without demonstrated improvement. The proposed framework evolves from feature-level exploration, through proxy-based tuning and freeze/rollback mechanisms, to full orchestration by an Orchestrator Agent that coordinates preprocessing, retrieval, selection, diversity, threshold optimization, and decoding. A proof-of-concept in transcript-based depression detection demonstrates that the framework converges to stable configurations, such as cosine similarity, dynamic Top-k, and threshold 0.75, while controlling evaluation costs and avoiding regressions. These results highlight the potential of agentic AI to enable population-level mental health screening over large clinical datasets, addressing critical challenges in trustworthiness, reproducibility, and adaptability required in healthcare environments.

摘要:心理健康障礙影響著全球數百萬人,而醫療保健系統正日益受到電子記錄、遠程醫療平台和人口層級篩查計劃所產生的臨床數據量的壓力。同時,基於新型人工智能的方法在醫療保健領域的出現呼喚著能夠處理特定領域非結構化臨床信息的智能框架,同時適應患者特定的需求。本文提出了一個代理框架,用於構建穩健的基於大型語言模型(LLM)的管道,其中每個階段都被封裝為一個由明確政策和代理引導評估所管理的LangChain代理。一旦驗證,階段會逐步鎖定,確保後續的調整無法在未顯示改進的情況下覆蓋配置。所提出的框架從特徵級探索演變而來,通過基於代理的調整和凍結/回滾機制,最終由一個協調預處理、檢索、選擇、多樣性、閾值優化和解碼的協調代理進行全面協調。在基於轉錄的抑鬱症檢測中的概念驗證顯示,該框架收斂到穩定的配置,例如餘弦相似度、動態Top-k和閾值0.75,同時控制評估成本並避免回歸。這些結果突顯了代理人工智能在大型臨床數據集上實現人口層級心理健康篩查的潛力,解決了醫療保健環境中所需的可信度、可重複性和適應性等關鍵挑戰。

No Attack Required: Semantic Fuzzing for Specification Violations in Agent Skills

2605.13044v1 by Ying Li, Hongbo Wen, Yanju Chen, Hanzhi Liu, Yuan Tian, Yu Feng

LLM-powered agents can silently delete documents, leak credentials, or transfer funds on a routine user request, not because the agent was attacked, but because the skill it invoked broke its own declared safety rules. We call these specification violations: benign inputs cause a skill to breach the natural-language guardrails in its own specification, typically because the guardrail's semantics are undefined for autonomous execution, or because the implementation silently ignores the documented constraint. These violations are invisible to static analyzers, traditional fuzzers, and prompt-injection defenses alike, yet they undermine the very contract a user trusts when installing a skill. We present Sefz, a goal-directed semantic fuzzing framework that automatically discovers specification violations in agent skills. Sefz translates each guardrail into a reachability goal over an annotated execution trace, reducing violation checking to a deterministic graph query. An LLM-based mutator generates benign inputs whose traces progressively approach the violation patterns, guided by a multi-armed bandit that uses goal-proximity as its reward signal. On 402 real-world skills from the largest public agent-skill marketplace, Sefz finds specification violations in 120 (29.9%), including 26 previously unknown exploitable guardrail violations in deployed skills. Six recurring specification pitfalls explain the bulk of the failures, suggesting concrete principles for safer skill design.

摘要:LLM 驅動的代理可以在例行的用戶請求下靜默地刪除文件、洩露憑證或轉移資金,這並不是因為代理遭到攻擊,而是因為它所調用的技能違反了自己聲明的安全規則。我們稱這些為規範違規:良性輸入導致技能違反其自身規範中的自然語言護欄,通常是因為護欄的語義對於自主執行是未定義的,或者因為實現靜默地忽略了文檔中的約束。這些違規行為對靜態分析器、傳統模糊測試工具和提示注入防禦都是不可見的,但它們破壞了用戶在安裝技能時所信任的合同。

我們提出了 Sefz,一個目標導向的語義模糊測試框架,能自動發現代理技能中的規範違規。Sefz 將每個護欄轉換為一個基於註解執行軌跡的可達性目標,將違規檢查簡化為確定性圖查詢。一個基於 LLM 的變異器生成良性輸入,其軌跡逐步接近違規模式,並由一個使用目標接近度作為獎勵信號的多臂賭徒引導。

在來自最大公共代理技能市場的 402 個真實技能中,Sefz 在 120 個(29.9%)中發現了規範違規,包括 26 個在已部署技能中之前未知的可利用護欄違規。六個重複出現的規範陷阱解釋了大部分失敗,這為更安全的技能設計提供了具體原則。

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

2605.12922v1 by Vardhan Dongre, Joseph Hsieh, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Dilek Hakkani-Tür

Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

摘要:大型語言模型可以在單次回合中遵循複雜的指令,但在長時間的多回合互動中,它們經常會失去指令、角色和規則的線索。這種退化已經在行為上被測量,但尚未在機制上得到解釋。我們提出了一個通道轉換的解釋:目標定義的標記通過注意力變得不那麼可及,而與目標相關的信息可能在殘餘表示中持續存在。我們引入了目標可及性比率(GAR),測量從生成的標記到任務定義的目標標記的注意力,並將其與滑動窗口消融和殘餘流探測相結合。當對指令的注意力關閉時,存留下來的內容揭示了架構。在不同的架構中,這種轉變產生了質量上不同的失敗模式:一些模型在注意力消失時仍然保持目標條件行為,而其他模型則在可解碼的殘餘目標信息存在的情況下失敗,這種編碼出現的層次從2到27不等。在Mistral中強制關閉注意力通道的模型內因果消融使得在20個事實的保留任務中回憶率從近乎完美降至11%,並在沒有用戶壓力的情況下將角色約束違規提高到敵對壓力基線以上,這兩種效應都在可預測的交叉回合中出現。線性探測從殘餘表示中恢復每次回合的回憶結果,在所有四個主要架構中AUC高達0.99,而輸入嵌入則保持在隨機水平。在不同的架構和模型規模中,注意力損失和殘餘可解碼性之間的差距預測了目標條件行為是否能在通道關閉後存活。我們貢獻了GAR作為診斷工具,通道轉換框架作為受控的機制解釋,以及在窗口注意力關閉下失敗時機的參數預測。

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

2605.12906v1 by Siyuan Liu, Tinghong Chen, Xinghan Li, Yifei Wang, Jingzhao Zhang

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

摘要:在監督式微調(SFT)過程中的數據選擇可以顯著改變大型語言模型(LLMs)的行為。儘管現有研究已經探討了基於困惑度、難度或長度等啟發式方法選擇數據的效果,但報告的結果往往不一致或依賴於上下文。在本研究中,我們從實證和理論的角度系統性地研究了數據難度在微調中的作用,發現沒有普遍最佳的難度水平;相反,其有效性取決於數據集的大小。我們顯示,在固定的數據預算下,存在一個最佳的數據難度,並且隨著數據預算的增加,這一最佳難度會向更難的數據傾斜。為了解釋這一現象,我們進行了受控的合成實驗,揭示了一個簡單的基本機制:在(內部分佈)泛化差距和外推差距之間的相互作用。我們還通過使用PAC-貝葉斯泛化界限的理論分析來進一步支持這一機制。總體而言,我們的結果澄清了數據大小和難度如何共同影響SFT中泛化和外推之間的權衡,為在特定模型和數據條件下基於難度的數據選擇提供了指導。

RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems

2605.12895v1 by Rohith Reddy Bellibatlu

Aggregate accuracy metrics dominate the evaluation of clinical AI decision-support systems but do not detect deployment-phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. We propose the RISED Framework: a five-dimension pre-deployment evaluation covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through formal sub-criteria, pre-specified pass/fail thresholds, and bias-corrected accelerated (BCa) bootstrap 95% confidence intervals combined under a Holm-Bonferroni family-wise error correction. A central demonstration is that a classifier satisfying conventional high-discrimination benchmarks can simultaneously fail input-encoding stability and threshold-shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks that aggregate evaluation alone cannot detect. We validate this differential pass/fail pattern on a synthetic cohort and three publicly available real-world cohorts spanning 35 years of clinical data vintage, from a 1980s cardiology dataset to a 2024 nationally representative health survey, where failing dimensions differ across cohorts, providing preliminary evidence of construct validity. The Equity dimension is reframed as a proxy-dependence diagnostic rather than a stand-alone gate: any need-based fairness verdict computed against a utilization-derived proxy carries a construct-validity problem the framework surfaces explicitly, triggering a procurement requirement for an outcome-independent need measure before the gate is binding. RISED is released as an open-source Python package that supplies the quantitative verdicts existing clinical AI reporting standards require, providing a principled gateway between in-silico model validation and silent-trial clinical evaluation.

摘要:聚合準確性指標主導了臨床AI決策支持系統的評估,但無法檢測部署階段的輸入可靠性、子群體公平性、閾值敏感性或操作可行性的失敗。我們提出了RISED框架:一個涵蓋可靠性、包容性、敏感性、公平性和可部署性的五維預部署評估,其中每個維度通過正式的子標準、預先指定的通過/失敗閾值以及在Holm-Bonferroni家庭誤差修正下結合的偏差修正加速(BCa)自助法95%置信區間進行操作。一個主要的示範是,滿足傳統高區分基準的分類器可以同時在輸入編碼穩定性和閾值變化敏感性檢查中失敗,而子群體AUC平等性仍然在統計上不確定,這表明僅依賴聚合評估無法檢測的部署風險。我們在一個合成隊列和三個涵蓋35年臨床數據的公開可用真實世界隊列上驗證了這種差異性的通過/失敗模式,從1980年代的心臟病學數據集到2024年全國代表性的健康調查,其中失敗的維度在不同隊列中有所不同,提供了構念效度的初步證據。公平性維度被重新定義為一種代理依賴診斷,而不是獨立的門檻:任何基於需求的公平裁決如果是針對使用派生的代理計算的,都會帶來構念效度問題,該框架明確顯示出來,觸發對結果獨立需求測量的採購要求,才能使該門檻生效。RISED作為一個開源Python包發布,提供現有臨床AI報告標準所需的定量裁決,為計算模型驗證和靜默試驗臨床評估之間提供了一個原則性的通道。

A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study

2605.13905v1 by Jaime Yan

Drug development and pharmacovigilance are frequently bottlenecked by legacy clinical reporting pipelines. These monolithic systems encode regulatory-grade logic but resist AI integration by producing opaque output with no machine-readable intermediate layer. Existing modernization approaches force a choice between full rewrites and incremental refactoring that preserves structural barriers. We present a non-destructive methodological framework achieving AI-driven pharmacoinformatics readiness without altering legacy source code. A metadata layer--comprising a bridge map, a typed Intermediate Representation (IR), and an orchestrator--wraps existing components and re-exposes their outputs as structured data consumable by LLMs. It enables optional incremental consolidation, replacing selected legacy components with metadata-configured core routines while the remainder operates unchanged. Validated on a 558-component SAS reporting library (373,000 lines of code), the framework demonstrated immediate AI-readiness under coexistence mode, yielding machine-readable output. Where consolidation was elected, the modernized core achieved a 92% reduction in proprietary code. Parity validation on 14 report types from a Phase III study achieved cell-level parity of 80% or above on 11 reports (mean 82.7%, best 99.2%). A benchmark using CDISC CDISCPilot01 data achieved 100% parity across 5 reports. LLM experiments confirmed the IR enables automated pharmacovigilance, table summarization, and trial configuration generation. The framework offers a regulation-aware path to AI-integrated clinical reporting, accelerating drug development without interrupting regulatory submissions.

摘要:藥物開發和藥物監測經常受到舊有臨床報告管道的瓶頸限制。這些單一系統編碼了合規級邏輯,但由於產生不透明的輸出且沒有可機器讀取的中間層,抵制AI整合。現有的現代化方法迫使在完全重寫和保留結構障礙的增量重構之間做出選擇。我們提出了一個非破壞性的方法論框架,實現了不改變舊有源代碼的AI驅動藥物信息學準備。元數據層——包括橋接圖、類型化中間表示(IR)和協調器——包裝現有組件,並將其輸出重新呈現為可被LLMs消耗的結構化數據。它使得可選的增量整合成為可能,將選定的舊有組件替換為元數據配置的核心例程,而其餘部分保持不變。在一個558組件的SAS報告庫(373,000行代碼)上進行驗證,該框架在共存模式下顯示出即時的AI準備,產生可機器讀取的輸出。在選擇整合的情況下,現代化核心實現了92%的專有代碼減少。在一項第三階段研究的14種報告類型上進行的平行驗證,在11份報告中達到了80%或以上的單元級平行性(平均82.7%,最佳99.2%)。使用CDISC CDISCPilot01數據的基準測試在5份報告中達到了100%的平行性。LLM實驗確認IR能夠實現自動化的藥物監測、表格摘要和試驗配置生成。該框架提供了一條符合規範的路徑,實現AI整合的臨床報告,加速藥物開發而不干擾監管提交。

NOVA: Fundamental Limits of Knowledge Discovery Through AI

2605.15219v1 by Salman Avestimehr, Ken Duffy, Muriel Médard

Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain'' loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good--Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent $α>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.

摘要:AI 系統能否通過迭代自我改進發現真正的新知識?如果可以,成本是多少?我們介紹了 NOVA 框架,該框架將常見的「生成、驗證、累積、再訓練」循環建模為知識空間上的自適應抽樣過程。我們確定了累積真正知識最終覆蓋有限領域的充分條件,並展示了其違反如何產生不同的失敗模式:污染、遺忘、探索失敗和接受失敗。然後,我們分析了不完美的驗證,並確定了一個污染陷阱:隨著易於找到的知識被耗盡,分配給新有效文物的模型質量縮小,因此即使是小的假陽性率也可能導致無效文物比真正的發現更快地進入知識庫。我們澄清,Good--Turing 估計是一種局部批次多樣性診斷,而不是歷史上未發現的有效質量的估計量,該質量支配長期發現。在與模型的有效發現分佈相關的獨立尾部等價假設下,該分佈遵循具有指數 $α>1$ 的 Zipf 法則,我們證明獲得 $D$ 個不同真正發現所需的累積生成成本滿足 $R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$,其中 $c_{\mathrm{gen}}$ 是每個候選者的生成成本。這一縮放法則量化了隨著發現邊界推進而出現的漸近遞減回報。最後,我們通過指導、生成和驗證形式化人類增幅,解釋為什麼專家輸入在自主探索障礙附近最有價值。

BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics

2605.12730v1 by Helene Malyutina

Existing AI systems for modeling human behavior operate at the level of individuals or detect events after they occur. As a result, they systematically fail to capture the collective dynamics that determine whether a group remains stable or transitions into escalation or breakdown. We propose a different foundation: a group of interacting humans constitutes a complex dynamical system in the precise mathematical sense, exhibiting emergence, nonlinearity, feedback loops, sensitivity near critical points, and phase transitions between qualitatively distinct regimes. The state of such a system is not located within any single participant; it is distributed across mutual influence loops and observable through the micro-dynamics of the body. We introduce BEHAVE (Behavioral Engine for Human Activity Vector Estimation), a formal framework that models collective dynamics as continuous behavioral fields defined over an interaction space derived from observable physical signals. Kinematic micro-signals (position, velocity, body orientation, gestural activity) are structured into a directed interaction graph and aggregated into a basis of behavioral fields capturing distinct, non-redundant axes of collective state. The framework rests on one theorem and two structural propositions characterizing the tension field, the field basis, and the criticality index. Perception and forecasting layers are implemented using neural models, enabling data-driven learning and approximation of system dynamics. BEHAVE is formulated as a computational system for learning, representing, and forecasting collective dynamics from data. A working pipeline is demonstrated on a 7-agent negotiation snapshot. The same fields, recalibrated, apply to crowd safety, crisis-team dynamics, education, and clinical contexts.

摘要:現有的人工智慧系統在建模人類行為時,運作於個體層面或在事件發生後進行檢測。因此,它們系統性地未能捕捉決定一個群體是否保持穩定或過渡到升級或崩潰的集體動態。我們提出了一個不同的基礎:一群互動的人類構成了一個精確數學意義上的複雜動態系統,展現出湧現性、非線性、反饋迴路、在臨界點附近的敏感性,以及質量上不同的狀態之間的相變。這樣一個系統的狀態並不位於任何單一參與者之中;它分布在相互影響的迴路中,並且可以通過身體的微觀動態進行觀察。

我們引入了BEHAVE(行為引擎,用於人類活動向量估計),這是一個正式框架,將集體動態建模為定義在從可觀察的物理信號衍生的互動空間上的連續行為場。運動學微信號(位置、速度、身體方向、手勢活動)被結構化為一個有向互動圖,並聚合成一組行為場,捕捉不同且不冗餘的集體狀態軸。該框架基於一個定理和兩個結構性命題,描述緊張場、場基礎和臨界指數。感知和預測層使用神經模型實現,使數據驅動的學習和系統動態的近似成為可能。BEHAVE被構建為一個計算系統,用於從數據中學習、表示和預測集體動態。在一個7代理的談判快照上展示了一個工作流程。相同的場,經過重新校準,適用於人群安全、危機團隊動態、教育和臨床背景。

A New Technique for AI Explainability using Feature Association Map

2605.12350v3 by Sayantani Ghosh, Amit Kumar Das, Amlan Chakrabarti

Lack of transparency in AI systems poses challenges in critical real-life applications. It is important to be able to explain the decisions of an AI system to ensure trust on the system. Explainable AI (XAI) algorithms play a vital role in achieving this objective. In this paper, we are proposing a new algorithm for Explaining AI systems, FAMeX (Feature Association Map based eXplainability). The proposed algorithm is based on a graph-theoretic formulation of the feature set termed as Feature Association Map (FAM). The foundation of the modelling is based on association between features. The proposed FAMeX algorithm has been found to be better than the competing XAI algorithms - Permutation Feature Importance (PFI) and SHapley Additive exPlanations (SHAP). Experiments conducted with eight benchmark algorithms show that FAMeX is able to gauge feature importance in the context of classification better than the competing algorithms. This definitely shows that FAMeX is a promising algorithm in explaining the predictions from an AI system

摘要:缺乏透明度的 AI 系統在關鍵的現實應用中帶來挑戰。能夠解釋 AI 系統的決策對於確保對該系統的信任至關重要。可解釋的 AI (XAI) 算法在實現這一目標中扮演著重要角色。在本文中,我們提出了一種新的算法來解釋 AI 系統,即 FAMeX (基於特徵關聯圖的解釋性)。所提出的算法基於一種稱為特徵關聯圖 (FAM) 的特徵集的圖論公式。建模的基礎是特徵之間的關聯。研究發現,所提出的 FAMeX 算法優於競爭的 XAI 算法——置換特徵重要性 (PFI) 和沙普利加法解釋 (SHAP)。對八個基準算法進行的實驗顯示,FAMeX 能夠在分類的背景下比競爭算法更好地評估特徵重要性。這無疑顯示 FAMeX 是一種有前景的算法,能夠解釋 AI 系統的預測。

Why Conclusions Diverge from the Same Observations: Formalizing World-Model Non-Identifiability via an Inference

2605.12255v1 by Toru Takahashi

When people share the same documents and observations yet reach different conclusions, the disagreement often shifts into a judgment that the other party is cognitively defective, irrational, or acting in bad faith. This paper argues that such divergence is better described as a form of non-identifiability inherent in inference and learning, rather than as a defect of the other party. We organize the phenomenon into two levels: (i) $θ$-level non-identifiability, where conclusions diverge under the same world model $W$ because inference settings differ; and (ii) $W$-level non-identifiability, where repeated use of an inference setting $θ$ biases data exposure and update rules, causing the learned world model $W$ itself to diverge. We introduce an inference profile $θ= (R, E, S, D)$, consisting of Reference, Exploration, Stabilization, and Horizon, and show how outputs can split even for the same observation $o$ and the same $W$. We further explain why disagreements tend to project onto a small number of bases -- abstract versus concrete, externalizability, and order versus freedom -- as a consequence of general constraints on learning systems: computational, observational, and coordination constraints. Finally, we relate the framework to deep representation learning, including representation hierarchy, latent-state estimation, and regularization-exploration trade-offs, and illustrate the framework through a case study on AI regulation debates.

摘要:當人們分享相同的文件和觀察結果卻得出不同的結論時,這種分歧通常會轉化為對對方的判斷,認為對方在認知上有缺陷、不理性或出於不良動機。本文主張,這種分歧更應被描述為推理和學習中固有的一種非可識別性,而不是對方的缺陷。我們將這一現象組織為兩個層次:(i)$θ$-層次的非可識別性,在相同的世界模型$W$下,由於推理設置的不同,結論會出現分歧;(ii)$W$-層次的非可識別性,重複使用推理設置$θ$會使數據暴露和更新規則產生偏差,導致學習到的世界模型$W$本身出現分歧。我們引入了一個推理配置$θ= (R, E, S, D)$,由參考、探索、穩定和視野組成,並展示即使對於相同的觀察$o$和相同的$W$,輸出也可能會分裂。我們進一步解釋為什麼分歧往往會投射到少數幾個基礎上——抽象與具體、外部化能力,以及秩序與自由——這是學習系統的一般約束的結果:計算、觀察和協調約束。最後,我們將該框架與深度表示學習相關聯,包括表示層次、潛在狀態估計和正則化-探索權衡,並通過一個關於人工智慧監管辯論的案例研究來說明該框架。

BoolXLLM: LLM-Assisted Explainability for Boolean Models

2605.12139v1 by Du Cheng, Serdar Kadioglu, Xin Wang

Interpretable machine learning aims to provide transparent models whose decision-making processes can be readily understood by humans. Recent advances in rule-based approaches, such as expressive Boolean formulas (BoolXAI), offer faithful and compact representations of model behavior. However, for non-technical stakeholders, main challenges remain in practice: (i) selecting semantically meaningful features and (ii) translating formal logical rules into accessible explanations. In this work, we propose BoolXLLM , as a hybrid framework that integrates Large Language Models (LLMs) into the end-to-end pipeline of Boolean rule learning. We augment BoolXAI , an expressive Boolean rule-based classifier, with LLMs at three critical stages: (1) feature selection, where LLMs guide the identification of domain-relevant variables; (2) threshold recommendation, where LLMs propose semantically meaningful discretization strategies for numerical features; and (3) rule compression and interpretation, where Boolean rules are translated into natural language explanations at both global and local levels. This integration bridges formal, faithful explanations with human-understandable narratives. This allows build an explainable AI system that is both theoretically grounded and accessible to non-experts. Early empirical results demonstrate that LLM-assisted pipelines improve interpretability while maintaining competitive predictive performance. Our work highlights the promise of combining symbolic reasoning with language-based models for human-centered explainability.

摘要:可解釋的機器學習旨在提供透明的模型,其決策過程可以被人類輕易理解。最近在基於規則的方法上的進展,例如表達性布林公式(BoolXAI),提供了模型行為的真實且緊湊的表示。然而,對於非技術利益相關者而言,實踐中仍然存在主要挑戰:(i)選擇語義上有意義的特徵,以及(ii)將正式邏輯規則轉換為易於理解的解釋。在這項工作中,我們提出了BoolXLLM,作為一個混合框架,將大型語言模型(LLMs)整合進布林規則學習的端到端流程中。我們在三個關鍵階段增強了BoolXAI,這是一個表達性布林基礎分類器:(1)特徵選擇,LLMs指導識別與領域相關的變量;(2)閾值建議,LLMs為數值特徵提出語義上有意義的離散化策略;以及(3)規則壓縮和解釋,布林規則在全局和局部層面上被轉換為自然語言解釋。這種整合橋接了正式的、真實的解釋與人類可理解的敘事。這使得建立一個既理論上有根基又對非專家可及的可解釋AI系統成為可能。早期的實證結果顯示,LLM輔助的流程在保持競爭性預測性能的同時改善了可解釋性。我們的工作突顯了結合符號推理與基於語言的模型以實現以人為中心的可解釋性的潛力。

To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

2605.12120v1 by Fangyi Yu, Nabeel Seedat, Jonathan Richard Schwarz, Andrew M. Bean

Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy -- an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards -- despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace -- e.g., that a drug has been withdrawn -- yet suppresses this in the user-facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high-stakes professional settings.

摘要:語言模型在高風險的專業環境中面臨來自用戶、機構當局和專業規範的矛盾需求。當這些需求發生衝突時,模型的行為揭示了一種主要的層級結構——一種對競爭利益相關者的隱含排序,這決定了,例如,一個醫療人工智慧在接受醫院管理者的成本削減指令時,是否會以證據為基礎的護理為代價而遵從,或者因為專業標準的要求而拒絕。在法律和醫療領域的7,136個場景中,我們測試了十個前沿模型,發現模型在執行任務時經常未能遵循專業標準,例如在草擬時,當用戶指令與這些標準發生衝突時——儘管在用戶尋求建議指導時能夠充分遵守這些標準。我們進一步發現,這些模型所展示的用戶、權威和專業標準之間的層級在醫療和法律背景中是不穩定的,並且在模型家族之間不一致。當未能遵循專業標準時,主要的失敗機制是知識遺漏:那些明顯擁有相關知識的模型在未顯示衝突知識的情況下產生有害的輸出。在一個特別令人擔憂的案例中,我們發現一個推理模型在其推理過程中識別出相關知識——例如,一種藥物已被撤回——但在面向用戶的回答中壓制了這一點,並在權威壓力下仍然建議使用該藥物。在任務框架、領域和模型家族之間的不一致對齊表明,當模型在高風險的專業環境中部署時,當前的對齊方法,包括已發表的對齊層級,可能不會穩健。

2605.12012v1 by Virgill van der Meer, Julien Rossi

Public-sector legal departments in the Netherlands face acute staff shortages, increased case volumes, and increased pressure to meet regulatory compliance. This paper presents LegalCheck, a novel system that addresses these challenges by automating the drafting of objection response letters through a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG). Using a large language model (LLM) alongside curated legal knowledge bases, LegalCheck performs retrieval of relevant laws and precedents, and uses controlled prompting to incorporate both external knowledge and case-specific details into a coherent draft. An expert-in-the-loop review ensures that each generated letter is legally sound and contextually appropriate. In a real-world deployment within the Municipality of Amsterdam, LegalCheck produced near-final advice letters in minutes rather than hours, while maintaining high legal consistency and factual accuracy. The output is based on actual regulations and prior cases, providing explainable outputs that captured the vast majority of required legal reasoning (often 80\% to 100\% of essential content). Legal professionals found that the system reduced their workload and ensured a consistent application of legal standards, without replacing human judgment. These results demonstrate substantial efficiency gains, improved legal consistency, and positive user acceptance. More broadly, this work illustrates how responsible AI can be deployed in the legal domain by augmenting LLMs with domain knowledge and governance mechanisms.

摘要:荷蘭的公共部門法律部門面臨著急迫的人力資源短缺、案件量增加以及滿足監管合規的壓力。這篇論文介紹了LegalCheck,一個通過結合檢索增強生成(RAG)和上下文增強生成(CAG)來解決這些挑戰的創新系統。LegalCheck使用大型語言模型(LLM)和經過策劃的法律知識庫,檢索相關法律和判例,並使用受控提示將外部知識和具體案件細節納入一致的草稿中。專家參與的審查確保每封生成的信件在法律上是合理的,並且在上下文上是合適的。在阿姆斯特丹市的實際部署中,LegalCheck能在幾分鐘內產生接近最終的建議信,而不是幾小時,同時保持高法律一致性和事實準確性。輸出基於實際的法規和先前的案件,提供可解釋的結果,捕捉到絕大多數所需的法律推理(通常為80\%到100\%的必要內容)。法律專業人士發現該系統減輕了他們的工作負擔,並確保法律標準的一致應用,而不取代人類的判斷。這些結果顯示出顯著的效率提升、改善的法律一致性以及積極的用戶接受度。更廣泛地說,這項工作說明了如何在法律領域中負責任地部署人工智慧,通過將領域知識和治理機制增強LLMs。

Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI

2605.11687v1 by Georgios Makridis, Georgios Fatouros, John Soldatos, George Katsis, Dimosthenis Kyriazis

Financial institutions increasingly require AI explanations that are persistent, cross-validated across methods, and conversationally accessible to human decision-makers. We present an architecture for human-centered explainable AI in financial sentiment analysis that combines three contributions. First, we treat XAI artifacts -- LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps -- as persistent, searchable objects in distributed S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval over explanation history and automatic index reconstruction after system failures. Second, we enable multi-method explanation triangulation, where a retrieval-augmented generation (RAG) assistant compares and synthesizes results from multiple XAI methods applied to the same prediction, allowing users to assess explanation robustness through natural-language dialogue. Third, we evaluate the faithfulness of generated explanations using automated checks over grounding completeness, hallucinated claims, and method-attribution behavior. We demonstrate the architecture on an EXTRA-BRAIN financial sentiment analysis pipeline using FinBERT predictions and present evaluation results showing that constrained prompting reduces hallucination rate by 36\% and increases method-attribution citations by 73\% compared to naive prompting. We discuss implications for trustworthy, human-centered AI services in regulated financial environments.

摘要:金融機構越來越需要持久的、跨方法驗證的、並且對人類決策者可對話的AI解釋。我們提出了一種以人為中心的可解釋AI架構,用於金融情感分析,結合了三個貢獻。首先,我們將XAI工件——LIME特徵歸因、基於遮擋的單詞重要性分數和顯著性熱圖——視為在分佈式S3兼容存儲中具有結構化元數據和自然語言摘要的持久可搜索對象,從而實現對解釋歷史的語義檢索和系統故障後的自動索引重建。其次,我們實現了多方法解釋三角測量,其中檢索增強生成(RAG)助手比較並綜合應用於同一預測的多種XAI方法的結果,使用戶能夠通過自然語言對話評估解釋的穩健性。第三,我們使用自動檢查來評估生成解釋的真實性,檢查基礎完整性、虛構聲明和方法歸因行為。我們在一個使用FinBERT預測的EXTRA-BRAIN金融情感分析管道上演示了該架構,並呈現評估結果顯示,相較於天真的提示,限制提示將幻覺率降低了36\%,並將方法歸因引用增加了73\%。我們討論了在受監管的金融環境中,值得信賴的人為中心AI服務的影響。

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

2605.11679v2 by ShiYing Huang, Liang Lin, Yuer Li, Kaiwen Luo, Zhenhong Zhou, An Zhang, Junhao Dong, Kun Wang, Zhigang Zeng

In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.

摘要:在大型語言模型的多目標對齊領域中,平衡不同的人類偏好往往表現為零和衝突。具體而言,競爭目標之間的內在緊張關係決定了積極優化某一指標(例如,有用性)通常會對另一指標(例如,無害性)造成重大懲罰。雖然先前的工作主要集中在數據選擇、參數合併或訓練期間的算法平衡上,但這些方法僅僅是在固定的帕累托邊界上強迫不同偏好的妥協,未能根本解決固有的權衡問題。在本研究中,我們從多維獎勵的新視角來處理這一問題。通過擴大模型的回合並分析不同獎勵維度的輸出,我們得出了一個關鍵結論:多個目標之間的衝突源於提示本身固有地限制了可實現的多維獎勵。基於這一核心觀察,我們提出了 MORA:多目標獎勵同化。具體而言,MORA 通過預採樣來隔離單一獎勵提示,並通過重寫原始問題以納入多維意圖來擴展其獎勵多樣性。大量實驗表明:(1)在序列對齊中,MORA 實現了單一偏好的改進範圍為 5% 到 12.4%,在有用、無害和真實維度的多偏好對齊後,無害性方面的增益尤為顯著。(2)在同時對齊中,MORA 實現了平均整體獎勵提高 4.6%。我們的代碼可在 https://github.com/Shiying-Huang/MORA-MPA 獲得。

Native Explainability for Bayesian Confidence Propagation Neural Networks: A Framework for Trusted Brain-Like AI

2605.11595v1 by Georgios Makridis, Georgios Fatouros, John Soldatos, George Katsis, Dimosthenis Kyriazis

The EU Artificial Intelligence Act (Regulation 2024/1689), fully applicable to high-risk systems from August 2026, creates urgent demand for AI architectures that are simultaneously trustworthy, transparent, and feasible to deploy on resource-constrained edge devices. Brain-like neural networks built on the Bayesian Confidence Propagation Neural Network (BCPNN) formalism have re-emerged as a credible alternative to backpropagation-driven deep learning. They deliver state-of-the-art unsupervised representation learning, neuromorphic-friendly sparsity, and existing FPGA implementations that target edge deployment. Despite this momentum, no systematic framework exists for explaining BCPNN decisions -- a gap the present paper fills. We argue that BCPNN is, in the sense of Rudin's interpretable-by-design agenda, an inherently transparent model whose architectural primitives map directly onto established explainable-AI (XAI) families. We make four contributions. First, we propose the first XAI taxonomy for BCPNN. It maps weights, biases, hypercolumn posteriors, structural-plasticity usage scores, attractor dynamics, and input-reconstruction populations onto attribution, prototype, concept, counterfactual, and mechanistic explanation modalities. Second, we introduce sixteen architecture-level explanation primitives (P1--P16), several without analogue in standard ANNs. We provide closed-form algorithms for computing each from quantities the model already maintains. Third, we introduce five design-time Configuration-as-Explanation primitives (Config-P1 to Config-P5) that treat BCPNN hyperparameter choices as an auditable pre-deployment explanation artifact. Fourth, we sketch a roadmap for integration into industrial IoT deployments and discuss EU AI Act alignment, edge feasibility, and Industry 5.0 implications.

摘要:歐盟人工智慧法案(法規2024/1689)將於2026年8月全面適用於高風險系統,這對同時具備可信度、透明度且能在資源受限的邊緣設備上部署的人工智慧架構產生了迫切需求。基於貝葉斯信心傳播神經網絡(BCPNN)形式的類腦神經網絡重新成為反向傳播驅動的深度學習的可信替代方案。它們提供了最先進的無監督表示學習、適合神經形態的稀疏性,並且已有針對邊緣部署的FPGA實現。儘管有這樣的動力,卻沒有系統框架來解釋BCPNN的決策——這是本論文所填補的空白。我們主張,根據Rudin的可解釋設計議程,BCPNN是一種固有透明的模型,其架構原語直接映射到既有的可解釋人工智慧(XAI)類別。我們做出四項貢獻。首先,我們提出了BCPNN的第一個XAI分類法。它將權重、偏差、超列後驗、結構可塑性使用分數、吸引子動態和輸入重建人群映射到歸因、原型、概念、反事實和機制解釋模態。其次,我們介紹了十六個架構級解釋原語(P1--P16),其中幾個在標準人工神經網絡中沒有類似物。我們提供了從模型已經維持的數量計算每個的封閉形式算法。第三,我們介紹了五個設計時的配置作為解釋原語(Config-P1到Config-P5),將BCPNN的超參數選擇視為可審計的預部署解釋工件。第四,我們勾勒出整合到工業物聯網部署的路線圖,並討論歐盟人工智慧法案的對齊、邊緣可行性及工業5.0的影響。

Generative Diffusion Prior Distillation for Long-Context Knowledge Transfer

2605.11414v1 by Nilushika Udayangani, Kishor Nandakishor, Marimuthu Palaniswami

While traditional time-series classifiers assume full sequences at inference, practical constraints (latency and cost) often limit inputs to partial prefixes. The absence of class-discriminative patterns in partial data can significantly hinder a classifier's ability to generalize. This work uses knowledge distillation (KD) to equip partial time series classifiers with the generalization ability of their full-sequence counterparts. In KD, high-capacity teacher transfers supervision to aid student learning on the target task. Matching with teacher features has shown promise in closing the generalization gap due to limited parameter capacity. However, when the generalization gap arises from training-data differences (full versus partial), the teacher's full-context features can be an overwhelming target signal for the student's short-context features. To provide progressive, diverse, and collective teacher supervision, we propose Generative Diffusion Prior Distillation (GDPD), a novel KD framework that treats short-context student features as degraded observations of the target full-context features. Inspired by the iterative restoration capability of diffusion models, we learn a diffusion-based generative prior over teacher features. Leveraging this prior, we posterior-sample target teacher representations that could best explain the missing long-range information in the student features and optimize the student features to be minimally degraded relative to these targets. GDPD provides each student feature with a distribution of task-relevant long-context knowledge, which benefits learning on the partial classification task. Extensive experiments across earliness settings, datasets, and architectures demonstrate GDPD's effectiveness for full-to-partial distillation.

摘要:傳統的時間序列分類器在推斷時假設完整的序列,但實際的限制(延遲和成本)常常將輸入限制為部分前綴。部分數據中缺乏類別區分模式可能會顯著妨礙分類器的泛化能力。本研究利用知識蒸餾(KD)來賦予部分時間序列分類器其完整序列對應物的泛化能力。在KD中,高容量的教師轉移監督以幫助學生在目標任務上的學習。與教師特徵匹配在縮小由於參數容量有限而產生的泛化差距方面顯示出希望。然而,當泛化差距源於訓練數據的差異(完整與部分)時,教師的完整上下文特徵可能對學生的短上下文特徵構成壓倒性的目標信號。為了提供漸進的、多樣的和集體的教師監督,我們提出生成擴散先驗蒸餾(GDPD),這是一個新穎的KD框架,將短上下文學生特徵視為目標完整上下文特徵的降級觀察。受到擴散模型迭代恢復能力的啟發,我們學習一個基於擴散的生成先驗,針對教師特徵。利用這個先驗,我們後驗抽樣能夠最好地解釋學生特徵中缺失的長程信息的目標教師表示,並優化學生特徵,使其相對於這些目標的降級程度最小。GDPD為每個學生特徵提供了一個與任務相關的長上下文知識的分佈,這對於部分分類任務的學習是有益的。在早期設置、數據集和架構上的廣泛實驗證明了GDPD在完整到部分蒸餾中的有效性。

What Do EEG Foundation Models Capture from Human Brain Signals?

2605.11410v2 by Ling Tang, Qian Chen, Jilin Mei, Houshi Xu, Quanshi Zhang, Jing Shao, Na Zou, Xia Hu, Dongrui Liu

Clinical electroencephalogram (EEG) analysis rests on a hand-crafted feature catalog refined over decades, \emph{e.g.,} band power, connectivity, complexity, and more. Modern EEG foundation models bypass this catalog, learn directly from raw signals via self-supervised pretraining, and match or outperform feature-engineered baselines on most clinical benchmarks. Whether the two representations align is an open question, which we decompose into three sub-questions: \emph{what does the model learn}, \emph{what does the model use}, and \emph{how much can be explained}. We answer them with layer-wise ridge probing, LEACE-style cross-covariance subspace erasure, and a transparent classifier benchmarked against a random-feature baseline. The audit covers three foundation models (CSBrain, CBraMod, LaBraM), five clinical tasks (MDD, Stress, ISRUC-Sleep, TUSL, Siena), and a 6-family 63-feature lexicon. Of the $945$ (model, task, feature) units, $648$ ($68.6\%$) are representation-causal and $199$ ($21.1\%$) are encoded-only. Across tasks, $50$ features qualify as universal candidates with strong support (all three architectures RC) in two or more tasks. Frequency-domain features dominate, but the other five families each contribute substantial causal mass. Confirmed features recover, on average, $79.3\%$ of the foundation model's advantage over the random baseline, with a clean task gradient (MDD $\approx 0.99$ down to Stress $\approx 0.56$): tasks near ceiling are almost fully recovered by the lexicon, while harder tasks leave a non-trivial residual that pinpoints a concrete target for future concept discovery.

摘要:臨床腦電圖 (EEG) 分析依賴於經過數十年精煉的手工特徵目錄,例如,頻帶功率、連接性、複雜性等。現代 EEG 基礎模型繞過了這個目錄,通過自我監督的預訓練直接從原始信號中學習,並在大多數臨床基準上與特徵工程基準相匹配或超越。這兩種表示是否一致是一個未解的問題,我們將其分解為三個子問題:模型學到了什麼、模型使用了什麼,以及可以解釋多少。我們通過逐層脊回歸探測、LEACE 風格的交叉協方差子空間消除,以及一個透明的分類器進行回答,並以隨機特徵基準進行評估。審計涵蓋了三個基礎模型(CSBrain、CBraMod、LaBraM)、五個臨床任務(MDD、壓力、ISRUC-睡眠、TUSL、Siena),以及一個 6 家族 63 特徵的詞彙表。在 $945$ 個(模型、任務、特徵)單元中,$648$ 個($68.6\%$)是表示因果的,$199$ 個($21.1\%$)是僅編碼的。在各任務中,有 $50$ 個特徵符合強支持的通用候選條件(所有三個架構 RC)在兩個或更多任務中。頻域特徵佔主導地位,但其他五個家族各自貢獻了可觀的因果質量。確認的特徵平均恢復了基礎模型相對於隨機基準的 $79.3\%$ 的優勢,並具有清晰的任務梯度(MDD $\approx 0.99$ 下降到壓力 $\approx 0.56$):接近上限的任務幾乎完全由詞彙表恢復,而更困難的任務則留下了非平凡的殘差,為未來的概念發現指明了具體目標。

Attributing Emergence in Million-Agent Systems

2605.11404v1 by Ling Tang, Jilin Mei, Qian Chen, Qihan Ren, Linfeng Zhang, Quanshi Zhang, Jing Shao, Xia Hu, Dongrui Liu

Large language models (LLMs) can simulate human-like reasoning and decision-making in individual agents. LLM-powered multi-agent systems (MAS) combine such agents to simulate population-scale social phenomena such as polarization, information cascades, and market panics. Such studies require attributing macro emergence to individual agents, but existing axiomatic methods scale combinatorially in $N$ and have been confined to $N \lesssim 10^3$, while the phenomena they explain occur at $N \geq 10^6$. We address this gap by adapting Aumann--Shapley path-integral attribution to LLM-powered MAS at million-agent scale; the resulting method satisfies all four axioms, runs four to five orders of magnitude faster than sampled Shapley on the same hardware. We use this method to test the scale gap empirically: across 14 days of public Bluesky data ($1{,}671{,}587$ active users), we compute the attribution at both full scale and the visibility-biased $N = 10^2$ convenience sample used by small-scale studies, and the two disagree structurally. At full scale the long tail and middle tier jointly carry the majority; the biased small panel attributes almost everything to a few high-follower accounts. We then prove that under any nonlinear macro indicator the disagreement cannot be reduced by post-hoc rescaling: an Attribution Scaling Bias theorem shows that no global rescaling factor can reconcile small-scale and full-scale attribution. Full-scale attribution is therefore not a methodological choice but a theoretical requirement for any nonlinear macro indicator.

摘要:大型語言模型(LLMs)能夠模擬個體代理人的類人推理和決策過程。LLM 驅動的多代理系統(MAS)將這些代理人結合起來,以模擬人口規模的社會現象,如極化、資訊瀑布和市場恐慌。這類研究需要將宏觀出現歸因於個體代理人,但現有的公理方法在 $N$ 上呈組合性增長,並且被限制在 $N \lesssim 10^3$,而它們解釋的現象則發生在 $N \geq 10^6$。我們通過將 Aumann--Shapley 路徑積分歸因方法調整為百萬代理規模的 LLM 驅動 MAS 來填補這一空白;所得到的方法滿足所有四個公理,並且在相同硬體上運行速度比採樣的 Shapley 快四到五個數量級。我們使用此方法進行實證測試,以檢驗規模差距:在 14 天的公共 Bluesky 數據中($1{,}671{,}587$ 名活躍用戶),我們計算了全規模和小規模研究使用的可見性偏見樣本 $N = 10^2$ 的歸因,兩者在結構上存在不一致。在全規模下,長尾和中層共同承擔了大部分;而偏見的小樣本幾乎將所有歸因都歸於少數高追隨者帳戶。我們接著證明,在任何非線性宏觀指標下,不一致無法通過事後重縮放來減少:一個歸因縮放偏差定理顯示,沒有全球重縮放因子能夠調和小規模和全規模的歸因。因此,全規模歸因並非一種方法論選擇,而是任何非線性宏觀指標的理論要求。

Causal Fairness for Survival Analysis

2605.11362v1 by Drago Plecko

In the data-driven era, large-scale datasets are routinely collected and analyzed using machine learning (ML) and artificial intelligence (AI) to inform decisions in high-stakes domains such as healthcare, employment, and criminal justice, raising concerns about the fairness behavior of these systems. Existing works in fair ML cover tasks such as bias detection, fair prediction, and fair decision-making, but largely focus on static settings. At the same time, fairness in temporal contexts, particularly survival/time-to-event (TTE) analysis, remains relatively underexplored, with current approaches to fair survival analysis adopting statistical fairness definitions, which, even with unlimited data, cannot disentangle the causal mechanisms that generate disparities. To address this gap, we develop a causal framework for fairness in TTE analysis, enabling the decomposition of disparities in survival into contributions from direct, indirect, and spurious pathways. This provides a human-understandable explanation of why disparities arise and how they evolve over time. Our non-parametric approach proceeds in four steps: (1) formalizing the necessary assumptions about censoring and lack of confounding using a graphical model; (2) recovering the conditional survival function given covariates; (3) applying the Causal Reduction Theorem to reframe the problem in a form amenable to causal pathway decomposition; (4) estimating the effects efficiently. Finally, our approach is used to analyze the temporal evolution of racial disparities in outcome after admission to an intensive care unit (ICU).

摘要:在數據驅動的時代,大規模數據集被常規地收集並使用機器學習(ML)和人工智慧(AI)進行分析,以指導在醫療保健、就業和刑事司法等高風險領域的決策,這引發了對這些系統公平性行為的擔憂。現有的公平機器學習研究涵蓋了偏見檢測、公平預測和公平決策等任務,但主要集中在靜態環境中。與此同時,時間背景下的公平性,特別是生存/事件發生時間(TTE)分析,仍然相對未被充分探索,目前對公平生存分析的做法採用了統計公平性定義,即使在數據無限制的情況下,也無法解開導致差異的因果機制。為了填補這一空白,我們開發了一個針對TTE分析公平性的因果框架,使得能夠將生存中的差異分解為來自直接、間接和虛假途徑的貢獻。這提供了一個人類可理解的解釋,說明了為什麼差異會出現以及它們如何隨時間演變。我們的非參數方法分為四個步驟:(1)使用圖形模型形式化有關截尾和缺乏混淆的必要假設;(2)在給定協變量的情況下恢復條件生存函數;(3)應用因果減少定理將問題重新框架為適合因果途徑分解的形式;(4)有效地估計效果。最後,我們的方法被用來分析重症監護病房(ICU)入院後結果的種族差異的時間演變。

Human-AI Productivity Paradoxes: Modeling the Interplay of Skill, Effort, and AI Assistance

2605.11350v1 by Ali Aouad, Thodoris Lykouris, Huiying Zhong

Generative Artificial Intelligence (AI) tools are rapidly adopted in the workplace and in education, yet the empirical evidence on AI's impact remains mixed. We propose a model of human-AI interaction to better understand and analyze several mechanisms by which AI affects productivity. In our setup, human agents with varying skill levels exert utility-maximizing effort to produce certain task outcomes with AI assistance. We find that incorporating either endogeneity in skill development or in AI unreliability can induce a productivity paradox: increased levels of AI assistance may degrade productivity, leading to potentially significant shortfalls. Moreover, we examine the long-term distributional effect of AI on skill, and demonstrate that skill polarization can emerge in steady state when accounting for heterogeneity in AI literacy -- the agent's capability to identify and adapt to inaccurate AI outputs. Our results elucidate several mechanisms that may explain the emergence of human-AI productivity paradoxes and skill polarization, and identify simple measures that characterize when they arise.

摘要:生成式人工智慧(AI)工具在工作場所和教育中迅速被採用,但關於AI影響的實證證據仍然不一。我們提出了一個人類與AI互動的模型,以更好地理解和分析AI影響生產力的幾個機制。在我們的設置中,具有不同技能水平的人類代理人發揮效用最大化的努力,以在AI協助下產生某些任務結果。我們發現,納入技能發展的內生性或AI不可靠性都可能引發生產力悖論:增加AI協助的水平可能會降低生產力,導致潛在的重大短缺。此外,我們檢視AI對技能的長期分配效應,並證明當考慮到AI素養的異質性——代理人識別和適應不準確AI輸出的能力——時,技能極化可以在穩態中出現。我們的結果闡明了幾個可能解釋人類與AI生產力悖論和技能極化出現的機制,並確定了簡單的措施來描述它們何時出現。

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

2605.11205v1 by Jung Min Kang

Benchmark evaluation across AI and safety-critical domains overwhelmingly relies on simple averaging. We demonstrate that this practice produces substantially misleading rankings when two conditions co-occur: (1) the evaluation matrix is sparse and (2) items vary substantially in difficulty. Through controlled simulation experiments across four domains -- NLP (GLUE), clinical drug trials, autonomous vehicle safety, and cybersecurity -- we show that Spearman rank correlation $ρ$ between simple-average rankings and ground-truth rankings degrades from $ρ= 1.000$ at 100% coverage to $ρ= 0.809$ at 67% coverage with high difficulty heterogeneity (mean over 20 seeds). A standard two-parameter logistic (2PL) Item Response Theory (IRT) model maintains $ρ\geq 0.996$ across all conditions. A 150-condition grid sweep over sparsity $S \in [0, 0.70]$ and difficulty gap $D \in [0.5, 5.0]$ confirms that ranking error forms a failure surface with a strong $S \times D$ interaction ($γ_3 = +0.20$, $t = 13.05$), while IRT maintains $ρ\geq 0.993$ throughout. We discuss implications for Physical AI benchmarking, where evaluation matrices are often incomplete and difficulty gaps are extreme.

摘要:基準評估在人工智慧和安全關鍵領域中壓倒性地依賴於簡單平均。我們展示了當兩個條件同時出現時,這種做法會產生實質上誤導性的排名:(1) 評估矩陣是稀疏的,並且 (2) 項目在難度上有實質差異。通過在四個領域進行控制模擬實驗——自然語言處理(NLP)(GLUE)、臨床藥物試驗、自主車輛安全和網絡安全——我們顯示簡單平均排名與真實排名之間的斯皮爾曼秩相關 $ρ$ 在100% 覆蓋率時為 $ρ= 1.000$,而在67% 覆蓋率時降至 $ρ= 0.809$,且難度異質性高(20個種子的平均值)。標準的雙參數邏輯(2PL)項目反應理論(IRT)模型在所有條件下保持 $ρ\geq 0.996$。對稀疏性 $S \in [0, 0.70]$ 和難度差距 $D \in [0.5, 5.0]$ 進行的150條件網格掃描確認排名誤差形成了一個失效表面,並且存在強烈的 $S \times D$ 交互作用($γ_3 = +0.20$, $t = 13.05$),而IRT在整個過程中保持 $ρ\geq 0.993$。我們討論了物理AI基準測試的含義,其中評估矩陣通常是不完整的,且難度差距極端。

Interpretability Can Be Actionable

2605.11161v1 by Hadas Orgad, Fazl Barez, Tal Haklay, Isabelle Lee, Marius Mosbach, Anja Reusch, Naomi Saphra, Byron Wallace, Sarah Wiegreffe, Eric Wong, Ian Tenney, Mor Geva

Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of this work has not translated into practical impact, raising questions about its relevance and utility. This position paper argues that the central missing ingredient is not new methods, but evaluation criteria: interpretability should be evaluated by actionability--the extent to which insights enable concrete decisions and interventions beyond interpretability research itself. We define actionable interpretability along two dimensions--concreteness and validation--and analyze the barriers currently preventing real-world impact. To address these barriers, we identify five domains where interpretability offers unique leverage and present a framework for actionable interpretability with evaluation criteria aligned with practical outcomes. Our goal is not to downplay exploratory research, but to establish actionability as a core objective of interpretability research.

摘要:可解釋性旨在解釋深度神經網絡的行為。儘管快速增長,但人們對這項工作的實際影響日益關注,這引發了對其相關性和實用性的質疑。這篇立場文件主張,缺失的核心成分不是新的方法,而是評估標準:可解釋性應該通過可行性來評估——即洞察力使具體決策和干預的程度,超越可解釋性研究本身。我們沿著兩個維度定義可行的可解釋性——具體性和驗證性——並分析目前阻礙現實世界影響的障礙。為了解決這些障礙,我們確定了五個可解釋性提供獨特槓桿的領域,並提出了一個可行的可解釋性框架,評估標準與實際結果相一致。我們的目標不是貶低探索性研究,而是將可行性確立為可解釋性研究的核心目標。

ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

2605.11091v1 by Shubhankit Singh, Hassan Shaikh, Kuldeep Raghuwanshi, Keshav Bulia

Automated ASD screening tools remain limited by single-architecture evaluations, axis-restricted assessment, and near-exclusive focus on adult cohorts, obscuring age-specific diagnostic patterns critical for early intervention. We introduce ASD-Bench, a systematic tabular benchmark evaluating ML, deep learning, and foundation model configurations across three age cohorts (children 1-11 yr, adolescents 12-16 yr, adults 17-64 yr) on four axes: predictive performance, calibration, interpretability, and adversarial robustness. Applied to a curated v3 dataset of 4,068 AQ-10 records, our benchmark spans classical models (XGBoost, AdaBoost, Random Forest, Logistic Regression), neural networks (MLP), deep tabular transformers (TabNet, TabTransformer, FT-Transformer), and TabPFN v2. We introduce the Heuristic Aggregate Penalty (HAP): a cost-sensitive metric penalising false negatives more heavily and incorporating cross-validation variance for deployment stability. Adult classification yields high performance (10/17 models achieve perfect F1 and AUC), while adolescents present a harder task (F1 ceiling 0.837 vs. 0.915 for children). Feature hierarchies shift across cohorts: A9 (social motivation) dominates for children, A5 (pattern recognition) leads for adolescents, and adults exhibit a flatter importance profile consistent with developmental social masking. Accuracy and calibration are dissociated: AdaBoost achieves F1=1.000 on adults with ECE=0.302, confirming single-metric evaluation is insufficient for clinical AI. Cohort-specific deployment recommendations are provided. All findings should be interpreted as proof-of-concept evidence on questionnaire-derived labels rather than clinically validated diagnostic performance.

摘要:自動化的自閉症譜系障礙(ASD)篩檢工具仍然受到單一架構評估、軸向限制評估和幾乎專注於成人群體的限制,這掩蓋了對於早期介入至關重要的年齡特定診斷模式。我們介紹了ASD-Bench,一個系統性的表格基準,評估機器學習、深度學習和基礎模型配置在三個年齡群體(1-11歲兒童、12-16歲青少年、17-64歲成人)上的表現,涵蓋四個軸向:預測性能、校準、可解釋性和對抗穩健性。應用於一個精心策劃的v3數據集,包括4,068條AQ-10記錄,我們的基準涵蓋了經典模型(XGBoost、AdaBoost、隨機森林、邏輯回歸)、神經網絡(MLP)、深度表格變換器(TabNet、TabTransformer、FT-Transformer)和TabPFN v2。我們引入了啟發式聚合懲罰(HAP):一種成本敏感的指標,對假陰性進行更嚴重的懲罰,並納入交叉驗證方差以提高部署穩定性。成人分類的表現優異(10/17模型達到完美的F1和AUC),而青少年則面臨更艱難的任務(F1上限為0.837,兒童為0.915)。特徵層次在不同群體中有所變化:A9(社會動機)在兒童中佔主導地位,A5(模式識別)在青少年中領先,而成人則顯示出與發展社會掩蓋一致的較平坦重要性輪廓。準確性和校準是分離的:AdaBoost在成人中達到F1=1.000,ECE=0.302,確認單一指標評估對臨床AI來說是不足的。提供了特定於群體的部署建議。所有發現應被解釋為基於問卷衍生標籤的概念驗證證據,而非臨床驗證的診斷性能。

Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography

2605.10871v2 by Timothy Oladunni, Farouk Ganiyu Adewumi

This work proposes Attractor-Vascular Coupling Theory (AVCT), a mathematical framework showing that cardiac attractor geometry encodes blood pressure (BP) information sufficient for AAMI-standard estimation, and validates the theory through a calibrated cuffless BP model using photoplethysmography (PPG). AVCT is grounded in Cardiac Stability Theory and operationalized using Takens delay embedding and attractor morphology extraction. Two theorems, one proposition, and one corollary formally justify the use of PPG attractor features for BP estimation and predict the feature-importance hierarchy. A LightGBM model trained on pulse transit time (PTT) and Cardiac Stability Index (CSI) attractor features under single-point calibration was evaluated using strict leave-one-subject-out cross-validation (LOSO-CV) on 46 subjects from BIDMC ICU (n = 9) and VitalDB surgical data (n = 37), comprising 29,684 windows. The model achieved systolic BP (SBP) mean absolute error (MAE) of 2.05 mmHg and diastolic BP (DBP) MAE of 1.67 mmHg, with correlations r = 0.990 and r = 0.991, satisfying the AAMI/IEEE SP10 requirement of MAE below 5 mmHg. Median per-subject MAE was 1.87/1.54 mmHg, and 70%/76% of subjects individually satisfied AAMI criteria. A PPG-only ablation using nine smartphone attractor features matched the ECG+PPG model within 0.05 mmHg, demonstrating that clinical-grade BP tracking is achievable using only a smartphone camera while surpassing prior generalized LOSO-CV results using fewer sensors. All four AVCT predictions were quantitatively confirmed, with 91.5% error reduction from uncalibrated to calibrated estimation (epsilon_cal = 0.915). Unlike post-hoc explainable AI methods, AVCT predicts features satisfying the architectural faithfulness criterion of the Explainable-AI Trustworthiness (EAT) framework and grounding BP estimation in nonlinear dynamical systems theory.

摘要:這項工作提出了吸引子-血管耦合理論(AVCT),這是一個數學框架,顯示心臟吸引子幾何形狀編碼了足夠用於 AAMI 標準估算的血壓(BP)信息,並通過一個使用光電容積描記法(PPG)的經過校準的無袖帶血壓模型來驗證該理論。AVCT 基於心臟穩定性理論,並使用 Takens 延遲嵌入和吸引子形態提取進行操作。兩個定理、一個命題和一個推論正式證明了使用 PPG 吸引子特徵進行 BP 估算的合理性,並預測了特徵重要性層級。基於脈搏傳輸時間(PTT)和心臟穩定性指數(CSI)吸引子特徵的 LightGBM 模型在單點校準下進行評估,使用嚴格的留一個受試者交叉驗證(LOSO-CV),在來自 BIDMC ICU(n = 9)和 VitalDB 手術數據(n = 37)的 46 名受試者中,包含 29,684 個窗口。該模型達到了收縮壓(SBP)平均絕對誤差(MAE)為 2.05 mmHg,舒張壓(DBP)MAE 為 1.67 mmHg,相關性 r = 0.990 和 r = 0.991,滿足 AAMI/IEEE SP10 對 MAE 低於 5 mmHg 的要求。每位受試者的中位數 MAE 為 1.87/1.54 mmHg,70%/76% 的受試者個別滿足 AAMI 標準。使用九個智能手機吸引子特徵的僅 PPG 消融模型與 ECG+PPG 模型的匹配誤差在 0.05 mmHg 內,證明了僅使用智能手機相機即可實現臨床級的 BP 追蹤,並超越了之前使用較少傳感器的通用 LOSO-CV 結果。所有四個 AVCT 預測均得到了定量確認,從未校準到已校準的估算中,誤差減少了 91.5%(epsilon_cal = 0.915)。與事後可解釋的 AI 方法不同,AVCT 預測的特徵滿足可解釋 AI 可信度(EAT)框架的架構忠實性標準,並將 BP 估算基於非線性動態系統理論。

New AI-Driven Tools for Enhancing Campus Well-being: A Prevention and Intervention Approach

2605.10804v1 by Jinwen Tang

Campus well-being underpins academic success, yet many universities lack effective methods for monitoring satisfaction and detecting mental health risks. This dissertation addresses these gaps through prevention (improving feedback collection) and intervention (advancing mental health detection), unified under an integrated framework. For prevention, we developed TigerGPT, a personalized survey chatbot leveraging LLMs to engage users in context-aware conversations grounded in conversational design and engagement theory, achieving 75% usability and 81% satisfaction. To address its limitations in repetitiveness and response depth, we introduced AURA, a reinforcement-learning framework that adapts follow-up question types (validate, specify, reflect, probe) within a session using an LSDE quality signal (Length, Self-disclosure, Emotion, Specificity), initialized from 96 prior conversations. AURA achieved +0.12 mean quality gain (p=0.044, d=0.66), with 63% fewer specification prompts and 10x more validation behavior. For intervention, we examine Expressive Narrative Stories (ENS) for mental health screening, showing BERT(128) captures nuanced linguistic features without keyword cues, while conventional classifiers depend heavily on explicit mental health terms. We then developed PsychoGPT, an LLM built on DSM-5 and PHQ-8 guidelines that performs initial distress classification, symptom-level scoring, and reconciliation with external ratings for explainable assessment. To reduce hallucinations, we proposed Stacked Multi-Model Reasoning (SMMR), layering expert models where early layers handle localized subtasks and later layers reconcile findings, outperforming single-model solutions on DAIC-WOZ in accuracy, F1, and PHQ-8 scoring. Finally, a cohesive framework unifies these tools, enabling adaptive survey insights to flow directly into specialized mental health detection models.

摘要:校園福祉是學術成功的基石,但許多大學缺乏有效的方法來監測滿意度和檢測心理健康風險。這篇論文通過預防(改善反饋收集)和干預(推進心理健康檢測)來解決這些空白,並在一個綜合框架下統一。為了預防,我們開發了TigerGPT,一個個性化的調查聊天機器人,利用大型語言模型(LLMs)與用戶進行基於對話設計和參與理論的情境感知對話,實現了75%的可用性和81%的滿意度。為了應對其在重複性和回應深度方面的局限性,我們引入了AURA,一個強化學習框架,根據LSDE質量信號(長度、自我披露、情感、具體性)在會話中調整後續問題類型(驗證、具體化、反思、探查),並從96次先前的對話中初始化。AURA實現了+0.12的平均質量增益(p=0.044, d=0.66),具體化提示減少了63%,驗證行為增加了10倍。對於干預,我們檢查了表達性敘事故事(ENS)在心理健康篩查中的應用,顯示BERT(128)能夠捕捉細緻的語言特徵而不依賴關鍵詞提示,而傳統的分類器則過度依賴明確的心理健康術語。然後,我們開發了PsychoGPT,一個基於DSM-5和PHQ-8指導方針的LLM,執行初步的困擾分類、症狀級別評分,並與外部評分進行調和以實現可解釋的評估。為了減少幻覺,我們提出了堆疊多模型推理(SMMR),將專家模型進行分層,其中早期層處理局部子任務,後期層則調和結果,在DAIC-WOZ的準確性、F1和PHQ-8評分上超越單一模型解決方案。最後,一個統一的框架將這些工具整合在一起,使適應性調查見解能夠直接流入專門的心理健康檢測模型。

Hierarchical Causal Abduction: A Foundation Framework for Explainable Model Predictive Control

2605.10624v1 by Ramesh Arvind Naagarajan, Zühal Wagner, Stefan Streif

Model Predictive Control (MPC) is widely used to operate safety-critical infrastructure by predicting future trajectories and optimizing control actions. However, nonlinear dynamics, hard safety constraints, and numerical optimization often render individual control moves opaque to human operators, undermining trust and hindering deployment. This paper presents Hierarchical Causal Abduction (HCA), which combines (i) physics-informed reasoning via domain knowledge graphs, (ii) optimization evidence from Karush--Kuhn--Tucker (KKT) multipliers, and (iii) temporal causal discovery via the PCMCI algorithm to generate faithful, human-interpretable explanations for control actions computed by nonlinear MPC. Across three diverse control applications (greenhouse climate, building HVAC, chemical process engineering) with expert validation, HCA improves explanation accuracy by 53\% over LIME (0.478 vs. 0.311) using a single set of cross-domain parameters without per-domain tuning; domain-specific KKT-threshold calibration over 2--3 days further increases accuracy to 0.88. Ablation studies confirm that each evidence source is essential, with 32--37\% accuracy degradation when any component is removed, and HCA's ranking-and-validation methodology generalizes beyond MPC to other prediction-based decision systems, including learning-based control and trajectory planning.

摘要:模型預測控制(MPC)被廣泛用於操作安全關鍵基礎設施,通過預測未來軌跡並優化控制行動。然而,非線性動力學、嚴格的安全約束和數值優化常常使得個別控制動作對人類操作員來說變得不透明,從而削弱信任並阻礙部署。本文提出了層次因果推斷(HCA),它結合了(i)通過領域知識圖進行的物理知識推理,(ii)來自Karush--Kuhn--Tucker(KKT)乘子的優化證據,以及(iii)通過PCMCI算法的時間因果發現,以生成對非線性MPC計算的控制行動的忠實且易於人類理解的解釋。在三個不同的控制應用(溫室氣候、建築HVAC、化學工藝工程)中,經過專家驗證,HCA在使用一組跨領域參數而無需每個領域調整的情況下,將解釋準確性提高了53\%(0.478對0.311);針對特定領域的KKT閾值校準在2至3天內進一步提高了準確性至0.88。消融研究證實每個證據來源都是必不可少的,當任何組件被移除時,準確性下降32至37\%。HCA的排名和驗證方法論不僅適用於MPC,還可以推廣到其他基於預測的決策系統,包括基於學習的控制和軌跡規劃。

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

2605.10601v1 by Phongsakon Mark Konrad, Tim Lukas Adam, Ane Cathrine Holst Merrild, Riccardo Terrenzi, Rebecca De Rosa, Toygar Tanyel, Serkan Ayvaz

AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.

摘要:AI 在健康護理、信用、就業和刑事司法等敏感領域的部署,通常被視為在模型內部無法解釋之前不安全授權。這往往導致過度依賴機械可解釋性來解決超出其預期範疇的部署挑戰。我們主張,應該將門檻調整為驗證:授權應該是領域範圍的、可獨立檢查的、在發布後進行監控的、可負責的、可爭議的和可撤回的。原因有二。首先,模型能力在相近任務之間是不均勻的,因此授權必須附加在特定的使用上,而不是一般的模型。其次,社會長期以來通過證書、監控、責任、上訴和撤銷來管理不透明的專業知識,而不是通過機制層面的解釋。最近的證據強化了機械理解與部署授權之間的區別:內部表示與輸出修正之間的53個百分點差距顯示,理解可能無法轉化為行動,而一項範疇回顧發現,只有9.0%的FDA批准的AI/ML設備文件包含前瞻性的市場後監測研究。我們提出驗證覆蓋率,這是一個包含六個組件的可報告標準,並具備最小組成規則,作為應該與模型卡、排行榜和監管披露中的能力分數並列的指標。

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

2605.10194v1 by Jiaxuan Wang, Xuan Ouyang, Zhiyu Chen, Yulan Hu, Zheng Pan, Xin Li, Lan-Zhe Guo

On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.

摘要:在策略自我蒸餾(self-OPD)中,通過讓策略在特權上下文中自我教學,強化學習與可驗證獎勵(RLVR)相結合。我們發現,當這種指導涵蓋整個響應時,所有標記的KL在大多數冗餘位置上消耗梯度,並放大特權信息的洩漏,導致熵上升、推理縮短,以及在長期數學訓練中出現分佈外的退化。我們提出了關鍵推理的標記路由對齊(TRACE),該方法僅在標註者標記的關鍵區間進行蒸餾:在正確的回滾關鍵區間上進行前向KL,對於局部錯誤區間可選擇進行反向KL,並在所有剩餘標記上進行GRPO,KL通道在短暫的預熱後被逐漸消除。我們的分析通過兩個效應解釋了TRACE:前向KL為學生未充分分配的教師支持標記提供了不會消失的提升,而區間屏蔽和衰減保持累積的特權梯度暴露是有限的。在四個保留的數學基準加上GPQA-Diamond上,TRACE平均比GRPO提高了2.76個百分點,並在GPQA-Diamond上保持了Qwen3-8B基本的OOD分數,而GRPO和所有標記的自我OPD基準則出現退化。在在線自我標註下,增益仍然存在(+1.90個百分點,約佔強API增益的69%),減少了TRACE僅僅引入外部標註者能力的擔憂。在不同規模上,最佳的路由行動是依賴於基礎模型的:在Qwen3-8B上是關鍵區間的前向KL,而在Qwen3-1.7B上則轉變為錯誤區間的反向KL。

A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection

2605.10181v1 by Jihyeon Baek, Seunghoon Lee, Gitaek Kwon, Doohyun Park

Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.

摘要:超出分佈(OOD)檢測對於建立可靠的人工智慧系統至關重要,因為對於無效輸入產生輸出的模型無法被信任。雖然深度學習(DL)通常被認為優於傳統機器學習(ML),但醫學影像數據通常是在標準化的協議下獲取的,這導致在OOD檢測任務中影像變異性相對受限。這促使在這個環境中對ML和DL方法進行直接比較。這兩種方法在開放數據集上進行評估,該數據集包含超過60,000張眼底和非眼底影像,涵蓋多個解析度。這兩種方法在內部和外部驗證集上均達到了1.000的AUROC和0.999至1.000之間的準確率,顯示出可比的檢測性能。然而,ML方法在保持相同準確率的同時,顯示出顯著較低的端到端延遲,表明其計算效率更高。這些結果表明,對於視覺複雜性有限的OOD檢測任務,輕量級的ML方法可以以顯著降低的計算成本實現DL級別的性能,支持實際的現實世界部署。

Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

2605.10142v1 by Mateusz Cedro, Marcin Chlebus

Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.

摘要:人工智慧模型的規模不斷擴大以提高預測準確性,但是否規模能改善事後解釋的質量仍然不明確。我們通過評估11個計算機視覺模型來研究這一關係,這些模型代表了ResNet、DenseNet和Vision Transformer系列中不斷增加的深度和複雜性,這些模型是從頭開始訓練或預訓練的,並在三個具有真實分割掩碼的圖像數據集上進行評估。對於每個模型,我們使用五種事後可解釋的AI方法生成解釋,並使用兩個定位指標量化掩碼對齊:相關性排名準確性(Arras et al., 2022)和提出的雙極性精度,後者測量類別掩碼內的正向歸因和外部的負向歸因。在各個數據集和方法中,增加架構深度和參數數量在大多數統計比較中並未改善解釋質量,而較小的模型往往與更深的變體相匹配或超越。雖然預訓練通常改善預測性能並增加解釋對學習權重的依賴,但並不總是一致地提高定位分數。我們還觀察到一些情況,其中模型在預測性能上表現強勁,而定位精度接近於零,這表明僅依賴性能指標可能無法指示預測是否基於標註區域。這些結果表明,較大的模型並不可靠地提供更高質量的解釋,因此在安全敏感的部署中,解釋性應在模型選擇過程中明確評估。

Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research

2605.10125v2 by Anthea Dathe, Kiran Hoffmann, Aline Mangold

Artificial intelligence (AI) tools are being incorporated into scientific research workflows with the potential to enhance efficiency in tasks such as document analysis, question answering (Q&A), and literature search. However, system outputs are often difficult to verify, lack transparency in their generation and remain prone to errors. Suitable benchmarks are needed to document and evaluate arising issues. Nevertheless, existing benchmarking approaches are not adequately capturing human-centered criteria such as usability, interpretability, and integration into research workflows. To address this gap, the present work proposes and applies a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI-based Q&A and literature review tools for research use. The findings suggest that Q&A tools can offer valuable overviews and generally accurate summaries; however, they are not always reliable for precise information extraction. Explainable AI (xAI) accuracy was particularly low, meaning highlighted source passages frequently failed to correspond to generated answers. This shifted the burden of validation back onto the researcher. Literature review tools supported exploratory searches but showed low reproducibility, limited transparency regarding chosen sources and databases, and inconsistent source quality, making them unsuitable for systematic reviews. A comparison of these tool groups reveals a similar pattern: while AI tools can enhance efficiency in the early stages of the research workflow and shallow tasks, their outputs still require human verification. The findings underscore the importance of explainability features to enhance transparency, verification efficiency and careful integration of AI tools into researchers' workflows. Further, human-centered evaluation remains an important concern to ensure practical applicability.

摘要:人工智慧 (AI) 工具正在被納入科學研究工作流程中,具有提升文件分析、問答 (Q&A) 和文獻搜尋等任務效率的潛力。然而,系統輸出的結果往往難以驗證,缺乏生成過程的透明度,並且仍然容易出錯。需要合適的基準來記錄和評估出現的問題。然而,現有的基準測試方法並未充分捕捉以人為中心的標準,如可用性、可解釋性和與研究工作流程的整合。為了解決這一缺口,本研究提出並應用了一個結合以人為中心和以計算機為中心的指標的基準框架,以評估基於 AI 的問答和文獻回顧工具在研究中的使用。研究結果表明,問答工具可以提供有價值的概述和通常準確的摘要;然而,它們在精確信息提取方面並不總是可靠。可解釋的 AI (xAI) 的準確性特別低,這意味著高亮的來源段落經常無法與生成的答案相對應。這將驗證的負擔重新轉回到研究者身上。文獻回顧工具支持探索性搜尋,但顯示出低重現性,對所選來源和數據庫的透明度有限,以及來源質量不一致,使其不適合系統評審。這些工具組的比較顯示出類似的模式:雖然 AI 工具可以在研究工作流程的早期階段和淺層任務中提高效率,但其輸出仍然需要人類驗證。研究結果強調了可解釋性特徵的重要性,以增強透明度、驗證效率以及將 AI 工具謹慎整合到研究者工作流程中的必要性。此外,以人為中心的評估仍然是一個重要的關注點,以確保實際的適用性。

Explainability of Recurrent Neural Networks for Enhancing P300-based Brain-Computer Interfaces

2605.10121v1 by Christian Oliva, Vinicio Changoluisa, Francisco B Rodríguez, Luis F Lago-Fernández

Brain-Computer Interfaces (BCIs) based on P300 event-related potentials offer promising applications in health, education, and assistive technologies. However, challenges related to inter- and intra-subject variability and the explainability of Deep Learning (DL) models limit their practical deployment. In this work, we present the Post-Recurrent Module (PRM), an additional layer designed to improve both performance and transparency, incorporated into a Recurrent Neural Network (RNN) architecture for classifying P300 signals from EEG data. Our approach enables a dual analysis of spatio-temporal signals through both global and local explainability techniques, allowing us not only to identify the most relevant brain regions and critical time intervals involved in classification, but also to interpret model decisions in terms of spatio-temporal EEG patterns consistent with well-stablished neurophysiological descriptions of the P300. Experimental results show a 9\% improvement in performance over state of the art, while also revealing the importance of inter- and intra-subject variability, in alignment with established neuroscience literature. By making model decisions transparent and efficient, we present a framework for explainable EEG-based models. This framework is not limited to more efficient P300 detection, but can be generalized to a wide range of EEG-based tasks. Its ability to identify key spatial and temporal features makes it suitable for applications such as motor imagery, steady-state visual evoked potentials, and even cognitive workload assessment.

摘要:基於 P300 事件相關電位的腦機介面 (BCIs) 在健康、教育和輔助技術方面提供了有前景的應用。然而,與個體間和個體內變異性以及深度學習 (DL) 模型的可解釋性相關的挑戰限制了它們的實際部署。在本研究中,我們提出了後遞歸模組 (PRM),這是一個旨在改善性能和透明度的附加層,並整合到用於分類 EEG 數據中 P300 信號的遞歸神經網絡 (RNN) 架構中。我們的方法通過全球和局部可解釋性技術實現了對時空信號的雙重分析,使我們不僅能夠識別與分類相關的最重要腦區和關鍵時間區間,還能根據與 P300 的已建立神經生理描述一致的時空 EEG 模式來解釋模型決策。實驗結果顯示性能比最先進技術提高了 9\%,同時揭示了個體間和個體內變異性的重要性,這與已建立的神經科學文獻一致。通過使模型決策透明和高效,我們提出了一個可解釋的基於 EEG 的模型框架。這個框架不僅限於更有效的 P300 檢測,還可以推廣到各種基於 EEG 的任務。它識別關鍵空間和時間特徵的能力使其適用於運動意象、穩態視覺誘發電位,甚至認知負荷評估等應用。

An LLM-RAG Approach for Healthy Eating Index-Informed Personalized Food Recommendations

2605.15213v1 by Yibin Wang, Yanjie Yang, Grace Melo Guerrero, Rodolfo M. Nayga, Azlan Zahid

Diet quality is a leading determinant of chronic disease risk. Advances in artificial intelligence (AI) have enabled food recommendation systems to adapt suggestions to user preferences and health goals. However, most current systems rely on loosely curated food databases and provide limited connection to a validated index. In this study, we propose a Healthy Eating Index (HEI) informed retrieval-augmented generation (RAG) framework that combines standardized nutrition databases with large language models (LLMs) for personalized food recommendations. Our proposed method anchors retrieval in the National Health and Nutrition Examination Survey (NHANES) and the Food Patterns Equivalents Database (FPED). A food-level embedding space is constructed from FPED-derived textual descriptions. For each entity, the system computes baseline HEI scores, retrieves candidate foods for intake recommendations, and estimates the HEI impact of simple substitutions or additions. A constrained RAG pipeline instantiated with a pretrained OpenAI LLM generates personalized recommendations and sources based on nutrient profiles and HEI contributions. The simulation results showed a mean HEI improvement of 6.45, with the proportion of users HEI over 50 increasing from 45.12 to 61.26. Quantile analysis revealed consistent improved shifts across the HEI distribution. Our findings suggest that the proposed LLM-RAG-based AI systems can support more precise, explainable, and personalized nutrition guidance to improve diet quality.

摘要:飲食質量是慢性疾病風險的主要決定因素。人工智慧(AI)的進步使得食品推薦系統能夠根據用戶偏好和健康目標調整建議。然而,目前大多數系統依賴於鬆散編輯的食品數據庫,並且與經過驗證的指數之間的聯繫有限。在這項研究中,我們提出了一個健康飲食指數(HEI)信息檢索增強生成(RAG)框架,該框架結合了標準化的營養數據庫和大型語言模型(LLMs)以提供個性化的食品推薦。我們提出的方法將檢索基於國家健康與營養檢查調查(NHANES)和食品模式等價數據庫(FPED)。從FPED衍生的文本描述構建了一個食品級嵌入空間。對於每個實體,系統計算基線HEI分數,檢索攝入建議的候選食品,並估算簡單替代或添加的HEI影響。一個用預訓練的OpenAI LLM實現的受限RAG管道生成基於營養特徵和HEI貢獻的個性化推薦和來源。模擬結果顯示HEI的平均改善為6.45,HEI超過50的用戶比例從45.12增加到61.26。分位數分析顯示HEI分佈中一致的改善變化。我們的研究結果表明,所提出的基於LLM-RAG的AI系統可以支持更精確、可解釋和個性化的營養指導,以改善飲食質量。

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

2605.09887v1 by Eslam Zaher, Maciej Trzaskowski, Quan Nguyen, Fred Roosta

Sparse autoencoders (SAEs) operationalise the linear representation hypothesis: they reconstruct model activations as sparse linear combinations of interpretable dictionary atoms, on the implicit assumption that activation space is well approximated by a globally linear structure. Their reconstruction error varies sharply across layers in ways that existing scaling laws, fitted at single layers, do not explain. We argue that this variation is the empirical trace of a geometric mismatch: where the activation manifold is curved and its intrinsic dimension varies across layers, no sparse linear dictionary can match it uniformly, and the SAE's width-sparsity scaling becomes a layer-dependent function of manifold structure rather than a single universal law. We conduct the first cross-layer SAE scaling study, fitting and regressing on 844 residual-stream Gemma Scope SAE checkpoints across 68 layers of Gemma 2 2B and 9B. Stage 1 fits a per-layer scaling-law surface; Stage 2 regresses the fitted parameters and the derived per-layer width exponents on four layerwise geometric summaries. We find that manifold geometry predicts the per-layer width exponent in both models, and that the same regression coefficients learnt on one model predict the other model's per-layer exponents under cross-model transfer, indicating a transferable geometric law. At the showcase layers where richer width grids permit identification of the asymptotic floor, we find that the fitted floor tracks the layerwise geometric ordering: higher curvature and intrinsic dimension correspond to higher floor, consistent with the irreducible second-order residual that any sparse linear approximation of a curved manifold must leave behind. SAEs thus encounter not a finite-resource ceiling but a geometry-dependent wall, set by the manifold they are trying to reconstruct.

摘要:稀疏自編碼器(SAEs)實現了線性表示假設:它們將模型激活重建為可解釋字典原子的稀疏線性組合,隱含假設激活空間可以很好地用全局線性結構來近似。它們的重建誤差在不同層之間變化劇烈,而現有的縮放法則在單層上擬合時無法解釋這種變化。我們認為這種變化是幾何不匹配的實證痕跡:當激活流形是彎曲的且其內在維度在不同層之間變化時,沒有任何稀疏線性字典能夠均勻匹配它,SAE的寬度-稀疏性縮放成為流形結構的層依賴函數,而不是單一的普遍法則。我們進行了首次跨層SAE縮放研究,擬合並回歸了68層Gemma 2 2B和9B的844個殘差流Gemma Scope SAE檢查點。第一階段擬合每層的縮放法則表面;第二階段對擬合的參數和導出的每層寬度指數進行回歸,基於四個層級幾何總結。我們發現流形幾何在兩個模型中都能預測每層的寬度指數,並且在跨模型轉移下,在一個模型上學到的相同回歸係數能預測另一個模型的每層指數,這表明存在可轉移的幾何法則。在展示層中,較豐富的寬度網格允許識別漸近底線,我們發現擬合的底線跟踪層級幾何排序:較高的曲率和內在維度對應於較高的底線,這與任何稀疏線性近似彎曲流形必須留下的不可約二階殘差一致。因此,SAEs面臨的不是有限資源的上限,而是由它們試圖重建的流形所設定的依賴幾何的牆壁。

Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework, Axioms, and Future Direction toward Responsible AI

2605.09852v1 by Gideon Popoola, John Sheppard

Machine learning algorithms are being used in high-stakes decisions, including those in criminal justice, healthcare, credit, and employment. The research community has responded with two largely independent research fields: \emph{algorithmic fairness}, which targets equitable outcomes, and \emph{explainable AI} (XAI), which targets interpretable reasoning. This survey identifies and maps a novel blind spot at their intersection, which is a model that can satisfy every standard fairness criterion in its outputs while being profoundly unfair in its \emph{reasoning process}. We refer to this as the procedural bias, and mitigating it requires treating the fairness of explanations as a distinct object of scientific study. To our knowledge, we provide the first unified theoretical and literature review of this emerging field and elucidate the drawbacks of post-hoc explainers in certifying explanation fairness. Our central contribution is a \emph{conditional invariance framework} formalizing explanation fairness as the requirement that explanations should be indifferent regardless of the protected attributes $ P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = a) = P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = b)$ for all task-relevant $x$, a single principle from which all existing explanation fairness metrics emerge as partial operationalizations. We introduce a seven-dimensional taxonomy, identify three generative mechanisms of explanation inequity (representation-driven, explanation-model mismatch, actionability-driven), and propose a canonical six-step evaluation workflow for operationalizing explanation fairness audits in practice.

摘要:機器學習算法正在被用於高風險決策,包括刑事司法、醫療保健、信貸和就業等領域。研究社群已經以兩個基本獨立的研究領域作出回應:\emph{算法公平性},旨在實現公平的結果,以及\emph{可解釋的人工智慧}(XAI),旨在實現可解釋的推理。這項調查識別並映射了它們交集處的一個新盲點,即一個模型可以在其輸出中滿足每一個標準公平性準則,同時在其\emph{推理過程}中卻是極其不公平的。我們稱之為程序性偏見,減輕這種偏見需要將解釋的公平性視為一個獨立的科學研究對象。根據我們的了解,我們提供了這一新興領域的首個統一理論和文獻回顧,並闡明了事後解釋者在認證解釋公平性方面的缺陷。我們的核心貢獻是一個\emph{條件不變性框架},將解釋公平性形式化為解釋應該對受保護屬性無差別的要求,即對於所有任務相關的$x$,$ P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = a) = P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = b)$,這是一個所有現有解釋公平性指標作為部分操作化所衍生的單一原則。我們介紹了一個七維分類法,識別了三種解釋不公平的生成機制(基於表徵的、解釋模型不匹配的、基於可行性的),並提出了一個經典的六步評估工作流程,以在實踐中操作化解釋公平性審計。

TokaMind for Power Grid: Cross-Domain Transfer from Fusion Plasma

2605.11033v1 by JC Wu, Norton Lee, Kai Siang Chen

TokaMind is a multi-modal transformer (MMT) foundation model pre-trained on tokamak plasma diagnostics data from MAST, where it was shown to outperform CNN-based approaches on fusion benchmarks. We investigate whether its learned representations generalize to physically distinct but structurally analogous domains. Through systematic experimentation across four domains-industrial bearing degradation, NASA CMAPSS turbofan degradation, and two independent power grid PMU datasets-we identify four transfer-favoring characteristics that help explain where TokaMind's pretrained representations are most effective. Power grid synchrophasor data matches this target-domain profile most directly, while industrial degradation datasets demonstrate that TokaMind can still yield useful performance under partial alignment, especially when task design and feature construction expose physically meaningful degradation structure. On the GESL/PNNL 500-event benchmark with provider-aware evaluation, TokaMind achieves test $\text{F1} = 0.837 \pm 0.040$ (3~seeds) for severe event classification. Our central finding, however, is not the aggregate score: classification difficulty is structurally determined by provider-level grid topology, not model capacity. In the single-window early-warning regime, TokaMind outperforms a CNN baseline (F1~0.889 vs.~0.878)--a reversal that disappears as more event windows are provided. Furthermore, Critical Slowing Down (CSD) indicators, used as a confidence gate rather than a classification label, improve F1 from 0.696 to 0.750 at 63% coverage-outperforming the CNN baseline (0.636) at any coverage level. These results establish the first cross-domain validation of TokaMind outside nuclear fusion and propose a transferability framework and revised evaluation protocol for multi-source PMU datasets.

摘要:TokaMind 是一個多模態Transformer (MMT) 基礎模型,預先訓練於來自 MAST 的托卡馬克等離子體診斷數據上,並且在融合基準測試中顯示出優於基於 CNN 的方法。我們調查其學習的表示是否能夠泛化到物理上不同但結構上類似的領域。通過在四個領域進行系統實驗——工業軸承退化、NASA CMAPSS 渦輪風扇退化,以及兩個獨立的電網 PMU 數據集——我們確定了四個有利於轉移的特徵,幫助解釋 TokaMind 的預訓練表示在哪些地方最有效。電網同步相量數據最直接符合這一目標領域特徵,而工業退化數據集則顯示 TokaMind 在部分對齊的情況下仍然可以產生有用的性能,尤其是當任務設計和特徵構建揭示出物理上有意義的退化結構時。在 GESL/PNNL 500 事件基準測試中,TokaMind 在提供者感知評估下,對於嚴重事件分類達到測試 $\text{F1} = 0.837 \pm 0.040$ (3~種種子)。然而,我們的核心發現並不是總體得分:分類難度是由提供者級別的電網拓撲結構決定的,而不是模型容量。在單窗口早期預警模式下,TokaMind 的表現超過了 CNN 基準 (F1~0.889 對~0.878)——這一反轉隨著提供更多事件窗口而消失。此外,作為信心閘而非分類標籤使用的關鍵減速 (CSD) 指標,將 F1 從 0.696 提升到 0.750,覆蓋率達到 63%——在任何覆蓋水平上都超越了 CNN 基準 (0.636)。這些結果建立了 TokaMind 在核融合之外的首次跨領域驗證,並提出了一個可轉移性框架和修訂的多源 PMU 數據集評估協議。

Attribution-based Explanations for Markov Decision Processes

2605.09780v2 by Paul Kobialka, Andrea Pferscher, Francesco Leofante, Erika Ábrahám, Silvia Lizeth Tapia Tarifa, Einar Broch Johnsen

Attribution techniques explain the outcome of an AI model by assigning a numerical score to its inputs. So far, these techniques have mainly focused on attributing importance to static input features at a single point in time, and thus fail to generalize to sequential decision-making settings. This paper fills this gap by introducing techniques to generate attribution-based explanations for Markov Decision Processes (MDPs). We give a formal characterization of what attributions should represent in MDPs, focusing on explanations that assign importance scores to both individual states and execution paths. We show how importance scores can be computed by leveraging techniques for strategy synthesis, enabling the efficient computation of these scores despite the non-determinism inherent in an MDP. We evaluate our approach on five case-studies, demonstrating its utility in providing interpretable insights into the logic of sequential decision-making agents.

摘要:歸因技術通過為其輸入分配數值分數來解釋 AI 模型的結果。到目前為止,這些技術主要集中在為靜態輸入特徵在單一時間點上分配重要性,因此無法推廣到序列決策環境。本文通過引入技術來生成基於歸因的馬爾可夫決策過程 (MDP) 的解釋,填補了這一空白。我們對 MDP 中的歸因應該代表什麼進行了正式的描述,重點是為個別狀態和執行路徑分配重要性分數的解釋。我們展示了如何利用策略合成技術計算重要性分數,儘管 MDP 中固有的非確定性,仍能高效計算這些分數。我們在五個案例研究中評估了我們的方法,展示了其在提供可解釋的洞察序列決策代理邏輯方面的實用性。

Sequential Feature Selection for Efficient Landslide Segmentation from Multi-Spectral Data

2605.09746v1 by Arsalaan Ahmad, Oktay Karakus, Paul L. Rosin

Landslide detection from satellite imagery has advanced through deep learning, yet most models rely on large, highly correlated spectral-topographic inputs whose contributions remain poorly understood. The question of which channels are actually necessary has received surprisingly little attention. This matters: redundant or correlated inputs obscure physical interpretability, inflate computational overhead, and can actively degrade model performance through the Hughes Phenomenon. We present a systematic, explainable channel-selection framework for the Landslide4Sense benchmark, combining Sentinel-2 multispectral and ALOS PALSAR terrain data with 16 engineered spectral and structural indices. Rather than relying on conventional single-band drop tests, which evaluate channels in isolation and miss interaction effects, we apply Sequential Forward Floating Selection (SFFS) to iteratively build and prune a candidate feature pool using a lightweight U-Net++ proxy model. Beyond identifying a compact 8-channel subset that matches or exceeds the segmentation F1 of configurations using up to 30 channels, we use the selection process itself to interrogate which spectral and topographic features landslide models genuinely rely on, and what this reveals about the physical cues driving their predictions. We argue that SFFS represents a principled feature selection approach to input design in Earth observation, in contrast to the prevailing practice of appending every available band and hoping the model learns what to ignore.

摘要:從衛星影像中進行滑坡檢測已通過深度學習取得進展,然而大多數模型依賴於大量高度相關的光譜-地形輸入,其貢獻仍然不甚明瞭。實際上哪些通道是必要的這一問題卻意外地鮮少受到關注。這是重要的:冗餘或相關的輸入會模糊物理可解釋性,增加計算開銷,並且可能通過休斯現象主動降低模型性能。我們提出了一個系統的、可解釋的通道選擇框架,針對Landslide4Sense基準,結合了Sentinel-2多光譜和ALOS PALSAR地形數據以及16個工程化的光譜和結構指數。我們不依賴於傳統的單波段丟棄測試,這些測試孤立評估通道並忽略交互效應,而是應用序列前向浮動選擇(SFFS)來迭代構建和修剪候選特徵池,使用輕量級的U-Net++代理模型。除了識別一個緊湊的8通道子集,其分割F1指標匹配或超過使用多達30個通道的配置外,我們還利用選擇過程本身來探究滑坡模型真正依賴的光譜和地形特徵,以及這揭示了什麼物理線索來驅動它們的預測。我們認為SFFS代表了一種有原則的特徵選擇方法,用於地球觀測中的輸入設計,這與當前普遍的做法形成對比,即附加每個可用波段並希望模型學會忽略不必要的部分。

Medical Model Synthesis Architectures: A Case Study

2605.09716v1 by Katherine M. Collins, Marlene Berke, Ilia Sucholutsky, Ayman Ali, Adrian Weller, Timothy J. O'Donnell, Tyler Brooke-Wilson, Lionel Wong, Joshua B. Tenenbaum

Medicine is rife with high-stakes uncertainty. Doctors routinely make clinical judgments and decisions that juggle many fundamental unknowns, like predictions about what might be causing a patients' symptoms or decisions about what treatment to try next. Despite increasing interest in developing AI systems that aid or even replace doctors in clinical settings, current systems struggle with calibrated reasoning under uncertainty, and are often deeply opaque about their reasoning. We propose a framework for AI systems that can make practically useful but formally transparent clinical predictions under uncertainty. Given a clinical situation, our framework (MedMSA) uses language models to retrieve relevant prior knowledge, but constructs a formal probabilistic model to support calibrated and verifiable inferences under uncertainty. We show how an initial proof-of-concept of this framework can be used for differential diagnosis, producing an uncertainty-weighted list of potential diagnoses that could explain a patients' symptoms, and discuss future applications and directions for applying this framework more generally for safe clinical collaborations.

摘要:醫學充滿了高風險的不確定性。醫生經常做出臨床判斷和決策,面對許多基本的未知因素,例如對可能導致病人症狀的原因的預測或對下一步嘗試何種治療的決策。儘管對開發能夠幫助甚至取代醫生的AI系統的興趣日益增加,但目前的系統在不確定性下的校準推理方面仍然面臨挑戰,並且其推理過程往往深具不透明性。我們提出了一個AI系統的框架,能夠在不確定性下做出實用但形式透明的臨床預測。在給定的臨床情況下,我們的框架(MedMSA)使用語言模型來檢索相關的先前知識,但構建一個正式的概率模型以支持在不確定性下的校準和可驗證推論。我們展示了這一框架的初步概念證明如何用於鑑別診斷,產生一個基於不確定性的潛在診斷列表,以解釋病人的症狀,並討論未來應用和方向,以更普遍地應用這一框架以促進安全的臨床合作。

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

2605.09679v1 by Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Boyan Wang, Liang He, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille

Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.

摘要:醫療視覺語言模型 (VLMs) 和 AI 代理在學習分析和推理臨床影像方面取得了顯著進展。然而,現有的醫療視覺問題回答 (VQA) 基準將模型能力壓縮為單一的準確度分數,模糊了模型失敗的原因和位置。我們提出了 DeepTumorVQA,一個層級基準,遵循腫瘤診斷中的多階段證據鏈,並將 3D CT 推理分解為四個階段:識別、測量、視覺推理和醫學推理。高階問題仍然可以獨立評分,而它們的真實證據鏈則在低階原始資料上定義。該基準包含 476K 問題,涵蓋 42 種臨床亞型,基於 9,262 份 3D CT 體積。除了 VLMs 的直接推理模式外,DeepTumorVQA 還提供了工具互動環境以進行代理評估,模型可以在回答問題之前調用外部工具,包括分割模型、測量程序和醫學知識模塊。在評估超過 30 種模型配置後,我們發現可靠的定量測量是主要瓶頸,這使得後階段的視覺和醫學推理對 VLMs 來說更加困難,而工具增強則大大減輕了這一問題。當工具可用時,利用醫學知識和工具來推理醫療影像成為一個新的挑戰。我們進一步顯示,DeepTumorVQA 的真實逐步工具使用痕跡可以監督代理並減少工具使用和推理失敗。從識別到測量再到視覺和醫學推理的這一階段性進展為未來的醫療 VLM 和 AI 代理研究提供了一個具體的路線圖。所有數據和代碼已發布於 https://github.com/Schuture/DeepTumorVQA。

Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT

2605.16373v1 by Zonglin Yang, Xiaolei Diao, Jishizhan Chen, Xiaozhuang Man, Wei Kong, Gen Wen, Pengfei Cheng, Daqian Shi

Early and accurate diagnosis and lesion localization of bone infections are crucial for clinical treatment. PET-CT integrates anatomical information from CT with metabolic information from PET, making it an important imaging modality for diagnosing bone infections. However, accurate lesion segmentation remains challenging due to indistinct lesion boundaries and inconsistencies in annotations generated by different experts or automated systems. In this work, we investigate multimodal segmentation of bone infections under annotation discrepancy. We develop a bimodal end-to-end segmentation framework that integrates PET metabolic signals and CT bone-window anatomy through an early-fusion multimodal representation.To mitigate performance inflation caused by inter-slice correlation in small datasets, this study discards traditional two-dimensional evaluation methods and implements a rigorous patient-level 3D volumetric evaluation and cross-validation. Furthermore, instead of forcing a singular consensus, we propose a decoupled dual-source learning framework where parallel models are trained on independent expert annotations driven by high-sensitivity and high-specificity clinical intents. Experimental results objectively report performance variations at the patient level (Mean + SD and Mean - SD), demonstrating the effectiveness of multimodal PET-CT fusion. The cross-evaluation matrix quantitatively reveals how models successfully internalize distinct expert diagnostic philosophies, providing a robust, diversity-preserving paradigm for clinical AI deployment in bone infection segmentation.

摘要:早期且準確的骨感染診斷和病變定位對臨床治療至關重要。PET-CT將CT的解剖信息與PET的代謝信息整合,使其成為診斷骨感染的重要影像學模式。然而,由於病變邊界不明確以及不同專家或自動化系統生成的標註不一致,準確的病變分割仍然具有挑戰性。在本研究中,我們探討了在標註差異下的骨感染多模態分割。我們開發了一個雙模態端到端分割框架,通過早期融合多模態表示,整合PET代謝信號和CT骨窗解剖。為了減少小數據集中由於切片間相關性引起的性能膨脹,本研究摒棄了傳統的二維評估方法,實施了嚴格的患者級三維體積評估和交叉驗證。此外,我們提出了一個解耦的雙源學習框架,而不是強迫達成單一共識,並在獨立的專家標註上訓練平行模型,這些標註是由高敏感性和高特異性的臨床意圖驅動的。實驗結果客觀地報告了患者級別的性能變化(均值 + 標準差和均值 - 標準差),展示了多模態PET-CT融合的有效性。交叉評估矩陣定量揭示了模型如何成功內化不同專家的診斷哲學,為骨感染分割中的臨床AI部署提供了一個穩健的、多樣性保護的範式。

Medical

Publish Date Title Authors Homepage Code
2026-05-18 Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs Junyu Pan et.al. 2605.18172v1 null
2026-05-18 Domain Transfer Becomes Identifiable via a Single Alignment Sagar Shrestha et.al. 2605.17918v1 null
2026-05-18 LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection Hanbyeol Park et.al. 2605.17902v1 null
2026-05-18 Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training Guanliang Liu et.al. 2605.17879v1 null
2026-05-18 Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale Jinghui Liu et.al. 2605.17775v1 null
2026-05-18 Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes Jinghui Liu et.al. 2605.17755v1 null
2026-05-18 Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science Yingjie Zhang et.al. 2605.17746v1 null
2026-05-18 Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis Danu Kim et.al. 2605.17729v1 null
2026-05-17 PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship Zhiyuan Wang et.al. 2605.17679v1 null
2026-05-17 ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation Zhikang Chen et.al. 2605.17580v1 null
2026-05-17 CasualSynth: Generating Structurally Sound Synthetic Data Zehua Cheng et.al. 2605.17528v1 null
2026-05-17 Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization Gunjan Balde et.al. 2605.17379v1 null
2026-05-17 CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings Qixuan Hu et.al. 2605.17370v1 null
2026-05-17 Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification Yang Wu et.al. 2605.17308v1 null
2026-05-17 How Do Electrocardiogram Models Scale? Jiawei Li et.al. 2605.17276v1 null
2026-05-17 Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability Nisreen Albzour et.al. 2605.17236v1 null
2026-05-16 UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation Shiv Ghosh et.al. 2605.17140v1 null
2026-05-16 SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning Yongfeng Huang et.al. 2605.17101v1 null
2026-05-16 Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench Tianyu Wang et.al. 2605.17079v1 null
2026-05-16 AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation Shiying Yu et.al. 2605.17071v1 null
2026-05-16 PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts Khizar Hussain et.al. 2605.17028v1 null
2026-05-16 Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings Anthonio Oladimeji Gabriel et.al. 2605.16993v1 null
2026-05-16 Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons Wei Tang et.al. 2605.16975v1 null
2026-05-16 Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects Zhentao Tan et.al. 2605.16966v1 null
2026-05-16 From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction Pujun Feng et.al. 2605.16927v1 null
2026-05-16 PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes Phat Tran et.al. 2605.16860v1 null
2026-05-16 VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment Amy Makawana et.al. 2605.16775v1 null
2026-05-15 CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? Haolin Chen et.al. 2605.16679v1 null
2026-05-15 \textsc{PrivScope}: Task-scoped Disclosure Control for Hybrid Agentic Systems Shafizur Rahman Seeam et.al. 2605.16630v1 null
2026-05-15 Isotonic Survival Regression: Calibrated Survival Distributions from Deep Cox Models Anchit Jain et.al. 2605.16571v1 null
2026-05-15 Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces Arne Nix et.al. 2605.16545v1 null
2026-05-15 Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search Sarah Martinson et.al. 2605.16238v1 null
2026-05-15 Fully Open Meditron: An Auditable Pipeline for Clinical LLMs Xavier Theimer-Lienhard et.al. 2605.16215v1 null
2026-05-15 Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning Ranjith Chodavarapu et.al. 2605.15894v1 null
2026-05-15 BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation Huanyang Tong et.al. 2605.15736v1 null
2026-05-15 Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery Weijun Ma et.al. 2605.16458v1 null
2026-05-15 Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign Jiahui Li et.al. 2605.16452v1 null
2026-05-15 Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating Hangchun Liang et.al. 2605.16446v1 null
2026-05-15 Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images Liangrui Pan et.al. 2605.16444v1 null
2026-05-15 Two-Valued Symmetric Circulant Matrices: Applications in Deep Learning Jayakrishna Amathi et.al. 2605.16443v1 null
2026-05-14 Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction A H M Rezaul Karim et.al. 2605.15467v1 null
2026-05-14 FutureSim: Replaying World Events to Evaluate Adaptive Agents Shashwat Goel et.al. 2605.15188v1 null
2026-05-14 Evidential Reasoning Advances Interpretable Real-World Disease Screening Chenyu Lian et.al. 2605.15171v1 null
2026-05-14 Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment Sayantan Kumar et.al. 2605.15168v1 null
2026-05-14 COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion Zihan Deng et.al. 2605.15016v1 null
2026-05-14 Quantifying and Mitigating Premature Closure in Frontier LLMs Rebecca Handler et.al. 2605.15000v1 null
2026-05-14 Explainable Detection of Depression Status Shifts from User Digital Traces Loris Belcastro et.al. 2605.14995v1 null
2026-05-14 Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning Francesco Pastori et.al. 2605.14991v1 null
2026-05-14 GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation Drewry H. Morris et.al. 2605.14968v1 null
2026-05-14 From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement Varad Vishwarupe et.al. 2605.14912v1 null
2026-05-14 BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring Zixuan Shu et.al. 2605.14886v1 null
2026-05-14 Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model Minghao Wu et.al. 2605.14723v1 null
2026-05-14 Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke Liren Chen et.al. 2605.14710v1 null
2026-05-14 NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces Konstantinos Kontras et.al. 2605.14698v1 null
2026-05-14 How Sensitive Are Radiomic AI Models to Acquisition Parameters? D. Gil et.al. 2605.14667v1 null
2026-05-14 MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder Eranga Bandara et.al. 2605.14660v1 null
2026-05-14 RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation Shuhao Chen et.al. 2605.14543v1 null
2026-05-14 Deciphering Neural Reparameterized Full-Waveform Inversion with Neural Sensitivity Kernel and Wave Tangent Kernel Ruihua Chen et.al. 2605.14370v1 null
2026-05-14 AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction Yerin Park et.al. 2605.14327v1 null
2026-05-14 Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment Xiaohua Wang et.al. 2605.14242v1 null
2026-05-14 Fusion-fission forecasts when AI will shift to undesirable behavior Neil F. Johnson et.al. 2605.14218v1 null
2026-05-14 Towards Fine-Grained and Verifiable Concept Bottleneck Models Yingying Fang et.al. 2605.14210v1 null
2026-05-13 Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR) Marius S. Knorr et.al. 2605.14126v1 null
2026-05-13 ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows Alvaro Lopez Pellicer et.al. 2605.14113v1 null
2026-05-13 Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening Nishi Doshi et.al. 2605.14108v1 null
2026-05-13 A Benchmark for Early-stage Parkinson's Disease Detection from Speech Terry Yi Zhong et.al. 2605.14066v1 null
2026-05-13 CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI Xiaoyue Liu et.al. 2605.13994v1 null
2026-05-13 Neurosymbolic Auditing of Natural-Language Software Requirements Bethel Hall et.al. 2605.13817v1 null
2026-05-13 Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography Christos Chrysanthos Nikolaidis et.al. 2605.13730v1 null
2026-05-13 SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing Marten J. Finck et.al. 2605.17620v1 null
2026-05-13 Cross Modality Image Translation In Medical Imaging Using Generative Frameworks Giulia Romoli et.al. 2605.13686v1 null
2026-05-13 Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model Riccardo Cavarra et.al. 2605.13568v1 null
2026-05-13 Generating synthetic computed tomography for radiotherapy: SynthRAD2025 challenge report Viktor Rogowski et.al. 2605.13555v1 null
2026-05-13 RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation Chengzhi Shen et.al. 2605.13542v1 null
2026-05-13 Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs Jincai Huang et.al. 2605.13530v1 null
2026-05-13 Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin Markus Wenzel et.al. 2605.13345v1 null
2026-05-13 VERA-MH: Validation of Ethical and Responsible AI in Mental Health Luca Belli et.al. 2605.13318v1 null
2026-05-13 IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages Shubham Kumar Nigam et.al. 2605.13292v1 null
2026-05-13 Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis Bo Cui et.al. 2605.13248v1 null
2026-05-13 AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions Ishika Agarwal et.al. 2605.13149v1 null
2026-05-13 Context Training with Active Information Seeking Zeyu Huang et.al. 2605.13050v2 null
2026-05-13 An Agentic LLM-Based Framework for Population-Scale Mental Health Screening Giuliano Lorenzoni et.al. 2605.13046v1 null
2026-05-13 RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems Rohith Reddy Bellibatlu et.al. 2605.12895v1 null
2026-05-13 A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study Jaime Yan et.al. 2605.13905v1 null
2026-05-13 Multimodal Hidden Markov Models for Persistent Emotional State Tracking Anamika Ragu et.al. 2605.12838v1 null
2026-05-13 PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models Sridhar Mahadevan et.al. 2605.12835v1 null
2026-05-12 Training Large Language Models to Predict Clinical Events Benjamin Turtel et.al. 2605.12817v1 null
2026-05-12 Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces Shixing Yu et.al. 2605.12809v1 null
2026-05-12 BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics Helene Malyutina et.al. 2605.12730v1 null
2026-05-12 Reward Hacking in Rubric-Based Reinforcement Learning Anas Mahmoud et.al. 2605.12474v1 null
2026-05-12 MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering Rezarta Islamaj et.al. 2605.12361v1 null
2026-05-12 EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records Saeed Shurrab et.al. 2605.12335v1 null
2026-05-12 Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study M A Al-Masud et.al. 2605.12241v1 null
2026-05-12 Overtrained, Not Misaligned Joel Schreiber et.al. 2605.12199v1 null
2026-05-12 To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands Fangyi Yu et.al. 2605.12120v1 null
2026-05-12 Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection Muhammad Aqeel et.al. 2605.12069v1 null
2026-05-12 Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MIL Hyun Do Jung et.al. 2605.12575v1 null
2026-05-12 Spectral Vision Transformer for Efficient Tokenization with Limited Data Alexandra G. Roberts et.al. 2605.12026v1 null
2026-05-12 DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction Hongyi Tang et.al. 2605.12574v1 null
2026-05-12 AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers Lei Wang et.al. 2605.11901v1 null

Abstracts

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

2605.18172v1 by Junyu Pan, Yansen Wang, Enze Zhang, Baoliang Lu, Weilong Zheng, Dongsheng Li

Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.

摘要:利用預訓練的大型語言模型(LLMs)和多模態大型模型(MLLMs)的通用表示,為腦基礎模型提供了一條有前景的道路。然而,視覺誘發的腦電圖(EEG)數據集仍然稀缺,導致現有方法主要將神經信號與抽象文本對齊,這是一種可能丟失大腦活動中編碼的細粒度感知信息的有損轉換。我們提出了生成視覺基礎(Generative Visual Grounding,GVG),這是一個通過使用EEG到圖像的生成模型作為視覺翻譯器來可視化不可見事物的框架。GVG不是僅僅將EEG強制轉換為文本,而是為非視覺EEG幻想出特定實例的代理圖像,提供結構化的視覺上下文,使MLLMs能夠利用其視覺先驗進行臨床狀態解釋。我們在兩個MLLM骨幹上驗證了這一想法,GVG-X-Omni和GVG-Janus。僅圖像對齊已經具有競爭力:輕量級的GVG-X-Omni在凍結的7B骨幹上僅調整170M參數,便能匹配1.7B參數的文本對齊基準。我們進一步擴展GVG-Janus,實現三模態的圖像+文本對齊,其中文本提供類別語義錨點,而視覺代理則用感知細節豐富神經表示。實驗顯示在EEG理解和視覺生成方面的一致增益,表明視覺代理基礎作為文本對齊的有效補充。

Domain Transfer Becomes Identifiable via a Single Alignment

2605.17918v1 by Sagar Shrestha, Subash Timilsina, Hoang-Son Nguyen, Xiao Fu

Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised image-to-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer -- requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.

摘要:領域轉移 (DT) 將源分佈映射到目標分佈,並支持無監督的圖像到圖像轉換、單細胞分析和跨平台醫學影像等任務。然而,DT 本質上是病態的:推進映射通常是不可識別的,因為保持測度的自同構 (MPAs) 在改變跨領域對應的同時保持邊際,導致內容不對齊的翻譯。最近的研究顯示,通過共同轉移多個對應的源/目標條件分佈可以消除 MPAs,但在實踐中標記這些條件的監督信號並不總是可用。我們開發了一條替代的 DT 可識別性路徑。在雅可比支持模式的結構稀疏條件下,我們顯示分佈匹配結合單一配對錨點樣本足以識別真實轉移——所需的監督遠低於先前的方法。為了實現實際的高維學習,我們進一步提出了一種基於隨機掩碼有限差分的高效雅可比稀疏正則化器,產生一種可擴展的替代方案,而無需顯式的雅可比評估。在合成和真實世界的 DT 任務上的實證結果驗證了這一理論。

LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

2605.17902v1 by Hanbyeol Park, Hyerim Bae

Stochastic-process-based degradation modeling is a core approach for estimating the distribution of remaining useful life (RUL); however, the selection of an appropriate stochastic process has not been sufficiently addressed. Existing model selection methods mainly rely on the statistical fit of the observed health indicator (HI) trajectory, but this approach may select a model that is inconsistent with the underlying degradation mechanism when the observation window is short or the signal is highly noisy. To address this issue, this paper proposes Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation (LAST-RAG). The proposed method uses both the observed HI trajectory and domain-specific context, and hierarchically conditions the candidate degradation model space based on theoretical and mechanical evidence retrieved from a local evidence bank. In addition, Rule-based Confidence Reasoning with Uncertain State (RCRUS) is introduced to prevent candidate models from being prematurely eliminated when hierarchical decisions are uncertain. Simulation-based experiments demonstrate that the proposed method outperforms statistical, prognostic, and uncertainty-aware baselines in both Wiener/gamma family classification and detailed degradation model classification. Ultimately, this study reframes degradation model selection from a purely statistical goodness-of-fit problem into a knowledge-conditioned decision-making problem that integrates observed data with domain knowledge.

摘要:隨機過程基礎的劣化建模是估計剩餘使用壽命 (RUL) 分佈的核心方法;然而,適當隨機過程的選擇尚未得到充分解決。現有的模型選擇方法主要依賴於觀察到的健康指標 (HI) 軌跡的統計擬合,但當觀察窗口較短或信號噪聲較大時,這種方法可能會選擇與潛在劣化機制不一致的模型。為了解決這個問題,本文提出了文獻錨定的隨機軌跡檢索增強生成 (LAST-RAG)。所提出的方法同時使用觀察到的 HI 軌跡和特定領域的上下文,並基於從本地證據庫檢索的理論和機械證據,分層條件化候選劣化模型空間。此外,引入了基於規則的不確定狀態信心推理 (RCRUS),以防止在分層決策不確定時候選模型被過早淘汰。基於模擬的實驗表明,所提出的方法在 Wiener/gamma 家族分類和詳細劣化模型分類中均優於統計、預測和不確定性感知的基準。最終,本研究將劣化模型選擇重新框架為一個純粹的統計擬合問題,轉變為一個知識條件化的決策問題,將觀察數據與領域知識相結合。

Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training

2605.17879v1 by Guanliang Liu, Abhinandan Patni, Congzhu Lin, Zoe Zeng, Jack Wittmayer, Josh Wu, Ashvin Nihalani, Binxuan Huang, Yinghong Liu, Rory Na, Anthony Ko, Alexander Zhipa, Cong Cheng, Mi Sun, Vijay Rajakumar, Rejith George Joseph, Parthasarathy Govindarajen

Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.

摘要:訓練前沿規模的基礎模型涉及協調數萬個GPU進行數月的運行,即使是輕微的性能下降也可能累積成顯著的效率損失。現有的健康檢查機制,如NCCL測試或GPU燒機,主要專注於功能正確性,並且常常無法檢測到靜默降低系統性能的失效緩慢行為。在本文中,我們提出了Guard,一個可擴展的系統,用於檢測拖延者並確保大型訓練集群中的節點健康。Guard結合了訓練過程中的輕量級在線性能監控與離線節點掃描機制,系統性地評估和確認節點在參與生產工作負載之前的狀態。這一設計使Guard能夠檢測到急性故障和傳統診斷無法捕捉的長期失效緩慢行為。在大規模基礎模型預訓練工作負載上部署後,Guard將平均FLOPs利用率提高了最多1.7倍,將每次訓練步驟的變異從20%降低到1%,增加了平均故障時間(MTTF),並顯著減少了運營和調試的開銷。這些結果表明,主動檢測拖延者和系統性節點資格認定對於維持穩定和高效的大規模訓練至關重要。

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

2605.17775v1 by Jinghui Liu, Sarvesh Soni, Anthony Nguyen

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

摘要:大型語言模型(LLMs)可以生成或合成臨床文本,應用範圍廣泛,從改善臨床文檔到增強臨床文本分析。然而,評估通常集中在狹窄的方面——例如相似性或效用比較——儘管這些方面是互補的,最好是並行考量。在這項研究中,我們旨在對LLM生成的臨床文本進行系統評估,包括對從MIMIC數據庫以百萬筆數據規模改寫的合成臨床筆記的內在、外在和事實性評估。我們的分析顯示,儘管語言上有 substantial 的變化,合成筆記仍然保留了核心臨床信息和對於粗粒度任務的預測效用,但在像ICD編碼這樣的任務中卻失去了細粒度的細節。我們顯示,通過將筆記按塊改寫而不是整個筆記,可以顯著減輕這種細節的損失,但代價是降低了在不完整上下文下的事實精確性。通過事實核查和錯誤分析,我們進一步發現,合成錯誤主要是由於對臨床上下文的誤解,以及時間混淆、測量錯誤和虛假聲明。最後,我們顯示,儘管合成筆記具有任務無關性,但仍然可以有效增強對於稀有ICD代碼的任務特定訓練。

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes

2605.17755v1 by Jinghui Liu, Anthony Nguyen

Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.

摘要:臨床編碼將臨床文檔映射到標準化的醫療代碼,這是一項必要但耗時的行政任務,可能受益於自動化。目前的ICD編碼模型通常針對特定ICD版本的代碼進行優化。然而,實際上,ICD系統不斷演變,不同版本在不同的時間段和地區被採用。此外,ICD編碼面臨長尾問題,稀有代碼的表現可能成為開發可實施模型的瓶頸。我們檢查通過結合在不同ICD版本中註釋的數據來訓練版本獨立模型是否可行,這可能有助於解決這些挑戰。我們將ICD-9數據添加到修改過的標籤級注意力模型的ICD-10預測訓練中,發現儘管版本不匹配,添加ICD-9使得18K稀有ICD代碼的微F1提高了27%,相比僅在ICD-10上訓練。在8K頻繁的ICD-10代碼上,多版本訓練也顯著改善了宏觀指標,並且模型參數大大減少。

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

2605.17746v1 by Yingjie Zhang, Chun Feng, Weizhang Zhu, Tianshu Sun

AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.

摘要:AI 系統正成為組織和知識工作的積極參與者。它們越來越多地與人類互動、協調工作流程,並在多代理安排中運作。因此,理解它們的影響需要的不僅僅是測量產出準確性;還需要關於機制、委派、反饋和控制的證據。實驗在這項任務中仍然是核心,但它們也面臨著一個遞歸挑戰:我們需要實驗來讓代理人研究這些安排,而我們可能需要代理人來進行實驗,以幫助搜尋不斷擴展的可能設計空間。然而,人類-AI 和代理工作流程的實驗條件仍然主要以散文形式指定,使其難以比較、重用或審核。我們將此框架視為一個在 AI 驅動的知識生產中,工作流程表示、可追溯性和治理的問題。我們介紹了 SEED(結構編碼用於實驗發現),這是一個將實驗條件表示為類型化的行為者-流程圖的框架。SEED 支持三個設計功能:將條件描述為互動結構、相對於編碼的先前設計評估結構新穎性,以及在可行性和治理約束下生成候選設計。我們報告了一個輕量級的實證可行性測試,該測試比較了圖盲生成和 SEED 指導生成在醫療分診設計任務中的表現。在這一診斷對比中,SEED 指導的候選設計顯示出更清晰的行為者-流程變化、假設和治理檢查,支持該語法作為設計輔助工具的可行性。評論最後指出了圍繞新穎性、複製性、有效性、探究多樣性和問責制的治理緊張。

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

2605.17729v1 by Danu Kim

Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.

摘要:深度學習模型在胸部X光片的肺炎檢測中達到了高準確率。然而,由於影像設備、獲取協議和機構條件的變化,它們在臨床領域的泛化能力仍然有限。本研究介紹了一種基於重播的領域增量持續學習,旨在實現對跨領域變化的持續適應,而不會造成災難性遺忘。所提出的方法結合了類別感知的平衡重播,以在受限記憶中維持平衡的類別表示,以及類別感知的損失,在訓練過程中動態重新加權類別不平衡。在一個由五個模擬領域組成的領域轉移肺炎MNIST數據集上進行的實驗表明,所提出的方法達到了88.66%的平均準確率,超越了經驗重播、微調和聯合訓練的基準。這些發現突顯了所提出的方法在實現跨臨床環境變化的穩健且一致的肺炎檢測中的有效性。

PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship

2605.17679v1 by Zhiyuan Wang, Ariful Islam, Indrajeet Ghosh, Xinyu Chen, Katharine E. Daniel, Subigya Nepal, Philip Chow, Laura E. Barnes

Cancer survivors face elevated rates of depression, anxiety, and general emotional distress, yet the precise moments they most need support are often the moments when self-report is sparse, a phenomenon we term the diary paradox. Passive smartphone sensing offers a continuous, unobtrusive alternative, but prior sensing-based affect prediction has been limited by an accuracy ceiling, suggesting a bottleneck not only in available data, but in how behavioral signals are interpreted. We present PULSE, a system that shifts from fixed feature pipelines to agentic sensing investigation: LLM agents equipped with eight purpose-built tools autonomously query smartphone sensing data, compare current behavior against personalized baselines, and calibrate inferences through retrieval-augmented population-level comparisons. Rather than receiving pre-formatted feature summaries, agents decide which modalities to inspect, how far back to look, and how deeply to investigate, mirroring hypothesis-driven clinical reasoning. We evaluate PULSE through a 2*2 factorial design crossing reasoning architecture (structured vs. agentic) with data modality (sensing-only vs. with diary) on 50 cancer survivors from a longitudinal study of cancer survivors. Agentic reasoning is the primary driver of performance: agentic multimodal agent achieves balanced accuracy of 0.743 for emotion regulation desire with diary and sensing data, while agentic agents predict intervention availability at 0.713 with passive sensing data only. These results suggest that agentic investigation may be a cornerstone for unlocking the clinical value of passive sensing, advancing the feasibility of proactive just-in-time mental health support.

摘要:癌症倖存者面臨較高的抑鬱、焦慮和一般情緒困擾的比率,但他們最需要支持的確切時刻往往是自我報告稀少的時刻,這一現象我們稱之為日記悖論。被動的智能手機感知提供了一種持續且不干擾的替代方案,但先前基於感知的情感預測受到準確性上限的限制,這表明不僅在可用數據上存在瓶頸,還在於行為信號的解釋方式。我們提出了PULSE,一個從固定特徵管道轉向自主感知調查的系統:配備八個專門工具的LLM代理自主查詢智能手機感知數據,將當前行為與個性化基準進行比較,並通過檢索增強的人口級比較來校準推斷。代理不僅接收預格式化的特徵摘要,還決定檢查哪些模式、回溯多遠以及深入調查的程度,這與假設驅動的臨床推理相呼應。我們通過一項2*2的因子設計來評估PULSE,交叉推理架構(結構化與自主)與數據模式(僅感知與日記)的設計,對50名來自癌症倖存者縱向研究的癌症倖存者進行評估。自主推理是性能的主要驅動力:自主多模態代理在情緒調節需求方面達到0.743的平衡準確率,使用日記和感知數據,而自主代理僅使用被動感知數據預測干預可用性達到0.713。這些結果表明,自主調查可能是解鎖被動感知臨床價值的基石,推進主動即時心理健康支持的可行性。

ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

2605.17580v1 by Zhikang Chen, Yue Wang, Sen Cui, Yu Zhang, Changshui Zhang, Tianling Ren, Tingting Zhu

Electrocardiogram (ECG)-based models have achieved strong performance in diagnostic tasks, yet they remain limited in modeling how cardiac dynamics evolve under external interventions. In particular, existing approaches focus primarily on static prediction and lack mechanisms to capture ECG variations under different pharmacological conditions. In this work, we propose an ECG World Model for action-conditioned predictive simulation of cardiac electrophysiology. Moving beyond disjoint pipelines, our framework features a principled integration of physiological ordinary differential equation (ODE) priors into latent diffusion dynamics via energy regularization. This structural constraint enables the synthesis of physiologically plausible post-intervention ECG trajectories while effectively mitigating generative hallucinations. Building on this simulation process, we introduce an uncertainty-aware evaluation strategy that leverages the stochasticity of diffusion sampling to characterize both the expected clinical risk and its variability, allowing a more reliable comparative assessment of candidate interventions. We evaluate our method across diverse settings, including controlled drug-response scenarios and real-world clinical records. Beyond standard waveform metrics, experimental results demonstrate improved risk calibration and strong alignment with expert-informed treatment preferences. These results establish our approach as a robust foundation for safe and intervention-aware clinical decision support.

摘要:心電圖(ECG)基礎的模型在診斷任務中已取得良好表現,但在建模心臟動力學如何在外部干預下演變方面仍然有限。特別是,現有的方法主要集中在靜態預測上,缺乏捕捉不同藥理條件下ECG變化的機制。在這項工作中,我們提出了一個ECG世界模型,用於基於行動條件的心臟電生理預測模擬。我們的框架超越了不相干的流程,特別整合了生理學常微分方程(ODE)先驗知識,通過能量正則化融入潛在擴散動力學。這一結構約束使得合成生理上合理的干預後ECG軌跡成為可能,同時有效減輕生成性幻覺。基於這一模擬過程,我們引入了一種具不確定性感知的評估策略,利用擴散取樣的隨機性來表徵預期的臨床風險及其變異性,從而允許對候選干預措施進行更可靠的比較評估。我們在多種環境中評估我們的方法,包括受控的藥物反應場景和真實世界的臨床記錄。除了標準波形指標外,實驗結果顯示風險校準有所改善,並與專家知情的治療偏好強烈一致。這些結果確立了我們的方法作為安全且具干預意識的臨床決策支持的堅實基礎。

CasualSynth: Generating Structurally Sound Synthetic Data

2605.17528v1 by Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $α=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.

摘要:大型語言模型(LLMs)生成現實的合成數據,但無法保證其輸出遵循目標領域的因果機制。我們介紹CausalSynth,一個將因果結構生成與語義實現解耦的框架,產生既具因果有效性又語言豐富的合成數據。該框架分為三個階段。首先,結構因果模型(SCM)——一組定義在有向無循環圖(DAG)上的結構方程組生成因果骨架,即通過祖先抽樣滿足所治理DAG的全局馬爾可夫性質的變量分配。其次,LLM作為受限的\emph{實現者},一個條件翻譯器,將每個骨架映射到高維觀察,例如臨床記錄或交易日誌。第三,迭代一致性驗證模塊通過確定性提取檢測結構違規,並將針對性的修正反饋給LLM,形成一個閉環精煉過程。我們確定了語義後門問題,即LLMs系統性地傾向於用預訓練先驗覆蓋施加的因果事實——並證明我們的迭代機制相對於標準拒絕抽樣減少了由此產生的選擇偏差。在三個因果基準(ASIA、ALARM和MIMIC-Struct)上,CausalSynth保持了條件獨立性,假陽性率接近名義$α=0.05$水平,並在70B參數的LLM骨幹上實現了超過96%的實現率。該框架還通過噪聲保留和圖形損壞支持原則性的干預和反事實生成。

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

2605.17379v1 by Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

摘要:大型語言模型在一般領域語料上預訓練時,應用於專業領域時常常會出現標記化效率低下的情況。雖然持續預訓練以適應領域在一定程度上減輕了性能下降,但並未解決根本的詞彙不匹配問題。為了解決這一問題,我們提出了一種針對性的參數高效領域適應方法,將詞彙適應與基於LLM的文本摘要預訓練相結合。我們的統一框架通過增加領域特定的標記來增強預訓練的標記器,同時有選擇性地替換訓練不足和無法到達的標記,以限制參數增長。我們在Llama-3.1-8B和Qwen2.5-7B上評估我們的方法,針對法律和醫療摘要任務,使用一種以挑戰為導向的評估協議,專注於專家驅動的文本和摘要,這通常具有更高的過度碎片化的詞彙外(OOV)單詞的集中度。詞彙適應算法通過提高生成摘要與其參考之間的語義相似性,增強了摘要模型的整體質量。此外,經過適應的模型生成的摘要包含更多合適的新穎和領域特定的單詞,從而提高了連貫性、相關性和真實性。我們進一步觀察到,我們提出的方法顯著減少了相較於持續預訓練的訓練時間,減少了$35-55\%$,並且相較於僅擴展的方法,參數數量減少了高達$37\%$。我們將代碼庫公開提供於 https://github.com/gb-kgp/VocabReplace-Then-Expand。

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

2605.17370v1 by Qixuan Hu, Shuchang Ye, Xumou Zhang, Anastasia Serafimovska, Anastasia Suraev, Amit Saha, Ping-hsiu Lin, Sydney Su, Usman Naseem, Adam G. Dunn, Jinman Kim

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.

摘要:認知行為療法廣泛用於幫助患者理解和管理心理困擾。這通常通過口頭對話進行,治療師不僅注意患者所說的內容,還注意他們的表達方式,因為這些線索可以幫助治療師決定如何回應並調整治療方案。在建立用於認知行為療法的人工智慧系統方面的進展仍然主要限於文本,部分原因是大多數可用的數據集都是基於文本的,而可分享的口語認知行為療法數據在倫理和隱私限制下非常稀缺。這造成了一個盲點,因為基於文本的模型和評估無法捕捉到轉錄文本與患者聲音之間的不匹配,即使治療師通常依賴這種不匹配來理解患者的困擾。我們介紹了CBT-Audio,一個用於評估從口語認知行為療法會話中估計患者困擾的數據集,並與音頻語言模型一起使用。CBT-Audio包含來自96個公開可用認知行為療法錄音的1,802個患者回合,回合級困擾標籤經過專家註釋的子集驗證。我們在三種輸入條件下評估了10個開源音頻語言模型,這些條件下模型僅接收患者音頻、僅接收轉錄文本或同時接收音頻和轉錄文本。我們的結果顯示,音頻可以提供超越文本的有用信息,尤其是當與轉錄文本結合時。在10個模型系列中的8個中,將音頻添加到轉錄文本輸入中改善了困擾估計,相較於僅使用轉錄文本,4個模型系列的增益顯著,案例研究顯示當口頭內容和聲音表達不一致時,最明顯的好處。CBT-Audio使口語患者行為在認知行為療法相關任務中的人工智慧評估變得可測量,並支持未來在心理健康互動中使用音頻語言模型的工作。

Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification

2605.17308v1 by Yang Wu, Xiaoyan Yuan, Hau-San Wong, Xiping Hu

Electrocardiogram (ECG) diagnosis in clinical practice relies on structured reasoning over multiple hierarchical aspects, including cardiac rhythm, conduction properties, waveform morphology, and overall diagnostic impression. However, most existing approaches predict labels directly from ECG signals without explicit clinical reasoning, resulting in opaque decisions that lack clinical alignment. To bridge this gap, we propose CardioThink, a physician-inspired multimodal large language model (MLLM) framework that explicitly models the diagnostic reasoning process through human-interpretable intermediate stages (rhythm, conduction, morphology, and impression) to derive final classification results. Furthermore, we introduce Structured Set Policy Optimization (SSPO) to jointly optimize adherence to this structured reasoning format and the accuracy of variable-size diagnostic sets, without requiring manually annotated reasoning traces. Extensive experiments on diverse ECG benchmarks demonstrate the significant superiority of our approach in diagnostic accuracy, while simultaneously providing interpretable clinical reasoning. Notably, reasoning quality evaluations confirm that SSPO substantially enhances the clinical validity of the generated rationales. These findings reveal that moving beyond direct label prediction toward structured reasoning offers a more clinically aligned direction for future ECG modeling.

摘要:心電圖 (ECG) 的臨床診斷依賴於對多個層次方面的結構化推理,包括心臟節律、傳導特性、波形形態和整體診斷印象。然而,大多數現有方法直接從 ECG 信號預測標籤,缺乏明確的臨床推理,導致決策不透明且缺乏臨床一致性。為了填補這一空白,我們提出了 CardioThink,一個受醫生啟發的多模態大型語言模型 (MLLM) 框架,該框架通過人類可解釋的中間階段(節律、傳導、形態和印象)明確建模診斷推理過程,以得出最終分類結果。此外,我們引入了結構化集合策略優化 (SSPO),以共同優化對這一結構化推理格式的遵循和變量大小診斷集合的準確性,而無需手動標註的推理痕跡。在多樣的 ECG 基準上進行的廣泛實驗顯示,我們的方法在診斷準確性方面具有顯著優越性,同時提供可解釋的臨床推理。值得注意的是,推理質量評估確認 SSPO 顯著提升了生成理論的臨床有效性。這些發現揭示了超越直接標籤預測,朝向結構化推理的方向,為未來 ECG 建模提供了更具臨床一致性的方向。

How Do Electrocardiogram Models Scale?

2605.17276v1 by Jiawei Li, Fabio Bonassi, Ming Jin, Stefan Gustafsson, Johan Sundström, Thomas B. Schön, Antônio H. Ribeiro

While scaling laws have established a fundamental framework for foundation models in natural language processing, their applicability to electrocardiogram (ECG) models remains poorly characterized. Indeed, recent studies do not always yield consistent downstream gains as one increases the model size or pre-training dataset size of ECG models, leaving the exact roles of architectural inductive biases, pre-training paradigms, and expected improvements with size largely unanswered. In this work, we systematically investigate neural and loss-to-loss scaling laws within the ECG domain. By pre-training over $120$ models (ranging from $20$K to $200$M parameters) on the large-scale CODE dataset ($2.3$M records), we decouple the effects of model architecture (ResNet vs. Transformer) and pre-training paradigm, namely supervised learning (SL) versus self-supervised learning (SSL). We found that (i) SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes; (ii) for out-of-distribution (OOD) generalization, ResNets are $1.3$ to $2.5$ times more parameter-efficient than Transformers, while SSL is up to $16$ times more data-efficient and achieves up to $7.6$ times higher transfer efficiency than SL on unseen clinical tasks; (iii) across the observed scales, ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes. Our results suggest that the path to effective ECG foundation models lies in the strategic alignment of architecture and paradigm rather than brute-force scaling.

摘要:雖然縮放法則已為自然語言處理中的基礎模型建立了基本框架,但其在心電圖(ECG)模型中的適用性仍然不明確。事實上,最近的研究並不總是隨著ECG模型的模型大小或預訓練數據集大小的增加而產生一致的下游增益,這使得架構的歧義偏差、預訓練範式以及隨著大小預期的改進的確切角色大多未能解答。在本研究中,我們系統地探討了ECG領域內的神經和損失對損失的縮放法則。通過在大規模CODE數據集($2.3$M條記錄)上對$120$個模型(參數範圍從$20$K到$200$M)進行預訓練,我們解耦了模型架構(ResNet與Transformer)和預訓練範式,即監督學習(SL)與自監督學習(SSL)的影響。我們發現:(i)SL模型在分佈內受到數據瓶頸的限制,而SSL模型在模型和數據大小上都能穩健地擴展;(ii)對於分佈外(OOD)泛化,ResNet的參數效率是Transformer的$1.3$到$2.5$倍,而SSL在數據效率上高達$16$倍,並在未見臨床任務上實現了比SL高出$7.6$倍的轉移效率;(iii)在觀察到的規模中,基於ResNet的模型通常實現最低的OOD損失,SSL在未見臨床任務中占主導地位,而自監督Transformer在非常大的模型大小上超越。我们的结果表明,有效的ECG基础模型的路径在于架构和范式的战略对齐,而非单纯的规模扩展。

Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

2605.17236v1 by Nisreen Albzour, Sarah S. Lam

Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.

摘要:手動的子宮頸抹片分析在子宮頸癌篩檢中受到觀察者間變異性、時間限制以及專家可用性不足的限制。雖然卷積神經網絡(CNNs)已經自動化了子宮頸細胞的分類,但在建模長距離空間依賴性方面仍然有限,並且通常缺乏臨床可解釋性。在本研究中,系統性優化了視覺Transformer(ViT)架構,以增強自動化的子宮頸癌篩檢,這導致了可解釋性的提高。使用了Herlev數據集(917張圖像:242張正常,675張異常)來優化ViT-Tiny,這是一種為減少計算複雜性而設計的輕量級視覺Transformer架構,通過對增強策略、類別加權和超參數的全面評估。最佳配置達到了94.9%-95.2%的交叉驗證準確率,其中隨機水平翻轉和類別加權(0.7 x 1.3)被確定為最有效的。梯度加權類別激活映射(Grad-CAM)分析確認模型注意力與臨床相關的形態特徵相對應,包括核區域、細胞邊界和染色質紋理,這些特徵與細胞病理學標準相符。這些發現表明,視覺Transformer可以為子宮頸癌篩檢提供準確且可解釋的決策支持,滿足醫療AI部署所需的臨床性能和透明度要求。

UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation

2605.17140v1 by Shiv Ghosh, Junayd Lateef, Chih-Hua, Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil

Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.

摘要:腦腫瘤的診斷在很大程度上依賴於磁共振成像(MRI)評估,這要求放射科醫生整合來自多個三維序列和縱向研究的數千張圖像。這個過程需要先進的神經放射學訓練,帶來相當大的認知負擔,並且非常耗時。儘管放射學的需求不斷增加,但這種專業知識難以擴展,給當前的醫療系統帶來壓力。視覺-語言模型(VLMs)提供了一個通過半自動化、互動式解釋複雜腦部MRI來減輕這一負擔的機會。然而,由於缺乏專門的基準來評估它們,這些模型在神經腫瘤學中的應用目前仍然不足。我們介紹了一個臨床相關的視覺問題回答(VQA)基準——UCSF-PDGM-VQA數據集——該數據集由473個與膠質瘤相關的MRI研究中的2,387對問答組成,這些研究來自公共的UCSF-PDGM數據集。我們進一步在這個數據集上建立了六個最先進的視覺-語言模型(VLMs)和一個大型語言模型的性能基準。我們發現目前的模型無法有效處理多序列、三維的MRI掃描,從而導致視覺特徵的抑制和對語言先驗的過度依賴,造成模態崩潰。這些發現凸顯了當前模型在臨床環境中的可靠性和安全性方面的重大缺陷,迫切需要開發穩健的、特定領域的VLMs。

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

2605.17101v1 by Yongfeng Huang, Ruiying Chen, James Cheng

Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.

摘要:檢索增強生成(RAG)被廣泛應用於減輕醫療問答中的幻覺和知識過時等風險,然而其主要的單輪靜態檢索範式與臨床推理的多階段過程不相符。這種壓縮的工作流程引發了兩個結構性缺陷:問題到查詢的翻譯往往缺乏臨床基礎的語義解釋,而檢索缺乏迭代的充分性反饋,這使得形成可靠的證據鏈變得困難。我們認為這兩個問題源於一個更深層的原因:將異質任務的解釋、探索和裁定過載到單一推理鏈上。解決方案是通過任務解耦和動態多輪探索來重建工作流程。為此,我們提出SEMA-RAG,一個自我演變的多代理RAG框架,用於醫療問答,該框架將這些角色分配給三個專家代理:解釋代理負責臨床模式解釋,探索代理負責基於充分性的自我演變檢索,裁定代理負責證據裁定和答案選擇。在五個基準和五個LLM骨幹上,SEMA-RAG平均提高了最強基線+6.46的準確率,按骨幹進行測量。

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

2605.17079v1 by Tianyu Wang, Jiajun Li, Jianghao Lin

LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.

摘要:LLMs 正在被越來越多地用作「數位消費者」,以模擬公共意見、預測市場決策並預期觀眾反應。然而,現有的評估很少詢問模型是否能夠重建真實消費者在公共話語中表現出的具體反應模式。我們介紹了 ConsumerSimBench,這是一個基於 1,553 個真實中國社交媒體主題和 23,122 個原子、經過規則審核的標準,涵蓋四種反應類別的基準。ConsumerSimBench 並不是用整體偏好評判來評分開放式生成,而是將每個任務分解為對具體反應點的可審核的是非決策,將三位評判者的協議從 65.8% 提高到 92.1%,並且點對點的評判者決策與人類多數標籤之間的協議達到 98.4%。在 13 個前沿生成器中,最強的模型 Gemini-3.1-Pro 僅覆蓋 47.8% 的真實反應標準,而 GPT-5.2 和 Claude-4.6 雖然在技術基準上表現強勁,但仍遠遠落後。這些失敗揭示了技術基準性能與社會基礎消費者直覺之間的巨大差距。直接的結構化推理提示會降低覆蓋率,而生成--反思的多代理管道則將 MiMo-V2.5-Pro 在一個子集上的表現從 32.9% 提高到 37.6%。ConsumerSimBench 將消費者模擬重新框架為一個針對真實公共話語反應的預測問題,顯示前沿 LLMs 在可靠預測消費者在高語境中國消費者話語中實際關心的事物方面仍然相距甚遠。

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

2605.17071v1 by Shiying Yu, Jielei Wang, Guoming Lu

Radiology report generation (RRG) aims to automatically produce clinically accurate textual reports from medical images. Existing methods predominantly rely on autoregressive (AR) language models, whose causal dependency structure restricts generation to a unidirectional left-to-right process. This paradigm can induce sequence bias, where models tend to follow stereotypical token orders and high-frequency report templates rather than fully grounding generation in image-specific evidence. In this paper, we propose AnchorDiff, the first masked-diffusion framework for RRG that integrates knowledge-graph-derived clinical anchors into diffusion language modeling. By leveraging bidirectional context and iterative refinement, AnchorDiff mitigates the limitations of fixed-order autoregressive decoding. Specifically, we introduce a topology-aware training strategy that uses RadGraph-derived entity hierarchies to assign clinically important tokens differentiated masking protection and loss weights. We further design an inference-time rewriting strategy that detects unstable committed tokens through perturbation-based testing and selectively revises them during denoising. Extensive experiments on the MIMIC-CXR and MIMIC-RG4 benchmarks demonstrate that AnchorDiff achieves state-of-the-art (SOTA) performance, showing the effectiveness of clinically anchored masked diffusion for radiology report generation.

摘要:放射科報告生成(RRG)旨在自動從醫學影像中產生臨床準確的文本報告。現有方法主要依賴自回歸(AR)語言模型,其因果依賴結構限制了生成過程為單向的從左到右。這種範式可能會引發序列偏見,使模型傾向於遵循刻板印象的標記順序和高頻報告模板,而不是完全基於影像特定證據進行生成。在本文中,我們提出了AnchorDiff,首個針對RRG的掩碼擴散框架,將知識圖譜衍生的臨床錨點整合到擴散語言建模中。通過利用雙向上下文和迭代精煉,AnchorDiff 減輕了固定順序自回歸解碼的限制。具體而言,我們引入了一種拓撲感知的訓練策略,使用RadGraph衍生的實體層級為臨床重要的標記分配差異化的掩蔽保護和損失權重。我們進一步設計了一種推理時重寫策略,通過基於擾動的測試檢測不穩定的已承諾標記,並在去噪過程中選擇性地修訂它們。在MIMIC-CXR和MIMIC-RG4基準上的廣泛實驗表明,AnchorDiff達到了最先進(SOTA)的性能,顯示了臨床錨定掩碼擴散在放射科報告生成中的有效性。

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

2605.17028v1 by Khizar Hussain, Murat Kantarcioglu

Large language models (LLMs) hallucinate with confidence: their outputs can be fluent, authoritative, and simply wrong. In medical, legal, and scientific applications this failure causes direct harm, and detecting it from internal model states offers a path to safer deployment. A growing body of work reports that this problem is increasingly tractable, with recent methods achieving high detection performance on widely used benchmarks. We show, however, that much of this apparent progress does not survive scrutiny. Four of the six corpora embed the ground-truth answer directly in the input prompt. A naïve text-similarity baseline we call \textsc{TxTemb} exploits this to achieve near-perfect detection scores without any access to model internals. To measure what genuine detection capability remains once these artifacts are controlled, we conduct a large-scale evaluation spanning twenty-two detection methods, twelve open-source models spanning six architectural families, and six corpora. We further introduce \textbf{DRIFT}, a supervised probe over inter-layer hidden-state transitions, as a point of comparison for live-generation detection. Our findings suggest that the field's reported progress on hallucination detection is substantially explained by benchmark construction artifacts in widely used corpora, and that the majority of established baselines perform near chance under controlled conditions; the consistent exceptions are SAPLMA and DRIFT, both supervised probes on upper-layer hidden states.

摘要:大型語言模型(LLMs)自信地產生幻覺:它們的輸出可能流暢、權威,但卻完全錯誤。在醫療、法律和科學應用中,這種失誤會造成直接傷害,從內部模型狀態中檢測它提供了一條更安全部署的途徑。越來越多的研究報告指出,這個問題變得越來越可解決,最近的方法在廣泛使用的基準上實現了高檢測性能。然而,我們顯示這些表面上的進展在仔細檢查下並不成立。六個語料庫中的四個將真實答案直接嵌入輸入提示中。我們稱之為 \textsc{TxTemb} 的天真文本相似性基線利用這一點,在不接觸模型內部的情況下實現近乎完美的檢測分數。為了測量在控制這些工件後,真正的檢測能力還剩下多少,我們進行了一項大規模評估,涵蓋二十二種檢測方法、十二個跨越六種架構系列的開源模型和六個語料庫。我們進一步介紹 \textbf{DRIFT},這是一個針對層間隱藏狀態轉換的監督探針,作為即時生成檢測的比較點。我們的研究結果表明,該領域報告的幻覺檢測進展在很大程度上是由於廣泛使用的語料庫中的基準構建工件所解釋的,而且在受控條件下,大多數已建立的基線表現接近隨機;一致的例外是 SAPLMA 和 DRIFT,這兩者都是針對上層隱藏狀態的監督探針。

Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings

2605.16993v1 by Anthonio Oladimeji Gabriel, Ahmad Rufai Yusuf

Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English-language inputs, conditions that do not reflect the realities of healthcare delivery in low-resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross-lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine-tuned on the COVID-QU-Ex chest X-ray dataset (85,318 images; COVID-19, Non-COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N-ATLAS) on 20 COVID-19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba-inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African-context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.

摘要:目前的臨床人工智慧(AI)系統幾乎完全依賴於乾淨、標準化的英文輸入,這些條件並不反映低資源環境中醫療服務的現實。這項研究呈現了對臨床AI中兩個正交安全漏洞的首次系統性雙重審核:對抗性影像脆弱性和跨語言診斷漂移。我們使用DenseNet121,這是CheXNet的底層架構,並在COVID-QU-Ex胸部X光數據集(85,318幅影像;COVID-19、非COVID肺炎、正常)上進行微調,證明在快速梯度法(FGM)擾動下,診斷準確率從89.3%降至62.0%,這一擾動的幅度對人眼來說是不可察覺的。包括高斯平滑和集成投票在內的標準防禦策略未能恢復臨床安全。在一項平行的語言脆弱性實驗中,我們在20個以標準英語、奈及利亞皮欽語(Naija)和約魯巴語變體英語呈現的COVID-19臨床案例上測試了Llama3.1:8b和NatLAS(N-ATLAS)。這兩個模型均顯示出顯著的準確性下降:Llama3.1:8b在皮欽語上的準確率從80.0%降至65.0%;而NatLAS,這是一個非洲背景模型,則從85.0%降至55.0%,診斷一致性降至50%。這些發現為臨床AI在尼日利亞初級健康中心(PHC)部署的條件下建立了量化失敗範圍,並促使對對抗性加固、語言包容的臨床AI架構提出迫切的呼籲。

Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons

2605.16975v1 by Wei Tang, Jinpei Han, Kangning Cui, Mattia Carletti, Fredrik K. Gustafsson, Shreyank N Gowda, Patitapaban Palo, Anshul Thakur, Lei Clifton, Jean-michel Morel, Raymond H. Chan, David A. Clifton, Xiao Gu

Electrocardiogram (ECG) foundation models pretrained on typical diagnostic 10-second ECG segments, have demonstrated strong transferability across a range of clinical applications. However, many real-world applications produce recordings that are typically longer, and are varied in duration during inference time. These 10-second models have no built-in way to combine information across time. Extending them to longer horizons introduces two challenges: structural incompatibilities arising from input-length disparities, and semantic challenges that limit meaningful temporal aggregation. We propose a parameter-efficient framework that extends pretrained ECG foundation models to longer and variable-length ECGs without retraining the backbone. Guided by a frozen pretrained 10-second model, we introduce a lightweight plug-in module that extends the model in two complementary ways: (i) structurally compatible long-sequence processing and (ii) semantically informed temporal modeling. Experiments on multiple long-horizon ECG tasks, datasets, and foundation model backbones demonstrate that our method enables robust long-horizon extension from pretrained snapshot models, consistently outperforming sliding-window and pooling-based baselines with strong parameter efficiency.

摘要:心電圖(ECG)基礎模型在典型的診斷10秒ECG片段上進行預訓練,已展示出在多種臨床應用中的強大可轉移性。然而,許多現實世界的應用產生的錄音通常較長,且在推斷時的持續時間各異。這些10秒模型沒有內建的方式來跨時間結合信息。將它們擴展到更長的時間範圍引入了兩個挑戰:由於輸入長度差異而產生的結構不相容性,以及限制有意義的時間聚合的語義挑戰。我們提出了一個參數高效的框架,將預訓練的ECG基礎模型擴展到更長和可變長度的ECG,而無需重新訓練主幹。在一個凍結的預訓練10秒模型的指導下,我們引入了一個輕量級的插件模塊,從兩個互補的方式擴展模型:(i)結構上兼容的長序列處理和(ii)語義上知情的時間建模。在多個長時間範圍的ECG任務、數據集和基礎模型主幹上的實驗表明,我們的方法能夠從預訓練的快照模型中實現穩健的長時間範圍擴展,並且在參數效率上始終優於基於滑動窗口和池化的基準。

Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

2605.16966v1 by Zhentao Tan, Yuze Hao, Boyi Zou, Mingsheng Long, Yi Yang, Gang Bao

Solving inverse partial differential equation (PDE) problems is a fundamental topic in scientific research due to its broad significance across a wide range of real-world applications. Inverse PDE problems arise across medical imaging, geophysics, materials science, and aerodynamics, where the goal is to infer hidden causes, design structures, or control physical states. In this paper, we provide a comprehensive review of recent advances in solving inverse PDE problems using artificial intelligence (AI). We first introduce the basic formulation, key challenges, and traditional numerical foundations of inverse PDE problems, and then organize it into three major categories: inverse problems, inverse design, and control problems. For each category, we further present a methodological paradigms, and review representative state-of-the-art approaches from recent years. We then summarize representative applications across scientific and industrial domains, including mechanical systems, aerodynamic problems, thermal systems, full-waveform inversion, system identification, and medical imaging. Finally, we discuss open challenges and future prospects, such as physics-informed architectures, limited real-world data, uncertainty quantification, and inverse foundation models. This survey aims to provide the first unified and systematic perspective on AI for inverse PDE problems, demonstrating how modern learning-based methods are reshaping inverse problems, inverse design, and control problems in PDE-governed systems.

摘要:解決反向偏微分方程(PDE)問題是科學研究中的一個基本主題,因為它在各種現實世界應用中具有廣泛的重要性。反向PDE問題出現在醫學影像、地球物理學、材料科學和氣動力學等領域,其目標是推斷隱藏的原因、設計結構或控制物理狀態。本文提供了使用人工智慧(AI)解決反向PDE問題的最新進展的綜合回顧。我們首先介紹反向PDE問題的基本公式、主要挑戰和傳統數值基礎,然後將其組織為三個主要類別:反向問題、反向設計和控制問題。對於每個類別,我們進一步呈現方法論範式,並回顧近年來具有代表性的最先進方法。我們接著總結在科學和工業領域中的代表性應用,包括機械系統、氣動問題、熱系統、全波形反演、系統識別和醫學影像。最後,我們討論開放的挑戰和未來的前景,如物理知識引導的架構、有限的現實世界數據、不確定性量化和反向基礎模型。本次調查旨在提供關於AI在反向PDE問題中的首次統一和系統的視角,展示現代基於學習的方法如何重塑PDE控制系統中的反向問題、反向設計和控制問題。

From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

2605.16927v1 by Pujun Feng, Xiaoyu Guo, Seyed Ehsan Saffari, Min Hun Lee, Siew-Kei Lam, Erik Cambria, Xibin Sun, Yangtao Zhou, Tong Yang, Xiaoyu Zhang, Tao Tan, Yue Sun, Bin Cui

Clinical decision-making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians' measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention-aware disease trajectory modeling in clinical AI--methods estimating patient-specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time-varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point-process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off-policy robustness, and target-trial validation. This synthesis advances benchmark prediction to decision-grade clinical evidence, enabling treatment-sensitive individualized futures, pre-deployment policy stress-testing, and safer closed-loop learning health systems that adapt/abstain when evidence is insufficient.

摘要:臨床決策制定是一個反饋系統,其中風險評估影響治療,這反過來又改變了疾病的軌跡,並且兩者共同塑造了臨床醫生的測量實踐。靜態預測在臨床上往往失敗:基於觀察性護理日誌訓練的模型將疾病生物學與臨床醫生的行為混為一談,特別是在治療混淆反饋和不規則或有信息的觀察下。這篇綜述專注於臨床人工智慧中的干預感知疾病軌跡建模——估計患者特定的縱向疾病演變並評估在替代治療下的軌跡變化的方法。我們圍繞六個相互關聯的組件組織這一領域:三個決策任務(事實預測、反事實估計、政策評估)和三個數據生成機制(疾病演變、治療分配、觀察過程),這些機制決定了可識別性。我們提出了第一個統一框架,橋接了預測、反事實軌跡和政策評估,涵蓋離散/連續時間,明確處理治療分配、時間變化的混淆和觀察偏差。我們綜合了關鍵的方法家族(多狀態/聯合模型、時間點過程、深度序列架構、縱向因果推斷),將它們映射到相關組件,並通過重疊診斷、不確定性量化、離政策穩健性和目標試驗驗證來對齊評估與主張強度。這一綜合推進了基準預測到決策級的臨床證據,使得治療敏感的個性化未來、預部署政策壓力測試以及在證據不足時能夠適應/避免的更安全的閉環學習健康系統成為可能。

PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes

2605.16860v1 by Phat Tran, Neville Mehta, Clara Mosquera-Lopez, Robert H. Dodier, Lizhong Chen, Peter G. Jacobs

Accurate long-horizon glucose forecasting is critical for automated insulin delivery systems, which help people with type 1 diabetes (T1D) manage their glucose and avoid dangerous hypoglycemia. However, standard recursive long short-term memory (LSTM) networks suffer from systematic negative bias at longer horizons due to error compounding, while purely mechanistic ordinary differential equation (ODE) models fail to generalize across individuals when parameterized at the population level. We propose PhysioSeq2Seq, a hybrid architecture that combines patient-specific physiological modeling with a sequence-to-sequence (Seq2Seq) LSTM. For each glucose segment, twin matching searches a population of 300 parameterized digital twins to identify the best-fitting physiological match from a 3-hour continuous glucose monitoring (CGM) history. The 10 internal ODE state variables of the matched twin are injected as exogenous covariates into both the encoder and decoder of the Seq2Seq LSTM. This simultaneous 48-step prediction strategy eliminates recursive error compounding, while the ODE features provide a physics-grounded constraint that bounds long-horizon drift within physiologically plausible ranges. PhysioSeq2Seq was trained on CGM and insulin data from 348 participants in the Type 1 Diabetes Exercise Initiative (T1DEXI) dataset and evaluated on 74 held-out participants. At the 240-minute horizon, PhysioSeq2Seq achieves a mean absolute error of 39.28 mg/dL and a mean error of -10.62 mg/dL, reducing bias by 13.89 mg/dL over the recursive LSTM and reducing mean absolute error by 28.62 mg/dL over the ODE-based digital twin. These results show that eliminating architectural feedback and injecting patient-matched physiological states is an effective and clinically meaningful strategy for long-horizon glucose forecasting in T1D.

摘要:精確的長期血糖預測對於自動胰島素輸送系統至關重要,這些系統幫助1型糖尿病(T1D)患者管理其血糖並避免危險的低血糖。然而,標準的遞歸長短期記憶(LSTM)網絡在較長的預測期間內,由於錯誤的累積,會遭受系統性的負偏差,而純粹機械性的常微分方程(ODE)模型在以群體水平參數化時無法在個體之間進行泛化。我們提出了PhysioSeq2Seq,一種混合架構,將患者特定的生理建模與序列到序列(Seq2Seq)LSTM相結合。對於每個血糖片段,雙胞胎匹配搜索300個參數化的數位雙胞胎,以從3小時的連續血糖監測(CGM)歷史中識別最佳的生理匹配。匹配雙胞胎的10個內部ODE狀態變數作為外生協變量注入到Seq2Seq LSTM的編碼器和解碼器中。這種同時的48步預測策略消除了遞歸錯誤的累積,而ODE特徵提供了一種基於物理的約束,將長期漂移限制在生理上合理的範圍內。PhysioSeq2Seq在348名參與者的1型糖尿病運動倡議(T1DEXI)數據集上進行了CGM和胰島素數據的訓練,並在74名保留參與者上進行了評估。在240分鐘的預測期間,PhysioSeq2Seq達到了39.28 mg/dL的平均絕對誤差和-10.62 mg/dL的平均誤差,較遞歸LSTM減少了13.89 mg/dL的偏差,並較基於ODE的數位雙胞胎減少了28.62 mg/dL的平均絕對誤差。這些結果表明,消除架構反饋並注入患者匹配的生理狀態是一種有效且臨床意義重大的長期血糖預測策略,適用於T1D。

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

2605.16775v1 by Amy Makawana, Abhijeet Parida, Marius George Linguraru, Julia Ive, Syed Muhammad Anwar

Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer's disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.

摘要:自我監督學習(SSL)已經推進了醫學影像分析,使得從大量未標記數據中學習成為可能。然而,在腦部磁共振成像(MRI)中,大多數3D模型仍然專門用於分割或分類,限制了它們在不同數據集、成像協議和下游任務之間的泛化能力。這種可轉移性缺失限制了3D MRI模型的臨床實用性,儘管有未標記的體積數據可用。我們提出了Volta-3D,一個自我監督的3D視覺Transformer框架,旨在學習可轉移的體積表示。Volta-3D在學生-教師範式中共同對齊全局類別樣式標記和局部補丁標記,並強化細緻的結構重建。這種全局-局部的對齊解決了腦部MRI的有限語義多樣性和微妙的解剖特徵,這對現有的SSL方法提出了挑戰。我們在多個分布外的下游任務上評估Volta-3D,包括海馬體分割和性別及阿茲海默症與健康對照的分類。在所有任務中,Volta-3D學習的表示超越了隨機初始化的基準,顯示出在領域轉移下的可轉移性和穩健性有所改善。因此,在預訓練期間共同強化全局語義一致性和局部結構學習,使得從未標記的腦部MRI數據中學習更廣泛的概念成為可能。總體而言,VolTA-3D支持有效的多任務下游性能,並進行任務特定的調整,這是朝向可泛化和臨床可行的3D模型邁進的一步。

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

2605.16679v1 by Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $χ$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

摘要:端到端的現實醫療運營自動化強調了當前基準中被低估的三種能力:政策密度,決策必須基於大量的醫療、保險和操作規則;多角色組合:單一任務要求代理人扮演多個角色並進行交接;以及多邊互動:中間工作流程步驟是多輪對話,例如同行評審和病人聯繫。我們介紹了 $χ$-Bench,一個涵蓋三個領域的長期醫療工作流程基準:提供者事前授權、支付者使用管理和護理管理。每個任務都將一個臨床案例交給代理人,在一個高保真度的模擬器中模擬20個醫療應用,通過87個MCP工具進行暴露,代理人必須通過工具調用和撰寫角色的文檔,將其推進到終端狀態,並受到一個擁有1,290多份文件的管理護理操作手冊技能的指導。在30種代理人配置/模型中,最佳代理人僅解決了28.0%的任務,沒有代理人在嚴格的通過^3中清除20%,而在單一會話中執行所有任務則使性能下降至3.8%。這些結果提出了假設,即類似的差距可能會在其他政策密集、角色組合、不可逆的企業領域中出現。

\textsc{PrivScope}: Task-scoped Disclosure Control for Hybrid Agentic Systems

2605.16630v1 by Shafizur Rahman Seeam, Zhengxiong Li, Zhiyuan Yu, Yimin, Chen, Yidan Hu

Hybrid local--cloud agents enrich user requests with context from persistent working state before delegating capability-intensive subtasks to a cloud language model (CLM). While this enrichment can improve task success, it also exposes unnecessary information in the cloud-bound payload, including task-irrelevant context, carryover from prior workflows, and overly specific sensitive details, resulting in \emph{over-disclosure}. Existing solutions either isolate workflows to limit cross-workflow leakage or apply general-purpose sanitization that does not reason over LC-assembled payload scope. We present \textsc{PrivScope}, a trusted on-device payload governor that enforces \emph{task-scoped disclosure} at the local--CLM boundary, without requiring cloud-side changes. Its key idea: sensitive information should reach the cloud only when required for the delegated subtask, and then only in the least revealing form preserving utility. \textsc{PrivScope} extracts disclosure units from the assembled payload and keeps direct identifiers and account-linked values on device. The remaining units pass through cloud-necessity control, which determines what is actually needed; units that must reach the cloud are abstracted to the least-specific representation sufficient for the task. On 100 medical-booking workflows across three commercial CLMs, \textsc{PrivScope} eliminates profile leakage (0.0\% vs.\ 17.7\%), more than halves attacker re-identification (23.1\% vs.\ 64.3\%), and achieves the highest candidate recall on every CLM tested while preserving task success close to the unprotected baseline on GPT-4o-mini and Gemini 2.5 Flash. Gains hold across five local backbones and add only seconds of on-device latency on commodity hardware.

摘要:混合本地--雲端代理在將能力密集型子任務委派給雲端語言模型 (CLM) 之前,利用持久工作狀態中的上下文來豐富用戶請求。雖然這種豐富可以提高任務成功率,但它也暴露了雲端負載中不必要的信息,包括與任務無關的上下文、來自先前工作流程的延續以及過於具體的敏感細節,導致\emph{過度披露}。現有解決方案要麼隔離工作流程以限制跨工作流程的洩漏,要麼應用通用的清理措施,這些措施未能考慮到LC組裝的負載範圍。

我們提出\textsc{PrivScope},這是一個可信的本地負載治理者,能夠在本地--CLM邊界強制執行\emph{任務範圍披露},而無需對雲端進行更改。其關鍵思想是:敏感信息應該僅在委派的子任務需要時才到達雲端,並且僅以保持效用的最少揭露形式傳送。\textsc{PrivScope} 從組裝的負載中提取披露單元,並將直接識別符和帳戶相關的值保留在設備上。剩餘的單元通過雲端必要性控制,該控制決定實際需要什麼;必須到達雲端的單元被抽象為對任務足夠的最不具體的表示。在三個商業CLM上進行的100個醫療預約工作流程中,\textsc{PrivScope} 消除了個人資料洩漏 (0.0\% 對 17.7\%),使攻擊者重新識別率減少超過一半 (23.1\% 對 64.3\%),並在每個測試的CLM上實現了最高的候選回憶率,同時在GPT-4o-mini和Gemini 2.5 Flash上保持接近未保護基線的任務成功率。這些增益在五個本地骨幹上持續存在,並且在商用硬體上僅增加幾秒的設備延遲。

Isotonic Survival Regression: Calibrated Survival Distributions from Deep Cox Models

2605.16571v1 by Anchit Jain, Kevin Zhang, Stephen Bates

Time-to-event data is widespread across the life sciences and engineering, but it is typically encountered together with censoring, which complicates the application of standard machine learning methods. Deep Cox models have emerged as a popular method for analyzing time-to-event data because they gracefully handle censoring and can be used with unstructured data such as clinical text reports, genomic sequences, and pathology images. However, their predicted survival probabilities are often poorly calibrated, thus limiting their practical utility. In this paper, we propose a novel post hoc calibration method for Deep Cox models that uses isotonic regression to refine predicted survival probabilities without affecting discriminative power. We establish favorable theoretical guarantees, including a double-robustness property and asymptotic calibration. Experiments on synthetic and real-world clinical data demonstrate the empirical effectiveness of our method.

摘要:時間到事件數據在生命科學和工程領域廣泛存在,但通常與刪除數據一起出現,這使得標準機器學習方法的應用變得複雜。深度Cox模型已成為分析時間到事件數據的熱門方法,因為它們能夠優雅地處理刪除數據,並且可以與非結構化數據(如臨床文本報告、基因組序列和病理圖像)一起使用。然而,它們預測的生存概率通常校準不佳,從而限制了它們的實際效用。在本文中,我們提出了一種新穎的深度Cox模型事後校準方法,該方法使用等單調回歸來細化預測的生存概率,而不影響區分能力。我們建立了有利的理論保證,包括雙穩健性質和漸近校準。在合成和真實臨床數據上的實驗證明了我們方法的實證有效性。

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

2605.16545v1 by Arne Nix, Robert James, Lasse Borgholt, Anna B. Ekner, Lana Krumm, Julius Severin, Dan Engel, Lars Maaløe, Jakob Havtorn

After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.

摘要:經過數十年的使用於聽寫和最近的環境文檔中,語音正逐漸成為與科技和人工智慧在醫療領域互動的主要方式。然而,醫療語音識別仍然困難:系統必須捕捉專業術語、解決上下文歧義,並精確呈現測量值、縮寫和臨床速記。現有解決方案通常是針對通用轉錄或狹窄的聽寫工作流程進行優化,這限制了它們在安全關鍵環境中的可靠性以及對更廣泛臨床工作流程的實用性。我們推出了 Symphony for Speech-to-Text,這是一個醫療級語音識別系統,適用於實時流媒體和批量文件的臨床使用。Symphony 將轉錄過程分解為專門的組件,以進行識別、格式化和上下文修正,從而優化醫療術語的回憶,同時在實時生成臨床結構化文本並適應不同的使用案例。對公共基準和醫療語音數據集的評估顯示,Symphony 在臨床環境中顯著超越最先進的系統,同時在通用領域環境中與它們相匹配或超越,這表明其穩健的泛化能力而非過擬合。我們發布了一個臨床基準數據集,以支持可靠的驗證和醫療語音識別的進一步進展。Symphony 通過生產級 API 提供實時聽寫、對話轉錄和批量音頻文件處理的功能。

2605.16238v1 by Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, Zahra Shamsi

Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.

摘要:傳染病的概率預測對公共衛生至關重要,但依賴專業建模團隊進行勞動密集型的手動模型管理。這種定制開發限制了對細粒度地理解析或新興病原體的可擴展性。在此,我們提出了一個自動化系統,利用大型語言模型(LLM)指導的樹搜索,迭代生成、評估和優化可執行的預測軟件。在2025-2026年美國呼吸季節的全面前瞻性實時評估中,該系統自主發現了流感、COVID-19和呼吸道合胞病毒(RSV)方法論上多樣的模型。聚合這些機器生成的模型產生了一個集成,始終與金標準的人類策劃的疾病控制與預防中心(CDC)集成模型在樣本外一致或表現更佳。該系統成功應對了RSV數據稀缺的“冷啟動”場景。此外,受控的回顧性消融顯示,優化對數尺度距離度量可以防止獎勵黑客,而自動化的循環評判確保了對複雜科學理論的結構忠實。通過自主將流行病學理論轉化為準確、透明的代碼,這一框架克服了建模勞動瓶頸,使專業級疾病預測能在前所未有的規模上迅速部署。

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

2605.16215v1 by Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

摘要:臨床決策支持系統(CDSS)需要可審查、可審計的流程,以實現嚴謹且可重複的驗證。然目前基於大型語言模型(LLM)的CDSS仍然在很大程度上不透明。大多數“開放”模型僅為開放權重,釋放參數的同時卻隱藏了決定模型行為的數據來源、策展程序和生成流程。完全開放(FO)模型,即從頭到尾公開完整訓練堆疊的模型,目前在醫學領域尚不存在。我們介紹了完全開放的Meditron,這是第一個用於構建LLM-CDSS的完全開放流程,包括經臨床醫生審核的訓練語料庫、可重複的數據構建和訓練框架,以及與使用對齊的評估協議。該語料庫將八個公共醫療問答數據集統一為標準化的對話格式,並通過三個經臨床醫生驗證的合成擴展進行擴展:考試風格的問答、基於46,469條臨床實踐指南的指南導向問答,以及臨床小插曲。該流程強制執行系統範圍內的去污染、教師生成的金標籤重抽樣,以及由四位醫生小組進行的端到端驗證。我們使用LLM作為評判者的協議,對專家撰寫的臨床小插曲進行評估,並與204名人類評審進行校準。我們將該方法應用於五個FO基礎模型(Apertus-70B/8B-Instruct、OLMo-2-32B-SFT、EuroLLM-22B/9B-Instruct)。所有MeditronFO變體都優於其基礎模型。Apertus-70B-MeditronFO在綜合醫療基準上比其基礎模型提高了6.6個百分點(從47.2%提高到53.8%),創造了新的FO最先進技術(SoTA)。Gemma-3-27B-MeditronFO在58.6%的LLM作為評判者的比較中優於MedGemma,並在HealthBench上表現優於它(58%對55.9%)。這些結果顯示,完全開放的流程可以在不犧牲可審計性或可重複性的情況下實現最先進的特定領域性能。

Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning

2605.15894v1 by Ranjith Chodavarapu

Rapid and accurate wildfire smoke severity assessment from satellite images is essential for emergency response, air quality modeling, and human health risk management. Existing deep learning approaches treat smoke detection as a binary task, producing point estimates without any measure of prediction confidence. We propose a probabilistic framework to categorize a satellite patch into Light, Moderate, and Heavy severity classes and to provide decomposed epistemic and aleatoric uncertainty in a single forward pass. Our architecture uses the backbone of a pre-trained EfficientNet-B3 and a CBAM module with an evidential deep learning head that predicts Dirichlet concentration parameters, directly estimating vacuity (epistemic) and dissonance (aleatoric) without Monte Carlo sampling. Evaluated on 16,298 real satellite patches derived from the Wildfire Detection dataset, our model achieves 93.8% weighted test accuracy (91.1% unweighted) with ECE=0.0274. Selective prediction retaining the most certain 50% of patches achieves 96.7% accuracy. As image quality degrades, uncertainty increases monotonically, and vacuity is a practical scan quality measure. The Moderate class represents transitional smoke conditions that exhibit the highest epistemic uncertainty (mean vacuity = 0.187), confirming the model correctly identifies ambiguous smoke boundary regions. CBAM spatial attention maps localize to structurally distinctive scene regions, and t-SNE demonstrates the clear cluster separation of Light and Heavy smoke.

摘要:快速且準確的野火煙霧嚴重程度評估來自衛星影像,對於應急響應、空氣質量建模以及人類健康風險管理至關重要。現有的深度學習方法將煙霧檢測視為一個二元任務,產生點估計而不提供任何預測信心的度量。我們提出了一個概率框架,將衛星圖像區塊分類為輕度、中度和重度嚴重性類別,並在單次前向傳播中提供分解的認知不確定性和隨機不確定性。我們的架構使用預訓練的EfficientNet-B3作為主幹,並結合CBAM模塊,搭配一個證據深度學習頭,預測Dirichlet濃度參數,直接估計虛無(認知)和不和諧(隨機),而無需蒙特卡羅取樣。在來自野火檢測數據集的16,298個真實衛星圖像區塊上進行評估,我們的模型達到了93.8%的加權測試準確率(未加權為91.1%),ECE=0.0274。選擇性預測保留最確定的50%區塊達到96.7%的準確率。隨著影像質量的下降,不確定性單調增加,而虛無是一個實用的掃描質量度量。中度類別代表過渡性煙霧條件,顯示出最高的認知不確定性(平均虛無=0.187),確認模型正確識別模糊的煙霧邊界區域。CBAM空間注意力圖將焦點定位於結構上獨特的場景區域,而t-SNE則顯示出輕度和重度煙霧的明確簇分離。

BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

2605.15736v1 by Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen

Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning

摘要:生物醫學視覺-語言模型(VLMs)在少量醫療診斷中顯示出顯著的潛力,但面臨一個關鍵瓶頸:\textit{對提示變化的脆弱性}。現有的適應框架通常將視覺和文本提示作為獨立的流進行優化,依賴於理想的“黃金提示”。在臨床現實中,描述往往是嘈雜且異質的,這種模態隔離導致跨模態對齊的不穩定性。為了解決這個問題,我們提出了BiomedAP,一種具有門控跨模態融合的視覺知情雙錨框架。BiomedAP通過兩個機制強化協同對齊:(1)門控跨模態融合,使模態之間的層級互動成為可能,充當動態噪聲調節器以抑制不相關的文本提示;(2)雙錨約束,將可學習的提示正則化為來自專家模板(高錨)和少量視覺原型(低錨)的穩定語義中心。在11個基準上的廣泛實驗表明,BiomedAP始終超越基準,實現了具有競爭力的少量準確性,並在提示擾動下顯著增強了穩健性。我們的代碼可在以下網址獲得:https://github.com/tongdiedie/BiomedAP。關鍵詞:視覺-語言模型;提示學習;參數高效微調;少量學習

Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery

2605.16458v1 by Weijun Ma

Image restoration models are increasingly applied to degraded medical scans, but in safety-sensitive settings they must improve image quality without uncontrolled modification of clinically important regions. This is especially relevant for intracranial CT and CT angiography (CTA), where small vessels and aneurysm-relevant cues lie near high-contrast anatomical boundaries. We frame medical image restoration as a conservative AI problem and present a residual-bounded 2.5D restoration framework trained on synthetically degraded CT/CTA inputs. The model adds a learned residual to the original center slice through an edit-control map that limits the magnitude and spatial extent of modification. We evaluate the framework using an aneurysm-relevant image-recovery matrix, paired comparison against a Gaussian baseline, Monte Carlo stability testing, anatomical localization of meaningful edits, and external evaluation on low-dose CT. On 50 out-of-distribution CT-CTA cases, the bounded model achieved a mean target gain of 0.0635, a mean PSNR of 37.51 dB, and an iatrogenic-edit rate of 4.0%. Across 1,000 Monte Carlo runs, it remained net positive in 85.4% of runs with no stably negative cases. On external low-dose CT, the model was directionally beneficial and produced a substantially smaller modification footprint than the baseline. Meaningful edits concentrated in brain and skull regions while unrelated anatomy showed negligible change. These findings provide preliminary computational evidence that residual-bounded restoration is feasible in boundary-sensitive vascular imaging, but they do not establish clinical diagnostic performance and require expert review and prospective validation before clinical use.

摘要:影像修復模型越來越多地應用於退化的醫學掃描,但在安全敏感的環境中,它們必須在不對臨床重要區域進行不受控修改的情況下提高影像質量。這對於顱內CT和CT血管造影(CTA)尤其相關,因為小血管和與動脈瘤相關的線索位於高對比度的解剖邊界附近。我們將醫學影像修復框架設置為一個保守的AI問題,並提出一個基於殘差限制的2.5D修復框架,該框架在合成退化的CT/CTA輸入上進行訓練。該模型通過一個編輯控制圖將學習到的殘差添加到原始中心切片上,該圖限制了修改的幅度和空間範圍。我們使用與動脈瘤相關的影像恢復矩陣來評估該框架,並與高斯基線進行配對比較,進行蒙特卡洛穩定性測試,對有意義的編輯進行解剖定位,以及在低劑量CT上進行外部評估。在50個分佈外的CT-CTA案例中,該受限模型實現了0.0635的平均目標增益,37.51 dB的平均PSNR,以及4.0%的醫源性編輯率。在1,000次蒙特卡洛運行中,它在85.4%的運行中保持淨正值,且沒有穩定的負值案例。在外部低劑量CT上,該模型在方向上是有益的,並且產生的修改足跡顯著小於基線。有意義的編輯集中在腦部和顱骨區域,而無關的解剖結構幾乎沒有變化。這些發現提供了初步的計算證據,表明在邊界敏感的血管影像中,基於殘差的修復是可行的,但它們並未建立臨床診斷性能,並且在臨床使用之前需要專家審查和前瞻性驗證。

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

2605.16452v1 by Jiahui Li, Yida Zhang, Zixuan Zeng, Jiayu Chen, Yingjian Song, Yin Xiao, Nishan Dong, Junjie Lu, Younghoon Kwon, Xiang Zhang, Jin Lu, Wenzhan Song, Fei Dou

Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram (ECG), Photoplethysmogram (PPG), Ballistocardiogram (BCG), and Bodyseismography (BSG), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability. Conversely, deep learning-based methods often lack interpretability, limiting transparency for expert verification and hindering expert-computer interaction. To address these limitations, we introduce Peak-Detector, a novel framework that leverages instruction-tuned Large Language Models (LLMs) for robust, cross-modal, and explainable peak detection. A core innovation of our framework is a "peak-representation" technique that transforms time-series data into a condensed format, preserving critical event information while significantly reducing signal length. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data. The model is optimized through a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning (RL) with a multi-objective reward function. The model's self-explanation capabilities are cultivated by fine-tuning on a custom-built Peak-Explanation dataset. Across four modalities-ECG, PPG, BCG, and BSG-spanning seven datasets (six public benchmarks plus one real-world cohort), Peak-Detector demonstrates strong cross-modal performance, achieving best or tied-best detection under clinically relevant temporal tolerance. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis.

摘要:準確的峰值檢測對於多樣的心臟生理信號,包括心電圖 (ECG)、光學容積描記圖 (PPG)、重力心電圖 (BCG) 和身體地震圖 (BSG),對於心血管監測至關重要,但常常受到工件和信號變異性的阻礙。傳統算法通常是基於專家知識針對單一信號模態設計的,這限制了它們的普遍適用性。相反,基於深度學習的方法通常缺乏可解釋性,限制了專家驗證的透明度並妨礙了專家與計算機的互動。為了解決這些限制,我們引入了 Peak-Detector,一個新穎的框架,利用經過指令調整的大型語言模型 (LLMs) 進行穩健的跨模態和可解釋的峰值檢測。我們框架的一個核心創新是“峰值表示”技術,將時間序列數據轉換為濃縮格式,保留關鍵事件信息,同時顯著減少信號長度。這種表示提供了一個關鍵的歸納偏見,引導 LLM 理解生理上有意義的事件,而不是原始的噪聲數據。該模型通過兩個階段的過程進行優化:監督微調 (SFT),然後是強化學習 (RL),使用多目標獎勵函數。模型的自我解釋能力通過在自定義的 Peak-Explanation 數據集上進行微調來培養。在四種模態 - ECG、PPG、BCG 和 BSG - 涵蓋七個數據集(六個公共基準加上一個真實世界的隊列)中,Peak-Detector 展示了強大的跨模態性能,在臨床相關的時間容忍範圍內達到了最佳或並列最佳的檢測效果。除了準確性外,生成的推理還揭示了失敗模式並支持驗證和錯誤分析。

Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating

2605.16446v1 by Hangchun Liang, Changchun Li

Semi-supervised learning (SSL) enables prediction with limited labels, but high-stakes tabular applications (medical, credit, recidivism) require statistical fairness guarantees. We identify a structural conflict in tabular fair SSL through a diagnostic stress test: under confidence-gated pseudo-labeling, moment-matching fairness regularizers can trigger two failure modes -- Masking Collapse (fairness erodes confidence, starving pseudo-labels) and Trivial Saturation (drift to constant predictors). We propose Online Primal-Dual Allocation (OPDA), an online controller that schedules fairness and entropy-based stability penalties using violation, risk, and pseudo-label health signals, avoiding per-dataset selection of a fixed fairness weight within this diagnostic regime. On the evaluated tabular benchmarks (Adult, ACSIncome, COMPAS), OPDA mitigates the degenerate regimes observed under static weighting and simple single-signal adaptive baselines. On Adult and COMPAS, it yields non-degenerate operating points competitive with the empirical static-$λ$ frontier; on ACSIncome, it preserves utility with a wider fairness-utility spread. Relative to OPDA-lite, the full controller mainly shifts the operating point toward higher utility on ACSIncome, while Adult highlights the fairness-utility trade-off between the two variants. These results position OPDA as a calibration-free controller for non-degenerate operating points in tabular fair SSL without per-dataset tuning.

摘要:半監督學習(SSL)能夠在有限標籤下進行預測,但高風險的表格應用(醫療、信用、再犯)需要統計公平性保證。我們通過診斷壓力測試識別出表格公平SSL中的結構性衝突:在信心閘控的偽標記下,時刻匹配的公平性正則化器可能會觸發兩種失效模式——遮罩崩潰(公平性侵蝕信心,導致偽標記匱乏)和微不足道的飽和(漂移至常數預測器)。我們提出了在線原始-對偶分配(OPDA),這是一種在線控制器,利用違規、風險和偽標記健康信號來安排基於公平性和熵的穩定性懲罰,避免在這一診斷體系內對每個數據集選擇固定的公平性權重。在評估的表格基準(Adult、ACSIncome、COMPAS)上,OPDA減輕了在靜態加權和簡單單信號自適應基準下觀察到的退化狀態。在Adult和COMPAS上,它產生了與經驗靜態-$λ$邊界競爭的非退化操作點;在ACSIncome上,它保持了效用,並擴大了公平性-效用的差距。相對於OPDA-lite,完整控制器主要將ACSIncome的操作點向更高的效用移動,而Adult則突顯了兩個變體之間的公平性-效用權衡。這些結果將OPDA定位為一種無需校準的控制器,適用於表格公平SSL中的非退化操作點,無需對每個數據集進行調整。

Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images

2605.16444v1 by Liangrui Pan, Jiadi Luo, Yuxuan Xiao, Chenchen Nie, Xiaoshuai Wu, Songqing Fan, Ling Chu, Manqiu Li, Rongfang He, Zhenyu Zhao, Ruixing Wang, Shulin Liu, Yiyi Liang, Xiang Wang, Qingchun Liang, Shaoliang Peng

Accurate intraoperative and postoperative diagnosis of spread through air spaces (STAS) is essential for guiding surgical decisions and postoperative management in lung cancer. However, histopathological assessment is labor-intensive and is prone to missed or incorrect diagnoses. We propose a Diffusion Attention Expert Model (DAEM) to detect STAS in frozen sections (FSs) and paraffin sections (PSs). Its diffusion attention expert module leverages full attention aggregation to learn multi-scale features from histopathological images, while a dual-branch architecture strengthens multi-scale feature representation. On an internal dataset, DAEM achieves AUCs of 0.8946 for FSs and 0.9112 for PSs. Validation on external multi-center datasets from eight institutions demonstrates strong generalizability and interpretability. Using tumor microenvironment (TME) features in PSs, we further enable semi-automatic measurement of STAS location and its distance from the primary tumor. Several quantitative TME metrics are identified as potential biomarkers for STAS, including micropapillary-type STAS. Overall, DAEM offers a clinically actionable framework for STAS assessment by enabling accurate and interpretable detection on FSs and PSs, supporting postoperative risk stratification through quantitative TME-based analysis.

摘要:準確的術中和術後診斷通過氣體空間擴散(STAS)對於指導肺癌的手術決策和術後管理至關重要。然而,組織病理學評估勞動密集,並且容易出現漏診或誤診的情況。我們提出了一種擴散注意力專家模型(DAEM)來檢測冷凍切片(FSs)和石蠟切片(PSs)中的STAS。其擴散注意力專家模塊利用全注意力聚合來學習組織病理圖像中的多尺度特徵,而雙分支架構則加強了多尺度特徵表示。在內部數據集上,DAEM對FSs的AUC達到0.8946,對PSs的AUC達到0.9112。在來自八個機構的外部多中心數據集上的驗證顯示出強大的泛化能力和可解釋性。利用PSs中的腫瘤微環境(TME)特徵,我們進一步實現了STAS位置及其與原發腫瘤距離的半自動測量。幾個定量TME指標被確定為STAS的潛在生物標誌物,包括微乳頭型STAS。總體而言,DAEM提供了一個臨床可行的STAS評估框架,通過在FSs和PSs上實現準確且可解釋的檢測,支持通過定量TME分析進行術後風險分層。

Two-Valued Symmetric Circulant Matrices: Applications in Deep Learning

2605.16443v1 by Jayakrishna Amathi, Venkata Prasanth Yanambaka, Saraju P. Mohanty, Elias Kougianos

Despite the success of deep neural networks in vision, medical diagnosis, and IoT scenarios, their deployment on resource-limited platforms poses serious challenges due to their high storage requirements, computational complexity, and large footprint. In particular, fully connected layers require a large number of weights, making it difficult for edge devices to accommodate them. To overcome these challenges associated with limited platforms, this paper proposes the Two-Valued Symmetric Circulant Matrix (TVSCM), a very sparse architecture that employs just two weights per layer to keep it circulant and symmetric. The extreme form of structured sparse architecture provides negligible storage costs compared to traditional full-weight storage. Instead of hardware and additional stages of other traditional sparse learning techniques, such as low-rank approximation and pruning approaches, this architecture provides an extreme form of sparsity, achieving very minimal storage requirements. The simulation study demonstrates more than 80$\times$ reduction in model parameters, reducing parameters from 623,290 to 7,852 on MNIST and from 24,709 to 942 on the MIT-BIH arrhythmia dataset, while maintaining comparable accuracy from 97.6% to 93.5% on MNIST and from 97.6% to 93.1% on MIT-BIH. Due to its minimal architectural requirements and very low power consumption, this architecture would be ideal for edge computing platforms, tiny-ML platforms, IoMT systems, and battery-powered systems.

摘要:儘管深度神經網絡在視覺、醫療診斷和物聯網場景中取得了成功,但由於其高存儲需求、計算複雜性和龐大的佔用空間,將其部署在資源有限的平台上面臨嚴重挑戰。特別是,完全連接層需要大量的權重,使得邊緣設備難以容納它們。為了克服這些與有限平台相關的挑戰,本文提出了雙值對稱循環矩陣(TVSCM),這是一種非常稀疏的架構,每層僅使用兩個權重以保持其循環和對稱。這種極端形式的結構稀疏架構相比於傳統的全權重存儲提供了微不足道的存儲成本。與硬體和其他傳統稀疏學習技術的額外階段(如低秩近似和剪枝方法)相比,這種架構提供了一種極端的稀疏性,實現了非常低的存儲需求。模擬研究顯示模型參數減少超過80$\times$,在MNIST上將參數從623,290減少到7,852,在MIT-BIH心律不整數據集上從24,709減少到942,同時保持了相似的準確率,MNIST從97.6%降至93.5%,MIT-BIH從97.6%降至93.1%。由於其最低的架構需求和非常低的功耗,這種架構非常適合邊緣計算平台、微型機器學習平台、物聯網醫療系統和電池供電系統。

Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction

2605.15467v1 by A H M Rezaul Karim, Ozlem Uzuner

Conversational nurse-patient transcripts contain actionable observations, but converting these transcripts into structured representations at scale remains challenging. Documentation burden is substantial, with prior studies showing clinicians spend large portions of their workday on documentation and related desk work rather than direct patient care. MEDIQA-SYNUR focuses on observation extraction from conversational nurse-patient transcripts, requiring systems to normalize these narratives into a predefined schema with value-type constraints. We propose a modular retrieval-augmented generation (RAG) pipeline that uses the training set as an exemplar corpus, combines schema-constrained prompting (full schema vs. pruned candidate schema), deterministic schema-based postprocessing, and a second-pass audit, with two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2 with corresponding embedding models for RAG. Our best configuration uses GPT-5.2 with full schema, RAG, and a second-pass auditing, achieving 80.36% F1 score. Overall, our results show that RAG consistently improves performance, while the optimal degree of schema constraint depends on the model, and second-pass auditing yields modest additional gains by correcting residual schema-adherence errors.

摘要:對話式護理人員與病人的文字記錄包含可行的觀察,但將這些文字記錄轉換為結構化表示在規模上仍然具有挑戰性。文檔負擔相當龐大,先前的研究顯示臨床醫生在文檔和相關的辦公工作上花費了大量的工作時間,而不是直接照顧病人。MEDIQA-SYNUR 專注於從對話式護理人員與病人的文字記錄中提取觀察,要求系統將這些敘述標準化為具有值類型約束的預定義架構。我們提出了一個模組化的檢索增強生成 (RAG) 管道,該管道使用訓練集作為範例語料庫,結合架構約束提示(完整架構與修剪候選架構)、確定性基於架構的後處理和第二次審核,並使用兩個 LLM 主幹:Llama-4-Scout-17B-16E-Instruct 和 GPT-5.2,並為 RAG 提供相應的嵌入模型。我們的最佳配置使用 GPT-5.2,結合完整架構、RAG 和第二次審核,達到 80.36% 的 F1 分數。總體而言,我們的結果顯示 RAG 一直在改善性能,而最佳的架構約束程度取決於模型,第二次審核通過糾正殘留的架構遵循錯誤而獲得適度的額外增益。

FutureSim: Replaying World Events to Evaluate Adaptive Agents

2605.15188v1 by Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

摘要:AI 代理人越來越多地被部署在需要隨著新信息到來而適應的動態、開放式環境中。為了有效地測量這種能力以應對現實案例,我們提出建立基於實際事件的模擬,按照事件發生的順序重播。我們建立了 FutureSim,讓代理人預測超出其知識截止日期的世界事件,同時與世界的時間順序重播互動:在模擬期間內,真實新聞文章不斷到達,問題逐漸解決。我們在其本地環境中評估前沿代理人,測試他們在 2026 年 1 月到 3 月的三個月期間預測世界事件的能力。FutureSim 顯示出它們能力的明顯差異,最佳代理人的準確率為 25%,而許多代理人的 Brier 技能分數甚至比不做預測還要差。通過仔細的消融實驗,我們展示了 FutureSim 如何提供一個現實的環境來研究新興的研究方向,如長期測試時間適應、搜索、記憶和對不確定性的推理。總體而言,我們希望我們的基準設計為測量 AI 在現實世界中跨越長時間範圍的開放式適應進展鋪平道路。

Evidential Reasoning Advances Interpretable Real-World Disease Screening

2605.15171v1 by Chenyu Lian, Hong-Yu Zhou, Jing Qin

Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.

摘要:疾病篩檢對於臨床實踐中的早期檢測和及時干預至關重要。然而,目前大多數醫學影像的篩檢模型在可解釋性和性能上都存在限制。它們通常缺乏有效的機制來參考歷史案例或提供透明的推理途徑。為了解決這些挑戰,我們提出了EviScreen,一個利用歷史案例區域級證據的疾病篩檢證據推理框架。所提出的EviScreen通過從雙重知識庫檢索的區域證據提供了回顧性可解釋性。利用這一證據機制,隨後的證據感知推理模塊使用當前案例和來自歷史案例的證據進行預測,從而提高疾病篩檢的性能。此外,EviScreen通過利用從對比檢索中獲得的異常圖來增強定位可解釋性,而不是依賴事後的顯著性圖。我們的方法在我們精心建立的現實世界疾病篩檢基準上實現了優越的性能,在臨床級召回率下產生了顯著更高的特異性。代碼可在https://github.com/DopamineLcy/EviScreen公開獲得。

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

2605.15168v1 by Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

摘要:重建精確的臨床時間線對於建模病人軌跡和預測像敗血症這樣複雜且異質的病症風險至關重要。雖然非結構化的臨床敘述提供了語義豐富且上下文完整的病程描述,但它們往往缺乏時間精確性,並且包含模糊的事件時間。相反,結構化的電子健康紀錄(EHR)數據提供了精確的時間錨點,但卻錯過了大量臨床上有意義的事件。我們提出了一種檢索增強的多模態對齊框架,旨在彌補這一差距,以提高從文本中提取的絕對臨床時間線的時間精確性。我們的方法將時間線重建公式化為基於圖的多步驟過程:首先從敘述中提取中心錨事件以建立初始時間框架,然後相對於這一骨幹放置非中心事件,最後使用檢索到的結構化EHR行作為外部時間證據來校準時間線。通過在涵蓋MIMIC-III和MIMIC-IV的i2m4基準上使用經過指導調整的大型語言模型進行評估,我們的多模態管道在絕對時間戳準確性(AULTC)上始終有所改善,並且在幾乎所有評估模型中提高了時間一致性,相較於單模態僅文本重建,且不妥協事件匹配率。此外,我們的實證差距分析顯示,34.8%的文本衍生事件在表格記錄中完全缺失,這表明對齊這些模態可以比單一來源產生更具時間忠實性和臨床信息性的病人軌跡重建。

COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion

2605.15016v1 by Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu

As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.

摘要:隨著大型語言模型在醫療保健領域的應用,智能臨床決策支持迅速發展。長期電子健康紀錄(EHR)提供準確臨床診斷和分析所需的關鍵時間證據。然而,當前的大型語言模型在長期EHR推理方面存在重大缺陷。首先,由於缺乏細緻的統計推理,當定量證據以文本形式隱含時,它們經常幻想出臨床趨勢和指標,從而偏見診斷推斷。其次,長期EHR中的非均勻時間序列和稀缺標籤阻礙了模型捕捉長期時間依賴性,限制了可靠的臨床推理。為了解決上述限制,本研究提出了概率性思維鏈完成代理(COTCAgent),這是一個針對長期電子健康紀錄的分層推理框架。它由三個核心模塊組成。時間統計適配器(TSA)將分析計劃轉換為可執行代碼,以標準化趨勢輸出。思維鏈完成(COTC)層利用帶權重評分的症狀-趨勢-疾病知識庫來評估疾病風險,而有界完成模塊通過標準化詢問和迭代評分約束獲取結構化證據,以確保嚴謹的推理。通過解耦統計計算、特徵匹配和語言生成,該框架消除了對複雜多模態輸入的依賴,並能以較低的計算開銷實現高效的長期紀錄分析。實驗結果顯示,基於Baichuan-M2的COTCAgent在自建數據集上達到90.47%的Top-1準確率,在HealthBench上達到70.41%,超越了現有的醫療代理和主流大型語言模型。代碼可在https://github.com/FrankDengAI/COTCAgent/獲得。

Quantifying and Mitigating Premature Closure in Frontier LLMs

2605.15000v1 by Rebecca Handler, Suhana Bedi, Nigam Shah

Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.

摘要:過早結論,或在資訊不足的情況下就做出結論,是診斷錯誤的公認原因,但在大型語言模型(LLMs)中仍然未受到充分研究。我們將LLM的過早結論定義為在不確定性下的不當承諾:在更安全的反應應該是澄清、避免、升級或拒絕的情況下提供答案、建議或臨床指導。我們評估了五個前沿LLM在結構化和開放式醫療任務中的表現。在MedQA(n = 500)和AfriMed-QA(n = 490)中,當正確選擇被移除時,模型仍以高比例選擇答案,基線錯誤行動率分別為55-81%和53-82%。在開放式評估中,模型在861個HealthBench問題中平均給出了30%的不當答案,在191個醫生撰寫的對抗性查詢中則為78%。以安全為導向的提示減少了模型的過早結論,但仍然存在殘餘失敗,突顯出評估醫療LLMs是否知道何時不應回答的必要性。

Explainable Detection of Depression Status Shifts from User Digital Traces

2605.14995v1 by Loris Belcastro, Francesco Gervino, Fabrizio Marozzo, Domenico Talia, Paolo Trunfio

Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.

摘要:每天,使用者會產生數位痕跡(例如,社交媒體帖子、聊天和線上互動),這些痕跡本質上是有時間戳的,並可能反映他們的心理狀態的某些方面。這些痕跡可以組織成時間軌跡,捕捉使用者的心理健康信號如何演變,包括改善、惡化或穩定的階段。在這項工作中,我們提出了一個可解釋的框架,用於檢測和分析使用者數位痕跡中與抑鬱相關的狀態變化。該方法結合了多個基於BERT的模型,以提取不同維度(例如,情感、情緒和抑鬱嚴重程度)之間的互補信號。這些信號隨時間聚合,以構建使用者級別的軌跡,並進行分析以識別有意義的變化點。為了增強可解釋性,該框架整合了一個大型語言模型,以生成簡潔且易於人類閱讀的報告,描述心理健康信號的演變並突出關鍵轉變。我們在兩個社交媒體數據集上評估了該框架。結果顯示,該方法產生的摘要比直接基於LLM的報告更具連貫性和信息性,實現了對使用者歷史的更高覆蓋率、更強的時間一致性,以及對變化點的更高敏感性。一項消融研究確認了每個組件的貢獻,特別是時間建模和分段。總體而言,該方法提供了心理健康信號隨時間變化的可解釋視圖,支持研究和決策,而不旨在臨床診斷。

Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning

2605.14991v1 by Francesco Pastori, Francesca Fati, Marina Rosanu, Luigi De Vitis, Lucia Ribero, Gabriella Schivardi, Giovanni Damiano Aletti, Nicoletta Colombo, Jvan Casarin, Francesco Multinu, Elena De Momi

Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.

摘要:卵巢癌是最致命的婦科惡性腫瘤:大約60%的患者在晚期被診斷,相關的5年生存率約為30%。及早識別對新輔助化療無反應的患者仍然是一個關鍵的未滿足需求,因為這可以防止無效的治療並避免最佳手術管理的延遲。這項工作提出了一個非侵入性的深度學習框架,通過利用自動生成的3D病變掩模,從治療前的對比增強CT中預測新輔助化療的反應。該方法使用部分微調的預訓練圖像編碼器對軸向切片進行編碼,並通過基於注意力的模塊將切片級表示聚合成體積嵌入。訓練結合了分類損失、監督對比正則化和困難負樣本挖掘,以改善模糊反應者和非反應者之間的區分。該方法是在歐洲腫瘤研究所(米蘭,意大利)的一個回顧性單中心隊列上開發的,包括280名符合條件的患者(147名反應者,133名非反應者)。在測試隊列中,模型達到了0.73的ROC-AUC(95% CI:0.58-0.86)和0.70的F1分數(95% CI:0.56-0.82)。總體而言,這些結果表明所提出的架構學習了臨床相關的預測模式,並為基於影像的分層工具提供了堅實的基礎。

GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation

2605.14968v1 by Drewry H. Morris, Luis Valles, Reza Hosseini Ghomi

GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time. Existing workflow platforms provide durable execution and observability but offer few semantic correctness guarantees, while agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit. GraphFlow is designed to address this gap by treating workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem; observed failures were localized primarily to external integrations. The formal semantics and proof-checked admission model described here are specified and under active development. Evaluation of the verified core is reserved for future work.

摘要:GraphFlow 是一個視覺化工作流程系統,旨在提高多步驟、任務關鍵過程中代理 AI 自動化的可靠性。在這些工作流程中,小錯誤會迅速累積:在理想化的獨立步驟模型下,一個具有 90% 每步可靠性的十步驟過程成功完成的機率僅為 35%。現有的工作流程平台提供耐用的執行和可觀察性,但對語義正確性的保證卻很少,而代理系統在推理時進行計劃,使得行為對提示變化敏感且難以審計。GraphFlow 的設計旨在填補這一空白,通過將工作流程圖視為可執行的規範,定義數據範圍、執行語義和監控的一個單一工件。在編譯時,指定一類受限的圖形以產生可重用的自動化,其合約(前置條件、後置條件和組合義務)旨在在進入共享庫之前進行證明檢查。在運行時,一個耐用的引擎在附加式事件日誌中記錄結果,並可以在系統邊界強制執行合約,支持重播、重試和審計。游泳道使信任邊界變得明確,將經過驗證的邏輯與外部系統、人類判斷和 AI 決策分開。在三個臨床站點進行的一年期試點執行了 8,728 次隊列註冊的工作流程運行,完成率為 97.08%,是在沒有經過驗證的核心子系統的早期原型下進行的;觀察到的失敗主要集中在外部集成上。這裡描述的正式語義和經過證明的入庫模型已被指定並在積極開發中。對經過驗證的核心的評估保留給未來的工作。

From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

2605.14912v1 by Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka

Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.

摘要:多元對齊通常被操作化為偏好聚合:產生跨越(Overton)、引導(Steerable)或按比例代表(Distributional)多樣人類價值的回應。我們認為僅僅依賴聚合對於已部署的多元對齊來說是不完整的原始概念。在真正的價值多元主義下,當前基於強化學習人類反饋(RLHF)訓練的助手的失敗模式並不是覆蓋不足,而是阿諛奉承的共識:一種學習到的傾向,與直接對話者達成一致、驗證並最小化摩擦。由於已部署的人工智慧系統現在在健康、公民生活、勞動和治理等方面進行重要的討論,因此在互動層面上意見的不一致的崩潰並不是一個狹隘的技術問題,而是一種具有分配後果的結構性失敗。我們從格賴斯的格言中重新框定多元對齊,圍繞三個對話機制:範疇(承認自身觀點的局限)、信號(呈現價值衝突而不是掩蓋它),以及修正(基於原則而非用戶壓力修訂自己的立場)。我們正式化了一個指標,即多元修正分數(PRS),以區分原則性修訂與屈從,並提供了一個小規模的實證示例,針對兩個前沿的RLHF訓練模型(Claude Sonnet 4.5, N=198; GPT-4o, N=100),顯示對於這兩者來說,遵循一致性與在有爭議的價值提示上低修正質量共存。PRS測量的是多元主義的互動前提(可見的不一致;原則性修訂),而不是完整的多元主義;我們討論這一差異,認真對待“誰的‘原則性’算數”的反思性問題,並主張多元主義在部署治理層面上最為決定性地形成或解體:介面、偏好數據管道和審計基礎設施。

BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring

2605.14886v1 by Zixuan Shu, Tiancheng Cao, Hen-Wei Huang

Electrocardiogram (ECG) monitoring in Internet of Medical Things (IoMT) networks is constrained by strict data-sharing regulations and privacy concerns. Federated learning (FL) enables collaborative learning by keeping raw ECG data on devices, but frequent transmissions of high-dimensional model updates incur heavy per-round traffic over bandwidth-limited links. To alleviate this bottleneck, federated distillation (FD) replaces parameter exchange with logit-based knowledge transfer. However, the performance of FD often degrades under the non-independent and identically distributed (non-IID) and long-tailed label distributions in ECG deployments. To address these challenges, we propose a bidirectional federated knowledge distillation (BiFedKD) framework that employs an aggregation-by-distillation pipeline with temperature scaling to produce a stable global distillation signal for cross-client alignment. Experiments on the MIT-BIH Arrhythmia dataset show that BiFedKD improves accuracy and Macro-F1 over the baseline by $3.52\%$ and $9.93\%$, respectively. Moreover, to reach the same Macro-F1, BiFedKD reduces communication overhead by $40\%$ and computation cost by $71.7\%$ compared with the baseline.

摘要:心電圖 (ECG) 監測在醫療物聯網 (IoMT) 網絡中受到嚴格的數據共享規範和隱私問題的限制。聯邦學習 (FL) 通過將原始 ECG 數據保留在設備上來實現協作學習,但高維模型更新的頻繁傳輸會在帶寬有限的鏈路上產生巨大的每輪流量。為了緩解這一瓶頸,聯邦蒸餾 (FD) 用基於邏輯的知識轉移取代了參數交換。然而,在 ECG 部署中,FD 的性能在非獨立同分佈 (non-IID) 和長尾標籤分佈下往往會下降。為了解決這些挑戰,我們提出了一種雙向聯邦知識蒸餾 (BiFedKD) 框架,該框架採用帶有溫度縮放的蒸餾聚合管道,以產生穩定的全局蒸餾信號以進行跨客戶對齊。在 MIT-BIH 心律不齊數據集上的實驗顯示,BiFedKD 分別將準確率和 Macro-F1 提高了 $3.52\%$ 和 $9.93\%$。此外,為了達到相同的 Macro-F1,與基線相比,BiFedKD 將通信開銷減少了 $40\%$,計算成本減少了 $71.7\%$。

Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model

2605.14723v1 by Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang, Ziying Sheng, Xidong Wang, Rongsheng Wang, Hejia Zhang, Shuang Li, Benyou Wang, Hongyuan Zha

Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

摘要:重症監護病房中的敗血症管理需要在快速變化的病人體徵下做出連續的治療決策。雖然大型語言模型(LLMs)編碼了廣泛的臨床知識並能夠推理指導方針,但它們並不固有地基於行動條件下的病人動態。我們介紹了SepsisAgent,一個增強世界模型的LLM代理,用於敗血症治療建議。SepsisAgent使用學習到的臨床世界模型來模擬病人在候選液體-血管收縮劑干預下的反應,並遵循提議-模擬-精煉的工作流程,然後再進行處方。我們首先顯示僅依賴世界模型的訪問會導致LLM決策性能不一致,這促使了特定代理的訓練。然後,我們通過三個階段的課程訓練SepsisAgent:病人動態監督微調、提議-模擬-精煉行為克隆,以及基於世界模型的代理強化學習。在MIMIC-IV敗血症軌跡上,SepsisAgent在離線政策價值方面超越了所有傳統的RL和基於LLM的基準,同時在遵循指導方針和不安全行為指標下達到了最佳的安全性配置。進一步分析顯示,與臨床世界模型的重複互動使代理能夠學習病人演變中的規律,即使在移除模擬器訪問的情況下,這些規律仍然有用。

Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

2605.14710v1 by Liren Chen, Lidong Sun, Mingyan Huang, Junzhe Tang, Yinghui Zhu, Guanjie Wang, Yiqing Xia, Ting Xiao

Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

摘要:深度學習和多模態融合在醫療診斷中展現了變革潛力,通過整合多樣的數據來源。然而,由於現有多模態方法的限制,對缺血性中風的準確預測仍然具有挑戰性。首先,當前的方法主要限於雙模態融合,缺乏有效整合醫療影像、結構化臨床數據和非結構化文本的框架。其次,它們通常無法建立模態之間的深度雙向互動;為了解決這些關鍵空白,本文提出了一種新穎的三模態融合模型,用於缺血性中風的預後。我們的方法首先通過使用大型語言模型(LLM)自動從腦部MRI生成半結構化的診斷文本來豐富數據表示。這一過程不僅解決了專家註釋的稀缺問題,還作為一種正則化的語義增強,提升了多模態融合的穩健性。此外,我們設計了一個核心組件,稱為視覺條件雙重對齊融合模塊(VDAFM),該模塊策略性地使用視覺特徵作為條件先驗,以引導與生成文本的細緻互動。這個模塊通過雙重語義對齊損失實現了動態而深刻的融合,有效減輕了模態異質性。在一個真實世界的臨床數據集上進行的廣泛實驗表明,我們的模型達到了最先進的性能。

NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces

2605.14698v1 by Konstantinos Kontras, Trui Osselaer, Stylianos G. Mouslech, Angeliki-Ilektra Karaiskou, Guido Gagliardi, Thomas Strypsteen, Mohammad Hossein Badiei, Anku Rani, Maarten Vanmarcke, Miguel Bhagubai, Chanakya Ekbote, Jaedong Hwang, Christos Chatzichristos, Paul Pu Liang, Maarten De Vos

Foundation models (FMs) promise to extract unified representations that generalize across downstream tasks. They have emerged across fields, including electroencephalography (EEG), but it is less clear how effective they are in this particular field. Published evaluations differ in datasets, in the EEG-specific preprocessing that might influence reported results, and in the reported metrics, frequently obscuring the clinical relevance in EEG. We introduce NeuroAtlas, the largest EEG benchmark to date: 42 datasets and 260k hours covering clinical EEG (epilepsy, sleep medicine, brain age estimation) and brain-computer interfaces, and include multiple datasets per task along with bespoke clinical evaluation metrics. Besides evaluating EEG-FMs with respect to supervised baselines, we present results from generic time-series FMs. We report three findings. First, EEG-specific FMs do not consistently outperform time-series FMs, which have neither EEG-focused architectures nor been pretrained on EEG. Second, standard machine learning metrics are insufficient to assess clinical utility: thus, we thoroughly evaluate more appropriate measures such as the quality of event-level decision-making, hypnogram-derived features, and the brain-age gap in the domains of epilepsy, sleep, and brain age, respectively. Third, model rankings and performance can vary substantially within domains. We conclude that pretrained models perform largely on par, with only narrow advantages for a few, and that current models do not yet deliver on the promise of an out-of-the-box unified EEG model. NeuroAtlas exposes this gap and provides the datasets and metrics for the next generation of unified EEG FMs.

摘要:基礎模型(FMs)承諾能夠提取統一的表示,這些表示可以在下游任務中進行泛化。它們在各個領域中出現,包括腦電圖(EEG),但在這個特定領域中的有效性尚不明確。已發表的評估在數據集、可能影響報告結果的EEG特定預處理以及報告的指標上有所不同,這常常掩蓋了EEG的臨床相關性。我們介紹了NeuroAtlas,迄今為止最大的EEG基準:42個數據集和260k小時,涵蓋臨床EEG(癲癇、睡眠醫學、大腦年齡估計)和腦-電腦介面,並為每個任務包含多個數據集以及定制的臨床評估指標。除了根據監督基準評估EEG-FMs外,我們還展示了通用時間序列FMs的結果。我們報告了三個發現。首先,EEG特定的FMs並不總是優於時間序列FMs,而後者既沒有EEG專注的架構,也沒有在EEG上進行預訓練。其次,標準機器學習指標不足以評估臨床實用性:因此,我們徹底評估了更合適的指標,例如事件級決策的質量、基於睡眠圖的特徵以及癲癇、睡眠和大腦年齡領域中的大腦年齡差距。第三,模型排名和性能在不同領域內可能會有顯著變化。我們得出結論,預訓練模型的性能大致相當,只有少數模型具有微小的優勢,而當前模型尚未實現即插即用的統一EEG模型的承諾。NeuroAtlas揭示了這一差距,並提供了下一代統一EEG FMs所需的數據集和指標。

How Sensitive Are Radiomic AI Models to Acquisition Parameters?

2605.14667v1 by D. Gil, I. Sanchez, C. Sanchez

A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying clinically significant parameter regions associated with improved cross-dataset robustness. We formulate a mixed-effects framework for quantifying the influence that clinically relevant acquisition parameters have on models performance, while accounting for subject-level random effects. We have applied our framework to lung cancer diagnosis in CT scans using two independent multicentre datasets (a public database and own-collected data) and several SoA architectures. To evaluate across-database reproducibility, CT parameters have been adjusted using the data collected and tested on the public set. The optimal configuration selected is the current of the X-ray tube >= 200 mA, spiral pitch <= 1.5, slice thickness <= 1.25 mm, which balances diagnostic quality with low radiation dose. These configuration push metrics from 0.79+-0.04 sensitivity, 0.47+-0.10 specificity in low quality scans to 0.90+-0.10 sensitivity, 0.79 +- 0.13 specificity in high quality ones.

摘要:主要障礙在於AI放射組學系統在臨床常規中的部署是它們在異質的多中心獲取協議下性能的下降。這項工作提出了一個以性能為導向的框架,用於量化放射組學AI模型的掃描參數敏感性,同時識別與提高跨數據集穩健性相關的臨床顯著參數區域。我們制定了一個混合效應框架,以量化臨床相關的獲取參數對模型性能的影響,同時考慮受試者層級的隨機效應。我們已將我們的框架應用於CT掃描中的肺癌診斷,使用兩個獨立的多中心數據集(公共數據庫和自收集數據)以及幾個最先進的架構。為了評估跨數據庫的重現性,CT參數已根據收集的數據進行調整,並在公共數據集上進行測試。選擇的最佳配置是X射線管電流 >= 200 mA,螺旋步距 <= 1.5,切片厚度 <= 1.25 mm,這在低輻射劑量下平衡了診斷質量。這些配置將指標從低質量掃描中的0.79+-0.04靈敏度、0.47+-0.10特異性推升至高質量掃描中的0.90+-0.10靈敏度、0.79 +- 0.13特異性。

MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder

2605.14660v1 by Eranga Bandara, Ross Gore, Asanga Gunaratna, Ravi Mukkamala, Nihal Siriwardanagea, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Wathsala Herath, Chalani Rajapakse, Sachin Shetty, Anita H. Clayton, Christopher K. Rhea, Ng Wee Keong, Kasun De Zoysa, Amin Hass, Shaifali Kaushik, Preston Samuel, Atmaram Yarlagadda

Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing therapeutic approaches, prolonged exposure, EMDR, cognitive behavioural therapy, operate predominantly downstream of the reactive cascade, teaching patients to tolerate or reframe distress after it has arisen. While clinically valuable, these suppression-based approaches do not produce the upstream pathway dissolution that constitutes lasting structural neural reorganisation. This paper proposes MindGap, a privacy-preserving on-device conversational AI framework that delivers structured neuroplastic rehabilitation for PTSD through the practice of dependent origination, a Buddhist psychological framework that identifies the precise moment between the pre-cognitive affective signal and the reactive elaboration that follows as the site of therapeutic intervention. MindGap guides patients through three progressive layers of observation at this feeling tone gap: noticing the bare affective signal before reactive elaboration, recognising it as self-arising rather than caused by the stimulus, and recognising the conditioned implicit belief beneath the feeling. Each layer corresponds to progressively deeper prefrontal regulatory engagement and progressively deeper long-term depression-mediated weakening of the reactive pathway, producing genuine upstream dissolution rather than downstream suppression. Running entirely on-device with no data egress, MindGap delivers daily calibrated exposure sessions through a fine-tuned lightweight large language model, making it deployable in sensitive clinical and military contexts where cloud-based solutions are not permitted.

摘要:創傷後壓力症候群(PTSD)根本上是一個神經可塑性問題,創傷性接觸事件通過赫布長期增強編碼過度反應的神經通路,產生在意識覺察之前就已觸發的、敏感的杏仁體-HPA壓力級聯反應。現有的治療方法,如長期暴露、EMDR、認知行為療法,主要在反應級聯的下游運作,教導患者在痛苦出現後如何忍受或重新框架。雖然這些以抑制為基礎的方法在臨床上具有價值,但並未產生構成持久結構性神經重組的上游通路溶解。本文提出了MindGap,一個保護隱私的設備內對話式人工智慧框架,通過依賴起源的實踐提供PTSD的結構性神經可塑性康復,這是一種佛教心理框架,確定了在前認知情感信號與隨之而來的反應性闡述之間的精確時刻作為治療干預的場所。MindGap引導患者通過三個漸進的觀察層次來探索這個感受音調的間隙:注意到反應性闡述之前的純粹情感信號,認識到它是自我產生的,而不是由刺激引起的,並認識到感受下的條件隱性信念。每一層對應於逐漸深入的前額葉調節參與和逐漸深入的長期抑制介導的反應通路削弱,產生真正的上游溶解,而不是下游抑制。MindGap完全在設備內運行,無數據外流,通過微調的輕量級大型語言模型提供每日校準的暴露會議,使其可在不允許雲端解決方案的敏感臨床和軍事環境中部署。

RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

2605.14543v1 by Shuhao Chen, Weisen Jiang, Changmiao Wang, Xiaoqing Wu, Xuanren Shi, Yu Zhang, James T. Kwok

Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

摘要:住院用藥推薦要求臨床醫師隨著病人狀況的變化,不斷選擇特定的藥物、劑量和給藥途徑。現有的基準將這項任務定義為對粗略藥物代碼的入院級預測,並使用多熱診斷和程序代碼輸入,未能捕捉到實際處方的每個時間點上豐富的信息特性。我們提出了RxEval,一個處方級基準,通過多選題評估LLM的處方能力:每個問題都提供一個詳細的病人檔案和按時間順序排列的臨床軌跡,要求從真實處方和通過推理鏈擾動生成的病人特定干擾項中選擇特定的藥物-劑量-途徑三元組。RxEval包含1,547個問題,涵蓋584名病人、18個診斷類別和969種獨特藥物。對16個LLM的評估顯示,RxEval既具挑戰性又具區分度:不同模型的F1範圍從45.18到77.10,最佳的精確匹配僅為46.10%。錯誤分析顯示,即使是最前沿的模型也可能忽視所述的病人信息,並未能得出臨床結論。

Deciphering Neural Reparameterized Full-Waveform Inversion with Neural Sensitivity Kernel and Wave Tangent Kernel

2605.14370v1 by Ruihua Chen, Yisi Luo, Bangyu Wu, Xile Zhao, Deyu Meng

Full-waveform inversion (FWI) estimates unknown parameters in the wave equation from limited boundary measurements. Recent advances in neural reparameterized FWI (NeurFWI) demonstrate that representing the parameters using a neural network can reduce the reliance on the high-quality initial model and wavefield data, at the cost of slow high-resolution convergence. However, its underlying theoretical mechanism remains unclear. In this study, we establish the neural sensitivity kernel (NSK) and the wave tangent kernel (WTK) to analyze their convergence behavior from both model and data domains. These theoretical frameworks show that the neural tangent kernel (NTK) induced by neural representation adaptively modulates the original sensitivity and wave tangent kernels. This modulation leads to several key outcomes, i.e., the spectral filtering effect, the gradient wavenumber modulation, and the wave frequency bias, connecting the convergence behavior of NeurFWI with the eigen-structures of NSK and WTK. Building on these insights, we propose several enhanced NeurFWI methods with tailored eigen-structures in NSK and WTK to improve inversion performances and efficiency. We numerically validate these theoretical claims and the proposed methods in seismic exploration, and firstly extend their application to medical imaging.

摘要:全波形反演(FWI)從有限的邊界測量中估計波動方程中的未知參數。最近在神經重參數化FWI(NeurFWI)方面的進展顯示,使用神經網絡表示參數可以減少對高品質初始模型和波場數據的依賴,但代價是高解析度收斂速度較慢。然而,其潛在的理論機制仍然不清楚。在本研究中,我們建立了神經靈敏度核(NSK)和波切線核(WTK),以分析它們在模型和數據領域的收斂行為。這些理論框架顯示,神經表示所誘導的神經切線核(NTK)自適應地調節了原始的靈敏度和波切線核。這種調節導致幾個關鍵結果,即光譜過濾效應、梯度波數調制和波頻率偏差,將NeurFWI的收斂行為與NSK和WTK的特徵結構聯繫起來。在這些見解的基礎上,我們提出幾種增強的NeurFWI方法,這些方法在NSK和WTK中具有量身定制的特徵結構,以改善反演性能和效率。我們在地震勘探中數值驗證了這些理論主張和提出的方法,並首次將其應用擴展到醫學影像。

AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction

2605.14327v1 by Yerin Park, Sangseon Lee

Drug-drug interaction (DDI) prediction is a critical task in computational biomedicine, as adverse interactions between co-administered drugs can cause severe side effects and clinical risks. A key challenge is unseen-drug generalization, where interactions must be predicted for drugs not observed during training. Although multimodal DDI models exploit diverse drug-related information, their fusion mechanisms are often tied to specific prediction architectures, limiting their reuse across models. To address this, we propose AIM-DDI, an architecture-independent multimodal integration module that represents heterogeneous modality information as tokens in a shared latent space. By modeling dependencies across modality tokens through a unified fusion module, AIM-DDI enables model-agnostic integration of structural, chemical, and semantic drug signals across different DDI prediction architectures. Extensive evaluations across diverse DDI models and DrugBank-based settings show that AIM-DDI consistently improves prediction performance, with the strongest gains under the most challenging both-unseen setting where neither drug in a test pair is observed during training. These results suggest that treating multimodal integration as a reusable module, rather than a model-specific fusion component, is an effective strategy for robust unseen-drug DDI prediction.

摘要:藥物間相互作用(DDI)預測是計算生物醫學中的一項關鍵任務,因為共同給藥的藥物之間的不良相互作用可能會導致嚴重的副作用和臨床風險。一個主要挑戰是未見藥物的概括性,這要求對在訓練期間未觀察到的藥物進行相互作用預測。儘管多模態 DDI 模型利用了多樣的藥物相關信息,但它們的融合機制往往與特定的預測架構相關,限制了它們在模型之間的重用。為了解決這個問題,我們提出了 AIM-DDI,一種架構無關的多模態整合模塊,將異質模態信息表示為共享潛在空間中的標記。通過通過統一的融合模塊建模模態標記之間的依賴性,AIM-DDI 能夠在不同的 DDI 預測架構中實現結構、化學和語義藥物信號的模型無關整合。在多個 DDI 模型和基於 DrugBank 的設置中進行的廣泛評估顯示,AIM-DDI 一直在提高預測性能,在最具挑戰性的雙未見設置下,測試對中的任一藥物在訓練期間都未被觀察到,獲得了最強的增益。這些結果表明,將多模態整合視為可重用模塊,而不是特定模型的融合組件,是進行穩健的未見藥物 DDI 預測的有效策略。

Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment

2605.14242v1 by Xiaohua Wang, Kai Yu, XuXiao Liang, Liang Wang, Chao Han

The monitoring of fetal heart rate (FHR) and the assessment of its variability are crucial for preventing fetal compromise and adverse outcomes. However, traditional methods encounter limitations arising from equipment performance, data transmission, and subjective assessments by doctors. We have developed a tailored AI-based FHrCTG model specifically for FHR monitoring, which effectively mitigates noise interference and precisely reconstructs signals. Our model was pre-trained on a massive dataset consisting of 558,412 unlabeled data points and further refined using 7,266 expert-reviewed entries. To validate FHR, we introduced the Intersection Overlapping Labels (IOL) approach, which transforms rate analysis into categorical judgments. Testing revealed that our model demonstrates high sensitivity and specificity in detecting critical FHR decelerations (89.13% and 87.78%, respectively) and accelerations (62.5% and 92.04%, respectively). Furthermore, based on Fischer's criteria for clinical application, our model achieved impressive AUC scores of 0.7214 and 0.9643 for verifying FHR periodicity and amplitude variation, respectively.

摘要:胎心率(FHR)的監測及其變異性的評估對於防止胎兒受損和不良結果至關重要。然而,傳統方法在設備性能、數據傳輸和醫生的主觀評估方面存在限制。我們開發了一種專門用於FHR監測的定制AI基於FHrCTG模型,該模型有效減少了噪音干擾並精確重建信號。我們的模型在一個包含558,412個未標記數據點的大型數據集上進行了預訓練,並使用7,266個專家審核的條目進一步精煉。為了驗證FHR,我們引入了交集重疊標籤(IOL)方法,將速率分析轉化為類別判斷。測試顯示,我們的模型在檢測關鍵FHR減速(分別為89.13%和87.78%)和加速(分別為62.5%和92.04%)方面表現出高敏感性和特異性。此外,根據Fischer的臨床應用標準,我們的模型在驗證FHR的周期性和幅度變化方面分別達到了令人印象深刻的AUC分數0.7214和0.9643。

Fusion-fission forecasts when AI will shift to undesirable behavior

2605.14218v1 by Neil F. Johnson, Frank Yingjie Huo

The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

摘要:面臨社會中類似ChatGPT的人工智慧使用的關鍵問題是,其行為可能會在不被注意的情況下,從可取轉變為不可取——促使自我傷害、極端行為、財務損失或代價高昂的醫療和軍事錯誤——而目前尚無法預測何時會發生這種轉變。儘管在人工智慧建模、訓練後的對齊和安全措施方面取得了顯著進展,但即使是最新的AI模型中,行為的轉變仍然持續存在。在這裡,我們展示了一種向量泛化的融合-裂變群體動力學,這種動力學在活體和活性物質系統中觀察到,驅動並可以預測AI行為的未來轉變。轉變條件也可以從數學上推導出來,這是由於目前為止的對話(C)與可取(B)和不可取(D)盆地動力學之間的群體競爭,這些動力學可以為特定應用提前進行估算。這既不是特定於模型的,也不是由隨機抽樣驅動的。我們在六項獨立測試中驗證了它,包括:在七個跨越兩個數量級的參數計數(124M-12B)的AI模型中,正確率達到90%;在十個前沿聊天機器人中持續保持生產規模;以及在斯坦福的「妄想螺旋」語料庫出現之前十一個月的先驗時間戳預測,並由207,443個人類-AI交流的該語料庫獨立確認。因為它在當前安全堆棧的架構下,所以同樣的公式提供了一個實時警告信號,而當前的對齊無法提供,這一信號在當前和未來的類似ChatGPT的AI架構中是可攜帶的,並且可以在可定義競爭反應類別的應用領域中實現。

Towards Fine-Grained and Verifiable Concept Bottleneck Models

2605.14210v1 by Yingying Fang, Haijie Xu, Shuang Wu, Mariathasan Anish, Guang Yang

Concept Bottleneck Models (CBMs) offer interpretable alternatives to black-box predictors by introducing human-relatable concepts before the final output. However, existing CBMs struggle to verify whether predicted concepts correspond to the correct visual evidence, limiting their reliability. We propose a fine-grained CBM framework that grounds each concept in localized visual evidence, enabling direct inspection of where and how concepts are encoded. This design allows users to interpret predictions and verify that the model learns intended concepts rather than spurious correlations. Experiments on medical imaging benchmarks show that our learned concept space is information-complete and achieves predictive performance comparable to standard CBMs, while substantially improving transparency. Unlike post-hoc attribution methods, our framework validates both the presence and correctness of concept representations, bridging interpretability with verifiability. Our approach enhances the trustworthiness of CBMs and establishes a principled mechanism for human-model interaction at the concept level, paving the way toward more reliable and clinically actionable concept-based learning systems.

摘要:概念瓶頸模型(CBMs)透過在最終輸出之前引入人類可理解的概念,提供了可解釋的替代方案,取代了黑箱預測器。然而,現有的CBMs在驗證預測的概念是否對應於正確的視覺證據方面存在困難,這限制了它們的可靠性。我們提出了一個細粒度的CBM框架,將每個概念基於局部的視覺證據,使得可以直接檢查概念是如何被編碼的。這一設計使得用戶能夠解釋預測並驗證模型學習的是預期的概念,而非虛假的相關性。在醫學影像基準上的實驗顯示,我們學習的概念空間是信息完整的,並且達到了與標準CBMs相當的預測性能,同時顯著提高了透明度。與事後歸因方法不同,我們的框架驗證了概念表示的存在性和正確性,將可解釋性與可驗證性聯繫起來。我們的方法增強了CBMs的可信度,並建立了一個原則性機制,以便在概念層面上進行人類與模型的互動,為更可靠和臨床可行的基於概念的學習系統鋪平了道路。

Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

2605.14126v1 by Marius S. Knorr, Robert Müller, Jan P. Bremer, Nils Schweingruber

Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

摘要:快速醫療互操作性資源(FHIR)是互操作性醫療數據交換的主導標準。在FHIR中,電子健康記錄形成了一個有向資源圖。在FHIR上回答臨床有意義的問題需要代理執行多步推理、過濾和跨多種資源類型的聚合。先前的研究顯示,即使是工具增強的LLM代理(檢索、代碼執行、多輪規劃)也常常選擇錯誤的資源或違反遍歷約束。我們在FHIR-AgentBench的背景下研究這個問題,這是一個針對現實世界醫院數據的真實問題回答基準,並將FHIR上的推理框架設置為可查詢結構圖上的序列決策問題。我們實現了一個多輪CodeAct代理,並使用自定義環境和工具進行強化學習後訓練。一個LLM評判者提供基於執行的獎勵。與基於提示的封閉模型基準相比,強化學習後訓練在強化數據完整性約束的同時提高了性能。實證結果顯示,我們的方法將FHIR-AgentBench上的答案正確率從50%(o4-mini)提高到77%,使用的是一個更小且更便宜的Qwen3-8B模型。我們提出了一個端到端的後訓練流程(環境構建、環境構造、模型訓練和自定義評估),該流程可靠地提高了對結構化臨床圖的多輪推理。

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

2605.14113v1 by Alvaro Lopez Pellicer, Plamen Angelov, Marwan Bukhari, Yi Li, Eduardo Soares, Jemma Kerns

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2\% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2\%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8\%.

摘要:雖然可解釋的原型網絡為臨床診斷提供了引人注目的案例推理,但其原始的連續輸出缺乏醫療文檔所需的語義結構。通過標準的檢索增強生成(RAG)來彌補這一差距,通常會觸發“檢索拍馬屁”,在這種情況下,大型語言模型(LLMs)會產生事後的合理化,以與視覺預測對齊。我們引入了ProtoMedAgent,一個將多模態臨床報告形式化為迭代的零梯度測試時間優化問題的框架,這一過程受到嚴格的神經符號瓶頸的約束。在一個固定的原型骨幹上,我們將潛在的視覺和表格特徵提煉成離散的語義記憶。線上生成受到精確集合理論微分和反思的抄寫-批評循環的嚴格約束,從數學上排除了不支持的敘事主張。為了安全地限制數據披露,我們引入了一個由$k$-匿名性和$\ell$-多樣性主導的語義隱私閘。經過對4,160名患者的臨床隊列進行評估,ProtoMedAgent在比較集的忠實度上達到了91.2\%,在這一點上它的表現根本上超越了標準的RAG(46.2\%)。ProtoMedAgent還利用綁定的$\ell$-多樣性相變化系統性地將人工製作的成員推斷風險絕對降低了9.8\%。

Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

2605.14108v1 by Nishi Doshi, Shrey Shah

Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology

摘要:糖尿病視網膜病變(DR)是可預防失明的主要原因之一,但農村地區通常缺乏早期檢測所需的專家和基礎設施。雖然基於雲的深度學習系統提供高準確性,但由於高延遲、有限的帶寬和高數據傳輸成本,這些系統在這些環境中面臨重大挑戰。為了解決這些挑戰,我們提出了一種在公共APTOS 2019失明檢測數據集上的兩級邊緣-雲級聯。第1級在本地診所設備上運行輕量級的MobileNetV3-small模型,以在可轉介的DR(類別2-4)和不可轉介的DR(類別0-1)之間進行二元分流。第2級在雲端運行RETFoundDINOv2模型進行序數嚴重性分級,但僅對第1級標記為可轉介的圖像子集進行處理。在733幅圖像的分層APTOS測試集中,第1級在經驗證調整的高敏感度閾值下達到98.99%的敏感性和84.37%的特異性。默認級聯將49.52%的測試圖像轉發至第2級,相較於對所有圖像使用基於雲的模型,減少了50.48%的雲端調用。在部署的4類輸出空間(類別0-1 / 類別2 / 類別3 / 類別4)中,級聯獲得了80.49%的準確率和0.8167的二次加權kappa;僅雲端基準獲得了80.76%的準確率和0.8184的二次加權kappa。在APTOS上,級聯將雲端使用量減少約一半,並且分級性能略有下降。索引詞:糖尿病視網膜病變、邊緣-雲級聯、MobileNetV3-small、RETFound-DINOv2、視網膜篩查、遠程眼科醫學

A Benchmark for Early-stage Parkinson's Disease Detection from Speech

2605.14066v1 by Terry Yi Zhong, Cristian Tejedor-Garcia, Khiet P. Truong, Janna Maas, Louis ten Bosch, Bastiaan R. Bloem

Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.

摘要:早期帕金森病(EarlyPD)從語音中檢測的臨床意義重大,但尚未充分探索,且已發表的結果難以比較,因為研究在數據集、語言、任務、評估協議和EarlyPD定義上各不相同。為了解決這個問題,我們提出了首個基於語音的EarlyPD檢測基準,設計了一個獨立於說話者的劃分,以便在研究者可訪問的數據集上進行公平且可重複的跨方法評估。該基準涵蓋三個常見的語音任務,並在不同的訓練資源設置下評估方法。我們還通過數據集、聚合水平、性別和疾病階段呈現多維度的評估細分,以支持細緻的比較和臨床應用。我們的結果提供了一個可重複的參考和可行的見解,鼓勵採用這個公開可用的基準,以推進從語音中穩健且具有臨床意義的EarlyPD檢測。

CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI

2605.13994v1 by Xiaoyue Liu, Xiaohan Yuan, Mark Y Chan, Ching-Hui Sia, Lei Li

Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.

摘要:準確的3D+t 整心網格重建來自於cine MRI,是一項臨床上至關重要但技術上具有挑戰性的任務。這項任務的難度源於兩個相互耦合的因素:由2D影像切片對3D心臟解剖的內在稀疏取樣,以及心臟形狀與運動之間的緊密耦合。目前的心臟影像到網格的方法通常僅重建心臟腔室的子集或心臟週期的單一相位。在這項工作中,我們提出了CineMesh4D,一種新穎的端到端4D(3D+t)管道,通過跨域映射直接從多視角2D cine MRI重建患者特定的整心網格。具體而言,我們引入了一種可微分渲染損失,能夠從多視角稀疏輪廓的cine MRI中監督3D+t整心網格。此外,我們開發了一個雙上下文時間塊,融合全局和局部心臟時間信息,以捕捉高維序列模式。在定量和定性評估中,CineMesh4D在重建質量和運動一致性方面超越了現有方法,為個性化的實時心臟評估提供了一條實用的途徑。代碼將在手稿被接受後公開發布。

Neurosymbolic Auditing of Natural-Language Software Requirements

2605.13817v1 by Bethel Hall, William Eiers

Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.

摘要:自然語言軟體需求通常模糊、不一致且不夠明確;在安全關鍵領域,這些缺陷會傳播到驗證錯誤規範的正式模型中,並導致實現不安全的行為。我們展示了大型語言模型配備 SMT 解算器可以審核這些需求:將其轉換為正式邏輯,通過生成的正式化中的隨機變化檢測模糊性,並通過對結果規範的解算器查詢揭示不一致性、空洞性和安全違規。我們提出了 VERIMED,一個神經符號管道,將這一理念應用於醫療設備軟體需求,並報告了兩項發現。首先,獨立正式化之間的隨機變化是模糊性的信號:承認多種合理解釋的需求會產生 SMT 不等價的正式化,而雙向 SMT 等價檢查將這種不一致轉化為可由解算器檢查的測試。其次,符號反饋的有用性取決於其粒度:在一個血液透析問答基準上的反例引導修復中,具體的 SMT 反例將驗證準確率從 55.4% 提高到 98.5%。在對開源血液透析安全需求的廣泛實驗評估中,我們展示了 VERIMED 中基於 LLM 的方法成功減少了對模糊性敏感的需求,並通過基於 SMT 的查詢實現了對軟體需求的嚴格審核。

Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography

2605.13730v1 by Christos Chrysanthos Nikolaidis, Vasileios Sachpekidis, Nikolas Moustakidis, Theofilos Moustakidis, Pavlos S. Efraimidis

Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on $N{=}90$ patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of $0.907$ and recall of $0.877$. Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone's contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.

摘要:經胸心臟超音波檢查(TTE)是診斷二尖瓣主動脈瓣(BAV)的第一線影像檢查方式,但其診斷表現因操作人員的專業技能和影像品質而異。我們開發了一個可解釋的人工智慧模型,利用常規獲取的旁胸長軸(PLAX)動態影像循環來區分BAV和三尖瓣主動脈瓣(TAV)。我們訓練並評估了一個多骨幹視頻集成模型,使用了一種考慮洩漏的分層外部交叉驗證協議,對90位患者的研究進行了評估(48 BAV,42 TAV)。在固定的外部分割和10個隨機種子下,經過校準的堆疊集成模型達到了外部交叉驗證F1分數$0.907$和召回率$0.877$。幀級Grad-CAM將顯著證據定位於主動脈根部和瓣膜平面,而全局聚合的SHAP值量化了每個視頻骨幹對堆疊預測的貢獻,實現了透明的案例級審計能力。這些發現表明,基於PLAX的視頻集成模型可以支持從常規心臟超音波動態影像中可靠地分類BAV/TAV,並可能促進在非專科或資源有限的臨床環境中更早的檢測。

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

2605.17620v1 by Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm, Daniel Behme, Naomi Larsen, Wojtek Palubicki, Sylvia Saalfeld, Sören Pirk

Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

摘要:顱內動脈瘤(IAs)以不可預測的生長和破裂風險為特徵,是中風的主要原因,可能導致危及生命的出血,並伴隨著高死亡率和長期殘疾。隨著人口老齡化,腦血管疾病的發病率和整體負擔預計將增加,這突顯了分析複雜醫療數據和改善對這些疾病的群體理解的可擴展方法的必要性。儘管數位雙胞胎和深度學習為改善診斷、預後和治療提供了有希望的途徑,但其有效性受到大規模、高質量醫療數據及相應標籤稀缺的限制。我們提出了合成血管(SynVA),這是一個模組化的血管網格生成和解剖一致性動脈瘤合成工具包。SynVA結合了基於流匹配的新穎方法來生成健康血管網格,以及基於學習的方法來生成解剖條件的動脈瘤網格——動脈瘤是從現有的血管幾何形狀計算得出的,而不是孤立生成的。此外,我們介紹了SynVA程序模型,該模型僅基於生理原則和統計先驗進行血管和動脈瘤合成,這使得生成大規模數據集(例如,用於基於網格的生成模型的訓練)成為可能。為此,我們釋放了一個包含50,000個完全標記的網格樣本的數據集,適用於各種下游視覺任務,如語義分割。廣泛的定量和定性評估顯示,SynVA生成了現實的血管幾何形狀和解剖上合理的動脈瘤。具體而言,我們的實驗表明,某些方法生成的動脈瘤形狀與專家的人類感知更一致,而其他方法在與真實動脈瘤的重建的定量相似性指標上表現更佳。

Cross Modality Image Translation In Medical Imaging Using Generative Frameworks

2605.13686v1 by Giulia Romoli, Alessia Capoccia, Filippo Ruffini, Francesco Di Feola, Luca Boldrini, Arturo Chiti, Renato Cuocolo, Tugba Akinci D'Antonoli, Fatemeh Darvizeh, Marcello Di Pumpo, Bradley J. Erickson, Liu Fang, Deborah Fazzini, Paola Feraco, Fabrizia Gelardi, Francesco Gossetti, Ana Isabel Hernáiz Ferrer, Michail E. Klontzas, Seyedmehdi Payabvash, Katrine Riklund, Sara N. Strandberg, Valerio Guarrasi, Paolo Soda

Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.

摘要:醫學影像對影像(I2I)轉換使得虛擬掃描成為可能,即從一種來源影像模態合成目標影像模態,而無需額外的獲取。儘管興趣日益增長,但大多數提出的方法僅在2D切片上運作,並且在不同的實驗設置下對孤立任務進行評估,缺乏臨床驗證。本研究的主要貢獻是對腫瘤影像學中的3D I2I轉換方法進行可重複的、標準化的比較評估,旨在標準化預處理、數據切分、推斷和跨異質臨床任務的多層次評估。在這一框架內,我們比較了七個生成模型,三個生成對抗網絡(GANs:Pix2Pix、CycleGAN、SRGAN)和四個潛在生成模型(潛在擴散模型、潛在擴散模型+控制網、布朗橋、流匹配),涵蓋了三個解剖區域(頭/頸部、肺、骨盆)和四個轉換方向(圓錐束CT到CT、MRI到CT、CT到PET、MRI T2加權到T2-FLAIR)的十一個數據集,共進行了77次在統一訓練、推斷和評估條件下的實驗。結果顯示,GAN在所有任務中均優於潛在生成模型,其中SRGAN達到了統計上顯著的優越性。我們的病變級別分析顯示,所有模型在處理小病變時均存在困難,並且在CT到PET合成中,模型在重現病變形狀方面比在絕對攝取相關強度上更可靠。我們還對17位醫生(包括15位放射科醫生)進行了視覺圖靈測試,結果顯示分類準確率接近隨機(56.7%),確認合成體積在很大程度上與真實獲取無法區分,同時揭示了定量指標與臨床偏好之間的脫節。

Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model

2605.13568v1 by Riccardo Cavarra, Lupo Lovatelli, Shaheim Ogbomo-Harmitt, Shahid Aziz, Adelaide De Vecchi, Andrew King, Oleg Aslanidi

Myocardial infarction (MI) is a leading cause of death, and its adverse outcomes are urgent to predict. Yet ECG-based prognostic models underperform because deep learning requires large, labelled datasets, which are scarce in medicine. Foundation models can learn from unlabelled ECGs via selfsupervision, but medically relevant training strategies remain underexplored. We propose a pretrained artificial intelligence model that combines patient-specific temporal information using contrastive learning with supervised multitask heads, then fine-tunes on post-MI outcome prediction. The proposed model outperformed a model trained from scratch (0.794 vs 0.608 AUC) showing that clinically structured ECG modelling improves classification in limited data regimes.

摘要:心肌梗塞(MI)是主要的死亡原因之一,其不良結果的預測迫在眉睫。然而,基於心電圖的預後模型表現不佳,因為深度學習需要大量標記的數據集,而這在醫學中是稀缺的。基礎模型可以通過自我監督從未標記的心電圖中學習,但與醫學相關的訓練策略仍然未被充分探索。我們提出了一個預訓練的人工智慧模型,該模型結合了使用對比學習的患者特定時間信息和監督式多任務頭,然後在心肌梗塞後結果預測上進行微調。所提出的模型在從頭訓練的模型中表現更佳(0.794 對 0.608 AUC),顯示臨床結構化的心電圖建模在有限數據環境中改善了分類效果。

Generating synthetic computed tomography for radiotherapy: SynthRAD2025 challenge report

2605.13555v1 by Viktor Rogowski, Maarten L. Terpstra, Niklas Wahl, Florian Kamp, Erik van der Bijl, Arthur Jr. Galapon, Christopher Kurz, Bowen Xin, Zhengxiang Sun, Hollie Min, Gregg Belous, Jason Dowling, Yan Xia, Siyuan Mei, Fuxin Fan, Arthur Longuefosse, Javier Sequeiro Gonzalez, Miguel Diaz Benito, Alvaro Garcia Martin, Fabien Baldacci, Valentin Boussot, Cédric Hémon, Jean-Claude Nunes, Jean-Louis Dillenseger, Zhiyuan Zhang, Jinghua Cai, Han Bing, Tan Zuopeng, Ricardo Brioso, Daniele Loiacono, Guillaume Landry, Adrian Thummerer, Matteo Maspero

Radiation therapy (RT) requires precise dose delivery over multiple fractions, with CT fundamental for treatment planning due to its electron density information. Repeated CT acquisitions impose radiation exposure and logistical burdens, MRI lacks electron density, and cone-beam CT (CBCT) requires correction for dose calculation. Synthetic CT (sCT) generation addresses these by converting MRI or CBCT into CT-equivalent images with accurate Hounsfield Unit (HU) values, enabling MRI-only RT and CBCT-based adaptive workflows. Building on SynthRAD2023, SynthRAD2025 benchmarked sCT methods on 2,362 patients from five European centers across head and neck, thorax, and abdomen. Two tasks: MRI-to-CT (890 cases) and CBCT-to-CT (1,472 cases), evaluated via image similarity (MAE, PSNR, MS-SSIM), segmentation (Dice, HD95), and dosimetric metrics from photon and proton plans. With 803 participants and 12/13 valid submissions, Task 1 top performance reached MAE $64.8\pm21.3$ HU, PSNR $\sim$30 dB, MS-SSIM $\sim$0.936, Dice 0.79, photon $γ_{2\%/2\text{mm}}>98\%$, proton $γ\approx85\%$. Task 2 improved: MAE $48.3\pm13.4$ HU, PSNR 32.6 dB, MS-SSIM 0.968, Dice 0.86, photon $γ>99\%$, proton $γ\approx89\%$. Strong image--segmentation correlations ($ρ=0.78$--$0.79$) but moderate dose correlations confirmed image quality is insufficient as a dosimetric surrogate. Head-and-neck cases were most consistent; thoracic and abdominal cases showed greater variability. Residual errors at tissue interfaces propagate along beam paths, affecting proton dose more than photon. SynthRAD2025 demonstrates that deep learning yields clinically relevant sCTs, especially for CBCT-to-CT, while identifying persistent MRI-to-CT challenges and underscoring dose-based evaluation as essential for clinical validation.

摘要:放射治療 (RT) 需要在多個分次中精確地傳遞劑量,CT 在治療計劃中至關重要,因為它提供了電子密度信息。重複的 CT 採集會帶來輻射暴露和後勤負擔,MRI 缺乏電子密度,而圓錐束 CT (CBCT) 需要進行劑量計算的修正。合成 CT (sCT) 的生成通過將 MRI 或 CBCT 轉換為具有準確 Hounsfield 單位 (HU) 值的 CT 等效圖像來解決這些問題,使得僅使用 MRI 的 RT 和基於 CBCT 的自適應工作流程成為可能。基於 SynthRAD2023,SynthRAD2025 在來自五個歐洲中心的 2,362 名患者中對 sCT 方法進行了基準測試,涵蓋頭頸部、胸部和腹部。兩項任務:MRI 到 CT (890 例) 和 CBCT 到 CT (1,472 例),通過圖像相似性 (MAE, PSNR, MS-SSIM)、分割 (Dice, HD95) 和來自光子和質子計劃的劑量指標進行評估。參與者 803 名,12/13 份有效提交,任務 1 的最佳表現達到 MAE $64.8\pm21.3$ HU,PSNR $\sim$30 dB,MS-SSIM $\sim$0.936,Dice 0.79,光子 $γ_{2\%/2\text{mm}}>98\%$,質子 $γ\approx85\%$。任務 2 有所改善:MAE $48.3\pm13.4$ HU,PSNR 32.6 dB,MS-SSIM 0.968,Dice 0.86,光子 $γ>99\%$,質子 $γ\approx89\%$。強烈的圖像-分割相關性 ($ρ=0.78$--$0.79$),但劑量相關性中等,確認圖像質量不足以作為劑量代理。頭頸部案例最為一致;胸部和腹部案例顯示出更大的變異性。組織界面處的殘餘誤差沿著束路徑傳播,對質子的劑量影響大於光子。SynthRAD2025 展示了深度學習產生臨床相關的 sCT,特別是對於 CBCT 到 CT,同時識別持續存在的 MRI 到 CT 挑戰,並強調基於劑量的評估對臨床驗證至關重要。

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

2605.13542v1 by Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: https://chengzhi-leo.github.io/RealICU-Bench/

摘要:重症監護病房 (ICU) 產生長期、密集且不斷演變的臨床資訊流,醫生必須在時間壓力下不斷重新評估病人的狀態,這凸顯了對可靠的 AI 決策支持的明確需求。現有的 ICU 基準通常將歷史醫生行為視為真實標準。然而,這些行為是在不完整資訊和有限的病人狀態時間背景下做出的,因此可能是次優的,這使得評估 AI 系統的真正推理能力變得困難。我們引入了 RealICU,一個後見標註的基準,用於在現實的 ICU 環境下評估大型語言模型 (LLMs),其中標籤是在資深醫生審查完整病人軌跡後創建的。我們制定了四個醫生驅動的任務:評估病人狀態、急性問題、建議行動和可能導致不安全結果的紅旗行動。我們將每個軌跡劃分為 30 分鐘的窗口,並發布了兩個數據集:RealICU-Gold,包含來自 94 名 MIMIC-IV 病人的 930 個窗口標註,以及 RealICU-Scale,包含由 Oracle 擴展的 11,862 個窗口,Oracle 是一個經醫生驗證的 LLM 後見標籤標註者。包括記憶增強型的現有 LLM 在 RealICU 上表現不佳,暴露出兩種失敗模式:臨床建議的回憶-安全權衡,以及對病人早期解釋的錨定偏見。我們進一步引入 ICU-Evo 來研究結構化記憶代理,這改善了長期推理但並未完全消除安全失敗。總之,RealICU 提供了一個臨床基礎的測試平台,用於衡量和改善高風險護理中的 AI 連續決策支持。項目頁面:https://chengzhi-leo.github.io/RealICU-Bench/

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

2605.13530v1 by Jincai Huang, Shihao Zou, Yuchen Guo, Jingjing Li, Wei Ji, Kai Wang, Shanshan Wang, Weixin Si

Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.

摘要:手術場景理解是計算機輔助介入的基石。雖然最近的進展,特別是在手術影像分割方面,推動了進步,但現實世界的臨床應用需要更全面的理解,能夠共同捕捉程序背景、語義推理和精確的視覺基礎。然而,現有的方法通常是孤立地處理這些組件,導致片段化的表徵和有限的語義一致性。為了解決這一限制,我們提出了SurgMLLM,一個統一的手術場景理解框架,將高層次推理和低層次視覺基礎橋接在一個模型中。針對手術視頻,SurgMLLM微調了一個多模態大型語言模型(MLLM),以支持結構化的可解釋性推理,這用於共同建模階段、工具-動詞-目標(IVT)三元組和三元組-實體分割標記。這些標記然後被時間聚合,作為分割網絡的提示,使得三元組工具和目標的像素級基礎能夠準確實現。整個框架以統一的目標進行端到端訓練,將基於語言的推理監督與視覺基礎損失相結合,促進一致的跨任務學習和臨床一致的場景表徵。為了促進統一評估,我們引入了CholecT45-Scene,擴展了CholecT45數據集,增加了64,299幀的像素級遮罩註釋,用於工具和目標,並與現有的三元組標籤對齊。大量實驗表明,SurgMLLM顯著推進了手術場景理解,將主要三元組識別指標AP_IVT從40.7%提高到46.0%,並在階段識別和分割方面持續超越先前的方法。這些結果突顯了統一推理與基礎的有效性,為可靠的、具上下文感知的手術輔助提供支持。

Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin

2605.13345v1 by Markus Wenzel, Tobias Strapatsas, Jessika Kress, Dorothea Sauer, Nele Gessler, Horst K. Hahn

Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent-Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes, patient load, and staffing from real-world studies. We then validate the model expressivity by matching its key performance indicators and metrics with their values known from literature. We proceed by implementing scientifically established and practice-proven resource optimization strategies. Comparing the documented real-world outcomes with our model's results demonstrates that the DES-ABM based simulation can effectively replicate real-world ER dynamics under interventions. We lastly integrate a Proof-of-Concept multi-agent system (MAS) that can autonomously explore resource allocation strategies within the simulated ER environment based on a temporal ledger of ED event records. This modular DES-ABM-MAS framework offers a powerful tool to explore resource optimization strategies in emergency departments.

摘要:急診部門(ED)在病人護理和資源管理方面面臨挑戰。我們提議探索在現實且靈活的模型中進行優化策略,並開發一個混合的離散事件模擬(DES)和基於代理的模型(ABM),以模擬高度可配置的急診環境。我們特別專注於建模方法的驗證。我們從現實世界的研究中推導出急診部門的配置,包括規模、病人負荷和人力資源。然後,我們通過將模型的關鍵績效指標和度量與文獻中已知的值進行匹配來驗證模型的表達能力。接著,我們實施科學上建立且實踐證明的資源優化策略。將記錄的現實世界結果與我們模型的結果進行比較,顯示基於DES-ABM的模擬能有效地重現在干預下的現實世界急診動態。最後,我們整合了一個概念驗證的多代理系統(MAS),該系統可以根據急診事件記錄的時間賬本,自主探索模擬急診環境中的資源配置策略。這個模組化的DES-ABM-MAS框架提供了一個強大的工具,用於探索急診部門的資源優化策略。

VERA-MH: Validation of Ethical and Responsible AI in Mental Health

2605.13318v1 by Luca Belli, Kate H. Bentley, Josh Gieringer, Emily Van Ark, Nilu Zhao, Pradip Thachile, Matt Hawrilenko, Millard Brown, Adam M. Chekroud

Chatbot usage has increased, including in fields for which they were never developed for--notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers' consistency and highlight models' failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers.

摘要:聊天機器人的使用量增加了,包括在那些從未為其開發的領域——尤其是心理健康支持。為此,我們介紹了心理健康中的道德和負責任的人工智慧驗證(VERA-MH),這是一種新穎的臨床驗證評估,用於評估聊天機器人在心理健康支持中的安全性。VERA-MH的第一個版本專注於自殺意念(SI)風險,通過評估聊天機器人如何應對可能處於危機中的用戶。
VERA-MH由三個步驟組成:對話模擬、對話評判和模型評分。首先,為了模擬與被評估聊天機器人的對話,另一個聊天機器人被指派根據特定角色扮演用戶。這些用戶角色是在臨床指導下開發的,以確保多種風險因素、人口特徵和披露因素等得以代表。在評判步驟中,使用第二個支持模型作為LLM-as-a-Judge,並結合臨床開發的評分標準。該評分標準以流程的形式結構,每次都提出一個是/否問題,以提高答案的一致性並突出模型的失效模式。在最後階段,每次對話的結果被匯總,以呈現聊天機器人的最終評估。隨著這一框架,我們還展示了四個主要LLM提供商的評估結果。

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

2605.13292v1 by Shubham Kumar Nigam, Suparnojit Sarkar, Piyush Patel

Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.

摘要:大多數現有的醫療對話系統運作於單輪問題--回答範式,或依賴基於模板的數據集,限制了對話的真實感和多語言的適用性。我們介紹了IndicMedDialog,這是一個涵蓋英語和九種印度語言的平行多輪醫療對話數據集:阿薩姆語、孟加拉語、古吉拉特語、印地語、馬拉地語、旁遮普語、泰米爾語、泰盧固語和烏爾都語。該數據集通過LLM生成的合成諮詢擴展了MDDial,這些諮詢使用TranslateGemma進行翻譯,由母語者驗證,並通過一個腳本感知的後處理流程進行精煉,以修正語音、詞彙和字元間距錯誤。基於這個數據集,我們通過對量化的小型語言模型進行參數高效的適應來微調IndicMedLM,並納入可選的患者前置上下文以個性化多輪症狀引導。我們針對零-shot多語言基準進行評估,對十種語言進行系統的錯誤分析,並通過醫療專家的評估驗證臨床的合理性。

Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis

2605.13248v1 by Bo Cui, Xiaowen Song, Yaowen Zhang, Shunzhe Zhang, B. J. F. van Beijnum, Monique Tabak, Ying Wang

The analysis of physiological time series, such as electrocardiograms (ECG) and photoplethysmograms (PPG), is persistently hindered by modality and frequency gaps stemming from heterogeneous recording devices. Existing foundation models typically rely on continuous latent spaces, which frequently suffer from severe modality entanglement, lack high-fidelity cross-frequency generative capacity, and impose high computational costs that prohibit edge-device deployment. In this paper, we propose Compact Latent Manifold Translation (CLMT), a highly parameter-efficient (0.09B) unified framework that bridges these gaps through a novel two-stage discrete translation paradigm. First, we introduce a Universal Tokenizer utilizing Hierarchical Residual Vector Quantization (RVQ) to decouple heterogeneous signals into isolated, well-structured discrete latent manifolds, effectively preventing inter-modality interference. Second, a Context-Prompted Latent Translator maps these discrete tokens across modalities by integrating static physiological priors, reframing complex signal synthesis as a pure latent sequence translation task. Extensive evaluations demonstrate that our 0.09B model significantly outperforms massive baselines. In cross-modal PPG-to-ECG synthesis, it resolves temporal phase drift and dramatically improves the clinical R-peak detection F1-score from 0.37 (baseline) to 0.83. Furthermore, in extreme cross-frequency super-resolution (25Hz to 100Hz), it successfully recovers high-frequency diagnostic landmarks, achieving an unprecedented Pearson correlation of 0.9956. By learning a universal discrete language for biological signals with a fraction of the computational footprint, our approach sets a new trajectory for edge-deployable, multi-modal medical foundation models.

摘要:生理時間序列的分析,例如心電圖 (ECG) 和光電容積圖 (PPG),持續受到來自異質錄製設備的模態和頻率差距的阻礙。現有的基礎模型通常依賴於連續潛在空間,這些空間經常遭受嚴重的模態糾纏,缺乏高保真度的跨頻率生成能力,並且施加高計算成本,禁止邊緣設備的部署。在本文中,我們提出了緊湊潛在流形翻譯 (CLMT),這是一個高度參數高效的 (0.09B) 統一框架,通過一種新穎的兩階段離散翻譯範式來彌補這些差距。首先,我們引入了一種使用層次殘差向量量化 (RVQ) 的通用標記器,將異質信號解耦為孤立的、結構良好的離散潛在流形,有效防止了模態間的干擾。其次,背景提示潛在翻譯器通過整合靜態生理先驗,將這些離散標記在模態間進行映射,將複雜的信號合成重新構建為純粹的潛在序列翻譯任務。廣泛的評估顯示,我們的 0.09B 模型顯著超越了龐大的基準。在跨模態 PPG 到 ECG 的合成中,它解決了時間相位漂移,並將臨床 R 峰檢測的 F1 分數從 0.37 (基準) 提高到 0.83。此外,在極端的跨頻率超解析度 (25Hz 到 100Hz) 中,它成功恢復了高頻診斷標記,實現了前所未有的皮爾森相關係數 0.9956。通過以較小的計算足跡學習生物信號的通用離散語言,我們的方法為邊緣可部署的多模態醫療基礎模型設定了一條新的軌跡。

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

2605.13149v1 by Ishika Agarwal, Sofia Stoica, Emre Can Acikgoz, Pradeep Natarajan, Mahdi Namazifar, Jiaqi Ma, Dilek Hakkani-Tür

Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language models to generate higher-quality synthetic data. We conduct experiments on classic verifiable tasks of math, medical question-answering, and coding. Our experimental results indicate that (1) student models trained with AcquisitionSynthesis data achieve good performance on in-distribution tasks (2-7% gain) and is more robust to catastrophic forgetting, and (2) AcquisitionSynthesis models can generate data for other models and for low-to-high resource training paradigms. By leveraging acquisition rewards, we seek to demonstrate a principled path toward model-aware self-improvement that surpasses static datasets.

摘要:數據質量仍然是開發有能力、具競爭力模型的關鍵瓶頸。研究人員探索了許多生成高質量樣本的方法。一些研究依賴於拒絕抽樣:生成大量合成樣本並篩選出低質量樣本。其他研究則依賴於更大或封閉源代碼的模型來提取模型的弱點、必要的技能或用以基於數據生成的課程。這些工作有一個共同的限制:沒有定量的方法來衡量生成的樣本對下游學習者的影響。主動學習文獻正好提供了這一點,以獲取函數的形式。獲取函數衡量數據的信息量和/或影響力,提供可解釋的、以模型為中心的信號。受到此啟發,我們提出了AcquisitionSynthesis:利用獲取函數作為獎勵模型來訓練語言模型生成更高質量的合成數據。我們在經典的可驗證任務上進行實驗,包括數學、醫學問答和編碼。我們的實驗結果表明:(1) 使用AcquisitionSynthesis數據訓練的學生模型在分佈內任務上取得了良好的表現(增益2-7%),並且對災難性遺忘更具魯棒性;(2) AcquisitionSynthesis模型可以為其他模型和低至高資源的訓練範式生成數據。通過利用獲取獎勵,我們希望展示一條有原則的路徑,朝向超越靜態數據集的模型感知自我改進。

Context Training with Active Information Seeking

2605.13050v2 by Zeyu Huang, Adhiguna Kuncoro, Qixuan Feng, Jiajun Shen, Lucio Dery, Arthur Szlam, Marc'Aurelio Ranzato

Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.

摘要:大多數現有的大型語言模型(LLMs)在部署後適應的成本昂貴,特別是當任務需要新產生的信息或小眾領域知識時。最近的研究顯示,通過操控和優化其上下文,LLMs可以在不更新其權重的情況下,調整以適應下游任務。然而,大多數現有的方法仍然是閉環的,僅依賴於模型的內在知識。在本文中,我們為這些上下文優化器配備了維基百科搜索和瀏覽器工具,以進行主動的信息尋求。我們顯示,將這些工具天真地添加到標準的序列上下文優化流程中,實際上可能會降低性能,與基準相比。然而,當與一種基於搜索的訓練程序配對時,該程序保持並修剪多個候選上下文,主動的信息尋求能夠帶來一致且顯著的增益。我們在多個領域展示了這些改進,包括低資源翻譯(Flores+)、健康場景(HealthBench)和重推理任務(LiveCodeBench 和 Humanity's Last Exam)。此外,我們的方法被證明是數據高效的,對不同的超參數具有穩健性,並能夠生成有效的文本上下文,這些上下文在不同模型之間具有良好的泛化能力。

An Agentic LLM-Based Framework for Population-Scale Mental Health Screening

2605.13046v1 by Giuliano Lorenzoni, Paulo Alencar, Donald Cowan

Mental health disorders affect millions worldwide, and healthcare systems are increasingly overwhelmed by the volume of clinical data generated from electronic records, telemedicine platforms, and population-level screening programs. At the same time, the emergence of novel AI-based approaches in healthcare calls for intelligent frameworks capable of processing domain-specific unstructured clinical information while adapting to patient-specific needs. This paper proposes an agentic framework for building robust LLM-based pipelines, where each stage is encapsulated as a LangChain agent governed by explicit policies and proxy-guided evaluation. Stages are incrementally locked once validated, ensuring that later adaptations cannot overwrite configurations without demonstrated improvement. The proposed framework evolves from feature-level exploration, through proxy-based tuning and freeze/rollback mechanisms, to full orchestration by an Orchestrator Agent that coordinates preprocessing, retrieval, selection, diversity, threshold optimization, and decoding. A proof-of-concept in transcript-based depression detection demonstrates that the framework converges to stable configurations, such as cosine similarity, dynamic Top-k, and threshold 0.75, while controlling evaluation costs and avoiding regressions. These results highlight the potential of agentic AI to enable population-level mental health screening over large clinical datasets, addressing critical challenges in trustworthiness, reproducibility, and adaptability required in healthcare environments.

摘要:心理健康障礙影響著全球數百萬人,而醫療保健系統正日益受到電子記錄、遠程醫療平台和人口層級篩查計劃所產生的臨床數據量的壓力。同時,基於新型人工智能的方法在醫療保健領域的出現呼喚著能夠處理特定領域非結構化臨床信息的智能框架,同時適應患者特定的需求。本文提出了一個代理框架,用於構建穩健的基於大型語言模型(LLM)的管道,其中每個階段都被封裝為一個由明確政策和代理引導評估所管理的LangChain代理。一旦驗證,階段會逐步鎖定,確保後續的調整無法在未顯示改進的情況下覆蓋配置。所提出的框架從特徵級探索演變而來,通過基於代理的調整和凍結/回滾機制,最終由一個協調預處理、檢索、選擇、多樣性、閾值優化和解碼的協調代理進行全面協調。在基於轉錄的抑鬱症檢測中的概念驗證顯示,該框架收斂到穩定的配置,例如餘弦相似度、動態Top-k和閾值0.75,同時控制評估成本並避免回歸。這些結果突顯了代理人工智能在大型臨床數據集上實現人口層級心理健康篩查的潛力,解決了醫療保健環境中所需的可信度、可重複性和適應性等關鍵挑戰。

RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems

2605.12895v1 by Rohith Reddy Bellibatlu

Aggregate accuracy metrics dominate the evaluation of clinical AI decision-support systems but do not detect deployment-phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. We propose the RISED Framework: a five-dimension pre-deployment evaluation covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through formal sub-criteria, pre-specified pass/fail thresholds, and bias-corrected accelerated (BCa) bootstrap 95% confidence intervals combined under a Holm-Bonferroni family-wise error correction. A central demonstration is that a classifier satisfying conventional high-discrimination benchmarks can simultaneously fail input-encoding stability and threshold-shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks that aggregate evaluation alone cannot detect. We validate this differential pass/fail pattern on a synthetic cohort and three publicly available real-world cohorts spanning 35 years of clinical data vintage, from a 1980s cardiology dataset to a 2024 nationally representative health survey, where failing dimensions differ across cohorts, providing preliminary evidence of construct validity. The Equity dimension is reframed as a proxy-dependence diagnostic rather than a stand-alone gate: any need-based fairness verdict computed against a utilization-derived proxy carries a construct-validity problem the framework surfaces explicitly, triggering a procurement requirement for an outcome-independent need measure before the gate is binding. RISED is released as an open-source Python package that supplies the quantitative verdicts existing clinical AI reporting standards require, providing a principled gateway between in-silico model validation and silent-trial clinical evaluation.

摘要:聚合準確性指標主導了臨床AI決策支持系統的評估,但無法檢測部署階段的輸入可靠性、子群體公平性、閾值敏感性或操作可行性的失敗。我們提出了RISED框架:一個涵蓋可靠性、包容性、敏感性、公平性和可部署性的五維預部署評估,其中每個維度通過正式的子標準、預先指定的通過/失敗閾值以及在Holm-Bonferroni家庭誤差修正下結合的偏差修正加速(BCa)自助法95%置信區間進行操作。一個主要的示範是,滿足傳統高區分基準的分類器可以同時在輸入編碼穩定性和閾值變化敏感性檢查中失敗,而子群體AUC平等性仍然在統計上不確定,這表明僅依賴聚合評估無法檢測的部署風險。我們在一個合成隊列和三個涵蓋35年臨床數據的公開可用真實世界隊列上驗證了這種差異性的通過/失敗模式,從1980年代的心臟病學數據集到2024年全國代表性的健康調查,其中失敗的維度在不同隊列中有所不同,提供了構念效度的初步證據。公平性維度被重新定義為一種代理依賴診斷,而不是獨立的門檻:任何基於需求的公平裁決如果是針對使用派生的代理計算的,都會帶來構念效度問題,該框架明確顯示出來,觸發對結果獨立需求測量的採購要求,才能使該門檻生效。RISED作為一個開源Python包發布,提供現有臨床AI報告標準所需的定量裁決,為計算模型驗證和靜默試驗臨床評估之間提供了一個原則性的通道。

A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study

2605.13905v1 by Jaime Yan

Drug development and pharmacovigilance are frequently bottlenecked by legacy clinical reporting pipelines. These monolithic systems encode regulatory-grade logic but resist AI integration by producing opaque output with no machine-readable intermediate layer. Existing modernization approaches force a choice between full rewrites and incremental refactoring that preserves structural barriers. We present a non-destructive methodological framework achieving AI-driven pharmacoinformatics readiness without altering legacy source code. A metadata layer--comprising a bridge map, a typed Intermediate Representation (IR), and an orchestrator--wraps existing components and re-exposes their outputs as structured data consumable by LLMs. It enables optional incremental consolidation, replacing selected legacy components with metadata-configured core routines while the remainder operates unchanged. Validated on a 558-component SAS reporting library (373,000 lines of code), the framework demonstrated immediate AI-readiness under coexistence mode, yielding machine-readable output. Where consolidation was elected, the modernized core achieved a 92% reduction in proprietary code. Parity validation on 14 report types from a Phase III study achieved cell-level parity of 80% or above on 11 reports (mean 82.7%, best 99.2%). A benchmark using CDISC CDISCPilot01 data achieved 100% parity across 5 reports. LLM experiments confirmed the IR enables automated pharmacovigilance, table summarization, and trial configuration generation. The framework offers a regulation-aware path to AI-integrated clinical reporting, accelerating drug development without interrupting regulatory submissions.

摘要:藥物開發和藥物監測經常受到舊有臨床報告管道的瓶頸限制。這些單一系統編碼了合規級邏輯,但由於產生不透明的輸出且沒有可機器讀取的中間層,抵制AI整合。現有的現代化方法迫使在完全重寫和保留結構障礙的增量重構之間做出選擇。我們提出了一個非破壞性的方法論框架,實現了不改變舊有源代碼的AI驅動藥物信息學準備。元數據層——包括橋接圖、類型化中間表示(IR)和協調器——包裝現有組件,並將其輸出重新呈現為可被LLMs消耗的結構化數據。它使得可選的增量整合成為可能,將選定的舊有組件替換為元數據配置的核心例程,而其餘部分保持不變。在一個558組件的SAS報告庫(373,000行代碼)上進行驗證,該框架在共存模式下顯示出即時的AI準備,產生可機器讀取的輸出。在選擇整合的情況下,現代化核心實現了92%的專有代碼減少。在一項第三階段研究的14種報告類型上進行的平行驗證,在11份報告中達到了80%或以上的單元級平行性(平均82.7%,最佳99.2%)。使用CDISC CDISCPilot01數據的基準測試在5份報告中達到了100%的平行性。LLM實驗確認IR能夠實現自動化的藥物監測、表格摘要和試驗配置生成。該框架提供了一條符合規範的路徑,實現AI整合的臨床報告,加速藥物開發而不干擾監管提交。

Multimodal Hidden Markov Models for Persistent Emotional State Tracking

2605.12838v1 by Anamika Ragu, Aneesh Jonelagadda

Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP-HMMs over multimodal valence-arousal representations derived from simultaneous video, audio and textual input. We evaluate the quality of regime prediction using LLM-as-a-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the computational cost of LLM-based dialogue state tracking methods. In addition, Question-Answer experiments in a clinical dataset suggest that meaningful emotional phases can reliably be recovered from multimodal valence-arousal trajectories and used to improve the quality of LLM responses in unstable affective regimes via context augmentation. This framework thus opens a path toward interpretable, lightweight, and actionable analysis of conversational emotion dynamics at scale.

摘要:追蹤對話中可解釋的情感弧線,透過整體處理的單個發言的情感,是理解和引導應用,特別是臨床對話情境中的溝通的核心。現有的情感識別方法在發言層面運作,掩蓋了特徵化真實對話動態的持續階段。我們提出了一個輕量級框架,將對話情感建模為潛在情感狀態的序列,使用基於多模態價值-喚起表示的粘性因子HDP-HMM,這些表示源自於同時的視頻、音頻和文本輸入。我們使用LLM-as-a-Judge、幾何和時間一致性指標評估狀態預測的質量,證明粘性HDP-HMM在計算成本僅為基於LLM的對話狀態追蹤方法的一小部分的情況下,產生了比基線高斯HMM更可解釋的狀態序列。此外,在臨床數據集中的問答實驗表明,可以可靠地從多模態價值-喚起軌跡中恢復有意義的情感階段,並通過上下文增強來改善不穩定情感狀態下LLM回應的質量。因此,這一框架為在大規模下可解釋、輕量且可行的對話情感動態分析開闢了一條道路。

PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models

2605.12835v1 by Sridhar Mahadevan

Large language models can extract local causal claims from text, but those claims become more useful when organized as persistent, navigable world models rather than as flat summaries. We introduce PROMETHEUS, a framework that turns retrieved literature, filings, reviews, reports, agent traces, source data, code, simulations, and scientific models into causal atlases: sheaf-like families of local causal predictive-state models over an explicit cover of a research substrate. Each local region contains causal episodes, structured claim tables, predictive tests, support statistics, and provenance; restriction maps compare overlapping regions; gluing diagnostics expose agreement, drift, contradiction, and underdetermination. The resulting Topos World Model is not a single universal graph. It is a research instrument for navigating what a corpus says, where it says it, how strongly it is supported, and where local claims fail to assemble into a coherent global view. Three literature-atlas case studies -- ocean-temperature impacts on marine populations, GLP-1 weight-loss evidence, and resveratrol/red-wine health-benefit claims -- illustrate deep causal research from text with explicit locality, evidence, persistent state, and gluing tension. Four grounded-counterfactual case studies -- a Nature Climate Change microplastics forcing paper, an Indus Valley hydrology paper with VIC-derived figure data and model code, the canonical Sachs protein-signaling study with single-cell perturbation data, and a Nature singing-mouse study with MAPseq projection matrices -- show a stronger mode: when a paper ships source data, simulation outputs, or code, PROMETHEUS can evaluate a counterfactual against that scientific substrate and then rebuild the sheaf world model around the

摘要:大型語言模型可以從文本中提取局部因果主張,但當這些主張被組織成持久的、可導航的世界模型而非平面摘要時,它們變得更有用。我們介紹了PROMETHEUS,這是一個將檢索到的文獻、申請、評論、報告、代理痕跡、來源數據、代碼、模擬和科學模型轉化為因果地圖的框架:類似於束的局部因果預測狀態模型的家族,覆蓋在研究基底的明確覆蓋上。每個局部區域包含因果事件、結構化的主張表、預測測試、支持統計和來源;限制映射比較重疊區域;粘合診斷揭示一致性、漂移、矛盾和不確定性。最終的拓撲世界模型不是一個單一的通用圖。它是一個研究工具,用於導航語料庫所說的內容、說的地點、支持的強度,以及局部主張無法組成一致的全球觀的地方。三個文獻地圖案例研究——海洋溫度對海洋種群的影響、GLP-1減肥證據和白藜蘆醇/紅酒健康益處主張——展示了從文本中進行深度因果研究的明確地方性、證據、持久狀態和粘合張力。四個基於實證的反事實案例研究——一篇《自然氣候變化》微塑料強迫論文、一篇印度河谷水文論文,包含來自VIC的圖表數據和模型代碼、經典的Sachs蛋白信號研究,配有單細胞擾動數據,以及一篇《自然》唱歌老鼠研究,包含MAPseq投影矩陣——展示了一種更強的模式:當一篇論文發佈來源數據、模擬輸出或代碼時,PROMETHEUS可以針對該科學基底評估反事實,然後圍繞該模型重建束世界模型。

Training Large Language Models to Predict Clinical Events

2605.12817v1 by Benjamin Turtel, Paul Wilczewski, Kris Skotheim

Longitudinal clinical notes contain rich evidence of how patients evolve over time, but converting this signal into training supervision for clinical prediction remains challenging. We extend Foresight Learning to clinical prediction by converting time-ordered MIMIC-III notes into examples consisting of past patient context, a natural-language question about a possible future event, and a label resolved from later documentation. This process yields 6,900 prediction examples from 702 admissions across medications, procedures, organ support, microbiology, and mortality. A small LoRA adapter trained on these examples improves over the prompted base model, reducing expected calibration error from 0.1269 to 0.0398 and Brier score from 0.199 to 0.145, while slightly outperforming GPT-5 point estimates on held-out questions. The approach enables reusable clinical prediction supervision from longitudinal notes without hand-engineered structured features or endpoint-specific classifiers.

摘要:長期臨床筆記包含了豐富的證據,顯示患者隨時間的演變,但將這種信號轉換為臨床預測的訓練監督仍然具有挑戰性。我們通過將時間排序的 MIMIC-III 筆記轉換為由過去患者背景、關於可能未來事件的自然語言問題以及從後續文檔中解決的標籤組成的示例,將 Foresight Learning 擴展到臨床預測。這個過程從 702 次入院中產生了 6,900 個預測示例,涵蓋了藥物、程序、器官支持、微生物學和死亡率。一個在這些示例上訓練的小型 LoRA 轉接器在提示的基礎模型上有所改善,將預期校準誤差從 0.1269 降低到 0.0398,將 Brier 分數從 0.199 降低到 0.145,同時在保留問題上稍微超越了 GPT-5 的點估計。這種方法使得可以從長期筆記中獲得可重用的臨床預測監督,而無需手工設計的結構化特徵或特定於終端的分類器。

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

2605.12809v1 by Shixing Yu, Promit Ghosal, Kyra Gan

A critical step for reliable large language models (LLMs) use in healthcare is to attribute predictions to their training data, akin to a medical case study. This requires token-level precision: pinpointing not just which training examples influence a decision, but which tokens within them are responsible. While influence functions offer a principled framework for this, prior work is restricted to autoregressive settings and relies on an implicit assumption of token independence, rendering their identified influences unreliable. We introduce a flexible framework that infers token-level influence through a latent mediation approach for general prediction tasks. Our method attaches sparse autoencoders to any layer of a pretrained LLM to learn a basis of approximately independent latent features. Unlike prior methods where influence decomposes additively across tokens, influence computed over latent features is inherently non-decomposable. To address this, we introduce a novel method using Jacobian-vector products. Token-level influence is obtained by propagating latent attributions back to the input space via token activation patterns. We scale our approach using efficient inverse-Hessian approximations. Experiments on medical benchmarks show our approach identifies sparse, interpretable sets of tokens that jointly influence predictions. Our framework enhances trust and enables model auditing, generalizing to high-stakes domain requiring transparent and accountable decisions.

摘要:一個可靠的醫療保健大型語言模型(LLMs)使用的關鍵步驟是將預測歸因於其訓練數據,類似於醫療案例研究。這需要逐字級別的精確度:不僅要確定哪些訓練範例影響了決策,還要確定其中哪些字元負責。雖然影響函數提供了一個原則性的框架,但先前的工作僅限於自回歸設置,並依賴於隱含的字元獨立性假設,使得它們所識別的影響不可靠。我們引入了一個靈活的框架,通過潛在中介方法推斷逐字級別的影響,適用於一般預測任務。我們的方法將稀疏自編碼器附加到預訓練LLM的任何層,以學習一組大約獨立的潛在特徵。與先前的方法不同,影響在潛在特徵上計算時本質上是不可分解的。為了解決這個問題,我們引入了一種使用雅可比-向量乘積的新方法。逐字級別的影響是通過通過字元激活模式將潛在歸因反向傳播到輸入空間來獲得的。我們使用高效的逆海森矩陣近似來擴展我們的方法。在醫療基準上的實驗顯示,我們的方法識別出稀疏且可解釋的字元集,這些字元共同影響預測。我們的框架增強了信任並使模型審計成為可能,並且可以推廣到需要透明和負責任決策的高風險領域。

BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics

2605.12730v1 by Helene Malyutina

Existing AI systems for modeling human behavior operate at the level of individuals or detect events after they occur. As a result, they systematically fail to capture the collective dynamics that determine whether a group remains stable or transitions into escalation or breakdown. We propose a different foundation: a group of interacting humans constitutes a complex dynamical system in the precise mathematical sense, exhibiting emergence, nonlinearity, feedback loops, sensitivity near critical points, and phase transitions between qualitatively distinct regimes. The state of such a system is not located within any single participant; it is distributed across mutual influence loops and observable through the micro-dynamics of the body. We introduce BEHAVE (Behavioral Engine for Human Activity Vector Estimation), a formal framework that models collective dynamics as continuous behavioral fields defined over an interaction space derived from observable physical signals. Kinematic micro-signals (position, velocity, body orientation, gestural activity) are structured into a directed interaction graph and aggregated into a basis of behavioral fields capturing distinct, non-redundant axes of collective state. The framework rests on one theorem and two structural propositions characterizing the tension field, the field basis, and the criticality index. Perception and forecasting layers are implemented using neural models, enabling data-driven learning and approximation of system dynamics. BEHAVE is formulated as a computational system for learning, representing, and forecasting collective dynamics from data. A working pipeline is demonstrated on a 7-agent negotiation snapshot. The same fields, recalibrated, apply to crowd safety, crisis-team dynamics, education, and clinical contexts.

摘要:現有的人工智慧系統在建模人類行為時,運作於個體層面或在事件發生後進行檢測。因此,它們系統性地未能捕捉決定一個群體是否保持穩定或過渡到升級或崩潰的集體動態。我們提出了一個不同的基礎:一群互動的人類構成了一個精確數學意義上的複雜動態系統,展現出湧現性、非線性、反饋迴路、在臨界點附近的敏感性,以及質量上不同的狀態之間的相變。這樣一個系統的狀態並不位於任何單一參與者之中;它分布在相互影響的迴路中,並且可以通過身體的微觀動態進行觀察。

我們引入了BEHAVE(行為引擎,用於人類活動向量估計),這是一個正式框架,將集體動態建模為定義在從可觀察的物理信號衍生的互動空間上的連續行為場。運動學微信號(位置、速度、身體方向、手勢活動)被結構化為一個有向互動圖,並聚合成一組行為場,捕捉不同且不冗餘的集體狀態軸。該框架基於一個定理和兩個結構性命題,描述緊張場、場基礎和臨界指數。感知和預測層使用神經模型實現,使數據驅動的學習和系統動態的近似成為可能。BEHAVE被構建為一個計算系統,用於從數據中學習、表示和預測集體動態。在一個7代理的談判快照上展示了一個工作流程。相同的場,經過重新校準,適用於人群安全、危機團隊動態、教育和臨床背景。

Reward Hacking in Rubric-Based Reinforcement Learning

2605.12474v1 by Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

摘要:強化學習與可驗證獎勵的結合使得在數學和編程等領域實現了強大的後訓練增益,儘管許多開放式設置依賴於基於評分標準的獎勵。我們研究了基於評分標準的強化學習中的獎勵黑客行為,其中一個策略是針對訓練驗證器進行優化,但卻是針對三位前沿評審的跨家族小組進行評估,從而減少對任何單一評估者的依賴。我們的框架將兩個來源的偏差分開:驗證器失效,即訓練驗證器給予基於評分標準的標準,而參考驗證器卻拒絕這些標準,以及評分標準設計的局限性,即即使是強大的基於評分標準的驗證器也偏好那些基於無評分標準的評審評價較差的回應。在醫學和科學領域,弱驗證器產生了大量的代理獎勵增益,但這些增益並未轉移到參考驗證器上;隨著訓練的進行,利用行為增長並集中在重複失敗上,例如對複合標準的部分滿足,將隱性內容視為顯性內容,以及不精確的主題匹配。更強的驗證器顯著減少了,但並未消除,驗證器的利用行為。我們還引入了一個自我內部化差距,這是一種基於策略對數概率的無驗證器診斷工具,能夠跟踪參考驗證器的質量,檢測使用弱驗證器訓練的策略何時停止改進。最後,在我們的設置中,強驗證並未防止獎勵黑客行為,當評分標準未指定重要的失敗模式時:基於評分標準的驗證器偏好強化學習檢查點,而無評分標準的評審則偏好基準模型。這些分歧與集中在完整性和存在性標準的增益相吻合,並伴隨著事實正確性、簡潔性、相關性和整體質量的下降。綜合這些結果表明,更強的驗證減少了獎勵黑客行為,但本身並不能確保基於評分標準的增益與更廣泛的質量增益相對應。

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

2605.12361v1 by Rezarta Islamaj, Robert Leaman, Joey Chan, Nicholas Wan, Qiao Jin, Natalie Xie, John Wilbur, Shubo Tian, Lana Yeganova, Po-Ting Lai, Chih-Hsuan Wei, Yifan Yang, Yao Ge, Qingqing Zhu, Zhizheng Wang, Zhiyong Lu

Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.

摘要:評估生物醫學領域的大型語言模型(LLMs)需要能夠區分推理與模式匹配的基準,並在模型能力提高時保持辨別性。現有的生物醫學問答(QA)基準在這方面有限。多選格式可以使模型通過排除答案而非推理來成功,而廣泛流通的考試風格數據集則越來越容易受到性能飽和和訓練數據污染的影響。多步推理,定義為跨多個來源整合信息以推導答案的能力,對於臨床有意義的任務如診斷支持、基於文獻的發現和假設生成至關重要,但在當前的生物醫學QA基準中仍然代表性不足。MedHopQA是一個以疾病為中心的多步推理基準,由1,000個專家策劃的問答對組成,作為BioCreative IX的共享任務引入。每個問題需要整合來自兩篇不同維基百科文章的信息,答案以開放式自由文本格式提供。金標註通過來自MONDO、NCBI Gene和NCBI Taxonomy的本體基礎同義詞集進行增強,以支持詞彙和概念層級的評估。MedHopQA是通過結合人類標註、篩選、迭代驗證和LLM作為評判者的驗證的結構化過程構建的。為了減少排行榜作弊和污染風險,這1,000個得分問題嵌入在一組公開可下載的10,000個問題中,答案被隱藏,並在CodaBench排行榜上展示。MedHopQA提供了一個基準和可重用的框架,用於構建未來的生物醫學QA數據集,將組合推理、飽和抵抗和污染抵抗作為核心設計約束。

EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records

2605.12335v1 by Saeed Shurrab, Mariam Al-Omari, Dana El Samad, Farah E. Shamout

Electronic Health Records (EHR) contain rich longitudinal patient information and are widely used in predictive modeling applications. However, effectively leveraging historical data remains challenging due to long trajectories, heterogeneous events, temporal irregularity, and the varying relevance of past clinical context. Existing approaches often rely on fixed windows or uniform aggregation, which can obscure clinically important signals. In this work, we introduce EHR-RAGp, a retrieval-augmented foundation model that dynamically integrates the most relevant patient history across diverse clinical event types. We propose a prototype-guided retrieval module that acts as an alignment mechanism and estimates the relevance of retrieved historical chunks with respect to a given prediction task, guiding the model towards the most informative context. Across multiple clinical prediction tasks, EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines. Furthermore, integrating EHR-RAGp with existing clinical foundation models yields substantial performance gains. Overall, EHR-RAGp provides a scalable and efficient framework for leveraging long-range clinical context to improve downstream performance.

摘要:電子健康紀錄 (EHR) 包含豐富的縱向病患資訊,並廣泛應用於預測建模應用中。然而,由於長期軌跡、異質事件、時間不規則性以及過去臨床背景的相關性變化,有效利用歷史數據仍然面臨挑戰。現有的方法通常依賴固定窗口或均勻聚合,這可能會掩蓋臨床上重要的信號。在這項工作中,我們介紹了 EHR-RAGp,一種檢索增強的基礎模型,能夠動態整合各種臨床事件類型中最相關的病歷。我們提出了一個原型引導的檢索模組,作為對齊機制,並根據給定的預測任務估計檢索到的歷史片段的相關性,引導模型朝向最具資訊的背景。在多個臨床預測任務中,EHR-RAGp 始終超越最先進的 EHR 基礎模型和基於Transformer的基準。此外,將 EHR-RAGp 與現有的臨床基礎模型整合,能夠顯著提升性能。總體而言,EHR-RAGp 提供了一個可擴展且高效的框架,以利用長期臨床背景來改善下游性能。

Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

2605.12241v1 by M A Al-Masud, Nils Strodthoff

Specialized foundation models are beginning to emerge in various medical subdomains, but pretraining methodologies and parametric scaling with the size of the pretraining dataset are rarely assessed systematically and in a like-for-like manner. This work focuses on foundation models for electrocardiography (ECG) data, one of the most widely captured physiological time series world-wide. We present a comprehensive assessment of pretraining methodologies, covering five different contrastive and non-contrastive self-supervised learning objectives for ECG foundation models, and investigate their scaling behavior with pretraining dataset sizes up to 11M input samples, exclusively from publicly available sources. Pretraining strategy has a meaningful and consistent impact on downstream performance, with contrastive predictive coding (slightly ahead of JEPA) yielding the most transferable representations across diverse clinical tasks. Scaling pretraining data continues to yield meaningful improvements up to 11M samples for most objectives. We also compare model architectures across all pretraining methodologies and find evidence for a clear superiority of structured state space models compared to transformers and CNN models. We hypothesize that the strong inductive biases of structured state space models, rather than pretraining scale alone, are the primary driver of effective ECG representation learning, with important implications for future foundation model development in this and potentially other physiological signal domains.

摘要:專門的基礎模型開始在各種醫療子領域中出現,但預訓練方法和參數擴展與預訓練數據集大小之間的關係,卻很少以系統性和類似的方式進行評估。這項工作專注於心電圖(ECG)數據的基礎模型,這是全球最廣泛捕獲的生理時間序列之一。我們對預訓練方法進行了全面評估,涵蓋了五種不同的對比性和非對比性自我監督學習目標,並調查了它們在預訓練數據集大小達到 1100 萬輸入樣本時的擴展行為,這些數據完全來自公開可用的來源。預訓練策略對下游性能有著重要且一致的影響,其中對比預測編碼(略微領先於 JEPA)在不同臨床任務中產生了最具可轉移性的表示。擴展預訓練數據繼續為大多數目標帶來有意義的改進,直到 1100 萬樣本。我們還比較了所有預訓練方法中的模型架構,並發現結構化狀態空間模型明顯優於Transformer和 CNN 模型。我們假設結構化狀態空間模型的強歧視性偏見,而非僅僅是預訓練規模,是有效 ECG 表示學習的主要驅動因素,這對於未來在這一領域及其他生理信號領域的基礎模型開發具有重要意義。

Overtrained, Not Misaligned

2605.12199v1 by Joel Schreiber, Ariel Goldstein

Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued training past task convergence. This yields practical mitigations: early stopping eliminates EM while retaining an average of 93% of task performance, and careful learning rate selection further minimizes risk. Cross-domain validation on medical fine-tuning confirms these patterns generalize: the size-EM correlation strengthens (r = 0.90), and overgeneralization to untruthfulness remains avoidable via early stopping in 67% of cases, though semantically proximate training domains produce less separable misalignment. As LLMs become increasingly integrated into real-world systems, fine-tuning and reinforcement learning remain the primary methods for adapting model behavior. Our findings demonstrate that with proper training practices, EM can be avoided, reframing it from an unforeseen fine-tuning risk to an avoidable training artifact.

摘要:新興的不對齊(EM),即在狹窄任務(如不安全代碼)上進行微調會導致無關領域的廣泛不對齊,首次由 Betley 等人(2025)展示。我們進行了迄今為止最全面的 EM 研究,重現了原始 GPT-4o 的發現,並擴展到 12 個開源模型,涵蓋 4 個系列(Llama、Qwen、DeepSeek、GPT-OSS),參數範圍從 8B 到 671B,評估了超過一百萬個模型回應,使用多個隨機種子。我們發現 EM 在 GPT-4o 中重現,但遠非普遍:只有 12 個開源模型中的 2 個(17%)在不同種子間表現出一致的 EM,且模型大小與 EM 易感性之間存在顯著相關性。通過在微調過程中的檢查點級分析,我們證明 EM 在訓練後期出現,與主要任務的接近收斂不同且隨之而來,這表明 EM 來自於超過任務收斂的持續訓練。這帶來了實際的緩解措施:提前停止可以消除 EM,同時保留平均 93% 的任務表現,而仔細選擇學習率則進一步降低風險。對醫療微調的跨領域驗證確認了這些模式的普遍性:大小-EM 相關性增強(r = 0.90),且在 67% 的情況下,通過提前停止仍能避免對不真實性的過度泛化,儘管語義相近的訓練領域產生的不可分離不對齊較少。隨著 LLM 越來越多地融入現實世界系統,微調和強化學習仍然是調整模型行為的主要方法。我們的研究結果表明,通過適當的訓練實踐,可以避免 EM,將其從不可預見的微調風險重新構建為可避免的訓練產物。

To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

2605.12120v1 by Fangyi Yu, Nabeel Seedat, Jonathan Richard Schwarz, Andrew M. Bean

Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy -- an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards -- despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace -- e.g., that a drug has been withdrawn -- yet suppresses this in the user-facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high-stakes professional settings.

摘要:語言模型在高風險的專業環境中面臨來自用戶、機構當局和專業規範的矛盾需求。當這些需求發生衝突時,模型的行為揭示了一種主要的層級結構——一種對競爭利益相關者的隱含排序,這決定了,例如,一個醫療人工智慧在接受醫院管理者的成本削減指令時,是否會以證據為基礎的護理為代價而遵從,或者因為專業標準的要求而拒絕。在法律和醫療領域的7,136個場景中,我們測試了十個前沿模型,發現模型在執行任務時經常未能遵循專業標準,例如在草擬時,當用戶指令與這些標準發生衝突時——儘管在用戶尋求建議指導時能夠充分遵守這些標準。我們進一步發現,這些模型所展示的用戶、權威和專業標準之間的層級在醫療和法律背景中是不穩定的,並且在模型家族之間不一致。當未能遵循專業標準時,主要的失敗機制是知識遺漏:那些明顯擁有相關知識的模型在未顯示衝突知識的情況下產生有害的輸出。在一個特別令人擔憂的案例中,我們發現一個推理模型在其推理過程中識別出相關知識——例如,一種藥物已被撤回——但在面向用戶的回答中壓制了這一點,並在權威壓力下仍然建議使用該藥物。在任務框架、領域和模型家族之間的不一致對齊表明,當模型在高風險的專業環境中部署時,當前的對齊方法,包括已發表的對齊層級,可能不會穩健。

Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection

2605.12069v1 by Muhammad Aqeel, Maham Nazir, Uzair Khan, Marco Cristani, Francesco Setti

Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO

摘要:零樣本異常檢測旨在識別未見類別中的缺陷,而無需針對特定目標的訓練。現有方法通常對所有樣本應用相同的特徵轉換,將正常數據和異常數據視為相同,儘管它們的分佈本質上是非對稱的,正常數據較為緊湊,而異常數據則多樣化。我們則利用這種自然的非對稱性,提出了AVA-DINO,一個異常感知的視覺-語言適應框架,具有針對正常和異常模式的雙重專門分支,這些分支能夠適應凍結的DINOv3視覺特徵。在輔助數據的訓練過程中,這兩個分支通過文本引導的路由機制和顯式路由正則化共同學習,促進分支專業化。在測試時,僅使用輸入圖像和固定的預定義語言描述來動態結合這兩個分支,實現非對稱激活。這一設計防止了退化的均勻路由,並允許上下文特定的特徵轉換。在九個工業和醫療基準上的實驗展示了最先進的性能,在MVTec-AD上達到93.5%的圖像-AUROC,並且在醫療影像上實現了強大的跨域泛化,而無需特定於域的微調。 https://github.com/aqeeelmirza/AVA-DINO

Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MIL

2605.12575v1 by Hyun Do Jung, Jungwon Choi, Soojung Choi, Yujin Oh, Hwiyoung Kim

Whole-slide image (WSI) multiple instance learning (MIL) classifiers can achieve strong slide-level AUC while leaving the full-bag prediction opaque. Attention scores are widely reused as post-hoc explanations, but high attention can reflect aggregation preference rather than a compact, model-sufficient rationale. We study post-hoc rationale highlighting for frozen WSI-MIL: given a trained classifier, can its slide-level prediction be recovered from a compact, output-consistent tile subset without retraining the backbone? We instantiate this with Finding Optimal Contextual Instances (FOCI), a lightweight rationale-readout layer over a frozen MIL backbone. FOCI is trained with model-output sufficiency and exclusion objectives over keep/drop tile subsets, evaluated with an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL, and summarized by the Selection Headroom Index (SHI). Across three WSI benchmarks and seven MIL backbones, FOCI reveals that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, relative to its documented CLS-proxy ranking, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% across benchmarks, while ACMIL+FOCI attains the highest mean SHI (+0.465). Deletion-based perturbation and selected-only downstream evaluation provide complementary checks. These results position FOCI as a model-level interpretability and audit layer: selected tiles are not claims of clinical or pathologist-level diagnostic sufficiency, but candidate rationales that offer a compact, reviewable view of when a frozen MIL prediction can be localized to a small output-consistent subset.

摘要:整張幻燈片影像(WSI)多實例學習(MIL)分類器可以實現強大的幻燈片級AUC,同時使整體預測變得不透明。注意力分數被廣泛重用作為事後解釋,但高注意力可能反映聚合偏好,而不是緊湊的、模型充分的理由。我們研究了對於凍結的WSI-MIL的事後理由突出:給定一個訓練好的分類器,它的幻燈片級預測是否可以從一個緊湊的、輸出一致的瓷磚子集恢復,而無需重新訓練主幹?我們用尋找最佳上下文實例(FOCI)來實現這一點,這是一個輕量級的理由讀取層,基於凍結的MIL主幹。FOCI以模型輸出充分性和排除目標在保留/刪除瓷磚子集上進行訓練,並通過適應於WSI-MIL的插入式序列揭示協議(SRP)進行評估,並由選擇頭部指數(SHI)進行總結。在三個WSI基準和七個MIL主幹中,FOCI顯示緊湊的理由依賴於選擇頭部:Transformer和多分支注意力聚合器可以接受緊湊的理由,接近最小注意力池化基準進入選擇飽和狀態,而硬選擇主幹可能與外部讀取發生衝突。對於TransMIL,相較於其記錄的CLS代理排名,FOCI在基準中將最小充分K(MSK)瓷磚數量減少了32-56%,而ACMIL+FOCI達到了最高的平均SHI(+0.465)。基於刪除的擾動和僅選擇的下游評估提供了互補的檢查。這些結果使FOCI成為一個模型級的可解釋性和審計層:選擇的瓷磚並不是臨床或病理學家級診斷充分性的主張,而是候選理由,提供了一個緊湊的、可審查的視角,顯示何時凍結的MIL預測可以定位到一個小的輸出一致子集。

Spectral Vision Transformer for Efficient Tokenization with Limited Data

2605.12026v1 by Alexandra G. Roberts, Maneesh John, Jinwei Zhang, Dominick Romano, Mert Sisman, Ki Sueng Choi, Heejong Kim, Mert R. Sabuncu, Thanh D. Nguyen, Alexey V. Dimov, Pascal Spincemaille, Brian H. Kopell, Yi Wang

We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+github.com/agr78/spectralViT+.

摘要:我們提出了一種新穎的光譜視覺Transformer架構,以實現有限數據中的高效標記化,重點關注醫學影像。我們概述了由基底選擇引起的便利理論性質,包括空間不變性和最佳信噪比。我們展示了與空間視覺Transformer相比,光譜投影所帶來的複雜性降低。我們展示了在參數數量減少的情況下,與多種模型(包括緊湊型和標準視覺Transformer、帶注意力的卷積神經網絡、移位窗口Transformer、多層感知器和邏輯回歸)相比,性能相當或優越。我們在分析中包含了模擬、公共和臨床數據,並在以下網址釋出我們的代碼: \verb+github.com/agr78/spectralViT+。

DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction

2605.12574v1 by Hongyi Tang, Zhihao Zhu, Yi Yang

Vision-language models (VLMs) are trained on large-scale image-text corpora that may contain private, copyrighted, or otherwise sensitive data, motivating membership inference as a tool for training-data auditing. This is especially challenging for deployed VLMs, where auditors typically observe only generated textual responses. Existing VLM membership inference attacks either rely on probability-level signals unavailable in such settings, or use mask-based semantic prediction tasks whose effectiveness depends on object-centric visual assumptions. To address these limitations, we propose DistractMIA, an output-only black-box framework based on semantic distraction. Rather than removing visual evidence, DistractMIA preserves the original image, inserts a known semantic distractor, and measures how generated responses change. This design is motivated by the intuition that member samples remain more anchored to the original image semantics, while non-member samples are more easily redirected toward the distractor. To make this signal reliable, DistractMIA calibrates distractor configurations on a reference set and derives membership scores from repeated textual generations, capturing response stability and distractor uptake without accessing logits, probabilities, or hidden states. Experiments across multiple VLMs and benchmarks show that DistractMIA consistently outperforms both output-only and stronger-access baselines. Its performance on a medical benchmark further demonstrates applicability beyond object-centric natural images.

摘要:視覺-語言模型(VLMs)是在大型圖像-文本語料庫上訓練的,這些語料庫可能包含私有、受版權保護或其他敏感數據,這促使會員推斷作為訓練數據審計的工具。這對於已部署的 VLMs 特別具有挑戰性,因為審計員通常僅觀察生成的文本響應。現有的 VLM 會員推斷攻擊要麼依賴於在這種環境中不可用的概率級別信號,要麼使用基於掩碼的語義預測任務,其有效性取決於以物體為中心的視覺假設。為了解決這些限制,我們提出了 DistractMIA,一種基於語義干擾的僅輸出黑箱框架。DistractMIA 並不是去除視覺證據,而是保留原始圖像,插入已知的語義干擾物,並測量生成的響應如何改變。這一設計的動機在於,會員樣本在原始圖像語義上保持更強的錨定,而非會員樣本則更容易被引導到干擾物上。為了使這一信號可靠,DistractMIA 在參考集上校準干擾物配置,並從重複的文本生成中推導會員分數,捕捉響應穩定性和干擾物的吸收,而無需訪問 logits、概率或隱藏狀態。跨多個 VLM 和基準的實驗顯示,DistractMIA 始終優於僅輸出和更強訪問基準。其在醫療基準上的表現進一步證明了其在物體中心自然圖像之外的適用性。

AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers

2605.11901v1 by Lei Wang, Jiangxuan Shen, Xi Zhang, Dalin Zhang, Jingyu Li, Haipeng Dai, Chenren Xu, Daqing Zhang, He Huang

The widespread use of earphones has enabled various sensing applications, including activity recognition, health monitoring, and context-aware computing. Among these, earphone-based user authentication has become a key technique by leveraging unique biometric features. However, existing earphone-based authentication systems face key limitations: they either require explicit user interaction or active speaker output, or suffer from poor accessibility and vulnerability to environmental noise, which hinders large-scale deployment. In this paper, we propose a passive authentication system, called AccLock, which leverages distinctive features extracted from in-ear BCG signals to enable secure and unobtrusive user verification. Our system offers several advantages over previous systems, including zero-involvement for both the device and the user, ubiquitous, and resilient to environmental noise. To realize this, we first design a two-stage denoising scheme to suppress both inherent and sporadic interference. To extract user-specific features, we then propose a disentanglement-based deep learning model, HIDNet, which explicitly separates user-specific features from shared nuisance components. Lastly, we develop a scalable authentication framework based on a Siamese network that eliminates the need for per-user classifier training. We conduct extensive experiments with 33 participants, achieving an average FAR of 3.13% and FRR of 2.99%, which demonstrates the practical feasibility of AccLock.

摘要:耳機的廣泛使用使得各種感測應用成為可能,包括活動識別、健康監測和情境感知計算。在這些應用中,基於耳機的用戶身份驗證已成為一項關鍵技術,利用獨特的生物特徵。然而,現有的基於耳機的身份驗證系統面臨著主要限制:它們要麼需要明確的用戶互動或主動的揚聲器輸出,要麼在可及性方面表現不佳,且容易受到環境噪音的影響,這妨礙了大規模部署。在本文中,我們提出了一種被動身份驗證系統,稱為 AccLock,該系統利用從耳內 BCG 信號中提取的獨特特徵來實現安全且不引人注意的用戶驗證。我們的系統相比於之前的系統提供了幾個優勢,包括對設備和用戶的零參與、無處不在以及對環境噪音的抵抗力。為了實現這一目標,我們首先設計了一個兩階段的去噪方案,以抑制內在和偶發的干擾。然後,我們提出了一種基於解耦的深度學習模型 HIDNet,該模型明確地將用戶特定特徵與共享的干擾成分分開。最後,我們開發了一個基於孿生網絡的可擴展身份驗證框架,消除了每個用戶分類器訓練的需求。我們對 33 名參與者進行了廣泛的實驗,達到了平均 FAR 3.13% 和 FRR 2.99%,這證明了 AccLock 的實際可行性。