Medical

Publish Date	Title	Authors	Homepage	Code
2026-04-03	PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction	Daniel C. MacRae et.al.	2604.03203v1	null
2026-04-03	Analyzing Healthcare Interoperability Vulnerabilities: Formal Modeling and Graph-Theoretic Approach	Jawad Mohammed et.al.	2604.03043v1	null
2026-04-03	Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution	KN Ajay Shastry et.al.	2604.03035v1	null
2026-04-03	ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents	Chao Li et.al.	2604.02834v1	null
2026-04-03	Eligibility-Aware Evidence Synthesis: An Agentic Framework for Clinical Trial Meta-Analysis	Yao Zhao et.al.	2604.02678v1	null
2026-04-02	An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis	Md. Sajeebul Islam Sk. et.al.	2604.02502v1	null
2026-04-02	Managing Diabetic Retinopathy with Deep Learning: A Data Centric Overview	Shramana Dey et.al.	2604.02448v1	null
2026-04-02	Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models	Minda Zhao et.al.	2604.02236v1	null
2026-04-02	When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning	Juarez Monteiro et.al.	2604.02226v1	null
2026-04-02	Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study	Yosuke Yamagishi et.al.	2604.02207v1	null
2026-04-02	Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data	Alejandro Castañeda Garcia et.al.	2604.02031v1	null
2026-04-02	Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia	Saja Al-Dabet et.al.	2604.01962v1	null
2026-04-02	Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always	Luka Hobor et.al.	2604.01896v1	null
2026-04-02	Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints	Minh-Khoi Pham et.al.	2604.01841v1	null
2026-04-02	A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection	Arezoo Borji et.al.	2604.01798v1	null
2026-04-02	Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring	Feiyu Zhou et.al.	2604.01712v1	null
2026-04-02	Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy	Ruijie Yang et.al.	2604.01705v1	null
2026-04-02	Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology	Tianhao Shi et.al.	2604.01690v1	null
2026-04-02	Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture	Florian Odi Stummer et.al.	2604.01661v1	null
2026-04-02	CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery	Ao Qu et.al.	2604.01658v1	null
2026-04-02	NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy	Kyeonghun Kim et.al.	2604.01612v1	null
2026-04-02	Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training	Abdelrahman Abouzeid et.al.	2604.01563v1	null
2026-04-02	Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging	Mengxian Lyu et.al.	2604.01538v1	null
2026-04-02	PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance	Ayan Das et.al.	2604.01532v1	null
2026-04-02	A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies	Congjing Zhang et.al.	2604.01529v1	null
2026-04-01	DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data	Arshia Ilaty et.al.	2604.01481v1	null
2026-04-01	Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis	Keshav Shankar et.al.	2604.01463v1	null
2026-04-01	When AI Gets it Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems	Khalid Adnan Alsayed et.al.	2604.01449v2	null
2026-04-01	AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction	Aiza Maksutova et.al.	2604.01371v1	null
2026-04-01	Safety, Security, and Cognitive Risks in World Models	Manoj Parmar et.al.	2604.01346v1	null
2026-04-01	Regularizing Attention Scores with Bootstrapping	Neo Christopher Chung et.al.	2604.01339v1	null
2026-04-01	AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation	Prantik Deb et.al.	2604.01167v1	null
2026-04-01	Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning	Mohammad R. Abu Ayyash et.al.	2604.01152v1	null
2026-04-01	PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor	Yutao Yang et.al.	2604.00931v2	null
2026-04-01	OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images	Okan Uçar et.al.	2604.01264v1	null
2026-04-01	BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction	Sayed Hashim et.al.	2604.00739v1	null
2026-04-01	Toward Optimal Sampling Rate Selection and Unbiased Classification for Precise Animal Activity Recognition	Axiu Mao et.al.	2604.00517v1	null
2026-04-01	MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning	Kyeonghun Kim et.al.	2604.00514v1	null
2026-04-01	A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation	Yabin Zhang et.al.	2604.00493v1	null
2026-04-01	Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions	Yuchen Yang et.al.	2604.00397v1	null
2026-04-01	EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts	Alibek T. Kaliyev et.al.	2604.00392v1	null
2026-03-31	Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry	Syed Eqbal Alam et.al.	2604.00319v1	null
2026-03-31	SANA I2I: A Text Free Flow Matching Framework for Paired Image to Image Translation with a Case Study in Fetal MRI Artifact Reduction	Italo Felix Santos et.al.	2604.00298v1	null
2026-03-31	A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation	Ha Na Cho et.al.	2604.00249v1	null
2026-03-31	One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction	Yuxing Lu et.al.	2604.00085v1	null
2026-03-31	Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System	Xiaoshan Huang et.al.	2603.29950v1	null
2026-03-31	Four Generations of Quantum Biomedical Sensors	Xin Jin et.al.	2603.29944v2	null
2026-03-31	ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules	Jonas Landsgesell et.al.	2603.29928v1	null
2026-03-31	Training deep learning based dynamic MR image reconstruction using synthetic fractals	Anirudh Raman et.al.	2603.29922v1	null
2026-03-31	Brain MR Image Synthesis with Multi-contrast Self-attention GAN	Zaid A. Abod et.al.	2604.00070v1	null
2026-03-31	Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding	Joakim Edin et.al.	2603.29709v1	null
2026-03-31	Few-shot Writer Adaptation via Multimodal In-Context Learning	Tom Simon et.al.	2603.29450v1	null
2026-03-31	NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification	Youngung Han et.al.	2603.29449v1	null
2026-03-31	AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding	Moiz Sadiq Awan et.al.	2603.29366v1	null
2026-03-31	Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model	Siyuan Du et.al.	2603.29176v1	null
2026-03-31	Knowledge database development by large language models for countermeasures against viruses and marine toxins	Hung N. Do et.al.	2603.29149v1	null
2026-03-31	Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health	Yuqing Xiao et.al.	2603.29114v1	null
2026-03-30	A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank	Iness Halimi et.al.	2603.29041v1	null
2026-03-30	Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction	Diego C. Lerma-Torres et.al.	2603.29023v1	null
2026-03-30	Towards a Medical AI Scientist	Hongtao Wu et.al.	2603.28589v1	null
2026-03-30	Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework	Ya Zhou et.al.	2603.28532v1	null
2026-03-30	FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation	Tiantian Wang et.al.	2603.28455v1	null
2026-03-30	The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation	Doan Nam Long Vu et.al.	2603.28387v1	null
2026-03-30	Bit-Identical Medical Deep Learning via Structured Orthogonal Initialization	Yakov Pyotr Shkolnikov et.al.	2603.28040v1	null
2026-03-29	Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images	Laura Rayón Ropero et.al.	2603.27798v1	null
2026-03-29	RAP: Retrieve, Adapt, and Prompt-Fit for Training-Free Few-Shot Medical Image Segmentation	Zhihao Mao et.al.	2603.27705v1	null
2026-03-29	Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development	Zhongying Deng et.al.	2603.27460v1	null
2026-03-28	Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models	Mehedi Hasan Tusar et.al.	2603.27325v1	null
2026-03-28	Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection	Jinhu Fu et.al.	2603.27240v1	null
2026-03-28	MediHive: A Decentralized Agent Collective for Medical Reasoning	Xiaoyang Wang et.al.	2603.27150v1	null
2026-03-28	Bayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data	Amuche Ibenegbu et.al.	2603.27142v1	null
2026-03-28	Autonomous Agent-Orchestrated Digital Twins (AADT): Leveraging the OpenClaw Framework for State Synchronization in Rare Genetic Disorders	Hongzhuo Chen et.al.	2603.27104v1	null
2026-03-27	When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models	Juan Gabriel Kostelec et.al.	2603.26556v1	null
2026-03-27	Foundation Model for Cardiac Time Series via Masked Latent Attention	Moritz Vandenhirtz et.al.	2603.26475v1	null
2026-03-27	PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management	Eugenio Rodrigo Zimmer Neves et.al.	2603.26324v1	null
2026-03-27	Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation	Xue Liu et.al.	2604.02368v1	null
2026-03-27	Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI	Jing Zhang et.al.	2603.26186v1	null
2026-03-27	SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis	Zhangtianyi Chen et.al.	2603.26122v1	null
2026-03-27	Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays	Kang Liu et.al.	2603.26049v1	null
2026-03-27	Unlabeled Cross-Center Automatic Analysis for TAAD: An Integrated Framework from Segmentation to Clinical Features	Mengdi Liu et.al.	2603.26019v1	null
2026-03-27	FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants	Mahesh Bhosale et.al.	2603.26008v1	null
2026-03-27	Longitudinal Boundary Sharpness Coefficient Slopes Predict Time to Alzheimer's Disease Conversion in Mild Cognitive Impairment: A Survival Analysis Using the ADNI Cohort	Ishaan Cherukuri et.al.	2603.26007v1	null
2026-03-26	When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models	Binesh Sadanandan et.al.	2603.25960v1	null
2026-03-26	Methods for Knowledge Graph Construction from Text Collections: Development and Applications	Vanni Zavarella et.al.	2603.25862v1	null
2026-03-26	Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI	Anna Kozlova et.al.	2603.25821v1	null
2026-03-26	Beyond identifiability: Learning causal representations with few environments and finite samples	Inbeom Lee et.al.	2603.25796v1	null
2026-03-26	DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial	Zhenchen Zhu et.al.	2603.25607v1	null
2026-03-26	Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models	Moazzam Umer Gondal et.al.	2603.25495v1	null
2026-03-26	Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models	Eyal Hadad et.al.	2603.25403v2	null
2026-03-26	A Causal Framework for Evaluating ICU Discharge Strategies	Sagar Nagaraj Simha et.al.	2603.25397v1	null
2026-03-26	Evaluating Language Models for Harmful Manipulation	Canfer Akbulut et.al.	2603.25326v3	null
2026-03-26	AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study	Wenlong Hou et.al.	2603.25322v1	null
2026-03-26	A Gait Foundation Model Predicts Multi-System Health Phenotypes from 3D Skeletal Motion	Adam Gabet et.al.	2603.25283v1	null
2026-03-26	A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations	Andong Tan et.al.	2603.25196v1	null
2026-03-26	Empowering Epidemic Response: The Role of Reinforcement Learning in Infectious Disease Control	Mutong Liu et.al.	2603.25771v1	null
2026-03-26	Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models	Chengyu Fang et.al.	2603.25155v1	null
2026-03-26	Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence	Vehid Geruslu et.al.	2603.25146v1	null
2026-03-26	Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators	Ray-Yuan Chung et.al.	2603.24986v1	null
2026-03-26	Subject-Specific Low-Field MRI Synthesis via a Neural Operator	Ziqi Gao et.al.	2603.24968v1	null
2026-03-26	Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence	Vasu Srinivasan et.al.	2603.24898v1	null

Abstracts

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

2604.03203v1 by Daniel C. MacRae, Luuk van der Hoek, Robert van der Wal, Suzanne P. M. de Vette, Hendrike Neh, Baoqiang Ma, Peter M. A. van Ooijen, Lisanne V. van Dijk

Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the medical field. To aid in these developments we introduce PR3DICTR: Platform for Research in 3D Image Classification and sTandardised tRaining. Built using community-standard distributions (PyTorch and MONAI), PR3DICTR provides an open-access, flexible and convenient framework for prediction model development, with an explicit focus on classification using three-dimensional medical image data. By combining modular design principles and standardization, it aims to alleviate developmental burden whilst retaining adjustability. It provides users with a wealth of pre-established functionality, for instance in model architecture design options, hyper-parameter solutions and training methodologies, but still gives users the opportunity and freedom to ``plug in'' their own solutions or modules. PR3DICTR can be applied to any binary or event-based three-dimensional classification task and can work with as little as two lines of code.

摘要：三維醫學影像數據和計算機輔助決策，特別是使用深度學習，正在醫學領域變得越來越重要。為了促進這些發展，我們介紹了 PR3DICTR：三維影像分類和標準化訓練的研究平台。PR3DICTR 是基於社群標準發行版（PyTorch 和 MONAI）構建的，提供了一個開放存取、靈活且方便的預測模型開發框架，明確專注於使用三維醫學影像數據的分類。通過結合模組化設計原則和標準化，它旨在減輕開發負擔，同時保持可調整性。它為用戶提供了豐富的預先建立的功能，例如模型架構設計選項、超參數解決方案和訓練方法，但仍然給予用戶“插入”自己解決方案或模組的機會和自由。PR3DICTR 可以應用於任何二元或事件驅動的三維分類任務，並且只需兩行代碼即可運行。

Analyzing Healthcare Interoperability Vulnerabilities: Formal Modeling and Graph-Theoretic Approach

2604.03043v1 by Jawad Mohammed, Gahangir Hossain

In a healthcare environment, the healthcare interoperability platforms based on HL7 FHIR allow concurrent, asynchronous access to a set of shared patient resources, which are independent systems, i.e., EHR systems, pharmacy systems, lab systems, and devices. The FHIR specification lacks a protocol for concurrency control, and the research on detecting a race condition only targets the OS kernel. The research on FHIR security only targets authentication and injection attacks, considering concurrent access to patient resources to be sequential. The gap in the research in this area is addressed through the introduction of FHIR Resource Access Graph (FRAG), a formally defined graph G = (P,R,E, λ, τ, S), in which the nodes are the concurrent processes, the typed edges represent the resource access events, and the race conditions are represented as detectable structural properties. Three clinically relevant race condition classes are formally specified: Simultaneous Write Conflict (SWC), TOCTOU Authorization Violation (TAV), and Cascading Update Race (CUR). The FRAG model is implemented as a three-pass graph traversal detection algorithm and tested against a time window-based baseline on 1,500 synthetic FHIR R4 transaction logs. Under full concurrent access (C2), FRAG attains a 90.0% F1 score vs. 25.5% for the baseline, a 64.5 pp improvement.

摘要：在醫療環境中，基於 HL7 FHIR 的醫療互操作性平台允許對一組共享病人資源進行同時的、非同步的訪問，這些資源是獨立的系統，即電子健康紀錄系統、藥房系統、實驗室系統和設備。FHIR 規範缺乏並發控制的協議，且對於檢測競爭條件的研究僅針對操作系統核心。對 FHIR 安全性的研究僅針對身份驗證和注入攻擊，將對病人資源的並發訪問視為順序訪問。這一領域研究的空白通過引入 FHIR 資源訪問圖（FRAG）來解決，這是一個形式上定義的圖 G = (P,R,E, λ, τ, S)，其中節點是並發過程，類型化的邊表示資源訪問事件，競爭條件則表示為可檢測的結構性特徵。三個臨床相關的競爭條件類別被正式指定：同時寫入衝突（SWC）、TOCTOU 授權違規（TAV）和級聯更新競爭（CUR）。FRAG 模型實現為一種三遍圖遍歷檢測算法，並在 1,500 個合成的 FHIR R4 交易日誌中與基於時間窗口的基線進行測試。在完全並發訪問（C2）下，FRAG 獲得了 90.0% 的 F1 分數，而基線為 25.5%，提高了 64.5 個百分點。

Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution

2604.03035v1 by KN Ajay Shastry, Ganesh Senrayan, Shrey Satapara, Pranoy Panda, Chaitanya Devaguptapu

Existing datasets for coding agents evaluate performance on isolated, single pull request (PR) tasks in a stateless manner, failing to capture the reality of real-world software development where code changes accumulate, technical debt accrues, and test suites grow over time. To bridge this gap, we introduce an automated coding task generation framework, which helps generate our dataset SWE-STEPS, that evaluates coding agents on long-horizon tasks through two realistic settings mirroring actual developer workflows: Conversational coding with iterative requests, and single-shot Project Requirement document (PRD)-based coding. Unlike existing datasets that evaluate agents on disjointed Pull Requests (PRs), our framework assesses performance across chains of dependent PRs, enabling evaluation of sequential execution, regression verification, and long-term repository health. We discover that widely used isolated PR evaluations yield inflated success rates, w.r.t. our settings - overshooting performance by as much as 20 percentage points - because they ignore the ``spillover'' effects of previous inefficient or buggy code. Furthermore, our analysis reveals that even when agents successfully resolve issues, they degrade repository health by generating code with higher cognitive complexity and technical debt compared to human developers, underscoring the necessity for multidimensional evaluation.

摘要：現有的編碼代理數據集在無狀態的方式下評估孤立的單一拉取請求（PR）任務的性能，未能捕捉到現實世界軟體開發的真實情況，其中代碼變更累積、技術負債增加，測試套件隨時間增長。為了填補這一空白，我們引入了一個自動化編碼任務生成框架，該框架幫助生成我們的數據集 SWE-STEPS，通過兩種反映實際開發者工作流程的現實設置來評估編碼代理在長期任務上的表現：對話式編碼與迭代請求，以及基於單次項目需求文檔（PRD）的編碼。與現有數據集評估代理在不相干的拉取請求（PRs）上的表現不同，我們的框架評估依賴 PR 的鏈條上的性能，從而使得對連續執行、回歸驗證和長期代碼庫健康的評估成為可能。我們發現，廣泛使用的孤立 PR 評估在我們的設置中產生了膨脹的成功率——性能超出多達 20 個百分點——因為它們忽略了先前低效或有缺陷代碼的“溢出”效應。此外，我們的分析顯示，即使代理成功解決問題，它們生成的代碼在認知複雜性和技術負債上相比於人類開發者也會降低代碼庫健康，這凸顯了多維評估的必要性。

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

2604.02834v1 by Chao Li, Cailiang Liu, Ang Gao, Kexin Deng, Shu Zhang, Langping Xu, Xiaotong Shi, Xionghao Ding, Jian Pei, Xun Jiang

Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events - yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a health profile, a multi-phase narrative plan, daily device measurements, periodic exam records, and an event log with explicit per-indicator impact parameters. Each indicator follows a baseline stochastic process driven by discrete events with sigmoid-onset, exponential-decay kernels under saturation and projection constraints; a hybrid pipeline delegates sparse semantic artifacts to LLM-based planning and dense indicator dynamics to algorithmic simulation with hard physiological bounds. Users are each paired with 100 evaluation queries across five dimensions - Lookup, Trend, Comparison, Anomaly, Explanation - stratified into Easy, Medium, and Hard tiers, with all ground-truth answers programmatically computable from the recorded event-indicator relationships. Evaluating 13 methods spanning LLMs with tools, DB-native agents, and memory-augmented RAG, we find that DB agents (48-58%) substantially outperform memory RAG baselines (30-38%), with the gap concentrated on Comparison and Explanation queries where multi-hop reasoning and evidence attribution are required.

摘要：長期健康代理必須在多來源軌跡中進行推理，這些軌跡結合了持續的設備數據流、稀疏的臨床檢查和偶發的生活事件——然而，評估它們是困難的：現實世界數據無法大規模釋放，而時間上基礎的歸因問題通常在沒有結構化真實數據的情況下無法給出明確的答案。我們提出了ESL-Bench，一個事件驅動的合成框架和基準，提供100個合成用戶，每個用戶擁有1-5年的軌跡，包括健康檔案、多階段敘事計劃、每日設備測量、定期檢查記錄和帶有明確每個指標影響參數的事件日誌。每個指標遵循一個基線隨機過程，該過程由具有S型起始、指數衰減核的離散事件驅動，並在飽和和投影約束下運行；一個混合管道將稀疏的語義工件委託給基於LLM的規劃，並將密集的指標動態委託給具有嚴格生理限制的算法模擬。每個用戶都與100個評估查詢配對，涵蓋五個維度——查詢、趨勢、比較、異常、解釋——並分為簡單、中等和困難層級，所有的真實答案都可以從記錄的事件-指標關係中以程序方式計算出來。評估涵蓋LLM工具、DB原生代理和記憶增強RAG的13種方法，我們發現DB代理（48-58%）顯著超越記憶RAG基準（30-38%），這一差距集中在需要多跳推理和證據歸因的比較和解釋查詢上。

Eligibility-Aware Evidence Synthesis: An Agentic Framework for Clinical Trial Meta-Analysis

2604.02678v1 by Yao Zhao, Zhiyue Zhang, Yanxun Xu

Clinical evidence synthesis requires identifying relevant trials from large registries and aggregating results that account for population differences. While recent LLM-based approaches have automated components of systematic review, they do not support end-to-end evidence synthesis. Moreover, conventional meta-analysis weights studies by statistical precision without considering clinical compatibility reflected in eligibility criteria. We propose EligMeta, an agentic framework that integrates automated trial discovery with eligibility-aware meta-analysis, translating natural-language queries into reproducible trial selection and incorporating eligibility alignment into study weighting to produce cohort-specific pooled estimates. EligMeta employs a hybrid architecture separating LLM-based reasoning from deterministic execution: LLMs generate interpretable rules from natural-language queries and perform schema-constrained parsing of trial metadata, while all logical operations, weight computations, and statistical pooling are executed deterministically to ensure reproducibility. The framework structures eligibility criteria and computes similarity-based study weights reflecting population alignment between target and comparator trials. In a gastric cancer landscape analysis, EligMeta reduced 4,044 candidate trials to 39 clinically relevant studies through rule-based filtering, recovering all 13 guideline-cited trials. In an olaparib adverse events meta-analysis across four trials, eligibility-aware weighting shifted the pooled risk ratio from 2.18 (95% CI: 1.71-2.79) under conventional Mantel-Haenszel estimation to 1.97 (95% CI: 1.76-2.20), demonstrating quantifiable impact of incorporating eligibility alignment. EligMeta bridges automated trial discovery with eligibility-aware meta-analysis, providing a scalable and reproducible framework for evidence synthesis in precision medicine.

摘要：臨床證據綜合需要從大型登記中識別相關試驗並匯總考慮到人口差異的結果。雖然最近基於LLM的方法已經自動化了系統評價的某些組件，但它們並不支持端到端的證據綜合。此外，傳統的元分析根據統計精確性對研究進行加權，而不考慮反映在合格標準中的臨床相容性。我們提出了EligMeta，一個代理框架，將自動試驗發現與考慮合格的元分析整合，將自然語言查詢轉換為可重複的試驗選擇，並將合格對齊納入研究加權，以生成特定於隊列的合併估計。EligMeta採用混合架構，將基於LLM的推理與確定性執行分開：LLM從自然語言查詢生成可解釋的規則並對試驗元數據進行結構約束解析，而所有邏輯操作、加權計算和統計合併都是以確定性執行，以確保可重複性。該框架結構化合格標準並計算基於相似性的研究加權，反映目標和對照試驗之間的人口對齊。在一項胃癌領域分析中，EligMeta通過基於規則的篩選將4,044個候選試驗減少到39個臨床相關研究，恢復了所有13個指南引用的試驗。在四項試驗的奧拉帕利不良事件元分析中，考慮合格的加權將合併風險比從傳統Mantel-Haenszel估計下的2.18（95% CI：1.71-2.79）轉變為1.97（95% CI：1.76-2.20），顯示出納入合格對齊的量化影響。EligMeta將自動試驗發現與考慮合格的元分析相結合，提供了一個可擴展且可重複的框架，用於精準醫學中的證據綜合。

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

2604.02502v1 by Md. Sajeebul Islam Sk., Md. Mehedi Hasan Shawon, Md. Golam Rabiul Alam

Lumbar Spinal Stenosis (LSS) diagnosis remains a critical clinical challenge, with diagnosis heavily dependent on labor-intensive manual interpretation of multi-view Magnetic Resonance Imaging (MRI), leading to substantial inter-observer variability and diagnostic delays. Existing vision-language models simultaneously fail to address the extreme class imbalance prevalent in clinical segmentation datasets while preserving spatial accuracy, primarily due to global pooling mechanisms that discard crucial anatomical hierarchies. We present an end-to-end Explainable Vision-Language Model framework designed to overcome these limitations, achieved through two principal objectives. We propose a Spatial Patch Cross-Attention module that enables precise, text-directed localization of spinal anomalies with spatial precision. A novel Adaptive PID-Tversky Loss function by integrating control theory principles dynamically further modifies training penalties to specifically address difficult, under-segmented minority instances. By incorporating foundational VLMs alongside an Automated Radiology Report Generation module, our framework demonstrates considerable performance: a diagnostic classification accuracy of 90.69%, a macro-averaged Dice score of 0.9512 for segmentation, and a CIDEr score of 92.80%. Furthermore, the framework shows explainability by converting complex segmentation predictions into radiologist-style clinical reports, thereby establishing a new benchmark for transparent, interpretable AI in clinical medical imaging that keeps essential human supervision while enhancing diagnostic capabilities.

摘要：腰椎脊髓狹窄症（LSS）的診斷仍然是一項關鍵的臨床挑戰，診斷嚴重依賴於勞動密集型的多視角磁共振成像（MRI）手動解讀，導致觀察者之間存在顯著的變異性和診斷延遲。現有的視覺-語言模型同時未能解決臨床分割數據集中普遍存在的極端類別不平衡問題，同時保持空間準確性，主要是由於全局池化機制丟棄了關鍵的解剖層次結構。我們提出了一個端到端的可解釋視覺-語言模型框架，旨在克服這些限制，通過兩個主要目標來實現。我們提出了一個空間補丁交叉注意力模塊，使得能夠精確、以文本為導向地定位脊柱異常，並具備空間精度。一種新穎的自適應 PID-Tversky 損失函數通過整合控制理論原則動態地進一步修改訓練懲罰，以專門應對困難的、分割不足的少數實例。通過結合基礎的 VLM 與自動放射學報告生成模塊，我們的框架展示了顯著的性能：診斷分類準確率為 90.69%，分割的宏觀平均 Dice 分數為 0.9512，CIDEr 分數為 92.80%。此外，該框架通過將複雜的分割預測轉換為放射科醫生風格的臨床報告，顯示了解釋性，從而為臨床醫學影像中的透明、可解釋的 AI 建立了新的基準，同時保持必要的人類監督，增強診斷能力。

Managing Diabetic Retinopathy with Deep Learning: A Data Centric Overview

2604.02448v1 by Shramana Dey, Zahir Khan, T. A. PramodKumar, B. Uma Shankar, Ashis K. Dhara, Ramachandran Rajalakshmi, Rajiv Raman, Sushmita Mitra

Diabetic Retinopathy (DR) is a serious microvascular complication of diabetes, and one of the leading causes of vision loss worldwide. Although automated detection and grading, with Deep Learning (DL), can reduce the burden on ophthalmologists, it is constrained by the limited availability of high-quality datasets. Existing repositories often remain geographically narrow, contain limited samples, and exhibit inconsistent annotations or variable image quality; thereby, restricting their clinical reliability. This paper presents a comprehensive review and comparative analysis of fundus image datasets used in the management of DR. The study evaluates their usability across key tasks, including binary classification, severity grading, lesion localization, and multi-disease screening. It also categorizes the datasets by size, accessibility, and annotation type (such as image-level, lesion-level, and multi-disease). Finally, a recently published dataset is presented as a case study to illustrate broader challenges in dataset curation and usage. The review consolidates current knowledge while highlighting persistent gaps such as the lack of standardized lesion-level annotations and longitudinal data. It also outlines recommendations for future dataset development to support clinically reliable and explainable solutions in DR screening.

摘要：糖尿病視網膜病變（DR）是糖尿病的一種嚴重微血管併發症，也是全球視力喪失的主要原因之一。雖然自動化檢測和分級，結合深度學習（DL），可以減輕眼科醫生的負擔，但其受到高品質數據集有限可用性的限制。現有的資料庫往往地理範圍狹窄，樣本有限，且標註不一致或影像質量變化，從而限制了其臨床可靠性。本文提供了一個全面的回顧和比較分析，針對用於管理DR的眼底影像數據集。該研究評估了這些數據集在關鍵任務中的可用性，包括二元分類、嚴重程度分級、病變定位和多疾病篩檢。它還根據大小、可及性和標註類型（如影像級、病變級和多疾病）對數據集進行分類。最後，最近發表的一個數據集被作為案例研究，展示了數據集策劃和使用中的更廣泛挑戰。該回顧整合了當前知識，同時突顯了持續存在的差距，例如缺乏標準化的病變級標註和縱向數據。它還概述了對未來數據集開發的建議，以支持在DR篩檢中臨床可靠和可解釋的解決方案。

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

2604.02236v1 by Minda Zhao, Yutong Yang, Chufei Peng, Rachel Gonsalves, Weiyue Li, Ruyi Yang, Zhixi Liu, Mengyu Wang

Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear. Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and social inference. Across models and tasks, static emotional prefixes usually produce only small changes in accuracy, suggesting that affective phrasing is typically a mild perturbation rather than a reliable general-purpose intervention. This stability is not uniform: effects are more variable in socially grounded tasks, where emotional context more plausibly interacts with interpersonal reasoning. Additional analyses show that stronger emotional wording induces only modest extra change, and that human-written prefixes reproduce the same qualitative pattern as LLM-generated ones. We then introduce EmotionRL, an adaptive emotional prompting framework that selects emotional framing adaptively for each query. Although no single emotion is consistently beneficial, adaptive selection yields more reliable gains than fixed emotional prompting. Together, these findings show that emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.

摘要：情感語調在人類溝通中無處不在，但其對大型語言模型（LLM）行為的影響仍不明確。在這裡，我們檢視第一人稱情感框架在用戶端查詢中如何影響LLM在六個基準領域的表現，包括數學推理、醫療問答、閱讀理解、常識推理和社會推斷。在模型和任務中，靜態情感前綴通常只會產生微小的準確性變化，這表明情感措辭通常是一種輕微的擾動，而不是可靠的通用干預。這種穩定性並不均勻：在社會性基礎的任務中，效果變化更大，因為情感背景更可能與人際推理互動。額外的分析顯示，較強的情感措辭僅引發適度的額外變化，而人類撰寫的前綴重現了與LLM生成的前綴相同的質量模式。然後，我們介紹EmotionRL，一種自適應情感提示框架，根據每個查詢自適應地選擇情感框架。儘管沒有單一情感始終如一地有益，但自適應選擇比固定的情感提示產生更可靠的增益。總體而言，這些發現顯示情感語調既不是LLM表現的主導驅動力，也不是無關的噪音，而是一種微弱且依賴於輸入的信號，可以通過自適應控制來利用。

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

2604.02226v1 by Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso

Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.

摘要：強化學習（RL）代理在處理分佈外（OOD）情境時常常面臨困難，導致高度的不確定性和隨機行為。雖然語言模型（LM）包含有價值的世界知識，但較大的模型會產生高計算成本，妨礙實時使用，並在自主規劃方面顯示出限制。我們引入了通過知識的自適應安全（ASK），它將較小的LM與訓練過的RL策略結合，以增強OOD泛化而無需重新訓練。ASK採用蒙特卡羅隨機失活來評估不確定性，並僅在不確定性超過設定閾值時查詢LM以獲取行動建議。這種選擇性使用保留了現有策略的效率，同時利用語言模型在不確定情況下的推理能力。在FrozenLake環境的實驗中，ASK在領域內沒有顯示出改善，但在轉移任務中顯示出穩健的導航，獲得了0.95的獎勵。我們的研究結果表明，有效的神經符號整合需要謹慎的協調，而非簡單的組合，突顯了成功的OOD泛化所需的足夠模型規模和有效的混合機制。

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

2604.02207v1 by Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe

Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in >93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.

摘要：背景：準確翻譯放射學報告對於多語言研究、臨床溝通和放射學教育至關重要，但基於大型語言模型（LLM）的評估有效性仍不清楚。目標：評估LLM生成的胸部CT報告日文翻譯的教育適用性，並比較放射科醫生的評估與LLM作為評判者的評估。方法：我們分析了來自CT-RATE-JPN驗證集的150份胸部CT報告。對於每份英文報告，將人工編輯的日文翻譯與DeepSeek-V3.2生成的翻譯進行比較。一名經過認證的放射科醫生和一名放射科住院醫師獨立進行了盲評，根據四個標準進行配對評估：術語準確性、可讀性、整體質量和放射科醫生風格的真實性。同時，三名LLM評判者（DeepSeek-V3.2、Mistral Large 3和GPT-5）對相同的配對進行評估。使用QWK和百分比一致性評估協議。結果：放射科醫生與LLM評判者之間的協議接近於零（QWK=-0.04至0.15）。兩名放射科醫生之間的協議也很差（QWK=0.01至0.06）。放射科醫生1在59%的案例中將術語評為等同，並偏好LLM翻譯的可讀性（51%）和整體質量（51%）。放射科醫生2在75%的案例中將可讀性評為等同，並偏好人工編輯的翻譯在整體質量上（40%對21%）。所有三名LLM評判者在所有標準上都強烈偏好LLM翻譯（70%-99%），並在超過93%的案例中將其評為更像放射科醫生的翻譯。結論：LLM生成的翻譯通常被評為自然流暢，但兩名放射科醫生的評價存在顯著差異。LLM作為評判者對LLM輸出表現出強烈偏好，與放射科醫生的協議微不足道。對於翻譯放射學報告的教育使用，僅依賴自動化的LLM基於評估是不夠的；專家放射科醫生的審查仍然很重要。

Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data

2604.02031v1 by Alejandro Castañeda Garcia, Jan van Gemert, Daan Brinks, Nergis Tömen

Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.

摘要：自編碼器在圖像內容的空間不均勻取樣方面面臨挑戰。這在醫學影像、生物學和物理學中很常見，因為在特定的圖像坐標上，資訊性模式很少出現，背景在大多數樣本中主導這些位置，導致重建偏向於主要外觀。實際上，自編碼器對主導模式存在偏見，導致細節損失，並在稀有空間輸入下造成模糊的重建，特別是在空間數據不平衡的情況下。我們通過兩個互補的組件來解決空間不平衡問題：(i) 基於自熵的損失，對統計上不常見的空間位置進行加權，以及 (ii) 樣本傳播，一種重播機制，在訓練過程中選擇性地重新暴露模型於難以重建的樣本。我們在無監督重建環境中基準測試了原本為監督分類開發的現有數據平衡策略。基於這些方法的局限性，我們的方法專門針對空間不平衡，鼓勵模型專注於統計上稀有的位置，與現有基準相比，提高重建的一致性。我們在具有控制空間不平衡條件的模擬數據集以及三個不受控的多樣化真實世界數據集中進行驗證，這些數據集涵蓋物理、生物和天文領域。我們的方法在各種重建指標上超越了基準，特別是在空間不平衡分佈下。這些結果突顯了批次中數據表示的重要性，並強調了無監督圖像重建中稀有樣本的價值。我們將提供所有代碼和相關數據。

Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia

2604.01962v1 by Saja Al-Dabet, Sherzod Turaev, Nazar Zaki

Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).

摘要：異常頭部運動（AHMs）在廣泛的神經疾病中表現出來；然而，缺乏一個整合運動學測量、臨床嚴重程度評分和患者人口統計的多條件資源，構成了開發基於人工智慧的診斷工具的持續障礙。為了解決這一問題，本研究介紹了NeuroPose-AHM，這是一個基於知識的神經誘發AHMs數據集，通過應用於1,430篇經過同行評審的出版物的多LLM提取框架構建而成。該數據集包含2,756個患者群體級別的記錄，涵蓋57種神經疾病，來源於846篇與AHM相關的論文。跨LLM可靠性分析確認了穩健的提取性能，研究級別的分類達到強一致性（kappa = 0.822）。為了展示該數據集的分析效用，將四任務框架應用於頸部肌張力障礙（CD），這是由病理性頭部運動最直接定義的疾病。首先，任務1執行多標籤AHM類型分類（F1 = 0.856）。任務2構建頭頸嚴重程度指數（HNSI），這是一個統一的指標，將異質的臨床評分標準進行標準化。然後在任務3中評估該指數的臨床相關性，其中HNSI與現實世界的CD患者數據進行驗證，對應的重度比例（6.7%）為指數在高嚴重程度範圍內的校準提供了初步的合理性指示。最後，任務4在運動類型概率和HNSI分數之間進行橋接分析，產生了顯著的相關性（p小於0.001）。這些結果展示了NeuroPose-AHM作為一個結構化的、基於知識的神經AHM研究資源的分析效用。NeuroPose-AHM數據集在Zenodo上公開可用（https://doi.org/10.5281/zenodo.19386862）。

Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always

2604.01896v1 by Luka Hobor, Mario Brcic, Mihael Kovac, Kristijan Poje

Large language models (LLMs) have been proposed as alternatives to human experts for estimating unknown quantities with associated uncertainty, a process known as Bayesian elicitation. We test this by asking eleven LLMs to estimate population statistics, such as health prevalence rates, personality trait distributions, and labor market figures, and to express their uncertainty as 95\% credible intervals. We vary each model's reasoning effort (low, medium, high) to test whether more "thinking" improves results. Our findings reveal three key results. First, larger, more capable models produce more accurate estimates, but increasing reasoning effort provides no consistent benefit. Second, all models are severely overconfident: their 95\% intervals contain the true value only 9--44\% of the time, far below the expected 95\%. Third, a statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage. In a preliminary experiment, giving models web search access degraded predictions for already-accurate models, while modestly improving predictions for weaker ones. Models performed well on commonly discussed topics but struggled with specialized health data. These results indicate that LLM uncertainty estimates require statistical correction before they can be used in decision-making.

摘要：大型語言模型（LLMs）被提出作為人類專家在估計與不確定性相關的未知數量的替代方案，這個過程被稱為貝葉斯引導。我們通過要求十一個LLM估計人口統計數據，例如健康流行率、個性特徵分佈和勞動市場數據，並將其不確定性表達為95\%可信區間，來測試這一點。我們變化每個模型的推理努力（低、中、高）以測試更多的“思考”是否能改善結果。我們的研究結果揭示了三個關鍵結果。首先，較大、能力更強的模型產生更準確的估計，但增加推理努力並未提供一致的好處。其次，所有模型都過於自信：它們的95\%區間僅在9--44\%的情況下包含真實值，遠低於預期的95\%。第三，一種稱為符合預測的統計重新校準技術可以糾正這種過度自信，擴大區間以實現預期的覆蓋率。在一個初步實驗中，給模型提供網絡搜索訪問權限使得已經準確的模型的預測變差，而對較弱的模型則有適度的改善。模型在常見話題上表現良好，但在專門的健康數據上則掙扎。這些結果表明，LLM的不確定性估計在用於決策之前需要進行統計校正。

Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

2604.01841v1 by Minh-Khoi Pham, Thang-Long Nguyen Ho, Thao Thi Phuong Dao, Tai Tan Mai, Minh-Triet Tran, Marie E. Ward, Una Geary, Rob Brennan, Nick McDonald, Martin Crane, Marija Bezbradica

Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.

摘要：臨床預測來自結構化電子健康紀錄（EHRs）是具有挑戰性的，因為它們具有高維度性、異質性、類別不平衡和分佈轉移。雖然表格內文學習（TICL）和檢索增強方法在通用基準上表現良好，但它們在臨床環境中的行為仍不明確。我們提出了一個多隊列EHR基準，比較了傳統模型、深度表格模型和TICL模型在不同數據規模、特徵維度、結果稀有性和跨隊列泛化方面的表現。基於PFN的TICL模型在低數據環境中樣本效率高，但在異質性和不平衡性增加時，簡單的基於距離的檢索會導致性能下降。我們提出了AWARE，一個與任務對齊的檢索框架，使用監督式嵌入學習和輕量級適配器。在極端不平衡的情況下，AWARE將AUPRC提高了多達12.2%，並且隨著數據複雜性的增加而增長。我們的結果確定了檢索質量和檢索推理對齊是將表格內文學習應用於臨床預測的關鍵瓶頸。

A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection

2604.01798v1 by Arezoo Borji, Gernot Kronreif, Bernhard Angermayr, Francisco Mario Calisto, Wolfgang Birkfellner, Inna Servetnyk, Yinyin Yuan, Sepideh Hatamikia

Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from H&E-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.

摘要：乳腺癌是一種高度異質性的疾病，具有多樣的分子特徵。PAM50基因特徵被廣泛認可為將乳腺癌分類為內在亞型的標準，從而使得更個性化的治療策略成為可能。在本研究中，我們介紹了一種新穎的優化驅動深度學習框架，旨在通過直接從H&E染色的全切片圖像（WSIs）預測PAM50亞型，以減少對昂貴分子檢測的依賴。我們的方法通過將非支配排序遺傳算法II（NSGA-II）與基於蒙特卡羅丟棄的不確定性估計相結合，聯合優化補丁信息量、空間多樣性、不確定性和補丁數量。所提出的方法可以識別出一小部分但高度信息豐富的補丁子集進行分類。我們使用ResNet18作為特徵提取的骨幹，並使用自定義CNN頭進行分類。為了評估，我們使用內部的TCGA-BRCA數據集作為訓練隊列，並使用外部的CPTAC-BRCA數據集作為測試隊列。在內部數據集上，使用來自TCGA-BRCA隊列的627個WSIs達到了0.8812的F1分數和0.9841的AUC。所提出方法在外部驗證數據集上的表現顯示F1分數為0.7952，AUC為0.9512。這些發現表明，所提出的優化引導、不確定性感知的補丁選擇能夠實現高性能，並提高基於組織病理學的PAM50分類的計算效率，相較於現有方法，這表明了一種可擴展的基於影像的替代方案，具有支持臨床決策的潛力。

Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring

2604.01712v1 by Feiyu Zhou, Marios Impraimakis

The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the measured ones to detect large deviations. Finally, the identified cases are used as an early-warning indicator of structural change. The artificial intelligence-based model outperforms approaches for response forecasting as no assumption on wind stationarity or on structural normal vibration behavior is needed. Specifically, wind-excited dynamic behavior suffers from uncertainty related to obtaining poor predictions when the environmental or traffic conditions change. This results in a hard distinction of what constitutes normal vibration behavior. To this end, a framework is rigorously examined on real-world measurements from the Hardanger Bridge monitored by the Norwegian University of Science and Technology. The approach captures accurate structural behavior in realistic conditions, and with respect to the changes in the system excitation. The results, importantly, highlight the potential of transformer-based digital twin components to serve as next-generation tools for resilient infrastructure management, continuous learning, and adaptive monitoring over the system's lifecycle with respect to temporal characteristics.

摘要：風引起的結構反應預測能力在這裡檢驗了一種新型Transformer方法。該模型還提供了一個數字雙胞胎組件，用於橋樑結構健康監測。首先，該方法利用系統的時間特徵來訓練預測模型。其次，將振動預測與實測數據進行比較，以檢測大偏差。最後，識別出的案例用作結構變化的早期預警指標。基於人工智慧的模型在反應預測方面表現優於其他方法，因為不需要對風的穩定性或結構的正常振動行為做出假設。具體而言，風激發的動態行為受到不確定性的影響，當環境或交通條件改變時，會導致預測不佳。這使得正常振動行為的界定變得困難。為此，該框架在挪威科技大學監測的哈爾丹格橋的實際測量數據上進行了嚴格檢驗。該方法在現實條件下捕捉到準確的結構行為，並考慮到系統激勵的變化。結果重要地突顯了基於Transformer的數字雙胞胎組件作為下一代工具的潛力，用於彈性基礎設施管理、持續學習和在系統生命周期內根據時間特徵進行自適應監測。

Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

2604.01705v1 by Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang, Xian Yang, Quanlin Li, Pinghong Zhou, Shuo Wang

Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

摘要：自動語音識別（ASR）是人機互動中一個關鍵介面，尤其是在胃腸內視鏡檢查中，但其在現實臨床環境中的可靠性受到特定領域術語和複雜聲學條件的限制。在此，我們介紹EndoASR，一個為內視鏡工作流程實時部署而設計的領域適應ASR系統。我們基於合成內視鏡報告開發了一種兩階段適應策略，針對特定領域的語言建模和噪音穩健性。在對六位內視鏡醫生的回顧性評估中，EndoASR顯著提高了轉錄準確性和臨床可用性，將字符錯誤率（CER）從20.52%降低至14.14%，並將醫療術語準確性（Med ACC）從54.30%提高至87.59%。在一項跨越五個獨立內視鏡中心的前瞻性多中心研究中，EndoASR在異質的現實條件下顯示出一致的泛化能力。與基線Paraformer模型相比，CER從16.20%降低至14.97%，而Med ACC從61.63%提高至84.16%，確認了其在實際部署情境中的穩健性。值得注意的是，EndoASR實現了0.005的實時因子（RTF），顯著快於Whisper-large-v3（RTF 0.055），同時保持220M參數的緊湊模型大小，實現高效的邊緣部署。此外，與大型語言模型的整合顯示，改善的ASR質量直接增強了下游結構化信息提取和臨床醫生與AI的互動。這些結果表明，領域適應的ASR可以作為胃腸內視鏡中人機協作的可靠介面，其一致的性能在多中心現實臨床環境中得到了驗證。

Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology

2604.01690v1 by Tianhao Shi, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Chenyi Lei, Han Li, Wenwu Ou, Yang Song, Yongdong Zhang, Fuli Feng

The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a prevalent scale-over-preference dynamic, wherein AIGC creators achieve aggregate engagement comparable to HGC creators through high-volume production, despite a marked consumer preference for HGC. Deeper analysis uncovered the ability of the algorithmic content distribution mechanism in moderating these competing interests regarding AIGC. These findings advocated for the implementation of AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of the online content platforms.

摘要：人工智慧生成內容（AIGC）的快速增長正在根本重塑線上內容生態系統，迫切需要對其行為和分配影響進行嚴格的檢視。本研究利用來自一家領先中國視頻分享平台的數千萬用戶的綜合縱向數據集，闡明了AIGC與人類生成內容（HGC）之間的獨特創作和消費行為。我們識別出一種普遍的規模優於偏好的動態，即AIGC創作者通過高產量的生產實現了與HGC創作者相當的總體參與度，儘管消費者對HGC的偏好顯著。更深入的分析揭示了算法內容分發機制在調節這些關於AIGC的競爭利益方面的能力。這些發現提倡實施對AIGC敏感的分發算法和精確的治理框架，以確保線上內容平台的長期健康。

Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture

2604.01661v1 by Florian Odi Stummer

Clinical AI systems routinely train on health data structurally distorted by documentation workflows, billing incentives, and terminology fragmentation. Prior work has characterised the mechanisms of this distortion: the three-forces model of documentary enactment, the reification feedback loop through which AI may amplify coding artefacts, and terminology governance failures that allow semantic drift to accumulate. Yet translating these insights into implementable software architecture remains an open problem. This paper proposes seven ontology-aware design patterns in Gang-of-Four pattern language for building clinical AI pipelines resilient to ontological distortion. The patterns address data ingestion validation (Ontological Checkpoint), low-frequency signal preservation (Dormancy-Aware Pipeline), continuous drift monitoring (Drift Sentinel), parallel representation maintenance (Dual-Ontology Layer), feedback loop interruption (Reification Circuit Breaker), terminology evolution management (Terminology Version Gate), and pluggable regulatory compliance (Regulatory Compliance Adapter). Each pattern is specified with Problem, Forces, Solution, Consequences, Known Uses, and Related Patterns. We illustrate their composition in a reference architecture for a primary care AI system and provide a walkthrough tracing all seven patterns through a diabetes risk prediction scenario. This paper does not report empirical validation; it offers a design vocabulary grounded in theoretical analysis, subject to future evaluation in production systems. Three patterns have partial precedent in existing systems; the remaining four have not been formally described. Limitations include the absence of runtime benchmarks and restriction to the German and EU regulatory context.

摘要：臨床 AI 系統通常在受到文檔工作流程、計費激勵和術語碎片化結構性扭曲的健康數據上進行訓練。先前的研究已經描述了這種扭曲的機制：文檔執行的三力模型、AI 可能放大編碼工件的具象反饋循環，以及允許語義漂移累積的術語治理失敗。然而，將這些洞見轉化為可實施的軟體架構仍然是一個未解決的問題。本文提出了七種基於本體的設計模式，使用四人幫模式語言來構建對本體扭曲具有韌性的臨床 AI 管道。這些模式針對數據攝取驗證（本體檢查點）、低頻信號保留（休眠感知管道）、持續漂移監控（漂移哨兵）、平行表示維護（雙本體層）、反饋循環中斷（具象電路斷路器）、術語演變管理（術語版本閘）和可插拔的監管合規性（監管合規適配器）。每個模式都包含問題、力量、解決方案、後果、已知用途和相關模式的規範。我們在一個初級護理 AI 系統的參考架構中展示了它們的組合，並提供了一個通過糖尿病風險預測場景追蹤所有七種模式的步驟。本文不報告實證驗證；它提供了一個基於理論分析的設計詞彙，待未來在生產系統中進行評估。三種模式在現有系統中有部分先例；其餘四種尚未被正式描述。限制包括缺乏運行時基準和僅限於德國及歐盟的監管背景。

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

2604.01658v1 by Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

摘要：大型語言模型（LLM）基礎的演化是一種有前景的開放式發現方法，其中進展需要持續的搜索和知識積累。現有的方法仍然在很大程度上依賴於固定的啟發式和硬編碼的探索規則，這限制了LLM代理的自主性。我們提出了CORAL，這是第一個針對開放式問題的自主多代理演化框架。CORAL用長期運行的代理取代了僵化的控制，這些代理通過共享的持久記憶、異步多代理執行和基於心跳的干預進行探索、反思和合作。它還提供了實用的安全措施，包括隔離的工作空間、評估者分離、資源管理以及代理會話和健康管理。在多樣的數學、算法和系統優化任務上進行評估，CORAL在10個任務上設置了新的最先進結果，實現了3-10倍的更高改進率，並且在任務中所需的評估次數遠少於固定的演化搜索基準。在Anthropic的核心工程任務上，四個共同演化的代理將最佳已知分數從1363改善到1103個循環。機械分析進一步顯示這些增益是如何來自知識重用和多代理的探索與通信。總體而言，這些結果表明，更大的代理自主性和多代理演化可以顯著改善開放式發現。代碼可在 https://github.com/Human-Agent-Society/CORAL 獲得。

NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy

2604.01612v1 by Kyeonghun Kim, Hyeonseok Jung, Youngung Han, Hyunsu Go, Eunseob Choi, Seongbin Park, Junsu Lim, Jiwon Yang, Sumin Lee, Insung Hwang, Ken Ying-Kai Liao, Nam-Joon Kim

Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.

摘要：體積CT影像對臨床診斷至關重要，但標註3D體積既昂貴又耗時，這促使了從未標記數據中進行自我監督學習（SSL）。然而，由於全體積Transformer的高記憶體成本以及CT數據的各向異性空間結構，將SSL應用於3D CT仍然具有挑戰性，傳統的遮罩策略無法很好地捕捉這一點。我們提出了NEMESIS，一個在局部128x128x128超補丁上運行的遮罩自編碼器（MAE）框架，實現了記憶體高效的訓練，同時保留了解剖細節。NEMESIS引入了三個關鍵組件：（i）作為前置任務的噪聲增強重建，（ii）通過平行平面和軸向標記移除進行雙重遮罩的遮罩解剖Transformer塊（MATB），以及（iii）用於跨尺度上下文聚合的NEMESIS標記（NT）。在BTCV多器官分類基準上，NEMESIS與冷凍主幹和線性分類器的組合達到了0.9633的平均AUROC，超越了完全微調的SuPreM（0.9493）和VoCo（0.9387）。在僅有10%可用標註的低標籤情況下，它仍然保持0.9075的AUROC，顯示出強大的標籤效率。此外，基於超補丁的設計將每次前向傳播的計算成本降低至31.0 GFLOPs，相較於全體積基線的985.8 GFLOPs，為3D醫學影像提供了一個可擴展且穩健的基礎。

Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training

2604.01563v1 by Abdelrahman Abouzeid

In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen & Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, shows no such penalty. Our evidence points to two failure modes of erf under Muon's faster spectral-norm growth: saturation (lossy compression) and scale blindness (discarding activation magnitude). An EMA-blend that reintroduces running scale estimates recovers ~84% of the gap. Separately, reducing Derf's alpha from its published default (0.5 to 0.3) recovers ~80% by keeping erf in its near-linear regime, where it approximately preserves relative scale; this setting is not the published default of Chen & Liu (2025). Using Derf's published default alpha with Muon incurs a 0.66-nat interaction penalty without producing NaNs or divergence, making the failure easy to miss in short pilot runs.

摘要：在LLM訓練中，正規化層和優化器通常被視為獨立的設計選擇。在1B參數和1000訓練步驟的3x2因子實驗中，我們顯示這一假設可能會失效：動態誤差函數（Derf；Chen & Liu, 2025）與Muon（Jordan, 2024）之間存在較大的負交互作用，其與RMSNorm的差距從AdamW下的+0.31 nats增長至Muon下的+0.97，約大三倍。作為有界正規化控制的動態雙曲正切（DyT；Zhu et al., 2025）並未顯示出這樣的懲罰。我們的證據指向在Muon的更快光譜範數增長下，erf的兩種失效模式：飽和（有損壓縮）和尺度盲（丟棄激活幅度）。一種重新引入運行尺度估計的EMA混合方法恢復了約84%的差距。另外，將Derf的alpha從其已發表的默認值（0.5調整至0.3）恢復了約80%，因為這樣可以使erf保持在其接近線性的範疇內，並大致保持相對尺度；這一設置並不是Chen & Liu（2025）所發表的默認值。使用Derf已發表的默認alpha與Muon結合會產生0.66-nat的交互懲罰，而不會產生NaNs或發散，使得在短期試點運行中容易忽略這一失敗。

Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging

2604.01538v1 by Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu

Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often "forget" a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.

摘要：大型語言模型已在醫療領域被採用，用於臨床文檔以減輕臨床醫師的負擔。然而，研究報告指出，當使用特定任務的醫療數據集進行微調時，LLMs經常會「遺忘」大量的指令跟隨能力，這是將通用LLMs應用於臨床的關鍵挑戰。本研究提出了一種模型合併框架，以有效地將通用LLMs適應於醫療領域，通過對抗這一遺忘問題。通過將臨床基礎模型（GatorTronLlama）與通用指令模型（Llama-3.1-8B-Instruct）通過基於插值的合併方法進行合併，我們旨在推導出一個在臨床任務上表現強勁的領域適應模型，同時保留指令跟隨能力。在醫療基準和五個臨床生成任務（例如，放射學和出院摘要）的全面評估顯示，合併模型可以有效減輕災難性遺忘，保留臨床領域專業知識，並保持指令跟隨能力。此外，我們的模型合併策略展示了訓練效率，在嚴格限制的監督下（例如，64-shot對比256-shot）達到與完全微調基準相當的性能。因此，權重空間合併構成了一種高度可擴展的解決方案，用於將開源LLMs適應於臨床應用，促進在資源有限的醫療環境中的更廣泛部署。

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

2604.01532v1 by Ayan Das, Dhaval Patel

Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68\% task completion, with systematic failures in tool orchestration (23\% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7\% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.

摘要：大型語言模型（LLM）代理人越來越多地被部署於複雜的工具協調任務中，然而現有的基準無法捕捉到工業領域的嚴格需求，在這些領域中，不正確的決策會帶來重大的安全和財務後果。為了解決這一關鍵缺口，我們推出了PHMForge，這是第一個專門設計用於評估LLM代理人在預測與健康管理（PHM）任務上的綜合基準，通過與特定領域的MCP伺服器進行現實互動。我們的基準涵蓋了75個專家策劃的場景，跨越7個工業資產類別（渦扇發動機、軸承、電動馬達、齒輪箱、航空發動機），涵蓋5個核心任務類別：剩餘使用壽命（RUL）預測、故障分類、發動機健康分析、成本效益分析和安全/政策評估。為了實現嚴格的評估，我們在兩個MCP伺服器上構建了65個專門工具，並實施了基於執行的評估者，使用與任務相稱的指標：回歸的MAE/RMSE、分類的F1分數以及健康評估的類別匹配。通過對領先框架（ReAct、Cursor Agent、Claude Code）與前沿LLM（Claude Sonnet 4.0、GPT-4o、Granite-3.0-8B）的廣泛評估，我們發現即使是表現最佳的配置也僅達到68%的任務完成率，在工具協調（23%的不正確排序）、多資產推理（下降14.9個百分點）和跨設備泛化（在保留數據集上為42.7%）方面存在系統性失敗。我們開源了完整的基準，包括場景規範、真實模板、工具實現和評估腳本，以促進代理工業AI的研究。

A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies

2604.01529v1 by Congjing Zhang, Ruoxuan Bao, Jingyu Li, Yoav Ackerman, Shuai Huang, Yanfang Su

Current Large Language Model (LLM) approaches for information extraction (IE) in the healthy food policy domain are often hindered by various factors, including misinformation, specifically hallucinations, misclassifications, and omissions that result from the structural diversity and inconsistency of policy documents. To address these limitations, this study proposes a role-based LLM framework that automates the IE from unstructured policy data by assigning specialized roles: an LLM policy analyst for metadata and mechanism classification, an LLM legal strategy specialist for identifying complex legal approaches, and an LLM food system expert for categorizing food system stages. This framework mimics expert analysis workflows by incorporating structured domain knowledge, including explicit definitions of legal mechanisms and classification criteria, into role-specific prompts. We evaluate the framework using 608 healthy food policies from the Healthy Food Policy Project (HFPP) database, comparing its performance against zero-shot, few-shot, and chain-of-thought (CoT) baselines using Llama-3.3-70B. Our proposed framework demonstrates superior performance in complex reasoning tasks, offering a reliable and transparent methodology for automating IE from health policies.

摘要：目前在健康食品政策領域中，針對信息提取（IE）的大型語言模型（LLM）方法常常受到各種因素的阻礙，包括錯誤信息，特別是幻覺、錯誤分類以及由於政策文件的結構多樣性和不一致性而導致的遺漏。為了解決這些限制，本研究提出了一個基於角色的LLM框架，通過分配專門角色來自動化從非結構化政策數據中提取信息：一個LLM政策分析師負責元數據和機制分類，一個LLM法律策略專家負責識別複雜的法律方法，以及一個LLM食品系統專家負責對食品系統階段進行分類。該框架通過將結構化的領域知識納入角色特定的提示，模仿專家分析工作流程，包括法律機制和分類標準的明確定義。我們使用來自健康食品政策項目（HFPP）數據庫的608個健康食品政策來評估該框架，並將其性能與使用Llama-3.3-70B的零樣本、少樣本和思維鏈（CoT）基準進行比較。我們提出的框架在複雜推理任務中顯示出卓越的性能，提供了一種可靠且透明的方法來自動化從健康政策中提取信息。

DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data

2604.01481v1 by Arshia Ilaty, Hossein Shirazi, Amir Rahmani, Hajar Homayouni

The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) offer a promising avenue for synthetic data generation, they often struggle to capture the complex, non-linear dependencies and severe class imbalances inherent in Electronic Health Records (EHR), leading to statistically plausible but clinically invalid records. To bridge this gap, we introduce DISCO-TAB (DIScriminator-guided COntrol for TABular synthesis), a novel framework that orchestrates a fine-tuned LLM with a multi-objective discriminator system optimized via Reinforcement Learning. Unlike prior methods relying on scalar feedback, DISCO-TAB evaluates synthesis at four granularities, token, sentence, feature, and row, while integrating Automated Constraint Discovery and Inverse-Frequency Reward Shaping to autonomously preserve latent medical logic and resolve minority-class collapse. We rigorously validate our framework across diverse benchmarks, including high-dimensional, small-sample medical datasets (e.g., Heart Failure, Parkinson's). Our results demonstrate that hierarchical feedback yields state-of-the-art performance, achieving up to 38.2% improvement in downstream clinical classifier utility compared to GAN and Diffusion baselines, while ensuring exceptional statistical fidelity (JSD < 0.01) and robust resistance to membership inference attacks. This work establishes a new standard for generating trustworthy, utility-preserving synthetic tabular data for sensitive healthcare applications.

摘要：臨床決策支持系統的發展常常受到高保真、隱私保護的生物醫學數據稀缺的阻礙。雖然生成大型語言模型（LLMs）為合成數據生成提供了一個有前景的途徑，但它們往往難以捕捉電子健康記錄（EHR）中固有的複雜非線性依賴關係和嚴重的類別不平衡，導致統計上合理但臨床上無效的記錄。為了彌補這一差距，我們提出了DISCO-TAB（基於判別器的表格合成控制），這是一個新穎的框架，協調了一個經過微調的LLM與通過強化學習優化的多目標判別器系統。與依賴標量反饋的先前方法不同，DISCO-TAB在四個粒度上評估合成：標記、句子、特徵和行，同時整合自動約束發現和逆頻率獎勵塑造，以自主保留潛在醫療邏輯並解決少數類別崩潰。我們在多個基準上嚴格驗證了我們的框架，包括高維度、小樣本醫療數據集（例如，心力衰竭、帕金森病）。我們的結果表明，分層反饋產生了最先進的性能，與GAN和擴散基準相比，在下游臨床分類器效用上提高了多達38.2%，同時確保了卓越的統計保真度（JSD < 0.01）以及對成員推斷攻擊的強大抵抗力。這項工作為生成值得信賴、保留效用的合成表格數據設立了新的標準，適用於敏感的醫療保健應用。

Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis

2604.01463v1 by Keshav Shankar, Dan Ding, Wei Gao

Physically Assistive Robots (PARs) require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause severe physical and cognitive fatigue for users with profound motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework (OTPF). This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, independent clinical experts confirmed the generated policies are safe and accurately reflect user preferences.

摘要：身體輔助機器人（PARs）需要個性化的行為以確保使用者的安全和舒適。然而，傳統的偏好學習方法，如全面的成對比較，會對有嚴重運動障礙的使用者造成嚴重的身體和認知疲勞。為了解決這個問題，我們提出了一個低負擔的離線框架，將非結構化的自然語言反饋直接轉換為確定性的機器人控制政策。為了安全地彌合模糊的人類語言與機器人代碼之間的差距，我們的流程使用基於職業治療實踐框架（OTPF）的大型語言模型（LLMs）。這種臨床推理將主觀的使用者反應解碼為明確的身體和心理需求，然後將其映射到透明的決策樹中。在部署之前，自動化的“LLM作為評判者”驗證代碼的結構安全性。我們在一項模擬的餐飲準備研究中驗證了這個系統，參與者為10名癱瘓成人。結果顯示，我們的自然語言方法顯著減少了使用者的工作負擔，與傳統基準相比。此外，獨立的臨床專家確認生成的政策是安全的，並準確反映了使用者的偏好。

When AI Gets it Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems

2604.01449v2 by Khalid Adnan Alsayed

Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.

摘要：人工智慧（AI）系統越來越多地融入醫療和藥學工作流程，支持如藥物建議、劑量確定和藥物相互作用檢測等任務。雖然這些系統在標準評估指標下通常表現良好，但它們在現實世界決策中的可靠性仍然不夠明瞭。在高風險領域如藥物管理中，即使是一個錯誤的建議也可能導致嚴重的患者傷害。本文通過聚焦於系統故障及其潛在臨床後果，檢視AI輔助藥物系統的可靠性。這項工作不僅僅通過聚合指標來評估性能，而是將注意力轉向錯誤是如何發生的，以及當AI系統產生不正確輸出時會發生什麼。通過一系列控制的模擬場景，涉及藥物相互作用和劑量決策，我們分析不同類型的系統故障，包括漏掉的相互作用、不正確的風險標記和不當的劑量建議。研究結果強調，在藥物相關的情境中，AI錯誤可能導致不良藥物反應、無效治療或延遲護理，特別是在系統未經充分人類監督的情況下。此外，本文討論了過度依賴AI建議的風險以及決策過程中透明度有限所帶來的挑戰。這項工作為醫療領域的AI評估提供了一個以可靠性為重點的視角，強調理解故障行為和現實世界影響的重要性。它突顯了在安全關鍵領域如藥學實踐中，將傳統性能指標與風險意識評估方法相結合的必要性。

AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction

2604.01371v1 by Aiza Maksutova, Lalithkumar Seenivasan, Hao Ding, Jiru Xu, Chenhao Yu, Chenyan Jing, Yiqing Shen, Mathias Unberath

Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.

摘要：外科手術行動自動化已迅速進展，朝著實現類似外科醫生的靈巧控制邁進，主要受到從示範學習和視覺-語言-行動模型的進步驅動。雖然這些在桌面實驗中已顯示出成功，但將其轉化為臨床應用仍然具有挑戰性：當前的方法對於儀器在組織表面上的互動位置提供的預測能力有限，並且缺乏明確的條件輸入來強制執行工具-行動特定的安全互動區域。為了解決這一差距，我們介紹了AffordTissue，一個多模態框架，用於在膽囊切除術中預測工具-行動特定的組織可用性區域，並以密集熱圖的形式呈現。我們的方法結合了一個時間視覺編碼器，捕捉多個視角下的工具運動和組織動態，語言條件化使得在多樣的儀器-行動對中進行泛化，以及一個DiT風格的解碼器，用於密集的可用性預測。我們通過策劃和標註103個膽囊切除術中的15,638個視頻片段，建立了第一個組織可用性基準，涵蓋六個獨特的工具-行動對，涉及四種儀器（鉤子、抓鉗、剪刀、夾子）及其相關任務：解剖、抓取、夾持和切割。實驗顯示，相較於視覺-語言模型基準，我們的架構在密集外科可用性預測上有顯著改善（20.6 px ASSD 對比 60.2 px 的 Molmo-VLM），顯示我們的任務特定架構優於大型基礎模型。通過預測工具-行動特定的組織可用性區域，AffordTissue 為安全的外科自動化提供了明確的空間推理，潛在地解鎖了對適當組織區域的明確政策指導，並在儀器偏離預測的安全區域時及早安全停止。

Safety, Security, and Cognitive Risks in World Models

2604.01346v1 by Manoj Parmar

World models -- learned internal simulators of environment dynamics -- are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause catastrophic failures in safety-critical deployments. World model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking precisely because they can simulate the consequences of their own actions. Authoritative world model predictions further foster automation bias and miscalibrated human trust that operators lack the tools to audit. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five-profile attacker capability taxonomy; and develops a unified threat model extending MITRE ATLAS and the OWASP LLM Top 10 to the world model stack. We provide an empirical proof-of-concept on trajectory-persistent adversarial attacks (GRU-RSSM: A_1 = 2.26x amplification, -59.5% reduction under adversarial fine-tuning; stochastic RSSM proxy: A_1 = 0.65x; DreamerV3 checkpoint: non-zero action drift confirmed). We illustrate risks through four deployment scenarios and propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human-factors design. We argue that world models must be treated as safety-critical infrastructure requiring the same rigour as flight-control software or medical devices.

摘要：世界模型——學習的環境動態內部模擬器——正迅速成為機器人、自主車輛和自主人工智慧中自主決策的基礎。然而，這種預測能力引入了一組獨特的安全性、保安性和認知風險。對手可以腐敗訓練數據、毒害潛在表示，並利用累積的展開錯誤來造成在安全關鍵部署中的災難性失敗。配備世界模型的代理更容易出現目標誤泛化、欺騙性對齊和獎勵駭客，正因為它們能夠模擬自身行動的後果。權威的世界模型預測進一步助長了自動化偏見和不當校準的人類信任，操作員缺乏審計工具。
本文調查了世界模型的現狀；引入了軌跡持續性和表示風險的正式定義；提出了五種攻擊者能力分類法；並發展了一個統一的威脅模型，將MITRE ATLAS和OWASP LLM前10名擴展到世界模型堆棧。我們提供了一個關於軌跡持續性對抗攻擊的實證概念驗證（GRU-RSSM: A_1 = 2.26倍增強，對抗微調下減少59.5%；隨機RSSM代理: A_1 = 0.65倍；DreamerV3檢查點: 確認非零行動漂移）。我們通過四個部署場景說明了風險，並提出了跨學科的緩解措施，涵蓋對抗加固、對齊工程、NIST AI RMF和EU AI法案治理，以及人因設計。我們認為，世界模型必須被視為安全關鍵基礎設施，需遵循與飛行控制軟件或醫療設備相同的嚴謹性。

Regularizing Attention Scores with Bootstrapping

2604.01339v1 by Neo Christopher Chung, Maxim Laletin

Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emph{Attention Regularization} approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: https://github.com/ncchung/AttentionRegularization

摘要：視覺轉換器（ViT）依賴注意力機制來加權輸入特徵，因此注意力分數自然被視為其決策過程的解釋。然而，注意力分數幾乎總是非零的，這導致了噪聲和擴散的注意力圖，限制了可解釋性。我們能否量化注意力分數的不確定性度量並獲得正則化的注意力分數？為此，我們在統計框架中考慮ViT的注意力分數，其中獨立噪聲會導致不重要但非零的分數。利用統計學習技術，我們引入了注意力分數的自助法，通過重新抽樣輸入特徵生成注意力分數的基準分佈。這樣的自助分佈隨後用於估計注意力分數的顯著性和後驗概率。在自然和醫學圖像中，所提出的\emph{注意力正則化}方法展示了直接去除由噪聲引起的虛假注意力，顯著改善了收縮性和稀疏性。定量評估使用模擬和現實世界數據集進行。我們的研究強調自助法作為使用注意力分數作為ViT解釋的實用正則化工具。
代碼可用：https://github.com/ncchung/AttentionRegularization

AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation

2604.01167v1 by Prantik Deb, Srimanth Dhondy, N. Ramakrishna, Anu Kapoor, Raju S. Bapi, Tapabrata Chakraborti

Chest X-ray (CXR) segmentation is an important step in computer-aided diagnosis, yet deploying large foundation models in clinical settings remains challenging due to computational constraints. We propose AdaLoRA-QAT, a two-stage fine-tuning framework that combines adaptive low-rank encoder adaptation with full quantization-aware training. Adaptive rank allocation improves parameter efficiency, while selective mixed-precision INT8 quantization preserves structural fidelity crucial for clinical reliability. Evaluated across large-scale CXR datasets, AdaLoRA-QAT achieves 95.6% Dice, matching full-precision SAM decoder fine-tuning while reducing trainable parameters by 16.6\times and yielding 2.24\times model compression. A Wilcoxon signed-rank test confirms that quantization does not significantly degrade segmentation accuracy. These results demonstrate that AdaLoRA-QAT effectively balances accuracy, efficiency, and structural trust-worthiness, enabling compact and deployable foundation models for medical image segmentation. Code and pretrained models are available at: https://prantik-pdeb.github.io/adaloraqat.github.io/

摘要：胸部X光（CXR）分割是電腦輔助診斷中的一個重要步驟，但由於計算限制，在臨床環境中部署大型基礎模型仍然具有挑戰性。我們提出AdaLoRA-QAT，一個兩階段的微調框架，結合了自適應低秩編碼器適應和全面量化感知訓練。自適應秩分配提高了參數效率，而選擇性混合精度INT8量化則保留了對臨床可靠性至關重要的結構保真度。在大規模CXR數據集上評估，AdaLoRA-QAT達到了95.6%的Dice，與全精度SAM解碼器微調相匹配，同時將可訓練參數減少了16.6\times，並實現了2.24\times的模型壓縮。威爾科克森符號秩檢驗確認量化並未顯著降低分割準確性。這些結果表明，AdaLoRA-QAT有效地平衡了準確性、效率和結構可信度，使得醫學影像分割的緊湊和可部署的基礎模型成為可能。代碼和預訓練模型可在以下網址獲得：https://prantik-pdeb.github.io/adaloraqat.github.io/

Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

2604.01152v1 by Mohammad R. Abu Ayyash

We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.

摘要：我們提出了 Brainstacks，一種模組化架構，用於大型語言模型的持續多領域微調，將領域專業知識打包為凍結的適配器堆疊，這些堆疊在推理時在共享的凍結基礎上進行加法組合。五個相互交織的組件：(1) MoE-LoRA，使用 Shazeer 風格的噪聲 top-2 路由，跨越所有七個Transformer投影，在 QLoRA 4 位量化下，並使用 rsLoRA 縮放；(2) 內部循環通過凍結訓練堆疊並添加新的堆疊來執行殘差增強；(3) 外部循環訓練具有課程排序依賴關係的序列領域特定堆疊；(4) 通過隨機 SVD 的零空間投影，將新的堆疊約束到與先前方向正交的子空間，實現零遺忘；(5) 基於結果的 sigmoid 元路由器，根據經驗發現的領域組合目標進行訓練，選擇性地加權堆疊，使跨領域組合成為可能。兩個邊界實驗：(6) 在隨機初始化的模型上進行 PSN 預訓練；(7) 每個領域的強化學習（DPO/GRPO）驗證與後 SFT 對齊的兼容性。在 TinyLlama-1.1B（4 個領域，9 個堆疊）和 Gemma 3 12B IT（5 個領域，10 個堆疊）上進行驗證，MoE-LoRA 的收斂速度比參數匹配的單一 LoRA 快 2.5 倍，殘差增強突破了單堆疊的天花板，路由系統恢復了因無閘堆疊累積而損失的生成質量。核心發現：基於結果的路由器發現領域堆疊編碼了可轉移的認知原語（遵循指令的清晰度、數字推理、程序邏輯、思維鏈結構），而不是特定於領域的知識，儘管這些堆疊中沒有醫療數據，但醫療提示在 97% 的情況下路由到 chat+math 堆疊。

PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

2604.00931v2 by Yutao Yang, Junsong Li, Qianjun Pan, Jie Zhou, Kai Chen, Qin Chen, Jingyuan Zhao, Ningning Zhou, Xin Li, Liang He

Existing methods for AI psychological counselors predominantly rely on supervised fine-tuning using static dialogue datasets. However, this contrasts with human experts, who continuously refine their proficiency through clinical practice and accumulated experience. To bridge this gap, we propose an Experience-Driven Lifelong Learning Agent (\texttt{PsychAgent}) for psychological counseling. First, we establish a Memory-Augmented Planning Engine tailored for longitudinal multi-session interactions, which ensures therapeutic continuity through persistent memory and strategic planning. Second, to support self-evolution, we design a Skill Evolution Engine that extracts new practice-grounded skills from historical counseling trajectories. Finally, we introduce a Reinforced Internalization Engine that integrates the evolved skills into the model via rejection fine-tuning, aiming to improve performance across diverse scenarios. Comparative analysis shows that our approach achieves higher scores than strong general LLMs (e.g., GPT-5.4, Gemini-3) and domain-specific baselines across all reported evaluation dimensions. These results suggest that lifelong learning can improve the consistency and overall quality of multi-session counseling responses.

摘要：現有的AI心理諮詢師方法主要依賴於使用靜態對話數據集的監督微調。然而，這與人類專家形成對比，人類專家通過臨床實踐和積累的經驗不斷提升自己的專業能力。為了彌補這一差距，我們提出了一個以經驗驅動的終身學習代理（\texttt{PsychAgent}）用於心理諮詢。首先，我們建立了一個針對長期多次會話互動的記憶增強規劃引擎，這確保了通過持久記憶和戰略規劃實現治療的連續性。其次，為了支持自我演化，我們設計了一個技能演化引擎，從歷史諮詢軌跡中提取基於實踐的新技能。最後，我們引入了一個強化內化引擎，通過拒絕微調將演化的技能整合到模型中，旨在提高在各種情境下的表現。比較分析顯示，我們的方法在所有報告的評估維度上都達到了比強大的通用LLM（例如，GPT-5.4、Gemini-3）和特定領域基準更高的分數。這些結果表明，終身學習可以提高多次會話諮詢回應的一致性和整體質量。

OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images

2604.01264v1 by Okan Uçar, Murat Kurt

Medical imaging techniques, especially Magnetic Resonance Imaging (MRI), are accepted as the gold standard in the diagnosis and treatment planning of neurological diseases. However, the manual analysis of MRI images is a time-consuming process for radiologists and is prone to human error due to fatigue. In this study, two different Deep Learning approaches were developed and analyzed comparatively for the automatic detection and classification of brain tumors (Glioma, Meningioma, Pituitary, and No Tumor). In the first approach, a custom Convolutional Neural Network (CNN) architecture named "OkanNet", which has a low computational cost and fast training time, was designed from scratch. In the second approach, the Transfer Learning method was applied using the 50-layer ResNet-50 [1] architecture, pre-trained on the ImageNet dataset. In experiments conducted on an extended dataset compiled by Masoud Nickparvar containing a total of $7,023$ MRI images, the Transfer Learning-based ResNet-50 model exhibited superior classification performance, achieving $96.49\%$ Accuracy and $0.963$ Precision. In contrast, the custom OkanNet architecture reached an accuracy rate of $88.10\%$; however, it proved to be a strong alternative for mobile and embedded systems with limited computational power by yielding results approximately $3.2$ times faster ($311$ seconds) than ResNet-50 in terms of training time. This study demonstrates the trade-off between model depth and computational efficiency in medical image analysis through experimental data.

摘要：醫學影像技術，特別是磁共振成像（MRI），被認為是神經疾病診斷和治療計劃的金標準。然而，MRI影像的手動分析對放射科醫生來說是一個耗時的過程，並且由於疲勞容易出現人為錯誤。在這項研究中，開發並比較分析了兩種不同的深度學習方法，用於自動檢測和分類腦腫瘤（膠質瘤、腦膜瘤、腦垂體瘤和無腫瘤）。在第一種方法中，從零開始設計了一種名為“OkanNet”的自定義卷積神經網絡（CNN）架構，其計算成本低且訓練時間快。在第二種方法中，應用了轉移學習方法，使用了在ImageNet數據集上預訓練的50層ResNet-50 [1]架構。在由Masoud Nickparvar編輯的擴展數據集上進行的實驗中，該數據集包含總共$7,023$幅MRI影像，基於轉移學習的ResNet-50模型顯示出優越的分類性能，達到$96.49\%$的準確率和$0.963$的精確度。相比之下，自定義的OkanNet架構達到了$88.10\%$的準確率；然而，它在訓練時間上比ResNet-50快約$3.2$倍（$311$秒），證明它對計算能力有限的移動和嵌入式系統是一個強有力的替代方案。這項研究通過實驗數據展示了醫學影像分析中模型深度與計算效率之間的權衡。

BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction

2604.00739v1 by Sayed Hashim, Frank Soboczenski, Paul Cairns

Datasets used in immunotherapy response prediction are typically small in size, as well as diverse in cancer type, drug administered, and sequencer used. Models often drop in performance when tested on patient cohorts that are not included in the training process. Recent work has shown that transformer-based models along with self-supervised learning show better generalisation performance than threshold-based biomarkers, but is still suboptimal. We present BioCOMPASS, an extension of a transformer-based model called COMPASS, that integrates biomarkers and treatment information to further improve its generalisability. Instead of feeding biomarker data as input, we built loss components to align them with the model's intermediate representations. We found that components such as treatment gating and pathway consistency loss improved generalisability when evaluated with Leave-one-cohort-out, Leave-one-cancer-type-out and Leave-one-treatment-out strategies. Results show that building components that exploit biomarker and treatment information can help in generalisability of immunotherapy response prediction. Careful curation of additional components that leverage complementary clinical information and domain knowledge represents a promising direction for future research.

摘要：使用於免疫療法反應預測的數據集通常規模較小，且在癌症類型、所施用的藥物和使用的測序儀器上具有多樣性。當在未包含於訓練過程中的患者群體上進行測試時，模型的性能往往會下降。最近的研究顯示，基於Transformer的模型結合自我監督學習在泛化性能上優於基於閾值的生物標記，但仍然不夠理想。我們提出了BioCOMPASS，這是一個基於Transformer的模型COMPASS的擴展，整合了生物標記和治療信息以進一步提高其泛化能力。我們並不是將生物標記數據作為輸入，而是構建了損失組件以使其與模型的中間表示對齊。我們發現，治療閘控和通路一致性損失等組件在使用Leave-one-cohort-out、Leave-one-cancer-type-out和Leave-one-treatment-out策略進行評估時，提高了泛化能力。結果顯示，構建利用生物標記和治療信息的組件可以幫助提高免疫療法反應預測的泛化能力。仔細策劃利用互補臨床信息和領域知識的額外組件，代表了未來研究的一個有前景的方向。

Toward Optimal Sampling Rate Selection and Unbiased Classification for Precise Animal Activity Recognition

2604.00517v1 by Axiu Mao, Meilu Zhu, Lei Shen, Xiaoshuai Wang, Tomas Norton, Kai Liu

With the rapid advancements in deep learning techniques, wearable sensor-aided animal activity recognition (AAR) has demonstrated promising performance, thereby improving livestock management efficiency as well as animal health and welfare monitoring. However, existing research often prioritizes overall performance, overlooking the fact that classification accuracies for specific animal behavioral categories may remain unsatisfactory. This issue typically stems from suboptimal sampling rates or class imbalance problems. To address these challenges and achieve high classification accuracy across all individual behaviors in farm animals, we propose a novel Individual-Behavior-Aware Network (IBA-Net). This network enhances the recognition of each specific behavior by simultaneously customizing features and calibrating the classifier. Specifically, considering that different behaviors require varying sampling rates to achieve optimal performance, we design a Mixture-of-Experts (MoE)-based Feature Customization (MFC) module. This module adaptively fuses data from multiple sampling rates, capturing customized features tailored to various animal behaviors. Additionally, to mitigate classifier bias toward majority classes caused by class imbalance, we develop a Neural Collapse-driven Classifier Calibration (NC3) module. This module introduces a fixed equiangular tight frame (ETF) classifier during the classification stage, maximizing the angles between pair-wise classifier vectors and thereby improving the classification performance for minority classes. To validate the effectiveness of IBA-Net, we conducted experiments on three public datasets covering goat, cattle, and horse activity recognition. The results demonstrate that our method consistently outperforms existing approaches across all datasets.

摘要：隨著深度學習技術的快速進步，穿戴式傳感器輔助的動物活動識別（AAR）顯示出良好的性能，從而提高了畜牧管理效率以及動物健康和福利監測。然而，現有研究通常優先考慮整體性能，忽視了特定動物行為類別的分類準確率可能仍然不令人滿意。這個問題通常源於次優的取樣率或類別不平衡問題。為了解決這些挑戰並在農場動物的所有個體行為中實現高分類準確率，我們提出了一種新穎的個體行為感知網絡（IBA-Net）。該網絡通過同時定制特徵和校準分類器來增強對每種特定行為的識別。具體而言，考慮到不同的行為需要不同的取樣率以實現最佳性能，我們設計了一個基於專家混合（MoE）的特徵定制（MFC）模塊。該模塊自適應地融合來自多個取樣率的數據，捕捉針對各種動物行為量身定制的特徵。此外，為了減輕由於類別不平衡而導致的分類器對多數類別的偏見，我們開發了一個基於神經崩潰的分類器校準（NC3）模塊。該模塊在分類階段引入了一個固定的等角緊框（ETF）分類器，最大化成對分類器向量之間的角度，從而提高對少數類別的分類性能。為了驗證IBA-Net的有效性，我們在涵蓋山羊、牛和馬活動識別的三個公共數據集上進行了實驗。結果表明，我們的方法在所有數據集上始終優於現有方法。

MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning

2604.00514v1 by Kyeonghun Kim, Hyeonseok Jung, Youngung Han, Junsu Lim, YeonJu Jean, Seongbin Park, Eunseob Choi, Hyunsu Go, SeoYoung Ju, Seohyoung Park, Gyeongmin Kim, MinJu Kwon, KyungSeok Yuh, Soo Yong Kim, Ken Ying-Kai Liao, Nam-Joon Kim, Hyuk-Jae Lee

Training deep learning models for three-dimensional (3D) medical imaging, such as Computed Tomography (CT), is fundamentally challenged by the scarcity of labeled data. While pre-training on natural images is common, it results in a significant domain shift, limiting performance. Self-Supervised Learning (SSL) on unlabeled medical data has emerged as a powerful solution, but prominent frameworks often fail to exploit the inherent 3D nature of CT scans. These methods typically process 3D scans as a collection of independent 2D slices, an approach that fundamentally discards critical axial coherence and the 3D structural context. To address this limitation, we propose the autoencoder for enhanced self-supervised medical image learning(MAESIL), a novel self-supervised learning framework designed to capture 3D structural information efficiently. The core innovation is the 'superpatch', a 3D chunk-based input unit that balances 3D context preservation with computational efficiency. Our framework partitions the volume into superpatches and employs a 3D masked autoencoder strategy with a dual-masking strategy to learn comprehensive spatial representations. We validated our approach on three diverse large-scale public CT datasets. Our experimental results show that MAESIL demonstrates significant improvements over existing methods such as AE, VAE and VQ-VAE in key reconstruction metrics such as PSNR and SSIM. This establishes MAESIL as a robust and practical pre-training solution for 3D medical imaging tasks.

摘要：訓練三維 (3D) 醫學影像的深度學習模型，例如計算機斷層掃描 (CT)，基本上受到標記數據稀缺的挑戰。雖然在自然影像上進行預訓練是常見的做法，但這會導致顯著的領域轉移，限制了性能。對未標記醫學數據進行自我監督學習 (SSL) 已經成為一種強大的解決方案，但現有的主要框架往往未能充分利用 CT 掃描的固有 3D 特性。這些方法通常將 3D 掃描處理為獨立的 2D 切片集合，這種方法基本上忽略了關鍵的軸向一致性和 3D 結構上下文。為了解決這一限制，我們提出了增強自我監督醫學影像學習的自編碼器 (MAESIL)，這是一個旨在高效捕捉 3D 結構信息的新型自我監督學習框架。核心創新是「超補丁」，這是一種基於 3D 塊的輸入單元，平衡了 3D 上下文的保留與計算效率。我們的框架將體積劃分為超補丁，並採用 3D 掩碼自編碼器策略，結合雙重掩碼策略來學習全面的空間表示。我們在三個多樣化的大型公共 CT 數據集上驗證了我們的方法。我們的實驗結果顯示，MAESIL 在關鍵重建指標如 PSNR 和 SSIM 上顯示出相較於現有方法如 AE、VAE 和 VQ-VAE 的顯著改進。這使 MAESIL 成為 3D 醫學影像任務的穩健且實用的預訓練解決方案。

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

2604.00493v1 by Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

摘要：胸部X光檢查（CXRs）是全球最常執行的影像檢查之一，但不斷增加的影像量增加了放射科醫師的工作負擔和診斷錯誤的風險。儘管人工智慧（AI）系統在CXR解讀方面顯示出潛力，但大多數僅生成最終預測，而未明確說明視覺證據如何轉化為放射學發現和診斷預測。我們提出了CheXOne，一種具備推理能力的視覺-語言模型，用於CXR解讀。CheXOne共同生成診斷預測和明確的、臨床基礎的推理痕跡，將視覺證據、放射學發現和這些預測連結起來。該模型在從30個公共數據集中策劃的1470萬個指令和推理樣本上進行訓練，涵蓋36個CXR解讀任務，使用一種結合指令調整和強化學習的兩階段框架，以提高推理質量。我們在零樣本設置下評估CheXOne，涵蓋視覺問題回答、報告生成、視覺定位和推理評估，共涉及17個評估設置。CheXOne在現有的醫學和通用領域基礎模型中表現優於其他模型，並在獨立公共基準上取得了良好的表現。一項臨床讀者研究顯示，CheXOne撰寫的報告在55%的案例中與住院醫師撰寫的報告相當或更好，同時有效地解決臨床指徵，並提高報告撰寫和CXR解讀的效率。進一步的分析涉及放射科醫師，顯示生成的推理痕跡具有高臨床事實性，並為最終預測提供因果支持，為性能提升提供了合理的解釋。這些結果表明，明確的推理可以改善模型性能、可解釋性和在AI輔助CXR解讀中的臨床實用性。

Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions

2604.00397v1 by Yuchen Yang, Shuangyang Zhong, Haijun Yu, Langcuomu Suo, Hongbin Han, Florian Putz, Yixing Huang

Background: Deep learning has demonstrated significant potential for automated brain metastases (BM) segmentation; however, models trained at a singular institution often exhibit suboptimal performance at various sites due to disparities in scanner hardware, imaging protocols, and patient demographics. The goal of this work is to create a domain adaptation framework that will allow for BM segmentation to be used across multiple institutions. Methods: We propose a VAE-MMD preprocessing pipeline that combines variational autoencoders (VAE) with maximum mean discrepancy (MMD) loss, incorporating skip connections and self-attention mechanisms alongside nnU-Net segmentation. The method was tested on 740 patients from four public databases: Stanford, UCSF, UCLM, and PKG, evaluated by domain classifier's accuracy, sensitivity, precision, F1/F2 scores, surface Dice (sDice), and 95th percentile Hausdorff distance (HD95). Results: VAE-MMD reduced domain classifier accuracy from 0.91 to 0.50, indicating successful feature alignment across institutions. Reconstructed volumes attained a PSNR greater than 36 dB, maintaining anatomical accuracy. The combined method raised the mean F1 by 11.1% (0.700 to 0.778), the mean sDice by 7.93% (0.7121 to 0.7686), and reduced the mean HD95 by 65.5% (11.33 to 3.91 mm) across all four centers compared to the baseline nnU-Net. Conclusions: VAE-MMD effectively diminishes cross-institutional data heterogeneity and enhances BM segmentation generalization across volumetric, detection, and boundary-level metrics without necessitating target-domain labels, thereby overcoming a significant obstacle to the clinical implementation of AI-assisted segmentation.

摘要：背景：深度學習在自動化腦轉移（BM）分割方面顯示出顯著潛力；然而，在單一機構訓練的模型在不同地點的表現往往不佳，這是由於掃描儀硬體、影像協議和患者人口統計的差異。本研究的目標是創建一個領域適應框架，使得BM分割能夠在多個機構之間使用。
方法：我們提出了一個VAE-MMD預處理管道，該管道將變分自編碼器（VAE）與最大均值差異（MMD）損失結合，並在nnU-Net分割的基礎上融入跳躍連接和自注意力機制。該方法在來自四個公共數據庫的740名患者上進行了測試：斯坦福大學、加州大學舊金山分校、馬德里大學和PKG，通過領域分類器的準確性、靈敏度、精確度、F1/F2分數、表面Dice（sDice）和第95百分位Hausdorff距離（HD95）進行評估。
結果：VAE-MMD將領域分類器的準確性從0.91降低到0.50，表明成功實現了跨機構的特徵對齊。重建的體積達到了大於36 dB的PSNR，保持了解剖準確性。該綜合方法使得平均F1提高了11.1%（從0.700到0.778），平均sDice提高了7.93%（從0.7121到0.7686），並且在四個中心之間相比基線nnU-Net將平均HD95降低了65.5%（從11.33到3.91 mm）。
結論：VAE-MMD有效減少了跨機構數據的異質性，並增強了BM分割在體積、檢測和邊界級別指標上的泛化能力，而無需目標領域標籤，從而克服了臨床實施AI輔助分割的一個重大障礙。

EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

2604.00392v1 by Alibek T. Kaliyev, Artem Maryanskyy

Modern LLM agents increasingly create their own tools at runtime -- from Python functions to API clients -- yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We introduce EvolveTool-Bench, a diagnostic benchmark for LLM-generated tool libraries in software engineering workflows. Across three domains requiring actual tool execution (proprietary data formats, API orchestration, and numerical computation), we define library-level software quality metrics -- reuse, redundancy, composition success, regression stability, and safety -- alongside a per-tool Tool Quality Score measuring correctness, robustness, generality, and code quality. In the first head-to-head comparison of code-level and strategy-level tool evolution (ARISE vs. EvoSkill vs. one-shot baselines, 99 tasks, two models), we show that systems with similar task completion (63-68%) differ by up to 18% in library health, revealing software quality risks invisible to task-only evaluation. Our results highlight that evaluation and governance of LLM-generated tools require treating the evolving tool library as a first-class software artifact, not a black box.

摘要：現代的 LLM 代理越來越多地在運行時創建自己的工具——從 Python 函數到 API 客戶端——然而現有的基準幾乎僅通過下游任務的完成來評估它們。這類似於僅根據一名軟體工程師的程式碼是否能運行來評價他們，卻忽略了冗餘、回歸和安全性。我們引入了 EvolveTool-Bench，這是一個用於 LLM 生成的工具庫在軟體工程工作流程中的診斷基準。在三個需要實際工具執行的領域（專有數據格式、API 協調和數值計算）中，我們定義了庫級軟體質量指標——重用、冗餘、組合成功、回歸穩定性和安全性——以及每個工具的工具質量分數，該分數衡量正確性、穩健性、通用性和程式碼質量。在首次的代碼級和策略級工具演化的正面比較中（ARISE vs. EvoSkill vs. 一次性基準，99 個任務，兩個模型），我們顯示出具有相似任務完成率（63-68%）的系統在庫健康度上最多相差 18%，揭示了僅依賴任務評估而無法察覺的軟體質量風險。我們的結果強調，對 LLM 生成工具的評估和治理需要將不斷演變的工具庫視為一級軟體工件，而不是黑箱。

Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

2604.00319v1 by Syed Eqbal Alam, Zhan Shu

We develop algorithms for collaborative control of AI agents and critics in a multi-actor, multi-critic federated multi-agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The AI agents and critics collaborate with a central server to complete multimodal tasks such as fault detection, severity, and cause analysis in a network telemetry system, text-to-image generation, video generation, healthcare diagnostics from medical images and patient records, etcetera. The AI agents complete their tasks and send them to AI critics for evaluation. The critics then send feedback to agents to improve their responses. Collaboratively, they minimize the overall cost to the system with no inter-agent or inter-critic communication. AI agents and critics keep their cost functions or derivatives of cost functions private. Using multi-time scale stochastic approximation techniques, we provide convergence guarantees on the time-average active states of AI agents and critics. The communication overhead is a little on the system, of the order of $\mathcal{O}(m)$, for $m$ modalities and is independent of the number of AI agents and critics. Finally, we present an example of fault detection, severity, and cause analysis in network telemetry and thorough evaluation to check the algorithm's efficacy.

摘要：我們為多演員、多評論者的聯邦多智能體系統開發了協作控制的算法。每個AI智能體和評論者都可以訪問傳統機器學習或生成AI基礎模型。AI智能體和評論者與中央伺服器合作，以完成多模態任務，例如在網絡遙測系統中的故障檢測、嚴重性和原因分析、文本到圖像生成、視頻生成、從醫療影像和病歷進行的醫療診斷等。AI智能體完成其任務並將其發送給AI評論者進行評估。評論者然後向智能體發送反饋，以改善其回應。他們協同工作，最小化系統的整體成本，且不進行智能體或評論者之間的通信。AI智能體和評論者將其成本函數或成本函數的導數保持私密。使用多時間尺度隨機近似技術，我們提供了AI智能體和評論者的時間平均活動狀態的收斂保證。通信開銷對系統的影響較小，約為$\mathcal{O}(m)$，其中$m$為模態數，且與AI智能體和評論者的數量無關。最後，我們展示了一個在網絡遙測中進行故障檢測、嚴重性和原因分析的例子，以及徹底的評估以檢查算法的有效性。

SANA I2I: A Text Free Flow Matching Framework for Paired Image to Image Translation with a Case Study in Fetal MRI Artifact Reduction

2604.00298v1 by Italo Felix Santos, Gilson Antonio Giraldi, Heron Werner Junior

We propose SANA-I2I, a text-free high-resolution image-to-image generation framework that extends the SANA family by removing textual conditioning entirely. In contrast to SanaControlNet, which combines text and image-based control, SANA-I2I relies exclusively on paired source-target images to learn a conditional flow-matching model in latent space. The model learns a conditional velocity field that maps a target image distribution to another one, enabling supervised image translation without reliance on language prompts. We evaluate the proposed approach on the challenging task of fetal MRI motion artifact reduction. To enable paired training in this application, where real paired data are difficult to acquire, we adopt a synthetic data generation strategy based on the method proposed by Duffy et al., which simulates realistic motion artifacts in fetal magnetic resonance imaging (MRI). Experimental results demonstrate that SANA-I2I effectively suppresses motion artifacts while preserving anatomical structure, achieving competitive performance few inference steps. These results highlight the efficiency and suitability of our proposed flow-based, text-free generative models for supervised image-to-image tasks in medical imaging.

摘要：我們提出了 SANA-I2I，一個無文本的高解析度圖像到圖像生成框架，通過完全移除文本條件來擴展 SANA 家族。與結合文本和基於圖像控制的 SanaControlNet 相比，SANA-I2I 僅依賴配對的源目標圖像來學習潛在空間中的條件流匹配模型。該模型學習一個條件速度場，將目標圖像分佈映射到另一個圖像，實現無需依賴語言提示的監督圖像轉換。我們在挑戰性的胎兒 MRI 運動伪影減少任務上評估了所提出的方法。為了在這一應用中實現配對訓練，因為真實配對數據難以獲得，我們採用了基於 Duffy 等人提出的方法的合成數據生成策略，該方法模擬胎兒磁共振成像 (MRI) 中的真實運動伪影。實驗結果表明，SANA-I2I 有效抑制運動伪影，同時保留解剖結構，實現了在少量推斷步驟下的競爭性能。這些結果突顯了我們提出的基於流的無文本生成模型在醫學影像中進行監督圖像到圖像任務的效率和適用性。

A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation

2604.00249v1 by Ha Na Cho

Single-agent large language model (LLM) systems struggle to simultaneously support diverse conversational functions and maintain safety in behavioral health communication. We propose a safety-aware, role-orchestrated multi-agent LLM framework designed to simulate supportive behavioral health dialogue through coordinated, role-differentiated agents. Conversational responsibilities are decomposed across specialized agents, including empathy-focused, action-oriented, and supervisory roles, while a prompt-based controller dynamically activates relevant agents and enforces continuous safety auditing. Using semi-structured interview transcripts from the DAIC-WOZ corpus, we evaluate the framework with scalable proxy metrics capturing structural quality, functional diversity, and computational characteristics. Results illustrate clear role differentiation, coherent inter-agent coordination, and predictable trade-offs between modular orchestration, safety oversight, and response latency when compared to a single-agent baseline. This work emphasizes system design, interpretability, and safety, positioning the framework as a simulation and analysis tool for behavioral health informatics and decision-support research rather than a clinical intervention.

摘要：單一代理的大型語言模型（LLM）系統在同時支持多樣化的對話功能和維持行為健康溝通的安全性方面面臨挑戰。我們提出了一種安全意識、角色協調的多代理LLM框架，旨在通過協調的、角色區分的代理模擬支持性的行為健康對話。對話責任被分解到專門的代理中，包括以同理心為重點、以行動為導向和監督角色，而基於提示的控制器則動態激活相關代理並執行持續的安全審核。使用來自DAIC-WOZ語料庫的半結構化訪談轉錄，我們利用可擴展的代理指標評估該框架，捕捉結構質量、功能多樣性和計算特徵。結果顯示出明確的角色區分、一致的代理間協調，以及與單一代理基準相比，模組化協調、安全監督和響應延遲之間的可預測權衡。這項工作強調系統設計、可解釋性和安全性，將該框架定位為行為健康信息學和決策支持研究的模擬和分析工具，而非臨床干預。

One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

2604.00085v1 by Yuxing Lu, Yushuhong Lin, Jason Zhang

Large language models applied to clinical prediction exhibit case-level heterogeneity: simple cases yield consistent outputs, while complex cases produce divergent predictions under minor prompt changes. Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement. We propose CAMP (Case-Adaptive Multi-agent Panel), where an attending-physician agent dynamically assembles a specialist panel tailored to each case's diagnostic uncertainty. Each specialist evaluates candidates via three-valued voting (KEEP/REFUSE/NEUTRAL), enabling principled abstention outside one's expertise. A hybrid router directs each diagnosis through strong consensus, fallback to the attending physician's judgment, or evidence-based arbitration that weighs argument quality over vote counts. On diagnostic prediction and brief hospital course generation from MIMIC-IV across four LLM backbones, CAMP consistently outperforms strong baselines while consuming fewer tokens than most competing multi-agent methods, with voting records and arbitration traces offering transparent decision audits.

摘要：大型語言模型應用於臨床預測顯示出案例層級的異質性：簡單案例產生一致的輸出，而複雜案例在輕微提示變更下產生不同的預測。現有的單一代理策略從一個角色條件分布中進行抽樣，而多代理框架使用固定角色和平坦的多數投票，忽略了不一致中的診斷信號。我們提出了 CAMP（案例自適應多代理小組），其中一名主治醫生代理根據每個案例的診斷不確定性動態組建專家小組。每位專家通過三值投票（KEEP/REFUSE/NEUTRAL）評估候選人，使得在自己專業範疇之外能夠有原則地選擇不投票。一個混合路由器將每個診斷引導通過強共識、回退到主治醫生的判斷，或根據證據的仲裁，權衡論點質量而非投票數量。在 MIMIC-IV 上進行的診斷預測和簡短住院過程生成中，CAMP 在四個 LLM 主幹上始終超越強基準，同時消耗的標記數量少於大多數競爭的多代理方法，投票記錄和仲裁痕跡提供了透明的決策審計。

Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

2603.29950v1 by Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie

Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments'': successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.

摘要：有效的合作需要團隊通過社會共享學習調節（SSRL）來管理複雜的認知和情感狀態。生理同步（即生理信號的縱向對齊）可以指示這些狀態，但單獨解釋起來較為困難。我們研究了四個醫療二人組在使用智能輔導系統診斷虛擬病人案例時的生理和對話動態。對話中的語義變化與瞬時生理同步峰值相關聯。我們還對發言片段進行了SSRL編碼，並使用句子嵌入推導了餘弦相似度。結果顯示，激活先前知識的語義相似度顯著低於較簡單的任務執行。高生理同步與較低的語義相似度相關，這表明這樣的時刻涉及探索性和多樣化的語言使用。質性分析將這些同步峰值三角測量為“關鍵時刻”：成功的團隊在共享發現時同步，而不成功的團隊在共享不確定性時達到峰值。本研究通過展示如何將生物信號與對話融合，以理解問題解決中的關鍵時刻，推進了以人為中心的人工智慧。

Four Generations of Quantum Biomedical Sensors

2603.29944v2 by Xin Jin, Priyam Srivastava, Ronghe Wang, Yuqing Li, Jonathan Beaumariage, Tom Purdy, M. V. Gurudev Dutt, Kang Kim, Kaushik Seshadreesan, Junyu Liu

Quantum sensing technologies offer transformative potential for ultra-sensitive biomedical sensing, yet their clinical translation remains constrained by classical noise limits and a reliance on macroscopic ensembles. We propose a unifying generational framework to organize the evolving landscape of quantum biosensors based on their utilization of quantum resources. First-generation devices utilize discrete energy levels for signal transduction but follow classical scaling laws. Second-generation sensors exploit quantum coherence to reach the standard quantum limit, while third-generation architectures leverage entanglement and spin squeezing to approach Heisenberg-limited precision. We further define an emerging fourth generation characterized by the end-to-end integration of quantum sensing with quantum learning and variational circuits, enabling adaptive inference directly within the quantum domain. By analyzing critical parameters such as bandwidth matching and sensor-tissue proximity, we identify key technological bottlenecks and propose a roadmap for transitioning from measuring physical observables to extracting structured biological information with quantum-enhanced intelligence.

摘要：量子感測技術為超敏感生物醫學感測提供了變革性的潛力，然而其臨床轉化仍受到經典噪聲限制和依賴宏觀集合的制約。我們提出了一個統一的生成框架，以組織基於量子資源利用的量子生物感測器的演變格局。第一代設備利用離散能級進行信號轉導，但遵循經典縮放法則。第二代感測器利用量子相干性達到標準量子極限，而第三代架構則利用糾纏和自旋擠壓接近海森堡限制的精度。我們進一步定義了一個新興的第四代，其特徵是量子感測與量子學習和變分電路的端到端整合，實現了在量子領域內直接進行自適應推斷。通過分析帶寬匹配和感測器與組織的接近等關鍵參數，我們識別出主要技術瓶頸，並提出了一個從測量物理可觀察量到提取結構化生物信息的量子增強智能的過渡路線圖。

ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

2603.29928v1 by Jonas Landsgesell, Pascal Knoll

Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions yet prevailing regression benchmarks evaluate them almost exclusively via point estimate metrics RMSE R2 These aggregate measures often obscure model performance in the tails of the distribution a critical deficit for high stakes decision making in domains like finance and clinical research where asymmetric risk profiles are the norm We introduce ScoringBench an open benchmark that computes a comprehensive suite of proper scoring rules like CRPS CRLS Interval Score Energy Score weighted CRPS and Brier Score alongside standard point metrics providing a richer picture of probabilistic forecast quality We evaluate realTabPFNv2.5 fine tuned with different scoring rule objectives and TabICL relative to untuned realTabPFNv2.5 across a suite of regression benchmarks Our results confirm that model rankings depend on the chosen scoring rule and that no single pretraining objective is universally optimal This demonstrates that for applications sensitive to extreme events the choice of evaluation metric is as much a domain specific requirement as the data itself ScoringBench is available at https://github.com/jonaslandsgesell/ScoringBench A live preview of the current leaderboard is available at https://scoringbench.bolt.host The leaderboard is maintained via git pull requests to ensure transparency traceability agility and reproducibility

摘要：表格基礎模型如 TabPFN 和 TabICL 已經能產生完整的預測分佈，但目前的回歸基準幾乎僅通過點估計指標 RMSE 和 R2 來評估它們。這些綜合指標往往掩蓋了模型在分佈尾部的表現，這對於金融和臨床研究等高風險決策領域來說是一個關鍵的缺陷，因為不對稱風險輪廓是常態。我們介紹了 ScoringBench，一個開放的基準，計算一套全面的適當評分規則，如 CRPS、CRLS、區間分數、能量分數、加權 CRPS 和 Brier 分數，並與標準點指標一起提供更豐富的概率預測質量圖景。我們評估了經過不同評分規則目標微調的 realTabPFNv2.5 和未經調整的 realTabPFNv2.5 相對於一系列回歸基準的 TabICL。我們的結果確認模型排名取決於所選的評分規則，且沒有單一的預訓練目標是普遍最佳的。這表明，對於對極端事件敏感的應用，評估指標的選擇與數據本身一樣，是一個特定於領域的要求。ScoringBench 可在 https://github.com/jonaslandsgesell/ScoringBench 獲得。當前排行榜的實時預覽可在 https://scoringbench.bolt.host 獲得。排行榜通過 git pull 請求進行維護，以確保透明度、可追溯性、靈活性和可重現性。

Training deep learning based dynamic MR image reconstruction using synthetic fractals

2603.29922v1 by Anirudh Raman, Olivier Jaubert, Mark Wrobel, Tina Yao, Ruaraidh Campbell, Rebecca Baker, Ruta Virsinskaite, Daniel Knight, Michael Quail, Jennifer Steeden, Vivek Muthurangu

Purpose: To investigate whether synthetically generated fractal data can be used to train deep learning (DL) models for dynamic MRI reconstruction, thereby avoiding the privacy, licensing, and availability limitations associated with cardiac MR training datasets. Methods: A training dataset was generated using quaternion Julia fractals to produce 2D+time images. Multi-coil MRI acquisition was simulated to generate paired fully sampled and radially undersampled k-space data. A 3D UNet deep artefact suppression model was trained using these fractal data (F-DL) and compared with an identical model trained on cardiac MRI data (CMR-DL). Both models were evaluated on prospectively acquired radial real-time cardiac MRI from 10 patients. Reconstructions were compared against compressed sensing(CS) and low-rank deep image prior (LR-DIP). All reconstrctuions were ranked for image quality, while ventricular volumes and ejection fraction were compared with reference breath-hold cine MRI. Results: There was no significant difference in qualitative ranking between F-DL and CMR-DL (p=0.9), while both outperformed CS and LR-DIP (p<0.001). Ventricular volumes and function derived from F-DL were similar to CMR-DL, showing no significant bias and accptable limits of agreement compared to reference cine imaging. However, LR-DIP had a signifcant bias (p=0.016) and wider lmits of agreement. Conclusion: DL models trained using synthetic fractal data can reconstruct real-time cardiac MRI with image quality and clinical measurements comparable to models trained on true cardiac MRI data. Fractal training data provide an open, scalable alternative to clinical datasets and may enable development of more generalisable DL reconstruction models for dynamic MRI.

摘要：目的：調查合成生成的分形數據是否可以用於訓練深度學習（DL）模型以進行動態MRI重建，從而避免與心臟MRI訓練數據集相關的隱私、授權和可用性限制。方法：使用四元數Julia分形生成訓練數據集，以產生2D+時間影像。模擬多線圈MRI獲取，以生成配對的完全採樣和徑向欠採樣的k空間數據。使用這些分形數據（F-DL）訓練了一個3D UNet深度伪影抑制模型，並與在心臟MRI數據（CMR-DL）上訓練的相同模型進行比較。這兩個模型在10名患者的前瞻性獲取的徑向實時心臟MRI上進行評估。重建結果與壓縮感知（CS）和低秩深度影像先驗（LR-DIP）進行比較。所有重建結果根據影像質量進行排名，而心室體積和射血分數則與參考的屏息動態MRI進行比較。結果：F-DL和CMR-DL之間的質量排名沒有顯著差異（p=0.9），而且兩者均優於CS和LR-DIP（p<0.001）。從F-DL衍生的心室體積和功能與CMR-DL相似，顯示出與參考動態影像相比沒有顯著偏差和可接受的協議範圍。然而，LR-DIP有顯著偏差（p=0.016）和更寬的協議範圍。結論：使用合成分形數據訓練的DL模型可以重建實時心臟MRI，其影像質量和臨床測量可與基於真實心臟MRI數據訓練的模型相媲美。分形訓練數據提供了一種開放、可擴展的替代臨床數據集的選擇，並可能促進更具通用性的動態MRI重建DL模型的發展。

Brain MR Image Synthesis with Multi-contrast Self-attention GAN

2604.00070v1 by Zaid A. Abod, Furqan Aziz

Accurate and complete multi-modal Magnetic Resonance Imaging (MRI) is essential for neuro-oncological assessment, as each contrast provides complementary anatomical and pathological information. However, acquiring all modalities (e.g., T1c, T1n, T2, T2f) for every patient is often impractical due to time, cost, and patient discomfort, potentially limiting comprehensive tumour evaluation. We propose 3D-MC-SAGAN (3D Multi-Contrast Self-Attention generative adversarial network), a unified 3D multi-contrast synthesis framework that generates high-fidelity missing modalities from a single T2 input while explicitly preserving tumour characteristics. The model employs a multi-scale 3D encoder-decoder generator with residual connections and a novel Memory-Bounded Hybrid Attention (MBHA) block to capture long-range dependencies efficiently, and is trained with a WGAN-GP critic and an auxiliary contrast-conditioning branch to produce T2f, T1n, and T1c volumes within a single unified network. A frozen 3D U-Net-based segmentation module introduces a segmentation-consistency constraint to preserve lesion morphology. The composite objective integrates adversarial, reconstruction, perceptual, structural similarity, contrast-classification, and segmentation-guided losses to align global realism with tumour-preserving structure. Extensive evaluation on 3D brain MRI datasets demonstrates that 3D-MC-SAGAN achieves state-of-the-art quantitative performance and generates visually coherent, anatomically plausible contrasts with improved distribution-level realism. Moreover, it maintains tumour segmentation accuracy comparable to fully acquired multi-modal inputs, highlighting its potential to reduce acquisition burden while preserving clinically meaningful information.

摘要：準確且完整的多模態磁共振成像（MRI）對於神經腫瘤評估至關重要，因為每種對比提供互補的解剖和病理信息。然而，由於時間、成本和患者不適，為每位患者獲取所有模態（例如，T1c、T1n、T2、T2f）通常是不切實際的，這可能限制了全面的腫瘤評估。我們提出了3D-MC-SAGAN（3D多對比自注意生成對抗網絡），這是一個統一的3D多對比合成框架，能從單一的T2輸入生成高保真度的缺失模態，同時明確保留腫瘤特徵。該模型採用多尺度3D編碼器-解碼器生成器，具有殘差連接和新穎的記憶約束混合注意（MBHA）模塊，以高效捕捉長距離依賴性，並使用WGAN-GP評價器和輔助對比條件分支進行訓練，以在單一統一網絡內生成T2f、T1n和T1c體積。一個凍結的3D U-Net基於分割模塊引入了分割一致性約束，以保留病變形態。綜合目標整合了對抗性、重建、感知、結構相似性、對比分類和分割引導損失，以將全球現實與腫瘤保護結構對齊。在3D腦MRI數據集上的廣泛評估表明，3D-MC-SAGAN實現了最先進的定量性能，並生成視覺上連貫、解剖上合理的對比，具有改善的分佈級現實性。此外，它保持的腫瘤分割準確性可與完全獲取的多模態輸入相媲美，突顯了其在減少獲取負擔的同時保留臨床有意義信息的潛力。

Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

2603.29709v1 by Joakim Edin, Andreas Motzfeldt, Simon Flachs, Lars Maaløe

Medical coding translates free-text clinical documentation into standardized codes drawn from classification systems that contain tens of thousands of entries and are updated annually. It is central to billing, clinical research, and quality reporting, yet remains largely manual, slow, and error-prone. Existing automated approaches learn to predict a fixed set of codes from labeled data, thereby preventing adaptation to new codes or different coding systems without retraining on different data. They also provide no explanation for their predictions, limiting trust in safety-critical settings. We introduce Symphony for Medical Coding, a system that approaches the task the way expert human coders do: by reasoning over the clinical narrative with direct access to the coding guidelines. This design allows Symphony to operate across any coding system and to provide span-level evidence linking each predicted code to the text that supports it. We evaluate on two public benchmarks and three real-world datasets spanning inpatient, outpatient, emergency, and subspecialty settings across the United States and the United Kingdom. Symphony achieves state-of-the-art results across all settings, establishing itself as a flexible, deployment-ready foundation for automated clinical coding.

摘要：醫療編碼將自由文本的臨床文檔轉換為標準化代碼，這些代碼來自包含數萬條條目的分類系統，並每年更新一次。它對於計費、臨床研究和質量報告至關重要，但仍然主要依賴手動操作，速度慢且容易出錯。現有的自動化方法從標記數據中學習預測固定的代碼集，因此無法在不重新訓練不同數據的情況下適應新代碼或不同的編碼系統。它們也不提供對其預測的解釋，限制了在安全關鍵環境中的信任。我們介紹了醫療編碼的交響樂系統，這一系統以專業人類編碼員的方式處理任務：通過直接訪問編碼指南來推理臨床敘事。這一設計使得交響樂能夠在任何編碼系統中運作，並提供將每個預測代碼與支持其的文本聯繫起來的跨度級證據。我們在兩個公共基準和三個涵蓋美國和英國住院、門診、急診和專科環境的真實世界數據集上進行評估。交響樂在所有環境中都達到了最先進的結果，確立了其作為自動化臨床編碼的靈活、可部署的基礎。

Few-shot Writer Adaptation via Multimodal In-Context Learning

2603.29450v1 by Tom Simon, Stephane Nicolas, Pierrick Tranouez, Clement Chatelain, Thierry Paquet

While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

摘要：雖然最先進的手寫文本識別（HTR）模型在標準基準測試中表現良好，但它們經常在處理具有高度特定風格的寫作者時遇到困難，這些風格在訓練數據中代表性不足。為了處理未見過的和不典型的寫作者，寫作者適應技術使 HTR 模型個性化以符合個別的手寫風格。領先的寫作者適應方法需要離線微調或在推理時更新參數，這兩者都涉及梯度計算和反向傳播，從而增加計算成本並要求仔細的超參數調整。在這項工作中，我們提出了一種新穎的基於上下文的 HTR 框架，受到多模態上下文學習的啟發，使得在推理時僅使用目標寫作者的幾個示例即可進行寫作者適應，而無需任何參數更新。我們進一步展示了上下文長度的影響，設計了一個緊湊的 8M 參數 CNN-Transformer，使得少量示例的上下文適應成為可能，並顯示結合基於上下文和標準 OCR 訓練策略會帶來互補的改進。在 IAM 和 RIMES 上的實驗驗證了我們的方法，字符錯誤率分別為 3.92% 和 2.34%，超越了所有不依賴寫作者的 HTR 模型，且在推理時無需任何參數更新。

NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification

2603.29449v1 by Youngung Han, Minkyung Cha, Kyeonghun Kim, Induk Um, Myeongbin Sho, Joo Young Bae, Jaewon Jung, Jung Hyeok Park, Seojun Lee, Nam-Joon Kim, Woo Kyoung Jeong, Won Jae Lee, Pa Hong, Ken Ying-Kai Liao, Hyuk-Jae Lee

Minimizing invasive diagnostic procedures to reduce the risk of patient injury and infection is a central goal in medical imaging. And yet, noninvasive diagnosis of perineural invasion (PNI), a critical prognostic factor involving infiltration of tumor cells along the surrounding nerve, still remains challenging, due to the lack of clear and consistent imaging criteria criteria for identifying PNI. To address this challenge, we present NeoNet, an integrated end-to-end 3D deep learning framework for PNI prediction in cholangiocarcinoma that does not rely on predefined image features. NeoNet integrates three modules: (1) NeoSeg, utilizing a Tumor-Localized ROI Crop (TLCR) algorithm; (2) NeoGen, a 3D Latent Diffusion Model (LDM) with ControlNet, conditioned on anatomical masks to generate synthetic image patches, specifically balancing the dataset to a 1:1 ratio; and (3) NeoCls, the final prediction module. For NeoCls, we developed the PNI-Attention Network (PattenNet), which uses the frozen LDM encoder and specialized 3D Dual Attention Blocks (DAB) designed to detect subtle intensity variations and spatial patterns indicative of PNI. In 5-fold cross-validation, NeoNet outperformed baseline 3D models and achieved the highest performance with a maximum AUC of 0.7903.

摘要：最小化侵入性診斷程序以降低病人受傷和感染的風險是醫學影像中的一個核心目標。然而，周圍神經侵犯（PNI）的非侵入性診斷仍然具有挑戰性，這是一個關鍵的預後因素，涉及腫瘤細胞沿著周圍神經的浸潤，因為缺乏明確且一致的影像標準來識別PNI。為了解決這一挑戰，我們提出了NeoNet，一個集成的端到端3D深度學習框架，用於膽管癌中的PNI預測，該框架不依賴於預定義的影像特徵。NeoNet整合了三個模塊： (1) NeoSeg，利用腫瘤定位ROI裁剪（TLCR）算法； (2) NeoGen，一個帶有ControlNet的3D潛在擴散模型（LDM），基於解剖掩膜生成合成影像區塊，特別是將數據集平衡到1:1的比例；以及 (3) NeoCls，最終的預測模塊。對於NeoCls，我們開發了PNI-注意力網絡（PattenNet），該網絡使用凍結的LDM編碼器和專門設計的3D雙重注意力區塊（DAB），旨在檢測微妙的強度變化和空間模式，以指示PNI。在5折交叉驗證中，NeoNet的表現超過了基線3D模型，並以最高0.7903的AUC達到了最佳性能。

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

2603.29366v1 by Moiz Sadiq Awan, Maryam Raza

Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.

摘要：事前授權仍然是美國醫療保健中最繁重的行政流程之一，每年耗費數十億美元和數千小時的醫生時間。雖然大型語言模型在臨床文本任務中顯示出潛力，但它們產生提交準備好的事前授權信的能力僅受到有限的關注，現有的工作僅限於單一案例的示範，而非結構化的多場景評估。我們評估了三種商業可用的LLM（GPT-4o、Claude Sonnet 4.5和Gemini 2.5 Pro），涵蓋了45個醫生驗證的合成場景，涉及風濕病學、精神病學、腫瘤學、心臟病學和骨科。這三個模型生成的信件具有強大的臨床內容：準確的診斷、結構良好的醫療必要性論證以及全面的步驟治療文檔。然而，對現實世界行政要求的次要分析揭示了一致的差距，這些差距僅僅依靠臨床評分無法捕捉，包括缺失的計費代碼、缺少的授權期限請求和不充分的後續計劃。這些發現重新框定了問題：臨床部署的挑戰不在於LLM是否能撰寫臨床上合格的信件，而在於圍繞它們構建的系統是否能提供支付者工作流程所需的行政精確性。

Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model

2603.29176v1 by Siyuan Du, Siyi Li, Shuwei Bai, Ang Li, Haolin Li, Mingqing Xiao, Yang Pan, Dongsheng Li, Weidi Xie, Yanfeng Wang, Ya Zhang, Chencheng Zhang, Jiangchao Yao

Parkinson's disease (PD) affects over ten million people worldwide. Although temporal interference (TI) and deep brain stimulation (DBS) are promising therapies, inter-individual variability limits empirical treatment selection, increasing non-negligible surgical risk and cost. Previous explorations either resort to limited statistical biomarkers that are insufficient to characterize variability, or employ AI-driven methods which is prone to overfitting and opacity. We bridge this gap with a pretraining-finetuning framework to predict outcomes directly from resting-state fMRI. Critically, a generative virtual brain foundation model, pretrained on a collective dataset (2707 subjects, 5621 sessions) to capture universal disorder patterns, was finetuned on PD cohorts receiving TI (n=51) or DBS (n=55) to yield individualized virtual brains with high fidelity to empirical functional connectivity (r=0.935). By constructing counterfactual estimations between pathological and healthy neural states within these personalized models, we predicted clinical responses (TI: AUPR=0.853; DBS: AUPR=0.915), substantially outperforming baselines. External and prospective validations (n=14, n=11) highlight the feasibility of clinical translation. Moreover, our framework provides state-dependent regional patterns linked to response, offering hypothesis-generating mechanistic insights.

摘要：帕金森病（PD）影響全球超過一千萬人。儘管時間干擾（TI）和深腦刺激（DBS）是有前景的療法，但個體間的變異性限制了實證治療的選擇，增加了不可忽視的手術風險和成本。先前的探索要麼依賴於有限的統計生物標記，這些標記不足以表徵變異性，要麼使用易於過擬合和不透明的 AI 驅動方法。我們通過一個預訓練-微調框架來填補這一空白，直接從靜息態 fMRI 預測結果。關鍵是，一個生成的虛擬大腦基礎模型，在一個集體數據集（2707 名受試者，5621 次會議）上進行預訓練，以捕捉普遍的疾病模式，並在接受 TI（n=51）或 DBS（n=55）的 PD 群體上進行微調，以產生與實證功能連接高度一致的個性化虛擬大腦（r=0.935）。通過在這些個性化模型中構建病理和健康神經狀態之間的反事實估計，我們預測了臨床反應（TI: AUPR=0.853; DBS: AUPR=0.915），顯著超越基準。外部和前瞻性驗證（n=14, n=11）突顯了臨床轉化的可行性。此外，我們的框架提供了與反應相關的狀態依賴區域模式，提供了生成假設的機制見解。

Knowledge database development by large language models for countermeasures against viruses and marine toxins

2603.29149v1 by Hung N. Do, Jessica Z. Kubicek-Sutherland, S. Gnanakaran

Access to the most up-to-date information on medical countermeasures is important for the research and development of effective treatments for viruses and marine toxins. However, there is a lack of comprehensive databases that curate data on viruses and marine toxins, making decisions on medical countermeasures slow and difficult. In this work, we employ two large language models (LLMs) of ChatGPT and Grok to design two comprehensive databases of therapeutic countermeasures for five viruses of Lassa, Marburg, Ebola, Nipah, and Venezuelan equine encephalitis, as well as marine toxins. With high-level human-provided inputs, the two LLMs identify public databases containing data on the five viruses and marine toxins, collect relevant information from these databases and the literature, iteratively cross-validate the collected information, and design interactive webpages for easy access to the curated, comprehensive databases. Notably, the ChatGPT LLM is employed to design agentic AI workflows (consisting of two AI agents for research and decision-making) to rank countermeasures for viruses and marine toxins in the databases. Together, our work explores the potential of LLMs as a scalable, updatable approach for building comprehensive knowledge databases and supporting evidence-based decision-making.

摘要：獲取有關醫療對策的最新資訊對於病毒和海洋毒素的有效治療的研究與開發至關重要。然而，缺乏綜合數據庫來整理有關病毒和海洋毒素的數據，使得醫療對策的決策變得緩慢且困難。在這項工作中，我們使用兩個大型語言模型（LLMs），即ChatGPT和Grok，設計了兩個針對拉薩病毒、馬爾堡病毒、埃博拉病毒、尼帕病毒和委內瑞拉馬腦炎病毒以及海洋毒素的治療對策的綜合數據庫。在高層次的人類提供的輸入下，這兩個LLMs識別出包含五種病毒和海洋毒素數據的公共數據庫，從這些數據庫和文獻中收集相關信息，迭代交叉驗證所收集的信息，並設計互動網頁以便於訪問整理好的綜合數據庫。值得注意的是，ChatGPT LLM被用來設計代理式AI工作流程（由兩個AI代理組成，分別負責研究和決策），以對數據庫中的病毒和海洋毒素對策進行排名。總之，我們的工作探討了LLMs作為可擴展、可更新的方法來建立綜合知識數據庫並支持基於證據的決策的潛力。

Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health

2603.29114v1 by Yuqing Xiao, John Grundy, Anuradha Madugalla, Elizabeth Manias

Requirements engineering for aged-care digital health must account for human aspects, because requirement priorities are shaped not only by technical functionality but also by stakeholders' health conditions, socioeconomics, and lived experience. Knowing which human aspects matter most, and for whom, is critical for inclusive and evidence-based requirements prioritisation. Yet in practice, while some studies have examined human aspects in RE, they have largely relied on expert judgement or model-driven analysis rather than large-scale user studies with meaningful human-in-the-loop validation to determine which aspects matter most and why. To address this gap, we conducted a mixed-methods study with 103 older adults, 105 developers, and 41 caregivers. We first applied an explainable machine learning to identify the human aspects most strongly associated with requirement priorities across 8 aged-care digital health themes, and then conducted 12 semi-structured interviews to validate and interpret the quantitative patterns. The results identify the key human aspects shaping requirement priorities, reveal their directional effects, and expose substantial misalignment across stakeholder groups. Together, these findings show that human-centric requirements analysis should engage stakeholder groups explicitly rather than collapsing their perspectives into a single aggregate view. This paper contributes an identification of the key human aspects driving requirement priorities in aged-care digital health and an explainable, human-centric RE framework that combines ML-derived importance rankings with qualitative validation to surface the stakeholder misalignments that inclusive requirements engineering must address.

摘要：老年護理數位健康的需求工程必須考慮人類因素，因為需求優先級不僅受到技術功能的影響，還受到利益相關者的健康狀況、社會經濟狀況和生活經驗的影響。了解哪些人類因素最重要，以及對誰最重要，對於包容性和基於證據的需求優先級排序至關重要。然而在實踐中，儘管一些研究已經考察了需求工程中的人類因素，但它們在很大程度上依賴於專家判斷或模型驅動的分析，而非大規模的用戶研究，這些研究具有意義的人類參與驗證，以確定哪些因素最重要及其原因。為了填補這一空白，我們對103位老年人、105位開發者和41位護理人員進行了混合方法研究。我們首先應用可解釋的機器學習來識別與8個老年護理數位健康主題的需求優先級最強相關的人類因素，然後進行了12次半結構化訪談，以驗證和解釋定量模式。結果確定了塑造需求優先級的關鍵人類因素，揭示了它們的方向性影響，並暴露了利益相關者群體之間的重大不一致性。這些發現表明，以人為中心的需求分析應該明確地吸引利益相關者群體，而不是將他們的觀點合併為單一的總體觀。本文貢獻了對推動老年護理數位健康需求優先級的關鍵人類因素的識別，以及一個可解釋的以人為中心的需求工程框架，該框架結合了機器學習導出的重要性排名與定性驗證，以揭示包容性需求工程必須解決的利益相關者不一致性。

A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank

2603.29041v1 by Iness Halimi, Emmanuel Piffo, Oumnia Boudersa, Yvan Marcel Carre Vilmorin, Melissa Ait-ikhlef, Karima Kone, Andy Tan, Augustin Medina, Juliette Hernando, Sheila Ernest, Vatche Bartekian, Karine Lalonde, Mireille E Schnitzer, Gianolli Dorcelus

Clinical trials are characterized by high costs, extended timelines, and substantial operational risk, yet reliable prospective methods for predicting trial success before initiation remain limited. Existing artificial intelligence approaches often focus on isolated metrics or specific development stages and frequently rely on variables unavailable at the trial design phase, limiting real-world applicability. We present a hierarchical latent risk-aware machine learning framework for prospective prediction of clinical trial operational success using a curated subset of TrialsBank, a proprietary AI-ready database developed by Sorintellis, comprising 13,700 trials. Operational success was defined as the ability to initiate, conduct, and complete a clinical trial according to planned timelines, recruitment targets, and protocol specifications through database lock. This approach decomposes operational success prediction into two modeling stages. First, intermediate latent operational risk factors are predicted using more than 180 drug- and trial-level features available before trial initiation. These predicted latent risks are then integrated into a downstream model to estimate the probability of operational success. A staged data-splitting strategy was employed to prevent information leakage, and models were benchmarked using XGBoost, CatBoost, and Explainable Boosting Machines. Across Phase I-III, the framework achieves strong out-of-sample performance, with F1-scores of 0.93, 0.92, and 0.91, respectively. Incorporating latent risk drivers improves discrimination of operational failures, and performance remains robust under independent inference evaluation. These results demonstrate that clinical trial operational success can be prospectively forecasted using a latent risk-aware AI framework, enabling early risk assessment and supporting data-driven clinical development decision-making.

摘要：臨床試驗的特點是高成本、長時間和相當大的操作風險，然而在試驗啟動之前，可靠的預測試驗成功的前瞻性方法仍然有限。現有的人工智慧方法通常專注於孤立的指標或特定的開發階段，並且經常依賴於在試驗設計階段無法獲得的變數，限制了其在現實世界中的適用性。我們提出了一個層次性潛在風險感知的機器學習框架，用於利用經過整理的TrialsBank子集預測臨床試驗的操作成功，TrialsBank是由Sorintellis開發的專有AI準備數據庫，包含13,700個試驗。操作成功被定義為根據計劃的時間表、招募目標和協議規範，啟動、進行和完成臨床試驗的能力，直到數據庫鎖定。這種方法將操作成功的預測分解為兩個建模階段。首先，使用在試驗啟動之前可用的180多個藥物和試驗級別特徵來預測中間潛在操作風險因素。這些預測的潛在風險然後被整合到下游模型中，以估計操作成功的概率。採用了分階段數據拆分策略以防止信息洩漏，並使用XGBoost、CatBoost和可解釋增強機器進行模型基準測試。在第一至第三階段，該框架在樣本外表現強勁，F1分數分別為0.93、0.92和0.91。納入潛在風險驅動因素提高了對操作失敗的區分能力，並且在獨立推斷評估下性能依然穩健。這些結果表明，臨床試驗的操作成功可以通過潛在風險感知的AI框架進行前瞻性預測，使早期風險評估成為可能，並支持基於數據的臨床開發決策。

Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction

2603.29023v1 by Diego C. Lerma-Torres

Large language models lack persistent, structured memory for long-term interaction and context-sensitive retrieval. Expanding context windows does not solve this: recent evidence shows that context length alone degrades reasoning by up to 85% - even with perfect retrieval. We propose a bio-inspired memory framework grounded in complementary learning systems theory, cognitive behavioral therapy's belief hierarchy, dual-process cognition, and fuzzy-trace theory, organized around three principles: (1) Memory has valence, not just content - pre-computed emotional-associative summaries (valence vectors) organized in an emergent belief hierarchy inspired by Beck's cognitive model enable instant orientation before deliberation; (2) Retrieval defaults to System 1 with System 2 escalation - automatic spreading activation and passive priming as default, with deliberate retrieval only when needed, and graded epistemic states that address hallucination structurally; and (3) Encoding is active, present, and feedback-dependent - a thalamic gateway tags and routes information between stores, while the executive forms gists through curiosity-driven investigation, not passive exposure. Seven functional properties specify what any implementation must satisfy. Over time, the system converges toward System 1 processing - the computational analog of clinical expertise - producing interactions that become cheaper, not more expensive, with experience.

摘要：大型語言模型缺乏持久的、結構化的記憶，以便於長期互動和上下文敏感的檢索。擴大上下文窗口並不能解決這個問題：最近的證據顯示，僅僅延長上下文長度會使推理能力下降多達85%——即使檢索是完美的。我們提出了一個受生物啟發的記憶框架，基於互補學習系統理論、認知行為療法的信念層級、雙過程認知和模糊痕跡理論，圍繞三個原則組織：(1) 記憶具有價值，而不僅僅是內容——預先計算的情感聯想摘要（價值向量）組織在一個受貝克認知模型啟發的緊急信念層級中，使得在深思熟慮之前能夠快速定位；(2) 檢索默認為系統1，並在需要時升級到系統2——自動擴散激活和被動引導作為默認情況，只有在需要時才進行深思熟慮的檢索，並且有分級的認知狀態結構性地解決幻覺；(3) 編碼是主動的、當前的，並依賴於反饋——丘腦閘道標記並路由信息在存儲之間，而執行系統則通過好奇心驅動的調查形成要旨，而不是被動的接觸。七個功能屬性規定了任何實現必須滿足的條件。隨著時間的推移，系統向系統1處理收斂——臨床專業知識的計算類比——產生的互動隨著經驗的增長變得更便宜，而不是更昂貴。

Towards a Medical AI Scientist

2603.28589v1 by Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.

摘要：自主系統生成科學假設、進行實驗並撰寫手稿，最近已成為加速發現的一個有前景的範式。然而，現有的人工智慧科學家在很大程度上仍然是領域無關的，這限制了它們在臨床醫學中的應用，因為研究需要基於醫學證據並具有專門的數據模式。在這項工作中，我們介紹了醫療人工智慧科學家，這是第一個專為臨床自主研究量身定制的自主研究框架。它通過臨床醫生與工程師的共同推理機制，將廣泛調查的文獻轉化為可行的證據，從而實現臨床基礎的創意，並提高生成研究想法的可追溯性。它進一步促進了基於證據的手稿撰寫，並遵循結構化的醫學組合慣例和倫理政策。該框架在三種研究模式下運作，即基於論文的重現、受文獻啟發的創新和任務驅動的探索，每種模式對應於不同層次的自動化科學探究，並逐步增加自主性。大型語言模型和人類專家的全面評估顯示，醫療人工智慧科學家生成的想法在171個案例、19個臨床任務和6種數據模式中，質量顯著高於商業LLM所產生的想法。同時，我們的系統在所提出的方法與其實施之間實現了強大的對齊，並在可執行實驗中顯示出顯著更高的成功率。人類專家和斯坦福代理審稿人的雙盲評估表明，生成的手稿接近MICCAI級別的質量，同時穩定超越來自ISBI和BIBM的手稿。所提出的醫療人工智慧科學家突顯了利用人工智慧進行醫療保健自主科學發現的潛力。

Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

2603.28532v1 by Ya Zhou, Tianxiang Hao, Ziyi Cai, Haojie Zhu, Hejun He, Jia Liu, Xiaohan Fan, Jing Yuan

Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box models with limited interpretability or on tabular systems dependent on commercial ECG measurement algorithms with suboptimal performance. We introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diagnostic probabilities with interpretable modeling for detecting LEF from ECG. Trained on the benchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and evaluated in predefined independent internal (n=5,442) and external (n=16,017) cohorts, our framework achieved robust discrimination for moderate LEF (internal AUROC 88.4%, F1 64.5%; external AUROC 86.8%, F1 53.6%), consistently outperforming the official end-to-end baseline provided with the benchmark across demographic and clinical subgroups. Interpretability analyses identified high-impact predictors, including normal ECG, incomplete left bundle branch block, and subendocardial injury in anterolateral leads, driving LEF risk estimation. Notably, these predictors independently enabled zero-shot-like inference without task-specific retraining (internal AUROC 75.3-81.0%; external AUROC 71.6-78.6%), indicating that ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. This framework reconciles predictive performance with mechanistic transparency, supporting scalable enhancement through additional predictors and seamless integration with existing AI-ECG systems.

摘要：低左心室射血分數（LEF）常常在進展到有症狀的心臟衰竭之前未被檢測到，這突顯了可擴展篩查策略的必要性。儘管人工智慧驅動的心電圖（AI-ECG）顯示出潛力，但現有的方法僅依賴於端到端的黑箱模型，這些模型的可解釋性有限，或依賴於依賴商業心電圖測量算法的表格系統，這些算法的性能不佳。我們引入了基於心電圖的預測驅動LEF（ECGPD-LEF），這是一個結構化框架，將基礎模型衍生的診斷概率與可解釋建模相結合，以從心電圖中檢測LEF。該框架在基準EchoNext數據集上進行訓練，該數據集包含72,475對心電圖-超聲心動圖，並在預定的獨立內部（n=5,442）和外部（n=16,017）隊列中進行評估，我們的框架在中度LEF的區分上達到了穩健的表現（內部AUROC 88.4%，F1 64.5%；外部AUROC 86.8%，F1 53.6%），在各個人口和臨床子群中始終優於基準提供的官方端到端基線。可解釋性分析確定了高影響力的預測因子，包括正常心電圖、不完全的左束支傳導阻滯和前外側導聯的心內膜下損傷，這些因子驅動了LEF風險評估。值得注意的是，這些預測因子獨立地實現了類似零樣本的推斷，而無需特定任務的再訓練（內部AUROC 75.3-81.0%；外部AUROC 71.6-78.6%），這表明心室功能障礙本質上在結構化診斷概率表示中被編碼。這一框架調和了預測性能與機制透明度，支持通過額外的預測因子和與現有AI-ECG系統的無縫整合來實現可擴展的增強。

FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation

2603.28455v1 by Tiantian Wang, Xiang Xiang, Simon S. Du

In federated healthcare systems, Federated Class-Incremental Learning (FCIL) has emerged as a key paradigm, enabling continuous adaptive model learning among distributed clients while safeguarding data privacy. However, in practical applications, data across agent nodes within the distributed framework often exhibits non-independent and identically distributed (non-IID) characteristics, rendering traditional continual learning methods inapplicable. To address these challenges, this paper covers more comprehensive incremental task scenarios and proposes a dynamic memory allocation strategy for exemplar storage based on the data replay mechanism. This strategy fully taps into the inherent potential of data heterogeneity, while taking into account the performance fairness of all participating clients, thereby establishing a balanced and adaptive solution to mitigate catastrophic forgetting. Unlike the fixed allocation of client exemplar memory, the proposed scheme emphasizes the rational allocation of limited storage resources among clients to improve model performance. Furthermore, extensive experiments are conducted on three medical image datasets, and the results demonstrate significant performance improvements compared to existing baseline models.

摘要：在聯邦醫療系統中，聯邦類別增量學習（FCIL）已成為一個關鍵範式，使得分散式客戶端之間能夠持續適應性地學習模型，同時保護數據隱私。然而，在實際應用中，分散框架內的代理節點之間的數據往往表現出非獨立同分佈（non-IID）的特徵，這使得傳統的持續學習方法無法適用。為了解決這些挑戰，本文涵蓋了更全面的增量任務場景，並提出了一種基於數據重播機制的示例存儲動態內存分配策略。該策略充分挖掘了數據異質性的內在潛力，同時考慮到所有參與客戶端的性能公平性，從而建立了一個平衡且適應性的解決方案，以減輕災難性遺忘。與客戶端示例內存的固定分配不同，所提出的方案強調在客戶端之間合理分配有限的存儲資源，以提高模型性能。此外，對三個醫學影像數據集進行了廣泛的實驗，結果顯示與現有基準模型相比，性能有顯著提升。

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

2603.28387v1 by Doan Nam Long Vu, Simone Balloccu

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

摘要：值得信賴的臨床人工智慧要求性能提升反映真實的證據整合，而非表面層次的人工產物。我們在兩個臨床神經影像學群體上對12個開放權重的視覺-語言模型（VLMs）進行二元分類評估，\textsc{FOR2107}（情感障礙）和\textsc{OASIS-3}（認知衰退）。這兩個數據集都包含結構性MRI數據，但不帶有可靠的個體級診斷信號。在這些條件下，較小的VLM在引入神經影像學背景後，F1得分提升高達58\%，而經過提煉的模型與規模大一個數量級的對應模型競爭。對比信心分析顯示，僅僅在任務提示中\emph{提及}MRI的可用性就佔據了70-80\%的變化，與影像數據是否存在無關，這是一個我們稱之為\emph{支架效應}的領域特定的模態崩潰實例。專家評估顯示在所有條件下均存在基於神經影像的理由的虛構，而偏好對齊在消除MRI參考行為的同時，使兩個條件都趨向隨機基線。我們的發現表明，表面評估並不足以作為多模態推理的指標，這對於在臨床環境中部署VLM有直接的影響。

Bit-Identical Medical Deep Learning via Structured Orthogonal Initialization

2603.28040v1 by Yakov Pyotr Shkolnikov

Deep learning training is non-deterministic: identical code with different random seeds produces models that agree on aggregate metrics but disagree on individual predictions, with per-class AUC swings exceeding 20 percentage points on rare clinical classes. We present a framework for verified bit-identical training that eliminates three sources of randomness: weight initialization (via structured orthogonal basis functions), batch ordering (via golden ratio scheduling), and non-deterministic GPU operations (via architecture selection and custom autograd). The pipeline produces MD5-verified identical trained weights across independent runs. On PTB-XL ECG rhythm classification, structured initialization significantly exceeds Kaiming across two architectures (n=20; Conformer p = 0.016, Baseline p < 0.001), reducing aggregate variance by 2-3x and reducing per-class variability on rare rhythms by up to 7.5x (TRIGU range: 4.1pp vs 30.9pp under Kaiming, independently confirmed by 3-fold CV). A four-basis comparison at n=20 shows all structured orthogonal bases produce equivalent performance (Friedman p=0.48), establishing that the contribution is deterministic structured initialization itself, not any particular basis function. Cross-domain validation on seven MedMNIST benchmarks (n=20, all p > 0.14) confirms no performance penalty on standard tasks; per-class analysis on imbalanced tasks (ChestMNIST, RetinaMNIST) shows the same variance reduction on rare classes observed in ECG. Cross-dataset evaluation on three external ECG databases confirms zero-shot generalization (>0.93 AFIB AUC).

摘要：深度學習訓練是非確定性的：相同的代碼在不同的隨機種子下產生的模型在整體指標上達成一致，但在個別預測上存在分歧，對於稀有臨床類別，每類的AUC波動超過20個百分點。我們提出了一個經過驗證的位元相同訓練框架，消除了三個隨機性來源：權重初始化（通過結構化正交基函數）、批次排序（通過黃金比例調度）和非確定性GPU操作（通過架構選擇和自定義自動微分）。該管道在獨立運行中產生MD5驗證的相同訓練權重。在PTB-XL ECG節律分類中，結構化初始化在兩個架構中顯著超過Kaiming（n=20；Conformer p = 0.016，Baseline p < 0.001），將整體變異性降低2-3倍，並將稀有節律的每類變異性降低高達7.5倍（TRIGU範圍：4.1pp對比Kaiming下的30.9pp，經3倍交叉驗證獨立確認）。在n=20的四基比較中，所有結構化正交基都產生等效的性能（Friedman p=0.48），確立了貢獻是確定性的結構化初始化本身，而不是任何特定的基函數。對七個MedMNIST基準的跨領域驗證（n=20，所有p > 0.14）確認在標準任務上沒有性能懲罰；對不平衡任務（ChestMNIST，RetinaMNIST）的每類分析顯示在稀有類別上觀察到的變異性降低與ECG相同。對三個外部ECG數據庫的跨數據集評估確認零樣本泛化（>0.93 AFIB AUC）。

Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images

2603.27798v1 by Laura Rayón Ropero, Jasper De Laet, Filip Lemic, Pau Sabater Nácher, Nabeel Nisar Bhat, Sergi Abadal, Jeroen Famaey, Eduard Alarcón, Xavier Costa-Pérez

Facial Emotion Recognition is a critical research area within Affective Computing due to its wide-ranging applications in Human Computer Interaction, mental health assessment and fatigue monitoring. Current FER methods predominantly rely on Deep Learning techniques trained on 2D image data, which pose significant privacy concerns and are unsuitable for continuous, real-time monitoring. As an alternative, we propose High-Frequency Wireless Sensing (HFWS) as an enabler of continuous, privacy-aware FER, through the generation of detailed 3D facial pointclouds via on-person sensors embedded in wearables. We present arguments supporting the privacy advantages of HFWS over traditional 2D imaging, particularly under increasingly stringent data protection regulations. A major barrier to adopting HFWS for FER is the scarcity of labeled 3D FER datasets. Towards addressing this issue, we introduce a FLAME-based method to generate 3D facial pointclouds from existing public 2D datasets. Using this approach, we create AffectNet3D, a 3D version of the AffectNet database. To evaluate the quality and usability of the generated data, we design a pointcloud refinement pipeline focused on isolating the facial region, and train the popular PointNet++ model on the refined pointclouds. Fine-tuning the model on a small subset of the unseen 3D FER dataset BU-3DFE yields a classification accuracy exceeding 70%, comparable to oracle-level performance. To further investigate the potential of HFWS-based FER for continuous monitoring, we simulate wearable sensing conditions by masking portions of the generated pointclouds. Experimental results show that models trained on AffectNet3D and fine-tuned with just 25% of BU-3DFE outperform those trained solely on BU-3DFE. These findings highlight the viability of our pipeline and support the feasibility of continuous, privacy-aware FER via wearable HFWS systems.

摘要：面部情感識別是情感計算中一個關鍵的研究領域，因為它在人體計算機互動、心理健康評估和疲勞監測等方面有著廣泛的應用。當前的面部情感識別方法主要依賴於基於2D圖像數據訓練的深度學習技術，這些技術存在顯著的隱私問題，且不適合持續的實時監測。作為替代方案，我們提出高頻無線感測（HFWS）作為持續、注重隱私的面部情感識別的實現方式，通過在可穿戴設備中嵌入的個人傳感器生成詳細的3D面部點雲。我們提出支持HFWS在隱私優勢方面優於傳統2D成像的論據，特別是在日益嚴格的數據保護法規下。採用HFWS進行面部情感識別的一個主要障礙是標記的3D面部情感識別數據集的稀缺。為了解決這一問題，我們引入了一種基於FLAME的方法，從現有的公共2D數據集中生成3D面部點雲。通過這種方法，我們創建了AffectNet3D，這是AffectNet數據庫的3D版本。為了評估生成數據的質量和可用性，我們設計了一個點雲精煉管道，專注於隔離面部區域，並在精煉後的點雲上訓練流行的PointNet++模型。在未見的3D面部情感識別數據集BU-3DFE的小子集上微調模型，得到了超過70%的分類準確率，與預測級別的表現相當。為了進一步研究基於HFWS的面部情感識別在持續監測中的潛力，我們通過遮罩生成的點雲的部分區域來模擬可穿戴感測條件。實驗結果顯示，在AffectNet3D上訓練並僅用25%的BU-3DFE進行微調的模型，表現優於僅在BU-3DFE上訓練的模型。這些發現突顯了我們的管道的可行性，並支持通過可穿戴HFWS系統實現持續、注重隱私的面部情感識別的可行性。

RAP: Retrieve, Adapt, and Prompt-Fit for Training-Free Few-Shot Medical Image Segmentation

2603.27705v1 by Zhihao Mao, Bangpu Chen

Few-shot medical image segmentation (FSMIS) has achieved notable progress, yet most existing methods mainly rely on semantic correspondences from scarce annotations while under-utilizing a key property of medical imagery: anatomical targets exhibit repeatable high-frequency morphology (e.g., boundary geometry and spatial layout) across patients and acquisitions. We propose RAP, a training-free framework that retrieves, adapts, and prompts Segment Anything Model 2 (SAM2) for FSMIS. First, RAP retrieves morphologically compatible supports from an archive using DINOv3 features to reduce brittleness in single-support choice. Second, it adapts the retrieved support mask to the query by fitting boundary-aware structural cues, yielding an anatomy-consistent pre-mask under domain shifts. Third, RAP converts the pre-mask into prompts by sampling positive points via Voronoi partitioning and negative points via sector-based sampling, and feeds them into SAM2 for final refinement without any fine-tuning. Extensive experiments on multiple medical segmentation benchmarks show that RAP consistently surpasses prior FSMIS baselines and achieves state-of-the-art performance. Overall, RAP demonstrates that explicit structural fitting combined with retrieval-augmented prompting offers a simple and effective route to robust training-free few-shot medical segmentation.

摘要：少樣本醫學影像分割（FSMIS）已取得顯著進展，然而大多數現有方法主要依賴於稀缺標註的語義對應，同時未充分利用醫學影像的一個關鍵特性：解剖目標在不同患者和獲取過程中展現可重複的高頻形態（例如，邊界幾何和空間佈局）。我們提出了RAP，一個無需訓練的框架，能夠檢索、適應並提示Segment Anything Model 2（SAM2）以進行FSMIS。首先，RAP使用DINOv3特徵從檔案中檢索形態上相容的支持，以減少單一支持選擇的脆弱性。其次，它通過擬合邊界感知的結構線索，將檢索到的支持掩模適應於查詢，從而在領域轉移下產生解剖一致的預掩模。第三，RAP通過Voronoi劃分抽樣正點和基於扇區的抽樣負點，將預掩模轉換為提示，並將其輸入SAM2進行最終精煉，而無需任何微調。在多個醫學分割基準上的廣泛實驗表明，RAP始終超越先前的FSMIS基準，並達到最先進的性能。總體而言，RAP展示了明確的結構擬合結合檢索增強提示提供了一條簡單有效的途徑，以實現穩健的無訓練少樣本醫學分割。

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

2603.27460v1 by Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou, Chaoyang Zhang, Wenjie Li, Shaohao Rui, Weijie Ma, Xingyue Zhao, Yibin Wang, Kun Yuan, Zhaohui Lu, Shujun Wang, Jinjie Wei, Lihao Liu, Dingkang Yang, Lin Wang, Yulong Li, Haolin Yang, Yiqing Shen, Lequan Yu, Xiaowei Hu, Yun Gu, Yicheng Wu, Benyou Wang, Minghui Zhang, Angelica I. Aviles-Rivero, Qi Gao, Hongming Shan, Xiaoyu Ren, Fang Yan, Hongyu Zhou, Haodong Duan, Maosong Cao, Shanshan Wang, Bin Fu, Xiaomeng Li, Zhi Hou, Chunfeng Song, Lei Bai, Yuan Cheng, Yuandong Pu, Xiang Li, Wenhai Wang, Hao Chen, Jiaxin Zhuang, Songyang Zhang, Huiguang He, Mengzhang Li, Bohan Zhuang, Zhian Bai, Rongshan Yu, Liansheng Wang, Yukun Zhou, Xiaosong Wang, Xin Guo, Guanbin Li, Xiangru Lin, Dakai Jin, Mianxin Liu, Wenlong Zhang, Qi Qin, Conghui He, Yuqiang Li, Ye Luo, Nanqing Dong, Jie Xu, Wenqi Shao, Bo Zhang, Qiujuan Yan, Yihao Liu, Jun Ma, Zhi Lu, Yuewen Cao, Zongwei Zhou, Jianming Liang, Shixiang Tang, Qi Duan, Dongzhan Zhou, Chen Jiang, Yuyin Zhou, Yanwu Xu, Jiancheng Yang, Shaoting Zhang, Xiaohong Liu, Siqi Luo, Yi Xin, Chaoyu Liu, Haochen Wen, Xin Chen, Alejandro Lozano, Min Woo Sun, Yuhui Zhang, Yue Yao, Xiaoxiao Sun, Serena Yeung-Levy, Xia Li, Jing Ke, Chunhui Zhang, Zongyuan Ge, Ming Hu, Jin Ye, Zhifeng Li, Yirong Chen, Yu Qiao, Junjun He

Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.

摘要：基礎模型在多樣領域和任務中展現了顯著的成功，主要歸功於大規模、多樣且高質量數據集的蓬勃發展。然而，在醫學影像領域，這類醫學數據集的策劃和組建面臨極大的挑戰，因為這依賴於臨床專業知識以及嚴格的倫理和隱私限制，導致大規模統一醫學數據集的稀缺，並阻礙了強大醫學基礎模型的發展。在這項工作中，我們提出了迄今為止最大的醫學影像數據集調查，涵蓋了超過1,000個開放訪問數據集，並系統性地編目它們的模態、任務、解剖、註釋、限制和整合潛力。我們的分析揭示了一個規模適中、在狹窄任務範疇中碎片化、並在器官和模態之間分佈不均的現狀，這反過來限制了現有醫學影像數據集在開發多功能和穩健的醫學基礎模型中的效用。為了將碎片化轉變為規模，我們提出了一種基於元數據的融合範式（MDFP），該範式整合了具有共享模態或任務的公共數據集，從而將多個小數據孤島轉變為更大、更連貫的資源。基於MDFP，我們發布了一個互動式發現入口，實現端到端的自動醫學影像數據集整合，並將所有調查的數據集編輯成一個統一的結構化表格，清晰地總結其關鍵特徵並提供參考鏈接，為社群提供一個可訪問且全面的資料庫。通過描繪當前的地形並提供一條有原則的數據集整合路徑，我們的調查為擴大醫學影像語料庫提供了一個實用的路線圖，支持更快的數據發現、更有原則的數據集創建，以及更強大的醫學基礎模型。

Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models

2603.27325v1 by Mehedi Hasan Tusar, Fateme Fayyazbakhsh, Igor Melnychuk, Ming C. Leu

Accurate wound classification and boundary segmentation are essential for guiding clinical decisions in both chronic and acute wound management. However, most existing AI models are limited, focusing on a narrow set of wound types or performing only a single task (segmentation or classification), which reduces their clinical applicability. This study presents a deep learning model based on YOLOv11 that simultaneously performs wound boundary segmentation (WBS) and wound classification (WC) across five clinically relevant wound types: burn injury (BI), pressure injury (PI), diabetic foot ulcer (DFU), vascular ulcer (VU), and surgical wound (SW). A wound-type balanced dataset of 2,963 annotated images was created to train the models for both tasks, with stratified five-fold cross-validation ensuring robust and unbiased evaluation. The models trained on the original non-augmented dataset achieved consistent performance across folds, though BI detection accuracy was relatively lower. Therefore, the dataset was augmented using rotation, flipping, and variations in brightness, saturation, and exposure to help the model learn more generalized and invariant features. This augmentation significantly improved model performance, particularly in detecting visually subtle BI cases. Among tested variants, YOLOv11x achieved the highest performance with F1-scores of 0.9341 (WBS) and 0.8736 (WC), while the lightweight YOLOv11n provided comparable accuracy at lower computational cost, making it suitable for resource-constrained deployments. Supported by confusion matrices and visual detection outputs, the results confirm the model's robustness against complex backgrounds and high intra-class variability, demonstrating the potential of YOLOv11-based architectures for accurate, real-time wound analysis in both clinical and remote care settings.

摘要：準確的傷口分類和邊界分割對於指導慢性和急性傷口管理中的臨床決策至關重要。然而，大多數現有的人工智慧模型都有限，專注於狹窄的傷口類型或僅執行單一任務（分割或分類），這降低了它們的臨床適用性。本研究提出了一個基於YOLOv11的深度學習模型，能同時執行五種臨床相關傷口類型的傷口邊界分割（WBS）和傷口分類（WC）：燒傷（BI）、壓力傷（PI）、糖尿病足潰瘍（DFU）、血管潰瘍（VU）和手術傷口（SW）。為了訓練這兩項任務的模型，創建了一個包含2,963張註釋圖像的傷口類型平衡數據集，並通過分層五折交叉驗證確保了穩健和無偏的評估。在原始未增強數據集上訓練的模型在各折中表現一致，儘管BI檢測的準確性相對較低。因此，通過旋轉、翻轉以及亮度、飽和度和曝光的變化來增強數據集，以幫助模型學習更通用和不變的特徵。這種增強顯著改善了模型的性能，特別是在檢測視覺上微妙的BI案例方面。在測試的變體中，YOLOv11x以0.9341（WBS）和0.8736（WC）的F1分數達到了最高性能，而輕量級的YOLOv11n在較低的計算成本下提供了可比的準確性，使其適合資源有限的部署。通過混淆矩陣和視覺檢測輸出支持，結果確認了模型在複雜背景和高類內變異性下的穩健性，展示了基於YOLOv11的架構在臨床和遠程護理環境中進行準確實時傷口分析的潛力。

2603.27240v1 by Jinhu Fu, Yihang Lou, Qingyi Si, Shudong Zhang, Yan Bai, Sen Su

Large Vision-Language Models (LVLMs) have achieved impressive performance across multimodal understanding and reasoning tasks, yet their internal safety mechanisms remain opaque and poorly controlled. In this work, we present a comprehensive framework for diagnosing and repairing unsafe channels within LVLMs (CARE). We first perform causal mediation analysis to identify neurons and layers that are causally responsible for unsafe behaviors. Based on these findings, we introduce a dual-modal safety subspace projection method that learns generalized safety subspaces for both visual and textual modalities through generalized eigen-decomposition between benign and malicious activations. During inference, activations are dynamically projected toward these safety subspaces via a hybrid fusion mechanism that adaptively balances visual and textual corrections, effectively suppressing unsafe features while preserving semantic fidelity. Extensive experiments on multiple safety benchmarks demonstrate that our causal-subspace repair framework significantly enhances safety robustness without degrading general multimodal capabilities, outperforming prior activation steering and alignment-based baselines. Additionally, our method exhibits good transferability, defending against unseen attacks.

摘要：大型視覺-語言模型（LVLMs）在多模態理解和推理任務中取得了令人印象深刻的表現，但其內部安全機制仍然不透明且控制不佳。在這項工作中，我們提出了一個全面的框架，用於診斷和修復LVLM中的不安全通道（CARE）。我們首先進行因果中介分析，以識別對不安全行為負有因果責任的神經元和層。基於這些發現，我們引入了一種雙模態安全子空間投影方法，通過良性和惡性激活之間的廣義特徵分解來學習視覺和文本模態的廣義安全子空間。在推理過程中，激活通過一種混合融合機制動態投影到這些安全子空間，該機制自適應地平衡視覺和文本的修正，有效抑制不安全特徵，同時保持語義的真實性。在多個安全基準上的廣泛實驗表明，我們的因果子空間修復框架顯著增強了安全穩健性，而不會降低一般的多模態能力，超越了先前的激活引導和對齊基準。此外，我們的方法展現了良好的可轉移性，能夠抵禦未見攻擊。

MediHive: A Decentralized Agent Collective for Medical Reasoning

2603.27150v1 by Xiaoyang Wang, Christopher C. Yang

Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplinary problems requiring robust handling of uncertainty and conflicting evidence. Multi-agent systems (MAS) leveraging LLMs enable collaborative intelligence, but prevailing centralized architectures suffer from scalability bottlenecks, single points of failure, and role confusion in resource-constrained environments. Decentralized MAS (D-MAS) promise enhanced autonomy and resilience via peer-to-peer interactions, but their application to high-stakes healthcare domains remains underexplored. We introduce MediHive, a novel decentralized multi-agent framework for medical question answering that integrates a shared memory pool with iterative fusion mechanisms. MediHive deploys LLM-based agents that autonomously self-assign specialized roles, conduct initial analyses, detect divergences through conditional evidence-based debates, and locally fuse peer insights over multiple rounds to achieve consensus. Empirically, MediHive outperforms single-LLM and centralized baselines on MedQA and PubMedQA datasets, attaining accuracies of 84.3% and 78.4%, respectively. Our work advances scalable, fault-tolerant D-MAS for medical AI, addressing key limitations of centralized designs while demonstrating superior performance in reasoning-intensive tasks.

摘要：大型語言模型（LLMs）已經徹底改變了醫療推理任務，但單一代理系統在處理需要強大不確定性和衝突證據的複雜跨學科問題時，往往表現不佳。利用LLMs的多代理系統（MAS）能夠實現協作智能，但現有的集中式架構在資源有限的環境中面臨可擴展性瓶頸、單點故障和角色混淆的問題。去中心化的多代理系統（D-MAS）通過點對點互動承諾增強自主性和韌性，但其在高風險醫療領域的應用仍然未得到充分探索。我們介紹了MediHive，一個新穎的去中心化多代理框架，用於醫療問題回答，該框架整合了共享記憶池和迭代融合機制。MediHive 部署了基於LLM的代理，這些代理能夠自主自我分配專業角色，進行初步分析，通過條件證據辯論檢測分歧，並在多輪中本地融合同伴見解以達成共識。實證結果表明，MediHive在MedQA和PubMedQA數據集上的表現優於單一LLM和集中基準，分別達到84.3%和78.4%的準確率。我們的工作推進了可擴展、容錯的D-MAS在醫療AI中的應用，解決了集中設計的關鍵限制，同時在推理密集型任務中展示了卓越的性能。

Bayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data

2603.27142v1 by Amuche Ibenegbu, Pierre Lafaye de Micheaux, Rohitash Chandra

Time-series analysis is often affected by missing data, a common problem across several fields, including healthcare and environmental monitoring. Multiple Imputation by Chained Equations (MICE) has been prominent for imputing missing values through "fully conditional specification". We extend MICE using the Bayesian framework (Bayes-MICE), utilising Bayesian inference to impute missing values via Markov Chain Monte Carlo (MCMC) sampling to account for uncertainty in MICE model parameters and imputed values. We also include temporally informed initialisation and time-lagged features in the model to respect the sequential nature of time-series data. We evaluate the Bayes-MICE method using two real-world datasets (AirQuality and PhysioNet), and using both the Random Walk Metropolis (RWM) and the Metropolis-Adjusted Langevin Algorithm (MALA) samplers. Our results demonstrate that Bayes-MICE reduces imputation errors relative to the baseline methods over all variables and accounts for uncertainty in the imputation process, thereby providing a more accurate measure of imputation error. We also found that MALA converges faster than RWM, achieving comparable accuracy while providing more consistent posterior exploration. Overall, these findings suggest that the Bayes-MICE framework represents a practical and efficient approach to time-series imputation, balancing increased accuracy with meaningful quantification of uncertainty in various environmental and clinical settings.

摘要：時間序列分析常常受到缺失資料的影響，這是多個領域中的常見問題，包括醫療保健和環境監測。多重插補鏈式方程（MICE）在通過「完全條件規範」插補缺失值方面非常重要。我們使用貝葉斯框架擴展MICE（Bayes-MICE），利用貝葉斯推斷通過馬爾可夫鏈蒙特卡羅（MCMC）抽樣來插補缺失值，以考慮MICE模型參數和插補值的不確定性。我們還在模型中包含了時間信息初始化和時間滯後特徵，以尊重時間序列數據的序列性質。我們使用兩個真實世界數據集（AirQuality和PhysioNet）來評估Bayes-MICE方法，並使用隨機漫步梅特羅波利斯（RWM）和梅特羅波利斯調整的朗之萬算法（MALA）抽樣器。我們的結果表明，Bayes-MICE在所有變數上相對於基準方法減少了插補誤差，並考慮了插補過程中的不確定性，從而提供了更準確的插補誤差測量。我們還發現MALA的收斂速度比RWM更快，實現了可比的準確性，同時提供了更一致的後驗探索。總體而言，這些發現表明，Bayes-MICE框架代表了一種實用且高效的時間序列插補方法，在各種環境和臨床設置中平衡了提高的準確性與不確定性的有意義量化。

Autonomous Agent-Orchestrated Digital Twins (AADT): Leveraging the OpenClaw Framework for State Synchronization in Rare Genetic Disorders

2603.27104v1 by Hongzhuo Chen, Zhanliang Wang, Quan M. Nguyen, Gongbo Zhang, Chunhua Weng, Kai Wang

Background: Medical Digital Twins (MDTs) are computational representations of individual patients that integrate clinical, genomic, and physiological data to support diagnosis, treatment planning, and outcome prediction. However, most MDTs remain static or passively updated, creating a critical synchronization gap, especially in rare genetic disorders where phenotypes, genomic interpretations, and care guidelines evolve over time. Methods: We propose an agent-orchestrated digital twin framework using OpenClaw's proactive "heartbeat" mechanism and modular Agent Skills. This Autonomous Agent-orchestrated Digital Twin (AADT) system continuously monitors local and external data streams (e.g., patient-reported phenotypes and updates in variant classification databases) and executes automated workflows for data ingestion, normalization, state updates, and trigger-based analysis. Results: A prototype implementation demonstrates that agent orchestration can continuously synchronize MDT states with both longitudinal phenotype updates and evolving genomic knowledge. In rare disease settings, this enables earlier diagnosis and more accurate modeling of disease progression. We present two case studies, including variant reinterpretation and longitudinal phenotype tracking, highlighting how AADTs support timely, auditable updates for both research and clinical care. Conclusion: The AADT framework addresses the key bottleneck of real-time synchronization in MDTs, enabling scalable and continuously updated patient models. We also discuss data security considerations and mitigation strategies through human-in-the-loop system design.

摘要：背景：醫療數位雙胞胎（MDTs）是個別病人的計算表示，整合臨床、基因組和生理數據，以支持診斷、治療計劃和結果預測。然後，大多數MDTs仍然是靜態或被動更新，造成了關鍵的同步差距，特別是在罕見遺傳疾病中，表型、基因組解釋和護理指導隨時間演變。
方法：我們提出了一個代理協調的數位雙胞胎框架，使用OpenClaw的主動“心跳”機制和模組化的代理技能。這個自主代理協調的數位雙胞胎（AADT）系統持續監控本地和外部數據流（例如，病人報告的表型和變異分類數據庫的更新），並執行自動化工作流程以進行數據攝取、標準化、狀態更新和基於觸發的分析。
結果：原型實現展示了代理協調如何持續同步MDT狀態，與縱向表型更新和不斷演變的基因組知識相結合。在罕見疾病環境中，這使得更早的診斷和更準確的疾病進展建模成為可能。我們呈現了兩個案例研究，包括變異重新解釋和縱向表型追蹤，突顯AADTs如何支持及時、可審計的更新，無論是對於研究還是臨床護理。
結論：AADT框架解決了MDTs中實時同步的關鍵瓶頸，使可擴展且持續更新的病人模型成為可能。我們還討論了數據安全考量和通過人機協作系統設計的緩解策略。

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

2603.26556v1 by Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4$\times$ at 128K-token contexts.

摘要：將預訓練的 Transformer 轉換為更高效的混合模型通過蒸餾提供了一種減少推理成本的有前景的方法。然而，在蒸餾模型中實現高質量生成需要學生架構和蒸餾過程的仔細聯合設計。許多先前的蒸餾工作通過使用對數似然對候選答案進行排名來評估下游的多選基準，而不是要求自回歸生成，這可能會掩蓋模型質量的重要差異。例如，我們展示了一個 7B 參數的蒸餾模型，在對數似然評分下幾乎與其教師匹配到 0.2\,pp，但當模型必須自回歸生成答案時，實際上落後 20.8\,pp。我們提出了一種混合 Kimi Delta 注意力（Hybrid-KDA）架構，搭配 GenDistill，一個多階段的蒸餾管道，並在整個過程中使用基於生成的評估來指導設計決策。將這種方法應用於 Qwen3-0.6B，我們系統性地消融了六個設計軸：訓練目標、損失遮蔽、訓練持續時間、數據集選擇、參數凍結和架構選擇。我們發現基於對數似然的評估始終低估了教師和學生之間的差距，在某些情況下甚至會顛倒設計選擇的排名，這意味著僅基於困惑度的評估得出的結論可能會誤導。在我們研究的因素中，數據集選擇、僅完成遮蔽以及在後訓練期間凍結注意力層對生成質量的影響最大。我們最好的 Hybrid-KDA 模型在知識基準上保留了 86--90\% 的教師準確性，同時將 KV 緩存記憶減少了最多 75\%，並在 128K 令牌上下文中將首次令牌的時間提高了 2--4$\times$。

Foundation Model for Cardiac Time Series via Masked Latent Attention

2603.26475v1 by Moritz Vandenhirtz, Samuel Ruipérez-Campillo, Simon Böhi, Sonia Laguna, Irene Cannistraci, Andrea Agostini, Ece Ozkan, Thomas M. Sutter, Julia E. Vogt

Electrocardiograms (ECGs) are among the most widely available clinical signals and play a central role in cardiovascular diagnosis. While recent foundation models (FMs) have shown promise for learning transferable ECG representations, most existing pretraining approaches treat leads as independent channels and fail to explicitly leverage their strong structural redundancy. We introduce the latent attention masked autoencoder (LAMAE) FM that directly exploits this structure by learning cross-lead connection mechanisms during self-supervised pretraining. Our approach models higher-order interactions across leads through latent attention, enabling permutation-invariant aggregation and adaptive weighting of lead-specific representations. We provide empirical evidence on the Mimic-IV-ECG database that leveraging the cross-lead connection constitutes an effective form of structural supervision, improving representation quality and transferability. Our method shows strong performance in predicting ICD-10 codes, outperforming independent-lead masked modeling and alignment-based baselines.

摘要：心電圖（ECG）是最廣泛可用的臨床信號之一，並在心血管診斷中扮演核心角色。儘管最近的基礎模型（FM）在學習可轉移的ECG表示方面顯示出潛力，但大多數現有的預訓練方法將導聯視為獨立通道，未能明確利用其強大的結構冗餘。我們引入了潛在注意力遮罩自編碼器（LAMAE）FM，通過在自監督預訓練期間學習導聯之間的連接機制，直接利用這一結構。我們的方法通過潛在注意力建模導聯之間的高階交互，實現了排列不變的聚合和導聯特定表示的自適應加權。我們在Mimic-IV-ECG數據庫上提供了實證證據，表明利用導聯之間的連接構成了一種有效的結構監督形式，提高了表示質量和可轉移性。我們的方法在預測ICD-10代碼方面表現出色，超越了獨立導聯遮罩建模和基於對齊的基準。

PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management

2603.26324v1 by Eugenio Rodrigo Zimmer Neves, Amanda Vanon Correa, Camila Campioni, Gabielli Pare Guglielmi, Bruno Morelli

Most existing approaches to AI in pharmacy collapse three epistemologically distinct operations into a single technical layer: document preservation, semantic interpretation, and contextual presentation. This conflation is a root cause of recurring fragilities including loss of provenance, interpretive opacity, alert fatigue, and erosion of accountability. This paper proposes the PATOS--Lector--PRISMA (PLP) infrastructure as a normative information architecture for responsible pharmaceutical knowledge management. PATOS preserves regulatory documents with explicit versioning and provenance; Lector implements machine-assisted reading with human curation, producing typed assertions anchored to primary sources; PRISMA delivers contextual presentation through the RPDA framework (Regulatory, Prescription, Dispensing, Administration), refracting the same informational core into distinct professional views. The architecture introduces the Evidence Pack as a formal unit of accountable assertion (versioned, traceable, epistemically bounded, and curatorially validated), with assertions typified by illocutionary force. A worked example traces dipyrone monohydrate across all three layers using real system data. Developed and validated in Brazil's regulatory context, the architecture is grounded in an operational implementation comprising over 16,000 official documents and 38 curated Evidence Packs spanning five reference medications. The proposal is demonstrated as complementary to operational decision support systems, providing infrastructural conditions that current systems lack: documentary anchoring, interpretive transparency, and institutional accountability.

摘要：大多數現有的藥學人工智慧方法將三個在認識論上明顯不同的操作合併為一個單一的技術層：文件保存、語義解釋和上下文呈現。這種混淆是導致反覆出現的脆弱性的根本原因，包括來源丟失、解釋不明、警報疲勞和問責制侵蝕。本文提出PATOS--Lector--PRISMA (PLP) 基礎設施作為負責任的藥學知識管理的規範性信息架構。PATOS以明確的版本控制和來源保存監管文件；Lector實施人機協作的閱讀，產生與主要來源相連的類型化斷言；PRISMA通過RPDA框架（監管、處方、配藥、管理）提供上下文呈現，將相同的信息核心折射成不同的專業視角。該架構引入了證據包作為一個正式的可問責斷言單位（版本化、可追溯、認識論界定且經過策展驗證），其斷言以言外之意的強度為特徵。一個實例追踪了單硫酸二氫鈉在所有三個層面上的應用，使用真實系統數據。在巴西的監管背景下開發和驗證，該架構基於一個操作實施，包含超過16,000份官方文件和38個策展的證據包，涵蓋五種參考藥物。該提案被證明是對操作決策支持系統的補充，提供了當前系統所缺乏的基礎設施條件：文件錨定、解釋透明度和機構問責制。

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

2604.02368v1 by Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Yueyan Qiu, Yanle Ren, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu

As Large Language Models (LLMs) exhibit plateauing performance on conventional benchmarks, a pivotal challenge persists: evaluating their proficiency in complex, open-ended tasks characterizing genuine expert-level cognition. Existing frameworks suffer from narrow domain coverage, reliance on generalist tasks, or self-evaluation biases. To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains. XpertBench consists of 1,346 meticulously curated tasks across 80 categories, spanning finance, healthcare, legal services, education, and dual-track research (STEM and Humanities). These tasks are derived from over 1,000 submissions by domain experts--including researchers from elite institutions and practitioners with extensive clinical or industrial experience--ensuring superior ecological validity. Each task uses detailed rubrics with mostly 15-40 weighted checkpoints to assess professional rigor. To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases. Our empirical evaluation of state-of-the-art LLMs reveals a pronounced performance ceiling: even leading models achieve a peak success rate of only ~66%, with a mean score around 55%. Models also exhibit domain-specific divergence, showing non-overlapping strengths in quantitative reasoning versus linguistic synthesis.. These findings underscore a significant "expert-gap" in current AI systems and establish XpertBench as a critical instrument for navigating the transition from general-purpose assistants to specialized professional collaborators.

摘要：隨著大型語言模型（LLMs）在傳統基準上表現出停滯的趨勢，一個關鍵挑戰依然存在：評估它們在複雜的、開放式任務中的能力，這些任務特徵是真正專家級的認知。現有框架存在狹窄的領域覆蓋、依賴於一般性任務或自我評估偏見的問題。為了填補這一空白，我們提出了 XpertBench，一個高保真度的基準，旨在評估 LLMs 在真實專業領域的表現。XpertBench 包含 1,346 個精心策劃的任務，涵蓋 80 個類別，涉及金融、醫療保健、法律服務、教育和雙軌研究（STEM 和人文學科）。這些任務源自超過 1,000 份來自領域專家的提交，包括來自精英機構的研究人員和具有豐富臨床或行業經驗的實踐者，確保了卓越的生態有效性。每個任務使用詳細的評分標準，通常包含 15-40 個加權檢查點，以評估專業的嚴謹性。為了促進可擴展但又符合人類需求的評估，我們引入了 ShotJudge，一種新穎的評估範式，利用經過專家少量示例校準的 LLM 評審，以減少自我獎勵偏見。我們對最先進的 LLMs 的實證評估顯示出明顯的性能上限：即使是領先模型的最高成功率也僅約為 66%，平均得分約為 55%。模型還表現出領域特定的差異，在定量推理與語言綜合方面顯示出不重疊的優勢。這些發現凸顯了當前 AI 系統中存在的重大「專家差距」，並確立了 XpertBench 作為從通用助手過渡到專業合作夥伴的重要工具。

Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI

2603.26186v1 by Jing Zhang, Bastien Bergere, Emilie Bollache, Jonas Leite, Mikaël Laredo, Alban Redheuil, Nadjia Kachenoura

Cardiac MRI late gadolinium enhancement (LGE) enables non-invasive identification of left atrial (LA) scar, whose spatial distribution is strongly associated with atrial fibrillation (AF) severity and recurrence. However, automatic LA scar segmentation remains challenging due to low contrast, annotation variability, and the lack of anatomical constraints, often leading to non-reliable predictions. Accordingly, our aim was to propose a progressive learning strategy to segment LA scar from LGE images inspired from a clinical workflow. A 3-stage framework based on SwinUNETR was implemented, comprising: 1) a first LA cavity pre-learning model, 2) dual-task model which further learns spatial relationship between LA geometry and scar patterns, and 3) fine-tuning on precise segmentation of the scar. Furthermore, we introduced an anatomy-aware spatially weighted loss that incorporates prior clinical knowledge by constraining scar predictions to anatomically plausible LA wall regions while mitigating annotation bias. Our preliminary results obtained on validation LGE volumes from LASCARQS public dataset after 5-fold cross validation, LA segmentation had Dice score of 0.94, LA scar segmentation achieved Dice score of 0.50, Hausdorff Distance of 11.84 mm, Average Surface Distance of 1.80 mm, outperforming only a one-stage scar segmentation with 0.49, 13.02 mm, 1.96 mm, repectively. By explicitly embedding clinical anatomical priors and diagnostic reasoning into deep learning, the proposed approach improved the accuracy and reliability of LA scar segmentation from LGE, revealing the importance of clinically informed model design.

摘要：心臟 MRI 晚期鉺增強 (LGE) 能夠非侵入性地識別左心房 (LA) 瘢痕，其空間分佈與心房顫動 (AF) 的嚴重程度和復發密切相關。然而，由於對比度低、標註變異性以及缺乏解剖約束，自動化的 LA 瘢痕分割仍然具有挑戰性，這常常導致不可靠的預測。因此，我們的目標是提出一種漸進式學習策略，從 LGE 圖像中分割 LA 瘢痕，靈感來自臨床工作流程。我們實現了一個基於 SwinUNETR 的三階段框架，包括：1) 第一個 LA 腔體預學習模型，2) 雙任務模型進一步學習 LA 幾何形狀與瘢痕模式之間的空間關係，以及 3) 對瘢痕的精確分割進行微調。此外，我們引入了一種解剖學意識的空間加權損失，通過將瘢痕預測約束於解剖上合理的 LA 壁區域，同時減輕標註偏差，來融入先前的臨床知識。我們在 LASCARQS 公共數據集中經過 5 倍交叉驗證後獲得的初步結果顯示，LA 分割的 Dice 分數為 0.94，LA 瘢痕分割的 Dice 分數為 0.50，Hausdorff 距離為 11.84 mm，平均表面距離為 1.80 mm，分別優於僅有一階段瘢痕分割的 0.49、13.02 mm 和 1.96 mm。通過明確地將臨床解剖先驗和診斷推理嵌入深度學習中，所提出的方法提高了從 LGE 中進行 LA 瘢痕分割的準確性和可靠性，揭示了臨床知情模型設計的重要性。

SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

2603.26122v1 by Zhangtianyi Chen, Yuhao Shen, Florensia Widjaja, Yan Xu, Liyuan Sun, Zijian Wang, Hongyi Chen, Wufei Dai, Juexiao Zhou

While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.

摘要：雖然近期在大型語言模型方面的進展顯著推進了皮膚科診斷，但單一的 LLM 在細粒度、大規模多類別診斷任務和罕見皮膚疾病診斷方面經常面臨困難，這主要是由於訓練數據的稀疏性，同時也缺乏臨床推理所需的可解釋性和可追溯性。儘管多代理系統可以提供更透明和可解釋的診斷，但現有框架主要集中在視覺問答和對話任務上，並且對靜態知識庫的高度依賴限制了其在複雜現實臨床環境中的適應性。在此，我們提出了 SkinGPT-X，一個多模態協作多代理系統，專為皮膚科診斷而設，並整合了自我演化的皮膚科記憶機制。通過模擬皮膚科醫生的診斷工作流程並實現持續的記憶演變，SkinGPT-X 提供了透明且值得信賴的診斷，以管理複雜和罕見的皮膚科病例。為了驗證 SkinGPT-X 的穩健性，我們設計了一個三級比較實驗。首先，我們將 SkinGPT-X 與四個最先進的 LLM 在四個公共數據集上進行基準測試，顯示其在 DDI31 上的準確率提高了 +9.6%，在 Dermnet 上的加權 F1 分數提高了 +13%。其次，我們構建了一個涵蓋 498 種不同皮膚科類別的大型多類別數據集，以評估其細粒度分類能力。最後，我們整理了罕見皮膚疾病數據集，這是首個針對臨床罕見皮膚疾病稀缺問題的基準，包含 564 份臨床樣本，涵蓋八種罕見皮膚病。在這個數據集上，SkinGPT-X 實現了 +9.8% 的準確率提高，+7.1% 的加權 F1 提高，以及 +10% 的 Cohen's Kappa 提高。

Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

2603.26049v1 by Kang Liu, Zhuoqi Ma, Siyu Liang, Yunan Li, Xiyue Gao, Chao Liang, Kun Xie, Qiguang Miao

Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.

摘要：儘管最近在醫學視覺-語言預訓練方面取得了進展，但現有模型仍然難以捕捉診斷工作流程：放射線片通常被視為與上下文無關的圖像，而放射科醫師的注視—這一對視覺推理至關重要的線索—在現有方法中仍然大多未被探索。這些限制阻礙了對特定疾病模式的建模，並削弱了跨模態的對齊。為了填補這一空白，我們提出了CoGaze，一種針對胸部X光片的上下文和注視引導的視覺-語言預訓練框架。我們首先提出了一種上下文融入的視覺編碼器，該編碼器建模放射科醫師如何整合臨床上下文—包括病史、症狀和診斷意圖—以指導診斷推理。然後，我們提出了一種多層次的監督範式，該範式 (1) 通過混合正對比學習強化內部和跨模態的語義對齊，(2) 通過疾病感知的跨模態表示學習注入診斷先驗，(3) 利用放射科醫師的注視作為概率先驗，引導注意力集中在診斷上重要的區域。大量實驗表明，CoGaze在各種任務中始終優於最先進的方法，在自由文本和結構化報告生成中達到高達 +2.0% 的 CheXbertF1 和 +1.2% 的 BLEU2，在零樣本分類中達到 +23.2% 的 AUROC，以及在圖像-文本檢索中達到 +12.2% 的 Precision@1。代碼可在 https://github.com/mk-runner/CoGaze 獲得。

Unlabeled Cross-Center Automatic Analysis for TAAD: An Integrated Framework from Segmentation to Clinical Features

2603.26019v1 by Mengdi Liu, Qiang Li, Weizhi Nie, Shaopeng Zhang, Yuting Su

Type A Aortic Dissection (TAAD) is a life-threatening cardiovascular emergency that demands rapid and precise preoperative evaluation. While key anatomical and pathological features are decisive for surgical planning, current research focuses predominantly on improving segmentation accuracy, leaving the reliable, quantitative extraction of clinically actionable features largely under-explored. Furthermore, constructing comprehensive TAAD datasets requires labor-intensive, expert level pixel-wise annotations, which is impractical for most clinical institutions. Due to significant domain shift, models trained on a single center dataset also suffer from severe performance degradation during cross-institutional deployment. This study addresses a clinically critical challenge: the accurate extraction of key TAAD clinical features during cross-institutional deployment in the total absence of target-domain annotations. To this end, we propose an unsupervised domain adaptation (UDA)-driven framework for the automated extraction of TAAD clinical features. The framework leverages limited source-domain labels while effectively adapting to unlabeled data from target domains. Tailored for real-world emergency workflows, our framework aims to achieve stable cross-institutional multi-class segmentation, reliable and quantifiable clinical feature extraction, and practical deployability independent of high-cost annotations. Extensive experiments demonstrate that our method significantly improves cross-domain segmentation performance compared to existing state-of-the-art approaches. More importantly, a reader study involving multiple cardiovascular surgeons confirms that the automatically extracted clinical features provide meaningful assistance for preoperative assessment, highlighting the practical utility of the proposed end-to-end segmentation-to-feature pipeline.

摘要：型 A 主動脈剝離 (TAAD) 是一種危及生命的心血管緊急情況，要求快速且精確的術前評估。雖然關鍵的解剖和病理特徵對於手術規劃至關重要，但當前的研究主要集中在提高分割準確性上，導致臨床可行特徵的可靠、定量提取仍然未被充分探索。此外，構建全面的 TAAD 數據集需要耗時的專家級像素級註釋，這對於大多數臨床機構來說是不切實際的。由於顯著的領域轉移，基於單一中心數據集訓練的模型在跨機構部署期間也會遭遇嚴重的性能下降。本研究針對一個臨床關鍵挑戰：在完全缺乏目標域註釋的情況下，準確提取關鍵的 TAAD 臨床特徵。為此，我們提出了一個基於無監督領域適應 (UDA) 的框架，用於自動提取 TAAD 臨床特徵。該框架利用有限的源域標籤，同時有效地適應來自目標域的未標記數據。我們的框架專為現實世界的緊急工作流程量身定制，旨在實現穩定的跨機構多類別分割、可靠且可量化的臨床特徵提取，以及獨立於高成本註釋的實用部署。廣泛的實驗表明，我們的方法在跨域分割性能上顯著優於現有的最先進方法。更重要的是，涉及多位心血管外科醫生的讀者研究確認，自動提取的臨床特徵對術前評估提供了有意義的幫助，突顯了所提議的端到端分割到特徵管道的實用性。

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

2603.26008v1 by Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong

While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

摘要：雖然在圖像條件生成方面具有強大能力，多模態大型語言模型（MLLMs）在不同人口群體之間的表現卻可能不均衡，凸顯了公平風險。在安全至關重要的臨床環境中，這種差異可能導致不平等的診斷敘事，並侵蝕對AI輔助決策的信任。儘管公平性在僅限於視覺和僅限於語言的模型中已被廣泛研究，但其對MLLMs的影響仍然大多未被探討。為了解決這些偏見，我們引入了FairLLaVA，一種參數高效的微調方法，能在不妥協整體性能的情況下減輕視覺指令調整中的群體差異。通過最小化目標屬性之間的互信息，FairLLaVA使模型的表示變得與人口統計無關。該方法可以作為輕量級插件納入，保持低秩適配器微調的效率，並提供一種與架構無關的公平視覺指令跟隨方法。在大規模胸部放射報告生成和皮膚鏡視覺問題回答基準上的廣泛實驗表明，FairLLaVA持續減少群體間的差異，同時提高各種醫學影像模式下的公平性縮放臨床表現和自然語言生成質量。代碼可在 https://github.com/bhosalems/FairLLaVA 獲取。

Longitudinal Boundary Sharpness Coefficient Slopes Predict Time to Alzheimer's Disease Conversion in Mild Cognitive Impairment: A Survival Analysis Using the ADNI Cohort

2603.26007v1 by Ishaan Cherukuri

Predicting whether someone with mild cognitive impairment (MCI) will progress to Alzheimer's disease (AD) is crucial in the early stages of neurodegeneration. This uncertainty limits enrollment in clinical trials and delays urgent treatment. The Boundary Sharpness Coefficient (BSC) measures how well-defined the gray-white matter boundary looks on structural MRI. This study measures how BSC changes over time, namely, how fast the boundary degrades each year works much better than looking at a single baseline scan for predicting MCI-to-AD conversion. This study analyzed 1,824 T1-weighted MRI scans from 450 ADNI subjects (95 converters, 355 stable; mean follow-up: 4.84 years). BSC voxel-wise maps were computed using tissue segmentation at the gray-white matter cortical ribbon. Previous studies have used CNN and RNN models that reached 96.0% accuracy for AD classification and 84.2% for MCI conversion, but those approaches disregard specific regions within the brain. This study focused specifically on the gray-white matter interface. The approach uses temporal slope features capturing boundary degradation rates, feeding them into Random Survival Forest, a non-parametric ensemble method for right-censored survival data. The Random Survival Forest trained on BSC slopes achieved a test C-index of 0.63, a 163% improvement over baseline parametric models (test C-index: 0.24). Structural MRI costs a fraction of PET imaging ($800--$1,500 vs. $5,000--$7,000) and does not require CSF collection. These temporal biomarkers could help with patient-centered safety screening as well as risk assessment.

摘要：預測輕度認知障礙（MCI）患者是否會進展為阿茲海默症（AD）在神經退行性疾病的早期階段至關重要。這種不確定性限制了臨床試驗的入組並延遲了緊急治療。邊界銳利度係數（BSC）衡量結構性MRI中灰白質邊界的清晰程度。這項研究測量了BSC隨時間的變化，即邊界每年退化的速度，這比僅僅查看單一基線掃描在預測MCI轉換為AD方面要好得多。這項研究分析了來自450名ADNI受試者的1,824個T1加權MRI掃描（95名轉換者，355名穩定者；平均隨訪：4.84年）。BSC體素級地圖是通過在灰白質皮質帶進行組織分割計算得出的。先前的研究使用了CNN和RNN模型，達到了96.0%的AD分類準確率和84.2%的MCI轉換準確率，但這些方法忽略了大腦內的特定區域。這項研究特別關注灰白質界面。該方法使用捕捉邊界退化速率的時間斜率特徵，並將其輸入隨機生存森林（Random Survival Forest），這是一種針對右截尾生存數據的非參數集成方法。基於BSC斜率訓練的隨機生存森林達到了0.63的測試C指數，較基線參數模型（測試C指數：0.24）提高了163%。結構性MRI的成本僅為PET成像的一小部分（$800--$1,500對比$5,000--$7,000），且不需要收集腦脊液。這些時間生物標記可能有助於以患者為中心的安全篩查以及風險評估。

When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

2603.25960v1 by Binesh Sadanandan, Vahid Behzadan

Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasing position bias from 0.14 to 0.47. Shuffling answer options causes the model to change predictions 59.1% of the time, with accuracy dropping up to 27.4 percentage points. Front-truncating context to 50% causes accuracy to plummet below the no-context baseline, yet back-truncation preserves 97% of full-context accuracy. We further show that cloze scoring (selecting the highest log-probability option token) achieves 51.8% (4B) and 64.5% (27B), surpassing all prompting strategies and revealing that models "know" more than their generated text shows. Permutation voting recovers 4 percentage points over single-ordering inference. These results demonstrate that prompt engineering techniques validated on general-purpose models do not transfer to domain-specific medical LLMs, and that reliable alternatives exist.

摘要：大型語言模型（LLMs）在醫療環境中的應用日益增多，但它們對提示格式的敏感性仍然缺乏充分的特徵描述。我們在一系列穩健性測試中評估了 MedGemma（4B 和 27B 參數）在 MedMCQA（4,183 題問題）和 PubMedQA（1,000 題問題）上的表現。我們的實驗揭示了幾個令人擔憂的發現。思維鏈（CoT）提示的準確性比直接回答降低了 5.7%。少量示例使性能下降了 11.9%，同時將位置偏見從 0.14 增加到 0.47。打亂答案選項導致模型在 59.1% 的情況下改變預測，準確率下降最多達 27.4 個百分點。將上下文前置截斷到 50% 使準確率跌至低於無上下文基準，而後置截斷則保留了 97% 的全上下文準確率。我們進一步顯示，填空評分（選擇最高對數概率選項標記）在 4B 中達到 51.8% 和在 27B 中達到 64.5%，超越了所有提示策略，並揭示模型“知道”的信息超過其生成文本所顯示的。排列投票在單次排序推理上恢復了 4 個百分點。這些結果表明，針對通用模型驗證的提示工程技術並不適用於特定領域的醫療 LLM，並且存在可靠的替代方案。

Methods for Knowledge Graph Construction from Text Collections: Development and Applications

2603.25862v1 by Vanni Zavarella

Virtually every sector of society is experiencing a dramatic growth in the volume of unstructured textual data that is generated and published, from news and social media online interactions, through open access scholarly communications and observational data in the form of digital health records and online drug reviews. The volume and variety of data across all this range of domains has created both unprecedented opportunities and pressing challenges for extracting actionable knowledge for several application scenarios. However, the extraction of rich semantic knowledge demands the deployment of scalable and flexible automatic methods adaptable across text genres and schema specifications. Moreover, the full potential of these data can only be unlocked by coupling information extraction methods with Semantic Web techniques for the construction of full-fledged Knowledge Graphs, that are semantically transparent, explainable by design and interoperable. In this thesis, we experiment with the application of Natural Language Processing, Machine Learning and Generative AI methods, powered by Semantic Web best practices, to the automatic construction of Knowledge Graphs from large text corpora, in three use case applications: the analysis of the Digital Transformation discourse in the global news and social media platforms; the mapping and trend analysis of recent research in the Architecture, Engineering, Construction and Operations domain from a large corpus of publications; the generation of causal relation graphs of biomedical entities from electronic health records and patient-authored drug reviews. The contributions of this thesis to the research community are in terms of benchmark evaluation results, the design of customized algorithms and the creation of data resources in the form of Knowledge Graphs, together with data analysis results built on top of them.

摘要：幾乎每個社會領域都在經歷著未結構化文本數據生成和發佈量的劇增，這些數據來自於新聞和社交媒體的在線互動、開放存取的學術交流以及以數位健康記錄和在線藥物評價形式呈現的觀察數據。這些領域中數據的量和多樣性創造了前所未有的機會和迫切的挑戰，以提取可行的知識以應用於多個場景。然而，提取豐富的語義知識需要部署可擴展且靈活的自動化方法，這些方法能夠適應不同的文本類型和架構規範。此外，這些數據的全部潛力僅能通過將信息提取方法與語義網技術結合來釋放，以構建語義透明、設計上可解釋且可互操作的完整知識圖譜。在本論文中，我們實驗應用自然語言處理、機器學習和生成式AI方法，這些方法由語義網最佳實踐驅動，實現從大型文本語料庫自動構建知識圖譜，並針對三個使用案例進行應用：分析全球新聞和社交媒體平台中的數位轉型話語；從大量出版物中映射和趨勢分析建築、工程、建設和運營領域的最新研究；從電子健康記錄和患者撰寫的藥物評價中生成生物醫學實體的因果關係圖。這篇論文對研究社群的貢獻體現在基準評估結果、定制算法的設計以及以知識圖譜形式創建的數據資源，連同基於這些資源構建的數據分析結果。

Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

2603.25821v1 by Anna Kozlova, Stanislau Salavei, Pavel Satalkin, Hanna Plotnitskaya, Sergey Parfenyuk

We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks.

摘要：我們提出了 Doctorina MedBench，這是一個基於模擬現實醫生-病人互動的代理醫療 AI 的綜合評估框架。與依賴解決標準化測試問題的傳統醫療基準不同，所提出的方法模型化了一個多步驟的臨床對話，在這個對話中，醫生或 AI 系統必須收集病歷、分析附加材料（包括實驗室報告、影像和醫療文件）、制定鑑別診斷並提供個性化建議。系統性能使用 D.O.T.S. 指標進行評估，該指標由四個組成部分構成：診斷、觀察/調查、治療和步驟計數，能夠評估臨床正確性和對話效率。

該系統還包含一個多層次的測試和質量監控架構，旨在在開發和部署期間檢測模型退化。該框架支持以安全為導向的陷阱案例、基於類別的臨床場景隨機抽樣以及全面的回歸測試。數據集目前包含超過 1,000 個臨床案例，涵蓋超過 750 種診斷。評估指標的普遍性使得該框架不僅可以用來評估醫療 AI 系統，還可以評估醫生並支持臨床推理技能的發展。我們的結果表明，臨床對話的模擬可能提供比傳統考試風格基準更現實的臨床能力評估。

Beyond identifiability: Learning causal representations with few environments and finite samples

2603.25796v1 by Inbeom Lee, Tongtong Jin, Bryon Aragam

We provide explicit, finite-sample guarantees for learning causal representations from data with a sublinear number of environments. Causal representation learning seeks to provide a rigourous foundation for the general representation learning problem by bridging causal models with latent factor models in order to learn interpretable representations with causal semantics. Despite a blossoming theory of identifiability in causal representation learning, estimation and finite-sample bounds are less well understood. We show that causal representations can be learned with only a logarithmic number of unknown, multi-node interventions, and that the intervention targets need not be carefully designed in advance. Through a careful perturbation analysis, we provide a new analysis of this problem that guarantees consistent recovery of (a) the latent causal graph, (b) the mixing matrix and representations, and (c) \emph{unknown} intervention targets.

摘要：我們提供了明確的有限樣本保證，以從具有次線性環境數量的數據中學習因果表示。因果表示學習旨在通過將因果模型與潛在因子模型相結合，為一般表示學習問題提供嚴謹的基礎，以學習具有因果語義的可解釋表示。儘管因果表示學習的可識別性理論正在蓬勃發展，但估計和有限樣本界限的理解仍然較少。我們展示了因果表示可以僅通過對數量的未知多節點干預來學習，並且干預目標不必事先仔細設計。通過仔細的擾動分析，我們提供了這個問題的新分析，保證一致地恢復 (a) 潛在因果圖，(b) 混合矩陣和表示，以及 (c) \emph{未知} 干預目標。

DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

2603.25607v1 by Zhenchen Zhu, Ge Hu, Weixiong Tan, Kai Gao, Chao Sun, Zhen Zhou, Kepei Xu, Wei Han, Meixia Shang, Xiaoming Qiu, Yiqing Tan, Jinhua Wang, Zhoumeng Ying, Li Peng, Wei Song, Lan Song, Zhengyu Jin, Nan Hong, Yizhou Yu

The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

摘要：CT的廣泛應用顯著增加了檢測到的肺結節數量。然而，當前用於分類良性和惡性結節的深度學習方法往往未能全面整合全球和局部特徵，且大多數尚未通過臨床試驗進行驗證。為了解決這個問題，我們開發了DeepFAN，一種基於Transformer的模型，該模型在超過10K病理確認的結節上進行了訓練，並進一步進行了多讀者、多案例的臨床試驗，以評估其在輔助初級放射科醫生方面的有效性。DeepFAN在內部測試集上達到了0.939的診斷曲線下面積（AUC）（95% CI 0.930-0.948），在涉及三個獨立醫療機構的400個案例的臨床試驗數據集上達到了0.954（95% CI 0.934-0.973）。可解釋性分析顯示全球特徵的貢獻高於局部特徵。十二位讀者的平均表現顯著提高了10.9%（95% CI 8.3%-13.5%）的AUC，10.0%（95% CI 8.9%-11.1%）的準確率，7.6%（95% CI 6.1%-9.2%）的敏感性，以及12.6%（95% CI 10.9%-14.3%）的特異性（所有P<0.001）。結節級別的讀者間診斷一致性從公平改善到中等（整體k: 0.313 vs. 0.421; P=0.019）。總之，DeepFAN有效地輔助了初級放射科醫生，並可能有助於均化診斷質量，減少對不確定肺結節的不必要隨訪。中國臨床試驗登記：ChiCTR2400084624。

Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models

2603.25495v1 by Moazzam Umer Gondal, Hamad ul Qudous, Asma Ahmad Farhan, Sultan Alamri

Accurate short-term air-quality forecasting is essential for public health protection and urban management, yet many recent forecasting frameworks rely on complex, data-intensive, and computationally demanding models. This study investigates whether lightweight and interpretable forecasting approaches can provide competitive performance for hourly PM2.5 prediction in Beijing, China. Using multi-year pollutant and meteorological time-series data, we developed a leakage-aware forecasting workflow that combined chronological data partitioning, preprocessing, feature selection, and exogenous-driver modeling under the Perfect Prognosis setting. Three forecasting families were evaluated: SARIMAX, Facebook Prophet, and NeuralProphet. To assess practical deployment behavior, the models were tested under two adaptive regimes: weekly walk-forward refitting and frozen forecasting with online residual correction. Results showed clear differences in both predictive accuracy and computational efficiency. Under walk-forward refitting, Facebook Prophet achieved the strongest completed performance, with an MAE of $37.61$ and an RMSE of $50.10$, while also requiring substantially less execution time than NeuralProphet. In the frozen-model regime, online residual correction improved Facebook Prophet and SARIMAX, with corrected SARIMAX yielding the lowest overall error (MAE $32.50$; RMSE $46.85$). NeuralProphet remained less accurate and less stable across both regimes, and residual correction did not improve its forecasts. Notably, corrected Facebook Prophet reached nearly the same error as its walk-forward counterpart while reducing runtime from $15$ min $21.91$ sec to $46.60$ sec. These findings show that lightweight additive forecasting strategies can remain highly competitive for urban air-quality prediction, offering a practical balance between accuracy, interpretability, ...

摘要：準確的短期空氣質量預測對於公共健康保護和城市管理至關重要，但許多最近的預測框架依賴於複雜、數據密集且計算需求高的模型。這項研究探討輕量且可解釋的預測方法是否能在中國北京的每小時 PM2.5 預測中提供競爭性的表現。利用多年來的污染物和氣象時間序列數據，我們開發了一個考慮洩漏的預測工作流程，該流程結合了時間數據分區、預處理、特徵選擇和外部驅動建模，並在完美預測的設置下進行。評估了三個預測家族：SARIMAX、Facebook Prophet 和 NeuralProphet。為了評估實際部署行為，這些模型在兩種自適應模式下進行了測試：每週前向重擬合和帶有在線殘差修正的冷凍預測。結果顯示在預測準確性和計算效率上存在明顯差異。在前向重擬合下，Facebook Prophet 實現了最強的完成表現，MAE 為 $37.61$，RMSE 為 $50.10$，同時所需的執行時間也顯著少於 NeuralProphet。在冷凍模型模式下，在線殘差修正改善了 Facebook Prophet 和 SARIMAX，修正後的 SARIMAX 產生了最低的整體誤差（MAE $32.50$；RMSE $46.85$）。NeuralProphet 在這兩種模式下的準確性和穩定性均較低，且殘差修正並未改善其預測。值得注意的是，修正後的 Facebook Prophet 的誤差幾乎與其前向對應物相同，同時將運行時間從 $15$ 分 $21.91$ 秒減少到 $46.60$ 秒。這些發現顯示，輕量的加法預測策略在城市空氣質量預測中仍然可以保持高度競爭力，提供準確性、可解釋性之間的實際平衡，...

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

2603.25403v2 by Eyal Hadad, Mordechai Guri

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

摘要：在裝置上的視覺-語言模型（VLMs）透過本地執行承諾數據隱私。然而，我們顯示出向動態高解析度預處理（例如 AnyRes）的架構轉變引入了一個固有的算法側信道。與靜態模型不同，動態預處理根據圖像的長寬比將其分解為可變數量的補丁，創造出依賴於工作負載的輸入。我們展示了一個針對本地 VLMs 的雙層攻擊框架。在第一層，未特權的攻擊者可以利用標準未特權操作系統指標來利用顯著的執行時間變化，可靠地指紋輸入的幾何形狀。在第二層，通過分析最後級快取（LLC）的競爭，攻擊者可以解決相同幾何形狀中的語義模糊，區分視覺上密集（例如醫療 X 光）和稀疏（例如文本文件）內容。通過評估最先進的模型，如 LLaVA-NeXT 和 Qwen2-VL，我們顯示結合這些信號能夠可靠地推斷隱私敏感的上下文。最後，我們分析了減輕此漏洞的安全工程權衡，揭示了使用常數工作填充的顯著性能開銷，並提出了安全邊緣 AI 部署的實用設計建議。

A Causal Framework for Evaluating ICU Discharge Strategies

2603.25397v1 by Sagar Nagaraj Simha, Juliette Ortholand, Dave Dongelmans, Jessica D. Workum, Olivier W. M. Thijssens, Ameen Abu-Hanna, Giovanni Cinà

In this applied paper, we address the difficult open problem of when to discharge patients from the Intensive Care Unit. This can be conceived as an optimal stopping scenario with three added challenges: 1) the evaluation of a stopping strategy from observational data is itself a complex causal inference problem, 2) the composite objective is to minimize the length of intervention and maximize the outcome, but the two cannot be collapsed to a single dimension, and 3) the recording of variables stops when the intervention is discontinued. Our contributions are two-fold. First, we generalize the implementation of the g-formula Python package, providing a framework to evaluate stopping strategies for problems with the aforementioned structure, including positivity and coverage checks. Second, with a fully open-source pipeline, we apply this approach to MIMIC-IV, a public ICU dataset, demonstrating the potential for strategies that improve upon current care.

摘要：在這篇應用論文中，我們探討了何時將病人從加護病房出院這一困難的未解問題。這可以被視為一個最佳停止情境，但面臨三個額外挑戰：1) 從觀察數據評估停止策略本身就是一個複雜的因果推斷問題，2) 複合目標是最小化干預時間並最大化結果，但這兩者無法簡化為單一維度，以及 3) 當干預停止時，變數的記錄也會停止。我們的貢獻有兩方面。首先，我們推廣了 g-formula Python 套件的實施，提供了一個框架來評估具有上述結構的問題的停止策略，包括正性和覆蓋性檢查。其次，通過一個完全開源的流程，我們將這一方法應用於 MIMIC-IV，一個公共的加護病房數據集，展示了改進當前護理的策略潛力。

Evaluating Language Models for Harmful Manipulation

2603.25326v3 by Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

摘要：對於由人工智慧驅動的有害操控概念的興趣正在增長，但目前評估這一概念的方法仍然有限。本文提出了一個通過特定情境的人機互動研究來評估有害AI操控的框架。我們通過評估一個涉及10,101名參與者的AI模型來說明這一框架的實用性，這些參與者的互動涵蓋了三個AI使用領域（公共政策、金融和健康）以及三個地區（美國、英國和印度）。總體而言，我們發現，經過測試的模型在被提示時可以產生操控行為，並且在實驗環境中能夠引發研究參與者的信念和行為變化。我們進一步發現，情境是重要的：AI操控在不同領域之間有所不同，這表明它需要在AI系統可能被使用的高風險情境中進行評估。我們還發現，在我們測試的地理區域之間存在顯著差異，這表明來自某一地理區域的AI操控結果可能無法推廣到其他地區。最後，我們發現AI模型的操控行為頻率（傾向性）並不總是能準確預測操控成功的可能性（有效性），這強調了分開研究這些維度的重要性。為了促進我們評估框架的採用，我們詳細說明了我們的測試協議並公開相關材料。我們最後討論了評估AI模型有害操控的開放挑戰。

AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study

2603.25322v1 by Wenlong Hou, Sheng Bi, Guangqian Yang, Lihao Liu, Ye Du, Hanxiao Xue, Juncheng Wang, Yuxiang Feng, Yue Xun, Nanxi Yu, Ning Mao, Mo Yang, Yi Wah Eva Cheung, Ling Long, Kay Chen Tan, Lequan Yu, Xiaomeng Ma, Shaozhen Yan, Shujun Wang

Alzheimer's disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.

摘要：阿茲海默症（AD）隨著人口老化而成為一個日益嚴重的全球健康挑戰，及時且準確的診斷對於減少個人和社會負擔至關重要。然而，現實世界中的AD評估受到不完整、異質的多模態數據以及不同地點和患者人口統計變異的阻礙。儘管大型語言模型（LLMs）在生物醫學領域顯示出潛力，但它們在AD中的應用主要限於回答狹隘的、特定於疾病的問題，而不是生成支持臨床決策的綜合診斷報告。在此，我們通過引入AD-CARE擴展LLM在臨床決策支持中的能力，這是一個與模態無關的代理，能夠從不完整、異質的輸入中執行基於指南的診斷評估，而無需填補缺失的模態。通過動態協調專門的診斷工具並將臨床指南嵌入LLM驅動的推理中，AD-CARE生成與現實世界臨床工作流程相符的透明報告風格輸出。在包括10,303個案例的六個隊列中，AD-CARE達到了84.9%的診斷準確率，較基準方法提高了4.2%-13.7%的相對改進。儘管隊列層級存在差異，數據集特定的準確率仍然穩健（80.4%-98.8%），且該代理始終超越所有基準。AD-CARE減少了不同種族和年齡子群之間的性能差異，分別降低了四個指標的平均離散度21%-68%和28%-51%。在一項受控讀者研究中，該代理提高了神經科醫生和放射科醫生的準確率6%-11%，並將決策時間減半以上。該框架在八個基礎LLM上實現了2.29%-10.66%的絕對增益，並使其性能收斂。這些結果顯示，AD-CARE是一個可擴展、實際可部署的框架，可以整合進入AD的常規臨床工作流程中，以提供多模態的決策支持。

A Gait Foundation Model Predicts Multi-System Health Phenotypes from 3D Skeletal Motion

2603.25283v1 by Adam Gabet, Sarah Kohn, Guy Lutsker, Shira Gelman, Anastasia Godneva, Gil Sasson, Arad Zulti, David Krongauz, Rotem Shaulitch, Assaf Rotem, Ohad Doron, Yuval Brodsky, Adina Weinberger, Eran Segal

Gait is increasingly recognized as a vital sign, yet current approaches treat it as a symptom of specific pathologies rather than a systemic biomarker. We developed a gait foundation model for 3D skeletal motion from 3,414 deeply phenotyped adults, recorded via a depth camera during five motor tasks. Learned embeddings outperformed engineered features, predicting age (Pearson r = 0.69), BMI (r = 0.90), and visceral adipose tissue area (r = 0.82). Embeddings significantly predicted 1,980 of 3,210 phenotypic targets; after adjustment for age, BMI, VAT, and height, gait provided independent gains in all 18 body systems in males and 17 of 18 in females, and improved prediction of clinical diagnoses and medication use. Anatomical ablation revealed that legs dominated metabolic and frailty predictions while torso encoded sleep and lifestyle phenotypes. These findings establish gait as an independent multi-system biosignal, motivating translation to consumer-grade video and its integration as a scalable, passive vital sign.

摘要：步態越來越被認為是一種重要的生命體徵，然而目前的方法將其視為特定病理的症狀，而非系統性的生物標記。我們從3,414名深度表型的成年人中開發了一個3D骨骼運動的步態基礎模型，這些數據是通過深度相機在五項運動任務中錄製的。學習到的嵌入超越了工程特徵，能夠預測年齡（Pearson r = 0.69）、BMI（r = 0.90）和內臟脂肪組織面積（r = 0.82）。嵌入顯著預測了3,210個表型目標中的1,980個；在調整年齡、BMI、VAT和身高後，步態在男性的所有18個身體系統中提供了獨立的增益，而在女性中則是18個系統中的17個，並改善了臨床診斷和用藥的預測。解剖性切除顯示，腿部主導了代謝和虛弱的預測，而軀幹則編碼了睡眠和生活方式的表型。這些發現確立了步態作為一種獨立的多系統生物信號，促使其轉化為消費級視頻並作為可擴展的被動生命體徵整合。

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

2603.25196v1 by Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen

Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

摘要：臨床實踐指導方針（CPGs）在確保基於證據的決策和改善患者結果方面發揮著關鍵作用。儘管大型語言模型（LLMs）在醫療場景中的應用越來越普遍，但尚不清楚LLMs在對話中能在多大程度上識別和遵循CPGs。為了解決這一空白，我們引入了CPGBench，這是一個自動化框架，用於基準測試LLMs在多輪對話中檢測和遵循臨床指導方針的能力。我們收集了來自9個國家/地區和2個國際組織的3,418份CPG文件，這些文件在過去十年內發表，涵蓋了24個專業領域。從這些文件中，我們提取了32,155條臨床建議，並附上相應的出版機構、日期、國家、專業、建議強度、證據水平等信息。根據每條建議生成一個多輪對話，以評估8個領先LLMs的檢測和遵循能力。我們發現71.1%-89.6%的建議可以被正確檢測，而只有3.6%-29.7%的相應標題可以被正確引用，揭示了了解指導方針內容與其來源之間的差距。不同模型的遵循率範圍從21.8%到63.2%，顯示出了解指導方針與能夠應用它們之間的巨大差距。為了確認我們自動分析的有效性，我們進一步進行了一項綜合性的人類評估，涉及來自不同專業的56名臨床醫生。據我們所知，CPGBench是第一個系統性揭示LLMs在對話中未能檢測或遵循的臨床建議的基準測試。鑑於每條臨床建議可能影響大量人群，且臨床應用本質上具有安全關鍵性，解決這些差距對於在現實世界臨床實踐中安全和負責任地部署LLMs至關重要。

Empowering Epidemic Response: The Role of Reinforcement Learning in Infectious Disease Control

2603.25771v1 by Mutong Liu, Yang Liu, Jiming Liu

Reinforcement learning (RL), owing to its adaptability to various dynamic systems in many real-world scenarios and the capability of maximizing long-term outcomes under different constraints, has been used in infectious disease control to optimize the intervention strategies for controlling infectious disease spread and responding to outbreaks in recent years. The potential of RL for assisting public health sectors in preventing and controlling infectious diseases is gradually emerging and being explored by rapidly increasing publications relevant to COVID-19 and other infectious diseases. However, few surveys exclusively discuss this topic, that is, the development and application of RL approaches for optimizing strategies of non-pharmaceutical and pharmaceutical interventions of public health. Therefore, this paper aims to provide a concise review and discussion of the latest literature on how RL approaches have been used to assist in controlling the spread and outbreaks of infectious diseases, covering several critical topics addressing public health demands: resource allocation, balancing between lives and livelihoods, mixed policy of multiple interventions, and inter-regional coordinated control. Finally, we conclude the paper with a discussion of several potential directions for future research.

摘要：強化學習（RL）因其對許多現實世界場景中各種動態系統的適應性以及在不同約束條件下最大化長期結果的能力，近年來已被用於傳染病控制，以優化控制傳染病擴散和應對疫情的干預策略。 RL在協助公共衛生部門預防和控制傳染病方面的潛力逐漸顯現，並且與COVID-19及其他傳染病相關的出版物迅速增加，正在被探索。然而，專門討論這一主題的調查很少，即針對優化公共衛生非藥物和藥物干預策略的RL方法的發展和應用。因此，本文旨在提供對最新文獻的簡要回顧和討論，探討RL方法如何被用於協助控制傳染病的擴散和疫情，涵蓋幾個關鍵主題以應對公共衛生需求：資源分配、生命與生計之間的平衡、多種干預的混合政策，以及區域間的協調控制。最後，我們以對未來研究幾個潛在方向的討論作為本文的結尾。

Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

2603.25155v1 by Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li, Minfeng Xu

Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.

摘要：多模態大型語言模型在臨床視覺問題回答任務中展現出良好的前景，但在擴展到3D影像時受到高計算成本的阻礙。先前的方法通常依賴於2D切片或固定長度的標記壓縮，這破壞了體積連續性並掩蓋了微妙的發現。我們提出了Photon，一個用變長標記序列表示3D醫療體積的框架。Photon引入了基於指令的標記調度和替代梯度傳播，能夠在訓練和推理過程中自適應地減少標記，從而降低計算成本，同時減輕由冗餘標記引起的注意力稀釋。它還結合了一個自定義的反向傳播規則，通過梯度恢復來實現可微分優化，儘管存在離散標記丟失。為了穩定標記壓縮並確保視覺證據的可靠使用，Photon進一步應用了正則化目標，以減輕僅基於語言的偏見並提高可靠性。在多樣的醫療視覺問題回答任務中的實驗顯示，Photon在降低資源使用和加速訓練及推理的同時，達到了最先進的準確率。

Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence

2603.25146v1 by Vehid Geruslu, Zulfiyya Aliyeva, Eray Tüzün

Context: The rapid adoption of AI-assisted code generation tools, such as large language models (LLMs), is transforming software development practices. While these tools promise significant productivity gains, concerns regarding the quality, reliability, and security of AI-generated code are increasingly reported in both academia and industry. --Objective: This study aims to systematically synthesize existing empirical evidence on the factors influencing the quality of AI-generated source code and to analyze how these factors impact software quality outcomes across different evaluation contexts. --Method: We conducted a systematic literature review (SLR) following established guidelines, supported by an AI-assisted workflow with human oversight. A total of 24 primary studies were selected through a structured search and screening process across major digital libraries. Data were extracted and analyzed using qualitative, pattern-based evidence synthesis. --Results: The findings reveal that code quality in AI-assisted development is influenced by a combination of human factors, AI system characteristics, and human AI interaction dynamics. Key influencing factors include prompt design, task specification, and developer expertise. The results also show variability in quality outcomes such as correctness, security, maintainability, and complexity across studies, with both improvements and risks reported. --Conclusion: AI-assisted code generation represents a socio-technical shift in software engineering, where achieving high-quality outcomes depends on both technological and human factors. While promising, AI-generated code requires careful validation and integration into development workflows.

摘要：Context: AI輔助的程式碼生成工具（如大型語言模型（LLMs））的快速採用正在改變軟體開發實踐。雖然這些工具承諾顯著的生產力提升，但關於AI生成程式碼的質量、可靠性和安全性的擔憂在學術界和業界中越來越多地被報導。--Objective: 本研究旨在系統性地綜合現有的實證證據，探討影響AI生成源碼質量的因素，並分析這些因素如何影響不同評估背景下的軟體質量結果。--Method: 我們按照既定指導方針進行了系統性文獻回顧（SLR），並在人工監督下支持AI輔助的工作流程。通過結構化的搜索和篩選過程，從主要數字圖書館中選擇了共24項主要研究。數據通過定性、基於模式的證據綜合進行提取和分析。--Results: 研究結果顯示，AI輔助開發中的程式碼質量受到人為因素、AI系統特徵和人機互動動態的綜合影響。主要影響因素包括提示設計、任務規範和開發者專業知識。結果還顯示，不同研究中的質量結果（如正確性、安全性、可維護性和複雜性）存在變異，報告了改進和風險。--Conclusion: AI輔助程式碼生成代表了軟體工程中的社會技術轉變，實現高質量結果依賴於技術和人為因素。雖然前景看好，但AI生成的程式碼需要仔細驗證並整合進開發工作流程中。

Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators

2603.24986v1 by Ray-Yuan Chung, Xuhai Xu, Ari Pollack

Large language model based health agents are increasingly used by health consumers and clinicians to interpret health information and guide health decisions. However, most AI systems in healthcare operate in siloed configurations, supporting individual users rather than the multi-stakeholder relationships central to healthcare. Such use can fragment understanding and exacerbate misalignment among patients, caregivers, and clinicians. We reframe AI not as a standalone assistant, but as a collaborator embedded within multi-party care interactions. Through a clinically validated fictional pediatric chronic kidney disease case study, we show that breakdowns in adherence stem from fragmented situational awareness and misaligned goals, and that siloed use of general-purpose AI tools does little to address these collaboration gaps. We propose a conceptual framework for designing AI collaborators that surface contextual information, reconcile mental models, and scaffold shared understanding while preserving human decision authority.

摘要：大型語言模型基礎的健康代理人越來越多地被健康消費者和臨床醫生用來解釋健康信息並指導健康決策。然而，大多數醫療保健中的人工智慧系統運作在孤立的配置中，支持個別用戶，而不是醫療保健中多方利益相關者關係的核心。這種使用方式可能會導致理解的碎片化，並加劇患者、照護者和臨床醫生之間的目標不一致。我們將人工智慧重新定義為一個嵌入多方護理互動中的合作者，而不是一個獨立的助手。通過一個臨床驗證的虛構小兒慢性腎病案例研究，我們顯示出遵循的中斷源於情境意識的碎片化和目標的不一致，而通用人工智慧工具的孤立使用對於解決這些協作差距幾乎沒有幫助。我們提出了一個設計人工智慧合作者的概念框架，旨在呈現上下文信息、調和心理模型，並支撐共享理解，同時保留人類的決策權威。

Subject-Specific Low-Field MRI Synthesis via a Neural Operator

2603.24968v1 by Ziqi Gao, Nicha Dvornek, Xiaoran Zhang, Gigi Galiana, Hemant Tagare, Todd Constable

Low-field (LF) magnetic resonance imaging (MRI) improves accessibility and reduces costs but generally has lower signal-to-noise ratios and degraded contrast compared to high field (HF) MRI, limiting its clinical utility. Simulating LF MRI from HF MRI enables virtual evaluation of novel imaging devices and development of LF algorithms. Existing low field simulators rely on noise injection and smoothing, which fail to capture the contrast degradation seen in LF acquisitions. To this end, we introduce an end-to-end LF-MRI synthesis framework that learns HF to LF image degradation directly from a small number of paired HF-LF MRIs. Specifically, we introduce a novel HF to LF coordinate-image decoupled neural operator (H2LO) to model the underlying degradation process, and tailor it to capture high-frequency noise textures and image structure. Experimental results in T1w and T2w MRI demonstrate that H2LO produces more faithful simulated low-field images than existing parameterized noise synthesis models and popular image-to-image translation models. Furthermore, it improves performance in downstream image enhancement tasks, showcasing its potential to enhance LF MRI diagnostic capabilities.

摘要：低場（LF）磁共振成像（MRI）提高了可及性並降低了成本，但與高場（HF）MRI相比，通常具有較低的信噪比和較差的對比度，限制了其臨床實用性。從HF MRI模擬LF MRI使得對新型成像設備的虛擬評估和LF算法的開發成為可能。現有的低場模擬器依賴於噪聲注入和平滑，未能捕捉到LF獲取中觀察到的對比度降解。為此，我們引入了一個端到端的LF-MRI合成框架，該框架直接從少量配對的HF-LF MRI中學習HF到LF的圖像降解。具體而言，我們引入了一種新穎的HF到LF坐標-圖像解耦神經運算子（H2LO），以建模潛在的降解過程，並調整其以捕捉高頻噪聲紋理和圖像結構。在T1w和T2w MRI中的實驗結果顯示，H2LO生成的模擬低場圖像比現有的參數化噪聲合成模型和流行的圖像到圖像轉換模型更為真實。此外，它在下游圖像增強任務中的表現有所提升，展示了其增強LF MRI診斷能力的潛力。

Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence

2603.24898v1 by Vasu Srinivasan, Dhriti Vasu

We present a Sovereign AI architecture for clinical triage in which all inference is performed on-device and inbound data is delivered via a physically unidirectional channel, implemented using receive-only broadcast infrastructure or certified hardware data diodes, with no return path to any external network. This design removes the network-mediated attack surface by construction, rather than attempting to secure it through software controls. The system performs conversational symptom intake, integrates device-captured vitals, and produces structured, triage-aligned clinical records at the point of care. We formalize the security properties of receiver-side unidirectionality and show that the architecture is transport-agnostic across broadcast and diode-enforced deployments. We further analyze threat models, enforcement mechanisms, and deployment configurations, demonstrating how physical one-way data flow enables high-assurance operation in both resource-constrained and high-risk environments. This work positions physically unidirectional channels as a foundational primitive for sovereign, on-device clinical intelligence at the front door of care.

摘要：我們提出了一種主權人工智慧架構，用於臨床分診，其中所有推理都在設備上進行，進入數據通過物理單向通道傳送，該通道使用僅接收的廣播基礎設施或經認證的硬體數據二極體實現，並且沒有返回路徑通往任何外部網絡。這一設計通過構建消除了網絡介導的攻擊面，而不是試圖通過軟體控制來保護它。系統執行對話式症狀收集，整合設備捕獲的生命體徵，並在護理現場產生結構化、與分診對齊的臨床記錄。我們正式化了接收端單向性的安全性質，並顯示該架構在廣播和二極體強制部署中是傳輸無關的。我們進一步分析了威脅模型、執行機制和部署配置，展示了物理單向數據流如何在資源受限和高風險環境中實現高保證操作。這項工作將物理單向通道定位為主權、設備內臨床智能的基礎原語，位於護理的前門。

Medical

Medical

Abstracts

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

Analyzing Healthcare Interoperability Vulnerabilities: Formal Modeling and Graph-Theoretic Approach

Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

Eligibility-Aware Evidence Synthesis: An Agentic Framework for Clinical Trial Meta-Analysis

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

Managing Diabetic Retinopathy with Deep Learning: A Data Centric Overview

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data

Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia

Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always

Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection

Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring

Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology

Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy

Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training

Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies

DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data

Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis

When AI Gets it Wrong: Reliability and Risk in AI-Assisted Medication Decision Systems

AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction

Safety, Security, and Cognitive Risks in World Models

Regularizing Attention Scores with Bootstrapping

AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation

Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images

BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction

Toward Optimal Sampling Rate Selection and Unbiased Classification for Precise Animal Activity Recognition

MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions

EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

SANA I2I: A Text Free Flow Matching Framework for Paired Image to Image Translation with a Case Study in Fetal MRI Artifact Reduction

A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation

One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

Four Generations of Quantum Biomedical Sensors

ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

Training deep learning based dynamic MR image reconstruction using synthetic fractals

Brain MR Image Synthesis with Multi-contrast Self-attention GAN

Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

Few-shot Writer Adaptation via Multimodal In-Context Learning

NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model

Knowledge database development by large language models for countermeasures against viruses and marine toxins

Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health

A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank

Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction

Towards a Medical AI Scientist

Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Bit-Identical Medical Deep Learning via Structured Orthogonal Initialization

Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images

RAP: Retrieve, Adapt, and Prompt-Fit for Training-Free Few-Shot Medical Image Segmentation

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

MediHive: A Decentralized Agent Collective for Medical Reasoning

Bayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data

Autonomous Agent-Orchestrated Digital Twins (AADT): Leveraging the OpenClaw Framework for State Synchronization in Rare Genetic Disorders

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

Foundation Model for Cardiac Time Series via Masked Latent Attention

PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI