Skip to content

Medical

Medical

Publish Date Title Authors Homepage Code
2026-04-24 FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records Hojjat Karami et.al. 2604.22534v1 null
2026-04-24 CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease Bulent Soykan et.al. 2604.22428v1 null
2026-04-24 Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems Meghana Karnam et.al. 2604.22154v1 null
2026-04-23 Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations Nalin Poungpeth et.al. 2604.22109v1 null
2026-04-23 Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake Guan Gui et.al. 2604.22067v1 null
2026-04-23 Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores Shevya Pandya et.al. 2604.22063v1 null
2026-04-23 Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching Xiaodi Li et.al. 2604.22061v1 null
2026-04-23 EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms Brian VanVoorst et.al. 2604.22036v1 null
2026-04-23 Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models Naheed Rayhan et.al. 2604.21860v1 null
2026-04-23 Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos Bowen Liu et.al. 2604.21814v1 null
2026-04-23 Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications Yvon K. Awuklu et.al. 2604.21793v1 null
2026-04-23 Causal Disentanglement for Full-Reference Image Quality Assessment Zhen Zhang et.al. 2604.21654v1 null
2026-04-23 Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach Eli Gildish et.al. 2604.21651v1 null
2026-04-23 Unbiased Prevalence Estimation with Multicalibrated LLMs Fridolin Linder et.al. 2604.21549v1 null
2026-04-23 Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation Michele Miranda et.al. 2604.21421v1 null
2026-04-23 Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models Muhammad Shafique et.al. 2604.21952v1 null
2026-04-23 Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages Michael Bouzinier et.al. 2604.21263v1 null
2026-04-22 Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction Abhishek Dharmaratnakar et.al. 2604.21154v1 null
2026-04-22 Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation Sanjoy Pator et.al. 2604.21076v1 null
2026-04-22 HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering Yuyu Liu et.al. 2604.21027v1 null
2026-04-22 Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics Open-H-Embodiment Consortium et.al. 2604.21017v1 null
2026-04-22 Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry Syed Nazmus Sakib et.al. 2604.20983v1 null
2026-04-22 Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs Mariano Barone et.al. 2604.20791v1 null
2026-04-22 MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills Yingyong Hou et.al. 2604.20441v1 null
2026-04-22 Surrogate modeling for interpreting black-box LLMs in medical predictions Changho Han et.al. 2604.20331v2 null
2026-04-22 Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA Zibo Xu et.al. 2604.20306v1 null
2026-04-21 From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI Patrick Vossler et.al. 2604.20055v1 null
2026-04-21 Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine Yusuf Kesmen et.al. 2604.20022v1 null
2026-04-21 scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics Qifeng Zhou et.al. 2604.20003v1 null
2026-04-21 Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning Palawat Busaranuvong et.al. 2604.19937v1 null
2026-04-21 Depression Risk Assessment in Social Media via Large Language Models Giorgia Gulino et.al. 2604.19887v1 null
2026-04-21 A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities Aya Cherigui et.al. 2604.19653v1 null
2026-04-21 Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models Kihyuk Lee et.al. 2604.19598v2 null
2026-04-21 Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity Farbod Zorriassatine et.al. 2604.19538v1 null
2026-04-21 Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents Vasundra Srininvasan et.al. 2604.19457v1 null
2026-04-21 Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications Abu Noman Md Sakib et.al. 2604.19281v1 null
2026-04-21 Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement Pritam Kar et.al. 2604.19191v1 null
2026-04-20 Regulating Artificial Intimacy: From Locks and Blocks to Relational Accountability Henry Fraser et.al. 2604.18893v1 null
2026-04-20 REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction Seowung Leem et.al. 2604.18757v1 null
2026-04-20 Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling Andrew Wang et.al. 2604.18753v1 null
2026-04-20 A multimodal and temporal foundation model for virtual patient representations at healthcare system scale Andrew Zhang et.al. 2604.18570v2 null
2026-04-20 ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification Florian Kittler et.al. 2604.18444v1 null
2026-04-20 Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support Eranga Bandara et.al. 2604.18302v1 null
2026-04-20 Style-Based Neural Architectures for Real-Time Weather Classification Hamed Ouattara et.al. 2604.18251v1 null
2026-04-20 Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies Lorenz Brehme et.al. 2604.18234v1 null
2026-04-20 Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients? Ku Onoda et.al. 2604.18161v1 null
2026-04-20 Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework Cong Huy Nguyen et.al. 2604.18145v1 null
2026-04-20 Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning Khalil Akremi et.al. 2604.19823v1 null
2026-04-20 First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows Sihao Xing et.al. 2604.18038v1 null
2026-04-20 AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis Nathasha Naranpanawa et.al. 2604.17846v1 null
2026-04-20 MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models Suhyun Lee et.al. 2604.17730v1 null
2026-04-20 RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models Arya Hadizadeh Moghaddam et.al. 2604.17725v1 null
2026-04-20 Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals Jon-Paul Cacioli et.al. 2604.17714v1 null
2026-04-20 Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report Jon-Paul Cacioli et.al. 2604.17707v1 null
2026-04-19 STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments Md Mezbahul Islam et.al. 2604.17611v1 null
2026-04-19 T-DuMpRa: Teacher-guided Dual-path Multi-prototype Retrieval Augmented framework for fine-grained medical image classification Zixuan Tang et.al. 2604.17360v1 null
2026-04-19 PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations Patrick Keough et.al. 2604.17359v1 null
2026-04-19 Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA Alberto Testoni et.al. 2604.17316v1 null
2026-04-19 Chaos-Enhanced Prototypical Networks for Few-Shot Medical Image Classification Chinthakuntla Meghan Sai et.al. 2604.17300v1 null
2026-04-19 Region-Affinity Attention for Whole-Slide Breast Cancer Classification in Deep Ultraviolet Imaging Nagur Shareef Shaik et.al. 2604.17222v1 null
2026-04-19 Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition Nwe Ni Win et.al. 2604.17214v1 null
2026-04-19 DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation Nagur Shareef Shaik et.al. 2604.17209v1 null
2026-04-19 CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography Si Li et.al. 2604.17208v1 null
2026-04-19 Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training Weibing Zheng et.al. 2604.17186v1 null
2026-04-18 If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data Yanjun Cui et.al. 2604.17133v1 null
2026-04-18 A Two-Stage Deep Learning Framework for Segmentation of Ten Gastrointestinal Organs from Coronal MR Enterography Ashiqur Rahman et.al. 2604.17118v1 null
2026-04-18 Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization Weijie Wan et.al. 2604.17051v1 null
2026-04-18 Light-Adapted Electroretinogram and Oscillatory Potentials (LEOPs) Dataset for Autism Spectrum Disorder and Typically Developing Individuals Paul A. Constable et.al. 2604.16981v1 null
2026-04-18 Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models Bruce A. Bassett et.al. 2604.16980v1 null
2026-04-18 Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction Liyin Chen et.al. 2604.16955v1 null
2026-04-18 Hybrid Quantum Neural Networks for Enhanced Breast Cancer Thermographic Classification: A Novel Quantum-Classical Integration Approach Riza Alaudin Syah et.al. 2604.16953v1 null
2026-04-18 Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts Gabriel Jason Lee et.al. 2604.16926v1 null
2026-04-18 Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models Inhyeok Lee et.al. 2604.16775v1 null
2026-04-17 CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction Jianyou Wang et.al. 2604.16742v1 null
2026-04-17 Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis Ayhan Can Erdur et.al. 2604.16729v1 null
2026-04-17 A Two-Stage Multi-Modal MRI Framework for Lifespan Brain Age Prediction Dingyi Zhang et.al. 2604.16655v1 null
2026-04-17 MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation Yi Lin et.al. 2604.16175v1 null
2026-04-17 Hybrid Spectro-Temporal Fusion Framework for Structural Health Monitoring Jongyeop Kim et.al. 2604.16589v1 null
2026-04-17 Large Language Models Meet Biomedical Knowledge Graphs for Mechanistically Grounded Therapeutic Prioritization Chih-Hsuan Wei et.al. 2604.19815v1 null
2026-04-17 Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors Jessica H. Zhu et.al. 2604.16132v1 null
2026-04-17 Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration Baramee Sukumal et.al. 2604.16104v1 null
2026-04-17 Towards Trustworthy Depression Estimation via Disentangled Evidential Learning Fangyuan Liu et.al. 2604.16579v1 null
2026-04-17 QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals Jeremy Qin et.al. 2604.15859v1 null
2026-04-17 Stein Variational Black-Box Combinatorial Optimization Thomas Landais et.al. 2604.15837v1 null
2026-04-17 Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI Lama Moukheiber et.al. 2604.15808v1 null
2026-04-17 KWBench: Measuring Unprompted Problem Recognition in Knowledge Work Ankit Maloo et.al. 2604.15760v1 null
2026-04-17 SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification Enhui Chai et.al. 2604.15711v1 null
2026-04-17 CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder Duy-Phuong Dao et.al. 2604.15611v1 null
2026-04-16 Robustifying and Selecting Cohort-Appropriate Prognostic Models under Distributional Shifts Dimitris Bertsimas et.al. 2604.16537v1 null
2026-04-16 Towards Reliable Testing of Machine Unlearning Anna Mazhar et.al. 2604.16536v1 null
2026-04-16 A Q-learning-based QoS-aware multipath routing protocol in IoMT-based wireless body area network Mehdi Hosseinzadeh et.al. 2604.15489v1 null
2026-04-16 Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models Emily Curl et.al. 2604.16532v1 null
2026-04-16 RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference Yuxin Liu et.al. 2604.15459v1 null
2026-04-16 DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI Zhizheng Wang et.al. 2604.15456v1 null
2026-04-16 SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation Tianhao Fu et.al. 2604.15271v2 null
2026-04-16 RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography Mélanie Roschewitz et.al. 2604.15231v1 null
2026-04-16 Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF Nicklas Neu et.al. 2604.16528v1 null
2026-04-16 Hybrid Decision Making via Conformal VLM-generated Guidance Debodeep Banerjee et.al. 2604.14980v2 null
2026-04-16 Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels? Amy Rouillard et.al. 2604.14892v2 null
2026-04-16 MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry Meng-Xun Li et.al. 2604.14866v1 null

Abstracts

FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

2604.22534v1 by Hojjat Karami, David Atienza, Jean-Philippe Thiran, Anisoara Ionescu

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.

摘要:電子健康紀錄(EHR)的特徵工程因不規則的觀察間隔、可變的測量頻率以及臨床時間序列固有的結構稀疏性而變得複雜。現有的自動化方法要麼缺乏臨床領域的認識,要麼假設輸入數據是乾淨且規則取樣的,這限制了它們在現實世界EHR數據中的適用性。我們提出了\textbf{FeatEHR-LLM},這是一個利用大型語言模型(LLMs)從不規則取樣的EHR時間序列生成臨床有意義的表格特徵的框架。為了限制患者隱私的暴露,LLM僅在數據集架構和任務描述上運作,而不是原始患者記錄。一種工具增強的生成機制為LLM提供了專門的例程,用於查詢不規則的時間數據,使其能夠生成可執行的特徵提取代碼,明確處理不均勻的觀察模式和信息稀疏性。FeatEHR-LLM支持通過迭代的、驗證在循環中的管道生成單變量和多變量特徵。在四個ICU數據集上評估的八個臨床預測任務中,我們的框架在8個任務中的7個上達到了最高的平均AUROC,相較於強基準提高了多達6個百分點。代碼可在github.com/hojjatkarami/FeatEHR-LLM上獲得。

CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimer's Disease

2604.22428v1 by Bulent Soykan, Gulsah Hancerliogullari Koksalmis, Hsin-Hsiung Huang, Laura J. Brattain

Predicting individual cognitive decline in Alzheimer's disease (AD) is difficult due to the heterogeneity of disease progression. Reliable clinical tools require not only high accuracy but also fairness across demographics and robustness to missing data. We present CognitiveTwin, a digital twin framework that predicts patient-specific cognitive trajectories. The model integrates multi-modal longitudinal data (cognitive scores, magnetic resonance imaging, positron emission tomography, cerebrospinal fluid biomarkers, and genetics). We use a Transformer-based architecture to fuse these modalities and a Deep Markov Model to capture temporal dynamics. We trained and evaluated the framework using data from 1,666 patients in the TADPOLE (Alzheimer's Disease Neuroimaging Initiative) dataset. We assessed the model for prediction error, demographic fairness, and robustness to missing-not-at-random (MNAR) data patterns. ognitiveTwin provides accurate and personalized predictions of cognitive decline. Its demonstrated fairness across patient demographics and resilience to clinical dropout make it a reliable tool for clinical trial enrichment and personalized care planning.

摘要:預測阿茲海默症(AD)中個體的認知衰退是困難的,因為疾病進展的異質性。可靠的臨床工具不僅需要高準確性,還需要在不同人口統計中保持公平性,並對缺失數據具有穩健性。我們提出了CognitiveTwin,一個預測患者特定認知軌跡的數位雙胞胎框架。該模型整合了多模態的縱向數據(認知分數、磁共振成像、正電子發射斷層掃描、腦脊髓液生物標記和遺傳學)。我們使用基於Transformer的架構來融合這些模態,並使用深度馬爾可夫模型來捕捉時間動態。我們使用來自1,666名患者的TADPOLE(阿茲海默症神經影像倡議)數據集訓練和評估該框架。我們評估了模型的預測誤差、人口統計公平性以及對隨機缺失數據模式的穩健性。CognitiveTwin提供準確且個性化的認知衰退預測。它在患者人口統計中的公平性和對臨床脫落的韌性使其成為臨床試驗增強和個性化護理計劃的可靠工具。

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

2604.22154v1 by Meghana Karnam, Ananya Joshi

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40\% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.

摘要:新興的行為健康和精神病學中的人工智慧系統使用多步驟或多代理的LLM管道來執行評估自我傷害風險和篩檢抑鬱症等任務。然而,常見的評估方法,如LLM作為裁判,並未指示何時決策是可靠的,或如何在多個LLM判斷中累積錯誤,這限制了它們在安全關鍵環境中的適用性。我們提出了一個統計框架,針對結構為有向無環圖(DAG)的多代理管道,提供了一種基於原則的、自適應的決策制定替代啟發式投票的方法。我們將每個代理建模為隨機類別決策,並引入(1)更緊的代理級性能信心界限,(2)基於輸入難度的強盜式自適應抽樣策略,以及(3)在多代理系統上提供的懊悔保證,顯示在部署時的對數錯誤增長。我們在行為健康的兩個標記數據集上評估我們的系統:AEGIS 2.0行為健康子集(N=161)和SWMH Reddit帖子的一個分層樣本(N=250)。從實證上看,我們的自適應抽樣策略在這兩個數據集中達到了最低的假陽性率,AEGIS 2.0為0.095,而單代理模型為0.159,將安全內容的錯誤標記減少了40\%,並且在所有條件下仍然保持相似的假陰性率。這些結果表明,基於原則的自適應抽樣在不降低召回率的情況下,提供了精確度的有意義改善。

Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

2604.22109v1 by Nalin Poungpeth, Nicholas Clark, Tanu Mitra

Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce "spontaneous persuasion," which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.

摘要:大型語言模型(LLMs)擁有強大的說服能力,在一對一比較中超越人類。使用者報告表示,在關係、醫療環境以及尋求專業建議時,會諮詢LLMs以協助做出重大生活決策。先前的研究將說服測量為產生最有效論點或令人信服陳述的有意圖嘗試。這未能捕捉到日常人類與AI互動中的情況,使用者在這些互動中尋求資訊或建議。為了解決這一空白,我們引入了「自發性說服」,其特徵是在人們不一定需要說服的日常情境中隱性使用說服策略。我們對五個LLMs進行了審核,以揭示自發性說服在多輪對話中出現的頻率及其技術。為了模擬回應風格,我們提供了一個基於心理學、溝通學和語言學文獻的使用者回應分類法。此外,我們比較了LLMs在相同主題上產生的自發性說服與從Reddit收集的人類回應的分佈。我們發現LLMs幾乎在所有對話中都自發地說服使用者,並大量依賴基於資訊的策略,例如訴諸邏輯或定量證據。這在各模型和使用者回應風格中是一致的,但涉及心理健康的對話中,基於評價和情感的策略的使用率較高。相比之下,人類回應則傾向於使用產生社會影響的策略,如負面情感訴求和非專家證言。這一差異可能解釋了LLM在說服使用者方面的有效性,以及模型被視為客觀和公正的感知。

Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake

2604.22067v1 by Guan Gui, Peter Zandi, Jacob Taylor, Ananya Joshi

Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.

摘要:精神科接診是一個連續的、高風險的信息收集過程,臨床醫生必須決定提問的內容、順序以及如何在有限的時間內解釋不完整或模糊的回答。儘管對於醫療保健中的對話式人工智慧的興趣日益增長,但在這一應用中,對話式人工智慧的基礎設施仍然有限。因此,我們將這一任務表述為一個問題選擇問題,涉及臨床上有根據的問題、已知的目標信息以及可控的患者難度。我們還基於655個臨床醫生撰寫的接診問題庫和5種不同行為條件的相應合成患者小品,介紹了一個特定任務的問題選擇基準。在我們的評估中,我們比較了隨機提問、一個臨床精神科接診表的基準,以及一個基於大型語言模型(LLM)指導的自適應政策,這涉及300次訪談會議,涵蓋四位患者和五種行為條件。在基準測試中,臨床有序的固定形式顯著優於隨機提問,而LLM指導的政策則實現了最強的整體恢復。在患者行為對現場恢復的適應性較差的情況下,適應的優勢急劇增長,尤其是在防守性簡潔的條件下。這些發現表明,對話式臨床系統的表現不僅取決於信息披露後的語言理解,還取決於系統是否能在有限的互動預算內觸及正確的主題。更廣泛地說,這一基準提供了一個受控框架,用於研究臨床結構和自適應後續如何促進互動式臨床機器學習中的信息恢復。

Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

2604.22063v1 by Shevya Pandya, Shinjini Bose, Ananya Joshi

Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.

摘要:大型語言模型(LLMs)在臨床推理和風險評估中被越來越多地使用。然而,它們在精神科等關鍵和不確定領域的解釋可靠性仍然不明。先前的研究已經識別出這些系統中的算法偏見和提示敏感性,這引發了關於上下文信息如何影響模型輸出的擔憂,但在精神科領域仍然沒有系統的方法來評估這些問題。我們提出了一種通過圍繞提示設計的影響和醫學上不重要的輸入對預測住院風險分數的影響來結構化評估的可靠性審核方法,這通常是第一個下游AI臨床決策任務。在我們的審核中,生成了一組合成患者資料(n = 50),每個資料包含15個臨床相關特徵和最多50個臨床不重要特徵,跨越四種提示重構(中立、邏輯、人類影響、臨床判斷)。我們審核了四個LLM(Gemini 2.5 Flash,LLaMa 3.3 70b,Claude Sonnet 4.6,GPT-4o mini),結果顯示,包含醫學上不重要的變量導致所有模型和提示的絕對平均預測住院風險和輸出變異性有統計學上顯著的增加,這表明隨著上下文噪音的增加,預測穩定性降低。臨床不重要特徵在許多模型-提示條件下對不穩定性產生了影響,而提示變化獨立地以模型依賴的方式影響不穩定性的軌跡。這些發現量化了基於LLM的精神科風險評估對非臨床信息的敏感性,突顯了在臨床部署之前需要對歸因穩定性和不確定性行為進行系統評估的必要性。

Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching

2604.22061v1 by Xiaodi Li, Yang Xiao, Munhwan Lee, Konstantinos Leventakos, Young J. Juhn, David Jones, Terence T. Sio, Wei Liu, Maria Vassilaki, Nansu Zong

Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.

摘要:病人試驗匹配需要對長期的異質電子健康紀錄(EHRs)和複雜的資格標準進行推理,這對於擴展性、泛化能力和計算效率提出了重大挑戰。現有的方法要麼依賴於使用大型語言模型(LLMs)進行完整文檔處理,這在計算上代價高昂,要麼使用傳統的機器學習方法,這些方法難以捕捉非結構化的臨床敘事。在這項工作中,我們提出了一個輕量級框架,結合檢索增強生成和基於大型語言模型的建模,以實現可擴展的病人試驗匹配。該框架明確分離了兩個關鍵組件:檢索增強生成用於從長EHR中識別臨床相關片段,減少輸入複雜性,而大型語言模型則用於將這些選定的片段編碼為信息豐富的表示。這些表示進一步通過降維進行精煉,並使用輕量級預測器進行建模,使得下游分類既高效又可擴展。我們在多個公共基準(n2c2、SIGIR、TREC 2021/2022)和來自梅約診所(Mayo Clinic)的真實世界多模態數據集(MCPMD)上評估了所提出的方法。結果顯示,基於檢索的信息選擇顯著減少了計算負擔,同時保留了臨床上有意義的信號。我們進一步證明,凍結的LLMs為結構化臨床數據提供了強大的表示,而微調對於建模非結構化的臨床敘事至關重要。重要的是,所提出的輕量級管道在性能上可與端到端的LLM方法相媲美,且計算成本顯著較低。

EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms

2604.22036v1 by Brian VanVoorst, Nicholas Walczak, Christopher Gilleo, Charles Meissner, Fabio Felix, Iran Roman, Bea Steers, Claudio Silva, Yuhan Shen, Zijia Lu, Shih-Po Lee, Ehsan Elhamifar

This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).

摘要:這篇論文介紹了EgoMAGIC(醫療輔助、指導、說明和修正),這是一個以自我為中心的醫療活動數據集,作為DARPA的感知能力任務指導(PTG)計畫的一部分收集而成。這個數據集包含3,355個視頻,涵蓋50個醫療任務,每個任務至少有50個標記視頻。PTG計畫的主要目標是開發集成在增強現實頭盔中的虛擬助手,以幫助用戶執行複雜任務。
為了鼓勵使用這個數據集進行探索和研究,醫療訓練數據已經發布,並附帶了一個專注於八個醫療任務的動作檢測挑戰。大多數視頻是使用帶有集成音頻的頭戴立體攝像機錄製的。從這個數據集中,使用195萬個標籤訓練了40個YOLO模型,以檢測124個醫療物體,為從事醫療AI應用開發的開發者提供了一個穩健的起點。
除了介紹數據集,這篇論文還呈現了三個模型在八個選定醫療任務上的動作檢測基準結果,其中表現最佳的方法達到了平均mAP 0.526。儘管這篇論文主要針對動作檢測作為基準,但EgoMAGIC數據集同樣適用於動作識別、物體識別和檢測、錯誤檢測以及其他具有挑戰性的計算機視覺任務。
該數據集可通過zenodo.org訪問(DOI: 10.5281/zenodo.19239154)。

Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models

2604.21860v1 by Naheed Rayhan, Sohely Jahan

Large language models (LLMs) are increasingly integrated into sensitive workflows, raising the stakes for adversarial robustness and safety. This paper introduces Transient Turn Injection(TTI), a new multi-turn attack technique that systematically exploits stateless moderation by distributing adversarial intent across isolated interactions. TTI leverages automated attacker agents powered by large language models to iteratively test and evade policy enforcement in both commercial and open-source LLMs, marking a departure from conventional jailbreak approaches that typically depend on maintaining persistent conversational context. Our extensive evaluation across state-of-the-art models-including those from OpenAI, Anthropic, Google Gemini, Meta, and prominent open-source alternatives-uncovers significant variations in resilience to TTI attacks, with only select architectures exhibiting substantial inherent robustness. Our automated blackbox evaluation framework also uncovers previously unknown model specific vulnerabilities and attack surface patterns, especially within medical and high stakes domains. We further compare TTI against established adversarial prompting methods and detail practical mitigation strategies, such as session level context aggregation and deep alignment approaches. Our study underscores the urgent need for holistic, context aware defenses and continuous adversarial testing to future proof LLM deployments against evolving multi-turn threats.

摘要:大型語言模型(LLMs)越來越多地融入敏感工作流程,這提高了對抗性穩健性和安全性的要求。本文介紹了瞬時回合注入(TTI),這是一種新的多回合攻擊技術,通過在孤立的互動中分配對抗意圖,系統性地利用無狀態的管理。TTI利用由大型語言模型驅動的自動攻擊者代理,迭代測試並規避商業和開源LLMs中的政策執行,這標誌著與傳統的越獄方法的不同,後者通常依賴於維持持久的對話上下文。我們對最先進模型的廣泛評估——包括來自OpenAI、Anthropic、Google Gemini、Meta及其他知名開源替代方案——揭示了對TTI攻擊的韌性存在顯著變化,只有少數架構顯示出顯著的內在穩健性。我們的自動黑箱評估框架還揭示了先前未知的模型特定脆弱性和攻擊面模式,特別是在醫療和高風險領域。我們進一步將TTI與已建立的對抗性提示方法進行比較,並詳細說明實際的緩解策略,如會話級上下文聚合和深度對齊方法。我們的研究強調了對全面、上下文感知防禦的迫切需求,以及持續的對抗性測試,以未來保障LLM部署免受不斷演變的多回合威脅。

Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos

2604.21814v1 by Bowen Liu, Li Yang, Shanshan Song, Mingyu Tang, Zhifang Gao, Qifeng Chen, Yangqiu Song, Huimin Chen, Xiaomeng Li

Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.

摘要:膠囊內視鏡(CE)使得非侵入性的胃腸道篩檢成為可能,但目前的CE研究仍然主要限於幀級分類和檢測,視頻級分析尚未得到充分探討。為了填補這一空白,我們引入並正式定義一項新任務,即基於診斷的CE視頻摘要,該任務要求提取涵蓋臨床有意義發現的關鍵證據幀,並從這些證據幀中做出準確的診斷。這一設定具有挑戰性,因為與診斷相關的事件極其稀疏,且可能被數以萬計的冗餘正常幀所淹沒,而個別觀察往往因運動模糊、碎片、鏡面高光和快速視角變化而變得模糊不清。為了促進這方面的研究,我們引入了VideoCAP,這是第一個基於診斷的CE數據集,包含來自真實臨床報告的標註。VideoCAP包含240個完整長度的視頻,並為關鍵證據幀提取和診斷提供現實的監督。為了解決這一任務,我們進一步提出了DiCE,一個受到臨床醫生啟發的框架,模擬標準CE閱讀工作流程。DiCE首先對原始視頻進行高效的候選篩選,然後使用上下文編織器將候選者組織成保持明確病變事件的連貫診斷上下文,並使用證據聚合器將每個上下文中的多幀證據聚合成穩健的片段級判斷。實驗表明,DiCE始終優於最先進的方法,產生簡潔且臨床可靠的診斷摘要。這些結果突顯了基於診斷的上下文推理作為超長CE視頻摘要的一個有前景的範式。

Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications

2604.21793v1 by Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue, Vianney Jouhet, Fleur Mougin

In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our framework employs logical rules to capture existence and termination conditions for simple temporal events and to combine these into meta-events. In the medical domain, for example, disease episodes and therapies are inferred from timestamped clinical observations, such as diagnoses and drug administrations stored in patient records, and can be further combined into higher-level disease events. As some incorrect events might be inferred, we use constraints to identify incompatible combinations of events and propose a repair mechanism to select preferred consistent sets of events. While reasoning in the full framework is intractable, we identify relevant restrictions that ensure polynomial-time data complexity. Our prototype system implements core components of the approach using answer set programming. An evaluation on a lung cancer use case supports the interest of the approach, both in terms of computational feasibility and positive alignment of our results with medical expert opinions. While strongly motivated by the needs of the healthcare domain, our framework is purposely generic, enabling its reuse in other areas.

摘要:在本文中,我們開發了一種新穎的基於邏輯的方法,用於從時間戳數據和背景知識中檢測高級的時間延伸事件。我們的框架使用邏輯規則來捕捉簡單時間事件的存在和終止條件,並將這些條件組合成元事件。例如,在醫療領域,疾病事件和治療是從時間戳的臨床觀察中推斷出來的,例如存儲在病人記錄中的診斷和藥物管理,並可以進一步組合成更高級的疾病事件。由於可能推斷出一些不正確的事件,我們使用約束來識別不兼容的事件組合,並提出一種修復機制來選擇首選的一致事件集。雖然在完整框架中的推理是不可處理的,但我們確定了相關的限制,以確保多項式時間的數據複雜度。我們的原型系統使用答案集編程實現了該方法的核心組件。對於肺癌用例的評估支持了該方法的價值,無論是在計算可行性方面,還是我們的結果與醫療專家意見的正面一致性方面。雖然受到醫療領域需求的強烈驅動,我們的框架故意設計為通用的,使其能在其他領域中重複使用。

Causal Disentanglement for Full-Reference Image Quality Assessment

2604.21654v1 by Zhen Zhang, Jielei Chu, Tian Zhang, Weide Liu, Fengmao Lv, Tianrui Li, Jun Cheng, Yuming Fang

Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

摘要:現有的基於深度網絡的全參考影像質量評估(FR-IQA)模型通常通過對參考影像和失真影像的深度特徵進行成對比較來工作。在本文中,我們從不同的角度來處理這個問題,並提出一種基於因果推斷和解耦表示學習的新型 FR-IQA 範式。與典型的基於特徵比較的 FR-IQA 模型不同,我們的方法將劣化估計表述為由潛在表示的干預引導的因果解耦過程。我們首先通過利用參考影像和失真影像之間的內容不變性來解耦劣化和內容表示。其次,受到人類視覺遮蔽效應的啟發,我們設計了一個遮蔽模塊來建模影像內容和劣化特徵之間的因果關係,從而從失真影像中提取受內容影響的劣化特徵。最後,質量分數是通過使用監督回歸或無標籤降維從這些劣化特徵中預測的。大量實驗表明,我們的方法在全監督、少標籤和無標籤設置下的標準 IQA 基準上達到了高度競爭的性能。此外,我們在數據稀缺的多樣非標準自然影像領域進行了評估,包括水下、放射影像、醫療影像、中子影像和螢幕內容影像。得益於其在無標籤 IQA 數據下進行特定場景訓練和預測的能力,我們的方法在跨域泛化方面表現優於現有的無訓練 FR-IQA 模型。

Dilated CNNs for Periodic Signal Processing: A Low-Complexity Approach

2604.21651v1 by Eli Gildish, Michael Grebshtein, Igor Makienko

Denoising of periodic signals and accurate waveform estimation are core tasks across many signal processing domains, including speech, music, medical diagnostics, radio, and sonar. Although deep learning methods have recently shown performance improvements over classical approaches, they require substantial computational resources and are usually trained separately for each signal observation. This study proposes a computationally efficient method based on DCNN and Re-sampling, termed R-DCNN, designed for operation under strict power and resource constraints. The approach targets signals with varying fundamental frequencies and requires only a single observation for training. It generalizes to additional signals via a lightweight resampling step that aligns time scales in signals with different frequencies to re-use the same network weights. Despite its low computational complexity, R-DCNN achieves performance comparable to state-of-the-art classical methods, such as autoregressive (AR)-based techniques, as well as conventional DCNNs trained individually for each observation. This combination of efficiency and performance makes the proposed method particularly well suited for deployment in resource-constrained environments without sacrificing denoising or estimation accuracy.

摘要:去除周期信號的噪音和準確的波形估計是許多信號處理領域的核心任務,包括語音、音樂、醫療診斷、無線電和聲納。儘管深度學習方法最近在性能上超越了傳統方法,但它們需要大量的計算資源,並且通常為每個信號觀察單獨訓練。本研究提出了一種基於DCNN和重採樣的計算效率高的方法,稱為R-DCNN,旨在在嚴格的功率和資源限制下運行。該方法針對具有不同基頻的信號,並且僅需要一次觀察進行訓練。它通過輕量級的重採樣步驟進行泛化,該步驟將不同頻率的信號的時間尺度對齊,以重用相同的網絡權重。儘管計算複雜度低,R-DCNN的性能仍然可與最先進的傳統方法相媲美,例如基於自回歸(AR)技術的方法,以及為每次觀察單獨訓練的傳統DCNN。這種效率和性能的結合使得所提出的方法特別適合在資源受限的環境中部署,而不會犧牲去噪或估計的準確性。

Unbiased Prevalence Estimation with Multicalibrated LLMs

2604.21549v1 by Fridolin Linder, Thomas Leeper, Daniel Haimovich, Niek Tax, Lorenzo Perini, Milan Vojnovic

Estimating the prevalence of a category in a population using imperfect measurement devices (diagnostic tests, classifiers, or large language models) is fundamental to science, public health, and online trust and safety. Standard approaches correct for known device error rates but assume these rates remain stable across populations. We show this assumption fails under covariate shift and that multicalibration, which enforces calibration conditional on the input features rather than just on average, is sufficient for unbiased prevalence estimation under such shift. Standard calibration and quantification methods fail to provide this guarantee. Our work connects recent theoretical work on fairness to a longstanding measurement problem spanning nearly all academic disciplines. A simulation confirms that standard methods exhibit bias growing with shift magnitude, while a multicalibrated estimator maintains near-zero bias. While we focus the discussion mostly on LLMs, our theoretical results apply to any classification model. Two empirical applications -- estimating employment prevalence across U.S. states using the American Community Survey, and classifying political texts across four countries using an LLM -- demonstrate that multicalibration substantially reduces bias in practice, while highlighting that calibration data should cover the key feature dimensions along which target populations may differ.

摘要:估計一個類別在某個人群中的盛行率,使用不完美的測量設備(診斷測試、分類器或大型語言模型)對科學、公共健康以及在線信任和安全至關重要。標準方法會修正已知的設備誤差率,但假設這些誤差率在不同人群中保持穩定。我們展示了這一假設在協變移動下失效,而多重校準則是針對輸入特徵進行校準,而不僅僅是針對平均值,這對於在此類移動下進行無偏盛行率估計是足夠的。標準的校準和量化方法無法提供這一保證。我們的工作將最近在公平性方面的理論研究與幾乎所有學術學科的長期測量問題聯繫起來。一項模擬確認標準方法在移動幅度增大時會顯示偏差,而多重校準的估計器則保持近乎零的偏差。雖然我們的討論主要集中在大型語言模型上,但我們的理論結果適用於任何分類模型。兩個實證應用——使用美國社區調查估計美國各州的就業盛行率,以及使用大型語言模型對四個國家的政治文本進行分類——展示了多重校準在實踐中顯著減少偏差,同時強調校準數據應涵蓋目標人群可能存在差異的關鍵特徵維度。

Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

2604.21421v1 by Michele Miranda, Xinlan Yan, Nishant Mishra, Rachel Murphy, Ameen Abu-Hanna, Sébastien Bratières, Iacer Calixto

Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction. Although methods based on differential privacy (DP) provide formal privacy guarantees, more recently also large language models (LLMs) are increasingly used for text de-identification in the clinical domain. In this work, we present the first comparative study of DP, NER, and LLMs for Dutch clinical text de-identification. We investigate these methods separately as well as hybrid strategies that apply NER or LLM preprocessing prior to DP, and assess performance in terms of privacy leakage and extrinsic evaluation (entity and relation classification). We show that DP mechanisms alone degrade utility substantially, but combining them with linguistic preprocessing, especially LLM-based redaction, significantly improves the privacy-utility trade-off.

摘要:保護臨床敘述中的病人隱私對於在 GDPR 和 HIPAA 等法規下促進醫療數據的二次使用至關重要。雖然手動去識別仍然是黃金標準,但其成本高且速度慢,這促使了需要自動化方法,這些方法結合了隱私保證和高效用性。大多數自動化文本去識別管道使用命名實體識別(NER)來識別需刪除的受保護實體。儘管基於差分隱私(DP)的方法提供了正式的隱私保證,但最近大型語言模型(LLMs)在臨床領域的文本去識別中也越來越多地被使用。在這項工作中,我們呈現了針對荷蘭臨床文本去識別的 DP、NER 和 LLMs 的首次比較研究。我們分別調查這些方法以及在 DP 之前應用 NER 或 LLM 預處理的混合策略,並在隱私洩漏和外部評估(實體和關係分類)方面評估性能。我們顯示,僅使用 DP 機制會大幅降低效用,但將其與語言預處理結合,特別是基於 LLM 的刪除,顯著改善了隱私與效用的權衡。

Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

2604.21952v1 by Muhammad Shafique, Abdul Basit, Muhammad Abdullah Hanif, Alberto Marchisio, Rachmad Vidya Wicaksana Putra, Minghao Shao

This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution & stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.

摘要:這項工作提出了一種多層次的方法論,以有效加速多模態基礎模型(MFMs)。它結合了Transformer區塊的硬體和軟體共同設計,以及一個優化流程,減少計算和記憶體需求。在模型開發過程中,它通過微調來實現針對特定領域的性能增強。我們的方法論進一步結合了優化MFMs的硬體和軟體技術。具體而言,它使用層次感知的混合精度量化和結構修剪來壓縮MFM,針對Transformer區塊和MLP通道。它還通過推測解碼來優化操作,模型級聯將查詢路由通過小到大的級聯,並使用輕量級自測來確定何時升級到更大的模型,以及序列長度、視覺解析度和步幅的共同優化,以及圖級運算元融合。為了有效執行模型,處理數據流根據底層硬體架構進行優化,並結合記憶體高效的注意力以滿足片上帶寬和延遲預算。為了支持這一點,使用專用的硬體加速器來處理Transformer工作負載,這可以通過專家設計或LLM輔助設計方法開發。我們展示了所提方法論在醫療MFMs和代碼生成任務上的有效性,並以向能源高效的脈衝MFMs擴展作結。

Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

2604.21263v1 by Michael Bouzinier, Sergey Trifonov, Michael Chumack, Eugenia Lvova, Dmitry Etin

\textbf{Background:} Regulatory frameworks for AI in healthcare, including the EU AI Act and FDA guidance on AI/ML-based medical devices, require clinical decision support to demonstrate not only accuracy but auditability. Existing formal languages for clinical logic validate syntactic and structural correctness but not whether decision rules use epistemologically appropriate evidence. \textbf{Methods:} Drawing on design-by-contract principles, we introduce meta-predicates -- predicates about predicates -- for asserting epistemological constraints on clinical decision rules expressed in a DSL. An epistemological type system classifies annotations along four dimensions: purpose, knowledge domain, scale, and method of acquisition. Meta-predicates assert which evidence types are permissible in any given rule. The framework is instantiated in AnFiSA, an open-source platform for genetic variant curation, and demonstrated using the Brigham Genomics Medicine protocol on 5.6 million variants from the Genome in a Bottle benchmark. \textbf{Results:} Decision trees used in variant interpretation can be reformulated as unate cascades, enabling per-variant audit trails that identify which rule classified each variant and why. Meta-predicate validation catches epistemological errors before deployment, whether rules are human-written or AI-generated. The approach complements post-hoc methods such as LIME and SHAP: where explanation reveals what evidence was used after the fact, meta-predicates constrain what evidence may be used before deployment, while preserving human readability. \textbf{Conclusions:} Meta-predicate validation is a step toward demonstrating not only that decisions are accurate but that they rest on appropriate evidence in ways that can be independently audited. While demonstrated in genomics, the approach generalises to any domain requiring auditable decision logic.

摘要:\textbf{背景:} 醫療保健中人工智慧的監管框架,包括歐盟人工智慧法案和FDA對基於人工智慧/機器學習醫療設備的指導,要求臨床決策支持不僅要顯示準確性,還要具備可審計性。現有的臨床邏輯形式語言驗證語法和結構的正確性,但不驗證決策規則是否使用了認識論上合適的證據。 \textbf{方法:} 基於契約設計原則,我們引入了元謂詞——關於謂詞的謂詞——用於對在DSL中表達的臨床決策規則施加認識論約束。認識論類型系統在四個維度上對註釋進行分類:目的、知識領域、範圍和獲取方法。元謂詞聲明在任何給定規則中允許使用哪些證據類型。該框架在AnFiSA中實現,這是一個開源的基因變異整理平台,並使用來自“瓶中基因組”基準的560萬個變異的Brigham Genomics Medicine協議進行演示。 \textbf{結果:} 用於變異解釋的決策樹可以重新表述為單調級聯,從而實現每個變異的審計跟蹤,識別每個變異的分類規則及其原因。元謂詞驗證在部署前捕捉認識論錯誤,無論規則是人工編寫還是AI生成。該方法補充了事後方法,如LIME和SHAP:當解釋揭示了事後使用了哪些證據時,元謂詞限制了在部署前可以使用的證據,同時保持人類可讀性。 \textbf{結論:} 元謂詞驗證是邁向證明決策不僅準確且基於適當證據的步驟,並且這些證據可以獨立審計。雖然在基因組學中得到了演示,但該方法可以推廣到任何需要可審計決策邏輯的領域。

Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction

2604.21154v1 by Abhishek Dharmaratnakar, Srivaths Ranganathan, Anushree Sinha, Debanshu Das

At-home physiotherapy compliance remains critically low due to a lack of personalized supervision and dynamic feedback. Existing digital health solutions rely on static, pre-recorded video libraries or generic 3D avatars that fail to account for a patient's specific injury limitations or home environment. In this paper, we propose a novel Multi-Agent System (MAS) architecture that leverages Generative AI and computer vision to close the tele-rehabilitation loop. Our framework consists of four specialized micro-agents: a Clinical Extraction Agent that parses unstructured medical notes into kinematic constraints; a Video Synthesis Agent that utilizes foundational video generation models to create personalized, patient-specific exercise videos; a Vision Processing Agent for real-time pose estimation; and a Diagnostic Feedback Agent that issues corrective instructions. We present the system architecture, detail the prototype pipeline using Large Language Models and MediaPipe, and outline our clinical evaluation plan. This work demonstrates the feasibility of combining generative media with agentic autonomous decision-making to scale personalized patient care safely and effectively.

摘要:居家物理治療的遵從率仍然極低,原因在於缺乏個性化的監督和動態反饋。現有的數位健康解決方案依賴於靜態的預錄影片庫或通用的3D虛擬角色,這些都未能考慮到患者特定的受傷限制或家庭環境。在本文中,我們提出了一種新穎的多智能體系統(MAS)架構,利用生成式人工智慧和計算機視覺來閉合遠程康復的循環。我們的框架由四個專門的微智能體組成:一個臨床提取智能體,將非結構化的醫療筆記解析為運動學約束;一個視頻合成智能體,利用基礎視頻生成模型創建個性化的、針對患者的運動視頻;一個視覺處理智能體,用於實時姿勢估計;以及一個診斷反饋智能體,提供糾正指導。我們展示了系統架構,詳細說明了使用大型語言模型和MediaPipe的原型管道,並概述了我們的臨床評估計劃。本研究展示了將生成媒體與自主決策相結合的可行性,以安全有效地擴展個性化患者護理。

Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation

2604.21076v1 by Sanjoy Pator

Medication reconciliation at clinical handoffs is a high-stakes, error-prone process. Large language models are increasingly proposed to assist with this task using FHIR-structured patient records, but a fundamental and largely unstudied variable is how the FHIR data is serialised before being passed to the model. We present the first systematic comparison of four FHIR serialisation strategies (Raw JSON, Markdown Table, Clinical Narrative, and Chronological Timeline) across five open-weight models (Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, Llama-3.3-70B) on a controlled benchmark of 200 synthetic patients, totalling 4,000 inference runs. We find that serialisation strategy has a large, statistically significant effect on performance for models up to 8B parameters: Clinical Narrative outperforms Raw JSON by up to 19 F1 points for Mistral-7B (r = 0.617, p < 10^{-10}). This advantage reverses at 70B, where Raw JSON achieves the best mean F1 of 0.9956. In all 20 model and strategy combinations, mean precision exceeds mean recall: omission is the dominant failure mode, with models more often missing an active medication than fabricating one, which changes how clinical safety auditing priorities should be set. Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients, the patients most at risk from reconciliation errors, systematically underserved. BioMistral-7B, a domain-pretrained model without instruction tuning, produces zero usable output in all conditions, showing that domain pretraining alone is not sufficient for structured extraction. These results offer practical, evidence-based format recommendations for clinical LLM deployment: Clinical Narrative for models up to 8B, Raw JSON for 70B and above. The complete pipeline is reproducible on open-source tools running on an AWS g6e.xlarge instance (NVIDIA L40S, 48 GB VRAM).

摘要:藥物調整在臨床交接中是一個高風險且易出錯的過程。越來越多的大型語言模型被提議用於協助這項任務,使用FHIR結構化的病人記錄,但一個基本且大多未被研究的變數是FHIR數據在傳遞給模型之前是如何序列化的。我們呈現了四種FHIR序列化策略(原始JSON、Markdown表格、臨床敘事和時間線)的首次系統比較,針對五個開放權重模型(Phi-3.5-mini、Mistral-7B、BioMistral-7B、Llama-3.1-8B、Llama-3.3-70B)進行了200名合成病人的受控基準測試,總共進行了4,000次推理運行。我們發現序列化策略對於最多8B參數的模型性能有著顯著且統計上顯著的影響:臨床敘事在Mistral-7B上比原始JSON高出最多19個F1點(r = 0.617,p < 10^{-10})。在70B時,這一優勢反轉,原始JSON達到了最佳平均F1值0.9956。在所有20個模型和策略組合中,平均精確度超過平均召回率:遺漏是主要的失敗模式,模型更常錯過一個活躍的藥物,而不是虛構一個,這改變了臨床安全審核優先級的設定方式。較小的模型在大約7-10個同時活躍藥物時達到平台,留下多重用藥的病人,即最容易受到調整錯誤影響的病人,系統性地得不到服務。BioMistral-7B是一個未經指導調整的領域預訓練模型,在所有條件下產生零可用輸出,顯示僅有領域預訓練不足以進行結構化提取。這些結果為臨床LLM部署提供了實用的、基於證據的格式建議:對於最多8B的模型使用臨床敘事,對於70B及以上的模型使用原始JSON。完整的管道可在運行於AWS g6e.xlarge實例(NVIDIA L40S,48 GB VRAM)的開源工具上重現。

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

2604.21027v1 by Yuyu Liu, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma

Electronic health record (EHR) question answering is often handled by LLM-based pipelines that are costly to deploy and do not explicitly leverage the hierarchical structure of clinical data. Motivated by evidence that medical ontologies and patient trajectories exhibit hyperbolic geometry, we propose HypEHR, a compact Lorentzian model that embeds codes, visits, and questions in hyperbolic space and answers queries via geometry-consistent cross-attention with type-specific pointer heads. HypEHR is pretrained with next-visit diagnosis prediction and hierarchy-aware regularization to align representations with the ICD ontology. On two MIMIC-IV-based EHR-QA benchmarks, HypEHR approaches LLM-based methods while using far fewer parameters. Our code is publicly available at https://github.com/yuyuliu11037/HypEHR.

摘要:電子健康紀錄(EHR)問答通常由基於大型語言模型(LLM)的管道處理,這些管道的部署成本高且未明確利用臨床數據的層次結構。基於醫療本體和病人軌跡展現雙曲幾何的證據,我們提出了HypEHR,一種緊湊的洛倫茲模型,將代碼、就診和問題嵌入雙曲空間,並通過幾何一致的交叉注意力和特定類型的指針頭來回答查詢。HypEHR經過下一次就診診斷預測和層次感知正則化的預訓練,以使表示與ICD本體對齊。在兩個基於MIMIC-IV的EHR-QA基準上,HypEHR的表現接近基於LLM的方法,同時使用的參數卻少得多。我們的代碼已公開於 https://github.com/yuyuliu11037/HypEHR。

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

2604.21017v1 by Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang, Alaa Eldin Abdelaal, Alberto Arezzo, Ayberk Acar, Farshid Alambeigi, Carlo Alberto Ammirati, Yunke Ao, Pablo David Aranda Rodriguez, Soofiyan Atar, Mattia Ballo, Noah Barnes, Federica Barontini, Filip Binkiewicz, Peter Black, Sebastian Bodenstedt, Leonardo Borgioli, Nikola Budjak, Benjamin Calmé, Fabio Carrillo, Nicola Cavalcanti, Changwei Chen, Haoxin Chen, Sihang Chen, Qihan Chen, Zhongyu Chen, Ziyang Chen, Shing Shin Cheng, Meiqing Cheng, Min Cheng, Zih-Yun Sarah Chiu, Xiangyu Chu, Camilo Correa-Gallego, Giulio Dagnino, Anton Deguet, Jacob Delgado, Jonathan C. DeLong, Kaizhong Deng, Alexander Dimitrakakis, Qingpeng Ding, Hao Ding, Giovanni Distefano, Daniel Donoho, Anqing Duan, Marco Esposito, Shane Farritor, Jad Fayad, Zahi Fayad, Mario Ferradosa, Filippo Filicori, Chelsea Finn, Philipp Fürnstahl, Jiawei Ge, Stamatia Giannarou, Xavier Giralt Ludevid, Frederic Giraud, Aditya Amit Godbole, Ken Goldberg, Antony Goldenberg, Diego Granero Marana, Xiaoqing Guo, Tamás Haidegger, Evan Hailey, Pascal Hansen, Ziyi Hao, Kush Hari, Kengo Hayashi, Jonathon Hawkins, Shelby Haworth, Ortrun Hellig, S. Duke Herrell, Zhouyang Hong, Andrew Howe, Junlei Hu, Ria Jain, Mohammad Rafiee Javazm, Howard Ji, Rui Ji, Jianmin Ji, Zhongliang Jiang, Dominic Jones, Jeffrey Jopling, Britton Jordan, Ran Ju, Michael Kam, Luoyao Kang, Fausto Kang, Siddhartha Kapuria, Peter Kazanzides, Sonika Kiehler, Ethan Kilmer, Ji Woong, Kim, Przemysław Korzeniowski, Chandra Kuchi, Nithesh Kumar, Alan Kuntz, Federico Lavagno, Yu Chung Lee, Hao-Chih Lee, Hang Li, Zhen Li, Xiao Liang, Xinxin Lin, Jinsong Lin, Chang Liu, Fei Liu, Pei Liu, Yun-hui Liu, Wanli Liuchen, Eszter Lukács, Sareena Mann, Miles Mannas, Brett Marinelli, Sabina Martyniak, Francesco Marzola, Lorenzo Mazza, Xueyan Mei, Maria Clara Morais, Luigi Muratore, Chetan Reddy Narayanaswamy, Michał Naskręt, David Navarro-Alarcon, Cyrus Neary, Chi Kit Ng, Christopher Nguan, David Noonan, Ki Hwan Oh, Tom Christian Olesch, Allison M. Okamura, Justin Opfermann, Matteo Pescio, Doan Xuan Viet Pham, Tito Porras, Hongliang Ren, Ariel Rodriguez Jimenez, Ferdinando Rodriguez y Baena, Septimiu E. Salcudean, Asmitha Sathya, Preethi Satish, Lalithkumar Seenivasan, Jiaqi Shao, Yiqing Shen, Yu Sheng, Lucy XiaoYang Shi, Zoe Soulé, Stefanie Speidel, Mingwu Su, Jianhao Su, Idris Sunmola, Kristóf Takács, Yunxi Tang, Patrick Thornycroft, Yu Tian, Jordan Thompson, Mehmet K. Turkcan, Mathias Unberath, Pietro Valdastri, Carlos Vives, Quan Vuong, Martin Wagner, Farong Wang, Wei Wang, Lidian Wang, Chung-Pang Wang, Guankun Wang, Junyi Wang, Erqi Wang, Ziyi Wang, Tanner Watts, Wolfgang Wein, Yimeng Wu, Zijian Wu, Hongjun Wu, Luohong Wu, Jie Ying Wu, Junlin Wu, Victoria Wu, Kaixuan Wu, Mateusz Wójcikowski, Yunye Xiao, Nan Xiao, Wenxuan Xie, Hao Yang, Tianqi Yang, Yinuo Yang, Menglong Ye, Ryan S. Yeung, Nural Yilmaz, Chim Ho Yin, Michael Yip, Rayan Younis, Chenhao Yu, Sayem Nazmuz Zaman, Milos Zefran, Han Zhang, Yuelin Zhang, Yidong Zhang, Yanyong Zhang, Xuyang Zhang, Yameng Zhang, Joyce Zhang, Ning Zhong, Peng Zhou, Haoying Zhou, Xiuli Zuo, Nassir Navab, Mahdi Azizian, Sean D. Huver, Axel Krieger

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

摘要:自主醫療機器人有潛力改善病人結果、減輕醫療提供者的工作負擔、民主化醫療服務的獲取,並實現超人精確度。然而,自主醫療機器人技術受到一個基本數據問題的限制:現有的醫療機器人數據集規模小、單一體現,且很少公開共享,限制了該領域所需的基礎模型的發展。我們推出了Open-H-Embodiment,這是迄今為止最大的醫療機器人視頻開放數據集,具有同步運動學,涵蓋了49多個機構和多個機器人平台,包括CMR Versius、Intuitive Surgical的da Vinci、da Vinci Research Kit (dVRK)、Rob Surgical BiTrack、Virtual Incision的MIRA、Moon Surgical Maestro,以及各種定制系統,涵蓋外科操作、機器人超聲波和內窺鏡程序。我們通過兩個基礎模型展示了這個數據集所促進的研究。GR00T-H是首個開放的醫療機器人視覺-語言-行動基礎模型,這是唯一一個在結構化縫合基準上實現完整端到端任務完成的評估模型(25%的試驗對比其他所有模型的0%),並在29步的體外縫合序列中達到64%的平均成功率。我們還訓練了Cosmos-H-Surgical-Simulator,首個行動條件的世界模型,能夠從單一檢查點實現多體現的外科模擬,涵蓋九個機器人平台,並支持醫療領域的計算政策評估和合成數據生成。這些結果表明,開放的大規模醫療機器人數據收集可以作為研究社群的關鍵基礎設施,促進機器人學習、世界建模及其他領域的進步。

Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

2604.20983v1 by Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin, Hasan Muhammad Abdullah, Md. Mehedi Hasan, Mohammad Zabed Hossain, Shifat E. Arman

Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.

摘要:視覺評估通常通過多步驟的過程進行。在大多數當代領域中,專家使用結構化、基於證據的自適應提問來分析圖像。在植物病理學中,植物學家檢查葉片圖像,識別視覺線索,推斷診斷意圖,並通過針對物種、症狀和嚴重程度的問題進一步探查。這種結構化的探查對於準確的疾病診斷和治療方案的制定至關重要。然而,目前的視覺-語言模型是在單回合問題回答上進行評估的。為了解決這一差距,我們推出了PlantInquiryVQA,這是一個用於研究植物診斷中多步驟、以意圖驅動的視覺推理的基準。我們正式化了一個詢問鏈框架,將診斷軌跡建模為基於具體視覺線索和明確的認識意圖的有序問答序列。我們發布了一個包含24,950張專家策劃的植物圖像和138,068對標註了視覺基礎、嚴重程度標籤和領域特定推理模板的問題-答案對的數據集。對頂級多模態大型語言模型的評估顯示,雖然它們能夠充分描述視覺症狀,但在安全的臨床推理和準確診斷方面卻面臨挑戰。重要的是,結構化的問題引導詢問顯著提高了診斷的正確性,減少了幻覺,並提高了推理效率。我們希望PlantInquiryVQA能作為推進研究的基礎基準,以培訓診斷代理像專家植物學家一樣進行推理,而不是靜態分類器。

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

2604.20791v1 by Mariano Barone, Francesco Di Serio, Roberto Moio, Marco Postiglione, Giuseppe Riccio, Antonio Romano, Vincenzo Moscato

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

摘要:大型語言模型(LLMs)在醫療保健領域的應用日益增多,但它們與臨床標準的溝通對齊程度仍然不足以量化。我們對通用型和專業領域的LLMs進行了多維度評估,涵蓋結構化的醫療解釋和現實世界的醫生-病人互動,分析語義忠實度、可讀性和情感共鳴。基準模型相對於醫生增強了情感極性(非常負面:43.14-45.10% vs. 37.25%),而在更大的架構如GPT-5和Claude中,產生了顯著更高的語言複雜性(FKGL高達16.91-17.60 vs. 11.47-12.50在醫生撰寫的回答中)。以同理心為導向的提示減少了極端的負面情緒並降低了年級水平的複雜性(對於GPT-5高達-6.87 FKGL點),但並未顯著提高語義忠實度。協作重寫產生了最強的整體對齊。重述配置實現了與醫生回答的最高語義相似度(平均高達0.93),同時持續改善可讀性並減少情感極端性。雙方利益相關者的評估顯示,沒有模型在認知標準上超越醫生,而病人則持續偏好重寫的變體以獲得清晰度和情感語調。這些發現表明,LLMs作為協作溝通增強工具的功能最為有效,而非臨床專業知識的替代品。

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

2604.20441v1 by Yingyong Hou, Xinyuan Lao, Huimei Wang, Qianyu Yao, Wei Chen, Bocheng Huang, Fei Sun, Yuxian Lv, Weiqi Lei, Xueqian Wen, Pengfei Xia, Zhujun Tan, Shengyang Xie

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit (skill-auditor@1.0), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

摘要:背景:代理技能越來越多地作為模組化、可重用的能力單元在人工智慧代理系統中部署。醫學研究代理技能需要超越一般評估的保障,包括科學誠信、方法論有效性、可重複性和邊界安全。本研究開發並初步評估了一個針對醫學研究代理技能的領域特定審核框架,重點關注對專家評審的可靠性。方法:我們開發了 MedSkillAudit (skill-auditor@1.0),這是一個分層框架,用於在部署前評估技能釋放的準備狀態。我們評估了五個醫學研究類別中的 75 項技能(每個類別 15 項)。兩位專家獨立地分配了一個質量分數(0-100)、一個序數釋放處置(生產就緒 / 限量釋放 / 僅限測試版 / 拒絕)和一個高風險失敗標誌。系統專家之間的協議使用 ICC(2,1) 和線性加權的 Cohen's kappa 進行量化,並以人類評分者基準進行基準測試。結果:共識質量分數的平均值為 72.4(標準差 = 13.0);57.3% 的技能低於限量釋放的門檻。MedSkillAudit 達到了 ICC(2,1) = 0.449(95% CI: 0.250-0.610),超過了人類評分者的 ICC 0.300。系統共識分數的差異(標準差 = 9.5)小於專家之間的差異(標準差 = 12.4),且沒有方向性偏差(Wilcoxon p = 0.613)。協議設計顯示出最強的類別級別協議(ICC = 0.551);學術寫作顯示出負的 ICC(-0.567),反映出結構性評分標準與專家之間的不匹配。結論:領域特定的預部署審核可能為治理醫學研究代理技能提供實用的基礎,通過針對科學用例量身定制的結構化審核工作流程來補充一般性質量檢查。

Surrogate modeling for interpreting black-box LLMs in medical predictions

2604.20331v2 by Changho Han, Songsoo Kim, Dong Won Kim, Leo Anthony Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

摘要:大型語言模型(LLMs)在龐大的數據集上進行訓練,將廣泛的現實世界知識編碼在其參數中,但其黑箱特性使得這種編碼的機制和範圍變得不明朗。代理建模使用簡化模型來近似複雜系統,可以為黑箱模型的更好可解釋性提供一條途徑。我們提出了一個代理建模框架,定量解釋LLM編碼的知識。對於從領域知識衍生的特定假設,該框架通過在一系列綜合模擬場景中進行廣泛的提示,使用可觀察的元素(輸入-輸出對)來近似潛在的LLM知識空間。通過在醫療預測中的概念驗證實驗,我們展示了該框架在揭示LLMs如何「感知」每個輸入變量與輸出之間的關係方面的有效性。特別是,考慮到LLMs可能會延續其訓練數據中固有的不準確性和社會偏見,我們使用該框架的實驗定量揭示了與既有醫學知識相矛盾的關聯以及LLM編碼知識中科學上被駁斥的種族假設的持續存在。通過揭示這些問題,我們的框架可以作為紅旗指標,以支持這些模型的安全和可靠應用。

Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA

2604.20306v1 by Zibo Xu, Qiang Li, Ke Lu, Jin Wang, Weizhi Nie, Yuting Su

Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.

摘要:醫學視覺問題回答(MedVQA)旨在根據複雜的醫學影像和問題生成臨床可靠的答案。然而,現有的方法往往過度擬合於表面上的跨模態相關性,忽略了嵌入多模態醫學數據中的內在偏見。因此,模型變得容易受到跨模態混淆效應的影響,嚴重妨礙其提供可信診斷推理的能力。為了解決這一限制,我們提出了一種新穎的雙重因果推斷(DCI)框架,用於MedVQA。據我們所知,DCI是第一個統一架構,整合了後門調整(BDA)和工具變量(IV)學習,以共同解決可觀察和不可觀察的混淆因素。具體而言,我們構建了一個結構性因果模型(SCM),其中可觀察的跨模態偏見(例如,頻繁的視覺和文本共現)通過BDA得到減輕,而不可觀察的混淆因素則通過從共享潛在空間學習的IV來補償。為了保證IV的有效性,我們設計了互信息約束,以最大化其對融合多模態表示的依賴,同時最小化其與不可觀察混淆因素和目標答案的關聯。通過這一雙重機制,DCI提取出去混淆的表示,捕捉真正的因果關係。在四個基準數據集SLAKE、SLAKE-CP、VQA-RAD和PathVQA上進行的廣泛實驗表明,我們的方法在性能上始終優於現有方法,特別是在分佈外(OOD)泛化方面。此外,定性分析證實,DCI通過明確區分真實的因果效應和虛假的跨模態捷徑,顯著增強了跨模態推理的可解釋性和穩健性。

From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI

2604.20055v1 by Patrick Vossler, Jean Feng, Venkat Sivaraman, Robert Gallo, Hemal Kanzaria, Dana Freiser, Christopher Ross, Amy Ou, James Marks, Susan Ehrlich, Christopher Peabody, Lucas Zier

Hospital Quality Improvement (QI) plays a critical role in optimizing healthcare delivery by translating high-level hospital goals into actionable solutions. A critical step of QI is to identify the key modifiable contributing factors, a process we call QI factor discovery, typically through expert-driven semi-structured qualitative tools like fishbone diagrams, chart reviews, and Lean Healthcare methods. AI has the potential to transform and accelerate QI factor discovery, which is traditionally time- and resource-intensive and limited in reproducibility and auditability. Nevertheless, current AI alignment methods assume the task is well-defined, whereas QI factor discovery is an exploratory, fuzzy, and iterative sense-making process that relies on complex implicit expert judgments. To design an AI pipeline that formalizes the QI process while preserving its exploratory components, we propose viewing the task as learning not only LLM prompts but also the overarching natural-language specifications. In particular, we map QI factor discovery to steps of the classical AI/ML development process (problem formalization, model learning, and model validation) where the specifications are tunable hyperparameters. Domain experts and AI agents iteratively refine both the overarching specifications and AI pipeline until AI extractions are concordant with expert annotations and aligned with clinical objectives. We applied this "Human-AI Spec-Solution Co-optimization" framework at an urban safety-net hospital to identify factors driving prolonged length of stay and unplanned 30-day readmissions. The resulting AI-for-QI pipelines achieved $\ge 70\%$ concordance with expert annotations. Compared to prior manual Lean analyses, the AI pipeline was substantially more efficient, recovered previous findings, surfaced new modifiable factors, and produced auditable reasoning traces.

摘要:醫院品質改善(QI)在優化醫療服務中扮演著關鍵角色,通過將高層次的醫院目標轉化為可行的解決方案。QI 的一個關鍵步驟是識別主要的可修改貢獻因素,我們稱之為 QI 因素發現,通常通過專家驅動的半結構化質性工具,如魚骨圖、圖表回顧和精益醫療方法來進行。人工智慧有潛力轉變和加速 QI 因素發現,這一過程傳統上耗時且資源密集,且在可重複性和可審計性方面受限。然而,目前的 AI 對齊方法假設任務是明確定義的,而 QI 因素發現是一個探索性、模糊且迭代的意義建構過程,依賴於複雜的隱性專家判斷。為了設計一個正式化 QI 過程的 AI 管道,同時保留其探索性組件,我們建議將任務視為學習不僅是 LLM 提示,還有整體的自然語言規範。具體來說,我們將 QI 因素發現映射到傳統 AI/ML 開發過程的步驟(問題形式化、模型學習和模型驗證),其中規範是可調的超參數。領域專家和 AI 代理反覆完善整體規範和 AI 管道,直到 AI 提取結果與專家標註一致並與臨床目標對齊。我們在一所城市安全網醫院應用這一「人類-AI 規範-解決方案共同優化」框架,以識別驅動延長住院時間和未計劃 30 天再入院的因素。最終的 AI-for-QI 管道與專家標註達到了 $\ge 70\%$ 的一致性。與之前的手動精益分析相比,AI 管道的效率顯著提高,恢復了先前的發現,揭示了新的可修改因素,並生成了可審計的推理痕跡。

Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine

2604.20022v1 by Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.

摘要:大型語言模型越來越多地被用作自主診斷代理,但它們混淆了兩種根本不同的能力:自然語言交流和概率推理。我們認為這種混淆是一種架構缺陷,而不是工程上的不足。我們介紹了 BMBE(貝葉斯醫療信念引擎),這是一個模組化的診斷對話框架,強調語言和推理之間的嚴格分離:LLM 僅作為感測器,將患者的言語解析為結構化證據並表達問題,而所有診斷推理都位於一個確定性、可審計的貝葉斯引擎中。由於患者數據從未進入 LLM,該架構在設計上是私密的;因為統計後端是一個獨立模組,它可以根據目標人群進行替換,而無需重新訓練。這種分離產生了三個自主 LLM 無法提供的特性:經過校準的選擇性診斷,具有可持續調整的準確性-覆蓋率權衡,一個統計分離間隙,即使是一個廉價的感測器與引擎搭配,也能以更低的成本超越同一家族的前沿獨立模型,以及對導致獨立醫生崩潰的對抗性患者交流風格的穩健性。我們在實證和 LLM 生成的知識庫中進行驗證,與前沿 LLM 進行比較,確認這一優勢是架構性的,而非信息性的。

scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics

2604.20003v1 by Qifeng Zhou, Lei Yu, Yuzhi Guo, Yuwei Miao, Hehuan Ma, Wenliang Zhong, Lin Xu, Junzhou Huang

The integration of single-cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer-based foundation model designed for single-cell proteomics. Pre-trained on over 390 million cells, scpFormer replaces standard index-based tokenization with a continuous, sequence-anchored approach. By combining Evolutionary Scale Modeling (ESM) with value-aware expression embeddings, it dynamically maps variable panels into a shared semantic space without artificial discretization. We demonstrate that scpFormer generates global cell representations that perform competitively in large-scale batch integration and unsupervised clustering. Moreover, its open-vocabulary architecture facilitates in silico panel expansion, assisting in the reconstruction of biological manifolds in sparse clinical datasets. Finally, this learned protein co-expression logic is transferable to bulk-omics tasks, supporting applications like cancer drug response prediction. scpFormer provides a versatile, panel-agnostic framework to facilitate scalable biomarker discovery and precision oncology.

摘要:單細胞蛋白質組學數據的整合常常受到目標抗體面板碎片化特性的阻礙。為了解決這一限制,我們引入了scpFormer,一種基於Transformer的基礎模型,專為單細胞蛋白質組學設計。scpFormer在超過3.9億個細胞上進行了預訓練,並用連續的、以序列為基礎的方法取代了標準的基於索引的標記化。通過將進化規模建模(ESM)與價值感知表達嵌入相結合,它動態地將可變面板映射到共享的語義空間中,而不進行人工離散化。我們展示了scpFormer生成的全球細胞表示在大規模批次整合和無監督聚類中表現競爭力。此外,它的開放詞彙架構促進了計算面板的擴展,幫助重建稀疏臨床數據集中的生物流形。最後,這種學習到的蛋白質共表達邏輯可以轉移到大宗組學任務,支持癌症藥物反應預測等應用。scpFormer提供了一個多功能的、與面板無關的框架,以促進可擴展的生物標記發現和精準腫瘤學。

Infection-Reasoner: A Compact Vision-Language Model for Wound Infection Classification with Evidence-Grounded Clinical Reasoning

2604.19937v1 by Palawat Busaranuvong, Reza Saadati Fard, Emmanuel Agu, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong

Assessing chronic wound infection from photographs is challenging because visual appearance varies across wound etiologies, anatomical locations, and imaging conditions. Prior image-based deep learning methods have mainly focused on classification with limited interpretability, despite the need for evidence-grounded explanations to support point-of-care decision making. We present Infection-Reasoner, a compact 4B-parameter reasoning vision-language model for chronic wound infection classification and rationale generation. To address the scarcity of expert-labeled wound images with reasoning annotations, Infection-Reasoner is trained using a two-stage pipeline: (1) reasoning distillation, in which GPT-5.1 generates chain-of-thought rationales for unlabeled wound images to initialize wound-specific reasoning in a smaller student model (Qwen3-VL-4B-Thinking), and (2) reinforcement learning post-training with Group Relative Policy Optimization on a small labeled infection dataset to refine classification reasoning. On a held-out heterogeneous wound dataset, Infection-Reasoner achieved 86.8\% accuracy, 86.4\% sensitivity, and 87.1\% specificity, outperforming several strong baselines, including GPT-5.1. Rationale quality was further evaluated using both multimodal large language model (MLLM) judges and wound expert review. Across four MLLM judges, visual-support agreement scores ranged from 0.722 to 0.903, while expert review rated 61.8\% of rationales as Correct and 32.4\% as Partially Correct.

摘要:評估慢性傷口感染的照片具有挑戰性,因為視覺外觀因傷口病因、解剖位置和成像條件而異。儘管需要基於證據的解釋來支持臨床決策,但先前基於圖像的深度學習方法主要集中在分類上,解釋性有限。我們提出了 Infection-Reasoner,一種緊湊的 4B 參數推理視覺-語言模型,用於慢性傷口感染的分類和理由生成。為了解決專家標註的傷口圖像及推理註釋的稀缺,Infection-Reasoner 使用兩階段管道進行訓練:(1) 推理蒸餾,在此過程中,GPT-5.1 為未標註的傷口圖像生成思考鏈理由,以初始化較小學生模型(Qwen3-VL-4B-Thinking)中的傷口特定推理;(2) 使用小型標註感染數據集進行強化學習後訓練,通過群體相對政策優化來細化分類推理。在一個保留的異質傷口數據集上,Infection-Reasoner 達到了 86.8\% 的準確率、86.4\% 的敏感性和 87.1\% 的特異性,超越了幾個強基準,包括 GPT-5.1。理由的質量進一步通過多模態大型語言模型(MLLM)評審和傷口專家評審進行評估。在四位 MLLM 評審中,視覺支持一致性得分範圍從 0.722 到 0.903,而專家評審認為 61.8\% 的理由是正確的,32.4\% 是部分正確的。

Depression Risk Assessment in Social Media via Large Language Models

2604.19887v1 by Giorgia Gulino, Manuel Petrucci

Depression is one of the most prevalent and debilitating mental health conditions worldwide, frequently underdiagnosed and undertreated. The proliferation of social media platforms provides a rich source of naturalistic linguistic signals for the automated monitoring of psychological well-being. In this work, we propose a system based on Large Language Models (LLMs) for depression risk assessment in Reddit posts, through multi-label classification of eight depression-associated emotions and the computation of a weighted severity index. The method is evaluated in a zero-shot setting on the annotated DepressionEmo dataset (~6,000 posts) and applied in-the-wild to 469,692 comments collected from four subreddits over the period 2024-2025. Our best model, gemma3:27b, achieves micro-F1 = 0.75 and macro-F1 = 0.70, results competitive with purpose-built fine-tuned models (BART: micro-F1 = 0.80, macro-F1 = 0.76). The in-the-wild analysis reveals consistent and temporally stable risk profiles across communities, with marked differences between r/depression and r/anxiety. Our findings demonstrate the feasibility of a cost-effective, scalable approach for large-scale psychological monitoring.

摘要:憂鬱症是全球最普遍且削弱人心的心理健康狀況之一,經常被低估和未得到適當治療。社交媒體平台的普及提供了豐富的自然語言信號來源,用於自動監測心理健康。在這項工作中,我們提出了一個基於大型語言模型(LLMs)的系統,用於評估Reddit帖子中的憂鬱風險,通過對八種與憂鬱相關的情緒進行多標籤分類以及計算加權嚴重性指數。該方法在零樣本設置下,對標註的DepressionEmo數據集(約6,000篇帖子)進行評估,並在2024-2025年期間應用於從四個子版塊收集的469,692條評論。我們的最佳模型gemma3:27b,達到了micro-F1 = 0.75和macro-F1 = 0.70的結果,與專門構建的微調模型(BART:micro-F1 = 0.80,macro-F1 = 0.76)具備競爭力。野外分析顯示社區之間風險輪廓一致且時間穩定,r/depression和r/anxiety之間存在顯著差異。我們的研究結果展示了一種具有成本效益、可擴展的方式,用於大規模心理監測的可行性。

A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities

2604.19653v1 by Aya Cherigui, Florent Guépin, Arnaud Legendre, Jean-François Couchot

Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.

摘要:人類流動數據被應用於許多領域,從公共衛生到城市規劃。人類流動本質上是敏感的,因為它可能包含如宗教信仰和政治立場等信息。歷史上,曾提出使用聚合、混淆或噪聲添加等技術來修改信息,以充分保護隱私並消除顧慮。由於這些方法在效用上付出了巨大的代價,因此引入了利用生成模型發展的新方法。這些方法在多大程度上解決了隱私與效用的權衡仍然是一個未解決的問題。在本文中,我們通過引入和應用一個新的效用評估框架,邁出了邁向解決此問題的第一步。此外,我們提供證據表明,隱私評估仍然是一個需要考慮的重大挑戰,並且應該根據當前的歐盟法規通過對抗性評估來解決。我們提出了一種針對生成模型子類的新成員推斷攻擊,儘管由於其對用戶連結問題的抵抗力,這個子類被認為是私密的。

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

2604.19598v2 by Kihyuk Lee

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

摘要:這項研究比較了在 temperature=0 條件下,三個大型語言模型(LLMs)產生的運動處方輸出的一致性,具體為 GPT-4.1、Claude Sonnet 4.6 和 Gemini 2.5 Flash。每個模型為六個臨床場景生成了 20 次處方,總共分析了 360 個輸出,涵蓋四個維度:語義相似性、輸出可重複性、FITT 分類和安全性表達。GPT-4.1 的平均語義相似性最高(0.955),其次是 Gemini 2.5 Flash(0.950)和 Claude Sonnet 4.6(0.903),並確認了模型間的顯著差異(H = 458.41, p < .001)。關鍵是,這些分數反映了根本不同的生成行為:GPT-4.1 產生了完全獨特的輸出(100%),並且語義內容穩定,而 Gemini 2.5 Flash 則顯示出明顯的輸出重複(27.5% 獨特輸出),這表明其高相似性分數源於文本重複,而非一致的推理。因此,相同的解碼設置產生了根本不同的一致性特徵,這一區別是單一輸出評估無法捕捉的。所有模型的安全性表達達到了上限水平,確認其作為區分指標的有限效用。這些結果表明,模型選擇是一個臨床而非僅僅是技術的決策,並且在重複生成條件下的輸出行為應被視為可靠部署基於 LLM 的運動處方系統的核心標準。

Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity

2604.19538v1 by Farbod Zorriassatine, Ahmad Lotfi

Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.

摘要:代理人工智慧具備目標導向、主動和自主決策能力,為解決人類活動中與運動相關的風險提供了引人注目的機會,包括老年人群中持續存在的跌倒危險。儘管已有多種方法針對跌倒進行預測和檢測以減少風險,但現有系統尚未能在護理路徑和安全關鍵環境中作為普遍解決方案運作。這主要是因為在一致處理現實世界的複雜性方面存在限制,特別是缺乏良好的上下文意識、高誤報率、環境噪音和數據稀缺。我們認為,跌倒檢測和跌倒預測可以有效地被構思為異常檢測問題,並透過代理人工智慧系統更有效地加以解決。更廣泛地說,這一觀點使得能夠及早識別與增加風險相關的運動模式中的微妙偏差,無論是由於年齡相關的衰退、疲勞還是環境因素引起的。雖然即時部署的技術要求超出了本文的範疇,但我們提出了一個概念框架,突顯潛在的價值。這個框架促進了一種精心協調的風險管理方法,通過動態選擇相關工具並將其整合到自適應決策工作流程中,而不是依賴於針對狹窄定義場景量身定制的靜態配置。

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

2604.19457v1 by Vasundra Srininvasan

Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.

摘要:長期視野的企業代理在有損記憶、多步驟推理和約束性法規限制下做出高風險決策(貸款承保、索賠裁定、臨床審查、事前授權)。目前的評估報告提供了一個單一的任務成功標量,這混淆了不同的失敗模式,並隱藏了代理是否符合其部署環境所需的標準。我們提出長期決策行為可分解為四個正交的對齊軸,每個軸都是獨立可測量和可失敗的:事實精確性(FRP)、推理一致性(RCS)、合規重建(CRR)和校準放棄(CAR)。CRR是一個新穎的基於法規的軸;CAR是一個測量軸,將覆蓋率與準確性分開。我們在一個受控基準(LongHorizon-Bench)上進行分解,涵蓋貸款資格和保險索賠裁定,並進行確定性真實構建。運行六種記憶架構,我們發現結構聚合準確性無法看到:檢索在事實精確性上崩潰;基於架構的架構支付了支架稅;在事實保留提示下的普通摘要在FRP、RCS、EDA和CRR上是一個強基線;而所有六種架構在每個案例上都犯錯,暴露了一個該領域未針對的決策對齊軸。這一分解還揭示了我們自己預先註冊的預測,即摘要將失敗於事實回憶,數據在大幅度上反轉了這一點,這一軸級反轉的聚合準確性本會隱藏。機構對齊(法規重建)和決策對齊(校準放棄)在對齊文獻中被低估,並且一旦決策離開實驗室,它們便成為承載負荷的要素。該框架通過兩個步驟轉移到任何受規範的決策領域:建立事實架構,並校準CRR審核提示。

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

2604.19281v1 by Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

摘要:使用大型語言模型(LLMs)來支持患者解決醫療問題的做法正變得越來越普遍。然而,目前用於評估這些模型在此背景下表現的措施大多僅衡量模型的答案在語義上的匹配程度,因此並未真正反映模型的醫療準確性或與之相關的健康公平風險。為了解決這些不足,我們提出了一個新的醫療問題回答評估框架,稱為VB-Score(基於驗證的分數),它對醫療問題回答模型的四個組成部分進行單獨評估,包括實體識別、語義相似性、事實一致性和結構化信息完整性。我們對三個知名且廣泛使用的LLMs在48個公共健康相關主題上的表現進行了嚴格評審,這些主題來自高質量、權威的信息來源。根據我們的分析,我們發現模型的語義準確性和實體準確性之間存在重大差異。我們對這三個模型表現的評估顯示,當根據我們的標準進行評估時,每個模型幾乎都存在嚴重的性能失敗。我們的研究結果顯示,在各種公共健康主題之間存在令人擔憂的性能差異,對於與老年人和少數族裔群體中發生的慢性病相關的所有公共健康主題,大多數模型的性能比整體平均水平低13.8%,這表明存在所謂的基於病症的算法歧視。我們的發現還表明,僅僅依靠提示工程並不能彌補這些模型在提取醫療實體方面的基本架構限制,並引發了語義評估是否足以作為醫療AI安全的衡量標準的問題。

Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement

2604.19191v1 by Pritam Kar, Gouri Lakshmi S, Saptarshi Bej

Anomaly detection in medical imaging is essential for identifying rare pathological conditions, particularly when annotated abnormal samples are limited. We propose a hybrid anomaly detection framework that integrates self-supervised representation learning with manifold-based density estimation, a combination that remains largely unexplored in this domain. Medical images are first embedded into a latent feature space using pretrained, potentially domain-specific, backbones. These representations are then refined via Mean Shift Density Enhancement (MSDE), an iterative manifold-shifting procedure that moves samples toward regions of higher likelihood. Anomaly scores are subsequently computed using Gaussian density estimation in a PCA-reduced latent space, where Mahalanobis distance measures deviation from the learned normal distribution. The framework follows a one-class learning paradigm and requires only normal samples for training. Extensive experiments on seven medical imaging datasets demonstrate state-of-the-art performance. MSDE achieves the highest AUC on four datasets and the highest Average Precision on five datasets, including near-perfect performance on brain tumor detection (0.981 AUC/AP). These results underscore the potential of the proposed framework as a scalable clinical decision-support tool for early disease detection, screening in low-label settings, and robust deployment across diverse imaging modalities.

摘要:醫學影像中的異常檢測對於識別罕見的病理狀況至關重要,特別是在標註的異常樣本有限的情況下。我們提出了一種混合異常檢測框架,將自我監督的表示學習與基於流形的密度估計相結合,這一組合在該領域仍然基本未被探索。
醫學影像首先使用預訓練的、潛在的特定領域的骨幹嵌入到潛在特徵空間中。這些表示隨後通過均值漂移密度增強(MSDE)進行精煉,這是一種迭代的流形轉移過程,將樣本移動到更高可能性的區域。隨後,使用在PCA降維的潛在空間中的高斯密度估計計算異常分數,其中馬哈拉諾比斯距離衡量與學習到的正常分佈的偏差。該框架遵循單類學習範式,僅需要正常樣本進行訓練。
在七個醫學影像數據集上進行的大量實驗展示了最先進的性能。MSDE在四個數據集上達到了最高的AUC,在五個數據集上達到了最高的平均精度,包括在腦腫瘤檢測中接近完美的表現(0.981 AUC/AP)。這些結果強調了所提出框架作為可擴展的臨床決策支持工具的潛力,適用於早期疾病檢測、低標籤環境中的篩查,以及在多樣影像模態中的穩健部署。

Regulating Artificial Intimacy: From Locks and Blocks to Relational Accountability

2604.18893v1 by Henry Fraser, Jessica M. Szczuka, Raffaele F. Ciriello

A series of high-profile tragedies involving companion chatbots has triggered an unusually rapid regulatory response. Several jurisdictions, including Australia, California, and New York, have introduced enforceable regulation, while regulators elsewhere have signaled growing concern about risks posed by companion chatbots, particularly to children. In parallel, leading providers, notably OpenAI, appear to have strengthened their self-regulatory approaches. Drawing on legal textual analysis and insights from regulatory theory, psychology, and information systems research, this paper critically examines these recent interventions. We examine what is regulated and who is regulated, identifying regulatory targets, scope, and modalities. We classify interventions by method and priority, showing how emerging regimes combine "locks and blocks", such as access gating and content moderation, with measures addressing toxic relationship features and process-based accountability requirements. We argue that effective regulation of companion chatbots must integrate all three dimensions. More, however, is required. Current regimes tend to focus on discrete harms, narrow conceptions of vulnerability, or highly specified accountability processes, while failing to confront deeper power asymmetries between providers and users. Providers of companion chatbots increasingly control artificial intimacy at scale, creating unprecedented opportunities for control through intimacy. We suggest that a general, open-ended duty of care would be an important first step toward constraining that power and addressing a fundamental source of chatbot risk. The paper contributes to debates on companion chatbot regulation and is relevant to regulators, platform providers, and scholars concerned with digital intimacy, law and technology, and fairness, accountability, and transparency in sociotechnical systems.

摘要:一系列涉及伴侶聊天機器人的高調悲劇引發了異常迅速的監管反應。包括澳大利亞、加州和紐約在內的幾個司法管轄區已經引入了可執行的監管措施,而其他地區的監管機構則對伴侶聊天機器人所帶來的風險,特別是對兒童的風險,表示出日益增長的擔憂。與此同時,主要提供商,尤其是OpenAI,似乎已經加強了他們的自我監管措施。本論文基於法律文本分析以及監管理論、心理學和信息系統研究的見解,對這些近期的干預措施進行了批判性檢視。我們考察了什麼被監管以及誰被監管,識別監管目標、範圍和方式。我們根據方法和優先級對干預措施進行分類,展示新興制度如何將“鎖和阻擋”相結合,例如訪問門檻和內容審核,與解決有毒關係特徵和基於過程的問責要求的措施相結合。我們認為,對伴侶聊天機器人的有效監管必須整合這三個維度。然而,更需要的是,目前的制度往往專注於離散的傷害、狹隘的脆弱性概念或高度具體的問責過程,同時未能面對提供商與用戶之間更深層的權力不對稱。伴侶聊天機器人的提供商越來越大規模地控制人工親密性,創造了通過親密性進行控制的前所未有的機會。我們建議,一項一般性、開放式的關懷責任將是限制這種權力並解決聊天機器人風險根本來源的重要第一步。本論文對伴侶聊天機器人監管的辯論作出了貢獻,並對關注數字親密性、法律與技術以及社會技術系統中的公平、問責和透明度的監管者、平台提供商和學者具有相關性。

REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

2604.18757v1 by Seowung Leem, Lin Gu, Chenyu You, Kuang Gong, Ruogu Fang

The retina provides a unique, noninvasive window into Alzheimer's disease (AD) and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to disease susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, limiting their ability to capture joint multimodal patterns critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, constraining the learning of coherent cross-modal associations. To address these limitations, we introduce REVEAL (REtinal-risk Vision-Language Early Alzheimer's Learning), a framework that aligns color fundus photographs with individualized disease-specific risk profiles for predicting incident AD and dementia, on average 8 years before diagnosis (range: 1-11 years). Because real-world risk factors are structured questionnaire data, we translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general-purpose VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier intervention and improve preventive care at the population level.

摘要:視網膜提供了一個獨特的、非侵入性的窗口,讓我們了解阿茲海默症(AD)和癡呆症,通過形態計量特徵捕捉早期的結構變化,而系統性和生活方式的風險因素則反映了在臨床症狀出現之前,對疾病易感性的已知貢獻者。然而,當前的視網膜分析框架通常將影像和風險因素分開建模,限制了它們捕捉對早期風險預測至關重要的聯合多模態模式的能力。此外,現有方法很少納入組織或對齊具有相似視網膜和臨床特徵的患者的機制,限制了對一致的跨模態關聯的學習。為了解決這些限制,我們引入了REVEAL(REtinal-risk Vision-Language Early Alzheimer's Learning),這是一個將彩色眼底攝影與個體化的疾病特定風險概況對齊的框架,用於預測阿茲海默症和癡呆症的發生,平均在診斷前8年(範圍:1-11年)。由於現實世界的風險因素是結構化的問卷數據,我們將它們轉換為與預訓練的視覺-語言模型(VLMs)兼容的臨床可解釋敘事。我們進一步提出了一種群體感知對比學習(GACL)策略,將具有相似視網膜形態學和風險因素的患者聚類為正樣本對,增強多模態對齊。這一統一的表示學習框架在性能上顯著超越了與臨床文本編碼器配對的最先進的視網膜影像模型,以及通用的VLMs,顯示了聯合建模視網膜生物標記和臨床風險因素的價值。通過提供一種可概括且非侵入性的早期AD和癡呆症風險分層方法,REVEAL有潛力促進更早的干預並改善人口層面的預防護理。

Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

2604.18753v1 by Andrew Wang, Ellie Pavlick, Ritambhara Singh

An active challenge in developing multimodal machine learning (ML) models for healthcare is handling missing modalities during training and deployment. As clinical datasets are inherently temporal and sparse in terms of modality presence, capturing the underlying predictive signal via diagnostic multimodal ML models while retaining model explainability remains an ongoing challenge. In this work, we address this by re-framing clinical diagnosis as an autoregressive sequence modeling task, utilizing causal decoders from large language models (LLMs) to model a patient's multimodal trajectory. We first introduce a missingness-aware contrastive pre-training objective that integrates multiple modalities in datasets with missingness in a shared latent space. We then show that autoregressive sequence modeling with transformer-based architectures outperforms baselines on the MIMIC-IV and eICU fine-tuning benchmarks. Finally, we use interpretability techniques to move beyond performance boosts and find that across various patient stays, removing modalities leads to divergent behavior that our contrastive pre-training mitigates. By abstracting clinical diagnosis as sequence modeling and interpreting patient stay trajectories, we develop a framework to profile and handle missing modalities while addressing the canonical desideratum of safe, transparent clinical AI.

摘要:在為醫療保健開發多模態機器學習(ML)模型的過程中,一個活躍的挑戰是處理訓練和部署期間缺失的模態。由於臨床數據集本質上是時間性的,並且在模態存在方面稀疏,因此通過診斷多模態 ML 模型捕捉潛在的預測信號,同時保持模型的可解釋性,仍然是一個持續的挑戰。在這項工作中,我們通過將臨床診斷重新框架為自回歸序列建模任務來解決這個問題,利用大型語言模型(LLMs)中的因果解碼器來建模患者的多模態軌跡。我們首先介紹了一種考慮缺失性的對比預訓練目標,該目標在具有缺失性的數據集中將多個模態整合到共享潛在空間中。然後,我們展示了基於Transformer架構的自回歸序列建模在 MIMIC-IV 和 eICU 微調基準測試中超越了基準。我們最後使用可解釋性技術超越性能提升,發現隨著各種患者住院的進展,去除模態會導致不同的行為,而我們的對比預訓練可以減輕這種情況。通過將臨床診斷抽象為序列建模並解釋患者住院軌跡,我們開發了一個框架來描述和處理缺失模態,同時解決安全、透明的臨床 AI 的基本願望。

A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

2604.18570v2 by Andrew Zhang, Tong Ding, Sophia J. Wagner, Caiwei Tian, Ming Y. Lu, Rowland Pettit, Joshua E. Lewis, Alexandre Misrahi, Dandan Mo, Long Phi Le, Faisal Mahmood

Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This "atlas of medical concepts" forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.

摘要:現代醫學在孤立的系統中產生大量的多模態數據,但目前沒有任何現有模型能將臨床記錄的全部範圍和時間深度整合成統一的患者表徵。我們介紹了Apollo,一個多模態時間基礎模型,該模型在美國一家主要醫院系統的三十多年長期住院記錄上進行訓練和評估,這些記錄包含來自720萬名患者的250億條記錄,代表28種不同的醫療模態和12個主要醫療專科。Apollo學習了一個統一的表徵空間,整合了我們臨床詞彙中超過10萬個獨特的醫療事件,以及影像和臨床文本。這個“醫療概念地圖”形成了一個計算基底,用於建模整個患者護理旅程,這些旅程由結構化和非結構化事件的序列組成,Apollo將其壓縮為虛擬患者表徵。為了評估這些整體患者表徵的潛力,我們從140萬名患者的保留測試集中創建了322個預後和檢索任務。我們展示了Apollo嵌入的通用臨床預測潛力,包括預測新疾病發作風險最多提前五年(95個任務)、疾病進展(78個任務)、治療反應(59個任務)、治療相關不良事件風險(17個任務)和醫院運營結束點(12個任務)。利用特徵歸因技術,我們顯示模型預測與臨床可解釋的多模態生物標誌物相一致。我們在61個檢索任務上評估了語義相似性搜索,並進一步展示了Apollo作為多模態醫療搜索引擎的潛力,使用文本和圖像查詢。這些建模能力共同建立了可計算醫學的基礎,使患者護理的完整上下文能夠被計算推理所訪問。

ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification

2604.18444v1 by Florian Kittler, Sheethal Bhat, Andreas Maier

Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.

摘要:零-shot 視覺-語言模型 (VLMs) 在胸部放射線影像分類方面顯示出潛力,但其性能常常受到混淆標籤共現、長尾類別不平衡和在領域轉移下的轉移不穩定性限制。我們提出了 ProtoCLIP,一種針對 CLIP 風格 VLM 的精煉策略,通過有針對性地數據策展和提煉錨點對齊來改善零-shot 判別。具體而言,我們構建了以病理為重點的訓練子集,並策展了負樣本,以減少共現偏差。我們還引入了一個保持表示的蒸餾目標,以穩定適應,同時保持語義結構並改善臨床相關共現病理的判別。在未見數據集 VinDr-CXR 上進行評估,ProtoCLIP 在多個發現上提高了 AUC 2-10 個百分點,超過了一個強大的基於 CLIP 的基準。特別是對於氣胸,ProtoCLIP 實現了 0.94 的最先進 AUC。這些結果表明,錨點引導的精煉,結合策展的監督和受控的適應,可以減輕醫療 VLM 中常見的零-shot 轉移失敗,而無需大規模的重新訓練。

Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support

2604.18302v1 by Eranga Bandara, Asanga Gunaratna, Ross Gore, Anita H. Clayton, Christopher K. Rhea, Sachini Rajapakse, Isurunima Kularathna, Sachin Shetty, Ravi Mukkamala, Xueping Liang, Preston Samuel, Atmaram Yarlagadda

Privacy represents one of the most critical yet underaddressed barriers to AI adoption in mental healthcare -- particularly in high-sensitivity operational environments such as military, correctional, and remote healthcare settings, where the risk of patient data exposure can deter help-seeking behavior entirely. Existing AI-enabled psychiatric decision support systems predominantly rely on cloud-based inference pipelines, requiring sensitive patient data to leave the device and traverse external servers, creating unacceptable privacy and security risks in these contexts. In this paper, we propose a zero-egress, on-device AI platform for privacy-preserving psychiatric decision support, deployed as a cross-platform mobile application. The proposed system extends our prior work on fine-tuned LLM consortiums for psychiatric diagnosis standardization by fundamentally re-architecting the inference pipeline for fully local execution -- ensuring that no patient data is transmitted to, processed by, or stored on any external server at any stage. The platform integrates a consortium of three lightweight, fine-tuned, and quantized open-source LLMs -- Gemma, Phi-3.5-mini, and Qwen2 -- selected for their compact architectures and proven efficiency on resource-constrained mobile hardware. An on-device orchestration layer coordinates ensemble inference and consensus-based diagnostic reasoning, producing DSM-5-aligned assessments for conditions. The platform is designed to assist clinicians with differential diagnosis and evidence-linked symptom mapping, as well as to support patient-facing self-screening with appropriate clinical safeguards. Initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor while sustaining real-time inference latency on commodity mobile hardware.

摘要:隱私代表了在心理健康護理中人工智慧採用的最關鍵但卻未被充分解決的障礙之一——特別是在軍事、矯正和遠程醫療等高敏感度操作環境中,患者數據暴露的風險可能完全阻礙尋求幫助的行為。現有的人工智慧輔助精神病決策支持系統主要依賴雲端推理管道,這要求敏感的患者數據離開設備並經過外部伺服器,從而在這些環境中造成不可接受的隱私和安全風險。在本文中,我們提出了一個零外洩的、基於設備的人工智慧平台,用於隱私保護的精神病決策支持,作為跨平台的移動應用程序部署。所提出的系統擴展了我們之前在精神病診斷標準化方面的精調大型語言模型聯盟的工作,通過根本性地重新架構推理管道以實現完全本地執行——確保在任何階段都不會將患者數據傳輸到、處理或存儲在任何外部伺服器上。該平台整合了三個輕量級、經過精調和量化的開源大型語言模型——Gemma、Phi-3.5-mini和Qwen2——這些模型因其緊湊的架構和在資源受限的移動硬體上的證明效率而被選中。一個基於設備的協調層協調集成推理和基於共識的診斷推理,生成與DSM-5對齊的條件評估。該平台旨在協助臨床醫生進行鑑別診斷和證據鏈接的症狀映射,並支持患者自我篩查,並提供適當的臨床保障。初步評估表明,所提出的零外洩部署在診斷準確性上與其伺服器端前身相當,同時在商用移動硬體上保持實時推理延遲。

Style-Based Neural Architectures for Real-Time Weather Classification

2604.18251v1 by Hamed Ouattara, Pascal Houssam Salmane, Pierre Duthon, Frédéric Bernardin, Omar Ait Aider

In this paper, we present three neural network architectures designed for real-time classification of weather conditions (sunny, rain, snow, fog) from images. These models, inspired by recent advances in style transfer, aim to capture the stylistic elements present in images. One model, called "Multi-PatchGAN", is based on PatchGANs used in well-known architectures such as Pix2Pix and CycleGAN, but here adapted with multiple patch sizes for detection tasks. The second model, "Truncated ResNet50", is a simplified version of ResNet50 retaining only its first nine layers. This truncation, determined by an evolutionary algorithm, facilitates the extraction of high-frequency features essential for capturing subtle stylistic details. Finally, we propose "Truncated ResNet50 with Gram Matrix and Attention", which computes Gram matrices for each layer during training and automatically weights them via an attention mechanism, thus optimizing the extraction of the most relevant stylistic expressions for classification. These last two models outperform the state of the art and demonstrate remarkable generalization capability on several public databases. Although developed for weather detection, these architectures are also suitable for other appearance-based classification tasks, such as animal species recognition, texture classification, disease detection in medical imaging, or industrial defect identification.

摘要:在本文中,我們提出了三種神經網絡架構,旨在從圖像中實時分類天氣條件(晴天、雨天、雪天、霧天)。這些模型受到近期風格轉換進展的啟發,旨在捕捉圖像中的風格元素。第一個模型稱為「Multi-PatchGAN」,基於在知名架構如Pix2Pix和CycleGAN中使用的PatchGAN,但在這裡為檢測任務調整為多種補丁大小。第二個模型「Truncated ResNet50」是ResNet50的簡化版本,只保留其前九層。這種截斷是由進化算法決定的,有助於提取對捕捉微妙風格細節至關重要的高頻特徵。最後,我們提出了「Truncated ResNet50 with Gram Matrix and Attention」,該模型在訓練期間為每一層計算Gram矩陣,並通過注意力機制自動加權,從而優化最相關風格表達的提取以進行分類。這最後兩個模型超越了當前的技術水平,並在幾個公共數據庫上展示了卓越的泛化能力。雖然這些架構是為了天氣檢測而開發的,但它們也適用於其他基於外觀的分類任務,如動物物種識別、紋理分類、醫學影像中的疾病檢測或工業缺陷識別。

Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

2604.18234v1 by Lorenz Brehme, Thomas Ströhle, Ruth Breu

Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-Aware Retriever Evaluation (CARE). Our goal is to better understand how multi-hop reasoning can be most effectively evaluated in RAG systems. Experiments with LLMs from OpenAI, Meta, and Google demonstrate that CARE consistently outperforms existing methods for evaluating multi-hop reasoning in RAG systems. The performance gains are most pronounced in models with larger parameter counts and longer context windows, while single-hop queries show minimal sensitivity to context-aware evaluation. Overall, the results highlight the critical role of context-aware evaluation in improving the reliability and accuracy of retrieval-augmented generation systems, particularly in complex query scenarios. To ensure reproducibility, we provide the complete data of our experiments at https://github.com/lorenzbrehme/CARE.

摘要:檢索增強生成(RAG)透過外部知識增強大型語言模型(LLMs),以更準確地回答問題。然而,對於評估RAG系統的研究——特別是檢索器組件——仍然有限,因為大多數現有的工作專注於單一上下文檢索,而非多跳查詢,在這種情況下,單獨的上下文可能看起來無關緊要,但在結合時卻至關重要。在本研究中,我們使用HotPotQA、MuSiQue和SQuAD數據集來模擬RAG系統,並比較三種LLM作為評估者的評估策略,包括我們提出的上下文感知檢索器評估(CARE)。我們的目標是更好地理解如何在RAG系統中最有效地評估多跳推理。來自OpenAI、Meta和Google的LLM實驗表明,CARE在評估RAG系統中的多跳推理方面始終優於現有方法。性能提升在參數較多和上下文窗口較長的模型中最為明顯,而單跳查詢對上下文感知評估的敏感度則較低。總體而言,結果突顯了上下文感知評估在提高檢索增強生成系統的可靠性和準確性方面的關鍵作用,特別是在複雜查詢場景中。為了確保可重複性,我們在https://github.com/lorenzbrehme/CARE提供了我們實驗的完整數據。

Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

2604.18161v1 by Ku Onoda, Paavo Parmas, Manato Yaguchi, Yutaka Matsuo

In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.

摘要:在策略梯度強化學習中,訪問可微分模型使得一階梯度估計成為可能,這加速了學習,相較於僅依賴無導數的零階估計器。然而,不連續的動態會造成偏差,並削弱一階估計器的有效性。先前的研究通過在REINFORCE零階梯度估計器周圍構建置信區間來解決這一偏差,並利用這些界限來檢測不連續性。然而,REINFORCE估計器以噪聲著稱,我們發現這種方法需要特定任務的超參數調整,並且樣本效率低下。本文探討這種偏差是否是主要障礙,以及哪些最小的修正措施足夠。首先,我們重新檢視先前工作的標準不連續設置,並引入DDCG,一種在不光滑區域切換估計器的輕量級測試;通過一個超參數,DDCG實現了穩健的性能,並在小樣本下保持可靠。其次,在可微分的機器人控制任務中,我們提出了IVW-H,一種逐步的逆方差實現,該實現穩定了方差而無需明確的不連續性檢測,並產生了強勁的結果。綜合這些發現表明,雖然估計器切換在控制研究中提高了穩健性,但在實際部署中,仔細的方差控制往往占主導地位。

Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

2604.18145v1 by Cong Huy Nguyen, Son Dinh Nguyen, Guanlin Li, Tuan Dung Nguyen, Aditya Narayan Sankaran, Mai Huy Thong, Thanh Trung Nguyen, Mai Hong Son, Reza Farahbakhsh, Phi Le Nguyen, Noel Crespi

Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.

摘要:自動化醫療報告生成對於3D PET/CT影像的挑戰根本上來自於體積數據的高維特性以及註釋數據集的嚴重匱乏,特別是對於低資源語言。當前的黑箱方法將整個體積映射到報告,忽略了分析局部感興趣區域(RoIs)以得出診斷結論的臨床工作流程。在本文中,我們通過引入VietPET-RoI來填補這一空白,這是第一個針對低資源語言的大規模3D PET/CT數據集,具有精細的RoI註釋,包括600個PET/CT樣本和1,960個手動註釋的RoIs,並配有相應的臨床報告。此外,為了展示這個數據集的實用性,我們提出了HiRRA,一個新穎的框架,通過採用基於圖的關係模塊來捕捉RoI屬性之間的依賴,模擬專業放射科醫生的診斷工作流程。這種方法從全球模式匹配轉向局部臨床發現。此外,我們引入了新的臨床評估指標,即RoI覆蓋率和RoI質量指數,這些指標使用基於LLM的提取來測量RoI定位準確性和屬性描述的真實性。廣泛的評估表明,我們的框架達到了SOTA性能,在BLEU上超越現有模型19.7%,在ROUGE-L上超越4.7%,同時在臨床指標上實現了驚人的45.8%的改進,顯示出增強的臨床可靠性和減少的幻覺。我們的代碼和數據集已在GitHub上發布。

Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning

2604.19823v1 by Khalil Akremi, Mariem Handous, Zied Bouslama, Farah Bassalah, Maryem Jebali, Mariem Hanachi, Ines Abdeljaoued-Tej

Rabies remains a major public health concern across many African and Asian countries, where accurate diagnosis is critical for effective epidemiological surveillance. The gold standard diagnostic methods rely heavily on fluorescence microscopy, necessitating skilled laboratory personnel for the accurate interpretation of results. Such expertise is often scarce, particularly in regions with low annual sample volumes. This paper presents an automated, AI-driven diagnostic system designed to address these challenges. We developed a robust pipeline utilizing fluorescent image analysis through transfer learning with four deep learning architectures: EfficientNetB0, EfficientNetB2, VGG16, and Vision Transformer (ViTB16). Three distinct data augmentation strategies were evaluated to enhance model generalization on a dataset of 155 microscopic images (123 positive and 32 negative). Our results demonstrate that TrivialAugmentWide was the most effective augmentation technique, as it preserved critical fluorescent patterns while improving model robustness. The EfficientNetB0 model, utilizing Geometric & Color augmentation and selected through stratified 3fold cross-validation, achieved optimal classification performance on cropped images. Despite constraints posed by class imbalance and a limited dataset size, this work confirms the viability of deep learning for automating rabies diagnosis. The proposed method enables fast and reliable detection with significant potential for further optimization. An online tool was deployed to facilitate practical access, establishing a framework for future medical imaging applications. This research underscores the potential of optimized deep learning models to transform rabies diagnostics and improve public health outcomes.

摘要:狂犬病在許多非洲和亞洲國家仍然是一個主要的公共健康問題,準確的診斷對於有效的流行病學監測至關重要。黃金標準的診斷方法在很大程度上依賴於螢光顯微鏡,這需要熟練的實驗室人員來準確解讀結果。這種專業知識往往稀缺,尤其是在年樣本量較低的地區。本文提出了一種自動化的、基於人工智慧的診斷系統,旨在解決這些挑戰。我們開發了一個穩健的流程,通過轉移學習利用四種深度學習架構進行螢光影像分析:EfficientNetB0、EfficientNetB2、VGG16 和 Vision Transformer (ViTB16)。我們評估了三種不同的數據增強策略,以提高模型在155張顯微鏡影像(123張陽性和32張陰性)數據集上的泛化能力。我們的結果顯示,TrivialAugmentWide 是最有效的增強技術,因為它在改善模型穩健性的同時保留了關鍵的螢光模式。使用幾何和顏色增強的 EfficientNetB0 模型,通過分層三折交叉驗證選擇,實現了在裁剪影像上的最佳分類性能。儘管受到類別不平衡和數據集大小限制的挑戰,這項工作證實了深度學習在自動化狂犬病診斷中的可行性。所提出的方法實現了快速且可靠的檢測,並具有進一步優化的重大潛力。還部署了一個在線工具以促進實際訪問,為未來的醫學影像應用建立了一個框架。本研究強調了優化的深度學習模型在轉變狂犬病診斷和改善公共健康結果方面的潛力。

First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

2604.18038v1 by Sihao Xing, Zaur Gouliev

Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task, DeepSeek V3 produced the strongest overall results across the reported metrics. When embedded in an agentic workflow, DeepSeek V3 showed an improvement of 0.0348 in mean p-value, 0.1166 in median p-value, and 0.0949 in mean difference relative to the standalone model, although improvement was not uniform across every metric. These findings support multi-metric bias evaluation for AI systems used in medical settings and suggest that retrieval-based agentic workflows may reduce some forms of explicit bias in benchmarked diagnostic tasks. Detailed prompt templates, experimental datasets, and code pipelines are available on our GitHub.

摘要:大型語言模型(LLMs)在臨床環境中的使用日益增加,這引發了對生成的醫療文本和臨床推理中的種族偏見的擔憂。現有研究已經識別出醫療LLMs中的偏見,但許多研究專注於單一模型,對於減輕偏見的關注較少。本研究使用歐盟人工智慧法案作為治理視角,評估五個廣泛使用的LLMs在兩個任務中的表現,即合成病人案例生成和鑑別診斷排名。利用美國的種族分層流行病學分佈和專家鑑別診斷清單作為基準,我們應用結構化提示模板和雙部分評估設計來檢查隱性和顯性種族偏見。在合成案例生成任務中,所有模型均偏離了觀察到的種族分佈,其中GPT-4.1的整體偏差最小。在鑑別診斷任務中,DeepSeek V3在報告的指標中產生了最強的整體結果。當嵌入到一個自主工作流程中時,DeepSeek V3在平均p值上改善了0.0348,在中位數p值上改善了0.1166,在平均差異上改善了0.0949,相對於獨立模型,儘管在每個指標上的改善並不均勻。這些發現支持對醫療環境中使用的AI系統進行多指標偏見評估,並表明基於檢索的自主工作流程可能減少基準診斷任務中的某些顯性偏見。詳細的提示模板、實驗數據集和代碼管道可在我們的GitHub上獲得。

AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis

2604.17846v1 by Nathasha Naranpanawa, Maree T. Izatt, Robert D. Labrom, Geoffrey N. Askin, J. Paige Little

MRI is preferred over CT in paediatric imaging because it avoids ionising radiation, but its use in spine deformity assessment is largely limited by the lack of automated, high-resolution 3D bony reconstruction, which continues to rely on CT. MRI-based 3D reconstruction remains impractical due to manual workflows and the scarcity of labelled full-spine datasets. This study introduces an AI framework that enables fully automated thoracolumbar spine (T1-L5) segmentation and 3D reconstruction from MRI alone. Historical low-dose CT scans from adolescent idiopathic scoliosis (AIS) patients were converted into MRI-like images using a GAN and combined with existing labelled thoracic MRI data to train a U-Net-based model. The resulting algorithm accurately generated continuous thoracolumbar 3D reconstructions, improved segmentation accuracy (88% Dice score), and reduced processing time from approximately 1 hour to under one minute, while preserving AIS-specific deformity features. This approach enables radiation-free 3D deformity assessment from MRI, supporting clinical evaluation, surgical planning, and navigation in paediatric spine care.

摘要:MRI 在兒童影像學中較 CT 更受青睞,因為它避免了電離輻射,但在脊柱畸形評估中的應用主要受到缺乏自動化、高解析度 3D 骨重建的限制,這仍然依賴於 CT。基於 MRI 的 3D 重建因手動工作流程和標註完整脊柱數據集的稀缺而仍然不切實際。本研究介紹了一個 AI 框架,能夠從 MRI 單獨實現完全自動化的胸腰脊柱 (T1-L5) 分割和 3D 重建。來自青少年特發性脊柱側彎 (AIS) 患者的歷史低劑量 CT 掃描被轉換為類似 MRI 的影像,並與現有的標註胸部 MRI 數據結合,以訓練基於 U-Net 的模型。所生成的算法準確地生成了連續的胸腰 3D 重建,提高了分割準確性 (88% Dice 分數),並將處理時間從約 1 小時縮短至不到 1 分鐘,同時保留了特定於 AIS 的畸形特徵。這種方法使得從 MRI 進行無輻射的 3D 畸形評估成為可能,支持臨床評估、手術規劃和兒童脊柱護理中的導航。

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

2604.17730v1 by Suhyun Lee, Palakorn Achananuparp, Neemesh Yadav, Ee-Peng Lim, Yang Deng

Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.

摘要:大型語言模型(LLMs)越來越多地被探索作為可擴展的心理健康諮詢工具,然而,由於臨床傷害的互動性和情境依賴性,評估它們的安全性仍然具有挑戰性。現有的評估框架主要使用粗糙的分類法或靜態數據集來評估孤立的反應,這限制了它們診斷傷害如何在多輪諮詢互動中出現和累積的能力。在這項工作中,我們介紹了 R-MHSafe,一種角色感知的心理健康安全分類法,根據 AI 諮詢師所採取的互動角色(包括施害者、煽動者、促進者或使能者)來描述臨床上重要的傷害,並結合臨床基礎的傷害類別。然後,我們提出了 MHSafeEval,一個閉環的基於代理的評估框架,將安全評估公式化為通過對抗性多輪互動的傷害軌跡級別發現,並以角色感知建模為指導。使用 R-MHSafe 和 MHSafeEval,我們對最先進的 LLMs 進行了大規模評估。我們的結果揭示了顯著的角色依賴性和累積性安全失敗,這些失敗在現有的靜態基準中被系統性地忽略,並顯示我們的框架顯著提高了失敗模式的覆蓋率和診斷的細緻度。

RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models

2604.17725v1 by Arya Hadizadeh Moghaddam, Drew Ross, Mohsen Nayebi Kerdabadi, Dongjie Wang, Zijun Yao

Large Language Models (LLMs) have shown strong promise for mining Electronic Health Records (EHRs) by reasoning over longitudinal clinical information to capture context-rich patient trajectories. However, leveraging LLMs for structured EHRs (e.g., standardized diagnosis and medication codes) presents two key challenges. First, translating time-stamped EHR sequences into plain text can obscure both temporal structure and code identities, weakening the ability to capture code co-occurrence and longitudinal regularities. Second, unlike cohort-trained predictive models that learn a shared, task-aligned representation space across patients, LLMs are often applied in a case-isolated inference setting where each patient is processed independently without leveraging population-level patterns. To address these challenges, we introduce RePrompT, a time-aware LLM framework that integrates structured EHR encoders through prompt tuning, without modifying underlying architectures. Specifically, RePrompT recurrently incorporates latent states from prior visits to preserve longitudinal information, and injects population-level information through trainable prompt tokens derived from a cohort-trained, task-aligned EHR encoder. Experiments on MIMIC-III and MIMIC-IV demonstrate that RePrompT consistently outperforms both EHR-based and LLM-based baselines across multiple clinical prediction tasks.

摘要:大型語言模型(LLMs)在挖掘電子健康紀錄(EHRs)方面顯示出強大的潛力,通過推理長期臨床信息來捕捉豐富的患者軌跡。然而,利用LLMs處理結構化EHR(例如,標準化診斷和藥物代碼)面臨兩個主要挑戰。首先,將帶有時間戳的EHR序列轉換為純文本可能會模糊時間結構和代碼身份,削弱捕捉代碼共現和長期規律的能力。其次,與學習共享、任務對齊表示空間的隊列訓練預測模型不同,LLMs通常在案例孤立的推斷環境中應用,其中每位患者獨立處理,而不利用人口層面的模式。為了解決這些挑戰,我們介紹了RePrompT,一個時間感知的LLM框架,通過提示調整整合結構化EHR編碼器,而不修改底層架構。具體而言,RePrompT重複性地整合來自先前訪問的潛在狀態,以保留長期信息,並通過可訓練的提示標記注入來自隊列訓練的、任務對齊的EHR編碼器的群體級信息。在MIMIC-III和MIMIC-IV上的實驗表明,RePrompT在多個臨床預測任務中始終優於基於EHR和基於LLM的基準。

Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals

2604.17714v1 by Jon-Paul Cacioli

LLM confidence signals are used for abstention, routing, and safety-critical decisions. No standard practice exists for checking whether a confidence signal carries item-level information before building on it. We transfer the validity screening principle from clinical personality assessment (PAI, MMPI-3) as a portable protocol for benchmark-based LLM confidence data. The protocol specifies three core indices (L, Fp, RBS), a structural indicator (TRIN), and an item-sensitivity statistic, computed from a single 2x2 contingency table. A three-tier classification system (Invalid, Indeterminate, Valid) draws on four clinical traditions. Validated on 20 frontier LLMs across 524 items, four models are classified Invalid, two Indeterminate. Valid-profile models show mean r = .18 (15/16 significant). Invalid-profile models show mean r = -.20 (d = 2.48). Cross-benchmark validation on 18 models using MMLU with verbalized confidence and on external data from Yang et al. (2024) confirms the screen transfers across benchmarks and probe formats. All data and code: https://github.com/synthiumjp/validity-scaling-llm

摘要:LLM 信心信號用於自我放棄、路由和安全關鍵決策。尚無標準做法來檢查信心信號是否攜帶項目級別的信息,然後再基於此進行構建。我們將臨床人格評估(PAI, MMPI-3)的有效性篩選原則轉移為基於基準的 LLM 信心數據的可攜式協議。該協議指定了三個核心指標(L, Fp, RBS)、一個結構指標(TRIN)和一個項目敏感性統計,這些都是從單個 2x2 交叉表中計算得出的。三層分類系統(無效、不確定、有效)借鑒了四個臨床傳統。在 524 個項目中對 20 個前沿 LLM 進行驗證,四個模型被分類為無效,兩個為不確定。有效型模型顯示平均 r = .18(15/16 顯著)。無效型模型顯示平均 r = -.20(d = 2.48)。在 18 個模型上進行的跨基準驗證,使用 MMLU 進行口頭信心和來自 Yang et al. (2024) 的外部數據,確認了篩選在基準和探測格式之間的轉移。所有數據和代碼: https://github.com/synthiumjp/validity-scaling-llm

Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report

2604.17707v1 by Jon-Paul Cacioli

Clinical personality assessment screens response validity before interpreting substantive scales. LLM evaluation does not. We apply the validity scaling framework from the PAI and MMPI-3 to metacognitive probe data from 20 frontier models across 524 items. Six validity indices are operationalised: L (maintaining confidence on errors), K (betting on errors), F (withdrawing consensus-endorsed items), Fp (withdrawing correct answers), RBS (inverted monitoring), and TRIN (fixed responding). A tiered classification system identifies four models as construct-level invalid and two as elevated. Valid-profile models produce item-sensitive confidence (mean r = .18, 14 of 16 significant). Invalid-profile models do not (mean r = -.20, d = 2.17, p = .001). Chain-of-thought training produces two opposite response distortions. Two latent dimensions account for 94.6% of index variance. Companion papers extract a portable screening protocol (Cacioli, 2026e) and validate it against selective prediction (Cacioli, 2026f). All data and code: https://github.com/synthiumjp/validity-scaling-llm

摘要:臨床人格評估在解釋實質量表之前會篩選反應有效性。LLM 評估則不會。我們將 PAI 和 MMPI-3 的有效性縮放框架應用於來自 20 個前沿模型的元認知探測數據,涵蓋 524 個項目。六個有效性指標被操作化:L(對錯誤保持信心)、K(對錯誤下注)、F(撤回共識支持的項目)、Fp(撤回正確答案)、RBS(反向監控)和 TRIN(固定反應)。一個分級分類系統將四個模型識別為構念層級無效,兩個模型則被標記為升高。有效型檔案模型產生對項目敏感的信心(平均 r = .18,16 個中的 14 個顯著)。無效型檔案模型則不然(平均 r = -.20,d = 2.17,p = .001)。思維鏈訓練產生兩種相反的反應扭曲。兩個潛在維度解釋了 94.6% 的指標變異性。伴隨的論文提取了一個可攜式篩檢協議(Cacioli, 2026e),並將其與選擇性預測進行驗證(Cacioli, 2026f)。所有數據和代碼:https://github.com/synthiumjp/validity-scaling-llm

STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments

2604.17611v1 by Md Mezbahul Islam, John Michael Templeton, Christian Poellabauer, Ananda Mohan Mondal

Parkinson's disease (PD) is a progressive disorder in which symptom burden and functional impairment evolve over time, making severity staging essential for clinical monitoring and treatment planning. However, many computational studies emphasize binary PD detection and do not fully use repeated follow-up clinical assessments for stage-aware prediction. This study proposes STEP-PD, a severity-aware machine learning framework to classify PD severity using clinically interpretable boundaries. It leverages all available visits from the Parkinson's Progression Markers Initiative (PPMI) and integrates routinely collected subjective questionnaires and objective clinician-assessed measures. Disease severity is defined using Hoehn and Yahr staging and grouped into three clinically meaningful categories: Healthy, Mild PD (stages 1-2), and Moderate-to-Severe PD (stages 3-5). Three binary classification problems and a three-class severity task were evaluated using stratified cross-validation with imbalance-aware training. To enhance interpretability, SHAP was used to provide global explanations and local patient-level waterfall explanations. Across all tasks, XGBoost achieved the strongest and most stable performance, with accuracies of 95.48% (Healthy vs. Mild), 99.44% (Healthy vs. Moderate-to-Severe), and 96.78% (Mild vs. Moderate-to-Severe), and 94.14% accuracy with 0.8775 Macro-F1 for three-class severity classification. Explainability results highlight a shift from early motor features to progression-related axial and balance impairments. These findings show that multimodal clinical assessments within the PPMI cohort can support accurate and interpretable visit-level PD severity stratification.

摘要:帕金森病(PD)是一種漸進性疾病,其症狀負擔和功能障礙隨時間演變,因此對於臨床監測和治療計劃來說,嚴重程度分級是必不可少的。然而,許多計算研究強調二元的PD檢測,並未充分利用重複的隨訪臨床評估來進行階段感知的預測。本研究提出了STEP-PD,一個重視嚴重程度的機器學習框架,用於使用臨床可解釋的邊界來分類PD的嚴重程度。它利用來自帕金森病進展標記計劃(PPMI)的所有可用訪問,並整合常規收集的主觀問卷和客觀臨床評估指標。疾病的嚴重程度是使用Hoehn和Yahr分級來定義的,並分為三個臨床意義明確的類別:健康、輕度PD(1-2期)和中度至重度PD(3-5期)。通過分層交叉驗證和考慮不平衡的訓練,評估了三個二元分類問題和一個三類嚴重程度任務。為了增強可解釋性,使用SHAP提供全局解釋和局部患者級別的瀑布解釋。在所有任務中,XGBoost實現了最強且最穩定的性能,健康與輕度的準確率為95.48%、健康與中度至重度的準確率為99.44%、輕度與中度至重度的準確率為96.78%,以及三類嚴重程度分類的準確率為94.14%,Macro-F1為0.8775。可解釋性結果突顯了從早期運動特徵到與進展相關的軸向和平衡障礙的轉變。這些發現表明,PPMI隊列中的多模態臨床評估可以支持準確且可解釋的訪問級PD嚴重程度分層。

T-DuMpRa: Teacher-guided Dual-path Multi-prototype Retrieval Augmented framework for fine-grained medical image classification

2604.17360v1 by Zixuan Tang, Shen Zhao

Fine-grained medical image classification is challenged by subtle inter-class variations and visually ambiguous cases, where confidence estimates often exhibit uncertainty rather than being overconfident. In such scenarios, purely discriminative classifiers may achieve high overall accuracy yet still fail to distinguish between highly similar categories, leading to miscalibrated predictions. We propose T-DuMpRa, a teacher-guided dual-path multi-prototype retrieval-augmented framework, where discriminative classification and multi-prototype retrieval jointly drive both training and prediction. During training, we jointly optimize cross-entropy and supervised contrastive objectives to learn a cosine-compatible embedding geometry for reliable prototype matching. We further employ an exponential moving average (EMA) teacher to obtain smoother representations and build a multi-prototype memory bank by clustering teacher embeddings in the teacher embedding space. Our framework is plug-and-play: it can be easily integrated into existing classification models by constructing a compact prototype bank, thereby improving performance on visually ambiguous cases. At inference, we combine the classifier's predicted distribution with a similarity-based distribution computed via cosine matching to prototypes, and apply a conservative confidence-gated fusion that activates retrieval only when the classifier's prediction is uncertain and the retrieval evidence is decisive and conflicting, otherwise keeping confident predictions unchanged. On HAM10000 and ISIC2019, our method yields 0.68%-0.21% and 0.44%-2.69% improvements on 5 different backbones. And visualization analysis proves our model can enhance the model's ability to handle visually ambiguous cases.

摘要:精細醫學影像分類面臨著微妙的類別間變化和視覺上模糊的情況,這些情況下,信心估計往往表現出不確定性,而不是過於自信。在這種情況下,純粹的區別性分類器可能達到高整體準確率,但仍然無法區分高度相似的類別,導致預測不準確。我們提出了 T-DuMpRa,一種教師引導的雙路徑多原型檢索增強框架,其中區別性分類和多原型檢索共同推動訓練和預測。在訓練期間,我們共同優化交叉熵和監督對比目標,以學習可靠的原型匹配的餘弦相容嵌入幾何。我們進一步使用指數移動平均(EMA)教師來獲得更平滑的表示,並通過在教師嵌入空間中聚類教師嵌入來建立多原型記憶庫。我們的框架是即插即用的:它可以通過構建緊湊的原型庫輕鬆集成到現有的分類模型中,從而提高在視覺上模糊情況下的性能。在推理時,我們將分類器的預測分佈與通過餘弦匹配計算的基於相似性的分佈相結合,並應用保守的信心閘融合,只有在分類器的預測不確定且檢索證據決定性且矛盾時才啟動檢索,否則保持自信的預測不變。在 HAM10000 和 ISIC2019 上,我們的方法在 5 個不同的骨幹上分別提高了 0.68%-0.21% 和 0.44%-2.69%。而可視化分析證明我們的模型能增強模型處理視覺模糊情況的能力。

PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

2604.17359v1 by Patrick Keough

Large language models are increasingly deployed to simulate patients for clinical training, research, and mental health tools, yet population-level validity remains largely untested. We introduce PsychBench, the first epidemiological audit of LLM patient simulation: 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) evaluated against NHANES and NESARC-III baselines across 120 intersectional cohorts. The central finding is a coherence-fidelity dissociation: models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent (GLM-4.7) to 62 percent (DeepSeek-V3), eliminating the distributional tails of clinical reality. Despite test-retest correlations above r = 0.90, 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices diverge across demographic groups beyond split-half noise, with transgender populations diverging three to five times more than racial differences. Calibration bias is systematic and asymmetric. Models overestimate depression severity for most groups by 3.6 to 6.1 points (Cohen d = 1.13 to 1.91), consistent with training on clinical corpora with elevated base rates. For transgender women the direction inverts: models capture only 8 to 46 percent of documented minority stress elevation, yielding a -5.42 residual (d = -1.55). Models also attribute irritability to Black men and fatigue to women beyond matched controls, encoding racialized and gendered assumptions. Patterns replicate across US and Chinese architectures, indicating failures tied to current training paradigms rather than isolated implementations. For most users, LLM mental health tools risk pathologizing ordinary distress; for transgender users, algorithmic erasure of genuine need. The patients look right. They do not represent real populations.

摘要:大型語言模型越來越多地被用來模擬患者,以進行臨床訓練、研究和心理健康工具,但其在人口層面的有效性仍然大部分未經測試。我們介紹了PsychBench,首個針對LLM患者模擬的流行病學審核:來自四個前沿模型(GPT-4o-mini、DeepSeek-V3、Gemini-3-Flash、GLM-4.7)的28,800個資料檔,與NHANES和NESARC-III基準進行評估,涵蓋120個交叉群體。中心發現是一致性與保真度的分離:模型生成臨床上合理的個體,但卻錯誤地表現出其所來源的人群。變異壓縮範圍從14%(GLM-4.7)到62%(DeepSeek-V3),消除了臨床現實的分佈尾部。儘管測試-重測相關性超過r = 0.90,但36.66%的案例在不同運行之間跨越診斷閾值。症狀相關矩陣在不同的人口群體中超越了分半噪音而出現分歧,跨性別人群的分歧程度是種族差異的三到五倍。校準偏差是系統性的和不對稱的。模型對大多數群體的抑鬱嚴重程度高估了3.6到6.1分(Cohen d = 1.13到1.91),這與在基數較高的臨床語料庫上的訓練一致。對於跨性別女性來說,方向則相反:模型僅捕捉到8%到46%的記錄在案的少數族裔壓力上升,產生-5.42的殘差(d = -1.55)。模型還將易怒歸因於黑人男性,將疲勞歸因於女性,超出匹配控制組,編碼了種族化和性別化的假設。這些模式在美國和中國的架構中重複出現,表明失敗與當前的訓練範式有關,而非孤立的實施。對於大多數用戶來說,LLM心理健康工具有使普通痛苦病理化的風險;對於跨性別用戶來說,則是算法抹去真正需求的風險。患者看起來正確。他們並不代表真實的人口。

Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA

2604.17316v1 by Alberto Testoni, Iacer Calixto

Safe clinical deployment of Large Language Models (LLMs) requires not only high accuracy but also robust uncertainty calibration to ensure models defer to clinicians when appropriate. Our paper investigates how social descriptors of a patient (specifically sexual orientation and religious affiliation) distort these uncertainty signals and model accuracy. Evaluating nine general-purpose and biomedical LLMs on 2,364 medical questions and their counterfactual variants, we demonstrate that identity markers cause a "calibration crisis". "Homosexual" markers consistently trigger performance drops, and intersectional identities produce idiosyncratic, non-additive harms to calibration. Moreover, a clinician-validated case study in an open-ended generation setting confirms that these failures are not an artifact of the multiple-choice format. Our results demonstrate that the presence of social identity cues does not merely shift predictions; it affects the reliability of confidence signals, posing a significant risk to equitable care and safe deployment in confidence-based clinical workflows.

摘要:安全臨床部署大型語言模型(LLMs)不僅需要高準確性,還需要穩健的不確定性校準,以確保模型在適當時候能夠聽從臨床醫生的意見。我們的論文探討了患者的社會描述符(特別是性取向和宗教信仰)如何扭曲這些不確定性信號和模型準確性。在2,364個醫療問題及其反事實變體上評估九個通用和生物醫學LLMs,我們證明身份標記會導致“校準危機”。“同性戀”標記持續觸發性能下降,而交叉身份則對校準產生特異性、非加性損害。此外,在開放式生成環境中的臨床驗證案例研究確認,這些失敗並不是多選格式的產物。我們的結果顯示,社會身份線索的存在不僅僅是改變預測;它影響了信心信號的可靠性,對公平護理和基於信心的臨床工作流程的安全部署構成了重大風險。

Chaos-Enhanced Prototypical Networks for Few-Shot Medical Image Classification

2604.17300v1 by Chinthakuntla Meghan Sai, Murarisetty V Sai Kartheek, Sita Devi Bharatula, Karthik Seemakurthy

The scarcity of labeled clinical data in oncology makes Few-Shot Learning (FSL) a critical framework for Computer Aided Diagnostics, but we observed that standard Prototypical Networks often struggle with the "prototype instability" caused by morphological noise and high intra-class variance in brain tumor scans. Our work attempts to minimize this by integrating a non-linear Logistic Chaos Module into a fine-tuned ResNet-18 backbone creating the Chaos-Enhanced ProtoNet(CE-ProtoNet). Using the deterministic ergodicity of the logistic chaos map we inject controlled perturbations into support features during episodic training-essentially for "stress testing" the embedding space. This process makes the model to converge on noise-invariant representations without increasing computational overhead. Testing this on a 4-way 5-shot brain tumor classification task, we found that a 15% chaotic injection level worked efficiently to stabilize high-dimensional clusters and reduce class dispersion. Our method achieved a peak test accuracy of 84.52%, outperforming standard ProtoNet. Our results suggest the idea of using chaotic perturbation as an efficient, low-overhead regularization tool, for the data-scarce regimes.

摘要:臨床數據在腫瘤學中的稀缺性使得少樣本學習(FSL)成為計算機輔助診斷的一個關鍵框架,但我們觀察到標準原型網絡經常因形態噪聲和腦腫瘤掃描中的高類內方差而面臨“原型不穩定”的問題。我們的工作試圖通過將非線性邏輯混沌模塊整合到微調的ResNet-18骨幹中來最小化這一問題,從而創建混沌增強原型網絡(CE-ProtoNet)。利用邏輯混沌映射的確定性遍歷性,我們在情節訓練期間將受控擾動注入支持特徵,基本上是為了“壓力測試”嵌入空間。這一過程使得模型能夠收斂到對噪聲不變的表示,而不增加計算開銷。在一個4路5樣本的腦腫瘤分類任務中,我們發現15%的混沌注入水平能有效穩定高維聚類並減少類別分散。我們的方法達到了84.52%的峰值測試準確率,超越了標準原型網絡。我們的結果表明,使用混沌擾動作為一種高效、低開銷的正則化工具的想法,適用於數據稀缺的情境。

Region-Affinity Attention for Whole-Slide Breast Cancer Classification in Deep Ultraviolet Imaging

2604.17222v1 by Nagur Shareef Shaik, Teja Krishna Cherukuri, Dong Hye Ye

Breast cancer diagnosis demands rapid and precise tools, yet traditional histopathological methods often fall short in intra-operative settings. Deep Ultraviolet (DUV) fluorescence imaging emerges as a transformative approach, offering high-contrast, label-free visualization of whole-slide images (WSIs) with unprecedented detail, surpassing conventional hematoxylin and eosin (H&E) staining in speed and resolution. However, existing deep learning methods for breast cancer classification, predominantly patch-based, fragment spatial context and incur significant preprocessing overhead, limiting their clinical utility. Moreover, standard attention mechanisms, such as Spatial, Squeeze-and-Excitation, Global Context and Guided Context Gating, fail to fully exploit the rich, multi-scale regional relationships inherent in DUV-WSI data, often prioritizing generic feature recalibration over diagnostic specificity. This study introduces a novel Region-Affinity Attention mechanism tailored for DUV-WSI breast cancer classification, processing entire slides without patching to preserve spatial integrity. By modeling local neighbor distances and constructing a full affinity matrix, our method dynamically highlights diagnostically relevant regions, augmented by a contrastive loss to enhance feature discriminability. Evaluated on a dataset of 136 DUV-WSI samples, our approach achieves an accuracy of 92.67 +/- 0.73% and an AUC of 95.97%, outperforming existing attention methods.

摘要:乳腺癌的診斷需要快速且精確的工具,但傳統的組織病理學方法在手術過程中往往無法滿足需求。深紫外(DUV)螢光成像作為一種變革性的方法,提供高對比度、無標籤的全片影像(WSIs)可視化,細節前所未有,超越了傳統的蘇木精-伊紅(H&E)染色在速度和解析度上的表現。然而,現有的乳腺癌分類深度學習方法主要基於補丁,破壞了空間上下文並且產生了顯著的預處理開銷,限制了其臨床實用性。此外,標準的注意力機制,如空間注意力、壓縮與激勵、全局上下文和引導上下文門控,未能充分利用DUV-WSI數據中固有的豐富多尺度區域關係,往往優先考慮通用特徵的重新校準而非診斷特異性。本研究提出了一種新穎的區域親和力注意力機制,專為DUV-WSI乳腺癌分類而設計,處理整個切片而不進行補丁,以保持空間完整性。通過建模局部鄰域距離並構建完整的親和力矩陣,我們的方法動態突出診斷相關區域,並通過對比損失來增強特徵的可區分性。在136個DUV-WSI樣本的數據集上進行評估,我們的方法達到了92.67 +/- 0.73%的準確率和95.97%的AUC,超越了現有的注意力方法。

Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition

2604.17214v1 by Nwe Ni Win, Jim Basilakis, Steven Thomas, Seyhan Yazar, Laura Pierce, Stephanie Liu, Paul M. Middleton, Nasser Ghadiri, X. Rosalind Wang

Extracting clinically relevant information from unstructured medical narratives such as admission notes, discharge summaries, and emergency case histories remains a challenge in clinical natural language processing (NLP). Medical Entity Recognition (MER) identifies meaningful concepts embedded in these records. Recent advancements in large language models (LLMs) have shown competitive MER performance; however, evaluations often focus on general entity types, offering limited utility for real-world clinical needs requiring finer-grained extraction. To address this gap, we rigorously evaluated the open-source LLaMA3 model for fine-grained medical entity recognition across 18 clinically detailed categories. To optimize performance, we employed three learning paradigms: zero-shot, few-shot, and fine-tuning with Low-Rank Adaptation (LoRA). To further enhance few-shot learning, we introduced two example selection methods based on token- and sentence-level embedding similarity, utilizing a pre-trained BioBERT model. Unlike prior work assessing zero-shot and few-shot performance on proprietary models (e.g., GPT-4) or fine-tuning different architectures, we ensured methodological consistency by applying all strategies to a unified LLaMA3 backbone, enabling fair comparison across learning settings. Our results showed that fine-tuned LLaMA3 surpasses zero-shot and few-shot approaches by 63.11% and 35.63%, respectivel respectively, achieving an F1 score of 81.24% in granular medical entity extraction.

摘要:從未結構化的醫療敘述中提取臨床相關資訊,例如入院記錄、出院摘要和急診病歷,仍然是臨床自然語言處理(NLP)中的一個挑戰。醫療實體識別(MER)識別這些記錄中嵌入的有意義概念。最近在大型語言模型(LLMs)方面的進展顯示出競爭性的MER表現;然而,評估通常集中在一般實體類型上,對於需要更細緻提取的現實臨床需求提供的效用有限。為了解決這一差距,我們嚴格評估了開源的LLaMA3模型在18個臨床詳細類別中的細粒度醫療實體識別表現。為了優化性能,我們採用了三種學習範式:零樣本、少樣本和使用低秩適應(LoRA)的微調。為了進一步增強少樣本學習,我們引入了基於標記和句子級嵌入相似性的兩種範例選擇方法,利用預訓練的BioBERT模型。與之前評估零樣本和少樣本性能的專有模型(例如GPT-4)或微調不同架構的工作不同,我們通過將所有策略應用於統一的LLaMA3骨幹來確保方法的一致性,從而實現學習設置之間的公平比較。我們的結果顯示,微調的LLaMA3在細粒度醫療實體提取中分別超越零樣本和少樣本方法63.11%和35.63%,達到81.24%的F1分數。

DREAM: Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion for Expert Precision Medical Report Generation

2604.17209v1 by Nagur Shareef Shaik, Teja Krishna Cherukuri, Dong Hye Ye

Automating medical reports for retinal images requires a sophisticated blend of visual pattern recognition and deep clinical knowledge. Current Large Vision-Language Models (LVLMs) often struggle in specialized medical fields where data is scarce, leading to models that overfit and miss subtle but critical pathologies. To address this, we introduce DREAM (Dynamic Retinal Enhancement with Adaptive Multi-modal Fusion), a novel framework for high-fidelity medical report generation that excels even with limited data. DREAM employs a unique two-stage fusion mechanism that intelligently integrates visual data with clinical keywords curated by ophthalmologists. First, the Abstractor module maps image and keyword features into a shared space, enhancing visual data with pathology-relevant insights. Next, the Adaptor performs adaptive multi-modal fusion, dynamically weighting the importance of each modality using learnable parameters to create a unified representation. To ensure the model's outputs are semantically grounded in clinical reality, a Contrastive Alignment module aligns these fused representations with ground-truth medical reports during training. By combining medical expertise with an efficient fusion strategy, DREAM sets a new state-of-the-art on the DeepEyeNet benchmark, achieving a BLEU-4 score of 0.241, and further demonstrates strong generalization to the ROCO dataset.

摘要:自動化視網膜影像的醫療報告需要視覺模式識別和深厚臨床知識的精妙結合。當前的大型視覺語言模型(LVLMs)在數據稀缺的專業醫療領域中經常遇到困難,導致模型過擬合並錯過微妙但關鍵的病理特徵。為了解決這個問題,我們引入了DREAM(動態視網膜增強與自適應多模態融合),這是一個高保真醫療報告生成的新框架,即使在數據有限的情況下也能表現出色。DREAM採用獨特的兩階段融合機制,智能地將視覺數據與眼科醫生策劃的臨床關鍵詞整合。首先,抽象模塊將影像和關鍵詞特徵映射到共享空間中,增強視覺數據與病理相關的見解。接下來,適配器執行自適應多模態融合,動態地根據可學習參數加權每種模態的重要性,以創建統一的表示。為了確保模型的輸出在臨床現實中具有語義基礎,對比對齊模塊在訓練期間將這些融合表示與真實醫療報告對齊。通過將醫療專業知識與高效的融合策略相結合,DREAM在DeepEyeNet基準上設立了新的最先進水平,達到了0.241的BLEU-4分數,並進一步展示了對ROCO數據集的強大泛化能力。

CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography

2604.17208v1 by Si Li, Chen-Kai Hu, Zhenhuan Lyu, Yuanqing He

Digital subtraction angiography (DSA) in coronary imaging is fundamentally challenged by physiological motion, forcing reliance on raw angiograms cluttered with anatomical noise. Existing deep learning methods often produced images with two critical clinically unacceptable flaws: persistent boundary artifacts and a loss of native tissue grayscale fidelity that undermined diagnostic confidence. We propose a novel framework termed as CDSA-Net that for the first time explicitly decouples and jointly optimizes vascular structure preservation and realistic background restoration. CDSA-Net introduces two core innovations: (i) A hierarchical geometric prior guidance (HGPG) mechanism, embedded in our coronary structure extraction network (CSENet). It synergistically combines integrated geometric prior (IGP) with gated spatial modulation (GSM) and centerline-aware topology (CAT) loss supervision, ensuring structural continuity. (ii) An adaptive noise module (ANM) within our coronary background restoration network (CBResNet). Unlike standard restoration, ANM uniquely models the stochastic nature of clinical X-ray noise, bridging the domain gap to enable seamless background intensity estimation and the complete elimination of boundary artifacts. The final subtraction is obtained by removing the restored background from the raw angiogram. Quantitatively, it significantly outperformed state-of-the-art methods in vascular intensity correlation and perceptual quality. A 25.6% improvement in morphology assessment efficiency and a 42.9% gain in hemodynamic evaluation speed set a new benchmark for utility in interventional cardiology, while maintaining diagnostic results consistent with raw angiograms. The project code is available at https://github.com/DrThink-ai/CDSA-Net.

摘要:數位減影血管造影(DSA)在冠狀動脈影像中受到生理運動的根本挑戰,迫使人們依賴充滿解剖噪音的原始血管造影圖像。現有的深度學習方法通常產生兩個關鍵的臨床不可接受的缺陷:持續的邊界伪影和原生組織灰階保真度的喪失,這削弱了診斷信心。我們提出了一個名為 CDSA-Net 的新框架,首次明確地解耦並聯合優化血管結構保護和現實背景恢復。CDSA-Net 引入了兩個核心創新:(i)一種分層幾何先驗引導(HGPG)機制,嵌入我們的冠狀結構提取網絡(CSENet)。它協同結合了集成幾何先驗(IGP)、門控空間調制(GSM)和中心線感知拓撲(CAT)損失監督,確保結構連續性。(ii)我們的冠狀背景恢復網絡(CBResNet)內的一個自適應噪聲模塊(ANM)。與標準恢復不同,ANM 獨特地建模臨床 X 射線噪聲的隨機性質,彌合領域差距以實現無縫的背景強度估計和完全消除邊界伪影。最終的減法是通過從原始血管造影中去除恢復的背景來獲得的。在定量上,它在血管強度相關性和感知質量方面顯著超越了最先進的方法。在形態評估效率上提高了 25.6%,在血流動力學評估速度上提高了 42.9%,為介入心臟病學的實用性設立了新的基準,同時保持診斷結果與原始血管造影一致。項目代碼可在 https://github.com/DrThink-ai/CDSA-Net 獲得。

Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training

2604.17186v1 by Weibing Zheng, Laurah Turner, Jess Kropczynski, Matthew Kelleher, Murat Ozer, Shane Halse

As Artificial Intelligence (AI) and Agentic AI become increasingly integrated across sectors such as education and healthcare, it is critical to ensure that Multi-Agent Education System (MAES) is explainable from the early stages of requirements engineering (RE) within the AI software development lifecycle. Explainability is essential to build trust, promote transparency, and enable effective human-AI collaboration. Although personas are well-established in human-computer interaction to represent users and capture their needs and behaviors, their role in RE for explainable MAES remains underexplored. This paper proposes a human-first, persona-driven, explainable MAES RE framework and demonstrates the framework through a MAES for clinical reasoning training. The framework integrates personas and user stories throughout the RE process to capture the needs, goals, and interactions of various stakeholders, including medical educators, medical students, AI patient agent, and clinical agents (physical exam agent, diagnostic agent, clinical intervention agent, supervisor agent, evaluation agent). The goals, underlying models, and knowledge base shape agent interactions and inform explainability requirements that guided the clinical reasoning training of medical students. A post-usage survey found that more than 78\% of medical students reported that MAES improved their clinical reasoning skills. These findings demonstrate that RE based on persona effectively connects technical requirements with non-technical medical students from a human-centered approach, ensuring that explainable MAES are trustworthy, interpretable, and aligned with authentic clinical scenarios from the early stages of the AI system engineering. The partial MAES for the clinical scenario simulator is~\href{https://github.com/2sigmaEdTech/MAS/}{open sourced here}.

摘要:隨著人工智慧(AI)和代理型AI在教育和醫療等各個領域的日益整合,確保多代理教育系統(MAES)在AI軟體開發生命週期的需求工程(RE)早期階段是可解釋的,至關重要。可解釋性對於建立信任、促進透明度以及實現有效的人機協作至關重要。儘管角色在人機互動中被廣泛應用以代表用戶並捕捉他們的需求和行為,但在可解釋的MAES的需求工程中的角色仍然未被充分探索。本文提出了一個以人為本、以角色驅動的可解釋MAES需求工程框架,並通過一個用於臨床推理訓練的MAES來演示該框架。該框架在整個需求工程過程中整合了角色和用戶故事,以捕捉各種利益相關者的需求、目標和互動,包括醫學教育者、醫學學生、AI病人代理和臨床代理(身體檢查代理、診斷代理、臨床干預代理、監督代理、評估代理)。目標、基本模型和知識基礎塑造了代理的互動,並告知了指導醫學學生臨床推理訓練的可解釋性需求。使用後調查發現,超過78\%的醫學學生報告說MAES提高了他們的臨床推理技能。這些發現表明,基於角色的需求工程有效地將技術需求與非技術醫學學生聯繫起來,採用以人為中心的方法,確保可解釋的MAES是可信的、可解釋的,並與AI系統工程早期階段的真實臨床情境相一致。針對臨床情境模擬器的部分MAES已在~\href{https://github.com/2sigmaEdTech/MAS/}{這裡開源}。

If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data

2604.17133v1 by Yanjun Cui, Ali Emami, Temiloluwa Prioleau, Nikhil Singh

Continuous glucose monitors (CGMs) used in diabetes care collect rich personal health data that could improve day-to-day self-management. However, current patient platforms only offer static summaries which do not support inquisitive user queries. Large language models (LLMs) could enable free-form inquiries about continuous glucose data, but deploying them over sensitive health records raises privacy and accuracy concerns. In this paper, we present CGM-Agent, a privacy-preserving framework for question answering over personal glucose data. In our design, the LLM serves purely as a reasoning engine that selects analytical functions. All computation occurs locally, and personal health data never leaves the user's device. For evaluation, we construct a benchmark of 4,180 questions combining parameterized question templates with real user queries and ground truth derived from deterministic program execution. Evaluating 6 leading LLMs, we find that top models achieve 94\% value accuracy on synthetic queries and 88\% on ambiguous real-world queries. Errors stem primarily from intent and temporal ambiguity rather than computational failures. Additionally, lightweight models achieve competitive performance in our agent design, suggesting opportunities for low-cost deployment. We release our code and benchmark to support future work on trustworthy health agents.

摘要:持續血糖監測器(CGMs)在糖尿病護理中收集豐富的個人健康數據,這些數據可以改善日常自我管理。然而,目前的病人平台僅提供靜態摘要,無法支持好奇的用戶查詢。大型語言模型(LLMs)可以使對持續血糖數據的自由形式查詢成為可能,但在敏感健康記錄上部署它們會引發隱私和準確性問題。在本文中,我們提出了CGM-Agent,一個針對個人血糖數據的隱私保護問答框架。在我們的設計中,LLM純粹作為一個推理引擎,選擇分析功能。所有計算都在本地進行,個人健康數據從不離開用戶的設備。為了進行評估,我們構建了一個基準,包含4,180個問題,結合了參數化問題模板、真實用戶查詢和來自確定性程序執行的真實數據。在評估6個領先的LLM時,我們發現頂級模型在合成查詢上達到94%的價值準確率,在模糊的現實查詢上達到88%。錯誤主要源於意圖和時間的模糊性,而不是計算失敗。此外,輕量級模型在我們的代理設計中達到了競爭性能,這表明低成本部署的機會。我們發布了我們的代碼和基準,以支持未來在可信健康代理上的工作。

A Two-Stage Deep Learning Framework for Segmentation of Ten Gastrointestinal Organs from Coronal MR Enterography

2604.17118v1 by Ashiqur Rahman, Md. Abu Sayed, Md Sharjis Ibne Wadud, Md. Abu Asad Al-Hafiz, Adam Mushtak, Muhammad E. H. Chowdhury

Accurate segmentation of gastrointestinal (GI) organs in magnetic resonance enterography (MRE) is critical for diagnosing inflammatory bowel disease (IBD). However, anatomical variability, class imbalance, and low tissue contrast hinder reliable automation. This study proposes a dual-stage deep learning framework for organ-specific segmentation of GI structures from coronal MRE images to address these challenges. A publicly available MRE dataset of 3,195 coronal T2-weighted HASTE slices from 114 IBD patients was used. Initially, a DenseNet201-UNet++ model generated coarse masks for ROI extraction. A DenseNet121-SelfONN-UNet model was then trained on organ-specific patches. Extensive data augmentation, normalization, five-fold cross-validation, and class-specific weighting were applied to mitigate severe class imbalance, particularly for the appendix. The initial stage achieved strong organ localization but underperformed for the appendix; class weighting improved its DSC from 6.76% to 85.76%. The second-stage DenseNet121-SelfONN-UNet significantly enhanced segmentation across all GI structures, with notable DSC gains (cecum +23.62%, sigmoid +18.57%, rectum +17.99%, small intestine +16.06%). Overall, the framework achieved mDSC of 88.99%, mIoU of 84.76%, and mHD95 of 6.94 mm, outperforming all baselines. This framework demonstrates the effectiveness of a coarse-to-fine, organ-aware segmentation strategy for intestinal MRE. Despite higher computational cost, it shows strong potential for clinical translation and enables anatomically informed diagnostic tools in gastroenterology.

摘要:準確地對磁共振腸道攝影(MRE)中的胃腸道(GI)器官進行分割對於診斷炎症性腸病(IBD)至關重要。然而,解剖變異性、類別不平衡和低組織對比度妨礙了可靠的自動化。本研究提出了一種雙階段深度學習框架,用於從冠狀MRE圖像中進行器官特定的GI結構分割,以應對這些挑戰。使用了一個公開可用的MRE數據集,其中包含來自114名IBD患者的3,195個冠狀T2加權HASTE切片。最初,DenseNet201-UNet++模型生成了ROI提取的粗略掩膜。然後,對器官特定的補丁進行了DenseNet121-SelfONN-UNet模型的訓練。採用了廣泛的數據增強、正規化、五折交叉驗證和類別特定的加權,以減輕嚴重的類別不平衡,特別是對於闌尾。初始階段實現了強大的器官定位,但對於闌尾的表現不佳;類別加權將其DSC從6.76%提高到85.76%。第二階段的DenseNet121-SelfONN-UNet顯著提高了所有GI結構的分割,並且DSC增益顯著(盲腸 +23.62%,乙狀結腸 +18.57%,直腸 +17.99%,小腸 +16.06%)。總體而言,該框架達到了88.99%的mDSC、84.76%的mIoU和6.94 mm的mHD95,超越了所有基準。該框架展示了粗到細的器官感知分割策略在腸道MRE中的有效性。儘管計算成本較高,但它顯示出強大的臨床轉化潛力,並使胃腸學中的解剖知情診斷工具成為可能。

Efficient Task Adaptation in Large Language Models via Selective Parameter Optimization

2604.17051v1 by Weijie Wan, Jiangjiang Zhao

Large Language Models (LLMs) have demonstrated excellent performance in general language understanding, generation and other tasks. However, when fine-tuning for specific domain tasks, the general knowledge accumulated in the pre-training phase is often partially overwritten or forgotten due to parameter updates, which severely limits the generalization ability and transferability of LLMs. Traditional fine-tuning strategies mostly train on the entire parameter space, ignoring the heterogeneity of model parameters, that is, some parameters are extremely important for general tasks, while other parameters are more sensitive to specific tasks. To alleviate the above problems, this paper innovatively proposes a parameter element importance evaluation method, which divides parameters into "core parameters" and "non-core parameters" by distinguishing the importance of parameters for general language ability tasks and specific domain tasks, and fixes the core parameters during fine-tuning, and only fine-tunes the non-core parameters. Extensive experiments on scientific, medical and physical tasks using GPT-J and LLaMA-3 show that our method can mitigate catastrophic forgetting while enhancing the adaptability of the model.

摘要:大型語言模型(LLMs)在一般語言理解、生成及其他任務中展現了卓越的表現。然而,在針對特定領域任務進行微調時,預訓練階段積累的一般知識往往因參數更新而部分被覆蓋或遺忘,這嚴重限制了LLMs的泛化能力和轉移能力。傳統的微調策略大多數是在整個參數空間上進行訓練,忽視了模型參數的異質性,也就是說,有些參數對一般任務極為重要,而其他參數則對特定任務更為敏感。為了緩解上述問題,本文創新性地提出了一種參數元素重要性評估方法,通過區分參數對一般語言能力任務和特定領域任務的重要性,將參數分為「核心參數」和「非核心參數」,並在微調過程中固定核心參數,只微調非核心參數。在使用GPT-J和LLaMA-3進行的科學、醫學和物理任務的廣泛實驗表明,我們的方法可以減輕災難性遺忘,同時增強模型的適應性。

Light-Adapted Electroretinogram and Oscillatory Potentials (LEOPs) Dataset for Autism Spectrum Disorder and Typically Developing Individuals

2604.16981v1 by Paul A. Constable, Dorothy A. Thompson, Irene O. Lee, Lynne Loh, Aleksei Zhdanov, Mikhail Kulyabin, Andreas Maier

The LEOPs (Light-ERG-Oscillatory Potentials) dataset provides light-adapted (LA) electroretinogram (ERG) and Oscillatory Potentials (OPs) waveforms for typically developing Control, Autism Spectrum Disorder (ASD) and ASD + Attention Deficit Hyperactivity Disorder (ADHD) childhood and adolescent populations. The ERGs were recorded in the Right And Left eyes with skin electrodes using the handheld RETeval device at two sites in Australia and the United Kingdom. The LEOPs dataset includes 5309 single flash ERG and 4434 OPs waveforms as well as images selected from each participant showing the position of the skin electrode. The LEOPs dataset is constructed from recordings using a 9 step randomized flash series from $-0.37$ to $1.20$~$Td.s$, a 2 step at 113 and 446 $Td.s$ flash strengths (2500 Control, 1730 ASD and 451 ASD + ADHD samples), as well as the $85$~$Td.s$ (Light Adapted 3 $cd.s.m^{-2}$ (LA3)) equivalent International Society of Clinical Electrophysiology of Vision (ISCEV) Standard flash with 435 Control, 176 ASD and 37 ASD + ADHD waveform samples. Code for the stimulus is provided along with participant demographics, date and time of testing, and where available diagnostic scores for the ASD and ASD + ADHD groups, alongside iris color, electrode position with image files and time domain values for the ERG and summed values for the OPs. The repository contains excel file, exported JSON files on the patient level that are more suitable for machine learning tasks, images of electrode position for each recording and the protocol files for use with the RETeval.

摘要:LEOPs(光-ERG-振盪電位)數據集提供了適應光線(LA)的電視網膜電圖(ERG)和振盪電位(OPs)波形,針對典型發展的控制組、自閉症譜系障礙(ASD)以及ASD + 注意力缺陷過動症(ADHD)兒童和青少年群體。ERG是在澳大利亞和英國的兩個地點使用手持RET eval設備的皮膚電極記錄的,記錄了左右眼的數據。LEOPs數據集包括5309個單閃ERG和4434個OPs波形,以及每位參與者所選擇的顯示皮膚電極位置的圖像。LEOPs數據集是基於使用9步隨機閃光系列的錄音構建的,閃光強度範圍從$-0.37$到$1.20$~$Td.s$,在113和446 $Td.s$的2步閃光強度下(2500個控制組、1730個ASD和451個ASD + ADHD樣本),以及$85$~$Td.s$(光適應3 $cd.s.m^{-2}$(LA3))等效的國際臨床視覺電生理學會(ISCEV)標準閃光,對應435個控制組、176個ASD和37個ASD + ADHD波形樣本。刺激的代碼與參與者的人口統計信息、測試的日期和時間,以及在可能的情況下,ASD和ASD + ADHD組的診斷分數,還有虹膜顏色、電極位置的圖像文件和ERG的時間域值以及OPs的總和值一起提供。該資料庫包含excel文件、以患者為單位導出的JSON文件,這些文件更適合機器學習任務,以及每次錄音的電極位置圖像和用於RET eval的協議文件。

Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models

2604.16980v1 by Bruce A. Bassett, Amy Rouillard, Sitwala Mundia, Michael Cameron Gramanie, Linda Camara, Ziyaad Dangor, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Ismail Kalla, Haroon Saloojee

Background: Large language models (LLMs) are increasingly proposed for diagnostic support, but few evaluations use real-world multimodal inpatient data, particularly in low and middle-income country (LMIC) public hospitals. Methods: We conducted VALID, a retrospective evaluation of 539 multimodal inpatient cases from a tertiary public hospital in South Africa. Inputs included radiology imaging (CT, MRI, CXR) and reports, laboratory results, clinical notes, and vital signs. Expert panels adjudicated 300 cases (balanced and discordant subsets) to establish ground truth diagnoses, differentials, and reasoning. Ten multimodal LLMs generated zero-shot outputs. A calibrated three-model LLM Jury scored all outputs and routine ward diagnoses across diagnostic accuracy, differential quality, reasoning, and patient safety (>10,000 evaluations). Primary outcomes were composite scores ($S_3$, $S_4$) and win rates. Results: (i) LLM performance was tightly clustered (<15% variation) despite large cost differences; low-cost models performed comparably to top models. (ii) All LLMs significantly outperformed routine ward diagnoses on average diagnostic and safety scores. (iii) Top performance was achieved by GPT-5.1, followed by Gemini models. (vi) Adding radiology reports improved performance by 6%. (v) Diagnostic and reasoning scores were highly correlated ($ρ= 0.85$). (vi) Output rates varied (65-100%) due to input constraints. Results were robust across subsets and evaluation design. Conclusions: Across a real-world LMIC dataset, multimodal LLMs showed similar diagnostic performance despite large cost differences and outperformed routine care on average safety metrics. Affordability, robustness, and deployment constraints may outweigh marginal performance differences in LMIC settings.

摘要:背景:大型語言模型(LLMs)越來越多地被提議用於診斷支持,但很少有評估使用真實世界的多模態住院病人數據,特別是在低收入和中等收入國家(LMIC)的公立醫院中。方法:我們進行了VALID,一項回顧性評估,涵蓋來自南非一所三級公立醫院的539例多模態住院病例。輸入包括放射學影像(CT、MRI、CXR)和報告、實驗室結果、臨床筆記和生命體徵。專家小組對300例病例(平衡和不一致子集)進行裁定,以確立真實診斷、鑑別診斷和推理。十個多模態LLM生成了零樣本輸出。一個經過校準的三模型LLM評審小組對所有輸出和常規病房診斷進行了評分,涵蓋診斷準確性、鑑別質量、推理和病人安全(超過10,000次評估)。主要結果是綜合分數($S_3$、$S_4$)和勝率。結果:(i)儘管成本差異很大,LLM的表現緊密聚集(<15%的變異);低成本模型的表現與頂級模型相當。(ii)所有LLM在平均診斷和安全分數上顯著超過常規病房診斷。(iii)最佳表現由GPT-5.1實現,其次是Gemini模型。(vi)添加放射學報告使表現提高了6%。(v)診斷和推理分數高度相關($ρ= 0.85$)。(vi)由於輸入限制,輸出率有所不同(65-100%)。結果在各子集和評估設計中都很穩健。結論:在真實世界的LMIC數據集中,多模態LLM顯示出相似的診斷表現,儘管成本差異很大,並且在平均安全指標上超過了常規護理。在LMIC環境中,負擔能力、穩健性和部署限制可能超過邊際表現差異。

Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction

2604.16955v1 by Liyin Chen, Nazlee Zebardast, Mengyu Wang, Tobias Elze, Jason I. Comander

Quantitative prediction of future retinal appearance from longitudinal imaging would support clinical decisions in progressive macular disease that currently rely on qualitative comparison or scalar progression scores. Recent methods have moved toward increasing generative complexity, but whether this complexity is necessary for slowly progressing retinal disease is unclear. We tested this through a controlled comparison of five conditioning configurations sharing one architecture and training dataset, spanning standard conditional diffusion, inference-aligned stochastic training, and deterministic regression. In our evaluation, aligning the training and inference input distributions produced large gains (delta-SSIM +0.082, SSIM +0.086, both p < 0.001), while the choice among aligned frameworks did not significantly affect any primary metric. Task-entropy and posterior-concentration analyses, replicated on two fundus autofluorescence (FAF) platforms, provided a mechanistic account: the predictable component of inter-visit change is small relative to time-invariant acquisition variability, leaving stochastic sampling with little width to exploit. Guided by these findings, we developed TRU (Temporal Retinal U-Net), a deterministic direct-regression model with continuous time-delta conditioning and multi-scale history aggregation. We evaluated TRU on 28,902 eyes across three imaging platforms: a mixed-disease Optos FAF cohort (9,942 eyes), zero-shot transfer to Stargardt macular dystrophy on Optos (288 eyes) and Heidelberg Spectralis (125 eyes), and a boundary evaluation on Cirrus en-face fundus images from a glaucoma cohort (18,547 eyes). TRU matched or exceeded delta-SSIM, SSIM, and PSNR in every FAF cohort against three state-of-the-art benchmarks, and its advantage grew monotonically with available history length.

摘要:從縱向影像中定量預測未來視網膜的外觀將支持在進行性黃斑疾病中的臨床決策,這些決策目前依賴於定性比較或標量進展評分。最近的方法已經朝著增加生成複雜性發展,但這種複雜性是否對於緩慢進展的視網膜疾病是必要的尚不清楚。我們通過對五種共享一個架構和訓練數據集的條件配置進行受控比較來測試這一點,這些配置涵蓋了標準條件擴散、推理對齊隨機訓練和確定性回歸。在我們的評估中,對齊訓練和推理輸入分佈產生了顯著的增益(delta-SSIM +0.082,SSIM +0.086,均 p < 0.001),而在對齊框架之間的選擇對任何主要指標並沒有顯著影響。任務熵和後驗集中度分析,在兩個眼底自發螢光(FAF)平台上重複進行,提供了一個機制解釋:訪問之間變化的可預測組件相對於時間不變的獲取變異性是小的,留下隨機取樣的利用空間很小。在這些發現的指導下,我們開發了 TRU(Temporal Retinal U-Net),這是一個具有連續時間差條件和多尺度歷史聚合的確定性直接回歸模型。我們在三個影像平台上對 28,902 隻眼睛進行了 TRU 的評估:一個混合疾病的 Optos FAF 隊列(9,942 隻眼睛)、在 Optos 上對 Stargardt 黃斑變性進行零次轉移(288 隻眼睛)和 Heidelberg Spectralis(125 隻眼睛),以及對來自青光眼隊列的 Cirrus 面對面眼底圖像的邊界評估(18,547 隻眼睛)。TRU 在每個 FAF 隊列中與三個最先進的基準相比,匹配或超過了 delta-SSIM、SSIM 和 PSNR,其優勢隨著可用歷史長度的增加而單調增長。

Hybrid Quantum Neural Networks for Enhanced Breast Cancer Thermographic Classification: A Novel Quantum-Classical Integration Approach

2604.16953v1 by Riza Alaudin Syah, Irwan Alnarus Kautsar, Gunawan Witjaksono, Haza Nuzly bin Abdull Hamed

Breast cancer diagnosis through thermographic image analysis remains a critical challenge in medical AI, with classical deep learning approaches facing limitations in complex thermal pattern classification tasks. This paper presents a novel Hybrid Quantum Neural Network (HQNN) architecture that integrates quantum computing principles with classical convolutional neural networks for enhanced breast cancer classification. Our approach employs parameterized quantum circuits with multi-head attention mechanisms for quantum-aware feature encoding, coupled with classical convolutional layers for comprehensive pattern recognition. The quantum component utilizes a 4qubit variational circuit with strongly entangling layers, while the classical component incorporates advanced attention mechanisms for feature fusion. Experimental validation on breast cancer thermographic data demonstrates substantial performance improvements over state-of-the-art classical architectures, with the quantum-enhanced approach exhibiting superior convergence dynamics and enhanced feature representation capabilities. Our findings provide evidence for quantum advantage in medical image classification through classical simulation, establishing a framework for quantum-classical hybrid systems in healthcare applications. The methodology addresses key challenges in quantum machine learning deployment while maintaining computational feasibility on near-term quantum devices.

摘要:乳腺癌的熱成像圖像分析診斷在醫療人工智慧中仍然是一個關鍵挑戰,傳統深度學習方法在複雜的熱模式分類任務中面臨限制。本文提出了一種新穎的混合量子神經網絡(HQNN)架構,將量子計算原則與傳統卷積神經網絡相結合,以增強乳腺癌的分類。我們的方法採用了帶有多頭注意力機制的參數化量子電路進行量子感知特徵編碼,並結合傳統卷積層進行全面的模式識別。量子組件利用具有強耦合層的4量子位變分電路,而傳統組件則融合了先進的注意力機制以進行特徵融合。在乳腺癌熱成像數據上的實驗驗證顯示,與最先進的傳統架構相比,性能有顯著改善,量子增強的方法展現出優越的收斂動態和增強的特徵表示能力。我們的研究結果提供了量子優勢在醫療影像分類中的證據,通過傳統模擬建立了量子-傳統混合系統在醫療應用中的框架。該方法論解決了量子機器學習部署中的關鍵挑戰,同時在近期的量子設備上保持計算可行性。

Test-Time Adaptation for EEG Foundation Models: A Systematic Study under Real-World Distribution Shifts

2604.16926v1 by Gabriel Jason Lee, Jathurshan Pradeepkumar, Jimeng Sun

Electroencephalography (EEG) foundation models have shown strong potential for learning generalizable representations from large-scale neural data, yet their clinical deployment is hindered by distribution shifts across clinical settings, devices, and populations. Test-time adaptation (TTA) offers a promising solution by enabling models to adapt to unlabeled target data during inference without access to source data, a valuable property in healthcare settings constrained by privacy regulations and limited labeled data. However, its effectiveness for EEG remains largely underexplored. In this work, we introduce NeuroAdapt-Bench, a systematic benchmark for evaluating test-time adaptation methods on EEG foundation models under realistic distribution shifts. We evaluate representative TTA approaches from other domains across multiple pretrained foundation models, diverse downstream tasks, and heterogeneous datasets spanning in-distribution, out-of-distribution, and extreme modality shifts (e.g., Ear-EEG). Our results show that standard TTA methods yield inconsistent gains and often degrade performance, with gradient-based approaches particularly prone to heavy degradation. In contrast, optimization-free methods demonstrate greater stability and more reliable improvements. These findings highlight the limitations of existing TTA techniques in EEG, provide guidance for future development, and underscore the need for domain-specific adaptation strategies.

摘要:腦電圖(EEG)基礎模型顯示出從大規模神經數據中學習可泛化表示的強大潛力,但其臨床應用受到臨床環境、設備和人群之間分佈變化的限制。測試時適應(TTA)提供了一個有前景的解決方案,通過使模型在推理過程中能夠適應未標記的目標數據,而無需訪問源數據,這在受到隱私法規和有限標記數據約束的醫療環境中是一個寶貴的特性。然而,對於EEG來說,其有效性仍然在很大程度上未被探索。在這項工作中,我們介紹了NeuroAdapt-Bench,這是一個系統性的基準,用於評估EEG基礎模型在現實分佈變化下的測試時適應方法。我們評估了來自其他領域的代表性TTA方法,涵蓋多個預訓練基礎模型、多樣的下游任務以及跨越分佈內、分佈外和極端模態變化(例如,耳部EEG)的異質數據集。我們的結果顯示,標準TTA方法產生的不一致增益,並且經常導致性能下降,特別是基於梯度的方法容易受到嚴重退化的影響。相比之下,無需優化的方法顯示出更大的穩定性和更可靠的改進。這些發現突顯了現有TTA技術在EEG中的局限性,為未來的發展提供了指導,並強調了需要特定於領域的適應策略。

Representation Before Training: A Fixed-Budget Benchmark for Generative Medical Event Models

2604.16775v1 by Inhyeok Lee, Luke Solo, Michael C. Burkhart, Bashar Ramadan, William F. Parker, Brett K. Beaulieu-Jones

Every prediction from a generative medical event model is bounded by how clinical events are tokenized, yet input representation is rarely isolated from other system and architectural choices. We evaluate how representation decisions affect downstream prediction after a shared one-epoch pretraining budget. We train 28 matched transformers on MIMIC-IV and evaluate them on 30 clinical outcomes in three experiments: (1) quantization granularity, reference-range anchoring, and code-value fusion; (2) value encoding (hard bins, soft discretization, code-normalized xVal) crossed with temporal encoding (event order, time tokens, admission-relative RoPE); and (3) native MIMIC laboratory/vital codes versus the Common Longitudinal ICU Format (CLIF)-remapped laboratory/vital codes with compression-preserving perturbation arms. In Experiment 1, fused code-value tokenization improves mortality AUROC from 0.891 to 0.915 (BH-adjusted p < 0.001), hospital length-of-stay AUROC from 0.763 to 0.788 (BH-adjusted p < 0.001), and, for the decile fused-vs-unfused comparison, mean regression Spearman rho across the 13 regression outcomes from 0.414 to 0.494. Across the three temporal encodings, event order only and admission-relative RoPE match or exceed inserting time tokens on average while shortening sequences by 11%. CLIF remapping preserves downstream performance in our single-site setting while yielding a smaller, clinically interpretable token set compatible with multi-site use. Finer-than-decile quantization, reference-range anchoring, and soft discretization help in selective outcomes, while code-normalized xVal remains well below the discrete and soft families, consistent with near-median suppression that persists after the affine variant.

摘要:每個生成醫療事件模型的預測都受到臨床事件標記方式的限制,但輸入表示通常不會與其他系統和架構選擇隔離。我們評估表示決策如何影響共享的一個訓練預算後的下游預測。我們在 MIMIC-IV 上訓練了 28 個匹配的Transformer,並在三個實驗中對 30 個臨床結果進行評估:(1) 量化粒度、參考範圍錨定和代碼值融合;(2) 值編碼(硬箱、軟離散化、代碼標準化 xVal)與時間編碼(事件順序、時間標記、入院相對 RoPE)交叉;以及 (3) 原生 MIMIC 實驗室/生命體徵代碼與經 CLIF 重新映射的實驗室/生命體徵代碼,並採用保持壓縮的擾動臂。在實驗 1 中,融合的代碼值標記提高了死亡率 AUROC 從 0.891 到 0.915(BH 調整後 p < 0.001)、住院天數 AUROC 從 0.763 到 0.788(BH 調整後 p < 0.001),以及在融合與未融合的十分位比較中,13 個回歸結果的平均回歸斯皮爾曼 rho 從 0.414 提高到 0.494。在三種時間編碼中,僅事件順序和入院相對 RoPE 的表現平均匹配或超過插入時間標記的效果,同時縮短序列長度 11%。CLIF 重新映射在我們的單一站點設置中保持下游性能,同時產生一個較小且臨床可解釋的標記集,適用於多站點使用。比十分位更細的量化、參考範圍錨定和軟離散化在選擇性結果中有所幫助,而代碼標準化 xVal 的表現仍然遠低於離散和軟家族,這與在仿射變體後持續存在的接近中位數抑制一致。

CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction

2604.16742v1 by Jianyou Wang, Youze Zheng, Longtian Bao, Hanyuan Zhang, Qirui Zheng, Yuhan Chen, Yang Zhang, Matthew Feng, Maxim Khan, Aditya K. Sehgal, Christopher D. Rosin, Ramamohan Paturi, Umber Dube, Leon Bergen

Scientists have long sought to accurately predict outcomes of real-world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high-stakes open challenge even for domain experts. We introduce CT Open, an open-access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial's outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM-powered web search to identify the earliest mention of trial outcomes. We validate the pipeline's quality and accuracy by human expert's annotations. Since CT Open's pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time-stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real-world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at $\href{https://ct-open.net/}{https://ct-open.net/}$

摘要:科學家們長期以來一直尋求在事件發生之前準確預測現實世界事件的結果。人工智慧系統能否更可靠地做到這一點?我們通過臨床試驗結果預測來研究這個問題,這對於領域專家來說是一個高風險的公開挑戰。我們介紹了 CT Open,一個每年舉辦四次挑戰的開放訪問即時平台。任何人都可以為每個挑戰提交預測。CT Open 在提交時對於那些結果尚未公開的試驗進行評估,但這些結果在之後會公開。確定某個試驗的結果在某個日期之前是否在互聯網上公開,實際上是相當困難的。官方登記處上發布的結果可能會滯後數年,而第一次提及可能出現在不知名的文章中。為了解決這個問題,我們提出了一個新穎的、完全自動化的去污流程,利用迭代的 LLM 驅動的網絡搜索來識別試驗結果的最早提及。我們通過人類專家的註釋來驗證該流程的質量和準確性。由於 CT Open 的流程確保每個被評估的試驗在預測時沒有公開報告的結果,因此參與者可以使用任何方法和任何數據來源。在本文中,我們發布了一個訓練集和兩個時間戳測試基準,分別是 2025 年冬季和 2025 年夏季。我們相信 CT Open 可以作為推進人工智慧研究以預測現實世界結果的中心樞紐,同時也能為生物醫學研究提供信息並改善臨床試驗設計。CT Open 平台托管於 $\href{https://ct-open.net/}{https://ct-open.net/}$

Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis

2604.16729v1 by Ayhan Can Erdur, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert, Jan C. Peeken

State-of-the-art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi-step radiological workflows remains underexplored. In this work, we present a training-free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off-the-shelf domain-specific tools, our system autonomously executes complex end-to-end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single-scan segmentation and volumetric reporting to longitudinal response assessment requiring multi-timepoint comparisons. We analyze the impact of architectural design by comparing single-agent models against multi-agent "domain-expert" collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image-prompt-answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro-radiological image analysis tasks through tool use without the need for training or fine-tuning.

摘要:最先進的大型語言模型(LLMs)在一般視覺問題回答方面表現出色。然而,仍然存在一個根本性的限制:當前的架構缺乏進行體積醫學影像(如 CT 或 MRI)直接分析所需的原生 3D 空間推理。新興的代理 AI 提供了一種新解決方案,通過使 LLM 能夠協調和利用專門的外部工具,消除了對內在 3D 處理的需求。然而,這種代理框架在複雜的多步放射學工作流程中的可行性仍未得到充分探索。在這項工作中,我們提出了一個無需訓練的代理管道,用於自動化腦部 MRI 分析。我們在幾個 LLM(GPT-5.1、Gemini 3 Pro、Claude Sonnet 4.5)上驗證我們的方法,並使用現成的領域專用工具,我們的系統自主執行複雜的端到端工作流程,包括預處理(去顱骨、註冊)、病理分割(膠質瘤、腦膜瘤、轉移瘤)和體積分析。我們在越來越複雜的放射學任務中評估我們的框架,從單掃描分割和體積報告到需要多時間點比較的縱向反應評估。我們通過比較單代理模型與多代理「領域專家」合作來分析架構設計的影響。最後,為了支持未來代理系統的嚴格評估,我們引入並發布了一個基準數據集,該數據集由來自公共 BraTS 數據的圖像-提示-答案元組組成。我們的結果表明,代理 AI 可以通過工具使用解決高度神經放射學影像分析任務,而無需訓練或微調。

A Two-Stage Multi-Modal MRI Framework for Lifespan Brain Age Prediction

2604.16655v1 by Dingyi Zhang, Ruiying Liu, Yun Wang

The accurate quantification of brain age from MRI has emerged as an important biomarker of brain health. However, existing approaches are often restricted to narrow age ranges and single-modality MRI data, limiting their capacity to capture the coordinated macro- and microstructural changes that unfold across the human lifespan. To address these limitations, we developed a multi-modal brain age framework to characterize the integrated evolution of brain morphology and white matter organization. Our model adopts a two-stage architecture, where modalities are processed independently and integrated via late fusion in both stages: first to classify each subject into one of six developmental stages, and then to estimate age within the predicted stage. This design enables a unified and lifespan-spanning assessment of brain maturity across diverse developmental periods.

摘要:腦部年齡的準確量化已成為腦部健康的重要生物標記。然而,現有的方法往往限於狹窄的年齡範圍和單一模態的MRI數據,限制了它們捕捉人類生命週期中協調的宏觀和微觀結構變化的能力。為了解決這些限制,我們開發了一個多模態腦部年齡框架,以描述腦部形態和白質組織的綜合演變。我們的模型採用兩階段架構,其中模態獨立處理,並在兩個階段通過後期融合進行整合:首先將每個受試者分類為六個發展階段之一,然後在預測的階段內估計年齡。這一設計使得對不同發展時期的腦部成熟度進行統一且跨生命週期的評估成為可能。

MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

2604.16175v1 by Yi Lin, Yihao Ding, Yonghui Wu, Yifan Peng

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.

摘要:自動化的 3D 放射學報告生成常常遭受臨床幻覺和缺乏人類實踐中所見的迭代驗證的問題。儘管近期的視覺-語言模型(VLMs)已經推進了該領域,但它們通常作為單一的「黑箱」系統運作,缺乏臨床工作流程中典型的協作監督。為了解決這些挑戰,我們提出了 MARCH(多代理放射學臨床層級),這是一個多代理框架,模擬放射學部門的專業層級,並為不同的代理分配專門角色。MARCH 利用住院醫師代理進行初步草擬,並進行多尺度 CT 特徵提取,使用多個研究員代理進行檢索增強的修訂,以及一位主治醫師代理協調基於立場的迭代共識討論,以解決診斷差異。在 RadGenome-ChestCT 數據集上,MARCH 在臨床忠實度和語言準確性方面顯著超越了最先進的基準。我們的研究表明,模擬類人組織結構可以提高人工智慧在高風險醫療領域的可靠性。

Hybrid Spectro-Temporal Fusion Framework for Structural Health Monitoring

2604.16589v1 by Jongyeop Kim, Jinki Kim, Doyun Lee

Structural health monitoring plays a critical role in ensuring structural safety by analyzing vibration responses from engineering systems. This paper proposes a Spectro-Temporal Alignment framework and a Hybrid Spectro-Temporal Fusion framework that integrate arrival-time interval descriptors with spectral features to capture both fine-scale and coarse-scale vibration dynamics. Experiments conducted on data collected from an LDS V406 electrodynamic shaker demonstrate that the proposed spectro-temporal representations significantly outperform conventional input formulations. The results indicate that a temporal resolution (Δτ) of 0.008 of 0.02 favors traditional machine learning models, whereas a finer resolution (Δτ) of 0.008 effectively unlocks the performance potential of deep learning architectures. Beyond classification accuracy, a comprehensive stability analysis based on condensed indices, including mean performance, standard deviation, coefficient of variation, and balanced score, shows that the proposed hybrid framework consistently achieves higher accuracy with substantially lower variability compared to baseline and alignment-only approaches. Overall, these results demonstrate that the proposed framework provides a robust, accurate, and reliable solution for vibration-based structural health monitoring.

摘要:結構健康監測在確保結構安全方面扮演著至關重要的角色,通過分析工程系統的振動響應來實現。本文提出了一個光譜-時間對齊框架和一個混合光譜-時間融合框架,這些框架將到達時間間隔描述符與光譜特徵相結合,以捕捉細尺度和粗尺度的振動動態。對從LDS V406電動振動台收集的數據進行的實驗表明,所提出的光譜-時間表示顯著優於傳統的輸入公式。結果表明,0.008的時間解析度(Δτ)有利於傳統機器學習模型,而更細的解析度(Δτ)0.008則有效釋放了深度學習架構的性能潛力。除了分類準確性之外,基於凝聚指標的綜合穩定性分析,包括平均性能、標準差、變異係數和均衡得分,顯示所提出的混合框架在準確性上始終達到更高的水平,並且變異性顯著低於基準和僅對齊的方法。總體而言,這些結果表明,所提出的框架為基於振動的結構健康監測提供了一個穩健、準確且可靠的解決方案。

Large Language Models Meet Biomedical Knowledge Graphs for Mechanistically Grounded Therapeutic Prioritization

2604.19815v1 by Chih-Hsuan Wei, Chi-Ping Day, Zhizheng Wang, Christine C. Alewine, Betty Tyler, Hasan Slika, David Saraf, Chin-Hsien Tai, Joey Chan, Robert Leaman, Zhiyong Lu

Drug repurposing is often framed as a candidate identification task, but existing approaches provide limited guidance for distinguishing biologically plausible candidates from historically well-connected ones. Here we introduce DrugKLM, a hybrid framework that integrates biomedical knowledge graph structure with large language model-based mechanistic reasoning to enable mechanistically grounded therapeutic prioritization. Across benchmark datasets, DrugKLM outperforms knowledge graph-only and language model-only baselines, including TxGNN. Beyond improved recall, DrugKLM confidence scores exhibit functional alignment with molecular phenotypes: higher scores are associated with transcriptional signatures linked to improved survival across 12 TCGA cancers. The scoring framework preferentially captures biologically perturbational signals rather than historical indication patterns. Expert curation across five cancers further reveals systematic differences in prioritization behavior, with DrugKLM elevating candidates supported by coherent mechanistic rationale and disease-specific clinical context. Together, these results establish DrugKLM as an evidence-integrative framework that translates heterogeneous biomedical data into mechanistically interpretable and clinically grounded therapeutic hypotheses.

摘要:藥物重定位通常被視為一項候選者識別任務,但現有的方法對於區分生物學上合理的候選者與歷史上關聯良好的候選者提供的指導有限。在這裡,我們介紹了DrugKLM,一個混合框架,將生物醫學知識圖譜結構與基於大型語言模型的機制推理相結合,以實現機制基礎的治療優先排序。在基準數據集上,DrugKLM的表現超越了僅使用知識圖譜和僅使用語言模型的基準,包括TxGNN。除了提高召回率外,DrugKLM的信心分數與分子表型具有功能對齊:較高的分數與12種TCGA癌症中與改善生存相關的轉錄簽名相關聯。該評分框架優先捕捉生物學擾動信號,而非歷史指示模式。對五種癌症的專家策展進一步揭示了優先排序行為的系統性差異,DrugKLM提升了支持一致機制理論和疾病特定臨床背景的候選者。綜合這些結果,DrugKLM建立為一個證據整合框架,將異質生物醫學數據轉化為機制可解釋且臨床基礎的治療假設。

Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

2604.16132v1 by Jessica H. Zhu, Shayla Stringfield, Vahe Zaprosyan, Michael Wagner, Michel Cukier, Joseph B. Richardson

Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the ethical challenges of applying AI in research involving marginalized communities.

摘要:槍支暴力是一個緊迫的公共衛生問題,但對於倖存者生活經歷的研究仍然資金不足且難以擴展。質性研究,包括深入訪談,是理解社區槍支暴力的個人和社會後果以及設計有效干預措施的寶貴工具。然後,通過主題分析和歸納編碼手動分析這些敘事既耗時又勞動密集。最近大型語言模型(LLMs)的進展為自動化這一過程打開了大門,但仍然存在這些模型是否能準確和倫理地捕捉弱勢群體經歷的擔憂。在這項研究中,我們評估使用開源LLMs對21名倖存於社區槍支暴力的黑人男性的訪談進行歸納編碼。我們的結果顯示,儘管某些LLMs的配置能夠識別重要的編碼,但整體相關性仍然較低,並且對數據處理高度敏感。此外,LLM的防護措施導致了實質性的敘事抹除。這些發現突顯了LLM輔助質性編碼的潛力和局限性,並強調了在涉及邊緣社區的研究中應用AI的倫理挑戰。

Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration

2604.16104v1 by Baramee Sukumal, Aueaphum Aueawatthanaphisut

Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (H&E) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the highest faithfulness and localization accuracy, demonstrating strong correspondence with expert-annotated tumor regions. These results indicate that multimodal fusion of radiology and histopathology can improve diagnostic performance while maintaining model transparency, suggesting potential for future clinical decision support systems in precision oncology.

摘要:肺癌仍然是全球癌症相關死亡的主要原因之一。傳統的電腦斷層掃描 (CT) 成像雖然對於檢測和分期至關重要,但在區分良性和惡性病變以及提供可解釋的診斷見解方面存在局限性。為了解決這一挑戰,本研究提出了一個雙模態人工智慧框架,將 CT 放射學與蘇木精-伊紅 (H&E) 組織病理學整合,用於肺癌的診斷和亞型分類。該系統採用卷積神經網絡提取放射學和組織病理學特徵,並結合臨床元數據以提高穩健性。來自兩種模態的預測通過加權決策級整合機制進行融合,以分類腺癌、鱗狀細胞癌、大細胞癌、小細胞肺癌和正常組織。應用可解釋的人工智慧技術,包括 Grad-CAM、Grad-CAM++、集成梯度、遮蔽、顯著性圖和 SmoothGrad,以提供視覺可解釋性。實驗結果顯示出強勁的性能,準確率高達 0.87,AUROC 超過 0.97,宏觀 F1 分數為 0.88。Grad-CAM++ 在忠實度和定位準確性方面達到了最高水平,顯示出與專家標註的腫瘤區域之間的強對應關係。這些結果表明,放射學和組織病理學的多模態融合可以提高診斷性能,同時保持模型透明度,暗示未來在精準腫瘤學中用於臨床決策支持系統的潛力。

Towards Trustworthy Depression Estimation via Disentangled Evidential Learning

2604.16579v1 by Fangyuan Liu, Sirui Zhao, Zeyu Zhang, Jinyang Huang, Feng-Qi Cui, Bin Luo, Tong Xu, Meng Li, Enhong Chen

Automated depression estimation is highly vulnerable to signal corruption and ambient noise in real-world deployment. Prevailing deterministic methods produce uncalibrated point estimates, exposing safety-critical clinical systems to the severe risk of overconfident misdiagnoses. To establish a highly resilient and trustworthy assessment paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. A fundamental vulnerability in multimodal evidential fusion is the uncontrolled accumulation of cross-modal redundancies. This structural flaw artificially inflates diagnostic confidence by double-counting overlapping evidence. To guarantee robust evidence synthesis, EviDep enforces strict information integrity. First, a Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically isolate task-irrelevant noise, preserving the fidelity of diagnostic signals. Subsequently, a Disentangled Evidential Learning strategy separates the shared consensus from modality-specific nuances. By explicitly decoupling these representations before Bayesian fusion, EviDep systematically mitigates evidence redundancy. Extensive experiments on AVEC 2013, 2014, DAIC-WOZ, and E-DAIC confirm that EviDep achieves state-of-the-art predictive accuracy and superior uncertainty calibration, delivering a robust fail-safe mechanism for trustworthy clinical screening.

摘要:自動化的憂鬱估計在實際應用中對信號損壞和環境噪音高度敏感。現有的確定性方法產生未經校準的點估計,將安全關鍵的臨床系統暴露於過度自信的誤診嚴重風險中。為了建立一個高度韌性和可信的評估範式,我們提出了EviDep,一個證據學習框架,通過正態-反向伽瑪分佈共同量化憂鬱嚴重性以及隨機和認知不確定性。多模態證據融合中的一個基本脆弱性是跨模態冗餘的無控制累積。這一結構缺陷通過重複計算重疊證據人為地膨脹了診斷信心。為了保證穩健的證據合成,EviDep 強制執行嚴格的信息完整性。首先,一個頻率感知特徵提取模塊利用基於小波的專家混合模型動態隔離與任務無關的噪音,保持診斷信號的真實性。隨後,一個解耦的證據學習策略將共享共識與特定模態的細微差別分開。通過在貝葉斯融合之前明確解耦這些表示,EviDep 系統性地減少了證據冗餘。在AVEC 2013、2014、DAIC-WOZ和E-DAIC上的廣泛實驗證實,EviDep實現了最先進的預測準確性和優越的不確定性校準,提供了一個穩健的故障安全機制以進行可信的臨床篩查。

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

2604.15859v1 by Jeremy Qin, Maksym Andriushchenko

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

摘要:預測已成為不確定性推理的自然基準。 然而,現有的大型語言模型評估仍然僅限於簡單格式的判斷任務,例如二元或多選題。 然而,在實踐中,預測涵蓋的範圍要廣得多。 在經濟學、公共衛生和社會人口統計等領域,決策依賴於對連續數量的數值估計,而這一能力是當前基準所無法捕捉的。 評估這些估計需要一種能夠明確且可測試不確定性的格式。 我們提出預測區間作為這一目的的自然且嚴謹的介面。 它們要求對規模的認識、在信心水平之間的內部一致性,以及在結果連續體上的校準,使其成為數值預測中比點估計更合適的評估格式。 為了評估這一能力,我們引入了一個新的基準QuantSightBench,並在多種設置下評估前沿模型,評估實證覆蓋率和區間銳利度。 我們的結果顯示,11個評估的前沿和開放權重模型中沒有一個達到90\%的覆蓋目標,表現最好的Gemini 3.1 Pro (79.1\%)、Grok 4 (76.4\%)和GPT-5.4 (75.3\%)均至少低於目標10個百分點。 在極端數量級下,校準急劇下降,顯示出所有評估模型的系統性過度自信。

Stein Variational Black-Box Combinatorial Optimization

2604.15837v1 by Thomas Landais, Olivier Goudet, Adrien Goëffon, Frédéric Saubion, Sylvain Lamprier

Combinatorial black-box optimization in high-dimensional settings demands a careful trade-off between exploiting promising regions of the search space and preserving sufficient exploration to identify multiple optima. Although Estimation-of-Distribution Algorithms (EDAs) provide a powerful model-based framework, they often concentrate on a single region of interest, which may result in premature convergence when facing complex or multimodal objective landscapes. In this work, we incorporate the Stein operator to introduce a repulsive mechanism among particles in the parameter space, thereby encouraging the population to disperse and jointly explore several modes of the fitness landscape. Empirical evaluations across diverse benchmark problems show that the proposed method achieves performance competitive with, and in several cases superior to, leading state-of-the-art approaches, particularly on large-scale instances. These findings highlight the potential of Stein variational gradient descent as a promising direction for addressing large, computationally expensive, discrete black-box optimization problems.

摘要:組合黑箱優化在高維設置中需要在利用搜尋空間中有前景的區域與保持足夠的探索以識別多個最優解之間進行仔細的權衡。儘管分佈估計演算法(EDAs)提供了一個強大的基於模型的框架,但它們通常集中於單一的興趣區域,這可能導致在面對複雜或多模態的目標景觀時過早收斂。在本研究中,我們引入了Stein算子,以在參數空間中的粒子之間引入排斥機制,從而鼓勵群體分散並共同探索適應度景觀的多個模式。對各種基準問題的實證評估顯示,所提出的方法在性能上與多種領先的最先進方法具有競爭力,並且在幾個案例中優於它們,特別是在大規模實例上。這些發現突顯了Stein變分梯度下降作為解決大型、計算成本高的離散黑箱優化問題的有前景方向的潛力。

Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI

2604.15808v1 by Lama Moukheiber, Caleb M. Yeung, Haotian Xue, Alec Helbling, Zelin Zhao, Yongxin Chen

Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.

摘要:空間推理和視覺基礎是視覺語言模型(VLMs)的核心能力,然而大多數醫療 VLMs 在預測時缺乏透明的推理或空間證據。現有的基準也僅在孤立的 2D 圖像上評估 VLMs,忽視了臨床影像的體積特性,因為發現可能跨越多幀或僅出現在幾個切片上。我們引入了空間基礎 MRI 視覺問題回答(SGMRI-VQA),這是一個包含 41,307 對的基準,旨在對體積 MRI 進行多幀、空間基礎的推理。該基準基於 fastMRI+ 數據集中專家放射科醫生的註釋,涵蓋腦部和膝部研究,每個 QA 對都包括與臨床醫生對齊的思維鏈跡跡,並附有幀索引的邊界框坐標。任務按層級組織,包括檢測、定位、計數/分類和標題生成,要求模型共同推理存在的內容、其位置以及跨越哪些幀。 我們基準測試了 10 個 VLMs,並顯示 Qwen3-VL-8B 在邊界框監督下的有監督微調始終改善了基於強大零樣本基準的基礎性能,這表明有針對性的空間監督是實現有根據的臨床推理的有效途徑。

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

2604.15760v1 by Ankit Maloo

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

摘要:我們介紹了KWBench(知識工作平台)的第一個版本,這是一個用於大型語言模型的無提示問題識別基準:LLM能否在嘗試解決問題之前識別專業場景。現有的前沿基準已經飽和,到目前為止,大多數知識工作評估簡化為根據規範進行的提取或任務完成。KWBench的目標是這一步之前:僅從原始輸入中識別情況的主導結構。基準包含223個任務,這些任務來自於收購、合同談判、臨床藥學、組織政治、欺詐分析和激勵設計等領域的從業者。每個任務編碼了一個正式的博弈論模式(委託-代理衝突、信號傳遞、機制設計失敗、戰略性省略、聯合動態、戰略性相互依賴),並攜帶結構化的真實記錄,記錄專家對情況的解讀和預期的失敗模式。模型接收原始數據和任務提示,沒有問題類型的指示。評分是一個三層的標準,必須通過強制性聯合檢查。強制性標準編碼了預測的錯誤路徑。我們評估了16個模型。最佳模型在27.9%的任務中通過。排名前兩的模型僅在31.7%的通過任務上達成一致。在前8名中,有44個任務僅由一個模型解決;在前8名之間的路由覆蓋了基準的50.7%,幾乎是最佳單一模型的兩倍。在通過的條件下,質量分數趨於一致(模型之間約83%);無條件分數則不然。相同的模型在被詢問時能正確表述相關的博弈論概念,但在未提示的情況下卻無法應用。我們發布KWBench,以改變前沿模型在知識工作上的評估方式,根據它們是否能僅從情況中識別正確的問題來進行評分,而不僅僅是根據它們在問題被框定後的執行效果。

SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification

2604.15711v1 by Enhui Chai, Sicheng Chen, Tianyi Zhang, Xingyu Li, Tianxiang Cui

Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Directional Multi-scale (DMS) module for balanced local-global modeling, and a Local Perception Residual (LPR) module for enhanced fine-grained sensitivity. Employing a two-stage pipeline, SSL pretraining on target ROI datasets followed by supervised fine-tuning (SFT), SSMamba outperforms 11 state-of-the-art (SOTA) pathological FMs on 10 public ROI datasets and surpasses 8 SOTA methods on 6 public WSI datasets. These results validate the superiority of task-specific architectural designs for pathological image analysis.

摘要:病理診斷高度依賴影像分析,其中感興趣區域(ROI)作為診斷證據的主要基礎,而全滑動影像(WSI)級別的任務主要捕捉聚合模式。為了提取這些關鍵的形態特徵,基於視覺Transformer(ViTs)和大規模自我監督學習(SSL)的ROI級別基礎模型(FMs)已被廣泛採用。然而,在其應用於ROI分析時仍存在三個核心限制:(1)跨放大倍數領域轉換,由於固定規模的預訓練妨礙了對多樣臨床環境的適應;(2)不充分的局部-全局關係建模,其中FMs的ViT主幹面臨高計算開銷和不精確的局部特徵表徵;(3)細粒度敏感性不足,因為傳統自注意機制往往忽略細微的診斷線索。為了解決這些挑戰,我們提出了SSMamba,一種混合SSL框架,能夠在不依賴大型外部數據集的情況下有效學習細粒度特徵。該框架包含三個領域自適應組件:Mamba遮罩影像建模(MAMIM)用於減少領域轉換,方向性多尺度(DMS)模塊用於平衡局部-全局建模,以及局部感知殘差(LPR)模塊用於增強細粒度敏感性。採用兩階段流程,首先在目標ROI數據集上進行SSL預訓練,然後進行監督式微調(SFT),SSMamba在10個公共ROI數據集上超越了11個最先進(SOTA)病理FMs,並在6個公共WSI數據集上超越了8個SOTA方法。這些結果驗證了針對病理影像分析的任務特定架構設計的優越性。

CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder

2604.15611v1 by Duy-Phuong Dao, Muhammad Taqiyuddin, Jahae Kim, Sang-Heon Lee, Hye-Won Jung, Jaehoo Choi, Hyung-Jeong Yang

Latent diffusion models have emerged as powerful generative models in medical imaging, enabling the synthesis of high quality brain magnetic resonance imaging scans. In particular, predicting the evolution of a patients brain can aid in early intervention, prognosis, and treatment planning. In this study, we introduce CLIMB, Controllable Longitudinal brain Image generation via state space based latent diffusion model, an advanced framework for modeling temporal changes in brain structure. CLIMB is designed to model the structural evolution of the brain structure over time, utilizing a baseline MRI scan and its acquisition age as foundational inputs. Additionally, multiple conditional variables, including projected age, gender, disease status, genetic information, and brain structure volumes, are incorporated to enhance the temporal modeling of anatomical changes. Unlike existing LDM methods that rely on self attention modules, which effectively capture contextual information from input images but are computationally expensive, our approach leverages state space, a state space model architecture that substantially reduces computational overhead while preserving high-quality image synthesis. Furthermore, we introduce a Gaussian-aligned autoencoder that extracts latent representations conforming to prior distributions without the sampling noise inherent in conventional variational autoencoders. We train and evaluate our proposed model on the Alzheimers Disease Neuroimaging Initiative dataset, consisting of 6,306 MRI scans from 1,390 participants. By comparing generated images with real MRI scans, CLIMB achieves a structural similarity index of 0.9433, demonstrating notable improvements over existing methods.

摘要:潛在擴散模型已成為醫學影像中強大的生成模型,使得高品質的腦部磁共振影像掃描的合成成為可能。特別是,預測患者腦部的演變可以幫助早期介入、預後和治療計畫。在這項研究中,我們介紹了CLIMB,即通過基於狀態空間的潛在擴散模型進行可控的縱向腦影像生成,這是一個用於建模腦結構時間變化的先進框架。CLIMB旨在建模腦結構隨時間的結構演變,利用基線MRI掃描及其獲取年齡作為基礎輸入。此外,還納入了多個條件變數,包括預測年齡、性別、疾病狀態、遺傳信息和腦結構體積,以增強解剖變化的時間建模。與現有的LDM方法依賴自注意模塊不同,後者有效捕捉輸入影像的上下文信息但計算成本高,我們的方法利用狀態空間,一種顯著減少計算開銷的狀態空間模型架構,同時保留高品質的影像合成。此外,我們引入了一種高斯對齊自編碼器,該編碼器提取符合先驗分佈的潛在表示,而不會受到傳統變分自編碼器固有的取樣噪聲的影響。我們在阿茲海默症疾病神經影像倡議數據集上訓練和評估我們提出的模型,該數據集包含1,390名參與者的6,306個MRI掃描。通過將生成的影像與真實的MRI掃描進行比較,CLIMB達到了0.9433的結構相似性指數,顯示出相較於現有方法的顯著改進。

Robustifying and Selecting Cohort-Appropriate Prognostic Models under Distributional Shifts

2604.16537v1 by Dimitris Bertsimas, Carol Gao, Angelos G. Koulouras, Georgios Antonios Margonis

External validation is widely regarded as the gold standard for prognostic model evaluation. In this study, we challenge the assumption that successful external calibration guarantees model generalizability and propose two complementary strategies to improve transportability of prognostic models across cohorts. Using six real-world surgical cohorts from tertiary academic centers, we tested whether successful external calibration depends largely on similarity in covariates and outcomes between training and validation cohorts, quantified using Kullback-Leibler (KL) divergence, with calibration assessed by the Integrated Calibration Index (ICI). From the model-developer's perspective, we trained the "best-on-average" prognostic model by tuning toward a meta-analysis-derived covariate and outcome distribution as an approximation of the broader target population. From the end-user perspective, we proposed a simple measure for cohort outcome similarity to identify, among published models, the one most suitable for a given target cohort in terms of both calibration and clinical utility. External calibration worsened as distributional mismatch increased. Higher KL divergence was associated with higher ICI in both surgery-alone (Spearman $ρ=0.614$, $p=0.004$) and surgery + adjuvant chemotherapy cohorts (Spearman $ρ=0.738$, $p<0.001$). Meta-analysis-informed weighting improved calibration in most settings without materially affecting discrimination, with the clearest benefit when evaluated on the aggregated external population ($p=0.037$). Models developed in more similar cohorts achieved lower ICI in surgery-alone (Spearman $ρ=0.803$, $p<0.001$) and surgery + adjuvant chemotherapy cohorts (Spearman $ρ=0.737$, $p<0.001$), and provided greater clinical utility on DCA.

摘要:外部驗證被廣泛認為是預後模型評估的金標準。在本研究中,我們挑戰了成功的外部校準保證模型可泛化性的假設,並提出了兩種互補策略,以提高預後模型在不同隊列之間的可轉移性。我們使用來自三級學術中心的六個真實外科隊列,測試成功的外部校準是否在很大程度上依賴於訓練和驗證隊列之間的協變量和結果的相似性,這一相似性通過 Kullback-Leibler (KL) 散度來量化,校準則通過綜合校準指數 (ICI) 來評估。從模型開發者的角度來看,我們通過調整以元分析衍生的協變量和結果分佈來訓練“平均最佳”預後模型,以此作為更廣泛目標人群的近似。從最終用戶的角度來看,我們提出了一個簡單的措施來評估隊列結果的相似性,以便在已發表的模型中識別出最適合特定目標隊列的模型,考慮到校準和臨床效用。隨著分佈不匹配的增加,外部校準變得更糟。較高的 KL 散度與外科單獨 (Spearman $ρ=0.614$, $p=0.004$) 和外科 + 輔助化療隊列 (Spearman $ρ=0.738$, $p<0.001$) 中較高的 ICI 相關聯。元分析知情的加權在大多數情況下改善了校準,而對區分能力的實質影響不大,在對聚合外部人群進行評估時,效果最為明顯 ($p=0.037$)。在更相似的隊列中開發的模型在外科單獨 (Spearman $ρ=0.803$, $p<0.001$) 和外科 + 輔助化療隊列 (Spearman $ρ=0.737$, $p<0.001$) 中達到了較低的 ICI,並在 DCA 上提供了更大的臨床效用。

Towards Reliable Testing of Machine Unlearning

2604.16536v1 by Anna Mazhar, Sainyam Galhotra

Machine learning components are now central to AI-infused software systems, from recommendations and code assistants to clinical decision support. As regulations and governance frameworks increasingly require deleting sensitive data from deployed models, machine unlearning is emerging as a practical alternative to full retraining. However, unlearning introduces a software quality-assurance challenge: under realistic deployment constraints and imperfect oracles, how can we test that a model no longer relies on targeted information? This paper frames unlearning testing as a first-class software engineering problem. We argue that practical unlearning tests must provide (i) thorough coverage over proxy and mediated influence pathways, (ii) debuggable diagnostics that localize where leakage persists, (iii) cost-effective regression-style execution under query budgets, and (iv) black-box applicability for API-deployed models. We outline a causal, pathway-centric perspective, causal fuzzing, that generates budgeted interventions to estimate residual direct and indirect effects and produce actionable "leakage reports". Proof-of-concept results illustrate that standard attribution checks can miss residual influence due to proxy pathways, cancellation effects, and subgroup masking, motivating causal testing as a promising direction for unlearning testing.

摘要:機器學習組件現在已成為融入人工智慧的軟體系統的核心,從推薦系統和程式碼助手到臨床決策支持。隨著法規和治理框架越來越要求從已部署模型中刪除敏感數據,機器遺忘作為完全重新訓練的實用替代方案正在出現。然而,遺忘帶來了一個軟體質量保證的挑戰:在現實的部署限制和不完美的預測下,我們如何測試一個模型不再依賴於目標資訊?本文將遺忘測試框架化為一個一流的軟體工程問題。我們主張實用的遺忘測試必須提供 (i) 對代理和中介影響路徑的全面覆蓋,(ii) 可調試的診斷,定位洩漏持續的地方,(iii) 在查詢預算下的成本效益回歸風格執行,以及 (iv) 對 API 部署模型的黑箱適用性。我們概述了一種因果、以路徑為中心的視角,即因果模糊測試,生成預算干預以估算殘留的直接和間接效果,並產生可行的「洩漏報告」。概念驗證結果顯示,標準的歸因檢查可能會因代理路徑、抵消效應和子群體掩蔽而錯過殘留影響,這促使因果測試成為遺忘測試的一個有前景的方向。

A Q-learning-based QoS-aware multipath routing protocol in IoMT-based wireless body area network

2604.15489v1 by Mehdi Hosseinzadeh, Roohallah Alizadehsani, Amin Beheshti, Hamid Alinejad-Roknyd, Lu Chen, Mohammad Sadegh Yousefpoor, Efat Yousefpoor, Muneera Altayeb, Thantrira Porntaveetus, Sadia Din

The Internet of Medical Things (IoMT) enables intelligent healthcare services but faces challenges such as dynamic topology, energy constraints, and diverse QoS requirements. This paper proposes QQMR, a Q-learning-based QoS-aware multipath routing method for WBANs. QQMR classifies data into three priority levels and employs adaptive multi-level queuing and fuzzy C-means clustering to optimize routing decisions. It maintains separate learning policies for each data type and selects primary and backup paths accordingly. Experimental results demonstrate improved packet delivery ratio and significant reductions in delay, routing overhead, and energy consumption compared to existing methods.

摘要:醫療物聯網(IoMT)使智能醫療服務成為可能,但面臨著動態拓撲、能源限制和多樣化的服務質量(QoS)要求等挑戰。本文提出了QQMR,一種基於Q-learning的QoS感知多路徑路由方法,適用於無線人體感測網路(WBANs)。QQMR將數據分類為三個優先級別,並採用自適應多級排隊和模糊C均值聚類來優化路由決策。它為每種類型的數據維護獨立的學習策略,並相應地選擇主要和備用路徑。實驗結果顯示,與現有方法相比,包傳送比例有所提高,延遲、路由開銷和能耗顯著降低。

Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models

2604.16532v1 by Emily Curl, Kofi Ampomah, Md Erfan, Sayanton Dibbo

While deep learning systems are becoming increasingly prevalent in medical image analysis, their vulnerabilities to adversarial perturbations raise serious concerns for clinical deployment. These vulnerability evaluations largely rely on Attack Success Rate (ASR), a binary metric that indicates solely whether an attack is successful. However, the ASR metric does not account for other factors, such as perturbation strength, perceptual image quality, and cross-architecture attack transferability, and therefore, the interpretation is incomplete. This gap requires consideration, as complex, large-scale deep learning systems, including Vision Transformers (ViTs), are increasingly challenging the dominance of Convolutional Neural Networks (CNNs). These architectures learn differently, and it is unclear whether a single metric, e.g., ASR, can effectively capture adversarial behavior. To address this, we perform a systematic empirical study on four medical image datasets: PathMNIST, DermaMNIST, RetinaMNIST, and CheXpert. We evaluate seven models (VGG-16, ResNet-50, DenseNet-121, Inception-v3, DeiT, Swin Transformer, and ViT-B/16) against seven attack methods at five perturbation budgets, measuring ASR, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and $L_2$ perturbation magnitude. Our findings show a consistent pattern: perceptual and distortion metrics are strongly associated with one another and exhibit minimal correlation with ASR. This applies to both CNNs and ViTs. The results demonstrate that ASR alone is an inadequate indicator of adversarial robustness and transferability. Consequently, we argue that a thorough assessment of adversarial risk in medical AI necessitates multi-metric frameworks that encompass not only the attack efficacy but also its methodology and associated overheads.

摘要:雖然深度學習系統在醫學影像分析中變得越來越普遍,但它們對對抗性擾動的脆弱性對臨床部署提出了嚴重的擔憂。這些脆弱性評估在很大程度上依賴於攻擊成功率(ASR),這是一個二元指標,僅指示攻擊是否成功。然而,ASR指標並未考慮其他因素,例如擾動強度、感知影像質量和跨架構攻擊可轉移性,因此其解釋是不完整的。這一缺口需要考慮,因為複雜的大規模深度學習系統,包括視覺Transformer(ViTs),正日益挑戰卷積神經網絡(CNNs)的主導地位。這些架構的學習方式不同,目前尚不清楚單一指標,例如ASR,是否能有效捕捉對抗行為。為了解決這個問題,我們對四個醫學影像數據集進行了系統的實證研究:PathMNIST、DermaMNIST、RetinaMNIST和CheXpert。我們對七個模型(VGG-16、ResNet-50、DenseNet-121、Inception-v3、DeiT、Swin Transformer和ViT-B/16)在五個擾動預算下進行了七種攻擊方法的評估,測量ASR、峰值信噪比(PSNR)、結構相似性指數度量(SSIM)和$L_2$擾動幅度。我們的研究結果顯示出一致的模式:感知和失真指標之間有很強的關聯性,並且與ASR的相關性極小。這一點適用於CNN和ViT。結果顯示,僅僅依賴ASR並不足以指標對抗穩健性和可轉移性。因此,我們認為對醫學人工智慧的對抗風險進行徹底評估需要多指標框架,不僅涵蓋攻擊效能,還包括其方法論和相關的開銷。

RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference

2604.15459v1 by Yuxin Liu, Yiqing Dong, Wenxue Yu, Zhan Wu, Rongjun Ge, Yang Chen, Yuting He

Medical image denoising (MID) lacks absolutely clean images for supervision, leading to a noisy reference problem that fundamentally limits denoising performance. Existing simulated-supervised discriminative learning (SimSDL) and simulated-supervised generative learning (SimSGL) treat noisy references as clean targets, causing suboptimal convergence or reference-biased learning, while self-supervised learning (SSL) imposes restrictive noise assumptions that are seldom satisfied in realistic MID scenarios. We propose \textbf{RelativeFlow}, a flow matching framework that learns from heterogeneous noisy references and drives inputs from arbitrary quality levels toward a unified high-quality target. RelativeFlow reformulates flow matching by decomposing the absolute noise-to-clean mapping into relative noisier-to-noisy mappings, and realizes this formulation through two key components: 1) consistent transport (CoT), a displacement map that constrains relative flows to be components of and progressively compose a unified absolute flow, and 2) simulation-based velocity field (SVF), which constructs a learnable velocity field using modality-specific degradation operators to support different medical imaging modalities. Extensive experiments on Computed Tomography (CT) and Magnetic Resonance (MR) denoising demonstrate that RelativeFlow significantly outperforms existing methods, taming MID with noisy references.

摘要:醫學影像去噪(MID)缺乏絕對乾淨的影像進行監督,導致噪聲參考問題,根本限制了去噪性能。現有的模擬監督辨別學習(SimSDL)和模擬監督生成學習(SimSGL)將噪聲參考視為乾淨目標,導致次優收斂或參考偏差學習,而自監督學習(SSL)則施加了在現實MID場景中很少滿足的限制性噪聲假設。我們提出了\textbf{RelativeFlow},一種流匹配框架,從異質的噪聲參考中學習,並將來自任意質量水平的輸入推向統一的高質量目標。RelativeFlow通過將絕對噪聲到乾淨的映射分解為相對更噪聲到噪聲的映射來重新定義流匹配,並通過兩個關鍵組件實現這一公式:1)一致性傳輸(CoT),一個位移圖,約束相對流為統一絕對流的組成部分並逐步組合,2)基於模擬的速度場(SVF),使用特定於模態的降解運算子構建可學習的速度場,以支持不同的醫學影像模態。在計算機斷層掃描(CT)和磁共振(MR)去噪的廣泛實驗中,RelativeFlow顯著超越現有方法,駕馭了帶有噪聲參考的MID。

DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

2604.15456v1 by Zhizheng Wang, Chih-Hsuan Wei, Joey Chan, Robert Leaman, Chi-Ping Day, Chuan Wu, Mark A Knepper, Antolin Serrano Farias, Jordina Rincon-Torroella, Hasan Slika, Betty Tyler, Ryan Huu-Tuan Nguyen, Asmita Indurkar, Mélanie Hébert, Shubo Tian, Lauren He, Noor Naffakh, Aseem Aseem, Nicholas Wan, Emily Y Chew, Tiarnan D L Keenan, Zhiyong Lu

Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med's conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.

摘要:信任度和透明度對於人工智慧 (AI) 在醫療保健和生物醫學研究中的臨床應用至關重要。最近的深度研究系統旨在通過將 AI 代理與多跳信息檢索、推理和綜合整合,來加速基於證據的科學發現。然而,大多數現有系統缺乏明確且可檢查的證據評估標準,這增加了錯誤累積的風險,並使研究人員和臨床醫生難以評估其輸出的可靠性。與此同時,當前的基準測試方法很少評估在複雜的現實醫療問題上的表現。在此,我們介紹 DeepER-Med,一個針對醫學的深度基於證據的研究框架,配備了一個代理 AI 系統。DeepER-Med 將深度醫學研究框架化為一個明確且可檢查的基於證據的生成工作流程,包含三個模塊:研究規劃、代理協作和證據綜合。為了支持現實評估,我們還提出了 DeepER-MedQA,一個基於證據的數據集,包含 100 個專家級研究問題,這些問題源自真實的醫學研究場景,並由 11 位生物醫學專家組成的多學科小組進行策劃。專家手動評估顯示,DeepER-Med 在多個標準上始終優於廣泛使用的生產級平台,包括生成新穎的科學見解。我們進一步通過八個現實臨床案例展示 DeepER-Med 的實用性。人類臨床醫生的評估表明,DeepER-Med 的結論在七個案例中與臨床建議一致,突顯了其在醫學研究和決策支持中的潛力。

SegWithU: Uncertainty as Perturbation Energy for Single-Forward-Pass Risk-Aware Medical Image Segmentation

2604.15271v2 by Tianhao Fu, Austin Wang, Charles Chen, Roby Aldave-Garza, Yucheng Chen

Reliable uncertainty estimation is critical for medical image segmentation, where automated contours feed downstream quantification and clinical decision support. Many strong uncertainty methods require repeated inference, while efficient single-forward-pass alternatives often provide weaker failure ranking or rely on restrictive feature-space assumptions. We present $\textbf{SegWithU}$, a post-hoc framework that augments a frozen pretrained segmentation backbone with a lightweight uncertainty head. SegWithU taps intermediate backbone features and models uncertainty as perturbation energy in a compact probe space using rank-1 posterior probes. It produces two voxel-wise uncertainty maps: a calibration-oriented map for probability tempering and a ranking-oriented map for error detection and selective prediction. Across ACDC, BraTS2024, and LiTS, SegWithU is the strongest and most consistent single-forward-pass baseline, achieving AUROC/AURC of $0.9838/2.4885$, $0.9946/0.2660$, and $0.9925/0.8193$, respectively, while preserving segmentation quality. These results suggest that perturbation-based uncertainty modeling is an effective and practical route to reliability-aware medical segmentation. Source code is available at https://github.com/ProjectNeura/SegWithU.

摘要:可靠的不確定性估計對於醫學影像分割至關重要,因為自動化的輪廓會為下游的量化和臨床決策支持提供依據。許多強大的不確定性方法需要重複推斷,而高效的單次前向傳遞替代方案往往提供較弱的失敗排名或依賴於限制性的特徵空間假設。我們提出了 $\textbf{SegWithU}$,這是一個後處理框架,通過輕量級的不確定性頭部增強了一個凍結的預訓練分割骨幹。SegWithU 利用中間骨幹特徵,並將不確定性建模為一個緊湊探針空間中的擾動能量,使用 rank-1 後驗探針。它生成兩個體素級不確定性圖:一個用於概率調整的校準導向圖和一個用於錯誤檢測和選擇性預測的排名導向圖。在 ACDC、BraTS2024 和 LiTS 中,SegWithU 是最強且最一致的單次前向傳遞基線,分別達到 $0.9838/2.4885$、$0.9946/0.2660$ 和 $0.9925/0.8193$ 的 AUROC/AURC,同時保持分割質量。這些結果表明,基於擾動的不確定性建模是實現可靠性意識醫學分割的有效且實用的途徑。源代碼可在 https://github.com/ProjectNeura/SegWithU 獲得。

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

2604.15231v1 by Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

摘要:視覺語言模型(VLM)顯著推進了基於人工智慧的複雜醫學影像解釋和報告,例如電腦斷層掃描(CT)。然而,現有的方法在很大程度上使臨床醫生成為最終輸出的被動觀察者,並未提供可解釋的推理痕跡供他們檢查、驗證或改進。為了解決這個問題,我們引入了 RadAgent,一個使用工具的人工智慧代理,通過逐步且可解釋的過程生成 CT 報告。每份生成的報告都附有可完全檢查的中間決策和工具互動的痕跡,允許臨床醫生檢查報告結果的推導過程。在我們的實驗中,我們觀察到 RadAgent 在三個維度上改善了胸部 CT 報告的生成,相較於其 3D VLM 版本 CT-Chat。臨床準確性在宏觀 F1 上改善了 6.0 分(相對 36.4%),在微觀 F1 上改善了 5.4 分(相對 19.6%)。在對抗條件下的穩健性改善了 24.7 分(相對 41.9%)。此外,RadAgent 在忠實度上達到了 37.0%,這是其 3D VLM 對應版本完全缺乏的新能力。通過將胸部 CT 的解釋結構化為一個明確的、增強工具的和迭代的推理痕跡,RadAgent 使我們更接近於實現放射學的透明和可靠的人工智慧。

Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF

2604.16528v1 by Nicklas Neu, Thomas Ebner, Jasmin Primus, Bernhard Schenkenfelder, Raphael Zefferer, Mathias Brunbauer, Florian Kromp

Embryo selection is one of multiple crucial steps in in-vitro fertilization, commonly based on morphological assessment by clinical embryologists. Although artificial intelligence methods have demonstrated their potential to support embryo selection by automated embryo ranking or grading methods, the overall impact of AI-based solutions is still limited. This is mainly due to the required adaptation of automated solutions to custom clinical data, reliance on time lapse incubators and a lack of interpretability to understand AI reasoning. The modern, informed patient is questioning expert decisions, particularly if the treatment is not successful. Thus, evidence-based decision justification in tasks like embryo selection would support transparent decision making and respectful patient communication. To support this aim, we hereby present an expert-annotated dataset consisting of embryo images and corresponding morphological description using natural language. The description contains relevant information on embryonic cell cycle, developmental stage and morphological features. This dataset enables the finetuning of modern foundational vision-language models to learn and improve over time with high accuracy. Predicted embryo descriptions can then be leveraged to automatically extract scientific evidence from literature, facilitating well-informed, evidence-based decision-making and transparent communication with patients. Our proposed dataset supports research in language-based, interpretable, and transparent automated embryo assessment and has the potential to enhance the decision-making process and improve patient outcomes significantly over time.

摘要:胚胎選擇是體外受精中多個關鍵步驟之一,通常基於臨床胚胎學家的形態評估。儘管人工智慧方法已顯示出支持胚胎選擇的潛力,例如自動化的胚胎排名或分級方法,但基於AI的解決方案的整體影響仍然有限。這主要是由於自動化解決方案需要適應特定的臨床數據,依賴於時間延遲培養箱,以及缺乏可解釋性來理解AI的推理。現代的知情患者質疑專家的決策,特別是在治療不成功的情況下。因此,在胚胎選擇等任務中進行基於證據的決策辯護將有助於透明的決策過程和尊重的患者溝通。為了支持這一目標,我們在此提出一個專家標註的數據集,該數據集包含胚胎圖像和相應的自然語言形態描述。描述中包含有關胚胎細胞週期、發育階段和形態特徵的相關信息。這個數據集使得現代基礎視覺-語言模型能夠進行微調,隨著時間的推移學習和提高準確性。預測的胚胎描述可以用來自動提取文獻中的科學證據,促進充分知情的基於證據的決策制定以及與患者的透明溝通。我們提出的數據集支持基於語言的、可解釋的和透明的自動化胚胎評估研究,並有潛力顯著增強決策過程並改善患者結果。

Hybrid Decision Making via Conformal VLM-generated Guidance

2604.14980v2 by Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini

Building on recent advances in AI, hybrid decision making (HDM) holds the promise of improving human decision quality and reducing cognitive load. We work in the context of learning to guide (LtG), a recently proposed HDM framework in which the human is always responsible for the final decision: rather than suggesting decisions, in LtG the AI supplies (textual) guidance useful for facilitating decision making. One limiting factor of existing approaches is that their guidance compounds information about all possible outcomes, and as a result it can be difficult to digest. We address this issue by introducing ConfGuide, a novel LtG approach that generates more succinct and targeted guidance. To this end, it employs conformal risk control to select a set of outcomes, ensuring a cap on the false negative rate. We demonstrate our approach on a real-world multi-label medical diagnosis task. Our empirical evaluation highlights the promise of ConfGuide.

摘要:基於近期在人工智慧方面的進展,混合決策(HDM)有望改善人類的決策質量並減少認知負擔。我們在學習引導(LtG)的背景下工作,這是一個最近提出的HDM框架,其中人類始終負責最終決策:在LtG中,AI提供有助於促進決策的(文本)指導,而不是建議決策。現有方法的一個限制因素是,它們的指導綜合了所有可能結果的信息,因此可能難以消化。我們通過引入ConfGuide來解決這個問題,這是一種新穎的LtG方法,能夠生成更簡潔和有針對性的指導。為此,它採用符合風險控制來選擇一組結果,確保假陰性率的上限。我們在一個現實世界的多標籤醫療診斷任務上展示了我們的方法。我們的實證評估突顯了ConfGuide的潛力。

Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

2604.14892v2 by Amy Rouillard, Sitwala Mundia, Linda Camara, Michael Cameron Gramanie, Ziyaad Dangor, Ismail Kalla, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Haroon Saloojee, Bruce A. Bassett

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations. Both LLM and clinician-generated diagnoses are scored across four dimensions: diagnosis, differential diagnosis, clinical reasoning and negative treatment risk. For each of these, we assess scoring difference, inter-rater agreement, scoring stability, severe safety errors and the effect of post-hoc calibration. We find that: (i) the uncalibrated LLM jury scores are systematically lower than clinician panels scores; (ii) the LLM Jury preserves ordinal agreement and exhibits better concordance with the primary expert panels than the human expert re-score panels do; (iii) the probability of severe errors is lower in \lj models compared to the human expert re-score panels; (iv) the LLM Jury shows excellent agreement with primary expert panels' rankings. We find that the LLM jury combined with AI model diagnoses can be used to identify ward diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (v) LLM jury models show no self-preference bias. They did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Finally, we demonstrate that LLM jury calibration using isotonic regression improves alignment with human expert panel evaluations. Together, these results provide compelling evidence that a calibrated, multi-model LLM jury can serve as a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking.

摘要:評估醫療 AI 系統使用專家臨床醫師小組既昂貴又緩慢,促使使用大型語言模型(LLMs)作為替代裁定者。在這裡,我們評估由三個前沿 AI 模型組成的 LLM 陪審團,對 300 個中等收入國家(MIC)醫院案例中的 3333 個診斷進行評分。模型性能與專家臨床醫師小組和獨立人類重新評分小組的評估進行基準比較。LLM 和臨床醫師生成的診斷在四個維度上進行評分:診斷、鑑別診斷、臨床推理和負面治療風險。對於這些,我們評估評分差異、評分者間一致性、評分穩定性、嚴重安全錯誤以及事後校準的效果。我們發現:(i)未經校準的 LLM 陪審團評分系統性地低於臨床醫師小組的評分;(ii)LLM 陪審團保持了序數一致性,並且與主要專家小組的符合度優於人類專家重新評分小組;(iii)與人類專家重新評分小組相比,\lj 模型中嚴重錯誤的概率較低;(iv)LLM 陪審團與主要專家小組的排名顯示出極好的一致性。我們發現,結合 AI 模型診斷的 LLM 陪審團可以用來識別高風險錯誤的病房診斷,從而實現針對性的專家審查和提高小組效率;(v)LLM 陪審團模型沒有自我偏好偏見。它們對自己底層模型或同一供應商的模型生成的診斷的評分並不比其他模型生成的診斷更(或更少)有利。最後,我們證明使用等距回歸進行 LLM 陪審團校準可以改善與人類專家小組評估的一致性。綜合這些結果,提供了有力的證據,表明經過校準的多模型 LLM 陪審團可以作為醫療 AI 基準中專家臨床評估的可靠代理。

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

2604.14866v1 by Meng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu, Chun-Xiao Jin, Jia-Min Wu, Yue Han, James Kit Hon Tsoi, Gui-Song Xia, Cui Huang

Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.

摘要:視覺-語言模型(VLMs)在醫學影像分析中顯示出顯著的潛力,然而,由於缺乏細緻的標註數據集和全面的基準測試,其在口腔內攝影中的應用仍然大多未被探索。為了解決這個問題,我們提出了MetaDent,一個綜合資源,包括(1)從臨床、公共和網絡來源收集的創新且大規模的牙科影像數據集;(2)一個半結構化的標註框架,旨在捕捉牙科攝影的層級和臨床細微特徵;以及(3)用於評估最新VLM在臨床影像理解上的全面基準套件。我們的標註方法結合了高層次的影像摘要與逐點的自由文本異常描述。這種方法使得豐富、可擴展且任務無關的表示成為可能。我們從各種來源精心策劃了60,669張牙科影像,並使用這一元標註方案對2,588張具有代表性的影像進行了標註。利用大型語言模型(LLMs),我們導出了標準化的基準:大約15K的視覺問題回答(VQA)對和一個18類多標籤分類數據集,我們通過人工審查和錯誤分析來驗證,證明LLM驅動的過渡可靠地保持了忠實度和語義準確性。然後,我們在VQA、分類和影像標題任務中評估最新的VLM。定量結果顯示,即使是最先進的模型在對口腔內場景的細緻理解上也面臨困難,達到中等準確性,並在影像標題中產生不一致或不完整的描述。我們公開釋放我們的數據集、標註和工具,以促進可重複的研究並加速牙科應用的視覺-語言系統的發展。