Medical

Publish Date	Title	Authors	Homepage	Code
2026-06-17	Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA	Ikram Belmadani et.al.	2606.19266v1	null
2026-06-17	A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers	Keran Wang et.al.	2606.19247v1	null
2026-06-17	Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis	Soheyl Bateni et.al.	2606.19183v1	null
2026-06-17	A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies	Fangyijie Wang et.al.	2606.19174v1	null
2026-06-17	A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI	Syed Mujtaba Haider et.al.	2606.18970v1	null
2026-06-17	Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems	Bernardo Feijó Junqueira et.al.	2606.18882v1	null
2026-06-17	RedactionBench	Sean Brynjólfsson et.al.	2606.18782v1	null
2026-06-17	Augmenting Dysarthric Speech Severity Assessment with MOS Supervision	Kaimeng Jia et.al.	2606.18645v1	null
2026-06-17	Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance	Tianming Du et.al.	2606.18613v1	null
2026-06-17	Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep	Amama Mahmood et.al.	2606.18596v1	null
2026-06-16	PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization	Arshia Ilaty et.al.	2606.18518v1	null
2026-06-16	From Specification to Execution: AI Assisted Scientific Workflow Management	Komal Thareja et.al.	2606.18425v1	null
2026-06-16	RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills	Weizhi Zhang et.al.	2606.18203v1	null
2026-06-16	WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning	Yuwei Zhang et.al.	2606.18147v1	null
2026-06-16	Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour	Abeer Badawi et.al.	2606.18129v1	null
2026-06-16	Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications	Divyansh Srivastava et.al.	2606.18068v1	null
2026-06-16	When LLMs Analyze Scars: From Images to Clinically-Meaningful Features	Ruman Wang et.al.	2606.18063v1	null
2026-06-16	ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents	Ander Alvarez et.al.	2606.18037v1	null
2026-06-16	Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis	Yonghao Chen et.al.	2606.17989v1	null
2026-06-16	STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training	Jinjie Shen et.al.	2606.17979v1	null
2026-06-16	Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation	Andrea Santomauro et.al.	2606.17961v1	null
2026-06-16	A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease	Antonio Scardace et.al.	2606.17867v1	null
2026-06-16	When Multiple Scripts Matter: Evaluating ASR in Clinical Settings	Jean Seo et.al.	2606.17826v1	null
2026-06-16	Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection	Nikola Kovacevic et.al.	2606.17767v1	null
2026-06-16	Vision-language models for chest radiography do not always need the image	Mahshad Lotfinia et.al.	2606.17710v1	null
2026-06-16	SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology	Wan Siti Halimatul Munirah Wan Ahmad et.al.	2606.17702v1	null
2026-06-16	AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows	Jiahui Niu et.al.	2606.17474v1	null
2026-06-16	A Machine-Learned Comorbidity Index	Suleman Baloch et.al.	2606.17450v1	null
2026-06-16	Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems	Xi Chu et.al.	2606.17443v1	null
2026-06-16	Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos	Bo Gou et.al.	2606.17437v1	null
2026-06-16	Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization	Jianwei Zhang et.al.	2606.17420v1	null
2026-06-16	Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation	Xinyu Qin et.al.	2606.17405v1	null
2026-06-15	Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation	Hongchao Shu et.al.	2606.17340v1	null
2026-06-15	SpeechDx: A Multi-Task Benchmark for Clinical Speech AI	Sejal Bhalla et.al.	2606.17339v1	null
2026-06-15	Symbolic Informalization: Fluent, Productive, Multilingual	Aarne Ranta et.al.	2606.16893v1	null
2026-06-15	Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering	Sanjay Basu et.al.	2606.16890v1	null
2026-06-15	Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection	Markus Bujotzek et.al.	2606.16868v1	null
2026-06-15	GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents	Rahul Suresh Babu et.al.	2606.16813v1	null
2026-06-15	AgentFairBench: Do LLM Agents Discriminate When They Act?	Triveni Morla et.al.	2606.16723v1	null
2026-06-15	Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies	Ke Liu et.al.	2606.16721v1	null
2026-06-15	Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation	Rutherford A. Patamia et.al.	2606.16568v1	null
2026-06-15	Unified Multimodal Model for Brain MRI Imputation and Understanding	Zhiyun Song et.al.	2606.16484v1	null
2026-06-15	Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis	Jingyu Hu et.al.	2606.17115v1	null
2026-06-15	Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning	Junting Wen et.al.	2606.16434v1	null
2026-06-15	Input-Dependent Fisher Information for Local Sensitivity Analysis of Medical Image Classifiers	Sourya Sengupta. Mark A. Anastasio et.al.	2606.16362v1	null
2026-06-15	Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules	Wei Xu et.al.	2606.16337v2	null
2026-06-15	Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans	Tengfei Ma et.al.	2606.16234v1	null
2026-06-15	Embedded Arena: Iterative Optimization via Hardware Feedback	Zhihan Zhang et.al.	2606.16190v1	null
2026-06-15	A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond	Pengyu Zhu et.al.	2606.16153v1	null
2026-06-15	LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis	Minh-Ha Nguyen et.al.	2606.16149v1	null
2026-06-15	PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization	Samah Fodeh et.al.	2606.16074v1	null
2026-06-14	DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts	Zijian Carl Ma et.al.	2606.15931v1	null
2026-06-14	Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot	Yuyang Dai et.al.	2606.15810v1	null
2026-06-14	EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries	Jiyoun Kim et.al.	2606.15735v2	null
2026-06-14	Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning	Zhenyu Yu et.al.	2606.15733v1	null
2026-06-14	AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan	Mohammed Fasha et.al.	2606.15709v1	null
2026-06-14	LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science	Eyup Engin Kucuk et.al.	2606.15566v1	null
2026-06-13	Hierarchical Modeling of ICD Codes in EHR Foundation Models	Megha Thukral et.al.	2606.15447v1	null
2026-06-13	Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models	Mayur Sanap et.al.	2606.15436v1	null
2026-06-13	Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering	Zaifu Zhan et.al.	2606.15419v1	null
2026-06-13	APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents	Ya-Chuan Chen et.al.	2606.15363v1	null
2026-06-13	CAP: Towards PPG Universal Representation Learning with Patient-level Supervision	Chenyang He et.al.	2606.15284v1	null
2026-06-13	RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning	Jinhan Liu et.al.	2606.15278v1	null
2026-06-13	Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs	Zhisen Hu et.al.	2606.15250v1	null
2026-06-13	Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings	Weihao Gao et.al.	2606.15176v1	null
2026-06-13	Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting	Xinze Zhang et.al.	2606.15055v1	null
2026-06-13	Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling	Zhemin Zhang et.al.	2606.15038v1	null
2026-06-12	Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability	Alyssa Unell et.al.	2606.15029v1	null
2026-06-12	ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning	Sicheng Yang et.al.	2606.14697v1	null
2026-06-12	Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts	Farica Zhuang et.al.	2606.14608v1	null
2026-06-12	A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health	Pavlos Nicolaou et.al.	2606.14604v1	null
2026-06-12	CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation	Guanyu Liu et.al.	2606.14581v1	null
2026-06-12	Securing the Future of IoMT in the Post-Quantum Era: An Edge-Native Federated Learning Approach	Taym Alshoghri et.al.	2606.14515v1	null
2026-06-12	Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport	Paula Joy B. Martinez et.al.	2606.14157v1	null
2026-06-12	Applicability Condition Extraction for Therapeutic Drug-Disease Relations	Guanting Luo et.al.	2606.14031v1	null
2026-06-11	Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography	Louis Chen et.al.	2606.13839v1	null
2026-06-11	ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages	Tanmoy Kanti Halder et.al.	2606.13572v1	null
2026-06-11	Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation	Aruna Dey et.al.	2606.13556v2	null
2026-06-11	MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment	Minlin Zeng et.al.	2606.13258v2	null
2026-06-11	Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints	Omar Alshahrani et.al.	2606.13211v1	null
2026-06-11	Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework	Abhishek H S et.al.	2606.13188v1	null
2026-06-11	Mental-R1: Aligning LLM Reasoning for Mental Health Assessment	Xin Wang et.al.	2606.13176v1	null
2026-06-11	Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation	Elena S. Kozachok et.al.	2606.13135v1	null
2026-06-11	AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction	Fabien Maury et.al.	2606.13051v1	null
2026-06-11	A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis	Manex Atxa et.al.	2606.12988v1	null
2026-06-11	OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models	Ibrahim Gulluk et.al.	2606.12953v1	null
2026-06-11	Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata	Daniel Soliman et.al.	2606.12824v1	null
2026-06-10	Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System	Alyssa Unell et.al.	2606.12702v1	null
2026-06-10	LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data	Yifan Gao et.al.	2606.12699v1	null
2026-06-10	CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents	Siyu Shen et.al.	2606.12666v2	null
2026-06-10	Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs	Shayan Mohammadizadehsamakosh et.al.	2606.12590v1	null
2026-06-10	EDEN: A Large-Scale Corpus of Clinical Notes for Italian	Tiziano Labruna et.al.	2606.12569v1	null
2026-06-10	Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy	Kai Standvoss et.al.	2606.12346v1	null
2026-06-10	Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification	Veerendhra Kumar Dangeti et.al.	2606.12252v1	null
2026-06-10	OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models	Negin Baghbanzadeh et.al.	2606.12169v1	null
2026-06-10	Towards Responsibly Non-Compliant Machines	Marija Slavkovik et.al.	2606.12147v1	null
2026-06-10	Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation	Minh-Khoi Pham et.al.	2606.12006v1	null
2026-06-10	Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability	Kuo-En Hung et.al.	2606.11930v2	null
2026-06-10	Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task	Qianyu Yao et.al.	2606.11830v1	null
2026-06-10	Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data	Boris-Stephan Rauchmann et.al.	2606.11794v1	null

Abstracts

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

2606.19266v1 by Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

摘要：大型語言模型（LLMs）的發展使得對其在專業領域和語言中的適應性更加關注，但領域適應策略的有效性仍然不明確。我們以法語醫療問答（QA）為案例，展示了一項醫療領域適應的研究。我們比較了持續預訓練（CPT）、監督微調（SFT）及其組合，在三個模型家族、多個尺寸和三種初始化類型之間，明確區分適應效果與基礎模型選擇的影響。我們在貪婪和受限解碼下，使用自動指標和LLM作為評估者的評價，評估了多選擇題（MCQA）和開放式問答（OEQA）。對於MCQA，CPT+SFT最常達到最佳分數，但相較於SFT的增益較小且經常不具統計顯著性，使得SFT成為一個強大且具成本效益的預設選擇。對於OEQA，CPT始終改善基於重疊的指標，而SFT則常常降低生成質量；指令調整和CPT+SFT在LLM基礎的評估中更受青睞。跨語言實驗進一步顯示了從法語適應到英語基準的有效轉移。總體而言，我們提供了在計算限制下選擇適應策略的實用指導方針。

A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers

2606.19247v1 by Keran Wang, Drishti Goel, Jiayue Melissa Shi, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

Family members caring for individuals with Alzheimer's disease and related dementias (AD/ADRD) provide the foundation of long-term care worldwide. In 2023, more than 11 million U.S. family and friends contributed 18 billion hours of unpaid care, often at the cost of their own physical and mental health. These informal caregivers -- also referred as the "invisible second patients" -- experience elevated rates of mental health problems. Yet research commonly reduces their complex psychosocial experiences to a single construct of caregiver burden, obscuring which specific needs are unmet or effectively supported. At the same time, digital and AI-enabled technologies are rapidly expanding, from smartphone apps and videoconferencing to sensor platforms and AI chatbots. However, the absence of shared frameworks across medicine, psychology, and technology research limits cumulative progress. This study introduces a Caregiver Mental Health and Technology Taxonomy that systematically links AD/ADRD caregiver needs with corresponding classes of technology-based interventions. Drawing from an interdisciplinary literature review and two qualitative studies with caregivers, the taxonomy identifies mismatches between caregiver priorities and existing technological support, highlights under-served domains such as relational strain and compassion fatigue, and proposes design directions for adaptive, responsive systems. The framework offers a shared vocabulary to guide clinicians, researchers, and technology designers in developing more person-centered and clinically grounded innovation in dementia care.

摘要：家庭成員照顧阿茲海默症及相關癡呆症（AD/ADRD）患者，為全球長期照護提供了基礎。在2023年，超過1100萬名美國家庭成員和朋友貢獻了180億小時的無償照護，這往往以他們自身的身心健康為代價。這些非正式的照護者——也被稱為「隱形的第二患者」——經歷著較高的心理健康問題發生率。然而，研究通常將他們複雜的心理社會經驗簡化為單一的照護者負擔構念，模糊了哪些特定需求未被滿足或有效支持。同時，數位和人工智慧技術正在迅速擴展，從智慧型手機應用程式和視訊會議到感應平台和人工智慧聊天機器人。然而，醫學、心理學和技術研究之間缺乏共享框架，限制了累積進展。本研究介紹了一個照護者心理健康與技術分類法，系統性地將AD/ADRD照護者需求與相應的技術干預類別聯繫起來。該分類法基於跨學科的文獻回顧和兩項與照護者的質性研究，識別出照護者優先事項與現有技術支持之間的不匹配，突顯了如關係緊張和同情疲勞等被忽視的領域，並提出了適應性、響應性系統的設計方向。該框架提供了一個共享的詞彙，以指導臨床醫生、研究人員和技術設計師在癡呆症照護中開發更以人為中心且臨床基礎的創新。

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

2606.19183v1 by Soheyl Bateni, Maryam Abdolali

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

摘要：大型語言模型（LLMs）可以通過解釋自由文本文檔，使臨床決策支持變得更加可及，但它們作為診斷引擎的直接使用受到提示、信息順序和合理但不正確的輸出敏感性的限制。結構化機器學習模型提供了更穩定的風險預測，但它們需要難以與敘述性臨床工作流程集成的表格輸入。我們提出了ClaMPAPP（臨床語言輔助機器學習管道，用於闌尾炎），這是一個混合系統，將LLM用作接口，而不是最終決策者。ClaMPAPP從類似筆記的敘述中提取受架構約束的臨床特徵，應用確定性的合理性檢查，並將經過驗證的特徵傳遞給基於臨床、實驗室和超聲變量訓練的XGBoost分類器。我們在來自德國醫院的兩個獨立兒科闌尾炎隊列上評估了ClaMPAPP，並將其與端到端LLM基準進行比較，包括開源和專有模型。為了在測試自由文本輸入時保留真實情況，敘述是通過模板渲染和受限的LLM重寫從結構化電子健康記錄生成的，並進行了額外的句子順序置換以評估位置穩健性。ClaMPAPP在內部和外部驗證中都實現了最強的整體診斷性能，同時最小化漏診的闌尾炎病例，這是急性分診中的關鍵安全問題。端到端LLM顯示出不穩定的敏感性-特異性權衡，並在敘述重新排序下出現更大的降級。這些結果支持LLM作為接口、機器學習作為預測者的設計，將自然語言的可用性與預測推斷分開，並提供了一個更可審計的臨床決策支持途徑。

A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies

2606.19174v1 by Fangyijie Wang, Jianjun Yu, Wentao Shi, Haixia Huang, Ran Shi, Guénolé Silvestre, Kathleen M. Curran

Clinician-centered evaluation is critical for validating medical AI systems, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. Existing medical image platforms primarily focus on dataset labeling. They lack integrated support for blinded model comparison and reproducible evaluation workflows. We present a clinician-centered pipeline for remote annotation and evaluation in ultrasound AI studies. The proposed pipeline uses a centralized server and lightweight browser interfaces to enable clinicians to perform annotation, blinded ranking, and review without local dataset downloads. The pipeline also supports multi-rater participation, centralized result aggregation, and automated statistical analysis. We validate the pipeline in a fetal ultrasound segmentation study with six raters spanning expert, generalist, and non-expert experience levels. The system automatically generated Spearman correlation, Kendall's $τ$, and top-1 selection statistics. Results indicated moderate to strong agreement across experts and other groups. The blinded evaluation results showed a tendency for later active learning models to be preferred. These outcomes suggest that the pipeline can support clinician-centered annotation and reproducible human-\ac{AI} evaluation studies in ultrasound imaging. The proposed pipeline is available on \href{https://github.com/13204942/SonoRate}{GitHub}.

摘要：臨床醫師中心的評估對於驗證醫療人工智慧系統至關重要，尤其是在超聲影像中，定量指標並不總是能夠捕捉臨床可用性。現有的醫療影像平台主要集中於數據集標註。它們缺乏對盲測模型比較和可重複評估工作流程的綜合支持。我們提出了一個臨床醫師中心的管道，用於遠程標註和超聲人工智慧研究中的評估。所提議的管道使用集中式伺服器和輕量級瀏覽器介面，使臨床醫師能夠在不下載本地數據集的情況下進行標註、盲測排名和審查。該管道還支持多評審者參與、集中結果聚合和自動統計分析。我們在一項涉及六位評審者的胎兒超聲分割研究中驗證了該管道，這些評審者的經驗水平涵蓋了專家、通才和非專家。系統自動生成了斯皮爾曼相關係數、肯德爾的 $τ$ 和前一選擇統計數據。結果顯示專家和其他組別之間的協議程度從中等到強。盲測評估結果顯示後期主動學習模型更受青睞。這些結果表明該管道可以支持臨床醫師中心的標註和可重複的人類-\ac{AI} 評估研究在超聲影像中。所提議的管道可在 \href{https://github.com/13204942/SonoRate}{GitHub} 上獲得。

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

2606.18970v1 by Syed Mujtaba Haider, Silvia Figini

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

摘要：醫學影像分類常受到有限標記數據的限制，這促使了生成增強的需求；最近，為此目的提出了量子生成模型，並經常報告準確性提升。然而，這些說法通常基於單次訓練運行，未能匹配量子和經典生成器的參數預算，且未能描述任何好處出現的數據範疇。我們提出了一個受控基準，隔離量子生成器對腦部MRI增強的貢獻。影像被編碼進入一個KL正則化的潛在空間，在該空間中，使用變分量子生成器或參數數量幾乎相同的經典生成器（1648對1632）訓練一個帶有梯度懲罰的條件Wasserstein GAN。合成樣本被解碼並用於增強一個預訓練的分類器，涵蓋從5%到100%的標記數據比例，並在八個隨機種子上進行評估，使用配對顯著性測試（帶有多重比較修正）以及內部集多樣性和潛在分佈分析。在所有比例中，沒有增強變體顯著超越僅使用真實數據的訓練，且量子和經典生成器在統計上無法區分。任何低數據的好處表現為正則化，而非忠實的數據擴展：合成樣本在數據稀缺的地方偏離分佈並嚴重模式崩潰，且量子生成器的多樣性不比其經典對應物更高。我們釋放該協議作為醫學影像中量子生成增強嚴格評估的測試平台。

Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems

2606.18882v1 by Bernardo Feijó Junqueira, Claudio Kiyoshi Umezu, Bruno Bilhar Karaziack, Tomaz Junior, Daniel Alves Castello

This work investigates the application of a domain-shift aware neural network for regression tasks aimed at estimating unbalance masses in rotating shafts under varying operating conditions. Experimental data were collected from a test rig in which a primary shaft, equipped with a flange carrying unbalanced masses, was driven at different rotational speeds, while a secondary shaft could be optionally activated to introduce domain discrepancy. The unbalance masses were positioned at a fixed radial distance, and the dynamic response of the system was recorded using triaxial accelerometers. The inverse problem of mass estimation is formulated within a domain adaptation framework, where the network is trained with a maximum mean discrepancy strategy to align feature representations across source and target distributions. The results demonstrate the effectiveness of explicitly addressing domain shift in improving prediction accuracy, especially when the system's physical behavior and sources of domain discrepancy are not fully known and fall outside the training conditions. These findings highlight the potential of domain-shift aware models for regression tasks in Structural Health Monitoring.

摘要：這項工作探討了針對回歸任務應用領域轉移感知神經網絡，以估算在變化操作條件下旋轉軸上的不平衡質量。實驗數據是從一個測試裝置中收集的，該裝置中一根主軸配備有承載不平衡質量的法蘭，以不同的轉速驅動，而一根次軸則可以選擇性啟動以引入領域差異。不平衡質量被定位在固定的徑向距離，系統的動態響應是使用三軸加速度計記錄的。質量估算的逆問題是在一個領域適應框架內進行公式化的，其中網絡使用最大均值差異策略進行訓練，以對齊源和目標分佈之間的特徵表示。結果顯示，明確處理領域轉移在提高預測準確性方面的有效性，特別是在系統的物理行為和領域差異的來源未完全了解且超出訓練條件時。這些發現突顯了領域轉移感知模型在結構健康監測中進行回歸任務的潛力。

RedactionBench

2606.18782v1 by Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

摘要：大型語言模型越來越多地應用於需要刪除個人可識別信息（PII）的敏感領域。雖然刪除PII是數據清理的前提，但現有基準將提取機制與隱私語義混為一談。公共電話號碼並不等同於醫療記錄中的電話號碼。信息是否構成違規在很大程度上取決於誰持有它、為什麼以及在什麼上下文中，這根本上區分了刪除和簡單的實體識別。基於上下文完整性，我們引入了RedactionBench，一個手動註釋的基準，包含來自11個領域的200份多樣化文件，大多數來源於現實世界。我們還引入了R-Score，一種新穎的字符級指標，平等對待語義相似的刪除，並消除淺顯的格式選擇，例如對電話號碼使用不同的掩碼樣式。在命名實體識別模型、實體提取小型語言模型和配備代理工具的前沿模型的評估中，顯示上下文刪除仍然是一個未解決的問題。對RedactionBench進行的超過80名用戶的人類評估顯示出隱私感知的明顯二元性。註釋者對於強制刪除的目標標籤（89.4%）和安全文本保留（94.1%）顯示出共識，但對於上下文刪除（47.7%）則未能達成一致。這種變異顯示了上下文隱私的主觀性，並促進了R-Score的發展，該指標將上下文模糊性與嚴格精確性解耦。我們比較了35個模型的不同類別，並報告了它們在刪除PII方面的表現。最後，我們發布了RedactionBench，以建立未來隱私保護系統的基準，希望能激發高效的模型設計和標準化評估。

Augmenting Dysarthric Speech Severity Assessment with MOS Supervision

2606.18645v1 by Kaimeng Jia, Minzhu Tu, Zengrui Jin, Siyin Wang, Chao Zhang

Dysarthria is a speech disorder marked by reduced intelligibility and communicative effectiveness. Automatic utterance-level assessment of dysarthric speech can support scalable speech monitoring and therapy-related analysis. Yet training such systems is bottlenecked by the scarcity of clinically annotated dysarthric speech. This work proposes to augment dysarthric speech assessment using data from speech synthesis evaluations, specifically human-annotated utterances with Mean Opinion Score (MOS) labels from the QualiSpeech corpus. Experiments show that fine-tuning on speech synthesis assessment data consistently improves performance on both intelligibility and naturalness prediction, while joint training yields gains primarily on naturalness. These results suggest that synthesis artifacts and dysarthric speech share perceptual commonalities, and speech synthesis evaluation corpora offer a practical augmentation source that reduces reliance on scarce clinical annotations.

摘要：失語症是一種以可理解性和交際效果降低為特徵的語言障礙。自動化的失語症語音評估可以支持可擴展的語音監測和治療相關分析。然而，訓練這類系統的瓶頸在於臨床註釋的失語症語音稀缺。本研究提議利用語音合成評估中的數據來增強失語症語音評估，特別是來自QualiSpeech語料庫的人類註釋語句，並附有平均意見分數（MOS）標籤。實驗顯示，在語音合成評估數據上進行微調能夠持續改善可理解性和自然性預測的表現，而聯合訓練主要在自然性上帶來增益。這些結果表明，合成工件和失語症語音在感知上存在共通性，而語音合成評估語料庫則提供了一個實用的增強來源，減少對稀缺臨床註釋的依賴。

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

2606.18613v1 by Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

摘要：醫療 LLMs 在近期最可能的角色是協助而非取代醫生，但目前的評估往往測試孤立的能力：臨床知識、EHR 系統互動或病人溝通。醫生的協助需要在同一互動中協調這些能力，在這裡醫生發出不明確的請求，病人模糊地描述症狀，而 EHR 系統則需要精確的工具使用。我們引入了 PhysAssistBench，一個用於互動醫生-病人-EHR 協助的基準。PhysAssistBench 由真實的 MIMIC-IV 案例構建，使用可擴展的管道來構建具主動性的病人：互動的、基於記錄的代理，將靜態的 EHR 記錄轉化為多輪臨床場景，同時保持臨床事實性。PhysAssistBench 提供了一個經過策劃的雙語評估集，包含 1,296 個手動審核和醫生驗證的回合。與領先的 LLMs 進行的實驗顯示，當前模型在這種環境中仍然不可靠，這暴露了臨床 LLMs 的一個關鍵瓶頸：可靠的協助需要在知識、溝通和系統之間進行協調，而不是在任何一個方面的孤立增長。

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

2606.18596v1 by Amama Mahmood, Bokyung Kim, Honghao Zhao, Molly E. Atwood, Luis F. Buenaver, Michael T. Smith, Chien-Ming Huang

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

摘要：睡眠日記是行為睡眠醫學和失眠認知行為療法的核心，但每日填寫難以持續，靜態形式往往提供有限的背景來解釋夜間睡眠變化。我們設計了一個基於大型語言模型的對話式語音日記，通過主動的智能音箱提示、結構化的對話收集和自適應的後續對話，提供臨床基礎的早晚睡眠日記問題。我們在一項為期四週的受試者間田野研究中評估了該系統，參與者為30名大學生，並將其與使用匹配日記項目、報告窗口和提醒間隔的文本基礎移動日記進行比較。與文本基礎日記相比，對話式語音日記顯示出更高的依從性，並引發了更詳細的上下文自我報告，涉及日常作息、壓力源、環境條件和其他與睡眠相關的因素。參與者還描述語音日記更容易融入日常作息，儘管感知的完成時間較長。然而，基於語音的對話收集在某些結構化日記字段中產生了較低的完整性，顯示出表達豐富性與結構精確性之間的權衡。這些發現顯示了使用基於大型語言模型的對話式語音助理進行長期健康自我報告的潛力與挑戰。

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

2606.18518v1 by Arshia Ilaty, Hossein Shirazi, Manasi Chitale, Kedar Hegde, Dhanalakshmi Ramesh, Rashmi S. Manjunath, Amir Rahmani, Hajar Homayouni

The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution, but existing methods lack principled mechanisms to explicitly manage the privacy-utility trade-off, often degrading clinically meaningful patterns or risking patient re-identification. We present PSyGenTAB, a privacy-preserving generative framework that formulates synthetic healthcare data generation as a constrained optimization problem solved using the Augmented Lagrangian Method. By embedding configurable privacy constraints directly into model training, PSyGenTAB enforces minimum privacy thresholds while maximizing clinical data utility. Across multiple clinically motivated benchmarks, PSyGenTAB preserves inter-feature clinical relationships and minority-class diagnostic patterns essential for reliable health AI. Downstream evaluation using Train-on-Synthetic, Test-on-Real and Train-on-Real, Test-on-Synthetic protocols shows that models trained on synthetic data achieve performance comparable to those trained on real patient records. Privacy auditing further demonstrates reduced exact record reproduction and strong resilience to membership inference attacks. These results establish PSyGenTAB as a principled framework for balancing privacy protection and clinical utility in synthetic healthcare data, supporting secure cross-institutional AI development.

摘要：醫療AI的發展受到高品質臨床數據獲取有限的限制，這是由於機構孤島和嚴格的隱私法規，例如HIPAA和GDPR。合成數據生成提供了一個潛在的解決方案，但現有的方法缺乏原則性機制來明確管理隱私與效用之間的權衡，這往往會降低臨床上有意義的模式或危及患者的重新識別。我們提出了PSyGenTAB，一個保護隱私的生成框架，將合成醫療數據生成公式化為一個約束優化問題，並使用增強拉格朗日方法解決。通過將可配置的隱私約束直接嵌入模型訓練中，PSyGenTAB在最大化臨床數據效用的同時，強制執行最低隱私閾值。在多個臨床動機的基準測試中，PSyGenTAB保留了臨床特徵之間的關係和對可靠健康AI至關重要的少數類別診斷模式。使用“在合成數據上訓練，在真實數據上測試”和“在真實數據上訓練，在合成數據上測試”的下游評估顯示，基於合成數據訓練的模型達到了與基於真實患者記錄訓練的模型相當的性能。隱私審計進一步顯示出精確記錄再現的減少和對會員推斷攻擊的強大抵抗力。這些結果確立了PSyGenTAB作為一個原則性框架，在合成醫療數據中平衡隱私保護和臨床效用，支持安全的跨機構AI開發。

From Specification to Execution: AI Assisted Scientific Workflow Management

2606.18425v1 by Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

摘要：科學工作流程管理系統（WMS）支持可擴展和可重現的複雜管道執行，但工作流程的設計、實施和除錯仍然主要是手動進行，並且需要相當的專業知識。最近使用大型語言模型（LLMs）的方法顯示出從自然語言生成工作流程的潛力，但通常依賴於直接的代碼合成，這限制了透明度、可重現性和與工作流程系統的整合。我們提出了一種AI輔助的科學工作流程管理方法，結合了以規範驅動的工作流程生成、自動化除錯和分佈式執行。該方法引入了一個結構化的規範階段，將工作流程的意圖、設計和實施分開，允許在生成代碼之前進行驗證。我們還開發了一個基於LLM的除錯代理，能夠診斷和解決多個系統層次的故障。為了支持分佈式執行和用戶互動，我們將廣泛使用的WMS Pegasus與模型上下文協議（MCP）層集成，提供一個統一的工作流程提交、監控和控制界面。我們使用一個聯邦學習的醫學影像工作流程來評估該方法，因為它具有並行、迭代和依賴密集的結構。該系統生成並執行了具有數千個作業的大規模工作流程，減少了除錯工作，並允許非專家用戶以專家級設計模式構建工作流程。這些結果表明，端到端的AI輔助工作流程生成和執行是可行的，並指向AI驅動的平台以管理科學工作流程的生命週期。

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

2606.18203v1 by Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

摘要：LLM 驅動的個人健康代理與用戶健康（傳感器）指標提供了一條有希望的途徑，以減輕全球醫療保健獲取的不平等。然而，大規模臨床部署仍然受到一個無限期評估瓶頸的限制：醫生註釋可靠但成本高昂且無法擴展，而 LLM 作為評估者則可擴展但主觀、不一致，有時與臨床不符。我們介紹了 RubricsTree，一個可擴展的評估框架，具有專家對齊的分層分類法，包含超過 100 個原子級的臨床可驗證布爾標準，這些標準源於 4,000 個真實用戶查詢的洞察，通過一個由經驗豐富的醫生領導的專家小組進行的迭代人機協作策劃協議進化而來。上下文感知的自適應路由器僅在每個查詢中激活相關的自動加權標準子集，提供可擴展評估所需的通量，並保持專家對齊的質量。通過系統的元評估，我們顯示 RubricsTree (i) 在挑戰性的開放式查詢上，專家對齊的表現顯著超過強大的大規模評估基準；(ii) 可靠地懲罰上下文退化的回應；以及 (iii) 當用作結構化指令、文本反饋或性能優化的訓練獎勵時，對 Gemini、GPT 和 Qwen 模型系列在 HealthBench 上產生高達約 66% 的相對增益。因此，RubricsTree 提供了一個可擴展的、可審計的、持續演進的評估基礎設施，滿足產品級個人健康 AI 持續優化的需求。

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

2606.18147v1 by Yuwei Zhang, Tong Xia, Bianca Emmerich, Yu Yvonne Wu, Dimitris Spathis, Xin Liu, Daniel McDuff, Cecilia Mascolo

Language models are remarkably capable at medical question answering, in some cases surpassing the accuracy of general physicians. However, answering questions about wearable health data remains challenging and understudied, as these ubiquitous sensors produce continuous, high-dimensional, and longitudinal data, which is non-trivial to align with text-centric distributions in LLM pretraining. The diversity of sensor modalities and user intents cannot be effectively handled by a fixed reasoning workflow or a single pretrained foundation model. To address these challenges, we propose WEQA, a query-adaptive agent framework that unifies LLM reasoning with specialized wearable analytical and modeling tools. An LLM controller is employed to synthesize execution plans and dynamically route each query to the appropriate combination of sensor analysis and pretrained models, and perform grounded response auditing with external knowledge. We also curate a benchmark spanning four open wearable datasets comprising analytic and predictive tasks in three different health domains. Experiments show that our framework is 24% more accurate than LLM and agentic baselines, and a blinded study with 12 medical experts and 8 users shows substantial gains in usefulness and clinical soundness.

摘要：語言模型在醫學問答方面表現出色，在某些情況下超越了一般醫生的準確性。然而，回答有關可穿戴健康數據的問題仍然具有挑戰性且研究不足，因為這些無處不在的傳感器產生連續的、高維度的和長期的數據，這與 LLM 預訓練中的以文本為中心的分佈對齊並非易事。傳感器模態和用戶意圖的多樣性無法通過固定的推理工作流程或單一的預訓練基礎模型有效處理。為了解決這些挑戰，我們提出了 WEQA，一個查詢自適應代理框架，將 LLM 推理與專門的可穿戴分析和建模工具統一。我們使用 LLM 控制器來合成執行計劃，並動態地將每個查詢路由到適當的傳感器分析和預訓練模型的組合，並利用外部知識進行基於事實的回應審核。我們還策劃了一個基準，涵蓋四個開放的可穿戴數據集，包括三個不同健康領域的分析和預測任務。實驗表明，我們的框架比 LLM 和代理基準準確性高出 24%，而一項由 12 位醫學專家和 8 位用戶參與的盲測顯示在實用性和臨床合理性方面有顯著提升。

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

2606.18129v1 by Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

摘要：最近涉及用於心理健康支持的LLM事件揭示了一個關鍵的評估缺口：表面上的安全分數無法捕捉模型在現實情境中隨時間推移的情感敏感互動中的行為。現有的基準測量知識、安全性或靜態反應質量，但未能評估LLM互動是否幫助用戶持續反思、應對和自主做出決策。我們將這一缺失的維度正式化為認知萎縮（COGNITIVE ATROPHY），這是一種在AI介導的心理健康支持中與安全性和幫助性不同的過程層面行為測量。為了測量它，我們引入了認知萎縮基準（COGNITIVE ATROPHY BENCH），這是一個基於1,576個完全由人類生成的諮詢對話、15,680次回合和來自五個LLM的42,230個回應的臨床基準。三位臨床和神經心理學專家開發了一個涵蓋用戶背景、回應行為和全球風險標誌的20屬性架構；六位經過培訓的臨床審核員應用該架構並提供基於證據的評估，產生了5,324條審核判斷。我們進一步引入了用戶輸入風險指數（User-Input Risk Index, UIRI）、認知萎縮風險指數（Cognitive Atrophy Risk Index, ARI）和軌跡摘要。在五個LLM中，模型在單回合和多回合設置中顯示出一致的中到高水平的萎縮對齊行為。儘管模型通常對明顯的安全提示作出反應，但當用戶尋求解決方案或決策時，它們的適應性較低。主導的重複模式包括指導性建議、問題解決、推薦回應、主題轉換和可能加強依賴而非反思的驗證形式。我們的工作使認知萎縮可測量，並為審計敏感LLM對話中的模型行為提供了基礎。

Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

2606.18068v1 by Divyansh Srivastava, Shreya Ghosh, Anshul Verma, Rajkumar Buyya

Recent advances in Large Language Models (LLMs) and multi-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi-agent framework that addresses both issues by replacing ``LLM-as-a-judge'' routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro-symbolic state-tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama-3.1-70b-instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = -0.181, p < 0.05) between OLDCARTS completeness (σ) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty.

摘要：最近在大型語言模型（LLMs）和多代理系統方面的進展推動了代理式人工智慧的興起，顯示出在醫學推理方面的潛力。然而，開放式對話代理仍然容易出現兩種關鍵的失敗模式：過早的診斷轉交和可能在到達患者之前未被檢測到的靜默臨床幻覺。在這項工作中，我們提出了一個多代理框架，通過用確定性編排約束取代“LLM作為裁判”的路由來解決這兩個問題。該框架包含兩個安全機制。首先，一個神經符號狀態跟蹤閘強制執行OLDCARTS臨床協議的完整性（起始、位置、持續時間、特徵、加重/緩解因素、輻射、時間和嚴重性），通過阻止診斷轉換直到收集所有所需的維度。其次，一個認知不確定性量化（UQ）閘計算K=5個獨立診斷樣本的語義熵（H），以識別並攔截在交付之前的分歧輸出。我們使用由llama-3.1-70b-instruct模型驅動的模擬患者代理在150個測試案例中評估系統。完整架構實現了49.3%的診斷精確度，與不受約束的基線相比，絕對改善了11.3個百分點。此外，我們觀察到OLDCARTS完整性（σ）與語義熵（H）之間存在統計上顯著的負相關（r = -0.181，p < 0.05），這表明結構化的信息收集與降低診斷不確定性相關。

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

2606.18063v1 by Ruman Wang, Hangting Ye

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

摘要：醫學影像分類面臨一個根本性的困境：雖然深度學習模型在大規模下表現卓越，但現實世界的臨床情境常常因為標註成本、隱私限制和疾病稀有性而遭遇嚴重的數據匱乏。這一挑戰在病理性疤痕分類中尤為明顯，因為區分凹疤和肥厚性疤痕需要微妙的專家知識，而標註的影像極為有限。我們提出了一種新穎的範式，將大型語言模型（LLMs）重新定位為知識驅動的特徵工程師，而非端到端的分類器。我們稱這一框架為ScaFE（疤痕特徵工程）。我們的關鍵見解是，LLMs編碼了豐富的醫學知識，這些知識可以外部化為可執行的特徵提取代碼，使高維影像轉換為低維且臨床可解釋的表示。具體而言，我們使用既定的疤痕評估標準來提示LLM生成確定性的Python代碼，提取與臨床評分系統（如溫哥華疤痕量表）對齊的特徵。我們的方法提供了三個主要優勢：（1）數據效率，通過將知識獲取與統計學習解耦，實現有限訓練樣本下的穩健性能；（2）隱私保護，因為原始影像在本地處理，未暴露於外部LLMs；以及（3）可解釋性，通過基於臨床推理的明確特徵。對疤痕分類的廣泛實驗表明，我們的方法在有限數據條件下始終優於端到端的深度學習基準或將LLMs用作黑箱分類器，確立了將LLMs整合進數據高效且臨床透明的醫學AI系統中的有前景方向。

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

2606.18037v1 by Ander Alvarez, Santhiya Rajan, Samuel Mugel, Román Orús

Tool-using LLM agents increasingly use the Model Context Protocol (MCP) to answer from heterogeneous evidence sources, including search, APIs, databases, clinical records, and formulary tools. Standard factuality metrics usually test whether an answer is supported by pooled evidence, missing a provenance-sensitive failure mode: a claim may be supported somewhere while being attributed to the wrong source. We call this cross-source conflation. We introduce ProvenanceGuard, a source-aware verifier for MCP-grounded answers. It consumes captured MCP traces with stable tool IDs, source IDs, and raw outputs; decomposes answers into atomic claims; routes claims to source-specific evidence; checks support with NLI and a token-alignment proxy; compares stated attribution with the routed source; and returns per-claim verdicts plus an answer-level allow/block decision. Blocked answers can be repaired with retrieval-augmented answer revision and re-verified. We evaluate on 281 medical-domain MCP-agent traces. A 266-trace adjudicated subset yields 2,325 LLM-assisted claim labels split by trace; 361 held-out labels are human-verified. On the 40-trace held-out split, ProvenanceGuard achieves block F1 0.802 and source accuracy 0.858 over 260 source-eligible claims, outperforming source-blind baselines that do not emit claim-to-source IDs. On a harder multi-source benchmark it reaches block F1 0.846, while source-plus-relation accuracy drops to 0.229, showing that exact source ownership remains difficult with semantically close sources. Repair-and-reverify resolves all blocked answers in the full trace set, often via conservative fallback. In 50 controlled clinical conflation probes, ProvenanceGuard detects all injected attribution swaps with no retained wrong attribution. These results show that source attribution is an independent axis for factuality verification in MCP-based agents.

摘要：使用工具的 LLM 代理越來越多地使用模型上下文協議 (MCP) 來從異質證據來源回答問題，包括搜索、API、數據庫、臨床記錄和處方工具。標準事實性指標通常測試答案是否得到聚合證據的支持，但忽略了一種對來源敏感的失敗模式：一個主張可能在某處得到支持，同時卻被歸因於錯誤的來源。我們稱這種情況為跨來源混淆。我們介紹 ProvenanceGuard，一種針對 MCP 基礎答案的來源感知驗證器。它處理捕獲的 MCP 跟蹤，並包含穩定的工具 ID、來源 ID 和原始輸出；將答案分解為原子主張；將主張路由到特定來源的證據；使用 NLI 和令牌對齊代理檢查支持；比較聲明的歸因與路由的來源；並返回每個主張的裁決以及答案層級的允許/阻止決策。被阻止的答案可以通過檢索增強的答案修訂進行修復並重新驗證。我們在 281 個醫療領域的 MCP 代理跟蹤上進行評估。一個 266 跟蹤的裁決子集產生了 2,325 個 LLM 輔助的主張標籤，按跟蹤分割；361 個保留標籤經過人工驗證。在 40 跟蹤的保留分割上，ProvenanceGuard 在 260 個符合來源的主張上達到阻止 F1 0.802 和來源準確率 0.858，超越了不發出主張到來源 ID 的來源盲基準。在一個更具挑戰性的多來源基準上，它達到阻止 F1 0.846，而來源加關係準確率降至 0.229，顯示出精確的來源擁有權在語義相近的來源中仍然困難。修復和重新驗證解決了完整跟蹤集中的所有被阻止答案，通常通過保守的後備方式。在 50 個受控的臨床混淆探測中，ProvenanceGuard 檢測到所有注入的歸因交換，且沒有保留錯誤的歸因。這些結果顯示，來源歸因是 MCP 基於代理的事實性驗證的一個獨立軸心。

Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis

2606.17989v1 by Yonghao Chen, Sicheng Yang, Rui Tang, Lei Zhu

Multi-contrast magnetic resonance imaging (MRI) provides complementary information for clinical diagnosis. However, acquiring all MRI sequences is often time-consuming and costly. Recent generative models perform cross-contrast synthesis to address this issue by inferring absent contrasts from the available ones. Nevertheless, synthesizing 3D MRI presents significant challenges. Due to the massive volume sizes, operating directly in the pixel space is computationally prohibitive; therefore, a common approach is to first compress the 3D volumes into a latent space and subsequently train generative models in that space. We observe that existing compression architectures face several critical issues: they under-preserve long-range anatomical coherence, discard clinically meaningful semantics, and rely on optimization objectives that lead to over-smoothed reconstructions. Ultimately, these shortcomings compromise the performance of subsequent generative models. In this work, we propose a semantics-first latent modeling framework for 3D MRI reconstruction and cross-contrast synthesis. Specifically, we introduce a Latent Harmonization Encoder (LHE) to capture global anatomical dependencies, ensuring coherent volumetric representations. To mitigate semantic degradation during latent compression, we further design a Semantic Recovery Block (SRB) that injects high-level priors from a self-supervised semantic teacher, enhancing contrast-aware separability in the latent space. Additionally, we propose an Anatomy-aware Frequency Loss (AFL) to adaptively preserve diagnostically relevant high-frequency structures. Extensive experiments on two public multi-contrast MRI datasets demonstrate consistent improvements in reconstruction fidelity and cross-contrast synthesis quality. Our code is available at https://github.com/script-Yang/RSF.

摘要：多重對比磁共振成像（MRI）提供了臨床診斷的補充資訊。然而，獲取所有MRI序列通常耗時且成本高昂。最近的生成模型通過從可用的對比中推斷缺失的對比來進行交叉對比合成，以解決這一問題。然而，合成3D MRI面臨著重大挑戰。由於巨大的體積大小，直接在像素空間中操作在計算上是不可行的；因此，一種常見的方法是首先將3D體積壓縮到潛在空間中，然後在該空間中訓練生成模型。我們觀察到現有的壓縮架構面臨幾個關鍵問題：它們未能充分保留長距離的解剖一致性，丟棄臨床上有意義的語義，並依賴於導致過度平滑重建的優化目標。最終，這些缺陷損害了後續生成模型的性能。在本研究中，我們提出了一種以語義為先的潛在建模框架，用於3D MRI重建和交叉對比合成。具體而言，我們引入了一個潛在調和編碼器（LHE），以捕捉全局解剖依賴性，確保一致的體積表示。為了減輕潛在壓縮過程中的語義退化，我們進一步設計了一個語義恢復模塊（SRB），該模塊從自我監督的語義教師中注入高層次的先驗，增強潛在空間中的對比感知可分離性。此外，我們提出了一種解剖感知頻率損失（AFL），以自適應地保留診斷相關的高頻結構。在兩個公共多重對比MRI數據集上進行的廣泛實驗顯示了重建保真度和交叉對比合成質量的一致改進。我們的代碼可在 https://github.com/script-Yang/RSF 獲得。

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

2606.17979v1 by Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

摘要：現有的強化學習後訓練方法在文本到圖像生成中，通常將最終圖像的獎勵轉換為單一的標量優勢，並以相同的強度應用於整個生成過程。然而，文本到圖像生成自然具有時間和空間結構：不同的去噪步驟負責不同的生成階段，而真正決定文本對齊的內容往往僅出現在圖像的一部分。這種粒度不匹配使得策略更新難以專注於實際影響獎勵的生成組件。為了解決這個問題，我們提出了\textbf{時空自適應獎勵 (STAR) 分配}，用於文本到圖像擴散和流模型的強化學習後訓練。STAR在生成模型內部使用文本-圖像注意力，並從用戶在提示中真正關心的核心內容開始。它構建了在去噪步驟和展開過程中動態變化的空間分配圖，並將相同的群組相對優勢分配給更相關的潛在區域，幾乎不增加額外的計算開銷。然後，STAR通過空間解析的策略目標對這些區域應用更強的策略更新。我們使用Stable Diffusion 3.5 Medium作為基礎模型，並在三個任務上進行評估：GenEval、OCR文本渲染和PickScore。實驗結果表明，STAR在不改變外部獎勵來源的情況下，改善了組合語義對齊、文本渲染和偏好優化，分別在GenEval、OCR和PickScore上達到$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

2606.17961v1 by Andrea Santomauro, Luigi Portinale, Giorgio Leonardi

Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexible framework for representing positional structure through pairwise relations. simPE was originally designed for medical imaging applications, where geometric robustness is especially relevant: small rotations naturally arise during image acquisition, induced by imaging instruments, patient positioning, or slight acquisition misalignments. Despite its empirical promise, the theoretical behavior of simPE under geometric perturbations has not been fully characterized. In this paper, we study the robustness of simPE with respect to rotations, combining formal theoretical analysis with experimental validation. We first show that simPE is generally not rotation-invariant. We then prove that, under mild Lipschitz assumptions on the elementary components, simPE is stable under rotational perturbations and derive explicit perturbation bounds in Frobenius norm. We validate these findings experimentally on four controlled datasets--a synthetic Arrow dataset, a synthetic Shapes dataset (four geometric shape categories), a synthetic Digits dataset, and a benchmark image classification dataset (FashionMNIST)--in which training and validation images are kept in a fixed canonical orientation while test images are subjected to increasing rotation angles. Across all datasets, simPE consistently outperforms standard learned positional encoding in terms of accuracy, F1 score, precision, and recall under rotation, particularly in the small-to-moderate angle regime, corroborating the theoretical stability guarantees.

摘要：位置編碼是Transformer架構的一個基本組成部分，因為它注入了有關輸入的空間或序列排列的信息。在最近的標準絕對和正弦編碼的替代方案中，基於相似性的位置信息編碼（simPE）已經成為一個靈活的框架，通過成對關係來表示位置結構。simPE最初是為醫學影像應用設計的，其中幾何穩健性尤其重要：在影像獲取過程中，因影像儀器、病人定位或輕微的獲取不對齊而自然產生的小旋轉。儘管其經驗上顯示出潛力，但在幾何擾動下，simPE的理論行為尚未完全表徵。在本文中，我們研究了simPE對於旋轉的穩健性，結合了正式的理論分析和實驗驗證。我們首先顯示simPE通常不是旋轉不變的。然後我們證明，在對基本組件的輕微Lipschitz假設下，simPE在旋轉擾動下是穩定的，並推導出Frobenius範數下的明確擾動界限。我們在四個受控數據集上實驗性地驗證這些發現——一個合成的箭頭數據集、一個合成的形狀數據集（四種幾何形狀類別）、一個合成的數字數據集，以及一個基準圖像分類數據集（FashionMNIST）——其中訓練和驗證圖像保持在固定的典範方向，而測試圖像則受到逐漸增加的旋轉角度影響。在所有數據集中，simPE在準確性、F1分數、精確度和召回率方面，始終優於標準學習的位置信息編碼，特別是在小到中等角度範圍內，證實了理論穩定性的保證。

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

2606.17867v1 by Antonio Scardace, Daniele Ravì

Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.

摘要：儘管在阿茲海默症（AD）研究中越來越多地採用多模態方法——旨在整合分子、結構、臨床和遺傳生物標記以增強疾病特徵——這些模態之間的關係仍然不甚了解。對它們動態互動的系統分析對於改善疾病建模、識別冗餘評估以及減少患者負擔和獲取成本至關重要。本文中，我們通過整合來自789名受試者的tau-PET、結構MRI、認知評分（MMSE和CDR）及APOE4數據，對多模態AD生物標記進行定量分析，這些數據來自ADNI數據集。在我們的分析中，我們（A）量化跨模態的互信息和解釋變異，以評估冗餘和預測依賴；（B）檢查tau拓撲與大腦區域結構性萎縮之間的關聯，以選擇有用的ROI；（C）對tau-認知關聯進行統計分解，將其分為與萎縮相關和與萎縮無關的成分；（D）並識別與認知衰退相一致的主導神經退行性軌跡。本研究提供了跨模態關係的系統特徵，改善了AD中生物標記的可解釋性和選擇性。代碼可在以下網址公開獲取：https://github.com/antonioscardace/Multimodal-AD。

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

2606.17826v1 by Jean Seo, Minkyu Kim, Jeonguk Lee, Jisoo Jung, Wooseok Han, Eunho Yang

Automatic speech recognition (ASR) in non-English clinical settings is challenged by multiscript variability, where the same term may appear in multiple valid orthographic forms. Conventional string-matching evaluation metrics often underestimate ASR performance by treating orthographic variants as errors. To address this issue, we introduce MultiClin, a clinical ASR benchmark designed to evaluate robustness to multiscript variability. Experiments across diverse ASR models show that multiscript-aware evaluation provides a fairer assessment of recognition quality than conventional single-reference evaluation. We further investigate the impact of script consistency during training and find that inconsistent script mappings increase orthographic uncertainty and hinder model convergence, with a balanced 50% mapping ratio producing the highest entropy. In contrast, script unification consistently yields the best ASR performance. Our dataset and code are publicly available at: https://github.com/aitrics-ronaldo/Interspeech_MultiClin.

摘要：自動語音識別（ASR）在非英語臨床環境中面臨多種文字變異的挑戰，其中相同的術語可能以多種有效的正字法形式出現。傳統的字符串匹配評估指標通常低估了ASR的性能，因為它們將正字法變體視為錯誤。為了解決這個問題，我們介紹了MultiClin，一個旨在評估對多文字變異的穩健性的臨床ASR基準。在多種ASR模型中的實驗顯示，考慮多文字的評估提供了比傳統單一參考評估更公正的識別質量評估。我們進一步研究了訓練期間文字一致性的影響，發現不一致的文字映射增加了正字法的不確定性並妨礙了模型的收斂，平衡的50%映射比例產生了最高的熵。相比之下，文字統一始終產生最佳的ASR性能。我們的數據集和代碼可在以下網址公開獲得：https://github.com/aitrics-ronaldo/Interspeech_MultiClin。

Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection

2606.17767v1 by Nikola Kovacevic, Bastien Husler, Di Zhuang, Rafael Wampfler, Barbara Solenthaler

Personal health data from wearables are typically presented through dashboards of charts and summary statistics, requiring users to actively interpret patterns and implications. We explore an alternative interaction paradigm: engaging with personal health data through an embodied conversational agent that facilitates objective data reflection in dialogue with the user. We present a system that combines lightweight preprocessing of wearable data with a Unity-based embodied character. Internally, the system follows a dual-agent design in which an Observer agent extracts descriptive statistics and temporal trends, and a Presenter agent communicates these findings through "spoken statistics," intentionally refraining from clinical advice to isolate the impact of the interaction modality. We evaluate this approach through a simulated-self user study (N=5) using a within-subject design. Participants adopted health personas and goals derived from the LifeSnaps dataset to compare traditional dashboard exploration with embodied conversational reflection. Our evaluation focuses on perceived understanding, the specificity of generated actions, and the cognitive shift from passive viewing to active sensemaking. The paper contributes a functional prototype, a design pattern for objective health data narrative generation, and early empirical insights into how embodiment affects the interpretation of personal health metrics.

摘要：個人健康數據通常通過圖表和摘要統計的儀表板呈現，要求用戶主動解釋模式和含義。我們探索了一種替代的互動範式：通過一個具身的對話代理與個人健康數據互動，促進用戶與數據之間的客觀反思。我們提出了一個系統，結合了可穿戴數據的輕量級預處理和基於Unity的具身角色。在內部，該系統遵循雙代理設計，其中觀察者代理提取描述性統計和時間趨勢，而呈現者代理通過“口述統計”傳達這些發現，故意避免臨床建議，以隔離互動模式的影響。我們通過一項模擬自我用戶研究（N=5）使用內部受試者設計來評估這種方法。參與者採用了來自LifeSnaps數據集的健康角色和目標，以比較傳統的儀表板探索與具身的對話反思。我們的評估重點在於感知理解、生成行動的具體性，以及從被動觀看到主動意義建構的認知轉變。本文貢獻了一個功能原型、一種客觀健康數據敘事生成的設計模式，以及對具身性如何影響個人健康指標解釋的早期實證見解。

Vision-language models for chest radiography do not always need the image

2606.17710v1 by Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

摘要：醫學視覺-語言模型報告顯示胸部X光片的準確性很高，這越來越被解讀為它們使用了影像。這種推論是不安全的：一個利用發現-名稱先驗的模型得分與一個閱讀掃描的模型相似，並且沒有標準基準能夠將它們區分開來。我們引入了一個因果審計，對影像進行干預，遮蔽相關區域，遮蔽不相關的區域，並替換為另一位患者的同標籤掃描，並結合三個行為指標來測試正確答案是否依賴於影像。在九個系統中，一個僅使用文本且無影像訪問的模型達到了距離最佳多模態模型僅5.7的準確性點，而一個1190億參數的多模態模型在統計上與一個70億文本模型的基準無法區分。該審計將樣本分為三個忽略影像的模型，一個不穩定的模型，以及五個選擇性使用影像的模型，針對一部分發現；這些類別在第二個數據集、解析度和提示措辭中保持一致。與董事會認證的放射科醫生相比，僅使用文本的模型在基準為零時在統計上與放射科醫生的準確性無法區分，而使用影像的模型則在與放射科醫生可比的比率下進行基準。報告的信心僅在模型使用影像時標記未基準的答案。應該以基準審計，而非準確性，來限制臨床部署。

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

2606.17702v1 by Wan Siti Halimatul Munirah Wan Ahmad, Faris Syahmi Samidi, Mohammad Badal Ahmmed, Vimal Angela Thiviyanathan, Selvam James Thavaraj, Anwar P. P. Abdul Majeed

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

摘要：對於常規 H&E 染色組織學影像來說，對腫瘤微環境 (TME) 的特徵描述需要同時進行細胞分割、特徵提取和可解釋的臨床報告。我們提出了 SEGTME-UNI2，一個統一的框架來滿足這些需求。其核心是 UNI2-UPERHOVER，一個雙頭分割模型，將 UNI2-H 病理基礎模型（ViT-Giant，預訓練於超過 1 億個來自 10 萬張幻燈片的切片）與兩個平行的 UperNet 解碼器配對：一個用於六類語義分割，另一個用於水平-垂直梯度回歸，以實現基於分水嶺的核實例分離。為了解決大型現實世界數據庫中缺乏像素級標註的問題，UNI2-UPERHOVER 進行了三階段的漸進式偽標籤課程。每個階段訓練一個全新的模型，沒有權重轉移，完全通過提高偽標籤質量來驅動改進：階段 1：使用人類標註的 PanNuke（7,901 張影像，189,744 個細胞核，0.25 微米/像素）。階段 2：使用來自階段 1 模型的熵過濾偽標籤，針對 271,711 個 TCGA-UT 標度 0 補丁（0.5 微米/像素）。階段 3：使用來自階段 2 模型的偽標籤，針對所有 1,608,060 個 TCGA-UT 補丁，涵蓋六個解析度尺度（0.5-1.0 微米/像素）。分割輸出進入一個結構化的 TME 特徵提取管道，計算 20 多個每個補丁的組成、形態學、空間熵和細胞間距度量。這些數據被編碼為 JSON 並傳遞給經過微調的 NVIDIA BioNeMo GPT 模型，以生成臨床可解釋的 TME 敘述。在保留的 PanNuke 和 TCGA-UT 部分上進行的初步驗證顯示了框架的可行性和內部一致性。偽標註的 TCGA-UT 數據集和 UNI2-UPERHOVER 檢查點已公開發布，以支持大規模 TME 輪廓和空間生物學研究。

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

2606.17474v1 by Jiahui Niu, Huizi Yu, Wenkong Wang, Guangxin Dai, Jingxian He, Xiang Li, Zhiying Liang, Xinxin Lin, Kent CY So, Bryan YP Yan, Yun Kwok Wing, Yanqiu Xing, Xin Ma, Lizhou Fan

Large language models (LLMs) are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

摘要：大型語言模型（LLMs）越來越被考慮用於臨床諮詢任務，然而大多數醫療評估仍然是靜態的、單回合的或狹隘的結果導向，限制了它們反映現實世界護理的連續性、不確定性和互動性的能力。在此，我們提出了AIPatient Arena，一個基於電子健康紀錄（EHRs）的評估框架，用於評估LLMs在八個臨床能力維度上的臨床效用。該框架將EHR數據整合到患者特定的知識圖譜中，使多回合的醫生-患者互動成為可能。我們在一個由437名患者組成的主要隊列以及兩個分佈外的驗證隊列（119名和67名患者）上應用AIPatient Arena。我們觀察到LLMs在醫療面試提問技能（QS；平均分數，4.43-4.99/5）、倫理和專業行為（ET；4.38-4.93/5）以及臨床解釋的清晰性和透明性（EX；3.80-4.72/5）方面表現良好。在信息整合（II；3.19-4.21/5）和藥物安全性與合理性（MS；3.13-3.78/5）方面表現中等，但在處理模糊患者反應（HR；2.57-3.32/5）、信息覆蓋（IC；2.08-3.02/5）以及診斷準確性和推理（Dx；2.63-3.55/5）方面持續存在弱點。基於過程的評估揭示了重複的互動失敗，包括重複提問、遺漏過去病史以及對不確定性的處理不足。更豐富的對話上下文改善了診斷推理，但在治療計劃方面的增益有限。這些發現表明，僅僅依賴最終答案的準確性不足以評估臨床準備情況，並突顯了評估模型在諮詢過程中如何收集、解釋和傳達信息的重要性。AIPatient Arena提供了一個基於EHR的框架，用於針對工作流程的醫療LLMs預部署評估。

A Machine-Learned Comorbidity Index

2606.17450v1 by Suleman Baloch, Kishlay Jha, Alberto M. Segre, Philip M. Polgreen, Bijaya Adhikari

Traditional comorbidity scores (e.g., Charlson and Elixhauser) are widely used for risk adjustment and patient stratification, but they have two key limitations: (i) they are largely mortality-centric and do not align well with other clinical outcomes, and (ii) their linear, rule-based structure cannot capture nonlinear, outcome-specific risk relationships. We propose a Machine-Learned Comorbidity Index (MLCI) that maps diagnosis codes to a single scalar by maximizing the normalized Hilbert-Schmidt Independence Criterion (nHSIC) between the learned score and multiple clinical outcomes. MLCI captures nonlinear risk-outcome dependence and is supported by a theory that characterizes when a unified, informative admission-level ordering can be achieved across outcomes. Empirical results on multiple benchmark electronic health record (EHR) datasets show that MLCI outperforms strong baselines across multiple evaluation metrics.

摘要：傳統的共病指數（例如，Charlson 和 Elixhauser）廣泛用於風險調整和病人分層，但它們有兩個主要限制：（i）它們主要以死亡率為中心，與其他臨床結果不太一致，並且（ii）它們的線性、基於規則的結構無法捕捉非線性、特定結果的風險關係。我們提出了一種機器學習共病指數（MLCI），通過最大化學習分數與多個臨床結果之間的標準化 Hilbert-Schmidt 獨立性標準（nHSIC），將診斷代碼映射到單一標量。MLCI 捕捉非線性風險-結果依賴性，並有理論支持，該理論描述了何時可以在結果之間實現統一的、信息豐富的入院級別排序。對多個基準電子健康記錄（EHR）數據集的實證結果顯示，MLCI 在多個評估指標上超越了強基準。

Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

2606.17443v1 by Xi Chu, Yupeng Hou

Large language models (LLMs) are becoming a major way for consumers to find products, but we do not yet understand how brands compete in this new channel. We study brand dynamics in LLM recommendations using skincare products -- a category where consumers cannot easily judge quality before buying and must rely on brand reputation -- across three commercial LLMs (GPT-4o-mini, Claude Sonnet, Gemini 3 Flash), with a robustness check on search goods. In three experiments, we find: (1) a Conditional Monopoly where well-known brands get recommended 100% of the time (IAI = 10.0) when all products have the same specifications, but this dominance disappears with less than a +0.1-star rating advantage for a competitor; (2) authority-style marketing language, including fabricated clinical-evidence claims, breaks this monopoly at a Bias Surplus Value equal to +0.17 rating points, with each model responding differently; and (3) a social dilemma in multi-brand GEO competition: when all brands adopt the same optimization strategy, individual payoff falls from +0.802 to +0.007 in our payoff proxy, and non-participating brands receive zero recommendations in our tests. Our results suggest that generative engine optimization (GEO) should be studied not only as a security risk, but also as an emerging marketing practice that shapes market competition.

摘要：大型語言模型（LLMs）正成為消費者尋找產品的主要方式，但我們尚未了解品牌在這一新渠道中的競爭方式。我們研究了在LLM推薦中護膚產品的品牌動態——這是一個消費者在購買前無法輕易判斷質量，必須依賴品牌聲譽的類別——跨越三個商業LLM（GPT-4o-mini、Claude Sonnet、Gemini 3 Flash），並對搜索商品進行了穩健性檢查。在三個實驗中，我們發現：（1）當所有產品具有相同規格時，知名品牌在推薦中獲得100%的機會（IAI = 10.0），但這種主導地位在競爭對手的評分優勢低於+0.1顆星時消失；（2）權威風格的營銷語言，包括虛構的臨床證據聲明，在偏見超額價值等於+0.17評分點時打破了這一壟斷，每個模型的反應不同；（3）在多品牌GEO競爭中的社會困境：當所有品牌採用相同的優化策略時，我們的收益代理從+0.802下降到+0.007，且在我們的測試中未參與的品牌獲得零推薦。我們的結果表明，生成引擎優化（GEO）不僅應被研究為安全風險，還應作為一種新興的營銷實踐來塑造市場競爭。

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

2606.17437v1 by Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

摘要：自動化的標準超聲心動圖視圖分類對於高效的臨床工作流程至關重要，但面臨三個主要挑戰。首先，公開可用的數據集稀缺，且在規模和視圖覆蓋方面有限。其次，一些現代視頻級架構在超聲心動圖視圖分類中的性能仍未被充分探索。第三，一些視圖類別表現出高度相似的空間外觀，使得單幀特徵不足以進行區分，而異質幀質量則使得穩健的時間信息融合變得複雜。為了解決這些挑戰，我們發布了九個視圖的超聲心動圖視頻（EV9V）數據集，包含5,138個視頻、910,579幀和9個標準視圖，據我們所知，這是目前最大的公開可用超聲心動圖視頻數據集。使用EV9V，我們系統性地基準測試了代表性的視頻分類架構，包括卷積神經網絡（CNNs）、遞歸神經網絡（RNNs）和Transformer。此外，我們提出了一個時空融合模型（STFM），這是一個高效的雙流CNN-LSTM（長短期記憶）框架，能夠共同捕捉空間解剖結構和時間心臟動力學。所提出的框架利用不確定性感知學習，在訓練期間優先抽樣代表性視頻片段，並在推斷期間進行基於證據的融合，從而提高對超聲心動圖視頻中幀質量變化的穩健性。大量實驗表明，我們的方法在各種視頻分類模型中達到了競爭性能，驗證了不確定性感知時空學習在超聲心動圖視圖分類中的有效性。代碼可在 https://github.com/bgx666/stfm 獲得。

Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization

2606.17420v1 by Jianwei Zhang, Xinyu Nie, Jiaxin Yue, Yonggang Shi

Tau PET imaging is central to tracking Alzheimer's disease progression, but systematic differences between scanners, protocols, and radiotracers across sites introduce nonbiological variability that inflates biomarker variance, reduces sensitivity to disease effects, and can bias downstream clinical assessments. Harmonization methods aim to remove these site-induced shifts while preserving biologically meaningful signal, yet existing approaches struggle when source and target cohorts differ in subgroup composition, risking conflation of site effects with biological variation such as tau-positivity status. We propose the Feynman Kac Reweighted Schröodinger Bridge Matching (FKRSBM) model to address this problem. Rather than routing data through a Gaussian noise prior as in diffusion-based methods, FKRSBM learns a direct stochastic transport process between source and target distributions via entropy-regularized optimal transport. To enforce biologically consistent transport, FKRSBM incorporates a subgroup-aware endpoint proposal derived from a Feynman Kac reweighting of the reference bridge measure, implemented entirely through stratified importance sampling at the data level and requiring no changes to the underlying bridge-matching solver or network architecture. For surface-based neuroimaging, FKRSBM employs a spherical convolutional backbone operating on cortical meshes to perform vertex-level harmonization. We evaluate the method on tau PET SUVR maps, harmonizing PI-2620 data from the HABS-HD cohort into the AV-1451 domain of ADNI. Compared against ComBat, CycleGAN, a diffusion-based method (DF), and unregularized Diffusion Schröodinger Bridge Matching (DSBM), FKRSBM achieves superior distributional alignment, reduced tau-positivity sign mismatch, stronger APOE subgroup alignment, and improved downstream disease classification performance.

摘要：Tau PET 成像在追蹤阿茲海默症進展中至關重要，但不同掃描儀、協議和放射性示蹤劑之間的系統性差異會引入非生物變異，這會增加生物標記的變異性，降低對疾病影響的敏感性，並可能偏倚後續的臨床評估。調和方法旨在消除這些由地點引起的變化，同時保留生物學上有意義的信號，然而現有方法在來源和目標群體的亞組組成不同時面臨挑戰，可能會將地點效應與生物變異（如 tau 陽性狀態）混淆。我們提出了費曼-卡茨重加權薛丁格橋匹配（FKRSBM）模型來解決這個問題。FKRSBM 通過熵正則化的最優傳輸學習來源和目標分佈之間的直接隨機傳輸過程，而不是像擴散基方法那樣通過高斯噪聲先驗來路由數據。為了強化生物學一致的傳輸，FKRSBM 結合了一個基於亞組的端點提議，該提議源自於對參考橋度量的費曼-卡茨重加權，完全通過在數據層級的分層重要性抽樣來實現，並不需要對基礎的橋匹配求解器或網絡架構進行任何更改。對於基於表面的神經影像學，FKRSBM 使用一個在皮質網格上運作的球形卷積骨幹來執行頂點級的調和。我們在 tau PET SUVR 地圖上評估該方法，將 HABS-HD 群體的 PI-2620 數據調和到 ADNI 的 AV-1451 領域。與 ComBat、CycleGAN、擴散基方法（DF）和未正則化的擴散薛丁格橋匹配（DSBM）相比，FKRSBM 實現了更優的分佈對齊、降低的 tau 陽性標誌不匹配、更強的 APOE 亞組對齊，以及改善的下游疾病分類性能。

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

2606.17405v1 by Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang

Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.

摘要：臨床決策支持人工智慧系統 (CDSASs) 必須在遵循嚴格安全限制的同時，實時適應不斷變化的患者狀況。我們提出了一個在線自適應框架，該框架整合了治療效果 (TE) 估算以量化臨床效益、患者數位雙胞胎 (DT) 以模擬治療軌跡，以及強化學習 (RL) 用於序列決策。該人工智慧系統最初在歷史醫療記錄上進行訓練，並在持續學習循環中運作。為了確保安全，一個基於規則的模組監測生命體徵並阻止禁忌治療。內部模型存在強烈不一致的案例會被標記以供臨床醫生審查，這在我們的實驗中是通過預訓練的結果模型來模擬的。我們使用合成臨床模擬器和來自癌症基因組圖譜 (TCGA) 的真實卵巢癌數據集來驗證我們的框架。在模擬和臨床環境中，我們的方法在推薦治療方面展示了優越的有效性和穩定性，相較於標準計算基準。此外，該人工智慧系統保持低延遲，並且在我們的實驗驗證中只有少數案例需要專家諮詢，顯示其作為一個安全的、由臨床醫生監督的個性化醫療工具的潛力，並能通過實際使用不斷改進。

2606.17340v1 by Hongchao Shu, Roger D. Soberanis-Mukul, Hao Ding, Morgan Ringel, Mali Shen, Saif Iftekar Sayed, Hedyeh Rafii-Tari, Mathias Unberath

Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.

摘要：在單眼內窺鏡中，基於視覺的精確導航因為深度線索有限、組織紋理弱、非剛性變形以及跨領域的外觀變化而變得困難，這些因素都使得姿勢估計、深度預測和影像與解剖對齊變得複雜。儘管最近的視覺基礎模型顯示出潛力，但它們學習到的表示往往在幾何一致性方面仍然不足，這妨礙了穩定的特徵對應，並限制了它們在下游導航任務中的可靠性。我們提出了一個統一框架，用於學習幾何一致且對領域穩健的影像表示，專為單眼內窺鏡設計。該框架結合了一個提供準確幾何監督的合成數據管道，與層次感知幾何-語義適應，這是一種結構化的替代方案，用於標準LoRA，選擇性地在Transformer層次中插入低秩適配器，並將其與層級訓練目標結合，以促進中間特徵中的幾何對應和深層特徵中的語義一致性。對公共和專有數據集的實驗顯示幾何和語義表示質量有所改善，從而在下游導航任務中，包括姿勢估計和單眼深度估計，表現更佳。學習到的表示在臨床支氣管鏡檢查中顯示出有利的合成到實際轉移，並為在有限監督下適應鼻竇內窺鏡和結腸鏡檢查提供了有用的初始化。該框架在模型大小和訓練數據方面也顯示出良好的擴展性。這些結果支持層次感知、幾何引導的適應作為內窺鏡表示學習的一種實用方法。

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

2606.17339v1 by Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

摘要：語音提供了一個獨特的資訊窗口，通過同時參與神經系統、運動系統、呼吸系統和聲音系統來了解健康。當前的臨床語音AI方法主要通過孤立的特定條件研究進展，使得結果難以比較，且難以評估其普遍性。我們介紹了SpeechDx，這是一個大規模的臨床語音AI基準，涵蓋12個數據集和27個任務，涉及多種健康狀況。為了能夠在共享的臨床機制中進行評估，SpeechDx根據它們干擾的語音產生階段來結構任務：概念化、形成和表達。該基準通過包括有限標記數據的任務來測試普遍性，並在多個數據集中評估相同的健康狀況，以區分臨床上有意義的模式與數據集的假象。我們系統地評估了12種最先進的音頻編碼器在所有任務中的表現，以及在零樣本跨條件轉移下的表現。結果顯示，大規模語音模型代表了最強的整體基準，特定領域模型僅在密切匹配的任務上提高性能，而目前沒有任何表示能夠在臨床語音領域中可靠地普遍化。SpeechDx建立了一個共享的評估框架，以追踪朝向通用臨床語音表示的進展。

Symbolic Informalization: Fluent, Productive, Multilingual

2606.16893v1 by Aarne Ranta

Symbolic informalization enables a reliable conversion of formal mathematics to natural language. It has the potential to make machine-checked content human-readable without loss of precision. In a traditional proof system usage, symbolic informalization generalizes the limited mechanisms of syntactic sugar into the ordinary language of mathematics. In a setting where proofs are constructed by artificial intelligence and autoformalization, symbolic informalization can explain what precisely has been constructed. This paper outlines the project Informath, which aims to show how symbolic informalization can produce fluent text with a reasonable development effort and address multiple formal and natural languages. Informath is based on an interlingual architecture, where Dedukti works as a hub between different proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) takes care of linguistic correctness and variation in different natural languages.

摘要：符號非正式化使得正式數學能夠可靠地轉換為自然語言。它有潛力使機器檢查的內容在人類可讀的情況下不失精確性。在傳統的證明系統使用中，符號非正式化將有限的語法糖機制概括為數學的普通語言。在由人工智慧和自動形式化構建證明的環境中，符號非正式化可以解釋究竟構建了什麼。本文概述了項目Informath，旨在展示符號非正式化如何在合理的開發努力下產生流暢的文本，並處理多種正式和自然語言。Informath基於一種跨語言架構，其中Dedukti作為不同證明系統（Agda、Lean、Rocq）之間的樞紐，而語法框架（GF）則負責不同自然語言中的語言正確性和變化。

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

2606.16890v1 by Sanjay Basu

Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p<0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

摘要：聚合準確性基準隱藏了大型語言模型在電子健康紀錄（EHR）問題回答上失敗的系統性結構：需要更多推理步驟的問題產生不成比例的錯誤。基於對Transformer組合性限制的理論結果，我們引入了一個預先指定的跳數分類法——回答來自EHR的臨床問題所需的不同推理步驟數量——作為模型失敗的原則性預測指標。我們對313個臨床醫生生成的MedAlign EHR問題-回答對進行了標註，涵蓋了四個跳數級別，並在模型內部消融（claude-sonnet-4-6，零樣本與擴展思考）和跨架構複製（gpt-4o和gpt-5.4-2026-03-05，零樣本）中評估了301個問題。所有三個模型，涵蓋了兩個提供者和兩個OpenAI世代（GPT-4和GPT-5），都顯示出隨著跳數的增加準確性單調下降：Claude Sonnet零樣本從30.6%（跳=1）下降到17.6%（跳=4）（Cochran-Armitage z=-2.30，p=0.011；每跳的OR 0.72，95% CI [0.56,0.92]，p=0.008）；GPT-4o複製了這一點（37.8%下降至14.7%；OR 0.58 [0.45,0.75]，p<0.001）；而gpt-5.4-2026-03-05確認了這一點（37.8%下降至23.5%；OR 0.80 [0.66,0.98]，p=0.027）。一項預先指定的上下文充分性審計顯示，高跳數問題並未因EHR截斷而受到差異性劣勢（在跳數2-4時可回答率為93-95%，而在跳=1時為79%），因此下降反映了組合推理的困難。擴展思考並未顯著平坦化三種推理條件下的準確性-深度曲線，且思考標記的使用隨著跳數的增加而增長（r=0.31，p<0.0001），與預測的O(k)計算需求一致。因此，跳數成為一個理論驅動的、跨架構的預測指標，用於大型語言模型在EHR問題回答上的錯誤，對臨床AI的部署風險分層具有直接的影響。

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

2606.16868v1 by Markus Bujotzek, Dimitrios Bounias, Stefan Denner, Ralf Floca, Maximilian Fischer, Peter Neher, Klaus Maier-Hein

While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at https://github.com/MIC-DKFZ/FedSegNoiseBench.

摘要：雖然聯邦學習 (FL) 使得在不集中敏感數據的情況下進行協作醫學影像分割成為可能，但實際部署常常因為跨站點標籤的不完善而變得複雜，例如輪廓不一致、缺失或多餘的結構以及混淆的標籤。聯邦噪聲標籤學習 (FNLL) 旨在減輕這些影響，但在實踐中仍然使用不足，因為現有的證據主要基於合成噪聲、簡化的設置和有限的實際噪聲評估。我們通過引入一個基準套件來解決這一空白，該套件結合了多樣的實際噪聲數據集、與部署相關的客戶端噪聲場景以及針對標籤噪聲的評估，以支持系統性的 FNLL 評估和知情的方法選擇。該套件結合了來自多個來源的策劃實際噪聲醫學影像分割數據集，以及包括各種客戶端噪聲場景和針對噪聲的評估的綜合聯邦分割框架。所呈現的套件為醫學影像分割中的 FNLL 評估提供了現實且具辨別力的基礎，並為公平基準測試、數據集特定的標籤噪聲特徵化以及在現實聯邦設置下的未來方法開發建立了可重用的基礎。代碼可在 https://github.com/MIC-DKFZ/FedSegNoiseBench 獲得。

GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents

2606.16813v1 by Rahul Suresh Babu, Rohit Shukla

Tool-augmented LLM agents rely on runtime filtering to decide which tools should be visible at each step. Causal Minimal Tool Filtering (CMTF) reduces tool-choice confusion by exposing only the next causally necessary tool frontier, but it assumes that the user request has already been mapped to a symbolic goal state. In practice, requests such as "handle my appointment" or "take care of this email" may correspond to multiple possible goals. This creates wrong-goal execution, where an agent follows a valid causal tool path for an unintended objective. We introduce GIST-CMTF, a goal-state inference layer that predicts candidate symbolic goals over the same state-transition vocabulary used by CMTF, estimates ambiguity, and either applies CMTF or exposes clarification as a causal action that produces missing goal or state variables. We evaluate GIST-CMTF across seven model backends, six filtering methods, and 120 controlled tool-use tasks. GIST-CMTF achieves 97.0% task success, compared with 80.1% for top-goal CMTF and 82.9% for semantic-goal CMTF. It reduces wrong-goal execution from 19.4% under top-goal CMTF to 2.5%, while preserving the one-tool exposure of causal filtering and using substantially fewer tokens than all-tools exposure. These results suggest that reliable tool-augmented agents should validate goal state, not only tool relevance, before exposing external actions.

摘要：工具增強的LLM代理依賴於運行時過濾來決定在每個步驟中哪些工具應該可見。因果最小工具過濾（CMTF）通過僅暴露下一個因果必要的工具邊界來減少工具選擇的混淆，但它假設用戶請求已經映射到一個符號目標狀態。實際上，像「處理我的約會」或「處理這封電子郵件」這樣的請求可能對應於多個可能的目標。這會導致錯誤目標執行，其中代理遵循有效的因果工具路徑以達成非預期的目標。我們引入GIST-CMTF，一個目標狀態推斷層，該層預測候選符號目標，使用與CMTF相同的狀態轉換詞彙，估計歧義，並根據需要應用CMTF或將澄清作為因果行動來暴露，從而產生缺失的目標或狀態變量。我們在七個模型後端、六種過濾方法和120個受控工具使用任務中評估GIST-CMTF。GIST-CMTF的任務成功率達到97.0%，而頂級目標CMTF為80.1%，語義目標CMTF為82.9%。它將錯誤目標執行率從頂級目標CMTF的19.4%降低到2.5%，同時保持因果過濾的單一工具暴露，並使用的標記數量明顯少於所有工具暴露。這些結果表明，可靠的工具增強代理在暴露外部行動之前應驗證目標狀態，而不僅僅是工具的相關性。

AgentFairBench: Do LLM Agents Discriminate When They Act?

2606.16723v1 by Triveni Morla, Rohith Reddy Bellibaltu, Manpreet Singh, Manmeet Singh Kapoor

Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.

摘要：大型語言模型（LLM）代理人越來越多地採取行動（篩選申請人、推薦信用、分診病人），然而LLM的公平性仍然是通過評分答案來衡量的。我們介紹AgentFairBench，一個便宜、可重複的多領域基準，用於衡量LLM代理人行動中的人口統計差異。這一基準基於一個伴隨框架，即偏見傳導框架（BCF，此處重述），涵蓋三個以監管者為基礎的領域：招聘、貸款和醫療分診。合成的人口統計中立檔案在反事實匹配集中的評估中，僅變化一個名稱編碼的種族 x 性別信號（遵循Bertrand Mullainathan的傳統），在四種逐漸增加代理權的代理架構下（直接、思考鏈、多代理協商、工具增強）。一個僅使用NumPy的工具計算反事實翻轉率、平均絕對分數差（MASD）、行動率差異和工具調用差異，並提供自助信心區間、配對測試和假發現率控制，每個模型的成本僅為單位數美元。一個實時排行榜帶有保留的私有拆分和污染警示，通過提交允許外部模型參與。我們的試點（864個決策加上測試重測複製）帶來了一個方法論教訓：將六組分數差與兩次運行的噪音差進行比較，僅通過統計的相同性就過度強調了差異約2.4倍。在一個相同性匹配的噪音基準和一個綜合組測試中，claude haiku 4 5顯示沒有超過抽樣噪音的人口統計效應（120對比中0個和9個綜合對比中0個在校正後存活）；一個植入偏見的測試確認該工具在存在差異時能夠檢測到差異。這一貢獻是一個可靠、敏感、可採用的工具，具有相同性匹配的零假設方法論，以及可擴展的開放文物。代碼、數據和工具在開放許可下發布，並附有匿名的審查文物。

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

2606.16721v1 by Ke Liu, Mengxuan Li, Yanyi Bao, Tianyun Zhang, Chong Chu, Jiajun Bu, Haishuai Wang

Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception--dynamics--planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at https://github.com/1999kevin/awesome_medical_world_models.

摘要：醫療診斷和治療是動態過程，患者狀態隨時間演變，臨床干預改變未來結果。儘管當前的醫療人工智慧可以檢測疾病、估算風險並生成報告，但許多系統仍然返回靜態標籤或分數，對於疾病如何進展或替代干預如何改變其軌跡提供有限的洞察。醫療世界模型將人工智慧中的世界模型概念應用於醫療保健，通過學習患者狀態動態的內部模擬器。它們的長期目標是幫助臨床醫生預測惡化、比較治療條件下的未來並為個別患者量身定制護理。然而，相關工作仍然分散在基礎模型、縱向建模、疾病模擬、治療效果估算、強化學習和數位雙胞胎之間。為了彌補這一差距，本綜述概述了一個推進醫療人工智慧的路線圖，從孤立的診斷和預測轉向模擬疾病演變並支持干預決策的醫療世界模型。這個路線圖圍繞三個相互關聯的能力組織：患者狀態構建、臨床動態建模和干預決策支持。在代表性系統中，這一比較突顯了每個能力的貢獻，以及如何將部分組件整合到更成熟的感知--動態--規劃系統中。最後，我們確定了將可行的推廣轉變為臨床有用模擬器所面臨的挑戰。相關文獻可在 https://github.com/1999kevin/awesome_medical_world_models 獲得。

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

2606.16568v1 by Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun

Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

摘要：可靠的輪流發言對於口語對話系統至關重要。然而，大多數現有方法是為兩位講者的互動而設計，並且在處理包含重疊和快速講者變換的現實多方音頻時表現不佳。我們在VoxConverse數據集上研究多方輪流發言，並提出一種僅使用音頻的兩階段管道，將觸發輪轉邊界的時機與實際是否轉移發言權分開。快速觸發器掃描音頻並提出候選的結束輪轉時間，而輕量級驗證器僅在這些時間運行，以決定\textsc{Hold}或\textsc{Shift}並支持下一位講者的預測。我們在完整的多方設置和受控的雙人前兩名投影中報告結果，以便進行比較。我們還研究了基於擴散的、保持標籤的背景音頻混合作為數據增強策略。結果顯示，相較於基線，輪轉檢測有所改善，並且擴散增強進一步提升了效果。

Unified Multimodal Model for Brain MRI Imputation and Understanding

2606.16484v1 by Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

摘要：多模態大型語言模型（MLLMs）在醫學領域具有巨大的潛力，因為它們繼承了LLM的知識並允許多種數據模態的整合、分析和用自然語言解釋。然而，醫學MLLM的領域受到非平凡挑戰的限制，尤其是高質量訓練數據的稀缺以及在現實臨床環境中經常出現的缺失數據。在此，我們提出了一種新穎的統一多模態模型UniBrain，用於腦部磁共振影像（MRI）分析。為了解決潛在的缺失腦部MRI模態，我們採用統一的訓練策略來執行聯合影像模態插補和腦影像理解。在訓練過程中，構建了一個交錯且描述豐富的數據流，以自回歸的方式訓練模型，使其能夠利用生成的多模態數據進行醫學推理。引入了一種自對齊策略，以利用密集的影像嵌入來學習細緻的解剖特徵，而無需詳細的影像標題。此外，我們提出了一種動態隱藏狀態機制，以減輕長上下文多模態推理過程中的曝光偏差。在多疾病腦部MRI數據集上的廣泛實驗顯示，UniBrain在不同程度的模態不完整性下，實現了腦影像插補、理解和疾病診斷的高性能。

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

2606.17115v1 by Jingyu Hu, Giuseppe Tripodi, Reed Naidoo, Sarah F. McGough, Tapabrata Chakraborti

Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based representations on a suite of computational pathology tasks across two real-world commercial cohorts, IH-BC and IH-NSCLC, drawn from the licensed in-house (IH) oncology dataset. The analysis focuses on two modalities, whole-slide images and transcriptomic profiles, drawn from the IH multimodal data. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image-omics fusion strategies built on paired representations. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction. Our results show that FM representations achieve competitive performance on out-of-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty-aware inference for clinical support.

摘要：基礎模型（FMs）已經成為醫療數據強大的表徵提取器，但它們在分佈變化下對數據集的可泛化性仍然未被充分探討。這項工作系統性地評估了基於FM的表徵在兩個來自授權內部（IH）腫瘤數據集的真實商業隊列IH-BC和IH-NSCLC上的一系列計算病理任務。分析重點集中在來自IH多模態數據的兩種模態：全切片圖像和轉錄組特徵。我們首先在八個下游分類任務上基準測試五個FMs的單模態探測性能，發現圖像和組學表徵具有互補的預測信號。然後，我們通過比較三種基於配對表徵的圖像-組學融合策略，調查多模態融合是否能在單模態基準上獲得額外的增益。所選單模態和多模態管道的可信度進一步通過符合預測進行評估。我們的結果顯示，FM表徵在分佈外數據上達到競爭性能，並且當沒有單一模態主導信號時，多模態融合主要有助於提升性能。符合預測顯示，在大多數點預測失敗的情況下，真實診斷仍然可以在預測集內恢復，強調了對臨床支持的考慮不確定性的推理的價值。

Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning

2606.16434v1 by Junting Wen, Dan Li, Qihao Quan, Xiwen Wang, Hang Yang, Zhaohong Meng, Zigui Jiang, Changlin Yang, Tianle Liu, Diego Muñoz-Carpintero, Jian Lou

Accurate state of health (SOH) estimation is a critical diagnostic service for lithium-ion battery management. However, reliance on labor-intensive manual feature engineering and opaque black-box models hinders scalable industrial deployment. To address this, we introduce TC-SOH: a modular, plug-and-play service architecture for autonomous, end-to-end SOH prediction. TC-SOH employs a temporal-contrastive mechanism and a cross-window prediction pretext task to extract degradation-relevant representations directly from raw operational data. To improve transparency, we connect model efficacy with representation diagnostics: visualization, sensitivity analysis, redundancy analysis, bidirectional probing, future-SOH probing, and temporal shuffling show that learned features overlap with selected expert descriptors while retaining additional SOH-relevant variation, and that ordered temporal context improves subsequent-SOH prediction. Across four public datasets, TC-SOH outperforms the considered physics-informed and data-driven baselines, reducing MAPE by 1.91 times and RMSE by 2.13 times.

摘要：準確的健康狀態（SOH）估計是鋰離子電池管理的一項關鍵診斷服務。然而，依賴勞動密集型的人工特徵工程和不透明的黑箱模型妨礙了可擴展的工業部署。為了解決這個問題，我們介紹了TC-SOH：一種模組化、即插即用的服務架構，用於自主的端到端SOH預測。TC-SOH採用時間對比機制和跨窗口預測的前置任務，直接從原始操作數據中提取與退化相關的表示。為了提高透明度，我們將模型效能與表示診斷相連接：可視化、敏感性分析、冗餘分析、雙向探測、未來SOH探測和時間洗牌顯示，學習到的特徵與選定的專家描述符重疊，同時保留額外的SOH相關變異，並且有序的時間上下文改善了後續的SOH預測。在四個公共數據集上，TC-SOH超越了考慮的物理知識驅動和數據驅動的基準，將MAPE降低了1.91倍，RMSE降低了2.13倍。

Input-Dependent Fisher Information for Local Sensitivity Analysis of Medical Image Classifiers

2606.16362v1 by Sourya Sengupta. Mark A. Anastasio

Deep neural networks have achieved strong performance in medical image classification, but often work like black-box. Commonly used post-hoc interpretation methods often provide heuristic visualizations whose relationship to the classifier's predictive distribution is indirect. This work introduces a local sensitivity analysis framework based on the input-dependent Fisher Information Matrix (iFIM) of a trained classifier. The iFIM characterizes how the classifier's predictive distribution changes under infinitesimal perturbations of the input image. By using a Gram-matrix formulation, the nonzero eigenspectrum of the iFIM can be recovered without explicitly forming the full image-dimensional Fisher matrix. The leading iFIM eigenspace is then used to project an input image into a high local-sensitivity component and its orthogonal component. These components provide a model-intrinsic description of local predictive sensitivity, rather than a conventional pixel-wise attribution heatmap or a causal segmentation of task-relevant anatomy. The framework is evaluated on controlled and clinical medical image classification tasks using multiple classifier architectures. Perturbation-based experiments show that high-sensitivity iFIM components are more strongly coupled to changes in predictive confidence and classification performance than lower-sensitivity complementary components. The results support the iFIM framework as a principled tool for analyzing local decision sensitivity and for complementing existing attribution-based interpretability methods in medical imaging.

摘要：深度神經網絡在醫學影像分類中取得了強大的表現，但通常運作如同黑箱。常用的事後解釋方法往往提供啟發式的可視化，其與分類器預測分佈的關係是間接的。本研究引入了一個基於訓練後分類器的輸入依賴費雪信息矩陣（iFIM）的局部靈敏度分析框架。iFIM 描述了分類器的預測分佈在輸入影像的無窮小擾動下如何變化。通過使用 Gram 矩陣的形式，可以在不明確形成完整影像維度費雪矩陣的情況下恢復 iFIM 的非零特徵譜。然後，利用主導的 iFIM 特徵空間將輸入影像投影到高局部靈敏度組件及其正交組件中。這些組件提供了模型內在的局部預測靈敏度描述，而不是傳統的逐像素歸因熱圖或與任務相關的解剖結構的因果分割。該框架在使用多個分類器架構的控制和臨床醫學影像分類任務上進行了評估。基於擾動的實驗表明，高靈敏度的 iFIM 組件與預測信心和分類性能的變化之間的耦合比低靈敏度的補充組件更強。這些結果支持 iFIM 框架作為分析局部決策靈敏度的原則性工具，並補充現有的基於歸因的醫學影像可解釋性方法。

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

2606.16337v2 by Wei Xu, Ke Yang, Gang Luo, Keli Zheng, Lingyan Hu, Jing Wang, Kefeng Li

Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

摘要：臨床表格數據的預測建模對於臨床決策支持至關重要，因此不僅需要強大的預測性能，還需要透明的決策邏輯。儘管深度學習和基於樹的集成方法可以實現高準確度，但它們的黑箱特性仍然是臨床應用的一大障礙。這一挑戰進一步受到醫療數據的共同特徵的影響，包括樣本量有限、類別嚴重不平衡，以及由於診斷標準和臨床文檔變更而產生的特徵演變。為了解決這些問題，我們提出了醫療啟發式學習（MHL），這是一種超越梯度學習範式在臨床表格預測中的具體實現。MHL不依賴於神經網絡權重更新，而是使用一種基於大型語言模型（LLM）的工作流程，該流程整合了統計探測、醫療知識探測、規則綜合和代碼級迭代優化，以優化一個確定性且可執行的決策系統。最終模型不是以不透明的參數表達，而是以版本化的純Python決策規則表達，這些規則是明確可解釋的、完全可審計的，並且與臨床實踐相結合。MHL還支持持續學習，通過從先前驗證的規則開始，並在數據漂移或特徵演變下使用更新的特徵信息迭代修訂這些規則。對醫療數據集的全面實驗顯示，MHL在小樣本和高度不平衡的環境中達到了與最先進方法相當的性能，同時保持了強大的行為。結果進一步表明，這種明確的規則更新機制可以幫助減輕特徵演變下的災難性遺忘。總體而言，這些發現表明，非梯度基礎的啟發式系統為高風險的臨床決策支持提供了一種透明且可適應的替代方案。

Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans

2606.16234v1 by Tengfei Ma, Ruiqi Wu, Chenran Zhang, Ye Geng, Na Su, Xiangyuan Duanmu, Tao Zhou, Yi Zhou, Wen Fan

Fundus fluorescein angiography (FFA) is critical for assessing retinal vascular abnormalities, but its acquisition is invasive and not always feasible. In contrast, color fundus photography (CFP) is non-invasive and widely accessible, which has motivated studies on CFP-to-FFA synthesis. However, prior works rely solely on CFP surface texture, fundamentally limiting the ability to reconstruct functional vascular information and subtle pathological changes. To address this, we propose a novel framework that synthesizes FFA from CFP with structural guidance provided by optical coherence tomography (OCT). We construct a multi-modal retinal imaging dataset with paired CFP, FFA, and OCT from 3,676 patient eyes--the first tri-modally aligned dataset in retinal imaging. To bridge the spatial gap between OCT and fundus modalities, we propose a Spatially Aligned Cross-Modal Fusion (SACMF) module that projects depth-resolved OCT features onto the fundus plane and injects them into the CFP encoder via adaptive layer normalization. Beyond feature fusion, we further introduce Token-wise Cross-Modality Alignment (TCMA), a token-level contrastive learning strategy that explicitly aligns CFP and FFA representations at corresponding spatial positions. Our method achieves superior synthesis performance compared to state-of-the-art methods. Moreover, extensive experiments demonstrate that the FFA images synthesized by our approach bring greater improvements in downstream disease diagnosis performance than existing methods, highlighting the clinical potential of our approach as a non-invasive decision-support tool in routine workflows. The code is available at https://github.com/while-plus/OCT-guide-FFA-Syn.

摘要：眼底螢光血管造影（FFA）對於評估視網膜血管異常至關重要，但其獲取方式具有侵入性且並不總是可行。相對而言，彩色眼底攝影（CFP）是非侵入性的且廣泛可及，這促使了對CFP到FFA合成的研究。然而，先前的工作僅依賴於CFP的表面紋理，根本限制了重建功能性血管信息和微妙病理變化的能力。為了解決這個問題，我們提出了一個新框架，通過光學相干斷層掃描（OCT）提供的結構指導，將FFA從CFP合成。我們構建了一個多模態視網膜影像數據集，包含3,676名患者眼睛的配對CFP、FFA和OCT——這是視網膜影像中首個三模態對齊數據集。為了彌合OCT和眼底模態之間的空間差距，我們提出了一個空間對齊跨模態融合（SACMF）模塊，該模塊將深度解析的OCT特徵投影到眼底平面，並通過自適應層歸一化將其注入CFP編碼器。除了特徵融合外，我們進一步引入了基於Token的跨模態對齊（TCMA），這是一種在對應空間位置上明確對齊CFP和FFA表示的token級對比學習策略。我們的方法在合成性能上超越了最先進的方法。此外，大量實驗表明，我們的方法合成的FFA影像在下游疾病診斷性能上帶來了比現有方法更大的改善，突顯了我們的方法作為日常工作流程中非侵入性決策支持工具的臨床潛力。代碼可在 https://github.com/while-plus/OCT-guide-FFA-Syn 獲得。

Embedded Arena: Iterative Optimization via Hardware Feedback

2606.16190v1 by Zhihan Zhang, Alexander Le Metzger, Jiuyang Lyu, Chun-Cheng Chang, Jiayi Shao, Yujia Liu, Emmanuel Azuh Mensah, Edward Wang, Kurtis Heimerl, Gregory D. Abowd, Shwetak Patel, Natasha Jaques, Vikram Iyer

Embedded devices from wildlife monitoring stations to clinical wearables require local AI inference due to latency, communication, or privacy constraints. Optimizing models for heterogeneous microcontrollers (MCUs) requires simultaneously satisfying hard physical constraints on memory, power, and temperature while preserving accuracy, a multidimensional optimization that is today performed manually by experts. We ask whether an LLM agent can autonomously navigate this complex, multi-turn pipeline guided by real hardware feedback, and introduce a hardware-in-the-loop agent arena in which the agent iteratively refines both model and firmware -- compiling, flashing, and measuring on real hardware -- to enable closed-loop optimization. Frontier models, including Claude Opus 4.7 and Gemini 3.1 Pro, fail entirely without hardware feedback (0% deployment success), whereas our hardware-in-the-loop formulation achieves the first successful deployment within three iterations and can surpass human expert results within seven. This agentic co-optimization achieves 250x compression for vision models with <3.3% accuracy loss and 400x for audio with <6% Feature Error Rate loss, enabling battery-free operation on a commercial MCU via solar harvesting. We demonstrate practical impact in two real-world systems: an elk-detection camera trap (96.7% accuracy) and a phonetic-transcription wearable (8.44% FER) for child development research.

摘要：嵌入式設備從野生動物監測站到臨床可穿戴設備，由於延遲、通信或隱私限制，需要進行本地 AI 推斷。對於異構微控制器 (MCUs) 優化模型需要同時滿足對記憶體、功耗和溫度的嚴格物理限制，同時保持準確性，這是一個多維優化，今天由專家手動執行。我們詢問一個 LLM 代理是否能夠自主導航這個複雜的多輪流程，並受到實際硬體反饋的指導，並介紹一個硬體在迴路的代理競技場，在這裡代理反覆精煉模型和韌體——在實際硬體上編譯、閃存和測量——以實現閉環優化。前沿模型，包括 Claude Opus 4.7 和 Gemini 3.1 Pro，完全依賴硬體反饋失敗（0% 部署成功），而我們的硬體在迴路的公式在三次迭代內實現了第一次成功部署，並且在七次迭代內可以超越人類專家的結果。這種代理協同優化為視覺模型實現了 250 倍壓縮，準確性損失小於 3.3%，音頻則實現了 400 倍壓縮，特徵錯誤率損失小於 6%，使得通過太陽能收集在商業 MCU 上實現無電池操作。我們在兩個現實世界系統中展示了實際影響：一個麋鹿檢測相機陷阱（96.7% 準確率）和一個語音轉錄可穿戴設備（8.44% 特徵錯誤率）用於兒童發展研究。

A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond

2606.16153v1 by Pengyu Zhu, Xiaojing Zhang, Kunbo Zhang, Chunyan Zhang, Zhenyu Wang

Medical image segmentation plays a critical role in clinical diagnostics, treatment planning, disease monitoring, and neurological disorder identification. This article presents a comprehensive review of its systematic development, covering widely used public datasets, representative methods built on the U-Net, Transformer, and SAM architectures, and key evaluation metrics with their differences, followed by an analysis of major challenges from multiple perspectives. Unlike surveys that focus on a single model family or a specific clinical application, this review organizes U-Net-, Transformer-, and SAM-based methods within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency. This work aims to guide future research and support clinical translation of medical image segmentation, with all related resources publicly available in our GitHub repository: https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main.

摘要：醫學影像分割在臨床診斷、治療計劃、疾病監測和神經疾病識別中扮演著關鍵角色。本文提供了其系統發展的綜合回顧，涵蓋了廣泛使用的公共數據集、基於U-Net、Transformer和SAM架構的代表性方法，以及關鍵評估指標及其差異，接著從多個角度分析主要挑戰。與專注於單一模型家族或特定臨床應用的調查不同，這篇回顧將基於U-Net、Transformer和SAM的方法組織在一個統一的分析框架內，特別關注它們在提高分割準確性和效率方面的有效性。這項工作旨在指導未來的研究並支持醫學影像分割的臨床轉化，所有相關資源均可在我們的GitHub儲存庫中公開獲得：https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main。

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

2606.16149v1 by Minh-Ha Nguyen, Erica Gray, Chih-Ting Yang, Rizwan Hamid, Lingyao Li, Siyuan Ma, Thomas A. Cassini, Cathy Shyr

Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

摘要：大多數醫療人工智慧系統透過擴展額外的機器來提高性能：更多的微調數據、更多的代理和/或更大的檢索數據庫。然而，在罕見疾病診斷中，這種擴展可能會產生難以部署、審核和維護的系統。我們詢問是否可以通過擴展單個人工智慧代理的推理鏈來實現最先進的診斷性能：通過人類與人工智慧的合作開發診斷政策來引導它，並使用免費的生物醫學工具進行增強。我們介紹了LiteOdyssey，一個輕量級的罕見疾病診斷框架，通過臨床遺傳學工作流程引導推理語言模型。這個框架是通過人類反饋的政策迭代（PIHF）開發的，並使用對公共生物醫學工具的動態訪問。在兩個具有挑戰性的基準上，LiteOdyssey在僅提供患者臨床特徵的情況下實現了最先進的性能，整體疾病的Recall@1為59.3%，涵蓋了1,243個LIRICAL（n = 370）和PhenoPacket Store（n = 873）的案例。這兩個基準中超罕見疾病的比例很高（流行率低於1/1,000,000，超罕見的比例分別約為45%和52.8%）。在更具挑戰性的PhenoPacket子集上，因為因果疾病未在我們的稀有映射管道中映射到Orphanet，LiteOdyssey實現了60.7%的Recall@1，而同一基準模型（GPT-5.4）在未使用工具的情況下僅為10.7%。這一性能是在沒有微調、多代理集成或大型案例檢索數據庫的情況下實現的。還觀察到以下增益：在開發期間從未見過的案例上、在一個真實世界罕見疾病患者的私有隊列上，以及在一個較小的開放權重模型上。LiteOdyssey暗示了朝向準確性高、易於部署且對醫生審查更透明的罕見疾病人工智慧系統的發展道路。

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

2606.16074v1 by Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Elyas Irankhah, Sreeraj Ramachandran, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

摘要：動機：病人生成的文本包含了病人生活經歷、社會背景和護理參與的重要信息，但仍然大多是非結構化的，限制了其在以病人為中心的結果研究中的使用。先前的工作引入了PV-Miner基準和PVMinerLLM模型以進行結構化提取。然而，僅僅依賴監督式微調（SFT）在稀有、細緻和不均勻分佈的錯誤方面面臨挑戰，特別是在對令牌關鍵的結構化輸出中。
結果：我們提出了PVminerLLM2，一組改進的LLM，用於結構化病人聲音提取，應用偏好優化來解決超出監督式微調範疇的令牌關鍵錯誤。我們的方法引入了(i) 一個帶有令牌級門控穩定項的偏好目標，防止在偏好優化過程中絕對令牌可能性的下降，以及(ii) 具混淆感知的偏好對構建，以更好地捕捉低分離區別。我們進一步結合了令牌重要性加權和逆頻率重加權，以解決令牌不平衡和類別偏斜。在多個模型大小中，PVMinerLLM2始終超越強基準，實現了高達4.43%（代碼）、3.50%（子代碼）和1.55%（跨度）的增益，並且超越了使用現有偏好優化方法訓練的基準LLM。
可用性和實施：PVminerLLM2的補充材料、代碼、評估腳本和訓練模型可在以下網址公開獲得：https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts

2606.15931v1 by Zijian Carl Ma, Sean J. Wang, Sijbren Kramer, Li Erran Li

Historical medical archives and traditional medicines hold immense potential for drug discovery and remain a primary source for current drug development. However, pre-ontological prose and idiosyncratic taxonomies prevent the standardization and medical modernization of the data for use in current biomedical pipelines. Furthermore, no existing LLM agent system, whether tool-calling, retrieval-augmented, or agentic deep-research, can convert such text into verifiable drug-discovery leads at scale. We close this gap with DeepRoot, a multi-agent LLM system that jointly builds and utilizes a verified knowledge graph, showing that grounding and reasoning -- often conflated -- are separable axes the system can compose for therapeutic reasoning. Applied to the Shen Nong Ben Cao Jing, DeepRoot recovers $10$ of $21$ held-out compound-disease treatment pairs at R@$20$ ($47.6\%$ vs $4.8\%$ for a raw corpus LLM and $\sim!2.4\%$ random) and dominates an LLM-as-judge audit for reasoning quality over baseline LLMs and LLMs with direct tool-call access to the same APIs DeepRoot itself queries. Tool-using LLMs hallucinate evidence on $87\%$ of claims, versus 7-10% for DeepRoot. Graph-only inference hallucinates $0\%$ but ranks lowest on reasoning coherence; DeepRoot KG+LLM is the only condition to win on both axes, pointing toward a route for systematic mining and repurposing of historical medical knowledge.

摘要：歷史醫療檔案和傳統醫藥在藥物發現方面具有巨大的潛力，並且仍然是當前藥物開發的主要來源。然而，前本體論的散文和特有的分類法阻礙了數據的標準化和醫療現代化，以便用於當前的生物醫學管道。此外，現有的 LLM 代理系統，無論是工具調用、檢索增強還是代理深度研究，都無法將這類文本轉換為可驗證的藥物發現線索。我們通過 DeepRoot 彌補了這一空白，這是一個多代理 LLM 系統，聯合構建和利用經過驗證的知識圖譜，顯示出基礎和推理——通常被混淆——是系統可以組合的可分離軸，用於治療推理。應用於《神農本草經》，DeepRoot 恢復了 $10$ 個 $21$ 個保留的化合物-疾病治療對，在 R@$20$ 下的表現為 $47.6\%$（相比之下，原始語料庫 LLM 為 $4.8\%$，隨機約 $2.4\%$），並在推理質量的 LLM 作為評審的審計中超越了基準 LLM 和直接工具調用訪問相同 API 的 LLM，這些 API 是 DeepRoot 自身查詢的。使用工具的 LLM 在 $87\%$ 的主張上出現幻覺，而 DeepRoot 的比例為 7-10%。僅圖譜推理的幻覺為 $0\%$，但在推理連貫性上排名最低；DeepRoot KG+LLM 是唯一在兩個軸上都獲勝的條件，指向系統性挖掘和重新利用歷史醫療知識的路徑。

Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot

2606.15810v1 by Yuyang Dai, Yushun Dong

Large language models deployed as commercial APIs are vulnerable to model extraction attacks, while existing defenses either act too late or degrade utility for legitimate users. We propose \textbf{Knowledge Trap}, a defense that redirects extraction attacks toward low-transferability knowledge through a \emph{Honeypot Knowledge Graph} (HKG) and breadcrumb-guided exploration. Instead of blocking queries or perturbing outputs, Knowledge Trap consumes the attacker's limited query budget on knowledge with negligible downstream utility while preserving benign-user performance. Experiments in medical and financial domains show that Knowledge Trap reduces surrogate Agreement by 6.2\% on average without degrading legitimate-user accuracy, outperforming existing defenses that impose measurable user impact. These results suggest that defending knowledge-space traversal is a practical direction for mitigating LLM extraction attacks.

摘要：大型語言模型作為商業API部署時，容易受到模型提取攻擊，而現有的防禦措施要麼反應太晚，要麼降低合法用戶的效用。我們提出了\textbf{知識陷阱}，這是一種通過\emph{蜜罐知識圖}（HKG）和麵包屑引導探索將提取攻擊重定向到低可轉移性知識的防禦措施。知識陷阱並不是阻止查詢或擾動輸出，而是使攻擊者有限的查詢預算消耗在對下游效用幾乎無影響的知識上，同時保留良性用戶的性能。在醫療和金融領域的實驗顯示，知識陷阱平均減少了替代協議的6.2\%，而不降低合法用戶的準確性，超越了現有對用戶影響可測量的防禦措施。這些結果表明，防禦知識空間遍歷是一個減輕LLM提取攻擊的實用方向。

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

2606.15735v2 by Jiyoun Kim, Muhan Yeo, Eunhye Jang, Jeewon Yang, Hangyul Yoon, Su Ji Lee, Hee Jo Han, Hee-Jae Jung, Doyun Kwon, Jun young Lee, Jaehun Lee, Jung-Oh Lee, Sunjun Kweon, Jong Hak Moon, Daseul Kim, Minjae Cho, Edward Choi

Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.

摘要：出院摘要是關鍵的臨床文件，包含患者整體住院期間的背景，並且經常被醫療專家用於患者再入院、持續護理和診斷決策的審查。在審查這些摘要時，醫療專家通常必須在多個摘要之間反覆綜合信息，同時驗證支持每個答案的證據。儘管大型語言模型（LLMs）在臨床問題回答中越來越受到關注，但現有的基準並未充分反映這一情境：它們通常評估考試風格的醫學知識或專注於單輪問題回答，並且對證據基礎的評估有限。我們介紹了EHRNote-ChatQA，這是首個針對患者多份出院摘要的證據基礎多輪臨床問題回答的基準。EHRNote-ChatQA基於去識別化的MIMIC-IV出院摘要，包含967個患者級別的多輪樣本，涵蓋一到五份摘要，以及16,072對經醫療專家驗證的問答對（8,036個內容問題，每個問題都配有一個證據基礎問題），跨越八個臨床類別。該基準是通過一個專家知情的流程構建的，結合了出院摘要結構化方案、專家策劃的多輪問答模板和基於LLM的生成，隨後由11位醫療專家對每個問答樣本進行審查和修訂。對22個開源和閉源LLM的基準測試揭示了幾個挑戰，包括LLM在證據基礎方面的表現較內容回答更為困難，多輪錯誤在輪次之間累積，以及單輪臨床問答的表現並不可靠地轉移到這一情境中。這些發現確立了EHRNote-ChatQA作為評估臨床問答系統的一個嚴謹且實用的基準。該數據集將通過PhysioNet的認證訪問公開。

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

2606.15733v1 by Zhenyu Yu

Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

摘要：指令調整的語言模型在其英語變數名稱被替換為類型保留的佔位符後，可以對相同的因果推理問題給出不同的答案，儘管結構因果模型和金標答案並未改變。我們詢問這種詞彙差距是否反映了佔位符視圖中的信息損失，或者是從仍然承載答案相關內容的表示中錯位的讀取。Vernier使用配對視圖權重更新作為工具，然後檢查在差距關閉後留下的機制。在工作狀態下，證據支持表示錯位。變數名稱探針在佔位符視圖上變得更準確，而在Qwen-7B、Qwen-14B和Llama-3.1-8B上的激活修補顯示決策令牌表示可以在視圖之間轉移答案身份。重新對齊視圖的更新是對原始和佔位符提示的反事實增強，而答案子空間KL主要加強了中間答案信念的一致性。成功受限於模型家族、規模和任務。CRASS轉移在Qwen規模和Llama之間是可靠的，而e-CARE仍然較弱，初步的非因果重命名任務顯示出類似的質量模式。

AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

2606.15709v1 by Mohammed Fasha, Nahel Al-Maayta, Bilal Sowan, Mohammad Athamneh, Husam Barham

Jordan faces severe water scarcity with 50\% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst -- confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan's intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.

摘要：約旦面臨嚴重的水資源短缺，50\% 的水產量因漏水、盜竊和計量問題而損失，這也被稱為非收入水 (NRW)。傳統的反應性方法已被證明不足以持續減少 NRW。本文提出了一個智能框架，整合了 EPANET 水力模型、數位雙胞胎技術、SCADA 系統和基於大型語言模型 (LLM) 的 AI 代理，用於持續的網絡監控和自適應決策。該系統結合了實時數據流和基於物理的模擬來檢測異常，採用檢索增強生成 (RAG) 進行政策解釋，並通過函數調用進行網絡控制。概念驗證實施使用 EPYT 和離線 LLM（llama3.1:8b 通過 Ollama）在 1,164 個接頭的安曼區網絡上驗證了技術可行性。該系統展示了自動化的水力模擬、基於流量的異常檢測，與水分配區 (DZ) 實踐相一致，並生成 AI 健康報告，響應時間在 2 分鐘以內且無 API 成本。爆裂檢測依賴於本地流量異常分析：一個 30.1~L/s 的模擬漏水在 15 根管道中產生可測量的流量重分配，標記出一個 15 接頭的集群，定位了爆裂點——確認與水分配區 (DZ) 監控實踐的一致性。該框架通過分階段實施，適應約旦的間歇性供應模式和有限的自動化，為水資源短缺的地區提供了利用智能自動化減少 NRW 和提高運營效率的可擴展途徑。

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

2606.15566v1 by Eyup Engin Kucuk, Tarik Kelestemur, Ömer Dağlar Tanrikulu

Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $α$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $α$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p < .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

摘要：質性編碼在社會科學中至關重要，但專家註釋難以擴展。大型語言模型（LLMs）提供了一種可能的擴展，但在目標構念是解釋性的、理論負載的且僅間接表達時，需要仔細驗證。我們在一個困難的案例中研究這個問題：檢測作者是否將貝葉斯模型視為心理和神經機制的描述（現實主義）或作為有用的數學工具（工具主義）。我們的方法結合了以理論為驅動的編碼手冊、專家編碼的參考註釋、診斷性門控提示優化搜索，產生了三個前沿LLM（GPT-5.1、Claude Sonnet 4.6、Gemini 3 Pro Preview）的共享零樣本提示，以及多評估者可靠性分析。最終提示達到了0.76的保留組合可靠性得分（ICC的調和平均值 = 0.79和$α$ = 0.74），所有診斷均滿足。在210篇文章的6,858條引用中，這三個LLM達成了相當可觀的引用級一致性（ICC = 0.80；$α$ = 0.76；組合 = 0.78）和幾乎完美的文章級排名穩定性（$r$ = 0.96-0.97，跨評估者對）。該語料庫主要呈現出弱現實主義，但文章級立場很少統一：只有1.4%的文章使用單一範疇，而59.5%跨越四個或更多範疇。低層次的感知/運動文章比高層次的認知文章高出8.8個現實主義點（$p < .001$，$d = 0.60$），量化了長期以來的質性直覺。我們將這作為一個專家主導的案例研究提出；該框架旨在推廣到類似理論要求高的任務，而不是所有質性分析。

Hierarchical Modeling of ICD Codes in EHR Foundation Models

2606.15447v1 by Megha Thukral, Dong Gyun Kang, Rudra Pratap Singh, Shruthi Kashinath Hiremath, Katrin Hänsel, Thomas Plötz

Electronic health record foundation models typically treat ICD diagnosis codes as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine-grained diagnostic detail. As a result, existing EHR representation learning methods do not explicitly exploit the hierarchical structure already present in the coding system. In this work, we study ICD-10-CM hierarchy as a general inductive bias for clinical representation learning. We investigate two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT-style transformer with tokens corresponding to different levels of the ICD hierarchy, and second, by injecting hierarchy into graph-based code representations through hierarchy-aware edges combined with diagnosis co-occurrence structure. Across these settings, we evaluate whether explicit hierarchy improves downstream prediction, which levels of the hierarchy are most useful, whether hierarchy encoding improves transfer across datasets, and how hierarchy reshapes embedding similarity structure. We conduct experiments on two large-scale real-world clinical datasets: MIMIC-IV, used for pretraining and in-domain evaluation, and eICU, used to assess cross-dataset transfer via frozen encoder probing. Our findings show that explicitly encoding ICD hierarchy improves over flat code representations in both in-domain and cross-dataset settings, while revealing that the most useful level of hierarchy depends on both the task and the modeling approach. More broadly, we focus on hierarchy-aware EHR representation learning and show that the benefits of encoding hierarchy are generalizable across modeling settings and hierarchy levels.

摘要：電子健康紀錄基礎模型通常將ICD診斷碼視為平面標記，忽略了捕捉疾病家族、子類別和細緻診斷細節的臨床意義層級結構。因此，現有的EHR表示學習方法並未明確利用已存在於編碼系統中的層級結構。在這項工作中，我們研究ICD-10-CM層級作為臨床表示學習的一般歸納偏見。我們探討了兩種互補機制來納入層級：首先，通過在BERT風格的Transformer中增強診斷序列，添加對應於ICD層級不同級別的標記；其次，通過將層級意識邊緣與診斷共現結構相結合，將層級注入基於圖的代碼表示。在這些設置中，我們評估明確的層級是否改善下游預測，哪些層級的層級結構最有用，層級編碼是否改善跨數據集的轉移，以及層級如何重塑嵌入相似性結構。我們在兩個大型真實臨床數據集上進行實驗：MIMIC-IV，用於預訓練和域內評估，以及eICU，用於通過凍結編碼器探測評估跨數據集轉移。我們的發現顯示，明確編碼ICD層級在域內和跨數據集設置中均優於平面代碼表示，同時揭示出最有用的層級取決於任務和建模方法。更廣泛地說，我們專注於層級意識的EHR表示學習，並展示了編碼層級的好處在各種建模設置和層級中都是可推廣的。

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

2606.15436v1 by Mayur Sanap, Prasanna Desikan, Edgar Lobaton

Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and disease probability estimation in settings where physical measurements are unavailable. We introduce the multi-model, multi-target cough regression benchmark evaluating five FMs (OPERA-CT, OPERA-CE, OPERA-GT, HeAR, M2D+Resp) across six targets on three datasets under subject-disjoint protocols, comparing linear, MLP-small, and full MLP regression heads. MLP-small beats the mean-predictor baseline on all tasks and linear probing in 23 of 30 model x task cases, with full MLP overfitting on small clinical data but recovering on larger sets, revealing a dataset size x head-capacity trade-off. HeAR leads within-dataset age regression on Coswara (9.12 yr MAE); its CIDRZ result is excluded from headline claims owing to possible HeAR-CIDRZ pretraining overlap. OPERA-GT is favored over OPERA-CT on age in all three datasets, with the CIDRZ margin within seed variance, extending a generative-pretraining advantage from breath to cough. HeAR and M2D+Resp reach near-full performance at N = 50 samples while OPERA models require N = 400. Cross-dataset transfer is strongly asymmetric as large diverse data generalises to small clinical populations (CoughVID to CIDRZ: -0.17 yr) but not vice versa (CIDRZ to Coswara: +2.43 yr, +26.6%).

摘要：呼吸聲學基礎模型 (FMs) 在咳嗽分類方面表現優異，然而它們從咳嗽音頻預測連續健康量的能力仍然大部分未被探索，儘管在無法進行物理測量的情況下，被動年齡、BMI 和疾病概率估算具有臨床價值。我們介紹了多模型、多目標的咳嗽回歸基準，評估五個 FMs（OPERA-CT、OPERA-CE、OPERA-GT、HeAR、M2D+Resp）在三個數據集上針對六個目標的表現，並在受試者不重疊的協議下比較線性、MLP-small 和全 MLP 回歸頭。MLP-small 在所有任務上超越了平均預測基準，並在 30 個模型 x 任務案例中有 23 個超越了線性探測，然而全 MLP 在小型臨床數據上出現過擬合，但在較大數據集上恢復，顯示出數據集大小與頭部容量之間的權衡。HeAR 在 Coswara 的數據集中領先於年齡回歸（9.12 年 MAE）；其 CIDRZ 結果因可能的 HeAR-CIDRZ 預訓練重疊而被排除在主要聲明之外。在所有三個數據集中，OPERA-GT 在年齡方面優於 OPERA-CT，CIDRZ 的邊際在種子變異範圍內，將從呼吸到咳嗽的生成預訓練優勢延伸。HeAR 和 M2D+Resp 在 N = 50 樣本時達到接近全性能，而 OPERA 模型則需要 N = 400。跨數據集轉移呈現強烈的不對稱性，因為大型多樣數據可以很好地推廣到小型臨床人群（CoughVID 到 CIDRZ: -0.17 年），但反之則不然（CIDRZ 到 Coswara: +2.43 年，+26.6%）。

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

2606.15419v1 by Zaifu Zhan, Shuang Zhou, Rui Zhang

Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

摘要：目標：提升大型語言模型（LLMs）在醫學問題回答（MedQA）中的準確性、可解釋性和穩健性。
方法：我們設計了一種多代理同行評審推理方法，其中多個LLM代理獨立生成思考鏈推理及候選答案，然後作為同行評審來評估彼此的推理在事實正確性和邏輯合理性方面的表現。評分最高的推理鏈被選中以產生最終答案。實驗使用了五個最先進的LLM（Llama-3.1-8B、Qwen2.5-7B、Phi-4、DeepSeek-LLM-7B、GPT-oss-20B）在三個基準數據集上進行：HeadQA、MedQA-USMLE和PubMedQA。性能與單模型思考鏈推理和基於思考鏈的多數投票進行比較。
結果：同行評審推理始終超越了兩個基準。最佳模型組合在數據集上達到了平均準確率0.820，超過了最強的單一模型（0.777）和多數投票集成（最高達到0.789）。該方法在參與模型數量增加時也能有效擴展，而同行評估可靠地區分了高質量和低質量的推理鏈。
結論：所提出的多代理同行評審推理方法使LLMs能夠同時作為解決者和評估者，在MedQA中產生了卓越的表現。通過強調推理質量而非僅僅是答案一致性，這種方法提高了準確性、可解釋性和穩健性，為可信的生物醫學AI系統提供了有前景的方向。

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

2606.15363v1 by Ya-Chuan Chen, Tien-Jen Lai, Hsiang-Wei Hu

Self-improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience. The state-of-the-art Self-Harness framework [1] achieves 14--21% improvement on Terminal-Bench-2.0 by mining failure clusters and patching the agent harness. However, Self-Harness optimises only one dimension -- the prompt harness -- leaving behavioural principles and workflow topology unchanged. We propose APEX (Adaptive Principle EXtraction), a three-layer co-evolution framework that simultaneously evolves: (L1) the harness via failure-mode patching, (L2) behavioural principles via success-trace distillation [2], and (L3) the agent workflow topology via structural fitness-based selection [6]. We implement APEX on Joe [13], a production-grade super AI Agent built on NVIDIA Nemotron and designed as an Edge AI Agent Factory for the NVIDIA Agent Challenge 2026, managing a 15-node compute fleet using 114 real task traces collected over 18 days. APEX achieves an APEX Health Score of 0.570 (+90% vs. baseline 0.300) in a single evolutionary run, distilling 6 novel reusable principles and selecting a research-first workflow topology scoring 0.900 (+20%). Our results demonstrate that multi-dimensional co-evolution substantially outperforms single-axis harness optimisation, at a cost of only 4 LLM calls (~270 s) on a local qwen2.5-coder:32b instance.

摘要：自我改進的 AI 代理已成為一個關鍵的研究前沿：這些系統根據累積的操作經驗修改自己的提示、工作流程和決策規則。最先進的 Self-Harness 框架 [1] 通過挖掘失敗集群和修補代理鞍具，在 Terminal-Bench-2.0 上實現了 14--21% 的改進。然而，Self-Harness 只優化了一個維度——提示鞍具——而行為原則和工作流程拓撲保持不變。我們提出了 APEX（自適應原則提取），這是一個三層共演化框架，同時演化： (L1) 通過失敗模式修補來改進鞍具，(L2) 通過成功追蹤蒸餾行為原則 [2]，以及 (L3) 通過基於結構適應度的選擇來改變代理工作流程拓撲 [6]。我們在 Joe [13] 上實現了 APEX，這是一個基於 NVIDIA Nemotron 構建的生產級超 AI 代理，設計為 NVIDIA 代理挑戰 2026 的邊緣 AI 代理工廠，管理一個使用 114 個在 18 天內收集的真實任務追蹤的 15 節點計算集群。在單次進化運行中，APEX 實現了 0.570 的 APEX 健康分數（比基線 0.300 提高 90%），蒸餾出 6 個新穎可重用的原則，並選擇了一個研究優先的工作流程拓撲，得分 0.900（提高 20%）。我們的結果表明，多維共演化顯著優於單軸鞍具優化，僅在本地 qwen2.5-coder:32b 實例上消耗 4 次 LLM 調用（約 270 秒）的成本。

CAP: Towards PPG Universal Representation Learning with Patient-level Supervision

2606.15284v1 by Chenyang He, Xinyi Shao, Shun Huang, Bosong Huang, Daoqiang Zhang, Ming Jing, Cheng Ding

Photoplethysmography (PPG) plays a central role in wearable health monitoring and clinical decision support. Yet existing approaches to universal PPG representation learning largely focus on signal-level objectives and often overlook patient-level health context, which limits generalization to complex clinical tasks and heterogeneous cohorts. To address this gap, we construct a large-scale paired PPG-EHR multimodal dataset by distilling fragmented medical histories and clinical records into cohesive, patient-level electronic health records (EHR). Building on this resource, we propose Clinical Anchored Pretraining for PPG (CAP). During pretraining, CAP performs cross-modal contrastive alignment that anchors PPG representations to patient-level clinical semantics, guiding the encoder beyond waveform fitting toward modeling consistency in a patient's overall physiological state. During downstream adaptation, the pretrained PPG encoder provides clinically grounded representations that strengthen inductive bias and improve robustness and transferability. Experiments demonstrate that CAP consistently outperforms strong baselines on four diverse downstream tasks. CAP achieves a particularly large gain on respiratory rate prediction (up to +87.6% relative improvement over the state-of-the-art baseline) and delivers an average relative +26.7% across all tasks. We further enhance the interpretability of our approach through comprehensive analyses, including ablations and multiple complementary visualizations of the learned representations. The code for our experiments is available at: https://github.com/gody123gody/CAP .

摘要：光學容積描記法（PPG）在可穿戴健康監測和臨床決策支持中扮演著核心角色。然而，現有的通用PPG表示學習方法主要集中於信號層面的目標，並且常常忽視患者層面的健康背景，這限制了其在複雜臨床任務和異質隊列中的泛化能力。為了解決這一問題，我們通過將零散的醫療歷史和臨床記錄提煉為一致的患者層面電子健康記錄（EHR），構建了一個大規模的配對PPG-EHR多模態數據集。基於這一資源，我們提出了針對PPG的臨床錨定預訓練（CAP）。在預訓練過程中，CAP執行跨模態對比對齊，將PPG表示錨定到患者層面的臨床語義，指導編碼器超越波形擬合，朝著建模患者整體生理狀態的一致性邁進。在下游適應過程中，預訓練的PPG編碼器提供臨床基礎的表示，增強歸納偏差並提高穩健性和可轉移性。實驗表明，CAP在四個不同的下游任務中始終超越強基線。CAP在呼吸率預測上取得了特別大的增益（相對於最先進基線的提升高達+87.6%），並在所有任務中提供了平均相對+26.7%的增益。我們還通過全面的分析進一步增強了我們方法的可解釋性，包括消融實驗和多種互補的學習表示可視化。我們實驗的代碼可在以下網址獲得：https://github.com/gody123gody/CAP 。

RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning

2606.15278v1 by Jinhan Liu, Mahsa Shoaran

Affective and cognitive disorders manifest as distributed, time-varying brain network dynamics across regions, channels, and time, challenging robust representation learning from EEG/sEEG for clinical diagnosis. We propose RECTOR (Masked Region-Channel-Temporal Modeling), an end-to-end self-supervised framework that unifies joint region-channel-temporal representation learning beyond fixed anatomical priors. At its core, RECTOR-SA is a hierarchical, block-sparse self-attention induced by Adaptive Functional Partitioning that evolves region structures from static anatomical definitions to adaptive functional regions. The self-supervision is driven by Masked Topology and Representation Learning, which jointly optimizes three complementary objectives: Masked Predictive Modeling, Topological Structure Modeling, and Cross-View Consistency. Across diverse benchmarks, RECTOR sets a new state-of-the-art in EEG emotion recognition and sEEG task-engagement classification. Crucially, its strong robustness to missing channels and cross-montage generalization underscores its potential for large-scale pre-training on heterogeneous EEG/sEEG, providing interpretable insights at both region and channel levels.

摘要：情感和認知障礙表現為跨區域、通道和時間的分佈性、時變腦網絡動態，這對於從EEG/sEEG進行臨床診斷的穩健表示學習提出了挑戰。我們提出了RECTOR（遮蔽區域-通道-時間建模），這是一個端到端的自我監督框架，超越固定解剖先驗，統一了聯合區域-通道-時間表示學習。在其核心，RECTOR-SA是一種由自適應功能劃分引發的層次性、區塊稀疏自注意力，將區域結構從靜態解剖定義演變為自適應功能區域。自我監督是由遮蔽拓撲和表示學習驅動的，這共同優化了三個互補的目標：遮蔽預測建模、拓撲結構建模和跨視圖一致性。在多樣的基準測試中，RECTOR在EEG情感識別和sEEG任務參與分類中創造了新的最先進水平。重要的是，其對缺失通道和跨蒙太奇泛化的強大穩健性突顯了其在異質EEG/sEEG上進行大規模預訓練的潛力，並在區域和通道層面提供可解釋的見解。

Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs

2606.15250v1 by Zhisen Hu, Antti Kemppainen, David Johnson, Egor Panfilov, Huy Hoang Nguyen, Timothy Cootes, Claudia Lindner, Aleksei Tiulpin

Radiographic assessment of lower-limb alignment (LLA) is important for predicting joint health and surgical outcomes in total knee arthroplasty. Traditional measurement methods are manual and time-consuming, while recent machine learning approaches typically rely on locating a fixed set of anatomical landmarks. This dependence limits flexibility and may require re-annotation when clinical definitions change. To address this, we propose an automated workflow using Implicit Neural Shape Functions (INSF). Rather than relying on explicit landmark coordinates, we encode the anatomy into a compact latent space and regress clinical alignment measurements directly from these latent codes. This architecture allows for rapid extendability to new tasks without altering the backbone representation. We trained our method on an internal dataset of 566 knee radiographs, each annotated with the outline of the femur and tibia. We evaluated it on both an internal test dataset of 50 patients and a separate external set of 402 preoperative cases from the MRKR dataset. Manual clinical measurements are available for these data, and the MRKR measurements will be made publicly accessible. Performance was comparable to state-of-the-art landmark-based methods and manual agreement, while offering a flexible shape representation that can be extended to additional measurement tasks.

摘要：下肢對齊（LLA）的放射學評估對於預測全膝關節置換手術的關節健康和手術結果非常重要。傳統的測量方法是手動且耗時，而最近的機器學習方法通常依賴於定位一組固定的解剖標誌。這種依賴限制了靈活性，並且在臨床定義變更時可能需要重新標註。為了解決這個問題，我們提出了一個使用隱式神經形狀函數（INSF）的自動化工作流程。我們不依賴明確的標誌坐標，而是將解剖結構編碼到一個緊湊的潛在空間中，並直接從這些潛在代碼回歸臨床對齊測量。這種架構允許快速擴展到新任務，而無需改變主幹表示。我們在一個內部數據集上訓練了我們的方法，該數據集包含566張膝關節X光片，每張都標註了股骨和脛骨的輪廓。我們在50名患者的內部測試數據集以及來自MRKR數據集的402個術前案例的單獨外部數據集上進行了評估。這些數據可用於手動臨床測量，並且MRKR測量將公開可用。性能與最先進的基於標誌的方法和手動一致性相當，同時提供了一種靈活的形狀表示，可以擴展到其他測量任務。

Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

2606.15176v1 by Weihao Gao

Ultrasound imaging is the most widely adopted medical modality globally due to its low cost and portability, yet artificial intelligence (AI) deployment remains constrained by reliance on GPU-accelerated models, creating a structural paradox where the cost of "intelligence" exceeds that of the imaging device itself. Here, we present the systematic adaptation and extensive evaluation of UltraSeg, an ultra-lightweight architecture originally developed for colonoscopic polyp segmentation, now engineered for point-of-care ultrasound (POCUS) across ten public datasets spanning six anatomical sites (breast, thyroid, kidney, carotid, fetal, and small-animal tumor). We systematically validate both variants in ultrasound domains: UltraSeg-130K (0.13M parameters) achieves 89.7 FPS on single-core CPUs and 34.8 FPS on a refurbished mobile device, while UltraSeg-500K (0.5M parameters) delivers 44.6 FPS on CPU and 16.1 FPS on mobile device. UltraSeg-500K matches or exceeds the Dice performance of the 31M-parameter UNet and approaches 105M-parameter TransUNet in average performance, with superior zero-shot cross-dataset generalization on external validation sets (UDIAT, DDTI). By enabling clinical-grade segmentation without GPU dependency, this work brings AI costs in line with ultrasound accessibility, making advanced diagnostics available in resource-limited settings.

摘要：超聲影像因其低成本和可攜性而成為全球最廣泛採用的醫療模式，但人工智慧（AI）的應用仍受限於對GPU加速模型的依賴，形成了一種結構性悖論，即「智慧」的成本超過影像設備本身的成本。在此，我們展示了UltraSeg的系統性調整和廣泛評估，這是一種最初為結腸鏡息肉分割而開發的超輕量架構，現已針對十個公共數據集進行了針對即時超聲（POCUS）的工程設計，涵蓋六個解剖部位（乳腺、甲狀腺、腎臟、頸動脈、胎兒和小動物腫瘤）。我們系統性地驗證了超聲領域中的兩個變體：UltraSeg-130K（0.13M參數）在單核心CPU上達到89.7 FPS，在翻新移動設備上達到34.8 FPS，而UltraSeg-500K（0.5M參數）在CPU上提供44.6 FPS，在移動設備上提供16.1 FPS。UltraSeg-500K的Dice性能與31M參數的UNet相匹配或超過，並在平均性能上接近105M參數的TransUNet，並在外部驗證集（UDIAT，DDTI）上展現出優越的零樣本跨數據集泛化能力。通過實現無GPU依賴的臨床級分割，這項工作使AI成本與超聲可及性相符，使高級診斷在資源有限的環境中變得可用。

Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

2606.15055v1 by Xinze Zhang

Visual perception of urban streetscapes underpins evidence-based decisions in landscape planning, public health, and place-making. Yet models trained on a few well-photographed metropolises systematically misjudge underrepresented districts, propagating geographic bias into downstream policy. We address this gap with HVSP-LL, a lifelong learning framework that couples a stratified visual-semantic pivoting module with an equity-aware rehearsal mechanism. The pivoting module organises landscape concepts along a three-tier ontology (macro structure, meso composition, micro element) and aligns image features to learnable semantic anchors at each tier, providing transferable representations that resist distributional drift. The lifelong adaptation component sequentially absorbs new urban regions while constraining inter-region perception gaps through a worst-region sample-reweighting objective and a structurally-aware exemplar buffer. We evaluate HVSP-LL on a panoramic streetscape benchmark assembled from twelve cities across four continents and seven perceptual dimensions. The framework attains 0.834 Spearman correlation on the held-out city sequence, an absolute 6.1 point improvement over the strongest continual baseline, and shrinks the inter-city perception gap to 0.094 -- a 38% reduction relative to the strongest continual baseline (0.151) and a 57% reduction relative to a representative regularisation baseline (0.218). Ablations confirm that each tier of the pivoting hierarchy contributes monotonically, and the equity-aware rehearsal converts mean backward transfer from -0.038 (without retention) to +0.013, eliminating catastrophic forgetting on the held-out sequence. Our results indicate that hierarchical anchoring is a practical pathway toward geographically equitable streetscape inference at city scale.

摘要：城市街景的視覺感知支撐著基於證據的景觀規劃、公共健康和地方創建決策。然而，基於少數幾個拍攝良好的大都市訓練的模型系統性地錯誤評估了被低估的區域，將地理偏見傳播到下游政策中。我們通過 HVSP-LL 解決了這一差距，這是一個終身學習框架，結合了分層的視覺-語義樞紐模塊和注重公平的重複機制。樞紐模塊沿著三層本體（宏觀結構、中觀組成、微觀元素）組織景觀概念，並將圖像特徵與每一層的可學習語義錨點對齊，提供可轉移的表示，抵抗分佈漂移。終身適應組件依次吸收新的城市區域，同時通過最差區域樣本重加權目標和結構感知示例緩衝區來限制區域間的感知差距。我們在從四大洲十二個城市組成的全景街景基準上評估 HVSP-LL，該框架在保留的城市序列上達到 0.834 的斯皮爾曼相關性，較最強的持續基準絕對提高 6.1 分，並將城市間感知差距縮小至 0.094——相較於最強的持續基準（0.151）減少 38%，相較於代表性正則化基準（0.218）減少 57%。消融實驗確認樞紐層級的每一層都單調貢獻，並且注重公平的重複將平均反向轉移從 -0.038（無保留）轉變為 +0.013，消除了在保留序列上的災難性遺忘。我們的結果表明，分層錨定是實現城市規模地理公平街景推斷的實用途徑。

2606.15038v1 by Zhemin Zhang, Weijie Chen, David Le, Amara Tariq, Alex Wallace, Matthew Stib, Juan Maria Farina, Chadi Ayoub, Reza Arsanjani, Imon Banerjee

Accurate time-to-event (TTE) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. We introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data, designed to generalize across tasks and institutions. CT and EHR modalities are encoded independently using domain-specific foundation models and aligned in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. We evaluate two clinically distinct TTE tasks: pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, on large-scale multi-institutional cohorts (PE: N=3,099 train; 1,098 internal; 435 external; CVD: N=2,951 train; 837 internal; 682 external). Fusion consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably. Overall, contrastive multimodal fusion, particularly with CLMBR representations, provided the most consistent and statistically robust improvements, especially for PE mortality prediction. For MACE, cross-attention (one-hot) achieved the highest internal performance and image-guided co-attention achieved the best external performance. We therefore introduce a generalizable foundation model-based cross-modal alignment framework and provide the first systematic analysis of fusion behavior under modality imbalance in TTE prediction. Our results establish task-aware multimodal alignment as a necessary design principle for robust generalization and scalable clinical deployment.

摘要：準確的事件時間預測（TTE）從多模態臨床數據中仍然面臨挑戰，這是由於模態不平衡和分佈轉移。我們介紹了一個基於基礎模型的框架，用於CT影像和縱向EHR數據之間的跨模態表示對齊，旨在跨任務和機構進行泛化。CT和EHR模態分別使用特定領域的基礎模型進行編碼，並通過四種原則性的融合策略在共享潛在空間中對齊：晚期融合、對比對齊、交叉注意力和共同注意力。我們在大規模多機構隊列上評估了兩個臨床上不同的TTE任務：肺栓塞（PE）死亡率和心血管疾病（CVD）結果（PE：N=3,099訓練；1,098內部；435外部；CVD：N=2,951訓練；837內部；682外部）。當模態的貢獻相當時，融合始終提高了1.5-5.4%的協調指數，超過了單模態基線。總體而言，對比多模態融合，特別是使用CLMBR表示，提供了最一致和統計上穩健的改進，尤其是在PE死亡率預測方面。對於MACE，交叉注意力（one-hot）達到了最高的內部性能，而影像引導的共同注意力則實現了最佳的外部性能。因此，我們介紹了一個可泛化的基於基礎模型的跨模態對齊框架，並提供了在TTE預測中模態不平衡下融合行為的首次系統分析。我們的結果確立了任務感知的多模態對齊作為堅固泛化和可擴展臨床部署的必要設計原則。

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

2606.15029v1 by Alyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo

LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels. We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match. All project code is publicly available, and we additionally provide an installable package for ease of use.

摘要：LLM 評估者被用來減少在評估開放式文本生成時對昂貴人力的需求。然而，這些評估者的可靠性在很大程度上依賴於它們與人類評分者的一致性——這一特性本身又依賴於昂貴的人類註釋。在這項工作中，我們開發了一種方法（Metric Match），用於從有限的註釋中估計 LLM 評估者的基於相關性的可靠性指標。Metric Match 選擇一個樣本子集進行人類註釋，使得該子集與獲得的合成標籤相對應的整體可靠性指標相匹配。我們實證表明，Metric Match 在四種不同的相關性指標和 15 個數據集上，對隨機子集選擇的勝率達到 0.838，平均估計誤差減少了 18.7%，並且減少了 32.5% 的註釋需求。我們提供了一個成本模型，並強調了一個醫療案例研究，其中我們的方法相比隨機選擇專家註釋節省了 $1,041.67。此外，我們將任務從可靠性估計轉移到可靠性分類，即判斷給定的評估者是否超過部署閾值，並使用 Metric Match 超越隨機選擇的表現。所有項目代碼均可公開獲得，我們還提供了一個可安裝的包以方便使用。

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

2606.14697v1 by Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at https://github.com/alibaba-damo-academy/ClinHallu.

摘要：建立可信賴的醫療多模態大型語言模型（MLLMs）對於可靠的臨床決策支持至關重要。現有的醫療幻覺基準主要集中在數據收集上，但往往忽略了幻覺在推理過程中產生的來源。我們發現幻覺來源在樣本之間有所不同：錯誤可能來自視覺錯誤識別、不正確的醫學知識回憶或有缺陷的推理整合。為了實現源級幻覺診斷，我們引入了ClinHallu，這是一個針對醫療MLLM推理的階段性幻覺診斷基準。ClinHallu包含7,031個經過驗證的實例，每個實例都附有結構化的推理痕跡，分解為視覺識別、知識回憶和推理整合。我們還使用階段替換干預來測量修正特定階段如何影響最終答案。除了評估，我們還展示了痕跡監督微調如何減少階段性幻覺。ClinHallu提供了一個細緻的幻覺測試平台，用於診斷和減輕醫療MLLM中的推理失敗。該基準在https://github.com/alibaba-damo-academy/ClinHallu上公開可用。

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

2606.14608v1 by Farica Zhuang, Zixuan Wen, Christos Davatzikos, Li Shen

Survival prediction plays a central role for healthcare providers and clinical researchers. Accurate risk stratification enables early intervention and improved patient management. Most existing deep survival models learn one common feature representation for all patients, which may hide important differences between patient subgroups. In contrast, a Mixture-of-Experts (MoE) framework allows different parts of the model to focus on different patient patterns, leading to more individualized representations. Therefore, in this work, we propose a mixture-of-experts enhanced adaptive deep clustering survival framework (AdaCSM) for modeling such heterogeneous survival patterns. We introduce a routing-based expert mechanism that enables conditional specialization within a parametric survival modeling framework. The proposed architecture allocates patients to specialized risk predictors dynamically while preserving the patient survival and subtype clustering objectives. We compare our method with state-of-the-art survival and deep clustering models on multiple real-world longitudinal clinical cohorts spanning diverse disease domains. The proposed method demonstrates improved predictive performance and leads to interpretable results in survival analysis.

摘要：生存預測在醫療提供者和臨床研究人員中扮演著核心角色。準確的風險分層使得早期介入和改善病人管理成為可能。大多數現有的深度生存模型為所有病人學習一個共同的特徵表示，這可能隱藏了病人子群之間的重要差異。相對而言，混合專家（MoE）框架允許模型的不同部分專注於不同的病人模式，從而導致更個性化的表示。因此，在本研究中，我們提出了一種基於混合專家的增強自適應深度聚類生存框架（AdaCSM），用於建模這種異質的生存模式。我們引入了一種基於路由的專家機制，使得在參數生存建模框架內實現條件專業化。所提出的架構動態地將病人分配給專門的風險預測器，同時保留病人的生存和亞型聚類目標。我們將我們的方法與多個現實世界的長期臨床隊列中的最先進生存和深度聚類模型進行比較，這些隊列涵蓋了多種疾病領域。所提出的方法顯示出改進的預測性能，並在生存分析中導致可解釋的結果。

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

2606.14604v1 by Pavlos Nicolaou, Kleanthis Malialis, Artemis Kontou, Panayiotis Kolios

Wearable devices and smartphones generate rich behavioural time series that can support proactive health interventions, yet systematic comparisons of modern forecasting architectures for these data are lacking. In particular, it remains unclear how models generalise across populations, how different architectures respond to participant-level fine-tuning and how forecasting accuracy degrades across multi-day horizons. We benchmark six deep learning architectures, two zero-shot Foundation Models (FM) and statistical baselines on three public datasets encompassing over 800 participants, reporting per-feature metrics for step counts, screen time and sleep duration across 1-8 day horizons. We further conduct a per-feature personalisation study across all six architectures and assess FM transferability across dataset sizes and temporal granularities. Our key findings are: (i) no single architecture dominates, PatchTST leads among trained models while the three runners-up (TCN, MLP, Transformer) show no meaningful performance difference; (ii) the FM TimesFM matches or exceeds trained models zero-shot, especially in low-data regimes and (iii) participant-level fine-tuning reduces per-feature RMSE by 16-60\%, with sleep benefiting most and step counts least. These results provide practical guidance on architecture selection, FM applicability and personalisation strategies for mobile health forecasting. To the best of our knowledge, this is the first study to jointly evaluate modern deep learning, FMs and personalisation for multi-horizon behavioural forecasting from wearables.

摘要：可穿戴設備和智能手機生成豐富的行為時間序列，這些序列可以支持主動的健康干預，但對於這些數據的現代預測架構的系統比較仍然缺乏。特別是，尚不清楚模型如何在不同人群之間進行泛化，不同架構如何對參與者級別的微調做出反應，以及預測準確性如何在多天的預測範圍內下降。我們在三個公共數據集上基準測試了六種深度學習架構、兩種零樣本基礎模型（FM）和統計基準，這些數據集涵蓋了超過800名參與者，報告了步數、屏幕時間和睡眠持續時間在1-8天範圍內的每個特徵指標。我們還在所有六種架構上進行了每個特徵的個性化研究，並評估了FM在數據集大小和時間粒度上的可轉移性。我們的主要發現是：（i）沒有單一架構占主導地位，PatchTST在訓練模型中領先，而三個亞軍（TCN、MLP、Transformer）表現沒有顯著差異；（ii）FM TimesFM在零樣本情況下匹配或超過訓練模型，特別是在低數據環境下，以及（iii）參與者級別的微調使每個特徵的均方根誤差（RMSE）降低了16-60\%，其中睡眠受益最多，步數受益最少。這些結果為移動健康預測的架構選擇、FM適用性和個性化策略提供了實用指導。據我們所知，這是第一項共同評估現代深度學習、FM和個性化在可穿戴設備多時間範圍行為預測中的研究。

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

2606.14581v1 by Guanyu Liu, Weiyi Kong, Zeyu Wang, Boer Zhang, Baiqing Li, Peiyu Zhang, Tianyu Shi

Granting LLMs direct control over costly, irreversible scientific experiments leads to unsafe exploration and unstable performance, but discarding LLM creativity entirely sacrifices significant optimization potential. We introduce CARE (Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation), an auditable controller for high-throughput experimentation (HTE) optimization that keeps a non-LLM incumbent optimizer as the default action path while using LLMs to revise challenger ranking policies. Before each outcome is revealed, a public-evidence intervention gate compares the challenger with the incumbent. It authorizes the challenger's selection only when the evidence available before selection supports the change, with the decision recorded in the audit log. CARE outperforms all other evaluated methods on Minerva/Olympus and ChemLex benchmarks, with final-best improving from 80.0 to 88.5 on Minerva/Olympus and from 83.9 to 92.1 on ChemLex, relative to the public incumbent. Our experiments indicate that LLM self-evolution is more reliable when it expands the proposal space under an auditable controller, rather than directly choosing experiments.

摘要：授予大型語言模型（LLMs）對昂貴且不可逆的科學實驗的直接控制會導致不安全的探索和不穩定的表現，但完全放棄LLM的創造力則犧牲了重要的優化潛力。我們提出了CARE（通過可審計的證據審查控制LLM生成的政策於科學實驗中），這是一個可審計的控制器，用於高通量實驗（HTE）優化，保持非LLM的現任優化器作為默認行動路徑，同時利用LLM來修訂挑戰者排名政策。在每個結果揭示之前，一個公共證據干預閘門會比較挑戰者與現任者。只有當選擇前可用的證據支持變更時，才授權挑戰者的選擇，並將決策記錄在審計日誌中。CARE在Minerva/Olympus和ChemLex基準測試中超越了所有其他評估方法，最終最佳成績在Minerva/Olympus上從80.0提升至88.5，在ChemLex上從83.9提升至92.1，相對於公共現任者。我們的實驗表明，當LLM在可審計的控制器下擴展提案空間時，自我演化更為可靠，而不是直接選擇實驗。

Securing the Future of IoMT in the Post-Quantum Era: An Edge-Native Federated Learning Approach

2606.14515v1 by Taym Alshoghri, Deemah H. Tashman, Mohammad Reza Gerami, Soumaya Cherkaoui

Internet of Medical Things (IoMT) devices operate under strict resource constraints while handling highly sensitive health data, making security and privacy critical concerns. Federated learning (FL) further complicates this landscape, as model updates exchanged during training may unintentionally expose private medical information. Emerging quantum computing capabilities threaten the long-term viability of conventional lightweight cryptographic mechanisms, motivating the integration of Post-Quantum Cryptography (PQC) into IoMT systems. This article discusses key enabling technologies for quantum-resilient IoMT, including post-quantum key establishment, lightweight encryption, and edge-native orchestration. We propose a scalable Kubernetes-based framework that integrates PQC into FL-enabled IoMT environments and validate it on a Raspberry Pi testbed. Results demonstrate that distributed cryptographic processing significantly reduces latency compared to sequential designs while maintaining feasible resource overhead. The primary contribution of this work lies in the design and validation of a secure orchestration and communication framework for FL-enabled IoMT systems. We conclude by outlining future directions toward energy-aware architectures, intelligent security optimization, and resilient next-generation Intelligent Internet of Medical Things (IIoMT) ecosystems.

摘要：物聯網醫療設備（IoMT）在處理高度敏感的健康數據時運行於嚴格的資源限制下，使得安全性和隱私成為關鍵問題。聯邦學習（FL）進一步複雜化了這一局面，因為在訓練過程中交換的模型更新可能無意中暴露私人的醫療信息。新興的量子計算能力威脅著傳統輕量級加密機制的長期可行性，促使將後量子加密（PQC）整合進IoMT系統。本文討論了量子抗性IoMT的關鍵支持技術，包括後量子密鑰建立、輕量級加密和邊緣原生編排。我們提出了一個可擴展的基於Kubernetes的框架，將PQC整合到支持FL的IoMT環境中，並在Raspberry Pi測試平台上進行驗證。結果顯示，與順序設計相比，分佈式加密處理顯著降低了延遲，同時保持了可行的資源開銷。這項工作的主要貢獻在於設計和驗證了一個安全的編排和通信框架，適用於支持FL的IoMT系統。我們最後概述了未來朝向能源感知架構、智能安全優化和韌性下一代智能醫療物聯網（IIoMT）生態系統的方向。

Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

2606.14157v1 by Paula Joy B. Martinez

Cities deliver basic services through mixed public-private facility networks, including schools, clinics, transit providers, and subsidized service points. In these systems, planners often observe where households go, but not the latent cost function through which they trade off factors such as distance, price, and institutional access. We study this urban problem through school choice in the Philippines, where the country's largest national education subsidy is intended to redirect learners from congested public schools to participating private schools. Treating school-to-school enrollment flows as an entropic optimal transport plan, we recover latent choice costs using two complementary inverse optimal transport models: an interpretable distance-banded model with a subsidy term, and a neural cost model trained through a differentiable Sinkhorn forward pass. Applied to 283{,}016 learner trips across 23{,}820 observed flows in the most populated region, the framework estimates a subsidy-equivalent distance, $λ^{(k)}$, interpreted as the kilometers of perceived travel cost offset by the subsidy. The case demonstrates how administrative origin-destination data can be transformed into interpretable planning metrics for accessibility-aware subsidy design, facility siting, and urban service allocation.

摘要：城市透過混合的公私設施網絡提供基本服務，包括學校、診所、交通提供者和補貼服務點。在這些系統中，規劃者通常觀察家庭的去向，但並不瞭解它們在距離、價格和機構可及性等因素之間權衡的潛在成本函數。我們通過菲律賓的學校選擇研究這一城市問題，該國最大的國家教育補貼旨在將學習者從擁擠的公立學校引導到參與的私立學校。將學校之間的入學流視為一種熵最優運輸計劃，我們使用兩個互補的逆最優運輸模型恢復潛在的選擇成本：一個可解釋的距離帶模型，帶有補貼項，以及一個通過可微分的Sinkhorn前向傳遞訓練的神經成本模型。應用於最人口稠密地區的283,016次學習者出行，涵蓋23,820個觀察流，該框架估算出一個補貼等效距離$λ^{(k)}$，該距離被解釋為補貼抵消的感知旅行成本的公里數。這一案例展示了如何將行政來源-目的地數據轉化為可解釋的規劃指標，以便進行考慮可及性的補貼設計、設施選址和城市服務分配。

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

2606.14031v1 by Guanting Luo, Noriki Nishida, Yuji Matsumoto, Yuki Arase

Identifying conditions that a certain drug takes therapeutic effect on a target disease is crucial for clinical decision-making support. However, most existing biomedical information extraction methods have focused on identifying only relations between drugs and diseases, while largely overlooking the context-specific conditions where such relations can apply. To address this problem, we introduce the task of applicability condition extraction for therapeutic drug--disease relations from biomedical research literature. We create the first dataset that has manually annotated triples of drugs, diseases, and applicability conditions on biomedical paper abstracts with 1,119 drug-disease pairs. Using this dataset, we systematically evaluate the performance of a range of existing methods. In addition, we propose a new method that enhances LoRA to consider relations between drugs and diseases. Our method consistently outperforms strong baselines across different evaluation settings. The source code and dataset of this paper can be obtained from: https://github.com/guantingluo98/Drug-ACE

摘要：識別某種藥物對目標疾病產生治療效果的條件對於臨床決策支持至關重要。然而，大多數現有的生物醫學信息提取方法僅專注於識別藥物與疾病之間的關係，而在很大程度上忽視了這些關係可以適用的具體條件。為了解決這個問題，我們引入了從生物醫學研究文獻中提取治療藥物-疾病關係的適用條件的任務。我們創建了第一個數據集，該數據集手動標註了包含1,119對藥物-疾病的三元組以及適用條件，並且基於生物醫學論文摘要。利用這個數據集，我們系統地評估了一系列現有方法的性能。此外，我們提出了一種新方法，增強了LoRA以考慮藥物與疾病之間的關係。我們的方法在不同的評估設置中始終超越強基準。本文的源代碼和數據集可以從以下網址獲取：https://github.com/guantingluo98/Drug-ACE

Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

2606.13839v1 by Louis Chen, Torbjörn E. M. Nordling

Remote photoplethysmography (rPPG) transformers achieve low heart-rate error on benchmarks, yet their decisions remain opaque--a growing concern as rPPG moves toward clinical heart rate estimation. Existing rPPG XAI is dominated by qualitative heatmap inspection without quantitative faithfulness metrics or physiology-grounded validation, leaving a gap between visual plausibility and auditable evidence. We address this gap. First, we adapt four attribution methods (raw attention, rollout, flow, Beyond Intuition) to RhythmFormer's bi-level routing attention with top-$k$ selection. Second, we introduce a skin coverage metric quantifying how much attribution mass falls on skin regions. Third, we adapt the SaCo faithfulness coefficient from its original classification setting to rPPG regression by using the MAE between original and perturbed predicted rPPG waveforms as the perturbation impact. Applying these tools, we quantify a multi-hop leakage effect under sparse top-$k$ routing: attention rollout and flow almost completely restores the connections that individual refined-attention layers explicitly set to zero. Beyond Intuition mitigates this via its value-projection-weighted rollout and gradient-supported mask, attaining the highest median refined skin coverage ($0.83$ vs. $0.57$ for vanilla rollout) and faithfulness ($F=0.92$) among the evaluated methods on UBFC-rPPG. Validation across diverse datasets and model variants is needed. A case study on a low-SaCo outlier further shows all four methods recovering consistently once an artefactual region is replaced, suggesting consistent SaCo behavior across attribution families in this illustrative case. Together, these metrics move XAI for rPPG toward auditable numerical evidence about spatial alignment and perturbation faithfulness, i.e. trustworthy rPPG XAI.

摘要：遠程光電容積描記法（rPPG）Transformer在基準測試中實現了低心率誤差，但它們的決策仍然不透明——隨著rPPG向臨床心率估計的發展，這成為一個日益關注的問題。現有的rPPG可解釋人工智慧（XAI）主要依賴於定性的熱圖檢查，缺乏定量的真實性指標或基於生理學的驗證，這使得視覺上的合理性與可審計的證據之間存在差距。我們針對這一差距進行了研究。首先，我們將四種歸因方法（原始注意力、展開、流動、超越直覺）適應於RhythmFormer的雙層路由注意力，並進行top-$k$選擇。其次，我們引入了一種皮膚覆蓋度指標，量化有多少歸因質量落在皮膚區域上。第三，我們將SaCo真實性係數從其原始分類設置調整為rPPG回歸，通過使用原始和擾動預測的rPPG波形之間的平均絕對誤差（MAE）作為擾動影響。應用這些工具，我們量化了在稀疏top-$k$路由下的多跳洩漏效應：注意力展開和流動幾乎完全恢復了個別精煉注意力層明確設置為零的連接。超越直覺通過其值投影加權的展開和梯度支持的掩碼來減輕這一問題，在評估的UBFC-rPPG方法中達到了最高的中位數精煉皮膚覆蓋度（$0.83$對比$0.57$的普通展開）和真實性（$F=0.92$）。需要在多樣化數據集和模型變體上進行驗證。一項關於低SaCo異常值的案例研究進一步顯示，一旦替換了人為產生的區域，所有四種方法都能一致恢復，這表明在這一示例案例中，歸因家族之間的SaCo行為是一致的。總體而言，這些指標使rPPG的XAI朝著可審計的數字證據邁進，關於空間對齊和擾動真實性，即值得信賴的rPPG XAI。

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

2606.13572v1 by Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

摘要：多模態大型語言模型（MLLMs）在一般領域中顯示出有希望的推理能力，但在專業環境中，如醫療保健，尤其是在多語言和低資源的情境下，其表現仍然有限。這一差距在像印度農村這樣的地區尤為重要，患者經常用母語印地語表達複雜的醫療問題，並依賴醫療影像等多模態輸入。現有以英語為中心的 MLLMs 難以支持這類用例，限制了公平獲得 AI 驅動的醫療協助。為了解決這一挑戰，我們介紹了 ArogyaBodha，這是一個大型多語言多模態醫療問答數據集，該數據集由八個異質來源構建，涵蓋 31 個身體系統、六種影像模態和 21 個臨床領域，涉及英語和七種主要的印度語言。我們進一步提出了 ArogyaSutra，一個基於演員-評論家的多代理框架，將工具基礎與雙重記憶機制整合，用於逐步的、具推理意識的決策，並利用存儲的演員-評論家模擬軌跡進行蒸餾。實驗表明，我們的數據集和框架提高了所有印地語言的多語言醫療推理準確性，消融實驗驗證了每個組件的貢獻。源代碼和數據集可在以下位置獲得：https://iitp-cse.github.io/ ArogyaSutra/

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

2606.13556v2 by Aruna Dey, Suraj Biswas

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)G-hat_genomic + [1-w(t)]P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

摘要：個性化健康人工智慧系統面臨一個根本的冷啟動問題：生理解釋的機器學習模型需要數週的個體行為數據，才能區分憲法變異與環境驅動的偏差。我們提出了一個基於因果推斷和貝葉斯先驗設計的解決方案。個體的基因組特徵作為外生的遺傳錨點——一個基於領域知識的個性化先驗，固定於受孕時，不受逆因果影響，並且在收集到任何行為觀察之前就已可用。這個錨點初始化了一個貝葉斯信念狀態，該狀態關於個體的生理設置點 G-hat = mu + sum(beta_i * g_i)，其中 beta_i 是 GWAS 衍生的效應大小，g_i 是風險等位基因數量。每個進來的生理測量 P 產生一個非憲法偏差 delta = P - G-hat，這將環境和狀態所造成的信號與憲法固定的基線分開。隨著行為數據的累積，先驗根據 G-hat_t = w(t)G-hat_genomic + [1-w(t)]P-bar_t 衰減，從基因組主導的推斷過渡到經驗基線主導的推斷。同樣觀察到的 HRV 為 55 毫秒，對於一個其先驗預測 80 毫秒的人，產生了一個抑制假設，而對於一個其先驗預測 30 毫秒的人，則產生了一個增強假設——這種反轉在沒有個性化錨點的情況下是不可能的。我們在六個生理領域內發展這一架構，根據證據強度對基因組先驗進行分級，明確區分穩健重複的錨點（FTO、FADS1/2、FKBP5）與有爭議的候選基因（SLC6A4、MAOA、DRD2）。我們解決了關聯、孟德爾隨機化和個體標記因果之間的推斷邊界，並定義了四個部署約束：證據分級的先驗、動態衰減、祖先匹配的效應大小，以及歸因而非確定性輸出。

MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment

2606.13258v2 by Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown, Zhiqi Shen

Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson's gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

摘要：步態基礎的帕金森病評估越來越依賴異質傳感器，但臨床系統很少同時收集所有模態。新傳感器可能通過設備升級、協議變更或多中心部署而出現，而歷史病人數據因隱私和存儲限制而常常無法獲得。這種模態增量設置面臨三個挑戰：不可靠的跨模態蒸餾、模態特定的統計變化，以及保存後的可塑性降低。我們提出了MOSAIC，一個緊湊的持續學習框架。首先，我們識別了有毒教師現象，並引入模態特定的熱身，以穩定新學習的模態表示，然後進行蒸餾。其次，我們提出了一種統計解耦的MSBN架構，該架構在保持共享語義骨幹的同時隔離傳感器統計。第三，我們設計了一個課程引導的排斥目標，用於可塑性恢復，在恢復模態特定能力的同時保留遺留知識。在三個多模態帕金森步態數據集上的實驗表明，MOSAIC改善了最終性能並減輕了遺忘。項目代碼可在以下鏈接獲得：https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

2606.13211v1 by Omar Alshahrani, Muzammil Behzad

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

摘要：AI 系統在醫學影像領域的部署速度超過了對其失效模式的理解。此時，最大的臨床關注點是幻覺：臨床上看似合理但事實上不正確的輸出，包括虛構的解剖結構、漏診、錯誤的側別以及生成報告中的虛構測量，這些都會直接影響，例如，活檢決策、分期和治療計劃。這篇結構化敘述綜合了同行評審的研究、基準數據集和 FDA 的監管指導，涵蓋五種影像模式，以產生對幻覺分類、病因、檢測和緩解的跨模式分析。具體而言，我們在這項研究中解決三個問題：(1) 如何統一現有的分類法？(2) 醫學專用的基礎模型如何比通用模型產生更少的幻覺？以及 (3) 哪些緩解策略是有效的並且與 FDA 生命週期監管相容？我們注意到，三個分類框架共同覆蓋了影像流程，而單一框架無法單獨做到這一點。我們還強調，通用基礎模型在針對幻覺的基準測試中表現優於醫學專用模型，這表明狹窄領域的微調可能會導致過擬合引起的虛構。同時，放射科醫生的監督仍然至關重要；例如，極高比例的 AI 生成標記在臨床使用前需要專家修正。基於物理的架構約束、思考鏈提示以及人機協作的安全措施各自針對不同的失效模式，並且在結合使用時效果良好。所有發現都映射到 FDA 的總產品生命週期和預定變更控制計劃框架，將幻覺管理視為生命週期的責任，而非部署前的檢查清單。

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

2606.13188v1 by Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath, Prasad B Honnavalli, Adithya Balasubramanyam

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

摘要：建立病人特定的心臟模型是精準心臟醫學的核心，但將這些模型投入臨床使用時卻不斷遇到同樣的障礙：網格生成緩慢、混亂且令人沮喪。標準工作流程——對影像進行分割、運行 Marching Cubes，然後手動清理結果——耗時、在操作人員之間不一致，並且需要大多數臨床團隊所不具備的專業知識。我們採取了根本不同的方法。與其將分割和網格生成視為兩個獨立的問題，我們訓練了一個單一的端到端網絡，直接從原始的 3D 醫學影像生成平滑、可用於模擬的心臟表面網格。核心是一個 3D Swin Transformer 編碼器-解碼器，從 CT 或 MRI 體積中提取體積特徵，並配合一個圖形注意力網絡（GAT）頭，迭代地變形模板網格以符合病人的心臟邊界。我們在 MM-WHS 2017 基準上進行了測試，使用了 CT 和 MRI。分割得分具有競爭力（CT 的 Dice 為 0.84，MRI 為 0.83），但主要焦點是網格質量：平均 Chamfer 距離為 1.8 毫米，95 百分位表面距離低於 5 毫米。每個網格都是在單次前向傳遞中生成的——不需要 Marching Cubes，不需要平滑濾波器，也不需要手動清理。我們認為，對於心臟數位雙胞胎管道來說，幾何保真度和拓撲正確性比像素級的 Dice 得分更為重要。通過消除後處理瓶頸，這種方法使病人特定的心臟模擬在臨床使用上變得更加可及。

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

2606.13176v1 by Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton

Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

摘要：心理健康問題，如焦慮、抑鬱和自殺，仍然是迫切的全球挑戰，及時和準確的評估對於有效的干預至關重要。最近，大型語言模型已被探索用於心理健康評估。然而，現有的一般性後訓練方法與人類評估的認知過程不一致，這可能導致不可靠的推理結果。為了填補這一空白，我們提出了認知相對策略優化（CRPO），這是一個針對心理健康領域量身定制的強化學習框架。CRPO 通過將階段依賴的不確定性建模整合到策略優化過程中，擴展了群體相對策略優化。具體而言，我們引入了一種階段性熵正則化機制，鼓勵在早期推理階段進行廣泛探索，並在後期階段逐步強化自信的決策，模仿人類從不確定性轉向確定性的認知轉變。此外，受到認知評估理論的啟發，我們正式化了認知推理階段，從而指導理論基礎的可解釋推斷。在8個心理健康數據集上的實驗顯示，CRPO 在加權 F1 分數上比最佳強化學習基線平均提高了10.4個百分點。此外，CRPO 訓練的模型 Mental-R1 在推理密集型案例中與現有的大型語言模型相比顯示出明顯的優勢，這表明 CRPO 增強了心理健康評估的推理能力。

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

2606.13135v1 by Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

摘要：目的。比較深度學習架構和皮膚腫瘤的皮膚鏡圖像分類方案，並評估其從開放國際數據集轉移到俄羅斯臨床實踐獨立數據集的泛化能力。方法。比較了四種架構（ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S）在三種方案中的表現：二元（惡性/良性）、單階段四類（良性、MEL、SCC、BCC），以及兩階段級聯（二元篩選，然後三類區分MEL/SCC/BCC）。所有模型均使用ImageNet預訓練權重和單一增強協議，基於聚合的開放ISIC檔案數據進行訓練，並在內部保留樣本和兩個臨床數據集（Melanoscope AI移動系統；塞琴諾夫大學）上進行評估。結果。在內部，二元階段達到ROC-AUC 0.952-0.966；在塞琴諾夫大學下降至0.797-0.893，靈敏度降至0.53-0.67，ECE從0.02上升至0.27-0.39，並低估了惡性腫瘤，量化了排名和校準中的泛化差距。配對測試確認了一個臨床數據上的架構間結果：ViT-B/16在二元階段的不足（p<0.05）；在區分階段，沒有架構具有明顯優勢。級聯方法在大多數架構中提高了宏觀F1分數，超過單階段四類分類，但對於ViT-B/16的提升顯著，因為它恢復了被分配到主導良性類別的惡性病變。在ISIC MILK10k上，直接的11類分類產生的平均類別靈敏度為0.525。結論。可調的篩選閾值提供了在標準單階段（argmax）分類中無法實現的靈敏度控制，並更好地重現臨床鑑別診斷邏輯。持續的泛化差距要求在部署之前進行外部臨床驗證和重新校準。

AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction

2606.13051v1 by Fabien Maury, Solène Grosdidier, Maud de Dieuleveult, Adrien Coulet

Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.

摘要：儘管深度學習和大型語言模型推動了信息提取的進步，但在高度專業化的生物醫學領域，性能差距仍然存在，這些領域特有的複雜性對通用模型構成挑戰。在這項工作中，我們專注於自體免疫領域，主要的關注實體包括自體免疫疾病、自體抗體（即可能標記或引起這些疾病的分子）、它們的分子靶點、它們在體內的位置以及相關的臨床徵兆。在此，我們呈現了 AAbAAC（自體抗體與自體免疫註釋語料庫），這是一個從 PubMed 中選取的 115 篇摘要的語料庫，我們手動註釋了實體及其關係。首先，AAbAAC 被用來評估幾種命名實體識別（NER）任務的方法，其次，用於微調 NER 模型。我們的研究展示了 AAbAAC 在自體免疫領域的信息提取中的實用性，顯示出在微調後 NER 性能的預期改善。這說明了小規模註釋工作對專業領域的價值，並為自體免疫的計算研究做出了貢獻。AAbAAC 語料庫可在 https://github.com/f-maury/AAbAAC 獲得。

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

2606.12988v1 by Manex Atxa, Bruno Simoes, Julen Balzategui

This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

摘要：這篇論文介紹了一種新方法，用於利用三維體積視頻數據進行人體姿勢的實時預測，包括符合人體工學和不符合人體工學的姿勢。雖然該方法是為人體工學評估而設計的，但它可以適應其他需要實時分析人體姿勢的應用。使這個系統脫穎而出的其中一個方面是它在評估過程中分析三維點雲的能力，從而實現從多個角度進行計算。這克服了相機的一個關鍵限制，因為相機通常提供固定的視角，從而限制了可用於徹底姿勢評估的數據，特別是在發生遮擋時。該系統持續並自動地使用所選視角對實時流數據進行姿勢推斷；然而，只有用戶手動選擇和標記的姿勢才用於訓練個性化的深度學習分類器。這一方法通過一個案例研究得到了完善，其中RGB-D相機捕捉受試者執行負載提升任務，實現了實時骨骼標記。該模型在這些數據上進行了訓練，並在訓練階段之後，對新的流數據進行實時推斷。這項研究通過結合最先進的三維數據技術和傳統的二維姿勢估計算法，提供了一種可擴展且務實的實時人體工學評估方法。它滿足了工作環境中對安全和健康監測日益增長的需求，標誌著該領域的一項顯著貢獻。

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

2606.12953v1 by Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

摘要：我們提出了 OpenMedQ，一個在迄今為止最廣泛的完全開放醫學混合資料上進行預訓練的醫學視覺-語言模型：14 個數據集總計約 335 萬個預訓練樣本，涵蓋病理學、放射學、顯微鏡學和僅文本的臨床問答。OpenMedQ 在 PathVQA 上達到了最先進的 BLEU-1（75.9），超越了高達 562B 參數的 Med-PaLM M 變體（約大 80 倍），並且與報告的最佳 VQA-MED BLEU-1（64.5）相匹配。其視覺編碼器在相同的下游食譜下轉移到 8 個未見的醫學分類基準，獲得了 BiomedCLIP（0.745）、PMC-CLIP（0.745）、PubMedCLIP（0.746）和一個從零開始的基準（0.616）中最高的平均宏 F1（0.757）。我們發布了我們的代碼，並提供了一個互動演示，作為社區可重現的基準。

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

2606.12824v1 by Daniel Soliman

AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

摘要：AI 醫療影像治理正在正式化：2026 年 ACR-SIIM 實踐參數建議進行本地接受測試和持續的漂移監控，而 ACR Assess-AI 註冊處則利用 DICOM 元數據監控 AI 輸出以提供上下文。我們主張，在輸出指標之下，存在一個必要的、目前未被監控的層面：即進來的研究是否仍然在模型驗證時的獲取範圍內。使用 LUNA16 訓練的 MONAI RetinaNet 肺結節檢測器，我們測試獲取狀態是否作為一個結構化的、可測量的變量。在僅在重建核上有所不同的真實配對 CT（NLST B30f 與 B80f）上，僅核便改變了 AI 測量的直徑，並在固定的患者和獲取下將 5.2%（8/155）的結節的 Fleischner 大小類別翻轉，而檢測信心則保持不變（Wilcoxon p=0.22）。在受控的 LIDC-IDRI 擾動下，影響按軸分離：噪聲軸降低了檢測信心（p=5.9e-32，集中在小於 6 mm 的結節上），但未影響測量，而頻率/核軸則損壞了測量（p=8.6e-13），但未影響檢測。一個 4 特徵像素指紋恢復了重建身份（在真實 CT 上的患者級 AUC 約為 0.95，在 QIBA 幻影上為 0.995），而 ConvolutionKernel DICOM 標籤則無法提供有用信息（在重建中標籤相同）。因此，核軸在四個製造商之間傳輸（去除一個供應商的 AUC 為 0.94-0.98，與供應商內的上限相匹配）。獲取狀態因此映射到不同的 AI 失效模式，頻率內容映射到測量可靠性，噪聲映射到檢測敏感性，且無法從元數據中恢復。具備獲取意識的輸入端驗證是當前進入影像 AI 認證的接受測試和漂移監控要求中缺失的層面。

Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System

2606.12702v1 by Alyssa Unell, Miguel Fuentes, Brenna Li, Bridget Lin, Meena Jagadeesan, Sanmi Koyejo, Nigam Shah

Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of these systems. However, static benchmarks tend to measure correctness rather than user acceptance, aggregate performance across queries, and require densely annotated datasets -- leading to major blind spots for evaluating clinical systems. In this work, we perform a deployment-centered evaluation of an LLM system embedded within electronic health records at an academic medical center, where user feedback is sparse but closely reflects the deployment conditions. Specifically, we train a pre-response classifier that estimates the risk that a future interaction will result in the user rejecting the LLM response, based on query content and deployment-specific context available before generation. We conduct a prospective analysis of our model over 4.5 months of user feedback, finding that our prediction model achieves an AUROC of 0.719. Further, we estimate the benefit of such predictions in two downstream use cases (guardrail triggering and abstention). Our key conceptual insight is that making use of deployment-specific context (i.e., the provider type, department name, language model used for response), as opposed to only query content, improves the ability to predict whether the user will reject the system output. Altogether, our empirical case study demonstrates the feasibility of predicting user rejection using deployment-specific context, opening the door to targeted guardrails.

摘要：大型語言模型（LLMs）越來越多地整合進臨床系統中，因此評估這些系統在現實世界中的實用性變得至關重要。然而，靜態基準往往測量正確性而非用戶接受度，對查詢的整體性能進行匯總，並需要密集標註的數據集——這導致在評估臨床系統時出現重大盲點。在本研究中，我們對嵌入在學術醫療中心電子健康記錄中的LLM系統進行了以部署為中心的評估，這裡的用戶反饋稀少，但與部署條件密切相關。具體而言，我們訓練了一個預響應分類器，該分類器根據查詢內容和生成前可用的特定於部署的上下文來估計未來互動導致用戶拒絕LLM響應的風險。我們對模型進行了為期4.5個月的用戶反饋前瞻性分析，發現我們的預測模型達到了0.719的AUROC。此外，我們估算了這些預測在兩個下游使用案例中的好處（安全邊界觸發和放棄）。我們的關鍵概念見解是，利用特定於部署的上下文（即提供者類型、部門名稱、用於響應的語言模型），而不僅僅是查詢內容，可以提高預測用戶是否會拒絕系統輸出的能力。總的來說，我們的實證案例研究展示了使用特定於部署的上下文預測用戶拒絕的可行性，為針對性安全邊界的開發鋪平了道路。

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

2606.12699v1 by Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

摘要：2型糖尿病（T2D）對全球健康構成日益嚴重的威脅，迫切需要有效的血糖評估以支持個性化和改善的糖尿病護理。可穿戴傳感器，如持續血糖監測儀（CGM）和健身追蹤器，為血糖評估提供了許多有價值的見解。然而，有效分析這些數據需要與重要的個體層面背景整合。現有的方法通常基於傳統的機器學習（ML），主要依賴歷史血糖測量，並忽略個性化信息，這限制了它們在多樣化糖尿病人群中的表現。最近在大型語言模型（LLMs）方面的進展已顯示出它們能夠整合多樣的數據模態，同時建模序列依賴性，這激勵我們探索它們在個性化血糖評估中的潛力。在本文中，我們提出了GlyLLM，一個基於LLM的框架，用於通過整合可穿戴傳感器數據和結構化元數據來建模基於CGM的血糖動態。GlyLLM可以利用預訓練LLMs的廣泛先驗知識，並在決策時實現傳感器-文本語義抽象。在AI-READI數據集上的兩個相關任務實驗表明，我們的模型在血糖預測中比傳統的ML方法平均提高了13.66\%的均方根誤差（RMSE），在糖尿病分類中提高了13.08\%的接收者操作特徵曲線下的面積（AUROC）。此外，我們的消融研究顯示，糖尿病調查和生物識別測試對於血糖評估比其他健康信息更為關鍵。我們的工作為利用LLMs的力量推進T2D護理中的個性化血糖評估邁出了有希望的一步。

CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents

2606.12666v2 by Siyu Shen, Fenghao Xu, Wenrui Diao, Kehuan Zhang

Screenshot-based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capability also turns every screen observation into a privacy boundary. During normal task execution, screenshots may expose contacts, messages, photos, files, recommendations, health cues, and other sensitive context that is unrelated to the user's request. We call this problem incidental visual privacy exposure. It is difficult to address with existing defenses: text anonymization misses many visual and inferential cues, while generic privacy masking can remove the evidence and controls that a GUI agent needs to complete the task. This paper presents CAPED, a context-aware pre-upload exposure control layer for mobile GUI agents. CAPED is designed as a phone-side protection layer: before screenshots are released to a remote multimodal agent, it extracts task requirements, uses screen context as a privacy prior, parses visible UI elements, and selectively exposes only content needed for the current task while masking incidental private content. We evaluate CAPED on AndroidWorld for broad task utility and with a controlled 28-task seeded privacy evaluation used as a measurement instrument for trajectory-level incidental leakage. In this seeded evaluation, Full CAPED reduces success-conditioned weighted seeded leakage from 0.766 under raw screenshots to 0.268 while preserving high task utility. A broader AndroidWorld run shows a remaining prototype-level utility cost, but the results show that task-driven selective exposure can reduce incidental visual leakage before screenshots are released to a remote GUI agent.

摘要：基於截圖的移動 GUI 代理可以通過與人類用戶相同的視覺界面操作普通智能手機應用，但這一能力也將每次屏幕觀察變成隱私邊界。在正常任務執行過程中，截圖可能會暴露聯絡人、消息、照片、文件、推薦、健康提示以及與用戶請求無關的其他敏感內容。我們稱這個問題為偶然的視覺隱私暴露。使用現有的防禦措施很難解決這個問題：文本匿名化忽略了許多視覺和推斷線索，而通用的隱私遮蔽可能會移除 GUI 代理完成任務所需的證據和控制。
本文介紹了 CAPED，一種針對移動 GUI 代理的上下文感知預上傳暴露控制層。 CAPED 被設計為手機端的保護層：在截圖發送到遠程多模態代理之前，它提取任務要求，使用屏幕上下文作為隱私先驗，解析可見的 UI 元素，並選擇性地僅暴露當前任務所需的內容，同時遮蔽偶然的私人內容。我們在 AndroidWorld 上評估 CAPED，以獲得廣泛的任務效用，並使用受控的 28 任務種子隱私評估作為測量工具來評估軌跡級別的偶然泄漏。在這次種子評估中，完整的 CAPED 將基於原始截圖的成功條件加權種子泄漏從 0.766 降低到 0.268，同時保持高任務效用。更廣泛的 AndroidWorld 測試顯示仍然存在原型級的效用成本，但結果顯示，基於任務驅動的選擇性暴露可以在截圖發送到遠程 GUI 代理之前減少偶然的視覺泄漏。

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

2606.12590v1 by Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

摘要：大型視覺-語言模型（LVLMs）在醫學影像任務中取得了強勁的表現，但它們仍然容易出現事實不一致、視覺基礎不佳以及與臨床意義反饋不對齊的問題。現有的後訓練對齊方法，包括直接偏好優化（DPO）及其變體，在醫學領域面臨三個關鍵限制：（1）序列級別的獎勵信號將臨床關鍵的標記與一般填充文本視為相同；（2）依賴靜態的監督微調參考作為偏好回應會引入偏離政策的分佈轉移，將優化引導至風格上的瑕疵而非臨床正確性；（3）對齊目標缺乏明確的視覺基礎約束，使模型對細微但診斷上決定性的病理特徵不敏感。我們的方法利用雙向標記級別的KL正則化器，並結合視覺對比基礎目標，將乾淨圖像與病變腐蝕圖像配對，以懲罰在缺乏足夠視覺證據的情況下生成的回應。這些組件共同構成了一個細緻的、在政策內的對齊框架，通過最小編輯模型生成的輸出來構建偏好對，僅修正臨床錯誤的範圍，同時保留原始的語言風格。在醫學影像任務和臨床文本生成基準上的廣泛實驗驗證了我們方法的有效性。

EDEN: A Large-Scale Corpus of Clinical Notes for Italian

2606.12569v1 by Tiziano Labruna, Guido Bertolini, Pietro Ferrazzi, Bernardo Magnini

We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Departments of Italian hospitals. The corpus, in its current version, is composed of approximately 4 million clinical notes fully anonymized, covering diverse phases of patient care during the stay in the emergency department. In addition, a subset of about six thousand notes has been manually annotated by clinical experts through a structured Case Report Form (CRF) containing 132 items relevant for two patient situations in emergency departments, dyspnea and loss of consciousness. Items may assume numerical values (e.g., for blood saturation), categorical (e.g., for level of consciousness ), binary (e.g., for presence of traumas), and mixed value types. The annotation process involved multiple clinicians and underwent iterative revision to resolve ambiguities in item formulation, resulting in a richly structured (although high imbalanced) resource. The dataset aims to fill a relevant gap of data able to support both the development and the use of Large Language Models in concrete medical applications. We describe the data collection protocol, the on-site anonymisation pipeline, corpus statistics, and the annotation scheme. Finally, we propose CRF-filling as a novel structured information extraction benchmark, and provide zero-shot baseline resulting from Gemma-27B and MedGemma-27B. To the best of our knowledge, the EDEN dataset is the largest freely available corpus of clinical notes existing for the Italian language.

摘要：我們介紹EDEN（急診部電子筆記），這是一個新穎且獨特的大規模臨床筆記語料庫，產自意大利醫院的急診部。該語料庫在當前版本中包含約400萬條完全匿名的臨床筆記，涵蓋患者在急診部住院期間的不同護理階段。此外，約六千條筆記已由臨床專家通過結構化病例報告表（CRF）手動標註，該表格包含132個與急診部兩種患者情況相關的項目，分別是呼吸困難和意識喪失。項目可以採用數值（例如，血氧飽和度）、類別（例如，意識水平）、二元（例如，創傷存在與否）和混合值類型。標註過程涉及多位臨床醫生，並經過多次修訂以解決項目表述中的模糊性，最終形成了一個結構豐富（雖然高度不平衡）的資源。該數據集旨在填補一個相關的數據空白，以支持大型語言模型在具體醫療應用中的開發和使用。我們描述了數據收集協議、現場匿名化流程、語料庫統計數據和標註方案。最後，我們提出CRF填寫作為一個新穎的結構化信息提取基準，並提供來自Gemma-27B和MedGemma-27B的零樣本基線。據我們所知，EDEN數據集是現存的意大利語臨床筆記中最大的免費可用語料庫。

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

2606.12346v1 by Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

摘要：Hematoxylin 和 eosin (H&E) 染色是組織病理學的基石，但 H&E 全片影像 (WSIs) 的可擴展、定量分析仍然是計算病理學中的一個主要挑戰。我們提出了 Atlas H&E-TME，一個基於 AI 的系統，建立在 Atlas 病理學基礎模型家族上，能夠預測多種癌症類型的組織質量、組織區域和細胞類型標籤，每張幻燈片提供超過 4,500 個細胞級別解析的定量讀數。驗證此類系統的一個主要挑戰是克服 H&E 僅有的真實標準中固有的形態學模糊性，以及基於免疫組織化學 (IHC) 等模式的更具信息性的參考的有限可擴展性。我們通過一個雙重驗證框架來解決這個問題，結合了生物學上扎實的深度與技術和形態學的廣度。在深度方面，我們提出了一個 IHC 資訊的多病理學家共識協議，顯著提高了與傳統 H&E 僅有標註相比的評估者間一致性。這提供了一個分子基礎的參考，與我們比較 Atlas H&E-TME 和僅使用 H&E 的病理學家。在廣度方面，我們在超過 200,000 個高信心的 H&E 僅有病理學家標註上對 Atlas H&E-TME 進行基準測試，這些標註來自 1,500 多個案例，涵蓋八種癌症類型及其最常見的轉移部位，亞型覆蓋每種癌症類型超過 90% 的臨床案例，來自 25 多個來源和 8 種以上的掃描儀模型。與 IHC 資訊的共識進行基準測試後，Atlas H&E-TME 的表現與病理學家的 H&E 僅有表現相匹配或超過，並在這個廣泛的形態學和技術範疇內持續且穩健地進行泛化。通過這樣做，Atlas H&E-TME 將 H&E 幻燈片——病理學中最普遍的數據——轉變為一個可擴展的、定量的窗口，觀察腫瘤及其微環境，為轉化和臨床研究中的下一代基於組織的生物標誌物奠定基礎。

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

2606.12252v1 by Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

摘要：訓練深度神經網絡以進行臨床時間序列分析在計算上要求甚高，然而許多醫療環境缺乏重複模型開發和部署所需的資源。這一挑戰在心電圖分類中特別明顯，因為大型數據集和長時間的訓練計劃使得效率變得非常重要。漸進式數據丟棄通過在樣本被學習後排除其對梯度更新的貢獻來降低訓練成本，但它依賴於模型信心，可能會保留由於噪聲或模糊而難以處理的樣本，而不是有用的信號。在這項工作中，我們介紹了ERTS，一種基於可解釋性的可靠性訓練信號，用於高效的心電圖分類。ERTS在訓練期間使用解釋質量來區分信息性和不可靠的不確定性。基於漸進式數據選擇，我們計算候選樣本的Grad-CAM注意力圖，並導出一個焦點分數，以衡量模型預測是否得到一致且局部化模式的支持。低焦點的樣本會被過濾掉，而那些具有意義的注意力的樣本則優先進行梯度更新。我們在三個心電圖數據集和多個主幹架構上評估了ERTS，顯示出宏觀F1分數的一致改善，同時有效的訓練成本降低。這些結果表明，解釋質量可以作為改善臨床時間序列學習中效率和可靠性的實用信號。代碼將會發布。

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

2606.12169v1 by Negin Baghbanzadeh, Pritam Sarkar, Michael Colacci, Abeer Badawi, Adibvafa Fallahpour, Arash Afkanpour, Leonid Sigal, Ali Etemad, Elham Dolatabadi

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at huggingface.co/datasets/neginb/OpenMedReason.

摘要：高風險臨床使用大型視覺-語言模型 (LVLMs) 需要基於視覺證據和臨床知識的推理，而不僅僅是正確的最終答案。我們介紹了 OpenMedReason，一個大規模的開放多模態醫療推理語料庫，包含約 450K 的圖像-問題-答案實例，其推理過程主要來自經過策劃的生物醫學和人類撰寫的科學文章。OpenMedReason 提供了超越合成思維鏈的高保真監督，涵蓋了多樣的醫療領域視覺模態，如放射掃描、顯微圖像、可見光照片、圖表等。我們用 OpenMedReason-Bench 進行補充，這是一個保留的基準，允許在三個互補的能力軸上對 LVLMs 進行細緻的評估，包括感知、醫療知識和推理，使診斷評估超越最終答案的準確性。OpenMedReason 是一個豐富的訓練資源，顯示其在監督細調 (SFT) 和基於增強的對齊中的有效性。使用 OpenMedReason 進行訓練使 VQA 準確性平均提高 20%，並在最強的可比規模醫療 LVLMs 中達到性能在 4.2% 內。細緻的性能分析確認這些增益並不集中於任何單一軸：OpenMedReason 共同改善了感知、醫療知識和推理，其推理過程在 86.1% 的成對比較中優於基準模型。我們在 huggingface.co/datasets/neginb/OpenMedReason 上發布了代碼和數據集。

Towards Responsibly Non-Compliant Machines

2606.12147v1 by Marija Slavkovik, Marie Farrell, Louise Dennis, Michael Fisher, Simon Kolker, Emily C. Collins

We consider the problem of engineering autonomous intelligent agents that are capable to responsibly not comply with user requests. We argue that machine non-compliance comes in many different forms, and sketch the issues we should pursue on the road of accomplishing responsibly non-compliant intelligent machines. We anchor responsible non-compliance in justifications for task refusal, pathways to override the non-compliance, as well as careful tracking of security risks and liability transfers.

摘要：我們考慮工程自主智能代理的問題，這些代理能夠負責任地不遵從用戶請求。我們認為機器的不遵從有許多不同的形式，並勾勒出我們在實現負責任的不遵從智能機器的過程中應該追求的問題。我們將負責任的不遵從建立在拒絕任務的理由、覆蓋不遵從的途徑，以及對安全風險和責任轉移的仔細追蹤上。

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

2606.12006v1 by Minh-Khoi Pham, Luca Cotugno, Alina Sirbu, Tai Tan Mai, Martin Crane, Marija Bezbradica

Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction.

摘要：預測事件發生時間的結果，例如死亡率，是臨床決策中的一項基本任務，通常通過生存分析來解決。雖然傳統的統計方法和深度學習方法已被廣泛研究，但這些方法通常需要特定任務的訓練和足夠的標記數據。最近在表格基礎模型方面的進展提供了一種新的範式，通過學習結構化數據的一般性表示來進行處理。然而，這些模型在臨床環境中對於被審查的事件預測的適用性仍然未被充分探索，因為典型應用主要限於離散分類，而非生存分析任務。在本研究中，我們提出了一種輕量級的適應方法，通過在預訓練表示的基礎上直接訓練一個生存感知的頭部，將表格基礎模型應用於臨床生存分析。我們研究了代表性的架構，包括TabPFN、TabDPT和TabICL，並使用多任務邏輯回歸（MTLR）頭部進行調整，以建模右審查的事件結果。我們在一組多樣的公共生存基準和兩個大型ICU隊列（MIMIC-IV和eICU）上評估了這一方法。我們的結果顯示，這種轉移學習方法在與強基準相比時，達到了競爭或更優的性能。在MIMIC-IV上，TabDPT-FT-MTLR達到了0.856的C指數，相當於比最佳非FM基準（DeepSurv，0.844）提高了+1.4%，比最佳零樣本模型（0.802）提高了+6.7%。在eICU上，TabICL-FT-MTLR達到了0.797，分別帶來+1.7%（DeepSurv，0.784）和+6.4%（0.749）的增益。這些發現突顯了將預訓練的表格表示與生存感知目標相結合的重要性，並表明表格基礎模型為臨床生存預測提供了一種實用且有效的替代方案。

Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

2606.11930v2 by Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

摘要：預測來自非同步視頻面試（AVI）的心理特徵是一個在AI輔助面試評估中具有挑戰性的問題，因為標記數據集有限，而每個回應包含高維度的視覺、聲音和語言信號。本文提出了我們對2026年ACM多媒體AVI挑戰的解決方案，該挑戰評估兩項任務：Track~1從與人格相關的面試回應中預測自我報告的HEXACO人格特徵，Track~2則從結構化的AVI回應中分類認知能力水平。我們將這個問題視為一個小樣本表示學習任務。我們不對大型預訓練模型進行微調，而是使用凍結的多模態編碼器，包括用於視覺特徵的CLIP、用於聲音特徵和文字稿的Whisper，以及用於文本表示的RoBERTa、E5和DeBERTaV3，然後再用低容量的下游模型。對於Track~1，我們的特徵特定回歸和後期融合系統達到了0.2696的平均驗證均方誤差（MSE），超過了官方基準的0.3334。消融結果顯示，從全局模型（0.3189）到每個特徵建模（0.2871），再到每個特徵的後期融合（0.2696），經歷了三步改進，對應於相對於官方基準的19.1% MSE減少。對於Track~2，一個緊湊的主題-屬性基準達到了0.5781的準確率，而我們的多模態集成達到了0.5313，均高於官方基準的0.4062。我們將這一結果解釋為在驗證拆分中可能存在的主題-屬性捷徑的證據，而不是從AVI內容中進行穩健的認知推斷。總體而言，我們的發現表明，基於AVI的心理評估受益於特徵特定的多模態建模，但認知能力預測需要對數據集捷徑進行仔細控制。

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

2606.11830v1 by Qianyu Yao, Fei Sun, Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen, Wenjie Xu, Bo li, Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

摘要：背景。大型語言模型和人工智慧代理越來越多地用於支持生物醫學研究，但原生模型的輸出可能省略關鍵的分析步驟、誤用方法或過度陳述結論。我們評估了自主訪問醫學研究技能包是否與較高質量的AI生成轉錄組研究分析輸出相關，與沒有技能的原生AI相比。方法。我們使用非小細胞肺癌免疫療法生物標記任務進行了探索性的多模型人類評估。測試了六個模型骨幹。評估包括21個匿名輸出：9個原生AI輸出和12個通過AI代理實現的技能增強輸出，該代理由OpenClaw表示。四位非專家生物醫學評審和兩位盲評專家評估了每個輸出，每種類型的評審提供了兩個評分。主要結果是專家評定的整體質量。結果。技能增強輸出的專家整體質量方向性上高於原生AI輸出（平均5.50對5.11；差異=0.39；自助法95\% CI，-0.04至0.90；Welch p=0.156）。非專家評審的質量顯示相同的方向（平均4.72對4.47；差異=0.26；自助法95\% CI，-0.25至0.80；Welch p=0.373）。專家之間的協議有限（單次評分ICC=-0.15），模型特定的效應是描述性的和異質的。結論。在這個探索性樣本中，自主技能訪問顯示出方向性的質量信號，但該信號小於專家評分的噪音，不應被解釋為確認性證據。這些發現主要促使對技能增強AI代理進行更大規模的評估，並加強可靠性控制、平台重複性和生物有效性評估。

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

2606.11794v1 by Boris-Stephan Rauchmann, Jonathan Laib, Buse Ercik, Robert Perneczky, Sergio Altares-López

Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

摘要：神經退行性疾病，如阿茲海默症（AD），需要準確且可擴展的工具來評估疾病嚴重程度，但目前的臨床分期仍然耗時且容易變異。我們提出了一種增強注意力的多模態機器學習框架，結合序數回歸，用於自動化且可解釋的AD嚴重程度分期。該框架整合了T1加權MRI與人口統計和遺傳變數，並使用序數和非序數預測頭比較單模態和多模態架構。模型使用來自ADNI、AIBL和NIFD數據集的隊列分層拆分進行訓練和驗證。嚴格保留的測試集是使用所有訓練、驗證、預處理和超參數調整程序中排除的受試者構建的，並在整個過程中採用了受試者級別的拆分以防止數據洩漏。在單模態方法中，T1加權MRI模型的相鄰階段準確率（0.963）和與臨床分期的一致性（QWK 0.444）略高於表格模型（QWK 0.433）。整合影像、人口統計和遺傳信息提高了整體性能。多模態非序數基線達到了最低的預測誤差（MAE 0.340），而序數多模態模型則達到了最高的相鄰階段準確率（0.970）和與臨床分期的最強一致性（QWK 0.549）。這些發現表明，序數公式更好地捕捉了CDR量表的有序結構，並產生與臨床分期更一致的預測。使用Grad CAM++和SHAP的可解釋性分析顯示了在解剖學和臨床上合理的模型行為，支持透明的決策過程。總體而言，基於注意力的多模態學習結合序數回歸代表了一種穩健、可解釋且可擴展的自動化AD嚴重程度分期和AI輔助臨床決策支持的方法。

Medical

Medical

Abstracts

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems

RedactionBench

Augmenting Dysarthric Speech Severity Assessment with MOS Supervision

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

From Specification to Execution: AI Assisted Scientific Workflow Management

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

WEQA: Wearable hEalth Question Answering with Query-Adaptive Agentic Reasoning

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

Talking to Your Data: Exploring Embodied Conversation as an Interface for Personal Health Reflection

Vision-language models for chest radiography do not always need the image

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

A Machine-Learned Comorbidity Index

Incumbent Advantage: Brand Bias and Cognitive Manipulation Dynamics in LLM Recommendation Systems

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

Symbolic Informalization: Fluent, Productive, Multilingual

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents

AgentFairBench: Do LLM Agents Discriminate When They Act?

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

Unified Multimodal Model for Brain MRI Imputation and Understanding

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning

Input-Dependent Fisher Information for Local Sensitivity Analysis of Medical Image Classifiers

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans

Embedded Arena: Iterative Optimization via Hardware Feedback

A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts

Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

Hierarchical Modeling of ICD Codes in EHR Foundation Models

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

CAP: Towards PPG Universal Representation Learning with Patient-level Supervision

RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning

Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs

Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

Expert-Driven Survival Machines: Improving Stratification and Interpretability in Multiple Clinical Cohorts

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation

Securing the Future of IoMT in the Post-Quantum Era: An Edge-Native Federated Learning Approach

Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages