Medical
Medical
| Publish Date | Title | Authors | Homepage | Code |
|---|---|---|---|---|
| 2026-05-18 | Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs | Junyu Pan et.al. | 2605.18172v1 | null |
| 2026-05-18 | Domain Transfer Becomes Identifiable via a Single Alignment | Sagar Shrestha et.al. | 2605.17918v1 | null |
| 2026-05-18 | LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection | Hanbyeol Park et.al. | 2605.17902v1 | null |
| 2026-05-18 | Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training | Guanliang Liu et.al. | 2605.17879v1 | null |
| 2026-05-18 | Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale | Jinghui Liu et.al. | 2605.17775v1 | null |
| 2026-05-18 | Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes | Jinghui Liu et.al. | 2605.17755v1 | null |
| 2026-05-18 | Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science | Yingjie Zhang et.al. | 2605.17746v1 | null |
| 2026-05-18 | Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis | Danu Kim et.al. | 2605.17729v1 | null |
| 2026-05-17 | PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship | Zhiyuan Wang et.al. | 2605.17679v1 | null |
| 2026-05-17 | ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation | Zhikang Chen et.al. | 2605.17580v1 | null |
| 2026-05-17 | CasualSynth: Generating Structurally Sound Synthetic Data | Zehua Cheng et.al. | 2605.17528v1 | null |
| 2026-05-17 | Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization | Gunjan Balde et.al. | 2605.17379v1 | null |
| 2026-05-17 | CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings | Qixuan Hu et.al. | 2605.17370v1 | null |
| 2026-05-17 | Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification | Yang Wu et.al. | 2605.17308v1 | null |
| 2026-05-17 | How Do Electrocardiogram Models Scale? | Jiawei Li et.al. | 2605.17276v1 | null |
| 2026-05-17 | Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability | Nisreen Albzour et.al. | 2605.17236v1 | null |
| 2026-05-16 | UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation | Shiv Ghosh et.al. | 2605.17140v1 | null |
| 2026-05-16 | SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning | Yongfeng Huang et.al. | 2605.17101v1 | null |
| 2026-05-16 | Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench | Tianyu Wang et.al. | 2605.17079v1 | null |
| 2026-05-16 | AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation | Shiying Yu et.al. | 2605.17071v1 | null |
| 2026-05-16 | PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts | Khizar Hussain et.al. | 2605.17028v1 | null |
| 2026-05-16 | Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings | Anthonio Oladimeji Gabriel et.al. | 2605.16993v1 | null |
| 2026-05-16 | Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons | Wei Tang et.al. | 2605.16975v1 | null |
| 2026-05-16 | Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects | Zhentao Tan et.al. | 2605.16966v1 | null |
| 2026-05-16 | From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction | Pujun Feng et.al. | 2605.16927v1 | null |
| 2026-05-16 | PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes | Phat Tran et.al. | 2605.16860v1 | null |
| 2026-05-16 | VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment | Amy Makawana et.al. | 2605.16775v1 | null |
| 2026-05-15 | CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? | Haolin Chen et.al. | 2605.16679v1 | null |
| 2026-05-15 | \textsc{PrivScope}: Task-scoped Disclosure Control for Hybrid Agentic Systems | Shafizur Rahman Seeam et.al. | 2605.16630v1 | null |
| 2026-05-15 | Isotonic Survival Regression: Calibrated Survival Distributions from Deep Cox Models | Anchit Jain et.al. | 2605.16571v1 | null |
| 2026-05-15 | Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces | Arne Nix et.al. | 2605.16545v1 | null |
| 2026-05-15 | Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search | Sarah Martinson et.al. | 2605.16238v1 | null |
| 2026-05-15 | Fully Open Meditron: An Auditable Pipeline for Clinical LLMs | Xavier Theimer-Lienhard et.al. | 2605.16215v1 | null |
| 2026-05-15 | Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning | Ranjith Chodavarapu et.al. | 2605.15894v1 | null |
| 2026-05-15 | BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation | Huanyang Tong et.al. | 2605.15736v1 | null |
| 2026-05-15 | Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery | Weijun Ma et.al. | 2605.16458v1 | null |
| 2026-05-15 | Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign | Jiahui Li et.al. | 2605.16452v1 | null |
| 2026-05-15 | Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating | Hangchun Liang et.al. | 2605.16446v1 | null |
| 2026-05-15 | Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images | Liangrui Pan et.al. | 2605.16444v1 | null |
| 2026-05-15 | Two-Valued Symmetric Circulant Matrices: Applications in Deep Learning | Jayakrishna Amathi et.al. | 2605.16443v1 | null |
| 2026-05-14 | Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction | A H M Rezaul Karim et.al. | 2605.15467v1 | null |
| 2026-05-14 | FutureSim: Replaying World Events to Evaluate Adaptive Agents | Shashwat Goel et.al. | 2605.15188v1 | null |
| 2026-05-14 | Evidential Reasoning Advances Interpretable Real-World Disease Screening | Chenyu Lian et.al. | 2605.15171v1 | null |
| 2026-05-14 | Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment | Sayantan Kumar et.al. | 2605.15168v1 | null |
| 2026-05-14 | COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion | Zihan Deng et.al. | 2605.15016v1 | null |
| 2026-05-14 | Quantifying and Mitigating Premature Closure in Frontier LLMs | Rebecca Handler et.al. | 2605.15000v1 | null |
| 2026-05-14 | Explainable Detection of Depression Status Shifts from User Digital Traces | Loris Belcastro et.al. | 2605.14995v1 | null |
| 2026-05-14 | Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning | Francesco Pastori et.al. | 2605.14991v1 | null |
| 2026-05-14 | GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation | Drewry H. Morris et.al. | 2605.14968v1 | null |
| 2026-05-14 | From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement | Varad Vishwarupe et.al. | 2605.14912v1 | null |
| 2026-05-14 | BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring | Zixuan Shu et.al. | 2605.14886v1 | null |
| 2026-05-14 | Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model | Minghao Wu et.al. | 2605.14723v1 | null |
| 2026-05-14 | Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke | Liren Chen et.al. | 2605.14710v1 | null |
| 2026-05-14 | NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces | Konstantinos Kontras et.al. | 2605.14698v1 | null |
| 2026-05-14 | How Sensitive Are Radiomic AI Models to Acquisition Parameters? | D. Gil et.al. | 2605.14667v1 | null |
| 2026-05-14 | MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder | Eranga Bandara et.al. | 2605.14660v1 | null |
| 2026-05-14 | RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation | Shuhao Chen et.al. | 2605.14543v1 | null |
| 2026-05-14 | Deciphering Neural Reparameterized Full-Waveform Inversion with Neural Sensitivity Kernel and Wave Tangent Kernel | Ruihua Chen et.al. | 2605.14370v1 | null |
| 2026-05-14 | AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction | Yerin Park et.al. | 2605.14327v1 | null |
| 2026-05-14 | Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment | Xiaohua Wang et.al. | 2605.14242v1 | null |
| 2026-05-14 | Fusion-fission forecasts when AI will shift to undesirable behavior | Neil F. Johnson et.al. | 2605.14218v1 | null |
| 2026-05-14 | Towards Fine-Grained and Verifiable Concept Bottleneck Models | Yingying Fang et.al. | 2605.14210v1 | null |
| 2026-05-13 | Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR) | Marius S. Knorr et.al. | 2605.14126v1 | null |
| 2026-05-13 | ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows | Alvaro Lopez Pellicer et.al. | 2605.14113v1 | null |
| 2026-05-13 | Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening | Nishi Doshi et.al. | 2605.14108v1 | null |
| 2026-05-13 | A Benchmark for Early-stage Parkinson's Disease Detection from Speech | Terry Yi Zhong et.al. | 2605.14066v1 | null |
| 2026-05-13 | CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI | Xiaoyue Liu et.al. | 2605.13994v1 | null |
| 2026-05-13 | Neurosymbolic Auditing of Natural-Language Software Requirements | Bethel Hall et.al. | 2605.13817v1 | null |
| 2026-05-13 | Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography | Christos Chrysanthos Nikolaidis et.al. | 2605.13730v1 | null |
| 2026-05-13 | SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing | Marten J. Finck et.al. | 2605.17620v1 | null |
| 2026-05-13 | Cross Modality Image Translation In Medical Imaging Using Generative Frameworks | Giulia Romoli et.al. | 2605.13686v1 | null |
| 2026-05-13 | Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model | Riccardo Cavarra et.al. | 2605.13568v1 | null |
| 2026-05-13 | Generating synthetic computed tomography for radiotherapy: SynthRAD2025 challenge report | Viktor Rogowski et.al. | 2605.13555v1 | null |
| 2026-05-13 | RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation | Chengzhi Shen et.al. | 2605.13542v1 | null |
| 2026-05-13 | Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs | Jincai Huang et.al. | 2605.13530v1 | null |
| 2026-05-13 | Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin | Markus Wenzel et.al. | 2605.13345v1 | null |
| 2026-05-13 | VERA-MH: Validation of Ethical and Responsible AI in Mental Health | Luca Belli et.al. | 2605.13318v1 | null |
| 2026-05-13 | IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages | Shubham Kumar Nigam et.al. | 2605.13292v1 | null |
| 2026-05-13 | Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis | Bo Cui et.al. | 2605.13248v1 | null |
| 2026-05-13 | AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions | Ishika Agarwal et.al. | 2605.13149v1 | null |
| 2026-05-13 | Context Training with Active Information Seeking | Zeyu Huang et.al. | 2605.13050v2 | null |
| 2026-05-13 | An Agentic LLM-Based Framework for Population-Scale Mental Health Screening | Giuliano Lorenzoni et.al. | 2605.13046v1 | null |
| 2026-05-13 | RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems | Rohith Reddy Bellibatlu et.al. | 2605.12895v1 | null |
| 2026-05-13 | A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study | Jaime Yan et.al. | 2605.13905v1 | null |
| 2026-05-13 | Multimodal Hidden Markov Models for Persistent Emotional State Tracking | Anamika Ragu et.al. | 2605.12838v1 | null |
| 2026-05-13 | PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models | Sridhar Mahadevan et.al. | 2605.12835v1 | null |
| 2026-05-12 | Training Large Language Models to Predict Clinical Events | Benjamin Turtel et.al. | 2605.12817v1 | null |
| 2026-05-12 | Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces | Shixing Yu et.al. | 2605.12809v1 | null |
| 2026-05-12 | BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics | Helene Malyutina et.al. | 2605.12730v1 | null |
| 2026-05-12 | Reward Hacking in Rubric-Based Reinforcement Learning | Anas Mahmoud et.al. | 2605.12474v1 | null |
| 2026-05-12 | MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering | Rezarta Islamaj et.al. | 2605.12361v1 | null |
| 2026-05-12 | EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records | Saeed Shurrab et.al. | 2605.12335v1 | null |
| 2026-05-12 | Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study | M A Al-Masud et.al. | 2605.12241v1 | null |
| 2026-05-12 | Overtrained, Not Misaligned | Joel Schreiber et.al. | 2605.12199v1 | null |
| 2026-05-12 | To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands | Fangyi Yu et.al. | 2605.12120v1 | null |
| 2026-05-12 | Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection | Muhammad Aqeel et.al. | 2605.12069v1 | null |
| 2026-05-12 | Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MIL | Hyun Do Jung et.al. | 2605.12575v1 | null |
| 2026-05-12 | Spectral Vision Transformer for Efficient Tokenization with Limited Data | Alexandra G. Roberts et.al. | 2605.12026v1 | null |
| 2026-05-12 | DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction | Hongyi Tang et.al. | 2605.12574v1 | null |
| 2026-05-12 | AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers | Lei Wang et.al. | 2605.11901v1 | null |
Abstracts
Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
2605.18172v1 by Junyu Pan, Yansen Wang, Enze Zhang, Baoliang Lu, Weilong Zheng, Dongsheng Li
Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, visually-evoked EEG datasets remain scarce, leading existing methods to align neural signals mainly with abstract text, a lossy translation that may discard fine-grained perceptual information encoded in brain activity. We propose Generative Visual Grounding (GVG), a framework that visualizes the invisible by using an EEG-to-image generative model as a visual translator. Instead of forcing EEG into text alone, GVG hallucinates instance-specific proxy images for non-visual EEG, providing structured visual contexts that allow MLLMs to exploit their visual priors for clinical-state interpretation. We validate this idea on two MLLM backbones, GVG-X-Omni and GVG-Janus. Image-only alignment is already competitive: the lightweight GVG-X-Omni matches 1.7B-parameter text-aligned baselines while tuning only 170M parameters on a frozen 7B backbone. We further extend GVG-Janus with trimodal Image+Text alignment, where text supplies categorical semantic anchors and visual proxies enrich neural representations with perceptual details. Experiments show consistent gains in EEG understanding and visual generation, suggesting visual proxy grounding as an effective complement to textual alignment.
摘要:利用預訓練的大型語言模型(LLMs)和多模態大型模型(MLLMs)的通用表示,為腦基礎模型提供了一條有前景的道路。
然而,視覺誘發的腦電圖(EEG)數據集仍然稀缺,導致現有方法主要將神經信號與抽象文本對齊,這是一種可能丟失大腦活動中編碼的細粒度感知信息的有損轉換。
我們提出了生成視覺基礎(Generative Visual Grounding,GVG),這是一個通過使用EEG到圖像的生成模型作為視覺翻譯器來可視化不可見事物的框架。
GVG不是僅僅將EEG強制轉換為文本,而是為非視覺EEG幻想出特定實例的代理圖像,提供結構化的視覺上下文,使MLLMs能夠利用其視覺先驗進行臨床狀態解釋。
我們在兩個MLLM骨幹上驗證了這一想法,GVG-X-Omni和GVG-Janus。
僅圖像對齊已經具有競爭力:輕量級的GVG-X-Omni在凍結的7B骨幹上僅調整170M參數,便能匹配1.7B參數的文本對齊基準。
我們進一步擴展GVG-Janus,實現三模態的圖像+文本對齊,其中文本提供類別語義錨點,而視覺代理則用感知細節豐富神經表示。
實驗顯示在EEG理解和視覺生成方面的一致增益,表明視覺代理基礎作為文本對齊的有效補充。
Domain Transfer Becomes Identifiable via a Single Alignment
2605.17918v1 by Sagar Shrestha, Subash Timilsina, Hoang-Son Nguyen, Xiao Fu
Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised image-to-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer -- requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.
摘要:領域轉移 (DT) 將源分佈映射到目標分佈,並支持無監督的圖像到圖像轉換、單細胞分析和跨平台醫學影像等任務。
然而,DT 本質上是病態的:推進映射通常是不可識別的,因為保持測度的自同構 (MPAs) 在改變跨領域對應的同時保持邊際,導致內容不對齊的翻譯。
最近的研究顯示,通過共同轉移多個對應的源/目標條件分佈可以消除 MPAs,但在實踐中標記這些條件的監督信號並不總是可用。
我們開發了一條替代的 DT 可識別性路徑。在雅可比支持模式的結構稀疏條件下,我們顯示分佈匹配結合單一配對錨點樣本足以識別真實轉移——所需的監督遠低於先前的方法。
為了實現實際的高維學習,我們進一步提出了一種基於隨機掩碼有限差分的高效雅可比稀疏正則化器,產生一種可擴展的替代方案,而無需顯式的雅可比評估。
在合成和真實世界的 DT 任務上的實證結果驗證了這一理論。
LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection
2605.17902v1 by Hanbyeol Park, Hyerim Bae
Stochastic-process-based degradation modeling is a core approach for estimating the distribution of remaining useful life (RUL); however, the selection of an appropriate stochastic process has not been sufficiently addressed. Existing model selection methods mainly rely on the statistical fit of the observed health indicator (HI) trajectory, but this approach may select a model that is inconsistent with the underlying degradation mechanism when the observation window is short or the signal is highly noisy. To address this issue, this paper proposes Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation (LAST-RAG). The proposed method uses both the observed HI trajectory and domain-specific context, and hierarchically conditions the candidate degradation model space based on theoretical and mechanical evidence retrieved from a local evidence bank. In addition, Rule-based Confidence Reasoning with Uncertain State (RCRUS) is introduced to prevent candidate models from being prematurely eliminated when hierarchical decisions are uncertain. Simulation-based experiments demonstrate that the proposed method outperforms statistical, prognostic, and uncertainty-aware baselines in both Wiener/gamma family classification and detailed degradation model classification. Ultimately, this study reframes degradation model selection from a purely statistical goodness-of-fit problem into a knowledge-conditioned decision-making problem that integrates observed data with domain knowledge.
摘要:隨機過程基礎的劣化建模是估計剩餘使用壽命 (RUL) 分佈的核心方法;然而,適當隨機過程的選擇尚未得到充分解決。現有的模型選擇方法主要依賴於觀察到的健康指標 (HI) 軌跡的統計擬合,但當觀察窗口較短或信號噪聲較大時,這種方法可能會選擇與潛在劣化機制不一致的模型。為了解決這個問題,本文提出了文獻錨定的隨機軌跡檢索增強生成 (LAST-RAG)。所提出的方法同時使用觀察到的 HI 軌跡和特定領域的上下文,並基於從本地證據庫檢索的理論和機械證據,分層條件化候選劣化模型空間。此外,引入了基於規則的不確定狀態信心推理 (RCRUS),以防止在分層決策不確定時候選模型被過早淘汰。基於模擬的實驗表明,所提出的方法在 Wiener/gamma 家族分類和詳細劣化模型分類中均優於統計、預測和不確定性感知的基準。最終,本研究將劣化模型選擇重新框架為一個純粹的統計擬合問題,轉變為一個知識條件化的決策問題,將觀察數據與領域知識相結合。
Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training
2605.17879v1 by Guanliang Liu, Abhinandan Patni, Congzhu Lin, Zoe Zeng, Jack Wittmayer, Josh Wu, Ashvin Nihalani, Binxuan Huang, Yinghong Liu, Rory Na, Anthony Ko, Alexander Zhipa, Cong Cheng, Mi Sun, Vijay Rajakumar, Rejith George Joseph, Parthasarathy Govindarajen
Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.
摘要:訓練前沿規模的基礎模型涉及協調數萬個GPU進行數月的運行,即使是輕微的性能下降也可能累積成顯著的效率損失。現有的健康檢查機制,如NCCL測試或GPU燒機,主要專注於功能正確性,並且常常無法檢測到靜默降低系統性能的失效緩慢行為。在本文中,我們提出了Guard,一個可擴展的系統,用於檢測拖延者並確保大型訓練集群中的節點健康。Guard結合了訓練過程中的輕量級在線性能監控與離線節點掃描機制,系統性地評估和確認節點在參與生產工作負載之前的狀態。這一設計使Guard能夠檢測到急性故障和傳統診斷無法捕捉的長期失效緩慢行為。在大規模基礎模型預訓練工作負載上部署後,Guard將平均FLOPs利用率提高了最多1.7倍,將每次訓練步驟的變異從20%降低到1%,增加了平均故障時間(MTTF),並顯著減少了運營和調試的開銷。這些結果表明,主動檢測拖延者和系統性節點資格認定對於維持穩定和高效的大規模訓練至關重要。
Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale
2605.17775v1 by Jinghui Liu, Sarvesh Soni, Anthony Nguyen
Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.
摘要:大型語言模型(LLMs)可以生成或合成臨床文本,應用範圍廣泛,從改善臨床文檔到增強臨床文本分析。
然而,評估通常集中在狹窄的方面——例如相似性或效用比較——儘管這些方面是互補的,最好是並行考量。
在這項研究中,我們旨在對LLM生成的臨床文本進行系統評估,包括對從MIMIC數據庫以百萬筆數據規模改寫的合成臨床筆記的內在、外在和事實性評估。
我們的分析顯示,儘管語言上有 substantial 的變化,合成筆記仍然保留了核心臨床信息和對於粗粒度任務的預測效用,但在像ICD編碼這樣的任務中卻失去了細粒度的細節。
我們顯示,通過將筆記按塊改寫而不是整個筆記,可以顯著減輕這種細節的損失,但代價是降低了在不完整上下文下的事實精確性。
通過事實核查和錯誤分析,我們進一步發現,合成錯誤主要是由於對臨床上下文的誤解,以及時間混淆、測量錯誤和虛假聲明。
最後,我們顯示,儘管合成筆記具有任務無關性,但仍然可以有效增強對於稀有ICD代碼的任務特定訓練。
Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes
2605.17755v1 by Jinghui Liu, Anthony Nguyen
Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.
摘要:臨床編碼將臨床文檔映射到標準化的醫療代碼,這是一項必要但耗時的行政任務,可能受益於自動化。目前的ICD編碼模型通常針對特定ICD版本的代碼進行優化。然而,實際上,ICD系統不斷演變,不同版本在不同的時間段和地區被採用。此外,ICD編碼面臨長尾問題,稀有代碼的表現可能成為開發可實施模型的瓶頸。我們檢查通過結合在不同ICD版本中註釋的數據來訓練版本獨立模型是否可行,這可能有助於解決這些挑戰。我們將ICD-9數據添加到修改過的標籤級注意力模型的ICD-10預測訓練中,發現儘管版本不匹配,添加ICD-9使得18K稀有ICD代碼的微F1提高了27%,相比僅在ICD-10上訓練。在8K頻繁的ICD-10代碼上,多版本訓練也顯著改善了宏觀指標,並且模型參數大大減少。
Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science
2605.17746v1 by Yingjie Zhang, Chun Feng, Weizhang Zhu, Tianshu Sun
AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.
摘要:AI 系統正成為組織和知識工作的積極參與者。它們越來越多地與人類互動、協調工作流程,並在多代理安排中運作。因此,理解它們的影響需要的不僅僅是測量產出準確性;還需要關於機制、委派、反饋和控制的證據。實驗在這項任務中仍然是核心,但它們也面臨著一個遞歸挑戰:我們需要實驗來讓代理人研究這些安排,而我們可能需要代理人來進行實驗,以幫助搜尋不斷擴展的可能設計空間。然而,人類-AI 和代理工作流程的實驗條件仍然主要以散文形式指定,使其難以比較、重用或審核。我們將此框架視為一個在 AI 驅動的知識生產中,工作流程表示、可追溯性和治理的問題。我們介紹了 SEED(結構編碼用於實驗發現),這是一個將實驗條件表示為類型化的行為者-流程圖的框架。SEED 支持三個設計功能:將條件描述為互動結構、相對於編碼的先前設計評估結構新穎性,以及在可行性和治理約束下生成候選設計。我們報告了一個輕量級的實證可行性測試,該測試比較了圖盲生成和 SEED 指導生成在醫療分診設計任務中的表現。在這一診斷對比中,SEED 指導的候選設計顯示出更清晰的行為者-流程變化、假設和治理檢查,支持該語法作為設計輔助工具的可行性。評論最後指出了圍繞新穎性、複製性、有效性、探究多樣性和問責制的治理緊張。
Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis
2605.17729v1 by Danu Kim
Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.
摘要:深度學習模型在胸部X光片的肺炎檢測中達到了高準確率。
然而,由於影像設備、獲取協議和機構條件的變化,它們在臨床領域的泛化能力仍然有限。
本研究介紹了一種基於重播的領域增量持續學習,旨在實現對跨領域變化的持續適應,而不會造成災難性遺忘。
所提出的方法結合了類別感知的平衡重播,以在受限記憶中維持平衡的類別表示,以及類別感知的損失,在訓練過程中動態重新加權類別不平衡。
在一個由五個模擬領域組成的領域轉移肺炎MNIST數據集上進行的實驗表明,所提出的方法達到了88.66%的平均準確率,超越了經驗重播、微調和聯合訓練的基準。
這些發現突顯了所提出的方法在實現跨臨床環境變化的穩健且一致的肺炎檢測中的有效性。
PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship
2605.17679v1 by Zhiyuan Wang, Ariful Islam, Indrajeet Ghosh, Xinyu Chen, Katharine E. Daniel, Subigya Nepal, Philip Chow, Laura E. Barnes
Cancer survivors face elevated rates of depression, anxiety, and general emotional distress, yet the precise moments they most need support are often the moments when self-report is sparse, a phenomenon we term the diary paradox. Passive smartphone sensing offers a continuous, unobtrusive alternative, but prior sensing-based affect prediction has been limited by an accuracy ceiling, suggesting a bottleneck not only in available data, but in how behavioral signals are interpreted. We present PULSE, a system that shifts from fixed feature pipelines to agentic sensing investigation: LLM agents equipped with eight purpose-built tools autonomously query smartphone sensing data, compare current behavior against personalized baselines, and calibrate inferences through retrieval-augmented population-level comparisons. Rather than receiving pre-formatted feature summaries, agents decide which modalities to inspect, how far back to look, and how deeply to investigate, mirroring hypothesis-driven clinical reasoning. We evaluate PULSE through a 2*2 factorial design crossing reasoning architecture (structured vs. agentic) with data modality (sensing-only vs. with diary) on 50 cancer survivors from a longitudinal study of cancer survivors. Agentic reasoning is the primary driver of performance: agentic multimodal agent achieves balanced accuracy of 0.743 for emotion regulation desire with diary and sensing data, while agentic agents predict intervention availability at 0.713 with passive sensing data only. These results suggest that agentic investigation may be a cornerstone for unlocking the clinical value of passive sensing, advancing the feasibility of proactive just-in-time mental health support.
摘要:癌症倖存者面臨較高的抑鬱、焦慮和一般情緒困擾的比率,但他們最需要支持的確切時刻往往是自我報告稀少的時刻,這一現象我們稱之為日記悖論。被動的智能手機感知提供了一種持續且不干擾的替代方案,但先前基於感知的情感預測受到準確性上限的限制,這表明不僅在可用數據上存在瓶頸,還在於行為信號的解釋方式。我們提出了PULSE,一個從固定特徵管道轉向自主感知調查的系統:配備八個專門工具的LLM代理自主查詢智能手機感知數據,將當前行為與個性化基準進行比較,並通過檢索增強的人口級比較來校準推斷。代理不僅接收預格式化的特徵摘要,還決定檢查哪些模式、回溯多遠以及深入調查的程度,這與假設驅動的臨床推理相呼應。我們通過一項2*2的因子設計來評估PULSE,交叉推理架構(結構化與自主)與數據模式(僅感知與日記)的設計,對50名來自癌症倖存者縱向研究的癌症倖存者進行評估。自主推理是性能的主要驅動力:自主多模態代理在情緒調節需求方面達到0.743的平衡準確率,使用日記和感知數據,而自主代理僅使用被動感知數據預測干預可用性達到0.713。這些結果表明,自主調查可能是解鎖被動感知臨床價值的基石,推進主動即時心理健康支持的可行性。
ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation
2605.17580v1 by Zhikang Chen, Yue Wang, Sen Cui, Yu Zhang, Changshui Zhang, Tianling Ren, Tingting Zhu
Electrocardiogram (ECG)-based models have achieved strong performance in diagnostic tasks, yet they remain limited in modeling how cardiac dynamics evolve under external interventions. In particular, existing approaches focus primarily on static prediction and lack mechanisms to capture ECG variations under different pharmacological conditions. In this work, we propose an ECG World Model for action-conditioned predictive simulation of cardiac electrophysiology. Moving beyond disjoint pipelines, our framework features a principled integration of physiological ordinary differential equation (ODE) priors into latent diffusion dynamics via energy regularization. This structural constraint enables the synthesis of physiologically plausible post-intervention ECG trajectories while effectively mitigating generative hallucinations. Building on this simulation process, we introduce an uncertainty-aware evaluation strategy that leverages the stochasticity of diffusion sampling to characterize both the expected clinical risk and its variability, allowing a more reliable comparative assessment of candidate interventions. We evaluate our method across diverse settings, including controlled drug-response scenarios and real-world clinical records. Beyond standard waveform metrics, experimental results demonstrate improved risk calibration and strong alignment with expert-informed treatment preferences. These results establish our approach as a robust foundation for safe and intervention-aware clinical decision support.
摘要:心電圖(ECG)基礎的模型在診斷任務中已取得良好表現,但在建模心臟動力學如何在外部干預下演變方面仍然有限。
特別是,現有的方法主要集中在靜態預測上,缺乏捕捉不同藥理條件下ECG變化的機制。
在這項工作中,我們提出了一個ECG世界模型,用於基於行動條件的心臟電生理預測模擬。
我們的框架超越了不相干的流程,特別整合了生理學常微分方程(ODE)先驗知識,通過能量正則化融入潛在擴散動力學。
這一結構約束使得合成生理上合理的干預後ECG軌跡成為可能,同時有效減輕生成性幻覺。
基於這一模擬過程,我們引入了一種具不確定性感知的評估策略,利用擴散取樣的隨機性來表徵預期的臨床風險及其變異性,從而允許對候選干預措施進行更可靠的比較評估。
我們在多種環境中評估我們的方法,包括受控的藥物反應場景和真實世界的臨床記錄。
除了標準波形指標外,實驗結果顯示風險校準有所改善,並與專家知情的治療偏好強烈一致。
這些結果確立了我們的方法作為安全且具干預意識的臨床決策支持的堅實基礎。
CasualSynth: Generating Structurally Sound Synthetic Data
2605.17528v1 by Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz
Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $α=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.
摘要:大型語言模型(LLMs)生成現實的合成數據,但無法保證其輸出遵循目標領域的因果機制。我們介紹CausalSynth,一個將因果結構生成與語義實現解耦的框架,產生既具因果有效性又語言豐富的合成數據。該框架分為三個階段。首先,結構因果模型(SCM)——一組定義在有向無循環圖(DAG)上的結構方程組生成因果骨架,即通過祖先抽樣滿足所治理DAG的全局馬爾可夫性質的變量分配。其次,LLM作為受限的\emph{實現者},一個條件翻譯器,將每個骨架映射到高維觀察,例如臨床記錄或交易日誌。第三,迭代一致性驗證模塊通過確定性提取檢測結構違規,並將針對性的修正反饋給LLM,形成一個閉環精煉過程。我們確定了語義後門問題,即LLMs系統性地傾向於用預訓練先驗覆蓋施加的因果事實——並證明我們的迭代機制相對於標準拒絕抽樣減少了由此產生的選擇偏差。在三個因果基準(ASIA、ALARM和MIMIC-Struct)上,CausalSynth保持了條件獨立性,假陽性率接近名義$α=0.05$水平,並在70B參數的LLM骨幹上實現了超過96%的實現率。該框架還通過噪聲保留和圖形損壞支持原則性的干預和反事實生成。
Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization
2605.17379v1 by Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.
摘要:大型語言模型在一般領域語料上預訓練時,應用於專業領域時常常會出現標記化效率低下的情況。雖然持續預訓練以適應領域在一定程度上減輕了性能下降,但並未解決根本的詞彙不匹配問題。為了解決這一問題,我們提出了一種針對性的參數高效領域適應方法,將詞彙適應與基於LLM的文本摘要預訓練相結合。我們的統一框架通過增加領域特定的標記來增強預訓練的標記器,同時有選擇性地替換訓練不足和無法到達的標記,以限制參數增長。我們在Llama-3.1-8B和Qwen2.5-7B上評估我們的方法,針對法律和醫療摘要任務,使用一種以挑戰為導向的評估協議,專注於專家驅動的文本和摘要,這通常具有更高的過度碎片化的詞彙外(OOV)單詞的集中度。詞彙適應算法通過提高生成摘要與其參考之間的語義相似性,增強了摘要模型的整體質量。此外,經過適應的模型生成的摘要包含更多合適的新穎和領域特定的單詞,從而提高了連貫性、相關性和真實性。我們進一步觀察到,我們提出的方法顯著減少了相較於持續預訓練的訓練時間,減少了$35-55\%$,並且相較於僅擴展的方法,參數數量減少了高達$37\%$。我們將代碼庫公開提供於 https://github.com/gb-kgp/VocabReplace-Then-Expand。
CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings
2605.17370v1 by Qixuan Hu, Shuchang Ye, Xumou Zhang, Anastasia Serafimovska, Anastasia Suraev, Amit Saha, Ping-hsiu Lin, Sydney Su, Usman Naseem, Adam G. Dunn, Jinman Kim
Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.
摘要:認知行為療法廣泛用於幫助患者理解和管理心理困擾。
這通常通過口頭對話進行,治療師不僅注意患者所說的內容,還注意他們的表達方式,因為這些線索可以幫助治療師決定如何回應並調整治療方案。
在建立用於認知行為療法的人工智慧系統方面的進展仍然主要限於文本,部分原因是大多數可用的數據集都是基於文本的,而可分享的口語認知行為療法數據在倫理和隱私限制下非常稀缺。
這造成了一個盲點,因為基於文本的模型和評估無法捕捉到轉錄文本與患者聲音之間的不匹配,即使治療師通常依賴這種不匹配來理解患者的困擾。
我們介紹了CBT-Audio,一個用於評估從口語認知行為療法會話中估計患者困擾的數據集,並與音頻語言模型一起使用。
CBT-Audio包含來自96個公開可用認知行為療法錄音的1,802個患者回合,回合級困擾標籤經過專家註釋的子集驗證。
我們在三種輸入條件下評估了10個開源音頻語言模型,這些條件下模型僅接收患者音頻、僅接收轉錄文本或同時接收音頻和轉錄文本。
我們的結果顯示,音頻可以提供超越文本的有用信息,尤其是當與轉錄文本結合時。
在10個模型系列中的8個中,將音頻添加到轉錄文本輸入中改善了困擾估計,相較於僅使用轉錄文本,4個模型系列的增益顯著,案例研究顯示當口頭內容和聲音表達不一致時,最明顯的好處。
CBT-Audio使口語患者行為在認知行為療法相關任務中的人工智慧評估變得可測量,並支持未來在心理健康互動中使用音頻語言模型的工作。
Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification
2605.17308v1 by Yang Wu, Xiaoyan Yuan, Hau-San Wong, Xiping Hu
Electrocardiogram (ECG) diagnosis in clinical practice relies on structured reasoning over multiple hierarchical aspects, including cardiac rhythm, conduction properties, waveform morphology, and overall diagnostic impression. However, most existing approaches predict labels directly from ECG signals without explicit clinical reasoning, resulting in opaque decisions that lack clinical alignment. To bridge this gap, we propose CardioThink, a physician-inspired multimodal large language model (MLLM) framework that explicitly models the diagnostic reasoning process through human-interpretable intermediate stages (rhythm, conduction, morphology, and impression) to derive final classification results. Furthermore, we introduce Structured Set Policy Optimization (SSPO) to jointly optimize adherence to this structured reasoning format and the accuracy of variable-size diagnostic sets, without requiring manually annotated reasoning traces. Extensive experiments on diverse ECG benchmarks demonstrate the significant superiority of our approach in diagnostic accuracy, while simultaneously providing interpretable clinical reasoning. Notably, reasoning quality evaluations confirm that SSPO substantially enhances the clinical validity of the generated rationales. These findings reveal that moving beyond direct label prediction toward structured reasoning offers a more clinically aligned direction for future ECG modeling.
摘要:心電圖 (ECG) 的臨床診斷依賴於對多個層次方面的結構化推理,包括心臟節律、傳導特性、波形形態和整體診斷印象。
然而,大多數現有方法直接從 ECG 信號預測標籤,缺乏明確的臨床推理,導致決策不透明且缺乏臨床一致性。
為了填補這一空白,我們提出了 CardioThink,一個受醫生啟發的多模態大型語言模型 (MLLM) 框架,該框架通過人類可解釋的中間階段(節律、傳導、形態和印象)明確建模診斷推理過程,以得出最終分類結果。
此外,我們引入了結構化集合策略優化 (SSPO),以共同優化對這一結構化推理格式的遵循和變量大小診斷集合的準確性,而無需手動標註的推理痕跡。
在多樣的 ECG 基準上進行的廣泛實驗顯示,我們的方法在診斷準確性方面具有顯著優越性,同時提供可解釋的臨床推理。
值得注意的是,推理質量評估確認 SSPO 顯著提升了生成理論的臨床有效性。
這些發現揭示了超越直接標籤預測,朝向結構化推理的方向,為未來 ECG 建模提供了更具臨床一致性的方向。
How Do Electrocardiogram Models Scale?
2605.17276v1 by Jiawei Li, Fabio Bonassi, Ming Jin, Stefan Gustafsson, Johan Sundström, Thomas B. Schön, Antônio H. Ribeiro
While scaling laws have established a fundamental framework for foundation models in natural language processing, their applicability to electrocardiogram (ECG) models remains poorly characterized. Indeed, recent studies do not always yield consistent downstream gains as one increases the model size or pre-training dataset size of ECG models, leaving the exact roles of architectural inductive biases, pre-training paradigms, and expected improvements with size largely unanswered. In this work, we systematically investigate neural and loss-to-loss scaling laws within the ECG domain. By pre-training over $120$ models (ranging from $20$K to $200$M parameters) on the large-scale CODE dataset ($2.3$M records), we decouple the effects of model architecture (ResNet vs. Transformer) and pre-training paradigm, namely supervised learning (SL) versus self-supervised learning (SSL). We found that (i) SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes; (ii) for out-of-distribution (OOD) generalization, ResNets are $1.3$ to $2.5$ times more parameter-efficient than Transformers, while SSL is up to $16$ times more data-efficient and achieves up to $7.6$ times higher transfer efficiency than SL on unseen clinical tasks; (iii) across the observed scales, ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes. Our results suggest that the path to effective ECG foundation models lies in the strategic alignment of architecture and paradigm rather than brute-force scaling.
摘要:雖然縮放法則已為自然語言處理中的基礎模型建立了基本框架,但其在心電圖(ECG)模型中的適用性仍然不明確。事實上,最近的研究並不總是隨著ECG模型的模型大小或預訓練數據集大小的增加而產生一致的下游增益,這使得架構的歧義偏差、預訓練範式以及隨著大小預期的改進的確切角色大多未能解答。在本研究中,我們系統地探討了ECG領域內的神經和損失對損失的縮放法則。通過在大規模CODE數據集($2.3$M條記錄)上對$120$個模型(參數範圍從$20$K到$200$M)進行預訓練,我們解耦了模型架構(ResNet與Transformer)和預訓練範式,即監督學習(SL)與自監督學習(SSL)的影響。我們發現:(i)SL模型在分佈內受到數據瓶頸的限制,而SSL模型在模型和數據大小上都能穩健地擴展;(ii)對於分佈外(OOD)泛化,ResNet的參數效率是Transformer的$1.3$到$2.5$倍,而SSL在數據效率上高達$16$倍,並在未見臨床任務上實現了比SL高出$7.6$倍的轉移效率;(iii)在觀察到的規模中,基於ResNet的模型通常實現最低的OOD損失,SSL在未見臨床任務中占主導地位,而自監督Transformer在非常大的模型大小上超越。我们的结果表明,有效的ECG基础模型的路径在于架构和范式的战略对齐,而非单纯的规模扩展。
Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability
2605.17236v1 by Nisreen Albzour, Sarah S. Lam
Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.
摘要:手動的子宮頸抹片分析在子宮頸癌篩檢中受到觀察者間變異性、時間限制以及專家可用性不足的限制。雖然卷積神經網絡(CNNs)已經自動化了子宮頸細胞的分類,但在建模長距離空間依賴性方面仍然有限,並且通常缺乏臨床可解釋性。在本研究中,系統性優化了視覺Transformer(ViT)架構,以增強自動化的子宮頸癌篩檢,這導致了可解釋性的提高。使用了Herlev數據集(917張圖像:242張正常,675張異常)來優化ViT-Tiny,這是一種為減少計算複雜性而設計的輕量級視覺Transformer架構,通過對增強策略、類別加權和超參數的全面評估。最佳配置達到了94.9%-95.2%的交叉驗證準確率,其中隨機水平翻轉和類別加權(0.7 x 1.3)被確定為最有效的。梯度加權類別激活映射(Grad-CAM)分析確認模型注意力與臨床相關的形態特徵相對應,包括核區域、細胞邊界和染色質紋理,這些特徵與細胞病理學標準相符。這些發現表明,視覺Transformer可以為子宮頸癌篩檢提供準確且可解釋的決策支持,滿足醫療AI部署所需的臨床性能和透明度要求。
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
2605.17140v1 by Shiv Ghosh, Junayd Lateef, Chih-Hua, Liu, Yannan Yu, Andreas M. Rauschecker, Madhumita Sushil
Brain tumor diagnosis is largely dependent on Magnetic Resonance Imaging (MRI) evaluation, which requires radiologists to synthesize thousands of images across multiple 3D sequences and longitudinal studies. This process requires advanced neuro-radiology training, poses substantial cognitive load, and is highly time-consuming. Despite increasing demands in radiology, this expertise is difficult to scale, straining the current health systems. Vision-Language Models (VLMs) provide an opportunity to reduce this burden through a semi-automated, interactive interpretation of complex brain MRIs. However, they are currently underutilized in neuro-oncology due to a lack of specialized benchmarks for evaluating them. We introduce a clinically relevant visual question answering (VQA) benchmark -- the UCSF-PDGM-VQA dataset -- consisting of 2,387 QA pairs from 473 glioma-related MRI studies in the public UCSF-PDGM dataset. We further establish a performance baseline for six state-of-the-art vision-language models (VLMs) and one large language model on this dataset. We find that current models are incapable of effectively processing multi-sequence, 3-dimensional MRI scans, thus resulting in a suppression of visual features and over-reliance on language priors, causing modality collapse. These findings underscore a critical deficiency in current model reliability and safety within clinical settings, necessitating the development of robust, domain-specific VLMs.
摘要:腦腫瘤的診斷在很大程度上依賴於磁共振成像(MRI)評估,這要求放射科醫生整合來自多個三維序列和縱向研究的數千張圖像。這個過程需要先進的神經放射學訓練,帶來相當大的認知負擔,並且非常耗時。儘管放射學的需求不斷增加,但這種專業知識難以擴展,給當前的醫療系統帶來壓力。視覺-語言模型(VLMs)提供了一個通過半自動化、互動式解釋複雜腦部MRI來減輕這一負擔的機會。然而,由於缺乏專門的基準來評估它們,這些模型在神經腫瘤學中的應用目前仍然不足。我們介紹了一個臨床相關的視覺問題回答(VQA)基準——UCSF-PDGM-VQA數據集——該數據集由473個與膠質瘤相關的MRI研究中的2,387對問答組成,這些研究來自公共的UCSF-PDGM數據集。我們進一步在這個數據集上建立了六個最先進的視覺-語言模型(VLMs)和一個大型語言模型的性能基準。我們發現目前的模型無法有效處理多序列、三維的MRI掃描,從而導致視覺特徵的抑制和對語言先驗的過度依賴,造成模態崩潰。這些發現凸顯了當前模型在臨床環境中的可靠性和安全性方面的重大缺陷,迫切需要開發穩健的、特定領域的VLMs。
SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
2605.17101v1 by Yongfeng Huang, Ruiying Chen, James Cheng
Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.
摘要:檢索增強生成(RAG)被廣泛應用於減輕醫療問答中的幻覺和知識過時等風險,然而其主要的單輪靜態檢索範式與臨床推理的多階段過程不相符。這種壓縮的工作流程引發了兩個結構性缺陷:問題到查詢的翻譯往往缺乏臨床基礎的語義解釋,而檢索缺乏迭代的充分性反饋,這使得形成可靠的證據鏈變得困難。我們認為這兩個問題源於一個更深層的原因:將異質任務的解釋、探索和裁定過載到單一推理鏈上。解決方案是通過任務解耦和動態多輪探索來重建工作流程。為此,我們提出SEMA-RAG,一個自我演變的多代理RAG框架,用於醫療問答,該框架將這些角色分配給三個專家代理:解釋代理負責臨床模式解釋,探索代理負責基於充分性的自我演變檢索,裁定代理負責證據裁定和答案選擇。在五個基準和五個LLM骨幹上,SEMA-RAG平均提高了最強基線+6.46的準確率,按骨幹進行測量。
Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench
2605.17079v1 by Tianyu Wang, Jiajun Li, Jianghao Lin
LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.
摘要:LLMs 正在被越來越多地用作「數位消費者」,以模擬公共意見、預測市場決策並預期觀眾反應。
然而,現有的評估很少詢問模型是否能夠重建真實消費者在公共話語中表現出的具體反應模式。
我們介紹了 ConsumerSimBench,這是一個基於 1,553 個真實中國社交媒體主題和 23,122 個原子、經過規則審核的標準,涵蓋四種反應類別的基準。
ConsumerSimBench 並不是用整體偏好評判來評分開放式生成,而是將每個任務分解為對具體反應點的可審核的是非決策,將三位評判者的協議從 65.8% 提高到 92.1%,並且點對點的評判者決策與人類多數標籤之間的協議達到 98.4%。
在 13 個前沿生成器中,最強的模型 Gemini-3.1-Pro 僅覆蓋 47.8% 的真實反應標準,而 GPT-5.2 和 Claude-4.6 雖然在技術基準上表現強勁,但仍遠遠落後。
這些失敗揭示了技術基準性能與社會基礎消費者直覺之間的巨大差距。
直接的結構化推理提示會降低覆蓋率,而生成--反思的多代理管道則將 MiMo-V2.5-Pro 在一個子集上的表現從 32.9% 提高到 37.6%。
ConsumerSimBench 將消費者模擬重新框架為一個針對真實公共話語反應的預測問題,顯示前沿 LLMs 在可靠預測消費者在高語境中國消費者話語中實際關心的事物方面仍然相距甚遠。
AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation
2605.17071v1 by Shiying Yu, Jielei Wang, Guoming Lu
Radiology report generation (RRG) aims to automatically produce clinically accurate textual reports from medical images. Existing methods predominantly rely on autoregressive (AR) language models, whose causal dependency structure restricts generation to a unidirectional left-to-right process. This paradigm can induce sequence bias, where models tend to follow stereotypical token orders and high-frequency report templates rather than fully grounding generation in image-specific evidence. In this paper, we propose AnchorDiff, the first masked-diffusion framework for RRG that integrates knowledge-graph-derived clinical anchors into diffusion language modeling. By leveraging bidirectional context and iterative refinement, AnchorDiff mitigates the limitations of fixed-order autoregressive decoding. Specifically, we introduce a topology-aware training strategy that uses RadGraph-derived entity hierarchies to assign clinically important tokens differentiated masking protection and loss weights. We further design an inference-time rewriting strategy that detects unstable committed tokens through perturbation-based testing and selectively revises them during denoising. Extensive experiments on the MIMIC-CXR and MIMIC-RG4 benchmarks demonstrate that AnchorDiff achieves state-of-the-art (SOTA) performance, showing the effectiveness of clinically anchored masked diffusion for radiology report generation.
摘要:放射科報告生成(RRG)旨在自動從醫學影像中產生臨床準確的文本報告。現有方法主要依賴自回歸(AR)語言模型,其因果依賴結構限制了生成過程為單向的從左到右。這種範式可能會引發序列偏見,使模型傾向於遵循刻板印象的標記順序和高頻報告模板,而不是完全基於影像特定證據進行生成。在本文中,我們提出了AnchorDiff,首個針對RRG的掩碼擴散框架,將知識圖譜衍生的臨床錨點整合到擴散語言建模中。通過利用雙向上下文和迭代精煉,AnchorDiff 減輕了固定順序自回歸解碼的限制。具體而言,我們引入了一種拓撲感知的訓練策略,使用RadGraph衍生的實體層級為臨床重要的標記分配差異化的掩蔽保護和損失權重。我們進一步設計了一種推理時重寫策略,通過基於擾動的測試檢測不穩定的已承諾標記,並在去噪過程中選擇性地修訂它們。在MIMIC-CXR和MIMIC-RG4基準上的廣泛實驗表明,AnchorDiff達到了最先進(SOTA)的性能,顯示了臨床錨定掩碼擴散在放射科報告生成中的有效性。
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
2605.17028v1 by Khizar Hussain, Murat Kantarcioglu
Large language models (LLMs) hallucinate with confidence: their outputs can be fluent, authoritative, and simply wrong. In medical, legal, and scientific applications this failure causes direct harm, and detecting it from internal model states offers a path to safer deployment. A growing body of work reports that this problem is increasingly tractable, with recent methods achieving high detection performance on widely used benchmarks. We show, however, that much of this apparent progress does not survive scrutiny. Four of the six corpora embed the ground-truth answer directly in the input prompt. A naïve text-similarity baseline we call \textsc{TxTemb} exploits this to achieve near-perfect detection scores without any access to model internals. To measure what genuine detection capability remains once these artifacts are controlled, we conduct a large-scale evaluation spanning twenty-two detection methods, twelve open-source models spanning six architectural families, and six corpora. We further introduce \textbf{DRIFT}, a supervised probe over inter-layer hidden-state transitions, as a point of comparison for live-generation detection. Our findings suggest that the field's reported progress on hallucination detection is substantially explained by benchmark construction artifacts in widely used corpora, and that the majority of established baselines perform near chance under controlled conditions; the consistent exceptions are SAPLMA and DRIFT, both supervised probes on upper-layer hidden states.
摘要:大型語言模型(LLMs)自信地產生幻覺:它們的輸出可能流暢、權威,但卻完全錯誤。在醫療、法律和科學應用中,這種失誤會造成直接傷害,從內部模型狀態中檢測它提供了一條更安全部署的途徑。越來越多的研究報告指出,這個問題變得越來越可解決,最近的方法在廣泛使用的基準上實現了高檢測性能。然而,我們顯示這些表面上的進展在仔細檢查下並不成立。六個語料庫中的四個將真實答案直接嵌入輸入提示中。我們稱之為 \textsc{TxTemb} 的天真文本相似性基線利用這一點,在不接觸模型內部的情況下實現近乎完美的檢測分數。為了測量在控制這些工件後,真正的檢測能力還剩下多少,我們進行了一項大規模評估,涵蓋二十二種檢測方法、十二個跨越六種架構系列的開源模型和六個語料庫。我們進一步介紹 \textbf{DRIFT},這是一個針對層間隱藏狀態轉換的監督探針,作為即時生成檢測的比較點。我們的研究結果表明,該領域報告的幻覺檢測進展在很大程度上是由於廣泛使用的語料庫中的基準構建工件所解釋的,而且在受控條件下,大多數已建立的基線表現接近隨機;一致的例外是 SAPLMA 和 DRIFT,這兩者都是針對上層隱藏狀態的監督探針。
Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings
2605.16993v1 by Anthonio Oladimeji Gabriel, Ahmad Rufai Yusuf
Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English-language inputs, conditions that do not reflect the realities of healthcare delivery in low-resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross-lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine-tuned on the COVID-QU-Ex chest X-ray dataset (85,318 images; COVID-19, Non-COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N-ATLAS) on 20 COVID-19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba-inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African-context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.
摘要:目前的臨床人工智慧(AI)系統幾乎完全依賴於乾淨、標準化的英文輸入,這些條件並不反映低資源環境中醫療服務的現實。這項研究呈現了對臨床AI中兩個正交安全漏洞的首次系統性雙重審核:對抗性影像脆弱性和跨語言診斷漂移。我們使用DenseNet121,這是CheXNet的底層架構,並在COVID-QU-Ex胸部X光數據集(85,318幅影像;COVID-19、非COVID肺炎、正常)上進行微調,證明在快速梯度法(FGM)擾動下,診斷準確率從89.3%降至62.0%,這一擾動的幅度對人眼來說是不可察覺的。包括高斯平滑和集成投票在內的標準防禦策略未能恢復臨床安全。在一項平行的語言脆弱性實驗中,我們在20個以標準英語、奈及利亞皮欽語(Naija)和約魯巴語變體英語呈現的COVID-19臨床案例上測試了Llama3.1:8b和NatLAS(N-ATLAS)。這兩個模型均顯示出顯著的準確性下降:Llama3.1:8b在皮欽語上的準確率從80.0%降至65.0%;而NatLAS,這是一個非洲背景模型,則從85.0%降至55.0%,診斷一致性降至50%。這些發現為臨床AI在尼日利亞初級健康中心(PHC)部署的條件下建立了量化失敗範圍,並促使對對抗性加固、語言包容的臨床AI架構提出迫切的呼籲。
Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons
2605.16975v1 by Wei Tang, Jinpei Han, Kangning Cui, Mattia Carletti, Fredrik K. Gustafsson, Shreyank N Gowda, Patitapaban Palo, Anshul Thakur, Lei Clifton, Jean-michel Morel, Raymond H. Chan, David A. Clifton, Xiao Gu
Electrocardiogram (ECG) foundation models pretrained on typical diagnostic 10-second ECG segments, have demonstrated strong transferability across a range of clinical applications. However, many real-world applications produce recordings that are typically longer, and are varied in duration during inference time. These 10-second models have no built-in way to combine information across time. Extending them to longer horizons introduces two challenges: structural incompatibilities arising from input-length disparities, and semantic challenges that limit meaningful temporal aggregation. We propose a parameter-efficient framework that extends pretrained ECG foundation models to longer and variable-length ECGs without retraining the backbone. Guided by a frozen pretrained 10-second model, we introduce a lightweight plug-in module that extends the model in two complementary ways: (i) structurally compatible long-sequence processing and (ii) semantically informed temporal modeling. Experiments on multiple long-horizon ECG tasks, datasets, and foundation model backbones demonstrate that our method enables robust long-horizon extension from pretrained snapshot models, consistently outperforming sliding-window and pooling-based baselines with strong parameter efficiency.
摘要:心電圖(ECG)基礎模型在典型的診斷10秒ECG片段上進行預訓練,已展示出在多種臨床應用中的強大可轉移性。
然而,許多現實世界的應用產生的錄音通常較長,且在推斷時的持續時間各異。
這些10秒模型沒有內建的方式來跨時間結合信息。
將它們擴展到更長的時間範圍引入了兩個挑戰:由於輸入長度差異而產生的結構不相容性,以及限制有意義的時間聚合的語義挑戰。
我們提出了一個參數高效的框架,將預訓練的ECG基礎模型擴展到更長和可變長度的ECG,而無需重新訓練主幹。
在一個凍結的預訓練10秒模型的指導下,我們引入了一個輕量級的插件模塊,從兩個互補的方式擴展模型:(i)結構上兼容的長序列處理和(ii)語義上知情的時間建模。
在多個長時間範圍的ECG任務、數據集和基礎模型主幹上的實驗表明,我們的方法能夠從預訓練的快照模型中實現穩健的長時間範圍擴展,並且在參數效率上始終優於基於滑動窗口和池化的基準。
Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects
2605.16966v1 by Zhentao Tan, Yuze Hao, Boyi Zou, Mingsheng Long, Yi Yang, Gang Bao
Solving inverse partial differential equation (PDE) problems is a fundamental topic in scientific research due to its broad significance across a wide range of real-world applications. Inverse PDE problems arise across medical imaging, geophysics, materials science, and aerodynamics, where the goal is to infer hidden causes, design structures, or control physical states. In this paper, we provide a comprehensive review of recent advances in solving inverse PDE problems using artificial intelligence (AI). We first introduce the basic formulation, key challenges, and traditional numerical foundations of inverse PDE problems, and then organize it into three major categories: inverse problems, inverse design, and control problems. For each category, we further present a methodological paradigms, and review representative state-of-the-art approaches from recent years. We then summarize representative applications across scientific and industrial domains, including mechanical systems, aerodynamic problems, thermal systems, full-waveform inversion, system identification, and medical imaging. Finally, we discuss open challenges and future prospects, such as physics-informed architectures, limited real-world data, uncertainty quantification, and inverse foundation models. This survey aims to provide the first unified and systematic perspective on AI for inverse PDE problems, demonstrating how modern learning-based methods are reshaping inverse problems, inverse design, and control problems in PDE-governed systems.
摘要:解決反向偏微分方程(PDE)問題是科學研究中的一個基本主題,因為它在各種現實世界應用中具有廣泛的重要性。反向PDE問題出現在醫學影像、地球物理學、材料科學和氣動力學等領域,其目標是推斷隱藏的原因、設計結構或控制物理狀態。本文提供了使用人工智慧(AI)解決反向PDE問題的最新進展的綜合回顧。我們首先介紹反向PDE問題的基本公式、主要挑戰和傳統數值基礎,然後將其組織為三個主要類別:反向問題、反向設計和控制問題。對於每個類別,我們進一步呈現方法論範式,並回顧近年來具有代表性的最先進方法。我們接著總結在科學和工業領域中的代表性應用,包括機械系統、氣動問題、熱系統、全波形反演、系統識別和醫學影像。最後,我們討論開放的挑戰和未來的前景,如物理知識引導的架構、有限的現實世界數據、不確定性量化和反向基礎模型。本次調查旨在提供關於AI在反向PDE問題中的首次統一和系統的視角,展示現代基於學習的方法如何重塑PDE控制系統中的反向問題、反向設計和控制問題。
From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction
2605.16927v1 by Pujun Feng, Xiaoyu Guo, Seyed Ehsan Saffari, Min Hun Lee, Siew-Kei Lam, Erik Cambria, Xibin Sun, Yangtao Zhou, Tong Yang, Xiaoyu Zhang, Tao Tan, Yue Sun, Bin Cui
Clinical decision-making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians' measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention-aware disease trajectory modeling in clinical AI--methods estimating patient-specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time-varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point-process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off-policy robustness, and target-trial validation. This synthesis advances benchmark prediction to decision-grade clinical evidence, enabling treatment-sensitive individualized futures, pre-deployment policy stress-testing, and safer closed-loop learning health systems that adapt/abstain when evidence is insufficient.
摘要:臨床決策制定是一個反饋系統,其中風險評估影響治療,這反過來又改變了疾病的軌跡,並且兩者共同塑造了臨床醫生的測量實踐。靜態預測在臨床上往往失敗:基於觀察性護理日誌訓練的模型將疾病生物學與臨床醫生的行為混為一談,特別是在治療混淆反饋和不規則或有信息的觀察下。這篇綜述專注於臨床人工智慧中的干預感知疾病軌跡建模——估計患者特定的縱向疾病演變並評估在替代治療下的軌跡變化的方法。我們圍繞六個相互關聯的組件組織這一領域:三個決策任務(事實預測、反事實估計、政策評估)和三個數據生成機制(疾病演變、治療分配、觀察過程),這些機制決定了可識別性。我們提出了第一個統一框架,橋接了預測、反事實軌跡和政策評估,涵蓋離散/連續時間,明確處理治療分配、時間變化的混淆和觀察偏差。我們綜合了關鍵的方法家族(多狀態/聯合模型、時間點過程、深度序列架構、縱向因果推斷),將它們映射到相關組件,並通過重疊診斷、不確定性量化、離政策穩健性和目標試驗驗證來對齊評估與主張強度。這一綜合推進了基準預測到決策級的臨床證據,使得治療敏感的個性化未來、預部署政策壓力測試以及在證據不足時能夠適應/避免的更安全的閉環學習健康系統成為可能。
PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes
2605.16860v1 by Phat Tran, Neville Mehta, Clara Mosquera-Lopez, Robert H. Dodier, Lizhong Chen, Peter G. Jacobs
Accurate long-horizon glucose forecasting is critical for automated insulin delivery systems, which help people with type 1 diabetes (T1D) manage their glucose and avoid dangerous hypoglycemia. However, standard recursive long short-term memory (LSTM) networks suffer from systematic negative bias at longer horizons due to error compounding, while purely mechanistic ordinary differential equation (ODE) models fail to generalize across individuals when parameterized at the population level. We propose PhysioSeq2Seq, a hybrid architecture that combines patient-specific physiological modeling with a sequence-to-sequence (Seq2Seq) LSTM. For each glucose segment, twin matching searches a population of 300 parameterized digital twins to identify the best-fitting physiological match from a 3-hour continuous glucose monitoring (CGM) history. The 10 internal ODE state variables of the matched twin are injected as exogenous covariates into both the encoder and decoder of the Seq2Seq LSTM. This simultaneous 48-step prediction strategy eliminates recursive error compounding, while the ODE features provide a physics-grounded constraint that bounds long-horizon drift within physiologically plausible ranges. PhysioSeq2Seq was trained on CGM and insulin data from 348 participants in the Type 1 Diabetes Exercise Initiative (T1DEXI) dataset and evaluated on 74 held-out participants. At the 240-minute horizon, PhysioSeq2Seq achieves a mean absolute error of 39.28 mg/dL and a mean error of -10.62 mg/dL, reducing bias by 13.89 mg/dL over the recursive LSTM and reducing mean absolute error by 28.62 mg/dL over the ODE-based digital twin. These results show that eliminating architectural feedback and injecting patient-matched physiological states is an effective and clinically meaningful strategy for long-horizon glucose forecasting in T1D.
摘要:精確的長期血糖預測對於自動胰島素輸送系統至關重要,這些系統幫助1型糖尿病(T1D)患者管理其血糖並避免危險的低血糖。然而,標準的遞歸長短期記憶(LSTM)網絡在較長的預測期間內,由於錯誤的累積,會遭受系統性的負偏差,而純粹機械性的常微分方程(ODE)模型在以群體水平參數化時無法在個體之間進行泛化。我們提出了PhysioSeq2Seq,一種混合架構,將患者特定的生理建模與序列到序列(Seq2Seq)LSTM相結合。對於每個血糖片段,雙胞胎匹配搜索300個參數化的數位雙胞胎,以從3小時的連續血糖監測(CGM)歷史中識別最佳的生理匹配。匹配雙胞胎的10個內部ODE狀態變數作為外生協變量注入到Seq2Seq LSTM的編碼器和解碼器中。這種同時的48步預測策略消除了遞歸錯誤的累積,而ODE特徵提供了一種基於物理的約束,將長期漂移限制在生理上合理的範圍內。PhysioSeq2Seq在348名參與者的1型糖尿病運動倡議(T1DEXI)數據集上進行了CGM和胰島素數據的訓練,並在74名保留參與者上進行了評估。在240分鐘的預測期間,PhysioSeq2Seq達到了39.28 mg/dL的平均絕對誤差和-10.62 mg/dL的平均誤差,較遞歸LSTM減少了13.89 mg/dL的偏差,並較基於ODE的數位雙胞胎減少了28.62 mg/dL的平均絕對誤差。這些結果表明,消除架構反饋並注入患者匹配的生理狀態是一種有效且臨床意義重大的長期血糖預測策略,適用於T1D。
VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment
2605.16775v1 by Amy Makawana, Abhijeet Parida, Marius George Linguraru, Julia Ive, Syed Muhammad Anwar
Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer's disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.
摘要:自我監督學習(SSL)已經推進了醫學影像分析,使得從大量未標記數據中學習成為可能。
然而,在腦部磁共振成像(MRI)中,大多數3D模型仍然專門用於分割或分類,限制了它們在不同數據集、成像協議和下游任務之間的泛化能力。
這種可轉移性缺失限制了3D MRI模型的臨床實用性,儘管有未標記的體積數據可用。
我們提出了Volta-3D,一個自我監督的3D視覺Transformer框架,旨在學習可轉移的體積表示。
Volta-3D在學生-教師範式中共同對齊全局類別樣式標記和局部補丁標記,並強化細緻的結構重建。
這種全局-局部的對齊解決了腦部MRI的有限語義多樣性和微妙的解剖特徵,這對現有的SSL方法提出了挑戰。
我們在多個分布外的下游任務上評估Volta-3D,包括海馬體分割和性別及阿茲海默症與健康對照的分類。
在所有任務中,Volta-3D學習的表示超越了隨機初始化的基準,顯示出在領域轉移下的可轉移性和穩健性有所改善。
因此,在預訓練期間共同強化全局語義一致性和局部結構學習,使得從未標記的腦部MRI數據中學習更廣泛的概念成為可能。
總體而言,VolTA-3D支持有效的多任務下游性能,並進行任務特定的調整,這是朝向可泛化和臨床可行的3D模型邁進的一步。
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
2605.16679v1 by Haolin Chen, Deon Metelski, Leon Qi, Tao Xia, Joonyul Lee, Steve Brown, Kevin Riley, Frank Wang, T. Y. Alvin Liu, Hank Capps MD, Zeyu Tang, Xiangchen Song, Lingjing Kong, Fan Feng, Tianyi Zeng, Zhiwei Liu, Zixian Ma, Hang Jiang, Fangli Geng, Yuan Yuan, Chenyu You, Qingsong Wen, Hua Wei, Yanjie Fu, Yue Zhao, Carl Yang, Biwei Huang, Kun Zhang, Caiming Xiong, Sanmi Koyejo, Eric P. Xing, Philip S. Yu, Weiran Yao
End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $χ$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.
摘要:端到端的現實醫療運營自動化強調了當前基準中被低估的三種能力:政策密度,決策必須基於大量的醫療、保險和操作規則;多角色組合:單一任務要求代理人扮演多個角色並進行交接;以及多邊互動:中間工作流程步驟是多輪對話,例如同行評審和病人聯繫。我們介紹了 $χ$-Bench,一個涵蓋三個領域的長期醫療工作流程基準:提供者事前授權、支付者使用管理和護理管理。每個任務都將一個臨床案例交給代理人,在一個高保真度的模擬器中模擬20個醫療應用,通過87個MCP工具進行暴露,代理人必須通過工具調用和撰寫角色的文檔,將其推進到終端狀態,並受到一個擁有1,290多份文件的管理護理操作手冊技能的指導。在30種代理人配置/模型中,最佳代理人僅解決了28.0%的任務,沒有代理人在嚴格的通過^3中清除20%,而在單一會話中執行所有任務則使性能下降至3.8%。這些結果提出了假設,即類似的差距可能會在其他政策密集、角色組合、不可逆的企業領域中出現。
\textsc{PrivScope}: Task-scoped Disclosure Control for Hybrid Agentic Systems
2605.16630v1 by Shafizur Rahman Seeam, Zhengxiong Li, Zhiyuan Yu, Yimin, Chen, Yidan Hu
Hybrid local--cloud agents enrich user requests with context from persistent working state before delegating capability-intensive subtasks to a cloud language model (CLM). While this enrichment can improve task success, it also exposes unnecessary information in the cloud-bound payload, including task-irrelevant context, carryover from prior workflows, and overly specific sensitive details, resulting in \emph{over-disclosure}. Existing solutions either isolate workflows to limit cross-workflow leakage or apply general-purpose sanitization that does not reason over LC-assembled payload scope. We present \textsc{PrivScope}, a trusted on-device payload governor that enforces \emph{task-scoped disclosure} at the local--CLM boundary, without requiring cloud-side changes. Its key idea: sensitive information should reach the cloud only when required for the delegated subtask, and then only in the least revealing form preserving utility. \textsc{PrivScope} extracts disclosure units from the assembled payload and keeps direct identifiers and account-linked values on device. The remaining units pass through cloud-necessity control, which determines what is actually needed; units that must reach the cloud are abstracted to the least-specific representation sufficient for the task. On 100 medical-booking workflows across three commercial CLMs, \textsc{PrivScope} eliminates profile leakage (0.0\% vs.\ 17.7\%), more than halves attacker re-identification (23.1\% vs.\ 64.3\%), and achieves the highest candidate recall on every CLM tested while preserving task success close to the unprotected baseline on GPT-4o-mini and Gemini 2.5 Flash. Gains hold across five local backbones and add only seconds of on-device latency on commodity hardware.
摘要:混合本地--雲端代理在將能力密集型子任務委派給雲端語言模型 (CLM) 之前,利用持久工作狀態中的上下文來豐富用戶請求。雖然這種豐富可以提高任務成功率,但它也暴露了雲端負載中不必要的信息,包括與任務無關的上下文、來自先前工作流程的延續以及過於具體的敏感細節,導致\emph{過度披露}。現有解決方案要麼隔離工作流程以限制跨工作流程的洩漏,要麼應用通用的清理措施,這些措施未能考慮到LC組裝的負載範圍。
我們提出\textsc{PrivScope},這是一個可信的本地負載治理者,能夠在本地--CLM邊界強制執行\emph{任務範圍披露},而無需對雲端進行更改。其關鍵思想是:敏感信息應該僅在委派的子任務需要時才到達雲端,並且僅以保持效用的最少揭露形式傳送。\textsc{PrivScope} 從組裝的負載中提取披露單元,並將直接識別符和帳戶相關的值保留在設備上。剩餘的單元通過雲端必要性控制,該控制決定實際需要什麼;必須到達雲端的單元被抽象為對任務足夠的最不具體的表示。在三個商業CLM上進行的100個醫療預約工作流程中,\textsc{PrivScope} 消除了個人資料洩漏 (0.0\% 對 17.7\%),使攻擊者重新識別率減少超過一半 (23.1\% 對 64.3\%),並在每個測試的CLM上實現了最高的候選回憶率,同時在GPT-4o-mini和Gemini 2.5 Flash上保持接近未保護基線的任務成功率。這些增益在五個本地骨幹上持續存在,並且在商用硬體上僅增加幾秒的設備延遲。
Isotonic Survival Regression: Calibrated Survival Distributions from Deep Cox Models
2605.16571v1 by Anchit Jain, Kevin Zhang, Stephen Bates
Time-to-event data is widespread across the life sciences and engineering, but it is typically encountered together with censoring, which complicates the application of standard machine learning methods. Deep Cox models have emerged as a popular method for analyzing time-to-event data because they gracefully handle censoring and can be used with unstructured data such as clinical text reports, genomic sequences, and pathology images. However, their predicted survival probabilities are often poorly calibrated, thus limiting their practical utility. In this paper, we propose a novel post hoc calibration method for Deep Cox models that uses isotonic regression to refine predicted survival probabilities without affecting discriminative power. We establish favorable theoretical guarantees, including a double-robustness property and asymptotic calibration. Experiments on synthetic and real-world clinical data demonstrate the empirical effectiveness of our method.
摘要:時間到事件數據在生命科學和工程領域廣泛存在,但通常與刪除數據一起出現,這使得標準機器學習方法的應用變得複雜。深度Cox模型已成為分析時間到事件數據的熱門方法,因為它們能夠優雅地處理刪除數據,並且可以與非結構化數據(如臨床文本報告、基因組序列和病理圖像)一起使用。然而,它們預測的生存概率通常校準不佳,從而限制了它們的實際效用。在本文中,我們提出了一種新穎的深度Cox模型事後校準方法,該方法使用等單調回歸來細化預測的生存概率,而不影響區分能力。我們建立了有利的理論保證,包括雙穩健性質和漸近校準。在合成和真實臨床數據上的實驗證明了我們方法的實證有效性。
Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces
2605.16545v1 by Arne Nix, Robert James, Lasse Borgholt, Anna B. Ekner, Lana Krumm, Julius Severin, Dan Engel, Lars Maaløe, Jakob Havtorn
After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.
摘要:經過數十年的使用於聽寫和最近的環境文檔中,語音正逐漸成為與科技和人工智慧在醫療領域互動的主要方式。然而,醫療語音識別仍然困難:系統必須捕捉專業術語、解決上下文歧義,並精確呈現測量值、縮寫和臨床速記。現有解決方案通常是針對通用轉錄或狹窄的聽寫工作流程進行優化,這限制了它們在安全關鍵環境中的可靠性以及對更廣泛臨床工作流程的實用性。我們推出了 Symphony for Speech-to-Text,這是一個醫療級語音識別系統,適用於實時流媒體和批量文件的臨床使用。Symphony 將轉錄過程分解為專門的組件,以進行識別、格式化和上下文修正,從而優化醫療術語的回憶,同時在實時生成臨床結構化文本並適應不同的使用案例。對公共基準和醫療語音數據集的評估顯示,Symphony 在臨床環境中顯著超越最先進的系統,同時在通用領域環境中與它們相匹配或超越,這表明其穩健的泛化能力而非過擬合。我們發布了一個臨床基準數據集,以支持可靠的驗證和醫療語音識別的進一步進展。Symphony 通過生產級 API 提供實時聽寫、對話轉錄和批量音頻文件處理的功能。
Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
2605.16238v1 by Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, Zahra Shamsi
Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.
摘要:傳染病的概率預測對公共衛生至關重要,但依賴專業建模團隊進行勞動密集型的手動模型管理。這種定制開發限制了對細粒度地理解析或新興病原體的可擴展性。在此,我們提出了一個自動化系統,利用大型語言模型(LLM)指導的樹搜索,迭代生成、評估和優化可執行的預測軟件。在2025-2026年美國呼吸季節的全面前瞻性實時評估中,該系統自主發現了流感、COVID-19和呼吸道合胞病毒(RSV)方法論上多樣的模型。聚合這些機器生成的模型產生了一個集成,始終與金標準的人類策劃的疾病控制與預防中心(CDC)集成模型在樣本外一致或表現更佳。該系統成功應對了RSV數據稀缺的“冷啟動”場景。此外,受控的回顧性消融顯示,優化對數尺度距離度量可以防止獎勵黑客,而自動化的循環評判確保了對複雜科學理論的結構忠實。通過自主將流行病學理論轉化為準確、透明的代碼,這一框架克服了建模勞動瓶頸,使專業級疾病預測能在前所未有的規模上迅速部署。
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
2605.16215v1 by Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley
Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.
摘要:臨床決策支持系統(CDSS)需要可審查、可審計的流程,以實現嚴謹且可重複的驗證。然目前基於大型語言模型(LLM)的CDSS仍然在很大程度上不透明。大多數“開放”模型僅為開放權重,釋放參數的同時卻隱藏了決定模型行為的數據來源、策展程序和生成流程。完全開放(FO)模型,即從頭到尾公開完整訓練堆疊的模型,目前在醫學領域尚不存在。我們介紹了完全開放的Meditron,這是第一個用於構建LLM-CDSS的完全開放流程,包括經臨床醫生審核的訓練語料庫、可重複的數據構建和訓練框架,以及與使用對齊的評估協議。該語料庫將八個公共醫療問答數據集統一為標準化的對話格式,並通過三個經臨床醫生驗證的合成擴展進行擴展:考試風格的問答、基於46,469條臨床實踐指南的指南導向問答,以及臨床小插曲。該流程強制執行系統範圍內的去污染、教師生成的金標籤重抽樣,以及由四位醫生小組進行的端到端驗證。我們使用LLM作為評判者的協議,對專家撰寫的臨床小插曲進行評估,並與204名人類評審進行校準。我們將該方法應用於五個FO基礎模型(Apertus-70B/8B-Instruct、OLMo-2-32B-SFT、EuroLLM-22B/9B-Instruct)。所有MeditronFO變體都優於其基礎模型。Apertus-70B-MeditronFO在綜合醫療基準上比其基礎模型提高了6.6個百分點(從47.2%提高到53.8%),創造了新的FO最先進技術(SoTA)。Gemma-3-27B-MeditronFO在58.6%的LLM作為評判者的比較中優於MedGemma,並在HealthBench上表現優於它(58%對55.9%)。這些結果顯示,完全開放的流程可以在不犧牲可審計性或可重複性的情況下實現最先進的特定領域性能。
Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning
2605.15894v1 by Ranjith Chodavarapu
Rapid and accurate wildfire smoke severity assessment from satellite images is essential for emergency response, air quality modeling, and human health risk management. Existing deep learning approaches treat smoke detection as a binary task, producing point estimates without any measure of prediction confidence. We propose a probabilistic framework to categorize a satellite patch into Light, Moderate, and Heavy severity classes and to provide decomposed epistemic and aleatoric uncertainty in a single forward pass. Our architecture uses the backbone of a pre-trained EfficientNet-B3 and a CBAM module with an evidential deep learning head that predicts Dirichlet concentration parameters, directly estimating vacuity (epistemic) and dissonance (aleatoric) without Monte Carlo sampling. Evaluated on 16,298 real satellite patches derived from the Wildfire Detection dataset, our model achieves 93.8% weighted test accuracy (91.1% unweighted) with ECE=0.0274. Selective prediction retaining the most certain 50% of patches achieves 96.7% accuracy. As image quality degrades, uncertainty increases monotonically, and vacuity is a practical scan quality measure. The Moderate class represents transitional smoke conditions that exhibit the highest epistemic uncertainty (mean vacuity = 0.187), confirming the model correctly identifies ambiguous smoke boundary regions. CBAM spatial attention maps localize to structurally distinctive scene regions, and t-SNE demonstrates the clear cluster separation of Light and Heavy smoke.
摘要:快速且準確的野火煙霧嚴重程度評估來自衛星影像,對於應急響應、空氣質量建模以及人類健康風險管理至關重要。現有的深度學習方法將煙霧檢測視為一個二元任務,產生點估計而不提供任何預測信心的度量。我們提出了一個概率框架,將衛星圖像區塊分類為輕度、中度和重度嚴重性類別,並在單次前向傳播中提供分解的認知不確定性和隨機不確定性。我們的架構使用預訓練的EfficientNet-B3作為主幹,並結合CBAM模塊,搭配一個證據深度學習頭,預測Dirichlet濃度參數,直接估計虛無(認知)和不和諧(隨機),而無需蒙特卡羅取樣。在來自野火檢測數據集的16,298個真實衛星圖像區塊上進行評估,我們的模型達到了93.8%的加權測試準確率(未加權為91.1%),ECE=0.0274。選擇性預測保留最確定的50%區塊達到96.7%的準確率。隨著影像質量的下降,不確定性單調增加,而虛無是一個實用的掃描質量度量。中度類別代表過渡性煙霧條件,顯示出最高的認知不確定性(平均虛無=0.187),確認模型正確識別模糊的煙霧邊界區域。CBAM空間注意力圖將焦點定位於結構上獨特的場景區域,而t-SNE則顯示出輕度和重度煙霧的明確簇分離。
BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation
2605.15736v1 by Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen
Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning
摘要:生物醫學視覺-語言模型(VLMs)在少量醫療診斷中顯示出顯著的潛力,但面臨一個關鍵瓶頸:\textit{對提示變化的脆弱性}。現有的適應框架通常將視覺和文本提示作為獨立的流進行優化,依賴於理想的“黃金提示”。在臨床現實中,描述往往是嘈雜且異質的,這種模態隔離導致跨模態對齊的不穩定性。
為了解決這個問題,我們提出了BiomedAP,一種具有門控跨模態融合的視覺知情雙錨框架。BiomedAP通過兩個機制強化協同對齊:(1)門控跨模態融合,使模態之間的層級互動成為可能,充當動態噪聲調節器以抑制不相關的文本提示;(2)雙錨約束,將可學習的提示正則化為來自專家模板(高錨)和少量視覺原型(低錨)的穩定語義中心。
在11個基準上的廣泛實驗表明,BiomedAP始終超越基準,實現了具有競爭力的少量準確性,並在提示擾動下顯著增強了穩健性。
我們的代碼可在以下網址獲得:https://github.com/tongdiedie/BiomedAP。
關鍵詞:視覺-語言模型;提示學習;參數高效微調;少量學習
Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery
2605.16458v1 by Weijun Ma
Image restoration models are increasingly applied to degraded medical scans, but in safety-sensitive settings they must improve image quality without uncontrolled modification of clinically important regions. This is especially relevant for intracranial CT and CT angiography (CTA), where small vessels and aneurysm-relevant cues lie near high-contrast anatomical boundaries. We frame medical image restoration as a conservative AI problem and present a residual-bounded 2.5D restoration framework trained on synthetically degraded CT/CTA inputs. The model adds a learned residual to the original center slice through an edit-control map that limits the magnitude and spatial extent of modification. We evaluate the framework using an aneurysm-relevant image-recovery matrix, paired comparison against a Gaussian baseline, Monte Carlo stability testing, anatomical localization of meaningful edits, and external evaluation on low-dose CT. On 50 out-of-distribution CT-CTA cases, the bounded model achieved a mean target gain of 0.0635, a mean PSNR of 37.51 dB, and an iatrogenic-edit rate of 4.0%. Across 1,000 Monte Carlo runs, it remained net positive in 85.4% of runs with no stably negative cases. On external low-dose CT, the model was directionally beneficial and produced a substantially smaller modification footprint than the baseline. Meaningful edits concentrated in brain and skull regions while unrelated anatomy showed negligible change. These findings provide preliminary computational evidence that residual-bounded restoration is feasible in boundary-sensitive vascular imaging, but they do not establish clinical diagnostic performance and require expert review and prospective validation before clinical use.
摘要:影像修復模型越來越多地應用於退化的醫學掃描,但在安全敏感的環境中,它們必須在不對臨床重要區域進行不受控修改的情況下提高影像質量。這對於顱內CT和CT血管造影(CTA)尤其相關,因為小血管和與動脈瘤相關的線索位於高對比度的解剖邊界附近。我們將醫學影像修復框架設置為一個保守的AI問題,並提出一個基於殘差限制的2.5D修復框架,該框架在合成退化的CT/CTA輸入上進行訓練。該模型通過一個編輯控制圖將學習到的殘差添加到原始中心切片上,該圖限制了修改的幅度和空間範圍。我們使用與動脈瘤相關的影像恢復矩陣來評估該框架,並與高斯基線進行配對比較,進行蒙特卡洛穩定性測試,對有意義的編輯進行解剖定位,以及在低劑量CT上進行外部評估。在50個分佈外的CT-CTA案例中,該受限模型實現了0.0635的平均目標增益,37.51 dB的平均PSNR,以及4.0%的醫源性編輯率。在1,000次蒙特卡洛運行中,它在85.4%的運行中保持淨正值,且沒有穩定的負值案例。在外部低劑量CT上,該模型在方向上是有益的,並且產生的修改足跡顯著小於基線。有意義的編輯集中在腦部和顱骨區域,而無關的解剖結構幾乎沒有變化。這些發現提供了初步的計算證據,表明在邊界敏感的血管影像中,基於殘差的修復是可行的,但它們並未建立臨床診斷性能,並且在臨床使用之前需要專家審查和前瞻性驗證。
Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign
2605.16452v1 by Jiahui Li, Yida Zhang, Zixuan Zeng, Jiayu Chen, Yingjian Song, Yin Xiao, Nishan Dong, Junjie Lu, Younghoon Kwon, Xiang Zhang, Jin Lu, Wenzhan Song, Fei Dou
Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram (ECG), Photoplethysmogram (PPG), Ballistocardiogram (BCG), and Bodyseismography (BSG), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability. Conversely, deep learning-based methods often lack interpretability, limiting transparency for expert verification and hindering expert-computer interaction. To address these limitations, we introduce Peak-Detector, a novel framework that leverages instruction-tuned Large Language Models (LLMs) for robust, cross-modal, and explainable peak detection. A core innovation of our framework is a "peak-representation" technique that transforms time-series data into a condensed format, preserving critical event information while significantly reducing signal length. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data. The model is optimized through a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning (RL) with a multi-objective reward function. The model's self-explanation capabilities are cultivated by fine-tuning on a custom-built Peak-Explanation dataset. Across four modalities-ECG, PPG, BCG, and BSG-spanning seven datasets (six public benchmarks plus one real-world cohort), Peak-Detector demonstrates strong cross-modal performance, achieving best or tied-best detection under clinically relevant temporal tolerance. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis.
摘要:準確的峰值檢測對於多樣的心臟生理信號,包括心電圖 (ECG)、光學容積描記圖 (PPG)、重力心電圖 (BCG) 和身體地震圖 (BSG),對於心血管監測至關重要,但常常受到工件和信號變異性的阻礙。傳統算法通常是基於專家知識針對單一信號模態設計的,這限制了它們的普遍適用性。相反,基於深度學習的方法通常缺乏可解釋性,限制了專家驗證的透明度並妨礙了專家與計算機的互動。為了解決這些限制,我們引入了 Peak-Detector,一個新穎的框架,利用經過指令調整的大型語言模型 (LLMs) 進行穩健的跨模態和可解釋的峰值檢測。我們框架的一個核心創新是“峰值表示”技術,將時間序列數據轉換為濃縮格式,保留關鍵事件信息,同時顯著減少信號長度。這種表示提供了一個關鍵的歸納偏見,引導 LLM 理解生理上有意義的事件,而不是原始的噪聲數據。該模型通過兩個階段的過程進行優化:監督微調 (SFT),然後是強化學習 (RL),使用多目標獎勵函數。模型的自我解釋能力通過在自定義的 Peak-Explanation 數據集上進行微調來培養。在四種模態 - ECG、PPG、BCG 和 BSG - 涵蓋七個數據集(六個公共基準加上一個真實世界的隊列)中,Peak-Detector 展示了強大的跨模態性能,在臨床相關的時間容忍範圍內達到了最佳或並列最佳的檢測效果。除了準確性外,生成的推理還揭示了失敗模式並支持驗證和錯誤分析。
Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating
2605.16446v1 by Hangchun Liang, Changchun Li
Semi-supervised learning (SSL) enables prediction with limited labels, but high-stakes tabular applications (medical, credit, recidivism) require statistical fairness guarantees. We identify a structural conflict in tabular fair SSL through a diagnostic stress test: under confidence-gated pseudo-labeling, moment-matching fairness regularizers can trigger two failure modes -- Masking Collapse (fairness erodes confidence, starving pseudo-labels) and Trivial Saturation (drift to constant predictors). We propose Online Primal-Dual Allocation (OPDA), an online controller that schedules fairness and entropy-based stability penalties using violation, risk, and pseudo-label health signals, avoiding per-dataset selection of a fixed fairness weight within this diagnostic regime. On the evaluated tabular benchmarks (Adult, ACSIncome, COMPAS), OPDA mitigates the degenerate regimes observed under static weighting and simple single-signal adaptive baselines. On Adult and COMPAS, it yields non-degenerate operating points competitive with the empirical static-$λ$ frontier; on ACSIncome, it preserves utility with a wider fairness-utility spread. Relative to OPDA-lite, the full controller mainly shifts the operating point toward higher utility on ACSIncome, while Adult highlights the fairness-utility trade-off between the two variants. These results position OPDA as a calibration-free controller for non-degenerate operating points in tabular fair SSL without per-dataset tuning.
摘要:半監督學習(SSL)能夠在有限標籤下進行預測,但高風險的表格應用(醫療、信用、再犯)需要統計公平性保證。
我們通過診斷壓力測試識別出表格公平SSL中的結構性衝突:在信心閘控的偽標記下,時刻匹配的公平性正則化器可能會觸發兩種失效模式——遮罩崩潰(公平性侵蝕信心,導致偽標記匱乏)和微不足道的飽和(漂移至常數預測器)。
我們提出了在線原始-對偶分配(OPDA),這是一種在線控制器,利用違規、風險和偽標記健康信號來安排基於公平性和熵的穩定性懲罰,避免在這一診斷體系內對每個數據集選擇固定的公平性權重。
在評估的表格基準(Adult、ACSIncome、COMPAS)上,OPDA減輕了在靜態加權和簡單單信號自適應基準下觀察到的退化狀態。
在Adult和COMPAS上,它產生了與經驗靜態-$λ$邊界競爭的非退化操作點;在ACSIncome上,它保持了效用,並擴大了公平性-效用的差距。
相對於OPDA-lite,完整控制器主要將ACSIncome的操作點向更高的效用移動,而Adult則突顯了兩個變體之間的公平性-效用權衡。
這些結果將OPDA定位為一種無需校準的控制器,適用於表格公平SSL中的非退化操作點,無需對每個數據集進行調整。
Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images
2605.16444v1 by Liangrui Pan, Jiadi Luo, Yuxuan Xiao, Chenchen Nie, Xiaoshuai Wu, Songqing Fan, Ling Chu, Manqiu Li, Rongfang He, Zhenyu Zhao, Ruixing Wang, Shulin Liu, Yiyi Liang, Xiang Wang, Qingchun Liang, Shaoliang Peng
Accurate intraoperative and postoperative diagnosis of spread through air spaces (STAS) is essential for guiding surgical decisions and postoperative management in lung cancer. However, histopathological assessment is labor-intensive and is prone to missed or incorrect diagnoses. We propose a Diffusion Attention Expert Model (DAEM) to detect STAS in frozen sections (FSs) and paraffin sections (PSs). Its diffusion attention expert module leverages full attention aggregation to learn multi-scale features from histopathological images, while a dual-branch architecture strengthens multi-scale feature representation. On an internal dataset, DAEM achieves AUCs of 0.8946 for FSs and 0.9112 for PSs. Validation on external multi-center datasets from eight institutions demonstrates strong generalizability and interpretability. Using tumor microenvironment (TME) features in PSs, we further enable semi-automatic measurement of STAS location and its distance from the primary tumor. Several quantitative TME metrics are identified as potential biomarkers for STAS, including micropapillary-type STAS. Overall, DAEM offers a clinically actionable framework for STAS assessment by enabling accurate and interpretable detection on FSs and PSs, supporting postoperative risk stratification through quantitative TME-based analysis.
摘要:準確的術中和術後診斷通過氣體空間擴散(STAS)對於指導肺癌的手術決策和術後管理至關重要。
然而,組織病理學評估勞動密集,並且容易出現漏診或誤診的情況。
我們提出了一種擴散注意力專家模型(DAEM)來檢測冷凍切片(FSs)和石蠟切片(PSs)中的STAS。
其擴散注意力專家模塊利用全注意力聚合來學習組織病理圖像中的多尺度特徵,而雙分支架構則加強了多尺度特徵表示。
在內部數據集上,DAEM對FSs的AUC達到0.8946,對PSs的AUC達到0.9112。
在來自八個機構的外部多中心數據集上的驗證顯示出強大的泛化能力和可解釋性。
利用PSs中的腫瘤微環境(TME)特徵,我們進一步實現了STAS位置及其與原發腫瘤距離的半自動測量。
幾個定量TME指標被確定為STAS的潛在生物標誌物,包括微乳頭型STAS。
總體而言,DAEM提供了一個臨床可行的STAS評估框架,通過在FSs和PSs上實現準確且可解釋的檢測,支持通過定量TME分析進行術後風險分層。
Two-Valued Symmetric Circulant Matrices: Applications in Deep Learning
2605.16443v1 by Jayakrishna Amathi, Venkata Prasanth Yanambaka, Saraju P. Mohanty, Elias Kougianos
Despite the success of deep neural networks in vision, medical diagnosis, and IoT scenarios, their deployment on resource-limited platforms poses serious challenges due to their high storage requirements, computational complexity, and large footprint. In particular, fully connected layers require a large number of weights, making it difficult for edge devices to accommodate them. To overcome these challenges associated with limited platforms, this paper proposes the Two-Valued Symmetric Circulant Matrix (TVSCM), a very sparse architecture that employs just two weights per layer to keep it circulant and symmetric. The extreme form of structured sparse architecture provides negligible storage costs compared to traditional full-weight storage. Instead of hardware and additional stages of other traditional sparse learning techniques, such as low-rank approximation and pruning approaches, this architecture provides an extreme form of sparsity, achieving very minimal storage requirements. The simulation study demonstrates more than 80$\times$ reduction in model parameters, reducing parameters from 623,290 to 7,852 on MNIST and from 24,709 to 942 on the MIT-BIH arrhythmia dataset, while maintaining comparable accuracy from 97.6% to 93.5% on MNIST and from 97.6% to 93.1% on MIT-BIH. Due to its minimal architectural requirements and very low power consumption, this architecture would be ideal for edge computing platforms, tiny-ML platforms, IoMT systems, and battery-powered systems.
摘要:儘管深度神經網絡在視覺、醫療診斷和物聯網場景中取得了成功,但由於其高存儲需求、計算複雜性和龐大的佔用空間,將其部署在資源有限的平台上面臨嚴重挑戰。特別是,完全連接層需要大量的權重,使得邊緣設備難以容納它們。為了克服這些與有限平台相關的挑戰,本文提出了雙值對稱循環矩陣(TVSCM),這是一種非常稀疏的架構,每層僅使用兩個權重以保持其循環和對稱。這種極端形式的結構稀疏架構相比於傳統的全權重存儲提供了微不足道的存儲成本。與硬體和其他傳統稀疏學習技術的額外階段(如低秩近似和剪枝方法)相比,這種架構提供了一種極端的稀疏性,實現了非常低的存儲需求。模擬研究顯示模型參數減少超過80$\times$,在MNIST上將參數從623,290減少到7,852,在MIT-BIH心律不整數據集上從24,709減少到942,同時保持了相似的準確率,MNIST從97.6%降至93.5%,MIT-BIH從97.6%降至93.1%。由於其最低的架構需求和非常低的功耗,這種架構非常適合邊緣計算平台、微型機器學習平台、物聯網醫療系統和電池供電系統。
Retrieval-Augmented Large Language Models for Schema-Constrained Clinical Information Extraction
2605.15467v1 by A H M Rezaul Karim, Ozlem Uzuner
Conversational nurse-patient transcripts contain actionable observations, but converting these transcripts into structured representations at scale remains challenging. Documentation burden is substantial, with prior studies showing clinicians spend large portions of their workday on documentation and related desk work rather than direct patient care. MEDIQA-SYNUR focuses on observation extraction from conversational nurse-patient transcripts, requiring systems to normalize these narratives into a predefined schema with value-type constraints. We propose a modular retrieval-augmented generation (RAG) pipeline that uses the training set as an exemplar corpus, combines schema-constrained prompting (full schema vs. pruned candidate schema), deterministic schema-based postprocessing, and a second-pass audit, with two LLM backbones: Llama-4-Scout-17B-16E-Instruct and GPT-5.2 with corresponding embedding models for RAG. Our best configuration uses GPT-5.2 with full schema, RAG, and a second-pass auditing, achieving 80.36% F1 score. Overall, our results show that RAG consistently improves performance, while the optimal degree of schema constraint depends on the model, and second-pass auditing yields modest additional gains by correcting residual schema-adherence errors.
摘要:對話式護理人員與病人的文字記錄包含可行的觀察,但將這些文字記錄轉換為結構化表示在規模上仍然具有挑戰性。文檔負擔相當龐大,先前的研究顯示臨床醫生在文檔和相關的辦公工作上花費了大量的工作時間,而不是直接照顧病人。MEDIQA-SYNUR 專注於從對話式護理人員與病人的文字記錄中提取觀察,要求系統將這些敘述標準化為具有值類型約束的預定義架構。我們提出了一個模組化的檢索增強生成 (RAG) 管道,該管道使用訓練集作為範例語料庫,結合架構約束提示(完整架構與修剪候選架構)、確定性基於架構的後處理和第二次審核,並使用兩個 LLM 主幹:Llama-4-Scout-17B-16E-Instruct 和 GPT-5.2,並為 RAG 提供相應的嵌入模型。我們的最佳配置使用 GPT-5.2,結合完整架構、RAG 和第二次審核,達到 80.36% 的 F1 分數。總體而言,我們的結果顯示 RAG 一直在改善性能,而最佳的架構約束程度取決於模型,第二次審核通過糾正殘留的架構遵循錯誤而獲得適度的額外增益。
FutureSim: Replaying World Events to Evaluate Adaptive Agents
2605.15188v1 by Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping
AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.
摘要:AI 代理人越來越多地被部署在需要隨著新信息到來而適應的動態、開放式環境中。
為了有效地測量這種能力以應對現實案例,我們提出建立基於實際事件的模擬,按照事件發生的順序重播。
我們建立了 FutureSim,讓代理人預測超出其知識截止日期的世界事件,同時與世界的時間順序重播互動:在模擬期間內,真實新聞文章不斷到達,問題逐漸解決。
我們在其本地環境中評估前沿代理人,測試他們在 2026 年 1 月到 3 月的三個月期間預測世界事件的能力。
FutureSim 顯示出它們能力的明顯差異,最佳代理人的準確率為 25%,而許多代理人的 Brier 技能分數甚至比不做預測還要差。
通過仔細的消融實驗,我們展示了 FutureSim 如何提供一個現實的環境來研究新興的研究方向,如長期測試時間適應、搜索、記憶和對不確定性的推理。
總體而言,我們希望我們的基準設計為測量 AI 在現實世界中跨越長時間範圍的開放式適應進展鋪平道路。
Evidential Reasoning Advances Interpretable Real-World Disease Screening
2605.15171v1 by Chenyu Lian, Hong-Yu Zhou, Jing Qin
Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.
摘要:疾病篩檢對於臨床實踐中的早期檢測和及時干預至關重要。
然而,目前大多數醫學影像的篩檢模型在可解釋性和性能上都存在限制。
它們通常缺乏有效的機制來參考歷史案例或提供透明的推理途徑。
為了解決這些挑戰,我們提出了EviScreen,一個利用歷史案例區域級證據的疾病篩檢證據推理框架。
所提出的EviScreen通過從雙重知識庫檢索的區域證據提供了回顧性可解釋性。
利用這一證據機制,隨後的證據感知推理模塊使用當前案例和來自歷史案例的證據進行預測,從而提高疾病篩檢的性能。
此外,EviScreen通過利用從對比檢索中獲得的異常圖來增強定位可解釋性,而不是依賴事後的顯著性圖。
我們的方法在我們精心建立的現實世界疾病篩檢基準上實現了優越的性能,在臨床級召回率下產生了顯著更高的特異性。
代碼可在https://github.com/DopamineLcy/EviScreen公開獲得。
Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment
2605.15168v1 by Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss
Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.
摘要:重建精確的臨床時間線對於建模病人軌跡和預測像敗血症這樣複雜且異質的病症風險至關重要。雖然非結構化的臨床敘述提供了語義豐富且上下文完整的病程描述,但它們往往缺乏時間精確性,並且包含模糊的事件時間。相反,結構化的電子健康紀錄(EHR)數據提供了精確的時間錨點,但卻錯過了大量臨床上有意義的事件。我們提出了一種檢索增強的多模態對齊框架,旨在彌補這一差距,以提高從文本中提取的絕對臨床時間線的時間精確性。我們的方法將時間線重建公式化為基於圖的多步驟過程:首先從敘述中提取中心錨事件以建立初始時間框架,然後相對於這一骨幹放置非中心事件,最後使用檢索到的結構化EHR行作為外部時間證據來校準時間線。通過在涵蓋MIMIC-III和MIMIC-IV的i2m4基準上使用經過指導調整的大型語言模型進行評估,我們的多模態管道在絕對時間戳準確性(AULTC)上始終有所改善,並且在幾乎所有評估模型中提高了時間一致性,相較於單模態僅文本重建,且不妥協事件匹配率。此外,我們的實證差距分析顯示,34.8%的文本衍生事件在表格記錄中完全缺失,這表明對齊這些模態可以比單一來源產生更具時間忠實性和臨床信息性的病人軌跡重建。
COTCAgent: Preventive Consultation via Probabilistic Chain-of-Thought Completion
2605.15016v1 by Zihan Deng, Xiaozhen Zhong, Chuanzhi Xu
As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at https://github.com/FrankDengAI/COTCAgent/.
摘要:隨著大型語言模型在醫療保健領域的應用,智能臨床決策支持迅速發展。
長期電子健康紀錄(EHR)提供準確臨床診斷和分析所需的關鍵時間證據。
然而,當前的大型語言模型在長期EHR推理方面存在重大缺陷。
首先,由於缺乏細緻的統計推理,當定量證據以文本形式隱含時,它們經常幻想出臨床趨勢和指標,從而偏見診斷推斷。
其次,長期EHR中的非均勻時間序列和稀缺標籤阻礙了模型捕捉長期時間依賴性,限制了可靠的臨床推理。
為了解決上述限制,本研究提出了概率性思維鏈完成代理(COTCAgent),這是一個針對長期電子健康紀錄的分層推理框架。
它由三個核心模塊組成。
時間統計適配器(TSA)將分析計劃轉換為可執行代碼,以標準化趨勢輸出。
思維鏈完成(COTC)層利用帶權重評分的症狀-趨勢-疾病知識庫來評估疾病風險,而有界完成模塊通過標準化詢問和迭代評分約束獲取結構化證據,以確保嚴謹的推理。
通過解耦統計計算、特徵匹配和語言生成,該框架消除了對複雜多模態輸入的依賴,並能以較低的計算開銷實現高效的長期紀錄分析。
實驗結果顯示,基於Baichuan-M2的COTCAgent在自建數據集上達到90.47%的Top-1準確率,在HealthBench上達到70.41%,超越了現有的醫療代理和主流大型語言模型。
代碼可在https://github.com/FrankDengAI/COTCAgent/獲得。
Quantifying and Mitigating Premature Closure in Frontier LLMs
2605.15000v1 by Rebecca Handler, Suhana Bedi, Nigam Shah
Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.
摘要:過早結論,或在資訊不足的情況下就做出結論,是診斷錯誤的公認原因,但在大型語言模型(LLMs)中仍然未受到充分研究。我們將LLM的過早結論定義為在不確定性下的不當承諾:在更安全的反應應該是澄清、避免、升級或拒絕的情況下提供答案、建議或臨床指導。我們評估了五個前沿LLM在結構化和開放式醫療任務中的表現。在MedQA(n = 500)和AfriMed-QA(n = 490)中,當正確選擇被移除時,模型仍以高比例選擇答案,基線錯誤行動率分別為55-81%和53-82%。在開放式評估中,模型在861個HealthBench問題中平均給出了30%的不當答案,在191個醫生撰寫的對抗性查詢中則為78%。以安全為導向的提示減少了模型的過早結論,但仍然存在殘餘失敗,突顯出評估醫療LLMs是否知道何時不應回答的必要性。
Explainable Detection of Depression Status Shifts from User Digital Traces
2605.14995v1 by Loris Belcastro, Francesco Gervino, Fabrizio Marozzo, Domenico Talia, Paolo Trunfio
Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user's mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.
摘要:每天,使用者會產生數位痕跡(例如,社交媒體帖子、聊天和線上互動),這些痕跡本質上是有時間戳的,並可能反映他們的心理狀態的某些方面。這些痕跡可以組織成時間軌跡,捕捉使用者的心理健康信號如何演變,包括改善、惡化或穩定的階段。在這項工作中,我們提出了一個可解釋的框架,用於檢測和分析使用者數位痕跡中與抑鬱相關的狀態變化。該方法結合了多個基於BERT的模型,以提取不同維度(例如,情感、情緒和抑鬱嚴重程度)之間的互補信號。這些信號隨時間聚合,以構建使用者級別的軌跡,並進行分析以識別有意義的變化點。為了增強可解釋性,該框架整合了一個大型語言模型,以生成簡潔且易於人類閱讀的報告,描述心理健康信號的演變並突出關鍵轉變。我們在兩個社交媒體數據集上評估了該框架。結果顯示,該方法產生的摘要比直接基於LLM的報告更具連貫性和信息性,實現了對使用者歷史的更高覆蓋率、更強的時間一致性,以及對變化點的更高敏感性。一項消融研究確認了每個組件的貢獻,特別是時間建模和分段。總體而言,該方法提供了心理健康信號隨時間變化的可解釋視圖,支持研究和決策,而不旨在臨床診斷。
Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning
2605.14991v1 by Francesco Pastori, Francesca Fati, Marina Rosanu, Luigi De Vitis, Lucia Ribero, Gabriella Schivardi, Giovanni Damiano Aletti, Nicoletta Colombo, Jvan Casarin, Francesco Multinu, Elena De Momi
Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.
摘要:卵巢癌是最致命的婦科惡性腫瘤:大約60%的患者在晚期被診斷,相關的5年生存率約為30%。及早識別對新輔助化療無反應的患者仍然是一個關鍵的未滿足需求,因為這可以防止無效的治療並避免最佳手術管理的延遲。這項工作提出了一個非侵入性的深度學習框架,通過利用自動生成的3D病變掩模,從治療前的對比增強CT中預測新輔助化療的反應。該方法使用部分微調的預訓練圖像編碼器對軸向切片進行編碼,並通過基於注意力的模塊將切片級表示聚合成體積嵌入。訓練結合了分類損失、監督對比正則化和困難負樣本挖掘,以改善模糊反應者和非反應者之間的區分。該方法是在歐洲腫瘤研究所(米蘭,意大利)的一個回顧性單中心隊列上開發的,包括280名符合條件的患者(147名反應者,133名非反應者)。在測試隊列中,模型達到了0.73的ROC-AUC(95% CI:0.58-0.86)和0.70的F1分數(95% CI:0.56-0.82)。總體而言,這些結果表明所提出的架構學習了臨床相關的預測模式,並為基於影像的分層工具提供了堅實的基礎。
GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation
2605.14968v1 by Drewry H. Morris, Luis Valles, Reza Hosseini Ghomi
GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time. Existing workflow platforms provide durable execution and observability but offer few semantic correctness guarantees, while agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit. GraphFlow is designed to address this gap by treating workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem; observed failures were localized primarily to external integrations. The formal semantics and proof-checked admission model described here are specified and under active development. Evaluation of the verified core is reserved for future work.
摘要:GraphFlow 是一個視覺化工作流程系統,旨在提高多步驟、任務關鍵過程中代理 AI 自動化的可靠性。
在這些工作流程中,小錯誤會迅速累積:在理想化的獨立步驟模型下,一個具有 90% 每步可靠性的十步驟過程成功完成的機率僅為 35%。
現有的工作流程平台提供耐用的執行和可觀察性,但對語義正確性的保證卻很少,而代理系統在推理時進行計劃,使得行為對提示變化敏感且難以審計。
GraphFlow 的設計旨在填補這一空白,通過將工作流程圖視為可執行的規範,定義數據範圍、執行語義和監控的一個單一工件。
在編譯時,指定一類受限的圖形以產生可重用的自動化,其合約(前置條件、後置條件和組合義務)旨在在進入共享庫之前進行證明檢查。
在運行時,一個耐用的引擎在附加式事件日誌中記錄結果,並可以在系統邊界強制執行合約,支持重播、重試和審計。
游泳道使信任邊界變得明確,將經過驗證的邏輯與外部系統、人類判斷和 AI 決策分開。
在三個臨床站點進行的一年期試點執行了 8,728 次隊列註冊的工作流程運行,完成率為 97.08%,是在沒有經過驗證的核心子系統的早期原型下進行的;觀察到的失敗主要集中在外部集成上。
這裡描述的正式語義和經過證明的入庫模型已被指定並在積極開發中。
對經過驗證的核心的評估保留給未來的工作。
From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement
2605.14912v1 by Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka
Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.
摘要:多元對齊通常被操作化為偏好聚合:產生跨越(Overton)、引導(Steerable)或按比例代表(Distributional)多樣人類價值的回應。我們認為僅僅依賴聚合對於已部署的多元對齊來說是不完整的原始概念。在真正的價值多元主義下,當前基於強化學習人類反饋(RLHF)訓練的助手的失敗模式並不是覆蓋不足,而是阿諛奉承的共識:一種學習到的傾向,與直接對話者達成一致、驗證並最小化摩擦。由於已部署的人工智慧系統現在在健康、公民生活、勞動和治理等方面進行重要的討論,因此在互動層面上意見的不一致的崩潰並不是一個狹隘的技術問題,而是一種具有分配後果的結構性失敗。我們從格賴斯的格言中重新框定多元對齊,圍繞三個對話機制:範疇(承認自身觀點的局限)、信號(呈現價值衝突而不是掩蓋它),以及修正(基於原則而非用戶壓力修訂自己的立場)。我們正式化了一個指標,即多元修正分數(PRS),以區分原則性修訂與屈從,並提供了一個小規模的實證示例,針對兩個前沿的RLHF訓練模型(Claude Sonnet 4.5, N=198; GPT-4o, N=100),顯示對於這兩者來說,遵循一致性與在有爭議的價值提示上低修正質量共存。PRS測量的是多元主義的互動前提(可見的不一致;原則性修訂),而不是完整的多元主義;我們討論這一差異,認真對待“誰的‘原則性’算數”的反思性問題,並主張多元主義在部署治理層面上最為決定性地形成或解體:介面、偏好數據管道和審計基礎設施。
BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring
2605.14886v1 by Zixuan Shu, Tiancheng Cao, Hen-Wei Huang
Electrocardiogram (ECG) monitoring in Internet of Medical Things (IoMT) networks is constrained by strict data-sharing regulations and privacy concerns. Federated learning (FL) enables collaborative learning by keeping raw ECG data on devices, but frequent transmissions of high-dimensional model updates incur heavy per-round traffic over bandwidth-limited links. To alleviate this bottleneck, federated distillation (FD) replaces parameter exchange with logit-based knowledge transfer. However, the performance of FD often degrades under the non-independent and identically distributed (non-IID) and long-tailed label distributions in ECG deployments. To address these challenges, we propose a bidirectional federated knowledge distillation (BiFedKD) framework that employs an aggregation-by-distillation pipeline with temperature scaling to produce a stable global distillation signal for cross-client alignment. Experiments on the MIT-BIH Arrhythmia dataset show that BiFedKD improves accuracy and Macro-F1 over the baseline by $3.52\%$ and $9.93\%$, respectively. Moreover, to reach the same Macro-F1, BiFedKD reduces communication overhead by $40\%$ and computation cost by $71.7\%$ compared with the baseline.
摘要:心電圖 (ECG) 監測在醫療物聯網 (IoMT) 網絡中受到嚴格的數據共享規範和隱私問題的限制。聯邦學習 (FL) 通過將原始 ECG 數據保留在設備上來實現協作學習,但高維模型更新的頻繁傳輸會在帶寬有限的鏈路上產生巨大的每輪流量。為了緩解這一瓶頸,聯邦蒸餾 (FD) 用基於邏輯的知識轉移取代了參數交換。然而,在 ECG 部署中,FD 的性能在非獨立同分佈 (non-IID) 和長尾標籤分佈下往往會下降。為了解決這些挑戰,我們提出了一種雙向聯邦知識蒸餾 (BiFedKD) 框架,該框架採用帶有溫度縮放的蒸餾聚合管道,以產生穩定的全局蒸餾信號以進行跨客戶對齊。在 MIT-BIH 心律不齊數據集上的實驗顯示,BiFedKD 分別將準確率和 Macro-F1 提高了 $3.52\%$ 和 $9.93\%$。此外,為了達到相同的 Macro-F1,與基線相比,BiFedKD 將通信開銷減少了 $40\%$,計算成本減少了 $71.7\%$。
Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model
2605.14723v1 by Minghao Wu, Yuting Yan, Zhenyang Cai, Ke Ji, Chuangsen Fang, Ziying Sheng, Xidong Wang, Rongsheng Wang, Hejia Zhang, Shuang Li, Benyou Wang, Hongyuan Zha
Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid--vasopressor interventions, and follows a propose--simulate--refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose--simulate--refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.
摘要:重症監護病房中的敗血症管理需要在快速變化的病人體徵下做出連續的治療決策。雖然大型語言模型(LLMs)編碼了廣泛的臨床知識並能夠推理指導方針,但它們並不固有地基於行動條件下的病人動態。我們介紹了SepsisAgent,一個增強世界模型的LLM代理,用於敗血症治療建議。SepsisAgent使用學習到的臨床世界模型來模擬病人在候選液體-血管收縮劑干預下的反應,並遵循提議-模擬-精煉的工作流程,然後再進行處方。我們首先顯示僅依賴世界模型的訪問會導致LLM決策性能不一致,這促使了特定代理的訓練。然後,我們通過三個階段的課程訓練SepsisAgent:病人動態監督微調、提議-模擬-精煉行為克隆,以及基於世界模型的代理強化學習。在MIMIC-IV敗血症軌跡上,SepsisAgent在離線政策價值方面超越了所有傳統的RL和基於LLM的基準,同時在遵循指導方針和不安全行為指標下達到了最佳的安全性配置。進一步分析顯示,與臨床世界模型的重複互動使代理能夠學習病人演變中的規律,即使在移除模擬器訪問的情況下,這些規律仍然有用。
Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke
2605.14710v1 by Liren Chen, Lidong Sun, Mingyan Huang, Junzhe Tang, Yinghui Zhu, Guanjie Wang, Yiqing Xia, Ting Xiao
Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.
摘要:深度學習和多模態融合在醫療診斷中展現了變革潛力,通過整合多樣的數據來源。然而,由於現有多模態方法的限制,對缺血性中風的準確預測仍然具有挑戰性。首先,當前的方法主要限於雙模態融合,缺乏有效整合醫療影像、結構化臨床數據和非結構化文本的框架。其次,它們通常無法建立模態之間的深度雙向互動;為了解決這些關鍵空白,本文提出了一種新穎的三模態融合模型,用於缺血性中風的預後。我們的方法首先通過使用大型語言模型(LLM)自動從腦部MRI生成半結構化的診斷文本來豐富數據表示。這一過程不僅解決了專家註釋的稀缺問題,還作為一種正則化的語義增強,提升了多模態融合的穩健性。此外,我們設計了一個核心組件,稱為視覺條件雙重對齊融合模塊(VDAFM),該模塊策略性地使用視覺特徵作為條件先驗,以引導與生成文本的細緻互動。這個模塊通過雙重語義對齊損失實現了動態而深刻的融合,有效減輕了模態異質性。在一個真實世界的臨床數據集上進行的廣泛實驗表明,我們的模型達到了最先進的性能。
NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces
2605.14698v1 by Konstantinos Kontras, Trui Osselaer, Stylianos G. Mouslech, Angeliki-Ilektra Karaiskou, Guido Gagliardi, Thomas Strypsteen, Mohammad Hossein Badiei, Anku Rani, Maarten Vanmarcke, Miguel Bhagubai, Chanakya Ekbote, Jaedong Hwang, Christos Chatzichristos, Paul Pu Liang, Maarten De Vos
Foundation models (FMs) promise to extract unified representations that generalize across downstream tasks. They have emerged across fields, including electroencephalography (EEG), but it is less clear how effective they are in this particular field. Published evaluations differ in datasets, in the EEG-specific preprocessing that might influence reported results, and in the reported metrics, frequently obscuring the clinical relevance in EEG. We introduce NeuroAtlas, the largest EEG benchmark to date: 42 datasets and 260k hours covering clinical EEG (epilepsy, sleep medicine, brain age estimation) and brain-computer interfaces, and include multiple datasets per task along with bespoke clinical evaluation metrics. Besides evaluating EEG-FMs with respect to supervised baselines, we present results from generic time-series FMs. We report three findings. First, EEG-specific FMs do not consistently outperform time-series FMs, which have neither EEG-focused architectures nor been pretrained on EEG. Second, standard machine learning metrics are insufficient to assess clinical utility: thus, we thoroughly evaluate more appropriate measures such as the quality of event-level decision-making, hypnogram-derived features, and the brain-age gap in the domains of epilepsy, sleep, and brain age, respectively. Third, model rankings and performance can vary substantially within domains. We conclude that pretrained models perform largely on par, with only narrow advantages for a few, and that current models do not yet deliver on the promise of an out-of-the-box unified EEG model. NeuroAtlas exposes this gap and provides the datasets and metrics for the next generation of unified EEG FMs.
摘要:基礎模型(FMs)承諾能夠提取統一的表示,這些表示可以在下游任務中進行泛化。它們在各個領域中出現,包括腦電圖(EEG),但在這個特定領域中的有效性尚不明確。已發表的評估在數據集、可能影響報告結果的EEG特定預處理以及報告的指標上有所不同,這常常掩蓋了EEG的臨床相關性。我們介紹了NeuroAtlas,迄今為止最大的EEG基準:42個數據集和260k小時,涵蓋臨床EEG(癲癇、睡眠醫學、大腦年齡估計)和腦-電腦介面,並為每個任務包含多個數據集以及定制的臨床評估指標。除了根據監督基準評估EEG-FMs外,我們還展示了通用時間序列FMs的結果。我們報告了三個發現。首先,EEG特定的FMs並不總是優於時間序列FMs,而後者既沒有EEG專注的架構,也沒有在EEG上進行預訓練。其次,標準機器學習指標不足以評估臨床實用性:因此,我們徹底評估了更合適的指標,例如事件級決策的質量、基於睡眠圖的特徵以及癲癇、睡眠和大腦年齡領域中的大腦年齡差距。第三,模型排名和性能在不同領域內可能會有顯著變化。我們得出結論,預訓練模型的性能大致相當,只有少數模型具有微小的優勢,而當前模型尚未實現即插即用的統一EEG模型的承諾。NeuroAtlas揭示了這一差距,並提供了下一代統一EEG FMs所需的數據集和指標。
How Sensitive Are Radiomic AI Models to Acquisition Parameters?
2605.14667v1 by D. Gil, I. Sanchez, C. Sanchez
A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying clinically significant parameter regions associated with improved cross-dataset robustness. We formulate a mixed-effects framework for quantifying the influence that clinically relevant acquisition parameters have on models performance, while accounting for subject-level random effects. We have applied our framework to lung cancer diagnosis in CT scans using two independent multicentre datasets (a public database and own-collected data) and several SoA architectures. To evaluate across-database reproducibility, CT parameters have been adjusted using the data collected and tested on the public set. The optimal configuration selected is the current of the X-ray tube >= 200 mA, spiral pitch <= 1.5, slice thickness <= 1.25 mm, which balances diagnostic quality with low radiation dose. These configuration push metrics from 0.79+-0.04 sensitivity, 0.47+-0.10 specificity in low quality scans to 0.90+-0.10 sensitivity, 0.79 +- 0.13 specificity in high quality ones.
摘要:主要障礙在於AI放射組學系統在臨床常規中的部署是它們在異質的多中心獲取協議下性能的下降。這項工作提出了一個以性能為導向的框架,用於量化放射組學AI模型的掃描參數敏感性,同時識別與提高跨數據集穩健性相關的臨床顯著參數區域。我們制定了一個混合效應框架,以量化臨床相關的獲取參數對模型性能的影響,同時考慮受試者層級的隨機效應。我們已將我們的框架應用於CT掃描中的肺癌診斷,使用兩個獨立的多中心數據集(公共數據庫和自收集數據)以及幾個最先進的架構。為了評估跨數據庫的重現性,CT參數已根據收集的數據進行調整,並在公共數據集上進行測試。選擇的最佳配置是X射線管電流 >= 200 mA,螺旋步距 <= 1.5,切片厚度 <= 1.25 mm,這在低輻射劑量下平衡了診斷質量。這些配置將指標從低質量掃描中的0.79+-0.04靈敏度、0.47+-0.10特異性推升至高質量掃描中的0.90+-0.10靈敏度、0.79 +- 0.13特異性。
MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder
2605.14660v1 by Eranga Bandara, Ross Gore, Asanga Gunaratna, Ravi Mukkamala, Nihal Siriwardanagea, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Wathsala Herath, Chalani Rajapakse, Sachin Shetty, Anita H. Clayton, Christopher K. Rhea, Ng Wee Keong, Kasun De Zoysa, Amin Hass, Shaifali Kaushik, Preston Samuel, Atmaram Yarlagadda
Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing therapeutic approaches, prolonged exposure, EMDR, cognitive behavioural therapy, operate predominantly downstream of the reactive cascade, teaching patients to tolerate or reframe distress after it has arisen. While clinically valuable, these suppression-based approaches do not produce the upstream pathway dissolution that constitutes lasting structural neural reorganisation. This paper proposes MindGap, a privacy-preserving on-device conversational AI framework that delivers structured neuroplastic rehabilitation for PTSD through the practice of dependent origination, a Buddhist psychological framework that identifies the precise moment between the pre-cognitive affective signal and the reactive elaboration that follows as the site of therapeutic intervention. MindGap guides patients through three progressive layers of observation at this feeling tone gap: noticing the bare affective signal before reactive elaboration, recognising it as self-arising rather than caused by the stimulus, and recognising the conditioned implicit belief beneath the feeling. Each layer corresponds to progressively deeper prefrontal regulatory engagement and progressively deeper long-term depression-mediated weakening of the reactive pathway, producing genuine upstream dissolution rather than downstream suppression. Running entirely on-device with no data egress, MindGap delivers daily calibrated exposure sessions through a fine-tuned lightweight large language model, making it deployable in sensitive clinical and military contexts where cloud-based solutions are not permitted.
摘要:創傷後壓力症候群(PTSD)根本上是一個神經可塑性問題,創傷性接觸事件通過赫布長期增強編碼過度反應的神經通路,產生在意識覺察之前就已觸發的、敏感的杏仁體-HPA壓力級聯反應。現有的治療方法,如長期暴露、EMDR、認知行為療法,主要在反應級聯的下游運作,教導患者在痛苦出現後如何忍受或重新框架。雖然這些以抑制為基礎的方法在臨床上具有價值,但並未產生構成持久結構性神經重組的上游通路溶解。本文提出了MindGap,一個保護隱私的設備內對話式人工智慧框架,通過依賴起源的實踐提供PTSD的結構性神經可塑性康復,這是一種佛教心理框架,確定了在前認知情感信號與隨之而來的反應性闡述之間的精確時刻作為治療干預的場所。MindGap引導患者通過三個漸進的觀察層次來探索這個感受音調的間隙:注意到反應性闡述之前的純粹情感信號,認識到它是自我產生的,而不是由刺激引起的,並認識到感受下的條件隱性信念。每一層對應於逐漸深入的前額葉調節參與和逐漸深入的長期抑制介導的反應通路削弱,產生真正的上游溶解,而不是下游抑制。MindGap完全在設備內運行,無數據外流,通過微調的輕量級大型語言模型提供每日校準的暴露會議,使其可在不允許雲端解決方案的敏感臨床和軍事環境中部署。
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
2605.14543v1 by Shuhao Chen, Weisen Jiang, Changmiao Wang, Xiaoqing Wu, Xuanren Shi, Yu Zhang, James T. Kwok
Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient's condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.
摘要:住院用藥推薦要求臨床醫師隨著病人狀況的變化,不斷選擇特定的藥物、劑量和給藥途徑。現有的基準將這項任務定義為對粗略藥物代碼的入院級預測,並使用多熱診斷和程序代碼輸入,未能捕捉到實際處方的每個時間點上豐富的信息特性。我們提出了RxEval,一個處方級基準,通過多選題評估LLM的處方能力:每個問題都提供一個詳細的病人檔案和按時間順序排列的臨床軌跡,要求從真實處方和通過推理鏈擾動生成的病人特定干擾項中選擇特定的藥物-劑量-途徑三元組。RxEval包含1,547個問題,涵蓋584名病人、18個診斷類別和969種獨特藥物。對16個LLM的評估顯示,RxEval既具挑戰性又具區分度:不同模型的F1範圍從45.18到77.10,最佳的精確匹配僅為46.10%。錯誤分析顯示,即使是最前沿的模型也可能忽視所述的病人信息,並未能得出臨床結論。
Deciphering Neural Reparameterized Full-Waveform Inversion with Neural Sensitivity Kernel and Wave Tangent Kernel
2605.14370v1 by Ruihua Chen, Yisi Luo, Bangyu Wu, Xile Zhao, Deyu Meng
Full-waveform inversion (FWI) estimates unknown parameters in the wave equation from limited boundary measurements. Recent advances in neural reparameterized FWI (NeurFWI) demonstrate that representing the parameters using a neural network can reduce the reliance on the high-quality initial model and wavefield data, at the cost of slow high-resolution convergence. However, its underlying theoretical mechanism remains unclear. In this study, we establish the neural sensitivity kernel (NSK) and the wave tangent kernel (WTK) to analyze their convergence behavior from both model and data domains. These theoretical frameworks show that the neural tangent kernel (NTK) induced by neural representation adaptively modulates the original sensitivity and wave tangent kernels. This modulation leads to several key outcomes, i.e., the spectral filtering effect, the gradient wavenumber modulation, and the wave frequency bias, connecting the convergence behavior of NeurFWI with the eigen-structures of NSK and WTK. Building on these insights, we propose several enhanced NeurFWI methods with tailored eigen-structures in NSK and WTK to improve inversion performances and efficiency. We numerically validate these theoretical claims and the proposed methods in seismic exploration, and firstly extend their application to medical imaging.
摘要:全波形反演(FWI)從有限的邊界測量中估計波動方程中的未知參數。最近在神經重參數化FWI(NeurFWI)方面的進展顯示,使用神經網絡表示參數可以減少對高品質初始模型和波場數據的依賴,但代價是高解析度收斂速度較慢。然而,其潛在的理論機制仍然不清楚。在本研究中,我們建立了神經靈敏度核(NSK)和波切線核(WTK),以分析它們在模型和數據領域的收斂行為。這些理論框架顯示,神經表示所誘導的神經切線核(NTK)自適應地調節了原始的靈敏度和波切線核。這種調節導致幾個關鍵結果,即光譜過濾效應、梯度波數調制和波頻率偏差,將NeurFWI的收斂行為與NSK和WTK的特徵結構聯繫起來。在這些見解的基礎上,我們提出幾種增強的NeurFWI方法,這些方法在NSK和WTK中具有量身定制的特徵結構,以改善反演性能和效率。我們在地震勘探中數值驗證了這些理論主張和提出的方法,並首次將其應用擴展到醫學影像。
AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction
2605.14327v1 by Yerin Park, Sangseon Lee
Drug-drug interaction (DDI) prediction is a critical task in computational biomedicine, as adverse interactions between co-administered drugs can cause severe side effects and clinical risks. A key challenge is unseen-drug generalization, where interactions must be predicted for drugs not observed during training. Although multimodal DDI models exploit diverse drug-related information, their fusion mechanisms are often tied to specific prediction architectures, limiting their reuse across models. To address this, we propose AIM-DDI, an architecture-independent multimodal integration module that represents heterogeneous modality information as tokens in a shared latent space. By modeling dependencies across modality tokens through a unified fusion module, AIM-DDI enables model-agnostic integration of structural, chemical, and semantic drug signals across different DDI prediction architectures. Extensive evaluations across diverse DDI models and DrugBank-based settings show that AIM-DDI consistently improves prediction performance, with the strongest gains under the most challenging both-unseen setting where neither drug in a test pair is observed during training. These results suggest that treating multimodal integration as a reusable module, rather than a model-specific fusion component, is an effective strategy for robust unseen-drug DDI prediction.
摘要:藥物間相互作用(DDI)預測是計算生物醫學中的一項關鍵任務,因為共同給藥的藥物之間的不良相互作用可能會導致嚴重的副作用和臨床風險。
一個主要挑戰是未見藥物的概括性,這要求對在訓練期間未觀察到的藥物進行相互作用預測。
儘管多模態 DDI 模型利用了多樣的藥物相關信息,但它們的融合機制往往與特定的預測架構相關,限制了它們在模型之間的重用。
為了解決這個問題,我們提出了 AIM-DDI,一種架構無關的多模態整合模塊,將異質模態信息表示為共享潛在空間中的標記。
通過通過統一的融合模塊建模模態標記之間的依賴性,AIM-DDI 能夠在不同的 DDI 預測架構中實現結構、化學和語義藥物信號的模型無關整合。
在多個 DDI 模型和基於 DrugBank 的設置中進行的廣泛評估顯示,AIM-DDI 一直在提高預測性能,在最具挑戰性的雙未見設置下,測試對中的任一藥物在訓練期間都未被觀察到,獲得了最強的增益。
這些結果表明,將多模態整合視為可重用模塊,而不是特定模型的融合組件,是進行穩健的未見藥物 DDI 預測的有效策略。
Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment
2605.14242v1 by Xiaohua Wang, Kai Yu, XuXiao Liang, Liang Wang, Chao Han
The monitoring of fetal heart rate (FHR) and the assessment of its variability are crucial for preventing fetal compromise and adverse outcomes. However, traditional methods encounter limitations arising from equipment performance, data transmission, and subjective assessments by doctors. We have developed a tailored AI-based FHrCTG model specifically for FHR monitoring, which effectively mitigates noise interference and precisely reconstructs signals. Our model was pre-trained on a massive dataset consisting of 558,412 unlabeled data points and further refined using 7,266 expert-reviewed entries. To validate FHR, we introduced the Intersection Overlapping Labels (IOL) approach, which transforms rate analysis into categorical judgments. Testing revealed that our model demonstrates high sensitivity and specificity in detecting critical FHR decelerations (89.13% and 87.78%, respectively) and accelerations (62.5% and 92.04%, respectively). Furthermore, based on Fischer's criteria for clinical application, our model achieved impressive AUC scores of 0.7214 and 0.9643 for verifying FHR periodicity and amplitude variation, respectively.
摘要:胎心率(FHR)的監測及其變異性的評估對於防止胎兒受損和不良結果至關重要。
然而,傳統方法在設備性能、數據傳輸和醫生的主觀評估方面存在限制。
我們開發了一種專門用於FHR監測的定制AI基於FHrCTG模型,該模型有效減少了噪音干擾並精確重建信號。
我們的模型在一個包含558,412個未標記數據點的大型數據集上進行了預訓練,並使用7,266個專家審核的條目進一步精煉。
為了驗證FHR,我們引入了交集重疊標籤(IOL)方法,將速率分析轉化為類別判斷。
測試顯示,我們的模型在檢測關鍵FHR減速(分別為89.13%和87.78%)和加速(分別為62.5%和92.04%)方面表現出高敏感性和特異性。
此外,根據Fischer的臨床應用標準,我們的模型在驗證FHR的周期性和幅度變化方面分別達到了令人印象深刻的AUC分數0.7214和0.9643。
Fusion-fission forecasts when AI will shift to undesirable behavior
2605.14218v1 by Neil F. Johnson, Frank Yingjie Huo
The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.
摘要:面臨社會中類似ChatGPT的人工智慧使用的關鍵問題是,其行為可能會在不被注意的情況下,從可取轉變為不可取——促使自我傷害、極端行為、財務損失或代價高昂的醫療和軍事錯誤——而目前尚無法預測何時會發生這種轉變。儘管在人工智慧建模、訓練後的對齊和安全措施方面取得了顯著進展,但即使是最新的AI模型中,行為的轉變仍然持續存在。在這裡,我們展示了一種向量泛化的融合-裂變群體動力學,這種動力學在活體和活性物質系統中觀察到,驅動並可以預測AI行為的未來轉變。轉變條件也可以從數學上推導出來,這是由於目前為止的對話(C)與可取(B)和不可取(D)盆地動力學之間的群體競爭,這些動力學可以為特定應用提前進行估算。這既不是特定於模型的,也不是由隨機抽樣驅動的。我們在六項獨立測試中驗證了它,包括:在七個跨越兩個數量級的參數計數(124M-12B)的AI模型中,正確率達到90%;在十個前沿聊天機器人中持續保持生產規模;以及在斯坦福的「妄想螺旋」語料庫出現之前十一個月的先驗時間戳預測,並由207,443個人類-AI交流的該語料庫獨立確認。因為它在當前安全堆棧的架構下,所以同樣的公式提供了一個實時警告信號,而當前的對齊無法提供,這一信號在當前和未來的類似ChatGPT的AI架構中是可攜帶的,並且可以在可定義競爭反應類別的應用領域中實現。
Towards Fine-Grained and Verifiable Concept Bottleneck Models
2605.14210v1 by Yingying Fang, Haijie Xu, Shuang Wu, Mariathasan Anish, Guang Yang
Concept Bottleneck Models (CBMs) offer interpretable alternatives to black-box predictors by introducing human-relatable concepts before the final output. However, existing CBMs struggle to verify whether predicted concepts correspond to the correct visual evidence, limiting their reliability. We propose a fine-grained CBM framework that grounds each concept in localized visual evidence, enabling direct inspection of where and how concepts are encoded. This design allows users to interpret predictions and verify that the model learns intended concepts rather than spurious correlations. Experiments on medical imaging benchmarks show that our learned concept space is information-complete and achieves predictive performance comparable to standard CBMs, while substantially improving transparency. Unlike post-hoc attribution methods, our framework validates both the presence and correctness of concept representations, bridging interpretability with verifiability. Our approach enhances the trustworthiness of CBMs and establishes a principled mechanism for human-model interaction at the concept level, paving the way toward more reliable and clinically actionable concept-based learning systems.
摘要:概念瓶頸模型(CBMs)透過在最終輸出之前引入人類可理解的概念,提供了可解釋的替代方案,取代了黑箱預測器。
然而,現有的CBMs在驗證預測的概念是否對應於正確的視覺證據方面存在困難,這限制了它們的可靠性。
我們提出了一個細粒度的CBM框架,將每個概念基於局部的視覺證據,使得可以直接檢查概念是如何被編碼的。
這一設計使得用戶能夠解釋預測並驗證模型學習的是預期的概念,而非虛假的相關性。
在醫學影像基準上的實驗顯示,我們學習的概念空間是信息完整的,並且達到了與標準CBMs相當的預測性能,同時顯著提高了透明度。
與事後歸因方法不同,我們的框架驗證了概念表示的存在性和正確性,將可解釋性與可驗證性聯繫起來。
我們的方法增強了CBMs的可信度,並建立了一個原則性機制,以便在概念層面上進行人類與模型的互動,為更可靠和臨床可行的基於概念的學習系統鋪平了道路。
Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)
2605.14126v1 by Marius S. Knorr, Robert Müller, Jan P. Bremer, Nils Schweingruber
Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.
摘要:快速醫療互操作性資源(FHIR)是互操作性醫療數據交換的主導標準。
在FHIR中,電子健康記錄形成了一個有向資源圖。
在FHIR上回答臨床有意義的問題需要代理執行多步推理、過濾和跨多種資源類型的聚合。
先前的研究顯示,即使是工具增強的LLM代理(檢索、代碼執行、多輪規劃)也常常選擇錯誤的資源或違反遍歷約束。
我們在FHIR-AgentBench的背景下研究這個問題,這是一個針對現實世界醫院數據的真實問題回答基準,並將FHIR上的推理框架設置為可查詢結構圖上的序列決策問題。
我們實現了一個多輪CodeAct代理,並使用自定義環境和工具進行強化學習後訓練。
一個LLM評判者提供基於執行的獎勵。
與基於提示的封閉模型基準相比,強化學習後訓練在強化數據完整性約束的同時提高了性能。
實證結果顯示,我們的方法將FHIR-AgentBench上的答案正確率從50%(o4-mini)提高到77%,使用的是一個更小且更便宜的Qwen3-8B模型。
我們提出了一個端到端的後訓練流程(環境構建、環境構造、模型訓練和自定義評估),該流程可靠地提高了對結構化臨床圖的多輪推理。
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
2605.14113v1 by Alvaro Lopez Pellicer, Plamen Angelov, Marwan Bukhari, Yi Li, Eduardo Soares, Jemma Kerns
While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2\% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2\%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8\%.
摘要:雖然可解釋的原型網絡為臨床診斷提供了引人注目的案例推理,但其原始的連續輸出缺乏醫療文檔所需的語義結構。通過標準的檢索增強生成(RAG)來彌補這一差距,通常會觸發“檢索拍馬屁”,在這種情況下,大型語言模型(LLMs)會產生事後的合理化,以與視覺預測對齊。我們引入了ProtoMedAgent,一個將多模態臨床報告形式化為迭代的零梯度測試時間優化問題的框架,這一過程受到嚴格的神經符號瓶頸的約束。在一個固定的原型骨幹上,我們將潛在的視覺和表格特徵提煉成離散的語義記憶。線上生成受到精確集合理論微分和反思的抄寫-批評循環的嚴格約束,從數學上排除了不支持的敘事主張。為了安全地限制數據披露,我們引入了一個由$k$-匿名性和$\ell$-多樣性主導的語義隱私閘。經過對4,160名患者的臨床隊列進行評估,ProtoMedAgent在比較集的忠實度上達到了91.2\%,在這一點上它的表現根本上超越了標準的RAG(46.2\%)。ProtoMedAgent還利用綁定的$\ell$-多樣性相變化系統性地將人工製作的成員推斷風險絕對降低了9.8\%。
Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening
2605.14108v1 by Nishi Doshi, Shrey Shah
Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology
摘要:糖尿病視網膜病變(DR)是可預防失明的主要原因之一,但農村地區通常缺乏早期檢測所需的專家和基礎設施。雖然基於雲的深度學習系統提供高準確性,但由於高延遲、有限的帶寬和高數據傳輸成本,這些系統在這些環境中面臨重大挑戰。為了解決這些挑戰,我們提出了一種在公共APTOS 2019失明檢測數據集上的兩級邊緣-雲級聯。第1級在本地診所設備上運行輕量級的MobileNetV3-small模型,以在可轉介的DR(類別2-4)和不可轉介的DR(類別0-1)之間進行二元分流。第2級在雲端運行RETFoundDINOv2模型進行序數嚴重性分級,但僅對第1級標記為可轉介的圖像子集進行處理。在733幅圖像的分層APTOS測試集中,第1級在經驗證調整的高敏感度閾值下達到98.99%的敏感性和84.37%的特異性。默認級聯將49.52%的測試圖像轉發至第2級,相較於對所有圖像使用基於雲的模型,減少了50.48%的雲端調用。在部署的4類輸出空間(類別0-1 / 類別2 / 類別3 / 類別4)中,級聯獲得了80.49%的準確率和0.8167的二次加權kappa;僅雲端基準獲得了80.76%的準確率和0.8184的二次加權kappa。在APTOS上,級聯將雲端使用量減少約一半,並且分級性能略有下降。索引詞:糖尿病視網膜病變、邊緣-雲級聯、MobileNetV3-small、RETFound-DINOv2、視網膜篩查、遠程眼科醫學
A Benchmark for Early-stage Parkinson's Disease Detection from Speech
2605.14066v1 by Terry Yi Zhong, Cristian Tejedor-Garcia, Khiet P. Truong, Janna Maas, Louis ten Bosch, Bastiaan R. Bloem
Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.
摘要:早期帕金森病(EarlyPD)從語音中檢測的臨床意義重大,但尚未充分探索,且已發表的結果難以比較,因為研究在數據集、語言、任務、評估協議和EarlyPD定義上各不相同。為了解決這個問題,我們提出了首個基於語音的EarlyPD檢測基準,設計了一個獨立於說話者的劃分,以便在研究者可訪問的數據集上進行公平且可重複的跨方法評估。該基準涵蓋三個常見的語音任務,並在不同的訓練資源設置下評估方法。我們還通過數據集、聚合水平、性別和疾病階段呈現多維度的評估細分,以支持細緻的比較和臨床應用。我們的結果提供了一個可重複的參考和可行的見解,鼓勵採用這個公開可用的基準,以推進從語音中穩健且具有臨床意義的EarlyPD檢測。
CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI
2605.13994v1 by Xiaoyue Liu, Xiaohan Yuan, Mark Y Chan, Ching-Hui Sia, Lei Li
Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.
摘要:準確的3D+t 整心網格重建來自於cine MRI,是一項臨床上至關重要但技術上具有挑戰性的任務。這項任務的難度源於兩個相互耦合的因素:由2D影像切片對3D心臟解剖的內在稀疏取樣,以及心臟形狀與運動之間的緊密耦合。目前的心臟影像到網格的方法通常僅重建心臟腔室的子集或心臟週期的單一相位。在這項工作中,我們提出了CineMesh4D,一種新穎的端到端4D(3D+t)管道,通過跨域映射直接從多視角2D cine MRI重建患者特定的整心網格。具體而言,我們引入了一種可微分渲染損失,能夠從多視角稀疏輪廓的cine MRI中監督3D+t整心網格。此外,我們開發了一個雙上下文時間塊,融合全局和局部心臟時間信息,以捕捉高維序列模式。在定量和定性評估中,CineMesh4D在重建質量和運動一致性方面超越了現有方法,為個性化的實時心臟評估提供了一條實用的途徑。代碼將在手稿被接受後公開發布。
Neurosymbolic Auditing of Natural-Language Software Requirements
2605.13817v1 by Bethel Hall, William Eiers
Natural-language software requirements are often ambiguous, inconsistent, and underspecified; in safety-critical domains, these defects propagate into formal models that verify the wrong specification and into implementations that ship unsafe behavior. We show that large language models, equipped with an SMT solver, can audit such requirements: translating them into formal logic, detecting ambiguity through stochastic variation in the generated formalization, and exposing inconsistency, vacuousness, and safety violations through solver queries on the resulting specification. We present VERIMED, a neurosymbolic pipeline that operationalizes this idea for medical-device software requirements, and report two findings. First, stochastic variation across independent formalizations is a signal of ambiguity: requirements that admit multiple plausible interpretations produce SMT-inequivalent formalizations, and bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test. Second, the usefulness of symbolic feedback depends on its granularity: in counterexample-guided repair on a hemodialysis question-answering benchmark, concrete SMT counterexamples raise verified accuracy from 55.4% to 98.5%. Over an extensive experimental evaluation on open-source hemodialysis safety requirements, we show that the LLM-based approach in VERIMED successfully reduces ambiguity-sensitive requirements and enables rigorous auditing of software requirements through SMT-based queries.
摘要:自然語言軟體需求通常模糊、不一致且不夠明確;在安全關鍵領域,這些缺陷會傳播到驗證錯誤規範的正式模型中,並導致實現不安全的行為。我們展示了大型語言模型配備 SMT 解算器可以審核這些需求:將其轉換為正式邏輯,通過生成的正式化中的隨機變化檢測模糊性,並通過對結果規範的解算器查詢揭示不一致性、空洞性和安全違規。我們提出了 VERIMED,一個神經符號管道,將這一理念應用於醫療設備軟體需求,並報告了兩項發現。首先,獨立正式化之間的隨機變化是模糊性的信號:承認多種合理解釋的需求會產生 SMT 不等價的正式化,而雙向 SMT 等價檢查將這種不一致轉化為可由解算器檢查的測試。其次,符號反饋的有用性取決於其粒度:在一個血液透析問答基準上的反例引導修復中,具體的 SMT 反例將驗證準確率從 55.4% 提高到 98.5%。在對開源血液透析安全需求的廣泛實驗評估中,我們展示了 VERIMED 中基於 LLM 的方法成功減少了對模糊性敏感的需求,並通過基於 SMT 的查詢實現了對軟體需求的嚴格審核。
Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography
2605.13730v1 by Christos Chrysanthos Nikolaidis, Vasileios Sachpekidis, Nikolas Moustakidis, Theofilos Moustakidis, Pavlos S. Efraimidis
Transthoracic echocardiography (TTE) is the first-line imaging modality for diagnosing bicuspid aortic valve (BAV), yet diagnostic performance varies with operator expertise and image quality. We developed an explainable AI model that distinguishes BAV from tricuspid aortic valves (TAV) using routinely acquired parasternal long-axis (PLAX) cine loops. A multi-backbone video ensemble was trained and evaluated using a leakage-aware, stratified outer cross-validation protocol on $N{=}90$ patient studies (48 BAV, 42 TAV). Across fixed outer splits and 10 random seeds, the calibrated stacked ensemble achieved an outer-CV F1-score of $0.907$ and recall of $0.877$. Frame-level Grad-CAM localized salient evidence to the aortic root and leaflet plane, while globally aggregated SHAP values quantified each video backbone's contribution to the stacked prediction, enabling transparent, case-level auditability. These findings indicate that PLAX-based video ensembles can support reliable BAV/TAV classification from routine echocardiographic cine loops and may facilitate earlier detection in non-specialist or resource-limited clinical settings.
摘要:經胸心臟超音波檢查(TTE)是診斷二尖瓣主動脈瓣(BAV)的第一線影像檢查方式,但其診斷表現因操作人員的專業技能和影像品質而異。我們開發了一個可解釋的人工智慧模型,利用常規獲取的旁胸長軸(PLAX)動態影像循環來區分BAV和三尖瓣主動脈瓣(TAV)。我們訓練並評估了一個多骨幹視頻集成模型,使用了一種考慮洩漏的分層外部交叉驗證協議,對90位患者的研究進行了評估(48 BAV,42 TAV)。在固定的外部分割和10個隨機種子下,經過校準的堆疊集成模型達到了外部交叉驗證F1分數$0.907$和召回率$0.877$。幀級Grad-CAM將顯著證據定位於主動脈根部和瓣膜平面,而全局聚合的SHAP值量化了每個視頻骨幹對堆疊預測的貢獻,實現了透明的案例級審計能力。這些發現表明,基於PLAX的視頻集成模型可以支持從常規心臟超音波動態影像中可靠地分類BAV/TAV,並可能促進在非專科或資源有限的臨床環境中更早的檢測。
SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing
2605.17620v1 by Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm, Daniel Behme, Naomi Larsen, Wojtek Palubicki, Sylvia Saalfeld, Sören Pirk
Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.
摘要:顱內動脈瘤(IAs)以不可預測的生長和破裂風險為特徵,是中風的主要原因,可能導致危及生命的出血,並伴隨著高死亡率和長期殘疾。隨著人口老齡化,腦血管疾病的發病率和整體負擔預計將增加,這突顯了分析複雜醫療數據和改善對這些疾病的群體理解的可擴展方法的必要性。儘管數位雙胞胎和深度學習為改善診斷、預後和治療提供了有希望的途徑,但其有效性受到大規模、高質量醫療數據及相應標籤稀缺的限制。我們提出了合成血管(SynVA),這是一個模組化的血管網格生成和解剖一致性動脈瘤合成工具包。SynVA結合了基於流匹配的新穎方法來生成健康血管網格,以及基於學習的方法來生成解剖條件的動脈瘤網格——動脈瘤是從現有的血管幾何形狀計算得出的,而不是孤立生成的。此外,我們介紹了SynVA程序模型,該模型僅基於生理原則和統計先驗進行血管和動脈瘤合成,這使得生成大規模數據集(例如,用於基於網格的生成模型的訓練)成為可能。為此,我們釋放了一個包含50,000個完全標記的網格樣本的數據集,適用於各種下游視覺任務,如語義分割。廣泛的定量和定性評估顯示,SynVA生成了現實的血管幾何形狀和解剖上合理的動脈瘤。具體而言,我們的實驗表明,某些方法生成的動脈瘤形狀與專家的人類感知更一致,而其他方法在與真實動脈瘤的重建的定量相似性指標上表現更佳。
Cross Modality Image Translation In Medical Imaging Using Generative Frameworks
2605.13686v1 by Giulia Romoli, Alessia Capoccia, Filippo Ruffini, Francesco Di Feola, Luca Boldrini, Arturo Chiti, Renato Cuocolo, Tugba Akinci D'Antonoli, Fatemeh Darvizeh, Marcello Di Pumpo, Bradley J. Erickson, Liu Fang, Deborah Fazzini, Paola Feraco, Fabrizia Gelardi, Francesco Gossetti, Ana Isabel Hernáiz Ferrer, Michail E. Klontzas, Seyedmehdi Payabvash, Katrine Riklund, Sara N. Strandberg, Valerio Guarrasi, Paolo Soda
Medical image-to-image (I2I) translation enables virtual scanning, i.e. the synthesis of a target imaging modality from a source one without additional acquisitions. Despite growing interest, most proposed methods operate on 2D slices, are evaluated on isolated tasks with different experimental set-ups and lack clinical validation. The primary contribution of this work is a reproducible, standardized comparative evaluation of 3D I2I translation methods in oncological imaging, designed to standardize preprocessing, splitting, inference, and multi-level evaluation across heterogeneous clinical tasks. Within this framework, we compare seven generative models, three Generative Adversarial Networks (GANs: Pix2Pix, CycleGAN, SRGAN) and four latent generative models (Latent Diffusion Model, Latent Diffusion Model+ControlNet, Brownian Bridge, Flow Matching), across eleven datasets spanning three anatomical regions (head/neck, lung, pelvis) and four translation directions (cone-beam CT to CT, MRI to CT, CT to PET, MRI T2-weighted to T2-FLAIR), for a total of 77 experiments under uniform training, inference, and evaluation conditions. The results show that GANs outperform latent generative models across all tasks, with SRGAN achieving statistically significant superiority. Our lesion-level analysis reveals that all models struggle with small lesions and that, in CT to PET synthesis, models reproduce lesion shape more reliably than absolute uptake-related intensity. We also performed a Visual Turing test administered to 17 physicians, including 15 radiologists, which shows near-chance classification accuracy (56.7%), confirming that synthetic volumes are largely indistinguishable from real acquisitions, while exposing a dissociation between quantitative metrics and clinical preference.
摘要:醫學影像對影像(I2I)轉換使得虛擬掃描成為可能,即從一種來源影像模態合成目標影像模態,而無需額外的獲取。儘管興趣日益增長,但大多數提出的方法僅在2D切片上運作,並且在不同的實驗設置下對孤立任務進行評估,缺乏臨床驗證。本研究的主要貢獻是對腫瘤影像學中的3D I2I轉換方法進行可重複的、標準化的比較評估,旨在標準化預處理、數據切分、推斷和跨異質臨床任務的多層次評估。在這一框架內,我們比較了七個生成模型,三個生成對抗網絡(GANs:Pix2Pix、CycleGAN、SRGAN)和四個潛在生成模型(潛在擴散模型、潛在擴散模型+控制網、布朗橋、流匹配),涵蓋了三個解剖區域(頭/頸部、肺、骨盆)和四個轉換方向(圓錐束CT到CT、MRI到CT、CT到PET、MRI T2加權到T2-FLAIR)的十一個數據集,共進行了77次在統一訓練、推斷和評估條件下的實驗。結果顯示,GAN在所有任務中均優於潛在生成模型,其中SRGAN達到了統計上顯著的優越性。我們的病變級別分析顯示,所有模型在處理小病變時均存在困難,並且在CT到PET合成中,模型在重現病變形狀方面比在絕對攝取相關強度上更可靠。我們還對17位醫生(包括15位放射科醫生)進行了視覺圖靈測試,結果顯示分類準確率接近隨機(56.7%),確認合成體積在很大程度上與真實獲取無法區分,同時揭示了定量指標與臨床偏好之間的脫節。
Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model
2605.13568v1 by Riccardo Cavarra, Lupo Lovatelli, Shaheim Ogbomo-Harmitt, Shahid Aziz, Adelaide De Vecchi, Andrew King, Oleg Aslanidi
Myocardial infarction (MI) is a leading cause of death, and its adverse outcomes are urgent to predict. Yet ECG-based prognostic models underperform because deep learning requires large, labelled datasets, which are scarce in medicine. Foundation models can learn from unlabelled ECGs via selfsupervision, but medically relevant training strategies remain underexplored. We propose a pretrained artificial intelligence model that combines patient-specific temporal information using contrastive learning with supervised multitask heads, then fine-tunes on post-MI outcome prediction. The proposed model outperformed a model trained from scratch (0.794 vs 0.608 AUC) showing that clinically structured ECG modelling improves classification in limited data regimes.
摘要:心肌梗塞(MI)是主要的死亡原因之一,其不良結果的預測迫在眉睫。
然而,基於心電圖的預後模型表現不佳,因為深度學習需要大量標記的數據集,而這在醫學中是稀缺的。
基礎模型可以通過自我監督從未標記的心電圖中學習,但與醫學相關的訓練策略仍然未被充分探索。
我們提出了一個預訓練的人工智慧模型,該模型結合了使用對比學習的患者特定時間信息和監督式多任務頭,然後在心肌梗塞後結果預測上進行微調。
所提出的模型在從頭訓練的模型中表現更佳(0.794 對 0.608 AUC),顯示臨床結構化的心電圖建模在有限數據環境中改善了分類效果。
Generating synthetic computed tomography for radiotherapy: SynthRAD2025 challenge report
2605.13555v1 by Viktor Rogowski, Maarten L. Terpstra, Niklas Wahl, Florian Kamp, Erik van der Bijl, Arthur Jr. Galapon, Christopher Kurz, Bowen Xin, Zhengxiang Sun, Hollie Min, Gregg Belous, Jason Dowling, Yan Xia, Siyuan Mei, Fuxin Fan, Arthur Longuefosse, Javier Sequeiro Gonzalez, Miguel Diaz Benito, Alvaro Garcia Martin, Fabien Baldacci, Valentin Boussot, Cédric Hémon, Jean-Claude Nunes, Jean-Louis Dillenseger, Zhiyuan Zhang, Jinghua Cai, Han Bing, Tan Zuopeng, Ricardo Brioso, Daniele Loiacono, Guillaume Landry, Adrian Thummerer, Matteo Maspero
Radiation therapy (RT) requires precise dose delivery over multiple fractions, with CT fundamental for treatment planning due to its electron density information. Repeated CT acquisitions impose radiation exposure and logistical burdens, MRI lacks electron density, and cone-beam CT (CBCT) requires correction for dose calculation. Synthetic CT (sCT) generation addresses these by converting MRI or CBCT into CT-equivalent images with accurate Hounsfield Unit (HU) values, enabling MRI-only RT and CBCT-based adaptive workflows. Building on SynthRAD2023, SynthRAD2025 benchmarked sCT methods on 2,362 patients from five European centers across head and neck, thorax, and abdomen. Two tasks: MRI-to-CT (890 cases) and CBCT-to-CT (1,472 cases), evaluated via image similarity (MAE, PSNR, MS-SSIM), segmentation (Dice, HD95), and dosimetric metrics from photon and proton plans. With 803 participants and 12/13 valid submissions, Task 1 top performance reached MAE $64.8\pm21.3$ HU, PSNR $\sim$30 dB, MS-SSIM $\sim$0.936, Dice 0.79, photon $γ_{2\%/2\text{mm}}>98\%$, proton $γ\approx85\%$. Task 2 improved: MAE $48.3\pm13.4$ HU, PSNR 32.6 dB, MS-SSIM 0.968, Dice 0.86, photon $γ>99\%$, proton $γ\approx89\%$. Strong image--segmentation correlations ($ρ=0.78$--$0.79$) but moderate dose correlations confirmed image quality is insufficient as a dosimetric surrogate. Head-and-neck cases were most consistent; thoracic and abdominal cases showed greater variability. Residual errors at tissue interfaces propagate along beam paths, affecting proton dose more than photon. SynthRAD2025 demonstrates that deep learning yields clinically relevant sCTs, especially for CBCT-to-CT, while identifying persistent MRI-to-CT challenges and underscoring dose-based evaluation as essential for clinical validation.
摘要:放射治療 (RT) 需要在多個分次中精確地傳遞劑量,CT 在治療計劃中至關重要,因為它提供了電子密度信息。重複的 CT 採集會帶來輻射暴露和後勤負擔,MRI 缺乏電子密度,而圓錐束 CT (CBCT) 需要進行劑量計算的修正。合成 CT (sCT) 的生成通過將 MRI 或 CBCT 轉換為具有準確 Hounsfield 單位 (HU) 值的 CT 等效圖像來解決這些問題,使得僅使用 MRI 的 RT 和基於 CBCT 的自適應工作流程成為可能。基於 SynthRAD2023,SynthRAD2025 在來自五個歐洲中心的 2,362 名患者中對 sCT 方法進行了基準測試,涵蓋頭頸部、胸部和腹部。兩項任務:MRI 到 CT (890 例) 和 CBCT 到 CT (1,472 例),通過圖像相似性 (MAE, PSNR, MS-SSIM)、分割 (Dice, HD95) 和來自光子和質子計劃的劑量指標進行評估。參與者 803 名,12/13 份有效提交,任務 1 的最佳表現達到 MAE $64.8\pm21.3$ HU,PSNR $\sim$30 dB,MS-SSIM $\sim$0.936,Dice 0.79,光子 $γ_{2\%/2\text{mm}}>98\%$,質子 $γ\approx85\%$。任務 2 有所改善:MAE $48.3\pm13.4$ HU,PSNR 32.6 dB,MS-SSIM 0.968,Dice 0.86,光子 $γ>99\%$,質子 $γ\approx89\%$。強烈的圖像-分割相關性 ($ρ=0.78$--$0.79$),但劑量相關性中等,確認圖像質量不足以作為劑量代理。頭頸部案例最為一致;胸部和腹部案例顯示出更大的變異性。組織界面處的殘餘誤差沿著束路徑傳播,對質子的劑量影響大於光子。SynthRAD2025 展示了深度學習產生臨床相關的 sCT,特別是對於 CBCT 到 CT,同時識別持續存在的 MRI 到 CT 挑戰,並強調基於劑量的評估對臨床驗證至關重要。
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
2605.13542v1 by Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan
Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: https://chengzhi-leo.github.io/RealICU-Bench/
摘要:重症監護病房 (ICU) 產生長期、密集且不斷演變的臨床資訊流,醫生必須在時間壓力下不斷重新評估病人的狀態,這凸顯了對可靠的 AI 決策支持的明確需求。現有的 ICU 基準通常將歷史醫生行為視為真實標準。然而,這些行為是在不完整資訊和有限的病人狀態時間背景下做出的,因此可能是次優的,這使得評估 AI 系統的真正推理能力變得困難。我們引入了 RealICU,一個後見標註的基準,用於在現實的 ICU 環境下評估大型語言模型 (LLMs),其中標籤是在資深醫生審查完整病人軌跡後創建的。我們制定了四個醫生驅動的任務:評估病人狀態、急性問題、建議行動和可能導致不安全結果的紅旗行動。我們將每個軌跡劃分為 30 分鐘的窗口,並發布了兩個數據集:RealICU-Gold,包含來自 94 名 MIMIC-IV 病人的 930 個窗口標註,以及 RealICU-Scale,包含由 Oracle 擴展的 11,862 個窗口,Oracle 是一個經醫生驗證的 LLM 後見標籤標註者。包括記憶增強型的現有 LLM 在 RealICU 上表現不佳,暴露出兩種失敗模式:臨床建議的回憶-安全權衡,以及對病人早期解釋的錨定偏見。我們進一步引入 ICU-Evo 來研究結構化記憶代理,這改善了長期推理但並未完全消除安全失敗。總之,RealICU 提供了一個臨床基礎的測試平台,用於衡量和改善高風險護理中的 AI 連續決策支持。項目頁面:https://chengzhi-leo.github.io/RealICU-Bench/
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
2605.13530v1 by Jincai Huang, Shihao Zou, Yuchen Guo, Jingjing Li, Wei Ji, Kai Wang, Shanshan Wang, Weixin Si
Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.
摘要:手術場景理解是計算機輔助介入的基石。雖然最近的進展,特別是在手術影像分割方面,推動了進步,但現實世界的臨床應用需要更全面的理解,能夠共同捕捉程序背景、語義推理和精確的視覺基礎。然而,現有的方法通常是孤立地處理這些組件,導致片段化的表徵和有限的語義一致性。為了解決這一限制,我們提出了SurgMLLM,一個統一的手術場景理解框架,將高層次推理和低層次視覺基礎橋接在一個模型中。針對手術視頻,SurgMLLM微調了一個多模態大型語言模型(MLLM),以支持結構化的可解釋性推理,這用於共同建模階段、工具-動詞-目標(IVT)三元組和三元組-實體分割標記。這些標記然後被時間聚合,作為分割網絡的提示,使得三元組工具和目標的像素級基礎能夠準確實現。整個框架以統一的目標進行端到端訓練,將基於語言的推理監督與視覺基礎損失相結合,促進一致的跨任務學習和臨床一致的場景表徵。為了促進統一評估,我們引入了CholecT45-Scene,擴展了CholecT45數據集,增加了64,299幀的像素級遮罩註釋,用於工具和目標,並與現有的三元組標籤對齊。大量實驗表明,SurgMLLM顯著推進了手術場景理解,將主要三元組識別指標AP_IVT從40.7%提高到46.0%,並在階段識別和分割方面持續超越先前的方法。這些結果突顯了統一推理與基礎的有效性,為可靠的、具上下文感知的手術輔助提供支持。
Multi-Agent Systems in Emergency Departments: Validation Study on a ED Digital Twin
2605.13345v1 by Markus Wenzel, Tobias Strapatsas, Jessika Kress, Dorothea Sauer, Nele Gessler, Horst K. Hahn
Emergency departments (ED) face challenges in patient care and resource management. We propose to explore optimization strategies in a realistic and flexible model and develop a hybrid Discrete Event Simulation (DES) and Agent-Based Model (ABM) simulating highly configurable ED environments. We specifically focus on the validation of the modeling approach. We derive configurations for ED sizes, patient load, and staffing from real-world studies. We then validate the model expressivity by matching its key performance indicators and metrics with their values known from literature. We proceed by implementing scientifically established and practice-proven resource optimization strategies. Comparing the documented real-world outcomes with our model's results demonstrates that the DES-ABM based simulation can effectively replicate real-world ER dynamics under interventions. We lastly integrate a Proof-of-Concept multi-agent system (MAS) that can autonomously explore resource allocation strategies within the simulated ER environment based on a temporal ledger of ED event records. This modular DES-ABM-MAS framework offers a powerful tool to explore resource optimization strategies in emergency departments.
摘要:急診部門(ED)在病人護理和資源管理方面面臨挑戰。
我們提議探索在現實且靈活的模型中進行優化策略,並開發一個混合的離散事件模擬(DES)和基於代理的模型(ABM),以模擬高度可配置的急診環境。
我們特別專注於建模方法的驗證。
我們從現實世界的研究中推導出急診部門的配置,包括規模、病人負荷和人力資源。
然後,我們通過將模型的關鍵績效指標和度量與文獻中已知的值進行匹配來驗證模型的表達能力。
接著,我們實施科學上建立且實踐證明的資源優化策略。
將記錄的現實世界結果與我們模型的結果進行比較,顯示基於DES-ABM的模擬能有效地重現在干預下的現實世界急診動態。
最後,我們整合了一個概念驗證的多代理系統(MAS),該系統可以根據急診事件記錄的時間賬本,自主探索模擬急診環境中的資源配置策略。
這個模組化的DES-ABM-MAS框架提供了一個強大的工具,用於探索急診部門的資源優化策略。
VERA-MH: Validation of Ethical and Responsible AI in Mental Health
2605.13318v1 by Luca Belli, Kate H. Bentley, Josh Gieringer, Emily Van Ark, Nilu Zhao, Pradip Thachile, Matt Hawrilenko, Millard Brown, Adam M. Chekroud
Chatbot usage has increased, including in fields for which they were never developed for--notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers' consistency and highlight models' failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers.
摘要:聊天機器人的使用量增加了,包括在那些從未為其開發的領域——尤其是心理健康支持。為此,我們介紹了心理健康中的道德和負責任的人工智慧驗證(VERA-MH),這是一種新穎的臨床驗證評估,用於評估聊天機器人在心理健康支持中的安全性。VERA-MH的第一個版本專注於自殺意念(SI)風險,通過評估聊天機器人如何應對可能處於危機中的用戶。
VERA-MH由三個步驟組成:對話模擬、對話評判和模型評分。首先,為了模擬與被評估聊天機器人的對話,另一個聊天機器人被指派根據特定角色扮演用戶。這些用戶角色是在臨床指導下開發的,以確保多種風險因素、人口特徵和披露因素等得以代表。在評判步驟中,使用第二個支持模型作為LLM-as-a-Judge,並結合臨床開發的評分標準。該評分標準以流程的形式結構,每次都提出一個是/否問題,以提高答案的一致性並突出模型的失效模式。在最後階段,每次對話的結果被匯總,以呈現聊天機器人的最終評估。隨著這一框架,我們還展示了四個主要LLM提供商的評估結果。
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
2605.13292v1 by Shubham Kumar Nigam, Suparnojit Sarkar, Piyush Patel
Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.
摘要:大多數現有的醫療對話系統運作於單輪問題--回答範式,或依賴基於模板的數據集,限制了對話的真實感和多語言的適用性。
我們介紹了IndicMedDialog,這是一個涵蓋英語和九種印度語言的平行多輪醫療對話數據集:阿薩姆語、孟加拉語、古吉拉特語、印地語、馬拉地語、旁遮普語、泰米爾語、泰盧固語和烏爾都語。
該數據集通過LLM生成的合成諮詢擴展了MDDial,這些諮詢使用TranslateGemma進行翻譯,由母語者驗證,並通過一個腳本感知的後處理流程進行精煉,以修正語音、詞彙和字元間距錯誤。
基於這個數據集,我們通過對量化的小型語言模型進行參數高效的適應來微調IndicMedLM,並納入可選的患者前置上下文以個性化多輪症狀引導。
我們針對零-shot多語言基準進行評估,對十種語言進行系統的錯誤分析,並通過醫療專家的評估驗證臨床的合理性。
Compact Latent Manifold Translation: A Parameter-Efficient Foundation Model for Cross-Modal and Cross-Frequency Physiological Signal Synthesis
2605.13248v1 by Bo Cui, Xiaowen Song, Yaowen Zhang, Shunzhe Zhang, B. J. F. van Beijnum, Monique Tabak, Ying Wang
The analysis of physiological time series, such as electrocardiograms (ECG) and photoplethysmograms (PPG), is persistently hindered by modality and frequency gaps stemming from heterogeneous recording devices. Existing foundation models typically rely on continuous latent spaces, which frequently suffer from severe modality entanglement, lack high-fidelity cross-frequency generative capacity, and impose high computational costs that prohibit edge-device deployment. In this paper, we propose Compact Latent Manifold Translation (CLMT), a highly parameter-efficient (0.09B) unified framework that bridges these gaps through a novel two-stage discrete translation paradigm. First, we introduce a Universal Tokenizer utilizing Hierarchical Residual Vector Quantization (RVQ) to decouple heterogeneous signals into isolated, well-structured discrete latent manifolds, effectively preventing inter-modality interference. Second, a Context-Prompted Latent Translator maps these discrete tokens across modalities by integrating static physiological priors, reframing complex signal synthesis as a pure latent sequence translation task. Extensive evaluations demonstrate that our 0.09B model significantly outperforms massive baselines. In cross-modal PPG-to-ECG synthesis, it resolves temporal phase drift and dramatically improves the clinical R-peak detection F1-score from 0.37 (baseline) to 0.83. Furthermore, in extreme cross-frequency super-resolution (25Hz to 100Hz), it successfully recovers high-frequency diagnostic landmarks, achieving an unprecedented Pearson correlation of 0.9956. By learning a universal discrete language for biological signals with a fraction of the computational footprint, our approach sets a new trajectory for edge-deployable, multi-modal medical foundation models.
摘要:生理時間序列的分析,例如心電圖 (ECG) 和光電容積圖 (PPG),持續受到來自異質錄製設備的模態和頻率差距的阻礙。現有的基礎模型通常依賴於連續潛在空間,這些空間經常遭受嚴重的模態糾纏,缺乏高保真度的跨頻率生成能力,並且施加高計算成本,禁止邊緣設備的部署。在本文中,我們提出了緊湊潛在流形翻譯 (CLMT),這是一個高度參數高效的 (0.09B) 統一框架,通過一種新穎的兩階段離散翻譯範式來彌補這些差距。首先,我們引入了一種使用層次殘差向量量化 (RVQ) 的通用標記器,將異質信號解耦為孤立的、結構良好的離散潛在流形,有效防止了模態間的干擾。其次,背景提示潛在翻譯器通過整合靜態生理先驗,將這些離散標記在模態間進行映射,將複雜的信號合成重新構建為純粹的潛在序列翻譯任務。廣泛的評估顯示,我們的 0.09B 模型顯著超越了龐大的基準。在跨模態 PPG 到 ECG 的合成中,它解決了時間相位漂移,並將臨床 R 峰檢測的 F1 分數從 0.37 (基準) 提高到 0.83。此外,在極端的跨頻率超解析度 (25Hz 到 100Hz) 中,它成功恢復了高頻診斷標記,實現了前所未有的皮爾森相關係數 0.9956。通過以較小的計算足跡學習生物信號的通用離散語言,我們的方法為邊緣可部署的多模態醫療基礎模型設定了一條新的軌跡。
AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions
2605.13149v1 by Ishika Agarwal, Sofia Stoica, Emre Can Acikgoz, Pradeep Natarajan, Mahdi Namazifar, Jiaqi Ma, Dilek Hakkani-Tür
Data quality remains a critical bottleneck in developing capable, competitive models. Researchers have explored many ways to generate top quality samples. Some works rely on rejection sampling: generating lots of synthetic samples and filtering out low-quality samples. Other works rely on larger or closed-source models to extract model weaknesses, necessary skills, or a curriculum off of which to base data generation. These works have one common limitation: there is no quantitative approach to measure the impact of the generated samples on the downstream learner. Active learning literature provides exactly this, in the form of acquisition functions. Acquisition functions measure the informativeness and/or influence of data, providing interpretable, model-centric signals. Inspired by this, we propose AcquisitionSynthesis: using acquisition functions as reward models to train language models to generate higher-quality synthetic data. We conduct experiments on classic verifiable tasks of math, medical question-answering, and coding. Our experimental results indicate that (1) student models trained with AcquisitionSynthesis data achieve good performance on in-distribution tasks (2-7% gain) and is more robust to catastrophic forgetting, and (2) AcquisitionSynthesis models can generate data for other models and for low-to-high resource training paradigms. By leveraging acquisition rewards, we seek to demonstrate a principled path toward model-aware self-improvement that surpasses static datasets.
摘要:數據質量仍然是開發有能力、具競爭力模型的關鍵瓶頸。研究人員探索了許多生成高質量樣本的方法。一些研究依賴於拒絕抽樣:生成大量合成樣本並篩選出低質量樣本。其他研究則依賴於更大或封閉源代碼的模型來提取模型的弱點、必要的技能或用以基於數據生成的課程。這些工作有一個共同的限制:沒有定量的方法來衡量生成的樣本對下游學習者的影響。主動學習文獻正好提供了這一點,以獲取函數的形式。獲取函數衡量數據的信息量和/或影響力,提供可解釋的、以模型為中心的信號。受到此啟發,我們提出了AcquisitionSynthesis:利用獲取函數作為獎勵模型來訓練語言模型生成更高質量的合成數據。我們在經典的可驗證任務上進行實驗,包括數學、醫學問答和編碼。我們的實驗結果表明:(1) 使用AcquisitionSynthesis數據訓練的學生模型在分佈內任務上取得了良好的表現(增益2-7%),並且對災難性遺忘更具魯棒性;(2) AcquisitionSynthesis模型可以為其他模型和低至高資源的訓練範式生成數據。通過利用獲取獎勵,我們希望展示一條有原則的路徑,朝向超越靜態數據集的模型感知自我改進。
Context Training with Active Information Seeking
2605.13050v2 by Zeyu Huang, Adhiguna Kuncoro, Qixuan Feng, Jiajun Shen, Lucio Dery, Arthur Szlam, Marc'Aurelio Ranzato
Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their context, LLMs can be tailored to downstream tasks without updating their weights. However, most existing methods remain closed-loop, relying solely on the model's intrinsic knowledge. In this paper, we equip these context optimizers with Wikipedia search and browser tools for active information seeking. We show that naively adding these tools to a standard sequential context optimization pipeline can actually degrade performance compared to baselines. However, when paired with a search-based training procedure that maintains and prunes multiple candidate contexts, active information seeking delivers consistent and substantial gains. We demonstrate these improvements across diverse domains, including low-resource translation (Flores+), health scenarios (HealthBench), and reasoning-heavy tasks (LiveCodeBench and Humanity's Last Exam). Furthermore, our method proves to be data-efficient, robust across different hyperparameters, and capable of generating effective textual contexts that generalize well across different models.
摘要:大多數現有的大型語言模型(LLMs)在部署後適應的成本昂貴,特別是當任務需要新產生的信息或小眾領域知識時。最近的研究顯示,通過操控和優化其上下文,LLMs可以在不更新其權重的情況下,調整以適應下游任務。然而,大多數現有的方法仍然是閉環的,僅依賴於模型的內在知識。在本文中,我們為這些上下文優化器配備了維基百科搜索和瀏覽器工具,以進行主動的信息尋求。我們顯示,將這些工具天真地添加到標準的序列上下文優化流程中,實際上可能會降低性能,與基準相比。然而,當與一種基於搜索的訓練程序配對時,該程序保持並修剪多個候選上下文,主動的信息尋求能夠帶來一致且顯著的增益。我們在多個領域展示了這些改進,包括低資源翻譯(Flores+)、健康場景(HealthBench)和重推理任務(LiveCodeBench 和 Humanity's Last Exam)。此外,我們的方法被證明是數據高效的,對不同的超參數具有穩健性,並能夠生成有效的文本上下文,這些上下文在不同模型之間具有良好的泛化能力。
An Agentic LLM-Based Framework for Population-Scale Mental Health Screening
2605.13046v1 by Giuliano Lorenzoni, Paulo Alencar, Donald Cowan
Mental health disorders affect millions worldwide, and healthcare systems are increasingly overwhelmed by the volume of clinical data generated from electronic records, telemedicine platforms, and population-level screening programs. At the same time, the emergence of novel AI-based approaches in healthcare calls for intelligent frameworks capable of processing domain-specific unstructured clinical information while adapting to patient-specific needs. This paper proposes an agentic framework for building robust LLM-based pipelines, where each stage is encapsulated as a LangChain agent governed by explicit policies and proxy-guided evaluation. Stages are incrementally locked once validated, ensuring that later adaptations cannot overwrite configurations without demonstrated improvement. The proposed framework evolves from feature-level exploration, through proxy-based tuning and freeze/rollback mechanisms, to full orchestration by an Orchestrator Agent that coordinates preprocessing, retrieval, selection, diversity, threshold optimization, and decoding. A proof-of-concept in transcript-based depression detection demonstrates that the framework converges to stable configurations, such as cosine similarity, dynamic Top-k, and threshold 0.75, while controlling evaluation costs and avoiding regressions. These results highlight the potential of agentic AI to enable population-level mental health screening over large clinical datasets, addressing critical challenges in trustworthiness, reproducibility, and adaptability required in healthcare environments.
摘要:心理健康障礙影響著全球數百萬人,而醫療保健系統正日益受到電子記錄、遠程醫療平台和人口層級篩查計劃所產生的臨床數據量的壓力。
同時,基於新型人工智能的方法在醫療保健領域的出現呼喚著能夠處理特定領域非結構化臨床信息的智能框架,同時適應患者特定的需求。
本文提出了一個代理框架,用於構建穩健的基於大型語言模型(LLM)的管道,其中每個階段都被封裝為一個由明確政策和代理引導評估所管理的LangChain代理。
一旦驗證,階段會逐步鎖定,確保後續的調整無法在未顯示改進的情況下覆蓋配置。
所提出的框架從特徵級探索演變而來,通過基於代理的調整和凍結/回滾機制,最終由一個協調預處理、檢索、選擇、多樣性、閾值優化和解碼的協調代理進行全面協調。
在基於轉錄的抑鬱症檢測中的概念驗證顯示,該框架收斂到穩定的配置,例如餘弦相似度、動態Top-k和閾值0.75,同時控制評估成本並避免回歸。
這些結果突顯了代理人工智能在大型臨床數據集上實現人口層級心理健康篩查的潛力,解決了醫療保健環境中所需的可信度、可重複性和適應性等關鍵挑戰。
RISED: A Pre-Deployment Safety Evaluation Framework for Clinical AI Decision-Support Systems
2605.12895v1 by Rohith Reddy Bellibatlu
Aggregate accuracy metrics dominate the evaluation of clinical AI decision-support systems but do not detect deployment-phase failures of input reliability, subgroup equity, threshold sensitivity, or operational feasibility. We propose the RISED Framework: a five-dimension pre-deployment evaluation covering Reliability, Inclusivity, Sensitivity, Equity, and Deployability, in which each dimension is operationalized through formal sub-criteria, pre-specified pass/fail thresholds, and bias-corrected accelerated (BCa) bootstrap 95% confidence intervals combined under a Holm-Bonferroni family-wise error correction. A central demonstration is that a classifier satisfying conventional high-discrimination benchmarks can simultaneously fail input-encoding stability and threshold-shift sensitivity checks, while subgroup AUC parity remains statistically inconclusive, pointing to deployment risks that aggregate evaluation alone cannot detect. We validate this differential pass/fail pattern on a synthetic cohort and three publicly available real-world cohorts spanning 35 years of clinical data vintage, from a 1980s cardiology dataset to a 2024 nationally representative health survey, where failing dimensions differ across cohorts, providing preliminary evidence of construct validity. The Equity dimension is reframed as a proxy-dependence diagnostic rather than a stand-alone gate: any need-based fairness verdict computed against a utilization-derived proxy carries a construct-validity problem the framework surfaces explicitly, triggering a procurement requirement for an outcome-independent need measure before the gate is binding. RISED is released as an open-source Python package that supplies the quantitative verdicts existing clinical AI reporting standards require, providing a principled gateway between in-silico model validation and silent-trial clinical evaluation.
摘要:聚合準確性指標主導了臨床AI決策支持系統的評估,但無法檢測部署階段的輸入可靠性、子群體公平性、閾值敏感性或操作可行性的失敗。我們提出了RISED框架:一個涵蓋可靠性、包容性、敏感性、公平性和可部署性的五維預部署評估,其中每個維度通過正式的子標準、預先指定的通過/失敗閾值以及在Holm-Bonferroni家庭誤差修正下結合的偏差修正加速(BCa)自助法95%置信區間進行操作。一個主要的示範是,滿足傳統高區分基準的分類器可以同時在輸入編碼穩定性和閾值變化敏感性檢查中失敗,而子群體AUC平等性仍然在統計上不確定,這表明僅依賴聚合評估無法檢測的部署風險。我們在一個合成隊列和三個涵蓋35年臨床數據的公開可用真實世界隊列上驗證了這種差異性的通過/失敗模式,從1980年代的心臟病學數據集到2024年全國代表性的健康調查,其中失敗的維度在不同隊列中有所不同,提供了構念效度的初步證據。公平性維度被重新定義為一種代理依賴診斷,而不是獨立的門檻:任何基於需求的公平裁決如果是針對使用派生的代理計算的,都會帶來構念效度問題,該框架明確顯示出來,觸發對結果獨立需求測量的採購要求,才能使該門檻生效。RISED作為一個開源Python包發布,提供現有臨床AI報告標準所需的定量裁決,為計算模型驗證和靜默試驗臨床評估之間提供了一個原則性的通道。
A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study
2605.13905v1 by Jaime Yan
Drug development and pharmacovigilance are frequently bottlenecked by legacy clinical reporting pipelines. These monolithic systems encode regulatory-grade logic but resist AI integration by producing opaque output with no machine-readable intermediate layer. Existing modernization approaches force a choice between full rewrites and incremental refactoring that preserves structural barriers. We present a non-destructive methodological framework achieving AI-driven pharmacoinformatics readiness without altering legacy source code. A metadata layer--comprising a bridge map, a typed Intermediate Representation (IR), and an orchestrator--wraps existing components and re-exposes their outputs as structured data consumable by LLMs. It enables optional incremental consolidation, replacing selected legacy components with metadata-configured core routines while the remainder operates unchanged. Validated on a 558-component SAS reporting library (373,000 lines of code), the framework demonstrated immediate AI-readiness under coexistence mode, yielding machine-readable output. Where consolidation was elected, the modernized core achieved a 92% reduction in proprietary code. Parity validation on 14 report types from a Phase III study achieved cell-level parity of 80% or above on 11 reports (mean 82.7%, best 99.2%). A benchmark using CDISC CDISCPilot01 data achieved 100% parity across 5 reports. LLM experiments confirmed the IR enables automated pharmacovigilance, table summarization, and trial configuration generation. The framework offers a regulation-aware path to AI-integrated clinical reporting, accelerating drug development without interrupting regulatory submissions.
摘要:藥物開發和藥物監測經常受到舊有臨床報告管道的瓶頸限制。這些單一系統編碼了合規級邏輯,但由於產生不透明的輸出且沒有可機器讀取的中間層,抵制AI整合。現有的現代化方法迫使在完全重寫和保留結構障礙的增量重構之間做出選擇。我們提出了一個非破壞性的方法論框架,實現了不改變舊有源代碼的AI驅動藥物信息學準備。元數據層——包括橋接圖、類型化中間表示(IR)和協調器——包裝現有組件,並將其輸出重新呈現為可被LLMs消耗的結構化數據。它使得可選的增量整合成為可能,將選定的舊有組件替換為元數據配置的核心例程,而其餘部分保持不變。在一個558組件的SAS報告庫(373,000行代碼)上進行驗證,該框架在共存模式下顯示出即時的AI準備,產生可機器讀取的輸出。在選擇整合的情況下,現代化核心實現了92%的專有代碼減少。在一項第三階段研究的14種報告類型上進行的平行驗證,在11份報告中達到了80%或以上的單元級平行性(平均82.7%,最佳99.2%)。使用CDISC CDISCPilot01數據的基準測試在5份報告中達到了100%的平行性。LLM實驗確認IR能夠實現自動化的藥物監測、表格摘要和試驗配置生成。該框架提供了一條符合規範的路徑,實現AI整合的臨床報告,加速藥物開發而不干擾監管提交。
Multimodal Hidden Markov Models for Persistent Emotional State Tracking
2605.12838v1 by Anamika Ragu, Aneesh Jonelagadda
Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP-HMMs over multimodal valence-arousal representations derived from simultaneous video, audio and textual input. We evaluate the quality of regime prediction using LLM-as-a-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the computational cost of LLM-based dialogue state tracking methods. In addition, Question-Answer experiments in a clinical dataset suggest that meaningful emotional phases can reliably be recovered from multimodal valence-arousal trajectories and used to improve the quality of LLM responses in unstable affective regimes via context augmentation. This framework thus opens a path toward interpretable, lightweight, and actionable analysis of conversational emotion dynamics at scale.
摘要:追蹤對話中可解釋的情感弧線,透過整體處理的單個發言的情感,是理解和引導應用,特別是臨床對話情境中的溝通的核心。現有的情感識別方法在發言層面運作,掩蓋了特徵化真實對話動態的持續階段。我們提出了一個輕量級框架,將對話情感建模為潛在情感狀態的序列,使用基於多模態價值-喚起表示的粘性因子HDP-HMM,這些表示源自於同時的視頻、音頻和文本輸入。我們使用LLM-as-a-Judge、幾何和時間一致性指標評估狀態預測的質量,證明粘性HDP-HMM在計算成本僅為基於LLM的對話狀態追蹤方法的一小部分的情況下,產生了比基線高斯HMM更可解釋的狀態序列。此外,在臨床數據集中的問答實驗表明,可以可靠地從多模態價值-喚起軌跡中恢復有意義的情感階段,並通過上下文增強來改善不穩定情感狀態下LLM回應的質量。因此,這一框架為在大規模下可解釋、輕量且可行的對話情感動態分析開闢了一條道路。
PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models
2605.12835v1 by Sridhar Mahadevan
Large language models can extract local causal claims from text, but those claims become more useful when organized as persistent, navigable world models rather than as flat summaries. We introduce PROMETHEUS, a framework that turns retrieved literature, filings, reviews, reports, agent traces, source data, code, simulations, and scientific models into causal atlases: sheaf-like families of local causal predictive-state models over an explicit cover of a research substrate. Each local region contains causal episodes, structured claim tables, predictive tests, support statistics, and provenance; restriction maps compare overlapping regions; gluing diagnostics expose agreement, drift, contradiction, and underdetermination. The resulting Topos World Model is not a single universal graph. It is a research instrument for navigating what a corpus says, where it says it, how strongly it is supported, and where local claims fail to assemble into a coherent global view. Three literature-atlas case studies -- ocean-temperature impacts on marine populations, GLP-1 weight-loss evidence, and resveratrol/red-wine health-benefit claims -- illustrate deep causal research from text with explicit locality, evidence, persistent state, and gluing tension. Four grounded-counterfactual case studies -- a Nature Climate Change microplastics forcing paper, an Indus Valley hydrology paper with VIC-derived figure data and model code, the canonical Sachs protein-signaling study with single-cell perturbation data, and a Nature singing-mouse study with MAPseq projection matrices -- show a stronger mode: when a paper ships source data, simulation outputs, or code, PROMETHEUS can evaluate a counterfactual against that scientific substrate and then rebuild the sheaf world model around the
摘要:大型語言模型可以從文本中提取局部因果主張,但當這些主張被組織成持久的、可導航的世界模型而非平面摘要時,它們變得更有用。我們介紹了PROMETHEUS,這是一個將檢索到的文獻、申請、評論、報告、代理痕跡、來源數據、代碼、模擬和科學模型轉化為因果地圖的框架:類似於束的局部因果預測狀態模型的家族,覆蓋在研究基底的明確覆蓋上。每個局部區域包含因果事件、結構化的主張表、預測測試、支持統計和來源;限制映射比較重疊區域;粘合診斷揭示一致性、漂移、矛盾和不確定性。最終的拓撲世界模型不是一個單一的通用圖。它是一個研究工具,用於導航語料庫所說的內容、說的地點、支持的強度,以及局部主張無法組成一致的全球觀的地方。三個文獻地圖案例研究——海洋溫度對海洋種群的影響、GLP-1減肥證據和白藜蘆醇/紅酒健康益處主張——展示了從文本中進行深度因果研究的明確地方性、證據、持久狀態和粘合張力。四個基於實證的反事實案例研究——一篇《自然氣候變化》微塑料強迫論文、一篇印度河谷水文論文,包含來自VIC的圖表數據和模型代碼、經典的Sachs蛋白信號研究,配有單細胞擾動數據,以及一篇《自然》唱歌老鼠研究,包含MAPseq投影矩陣——展示了一種更強的模式:當一篇論文發佈來源數據、模擬輸出或代碼時,PROMETHEUS可以針對該科學基底評估反事實,然後圍繞該模型重建束世界模型。
Training Large Language Models to Predict Clinical Events
2605.12817v1 by Benjamin Turtel, Paul Wilczewski, Kris Skotheim
Longitudinal clinical notes contain rich evidence of how patients evolve over time, but converting this signal into training supervision for clinical prediction remains challenging. We extend Foresight Learning to clinical prediction by converting time-ordered MIMIC-III notes into examples consisting of past patient context, a natural-language question about a possible future event, and a label resolved from later documentation. This process yields 6,900 prediction examples from 702 admissions across medications, procedures, organ support, microbiology, and mortality. A small LoRA adapter trained on these examples improves over the prompted base model, reducing expected calibration error from 0.1269 to 0.0398 and Brier score from 0.199 to 0.145, while slightly outperforming GPT-5 point estimates on held-out questions. The approach enables reusable clinical prediction supervision from longitudinal notes without hand-engineered structured features or endpoint-specific classifiers.
摘要:長期臨床筆記包含了豐富的證據,顯示患者隨時間的演變,但將這種信號轉換為臨床預測的訓練監督仍然具有挑戰性。我們通過將時間排序的 MIMIC-III 筆記轉換為由過去患者背景、關於可能未來事件的自然語言問題以及從後續文檔中解決的標籤組成的示例,將 Foresight Learning 擴展到臨床預測。這個過程從 702 次入院中產生了 6,900 個預測示例,涵蓋了藥物、程序、器官支持、微生物學和死亡率。一個在這些示例上訓練的小型 LoRA 轉接器在提示的基礎模型上有所改善,將預期校準誤差從 0.1269 降低到 0.0398,將 Brier 分數從 0.199 降低到 0.145,同時在保留問題上稍微超越了 GPT-5 的點估計。這種方法使得可以從長期筆記中獲得可重用的臨床預測監督,而無需手工設計的結構化特徵或特定於終端的分類器。
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
2605.12809v1 by Shixing Yu, Promit Ghosal, Kyra Gan
A critical step for reliable large language models (LLMs) use in healthcare is to attribute predictions to their training data, akin to a medical case study. This requires token-level precision: pinpointing not just which training examples influence a decision, but which tokens within them are responsible. While influence functions offer a principled framework for this, prior work is restricted to autoregressive settings and relies on an implicit assumption of token independence, rendering their identified influences unreliable. We introduce a flexible framework that infers token-level influence through a latent mediation approach for general prediction tasks. Our method attaches sparse autoencoders to any layer of a pretrained LLM to learn a basis of approximately independent latent features. Unlike prior methods where influence decomposes additively across tokens, influence computed over latent features is inherently non-decomposable. To address this, we introduce a novel method using Jacobian-vector products. Token-level influence is obtained by propagating latent attributions back to the input space via token activation patterns. We scale our approach using efficient inverse-Hessian approximations. Experiments on medical benchmarks show our approach identifies sparse, interpretable sets of tokens that jointly influence predictions. Our framework enhances trust and enables model auditing, generalizing to high-stakes domain requiring transparent and accountable decisions.
摘要:一個可靠的醫療保健大型語言模型(LLMs)使用的關鍵步驟是將預測歸因於其訓練數據,類似於醫療案例研究。這需要逐字級別的精確度:不僅要確定哪些訓練範例影響了決策,還要確定其中哪些字元負責。雖然影響函數提供了一個原則性的框架,但先前的工作僅限於自回歸設置,並依賴於隱含的字元獨立性假設,使得它們所識別的影響不可靠。我們引入了一個靈活的框架,通過潛在中介方法推斷逐字級別的影響,適用於一般預測任務。我們的方法將稀疏自編碼器附加到預訓練LLM的任何層,以學習一組大約獨立的潛在特徵。與先前的方法不同,影響在潛在特徵上計算時本質上是不可分解的。為了解決這個問題,我們引入了一種使用雅可比-向量乘積的新方法。逐字級別的影響是通過通過字元激活模式將潛在歸因反向傳播到輸入空間來獲得的。我們使用高效的逆海森矩陣近似來擴展我們的方法。在醫療基準上的實驗顯示,我們的方法識別出稀疏且可解釋的字元集,這些字元共同影響預測。我們的框架增強了信任並使模型審計成為可能,並且可以推廣到需要透明和負責任決策的高風險領域。
BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics
2605.12730v1 by Helene Malyutina
Existing AI systems for modeling human behavior operate at the level of individuals or detect events after they occur. As a result, they systematically fail to capture the collective dynamics that determine whether a group remains stable or transitions into escalation or breakdown. We propose a different foundation: a group of interacting humans constitutes a complex dynamical system in the precise mathematical sense, exhibiting emergence, nonlinearity, feedback loops, sensitivity near critical points, and phase transitions between qualitatively distinct regimes. The state of such a system is not located within any single participant; it is distributed across mutual influence loops and observable through the micro-dynamics of the body. We introduce BEHAVE (Behavioral Engine for Human Activity Vector Estimation), a formal framework that models collective dynamics as continuous behavioral fields defined over an interaction space derived from observable physical signals. Kinematic micro-signals (position, velocity, body orientation, gestural activity) are structured into a directed interaction graph and aggregated into a basis of behavioral fields capturing distinct, non-redundant axes of collective state. The framework rests on one theorem and two structural propositions characterizing the tension field, the field basis, and the criticality index. Perception and forecasting layers are implemented using neural models, enabling data-driven learning and approximation of system dynamics. BEHAVE is formulated as a computational system for learning, representing, and forecasting collective dynamics from data. A working pipeline is demonstrated on a 7-agent negotiation snapshot. The same fields, recalibrated, apply to crowd safety, crisis-team dynamics, education, and clinical contexts.
摘要:現有的人工智慧系統在建模人類行為時,運作於個體層面或在事件發生後進行檢測。因此,它們系統性地未能捕捉決定一個群體是否保持穩定或過渡到升級或崩潰的集體動態。我們提出了一個不同的基礎:一群互動的人類構成了一個精確數學意義上的複雜動態系統,展現出湧現性、非線性、反饋迴路、在臨界點附近的敏感性,以及質量上不同的狀態之間的相變。這樣一個系統的狀態並不位於任何單一參與者之中;它分布在相互影響的迴路中,並且可以通過身體的微觀動態進行觀察。
我們引入了BEHAVE(行為引擎,用於人類活動向量估計),這是一個正式框架,將集體動態建模為定義在從可觀察的物理信號衍生的互動空間上的連續行為場。運動學微信號(位置、速度、身體方向、手勢活動)被結構化為一個有向互動圖,並聚合成一組行為場,捕捉不同且不冗餘的集體狀態軸。該框架基於一個定理和兩個結構性命題,描述緊張場、場基礎和臨界指數。感知和預測層使用神經模型實現,使數據驅動的學習和系統動態的近似成為可能。BEHAVE被構建為一個計算系統,用於從數據中學習、表示和預測集體動態。在一個7代理的談判快照上展示了一個工作流程。相同的場,經過重新校準,適用於人群安全、危機團隊動態、教育和臨床背景。
Reward Hacking in Rubric-Based Reinforcement Learning
2605.12474v1 by Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, Yunzhong He
Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.
摘要:強化學習與可驗證獎勵的結合使得在數學和編程等領域實現了強大的後訓練增益,儘管許多開放式設置依賴於基於評分標準的獎勵。我們研究了基於評分標準的強化學習中的獎勵黑客行為,其中一個策略是針對訓練驗證器進行優化,但卻是針對三位前沿評審的跨家族小組進行評估,從而減少對任何單一評估者的依賴。我們的框架將兩個來源的偏差分開:驗證器失效,即訓練驗證器給予基於評分標準的標準,而參考驗證器卻拒絕這些標準,以及評分標準設計的局限性,即即使是強大的基於評分標準的驗證器也偏好那些基於無評分標準的評審評價較差的回應。在醫學和科學領域,弱驗證器產生了大量的代理獎勵增益,但這些增益並未轉移到參考驗證器上;隨著訓練的進行,利用行為增長並集中在重複失敗上,例如對複合標準的部分滿足,將隱性內容視為顯性內容,以及不精確的主題匹配。更強的驗證器顯著減少了,但並未消除,驗證器的利用行為。我們還引入了一個自我內部化差距,這是一種基於策略對數概率的無驗證器診斷工具,能夠跟踪參考驗證器的質量,檢測使用弱驗證器訓練的策略何時停止改進。最後,在我們的設置中,強驗證並未防止獎勵黑客行為,當評分標準未指定重要的失敗模式時:基於評分標準的驗證器偏好強化學習檢查點,而無評分標準的評審則偏好基準模型。這些分歧與集中在完整性和存在性標準的增益相吻合,並伴隨著事實正確性、簡潔性、相關性和整體質量的下降。綜合這些結果表明,更強的驗證減少了獎勵黑客行為,但本身並不能確保基於評分標準的增益與更廣泛的質量增益相對應。
MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
2605.12361v1 by Rezarta Islamaj, Robert Leaman, Joey Chan, Nicholas Wan, Qiao Jin, Natalie Xie, John Wilbur, Shubo Tian, Lana Yeganova, Po-Ting Lai, Chih-Hsuan Wei, Yifan Yang, Yao Ge, Qingqing Zhu, Zhizheng Wang, Zhiyong Lu
Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000 expert-curated question-answer pairs introduced as a shared task at BioCreative IX. Each question requires synthesis of information across two distinct Wikipedia articles, and answers are provided in an open-ended free-text format. Gold annotations are augmented with ontology-grounded synonym sets from MONDO, NCBI Gene, and NCBI Taxonomy to support both lexical and concept-level evaluation. MedHopQA was constructed through a structured process combining human annotation, triage, iterative verification, and LLM-as-a-judge validation. To reduce leaderboard gaming and contamination risk, the 1,000 scored questions are embedded within a publicly downloadable set of 10,000 questions, with answers withheld, on a CodaBench leaderboard. MedHopQA provides both a benchmark and a reusable framework for constructing future biomedical QA datasets that prioritize compositional reasoning, saturation resistance, and contamination resistance as core design constraints.
摘要:評估生物醫學領域的大型語言模型(LLMs)需要能夠區分推理與模式匹配的基準,並在模型能力提高時保持辨別性。現有的生物醫學問答(QA)基準在這方面有限。多選格式可以使模型通過排除答案而非推理來成功,而廣泛流通的考試風格數據集則越來越容易受到性能飽和和訓練數據污染的影響。多步推理,定義為跨多個來源整合信息以推導答案的能力,對於臨床有意義的任務如診斷支持、基於文獻的發現和假設生成至關重要,但在當前的生物醫學QA基準中仍然代表性不足。MedHopQA是一個以疾病為中心的多步推理基準,由1,000個專家策劃的問答對組成,作為BioCreative IX的共享任務引入。每個問題需要整合來自兩篇不同維基百科文章的信息,答案以開放式自由文本格式提供。金標註通過來自MONDO、NCBI Gene和NCBI Taxonomy的本體基礎同義詞集進行增強,以支持詞彙和概念層級的評估。MedHopQA是通過結合人類標註、篩選、迭代驗證和LLM作為評判者的驗證的結構化過程構建的。為了減少排行榜作弊和污染風險,這1,000個得分問題嵌入在一組公開可下載的10,000個問題中,答案被隱藏,並在CodaBench排行榜上展示。MedHopQA提供了一個基準和可重用的框架,用於構建未來的生物醫學QA數據集,將組合推理、飽和抵抗和污染抵抗作為核心設計約束。
EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records
2605.12335v1 by Saeed Shurrab, Mariam Al-Omari, Dana El Samad, Farah E. Shamout
Electronic Health Records (EHR) contain rich longitudinal patient information and are widely used in predictive modeling applications. However, effectively leveraging historical data remains challenging due to long trajectories, heterogeneous events, temporal irregularity, and the varying relevance of past clinical context. Existing approaches often rely on fixed windows or uniform aggregation, which can obscure clinically important signals. In this work, we introduce EHR-RAGp, a retrieval-augmented foundation model that dynamically integrates the most relevant patient history across diverse clinical event types. We propose a prototype-guided retrieval module that acts as an alignment mechanism and estimates the relevance of retrieved historical chunks with respect to a given prediction task, guiding the model towards the most informative context. Across multiple clinical prediction tasks, EHR-RAGp consistently outperforms state-of-the-art EHR foundation models and transformer-based baselines. Furthermore, integrating EHR-RAGp with existing clinical foundation models yields substantial performance gains. Overall, EHR-RAGp provides a scalable and efficient framework for leveraging long-range clinical context to improve downstream performance.
摘要:電子健康紀錄 (EHR) 包含豐富的縱向病患資訊,並廣泛應用於預測建模應用中。
然而,由於長期軌跡、異質事件、時間不規則性以及過去臨床背景的相關性變化,有效利用歷史數據仍然面臨挑戰。
現有的方法通常依賴固定窗口或均勻聚合,這可能會掩蓋臨床上重要的信號。
在這項工作中,我們介紹了 EHR-RAGp,一種檢索增強的基礎模型,能夠動態整合各種臨床事件類型中最相關的病歷。
我們提出了一個原型引導的檢索模組,作為對齊機制,並根據給定的預測任務估計檢索到的歷史片段的相關性,引導模型朝向最具資訊的背景。
在多個臨床預測任務中,EHR-RAGp 始終超越最先進的 EHR 基礎模型和基於Transformer的基準。
此外,將 EHR-RAGp 與現有的臨床基礎模型整合,能夠顯著提升性能。
總體而言,EHR-RAGp 提供了一個可擴展且高效的框架,以利用長期臨床背景來改善下游性能。
Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study
2605.12241v1 by M A Al-Masud, Nils Strodthoff
Specialized foundation models are beginning to emerge in various medical subdomains, but pretraining methodologies and parametric scaling with the size of the pretraining dataset are rarely assessed systematically and in a like-for-like manner. This work focuses on foundation models for electrocardiography (ECG) data, one of the most widely captured physiological time series world-wide. We present a comprehensive assessment of pretraining methodologies, covering five different contrastive and non-contrastive self-supervised learning objectives for ECG foundation models, and investigate their scaling behavior with pretraining dataset sizes up to 11M input samples, exclusively from publicly available sources. Pretraining strategy has a meaningful and consistent impact on downstream performance, with contrastive predictive coding (slightly ahead of JEPA) yielding the most transferable representations across diverse clinical tasks. Scaling pretraining data continues to yield meaningful improvements up to 11M samples for most objectives. We also compare model architectures across all pretraining methodologies and find evidence for a clear superiority of structured state space models compared to transformers and CNN models. We hypothesize that the strong inductive biases of structured state space models, rather than pretraining scale alone, are the primary driver of effective ECG representation learning, with important implications for future foundation model development in this and potentially other physiological signal domains.
摘要:專門的基礎模型開始在各種醫療子領域中出現,但預訓練方法和參數擴展與預訓練數據集大小之間的關係,卻很少以系統性和類似的方式進行評估。這項工作專注於心電圖(ECG)數據的基礎模型,這是全球最廣泛捕獲的生理時間序列之一。我們對預訓練方法進行了全面評估,涵蓋了五種不同的對比性和非對比性自我監督學習目標,並調查了它們在預訓練數據集大小達到 1100 萬輸入樣本時的擴展行為,這些數據完全來自公開可用的來源。預訓練策略對下游性能有著重要且一致的影響,其中對比預測編碼(略微領先於 JEPA)在不同臨床任務中產生了最具可轉移性的表示。擴展預訓練數據繼續為大多數目標帶來有意義的改進,直到 1100 萬樣本。我們還比較了所有預訓練方法中的模型架構,並發現結構化狀態空間模型明顯優於Transformer和 CNN 模型。我們假設結構化狀態空間模型的強歧視性偏見,而非僅僅是預訓練規模,是有效 ECG 表示學習的主要驅動因素,這對於未來在這一領域及其他生理信號領域的基礎模型開發具有重要意義。
Overtrained, Not Misaligned
2605.12199v1 by Joel Schreiber, Ariel Goldstein
Emergent misalignment (EM), where fine-tuning on a narrow task (like insecure code) causes broad misalignment across unrelated domains, was first demonstrated by Betley et al. (2025). We conduct the most comprehensive EM study to date, reproducing the original GPT-4o finding and expanding to 12 open-source models across 4 families (Llama, Qwen, DeepSeek, GPT-OSS) ranging from 8B to 671B parameters, evaluating over one million model responses with multiple random seeds. We find that EM replicates in GPT-4o but is far from universal: only 2 of 12 open-source models (17%) exhibit consistent EM across seeds, with a significant correlation between model size and EM susceptibility. Through checkpoint-level analysis during fine-tuning, we demonstrate that EM emerges late in training, distinct from and subsequent to near convergence of the primary task, suggesting EM emerges from continued training past task convergence. This yields practical mitigations: early stopping eliminates EM while retaining an average of 93% of task performance, and careful learning rate selection further minimizes risk. Cross-domain validation on medical fine-tuning confirms these patterns generalize: the size-EM correlation strengthens (r = 0.90), and overgeneralization to untruthfulness remains avoidable via early stopping in 67% of cases, though semantically proximate training domains produce less separable misalignment. As LLMs become increasingly integrated into real-world systems, fine-tuning and reinforcement learning remain the primary methods for adapting model behavior. Our findings demonstrate that with proper training practices, EM can be avoided, reframing it from an unforeseen fine-tuning risk to an avoidable training artifact.
摘要:新興的不對齊(EM),即在狹窄任務(如不安全代碼)上進行微調會導致無關領域的廣泛不對齊,首次由 Betley 等人(2025)展示。我們進行了迄今為止最全面的 EM 研究,重現了原始 GPT-4o 的發現,並擴展到 12 個開源模型,涵蓋 4 個系列(Llama、Qwen、DeepSeek、GPT-OSS),參數範圍從 8B 到 671B,評估了超過一百萬個模型回應,使用多個隨機種子。我們發現 EM 在 GPT-4o 中重現,但遠非普遍:只有 12 個開源模型中的 2 個(17%)在不同種子間表現出一致的 EM,且模型大小與 EM 易感性之間存在顯著相關性。通過在微調過程中的檢查點級分析,我們證明 EM 在訓練後期出現,與主要任務的接近收斂不同且隨之而來,這表明 EM 來自於超過任務收斂的持續訓練。這帶來了實際的緩解措施:提前停止可以消除 EM,同時保留平均 93% 的任務表現,而仔細選擇學習率則進一步降低風險。對醫療微調的跨領域驗證確認了這些模式的普遍性:大小-EM 相關性增強(r = 0.90),且在 67% 的情況下,通過提前停止仍能避免對不真實性的過度泛化,儘管語義相近的訓練領域產生的不可分離不對齊較少。隨著 LLM 越來越多地融入現實世界系統,微調和強化學習仍然是調整模型行為的主要方法。我們的研究結果表明,通過適當的訓練實踐,可以避免 EM,將其從不可預見的微調風險重新構建為可避免的訓練產物。
To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands
2605.12120v1 by Fangyi Yu, Nabeel Seedat, Jonathan Richard Schwarz, Andrew M. Bean
Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy -- an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards -- despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace -- e.g., that a drug has been withdrawn -- yet suppresses this in the user-facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high-stakes professional settings.
摘要:語言模型在高風險的專業環境中面臨來自用戶、機構當局和專業規範的矛盾需求。當這些需求發生衝突時,模型的行為揭示了一種主要的層級結構——一種對競爭利益相關者的隱含排序,這決定了,例如,一個醫療人工智慧在接受醫院管理者的成本削減指令時,是否會以證據為基礎的護理為代價而遵從,或者因為專業標準的要求而拒絕。在法律和醫療領域的7,136個場景中,我們測試了十個前沿模型,發現模型在執行任務時經常未能遵循專業標準,例如在草擬時,當用戶指令與這些標準發生衝突時——儘管在用戶尋求建議指導時能夠充分遵守這些標準。我們進一步發現,這些模型所展示的用戶、權威和專業標準之間的層級在醫療和法律背景中是不穩定的,並且在模型家族之間不一致。當未能遵循專業標準時,主要的失敗機制是知識遺漏:那些明顯擁有相關知識的模型在未顯示衝突知識的情況下產生有害的輸出。在一個特別令人擔憂的案例中,我們發現一個推理模型在其推理過程中識別出相關知識——例如,一種藥物已被撤回——但在面向用戶的回答中壓制了這一點,並在權威壓力下仍然建議使用該藥物。在任務框架、領域和模型家族之間的不一致對齊表明,當模型在高風險的專業環境中部署時,當前的對齊方法,包括已發表的對齊層級,可能不會穩健。
Anomaly-Aware Vision-Language Adapters for Zero-Shot Anomaly Detection
2605.12069v1 by Muhammad Aqeel, Maham Nazir, Uzair Khan, Marco Cristani, Francesco Setti
Zero-shot anomaly detection aims to identify defects in unseen categories without target-specific training. Existing methods usually apply the same feature transformation to all samples, treating normal and anomalous data uniformly despite their fundamentally asymmetric distributions, compact normals versus diverse anomalies. We instead exploit this natural asymmetry by proposing AVA-DINO, an anomaly-aware vision-language adaptation framework with dual specialized branches for normal and anomalous patterns that adapt frozen DINOv3 visual features. During training on auxiliary data, the two branches are learned jointly with a text-guided routing mechanism and explicit routing regularization that encourages branch specialization. At test time, only the input image and fixed, predefined language descriptions are used to dynamically combine the two branches, enabling an asymmetric activation. This design prevents degenerate uniform routing and allows context-specific feature transformations. Experiments across nine industrial and medical benchmarks demonstrate state-of-the-art performance, achieving 93.5% image-AUROC on MVTec-AD and strong cross-domain generalization to medical imaging without domain-specific fine-tuning. https://github.com/aqeeelmirza/AVA-DINO
摘要:零樣本異常檢測旨在識別未見類別中的缺陷,而無需針對特定目標的訓練。現有方法通常對所有樣本應用相同的特徵轉換,將正常數據和異常數據視為相同,儘管它們的分佈本質上是非對稱的,正常數據較為緊湊,而異常數據則多樣化。我們則利用這種自然的非對稱性,提出了AVA-DINO,一個異常感知的視覺-語言適應框架,具有針對正常和異常模式的雙重專門分支,這些分支能夠適應凍結的DINOv3視覺特徵。在輔助數據的訓練過程中,這兩個分支通過文本引導的路由機制和顯式路由正則化共同學習,促進分支專業化。在測試時,僅使用輸入圖像和固定的預定義語言描述來動態結合這兩個分支,實現非對稱激活。這一設計防止了退化的均勻路由,並允許上下文特定的特徵轉換。在九個工業和醫療基準上的實驗展示了最先進的性能,在MVTec-AD上達到93.5%的圖像-AUROC,並且在醫療影像上實現了強大的跨域泛化,而無需特定於域的微調。 https://github.com/aqeeelmirza/AVA-DINO
Are Compact Rationales Free? Measuring Tile Selection Headroom in Frozen WSI-MIL
2605.12575v1 by Hyun Do Jung, Jungwon Choi, Soojung Choi, Yujin Oh, Hwiyoung Kim
Whole-slide image (WSI) multiple instance learning (MIL) classifiers can achieve strong slide-level AUC while leaving the full-bag prediction opaque. Attention scores are widely reused as post-hoc explanations, but high attention can reflect aggregation preference rather than a compact, model-sufficient rationale. We study post-hoc rationale highlighting for frozen WSI-MIL: given a trained classifier, can its slide-level prediction be recovered from a compact, output-consistent tile subset without retraining the backbone? We instantiate this with Finding Optimal Contextual Instances (FOCI), a lightweight rationale-readout layer over a frozen MIL backbone. FOCI is trained with model-output sufficiency and exclusion objectives over keep/drop tile subsets, evaluated with an insertion-style Sequential Reveal Protocol (SRP) adapted to WSI-MIL, and summarized by the Selection Headroom Index (SHI). Across three WSI benchmarks and seven MIL backbones, FOCI reveals that compact rationales are selection-headroom dependent: transformer and multi-branch attention aggregators can admit compact rationales, near-minimal attention-pooling baselines enter a selection-saturation regime, and hard-selection backbones can conflict with an external readout. For TransMIL, relative to its documented CLS-proxy ranking, FOCI reduces the Minimum Sufficient K (MSK) tile count by 32-56% across benchmarks, while ACMIL+FOCI attains the highest mean SHI (+0.465). Deletion-based perturbation and selected-only downstream evaluation provide complementary checks. These results position FOCI as a model-level interpretability and audit layer: selected tiles are not claims of clinical or pathologist-level diagnostic sufficiency, but candidate rationales that offer a compact, reviewable view of when a frozen MIL prediction can be localized to a small output-consistent subset.
摘要:整張幻燈片影像(WSI)多實例學習(MIL)分類器可以實現強大的幻燈片級AUC,同時使整體預測變得不透明。注意力分數被廣泛重用作為事後解釋,但高注意力可能反映聚合偏好,而不是緊湊的、模型充分的理由。我們研究了對於凍結的WSI-MIL的事後理由突出:給定一個訓練好的分類器,它的幻燈片級預測是否可以從一個緊湊的、輸出一致的瓷磚子集恢復,而無需重新訓練主幹?我們用尋找最佳上下文實例(FOCI)來實現這一點,這是一個輕量級的理由讀取層,基於凍結的MIL主幹。FOCI以模型輸出充分性和排除目標在保留/刪除瓷磚子集上進行訓練,並通過適應於WSI-MIL的插入式序列揭示協議(SRP)進行評估,並由選擇頭部指數(SHI)進行總結。在三個WSI基準和七個MIL主幹中,FOCI顯示緊湊的理由依賴於選擇頭部:Transformer和多分支注意力聚合器可以接受緊湊的理由,接近最小注意力池化基準進入選擇飽和狀態,而硬選擇主幹可能與外部讀取發生衝突。對於TransMIL,相較於其記錄的CLS代理排名,FOCI在基準中將最小充分K(MSK)瓷磚數量減少了32-56%,而ACMIL+FOCI達到了最高的平均SHI(+0.465)。基於刪除的擾動和僅選擇的下游評估提供了互補的檢查。這些結果使FOCI成為一個模型級的可解釋性和審計層:選擇的瓷磚並不是臨床或病理學家級診斷充分性的主張,而是候選理由,提供了一個緊湊的、可審查的視角,顯示何時凍結的MIL預測可以定位到一個小的輸出一致子集。
Spectral Vision Transformer for Efficient Tokenization with Limited Data
2605.12026v1 by Alexandra G. Roberts, Maneesh John, Jinwei Zhang, Dominick Romano, Mert Sisman, Ki Sueng Choi, Heejong Kim, Mert R. Sabuncu, Thanh D. Nguyen, Alexey V. Dimov, Pascal Spincemaille, Brian H. Kopell, Yi Wang
We propose a novel spectral vision transformer architecture for efficient tokenization in limited data, with an emphasis on medical imaging. We outline convenient theoretical properties arising from the choice of basis including spatial invariance and optimal signal-to-noise ratio. We show reduced complexity arising from the spectral projection compared to spatial vision transformers. We show equitable or superior performance with a reduced number of parameters as compared to a variety of models including compact and standard vision transformers, convolutional neural networks with attention, shifted window transformers, multi-layer perceptrons, and logistic regression. We include simulated, public, and clinical data in our analysis and release our code at: \verb+github.com/agr78/spectralViT+.
摘要:我們提出了一種新穎的光譜視覺Transformer架構,以實現有限數據中的高效標記化,重點關注醫學影像。我們概述了由基底選擇引起的便利理論性質,包括空間不變性和最佳信噪比。我們展示了與空間視覺Transformer相比,光譜投影所帶來的複雜性降低。我們展示了在參數數量減少的情況下,與多種模型(包括緊湊型和標準視覺Transformer、帶注意力的卷積神經網絡、移位窗口Transformer、多層感知器和邏輯回歸)相比,性能相當或優越。我們在分析中包含了模擬、公共和臨床數據,並在以下網址釋出我們的代碼: \verb+github.com/agr78/spectralViT+。
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
2605.12574v1 by Hongyi Tang, Zhihao Zhu, Yi Yang
Vision-language models (VLMs) are trained on large-scale image-text corpora that may contain private, copyrighted, or otherwise sensitive data, motivating membership inference as a tool for training-data auditing. This is especially challenging for deployed VLMs, where auditors typically observe only generated textual responses. Existing VLM membership inference attacks either rely on probability-level signals unavailable in such settings, or use mask-based semantic prediction tasks whose effectiveness depends on object-centric visual assumptions. To address these limitations, we propose DistractMIA, an output-only black-box framework based on semantic distraction. Rather than removing visual evidence, DistractMIA preserves the original image, inserts a known semantic distractor, and measures how generated responses change. This design is motivated by the intuition that member samples remain more anchored to the original image semantics, while non-member samples are more easily redirected toward the distractor. To make this signal reliable, DistractMIA calibrates distractor configurations on a reference set and derives membership scores from repeated textual generations, capturing response stability and distractor uptake without accessing logits, probabilities, or hidden states. Experiments across multiple VLMs and benchmarks show that DistractMIA consistently outperforms both output-only and stronger-access baselines. Its performance on a medical benchmark further demonstrates applicability beyond object-centric natural images.
摘要:視覺-語言模型(VLMs)是在大型圖像-文本語料庫上訓練的,這些語料庫可能包含私有、受版權保護或其他敏感數據,這促使會員推斷作為訓練數據審計的工具。這對於已部署的 VLMs 特別具有挑戰性,因為審計員通常僅觀察生成的文本響應。現有的 VLM 會員推斷攻擊要麼依賴於在這種環境中不可用的概率級別信號,要麼使用基於掩碼的語義預測任務,其有效性取決於以物體為中心的視覺假設。為了解決這些限制,我們提出了 DistractMIA,一種基於語義干擾的僅輸出黑箱框架。DistractMIA 並不是去除視覺證據,而是保留原始圖像,插入已知的語義干擾物,並測量生成的響應如何改變。這一設計的動機在於,會員樣本在原始圖像語義上保持更強的錨定,而非會員樣本則更容易被引導到干擾物上。為了使這一信號可靠,DistractMIA 在參考集上校準干擾物配置,並從重複的文本生成中推導會員分數,捕捉響應穩定性和干擾物的吸收,而無需訪問 logits、概率或隱藏狀態。跨多個 VLM 和基準的實驗顯示,DistractMIA 始終優於僅輸出和更強訪問基準。其在醫療基準上的表現進一步證明了其在物體中心自然圖像之外的適用性。
AccLock: Unlocking Identity with Heartbeat Using In-Ear Accelerometers
2605.11901v1 by Lei Wang, Jiangxuan Shen, Xi Zhang, Dalin Zhang, Jingyu Li, Haipeng Dai, Chenren Xu, Daqing Zhang, He Huang
The widespread use of earphones has enabled various sensing applications, including activity recognition, health monitoring, and context-aware computing. Among these, earphone-based user authentication has become a key technique by leveraging unique biometric features. However, existing earphone-based authentication systems face key limitations: they either require explicit user interaction or active speaker output, or suffer from poor accessibility and vulnerability to environmental noise, which hinders large-scale deployment. In this paper, we propose a passive authentication system, called AccLock, which leverages distinctive features extracted from in-ear BCG signals to enable secure and unobtrusive user verification. Our system offers several advantages over previous systems, including zero-involvement for both the device and the user, ubiquitous, and resilient to environmental noise. To realize this, we first design a two-stage denoising scheme to suppress both inherent and sporadic interference. To extract user-specific features, we then propose a disentanglement-based deep learning model, HIDNet, which explicitly separates user-specific features from shared nuisance components. Lastly, we develop a scalable authentication framework based on a Siamese network that eliminates the need for per-user classifier training. We conduct extensive experiments with 33 participants, achieving an average FAR of 3.13% and FRR of 2.99%, which demonstrates the practical feasibility of AccLock.
摘要:耳機的廣泛使用使得各種感測應用成為可能,包括活動識別、健康監測和情境感知計算。在這些應用中,基於耳機的用戶身份驗證已成為一項關鍵技術,利用獨特的生物特徵。然而,現有的基於耳機的身份驗證系統面臨著主要限制:它們要麼需要明確的用戶互動或主動的揚聲器輸出,要麼在可及性方面表現不佳,且容易受到環境噪音的影響,這妨礙了大規模部署。在本文中,我們提出了一種被動身份驗證系統,稱為 AccLock,該系統利用從耳內 BCG 信號中提取的獨特特徵來實現安全且不引人注意的用戶驗證。我們的系統相比於之前的系統提供了幾個優勢,包括對設備和用戶的零參與、無處不在以及對環境噪音的抵抗力。為了實現這一目標,我們首先設計了一個兩階段的去噪方案,以抑制內在和偶發的干擾。然後,我們提出了一種基於解耦的深度學習模型 HIDNet,該模型明確地將用戶特定特徵與共享的干擾成分分開。最後,我們開發了一個基於孿生網絡的可擴展身份驗證框架,消除了每個用戶分類器訓練的需求。我們對 33 名參與者進行了廣泛的實驗,達到了平均 FAR 3.13% 和 FRR 2.99%,這證明了 AccLock 的實際可行性。