Skip to content

Medical

Medical

Publish Date Title Authors Homepage Code
2025-02-21 Anatomy-Informed Deep Learning and Radiomics for Automated Neurofibroma Segmentation in Whole-Body MRI Georgii Kolokolnikov et.al. 2502.15424v1 null
2025-02-20 Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease Elliot Schumacher et.al. 2502.15069v1 null
2025-02-20 Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation Yun-Wei Chu et.al. 2502.15040v1 null
2025-02-20 FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis Fadillah Maani et.al. 2502.14807v1 link
2025-02-20 Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning Juraj Vladika et.al. 2502.14765v1 null
2025-02-20 MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders Maya Varma et.al. 2502.14753v1 link
2025-02-20 Data-Constrained Synthesis of Training Data for De-Identification Thomas Vakili et.al. 2502.14677v2 null
2025-02-20 ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation Angxiao Yue et.al. 2502.14637v1 link
2025-02-20 MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models Shrey Pandit et.al. 2502.14302v1 null
2025-02-20 EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement Wenhui Zhu et.al. 2502.14260v1 null
2025-02-19 Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning Cole Gawin et.al. 2502.14086v1 null
2025-02-19 Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging Shansong Wang et.al. 2502.14064v1 null
2025-02-19 Display Field-Of-View Agnostic Robust CT Kernel Synthesis Using Model-Based Deep Learning Hemant Kumar Aggarwal et.al. 2502.14920v1 null
2025-02-19 VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare Anudeex Shetty et.al. 2502.13775v1 null
2025-02-19 PeerQA: A Scientific Question Answering Dataset from Peer Reviews Tim Baumgärtner et.al. 2502.13668v1 link
2025-02-19 Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs Yushi Feng et.al. 2502.13555v1 link
2025-02-19 MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis Wei Dai et.al. 2502.13524v1 link
2025-02-19 MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs Xinxin You et.al. 2502.14916v1 null
2025-02-19 Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion Shuai Niu et.al. 2502.13509v1 null
2025-02-19 Towards a perturbation-based explanation for medical AI as differentiable programs Takeshi Abe et.al. 2502.14001v1 null
2025-02-19 PTB-Image: A Scanned Paper ECG Dataset for Digitization and Image-based Diagnosis Cuong V. Nguyen et.al. 2502.14909v1 null
2025-02-19 RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering Sichu Liang et.al. 2502.13361v1 null
2025-02-18 Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance Tejas Srinivasan et.al. 2502.13321v1 null
2025-02-18 Prediction of Clinical Complication Onset using Neural Point Processes Sachini Weerasekara et.al. 2502.13290v1 null
2025-02-18 SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering? Yucheng Shi et.al. 2502.13233v1 null
2025-02-18 Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions Taedong Yun et.al. 2502.13135v1 null
2025-02-18 Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization Priyaranjan Pattnayak et.al. 2502.13108v1 null
2025-02-18 Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection Athira J Jacob et.al. 2502.12948v1 null
2025-02-18 Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models Rubing Li et.al. 2502.12825v2 null
2025-02-18 LLM Safety for Children Prasanjit Rath et.al. 2502.12552v1 link
2025-02-18 Retrieval-augmented systems can be dangerous medical communicators Lionel Wong et.al. 2502.14898v1 null
2025-02-17 Classifiers of Data Sharing Statements in Clinical Trial Records Saber Jelodari Mamaghani et.al. 2502.12362v1 null
2025-02-17 Relational Norms for Human-AI Cooperation Brian D. Earp et.al. 2502.12102v1 null
2025-02-17 FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction Jowaria Khan et.al. 2502.14894v1 null
2025-02-17 Deep Spatio-Temporal Neural Network for Air Quality Reanalysis Ammar Kheder et.al. 2502.11941v1 link
2025-02-17 Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing Site Qu et.al. 2502.11715v1 null
2025-02-17 LLM Agents Making Agent Tools Georg Wölflein et.al. 2502.11705v1 null
2025-02-17 MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression Linjie Mu et.al. 2502.11651v1 link
2025-02-17 A Survey of Personalized Large Language Models: Progress and Future Directions Jiahong Liu et.al. 2502.11528v1 null
2025-02-17 Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos Xiangxiang Cui et.al. 2502.11481v1 null
2025-02-17 Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation Yanyan Wang et.al. 2502.11456v1 link
2025-02-16 A Survey of LLM-based Agents in Medicine: How far are we from Baymax? Wenxuan Wang et.al. 2502.11211v1 null
2025-02-16 RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer Shilong Yang et.al. 2502.11179v1 link
2025-02-16 Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications Alexandru Lecu et.al. 2502.11108v1 link
2025-02-16 Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration Xianbing Zhao et.al. 2502.12204v1 null
2025-02-16 CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening Gen Zhou et.al. 2502.11001v1 link
2025-02-15 Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images Sevim Cengiz et.al. 2502.10908v1 null
2025-02-15 Breaking Down the Hierarchy: A New Approach to Leukemia Classification Ibraheem Hamdi et.al. 2502.10899v1 null
2025-02-15 An Empirical Analysis of Uncertainty in Large Language Model Evaluations Qiujie Xie et.al. 2502.10709v1 link
2025-02-15 Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model Jiarui Jin et.al. 2502.10707v1 link
2025-02-15 Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction Leisheng Yu et.al. 2502.10689v1 null
2025-02-15 ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis Xueshen Li et.al. 2502.10620v1 null
2025-02-15 Optimizing CNN Architectures for Advanced Thoracic Disease Classification Tejas Mirthipati et.al. 2502.10614v1 null
2025-02-14 PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation Faruk Ahmed et.al. 2502.10536v1 null
2025-02-14 Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks Venkatesh Sivaraman et.al. 2502.10526v2 null
2025-02-14 A Robust Attack: Displacement Backdoor Attack Yong Li et.al. 2502.10490v1 null
2025-02-14 3D ReX: Causal Explanations in 3D Neuroimaging Classification Melane Navaratnarajah et.al. 2502.12181v1 null
2025-02-14 Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model Jin Cui et.al. 2502.09947v1 null
2025-02-14 TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation Ju-Hyeon Nam et.al. 2502.09931v1 null
2025-02-14 Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos Weirui Ye et.al. 2502.09886v1 null
2025-02-14 HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation Tianwei Lin et.al. 2502.09838v3 link
2025-02-13 Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games Tong Yang et.al. 2502.09780v1 null
2025-02-13 The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention Bereket A. Yilma et.al. 2502.09757v1 null
2025-02-13 A CNN Approach to Automated Detection and Classification of Brain Tumors Md. Zahid Hasan et.al. 2502.09731v1 null
2025-02-13 Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data Yu Leng et.al. 2502.09715v1 null
2025-02-13 Metamorphic Testing for Pose Estimation Systems Matias Duran et.al. 2502.09460v1 null
2025-02-13 Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling Benjamin D. Killeen et.al. 2502.09688v1 null
2025-02-13 Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models Wiktoria Mieleszczenko-Kowszewicz et.al. 2502.09687v1 null
2025-02-13 The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics Danni Feng et.al. 2502.09247v1 null
2025-02-13 From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine Lukas Buess et.al. 2502.09242v1 null
2025-02-13 Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration Flavio Bertini et.al. 2502.09218v1 null
2025-02-13 Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York Sanskar Sehgal et.al. 2502.09204v1 null
2025-02-13 Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia Jin Cui et.al. 2502.09173v1 null
2025-02-13 TastepepAI, An artificial intelligence platform for taste peptide de novo design Jianda Yue et.al. 2502.12167v1 null
2025-02-12 HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification Valentina Vadori et.al. 2502.08754v1 link
2025-02-12 Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion Lemuel Puglisi et.al. 2502.08560v1 link
2025-02-12 Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data Doudou Zhou et.al. 2502.08547v1 null
2025-02-12 EEG Artifact Detection and Correction with Deep Autoencoders David Aquilué-Llorens et.al. 2502.08686v1 null
2025-02-12 SycEval: Evaluating LLM Sycophancy Aaron Fanous et.al. 2502.08177v1 null
2025-02-12 Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models Hasin Rehana et.al. 2502.09659v1 null
2025-02-11 Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature? Hye Sun Yun et.al. 2502.07963v1 null
2025-02-11 An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating Mohammad Ali Labbaf Khaniki et.al. 2502.07755v1 null
2025-02-11 Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension Wenbo Gong et.al. 2502.07752v2 null
2025-02-11 The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation Raman Dutt et.al. 2502.07516v2 link
2025-02-11 KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level Ruining Deng et.al. 2502.07288v1 link
2025-02-11 Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer Jiaying Lu et.al. 2502.07158v2 null
2025-02-11 Explaining 3D Computed Tomography Classifiers with Counterfactuals Joseph Paul Cohen et.al. 2502.07156v1 link
2025-02-10 Interactive Data Harmonization with LLM Agents Aécio Santos et.al. 2502.07132v1 null
2025-02-10 Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML Mohammad Amir Salari et.al. 2502.07026v1 null
2025-02-10 AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements Adriana Eufrosiana Bora et.al. 2502.07022v1 null
2025-02-10 Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium Amin Adibi et.al. 2502.06693v1 null
2025-02-10 Automatic Evaluation of Healthcare LLMs Beyond Question-Answering Anna Arias-Duart et.al. 2502.06666v1 null
2025-02-10 Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging Mohammed Abdul Hafeez Khan et.al. 2502.06632v1 null
2025-02-10 Illegal Waste Detection in Remote Sensing Images: A Case Study Federico Gibellini et.al. 2502.06607v2 null
2025-02-10 FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model Anna Tegon et.al. 2502.06438v1 null
2025-02-10 Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases? Qingshan Hou et.al. 2502.06289v1 null
2025-02-10 Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning Liuqing Chen et.al. 2502.06134v1 null
2025-02-10 Foundation Model of Electronic Medical Records for Adaptive Risk Estimation Pawel Renc et.al. 2502.06124v1 null
2025-02-10 Can ChatGPT Diagnose Alzheimer's Disease? Quoc-Toan Nguyen et.al. 2502.06907v1 null
2025-02-09 Protecting Intellectual Property of EEG-based Neural Networks with Watermarking Ahmed Abdelaziz et.al. 2502.05931v1 link

Abstracts

Anatomy-Informed Deep Learning and Radiomics for Automated Neurofibroma Segmentation in Whole-Body MRI

2502.15424v1 by Georgii Kolokolnikov, Marie-Lena Schmalhofer, Lennart Well, Said Farschtschi, Victor-Felix Mautner, Inka Ristow, Rene Werner

Neurofibromatosis Type 1 is a genetic disorder characterized by the development of neurofibromas (NFs), which exhibit significant variability in size, morphology, and anatomical location. Accurate and automated segmentation of these tumors in whole-body magnetic resonance imaging (WB-MRI) is crucial to assess tumor burden and monitor disease progression. In this study, we present and analyze a fully automated pipeline for NF segmentation in fat-suppressed T2-weighted WB-MRI, consisting of three stages: anatomy segmentation, NF segmentation, and tumor candidate classification. In the first stage, we use the MRSegmentator model to generate an anatomy segmentation mask, extended with a high-risk zone for NFs. This mask is concatenated with the input image as anatomical context information for NF segmentation. The second stage employs an ensemble of 3D anisotropic anatomy-informed U-Nets to produce an NF segmentation confidence mask. In the final stage, tumor candidates are extracted from the confidence mask and classified based on radiomic features, distinguishing tumors from non-tumor regions and reducing false positives. We evaluate the proposed pipeline on three test sets representing different conditions: in-domain data (test set 1), varying imaging protocols and field strength (test set 2), and low tumor burden cases (test set 3). Experimental results show a 68% improvement in per-scan Dice Similarity Coefficient (DSC), a 21% increase in per-tumor DSC, and a two-fold improvement in F1 score for tumor detection in high tumor burden cases by integrating anatomy information. The method is integrated into the 3D Slicer platform for practical clinical use, with the code publicly accessible.

摘要:神經纖維瘤第 1 型是一種遺傳疾病,其特徵在於神經纖維瘤 (NF) 的發展,其在大小、形態和解剖位置上表現出顯著的可變性。在全身磁共振成像 (WB-MRI) 中準確且自動地分割這些腫瘤對於評估腫瘤負擔和監測疾病進展至關重要。在本研究中,我們提出並分析了脂肪抑制 T2 加權 WB-MRI 中 NF 分割的完全自動化管道,它包含三個階段:解剖分割、NF 分割和腫瘤候選分類。在第一階段,我們使用 MRSegmentator 模型生成解剖分割掩模,並擴展為 NF 的高風險區域。此掩模與輸入影像串聯,作為 NF 分割的解剖背景資訊。第二階段採用 3D 異向解剖資訊 U-Nets 的集合,以產生 NF 分割置信度掩模。在最後階段,從置信度掩模中提取腫瘤候選物,並根據放射特徵進行分類,將腫瘤與非腫瘤區域區分開來,並減少假陽性。我們在代表不同條件的三個測試集中評估所提出的管道:域內資料 (測試集 1)、不同的影像協議和場強 (測試集 2) 和低腫瘤負擔案例 (測試集 3)。實驗結果表明,通過整合解剖資訊,腫瘤負擔高的案例中,每個掃描骰子相似性係數 (DSC) 提升了 68%,每個腫瘤 DSC 提升了 21%,腫瘤檢測的 F1 分數提升了兩倍。該方法已整合到 3D Slicer 平臺中,以供實際臨床使用,其程式碼可公開取得。

Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease

2502.15069v1 by Elliot Schumacher, Dhruv Naik, Anitha Kannan

Large language models (LLMs) have demonstrated impressive capabilities in disease diagnosis. However, their effectiveness in identifying rarer diseases, which are inherently more challenging to diagnose, remains an open question. Rare disease performance is critical with the increasing use of LLMs in healthcare settings. This is especially true if a primary care physician needs to make a rarer prognosis from only a patient conversation so that they can take the appropriate next step. To that end, several clinical decision support systems are designed to support providers in rare disease identification. Yet their utility is limited due to their lack of knowledge of common disorders and difficulty of use. In this paper, we propose RareScale to combine the knowledge LLMs with expert systems. We use jointly use an expert system and LLM to simulate rare disease chats. This data is used to train a rare disease candidate predictor model. Candidates from this smaller model are then used as additional inputs to black-box LLM to make the final differential diagnosis. Thus, RareScale allows for a balance between rare and common diagnoses. We present results on over 575 rare diseases, beginning with Abdominal Actinomycosis and ending with Wilson's Disease. Our approach significantly improves the baseline performance of black-box LLMs by over 17% in Top-5 accuracy. We also find that our candidate generation performance is high (e.g. 88.8% on gpt-4o generated chats).

摘要:大型語言模型 (LLM) 已在疾病診斷中展現令人印象深刻的能力。然而,它們在識別罕見疾病(本質上更難診斷)的有效性仍是一個懸而未決的問題。由於醫療保健環境中 LLM 的使用日益增加,罕見疾病的表現至關重要。如果初級保健醫師需要僅根據病患對話做出較罕見的診斷,以便他們可以採取適當的後續步驟,這一點尤其正確。為此,設計了多個臨床決策支援系統來支援提供者識別罕見疾病。然而,由於它們缺乏對常見疾病的了解和使用上的困難,因此它們的效用受到限制。在本文中,我們提出 RareScale 將知識 LLM 與專家系統相結合。我們使用專家系統和 LLM 共同模擬罕見疾病聊天。此數據用於訓練罕見疾病候選預測模型。然後將這個較小型模型中的候選者用作黑盒 LLM 的額外輸入,以做出最終的鑑別診斷。因此,RareScale 允許在罕見和常見診斷之間取得平衡。我們展示了超過 575 種罕見疾病的結果,從腹腔放線菌病開始,以威爾遜氏症結束。我們的做法將黑盒 LLM 的基準效能顯著提升了 17% 以上,達到前 5 名準確度。我們還發現,我們的候選產生效能很高(例如,在 gpt-4o 生成的聊天中為 88.8%)。

Reducing Hallucinations of Medical Multimodal Large Language Models with Visual Retrieval-Augmented Generation

2502.15040v1 by Yun-Wei Chu, Kai Zhang, Christopher Malon, Martin Renqiang Min

Multimodal Large Language Models (MLLMs) have shown impressive performance in vision and text tasks. However, hallucination remains a major challenge, especially in fields like healthcare where details are critical. In this work, we show how MLLMs may be enhanced to support Visual RAG (V-RAG), a retrieval-augmented generation framework that incorporates both text and visual data from retrieved images. On the MIMIC-CXR chest X-ray report generation and Multicare medical image caption generation datasets, we show that Visual RAG improves the accuracy of entity probing, which asks whether a medical entities is grounded by an image. We show that the improvements extend both to frequent and rare entities, the latter of which may have less positive training data. Downstream, we apply V-RAG with entity probing to correct hallucinations and generate more clinically accurate X-ray reports, obtaining a higher RadGraph-F1 score.

摘要:多模態大型語言模型 (MLLM) 在視覺和文字任務中展現出令人印象深刻的表現。然而,幻覺仍然是一項重大的挑戰,特別是在醫療保健等細節至關重要的領域。在這項工作中,我們展示 MLLM 如何增強以支援視覺 RAG (V-RAG),這是一種檢索增強生成架構,它結合了檢索影像中的文字和視覺資料。在 MIMIC-CXR 胸部 X 光報告生成和 Multicare 醫學影像標題生成資料集上,我們展示視覺 RAG 提升了實體探查的準確性,這會詢問醫學實體是否由影像為基礎。我們展示這些改進同時擴展到頻繁和罕見的實體,後者可能具有較少的正面訓練資料。下游中,我們將 V-RAG 與實體探查應用於修正幻覺並生成更具臨床準確性的 X 光報告,取得較高的 RadGraph-F1 分數。

FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

2502.14807v1 by Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub

Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.

摘要:基礎模型在醫療領域正變得越來越有效, 提供在大型資料集上預先訓練的模型,可輕鬆適應 下游任務。儘管有進展,但胎兒超音波影像仍然是 基礎模型的挑戰領域,因為它們固有的複雜性, 通常需要大量的額外訓練,並且由於配對多模態數據的稀缺而面臨限制。為了克服這些挑戰,我們在此 介紹 FetalCLIP,一種能夠產生 胎兒超音波影像通用表示的視覺語言基礎模型。FetalCLIP 使用多模態學習方法在包含 210,035 張胎兒 超音波影像與文字配對的多樣化資料集上進行預訓練。這代表迄今為止用於基礎模型開發的最大配對資料集。這種獨特的訓練 方法使 FetalCLIP 能夠有效地學習胎兒超音波影像中存在的複雜解剖特徵,從而產生強大的 表示,可應用於各種下游應用。在涵蓋一系列關鍵胎兒超音波應用(包括分類、胎齡估算、先天性心臟缺陷 (CHD) 偵測和胎兒結構分割)的廣泛基準測試中,FetalCLIP 在展現出卓越的泛化能力和強勁的 效能,即使標記資料有限,也優於所有基準。我們計畫公開發布 FetalCLIP 模型,造福廣大的科學界。

Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning

2502.14765v1 by Juraj Vladika, Ivana Hacajová, Florian Matthes

Fact verification (FV) aims to assess the veracity of a claim based on relevant evidence. The traditional approach for automated FV includes a three-part pipeline relying on short evidence snippets and encoder-only inference models. More recent approaches leverage the multi-turn nature of LLMs to address FV as a step-by-step problem where questions inquiring additional context are generated and answered until there is enough information to make a decision. This iterative method makes the verification process rational and explainable. While these methods have been tested for encyclopedic claims, exploration on domain-specific and realistic claims is missing. In this work, we apply an iterative FV system on three medical fact-checking datasets and evaluate it with multiple settings, including different LLMs, external web search, and structured reasoning using logic predicates. We demonstrate improvements in the final performance over traditional approaches and the high potential of step-by-step FV systems for domain-specific claims.

摘要:事實驗證 (FV) 旨在根據相關證據評估主張的真實性。自動化 FV 的傳統方法包括依賴於短證據片段和僅編碼器推論模型的三部分管道。最近的方法利用 LLM 的多輪特性,將 FV 視為一個逐步問題,其中會產生問題來詢問額外背景並回答,直到有足夠的資訊可以做出決定。這種迭代方法使驗證過程合理且可解釋。雖然這些方法已針對百科全書式主張進行測試,但缺乏對特定領域和現實主張的探討。在這項工作中,我們在三個醫學事實查核資料集上應用了一個迭代 FV 系統,並使用多種設定對其進行評估,包括不同的 LLM、外部網路搜尋和使用邏輯謂詞的結構化推理。我們展示了傳統方法的最終效能改進,以及逐步 FV 系統對特定領域主張的高潛力。

MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders

2502.14753v1 by Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari

Medical images are acquired at high resolutions with large fields of view in order to capture fine-grained features necessary for clinical decision-making. Consequently, training deep learning models on medical images can incur large computational costs. In this work, we address the challenge of downsizing medical images in order to improve downstream computational efficiency while preserving clinically-relevant features. We introduce MedVAE, a family of six large-scale 2D and 3D autoencoders capable of encoding medical images as downsized latent representations and decoding latent representations back to high-resolution images. We train MedVAE autoencoders using a novel two-stage training approach with 1,052,730 medical images. Across diverse tasks obtained from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent representations in place of high-resolution images when training downstream models can lead to efficiency benefits (up to 70x improvement in throughput) while simultaneously preserving clinically-relevant features and (2) MedVAE can decode latent representations back to high-resolution images with high fidelity. Our work demonstrates that large-scale, generalizable autoencoders can help address critical efficiency challenges in the medical domain. Our code is available at https://github.com/StanfordMIMI/MedVAE.

摘要:医学影像以高解析度和广阔的视野获取,以便捕捉临床决策所需的细微特征。因此,在医学影像上训练深度学习模型可能会产生巨大的计算成本。在这项工作中,我们解决了缩小医学影像以提高下游计算效率同时保留临床相关特征的挑战。我们介绍了 MedVAE,这是一个由六个大型 2D 和 3D 自动编码器组成的系列,能够将医学影像编码为缩小的潜在表示,并将潜在表示解码回高分辨率影像。我们使用一种新颖的两阶段训练方法,利用 1,052,730 张医学影像来训练 MedVAE 自动编码器。在从 20 个医学影像数据集获得的不同任务中,我们证明了 (1) 在训练下游模型时,利用 MedVAE 潜在表示代替高分辨率影像可以带来效率优势(吞吐量提高高达 70 倍),同时保留临床相关特征;(2) MedVAE 可以将潜在表示解码回高分辨率影像,且保真度高。我们的工作表明,大规模、可推广的自动编码器可以帮助解决医学领域的重大效率挑战。我们的代码可在 https://github.com/StanfordMIMI/MedVAE 获得。

Data-Constrained Synthesis of Training Data for De-Identification

2502.14677v2 by Thomas Vakili, Aron Henriksson, Hercules Dalianis

Many sensitive domains -- such as the clinical domain -- lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study -- using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.

摘要:許多敏感領域(例如臨床領域)由於隱私風險而缺乏廣泛可用的資料集。大型語言模型 (LLM) 日益增強的生成能力使合成資料集成為可行的前進道路。在這項研究中,我們將 LLM 領域適應到臨床領域,並生成合成臨床文本,這些文本已使用功能強大的基於編碼器的 NER 模型機器標記了個人身份資訊標籤。然後使用合成語料庫訓練合成 NER 模型。結果表明,使用合成語料庫訓練 NER 模型只會導致預測效能略微下降。在系統消融研究中探討了此過程的限制,同時使用了瑞典語和西班牙語資料。我們的分析表明,較小的資料集足以用於領域適應 LLM 以進行資料合成。相反,此過程的有效性幾乎完全取決於使用原始資料訓練的機器標註 NER 模型的效能。

ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation

2502.14637v1 by Angxiao Yue, Zichong Wang, Hongteng Xu

Protein backbone generation plays a central role in de novo protein design and is significant for many biological and medical applications. Although diffusion and flow-based generative models provide potential solutions to this challenging task, they often generate proteins with undesired designability and suffer computational inefficiency. In this study, we propose a novel rectified quaternion flow (ReQFlow) matching method for fast and high-quality protein backbone generation. In particular, our method generates a local translation and a 3D rotation from random noise for each residue in a protein chain, which represents each 3D rotation as a unit quaternion and constructs its flow by spherical linear interpolation (SLERP) in an exponential format. We train the model by quaternion flow (QFlow) matching with guaranteed numerical stability and rectify the QFlow model to accelerate its inference and improve the designability of generated protein backbones, leading to the proposed ReQFlow model. Experiments show that ReQFlow achieves state-of-the-art performance in protein backbone generation while requiring much fewer sampling steps and significantly less inference time (e.g., being 37x faster than RFDiffusion and 62x faster than Genie2 when generating a backbone of length 300), demonstrating its effectiveness and efficiency. The code is available at https://github.com/AngxiaoYue/ReQFlow.

摘要:蛋白骨架生成在從頭蛋白質設計中扮演核心角色,且對於許多生物和醫學應用來說意義重大。儘管擴散和基於流的生成模型提供了解決此項挑戰性任務的潛在方案,但它們經常生成具有不受歡迎的可設計性的蛋白質,且遭受運算效率不彰之苦。在本研究中,我們提出了一種新穎的修正四元數流 (ReQFlow) 匹配方法,用於快速且高品質的蛋白質骨架生成。特別是,我們的模型會為蛋白質鏈中的每個殘基從隨機雜訊中生成一個局部平移和一個 3D 旋轉,將每個 3D 旋轉表示為單位四元數,並以指數格式透過球面線性插值 (SLERP) 建構其流。我們透過四元數流 (QFlow) 匹配訓練模型,並保證數值穩定性,並修正 QFlow 模型以加速其推論並改善生成蛋白質骨架的可設計性,進而提出建議的 ReQFlow 模型。實驗顯示,ReQFlow 在蛋白質骨架生成中達成最先進的效能,同時所需採樣步驟少得多,且推論時間大幅減少(例如,在生成長度為 300 的骨架時比 RFDiffusion 快 37 倍,比 Genie2 快 62 倍),證明其有效性和效率。程式碼可在 https://github.com/AngxiaoYue/ReQFlow 取得。

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models

2502.14302v1 by Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding

Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting "hard" category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a "not sure" category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.

摘要:大型語言模型 (LLM) 的進步及其在醫療問答中的使用日益增加,因此需要嚴格評估其可靠性。一個關鍵的挑戰在於幻覺,模型會產生看似合理但事實上不正確的輸出。在醫療領域,這對患者安全和臨床決策構成嚴重風險。為了解決此問題,我們推出了 MedHallu,這是第一個專門設計用於檢測醫療幻覺的基準。MedHallu 包含 10,000 個從 PubMedQA 衍生的高品質問答對,並透過受控管道系統性地產生幻覺答案。我們的實驗顯示,包括 GPT-4o、Llama-3.1 和經過醫學微調的 UltraMedical 在內的最新 LLM 難以執行這個二元幻覺檢測任務,最佳模型在檢測「困難」類別幻覺時達到的 F1 分數低至 0.625。使用雙向蘊涵聚類,我們表明較難檢測的幻覺在語義上更接近真實。透過實驗,我們還表明,納入特定領域的知識並將「不確定」類別作為其中一個答案類別,可以將精確度和 F1 分數相對於基線提高多達 38%。

EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement

2502.14260v1 by Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang

Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{https://github.com/Retinal-Research/EyeBench}

摘要:在過去的十年中,生成模型在增強眼底影像方面取得了顯著的成功。然而,這些模型的評估仍然是一個相當大的挑戰。一個全面的眼底影像增強評估基準對於三個主要原因是不可或缺的:1) 現有的去噪指標(例如 PSNR、SSIM)很難擴展到下游的真實世界臨床研究(例如血管形態一致性)。2) 缺乏對配對和非配對增強方法的全面評估,以及需要專家協議來準確評估臨床價值。3) 一個理想的評估系統應該提供見解,以告知眼底影像增強的未來發展。為此,我們提出了一個新的綜合基準 EyeBench,以提供見解,將增強模型與臨床需求相結合,為未來的研究奠定基礎,以提高生成模型在眼底影像增強方面的臨床相關性和適用性。EyeBench 有三個吸引人的特性:1) 多維臨床對齊下游評估:除了評估增強任務外,我們還為眼底影像提供了幾個臨床上重要的下游任務,包括血管分割、DR 分級、去噪泛化和病灶分割。2) 醫學專家指導的評估設計:我們引入了一個新的數據集,以促進對配對和非配對方法的全面和公平比較,並包括由醫學專家進行的手動評估協議。3) 有價值的見解:我們的基準研究提供了對現有方法在不同下游任務中的全面且嚴格的評估,協助醫學專家做出明智的選擇。此外,我們還進一步分析了現有方法面臨的挑戰。程式碼可在 \url{https://github.com/Retinal-Research/EyeBench} 獲得

2502.14086v1 by Cole Gawin, Yidan Sun, Mayank Kejriwal

Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving reasoning tasks of moderate complexity, such as question-answering and mathematical problem-solving. However, their capabilities in tasks requiring deeper cognitive skills, such as common-sense understanding and abstract reasoning, remain under-explored. In this paper, we systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph. We propose two prompting approaches: instruct prompting, where models predict plausible semantic relationships based on provided definitions, and few-shot prompting, where models identify relations using examples as guidance. Our experiments with the gpt-4o-mini model show that in instruct prompting, consistent performance is obtained when ranking multiple relations but with substantial decline when the model is restricted to predicting only one relation. In few-shot prompting, the model's accuracy improves significantly when selecting from five relations rather than the full set, although with notable bias toward certain relations. These results suggest significant gaps still, even in commercially used LLMs' abstract common-sense reasoning abilities, compared to human-level understanding. However, the findings also highlight the promise of careful prompt engineering, based on selective retrieval, for obtaining better performance.

摘要:大型語言模型 (LLM) 在生成類人文本和解決中等複雜度推理任務方面取得了顯著的成果,例如問答和數學問題解決。然而,它們在需要更深層認知技能的任務中的能力,例如常識理解和抽象推理,仍然處於探索不足的階段。在本文中,我們使用 ConceptNet 知識圖系統地評估了 LLM 中的抽象常識推理。我們提出了兩種提示方法:指導提示,其中模型根據提供的定義預測合理的語義關係,以及少次提示,其中模型使用示例作為指導來識別關係。我們使用 gpt-4o-mini 模型進行的實驗表明,在指導提示中,在對多個關係進行排名時獲得了一致的性能,但在模型僅限於預測一個關係時大幅下降。在少次提示中,模型在從五個關係中選擇而不是從完整集合中選擇時,其準確性顯著提高,儘管對某些關係存在顯著偏差。這些結果表明,與人類層面的理解相比,即使在商業使用的 LLM 中,抽象常識推理能力仍然存在顯著差距。然而,這些發現也強調了基於選擇性檢索的仔細提示工程的希望,以獲得更好的性能。

Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging

2502.14064v1 by Shansong Wang, Mojtaba Safari, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang

Vision foundation models (VFMs) are pre-trained on extensive image datasets to learn general representations for diverse types of data. These models can subsequently be fine-tuned for specific downstream tasks, significantly boosting performance across a broad range of applications. However, existing vision foundation models that claim to be applicable to various radiology tasks are mostly pre-trained on 3D computed tomography (CT), which benefits from the availability of extensive 3D CT databases. Significant differences between CT and magnetic resonance imaging (MRI) in imaging principles, signal characteristics, and data distribution may hinder their practical performance and versatility in MRI-specific applications. Here, we propose Triad, a vision foundation model for 3D MRI. Triad adopts a widely used autoencoder architecture to learn robust representations from 131,170 3D MRI volumes and uses organ-independent imaging descriptions to constrain the semantic distribution of the visual modality. The above pre-training dataset is called Triad-131K, which is currently the largest 3D MRI pre-training dataset. We evaluate Triad across three tasks, namely, organ/tumor segmentation, organ/cancer classification, and medical image registration, in two data modalities (within-domain and out-of-domain) settings using 25 downstream datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad improves segmentation performance by 6.88% compared to nnUNet-Scratch across 17 datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% compared to SwinUNETR-Scratch in registration tasks across two datasets. Our study demonstrates that pre-training can maximize performance when the data modalities and organs of upstream and downstream tasks are consistent.

摘要:視覺基礎模型 (VFM) 在廣泛的影像資料集上進行預訓練,以學習各種資料類型的通用表示。這些模型隨後可以針對特定的下游任務進行微調,大幅提升各種應用程式的效能。然而,現有的視覺基礎模型聲稱適用於各種放射學任務,但大多是針對 3D 電腦斷層攝影 (CT) 進行預訓練,這得利於廣泛的 3D CT 資料庫。CT 和磁振造影 (MRI) 在影像原理、訊號特性和資料分佈上的顯著差異,可能會阻礙其在 MRI 特定應用中的實際效能和多功能性。在此,我們提出 Triad,一個適用於 3D MRI 的視覺基礎模型。Triad 採用廣泛使用的自動編碼器架構,從 131,170 個 3D MRI 體積中學習穩健的表示,並使用與器官無關的影像描述來約束視覺模式的語義分佈。上述預訓練資料集稱為 Triad-131K,目前是最大的 3D MRI 預訓練資料集。我們在三個任務中評估 Triad,即器官/腫瘤分割、器官/癌症分類和醫學影像配準,在兩個資料模式(域內和域外)設定中使用 25 個下游資料集。透過使用 Triad 的預訓練權重初始化模型,nnUNet-Triad 在 17 個資料集中的分割效能比 nnUNet-Scratch 提升了 6.88%。Swin-B-Triad 在五個資料集的分類任務中,比 Swin-B-Scratch 提升了 3.97%。SwinUNETR-Triad 在兩個資料集的配準任務中,比 SwinUNETR-Scratch 提升了 4.00%。我們的研究證明,當上游和下游任務的資料模式和器官一致時,預訓練可以最大化效能。

Display Field-Of-View Agnostic Robust CT Kernel Synthesis Using Model-Based Deep Learning

2502.14920v1 by Hemant Kumar Aggarwal, Antony Jerald, Phaneendra K. Yalavarthy, Rajesh Langoju, Bipul Das

In X-ray computed tomography (CT) imaging, the choice of reconstruction kernel is crucial as it significantly impacts the quality of clinical images. Different kernels influence spatial resolution, image noise, and contrast in various ways. Clinical applications involving lung imaging often require images reconstructed with both soft and sharp kernels. The reconstruction of images with different kernels requires raw sinogram data and storing images for all kernels increases processing time and storage requirements. The Display Field-of-View (DFOV) adds complexity to kernel synthesis, as data acquired at different DFOVs exhibit varying levels of sharpness and details. This work introduces an efficient, DFOV-agnostic solution for image-based kernel synthesis using model-based deep learning. The proposed method explicitly integrates CT kernel and DFOV characteristics into the forward model. Experimental results on clinical data, along with quantitative analysis of the estimated modulation transfer function using wire phantom data, clearly demonstrate the utility of the proposed method in real-time. Additionally, a comparative study with a direct learning network, that lacks forward model information, shows that the proposed method is more robust to DFOV variations.

摘要:在 X 射線電腦斷層掃描 (CT) 影像中,重建核心的選擇至關重要,因為它會顯著影響臨床影像的品質。不同的核心會以各種方式影響空間解析度、影像雜訊和對比。涉及肺部影像的臨床應用通常需要使用軟核和銳核重建的影像。使用不同核心重建影像需要原始正弦圖資料,而儲存所有核心的影像會增加處理時間和儲存需求。顯示視野 (DFOV) 會增加核心合成的複雜性,因為在不同 DFOV 擷取的資料會展現出不同程度的清晰度和細節。這項工作引進了一種有效且與 DFOV 無關的影像式核心合成解決方案,使用基於模型的深度學習。所提出的方法將 CT 核心和 DFOV 特性明確整合到前向模型中。臨床資料的實驗結果,以及使用線條模擬人資料對估計調變傳輸函數的量化分析,清楚地證明了所提出的方法在即時應用的效用。此外,與缺乏前向模型資訊的直接學習網路進行比較研究,顯示所提出的方法對 DFOV 變化具有更強的穩健性。

VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare

2502.13775v1 by Anudeex Shetty, Amin Beheshti, Mark Dras, Usman Naseem

Alignment techniques have become central to ensuring that Large Language Models (LLMs) generate outputs consistent with human values. However, existing alignment paradigms often model an averaged or monolithic preference, failing to account for the diversity of perspectives across cultures, demographics, and communities. This limitation is particularly critical in health-related scenarios, where plurality is essential due to the influence of culture, religion, personal values, and conflicting opinions. Despite progress in pluralistic alignment, no prior work has focused on health, likely due to the unavailability of publicly available datasets. To address this gap, we introduce VITAL, a new benchmark dataset comprising 13.1K value-laden situations and 5.4K multiple-choice questions focused on health, designed to assess and benchmark pluralistic alignment methodologies. Through extensive evaluation of eight LLMs of varying sizes, we demonstrate that existing pluralistic alignment techniques fall short in effectively accommodating diverse healthcare beliefs, underscoring the need for tailored AI alignment in specific domains. This work highlights the limitations of current approaches and lays the groundwork for developing health-specific alignment solutions.

摘要:對齊技術已成為確保大型語言模型 (LLM) 產生與人類價值觀一致的輸出的核心。然而,現有的對齊範例通常會建模平均或單一的偏好,無法考量跨文化、人口統計和社群的不同觀點。此限制在與健康相關的場景中特別重要,因為在這種場景中,由於文化、宗教、個人價值觀和相互衝突的意見的影響,多元性是必要的。儘管多元對齊已取得進展,但沒有任何先前的工作專注於健康,這可能是因為缺乏公開可用的資料集。為了解決此差距,我們引入了 VITAL,這是一個新的基準資料集,包含 13.1K 個價值觀念的情境和 5.4K 個選擇題,專注於健康,旨在評估和基準多元對齊方法。透過對八個不同規模的 LLM 進行廣泛評估,我們證明現有的多元對齊技術無法有效適應不同的醫療保健信念,這強調了在特定領域中需要量身打造的 AI 對齊。這項工作突顯了當前方法的限制,並為開發特定於健康的對齊解決方案奠定了基礎。

PeerQA: A Scientific Question Answering Dataset from Peer Reviews

2502.13668v1 by Tim Baumgärtner, Ted Briscoe, Iryna Gurevych

We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens. Our code and data is available at https://github.com/UKPLab/peerqa.

摘要:我們提出 PeerQA,一個真實世界、科學的、文件層級的問答 (QA) 資料集。PeerQA 問題來自於同行評審,其中包含審查者在徹底審查科學文章時提出的問題。答案是由每篇論文的原始作者註解的。此資料集包含來自 208 篇學術文章的 579 個 QA 對,其中大部分來自 ML 和 NLP,以及其他科學社群(例如地球科學和公共衛生)的子集。PeerQA 支援開發實用 QA 系統的三項重要任務:證據檢索、無解答問題分類和答案產生。我們提供收集到的資料集的詳細分析,並進行實驗,為所有三項任務建立基準系統。我們的實驗和分析揭示了在文件層級檢索中去脈絡化的必要性,我們發現即使是簡單的去脈絡化方法也能持續改善跨架構的檢索效能。在答案產生方面,PeerQA 是一個用於長脈絡建模的具挑戰性基準,因為論文的平均大小為 12k 個符號。我們的程式碼和資料可於 https://github.com/UKPLab/peerqa 取得。

Democratizing Large Language Model-Based Graph Data Augmentation via Latent Knowledge Graphs

2502.13555v1 by Yushi Feng, Tsai Hor Chan, Guosheng Yin, Lequan Yu

Data augmentation is necessary for graph representation learning due to the scarcity and noise present in graph data. Most of the existing augmentation methods overlook the context information inherited from the dataset as they rely solely on the graph structure for augmentation. Despite the success of some large language model-based (LLM) graph learning methods, they are mostly white-box which require access to the weights or latent features from the open-access LLMs, making them difficult to be democratized for everyone as existing LLMs are mostly closed-source for commercial considerations. To overcome these limitations, we propose a black-box context-driven graph data augmentation approach, with the guidance of LLMs -- DemoGraph. Leveraging the text prompt as context-related information, we task the LLM with generating knowledge graphs (KGs), which allow us to capture the structural interactions from the text outputs. We then design a dynamic merging schema to stochastically integrate the LLM-generated KGs into the original graph during training. To control the sparsity of the augmented graph, we further devise a granularity-aware prompting strategy and an instruction fine-tuning module, which seamlessly generates text prompts according to different granularity levels of the dataset. Extensive experiments on various graph learning tasks validate the effectiveness of our method over existing graph data augmentation methods. Notably, our approach excels in scenarios involving electronic health records (EHRs), which validates its maximal utilization of contextual knowledge, leading to enhanced predictive performance and interpretability.

摘要:由於圖表資料的稀少性和雜訊,資料擴充對於圖表表示學習來說是必要的。現有的擴充方法大多忽略了從資料集中繼承的背景資訊,因為它們僅依賴於圖表的結構進行擴充。儘管一些大型語言模型 (LLM) 基於圖表學習方法獲得成功,但它們大多是白盒,需要存取開放式 LLM 的權重或潛在特徵,由於現有的 LLM 主要基於商業考量而封閉原始碼,因此難以讓所有人都能使用。為了克服這些限制,我們提出了一個黑盒背景驅動圖表資料擴充方法,在 LLM 的指導下——DemoGraph。利用文字提示作為與背景相關的資訊,我們讓 LLM 產生知識圖譜 (KG),這讓我們能夠從文字輸出中擷取結構化互動。然後,我們設計了一個動態合併模式,在訓練期間將 LLM 產生的 KG 隨機整合到原始圖表中。為了控制擴充圖表的稀疏性,我們進一步設計了一個粒度感知提示策略和一個指令微調模組,它可以根據資料集的不同粒度層級無縫產生文字提示。在各種圖表學習任務上的大量實驗驗證了我們的方法比現有的圖表資料擴充方法更有效。值得注意的是,我們的做法在涉及電子健康記錄 (EHR) 的場景中表現出色,這驗證了它對上下文知識的最大利用,從而提高了預測效能和可解釋性。

MobileViM: A Light-weight and Dimension-independent Vision Mamba for 3D Medical Image Analysis

2502.13524v1 by Wei Dai, Steven Wang, Jun Liu

Efficient evaluation of three-dimensional (3D) medical images is crucial for diagnostic and therapeutic practices in healthcare. Recent years have seen a substantial uptake in applying deep learning and computer vision to analyse and interpret medical images. Traditional approaches, such as convolutional neural networks (CNNs) and vision transformers (ViTs), face significant computational challenges, prompting the need for architectural advancements. Recent efforts have led to the introduction of novel architectures like the ``Mamba'' model as alternative solutions to traditional CNNs or ViTs. The Mamba model excels in the linear processing of one-dimensional data with low computational demands. However, Mamba's potential for 3D medical image analysis remains underexplored and could face significant computational challenges as the dimension increases. This manuscript presents MobileViM, a streamlined architecture for efficient segmentation of 3D medical images. In the MobileViM network, we invent a new dimension-independent mechanism and a dual-direction traversing approach to incorporate with a vision-Mamba-based framework. MobileViM also features a cross-scale bridging technique to improve efficiency and accuracy across various medical imaging modalities. With these enhancements, MobileViM achieves segmentation speeds exceeding 90 frames per second (FPS) on a single graphics processing unit (i.e., NVIDIA RTX 4090). This performance is over 24 FPS faster than the state-of-the-art deep learning models for processing 3D images with the same computational resources. In addition, experimental evaluations demonstrate that MobileViM delivers superior performance, with Dice similarity scores reaching 92.72%, 86.69%, 80.46%, and 77.43% for PENGWIN, BraTS2024, ATLAS, and Toothfairy2 datasets, respectively, which significantly surpasses existing models.

摘要:有效評估三維 (3D) 醫學影像對於醫療保健中的診斷和治療實務至關重要。近年來,將深度學習和電腦視覺應用於分析和詮釋醫學影像的應用大幅增加。傳統方法,例如卷積神經網路 (CNN) 和視覺Transformer (ViT),面臨重大的運算挑戰,促使需要架構上的進步。最近的努力已導致引進創新的架構,例如「Mamba」模型,作為傳統 CNN 或 ViT 的替代解決方案。Mamba 模型擅長以低運算需求進行一維資料的線性處理。然而,Mamba 在 3D 醫學影像分析方面的潛力仍未被充分探索,並且隨著維度的增加可能會面臨重大的運算挑戰。本手稿提出 MobileViM,這是一種簡化的架構,可有效分割 3D 醫學影像。在 MobileViM 網路中,我們發明了一種新的與維度無關的機制和雙向遍歷方法,以與基於視覺 Mamba 的架構結合。MobileViM 還具備跨尺度橋接技術,以提高各種醫學影像模式的效率和準確性。透過這些增強功能,MobileViM 在單一顯示卡 (即 NVIDIA RTX 4090) 上達到了每秒超過 90 幀 (FPS) 的分割速度。此效能比現有最先進的深度學習模型快了超過 24 FPS,這些模型使用相同的運算資源處理 3D 影像。此外,實驗評估證明 MobileViM 提供了卓越的效能,Dice 相似性評分對於 PENGWIN、BraTS2024、ATLAS 和 Toothfairy2 資料集分別達到 92.72%、86.69%、80.46% 和 77.43%,顯著超越現有模型。

MKE-Coder: Multi-Axial Knowledge with Evidence Verification in ICD Coding for Chinese EMRs

2502.14916v1 by Xinxin You, Xien Liu, Xue Yang, Ziyi Wang, Ji Wu

The task of automatically coding the International Classification of Diseases (ICD) in the medical field has been well-established and has received much attention. Automatic coding of the ICD in the medical field has been successful in English but faces challenges when dealing with Chinese electronic medical records (EMRs). The first issue lies in the difficulty of extracting disease code-related information from Chinese EMRs, primarily due to the concise writing style and specific internal structure of the EMRs. The second problem is that previous methods have failed to leverage the disease-based multi-axial knowledge and lack of association with the corresponding clinical evidence. This paper introduces a novel framework called MKE-Coder: Multi-axial Knowledge with Evidence verification in ICD coding for Chinese EMRs. Initially, we identify candidate codes for the diagnosis and categorize each of them into knowledge under four coding axes.Subsequently, we retrieve corresponding clinical evidence from the comprehensive content of EMRs and filter credible evidence through a scoring model. Finally, to ensure the validity of the candidate code, we propose an inference module based on the masked language modeling strategy. This module verifies that all the axis knowledge associated with the candidate code is supported by evidence and provides recommendations accordingly. To evaluate the performance of our framework, we conduct experiments using a large-scale Chinese EMR dataset collected from various hospitals. The experimental results demonstrate that MKE-Coder exhibits significant superiority in the task of automatic ICD coding based on Chinese EMRs. In the practical evaluation of our method within simulated real coding scenarios, it has been demonstrated that our approach significantly aids coders in enhancing both their coding accuracy and speed.

摘要:在醫療領域中自動編碼國際疾病分類 (ICD) 的任務已經建立得很好,並且備受關注。在醫療領域中自動編碼 ICD 在英文中已經成功,但在處理中文電子病歷 (EMR) 時卻面臨挑戰。第一個問題在於從中文 EMR 中提取疾病代碼相關資訊的難度,這主要是由於 EMR 的簡潔寫作風格和特定的內部結構。第二個問題是,先前的做法未能利用基於疾病的多軸知識,並且缺乏與相應臨床證據的關聯。本文介紹了一個名為 MKE-Coder 的新框架:中文 EMR 中 ICD 編碼的證據驗證多軸知識。最初,我們識別出診斷的候選代碼,並將它們分類為四個編碼軸下的知識。隨後,我們從 EMR 的綜合內容中檢索相應的臨床證據,並通過評分模型過濾可信的證據。最後,為了確保候選代碼的有效性,我們提出了一個基於遮蔽語言建模策略的推論模組。此模組驗證與候選代碼相關的所有軸知識都受到證據支持,並據此提供建議。為了評估我們框架的效能,我們使用從各種醫院收集的大規模中文 EMR 資料集進行實驗。實驗結果表明,MKE-Coder 在基於中文 EMR 的自動 ICD 編碼任務中表現出顯著的優越性。在模擬真實編碼場景中對我們方法的實際評估中,已經證明我們的方法顯著幫助編碼人員提高編碼準確性和速度。

Unlocking Multimodal Integration in EHRs: A Prompt Learning Framework for Language and Time Series Fusion

2502.13509v1 by Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Wei Bi, Yida Xu, Guo Li, Xian Yang

Large language models (LLMs) have shown remarkable performance in vision-language tasks, but their application in the medical field remains underexplored, particularly for integrating structured time series data with unstructured clinical notes. In clinical practice, dynamic time series data such as lab test results capture critical temporal patterns, while clinical notes provide rich semantic context. Merging these modalities is challenging due to the inherent differences between continuous signals and discrete text. To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal framework that employs prompt-guided learning to unify these heterogeneous data types. Our approach leverages lightweight anomaly detection to generate anomaly captions that serve as prompts, guiding the encoding of raw time series data into informative embeddings. These embeddings are aligned with textual representations in a shared latent space, preserving fine-grained temporal nuances alongside semantic insights. Furthermore, our framework incorporates tailored self-supervised objectives to enhance both intra- and inter-modal alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world datasets, and the results demonstrate that our method consistently outperforms state-of-the-art approaches.

摘要:大型語言模型(LLM)在視覺語言任務中表現出色,但其在醫療領域的應用仍未得到充分探索,特別是在將結構化時間序列數據與非結構化臨床筆記整合方面。在臨床實務中,動態時間序列數據(例如實驗室檢驗結果)會擷取關鍵的時間模式,而臨床筆記則提供豐富的語意脈絡。由於連續訊號與離散文字之間的固有差異,合併這些方式具有挑戰性。為了彌補這個差距,我們引入了 ProMedTS,這是一個新穎的自監督多模態框架,採用提示引導學習來統一這些異質化的數據類型。我們的做法利用輕量級異常偵測來產生異常標題,作為提示,引導將原始時間序列數據編碼成資訊性的嵌入。這些嵌入與共享潛在空間中的文字表示對齊,同時保留細微的時間差異和語意見解。此外,我們的框架納入了客製化的自監督目標,以增強模態內和模態間對齊。我們在疾病診斷任務中使用真實世界的數據集評估 ProMedTS,結果表明,我們的模型始終優於最先進的方法。

Towards a perturbation-based explanation for medical AI as differentiable programs

2502.14001v1 by Takeshi Abe, Yoshiyuki Asai

Recent advancement in machine learning algorithms reaches a point where medical devices can be equipped with artificial intelligence (AI) models for diagnostic support and routine automation in clinical settings. In medicine and healthcare, there is a particular demand for sufficient and objective explainability of the outcome generated by AI models. However, AI models are generally considered as black boxes due to their complexity, and the computational process leading to their response is often opaque. Although several methods have been proposed to explain the behavior of models by evaluating the importance of each feature in discrimination and prediction, they may suffer from biases and opacities arising from the scale and sampling protocol of the dataset used for training or testing. To overcome the shortcomings of existing methods, we explore an alternative approach to provide an objective explanation of AI models that can be defined independently of the learning process and does not require additional data. As a preliminary study for this direction of research, this work examines a numerical availability of the Jacobian matrix of deep learning models that measures how stably a model responses against small perturbations added to the input. The indicator, if available, are calculated from a trained AI model for a given target input. This is a first step towards a perturbation-based explanation, which will assist medical practitioners in understanding and interpreting the response of the AI model in its clinical application.

摘要:機器學習演算法的最新進展已達到一個階段,醫療裝置可以配備人工智慧 (AI) 模型,以在臨床環境中提供診斷支援和例行自動化。在醫學和保健領域,對於 AI 模型產生的結果有足夠且客觀的可解釋性有特別的需求。然而,由於 AI 模型的複雜性,它們通常被視為黑盒子,而導致其反應的運算過程通常是不透明的。儘管已經提出多種方法來解釋模型的行為,方法是評估每個特徵在判別和預測中的重要性,但它們可能會受到訓練或測試所用資料集的規模和抽樣協定的偏差和不透明性的影響。為了克服現有方法的缺點,我們探索一種替代方法,以提供 AI 模型的客觀解釋,這種方法可以獨立於學習過程定義,而且不需要額外的資料。作為這個研究方向的初步研究,這項工作探討了深度學習模型的雅可比矩陣的數值可用性,它衡量了模型對輸入中新增的小擾動的穩定反應程度。如果可用,指標會從訓練好的 AI 模型計算得出,以取得給定的目標輸入。這是基於擾動的解釋的第一步,它將協助醫療從業人員了解和詮釋 AI 模型在其臨床應用中的反應。

PTB-Image: A Scanned Paper ECG Dataset for Digitization and Image-based Diagnosis

2502.14909v1 by Cuong V. Nguyen, Hieu X. Nguyen, Dung D. Pham Minh, Cuong D. Do

Electrocardiograms (ECGs) recorded on paper remain prevalent in clinical practice, yet their use presents challenges for automated analysis and digital storage. To address this issue, we introduce PTB-Image, a dataset comprising scanned paper ECGs with corresponding digital signals, enabling research on ECG digitization. We also provide VinDigitizer, a digitization baseline to convert paper-based ECGs into digital time-series signals. The method involves detecting signal rows, extracting waveforms from the background, and reconstructing numerical values from the digitized traces. We applied VinDigitizer to 549 scanned ECGs and evaluated its performance against the original PTB dataset (modified to match the printed signals). The results achieved a mean signal-to-noise ratio (SNR) of 0.01 dB, highlighting both the feasibility and challenges of ECG digitization, particularly in mitigating distortions from printing and scanning processes. By providing PTB-Image and baseline digitization methods, this work aims to facilitate advancements in ECG digitization, enhancing access to historical ECG data and supporting applications in telemedicine and automated cardiac diagnostics.

摘要:紙本心電圖(ECG)在臨床實務中仍然普遍,但其使用對自動化分析和數位儲存提出了挑戰。為了解決這個問題,我們引入了 PTB-Image,這是一個包含掃描紙本 ECG 和對應數位訊號的資料集,可進行 ECG 數位化的研究。我們還提供了 VinDigitizer,一個數位化基準,用於將紙本 ECG 轉換為數位時間序列訊號。此方法包括偵測訊號列、從背景中萃取波形,以及從數位化的軌跡重建數值。我們將 VinDigitizer 應用於 549 個掃描的 ECG,並根據原始 PTB 資料集(已修改以符合列印訊號)評估其效能。結果達到了平均訊噪比 (SNR) 為 0.01 dB,突顯了 ECG 數位化的可行性與挑戰,特別是在減輕列印和掃描過程中的失真。透過提供 PTB-Image 和基準數位化方法,這項工作旨在促進 ECG 數位化的進展,加強對歷史 ECG 資料的存取,並支援遠距醫療和自動化心臟診斷的應用。

RGAR: Recurrence Generation-augmented Retrieval for Factual-aware Medical Question Answering

2502.13361v1 by Sichu Liang, Linhai Zhang, Hongyu Zhu, Wenwen Wang, Yulan He, Deyu Zhou

Medical question answering requires extensive access to specialized conceptual knowledge. The current paradigm, Retrieval-Augmented Generation (RAG), acquires expertise medical knowledge through large-scale corpus retrieval and uses this knowledge to guide a general-purpose large language model (LLM) for generating answers. However, existing retrieval approaches often overlook the importance of factual knowledge, which limits the relevance of retrieved conceptual knowledge and restricts its applicability in real-world scenarios, such as clinical decision-making based on Electronic Health Records (EHRs). This paper introduces RGAR, a recurrence generation-augmented retrieval framework that retrieves both relevant factual and conceptual knowledge from dual sources (i.e., EHRs and the corpus), allowing them to interact and refine each another. Through extensive evaluation across three factual-aware medical question answering benchmarks, RGAR establishes a new state-of-the-art performance among medical RAG systems. Notably, the Llama-3.1-8B-Instruct model with RGAR surpasses the considerably larger, RAG-enhanced GPT-3.5. Our findings demonstrate the benefit of extracting factual knowledge for retrieval, which consistently yields improved generation quality.

摘要:醫療問題解答需要大量取得專業概念知識。目前的典範,檢索增強生成(RAG),透過大規模語料庫檢索取得專業醫療知識,並使用此知識引導通用大型語言模型(LLM)來產生答案。然而,現有的檢索方法經常忽略事實知識的重要性,這會限制檢索到的概念知識的相關性,並限制其在現實世界情境中的適用性,例如基於電子健康記錄(EHR)的臨床決策制定。本文介紹 RGAR,一個遞迴生成增強檢索架構,從雙重來源(即 EHR 和語料庫)檢索相關的事實和概念知識,讓它們互動並互相精煉。透過在三個事實感知醫療問題解答基準上進行廣泛評估,RGAR 在醫療 RAG 系統中建立了新的最先進效能。值得注意的是,採用 RGAR 的 Llama-3.1-8B-Instruct 模型超越了規模大得多的 RAG 增強型 GPT-3.5。我們的研究結果證明了提取事實知識以進行檢索的好處,這會持續產生改善的生成品質。

Adjust for Trust: Mitigating Trust-Induced Inappropriate Reliance on AI Assistance

2502.13321v1 by Tejas Srinivasan, Jesse Thomason

Trust biases how users rely on AI recommendations in AI-assisted decision-making tasks, with low and high levels of trust resulting in increased under- and over-reliance, respectively. We propose that AI assistants should adapt their behavior through trust-adaptive interventions to mitigate such inappropriate reliance. For instance, when user trust is low, providing an explanation can elicit more careful consideration of the assistant's advice by the user. In two decision-making scenarios -- laypeople answering science questions and doctors making medical diagnoses -- we find that providing supporting and counter-explanations during moments of low and high trust, respectively, yields up to 38% reduction in inappropriate reliance and 20% improvement in decision accuracy. We are similarly able to reduce over-reliance by adaptively inserting forced pauses to promote deliberation. Our results highlight how AI adaptation to user trust facilitates appropriate reliance, presenting exciting avenues for improving human-AI collaboration.

摘要:信任偏見影響使用者在 AI 輔助決策任務中如何依賴 AI 建議,信任程度低和高分別導致依賴不足和過度依賴。我們建議 AI 助理應透過信任適應式干預調整其行為,以減輕這種不適當的依賴。例如,當使用者信任度低時,提供解釋可以引發使用者更仔細地考慮助理的建議。在兩種決策情境中——外行人回答科學問題和醫生進行醫療診斷——我們發現,分別在信任度低和高的時刻提供支持性和反向解釋,可以將不適當的依賴降低多達 38%,並將決策準確性提高 20%。我們同樣能夠透過適應性地插入強制暫停來促進審議,以減少過度依賴。我們的結果強調 AI 如何適應使用者信任以促進適當的依賴,為改善人機協作提供了令人興奮的途徑。

Prediction of Clinical Complication Onset using Neural Point Processes

2502.13290v1 by Sachini Weerasekara, Sagar Kamarthi, Jacqueline Isaacs

Predicting medical events in advance within critical care settings is paramount for patient outcomes and resource management. Utilizing predictive models, healthcare providers can anticipate issues such as cardiac arrest, sepsis, or respiratory failure before they manifest. Recently, there has been a surge in research focusing on forecasting adverse medical event onsets prior to clinical manifestation using machine learning. However, while these models provide temporal prognostic predictions for the occurrence of a specific adverse event of interest within defined time intervals, their interpretability often remains a challenge. In this work, we explore the applicability of neural temporal point processes in the context of adverse event onset prediction, with the aim of explaining clinical pathways and providing interpretable insights. Our experiments span six state-of-the-art neural point processes and six critical care datasets, each focusing on the onset of distinct adverse events. This work represents a novel application class of neural temporal point processes in event prediction.

摘要:在重症監護環境中預先預測醫療事件對於患者的預後和資源管理至關重要。利用預測模型,醫療保健提供者可以在心臟驟停、敗血症或呼吸衰竭等問題發生之前預測到這些問題。最近,專注於在臨床表現之前使用機器學習預測不良醫療事件發生的研究激增。然而,儘管這些模型為特定不良事件在定義的時間間隔內發生提供了時間預後預測,但它們的可解釋性仍然是一個挑戰。在這項工作中,我們探討了神經時間點過程在不良事件發作預測中的適用性,目的是解釋臨床途徑並提供可解釋的見解。我們的實驗涵蓋了六種最先進的神經點過程和六個重症監護資料集,每個資料集都專注於不同不良事件的發作。這項工作代表了神經時間點過程在事件預測中的一種新的應用類別。

SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?

2502.13233v1 by Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu

Large Language Models (LLMs) have shown remarkable capabilities in general domains but often struggle with tasks requiring specialized knowledge. Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve external information from static knowledge bases, which can be outdated or incomplete, missing fine-grained clinical details essential for accurate medical question answering. In this work, we propose SearchRAG, a novel framework that overcomes these limitations by leveraging real-time search engines. Our method employs synthetic query generation to convert complex medical questions into search-engine-friendly queries and utilizes uncertainty-based knowledge selection to filter and incorporate the most relevant and informative medical knowledge into the LLM's input. Experimental results demonstrate that our method significantly improves response accuracy in medical question answering tasks, particularly for complex questions requiring detailed and up-to-date knowledge.

摘要:大型語言模型 (LLM) 在一般領域展現出驚人的能力,但經常在需要專業知識的任務中掙扎。 傳統的檢索增強生成 (RAG) 技術通常從靜態知識庫中檢索外部資訊,這些資訊可能過時或不完整,缺少準確回答醫療問題所需的細微臨床細節。在這項工作中,我們提出 SearchRAG,這是一種新穎的架構,透過利用即時搜尋引擎克服這些限制。我們的模型採用合成查詢生成,將複雜的醫療問題轉換成搜尋引擎友善的查詢,並利用基於不確定性的知識選擇來過濾和納入 LLM 輸入中最相關且最有資訊的醫療知識。實驗結果證明,我們的模型顯著改善了醫療問題回答任務中的回應準確度,特別是需要詳細且最新的知識的複雜問題。

Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions

2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić

We present an end-to-end framework for generating synthetic users for evaluating interactive agents designed to encourage positive behavior changes, such as in health and lifestyle coaching. The synthetic users are grounded in health and lifestyle conditions, specifically sleep and diabetes management in this study, to ensure realistic interactions with the health coaching agent. Synthetic users are created in two stages: first, structured data are generated grounded in real-world health and lifestyle factors in addition to basic demographics and behavioral attributes; second, full profiles of the synthetic users are developed conditioned on the structured data. Interactions between synthetic users and the coaching agent are simulated using generative agent-based models such as Concordia, or directly by prompting a language model. Using two independently-developed agents for sleep and diabetes coaching as case studies, the validity of this framework is demonstrated by analyzing the coaching agent's understanding of the synthetic users' needs and challenges. Finally, through multiple blinded evaluations of user-coach interactions by human experts, we demonstrate that our synthetic users with health and behavioral attributes more accurately portray real human users with the same attributes, compared to generic synthetic users not grounded in such attributes. The proposed framework lays the foundation for efficient development of conversational agents through extensive, realistic, and grounded simulated interactions.

摘要:我們提供了一個端到端的架構,用於為評估互動式代理生成合成使用者,這些代理旨在鼓勵正向行為改變,例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎,特別是本研究中的睡眠和糖尿病管理,以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立:首先,除了基本人口統計資料和行為屬性外,還會產生以現實世界的健康和生活方式因素為基礎的結構化資料;其次,會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型(例如 Concordia)模擬的,或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究,通過分析指導代理對合成使用者需求和挑戰的理解,證明了此架構的有效性。最後,通過人類專家對使用者指導互動進行多重盲測評估,我們證明了與未以這些屬性為基礎的通用合成使用者相比,具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動,為對話代理的有效開發奠定了基礎。

Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization

2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar

Clinical Question Answering (CQA) plays a crucial role in medical decision-making, enabling physicians to extract relevant information from Electronic Medical Records (EMRs). While transformer-based models such as BERT, BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in CQA, existing models lack the ability to categorize extracted answers, which is critical for structured retrieval, content filtering, and medical decision support. To address this limitation, we introduce a Multi-Task Learning (MTL) framework that jointly trains CQA models for both answer extraction and medical categorization. In addition to predicting answer spans, our model classifies responses into five standardized medical categories: Diagnosis, Medication, Symptoms, Procedure, and Lab Reports. This categorization enables more structured and interpretable outputs, making clinical QA models more useful in real-world healthcare settings. We evaluate our approach on emrQA, a large-scale dataset for medical question answering. Results show that MTL improves F1-score by 2.2% compared to standard fine-tuning, while achieving 90.7% accuracy in answer categorization. These findings suggest that MTL not only enhances CQA performance but also introduces an effective mechanism for categorization and structured medical information retrieval.

摘要:臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色,讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能,但現有的模型缺乏分類擷取答案的能力,這對於結構化檢索、內容過濾和醫療決策支援至關重要。 為了解決這個限制,我們引進了一個多任務學習 (MTL) 架構,它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍,我們的模型將回應分類為五個標準化醫療類別:診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出,讓臨床問答模型在真實世界的醫療保健環境中更實用。 我們在 emrQA 上評估我們的做法,emrQA 是用於醫療問題解答的大規模資料集。結果顯示,與標準微調相比,MTL 將 F1 分數提高了 2.2%,同時在答案分類中達到 90.7% 的準確度。這些發現表明,MTL 不僅增強了 CQA 的效能,還引入了一種分類和結構化醫療資訊檢索的有效機制。

Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection

2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert

Detection of hyperenhancement from cardiac LGE MRI images is a complex task requiring significant clinical expertise. Although deep learning-based models have shown promising results for the task, they require large amounts of data with fine-grained annotations. Clinical reports generated for cardiac MR studies contain rich, clinically relevant information, including the location, extent and etiology of any scars present. Although recently developed CLIP-based training enables pretraining models with image-text pairs, it requires large amounts of data and further finetuning strategies on downstream tasks. In this study, we use various strategies rooted in domain knowledge to train a model for LGE detection solely using text from clinical reports, on a relatively small clinical cohort of 965 patients. We improve performance through the use of synthetic data augmentation, by systematically creating scar images and associated text. In addition, we standardize the orientation of the images in an anatomy-informed way to enable better alignment of spatial and text features. We also use a captioning loss to enable fine-grained supervision and explore the effect of pretraining of the vision encoder on performance. Finally, ablation studies are carried out to elucidate the contributions of each design component to the overall performance of the model.

摘要:從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務,需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果,但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊,包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型,但它需要大量資料和進一步微調下游任務的策略。在這項研究中,我們使用植基於領域知識的各種策略,僅使用來自臨床報告的文字,在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能,系統性地建立疤痕影像和相關文字。此外,我們以解剖學告知的方式標準化影像方向,以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督,並探討視覺編碼器的預訓練對效能的影響。最後,進行消融研究以闡明每個設計元件對模型整體效能的貢獻。

Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models

2502.12825v2 by Rubing Li, João Sedoc, Arun Sundararajan

When encountering increasingly frequent performance improvements or cost reductions from a new large language model (LLM), developers of applications leveraging LLMs must decide whether to take advantage of these improvements or stay with older tried-and-tested models. Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes that the transition may induce. Our experiments use a popular game-theoretic behavioral economics model of trust to show stark differences in the trusting behavior of OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing and risk-seeking with future returns from trust, and contrast it with DeepSeek's more sophisticated and profitable trusting behavior that stems from an ability to incorporate deeper concepts like forward planning and theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our results highlight the perils of relying on LLM performance benchmarks that are too narrowly defined and suggest that careful analysis of their hidden fault lines should be part of any organization's AI strategy.

摘要:在遇到大型語言模型 (LLM) 頻頻帶來的效能提升或成本降低時,利用 LLM 的應用程式開發人員必須決定是否要利用這些提升,或繼續使用較舊且經過驗證的模型。低感知切換摩擦可能會導致選擇,而沒有考慮轉換可能引發的更細微行為變更。我們的實驗使用流行的博弈論行為經濟信任模型,以顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰,因為它們調和了利潤最大化和冒險,以及來自信任的未來回報,並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比,這種行為源於整合更深入的概念,例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎,我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險,並建議仔細分析其隱藏的斷層線應該是任何組織 AI 策略的一部分。

LLM Safety for Children

2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat

This paper analyzes the safety of Large Language Models (LLMs) in interactions with children below age of 18 years. Despite the transformative applications of LLMs in various aspects of children's lives such as education and therapy, there remains a significant gap in understanding and mitigating potential content harms specific to this demographic. The study acknowledges the diverse nature of children often overlooked by standard safety evaluations and proposes a comprehensive approach to evaluating LLM safety specifically for children. We list down potential risks that children may encounter when using LLM powered applications. Additionally we develop Child User Models that reflect the varied personalities and interests of children informed by literature in child care and psychology. These user models aim to bridge the existing gap in child safety literature across various fields. We utilize Child User Models to evaluate the safety of six state of the art LLMs. Our observations reveal significant safety gaps in LLMs particularly in categories harmful to children but not adults

摘要:本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面(例如教育和治療)都有轉變性的應用,但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性,而標準安全評估通常會忽略這些多樣性,並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外,我們開發了兒童使用者模型,這些模型反映了兒童不同的個性特質和興趣,並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞,特別是在對兒童有害但對成年人無害的類別中

Retrieval-augmented systems can be dangerous medical communicators

2502.14898v1 by Lionel Wong, Ayman Ali, Raymond Xiong, Shannon Zeijang Shen, Yoon Kim, Monica Agrawal

Patients have long sought health information online, and increasingly, they are turning to generative AI to answer their health-related queries. Given the high stakes of the medical domain, techniques like retrieval-augmented generation and citation grounding have been widely promoted as methods to reduce hallucinations and improve the accuracy of AI-generated responses and have been widely adopted into search engines. This paper argues that even when these methods produce literally accurate content drawn from source documents sans hallucinations, they can still be highly misleading. Patients may derive significantly different interpretations from AI-generated outputs than they would from reading the original source material, let alone consulting a knowledgeable clinician. Through a large-scale query analysis on topics including disputed diagnoses and procedure safety, we support our argument with quantitative and qualitative evidence of the suboptimal answers resulting from current systems. In particular, we highlight how these models tend to decontextualize facts, omit critical relevant sources, and reinforce patient misconceptions or biases. We propose a series of recommendations -- such as the incorporation of communication pragmatics and enhanced comprehension of source documents -- that could help mitigate these issues and extend beyond the medical domain.

摘要:患者長期以來一直在網路上搜尋健康資訊,而現在他們越來越常使用生成式 AI 來回答與健康相關的問題。由於醫療領域的風險很高,因此像檢索增強生成和引文基礎等技術已被廣泛推廣為減少幻覺並提高 AI 生成的回應準確性的方法,並已廣泛採用於搜尋引擎中。本文論證,即使這些方法產生了從原始文件得出的字面準確內容,沒有幻覺,它們仍然可能具有高度誤導性。患者可能會從 AI 生成的輸出中得出與閱讀原始來源資料時截然不同的解釋,更不用說諮詢知識淵博的臨床醫生了。透過對有爭議的診斷和程序安全等主題進行大規模查詢分析,我們用定量和定性證據支持我們的論點,證明了當前系統產生的次優答案。特別是,我們強調這些模型如何傾向於將事實去脈絡化、省略關鍵相關來源,並強化患者的誤解或偏見。我們提出了一系列建議,例如納入溝通語用學和加強對原始文件的理解,這些建議可以幫助減輕這些問題,並擴展到醫療領域之外。

Classifiers of Data Sharing Statements in Clinical Trial Records

2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth

Digital individual participant data (IPD) from clinical trials are increasingly distributed for potential scientific reuse. The identification of available IPD, however, requires interpretations of textual data-sharing statements (DSS) in large databases. Recent advancements in computational linguistics include pre-trained language models that promise to simplify the implementation of effective classifiers based on textual inputs. In a subset of 5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers based on domain-specific pre-trained language models reproduce original availability categories as well as manually annotated labels. Typical metrics indicate that classifiers that predicted manual annotations outperformed those that learned to output the original availability categories. This suggests that the textual DSS descriptions contain applicable information that the availability categories do not, and that such classifiers could thus aid the automatic identification of available IPD in large trial databases.

摘要:臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而,要找出可用的 IPD,需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型,有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中,我們評估了基於特定領域預先訓練語言模型的分類器,在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示,預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊,而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。

Relational Norms for Human-AI Cooperation

2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark

How we should design and interact with social artificial intelligence depends on the socio-relational role the AI is meant to emulate or occupy. In human society, relationships such as teacher-student, parent-child, neighbors, siblings, or employer-employee are governed by specific norms that prescribe or proscribe cooperative functions including hierarchy, care, transaction, and mating. These norms shape our judgments of what is appropriate for each partner. For example, workplace norms may allow a boss to give orders to an employee, but not vice versa, reflecting hierarchical and transactional expectations. As AI agents and chatbots powered by large language models are increasingly designed to serve roles analogous to human positions - such as assistant, mental health provider, tutor, or romantic partner - it is imperative to examine whether and how human relational norms should extend to human-AI interactions. Our analysis explores how differences between AI systems and humans, such as the absence of conscious experience and immunity to fatigue, may affect an AI's capacity to fulfill relationship-specific functions and adhere to corresponding norms. This analysis, which is a collaborative effort by philosophers, psychologists, relationship scientists, ethicists, legal experts, and AI researchers, carries important implications for AI systems design, user behavior, and regulation. While we accept that AI systems can offer significant benefits such as increased availability and consistency in certain socio-relational roles, they also risk fostering unhealthy dependencies or unrealistic expectations that could spill over into human-human relationships. We propose that understanding and thoughtfully shaping (or implementing) suitable human-AI relational norms will be crucial for ensuring that human-AI interactions are ethical, trustworthy, and favorable to human well-being.

摘要:我們應如何設計和與社交人工智慧互動,取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中,師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配,這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如,職場規範可能允許老闆對員工發號施令,但反之則不行,這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色,例如助理、心理健康提供者、導師或浪漫伴侶,審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異,例如缺乏意識體驗和對疲勞的免疫力,如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果,對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處,例如增加可用性和一致性,但它們也可能助長不健康的依賴關係或不切實際的期望,這些期望可能會蔓延到人際關係中。我們提出,理解和深思熟慮地塑造(或實施)適當的人類與人工智慧關係規範,對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。

FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction

2502.14894v1 by Jowaria Khan, Alexa Friedman, Sydney Evans, Runzi Wang, Kaley Beins, David Andrews, Elizabeth Bondi-Kelly

Per and polyfluoroalkyl substances (PFAS), chemicals found in products like non-stick cookware, are unfortunately persistent environmental pollutants with severe health risks. Accurately mapping PFAS contamination is crucial for guiding targeted remediation efforts and protecting public and environmental health, yet detection across large regions remains challenging due to the cost of testing and the difficulty of simulating their spread. In this work, we introduce FOCUS, a geospatial deep learning framework with a label noise-aware loss function, to predict PFAS contamination in surface water over large regions. By integrating hydrological flow data, land cover information, and proximity to known PFAS sources, our approach leverages both spatial and environmental context to improve prediction accuracy. We evaluate the performance of our approach through extensive ablation studies and comparative analyses against baselines like sparse segmentation, as well as existing scientific methods, including Kriging and pollutant transport simulations. Results highlight our framework's potential for scalable PFAS monitoring.

摘要:全氟和多氟烷基物質 (PFAS) 是一種存在於不沾鍋等產品中的化學物質,遺憾的是,它們是具有嚴重健康風險的持久性環境污染物。精確繪製 PFAS 汙染地圖對於指導有針對性的修復工作和保護公眾與環境健康至關重要,然而,由於檢測成本和模擬其擴散的難度,在廣大地區進行檢測仍然具有挑戰性。在本文中,我們介紹了 FOCUS,一個具有標籤雜訊感知損失函數的地理空間深度學習框架,用於預測大面積地表水中的 PFAS 汙染。透過整合水文流動資料、土地覆蓋資訊和與已知 PFAS 來源的接近程度,我們的做法同時利用空間和環境背景來提高預測準確度。我們透過廣泛的消融研究和與稀疏分割等基線以及克里金法和污染物傳輸模擬等現有科學方法的比較分析來評估我們方法的效能。結果突顯了我們框架在可擴充 PFAS 監測方面的潛力。

Deep Spatio-Temporal Neural Network for Air Quality Reanalysis

2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy

Air quality prediction is key to mitigating health impacts and guiding decisions, yet existing models tend to focus on temporal trends while overlooking spatial generalization. We propose AQ-Net, a spatiotemporal reanalysis model for both observed and unobserved stations in the near future. AQ-Net utilizes the LSTM and multi-head attention for the temporal regression. We also propose a cyclic encoding technique to ensure continuous time representation. To learn fine-grained spatial air quality estimation, we incorporate AQ-Net with the neural kNN to explore feature-based interpolation, such that we can fill the spatial gaps given coarse observation stations. To demonstrate the efficiency of our model for spatiotemporal reanalysis, we use data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive experiments show that AQ-Net excels in air quality reanalysis, highlighting the potential of hybrid spatio-temporal models to better capture environmental dynamics, especially in urban areas where both spatial and temporal variability are critical.

摘要:空气品质预测是减轻健康影响和指导决策的关键,但现有的模型倾向于关注时间趋势,而忽略空间概化。我们提出了 AQ-Net,这是一种时空再分析模型,适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计,我们将 AQ-Net 与神经 kNN 结合起来,以探索基于特征的插值,以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率,我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明,AQ-Net 在空气质量再分析中表现出色,突出了混合时空模型在更好地捕捉环境动态方面的潜力,尤其是在空间和时间变异性都很关键的城市地区。

Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing

2502.11715v1 by Site Qu, Guoqiang Hu

The Location-Routing Problem (LRP), which combines the challenges of facility (depot) locating and vehicle route planning, is critically constrained by the reliance on predefined depot candidates, limiting the solution space and potentially leading to suboptimal outcomes. Previous research on LRP without predefined depots is scant and predominantly relies on heuristic algorithms that iteratively attempt depot placements across a planar area. Such approaches lack the ability to proactively generate depot locations that meet specific geographic requirements, revealing a notable gap in current research landscape. To bridge this gap, we propose a data-driven generative DRL framework, designed to proactively generate depots for LRP without predefined depot candidates, solely based on customer requests data which include geographic and demand information. It can operate in two distinct modes: direct generation of exact depot locations, and the creation of a multivariate Gaussian distribution for flexible depots sampling. By extracting depots' geographic pattern from customer requests data, our approach can dynamically respond to logistical needs, identifying high-quality depot locations that further reduce total routing costs compared to traditional methods. Extensive experiments demonstrate that, for a same group of customer requests, compared with those depots identified through random attempts, our framework can proactively generate depots that lead to superior solution routes with lower routing cost. The implications of our framework potentially extend into real-world applications, particularly in emergency medical rescue and disaster relief logistics, where rapid establishment and adjustment of depot locations are paramount, showcasing its potential in addressing LRP for dynamic and unpredictable environments.

摘要:地點路線問題(LRP)結合了設施(倉庫)定位和車輛路線規劃的挑戰,嚴重受到預先定義的倉庫候選限制,限制了解決方案空間,並可能導致次優結果。先前關於沒有預先定義倉庫的 LRP 研究很少,而且主要依賴於啟發式演算法,在平面區域中反覆嘗試倉庫配置。這種方法無法主動產生符合特定地理需求的倉庫位置,顯示了當前研究領域的顯著差距。為了彌補這個差距,我們提出一個資料驅動的生成式 DRL 架構,旨在主動為 LRP 產生倉庫,而無需預先定義的倉庫候選,僅根據包含地理和需求資訊的客戶要求資料。它可以在兩種不同的模式下運作:直接產生確切的倉庫位置,以及建立多元高斯分布以進行彈性倉庫抽樣。透過從客戶要求資料中提取倉庫的地理模式,我們的方法可以動態回應後勤需求,找出高品質的倉庫位置,進一步降低與傳統方法相比的總路線成本。廣泛的實驗證明,對於同一組客戶要求,與透過隨機嘗試識別的那些倉庫相比,我們的架構可以主動產生倉庫,並產生路線成本較低的優質解決方案路線。我們的架構的影響潛在地擴展到實際應用,特別是在緊急醫療救援和災害救災後勤方面,其中倉庫位置的快速建立和調整至關重要,展示了其在解決動態和不可預測環境的 LRP 中的潛力。

LLM Agents Making Agent Tools

2502.11705v1 by Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather

Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains which demand large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, a novel agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a short task description and a repository URL, ToolMaker autonomously installs required dependencies and generates code to perform the task, using a closed-loop self-correction mechanism to iteratively diagnose and rectify errors. To evaluate our approach, we introduce a benchmark comprising 15 diverse and complex computational tasks spanning both medical and non-medical domains with over 100 unit tests to objectively assess tool correctness and robustness. ToolMaker correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows.

摘要:工具使用已將大型語言模型 (LLM) 轉變為強大的代理,可透過動態使用外部軟體元件來執行複雜的多步驟任務。然而,這些工具必須事先由人類開發人員實作,這會阻礙 LLM 代理在需要大量高度專業化工具的領域(例如生命科學和醫學)中的應用性。受到伴隨公開程式碼儲存庫的科學研究趨勢所啟發,我們提出 ToolMaker,一個創新的代理架構,可自主地將帶有程式碼的論文轉換為相容於 LLM 的工具。給定簡短的任務描述和儲存庫網址,ToolMaker 會自主安裝所需的依賴項,並產生程式碼來執行任務,使用閉環自我修正機制來反覆診斷和糾正錯誤。為了評估我們的做法,我們引進一個包含 15 個不同且複雜的運算任務的基準,涵蓋醫療和非醫療領域,並包含超過 100 個單元測試,以客觀評估工具的正確性和穩健性。ToolMaker 正確實作了 80% 的任務,大幅優於目前的最新軟體工程代理。因此,ToolMaker 是邁向完全自主的基於代理的科學工作流程的一步。

MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression

2502.11651v1 by Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang

Large vision-language models (LVLMs) have shown great promise in medical applications, particularly in visual question answering (MedVQA) and diagnosis from medical images. However, existing datasets and models often fail to consider critical aspects of medical diagnostics, such as the integration of historical records and the analysis of disease progression over time. In this paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel dataset for MedVQA that focuses on identifying changes in specific regions between two patient visits. Unlike previous datasets that primarily address single-image questions, MMXU enables multi-image questions, incorporating both current and historical patient data. We demonstrate the limitations of current LVLMs in identifying disease progression on MMXU-\textit{test}, even those that perform well on traditional benchmarks. To address this, we propose a MedRecord-Augmented Generation (MAG) approach, incorporating both global and regional historical records. Our experiments show that integrating historical records significantly enhances diagnostic accuracy by at least 20\%, bridging the gap between current LVLMs and human expert performance. Additionally, we fine-tune models with MAG on MMXU-\textit{dev}, which demonstrates notable improvements. We hope this work could illuminate the avenue of advancing the use of LVLMs in medical diagnostics by emphasizing the importance of historical context in interpreting medical images. Our dataset is released at \href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU}.

摘要:大型視覺語言模型 (LVLMs) 已在醫療應用中展現出極大的潛力,特別是在視覺問答 (MedVQA) 和醫學影像診斷方面。然而,現有的資料集和模型常常無法考量醫療診斷的關鍵層面,例如病歷整合以及隨著時間推移對疾病進程的分析。在本文中,我們介紹 MMXU(多模態多 X 光理解),一個專注於識別兩次患者就診之間特定區域變化的 MedVQA 新資料集。與主要處理單一影像問題的先前資料集不同,MMXU 支援多影像問題,同時納入當前和病史患者資料。我們展示了現有 LVLMs 在 MMXU-\textit{test} 中識別疾病進程的限制,即使是在傳統基準測試中表現良好的 LVLMs 也是如此。為了解決這個問題,我們提出了一個病歷增強生成 (MAG) 方法,結合了全域和區域病史。我們的實驗顯示,整合病歷可顯著提升至少 20% 的診斷準確度,縮小了現有 LVLMs 和人類專家表現之間的差距。此外,我們在 MMXU-\textit{dev} 上微調帶有 MAG 的模型,這展示了顯著的進步。我們希望這項工作能透過強調病史脈絡在解讀醫學影像中的重要性,為推進 LVLMs 在醫療診斷中的應用開闢道路。我們的資料集已於\href{https://github.com/linjiemu/MMXU}{https://github.com/linjiemu/MMXU} 發布。

A Survey of Personalized Large Language Models: Progress and Future Directions

2502.11528v1 by Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Jieming Zhu, Minda Hu, Menglin Yang, Irwin King

Large Language Models (LLMs) excel in handling general knowledge tasks, yet they struggle with user-specific personalization, such as understanding individual emotions, writing styles, and preferences. Personalized Large Language Models (PLLMs) tackle these challenges by leveraging individual user data, such as user profiles, historical dialogues, content, and interactions, to deliver responses that are contextually relevant and tailored to each user's specific needs. This is a highly valuable research topic, as PLLMs can significantly enhance user satisfaction and have broad applications in conversational agents, recommendation systems, emotion recognition, medical assistants, and more. This survey reviews recent advancements in PLLMs from three technical perspectives: prompting for personalized context (input level), finetuning for personalized adapters (model level), and alignment for personalized preferences (objective level). To provide deeper insights, we also discuss current limitations and outline several promising directions for future research. Updated information about this survey can be found at the https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models.

摘要:大型語言模型 (LLM) 在處理一般知識任務方面表現出色,但 它們在使用者特定的個人化方面有困難,例如理解 個別的情緒、寫作風格和偏好。個人化大型 語言模型 (PLLM) 透過利用個別使用者的 資料來解決這些挑戰,例如使用者個人資料、歷史對話、內容和互動, 提供在脈絡上相關且針對每個使用者的特定需求量身打造的回應。這是一個非常有價值的研究主題,因為 PLLM 可以 顯著提升使用者滿意度,並在對話代理、推薦系統、情緒辨識、醫療 助理等方面有廣泛的應用。這項調查從三個技術觀點回顧 PLLM 的最新進展:提示個人化脈絡(輸入層級)、微調個人化適配器(模型層級),以及對齊個人化偏好(目標層級)。為了提供更深入的見解,我們也 討論目前的限制,並概述未來研究的幾個有希望的方向。這項調查的最新資訊可以在 https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models 找到。

Variable-frame CNNLSTM for Breast Nodule Classification using Ultrasound Videos

2502.11481v1 by Xiangxiang Cui, Zhongyu Li, Xiayue Fan, Peng Huang, Ying Wang, Meng Yang, Shi Chang, Jihua Zhu

The intersection of medical imaging and artificial intelligence has become an important research direction in intelligent medical treatment, particularly in the analysis of medical images using deep learning for clinical diagnosis. Despite the advances, existing keyframe classification methods lack extraction of time series features, while ultrasonic video classification based on three-dimensional convolution requires uniform frame numbers across patients, resulting in poor feature extraction efficiency and model classification performance. This study proposes a novel video classification method based on CNN and LSTM, introducing NLP's long and short sentence processing scheme into video classification for the first time. The method reduces CNN-extracted image features to 1x512 dimension, followed by sorting and compressing feature vectors for LSTM training. Specifically, feature vectors are sorted by patient video frame numbers and populated with padding value 0 to form variable batches, with invalid padding values compressed before LSTM training to conserve computing resources. Experimental results demonstrate that our variable-frame CNNLSTM method outperforms other approaches across all metrics, showing improvements of 3-6% in F1 score and 1.5% in specificity compared to keyframe methods. The variable-frame CNNLSTM also achieves better accuracy and precision than equal-frame CNNLSTM. These findings validate the effectiveness of our approach in classifying variable-frame ultrasound videos and suggest potential applications in other medical imaging modalities.

摘要:醫學影像與人工智慧的交叉領域已成為智慧醫療的重要研究方向,特別是在臨床診斷中使用深度學習分析醫學影像。儘管有進展,現有的關鍵影格分類方法缺乏時間序列特徵的提取,而基於三維卷積的超音波影片分類需要患者之間的均勻影格數,導致特徵提取效率差和模型分類效能不佳。本研究提出了一種基於 CNN 和 LSTM 的新影片分類方法,首次將 NLP 的長短句處理機制引入影片分類中。該方法將 CNN 提取的影像特徵縮減為 1x512 維度,然後對特徵向量進行排序和壓縮以進行 LSTM 訓練。具體來說,特徵向量按患者影片影格數排序,並填充 0 補齊值以形成可變批次,在 LSTM 訓練前壓縮無效的補齊值以節省運算資源。實驗結果表明,我們的可變影格 CNNLSTM 方法在所有指標上都優於其他方法,與關鍵影格方法相比,F1 分數提高了 3-6%,特異性提高了 1.5%。可變影格 CNNLSTM 也比等影格 CNNLSTM 達到了更好的準確度和精確度。這些發現驗證了我們的方法在分類可變影格超音波影片中的有效性,並表明在其他醫學影像模式中具有潛在的應用。

Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation

2502.11456v1 by Yanyan Wang, Kechen Song, Yuyuan Liu, Shuai Ma, Yunhui Yan, Gustavo Carneiro

Semi-supervised 3D medical image segmentation aims to achieve accurate segmentation using few labelled data and numerous unlabelled data. The main challenge in the design of semi-supervised learning methods consists in the effective use of the unlabelled data for training. A promising solution consists of ensuring consistent predictions across different views of the data, where the efficacy of this strategy depends on the accuracy of the pseudo-labels generated by the model for this consistency learning strategy. In this paper, we introduce a new methodology to produce high-quality pseudo-labels for a consistency learning strategy to address semi-supervised 3D medical image segmentation. The methodology has three important contributions. The first contribution is the Cooperative Rectification Learning Network (CRLN) that learns multiple prototypes per class to be used as external knowledge priors to adaptively rectify pseudo-labels at the voxel level. The second contribution consists of the Dynamic Interaction Module (DIM) to facilitate pairwise and cross-class interactions between prototypes and multi-resolution image features, enabling the production of accurate voxel-level clues for pseudo-label rectification. The third contribution is the Cooperative Positive Supervision (CPS), which optimises uncertain representations to align with unassertive representations of their class distributions, improving the model's accuracy in classifying uncertain regions. Extensive experiments on three public 3D medical segmentation datasets demonstrate the effectiveness and superiority of our semi-supervised learning method.

摘要:半监督 3D 医学影像分割旨在使用少量标记数据和大量未标记数据实现精确分割。半监督学习方法设计中的主要挑战在于有效使用未标记数据进行训练。一个有前景的解决方案是确保数据不同视图之间预测的一致性,其中此策略的有效性取决于模型为这种一致性学习策略生成的伪标签的准确性。在本文中,我们引入了一种新的方法来为一致性学习策略生成高质量的伪标签,以解决半监督 3D 医学图像分割问题。该方法有三个重要的贡献。第一个贡献是协作修正学习网络 (CRLN),它为每个类别学习多个原型,用作外部知识先验,以在体素级别自适应地修正伪标签。第二个贡献包括动态交互模块 (DIM),以促进原型和多分辨率图像特征之间的成对和跨类交互,从而能够生成用于伪标签修正的准确体素级线索。第三个贡献是协作正监督 (CPS),它优化不确定的表示以与其类分布的不确定表示保持一致,从而提高模型对不确定区域进行分类的准确性。在三个公共 3D 医学分割数据集上进行的大量实验表明了我们半监督学习方法的有效性和优越性。

A Survey of LLM-based Agents in Medicine: How far are we from Baymax?

2502.11211v1 by Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Wenting Chen, Xiang Li, Yixuan Yuan

Large Language Models (LLMs) are transforming healthcare through the development of LLM-based agents that can understand, reason about, and assist with medical tasks. This survey provides a comprehensive review of LLM-based agents in medicine, examining their architectures, applications, and challenges. We analyze the key components of medical agent systems, including system profiles, clinical planning mechanisms, medical reasoning frameworks, and external capacity enhancement. The survey covers major application scenarios such as clinical decision support, medical documentation, training simulations, and healthcare service optimization. We discuss evaluation frameworks and metrics used to assess these agents' performance in healthcare settings. While LLM-based agents show promise in enhancing healthcare delivery, several challenges remain, including hallucination management, multimodal integration, implementation barriers, and ethical considerations. The survey concludes by highlighting future research directions, including advances in medical reasoning inspired by recent developments in LLM architectures, integration with physical systems, and improvements in training simulations. This work provides researchers and practitioners with a structured overview of the current state and future prospects of LLM-based agents in medicine.

摘要:大型語言模型 (LLM) 透過開發可理解、推理並協助醫療任務的 LLM 基礎代理人,轉變了醫療保健。本調查提供了 LLM 基礎代理人在醫學中的全面回顧,探討其架構、應用和挑戰。我們分析了醫療代理系統的主要組成部分,包括系統概況、臨床規劃機制、醫療推理架構和外部能力提升。本調查涵蓋了主要的應用場景,例如臨床決策支援、醫療文件、訓練模擬和醫療保健服務最佳化。我們討論了用於評估這些代理人在醫療保健環境中表現的評估架構和指標。雖然 LLM 基礎代理人顯示出在增強醫療保健提供方面的潛力,但仍有許多挑戰,包括幻覺管理、多模態整合、實施障礙和倫理考量。本調查最後強調了未來的研究方向,包括受 LLM 架構近期發展啟發的醫療推理進展、與物理系統的整合和訓練模擬的改進。這項工作為研究人員和從業人員提供了 LLM 基礎代理人在醫學中當前狀態和未來前景的結構化概觀。

RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer

2502.11179v1 by Shilong Yang, Qi Zang, Chulong Zhang, Lingfeng Huang, Yaoqin Xie

Traditional Chinese acupuncture methods often face controversy in clinical practice due to their high subjectivity. Additionally, current intelligent-assisted acupuncture systems have two major limitations: slow acupoint localization speed and low accuracy. To address these limitations, a new method leverages the excellent inference efficiency of the state-space model Mamba, while retaining the advantages of the attention mechanism in the traditional DETR architecture, to achieve efficient global information integration and provide high-quality feature information for acupoint localization tasks. Furthermore, by employing the concept of residual likelihood estimation, it eliminates the need for complex upsampling processes, thereby accelerating the acupoint localization task. Our method achieved state-of-the-art (SOTA) accuracy on a private dataset of acupoints on the human back, with an average Euclidean distance pixel error (EPE) of 7.792 and an average time consumption of 10.05 milliseconds per localization task. Compared to the second-best algorithm, our method improved both accuracy and speed by approximately 14\%. This significant advancement not only enhances the efficacy of acupuncture treatment but also demonstrates the commercial potential of automated acupuncture robot systems. Access to our method is available at https://github.com/Sohyu1/RT-DEMT

摘要:傳統的中醫針灸方法由於其高度主觀性,在臨床實務中經常面臨爭議。此外,現有的智慧輔助針灸系統有兩大限制:取穴速度慢以及準確度低。為了解決這些限制,一種新的方法利用了狀態空間模型 Mamba 優異的推理效率,同時保留了傳統 DETR 架構中注意力機制的優點,以實現高效的全局資訊整合,並為取穴任務提供高品質的特徵資訊。此外,透過採用殘差似然估計的概念,它消除了對複雜上採樣程序的需求,從而加速了取穴任務。我們的模型在人體背部穴位私人資料集上達到了最先進 (SOTA) 的準確度,平均歐幾里得距離像素誤差 (EPE) 為 7.792,平均每個取穴任務耗時 10.05 毫秒。與第二好的演算法相比,我們的模型在準確度和速度上都提高了大約 14%。這項重大進展不僅提高了針灸治療的療效,也證明了自動化針灸機器人系統的商業潛力。我們的模型可以在 https://github.com/Sohyu1/RT-DEMT 取得

Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications

2502.11108v1 by Alexandru Lecu, Adrian Groza, Lezan Hawizy

Large language models (LLMs) have significantly advanced the field of natural language generation. However, they frequently generate unverified outputs, which compromises their reliability in critical applications. In this study, we propose an innovative framework that combines structured biomedical knowledge with LLMs through a retrieval-augmented generation technique. Our system develops a thorough knowledge graph by identifying and refining causal relationships and named entities from medical abstracts related to age-related macular degeneration (AMD). Using a vector-based retrieval process and a locally deployed language model, our framework produces responses that are both contextually relevant and verifiable, with direct references to clinical evidence. Experimental results show that this method notably decreases hallucinations, enhances factual precision, and improves the clarity of generated responses, providing a robust solution for advanced biomedical chatbot applications.

摘要:大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而,它們經常產生未經驗證的輸出,這會損害它們在關鍵應用中的可靠性。在本研究中,我們提出了一個創新的框架,透過檢索增強生成技術,將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體,開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型,產生在脈絡上相關且可驗證的回應,並直接參考臨床證據。實驗結果顯示,此方法顯著減少了幻覺、增強了事實準確性,並改善了生成回應的清晰度,為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。

Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration

2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang

Automatic depression detection provides cues for early clinical intervention by clinicians. Clinical interviews for depression detection involve dialogues centered around multiple themes. Existing studies primarily design end-to-end neural network models to capture the hierarchical structure of clinical interview dialogues. However, these methods exhibit defects in modeling the thematic content of clinical interviews: 1) they fail to capture intra-theme and inter-theme correlation explicitly, and 2) they do not allow clinicians to intervene and focus on themes of interest. To address these issues, this paper introduces an interactive depression detection framework. This framework leverages in-context learning techniques to identify themes in clinical interviews and then models both intra-theme and inter-theme correlation. Additionally, it employs AI-driven feedback to simulate the interests of clinicians, enabling interactive adjustment of theme importance. PDIMC achieves absolute improvements of 35\% and 12\% compared to the state-of-the-art on the depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of modeling theme correlation and incorporating interactive external feedback.

摘要:自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而,這些方法在建模臨床訪談的主題內容時表現出缺陷:1)它們無法明確捕捉主題內和主題間的關聯性,以及 2)它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題,本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題,然後對主題內和主題間的關聯性進行建模。此外,它採用 AI 驅動的回饋來模擬臨床醫師的興趣,實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比,PDIMC 的絕對改進率分別為 35% 和 12%,這證明了對主題關聯性建模和納入互動式外部回饋的有效性。

CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening

2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu

Due to the rise in antimicrobial resistance, identifying novel compounds with antibiotic potential is crucial for combatting this global health issue. However, traditional drug development methods are costly and inefficient. Recognizing the pressing need for more effective solutions, researchers have turned to machine learning techniques to streamline the prediction and development of novel antibiotic compounds. While foundation models have shown promise in antibiotic discovery, current mainstream efforts still fall short of fully leveraging the potential of multimodal molecular data. Recent studies suggest that contrastive learning frameworks utilizing multimodal data exhibit excellent performance in representation learning across various domains. Building upon this, we introduce CL-MFAP, an unsupervised contrastive learning (CL)-based multimodal foundation (MF) model specifically tailored for discovering small molecules with potential antibiotic properties (AP) using three types of molecular data. This model employs 1.6 million bioactive molecules with drug-like properties from the ChEMBL dataset to jointly pretrain three encoders: (1) a transformer-based encoder with rotary position embedding for processing SMILES strings; (2) another transformer-based encoder, incorporating a novel bi-level routing attention mechanism to handle molecular graph representations; and (3) a Morgan fingerprint encoder using a multilayer perceptron, to achieve the contrastive learning purpose. The CL-MFAP outperforms baseline models in antibiotic property prediction by effectively utilizing different molecular modalities and demonstrates superior domain-specific performance when fine-tuned for antibiotic-related property prediction tasks.

摘要:由於抗菌藥物抗性上升,找出具有抗生素潛力的新型化合物對於對抗此項全球性健康議題至關重要。不過,傳統的藥物開發方法成本高昂且效率不彰。研究人員體認到對於更有效解決方案的迫切需求,因此轉向機器學習技術來簡化新型抗生素化合物的預測和開發。儘管基礎模型在抗生素發現方面展現潛力,目前的普遍做法仍未充分利用多模態分子資料的潛力。最近的研究顯示,利用多模態資料的對比學習架構在各種領域的表徵學習中展現出優異的效能。有鑑於此,我們引進 CL-MFAP,一種無監督對比學習 (CL) 為基礎的多模態基礎 (MF) 模型,專門用於使用三種類型的分子資料發現具有潛在抗生素特性的低分子。此模型採用 ChEMBL 資料集中的 160 萬個具有類藥物特性的生物活性分子,以聯合預訓練三個編碼器:(1) 一個具有旋轉位置嵌入的基於Transformer的編碼器,用於處理 SMILES 字串;(2) 另一個基於Transformer的編碼器,結合一種新穎的雙層路由注意機制來處理分子圖表表徵;以及 (3) 一個使用多層感知器的 Morgan 指紋編碼器,以達成對比學習的目的。CL-MFAP 透過有效利用不同的分子模式在抗生素特性預測方面優於基準模型,並且在針對抗生素相關特性預測任務進行微調時展現出優異的特定領域效能。

Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images

2502.10908v1 by Sevim Cengiz, Ibraheem Hamdi, Mohammad Yaqub

Fetal gestational age (GA) is vital clinical information that is estimated during pregnancy in order to assess fetal growth. This is usually performed by measuring the crown-rump-length (CRL) on an ultrasound image in the Dating scan which is then correlated with fetal age and growth trajectory. A major issue when performing the CRL measurement is ensuring that the image is acquired at the correct view, otherwise it could be misleading. Although clinical guidelines specify the criteria for the correct CRL view, sonographers may not regularly adhere to such rules. In this paper, we propose a new deep learning-based solution that is able to verify the adherence of a CRL image to clinical guidelines in order to assess image quality and facilitate accurate estimation of GA. We first segment out important fetal structures then use the localized structures to perform a clinically-guided mapping that verifies the adherence of criteria. The segmentation method combines the benefits of Convolutional Neural Network (CNN) and the Vision Transformer (ViT) to segment fetal structures in ultrasound images and localize important fetal landmarks. For segmentation purposes, we compare our proposed work with UNet and show that our CNN/ViT-based method outperforms an optimized version of UNet. Furthermore, we compare the output of the mapping with classification CNNs when assessing the clinical criteria and the overall acceptability of CRL images. We show that the proposed mapping is not only explainable but also more accurate than the best performing classification CNNs.

摘要:胎兒妊娠年齡 (GA) 是重要的臨床資訊,會在懷孕期間估計,以評估胎兒生長。這通常是透過在約會掃描中測量超音波影像中的頭臀長度 (CRL) 來執行,然後與胎兒年齡和生長軌跡相關聯。執行 CRL 測量時的一個主要問題是確保影像是在正確的視角下取得,否則可能會產生誤導。儘管臨床指南規定了正確 CRL 視角的標準,但超音波檢查員可能不會定期遵守這些規則。在本文中,我們提出了一個新的深度學習解決方案,能夠驗證 CRL 影像是否符合臨床指南,以評估影像品質並促進對 GA 的準確估計。我們首先分割出重要的胎兒結構,然後使用局部結構來執行臨床指導的對應,以驗證標準的遵守情況。分割方法結合了卷積神經網路 (CNN) 和視覺轉換器 (ViT) 的優點,以分割超音波影像中的胎兒結構並定位重要的胎兒標誌。為了分割目的,我們將我們提出的工作與 UNet 進行比較,並顯示我們基於 CNN/ViT 的方法優於 UNet 的最佳化版本。此外,我們在評估臨床標準和 CRL 影像的整體可接受性時,將對應的輸出與分類 CNN 進行比較。我們表明,所提出的對應不僅可以解釋,而且比效能最佳的分類 CNN 更準確。

Breaking Down the Hierarchy: A New Approach to Leukemia Classification

2502.10899v1 by Ibraheem Hamdi, Hosam El-Gendy, Ahmed Sharshar, Mohamed Saeed, Muhammad Ridzuan, Shahrukh K. Hashmi, Naveed Syed, Imran Mirza, Shakir Hussain, Amira Mahmoud Abdalla, Mohammad Yaqub

The complexities inherent to leukemia, multifaceted cancer affecting white blood cells, pose considerable diagnostic and treatment challenges, primarily due to reliance on laborious morphological analyses and expert judgment that are susceptible to errors. Addressing these challenges, this study presents a refined, comprehensive strategy leveraging advanced deep-learning techniques for the classification of leukemia subtypes. We commence by developing a hierarchical label taxonomy, paving the way for differentiating between various subtypes of leukemia. The research further introduces a novel hierarchical approach inspired by clinical procedures capable of accurately classifying diverse types of leukemia alongside reactive and healthy cells. An integral part of this study involves a meticulous examination of the performance of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) as classifiers. The proposed method exhibits an impressive success rate, achieving approximately 90\% accuracy across all leukemia subtypes, as substantiated by our experimental results. A visual representation of the experimental findings is provided to enhance the model's explainability and aid in understanding the classification process.

摘要:白血病的复杂性源于它是一种影响白血球的多面性癌症,主要由于依赖费力的形态分析和容易出错的专家判断,因此带来了相当大的诊断和治疗挑战。为了应对这些挑战,本研究提出了一种精细且全面的策略,利用先进的深度学习技术对白血病亚型进行分类。我们首先开发了一个分层的标签分类法,为区分白血病的各种亚型铺平了道路。该研究进一步引入了一种新颖的分层方法,该方法受临床程序的启发,能够准确地对各种类型的白血病以及反应性和健康细胞进行分类。本研究的一个组成部分涉及对卷积神经网络 (CNN) 和视觉变压器 (ViT) 作为分类器的性能进行细致检查。所提出的方法展示了令人印象深刻的成功率,在所有白血病亚型中实现了大约 90% 的准确率,我们的实验结果证实了这一点。提供了实验结果的可视化表示,以增强模型的可解释性并帮助理解分类过程。

An Empirical Analysis of Uncertainty in Large Language Model Evaluations

2502.10709v1 by Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang

As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an uncertainty-aware LLM evaluator named ConfiLM using a human-annotated fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually designed test set sourced from the 2024 Olympics. Experimental results demonstrate that incorporating uncertainty as additional information during the fine-tuning phase can largely improve the model's evaluation performance in OOD scenarios. The code and data are released at: https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty.

摘要:隨著 LLM 作為法官的新典範出現,用於評估大型語言模型 (LLM) 的 LLM 評估器在對齊、偏差和穩定性方面引發了關注。儘管大量工作集中在對齊和偏差上,但很少有研究集中在 LLM 評估器的穩定性上。在本文中,我們進行了廣泛的實驗,涉及 9 個廣泛使用的 LLM 評估器,跨越 2 個不同的評估設定,以調查基於模型的 LLM 評估中的不確定性。我們精確指出 LLM 評估器根據模型系列和大小表現出不同的不確定性。通過仔細的比較分析,我們發現採用特殊的提示策略(無論是在推理過程中還是訓練後)可以在一定程度上緩解評估不確定性。通過利用不確定性來增強 LLM 在 Out-Of-Distribution (OOD) 數據中的可靠性和檢測能力,我們進一步微調了一個名為 ConfiLM 的不確定性感知 LLM 評估器,使用人工註釋的微調設置,並評估 ConfiLM 在手動設計的、來自 2024 年奧運會的測試集上的 OOD 評估能力。實驗結果表明,在微調階段將不確定性作為附加信息納入其中可以在很大程度上提高模型在 OOD 場景中的評估性能。代碼和數據發布於: https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty。

Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model

2502.10707v1 by Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong

Electrocardiogram (ECG) is essential for the clinical diagnosis of arrhythmias and other heart diseases, but deep learning methods based on ECG often face limitations due to the need for high-quality annotations. Although previous ECG self-supervised learning (eSSL) methods have made significant progress in representation learning from unannotated ECG data, they typically treat ECG signals as ordinary time-series data, segmenting the signals using fixed-size and fixed-step time windows, which often ignore the form and rhythm characteristics and latent semantic relationships in ECG signals. In this work, we introduce a novel perspective on ECG signals, treating heartbeats as words and rhythms as sentences. Based on this perspective, we first designed the QRS-Tokenizer, which generates semantically meaningful ECG sentences from the raw ECG signals. Building on these, we then propose HeartLang, a novel self-supervised learning framework for ECG language processing, learning general representations at form and rhythm levels. Additionally, we construct the largest heartbeat-based ECG vocabulary to date, which will further advance the development of ECG language processing. We evaluated HeartLang across six public ECG datasets, where it demonstrated robust competitiveness against other eSSL methods. Our data and code are publicly available at https://github.com/PKUDigitalHealth/HeartLang.

摘要:心電圖 (ECG) 對於心律不整和其他心臟疾病的臨床診斷至關重要,但基於心電圖的深度學習方法通常會因需要高品質註解而面臨限制。儘管先前的 ECG 自我監督學習 (eSSL) 方法在從未註解的 ECG 資料中學習表徵方面取得顯著進展,但它們通常將 ECG 訊號視為普通的時間序列資料,使用固定大小和固定步長的時窗對訊號進行分段,這通常會忽略 ECG 訊號中的形式和節律特徵以及潛在的語義關係。在這項工作中,我們對 ECG 訊號引入了新的觀點,將心跳視為單字,將節律視為句子。基於此觀點,我們首先設計了 QRS-Tokenizer,它從原始 ECG 訊號中產生語義有意義的 ECG 句子。在此基礎上,我們提出了 HeartLang,一種用於 ECG 語言處理的新型自我監督學習框架,在形式和節律層面上學習一般表徵。此外,我們構建了迄今為止最大的基於心跳的 ECG 詞彙表,這將進一步促進 ECG 語言處理的發展。我們在六個公開的 ECG 資料集上評估了 HeartLang,它展示了與其他 eSSL 方法相比的強大競爭力。我們的資料和程式碼可在 https://github.com/PKUDigitalHealth/HeartLang 公開取得。

Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction

2502.10689v1 by Leisheng Yu, Yanxiao Cai, Minxing Zhang, Xia Hu

The burgeoning volume of electronic health records (EHRs) has enabled deep learning models to excel in predictive healthcare. However, for high-stakes applications such as diagnosis prediction, model interpretability remains paramount. Existing deep learning diagnosis prediction models with intrinsic interpretability often assign attention weights to every past diagnosis or hospital visit, providing explanations lacking flexibility and succinctness. In this paper, we introduce SHy, a self-explaining hypergraph neural network model, designed to offer personalized, concise and faithful explanations that allow for interventions from clinical experts. By modeling each patient as a unique hypergraph and employing a message-passing mechanism, SHy captures higher-order disease interactions and extracts distinct temporal phenotypes as personalized explanations. It also addresses the incompleteness of the EHR data by accounting for essential false negatives in the original diagnosis record. A qualitative case study and extensive quantitative evaluations on two real-world EHR datasets demonstrate the superior predictive performance and interpretability of SHy over existing state-of-the-art models.

摘要:隨著電子健康紀錄 (EHR) 數量的激增,深度學習模型在預測保健方面表現出色。然而,對於診斷預測等高風險應用,模型的可解釋性仍然至關重要。現有的具有內在可解釋性的深度學習診斷預測模型通常會為每個過去的診斷或醫院就診分配注意力權重,提供的解釋缺乏靈活性且簡潔性。在本文中,我們介紹了 SHy,這是一個自解釋的超圖神經網路模型,旨在提供個性化、簡潔且忠實的解釋,讓臨床專家可以進行干預。通過將每個患者建模為一個獨特的超圖並採用訊息傳遞機制,SHy 捕捉到了高階疾病交互作用,並提取出不同的時間表型作為個性化解釋。它還通過考慮原始診斷記錄中的基本假陰性來解決電子健康紀錄資料的不完整性。對兩個真實世界電子健康紀錄資料集進行的定性案例研究和廣泛的定量評估表明,SHy 在預測效能和可解釋性方面優於現有的最先進模型。

ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis

2502.10620v1 by Xueshen Li, Xinlong Hou, Ziyi Huang, Yu Gan

Recent advancements in large language models (LLMs) have demonstrated extraordinary comprehension capabilities with remarkable breakthroughs on various vision-language tasks. However, the application of LLMs in generating reliable medical diagnostic reports remains in the early stages. Currently, medical LLMs typically feature a passive interaction model where doctors respond to patient queries with little or no involvement in analyzing medical images. In contrast, some ChatBots simply respond to predefined queries based on visual inputs, lacking interactive dialogue or consideration of medical history. As such, there is a gap between LLM-generated patient-ChatBot interactions and those occurring in actual patient-doctor consultations. To bridge this gap, we develop an LLM-based dialogue system, namely proactive multi-round vision-language interactions for computer-aided diagnosis (ProMRVL-CAD), to generate patient-friendly disease diagnostic reports. The proposed ProMRVL-CAD system allows proactive dialogue to provide patients with constant and reliable medical access via an integration of knowledge graph into a recommendation system. Specifically, we devise two generators: a Proactive Question Generator (Pro-Q Gen) to generate proactive questions that guide the diagnostic procedure and a Multi-Vision Patient-Text Diagnostic Report Generator (MVP-DR Gen) to produce high-quality diagnostic reports. Evaluating two real-world publicly available datasets, MIMIC-CXR and IU-Xray, our model has better quality in generating medical reports. We further demonstrate the performance of ProMRVL achieves robust under the scenarios with low image quality. Moreover, we have created a synthetic medical dialogue dataset that simulates proactive diagnostic interactions between patients and doctors, serving as a valuable resource for training LLM.

摘要:大型語言模型 (LLM) 最近的進展已展現出非凡的理解能力,在各種視覺語言任務中取得了顯著的突破。然而,LLM 在產生可靠的醫療診斷報告中的應用仍處於早期階段。目前,醫療 LLM 通常採用被動互動模式,醫生對患者的疑問做出回應,但很少或根本不參與分析醫療影像。相比之下,有些聊天機器人僅根據視覺輸入回應預先定義的查詢,缺乏互動對話或對病史的考量。因此,LLM 產生的患者聊天機器人互動與實際患者醫生諮詢之間存在差距。為了彌合這一差距,我們開發了一個基於 LLM 的對話系統,即主動多輪視覺語言互動,用於電腦輔助診斷 (ProMRVL-CAD),以產生對患者友善的疾病診斷報告。建議的 ProMRVL-CAD 系統允許主動對話,透過將知識圖譜整合到推薦系統中,為患者提供持續且可靠的醫療管道。具體來說,我們設計了兩個產生器:主動問題產生器 (Pro-Q Gen),用於產生引導診斷程序的主動問題,以及多視覺患者文字診斷報告產生器 (MVP-DR Gen),用於產生高品質的診斷報告。評估兩個真實世界公開可用的資料集,MIMIC-CXR 和 IU-Xray,我們的模型在產生醫療報告方面品質較佳。我們進一步證明 ProMRVL 的效能,在影像品質低的情況下仍能穩健運行。此外,我們建立了一個模擬患者和醫生之間主動診斷互動的合成醫療對話資料集,作為訓練 LLM 的寶貴資源。

Optimizing CNN Architectures for Advanced Thoracic Disease Classification

2502.10614v1 by Tejas Mirthipati

Machine learning, particularly convolutional neural networks (CNNs), has shown promise in medical image analysis, especially for thoracic disease detection using chest X-ray images. In this study, we evaluate various CNN architectures, including binary classification, multi-label classification, and ResNet50 models, to address challenges like dataset imbalance, variations in image quality, and hidden biases. We introduce advanced preprocessing techniques such as principal component analysis (PCA) for image compression and propose a novel class-weighted loss function to mitigate imbalance issues. Our results highlight the potential of CNNs in medical imaging but emphasize that issues like unbalanced datasets and variations in image acquisition methods must be addressed for optimal model performance.

摘要:機器學習,特別是卷積神經網路 (CNN) 已在醫學影像分析中展現出潛力,特別是使用胸部 X 光影像進行胸腔疾病偵測。在此研究中,我們評估各種 CNN 架構,包括二元分類、多標籤分類和 ResNet50 模型,以解決資料集不平衡、影像品質差異和隱藏偏差等挑戰。我們導入進階前處理技術,例如主成分分析 (PCA) 以進行影像壓縮,並提出一個新穎的類別加權損失函數來緩解不平衡問題。我們的結果突顯了 CNN 在醫學影像中的潛力,但強調必須解決資料集不平衡和影像擷取方法差異等問題,才能獲得最佳模型效能。

PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation

2502.10536v1 by Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner

The interpretation of histopathology cases underlies many important diagnostic and treatment decisions in medicine. Notably, this process typically requires pathologists to integrate and summarize findings across multiple slides per case. Existing vision-language capabilities in computational pathology have so far been largely limited to small regions of interest, larger regions at low magnification, or single whole-slide images (WSIs). This limits interpretation of findings that span multiple high-magnification regions across multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model (LMM) with a 1-million token context window, we demonstrate the ability to generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches from multiple WSIs at 10X magnification. This is the equivalent of up to 11 hours of video at 1 fps. Expert pathologist evaluations demonstrate that the generated report text is clinically accurate and equivalent to or preferred over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide examples with up to 5 slides. While performance decreased for examples with 6 or more slides, this study demonstrates the promise of leveraging the long-context capabilities of modern LMMs for the uniquely challenging task of medical report generation where each case can contain thousands of image patches.

摘要:組織病理學病例的解讀是許多重要的醫學診斷和治療決策的基礎。值得注意的是,這個過程通常需要病理學家整合和總結每個病例的許多玻片中的發現。迄今為止,計算機病理學中現有的視覺語言功能在很大程度上僅限於小範圍的感興趣區域、低倍率下的較大區域或單一的全玻片影像 (WSI)。這限制了跨多個 WSI 中多個高倍率區域的發現的解讀。通過使用 Gemini 1.5 Flash,一個具有 100 萬個令牌上下文視窗的大型多模態模型 (LMM),我們展示了從多個 WSI 中多達 40,000 個 768x768 像素圖像貼片(10 倍放大)生成底線診斷的能力。這相當於 1 fps 下長達 11 小時的影片。專家病理學家評估表明,生成的報告文字在臨床上是準確的,並且等同於或優於 68%(95% CI:[60%,76%])的多玻片範例(最多 5 個玻片)的原始報告。儘管對於有 6 個或更多玻片的範例,其性能下降,但這項研究證明了利用現代 LMM 的長上下文功能來應對獨特挑戰性的醫療報告生成任務,其中每個病例可能包含數千個影像貼片,這項任務的前景。

Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks

2502.10526v2 by Venkatesh Sivaraman, Anika Vaishampayan, Xiaotong Li, Brian R Buck, Ziyong Ma, Richard D Boyce, Adam Perer

Temporal predictive models have the potential to improve decisions in health care, public services, and other domains, yet they often fail to effectively support decision-makers. Prior literature shows that many misalignments between model behavior and decision-makers' expectations stem from issues of model specification, namely how, when, and for whom predictions are made. However, model specifications for predictive tasks are highly technical and difficult for non-data-scientist stakeholders to interpret and critique. To address this challenge we developed Tempo, an interactive system that helps data scientists and domain experts collaboratively iterate on model specifications. Using Tempo's simple yet precise temporal query language, data scientists can quickly prototype specifications with greater transparency about pre-processing choices. Moreover, domain experts can assess performance within data subgroups to validate that models behave as expected. Through three case studies, we demonstrate how Tempo helps multidisciplinary teams quickly prune infeasible specifications and identify more promising directions to explore.

摘要:時序預測模型有潛力改善醫療保健、公共服務和其他領域的決策,但它們經常無法有效支援決策者。先前的文獻顯示,模型行為與決策者期望之間的許多不一致源自於模型規範問題,也就是如何、何時以及針對誰進行預測。然而,預測任務的模型規範非常技術化,非數據科學家利害關係人難以解讀和批評。為了應對此挑戰,我們開發了 Tempo,一個互動式系統,可協助數據科學家和領域專家協同反覆運算模型規範。透過使用 Tempo 簡單但精確的時序查詢語言,數據科學家可以快速建構規範原型,並更透明地了解前處理的選擇。此外,領域專家可以評估資料子群組內的效能,以驗證模型是否如預期般運作。透過三個案例研究,我們展示 Tempo 如何協助跨領域團隊快速刪減不可行的規範,並找出更有希望探索的方向。

A Robust Attack: Displacement Backdoor Attack

2502.10490v1 by Yong Li, Han Gao

As artificial intelligence becomes more prevalent in our lives, people are enjoying the convenience it brings, but they are also facing hidden threats, such as data poisoning and adversarial attacks. These threats can have disastrous consequences for the application of artificial intelligence, especially for some applications that take effect immediately, such as autonomous driving and medical fields. Among these threats, backdoor attacks have left a deep impression on people with their concealment and simple deployment, making them a threat that cannot be ignored, however, in the process of deploying the backdoor model, the backdoor attack often has some reasons that make it unsatisfactory in real-world applications, such as jitter and brightness changes. Based on this, we propose a highly robust backdoor attack that shifts the target sample and combines it with itself to form a backdoor sample, the Displacement Backdoor Attack(DBA). Experimental results show that the DBA attack can resist data augmentation that simulates real-world differences, such as rotation and cropping.

摘要:随着人工智能在我们的生活中变得越来越普遍,人们正在享受它带来的便利,但也面临着隐藏的威胁,例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果,特别是对于一些立即生效的应用,例如自动驾驶和医疗领域。在这些威胁中,后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象,使其成为不可忽视的威胁,然而,在部署后门模型的过程中,后门攻击往往存在一些使其在实际应用中不尽如人意的原因,例如抖动和亮度变化。基于此,我们提出了一种高度鲁棒的后门攻击,该攻击对目标样本进行平移并将其与自身结合以形成后门样本,即置换后门攻击 (DBA)。实验结果表明,DBA 攻击可以抵抗模拟真实世界差异的数据增强,例如旋转和裁剪。

3D ReX: Causal Explanations in 3D Neuroimaging Classification

2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker

Explainability remains a significant problem for AI models in medical imaging, making it challenging for clinicians to trust AI-driven predictions. We introduce 3D ReX, the first causality-based post-hoc explainability tool for 3D models. 3D ReX uses the theory of actual causality to generate responsibility maps which highlight the regions most crucial to the model's decision. We test 3D ReX on a stroke detection model, providing insight into the spatial distribution of features relevant to stroke.

摘要:解釋性仍然是醫療影像中 AI 模型的一大問題,這使得臨床醫生難以信任 AI 驅動的預測。 我們引入了 3D ReX,這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖,該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX,提供了與中風相關特徵的空間分佈的見解。

Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model

2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott

In the analysis of remote healthcare monitoring data, time series representation learning offers substantial value in uncovering deeper patterns of patient behavior, especially given the fine temporal granularity of the data. In this study, we focus on a dataset of home activity records from people living with Dementia. We propose a two-stage self-supervised learning approach. The first stage involves converting time-series activities into text strings, which are then encoded by a fine-tuned language model. In the second stage, these time-series vectors are bi-dimensionalized for applying PageRank method, to analyze latent state transitions to quantitatively assess participants behavioral patterns and identify activity biases. These insights, combined with diagnostic data, aim to support personalized care interventions.

摘要:在遠程醫療監控數據分析中,時序表示學習在揭示患者行為的更深層模式方面提供了實質性的價值,特別是考慮到數據的精細時間粒度。在本研究中,我們專注於痴呆症患者居家活動記錄的數據集。我們提出了一種兩階段的自我監督學習方法。第一階段涉及將時序活動轉換為文本串,然後由微調語言模型編碼。在第二階段,這些時序向量被雙維化以應用 PageRank 方法,分析潛在狀態轉換以定量評估參與者的行為模式並識別活動偏差。這些見解與診斷數據相結合,旨在支持個性化護理干預。

TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation

2502.09931v1 by Ju-Hyeon Nam, Nur Suriza Syazwany, Sang-Chul Lee

Skip connection engineering is primarily employed to address the semantic gap between the encoder and decoder, while also integrating global dependencies to understand the relationships among complex anatomical structures in medical image segmentation. Although several models have proposed transformer-based approaches to incorporate global dependencies within skip connections, they often face limitations in capturing detailed local features with high computational complexity. In contrast, graph neural networks (GNNs) exploit graph structures to effectively capture local and global features. Leveraging these properties, we introduce an attentional cross-scale graph neural network (ACS-GNN), which enhances the skip connection framework by converting cross-scale feature maps into a graph structure and capturing complex anatomical structures through node attention. Additionally, we observed that deep learning models often produce uninformative feature maps, which degrades the quality of spatial attention maps. To address this problem, we integrated entropy-driven feature selection (EFS) with spatial attention, calculating an entropy score for each channel and filtering out high-entropy feature maps. Our innovative framework, TransGUNet, comprises ACS-GNN and EFS-based spatial attentio} to effectively enhance domain generalizability across various modalities by leveraging GNNs alongside a reliable spatial attention map, ensuring more robust features within the skip connection. Through comprehensive experiments and analysis, TransGUNet achieved superior segmentation performance on six seen and eight unseen datasets, demonstrating significantly higher efficiency compared to previous methods.

摘要:跳躍連接工程主要用於解決編碼器和解碼器之間的語義鴻溝,同時還整合全局依賴關係以了解醫學影像分割中複雜解剖結構之間的關係。儘管有幾個模型提出了基於Transformer的架構來整合跳躍連接中的全局依賴關係,但它們在以高計算複雜度擷取詳細的局部特徵時常常面臨限制。相比之下,圖神經網路 (GNN) 利用圖結構有效擷取局部和全局特徵。利用這些屬性,我們引入了注意力跨尺度圖神經網路 (ACS-GNN),它通過將跨尺度特徵圖轉換為圖結構並通過節點注意力擷取複雜的解剖結構來增強跳躍連接框架。此外,我們觀察到深度學習模型通常會產生無意義的特徵圖,這會降低空間注意力圖的品質。為了解決這個問題,我們將熵驅動特徵選擇 (EFS) 與空間注意力整合在一起,為每個通道計算熵分數並濾出高熵特徵圖。我們創新的框架 TransGUNet 包含 ACS-GNN 和基於 EFS 的空間注意力,通過利用 GNN 以及可靠的空間注意力圖有效增強跨各種模態的域泛化能力,確保跳躍連接中更強大的特徵。透過全面的實驗和分析,TransGUNet 在六個已見和八個未見的資料集上實現了優異的分割效能,證明與先前的方法相比,效率顯著提高。

Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos

2502.09886v1 by Weirui Ye, Fangchen Liu, Zheng Ding, Yang Gao, Oleh Rybkin, Pieter Abbeel

Simulation offers a promising approach for cheaply scaling training data for generalist policies. To scalably generate data from diverse and realistic tasks, existing algorithms either rely on large language models (LLMs) that may hallucinate tasks not interesting for robotics; or digital twins, which require careful real-to-sim alignment and are hard to scale. To address these challenges, we introduce Video2Policy, a novel framework that leverages internet RGB videos to reconstruct tasks based on everyday human behavior. Our approach comprises two phases: (1) task generation in simulation from videos; and (2) reinforcement learning utilizing in-context LLM-generated reward functions iteratively. We demonstrate the efficacy of Video2Policy by reconstructing over 100 videos from the Something-Something-v2 (SSv2) dataset, which depicts diverse and complex human behaviors on 9 different tasks. Our method can successfully train RL policies on such tasks, including complex and challenging tasks such as throwing. Finally, we show that the generated simulation data can be scaled up for training a general policy, and it can be transferred back to the real robot in a Real2Sim2Real way.

摘要:模擬提供了一種有前途的方法,可以用於擴展訓練資料,以制定通才政策。為了從多樣化且逼真的任務中可擴充地產生資料,現有演算法仰賴大型語言模型 (LLM),這些模型可能會產生對機器人技術不感興趣的任務;或者仰賴數位雙胞胎,這需要仔細地將真實環境與模擬環境對齊,而且很難擴充。為了應對這些挑戰,我們引入了 Video2Policy,這是一個新穎的架構,它利用網路上的 RGB 影片,根據日常人類行為來重建任務。我們的做法包含兩個階段:(1) 從影片中在模擬環境中產生任務;以及 (2) 利用在情境中由 LLM 產生的獎勵函數,反覆進行強化學習。我們透過重建 Something-Something-v2 (SSv2) 資料集中的 100 多個影片來展示 Video2Policy 的效能,這些影片描繪了 9 項不同任務中多樣化且複雜的人類行為。我們的做法可以在這些任務上成功訓練 RL 政策,包括複雜且具挑戰性的任務,例如投擲。最後,我們展示了產生的模擬資料可以擴充到訓練一般政策,而且可以透過 Real2Sim2Real 的方式轉移回真實機器人。

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

2502.09838v3 by Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi

We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.

摘要:我們提出 HealthGPT,一個強大的醫學大型視覺語言模型 (Med-LVLM),它在一個統一的自迴歸範例中整合了醫學視覺理解和生成能力。我們的自舉哲學是逐步將異質理解和生成知識適應於預先訓練的大語言模型 (LLM)。這通過一種新穎的異質低秩適應 (H-LoRA) 技術來實現,該技術由量身定制的分層視覺感知方法和三階段學習策略補充。為了有效地學習 HealthGPT,我們設計了一個名為 VL-Health 的綜合醫學領域特定理解和生成數據集。實驗結果證明了 HealthGPT 在醫學視覺統一任務中的卓越性能和可擴展性。我們的項目可以在 https://github.com/DCDmllm/HealthGPT 訪問。

Incentivize without Bonus: Provably Efficient Model-based Online Multi-agent RL for Markov Games

2502.09780v1 by Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi

Multi-agent reinforcement learning (MARL) lies at the heart of a plethora of applications involving the interaction of a group of agents in a shared unknown environment. A prominent framework for studying MARL is Markov games, with the goal of finding various notions of equilibria in a sample-efficient manner, such as the Nash equilibrium (NE) and the coarse correlated equilibrium (CCE). However, existing sample-efficient approaches either require tailored uncertainty estimation under function approximation, or careful coordination of the players. In this paper, we propose a novel model-based algorithm, called VMG, that incentivizes exploration via biasing the empirical estimate of the model parameters towards those with a higher collective best-response values of all the players when fixing the other players' policies, thus encouraging the policy to deviate from its current equilibrium for more exploration. VMG is oblivious to different forms of function approximation, and permits simultaneous and uncoupled policy updates of all players. Theoretically, we also establish that VMG achieves a near-optimal regret for finding both the NEs of two-player zero-sum Markov games and CCEs of multi-player general-sum Markov games under linear function approximation in an online environment, which nearly match their counterparts with sophisticated uncertainty quantification.

摘要:多智能體強化學習 (MARL) 是一系列應用程式的心臟,這些應用程式涉及一群智能體在一個共用未知環境中的互動。研究 MARL 的一個著名框架是馬可夫博弈,其目標是用樣本有效率的方式找出各種均衡概念,例如納許均衡 (NE) 和粗相關均衡 (CCE)。然而,現有的樣本有效率方法需要在函數逼近下進行量身打造的不確定性估計,或謹慎協調參與者。在本文中,我們提出了一種新的基於模型的演算法,稱為 VMG,它透過將模型參數的經驗估計值偏向於在固定其他參與者政策時所有參與者的集體最佳反應值,從而激勵探索,進而鼓勵政策偏離其當前均衡以進行更多探索。VMG 不會忽略函數逼近的不同形式,並允許所有參與者同時進行非耦合的政策更新。在理論上,我們也建立了 VMG 在線上環境中使用線性函數逼近來尋找雙人零和馬可夫博弈的 NE 和多人一般和馬可夫博弈的 CCE 時,會獲得接近最佳的後悔,這幾乎與其在不確定性量化方面更為複雜的對應物相匹配。

The AI-Therapist Duo: Exploring the Potential of Human-AI Collaboration in Personalized Art Therapy for PICS Intervention

2502.09757v1 by Bereket A. Yilma, Chan Mi Kim, Geke Ludden, Thomas van Rompay, Luis A. Leiva

Post-intensive care syndrome (PICS) is a multifaceted condition that arises from prolonged stays in an intensive care unit (ICU). While preventing PICS among ICU patients is becoming increasingly important, interventions remain limited. Building on evidence supporting the effectiveness of art exposure in addressing the psychological aspects of PICS, we propose a novel art therapy solution through a collaborative Human-AI approach that enhances personalized therapeutic interventions using state-of-the-art Visual Art Recommendation Systems. We developed two Human-in-the-Loop (HITL) personalization methods and assessed their impact through a large-scale user study (N=150). Our findings demonstrate that this Human-AI collaboration not only enhances the personalization and effectiveness of art therapy but also supports therapists by streamlining their workload. While our study centres on PICS intervention, the results suggest that human-AI collaborative Art therapy could potentially benefit other areas where emotional support is critical, such as cases of anxiety and depression.

摘要:重症後症候群 (PICS) 是一種多面向的疾病,源自於在加護病房 (ICU) 長期住院。雖然預防重症後症候群在加護病房患者中正變得越來越重要,但介入措施仍然有限。建立在支持藝術接觸在解決重症後症候群心理層面的證據上,我們提出一個創新的藝術療法解決方案,透過協作式的人工智慧方法,使用最先進的視覺藝術推薦系統,增強個人化的治療介入。我們開發了兩種人機迴路 (HITL) 個人化方法,並透過大規模使用者研究 (N=150) 評估其影響。我們的發現證明,這種人機協作不僅增強了藝術治療的個人化和有效性,也透過簡化治療師的工作量來提供支援。雖然我們的研究中心在重症後症候群介入,但結果顯示,人機協作藝術療法有可能對其他需要情緒支持的領域有益,例如焦慮和憂鬱症。

A CNN Approach to Automated Detection and Classification of Brain Tumors

2502.09731v1 by Md. Zahid Hasan, Abdullah Tamim, D. M. Asadujjaman, Md. Mahfujur Rahman, Md. Abu Ahnaf Mollick, Nosin Anjum Dristi, Abdullah-Al-Noman

Brain tumors require an assessment to ensure timely diagnosis and effective patient treatment. Morphological factors such as size, location, texture, and variable appearance complicate tumor inspection. Medical imaging presents challenges, including noise and incomplete images. This research article presents a methodology for processing Magnetic Resonance Imaging (MRI) data, encompassing techniques for image classification and denoising. The effective use of MRI images allows medical professionals to detect brain disorders, including tumors. This research aims to categorize healthy brain tissue and brain tumors by analyzing the provided MRI data. Unlike alternative methods like Computed Tomography (CT), MRI technology offers a more detailed representation of internal anatomical components, making it a suitable option for studying data related to brain tumors. The MRI picture is first subjected to a denoising technique utilizing an Anisotropic diffusion filter. The dataset utilized for the models creation is a publicly accessible and validated Brain Tumour Classification (MRI) database, comprising 3,264 brain MRI scans. SMOTE was employed for data augmentation and dataset balancing. Convolutional Neural Networks(CNN) such as ResNet152V2, VGG, ViT, and EfficientNet were employed for the classification procedure. EfficientNet attained an accuracy of 98%, the highest recorded.

摘要:腦腫瘤需要評估以確保及時診斷和有效的患者治療。大小、位置、質地和可變外觀等形態因素會使腫瘤檢查複雜化。醫學影像會呈現挑戰,包括雜訊和不完整的影像。本研究文章提出了一種處理磁共振影像 (MRI) 資料的方法,包含影像分類和去噪技術。有效使用 MRI 影像可讓醫護人員偵測腦部疾病,包括腫瘤。本研究旨在透過分析提供的 MRI 資料來分類健康的腦組織和腦瘤。與電腦斷層掃描 (CT) 等替代方法不同,MRI 技術提供了更詳細的內部解剖結構表示,使其成為研究與腦瘤相關資料的合適選擇。MRI 影像會先使用各向異性擴散濾波器進行去噪技術處理。用於建立模型的資料集是一個公開且經過驗證的腦腫瘤分類 (MRI) 資料庫,包含 3,264 個腦部 MRI 掃描。SMOTE 用於資料擴充和資料集平衡。卷積神經網路 (CNN),例如 ResNet152V2、VGG、ViT 和 EfficientNet,用於分類程序。EfficientNet 達到了 98% 的準確度,是記錄到的最高值。

Evaluating GPT's Capability in Identifying Stages of Cognitive Impairment from Electronic Health Data

2502.09715v1 by Yu Leng, Yingnan He, Colin Magdamo, Ana-Maria Vranceanu, Christine S. Ritchie, Shibani S. Mukerji, Lidia M. V. R. Moura, John R. Dickson, Deborah Blacker, Sudeshna Das

Identifying cognitive impairment within electronic health records (EHRs) is crucial not only for timely diagnoses but also for facilitating research. Information about cognitive impairment often exists within unstructured clinician notes in EHRs, but manual chart reviews are both time-consuming and error-prone. To address this issue, our study evaluates an automated approach using zero-shot GPT-4o to determine stage of cognitive impairment in two different tasks. First, we evaluated the ability of GPT-4o to determine the global Clinical Dementia Rating (CDR) on specialist notes from 769 patients who visited the memory clinic at Massachusetts General Hospital (MGH), and achieved a weighted kappa score of 0.83. Second, we assessed GPT-4o's ability to differentiate between normal cognition, mild cognitive impairment (MCI), and dementia on all notes in a 3-year window from 860 Medicare patients. GPT-4o attained a weighted kappa score of 0.91 in comparison to specialist chart reviews and 0.96 on cases that the clinical adjudicators rated with high confidence. Our findings demonstrate GPT-4o's potential as a scalable chart review tool for creating research datasets and assisting diagnosis in clinical settings in the future.

摘要:在電子健康記錄 (EHR) 中識別認知障礙不僅對及時診斷至關重要,也有助於促進研究。有關認知障礙的資訊通常存在於 EHR 中非結構化的臨床記錄中,但手動圖表審查既耗時又容易出錯。為了解決這個問題,我們的研究評估了一種自動化方法,使用零次學習的 GPT-4o 來確定兩種不同任務中的認知障礙分期。首先,我們評估了 GPT-4o 確定來自麻薩諸塞州總醫院 (MGH) 記憶診所 769 名患者的專科記錄的全球臨床痴呆評分 (CDR) 的能力,並獲得了 0.83 的加權 kappa 分數。其次,我們評估了 GPT-4o 在 860 名 Medicare 患者 3 年視窗中的所有記錄中區分正常認知、輕度認知障礙 (MCI) 和痴呆的能力。與專科圖表審查相比,GPT-4o 獲得了 0.91 的加權 kappa 分數,而對於臨床評審員以高度信心評估的病例,其加權 kappa 分數為 0.96。我們的研究結果證明了 GPT-4o 作為可擴充圖表審查工具的潛力,可用於建立研究資料集並協助未來臨床環境中的診斷。

Metamorphic Testing for Pose Estimation Systems

2502.09460v1 by Matias Duran, Thomas Laurent, Ellen Rushe, Anthony Ventresque

Pose estimation systems are used in a variety of fields, from sports analytics to livestock care. Given their potential impact, it is paramount to systematically test their behaviour and potential for failure. This is a complex task due to the oracle problem and the high cost of manual labelling necessary to build ground truth keypoints. This problem is exacerbated by the fact that different applications require systems to focus on different subjects (e.g., human versus animal) or landmarks (e.g., only extremities versus whole body and face), which makes labelled test data rarely reusable. To combat these problems we propose MET-POSE, a metamorphic testing framework for pose estimation systems that bypasses the need for manual annotation while assessing the performance of these systems under different circumstances. MET-POSE thus allows users of pose estimation systems to assess the systems in conditions that more closely relate to their application without having to label an ad-hoc test dataset or rely only on available datasets, which may not be adapted to their application domain. While we define MET-POSE in general terms, we also present a non-exhaustive list of metamorphic rules that represent common challenges in computer vision applications, as well as a specific way to evaluate these rules. We then experimentally show the effectiveness of MET-POSE by applying it to Mediapipe Holistic, a state of the art human pose estimation system, with the FLIC and PHOENIX datasets. With these experiments, we outline numerous ways in which the outputs of MET-POSE can uncover faults in pose estimation systems at a similar or higher rate than classic testing using hand labelled data, and show that users can tailor the rule set they use to the faults and level of accuracy relevant to their application.

摘要:姿勢估計系統應用於各種領域,從運動分析到牲畜照護。鑑於其潛在影響,系統性地測試其行為和故障潛力至關重要。由於預言機問題以及建立地面實況關鍵點所需的手動標記成本高,這是一項複雜的任務。這個問題因不同的應用需要系統專注於不同的主體(例如,人類對動物)或地標(例如,只有四肢對全身和臉部)而加劇,這使得標記的測試數據很少可以重複使用。為了解決這些問題,我們提出了 MET-POSE,這是一個姿勢估計系統的變形測試框架,在評估這些系統在不同情況下的性能時,可以繞過手動註解的需要。因此,MET-POSE 允許姿勢估計系統的使用者在更接近其應用程式的條件下評估系統,而無需標記臨時測試數據集或僅依賴可用數據集,這些數據集可能不適合其應用領域。雖然我們以一般術語定義 MET-POSE,但我們也提供了一個非詳盡的變形規則列表,這些規則代表了電腦視覺應用中的常見挑戰,以及評估這些規則的具體方法。然後,我們通過將 MET-POSE 應用於 Mediapipe Holistic(一種先進的人類姿勢估計系統),並使用 FLIC 和 PHOENIX 數據集,以實驗方式展示 MET-POSE 的有效性。通過這些實驗,我們概述了 MET-POSE 的輸出可以揭示姿勢估計系統中故障的許多方法,其速度與使用手動標記數據的傳統測試類似或更高,並表明使用者可以根據其應用程式相關的故障和準確度等級來調整他們使用的規則集。

Towards Virtual Clinical Trials of Radiology AI with Conditional Generative Modeling

2502.09688v1 by Benjamin D. Killeen, Bohua Wan, Aditya V. Kulkarni, Nathan Drenkow, Michael Oberst, Paul H. Yi, Mathias Unberath

Artificial intelligence (AI) is poised to transform healthcare by enabling personalized and efficient care through data-driven insights. Although radiology is at the forefront of AI adoption, in practice, the potential of AI models is often overshadowed by severe failures to generalize: AI models can have performance degradation of up to 20% when transitioning from controlled test environments to clinical use by radiologists. This mismatch raises concerns that radiologists will be misled by incorrect AI predictions in practice and/or grow to distrust AI, rendering these promising technologies practically ineffectual. Exhaustive clinical trials of AI models on abundant and diverse data is thus critical to anticipate AI model degradation when encountering varied data samples. Achieving these goals, however, is challenging due to the high costs of collecting diverse data samples and corresponding annotations. To overcome these limitations, we introduce a novel conditional generative AI model designed for virtual clinical trials (VCTs) of radiology AI, capable of realistically synthesizing full-body CT images of patients with specified attributes. By learning the joint distribution of images and anatomical structures, our model enables precise replication of real-world patient populations with unprecedented detail at this scale. We demonstrate meaningful evaluation of radiology AI models through VCTs powered by our synthetic CT study populations, revealing model degradation and facilitating algorithmic auditing for bias-inducing data attributes. Our generative AI approach to VCTs is a promising avenue towards a scalable solution to assess model robustness, mitigate biases, and safeguard patient care by enabling simpler testing and evaluation of AI models in any desired range of diverse patient populations.

摘要:人工智慧 (AI) 準備透過資料驅動的見解,轉型醫療保健,並提供個人化且有效率的照護。儘管放射科處於 AI 採用的最前線,但在實務上,AI 模型的潛力往往會被嚴重的概化失敗所掩蓋:AI 模型在從受控測試環境轉移到放射科醫師的臨床使用時,效能可能會降低多達 20%。這種不匹配引發了疑慮,即放射科醫師在實務上會被不正確的 AI 預測誤導,和/或開始不信任 AI,讓這些有前景的技術在實務上形同失效。因此,在 AI 模型遭遇各種資料範例時,預期 AI 模型的衰退,對豐富且多樣化的資料進行 AI 模型的全面臨床試驗至關重要。然而,由於收集多樣化的資料範例和對應註解的成本很高,實現這些目標具有挑戰性。為了克服這些限制,我們引進一個創新的條件式生成式 AI 模型,專門用於放射科 AI 的虛擬臨床試驗 (VCT),能夠真實地合成具有特定屬性的病患全身電腦斷層 (CT) 影像。透過學習影像和解剖結構的聯合分佈,我們的模型能夠以空前的細節精確複製真實世界的病患族群。我們透過由我們合成的電腦斷層研究族群支援的 VCT,展示了放射科 AI 模型有意義的評估,揭露模型衰退,並促進演算法稽核,以找出導致偏差的資料屬性。我們對 VCT 的生成式 AI 方法,是一個有前景的途徑,可以評估模型的穩健性、減輕偏差,並透過在任何所需的各種病患族群中,進行更簡單的 AI 模型測試和評估,來保障病患照護。

Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models

2502.09687v1 by Wiktoria Mieleszczenko-Kowszewicz, Beata Bajcar, Jolanta Babiak, Berenika Dyczek, Jakub Świstak, Przemysław Biecek

Be careful what you ask for, you just might get it. This saying fits with the way large language models (LLMs) are trained, which, instead of being rewarded for correctness, are increasingly rewarded for pleasing the recipient. So, they are increasingly effective at persuading us that their answers are valuable. But what tricks do they use in this persuasion? In this study, we examine what are the psycholinguistic features of the responses used by twelve different language models. By grouping response content according to rational or emotional prompts and exploring social influence principles employed by LLMs, we ask whether and how we can mitigate the risks of LLM-driven mass misinformation. We position this study within the broader discourse on human-centred AI, emphasizing the need for interdisciplinary approaches to mitigate cognitive and societal risks posed by persuasive AI responses.

摘要:小心你要求的,你可能真的會得到。這句話適用於大型語言模型 (LLM) 的訓練方式,它們不是因為正確性而獲得獎勵,而是因為取悅接收者而獲得越來越多的獎勵。因此,它們越來越有效地說服我們,它們的答案是有價值的。但是它們在這種說服中使用什麼技巧呢?在這項研究中,我們探討了十二種不同的語言模型使用的回應的心理語言特徵。通過根據理性和情緒提示對回應內容進行分組,並探討 LLM 使用的社會影響原則,我們探討是否以及如何減輕 LLM 驅動的大規模錯誤信息的風險。我們將這項研究定位在以人為中心的 AI 的更廣泛討論中,強調需要跨學科方法來減輕具有說服力的 AI 回應帶來的認知和社會風險。

The Joint Entity-Relation Extraction Model Based on Span and Interactive Fusion Representation for Chinese Medical Texts with Complex Semantics

2502.09247v1 by Danni Feng, Runzhi Li, Jing Wang, Siyu Yan, Lihong Ma, Yunli Xing

Joint entity-relation extraction is a critical task in transforming unstructured or semi-structured text into triplets, facilitating the construction of large-scale knowledge graphs, and supporting various downstream applications. Despite its importance, research on Chinese text, particularly with complex semantics in specialized domains like medicine, remains limited. To address this gap, we introduce the CH-DDI, a Chinese drug-drug interactions dataset designed to capture the intricacies of medical text. Leveraging the strengths of attention mechanisms in capturing long-range dependencies, we propose the SEA module, which enhances the extraction of complex contextual semantic information, thereby improving entity recognition and relation extraction. Additionally, to address the inefficiencies of existing methods in facilitating information exchange between entity recognition and relation extraction, we present an interactive fusion representation module. This module employs Cross Attention for bidirectional information exchange between the tasks and further refines feature extraction through BiLSTM. Experimental results on both our CH-DDI dataset and public CoNLL04 dataset demonstrate that our model exhibits strong generalization capabilities. On the CH-DDI dataset, our model achieves an F1-score of 96.73% for entity recognition and 78.43% for relation extraction. On the CoNLL04 dataset, it attains an entity recognition precision of 89.54% and a relation extraction accuracy of 71.64%.

摘要:聯合實體關係抽取是將非結構化或半結構化文字轉換為三元組的重要任務,有助於建構大規模知識圖譜,並支援各種下游應用程式。儘管其重要性,但針對中文文本的研究,特別是醫學等專業領域中具有複雜語義的研究仍十分有限。為了解決這個差距,我們引入了 CH-DDI,一個中文藥物-藥物交互作用資料集,旨在擷取醫學文本的複雜性。利用注意力機制在擷取長程依賴關係方面的優勢,我們提出了 SEA 模組,增強了複雜脈絡語義資訊的抽取,從而改進了實體辨識和關係抽取。此外,為了解決現有方法在促進實體辨識和關係抽取之間資訊交換方面的低效率問題,我們提出了互動式融合表示模組。此模組採用交叉注意力,在任務之間進行雙向資訊交換,並透過 BiLSTM 進一步精煉特徵抽取。在我們的 CH-DDI 資料集和公開的 CoNLL04 資料集上的實驗結果表明,我們的模型展現出強大的泛化能力。在 CH-DDI 資料集上,我們的模型在實體辨識方面達到了 96.73% 的 F1 分數,在關係抽取方面達到了 78.43% 的 F1 分數。在 CoNLL04 資料集上,它在實體辨識方面達到了 89.54% 的準確度,在關係抽取方面達到了 71.64% 的準確度。

From large language models to multimodal AI: A scoping review on the potential of generative AI in medicine

2502.09242v1 by Lukas Buess, Matthias Keicher, Nassir Navab, Andreas Maier, Soroosh Tayebi Arasteh

Generative artificial intelligence (AI) models, such as diffusion models and OpenAI's ChatGPT, are transforming medicine by enhancing diagnostic accuracy and automating clinical workflows. The field has advanced rapidly, evolving from text-only large language models for tasks such as clinical documentation and decision support to multimodal AI systems capable of integrating diverse data modalities, including imaging, text, and structured data, within a single model. The diverse landscape of these technologies, along with rising interest, highlights the need for a comprehensive review of their applications and potential. This scoping review explores the evolution of multimodal AI, highlighting its methods, applications, datasets, and evaluation in clinical settings. Adhering to PRISMA-ScR guidelines, we systematically queried PubMed, IEEE Xplore, and Web of Science, prioritizing recent studies published up to the end of 2024. After rigorous screening, 144 papers were included, revealing key trends and challenges in this dynamic field. Our findings underscore a shift from unimodal to multimodal approaches, driving innovations in diagnostic support, medical report generation, drug discovery, and conversational AI. However, critical challenges remain, including the integration of heterogeneous data types, improving model interpretability, addressing ethical concerns, and validating AI systems in real-world clinical settings. This review summarizes the current state of the art, identifies critical gaps, and provides insights to guide the development of scalable, trustworthy, and clinically impactful multimodal AI solutions in healthcare.

摘要:生成式人工智能 (AI) 模型,例如扩散模型和 OpenAI 的 ChatGPT,通过提高诊断准确性和自动化临床工作流程,正在改变医学领域。该领域已迅速发展,从用于临床文件编制和决策支持等任务的纯文本大型语言模型,发展到能够在单个模型中整合包括影像、文本和结构化数据在内的多种数据方式的多模态 AI 系统。这些技术的多样化格局以及日益增长的兴趣,凸显了全面审查其应用和潜力的必要性。本范围审查探讨了多模态 AI 的演变,重点介绍了其方法、应用、数据集和在临床环境中的评估。遵循 PRISMA-ScR 指南,我们系统地查询了 PubMed、IEEE Xplore 和 Web of Science,优先考虑截至 2024 年底发表的最新研究。经过严格筛选,纳入了 144 篇论文,揭示了这一充满活力的领域的趋势和挑战。我们的研究结果强调了从单模态方法向多模态方法的转变,推动了诊断支持、医疗报告生成、药物发现和会话式 AI 的创新。然而,关键挑战仍然存在,包括异构数据类型的整合、提高模型可解释性、解决伦理问题以及在现实世界的临床环境中验证 AI 系统。本综述总结了当前的最新技术,确定了关键差距,并提供了见解,以指导在医疗保健领域开发可扩展、可信赖且具有临床影响力的多模态 AI 解决方案。

Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration

2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano

This paper presents a complete explainable system that interprets a set of data, abstracts the underlying features and describes them in a natural language of choice. The system relies on two crucial stages: (i) identifying emerging properties from data and transforming them into abstract concepts, and (ii) converting these concepts into natural language. Despite the impressive natural language generation capabilities demonstrated by Large Language Models, their statistical nature and the intricacy of their internal mechanism still force us to employ these techniques as black boxes, forgoing trustworthiness. Developing an explainable pipeline for data interpretation would allow facilitating its use in safety-critical environments like processing medical information and allowing non-experts and visually impaired people to access narrated information. To this end, we believe that the fields of knowledge representation and automated reasoning research could present a valid alternative. Expanding on prior research that tackled the first stage (i), we focus on the second stage, named Concept2Text. Being explainable, data translation is easily modeled through logic-based rules, once again emphasizing the role of declarative programming in achieving AI explainability. This paper explores a Prolog/CLP-based rewriting system to interpret concepts-articulated in terms of classes and relations, plus common knowledge-derived from a generic ontology, generating natural language text. Its main features include hierarchical tree rewritings, modular multilingual generation, support for equivalent variants across semantic, grammar, and lexical levels, and a transparent rule-based system. We outline the architecture and demonstrate its flexibility through some examples capable of generating numerous diverse and equivalent rewritings based on the input concept.

摘要:這篇論文提出了一個完整的可解釋系統,它可以解釋一組資料,抽象出基礎特徵,並以選擇的自然語言描述它們。系統依賴兩個關鍵階段:(i) 從資料中識別新興屬性,並將它們轉換為抽象概念,以及 (ii) 將這些概念轉換為自然語言。儘管大型語言模型展示了令人印象深刻的自然語言生成能力,但它們的統計性質和內部機制的複雜性仍然迫使我們將這些技術用作黑盒子,放棄可信度。開發一個可解釋的資料解釋管道將有助於促進在安全關鍵環境中使用它,例如處理醫療資訊,並允許非專家和視障人士存取敘述資訊。為此,我們相信知識表示和自動推理研究領域可以提出一個有效的替代方案。在擴展解決第一階段 (i) 的先前研究的基礎上,我們專注於第二階段,稱為 Concept2Text。由於具有可解釋性,資料翻譯很容易透過基於邏輯的規則建模,再次強調宣告式程式設計在實現 AI 可解釋性中的作用。本文探討了一個基於 Prolog/CLP 的重寫系統,以解釋概念,這些概念以類別和關係的形式表達,再加上從通用本体衍生的常識,產生自然語言文字。它的主要特點包括階層樹重寫、模組化多語言生成、支援語義、語法和詞彙層面的等效變體,以及一個透明的基於規則的系統。我們概述了架構,並透過一些範例展示了它的靈活性,這些範例能夠根據輸入概念生成許多不同的等效重寫。

Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York

2502.09204v1 by Sanskar Sehgal, Yanhong A. Liu

Legal cases require careful logical reasoning following the laws, whereas interactions with non-technical users must be in natural language. As an application combining logical reasoning using Prolog and natural language processing using large language models (LLMs), this paper presents a novel approach and system, LogicLease, to automate the analysis of landlord-tenant legal cases in the state of New York. LogicLease determines compliance with relevant legal requirements by analyzing case descriptions and citing all relevant laws. It leverages LLMs for information extraction and Prolog for legal reasoning. By separating information extraction from legal reasoning, LogicLease achieves greater transparency and control over the legal logic applied to each case. We evaluate the accuracy, efficiency, and robustness of LogicLease through a series of tests, achieving 100% accuracy and an average processing time of 2.57 seconds. LogicLease presents advantages over state-of-the-art LLM-based legal analysis systems by providing clear, step-by-step reasoning, citing specific laws, and distinguishing itself by its ability to avoid hallucinations -- a common issue in LLMs.

摘要:法律案件需要遵循法律进行谨慎的逻辑推理,而与非技术用户的互动必须使用自然语言。作为结合使用 Prolog 进行逻辑推理和使用大型语言模型 (LLM) 进行自然语言处理的应用程序,本文提出了一种新颖的方法和系统 LogicLease,以自动分析纽约州的房东与租户法律案件。LogicLease 通过分析案例描述并引用所有相关法律来确定是否符合相关法律要求。它利用 LLM 进行信息提取,并利用 Prolog 进行法律推理。通过将信息提取与法律推理分开,LogicLease 实现了对应用于每个案例的法律逻辑的更高透明度和控制力。我们通过一系列测试评估了 LogicLease 的准确性、效率和鲁棒性,实现了 100% 的准确性和 2.57 秒的平均处理时间。LogicLease 通过提供清晰、分步的推理,引用具体法律,并以其避免幻觉的能力而区别于最先进的基于 LLM 的法律分析系统,从而显示出优势——这是 LLM 中的常见问题。

Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia

2502.09173v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott

In remote healthcare monitoring, time series representation learning reveals critical patient behavior patterns from high-frequency data. This study analyzes home activity data from individuals living with dementia by proposing a two-stage, self-supervised learning approach tailored to uncover low-rank structures. The first stage converts time-series activities into text sequences encoded by a pre-trained language model, providing a rich, high-dimensional latent state space using a PageRank-based method. This PageRank vector captures latent state transitions, effectively compressing complex behaviour data into a succinct form that enhances interpretability. This low-rank representation not only enhances model interpretability but also facilitates clustering and transition analysis, revealing key behavioral patterns correlated with clinicalmetrics such as MMSE and ADAS-COG scores. Our findings demonstrate the framework's potential in supporting cognitive status prediction, personalized care interventions, and large-scale health monitoring.

摘要:在遠程醫療監控中,時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據,該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列,使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換,有效地將複雜的行為數據壓縮成簡潔的形式,從而增強了解力。此低秩表示不僅增強了模型的可解釋性,還促進了聚類和轉換分析,揭示了與臨床指標(例如 MMSE 和 ADAS-COG 分數)相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。

TastepepAI, An artificial intelligence platform for taste peptide de novo design

2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang

Taste peptides have emerged as promising natural flavoring agents attributed to their unique organoleptic properties, high safety profile, and potential health benefits. However, the de novo identification of taste peptides derived from animal, plant, or microbial sources remains a time-consuming and resource-intensive process, significantly impeding their widespread application in the food industry. Here, we present TastePepAI, a comprehensive artificial intelligence framework for customized taste peptide design and safety assessment. As the key element of this framework, a loss-supervised adaptive variational autoencoder (LA-VAE) is implemented to efficiently optimizes the latent representation of sequences during training and facilitates the generation of target peptides with desired taste profiles. Notably, our model incorporates a novel taste-avoidance mechanism, allowing for selective flavor exclusion. Subsequently, our in-house developed toxicity prediction algorithm (SpepToxPred) is integrated in the framework to undergo rigorous safety evaluation of generated peptides. Using this integrated platform, we successfully identified 73 peptides exhibiting sweet, salty, and umami, significantly expanding the current repertoire of taste peptides. This work demonstrates the potential of TastePepAI in accelerating taste peptide discovery for food applications and provides a versatile framework adaptable to broader peptide engineering challenges.

摘要:味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而,从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程,严重阻碍了它们在食品工业中的广泛应用。在此,我们提出了 TastePepAI,这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素,实现了损失监督自适应变分自动编码器 (LA-VAE),以在训练期间有效优化序列的潜在表示,并促进生成具有所需味觉特征的目标肽。值得注意的是,我们的模型包含了一种新颖的味觉回避机制,允许选择性排除风味。随后,我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中,以对生成的肽进行严格的安全评估。使用这个集成平台,我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽,极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力,并提供了一个适用于更广泛的肽工程挑战的多功能框架。

HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification

2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan

Precise segmentation and classification of cell instances are vital for analyzing the tissue microenvironment in histology images, supporting medical diagnosis, prognosis, treatment planning, and studies of brain cytoarchitecture. However, the creation of high-quality annotated datasets for training remains a major challenge. This study introduces a novel single-stage approach (HistoSmith) for generating image-label pairs to augment histology datasets. Unlike state-of-the-art methods that utilize diffusion models with separate components for label and image generation, our approach employs a latent diffusion model to learn the joint distribution of cellular layouts, classification masks, and histology images. This model enables tailored data generation by conditioning on user-defined parameters such as cell types, quantities, and tissue types. Trained on the Conic H&E histopathology dataset and the Nissl-stained CytoDArk0 dataset, the model generates realistic and diverse labeled samples. Experimental results demonstrate improvements in cell instance segmentation and classification, particularly for underrepresented cell types like neutrophils in the Conic dataset. These findings underscore the potential of our approach to address data scarcity challenges.

摘要:精確的細胞實例分割和分類對於分析組織學影像中的組織微環境、支援醫療診斷、預後、治療規劃和腦部細胞結構研究至關重要。然而,建立用於訓練的高品質標註資料集仍然是一項重大挑戰。本研究提出了一種新穎的單階段方法 (HistoSmith),用於產生影像標籤對,以擴充組織學資料集。與利用擴散模型並將標籤和影像產生分開的組成部分的現有技術不同,我們的做法採用潛在擴散模型來學習細胞佈局、分類遮罩和組織學影像的聯合分佈。此模型能透過調整使用者定義的參數(例如細胞類型、數量和組織類型)來進行客製化資料產生。在 Conic H&E 細胞病理學資料集和 Nissl 染色的 CytoDArk0 資料集上訓練後,此模型產生逼真且多樣化的標籤樣本。實驗結果顯示細胞實例分割和分類有顯著進步,特別是對於 Conic 資料集中代表性不足的細胞類型,例如中性球。這些發現強調了我們的方法在解決資料稀少性挑戰方面的潛力。

Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion

2502.08560v1 by Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì

The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: https://github.com/LemuelPuglisi/BrLP.

摘要:隨著縱向磁共振影像 (MRI) 資料集的日益普及,已促進人工智慧 (AI) 驅動的疾病進程建模,讓預測個別患者的未來醫學掃描成為可能。然而,儘管 AI 有顯著進展,目前的技術仍面臨挑戰,包括實現患者特定的個別化、確保時空一致性、有效利用縱向資料,以及管理 3D 掃描的大量記憶體需求。為了應對這些挑戰,我們提出腦潛在進程 (BrLP),這是一種新穎的時空模型,旨在預測 3D 腦部 MRI 中的個人層級疾病進程。BrLP 的主要貢獻有四個:(i) 它在一個小的潛在空間中運作,減輕了高維度影像資料帶來的計算挑戰;(ii) 它明確整合受試者的元資料,以增強預測的個別化;(iii) 它透過輔助模型納入疾病動態的先驗知識,促進縱向資料的整合;(iv) 它引入了潛在平均穩定化 (LAS) 演算法,該演算法 (a) 在推論時強制預測進程中的時空一致性,(b) 讓我們能夠推導預測的不確定性測量。我們對來自 2,805 名受試者的 11,730 個 T1 加權 (T1w) 腦部 MRI 進行 BrLP 訓練和評估,並在包含來自 962 名受試者的 2,257 個 MRI 的外部測試集上驗證其概括性。我們的實驗將 BrLP 生成的 MRI 掃描與實際追蹤 MRI 進行比較,與現有方法相比,展示了最先進的準確性。程式碼已公開於:https://github.com/LemuelPuglisi/BrLP。

Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data

2502.08547v1 by Doudou Zhou, Han Tong, Linshanshan Wang, Suqi Liu, Xin Xiong, Ziming Gan, Romain Griffier, Boris Hejblum, Yun-Chung Liu, Chuan Hong, Clara-Lea Bonzel, Tianrun Cai, Kevin Pan, Yuk-Lam Ho, Lauren Costa, Vidul A. Panickan, J. Michael Gaziano, Kenneth Mandl, Vianney Jouhet, Rodolphe Thiebaut, Zongqi Xia, Kelly Cho, Katherine Liao, Tianxi Cai

The adoption of EHRs has expanded opportunities to leverage data-driven algorithms in clinical care and research. A major bottleneck in effectively conducting multi-institutional EHR studies is the data heterogeneity across systems with numerous codes that either do not exist or represent different clinical concepts across institutions. The need for data privacy further limits the feasibility of including multi-institutional patient-level data required to study similarities and differences across patient subgroups. To address these challenges, we developed the GAME algorithm. Tested and validated across 7 institutions and 2 languages, GAME integrates data in several levels: (1) at the institutional level with knowledge graphs to establish relationships between codes and existing knowledge sources, providing the medical context for standard codes and their relationship to each other; (2) between institutions, leveraging language models to determine the relationships between institution-specific codes with established standard codes; and (3) quantifying the strength of the relationships between codes using a graph attention network. Jointly trained embeddings are created using transfer and federated learning to preserve data privacy. In this study, we demonstrate the applicability of GAME in selecting relevant features as inputs for AI-driven algorithms in a range of conditions, e.g., heart failure, rheumatoid arthritis. We then highlight the application of GAME harmonized multi-institutional EHR data in a study of Alzheimer's disease outcomes and suicide risk among patients with mental health disorders, without sharing patient-level data outside individual institutions.

摘要:電子健康紀錄的採用擴大了在臨床照護和研究中利用資料驅動演算法的機會。在有效進行多機構電子健康紀錄研究時,一個主要的瓶頸是系統間資料異質性,其中有許多代碼在機構間不存在或表示不同的臨床概念。資料隱私的需求進一步限制了納入多機構患者層級資料的可行性,而這些資料對於研究患者亞群之間的相似性和差異性是必要的。為了應對這些挑戰,我們開發了 GAME 演算法。GAME 已在 7 個機構和 2 種語言中進行測試和驗證,它整合了多個層級的資料:(1) 在機構層級,使用知識圖表來建立代碼和現有知識來源之間的關係,為標準代碼及其彼此之間的關係提供醫療背景;(2) 在機構之間,利用語言模型來確定機構特定代碼與已建立的標準代碼之間的關係;(3) 使用圖形注意網路量化代碼之間關係的強度。使用遷移和聯合學習建立聯合訓練的嵌入,以保護資料隱私。在本研究中,我們展示了 GAME 在選擇相關特徵作為 AI 驅動演算法輸入時的適用性,適用於各種情況,例如心臟衰竭、類風濕性關節炎。然後,我們重點介紹了 GAME 和諧化多機構電子健康紀錄資料在阿茲海默症疾病結果和精神疾病患者自殺風險研究中的應用,而無需在個別機構之外共享患者層級資料。

EEG Artifact Detection and Correction with Deep Autoencoders

2502.08686v1 by David Aquilué-Llorens, Aureli Soria-Frisch

EEG signals convey important information about brain activity both in healthy and pathological conditions. However, they are inherently noisy, which poses significant challenges for accurate analysis and interpretation. Traditional EEG artifact removal methods, while effective, often require extensive expert intervention. This study presents LSTEEG, a novel LSTM-based autoencoder designed for the detection and correction of artifacts in EEG signals. Leveraging deep learning, particularly LSTM layers, LSTEEG captures non-linear dependencies in sequential EEG data. LSTEEG demonstrates superior performance in both artifact detection and correction tasks compared to other state-of-the-art convolutional autoencoders. Our methodology enhances the interpretability and utility of the autoencoder's latent space, enabling data-driven automated artefact removal in EEG its application in downstream tasks. This research advances the field of efficient and accurate multi-channel EEG preprocessing, and promotes the implementation and usage of automated EEG analysis pipelines for brain health applications.

摘要:腦電圖訊號傳達了關於大腦活動的重要資訊,無論是在健康或病理狀況下。然而,它們本質上是有雜訊的,這對準確的分析和解釋構成了重大的挑戰。傳統的腦電圖人工製品移除方法雖然有效,但通常需要大量的專家介入。本研究提出 LSTEEG,一種新穎的基於 LSTM 的自動編碼器,用於偵測和校正腦電圖訊號中的人工製品。利用深度學習,特別是 LSTM 層,LSTEEG 捕捉序列腦電圖資料中的非線性依賴性。與其他最先進的卷積自動編碼器相比,LSTEEG 在人工製品偵測和校正任務中都展現出優異的效能。我們的做法增強了自動編碼器潛在空間的可解釋性和實用性,讓資料驅動的自動人工製品移除得以應用於腦電圖的下游任務。這項研究推動了高效且準確的多通道腦電圖前處理領域,並促進了自動腦電圖分析管線在腦部健康應用中的實作和使用。

SycEval: Evaluating LLM Sycophancy

2502.08177v1 by Aaron Fanous, Jacob Goldberg, Ank A. Agarwal, Joanna Lin, Anson Zhou, Roxana Daneshjou, Sanmi Koyejo

Large language models (LLMs) are increasingly applied in educational, clinical, and professional settings, but their tendency for sycophancy -- prioritizing user agreement over independent reasoning -- poses risks to reliability. This study introduces a framework to evaluate sycophantic behavior in ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro across AMPS (mathematics) and MedQuad (medical advice) datasets. Sycophantic behavior was observed in 58.19% of cases, with Gemini exhibiting the highest rate (62.47%) and ChatGPT the lowest (56.71%). Progressive sycophancy, leading to correct answers, occurred in 43.52% of cases, while regressive sycophancy, leading to incorrect answers, was observed in 14.66%. Preemptive rebuttals demonstrated significantly higher sycophancy rates than in-context rebuttals (61.75% vs. 56.52%, $Z=5.87$, $p<0.001$), particularly in computational tasks, where regressive sycophancy increased significantly (preemptive: 8.13%, in-context: 3.54%, $p<0.001$). Simple rebuttals maximized progressive sycophancy ($Z=6.59$, $p<0.001$), while citation-based rebuttals exhibited the highest regressive rates ($Z=6.59$, $p<0.001$). Sycophantic behavior showed high persistence (78.5%, 95% CI: [77.2%, 79.8%]) regardless of context or model. These findings emphasize the risks and opportunities of deploying LLMs in structured and dynamic domains, offering insights into prompt programming and model optimization for safer AI applications.

摘要:大型語言模型(LLM)日益應用於教育、臨床和專業領域,但它們趨於趨炎附勢——優先考慮用戶同意而非獨立推理——對可靠性構成風險。本研究引入了一個框架來評估 ChatGPT-4o、Claude-Sonnet 和 Gemini-1.5-Pro 中的趨炎附勢行為,涉及 AMPS(數學)和 MedQuad(醫療建議)數據集。在 58.19% 的案例中觀察到了趨炎附勢行為,其中 Gemini 表現出最高比率(62.47%),而 ChatGPT 最低(56.71%)。導致正確答案的漸進式趨炎附勢發生在 43.52% 的案例中,而導致不正確答案的退步式趨炎附勢則在 14.66% 的案例中被觀察到。先發制人的反駁表現出顯著高於上下文反駁的趨炎附勢率(61.75% 對 56.52%,Z=5.87,p<0.001),特別是在計算任務中,其中退步式趨炎附勢顯著增加(先發制人:8.13%,上下文:3.54%,p<0.001)。簡單的反駁最大化了漸進式趨炎附勢(Z=6.59,p<0.001),而基於引用的反駁表現出最高的退步式比率(Z=6.59,p<0.001)。趨炎附勢行為表現出很高的持續性(78.5%,95% CI:[77.2%,79.8%]),無論上下文或模型如何。這些發現強調了在結構化和動態領域部署 LLM 的風險和機遇,為更安全的 AI 應用提供了提示編程和模型優化的見解。

Cancer Vaccine Adjuvant Name Recognition from Biomedical Literature using Large Language Models

2502.09659v1 by Hasin Rehana, Jie Zheng, Leo Yeh, Benu Bansal, Nur Bengisu Çam, Christianah Jemiyo, Brett McGregor, Arzucan Özgür, Yongqun He, Junguk Hur

Motivation: An adjuvant is a chemical incorporated into vaccines that enhances their efficacy by improving the immune response. Identifying adjuvant names from cancer vaccine studies is essential for furthering research and enhancing immunotherapies. However, the manual curation from the constantly expanding biomedical literature poses significant challenges. This study explores the automated recognition of vaccine adjuvant names using Large Language Models (LLMs), specifically Generative Pretrained Transformers (GPT) and Large Language Model Meta AI (Llama). Methods: We utilized two datasets: 97 clinical trial records from AdjuvareDB and 290 abstracts annotated with the Vaccine Adjuvant Compendium (VAC). GPT-4o and Llama 3.2 were employed in zero-shot and few-shot learning paradigms with up to four examples per prompt. Prompts explicitly targeted adjuvant names, testing the impact of contextual information such as substances or interventions. Outputs underwent automated and manual validation for accuracy and consistency. Results: GPT-4o attained 100% Precision across all situations while exhibiting notable improve in Recall and F1-scores, particularly with incorporating interventions. On the VAC dataset, GPT-4o achieved a maximum F1-score of 77.32% with interventions, surpassing Llama-3.2-3B by approximately 2%. On the AdjuvareDB dataset, GPT-4o reached an F1-score of 81.67% for three-shot prompting with interventions, surpassing Llama-3.2-3 B's maximum F1-score of 65.62%. Conclusion: Our findings demonstrate that LLMs excel at identifying adjuvant names, including rare variations of naming representation. This study emphasizes the capability of LLMs to enhance cancer vaccine development by efficiently extracting insights. Future work aims to broaden the framework to encompass various biomedical literature and enhance model generalizability across various vaccines and adjuvants.

摘要:動機:佐劑是一種加入疫苗的化學物質,能藉由改善免疫反應來提升疫苗的效力。從癌症疫苗研究中找出佐劑名稱對於推進研究和改善免疫療法至關重要。然而,從不斷擴展的生物醫學文獻中手動整理會造成重大挑戰。本研究探討使用大型語言模型 (LLM),特別是生成式預訓練Transformer (GPT) 和大型語言模型 Meta AI (Llama) 來自動辨識疫苗佐劑名稱。方法:我們使用兩個資料集:來自 AdjuvareDB 的 97 份臨床試驗記錄和 290 篇標註了疫苗佐劑彙編 (VAC) 的摘要。GPT-4o 和 Llama 3.2 被用於零次學習和少量學習範例,每個提示最多有四個範例。提示明確鎖定佐劑名稱,測試物質或介入措施等背景資訊的影響。輸出經過自動和手動驗證,以確保準確性和一致性。結果:GPT-4o 在所有情況下都達到 100% 的準確率,同時在召回率和 F1 分數上表現出顯著的進步,特別是在納入介入措施的情況下。在 VAC 資料集上,GPT-4o 在有介入措施的情況下達到 77.32% 的最高 F1 分數,比 Llama-3.2-3B 高出約 2%。在 AdjuvareDB 資料集上,GPT-4o 在有介入措施的三次提示中達到 81.67% 的 F1 分數,超過 Llama-3.2-3 B 的最高 F1 分數 65.62%。結論:我們的研究結果表明,LLM 在辨識佐劑名稱方面表現出色,包括命名表示的罕見變異。本研究強調了 LLM 在有效提取見解方面增強癌症疫苗開發的能力。未來的研究工作旨在擴大架構,涵蓋各種生物醫學文獻,並增強模型在各種疫苗和佐劑中的泛化能力。

Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

2502.07963v1 by Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace

Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.

摘要:醫學研究在將新穎療法轉化為臨床實務上,面臨著有據可查的挑戰。發表誘因鼓勵研究人員呈現「正向」的發現,即使經驗結果模稜兩可。因此,有據可查的是,作者經常扭曲研究結果,特別是在文章摘要中。此類扭曲可能會影響臨床醫師對證據的詮釋,並可能影響病患照護決策。在本研究中,我們探討大型語言模型 (LLM) 提供的試驗結果詮釋是否也受到扭曲影響。由於 LLM 正越來越常被用於爬梳和綜合已發表的醫學證據,因此這點非常重要。我們評估了 22 個 LLM,發現它們普遍比人類更容易受到扭曲影響。它們也可能將扭曲傳播到其輸出中:例如,我們發現 LLM 會將扭曲隱含納入其產生的白話文摘要中。然而,我們也發現 LLM 通常有能力辨認扭曲,而且可以透過提示的方式減輕扭曲對 LLM 輸出的影響。

An Advanced NLP Framework for Automated Medical Diagnosis with DeBERTa and Dynamic Contextual Positional Gating

2502.07755v1 by Mohammad Ali Labbaf Khaniki, Sahabeh Saadati, Mohammad Manthouri

This paper presents a novel Natural Language Processing (NLP) framework for enhancing medical diagnosis through the integration of advanced techniques in data augmentation, feature extraction, and classification. The proposed approach employs back-translation to generate diverse paraphrased datasets, improving robustness and mitigating overfitting in classification tasks. Leveraging Decoding-enhanced BERT with Disentangled Attention (DeBERTa) with Dynamic Contextual Positional Gating (DCPG), the model captures fine-grained contextual and positional relationships, dynamically adjusting the influence of positional information based on semantic context to produce high-quality text embeddings. For classification, an Attention-Based Feedforward Neural Network (ABFNN) is utilized, effectively focusing on the most relevant features to improve decision-making accuracy. Applied to the classification of symptoms, clinical notes, and other medical texts, this architecture demonstrates its ability to address the complexities of medical data. The combination of data augmentation, contextual embedding generation, and advanced classification mechanisms offers a robust and accurate diagnostic tool, with potential applications in automated medical diagnosis and clinical decision support. This method demonstrates the effectiveness of the proposed NLP framework for medical diagnosis, achieving remarkable results with an accuracy of 99.78%, recall of 99.72%, precision of 99.79%, and an F1-score of 99.75%. These metrics not only underscore the model's robust performance in classifying medical texts with exceptional precision and reliability but also highlight its superiority over existing methods, making it a highly promising tool for automated diagnostic systems.

摘要:本文提出了一個創新的自然語言處理 (NLP) 框架,透過整合資料擴充、特徵萃取和分類的進階技術來增強醫療診斷。所提出的方法採用反向翻譯來產生多樣化的同義改寫資料集,提升穩健性並減輕分類任務中的過度擬合。透過利用具有動態脈絡位置閘控 (DCPG) 的解碼增強 BERT 與去糾纏注意力 (DeBERTa),這個模型捕捉細緻的脈絡和位置關係,根據語意脈絡動態調整位置資訊的影響,以產生高品質的文字嵌入。在分類方面,利用基於注意力的前饋神經網路 (ABFNN),有效地關注最相關的特徵,以提高決策準確度。應用於症狀、臨床筆記和其他醫療文本的分類,此架構證明了其處理醫療資料複雜性的能力。資料擴充、脈絡嵌入產生和進階分類機制的結合提供了一個穩健且準確的診斷工具,在自動化醫療診斷和臨床決策支援中具有潛在應用。此方法證明了所提出的 NLP 框架在醫療診斷中的有效性,以 99.78% 的準確度、99.72% 的召回率、99.79% 的精確度和 99.75% 的 F1 分數,取得了顯著的成果。這些指標不僅強調了模型在分類醫療文本時具有卓越的精確度和可靠性,也突顯了它優於現有方法的優越性,使其成為自動化診斷系統中極具前景的工具。

Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

2502.07752v2 by Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds

Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.

摘要:設計具有低記憶體需求和快速收斂的大型語言模型 (LLM) 的高效最佳化器是一個重要且具有挑戰性的問題。本文透過結構化 Fisher 資訊矩陣 (FIM) 近似的觀點,朝著系統化設計此類最佳化器邁出了一步。我們證明許多最先進的高效最佳化器可以視為 FIM 近似(在 Frobenius 範數下)的解,並具有特定的結構假設。基於這些見解,我們提出了 LLM 的兩個實用高效最佳化器設計建議,包括仔細選擇結構假設以平衡通用性和效率,以及透過新穎的低秩擴充框架增強一般結構最佳化器的記憶體效率。我們透過推導新的記憶體高效最佳化器來展示如何使用每種設計方法:列和欄縮放 SGD (RACS) 和自適應低維子空間估計 (Alice)。在 LLaMA 預訓練(高達 1B 參數)上的實驗驗證了其有效性,顯示比現有的記憶體高效基準和 Adam 更快且更好的收斂,且記憶體開銷很小。值得注意的是,Alice 的收斂速度比 Adam 快 2 倍以上,而 RACS 則在 1B 模型上提供類似 SGD 的記憶體的強勁效能。

The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation

2502.07516v2 by Raman Dutt

Generative models, particularly text-to-image (T2I) diffusion models, play a crucial role in medical image analysis. However, these models are prone to training data memorization, posing significant risks to patient privacy. Synthetic chest X-ray generation is one of the most common applications in medical image analysis with the MIMIC-CXR dataset serving as the primary data repository for this task. This study presents the first systematic attempt to identify prompts and text tokens in MIMIC-CXR that contribute the most to training data memorization. Our analysis reveals two unexpected findings: (1) prompts containing traces of de-identification procedures (markers introduced to hide Protected Health Information) are the most memorized, and (2) among all tokens, de-identification markers contribute the most towards memorization. This highlights a broader issue with the standard anonymization practices and T2I synthesis with MIMIC-CXR. To exacerbate, existing inference-time memorization mitigation strategies are ineffective and fail to sufficiently reduce the model's reliance on memorized text tokens. On this front, we propose actionable strategies for different stakeholders to enhance privacy and improve the reliability of generative models in medical imaging. Finally, our results provide a foundation for future work on developing and benchmarking memorization mitigation techniques for synthetic chest X-ray generation using the MIMIC-CXR dataset. The anonymized code is available at https://anonymous.4open.science/r/diffusion_memorization-8011/

摘要:生成模型,尤其是文本到影像 (T2I) 擴散模型在醫學影像分析中扮演著至關重要的角色。然而,這些模型容易訓練資料記憶,對病患隱私構成重大風險。合成胸部 X 光影像生成是醫學影像分析中最常見的應用之一,而 MIMIC-CXR 資料集則作為此任務的主要資料儲存庫。本研究提出了第一個系統化的嘗試,以識別 MIMIC-CXR 中對訓練資料記憶貢獻最大的提示和文字代碼。我們的分析揭示了兩個出乎意料的發現:(1) 包含去識別程序痕跡的提示(用於隱藏受保護健康資訊的標記)是最容易被記憶的,以及 (2) 在所有代碼中,去識別標記對記憶的貢獻最大。這突顯了標準匿名化實務和使用 MIMIC-CXR 進行 T2I 合成的更廣泛問題。更糟的是,現有的推論時間記憶減緩策略無效,無法充分降低模型對記憶文字代碼的依賴。在這個方面,我們針對不同的利害關係人提出可行的策略,以增強隱私和改善生成模型在醫學影像中的可靠性。最後,我們的結果為未來開發和評量使用 MIMIC-CXR 資料集進行合成胸部 X 光影像生成的記憶減緩技術奠定了基礎。已匿名化的程式碼可在 https://anonymous.4open.science/r/diffusion_memorization-8011/ 取得。

KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level

2502.07288v1 by Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo

Chronic kidney disease (CKD) is a major global health issue, affecting over 10% of the population and causing significant mortality. While kidney biopsy remains the gold standard for CKD diagnosis and treatment, the lack of comprehensive benchmarks for kidney pathology segmentation hinders progress in the field. To address this, we organized the Kidney Pathology Image Segmentation (KPIs) Challenge, introducing a dataset that incorporates preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+ Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes two tasks, patch-level segmentation and whole slide image segmentation and detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score. By encouraging innovative segmentation methods that adapt to diverse CKD models and tissue conditions, the KPIs Challenge aims to advance kidney pathology analysis, establish new benchmarks, and enable precise, large-scale quantification for disease research and diagnosis.

摘要:慢性腎臟病 (CKD) 是全球主要的健康問題,影響超過 10% 的人口,並造成顯著的死亡率。雖然腎臟活檢 仍然是 CKD 診斷和治療的黃金標準,但缺乏 腎臟病理學分割的全面基準阻礙了該領域的進展。 為了解決這個問題,我們組織了腎臟病理影像 分割 (KPIs) 挑戰,引入了包含超過 10,000 個註解的 CKD 臨床前嚙齒動物模型的資料集,這些註解來自 60 多個 週期性酸性雪夫 (PAS) 染色的全幻燈片影像。挑戰包括 兩個任務,修補層級分割和全幻燈片影像分割和 偵測,使用 Dice 相似係數 (DSC) 和 F1 分數進行評估。 通過鼓勵創新的分割方法來適應不同的 CKD 模型 和組織條件,KPIs 挑戰旨在推進腎臟病理 分析,建立新的基準,並實現精確、大規模的 疾病研究和診斷量化。

Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer

2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu

Early prediction of pediatric cardiac arrest (CA) is critical for timely intervention in high-risk intensive care settings. We introduce PedCA-FT, a novel transformer-based framework that fuses tabular view of EHR with the derived textual view of EHR to fully unleash the interactions of high-dimensional risk factors and their dynamics. By employing dedicated transformer modules for each modality view, PedCA-FT captures complex temporal and contextual patterns to produce robust CA risk estimates. Evaluated on a curated pediatric cohort from the CHOA-CICU database, our approach outperforms ten other artificial intelligence models across five key performance metrics and identifies clinically meaningful risk factors. These findings underscore the potential of multimodal fusion techniques to enhance early CA detection and improve patient care.

摘要:早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT,一個新穎的基於轉換器的框架,它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起,以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組,PedCA-FT 捕獲複雜的時間和上下文模式,以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估,我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型,並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。

Explaining 3D Computed Tomography Classifiers with Counterfactuals

2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari

Counterfactual explanations in medical imaging are critical for understanding the predictions made by deep learning models. We extend the Latent Shift counterfactual generation method from 2D applications to 3D computed tomography (CT) scans. We address the challenges associated with 3D data, such as limited training samples and high memory demands, by implementing a slice-based approach. This method leverages a 2D encoder trained on CT slices, which are subsequently combined to maintain 3D context. We demonstrate this technique on two models for clinical phenotype prediction and lung segmentation. Our approach is both memory-efficient and effective for generating interpretable counterfactuals in high-resolution 3D medical imaging.

摘要:反事實解釋在醫學影像中對於理解深度學習模型所做的預測至關重要。我們將 Latent Shift 反事實生成方法從 2D 應用程式延伸到 3D 電腦斷層掃描 (CT) 掃描。我們透過實作基於切片的做法,來解決與 3D 資料相關的挑戰,例如受限的訓練樣本和高記憶體需求。此方法利用經過 CT 切片訓練的 2D 編碼器,隨後將這些切片結合起來以維護 3D 背景。我們在兩個用於臨床表型預測和肺部分割的模型上展示此技術。我們的做法對於在高解析度 3D 醫學影像中產生可解釋的反事實,既節省記憶體又有效。

Interactive Data Harmonization with LLM Agents

2502.07132v1 by Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire

Data harmonization is an essential task that entails integrating datasets from diverse sources. Despite years of research in this area, it remains a time-consuming and challenging task due to schema mismatches, varying terminologies, and differences in data collection methodologies. This paper presents the case for agentic data harmonization as a means to both empower experts to harmonize their data and to streamline the process. We introduce Harmonia, a system that combines LLM-based reasoning, an interactive user interface, and a library of data harmonization primitives to automate the synthesis of data harmonization pipelines. We demonstrate Harmonia in a clinical data harmonization scenario, where it helps to interactively create reusable pipelines that map datasets to a standard format. Finally, we discuss challenges and open problems, and suggest research directions for advancing our vision.

摘要:資料調和是一項整合不同來源資料集的重要任務。儘管多年來針對此領域的研究不斷,但由於架構不匹配、術語不同,以及資料收集方法的差異,它仍然是一項耗時且具有挑戰性的任務。本文提出代理資料調和,作為賦能專家調和其資料並簡化流程的方法。我們介紹 Harmonia,一個結合了基於 LLM 的推理、互動式使用者介面和資料調和原語庫的系統,以自動化資料調和管線的合成。我們在臨床資料調和場景中展示了 Harmonia,它有助於互動式建立可重複使用的管線,將資料集對應至標準格式。最後,我們討論挑戰和開放性問題,並建議研究方向以推進我們的願景。

Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML

2502.07026v1 by Mohammad Amir Salari, Bahareh Rahmani

Machine learning (ML) is transforming healthcare by enabling predictive analytics, personalized treatments, and improved patient outcomes. However, traditional ML workflows require specialized skills, infrastructure, and resources, limiting accessibility for many healthcare professionals. This paper explores how Google Cloud's BigQuery ML simplifies the development and deployment of ML models using SQL, reducing technical barriers. Through a case study on diabetes prediction using the Diabetes Health Indicators Dataset, we evaluate three predictive models: Logistic Regression, Boosted Tree, and Deep Neural Network (DNN). Our results demonstrate that the Boosted Tree model achieves the highest performance, making it highly effective for diabetes prediction. This study highlights BigQuery ML's role in democratizing machine learning by providing a scalable, efficient, and accessible solution for healthcare analytics.

摘要:機器學習 (ML) 透過啟用預測分析、個人化治療和改善病患結果,正在轉型醫療保健。然而,傳統的 ML 工作流程需要專業技能、基礎設施和資源,限制了許多醫療保健專業人員的可及性。本文探討 Google Cloud 的 BigQuery ML 如何使用 SQL 簡化 ML 模型的開發和部署,降低技術障礙。透過使用糖尿病健康指標資料集對糖尿病預測進行個案研究,我們評估了三個預測模型:邏輯迴歸、提升樹和深度神經網路 (DNN)。我們的結果證明,提升樹模型達到了最高的效能,使其對於糖尿病預測非常有效。這項研究強調了 BigQuery ML 在民主化機器學習中扮演的角色,提供可擴充、有效率且可存取的醫療保健分析解決方案。

AIMS.au: A Dataset for the Analysis of Modern Slavery Countermeasures in Corporate Statements

2502.07022v1 by Adriana Eufrosiana Bora, Pierre-Luc St-Charles, Mirko Bronzi, Arsène Fansi Tchango, Bruno Rousseau, Kerrie Mengersen

Despite over a decade of legislative efforts to address modern slavery in the supply chains of large corporations, the effectiveness of government oversight remains hampered by the challenge of scrutinizing thousands of statements annually. While Large Language Models (LLMs) can be considered a well established solution for the automatic analysis and summarization of documents, recognizing concrete modern slavery countermeasures taken by companies and differentiating those from vague claims remains a challenging task. To help evaluate and fine-tune LLMs for the assessment of corporate statements, we introduce a dataset composed of 5,731 modern slavery statements taken from the Australian Modern Slavery Register and annotated at the sentence level. This paper details the construction steps for the dataset that include the careful design of annotation specifications, the selection and preprocessing of statements, and the creation of high-quality annotation subsets for effective model evaluations. To demonstrate our dataset's utility, we propose a machine learning methodology for the detection of sentences relevant to mandatory reporting requirements set by the Australian Modern Slavery Act. We then follow this methodology to benchmark modern language models under zero-shot and supervised learning settings.

摘要:儘管立法努力超過十年,旨在解決大型企業供應鏈中的現代奴隸制,但政府監督的有效性仍然受到每年審查數千份聲明的挑戰所阻礙。雖然大型語言模型(LLM)可以被認為是文件自動分析和摘要的完善解決方案,但要辨識公司採取的具體現代奴隸制對策,並將其與含糊的聲明區分開來,仍然是一項具有挑戰性的任務。為了幫助評估和微調 LLM 以評估企業聲明,我們引入了一個由 5,731 份現代奴隸制聲明組成的資料集,這些聲明取自澳洲現代奴隸制註冊處,並在句子層級進行註解。本文詳細說明了資料集的建構步驟,其中包括註解規格的仔細設計、聲明的選擇和預處理,以及用於有效模型評估的高品質註解子集的建立。為了展示我們的資料集的效用,我們提出了一種機器學習方法,用於檢測與澳洲現代奴隸制法規定的強制性報告要求相關的句子。然後,我們遵循這種方法,在零次學習和監督學習設定下對現代語言模型進行基準測試。

Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium

2502.06693v1 by Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour

The fourth Machine Learning for Health (ML4H) symposium was held in person on December 15th and 16th, 2024, in the traditional, ancestral, and unceded territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, British Columbia, Canada. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community. The organization of the research roundtables at the conference involved 13 senior and 27 junior chairs across 13 tables. Each roundtable session included an invited senior chair (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with an interest in the session's topic.

摘要:第四屆醫療機器學習 (ML4H) 研討會於 2024 年 12 月 15 日和 16 日在加拿大不列顛哥倫比亞省溫哥華的 Musqueam、Squamish 和 Tsleil-Waututh 國家的傳統、祖先和未割讓領土上舉行。研討會包括研究圓桌會議,以促進參與者和高級研究人員之間關於 ML4H 社群的及時和相關主題的討論。在會議上組織研究圓桌會議涉及 13 張桌子上的 13 位高級主席和 27 位初級主席。每個圓桌會議都包括一位受邀的高級主席(在該領域擁有豐富的經驗)、初級主席(負責促進討論)以及對會議主題感興趣的來自不同背景的與會者。

Automatic Evaluation of Healthcare LLMs Beyond Question-Answering

2502.06666v1 by Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lucia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, Dario Garcia-Gasulla

Current Large Language Models (LLMs) benchmarks are often based on open-ended or close-ended QA evaluations, avoiding the requirement of human labor. Close-ended measurements evaluate the factuality of responses but lack expressiveness. Open-ended capture the model's capacity to produce discourse responses but are harder to assess for correctness. These two approaches are commonly used, either independently or together, though their relationship remains poorly understood. This work is focused on the healthcare domain, where both factuality and discourse matter greatly. It introduces a comprehensive, multi-axis suite for healthcare LLM evaluation, exploring correlations between open and close benchmarks and metrics. Findings include blind spots and overlaps in current methodologies. As an updated sanity check, we release a new medical benchmark --CareQA-- with both open and closed variants. Finally, we propose a novel metric for open-ended evaluations -- Relaxed Perplexity -- to mitigate the identified limitations.

摘要:當前大型語言模型 (LLM) 基準通常基於開放式或封閉式問答評量,避免了人力需求。封閉式測量評估回應的事實性,但缺乏表達力。開放式測量捕捉模型產生論述回應的能力,但較難評估正確性。這兩種方法通常獨立或合併使用,儘管它們之間的關係仍然知之甚少。這項工作專注於醫療保健領域,在該領域中,事實性和論述都非常重要。它引入了一個全面的多軸套件,用於醫療保健 LLM 評量,探索開放式和封閉式基準和指標之間的關聯性。研究結果包括當前方法中的盲點和重疊。作為更新的健全性檢查,我們發布了一個新的醫療基準--CareQA--,包含開放式和封閉式變體。最後,我們提出了一個用於開放式評量的全新指標--放鬆困惑度--以減輕已識別的限制。

Few-Shot Classification and Anatomical Localization of Tissues in SPECT Imaging

2502.06632v1 by Mohammed Abdul Hafeez Khan, Samuel Morries Boddepalli, Siddhartha Bhattacharyya, Debasis Mitra

Accurate classification and anatomical localization are essential for effective medical diagnostics and research, which may be efficiently performed using deep learning techniques. However, availability of limited labeled data poses a significant challenge. To address this, we adapted Prototypical Networks and the Propagation-Reconstruction Network (PRNet) for few-shot classification and localization, respectively, in Single Photon Emission Computed Tomography (SPECT) images. For the proof of concept we used a 2D-sliced image cropped around heart. The Prototypical Network, with a pre-trained ResNet-18 backbone, classified ventricles, myocardium, and liver tissues with 96.67% training and 93.33% validation accuracy. PRNet, adapted for 2D imaging with an encoder-decoder architecture and skip connections, achieved a training loss of 1.395, accurately reconstructing patches and capturing spatial relationships. These results highlight the potential of Prototypical Networks for tissue classification with limited labeled data and PRNet for anatomical landmark localization, paving the way for improved performance in deep learning frameworks.

摘要:精確的分類和解剖定位對於有效的醫療診斷和研究至關重要,而這可以使用深度學習技術有效執行。然而,標記資料有限的取得會造成重大的挑戰。為了解決這個問題,我們分別調整了原型網路和傳播重建網路 (PRNet),用於單光子發射電腦斷層掃描 (SPECT) 影像中的少量分類和定位。為了證明這個概念,我們使用圍繞心臟裁切的 2D 切片影像。原型網路,使用預先訓練的 ResNet-18 主幹,對心室、心肌和肝臟組織進行分類,訓練準確度為 96.67%,驗證準確度為 93.33%。PRNet,調整為使用編碼器解碼器架構和跳躍連接的 2D 影像,達到了 1.395 的訓練損失,精確地重建了區塊並擷取了空間關係。這些結果突出了原型網路在標記資料有限的情況下進行組織分類的潛力,以及 PRNet 在解剖標誌定位方面的潛力,為深度學習架構中效能的提升鋪平了道路。

Illegal Waste Detection in Remote Sensing Images: A Case Study

2502.06607v2 by Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori

Environmental crime currently represents the third largest criminal activity worldwide while threatening ecosystems as well as human health. Among the crimes related to this activity, improper waste management can nowadays be countered more easily thanks to the increasing availability and decreasing cost of Very-High-Resolution Remote Sensing images, which enable semi-automatic territory scanning in search of illegal landfills. This paper proposes a pipeline, developed in collaboration with professionals from a local environmental agency, for detecting candidate illegal dumping sites leveraging a classifier of Remote Sensing images. To identify the best configuration for such classifier, an extensive set of experiments was conducted and the impact of diverse image characteristics and training settings was thoroughly analyzed. The local environmental agency was then involved in an experimental exercise where outputs from the developed classifier were integrated in the experts' everyday work, resulting in time savings with respect to manual photo-interpretation. The classifier was eventually run with valuable results on a location outside of the training area, highlighting potential for cross-border applicability of the proposed pipeline.

摘要:環境犯罪目前是全球第三大犯罪活動,威脅生態系統和人類健康。在與此活動相關的犯罪中,不當廢物管理現在可以更容易地得到解決,這要歸功於超高解析度遙測影像越來越普及且成本下降,這使得半自動領土掃描能夠搜尋非法垃圾掩埋場。本文提出了一條管道,與當地環境機構的專業人士合作開發,用於檢測候選非法傾倒地點,利用遙測影像分類器。為了找出這種分類器的最佳配置,進行了一系列廣泛的實驗,並徹底分析了不同影像特徵和訓練設定的影響。然後,當地環境機構參與了一項實驗練習,其中將已開發分類器的輸出整合到專家的日常工作中,從而節省了人工照片解譯的時間。最後在訓練區域外的某個位置執行分類器,獲得了有價值的結果,突出了所提出管道的跨境適用性潛力。

FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model

2502.06438v1 by Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li

Accurate and efficient electroencephalography (EEG) analysis is essential for detecting seizures and artifacts in long-term monitoring, with applications spanning hospital diagnostics to wearable health devices. Robust EEG analytics have the potential to greatly improve patient care. However, traditional deep learning models, especially Transformer-based architectures, are hindered by their quadratic time and memory complexity, making them less suitable for resource-constrained environments. To address these challenges, we present FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel self-supervised framework that establishes new efficiency benchmarks for EEG analysis through bidirectional state-space modeling. Unlike Transformer-based models, which incur quadratic time and memory complexity, FEMBA scales linearly with sequence length, enabling more scalable and efficient processing of extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and fine-tuned on three downstream tasks, FEMBA achieves competitive performance in comparison with transformer models, with significantly lower computational cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates viability for resource-constrained devices. These results pave the way for scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as a promising candidate for wearable applications.

摘要:準確且有效的腦電圖 (EEG) 分析對於偵測長時間監控中的癲癇發作和偽像至關重要,其應用範圍涵蓋醫院診斷到可穿戴式健康裝置。穩健的 EEG 分析具有大幅改善病患照護的潛力。然而,傳統深度學習模型,特別是基於 Transformer 的架構,受到其二次時間和記憶體複雜度的阻礙,使其不太適合資源受限的環境。為了應對這些挑戰,我們提出 FEMBA (基礎 EEG Mamba + 雙向架構),一種創新的自我監督架構,透過雙向狀態空間建模為 EEG 分析建立新的效率基準。與會產生二次時間和記憶體複雜度的基於 Transformer 的模型不同,FEMBA 隨著序列長度線性縮放,支援更具可擴充性和效率的延伸 EEG 記錄處理。FEMBA 在超過 21,000 小時的未標記 EEG 上訓練並在三個下游任務上進行微調,與Transformer模型相比,在計算成本顯著降低的情況下,實現了具有競爭力的效能。具體來說,它在 TUAB 上達到 81.82% 的平衡準確度 (0.8921 AUROC) 和在 TUAR 上達到 0.949 AUROC,而一個微小的 7.8M 參數變體證明了其在資源受限裝置上的可行性。這些結果為臨床和可穿戴應用中可擴充的通用 EEG 分析鋪平了道路,並突顯 FEMBA 是可穿戴應用中一個有前景的候選者。

Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?

2502.06289v1 by Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, AnRan Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Yim-Lui Cheung, Pearse A. Keane, Yih Chung Tham

The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domains. However, its applicability to clinical tasks remains underexplored. To address this, we conducted head-to-head evaluations by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular disease detection and systemic disease prediction tasks, across eight standardized open-source ocular datasets, as well as the Moorfields AlzEye and the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 models in predicting heart failure, myocardial infarction, and ischaemic stroke (AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even with 10% of the fine-tuning data. These findings showcase the distinct scenarios where general-purpose and domain-specific FMs excel, highlighting the importance of aligning FM selection with task-specific requirements to optimise clinical performance.

摘要:基礎模型 (FM) 的出現正在轉變醫療領域。在眼科,RETFound 是一個視網膜專用 FM,依序使用 140 萬張自然影像和 160 萬張視網膜影像進行預訓練,已展現出高度適應性,可應用於各種臨床應用。相反地,DINOv2 是一個通用視覺 FM,使用 1.42 億張自然影像進行預訓練,已展現出在非醫療領域的潛力。然而,其在臨床任務中的適用性仍未被充分探索。為了解決這個問題,我們針對眼部疾病偵測和全身性疾病預測任務,對 RETFound 和三個 DINOv2 模型(大型、基礎、小型)進行微調,並進行一對一的評估,使用八個標準化的開源眼科資料集,以及 Moorfields AlzEye 和 UK Biobank 資料集。DINOv2 大型模型在糖尿病視網膜病變偵測方面優於 RETFound(三個資料集的 AUROC=0.850-0.952,相較於 0.823-0.944,所有 P<=0.007)和多類眼部疾病(AUROC=0.892,相較於 0.846,P<0.001)。在青光眼方面,DINOv2 基礎模型優於 RETFound(AUROC=0.958,相較於 0.940,P<0.001)。相反地,RETFound 在預測心臟衰竭、心肌梗塞和缺血性中風方面優於所有 DINOv2 模型(AUROC=0.732-0.796,相較於 0.663-0.771,所有 P<0.001)。即使使用 10% 的微調資料,這些趨勢仍然持續。這些發現展示了通用和領域專用 FM 各自擅長的場景,突顯了根據任務特定需求調整 FM 選擇,以最佳化臨床表現的重要性。

Integrating Sequence and Image Modeling in Irregular Medical Time Series Through Self-Supervised Learning

2502.06134v1 by Liuqing Chen, Shuhong Xiao, Shixian Ding, Shanhai Hu, Lingyun Sun

Medical time series are often irregular and face significant missingness, posing challenges for data analysis and clinical decision-making. Existing methods typically adopt a single modeling perspective, either treating series data as sequences or transforming them into image representations for further classification. In this paper, we propose a joint learning framework that incorporates both sequence and image representations. We also design three self-supervised learning strategies to facilitate the fusion of sequence and image representations, capturing a more generalizable joint representation. The results indicate that our approach outperforms seven other state-of-the-art models in three representative real-world clinical datasets. We further validate our approach by simulating two major types of real-world missingness through leave-sensors-out and leave-samples-out techniques. The results demonstrate that our approach is more robust and significantly surpasses other baselines in terms of classification performance.

摘要:醫療時間序列通常不規則且會面臨顯著的缺失,對資料分析和臨床決策制定構成挑戰。現有方法通常採用單一建模觀點,將序列資料視為序列或將其轉換為影像表示以進行進一步分類。在本文中,我們提出了一個聯合學習架構,結合序列和影像表示。我們還設計了三種自我監督學習策略,以促進序列和影像表示的融合,捕捉更具概括性的聯合表示。結果表明,我們的做法在三個具有代表性的真實世界臨床資料集中優於其他七個最先進的模型。我們進一步通過留出感測器和留出樣本的技術模擬兩種主要的真實世界缺失類型來驗證我們的做法。結果表明,我們的做法更強大,並且在分類效能方面顯著優於其他基準。

Foundation Model of Electronic Medical Records for Adaptive Risk Estimation

2502.06124v1 by Pawel Renc, Michal K. Grzeszczyk, Nassim Oufattole, Deirdre Goode, Yugang Jia, Szymon Bieganski, Matthew B. A. McDermott, Jaroslaw Was, Anthony E. Samir, Jonathan W. Cunningham, David W. Bates, Arkadiusz Sitek

We developed the Enhanced Transformer for Health Outcome Simulation (ETHOS), an AI model that tokenizes patient health timelines (PHTs) from EHRs. ETHOS predicts future PHTs using transformer-based architectures. The Adaptive Risk Estimation System (ARES) employs ETHOS to compute dynamic and personalized risk probabilities for clinician-defined critical events. ARES incorporates a personalized explainability module that identifies key clinical factors influencing risk estimates for individual patients. ARES was evaluated on the MIMIC-IV v2.2 dataset in emergency department (ED) settings, benchmarking its performance against traditional early warning systems and machine learning models. We processed 299,721 unique patients from MIMIC-IV into 285,622 PHTs, with 60% including hospital admissions. The dataset contained over 357 million tokens. ETHOS outperformed benchmark models in predicting hospital admissions, ICU admissions, and prolonged hospital stays, achieving superior AUC scores. ETHOS-based risk estimates demonstrated robustness across demographic subgroups with strong model reliability, confirmed via calibration curves. The personalized explainability module provides insights into patient-specific factors contributing to risk. ARES, powered by ETHOS, advances predictive healthcare AI by providing dynamic, real-time, and personalized risk estimation with patient-specific explainability to enhance clinician trust. Its adaptability and superior accuracy position it as a transformative tool for clinical decision-making, potentially improving patient outcomes and resource allocation in emergency and inpatient settings. We release the full code at github.com/ipolharvard/ethos-ares to facilitate future research.

摘要:我們開發了增強型健康結果模擬轉換器 (ETHOS), 一種從電子健康紀錄 (EHR) 中將患者健康時間軸 (PHT) 標記化的 AI 模型。ETHOS 使用基於轉換器的架構預測未來的 PHT。自適應風險評估系統 (ARES) 使用 ETHOS 計算由臨床醫生定義的危急事件的動態且個人化的風險機率。ARES 結合了個人化的可解釋性模組,可找出影響個別患者風險評估的主要臨床因素。ARES 在急診部門 (ED) 設定中針對 MIMIC-IV v2.2 資料集進行評估,並將其效能與傳統的預警系統和機器學習模型進行基準測試。我們將 299,721 位 MIMIC-IV 的獨特患者處理成 285,622 個 PHT,其中 60% 包含住院記錄。該資料集包含超過 3.57 億個標記。ETHOS 在預測住院、加護病房 (ICU) 住院和延長住院時間方面表現優於基準模型,並獲得了較高的 AUC 分數。基於 ETHOS 的風險評估顯示出跨人口統計子群的穩健性,並通過校準曲線確認了強大的模型可靠性。個人化的可解釋性模組提供了對導致風險的患者特定因素的見解。由 ETHOS 驅動的 ARES 透過提供動態、即時且個人化的風險評估,以及患者特定的可解釋性來增強臨床醫生的信任,從而推動了預測性醫療保健 AI 的發展。其適應性和卓越的準確性使其成為臨床決策制定的一種變革性工具,有可能改善緊急和住院環境中的患者結果和資源分配。我們在 github.com/ipolharvard/ethos-ares 上釋出完整程式碼,以利未來的研究。

Can ChatGPT Diagnose Alzheimer's Disease?

2502.06907v1 by Quoc-Toan Nguyen, Linh Le, Xuan-The Tran, Thomas Do, Chin-Teng Lin

Can ChatGPT diagnose Alzheimer's Disease (AD)? AD is a devastating neurodegenerative condition that affects approximately 1 in 9 individuals aged 65 and older, profoundly impairing memory and cognitive function. This paper utilises 9300 electronic health records (EHRs) with data from Magnetic Resonance Imaging (MRI) and cognitive tests to address an intriguing question: As a general-purpose task solver, can ChatGPT accurately detect AD using EHRs? We present an in-depth evaluation of ChatGPT using a black-box approach with zero-shot and multi-shot methods. This study unlocks ChatGPT's capability to analyse MRI and cognitive test results, as well as its potential as a diagnostic tool for AD. By automating aspects of the diagnostic process, this research opens a transformative approach for the healthcare system, particularly in addressing disparities in resource-limited regions where AD specialists are scarce. Hence, it offers a foundation for a promising method for early detection, supporting individuals with timely interventions, which is paramount for Quality of Life (QoL).

摘要:ChatGPT 能否診斷出阿茲海默症 (AD)?AD 是一種毀滅性的神經退化性疾病,影響約 1/9 的 65 歲及以上人士,嚴重損害記憶力和認知功能。這篇論文利用了 9300 份電子健康紀錄 (EHR),其中包含磁共振成像 (MRI) 和認知測試的數據,來解決一個有趣的問題:作為一個通用任務解決器,ChatGPT 能否使用 EHR 準確地檢測出 AD?我們使用黑盒方法對 ChatGPT 進行了深入評估,採用零次嘗試和多次嘗試的方法。這項研究揭示了 ChatGPT 分析 MRI 和認知測試結果的能力,以及其作為 AD 診斷工具的潛力。通過自動化診斷過程的各個方面,這項研究為醫療保健系統開啟了一種變革性的方法,特別是在解決資源有限的地區中 AD 專家稀缺的不平等問題方面。因此,它為一種有希望的早期檢測方法奠定了基礎,通過及時干預來支持個人,這對於生活品質 (QoL) 至關重要。

Protecting Intellectual Property of EEG-based Neural Networks with Watermarking

2502.05931v1 by Ahmed Abdelaziz, Ahmed Fathi, Ahmed Fares

EEG-based neural networks, pivotal in medical diagnosis and brain-computer interfaces, face significant intellectual property (IP) risks due to their reliance on sensitive neurophysiological data and resource-intensive development. Current watermarking methods, particularly those using abstract trigger sets, lack robust authentication and fail to address the unique challenges of EEG models. This paper introduces a cryptographic wonder filter-based watermarking framework tailored for EEG-based neural networks. Leveraging collision-resistant hashing and public-key encryption, the wonder filter embeds the watermark during training, ensuring minimal distortion ($\leq 5\%$ drop in EEG task accuracy) and high reliability (100\% watermark detection). The framework is rigorously evaluated against adversarial attacks, including fine-tuning, transfer learning, and neuron pruning. Results demonstrate persistent watermark retention, with classification accuracy for watermarked states remaining above 90\% even after aggressive pruning, while primary task performance degrades faster, deterring removal attempts. Piracy resistance is validated by the inability to embed secondary watermarks without severe accuracy loss ( $>10\%$ in EEGNet and CCNN models). Cryptographic hashing ensures authentication, reducing brute-force attack success probabilities. Evaluated on the DEAP dataset across models (CCNN, EEGNet, TSception), the method achieves $>99.4\%$ null-embedding accuracy, effectively eliminating false positives. By integrating wonder filters with EEG-specific adaptations, this work bridges a critical gap in IP protection for neurophysiological models, offering a secure, tamper-proof solution for healthcare and biometric applications. The framework's robustness against adversarial modifications underscores its potential to safeguard sensitive EEG models while maintaining diagnostic utility.

摘要:基於 EEG 的神經網路在醫學診斷和腦電腦介面中至關重要,由於其依賴敏感的神經生理資料和資源密集型的開發,面臨重大的智慧財產權 (IP) 風險。目前的浮水印方法,特別是那些使用抽象觸發集的方法,缺乏強健的驗證,且無法解決 EEG 模型的獨特挑戰。本文介紹了一個專為基於 EEG 的神經網路量身打造的密碼學 wonder 濾波器浮水印架構。利用抗碰撞雜湊和公開金鑰加密,wonder 濾波器在訓練期間嵌入浮水印,確保最小的失真(EEG 任務準確度下降 $\leq 5\%$)和高可靠性(100% 浮水印檢測)。該架構針對對抗性攻擊進行了嚴格的評估,包括微調、遷移學習和神經元剪枝。結果證明了持續的浮水印保留,即使在激進的剪枝後,浮水印狀態的分類準確度仍保持在 90% 以上,而主要任務的性能下降得更快,阻止了移除嘗試。盜版抵抗力通過無法嵌入次要浮水印而得到驗證,而不會造成嚴重的準確度損失(在 EEGNet 和 CCNN 模型中 $>10\%$)。密碼學雜湊確保驗證,降低了暴力攻擊成功機率。在 DEAP 資料集上針對模型(CCNN、EEGNet、TSception)進行評估,該方法達到了 $>99.4\%$ 的空嵌入準確度,有效地消除了假陽性。透過將 wonder 濾波器與 EEG 特定的適應相整合,這項工作彌補了神經生理模型 IP 保護中的關鍵差距,為醫療保健和生物特徵應用提供了一個安全、防篡改的解決方案。該架構對抗敵對修改的強健性突顯了其在維護診斷效用的同時保護敏感 EEG 模型的潛力。