Skip to content

Medical

Medical

Publish Date Title Authors Homepage Code
2024-11-12 Scaling Properties of Diffusion Models for Perceptual Tasks Rahul Ravishankar et.al. 2411.08034v1 null
2024-11-12 Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech Eleonora Mancini et.al. 2411.08013v1 null
2024-11-12 DuoLift-GAN:Reconstructing CT from Single-view and Biplanar X-Rays with Generative Adversarial Networks Zhaoxi Zhang et.al. 2411.07941v1 null
2024-11-12 Automatic dataset shift identification to support root cause analysis of AI performance drift Mélanie Roschewitz et.al. 2411.07940v1 null
2024-11-12 INTRABENCH: Interactive Radiological Benchmark Constantin Ulrich et.al. 2411.07885v1 null
2024-11-12 Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease Francesco Chiumento et.al. 2411.07871v1 null
2024-11-12 PatchCTG: Patch Cardiotocography Transformer for Antepartum Fetal Health Monitoring M. Jaleed Khan et.al. 2411.07796v1 link
2024-11-12 Multimodal Clinical Reasoning through Knowledge-augmented Rationale Generation Shuai Niu et.al. 2411.07611v1 null
2024-11-12 Contrastive Language Prompting to Ease False Positives in Medical Anomaly Detection YeongHyeon Park et.al. 2411.07546v1 null
2024-11-11 Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews Aakash Sorathiya et.al. 2411.07398v1 null
2024-11-11 Data-Centric Learning Framework for Real-Time Detection of Aiming Beam in Fluorescence Lifetime Imaging Guided Surgery Mohamed Abul Hassan et.al. 2411.07395v1 null
2024-11-11 Data-Driven Analysis of AI in Medical Device Software in China: Deep Learning and General AI Trends Based on Regulatory Data Yu Han et.al. 2411.07378v1 null
2024-11-11 A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19 Vedant Khandelwal et.al. 2411.07163v1 null
2024-11-11 Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models Chanseo Lee et.al. 2411.06713v1 null
2024-11-10 In-Context Learning for Preserving Patient Privacy: A Framework for Synthesizing Realistic Patient Portal Messages Joseph Gatto et.al. 2411.06549v1 link
2024-11-09 NeuReg: Domain-invariant 3D Image Registration on Human and Mouse Brains Taha Razzaq et.al. 2411.06315v1 null
2024-11-09 GuidelineGuard: An Agentic Framework for Medical Note Evaluation with Guideline Adherence MD Ragib Shahriyear et.al. 2411.06264v1 null
2024-11-09 Deep Reinforcement Learning for Digital Twin-Oriented Complex Networked Systems Jiaqi Wen et.al. 2411.06148v1 null
2024-11-09 Evaluating the Propensity of Generative AI for Producing Disinformation During an Election Cycle Erik J Schlicht et.al. 2411.06120v1 null
2024-11-09 Personalize to generalize: Towards a universal medical multi-modality generalization through personalization Zhaorui Tan et.al. 2411.06106v1 null
2024-11-08 Assessing Foundational Medical 'Segment Anything' (Med-SAM1, Med-SAM2) Deep Learning Models for Left Atrial Segmentation in 3D LGE MRI Mehri Mehrnia et.al. 2411.05963v1 null
2024-11-08 GazeSearch: Radiology Findings Search Benchmark Trong Thang Pham et.al. 2411.05780v1 null
2024-11-08 Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators Nicholas Wan et.al. 2411.05897v1 null
2024-11-08 Identifying and Decomposing Compound Ingredients in Meal Plans Using Large Language Models Leon Kopitar et.al. 2411.05892v1 null
2024-11-08 SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark Sithursan Sivasubramaniam et.al. 2411.05521v1 null
2024-11-08 Towards Scalable Foundation Models for Digital Dermatology Fabian Gröger et.al. 2411.05514v1 link
2024-11-08 Towards Equitable ASD Diagnostics: A Comparative Study of Machine and Deep Learning Models Using Behavioral and Facial Data Mohammed Aledhari et.al. 2411.05880v1 null
2024-11-07 Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations Joey Hong et.al. 2411.05194v1 null
2024-11-07 Inverse Transition Learning: Learning Dynamics from Demonstrations Leo Benac et.al. 2411.05174v1 null
2024-11-07 PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation Daniel C. Castro et.al. 2411.05085v1 null
2024-11-07 Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability Yanjun Gao et.al. 2411.04962v1 null
2024-11-07 FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs? Eric Wu et.al. 2411.05059v2 link
2024-11-07 Integrating Large Language Models for Genetic Variant Classification Youssef Boulaimen et.al. 2411.05055v1 null
2024-11-07 AWARE Narrator and the Utilization of Large Language Models to Extract Behavioral Insights from Smartphone Sensing Data Tianyi Zhang et.al. 2411.04691v1 null
2024-11-07 FedDP: Privacy-preserving method based on federated learning for histopathology image segmentation Liangrui Pan et.al. 2411.04509v1 null
2024-11-07 Conditional Diffusion Model for Longitudinal Medical Image Generation Duy-Phuong Dao et.al. 2411.05860v1 null
2024-11-07 Evaluating the Economic Implications of Using Machine Learning in Clinical Psychiatry Soaad Hossain et.al. 2411.05856v1 null
2024-11-06 Robust Real-Time Mortality Prediction in the Intensive Care Unit using Temporal Difference Learning Thomas Frost et.al. 2411.04285v1 link
2024-11-06 Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? Daniel P. Jeong et.al. 2411.04118v1 null
2024-11-06 RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models Maya Varma et.al. 2411.04097v1 link
2024-11-06 Aligning Characteristic Descriptors with Images for Human-Expert-like Explainability Bharat Chandra Yalavarthi et.al. 2411.04008v1 null
2024-11-06 Fine-tuning -- a Transfer Learning approach Joseph Arul Raj et.al. 2411.03941v1 null
2024-11-06 MEG: Medical Knowledge-Augmented Large Language Models for Question Answering Laura Cabello et.al. 2411.03883v2 link
2024-11-06 Navigating the landscape of multimodal AI in medicine: a scoping review on technical challenges and clinical applications Daan Schouten et.al. 2411.03782v1 null
2024-11-06 Sub-DM:Subspace Diffusion Model with Orthogonal Decomposition for MRI Reconstruction Yu Guan et.al. 2411.03758v1 null
2024-11-06 Ultrasound-Based AI for COVID-19 Detection: A Comprehensive Review of Public and Private Lung Ultrasound Datasets and Studies Abrar Morshed et.al. 2411.05029v1 null
2024-11-06 Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? Pedro R. A. S. Bassi et.al. 2411.03670v1 link
2024-11-06 Requirements Engineering for Older Adult Digital Health Software: A Systematic Literature Review Yuqing Xiao et.al. 2411.03656v1 null
2024-11-06 Cross Feature Fusion of Fundus Image and Generated Lesion Map for Referable Diabetic Retinopathy Classification Dahyun Mok et.al. 2411.03618v1 null
2024-11-05 The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare Souren Pashangpour et.al. 2411.03287v1 null
2024-11-05 Discovering Data Structures: Nearest Neighbor Search and Beyond Omar Salemohamed et.al. 2411.03253v1 null
2024-11-05 Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care Christel Sirocchi et.al. 2411.03105v1 link
2024-11-05 Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting Adrian B. Chłopowiec et.al. 2411.03098v1 null
2024-11-05 Controlling for Unobserved Confounding with Large Language Model Classification of Patient Smoking Status Samuel Lee et.al. 2411.03004v1 null
2024-11-05 Region-Guided Attack on the Segment Anything Model (SAM) Xiaoliang Liu et.al. 2411.02974v2 null
2024-11-05 [Vision Paper] PRObot: Enhancing Patient-Reported Outcome Measures for Diabetic Retinopathy using Chatbots and Generative AI Maren Pielka et.al. 2411.02973v1 null
2024-11-05 Leveraging Transfer Learning and Multiple Instance Learning for HER2 Automatic Scoring of H\&E Whole Slide Images Rawan S. Abdulsadig et.al. 2411.05028v1 null
2024-11-05 Membership Inference Attacks against Large Vision-Language Models Zhan Li et.al. 2411.02902v1 link
2024-11-04 Advanced XR-Based 6-DOF Catheter Tracking System for Immersive Cardiac Intervention Training Mohsen Annabestani et.al. 2411.02611v1 null
2024-11-04 "It's a conversation, not a quiz": A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health Jiawei Zhou et.al. 2411.02594v1 null
2024-11-04 Digitizing Touch with an Artificial Multimodal Fingertip Mike Lambeta et.al. 2411.02479v1 link
2024-11-04 Simulation of Nanorobots with Artificial Intelligence and Reinforcement Learning for Advanced Cancer Cell Detection and Tracking Shahab Kavousinejad et.al. 2411.02345v1 link
2024-11-04 Taking AI Welfare Seriously Robert Long et.al. 2411.00986v1 null
2024-11-04 Federated GNNs for EEG-Based Stroke Assessment Andrea Protani et.al. 2411.02286v1 null
2024-11-04 Weakly supervised deep learning model with size constraint for prostate cancer detection in multiparametric MRI and generalization to unseen domains Robin Trombetta et.al. 2411.02466v1 null
2024-11-04 Evaluating the quality of published medical research with ChatGPT Mike Thelwall et.al. 2411.01952v1 null
2024-11-04 You are out of context! Giancarlo Cobino et.al. 2411.02464v1 null
2024-11-03 Diagnosing Medical Datasets with Training Dynamics Laura Wenderoth et.al. 2411.01653v1 link
2024-11-03 Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation Zhenbin Wang et.al. 2411.01647v1 null
2024-11-03 Customized Subgraph Selection and Encoding for Drug-drug Interaction Prediction Haotong Du et.al. 2411.01535v1 null
2024-11-03 Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design Onur Boyar et.al. 2411.01423v1 null
2024-11-02 Medical X-Ray Image Enhancement Using Global Contrast-Limited Adaptive Histogram Equalization Sohrab Namazi Nia et.al. 2411.01373v1 null
2024-11-02 Guided Synthesis of Labeled Brain MRI Data Using Latent Diffusion Models for Segmentation of Enlarged Ventricles Tim Ruschke et.al. 2411.01351v1 null
2024-11-02 Causal reasoning in difference graphs Charles K. Assaad et.al. 2411.01292v1 null
2024-11-02 Designing a Robust Radiology Report Generation System Sonit Singh et.al. 2411.01153v1 null
2024-11-02 LEARNER: Learning Granular Labels from Coarse Labels using Contrastive Learning Gautam Gare et.al. 2411.01144v1 null
2024-11-02 Artificial Intelligence for Microbiology and Microbiome Research Xu-Wen Wang et.al. 2411.01098v1 null
2024-11-01 Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities Adriel Saporta et.al. 2411.01053v1 link
2024-11-01 Cross-Fundus Transformer for Multi-modal Diabetic Retinopathy Grading with Cataract Fan Xiao et.al. 2411.00726v1 null
2024-11-01 CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis Fuying Wang et.al. 2411.00696v1 null
2024-11-01 Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering Mehdi Hosseini Chagahi et.al. 2411.00916v2 null
2024-11-01 Deep learning-based auto-contouring of organs/structures-at-risk for pediatric upper abdominal radiotherapy Mianyong Ding et.al. 2411.00594v1 link
2024-11-01 Enhancing the Traditional Chinese Medicine Capabilities of Large Language Model through Reinforcement Learning from AI Feedback Song Yu et.al. 2411.00897v1 null
2024-11-01 StepCountJITAI: simulation environment for RL with application to physical activity adaptive intervention Karine Karine et.al. 2411.00336v1 null
2024-11-01 Strongly Topology-preserving GNNs for Brain Graph Super-resolution Pragya Singh et.al. 2411.02525v1 null
2024-11-01 Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes Balu Bhasuran et.al. 2411.02523v1 null
2024-10-31 Deep Learning Predicts Mammographic Breast Density in Clinical Breast Ultrasound Images Arianna Bunnell et.al. 2411.00891v2 link
2024-10-31 Monitoring fairness in machine learning models that predict patient mortality in the ICU Tempest A. van Schaik et.al. 2411.00190v2 null
2024-10-31 Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy Panagiota Gatoula et.al. 2411.00178v1 null
2024-10-31 Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning John Wu et.al. 2411.00173v1 null
2024-10-31 Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks Yingzhe Peng et.al. 2410.24032v1 null
2024-10-31 Neural Network Verification with PyRAT Augustin Lemesle et.al. 2410.23903v1 null
2024-10-31 Counterfactual MRI Data Augmentation using Conditional Denoising Diffusion Generative Models Pedro Morão et.al. 2410.23835v1 link
2024-10-31 Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding Jinlong He et.al. 2410.23822v1 null
2024-10-31 Improving snore detection under limited dataset through harmonic/percussive source separation and convolutional neural networks F. D. Gonzalez-Martinez et.al. 2410.23796v1 null
2024-10-31 The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams Yunqi Zhu et.al. 2410.23769v1 null
2024-10-31 Artificial intelligence to improve clinical coding practice in Scandinavia: a crossover randomized controlled trial Taridzo Chomutare et.al. 2410.23725v1 null
2024-10-31 Enhancing Brain Tumor Classification Using TrAdaBoost and Multi-Classifier Deep Learning Approaches Mahin Mohammadi et.al. 2411.00875v1 null
2024-10-31 Deep Convolutional Neural Networks on Multiclass Classification of Three-Dimensional Brain Images for Parkinson's Disease Stage Prediction Guan-Hua Huang et.al. 2410.23649v1 null
2024-10-31 MS-Glance: Non-semantic context vectors and the applications in supervising image reconstruction Ziqi Gao et.al. 2410.23577v1 link

Abstracts

Scaling Properties of Diffusion Models for Perceptual Tasks

2411.08034v1 by Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and segmentation under image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perception tasks. Through a careful analysis of these scaling behaviors, we present various techniques to efficiently train diffusion models for visual perception tasks. Our models achieve improved or comparable performance to state-of-the-art methods using significantly less data and compute. To use our code and models, see https://scaling-diffusion-perception.github.io .

摘要:在本文中,我們主張使用擴散模型進行的迭代計算不僅為生成提供了強大的範例,也為視覺感知任務提供了強大的範例。我們將深度估計、光流和分割等任務統一在圖像到圖像轉換下,並展示了擴散模型如何從擴展感知任務的訓練和測試時間計算中受益。通過仔細分析這些縮放行為,我們提出了各種技術,以有效訓練用於視覺感知任務的擴散模型。我們的模型使用顯著更少的数据和計算,達到了與最先進的方法相當或更好的性能。若要使用我們的代碼和模型,請參閱 https://scaling-diffusion-perception.github.io 。

Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech

2411.08013v1 by Eleonora Mancini, Francesco Paissan, Paolo Torroni, Cem Subakan, Mirco Ravanelli

Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detection have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.

摘要:帕金森氏症 (PD) 的言語障礙提供了重要的早期診斷指標。儘管基於言語的 PD 檢測模型已展現出強勁的效能,但其可解釋性仍未獲得充分探討。本研究系統性地評估了數種可解釋性方法,以識別 PD 特定的言語特徵,旨在支援開發準確、可解釋的模型,以進行 PD 診斷和監控中的臨床決策。我們的研究方法包括:(i) 使用主流可解釋性技術取得歸因和顯著性圖,(ii) 透過一系列既定的指標,量化評估這些圖及其透過聯集和交集所取得組合的真實性,以及 (iii) 從輔助分類器評估顯著性圖傳達的 PD 檢測資訊。我們的結果顯示,儘管解釋與分類器一致,但它們通常無法為領域專家提供有價值的資訊。

DuoLift-GAN:Reconstructing CT from Single-view and Biplanar X-Rays with Generative Adversarial Networks

2411.07941v1 by Zhaoxi Zhang, Yueliang Ying

Computed tomography (CT) provides highly detailed three-dimensional (3D) medical images but is costly, time-consuming, and often inaccessible in intraoperative settings (Organization et al. 2011). Recent advancements have explored reconstructing 3D chest volumes from sparse 2D X-rays, such as single-view or orthogonal double-view images. However, current models tend to process 2D images in a planar manner, prioritizing visual realism over structural accuracy. In this work, we introduce DuoLift Generative Adversarial Networks (DuoLift-GAN), a novel architecture with dual branches that independently elevate 2D images and their features into 3D representations. These 3D outputs are merged into a unified 3D feature map and decoded into a complete 3D chest volume, enabling richer 3D information capture. We also present a masked loss function that directs reconstruction towards critical anatomical regions, improving structural accuracy and visual quality. This paper demonstrates that DuoLift-GAN significantly enhances reconstruction accuracy while achieving superior visual realism compared to existing methods.

摘要:電腦斷層掃描 (CT) 能提供高度詳細的三維 (3D) 醫學影像,但昂貴、耗時且在術中環境中通常無法取得 (Organization et al. 2011)。最近的進展探索從稀疏的 2D X 光重建 3D 胸部體積,例如單視圖或正交雙視圖影像。然而,目前的模型傾向於以平面方式處理 2D 影像,優先考慮視覺真實性而非結構準確性。在這項工作中,我們介紹了 DuoLift 生成對抗網路 (DuoLift-GAN),一種具有雙分支的新穎架構,可獨立地將 2D 影像及其特徵提升到 3D 表現形式。這些 3D 輸出會合併成一個統一的 3D 特徵圖,並解碼成一個完整的 3D 胸部體積,從而能夠擷取更豐富的 3D 資訊。我們也提出了一個遮罩損失函數,將重建導向關鍵解剖區域,改善結構準確性和視覺品質。這篇論文證明了 DuoLift-GAN 與現有方法相比,顯著提升了重建準確性,同時達到了卓越的視覺真實性。

Automatic dataset shift identification to support root cause analysis of AI performance drift

2411.07940v1 by Mélanie Roschewitz, Raghav Mehta, Charles Jones, Ben Glocker

Shifts in data distribution can substantially harm the performance of clinical AI models. Hence, various methods have been developed to detect the presence of such shifts at deployment time. However, root causes of dataset shifts are varied, and the choice of shift mitigation strategies is highly dependent on the precise type of shift encountered at test time. As such, detecting test-time dataset shift is not sufficient: precisely identifying which type of shift has occurred is critical. In this work, we propose the first unsupervised dataset shift identification framework, effectively distinguishing between prevalence shift (caused by a change in the label distribution), covariate shift (caused by a change in input characteristics) and mixed shifts (simultaneous prevalence and covariate shifts). We discuss the importance of self-supervised encoders for detecting subtle covariate shifts and propose a novel shift detector leveraging both self-supervised encoders and task model outputs for improved shift detection. We report promising results for the proposed shift identification framework across three different imaging modalities (chest radiography, digital mammography, and retinal fundus images) on five types of real-world dataset shifts, using four large publicly available datasets.

摘要:資料分佈的轉變會嚴重損害臨床 AI 模型的效能。因此,已經開發出各種方法來偵測部署時發生的此類轉變。然而,資料集轉變的根本原因各不相同,而轉變緩解策略的選擇高度依賴於測試時遇到的轉變類型。因此,偵測測試時資料集轉變是不夠的:精確識別已發生的轉變類型至關重要。在這項工作中,我們提出了第一個無監督資料集轉變識別架構,有效區分發生率轉變(由標籤分佈的變化引起)、協變數轉變(由輸入特徵的變化引起)和混合轉變(同時發生率和協變數轉變)。我們討論了自監督編碼器在偵測細微協變數轉變中的重要性,並提出了一種新穎的轉變偵測器,利用自監督編碼器和任務模型輸出,以改善轉變偵測。我們針對三個不同的影像模式(胸部 X 光、數位乳房攝影和視網膜眼底影像)報告了所提出的轉變識別架構的良好結果,使用四個大型公開可取得的資料集,針對五種類型的真實世界資料集轉變。

INTRABENCH: Interactive Radiological Benchmark

2411.07885v1 by Constantin Ulrich, Tassilo Wald, Emily Tempus, Maximilian Rokuss, Paul F. Jaeger, Klaus Maier-Hein

Current interactive segmentation approaches, inspired by the success of META's Segment Anything model, have achieved notable advancements, however, they come with substantial limitations that hinder their practical application in real clinical scenarios. These include unrealistic human interaction requirements, such as slice-by-slice operations for 2D models on 3D data, a lack of iterative refinement, and insufficient evaluation experiments. These shortcomings prevent accurate assessment of model performance and lead to inconsistent outcomes across studies. IntRaBench overcomes these challenges by offering a comprehensive and reproducible framework for evaluating interactive segmentation methods in realistic, clinically relevant scenarios. It includes diverse datasets, target structures, and segmentation models, and provides a flexible codebase that allows seamless integration of new models and prompting strategies. Additionally, we introduce advanced techniques to minimize clinician interaction, ensuring fair comparisons between 2D and 3D models. By open-sourcing IntRaBench, we invite the research community to integrate their models and prompting techniques, ensuring continuous and transparent evaluation of interactive segmentation models in 3D medical imaging.

摘要:目前互動式分割方法受到 META 的 Segment Anything 模型成功的啟發,已取得顯著進展,但它們仍有很大的限制,會阻礙它們在實際臨床場景中的應用。這些限制包括不切實際的人機互動需求,例如 3D 資料上的 2D 模型的逐層操作、缺乏反覆改進以及評估實驗不足。這些缺點會妨礙準確評估模型效能,並導致各項研究結果不一致。IntRaBench 克服了這些挑戰,提供了一個全面且可重現的架構,用於評估實際臨床相關場景中的互動式分割方法。它包含多元的資料集、目標結構和分割模型,並提供了一個彈性的程式碼庫,允許無縫整合新的模型和提示策略。此外,我們引進了先進技術來最小化臨床醫師的互動,確保 2D 和 3D 模型之間的公平比較。透過開放原始碼 IntRaBench,我們邀請研究社群整合他們的模型和提示技術,確保在 3D 醫學影像中持續且透明地評估互動式分割模型。

Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease

2411.07871v1 by Francesco Chiumento, Mingming Liu

The rapid advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown great potential in medical diagnostics, particularly in radiology, where datasets such as X-rays are paired with human-generated diagnostic reports. However, a significant research gap exists in the neuroimaging field, especially for conditions such as Alzheimer's disease, due to the lack of comprehensive diagnostic reports that can be utilized for model fine-tuning. This paper addresses this gap by generating synthetic diagnostic reports using GPT-4o-mini on structured data from the OASIS-4 dataset, which comprises 663 patients. Using the synthetic reports as ground truth for training and validation, we then generated neurological reports directly from the images in the dataset leveraging the pre-trained BiomedCLIP and T5 models. Our proposed method achieved a BLEU-4 score of 0.1827, ROUGE-L score of 0.3719, and METEOR score of 0.4163, revealing its potential in generating clinically relevant and accurate diagnostic reports.

摘要:大型語言模型 (LLM) 和視覺語言模型 (VLM) 的快速進展在醫學診斷中展現了巨大的潛力,特別是在放射學中,其中 X 射線等數據集與人類產生的診斷報告配對。然而,神經影像領域存在著顯著的研究差距,特別是對於阿茲海默症等疾病,因為缺乏可供模型微調使用的全面診斷報告。本文通過使用 GPT-4o-mini 在來自 OASIS-4 數據集的結構化數據上生成合成診斷報告來解決這一差距,該數據集包含 663 名患者。使用合成報告作為訓練和驗證的真實數據,然後我們直接從數據集中的圖像中生成神經報告,利用預先訓練的 BiomedCLIP 和 T5 模型。我們提出的方法實現了 BLEU-4 分數為 0.1827、ROUGE-L 分數為 0.3719 和 METEOR 分數為 0.4163,揭示了其生成臨床相關且準確的診斷報告的潛力。

PatchCTG: Patch Cardiotocography Transformer for Antepartum Fetal Health Monitoring

2411.07796v1 by M. Jaleed Khan, Manu Vatish, Gabriel Davis Jones

Antepartum Cardiotocography (CTG) is vital for fetal health monitoring, but traditional methods like the Dawes-Redman system are often limited by high inter-observer variability, leading to inconsistent interpretations and potential misdiagnoses. This paper introduces PatchCTG, a transformer-based model specifically designed for CTG analysis, employing patch-based tokenisation, instance normalisation and channel-independent processing to capture essential local and global temporal dependencies within CTG signals. PatchCTG was evaluated on the Oxford Maternity (OXMAT) dataset, comprising over 20,000 CTG traces across diverse clinical outcomes after applying the inclusion and exclusion criteria. With extensive hyperparameter optimisation, PatchCTG achieved an AUC of 77%, with specificity of 88% and sensitivity of 57% at Youden's index threshold, demonstrating adaptability to various clinical needs. Testing across varying temporal thresholds showed robust predictive performance, particularly with finetuning on data closer to delivery, achieving a sensitivity of 52% and specificity of 88% for near-delivery cases. These findings suggest the potential of PatchCTG to enhance clinical decision-making in antepartum care by providing a reliable, objective tool for fetal health assessment. The source code is available at https://github.com/jaleedkhan/PatchCTG.

摘要:產前胎兒心搏圖 (CTG) 對於胎兒健康監測至關重要,但傳統方法(如 Dawes-Redman 系統)通常受到高觀察者間變異性的限制,導致解釋不一致和潛在的誤診。本文介紹 PatchCTG,一種專門設計用於 CTG 分析的基於Transformer的模型,採用基於區塊的標記化、實例正規化和通道獨立處理,以捕捉 CTG 信號中的基本局部和全局時間依賴性。PatchCTG 在牛津婦產 (OXMAT) 資料集上進行評估,該資料集包含超過 20,000 個 CTG 軌跡,涵蓋在應用包含和排除標準後不同的臨床結果。透過廣泛的超參數最佳化,PatchCTG 在 Youden 指數閾值下達到 77% 的 AUC,特異性為 88%,敏感性為 57%,證明了其對各種臨床需求的適應性。在不同的時間閾值下進行測試顯示出穩健的預測效能,特別是在接近分娩時對資料進行微調,對於接近分娩的病例,敏感性達到 52%,特異性達到 88%。這些發現表明 PatchCTG 有潛力透過提供可靠、客觀的胎兒健康評估工具來加強產前照護中的臨床決策制定。原始程式碼可在 https://github.com/jaleedkhan/PatchCTG 取得。

Multimodal Clinical Reasoning through Knowledge-augmented Rationale Generation

2411.07611v1 by Shuai Niu, Jing Ma, Liang Bai, Zhihua Wang, Yida Xu, Yunya Song, Xian Yang

Clinical rationales play a pivotal role in accurate disease diagnosis; however, many models predominantly use discriminative methods and overlook the importance of generating supportive rationales. Rationale distillation is a process that transfers knowledge from large language models (LLMs) to smaller language models (SLMs), thereby enhancing the latter's ability to break down complex tasks. Despite its benefits, rationale distillation alone is inadequate for addressing domain knowledge limitations in tasks requiring specialized expertise, such as disease diagnosis. Effectively embedding domain knowledge in SLMs poses a significant challenge. While current LLMs are primarily geared toward processing textual data, multimodal LLMs that incorporate time series data, especially electronic health records (EHRs), are still evolving. To tackle these limitations, we introduce ClinRaGen, an SLM optimized for multimodal rationale generation in disease diagnosis. ClinRaGen incorporates a unique knowledge-augmented attention mechanism to merge domain knowledge with time series EHR data, utilizing a stepwise rationale distillation strategy to produce both textual and time series-based clinical rationales. Our evaluations show that ClinRaGen markedly improves the SLM's capability to interpret multimodal EHR data and generate accurate clinical rationales, supporting more reliable disease diagnosis, advancing LLM applications in healthcare, and narrowing the performance divide between LLMs and SLMs.

摘要:臨床依據在準確的疾病診斷中扮演著關鍵角色; 然而,許多模型主要使用判別式方法,而忽略了生成支持性依據的重要性。依據萃取是一種將知識從大型語言模型 (LLM) 轉移到小型語言模型 (SLM) 的過程,從而增強後者分解複雜任務的能力。儘管有其好處,但單獨的依據萃取不足以解決需要專業知識的任務(例如疾病診斷)中的領域知識限制。有效地將領域知識嵌入 SLM 是一個重大的挑戰。雖然目前的 LLM 主要用於處理文本資料,但整合時間序列資料(特別是電子健康記錄 (EHR))的多模態 LLM 仍在發展中。為了解決這些限制,我們引入了 ClinRaGen,一種針對疾病診斷中多模態依據生成的最佳化 SLM。ClinRaGen 結合了一個獨特的知識增強注意力機制,將領域知識與時間序列 EHR 資料合併,利用逐步的依據萃取策略來產生基於文本和時間序列的臨床依據。我們的評估表明,ClinRaGen 明顯改善了 SLM 解釋多模態 EHR 資料和生成準確臨床依據的能力,支持更可靠的疾病診斷,推進 LLM 在醫療保健中的應用,並縮小 LLM 和 SLM 之間的效能差距。

Contrastive Language Prompting to Ease False Positives in Medical Anomaly Detection

2411.07546v1 by YeongHyeon Park, Myung Jin Kim, Hyeong Seok Kim

A pre-trained visual-language model, contrastive language-image pre-training (CLIP), successfully accomplishes various downstream tasks with text prompts, such as finding images or localizing regions within the image. Despite CLIP's strong multi-modal data capabilities, it remains limited in specialized environments, such as medical applications. For this purpose, many CLIP variants-i.e., BioMedCLIP, and MedCLIP-SAMv2-have emerged, but false positives related to normal regions persist. Thus, we aim to present a simple yet important goal of reducing false positives in medical anomaly detection. We introduce a Contrastive LAnguage Prompting (CLAP) method that leverages both positive and negative text prompts. This straightforward approach identifies potential lesion regions by visual attention to the positive prompts in the given image. To reduce false positives, we attenuate attention on normal regions using negative prompts. Extensive experiments with the BMAD dataset, including six biomedical benchmarks, demonstrate that CLAP method enhances anomaly detection performance. Our future plans include developing an automated fine prompting method for more practical usage.

摘要:預訓練的視覺語言模型,對比語言影像預訓練 (CLIP),成功使用文字提示完成各種下游任務,例如尋找影像或定位影像中的區域。儘管 CLIP 擁有強大的多模態資料功能,但在專門的環境中,例如醫療應用,仍然有限。為此,出現了許多 CLIP 變體,即 BioMedCLIP 和 MedCLIP-SAMv2,但與正常區域相關的假陽性仍然存在。因此,我們的目標是提出一個簡單但重要的目標,以減少醫療異常檢測中的假陽性。我們引入了對比語言提示 (CLAP) 方法,該方法同時利用正向和負向文字提示。這種直接的方法透過視覺注意給定影像中的正向提示,來識別潛在的病灶區域。為了減少假陽性,我們使用負向提示來減弱對正常區域的注意。使用 BMAD 資料集進行的廣泛實驗,包括六個生物醫學基準,證明 CLAP 方法增強了異常檢測效能。我們未來的計畫包括開發一種自動化精細提示方法,以供更實用的使用。

2411.07398v1 by Aakash Sorathiya, Gouri Ginde

With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach.

摘要:隨著行動應用程式在我們日常體驗中激增,圍繞倫理的疑慮也大幅增加。使用者通常在應用程式(app)評論中傳達他們的回饋、回報問題,並建議新的功能,經常強調安全性、隱私和問責疑慮。納入這些評論對於開發成功的產品至關重要。然而,與倫理疑慮相關的 app 評論通常使用特定領域語言,並使用更多變化的詞彙表達。因此,自動化與倫理疑慮相關的 app 評論擷取是一項具有挑戰性且耗時的工作。 本研究提出了一種基於自然語言處理 (NLP) 的新穎方法,它結合了自然語言推論 (NLI),它提供了對語言細微差別的深入理解,以及僅解碼器(類似 LLaMA)的大型語言模型 (LLM),以大規模擷取與倫理疑慮相關的 app 評論。利用心理健康領域的 43,647 個 app 評論,提出的方法 1) 評估四個 NLI 模型以擷取潛在的隱私評論,並將特定領域隱私假設的結果與一般隱私假設進行比較;2) 評估四個 LLM 以將 app 評論分類為隱私疑慮;以及 3) 進一步使用最佳的 NLI 和 LLM 模型從資料集中擷取新的隱私評論。結果顯示,具有特定領域假設的 DeBERTa-v3-base-mnli-fever-anli NLI 模型產生最佳效能,而 Llama3.1-8B-Instruct LLM 在 app 評論分類中表現最佳。然後,使用 NLI+LLM,額外擷取了 1,008 個新的與隱私相關的評論,這些評論未透過先前研究中的基於關鍵字的方法識別出來,因此證明了所提出方法的有效性。

Data-Centric Learning Framework for Real-Time Detection of Aiming Beam in Fluorescence Lifetime Imaging Guided Surgery

2411.07395v1 by Mohamed Abul Hassan, Pu Sun, Xiangnan Zhou, Lisanne Kraft, Kelsey T Hadfield, Katjana Ehrlich, Jinyi Qi, Andrew Birkeland, Laura Marcu

This study introduces a novel data-centric approach to improve real-time surgical guidance using fiber-based fluorescence lifetime imaging (FLIm). A key aspect of the methodology is the accurate detection of the aiming beam, which is essential for localizing points used to map FLIm measurements onto the tissue region within the surgical field. The primary challenge arises from the complex and variable conditions encountered in the surgical environment, particularly in Transoral Robotic Surgery (TORS). Uneven illumination in the surgical field can cause reflections, reduce contrast, and results in inconsistent color representation, further complicating aiming beam detection. To overcome these challenges, an instance segmentation model was developed using a data-centric training strategy that improves accuracy by minimizing label noise and enhancing detection robustness. The model was evaluated on a dataset comprising 40 in vivo surgical videos, demonstrating a median detection rate of 85%. This performance was maintained when the model was integrated in a clinical system, achieving a similar detection rate of 85% during TORS procedures conducted in patients. The system's computational efficiency, measured at approximately 24 frames per second (FPS), was sufficient for real-time surgical guidance. This study enhances the reliability of FLIm-based aiming beam detection in complex surgical environments, advancing the feasibility of real-time, image-guided interventions for improved surgical precision

摘要:本研究提出了一種新穎的以數據為中心的策略,以使用基於光纖的螢光生命期成像 (FLIm) 來改善實時手術導引。此方法的一個關鍵面向是準確偵測瞄準光束,這對於定位用於將 FLIm 測量結果對應到手術視野內組織區域的點至關重要。主要的挑戰來自於手術環境中遇到的複雜且變化的條件,特別是在經口機器人手術 (TORS) 中。手術視野中的照明不均會導致反射、降低對比度,並造成不一致的顏色呈現,進一步使瞄準光束偵測複雜化。為了克服這些挑戰,開發了一個實例分割模型,使用以數據為中心的訓練策略,透過最小化標籤雜訊和增強偵測穩健性來提高準確度。此模型在包含 40 個體內手術影片的資料集上進行評估,顯示出 85% 的中位數偵測率。當此模型整合到臨床系統中時,此效能得以維持,在患者進行 TORS 手術期間達成相似的 85% 偵測率。此系統的運算效率,測量結果約為每秒 24 幀 (FPS),足以進行實時手術導引。本研究增強了 FLIm 為基礎的瞄準光束偵測在複雜手術環境中的可靠性,提升了實時、影像導引介入的可行性,以改善手術精準度

2411.07378v1 by Yu Han, Aaron Ceross, Sarim Ather, Jeroen H. M. Bergmann

Artificial intelligence (AI) in medical device software (MDSW) represents a transformative clinical technology, attracting increasing attention within both the medical community and the regulators. In this study, we leverage a data-driven approach to automatically extract and analyze AI-enabled medical devices (AIMD) from the National Medical Products Administration (NMPA) regulatory database. The continued increase in publicly available regulatory data requires scalable methods for analysis. Automation of regulatory information screening is essential to create reproducible insights that can be quickly updated in an ever changing medical device landscape. More than 4 million entries were assessed, identifying 2,174 MDSW registrations, including 531 standalone applications and 1,643 integrated within medical devices, of which 43 were AI-enabled. It was shown that the leading medical specialties utilizing AIMD include respiratory (20.5%), ophthalmology/endocrinology (12.8%), and orthopedics (10.3%). This approach greatly improves the speed of data extracting providing a greater ability to compare and contrast. This study provides the first extensive, data-driven exploration of AIMD in China, showcasing the potential of automated regulatory data analysis in understanding and advancing the landscape of AI in medical technology.

摘要:醫療器材軟體 (MDSW) 中的人工智慧 (AI) 代表著變革性的臨床技術,在醫療社群和法規單位中都吸引了越來越多的關注。在本研究中,我們利用資料驅動的方法,從國家藥品監督管理局 (NMPA) 法規資料庫中自動擷取和分析具備 AI 功能的醫療器材 (AIMD)。持續增加的公開法規資料需要可擴充的分析方法。法規資訊篩選的自動化對於建立可重製的見解至關重要,這些見解可以在不斷變化的醫療器材領域中快速更新。評估了超過 400 萬筆條目,識別出 2,174 筆 MDSW 註冊,包括 531 筆獨立應用和 1,643 筆整合於醫療器材中,其中 43 筆具備 AI 功能。結果顯示,使用 AIMD 的主要醫療專科包括呼吸科 (20.5%)、眼科/內分泌科 (12.8%) 和骨科 (10.3%)。這種方法大幅提升了資料擷取速度,提供了更強大的比較和對比能力。本研究提供了中國 AIMD 的第一個廣泛資料驅動探索,展示了自動化法規資料分析在了解和推進醫療技術中 AI 領域的潛力。

A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19

2411.07163v1 by Vedant Khandelwal, Manas Gaur, Ugur Kursuncu, Valerie Shalin, Amit Sheth

Monitoring public sentiment via social media is potentially helpful during health crises such as the COVID-19 pandemic. However, traditional frequency-based, data-driven neural network-based approaches can miss newly relevant content due to the evolving nature of language in a dynamically evolving environment. Human-curated symbolic knowledge sources, such as lexicons for standard language and slang terms, can potentially elevate social media signals in evolving language. We introduce a neurosymbolic method that integrates neural networks with symbolic knowledge sources, enhancing the detection and interpretation of mental health-related tweets relevant to COVID-19. Our method was evaluated using a corpus of large datasets (approximately 12 billion tweets, 2.5 million subreddit data, and 700k news articles) and multiple knowledge graphs. This method dynamically adapts to evolving language, outperforming purely data-driven models with an F1 score exceeding 92\%. This approach also showed faster adaptation to new data and lower computational demands than fine-tuning pre-trained large language models (LLMs). This study demonstrates the benefit of neurosymbolic methods in interpreting text in a dynamic environment for tasks such as health surveillance.

摘要:透過社群媒體監控公眾情緒在 COVID-19 等健康危機期間可能很有幫助。然而,傳統的基於頻率、資料驅動的神經網路方法可能會錯過新相關的內容,因為語言在動態演化的環境中會持續演化。由人類策劃的象徵性知識來源(例如標準語言和俚語術語的詞彙)可能會提升社群媒體在演化語言中的訊號。我們引入一種將神經網路與象徵性知識來源整合的神經符號方法,增強與 COVID-19 相關的心理健康相關推文的偵測和詮釋。我們的做法使用大型資料集語料庫(約 120 億則推文、250 萬個 subreddit 資料和 70 萬則新聞文章)和多個知識圖譜進行評估。這種方法動態適應演化的語言,優於純資料驅動模型,F1 分數超過 92%。這種方法也顯示出比微調預訓練大型語言模型 (LLM) 更快適應新資料和更低的運算需求。本研究證明了神經符號方法在動態環境中詮釋文字的優點,適用於健康監控等任務。

Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models

2411.06713v1 by Chanseo Lee, Sonu Kumar, Kimon A. Vogt, Sam Meraj

This study compares Sporo Health's AI Scribe, a proprietary model fine-tuned for medical scribing, with various LLMs (GPT-4o, GPT-3.5, Gemma-9B, and Llama-3.2-3B) in clinical documentation. We analyzed de-identified patient transcripts from partner clinics, using clinician-provided SOAP notes as the ground truth. Each model generated SOAP summaries using zero-shot prompting, with performance assessed via recall, precision, and F1 scores. Sporo outperformed all models, achieving the highest recall (73.3%), precision (78.6%), and F1 score (75.3%) with the lowest performance variance. Statistically significant differences (p < 0.05) were found between Sporo and the other models, with post-hoc tests showing significant improvements over GPT-3.5, Gemma-9B, and Llama 3.2-3B. While Sporo outperformed GPT-4o by up to 10%, the difference was not statistically significant (p = 0.25). Clinical user satisfaction, measured with a modified PDQI-9 inventory, favored Sporo. Evaluations indicated Sporo's outputs were more accurate and relevant. This highlights the potential of Sporo's multi-agentic architecture to improve clinical workflows.

摘要:本研究比较了 Sporo Health 的 AI Scribe,一种针对医疗记录专门微调的专有模型,与临床记录中的各种 LLM(GPT-4o、GPT-3.5、Gemma-9B 和 Llama-3.2-3B)。我们分析了来自合作诊所的去标识患者记录,使用临床医生提供的 SOAP 记录作为基本事实。每个模型使用零次提示生成了 SOAP 摘要,通过召回率、精确率和 F1 分数评估性能。Sporo 优于所有模型,以最低的性能差异实现了最高的召回率 (73.3%)、精确率 (78.6%) 和 F1 分数 (75.3%)。在 Sporo 和其他模型之间发现了统计学上的显着差异 (p < 0.05),事后检验显示与 GPT-3.5、Gemma-9B 和 Llama 3.2-3B 相比有显着改善。虽然 Sporo 的表现优于 GPT-4o 达 10%,但差异在统计学上并不显着 (p = 0.25)。使用修改后的 PDQI-9 清单衡量的临床用户满意度偏好 Sporo。评估表明 Sporo 的输出更准确、更相关。这突出了 Sporo 的多代理架构在改进临床工作流程方面的潜力。

In-Context Learning for Preserving Patient Privacy: A Framework for Synthesizing Realistic Patient Portal Messages

2411.06549v1 by Joseph Gatto, Parker Seegmiller, Timothy E. Burdick, Sarah Masud Preum

Since the COVID-19 pandemic, clinicians have seen a large and sustained influx in patient portal messages, significantly contributing to clinician burnout. To the best of our knowledge, there are no large-scale public patient portal messages corpora researchers can use to build tools to optimize clinician portal workflows. Informed by our ongoing work with a regional hospital, this study introduces an LLM-powered framework for configurable and realistic patient portal message generation. Our approach leverages few-shot grounded text generation, requiring only a small number of de-identified patient portal messages to help LLMs better match the true style and tone of real data. Clinical experts in our team deem this framework as HIPAA-friendly, unlike existing privacy-preserving approaches to synthetic text generation which cannot guarantee all sensitive attributes will be protected. Through extensive quantitative and human evaluation, we show that our framework produces data of higher quality than comparable generation methods as well as all related datasets. We believe this work provides a path forward for (i) the release of large-scale synthetic patient message datasets that are stylistically similar to ground-truth samples and (ii) HIPAA-friendly data generation which requires minimal human de-identification efforts.

摘要:自 COVID-19 大流行以來,臨床醫生收到了大量的持續性患者入口訊息,這顯著加劇了臨床醫生的倦怠感。據我們所知,沒有大型公共患者入口訊息語料庫可供研究人員用於建構工具來最佳化臨床醫生入口工作流程。本研究借鑒了我們與區域醫院正在進行的工作,介紹了一個由 LLM 驅動的框架,用於可配置且逼真的患者入口訊息產生。我們的做法利用了少樣本接地文本產生,只需少數去識別化的患者入口訊息,就能幫助 LLM 更佳匹配真實資料的真實風格和語氣。我們團隊中的臨床專家認為這個框架符合 HIPAA,這與現有的合成文本產生隱私保護方法不同,後者無法保證所有敏感屬性都受到保護。透過廣泛的量化和人工評估,我們證明了我們的框架產生的資料品質高於可比較的產生方法以及所有相關的資料集。我們相信這項工作為以下事項提供了前進的道路:(i) 發布與真實樣本在風格上相似的、大規模的合成患者訊息資料集,以及 (ii) 符合 HIPAA 的資料產生,而這需要最少的人工去識別化工作。

NeuReg: Domain-invariant 3D Image Registration on Human and Mouse Brains

2411.06315v1 by Taha Razzaq, Asim Iqbal

Medical brain imaging relies heavily on image registration to accurately curate structural boundaries of brain features for various healthcare applications. Deep learning models have shown remarkable performance in image registration in recent years. Still, they often struggle to handle the diversity of 3D brain volumes, challenged by their structural and contrastive variations and their imaging domains. In this work, we present NeuReg, a Neuro-inspired 3D image registration architecture with the feature of domain invariance. NeuReg generates domain-agnostic representations of imaging features and incorporates a shifting window-based Swin Transformer block as the encoder. This enables our model to capture the variations across brain imaging modalities and species. We demonstrate a new benchmark in multi-domain publicly available datasets comprising human and mouse 3D brain volumes. Extensive experiments reveal that our model (NeuReg) outperforms the existing baseline deep learning-based image registration models and provides a high-performance boost on cross-domain datasets, where models are trained on 'source-only' domain and tested on completely 'unseen' target domains. Our work establishes a new state-of-the-art for domain-agnostic 3D brain image registration, underpinned by Neuro-inspired Transformer-based architecture.

摘要:醫學腦部影像高度依賴影像配準,以準確策畫大腦特徵的結構性邊界,用於各種醫療保健應用。深度學習模型近年來在影像配準中展現出卓越的效能。儘管如此,這些模型在處理多元的 3D 大腦體積時常常會遇到困難,受到其結構和對比變化以及影像領域的挑戰。在這項工作中,我們提出 NeuReg,一種具備領域不變性特徵的神經啟發式 3D 影像配準架構。NeuReg 產生影像特徵的領域不可知表示,並將基於滑動視窗的 Swin Transformer 區塊作為編碼器。這使我們的模型能夠擷取跨大腦影像模式和物種的變化。我們展示了一個新的基準,包含人類和老鼠 3D 大腦體積的多領域公開可用資料集。廣泛的實驗顯示,我們的模型 (NeuReg) 優於現有的基準深度學習影像配準模型,並在跨領域資料集上提供高性能提升,其中模型在「僅來源」領域上訓練,並在完全「未見」的目標領域上進行測試。我們的研究建立了領域不可知 3D 大腦影像配準的新技術,由神經啟發式 Transformer 為基礎的架構所支撐。

GuidelineGuard: An Agentic Framework for Medical Note Evaluation with Guideline Adherence

2411.06264v1 by MD Ragib Shahriyear

Although rapid advancements in Large Language Models (LLMs) are facilitating the integration of artificial intelligence-based applications and services in healthcare, limited research has focused on the systematic evaluation of medical notes for guideline adherence. This paper introduces GuidelineGuard, an agentic framework powered by LLMs that autonomously analyzes medical notes, such as hospital discharge and office visit notes, to ensure compliance with established healthcare guidelines. By identifying deviations from recommended practices and providing evidence-based suggestions, GuidelineGuard helps clinicians adhere to the latest standards from organizations like the WHO and CDC. This framework offers a novel approach to improving documentation quality and reducing clinical errors.

摘要:儘管大型語言模型 (LLM) 的快速進展促進了人工智慧應用程式和服務在醫療保健中的整合,但有限的研究專注於對醫療記錄進行系統評估以符合準則。本文介紹了 GuidelineGuard,一個由 LLM 提供動力的代理架構,它會自動分析醫療記錄,例如醫院出院和門診記錄,以確保符合既定的醫療保健準則。透過找出與建議做法的偏差並提供基於證據的建議,GuidelineGuard 可協助臨床醫生遵守世界衛生組織 (WHO) 和疾病管制中心 (CDC) 等組織的最新標準。此架構提供了一種改善文件品質和減少臨床錯誤的新方法。

Deep Reinforcement Learning for Digital Twin-Oriented Complex Networked Systems

2411.06148v1 by Jiaqi Wen, Bogdan Gabrys, Katarzyna Musial

The Digital Twin Oriented Complex Networked System (DT-CNS) aims to build and extend a Complex Networked System (CNS) model with progressively increasing dynamics complexity towards an accurate reflection of reality -- a Digital Twin of reality. Our previous work proposed evolutionary DT-CNSs to model the long-term adaptive network changes in an epidemic outbreak. This study extends this framework by proposeing the temporal DT-CNS model, where reinforcement learning-driven nodes make decisions on temporal directed interactions in an epidemic outbreak. We consider cooperative nodes, as well as egocentric and ignorant "free-riders" in the cooperation. We describe this epidemic spreading process with the Susceptible-Infected-Recovered ($SIR$) model and investigate the impact of epidemic severity on the epidemic resilience for different types of nodes. Our experimental results show that (i) the full cooperation leads to a higher reward and lower infection number than a cooperation with egocentric or ignorant "free-riders"; (ii) an increasing number of "free-riders" in a cooperation leads to a smaller reward, while an increasing number of egocentric "free-riders" further escalate the infection numbers and (iii) higher infection rates and a slower recovery weakens networks' resilience to severe epidemic outbreaks. These findings also indicate that promoting cooperation and reducing "free-riders" can improve public health during epidemics.

摘要:數位孿生導向複雜網路系統(DT-CNS)旨在建立和擴展複雜網路系統(CNS)模型,並逐步增加動態複雜性以準確反映現實——現實的數位孿生。我們先前的工作提出演化的 DT-CNS 來建模流行病爆發中的長期適應性網路變化。本研究透過提出時間 DT-CNS 模型來延伸這個架構,其中強化學習驅動的節點在流行病爆發中對時間導向互動做出決策。我們考慮合作節點,以及合作中的自我中心和無知的「搭便車者」。我們使用易感者-受感染者-康復者($SIR$)模型描述這個流行病擴散過程,並調查流行病嚴重性對不同類型節點的流行病復原力的影響。我們的實驗結果顯示 (i) 全面合作會導致比與自我中心或無知的「搭便車者」合作更高的回報和更低的感染數;(ii) 合作中的「搭便車者」數量增加會導致較小的回報,而自我中心的「搭便車者」數量增加會進一步提升感染數;(iii) 較高的感染率和較慢的復原會削弱網路對嚴重流行病爆發的復原力。這些發現也表示,在流行病期間促進合作和減少「搭便車者」可以改善公共衛生。

Evaluating the Propensity of Generative AI for Producing Disinformation During an Election Cycle

2411.06120v1 by Erik J Schlicht

Generative Artificial Intelligence offers a powerful tool for adversaries who wish to engage in influence operations, such as the Chinese Spamouflage operation and the Russian Internet Research Agency effort that both sought to interfere with recent US election cycles. Therefore, this study seeks to investigate the propensity of current Generative AI models for producing harmful disinformation during an election cycle. The probability that different Generative AI models produced disinformation when given adversarial prompts was evaluated, in addition the associated harm. This allows for the expected harm for each model to be computed and it was discovered that Copilot and Gemini tied for the overall safest performance by realizing the lowest expected harm, while GPT-4o produced the greatest rates of harmful disinformation, resulting in much higher expected harm scores. The impact of disinformation category was also investigated and Gemini was safest within the political category of disinformation, while Copilot was safest for topics related to health. Moreover, characteristics of adversarial roles were discovered that led to greater expected harm across all models. Finally, classification models were developed that predicted disinformation production based on the conditions considered in this study, which offers insight into factors important for predicting disinformation production. Based on all of these insights, recommendations are provided that seek to mitigate factors that lead to harmful disinformation being produced by Generative AI models. It is hoped that developers will use these insights to improve future models.

摘要:生成式人工智慧為有意從事影響力操作的敵對者提供強大的工具,例如中國的垃圾郵件偽裝行動和俄羅斯的網路研究機構努力,這兩者都試圖干預最近的美國選舉週期。因此,本研究旨在調查當前生成式 AI 模型在選舉週期中產生有害錯誤訊息的傾向。除了相關危害之外,還評估了在給定對抗提示時不同生成式 AI 模型產生錯誤訊息的可能性。這允許計算每個模型的預期危害,並且發現 Copilot 和 Gemini 在實現最低預期危害方面並列為最安全的整體效能,而 GPT-4o 產生了最高比率的有害錯誤訊息,導致預期危害分數高得多。還調查了錯誤訊息類別的影響,並且 Gemini 在政治類別的錯誤訊息中是最安全的,而 Copilot 在與健康相關的主題中最安全。此外,發現了對抗角色的特性,導致所有模型的預期危害更大。最後,開發了分類模型,根據本研究中考慮的條件預測錯誤訊息產生,這提供了對預測錯誤訊息產生很重要的因素的見解。根據所有這些見解,提供了建議,旨在減輕導致生成式 AI 模型產生有害錯誤訊息的因素。希望開發人員將使用這些見解來改進未來的模型。

Personalize to generalize: Towards a universal medical multi-modality generalization through personalization

2411.06106v1 by Zhaorui Tan, Xi Yang, Tan Pan, Tianyi Liu, Chen Jiang, Xin Guo, Qiufeng Wang, Anh Nguyen, Yuan Qi, Kaizhu Huang, Yuan Cheng

Personalized medicine is a groundbreaking healthcare framework for the $21^{st}$ century, tailoring medical treatments to individuals based on unique clinical characteristics, including diverse medical imaging modalities. Given the significant differences among these modalities due to distinct underlying imaging principles, generalization in multi-modal medical image tasks becomes substantially challenging. Previous methods addressing multi-modal generalization rarely consider personalization, primarily focusing on common anatomical information. This paper aims to bridge multi-modal generalization with the concept of personalized medicine. Specifically, we propose a novel approach to derive a tractable form of the underlying personalized invariant representation $\mathbb{X}_h$ by leveraging individual-level constraints and a learnable biological prior. We demonstrate the feasibility and benefits of learning a personalized $\mathbb{X}_h$, showing that this representation is highly generalizable and transferable across various multi-modal medical tasks. Our method is rigorously validated on medical imaging modalities emphasizing both physical structure and functional information, encompassing a range of tasks that require generalization. Extensive experimental results consistently show that our approach significantly improves performance across diverse scenarios, confirming its effectiveness.

摘要:個人化醫療是 21 世紀的創新醫療保健架構,根據獨特的臨床特徵(包括多種醫學影像方式)為個人量身打造醫療治療。由於這些方式基於不同的影像原理,因此存在顯著差異,多模式醫學影像任務中的概括變得極具挑戰性。先前處理多模式概括的方法很少考慮個人化,主要關注於共同的解剖資訊。本文旨在將多模式概括與個人化醫療的概念聯繫起來。具體來說,我們提出了一種新穎的方法,透過利用個人層級約束和可學習的生物先驗,衍生出基礎個人化不變表示 $\mathbb{X}_h$ 的易於處理形式。我們展示了學習個人化 $\mathbb{X}_h$ 的可行性和好處,表明此表示具有高度可概括性,並且可以在各種多模式醫療任務中轉移。我們的技術在強調物理結構和功能資訊的醫學影像方式上得到嚴格驗證,涵蓋了需要概括的一系列任務。廣泛的實驗結果一致表明,我們的技術顯著改善了各種情境下的效能,證實了其有效性。

Assessing Foundational Medical 'Segment Anything' (Med-SAM1, Med-SAM2) Deep Learning Models for Left Atrial Segmentation in 3D LGE MRI

2411.05963v1 by Mehri Mehrnia, Mohamed Elbayumi, Mohammed S. M. Elbaz

Atrial fibrillation (AF), the most common cardiac arrhythmia, is associated with heart failure and stroke. Accurate segmentation of the left atrium (LA) in 3D late gadolinium-enhanced (LGE) MRI is helpful for evaluating AF, as fibrotic remodeling in the LA myocardium contributes to arrhythmia and serves as a key determinant of therapeutic strategies. However, manual LA segmentation is labor-intensive and challenging. Recent foundational deep learning models, such as the Segment Anything Model (SAM), pre-trained on diverse datasets, have demonstrated promise in generic segmentation tasks. MedSAM, a fine-tuned version of SAM for medical applications, enables efficient, zero-shot segmentation without domain-specific training. Despite the potential of MedSAM model, it has not yet been evaluated for the complex task of LA segmentation in 3D LGE-MRI. This study aims to (1) evaluate the performance of MedSAM in automating LA segmentation, (2) compare the performance of the MedSAM2 model, which uses a single prompt with automated tracking, with the MedSAM1 model, which requires separate prompt for each slice, and (3) analyze the performance of MedSAM1 in terms of Dice score(i.e., segmentation accuracy) by varying the size and location of the box prompt.

摘要:心房顫動 (AF) 是最常見的心律不整,與心臟衰竭和中風有關。3D 晚期钆增強 (LGE) MRI 中左心房 (LA) 的精確分割有助於評估 AF,因為 LA 心肌中的纖維化重塑會導致心律不整,並作為治療策略的關鍵決定因素。然而,手動 LA 分割既費力又具有挑戰性。最近基礎深度學習模型(例如在不同資料集上預先訓練的 Segment Anything Model (SAM))已在通用分割任務中展現出前景。MedSAM 是 SAM 的微調版本,適用於醫療應用,它能進行有效、零次學習的分割,而無需特定領域的訓練。儘管 MedSAM 模型具有潛力,但尚未評估其在 3D LGE-MRI 中 LA 分割的複雜任務。本研究旨在 (1) 評估 MedSAM 在自動化 LA 分割中的效能,(2) 比較使用單一提示和自動追蹤的 MedSAM2 模型與需要為每個切片提供單獨提示的 MedSAM1 模型的效能,以及 (3) 分析 MedSAM1 在骰子分數(即分割準確度)方面的效能,方法是改變方框提示的大小和位置。

GazeSearch: Radiology Findings Search Benchmark

2411.05780v1 by Trong Thang Pham, Tien-Phat Nguyen, Yuki Ikebe, Akash Awasthi, Zhigang Deng, Carol C. Wu, Hien Nguyen, Ngan Le

Medical eye-tracking data is an important information source for understanding how radiologists visually interpret medical images. This information not only improves the accuracy of deep learning models for X-ray analysis but also their interpretability, enhancing transparency in decision-making. However, the current eye-tracking data is dispersed, unprocessed, and ambiguous, making it difficult to derive meaningful insights. Therefore, there is a need to create a new dataset with more focus and purposeful eyetracking data, improving its utility for diagnostic applications. In this work, we propose a refinement method inspired by the target-present visual search challenge: there is a specific finding and fixations are guided to locate it. After refining the existing eye-tracking datasets, we transform them into a curated visual search dataset, called GazeSearch, specifically for radiology findings, where each fixation sequence is purposefully aligned to the task of locating a particular finding. Subsequently, we introduce a scan path prediction baseline, called ChestSearch, specifically tailored to GazeSearch. Finally, we employ the newly introduced GazeSearch as a benchmark to evaluate the performance of current state-of-the-art methods, offering a comprehensive assessment for visual search in the medical imaging domain.

摘要:醫療眼動追蹤資料是了解放射科醫師如何視覺化詮釋醫療影像的重要資訊來源。這些資訊不僅提升了深度學習模型在 X 光分析中的準確度,也提升了其可解釋性,增進決策制定中的透明度。然而,目前的醫療眼動追蹤資料分散、未經處理且不明確,這使得難以推導出有意義的見解。因此,有必要建立一個新的資料集,其中包含更多焦點和有目的的眼動追蹤資料,以提升其在診斷應用中的效用。在這項工作中,我們提出了一種改良方法,其靈感來自目標呈現視覺搜尋挑戰:有一個特定的發現,而固定則用於定位它。在改良現有的眼動追蹤資料集後,我們將其轉換為一個名為 GazeSearch 的精選視覺搜尋資料集,專門用於放射科發現,其中每個固定序列都刻意與定位特定發現的任務對齊。隨後,我們介紹了一個掃描路徑預測基準,稱為 ChestSearch,專門針對 GazeSearch 量身打造。最後,我們採用新推出的 GazeSearch 作為基準,評估目前最先進方法的效能,提供醫療影像領域中視覺搜尋的全面評估。

Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators

2411.05897v1 by Nicholas Wan, Qiao Jin, Joey Chan, Guangzhi Xiong, Serina Applebaum, Aidan Gilson, Reid McMurry, R. Andrew Taylor, Aidong Zhang, Qingyu Chen, Zhiyong Lu

Although large language models (LLMs) have been assessed for general medical knowledge using medical licensing exams, their ability to effectively support clinical decision-making tasks, such as selecting and using medical calculators, remains uncertain. Here, we evaluate the capability of both medical trainees and LLMs to recommend medical calculators in response to various multiple-choice clinical scenarios such as risk stratification, prognosis, and disease diagnosis. We assessed eight LLMs, including open-source, proprietary, and domain-specific models, with 1,009 question-answer pairs across 35 clinical calculators and measured human performance on a subset of 100 questions. While the highest-performing LLM, GPT-4o, provided an answer accuracy of 74.3% (CI: 71.5-76.9%), human annotators, on average, outperformed LLMs with an accuracy of 79.5% (CI: 73.5-85.0%). With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (56.6%) and calculator knowledge (8.1%), our findings emphasize that humans continue to surpass LLMs on complex clinical tasks such as calculator recommendation.

摘要:儘管大型語言模型 (LLM) 已使用醫學執照考試評估其一般醫學知識,但它們有效支援臨床決策任務(例如選擇和使用醫學計算器)的能力仍不確定。在此,我們評估醫學受訓者和 LLM 推薦醫學計算器的能力,以回應各種多選題臨床情境,例如風險分層、預後和疾病診斷。我們評估了八個 LLM,包括開源、專有和特定領域的模型,其中包含 35 個臨床計算器的 1,009 個問答對,並測量了人類在 100 個問題子集上的表現。表現最佳的 LLM GPT-4o 提供了 74.3% 的回答準確度 (CI:71.5-76.9%),而人類註解者平均表現優於 LLM,準確度為 79.5% (CI:73.5-85.0%)。錯誤分析顯示,表現最佳的 LLM 在理解 (56.6%) 和計算器知識 (8.1%) 方面仍會犯錯,我們的研究結果強調,人類在計算器推薦等複雜臨床任務上仍然優於 LLM。

Identifying and Decomposing Compound Ingredients in Meal Plans Using Large Language Models

2411.05892v1 by Leon Kopitar, Leon Bedrac, Larissa J Strath, Jiang Bian, Gregor Stiglic

This study explores the effectiveness of Large Language Models in meal planning, focusing on their ability to identify and decompose compound ingredients. We evaluated three models-GPT-4o, Llama-3 (70b), and Mixtral (8x7b)-to assess their proficiency in recognizing and breaking down complex ingredient combinations. Preliminary results indicate that while Llama-3 (70b) and GPT-4o excels in accurate decomposition, all models encounter difficulties with identifying essential elements like seasonings and oils. Despite strong overall performance, variations in accuracy and completeness were observed across models. These findings underscore LLMs' potential to enhance personalized nutrition but highlight the need for further refinement in ingredient decomposition. Future research should address these limitations to improve nutritional recommendations and health outcomes.

摘要:這項研究探討大型語言模型在餐點規劃中的效能,著重於其辨識並分解複合食材的能力。我們評估了三個模型:GPT-4o、Llama-3 (70b) 和 Mixtral (8x7b),以評量其辨識並分解複雜食材組合的能力。初步結果顯示,雖然 Llama-3 (70b) 和 GPT-4o 在準確分解方面表現出色,但所有模型在辨識調味料和油脂等必要元素時都遇到困難。儘管整體表現強勁,但各個模型在準確性和完整性方面仍有差異。這些發現強調了 LLM 增強個人化營養的潛力,但同時也突顯了進一步優化食材分解技術的必要性。未來的研究應針對這些限制進行探討,以改善營養建議和健康成果。

SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark

2411.05521v1 by Sithursan Sivasubramaniam, Cedric Osei-Akoto, Yi Zhang, Kurt Stockinger, Jonathan Fuerst

Electronic health records (EHRs) are stored in various database systems with different database models on heterogeneous storage architectures, such as relational databases, document stores, or graph databases. These different database models have a big impact on query complexity and performance. While this has been a known fact in database research, its implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far. In this paper, we present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark based on synthetic patient data from Synthea, following the SNOMED-CT taxonomy -- a widely used knowledge graph ontology covering medical terminology. SM3-Text-to-Query provides data representations for relational databases (PostgreSQL), document stores (MongoDB), and graph databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically and manually develop 408 template questions, which we augment to construct a benchmark of 10K diverse natural language question/query pairs for these four query languages (40K pairs overall). On our dataset, we evaluate several common in-context-learning (ICL) approaches for a set of representative closed and open-source LLMs. Our evaluation sheds light on the trade-offs between database models and query languages for different ICL strategies and LLMs. Last, SM3-Text-to-Query is easily extendable to additional query languages or real, standard-based patient databases.

摘要:電子健康紀錄 (EHR) 儲存在各種資料庫系統中,這些系統在異質儲存架構上具有不同的資料庫模型,例如關聯式資料庫、文件儲存或圖形資料庫。這些不同的資料庫模型對查詢複雜度和效能有很大的影響。雖然這在資料庫研究中已經是眾所周知的事實,但令人驚訝的是,它對日益增加的文字轉查詢系統的影響迄今尚未得到調查。在本文中,我們提出 SM3-Text-to-Query,這是第一個基於來自 Synthea 的合成患者資料的多模型醫療文字轉查詢基準,遵循 SNOMED-CT 分類法——一種廣泛使用的涵蓋醫學術語的知識圖譜本體。SM3-Text-to-Query 提供了關聯式資料庫 (PostgreSQL)、文件儲存 (MongoDB) 和圖形資料庫 (Neo4j 和 GraphDB (RDF)) 的資料表示,允許跨四種流行查詢語言(即 SQL、MQL、Cypher 和 SPARQL)進行評估。我們系統且手動開發了 408 個範本問題,我們擴充這些問題以構建一個基準,其中包含 10K 個針對這四種查詢語言的多樣化自然語言問題/查詢對(總共 40K 對)。在我們的資料集上,我們評估了幾種常見的代表性閉源和開源 LLM 的情境學習 (ICL) 方法。我們的評估揭示了不同 ICL 策略和 LLM 的資料庫模型和查詢語言之間的取捨。最後,SM3-Text-to-Query 可以輕鬆擴展到其他查詢語言或真實的基於標準的患者資料庫。

Towards Scalable Foundation Models for Digital Dermatology

2411.05514v1 by Fabian Gröger, Philippe Gottfrois, Ludovic Amruthalingam, Alvaro Gonzalez-Jimenez, Simone Lionetti, Luis R. Soenksen-Martinez, Alexander A. Navarini, Marc Pouly

The growing demand for accurate and equitable AI models in digital dermatology faces a significant challenge: the lack of diverse, high-quality labeled data. In this work, we investigate the potential of domain-specific foundation models for dermatology in addressing this challenge. We utilize self-supervised learning (SSL) techniques to pre-train models on a dataset of over 240,000 dermatological images from public and private collections. Our study considers several SSL methods and compares the resulting foundation models against domain-agnostic models like those pre-trained on ImageNet and state-of-the-art models such as MONET across 12 downstream tasks. Unlike previous research, we emphasize the development of smaller models that are more suitable for resource-limited clinical settings, facilitating easier adaptation to a broad range of use cases. Results show that models pre-trained in this work not only outperform general-purpose models but also approach the performance of models 50 times larger on clinically relevant diagnostic tasks. To promote further research in this direction, we publicly release both the training code and the foundation models, which can benefit clinicians in dermatological applications.

摘要:數位皮膚科對精準且公平的 AI 模型需求日益增加,但面臨一項重大挑戰:缺乏多元且高品質的標記資料。在這項研究中,我們探討特定領域的基礎模型在皮膚科中解決此挑戰的可能性。我們利用自監督學習 (SSL) 技術在包含超過 24 萬張來自公有和私有資料庫的皮膚科影像的資料集上預先訓練模型。我們的研究考量了多種 SSL 方法,並將產生的基礎模型與不受領域限制的模型(例如在 ImageNet 上預先訓練的模型)以及最先進的模型(例如 MONET)在 12 個下游任務中進行比較。與先前的研究不同,我們強調開發更適合資源有限的臨床環境的小型模型,以利於更輕鬆地適應廣泛的用例。結果顯示,在這項研究中預先訓練的模型不僅優於通用模型,而且在臨床上相關的診斷任務中,其效能也接近大 50 倍的模型。為了促進此方向的進一步研究,我們公開發布訓練程式碼和基礎模型,這些模型可讓皮膚科應用中的臨床醫生受益。

Towards Equitable ASD Diagnostics: A Comparative Study of Machine and Deep Learning Models Using Behavioral and Facial Data

2411.05880v1 by Mohammed Aledhari, Mohamed Rahouti, Ali Alfatemi

Autism Spectrum Disorder (ASD) is often underdiagnosed in females due to gender-specific symptom differences overlooked by conventional diagnostics. This study evaluates machine learning models, particularly Random Forest and convolutional neural networks, for enhancing ASD diagnosis through structured data and facial image analysis. Random Forest achieved 100% validation accuracy across datasets, highlighting its ability to manage complex relationships and reduce false negatives, which is crucial for early intervention and addressing gender biases. In image-based analysis, MobileNet outperformed the baseline CNN, achieving 87% accuracy, though a 30% validation loss suggests possible overfitting, requiring further optimization for robustness in clinical settings. Future work will emphasize hyperparameter tuning, regularization, and transfer learning. Integrating behavioral data with facial analysis could improve diagnosis for underdiagnosed groups. These findings suggest Random Forest's high accuracy and balanced precision-recall metrics could enhance clinical workflows. MobileNet's lightweight structure also shows promise for resource-limited environments, enabling accessible ASD screening. Addressing model explainability and clinician trust will be vital.

摘要:自閉症譜系障礙 (ASD) 由於性別特異的症狀差異,常被忽略而漏診。本研究評估機器學習模型,特別是隨機森林和卷積神經網路,以透過結構化資料和臉部影像分析來強化 ASD 診斷。隨機森林在所有資料集中的驗證準確度達到 100%,突顯其處理複雜關係和減少假陰性的能力,這對於早期介入和解決性別偏見至關重要。在基於影像的分析中,MobileNet 優於基準 CNN,準確度達到 87%,儘管 30% 的驗證損失表明可能過度擬合,需要進一步最佳化以提高臨床環境中的穩健性。未來的研究將強調超參數調整、正則化和遷移學習。將行為資料與臉部分析整合,可以改善漏診群體的診斷。這些發現表明隨機森林的高準確度和平衡的精確度召回指標可以增強臨床工作流程。MobileNet 的輕量級結構也顯示出在資源受限的環境中很有前景,可以進行無障礙的 ASD 篩檢。解決模型可解釋性和臨床醫師的信任至關重要。

Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations

2411.05194v1 by Joey Hong, Jessica Lin, Anca Dragan, Sergey Levine

Recent progress on large language models (LLMs) has enabled dialogue agents to generate highly naturalistic and plausible text. However, current LLM language generation focuses on responding accurately to questions and requests with a single effective response. In reality, many real dialogues are interactive, meaning an agent's utterances will influence their conversational partner, elicit information, or change their opinion. Accounting for how an agent can effectively steer a conversation is a crucial ability in many dialogue tasks, from healthcare to preference elicitation. Existing methods for fine-tuning dialogue agents to accomplish such tasks would rely on curating some amount of expert data. However, doing so often requires understanding the underlying cognitive processes of the conversational partner, which is a skill neither humans nor LLMs trained on human data can reliably do. Our key insight is that while LLMs may not be adept at identifying effective strategies for steering conversations a priori, or in the middle of an ongoing conversation, they can do so post-hoc, or in hindsight, after seeing how their conversational partner responds. We use this fact to rewrite and augment existing suboptimal data, and train via offline reinforcement learning (RL) an agent that outperforms both prompting and learning from unaltered human demonstrations. We apply our approach to two domains that require understanding human mental state, intelligent interaction, and persuasion: mental health support, and soliciting charitable donations. Our results in a user study with real humans show that our approach greatly outperforms existing state-of-the-art dialogue agents.

摘要:大型語言模型 (LLM) 的最新進展使對話代理能夠生成高度自然且合理的文字。然而,目前的 LLM 語言生成著重於以單一有效的回應準確回應問題和要求。在現實中,許多真實對話都是互動的,這表示代理人的發言會影響他們的對話夥伴、引出資訊或改變他們的意見。考量代理人如何有效引導對話的能力在許多對話任務中至關重要,從醫療保健到偏好引導皆是如此。現有的微調對話代理方法以完成此類任務會依賴於策劃一定量的專家資料。然而,這麼做通常需要了解對話夥伴的基礎認知歷程,而這項技能既不是人類也不是訓練過人類資料的 LLM 可靠具備的。我們的關鍵見解在於,儘管 LLM 可能不擅長於事先或在對話進行中識別出引導對話的有效策略,但他們可以在事後或回顧時,在看到他們的對話夥伴如何回應後這麼做。我們利用這個事實來改寫並擴充現有的次佳資料,並透過離線強化學習 (RL) 訓練一名代理人,其表現優於提示和從未經修改的人類示範中學習。我們將我們的做法應用於需要了解人類心理狀態、智慧互動和說服的兩個領域:心理健康支持和募集慈善捐款。我們在與真實人類進行的使用者研究中的結果顯示,我們的做法大幅優於現有的最先進對話代理。

Inverse Transition Learning: Learning Dynamics from Demonstrations

2411.05174v1 by Leo Benac, Abhishek Sharma, Sonali Parbhoo, Finale Doshi-Velez

We consider the problem of estimating the transition dynamics $T^$ from near-optimal expert trajectories in the context of offline model-based reinforcement learning. We develop a novel constraint-based method, Inverse Transition Learning, that treats the limited coverage of the expert trajectories as a \emph{feature}: we use the fact that the expert is near-optimal to inform our estimate of $T^$. We integrate our constraints into a Bayesian approach. Across both synthetic environments and real healthcare scenarios like Intensive Care Unit (ICU) patient management in hypotension, we demonstrate not only significant improvements in decision-making, but that our posterior can inform when transfer will be successful.

摘要:我們考慮在離線模型基礎強化學習的脈絡中,從接近最佳的專家軌跡估計轉換動態 $T^$ 的問題。我們開發一種新的基於約束的方法,逆轉換學習,它將專家軌跡的有限覆蓋範圍視為一種「特徵」:我們利用專家接近最佳的事實來告知我們對 $T^$ 的估計。我們將我們的約束整合到貝氏方法中。在綜合環境和實際醫療保健場景(例如低血壓重症監護病房 (ICU) 病患管理)中,我們不僅展示了決策制定方面的顯著進步,而且我們的後驗可以告知轉移何時會成功。

PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation

2411.05085v1 by Daniel C. Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L. Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores Sánchez-Valverde, Lara Jaques-Pérez, Lourdes Pérez-Rodríguez, Kenji Takeda, José María Salinas, Javier Alvarez-Valle, Joaquín Galant Herrero, Antonio Pertusa

Radiology report generation (RRG) aims to create free-text radiology reports from clinical imaging. Grounded radiology report generation (GRRG) extends RRG by including the localisation of individual findings on the image. Currently, there are no manually annotated chest X-ray (CXR) datasets to train GRRG models. In this work, we present a dataset called PadChest-GR (Grounded-Reporting) derived from PadChest aimed at training GRRG models for CXR images. We curate a public bi-lingual dataset of 4,555 CXR studies with grounded reports (3,099 abnormal and 1,456 normal), each containing complete lists of sentences describing individual present (positive) and absent (negative) findings in English and Spanish. In total, PadChest-GR contains 7,037 positive and 3,422 negative finding sentences. Every positive finding sentence is associated with up to two independent sets of bounding boxes labelled by different readers and has categorical labels for finding type, locations, and progression. To the best of our knowledge, PadChest-GR is the first manually curated dataset designed to train GRRG models for understanding and interpreting radiological images and generated text. By including detailed localization and comprehensive annotations of all clinically relevant findings, it provides a valuable resource for developing and evaluating GRRG models from CXR images. PadChest-GR can be downloaded under request from https://bimcv.cipf.es/bimcv-projects/padchest-gr/

摘要:放射學報告生成 (RRG) 旨在從臨床影像建立自由文字的放射學報告。基礎放射學報告生成 (GRRG) 透過納入影像上個別發現的定位,來延伸 RRG。目前,沒有手動標記的胸部 X 光 (CXR) 資料集,可供訓練 GRRG 模型。在此研究中,我們提出一個名為 PadChest-GR(基礎報告)的資料集,其源自 PadChest,旨在訓練 CXR 影像的 GRRG 模型。我們策劃了一個公開的雙語資料集,其中包含 4,555 份 CXR 研究,附有基礎報告(3,099 份異常報告和 1,456 份正常報告),每個報告都包含完整的句子清單,用英文和西班牙文描述個別存在的(陽性)和不存在的(陰性)發現。總計,PadChest-GR 包含 7,037 個陽性發現句子和 3,422 個陰性發現句子。每個陽性發現句子最多與兩組獨立的邊界框相關聯,由不同的讀者標記,並具有發現類型、位置和進展的分類標籤。據我們所知,PadChest-GR 是第一個手動策劃的資料集,旨在訓練 GRRG 模型,以理解和詮釋放射學影像和產生的文字。透過納入所有臨床相關發現的詳細定位和綜合註解,它為從 CXR 影像開發和評估 GRRG 模型提供了寶貴的資源。PadChest-GR 可應要求從 https://bimcv.cipf.es/bimcv-projects/padchest-gr/ 下載

Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability

2411.04962v1 by Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew Churpek, Majid Afshar

Large language models (LLMs) are being explored for diagnostic decision support, yet their ability to estimate pre-test probabilities, vital for clinical decision-making, remains limited. This study evaluates two LLMs, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks. We examined three current methods of extracting LLM probability estimations and revealed their limitations. We aim to highlight the need for improved techniques in LLM confidence estimation.

摘要:大型語言模型 (LLM) 正在被探索用於診斷決策支持,但它們估計臨床決策制定中至關重要的預測試概率的能力仍然有限。本研究使用三個診斷任務的結構化電子健康記錄數據評估了兩個 LLM,Mistral-7B 和 Llama3-70B。我們檢查了提取 LLM 概率估計的三種當前方法並揭示了它們的局限性。我們的目標是強調改進 LLM 置信度估計技術的必要性。

FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?

2411.05059v2 by Eric Wu, Kevin Wu, James Zou

There is great interest in fine-tuning frontier large language models (LLMs) to inject new information and update existing knowledge. While commercial LLM fine-tuning APIs from providers such as OpenAI and Google promise flexible adaptation for various applications, the efficacy of fine-tuning remains unclear. In this study, we introduce FineTuneBench, an evaluation framework and dataset for understanding how well commercial fine-tuning APIs can successfully learn new and updated knowledge. We analyze five frontier LLMs with commercially available fine-tuning APIs, including GPT-4o and Gemini 1.5 Pro, on their effectiveness in two settings: (1) ingesting novel information, such as recent news events and new people profiles, and (2) updating existing knowledge, such as updated medical guidelines and code frameworks. Our results reveal substantial shortcomings in all the models' abilities to effectively learn new information through fine-tuning, with an average generalization accuracy of 37% across all models. When updating existing knowledge, such as incorporating medical guideline updates, commercial fine-tuning APIs show even more limited capability (average generalization accuracy of 19%). Overall, fine-tuning GPT-4o mini is the most effective for infusing new knowledge and updating knowledge, followed by GPT-3.5 Turbo and GPT-4o. The fine-tuning APIs for Gemini 1.5 Flesh and Gemini 1.5 Pro are unable to learn new knowledge or update existing knowledge. These findings underscore a major shortcoming in using current commercial fine-tuning services to achieve reliable knowledge infusion in common scenarios. We open source the FineTuneBench dataset at https://github.com/kevinwu23/StanfordFineTuneBench.

摘要:微调前沿大型语言模型 (LLM) 以注入新信息并更新现有知识引起了极大的兴趣。虽然来自 OpenAI 和 Google 等提供商的商业 LLM 微调 API 承诺为各种应用程序提供灵活的适应性,但微调的功效仍不明确。在这项研究中,我们介绍了 FineTuneBench,这是一个评估框架和数据集,用于理解商业微调 API 如何成功学习新的和更新的知识。我们分析了五种前沿 LLM,它们具有可商用的微调 API,包括 GPT-4o 和 Gemini 1.5 Pro,在两种设置中的有效性:(1) 摄取新信息,例如最近的新闻事件和新的人物简介,以及 (2) 更新现有知识,例如更新的医疗指南和代码框架。我们的结果揭示了所有模型在通过微调有效学习新信息方面的重大缺陷,所有模型的平均泛化准确度为 37%。在更新现有知识时,例如纳入医疗指南更新,商业微调 API 显示出更有限的能力(平均泛化准确度为 19%)。总体而言,微调 GPT-4o mini 在灌输新知识和更新知识方面最有效,其次是 GPT-3.5 Turbo 和 GPT-4o。Gemini 1.5 Flesh 和 Gemini 1.5 Pro 的微调 API 无法学习新知识或更新现有知识。这些发现强调了在常见场景中使用当前商业微调服务来实现可靠知识注入的重大缺陷。我们在 https://github.com/kevinwu23/StanfordFineTuneBench 上开源了 FineTuneBench 数据集。

Integrating Large Language Models for Genetic Variant Classification

2411.05055v1 by Youssef Boulaimen, Gabriele Fossi, Leila Outemzabet, Nathalie Jeanray, Oleksandr Levenets, Stephane Gerart, Sebastien Vachenc, Salvatore Raieli, Joanna Giemza

The classification of genetic variants, particularly Variants of Uncertain Significance (VUS), poses a significant challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools in this realm. These models can uncover intricate patterns and predictive insights that traditional methods might miss, thus enhancing the predictive accuracy of genetic variant pathogenicity. This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification. Our approach evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets, setting new benchmarks in classification performance. The models were rigorously tested on a set of challenging variants, demonstrating substantial improvements over existing state-of-the-art tools, especially in handling ambiguous and clinically uncertain variants. The results of this research underline the efficacy of combining multiple modeling approaches to significantly refine the accuracy and reliability of genetic variant classification systems. These findings support the deployment of these advanced computational models in clinical environments, where they can significantly enhance the diagnostic processes for genetic disorders, ultimately pushing the boundaries of personalized medicine by offering more detailed and actionable genetic insights.

摘要:遺傳變異的分類,特別是不確定意義變異(VUS),對臨床遺傳學和精準醫療提出了重大挑戰。大型語言模型(LLM)已成為這個領域的變革性工具。這些模型可以揭示傳統方法可能遺漏的複雜模式和預測見解,從而提高遺傳變異致病性的預測準確度。 本研究調查了最先進 LLM 的整合,包括 GPN-MSA、ESM1b 和 AlphaMissense,這些 LLM 利用 DNA 和蛋白質序列數據以及結構見解,形成了一個全面的變異分類分析框架。我們的做法使用標註完善的 ProteinGym 和 ClinVar 數據集來評估這些整合模型,在分類效能上設定了新的基準。這些模型經過嚴格測試,使用一組具有挑戰性的變異,證明了對現有最先進工具的實質性改進,特別是在處理模稜兩可和臨床上不確定的變異方面。 這項研究的結果強調了結合多種建模方法以顯著提高遺傳變異分類系統的準確度和可靠性的有效性。這些發現支持在臨床環境中部署這些先進的計算模型,它們可以在那裡顯著增強遺傳疾病的診斷程序,最終通過提供更詳細且可操作的遺傳見解來突破個人化醫療的界限。

AWARE Narrator and the Utilization of Large Language Models to Extract Behavioral Insights from Smartphone Sensing Data

2411.04691v1 by Tianyi Zhang, Miu Kojima, Simon D'Alfonso

Smartphones, equipped with an array of sensors, have become valuable tools for personal sensing. Particularly in digital health, smartphones facilitate the tracking of health-related behaviors and contexts, contributing significantly to digital phenotyping, a process where data from digital interactions is analyzed to infer behaviors and assess mental health. Traditional methods process raw sensor data into information features for statistical and machine learning analyses. In this paper, we introduce a novel approach that systematically converts smartphone-collected data into structured, chronological narratives. The AWARE Narrator translates quantitative smartphone sensing data into English language descriptions, forming comprehensive narratives of an individual's activities. We apply the framework to the data collected from university students over a week, demonstrating the potential of utilizing the narratives to summarize individual behavior, and analyzing psychological states by leveraging large language models.

摘要:智慧型手機配備了各式感測器,已成為個人感測的寶貴工具。特別是在數位健康領域,智慧型手機促進了健康相關行為和情境的追蹤,對數位表型分析做出了重大貢獻,數位表型分析是一種從數位互動中分析資料以推論行為和評估心理健康的程序。傳統方法將原始感測器資料處理成資訊特徵,以進行統計和機器學習分析。在本文中,我們介紹一種新穎的方法,該方法系統性地將智慧型手機收集的資料轉換成結構化的時間順序敘事。AWARE Narrator 將定量的智慧型手機感測資料轉換成英文語言描述,形成個人活動的綜合敘事。我們將此架構套用在大學生一週內收集的資料上,證明了利用敘事總結個人行為的潛力,並透過運用大型語言模型來分析心理狀態。

FedDP: Privacy-preserving method based on federated learning for histopathology image segmentation

2411.04509v1 by Liangrui Pan, Mao Huang, Lian Wang, Pinle Qin, Shaoliang Peng

Hematoxylin and Eosin (H&E) staining of whole slide images (WSIs) is considered the gold standard for pathologists and medical practitioners for tumor diagnosis, surgical planning, and post-operative assessment. With the rapid advancement of deep learning technologies, the development of numerous models based on convolutional neural networks and transformer-based models has been applied to the precise segmentation of WSIs. However, due to privacy regulations and the need to protect patient confidentiality, centralized storage and processing of image data are impractical. Training a centralized model directly is challenging to implement in medical settings due to these privacy concerns.This paper addresses the dispersed nature and privacy sensitivity of medical image data by employing a federated learning framework, allowing medical institutions to collaboratively learn while protecting patient privacy. Additionally, to address the issue of original data reconstruction through gradient inversion during the federated learning training process, differential privacy introduces noise into the model updates, preventing attackers from inferring the contributions of individual samples, thereby protecting the privacy of the training data.Experimental results show that the proposed method, FedDP, minimally impacts model accuracy while effectively safeguarding the privacy of cancer pathology image data, with only a slight decrease in Dice, Jaccard, and Acc indices by 0.55%, 0.63%, and 0.42%, respectively. This approach facilitates cross-institutional collaboration and knowledge sharing while protecting sensitive data privacy, providing a viable solution for further research and application in the medical field.

摘要:蘇木精和伊紅(H&E)染色全切片圖像(WSI)被認為是病理學家和醫療從業人員用於腫瘤診斷、手術規劃和術後評估的黃金標準。隨著深度學習技術的快速進展,基於卷積神經網路和基於Transformer的模型的眾多模型已被應用於 WSI 的精確分割。然而,由於隱私法規和保護患者機密性的需要,集中式儲存和處理影像資料是不切實際的。由於這些隱私問題,在醫療環境中直接訓練集中式模型難以實施。本文通過採用聯合學習框架來解決醫療影像資料的分散性質和隱私敏感性,允許醫療機構在保護患者隱私的同時進行協作學習。此外,為了解決聯合學習訓練過程中通過梯度反轉進行原始資料重建的問題,差分隱私會在模型更新中引入雜訊,防止攻擊者推斷個別樣本的貢獻,從而保護訓練資料的隱私。實驗結果表明,所提出的方法 FedDP 對模型準確度的影響最小,同時有效保護了癌症病理影像資料的隱私,Dice、Jaccard 和 Acc 指數分別僅略微下降了 0.55%、0.63% 和 0.42%。這種方法促進了機構間的合作和知識共享,同時保護了敏感資料的隱私,為醫療領域的進一步研究和應用提供了可行的解決方案。

Conditional Diffusion Model for Longitudinal Medical Image Generation

2411.05860v1 by Duy-Phuong Dao, Hyung-Jeong Yang, Jahae Kim

Alzheimers disease progresses slowly and involves complex interaction between various biological factors. Longitudinal medical imaging data can capture this progression over time. However, longitudinal data frequently encounter issues such as missing data due to patient dropouts, irregular follow-up intervals, and varying lengths of observation periods. To address these issues, we designed a diffusion-based model for 3D longitudinal medical imaging generation using single magnetic resonance imaging (MRI). This involves the injection of a conditioning MRI and time-visit encoding to the model, enabling control in change between source and target images. The experimental results indicate that the proposed method generates higher-quality images compared to other competing methods.

摘要:阿茲海默症的進程緩慢,涉及各種生物因子之間的複雜互動。縱向醫學影像資料可以隨著時間推移捕捉這種進程。然而,縱向資料經常會遇到問題,例如由於患者退出、不規則的追蹤間隔和觀察期長度不同而導致資料遺失。為了解決這些問題,我們設計了一個基於擴散的模型,用於使用單一磁共振成像 (MRI) 進行 3D 縱向醫學影像生成。這涉及將條件 MRI 和時間訪問編碼注入模型,從而能夠控制源影像和目標影像之間的轉換。實驗結果表明,與其他競爭方法相比,所提出的方法生成的影像品質較高。

Evaluating the Economic Implications of Using Machine Learning in Clinical Psychiatry

2411.05856v1 by Soaad Hossain, James Rasalingam, Arhum Waheed, Fatah Awil, Rachel Kandiah, Syed Ishtiaque Ahmed

With the growing interest in using AI and machine learning (ML) in medicine, there is an increasing number of literature covering the application and ethics of using AI and ML in areas of medicine such as clinical psychiatry. The problem is that there is little literature covering the economic aspects associated with using ML in clinical psychiatry. This study addresses this gap by specifically studying the economic implications of using ML in clinical psychiatry. In this paper, we evaluate the economic implications of using ML in clinical psychiatry through using three problem-oriented case studies, literature on economics, socioeconomic and medical AI, and two types of health economic evaluations. In addition, we provide details on fairness, legal, ethics and other considerations for ML in clinical psychiatry.

摘要:隨著 AI 和機器學習 (ML) 在醫學中應用日益受到重視, 探討 AI 和 ML 在醫學領域(例如臨床精神病學)中應用和倫理的文獻越來越多。問題在於,探討與 ML 在臨床精神病學中應用相關的經濟方面的文獻很少。本研究透過特別探討 ML 在臨床精神病學中應用的經濟影響,來解決這個問題。在本文中,我們透過使用三個以問題為導向的案例研究、經濟學、社會經濟和醫療 AI 的文獻,以及兩種類型的健康經濟評估,評估 ML 在臨床精神病學中應用的經濟影響。此外,我們提供有關 ML 在臨床精神病學中的公平性、法律、倫理和其他考量的詳細資訊。

Robust Real-Time Mortality Prediction in the Intensive Care Unit using Temporal Difference Learning

2411.04285v1 by Thomas Frost, Kezhi Li, Steve Harris

The task of predicting long-term patient outcomes using supervised machine learning is a challenging one, in part because of the high variance of each patient's trajectory, which can result in the model over-fitting to the training data. Temporal difference (TD) learning, a common reinforcement learning technique, may reduce variance by generalising learning to the pattern of state transitions rather than terminal outcomes. However, in healthcare this method requires several strong assumptions about patient states, and there appears to be limited literature evaluating the performance of TD learning against traditional supervised learning methods for long-term health outcome prediction tasks. In this study, we define a framework for applying TD learning to real-time irregularly sampled time series data using a Semi-Markov Reward Process. We evaluate the model framework in predicting intensive care mortality and show that TD learning under this framework can result in improved model robustness compared to standard supervised learning methods. and that this robustness is maintained even when validated on external datasets. This approach may offer a more reliable method when learning to predict patient outcomes using high-variance irregular time series data.

摘要:預測長期患者結果的任務使用監督式機器學習,這是一個具有挑戰性的任務,部分原因是每個患者的軌跡的變異性很高,這可能導致模型過度擬合到訓練數據。時間差分 (TD) 學習,一種常見的強化學習技術,可以通過將學習概括為狀態轉換模式而不是終端結果來減少變異。然而,在醫療保健中,這種方法需要對患者狀態做出幾個強有力的假設,而且似乎有限的文獻評估了 TD 學習相對於傳統監督式學習方法在長期健康結果預測任務中的性能。在這項研究中,我們定義了一個框架,用於將 TD 學習應用於使用半馬爾可夫獎勵過程的實時不規則採樣時間序列數據。我們評估了模型框架在預測重症監護死亡率中的表現,並表明在這個框架下的 TD 學習可以導致與標準監督式學習方法相比模型魯棒性得到改善。而且這種魯棒性即使在外部數據集上驗證也能保持。在使用高變異不規則時間序列數據學習預測患者結果時,這種方法可能會提供一種更可靠的方法。

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

2411.04118v1 by Daniel P. Jeong, Saurabh Garg, Zachary C. Lipton, Michael Oberst

Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks. For instance, across the tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 12.1% of cases, reach a (statistical) tie in 49.8% of cases, and are significantly worse than their base models in the remaining 38.2% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.

摘要:近期的幾項研究致力於專門針對醫療應用開發基礎模型,透過在公開的生物醫學語料庫上持續預訓練,調整通用的大型語言模型 (LLM) 和視覺語言模型 (VLM)。這些研究通常聲稱,這種領域適應性預訓練 (DAPT) 能改善下游醫療任務的效能,例如回答醫療執照考試題目。在本文中,我們比較了七個公開的「醫療」LLM 和兩個 VLM 與它們對應的基本模型,並得出不同的結論:在醫療問題回答 (QA) 任務的零次/小樣本提示機制中,所有醫療 VLM 和幾乎所有醫療 LLM 都無法持續優於它們的基本模型。例如,在我們在 3 次提示設定中考慮的任務和模型配對中,醫療 LLM 僅在 12.1% 的情況下優於它們的基本模型,在 49.8% 的情況下達到(統計)平手,而在其餘 38.2% 的情況下顯著低於它們的基本模型。我們的結論基於 (i) 直接針對對應的基本模型,逐一比較每個醫療模型;(ii) 分別針對每個模型最佳化提示;以及 (iii) 考慮比較中的統計不確定性。雖然這些基本做法並未持續採用在文獻中,但我們的消融研究表明,它們會大幅影響結論。我們的研究結果表明,最先進的通用領域模型可能已經展現出強大的醫療知識和推理能力,並提出建議以強化未來研究的結論。

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

2411.04097v1 by Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz

Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.

摘要:微调的视觉语言模型(VLM)通常会捕捉图像特征和文本属性之间的虚假相关性,导致在测试时零样本性能下降。现有的解决虚假相关性的方法(i)主要在全局图像级别操作,而不是直接干预细粒度的图像特征,并且(ii)主要设计用于单模态设置。在这项工作中,我们提出了 RaVL,它通过使用局部图像特征而不是在全局图像级别操作来发现和减轻虚假相关性,从而对 VLM 鲁棒性采取了细粒度的视角。给定一个微调的 VLM,RaVL 首先通过利用区域级聚类方法发现虚假相关性,以识别导致零样本分类错误的精确图像特征。然后,RaVL 使用一种新颖的区域感知损失函数来减轻已识别的虚假相关性,该损失函数使 VLM 能够在微调期间关注相关区域并忽略虚假关系。我们使用 654 个 VLM 对 RaVL 进行了评估,这些 VLM 具有各种模型架构、数据域和学习到的虚假相关性。我们的结果表明,RaVL 准确地发现了(比最接近的基线提高了 191%)和减轻了(在最差组图像分类准确性上提高了 8.2%)虚假相关性。对通用域和医学域 VLM 的定性评估证实了我们的发现。

Aligning Characteristic Descriptors with Images for Human-Expert-like Explainability

2411.04008v1 by Bharat Chandra Yalavarthi, Nalini Ratha

In mission-critical domains such as law enforcement and medical diagnosis, the ability to explain and interpret the outputs of deep learning models is crucial for ensuring user trust and supporting informed decision-making. Despite advancements in explainability, existing methods often fall short in providing explanations that mirror the depth and clarity of those given by human experts. Such expert-level explanations are essential for the dependable application of deep learning models in law enforcement and medical contexts. Additionally, we recognize that most explanations in real-world scenarios are communicated primarily through natural language. Addressing these needs, we propose a novel approach that utilizes characteristic descriptors to explain model decisions by identifying their presence in images, thereby generating expert-like explanations. Our method incorporates a concept bottleneck layer within the model architecture, which calculates the similarity between image and descriptor encodings to deliver inherent and faithful explanations. Through experiments in face recognition and chest X-ray diagnosis, we demonstrate that our approach offers a significant contrast over existing techniques, which are often limited to the use of saliency maps. We believe our approach represents a significant step toward making deep learning systems more accountable, transparent, and trustworthy in the critical domains of face recognition and medical diagnosis.

摘要:在执法和医疗诊断等任务关键型领域, 解释和诠释深度学习模型的输出对于确保用户信任和支持知情决策至关重要。 尽管可解释性方面取得了进步,但现有方法在提供解释时往往达不到人类专家给出的深度和清晰度。这种专家级别的解释对于在执法和医疗环境中可靠地应用深度学习模型至关重要。 此外,我们认识到,在现实世界场景中,大多数解释主要是通过自然语言进行交流的。为了满足这些需求,我们提出了一种新颖的方法,该方法利用特征描述符通过识别图像中的特征描述符的存在来解释模型决策,从而生成类似专家的解释。我们的方法在模型架构中加入了一个概念瓶颈层,该层计算图像和描述符编码之间的相似性,以提供内在且可靠的解释。通过面部识别和胸部 X 射线诊断的实验,我们证明了我们的方法与现有技术相比具有显着优势,而现有技术通常仅限于使用显着性图。我们相信,我们的方法代表了朝着使深度学习系统在面部识别和医疗诊断的关键领域更加负责、透明和值得信赖迈出的重要一步。

Fine-tuning -- a Transfer Learning approach

2411.03941v1 by Joseph Arul Raj, Linglong Qian, Zina Ibrahim

Secondary research use of Electronic Health Records (EHRs) is often hampered by the abundance of missing data in this valuable resource. Missingness in EHRs occurs naturally as a result of the data recording practices during routine clinical care, but handling it is crucial to the precision of medical analysis and the decision-making that follows. The literature contains a variety of imputation methodologies based on deep neural networks. Those aim to overcome the dynamic, heterogeneous and multivariate missingness patterns of EHRs, which cannot be handled by classical and statistical imputation methods. However, all existing deep imputation methods rely on end-to-end pipelines that incorporate both imputation and downstream analyses, e.g. classification. This coupling makes it difficult to assess the quality of imputation and takes away the flexibility of re-using the imputer for a different task. Furthermore, most end-to-end deep architectures tend to use complex networks to perform the downstream task, in addition to the already sophisticated deep imputation network. We, therefore ask if the high performance reported in the literature is due to the imputer or the classifier and further ask if an optimised state-of-the-art imputer is used, a simpler classifier can achieve comparable performance. This paper explores the development of a modular, deep learning-based imputation and classification pipeline, specifically built to leverage the capabilities of state-of-the-art imputation models for downstream classification tasks. Such a modular approach enables a) objective assessment of the quality of the imputer and classifier independently, and b) enables the exploration of the performance of simpler classification architectures using an optimised imputer.

摘要:電子健康紀錄 (EHR) 的二次研究用途經常受到此寶貴資源中大量遺失資料的阻礙。EHR 中的遺失資料會在例行臨床照護期間的資料記錄實務中自然發生,但處理遺失資料對於醫療分析的精確度和後續決策至關重要。文獻中包含各種基於深度神經網路的內插方法。這些方法旨在克服 EHR 中動態、異質且多變量的遺失資料模式,而這無法透過傳統和統計內插方法來處理。然而,所有現有的深度內插方法都依賴於將內插和下游分析(例如分類)結合在一起的端到端管道。這種結合使得難以評估內插的品質,並消除了重新使用內插器進行不同任務的靈活性。此外,大多數端到端深度架構傾向於使用複雜的網路來執行下游任務,除了已經很複雜的深度內插網路之外。因此,我們詢問文獻中報導的高效能是由於內插器還是分類器,並進一步詢問是否使用了最佳化的最新內插器,較簡單的分類器是否可以達到相近的效能。本文探討模組化、基於深度學習的內插和分類管道的開發,特別是建構來利用最新內插模型的能力,以進行下游分類任務。這種模組化方法能 a) 客觀評估內插器和分類器的品質,以及 b) 能夠使用最佳化的內插器來探討較簡單分類架構的效能。

MEG: Medical Knowledge-Augmented Large Language Models for Question Answering

2411.03883v2 by Laura Cabello, Carmen Martin-Turrero, Uchenna Akujuobi, Anders Søgaard, Carlos Bobed

Question answering is a natural language understanding task that involves reasoning over both explicit context and unstated, relevant domain knowledge. Large language models (LLMs), which underpin most contemporary question answering systems, struggle to induce how concepts relate in specialized domains such as medicine. Existing medical LLMs are also costly to train. In this work, we present MEG, a parameter-efficient approach for medical knowledge-augmented LLMs. MEG uses a lightweight mapping network to integrate graph embeddings into the LLM, enabling it to leverage external knowledge in a cost-effective way. We evaluate our method on four popular medical multiple-choice datasets and show that LLMs greatly benefit from the factual grounding provided by knowledge graph embeddings. MEG attains an average of +10.2% accuracy over the Mistral-Instruct baseline, and +6.7% over specialized models like BioMistral. We also show results based on Llama-3. Finally, we show that MEG's performance remains robust to the choice of graph encoder.

摘要:問答是自然語言理解任務,涉及對明確的上下文和未說明的相關領域知識進行推理。支撐大多數當代問答系統的大型語言模型 (LLM) 難以推論概念如何在醫學等專業領域中關聯。現有的醫學 LLM 訓練成本也很高。在這項工作中,我們提出了 MEG,這是一種用於醫學知識增強 LLM 的參數有效方法。MEG 使用輕量級映射網路將圖表嵌入整合到 LLM 中,使其能夠以經濟有效的方式利用外部知識。我們在四個流行的醫學多選題資料集上評估了我們的方法,並表明 LLM 從知識圖表嵌入提供的實際依據中受益匪淺。MEG 在 Mistral-Instruct 基準上平均提高了 +10.2% 的準確度,在 BioMistral 等專門模型上提高了 +6.7%。我們還展示了基於 Llama-3 的結果。最後,我們表明 MEG 的性能對圖表編碼器的選擇保持穩健。

2411.03782v1 by Daan Schouten, Giulia Nicoletti, Bas Dille, Catherine Chia, Pierpaolo Vendittelli, Megan Schuurmans, Geert Litjens, Nadieh Khalili

Recent technological advances in healthcare have led to unprecedented growth in patient data quantity and diversity. While artificial intelligence (AI) models have shown promising results in analyzing individual data modalities, there is increasing recognition that models integrating multiple complementary data sources, so-called multimodal AI, could enhance clinical decision-making. This scoping review examines the landscape of deep learning-based multimodal AI applications across the medical domain, analyzing 432 papers published between 2018 and 2024. We provide an extensive overview of multimodal AI development across different medical disciplines, examining various architectural approaches, fusion strategies, and common application areas. Our analysis reveals that multimodal AI models consistently outperform their unimodal counterparts, with an average improvement of 6.2 percentage points in AUC. However, several challenges persist, including cross-departmental coordination, heterogeneous data characteristics, and incomplete datasets. We critically assess the technical and practical challenges in developing multimodal AI systems and discuss potential strategies for their clinical implementation, including a brief overview of commercially available multimodal AI models for clinical decision-making. Additionally, we identify key factors driving multimodal AI development and propose recommendations to accelerate the field's maturation. This review provides researchers and clinicians with a thorough understanding of the current state, challenges, and future directions of multimodal AI in medicine.

摘要:醫療保健領域的近期科技進展導致病患資料數量和多樣性前所未有的成長。儘管人工智慧 (AI) 模型在分析個別資料模式中展現出有前途的成果,但整合多個互補資料來源的模型,即所謂的多模式 AI,可以提升臨床決策制定,這項認知正與日俱增。這篇範圍探討回顧研究探討了涵蓋醫療領域的深度學習基礎多模式 AI 應用現況,分析 2018 年至 2024 年間發表的 432 篇論文。我們提供了多模式 AI 發展的廣泛概觀,涵蓋不同的醫療領域,探討各種架構方法、融合策略和常見應用領域。我們的分析顯示,多模式 AI 模型始終優於其單一模式的對應模型,AUC 平均改善 6.2 個百分點。然而,仍有許多挑戰持續存在,包括跨部門協調、異質資料特性和不完整資料集。我們批判性地評估開發多模式 AI 系統在技術和實務上的挑戰,並討論其臨床實作的潛在策略,包括對市售多模式 AI 模型的簡要概述,用於臨床決策制定。此外,我們找出推動多模式 AI 發展的主要因素,並提出建議以加速該領域的成熟。本回顧研究讓研究人員和臨床醫師深入了解多模式 AI 在醫學領域的現況、挑戰和未來方向。

Sub-DM:Subspace Diffusion Model with Orthogonal Decomposition for MRI Reconstruction

2411.03758v1 by Yu Guan, Qinrong Cai, Wei Li, Qiuyun Fan, Dong Liang, Qiegen Liu

Diffusion model-based approaches recently achieved re-markable success in MRI reconstruction, but integration into clinical routine remains challenging due to its time-consuming convergence. This phenomenon is partic-ularly notable when directly apply conventional diffusion process to k-space data without considering the inherent properties of k-space sampling, limiting k-space learning efficiency and image reconstruction quality. To tackle these challenges, we introduce subspace diffusion model with orthogonal decomposition, a method (referred to as Sub-DM) that restrict the diffusion process via projections onto subspace as the k-space data distribution evolves toward noise. Particularly, the subspace diffusion model circumvents the inference challenges posed by the com-plex and high-dimensional characteristics of k-space data, so the highly compact subspace ensures that diffusion process requires only a few simple iterations to produce accurate prior information. Furthermore, the orthogonal decomposition strategy based on wavelet transform hin-ders the information loss during the migration of the vanilla diffusion process to the subspace. Considering the strate-gy is approximately reversible, such that the entire pro-cess can be reversed. As a result, it allows the diffusion processes in different spaces to refine models through a mutual feedback mechanism, enabling the learning of ac-curate prior even when dealing with complex k-space data. Comprehensive experiments on different datasets clearly demonstrate that the superiority of Sub-DM against state of-the-art methods in terms of reconstruction speed and quality.

摘要:基於擴散模型的方法最近在 MRI 重建中取得了顯著的成功,但由於其耗時的收斂性,整合到臨床常規中仍然具有挑戰性。當直接將傳統擴散過程應用到 k-space 資料,而沒有考慮 k-space 取樣的固有特性時,這種現象尤其明顯,限制了 k-space 學習效率和影像重建品質。為了應對這些挑戰,我們引入了具有正交分解的子空間擴散模型,一種方法(稱為 Sub-DM),它通過投影到子空間來限制擴散過程,因為 k-space 資料分佈會演變成雜訊。特別是,子空間擴散模型迴避了 k-space 資料的複雜和高維特徵所帶來的推論挑戰,因此高度緊湊的子空間確保擴散過程只需要幾個簡單的迭代即可產生準確的先驗資訊。此外,基於小波轉換的正交分解策略阻礙了香草擴散過程遷移到子空間期間的資訊遺失。考慮到該策略近似可逆,因此整個過程可以逆轉。因此,它允許不同空間中的擴散過程通過相互回饋機制來優化模型,即使在處理複雜的 k-space 資料時也能學習準確的先驗。在不同資料集上的全面實驗清楚地證明了 Sub-DM 在重建速度和品質方面優於最先進的方法。

Ultrasound-Based AI for COVID-19 Detection: A Comprehensive Review of Public and Private Lung Ultrasound Datasets and Studies

2411.05029v1 by Abrar Morshed, Abdulla Al Shihab, Md Abrar Jahin, Md Jaber Al Nahian, Md Murad Hossain Sarker, Md Sharjis Ibne Wadud, Mohammad Istiaq Uddin, Muntequa Imtiaz Siraji, Nafisa Anjum, Sumiya Rajjab Shristy, Tanvin Rahman, Mahmuda Khatun, Md Rubel Dewan, Mosaddeq Hossain, Razia Sultana, Ripel Chakma, Sonet Barua Emon, Towhidul Islam, Mohammad Arafat Hussain

The COVID-19 pandemic has affected millions of people globally, with respiratory organs being strongly affected in individuals with comorbidities. Medical imaging-based diagnosis and prognosis have become increasingly popular in clinical settings for detecting COVID-19 lung infections. Among various medical imaging modalities, ultrasound stands out as a low-cost, mobile, and radiation-safe imaging technology. In this comprehensive review, we focus on AI-driven studies utilizing lung ultrasound (LUS) for COVID-19 detection and analysis. We provide a detailed overview of both publicly available and private LUS datasets and categorize the AI studies according to the dataset they used. Additionally, we systematically analyzed and tabulated the studies across various dimensions, including data preprocessing methods, AI models, cross-validation techniques, and evaluation metrics. In total, we reviewed 60 articles, 41 of which utilized public datasets, while the remaining employed private data. Our findings suggest that ultrasound-based AI studies for COVID-19 detection have great potential for clinical use, especially for children and pregnant women. Our review also provides a useful summary for future researchers and clinicians who may be interested in the field.

摘要:COVID-19 疫情影響全球數百萬人,其中合併症患者的呼吸器官受到嚴重影響。基於醫學影像的診斷和預後在臨床環境中已日益普及,用於偵測 COVID-19 肺部感染。在各種醫學影像模式中,超音波因其低成本、可攜式且無輻射的影像技術而脫穎而出。在這篇全面的評論中,我們專注於利用肺部超音波 (LUS) 進行 COVID-19 偵測和分析的人工智慧驅動研究。我們提供公開和私人 LUS 資料集的詳細概觀,並根據所使用的資料集對人工智慧研究進行分類。此外,我們系統地分析並整理了各種面向的研究,包括資料前處理方法、人工智慧模型、交叉驗證技術和評估指標。總計,我們檢閱了 60 篇文章,其中 41 篇使用公開資料集,而其餘則使用私人資料。我們的研究結果表明,基於超音波的人工智慧研究對於 COVID-19 偵測具有極大的臨床應用潛力,特別是對於兒童和孕婦。我們的評論也為可能對此領域感興趣的未來研究人員和臨床醫生提供了有用的摘要。

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

2411.03670v1 by Pedro R. A. S. Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu Wang, Jieneng Chen, Yu-Cheng Chou, Yannick Kirchhoff, Maximilian Rokuss, Ziyan Huang, Jin Ye, Junjun He, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus H. Maier-Hein, Paul Jaeger, Yiwen Ye, Yutong Xie, Jianpeng Zhang, Ziyang Chen, Yong Xia, Zhaohu Xing, Lei Zhu, Yousef Sadegheih, Afshin Bozorgpour, Pratibha Kumari, Reza Azad, Dorit Merhof, Pengcheng Shi, Ting Ma, Yuxin Du, Fan Bai, Tiejun Huang, Bo Zhao, Haonan Wang, Xiaomeng Li, Hanxue Gu, Haoyu Dong, Jichen Yang, Maciej A. Mazurowski, Saumya Gupta, Linshan Wu, Jiaxin Zhuang, Hao Chen, Holger Roth, Daguang Xu, Matthew B. Blaschko, Sergio Decherchi, Andrea Cavalli, Alan L. Yuille, Zongwei Zhou

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

摘要:如何測試 AI 效能?這個問題看似簡單,但並非如此。 標準基準經常有諸如分佈內和小型測試集、過於簡化的指標、不公平的比較和短期結果壓力等問題。因此,在標準基準上的良好效能無法保證在實際情況中也能成功。為了解決這些問題,我們提出了 Touchstone,一種大型協作分割基準,包含 9 種類型的腹部器官。此基準基於來自全球 76 家醫院的 5,195 個訓練 CT 掃描和來自 11 家其他醫院的 5,903 個測試 CT 掃描。這個多樣化的測試集增強了基準結果的統計顯著性,並嚴格評估了各種分佈外情況下的 AI 演算法。我們邀請了 19 種 AI 演算法的 14 位發明者訓練他們的演算法,而我們的團隊作為第三方,獨立評估了這些演算法在三個測試集上的表現。此外,我們還評估了現有的 AI 框架,這些框架與演算法不同,更具彈性,且可以支援不同的演算法,包括 NVIDIA 的 MONAI、DKFZ 的 nnU-Net 和許多其他開源框架。我們致力於擴展此基準,以鼓勵更多 AI 演算法在醫療領域的創新。

Requirements Engineering for Older Adult Digital Health Software: A Systematic Literature Review

2411.03656v1 by Yuqing Xiao, John Grundy, Anuradha Madugalla

Growth of the older adult population has led to an increasing interest in technology-supported aged care. However, the area has some challenges such as a lack of caregivers and limitations in understanding the emotional, social, physical, and mental well-being needs of seniors. Furthermore, there is a gap in the understanding between developers and ageing people of their requirements. Digital health can be important in supporting older adults wellbeing, emotional requirements, and social needs. Requirements Engineering (RE) is a major software engineering field, which can help to identify, elicit and prioritize the requirements of stakeholders and ensure that the systems meet standards for performance, reliability, and usability. We carried out a systematic review of the literature on RE for older adult digital health software. This was necessary to show the representatives of the current stage of understanding the needs of older adults in aged care digital health. Using established guidelines outlined by the Kitchenham method, the PRISMA and the PICO guideline, we developed a protocol, followed by the systematic exploration of eight databases. This resulted in 69 primary studies of high relevance, which were subsequently subjected to data extraction, synthesis, and reporting. We highlight key RE processes in digital health software for ageing people. It explored the utilization of technology for older user well-being and care, and the evaluations of such solutions. The review also identified key limitations found in existing primary studies that inspire future research opportunities. The results indicate that requirement gathering and understanding have a significant variation between different studies. The differences are in the quality, depth, and techniques adopted for requirement gathering and these differences are largely due to uneven adoption of RE methods.

摘要:高齡人口的增長,導致對科技輔助長照服務的需求與日俱增。然而,該領域也面臨一些挑戰,例如照護人員的短缺,以及在理解長者在情緒、社交、生理和心理方面的福祉需求時所存在的限制。此外,開發人員和長者在需求理解上也存在差距。數位健康在支持長者的福祉、情緒需求和社會需求方面扮演著重要的角色。需求工程(RE)是軟體工程領域的一大領域,有助於識別、引導和優先處理利害關係人的需求,並確保系統符合效能、可靠性和可用性的標準。我們對長者數位健康軟體的RE文獻進行了系統性的回顧。這對於展現目前在長照數位健康領域中理解長者需求的階段代表性是必要的。我們根據Kitchenham方法、PRISMA和PICO指南所列出的既定準則,制定了一套協定,接著系統性地探討了八個資料庫。這產生了69項高度相關的主要研究,其後進行了資料萃取、綜合和回報。我們重點介紹了長者數位健康軟體中的關鍵RE流程。它探討了科技在長者使用者福祉和照護中的應用,以及這些解決方案的評估。這份回顧也找出了現有主要研究中發現的主要限制,激勵了未來的研究機會。結果顯示,不同研究之間在需求收集和理解方面有顯著的差異。差異在於需求收集所採用的品質、深度和技術,而這些差異在很大程度上是由於RE方法採用不均所致。

Cross Feature Fusion of Fundus Image and Generated Lesion Map for Referable Diabetic Retinopathy Classification

2411.03618v1 by Dahyun Mok, Junghyun Bum, Le Duc Tai, Hyunseung Choo

Diabetic Retinopathy (DR) is a primary cause of blindness, necessitating early detection and diagnosis. This paper focuses on referable DR classification to enhance the applicability of the proposed method in clinical practice. We develop an advanced cross-learning DR classification method leveraging transfer learning and cross-attention mechanisms. The proposed method employs the Swin U-Net architecture to segment lesion maps from DR fundus images. The Swin U-Net segmentation model, enriched with DR lesion insights, is transferred to generate a lesion map. Both the fundus image and its segmented lesion map are used as complementary inputs for the classification model. A cross-attention mechanism is deployed to improve the model's ability to capture fine-grained details from the input pairs. Our experiments, utilizing two public datasets, FGADR and EyePACS, demonstrate a superior accuracy of 94.6%, surpassing current state-of-the-art methods by 4.4%. To this end, we aim for the proposed method to be seamlessly integrated into clinical workflows, enhancing accuracy and efficiency in identifying referable DR.

摘要:糖尿病視網膜病變 (DR) 是失明的首要原因,需要早期檢測和診斷。本文重點關注可轉診的 DR 分類,以增強所提出方法在臨床實務中的適用性。我們開發了一種先進的交叉學習 DR 分類方法,利用遷移學習和交叉注意機制。所提出的方法採用 Swin U-Net 架構,從 DR 眼底圖像中分割病灶圖。豐富了 DR 病灶見解的 Swin U-Net 分割模型被轉移以生成病灶圖。眼底圖像及其分割的病灶圖都被用作分類模型的補充輸入。部署交叉注意機制以提高模型從輸入對中擷取細粒度細節的能力。我們的實驗利用了兩個公開數據集,FGADR 和 EyePACS,展示了 94.6% 的優異準確率,比當前最先進的方法高出 4.4%。為此,我們希望所提出的方法能無縫整合到臨床工作流程中,提高準確度和效率,以識別可轉診的 DR。

The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare

2411.03287v1 by Souren Pashangpour, Goldie Nejat

The potential use of large language models (LLMs) in healthcare robotics can help address the significant demand put on healthcare systems around the world with respect to an aging demographic and a shortage of healthcare professionals. Even though LLMs have already been integrated into medicine to assist both clinicians and patients, the integration of LLMs within healthcare robots has not yet been explored for clinical settings. In this perspective paper, we investigate the groundbreaking developments in robotics and LLMs to uniquely identify the needed system requirements for designing health specific LLM based robots in terms of multi modal communication through human robot interactions (HRIs), semantic reasoning, and task planning. Furthermore, we discuss the ethical issues, open challenges, and potential future research directions for this emerging innovative field.

摘要:大型語言模型 (LLM) 在醫療保健機器人中潛在的應用,有助於滿足全球醫療保健系統對應老齡化人口和醫療保健專業人員短缺問題的重大需求。儘管 LLM 已整合到醫療領域中,以協助臨床醫生和患者,但 LLM 在醫療保健機器人中的整合尚未針對臨床環境進行探討。在此觀點論文中,我們探討機器人和 LLM 的創新發展,以獨特地找出設計特定於健康的 LLM 機器人的系統需求,包括透過人機互動 (HRI)、語義推理和任務規劃的多模式溝通。此外,我們討論了這個新興創新領域的倫理議題、開放性挑戰和潛在的未來研究方向。

Discovering Data Structures: Nearest Neighbor Search and Beyond

2411.03253v1 by Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant

We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.

摘要:我們提出一個通用的架構,用於資料結構的端到端學習。 我們的架構會適應基礎資料分佈,並提供對查詢和空間複雜度的細緻控制。至關重要的是,資料結構是從頭開始學習,不需要仔細初始化或使用候選資料結構/演算法進行設定。我們首先將這個架構應用到最近鄰搜尋的問題。在多種設定中,我們能夠逆向工程已學習的資料結構和查詢演算法。對於 1D 最近鄰搜尋,模型會發現最佳分佈(內部)獨立演算法,例如二元搜尋和內插搜尋變體。在更高維度中,模型學習到的解會在某些模式下類似於 k-d 樹,而在其他模式下,它們會包含局部敏感雜湊的元素。該模型還可以學習高維資料的有用表示,並利用它們來設計有效的資料結構。我們也將我們的架構調整到資料串流上頻率估計的問題,並相信它對於新問題來說也可能是一個強大的發現工具。

Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care

2411.03105v1 by Christel Sirocchi, Muhammad Suffian, Federico Sabbatini, Alessandro Bogliolo, Sara Montagna

In clinical practice, decision-making relies heavily on established protocols, often formalised as rules. Concurrently, Machine Learning (ML) models, trained on clinical data, aspire to integrate into medical decision-making processes. However, despite the growing number of ML applications, their adoption into clinical practice remains limited. Two critical concerns arise, relevant to the notions of consistency and continuity of care: (a) accuracy - the ML model, albeit more accurate, might introduce errors that would not have occurred by applying the protocol; (b) interpretability - ML models operating as black boxes might make predictions based on relationships that contradict established clinical knowledge. In this context, the literature suggests using ML models integrating domain knowledge for improved accuracy and interpretability. However, there is a lack of appropriate metrics for comparing ML models with clinical rules in addressing these challenges. Accordingly, in this article, we first propose metrics to assess the accuracy of ML models with respect to the established protocol. Secondly, we propose an approach to measure the distance of explanations provided by two rule sets, with the goal of comparing the explanation similarity between clinical rule-based systems and rules extracted from ML models. The approach is validated on the Pima Indians Diabetes dataset by training two neural networks - one exclusively on data, and the other integrating a clinical protocol. Our findings demonstrate that the integrated ML model achieves comparable performance to that of a fully data-driven model while exhibiting superior accuracy relative to the clinical protocol, ensuring enhanced continuity of care. Furthermore, we show that our integrated model provides explanations for predictions that align more closely with the clinical protocol compared to the data-driven model.

摘要:在臨床實務中,決策仰賴既定的協定,通常以規則形式化。同時,以臨床資料訓練的機器學習 (ML) 模型,渴望整合到醫療決策流程中。然而,儘管 ML 應用數量日增,它們在臨床實務中的採用仍受限。兩個關鍵疑慮浮現,與照護的一致性和連續性概念相關:(a) 準確性 - ML 模型雖然更準確,但可能會引入套用協定時不會發生的錯誤;(b) 可解釋性 - 作為黑盒運作的 ML 模型可能會根據與既定臨床知識相矛盾的關係進行預測。在此脈絡中,文獻建議使用整合領域知識的 ML 模型以提升準確性和可解釋性。然而,缺乏適當的指標來比較 ML 模型與臨床規則,以應對這些挑戰。因此,在本文中,我們首先提出指標來評估 ML 模型相對於既定協定的準確性。其次,我們提出一個方法來衡量兩組規則所提供的解釋的距離,目標是比較基於臨床規則的系統與從 ML 模型中提取的規則之間的解釋相似性。此方法在 Pima 印地安人糖尿病資料集上驗證,方法是訓練兩個神經網路 - 一個僅針對資料,另一個整合臨床協定。我們的研究結果證明,整合式 ML 模型達到了與完全資料驅動模型相當的效能,同時展現出相對於臨床協定的優異準確性,確保增強的照護連續性。此外,我們證明我們的整合模型提供的預測解釋與臨床協定相比,更為緊密地結合。

Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting

2411.03098v1 by Adrian B. Chłopowiec, Adam R. Chłopowiec, Krzysztof Galus, Wojciech Cebula, Martin Tabakov

Limited medical imaging datasets challenge deep learning models by increasing risks of overfitting and reduced generalization, particularly in Generative Adversarial Networks (GANs), where discriminators may overfit, leading to training divergence. This constraint also impairs classification models trained on small datasets. Generative Data Augmentation (GDA) addresses this by expanding training datasets with synthetic data, although it requires training a generative model. We propose and evaluate two local lesion generation approaches to address the challenge of augmenting small medical image datasets. The first approach employs the Poisson Image Editing algorithm, a classical image processing technique, to create realistic image composites that outperform current state-of-the-art methods. The second approach introduces a novel generative method, leveraging a fine-tuned Image Inpainting GAN to synthesize realistic lesions within specified regions of real training images. A comprehensive comparison of the two proposed methods demonstrates that effective local lesion generation in a data-constrained setting allows for reaching new state-of-the-art results in capsule endoscopy lesion classification. Combination of our techniques achieves a macro F1-score of 33.07%, surpassing the previous best result by 7.84 percentage points (p.p.) on the highly imbalanced Kvasir Capsule Dataset, a benchmark for capsule endoscopy. To the best of our knowledge, this work is the first to apply a fine-tuned Image Inpainting GAN for GDA in medical imaging, demonstrating that an image-conditional GAN can be adapted effectively to limited datasets to generate high-quality examples, facilitating effective data augmentation. Additionally, we show that combining this GAN-based approach with classical image processing techniques further enhances the results.

摘要:受限的醫學影像資料集會透過增加過度擬合的風險和降低概化能力,特別是在生成對抗網路 (GAN) 中,其中判別器可能會過度擬合,導致訓練分歧,對深度學習模型構成挑戰。這種限制也損害了在小型資料集上訓練的分類模型。生成資料擴充 (GDA) 透過使用合成資料擴充訓練資料集來解決此問題,儘管它需要訓練生成模型。我們提出並評估兩種局部病灶生成方法,以解決擴充小型醫學影像資料集的挑戰。第一種方法採用泊松影像編輯演算法,一種經典影像處理技術,來建立逼真的影像合成,其優於目前最先進的方法。第二種方法引進一種新穎的生成方法,利用微調的影像修復 GAN,在真實訓練影像的特定區域內合成逼真的病灶。對這兩種提議方法的全面比較證明,在資料受限的設定中,有效的局部病灶生成允許在膠囊內視鏡病灶分類中達到新的最先進結果。我們的技術組合在高度不平衡的 Kvasir Capsule 資料集(膠囊內視鏡的基準)上,達到了 33.07% 的巨觀 F1 分數,比先前的最佳結果高出 7.84 個百分點 (p.p.)。據我們所知,這項工作是第一個將微調的影像修復 GAN 應用於醫學影像中的 GDA,證明了影像條件 GAN 可以有效地適應受限的資料集,以產生高品質的範例,促進有效的資料擴充。此外,我們表明將這種基於 GAN 的方法與經典影像處理技術相結合,進一步增強了結果。

Controlling for Unobserved Confounding with Large Language Model Classification of Patient Smoking Status

2411.03004v1 by Samuel Lee, Zach Wood-Doughty

Causal understanding is a fundamental goal of evidence-based medicine. When randomization is impossible, causal inference methods allow the estimation of treatment effects from retrospective analysis of observational data. However, such analyses rely on a number of assumptions, often including that of no unobserved confounding. In many practical settings, this assumption is violated when important variables are not explicitly measured in the clinical record. Prior work has proposed to address unobserved confounding with machine learning by imputing unobserved variables and then correcting for the classifier's mismeasurement. When such a classifier can be trained and the necessary assumptions are met, this method can recover an unbiased estimate of a causal effect. However, such work has been limited to synthetic data, simple classifiers, and binary variables. This paper extends this methodology by using a large language model trained on clinical notes to predict patients' smoking status, which would otherwise be an unobserved confounder. We then apply a measurement error correction on the categorical predicted smoking status to estimate the causal effect of transthoracic echocardiography on mortality in the MIMIC dataset.

摘要:因果理解是循证医学的基本目标。当随机化不可行时,因果推论方法允许从观察性数据的回顾性分析中估计治疗效果。然而,此类分析依赖于许多假设,通常包括没有未观察到的混杂因素。在许多实际情况下,当重要的变量在临床记录中没有明确测量时,这一假设就会被违反。先前的工作提出用机器学习来解决未观察到的混杂问题,方法是推算未观察到的变量,然后校正分类器的测量误差。当可以训练这样的分类器并且满足必要的假设时,这种方法可以恢复因果效应的无偏估计。然而,此类工作仅限于合成数据、简单的分类器和二元变量。本文通过使用在临床记录上训练的大语言模型来预测患者的吸烟状况来扩展这种方法,否则这将是一个未观察到的混杂因素。然后,我们对分类预测的吸烟状态应用测量误差校正,以估计经胸超声心动图对 MIMIC 数据集中死亡率的因果效应。

Region-Guided Attack on the Segment Anything Model (SAM)

2411.02974v2 by Xiaoliang Liu, Furao Shen, Jian Zhao

The Segment Anything Model (SAM) is a cornerstone of image segmentation, demonstrating exceptional performance across various applications, particularly in autonomous driving and medical imaging, where precise segmentation is crucial. However, SAM is vulnerable to adversarial attacks that can significantly impair its functionality through minor input perturbations. Traditional techniques, such as FGSM and PGD, are often ineffective in segmentation tasks due to their reliance on global perturbations that overlook spatial nuances. Recent methods like Attack-SAM-K and UAD have begun to address these challenges, but they frequently depend on external cues and do not fully leverage the structural interdependencies within segmentation processes. This limitation underscores the need for a novel adversarial strategy that exploits the unique characteristics of segmentation tasks. In response, we introduce the Region-Guided Attack (RGA), designed specifically for SAM. RGA utilizes a Region-Guided Map (RGM) to manipulate segmented regions, enabling targeted perturbations that fragment large segments and expand smaller ones, resulting in erroneous outputs from SAM. Our experiments demonstrate that RGA achieves high success rates in both white-box and black-box scenarios, emphasizing the need for robust defenses against such sophisticated attacks. RGA not only reveals SAM's vulnerabilities but also lays the groundwork for developing more resilient defenses against adversarial threats in image segmentation.

摘要:影像分割的基石為區段任何模型 (SAM),在各種應用中展現出色的效能,特別是在自動駕駛和醫療影像中,精準的分割至關重要。然而,SAM 容易受到對抗攻擊,而對抗攻擊可能透過輕微的輸入擾動大幅損害其功能性。傳統技術,例如 FGSM 和 PGD,通常在分割任務中無效,因為它們依賴於忽略空間細微差的全局擾動。最近的方法,例如 Attack-SAM-K 和 UAD,已開始解決這些挑戰,但它們經常依賴於外部提示,且並未充分利用分割過程中結構性的相互依賴性。此限制強調需要一種新的對抗策略,以利用分割任務的獨特特性。為了解決這個問題,我們引進專門為 SAM 設計的區域引導攻擊 (RGA)。RGA 利用區域引導地圖 (RGM) 操控分割區域,進而針對擾動進行標定,將大型區段分割並擴展較小的區段,導致 SAM 產生錯誤輸出。我們的實驗證明,RGA 在白盒和黑盒場景中都取得高成功率,強調需要針對此類精密攻擊建立強固的防禦機制。RGA 不僅揭露 SAM 的漏洞,也為在影像分割中針對對抗威脅發展更具復原力的防禦措施奠定基礎。

[Vision Paper] PRObot: Enhancing Patient-Reported Outcome Measures for Diabetic Retinopathy using Chatbots and Generative AI

2411.02973v1 by Maren Pielka, Tobias Schneider, Jan Terheyden, Rafet Sifa

We present an outline of the first large language model (LLM) based chatbot application in the context of patient-reported outcome measures (PROMs) for diabetic retinopathy. By utilizing the capabilities of current LLMs, we enable patients to provide feedback about their quality of life and treatment progress via an interactive application. The proposed framework offers significant advantages over the current approach, which encompasses only qualitative collection of survey data or a static survey with limited answer options. Using the PROBot LLM-PROM application, patients will be asked tailored questions about their individual challenges, and can give more detailed feedback on the progress of their treatment. Based on this input, we will use machine learning to infer conventional PROM scores, which can be used by clinicians to evaluate the treatment status. The goal of the application is to improve adherence to the healthcare system and treatments, and thus ultimately reduce cases of subsequent vision impairment. The approach needs to be further validated using a survey and a clinical study.

摘要:我們提出一個基於第一個大型語言模型 (LLM) 的聊天機器人應用程式,用於糖尿病視網膜病變的病人回報結果測量 (PROM)。透過利用當前 LLM 的功能,我們讓病人能夠透過互動式應用程式提供有關其生活品質和治療進度的回饋。所提出的架構提供顯著優於目前方法的優點,目前方法僅包含調查資料的質性收集或具有有限答案選項的靜態調查。使用 PROBot LLM-PROM 應用程式,病人將會被詢問有關其個人挑戰的客製化問題,並能提供更詳細的回饋,說明其治療進度。根據此輸入,我們將使用機器學習推論傳統 PROM 分數,臨床醫生可以使用這些分數來評估治療狀態。此應用程式的目標是改善對醫療保健系統和治療的依從性,並因此最終減少後續視力損害的病例。需要使用調查和臨床研究進一步驗證此方法。

Leveraging Transfer Learning and Multiple Instance Learning for HER2 Automatic Scoring of H\&E Whole Slide Images

2411.05028v1 by Rawan S. Abdulsadig, Bryan M. Williams, Nikolay Burlutskiy

Expression of human epidermal growth factor receptor 2 (HER2) is an important biomarker in breast cancer patients who can benefit from cost-effective automatic Hematoxylin and Eosin (H\&E) HER2 scoring. However, developing such scoring models requires large pixel-level annotated datasets. Transfer learning allows prior knowledge from different datasets to be reused while multiple-instance learning (MIL) allows the lack of detailed annotations to be mitigated. The aim of this work is to examine the potential of transfer learning on the performance of deep learning models pre-trained on (i) Immunohistochemistry (IHC) images, (ii) H\&E images and (iii) non-medical images. A MIL framework with an attention mechanism is developed using pre-trained models as patch-embedding models. It was found that embedding models pre-trained on H\&E images consistently outperformed the others, resulting in an average AUC-ROC value of $0.622$ across the 4 HER2 scores ($0.59-0.80$ per HER2 score). Furthermore, it was found that using multiple-instance learning with an attention layer not only allows for good classification results to be achieved, but it can also help with producing visual indication of HER2-positive areas in the H\&E slide image by utilising the patch-wise attention weights.

摘要:人類表皮生長因子受體 2 (HER2) 的表現是乳癌患者中的一項重要生物標記,這些患者可以受益於具有成本效益的自動蘇木精和伊紅 (H&E) HER2 評分。然而,開發此類評分模型需要大量的像素級註解資料集。遷移學習允許重複使用來自不同資料集的先驗知識,而多實例學習 (MIL) 允許減輕詳細註解的缺乏。這項工作的目的是檢查遷移學習在預先訓練於 (i) 免疫組織化學 (IHC) 影像、(ii) H&E 影像和 (iii) 非醫學影像上的深度學習模型的效能上的潛力。使用預先訓練的模型作為區塊嵌入模型,開發了一個具有注意力機制的 MIL 框架。研究發現,預先訓練於 H&E 影像上的嵌入模型始終優於其他模型,在 4 個 HER2 分數中產生平均 AUC-ROC 值為 $0.622$(每個 HER2 分數為 $0.59-0.80$)。此外,研究發現,使用具有注意力層的多實例學習不僅可以獲得良好的分類結果,還可以幫助通過利用區塊注意力權重產生 H&E 玻片影像中 HER2 陽性區域的可視化指示。

Membership Inference Attacks against Large Vision-Language Models

2411.02902v1 by Zhan Li, Yongtao Wu, Yihang Chen, Francesco Tonin, Elias Abad Rocamora, Volkan Cevher

Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios. However, their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records, in their training datasets. Detecting inappropriately used data in VLLMs remains a critical and unresolved issue, mainly due to the lack of standardized datasets and suitable methodologies. In this study, we introduce the first membership inference attack (MIA) benchmark tailored for various VLLMs to facilitate training data detection. Then, we propose a novel MIA pipeline specifically designed for token-level image detection. Lastly, we present a new metric called MaxR\'enyi-K%, which is based on the confidence of the model output and applies to both text and image data. We believe that our work can deepen the understanding and methodology of MIAs in the context of VLLMs. Our code and datasets are available at https://github.com/LIONS-EPFL/VL-MIA.

摘要:大型視覺語言模型 (VLLM) 在處理各種應用場景的多模態任務方面表現出有前景的能力。然而,它們的出現也引發了重大的資料安全問題,因為它們的訓練資料集中可能會包含敏感資訊,例如私人照片和醫療記錄。偵測 VLLM 中不當使用的資料仍然是一個關鍵且尚未解決的問題,主要是由於缺乏標準化的資料集和適當的方法。在本研究中,我們引入了第一個針對各種 VLLM 量身打造的成員推論攻擊 (MIA) 基準,以利於訓練資料偵測。然後,我們提出了一個專門設計用於令牌級別影像偵測的全新 MIA 管線。最後,我們提出一個名為 MaxR\'enyi-K% 的新指標,它基於模型輸出的信心,並適用於文字和影像資料。我們相信,我們的研究可以加深對 VLLM 背景下 MIA 的理解和方法。我們的程式碼和資料集可在 https://github.com/LIONS-EPFL/VL-MIA 取得。

Advanced XR-Based 6-DOF Catheter Tracking System for Immersive Cardiac Intervention Training

2411.02611v1 by Mohsen Annabestani, Sandhya Sriram, S. Chiu Wong, Alexandros Sigaras, Bobak Mosadegh

Extended Reality (XR) technologies are gaining traction as effective tools for medical training and procedural guidance, particularly in complex cardiac interventions. This paper presents a novel system for real-time 3D tracking and visualization of intracardiac echocardiography (ICE) catheters, with precise measurement of the roll angle. A custom 3D-printed setup, featuring orthogonal cameras, captures biplane video of the catheter, while a specialized computer vision algorithm reconstructs its 3D trajectory, localizing the tip with sub-millimeter accuracy and tracking the roll angle in real-time. The system's data is integrated into an interactive Unity-based environment, rendered through the Meta Quest 3 XR headset, combining a dynamically tracked catheter with a patient-specific 3D heart model. This immersive environment allows the testing of the importance of 3D depth perception, in comparison to 2D projections, as a form of visualization in XR. Our experimental study, conducted using the ICE catheter with six participants, suggests that 3D visualization is not necessarily beneficial over 2D views offered by the XR system; although all cardiologists saw its utility for pre-operative training, planning, and intra-operative guidance. The proposed system qualitatively shows great promise in transforming catheter-based interventions, particularly ICE procedures, by improving visualization, interactivity, and skill development.

摘要:擴增實境 (XR) 技術正作為醫療訓練和程序指導的有效工具而獲得重視,特別是在複雜的心臟介入治療中。本文提出了一個新的系統,用於實時 3D 追蹤和可視化心內超聲心動圖 (ICE) 導管,並精確測量滾動角度。一個客製化的 3D 列印設定,配備正交相機,捕捉導管的雙平面影片,而一個專門的電腦視覺演算法重建其 3D 軌跡,以小於毫米的精確度定位尖端並即時追蹤滾動角度。系統的資料整合到一個互動式的 Unity 為基礎的環境中,透過 Meta Quest 3 XR 頭戴式裝置呈現,結合動態追蹤的導管和特定病患的 3D 心臟模型。這個沈浸式的環境允許測試 3D 深度感知的重要性,與 2D 投影相比,作為 XR 中的一種視覺化形式。我們的實驗研究,使用 ICE 導管進行,有六位參與者,顯示 3D 視覺化不一定比 XR 系統提供的 2D 視圖有益;儘管所有心臟科醫師都看到它在術前訓練、規劃和術中指導中的用途。所提出的系統在質化上顯示出在轉換導管介入治療,特別是 ICE 程序方面,透過改善視覺化、互動性和技能發展,具有很大的前景。

"It's a conversation, not a quiz": A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health

2411.02594v1 by Jiawei Zhou, Amy Z. Chen, Darshi Shah, Laura Schwab Reese, Munmun De Choudhury

Recent breakthroughs in large language models (LLMs) have generated both interest and concern about their potential adoption as accessible information sources or communication tools across different domains. In public health -- where stakes are high and impacts extend across populations -- adopting LLMs poses unique challenges that require thorough evaluation. However, structured approaches for assessing potential risks in public health remain under-explored. To address this gap, we conducted focus groups with health professionals and health issue experiencers to unpack their concerns, situated across three distinct and critical public health issues that demand high-quality information: vaccines, opioid use disorder, and intimate partner violence. We synthesize participants' perspectives into a risk taxonomy, distinguishing and contextualizing the potential harms LLMs may introduce when positioned alongside traditional health communication. This taxonomy highlights four dimensions of risk in individual behaviors, human-centered care, information ecosystem, and technology accountability. For each dimension, we discuss specific risks and example reflection questions to help practitioners adopt a risk-reflexive approach. This work offers a shared vocabulary and reflection tool for experts in both computing and public health to collaboratively anticipate, evaluate, and mitigate risks in deciding when to employ LLM capabilities (or not) and how to mitigate harm when they are used.

摘要:大型語言模型 (LLM) 的最新突破引起了人們的興趣,也引起了人們對其作為不同領域的無障礙信息來源或通信工具的潛在採用所產生的擔憂。在公共衛生領域——利害關係很高且影響遍及人群——採用 LLM 構成了獨特的挑戰,需要徹底評估。然而,評估公共衛生中潛在風險的結構化方法仍未得到充分探索。為了解決這一差距,我們與醫療專業人員和健康問題體驗者進行了焦點小組,以解開他們的疑慮,這些疑慮涉及三個不同的關鍵公共衛生問題,這些問題需要高質量的資訊:疫苗、阿片類藥物使用障礙和親密伴侶暴力。我們將參與者的觀點綜合到風險分類法中,區分和情境化 LLM 在與傳統健康傳播並列時可能造成的潛在危害。這種分類法突出了個人行為、以人為中心的護理、資訊生態系統和技術問責制這四個維度的風險。對於每個維度,我們討論具體的風險和範例反思問題,以幫助從業者採用風險反思方法。這項工作為計算和公共衛生領域的專家提供了一個共同的詞彙和反思工具,以便在決定何時採用 LLM 功能(或不採用)以及在使用 LLM 功能時如何減輕危害時,共同預測、評估和減輕風險。

Digitizing Touch with an Artificial Multimodal Fingertip

2411.02479v1 by Mike Lambeta, Tingfan Wu, Ali Sengul, Victoria Rose Most, Nolan Black, Kevin Sawyer, Romeo Mercado, Haozhi Qi, Alexander Sohn, Byron Taylor, Norb Tydingco, Gregg Kammerer, Dave Stroud, Jake Khatha, Kurt Jenkins, Kyle Most, Neal Stein, Ricardo Chavira, Thomas Craven-Bartle, Eric Sanchez, Yitian Ding, Jitendra Malik, Roberto Calandra

Touch is a crucial sensing modality that provides rich information about object properties and interactions with the physical environment. Humans and robots both benefit from using touch to perceive and interact with the surrounding environment (Johansson and Flanagan, 2009; Li et al., 2020; Calandra et al., 2017). However, no existing systems provide rich, multi-modal digital touch-sensing capabilities through a hemispherical compliant embodiment. Here, we describe several conceptual and technological innovations to improve the digitization of touch. These advances are embodied in an artificial finger-shaped sensor with advanced sensing capabilities. Significantly, this fingertip contains high-resolution sensors (~8.3 million taxels) that respond to omnidirectional touch, capture multi-modal signals, and use on-device artificial intelligence to process the data in real time. Evaluations show that the artificial fingertip can resolve spatial features as small as 7 um, sense normal and shear forces with a resolution of 1.01 mN and 1.27 mN, respectively, perceive vibrations up to 10 kHz, sense heat, and even sense odor. Furthermore, it embeds an on-device AI neural network accelerator that acts as a peripheral nervous system on a robot and mimics the reflex arc found in humans. These results demonstrate the possibility of digitizing touch with superhuman performance. The implications are profound, and we anticipate potential applications in robotics (industrial, medical, agricultural, and consumer-level), virtual reality and telepresence, prosthetics, and e-commerce. Toward digitizing touch at scale, we open-source a modular platform to facilitate future research on the nature of touch.

摘要:觸覺是一種至關重要的感測方式,可提供關於物體屬性和與物理環境交互作用的豐富資訊。人類和機器人都受益於使用觸覺來感知和與周圍環境互動(Johansson and Flanagan, 2009; Li et al., 2020; Calandra et al., 2017)。然而,沒有現有系統透過半球形順應性具身化提供豐富的多模式數位觸覺感測功能。在此,我們描述了幾個概念和技術創新,以改善觸覺的數位化。這些進展體現在具備先進感測功能的人工手指形感測器中。重要的是,這個指尖包含高解析度感測器(約 830 萬個觸覺點),可對全方位觸覺做出反應、擷取多模式訊號,並使用裝置上的人工智慧即時處理資料。評估顯示,人工指尖可以解析小至 7 微米的空間特徵,以 1.01 毫牛頓和 1.27 毫牛頓的解析度感測法向力和剪切力,感知高達 10 千赫的振動、感測熱,甚至感測氣味。此外,它內嵌了一個裝置上的 AI 神經網路加速器,作為機器人的周邊神經系統,並模仿人類的反射弧。這些結果證明了以超人類效能數位化觸覺的可能性。其影響深遠,我們預期在機器人技術(工業、醫療、農業和消費者層級)、虛擬實境和遠距臨場、假肢和電子商務中潛在的應用。為了大規模數位化觸覺,我們開放原始碼一個模組化平台,以促進未來對觸覺本質的研究。

Simulation of Nanorobots with Artificial Intelligence and Reinforcement Learning for Advanced Cancer Cell Detection and Tracking

2411.02345v1 by Shahab Kavousinejad

Nanorobots are a promising development in targeted drug delivery and the treatment of neurological disorders, with potential for crossing the blood-brain barrier (BBB). These small devices leverage advancements in nanotechnology and bioengineering for precise navigation and targeted payload delivery, particularly for conditions like brain tumors, Alzheimer's disease, and Parkinson's disease. Recent progress in artificial intelligence (AI) and machine learning (ML) has improved the navigation and effectiveness of nanorobots, allowing them to detect and interact with cancer cells through biomarker analysis. This study presents a new reinforcement learning (RL) framework for optimizing nanorobot navigation in complex biological environments, focusing on cancer cell detection by analyzing the concentration gradients of surrounding biomarkers. We utilize a computer simulation model to explore the behavior of nanorobots in a three-dimensional space with cancer cells and biological barriers. The proposed method uses Q-learning to refine movement strategies based on real-time biomarker concentration data, enabling nanorobots to autonomously navigate to cancerous tissues for targeted drug delivery. This research lays the groundwork for future laboratory experiments and clinical applications, with implications for personalized medicine and less invasive cancer treatments. The integration of intelligent nanorobots could revolutionize therapeutic strategies, reducing side effects and enhancing treatment effectiveness for cancer patients. Further research will investigate the practical deployment of these technologies in medical settings, aiming to unlock the full potential of nanorobotics in healthcare.

摘要:奈米機器人在標靶藥物傳輸和神經疾病治療中是一項有前景的發展,並具有穿越血腦屏障 (BBB) 的潛力。這些小型裝置利用奈米技術和生物工程的進展,進行精確導航和標靶有效載荷傳輸,特別是針對腦瘤、阿茲海默症和帕金森氏症等疾病。人工智慧 (AI) 和機器學習 (ML) 的最新進展改善了奈米機器人的導航和效能,讓它們能透過生物標記分析來偵測和與癌細胞互動。本研究提出了一個新的強化學習 (RL) 架構,用於最佳化奈米機器人在複雜生物環境中的導航,重點在於透過分析周圍生物標記的濃度梯度來偵測癌細胞。我們利用電腦模擬模型來探索奈米機器人在三維空間中與癌細胞和生物障礙物之間的行為。所提出的方法使用 Q 學習來根據即時生物標記濃度資料調整移動策略,讓奈米機器人能自主導航至癌組織進行標靶藥物傳輸。這項研究為未來的實驗室實驗和臨床應用奠定了基礎,並對個人化醫療和侵入性較小的癌症治療產生影響。整合智慧奈米機器人可以革新治療策略,減少副作用並提高癌症患者的治療效果。進一步的研究將探討這些技術在醫療環境中的實際部署,目標是發揮奈米機器人在醫療保健中的全部潛力。

Taking AI Welfare Seriously

2411.00986v1 by Robert Long, Jeff Sebo, Patrick Butlin, Kathleen Finlinson, Kyle Fish, Jacqueline Harding, Jacob Pfau, Toni Sims, Jonathan Birch, David Chalmers

In this report, we argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future. That means that the prospect of AI welfare and moral patienthood, i.e. of AI systems with their own interests and moral significance, is no longer an issue only for sci-fi or the distant future. It is an issue for the near future, and AI companies and other actors have a responsibility to start taking it seriously. We also recommend three early steps that AI companies and other actors can take: They can (1) acknowledge that AI welfare is an important and difficult issue (and ensure that language model outputs do the same), (2) start assessing AI systems for evidence of consciousness and robust agency, and (3) prepare policies and procedures for treating AI systems with an appropriate level of moral concern. To be clear, our argument in this report is not that AI systems definitely are, or will be, conscious, robustly agentic, or otherwise morally significant. Instead, our argument is that there is substantial uncertainty about these possibilities, and so we need to improve our understanding of AI welfare and our ability to make wise decisions about this issue. Otherwise there is a significant risk that we will mishandle decisions about AI welfare, mistakenly harming AI systems that matter morally and/or mistakenly caring for AI systems that do not.

摘要:在這份報告中,我們認為有些 AI 系統在不久的將來有現實的可能性會具有意識和/或強大的能動性。這表示 AI 福利和道德上的病人地位的前景,亦即具有自身利益和道德意義的 AI 系統,不再只是科幻小說或遙遠未來的議題。這是近未來的議題,而 AI 公司和其他行為者有責任開始認真看待它。我們也建議 AI 公司和其他行為者可以採取三個早期的步驟:他們可以 (1) 承認 AI 福利是一個重要且困難的議題(並確保語言模型的輸出也這麼做),(2) 開始評估 AI 系統是否有意識和強大能動性的證據,以及 (3) 準備政策和程序,以適當的道德關注層級來對待 AI 系統。明確來說,我們在這份報告中的論點並非 AI 系統絕對是或將會具有意識、強大的能動性或其他道德意義。相反地,我們的論點是關於這些可能性存在著實質的不確定性,因此我們需要增進我們對 AI 福利的了解,以及我們做出關於此議題的明智決定的能力。否則,我們將面臨重大風險,錯誤地處理關於 AI 福利的決策,錯誤地傷害到在道德上重要的 AI 系統,和/或錯誤地照顧到在道德上不重要的 AI 系統。

Federated GNNs for EEG-Based Stroke Assessment

2411.02286v1 by Andrea Protani, Lorenzo Giusti, Albert Sund Aillet, Simona Sacco, Paolo Manganotti, Lucio Marinelli, Diogo Reis Santos, Pierpaolo Brutti, Pietro Caliandro, Luigi Serio

Machine learning (ML) has the potential to become an essential tool in supporting clinical decision-making processes, offering enhanced diagnostic capabilities and personalized treatment plans. However, outsourcing medical records to train ML models using patient data raises legal, privacy, and security concerns. Federated learning has emerged as a promising paradigm for collaborative ML, meeting healthcare institutions' requirements for robust models without sharing sensitive data and compromising patient privacy. This study proposes a novel method that combines federated learning (FL) and Graph Neural Networks (GNNs) to predict stroke severity using electroencephalography (EEG) signals across multiple medical institutions. Our approach enables multiple hospitals to jointly train a shared GNN model on their local EEG data without exchanging patient information. Specifically, we address a regression problem by predicting the National Institutes of Health Stroke Scale (NIHSS), a key indicator of stroke severity. The proposed model leverages a masked self-attention mechanism to capture salient brain connectivity patterns and employs EdgeSHAP to provide post-hoc explanations of the neurological states after a stroke. We evaluated our method on EEG recordings from four institutions, achieving a mean absolute error (MAE) of 3.23 in predicting NIHSS, close to the average error made by human experts (MAE $\approx$ 3.0). This demonstrates the method's effectiveness in providing accurate and explainable predictions while maintaining data privacy.

摘要:機器學習 (ML) 有潛力成為支援臨床決策制定流程的必要工具,提供增強的診斷能力和個人化治療計畫。然而,使用病患資料訓練機器學習模型的外包醫療紀錄引發了法律、隱私和安全方面的疑慮。聯合學習已成為協作機器學習的一種有前景的典範,它符合醫療保健機構對穩健模型的要求,同時不會分享敏感資料和危害病患隱私。本研究提出了一種新的方法,結合聯合學習 (FL) 和圖形神經網路 (GNN) 來使用腦電圖 (EEG) 訊號預測多個醫療機構的腦中風嚴重程度。我們的做法讓多家醫院能夠共同在他們的本地 EEG 資料上訓練一個共享的 GNN 模型,而無需交換病患資訊。具體來說,我們透過預測美國國家衛生研究院腦中風量表 (NIHSS) 來解決回歸問題,NIHSS 是腦中風嚴重程度的一個關鍵指標。所提出的模型利用遮罩自我注意機制來擷取顯著的腦部連結模式,並採用 EdgeSHAP 在中風後提供神經狀態的事後解釋。我們在來自四家機構的 EEG 記錄上評估了我們的模型,在預測 NIHSS 時達到了 3.23 的平均絕對誤差 (MAE),接近人類專家所犯的平均誤差 (MAE ≈ 3.0)。這證明了該方法在維持資料隱私的同時,能提供準確且可解釋的預測,進而展現其效能。

Weakly supervised deep learning model with size constraint for prostate cancer detection in multiparametric MRI and generalization to unseen domains

2411.02466v1 by Robin Trombetta, Olivier Rouvière, Carole Lartizien

Fully supervised deep models have shown promising performance for many medical segmentation tasks. Still, the deployment of these tools in clinics is limited by the very timeconsuming collection of manually expert-annotated data. Moreover, most of the state-ofthe-art models have been trained and validated on moderately homogeneous datasets. It is known that deep learning methods are often greatly degraded by domain or label shifts and are yet to be built in such a way as to be robust to unseen data or label distributions. In the clinical setting, this problematic is particularly relevant as the deployment institutions may have different scanners or acquisition protocols than those from which the data has been collected to train the model. In this work, we propose to address these two challenges on the detection of clinically significant prostate cancer (csPCa) from bi-parametric MRI. We evaluate the method proposed by (Kervadec et al., 2018), which introduces a size constaint loss to produce fine semantic cancer lesions segmentations from weak circle scribbles annotations. Performance of the model is based on two public (PI-CAI and Prostate158) and one private databases. First, we show that the model achieves on-par performance with strong fully supervised baseline models, both on in-distribution validation data and unseen test images. Second, we observe a performance decrease for both fully supervised and weakly supervised models when tested on unseen data domains. This confirms the crucial need for efficient domain adaptation methods if deep learning models are aimed to be deployed in a clinical environment. Finally, we show that ensemble predictions from multiple trainings increase generalization performance.

摘要:完全監督的深度模型在許多醫療影像分割任務中展現出良好的效能。然而,這些工具在臨床上的部署受到耗時的人工標記資料蒐集限制。此外,大多數最先進的模型都在中等同質的資料集上訓練和驗證。眾所周知,深度學習方法經常會因領域或標籤轉移而大幅降低,而且尚未建構出對未見資料或標籤分佈具有穩健性的方法。在臨床環境中,這個問題特別相關,因為部署機構可能擁有與用於訓練模型的資料不同的掃描器或擷取協定。在這項工作中,我們提議針對從雙參數 MRI 中偵測臨床顯著的前列腺癌 (csPCa) 來解決這兩個挑戰。我們評估由 (Kervadec 等人,2018 年) 提出,並引入大小約束損失的方法,以從弱圓形塗鴉標註中產生精細的語義癌症病灶分割。模型的效能基於兩個公開資料庫 (PI-CAI 和 Prostate158) 和一個私人資料庫。首先,我們展示該模型在分佈內驗證資料和未見測試影像上都達到與強大的完全監督基線模型同等的效能。其次,我們觀察到在未見資料領域上測試時,完全監督和弱監督模型的效能都會下降。這證實了對有效領域適應方法的迫切需求,如果深度學習模型旨在部署在臨床環境中。最後,我們展示來自多重訓練的整體預測會提升概化效能。

Evaluating the quality of published medical research with ChatGPT

2411.01952v1 by Mike Thelwall, Xiaorui Jiang, Peter A. Bath

Evaluating the quality of published research is time-consuming but important for departmental evaluations, appointments, and promotions. Previous research has shown that ChatGPT can score articles for research quality, with the results correlating positively with an indicator of quality in all fields except Clinical Medicine. This article investigates this anomaly with the largest dataset yet and a more detailed analysis. The results showed that ChatGPT 4o-mini scores for articles submitted to the UK's Research Excellence Framework (REF) 2021 Unit of Assessment (UoA) 1 Clinical Medicine correlated positively (r=0.134, n=9872) with departmental mean REF scores, against a theoretical maximum correlation of r=0.226 (due to the departmental averaging involved). At the departmental level, mean ChatGPT scores correlated more strongly with departmental mean REF scores (r=0.395, n=31). For the 100 journals with the most articles in UoA 1, their mean ChatGPT score correlated strongly with their REF score (r=0.495) but negatively with their citation rate (r=-0.148). Journal and departmental anomalies in these results point to ChatGPT being ineffective at assessing the quality of research in prestigious medical journals or research directly affecting human health, or both. Nevertheless, the results give evidence of ChatGPT's ability to assess research quality overall for Clinical Medicine, so now there is evidence of its ability in all academic fields.

摘要:評估已發表的品質研究很耗時,但對於部門評鑑、任命和晉升來說很重要。先前的研究顯示,ChatGPT 可以為研究品質評分,其結果與所有領域(臨床醫學除外)的品質指標呈正相關。本文使用迄今為止最大的資料集和更詳細的分析來探討這種異常現象。結果顯示,提交給英國研究卓越架構 (REF) 2021 評估單位 (UoA) 1 臨床醫學的 ChatGPT 4o-mini 分數與部門平均 REF 分數呈正相關(r=0.134,n=9872),而理論最大相關係數為 r=0.226(由於涉及部門平均)。在部門層級,平均 ChatGPT 分數與部門平均 REF 分數相關性更強(r=0.395,n=31)。對於 UoA 1 中文章最多的 100 本期刊,其平均 ChatGPT 分數與其 REF 分數呈強正相關(r=0.495),但與其引用率呈負相關(r=-0.148)。這些結果中的期刊和部門異常現象表明,ChatGPT 無法評估聲望卓著的醫學期刊或直接影響人類健康的研究(或兩者)的品質。儘管如此,結果證明了 ChatGPT 整體評估臨床醫學研究品質的能力,因此現在有證據證明其在所有學術領域的能力。

You are out of context!

2411.02464v1 by Giancarlo Cobino, Simone Farci

This research proposes a novel drift detection methodology for machine learning (ML) models based on the concept of ''deformation'' in the vector space representation of data. Recognizing that new data can act as forces stretching, compressing, or twisting the geometric relationships learned by a model, we explore various mathematical frameworks to quantify this deformation. We investigate measures such as eigenvalue analysis of covariance matrices to capture global shape changes, local density estimation using kernel density estimation (KDE), and Kullback-Leibler divergence to identify subtle shifts in data concentration. Additionally, we draw inspiration from continuum mechanics by proposing a ''strain tensor'' analogy to capture multi-faceted deformations across different data types. This requires careful estimation of the displacement field, and we delve into strategies ranging from density-based approaches to manifold learning and neural network methods. By continuously monitoring these deformation metrics and correlating them with model performance, we aim to provide a sensitive, interpretable, and adaptable drift detection system capable of distinguishing benign data evolution from true drift, enabling timely interventions and ensuring the reliability of machine learning systems in dynamic environments. Addressing the computational challenges of this methodology, we discuss mitigation strategies like dimensionality reduction, approximate algorithms, and parallelization for real-time and large-scale applications. The method's effectiveness is demonstrated through experiments on real-world text data, focusing on detecting context shifts in Generative AI. Our results, supported by publicly available code, highlight the benefits of this deformation-based approach in capturing subtle drifts that traditional statistical methods often miss. Furthermore, we present a detailed application example within the healthcare domain, showcasing the methodology's potential in diverse fields. Future work will focus on further improving computational efficiency and exploring additional applications across different ML domains.

摘要:本研究提出一個新穎的漂移偵測方法,該方法針對機器學習 (ML) 模型,並基於資料向量空間表示中的「變形」概念。我們了解到新資料可以作為力量,延伸、壓縮或扭曲模型學習到的幾何關係,我們探索各種數學架構來量化這種變形。我們研究了諸如協方差矩陣的特徵值分析來擷取整體形狀變化、使用核密度估計 (KDE) 的局部密度估計,以及 Kullback-Leibler 距離來識別資料集中微妙的偏移。此外,我們從連續力學中汲取靈感,提出一個「應變張量」類比來擷取不同資料類型中的多面向變形。這需要仔細估計位移場,我們深入探討從基於密度的途徑到流形學習和神經網路方法的策略。透過持續監控這些變形量度並將它們與模型效能相關聯,我們旨在提供一個靈敏、可解釋且適應性強的漂移偵測系統,能夠區分良性的資料演化和真正的漂移,從而實現及時的干預並確保機器學習系統在動態環境中的可靠性。為了應對這種方法的計算挑戰,我們討論了降維、近似演算法和並行化等緩解策略,以用於即時和大規模應用。透過在真實世界文字資料上進行實驗,證明了該方法的有效性,重點在於偵測生成式 AI 中的脈絡轉移。我們的結果由公開可用的程式碼支援,突顯了這種基於變形的途徑在擷取傳統統計方法經常遺漏的微妙漂移方面的優點。此外,我們在醫療保健領域中展示了一個詳細的應用範例,展示了該方法在不同領域的潛力。未來的研究將集中在進一步提高計算效率,並探索不同 ML 領域中的其他應用。

Diagnosing Medical Datasets with Training Dynamics

2411.01653v1 by Laura Wenderoth

This study explores the potential of using training dynamics as an automated alternative to human annotation for evaluating the quality of training data. The framework used is Data Maps, which classifies data points into categories such as easy-to-learn, hard-to-learn, and ambiguous (Swayamdipta et al., 2020). Swayamdipta et al. (2020) highlight that difficult-to-learn examples often contain errors, and ambiguous cases significantly impact model training. To confirm the reliability of these findings, we replicated the experiments using a challenging dataset, with a focus on medical question answering. In addition to text comprehension, this field requires the acquisition of detailed medical knowledge, which further complicates the task. A comprehensive evaluation was conducted to assess the feasibility and transferability of the Data Maps framework to the medical domain. The evaluation indicates that the framework is unsuitable for addressing datasets' unique challenges in answering medical questions.

摘要:本研究探討使用訓練動態作為自動化替代方案,以評估訓練資料品質,以取代人工標註。所使用的架構為資料地圖,其將資料點分類為易於學習、難以學習和模稜兩可等類別(Swayamdipta 等人,2020 年)。Swayamdipta 等人(2020 年)強調,難以學習的範例通常包含錯誤,而模稜兩可的情況會對模型訓練產生重大影響。為了確認這些發現的可靠性,我們使用具有挑戰性的資料集複製了實驗,重點放在醫學問題解答上。除了文字理解之外,這個領域還需要獲取詳細的醫學知識,這進一步使任務複雜化。我們進行了全面的評估,以評估資料地圖架構在醫學領域的可行性和可轉移性。評估結果表明,該架構不適合解決資料集在回答醫學問題時面臨的獨特挑戰。

Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation

2411.01647v1 by Zhenbin Wang, Lei Zhang, Lituan Wang, Minjuan Zhu, Zhenwei Zhang

Medical video generation models are expected to have a profound impact on the healthcare industry, including but not limited to medical education and training, surgical planning, and simulation. Current video diffusion models typically build on image diffusion architecture by incorporating temporal operations (such as 3D convolution and temporal attention). Although this approach is effective, its oversimplification limits spatio-temporal performance and consumes substantial computational resources. To counter this, we propose Medical Simulation Video Generator (MedSora), which incorporates three key elements: i) a video diffusion framework integrates the advantages of attention and Mamba, balancing low computational load with high-quality video generation, ii) an optical flow representation alignment method that implicitly enhances attention to inter-frame pixels, and iii) a video variational autoencoder (VAE) with frequency compensation addresses the information loss of medical features that occurs when transforming pixel space into latent features and then back to pixel frames. Extensive experiments and applications demonstrate that MedSora exhibits superior visual quality in generating medical videos, outperforming the most advanced baseline methods. Further results and code are available at https://wongzbb.github.io/MedSora

摘要:醫療影片生成模型預計將對醫療保健產業產生深遠的影響,包括但不限於醫學教育和訓練、手術規劃和模擬。目前的影片擴散模型通常建立在影像擴散架構上,並結合時間運算(例如 3D 摺積和時間注意力)。儘管此方法有效,但其過於簡化限制了時空效能,並消耗大量的運算資源。為了解決這個問題,我們提出醫學模擬影片生成器 (MedSora),它結合了三個關鍵要素:i) 一個影片擴散架構整合了注意力和 Mamba 的優點,在低運算負載和高品質影片生成之間取得平衡,ii) 一個光流表示對齊方法,可以隱含地增強對影格間像素的注意力,以及 iii) 一個具有頻率補償的影片變異自動編碼器 (VAE),用於解決在將像素空間轉換為潛在特徵,然後再轉回像素影格時發生的醫療特徵資訊遺失問題。廣泛的實驗和應用證明,MedSora 在生成醫療影片方面展現出優異的視覺品質,優於最先進的基準方法。進一步的結果和程式碼可以在 https://wongzbb.github.io/MedSora 取得

Customized Subgraph Selection and Encoding for Drug-drug Interaction Prediction

2411.01535v1 by Haotong Du, Quanming Yao, Juzheng Zhang, Yang Liu, Zhen Wang

Subgraph-based methods have proven to be effective and interpretable in predicting drug-drug interactions (DDIs), which are essential for medical practice and drug development. Subgraph selection and encoding are critical stages in these methods, yet customizing these components remains underexplored due to the high cost of manual adjustments. In this study, inspired by the success of neural architecture search (NAS), we propose a method to search for data-specific components within subgraph-based frameworks. Specifically, we introduce extensive subgraph selection and encoding spaces that account for the diverse contexts of drug interactions in DDI prediction. To address the challenge of large search spaces and high sampling costs, we design a relaxation mechanism that uses an approximation strategy to efficiently explore optimal subgraph configurations. This approach allows for robust exploration of the search space. Extensive experiments demonstrate the effectiveness and superiority of the proposed method, with the discovered subgraphs and encoding functions highlighting the model's adaptability.

摘要:基於子圖的方法已被證明在預測藥物-藥物交互作用 (DDI) 中有效且易於解釋,這對於醫療實務和藥物開發至關重要。子圖選擇和編碼是這些方法中的關鍵階段,然而,由於手動調整的成本高昂,客製化這些元件仍未被充分探討。在本研究中,受到神經架構搜尋 (NAS) 成功啟發,我們提出一個方法來搜尋子圖架構中的資料特定元件。具體來說,我們引入了廣泛的子圖選擇和編碼空間,以說明 DDI 預測中藥物交互作用的不同背景。為了應對大型搜尋空間和高取樣成本的挑戰,我們設計了一個放鬆機制,使用近似策略來有效探索最佳子圖配置。這種方法允許對搜尋空間進行穩健的探索。廣泛的實驗證明了所提出方法的有效性和優越性,發現的子圖和編碼函數突顯了模型的適應性。

Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design

2411.01423v1 by Onur Boyar, Hiroyuki Hanada, Ichiro Takeuchi

The rapid discovery of new chemical compounds is essential for advancing global health and developing treatments. While generative models show promise in creating novel molecules, challenges remain in ensuring the real-world applicability of these molecules and finding such molecules efficiently. To address this, we introduce Conditional Latent Space Molecular Scaffold Optimization (CLaSMO), which combines a Conditional Variational Autoencoder (CVAE) with Latent Space Bayesian Optimization (LSBO) to modify molecules strategically while maintaining similarity to the original input. Our LSBO setting improves the sample-efficiency of our optimization, and our modification approach helps us to obtain molecules with higher chances of real-world applicability. CLaSMO explores substructures of molecules in a sample-efficient manner by performing BO in the latent space of a CVAE conditioned on the atomic environment of the molecule to be optimized. Our experiments demonstrate that CLaSMO efficiently enhances target properties with minimal substructure modifications, achieving state-of-the-art results with a smaller model and dataset compared to existing methods. We also provide an open-source web application that enables chemical experts to apply CLaSMO in a Human-in-the-Loop setting.

摘要:新化學化合物的快速發現對於促進全球健康和開發治療方法至關重要。儘管生成模型在創造新分子方面顯示出前景,但仍然存在挑戰,以確保這些分子的實際適用性並有效地找到這些分子。為了解決這個問題,我們引入了條件潛在空間分子支架最佳化 (CLaSMO),它結合了條件變異自動編碼器 (CVAE) 與潛在空間貝氏最佳化 (LSBO),以策略性地修改分子,同時保持與原始輸入的相似性。我們的 LSBO 設定改善了我們最佳化的樣本效率,我們的修改方法幫助我們獲得具有更高實際適用機會的分子。CLaSMO 以樣本有效的方式探索分子的子結構,方法是在 CVAE 的潛在空間中執行 BO,該空間以要最佳化的分子的原子環境為條件。我們的實驗表明,CLaSMO 以最小的子結構修改有效地增強了目標屬性,與現有方法相比,使用較小的模型和數據集實現了最先進的結果。我們還提供了一個開源網路應用程式,讓化學專家能夠在人機迴圈設定中應用 CLaSMO。

Medical X-Ray Image Enhancement Using Global Contrast-Limited Adaptive Histogram Equalization

2411.01373v1 by Sohrab Namazi Nia, Frank Y. Shih

In medical imaging, accurate diagnosis heavily relies on effective image enhancement techniques, particularly for X-ray images. Existing methods often suffer from various challenges such as sacrificing global image characteristics over local image characteristics or vice versa. In this paper, we present a novel approach, called G-CLAHE (Global-Contrast Limited Adaptive Histogram Equalization), which perfectly suits medical imaging with a focus on X-rays. This method adapts from Global Histogram Equalization (GHE) and Contrast Limited Adaptive Histogram Equalization (CLAHE) to take both advantages and avoid weakness to preserve local and global characteristics. Experimental results show that it can significantly improve current state-of-the-art algorithms to effectively address their limitations and enhance the contrast and quality of X-ray images for diagnostic accuracy.

摘要:在醫學影像中,準確的診斷高度依賴於有效的影像增強技術,特別是 X 光影像。現有的方法通常會遇到各種挑戰,例如犧牲整體影像特性以換取局部影像特性,反之亦然。在本文中,我們提出了一種新穎的方法,稱為 G-CLAHE(全局對比度限制自適應直方圖均衡化),它非常適合於以 X 光為重點的醫學影像。此方法改編自全局直方圖均衡化 (GHE) 和對比度限制自適應直方圖均衡化 (CLAHE),以取得兩者的優點,並避免弱點,以保留局部和全局特性。實驗結果表明,它可以顯著改善當前最先進的演算法,以有效解決其限制,並增強 X 光影像的對比度和品質,以利於診斷準確性。

Guided Synthesis of Labeled Brain MRI Data Using Latent Diffusion Models for Segmentation of Enlarged Ventricles

2411.01351v1 by Tim Ruschke, Jonathan Frederik Carlsen, Adam Espe Hansen, Ulrich Lindberg, Amalie Monberg Hindsholm, Martin Norgaard, Claes Nøhr Ladefoged

Deep learning models in medical contexts face challenges like data scarcity, inhomogeneity, and privacy concerns. This study focuses on improving ventricular segmentation in brain MRI images using synthetic data. We employed two latent diffusion models (LDMs): a mask generator trained using 10,000 masks, and a corresponding SPADE image generator optimized using 6,881 scans to create an MRI conditioned on a 3D brain mask. Conditioning the mask generator on ventricular volume in combination with classifier-free guidance enabled the control of the ventricular volume distribution of the generated synthetic images. Next, the performance of the synthetic data was tested using three nnU-Net segmentation models trained on a real, augmented and entirely synthetic data, respectively. The resulting models were tested on a completely independent hold-out dataset of patients with enlarged ventricles, with manual delineation of the ventricles used as ground truth. The model trained on real data showed a mean absolute error (MAE) of 9.09 \pm 12.18 mL in predicted ventricular volume, while the models trained on synthetic and augmented data showed MAEs of 7.52 \pm 4.81 mL and 6.23 \pm 4.33 mL, respectively. Both the synthetic and augmented model also outperformed the state-of-the-art model SynthSeg, which due to limited performance in cases of large ventricular volumes, showed an MAE of 7.73 \pm 12.12 mL with a factor of 3 higher standard deviation. The model trained on augmented data showed the highest Dice score of 0.892 \pm 0.05, slightly outperforming SynthSeg and on par with the model trained on real data. The synthetic model performed similar to SynthSeg. In summary, we provide evidence that guided synthesis of labeled brain MRI data using LDMs improves the segmentation of enlarged ventricles and outperforms existing state-of-the-art segmentation models.

摘要:在医学背景中,深度学习模型面临着数据稀缺性、不均匀性和隐私问题等挑战。本研究专注于使用合成数据改进脑部 MRI 图像中的心室分割。我们采用了两个潜在扩散模型 (LDM):一个使用 10,000 个蒙版训练的蒙版生成器,以及一个使用 6,881 次扫描进行优化的相应 SPADE 图像生成器,以创建基于 3D 脑部蒙版的 MRI。对蒙版生成器进行心室体积调节,并结合无分类器指导,能够控制生成合成图像的心室体积分布。接下来,使用分别训练于真实、增强和完全合成数据上的三个 nnU-Net 分割模型测试了合成数据的性能。将训练所得的模型在完全独立的、具有扩大心室的患者的保留数据集上进行测试,并使用心室的手动描绘作为真实情况。在真实数据上训练的模型在预测的心室体积中显示出 9.09 ± 12.18 mL 的平均绝对误差 (MAE),而在合成和增强数据上训练的模型显示出 7.52 ± 4.81 mL 和 6.23 ± 4.33 mL 的 MAE。合成模型和增强模型的性能均优于最先进的模型 SynthSeg,后者由于在大心室体积的情况下性能有限,显示出 7.73 ± 12.12 mL 的 MAE,标准差高出 3 倍。在增强数据上训练的模型显示出最高的 Dice 得分 0.892 ± 0.05,略优于 SynthSeg,并且与在真实数据上训练的模型相当。合成模型的性能与 SynthSeg 类似。总之,我们提供了证据表明,使用 LDM 对标记的脑部 MRI 数据进行引导合成可以改善扩大心室的分割,并且优于现有的最先进的分割模型。

Causal reasoning in difference graphs

2411.01292v1 by Charles K. Assaad

In epidemiology, understanding causal mechanisms across different populations is essential for designing effective public health interventions. Recently, difference graphs have been introduced as a tool to visually represent causal variations between two distinct populations. While there has been progress in inferring these graphs from data through causal discovery methods, there remains a gap in systematically leveraging their potential to enhance causal reasoning. This paper addresses that gap by establishing conditions for identifying causal changes and effects using difference graphs and observational data. It specifically focuses on identifying total causal changes and total effects in a nonparametric framework, as well as direct causal changes and direct effects in a linear context. In doing so, it provides a novel approach to causal reasoning that holds potential for various public health applications.

摘要:在流行病學中,了解不同人群之間的因果機制對於設計有效的公共衛生干預措施至關重要。最近,差異圖表已被引入作為一種工具,用於直觀地表示兩個不同人群之間的因果變化。儘管通過因果發現方法從數據中推斷這些圖表方面取得了進展,但在系統性地利用其增強因果推理的潛力方面仍然存在差距。本文通過建立使用差異圖表和觀察數據識別因果變化和因果效應的條件來解決這一差距。它特別側重於在非參數框架中識別總因果變化和總效應,以及在線性背景中識別直接因果變化和直接效應。這樣一來,它提供了一種因果推理的新方法,對各種公共衛生應用具有潛力。

Designing a Robust Radiology Report Generation System

2411.01153v1 by Sonit Singh

Recent advances in deep learning have enabled researchers to explore tasks at the intersection of computer vision and natural language processing, such as image captioning, visual question answering, visual dialogue, and visual language navigation. Taking inspiration from image captioning, the task of radiology report generation aims at automatically generating radiology reports by having a comprehensive understanding of medical images. However, automatically generating radiology reports from medical images is a challenging task due to the complexity, diversity, and nature of medical images. In this paper, we outline the design of a robust radiology report generation system by integrating different modules and highlighting best practices drawing upon lessons from our past work and also from relevant studies in the literature. We also discuss the impact of integrating different components to form a single integrated system. We believe that these best practices, when implemented, could improve automatic radiology report generation, augment radiologists in decision making, and expedite diagnostic workflow, in turn improve healthcare and save human lives.

摘要:最近深度學習的進展使研究人員能夠探索電腦視覺和自然語言處理交集中的任務,例如影像標題、視覺問答、視覺對話和視覺語言導航。受影像標題的啟發,放射科報告生成的任務旨在透過全面了解醫學影像自動生成放射科報告。然而,由於醫學影像的複雜性、多樣性和性質,自動從醫學影像生成放射科報告是一項具有挑戰性的任務。在本文中,我們透過整合不同的模組並強調最佳實務,概述了健全的放射科報告生成系統的設計,這些實務汲取自我們過去的工作以及文獻中的相關研究。我們也討論了整合不同組件以形成單一整合系統的影響。我們相信,這些最佳實務在實施後,可以改善自動放射科報告生成,增強放射科醫師在決策制定中的能力,並加快診斷工作流程,進而改善醫療保健並拯救人命。

LEARNER: Learning Granular Labels from Coarse Labels using Contrastive Learning

2411.01144v1 by Gautam Gare, Jana Armouti, Nikhil Madaan, Rohan Panda, Tom Fox, Laura Hutchins, Amita Krishnan, Ricardo Rodriguez, Bennett DeBoisblanc, Deva Ramanan, John Galeotti

A crucial question in active patient care is determining if a treatment is having the desired effect, especially when changes are subtle over short periods. We propose using inter-patient data to train models that can learn to detect these fine-grained changes within a single patient. Specifically, can a model trained on multi-patient scans predict subtle changes in an individual patient's scans? Recent years have seen increasing use of deep learning (DL) in predicting diseases using biomedical imaging, such as predicting COVID-19 severity using lung ultrasound (LUS) data. While extensive literature exists on successful applications of DL systems when well-annotated large-scale datasets are available, it is quite difficult to collect a large corpus of personalized datasets for an individual. In this work, we investigate the ability of recent computer vision models to learn fine-grained differences while being trained on data showing larger differences. We evaluate on an in-house LUS dataset and a public ADNI brain MRI dataset. We find that models pre-trained on clips from multiple patients can better predict fine-grained differences in scans from a single patient by employing contrastive learning.

摘要:在主動患者照護中,一個關鍵問題是確定治療是否產生預期的效果,特別是在短時間內變化細微的情況下。我們提議使用患者間數據來訓練模型,以便學習偵測單一患者內這些細微的變化。具體來說,在多位患者掃描中訓練的模型是否可以預測個別患者掃描中的細微變化?近年來,深度學習 (DL) 在使用生物醫學影像預測疾病方面應用日益廣泛,例如使用肺部超音波 (LUS) 數據預測 COVID-19 的嚴重程度。儘管有大量文獻記載了在有標註的大規模數據集可用時 DL 系統的成功應用,但要為個人收集大量個人化數據集相當困難。在這項工作中,我們探討了近期電腦視覺模型在針對顯示較大差異的數據進行訓練時,學習細微差異的能力。我們在內部 LUS 數據集和公開的 ADNI 大腦 MRI 數據集上進行評估。我們發現,透過使用對比學習,在多位患者的片段上預先訓練的模型可以更好地預測單一患者掃描中的細微差異。

Artificial Intelligence for Microbiology and Microbiome Research

2411.01098v1 by Xu-Wen Wang, Tong Wang, Yang-Yu Liu

Advancements in artificial intelligence (AI) have transformed many scientific fields, with microbiology and microbiome research now experiencing significant breakthroughs through machine learning and deep learning applications. This review provides a comprehensive overview of AI-driven approaches tailored for microbiology and microbiome studies, emphasizing both technical advancements and biological insights. We begin with an introduction to foundational AI techniques, including primary machine learning paradigms and various deep learning architectures, and offer guidance on choosing between machine learning and deep learning methods based on specific research goals. The primary section on application scenarios spans diverse research areas, from taxonomic profiling, functional annotation & prediction, microbe-X interactions, microbial ecology, metabolic modeling, precision nutrition, clinical microbiology, to prevention & therapeutics. Finally, we discuss challenges unique to this field, including the balance between interpretability and complexity, the "small n, large p" problem, and the critical need for standardized benchmarking datasets to validate and compare models. Together, this review underscores AI's transformative role in microbiology and microbiome research, paving the way for innovative methodologies and applications that enhance our understanding of microbial life and its impact on our planet and our health.

摘要:人工智慧 (AI) 的進步已轉變許多科學領域,而微生物學和微生物組研究現在正透過機器學習和深度學習應用體驗到顯著的突破。本篇評論提供 AI 驅動方法的全面概述,這些方法專為微生物學和微生物組研究量身打造,強調技術進步和生物見解。我們從基礎 AI 技術的介紹開始,包括主要的機器學習範例和各種深度學習架構,並提供根據具體研究目標在機器學習和深度學習方法之間進行選擇的指導。應用場景的主要部分涵蓋了從分類分析、功能註解和預測、微生物 X 相互作用、微生物生態、代謝建模、精準營養、臨床微生物學到預防和治療等多個研究領域。最後,我們討論了該領域獨有的挑戰,包括可解釋性和複雜性之間的平衡、「小 n,大 p」問題,以及驗證和比較模型的標準化基準數據集的關鍵需求。本篇評論共同強調了 AI 在微生物學和微生物組研究中的轉型作用,為創新方法和應用鋪平道路,這些方法和應用增強了我們對微生物生命及其對我們星球和我們健康的影響的理解。

Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities

2411.01053v1 by Adriel Saporta, Aahlad Puli, Mark Goldstein, Rajesh Ranganath

Contrastive learning methods, such as CLIP, leverage naturally paired data-for example, images and their corresponding text captions-to learn general representations that transfer efficiently to downstream tasks. While such approaches are generally applied to two modalities, domains such as robotics, healthcare, and video need to support many types of data at once. We show that the pairwise application of CLIP fails to capture joint information between modalities, thereby limiting the quality of the learned representations. To address this issue, we present Symile, a simple contrastive learning approach that captures higher-order information between any number of modalities. Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations. To develop Symile's objective, we derive a lower bound on total correlation, and show that Symile representations for any set of modalities form a sufficient statistic for predicting the remaining modalities. Symile outperforms pairwise CLIP, even with modalities missing in the data, on cross-modal classification and retrieval across several experiments including on an original multilingual dataset of 33M image, text and audio samples and a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements. All datasets and code used in this work are publicly available at https://github.com/rajesh-lab/symile.

摘要:對比學習方法,例如 CLIP,利用自然配對的資料,例如影像及其對應的文字標題,來學習一般化表徵,並有效率地轉移到下游任務。雖然此類方法通常應用於兩種形式,但機器人技術、醫療保健和視訊等領域需要一次支援多種類型的資料。我們顯示,CLIP 的成對應用無法擷取形式間的聯合資訊,因此限制了學習表徵的品質。為了解決此問題,我們提出 Symile,這是一種簡單的對比學習方法,可以擷取任意數量的形式之間的高階資訊。Symile 提供了一個靈活且與架構無關的目標,用於學習特定於形式的表徵。為開發 Symile 的目標,我們推導出總相關性的下界,並顯示任何形式集合的 Symile 表徵形成一個充分的統計量,用於預測其餘形式。Symile 優於成對 CLIP,即使資料中缺少形式,也能在跨形式分類和檢索中表現出色,包括在一個包含 3300 萬張影像、文字和音訊樣本的原始多語言資料集和一個包含胸部 X 光、心電圖和實驗室測量的臨床資料集上進行的多次實驗。本研究中使用所有資料集和程式碼皆公開於 https://github.com/rajesh-lab/symile。

Cross-Fundus Transformer for Multi-modal Diabetic Retinopathy Grading with Cataract

2411.00726v1 by Fan Xiao, Junlin Hou, Ruiwei Zhao, Rui Feng, Haidong Zou, Lina Lu, Yi Xu, Juzhao Zhang

Diabetic retinopathy (DR) is a leading cause of blindness worldwide and a common complication of diabetes. As two different imaging tools for DR grading, color fundus photography (CFP) and infrared fundus photography (IFP) are highly-correlated and complementary in clinical applications. To the best of our knowledge, this is the first study that explores a novel multi-modal deep learning framework to fuse the information from CFP and IFP towards more accurate DR grading. Specifically, we construct a dual-stream architecture Cross-Fundus Transformer (CFT) to fuse the ViT-based features of two fundus image modalities. In particular, a meticulously engineered Cross-Fundus Attention (CFA) module is introduced to capture the correspondence between CFP and IFP images. Moreover, we adopt both the single-modality and multi-modality supervisions to maximize the overall performance for DR grading. Extensive experiments on a clinical dataset consisting of 1,713 pairs of multi-modal fundus images demonstrate the superiority of our proposed method. Our code will be released for public access.

摘要:糖尿病視網膜病變 (DR) 是全球失明的主要原因,也是糖尿病的常見併發症。作為 DR 分級的兩種不同的影像工具,彩色眼底攝影 (CFP) 和紅外線眼底攝影 (IFP) 在臨床應用中高度相關且互補。據我們所知,這是第一個探討創新的多模式深度學習框架,以融合 CFP 和 IFP 的資訊,以進行更準確的 DR 分級。具體來說,我們構建了一個雙流架構 Cross-Fundus Transformer (CFT),以融合兩種眼底影像模式的基於 ViT 的特徵。特別是,引入了精心設計的 Cross-Fundus Attention (CFA) 模組,以捕捉 CFP 和 IFP 影像之間的對應關係。此外,我們採用單一模式和多模式監督,以最大化 DR 分級的整體效能。在由 1,713 對多模式眼底影像組成的臨床資料集上進行的廣泛實驗證明了我們提出的方法的優越性。我們的程式碼將會公開發布。

CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis

2411.00696v1 by Fuying Wang, Feng Wu, Yihan Tang, Lequan Yu

Integrating multimodal Electronic Health Records (EHR) data, such as numerical time series and free-text clinical reports, has great potential in predicting clinical outcomes. However, prior work has primarily focused on capturing temporal interactions within individual samples and fusing multimodal information, overlooking critical temporal patterns across patients. These patterns, such as trends in vital signs like abnormal heart rate or blood pressure, can indicate deteriorating health or an impending critical event. Similarly, clinical notes often contain textual descriptions that reflect these patterns. Identifying corresponding temporal patterns across different modalities is crucial for improving the accuracy of clinical outcome predictions, yet it remains a challenging task. To address this gap, we introduce a Cross-Modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data. Our approach introduces shared initial temporal pattern representations which are refined using slot attention to generate temporal semantic embeddings. To ensure rich cross-modal temporal semantics in the learned patterns, we introduce a contrastive-based TPNCE loss for cross-modal alignment, along with two reconstruction losses to retain core information of each modality. Evaluations on two clinically critical tasks, 48-hour in-hospital mortality and 24-hour phenotype classification, using the MIMIC-III database demonstrate the superiority of our method over existing approaches.

摘要:整合多模态电子健康记录 (EHR) 数据(例如数值时间序列和自由文本临床报告)在预测临床结果方面具有巨大潜力。然而,以前的工作主要集中在捕捉单个样本中的时间交互并融合多模态信息,而忽略了患者之间的关键时间模式。这些模式(例如生命体征趋势,如异常心率或血压)可能表明健康状况恶化或即将发生的危重事件。类似地,临床笔记通常包含反映这些模式的文本描述。识别不同模态之间相应的时间模式对于提高临床结果预测的准确性至关重要,但它仍然是一项具有挑战性的任务。为了解决这一差距,我们引入了一个跨模态时间模式发现 (CTPD) 框架,旨在从多模态 EHR 数据中有效提取有意义的跨模态时间模式。我们的方法引入了共享的初始时间模式表示,这些表示使用插槽注意力进行优化以生成时间语义嵌入。为了确保学习模式中丰富的跨模态时间语义,我们引入了基于对比的 TPNCE 损失用于跨模态对齐,以及两个重建损失以保留每个模态的核心信息。在两个临床关键任务(48 小时院内死亡率和 24 小时表型分类)上的评估,使用 MIMIC-III 数据库证明了我们方法优于现有方法。

Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering

2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust

Osteoporosis is a common condition that increases fracture risk, especially in older adults. Early diagnosis is vital for preventing fractures, reducing treatment costs, and preserving mobility. However, healthcare providers face challenges like limited labeled data and difficulties in processing medical images. This study presents a novel multi-modal learning framework that integrates clinical and imaging data to improve diagnostic accuracy and model interpretability. The model utilizes three pre-trained networks-VGG19, InceptionV3, and ResNet50-to extract deep features from X-ray images. These features are transformed using PCA to reduce dimensionality and focus on the most relevant components. A clustering-based selection process identifies the most representative components, which are then combined with preprocessed clinical data and processed through a fully connected network (FCN) for final classification. A feature importance plot highlights key variables, showing that Medical History, BMI, and Height were the main contributors, emphasizing the significance of patient-specific data. While imaging features were valuable, they had lower importance, indicating that clinical data are crucial for accurate predictions. This framework promotes precise and interpretable predictions, enhancing transparency and building trust in AI-driven diagnoses for clinical integration.

摘要:骨質疏鬆症是一種常見的疾病,會增加骨折的風險,特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而,醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架,該框架整合了臨床和影像數據,以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路,VGG19、InceptionV3 和 ResNet50,從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分,然後將這些組成部分與預處理的臨床數據結合,並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數,表明病史、BMI 和身高是主要貢獻因素,強調了患者特定數據的重要性。雖然影像特徵很有價值,但它們的重要性較低,這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測,提高了透明度,並建立了對 AI 驅動診斷在臨床整合中的信任。

Deep learning-based auto-contouring of organs/structures-at-risk for pediatric upper abdominal radiotherapy

2411.00594v1 by Mianyong Ding, Matteo Maspero, Annemieke S Littooij, Martine van Grotel, Raquel Davila Fajardo, Max M van Noesel, Marry M van den Heuvel-Eibrink, Geert O Janssens

Purposes: This study aimed to develop a computed tomography (CT)-based multi-organ segmentation model for delineating organs-at-risk (OARs) in pediatric upper abdominal tumors and evaluate its robustness across multiple datasets. Materials and methods: In-house postoperative CTs from pediatric patients with renal tumors and neuroblastoma (n=189) and a public dataset (n=189) with CTs covering thoracoabdominal regions were used. Seventeen OARs were delineated: nine by clinicians (Type 1) and eight using TotalSegmentator (Type 2). Auto-segmentation models were trained using in-house (ModelPMC-UMCU) and a combined dataset of public data (Model-Combined). Performance was assessed with Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD95), and mean surface distance (MSD). Two clinicians rated clinical acceptability on a 5-point Likert scale across 15 patient contours. Model robustness was evaluated against sex, age, intravenous contrast, and tumor type. Results: Model-PMC-UMCU achieved mean DSC values above 0.95 for five of nine OARs, while spleen and heart ranged between 0.90 and 0.95. The stomach-bowel and pancreas exhibited DSC values below 0.90. Model-Combined demonstrated improved robustness across both datasets. Clinical evaluation revealed good usability, with both clinicians rating six of nine Type 1 OARs above four and six of eight Type 2 OARs above three. Significant performance 2 differences were only found across age groups in both datasets, specifically in the left lung and pancreas. The 0-2 age group showed the lowest performance. Conclusion: A multi-organ segmentation model was developed, showcasing enhanced robustness when trained on combined datasets. This model is suitable for various OARs and can be applied to multiple datasets in clinical settings.

摘要:目的:本研究旨在开发一个基于计算机断层扫描 (CT) 的多器官分割模型,用于描绘小儿上腹部肿瘤中的危险器官 (OAR),并评估其在多个数据集中的稳健性。材料和方法:使用小儿肾肿瘤和神经母细胞瘤患者 (n=189) 的院内术后 CT 以及包含胸腹区域 CT 的公共数据集 (n=189)。描绘了 17 个 OAR:9 个由临床医生描绘 (类型 1),8 个使用 TotalSegmentator 描绘 (类型 2)。使用院内 (ModelPMC-UMCU) 和公共数据组合数据集 (Model-Combined) 训练自动分割模型。使用骰子相似性系数 (DSC)、95% 霍斯多夫距离 (HD95) 和平均表面距离 (MSD) 评估性能。两位临床医生使用 5 点李克特量表对 15 个患者轮廓的临床可接受性进行评级。针对性别、年龄、静脉对比和肿瘤类型评估模型的稳健性。结果:Model-PMC-UMCU 对九个 OAR 中的五个 OAR 的平均 DSC 值达到 0.95 以上,而脾脏和心脏在 0.90 到 0.95 之间。胃肠和胰腺的 DSC 值低于 0.90。Model-Combined 在两个数据集上都表现出改进的稳健性。临床评估显示出良好的可用性,两位临床医生对六个九个类型 1 OAR 的评分均高于四分,对八个类型 2 OAR 中的六个评分均高于三分。仅在两个数据集的年龄组中发现了显着的性能 2 差异,特别是在左肺和胰腺中。0-2 岁年龄组表现最差。结论:开发了一个多器官分割模型,在合并数据集上训练时显示出增强的稳健性。该模型适用于各种 OAR,并且可以在临床环境中应用于多个数据集。

Enhancing the Traditional Chinese Medicine Capabilities of Large Language Model through Reinforcement Learning from AI Feedback

2411.00897v1 by Song Yu, Xiaofei Xu, Fangfei Xu, Li Li

Although large language models perform well in understanding and responding to user intent, their performance in specialized domains such as Traditional Chinese Medicine (TCM) remains limited due to lack of expertise. In addition, high-quality data related to TCM is scarce and difficult to obtain, making large language models ineffective in handling TCM tasks. In this work, we propose a framework to improve the performance of large language models for TCM tasks using only a small amount of data. First, we use medical case data for supervised fine-tuning of the large model, making it initially capable of performing TCM tasks. Subsequently, we further optimize the model's performance using reinforcement learning from AI feedback (RLAIF) to align it with the preference data. The ablation study also demonstrated the performance gain is attributed to both supervised fine-tuning and the direct policy optimization. The experimental results show that the model trained with a small amount of data achieves a significant performance improvement on a representative TCM task.

摘要:儘管大型語言模型在理解和回應使用者意圖方面表現良好,但由於缺乏專業知識,它們在傳統中醫 (TCM) 等專業領域的表現仍然有限。此外,與中醫相關的高品質資料稀少且難以取得,這使得大型語言模型在處理中醫任務時效果不彰。在這項工作中,我們提出一個架構,使用少量資料來改善大型語言模型在中醫任務中的表現。首先,我們使用醫療案例資料對大型模型進行監督微調,使其最初具備執行中醫任務的能力。隨後,我們進一步使用人工智慧回饋的強化學習 (RLAIF) 來最佳化模型的表現,使其與偏好資料保持一致。消融研究也證明,表現提升歸功於監督微調和直接策略最佳化。實驗結果顯示,使用少量資料訓練的模型在代表性的中醫任務上取得顯著的表現提升。

StepCountJITAI: simulation environment for RL with application to physical activity adaptive intervention

2411.00336v1 by Karine Karine, Benjamin M. Marlin

The use of reinforcement learning (RL) to learn policies for just-in-time adaptive interventions (JITAIs) is of significant interest in many behavioral intervention domains including improving levels of physical activity. In a messaging-based physical activity JITAI, a mobile health app is typically used to send messages to a participant to encourage engagement in physical activity. In this setting, RL methods can be used to learn what intervention options to provide to a participant in different contexts. However, deploying RL methods in real physical activity adaptive interventions comes with challenges: the cost and time constraints of real intervention studies result in limited data to learn adaptive intervention policies. Further, commonly used RL simulation environments have dynamics that are of limited relevance to physical activity adaptive interventions and thus shed little light on what RL methods may be optimal for this challenging application domain. In this paper, we introduce StepCountJITAI, an RL environment designed to foster research on RL methods that address the significant challenges of policy learning for adaptive behavioral interventions.

摘要:利用強化學習 (RL) 來學習即時適應性介入 (JITAI) 的策略,在許多行為介入領域中備受關注,包括提升體能活動的層級。在基於訊息的體能活動 JITAI 中,行動健康應用程式通常用於向參與者傳送訊息,以鼓勵參與體能活動。在此設定中,RL 方法可被用於學習在不同情境下提供給參與者的介入選項。然而,在實際體能活動適應性介入中部署 RL 方法會遇到挑戰:實際介入研究的成本和時間限制,導致可供學習適應性介入策略的資料有限。此外,常用的 RL 模擬環境具有與體能活動適應性介入相關性有限的動態,因此難以了解哪些 RL 方法可能最適合這個具挑戰性的應用領域。在本文中,我們介紹 StepCountJITAI,這是一個 RL 環境,旨在促進對 RL 方法的研究,以應對適應性行為介入策略學習的重大挑戰。

Strongly Topology-preserving GNNs for Brain Graph Super-resolution

2411.02525v1 by Pragya Singh, Islem Rekik

Brain graph super-resolution (SR) is an under-explored yet highly relevant task in network neuroscience. It circumvents the need for costly and time-consuming medical imaging data collection, preparation, and processing. Current SR methods leverage graph neural networks (GNNs) thanks to their ability to natively handle graph-structured datasets. However, most GNNs perform node feature learning, which presents two significant limitations: (1) they require computationally expensive methods to learn complex node features capable of inferring connectivity strength or edge features, which do not scale to larger graphs; and (2) computations in the node space fail to adequately capture higher-order brain topologies such as cliques and hubs. However, numerous studies have shown that brain graph topology is crucial in identifying the onset and presence of various neurodegenerative disorders like Alzheimer and Parkinson. Motivated by these challenges and applications, we propose our STP-GSR framework. It is the first graph SR architecture to perform representation learning in higher-order topological space. Specifically, using the primal-dual graph formulation from graph theory, we develop an efficient mapping from the edge space of our low-resolution (LR) brain graphs to the node space of a high-resolution (HR) dual graph. This approach ensures that node-level computations on this dual graph correspond naturally to edge-level learning on our HR brain graphs, thereby enforcing strong topological consistency within our framework. Additionally, our framework is GNN layer agnostic and can easily learn from smaller, scalable GNNs, reducing computational requirements. We comprehensively benchmark our framework across seven key topological measures and observe that it significantly outperforms the previous state-of-the-art methods and baselines.

摘要:腦圖像超解析度 (SR) 是網路神經科學中一個尚未充分探索但高度相關的任務。它避開了代價高昂且耗時的醫學影像資料收集、準備和處理的需要。目前的 SR 方法利用圖神經網路 (GNN),因為它們能夠原生處理圖形結構的資料集。然而,大多數 GNN 都執行節點特徵學習,這提出了兩個重大的限制:(1) 它們需要以計算成本高的方式來學習複雜的節點特徵,這些特徵能夠推論連接強度或邊緣特徵,這無法擴展到更大的圖形;(2) 節點空間中的計算無法充分擷取高階腦部拓撲,例如派系和樞紐。然而,許多研究表明,腦圖形拓撲對於識別各種神經退化性疾病(如阿茲海默症和帕金森氏症)的發病和存在至關重要。受到這些挑戰和應用激勵,我們提出了我們的 STP-GSR 架構。它是第一個在高階拓撲空間中執行表示學習的圖形 SR 架構。具體來說,我們使用圖論中的原始對偶圖形公式,從我們低解析度 (LR) 腦圖形的邊緣空間開發了一個高效的對映,對映到高解析度 (HR) 對偶圖形節點空間。這種方法確保了在這個對偶圖形上的節點層級計算自然地對應於我們 HR 腦圖形上的邊緣層級學習,從而強制執行我們框架內強大的拓撲一致性。此外,我們的框架與 GNN 層無關,並且可以輕鬆地從更小、可擴展的 GNN 中學習,從而減少計算需求。我們在七項關鍵拓撲測量中全面評定了我們的框架,並觀察到它顯著優於以往的先進方法和基線。

Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes

2411.02523v1 by Balu Bhasuran, Qiao Jin, Yuzhang Xie, Carl Yang, Karim Hanna, Jennifer Costa, Cindy Shavor, Zhiyong Lu, Zhe He

Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and lab results. Five LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. A comprehensive evaluation involving GPT-4, a knowledge graph, and clinicians was conducted. GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%. Lab results significantly improved accuracy, with GPT-4 and Mixtral excelling, though exact match rates were low. Lab tests, including liver function, metabolic/toxicology panels, and serology/immune tests, were generally interpreted correctly by LLMs for differential diagnosis.

摘要:鑑別診斷對於醫學至關重要,因為它有助於醫療保健提供者系統區分具有相似症狀的疾病。這項研究評估了實驗室檢驗結果對大型語言模型 (LLM) 做出的鑑別診斷 (DDx) 的影響。從 PubMed Central 的 50 份病例報告中建立了臨床簡報,其中包含患者人口統計、症狀和實驗室結果。測試了五個 LLM GPT-4、GPT-3.5、Llama-2-70b、Claude-2 和 Mixtral-8x7B,以生成帶和不帶實驗室數據的前 10、前 5 和前 1 DDx。進行了一項涉及 GPT-4、知識圖譜和臨床醫生的綜合評估。GPT-4 表現最佳,在有實驗室數據的情況下,前 1 名診斷的準確率達到 55%,前 10 名的準確率達到 60%,寬鬆準確率高達 80%。實驗室結果顯著提高了準確率,GPT-4 和 Mixtral 表現出色,儘管完全匹配率較低。LLM 通常可以正確解釋包括肝功能、代謝/毒理學檢查和血清學/免疫測試在內的實驗室檢驗,以進行鑑別診斷。

Deep Learning Predicts Mammographic Breast Density in Clinical Breast Ultrasound Images

2411.00891v2 by Arianna Bunnell, Dustin Valdez, Thomas K. Wolfgruber, Brandon Quon, Kailee Hung, Brenda Y. Hernandez, Todd B. Seto, Jeffrey Killeen, Marshall Miyoshi, Peter Sadowski, John A. Shepherd

Background: Breast density, as derived from mammographic images and defined by the American College of Radiology's Breast Imaging Reporting and Data System (BI-RADS), is one of the strongest risk factors for breast cancer. Breast ultrasound (BUS) is an alternative breast cancer screening modality, particularly useful for early detection in low-resource, rural contexts. The purpose of this study was to explore an artificial intelligence (AI) model to predict BI-RADS mammographic breast density category from clinical, handheld BUS imaging. Methods: All data are sourced from the Hawaii and Pacific Islands Mammography Registry. We compared deep learning methods from BUS imaging, as well as machine learning models from image statistics alone. The use of AI-derived BUS density as a risk factor for breast cancer was then compared to clinical BI-RADS breast density while adjusting for age. The BUS data were split by individual into 70/20/10% groups for training, validation, and testing. Results: 405,120 clinical BUS images from 14.066 women were selected for inclusion in this study, resulting in 9.846 women for training (302,574 images), 2,813 for validation (11,223 images), and 1,406 for testing (4,042 images). On the held-out testing set, the strongest AI model achieves AUROC 0.854 predicting BI-RADS mammographic breast density from BUS imaging and outperforms all shallow machine learning methods based on image statistics. In cancer risk prediction, age-adjusted AI BUS breast density predicted 5-year breast cancer risk with 0.633 AUROC, as compared to 0.637 AUROC from age-adjusted clinical breast density. Conclusions: BI-RADS mammographic breast density can be estimated from BUS imaging with high accuracy using a deep learning model. Furthermore, we demonstrate that AI-derived BUS breast density is predictive of 5-year breast cancer risk in our population.

摘要:背景:乳房密度是根据乳房 X 光图像衍生而来,并由美国放射学院的乳房影像报告和数据系统 (BI-RADS) 定义,是乳腺癌最强的风险因素之一。乳房超音波 (BUS) 是一种替代的乳腺癌筛检方式,特别适用于资源匮乏的农村环境中的早期侦测。本研究的目的是探索一种人工智能 (AI) 模型,以根据临床手持式 BUS 影像预测 BI-RADS 乳房 X 光摄影乳房密度类别。方法:所有数据均来自夏威夷和太平洋岛屿乳房摄影注册中心。我们比较了来自 BUS 影像的深度学习方法,以及仅来自图像统计数据的机器学习模型。然后将 AI 衍生的 BUS 密度用作乳腺癌的风险因子,与临床 BI-RADS 乳房密度进行比较,同时调整年龄。BUS 数据按个人分为 70/20/10% 的组别,用于训练、验证和测试。结果:本研究选取了来自 14.066 名女性的 405,120 张临床 BUS 影像,产生了 9.846 名女性用于训练(302,574 张影像)、2,813 名用于验证(11,223 张影像)和 1,406 名用于测试(4,042 张影像)。在留出的测试集中,最强的 AI 模型实现了 0.854 的 AUROC,根据 BUS 影像预测 BI-RADS 乳房 X 光摄影乳房密度,并且优于所有基于图像统计的浅层机器学习方法。在癌症风险预测中,经年龄调整的 AI BUS 乳房密度预测 5 年乳腺癌风险的 AUROC 为 0.633,而经年龄调整的临床乳房密度预测的 AUROC 为 0.637。结论:使用深度学习模型,可以从 BUS 影像中以高精度估计 BI-RADS 乳房 X 光摄影乳房密度。此外,我们证明了 AI 衍生的 BUS 乳房密度可以预测我们人群中 5 年的乳腺癌风险。

Monitoring fairness in machine learning models that predict patient mortality in the ICU

2411.00190v2 by Tempest A. van Schaik, Xinggang Liu, Louis Atallah, Omar Badawi

This work proposes a fairness monitoring approach for machine learning models that predict patient mortality in the ICU. We investigate how well models perform for patient groups with different race, sex and medical diagnoses. We investigate Documentation bias in clinical measurement, showing how fairness analysis provides a more detailed and insightful comparison of model performance than traditional accuracy metrics alone.

摘要:這項研究提出一個公平性監控方法,用於預測加護病房中病患死亡率的機器學習模型。我們探討模型在不同種族、性別和醫療診斷的病患群體中表現如何。我們探討臨床測量中的文件偏差,說明公平性分析如何提供比傳統準確性指標更詳細且有見地的模型效能比較。

Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy

2411.00178v1 by Panagiota Gatoula, Dimitrios E. Diamantis, Anastasios Koulaouzidis, Cristina Carretero, Stefania Chetcuti-Zammit, Pablo Cortegoso Valdivia, Begoña González-Suárez, Alessandro Mussetto, John Plevris, Alexander Robertson, Bruno Rosa, Ervin Toth, Dimitris K. Iakovidis

Sharing retrospectively acquired data is essential for both clinical research and training. Synthetic Data Generation (SDG), using Artificial Intelligence (AI) models, can overcome privacy barriers in sharing clinical data, enabling advancements in medical diagnostics. This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images. The paper contributes by a) presenting a protocol for the systematic evaluation of synthetic images by medical experts and b) applying it to assess TIDE-II, a novel variational autoencoder-based model for high-resolution WCE image synthesis, with a comprehensive qualitative evaluation conducted by 10 international WCE specialists, focusing on image quality, diversity, realism, and clinical decision-making. The results show that TIDE-II generates clinically relevant WCE images, helping to address data scarcity and enhance diagnostic tools. The proposed protocol serves as a reference for future research on medical image-generation techniques.

摘要:回顧性獲取的資料分享對於臨床研究和訓練至關重要。使用人工智慧 (AI) 模型的合成資料產生 (SDG) 能夠克服臨床資料共享中的隱私障礙,促進醫療診斷的進展。本研究專注於臨床評估醫學 SDG,並透過無線膠囊內視鏡 (WCE) 影像診斷發炎性腸道疾病 (IBD) 的概念驗證調查。本文的貢獻包括:a) 提出由醫學專家系統性評估合成影像的協定,以及 b) 將其應用於評估 TIDE-II,這是一個用於高解析度 WCE 影像合成的變異自動編碼器模型,並由 10 位國際 WCE 專家進行全面的品質評估,重點在於影像品質、多樣性、真實性,以及臨床決策制定。結果顯示 TIDE-II 產生了臨床相關的 WCE 影像,有助於解決資料稀少的問題,並增強診斷工具。所提出的協定可作為未來醫學影像產生技術研究的參考。

Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning

2411.00173v1 by John Wu, David Wu, Jimeng Sun

Medical coding, the translation of unstructured clinical text into standardized medical codes, is a crucial but time-consuming healthcare practice. Though large language models (LLM) could automate the coding process and improve the efficiency of such tasks, interpretability remains paramount for maintaining patient trust. Current efforts in interpretability of medical coding applications rely heavily on label attention mechanisms, which often leads to the highlighting of extraneous tokens irrelevant to the ICD code. To facilitate accurate interpretability in medical language models, this paper leverages dictionary learning that can efficiently extract sparsely activated representations from dense language model embeddings in superposition. Compared with common label attention mechanisms, our model goes beyond token-level representations by building an interpretable dictionary which enhances the mechanistic-based explanations for each ICD code prediction, even when the highlighted tokens are medically irrelevant. We show that dictionary features can steer model behavior, elucidate the hidden meanings of upwards of 90% of medically irrelevant tokens, and are human interpretable.

摘要:醫療編碼是將非結構化的臨床文本轉換為標準化醫療代碼的過程,是一項至關重要的醫療保健實務,但耗時費力。儘管大型語言模型 (LLM) 可以自動化編碼流程並提升此類任務的效率,但可解釋性對於維護患者信任仍然至關重要。目前在醫療編碼應用程式的可解釋性方面所做的努力,極度依賴標籤注意機制,這通常會導致強調與 ICD 代碼無關的無關符號。為了促進醫療語言模型的準確可解釋性,本文利用字典學習,可以有效地從疊加的稠密語言模型嵌入中提取稀疏激活的表示。與常見的標籤注意機制相比,我們的模型超越了符號層級的表示,建立了一個可解釋的字典,增強了對每個 ICD 代碼預測的基於機制的解釋,即使強調的符號在醫學上無關緊要。我們證明字典特徵可以引導模型行為,闡明 90% 以上在醫學上無關的符號的隱藏意義,並且人類可以解釋。

2410.24032v1 by Yingzhe Peng, Xiaoting Qin, Zhiyang Zhang, Jue Zhang, Qingwei Lin, Xu Yang, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

The rise of large language models (LLMs) has revolutionized user interactions with knowledge-based systems, enabling chatbots to synthesize vast amounts of information and assist with complex, exploratory tasks. However, LLM-based chatbots often struggle to provide personalized support, particularly when users start with vague queries or lack sufficient contextual information. This paper introduces the Collaborative Assistant for Personalized Exploration (CARE), a system designed to enhance personalization in exploratory tasks by combining a multi-agent LLM framework with a structured user interface. CARE's interface consists of a Chat Panel, Solution Panel, and Needs Panel, enabling iterative query refinement and dynamic solution generation. The multi-agent framework collaborates to identify both explicit and implicit user needs, delivering tailored, actionable solutions. In a within-subject user study with 22 participants, CARE was consistently preferred over a baseline LLM chatbot, with users praising its ability to reduce cognitive load, inspire creativity, and provide more tailored solutions. Our findings highlight CARE's potential to transform LLM-based systems from passive information retrievers to proactive partners in personalized problem-solving and exploration.

摘要:大型語言模型 (LLM) 的興起徹底改變了使用者與基於知識的系統互動的方式,讓聊天機器人能夠綜合大量的資訊,並協助進行複雜的探索性任務。然而,基於 LLM 的聊天機器人通常難以提供個人化的支援,特別是在使用者一開始提出的查詢很模糊,或缺乏足夠的脈絡資訊時。本文介紹了個人化探索的協作助理 (CARE),一個旨在透過結合多重代理 LLM 架構與結構化的使用者介面來增強探索性任務中個人化的系統。CARE 的介面包含聊天面板、解決方案面板和需求面板,可進行反覆的查詢精煉和動態的解決方案產生。多重代理架構協作識別明確和隱含的使用者需求,提供客製化且可行的解決方案。在一個有 22 位參與者的受試者內研究中,CARE 持續獲得比基準 LLM 聊天機器人更好的評價,使用者讚賞其減輕認知負擔、激發創造力,以及提供更客製化解決方案的能力。我們的發現突顯了 CARE 將基於 LLM 的系統從被動的資訊檢索者轉變為個人化問題解決和探索中的主動夥伴的潛力。

Neural Network Verification with PyRAT

2410.23903v1 by Augustin Lemesle, Julien Lehmann, Tristan Le Gall

As AI systems are becoming more and more popular and used in various critical domains (health, transport, energy, ...), the need to provide guarantees and trust of their safety is undeniable. To this end, we present PyRAT, a tool based on abstract interpretation to verify the safety and the robustness of neural networks. In this paper, we describe the different abstractions used by PyRAT to find the reachable states of a neural network starting from its input as well as the main features of the tool to provide fast and accurate analysis of neural networks. PyRAT has already been used in several collaborations to ensure safety guarantees, with its second place at the VNN-Comp 2024 showcasing its performance.

摘要:隨著 AI 系統越來越普及,並用於各種關鍵領域(健康、運輸、能源,...),提供其安全保證和信任的需求是不容否認的。為此,我們提出了 PyRAT,一個基於抽象詮釋的工具,用於驗證神經網路的安全性和穩健性。在本文中,我們描述了 PyRAT 用於從神經網路輸入中找出可達狀態的不同抽象,以及該工具的主要功能,以提供快速且準確的神經網路分析。PyRAT 已在多項合作中用於確保安全保證,其在 VNN-Comp 2024 中獲得第二名,展示了其效能。

Counterfactual MRI Data Augmentation using Conditional Denoising Diffusion Generative Models

2410.23835v1 by Pedro Morão, Joao Santinha, Yasna Forghani, Nuno Loução, Pedro Gouveia, Mario A. T. Figueiredo

Deep learning (DL) models in medical imaging face challenges in generalizability and robustness due to variations in image acquisition parameters (IAP). In this work, we introduce a novel method using conditional denoising diffusion generative models (cDDGMs) to generate counterfactual magnetic resonance (MR) images that simulate different IAP without altering patient anatomy. We demonstrate that using these counterfactual images for data augmentation can improve segmentation accuracy, particularly in out-of-distribution settings, enhancing the overall generalizability and robustness of DL models across diverse imaging conditions. Our approach shows promise in addressing domain and covariate shifts in medical imaging. The code is publicly available at https: //github.com/pedromorao/Counterfactual-MRI-Data-Augmentation

摘要:深度學習 (DL) 模型在醫學影像中會因影像擷取參數 (IAP) 的變化而面臨可概括性和穩健性的挑戰。在這項工作中,我們提出了一種使用條件式去噪擴散生成模型 (cDDGMs) 的新方法,以產生反事實磁共振 (MR) 影像,模擬不同的 IAP,而不會改變患者的解剖結構。我們證明使用這些反事實影像進行資料擴充可以提高分割準確度,特別是在分佈外設定中,增強 DL 模型在不同影像條件下的整體可概括性和穩健性。我們的做法顯示了解決醫學影像中的領域和協變數轉移的前景。程式碼已公開於 https: //github.com/pedromorao/Counterfactual-MRI-Data-Augmentation

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

2410.23822v1 by Jinlong He, Pengfei Li, Gang Liu, Shenjun Zhong

Multimodal Large Language Models (MLLMs) inherit the superior text understanding capabilities of LLMs and extend these capabilities to multimodal scenarios. These models achieve excellent results in the general domain of multimodal tasks. However, in the medical domain, the substantial training costs and the requirement for extensive medical data pose challenges to the development of medical MLLMs. Furthermore, due to the free-text form of answers, tasks such as visual grounding that need to produce output in a prescribed form become difficult for MLLMs. So far, there have been no medical MLLMs works in medical visual grounding area. For the medical vision grounding task, which involves identifying locations in medical images based on short text descriptions, we propose Parameter-efficient Fine-tuning medical multimodal large language models for Medcial Visual Grounding (PFMVG). To validate the performance of the model, we evaluate it on a public benchmark dataset for medical visual grounding, where it achieves competitive results, and significantly outperforming GPT-4v. Our code will be open sourced after peer review.

摘要:多模态大型语言模型 (MLLM) 继承了 LLM 优越的文本理解能力,并将这些能力扩展到多模态场景。这些模型在多模态任务的通用领域中取得了出色的成果。然而,在医学领域,大量的训练成本和对广泛医学数据的需求对医学 MLLM 的发展构成了挑战。此外,由于答案的自由文本形式,需要以规定形式生成输出的任务(例如视觉基础)对于 MLLM 来说变得困难。到目前为止,还没有医学 MLLM 在医学视觉基础领域工作。对于医学视觉基础任务,它涉及根据简短的文本描述识别医学图像中的位置,我们提出了用于医学视觉基础的参数高效微调医学多模态大型语言模型 (PFMVG)。为了验证模型的性能,我们在医学视觉基础的公共基准数据集上对其进行了评估,它取得了有竞争力的结果,并且明显优于 GPT-4v。我们的代码将在同行评审后开源。

Improving snore detection under limited dataset through harmonic/percussive source separation and convolutional neural networks

2410.23796v1 by F. D. Gonzalez-Martinez, J. J. Carabias-Orti, F. J. Canadas-Quesada, N. Ruiz-Reyes, D. Martinez-Munoz, S. Garcia-Galan

Snoring, an acoustic biomarker commonly observed in individuals with Obstructive Sleep Apnoea Syndrome (OSAS), holds significant potential for diagnosing and monitoring this recognized clinical disorder. Irrespective of snoring types, most snoring instances exhibit identifiable harmonic patterns manifested through distinctive energy distributions over time. In this work, we propose a novel method to differentiate monaural snoring from non-snoring sounds by analyzing the harmonic content of the input sound using harmonic/percussive sound source separation (HPSS). The resulting feature, based on the harmonic spectrogram from HPSS, is employed as input data for conventional neural network architectures, aiming to enhance snoring detection performance even under a limited data learning framework. To evaluate the performance of our proposal, we studied two different scenarios: 1) using a large dataset of snoring and interfering sounds, and 2) using a reduced training set composed of around 1% of the data material. In the former scenario, the proposed HPSS-based feature provides competitive results compared to other input features from the literature. However, the key advantage of the proposed method lies in the superior performance of the harmonic spectrogram derived from HPSS in a limited data learning context. In this particular scenario, using the proposed harmonic feature significantly enhances the performance of all the studied architectures in comparison to the classical input features documented in the existing literature. This finding clearly demonstrates that incorporating harmonic content enables more reliable learning of the essential time-frequency characteristics that are prevalent in most snoring sounds, even in scenarios where the amount of training data is limited.

摘要:鼾聲是一種在阻塞性睡眠呼吸中止症候群 (OSAS) 患者中常見的聲學生物標記,對於診斷和監控此公認的臨床疾病具有顯著潛力。無論鼾聲類型如何,大多數鼾聲都表現出可識別的諧波模式,並隨著時間推移表現出獨特的能量分佈。在這項工作中,我們提出了一種新方法,通過使用諧波/打擊聲源分離 (HPSS) 分析輸入聲音的諧波內容,將單聲道鼾聲與非鼾聲區分開來。基於 HPSS 的諧波頻譜圖所產生的特徵,被用作傳統神經網路架構的輸入資料,旨在即使在有限資料學習架構下也能增強鼾聲偵測效能。為了評估我們提案的效能,我們研究了兩種不同的情境:1) 使用大量的鼾聲和干擾聲資料集,以及 2) 使用由約 1% 資料素材組成的縮減訓練集。在前一種情境中,與文獻中的其他輸入特徵相比,所提出的基於 HPSS 的特徵提供了具有競爭力的結果。然而,所提出方法的主要優點在於,在有限資料學習情境中,源自 HPSS 的諧波頻譜圖具有優異的效能。在這個特定情境中,與現有文獻中記載的傳統輸入特徵相比,使用所提出的諧波特徵顯著增強了所有研究架構的效能。這一發現清楚地表明,即使在訓練資料量有限的情境中,納入諧波內容也能夠更可靠地學習大多數鼾聲中普遍存在的必要時頻特徵。

The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams

2410.23769v1 by Yunqi Zhu, Wen Tang, Ying Sun, Xuebing Yang

Recent research on large language models (LLMs) has primarily focused on their adaptation and application in specialized domains. The application of LLMs in the medical field is mainly concentrated on tasks such as the automation of medical report generation, summarization, diagnostic reasoning, and question-and-answer interactions between doctors and patients. The challenge of becoming a good teacher is more formidable than that of becoming a good student, and this study pioneers the application of LLMs in the field of medical education. In this work, we investigate the extent to which LLMs can generate medical qualification exam questions and corresponding answers based on few-shot prompts. Utilizing a real-world Chinese dataset of elderly chronic diseases, we tasked the LLMs with generating open-ended questions and answers based on a subset of sampled admission reports across eight widely used LLMs, including ERNIE 4, ChatGLM 4, Doubao, Hunyuan, Spark 4, Qwen, Llama 3, and Mistral. Furthermore, we engaged medical experts to manually evaluate these open-ended questions and answers across multiple dimensions. The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions, whereas there is room for improvement in the correctness, evidence-based statements, and professionalism of the generated answers. Moreover, LLMs also demonstrate a decent level of ability to correct and rectify reference answers. Given the immense potential of artificial intelligence in the medical field, the task of generating questions and answers for medical qualification exams aimed at medical students, interns and residents can be a significant focus of future research.

摘要:針對大型語言模型 (LLM) 的近期研究主要集中在它們在特定領域的適應和應用。LLM 在醫學領域的應用主要集中在自動化病歷產生、摘要、診斷推理以及醫生與病人之間問答互動等任務。成為一名好老師的挑戰比成為一名好學生更艱鉅,而本研究開創了 LLM 在醫學教育領域的應用。在這項工作中,我們探討了 LLM 在少數提示下產生醫學資格考試題目和對應答案的程度。利用一個真實世界的老年慢性疾病中文數據集,我們讓 LLM 根據八個廣泛使用的 LLM(包括 ERNIE 4、ChatGLM 4、豆包、混元、Spark 4、Qwen、Llama 3 和 Mistral)抽取的入院報告子集產生開放式問題和答案。此外,我們聘請醫學專家手動評估這些開放式問題和答案的多個面向。研究發現,LLM 在使用少數提示後,可以有效模擬真實世界的醫學資格考試題目,而產生的答案在正確性、循證陳述和專業性方面仍有改進空間。此外,LLM 也展現出相當程度更正和修正參考答案的能力。鑑於人工智能在醫學領域的巨大潛力,產生針對醫學生、實習醫生和住院醫生的醫學資格考試題目和答案的任務,可以成為未來研究的重要重點。

Artificial intelligence to improve clinical coding practice in Scandinavia: a crossover randomized controlled trial

2410.23725v1 by Taridzo Chomutare, Therese Olsen Svenning, Miguel Ángel Tejedor Hernández, Phuong Dinh Ngo, Andrius Budrionis, Kaisa Markljung, Lill Irene Hind, Torbjørn Torsvik, Karl Øyvind Mikalsen, Aleksandar Babic, Hercules Dalianis

\textbf{Trial design} Crossover randomized controlled trial. \textbf{Methods} An AI tool, Easy-ICD, was developed to assist clinical coders and was tested for improving both accuracy and time in a user study in Norway and Sweden. Participants were randomly assigned to two groups, and crossed over between coding complex (longer) texts versus simple (shorter) texts, while using our tool versus not using our tool. \textbf{Results} Based on Mann-Whitney U test, the median coding time difference for complex clinical text sequences was 123 seconds (\emph{P}\textless.001, 95\% CI: 81 to 164), representing a 46\% reduction in median coding time when our tool is used. There was no significant time difference for simpler text sequences. For coding accuracy, the improvement we noted for both complex and simple texts was not significant. \textbf{Conclusions} This study demonstrates the potential of AI to transform common tasks in clinical workflows, with ostensible positive impacts on work efficiencies for complex clinical coding tasks. Further studies within hospital workflows are required before these presumed impacts can be more clearly understood.

摘要:試驗設計 交叉隨機對照試驗。方法開發了一種 AI 工具 Easy-ICD,以協助臨床編碼員,並在挪威和瑞典進行的一項使用者研究中測試其在準確性和時間上的改進。參與者被隨機分為兩組,並在使用我們的工具與不使用我們的工具的情況下,對複雜(較長)文本與簡單(較短)文本進行編碼交叉。結果根據 Mann-Whitney U 檢定,複雜臨床文本序列的中位數編碼時間差為 123 秒(\emph{P}\textless.001,95% CI:81 至 164),表示使用我們的工具時中位數編碼時間減少了 46%。對於較簡單的文本序列,沒有顯著的時間差異。對於編碼準確性,我們對複雜文本和簡單文本所觀察到的改進並不顯著。結論這項研究展示了 AI 在轉換臨床工作流程中常見任務的潛力,對複雜臨床編碼任務的工作效率有明顯的正面影響。在這些假設影響能更清楚地被理解之前,需要在醫院工作流程中進行進一步的研究。

Enhancing Brain Tumor Classification Using TrAdaBoost and Multi-Classifier Deep Learning Approaches

2411.00875v1 by Mahin Mohammadi, Saman Jamshidi

Brain tumors pose a serious health threat due to their rapid growth and potential for metastasis. While medical imaging has advanced significantly, accurately identifying and characterizing these tumors remains a challenge. This study addresses this challenge by leveraging the innovative TrAdaBoost methodology to enhance the Brain Tumor Segmentation (BraTS2020) dataset, aiming to improve the efficiency and accuracy of brain tumor classification. Our approach combines state-of-the-art deep learning algorithms, including the Vision Transformer (ViT), Capsule Neural Network (CapsNet), and convolutional neural networks (CNNs) such as ResNet-152 and VGG16. By integrating these models within a multi-classifier framework, we harness the strengths of each approach to achieve more robust and reliable tumor classification. A novel decision template is employed to synergistically combine outputs from different algorithms, further enhancing classification accuracy. To augment the training process, we incorporate a secondary dataset, "Brain Tumor MRI Dataset," as a source domain, providing additional data for model training and improving generalization capabilities. Our findings demonstrate a high accuracy rate in classifying tumor versus non-tumor images, signifying the effectiveness of our approach in the medical imaging domain. This study highlights the potential of advanced machine learning techniques to contribute significantly to the early and accurate diagnosis of brain tumors, ultimately improving patient outcomes.

摘要:腦瘤由於生長快速且有轉移的可能性,對健康構成嚴重威脅。雖然醫學影像技術已大幅進步,但精準辨識和描述這些腫瘤仍然是一大挑戰。本研究透過運用創新的 TrAdaBoost 方法提升腦瘤分割 (BraTS2020) 資料集來解決這個挑戰,目標是提升腦瘤分類的效率和準確度。我們的做法結合了最先進的深度學習演算法,包括視覺轉換器 (ViT)、膠囊神經網路 (CapsNet) 和卷積神經網路 (CNN),例如 ResNet-152 和 VGG16。透過在多分類器架構中整合這些模型,我們利用每種方法的優點來達成更強健且可靠的腫瘤分類。採用新穎的決策範本,以綜效結合不同演算法的輸出,進一步提升分類準確度。為了擴充訓練流程,我們納入次要資料集「腦瘤 MRI 資料集」作為來源網域,提供額外的資料用於模型訓練,並提升概化能力。我們的研究結果顯示,在分類腫瘤與非腫瘤影像時,準確率很高,表示我們的方法在醫學影像領域中很有效。本研究強調進階機器學習技術的潛力,對腦瘤的早期且精準診斷有顯著貢獻,進而改善病患的治療結果。

Deep Convolutional Neural Networks on Multiclass Classification of Three-Dimensional Brain Images for Parkinson's Disease Stage Prediction

2410.23649v1 by Guan-Hua Huang, Wan-Chen Lai, Tai-Been Chen, Chien-Chin Hsu, Huei-Yung Chen, Yi-Chen Wu, Li-Ren Yeh

Parkinson's disease (PD), a degenerative disorder of the central nervous system, is commonly diagnosed using functional medical imaging techniques such as single-photon emission computed tomography (SPECT). In this study, we utilized two SPECT data sets (n = 634 and n = 202) from different hospitals to develop a model capable of accurately predicting PD stages, a multiclass classification task. We used the entire three-dimensional (3D) brain images as input and experimented with various model architectures. Initially, we treated the 3D images as sequences of two-dimensional (2D) slices and fed them sequentially into 2D convolutional neural network (CNN) models pretrained on ImageNet, averaging the outputs to obtain the final predicted stage. We also applied 3D CNN models pretrained on Kinetics-400. Additionally, we incorporated an attention mechanism to account for the varying importance of different slices in the prediction process. To further enhance model efficacy and robustness, we simultaneously trained the two data sets using weight sharing, a technique known as cotraining. Our results demonstrated that 2D models pretrained on ImageNet outperformed 3D models pretrained on Kinetics-400, and models utilizing the attention mechanism outperformed both 2D and 3D models. The cotraining technique proved effective in improving model performance when the cotraining data sets were sufficiently large.

摘要:帕金森氏症 (PD) 是一種中樞神經系統退化性疾病,通常使用功能性醫學影像技術,例如單光子發射斷層掃描 (SPECT) 來診斷。在這項研究中,我們利用來自不同醫院的兩個 SPECT 資料集 (n = 634 和 n = 202) 來開發一個模型,能夠準確預測 PD 分期,這是一個多類別分類任務。我們使用整個三維 (3D) 大腦影像作為輸入,並嘗試使用各種模型架構。最初,我們將 3D 影像視為二維 (2D) 切片的序列,並將它們依序輸入到預先在 ImageNet 上訓練過的 2D 卷積神經網路 (CNN) 模型中,取平均輸出值來取得最終預測的期別。我們也應用預先在 Kinetics-400 上訓練過的 3D CNN 模型。此外,我們納入一個注意力機制,以考量不同切片在預測過程中的重要性差異。為了進一步增強模型的效能和穩健性,我們使用權重共享同時訓練兩個資料集,這是一種稱為共同訓練的技術。我們的結果顯示,預先在 ImageNet 上訓練過的 2D 模型優於預先在 Kinetics-400 上訓練過的 3D 模型,而使用注意力機制的模型則優於 2D 和 3D 模型。當共同訓練的資料集夠大的時候,共同訓練技術已被證明能有效改善模型效能。

MS-Glance: Non-semantic context vectors and the applications in supervising image reconstruction

2410.23577v1 by Ziqi Gao, Wendi Yang, Yujia Li, Lei Xing, S. Kevin Zhou

Non-semantic context information is crucial for visual recognition, as the human visual perception system first uses global statistics to process scenes rapidly before identifying specific objects. However, while semantic information is increasingly incorporated into computer vision tasks such as image reconstruction, non-semantic information, such as global spatial structures, is often overlooked. To bridge the gap, we propose a biologically informed non-semantic context descriptor, \textbf{MS-Glance}, along with the Glance Index Measure for comparing two images. A Global Glance vector is formulated by randomly retrieving pixels based on a perception-driven rule from an image to form a vector representing non-semantic global context, while a local Glance vector is a flattened local image window, mimicking a zoom-in observation. The Glance Index is defined as the inner product of two standardized sets of Glance vectors. We evaluate the effectiveness of incorporating Glance supervision in two reconstruction tasks: image fitting with implicit neural representation (INR) and undersampled MRI reconstruction. Extensive experimental results show that MS-Glance outperforms existing image restoration losses across both natural and medical images. The code is available at \url{https://github.com/Z7Gao/MSGlance}.

摘要:非语义上下文信息对于视觉识别至关重要,因为人类视觉感知系统首先使用全局统计数据来快速处理场景,然后再识别特定对象。然而,虽然语义信息正越来越多地融入到图像重建等计算机视觉任务中,但非语义信息(如全局空间结构)却常常被忽视。为了弥合这一差距,我们提出了一个生物信息启发的非语义上下文描述符,即 \textbf{MS-Glance},以及用于比较两幅图像的 Glance 指数度量。通过根据感知驱动的规则从图像中随机检索像素来构建一个全局 Glance 向量,以形成一个表示非语义全局上下文的向量,而局部 Glance 向量是一个扁平的局部图像窗口,模仿了放大观察。Glance 指数被定义为两组标准化的 Glance 向量的内积。我们评估了在两个重建任务中纳入 Glance 监督的有效性:具有隐式神经表征 (INR) 的图像拟合和欠采样 MRI 重建。大量的实验结果表明,MS-Glance 在自然图像和医学图像中都优于现有的图像恢复损失。代码可在 \url{https://github.com/Z7Gao/MSGlance} 获得。