Skip to content

arxiv-daily

Automated deployment @ 2024-11-13 20:36:30 Asia/Taipei

Welcome to contribute! Add your topics and keywords in topic.yml. You can also view historical data through the storage.

AI

Medical explainable AI

Publish Date Title Authors Homepage Code
2024-11-01 Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering Mehdi Hosseini Chagahi et.al. 2411.00916v2 null
2024-10-25 A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection Muath Alsuhaibani et.al. 2410.19898v1 null
2024-10-23 An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems Shruthi Chari et.al. 2410.17504v1 link
2024-10-22 Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study Lukas Hughes-Noehrer et.al. 2410.16879v1 null
2024-10-19 Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer Gesa Mittmann et.al. 2410.15012v1 null
2024-10-15 Explainable AI Methods for Multi-Omics Analysis: A Survey Ahmad Hussein et.al. 2410.11910v1 null
2024-10-14 Study on the Helpfulness of Explainable Artificial Intelligence Tobias Labarta et.al. 2410.11896v1 link
2024-10-12 Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health Abdullah Mamun et.al. 2410.09635v1 link
2024-10-10 Artificial intelligence techniques in inherited retinal diseases: A review Han Trinh et.al. 2410.09105v1 null
2024-10-07 CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures Ekaterina Sviridova et.al. 2410.05235v2 link
2024-10-01 Explainable Diagnosis Prediction through Neuro-Symbolic Integration Qiuhao Lu et.al. 2410.01855v1 null
2024-10-01 Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare Prasenjit Maji et.al. 2410.00366v1 null
2024-09-20 Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study Tirtha Chanda et.al. 2409.13476v1 null
2024-09-19 Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data Suryansh Vidya et.al. 2409.15374v1 null
2024-09-19 Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition Daniel Flores-Araiza et.al. 2409.12883v1 null
2024-09-18 Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques Yubo Li et.al. 2409.12087v3 null
2024-09-09 Explainable AI: Definition and attributes of a good explanation for health AI Evangelia Kyrimi et.al. 2409.15338v1 null
2024-08-30 Exploring the Effect of Explanation Content and Format on User Comprehension and Trust Antonio Rago et.al. 2408.17401v1 null
2024-08-29 A Survey for Large Language Models in Biomedicine Chong Wang et.al. 2409.00133v1 null
2024-08-27 Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis Francesco Sovrano et.al. 2408.15121v1 null
2024-08-24 Towards Case-based Interpretability for Medical Federated Learning Laura Latorre et.al. 2408.13626v1 null
2024-08-22 AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines Douwe J. Spaanderman et.al. 2408.12491v1 null
2024-08-14 Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy Kimji N. Pellano et.al. 2409.00001v1 null
2024-08-06 MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy Hanchen David Wang et.al. 2408.11837v1 null
2024-08-05 The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development Joshua Morriss et.al. 2408.05239v1 null
2024-08-05 Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns Chi Him Ng et.al. 2408.02709v1 null
2024-08-05 Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability Masoud Muhammed Hassan et.al. 2408.02706v1 null
2024-07-26 MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI Shyam Dongre et.al. 2407.20284v1 null
2024-07-25 Introducing δ-XAI: a novel sensitivity-based method for local AI explanations Alessandro De Carlo et.al. 2407.18343v2 null
2024-07-24 Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population Nikolaos Ntampakis et.al. 2407.17324v2 null
2024-07-24 Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition Michele Fiori et.al. 2408.06352v1 null
2024-07-21 Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions Naseem Khan et.al. 2408.03335v1 null
2024-07-18 A Comparative Study on Automatic Coding of Medical Letters with Explainability Jamie Glen et.al. 2407.13638v1 link
2024-07-09 Explainable AI for Enhancing Efficiency of DL-based Channel Estimation Abdul Karim Gizzini et.al. 2407.07009v1 null
2024-07-07 Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification P. N. Karthikayan et.al. 2407.05440v2 null
2024-07-03 A Survey on Trustworthiness in Foundation Models for Medical Image Analysis Congzhen Shi et.al. 2407.15851v2 null
2024-07-01 The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data Ximing Wen et.al. 2407.06206v1 null
2024-06-28 Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach Sai Krishna Revanth Vuruma et.al. 2407.00167v1 null
2024-06-25 Towards Compositional Interpretability for XAI Sean Tull et.al. 2406.17583v1 null
2024-06-17 Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods Vincent Olesen et.al. 2406.12142v2 link
2024-06-11 Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health Fatemeh Ebrahimzadeh et.al. 2406.07114v2 null
2024-06-10 AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI K M Tawsik Jawad et.al. 2406.06728v1 null
2024-06-10 Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook Yusif Ibrahimov et.al. 2406.05984v1 null
2024-06-09 Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance Zhan Zhang et.al. 2406.05746v1 null
2024-06-07 Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability Faseela Abdullakutty et.al. 2406.12897v1 null
2024-06-07 Revisiting Attention Weights as Interpretations of Message-Passing Neural Networks Yong-Min Shin et.al. 2406.04612v1 link
2024-06-04 Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection Dinuka Sandun Udayantha et.al. 2406.16908v3 link
2024-06-01 Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques Samita Bai et.al. 2406.00532v1 null
2024-06-01 Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition Alaa Nfissi et.al. 2406.01624v2 link
2024-05-31 The Explanation Necessity for Healthcare AI Michail Mamalakis et.al. 2406.00216v1 null
2024-05-29 Interdisciplinary Expertise to Advance Equitable Explainable AI Chloe R. Bennett et.al. 2406.18563v1 null
2024-05-27 "It depends": Configuring AI to Improve Clinical Usefulness Across Contexts Hubert D. Zając et.al. 2407.11978v1 null
2024-05-26 Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making Min Hun Lee et.al. 2405.16424v1 null
2024-05-26 Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach Ziming Liu et.al. 2405.17502v1 null
2024-05-24 Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone Catalina Gomez et.al. 2407.11974v1 null
2024-05-23 Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery Yingying Fang et.al. 2406.18552v1 null
2024-05-21 The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach Mohsen Jozani et.al. 2405.13099v1 null
2024-05-17 ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education Harris Bin Munawar et.al. 2405.10645v1 null
2024-05-13 Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data Camelia Oprea et.al. 2405.07590v1 null
2024-05-10 XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare Fatemeh Nazary et.al. 2405.06270v3 null
2024-05-09 To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems Miquel Miró-Nicolau et.al. 2405.05766v1 null
2024-05-05 Region-specific Risk Quantification for Interpretable Prognosis of COVID-19 Zhusi Zhong et.al. 2405.02815v1 link
2024-04-26 Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics Francesco Prinzi et.al. 2405.02334v1 null
2024-04-25 Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability Yunfei Ge et.al. 2404.16957v1 null
2024-04-19 Explainable AI for Fair Sepsis Mortality Predictive Model Chia-Hsuan Chang et.al. 2404.13139v1 null
2024-04-19 Multi Class Depression Detection Through Tweets using Artificial Intelligence Muhammad Osama Nusrat et.al. 2404.13104v1 link
2024-04-19 COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images Dmytro Shvetsov et.al. 2404.12832v2 link
2024-04-15 Hybrid Intelligence for Digital Humanities Victor de Boer et.al. 2406.15374v1 null
2024-04-14 Ethical Framework for Responsible Foundational Models in Medical Imaging Abhijit Das et.al. 2406.11868v1 null
2024-04-09 Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis Milad Yousefi et.al. 2404.07239v1 null
2024-04-06 Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI Taminul Islam et.al. 2404.04686v1 null
2024-04-05 Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI Maryam Ahmed et.al. 2404.03892v3 null
2024-03-30 Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives Xingrui Gu et.al. 2404.00320v2 null
2024-03-26 Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach Andrea Ferrario et.al. 2403.17873v1 null
2024-03-26 Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification Han Yuan et.al. 2403.18871v1 link
2024-03-03 Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures Séamus Lankford et.al. 2403.01580v1 null
2024-02-28 Cause and Effect: Can Large Language Models Truly Understand Causality? Swagata Ashwani et.al. 2402.18139v3 null
2024-02-28 Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina Yasin Sadeghi Bazargani et.al. 2402.18600v1 null
2024-02-22 Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education A. J. Karran et.al. 2402.15027v2 null
2024-02-12 Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals Aruna Mohan et.al. 2402.09474v2 null
2024-02-05 Illuminate: A novel approach for depression detection with explainable analysis and proactive therapy using prompt engineering Aryan Agrawal et.al. 2402.05127v1 null
2024-01-24 Information That Matters: Exploring Information Needs of People Affected by Algorithmic Decisions Timothée Schmude et.al. 2401.13324v6 null
2024-01-02 Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education Vahid Ashrafimoghari et.al. 2401.02985v1 null
2023-12-29 XAI for In-hospital Mortality Prediction via Multimodal ICU Data Xingqiao Li et.al. 2312.17624v1 link
2023-12-22 Joining Forces for Pathology Diagnostics with AI Assistance: The EMPAIA Initiative Norman Zerbe et.al. 2401.09450v2 null
2023-12-18 Robust Stochastic Graph Generator for Counterfactual Explanations Mario Alfonso Prado-Romero et.al. 2312.11747v2 null
2023-12-10 Evaluating the Utility of Model Explanations for Model Development Shawn Im et.al. 2312.06032v1 null
2023-12-05 Building Trustworthy NeuroSymbolic AI Systems: Consistency, Reliability, Explainability, and Safety Manas Gaur et.al. 2312.06798v1 null
2023-12-04 Class-Discriminative Attention Maps for Vision Transformers Lennart Brocki et.al. 2312.02364v3 null
2023-11-28 Deployment of a Robust and Explainable Mortality Prediction Model: The COVID-19 Pandemic and Beyond Jacob R. Epifano et.al. 2311.17133v1 null
2023-11-27 Variational Autoencoders for Feature Exploration and Malignancy Prediction of Lung Lesions Benjamin Keel et.al. 2311.15719v1 link
2023-11-24 MRxaI: Black-Box Explainability for Image Classifiers in a Medical Setting Nathan Blake et.al. 2311.14471v1 null
2023-11-21 Moderating Model Marketplaces: Platform Governance Puzzles for AI Intermediaries Robert Gorwa et.al. 2311.12573v3 null
2023-11-20 Ovarian Cancer Data Analysis using Deep Learning: A Systematic Review from the Perspectives of Key Features of Data Analysis and AI Assurance Muta Tah Hira et.al. 2311.11932v1 null
2023-11-18 Representing visual classification as a linear combination of words Shobhit Agarwal et.al. 2311.10933v1 link
2023-11-03 Towards objective and systematic evaluation of bias in artificial intelligence for medical imaging Emma A. M. Stanley et.al. 2311.02115v2 link
2023-10-29 Predicting recovery following stroke: deep learning, multimodal data and feature selection using explainable AI Adam White et.al. 2310.19174v1 null
2023-10-03 Trainable Noise Model as an XAI evaluation method: application on Sobol for remote sensing image segmentation Hossein Shreim et.al. 2310.01828v2 link
2023-09-26 Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI Muhammad Aurangzeb Ahmad et.al. 2311.01463v1 null
2023-09-20 When to Trust AI: Advances and Challenges for Certification of Neural Networks Marta Kwiatkowska et.al. 2309.11196v1 null

Abstracts

Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering

2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust

Osteoporosis is a common condition that increases fracture risk, especially in older adults. Early diagnosis is vital for preventing fractures, reducing treatment costs, and preserving mobility. However, healthcare providers face challenges like limited labeled data and difficulties in processing medical images. This study presents a novel multi-modal learning framework that integrates clinical and imaging data to improve diagnostic accuracy and model interpretability. The model utilizes three pre-trained networks-VGG19, InceptionV3, and ResNet50-to extract deep features from X-ray images. These features are transformed using PCA to reduce dimensionality and focus on the most relevant components. A clustering-based selection process identifies the most representative components, which are then combined with preprocessed clinical data and processed through a fully connected network (FCN) for final classification. A feature importance plot highlights key variables, showing that Medical History, BMI, and Height were the main contributors, emphasizing the significance of patient-specific data. While imaging features were valuable, they had lower importance, indicating that clinical data are crucial for accurate predictions. This framework promotes precise and interpretable predictions, enhancing transparency and building trust in AI-driven diagnoses for clinical integration.

摘要:骨質疏鬆症是一種常見的疾病,會增加骨折的風險,特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而,醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架,該框架整合了臨床和影像數據,以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路,VGG19、InceptionV3 和 ResNet50,從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分,然後將這些組成部分與預處理的臨床數據結合,並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數,表明病史、BMI 和身高是主要貢獻因素,強調了患者特定數據的重要性。雖然影像特徵很有價值,但它們的重要性較低,這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測,提高了透明度,並建立了對 AI 驅動診斷在臨床整合中的信任。

A Review of Deep Learning Approaches for Non-Invasive Cognitive Impairment Detection

2410.19898v1 by Muath Alsuhaibani, Ali Pourramezan Fard, Jian Sun, Farida Far Poor, Peter S. Pressman, Mohammad H. Mahoor

This review paper explores recent advances in deep learning approaches for non-invasive cognitive impairment detection. We examine various non-invasive indicators of cognitive decline, including speech and language, facial, and motoric mobility. The paper provides an overview of relevant datasets, feature-extracting techniques, and deep-learning architectures applied to this domain. We have analyzed the performance of different methods across modalities and observed that speech and language-based methods generally achieved the highest detection performance. Studies combining acoustic and linguistic features tended to outperform those using a single modality. Facial analysis methods showed promise for visual modalities but were less extensively studied. Most papers focused on binary classification (impaired vs. non-impaired), with fewer addressing multi-class or regression tasks. Transfer learning and pre-trained language models emerged as popular and effective techniques, especially for linguistic analysis. Despite significant progress, several challenges remain, including data standardization and accessibility, model explainability, longitudinal analysis limitations, and clinical adaptation. Lastly, we propose future research directions, such as investigating language-agnostic speech analysis methods, developing multi-modal diagnostic systems, and addressing ethical considerations in AI-assisted healthcare. By synthesizing current trends and identifying key obstacles, this review aims to guide further development of deep learning-based cognitive impairment detection systems to improve early diagnosis and ultimately patient outcomes.

摘要:本篇評論探討了深度學習方法在非侵入式認知功能障礙檢測上的最新進展。我們檢視了各種非侵入式的認知衰退指標,包括語言和語言、面部和運動機能。本文概述了與此領域相關的資料集、特徵提取技術和深度學習架構。我們分析了不同方法在不同方式上的表現,並觀察到基於語言和語言的方法通常能達到最高的檢測表現。結合聲學和語言特徵的研究往往優於使用單一方式的研究。面部分析方法顯示出視覺方式的潛力,但研究較少。大多數論文專注於二元分類(受損與未受損),較少探討多類或回歸任務。遷移學習和預訓練語言模型已成為流行且有效的技術,特別是對於語言分析。儘管取得了重大進展,但仍存在一些挑戰,包括資料標準化和可及性、模型可解釋性、縱向分析限制和臨床適應性。最後,我們提出了未來的研究方向,例如調查與語言無關的語音分析方法、開發多模式診斷系統,以及解決人工智慧輔助醫療保健中的倫理考量。透過綜合目前的趨勢和找出關鍵障礙,本篇評論旨在引導深度學習為基礎的認知功能障礙檢測系統的進一步發展,以改善早期診斷,並最終改善患者的治療結果。

An Ontology-Enabled Approach For User-Centered and Knowledge-Enabled Explanations of AI Systems

2410.17504v1 by Shruthi Chari

Explainable Artificial Intelligence (AI) focuses on helping humans understand the working of AI systems or their decisions and has been a cornerstone of AI for decades. Recent research in explainability has focused on explaining the workings of AI models or model explainability. There have also been several position statements and review papers detailing the needs of end-users for user-centered explainability but fewer implementations. Hence, this thesis seeks to bridge some gaps between model and user-centered explainability. We create an explanation ontology (EO) to represent literature-derived explanation types via their supporting components. We implement a knowledge-augmented question-answering (QA) pipeline to support contextual explanations in a clinical setting. Finally, we are implementing a system to combine explanations from different AI methods and data modalities. Within the EO, we can represent fifteen different explanation types, and we have tested these representations in six exemplar use cases. We find that knowledge augmentations improve the performance of base large language models in the contextualized QA, and the performance is variable across disease groups. In the same setting, clinicians also indicated that they prefer to see actionability as one of the main foci in explanations. In our explanations combination method, we plan to use similarity metrics to determine the similarity of explanations in a chronic disease detection setting. Overall, through this thesis, we design methods that can support knowledge-enabled explanations across different use cases, accounting for the methods in today's AI era that can generate the supporting components of these explanations and domain knowledge sources that can enhance them.

摘要:可解釋人工智慧(AI)專注於協助人類了解 AI 系統運作或其決策,數十年來一直是 AI 的基石。最近的可解釋性研究專注於解釋 AI 模型或模型可解釋性的運作。也有幾份立場聲明和評論論文詳細說明了最終使用者對以使用者為中心的可解釋性的需求,但實作較少。因此,本論文旨在彌補模型和以使用者為中心的可解釋性之間的一些差距。我們建立一個解釋本體(EO)以透過其支援元件來表示從文獻中衍生的解釋類型。我們實作一個知識增強的問答(QA)管線,以在臨床環境中支援情境解釋。最後,我們正在實作一個系統,以結合來自不同 AI 方法和資料模式的解釋。在 EO 中,我們可以表示 15 種不同的解釋類型,並且我們已在六個範例使用案例中測試這些表示。我們發現,知識增強改善了基礎大型語言模型在情境化 QA 中的效能,並且效能因疾病群組而異。在相同的環境中,臨床醫生也表示他們希望將可操作性視為解釋中的主要焦點之一。在我們的解釋組合方法中,我們計畫使用相似性指標來確定慢性病偵測環境中解釋的相似性。總體而言,透過本論文,我們設計了可以在不同使用案例中支援知識啟用解釋的方法,考量到當今 AI 時代中可以產生這些解釋的支援元件和可以增強這些解釋的領域知識來源的方法。

Contrasting Attitudes Towards Current and Future AI Applications for Computerised Interpretation of ECG: A Clinical Stakeholder Interview Study

2410.16879v1 by Lukas Hughes-Noehrer, Leda Channer, Gabriel Strain, Gregory Yates, Richard Body, Caroline Jay

Objectives: To investigate clinicians' attitudes towards current automated interpretation of ECG and novel AI technologies and their perception of computer-assisted interpretation. Materials and Methods: We conducted a series of interviews with clinicians in the UK. Our study: (i) explores the potential for AI, specifically future 'human-like' computing approaches, to facilitate ECG interpretation and support clinical decision making, and (ii) elicits their opinions about the importance of explainability and trustworthiness of AI algorithms. Results: We performed inductive thematic analysis on interview transcriptions from 23 clinicians and identified the following themes: (i) a lack of trust in current systems, (ii) positive attitudes towards future AI applications and requirements for these, (iii) the relationship between the accuracy and explainability of algorithms, and (iv) opinions on education, possible deskilling, and the impact of AI on clinical competencies. Discussion: Clinicians do not trust current computerised methods, but welcome future 'AI' technologies. Where clinicians trust future AI interpretation to be accurate, they are less concerned that it is explainable. They also preferred ECG interpretation that demonstrated the results of the algorithm visually. Whilst clinicians do not fear job losses, they are concerned about deskilling and the need to educate the workforce to use AI responsibly. Conclusion: Clinicians are positive about the future application of AI in clinical decision-making. Accuracy is a key factor of uptake and visualisations are preferred over current computerised methods. This is viewed as a potential means of training and upskilling, in contrast to the deskilling that automation might be perceived to bring.

摘要:目的:調查臨床醫生對目前自動化心電圖解讀和新的人工智慧技術的態度,以及他們對電腦輔助解讀的看法。材料和方法:我們對英國的臨床醫生進行了一系列訪談。我們的研究:(i) 探討人工智慧的潛力,特別是未來的「類人類」運算方法,以促進心電圖解讀並支持臨床決策制定,以及 (ii) 徵求他們對人工智慧演算法的可解釋性和可信度的看法。結果:我們對 23 位臨床醫生的訪談記錄進行了歸納主題分析,並找出以下主題:(i) 對目前系統缺乏信任,(ii) 對未來人工智慧應用和對這些應用的要求持正面態度,(iii) 演算法的準確性和可解釋性之間的關係,以及 (iv) 對教育、可能的技能退化,以及人工智慧對臨床能力的影響的看法。討論:臨床醫生不信任目前的電腦化方法,但歡迎未來的「人工智慧」技術。在臨床醫生相信未來的 AI 解讀準確的情況下,他們不太擔心它是否可解釋。他們也比較喜歡能以視覺方式呈現演算法結果的心電圖解讀。雖然臨床醫生不害怕失業,但他們擔心技能退化,以及需要教育員工負責任地使用人工智慧。結論:臨床醫生對人工智慧在臨床決策制定中的未來應用持正面態度。準確性是採用人工智慧的一個關鍵因素,而視覺化比目前的電腦化方法更受青睞。這被視為一種潛在的培訓和提升技能的方法,與自動化可能帶來的技能退化形成對比。

Pathologist-like explainable AI for interpretable Gleason grading in prostate cancer

2410.15012v1 by Gesa Mittmann, Sara Laiouar-Pedari, Hendrik A. Mehrtens, Sarah Haggenmüller, Tabea-Clara Bucher, Tirtha Chanda, Nadine T. Gaisa, Mathias Wagner, Gilbert Georg Klamminger, Tilman T. Rau, Christina Neppl, Eva Maria Compérat, Andreas Gocht, Monika Hämmerle, Niels J. Rupp, Jula Westhoff, Irene Krücken, Maximillian Seidl, Christian M. Schürch, Marcus Bauer, Wiebke Solass, Yu Chun Tam, Florian Weber, Rainer Grobholz, Jaroslaw Augustyniak, Thomas Kalinski, Christian Hörner, Kirsten D. Mertz, Constanze Döring, Andreas Erbersdobler, Gabriele Deubler, Felix Bremmer, Ulrich Sommer, Michael Brodhun, Jon Griffin, Maria Sarah L. Lenon, Kiril Trpkov, Liang Cheng, Fei Chen, Angelique Levi, Guoping Cai, Tri Q. Nguyen, Ali Amin, Alessia Cimadamore, Ahmed Shabaik, Varsha Manucha, Nazeel Ahmad, Nidia Messias, Francesca Sanguedolce, Diana Taheri, Ezra Baraban, Liwei Jia, Rajal B. Shah, Farshid Siadat, Nicole Swarbrick, Kyung Park, Oudai Hassan, Siamak Sakhaie, Michelle R. Downes, Hiroshi Miyamoto, Sean R. Williamson, Tim Holland-Letz, Carolin V. Schneider, Jakob Nikolas Kather, Yuri Tolkach, Titus J. Brinker

The aggressiveness of prostate cancer, the most common cancer in men worldwide, is primarily assessed based on histopathological data using the Gleason scoring system. While artificial intelligence (AI) has shown promise in accurately predicting Gleason scores, these predictions often lack inherent explainability, potentially leading to distrust in human-machine interactions. To address this issue, we introduce a novel dataset of 1,015 tissue microarray core images, annotated by an international group of 54 pathologists. The annotations provide detailed localized pattern descriptions for Gleason grading in line with international guidelines. Utilizing this dataset, we develop an inherently explainable AI system based on a U-Net architecture that provides predictions leveraging pathologists' terminology. This approach circumvents post-hoc explainability methods while maintaining or exceeding the performance of methods trained directly for Gleason pattern segmentation (Dice score: 0.713 $\pm$ 0.003 trained on explanations vs. 0.691 $\pm$ 0.010 trained on Gleason patterns). By employing soft labels during training, we capture the intrinsic uncertainty in the data, yielding strong results in Gleason pattern segmentation even in the context of high interobserver variability. With the release of this dataset, we aim to encourage further research into segmentation in medical tasks with high levels of subjectivity and to advance the understanding of pathologists' reasoning processes.

摘要:前列腺癌是全球男性最常見的癌症,其惡性程度主要根據 Gleason 評分系統使用組織病理學數據進行評估。雖然人工智慧 (AI) 在準確預測 Gleason 評分方面已展現潛力,但這些預測通常缺乏內在的可解釋性,可能會導致對人機互動的不信任。為了解決這個問題,我們引進了一個由 54 位病理學家組成的國際團隊註解的 1,015 個組織微陣列核心影像的新穎資料集。這些註解提供了詳細的局部模式描述,用於符合國際準則的 Gleason 分級。利用這個資料集,我們開發了一個基於 U-Net 架構的內在可解釋 AI 系統,該系統提供了利用病理學家術語進行預測。這種方法規避了事後可解釋性方法,同時維持或超越了直接訓練用於 Gleason 模式分割的方法的效能(Dice 分數:0.713 ± 0.003,訓練於解釋,相對於 0.691 ± 0.010,訓練於 Gleason 模式)。透過在訓練期間採用軟標籤,我們捕捉了資料中的內在不確定性,即使在觀察者間變異性高的情況下,也能在 Gleason 模式分割中產生強大的結果。透過釋出這個資料集,我們旨在鼓勵進一步研究主觀性高的醫療任務中的分割,並增進對病理學家推理過程的理解。

Explainable AI Methods for Multi-Omics Analysis: A Survey

2410.11910v1 by Ahmad Hussein, Mukesh Prasad, Ali Braytee

Advancements in high-throughput technologies have led to a shift from traditional hypothesis-driven methodologies to data-driven approaches. Multi-omics refers to the integrative analysis of data derived from multiple 'omes', such as genomics, proteomics, transcriptomics, metabolomics, and microbiomics. This approach enables a comprehensive understanding of biological systems by capturing different layers of biological information. Deep learning methods are increasingly utilized to integrate multi-omics data, offering insights into molecular interactions and enhancing research into complex diseases. However, these models, with their numerous interconnected layers and nonlinear relationships, often function as black boxes, lacking transparency in decision-making processes. To overcome this challenge, explainable artificial intelligence (xAI) methods are crucial for creating transparent models that allow clinicians to interpret and work with complex data more effectively. This review explores how xAI can improve the interpretability of deep learning models in multi-omics research, highlighting its potential to provide clinicians with clear insights, thereby facilitating the effective application of such models in clinical settings.

摘要:高通量技術的進步導致從傳統的假設驅動方法轉變為資料驅動的方法。多組學是指整合分析來自多個「組學」的資料,例如基因組學、蛋白質組學、轉錄組學、代謝組學和微生物組學。此方法透過擷取生物資訊的不同層面,能全面了解生物系統。深度學習方法愈來愈常被用於整合多組學資料,提供分子交互作用的洞察力,並加強對複雜疾病的研究。然而,這些模型具有許多相互連接的層級和非線性關係,通常會像黑盒子一樣運作,缺乏決策過程的透明度。為了克服此挑戰,可解釋人工智慧 (xAI) 方法對於建立透明模型至關重要,讓臨床醫生可以更有效地解釋和處理複雜資料。此評論探討 xAI 如何能改善多組學研究中深度學習模型的可解釋性,強調其提供臨床醫生明確見解的潛力,進而促進此類模型在臨床環境中的有效應用。

Study on the Helpfulness of Explainable Artificial Intelligence

2410.11896v1 by Tobias Labarta, Elizaveta Kulicheva, Ronja Froelian, Christian Geißler, Xenia Melman, Julian von Klitzing

Explainable Artificial Intelligence (XAI) is essential for building advanced machine learning-powered applications, especially in critical domains such as medical diagnostics or autonomous driving. Legal, business, and ethical requirements motivate using effective XAI, but the increasing number of different methods makes it challenging to pick the right ones. Further, as explanations are highly context-dependent, measuring the effectiveness of XAI methods without users can only reveal a limited amount of information, excluding human factors such as the ability to understand it. We propose to evaluate XAI methods via the user's ability to successfully perform a proxy task, designed such that a good performance is an indicator for the explanation to provide helpful information. In other words, we address the helpfulness of XAI for human decision-making. Further, a user study on state-of-the-art methods was conducted, showing differences in their ability to generate trust and skepticism and the ability to judge the rightfulness of an AI decision correctly. Based on the results, we highly recommend using and extending this approach for more objective-based human-centered user studies to measure XAI performance in an end-to-end fashion.

摘要:可解釋人工智慧 (XAI) 對於建構先進的機器學習驅動應用程式至關重要,特別是在醫療診斷或自動駕駛等關鍵領域。法律、商業和倫理要求促使使用有效的 XAI,但數量日益增加的不同方法使得挑選正確的方法具有挑戰性。此外,由於解釋高度依賴於背景,在沒有使用者的情況下衡量 XAI 方法的有效性只能揭示有限的資訊,排除人類因素,例如理解它的能力。我們建議透過使用者成功執行代理任務的能力來評估 XAI 方法,設計使得良好的執行表現是解釋提供有用資訊的指標。換句話說,我們探討 XAI 對人類決策制定的幫助。此外,對最先進的方法進行使用者研究,顯示出它們在產生信任和懷疑的能力以及正確判斷 AI 決策是否正確的能力方面存在差異。根據結果,我們強烈建議使用和擴充這種方法,以進行更多以目標為基礎的人為中心使用者研究,以終端到終端的方式衡量 XAI 效能。

Use of What-if Scenarios to Help Explain Artificial Intelligence Models for Neonatal Health

2410.09635v1 by Abdullah Mamun, Lawrence D. Devoe, Mark I. Evans, David W. Britt, Judith Klein-Seetharaman, Hassan Ghasemzadeh

Early detection of intrapartum risk enables interventions to potentially prevent or mitigate adverse labor outcomes such as cerebral palsy. Currently, there is no accurate automated system to predict such events to assist with clinical decision-making. To fill this gap, we propose "Artificial Intelligence (AI) for Modeling and Explaining Neonatal Health" (AIMEN), a deep learning framework that not only predicts adverse labor outcomes from maternal, fetal, obstetrical, and intrapartum risk factors but also provides the model's reasoning behind the predictions made. The latter can provide insights into what modifications in the input variables of the model could have changed the predicted outcome. We address the challenges of imbalance and small datasets by synthesizing additional training data using Adaptive Synthetic Sampling (ADASYN) and Conditional Tabular Generative Adversarial Networks (CTGAN). AIMEN uses an ensemble of fully-connected neural networks as the backbone for its classification with the data augmentation supported by either ADASYN or CTGAN. AIMEN, supported by CTGAN, outperforms AIMEN supported by ADASYN in classification. AIMEN can predict a high risk for adverse labor outcomes with an average F1 score of 0.784. It also provides counterfactual explanations that can be achieved by changing 2 to 3 attributes on average. Resources available: https://github.com/ab9mamun/AIMEN.

摘要:產程中風險的早期偵測有助於進行干預措施,以預防或減輕不利的生產結果,例如腦性麻痺。目前,沒有準確的自動化系統可以預測此類事件,以協助臨床決策。為了填補這一空白,我們提出「用於建模和解釋新生兒健康的人工智慧」(AIMEN),這是一個深度學習架構,它不僅可以根據孕產婦、胎兒、產科和產程風險因素預測不利的生產結果,還能提供模型做出預測背後的原因。後者可以提供見解,說明模型輸入變數中的哪些修改可能會改變預測結果。我們透過使用適應性合成抽樣 (ADASYN) 和條件表格生成對抗網路 (CTGAN) 來合成額外的訓練資料,以解決不平衡和小型資料集的挑戰。AIMEN 使用全連接神經網路的集合作為其分類的骨幹,並透過 ADASYN 或 CTGAN 支援資料擴充。由 CTGAN 支援的 AIMEN 在分類方面優於由 ADASYN 支援的 AIMEN。AIMEN 可以預測不利的生產結果的高風險,平均 F1 分數為 0.784。它還提供反事實解釋,可透過平均變更 2 至 3 個屬性來達成。可用資源:https://github.com/ab9mamun/AIMEN。

Artificial intelligence techniques in inherited retinal diseases: A review

2410.09105v1 by Han Trinh, Jordan Vice, Jason Charng, Zahra Tajbakhsh, Khyber Alam, Fred K. Chen, Ajmal Mian

Inherited retinal diseases (IRDs) are a diverse group of genetic disorders that lead to progressive vision loss and are a major cause of blindness in working-age adults. The complexity and heterogeneity of IRDs pose significant challenges in diagnosis, prognosis, and management. Recent advancements in artificial intelligence (AI) offer promising solutions to these challenges. However, the rapid development of AI techniques and their varied applications have led to fragmented knowledge in this field. This review consolidates existing studies, identifies gaps, and provides an overview of AI's potential in diagnosing and managing IRDs. It aims to structure pathways for advancing clinical applications by exploring AI techniques like machine learning and deep learning, particularly in disease detection, progression prediction, and personalized treatment planning. Special focus is placed on the effectiveness of convolutional neural networks in these areas. Additionally, the integration of explainable AI is discussed, emphasizing its importance in clinical settings to improve transparency and trust in AI-based systems. The review addresses the need to bridge existing gaps in focused studies on AI's role in IRDs, offering a structured analysis of current AI techniques and outlining future research directions. It concludes with an overview of the challenges and opportunities in deploying AI for IRDs, highlighting the need for interdisciplinary collaboration and the continuous development of robust, interpretable AI models to advance clinical applications.

摘要:遺傳性視網膜疾病 (IRD) 是一組多樣化的遺傳疾病, 會導致視力逐漸喪失,是工作年齡成人失明的主要原因。IRD 的複雜性和異質性對診斷、預後和管理提出了重大挑戰。最近人工智能 (AI) 的進步為這些挑戰提供了有希望的解決方案。 然而,AI 技術的快速發展及其多種應用導致了該領域的知識分散。本綜述整合了現有研究,找出差距,並概述了 AI 在診斷和管理 IRD 中的潛力。它旨在通過探索機器學習和深度學習等 AI 技術,特別是在疾病檢測、進程預測和個性化治療計劃中,為推進臨床應用構建途徑。特別關注這些領域中卷積神經網路的有效性。此外,討論了可解釋 AI 的整合,強調了其在臨床環境中提高透明度和對基於 AI 的系統的信任的重要性。該綜述解決了彌合 AI 在 IRD 中作用的重點研究中現有差距的必要性,提供了對當前 AI 技術的結構化分析,並概述了未來的研究方向。最後概述了在 IRD 中部署 AI 的挑戰和機遇,強調了跨學科合作和持續開發強大、可解釋的 AI 模型以推進臨床應用的必要性。

CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures

2410.05235v2 by Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri

Explaining Artificial Intelligence (AI) decisions is a major challenge nowadays in AI, in particular when applied to sensitive scenarios like medicine and law. However, the need to explain the rationale behind decisions is a main issue also for human-based deliberation as it is important to justify \textit{why} a certain decision has been taken. Resident medical doctors for instance are required not only to provide a (possibly correct) diagnosis, but also to explain how they reached a certain conclusion. Developing new tools to aid residents to train their explanation skills is therefore a central objective of AI in education. In this paper, we follow this direction, and we present, to the best of our knowledge, the first multilingual dataset for Medical Question Answering where correct and incorrect diagnoses for a clinical case are enriched with a natural language explanation written by doctors. These explanations have been manually annotated with argument components (i.e., premise, claim) and argument relations (i.e., attack, support), resulting in the Multilingual CasiMedicos-Arg dataset which consists of 558 clinical cases in four languages (English, Spanish, French, Italian) with explanations, where we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 attack relations. We conclude by showing how competitive baselines perform over this challenging dataset for the argument mining task.

摘要:解釋人工智慧 (AI) 的決策是現在 AI 的一項重大挑戰,特別是應用於像醫學和法律等敏感情境時。然而,解釋決策背後理由的需求也是基於人類的考量的一個主要問題,因為有必要證明為什麼做出某個決策。例如,住院醫師不僅需要提供(可能是正確的)診斷,還需要解釋他們如何達成某個結論。因此,開發新的工具來幫助住院醫師訓練他們的解釋技巧是教育中 AI 的一項核心目標。在本文中,我們遵循這個方向,並且根據我們的了解,提出第一個多語言醫學問答資料集,其中臨床病例的正確和不正確診斷都附有由醫生撰寫的自然語言解釋。這些解釋已使用論證組成(即前提、主張)和論證關係(即攻擊、支持)進行手動註解,產生多語言 CasiMedicos-Arg 資料集,其中包含 558 個具有解釋的四種語言(英語、西班牙語、法語、義大利語)的臨床病例,我們註解了 5021 個主張、2313 個前提、2431 個支持關係和 1106 個攻擊關係。我們最後展示了競爭基準如何針對論證探勘任務執行此具挑戰性的資料集。

Explainable Diagnosis Prediction through Neuro-Symbolic Integration

2410.01855v1 by Qiuhao Lu, Rui Li, Elham Sagheb, Andrew Wen, Jinlian Wang, Liwei Wang, Jungwei W. Fan, Hongfang Liu

Diagnosis prediction is a critical task in healthcare, where timely and accurate identification of medical conditions can significantly impact patient outcomes. Traditional machine learning and deep learning models have achieved notable success in this domain but often lack interpretability which is a crucial requirement in clinical settings. In this study, we explore the use of neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop explainable models for diagnosis prediction. Essentially, we design and implement LNN-based models that integrate domain-specific knowledge through logical rules with learnable thresholds. Our models, particularly $M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior performance over traditional models such as Logistic Regression, SVM, and Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up to 0.8457) in the case study of diabetes prediction. The learned weights and thresholds within the LNN models provide direct insights into feature contributions, enhancing interpretability without compromising predictive power. These findings highlight the potential of neuro-symbolic approaches in bridging the gap between accuracy and explainability in healthcare AI applications. By offering transparent and adaptable diagnostic models, our work contributes to the advancement of precision medicine and supports the development of equitable healthcare solutions. Future research will focus on extending these methods to larger and more diverse datasets to further validate their applicability across different medical conditions and populations.

摘要:診斷預測是醫療保健中的一項關鍵任務,及時且準確地識別醫療狀況會對患者的結果產生重大影響。傳統機器學習和深度學習模型已在此領域取得顯著成功,但通常缺乏可解釋性,這是臨床環境中的關鍵要求。在本研究中,我們探討了神經符號方法,特別是邏輯神經網路 (LNN),以開發可解釋的診斷預測模型。基本上,我們設計並實作了基於 LNN 的模型,該模型透過邏輯規則和可學習的閾值整合領域特定的知識。我們的模型,特別是 $M_{\text{multi-pathway}}$ 和 $M_{\text{comprehensive}}$,表現出優於傳統模型(如邏輯迴歸、SVM 和隨機森林)的卓越效能,在糖尿病預測的案例研究中,達到了更高的準確度(高達 80.52%)和 AUROC 分數(高達 0.8457)。LNN 模型中學習的權重和閾值提供了對特徵貢獻的直接見解,增強了可解釋性,同時不損害預測能力。這些發現突顯了神經符號方法在彌合醫療保健 AI 應用中準確性和可解釋性差距方面的潛力。透過提供透明且適應性強的診斷模型,我們的研究有助於精準醫療的進步,並支援公平醫療保健解決方案的開發。未來的研究將專注於將這些方法擴展到更大且更多樣化的資料集,以進一步驗證其在不同醫療狀況和人群中的適用性。

Easydiagnos: a framework for accurate feature selection for automatic diagnosis in smart healthcare

2410.00366v1 by Prasenjit Maji, Amit Kumar Mondal, Hemanta Kumar Mondal, Saraju P. Mohanty

The rapid advancements in artificial intelligence (AI) have revolutionized smart healthcare, driving innovations in wearable technologies, continuous monitoring devices, and intelligent diagnostic systems. However, security, explainability, robustness, and performance optimization challenges remain critical barriers to widespread adoption in clinical environments. This research presents an innovative algorithmic method using the Adaptive Feature Evaluator (AFE) algorithm to improve feature selection in healthcare datasets and overcome problems. AFE integrating Genetic Algorithms (GA), Explainable Artificial Intelligence (XAI), and Permutation Combination Techniques (PCT), the algorithm optimizes Clinical Decision Support Systems (CDSS), thereby enhancing predictive accuracy and interpretability. The proposed method is validated across three diverse healthcare datasets using six distinct machine learning algorithms, demonstrating its robustness and superiority over conventional feature selection techniques. The results underscore the transformative potential of AFE in smart healthcare, enabling personalized and transparent patient care. Notably, the AFE algorithm, when combined with a Multi-layer Perceptron (MLP), achieved an accuracy of up to 98.5%, highlighting its capability to improve clinical decision-making processes in real-world healthcare applications.

摘要:人工智慧 (AI) 的快速進展徹底改變了智慧醫療保健,推動了可穿戴技術、持續監控裝置和智慧診斷系統的創新。然而,安全性、可解釋性、穩健性和效能最佳化挑戰仍然是臨床環境中廣泛採用的關鍵障礙。本研究提出一個創新的演算法方法,使用自適應特徵評估器 (AFE) 演算法來改善醫療保健資料集中的特徵選取並克服問題。AFE 整合了遺傳演算法 (GA)、可解釋人工智慧 (XAI) 和排列組合技術 (PCT),該演算法最佳化了臨床決策支援系統 (CDSS),從而提高了預測準確性和可解釋性。所提出的方法使用六種不同的機器學習演算法驗證了三個不同的醫療保健資料集,證明了其穩健性和優於傳統特徵選取技術。結果強調了 AFE 在智慧醫療保健中的轉變潛力,實現了個人化和透明的患者照護。值得注意的是,AFE 演算法與多層感知器 (MLP) 結合使用時,準確度高達 98.5%,突顯了其改善實際醫療保健應用中臨床決策制定流程的能力。

Dermatologist-like explainable AI enhances melanoma diagnosis accuracy: eye-tracking study

2409.13476v1 by Tirtha Chanda, Sarah Haggenmueller, Tabea-Clara Bucher, Tim Holland-Letz, Harald Kittler, Philipp Tschandl, Markus V. Heppt, Carola Berking, Jochen S. Utikal, Bastian Schilling, Claudia Buerger, Cristian Navarrete-Dechent, Matthias Goebeler, Jakob Nikolas Kather, Carolin V. Schneider, Benjamin Durani, Hendrike Durani, Martin Jansen, Juliane Wacker, Joerg Wacker, Reader Study Consortium, Titus J. Brinker

Artificial intelligence (AI) systems have substantially improved dermatologists' diagnostic accuracy for melanoma, with explainable AI (XAI) systems further enhancing clinicians' confidence and trust in AI-driven decisions. Despite these advancements, there remains a critical need for objective evaluation of how dermatologists engage with both AI and XAI tools. In this study, 76 dermatologists participated in a reader study, diagnosing 16 dermoscopic images of melanomas and nevi using an XAI system that provides detailed, domain-specific explanations. Eye-tracking technology was employed to assess their interactions. Diagnostic performance was compared with that of a standard AI system lacking explanatory features. Our findings reveal that XAI systems improved balanced diagnostic accuracy by 2.8 percentage points relative to standard AI. Moreover, diagnostic disagreements with AI/XAI systems and complex lesions were associated with elevated cognitive load, as evidenced by increased ocular fixations. These insights have significant implications for clinical practice, the design of AI tools for visual tasks, and the broader development of XAI in medical diagnostics.

摘要:人工智慧 (AI) 系統已大幅改善皮膚科醫師對黑色素瘤的診斷準確度,而可解釋 AI (XAI) 系統進一步提升臨床醫師對 AI 驅動決策的信心與信賴。儘管有這些進展,對於皮膚科醫師如何使用 AI 和 XAI 工具,仍有客觀評估的迫切需求。在這項研究中,76 位皮膚科醫師參與了一項讀者研究,使用 XAI 系統診斷 16 張黑色素瘤和痣的皮膚鏡影像,該系統提供詳細的領域特定說明。採用眼球追蹤技術來評估他們的互動。將診斷表現與缺乏說明功能的標準 AI 系統進行比較。我們的研究結果顯示,XAI 系統相較於標準 AI,將平衡診斷準確度提升了 2.8 個百分點。此外,與 AI/XAI 系統的診斷分歧和複雜的病灶與認知負擔升高有關,這由增加的眼睛注視次數所證實。這些見解對臨床實務、視覺任務 AI 工具的設計和醫學診斷中 XAI 的廣泛發展具有重大意義。

Explainable AI for Autism Diagnosis: Identifying Critical Brain Regions Using fMRI Data

2409.15374v1 by Suryansh Vidya, Kush Gupta, Amir Aly, Andy Wills, Emmanuel Ifeachor, Rohit Shankar

Early diagnosis and intervention for Autism Spectrum Disorder (ASD) has been shown to significantly improve the quality of life of autistic individuals. However, diagnostics methods for ASD rely on assessments based on clinical presentation that are prone to bias and can be challenging to arrive at an early diagnosis. There is a need for objective biomarkers of ASD which can help improve diagnostic accuracy. Deep learning (DL) has achieved outstanding performance in diagnosing diseases and conditions from medical imaging data. Extensive research has been conducted on creating models that classify ASD using resting-state functional Magnetic Resonance Imaging (fMRI) data. However, existing models lack interpretability. This research aims to improve the accuracy and interpretability of ASD diagnosis by creating a DL model that can not only accurately classify ASD but also provide explainable insights into its working. The dataset used is a preprocessed version of the Autism Brain Imaging Data Exchange (ABIDE) with 884 samples. Our findings show a model that can accurately classify ASD and highlight critical brain regions differing between ASD and typical controls, with potential implications for early diagnosis and understanding of the neural basis of ASD. These findings are validated by studies in the literature that use different datasets and modalities, confirming that the model actually learned characteristics of ASD and not just the dataset. This study advances the field of explainable AI in medical imaging by providing a robust and interpretable model, thereby contributing to a future with objective and reliable ASD diagnostics.

摘要:自閉症譜系障礙 (ASD) 的早期診斷和介入已被證實能顯著改善自閉症患者的生活品質。然而,ASD 的診斷方法依賴於基於臨床表現的評估,容易產生偏見,且可能難以做出早期診斷。有必要找出 ASD 的客觀生物標記,以幫助提高診斷準確性。深度學習 (DL) 在從醫學影像資料診斷疾病和病症方面取得傑出的表現。已經針對建立使用靜態功能性磁振造影 (fMRI) 資料對 ASD 進行分類的模型進行廣泛的研究。然而,現有的模型缺乏可解釋性。本研究旨在透過建立一個不僅能準確分類 ASD,還能提供可解釋見解說明其運作原理的 DL 模型,來改善 ASD 診斷的準確性和可解釋性。所使用的資料集是自閉症大腦影像資料交換 (ABIDE) 的預處理版本,包含 884 個樣本。我們的研究結果顯示,該模型能準確分類 ASD,並強調 ASD 與典型對照組之間存在差異的關鍵腦區,對於 ASD 的早期診斷和神經基礎的理解具有潛在的意義。這些研究結果已由使用不同資料集和方式的文獻研究驗證,證實該模型實際上學習了 ASD 的特徵,而不僅僅是資料集。本研究透過提供一個強健且可解釋的模型,推動了醫學影像中可解釋 AI 的領域,從而為未來提供客觀且可靠的 ASD 診斷做出貢獻。

Improving Prototypical Parts Abstraction for Case-Based Reasoning Explanations Designed for the Kidney Stone Type Recognition

2409.12883v1 by Daniel Flores-Araiza, Francisco Lopez-Tiro, Clément Larose, Salvador Hinojosa, Andres Mendez-Vazquez, Miguel Gonzalez-Mendoza, Gilberto Ochoa-Ruiz, Christian Daul

The in-vivo identification of the kidney stone types during an ureteroscopy would be a major medical advance in urology, as it could reduce the time of the tedious renal calculi extraction process, while diminishing infection risks. Furthermore, such an automated procedure would make possible to prescribe anti-recurrence treatments immediately. Nowadays, only few experienced urologists are able to recognize the kidney stone types in the images of the videos displayed on a screen during the endoscopy. Thus, several deep learning (DL) models have recently been proposed to automatically recognize the kidney stone types using ureteroscopic images. However, these DL models are of black box nature whicl limits their applicability in clinical settings. This contribution proposes a case-based reasoning DL model which uses prototypical parts (PPs) and generates local and global descriptors. The PPs encode for each class (i.e., kidney stone type) visual feature information (hue, saturation, intensity and textures) similar to that used by biologists. The PPs are optimally generated due a new loss function used during the model training. Moreover, the local and global descriptors of PPs allow to explain the decisions ("what" information, "where in the images") in an understandable way for biologists and urologists. The proposed DL model has been tested on a database including images of the six most widespread kidney stone types. The overall average classification accuracy was 90.37. When comparing this results with that of the eight other DL models of the kidney stone state-of-the-art, it can be seen that the valuable gain in explanability was not reached at the expense of accuracy which was even slightly increased with respect to that (88.2) of the best method of the literature. These promising and interpretable results also encourage urologists to put their trust in AI-based solutions.

摘要:尿路鏡檢查中腎結石類型的體內識別將是泌尿科的一項重大進展,因為它可以減少繁瑣的腎結石取出過程的時間,同時降低感染風險。此外,這種自動化程序將使立即開立抗復發治療成為可能。如今,只有少數經驗豐富的泌尿科醫生能夠在內視鏡檢查期間屏幕上顯示的視頻圖像中識別腎結石類型。因此,最近已提出多種深度學習 (DL) 模型,以使用輸尿管鏡圖像自動識別腎結石類型。然而,這些 DL 模型本質上是黑盒子,這限制了它們在臨床環境中的應用性。本文提出了一個基於案例推理的 DL 模型,它使用原型部分 (PP) 並生成局部和全局描述符。PP 為每種類型(即腎結石類型)編碼視覺特徵信息(色調、飽和度、強度和紋理),類似於生物學家使用的信息。由於在模型訓練期間使用的新損失函數,PP 得到了最佳生成。此外,PP 的局部和全局描述符允許以生物學家和泌尿科醫生可以理解的方式解釋決策(“什麼”信息,“圖像中的什麼位置”)。所提出的 DL 模型已在一個包含六種最廣泛的腎結石類型圖像的數據庫上進行了測試。總體平均分類準確率為 90.37。將此結果與腎結石最先進的八個其他 DL 模型的結果進行比較時,可以看出,可解釋性的寶貴增益並未以準確性為代價,甚至略有增加與文獻中最好的方法 (88.2) 相比。這些有希望且可解釋的結果也鼓勵泌尿科醫生相信基於人工智能的解決方案。

Towards Interpretable End-Stage Renal Disease (ESRD) Prediction: Utilizing Administrative Claims Data with Explainable AI Techniques

2409.12087v3 by Yubo Li, Saba Al-Sayouri, Rema Padman

This study explores the potential of utilizing administrative claims data, combined with advanced machine learning and deep learning techniques, to predict the progression of Chronic Kidney Disease (CKD) to End-Stage Renal Disease (ESRD). We analyze a comprehensive, 10-year dataset provided by a major health insurance organization to develop prediction models for multiple observation windows using traditional machine learning methods such as Random Forest and XGBoost as well as deep learning approaches such as Long Short-Term Memory (LSTM) networks. Our findings demonstrate that the LSTM model, particularly with a 24-month observation window, exhibits superior performance in predicting ESRD progression, outperforming existing models in the literature. We further apply SHapley Additive exPlanations (SHAP) analysis to enhance interpretability, providing insights into the impact of individual features on predictions at the individual patient level. This study underscores the value of leveraging administrative claims data for CKD management and predicting ESRD progression.

摘要:本研究探討利用行政申報資料,結合先進機器學習與深度學習技術,預測慢性腎臟病 (CKD) 進展至末期腎臟疾病 (ESRD) 的可能性。我們分析一家大型健康保險組織提供的 10 年綜合資料集,使用傳統機器學習方法(例如隨機森林和 XGBoost)以及深度學習方法(例如長期短期記憶 (LSTM) 網路)開發多個觀察視窗的預測模型。我們的研究結果顯示,LSTM 模型(尤其是 24 個月觀察視窗)在預測 ESRD 進展方面表現優異,優於文獻中的現有模型。我們進一步應用 SHapley 可加性解釋 (SHAP) 分析以增強可解釋性,深入了解個別特徵對個別患者層級預測的影響。本研究強調了利用行政申報資料進行 CKD 管理和預測 ESRD 進展的價值。

Explainable AI: Definition and attributes of a good explanation for health AI

2409.15338v1 by Evangelia Kyrimi, Scott McLachlan, Jared M Wohlgemut, Zane B Perkins, David A. Lagnado, William Marsh, the ExAIDSS Expert Group

Proposals of artificial intelligence (AI) solutions based on increasingly complex and accurate predictive models are becoming ubiquitous across many disciplines. As the complexity of these models grows, transparency and users' understanding often diminish. This suggests that accurate prediction alone is insufficient for making an AI-based solution truly useful. In the development of healthcare systems, this introduces new issues related to accountability and safety. Understanding how and why an AI system makes a recommendation may require complex explanations of its inner workings and reasoning processes. Although research on explainable AI (XAI) has significantly increased in recent years and there is high demand for XAI in medicine, defining what constitutes a good explanation remains ad hoc, and providing adequate explanations continues to be challenging. To fully realize the potential of AI, it is critical to address two fundamental questions about explanations for safety-critical AI applications, such as health-AI: (1) What is an explanation in health-AI? and (2) What are the attributes of a good explanation in health-AI? In this study, we examined published literature and gathered expert opinions through a two-round Delphi study. The research outputs include (1) a definition of what constitutes an explanation in health-AI and (2) a comprehensive list of attributes that characterize a good explanation in health-AI.

摘要:隨著越來越複雜且準確的預測模型,基於人工智慧 (AI) 解決方案的提案在許多領域中變得無處不在。隨著這些模型複雜性的增加,透明度和使用者的理解力往往會降低。這表示僅有準確的預測並不足以讓 AI 解決方案真正有用。在醫療保健系統的開發中,這引入了與問責制和安全性相關的新問題。瞭解 AI 系統如何以及為何提出建議可能需要對其內部運作和推理過程進行複雜的說明。儘管近年來對可解釋 AI (XAI) 的研究已大幅增加,且醫學領域對 XAI 有很高的需求,但定義什麼構成一個好的解釋仍是臨時性的,而提供適當的解釋仍然具有挑戰性。為了充分發揮 AI 的潛力,對於安全關鍵型 AI 應用(例如健康 AI)的解釋,探討兩個基本問題至關重要:(1) 什麼是健康 AI 中的解釋?以及 (2) 健康 AI 中一個好的解釋有哪些屬性?在本研究中,我們檢視了已發表的文獻,並透過兩輪德爾菲研究收集了專家意見。研究成果包括:(1) 健康 AI 中什麼構成解釋的定義,以及 (2) 健康 AI 中一個好解釋的屬性清單。

Exploring the Effect of Explanation Content and Format on User Comprehension and Trust

2408.17401v1 by Antonio Rago, Bence Palfi, Purin Sukpanichnant, Hannibal Nabli, Kavyesh Vivek, Olga Kostopoulou, James Kinross, Francesca Toni

In recent years, various methods have been introduced for explaining the outputs of "black-box" AI models. However, it is not well understood whether users actually comprehend and trust these explanations. In this paper, we focus on explanations for a regression tool for assessing cancer risk and examine the effect of the explanations' content and format on the user-centric metrics of comprehension and trust. Regarding content, we experiment with two explanation methods: the popular SHAP, based on game-theoretic notions and thus potentially complex for everyday users to comprehend, and occlusion-1, based on feature occlusion which may be more comprehensible. Regarding format, we present SHAP explanations as charts (SC), as is conventional, and occlusion-1 explanations as charts (OC) as well as text (OT), to which their simpler nature also lends itself. The experiments amount to user studies questioning participants, with two different levels of expertise (the general population and those with some medical training), on their subjective and objective comprehension of and trust in explanations for the outputs of the regression tool. In both studies we found a clear preference in terms of subjective comprehension and trust for occlusion-1 over SHAP explanations in general, when comparing based on content. However, direct comparisons of explanations when controlling for format only revealed evidence for OT over SC explanations in most cases, suggesting that the dominance of occlusion-1 over SHAP explanations may be driven by a preference for text over charts as explanations. Finally, we found no evidence of a difference between the explanation types in terms of objective comprehension. Thus overall, the choice of the content and format of explanations needs careful attention, since in some contexts format, rather than content, may play the critical role in improving user experience.

摘要:近年來,已經引進各種方法來解釋「黑箱」AI 模型的輸出。然而,目前並不清楚使用者是否實際理解和信任這些解釋。在本文中,我們專注於評估癌症風險的回歸工具的解釋,並探討解釋的內容和格式對以使用者為中心的理解和信任指標的影響。關於內容,我們實驗了兩種解釋方法:流行的 SHAP,基於博弈論概念,因此對於日常使用者來說可能很複雜,以及基於特徵遮蔽的 occlusion-1,可能更易於理解。關於格式,我們將 SHAP 解釋呈現為圖表 (SC),這是慣例,而將 occlusion-1 解釋呈現為圖表 (OC) 以及文字 (OT),其較為簡單的性質也適用於此。這些實驗等同於使用者研究,詢問參與者,具有兩種不同程度的專業知識(一般民眾和具備一些醫學訓練的人),他們對回歸工具輸出解釋的主觀和客觀理解和信任。在兩項研究中,我們發現,在基於內容進行比較時,一般來說,occlusion-1 優於 SHAP 解釋,在主觀理解和信任方面有明顯的偏好。然而,在僅控制格式的情況下直接比較解釋,在大多數情況下只顯示 OT 優於 SC 解釋的證據,這表明 occlusion-1 優於 SHAP 解釋的主導地位可能是由偏好文字而非圖表作為解釋所驅動的。最後,我們沒有發現解釋類型在客觀理解方面的差異證據。因此,總體而言,對解釋的內容和格式的選擇需要仔細注意,因為在某些情況下,格式而非內容,可能在改善使用者體驗方面發揮關鍵作用。

A Survey for Large Language Models in Biomedicine

2409.00133v1 by Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, Kaili Qu, Shuxin Li, Yi Yu, Pietro Liò, Tianyun Wang, Yu Guang Wang, Yiqing Shen

Recent breakthroughs in large language models (LLMs) offer unprecedented natural language understanding and generation capabilities. However, existing surveys on LLMs in biomedicine often focus on specific applications or model architectures, lacking a comprehensive analysis that integrates the latest advancements across various biomedical domains. This review, based on an analysis of 484 publications sourced from databases including PubMed, Web of Science, and arXiv, provides an in-depth examination of the current landscape, applications, challenges, and prospects of LLMs in biomedicine, distinguishing itself by focusing on the practical implications of these models in real-world biomedical contexts. Firstly, we explore the capabilities of LLMs in zero-shot learning across a broad spectrum of biomedical tasks, including diagnostic assistance, drug discovery, and personalized medicine, among others, with insights drawn from 137 key studies. Then, we discuss adaptation strategies of LLMs, including fine-tuning methods for both uni-modal and multi-modal LLMs to enhance their performance in specialized biomedical contexts where zero-shot fails to achieve, such as medical question answering and efficient processing of biomedical literature. Finally, we discuss the challenges that LLMs face in the biomedicine domain including data privacy concerns, limited model interpretability, issues with dataset quality, and ethics due to the sensitive nature of biomedical data, the need for highly reliable model outputs, and the ethical implications of deploying AI in healthcare. To address these challenges, we also identify future research directions of LLM in biomedicine including federated learning methods to preserve data privacy and integrating explainable AI methodologies to enhance the transparency of LLMs.

摘要:大型語言模型 (LLM) 的最新突破提供了前所未有的自然語言理解和生成能力。然而,現有關於生物醫學中 LLM 的調查通常專注於特定應用或模型架構,缺乏整合各種生物醫學領域最新進展的全面分析。本綜述基於對來自 PubMed、Web of Science 和 arXiv 等數據庫的 484 篇出版物的分析,深入探討了生物醫學中 LLM 的當前現況、應用、挑戰和前景,其特點是關注這些模型在現實世界生物醫學背景中的實際應用。首先,我們探討了 LLM 在廣泛的生物醫學任務中的零次學習能力,包括診斷輔助、藥物發現和個性化醫療等,並從 137 項關鍵研究中汲取見解。然後,我們討論了 LLM 的適應策略,包括單模態和多模態 LLM 的微調方法,以增強它們在零次學習無法實現的專業生物醫學背景中的性能,例如醫療問題解答和生物醫學文獻的有效處理。最後,我們討論了 LLM 在生物醫學領域面臨的挑戰,包括數據隱私問題、模型可解釋性有限、數據集質量問題以及由於生物醫學數據的敏感性、對高度可靠模型輸出的需求以及在醫療保健中部署 AI 的倫理影響而產生的倫理問題。為了應對這些挑戰,我們還確定了生物醫學中 LLM 未來的研究方向,包括用於保護數據隱私的聯合學習方法以及整合可解釋 AI 方法以增強 LLM 的透明度。

Aligning XAI with EU Regulations for Smart Biomedical Devices: A Methodology for Compliance Analysis

2408.15121v1 by Francesco Sovrano, Michael Lognoul, Giulia Vilone

Significant investment and development have gone into integrating Artificial Intelligence (AI) in medical and healthcare applications, leading to advanced control systems in medical technology. However, the opacity of AI systems raises concerns about essential characteristics needed in such sensitive applications, like transparency and trustworthiness. Our study addresses these concerns by investigating a process for selecting the most adequate Explainable AI (XAI) methods to comply with the explanation requirements of key EU regulations in the context of smart bioelectronics for medical devices. The adopted methodology starts with categorising smart devices by their control mechanisms (open-loop, closed-loop, and semi-closed-loop systems) and delving into their technology. Then, we analyse these regulations to define their explainability requirements for the various devices and related goals. Simultaneously, we classify XAI methods by their explanatory objectives. This allows for matching legal explainability requirements with XAI explanatory goals and determining the suitable XAI algorithms for achieving them. Our findings provide a nuanced understanding of which XAI algorithms align better with EU regulations for different types of medical devices. We demonstrate this through practical case studies on different neural implants, from chronic disease management to advanced prosthetics. This study fills a crucial gap in aligning XAI applications in bioelectronics with stringent provisions of EU regulations. It provides a practical framework for developers and researchers, ensuring their AI innovations advance healthcare technology and adhere to legal and ethical standards.

摘要:人工智慧(AI)在醫療和保健應用中投入了大量的投資和開發,進而導致醫療技術中的先進控制系統。然而,AI 系統的不透明性引發了對此類敏感應用中所需基本特性的擔憂,例如透明度和可信度。我們的研究透過調查一個程序來解決這些問題,用於選擇最充分的可解釋 AI(XAI)方法,以符合歐盟法規在醫療器材的智慧型生物電子學中的說明要求。採用的方法從透過其控制機制(開迴路、閉迴路和半閉迴路系統)對智慧型裝置進行分類,並深入探討其技術開始。然後,我們分析這些法規以定義其對各種裝置和相關目標的可解釋性要求。同時,我們透過其說明目標對 XAI 方法進行分類。這允許將法律可解釋性要求與 XAI 說明目標相匹配,並確定適當的 XAI 演算法來達成它們。我們的研究結果提供了對哪些 XAI 演算法更符合歐盟法規以適用於不同類型的醫療器材的細緻理解。我們透過不同神經植入物的實際案例研究來證明這一點,從慢性疾病管理到先進的義肢。這項研究填補了將生物電子學中的 XAI 應用與歐盟法規的嚴格規定相符的重要空白。它為開發人員和研究人員提供了一個實用的架構,確保其 AI 創新能促進醫療技術並遵守法律和道德標準。

Towards Case-based Interpretability for Medical Federated Learning

2408.13626v1 by Laura Latorre, Liliana Petrychenko, Regina Beets-Tan, Taisiya Kopytova, Wilson Silva

We explore deep generative models to generate case-based explanations in a medical federated learning setting. Explaining AI model decisions through case-based interpretability is paramount to increasing trust and allowing widespread adoption of AI in clinical practice. However, medical AI training paradigms are shifting towards federated learning settings in order to comply with data protection regulations. In a federated scenario, past data is inaccessible to the current user. Thus, we use a deep generative model to generate synthetic examples that protect privacy and explain decisions. Our proof-of-concept focuses on pleural effusion diagnosis and uses publicly available Chest X-ray data.

摘要:我們探索深度生成模型,在醫療聯邦學習設置中生成基於案例的說明。透過基於案例的可解釋性來解釋 AI 模型決策,對於增加信任並允許 AI 在臨床實務中廣泛採用至關重要。然而,醫療 AI 訓練範例正轉向聯邦學習設置,以符合資料保護法規。在聯邦情境中,過去的資料對目前的使用者而言是無法取得的。因此,我們使用深度生成模型來產生保護隱私和解釋決策的合成範例。我們的概念驗證著重於胸腔積液診斷,並使用公開可取得的胸部 X 光資料。

AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines

2408.12491v1 by Douwe J. Spaanderman, Matthew Marzetti, Xinyi Wan, Andrew F. Scarsbrook, Philip Robinson, Edwin H. G. Oei, Jacob J. Visser, Robert Hemke, Kirsten van Langevelde, David F. Hanff, Geert J. L. H. van Leenders, Cornelis Verhoef, Dirk J. Gruühagen, Wiro J. Niessen, Stefan Klein, Martijn P. A. Starmans

Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging lesions with variable clinical behaviours and treatment approaches. This systematic review provides an overview of Artificial Intelligence (AI) methods using radiological imaging for diagnosis and prognosis of these tumours, highlighting challenges in clinical translation, and evaluating study alignment with the Checklist for AI in Medical Imaging (CLAIM) and the FUTURE-AI international consensus guidelines for trustworthy and deployable AI to promote the clinical translation of AI methods. The review covered literature from several bibliographic databases, including papers published before 17/07/2024. Original research in peer-reviewed journals focused on radiology-based AI for diagnosing or prognosing primary STBT was included. Exclusion criteria were animal, cadaveric, or laboratory studies, and non-English papers. Abstracts were screened by two of three independent reviewers for eligibility. Eligible papers were assessed against guidelines by one of three independent reviewers. The search identified 15,015 abstracts, from which 325 articles were included for evaluation. Most studies performed moderately on CLAIM, averaging a score of 28.9$\pm$7.5 out of 53, but poorly on FUTURE-AI, averaging 5.1$\pm$2.1 out of 30. Imaging-AI tools for STBT remain at the proof-of-concept stage, indicating significant room for improvement. Future efforts by AI developers should focus on design (e.g. define unmet clinical need, intended clinical setting and how AI would be integrated in clinical workflow), development (e.g. build on previous work, explainability), evaluation (e.g. evaluating and addressing biases, evaluating AI against best practices), and data reproducibility and availability (making documented code and data publicly available). Following these recommendations could improve clinical translation of AI methods.

摘要:軟組織和骨骼腫瘤(STBT)是罕見、診斷具有挑戰性的病灶,其臨床行為和治療方法各不相同。這篇系統性回顧提供了使用放射影像進行診斷和預後的人工智慧 (AI) 方法的概觀,重點說明了臨床轉譯的挑戰,並評估研究與醫療影像 AI 核查表 (CLAIM) 和 FUTURE-AI 可信賴且可部署 AI 的國際共識準則的一致性,以促進 AI 方法的臨床轉譯。這篇回顧涵蓋了幾個書目資料庫中的文獻,包括在 2024 年 7 月 17 日之前發表的論文。納入了以放射為基礎的 AI 診斷或預後原發性 STBT 的同行評審期刊中的原始研究。排除標準是動物、屍體或實驗室研究,以及非英文論文。摘要由三位獨立審查員中的兩位篩選資格。合格的論文由三位獨立審查員中的一位根據準則進行評估。搜索識別出 15,015 篇摘要,其中 325 篇文章被納入評估。大多數研究在 CLAIM 中表現中等,平均得分為 53 分中的 28.9±7.5 分,但在 FUTURE-AI 中表現不佳,平均得分為 30 分中的 5.1±2.1 分。STBT 的影像 AI 工具仍處於概念驗證階段,表明有顯著的改進空間。AI 開發人員未來的努力應集中在設計(例如定義未滿足的臨床需求、預期的臨床環境以及 AI 如何整合到臨床工作流程中)、開發(例如建立在先前的工作、可解釋性)、評估(例如評估和解決偏差、評估 AI 與最佳實務)、以及數據可複製性和可用性(公開提供文件化的代碼和數據)。遵循這些建議可以改善 AI 方法的臨床轉譯。

Evaluating Explainable AI Methods in Deep Learning Models for Early Detection of Cerebral Palsy

2409.00001v1 by Kimji N. Pellano, Inga Strümke, Daniel Groos, Lars Adde, Espen Alexander F. Ihlen

Early detection of Cerebral Palsy (CP) is crucial for effective intervention and monitoring. This paper tests the reliability and applicability of Explainable AI (XAI) methods using a deep learning method that predicts CP by analyzing skeletal data extracted from video recordings of infant movements. Specifically, we use XAI evaluation metrics -- namely faithfulness and stability -- to quantitatively assess the reliability of Class Activation Mapping (CAM) and Gradient-weighted Class Activation Mapping (Grad-CAM) in this specific medical application. We utilize a unique dataset of infant movements and apply skeleton data perturbations without distorting the original dynamics of the infant movements. Our CP prediction model utilizes an ensemble approach, so we evaluate the XAI metrics performances for both the overall ensemble and the individual models. Our findings indicate that both XAI methods effectively identify key body points influencing CP predictions and that the explanations are robust against minor data perturbations. Grad-CAM significantly outperforms CAM in the RISv metric, which measures stability in terms of velocity. In contrast, CAM performs better in the RISb metric, which relates to bone stability, and the RRS metric, which assesses internal representation robustness. Individual models within the ensemble show varied results, and neither CAM nor Grad-CAM consistently outperform the other, with the ensemble approach providing a representation of outcomes from its constituent models.

摘要:腦性麻痺 (CP) 的早期偵測對於有效的介入和監測至關重要。本文測試了可解釋 AI (XAI) 方法的可靠性和適用性,使用深度學習方法,透過分析從嬰兒動作影片記錄中提取的骨骼資料來預測 CP。具體來說,我們使用 XAI 評估指標(即忠實度和穩定性)來量化評估類別激活映射 (CAM) 和梯度加權類別激活映射 (Grad-CAM) 在這個特定醫療應用中的可靠性。我們利用一個獨特的嬰兒動作資料集,並應用骨骼資料擾動,而不會扭曲嬰兒動作的原始動力。我們的 CP 預測模型利用整體方法,因此我們評估了整體整體和個別模型的 XAI 指標表現。我們的研究結果表明,兩種 XAI 方法都能有效識別影響 CP 預測的關鍵身體部位,並且這些解釋對於微小的資料擾動具有魯棒性。Grad-CAM 在 RISv 指標中顯著優於 CAM,該指標衡量速度方面的穩定性。相比之下,CAM 在 RISb 指標中表現得更好,該指標與骨骼穩定性有關,而 RRS 指標則評估內部表示的魯棒性。整體中的個別模型顯示出不同的結果,CAM 和 Grad-CAM 都不一致地優於另一種,整體方法提供了其組成模型結果的表示。

MicroXercise: A Micro-Level Comparative and Explainable System for Remote Physical Therapy

2408.11837v1 by Hanchen David Wang, Nibraas Khan, Anna Chen, Nilanjan Sarkar, Pamela Wisniewski, Meiyi Ma

Recent global estimates suggest that as many as 2.41 billion individuals have health conditions that would benefit from rehabilitation services. Home-based Physical Therapy (PT) faces significant challenges in providing interactive feedback and meaningful observation for therapists and patients. To fill this gap, we present MicroXercise, which integrates micro-motion analysis with wearable sensors, providing therapists and patients with a comprehensive feedback interface, including video, text, and scores. Crucially, it employs multi-dimensional Dynamic Time Warping (DTW) and attribution-based explainable methods to analyze the existing deep learning neural networks in monitoring exercises, focusing on a high granularity of exercise. This synergistic approach is pivotal, providing output matching the input size to precisely highlight critical subtleties and movements in PT, thus transforming complex AI analysis into clear, actionable feedback. By highlighting these micro-motions in different metrics, such as stability and range of motion, MicroXercise significantly enhances the understanding and relevance of feedback for end-users. Comparative performance metrics underscore its effectiveness over traditional methods, such as a 39% and 42% improvement in Feature Mutual Information (FMI) and Continuity. MicroXercise is a step ahead in home-based physical therapy, providing a technologically advanced and intuitively helpful solution to enhance patient care and outcomes.

摘要:最近的全球估計表明,多達 24.1 億人有 健康狀況可從復健服務中受益。居家 物理治療 (PT) 在提供互動式 回饋和有意義的觀察方面面臨重大挑戰,供治療師和患者使用。為了填補這 個缺口,我們提出 MicroXercise,它將微動作分析與 可穿戴式感測器整合在一起,為治療師和患者提供一個全面的 回饋介面,包括影片、文字和分數。至關重要的是,它採用 多維動態時間規整 (DTW) 和基於歸因的可解釋 方法來分析監控運動中現有的深度學習神經網路,專注於運動的高粒度。這種協同 方法至關重要,提供與輸入大小匹配的輸出,以精確地 突出 PT 中關鍵的細微差別和動作,從而將複雜的 AI 分析轉換為清晰、可操作的回饋。透過在不同指標中突顯這些微動作,例如穩定性和動作範圍,MicroXercise 顯著提升最終使用者對回饋的理解和相關性。比較效能指標強調其優於 傳統方法的有效性,例如特徵互惠資訊 (FMI) 和連續性分別提升了 39% 和 42%。MicroXercise 在居家 物理治療方面更進一步,提供技術先進且直覺有用的 解決方案,以提升患者照護和結果。

The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development

2408.05239v1 by Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz

Systematic literature reviews are the highest quality of evidence in research. However, the review process is hindered by significant resource and data constraints. The Literature Review Network (LRN) is the first of its kind explainable AI platform adhering to PRISMA 2020 standards, designed to automate the entire literature review process. LRN was evaluated in the domain of surgical glove practices using 3 search strings developed by experts to query PubMed. A non-expert trained all LRN models. Performance was benchmarked against an expert manual review. Explainability and performance metrics assessed LRN's ability to replicate the experts' review. Concordance was measured with the Jaccard index and confusion matrices. Researchers were blinded to the other's results until study completion. Overlapping studies were integrated into an LRN-generated systematic review. LRN models demonstrated superior classification accuracy without expert training, achieving 84.78% and 85.71% accuracy. The highest performance model achieved high interrater reliability (k = 0.4953) and explainability metrics, linking 'reduce', 'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51% of the relevant literature despite diverging from the non-expert's judgments (k = 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN outperformed the manual review (19,920 minutes over 11 months), reducing the entire process to 288.6 minutes over 5 days. This study demonstrates that explainable AI does not require expert training to successfully conduct PRISMA-compliant systematic literature reviews like an expert. LRN summarized the results of surgical glove studies and identified themes that were nearly identical to the clinical researchers' findings. Explainable AI can accurately expedite our understanding of clinical practices, potentially revolutionizing healthcare research.

摘要:系統性文獻回顧是研究中證據品質最高的。然而,回顧過程受到顯著資源和資料限制的阻礙。文獻回顧網路 (LRN) 是第一個遵循 PRISMA 2020 標準的可解釋 AI 平台,旨在自動化整個文獻回顧過程。LRN 在外科手套實務領域中進行評估,使用專家開發的 3 個搜尋字串來查詢 PubMed。非專家訓練所有 LRN 模型。效能以專家手動回顧作為基準。可解釋性和效能指標評估 LRN 複製專家回顧的能力。一致性以 Jaccard 指數和混淆矩陣測量。研究人員在研究完成前對彼此的結果保密。重疊的研究整合到 LRN 生成的系統性回顧中。LRN 模型在沒有專家訓練的情況下展現出優異的分類準確率,達到 84.78% 和 85.71% 的準確率。效能最高的模型達到了高評分者間信賴度 (k = 0.4953) 和可解釋性指標,將「減少」、「意外」和「銳利」與「雙重戴手套」連結在一起。另一個 LRN 模型涵蓋了 91.51% 的相關文獻,儘管與非專家的判斷不同 (k = 0.2174),但包含了「乳膠」、「雙重」(手套)和「適應症」等詞彙。LRN 優於手動回顧(11 個月超過 19,920 分鐘),將整個過程縮短為 5 天超過 288.6 分鐘。這項研究顯示,可解釋的 AI 不需要專家訓練即可成功進行專家等級的 PRISMA 相容系統性文獻回顧。LRN 總結了外科手套研究的結果,並找出與臨床研究人員發現幾乎相同的主题。可解釋的 AI 可以準確地加快我們對臨床實務的理解,有潛力革新醫療保健研究。

Enhancing Medical Learning and Reasoning Systems: A Boxology-Based Comparative Analysis of Design Patterns

2408.02709v1 by Chi Him Ng

This study analyzes hybrid AI systems' design patterns and their effectiveness in clinical decision-making using the boxology framework. It categorizes and copares various architectures combining machine learning and rule-based reasoning to provide insights into their structural foundations and healthcare applications. Addressing two main questions, how to categorize these systems againts established design patterns and how to extract insights through comparative analysis, the study uses design patterns from software engineering to understand and optimize healthcare AI systems. Boxology helps identify commonalities and create reusable solutions, enhancing these systems' scalability, reliability, and performance. Five primary architectures are examined: REML, MLRB, RBML, RMLT, and PERML. Each has unique strengths and weaknesses, highlighting the need for tailored approaches in clinical tasks. REML excels in high-accuracy prediction for datasets with limited data; MLRB in handling large datasets and complex data integration; RBML in explainability and trustworthiness; RMLT in managing high-dimensional data; and PERML, though limited in analysis, shows promise in urgent care scenarios. The study introduces four new patterns, creates five abstract categorization patterns, and refines those five further to specific systems. These contributions enhance Boxlogy's taxonomical organization and offer novel approaches to integrating expert knowledge with machine learning. Boxology's structured, modular apporach offers significant advantages in developing and analyzing hybrid AI systems, revealing commonalities, and promoting reusable solutions. In conclusion, this study underscores hybrid AI systems' crucial role in advancing healthcare and Boxology's potential to drive further innovation in AI integration, ultimately improving clinical decision support and patient outcomes.

摘要:本研究使用盒子學框架分析混合人工智慧系統的設計模式及其在臨床決策中的有效性。它分類並比較結合機器學習和基於規則的推理的各種架構,以深入了解其結構基礎和醫療保健應用。針對兩個主要問題,如何根據既定的設計模式對這些系統進行分類,以及如何通過比較分析提取見解,本研究使用軟體工程中的設計模式來了解和優化醫療保健人工智慧系統。盒子學有助於識別共性並建立可重複使用的解決方案,從而增強這些系統的可擴充性、可靠性和效能。檢查了五種主要的架構:REML、MLRB、RBML、RMLT 和 PERML。每種架構都有獨特的優缺點,強調了在臨床任務中需要量身打造的方法。REML 在資料有限的資料集中表現出高精度的預測;MLRB 在處理大型資料集和複雜資料整合方面表現出色;RBML 在可解釋性和可信度方面表現出色;RMLT 在管理高維資料方面表現出色;而 PERML 儘管在分析方面有限,但在緊急照護場景中表現出潛力。本研究引入了四種新模式,建立了五種抽象分類模式,並進一步將這五種模式細化為具體的系統。這些貢獻增強了盒子學的分類組織,並提供了將專家知識與機器學習整合的新方法。盒子學的結構化、模組化方法在開發和分析混合人工智慧系統、揭示共性以及推廣可重複使用的解決方案方面具有顯著優勢。總之,本研究強調了混合人工智慧系統在推進醫療保健中的關鍵作用,以及盒子學在推動人工智慧整合進一步創新方面的潛力,最終改善臨床決策支援和患者的治療成果。

Bayesian Kolmogorov Arnold Networks (Bayesian_KANs): A Probabilistic Approach to Enhance Accuracy and Interpretability

2408.02706v1 by Masoud Muhammed Hassan

Because of its strong predictive skills, deep learning has emerged as an essential tool in many industries, including healthcare. Traditional deep learning models, on the other hand, frequently lack interpretability and omit to take prediction uncertainty into account two crucial components of clinical decision making. In order to produce explainable and uncertainty aware predictions, this study presents a novel framework called Bayesian Kolmogorov Arnold Networks (BKANs), which combines the expressive capacity of Kolmogorov Arnold Networks with Bayesian inference. We employ BKANs on two medical datasets, which are widely used benchmarks for assessing machine learning models in medical diagnostics: the Pima Indians Diabetes dataset and the Cleveland Heart Disease dataset. Our method provides useful insights into prediction confidence and decision boundaries and outperforms traditional deep learning models in terms of prediction accuracy. Moreover, BKANs' capacity to represent aleatoric and epistemic uncertainty guarantees doctors receive more solid and trustworthy decision support. Our Bayesian strategy improves the interpretability of the model and considerably minimises overfitting, which is important for tiny and imbalanced medical datasets, according to experimental results. We present possible expansions to further use BKANs in more complicated multimodal datasets and address the significance of these discoveries for future research in building reliable AI systems for healthcare. This work paves the way for a new paradigm in deep learning model deployment in vital sectors where transparency and reliability are crucial.

摘要:由於其強大的預測能力,深度學習已成為許多產業中不可或缺的工具,包括醫療保健。然而,傳統的深度學習模型通常缺乏可解釋性,並且忽略了將預測不確定性納入考量,而這兩個因素是臨床決策制定的關鍵組成部分。為了產生可解釋且具有不確定性意識的預測,本研究提出了一個名為貝氏柯爾莫哥洛夫阿諾德網路 (BKAN) 的新架構,它結合了柯爾莫哥洛夫阿諾德網路的表達能力與貝氏推論。我們在兩個醫學資料集上使用 BKAN,這些資料集是評估機器學習模型在醫學診斷中的廣泛使用基準:皮馬印第安人糖尿病資料集和克里夫蘭心臟病資料集。我們的模型提供了對預測信心和決策邊界的有益見解,並且在預測準確度方面優於傳統的深度學習模型。此外,BKAN 表現隨機和認識不確定性的能力,可確保醫生獲得更可靠且值得信賴的決策支援。根據實驗結果,我們的貝氏策略提高了模型的可解釋性,並大幅減少了過度擬合,這對於小型且不平衡的醫學資料集非常重要。我們提出了可能的擴充功能,以進一步將 BKAN 用於更複雜的多模式資料集,並探討這些發現對於未來建立可靠的醫療保健 AI 系統研究的重要性。這項工作為深度學習模型部署在透明度和可靠性至關重要的重要領域中開啟了一個新的典範。

MLtoGAI: Semantic Web based with Machine Learning for Enhanced Disease Prediction and Personalized Recommendations using Generative AI

2407.20284v1 by Shyam Dongre, Ritesh Chandra, Sonali Agarwal

In modern healthcare, addressing the complexities of accurate disease prediction and personalized recommendations is both crucial and challenging. This research introduces MLtoGAI, which integrates Semantic Web technology with Machine Learning (ML) to enhance disease prediction and offer user-friendly explanations through ChatGPT. The system comprises three key components: a reusable disease ontology that incorporates detailed knowledge about various diseases, a diagnostic classification model that uses patient symptoms to detect specific diseases accurately, and the integration of Semantic Web Rule Language (SWRL) with ontology and ChatGPT to generate clear, personalized health advice. This approach significantly improves prediction accuracy and ensures results that are easy to understand, addressing the complexity of diseases and diverse symptoms. The MLtoGAI system demonstrates substantial advancements in accuracy and user satisfaction, contributing to developing more intelligent and accessible healthcare solutions. This innovative approach combines the strengths of ML algorithms with the ability to provide transparent, human-understandable explanations through ChatGPT, achieving significant improvements in prediction accuracy and user comprehension. By leveraging semantic technology and explainable AI, the system enhances the accuracy of disease prediction and ensures that the recommendations are relevant and easily understood by individual patients. Our research highlights the potential of integrating advanced technologies to overcome existing challenges in medical diagnostics, paving the way for future developments in intelligent healthcare systems. Additionally, the system is validated using 200 synthetic patient data records, ensuring robust performance and reliability.

摘要:在現代醫療保健中,解決準確疾病預測和個性化建議的複雜性既至關重要又具有挑戰性。本研究引入了 MLtoGAI,它將語義網路技術與機器學習 (ML) 相結合,以增強疾病預測並透過 ChatGPT 提供使用者友善的說明。該系統包含三個關鍵組成部分:一個可重複使用的疾病本体,其中包含有關各種疾病的詳細知識;一個診斷分類模型,它使用患者症狀來準確檢測特定疾病;以及語義網路規則語言 (SWRL) 與本体和 ChatGPT 的整合,以產生清晰、個性化的健康建議。這種方法顯著提高了預測準確性,並確保了易於理解的結果,解決了疾病和不同症狀的複雜性。MLtoGAI 系統展示了準確性和使用者滿意度的實質性進步,有助於開發更智慧且更易於取得的醫療保健解決方案。這種創新的方法結合了 ML 演算法的優點,以及透過 ChatGPT 提供透明且人類可以理解的說明的能力,在預測準確性和使用者理解方面取得了顯著的進步。透過利用語義技術和可解釋的 AI,該系統提高了疾病預測的準確性,並確保了建議與個別患者相關且易於理解。我們的研究強調了整合先進技術以克服醫療診斷中現有挑戰的潛力,為智慧醫療保健系統的未來發展鋪路。此外,該系統使用 200 個合成患者資料記錄進行驗證,確保了穩健的效能和可靠性。

Introducing δ-XAI: a novel sensitivity-based method for local AI explanations

2407.18343v2 by Alessandro De Carlo, Enea Parimbelli, Nicola Melillo, Giovanna Nicora

Explainable Artificial Intelligence (XAI) is central to the debate on integrating Artificial Intelligence (AI) and Machine Learning (ML) algorithms into clinical practice. High-performing AI/ML models, such as ensemble learners and deep neural networks, often lack interpretability, hampering clinicians' trust in their predictions. To address this, XAI techniques are being developed to describe AI/ML predictions in human-understandable terms. One promising direction is the adaptation of sensitivity analysis (SA) and global sensitivity analysis (GSA), which inherently rank model inputs by their impact on predictions. Here, we introduce a novel delta-XAI method that provides local explanations of ML model predictions by extending the delta index, a GSA metric. The delta-XAI index assesses the impact of each feature's value on the predicted output for individual instances in both regression and classification problems. We formalize the delta-XAI index and provide code for its implementation. The delta-XAI method was evaluated on simulated scenarios using linear regression models, with Shapley values serving as a benchmark. Results showed that the delta-XAI index is generally consistent with Shapley values, with notable discrepancies in models with highly impactful or extreme feature values. The delta-XAI index demonstrated higher sensitivity in detecting dominant features and handling extreme feature values. Qualitatively, the delta-XAI provides intuitive explanations by leveraging probability density functions, making feature rankings clearer and more explainable for practitioners. Overall, the delta-XAI method appears promising for robustly obtaining local explanations of ML model predictions. Further investigations in real-world clinical settings will be conducted to evaluate its impact on AI-assisted clinical workflows.

摘要:可解釋人工智慧 (XAI) 是將人工智慧 (AI) 和機器學習 (ML) 演算法整合到臨床實務中的辯論核心。高執行效能的 AI/ML 模型,例如整體學習器和深度神經網路,通常缺乏可解釋性,阻礙臨床醫生對其預測的信任。為了解決這個問題,正在開發 XAI 技術,以人類可以理解的術語描述 AI/ML 預測。一個有希望的方向是採用敏感度分析 (SA) 和全球敏感度分析 (GSA),它們本質上會依據模型輸入對預測的影響來對其進行排名。在此,我們介紹一種新的 delta-XAI 方法,透過擴充 GSA 指標 delta 指數來提供 ML 模型預測的局部解釋。delta-XAI 指數評估每個特徵值對回歸和分類問題中個別例項的預測輸出之影響。我們將 delta-XAI 指數形式化,並提供其實作的程式碼。使用線性回歸模型對模擬情境評估 delta-XAI 方法,並以 Shapley 值作為基準。結果顯示 delta-XAI 指數通常與 Shapley 值一致,但在具有高度影響力或極端特徵值的模型中存在顯著差異。delta-XAI 指數在偵測主要特徵和處理極端特徵值方面表現出更高的敏感度。定性地來說,delta-XAI 透過利用機率密度函數提供直觀的解釋,使特徵排名更清晰且對從業人員來說更具可解釋性。總體而言,delta-XAI 方法對於穩健地取得 ML 模型預測的局部解釋似乎很有希望。將在真實世界的臨床環境中進行進一步調查,以評估其對 AI 輔助臨床工作流程的影響。

Enhanced Deep Learning Methodologies and MRI Selection Techniques for Dementia Diagnosis in the Elderly Population

2407.17324v2 by Nikolaos Ntampakis, Konstantinos Diamantaras, Ioanna Chouvarda, Vasileios Argyriou, Panagiotis Sarigianndis

Dementia, a debilitating neurological condition affecting millions worldwide, presents significant diagnostic challenges. In this work, we introduce a novel methodology for the classification of demented and non-demented elderly patients using 3D brain Magnetic Resonance Imaging (MRI) scans. Our approach features a unique technique for selectively processing MRI slices, focusing on the most relevant brain regions and excluding less informative sections. This methodology is complemented by a confidence-based classification committee composed of three custom deep learning models: Dem3D ResNet, Dem3D CNN, and Dem3D EfficientNet. These models work synergistically to enhance decision-making accuracy, leveraging their collective strengths. Tested on the Open Access Series of Imaging Studies(OASIS) dataset, our method achieved an impressive accuracy of 94.12%, surpassing existing methodologies. Furthermore, validation on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset confirmed the robustness and generalizability of our approach. The use of explainable AI (XAI) techniques and comprehensive ablation studies further substantiate the effectiveness of our techniques, providing insights into the decision-making process and the importance of our methodology. This research offers a significant advancement in dementia diagnosis, providing a highly accurate and efficient tool for clinical applications.

摘要:失智症是一種影響全球數百萬人的衰弱性神經疾病,在診斷上具有重大挑戰。在這項工作中,我們提出了一種新的方法,用於對失智和非失智老年患者進行分類,使用 3D 大腦磁振造影 (MRI) 掃描。我們的做法採用了一種獨特技術,用於選擇性處理 MRI 切片,重點關注最相關的大腦區域,並排除信息量較少的部分。這種方法由一個基於信心的分類委員會補充,該委員會由三個自定義深度學習模型組成:Dem3D ResNet、Dem3D CNN 和 Dem3D EfficientNet。這些模型協同工作以增強決策的準確性,利用它們的集體優勢。在影像研究開放存取系列 (OASIS) 資料集上進行測試,我們的模型達到了 94.12% 的驚人準確度,超過了現有方法。此外,在阿茲海默症神經影像倡議 (ADNI) 資料集上的驗證證實了我們方法的穩健性和普遍性。可解釋 AI (XAI) 技術和全面的消融研究進一步證實了我們技術的有效性,提供了對決策過程和我們方法重要性的見解。這項研究為失智症診斷提供了重大進展,為臨床應用提供了一個高度準確且高效的工具。

Using Large Language Models to Compare Explainable Models for Smart Home Human Activity Recognition

2408.06352v1 by Michele Fiori, Gabriele Civitarese, Claudio Bettini

Recognizing daily activities with unobtrusive sensors in smart environments enables various healthcare applications. Monitoring how subjects perform activities at home and their changes over time can reveal early symptoms of health issues, such as cognitive decline. Most approaches in this field use deep learning models, which are often seen as black boxes mapping sensor data to activities. However, non-expert users like clinicians need to trust and understand these models' outputs. Thus, eXplainable AI (XAI) methods for Human Activity Recognition have emerged to provide intuitive natural language explanations from these models. Different XAI methods generate different explanations, and their effectiveness is typically evaluated through user surveys, that are often challenging in terms of costs and fairness. This paper proposes an automatic evaluation method using Large Language Models (LLMs) to identify, in a pool of candidates, the best XAI approach for non-expert users. Our preliminary results suggest that LLM evaluation aligns with user surveys.

摘要:藉由智慧環境中不引人注目的感測器辨識日常活動,能啟用各種醫療保健應用。監控受試者在家中如何執行活動,以及其隨著時間的變化,可以揭示健康問題的早期症狀,例如認知能力下降。此領域中的大多數方法都使用深度學習模型,這些模型通常被視為將感測器資料對應至活動的黑盒子。然而,非專家使用者(例如臨床醫師)需要信任並了解這些模型的輸出。因此,人類活動辨識的可解釋 AI (XAI) 方法應運而生,以提供來自這些模型的直覺自然語言說明。不同的 XAI 方法會產生不同的說明,而其有效性通常透過使用者調查來評估,這在成本和公平性方面通常具有挑戰性。本文提出使用大型語言模型 (LLM) 的自動評估方法,以在候選者中找出最適合非專家使用者的 XAI 方法。我們的初步結果表明,LLM 評估與使用者調查一致。

Explainable AI-based Intrusion Detection System for Industry 5.0: An Overview of the Literature, associated Challenges, the existing Solutions, and Potential Research Directions

2408.03335v1 by Naseem Khan, Kashif Ahmad, Aref Al Tamimi, Mohammed M. Alani, Amine Bermak, Issa Khalil

Industry 5.0, which focuses on human and Artificial Intelligence (AI) collaboration for performing different tasks in manufacturing, involves a higher number of robots, Internet of Things (IoTs) devices and interconnections, Augmented/Virtual Reality (AR), and other smart devices. The huge involvement of these devices and interconnection in various critical areas, such as economy, health, education and defense systems, poses several types of potential security flaws. AI itself has been proven a very effective and powerful tool in different areas of cybersecurity, such as intrusion detection, malware detection, and phishing detection, among others. Just as in many application areas, cybersecurity professionals were reluctant to accept black-box ML solutions for cybersecurity applications. This reluctance pushed forward the adoption of eXplainable Artificial Intelligence (XAI) as a tool that helps explain how decisions are made in ML-based systems. In this survey, we present a comprehensive study of different XAI-based intrusion detection systems for industry 5.0, and we also examine the impact of explainability and interpretability on Cybersecurity practices through the lens of Adversarial XIDS (Adv-XIDS) approaches. Furthermore, we analyze the possible opportunities and challenges in XAI cybersecurity systems for industry 5.0 that elicit future research toward XAI-based solutions to be adopted by high-stakes industry 5.0 applications. We believe this rigorous analysis will establish a foundational framework for subsequent research endeavors within the specified domain.

摘要:工業 5.0 著重於人類與人工智慧 (AI) 合作執行製造中的不同任務,涉及更多機器人、物聯網 (IoT) 裝置和互連、擴增/虛擬實境 (AR) 和其他智慧裝置。這些裝置和互連在經濟、醫療保健、教育和國防系統等各種關鍵領域的廣泛參與,引發了多種類型的潛在安全漏洞。AI 本身已被證明是網路安全不同領域中非常有效且強大的工具,例如入侵偵測、惡意軟體偵測和網路釣魚偵測等。就像在許多應用領域一樣,網路安全專業人員不願意接受黑盒 ML 解決方案來應用於網路安全。這種不願意促使可解釋人工智慧 (XAI) 作為一種工具被採用,有助於說明在基於 ML 的系統中如何做出決策。在這項調查中,我們對工業 5.0 的不同基於 XAI 的入侵偵測系統進行了全面的研究,並且我們也透過對抗式 XIDS (Adv-XIDS) 方法的觀點來探討可解釋性和可詮釋性對網路安全實務的影響。此外,我們分析了工業 5.0 的 XAI 網路安全系統中可能存在的機會和挑戰,引發了未來針對 XAI 基礎解決方案的研究,以供高風險的工業 5.0 應用採用。我們相信這項嚴謹的分析將為指定領域內的後續研究工作建立基礎架構。

A Comparative Study on Automatic Coding of Medical Letters with Explainability

2407.13638v1 by Jamie Glen, Lifeng Han, Paul Rayson, Goran Nenadic

This study aims to explore the implementation of Natural Language Processing (NLP) and machine learning (ML) techniques to automate the coding of medical letters with visualised explainability and light-weighted local computer settings. Currently in clinical settings, coding is a manual process that involves assigning codes to each condition, procedure, and medication in a patient's paperwork (e.g., 56265001 heart disease using SNOMED CT code). There are preliminary research on automatic coding in this field using state-of-the-art ML models; however, due to the complexity and size of the models, the real-world deployment is not achieved. To further facilitate the possibility of automatic coding practice, we explore some solutions in a local computer setting; in addition, we explore the function of explainability for transparency of AI models. We used the publicly available MIMIC-III database and the HAN/HLAN network models for ICD code prediction purposes. We also experimented with the mapping between ICD and SNOMED CT knowledge bases. In our experiments, the models provided useful information for 97.98\% of codes. The result of this investigation can shed some light on implementing automatic clinical coding in practice, such as in hospital settings, on the local computers used by clinicians , project page \url{https://github.com/Glenj01/Medical-Coding}.

摘要:本研究旨在探討將自然語言處理 (NLP) 和機器學習 (ML) 技術實作於醫療信函編碼自動化,並具備視覺化說明能力和輕量化的本地電腦設定。目前在臨床環境中,編碼是一種手動流程,涉及為病患文件中的每項病症、程序和藥物指派代碼 (例如,使用 SNOMED CT 代碼 56265001 表示心臟病)。此領域有使用最新 ML 模型進行自動編碼的初步研究;然而,由於模型的複雜性和大小,並未實現實際部署。為了進一步促進自動編碼實務的可能性,我們在本地電腦設定中探討了一些解決方案;此外,我們探討了說明功能在 AI 模型透明度中的功能。我們使用公開的 MIMIC-III 資料庫和 HAN/HLAN 網路模型進行 ICD 代碼預測。我們還試驗了 ICD 和 SNOMED CT 知識庫之間的對應。在我們的實驗中,這些模型提供了 97.98% 代碼的有用資訊。這項調查結果可以為實務中的自動臨床編碼實作提供一些見解,例如在醫院環境中,由臨床醫生使用的本地電腦,專案頁面 \url{https://github.com/Glenj01/Medical-Coding}。

Explainable AI for Enhancing Efficiency of DL-based Channel Estimation

2407.07009v1 by Abdul Karim Gizzini, Yahia Medjahdi, Ali J. Ghandour, Laurent Clavier

The support of artificial intelligence (AI) based decision-making is a key element in future 6G networks, where the concept of native AI will be introduced. Moreover, AI is widely employed in different critical applications such as autonomous driving and medical diagnosis. In such applications, using AI as black-box models is risky and challenging. Hence, it is crucial to understand and trust the decisions taken by these models. Tackling this issue can be achieved by developing explainable AI (XAI) schemes that aim to explain the logic behind the black-box model behavior, and thus, ensure its efficient and safe deployment. Recently, we proposed a novel perturbation-based XAI-CHEST framework that is oriented toward channel estimation in wireless communications. The core idea of the XAI-CHEST framework is to identify the relevant model inputs by inducing high noise on the irrelevant ones. This manuscript provides the detailed theoretical foundations of the XAI-CHEST framework. In particular, we derive the analytical expressions of the XAI-CHEST loss functions and the noise threshold fine-tuning optimization problem. Hence the designed XAI-CHEST delivers a smart input feature selection methodology that can further improve the overall performance while optimizing the architecture of the employed model. Simulation results show that the XAI-CHEST framework provides valid interpretations, where it offers an improved bit error rate performance while reducing the required computational complexity in comparison to the classical DL-based channel estimation.

摘要:人工智能 (AI) 支持的決策制定是未來 6G 網路中的關鍵元素,其中將引入原生 AI 的概念。此外,AI 廣泛用於不同的關鍵應用中,例如自動駕駛和醫療診斷。在這些應用中,使用 AI 作為黑盒模型是有風險且具有挑戰性的。因此,理解和信任這些模型做出的決策至關重要。解決此問題的方法是開發可解釋 AI (XAI) 架構,旨在解釋黑盒模型行為背後的邏輯,從而確保其有效且安全的部署。最近,我們提出了一個新的基於擾動的 XAI-CHEST 框架,該框架面向無線通信中的信道估計。XAI-CHEST 框架的核心思想是通過在無關輸入上引入高噪聲來識別相關模型輸入。這份手稿提供了 XAI-CHEST 框架的詳細理論基礎。特別是,我們推導了 XAI-CHEST 損失函數和噪聲閾值微調優化問題的解析表達式。因此,設計的 XAI-CHEST 提供了一種智能輸入特徵選擇方法,可以在優化所用模型的架構的同時進一步提高整體性能。模擬結果表明,XAI-CHEST 框架提供了有效的解釋,在降低所需的計算複雜度的同時,提供了改進的比特錯誤率性能,而這與基於傳統 DL 的信道估計相比。

Explainable AI: Comparative Analysis of Normal and Dilated ResNet Models for Fundus Disease Classification

2407.05440v2 by P. N. Karthikayan, Yoga Sri Varshan V, Hitesh Gupta Kattamuri, Umarani Jayaraman

This paper presents dilated Residual Network (ResNet) models for disease classification from retinal fundus images. Dilated convolution filters are used to replace normal convolution filters in the higher layers of the ResNet model (dilated ResNet) in order to improve the receptive field compared to the normal ResNet model for disease classification. This study introduces computer-assisted diagnostic tools that employ deep learning, enhanced with explainable AI techniques. These techniques aim to make the tool's decision-making process transparent, thereby enabling medical professionals to understand and trust the AI's diagnostic decision. They are particularly relevant in today's healthcare landscape, where there is a growing demand for transparency in AI applications to ensure their reliability and ethical use. The dilated ResNet is used as a replacement for the normal ResNet to enhance the classification accuracy of retinal eye diseases and reduce the required computing time. The dataset used in this work is the Ocular Disease Intelligent Recognition (ODIR) dataset which is a structured ophthalmic database with eight classes covering most of the common retinal eye diseases. The evaluation metrics used in this work include precision, recall, accuracy, and F1 score. In this work, a comparative study has been made between normal ResNet models and dilated ResNet models on five variants namely ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. The dilated ResNet model shows promising results as compared to normal ResNet with an average F1 score of 0.71, 0.70, 0.69, 0.67, and 0.70 respectively for the above respective variants in ODIR multiclass disease classification.

摘要:这篇论文提出了用于从视网膜眼底图像进行疾病分类的扩张残差网络 (ResNet) 模型。扩张卷积滤波器用于替换 ResNet 模型较高层中的正常卷积滤波器(扩张 ResNet),以改善感知场,从而针对疾病分类对正常 ResNet 模型进行改进。本研究引入了采用深度学习的计算机辅助诊断工具,并通过可解释的 AI 技术进行了增强。这些技术旨在使该工具的决策过程透明化,从而使医学专业人士能够理解和信任 AI 的诊断决策。它们与当今的医疗保健领域尤为相关,在该领域,对 AI 应用的透明度需求不断增长,以确保其可靠性和合乎道德的使用。扩张 ResNet 用作正常 ResNet 的替代品,以提高视网膜眼部疾病的分类准确性并减少所需的计算时间。本工作中使用的数据集是眼科疾病智能识别 (ODIR) 数据集,这是一个结构化的眼科数据库,包含八类涵盖大多数常见视网膜眼部疾病。本工作中使用的评估指标包括精确度、召回率、准确度和 F1 得分。在这项工作中,对 ResNet-18、ResNet-34、ResNet-50、ResNet-101 和 ResNet-152 五个变体的正常 ResNet 模型和扩张 ResNet 模型进行了比较研究。与正常 ResNet 相比,扩张 ResNet 模型显示出有希望的结果,在 ODIR 多类疾病分类中,上述各个变体的平均 F1 得分为 0.71、0.70、0.69、0.67 和 0.70。

A Survey on Trustworthiness in Foundation Models for Medical Image Analysis

2407.15851v2 by Congzhen Shi, Ryan Rezai, Jiaxi Yang, Qi Dou, Xiaoxiao Li

The rapid advancement of foundation models in medical imaging represents a significant leap toward enhancing diagnostic accuracy and personalized treatment. However, the deployment of foundation models in healthcare necessitates a rigorous examination of their trustworthiness, encompassing privacy, robustness, reliability, explainability, and fairness. The current body of survey literature on foundation models in medical imaging reveals considerable gaps, particularly in the area of trustworthiness. Additionally, existing surveys on the trustworthiness of foundation models do not adequately address their specific variations and applications within the medical imaging domain. This survey aims to fill that gap by presenting a novel taxonomy of foundation models used in medical imaging and analyzing the key motivations for ensuring their trustworthiness. We review current research on foundation models in major medical imaging applications, focusing on segmentation, medical report generation, medical question and answering (Q\&A), and disease diagnosis. These areas are highlighted because they have seen a relatively mature and substantial number of foundation models compared to other applications. We focus on literature that discusses trustworthiness in medical image analysis manuscripts. We explore the complex challenges of building trustworthy foundation models for each application, summarizing current concerns and strategies for enhancing trustworthiness. Furthermore, we examine the potential of these models to revolutionize patient care. Our analysis underscores the imperative for advancing towards trustworthy AI in medical image analysis, advocating for a balanced approach that fosters innovation while ensuring ethical and equitable healthcare delivery.

摘要:基礎模型在醫學影像方面的快速進展,代表著在加強診斷準確性和個人化治療方面邁出一大步。然而,基礎模型在醫療保健中的部署需要對其可信度進行嚴格的審查,包括隱私、穩健性、可靠性、可解釋性和公平性。目前關於醫學影像中基礎模型的調查文獻中顯示出相當大的差距,特別是在可信度方面。此外,現有關於基礎模型可信度的調查並未充分解決其在醫學影像領域中的特定變化和應用。本調查旨在通過提出醫學影像中使用的基礎模型的新分類法並分析確保其可信度的關鍵動機,來填補這一空白。我們回顧了基礎模型在主要醫學影像應用中的當前研究,重點關注分割、醫療報告生成、醫療問題和回答 (Q&A) 以及疾病診斷。這些領域之所以被強調,是因為與其他應用相比,它們已經看到相對成熟且大量的基礎模型。我們專注於探討醫學影像分析手稿中可信度的文獻。我們探討了為每個應用構建可信基礎模型的複雜挑戰,總結了當前關注點和增強可信度的策略。此外,我們探討了這些模型在革新患者護理方面的潛力。我們的分析強調了在醫學影像分析中朝著可信賴的人工智慧邁進的必要性,並倡導一種平衡的方法,既能促進創新,又能確保道德和公平的醫療保健服務。

The Impact of an XAI-Augmented Approach on Binary Classification with Scarce Data

2407.06206v1 by Ximing Wen, Rosina O. Weber, Anik Sen, Darryl Hannan, Steven C. Nesbit, Vincent Chan, Alberto Goffi, Michael Morris, John C. Hunninghake, Nicholas E. Villalobos, Edward Kim, Christopher J. MacLellan

Point-of-Care Ultrasound (POCUS) is the practice of clinicians conducting and interpreting ultrasound scans right at the patient's bedside. However, the expertise needed to interpret these images is considerable and may not always be present in emergency situations. This reality makes algorithms such as machine learning classifiers extremely valuable to augment human decisions. POCUS devices are becoming available at a reasonable cost in the size of a mobile phone. The challenge of turning POCUS devices into life-saving tools is that interpretation of ultrasound images requires specialist training and experience. Unfortunately, the difficulty to obtain positive training images represents an important obstacle to building efficient and accurate classifiers. Hence, the problem we try to investigate is how to explore strategies to increase accuracy of classifiers trained with scarce data. We hypothesize that training with a few data instances may not suffice for classifiers to generalize causing them to overfit. Our approach uses an Explainable AI-Augmented approach to help the algorithm learn more from less and potentially help the classifier better generalize.

摘要:床邊超音波 (POCUS) 是臨床醫師在患者床邊進行和解讀超音波掃描的實務。然而,解讀這些影像所需的專業知識相當可觀,而且在緊急情況下可能並非隨時具備。這種現實情況使得機器學習分類器等演算法對於加強人類決策變得極為有價值。POCUS 裝置正以合理成本推出,尺寸為手機大小。將 POCUS 裝置轉變為救生工具的挑戰在於,解讀超音波影像需要專門訓練和經驗。不幸的是,取得正向訓練影像的困難度代表著建置有效率且準確的分類器的一大障礙。因此,我們嘗試探討的問題是如何探索策略,以提高使用稀疏資料訓練的分類器的準確度。我們假設使用少數資料實例進行訓練可能不足以讓分類器概括,導致它們過度擬合。我們的做法使用可解釋 AI 增強方法,以協助演算法從較少的資料中學習更多,並潛在協助分類器更好地概括。

Can GPT-4 Help Detect Quit Vaping Intentions? An Exploration of Automatic Data Annotation Approach

2407.00167v1 by Sai Krishna Revanth Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, Dian Hu, Ming Huang

In recent years, the United States has witnessed a significant surge in the popularity of vaping or e-cigarette use, leading to a notable rise in cases of e-cigarette and vaping use-associated lung injury (EVALI) that caused hospitalizations and fatalities during the EVALI outbreak in 2019, highlighting the urgency to comprehend vaping behaviors and develop effective strategies for cessation. Due to the ubiquity of social media platforms, over 4.7 billion users worldwide use them for connectivity, communications, news, and entertainment with a significant portion of the discourse related to health, thereby establishing social media data as an invaluable organic data resource for public health research. In this study, we extracted a sample dataset from one vaping sub-community on Reddit to analyze users' quit-vaping intentions. Leveraging OpenAI's latest large language model GPT-4 for sentence-level quit vaping intention detection, this study compares the outcomes of this model against layman and clinical expert annotations. Using different prompting strategies such as zero-shot, one-shot, few-shot and chain-of-thought prompting, we developed 8 prompts with varying levels of detail to explain the task to GPT-4 and also evaluated the performance of the strategies against each other. These preliminary findings emphasize the potential of GPT-4 in social media data analysis, especially in identifying users' subtle intentions that may elude human detection.

摘要:近年來,美國見證了電子煙或電子香菸使用率大幅激增,導致電子煙和電子煙使用相關肺損傷 (EVALI) 病例顯著增加,在 2019 年 EVALI 爆發期間造成住院和死亡,凸顯了理解電子煙行為和制定有效戒菸策略的迫切性。由於社群媒體平台的普及,全球超過 47 億使用者使用它們進行連結、溝通、新聞和娛樂,其中很大一部分與健康相關,因此將社群媒體資料建立為公共衛生研究中無價的有機資料資源。在本研究中,我們從 Reddit 上一個電子煙子社群中提取一個範例資料集,以分析使用者的戒電子煙意圖。利用 OpenAI 最新的大型語言模型 GPT-4 進行句子層級的戒電子煙意圖偵測,本研究比較了此模型的結果與外行人和臨床專家註解。使用不同的提示策略,例如零次學習、一次學習、少次學習和思考鏈提示,我們開發了 8 個提示,詳細程度不同,向 GPT-4 解釋任務,並評估這些策略彼此之間的效能。這些初步發現強調了 GPT-4 在社群媒體資料分析中的潛力,特別是在識別人類偵測可能無法察覺的使用者微妙意圖方面。

Towards Compositional Interpretability for XAI

2406.17583v1 by Sean Tull, Robin Lorenz, Stephen Clark, Ilyas Khan, Bob Coecke

Artificial intelligence (AI) is currently based largely on black-box machine learning models which lack interpretability. The field of eXplainable AI (XAI) strives to address this major concern, being critical in high-stakes areas such as the finance, legal and health sectors. We present an approach to defining AI models and their interpretability based on category theory. For this we employ the notion of a compositional model, which sees a model in terms of formal string diagrams which capture its abstract structure together with its concrete implementation. This comprehensive view incorporates deterministic, probabilistic and quantum models. We compare a wide range of AI models as compositional models, including linear and rule-based models, (recurrent) neural networks, transformers, VAEs, and causal and DisCoCirc models. Next we give a definition of interpretation of a model in terms of its compositional structure, demonstrating how to analyse the interpretability of a model, and using this to clarify common themes in XAI. We find that what makes the standard 'intrinsically interpretable' models so transparent is brought out most clearly diagrammatically. This leads us to the more general notion of compositionally-interpretable (CI) models, which additionally include, for instance, causal, conceptual space, and DisCoCirc models. We next demonstrate the explainability benefits of CI models. Firstly, their compositional structure may allow the computation of other quantities of interest, and may facilitate inference from the model to the modelled phenomenon by matching its structure. Secondly, they allow for diagrammatic explanations for their behaviour, based on influence constraints, diagram surgery and rewrite explanations. Finally, we discuss many future directions for the approach, raising the question of how to learn such meaningfully structured models in practice.

摘要:人工智慧(AI)目前在很大程度上依賴於缺乏可解釋性的黑盒機器學習模型。可解釋性人工智慧(XAI)領域致力於解決這個主要問題,這在金融、法律和健康等高風險領域至關重要。 我們提出了一種基於範疇論定義 AI 模型及其可解釋性的方法。為此,我們採用組合模型的概念,它以形式弦圖的形式看待模型,這些弦圖捕獲了模型的抽象結構及其具體實現。這種綜合觀點包含了確定性、概率性和量子模型。我們將各種 AI 模型作為組合模型進行比較,包括線性和基於規則的模型、(遞迴)神經網路、Transformer、VAE,以及因果和 DisCoCirc 模型。 接下來,我們根據模型的組合結構給出模型解釋的定義,展示如何分析模型的可解釋性,並使用它來澄清 XAI 中的常見主題。我們發現,讓標準的「內在可解釋」模型如此透明的原因在圖表中表現得最為清楚。這引導我們得出更一般的組合可解釋(CI)模型概念,它另外還包括因果、概念空間和 DisCoCirc 模型。 接下來,我們展示了 CI 模型的可解釋性優勢。首先,它們的組合結構允許計算其他感興趣的量,並可能通過匹配模型的結構來促進從模型到被建模現象的推理。其次,它們允許對其行為進行圖解說明,這些說明基於影響約束、圖解手術和重寫說明。最後,我們討論了這種方法的許多未來方向,提出了如何在實踐中學習這種有意義的結構化模型的問題。

Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods

2406.12142v2 by Vincent Olesen, Nina Weng, Aasa Feragen, Eike Petersen

Machine learning models have achieved high overall accuracy in medical image analysis. However, performance disparities on specific patient groups pose challenges to their clinical utility, safety, and fairness. This can affect known patient groups - such as those based on sex, age, or disease subtype - as well as previously unknown and unlabeled groups. Furthermore, the root cause of such observed performance disparities is often challenging to uncover, hindering mitigation efforts. In this paper, to address these issues, we leverage Slice Discovery Methods (SDMs) to identify interpretable underperforming subsets of data and formulate hypotheses regarding the cause of observed performance disparities. We introduce a novel SDM and apply it in a case study on the classification of pneumothorax and atelectasis from chest x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis formulation and yields an explanation of previously observed but unexplained performance disparities between male and female patients in widely used chest X-ray datasets and models. Our findings indicate shortcut learning in both classification tasks, through the presence of chest drains and ECG wires, respectively. Sex-based differences in the prevalence of these shortcut features appear to cause the observed classification performance gap, representing a previously underappreciated interaction between shortcut learning and model fairness analyses.

摘要:機器學習模型在醫學影像分析中已達到整體高準確度。然而,特定患者群體的效能差異對其臨床效用、安全性與公平性構成挑戰。這可能會影響已知的患者群體(例如基於性別、年齡或疾病亞型)以及先前未知且未標籤的群體。此外,此類觀察到的效能差異的根本原因通常難以發現,阻礙了緩解措施。在本文中,為了解決這些問題,我們利用切片發現方法 (SDM) 來識別可解釋的資料效能不佳子集,並針對觀察到的效能差異原因制定假設。我們引入一種新的 SDM,並在胸部 X 光片中肺炎和肺不張分類的案例研究中應用它。我們的研究證明了 SDM 在假設制定中的有效性,並對廣泛使用的胸部 X 光片資料集和模型中先前觀察到但無法解釋的男性和女性患者之間的效能差異提供了解釋。我們的發現表明,在分類任務中,透過胸腔引流管和心電圖導線的存在,存在捷徑學習。這些捷徑特徵的盛行率存在基於性別的差異,似乎會導致觀察到的分類效能差距,這代表捷徑學習和模型公平性分析之間先前未受到重視的交互作用。

Unlocking the Potential of Metaverse in Innovative and Immersive Digital Health

2406.07114v2 by Fatemeh Ebrahimzadeh, Ramin Safa

The concept of Metaverse has attracted a lot of attention in various fields and one of its important applications is health and treatment. The Metaverse has enormous potential to transform healthcare by changing patient care, medical education, and the way teaching/learning and research are done. The purpose of this research is to provide an introduction to the basic concepts and fundamental technologies of the Metaverse. This paper examines the pros and cons of the Metaverse in healthcare context and analyzes its potential from the technology and AI perspective. In particular, the role of machine learning methods is discussed; We will explain how machine learning algorithms can be applied to the Metaverse generated data to gain better insights in healthcare applications. Additionally, we examine the future visions of the Metaverse in health delivery, by examining emerging technologies such as blockchain and also addressing privacy concerns. The findings of this study contribute to a deeper understanding of the applications of Metaverse in healthcare and its potential to revolutionize the delivery of medical services.

摘要:元宇宙的概念在各個領域都備受關注,其重要應用之一便是醫療保健。元宇宙有巨大的潛力透過改變病患照護、醫學教育,以及教學/學習和研究的方式來轉型醫療保健。本研究的目的是提供元宇宙基本概念和基礎技術的介紹。本文探討了元宇宙在醫療保健背景下的優缺點,並從技術和 AI 的角度分析其潛力。特別是,討論了機器學習方法的角色;我們將說明如何將機器學習演算法應用於元宇宙產生的資料,以獲得醫療保健應用方面的更佳見解。此外,我們透過探討區塊鏈等新興技術,並解決隱私問題,來探討元宇宙在醫療保健方面的未來願景。本研究的發現有助於更深入地了解元宇宙在醫療保健中的應用,以及其在醫療服務提供方面發揮革命性變革的潛力。

AI-Driven Predictive Analytics Approach for Early Prognosis of Chronic Kidney Disease Using Ensemble Learning and Explainable AI

2406.06728v1 by K M Tawsik Jawad, Anusha Verma, Fathi Amsaad

Chronic Kidney Disease (CKD) is one of the widespread Chronic diseases with no known ultimo cure and high morbidity. Research demonstrates that progressive Chronic Kidney Disease (CKD) is a heterogeneous disorder that significantly impacts kidney structure and functions, eventually leading to kidney failure. With the progression of time, chronic kidney disease has moved from a life-threatening disease affecting few people to a common disorder of varying severity. The goal of this research is to visualize dominating features, feature scores, and values exhibited for early prognosis and detection of CKD using ensemble learning and explainable AI. For that, an AI-driven predictive analytics approach is proposed to aid clinical practitioners in prescribing lifestyle modifications for individual patients to reduce the rate of progression of this disease. Our dataset is collected on body vitals from individuals with CKD and healthy subjects to develop our proposed AI-driven solution accurately. In this regard, blood and urine test results are provided, and ensemble tree-based machine-learning models are applied to predict unseen cases of CKD. Our research findings are validated after lengthy consultations with nephrologists. Our experiments and interpretation results are compared with existing explainable AI applications in various healthcare domains, including CKD. The comparison shows that our developed AI models, particularly the Random Forest model, have identified more features as significant contributors than XgBoost. Interpretability (I), which measures the ratio of important to masked features, indicates that our XgBoost model achieved a higher score, specifically a Fidelity of 98\%, in this metric and naturally in the FII index compared to competing models.

摘要:慢性腎臟病(CKD)是一種廣泛的慢性疾病,沒有已知的最終療法且發病率很高。研究表明,進行性慢性腎臟病(CKD)是一種異質性疾病,會顯著影響腎臟結構和功能,最終導致腎衰竭。隨著時間的推移,慢性腎臟病已從影響少數人的致命疾病轉變為一種嚴重程度不同的常見疾病。本研究的目標是使用集成學習和可解釋的 AI 進行早期預後和 CKD 檢測,並視覺化主導特徵、特徵分數和表現出的值。為此,提出了一種 AI 驅動的預測分析方法,以幫助臨床醫生為個別患者開具生活方式修改建議,以降低這種疾病的進展速度。我們的數據集是從 CKD 患者和健康受試者的身體生命體徵中收集的,以準確開發我們提出的 AI 驅動的解決方案。在這方面,提供了血液和尿液檢測結果,並應用基於集成樹的機器學習模型來預測未發現的 CKD 病例。我們的研究結果經過與腎臟科醫生的長期諮詢後得到驗證。我們的實驗和解釋結果與各種醫療保健領域中現有的可解釋 AI 應用進行了比較,包括 CKD。比較表明,我們開發的 AI 模型,特別是隨機森林模型,已經確定了比 XgBoost 更多作為重要貢獻者的特徵。可解釋性 (I) 衡量重要特徵與掩蓋特徵的比率,表明我們的 XgBoost 模型在這個指標中獲得了更高的分數,特別是 98% 的保真度,並且在 FII 指數中自然高於競爭模型。

Explainable AI for Mental Disorder Detection via Social Media: A survey and outlook

2406.05984v1 by Yusif Ibrahimov, Tarique Anwar, Tommy Yuan

Mental health constitutes a complex and pervasive global challenge, affecting millions of lives and often leading to severe consequences. In this paper, we conduct a thorough survey to explore the intersection of data science, artificial intelligence, and mental healthcare, focusing on the recent developments of mental disorder detection through online social media (OSM). A significant portion of the population actively engages in OSM platforms, creating a vast repository of personal data that holds immense potential for mental health analytics. The paper navigates through traditional diagnostic methods, state-of-the-art data- and AI-driven research studies, and the emergence of explainable AI (XAI) models for mental healthcare. We review state-of-the-art machine learning methods, particularly those based on modern deep learning, while emphasising the need for explainability in healthcare AI models. The experimental design section provides insights into prevalent practices, including available datasets and evaluation approaches. We also identify key issues and challenges in the field and propose promising future research directions. As mental health decisions demand transparency, interpretability, and ethical considerations, this paper contributes to the ongoing discourse on advancing XAI in mental healthcare through social media. The comprehensive overview presented here aims to guide researchers, practitioners, and policymakers in developing the area of mental disorder detection.

摘要:心理健康構成了一項複雜且普遍的全球挑戰,影響了數百萬人的生活,並經常導致嚴重的後果。在本文中,我們進行了一項徹底的調查,以探索數據科學、人工智慧和心理保健的交集,重點關注通過線上社交媒體 (OSM) 進行心理疾病檢測的最新發展。很大一部分人口積極參與 OSM 平台,創造了一個龐大的人員資料庫,對心理健康分析具有巨大的潛力。本文探討了傳統的診斷方法、最先進的資料和 AI 驅動的研究,以及心理保健中可解釋 AI (XAI) 模型的出現。我們回顧了最先進的機器學習方法,特別是那些基於現代深度學習的方法,同時強調了醫療保健 AI 模型中可解釋性的必要性。實驗設計部分提供了對普遍做法的見解,包括可用的資料集和評估方法。我們還找出該領域的主要問題和挑戰,並提出了有希望的未來研究方向。由於心理健康決策需要透明度、可解釋性和道德考量,本文有助於推進心理保健中透過社交媒體推進 XAI 的持續討論。這裡提出的全面概述旨在引導研究人員、從業人員和政策制定者發展心理疾病檢測領域。

Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance

2406.05746v1 by Zhan Zhang, Qin Zhang, Yang Jiao, Lin Lu, Lin Ma, Aihua Liu, Xiao Liu, Juan Zhao, Yajun Xue, Bing Wei, Mingxia Zhang, Ru Gao, Hong Zhao, Jie Lu, Fan Li, Yang Zhang, Yiming Wang, Lei Zhang, Fengwei Tian, Jie Hu, Xin Gou

AI-aided clinical diagnosis is desired in medical care. Existing deep learning models lack explainability and mainly focus on image analysis. The recently developed Dynamic Uncertain Causality Graph (DUCG) approach is causality-driven, explainable, and invariant across different application scenarios, without problems of data collection, labeling, fitting, privacy, bias, generalization, high cost and high energy consumption. Through close collaboration between clinical experts and DUCG technicians, 46 DUCG models covering 54 chief complaints were constructed. Over 1,000 diseases can be diagnosed without triage. Before being applied in real-world, the 46 DUCG models were retrospectively verified by third-party hospitals. The verified diagnostic precisions were no less than 95%, in which the diagnostic precision for every disease including uncommon ones was no less than 80%. After verifications, the 46 DUCG models were applied in the real-world in China. Over one million real diagnosis cases have been performed, with only 17 incorrect diagnoses identified. Due to DUCG's transparency, the mistakes causing the incorrect diagnoses were found and corrected. The diagnostic abilities of the clinicians who applied DUCG frequently were improved significantly. Following the introduction to the earlier presented DUCG methodology, the recommendation algorithm for potential medical checks is presented and the key idea of DUCG is extracted.

摘要:醫療照護中需要 AI 輔助的臨床診斷。現有的深度學習模型缺乏可解釋性,並且主要專注於影像分析。最近開發的動態不確定因果關係圖 (DUCG) 方法是因果驅動的、可解釋的,並且在不同的應用場景中是不變的,沒有資料收集、標記、擬合、隱私、偏見、概化、高成本和高能耗的問題。通過臨床專家和 DUCG 技術人員之間的密切合作,構建了涵蓋 54 個主訴的 46 個 DUCG 模型。可以在沒有分流的情況下診斷出 1,000 多種疾病。在應用於實際世界之前,46 個 DUCG 模型已由第三方醫院回溯性驗證。驗證的診斷精度不低於 95%,其中包括罕見疾病在內的每種疾病的診斷精度不低於 80%。驗證後,46 個 DUCG 模型已在中國實際應用。已經執行了超過一百萬個真實診斷案例,僅發現 17 個不正確的診斷。由於 DUCG 的透明性,發現並糾正了導致不正確診斷的錯誤。頻繁應用 DUCG 的臨床醫生的診斷能力得到了顯著提高。在介紹了前面提出的 DUCG 方法論之後,提出了潛在健康檢查的推薦演算法,並提取了 DUCG 的關鍵思想。

Advancing Histopathology-Based Breast Cancer Diagnosis: Insights into Multi-Modality and Explainability

2406.12897v1 by Faseela Abdullakutty, Younes Akbari, Somaya Al-Maadeed, Ahmed Bouridane, Rifat Hamoudi

It is imperative that breast cancer is detected precisely and timely to improve patient outcomes. Diagnostic methodologies have traditionally relied on unimodal approaches; however, medical data analytics is integrating diverse data sources beyond conventional imaging. Using multi-modal techniques, integrating both image and non-image data, marks a transformative advancement in breast cancer diagnosis. The purpose of this review is to explore the burgeoning field of multimodal techniques, particularly the fusion of histopathology images with non-image data. Further, Explainable AI (XAI) will be used to elucidate the decision-making processes of complex algorithms, emphasizing the necessity of explainability in diagnostic processes. This review utilizes multi-modal data and emphasizes explainability to enhance diagnostic accuracy, clinician confidence, and patient engagement, ultimately fostering more personalized treatment strategies for breast cancer, while also identifying research gaps in multi-modality and explainability, guiding future studies, and contributing to the strategic direction of the field.

摘要:精確且及時地偵測乳癌對於改善患者預後至關重要。診斷方法傳統上依賴於單一模式方法;然而,醫療資料分析正在整合超越傳統影像的各種資料來源。使用整合影像和非影像資料的多模式技術,標誌著乳癌診斷的變革性進展。本篇綜述的目的是探討多模式技術的新興領域,特別是將組織病理學影像與非影像資料融合。此外,可解釋人工智慧 (XAI) 將用於闡明複雜演算法的決策過程,強調診斷過程中可解釋性的必要性。本綜述利用多模式資料並強調可解釋性,以提高診斷準確性、臨床醫師的信心和患者參與度,最終促進乳癌更個人化的治療策略,同時也找出多模式和可解釋性的研究差距,引導未來的研究,並為該領域的策略方向做出貢獻。

Revisiting Attention Weights as Interpretations of Message-Passing Neural Networks

2406.04612v1 by Yong-Min Shin, Siqing Li, Xin Cao, Won-Yong Shin

The self-attention mechanism has been adopted in several widely-used message-passing neural networks (MPNNs) (e.g., GATs), which adaptively controls the amount of information that flows along the edges of the underlying graph. This usage of attention has made such models a baseline for studies on explainable AI (XAI) since interpretations via attention have been popularized in various domains (e.g., natural language processing and computer vision). However, existing studies often use naive calculations to derive attribution scores from attention, and do not take the precise and careful calculation of edge attribution into consideration. In our study, we aim to fill the gap between the widespread usage of attention-enabled MPNNs and their potential in largely under-explored explainability, a topic that has been actively investigated in other areas. To this end, as the first attempt, we formalize the problem of edge attribution from attention weights in GNNs. Then, we propose GATT, an edge attribution calculation method built upon the computation tree. Through comprehensive experiments, we demonstrate the effectiveness of our proposed method when evaluating attributions from GATs. Conversely, we empirically validate that simply averaging attention weights over graph attention layers is insufficient to interpret the GAT model's behavior. Code is publicly available at https://github.com/jordan7186/GAtt/tree/main.

摘要:自注意力機制已被採用於多個廣泛使用的訊息傳遞神經網路 (MPNN)(例如 GAT),它可以自適應地控制沿著底層圖形邊緣流動的資訊量。這種注意力的使用使得此類模型成為可解釋 AI (XAI) 研究的基線,因為透過注意力的詮釋已在各種領域(例如自然語言處理和電腦視覺)中普及。然而,現有的研究通常使用天真的計算方法從注意力中推導出歸因分數,並且沒有考慮到邊緣歸因的精確且仔細的計算。在我們的研究中,我們旨在填補注意力啟用 MPNN 的廣泛使用與它們在很大程度上未被充分探索的可解釋性之間的差距,這個主題已在其他領域積極研究。為此,作為第一次嘗試,我們將 GNN 中注意力權重的邊緣歸因問題形式化。然後,我們提出 GATT,一種建立在計算樹上的邊緣歸因計算方法。透過全面的實驗,我們展示了我們提出的方法在評估 GAT 的歸因時所具有的效果。相反地,我們憑經驗驗證了僅對圖注意力層上的注意力權重取平均值不足以詮釋 GAT 模型的行為。程式碼已公開於 https://github.com/jordan7186/GAtt/tree/main。

Using Explainable AI for EEG-based Reduced Montage Neonatal Seizure Detection

2406.16908v3 by Dinuka Sandun Udayantha, Kavindu Weerasinghe, Nima Wickramasinghe, Akila Abeyratne, Kithmin Wickremasinghe, Jithangi Wanigasinghe, Anjula De Silva, Chamira U. S. Edussooriya

The neonatal period is the most vulnerable time for the development of seizures. Seizures in the immature brain lead to detrimental consequences, therefore require early diagnosis. The gold-standard for neonatal seizure detection currently relies on continuous video-EEG monitoring; which involves recording multi-channel electroencephalogram (EEG) alongside real-time video monitoring within a neonatal intensive care unit (NICU). However, video-EEG monitoring technology requires clinical expertise and is often limited to technologically advanced and resourceful settings. Cost-effective new techniques could help the medical fraternity make an accurate diagnosis and advocate treatment without delay. In this work, a novel explainable deep learning model to automate the neonatal seizure detection process with a reduced EEG montage is proposed, which employs convolutional nets, graph attention layers, and fully connected layers. Beyond its ability to detect seizures in real-time with a reduced montage, this model offers the unique advantage of real-time interpretability. By evaluating the performance on the Zenodo dataset with 10-fold cross-validation, the presented model achieves an absolute improvement of 8.31% and 42.86% in area under curve (AUC) and recall, respectively.

摘要:新生兒期是大腦發育最脆弱的時期,容易出現癲癇發作。大腦發育不成熟時出現癲癇發作會造成不良後果,因此需要及早診斷。目前新生兒癲癇發作的黃金標準依賴於連續的視訊腦電圖 (EEG) 監測;其中包括在新生兒加護病房 (NICU) 內同時進行多頻道腦電圖 (EEG) 記錄和即時視訊監控。然而,視訊腦電圖監控技術需要臨床專業知識,而且通常僅限於技術先進且資源豐富的環境。具成本效益的新技術可以幫助醫療界準確診斷並立即提倡治療。在這項工作中,提出了一個新穎的可解釋深度學習模型,以自動化新生兒癲癇發作偵測過程,並採用減少的腦電圖裝置,其中採用了卷積神經網路、圖形注意力層和全連接層。除了能夠使用減少的裝置即時偵測癲癇發作外,此模型還提供了即時可解釋性的獨特優勢。透過在 Zenodo 資料集上使用 10 倍交叉驗證評估效能,所提出的模型在曲線下面積 (AUC) 和召回率方面分別達到了 8.31% 和 42.86% 的絕對改善。

Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques

2406.00532v1 by Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik

Breast cancer (BC) stands as one of the most common malignancies affecting women worldwide, necessitating advancements in diagnostic methodologies for better clinical outcomes. This article provides a comprehensive exploration of the application of Explainable Artificial Intelligence (XAI) techniques in the detection and diagnosis of breast cancer. As Artificial Intelligence (AI) technologies continue to permeate the healthcare sector, particularly in oncology, the need for transparent and interpretable models becomes imperative to enhance clinical decision-making and patient care. This review discusses the integration of various XAI approaches, such as SHAP, LIME, Grad-CAM, and others, with machine learning and deep learning models utilized in breast cancer detection and classification. By investigating the modalities of breast cancer datasets, including mammograms, ultrasounds and their processing with AI, the paper highlights how XAI can lead to more accurate diagnoses and personalized treatment plans. It also examines the challenges in implementing these techniques and the importance of developing standardized metrics for evaluating XAI's effectiveness in clinical settings. Through detailed analysis and discussion, this article aims to highlight the potential of XAI in bridging the gap between complex AI models and practical healthcare applications, thereby fostering trust and understanding among medical professionals and improving patient outcomes.

摘要:乳癌 (BC) 是影響全球女性最常見的惡性腫瘤之一,因此需要進步的診斷方法,以改善臨床結果。本文全面探討了可解釋人工智慧 (XAI) 技術在乳癌偵測和診斷中的應用。隨著人工智慧 (AI) 技術持續滲透醫療保健領域,特別是在腫瘤學中,透明且可解釋的模型需求變得勢在必行,以增強臨床決策制定和患者照護。此篇評論探討了各種 XAI 方法的整合,例如 SHAP、LIME、Grad-CAM 等,以及用於乳癌偵測和分類的機器學習和深度學習模型。透過探討乳癌資料集的模式,包括乳房攝影、超音波及其在 AI 中的處理,本文重點說明 XAI 如何能導致更準確的診斷和個人化治療計畫。它也探討了實施這些技術的挑戰,以及制定標準化評量指標以評估 XAI 在臨床環境中的有效性的重要性。透過詳細的分析和討論,本文旨在強調 XAI 在縮小複雜 AI 模型與實務醫療保健應用之間差距的潛力,進而促進醫療專業人員之間的信任與理解,並改善患者的結果。

Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition

2406.01624v2 by Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara

Speech emotion recognition (SER) has gained significant attention due to its several application fields, such as mental health, education, and human-computer interaction. However, the accuracy of SER systems is hindered by high-dimensional feature sets that may contain irrelevant and redundant information. To overcome this challenge, this study proposes an iterative feature boosting approach for SER that emphasizes feature relevance and explainability to enhance machine learning model performance. Our approach involves meticulous feature selection and analysis to build efficient SER systems. In addressing our main problem through model explainability, we employ a feature evaluation loop with Shapley values to iteratively refine feature sets. This process strikes a balance between model performance and transparency, which enables a comprehensive understanding of the model's predictions. The proposed approach offers several advantages, including the identification and removal of irrelevant and redundant features, leading to a more effective model. Additionally, it promotes explainability, facilitating comprehension of the model's predictions and the identification of crucial features for emotion determination. The effectiveness of the proposed method is validated on the SER benchmarks of the Toronto emotional speech set (TESS), Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE) datasets, outperforming state-of-the-art methods. To the best of our knowledge, this is the first work to incorporate model explainability into an SER framework. The source code of this paper is publicly available via this https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition.

摘要:語音情緒辨識 (SER) 由於其在心理健康、教育和人機互動等多個應用領域而備受關注。然而,SER 系統的準確性受到高維特徵集的阻礙,這些特徵集可能包含不相關和冗餘的資訊。為了克服這個挑戰,本研究提出了一種用於 SER 的迭代特徵提升方法,該方法強調特徵相關性和可解釋性,以增強機器學習模型的效能。我們的做法涉及仔細的特徵選擇和分析,以建立高效的 SER 系統。為了透過模型可解釋性解決我們的核心問題,我們採用了具有 Shapley 值的特徵評估迴圈,以反覆改善特徵集。這個過程在模型效能和透明度之間取得平衡,這使得我們能夠全面了解模型的預測。所提出的方法提供了多項優點,包括識別和移除不相關和冗餘的特徵,從而建立更有效的模型。此外,它促進了可解釋性,有助於理解模型的預測以及識別情緒決定的關鍵特徵。所提出的方法的有效性已在多倫多情緒語音集 (TESS)、柏林情緒語音資料庫 (EMO-DB)、賴爾森音訊視覺情緒語音和歌曲資料庫 (RAVDESS) 和薩里音訊視覺表達情緒 (SAVEE) 資料集的 SER 基準上得到驗證,其效能優於現有方法。據我們所知,這是第一個將模型可解釋性納入 SER 架構的研究。本文的原始碼可透過此連結公開取得:https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition。

The Explanation Necessity for Healthcare AI

2406.00216v1 by Michail Mamalakis, Héloïse de Vareilles, Graham Murray, Pietro Lio, John Suckling

Explainability is often critical to the acceptable implementation of artificial intelligence (AI). Nowhere is this more important than healthcare where decision-making directly impacts patients and trust in AI systems is essential. This trust is often built on the explanations and interpretations the AI provides. Despite significant advancements in AI interpretability, there remains the need for clear guidelines on when and to what extent explanations are necessary in the medical context. We propose a novel categorization system with four distinct classes of explanation necessity, guiding the level of explanation required: patient or sample (local) level, cohort or dataset (global) level, or both levels. We introduce a mathematical formulation that distinguishes these categories and offers a practical framework for researchers to determine the necessity and depth of explanations required in medical AI applications. Three key factors are considered: the robustness of the evaluation protocol, the variability of expert observations, and the representation dimensionality of the application. In this perspective, we address the question: When does an AI medical application need to be explained, and at what level of detail?

摘要:可解释性通常对于人工智能 (AI) 的可接受实施至关重要。在医疗保健领域,这一点尤为重要,因为决策直接影响患者,并且对 AI 系统的信任至关重要。这种信任通常建立在 AI 提供的解释和诠释之上。尽管 AI 可解释性取得了重大进展,但仍然需要明确的指导方针,说明在医疗环境中何时以及在多大程度上需要解释。我们提出了一种新颖的分类系统,该系统具有四种不同的解释必要性类别,指导所需的解释级别:患者或样本(局部)级别、队列或数据集(全局)级别,或两个级别。我们引入了一个数学公式,该公式区分了这些类别,并为研究人员提供了一个实用框架,以确定医疗 AI 应用中所需的解释的必要性和深度。考虑了三个关键因素:评估协议的稳健性、专家观察的可变性以及应用程序的表示维数。从这个角度来看,我们解决了这个问题:AI 医疗应用何时需要解释,以及需要解释到何种程度?

Interdisciplinary Expertise to Advance Equitable Explainable AI

2406.18563v1 by Chloe R. Bennett, Heather Cole-Lewis, Stephanie Farquhar, Naama Haamel, Boris Babenko, Oran Lang, Mat Fleck, Ilana Traynis, Charles Lau, Ivor Horn, Courtney Lyles

The field of artificial intelligence (AI) is rapidly influencing health and healthcare, but bias and poor performance persists for populations who face widespread structural oppression. Previous work has clearly outlined the need for more rigorous attention to data representativeness and model performance to advance equity and reduce bias. However, there is an opportunity to also improve the explainability of AI by leveraging best practices of social epidemiology and health equity to help us develop hypotheses for associations found. In this paper, we focus on explainable AI (XAI) and describe a framework for interdisciplinary expert panel review to discuss and critically assess AI model explanations from multiple perspectives and identify areas of bias and directions for future research. We emphasize the importance of the interdisciplinary expert panel to produce more accurate, equitable interpretations which are historically and contextually informed. Interdisciplinary panel discussions can help reduce bias, identify potential confounders, and identify opportunities for additional research where there are gaps in the literature. In turn, these insights can suggest opportunities for AI model improvement.

摘要:人工智慧 (AI) 領域正快速影響著健康與醫療保健,但對於面臨廣泛結構性壓迫的人群來說,偏見和不良表現依然存在。先前的研究已清楚說明,需要更嚴格地注意資料代表性和模型效能,以促進公平性並減少偏見。然而,我們有機會透過運用社會流行病學和健康公平的最佳實務,來改善 AI 的可解釋性,以幫助我們針對發現的關聯性,發展假設。在本文中,我們專注於可解釋 AI (XAI),並描述一個跨領域專家小組審查架構,以從多重觀點討論和批判性評估 AI 模型的解釋,並找出偏見領域和未來研究的方向。我們強調跨領域專家小組對於產生更準確、公平的詮釋至關重要,而這些詮釋是根據歷史和脈絡而來的。跨領域小組討論有助於減少偏見、找出潛在的混淆因素,並在文獻中有缺口時找出額外研究的機會。反過來,這些見解可以建議 AI 模型改進的機會。

"It depends": Configuring AI to Improve Clinical Usefulness Across Contexts

2407.11978v1 by Hubert D. Zając, Jorge M. N. Ribeiro, Silvia Ingala, Simona Gentile, Ruth Wanjohi, Samuel N. Gitau, Jonathan F. Carlsen, Michael B. Nielsen, Tariq O. Andersen

Artificial Intelligence (AI) repeatedly match or outperform radiologists in lab experiments. However, real-world implementations of radiological AI-based systems are found to provide little to no clinical value. This paper explores how to design AI for clinical usefulness in different contexts. We conducted 19 design sessions and design interventions with 13 radiologists from 7 clinical sites in Denmark and Kenya, based on three iterations of a functional AI-based prototype. Ten sociotechnical dependencies were identified as crucial for the design of AI in radiology. We conceptualised four technical dimensions that must be configured to the intended clinical context of use: AI functionality, AI medical focus, AI decision threshold, and AI Explainability. We present four design recommendations on how to address dependencies pertaining to the medical knowledge, clinic type, user expertise level, patient context, and user situation that condition the configuration of these technical dimensions.

摘要:人工智慧(AI)在實驗室實驗中不斷地與放射科醫師匹敵或表現得更出色。然而,發現放射科 AI 為基礎系統的實際執行幾乎沒有提供臨床價值。本文探討如何為 AI 設計在不同情境中臨床上的效用。我們根據功能性 AI 為基礎原型的三次迭代,在丹麥和肯亞的 7 個臨床場域與 13 位放射科醫師進行了 19 次設計會議和設計介入。十個社會技術依賴關係被認為對於放射科中 AI 的設計至關重要。我們概念化了四個技術面向,必須根據預期的臨床使用情境進行設定:AI 功能、AI 醫療重點、AI 決策門檻,以及 AI 可解釋性。我們提出四項設計建議,說明如何處理與醫療知識、診所類型、使用者專業知識等級、患者情境,以及影響這些技術面向設定的使用者情境相關的依賴關係。

Improving Health Professionals' Onboarding with AI and XAI for Trustworthy Human-AI Collaborative Decision Making

2405.16424v1 by Min Hun Lee, Silvana Xin Yi Choo, Shamala D/O Thilarajah

With advanced AI/ML, there has been growing research on explainable AI (XAI) and studies on how humans interact with AI and XAI for effective human-AI collaborative decision-making. However, we still have a lack of understanding of how AI systems and XAI should be first presented to users without technical backgrounds. In this paper, we present the findings of semi-structured interviews with health professionals (n=12) and students (n=4) majoring in medicine and health to study how to improve onboarding with AI and XAI. For the interviews, we built upon human-AI interaction guidelines to create onboarding materials of an AI system for stroke rehabilitation assessment and AI explanations and introduce them to the participants. Our findings reveal that beyond presenting traditional performance metrics on AI, participants desired benchmark information, the practical benefits of AI, and interaction trials to better contextualize AI performance, and refine the objectives and performance of AI. Based on these findings, we highlight directions for improving onboarding with AI and XAI and human-AI collaborative decision-making.

摘要:隨著先進的 AI/ML,對可解釋 AI (XAI) 的研究不斷增加,以及關於人類如何與 AI 和 XAI 互動以進行有效的人工智慧協作決策制定。然而,我們仍然缺乏對 AI 系統和 XAI 應如何首先呈現給沒有技術背景的用戶的了解。在本文中,我們展示了與醫療專業人員 (n=12) 和主修醫學和健康的學生 (n=4) 進行半結構化訪談的結果,以研究如何改善 AI 和 XAI 的入門。對於訪談,我們建立在人機互動準則之上,為中風康復評估和 AI 解釋的 AI 系統創建入門材料,並將它們介紹給參與者。我們的研究結果表明,除了呈現傳統的 AI 性能指標外,參與者還希望基准信息、AI 的實際好處以及交互試驗,以更好地將 AI 性能情境化,並完善 AI 的目標和性能。根據這些發現,我們強調了改進 AI 和 XAI 以及人機協作決策制定的入門方向。

Exploring Nutritional Impact on Alzheimer's Mortality: An Explainable AI Approach

2405.17502v1 by Ziming Liu, Longjian Liu, Robert E. Heidel, Xiaopeng Zhao

This article uses machine learning (ML) and explainable artificial intelligence (XAI) techniques to investigate the relationship between nutritional status and mortality rates associated with Alzheimers disease (AD). The Third National Health and Nutrition Examination Survey (NHANES III) database is employed for analysis. The random forest model is selected as the base model for XAI analysis, and the Shapley Additive Explanations (SHAP) method is used to assess feature importance. The results highlight significant nutritional factors such as serum vitamin B12 and glycated hemoglobin. The study demonstrates the effectiveness of random forests in predicting AD mortality compared to other diseases. This research provides insights into the impact of nutrition on AD and contributes to a deeper understanding of disease progression.

摘要:本文使用機器學習 (ML) 和可解釋人工智慧 (XAI) 技術來探討營養狀況與阿茲海默症 (AD) 相關的死亡率之間的關係。採用第三次全國健康與營養檢查調查 (NHANES III) 資料庫進行分析。選擇隨機森林模型作為 XAI 分析的基礎模型,並使用 Shapley Additive Explanations (SHAP) 方法來評估特徵重要性。結果突顯了重要的營養因素,例如血清維生素 B12 和糖化血紅蛋白。該研究證明了隨機森林在預測 AD 死亡率方面相較於其他疾病的有效性。本研究提供了營養對 AD 的影響的見解,並有助於更深入地了解疾病的進展。

Explainable AI Enhances Glaucoma Referrals, Yet the Human-AI Team Still Falls Short of the AI Alone

2407.11974v1 by Catalina Gomez, Ruolin Wang, Katharina Breininger, Corinne Casey, Chris Bradley, Mitchell Pavlak, Alex Pham, Jithin Yohannan, Mathias Unberath

Primary care providers are vital for initial triage and referrals to specialty care. In glaucoma, asymptomatic and fast progression can lead to vision loss, necessitating timely referrals to specialists. However, primary eye care providers may not identify urgent cases, potentially delaying care. Artificial Intelligence (AI) offering explanations could enhance their referral decisions. We investigate how various AI explanations help providers distinguish between patients needing immediate or non-urgent specialist referrals. We built explainable AI algorithms to predict glaucoma surgery needs from routine eyecare data as a proxy for identifying high-risk patients. We incorporated intrinsic and post-hoc explainability and conducted an online study with optometrists to assess human-AI team performance, measuring referral accuracy and analyzing interactions with AI, including agreement rates, task time, and user experience perceptions. AI support enhanced referral accuracy among 87 participants (59.9%/50.8% with/without AI), though Human-AI teams underperformed compared to AI alone. Participants believed they included AI advice more when using the intrinsic model, and perceived it more useful and promising. Without explanations, deviations from AI recommendations increased. AI support did not increase workload, confidence, and trust, but reduced challenges. On a separate test set, our black-box and intrinsic models achieved an accuracy of 77% and 71%, respectively, in predicting surgical outcomes. We identify opportunities of human-AI teaming for glaucoma management in primary eye care, noting that while AI enhances referral accuracy, it also shows a performance gap compared to AI alone, even with explanations. Human involvement remains essential in medical decision making, underscoring the need for future research to optimize collaboration, ensuring positive experiences and safe AI use.

摘要:初級保健提供者對於最初的分流和轉診到專科照護至關重要。在青光眼的情況下,無症狀且快速惡化可能導致視力喪失,因此需要及時轉診給專家。然而,初級眼科保健提供者可能無法識別緊急情況,可能會延誤照護。提供解釋的人工智慧 (AI) 可以加強他們的轉診決策。我們研究各種 AI 解釋如何幫助提供者區分需要立即或非緊急專科轉診的患者。我們建立了解釋性 AI 演算法,以從例行眼科護理資料預測青光眼手術需求,作為識別高風險患者的代理。我們納入了內在和事後解釋性,並與驗光師進行了一項線上研究,以評估人機團隊的表現,衡量轉診準確度並分析與 AI 的互動,包括同意率、任務時間和使用者體驗感知。在 87 名參與者中,AI 支援提高了轉診準確度(使用 AI/未使用的比例為 59.9%/50.8%),儘管人機團隊的表現不如單獨使用 AI。參與者認為他們在使用內在模型時更多地納入了 AI 建議,並認為它更有用且更有希望。沒有解釋,AI 建議的偏差會增加。AI 支援並未增加工作量、信心和信任,但減少了挑戰。在一個單獨的測試集中,我們的黑盒子和內在模型在預測手術結果方面分別達到了 77% 和 71% 的準確度。我們找出在初級眼科保健中,人機團隊合作管理青光眼的機會,並注意到雖然 AI 提高了轉診準確度,但即使有解釋,它也顯示出與單獨使用 AI 相比的效能差距。人類參與在醫療決策中仍然至關重要,這強調了未來研究優化協作、確保正面經驗和安全使用 AI 的必要性。

Decoding Decision Reasoning: A Counterfactual-Powered Model for Knowledge Discovery

2406.18552v1 by Yingying Fang, Zihao Jin, Xiaodan Xing, Simon Walsh, Guang Yang

In medical imaging, particularly in early disease detection and prognosis tasks, discerning the rationale behind an AI model's predictions is crucial for evaluating the reliability of its decisions. Conventional explanation methods face challenges in identifying discernible decisive features in medical image classifications, where discriminative features are subtle or not immediately apparent. To bridge this gap, we propose an explainable model that is equipped with both decision reasoning and feature identification capabilities. Our approach not only detects influential image patterns but also uncovers the decisive features that drive the model's final predictions. By implementing our method, we can efficiently identify and visualise class-specific features leveraged by the data-driven model, providing insights into the decision-making processes of deep learning models. We validated our model in the demanding realm of medical prognosis task, demonstrating its efficacy and potential in enhancing the reliability of AI in healthcare and in discovering new knowledge in diseases where prognostic understanding is limited.

摘要:在醫學影像中,特別是在早期疾病檢測和預後任務中,辨別 AI 模型預測背後的原理對於評估其決策的可靠性至關重要。傳統的解釋方法在識別醫學影像分類中可識別的決定性特徵時面臨挑戰,其中區別性特徵很微妙或並不明顯。為了彌合這一差距,我們提出了一個可解釋的模型,該模型具備決策推理和特徵識別能力。我們的做法不僅檢測有影響力的影像模式,還揭示了推動模型最終預測的決定性特徵。通過實施我們的模型,我們可以有效識別和視覺化由數據驅動模型利用的類特定特徵,從而深入了解深度學習模型的決策過程。我們在要求嚴格的醫學預後任務領域驗證了我們的模型,展示了其在提高 AI 在醫療保健中的可靠性和發現預後理解受限疾病的新知識方面的功效和潛力。

The Role of Emotions in Informational Support Question-Response Pairs in Online Health Communities: A Multimodal Deep Learning Approach

2405.13099v1 by Mohsen Jozani, Jason A. Williams, Ahmed Aleroud, Sarbottam Bhagat

This study explores the relationship between informational support seeking questions, responses, and helpfulness ratings in online health communities. We created a labeled data set of question-response pairs and developed multimodal machine learning and deep learning models to reliably predict informational support questions and responses. We employed explainable AI to reveal the emotions embedded in informational support exchanges, demonstrating the importance of emotion in providing informational support. This complex interplay between emotional and informational support has not been previously researched. The study refines social support theory and lays the groundwork for the development of user decision aids. Further implications are discussed.

摘要:本研究探討線上健康社群中尋求資訊支持的問題、回應,以及有幫助的評分之間的關係。我們建立了一組標記的問答配對資料集,並開發了多模態機器學習和深度學習模型,以可靠地預測資訊支持問題和回應。我們採用可解釋的 AI 來揭示資訊支持交流中蘊含的情緒,證明情緒在提供資訊支持中的重要性。這種情緒支持和資訊支持之間的複雜交互作用以前並未被研究過。本研究改進了社會支持理論,並為使用者決策輔助工具的開發奠定了基礎。討論了進一步的影響。

ChatGPT in Classrooms: Transforming Challenges into Opportunities in Education

2405.10645v1 by Harris Bin Munawar, Nikolaos Misirlis

In the era of exponential technology growth, one unexpected guest has claimed a seat in classrooms worldwide, Artificial Intelligence. Generative AI, such as ChatGPT, promises a revolution in education, yet it arrives with a double-edged sword. Its potential for personalized learning is offset by issues of cheating, inaccuracies, and educators struggling to incorporate it effectively into their lesson design. We are standing on the brink of this educational frontier, and it is clear that we need to navigate this terrain with a lot of care. This is a major challenge that could undermine the integrity and value of our educational process. So, how can we turn these challenges into opportunities? When used inappropriately, AI tools can become the perfect tool for the cut copy paste mentality, and quickly begin to corrode critical thinking, creativity, and deep understanding, the most important skills in our rapidly changing world. Teachers feel that they are not equipped to leverage this technology, widening the digital divide among educators and institutions. Addressing these concerns calls for an in depth research approach. We will employ empirical research, drawing on the Technology Acceptance Model, to assess the attitudes toward generative AI among educators and students. Understanding their perceptions, usage patterns, and hurdles is the first crucial step in creating an effective solution. The present study will be used as a process manual for future researchers to apply, running their own data, based on the steps explained here

摘要:在科技飛速發展的時代,一位意外的訪客已在全球教室中佔有一席之地,那就是人工智慧。生成式 AI,例如 ChatGPT,承諾在教育領域掀起一場革命,但它卻是一把雙面刃。它在個人化學習方面的潛力,卻因作弊、不準確以及教育工作者難以將其有效融入教學設計等問題而抵銷。我們正站在這教育前沿的邊緣,顯然我們需要非常小心地探索這片領域。這是一個重大的挑戰,可能會損害我們教育過程的完整性和價值。那麼,我們如何將這些挑戰轉化為機遇?當不適當地使用時,AI 工具可能會成為複製貼上心態的完美工具,並迅速腐蝕批判性思維、創造力和深入理解,這些都是我們快速變化的世界中最重要的技能。教師們覺得他們沒有能力利用這項技術,這擴大了教育工作者和機構之間的數位鴻溝。解決這些問題需要深入的研究方法。我們將採用實證研究,借鑑技術接受模型,來評估教育工作者和學生對生成式 AI 的態度。了解他們的看法、使用模式和障礙是創造有效解決方案的第一個關鍵步驟。本研究將作為未來研究人員應用的流程手冊,根據此處說明的步驟運行他們自己的數據

Evaluating the Explainable AI Method Grad-CAM for Breath Classification on Newborn Time Series Data

2405.07590v1 by Camelia Oprea, Mike Grüne, Mateusz Buglowski, Lena Olivier, Thorsten Orlikowsky, Stefan Kowalewski, Mark Schoberer, André Stollenwerk

With the digitalization of health care systems, artificial intelligence becomes more present in medicine. Especially machine learning shows great potential for complex tasks such as time series classification, usually at the cost of transparency and comprehensibility. This leads to a lack of trust by humans and thus hinders its active usage. Explainable artificial intelligence tries to close this gap by providing insight into the decision-making process, the actual usefulness of its different methods is however unclear. This paper proposes a user study based evaluation of the explanation method Grad-CAM with application to a neural network for the classification of breaths in time series neonatal ventilation data. We present the perceived usefulness of the explainability method by different stakeholders, exposing the difficulty to achieve actual transparency and the wish for more in-depth explanations by many of the participants.

摘要:隨著醫療保健系統的數位化,人工智慧在醫學領域中變得更加普及。特別是機器學習在時間序列分類等複雜任務中展現出極大的潛力,但通常是以透明度和可理解性為代價。這導致人類缺乏信任,從而阻礙了其積極使用。可解釋的人工智慧試圖通過提供對決策過程的洞察來彌補這一差距,但其不同方法的實際效用尚不清楚。本文提出了一個基於使用者研究的評估,其中包含了 Grad-CAM 解釋方法,並將其應用於神經網路以分類時間序列新生兒呼吸數據中的呼吸。我們展示了不同利益相關者對可解釋性方法的感知效用,揭示了實現實際透明度的難度,以及許多參與者希望獲得更深入的解釋。

XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare

2405.06270v3 by Fatemeh Nazary, Yashar Deldjoo, Tommaso Di Noia, Eugenio di Sciascio

The integration of Large Language Models (LLMs) into healthcare diagnostics offers a promising avenue for clinical decision-making. This study outlines the development of a novel method for zero-shot/few-shot in-context learning (ICL) by integrating medical domain knowledge using a multi-layered structured prompt. We also explore the efficacy of two communication styles between the user and LLMs: the Numerical Conversational (NC) style, which processes data incrementally, and the Natural Language Single-Turn (NL-ST) style, which employs long narrative prompts. Our study systematically evaluates the diagnostic accuracy and risk factors, including gender bias and false negative rates, using a dataset of 920 patient records in various few-shot scenarios. Results indicate that traditional clinical machine learning (ML) models generally outperform LLMs in zero-shot and few-shot settings. However, the performance gap narrows significantly when employing few-shot examples alongside effective explainable AI (XAI) methods as sources of domain knowledge. Moreover, with sufficient time and an increased number of examples, the conversational style (NC) nearly matches the performance of ML models. Most notably, LLMs demonstrate comparable or superior cost-sensitive accuracy relative to ML models. This research confirms that, with appropriate domain knowledge and tailored communication strategies, LLMs can significantly enhance diagnostic processes. The findings highlight the importance of optimizing the number of training examples and communication styles to improve accuracy and reduce biases in LLM applications.

摘要:大型語言模型 (LLM) 與醫療診斷整合 為臨床決策提供了一個有前景的途徑。本研究概述了一種新穎方法的開發,用於零次學習/少量學習情境學習 (ICL),方法是使用多層結構化提示整合醫療領域知識。我們還探討了使用者與 LLM 之間兩種溝通方式的功效:數值對話 (NC) 方式,它會逐步處理資料,以及自然語言單回合 (NL-ST) 方式,它會使用長篇敘事提示。 我們的研究系統性地評估了診斷準確性和風險因子,包括性別偏見和假陰性率,使用了一個包含 920 個患者記錄的資料集,採用各種少量學習情境。結果表明,傳統的臨床機器學習 (ML) 模型通常在零次學習和少量學習設定中表現優於 LLM。然而,當使用少量學習範例以及有效的可解釋 AI (XAI) 方法作為領域知識來源時,效能差距會顯著縮小。此外,隨著時間充足和範例數量增加,對話方式 (NC) 幾乎可以媲美 ML 模型的效能。最值得注意的是,LLM 相對於 ML 模型展現出相當或更佳的成本敏感準確度。 本研究證實,透過適當的領域知識和量身打造的溝通策略,LLM 可以顯著增強診斷程序。這些發現突顯了最佳化訓練範例數量和溝通方式的重要性,以提高準確度並減少 LLM 應用中的偏差。

To Trust or Not to Trust: Towards a novel approach to measure trust for XAI systems

2405.05766v1 by Miquel Miró-Nicolau, Gabriel Moyà-Alcover, Antoni Jaume-i-Capó, Manuel González-Hidalgo, Maria Gemma Sempere Campello, Juan Antonio Palmer Sancho

The increasing reliance on Deep Learning models, combined with their inherent lack of transparency, has spurred the development of a novel field of study known as eXplainable AI (XAI) methods. These methods seek to enhance the trust of end-users in automated systems by providing insights into the rationale behind their decisions. This paper presents a novel approach for measuring user trust in XAI systems, allowing their refinement. Our proposed metric combines both performance metrics and trust indicators from an objective perspective. To validate this novel methodology, we conducted a case study in a realistic medical scenario: the usage of XAI system for the detection of pneumonia from x-ray images.

摘要:隨著對深度學習模型依賴性的增加,加上其固有的透明度不足,促使一個新的研究領域發展,稱為可解釋 AI (XAI) 方法。這些方法旨在透過深入了解決策背後的原理,來提升最終使用者對自動化系統的信賴。本文提出了一種衡量使用者對 XAI 系統信賴度的新穎方法,允許對其進行改進。我們提出的指標結合了客觀觀點下的效能指標和信賴指標。為了驗證這個新穎的方法,我們在一個真實的醫療場景中進行了一個案例研究:使用 XAI 系統從 X 光影像中偵測肺炎。

Region-specific Risk Quantification for Interpretable Prognosis of COVID-19

2405.02815v1 by Zhusi Zhong, Jie Li, Zhuoqi Ma, Scott Collins, Harrison Bai, Paul Zhang, Terrance Healey, Xinbo Gao, Michael K. Atalay, Zhicheng Jiao

The COVID-19 pandemic has strained global public health, necessitating accurate diagnosis and intervention to control disease spread and reduce mortality rates. This paper introduces an interpretable deep survival prediction model designed specifically for improved understanding and trust in COVID-19 prognosis using chest X-ray (CXR) images. By integrating a large-scale pretrained image encoder, Risk-specific Grad-CAM, and anatomical region detection techniques, our approach produces regional interpretable outcomes that effectively capture essential disease features while focusing on rare but critical abnormal regions. Our model's predictive results provide enhanced clarity and transparency through risk area localization, enabling clinicians to make informed decisions regarding COVID-19 diagnosis with better understanding of prognostic insights. We evaluate the proposed method on a multi-center survival dataset and demonstrate its effectiveness via quantitative and qualitative assessments, achieving superior C-indexes (0.764 and 0.727) and time-dependent AUCs (0.799 and 0.691). These results suggest that our explainable deep survival prediction model surpasses traditional survival analysis methods in risk prediction, improving interpretability for clinical decision making and enhancing AI system trustworthiness.

摘要:COVID-19 疫情對全球公共衛生造成壓力,必須進行準確的診斷和干預,以控制疾病傳播並降低死亡率。本文介紹了一個可解釋的深度生存預測模型,專門設計用於透過胸部 X 光 (CXR) 影像改善對 COVID-19 預後的理解和信賴。透過整合大規模預訓練影像編碼器、風險特定 Grad-CAM 和解剖區域偵測技術,我們的做法產生區域可解釋的結果,有效捕捉必要的疾病特徵,同時專注於罕見但關鍵的異常區域。我們的模型預測結果透過風險區域定位提供增強的清晰度和透明度,讓臨床醫生能夠在更了解預後見解的情況下,就 COVID-19 診斷做出明智的決策。我們在多中心生存資料集上評估所提出的方法,並透過量化和質化評估證明其有效性,達到優異的 C 指數(0.764 和 0.727)和時間相關 AUC(0.799 和 0.691)。這些結果表明,我們可解釋的深度生存預測模型在風險預測方面超越傳統的生存分析方法,提升臨床決策的解釋性,並增強 AI 系統的信賴度。

Rad4XCNN: a new agnostic method for post-hoc global explanation of CNN-derived features by means of radiomics

2405.02334v1 by Francesco Prinzi, Carmelo Militello, Calogero Zarcaro, Tommaso Vincenzo Bartolotta, Salvatore Gaglio, Salvatore Vitabile

In the last years, artificial intelligence (AI) in clinical decision support systems (CDSS) played a key role in harnessing machine learning and deep learning architectures. Despite their promising capabilities, the lack of transparency and explainability of AI models poses significant challenges, particularly in medical contexts where reliability is a mandatory aspect. Achieving transparency without compromising predictive accuracy remains a key challenge. This paper presents a novel method, namely Rad4XCNN, to enhance the predictive power of CNN-derived features with the interpretability inherent in radiomic features. Rad4XCNN diverges from conventional methods based on saliency map, by associating intelligible meaning to CNN-derived features by means of Radiomics, offering new perspectives on explanation methods beyond visualization maps. Using a breast cancer classification task as a case study, we evaluated Rad4XCNN on ultrasound imaging datasets, including an online dataset and two in-house datasets for internal and external validation. Some key results are: i) CNN-derived features guarantee more robust accuracy when compared against ViT-derived and radiomic features; ii) conventional visualization map methods for explanation present several pitfalls; iii) Rad4XCNN does not sacrifice model accuracy for their explainability; iv) Rad4XCNN provides global explanation insights enabling the physician to analyze the model outputs and findings. In addition, we highlight the importance of integrating interpretability into AI models for enhanced trust and adoption in clinical practice, emphasizing how our method can mitigate some concerns related to explainable AI methods.

摘要:在過去幾年,臨床決策支援系統 (CDSS) 中的人工智慧 (AI) 在利用機器學習和深度學習架構方面發揮了關鍵作用。儘管 AI 模型具有令人滿意的能力,但缺乏透明度和可解釋性,特別是在可靠性為必要考量的醫療背景下,這帶來了重大的挑戰。在不影響預測精準度的情況下實現透明度仍然是一項關鍵挑戰。本文提出了一種新方法,即 Rad4XCNN,以增強 CNN 衍生特徵的預測能力,同時具備放射特徵固有的可解釋性。Rad4XCNN 不同於基於顯著性圖的傳統方法,它通過放射組學將可理解的含義與 CNN 衍生特徵關聯起來,為超越視覺化圖表的解釋方法提供了新的觀點。我們以乳癌分類任務作為案例研究,在超音波影像資料集上評估 Rad4XCNN,包括一個線上資料集和兩個用於內部和外部驗證的內部資料集。一些關鍵結果如下:i) 與 ViT 衍生特徵和放射特徵相比,CNN 衍生特徵保證了更穩健的準確度;ii) 傳統的視覺化圖解釋方法存在一些缺陷;iii) Rad4XCNN 沒有犧牲模型準確度來換取其可解釋性;iv) Rad4XCNN 提供了全局解釋見解,使醫師能夠分析模型輸出和發現。此外,我們強調將可解釋性整合到 AI 模型中對於增強臨床實務中的信任和採用至關重要,並強調了我們的方法如何能緩解與可解釋 AI 方法相關的一些疑慮。

Attributing Responsibility in AI-Induced Incidents: A Computational Reflective Equilibrium Framework for Accountability

2404.16957v1 by Yunfei Ge, Quanyan Zhu

The pervasive integration of Artificial Intelligence (AI) has introduced complex challenges in the responsibility and accountability in the event of incidents involving AI-enabled systems. The interconnectivity of these systems, ethical concerns of AI-induced incidents, coupled with uncertainties in AI technology and the absence of corresponding regulations, have made traditional responsibility attribution challenging. To this end, this work proposes a Computational Reflective Equilibrium (CRE) approach to establish a coherent and ethically acceptable responsibility attribution framework for all stakeholders. The computational approach provides a structured analysis that overcomes the limitations of conceptual approaches in dealing with dynamic and multifaceted scenarios, showcasing the framework's explainability, coherence, and adaptivity properties in the responsibility attribution process. We examine the pivotal role of the initial activation level associated with claims in equilibrium computation. Using an AI-assisted medical decision-support system as a case study, we illustrate how different initializations lead to diverse responsibility distributions. The framework offers valuable insights into accountability in AI-induced incidents, facilitating the development of a sustainable and resilient system through continuous monitoring, revision, and reflection.

摘要:隨著人工智慧 (AI) 的普及整合,在涉及 AI 驅動系統的事故中,責任和義務歸屬產生了複雜的挑戰。這些系統的互連性、AI 引發事故的倫理問題,加上 AI 技術的不確定性和缺乏相應法規,使得傳統責任歸屬面臨挑戰。為此,本研究提出了一種計算反思均衡 (CRE) 方法,以建立一個連貫且在倫理上可接受的責任歸屬架構,適用於所有利害關係人。計算方法提供了結構化的分析,克服了概念方法在處理動態且多面向情境時的限制,展示了該架構在責任歸屬過程中具備的可解釋性、連貫性和適應性。我們探討了與均衡計算中索賠相關的初始啟動層級的關鍵作用。我們以 AI 輔助醫療決策支援系統為案例研究,說明不同的初始化如何導致不同的責任分配。該架構提供了對 AI 引發事故中問責制的寶貴見解,透過持續監控、修訂和反思,促進了永續且有韌性的系統發展。

Explainable AI for Fair Sepsis Mortality Predictive Model

2404.13139v1 by Chia-Hsuan Chang, Xiaoyang Wang, Christopher C. Yang

Artificial intelligence supports healthcare professionals with predictive modeling, greatly transforming clinical decision-making. This study addresses the crucial need for fairness and explainability in AI applications within healthcare to ensure equitable outcomes across diverse patient demographics. By focusing on the predictive modeling of sepsis-related mortality, we propose a method that learns a performance-optimized predictive model and then employs the transfer learning process to produce a model with better fairness. Our method also introduces a novel permutation-based feature importance algorithm aiming at elucidating the contribution of each feature in enhancing fairness on predictions. Unlike existing explainability methods concentrating on explaining feature contribution to predictive performance, our proposed method uniquely bridges the gap in understanding how each feature contributes to fairness. This advancement is pivotal, given sepsis's significant mortality rate and its role in one-third of hospital deaths. Our method not only aids in identifying and mitigating biases within the predictive model but also fosters trust among healthcare stakeholders by improving the transparency and fairness of model predictions, thereby contributing to more equitable and trustworthy healthcare delivery.

摘要:人工智慧透過預測模型協助醫療專業人員,大幅轉變了臨床決策制定。本研究探討了在醫療保健中使用人工智慧應用程式時公平性和可解釋性的關鍵需求,以確保在不同的患者人口統計資料中獲得公平的結果。透過專注於敗血症相關死亡率的預測模型,我們提出了一種方法,該方法會學習一個效能最佳化的預測模型,然後採用轉移學習過程來產生一個具有更好公平性的模型。我們的模型還引入了一種新穎的基於排列的特徵重要性演算法,旨在闡明每個特徵在增強預測公平性方面的貢獻。與現有的可解釋性方法專注於解釋特徵對預測效能的貢獻不同,我們提出的方法獨特地彌補了理解每個特徵如何有助於公平性的差距。這項進展至關重要,因為敗血症的死亡率很高,且在三分之一的醫院死亡中扮演著角色。我們的模型不僅有助於識別和減輕預測模型中的偏差,還能透過提高模型預測的透明度和公平性來培養醫療保健利益相關者之間的信任,進而有助於提供更公平且值得信賴的醫療保健服務。

Multi Class Depression Detection Through Tweets using Artificial Intelligence

2404.13104v1 by Muhammad Osama Nusrat, Waseem Shahzad, Saad Ahmed Jamal

Depression is a significant issue nowadays. As per the World Health Organization (WHO), in 2023, over 280 million individuals are grappling with depression. This is a huge number; if not taken seriously, these numbers will increase rapidly. About 4.89 billion individuals are social media users. People express their feelings and emotions on platforms like Twitter, Facebook, Reddit, Instagram, etc. These platforms contain valuable information which can be used for research purposes. Considerable research has been conducted across various social media platforms. However, certain limitations persist in these endeavors. Particularly, previous studies were only focused on detecting depression and the intensity of depression in tweets. Also, there existed inaccuracies in dataset labeling. In this research work, five types of depression (Bipolar, major, psychotic, atypical, and postpartum) were predicted using tweets from the Twitter database based on lexicon labeling. Explainable AI was used to provide reasoning by highlighting the parts of tweets that represent type of depression. Bidirectional Encoder Representations from Transformers (BERT) was used for feature extraction and training. Machine learning and deep learning methodologies were used to train the model. The BERT model presented the most promising results, achieving an overall accuracy of 0.96.

摘要:現今,憂鬱症是一個重要的議題。根據世界衛生組織 (WHO) 的資料,在 2023 年,超過 2.8 億人正在與憂鬱症搏鬥。這是一個龐大的數字;如果不認真看待,這些數字將會快速增加。大約有 48.9 億人是社群媒體使用者。人們在 Twitter、Facebook、Reddit、Instagram 等平台上表達自己的感受和情緒。這些平台包含有價值的資訊,可用於研究目的。已經在各種社群媒體平台上進行了大量的研究。然而,這些努力仍存在某些限制。特別是,先前的研究僅專注於偵測推文中的憂鬱症和憂鬱症的強度。此外,資料集標籤中存在不準確的情況。在這項研究工作中,使用基於詞彙標籤的 Twitter 資料庫中的推文預測了五種類型的憂鬱症(雙極型、重度、精神病型、非典型和產後)。可解釋的 AI 用於透過強調代表憂鬱症類型的推文部分來提供推理。從 Transformers(BERT)中提取的雙向編碼器表示用於特徵提取和訓練。機器學習和深度學習方法用於訓練模型。BERT 模型呈現出最有希望的結果,達到 0.96 的整體準確度。

COIN: Counterfactual inpainting for weakly supervised semantic segmentation for medical images

2404.12832v2 by Dmytro Shvetsov, Joonas Ariva, Marharyta Domnich, Raul Vicente, Dmytro Fishman

Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.

摘要:深度学习正大幅轉變醫學影像和放射線學領域,能辨識醫學影像中的病理,包括電腦斷層掃描 (CT) 和 X 光掃描。然而,深度學習模型的效能,特別是在分割任務中,常常受到廣泛註解資料集需求的限制。為了應對此挑戰,透過可解釋 AI 和反事實解釋的產生,探索弱監督語意分割的能力。本研究的範圍是開發一種新的反事實內插方法 (COIN),該方法使用生成模型將預測的分類標籤從異常翻轉為正常。例如,如果分類器將輸入的醫學影像 X 視為異常,表示存在病理,則生成模型旨在內插異常區域,從而逆轉分類器的原始預測標籤。此方法使我們能夠產生病理的精確分割,而無需依賴於預先存在的分割遮罩。至關重要的是,利用影像層級標籤,這比建立詳細的分割遮罩容易取得。該方法的有效性透過分割合成目標和從愛沙尼亞塔爾圖大學醫院取得的 CT 影像中的實際腎臟腫瘤來證明。研究結果表明,COIN 遠遠超過已建立的歸因方法,例如 RISE、ScoreCAM 和 LayerCAM,以及 Singla 等人提出的另一種反事實解釋方法。此證據表明,COIN 是一種很有前途的 CT 影像中腫瘤語意分割方法,並在醫療保健中讓深度學習應用更易於取得和更有效率邁進一步,其中註解資料很稀少。

Hybrid Intelligence for Digital Humanities

2406.15374v1 by Victor de Boer, Lise Stork

In this paper, we explore the synergies between Digital Humanities (DH) as a discipline and Hybrid Intelligence (HI) as a research paradigm. In DH research, the use of digital methods and specifically that of Artificial Intelligence is subject to a set of requirements and constraints. We argue that these are well-supported by the capabilities and goals of HI. Our contribution includes the identification of five such DH requirements: Successful AI systems need to be able to 1) collaborate with the (human) scholar; 2) support data criticism; 3) support tool criticism; 4) be aware of and cater to various perspectives and 5) support distant and close reading. We take the CARE principles of Hybrid Intelligence (collaborative, adaptive, responsible and explainable) as theoretical framework and map these to the DH requirements. In this mapping, we include example research projects. We finally address how insights from DH can be applied to HI and discuss open challenges for the combination of the two disciplines.

摘要:在本文中,我們探討數位人文學科 (DH) 作為一門學科與混合智能 (HI) 作為一個研究典範之間的協同作用。在 DH 研究中,數位方法的使用,特別是人工智慧的使用,受到一系列要求和限制。我們認為這些要求和限制獲得 HI 的能力和目標的充分支持。我們的貢獻包括找出五個這樣的 DH 要求:成功的 AI 系統需要能夠 1) 與(人類)學者合作;2) 支援資料批評;3) 支援工具批評;4) 察覺並迎合各種觀點;5) 支援遠距和近距離閱讀。我們將混合智能的 CARE 原則(協作、適應、負責和可解釋)作為理論架構,並將這些原則對應到 DH 要求。在此對應中,我們納入範例研究專案。最後,我們探討如何將 DH 的見解應用於 HI,並討論結合這兩個學科的開放挑戰。

Ethical Framework for Responsible Foundational Models in Medical Imaging

2406.11868v1 by Abhijit Das, Debesh Jha, Jasmer Sanjotra, Onkar Susladkar, Suramyaa Sarkar, Ashish Rauniyar, Nikhil Tomar, Vanshali Sharma, Ulas Bagci

Foundational models (FMs) have tremendous potential to revolutionize medical imaging. However, their deployment in real-world clinical settings demands extensive ethical considerations. This paper aims to highlight the ethical concerns related to FMs and propose a framework to guide their responsible development and implementation within medicine. We meticulously examine ethical issues such as privacy of patient data, bias mitigation, algorithmic transparency, explainability and accountability. The proposed framework is designed to prioritize patient welfare, mitigate potential risks, and foster trust in AI-assisted healthcare.

摘要:基礎模型 (FM) 具有徹底改變醫學影像的巨大潛力。然而,它們在現實世界臨床環境中的部署需要廣泛的倫理考量。本文旨在強調與 FM 相關的倫理問題,並提出一個框架來指導它們在醫學中的負責任開發和實施。我們仔細審查了倫理問題,例如患者數據隱私、偏差緩解、演算法透明度、可解釋性和問責制。所提出的框架旨在優先考慮患者福利、減輕潛在風險,並培養對 AI 輔助醫療保健的信任。

Advancements in Radiomics and Artificial Intelligence for Thyroid Cancer Diagnosis

2404.07239v1 by Milad Yousefi, Shadi Farabi Maleki, Ali Jafarizadeh, Mahya Ahmadpour Youshanlui, Aida Jafari, Siamak Pedrammehr, Roohallah Alizadehsani, Ryszard Tadeusiewicz, Pawel Plawiak

Thyroid cancer is an increasing global health concern that requires advanced diagnostic methods. The application of AI and radiomics to thyroid cancer diagnosis is examined in this review. A review of multiple databases was conducted in compliance with PRISMA guidelines until October 2023. A combination of keywords led to the discovery of an English academic publication on thyroid cancer and related subjects. 267 papers were returned from the original search after 109 duplicates were removed. Relevant studies were selected according to predetermined criteria after 124 articles were eliminated based on an examination of their abstract and title. After the comprehensive analysis, an additional six studies were excluded. Among the 28 included studies, radiomics analysis, which incorporates ultrasound (US) images, demonstrated its effectiveness in diagnosing thyroid cancer. Various results were noted, some of the studies presenting new strategies that outperformed the status quo. The literature has emphasized various challenges faced by AI models, including interpretability issues, dataset constraints, and operator dependence. The synthesized findings of the 28 included studies mentioned the need for standardization efforts and prospective multicenter studies to address these concerns. Furthermore, approaches to overcome these obstacles were identified, such as advances in explainable AI technology and personalized medicine techniques. The review focuses on how AI and radiomics could transform the diagnosis and treatment of thyroid cancer. Despite challenges, future research on multidisciplinary cooperation, clinical applicability validation, and algorithm improvement holds the potential to improve patient outcomes and diagnostic precision in the treatment of thyroid cancer.

摘要:甲狀腺癌是一種日益嚴重的全球健康問題,需要先進的診斷方法。本篇評論探討了人工智能與放射特徵分析在甲狀腺癌診斷中的應用。在符合 PRISMA 指南的情況下,對多個資料庫進行了回顧,直到 2023 年 10 月。通過結合關鍵字,發現了一篇關於甲狀腺癌和相關主題的英文學術出版物。在移除 109 篇重複文獻後,原始搜尋共回傳 267 篇論文。在根據預先確定的標準,淘汰了 124 篇文章的摘要和標題後,選出了相關研究。在進行全面分析後,額外排除了六項研究。在納入的 28 項研究中,結合超音波 (US) 影像的放射特徵分析,證明了其在診斷甲狀腺癌方面的有效性。研究結果不一,有些研究提出了優於現狀的新策略。文獻強調了人工智能模型面臨的各種挑戰,包括可解釋性問題、資料集限制和操作員依賴性。28 項納入研究的綜合發現提到,需要標準化工作和前瞻性多中心研究來解決這些問題。此外,還確定了克服這些障礙的方法,例如可解釋人工智能技術和個人化醫療技術的進步。本篇評論重點探討了人工智能和放射特徵分析如何轉變甲狀腺癌的診斷和治療。儘管存在挑戰,但未來對多學科合作、臨床適用性驗證和演算法改進的研究,仍有潛力改善甲狀腺癌治療中的患者預後和診斷精準度。

Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI

2404.04686v1 by Taminul Islam, Md. Alif Sheakh, Mst. Sazia Tahosin, Most. Hasna Hena, Shopnil Akash, Yousef A. Bin Jardan, Gezahign Fentahun Wondmie, Hiba-Allah Nafidi, Mohammed Bourhia

Breast cancer has rapidly increased in prevalence in recent years, making it one of the leading causes of mortality worldwide. Among all cancers, it is by far the most common. Diagnosing this illness manually requires significant time and expertise. Since detecting breast cancer is a time-consuming process, preventing its further spread can be aided by creating machine-based forecasts. Machine learning and Explainable AI are crucial in classification as they not only provide accurate predictions but also offer insights into how the model arrives at its decisions, aiding in the understanding and trustworthiness of the classification results. In this study, we evaluate and compare the classification accuracy, precision, recall, and F-1 scores of five different machine learning methods using a primary dataset (500 patients from Dhaka Medical College Hospital). Five different supervised machine learning techniques, including decision tree, random forest, logistic regression, naive bayes, and XGBoost, have been used to achieve optimal results on our dataset. Additionally, this study applied SHAP analysis to the XGBoost model to interpret the model's predictions and understand the impact of each feature on the model's output. We compared the accuracy with which several algorithms classified the data, as well as contrasted with other literature in this field. After final evaluation, this study found that XGBoost achieved the best model accuracy, which is 97%.

摘要:近年來,乳癌的盛行率迅速增加,使其成為全球主要的死亡原因之一。在所有癌症中,乳癌迄今為止是最常見的。手動診斷此疾病需要大量的時間和專業知識。由於乳癌的檢測過程耗時,因此透過建立機器學習模型來預測,有助於防止其進一步擴散。機器學習和可解釋 AI 在分類中至關重要,因為它們不僅可以提供準確的預測,還可以深入了解模型如何做出決策,有助於理解和信賴分類結果。在此研究中,我們評估並比較了五種不同的機器學習方法的分類準確度、精確度、召回率和 F1 分數,使用了一個主要的資料集(達卡醫學院醫院的 500 名患者)。五種不同的監督式機器學習技術,包括決策樹、隨機森林、邏輯迴歸、朴素貝氏和 XGBoost,已用於在我們的資料集上取得最佳結果。此外,本研究將 SHAP 分析應用於 XGBoost 模型,以解釋模型的預測並了解每個特徵對模型輸出的影響。我們比較了幾種演算法對資料進行分類的準確度,並與該領域的其他文獻進行對比。在最後評估後,本研究發現 XGBoost 達到了最佳的模型準確度,為 97%。

Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI

2404.03892v3 by Maryam Ahmed, Tooba Bibi, Rizwan Ahmed Khan, Sidra Nasir

The Deep learning (DL) models for diagnosing breast cancer from mammographic images often operate as "black boxes", making it difficult for healthcare professionals to trust and understand their decision-making processes. The study presents an integrated framework combining Convolutional Neural Networks (CNNs) and Explainable Artificial Intelligence (XAI) for the enhanced diagnosis of breast cancer using the CBIS-DDSM dataset. The methodology encompasses an elaborate data preprocessing pipeline and advanced data augmentation techniques to counteract dataset limitations and transfer learning using pre-trained networks such as VGG-16, Inception-V3 and ResNet was employed. A focal point of our study is the evaluation of XAI's effectiveness in interpreting model predictions, highlighted by utilizing the Hausdorff measure to assess the alignment between AI-generated explanations and expert annotations quantitatively. This approach is critical for XAI in promoting trustworthiness and ethical fairness in AI-assisted diagnostics. The findings from our research illustrate the effective collaboration between CNNs and XAI in advancing diagnostic methods for breast cancer, thereby facilitating a more seamless integration of advanced AI technologies within clinical settings. By enhancing the interpretability of AI driven decisions, this work lays the groundwork for improved collaboration between AI systems and medical practitioners, ultimately enriching patient care. Furthermore, the implications of our research extended well beyond the current methodologies. It encourages further research into how to combine multimodal data and improve AI explanations to meet the needs of clinical practice.

摘要:深度學習 (DL) 用於從乳房攝影術影像診斷乳癌的模型通常以「黑盒子」方式運作,這使得醫療保健專業人員難以信任和理解其決策過程。本研究提出一個整合架構,結合卷積神經網路 (CNN) 和可解釋人工智慧 (XAI),以使用 CBIS-DDSM 資料集增強乳癌的診斷。方法包含一個精細的資料前處理管線和進階資料擴充技術,以對抗資料集限制,並採用預先訓練的網路(例如 VGG-16、Inception-V3 和 ResNet)進行遷移學習。我們研究的重點是評估 XAI 在解釋模型預測中的有效性,重點利用豪斯多夫測度量化評估 AI 生成的解釋和專家註解之間的一致性。這種方法對於 XAI 在促進 AI 輔助診斷中的可信度和倫理公平性至關重要。我們研究的發現說明了 CNN 和 XAI 在推進乳癌診斷方法中的有效協作,從而促進了先進 AI 技術在臨床環境中的更順暢整合。透過增強 AI 驅動決策的可解釋性,這項工作為 AI 系統和醫療從業人員之間的改善協作奠定了基礎,最終豐富了患者照護。此外,我們研究的影響遠遠超出了目前的技術。它鼓勵進一步研究如何結合多模式資料並改善 AI 解釋,以滿足臨床實務的需求。

Advancing Multimodal Data Fusion in Pain Recognition: A Strategy Leveraging Statistical Correlation and Human-Centered Perspectives

2404.00320v2 by Xingrui Gu, Zhixuan Wang, Irisa Jin, Zekun Wu

This research presents a novel multimodal data fusion methodology for pain behavior recognition, integrating statistical correlation analysis with human-centered insights. Our approach introduces two key innovations: 1) integrating data-driven statistical relevance weights into the fusion strategy to effectively utilize complementary information from heterogeneous modalities, and 2) incorporating human-centric movement characteristics into multimodal representation learning for detailed modeling of pain behaviors. Validated across various deep learning architectures, our method demonstrates superior performance and broad applicability. We propose a customizable framework that aligns each modality with a suitable classifier based on statistical significance, advancing personalized and effective multimodal fusion. Furthermore, our methodology provides explainable analysis of multimodal data, contributing to interpretable and explainable AI in healthcare. By highlighting the importance of data diversity and modality-specific representations, we enhance traditional fusion techniques and set new standards for recognizing complex pain behaviors. Our findings have significant implications for promoting patient-centered healthcare interventions and supporting explainable clinical decision-making.

摘要:本研究提出了一種創新的多模態數據融合方法,用於疼痛行為識別,將統計相關分析與以人為中心的見解相結合。我們的做法引入了兩項關鍵創新:1) 將數據驅動的統計相關權重整合到融合策略中,以有效利用來自異質模態的補充信息,以及 2) 將以人為中心的運動特徵納入多模態表示學習中,以詳細建模疼痛行為。我們的模型在各種深度學習架構中得到驗證,展示了卓越的性能和廣泛的適用性。我們提出了一個可自定義的框架,根據統計顯著性將每個模態與合適的分類器對齊,推進個性化和有效的多模態融合。此外,我們的模型提供對多模態數據的可解釋分析,有助於醫療保健中的可解釋和可解釋 AI。通過強調數據多樣性和模態特定表示的重要性,我們增強了傳統的融合技術,並為識別複雜的疼痛行為設定了新的標準。我們的發現對促進以患者為中心的醫療保健干預和支持可解釋的臨床決策制定具有重要意義。

Addressing Social Misattributions of Large Language Models: An HCXAI-based Approach

2403.17873v1 by Andrea Ferrario, Alberto Termine, Alessandro Facchini

Human-centered explainable AI (HCXAI) advocates for the integration of social aspects into AI explanations. Central to the HCXAI discourse is the Social Transparency (ST) framework, which aims to make the socio-organizational context of AI systems accessible to their users. In this work, we suggest extending the ST framework to address the risks of social misattributions in Large Language Models (LLMs), particularly in sensitive areas like mental health. In fact LLMs, which are remarkably capable of simulating roles and personas, may lead to mismatches between designers' intentions and users' perceptions of social attributes, risking to promote emotional manipulation and dangerous behaviors, cases of epistemic injustice, and unwarranted trust. To address these issues, we propose enhancing the ST framework with a fifth 'W-question' to clarify the specific social attributions assigned to LLMs by its designers and users. This addition aims to bridge the gap between LLM capabilities and user perceptions, promoting the ethically responsible development and use of LLM-based technology.

摘要:以人为本的可解释 AI (HCXAI) 倡导将社会层面整合到 AI 解释中。HCXAI 话语的核心是社会透明度 (ST) 框架,其目标是让 AI 系统的社会组织背景对用户来说是可理解的。在这项工作中,我们建议扩展 ST 框架以解决大型语言模型 (LLM) 中社会错误归因的风险,尤其是在心理健康等敏感领域。事实上,LLM 能够出色地模拟角色和人格,这可能导致设计者的意图和用户对社会属性的认知之间出现错配,从而有风险促进情绪操纵和危险行为、认知不公正和不合理的信任。为了解决这些问题,我们建议用第五个“W 问题”来增强 ST 框架,以明确设计者和用户赋予 LLM 的具体社会属性。此补充旨在弥合 LLM 能力和用户认知之间的差距,促进基于 LLM 的技术在道德上负责任地开发和使用。

Clinical Domain Knowledge-Derived Template Improves Post Hoc AI Explanations in Pneumothorax Classification

2403.18871v1 by Han Yuan, Chuan Hong, Pengtao Jiang, Gangming Zhao, Nguyen Tuan Anh Tran, Xinxing Xu, Yet Yen Yan, Nan Liu

Background: Pneumothorax is an acute thoracic disease caused by abnormal air collection between the lungs and chest wall. To address the opaqueness often associated with deep learning (DL) models, explainable artificial intelligence (XAI) methods have been introduced to outline regions related to pneumothorax diagnoses made by DL models. However, these explanations sometimes diverge from actual lesion areas, highlighting the need for further improvement. Method: We propose a template-guided approach to incorporate the clinical knowledge of pneumothorax into model explanations generated by XAI methods, thereby enhancing the quality of these explanations. Utilizing one lesion delineation created by radiologists, our approach first generates a template that represents potential areas of pneumothorax occurrence. This template is then superimposed on model explanations to filter out extraneous explanations that fall outside the template's boundaries. To validate its efficacy, we carried out a comparative analysis of three XAI methods with and without our template guidance when explaining two DL models in two real-world datasets. Results: The proposed approach consistently improved baseline XAI methods across twelve benchmark scenarios built on three XAI methods, two DL models, and two datasets. The average incremental percentages, calculated by the performance improvements over the baseline performance, were 97.8% in Intersection over Union (IoU) and 94.1% in Dice Similarity Coefficient (DSC) when comparing model explanations and ground-truth lesion areas. Conclusions: In the context of pneumothorax diagnoses, we proposed a template-guided approach for improving AI explanations. We anticipate that our template guidance will forge a fresh approach to elucidating AI models by integrating clinical domain expertise.

摘要:背景:氣胸是一種因肺部與胸壁之間異常集氣所引起的急性胸腔疾病。為了解決深度學習(DL)模型經常伴隨的不透明性,可解釋人工智慧(XAI)方法已被引入,用於概述與 DL 模型做出的氣胸診斷相關的區域。然而,這些解釋有時會與實際病灶區域有所出入,突顯出進一步改進的必要性。方法:我們提出了一種模板引導式方法,將氣胸的臨床知識納入 XAI 方法產生的模型解釋中,從而提升這些解釋的品質。利用放射科醫師建立的病灶描繪,我們的做法首先產生一個模板,用於表示氣胸可能發生的區域。然後將此模板疊加在模型解釋上,以篩選出超出模板邊界的無關解釋。為了驗證其效力,我們對三種 XAI 方法進行了比較分析,在兩個真實世界資料集中解釋兩個 DL 模型時,分別採用和不採用我們的模板引導。結果:所提出的方法在建立於三種 XAI 方法、兩個 DL 模型和兩個資料集的十二種基準情境中,始終改善了基準 XAI 方法。在比較模型解釋和真實病灶區域時,透過基準效能的效能改進計算出的平均增量百分比為交集比(IoU)的 97.8% 和骰子相似性係數(DSC)的 94.1%。結論:在氣胸診斷的背景下,我們提出了一種模板引導式方法,用於改善 AI 解釋。我們預期我們的模板引導將透過整合臨床領域專業知識,為闡明 AI 模型建立一種新方法。

Enhancing Neural Machine Translation of Low-Resource Languages: Corpus Development, Human Evaluation and Explainable AI Architectures

2403.01580v1 by Séamus Lankford

In the current machine translation (MT) landscape, the Transformer architecture stands out as the gold standard, especially for high-resource language pairs. This research delves into its efficacy for low-resource language pairs including both the English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi language pairs. Notably, the study identifies the optimal hyperparameters and subword model type to significantly improve the translation quality of Transformer models for low-resource language pairs. The scarcity of parallel datasets for low-resource languages can hinder MT development. To address this, gaHealth was developed, the first bilingual corpus of health data for the Irish language. Focusing on the health domain, models developed using this in-domain dataset exhibited very significant improvements in BLEU score when compared with models from the LoResMT2021 Shared Task. A subsequent human evaluation using the multidimensional quality metrics error taxonomy showcased the superior performance of the Transformer system in reducing both accuracy and fluency errors compared to an RNN-based counterpart. Furthermore, this thesis introduces adaptNMT and adaptMLLM, two open-source applications streamlined for the development, fine-tuning, and deployment of neural machine translation models. These tools considerably simplify the setup and evaluation process, making MT more accessible to both developers and translators. Notably, adaptNMT, grounded in the OpenNMT ecosystem, promotes eco-friendly natural language processing research by highlighting the environmental footprint of model development. Fine-tuning of MLLMs by adaptMLLM demonstrated advancements in translation performance for two low-resource language pairs: English$\leftrightarrow$Irish and English$\leftrightarrow$Marathi, compared to baselines from the LoResMT2021 Shared Task.

摘要:在當前機器翻譯 (MT) 領域中,Transformer 架構脫穎而出,成為黃金標準,特別是對於高資源語言對。本研究探討其對低資源語言對的效能,包括英語↔愛爾蘭語和英語↔馬拉地語語言對。值得注意的是,本研究識別出最佳超參數和子詞模型類型,以顯著提高 Transformer 模型對低資源語言對的翻譯品質。 低資源語言的平行資料集的稀缺會阻礙 MT 的發展。為了解決這個問題,開發了 gaHealth,這是愛爾蘭語的第一個雙語健康資料語料庫。專注於健康領域,使用此域內資料集開發的模型在 BLEU 得分方面表現出非常顯著的進步,與 LoResMT2021 共享任務中的模型相比。隨後使用多維品質指標錯誤分類法進行的人工評估顯示,與基於 RNN 的對應模型相比,Transformer 系統在減少準確性和流暢性錯誤方面表現出優異的性能。 此外,本論文介紹了 adaptNMT 和 adaptMLLM,這兩個開源應用程式簡化了神經機器翻譯模型的開發、微調和部署。這些工具大幅簡化了設定和評估流程,讓 MT 更容易讓開發人員和翻譯人員使用。值得注意的是,adaptNMT 以 OpenNMT 生態系統為基礎,通過強調模型開發的環境足跡來促進生態友好的自然語言處理研究。與 LoResMT2021 共享任務中的基準相比,adaptMLLM 對 MLLM 的微調證明了英語↔愛爾蘭語和英語↔馬拉地語這兩個低資源語言對的翻譯性能進步。

Cause and Effect: Can Large Language Models Truly Understand Causality?

2402.18139v3 by Swagata Ashwani, Kshiteesh Hegde, Nishith Reddy Mannuru, Mayank Jindal, Dushyant Singh Sengar, Krishna Chaitanya Rao Kathala, Dishant Banga, Vinija Jain, Aman Chadha

With the rise of Large Language Models(LLMs), it has become crucial to understand their capabilities and limitations in deciphering and explaining the complex web of causal relationships that language entails. Current methods use either explicit or implicit causal reasoning, yet there is a strong need for a unified approach combining both to tackle a wide array of causal relationships more effectively. This research proposes a novel architecture called Context Aware Reasoning Enhancement with Counterfactual Analysis(CARE CA) framework to enhance causal reasoning and explainability. The proposed framework incorporates an explicit causal detection module with ConceptNet and counterfactual statements, as well as implicit causal detection through LLMs. Our framework goes one step further with a layer of counterfactual explanations to accentuate LLMs understanding of causality. The knowledge from ConceptNet enhances the performance of multiple causal reasoning tasks such as causal discovery, causal identification and counterfactual reasoning. The counterfactual sentences add explicit knowledge of the not caused by scenarios. By combining these powerful modules, our model aims to provide a deeper understanding of causal relationships, enabling enhanced interpretability. Evaluation of benchmark datasets shows improved performance across all metrics, such as accuracy, precision, recall, and F1 scores. We also introduce CausalNet, a new dataset accompanied by our code, to facilitate further research in this domain.

摘要:隨著大型語言模型 (LLM) 的興起,了解它們在解碼和解釋語言所蘊含的複雜因果關係網路中的能力和限制變得至關重要。目前的技術使用明確或隱含的因果推理,但強烈需要一種統一的方法,結合兩者以更有效地處理廣泛的因果關係。本研究提出了一種稱為情境感知推理增強與反事實分析 (CARE CA) 框架的新架構,以增強因果推理和可解釋性。提出的框架結合了使用 ConceptNet 和反事實陳述的明確因果檢測模組,以及透過 LLM 進行的隱含因果檢測。我們的框架更進一步,加入一層反事實解釋,以強調 LLM 對因果關係的理解。來自 ConceptNet 的知識增強了多項因果推理任務的執行,例如因果發現、因果識別和反事實推理。反事實句加入了未由情境造成的明確知識。透過結合這些強大的模組,我們的模型旨在提供對因果關係更深入的理解,實現增強的可解釋性。基準資料集的評估顯示在所有指標(例如準確度、精確度、召回率和 F1 分數)上都有所提升。我們還引入了 CausalNet,一個新的資料集,並附上了我們的程式碼,以促進在這個領域的進一步研究。

Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina

2402.18600v1 by Yasin Sadeghi Bazargani, Majid Mirzaei, Navid Sobhi, Mirsaeed Abdollahi, Ali Jafarizadeh, Siamak Pedrammehr, Roohallah Alizadehsani, Ru San Tan, Sheikh Mohammed Shariful Islam, U. Rajendra Acharya

Diabetes mellitus (DM) predisposes patients to vascular complications. Retinal images and vasculature reflect the body's micro- and macrovascular health. They can be used to diagnose DM complications, including diabetic retinopathy (DR), neuropathy, nephropathy, and atherosclerotic cardiovascular disease, as well as forecast the risk of cardiovascular events. Artificial intelligence (AI)-enabled systems developed for high-throughput detection of DR using digitized retinal images have become clinically adopted. Beyond DR screening, AI integration also holds immense potential to address challenges associated with the holistic care of the patient with DM. In this work, we aim to comprehensively review the literature for studies on AI applications based on retinal images related to DM diagnosis, prognostication, and management. We will describe the findings of holistic AI-assisted diabetes care, including but not limited to DR screening, and discuss barriers to implementing such systems, including issues concerning ethics, data privacy, equitable access, and explainability. With the ability to evaluate the patient's health status vis a vis DM complication as well as risk prognostication of future cardiovascular complications, AI-assisted retinal image analysis has the potential to become a central tool for modern personalized medicine in patients with DM.

摘要:糖尿病(DM)使患者容易出現血管併發症。 視網膜影像和血管反映身體的微血管和巨血管健康狀況。它們可用於診斷糖尿病併發症,包括糖尿病視網膜病變(DR)、神經病變、腎病和動脈粥樣硬化性心血管疾病,以及預測心血管事件的風險。為使用數位化視網膜影像進行高通量 DR 檢測而開發的人工智慧(AI)啟用系統已在臨床採用。除了 DR 篩檢外,AI 整合也具有巨大的潛力來應對與糖尿病患者整體照護相關的挑戰。在這項工作中,我們旨在全面回顧基於視網膜影像的 AI 應用相關研究的文獻,這些研究與糖尿病的診斷、預後和管理有關。我們將描述整體 AI 輔助糖尿病照護的發現,包括但不限於 DR 篩檢,並討論實施此類系統的障礙,包括與倫理、資料隱私、公平存取和可解釋性有關的問題。透過評估患者的健康狀況,同時考量糖尿病併發症以及未來心血管併發症的風險預後,AI 輔助視網膜影像分析有潛力成為糖尿病患者現代化個人化醫療的中心工具。

Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education

2402.15027v2 by A. J. Karran, P. Charland, J-T. Martineau, A. Ortiz de Guinea Lopez de Arana, AM. Lesage, S. Senecal, P-M. Leger

This study investigates the acceptability of different artificial intelligence (AI) applications in education from a multi-stakeholder perspective, including students, teachers, and parents. Acknowledging the transformative potential of AI in education, it addresses concerns related to data privacy, AI agency, transparency, explainability and the ethical deployment of AI. Through a vignette methodology, participants were presented with four scenarios where AI's agency, transparency, explainability, and privacy were manipulated. After each scenario, participants completed a survey that captured their perceptions of AI's global utility, individual usefulness, justice, confidence, risk, and intention to use each scenario's AI if available. The data collection comprising a final sample of 1198 multi-stakeholder participants was distributed through a partner institution and social media campaigns and focused on individual responses to four AI use cases. A mediation analysis of the data indicated that acceptance and trust in AI varies significantly across stakeholder groups. We found that the key mediators between high and low levels of AI's agency, transparency, and explainability, as well as the intention to use the different educational AI, included perceived global utility, justice, and confidence. The study highlights that the acceptance of AI in education is a nuanced and multifaceted issue that requires careful consideration of specific AI applications and their characteristics, in addition to the diverse stakeholders' perceptions.

摘要:這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性,包括學生、老師和家長。承認 AI 在教育上的轉型潛力,它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法,參與者被呈現了四種情境,其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後,參與者完成了一項調查,該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用,使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本,並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明,對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現,AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者,以及使用不同教育 AI 的意圖,包括感知到的整體效用、正義和信心。這項研究強調,接受 AI 在教育上的應用是一個微妙且多面向的問題,除了不同的利害關係人的看法外,還需要仔細考慮具體的 AI 應用及其特徵。

Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals

2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer

Remote patient monitoring based on wearable single-lead electrocardiogram (ECG) devices has significant potential for enabling the early detection of heart disease, especially in combination with artificial intelligence (AI) approaches for automated heart disease detection. There have been prior studies applying AI approaches based on deep learning for heart disease detection. However, these models are yet to be widely accepted as a reliable aid for clinical diagnostics, in part due to the current black-box perception surrounding many AI algorithms. In particular, there is a need to identify the key features of the ECG signal that contribute toward making an accurate diagnosis, thereby enhancing the interpretability of the model. In the present study, we develop a vision transformer approach to identify atrial fibrillation based on single-lead ECG data. A residual network (ResNet) approach is also developed for comparison with the vision transformer approach. These models are applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm heartbeats. The models enable the identification of the key regions of the heartbeat that determine the resulting classification, and highlight the importance of P-waves and T-waves, as well as heartbeat duration and signal amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and sinus bradycardia.

摘要:基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力,特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而,這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具,部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是,有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵,從而增強模型的可解釋性。在本研究中,我們開發了一種視覺轉換器方法,以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來,以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集,以分類心房顫動,以及另一種常見的心律不整,竇性心動過緩,和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域,並強調 P 波和 T 波,以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。

Illuminate: A novel approach for depression detection with explainable analysis and proactive therapy using prompt engineering

2402.05127v1 by Aryan Agrawal

This paper introduces a novel paradigm for depression detection and treatment using advanced Large Language Models (LLMs): Generative Pre-trained Transformer 4 (GPT-4), Llama 2 chat, and Gemini. These LLMs are fine-tuned with specialized prompts to diagnose, explain, and suggest therapeutic interventions for depression. A unique few-shot prompting method enhances the models' ability to analyze and explain depressive symptoms based on the DSM-5 criteria. In the interaction phase, the models engage in empathetic dialogue management, drawing from resources like PsychDB and a Cognitive Behavioral Therapy (CBT) Guide, fostering supportive interactions with individuals experiencing major depressive disorders. Additionally, the research introduces the Illuminate Database, enriched with various CBT modules, aiding in personalized therapy recommendations. The study evaluates LLM performance using metrics such as F1 scores, Precision, Recall, Cosine similarity, and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) across different test sets, demonstrating their effectiveness. This comprehensive approach blends cutting-edge AI with established psychological methods, offering new possibilities in mental health care and showcasing the potential of LLMs in revolutionizing depression diagnosis and treatment strategies.

摘要:本文介紹了一種使用先進大型語言模型 (LLM) 進行憂鬱症偵測和治療的新模式:生成式預訓練Transformer 4 (GPT-4)、Llama 2 聊天機器人和 Gemini。這些 LLM 經過微調,具備專業提示,可診斷、解釋並建議憂鬱症的治療介入方法。一種獨特的少次提示方法增強了模型根據 DSM-5 標準分析和解釋憂鬱症狀的能力。在互動階段,這些模型會參與同理心對話管理,從 PsychDB 和認知行為療法 (CBT) 指南等資源中汲取,促進與經歷重度憂鬱症的人們的支持性互動。此外,這項研究還介紹了 Illuminate 資料庫,其中包含各種 CBT 模組,有助於個性化治療建議。這項研究使用 F1 分數、準確率、召回率、餘弦相似度和面向召回率的 Gisting 評估替身 (ROUGE) 等指標,在不同的測試集中評估 LLM 的表現,證明了它們的有效性。這種綜合方法結合了尖端的 AI 與既定的心理方法,為心理保健提供了新的可能性,並展示了 LLM 在革新憂鬱症診斷和治療策略方面的潛力。

Information That Matters: Exploring Information Needs of People Affected by Algorithmic Decisions

2401.13324v6 by Timothée Schmude, Laura Koesten, Torsten Möller, Sebastian Tschiatschek

Every AI system that makes decisions about people has a group of stakeholders that are personally affected by these decisions. However, explanations of AI systems rarely address the information needs of this stakeholder group, who often are AI novices. This creates a gap between conveyed information and information that matters to those who are impacted by the system's decisions, such as domain experts and decision subjects. To address this, we present the "XAI Novice Question Bank," an extension of the XAI Question Bank containing a catalog of information needs from AI novices in two use cases: employment prediction and health monitoring. The catalog covers the categories of data, system context, system usage, and system specifications. We gathered information needs through task-based interviews where participants asked questions about two AI systems to decide on their adoption and received verbal explanations in response. Our analysis showed that participants' confidence increased after receiving explanations but that their understanding faced challenges. These included difficulties in locating information and in assessing their own understanding, as well as attempts to outsource understanding. Additionally, participants' prior perceptions of the systems' risks and benefits influenced their information needs. Participants who perceived high risks sought explanations about the intentions behind a system's deployment, while those who perceived low risks rather asked about the system's operation. Our work aims to support the inclusion of AI novices in explainability efforts by highlighting their information needs, aims, and challenges. We summarize our findings as five key implications that can inform the design of future explanations for lay stakeholder audiences.

摘要:每個對人做出決定的 AI 系統都有一群利害關係人 受到這些決定的親身影響。然而,AI 系統的解釋很少能滿足這群利害關係人的資訊需求,而他們 通常都是 AI 新手。這造成了傳達資訊與 受到系統決策影響的人士(例如領域專家和決策主體)重視的資訊之間的落差。為了解決這個問題,我們提出了 「XAI 新手問題庫」,它是 XAI 問題庫的延伸,包含來自 AI 新手在兩個使用案例中的資訊需求目錄:就業 預測和健康監測。目錄涵蓋了資料、 系統背景、系統使用和系統規格等類別。我們透過任務型訪談收集資訊需求,參與者在訪談中詢問了兩個 AI 系統的問題,以決定是否採用它們,並收到口頭 解釋作為回應。我們的分析顯示,參與者在收到解釋後信心有所提升,但他們的理解卻面臨挑戰。這些挑戰包括難以找到資訊和評估自己的理解,以及試圖外包 理解。此外,參與者對系統風險和好處的先前回饋影響了他們的資訊需求。認為風險高的參與者尋求解釋系統部署背後的意圖,而認為風險低的人則詢問系統的 操作。我們的研究旨在透過強調 AI 新手的資訊需求、目標和 挑戰,來支持將 AI 新手納入可解釋性工作中。我們將我們的研究結果總結為五個關鍵啟示,這些啟示可以為未來針對非專業利害關係人受眾的解釋設計提供參考。

Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education

2401.02985v1 by Vahid Ashrafimoghari, Necdet Gürkan, Jordan W. Suchow

The rapid evolution of artificial intelligence (AI), especially in the domain of Large Language Models (LLMs) and generative AI, has opened new avenues for application across various fields, yet its role in business education remains underexplored. This study introduces the first benchmark to assess the performance of seven major LLMs, OpenAI's models (GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo), Google's models (PaLM 2, Gemini 1.0 Pro), and Anthropic's models (Claude 2 and Claude 2.1), on the GMAT, which is a key exam in the admission process for graduate business programs. Our analysis shows that most LLMs outperform human candidates, with GPT-4 Turbo not only outperforming the other models but also surpassing the average scores of graduate students at top business schools. Through a case study, this research examines GPT-4 Turbo's ability to explain answers, evaluate responses, identify errors, tailor instructions, and generate alternative scenarios. The latest LLM versions, GPT-4 Turbo, Claude 2.1, and Gemini 1.0 Pro, show marked improvements in reasoning tasks compared to their predecessors, underscoring their potential for complex problem-solving. While AI's promise in education, assessment, and tutoring is clear, challenges remain. Our study not only sheds light on LLMs' academic potential but also emphasizes the need for careful development and application of AI in education. As AI technology advances, it is imperative to establish frameworks and protocols for AI interaction, verify the accuracy of AI-generated content, ensure worldwide access for diverse learners, and create an educational environment where AI supports human expertise. This research sets the stage for further exploration into the responsible use of AI to enrich educational experiences and improve exam preparation and assessment methods.

摘要:人工智慧 (AI) 的快速演進,尤其是在大型語言模型 (LLM) 和生成式 AI 的領域,為各個領域的應用開啟了新途徑,但其在商業教育中的角色仍未被充分探討。本研究首次引入了基準,用以評估七個主要 LLM 的效能,包括 OpenAI 的模型 (GPT-3.5 Turbo、GPT-4 和 GPT-4 Turbo)、Google 的模型 (PaLM 2、Gemini 1.0 Pro) 和 Anthropic 的模型 (Claude 2 和 Claude 2.1),這些模型將用於研究生商業課程入學程序中的關鍵考試 GMAT。我們的分析顯示,大多數 LLM 的表現都優於人類考生,其中 GPT-4 Turbo 不僅優於其他模型,更超越了頂尖商學院的研究生平均分數。透過案例研究,本研究探討了 GPT-4 Turbo 在解釋答案、評估回應、辨識錯誤、調整說明和產生替代情境方面的能力。與前一代版本相比,最新的 LLM 版本 GPT-4 Turbo、Claude 2.1 和 Gemini 1.0 Pro 在推理任務方面有顯著的進步,凸顯了其在解決複雜問題方面的潛力。儘管 AI 在教育、評量和輔導方面的承諾很明確,但仍有挑戰存在。我們的研究不僅闡明了 LLM 的學術潛力,也強調了在教育中審慎開發和應用 AI 的必要性。隨著 AI 技術的進步,建立 AI 互動的架構和協定、驗證 AI 生成的內容的準確性、確保全球各地多元學習者的存取權,以及創造一個 AI 支持人類專業知識的教育環境至關重要。本研究為進一步探索負責任地使用 AI 來豐富教育體驗並改善考試準備和評量方法奠定了基礎。

XAI for In-hospital Mortality Prediction via Multimodal ICU Data

2312.17624v1 by Xingqiao Li, Jindong Gu, Zhiyong Wang, Yancheng Yuan, Bo Du, Fengxiang He

Predicting in-hospital mortality for intensive care unit (ICU) patients is key to final clinical outcomes. AI has shown advantaged accuracy but suffers from the lack of explainability. To address this issue, this paper proposes an eXplainable Multimodal Mortality Predictor (X-MMP) approaching an efficient, explainable AI solution for predicting in-hospital mortality via multimodal ICU data. We employ multimodal learning in our framework, which can receive heterogeneous inputs from clinical data and make decisions. Furthermore, we introduce an explainable method, namely Layer-Wise Propagation to Transformer, as a proper extension of the LRP method to Transformers, producing explanations over multimodal inputs and revealing the salient features attributed to prediction. Moreover, the contribution of each modality to clinical outcomes can be visualized, assisting clinicians in understanding the reasoning behind decision-making. We construct a multimodal dataset based on MIMIC-III and MIMIC-III Waveform Database Matched Subset. Comprehensive experiments on benchmark datasets demonstrate that our proposed framework can achieve reasonable interpretation with competitive prediction accuracy. In particular, our framework can be easily transferred to other clinical tasks, which facilitates the discovery of crucial factors in healthcare research.

摘要:預測加護病房 (ICU) 病患的院內死亡率是最終臨床結果的關鍵。AI 已展現出優異的準確度,但卻缺乏可解釋性。為了解決這個問題,本文提出了一個可解釋的多模式死亡率預測器 (X-MMP),採用有效且可解釋的 AI 方式,藉由多模式 ICU 資料來預測院內死亡率。我們在架構中採用多模式學習,可以接收來自臨床資料的異質輸入並做出決策。此外,我們引入了一個可解釋的方法,也就是分層傳播至 Transformer,作為 LRP 方法適當地延伸至 Transformer,對多模式輸入產生解釋,並揭露歸因於預測的顯著特徵。此外,每個模式對臨床結果的貢獻可以視覺化,協助臨床醫師了解決策背後的理由。我們根據 MIMIC-III 和 MIMIC-III 波形資料庫比對子集建構了一個多模式資料集。在基準資料集上的全面實驗證明,我們提出的架構可以達成合理的詮釋,並具備競爭力的預測準確度。特別是,我們的架構可以輕鬆地轉移到其他臨床任務,這有助於在醫療保健研究中發現關鍵因素。

Joining Forces for Pathology Diagnostics with AI Assistance: The EMPAIA Initiative

2401.09450v2 by Norman Zerbe, Lars Ole Schwen, Christian Geißler, Katja Wiesemann, Tom Bisson, Peter Boor, Rita Carvalho, Michael Franz, Christoph Jansen, Tim-Rasmus Kiehl, Björn Lindequist, Nora Charlotte Pohlan, Sarah Schmell, Klaus Strohmenger, Falk Zakrzewski, Markus Plass, Michael Takla, Tobias Küster, André Homeyer, Peter Hufnagl

Over the past decade, artificial intelligence (AI) methods in pathology have advanced substantially. However, integration into routine clinical practice has been slow due to numerous challenges, including technical and regulatory hurdles in translating research results into clinical diagnostic products and the lack of standardized interfaces. The open and vendor-neutral EMPAIA initiative addresses these challenges. Here, we provide an overview of EMPAIA's achievements and lessons learned. EMPAIA integrates various stakeholders of the pathology AI ecosystem, i.e., pathologists, computer scientists, and industry. In close collaboration, we developed technical interoperability standards, recommendations for AI testing and product development, and explainability methods. We implemented the modular and open-source EMPAIA platform and successfully integrated 14 AI-based image analysis apps from 8 different vendors, demonstrating how different apps can use a single standardized interface. We prioritized requirements and evaluated the use of AI in real clinical settings with 14 different pathology laboratories in Europe and Asia. In addition to technical developments, we created a forum for all stakeholders to share information and experiences on digital pathology and AI. Commercial, clinical, and academic stakeholders can now adopt EMPAIA's common open-source interfaces, providing a unique opportunity for large-scale standardization and streamlining of processes. Further efforts are needed to effectively and broadly establish AI assistance in routine laboratory use. To this end, a sustainable infrastructure, the non-profit association EMPAIA International, has been established to continue standardization and support broad implementation and advocacy for an AI-assisted digital pathology future.

摘要:在過去的十年中,病理學中的人工智慧 (AI) 方法已大幅進步。然而,由於許多挑戰,包括將研究結果轉化為臨床診斷產品在技術和法規方面的障礙,以及缺乏標準化介面,導致整合到常規臨床實務中進展緩慢。開放且與供應商無關的 EMPAIA 計畫應對了這些挑戰。在此,我們提供 EMPAIA 的成就和經驗教訓的概述。EMPAIA 整合了病理學 AI 生態系統的各個利害關係人,即病理學家、電腦科學家和產業。在密切合作下,我們制定了技術互通性標準、AI 測試和產品開發建議,以及可解釋性方法。我們實作了模組化且開放原始碼的 EMPAIA 平臺,並成功整合了來自 8 個不同供應商的 14 個基於 AI 的影像分析應用程式,展示了不同的應用程式如何使用單一的標準化介面。我們優先考慮需求,並評估了 AI 在歐洲和亞洲的 14 個不同病理實驗室中的實際臨床應用。除了技術開發外,我們還為所有利害關係人建立了一個論壇,以分享數位病理學和 AI 的資訊和經驗。商業、臨床和學術利害關係人現在可以採用 EMPAIA 的常見開放原始碼介面,這為大規模標準化和簡化流程提供了獨特的機會。需要進一步的努力才能有效且廣泛地建立例行實驗室使用中的 AI 輔助。為此,已成立非營利協會 EMPAIA International,以作為永續基礎架構,繼續進行標準化,並支援廣泛實作和倡導 AI 輔助數位病理學的未來。

Robust Stochastic Graph Generator for Counterfactual Explanations

2312.11747v2 by Mario Alfonso Prado-Romero, Bardh Prenkaj, Giovanni Stilo

Counterfactual Explanation (CE) techniques have garnered attention as a means to provide insights to the users engaging with AI systems. While extensively researched in domains such as medical imaging and autonomous vehicles, Graph Counterfactual Explanation (GCE) methods have been comparatively under-explored. GCEs generate a new graph similar to the original one, with a different outcome grounded on the underlying predictive model. Among these GCE techniques, those rooted in generative mechanisms have received relatively limited investigation despite demonstrating impressive accomplishments in other domains, such as artistic styles and natural language modelling. The preference for generative explainers stems from their capacity to generate counterfactual instances during inference, leveraging autonomously acquired perturbations of the input graph. Motivated by the rationales above, our study introduces RSGG-CE, a novel Robust Stochastic Graph Generator for Counterfactual Explanations able to produce counterfactual examples from the learned latent space considering a partially ordered generation sequence. Furthermore, we undertake quantitative and qualitative analyses to compare RSGG-CE's performance against SoA generative explainers, highlighting its increased ability to engendering plausible counterfactual candidates.

摘要:反事實解釋 (CE) 技術已引起關注,作為一種為與 AI 系統互動的使用者提供見解的方法。雖然在醫學影像和自動駕駛汽車等領域廣泛研究,圖形反事實解釋 (GCE) 方法相對較少被探索。GCE 會產生一個類似於原始圖形的新圖形,並根據基礎預測模型產生不同的結果。在這些 GCE 技術中,儘管在其他領域(例如藝術風格和自然語言建模)中展現出令人印象深刻的成就,但植基於生成機制的技術獲得的關注相對有限。對生成式解釋器的偏好源於它們在推理期間產生反事實實例的能力,利用輸入圖形的自主獲取擾動。基於上述理由,我們的研究引入了 RSGG-CE,一種用於反事實解釋的新型穩健隨機圖形生成器,能夠從學習到的潛在空間中產生反事實範例,考慮部分有序的生成序列。此外,我們進行定量和定性分析,以比較 RSGG-CE 的效能與 SoA 生成式解釋器,強調其增強了產生合理解釋候選的能力。

Evaluating the Utility of Model Explanations for Model Development

2312.06032v1 by Shawn Im, Jacob Andreas, Yilun Zhou

One of the motivations for explainable AI is to allow humans to make better and more informed decisions regarding the use and deployment of AI models. But careful evaluations are needed to assess whether this expectation has been fulfilled. Current evaluations mainly focus on algorithmic properties of explanations, and those that involve human subjects often employ subjective questions to test human's perception of explanation usefulness, without being grounded in objective metrics and measurements. In this work, we evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. We conduct a mixed-methods user study involving image data to evaluate saliency maps generated by SmoothGrad, GradCAM, and an oracle explanation on two tasks: model selection and counterfactual simulation. To our surprise, we did not find evidence of significant improvement on these tasks when users were provided with any of the saliency maps, even the synthetic oracle explanation designed to be simple to understand and highly indicative of the answer. Nonetheless, explanations did help users more accurately describe the models. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.

摘要:可解釋 AI 的動機之一是讓人們在使用和部署 AI 模型時做出更好、更明智的決策。但需要仔細評估以評估是否已達到此預期。目前的評估主要集中在解釋的演算法特性,而涉及人類受試者的評估通常採用主觀問題來測試人類對解釋有用性的看法,而沒有基於客觀指標和測量。在這項工作中,我們評估解釋是否可以在機器學習模型開發的實際場景中改善人類決策制定。我們進行了一項涉及影像資料的混合方法使用者研究,以評估 SmoothGrad、GradCAM 和預言解釋在兩個任務中產生的顯著性圖:模型選擇和反事實模擬。令人驚訝的是,我們沒有發現任何顯著性圖(即使是設計為易於理解且高度指示答案的合成預言解釋)能讓使用者在這些任務上顯著改善的證據。儘管如此,解釋確實有助於使用者更準確地描述模型。這些發現提示我們要對基於顯著性的解釋中可能存在誤解的有用性保持謹慎。

Building Trustworthy NeuroSymbolic AI Systems: Consistency, Reliability, Explainability, and Safety

2312.06798v1 by Manas Gaur, Amit Sheth

Explainability and Safety engender Trust. These require a model to exhibit consistency and reliability. To achieve these, it is necessary to use and analyze data and knowledge with statistical and symbolic AI methods relevant to the AI application - neither alone will do. Consequently, we argue and seek to demonstrate that the NeuroSymbolic AI approach is better suited for making AI a trusted AI system. We present the CREST framework that shows how Consistency, Reliability, user-level Explainability, and Safety are built on NeuroSymbolic methods that use data and knowledge to support requirements for critical applications such as health and well-being. This article focuses on Large Language Models (LLMs) as the chosen AI system within the CREST framework. LLMs have garnered substantial attention from researchers due to their versatility in handling a broad array of natural language processing (NLP) scenarios. For example, ChatGPT and Google's MedPaLM have emerged as highly promising platforms for providing information in general and health-related queries, respectively. Nevertheless, these models remain black boxes despite incorporating human feedback and instruction-guided tuning. For instance, ChatGPT can generate unsafe responses despite instituting safety guardrails. CREST presents a plausible approach harnessing procedural and graph-based knowledge within a NeuroSymbolic framework to shed light on the challenges associated with LLMs.

摘要:可解釋性和安全性建立信任。這些需要一個模型來展示一致性和可靠性。為了實現這些,有必要使用和分析數據和知識,並使用與 AI 應用相關的統計和符號 AI 方法 - 單獨使用任何一種方法都不會奏效。因此,我們主張並試圖證明 NeuroSymbolic AI 方法更適合於使 AI 成為受信任的 AI 系統。我們提出了 CREST 框架,展示了一致性、可靠性、使用者層級的可解釋性和安全性是如何建立在 NeuroSymbolic 方法上的,該方法使用數據和知識來支持關鍵應用(例如健康和福祉)的要求。本文重點關注大型語言模型 (LLM),因為它是 CREST 框架中選擇的 AI 系統。LLM 因其在處理廣泛的自然語言處理 (NLP) 場景方面的多功能性而備受研究人員的關注。例如,ChatGPT 和 Google 的 MedPaLM 已成為提供一般和健康相關查詢信息的極有希望的平台。儘管如此,這些模型仍然是黑盒子,儘管納入了人類反饋和指令引導的調整。例如,儘管制定了安全防護措施,ChatGPT 仍可能產生不安全的回應。CREST 提出了一種合理的方法,在 NeuroSymbolic 框架中利用程序和基於圖表的知識,以闡明與 LLM 相關的挑戰。

Class-Discriminative Attention Maps for Vision Transformers

2312.02364v3 by Lennart Brocki, Jakub Binda, Neo Christopher Chung

Importance estimators are explainability methods that quantify feature importance for deep neural networks (DNN). In vision transformers (ViT), the self-attention mechanism naturally leads to attention maps, which are sometimes interpreted as importance scores that indicate which input features ViT models are focusing on. However, attention maps do not account for signals from downstream tasks. To generate explanations that are sensitive to downstream tasks, we have developed class-discriminative attention maps (CDAM), a gradient-based extension that estimates feature importance with respect to a known class or a latent concept. CDAM scales attention scores by how relevant the corresponding tokens are for the predictions of a classifier head. In addition to targeting the supervised classifier, CDAM can explain an arbitrary concept shared by selected samples by measuring similarity in the latent space of ViT. Additionally, we introduce Smooth CDAM and Integrated CDAM, which average a series of CDAMs with slightly altered tokens. Our quantitative benchmarks include correctness, compactness, and class sensitivity, in comparison to 7 other importance estimators. Vanilla, Smooth, and Integrated CDAM excel across all three benchmarks. In particular, our results suggest that existing importance estimators may not provide sufficient class-sensitivity. We demonstrate the utility of CDAM in medical images by training and explaining malignancy and biomarker prediction models based on lung Computed Tomography (CT) scans. Overall, CDAM is shown to be highly class-discriminative and semantically relevant, while providing compact explanations.

摘要:重要性估計器是一種可解釋性方法,用於量化深度神經網路 (DNN) 的特徵重要性。在視覺Transformer (ViT) 中,自我注意機制自然會導致注意力圖,有時會將其解釋為重要性分數,表示 ViT 模型關注哪些輸入特徵。然而,注意力圖並未考慮來自下游任務的信號。為了產生對下游任務敏感的解釋,我們開發了類別區分注意力圖 (CDAM),這是一種基於梯度的擴充,用於估計相對於已知類別或潛在概念的特徵重要性。CDAM 根據對應的符號與分類器頭的預測相關程度,調整注意力分數。除了針對監督分類器外,CDAM 還可以通過測量 ViT 的潛在空間中的相似性來解釋選定樣本共有的任意概念。此外,我們引入了平滑 CDAM 和積分 CDAM,它們對一系列具有略微改變的符號的 CDAM 進行平均。我們的量化基準包括正確性、緊湊性和類別敏感性,與其他 7 個重要性估計器相比。香草、平滑和積分 CDAM 在所有三個基準中表現出色。特別是,我們的結果表明現有的重要性估計器可能無法提供足夠的類別敏感性。我們通過基於肺部電腦斷層掃描 (CT) 掃描訓練和解釋惡性腫瘤和生物標記預測模型,證明了 CDAM 在醫學影像中的效用。總的來說,CDAM 被證明具有高度類別區分性和語義相關性,同時提供簡潔的解釋。

Deployment of a Robust and Explainable Mortality Prediction Model: The COVID-19 Pandemic and Beyond

2311.17133v1 by Jacob R. Epifano, Stephen Glass, Ravi P. Ramachandran, Sharad Patel, Aaron J. Masino, Ghulam Rasool

This study investigated the performance, explainability, and robustness of deployed artificial intelligence (AI) models in predicting mortality during the COVID-19 pandemic and beyond. The first study of its kind, we found that Bayesian Neural Networks (BNNs) and intelligent training techniques allowed our models to maintain performance amidst significant data shifts. Our results emphasize the importance of developing robust AI models capable of matching or surpassing clinician predictions, even under challenging conditions. Our exploration of model explainability revealed that stochastic models generate more diverse and personalized explanations thereby highlighting the need for AI models that provide detailed and individualized insights in real-world clinical settings. Furthermore, we underscored the importance of quantifying uncertainty in AI models which enables clinicians to make better-informed decisions based on reliable predictions. Our study advocates for prioritizing implementation science in AI research for healthcare and ensuring that AI solutions are practical, beneficial, and sustainable in real-world clinical environments. By addressing unique challenges and complexities in healthcare settings, researchers can develop AI models that effectively improve clinical practice and patient outcomes.

摘要:本研究调查了在 COVID-19 疫情期间及以后预测死亡率时,已部署人工智能 (AI) 模型的性能、可解释性和稳健性。作为同类研究中的首例,我们发现贝叶斯神经网络 (BNN) 和智能训练技术让我们的模型在数据发生重大变化时仍能保持性能。我们的结果强调了开发稳健的 AI 模型的重要性,即使在具有挑战性的条件下,这些模型也能匹配或超越临床医生的预测。我们对模型可解释性的探索表明,随机模型会产生更多样化且个性化的解释,从而突出了在现实世界的临床环境中提供详细且个性化见解的 AI 模型的必要性。此外,我们强调了量化 AI 模型中不确定性的重要性,这使临床医生能够根据可靠的预测做出更明智的决策。我们的研究提倡在医疗保健的 AI 研究中优先考虑实施科学,并确保 AI 解决方案在现实世界的临床环境中实用、有益且可持续。通过解决医疗保健环境中的独特挑战和复杂性,研究人员可以开发出有效改善临床实践和患者预后的 AI 模型。

Variational Autoencoders for Feature Exploration and Malignancy Prediction of Lung Lesions

2311.15719v1 by Benjamin Keel, Aaron Quyn, David Jayne, Samuel D. Relton

Lung cancer is responsible for 21% of cancer deaths in the UK and five-year survival rates are heavily influenced by the stage the cancer was identified at. Recent studies have demonstrated the capability of AI methods for accurate and early diagnosis of lung cancer from routine scans. However, this evidence has not translated into clinical practice with one barrier being a lack of interpretable models. This study investigates the application Variational Autoencoders (VAEs), a type of generative AI model, to lung cancer lesions. Proposed models were trained on lesions extracted from 3D CT scans in the LIDC-IDRI public dataset. Latent vector representations of 2D slices produced by the VAEs were explored through clustering to justify their quality and used in an MLP classifier model for lung cancer diagnosis, the best model achieved state-of-the-art metrics of AUC 0.98 and 93.1% accuracy. Cluster analysis shows the VAE latent space separates the dataset of malignant and benign lesions based on meaningful feature components including tumour size, shape, patient and malignancy class. We also include a comparative analysis of the standard Gaussian VAE (GVAE) and the more recent Dirichlet VAE (DirVAE), which replaces the prior with a Dirichlet distribution to encourage a more explainable latent space with disentangled feature representation. Finally, we demonstrate the potential for latent space traversals corresponding to clinically meaningful feature changes.

摘要:肺癌占英國癌症死亡人數的 21%,五年存活率很大程度取決於癌症被發現的階段。最近的研究已證明人工智能方法具有從例行掃描中準確及早診斷肺癌的能力。然而,此證據尚未轉化為臨床實務,其中一個障礙是缺乏可解釋的模型。本研究探討了應用變分自動編碼器 (VAE),一種生成式人工智能模型,於肺癌病灶。將提出的模型訓練於從 LIDC-IDRI 公共數據集中提取的 3D 電腦斷層掃描病灶。通過聚類探索了 VAE 生成的 2D 切片的潛在向量表示,以證明其品質,並用於肺癌診斷的 MLP 分類器模型,最佳模型達到了 AUC 0.98 和 93.1% 準確度的最先進指標。聚類分析顯示,VAE 潛在空間根據有意義的特徵組成(包括腫瘤大小、形狀、患者和惡性類別)將惡性和良性病灶的數據集分開。我們還包括標準高斯 VAE (GVAE) 和更新的狄利克雷 VAE (DirVAE) 的比較分析,後者用狄利克雷分佈取代先驗,以促進具有解開特徵表示的更具可解釋性的潛在空間。最後,我們展示了與臨床有意義的特徵變化相應的潛在空間橫越的潛力。

MRxaI: Black-Box Explainability for Image Classifiers in a Medical Setting

2311.14471v1 by Nathan Blake, Hana Chockler, David A. Kelly, Santiago Calderon Pena, Akchunya Chanchal

Existing tools for explaining the output of image classifiers can be divided into white-box, which rely on access to the model internals, and black-box, agnostic to the model. As the usage of AI in the medical domain grows, so too does the usage of explainability tools. Existing work on medical image explanations focuses on white-box tools, such as gradcam. However, there are clear advantages to switching to a black-box tool, including the ability to use it with any classifier and the wide selection of black-box tools available. On standard images, black-box tools are as precise as white-box. In this paper we compare the performance of several black-box methods against gradcam on a brain cancer MRI dataset. We demonstrate that most black-box tools are not suitable for explaining medical image classifications and present a detailed analysis of the reasons for their shortcomings. We also show that one black-box tool, a causal explainability-based rex, performs as well as \gradcam.

摘要:現有的圖像分類器輸出解釋工具可分為依賴於模型內部存取權限的白盒,以及與模型無關的黑盒。隨著 AI 在醫療領域的使用增加,可解釋性工具的使用也隨之增加。現有醫學影像解釋的工作重點在於白盒工具,例如 gradcam。然而,切換到黑盒工具有明顯的優點,包括能夠與任何分類器一起使用,以及廣泛的黑盒工具可供選擇。在標準影像上,黑盒工具與白盒一樣精確。在本文中,我們比較了多種黑盒方法在腦癌 MRI 資料集上與 gradcam 的效能。我們證明大多數黑盒工具不適合解釋醫學影像分類,並詳細分析其缺點的原因。我們還表明一種黑盒工具,基於因果可解釋性的 rex,表現與 \gradcam 一樣好。

Moderating Model Marketplaces: Platform Governance Puzzles for AI Intermediaries

2311.12573v3 by Robert Gorwa, Michael Veale

The AI development community is increasingly making use of hosting intermediaries such as Hugging Face provide easy access to user-uploaded models and training data. These model marketplaces lower technical deployment barriers for hundreds of thousands of users, yet can be used in numerous potentially harmful and illegal ways. In this article, we explain ways in which AI systems, which can both `contain' content and be open-ended tools, present one of the trickiest platform governance challenges seen to date. We provide case studies of several incidents across three illustrative platforms -- Hugging Face, GitHub and Civitai -- to examine how model marketplaces moderate models. Building on this analysis, we outline important (and yet nevertheless limited) practices that industry has been developing to respond to moderation demands: licensing, access and use restrictions, automated content moderation, and open policy development. While the policy challenge at hand is a considerable one, we conclude with some ideas as to how platforms could better mobilize resources to act as a careful, fair, and proportionate regulatory access point.

摘要:AI 開發社群日益利用 Hugging Face 等託管中介機構提供用戶上傳的模型和訓練資料的簡易存取權限。這些模型市集降低了數十萬名用戶的技術部署障礙,但可能會被用於許多潛在有害和非法的方式。在本文中,我們說明 AI 系統既可以「包含」內容,又可以作為開放式工具,這提出了迄今為止最棘手的平台治理挑戰之一。我們提供 Hugging Face、GitHub 和 Civitai 等三個說明性平台上數起事件的案例研究,以檢視模型市集如何審核模型。根據此分析,我們概述產業為回應審核需求而開發的重要(但仍有限)實務:授權、存取和使用限制、自動化內容審核和開放政策制定。雖然當前政策挑戰相當可觀,我們最後提出一些構想,說明平台如何能更好地動員資源,作為謹慎、公平且適度的法規存取點。

Ovarian Cancer Data Analysis using Deep Learning: A Systematic Review from the Perspectives of Key Features of Data Analysis and AI Assurance

2311.11932v1 by Muta Tah Hira, Mohammad A. Razzaque, Mosharraf Sarker

Background and objectives: By extracting this information, Machine or Deep Learning (ML/DL)-based autonomous data analysis tools can assist clinicians and cancer researchers in discovering patterns and relationships from complex data sets. Many DL-based analyses on ovarian cancer (OC) data have recently been published. These analyses are highly diverse in various aspects of cancer (e.g., subdomain(s) and cancer type they address) and data analysis features. However, a comprehensive understanding of these analyses in terms of these features and AI assurance (AIA) is currently lacking. This systematic review aims to fill this gap by examining the existing literature and identifying important aspects of OC data analysis using DL, explicitly focusing on the key features and AI assurance perspectives. Methods: The PRISMA framework was used to conduct comprehensive searches in three journal databases. Only studies published between 2015 and 2023 in peer-reviewed journals were included in the analysis. Results: In the review, a total of 96 DL-driven analyses were examined. The findings reveal several important insights regarding DL-driven ovarian cancer data analysis: - Most studies 71% (68 out of 96) focused on detection and diagnosis, while no study addressed the prediction and prevention of OC. - The analyses were predominantly based on samples from a non-diverse population (75% (72/96 studies)), limited to a geographic location or country. - Only a small proportion of studies (only 33% (32/96)) performed integrated analyses, most of which used homogeneous data (clinical or omics). - Notably, a mere 8.3% (8/96) of the studies validated their models using external and diverse data sets, highlighting the need for enhanced model validation, and - The inclusion of AIA in cancer data analysis is in a very early stage; only 2.1% (2/96) explicitly addressed AIA through explainability.

摘要:背景和目標:通過提取這些資訊,機器或深度學習 (ML/DL) 基於自主數據分析工具可以協助臨床醫生和癌症研究人員從複雜的數據集中發現模式和關係。最近已發表許多基於 DL 的卵巢癌 (OC) 數據分析。這些分析在癌症的各個方面(例如,它們涉及的子領域和癌症類型)和數據分析功能方面高度多樣化。然而,目前缺乏對這些分析在這些特徵和 AI 保證 (AIA) 方面的全面理解。這篇系統性回顧旨在通過檢視現有文獻並明確關注關鍵特徵和 AI 保證觀點,來填補這個空白。方法:使用 PRISMA 架構在三個期刊資料庫中進行全面搜尋。分析僅包括 2015 年至 2023 年間發表於同行評審期刊的研究。結果:在回顧中,總共檢視了 96 項由 DL 驅動的分析。研究結果揭示了幾個關於由 DL 驅動的卵巢癌數據分析的重要見解:- 大多數研究 71%(96 項中有 68 項)專注於檢測和診斷,而沒有研究探討 OC 的預測和預防。- 這些分析主要基於來自非多元族群的樣本(75%(96 項研究中的 72 項)),僅限於某個地理位置或國家。- 只有少部分研究(僅 33%(96 項研究中的 32 項)執行整合分析,其中大多數使用同質數據(臨床或組學)。- 值得注意的是,只有 8.3%(96 項研究中的 8 項)使用外部和多元數據集驗證了其模型,強調了加強模型驗證的必要性,以及- 將 AIA 納入癌症數據分析仍處於非常早期的階段;只有 2.1%(96 項研究中的 2 項)透過可解釋性明確探討了 AIA。

Representing visual classification as a linear combination of words

2311.10933v1 by Shobhit Agarwal, Yevgeniy R. Semenov, William Lotter

Explainability is a longstanding challenge in deep learning, especially in high-stakes domains like healthcare. Common explainability methods highlight image regions that drive an AI model's decision. Humans, however, heavily rely on language to convey explanations of not only "where" but "what". Additionally, most explainability approaches focus on explaining individual AI predictions, rather than describing the features used by an AI model in general. The latter would be especially useful for model and dataset auditing, and potentially even knowledge generation as AI is increasingly being used in novel tasks. Here, we present an explainability strategy that uses a vision-language model to identify language-based descriptors of a visual classification task. By leveraging a pre-trained joint embedding space between images and text, our approach estimates a new classification task as a linear combination of words, resulting in a weight for each word that indicates its alignment with the vision-based classifier. We assess our approach using two medical imaging classification tasks, where we find that the resulting descriptors largely align with clinical knowledge despite a lack of domain-specific language training. However, our approach also identifies the potential for 'shortcut connections' in the public datasets used. Towards a functional measure of explainability, we perform a pilot reader study where we find that the AI-identified words can enable non-expert humans to perform a specialized medical task at a non-trivial level. Altogether, our results emphasize the potential of using multimodal foundational models to deliver intuitive, language-based explanations of visual tasks.

摘要:解釋性是深度學習中長期的挑戰,特別是在醫療保健等高風險領域。常見的解釋性方法會強調驅動 AI 模型決策的影像區域。然而,人類很大程度依賴語言來傳達不僅是「在哪裡」,還有「是什麼」的解釋。此外,大多數解釋性方法都專注於解釋個別 AI 預測,而不是描述 AI 模型一般使用的特徵。後者對於模型和資料集稽核特別有用,甚至可能在 AI 愈來愈用於新穎任務時產生知識。在此,我們提出一個使用視覺語言模型來辨識視覺分類任務的語言描述符的解釋性策略。透過利用影像和文字之間預先訓練的聯合嵌入空間,我們的做法將新的分類任務估計為一個線性文字組合,導致每個文字都有權重,表示它與基於視覺的分類器對齊。我們使用兩個醫學影像分類任務來評估我們的做法,我們發現產生的描述符在很大程度上與臨床知識一致,儘管缺乏特定領域的語言訓練。然而,我們的做法也發現了所用公開資料集中的「捷徑連線」的可能性。為了達到解釋性的功能性衡量,我們進行了一項試驗讀者研究,發現 AI 識別的文字能讓非專家人類在非平凡的層級執行專業的醫療任務。總之,我們的結果強調了使用多模式基礎模型來提供直觀的、基於語言的視覺任務解釋的潛力。

Towards objective and systematic evaluation of bias in artificial intelligence for medical imaging

2311.02115v2 by Emma A. M. Stanley, Raissa Souza, Anthony Winder, Vedant Gulve, Kimberly Amador, Matthias Wilms, Nils D. Forkert

Artificial intelligence (AI) models trained using medical images for clinical tasks often exhibit bias in the form of disparities in performance between subgroups. Since not all sources of biases in real-world medical imaging data are easily identifiable, it is challenging to comprehensively assess how those biases are encoded in models, and how capable bias mitigation methods are at ameliorating performance disparities. In this article, we introduce a novel analysis framework for systematically and objectively investigating the impact of biases in medical images on AI models. We developed and tested this framework for conducting controlled in silico trials to assess bias in medical imaging AI using a tool for generating synthetic magnetic resonance images with known disease effects and sources of bias. The feasibility is showcased by using three counterfactual bias scenarios to measure the impact of simulated bias effects on a convolutional neural network (CNN) classifier and the efficacy of three bias mitigation strategies. The analysis revealed that the simulated biases resulted in expected subgroup performance disparities when the CNN was trained on the synthetic datasets. Moreover, reweighing was identified as the most successful bias mitigation strategy for this setup, and we demonstrated how explainable AI methods can aid in investigating the manifestation of bias in the model using this framework. Developing fair AI models is a considerable challenge given that many and often unknown sources of biases can be present in medical imaging datasets. In this work, we present a novel methodology to objectively study the impact of biases and mitigation strategies on deep learning pipelines, which can support the development of clinical AI that is robust and responsible.

摘要:使用醫療影像訓練的人工智慧 (AI) 模型,用於臨床任務時,常會在效能上展現出次群體之間的差異,形成偏見。由於並非所有真實世界醫療影像資料中的偏見來源都容易辨識,因此全面評估這些偏見是如何編碼到模型中,以及偏見緩解方法在改善效能差異方面的能力,是一項挑戰。在本文中,我們介紹了一個新穎的分析架構,用於系統化且客觀地調查醫療影像中的偏見對 AI 模型的影響。我們開發並測試了這個架構,以進行受控的電腦模擬試驗,使用一個工具來評估醫療影像 AI 中的偏見,該工具用於產生具有已知疾病影響和偏見來源的合成磁共振影像。可行性透過使用三個反事實偏見情境來衡量模擬偏見效應對卷積神經網路 (CNN) 分類器和三個偏見緩解策略的影響,並展示出來。分析顯示,當 CNN 在合成資料集上受訓時,模擬偏見會導致預期的次群體效能差異。此外,重新加權被認為是此設定中最成功的偏見緩解策略,我們展示了解釋性 AI 方法如何協助使用這個架構調查模型中偏見的表現。開發公平的 AI 模型是一項重大的挑戰,因為醫療影像資料集中可能存在許多且經常未知的偏見來源。在這項工作中,我們提出了一種新穎的方法,用於客觀地研究偏見和緩解策略對深度學習管線的影響,這可以支援健全且負責任的臨床 AI 的開發。

Predicting recovery following stroke: deep learning, multimodal data and feature selection using explainable AI

2310.19174v1 by Adam White, Margarita Saranti, Artur d'Avila Garcez, Thomas M. H. Hope, Cathy J. Price, Howard Bowman

Machine learning offers great potential for automated prediction of post-stroke symptoms and their response to rehabilitation. Major challenges for this endeavour include the very high dimensionality of neuroimaging data, the relatively small size of the datasets available for learning, and how to effectively combine neuroimaging and tabular data (e.g. demographic information and clinical characteristics). This paper evaluates several solutions based on two strategies. The first is to use 2D images that summarise MRI scans. The second is to select key features that improve classification accuracy. Additionally, we introduce the novel approach of training a convolutional neural network (CNN) on images that combine regions-of-interest extracted from MRIs, with symbolic representations of tabular data. We evaluate a series of CNN architectures (both 2D and a 3D) that are trained on different representations of MRI and tabular data, to predict whether a composite measure of post-stroke spoken picture description ability is in the aphasic or non-aphasic range. MRI and tabular data were acquired from 758 English speaking stroke survivors who participated in the PLORAS study. The classification accuracy for a baseline logistic regression was 0.678 for lesion size alone, rising to 0.757 and 0.813 when initial symptom severity and recovery time were successively added. The highest classification accuracy 0.854 was observed when 8 regions-of-interest was extracted from each MRI scan and combined with lesion size, initial severity and recovery time in a 2D Residual Neural Network.Our findings demonstrate how imaging and tabular data can be combined for high post-stroke classification accuracy, even when the dataset is small in machine learning terms. We conclude by proposing how the current models could be improved to achieve even higher levels of accuracy using images from hospital scanners.

摘要:機器學習為自動預測中風後症狀及其對復健的反應提供了極大的潛力。這項工作的重大挑戰包括神經影像資料的維度非常高、可用於學習的資料集規模相對較小,以及如何有效結合神經影像和表格資料(例如人口統計資訊和臨床特徵)。本文根據兩種策略評估了多種解決方案。第一種是使用總結 MRI 掃描的 2D 影像。第二種是選擇有助於提高分類精確度的關鍵特徵。此外,我們引入了在結合從 MRI 中提取的感興趣區域與表格資料的符號表示的影像上訓練卷積神經網路 (CNN) 的新穎方法。我們評估了一系列 CNN 架構(2D 和 3D),這些架構在 MRI 和表格資料的不同表示上進行訓練,以預測中風後口述圖片描述能力的綜合測量是否在失語症或非失語症範圍內。MRI 和表格資料來自 758 名參與 PLORAS 研究的英語中風倖存者。僅針對病灶大小的基線邏輯迴歸分類準確度為 0.678,當依序加入初始症狀嚴重程度和恢復時間時,上升至 0.757 和 0.813。在從每個 MRI 掃描中提取 8 個感興趣區域並在 2D 殘差神經網路中與病灶大小、初始嚴重程度和恢復時間結合時,觀察到最高的分類準確度 0.854。我們的研究結果展示了如何將影像和表格資料結合起來以獲得高於中風後分類準確度,即使在機器學習術語中資料集很小的情況下也是如此。最後,我們提出如何改進目前的模型,以使用來自醫院掃描儀的影像來實現更高的準確度。

Trainable Noise Model as an XAI evaluation method: application on Sobol for remote sensing image segmentation

2310.01828v2 by Hossein Shreim, Abdul Karim Gizzini, Ali J. Ghandour

eXplainable Artificial Intelligence (XAI) has emerged as an essential requirement when dealing with mission-critical applications, ensuring transparency and interpretability of the employed black box AI models. The significance of XAI spans various domains, from healthcare to finance, where understanding the decision-making process of deep learning algorithms is essential. Most AI-based computer vision models are often black boxes; hence, providing explainability of deep neural networks in image processing is crucial for their wide adoption and deployment in medical image analysis, autonomous driving, and remote sensing applications. Recently, several XAI methods for image classification tasks have been introduced. On the contrary, image segmentation has received comparatively less attention in the context of explainability, although it is a fundamental task in computer vision applications, especially in remote sensing. Only some research proposes gradient-based XAI algorithms for image segmentation. This paper adapts the recent gradient-free Sobol XAI method for semantic segmentation. To measure the performance of the Sobol method for segmentation, we propose a quantitative XAI evaluation method based on a learnable noise model. The main objective of this model is to induce noise on the explanation maps, where higher induced noise signifies low accuracy and vice versa. A benchmark analysis is conducted to evaluate and compare performance of three XAI methods, including Seg-Grad-CAM, Seg-Grad-CAM++ and Seg-Sobol using the proposed noise-based evaluation technique. This constitutes the first attempt to run and evaluate XAI methods using high-resolution satellite images.

摘要:可解釋人工智慧 (XAI) 已成為處理任務關鍵應用程式時的一項基本需求,確保採用黑盒 AI 模型的透明度和可解釋性。XAI 的重要性涵蓋從醫療保健到金融的各種領域,在這些領域中,了解深度學習演算法的決策制定過程至關重要。大多數基於 AI 的電腦視覺模型通常是黑盒子;因此,在影像處理中提供深度神經網路的可解釋性對於其在醫學影像分析、自動駕駛和遙測應用中的廣泛採用和部署至關重要。最近,已針對影像分類任務引入了多種 XAI 方法。相反地,影像分割在可解釋性的背景下受到的關注相對較少,儘管它是電腦視覺應用中的一項基本任務,特別是在遙測中。只有部分研究提出用於影像分割的基於梯度的 XAI 演算法。本文改編了最近的無梯度 Sobol XAI 方法以進行語意分割。為了衡量 Sobol 方法在分割中的效能,我們提出了一種基於可學習雜訊模型的定量 XAI 評估方法。此模型的主要目的是在解釋圖上誘發雜訊,其中較高的誘發雜訊表示較低的準確度,反之亦然。進行基準分析以評估和比較三種 XAI 方法的效能,包括 Seg-Grad-CAM、Seg-Grad-CAM++ 和 Seg-Sobol,並使用所提出的基於雜訊的評估技術。這構成了使用高解析度衛星影像執行和評估 XAI 方法的首次嘗試。

Creating Trustworthy LLMs: Dealing with Hallucinations in Healthcare AI

2311.01463v1 by Muhammad Aurangzeb Ahmad, Ilker Yaramis, Taposh Dutta Roy

Large language models have proliferated across multiple domains in as short period of time. There is however hesitation in the medical and healthcare domain towards their adoption because of issues like factuality, coherence, and hallucinations. Give the high stakes nature of healthcare, many researchers have even cautioned against its usage until these issues are resolved. The key to the implementation and deployment of LLMs in healthcare is to make these models trustworthy, transparent (as much possible) and explainable. In this paper we describe the key elements in creating reliable, trustworthy, and unbiased models as a necessary condition for their adoption in healthcare. Specifically we focus on the quantification, validation, and mitigation of hallucinations in the context in healthcare. Lastly, we discuss how the future of LLMs in healthcare may look like.

摘要:大型語言模型在短時間內已在多個領域中大量激增。然而,由於事實性、連貫性和幻覺等問題,醫療和保健領域對其採用猶豫不決。鑑於醫療保健的高風險性質,許多研究人員甚至警告不要使用它,直到這些問題得到解決。在醫療保健中實施和部署 LLM 的關鍵是使這些模型值得信賴、透明(盡可能多)且可解釋。在本文中,我們描述了建立可靠、值得信賴和無偏見模型的關鍵要素,作為它們在醫療保健中得到採用的必要條件。具體來說,我們專注於在醫療保健背景下對幻覺進行量化、驗證和緩解。最後,我們討論了 LLM 在醫療保健中的未來可能是什麼樣子。

When to Trust AI: Advances and Challenges for Certification of Neural Networks

2309.11196v1 by Marta Kwiatkowska, Xiyue Zhang

Artificial intelligence (AI) has been advancing at a fast pace and it is now poised for deployment in a wide range of applications, such as autonomous systems, medical diagnosis and natural language processing. Early adoption of AI technology for real-world applications has not been without problems, particularly for neural networks, which may be unstable and susceptible to adversarial examples. In the longer term, appropriate safety assurance techniques need to be developed to reduce potential harm due to avoidable system failures and ensure trustworthiness. Focusing on certification and explainability, this paper provides an overview of techniques that have been developed to ensure safety of AI decisions and discusses future challenges.

摘要:人工智慧(AI)已快速進步,現已準備部署於廣泛的應用程式中,例如自主系統、醫療診斷和自然語言處理。及早採用 AI 技術於實際應用程式並非沒有問題,特別是對於神經網路,它可能不穩定且容易受到對抗性範例的影響。從長遠來看,需要開發適當的安全保證技術,以減少因可避免的系統故障而造成的潛在傷害,並確保可信賴性。本文著重於認證和可解釋性,概述了已開發用於確保 AI 決策安全的技術,並討論未來的挑戰。

Knowledge Graphs

Publish Date Title Authors Homepage Code
2024-11-12 Language Models as Causal Effect Generators Lucius E. J. Bynum et.al. 2411.08019v1 link
2024-11-12 From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents Chuyi Kong et.al. 2411.07965v1 null
2024-11-12 Chain Association-based Attacking and Shielding Natural Language Processing Systems Jiacheng Huang et.al. 2411.07843v1 null
2024-11-11 Gradual Fine-Tuning with Graph Routing for Multi-Source Unsupervised Domain Adaptation Yao Ma et.al. 2411.07185v1 null
2024-11-11 A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19 Vedant Khandelwal et.al. 2411.07163v1 null
2024-11-11 A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs Myeongsoo Kim et.al. 2411.07098v1 null
2024-11-11 Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation Qiao Qiao et.al. 2411.06660v1 null
2024-11-10 CausalStock: Deep End-to-end Causal Discovery for News-driven Stock Movement Prediction Shuqi Li et.al. 2411.06391v1 null
2024-11-09 Analyzing the Evolution of Graphs and Texts Xingzhi Guo et.al. 2411.06295v1 null
2024-11-09 An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models Fatemeh Shiri et.al. 2411.06048v1 link
2024-11-08 Mitigating Hallucination with ZeroG: An Advanced Knowledge Management Engine Anantha Sharma et.al. 2411.05936v1 null
2024-11-08 SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark Sithursan Sivasubramaniam et.al. 2411.05521v1 null
2024-11-08 EUREKHA: Enhancing User Representation for Key Hackers Identification in Underground Forums Abdoul Nasser Hassane Amadou et.al. 2411.05479v1 link
2024-11-08 When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization Jacob Nielsen et.al. 2411.05882v1 null
2024-11-08 Exploring the Alignment Landscape: LLMs and Geometric Deep Models in Protein Representation Dong Shu et.al. 2411.05316v1 link
2024-11-06 LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration Yukun Cao et.al. 2411.05844v1 null
2024-11-06 MEG: Medical Knowledge-Augmented Large Language Models for Question Answering Laura Cabello et.al. 2411.03883v2 link
2024-11-06 The American Sign Language Knowledge Graph: Infusing ASL Models with Linguistic Knowledge Lee Kezar et.al. 2411.03568v1 null
2024-11-05 Graph-DPEP: Decomposed Plug and Ensemble Play for Few-Shot Document Relation Extraction with Graph-of-Thoughts Reasoning Tao Zhang et.al. 2411.02864v1 null
2024-11-05 Multimodal Commonsense Knowledge Distillation for Visual Question Answering Shuo Yang et.al. 2411.02722v1 null
2024-11-04 Geometry of orofacial neuromuscular signals: speech articulation decoding using surface electromyography Harshavardhana T. Gowda et.al. 2411.02591v1 link
2024-11-04 GraphXAIN: Narratives to Explain Graph Neural Networks Mateusz Cedro et.al. 2411.02540v2 link
2024-11-04 Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models Guangzhi Xiong et.al. 2411.02382v1 null
2024-11-04 Can Language Models Enable In-Context Database? Yu Pan et.al. 2411.01807v1 null
2024-11-03 Graph-based Confidence Calibration for Large Language Models Yukun Li et.al. 2411.02454v1 null
2024-11-03 Ontology Population using LLMs Sanaz Saki Norouzi et.al. 2411.01612v1 null
2024-11-03 Pre-trained Molecular Language Models with Random Functional Group Masking Tianhao Peng et.al. 2411.01401v1 null
2024-11-01 Narrative Analysis of True Crime Podcasts With Knowledge Graph-Augmented Large Language Models Xinyi Leng et.al. 2411.02435v1 null
2024-11-01 WLPlan: Relational Features for Symbolic Planning Dillon Z. Chen et.al. 2411.00577v1 null
2024-11-01 GRS-QA -- Graph Reasoning-Structured Question Answering Dataset Anish Pahilajani et.al. 2411.00369v3 null
2024-11-01 Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes Balu Bhasuran et.al. 2411.02523v1 null
2024-10-31 Compositional Automata Embeddings for Goal-Conditioned Reinforcement Learning Beyazit Yalcinkaya et.al. 2411.00205v1 null
2024-10-31 Building Multi-Agent Copilot towards Autonomous Agricultural Data Management and Analysis Yu Pan et.al. 2411.00188v1 null
2024-10-31 Exploring the Knowledge Mismatch Hypothesis: Hallucination Propensity in Small Models Fine-tuned on Data from Larger Models Phil Wee et.al. 2411.00878v1 null
2024-10-31 Failure Modes of LLMs for Causal Reasoning on Narratives Khurram Yamin et.al. 2410.23884v1 link
2024-10-31 Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs Liyi Chen et.al. 2410.23875v1 link
2024-10-31 LLaMo: Large Language Model-based Molecular Graph Assistant Jinyoung Park et.al. 2411.00871v1 link
2024-10-31 End-to-End Ontology Learning with Large Language Models Andy Lo et.al. 2410.23584v1 link
2024-10-30 Graph-Augmented Relation Extraction Model with LLMs-Generated Support Document Vicky Dong et.al. 2410.23452v1 null
2024-10-30 FlowLLM: Flow Matching for Material Generation with Large Language Models as Base Distributions Anuroop Sriram et.al. 2410.23405v1 link
2024-10-30 EMMA: End-to-End Multimodal Model for Autonomous Driving Jyh-Jing Hwang et.al. 2410.23262v2 null
2024-10-30 ProTransformer: Robustify Transformers via Plug-and-Play Paradigm Zhichao Hou et.al. 2410.23182v1 null
2024-10-30 Semantic Enrichment of the Quantum Cascade Laser Properties in Text- A Knowledge Graph Generation Approach Deperias Kerre et.al. 2410.22996v1 null
2024-10-30 How Well Do Large Language Models Disambiguate Swedish Words? Richard Johansson et.al. 2410.22827v1 null
2024-10-30 Beyond Ontology in Dialogue State Tracking for Goal-Oriented Chatbot Sejin Lee et.al. 2410.22767v1 link
2024-10-30 The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation Reza Moravej et.al. 2411.00843v1 null
2024-10-29 Are Large-Language Models Graph Algorithmic Reasoners? Alexander K Taylor et.al. 2410.22597v1 link
2024-10-29 Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset Adrian Garret Gabriel et.al. 2410.22457v1 null
2024-10-29 DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models Chengke Zou et.al. 2411.00836v1 null
2024-10-29 ADAM: An Embodied Causal Agent in Open-World Environments Shu Yu et.al. 2410.22194v1 null
2024-10-29 Synergizing LLM Agents and Knowledge Graph for Socioeconomic Prediction in LBSN Zhilun Zhou et.al. 2411.00028v1 null
2024-10-29 A Hierarchical Language Model For Interpretable Graph Reasoning Sambhav Khurana et.al. 2410.22372v1 null
2024-10-28 LLM-Forest for Health Tabular Data Imputation Xinrui He et.al. 2410.21520v1 null
2024-10-28 Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce Zhantao Yang et.al. 2410.21237v1 null
2024-10-28 CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models Meiqi Chen et.al. 2410.21067v1 null
2024-10-28 CTINEXUS: Leveraging Optimized LLM In-Context Learning for Constructing Cybersecurity Knowledge Graphs Under Data Scarcity Yutong Cheng et.al. 2410.21060v1 null
2024-10-28 Graph-based Uncertainty Metrics for Long-form Language Model Outputs Mingjian Jiang et.al. 2410.20783v1 link
2024-10-28 Plan$\times$RAG: Planning-guided Retrieval Augmented Generation Prakhar Verma et.al. 2410.20753v1 null
2024-10-28 Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation Mufei Li et.al. 2410.20724v2 link
2024-10-27 Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs Xingrui Zhuo et.al. 2410.20321v1 null
2024-10-26 Mathematical Derivation Graphs: A Task for Summarizing Equation Dependencies in STEM Manuscripts Vishesh Prasad et.al. 2410.21324v1 null
2024-10-25 DualMAR: Medical-Augmented Representation from Dual-Expertise Perspectives Pengfei Hu et.al. 2410.19955v1 null
2024-10-25 FISHNET: Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert Swarms, and Task Planning Nicole Cho et.al. 2410.19727v1 null
2024-10-25 Knowledge Graph Enhanced Language Agents for Recommendation Taicheng Guo et.al. 2410.19627v1 null
2024-10-25 Graph Linearization Methods for Reasoning on Graphs with Large Language Models Christos Xypolopoulos et.al. 2410.19494v1 null
2024-10-25 Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis Weikai Li et.al. 2410.19225v1 null
2024-10-24 Enriching GNNs with Text Contextual Representations for Detecting Disinformation Campaigns on Social Media Bruno Croso Cunha da Silva et.al. 2410.19193v1 null
2024-10-24 GCoder: Improving Large Language Model for Generalized Graph Problem Solving Qifan Zhang et.al. 2410.19084v1 link
2024-10-24 LLM-based Online Prediction of Time-varying Graph Signals Dayu Qin et.al. 2410.18718v1 null
2024-10-24 Gene-Metabolite Association Prediction with Interactive Knowledge Transfer Enhanced Graph for Metabolite Production Kexuan Xin et.al. 2410.18475v2 null
2024-10-24 ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis Zezhong Wang et.al. 2410.18447v1 null
2024-10-24 Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains Kun Li et.al. 2410.18415v1 null
2024-10-23 Explaining Bayesian Networks in Natural Language using Factor Arguments. Evaluation in the medical domain Jaime Sevilla et.al. 2410.18060v1 null
2024-10-23 Graphusion: A RAG Framework for Knowledge Graph Construction with a Global Perspective Rui Yang et.al. 2410.17600v1 null
2024-10-23 Navigate Complex Physical Worlds via Geometrically Constrained LLM Yongqiang Huang et.al. 2410.17529v1 null
2024-10-22 Large Language Model-based Augmentation for Imbalanced Node Classification on Text-Attributed Graphs Leyao Wang et.al. 2410.16882v1 null
2024-10-22 Context-aware Inductive Knowledge Graph Completion with Latent Type Constraints and Subgraph Reasoning Muzhi Li et.al. 2410.16803v1 null
2024-10-22 The Scene Language: Representing Scenes with Programs, Words, and Embeddings Yunzhi Zhang et.al. 2410.16770v1 null
2024-10-22 Atomic Fact Decomposition Helps Attributed Question Answering Zhichao Yan et.al. 2410.16708v1 null
2024-10-22 PLDR-LLM: Large Language Model from Power Law Decoder Representations Burc Gokden et.al. 2410.16703v1 link
2024-10-22 Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency Prafulla Kumar Choubey et.al. 2410.16597v1 null
2024-10-21 Towards a Reliable Offline Personal AI Assistant for Long Duration Spaceflight Oliver Bensch et.al. 2410.16397v1 null
2024-10-21 A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns Tianyi Men et.al. 2410.16155v1 null
2024-10-21 CausalGraph2LLM: Evaluating LLMs for Causal Queries Ivaxi Sheth et.al. 2410.15939v1 link
2024-10-21 LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation Tejumade Afonja et.al. 2410.15828v1 null
2024-10-21 NetSafe: Exploring the Topological Safety of Multi-agent Networks Miao Yu et.al. 2410.15686v1 null
2024-10-20 TAGExplainer: Narrating Graph Explanations for Text-Attributed Graph Learning Models Bo Pan et.al. 2410.15268v1 null
2024-10-19 Explaining Graph Neural Networks with Large Language Models: A Counterfactual Perspective for Molecular Property Prediction Yinhan He et.al. 2410.15165v1 null
2024-10-19 MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science Junho Kim et.al. 2410.15126v1 null
2024-10-19 Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models Qitan Lv et.al. 2410.15116v1 null
2024-10-19 A Prompt Engineering Approach and a Knowledge Graph based Framework for Tackling Legal Implications of Large Language Model Answers George Hannah et.al. 2410.15064v1 null
2024-10-19 LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model Tianqianjin Lin et.al. 2410.14961v1 null
2024-10-18 TransBox: EL++-closed Ontology Embedding Hui Yang et.al. 2410.14571v1 null
2024-10-18 Enabling Scalable Evaluation of Bias Patterns in Medical LLMs Hamed Fayyaz et.al. 2410.14763v1 link
2024-10-18 Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning Xingyu Tan et.al. 2410.14211v2 null
2024-10-18 UniMTS: Unified Pre-training for Motion Time Series Xiyuan Zhang et.al. 2410.19818v1 link
2024-10-18 Supervised Chain of Thought Xiang Zhang et.al. 2410.14198v1 null
2024-10-17 Towards Cross-Cultural Machine Translation with Retrieval-Augmented Generation from Multilingual Knowledge Graphs Simone Conia et.al. 2410.14057v1 null
2024-10-17 RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs Jiatan Huang et.al. 2410.13987v1 null
2024-10-17 The Mystery of the Pathological Path-star Task for Language Models Arvid Frydenlund et.al. 2410.13779v1 null

Abstracts

Language Models as Causal Effect Generators

2411.08019v1 by Lucius E. J. Bynum, Kyunghyun Cho

We present a framework for large language model (LLM) based data generation with controllable causal structure. In particular, we define a procedure for turning any language model and any directed acyclic graph (DAG) into a sequence-driven structural causal model (SD-SCM). Broadly speaking, an SD-SCM is a causal model with user-defined structure and LLM-defined structural equations. We characterize how an SD-SCM allows sampling from observational, interventional, and counterfactual distributions according to the desired causal structure. We then leverage this procedure to propose a new type of benchmark for causal inference methods, generating individual-level counterfactual data without needing to manually specify functional relationships between variables. We create an example benchmark consisting of thousands of datasets, and test a suite of popular estimation methods on these datasets for average, conditional average, and individual treatment effect estimation, both with and without hidden confounding. Apart from generating data, the same procedure also allows us to test for the presence of a causal effect that might be encoded in an LLM. This procedure can underpin auditing LLMs for misinformation, discrimination, or otherwise undesirable behavior. We believe SD-SCMs can serve as a useful tool in any application that would benefit from sequential data with controllable causal structure.

摘要:我們提出了一個基於大型語言模型 (LLM) 的資料生成架構,具有可控制的因果結構。具體來說,我們定義了一個程序,將任何語言模型和任何有向無環圖 (DAG) 轉換成一個序列驅動的結構因果模型 (SD-SCM)。廣義來說,SD-SCM 是一個因果模型,具有使用者定義的結構和 LLM 定義的結構方程式。我們描述了 SD-SCM 如何根據所需的因果結構,允許從觀測、介入和反事實分佈中進行抽樣。然後,我們利用這個程序提出了一種類型的因果推論方法基準,生成個體層級的反事實資料,而無需手動指定變數之間的功能關係。我們建立了一個範例基準,包含數千個資料集,並在這些資料集上測試了一系列流行的估計方法,用於平均值、條件平均值和個別處理效果估計,無論是有或沒有隱藏混淆。除了生成資料之外,相同的程序也允許我們測試 LLM 中可能編碼的因果效應的存在。此程序可以支持審核 LLM 的錯誤資訊、歧視或其他不良行為。我們相信 SD-SCM 可以作為任何應用程式的有用工具,這些應用程式可以從具有可控制因果結構的序列資料中受益。

From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents

2411.07965v1 by Chuyi Kong, Ziyang Luo, Hongzhan Lin, Zhiyuan Fan, Yaxin Fan, Yuxi Sun, Jing Ma

The advanced role-playing capabilities of Large Language Models (LLMs) have paved the way for developing Role-Playing Agents (RPAs). However, existing benchmarks, such as HPD, which incorporates manually scored character relationships into the context for LLMs to sort coherence, and SocialBench, which uses specific profiles generated by LLMs in the context of multiple-choice tasks to assess character preferences, face limitations like poor generalizability, implicit and inaccurate judgments, and excessive context length. To address the above issues, we propose an automatic, scalable, and generalizable paradigm. Specifically, we construct a benchmark by extracting relations from a general knowledge graph and leverage RPA's inherent hallucination properties to prompt it to interact across roles, employing ChatGPT for stance detection and defining relationship hallucination along with three related metrics. Extensive experiments validate the effectiveness and stability of our metrics. Our findings further explore factors influencing these metrics and discuss the trade-off between relationship hallucination and factuality.

摘要:大型語言模型 (LLM) 的先進角色扮演能力已為開發角色扮演代理 (RPA) 鋪平道路。然而,現有的基準,例如 HPD(將手動評分的角色關係納入 LLM 的背景中以對連貫性進行排序),以及 SocialBench(在多選題任務的背景下使用 LLM 生成的特定個人資料來評估角色偏好)面臨著諸如通用性差、判斷含蓄且不準確以及背景長度過長等限制。為了解決上述問題,我們提出了一個自動、可擴充且可概括的範例。具體來說,我們通過從通用知識圖譜中提取關係來構建基準,並利用 RPA 固有的幻覺屬性提示它跨角色互動,採用 ChatGPT 進行立場檢測並定義關係幻覺以及三個相關指標。廣泛的實驗驗證了我們指標的有效性和穩定性。我們的研究結果進一步探討了影響這些指標的因素,並討論了關係幻覺和事實性之間的權衡。

Chain Association-based Attacking and Shielding Natural Language Processing Systems

2411.07843v1 by Jiacheng Huang, Long Chen

Association as a gift enables people do not have to mention something in completely straightforward words and allows others to understand what they intend to refer to. In this paper, we propose a chain association-based adversarial attack against natural language processing systems, utilizing the comprehension gap between humans and machines. We first generate a chain association graph for Chinese characters based on the association paradigm for building search space of potential adversarial examples. Then, we introduce an discrete particle swarm optimization algorithm to search for the optimal adversarial examples. We conduct comprehensive experiments and show that advanced natural language processing models and applications, including large language models, are vulnerable to our attack, while humans appear good at understanding the perturbed text. We also explore two methods, including adversarial training and associative graph-based recovery, to shield systems from chain association-based attack. Since a few examples that use some derogatory terms, this paper contains materials that may be offensive or upsetting to some people.

摘要:聯想作為一種禮物,使人們不必用完全直白的話語提及某事,並讓其他人明白他們想提的是什麼。在本文中,我們提出了一種基於鏈式聯想的對抗性攻擊,用於自然語言處理系統,利用了人類與機器之間的理解差距。我們首先基於聯想範例為漢字生成一個鏈式聯想圖,用於構建潛在對抗性範例的搜索空間。然後,我們引入一個離散粒子群優化演算法來搜索最佳的對抗性範例。我們進行了全面的實驗,並表明先進的自然語言處理模型和應用程式,包括大型語言模型,都容易受到我們的攻擊,而人類似乎很擅長理解擾動後的文字。我們還探索了兩種方法,包括對抗性訓練和基於聯想圖的恢復,以保護系統免受基於鏈式聯想的攻擊。由於一些範例使用了某些貶義詞,因此本文包含可能冒犯或令某些人感到不安的材料。

Gradual Fine-Tuning with Graph Routing for Multi-Source Unsupervised Domain Adaptation

2411.07185v1 by Yao Ma, Samuel Louvan, Zhunxuan Wang

Multi-source unsupervised domain adaptation aims to leverage labeled data from multiple source domains for training a machine learning model to generalize well on a target domain without labels. Source domain selection plays a crucial role in determining the model's performance. It relies on the similarities amongst source and target domains. Nonetheless, existing work for source domain selection often involves heavyweight computational procedures, especially when dealing with numerous source domains and the need to identify the best ones from them. In this paper, we introduce a framework for gradual fine tuning (GFT) of machine learning models on multiple source domains. We represent multiple source domains as an undirected weighted graph. We then give a new generalization error bound for GFT along any path within the graph, which is used to determine the optimal path corresponding to the optimal training order. With this formulation, we introduce three lightweight graph-routing strategies which tend to minimize the error bound. Our best strategy improves $2.3\%$ of accuracy over the state-of-the-art on Natural Language Inference (NLI) task and achieves competitive performance on Sentiment Analysis (SA) task, especially a $3.9\%$ improvement on a more diverse subset of data we use for SA.

摘要:多源无监督域自适应旨在利用来自多个源域的标记数据,训练机器学习模型,以便在没有标签的目标域上很好地泛化。源域选择在确定模型性能方面起着至关重要的作用。它依赖于源域和目标域之间的相似性。尽管如此,现有的源域选择工作通常涉及重量级计算程序,尤其是在处理众多源域以及需要从中识别最佳源域时。在本文中,我们介绍了一个在多个源域上对机器学习模型进行逐步微调 (GFT) 的框架。我们将多个源域表示为无向加权图。然后,我们为图中沿任何路径的 GFT 给出了一个新的泛化误差界,用于确定对应于最佳训练顺序的最佳路径。通过这种表述,我们介绍了三种轻量级的图路由策略,这些策略倾向于最小化误差界。我们最好的策略在自然语言推理 (NLI) 任务上比最先进的技术提高了 2.3% 的准确率,并在情感分析 (SA) 任务上取得了有竞争力的性能,特别是在我们用于 SA 的更多样化的数据子集上提高了 3.9%。

A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19

2411.07163v1 by Vedant Khandelwal, Manas Gaur, Ugur Kursuncu, Valerie Shalin, Amit Sheth

Monitoring public sentiment via social media is potentially helpful during health crises such as the COVID-19 pandemic. However, traditional frequency-based, data-driven neural network-based approaches can miss newly relevant content due to the evolving nature of language in a dynamically evolving environment. Human-curated symbolic knowledge sources, such as lexicons for standard language and slang terms, can potentially elevate social media signals in evolving language. We introduce a neurosymbolic method that integrates neural networks with symbolic knowledge sources, enhancing the detection and interpretation of mental health-related tweets relevant to COVID-19. Our method was evaluated using a corpus of large datasets (approximately 12 billion tweets, 2.5 million subreddit data, and 700k news articles) and multiple knowledge graphs. This method dynamically adapts to evolving language, outperforming purely data-driven models with an F1 score exceeding 92\%. This approach also showed faster adaptation to new data and lower computational demands than fine-tuning pre-trained large language models (LLMs). This study demonstrates the benefit of neurosymbolic methods in interpreting text in a dynamic environment for tasks such as health surveillance.

摘要:透過社群媒體監控公眾情緒在 COVID-19 等健康危機期間可能很有幫助。然而,傳統的基於頻率、資料驅動的神經網路方法可能會錯過新相關的內容,因為語言在動態演化的環境中會持續演化。由人類策劃的象徵性知識來源(例如標準語言和俚語術語的詞彙)可能會提升社群媒體在演化語言中的訊號。我們引入一種將神經網路與象徵性知識來源整合的神經符號方法,增強與 COVID-19 相關的心理健康相關推文的偵測和詮釋。我們的做法使用大型資料集語料庫(約 120 億則推文、250 萬個 subreddit 資料和 70 萬則新聞文章)和多個知識圖譜進行評估。這種方法動態適應演化的語言,優於純資料驅動模型,F1 分數超過 92%。這種方法也顯示出比微調預訓練大型語言模型 (LLM) 更快適應新資料和更低的運算需求。本研究證明了神經符號方法在動態環境中詮釋文字的優點,適用於健康監控等任務。

A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs

2411.07098v1 by Myeongsoo Kim, Tyler Stennett, Saurabh Sinha, Alessandro Orso

As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API specifications such as the OpenAPI Specification has led to the emergence of many black-box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in detecting faults (i.e., 500 response codes). To address these limitations, we present AutoRestTest, the first black-box framework to adopt a dependency-embedded multi-agent approach for REST API testing, integrating Multi-Agent Reinforcement Learning (MARL) with a Semantic Property Dependency Graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents -- API, dependency, parameter, and value -- collaborate to optimize API exploration. LLMs handle domain-specific value restrictions, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents' behavior. Evaluated on 12 real-world REST services, AutoRestTest outperforms the four leading black-box REST API testing tools, including those assisted by RESTGPT (which augments realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to identify an internal server error in Spotify. Our ablation study underscores the significant contributions of the agent learning, SPDG, and LLM components.

摘要:隨著現代網路服務日益依賴 REST API,其徹底的測試變得至關重要。此外,REST API 規範(例如 OpenAPI 規範)的出現,導致許多黑盒 REST API 測試工具的出現。然而,這些工具通常專注於單獨的測試元素(例如 API、參數、值),導致覆蓋率較低,且在偵測錯誤(即 500 回應碼)方面效率較低。為了解決這些限制,我們提出 AutoRestTest,這是第一個採用依賴嵌入式多代理方法進行 REST API 測試的黑盒框架,將多代理強化學習 (MARL) 與語義屬性依賴圖 (SPDG) 和大型語言模型 (LLM) 整合在一起。我們的做法將 REST API 測試視為一個可分離的問題,其中四個代理(API、依賴關係、參數和值)協同合作以最佳化 API 探索。LLM 處理特定領域的值限制,SPDG 模型使用 API 操作之間的相似性分數簡化依賴關係的搜尋空間,而 MARL 則動態最佳化代理的行為。在 12 項真實世界的 REST 服務上進行評估,AutoRestTest 在程式碼覆蓋率、操作覆蓋率和錯誤偵測方面,優於四種領先的黑盒 REST API 測試工具,包括那些由 RESTGPT(使用 LLM 增加逼真的測試輸入)輔助的工具。值得注意的是,AutoRestTest 是唯一能夠識別 Spotify 中內部伺服器錯誤的工具。我們的消融研究強調了代理學習、SPDG 和 LLM 組件的重大貢獻。

Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation

2411.06660v1 by Qiao Qiao, Yuepei Li, Qing Wang, Kang Zhou, Qi Li

Knowledge graph completion (KGC) is a task of inferring missing triples based on existing Knowledge Graphs (KGs). Both structural and semantic information are vital for successful KGC. However, existing methods only use either the structural knowledge from the KG embeddings or the semantic information from pre-trained language models (PLMs), leading to suboptimal model performance. Moreover, since PLMs are not trained on KGs, directly using PLMs to encode triples may be inappropriate. To overcome these limitations, we propose a novel framework called Bridge, which jointly encodes structural and semantic information of KGs. Specifically, we strategically encode entities and relations separately by PLMs to better utilize the semantic knowledge of PLMs and enable structured representation learning via a structural learning principle. Furthermore, to bridge the gap between KGs and PLMs, we employ a self-supervised representation learning method called BYOL to fine-tune PLMs with two different views of a triple. Unlike BYOL, which uses augmentation methods to create two semantically similar views of the same image, potentially altering the semantic information. We strategically separate the triple into two parts to create different views, thus avoiding semantic alteration. Experiments demonstrate that Bridge outperforms the SOTA models on three benchmark datasets.

摘要:知識圖譜補全 (KGC) 是一項根據現有知識圖譜 (KG) 推論遺失三元組的任務。結構和語義資訊對於成功的 KGC 至關重要。然而,現有方法僅使用來自 KG 嵌入的結構知識或來自預訓練語言模型 (PLM) 的語義資訊,導致模型效能不佳。此外,由於 PLM 沒有在 KG 上訓練,因此直接使用 PLM 編碼三元組可能並不適當。為了克服這些限制,我們提出一個名為 Bridge 的新架構,該架構聯合編碼 KG 的結構和語義資訊。具體來說,我們透過 PLM 分別對實體和關係進行策略性編碼,以更好地利用 PLM 的語義知識,並透過結構學習原則啟用結構化表示學習。此外,為了彌合 KG 和 PLM 之間的差距,我們採用一種稱為 BYOL 的自監督表示學習方法,以三元組的兩個不同視圖微調 PLM。與 BYOL 不同,BYOL 使用擴充方法來建立兩個語義上相似的相同影像視圖,可能會改變語義資訊。我們策略性地將三元組分為兩部分以建立不同的視圖,從而避免語義改變。實驗證明 Bridge 在三個基準資料集上優於 SOTA 模型。

CausalStock: Deep End-to-end Causal Discovery for News-driven Stock Movement Prediction

2411.06391v1 by Shuqi Li, Yuebo Sun, Yuxin Lin, Xin Gao, Shuo Shang, Rui Yan

There are two issues in news-driven multi-stock movement prediction tasks that are not well solved in the existing works. On the one hand, "relation discovery" is a pivotal part when leveraging the price information of other stocks to achieve accurate stock movement prediction. Given that stock relations are often unidirectional, such as the "supplier-consumer" relationship, causal relations are more appropriate to capture the impact between stocks. On the other hand, there is substantial noise existing in the news data leading to extracting effective information with difficulty. With these two issues in mind, we propose a novel framework called CausalStock for news-driven multi-stock movement prediction, which discovers the temporal causal relations between stocks. We design a lag-dependent temporal causal discovery mechanism to model the temporal causal graph distribution. Then a Functional Causal Model is employed to encapsulate the discovered causal relations and predict the stock movements. Additionally, we propose a Denoised News Encoder by taking advantage of the excellent text evaluation ability of large language models (LLMs) to extract useful information from massive news data. The experiment results show that CausalStock outperforms the strong baselines for both news-driven multi-stock movement prediction and multi-stock movement prediction tasks on six real-world datasets collected from the US, China, Japan, and UK markets. Moreover, getting benefit from the causal relations, CausalStock could offer a clear prediction mechanism with good explainability.

摘要:在新聞驅動的多股票移動預測任務中,現有研究尚未妥善解決兩個問題。一方面,在利用其他股票的價格資訊來實現準確的股票移動預測時,「關係發現」是一個關鍵部分。由於股票關係通常是單向的,例如「供應商-消費者」關係,因此因果關係更適合捕捉股票之間的影響。另一方面,新聞資料中存在大量雜訊,導致難以提取有效資訊。考慮到這兩個問題,我們提出了一個名為 CausalStock 的新框架,用於新聞驅動的多股票移動預測,該框架發現了股票之間的時序因果關係。我們設計了一個延遲依賴的時序因果發現機制,以建模時序因果圖分布。然後採用功能因果模型來封裝發現的因果關係並預測股票走勢。此外,我們提出了一個去噪新聞編碼器,利用大型語言模型 (LLM) 出色的文本評估能力從大量新聞資料中提取有用資訊。實驗結果表明,CausalStock 在從美國、中國、日本和英國市場收集的六個真實世界資料集上,在新聞驅動的多股票移動預測和多股票移動預測任務中都優於強大的基線。此外,CausalStock 受益於因果關係,可以提供具有良好可解釋性的清晰預測機制。

Analyzing the Evolution of Graphs and Texts

2411.06295v1 by Xingzhi Guo

With the recent advance of representation learning algorithms on graphs (e.g., DeepWalk/GraphSage) and natural languages (e.g., Word2Vec/BERT) , the state-of-the art models can even achieve human-level performance over many downstream tasks, particularly for the task of node and sentence classification. However, most algorithms focus on large-scale models for static graphs and text corpus without considering the inherent dynamic characteristics or discovering the reasons behind the changes. This dissertation aims to efficiently model the dynamics in graphs (such as social networks and citation graphs) and understand the changes in texts (specifically news titles and personal biographies). To achieve this goal, we utilize the renowned Personalized PageRank algorithm to create effective dynamic network embeddings for evolving graphs. Our proposed approaches significantly improve the running time and accuracy for both detecting network abnormal intruders and discovering entity meaning shifts over large-scale dynamic graphs. For text changes, we analyze the post-publication changes in news titles to understand the intents behind the edits and discuss the potential impact of titles changes from information integrity perspective. Moreover, we investigate self-presented occupational identities in Twitter users' biographies over five years, investigating job prestige and demographics effects in how people disclose jobs, quantifying over-represented jobs and their transitions over time.

摘要:隨著圖形表示學習演算法的最新進展(例如 DeepWalk/GraphSage)和自然語言(例如 Word2Vec/BERT),最先進的模型甚至可以在許多下游任務中達到人類等級的效能,特別是對於節點和句子分類的任務。然而,大多數演算法都專注於靜態圖形和大規模文字語料庫的模型,而沒有考慮固有的動態特性或找出變化的原因。本論文旨在有效地為圖形(例如社群網路和引文圖形)建模動態,並了解文字的變化(特別是新聞標題和個人傳記)。為了達成這個目標,我們利用著名的 Personalized PageRank 演算法為不斷變化的圖形建立有效的動態網路嵌入。我們提出的方法顯著改善了偵測網路異常入侵者和找出大規模動態圖形中實體含義轉移的執行時間和準確度。對於文字變化的部分,我們分析了新聞標題在出版後的變化,以了解編輯背後的意圖,並討論標題變更對資訊完整性的潛在影響。此外,我們調查了 Twitter 使用者在傳記中呈現的職業身分長達五年,探討了工作聲望和人口統計資料對人們揭露工作的影響,並量化了過度代表的工作及其隨著時間推移的轉變。

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

2411.06048v1 by Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, Yuan-Fang Li

Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities. Our analyses on object-relationship and multi-hop reasoning reveal several important findings. Firstly, bounding boxes and scene graphs, even synthetic ones, can significantly enhance LMMs' spatial reasoning. Secondly, LMMs struggle more with questions posed from the human perspective than the camera perspective about the image. Thirdly, chain of thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relations. % Moreover, spatial reasoning steps are much less accurate than non-spatial ones across MLLMs. Lastly, our perturbation analysis on GQA-spatial reveals that LMMs are much stronger at basic object detection than complex spatial reasoning. We believe our benchmark dataset and in-depth analyses can spark further research on LMMs spatial reasoning. Spatial-MM benchmark is available at: https://github.com/FatemehShiri/Spatial-MM

摘要:大型多模態模型 (LMM) 已在各種視覺和語言任務中取得強勁的表現。然而,它們的空間推理能力尚未得到充分研究。在本文中,我們構建了一個新穎的 VQA 資料集 Spatial-MM,以全面研究 LMM 的空間理解和推理能力。我們對物件關係和多跳推理的分析揭示了幾個重要的發現。首先,邊界框和場景圖,即使是合成的,也可以顯著增強 LMM 的空間推理能力。其次,LMM 在回答從人類視角提出的問題時比從相機視角提出的問題時遇到更多困難。第三,思考鏈 (CoT) 提示並未改善模型在涉及空間關係的複雜多跳問題上的效能。% 此外,在 MLLM 中,空間推理步驟的準確度遠低於非空間步驟。最後,我們對 GQA-spatial 的擾動分析表明,LMM 在基本物件偵測方面的能力遠強於複雜的空間推理。我們相信我們的基準資料集和深入分析可以激發對 LMM 空間推理的進一步研究。Spatial-MM 基準可在以下網址取得:https://github.com/FatemehShiri/Spatial-MM

Mitigating Hallucination with ZeroG: An Advanced Knowledge Management Engine

2411.05936v1 by Anantha Sharma, Sheeba Elizabeth John, Fatemeh Rezapoor Nikroo, Krupali Bhatt, Mrunal Zambre, Aditi Wikhe

The growth of digital documents presents significant challenges in efficient management and knowledge extraction. Traditional methods often struggle with complex documents, leading to issues such as hallucinations and high latency in responses from Large Language Models (LLMs). ZeroG, an innovative approach, significantly mitigates these challenges by leveraging knowledge distillation and prompt tuning to enhance model performance. ZeroG utilizes a smaller model that replicates the behavior of a larger teacher model, ensuring contextually relevant and grounded responses, by employing a black-box distillation approach, it creates a distilled dataset without relying on intermediate features, optimizing computational efficiency. This method significantly enhances accuracy and reduces response times, providing a balanced solution for modern document management. Incorporating advanced techniques for document ingestion and metadata utilization, ZeroG improves the accuracy of question-and-answer systems. The integration of graph databases and robust metadata management further streamlines information retrieval, allowing for precise and context-aware responses. By transforming how organizations interact with complex data, ZeroG enhances productivity and user experience, offering a scalable solution for the growing demands of digital document management.

摘要:數位文件成長帶來顯著的挑戰,包括有效管理和知識萃取。傳統方法經常難以處理複雜文件,導致問題,例如產生幻覺和大型語言模型 (LLM) 回應的高延遲。ZeroG 是一種創新的方法,透過利用知識蒸餾和提示調整來增強模型效能,大幅減輕這些挑戰。 ZeroG 使用較小的模型複製較大的教師模型的行為,透過採用黑盒蒸餾方法,確保在脈絡上相關且有根據的回應,它建立一個蒸餾的資料集,而不需要依賴中間特徵,最佳化運算效率。這種方法大幅提升準確度並減少回應時間,提供現代文件管理的平衡解決方案。 透過整合進階技術來擷取文件和使用元資料,ZeroG 改善問答系統的準確度。圖形資料庫和強健的元資料管理的整合進一步簡化資訊擷取,允許精確且符合脈絡的回應。透過轉換組織與複雜資料互動的方式,ZeroG 提升生產力和使用者體驗,提供可擴充的解決方案,以滿足數位文件管理日益增長的需求。

SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark

2411.05521v1 by Sithursan Sivasubramaniam, Cedric Osei-Akoto, Yi Zhang, Kurt Stockinger, Jonathan Fuerst

Electronic health records (EHRs) are stored in various database systems with different database models on heterogeneous storage architectures, such as relational databases, document stores, or graph databases. These different database models have a big impact on query complexity and performance. While this has been a known fact in database research, its implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far. In this paper, we present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark based on synthetic patient data from Synthea, following the SNOMED-CT taxonomy -- a widely used knowledge graph ontology covering medical terminology. SM3-Text-to-Query provides data representations for relational databases (PostgreSQL), document stores (MongoDB), and graph databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically and manually develop 408 template questions, which we augment to construct a benchmark of 10K diverse natural language question/query pairs for these four query languages (40K pairs overall). On our dataset, we evaluate several common in-context-learning (ICL) approaches for a set of representative closed and open-source LLMs. Our evaluation sheds light on the trade-offs between database models and query languages for different ICL strategies and LLMs. Last, SM3-Text-to-Query is easily extendable to additional query languages or real, standard-based patient databases.

摘要:電子健康紀錄 (EHR) 儲存在各種資料庫系統中,這些系統在異質儲存架構上具有不同的資料庫模型,例如關聯式資料庫、文件儲存或圖形資料庫。這些不同的資料庫模型對查詢複雜度和效能有很大的影響。雖然這在資料庫研究中已經是眾所周知的事實,但令人驚訝的是,它對日益增加的文字轉查詢系統的影響迄今尚未得到調查。在本文中,我們提出 SM3-Text-to-Query,這是第一個基於來自 Synthea 的合成患者資料的多模型醫療文字轉查詢基準,遵循 SNOMED-CT 分類法——一種廣泛使用的涵蓋醫學術語的知識圖譜本體。SM3-Text-to-Query 提供了關聯式資料庫 (PostgreSQL)、文件儲存 (MongoDB) 和圖形資料庫 (Neo4j 和 GraphDB (RDF)) 的資料表示,允許跨四種流行查詢語言(即 SQL、MQL、Cypher 和 SPARQL)進行評估。我們系統且手動開發了 408 個範本問題,我們擴充這些問題以構建一個基準,其中包含 10K 個針對這四種查詢語言的多樣化自然語言問題/查詢對(總共 40K 對)。在我們的資料集上,我們評估了幾種常見的代表性閉源和開源 LLM 的情境學習 (ICL) 方法。我們的評估揭示了不同 ICL 策略和 LLM 的資料庫模型和查詢語言之間的取捨。最後,SM3-Text-to-Query 可以輕鬆擴展到其他查詢語言或真實的基於標準的患者資料庫。

EUREKHA: Enhancing User Representation for Key Hackers Identification in Underground Forums

2411.05479v1 by Abdoul Nasser Hassane Amadou, Anas Motii, Saida Elouardi, EL Houcine Bergou

Underground forums serve as hubs for cybercriminal activities, offering a space for anonymity and evasion of conventional online oversight. In these hidden communities, malicious actors collaborate to exchange illicit knowledge, tools, and tactics, driving a range of cyber threats from hacking techniques to the sale of stolen data, malware, and zero-day exploits. Identifying the key instigators (i.e., key hackers), behind these operations is essential but remains a complex challenge. This paper presents a novel method called EUREKHA (Enhancing User Representation for Key Hacker Identification in Underground Forums), designed to identify these key hackers by modeling each user as a textual sequence. This sequence is processed through a large language model (LLM) for domain-specific adaptation, with LLMs acting as feature extractors. These extracted features are then fed into a Graph Neural Network (GNN) to model user structural relationships, significantly improving identification accuracy. Furthermore, we employ BERTopic (Bidirectional Encoder Representations from Transformers Topic Modeling) to extract personalized topics from user-generated content, enabling multiple textual representations per user and optimizing the selection of the most representative sequence. Our study demonstrates that fine-tuned LLMs outperform state-of-the-art methods in identifying key hackers. Additionally, when combined with GNNs, our model achieves significant improvements, resulting in approximately 6% and 10% increases in accuracy and F1-score, respectively, over existing methods. EUREKHA was tested on the Hack-Forums dataset, and we provide open-source access to our code.

摘要:地下論壇是網路犯罪活動的樞紐,提供匿名和規避傳統網路監督的空間。在這些隱藏的社群中,惡意行為者合作交換非法知識、工具和策略,推動從駭客技術到銷售竊取資料、惡意軟體和零時差漏洞的各種網路威脅。找出這些行動背後的關鍵煽動者(即關鍵駭客)至關重要,但仍然是一個複雜的挑戰。本文提出了一種稱為 EUREKHA(增強使用者表徵以識別地下論壇中的關鍵駭客)的新方法,旨在透過將每個使用者建模為文字序列來識別這些關鍵駭客。此序列透過大型語言模型(LLM)處理以進行特定領域的適應,其中 LLM 作為特徵萃取器。然後將這些萃取的特徵輸入圖神經網路(GNN)以建模使用者結構關係,大幅提升識別準確度。此外,我們採用 BERTopic(來自 Transformer 主題建模的雙向編碼器表徵)從使用者產生的內容中萃取個人化主題,為每個使用者啟用多個文字表徵,並最佳化最具代表性序列的選擇。我們的研究表明,微調後的 LLM 在識別關鍵駭客方面優於最先進的方法。此外,當與 GNN 結合使用時,我們的模型獲得顯著的提升,與現有方法相比,準確度和 F1 分數分別提高了約 6% 和 10%。EUREKHA 已在 Hack-Forums 資料集上進行測試,我們提供開源方式存取我們的程式碼。

When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

2411.05882v1 by Jacob Nielsen, Lukas Galke, Peter Schneider-Kamp

Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. It has been shown that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight), facilitating efficient inference. Here, we start our exploration with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models, namely encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.

摘要:當代機器學習模型(例如語言模型)功能強大, 但在訓練和推論時間上都需要大量的資源。已經證明,僅解碼器語言模型可以用三元權重(每個權重 1.58 位元)訓練到競爭狀態,促進有效率的推論。在此,我們從非Transformer模型架構開始探討,研究多層感知器和圖神經網路的 1.58 位元訓練。接著,我們探討其他基於Transformer的語言模型(即僅編碼器和編碼器-解碼器模型)的 1.58 位元訓練。我們的結果顯示,在所有這些設定中,1.58 位元訓練與標準 32/16 位元模型相當,有時甚至更好。

Exploring the Alignment Landscape: LLMs and Geometric Deep Models in Protein Representation

2411.05316v1 by Dong Shu, Bingbing Duan, Kai Guo, Kaixiong Zhou, Jiliang Tang, Mengnan Du

Latent representation alignment has become a foundational technique for constructing multimodal large language models (MLLM) by mapping embeddings from different modalities into a shared space, often aligned with the embedding space of large language models (LLMs) to enable effective cross-modal understanding. While preliminary protein-focused MLLMs have emerged, they have predominantly relied on heuristic approaches, lacking a fundamental understanding of optimal alignment practices across representations. In this study, we explore the alignment of multimodal representations between LLMs and Geometric Deep Models (GDMs) in the protein domain. We comprehensively evaluate three state-of-the-art LLMs (Gemma2-2B, LLaMa3.1-8B, and LLaMa3.1-70B) with four protein-specialized GDMs (GearNet, GVP, ScanNet, GAT). Our work examines alignment factors from both model and protein perspectives, identifying challenges in current alignment methodologies and proposing strategies to improve the alignment process. Our key findings reveal that GDMs incorporating both graph and 3D structural information align better with LLMs, larger LLMs demonstrate improved alignment capabilities, and protein rarity significantly impacts alignment performance. We also find that increasing GDM embedding dimensions, using two-layer projection heads, and fine-tuning LLMs on protein-specific data substantially enhance alignment quality. These strategies offer potential enhancements to the performance of protein-related multimodal models. Our code and data are available at https://github.com/Tizzzzy/LLM-GDM-alignment.

摘要:潛在表徵對齊已成為建構多模態大型語言模型 (MLLM) 的基礎技術,方法是將不同模態的嵌入映射到共享空間中,通常與大型語言模型 (LLM) 的嵌入空間對齊,以實現有效的跨模態理解。雖然初步以蛋白質為重點的 MLLM 已出現,但它們主要依賴啟發式方法,缺乏對跨表徵最佳對齊實務的基本理解。在本研究中,我們探討了蛋白質領域中 LLM 與幾何深度模型 (GDM) 之間的多模態表徵對齊。我們全面評估了三個最先進的 LLM(Gemma2-2B、LLaMa3.1-8B 和 LLaMa3.1-70B)與四個蛋白質專用 GDM(GearNet、GVP、ScanNet、GAT)。我們的研究從模型和蛋白質角度檢視對齊因素,識別當前對齊方法的挑戰,並提出改善對齊程序的策略。我們的關鍵發現顯示,同時包含圖形和 3D 結構資訊的 GDM 與 LLM 的對齊效果較佳,較大的 LLM 展現出更佳的對齊能力,而蛋白質的稀有性顯著影響對齊效能。我們還發現,增加 GDM 嵌入維度、使用兩層投影頭,以及針對蛋白質特定資料微調 LLM,可以大幅提升對齊品質。這些策略為蛋白質相關多模態模型的效能提供潛在的強化。我們的程式碼和資料可在 https://github.com/Tizzzzy/LLM-GDM-alignment 取得。

LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration

2411.05844v1 by Yukun Cao, Zengyi Gao, Zhiyang Li, Xike Xie, S Kevin Zhou

GraphRAG addresses significant challenges in Retrieval-Augmented Generation (RAG) by leveraging graphs with embedded knowledge to enhance the reasoning capabilities of Large Language Models (LLMs). Despite its promising potential, the GraphRAG community currently lacks a unified framework for fine-grained decomposition of the graph-based knowledge retrieval process. Furthermore, there is no systematic categorization or evaluation of existing solutions within the retrieval process. In this paper, we present LEGO-GraphRAG, a modular framework that decomposes the retrieval process of GraphRAG into three interconnected modules: subgraph-extraction, path-filtering, and path-refinement. We systematically summarize and classify the algorithms and neural network (NN) models relevant to each module, providing a clearer understanding of the design space for GraphRAG instances. Additionally, we identify key design factors, such as Graph Coupling and Computational Cost, that influence the effectiveness of GraphRAG implementations. Through extensive empirical studies, we construct high-quality GraphRAG instances using a representative selection of solutions and analyze their impact on retrieval and reasoning performance. Our findings offer critical insights into optimizing GraphRAG instance design, ultimately contributing to the advancement of more accurate and contextually relevant LLM applications.

摘要:GraphRAG 透過利用具嵌入知識的圖表來增強大型語言模型 (LLM) 的推理能力,解決了檢索增強生成 (RAG) 中的重大挑戰。儘管具有令人期待的潛力,但 GraphRAG 社群目前缺乏一個統一的架構,用於對基於圖表的知識檢索過程進行細粒度的分解。此外,在檢索過程中,現有解決方案並未進行系統性的分類或評估。在本文中,我們提出了 LEGO-GraphRAG,這是一個模組化架構,將 GraphRAG 的檢索過程分解為三個相互連接的模組:子圖萃取、路徑過濾和路徑精煉。我們系統性地總結和分類與每個模組相關的演算法和神經網路 (NN) 模型,提供對 GraphRAG 實例設計空間的更清晰理解。此外,我們找出影響 GraphRAG 實作有效性的關鍵設計因素,例如圖表耦合和運算成本。透過廣泛的經驗研究,我們使用具代表性的解決方案選擇來建構高品質的 GraphRAG 實例,並分析它們對檢索和推理效能的影響。我們的研究結果提供了優化 GraphRAG 實例設計的重要見解,最終有助於推進更準確且與脈絡相關的 LLM 應用。

MEG: Medical Knowledge-Augmented Large Language Models for Question Answering

2411.03883v2 by Laura Cabello, Carmen Martin-Turrero, Uchenna Akujuobi, Anders Søgaard, Carlos Bobed

Question answering is a natural language understanding task that involves reasoning over both explicit context and unstated, relevant domain knowledge. Large language models (LLMs), which underpin most contemporary question answering systems, struggle to induce how concepts relate in specialized domains such as medicine. Existing medical LLMs are also costly to train. In this work, we present MEG, a parameter-efficient approach for medical knowledge-augmented LLMs. MEG uses a lightweight mapping network to integrate graph embeddings into the LLM, enabling it to leverage external knowledge in a cost-effective way. We evaluate our method on four popular medical multiple-choice datasets and show that LLMs greatly benefit from the factual grounding provided by knowledge graph embeddings. MEG attains an average of +10.2% accuracy over the Mistral-Instruct baseline, and +6.7% over specialized models like BioMistral. We also show results based on Llama-3. Finally, we show that MEG's performance remains robust to the choice of graph encoder.

摘要:問答是自然語言理解任務,涉及對明確的上下文和未說明的相關領域知識進行推理。支撐大多數當代問答系統的大型語言模型 (LLM) 難以推論概念如何在醫學等專業領域中關聯。現有的醫學 LLM 訓練成本也很高。在這項工作中,我們提出了 MEG,這是一種用於醫學知識增強 LLM 的參數有效方法。MEG 使用輕量級映射網路將圖表嵌入整合到 LLM 中,使其能夠以經濟有效的方式利用外部知識。我們在四個流行的醫學多選題資料集上評估了我們的方法,並表明 LLM 從知識圖表嵌入提供的實際依據中受益匪淺。MEG 在 Mistral-Instruct 基準上平均提高了 +10.2% 的準確度,在 BioMistral 等專門模型上提高了 +6.7%。我們還展示了基於 Llama-3 的結果。最後,我們表明 MEG 的性能對圖表編碼器的選擇保持穩健。

The American Sign Language Knowledge Graph: Infusing ASL Models with Linguistic Knowledge

2411.03568v1 by Lee Kezar, Nidhi Munikote, Zian Zeng, Zed Sehyr, Naomi Caselli, Jesse Thomason

Language models for American Sign Language (ASL) could make language technologies substantially more accessible to those who sign. To train models on tasks such as isolated sign recognition (ISR) and ASL-to-English translation, datasets provide annotated video examples of ASL signs. To facilitate the generalizability and explainability of these models, we introduce the American Sign Language Knowledge Graph (ASLKG), compiled from twelve sources of expert linguistic knowledge. We use the ASLKG to train neuro-symbolic models for 3 ASL understanding tasks, achieving accuracies of 91% on ISR, 14% for predicting the semantic features of unseen signs, and 36% for classifying the topic of Youtube-ASL videos.

摘要:美國手語 (ASL) 的語言模型可以讓語言技術對手語使用者更易於使用。為了訓練模型執行手語辨識 (ISR) 和 ASL 轉換成英文等任務,資料集提供 ASL 手勢的註解影片範例。為了促進這些模型的概括性和可解釋性,我們引入了美國手語知識圖譜 (ASLKG),它是由十二個專家語言知識來源編譯而成的。我們使用 ASLKG 訓練神經符號模型來執行 3 項 ASL 理解任務,在 ISR 上達到 91% 的準確度、在預測未見手勢的語義特徵上達到 14%,以及在分類 YouTube-ASL 影片主題上達到 36%。

Graph-DPEP: Decomposed Plug and Ensemble Play for Few-Shot Document Relation Extraction with Graph-of-Thoughts Reasoning

2411.02864v1 by Tao Zhang, Ning Yan, Masood Mortazavi, Hoang H. Nguyen, Zhongfen Deng, Philip S. Yu

Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning capability on many NLP tasks. Recasting an NLP task into a text-to-text generation task is a common practice so that generative LLMs can be prompted to resolve it. However, performing document-level relation extraction (DocRE) tasks with generative LLM models is still challenging due to the structured output format of DocRE, which complicates the conversion to plain text. Limited information available in few-shot samples and prompt instructions induce further difficulties and challenges in relation extraction for mentioned entities in a document. In this paper, we represent the structured output as a graph-style triplet rather than natural language expressions and leverage generative LLMs for the DocRE task. Our approach, the Graph-DPEP framework is grounded in the reasoning behind triplet explanation thoughts presented in natural language. In this framework, we first introduce a ``decomposed-plug" method for performing the generation from LLMs over prompts with type-space decomposition to alleviate the burden of distinguishing all relation types. Second, we employ a verifier for calibrating the generation and identifying overlooked query entity pairs. Third, we develop "ensemble-play", reapplying generation on the entire type list by leveraging the reasoning thoughts embedded in a sub-graph associated with the missing query pair to address the missingness issue. Through extensive comparisons with existing prompt techniques and alternative Language Models (LLMs), our framework demonstrates superior performance on publicly available benchmarks in experiments.

摘要:大型語言模型 (LLM) 在海量語料庫上預先訓練,已在許多自然語言處理任務上展現出令人印象深刻的少量樣本學習能力。將自然語言處理任務轉化為文字到文字的生成任務是一種常見做法,這樣生成式大型語言模型就可以提示解決它。然而,由於 DocRE 的結構化輸出格式,使用生成式大型語言模型來執行文件級別關係萃取 (DocRE) 任務仍然具有挑戰性,這使得轉換為純文字變得複雜。少量樣本和提示說明中可用的資訊有限,會導致在文件中提到實體的關係萃取中產生進一步的困難和挑戰。在本文中,我們將結構化輸出表示為圖形樣式的三元組,而不是自然語言表達,並利用生成式大型語言模型來執行 DocRE 任務。我們的做法,圖形 DPEP 框架,是基於自然語言中呈現的三元組解釋思想背後的推理。在這個框架中,我們首先介紹一種「分解插入」方法,用於對具有類型空間分解的提示進行大型語言模型生成,以減輕區分所有關係類型的負擔。其次,我們使用驗證器來校準生成並識別被忽略的查詢實體對。第三,我們開發「整體遊戲」,通過利用與遺失查詢對相關的子圖中嵌入的推理思想,在整個類型列表上重新應用生成,以解決遺失問題。通過與現有提示技術和替代語言模型 (LLM) 的廣泛比較,我們的框架在實驗中證明了在公開基準上的優異性能。

Multimodal Commonsense Knowledge Distillation for Visual Question Answering

2411.02722v1 by Shuo Yang, Siwen Luo, Soyeon Caren Han

Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.

摘要:現有的多模態大型語言模型 (MLLM) 和視覺語言預訓練模型 (VLPM) 在一般的視覺問答 (VQA) 中展現了卓越的表現。然而,這些模型在需要外部常識知識的 VQA 問題上會遇到困難,原因在於產生高品質提示的挑戰以及微調的高運算成本。在這項工作中,我們提出了一個新穎的基於圖形的模態常識知識萃取架構,透過圖形卷積網路 (GCN) 在常識知識、視覺物件和問題上建構一個統一的關聯圖形,遵循師生環境。這個提出的架構對於任何類型的教師和學生模型都具有彈性,無需進一步微調,並在 ScienceQA 資料集上取得了有競爭力的表現。

Geometry of orofacial neuromuscular signals: speech articulation decoding using surface electromyography

2411.02591v1 by Harshavardhana T. Gowda, Zachary D. McNaughton, Lee M. Miller

Each year, millions of individuals lose the ability to speak intelligibly due to causes such as neuromuscular disease, stroke, trauma, and head/neck cancer surgery (e.g. laryngectomy) or treatment (e.g. radiotherapy toxicity to the speech articulators). Effective communication is crucial for daily activities, and losing the ability to speak leads to isolation, depression, anxiety, and a host of detrimental sequelae. Noninvasive surface electromyography (sEMG) has shown promise to restore speech output in these individuals. The goal is to collect sEMG signals from multiple articulatory sites as people silently produce speech and then decode the signals to enable fluent and natural communication. Currently, many fundamental properties of orofacial neuromuscular signals relating to speech articulation remain unanswered. They include questions relating to 1) the data structure of the orofacial sEMG signals, 2)the signal distribution shift of sEMG across individuals, 3) ability of sEMG signals to span the entire English language phonetic space during silent speech articulations, and 4) the generalization capability of non-invasive sEMG based silent speech interfaces. We address these questions through a series of experiments involving healthy human subjects. We show that sEMG signals evince graph data structure and that the signal distribution shift is given by a change of basis. Furthermore, we show that silently voiced articulations spanning the entire English language phonetic space can be decoded using small neural networks which can be trained with little data and that such architectures work well across individuals. To ensure transparency and reproducibility, we open-source all the data and codes used in this study.

摘要:每年,數百萬人因為神經肌肉疾病、中風、創傷和頭頸癌手術(例如喉切除術)或治療(例如放射治療對言語發音器官的毒性)等原因而失去清晰說話的能力。有效的溝通對於日常生活至關重要,而失去說話能力會導致孤立、沮喪、焦慮和一系列有害的後遺症。非侵入性表面肌電圖 (sEMG) 已顯示出恢復這些人說話輸出的希望。目標是從多個發音部位收集 sEMG 信號,因為人們在無聲地發出言語,然後解碼信號以實現流利和自然的溝通。目前,許多與言語發音有關的顏面神經肌肉信號的基本特性仍然沒有得到解答。它們包括與 1) 顏面 sEMG 信號的數據結構、2) sEMG 在個體間的信號分佈轉移、3) sEMG 信號在無聲言語發音過程中跨越整個英語語言音標空間的能力以及 4) 基於非侵入性 sEMG 的無聲言語介面的概括能力相關的問題。我們通過一系列涉及健康人類受試者的實驗來解決這些問題。我們表明 sEMG 信號證明圖數據結構,並且信號分佈轉移是由基變化的給出。此外,我們表明使用可以通過少量數據訓練的小神經網路可以解碼跨越整個英語語言音標空間的無聲發音,並且此類架構在不同個體之間運行良好。為了確保透明度和可重現性,我們公開了本研究中使用的所有數據和代碼。

GraphXAIN: Narratives to Explain Graph Neural Networks

2411.02540v2 by Mateusz Cedro, David Martens

Graph Neural Networks (GNNs) are a powerful technique for machine learning on graph-structured data, yet they pose interpretability challenges, especially for non-expert users. Existing GNN explanation methods often yield technical outputs such as subgraphs and feature importance scores, which are not easily understood. Building on recent insights from social science and other Explainable AI (XAI) methods, we propose GraphXAIN, a natural language narrative that explains individual predictions made by GNNs. We present a model-agnostic and explainer-agnostic XAI approach that complements graph explainers by generating GraphXAINs, using Large Language Models (LLMs) and integrating graph data, individual predictions from GNNs, explanatory subgraphs, and feature importances. We define XAI Narratives and XAI Descriptions, highlighting their distinctions and emphasizing the importance of narrative principles in effective explanations. By incorporating natural language narratives, our approach supports graph practitioners and non-expert users, aligning with social science research on explainability and enhancing user understanding and trust in complex GNN models. We demonstrate GraphXAIN's capabilities on a real-world graph dataset, illustrating how its generated narratives can aid understanding compared to traditional graph explainer outputs or other descriptive explanation methods.

摘要:圖形神經網路 (GNN) 是用於圖形結構資料的機器學習強大技術,但它們會造成可解釋性挑戰,特別是對於非專家使用者。現有的 GNN 解釋方法通常會產生技術輸出,例如子圖和特徵重要性分數,這些輸出不容易理解。建構於社會科學和其他可解釋 AI (XAI) 方法的最新見解,我們提出 GraphXAIN,這是一種自然語言敘述,可以解釋 GNN 做出的個別預測。我們提出一個與模型無關且與解釋器無關的 XAI 方法,它透過使用大型語言模型 (LLM) 和整合圖形資料、GNN 的個別預測、說明性子圖和特徵重要性來補充圖形解釋器,進而產生 GraphXAIN。我們定義 XAI 敘述和 XAI 描述,強調它們的區別,並強調敘述原則在有效解釋中的重要性。透過結合自然語言敘述,我們的做法支援圖形從業者和非專家使用者,與可解釋性的社會科學研究保持一致,並增強使用者對複雜 GNN 模型的理解和信任。我們在真實世界圖形資料集上展示 GraphXAIN 的功能,說明與傳統圖形解釋器輸出或其他描述性解釋方法相比,其產生的敘述如何有助於理解。

Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models

2411.02382v1 by Guangzhi Xiong, Eric Xie, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang

Large language models (LLMs) have demonstrated remarkable capabilities in various scientific domains, from natural language processing to complex problem-solving tasks. Their ability to understand and generate human-like text has opened up new possibilities for advancing scientific research, enabling tasks such as data analysis, literature review, and even experimental design. One of the most promising applications of LLMs in this context is hypothesis generation, where they can identify novel research directions by analyzing existing knowledge. However, despite their potential, LLMs are prone to generating ``hallucinations'', outputs that are plausible-sounding but factually incorrect. Such a problem presents significant challenges in scientific fields that demand rigorous accuracy and verifiability, potentially leading to erroneous or misleading conclusions. To overcome these challenges, we propose KG-CoI (Knowledge Grounded Chain of Ideas), a novel system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs (KGs). KG-CoI guides LLMs through a structured reasoning process, organizing their output as a chain of ideas (CoI), and includes a KG-supported module for the detection of hallucinations. With experiments on our newly constructed hypothesis generation dataset, we demonstrate that KG-CoI not only improves the accuracy of LLM-generated hypotheses but also reduces the hallucination in their reasoning chains, highlighting its effectiveness in advancing real-world scientific research.

摘要:大型語言模型 (LLM) 已在各種科學領域展現卓越的能力,從自然語言處理到複雜的解決問題任務。它們理解和產生類似人類文字的能力為推進科學研究開啟了新的可能性,讓資料分析、文獻回顧,甚至實驗設計等任務成為可能。LLM 在此脈絡中最有希望的應用之一是假設產生,它們能透過分析現有知識來找出新的研究方向。然而,儘管 LLM 具有潛力,它們卻容易產生「幻覺」,也就是聽起來合理但事實上不正確的輸出。此類問題在需要嚴謹準確性和可驗證性的科學領域中會造成重大挑戰,有可能導致錯誤或誤導性的結論。為了克服這些挑戰,我們提出 KG-CoI(知識基礎觀念鏈),這是一個創新的系統,它透過整合知識圖譜 (KG) 中的外部結構化知識來增強 LLM 假設產生。KG-CoI 引導 LLM 進行結構化推理程序,將其輸出整理成觀念鏈 (CoI),並包含一個由 KG 支援的模組來偵測幻覺。透過我們新建立的假設產生資料集進行的實驗,我們證明 KG-CoI 不僅改善了 LLM 產生的假設的準確性,也減少了其推理鏈中的幻覺,突顯了其在推進現實世界科學研究中的效能。

Can Language Models Enable In-Context Database?

2411.01807v1 by Yu Pan, Hongfeng Yu, Tianjiao Zhao, Jianxin Sun

Large language models (LLMs) are emerging as few-shot learners capable of handling a variety of tasks, including comprehension, planning, reasoning, question answering, arithmetic calculations, and more. At the core of these capabilities is LLMs' proficiency in representing and understanding structural or semi-structural data, such as tables and graphs. Numerous studies have demonstrated that reasoning on tabular data or graphs is not only feasible for LLMs but also gives a promising research direction which treats these data as in-context data. The lightweight and human readable characteristics of in-context database can potentially make it an alternative for the traditional database in typical RAG (Retrieval Augmented Generation) settings. However, almost all current work focuses on static in-context data, which does not allow dynamic update. In this paper, to enable dynamic database update, delta encoding of database is proposed. We explore how data stored in traditional RDBMS can be encoded as in-context text and evaluate LLMs' proficiency for CRUD (Create, Read, Update and Delete) operations on in-context databases. A benchmark named InConDB is presented and extensive experiments are conducted to show the performance of different language models in enabling in-context database by varying the database encoding method, prompting method, operation type and input data distribution, revealing both the proficiency and limitations.

摘要:大型語言模型 (LLM) 逐漸成為僅需少量範例就能處理各種任務的學習者,包括理解、規劃、推理、問答、算術計算等。這些能力的核心是 LLM 在表示和理解結構化或半結構化資料(例如表格和圖形)方面的能力。許多研究已證明,LLM 不僅可以推論表格資料或圖形,還提供了一個有前景的研究方向,將這些資料視為語境資料。語境資料庫的輕量級和人類可讀取特性有可能使其成為典型 RAG(檢索擴充生成)設定中傳統資料庫的替代方案。然而,幾乎所有目前的工作都專注於靜態語境資料,這不允許動態更新。在本文中,為了實現動態資料庫更新,提出了資料庫的 delta 編碼。我們探討了如何將儲存在傳統 RDBMS 中的資料編碼為語境文字,並評估 LLM 在語境資料庫上進行 CRUD(建立、讀取、更新和刪除)操作的能力。提出了名為 InConDB 的基準,並進行了廣泛的實驗,以顯示不同語言模型在通過改變資料庫編碼方法、提示方法、操作類型和輸入資料分佈來啟用語境資料庫方面的效能,揭示了能力和限制。

Graph-based Confidence Calibration for Large Language Models

2411.02454v1 by Yukun Li, Sijia Wang, Lifu Huang, Li-Ping Liu

One important approach to improving the reliability of large language models (LLMs) is to provide accurate confidence estimations regarding the correctness of their answers. However, developing a well-calibrated confidence estimation model is challenging, as mistakes made by LLMs can be difficult to detect. We propose a novel method combining the LLM's self-consistency with labeled data and training an auxiliary model to estimate the correctness of its responses to questions. This auxiliary model predicts the correctness of responses based solely on their consistent information. To set up the learning problem, we use a weighted graph to represent the consistency among the LLM's multiple responses to a question. Correctness labels are assigned to these responses based on their similarity to the correct answer. We then train a graph neural network to estimate the probability of correct responses. Experiments demonstrate that the proposed approach substantially outperforms several of the most recent methods in confidence calibration across multiple widely adopted benchmark datasets. Furthermore, the proposed approach significantly improves the generalization capability of confidence calibration on out-of-domain (OOD) data.

摘要:一種改善大型語言模型 (LLM) 可靠性的重要方法是提供有關其答案正確性的準確信心估計。然而,開發一個校準良好的信心估計模型具有挑戰性,因為 LLM 所犯的錯誤可能難以偵測。我們提出一個新方法,結合 LLM 的自我一致性與標籤資料,並訓練一個輔助模型來估計其對問題的回應正確性。這個輔助模型僅根據其一致性資訊來預測回應的正確性。為了設定學習問題,我們使用一個加權圖形來表示 LLM 對一個問題的多次回應之間的一致性。正確性標籤會根據這些回應與正確答案的相似性分配給這些回應。然後,我們訓練一個圖形神經網路來估計正確回應的機率。實驗證明,所提出的方法在多個廣泛採用的基準資料集上,在信心校準方面明顯優於多種最新方法。此外,所提出的方法顯著改善了在領域外 (OOD) 資料上信心校準的泛化能力。

Ontology Population using LLMs

2411.01612v1 by Sanaz Saki Norouzi, Adrita Barua, Antrea Christou, Nikita Gautam, Andrew Eells, Pascal Hitzler, Cogan Shimizu

Knowledge graphs (KGs) are increasingly utilized for data integration, representation, and visualization. While KG population is critical, it is often costly, especially when data must be extracted from unstructured text in natural language, which presents challenges, such as ambiguity and complex interpretations. Large Language Models (LLMs) offer promising capabilities for such tasks, excelling in natural language understanding and content generation. However, their tendency to ``hallucinate'' can produce inaccurate outputs. Despite these limitations, LLMs offer rapid and scalable processing of natural language data, and with prompt engineering and fine-tuning, they can approximate human-level performance in extracting and structuring data for KGs. This study investigates LLM effectiveness for the KG population, focusing on the Enslaved.org Hub Ontology. In this paper, we report that compared to the ground truth, LLM's can extract ~90% of triples, when provided a modular ontology as guidance in the prompts.

摘要:知識圖譜 (KG) 愈來愈多用於資料整合、表示和視覺化。儘管 KG 填充至關重要,但它通常很昂貴,特別是在必須從自然語言中非結構化文字中提取資料時,這會帶來挑戰,例如歧義和複雜的詮釋。大型語言模型 (LLM) 為此類任務提供了有前景的能力,擅長自然語言理解和內容生成。然而,它們「產生幻覺」的傾向可能會產生不準確的輸出。儘管有這些限制,LLM 提供了自然語言資料的快速且可擴充處理,並且透過提示工程和微調,它們可以近似人類層級的效能,以提取和建構 KG 的資料。本研究調查 LLM 對 KG 填充的有效性,重點關注 Enslaved.org Hub Ontology。在本文中,我們報告與真實情況相比,當在提示中提供模組化本体作為指導時,LLM 可以提取約 90% 的三元組。

Pre-trained Molecular Language Models with Random Functional Group Masking

2411.01401v1 by Tianhao Peng, Yuchen Li, Xuhong Li, Jiang Bian, Zeke Xie, Ning Sui, Shahid Mumtaz, Yanwu Xu, Linghe Kong, Haoyi Xiong

Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, researchers have introduced graph neural networks with graph-based molecular representations, such as GEM, incorporating the topology, geometry, 2D or even 3D structures of molecules into pre-training. While most of molecular graphs in existing studies were automatically converted from SMILES sequences, it is to assume that transformer-based language models might be able to implicitly learn structure-aware representations from SMILES sequences. In this paper, we propose \ours{} -- a SMILES-based \underline{\em M}olecular \underline{\em L}anguage \underline{\em M}odel, which randomly masking SMILES subsequences corresponding to specific molecular \underline{\em F}unctional \underline{\em G}roups to incorporate structure information of atoms during the pre-training phase. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities. Extensive experimental evaluations across 11 benchmark classification and regression tasks in the chemical domain demonstrate the robustness and superiority of \ours{}. Our findings reveal that \ours{} outperforms existing pre-training models, either based on SMILES or graphs, in 9 out of the 11 downstream tasks, ranking as a close second in the remaining ones.

摘要:計算化學的近期進展已利用轉換器語言模型的力量,例如 MoLFormer,使用大量簡化分子輸入線條輸入系統 (SMILES) 序列進行預訓練,以了解和預測分子特性和活性,這是藥物發現和材料科學等領域的重要步驟。為了進一步提升效能,研究人員引入了具有圖形為基礎的分子表示的圖形神經網路,例如 GEM,將分子的拓樸、幾何、2D 甚至 3D 結構納入預訓練中。雖然現有研究中的大多數分子圖形都是從 SMILES 序列自動轉換而來的,但可以假設基於轉換器的語言模型可能能夠從 SMILES 序列中隱式學習結構感知表示。在本文中,我們提出 \ours{} -- 一個基於 SMILES 的\underline{\em M}olecular\underline{\em L}anguage \underline{\em M}odel,它隨機遮蔽對應於特定分子\underline{\em F}unctional\underline{\em G}roups 的 SMILES 子序列,以在預訓練階段納入原子的結構資訊。此技術旨在強制模型更好地推斷分子結構和特性,從而增強其預測能力。在化學領域的 11 個基準分類和回歸任務中進行的廣泛實驗評估證明了 \ours{} 的穩健性和優越性。我們的研究結果顯示,\ours{} 在 11 個下游任務中的 9 個任務中優於現有的預訓練模型(基於 SMILES 或圖形),在剩下的任務中排名第二。

Narrative Analysis of True Crime Podcasts With Knowledge Graph-Augmented Large Language Models

2411.02435v1 by Xinyi Leng, Jason Liang, Jack Mauro, Xu Wang, Andrea L. Bertozzi, James Chapman, Junyuan Lin, Bohan Chen, Chenchen Ye, Temple Daniel, P. Jeffrey Brantingham

Narrative data spans all disciplines and provides a coherent model of the world to the reader or viewer. Recent advancement in machine learning and Large Language Models (LLMs) have enable great strides in analyzing natural language. However, Large language models (LLMs) still struggle with complex narrative arcs as well as narratives containing conflicting information. Recent work indicates LLMs augmented with external knowledge bases can improve the accuracy and interpretability of the resulting models. In this work, we analyze the effectiveness of applying knowledge graphs (KGs) in understanding true-crime podcast data from both classical Natural Language Processing (NLP) and LLM approaches. We directly compare KG-augmented LLMs (KGLLMs) with classical methods for KG construction, topic modeling, and sentiment analysis. Additionally, the KGLLM allows us to query the knowledge base in natural language and test its ability to factually answer questions. We examine the robustness of the model to adversarial prompting in order to test the model's ability to deal with conflicting information. Finally, we apply classical methods to understand more subtle aspects of the text such as the use of hearsay and sentiment in narrative construction and propose future directions. Our results indicate that KGLLMs outperform LLMs on a variety of metrics, are more robust to adversarial prompts, and are more capable of summarizing the text into topics.

摘要:敘事資料涵蓋所有學科,並為讀者或觀眾提供一個連貫的世界模型。機器學習和大型語言模型 (LLM) 的最新進展在分析自然語言方面取得了長足的進步。然而,大型語言模型 (LLM) 仍然難以應付複雜的敘事弧線以及包含相互矛盾資訊的敘事。最近的研究表明,使用外部知識庫增強的 LLM 可以提高所產生模型的準確性和可解釋性。在這項工作中,我們分析了在從傳統自然語言處理 (NLP) 和 LLM 方法中理解真實犯罪播客資料時,應用知識圖譜 (KG) 的有效性。我們直接比較了 KG 增強的 LLM (KGLLM) 與用於 KG 建構、主題建模和情緒分析的傳統方法。此外,KGLLM 允許我們以自然語言查詢知識庫,並測試其事實回答問題的能力。我們檢查了模型對對抗性提示的穩健性,以測試模型處理相互矛盾資訊的能力。最後,我們應用傳統方法來理解文本的更細微方面,例如在敘事建構中使用道聽途說和情緒,並提出未來的方向。我們的結果表明,KGLLM 在各種指標上優於 LLM,對對抗提示更穩健,並且更能夠將文本總結為主題。

WLPlan: Relational Features for Symbolic Planning

2411.00577v1 by Dillon Z. Chen

Scalable learning for planning research generally involves juggling between different programming languages for handling learning and planning modules effectively. Interpreted languages such as Python are commonly used for learning routines due to their ease of use and the abundance of highly maintained learning libraries they exhibit, while compiled languages such as C++ are used for planning routines due to their optimised resource usage. Motivated by the need for tools for developing scalable learning planners, we introduce WLPlan, a C++ package with Python bindings which implements recent promising work for automatically generating relational features of planning tasks. Such features can be used for any downstream routine, such as learning domain control knowledge or probing and understanding planning tasks. More specifically, WLPlan provides functionality for (1) transforming planning tasks into graphs, and (2) embedding planning graphs into feature vectors via graph kernels. The source code and instructions for the installation and usage of WLPlan are available at tinyurl.com/42kymswc

摘要:可擴充的學習規劃研究通常需要在不同的程式語言之間切換,才能有效地處理學習和規劃模組。例如 Python 等直譯語言通常用於學習常式,因為它們易於使用,且有許多維護完善的學習函式庫;而例如 C++ 等編譯語言則用於規劃常式,因為它們能最佳化資源使用。由於需要開發可擴充學習規劃器的工具,我們引進了 WLPlan,這是一個具有 Python 繫結的 C++ 套件,實作了近期有前途的自動產生規劃任務關係特徵的工作。此類特徵可用於任何下游常式,例如學習領域控制知識或探測和理解規劃任務。更具體地說,WLPlan 提供了以下功能:(1) 將規劃任務轉換為圖形,以及 (2) 透過圖形核將規劃圖形嵌入特徵向量。WLPlan 的原始碼和安裝及使用說明可在 tinyurl.com/42kymswc 取得

GRS-QA -- Graph Reasoning-Structured Question Answering Dataset

2411.00369v3 by Anish Pahilajani, Devasha Trivedi, Jincen Shuai, Khin S. Yone, Samyak Rajesh Jain, Namyong Park, Ryan A. Rossi, Nesreen K. Ahmed, Franck Dernoncourt, Yu Wang

Large Language Models (LLMs) have excelled in multi-hop question-answering (M-QA) due to their advanced reasoning abilities. However, the impact of the inherent reasoning structures on LLM M-QA performance remains unclear, largely due to the absence of QA datasets that provide fine-grained reasoning structures. To address this gap, we introduce the Graph Reasoning-Structured Question Answering Dataset (GRS-QA), which includes both semantic contexts and reasoning structures for QA pairs. Unlike existing M-QA datasets, where different reasoning structures are entangled together, GRS-QA explicitly captures intricate reasoning pathways by constructing reasoning graphs, where nodes represent textual contexts and edges denote logical flows. These reasoning graphs of different structures enable a fine-grained evaluation of LLM reasoning capabilities across various reasoning structures. Our empirical analysis reveals that LLMs perform differently when handling questions with varying reasoning structures. This finding facilitates the exploration of textual structures as compared with semantics.

摘要:大型語言模型 (LLM) 由於其先進的推理能力,在多跳問答 (M-QA) 中表現出色。然而,固有推理結構對 LLM M-QA 效能的影響仍不清楚,這主要是由於缺乏提供細粒度推理結構的 QA 資料集。為了解決這個差距,我們引入了圖形推理結構化問答資料集 (GRS-QA),其中包含語義脈絡和 QA 對應的推理結構。與現有的 M-QA 資料集不同,其中不同的推理結構糾纏在一起,GRS-QA 透過建構推理圖形明確捕捉複雜的推理路徑,其中節點表示文字脈絡,邊緣表示邏輯流程。這些不同結構的推理圖形能夠細緻地評估 LLM 在各種推理結構中的推理能力。我們的實證分析顯示,LLM 在處理具有不同推理結構的問題時表現不同。這個發現促進了對文字結構與語義的比較探索。

Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes

2411.02523v1 by Balu Bhasuran, Qiao Jin, Yuzhang Xie, Carl Yang, Karim Hanna, Jennifer Costa, Cindy Shavor, Zhiyong Lu, Zhe He

Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and lab results. Five LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. A comprehensive evaluation involving GPT-4, a knowledge graph, and clinicians was conducted. GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%. Lab results significantly improved accuracy, with GPT-4 and Mixtral excelling, though exact match rates were low. Lab tests, including liver function, metabolic/toxicology panels, and serology/immune tests, were generally interpreted correctly by LLMs for differential diagnosis.

摘要:鑑別診斷對於醫學至關重要,因為它有助於醫療保健提供者系統區分具有相似症狀的疾病。這項研究評估了實驗室檢驗結果對大型語言模型 (LLM) 做出的鑑別診斷 (DDx) 的影響。從 PubMed Central 的 50 份病例報告中建立了臨床簡報,其中包含患者人口統計、症狀和實驗室結果。測試了五個 LLM GPT-4、GPT-3.5、Llama-2-70b、Claude-2 和 Mixtral-8x7B,以生成帶和不帶實驗室數據的前 10、前 5 和前 1 DDx。進行了一項涉及 GPT-4、知識圖譜和臨床醫生的綜合評估。GPT-4 表現最佳,在有實驗室數據的情況下,前 1 名診斷的準確率達到 55%,前 10 名的準確率達到 60%,寬鬆準確率高達 80%。實驗室結果顯著提高了準確率,GPT-4 和 Mixtral 表現出色,儘管完全匹配率較低。LLM 通常可以正確解釋包括肝功能、代謝/毒理學檢查和血清學/免疫測試在內的實驗室檢驗,以進行鑑別診斷。

Compositional Automata Embeddings for Goal-Conditioned Reinforcement Learning

2411.00205v1 by Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, Sanjit A. Seshia

Goal-conditioned reinforcement learning is a powerful way to control an AI agent's behavior at runtime. That said, popular goal representations, e.g., target states or natural language, are either limited to Markovian tasks or rely on ambiguous task semantics. We propose representing temporal goals using compositions of deterministic finite automata (cDFAs) and use cDFAs to guide RL agents. cDFAs balance the need for formal temporal semantics with ease of interpretation: if one can understand a flow chart, one can understand a cDFA. On the other hand, cDFAs form a countably infinite concept class with Boolean semantics, and subtle changes to the automaton can result in very different tasks, making them difficult to condition agent behavior on. To address this, we observe that all paths through a DFA correspond to a series of reach-avoid tasks and propose pre-training graph neural network embeddings on "reach-avoid derived" DFAs. Through empirical evaluation, we demonstrate that the proposed pre-training method enables zero-shot generalization to various cDFA task classes and accelerated policy specialization without the myopic suboptimality of hierarchical methods.

摘要:目標條件強化學習是一種在執行階段控制 AI 代理行為的強大方法。話雖如此,熱門的目標表示,例如目標狀態或自然語言,僅限於馬可夫任務或依賴於含糊不清的任務語義。我們建議使用確定性有限狀態自動機 (cDFA) 的組合來表示時間目標,並使用 cDFA 來指導 RL 代理。cDFA 平衡了對形式時間語義的需求與易於解釋之間的關係:如果一個人能理解流程圖,那麼他就能理解 cDFA。另一方面,cDFA 形成了一個具有布林語義的可數無限概念類,而對自動機的細微更改可能會導致非常不同的任務,這使得它們難以對代理行為進行條件化。為了解決這個問題,我們觀察到通過 DFA 的所有路徑都對應於一系列到達避免任務,並提出對「到達避免衍生」DFA 進行預訓練圖神經網路嵌入。通過經驗評估,我們證明了所提出的預訓練方法能夠對各種 cDFA 任務類別進行零次學習泛化,並加速策略專業化,而沒有分層方法的近視次優性。

Building Multi-Agent Copilot towards Autonomous Agricultural Data Management and Analysis

2411.00188v1 by Yu Pan, Jianxin Sun, Hongfeng Yu, Joe Luck, Geng Bai, Nipuna Chamara, Yufeng Ge, Tala Awada

Current agricultural data management and analysis paradigms are to large extent traditional, in which data collecting, curating, integration, loading, storing, sharing and analyzing still involve too much human effort and know-how. The experts, researchers and the farm operators need to understand the data and the whole process of data management pipeline to make fully use of the data. The essential problem of the traditional paradigm is the lack of a layer of orchestrational intelligence which can understand, organize and coordinate the data processing utilities to maximize data management and analysis outcome. The emerging reasoning and tool mastering abilities of large language models (LLM) make it a potentially good fit to this position, which helps a shift from the traditional user-driven paradigm to AI-driven paradigm. In this paper, we propose and explore the idea of a LLM based copilot for autonomous agricultural data management and analysis. Based on our previously developed platform of Agricultural Data Management and Analytics (ADMA), we build a proof-of-concept multi-agent system called ADMA Copilot, which can understand user's intent, makes plans for data processing pipeline and accomplishes tasks automatically, in which three agents: a LLM based controller, an input formatter and an output formatter collaborate together. Different from existing LLM based solutions, by defining a meta-program graph, our work decouples control flow and data flow to enhance the predictability of the behaviour of the agents. Experiments demonstrates the intelligence, autonomy, efficacy, efficiency, extensibility, flexibility and privacy of our system. Comparison is also made between ours and existing systems to show the superiority and potential of our system.

摘要:目前的農業資料管理與分析模式在很大程度上仍是傳統的,其中資料收集、整理、整合、載入、儲存、分享和分析仍然需要太多的人力與專業知識。專家、研究人員和農場經營者需要了解資料和整個資料管理流程,才能充分利用資料。傳統模式的基本問題是缺乏一層編排智能,無法理解、組織和協調資料處理工具,以最大化資料管理和分析成果。大型語言模型 (LLM) 新興的推理和工具掌握能力使其潛在適合這個職位,這有助於從傳統的使用者驅動模式轉變為 AI 驅動模式。在本文中,我們提出並探討了基於 LLM 的副駕駛的想法,用於自動化農業資料管理和分析。基於我們先前開發的農業資料管理和分析 (ADMA) 平台,我們建立了一個名為 ADMA Copilot 的概念驗證多代理系統,它可以理解使用者的意圖、規劃資料處理流程並自動完成任務,其中三個代理:基於 LLM 的控制器、輸入格式化程式和輸出格式化程式共同合作。與現有的基於 LLM 的解決方案不同,透過定義元程式圖,我們的研究將控制流程和資料流程解耦,以增強代理行為的可預測性。實驗證明了我們系統的智慧、自主性、效能、效率、可擴充性、靈活性與隱私性。我們也與現有系統進行比較,以顯示我們系統的優越性和潛力。

Exploring the Knowledge Mismatch Hypothesis: Hallucination Propensity in Small Models Fine-tuned on Data from Larger Models

2411.00878v1 by Phil Wee, Riyadh Baghdadi

Recently, there has been an explosion of large language models created through fine-tuning with data from larger models. These small models able to produce outputs that appear qualitatively similar to significantly larger models. However, one of the key limitations that have been observed with these models is their propensity to hallucinate significantly more often than larger models. In particular, they have been observed to generate coherent outputs that involve factually incorrect information and spread misinformation, toxicity, and stereotypes. There are many potential causes of hallucination, of which, one hypothesis is that fine-tuning a model on data produced by a larger model leads to a knowledge mismatch which contributes to hallucination. In particular, it is hypothesized that there is a mismatch between the knowledge that is fed to the model to fine-tune it and the knowledge that is already present in the graph. Fine-tuning the model on data that has such mismatch could contribute to an increased propensity to hallucinate. We show that on an unseen test set, a smaller model fine-tuned on data generated from a larger model produced more wrong answers when compared to models fine-tuned on data created by the small model, which confirms the hypothesis.

摘要:最近,通过使用更大模型的数据进行微调,创建了大量语言模型爆炸。这些小模型能够产生与明显更大的模型在质量上类似的输出。然而,在这些模型中观察到的一个关键限制是,它们比更大的模型更容易出现幻觉。特别是,已经观察到它们会生成涉及事实不正确的信息并传播错误信息、毒性和刻板印象的连贯输出。幻觉有很多潜在原因,其中一个假设是,在更大模型生成的数据上微调模型会导致知识不匹配,从而导致幻觉。特别是,假设模型微调所馈送的知识与图中已有的知识之间存在不匹配。在具有这种不匹配的数据上微调模型可能会导致幻觉倾向增加。我们表明,在一个看不见的测试集中,一个在从一个更大的模型生成的数据上微调的小模型,与在小模型创建的数据上微调的模型相比,产生了更多错误的答案,这证实了这一假设。

Failure Modes of LLMs for Causal Reasoning on Narratives

2410.23884v1 by Khurram Yamin, Shantanu Gupta, Gaurav R. Ghosal, Zachary C. Lipton, Bryan Wilder

In this work, we investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge. For example, LLMs tend to determine causal relationships based on the topological ordering of events (i.e., earlier events cause later ones), resulting in lower performance whenever events are not narrated in their exact causal order. Similarly, we demonstrate that LLMs struggle with long-term causal reasoning and often fail when the narratives are long and contain many events. Additionally, we show LLMs appear to rely heavily on their parametric knowledge at the expense of reasoning over the provided narrative. This degrades their abilities whenever the narrative opposes parametric knowledge. We extensively validate these failure modes through carefully controlled synthetic experiments, as well as evaluations on real-world narratives. Finally, we observe that explicitly generating a causal graph generally improves performance while naive chain-of-thought is ineffective. Collectively, our results distill precise failure modes of current state-of-the-art models and can pave the way for future techniques to enhance causal reasoning in LLMs.

摘要:在這項工作中,我們透過推論敘述中的因果關係這個代表性問題,來探討大型語言模型 (LLM) 的因果推理能力。我們發現,即使是最先進的語言模型,也會依賴於不可靠的捷徑,無論是在敘述呈現或其參數知識方面。例如,LLM 傾向於根據事件的拓撲順序(即,較早的事件導致較晚的事件)來確定因果關係,當事件未按其確切的因果順序敘述時,就會導致較低的效能。同樣地,我們證明 LLM 難以進行長期因果推理,並且當敘述很長且包含許多事件時,它們通常會失敗。此外,我們表明 LLM 似乎過度依賴其參數知識,而犧牲了對所提供敘述的推理。每當敘述與參數知識相衝突時,這就會降低它們的能力。我們透過仔細控制的合成實驗以及對真實世界敘述的評估,廣泛驗證了這些失敗模式。最後,我們觀察到,明確產生因果圖通常會改善效能,而天真的思考鏈則無效。總的來說,我們的結果精確地提煉了當前最先進模型的失敗模式,並可以為未來增強 LLM 中因果推理的技術鋪路。

Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs

2410.23875v1 by Liyi Chen, Panrong Tong, Zhongming Jin, Ying Sun, Jieping Ye, Hui Xiong

Large Language Models (LLMs) have shown remarkable reasoning capabilities on complex tasks, but they still suffer from out-of-date knowledge, hallucinations, and opaque decision-making. In contrast, Knowledge Graphs (KGs) can provide explicit and editable knowledge for LLMs to alleviate these issues. Existing paradigm of KG-augmented LLM manually predefines the breadth of exploration space and requires flawless navigation in KGs. However, this paradigm cannot adaptively explore reasoning paths in KGs based on the question semantics and self-correct erroneous reasoning paths, resulting in a bottleneck in efficiency and effect. To address these limitations, we propose a novel self-correcting adaptive planning paradigm for KG-augmented LLM named Plan-on-Graph (PoG), which first decomposes the question into several sub-objectives and then repeats the process of adaptively exploring reasoning paths, updating memory, and reflecting on the need to self-correct erroneous reasoning paths until arriving at the answer. Specifically, three important mechanisms of Guidance, Memory, and Reflection are designed to work together, to guarantee the adaptive breadth of self-correcting planning for graph reasoning. Finally, extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of PoG.

摘要:大型語言模型 (LLM) 在複雜任務中展現出非凡的推理能力,但仍存在知識過時、幻覺和決策不透明的問題。相反地,知識圖譜 (KG) 可以提供明確且可編輯的知識,供 LLM 緩解這些問題。現有的 KG 增強 LLM 典範手動預先定義探索空間的廣度,並需要在 KG 中完美導航。然而,此典範無法根據問題語意自適應地探索 KG 中的推理路徑,並自行糾正錯誤的推理路徑,導致效率和效果的瓶頸。為了解決這些限制,我們提出了一個名為圖形計畫 (PoG) 的 KG 增強 LLM 的新穎自修正自適應規劃典範,它首先將問題分解成幾個子目標,然後重複自適應探索推理路徑、更新記憶體和反思需要自行糾正錯誤推理路徑的過程,直到得出答案。具體來說,指導、記憶和反思這三個重要機制被設計為協同運作,以保證自修正規劃在圖形推理中的自適應廣度。最後,在三個真實世界資料集上的廣泛實驗證明了 PoG 的有效性和效率。

LLaMo: Large Language Model-based Molecular Graph Assistant

2411.00871v1 by Jinyoung Park, Minseong Bae, Dohwan Ko, Hyunwoo J. Kim

Large Language Models (LLMs) have demonstrated remarkable generalization and instruction-following capabilities with instruction tuning. The advancements in LLMs and instruction tuning have led to the development of Large Vision-Language Models (LVLMs). However, the competency of the LLMs and instruction tuning have been less explored in the molecular domain. Thus, we propose LLaMo: Large Language Model-based Molecular graph assistant, which is an end-to-end trained large molecular graph-language model. To bridge the discrepancy between the language and graph modalities, we present the multi-level graph projector that transforms graph representations into graph tokens by abstracting the output representations of each GNN layer and motif representations with the cross-attention mechanism. We also introduce machine-generated molecular graph instruction data to instruction-tune the large molecular graph-language model for general-purpose molecule and language understanding. Our extensive experiments demonstrate that LLaMo shows the best performance on diverse tasks, such as molecular description generation, property prediction, and IUPAC name prediction. The code of LLaMo is available at https://github.com/mlvlab/LLaMo.

摘要:大型语言模型 (LLM) 已展示出卓越的概括和指令遵循能力,并进行指令调整。LLM 和指令调整的进步导致了大型视觉语言模型 (LVLMs) 的发展。然而,LLM 和指令调整的能力在分子领域的研究较少。因此,我们提出了 LLaMo:基于大语言模型的分子图助手,这是一个端到端训练的大分子图语言模型。为了弥合语言和图模式之间的差异,我们提出了多级图投影仪,它通过抽象每个 GNN 层的输出表示和基序表示(使用交叉注意力机制)将图表示转换为图标记。我们还引入了机器生成的分子图指令数据,以对大型分子图语言模型进行指令调整,以用于通用分子和语言理解。我们广泛的实验表明,LLaMo 在分子描述生成、属性预测和 IUPAC 名称预测等不同任务上表现出最佳性能。LLaMo 的代码可在 https://github.com/mlvlab/LLaMo 获得。

End-to-End Ontology Learning with Large Language Models

2410.23584v1 by Andy Lo, Albert Q. Jiang, Wenda Li, Mateja Jamnik

Ontologies are useful for automatic machine processing of domain knowledge as they represent it in a structured format. Yet, constructing ontologies requires substantial manual effort. To automate part of this process, large language models (LLMs) have been applied to solve various subtasks of ontology learning. However, this partial ontology learning does not capture the interactions between subtasks. We address this gap by introducing OLLM, a general and scalable method for building the taxonomic backbone of an ontology from scratch. Rather than focusing on subtasks, like individual relations between entities, we model entire subcomponents of the target ontology by finetuning an LLM with a custom regulariser that reduces overfitting on high-frequency concepts. We introduce a novel suite of metrics for evaluating the quality of the generated ontology by measuring its semantic and structural similarity to the ground truth. In contrast to standard metrics, our metrics use deep learning techniques to define more robust distance measures between graphs. Both our quantitative and qualitative results on Wikipedia show that OLLM outperforms subtask composition methods, producing more semantically accurate ontologies while maintaining structural integrity. We further demonstrate that our model can be effectively adapted to new domains, like arXiv, needing only a small number of training examples. Our source code and datasets are available at https://github.com/andylolu2/ollm.

摘要:本体对于领域知识的自动机器处理很有用,因为它们以结构化格式表示知识。然而,构建本体需要大量的手动工作。为了自动化这个过程的一部分,大型语言模型(LLM)已被应用于解决本体学习的各种子任务。然而,这种部分本体学习并没有捕捉到子任务之间的交互。我们通过引入 OLLM 来解决这一差距,这是一种从头开始构建本体分类骨架的通用且可扩展的方法。我们没有专注于子任务,例如实体之间的个别关系,而是通过使用自定义正则化器微调 LLM 来对目标本体的整个子组件进行建模,该正则化器减少了对高频概念的过度拟合。我们引入了一套新的指标来评估生成本体的质量,方法是测量它与地面真实值的语义和结构相似性。与标准指标相反,我们的指标使用深度学习技术来定义图之间的更稳健的距离度量。我们在维基百科上的定量和定性结果表明,OLLM 优于子任务组合方法,在保持结构完整性的同时生成语义上更准确的本体。我们进一步证明,我们的模型可以有效地适应新的领域,如 arXiv,只需要少量的训练样本。我们的源代码和数据集可在 https://github.com/andylolu2/ollm 获得。

Graph-Augmented Relation Extraction Model with LLMs-Generated Support Document

2410.23452v1 by Vicky Dong, Hao Yu, Yao Chen

This study introduces a novel approach to sentence-level relation extraction (RE) that integrates Graph Neural Networks (GNNs) with Large Language Models (LLMs) to generate contextually enriched support documents. By harnessing the power of LLMs to generate auxiliary information, our approach crafts an intricate graph representation of textual data. This graph is subsequently processed through a Graph Neural Network (GNN) to refine and enrich the embeddings associated with each entity ensuring a more nuanced and interconnected understanding of the data. This methodology addresses the limitations of traditional sentence-level RE models by incorporating broader contexts and leveraging inter-entity interactions, thereby improving the model's ability to capture complex relationships across sentences. Our experiments, conducted on the CrossRE dataset, demonstrate the effectiveness of our approach, with notable improvements in performance across various domains. The results underscore the potential of combining GNNs with LLM-generated context to advance the field of relation extraction.

摘要:本研究提出了一個句子層級關係萃取 (RE) 的新方法,該方法整合了圖形神經網路 (GNN) 和大型語言模型 (LLM),以產生脈絡豐富的支援文件。透過利用 LLM 的功能來產生輔助資訊,我們的做法建立了一個文本資料的複雜圖形表示。此圖形隨後透過圖形神經網路 (GNN) 進行處理,以改善和豐富與每個實體相關的嵌入,確保對資料有更細緻且相互連結的理解。此方法透過納入更廣泛的脈絡並利用實體間互動,來解決傳統句子層級 RE 模型的限制,進而提升模型捕捉跨句子的複雜關係的能力。我們在 CrossRE 資料集上執行的實驗證明了我們方法的有效性,在各種領域的效能都有顯著的提升。這些結果強調了將 GNN 與 LLM 產生的脈絡相結合,以推進關係萃取領域的潛力。

FlowLLM: Flow Matching for Material Generation with Large Language Models as Base Distributions

2410.23405v1 by Anuroop Sriram, Benjamin Kurt Miller, Ricky T. Q. Chen, Brandon M. Wood

Material discovery is a critical area of research with the potential to revolutionize various fields, including carbon capture, renewable energy, and electronics. However, the immense scale of the chemical space makes it challenging to explore all possible materials experimentally. In this paper, we introduce FlowLLM, a novel generative model that combines large language models (LLMs) and Riemannian flow matching (RFM) to design novel crystalline materials. FlowLLM first fine-tunes an LLM to learn an effective base distribution of meta-stable crystals in a text representation. After converting to a graph representation, the RFM model takes samples from the LLM and iteratively refines the coordinates and lattice parameters. Our approach significantly outperforms state-of-the-art methods, increasing the generation rate of stable materials by over three times and increasing the rate for stable, unique, and novel crystals by $\sim50\%$ - a huge improvement on a difficult problem. Additionally, the crystals generated by FlowLLM are much closer to their relaxed state when compared with another leading model, significantly reducing post-hoc computational cost.

摘要:材料發現是一個重要的研究領域,具有革新各種領域的潛力,包括碳捕集、可再生能源和電子產品。然而,化學空間的巨大規模使得實驗探索所有可能的材料具有挑戰性。在本文中,我們介紹了 FlowLLM,這是一種新穎的生成模型,結合了大型語言模型 (LLM) 和黎曼流匹配 (RFM) 來設計新型晶體材料。FlowLLM 首先微調 LLM,以學習文本表示中亞穩態晶體的有效基礎分佈。在轉換為圖形表示後,RFM 模型從 LLM 中獲取樣本,並反覆精煉坐標和晶格參數。我們的做法顯著優於最先進的方法,將穩定材料的生成率提高了三倍以上,並將穩定、獨特和新穎晶體的生成率提高了約 50%——這在一個困難的問題上是一個巨大的改進。此外,與另一種領先模型相比,FlowLLM 生成的晶體更接近其鬆弛狀態,顯著降低了事後計算成本。

EMMA: End-to-End Multimodal Model for Autonomous Driving

2410.23262v2 by Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, Mingxing Tan

We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures.

摘要:我們介紹 EMMA,一種用於自動駕駛的端到端多模態模型。 建立在多模態大型語言模型基礎上,EMMA 直接將原始 相機感測器資料對應到各種特定於駕駛的輸出,包括規劃器 軌跡、感知物件和道路圖形元素。EMMA 最大化利用預訓練大型語言模型中的世界知識,方法是 將所有非感測器輸入(例如導航指示和自我 車輛狀態)和輸出(例如軌跡和 3D 位置)表示為自然 語言文字。這種方法允許 EMMA 在統一的語言空間中共同處理各種駕駛 任務,並使用特定於任務的提示為每個任務產生輸出。 根據經驗,我們證明了 EMMA 的有效性,在 nuScenes 上的運動規劃中達到了最先進的性能,以及 在 Waymo 開放運動資料集 (WOMD) 上取得了有競爭力的結果。EMMA 也 在 Waymo 開放資料集 (WOD) 上對相機優先的 3D 物件偵測產生了有競爭力的結果。我們展示了使用規劃器軌跡、 物件偵測和道路圖形任務共同訓練 EMMA 會在所有三個 領域產生改進,突顯了 EMMA 作為自動駕駛應用程式通用模型的潛力。然而,EMMA 也表現出某些限制:它只能 處理少量的影像幀,不包含像 LiDAR 或雷達等準確的 3D 感測模式,並且計算成本昂貴。我們 希望我們的結果能激勵進一步的研究,以減輕這些問題並進一步發展自動駕駛模型 架構的最新技術。

ProTransformer: Robustify Transformers via Plug-and-Play Paradigm

2410.23182v1 by Zhichao Hou, Weizhi Gao, Yuchen Shen, Feiyi Wang, Xiaorui Liu

Transformer-based architectures have dominated various areas of machine learning in recent years. In this paper, we introduce a novel robust attention mechanism designed to enhance the resilience of transformer-based architectures. Crucially, this technique can be integrated into existing transformers as a plug-and-play layer, improving their robustness without the need for additional training or fine-tuning. Through comprehensive experiments and ablation studies, we demonstrate that our ProTransformer significantly enhances the robustness of transformer models across a variety of prediction tasks, attack mechanisms, backbone architectures, and data domains. Notably, without further fine-tuning, the ProTransformer consistently improves the performance of vanilla transformers by 19.5%, 28.3%, 16.1%, and 11.4% for BERT, ALBERT, DistilBERT, and RoBERTa, respectively, under the classical TextFooler attack. Furthermore, ProTransformer shows promising resilience in large language models (LLMs) against prompting-based attacks, improving the performance of T5 and LLaMA by 24.8% and 17.8%, respectively, and enhancing Vicuna by an average of 10.4% against the Jailbreaking attack. Beyond the language domain, ProTransformer also demonstrates outstanding robustness in both vision and graph domains.

摘要:近年來,基於 Transformer 的架構主導了機器學習的各個領域。在本文中,我們介紹了一種新穎且強大的注意力機制,旨在增強基於 Transformer 的架構的韌性。至關重要的是,此技術可以作為即插即用的層整合到現有的 Transformer 中,在無需額外訓練或微調的情況下提高其穩健性。通過全面的實驗和消融研究,我們證明了我們的 ProTransformer 在各種預測任務、攻擊機制、主幹架構和數據領域中顯著增強了 Transformer 模型的穩健性。值得注意的是,在不進一步微調的情況下,ProTransformer 在經典的 TextFooler 攻擊下,分別為 BERT、ALBERT、DistilBERT 和 RoBERTa 提升了 19.5%、28.3%、16.1% 和 11.4% 的性能。此外,ProTransformer 在基於提示的攻擊中對大型語言模型 (LLM) 顯示出有希望的韌性,分別將 T5 和 LLaMA 的性能提升了 24.8% 和 17.8%,並在越獄攻擊中將 Vicuna 的性能平均提升了 10.4%。除了語言領域之外,ProTransformer 在視覺和圖形領域也表現出出色的穩健性。

Semantic Enrichment of the Quantum Cascade Laser Properties in Text- A Knowledge Graph Generation Approach

2410.22996v1 by Deperias Kerre, Anne Laurent, Kenneth Maussang, Dickson Owuor

A well structured collection of the various Quantum Cascade Laser (QCL) design and working properties data provides a platform to analyze and understand the relationships between these properties. By analyzing these relationships, we can gain insights into how different design features impact laser performance properties such as the working temperature. Most of these QCL properties are captured in scientific text. There is therefore need for efficient methodologies that can be utilized to extract QCL properties from text and generate a semantically enriched and interlinked platform where the properties can be analyzed to uncover hidden relations. There is also the need to maintain provenance and reference information on which these properties are based. Semantic Web technologies such as Ontologies and Knowledge Graphs have proven capability in providing interlinked data platforms for knowledge representation in various domains. In this paper, we propose an approach for generating a QCL properties Knowledge Graph (KG) from text for semantic enrichment of the properties. The approach is based on the QCL ontology and a Retrieval Augmented Generation (RAG) enabled information extraction pipeline based on GPT 4-Turbo language model. The properties of interest include: working temperature, laser design type, lasing frequency, laser optical power and the heterostructure. The experimental results demonstrate the feasibility and effectiveness of this approach for efficiently extracting QCL properties from unstructured text and generating a QCL properties Knowledge Graph, which has potential applications in semantic enrichment and analysis of QCL data.

摘要:一個結構良好的各種量子層疊雷射 (QCL) 設計和工作特性數據集合,提供了一個平台來分析和理解這些特性之間的關係。透過分析這些關係,我們可以深入了解不同的設計特徵如何影響雷射效能特性,例如工作溫度。這些 QCL 特性大多數都捕捉在科學文字中。因此,需要有效的方法,可以用於從文字中萃取 QCL 特性,並產生一個語義豐富且相互連結的平台,可以在其中分析這些特性以發現隱藏的關係。還需要維護這些特性所依據的來源和參考資訊。語義網路技術,例如本体和知識圖譜,已證明它們在提供各種領域中知識表徵的相互連結資料平台方面具有能力。在本文中,我們提出一個從文字中產生 QCL 特性知識圖譜 (KG) 的方法,以進行特性的語義豐富化。此方法基於 QCL 本体和基於 GPT 4-Turbo 語言模型的檢索擴增生成 (RAG) 啟用資訊萃取管線。感興趣的特性包括:工作溫度、雷射設計類型、雷射頻率、雷射光功率和異質結構。實驗結果證明了此方法對於從非結構化文字中有效萃取 QCL 特性和產生 QCL 特性知識圖譜的可行性和有效性,這在 QCL 數據的語義豐富化和分析中具有潛在應用。

How Well Do Large Language Models Disambiguate Swedish Words?

2410.22827v1 by Richard Johansson

We evaluate a battery of recent large language models on two benchmarks for word sense disambiguation in Swedish. At present, all current models are less accurate than the best supervised disambiguators in cases where a training set is available, but most models outperform graph-based unsupervised systems. Different prompting approaches are compared, with a focus on how to express the set of possible senses in a given context. The best accuracies are achieved when human-written definitions of the senses are included in the prompts.

摘要:我們針對兩個瑞典語詞彙意義消歧基準,評估一系列近期的大型語言模型。目前,在有訓練集可用的情況下,所有現有模型的準確度都低於最佳監督式消歧器,但大多數模型的表現都優於基於圖形的非監督式系統。比較了不同的提示方法,重點在於如何在特定脈絡中表達可能的意義集合。當提示中包含人類撰寫的意義定義時,可達到最佳準確度。

Beyond Ontology in Dialogue State Tracking for Goal-Oriented Chatbot

2410.22767v1 by Sejin Lee, Dongha Kim, Min Song

Goal-oriented chatbots are essential for automating user tasks, such as booking flights or making restaurant reservations. A key component of these systems is Dialogue State Tracking (DST), which interprets user intent and maintains the dialogue state. However, existing DST methods often rely on fixed ontologies and manually compiled slot values, limiting their adaptability to open-domain dialogues. We propose a novel approach that leverages instruction tuning and advanced prompt strategies to enhance DST performance, without relying on any predefined ontologies. Our method enables Large Language Model (LLM) to infer dialogue states through carefully designed prompts and includes an anti-hallucination mechanism to ensure accurate tracking in diverse conversation contexts. Additionally, we employ a Variational Graph Auto-Encoder (VGAE) to model and predict subsequent user intent. Our approach achieved state-of-the-art with a JGA of 42.57% outperforming existing ontology-less DST models, and performed well in open-domain real-world conversations. This work presents a significant advancement in creating more adaptive and accurate goal-oriented chatbots.

摘要:以目標為導向的聊天機器人在自動化使用者任務中至關重要,例如預訂航班或進行餐廳訂位。這些系統的一個關鍵組成部分是對話狀態追蹤 (DST),它會解譯使用者的意圖並維護對話狀態。然而,現有的 DST 方法通常依賴於固定的本体和手動編譯的槽位值,這限制了它們對開放領域對話的適應性。我們提出了一種新穎的方法,它利用指令調整和先進的提示策略來增強 DST 效能,而無需依賴任何預定義的本体。我們的方法使大型語言模型 (LLM) 能夠透過精心設計的提示來推論對話狀態,並包含一個反幻覺機制,以確保在不同的對話情境中準確追蹤。此外,我們採用變分圖自編碼器 (VGAE) 來建模和預測後續使用者的意圖。我們的做法以 42.57% 的 JGA 達到了現有技術的頂峰,優於現有的無本体 DST 模型,並在開放領域的真實對話中表現良好。這項工作在建立更具適應性和準確性的以目標為導向的聊天機器人方面取得了重大進展。

The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation

2411.00843v1 by Reza Moravej, Saurabh Bodhe, Zhanguang Zhang, Didier Chetelat, Dimitrios Tsaras, Yingxue Zhang, Hui-Ling Zhen, Jianye Hao, Mingxuan Yuan

Logic synthesis is a crucial phase in the circuit design process, responsible for transforming hardware description language (HDL) designs into optimized netlists. However, traditional logic synthesis methods are computationally intensive, restricting their iterative use in refining chip designs. Recent advancements in large language models (LLMs), particularly those fine-tuned on programming languages, present a promising alternative. In this paper, we introduce VeriDistill, the first end-to-end machine learning model that directly processes raw Verilog code to predict circuit quality-of-result metrics. Our model employs a novel knowledge distillation method, transferring low-level circuit insights via graphs into the predictor based on LLM. Experiments show VeriDistill outperforms state-of-the-art baselines on large-scale Verilog datasets and demonstrates robust performance when evaluated on out-of-distribution datasets.

摘要:邏輯合成是電路設計過程中至關重要的一個階段,負責將硬體描述語言 (HDL) 設計轉換為最佳化的網路表。然而,傳統的邏輯合成方法在運算上很密集,限制了它們在精煉晶片設計中的反覆使用。最近大型語言模型 (LLM) 的進展,特別是那些經過程式語言微調的,提供了一個有希望的替代方案。在本文中,我們介紹了 VeriDistill,第一個端到端的機器學習模型,它直接處理原始 Verilog 程式碼以預測電路品質結果指標。我們的模型採用了一種新穎的知識提煉方法,通過圖表將低階電路見解傳輸到基於 LLM 的預測器中。實驗表明,VeriDistill 在大規模 Verilog 資料集上優於最先進的基準,並且在在分佈外資料集上進行評估時表現出穩健的效能。

Are Large-Language Models Graph Algorithmic Reasoners?

2410.22597v1 by Alexander K Taylor, Anthony Cuturrufo, Vishal Yathish, Mingyu Derek Ma, Wei Wang

We seek to address a core challenge facing current Large Language Models (LLMs). LLMs have demonstrated superior performance in many tasks, yet continue to struggle with reasoning problems on explicit graphs that require multiple steps. To address this gap, we introduce a novel benchmark designed to evaluate LLM performance on classical algorithmic reasoning tasks on explicit graphs. Our benchmark encompasses five fundamental algorithms: Breadth-First Search (BFS) and Depth-First Search (DFS) for connectivity, Dijkstra's algorithm and Floyd-Warshall algorithm for all nodes shortest path, and Prim's Minimum Spanning Tree (MST-Prim's) algorithm. Through extensive experimentation, we assess the capabilities of state-of-the-art LLMs in executing these algorithms step-by-step and systematically evaluate their performance at each stage. Our findings highlight the persistent challenges LLMs face in this domain and underscore the necessity for advanced prompting techniques and algorithmic instruction to enhance their graph reasoning abilities. This work presents MAGMA, the first comprehensive benchmark focused on LLMs completing classical graph algorithms, and provides a critical step toward understanding and improving their structured problem-solving skills.

摘要:我們試圖解決當前大型語言模型 (LLM) 面臨的核心挑戰。LLM 在許多任務中表現出優異的性能,但仍難以應對需要多個步驟的明確圖表中的推理問題。為了解決這個差距,我們引入了一個新的基準,用於評估 LLM 在明確圖表上的經典演算法推理任務上的性能。我們的基準包含五個基本演算法:廣度優先搜尋 (BFS) 和深度優先搜尋 (DFS) 以進行連通性、Dijkstra 演算法和 Floyd-Warshall 演算法以找出所有節點的最短路徑,以及 Prim 最小生成樹 (MST-Prim) 演算法。透過廣泛的實驗,我們評估了最先進的 LLM 在逐步執行這些演算法的能力,並系統性地評估它們在每個階段的性能。我們的研究結果突出了 LLM 在這個領域面臨的持續挑戰,並強調了使用進階提示技術和演算法指令來增強其圖形推理能力的必要性。這項工作提出了 MAGMA,這是第一個專注於 LLM 完成經典圖形演算法的綜合基準,並為了解和改進其結構化問題解決技能提供了關鍵的一步。

Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset

2410.22457v1 by Adrian Garret Gabriel, Alaa Alameer Ahmad, Shankar Kumar Jeyakumar

Advancements in Large Language Models (LLMs) are revolutionizing the development of autonomous agentic systems by enabling dynamic, context-aware task decomposition and automated tool selection. These sophisticated systems possess significant automation potential across various industries, managing complex tasks, interacting with external systems to enhance knowledge, and executing actions independently. This paper presents three primary contributions to advance this field: - Advanced Agentic Framework: A system that handles multi-hop queries, generates and executes task graphs, selects appropriate tools, and adapts to real-time changes. - Novel Evaluation Metrics: Introduction of Node F1 Score, Structural Similarity Index (SSI), and Tool F1 Score to comprehensively assess agentic systems. - Specialized Dataset: Development of an AsyncHow-based dataset for analyzing agent behavior across different task complexities. Our findings reveal that asynchronous and dynamic task graph decomposition significantly enhances system responsiveness and scalability, particularly for complex, multi-step tasks. Detailed analysis shows that structural and node-level metrics are crucial for sequential tasks, while tool-related metrics are more important for parallel tasks. Specifically, the Structural Similarity Index (SSI) is the most significant predictor of performance in sequential tasks, and the Tool F1 Score is essential for parallel tasks. These insights highlight the need for balanced evaluation methods that capture both structural and operational dimensions of agentic systems. Additionally, our evaluation framework, validated through empirical analysis and statistical testing, provides valuable insights for improving the adaptability and reliability of agentic systems in dynamic environments.

摘要:大型語言模型 (LLM) 的進展正透過啟用動態、具情境感知能力的任務分解和自動化工具選擇,革新自主代理系統的開發。這些精密的系統在各產業中擁有顯著的自動化潛力,管理複雜的任務、與外部系統互動以增強知識,並獨立執行動作。本文提出了三個主要貢獻以推動這個領域的進展: - 進階代理架構:一種處理多重跳躍查詢、產生並執行任務圖表、選擇適當的工具,並適應即時變化的系統。 - 新穎的評估指標:導入節點 F1 分數、結構相似性指標 (SSI) 和工具 F1 分數,以全面評估代理系統。 - 專業資料集:開發一個基於 AsyncHow 的資料集,用於分析代理行為在不同任務複雜度之間的差異。 我們的研究結果顯示,非同步和動態任務圖表分解能顯著增強系統的回應能力和可擴充性,特別是對於複雜的多步驟任務。詳細的分析顯示,結構和節點層級的指標對於順序任務至關重要,而與工具相關的指標對於並行任務更為重要。具體來說,結構相似性指標 (SSI) 是順序任務中效能最顯著的預測指標,而工具 F1 分數對於並行任務至關重要。這些見解突顯了平衡評估方法的需求,該方法能捕捉代理系統的結構和操作面向。此外,我們的評估架構透過實證分析和統計檢定驗證,為改善代理系統在動態環境中的適應性和可靠性提供了有價值的見解。

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

2411.00836v1 by Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang

The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.

摘要:視覺語言模型 (VLM) 的快速進步在解決涉及視覺背景的數學推理任務方面展現了巨大的潛力。與人類可以將解決步驟可靠地應用於類似問題(並進行微小的修改)不同,我們發現像 GPT-4o 等 SOTA VLM 在這些場景中可能會持續失敗,揭露了其數學推理能力的限制。在本文中,我們研究了 VLM 中的數學推理穩健性,並評估了這些模型在同一問題的不同變體(例如視覺數值或函數圖形的變化)下的表現。雖然已經開發了多個基於視覺的數學基準來評估 VLM 的問題解決能力,但這些基準只包含靜態問題集,無法輕鬆評估數學推理穩健性。為了填補這一空白,我們引入了 DynaMath,這是一個動態視覺數學基準,專門用於深入評估 VLM。DynaMath 包含 501 個高品質、多主題種子問題,每個問題都表示為一個 Python 程式。這些程式經過仔細設計和註解,以便自動產生一組更大的具體問題,包括許多不同類型的視覺和文字變體。DynaMath 允許我們評估 VLM 的泛化能力,方法是在種子問題的不同輸入條件下評估其表現。我們使用 5,010 個生成的具體問題評估了 14 個 SOTA VLM。我們的結果顯示,最差情況的模型準確度(定義為在所有 10 個變體中正確回答的種子問題的百分比)顯著低於平均情況準確度。我們的分析強調了研究 VLM 推理能力穩健性的必要性,而 DynaMath 提供了有價值的見解,以指導開發更可靠的數學推理模型。

ADAM: An Embodied Causal Agent in Open-World Environments

2410.22194v1 by Shu Yu, Chaochao Lu

In open-world environments like Minecraft, existing agents face challenges in continuously learning structured knowledge, particularly causality. These challenges stem from the opacity inherent in black-box models and an excessive reliance on prior knowledge during training, which impair their interpretability and generalization capability. To this end, we introduce ADAM, An emboDied causal Agent in Minecraft, that can autonomously navigate the open world, perceive multimodal contexts, learn causal world knowledge, and tackle complex tasks through lifelong learning. ADAM is empowered by four key components: 1) an interaction module, enabling the agent to execute actions while documenting the interaction processes; 2) a causal model module, tasked with constructing an ever-growing causal graph from scratch, which enhances interpretability and diminishes reliance on prior knowledge; 3) a controller module, comprising a planner, an actor, and a memory pool, which uses the learned causal graph to accomplish tasks; 4) a perception module, powered by multimodal large language models, which enables ADAM to perceive like a human player. Extensive experiments show that ADAM constructs an almost perfect causal graph from scratch, enabling efficient task decomposition and execution with strong interpretability. Notably, in our modified Minecraft games where no prior knowledge is available, ADAM maintains its performance and shows remarkable robustness and generalization capability. ADAM pioneers a novel paradigm that integrates causal methods and embodied agents in a synergistic manner. Our project page is at https://opencausalab.github.io/ADAM.

摘要:在像 Minecraft 這樣的開放世界環境中,現有的代理人面臨持續學習結構化知識的挑戰,尤其是因果關係。這些挑戰源於黑盒模型固有的不透明性,以及在訓練期間過度依賴先驗知識,這會損害它們的可解釋性和泛化能力。為此,我們引入了 ADAM,Minecraft 中的一個具身因果代理,它可以自主導航開放世界,感知多模式上下文,學習因果世界知識,並通過終身學習來應對複雜任務。ADAM 由四個關鍵組成部分賦能:1) 一個交互模組,使代理能夠執行動作,同時記錄交互過程;2) 一個因果模型模組,負責從頭開始構建一個不斷增長的因果圖,這增強了可解釋性並減少了對先驗知識的依賴;3) 一個控制器模組,包括一個規劃器、一個執行器和一個記憶池,它使用學習到的因果圖來完成任務;4) 一個感知模組,由多模式大型語言模型提供支援,使 ADAM 能夠像人類玩家一樣感知。大量的實驗表明,ADAM 從頭開始構建了一個幾乎完美的因果圖,實現了高效的任務分解和執行,並具有很強的可解釋性。值得注意的是,在我們修改過的 Minecraft 遊戲中,沒有可用的先驗知識,ADAM 保持了其性能,並表現出顯著的魯棒性和泛化能力。ADAM 開創了一種新穎的範例,以協同方式整合因果方法和具身代理。我們的專案頁面位於 https://opencausalab.github.io/ADAM。

Synergizing LLM Agents and Knowledge Graph for Socioeconomic Prediction in LBSN

2411.00028v1 by Zhilun Zhou, Jingyang Fan, Yu Liu, Fengli Xu, Depeng Jin, Yong Li

The fast development of location-based social networks (LBSNs) has led to significant changes in society, resulting in popular studies of using LBSN data for socioeconomic prediction, e.g., regional population and commercial activity estimation. Existing studies design various graphs to model heterogeneous LBSN data, and further apply graph representation learning methods for socioeconomic prediction. However, these approaches heavily rely on heuristic ideas and expertise to extract task-relevant knowledge from diverse data, which may not be optimal for specific tasks. Additionally, they tend to overlook the inherent relationships between different indicators, limiting the prediction accuracy. Motivated by the remarkable abilities of large language models (LLMs) in commonsense reasoning, embedding, and multi-agent collaboration, in this work, we synergize LLM agents and knowledge graph for socioeconomic prediction. We first construct a location-based knowledge graph (LBKG) to integrate multi-sourced LBSN data. Then we leverage the reasoning power of LLM agent to identify relevant meta-paths in the LBKG for each type of socioeconomic prediction task, and design a semantic-guided attention module for knowledge fusion with meta-paths. Moreover, we introduce a cross-task communication mechanism to further enhance performance by enabling knowledge sharing across tasks at both LLM agent and KG levels. On the one hand, the LLM agents for different tasks collaborate to generate more diverse and comprehensive meta-paths. On the other hand, the embeddings from different tasks are adaptively merged for better socioeconomic prediction. Experiments on two datasets demonstrate the effectiveness of the synergistic design between LLM and KG, providing insights for information sharing across socioeconomic prediction tasks.

摘要:基於位置的社交網路 (LBSN) 的快速發展已導致社會發生重大變革,進而促成使用 LBSN 資料進行社會經濟預測的熱門研究,例如區域人口和商業活動估計。現有研究設計各種圖形來建模異質的 LBSN 資料,並進一步應用圖形表示學習方法進行社會經濟預測。然而,這些方法極度依賴啟發式想法和專業知識從不同的資料中萃取與任務相關的知識,這對於特定任務而言可能不是最佳的。此外,它們傾向於忽略不同指標之間的固有關係,進而限制預測準確度。受惠於大型語言模型 (LLM) 在常識推理、嵌入和多重代理協作方面的卓越能力,在這項工作中,我們將 LLM 代理和知識圖形結合起來進行社會經濟預測。我們首先建構一個基於位置的知識圖形 (LBKG) 來整合多來源的 LBSN 資料。然後,我們利用 LLM 代理的推理能力,針對每種類型的社會經濟預測任務識別 LBKG 中相關的 meta 路徑,並設計一個語義導向的注意力模組,用於與 meta 路徑的知識融合。此外,我們引入一個跨任務溝通機制,以透過在 LLM 代理和 KG 層級上跨任務啟用知識共享進一步提升效能。一方面,不同任務的 LLM 代理協作產生更多樣化且全面的 meta 路徑。另一方面,來自不同任務的嵌入會自適應地合併,以進行更好的社會經濟預測。在兩個資料集上的實驗證明了 LLM 和 KG 之間協同設計的有效性,並提供跨社會經濟預測任務進行資訊共享的見解。

A Hierarchical Language Model For Interpretable Graph Reasoning

2410.22372v1 by Sambhav Khurana, Xiner Li, Shurui Gui, Shuiwang Ji

Large language models (LLMs) are being increasingly explored for graph tasks. Despite their remarkable success in text-based tasks, LLMs' capabilities in understanding explicit graph structures remain limited, particularly with large graphs. In this work, we introduce Hierarchical Language Model for Graph (HLM-G), which employs a two-block architecture to capture node-centric local information and interaction-centric global structure, effectively enhancing graph structure understanding abilities. The proposed scheme allows LLMs to address various graph queries with high efficacy, efficiency, and robustness, while reducing computational costs on large-scale graph tasks. Furthermore, we demonstrate the interpretability of our model using intrinsic attention weights and established explainers. Comprehensive evaluations across diverse graph reasoning and real-world tasks of node, link, and graph-levels highlight the superiority of our method, marking a significant advancement in the application of LLMs to graph understanding.

摘要:大型語言模型 (LLM) 愈來愈多用於圖形任務。 儘管 LLM 在基於文字的任務中取得顯著的成功,但其在理解明確圖形結構方面的能力仍然有限,特別是對於大型圖形。在這項工作中,我們引入了圖形階層語言模型 (HLM-G),它採用雙區塊架構來擷取以節點為中心的局部資訊和以互動為中心的整體結構,有效地增強了圖形結構理解能力。所提出的架構允許 LLM 以高效率、高效率和高穩健性來處理各種圖形查詢,同時降低大型圖形任務的運算成本。此外,我們使用內在注意力權重和已建立的解釋器來展示我們模型的可解釋性。在節點、連結和圖形層級的各種圖形推理和真實世界任務中進行的全面評估突顯了我們方法的優越性,標誌著 LLM 在圖形理解應用方面取得重大進展。

LLM-Forest for Health Tabular Data Imputation

2410.21520v1 by Xinrui He, Yikun Ban, Jiaru Zou, Tianxin Wei, Curtiss B. Cook, Jingrui He

Missing data imputation is a critical challenge in tabular datasets, especially in healthcare, where data completeness is vital for accurate analysis. Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation, making them a promising tool for tabular data imputation. However, challenges persist in designing effective prompts for a finetuning-free process and in mitigating the risk of LLM hallucinations. To address these issues, we propose a novel framework, LLM-Forest, which introduces a "forest" of few-shot learning LLM "trees" with confidence-based weighted voting. This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries with both feature and value granularity. Extensive experiments on four real-world healthcare datasets demonstrate the effectiveness and efficiency of LLM-Forest.

摘要:遺失資料推估是表格資料集中的重大挑戰, 特別是在醫療保健中,資料完整性對於準確分析至關重要。 大型語言模型 (LLM) 在龐大的語料庫上訓練,在資料產生方面展現出強大的潛力,使其成為表格資料推估的有前途工具。 然而,在設計有效提示以進行微調免費流程和減輕 LLM 幻覺風險方面仍存在挑戰。 為了解決這些問題,我們提出一個新的框架,LLM-Forest,它引入了一個「森林」的少量學習 LLM「樹」,並採用基於信心的加權投票。 這個框架建立在雙分資訊圖的新概念上,以識別具有特徵和值粒度的優質相關鄰近項目。 在四個真實世界的醫療保健資料集上進行的廣泛實驗證明了 LLM-Forest 的有效性和效率。

Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce

2410.21237v1 by Zhantao Yang, Han Zhang, Fangyi Chen, Anudeepsekhar Bolimera, Marios Savvides

Knowledge Graph (KG) is playing an increasingly important role in various AI systems. For e-commerce, an efficient and low-cost automated knowledge graph construction method is the foundation of enabling various successful downstream applications. In this paper, we propose a novel method for constructing structured product knowledge graphs from raw product images. The method cooperatively leverages recent advances in the vision-language model (VLM) and large language model (LLM), fully automating the process and allowing timely graph updates. We also present a human-annotated e-commerce product dataset for benchmarking product property extraction in knowledge graph construction. Our method outperforms our baseline in all metrics and evaluated properties, demonstrating its effectiveness and bright usage potential.

摘要:知識圖譜 (KG) 在各種 AI 系統中扮演越來越重要的角色。對於電子商務來說,一種有效且低成本的自動化知識圖譜建構方法是促成各種成功的下游應用程式的基礎。在本文中,我們提出了一種從原始產品影像建構結構化產品知識圖譜的新穎方法。該方法協同利用了視覺語言模型 (VLM) 和大型語言模型 (LLM) 的最新進展,完全自動化了流程並允許及時更新圖譜。我們還提供了一個由人工標註的電子商務產品資料集,用於評量知識圖譜建構中的產品屬性萃取。我們的模型在所有指標和評估屬性上都優於我們的基準,證明了其有效性和廣闊的使用潛力。

CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models

2410.21067v1 by Meiqi Chen, Fandong Meng, Yingxue Zhang, Yan Zhang, Jie Zhou

Large language models (LLMs) have shown great promise in machine translation, but they still struggle with contextually dependent terms, such as new or domain-specific words. This leads to inconsistencies and errors that are difficult to address. Existing solutions often depend on manual identification of such terms, which is impractical given the complexity and evolving nature of language. While Retrieval-Augmented Generation (RAG) could provide some assistance, its application to translation is limited by issues such as hallucinations from information overload. In this paper, we propose CRAT, a novel multi-agent translation framework that leverages RAG and causality-enhanced self-reflection to address these challenges. This framework consists of several specialized agents: the Unknown Terms Identification agent detects unknown terms within the context, the Knowledge Graph (KG) Constructor agent extracts relevant internal knowledge about these terms and retrieves bilingual information from external sources, the Causality-enhanced Judge agent validates the accuracy of the information, and the Translator agent incorporates the refined information into the final output. This automated process allows for more precise and consistent handling of key terms during translation. Our results show that CRAT significantly improves translation accuracy, particularly in handling context-sensitive terms and emerging vocabulary.

摘要:大型語言模型(LLM)在機器翻譯方面展現出極大的前景, 但它們仍然難以應對依賴於語境的詞彙,例如新詞或特定領域的詞彙。這會導致不一致和錯誤,而這些錯誤很難解決。現有的解決方案通常依賴於手動識別此類詞彙,但由於語言的複雜性和不斷演變的特性,這並不可行。雖然檢索增強生成(RAG)可以提供一些協助,但其在翻譯中的應用受到諸如資訊超載產生的幻覺等問題的限制。在本文中,我們提出 CRAT,這是一個新穎的多代理翻譯架構,它利用 RAG 和因果增強自省來應對這些挑戰。此架構包含幾個專門的代理:未知詞彙識別代理會偵測語境中的未知詞彙,知識圖譜(KG)建構代理會擷取這些詞彙相關的內部知識,並從外部來源中檢索雙語資訊,因果增強判斷代理會驗證資訊的準確性,而翻譯代理會將精煉過的資訊納入最終輸出。這個自動化的流程允許在翻譯過程中更精確且一致地處理關鍵詞彙。我們的結果顯示,CRAT 大幅提升了翻譯準確性,特別是在處理對語境敏感的詞彙和新興詞彙方面。

CTINEXUS: Leveraging Optimized LLM In-Context Learning for Constructing Cybersecurity Knowledge Graphs Under Data Scarcity

2410.21060v1 by Yutong Cheng, Osama Bajaber, Saimon Amanuel Tsegai, Dawn Song, Peng Gao

Textual descriptions in cyber threat intelligence (CTI) reports, such as security articles and news, are rich sources of knowledge about cyber threats, crucial for organizations to stay informed about the rapidly evolving threat landscape. However, current CTI extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. Syntax parsing relies on fixed rules and dictionaries, while model fine-tuning requires large annotated datasets, making both paradigms challenging to adapt to new threats and ontologies. To bridge the gap, we propose CTINexus, a novel framework leveraging optimized in-context learning (ICL) of large language models (LLMs) for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction. Unlike existing methods, CTINexus requires neither extensive data nor parameter tuning and can adapt to various ontologies with minimal annotated examples. This is achieved through (1) a carefully designed automatic prompt construction strategy with optimal demonstration retrieval for extracting a wide range of cybersecurity entities and relations; (2) a hierarchical entity alignment technique that canonicalizes the extracted knowledge and removes redundancy; (3) an ICL-enhanced long-distance relation prediction technique to further complete the CKSG with missing links. Our extensive evaluations using 150 real-world CTI reports collected from 10 platforms demonstrate that CTINexus significantly outperforms existing methods in constructing accurate and complete CSKGs, highlighting its potential to transform CTI analysis with an efficient and adaptable solution for the dynamic threat landscape.

摘要:網路威脅情報 (CTI) 報告中的文字描述,例如安全文章和新聞,是網路威脅的豐富知識來源,對於組織而言至關重要,可以隨時了解快速演變的威脅環境。然而,目前的 CTI 提取方法缺乏靈活性且難以概括,通常會導致知識提取不準確且不完整。語法解析依賴於固定規則和字典,而模型微調需要大量標註的資料集,這使得這兩種範例都難以適應新的威脅和本体。為了彌補差距,我們提出了 CTINexus,這是一個新穎的框架,利用大型語言模型 (LLM) 的最佳化情境學習 (ICL) 來進行資料有效率的 CTI 知識提取和高品質的網路安全知識圖 (CSKG) 建構。與現有方法不同,CTINexus 不需要廣泛的資料或參數調整,並且可以透過最少的標註範例適應各種本体。這是透過 (1) 經過精心設計的自動提示建構策略,並透過最佳示範檢索來提取廣泛的網路安全實體和關係來實現的;(2) 一種階層式實體比對技術,可以將提取的知識標準化並消除冗餘;(3) 一種 ICL 增強的長距離關係預測技術,可以進一步完成具有遺失連結的 CKSG。我們使用從 10 個平台收集的 150 份真實世界 CTI 報告進行廣泛評估,證明 CTINexus 在建構準確且完整的 CSKG 方面明顯優於現有方法,突顯了其以有效且適應性強的解決方案轉換 CTI 分析的潛力,以應對動態的威脅環境。

Graph-based Uncertainty Metrics for Long-form Language Model Outputs

2410.20783v1 by Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, Tatsunori Hashimoto

Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses.

摘要:大型語言模型 (LLM) 的最新進展顯著提升了文字生成能力,但這些系統仍以產生幻覺著稱,而針對長篇 LLM 生成的細緻不確定性估計仍是一項挑戰。在這項工作中,我們提出圖形不確定性,它將 LLM 生成和其中的主張表示為二部圖,並使用一系列圖形中心性指標估計主張層級的不確定性。在此觀點下,現有的基於自洽性概念的不確定性估計方法可視為使用度量中心性作為不確定性指標,我們證明了更精密的替代方案(例如接近中心性)在主張層級不確定性估計中提供了穩定的增益。此外,我們提出了不確定性感知解碼技術,該技術利用圖形結構和不確定性估計來提升 LLM 生成的真實性,方法是僅保留最可靠的主張。與現有方法相比,我們的基於圖形的指標在各種長篇生成設定中平均提升了 AUPRC 的 6.8%,而我們的端到端系統在真實性方面提供了 2-4% 的穩定增益,同時顯著提升了生成回應的資訊性。

Plan$\times$RAG: Planning-guided Retrieval Augmented Generation

2410.20753v1 by Prakhar Verma, Sukruta Prakash Midigeshi, Gaurav Sinha, Arno Solin, Nagarajan Natarajan, Amit Sharma

We introduce Planning-guided Retrieval Augmented Generation (Plan$\times$RAG), a novel framework that augments the \emph{retrieve-then-reason} paradigm of existing RAG frameworks to \emph{plan-then-retrieve}. Plan$\times$RAG formulates a reasoning plan as a directed acyclic graph (DAG), decomposing queries into interrelated atomic sub-queries. Answer generation follows the DAG structure, allowing significant gains in efficiency through parallelized retrieval and generation. While state-of-the-art RAG solutions require extensive data generation and fine-tuning of language models (LMs), Plan$\times$RAG incorporates frozen LMs as plug-and-play experts to generate high-quality answers. Compared to existing RAG solutions, Plan$\times$RAG demonstrates significant improvements in reducing hallucinations and bolstering attribution due to its structured sub-query decomposition. Overall, Plan$\times$RAG offers a new perspective on integrating external knowledge in LMs while ensuring attribution by design, contributing towards more reliable LM-based systems.

摘要:我們引入了規劃引導的檢索增強生成 (Plan$\times$RAG),這是一個新穎的框架,它擴充了現有 RAG 框架的「先檢索後推理」範例,改為「先規劃後檢索」。Plan$\times$RAG 將推理計畫制定為有向無環圖 (DAG),將查詢分解成相互關聯的原子子查詢。答案生成遵循 DAG 結構,透過並行檢索和生成,大幅提升效率。雖然最先進的 RAG 解决方案需要大量資料生成和語言模型 (LM) 的微調,但 Plan$\times$RAG 將凍結的 LM 整合為即插即用的專家,以生成高品質的答案。與現有的 RAG 解决方案相比,Plan$\times$RAG 在減少幻覺和加強歸因方面表現出顯著的進步,這要歸功於其結構化的子查詢分解。總體而言,Plan$\times$RAG 提供了一個新的觀點,以整合 LM 中的外部知識,同時確保歸因設計,有助於建立更可靠的基於 LM 的系統。

Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation

2410.20724v2 by Mufei Li, Siqi Miao, Pan Li

Large Language Models (LLMs) demonstrate strong reasoning abilities but face limitations such as hallucinations and outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLM outputs in structured external knowledge from KGs. However, current KG-based RAG frameworks still struggle to optimize the trade-off between retrieval effectiveness and efficiency in identifying a suitable amount of relevant graph information for the LLM to digest. We introduce SubgraphRAG, extending the KG-based RAG framework that retrieves subgraphs and leverages LLMs for reasoning and answer prediction. Our approach innovatively integrates a lightweight multilayer perceptron with a parallel triple-scoring mechanism for efficient and flexible subgraph retrieval while encoding directional structural distances to enhance retrieval effectiveness. The size of retrieved subgraphs can be flexibly adjusted to match the query's need and the downstream LLM's capabilities. This design strikes a balance between model complexity and reasoning power, enabling scalable and generalizable retrieval processes. Notably, based on our retrieved subgraphs, smaller LLMs like Llama3.1-8B-Instruct deliver competitive results with explainable reasoning, while larger models like GPT-4o achieve state-of-the-art accuracy compared with previous baselines -- all without fine-tuning. Extensive evaluations on the WebQSP and CWQ benchmarks highlight SubgraphRAG's strengths in efficiency, accuracy, and reliability by reducing hallucinations and improving response grounding.

摘要:大型語言模型 (LLM) 具有強大的推理能力,但面臨幻覺和過時知識等限制。基於知識圖譜 (KG) 的檢索增強生成 (RAG) 透過將 LLM 輸出結果奠基於 KG 中的結構化外部知識,來解決這些問題。然而,目前基於 KG 的 RAG 架構仍難以在檢索效能和效率之間取得最佳平衡,以找出 LLM 能夠消化的適當相關圖表資訊量。我們引進 SubgraphRAG,擴充基於 KG 的 RAG 架構,以檢索子圖表並利用 LLM 進行推理和答案預測。我們的做法創新地整合了一個輕量多層感知器與一個並行三元組計分機制,用於高效且靈活地檢索子圖表,同時編碼方向結構距離以增強檢索效能。檢索到的子圖表大小可以靈活調整,以符合查詢需求和下游 LLM 的功能。這種設計在模型複雜度和推理能力之間取得平衡,實現可擴充且可概化的檢索程序。值得注意的是,根據我們檢索到的子圖表,較小的 LLM(例如 Llama3.1-8B-Instruct)可以提供具備可解釋推理的競爭結果,而較大的模型(例如 GPT-4o)則達到與先前基準相比的最新準確度,而且所有這些都不需要微調。在 WebQSP 和 CWQ 基準上的廣泛評估突顯了 SubgraphRAG 在效率、準確度和可靠性方面的優勢,透過減少幻覺並改善回應依據。

Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs

2410.20321v1 by Xingrui Zhuo, Jiapu Wang, Gongqing Wu, Shirui Pan, Xindong Wu

Knowledge Graph Query Embedding (KGQE) aims to embed First-Order Logic (FOL) queries in a low-dimensional KG space for complex reasoning over incomplete KGs. To enhance the generalization of KGQE models, recent studies integrate various external information (such as entity types and relation context) to better capture the logical semantics of FOL queries. The whole process is commonly referred to as Query Pattern Learning (QPL). However, current QPL methods typically suffer from the pattern-entity alignment bias problem, leading to the learned defective query patterns limiting KGQE models' performance. To address this problem, we propose an effective Query Instruction Parsing Plugin (QIPP) that leverages the context awareness of Pre-trained Language Models (PLMs) to capture latent query patterns from code-like query instructions. Unlike the external information introduced by previous QPL methods, we first propose code-like instructions to express FOL queries in an alternative format. This format utilizes textual variables and nested tuples to convey the logical semantics within FOL queries, serving as raw materials for a PLM-based instruction encoder to obtain complete query patterns. Building on this, we design a query-guided instruction decoder to adapt query patterns to KGQE models. To further enhance QIPP's effectiveness across various KGQE models, we propose a query pattern injection mechanism based on compressed optimization boundaries and an adaptive normalization component, allowing KGQE models to utilize query patterns more efficiently. Extensive experiments demonstrate that our plug-and-play method improves the performance of eight basic KGQE models and outperforms two state-of-the-art QPL methods.

摘要:知識圖譜查詢嵌入(KGQE)旨在將一階邏輯(FOL)查詢嵌入到低維 KG 空間中,以便對不完整的 KG 進行複雜推理。為了增強 KGQE 模型的泛化能力,最近的研究整合了各種外部資訊(例如實體類型和關係上下文),以更好地捕捉 FOL 查詢的邏輯語義。整個過程通常稱為查詢模式學習(QPL)。然而,當前的 QPL 方法通常會受到模式實體對齊偏差問題的影響,導致學習到的有缺陷查詢模式限制了 KGQE 模型的效能。為了解決這個問題,我們提出了一個有效的查詢指令解析外掛程式(QIPP),它利用預訓練語言模型(PLM)的上下文感知來從類代碼的查詢指令中擷取潛在查詢模式。與先前 QPL 方法引入的外部資訊不同,我們首先提出類代碼的指令以另類格式表達 FOL 查詢。此格式利用文字變數和巢狀元組來傳達 FOL 查詢中的邏輯語義,作為基於 PLM 的指令編碼器的原料,以取得完整的查詢模式。在此基礎上,我們設計了一個查詢引導的指令解碼器,以將查詢模式調整到 KGQE 模型。為了進一步增強 QIPP 在各種 KGQE 模型中的有效性,我們提出了一個基於壓縮最佳化邊界和自適應正規化元件的查詢模式注入機制,允許 KGQE 模型更有效地利用查詢模式。廣泛的實驗表明,我們的即插即用方法改善了八個基本 KGQE 模型的效能,並優於兩種最先進的 QPL 方法。

Mathematical Derivation Graphs: A Task for Summarizing Equation Dependencies in STEM Manuscripts

2410.21324v1 by Vishesh Prasad, Brian Kim, Nickvash Kani

Recent advances in natural language processing (NLP), particularly with the emergence of large language models (LLMs), have significantly enhanced the field of textual analysis. However, while these developments have yielded substantial progress in analyzing textual data, applying analysis to mathematical equations and their relationships within texts has produced mixed results. In this paper, we take the initial steps toward understanding the dependency relationships between mathematical expressions in STEM articles. Our dataset, sourced from a random sampling of the arXiv corpus, contains an analysis of 107 published STEM manuscripts whose inter-equation dependency relationships have been hand-labeled, resulting in a new object we refer to as a derivation graph that summarizes the mathematical content of the manuscript. We exhaustively evaluate analytical and NLP-based models to assess their capability to identify and extract the derivation relationships for each article and compare the results with the ground truth. Our comprehensive testing finds that both analytical and NLP models (including LLMs) achieve $\sim$40-50% F1 scores for extracting derivation graphs from articles, revealing that the recent advances in NLP have not made significant inroads in comprehending mathematical texts compared to simpler analytic models. While current approaches offer a solid foundation for extracting mathematical information, further research is necessary to improve accuracy and depth in this area.

摘要:自然語言處理(NLP)的最新進展,特別是大語言模型(LLM)的出現,已顯著增強了文本分析領域。然而,儘管這些發展在分析文本資料方面取得了實質性進展,但將分析應用於數學方程式及其在文本中的關係卻產生了不同的結果。在本文中,我們採取了初步步驟來了解 STEM 文章中數學表達式之間的依賴關係。我們的資料集取自 arXiv 語料庫的隨機抽樣,其中包含對 107 篇已發表的 STEM 手稿的分析,其方程式間的依賴關係已進行手動標記,產生了一個我們稱為衍生圖的新物件,該物件總結了手稿的數學內容。我們徹底評估了分析和基於 NLP 的模型,以評估它們識別和提取每篇文章的衍生關係的能力,並將結果與真實情況進行比較。我們的全面測試發現,分析和 NLP 模型(包括 LLM)在從文章中提取衍生圖方面的 F1 分數均達到 $\sim$40-50%,這表明與更簡單的分析模型相比,NLP 的最新進展並沒有在理解數學文本方面取得重大進展。儘管目前的方法為提取數學資訊提供了堅實的基礎,但仍需要進一步的研究來提高此領域的準確性和深度。

DualMAR: Medical-Augmented Representation from Dual-Expertise Perspectives

2410.19955v1 by Pengfei Hu, Chang Lu, Fei Wang, Yue Ning

Electronic Health Records (EHR) has revolutionized healthcare data management and prediction in the field of AI and machine learning. Accurate predictions of diagnosis and medications significantly mitigate health risks and provide guidance for preventive care. However, EHR driven models often have limited scope on understanding medical-domain knowledge and mostly rely on simple-and-sole ontologies. In addition, due to the missing features and incomplete disease coverage of EHR, most studies only focus on basic analysis on conditions and medication. We propose DualMAR, a framework that enhances EHR prediction tasks through both individual observation data and public knowledge bases. First, we construct a bi-hierarchical Diagnosis Knowledge Graph (KG) using verified public clinical ontologies and augment this KG via Large Language Models (LLMs); Second, we design a new proxy-task learning on lab results in EHR for pretraining, which further enhance KG representation and patient embeddings. By retrieving radial and angular coordinates upon polar space, DualMAR enables accurate predictions based on rich hierarchical and semantic embeddings from KG. Experiments also demonstrate that DualMAR outperforms state-of-the-art models, validating its effectiveness in EHR prediction and KG integration in medical domains.

摘要:電子健康紀錄 (EHR) 已徹底改變了醫療保健資料管理,並預測了人工智慧和機器學習領域。準確預測診斷和藥物可大幅減輕健康風險,並提供預防性照護的指導方針。然而,EHR 驅動的模型在理解醫療領域知識上通常具有局限性,而且大多依賴於簡單且單一的本体。此外,由於 EHR 遺漏了功能且疾病涵蓋不完整,大多數研究僅專注於疾病和藥物的基本分析。我們提出 DualMAR,一個透過個人觀察資料和公共知識庫增強 EHR 預測任務的架構。首先,我們使用經過驗證的公共臨床本体構建一個雙層級診斷知識圖 (KG),並透過大型語言模型 (LLM) 擴充這個 KG;其次,我們設計一個新的代理任務學習,針對 EHR 中的實驗室結果進行預訓練,進一步增強 KG 表示和患者嵌入。透過擷取極座標空間上的徑向和角向坐標,DualMAR 能夠根據 KG 中豐富的層級和語意嵌入進行準確的預測。實驗也證明 DualMAR 優於最先進的模型,驗證了其在 EHR 預測和醫療領域中 KG 整合的有效性。

FISHNET: Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert Swarms, and Task Planning

2410.19727v1 by Nicole Cho, Nishan Srishankar, Lucas Cecchi, William Watson

Financial intelligence generation from vast data sources has typically relied on traditional methods of knowledge-graph construction or database engineering. Recently, fine-tuned financial domain-specific Large Language Models (LLMs), have emerged. While these advancements are promising, limitations such as high inference costs, hallucinations, and the complexity of concurrently analyzing high-dimensional financial data, emerge. This motivates our invention FISHNET (Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert swarming, and Task planning), an agentic architecture that accomplishes highly complex analytical tasks for more than 98,000 regulatory filings that vary immensely in terms of semantics, data hierarchy, or format. FISHNET shows remarkable performance for financial insight generation (61.8% success rate over 5.0% Routing, 45.6% RAG R-Precision). We conduct rigorous ablations to empirically prove the success of FISHNET, each agent's importance, and the optimized performance of assembling all agents. Our modular architecture can be leveraged for a myriad of use-cases, enabling scalability, flexibility, and data integrity that are critical for financial tasks.

摘要:財務情報生成通常依賴於傳統的知識圖表建構或資料庫工程方法,這些方法來自於龐大的資料來源。最近,針對財務領域進行微調的大型語言模型 (LLM) 已應運而生。儘管這些進展令人振奮,但仍存在一些限制,例如高推理成本、幻覺,以及同時分析高維度財務資料的複雜性。這促使我們發明了 FISHNET(來自子查詢、協調、神經條件化、專家群集和任務規劃的財務情報),這是一種代理架構,可針對超過 98,000 份法規文件執行高度複雜的分析任務,而這些文件在語義、資料階層或格式方面差異極大。FISHNET 在產生財務見解方面表現出色(成功率為 61.8%,路由率為 5.0%,RAG R-Precision 為 45.6%)。我們進行了嚴格的消融,以實證證明 FISHNET 的成功、每個代理的重要性,以及組裝所有代理的最佳化效能。我們模組化的架構可運用於各種使用案例,提供財務任務至關重要的可擴充性、彈性和資料完整性。

Knowledge Graph Enhanced Language Agents for Recommendation

2410.19627v1 by Taicheng Guo, Chaochun Liu, Hai Wang, Varun Mannam, Fang Wang, Xin Chen, Xiangliang Zhang, Chandan K. Reddy

Language agents have recently been used to simulate human behavior and user-item interactions for recommendation systems. However, current language agent simulations do not understand the relationships between users and items, leading to inaccurate user profiles and ineffective recommendations. In this work, we explore the utility of Knowledge Graphs (KGs), which contain extensive and reliable relationships between users and items, for recommendation. Our key insight is that the paths in a KG can capture complex relationships between users and items, eliciting the underlying reasons for user preferences and enriching user profiles. Leveraging this insight, we propose Knowledge Graph Enhanced Language Agents(KGLA), a framework that unifies language agents and KG for recommendation systems. In the simulated recommendation scenario, we position the user and item within the KG and integrate KG paths as natural language descriptions into the simulation. This allows language agents to interact with each other and discover sufficient rationale behind their interactions, making the simulation more accurate and aligned with real-world cases, thus improving recommendation performance. Our experimental results show that KGLA significantly improves recommendation performance (with a 33%-95% boost in NDCG@1 among three widely used benchmarks) compared to the previous best baseline method.

摘要:語言代理最近已被用於模擬人類行為和推薦系統中的使用者項目互動。然而,目前的語言代理模擬並未了解使用者和項目之間的關係,導致使用者輪廓不準確和推薦效果不佳。在這項工作中,我們探討了知識圖譜 (KG) 的效用,其中包含使用者和項目之間廣泛且可靠的關係,以供推薦。我們的關鍵見解是,KG 中的路徑可以捕捉使用者和項目之間的複雜關係,引出使用者偏好的根本原因並豐富使用者輪廓。利用此見解,我們提出了知識圖譜增強語言代理 (KGLA),一個統一語言代理和 KG 以用於推薦系統的架構。在模擬推薦情境中,我們將使用者和項目定位在 KG 中,並將 KG 路徑整合為自然語言描述到模擬中。這允許語言代理彼此互動並發現其互動背後的充分依據,使模擬更準確且與實際案例相符,從而改善推薦效能。我們的實驗結果顯示,與先前最佳基準方法相比,KGLA 大幅改善了推薦效能(在三個廣泛使用的基準中,NDCG@1 提升了 33%-95%)。

Graph Linearization Methods for Reasoning on Graphs with Large Language Models

2410.19494v1 by Christos Xypolopoulos, Guokan Shang, Xiao Fei, Giannis Nikolentzos, Hadi Abdine, Iakovos Evdaimon, Michail Chatzianastasis, Giorgos Stamou, Michalis Vazirgiannis

Large language models have evolved to process multiple modalities beyond text, such as images and audio, which motivates us to explore how to effectively leverage them for graph machine learning tasks. The key question, therefore, is how to transform graphs into linear sequences of tokens, a process we term graph linearization, so that LLMs can handle graphs naturally. We consider that graphs should be linearized meaningfully to reflect certain properties of natural language text, such as local dependency and global alignment, in order to ease contemporary LLMs, trained on trillions of textual tokens, better understand graphs. To achieve this, we developed several graph linearization methods based on graph centrality, degeneracy, and node relabeling schemes. We then investigated their effect on LLM performance in graph reasoning tasks. Experimental results on synthetic graphs demonstrate the effectiveness of our methods compared to random linearization baselines. Our work introduces novel graph representations suitable for LLMs, contributing to the potential integration of graph machine learning with the trend of multi-modal processing using a unified transformer model.

摘要:大型語言模型已演化為處理文字之外的多種模式,例如影像和音訊,這促使我們探索如何有效地運用它們於圖形機器學習任務。因此,關鍵問題在於如何將圖形轉換為線性序列的代幣,這是一個我們稱為圖形線性化的過程,讓 LLM 能自然地處理圖形。我們認為圖形應有意義地進行線性化,以反映自然語言文字的特定屬性,例如局部依賴性和全局對齊,以便讓在數兆個文字代幣上訓練的當代 LLM 更能理解圖形。為達成此目的,我們開發了幾種基於圖形中心性、簡併性和節點重新標籤架構的圖形線性化方法。接著,我們探討它們對 LLM 在圖形推理任務中的效能影響。合成圖形上的實驗結果證明了我們的方法比隨機線性化基準更有效。我們的研究引入了適合 LLM 的新穎圖形表示法,有助於將圖形機器學習與使用統一Transformer模型的多模式處理趨勢整合起來。

Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis

2410.19225v1 by Weikai Li, Ding Wang, Zijian Ding, Atefeh Sohrabizadeh, Zongyue Qin, Jason Cong, Yizhou Sun

High-level synthesis (HLS) is a widely used tool in designing Field Programmable Gate Array (FPGA). HLS enables FPGA design with software programming languages by compiling the source code into an FPGA circuit. The source code includes a program (called ``kernel'') and several pragmas that instruct hardware synthesis, such as parallelization, pipeline, etc. While it is relatively easy for software developers to design the program, it heavily relies on hardware knowledge to design the pragmas, posing a big challenge for software developers. Recently, different machine learning algorithms, such as GNNs, have been proposed to automate the pragma design via performance prediction. However, when applying the trained model on new kernels, the significant domain shift often leads to unsatisfactory performance. We propose a more domain-generalizable model structure: a two-level hierarchical Mixture of Experts (MoE), that can be flexibly adapted to any GNN model. Different expert networks can learn to deal with different regions in the representation space, and they can utilize similar patterns between the old kernels and new kernels. In the low-level MoE, we apply MoE on three natural granularities of a program: node, basic block, and graph. The high-level MoE learns to aggregate the three granularities for the final decision. To stably train the hierarchical MoE, we further propose a two-stage training method. Extensive experiments verify the effectiveness of the hierarchical MoE.

摘要:高階綜合(HLS)是設計現場可編程閘陣列(FPGA)中廣泛使用的工具。HLS 透過將原始碼編譯成 FPGA 電路,使用軟體程式語言進行 FPGA 設計。原始碼包含一個程式(稱為「核心」)和多個指導硬體綜合的指示,例如平行化、管線等。雖然軟體開發人員設計程式相對容易,但它極度依賴硬體知識來設計指示,這對軟體開發人員來說是一大挑戰。最近,不同的機器學習演算法,例如 GNN,已被提出用於透過效能預測自動進行指示設計。然而,在新的核心上應用訓練好的模型時,顯著的領域轉移通常會導致效能不佳。我們提出一個更具領域通用性的模型結構:一個二階層混合專家(MoE),它可以靈活地適應任何 GNN 模型。不同的專家網路可以學習處理表示空間中的不同區域,並且它們可以利用舊核心和新核心之間的相似模式。在低階 MoE 中,我們對程式的三個自然粒度應用 MoE:節點、基本區塊和圖。高階 MoE 學習彙總這三個粒度以做出最終決策。為了穩定訓練階層式 MoE,我們進一步提出一個二階段訓練方法。廣泛的實驗驗證了階層式 MoE 的有效性。

Enriching GNNs with Text Contextual Representations for Detecting Disinformation Campaigns on Social Media

2410.19193v1 by Bruno Croso Cunha da Silva, Thomas Palmeira Ferraz, Roseli De Deus Lopes

Disinformation on social media poses both societal and technical challenges. While previous studies have integrated textual information into propagation networks, they have yet to fully leverage the advancements in Transformer-based language models for high-quality contextual text representations. This work investigates the impact of incorporating textual features into Graph Neural Networks (GNNs) for fake news detection. Our experiments demonstrate that contextual representations improve performance by 9.3% in Macro F1 over static ones and 33.8% over GNNs without textual features. However, noisy data augmentation degrades performance and increases instability. We expect our methodology to open avenues for further research, and all code is made publicly available.

摘要:社群媒體上的錯誤訊息造成社會和技術層面的挑戰。 儘管過往的研究已將文字資訊整合到傳播網路中,但尚未充分利用基於 Transformer 的語言模型在高品質脈絡文字表徵上的進展。這項研究探討將文字特徵納入圖形神經網路 (GNN) 中對於假新聞偵測的影響。我們的實驗結果顯示,脈絡表徵將巨觀 F1 的效能提升了 9.3%,優於靜態表徵,並比沒有文字特徵的 GNN 提升了 33.8%。然而,有雜訊的資料擴充會降低效能並增加不穩定性。我們預期我們的研究方法將開啟進一步研究的途徑,所有程式碼皆公開提供。

GCoder: Improving Large Language Model for Generalized Graph Problem Solving

2410.19084v1 by Qifan Zhang, Xiaobin Hong, Jianheng Tang, Nuo Chen, Yuhan Li, Wenzhong Li, Jing Tang, Jia Li

Large Language Models (LLMs) have demonstrated strong reasoning abilities, making them suitable for complex tasks such as graph computation. Traditional reasoning steps paradigm for graph problems is hindered by unverifiable steps, limited long-term reasoning, and poor generalization to graph variations. To overcome these limitations, we introduce GCoder, a code-based LLM designed to enhance problem-solving in generalized graph computation problems. Our method involves constructing an extensive training dataset, GraphWild, featuring diverse graph formats and algorithms. We employ a multi-stage training process, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Compiler Feedback (RLCF), to refine model capabilities. For unseen tasks, a hybrid retrieval technique is used to augment performance. Experiments demonstrate that GCoder outperforms GPT-4o, with an average accuracy improvement of 16.42% across various graph computational problems. Furthermore, GCoder efficiently manages large-scale graphs with millions of nodes and diverse input formats, overcoming the limitations of previous models focused on the reasoning steps paradigm. This advancement paves the way for more intuitive and effective graph problem-solving using LLMs. Code and data are available at here: https://github.com/Bklight999/WWW25-GCoder/tree/master.

摘要:大型語言模型 (LLM) 已展現強大的推理能力,使其適用於複雜任務,例如圖形運算。傳統圖形問題的推理步驟範例受到不可驗證的步驟、有限的長期推理和對圖形變化的概括性不佳的阻礙。為了克服這些限制,我們引入了 GCoder,一種基於代碼的 LLM,旨在增強廣義圖形運算問題中的問題解決能力。我們的技術涉及構建一個廣泛的訓練資料集 GraphWild,其中包含多樣的圖形格式和演算法。我們採用多階段訓練流程,包括監督微調 (SFT) 和編譯器回饋強化學習 (RLCF),以改善模型能力。對於未知任務,使用混合擷取技術來增強效能。實驗證明,GCoder 優於 GPT-4o,在各種圖形運算問題中平均準確度提升了 16.42%。此外,GCoder 有效地管理著擁有數百萬個節點和多樣輸入格式的大規模圖形,克服了先前專注於推理步驟範例的模型的限制。這項進展為使用 LLM 進行更直觀且有效的圖形問題解決鋪平了道路。程式碼和資料可於此處取得:https://github.com/Bklight999/WWW25-GCoder/tree/master。

LLM-based Online Prediction of Time-varying Graph Signals

2410.18718v1 by Dayu Qin, Yi Yan, Ercan Engin Kuruoglu

In this paper, we propose a novel framework that leverages large language models (LLMs) for predicting missing values in time-varying graph signals by exploiting spatial and temporal smoothness. We leverage the power of LLM to achieve a message-passing scheme. For each missing node, its neighbors and previous estimates are fed into and processed by LLM to infer the missing observations. Tested on the task of the online prediction of wind-speed graph signals, our model outperforms online graph filtering algorithms in terms of accuracy, demonstrating the potential of LLMs in effectively addressing partially observed signals in graphs.

摘要:在本文中,我們提出了一個新穎的框架,該框架利用大型語言模型 (LLM) 來預測時變圖形信號中的缺失值,方法是利用空間和時間平滑度。我們利用 LLM 的能力來實現消息傳遞方案。對於每個缺失節點,其鄰居和先前的估計值會被輸入到 LLM 中並由 LLM 進行處理,以推斷出缺失的觀測值。在風速圖形信號的線上預測任務中進行測試,我們的模型在準確性方面優於線上圖形過濾演算法,這證明了 LLM 在有效處理圖形中部分觀測到的信號方面的潛力。

Gene-Metabolite Association Prediction with Interactive Knowledge Transfer Enhanced Graph for Metabolite Production

2410.18475v2 by Kexuan Xin, Qingyun Wang, Junyu Chen, Pengfei Yu, Huimin Zhao, Heng Ji

In the rapidly evolving field of metabolic engineering, the quest for efficient and precise gene target identification for metabolite production enhancement presents significant challenges. Traditional approaches, whether knowledge-based or model-based, are notably time-consuming and labor-intensive, due to the vast scale of research literature and the approximation nature of genome-scale metabolic model (GEM) simulations. Therefore, we propose a new task, Gene-Metabolite Association Prediction based on metabolic graphs, to automate the process of candidate gene discovery for a given pair of metabolite and candidate-associated genes, as well as presenting the first benchmark containing 2474 metabolites and 1947 genes of two commonly used microorganisms Saccharomyces cerevisiae (SC) and Issatchenkia orientalis (IO). This task is challenging due to the incompleteness of the metabolic graphs and the heterogeneity among distinct metabolisms. To overcome these limitations, we propose an Interactive Knowledge Transfer mechanism based on Metabolism Graph (IKT4Meta), which improves the association prediction accuracy by integrating the knowledge from different metabolism graphs. First, to build a bridge between two graphs for knowledge transfer, we utilize Pretrained Language Models (PLMs) with external knowledge of genes and metabolites to help generate inter-graph links, significantly alleviating the impact of heterogeneity. Second, we propagate intra-graph links from different metabolic graphs using inter-graph links as anchors. Finally, we conduct the gene-metabolite association prediction based on the enriched metabolism graphs, which integrate the knowledge from multiple microorganisms. Experiments on both types of organisms demonstrate that our proposed methodology outperforms baselines by up to 12.3% across various link prediction frameworks.

摘要:在快速發展的代謝工程領域中,尋求有效且精確的基因目標識別以提升代謝產物產量,是一項重大的挑戰。傳統方法,無論是基於知識或基於模型,都相當耗時且費力,這是因為研究文獻的規模龐大,且基因組規模代謝模型 (GEM) 模擬的近似性質。因此,我們提出了一項新的任務,即基於代謝圖的基因-代謝物關聯預測,以自動化候選基因發現的過程,針對給定的代謝物對和候選相關基因,並呈現第一個基準,其中包含 2474 種代謝物和 1947 個基因,來自兩種常用的微生物釀酒酵母 (SC) 和東方伊薩琴科酵母 (IO)。由於代謝圖的不完整性和不同代謝物之間的異質性,這項任務具有挑戰性。為了克服這些限制,我們提出了一個基於代謝圖的互動知識傳輸機制 (IKT4Meta),它透過整合來自不同代謝圖的知識來提高關聯預測的準確性。首先,為了在兩個圖之間建立知識傳輸的橋樑,我們利用具備基因和代謝物外部知識的預訓練語言模型 (PLM) 來幫助產生圖間連結,大幅減輕異質性的影響。其次,我們使用圖間連結作為錨點,從不同的代謝圖傳播圖內連結。最後,我們根據整合了多種微生物知識的豐富代謝圖,進行基因-代謝物關聯預測。兩種生物體的實驗都證明,我們提出的方法在各種連結預測架構中,比基準高出 12.3%。

ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis

2410.18447v1 by Zezhong Wang, Xingshan Zeng, Weiwen Liu, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong

Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis process generally involves sampling a set of tools, formulating a requirement based on these tools, and generating the call statements. However, tools sampled randomly lack relevance, making them difficult to combine and thus reducing the diversity of the data. Additionally, current work overlooks the coherence between turns of dialogues, leading to a gap between the synthesized data and real-world scenarios. To address these issues, we propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We integrate these two strategies and enable multiple agents to synthesize the dialogue data interactively, resulting in our tool-calling data synthesis pipeline ToolFlow. Data quality assessments demonstrate improvements in the naturalness and coherence of our synthesized dialogues. Finally, we apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow. Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.

摘要:監督微調 (SFT) 是增強大型語言模型 (LLM) 工具呼叫功能的常見方法,訓練資料通常是合成資料。目前的資料合成流程通常涉及抽樣一組工具、根據這些工具制定需求,並產生呼叫陳述。然而,隨機抽樣的工具缺乏關聯性,使得它們難以組合,從而降低資料的多樣性。此外,目前的工作忽略了對話回合之間的連貫性,導致合成資料與現實世界場景之間存在差距。為了解決這些問題,我們提出了一個基於圖形的抽樣策略來抽取更多相關的工具組合,以及一個計畫生成策略來建立計畫,以引導連貫對話的合成。我們整合這兩種策略,並使多個代理能夠互動地合成對話資料,從而產生我們的工具呼叫資料合成管線 ToolFlow。資料品質評估證明了我們合成對話的自然性和連貫性有了改進。最後,我們使用 ToolFlow 生成的 8,000 個合成對話在 LLaMA-3.1-8B 上應用 SFT。結果表明,該模型實現了與 GPT-4 相當甚至超越 GPT-4 的工具呼叫效能,同時保持強大的通用能力。

Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains

2410.18415v1 by Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James Glass, Helen Meng

Knowledge Graphs (KGs) can serve as reliable knowledge sources for question answering (QA) due to their structured representation of knowledge. Existing research on the utilization of KG for large language models (LLMs) prevalently relies on subgraph retriever or iterative prompting, overlooking the potential synergy of LLMs' step-wise reasoning capabilities and KGs' structural nature. In this paper, we present DoG (Decoding on Graphs), a novel framework that facilitates a deep synergy between LLMs and KGs. We first define a concept, well-formed chain, which consists of a sequence of interrelated fact triplets on the KGs, starting from question entities and leading to answers. We argue that this concept can serve as a principle for making faithful and sound reasoning for KGQA. To enable LLMs to generate well-formed chains, we propose graph-aware constrained decoding, in which a constraint derived from the topology of the KG regulates the decoding process of the LLMs. This constrained decoding method ensures the generation of well-formed chains while making full use of the step-wise reasoning capabilities of LLMs. Based on the above, DoG, a training-free approach, is able to provide faithful and sound reasoning trajectories grounded on the KGs. Experiments across various KGQA tasks with different background KGs demonstrate that DoG achieves superior and robust performance. DoG also shows general applicability with various open-source LLMs.

摘要:知識圖譜 (KG) 由於其結構化的知識表示,可用作問答 (QA) 的可靠知識來源。現有關於利用 KG 的大型語言模型 (LLM) 的研究普遍依賴於子圖檢索器或反覆提示,忽視了 LLM 的逐步推理能力和 KG 的結構特性的潛在協同作用。在本文中,我們提出了 DoG(圖形解碼),一個促進 LLM 和 KG 之間深度協同作用的新框架。我們首先定義了一個概念,即良好形成的鏈,它由 KG 上一系列相互關聯的事實三元組組成,從問題實體開始並導致答案。我們認為這個概念可以作為對 KGQA 進行忠實和合理的推理的原則。為了使 LLM 能夠生成良好的鏈,我們提出了圖感知約束解碼,其中源自 KG 拓撲的約束約束了 LLM 的解碼過程。這種受約束的解碼方法確保了良好形成的鏈的生成,同時充分利用了 LLM 的逐步推理能力。基於上述,DoG 是一種無需訓練的方法,能夠提供基於 KG 的忠實且合理的推理軌跡。在具有不同背景 KG 的各種 KGQA 任務中的實驗表明,DoG 達到了卓越且穩健的性能。DoG 還顯示了與各種開源 LLM 的通用適用性。

Explaining Bayesian Networks in Natural Language using Factor Arguments. Evaluation in the medical domain

2410.18060v1 by Jaime Sevilla, Nikolay Babakov, Ehud Reiter, Alberto Bugarin

In this paper, we propose a model for building natural language explanations for Bayesian Network Reasoning in terms of factor arguments, which are argumentation graphs of flowing evidence, relating the observed evidence to a target variable we want to learn about. We introduce the notion of factor argument independence to address the outstanding question of defining when arguments should be presented jointly or separately and present an algorithm that, starting from the evidence nodes and a target node, produces a list of all independent factor arguments ordered by their strength. Finally, we implemented a scheme to build natural language explanations of Bayesian Reasoning using this approach. Our proposal has been validated in the medical domain through a human-driven evaluation study where we compare the Bayesian Network Reasoning explanations obtained using factor arguments with an alternative explanation method. Evaluation results indicate that our proposed explanation approach is deemed by users as significantly more useful for understanding Bayesian Network Reasoning than another existing explanation method it is compared to.

摘要:在本文中,我們提出了一個模型,用於建構貝氏網路推理的自然語言解釋,以因子論證為基礎,它們是流動證據的論證圖,將觀察到的證據與我們想要了解的目標變數聯繫起來。我們引入了因子論證獨立性的概念,以解決定義何時應將論證聯合或單獨呈現的未決問題,並提出了一種演算法,從證據節點和目標節點開始,產生一個按強度排序的所有獨立因子論證清單。最後,我們實作了一個方案,使用這種方法建構貝氏推理的自然語言解釋。我們的提案已在醫學領域中通過人為驅動的評估研究得到驗證,在該研究中,我們將使用因子論證獲得的貝氏網路推理解釋與另一種解釋方法進行比較。評估結果表明,與另一種現有的解釋方法相比,我們的提議解釋方法被使用者視為顯著更有助於理解貝氏網路推理。

Graphusion: A RAG Framework for Knowledge Graph Construction with a Global Perspective

2410.17600v1 by Rui Yang, Boming Yang, Aosong Feng, Sixun Ouyang, Moritz Blum, Tianwei She, Yuang Jiang, Freddy Lecue, Jinghui Lu, Irene Li

Knowledge Graphs (KGs) are crucial in the field of artificial intelligence and are widely used in downstream tasks, such as question-answering (QA). The construction of KGs typically requires significant effort from domain experts. Large Language Models (LLMs) have recently been used for Knowledge Graph Construction (KGC). However, most existing approaches focus on a local perspective, extracting knowledge triplets from individual sentences or documents, missing a fusion process to combine the knowledge in a global KG. This work introduces Graphusion, a zero-shot KGC framework from free text. It contains three steps: in Step 1, we extract a list of seed entities using topic modeling to guide the final KG includes the most relevant entities; in Step 2, we conduct candidate triplet extraction using LLMs; in Step 3, we design the novel fusion module that provides a global view of the extracted knowledge, incorporating entity merging, conflict resolution, and novel triplet discovery. Results show that Graphusion achieves scores of 2.92 and 2.37 out of 3 for entity extraction and relation recognition, respectively. Moreover, we showcase how Graphusion could be applied to the Natural Language Processing (NLP) domain and validate it in an educational scenario. Specifically, we introduce TutorQA, a new expert-verified benchmark for QA, comprising six tasks and a total of 1,200 QA pairs. Using the Graphusion-constructed KG, we achieve a significant improvement on the benchmark, for example, a 9.2% accuracy improvement on sub-graph completion.

摘要:知識圖譜 (KG) 在人工智慧領域至關重要,廣泛用於下游任務,例如問答 (QA)。KG 的建構通常需要領域專家付出大量心力。大型語言模型 (LLM) 近來已用於知識圖譜建構 (KGC)。然而,現有方法大多著重於局部觀點,從個別句子或文件擷取知識三元組,缺少一個融合程序來將知識結合在一個整體 KG 中。本研究引入了 Graphusion,一個從自由文字進行零次學習的 KGC 框架。它包含三個步驟:在步驟 1 中,我們使用主題建模擷取一組種子實體,以引導最終的 KG 納入最相關的實體;在步驟 2 中,我們使用 LLM 進行候選三元組擷取;在步驟 3 中,我們設計了新穎的融合模組,提供擷取知識的整體觀點,包含實體合併、衝突解決和新三元組發現。結果顯示 Graphusion 在實體擷取和關係識別方面分別獲得 3 分中的 2.92 分和 2.37 分。此外,我們展示了 Graphusion 如何應用於自然語言處理 (NLP) 領域,並在教育情境中驗證它。具體來說,我們引入了 TutorQA,一個由專家驗證的新型 QA 基準,包含六項任務和總計 1,200 組 QA。使用 Graphusion 建構的 KG,我們在基準上取得顯著進步,例如,在子圖完成方面提升了 9.2% 的準確度。

2410.17529v1 by Yongqiang Huang, Wentao Ye, Liyao Li, Junbo Zhao

This study investigates the potential of Large Language Models (LLMs) for reconstructing and constructing the physical world solely based on textual knowledge. It explores the impact of model performance on spatial understanding abilities. To enhance the comprehension of geometric and spatial relationships in the complex physical world, the study introduces a set of geometric conventions and develops a workflow based on multi-layer graphs and multi-agent system frameworks. It examines how LLMs achieve multi-step and multi-objective geometric inference in a spatial environment using multi-layer graphs under unified geometric conventions. Additionally, the study employs a genetic algorithm, inspired by large-scale model knowledge, to solve geometric constraint problems. In summary, this work innovatively explores the feasibility of using text-based LLMs as physical world builders and designs a workflow to enhance their capabilities.

摘要:本研究探討大型語言模型 (LLM) 僅基於文字知識重建和建構物理世界的潛力。探討模型效能對空間理解能力的影響。為了增強對複雜物理世界中幾何和空間關係的理解,本研究引入了一組幾何慣例,並基於多層圖形和多代理系統架構開發了一套工作流程。研究探討了 LLM 如何在統一的幾何慣例下,使用多層圖形在空間環境中達成多步驟和多目標的幾何推論。此外,本研究採用受大型模型知識啟發的遺傳演算法來解決幾何約束問題。總之,這項工作創新地探討了使用基於文字的 LLM 作為物理世界建構者的可行性,並設計了一套工作流程來增強其能力。

Large Language Model-based Augmentation for Imbalanced Node Classification on Text-Attributed Graphs

2410.16882v1 by Leyao Wang, Yu Wang, Bo Ni, Yuying Zhao, Tyler Derr

Node classification on graphs frequently encounters the challenge of class imbalance, leading to biased performance and posing significant risks in real-world applications. Although several data-centric solutions have been proposed, none of them focus on Text-Attributed Graphs (TAGs), and therefore overlook the potential of leveraging the rich semantics encoded in textual features for boosting the classification of minority nodes. Given this crucial gap, we investigate the possibility of augmenting graph data in the text space, leveraging the textual generation power of Large Language Models (LLMs) to handle imbalanced node classification on TAGs. Specifically, we propose a novel approach called LA-TAG (LLM-based Augmentation on Text-Attributed Graphs), which prompts LLMs to generate synthetic texts based on existing node texts in the graph. Furthermore, to integrate these synthetic text-attributed nodes into the graph, we introduce a text-based link predictor to connect the synthesized nodes with the existing nodes. Our experiments across multiple datasets and evaluation metrics show that our framework significantly outperforms traditional non-textual-based data augmentation strategies and specific node imbalance solutions. This highlights the promise of using LLMs to resolve imbalance issues on TAGs.

摘要:圖形節點分類經常會遇到類別不平衡的挑戰,導致有偏差的效能,並在實際應用中造成顯著風險。儘管已提出多項以資料為中心的解決方案,但沒有一項專注於文字屬性圖形 (TAG),因此忽略了利用文字特徵中編碼的豐富語意來提升少數節點分類的可能性。鑑於這個關鍵差距,我們探討了在文字空間中擴充圖形資料的可能性,利用大型語言模型 (LLM) 的文字產生能力來處理 TAG 上的不平衡節點分類。具體來說,我們提出了一種名為 LA-TAG(基於 LLM 的文字屬性圖形擴充)的新方法,它提示 LLM 根據圖形中現有的節點文字產生合成文字。此外,為了將這些合成文字屬性節點整合到圖形中,我們引入了一個基於文字的連結預測器,以將合成節點與現有節點連接起來。我們在多個資料集和評估指標上的實驗表明,我們的框架明顯優於傳統的非文字資料擴充策略和特定的節點不平衡解決方案。這突顯了使用 LLM 來解決 TAG 上的不平衡問題的潛力。

Context-aware Inductive Knowledge Graph Completion with Latent Type Constraints and Subgraph Reasoning

2410.16803v1 by Muzhi Li, Cehao Yang, Chengjin Xu, Zixing Song, Xuhui Jiang, Jian Guo, Ho-fung Leung, Irwin King

Inductive knowledge graph completion (KGC) aims to predict missing triples with unseen entities. Recent works focus on modeling reasoning paths between the head and tail entity as direct supporting evidence. However, these methods depend heavily on the existence and quality of reasoning paths, which limits their general applicability in different scenarios. In addition, we observe that latent type constraints and neighboring facts inherent in KGs are also vital in inferring missing triples. To effectively utilize all useful information in KGs, we introduce CATS, a novel context-aware inductive KGC solution. With sufficient guidance from proper prompts and supervised fine-tuning, CATS activates the strong semantic understanding and reasoning capabilities of large language models to assess the existence of query triples, which consist of two modules. First, the type-aware reasoning module evaluates whether the candidate entity matches the latent entity type as required by the query relation. Then, the subgraph reasoning module selects relevant reasoning paths and neighboring facts, and evaluates their correlation to the query triple. Experiment results on three widely used datasets demonstrate that CATS significantly outperforms state-of-the-art methods in 16 out of 18 transductive, inductive, and few-shot settings with an average absolute MRR improvement of 7.2%.

摘要:歸納知識圖譜完成 (KGC) 旨在預測具有未見實體的缺失三元組。最近的工作重點在於建模頭實體和尾實體之間的推理路徑作為直接支持證據。然而,這些方法高度依賴推理路徑的存在和品質,這限制了它們在不同場景中的普遍適用性。此外,我們觀察到隱藏類型約束和 KG 中固有的鄰近事實對於推斷缺失三元組也至關重要。為了有效利用 KG 中所有有用的資訊,我們引入了 CATS,一種新穎的具備情境感知能力的歸納式 KGC 解决方案。在適當提示和監督微調的充分指導下,CATS 啟動大型語言模型強大的語義理解和推理能力,以評估查詢三元組的存在,其中包含兩個模組。首先,類型感知推理模組評估候選實體是否與查詢關係所需的隱藏實體類型相符。然後,子圖推理模組選擇相關推理路徑和鄰近事實,並評估它們與查詢三元組的關聯性。在三個廣泛使用的資料集上進行的實驗結果表明,在 18 個轉導、歸納和少次嘗試設定中,CATS 在 16 個設定中顯著優於最先進的方法,平均絕對 MRR 提升了 7.2%。

The Scene Language: Representing Scenes with Programs, Words, and Embeddings

2410.16770v1 by Yunzhi Zhang, Zizhang Li, Matt Zhou, Shangzhe Wu, Jiajun Wu

We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes. It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity. This representation can be inferred from pre-trained language models via a training-free inference technique, given text or image inputs. The resulting scene can be rendered into images using traditional, neural, or hybrid graphics renderers. Together, this forms a robust, automated system for high-quality 3D and 4D scene generation. Compared with existing representations like scene graphs, our proposed Scene Language generates complex scenes with higher fidelity, while explicitly modeling the scene structures to enable precise control and editing.

摘要:我們引入了場景語言,這是一種視覺場景表示法,簡潔且精確地描述了視覺場景的結構、語意和身分。它使用三個關鍵組成部分來表示場景:一個程式,用於指定場景中實體的階層和關係結構;以自然語言表示的詞彙,用於總結每個實體的語意類別;以及用於擷取每個實體的視覺身分的嵌入。這個表示法可以透過無訓練推論技術從預先訓練的語言模型推論出來,給定文字或影像輸入。產生的場景可以使用傳統、神經或混合圖形渲染器渲染成影像。總而言之,這形成了一個強健的自動化系統,用於高品質 3D 和 4D 場景生成。與現有的表示法(例如場景圖)相比,我們提出的場景語言可以生成具有更高保真度的複雜場景,同時明確地建模場景結構以實現精確控制和編輯。

Atomic Fact Decomposition Helps Attributed Question Answering

2410.16708v1 by Zhichao Yan, Jiapu Wang, Jiaoyan Chen, Xiaoli Li, Ru Li, Jeff Z. Pan

Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a given question. Retrieval is a widely adopted approach, including two general paradigms: Retrieval-Then-Read (RTR) and post-hoc retrieval. Recently, Large Language Models (LLMs) have shown remarkable proficiency, prompting growing interest in AQA among researchers. However, RTR-based AQA often suffers from irrelevant knowledge and rapidly changing information, even when LLMs are adopted, while post-hoc retrieval-based AQA struggles with comprehending long-form answers with complex logic, and precisely identifying the content needing revision and preserving the original intent. To tackle these problems, this paper proposes an Atomic fact decomposition-based Retrieval and Editing (ARE) framework, which decomposes the generated long-form answers into molecular clauses and atomic facts by the instruction-tuned LLMs. Notably, the instruction-tuned LLMs are fine-tuned using a well-constructed dataset, generated from large scale Knowledge Graphs (KGs). This process involves extracting one-hop neighbors from a given set of entities and transforming the result into coherent long-form text. Subsequently, ARE leverages a search engine to retrieve evidences related to atomic facts, inputting these evidences into an LLM-based verifier to determine whether the facts require expansion for re-retrieval or editing. Furthermore, the edited facts are backtracked into the original answer, with evidence aggregated based on the relationship between molecular clauses and atomic facts. Extensive evaluations demonstrate the superior performance of our proposed method over the state-of-the-arts on several datasets, with an additionally proposed new metric $Attr_{p}$ for evaluating the precision of evidence attribution.

摘要:歸因式問答 (AQA) 的目標是針對特定問題提供可信的答案和可靠的歸因報告。擷取是一種廣泛採用的方法,包括兩種一般範例:擷取再閱讀 (RTR) 和事後擷取。最近,大型語言模型 (LLM) 已展現出卓越的熟練度,促使研究人員對 AQA 產生越來越濃厚的興趣。然而,即使採用 LLM,基於 RTR 的 AQA 仍常常會受到不相關知識和快速變動的資訊影響,而基於事後擷取的 AQA 則難以理解具有複雜邏輯的長篇答案,並精確找出需要修改的內容,同時保留原始意圖。為了解決這些問題,本文提出了一個基於原子事實分解的擷取和編輯 (ARE) 架構,它透過指令調整的 LLM 將產生的長篇答案分解為分子子句和原子事實。值得注意的是,指令調整的 LLM 會使用從大規模知識圖譜 (KG) 中產生的結構良好資料集進行微調。此程序包含從特定實體集合中擷取一跳鄰居,並將結果轉換為連貫的長篇文字。隨後,ARE 會利用搜尋引擎擷取與原子事實相關的證據,將這些證據輸入到基於 LLM 的驗證器中,以確定事實是否需要擴充以供重新擷取或編輯。此外,編輯後的結果會回溯到原始答案,並根據分子子句和原子事實之間的關係彙整證據。廣泛的評估顯示,我們提出的方法在多個資料集上優於現有技術,並額外提出了一個新的指標 $Attr_{p}$,用於評估證據歸因的精準度。

PLDR-LLM: Large Language Model from Power Law Decoder Representations

2410.16703v1 by Burc Gokden

We present the Large Language Model from Power Law Decoder Representations (PLDR-LLM), a language model that leverages non-linear and linear transformations through Power Law Graph Attention mechanism to generate well-defined deductive and inductive outputs. We pretrain the PLDR-LLMs of varying layer sizes with a small batch size of 32 and $\sim$8B tokens from the RefinedWeb dataset, and show that they achieve competitive performance in zero-shot and few-shot settings compared to scaled dot-product LLMs of similar model size reported in the literature. We show that deductive outputs of PLDR-LLMs can be used to compare model characteristics or improve the performance by introducing the Directed Acyclic Graph (DAG) loss as a metric and regularizer. Our results indicate that the initial maximum learning rate and warm-up steps have a lasting impact on deductive outputs throughout the pretraining. We provide a detailed description of PLDR-LLM architecture, its implementation and the pretraining procedure.

摘要:我們提出使用冪律解碼器表示法的大語言模型 (PLDR-LLM),這是一個語言模型,它透過冪律圖注意力機制,利用非線性和線性轉換來產生定義良好的演繹和歸納輸出。我們使用 32 的小批次大小和 RefinedWeb 資料集中的 $\sim$8B 令牌,預訓練不同層大小的 PLDR-LLM,並展示出它們在零次和少次設定中,與文獻中報導的類似模型大小的縮放點積 LLM 相比,它們達到了競爭力表現。我們展示了 PLDR-LLM 的演繹輸出可用於比較模型特徵或透過引入有向無環圖 (DAG) 損失作為指標和正則化器來改善效能。我們的結果表明,初始最大學習率和熱身步驟對整個預訓練過程中的演繹輸出有持久的影響。我們提供了 PLDR-LLM 架構、其實現和預訓練程序的詳細說明。

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

2410.16597v1 by Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong, Tiep Le, Shachar Rosenman, Vasudev Lal, Phil Mui, Ricky Ho, Phillip Howard, Chien-Sheng Wu

Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods predominantly rely on prompt-based approaches, which are inefficient for processing large-scale corpora. These approaches often suffer from information loss, particularly with long documents, due to the lack of specialized design for KG construction. Additionally, there is a gap in evaluation datasets and methodologies for ontology-free KG construction. To overcome these limitations, we propose SynthKG, a multi-step, document-level ontology-free KG synthesis workflow based on LLMs. By fine-tuning a smaller LLM on the synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG, substantially reducing the number of LLM inference calls. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality -- including models up to eight times larger -- but also consistently excels in retrieval and question-answering tasks. Our proposed graph retrieval framework also outperforms all KG-retrieval methods across multiple benchmark datasets. We release the SynthKG dataset and Distill-SynthKG model publicly to support further research and development.

摘要:由大型語言模型 (LLM) 生成的知識圖譜 (KG) 對於需要知識密集型推理的檢索增強生成 (RAG) 應用程式變得越來越有價值。然而,現有的 KG 萃取方法主要依賴於提示式方法,這種方法對於處理大規模語料庫而言效率低下。由於缺乏針對 KG 建構的專門設計,這些方法通常會遭受資訊遺失,特別是在長篇文件的情況下。此外,在用於建構無本体 KG 的評估資料集和方法論方面存在差距。為了克服這些限制,我們提出了 SynthKG,這是一個基於 LLM 的多步驟文件級別無本体 KG 合成工作流程。透過微調較小的 LLM 在合成的文件-KG 對上,我們將多步驟流程簡化為稱為 Distill-SynthKG 的單步驟 KG 生成方法,大幅減少了 LLM 推論呼叫的數量。此外,我們重新利用現有的問答資料集來建立 KG 評估資料集,並引入新的評估指標。使用 Distill-SynthKG 生成的 KG,我們還為 RAG 設計了一個新穎的基於圖形的檢索架構。實驗結果表明,Distill-SynthKG 不僅在 KG 品質方面超越了所有基準模型(包括大八倍的模型),而且在檢索和問答任務中也始終表現出色。我們提出的圖形檢索架構在多個基準資料集上也優於所有 KG 檢索方法。我們公開釋出 SynthKG 資料集和 Distill-SynthKG 模型,以支持進一步的研究和開發。

Towards a Reliable Offline Personal AI Assistant for Long Duration Spaceflight

2410.16397v1 by Oliver Bensch, Leonie Bensch, Tommy Nilsson, Florian Saling, Wafa M. Sadri, Carsten Hartmann, Tobias Hecking, J. Nathan Kutz

As humanity prepares for new missions to the Moon and Mars, astronauts will need to operate with greater autonomy, given the communication delays that make real-time support from Earth difficult. For instance, messages between Mars and Earth can take up to 24 minutes, making quick responses impossible. This limitation poses a challenge for astronauts who must rely on in-situ tools to access the large volume of data from spacecraft sensors, rovers, and satellites, data that is often fragmented and difficult to use. To bridge this gap, systems like the Mars Exploration Telemetry-Driven Information System (METIS) are being developed. METIS is an AI assistant designed to handle routine tasks, monitor spacecraft systems, and detect anomalies, all while reducing the reliance on mission control. Current Generative Pretrained Transformer (GPT) Models, while powerful, struggle in safety-critical environments. They can generate plausible but incorrect responses, a phenomenon known as "hallucination," which could endanger astronauts. To overcome these limitations, this paper proposes enhancing systems like METIS by integrating GPTs, Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Augmented Reality (AR). The idea is to allow astronauts to interact with their data more intuitively, using natural language queries and visualizing real-time information through AR. KGs will be used to easily access live telemetry and multimodal data, ensuring that astronauts have the right information at the right time. By combining AI, KGs, and AR, this new system will empower astronauts to work more autonomously, safely, and efficiently during future space missions.

摘要:隨著人類準備前往月球和火星執行新任務,考量到通訊延遲讓來自地球的即時支援變得困難,太空人將需要以更高的自主性執行任務。例如,火星和地球之間的訊息傳遞可能需要長達 24 分鐘,這使得快速回應變得不可能。這個限制對必須仰賴現場工具才能存取來自太空船感測器、探測車和衛星的大量資料的太空人來說是一項挑戰,而這些資料通常是片段且難以使用的。為了彌合這個差距,像火星探測遙測驅動資訊系統 (METIS) 之類的系統正在開發中。METIS 是一個 AI 助理,旨在處理例行工作、監控太空船系統和偵測異常,同時減少對任務控制的依賴。現有的生成式預訓練Transformer (GPT) 模型雖然強大,但在安全關鍵環境中卻難以發揮作用。它們可能會產生看似合理但錯誤的回應,這種現象稱為「幻覺」,可能會使太空人陷入危險。為了克服這些限制,本文提出透過整合 GPT、檢索增強生成 (RAG)、知識圖譜 (KG) 和擴增實境 (AR) 來增強像 METIS 之類的系統。這個想法是讓太空人能夠更直覺地與他們的資料互動,使用自然語言查詢並透過 AR 視覺化即時資訊。KG 將用於輕鬆存取即時遙測和多模式資料,確保太空人在適當的時間取得適當的資訊。透過結合 AI、KG 和 AR,這個新系統將賦能太空人在未來的太空任務中更自主、安全且有效率地工作。

A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns

2410.16155v1 by Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao

With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100-agent settings. Encourage community attention to the security of multi-agent systems.

摘要:随着大型语言模型的发展,它们被广泛用作各个领域的代理。代理的关键组成部分是记忆,它存储重要信息,但容易受到越狱攻击。现有研究主要集中在单一代理攻击和共享内存攻击上。然而,现实世界中的场景通常涉及独立的内存。在本文中,我们提出了 Troublemaker Makes Chaos in Honest Town (TMCHT) 任务,这是一个大规模、多代理、多拓扑基于文本的攻击评估框架。TMCHT 涉及一个攻击者代理试图误导整个代理社会。我们确定了多代理攻击中的两个主要挑战:(1) 非完整图结构,(2) 大规模系统。我们将这些挑战归因于我们称之为毒性消失的现象。为了解决这些问题,我们提出了一种对抗性复制传染性越狱 (ARCJ) 方法,该方法优化了检索后缀以使中毒样本更容易被检索,并优化了复制后缀以使中毒样本具有传染性。我们在 TMCHT 中展示了我们方法的优越性,在直线拓扑、星形拓扑和 100 代理设置中分别提高了 23.51%、18.95% 和 52.93%。鼓励社区关注多代理系统的安全性。

CausalGraph2LLM: Evaluating LLMs for Causal Queries

2410.15939v1 by Ivaxi Sheth, Bahare Fatemi, Mario Fritz

Causality is essential in scientific research, enabling researchers to interpret true relationships between variables. These causal relationships are often represented by causal graphs, which are directed acyclic graphs. With the recent advancements in Large Language Models (LLMs), there is an increasing interest in exploring their capabilities in causal reasoning and their potential use to hypothesize causal graphs. These tasks necessitate the LLMs to encode the causal graph effectively for subsequent downstream tasks. In this paper, we propose a comprehensive benchmark, \emph{CausalGraph2LLM}, encompassing a variety of causal graph settings to assess the causal graph understanding capability of LLMs. We categorize the causal queries into two types: graph-level and node-level queries. We benchmark both open-sourced and closed models for our study. Our findings reveal that while LLMs show promise in this domain, they are highly sensitive to the encoding used. Even capable models like GPT-4 and Gemini-1.5 exhibit sensitivity to encoding, with deviations of about $60\%$. We further demonstrate this sensitivity for downstream causal intervention tasks. Moreover, we observe that LLMs can often display biases when presented with contextual information about a causal graph, potentially stemming from their parametric memory.

摘要:因果关系在科学研究中至关重要,它使研究人员能够解释变量之间的真实关系。这些因果关系通常用因果图表示,因果图是有向无环图。随着大语言模型 (LLM) 的最新进展,人们越来越有兴趣探索它们在因果推理中的能力以及它们在假设因果图中的潜在用途。这些任务需要 LLM 有效地对因果图进行编码,以便后续的下游任务。在本文中,我们提出了一个综合基准,\emph{CausalGraph2LLM},它包含了各种因果图设置,以评估 LLM 的因果图理解能力。我们将因果查询分为两类:图级查询和节点级查询。我们对开源模型和封闭模型进行了基准测试。我们的研究结果表明,虽然 LLM 在该领域显示出前景,但它们对所使用的编码非常敏感。即使像 GPT-4 和 Gemini-1.5 这样的强大模型也对编码表现出敏感性,偏差约为 60%。我们进一步证明了这种对下游因果干预任务的敏感性。此外,我们观察到,当 LLM 获得有关因果图的上下文信息时,它们通常会表现出偏见,这可能源于它们的参数记忆。

LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation

2410.15828v1 by Tejumade Afonja, Ivaxi Sheth, Ruta Binkyte, Waqar Hanif, Thomas Ulas, Matthias Becker, Mario Fritz

Gene regulatory networks (GRNs) represent the causal relationships between transcription factors (TFs) and target genes in single-cell RNA sequencing (scRNA-seq) data. Understanding these networks is crucial for uncovering disease mechanisms and identifying therapeutic targets. In this work, we investigate the potential of large language models (LLMs) for GRN discovery, leveraging their learned biological knowledge alone or in combination with traditional statistical methods. We develop a task-based evaluation strategy to address the challenge of unavailable ground truth causal graphs. Specifically, we use the GRNs suggested by LLMs to guide causal synthetic data generation and compare the resulting data against the original dataset. Our statistical and biological assessments show that LLMs can support statistical modeling and data synthesis for biological research.

摘要:基因調控網路 (GRN) 代表單細胞 RNA 定序 (scRNA-seq) 資料中轉錄因子 (TF) 與目標基因之間的因果關係。了解這些網路對於揭露疾病機制和找出治療目標至關重要。在這項工作中,我們探討大型語言模型 (LLM) 在 GRN 探索中的潛力,利用它們學習到的生物知識,單獨或與傳統統計方法結合使用。我們制定了一項基於任務的評估策略,以解決無法取得地面真相因果圖表的挑戰。具體來說,我們使用 LLM 建議的 GRN 來引導因果合成資料產生,並將產生的資料與原始資料集進行比較。我們的統計和生物評估顯示,LLM 可以支援生物研究的統計建模和資料合成。

NetSafe: Exploring the Topological Safety of Multi-agent Networks

2410.15686v1 by Miao Yu, Shilong Wang, Guibin Zhang, Junyuan Mao, Chenlong Yin, Qijiong Liu, Qingsong Wen, Kun Wang, Yang Wang

Large language models (LLMs) have empowered nodes within multi-agent networks with intelligence, showing growing applications in both academia and industry. However, how to prevent these networks from generating malicious information remains unexplored with previous research on single LLM's safety be challenging to transfer. In this paper, we focus on the safety of multi-agent networks from a topological perspective, investigating which topological properties contribute to safer networks. To this end, we propose a general framework, NetSafe along with an iterative RelCom interaction to unify existing diverse LLM-based agent frameworks, laying the foundation for generalized topological safety research. We identify several critical phenomena when multi-agent networks are exposed to attacks involving misinformation, bias, and harmful information, termed as Agent Hallucination and Aggregation Safety. Furthermore, we find that highly connected networks are more susceptible to the spread of adversarial attacks, with task performance in a Star Graph Topology decreasing by 29.7%. Besides, our proposed static metrics aligned more closely with real-world dynamic evaluations than traditional graph-theoretic metrics, indicating that networks with greater average distances from attackers exhibit enhanced safety. In conclusion, our work introduces a new topological perspective on the safety of LLM-based multi-agent networks and discovers several unreported phenomena, paving the way for future research to explore the safety of such networks.

摘要:大型語言模型 (LLM) 賦予了多主體網路中的節點智慧,在學術界和產業中展現出越來越多的應用。然而,如何防止這些網路產生惡意資訊仍然是未經探索的領域,先前針對單一 LLM 安全性的研究難以轉移。在本文中,我們從拓撲學的角度探討多主體網路的安全性,研究哪些拓撲屬性有助於網路更安全。為此,我們提出了一個通用框架 NetSafe,以及一個反覆的 RelCom 互動,以統一現有的各種基於 LLM 的主體框架,為廣義的拓撲安全性研究奠定基礎。我們在多主體網路遭受涉及錯誤資訊、偏見和有害資訊的攻擊時,找出幾個關鍵現象,稱為主體幻覺和聚合安全性。此外,我們發現高度連接的網路更容易受到對抗性攻擊的影響,星形圖形拓撲中的任務效能下降了 29.7%。此外,我們提出的靜態指標比傳統的圖論指標更貼近真實世界的動態評估,這表示與攻擊者平均距離較大的網路具有更高的安全性。總之,我們的研究引入了基於 LLM 的多主體網路安全性的新拓撲觀點,並發現了幾個未曾報導的現象,為未來探索此類網路安全性的研究鋪路。

TAGExplainer: Narrating Graph Explanations for Text-Attributed Graph Learning Models

2410.15268v1 by Bo Pan, Zhen Xiong, Guanchen Wu, Zheng Zhang, Yifei Zhang, Liang Zhao

Representation learning of Text-Attributed Graphs (TAGs) has garnered significant attention due to its applications in various domains, including recommendation systems and social networks. Despite advancements in TAG learning methodologies, challenges remain in explainability due to the black-box nature of existing TAG representation learning models. This paper presents TAGExplainer, the first method designed to generate natural language explanations for TAG learning. TAGExplainer employs a generative language model that maps input-output pairs to explanations reflecting the model's decision-making process. To address the lack of annotated ground truth explanations in real-world scenarios, we propose first generating pseudo-labels that capture the model's decisions from saliency-based explanations, then the pseudo-label generator is iteratively trained based on three training objectives focusing on faithfulness and brevity via Expert Iteration, to improve the quality of generated pseudo-labels. The high-quality pseudo-labels are finally utilized to train an end-to-end explanation generator model. Extensive experiments are conducted to demonstrate the effectiveness of TAGExplainer in producing faithful and concise natural language explanations.

摘要:文本歸因圖 (TAG) 的表示學習因其在各種領域(包括推薦系統和社交網絡)中的應用而備受關注。儘管 TAG 學習方法取得了進展,但由於現有 TAG 表示學習模型的黑箱性質,可解釋性仍然面臨挑戰。本文提出了 TAGExplainer,這是一種旨在為 TAG 學習生成自然語言解釋的第一種方法。TAGExplainer 採用生成語言模型,將輸入輸出對應到反映模型決策過程的解釋。為了解決現實場景中缺乏註解地面真實解釋的問題,我們建議首先從基於顯著性的解釋中生成偽標籤來捕捉模型的決策,然後通過專家迭代基於三個訓練目標(側重於忠實度和簡潔性)反覆訓練偽標籤生成器,以提高生成偽標籤的品質。最後將高品質的偽標籤用於訓練端到端解釋生成器模型。進行了廣泛的實驗,以證明 TAGExplainer 在生成忠實且簡潔的自然語言解釋方面的有效性。

Explaining Graph Neural Networks with Large Language Models: A Counterfactual Perspective for Molecular Property Prediction

2410.15165v1 by Yinhan He, Zaiyi Zheng, Patrick Soga, Yaozhen Zhu, yushun Dong, Jundong Li

In recent years, Graph Neural Networks (GNNs) have become successful in molecular property prediction tasks such as toxicity analysis. However, due to the black-box nature of GNNs, their outputs can be concerning in high-stakes decision-making scenarios, e.g., drug discovery. Facing such an issue, Graph Counterfactual Explanation (GCE) has emerged as a promising approach to improve GNN transparency. However, current GCE methods usually fail to take domain-specific knowledge into consideration, which can result in outputs that are not easily comprehensible by humans. To address this challenge, we propose a novel GCE method, LLM-GCE, to unleash the power of large language models (LLMs) in explaining GNNs for molecular property prediction. Specifically, we utilize an autoencoder to generate the counterfactual graph topology from a set of counterfactual text pairs (CTPs) based on an input graph. Meanwhile, we also incorporate a CTP dynamic feedback module to mitigate LLM hallucination, which provides intermediate feedback derived from the generated counterfactuals as an attempt to give more faithful guidance. Extensive experiments demonstrate the superior performance of LLM-GCE. Our code is released on https://github.com/YinhanHe123/new_LLM4GNNExplanation.

摘要:近年来,图神经网络 (GNN) 已成功应用于分子性质预测任务,例如毒性分析。然而,由于 GNN 的黑盒性质,其输出在高风险决策场景中可能会令人担忧,例如药物发现。针对这一问题,图反事实解释 (GCE) 已成为提高 GNN 透明度的一种很有前景的方法。然而,当前的 GCE 方法通常无法考虑特定领域的知识,这可能导致人类难以理解输出。为了应对这一挑战,我们提出了一种新颖的 GCE 方法,LLM-GCE,以释放大型语言模型 (LLM) 在解释 GNN 用于分子性质预测方面的能力。具体来说,我们利用自动编码器从一组基于输入图的反事实文本对 (CTP) 生成反事实图拓扑。同时,我们还加入了一个 CTP 动态反馈模块来减轻 LLM 幻觉,该模块提供从生成的反事实中派生的中间反馈,以尝试提供更真实的指导。大量的实验表明了 LLM-GCE 的卓越性能。我们的代码已发布在 https://github.com/YinhanHe123/new_LLM4GNNExplanation。

MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science

2410.15126v1 by Junho Kim, Yeachan Kim, Jun-Hyung Park, Yerim Oh, Suho Kim, SangKeun Lee

We introduce a novel continued pre-training method, MELT (MatEriaLs-aware continued pre-Training), specifically designed to efficiently adapt the pre-trained language models (PLMs) for materials science. Unlike previous adaptation strategies that solely focus on constructing domain-specific corpus, MELT comprehensively considers both the corpus and the training strategy, given that materials science corpus has distinct characteristics from other domains. To this end, we first construct a comprehensive materials knowledge base from the scientific corpus by building semantic graphs. Leveraging this extracted knowledge, we integrate a curriculum into the adaptation process that begins with familiar and generalized concepts and progressively moves toward more specialized terms. We conduct extensive experiments across diverse benchmarks to verify the effectiveness and generality of MELT. A comprehensive evaluation convincingly supports the strength of MELT, demonstrating superior performance compared to existing continued pre-training methods. The in-depth analysis also shows that MELT enables PLMs to effectively represent materials entities compared to the existing adaptation methods, thereby highlighting its broad applicability across a wide spectrum of materials science.

摘要:我們介紹了一種新穎的持續預訓練方法,MELT(MatEriaLs-aware持續預訓練),專門設計用於有效地調整材料科學的預訓練語言模型 (PLM)。與先前僅專注於建構特定領域語料庫的調整策略不同,MELT 全面考慮語料庫和訓練策略,因為材料科學語料庫具有不同於其他領域的特徵。為此,我們首先通過建立語義圖從科學語料庫構建一個全面的材料知識庫。利用提取的知識,我們將課程整合到調整過程中,從熟悉且通用的概念開始,逐漸轉向更專業的術語。我們在不同的基準上進行了廣泛的實驗,以驗證 MELT 的有效性和普遍性。全面的評估令人信服地支持了 MELT 的優點,與現有的持續預訓練方法相比,表現出優異的性能。深入分析還表明,與現有的調整方法相比,MELT 能讓 PLM 有效地表示材料實體,從而突顯其在廣泛的材料科學領域中的廣泛適用性。

Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models

2410.15116v1 by Qitan Lv, Jie Wang, Hanzhu Chen, Bin Li, Yongdong Zhang, Feng Wu

Generation of plausible but incorrect factual information, often termed hallucination, has attracted significant research interest. Retrieval-augmented language model (RALM) -- which enhances models with up-to-date knowledge -- emerges as a promising method to reduce hallucination. However, existing RALMs may instead exacerbate hallucination when retrieving lengthy contexts. To address this challenge, we propose COFT, a novel \textbf{CO}arse-to-\textbf{F}ine highligh\textbf{T}ing method to focus on different granularity-level key texts, thereby avoiding getting lost in lengthy contexts. Specifically, COFT consists of three components: \textit{recaller}, \textit{scorer}, and \textit{selector}. First, \textit{recaller} applies a knowledge graph to extract potential key entities in a given context. Second, \textit{scorer} measures the importance of each entity by calculating its contextual weight. Finally, \textit{selector} selects high contextual weight entities with a dynamic threshold algorithm and highlights the corresponding paragraphs, sentences, or words in a coarse-to-fine manner. Extensive experiments on the knowledge hallucination benchmark demonstrate the effectiveness of COFT, leading to a superior performance over $30\%$ in the F1 score metric. Moreover, COFT also exhibits remarkable versatility across various long-form tasks, such as reading comprehension and question answering.

摘要:生成看似合理但实际上不正确的实际信息(通常称为幻觉)引起了重要的研究兴趣。检索增强语言模型 (RALM) 通过为模型提供最新的知识来增强模型,这是一种有前途的方法,可以减少幻觉。然而,现有的 RALM 在检索冗长的上下文时可能会加剧幻觉。为了应对这一挑战,我们提出了 COFT,一种新颖的\textbf{粗}到\textbf{细}高亮\textbf{T}ing 方法,专注于不同粒度级别的关键文本,从而避免在冗长的上下文中迷失。具体来说,COFT 由三个组件组成:\textit{recaller}、\textit{scorer} 和 \textit{selector}。首先,\textit{recaller} 应用知识图谱来提取给定上下文中潜在的关键实体。其次,\textit{scorer} 通过计算每个实体的上下文权重来衡量其重要性。最后,\textit{selector} 使用动态阈值算法选择具有高上下文权重的实体,并以粗到细的方式突出显示相应的段落、句子或单词。在知识幻觉基准上的广泛实验证明了 COFT 的有效性,在 F1 分数指标上取得了超过 30% 的卓越性能。此外,COFT 在各种长篇任务中也表现出卓越的多功能性,例如阅读理解和问题解答。

2410.15064v1 by George Hannah, Rita T. Sousa, Ioannis Dasoulas, Claudia d'Amato

With the recent surge in popularity of Large Language Models (LLMs), there is the rising risk of users blindly trusting the information in the response, even in cases where the LLM recommends actions that have potential legal implications and this may put the user in danger. We provide an empirical analysis on multiple existing LLMs showing the urgency of the problem. Hence, we propose a short-term solution consisting in an approach for isolating these legal issues through prompt re-engineering. We further analyse the outcomes but also the limitations of the prompt engineering based approach and we highlight the need of additional resources for fully solving the problem We also propose a framework powered by a legal knowledge graph (KG) to generate legal citations for these legal issues, enriching the response of the LLM.

摘要:隨著大型語言模型(LLM)近期流行激增,使用者盲目相信回應中資訊的風險也隨之升高,即使在 LLM 建議採取可能產生法律影響的行動時亦然,這可能會使使用者陷入危險之中。我們針對多個現有 LLM 提供實證分析,顯示此問題的急迫性。因此,我們提出一個短期解決方案,包括透過提示重新設計來孤立這些法律問題的方法。我們進一步分析提示工程方法的成果,但也分析其限制,並強調完全解決問題需要額外資源。我們還提出一個由法律知識圖譜(KG)驅動的架構,為這些法律問題產生法律引文,豐富 LLM 的回應。

LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model

2410.14961v1 by Tianqianjin Lin, Pengwei Yan, Kaisong Song, Zhuoren Jiang, Yangyang Kang, Jun Lin, Weikang Yuan, Junjie Cao, Changlong Sun, Xiaozhong Liu

Graph foundation models (GFMs) have recently gained significant attention. However, the unique data processing and evaluation setups employed by different studies hinder a deeper understanding of their progress. Additionally, current research tends to focus on specific subsets of graph learning tasks, such as structural tasks, node-level tasks, or classification tasks. As a result, they often incorporate specialized modules tailored to particular task types, losing their applicability to other graph learning tasks and contradicting the original intent of foundation models to be universal. Therefore, to enhance consistency, coverage, and diversity across domains, tasks, and research interests within the graph learning community in the evaluation of GFMs, we propose GFMBench-a systematic and comprehensive benchmark comprising 26 datasets. Moreover, we introduce LangGFM, a novel GFM that relies entirely on large language models. By revisiting and exploring the effective graph textualization principles, as well as repurposing successful techniques from graph augmentation and graph self-supervised learning within the language space, LangGFM achieves performance on par with or exceeding the state of the art across GFMBench, which can offer us new perspectives, experiences, and baselines to drive forward the evolution of GFMs.

摘要:圖形基礎模型 (GFM) 近期獲得顯著的關注。 然而,不同研究採用獨特資料處理和評估設定,阻礙了對其進展的深入理解。此外,目前的研究傾向於專注於圖形學習任務的特定子集,例如結構任務、節點層級任務或分類任務。因此,它們經常整合專門針對特定任務類型量身打造的模組,失去其對其他圖形學習任務的適用性,並與基礎模型成為通用的原始意圖相矛盾。因此,為了增強圖形學習社群在評估 GFM 時跨領域、任務和研究興趣的一致性、涵蓋範圍和多樣性,我們提出 GFMBench,這是一個包含 26 個資料集的系統化且全面的基準。此外,我們介紹 LangGFM,這是一種完全依賴大型語言模型的新穎 GFM。透過重新檢視和探索有效的圖形文字化原則,以及在語言空間中重新利用圖形擴充和圖形自監督學習的成功技術,LangGFM 在 GFMBench 上實現與現有技術同等或超越現有技術的效能,這可以為我們提供新的觀點、經驗和基準,以推動 GFM 的演進。

TransBox: EL++-closed Ontology Embedding

2410.14571v1 by Hui Yang, Jiaoyan Chen, Uli Sattler

OWL (Web Ontology Language) ontologies, which are able to represent both relational and type facts as standard knowledge graphs and complex domain knowledge in Description Logic (DL) axioms, are widely adopted in domains such as healthcare and bioinformatics. Inspired by the success of knowledge graph embeddings, embedding OWL ontologies has gained significant attention in recent years. Current methods primarily focus on learning embeddings for atomic concepts and roles, enabling the evaluation based on normalized axioms through specially designed score functions. However, they often neglect the embedding of complex concepts, making it difficult to infer with more intricate axioms. This limitation reduces their effectiveness in advanced reasoning tasks, such as Ontology Learning and ontology-mediated Query Answering. In this paper, we propose EL++-closed ontology embeddings which are able to represent any logical expressions in DL via composition. Furthermore, we develop TransBox, an effective EL++-closed ontology embedding method that can handle many-to-one, one-to-many and many-to-many relations. Our extensive experiments demonstrate that TransBox often achieves state-of-the-art performance across various real-world datasets for predicting complex axioms.

摘要:OWL(Web Ontology Language)本体,能够将关系和类型事实表示为标准知识图和描述逻辑 (DL) 公理中的复杂领域知识,在医疗保健和生物信息学等领域得到广泛采用。受知识图嵌入的成功启发,嵌入 OWL 本体近年来备受关注。当前方法主要集中在学习原子概念和角色的嵌入,通过专门设计的评分函数,支持基于归一化公理的评估。然而,它们经常忽略复杂概念的嵌入,这使得难以推断出更复杂的公理。这种限制降低了它们在高级推理任务(例如本体学习和本体介导查询应答)中的有效性。在本文中,我们提出了 EL++ 封闭本体嵌入,它能够通过组合来表示 DL 中的任何逻辑表达式。此外,我们开发了 TransBox,一种有效的 EL++ 封闭本体嵌入方法,可以处理多对一、一对多和多对多关系。我们广泛的实验表明,TransBox 在预测复杂公理的各种真实世界数据集上通常都能达到最先进的性能。

Enabling Scalable Evaluation of Bias Patterns in Medical LLMs

2410.14763v1 by Hamed Fayyaz, Raphael Poulain, Rahmatollah Beheshti

Large language models (LLMs) have shown impressive potential in helping with numerous medical challenges. Deploying LLMs in high-stakes applications such as medicine, however, brings in many concerns. One major area of concern relates to biased behaviors of LLMs in medical applications, leading to unfair treatment of individuals. To pave the way for the responsible and impactful deployment of Med LLMs, rigorous evaluation is a key prerequisite. Due to the huge complexity and variability of different medical scenarios, existing work in this domain has primarily relied on using manually crafted datasets for bias evaluation. In this study, we present a new method to scale up such bias evaluations by automatically generating test cases based on rigorous medical evidence. We specifically target the challenges of a) domain-specificity of bias characterization, b) hallucinating while generating the test cases, and c) various dependencies between the health outcomes and sensitive attributes. To that end, we offer new methods to address these challenges integrated with our generative pipeline, using medical knowledge graphs, medical ontologies, and customized general LLM evaluation frameworks in our method. Through a series of extensive experiments, we show that the test cases generated by our proposed method can effectively reveal bias patterns in Med LLMs at larger and more flexible scales than human-crafted datasets. We publish a large bias evaluation dataset using our pipeline, which is dedicated to a few medical case studies. A live demo of our application for vignette generation is available at https://vignette.streamlit.app. Our code is also available at https://github.com/healthylaife/autofair.

摘要:大型語言模型 (LLM) 已展現出在協助解決 許多醫療挑戰方面的驚人潛力。然而,在高風險應用程式(例如 醫療)中部署 LLM 會帶來許多疑慮。一個主要的疑慮領域與 醫療應用程式中 LLM 的偏見行為有關,導致對個人不公平的 待遇。為了為負責任且有影響力的 Med LLM 部署鋪路,嚴謹的 評估是一項關鍵前提。由於不同醫療場景的複雜性和變異性極大, 此領域現有的工作主要依賴使用人工製作的資料集進行偏見 評估。在本研究中,我們提出了一種新的方法,可以根據嚴謹的醫療 證據自動產生測試案例,以擴大此類偏見評估。我們特別針對 a) 偏見特徵的領域專屬性、b) 在產生測試案例時出現幻覺,以及 c) 健康結果和敏感屬性之間的各種依賴性等挑戰。為此,我們提供 新的方法來解決這些挑戰,並將其與我們的生成管道整合,在我們的 方法中使用醫療知識圖、醫療本体和自訂的通用 LLM 評估架構。透過 一系列廣泛的實驗,我們表明我們提出的方法產生的測試案例可以有效 揭示 Med LLM 中的偏見模式,其規模比人工製作的資料集更大且更具 彈性。我們使用我們的管道發布了一個大型偏見評估資料集,該資料集 專門針對一些醫療案例研究。我們的小插圖生成應用程式的現場示範 可在 https://vignette.streamlit.app 取得。我們的程式碼也可在 https://github.com/healthylaife/autofair 取得。

Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning

2410.14211v2 by Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Wenjie Zhang

Large Language Models (LLMs) have achieved impressive results in various tasks but struggle with hallucination problems and lack of relevant knowledge, especially in deep complex reasoning and knowledge-intensive tasks. Knowledge Graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. However, existing KG-based LLM reasoning methods face challenges like handling multi-hop reasoning, multi-entity questions, and effectively utilizing graph structures. To address these issues, we propose Paths-over-Graph (PoG), a novel method that enhances LLM reasoning by integrating knowledge reasoning paths from KGs, improving the interpretability and faithfulness of LLM outputs. PoG tackles multi-hop and multi-entity questions through a three-phase dynamic multi-hop path exploration, which combines the inherent knowledge of LLMs with factual knowledge from KGs. In order to improve the efficiency, PoG prunes irrelevant information from the graph exploration first and introduces efficient three-step pruning techniques that incorporate graph structures, LLM prompting, and a pre-trained language model (e.g., SBERT) to effectively narrow down the explored candidate paths. This ensures all reasoning paths contain highly relevant information captured from KGs, making the reasoning faithful and interpretable in problem-solving. PoG innovatively utilizes graph structure to prune the irrelevant noise and represents the first method to implement multi-entity deep path detection on KGs for LLM reasoning tasks. Comprehensive experiments on five benchmark KGQA datasets demonstrate PoG outperforms the state-of-the-art method ToG across GPT-3.5-Turbo and GPT-4, achieving an average accuracy improvement of 18.9%. Notably, PoG with GPT-3.5-Turbo surpasses ToG with GPT-4 by up to 23.9%.

摘要:大型語言模型 (LLM) 在各種任務中取得令人印象深刻的成果,但仍存在幻覺問題和缺乏相關知識,尤其是在深度複雜推理和知識密集型任務中。知識圖譜 (KG) 以結構化格式擷取大量事實,為推理提供了可靠的知識來源。然而,現有的基於 KG 的 LLM 推理方法面臨處理多跳推理、多實體問題和有效利用圖結構等挑戰。為了解決這些問題,我們提出了圖上路徑 (PoG),這是一種創新的方法,通過整合來自 KG 的知識推理路徑來增強 LLM 推理,提高 LLM 輸出的可解釋性和保真性。PoG 通過三階段動態多跳路徑探索來解決多跳和多實體問題,將 LLM 的固有知識與來自 KG 的事實知識相結合。為了提高效率,PoG 首先從圖探索中剪除無關信息,並引入了三步剪枝技術,這些技術結合了圖結構、LLM 提示和預訓練語言模型(例如,SBERT)來有效縮小探索的候選路徑。這確保了所有推理路徑都包含從 KG 擷取的高度相關信息,從而使推理在問題解決中具有保真性和可解釋性。PoG 創新地利用圖結構來剪除無關噪聲,並代表了在 KG 上實現 LLM 推理任務的多實體深度路徑檢測的第一種方法。在五個基準 KGQA 數據集上的綜合實驗表明,PoG 在 GPT-3.5-Turbo 和 GPT-4 上的表現優於最先進的方法 ToG,平均準確率提高了 18.9%。值得注意的是,使用 GPT-3.5-Turbo 的 PoG 比使用 GPT-4 的 ToG 高出 23.9%。

UniMTS: Unified Pre-training for Motion Time Series

2410.19818v1 by Xiyuan Zhang, Diyan Teng, Ranak Roy Chowdhury, Shuheng Li, Dezhi Hong, Rajesh K. Gupta, Jingbo Shang

Motion time series collected from mobile and wearable devices such as smartphones and smartwatches offer significant insights into human behavioral patterns, with wide applications in healthcare, automation, IoT, and AR/XR due to their low-power, always-on nature. However, given security and privacy concerns, building large-scale motion time series datasets remains difficult, preventing the development of pre-trained models for human activity analysis. Typically, existing models are trained and tested on the same dataset, leading to poor generalizability across variations in device location, device mounting orientation and human activity type. In this paper, we introduce UniMTS, the first unified pre-training procedure for motion time series that generalizes across diverse device latent factors and activities. Specifically, we employ a contrastive learning framework that aligns motion time series with text descriptions enriched by large language models. This helps the model learn the semantics of time series to generalize across activities. Given the absence of large-scale motion time series data, we derive and synthesize time series from existing motion skeleton data with all-joint coverage. Spatio-temporal graph networks are utilized to capture the relationships across joints for generalization across different device locations. We further design rotation-invariant augmentation to make the model agnostic to changes in device mounting orientations. Our model shows exceptional generalizability across 18 motion time series classification benchmark datasets, outperforming the best baselines by 340% in the zero-shot setting, 16.3% in the few-shot setting, and 9.2% in the full-shot setting.

摘要:從智慧型手機與智慧型手錶等行動裝置和穿戴式裝置收集的動作時間序列,由於其低耗電、持續運作的特性,可提供人類行為模式的重要見解,在醫療保健、自動化、物聯網和 AR/XR 中有廣泛的應用。然而,考量到安全性和隱私問題,建構大規模的動作時間序列資料集仍然困難,阻礙了人類活動分析預先訓練模型的發展。一般來說,現有的模型會在同一個資料集上訓練和測試,導致無法對裝置位置、裝置安裝方向和人類活動類型的變化進行良好的概化。在本文中,我們介紹 UniMTS,這是第一個統一的動作時間序列預訓練程序,可概化到不同的裝置潛在因子和活動。具體來說,我們採用對比學習架構,將動作時間序列與大型語言模型豐富的文字描述對齊。這有助於模型學習時間序列的語義,以概化到各種活動。由於缺乏大規模的動作時間序列資料,我們從現有的動作骨架資料中衍生和合成時間序列,並涵蓋所有關節。時空圖形網路用於擷取關節之間的關係,以概化到不同的裝置位置。我們進一步設計了旋轉不變增強,讓模型不會受裝置安裝方向變化的影響。我們的模型在 18 個動作時間序列分類基準資料集上展現出卓越的概化能力,在零次學習設定中優於最佳基準 340%,在少次學習設定中優於最佳基準 16.3%,在全次學習設定中優於最佳基準 9.2%。

Supervised Chain of Thought

2410.14198v1 by Xiang Zhang, Dujian Ding

Large Language Models (LLMs) have revolutionized natural language processing and hold immense potential for advancing Artificial Intelligence. However, the core architecture of most mainstream LLMs -- the Transformer -- has inherent limitations in computational depth, rendering them theoretically incapable of solving many reasoning tasks that demand increasingly deep computations. Chain of Thought (CoT) prompting has emerged as a technique to address these architectural limitations, as evidenced by several theoretical studies. It offers a promising approach to solving complex reasoning tasks that were previously beyond the capabilities of these models. Despite its successes, CoT and its variants (such as Tree of Thought, Graph of Thought, etc.) rely on a "one-prompt-for-all" approach, using a single prompt structure (e.g., "think step by step") for a wide range of tasks -- from counting and sorting to solving mathematical and algorithmic problems. This approach poses significant challenges for models to generate the correct reasoning steps, as the model must navigate through a vast prompt template space to find the appropriate template for each task. In this work, we build upon previous theoretical analyses of CoT to demonstrate how the one-prompt-for-all approach can negatively affect the computability of LLMs. We partition the solution search space into two: the prompt space and the answer space. Our findings show that task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance. Through experiments with state-of-the-art LLMs, we reveal a gap in reasoning performance when supervision is applied versus when it is not.

摘要:大型語言模型 (LLM) 徹底改變了自然語言處理,並具備促進人工智慧發展的巨大潛力。然而,大多數主流 LLM 的核心架構(Transformer)在計算深度方面有其內在限制,理論上無法解決許多需要越來越深入計算的推理任務。思維鏈 (CoT) 提示已成為解決這些架構限制的一種技術,這已由幾項理論研究證實。它提供了一個有前途的方法來解決複雜的推理任務,這些任務以前超出了這些模型的能力。儘管取得了成功,CoT 及其變體(例如思維樹、思維圖等)依賴於「一提示適用所有」的方法,對各種任務(從計數和排序到解決數學和演算法問題)使用單一的提示結構(例如,「逐步思考」)。這種方法對模型產生正確的推理步驟構成了重大挑戰,因為模型必須在廣泛的提示範本空間中導航,才能為每個任務找到適當的範本。在這項工作中,我們建立在 CoT 先前的理論分析之上,說明「一提示適用所有」的方法如何對 LLM 的可計算性產生負面影響。我們將解的搜尋空間分為兩部分:提示空間和答案空間。我們的研究結果表明,特定於任務的監督對於準確導航提示空間並實現最佳效能至關重要。透過使用最先進的 LLM 進行實驗,我們揭示了在應用監督與未應用監督時推理效能的差距。

Towards Cross-Cultural Machine Translation with Retrieval-Augmented Generation from Multilingual Knowledge Graphs

2410.14057v1 by Simone Conia, Daniel Lee, Min Li, Umar Farooq Minhas, Saloni Potdar, Yunyao Li

Translating text that contains entity names is a challenging task, as cultural-related references can vary significantly across languages. These variations may also be caused by transcreation, an adaptation process that entails more than transliteration and word-for-word translation. In this paper, we address the problem of cross-cultural translation on two fronts: (i) we introduce XC-Translate, the first large-scale, manually-created benchmark for machine translation that focuses on text that contains potentially culturally-nuanced entity names, and (ii) we propose KG-MT, a novel end-to-end method to integrate information from a multilingual knowledge graph into a neural machine translation model by leveraging a dense retrieval mechanism. Our experiments and analyses show that current machine translation systems and large language models still struggle to translate texts containing entity names, whereas KG-MT outperforms state-of-the-art approaches by a large margin, obtaining a 129% and 62% relative improvement compared to NLLB-200 and GPT-4, respectively.

摘要:翻譯包含實體名稱的文字是一項具有挑戰性的任務,因為與文化相關的參考在不同語言中可能會有很大差異。這些差異也可能是由轉譯造成的,轉譯是一種改編過程,不僅涉及音譯和逐字翻譯。在本文中,我們從兩個方面解決跨文化翻譯的問題:(i) 我們介紹 XC-Translate,這是第一個針對包含潛在文化細微差別實體名稱的文字的大規模、人工建立的機器翻譯基準測試,以及 (ii) 我們提出 KG-MT,這是一種新的端到端方法,通過利用密集檢索機制將來自多語言知識圖譜的資訊整合到神經機器翻譯模型中。我們的實驗和分析表明,目前的機器翻譯系統和大型語言模型在翻譯包含實體名稱的文字時仍存在困難,而 KG-MT 則以大幅優於最先進方法的優勢勝出,與 NLLB-200 和 GPT-4 相比,分別獲得了 129% 和 62% 的相對改進。

RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs

2410.13987v1 by Jiatan Huang, Mingchen Li, Zonghai Yao, Zhichao Yang, Yongkang Xiao, Feiyun Ouyang, Xiaohan Li, Shuo Han, Hong Yu

Answering complex real-world questions often requires accurate retrieval from textual knowledge graphs (TKGs). The scarcity of annotated data, along with intricate topological structures, makes this task particularly challenging. As the nature of relational path information could enhance the inference ability of Large Language Models (LLMs), efficiently retrieving more complex relational path information from TKGs presents another key challenge. To tackle these challenges, we first develop a Dataset for LLMs Complex Reasoning over Textual Knowledge Graphs (RiTeK) with a broad topological structure coverage.We synthesize realistic user queries that integrate diverse topological structures, relational information, and complex textual descriptions. We conduct rigorous expert evaluation to validate the quality of our synthesized queries. And then, we introduce an enhanced Monte Carlo Tree Search (MCTS) method, Relational MCTS, to automatically extract relational path information from textual graphs for specific queries. Our dataset mainly covers the medical domain as the relation types and entity are complex and publicly available. Experimental results indicate that RiTeK poses significant challenges for current retrieval and LLM systems, while the proposed Relational MCTS method enhances LLM inference ability and achieves state-of-the-art performance on RiTeK.

摘要:回答複雜的現實世界問題通常需要從文本知識圖 (TKG) 中準確擷取。標註資料的稀少,加上複雜的拓撲結構,使得這項任務特別具有挑戰性。由於關係路徑資訊的性質可以增強大型語言模型 (LLM) 的推論能力,從 TKG 有效地擷取更複雜的關係路徑資訊提出了另一個關鍵挑戰。為了應對這些挑戰,我們首先開發了一個具有廣泛拓撲結構涵蓋範圍的文本知識圖 (RiTeK) 上的 LLM 複雜推理資料集。我們綜合了整合了多樣化拓撲結構、關係資訊和複雜文本描述的現實使用者查詢。我們進行嚴格的專家評估,以驗證我們綜合查詢的品質。然後,我們引入一種增強的蒙地卡羅樹搜尋 (MCTS) 方法,即關係 MCTS,以自動從文本圖中擷取特定查詢的關係路徑資訊。我們的資料集主要涵蓋醫療領域,因為關係類型和實體很複雜且公開可用。實驗結果表明,RiTeK 對目前的擷取和 LLM 系統提出了重大挑戰,而所提出的關係 MCTS 方法增強了 LLM 推論能力,並在 RiTeK 上達到了最先進的效能。

The Mystery of the Pathological Path-star Task for Language Models

2410.13779v1 by Arvid Frydenlund

The recently introduced path-star task is a minimal task designed to exemplify limitations to the abilities of language models (Bachmann and Nagarajan, 2024). It involves a path-star graph where multiple arms radiate from a single starting node and each node is unique. Given the start node and a specified target node that ends an arm, the task is to generate the arm containing that target node. This is straightforward for a human but surprisingly difficult for language models, which did not outperform the random baseline. The authors hypothesized this is due to a deficiency in teacher-forcing and the next-token prediction paradigm. We demonstrate the task is learnable using teacher-forcing in alternative settings and that the issue is partially due to representation. We introduce a regularization method using structured samples of the same graph but with differing target nodes, improving results across a variety of model types. We provide RASP proofs showing the task is theoretically solvable. Finally, we find settings where an encoder-only model can consistently solve the task.

摘要:最近推出的路徑星形任務是一個極簡任務,旨在說明語言模型能力的限制(Bachmann 和 Nagarajan,2024 年)。它涉及一個路徑星形圖,其中多個分支從一個起始節點輻射出去,每個節點都是唯一的。給定起始節點和結束一個分支的指定目標節點,任務是生成包含該目標節點的分支。這對人類來說很簡單,但對語言模型來說卻異乎尋常地困難,因為語言模型並未優於隨機基準線。作者假設這是由於教師強制和下一個符號預測範例的不足。 我們展示了該任務可以使用替代設置中的教師強制來學習,並且問題部分是由於表示。我們引入了一種正則化方法,使用同一圖形的結構化樣本,但目標節點不同,從而改進了各種模型類型的結果。我們提供了 RASP 證明,表明該任務在理論上是可以解決的。最後,我們找到了僅編碼器模型可以持續解決任務的設置。

Medical

Publish Date Title Authors Homepage Code
2024-11-12 Scaling Properties of Diffusion Models for Perceptual Tasks Rahul Ravishankar et.al. 2411.08034v1 null
2024-11-12 Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech Eleonora Mancini et.al. 2411.08013v1 null
2024-11-12 DuoLift-GAN:Reconstructing CT from Single-view and Biplanar X-Rays with Generative Adversarial Networks Zhaoxi Zhang et.al. 2411.07941v1 null
2024-11-12 Automatic dataset shift identification to support root cause analysis of AI performance drift Mélanie Roschewitz et.al. 2411.07940v1 null
2024-11-12 INTRABENCH: Interactive Radiological Benchmark Constantin Ulrich et.al. 2411.07885v1 null
2024-11-12 Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease Francesco Chiumento et.al. 2411.07871v1 null
2024-11-12 PatchCTG: Patch Cardiotocography Transformer for Antepartum Fetal Health Monitoring M. Jaleed Khan et.al. 2411.07796v1 link
2024-11-12 Multimodal Clinical Reasoning through Knowledge-augmented Rationale Generation Shuai Niu et.al. 2411.07611v1 null
2024-11-12 Contrastive Language Prompting to Ease False Positives in Medical Anomaly Detection YeongHyeon Park et.al. 2411.07546v1 null
2024-11-11 Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews Aakash Sorathiya et.al. 2411.07398v1 null
2024-11-11 Data-Centric Learning Framework for Real-Time Detection of Aiming Beam in Fluorescence Lifetime Imaging Guided Surgery Mohamed Abul Hassan et.al. 2411.07395v1 null
2024-11-11 Data-Driven Analysis of AI in Medical Device Software in China: Deep Learning and General AI Trends Based on Regulatory Data Yu Han et.al. 2411.07378v1 null
2024-11-11 A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19 Vedant Khandelwal et.al. 2411.07163v1 null
2024-11-11 Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models Chanseo Lee et.al. 2411.06713v1 null
2024-11-10 In-Context Learning for Preserving Patient Privacy: A Framework for Synthesizing Realistic Patient Portal Messages Joseph Gatto et.al. 2411.06549v1 link
2024-11-09 NeuReg: Domain-invariant 3D Image Registration on Human and Mouse Brains Taha Razzaq et.al. 2411.06315v1 null
2024-11-09 GuidelineGuard: An Agentic Framework for Medical Note Evaluation with Guideline Adherence MD Ragib Shahriyear et.al. 2411.06264v1 null
2024-11-09 Deep Reinforcement Learning for Digital Twin-Oriented Complex Networked Systems Jiaqi Wen et.al. 2411.06148v1 null
2024-11-09 Evaluating the Propensity of Generative AI for Producing Disinformation During an Election Cycle Erik J Schlicht et.al. 2411.06120v1 null
2024-11-09 Personalize to generalize: Towards a universal medical multi-modality generalization through personalization Zhaorui Tan et.al. 2411.06106v1 null
2024-11-08 Assessing Foundational Medical 'Segment Anything' (Med-SAM1, Med-SAM2) Deep Learning Models for Left Atrial Segmentation in 3D LGE MRI Mehri Mehrnia et.al. 2411.05963v1 null
2024-11-08 GazeSearch: Radiology Findings Search Benchmark Trong Thang Pham et.al. 2411.05780v1 null
2024-11-08 Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators Nicholas Wan et.al. 2411.05897v1 null
2024-11-08 Identifying and Decomposing Compound Ingredients in Meal Plans Using Large Language Models Leon Kopitar et.al. 2411.05892v1 null
2024-11-08 SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark Sithursan Sivasubramaniam et.al. 2411.05521v1 null
2024-11-08 Towards Scalable Foundation Models for Digital Dermatology Fabian Gröger et.al. 2411.05514v1 link
2024-11-08 Towards Equitable ASD Diagnostics: A Comparative Study of Machine and Deep Learning Models Using Behavioral and Facial Data Mohammed Aledhari et.al. 2411.05880v1 null
2024-11-07 Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations Joey Hong et.al. 2411.05194v1 null
2024-11-07 Inverse Transition Learning: Learning Dynamics from Demonstrations Leo Benac et.al. 2411.05174v1 null
2024-11-07 PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation Daniel C. Castro et.al. 2411.05085v1 null
2024-11-07 Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability Yanjun Gao et.al. 2411.04962v1 null
2024-11-07 FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs? Eric Wu et.al. 2411.05059v2 link
2024-11-07 Integrating Large Language Models for Genetic Variant Classification Youssef Boulaimen et.al. 2411.05055v1 null
2024-11-07 AWARE Narrator and the Utilization of Large Language Models to Extract Behavioral Insights from Smartphone Sensing Data Tianyi Zhang et.al. 2411.04691v1 null
2024-11-07 FedDP: Privacy-preserving method based on federated learning for histopathology image segmentation Liangrui Pan et.al. 2411.04509v1 null
2024-11-07 Conditional Diffusion Model for Longitudinal Medical Image Generation Duy-Phuong Dao et.al. 2411.05860v1 null
2024-11-07 Evaluating the Economic Implications of Using Machine Learning in Clinical Psychiatry Soaad Hossain et.al. 2411.05856v1 null
2024-11-06 Robust Real-Time Mortality Prediction in the Intensive Care Unit using Temporal Difference Learning Thomas Frost et.al. 2411.04285v1 link
2024-11-06 Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? Daniel P. Jeong et.al. 2411.04118v1 null
2024-11-06 RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models Maya Varma et.al. 2411.04097v1 link
2024-11-06 Aligning Characteristic Descriptors with Images for Human-Expert-like Explainability Bharat Chandra Yalavarthi et.al. 2411.04008v1 null
2024-11-06 Fine-tuning -- a Transfer Learning approach Joseph Arul Raj et.al. 2411.03941v1 null
2024-11-06 MEG: Medical Knowledge-Augmented Large Language Models for Question Answering Laura Cabello et.al. 2411.03883v2 link
2024-11-06 Navigating the landscape of multimodal AI in medicine: a scoping review on technical challenges and clinical applications Daan Schouten et.al. 2411.03782v1 null
2024-11-06 Sub-DM:Subspace Diffusion Model with Orthogonal Decomposition for MRI Reconstruction Yu Guan et.al. 2411.03758v1 null
2024-11-06 Ultrasound-Based AI for COVID-19 Detection: A Comprehensive Review of Public and Private Lung Ultrasound Datasets and Studies Abrar Morshed et.al. 2411.05029v1 null
2024-11-06 Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? Pedro R. A. S. Bassi et.al. 2411.03670v1 link
2024-11-06 Requirements Engineering for Older Adult Digital Health Software: A Systematic Literature Review Yuqing Xiao et.al. 2411.03656v1 null
2024-11-06 Cross Feature Fusion of Fundus Image and Generated Lesion Map for Referable Diabetic Retinopathy Classification Dahyun Mok et.al. 2411.03618v1 null
2024-11-05 The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare Souren Pashangpour et.al. 2411.03287v1 null
2024-11-05 Discovering Data Structures: Nearest Neighbor Search and Beyond Omar Salemohamed et.al. 2411.03253v1 null
2024-11-05 Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care Christel Sirocchi et.al. 2411.03105v1 link
2024-11-05 Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting Adrian B. Chłopowiec et.al. 2411.03098v1 null
2024-11-05 Controlling for Unobserved Confounding with Large Language Model Classification of Patient Smoking Status Samuel Lee et.al. 2411.03004v1 null
2024-11-05 Region-Guided Attack on the Segment Anything Model (SAM) Xiaoliang Liu et.al. 2411.02974v2 null
2024-11-05 [Vision Paper] PRObot: Enhancing Patient-Reported Outcome Measures for Diabetic Retinopathy using Chatbots and Generative AI Maren Pielka et.al. 2411.02973v1 null
2024-11-05 Leveraging Transfer Learning and Multiple Instance Learning for HER2 Automatic Scoring of H\&E Whole Slide Images Rawan S. Abdulsadig et.al. 2411.05028v1 null
2024-11-05 Membership Inference Attacks against Large Vision-Language Models Zhan Li et.al. 2411.02902v1 link
2024-11-04 Advanced XR-Based 6-DOF Catheter Tracking System for Immersive Cardiac Intervention Training Mohsen Annabestani et.al. 2411.02611v1 null
2024-11-04 "It's a conversation, not a quiz": A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health Jiawei Zhou et.al. 2411.02594v1 null
2024-11-04 Digitizing Touch with an Artificial Multimodal Fingertip Mike Lambeta et.al. 2411.02479v1 link
2024-11-04 Simulation of Nanorobots with Artificial Intelligence and Reinforcement Learning for Advanced Cancer Cell Detection and Tracking Shahab Kavousinejad et.al. 2411.02345v1 link
2024-11-04 Taking AI Welfare Seriously Robert Long et.al. 2411.00986v1 null
2024-11-04 Federated GNNs for EEG-Based Stroke Assessment Andrea Protani et.al. 2411.02286v1 null
2024-11-04 Weakly supervised deep learning model with size constraint for prostate cancer detection in multiparametric MRI and generalization to unseen domains Robin Trombetta et.al. 2411.02466v1 null
2024-11-04 Evaluating the quality of published medical research with ChatGPT Mike Thelwall et.al. 2411.01952v1 null
2024-11-04 You are out of context! Giancarlo Cobino et.al. 2411.02464v1 null
2024-11-03 Diagnosing Medical Datasets with Training Dynamics Laura Wenderoth et.al. 2411.01653v1 link
2024-11-03 Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation Zhenbin Wang et.al. 2411.01647v1 null
2024-11-03 Customized Subgraph Selection and Encoding for Drug-drug Interaction Prediction Haotong Du et.al. 2411.01535v1 null
2024-11-03 Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design Onur Boyar et.al. 2411.01423v1 null
2024-11-02 Medical X-Ray Image Enhancement Using Global Contrast-Limited Adaptive Histogram Equalization Sohrab Namazi Nia et.al. 2411.01373v1 null
2024-11-02 Guided Synthesis of Labeled Brain MRI Data Using Latent Diffusion Models for Segmentation of Enlarged Ventricles Tim Ruschke et.al. 2411.01351v1 null
2024-11-02 Causal reasoning in difference graphs Charles K. Assaad et.al. 2411.01292v1 null
2024-11-02 Designing a Robust Radiology Report Generation System Sonit Singh et.al. 2411.01153v1 null
2024-11-02 LEARNER: Learning Granular Labels from Coarse Labels using Contrastive Learning Gautam Gare et.al. 2411.01144v1 null
2024-11-02 Artificial Intelligence for Microbiology and Microbiome Research Xu-Wen Wang et.al. 2411.01098v1 null
2024-11-01 Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities Adriel Saporta et.al. 2411.01053v1 link
2024-11-01 Cross-Fundus Transformer for Multi-modal Diabetic Retinopathy Grading with Cataract Fan Xiao et.al. 2411.00726v1 null
2024-11-01 CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis Fuying Wang et.al. 2411.00696v1 null
2024-11-01 Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering Mehdi Hosseini Chagahi et.al. 2411.00916v2 null
2024-11-01 Deep learning-based auto-contouring of organs/structures-at-risk for pediatric upper abdominal radiotherapy Mianyong Ding et.al. 2411.00594v1 link
2024-11-01 Enhancing the Traditional Chinese Medicine Capabilities of Large Language Model through Reinforcement Learning from AI Feedback Song Yu et.al. 2411.00897v1 null
2024-11-01 StepCountJITAI: simulation environment for RL with application to physical activity adaptive intervention Karine Karine et.al. 2411.00336v1 null
2024-11-01 Strongly Topology-preserving GNNs for Brain Graph Super-resolution Pragya Singh et.al. 2411.02525v1 null
2024-11-01 Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes Balu Bhasuran et.al. 2411.02523v1 null
2024-10-31 Deep Learning Predicts Mammographic Breast Density in Clinical Breast Ultrasound Images Arianna Bunnell et.al. 2411.00891v2 link
2024-10-31 Monitoring fairness in machine learning models that predict patient mortality in the ICU Tempest A. van Schaik et.al. 2411.00190v2 null
2024-10-31 Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy Panagiota Gatoula et.al. 2411.00178v1 null
2024-10-31 Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning John Wu et.al. 2411.00173v1 null
2024-10-31 Navigating the Unknown: A Chat-Based Collaborative Interface for Personalized Exploratory Tasks Yingzhe Peng et.al. 2410.24032v1 null
2024-10-31 Neural Network Verification with PyRAT Augustin Lemesle et.al. 2410.23903v1 null
2024-10-31 Counterfactual MRI Data Augmentation using Conditional Denoising Diffusion Generative Models Pedro Morão et.al. 2410.23835v1 link
2024-10-31 Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding Jinlong He et.al. 2410.23822v1 null
2024-10-31 Improving snore detection under limited dataset through harmonic/percussive source separation and convolutional neural networks F. D. Gonzalez-Martinez et.al. 2410.23796v1 null
2024-10-31 The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams Yunqi Zhu et.al. 2410.23769v1 null
2024-10-31 Artificial intelligence to improve clinical coding practice in Scandinavia: a crossover randomized controlled trial Taridzo Chomutare et.al. 2410.23725v1 null
2024-10-31 Enhancing Brain Tumor Classification Using TrAdaBoost and Multi-Classifier Deep Learning Approaches Mahin Mohammadi et.al. 2411.00875v1 null
2024-10-31 Deep Convolutional Neural Networks on Multiclass Classification of Three-Dimensional Brain Images for Parkinson's Disease Stage Prediction Guan-Hua Huang et.al. 2410.23649v1 null
2024-10-31 MS-Glance: Non-semantic context vectors and the applications in supervising image reconstruction Ziqi Gao et.al. 2410.23577v1 link

Abstracts

Scaling Properties of Diffusion Models for Perceptual Tasks

2411.08034v1 by Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and segmentation under image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perception tasks. Through a careful analysis of these scaling behaviors, we present various techniques to efficiently train diffusion models for visual perception tasks. Our models achieve improved or comparable performance to state-of-the-art methods using significantly less data and compute. To use our code and models, see https://scaling-diffusion-perception.github.io .

摘要:在本文中,我們論證使用擴散模型的迭代計算不僅為生成任務,也為視覺感知任務提供了一個強大的範例。我們將深度估計、光流和分割等任務統一在圖像到圖像轉換下,並展示了擴散模型如何從擴展感知任務的訓練和測試時間計算中受益。通過仔細分析這些擴展行為,我們提出了各種技術,以有效訓練用於視覺感知任務的擴散模型。我們的模型使用顯著更少的数据和計算,實現了與最先進的方法相當或更好的性能。若要使用我們的代碼和模型,請參閱 https://scaling-diffusion-perception.github.io 。

Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech

2411.08013v1 by Eleonora Mancini, Francesco Paissan, Paolo Torroni, Cem Subakan, Mirco Ravanelli

Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detection have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.

摘要:帕金森氏症 (PD) 的言語障礙提供了重要的早期診斷指標。儘管基於言語的 PD 檢測模型已展現出強勁的效能,但其可解釋性仍未獲得充分探討。本研究系統性地評估了數種可解釋性方法,以識別 PD 特定的言語特徵,旨在支援開發準確、可解釋的模型,以進行 PD 診斷和監控中的臨床決策。我們的研究方法包括:(i) 使用主流可解釋性技術取得歸因和顯著性圖,(ii) 透過一系列既定的指標,量化評估這些圖及其透過聯集和交集所取得組合的真實性,以及 (iii) 從輔助分類器評估顯著性圖傳達的 PD 檢測資訊。我們的結果顯示,儘管解釋與分類器一致,但它們通常無法為領域專家提供有價值的資訊。

DuoLift-GAN:Reconstructing CT from Single-view and Biplanar X-Rays with Generative Adversarial Networks

2411.07941v1 by Zhaoxi Zhang, Yueliang Ying

Computed tomography (CT) provides highly detailed three-dimensional (3D) medical images but is costly, time-consuming, and often inaccessible in intraoperative settings (Organization et al. 2011). Recent advancements have explored reconstructing 3D chest volumes from sparse 2D X-rays, such as single-view or orthogonal double-view images. However, current models tend to process 2D images in a planar manner, prioritizing visual realism over structural accuracy. In this work, we introduce DuoLift Generative Adversarial Networks (DuoLift-GAN), a novel architecture with dual branches that independently elevate 2D images and their features into 3D representations. These 3D outputs are merged into a unified 3D feature map and decoded into a complete 3D chest volume, enabling richer 3D information capture. We also present a masked loss function that directs reconstruction towards critical anatomical regions, improving structural accuracy and visual quality. This paper demonstrates that DuoLift-GAN significantly enhances reconstruction accuracy while achieving superior visual realism compared to existing methods.

摘要:電腦斷層掃描 (CT) 能提供高度詳細的三維 (3D) 醫學影像,但昂貴、耗時且在術中環境中通常無法取得 (Organization et al. 2011)。最近的進展探索從稀疏的 2D X 光重建 3D 胸部體積,例如單視圖或正交雙視圖影像。然而,目前的模型傾向於以平面方式處理 2D 影像,優先考慮視覺真實性而非結構準確性。在這項工作中,我們介紹了 DuoLift 生成對抗網路 (DuoLift-GAN),一種具有雙分支的新穎架構,可獨立地將 2D 影像及其特徵提升到 3D 表現形式。這些 3D 輸出會合併成一個統一的 3D 特徵圖,並解碼成一個完整的 3D 胸部體積,從而能夠擷取更豐富的 3D 資訊。我們也提出了一個遮罩損失函數,將重建導向關鍵解剖區域,改善結構準確性和視覺品質。這篇論文證明了 DuoLift-GAN 與現有方法相比,顯著提升了重建準確性,同時達到了卓越的視覺真實性。

Automatic dataset shift identification to support root cause analysis of AI performance drift

2411.07940v1 by Mélanie Roschewitz, Raghav Mehta, Charles Jones, Ben Glocker

Shifts in data distribution can substantially harm the performance of clinical AI models. Hence, various methods have been developed to detect the presence of such shifts at deployment time. However, root causes of dataset shifts are varied, and the choice of shift mitigation strategies is highly dependent on the precise type of shift encountered at test time. As such, detecting test-time dataset shift is not sufficient: precisely identifying which type of shift has occurred is critical. In this work, we propose the first unsupervised dataset shift identification framework, effectively distinguishing between prevalence shift (caused by a change in the label distribution), covariate shift (caused by a change in input characteristics) and mixed shifts (simultaneous prevalence and covariate shifts). We discuss the importance of self-supervised encoders for detecting subtle covariate shifts and propose a novel shift detector leveraging both self-supervised encoders and task model outputs for improved shift detection. We report promising results for the proposed shift identification framework across three different imaging modalities (chest radiography, digital mammography, and retinal fundus images) on five types of real-world dataset shifts, using four large publicly available datasets.

摘要:資料分佈的轉變會嚴重損害臨床 AI 模型的效能。因此,已經開發出各種方法來偵測部署時發生的此類轉變。然而,資料集轉變的根本原因各不相同,而轉變緩解策略的選擇高度依賴於測試時遇到的轉變類型。因此,偵測測試時資料集轉變是不夠的:精確識別已發生的轉變類型至關重要。在這項工作中,我們提出了第一個無監督資料集轉變識別架構,有效區分發生率轉變(由標籤分佈的變化引起)、協變數轉變(由輸入特徵的變化引起)和混合轉變(同時發生率和協變數轉變)。我們討論了自監督編碼器在偵測細微協變數轉變中的重要性,並提出了一種新穎的轉變偵測器,利用自監督編碼器和任務模型輸出,以改善轉變偵測。我們針對三個不同的影像模式(胸部 X 光、數位乳房攝影和視網膜眼底影像)報告了所提出的轉變識別架構的良好結果,使用四個大型公開可取得的資料集,針對五種類型的真實世界資料集轉變。

INTRABENCH: Interactive Radiological Benchmark

2411.07885v1 by Constantin Ulrich, Tassilo Wald, Emily Tempus, Maximilian Rokuss, Paul F. Jaeger, Klaus Maier-Hein

Current interactive segmentation approaches, inspired by the success of META's Segment Anything model, have achieved notable advancements, however, they come with substantial limitations that hinder their practical application in real clinical scenarios. These include unrealistic human interaction requirements, such as slice-by-slice operations for 2D models on 3D data, a lack of iterative refinement, and insufficient evaluation experiments. These shortcomings prevent accurate assessment of model performance and lead to inconsistent outcomes across studies. IntRaBench overcomes these challenges by offering a comprehensive and reproducible framework for evaluating interactive segmentation methods in realistic, clinically relevant scenarios. It includes diverse datasets, target structures, and segmentation models, and provides a flexible codebase that allows seamless integration of new models and prompting strategies. Additionally, we introduce advanced techniques to minimize clinician interaction, ensuring fair comparisons between 2D and 3D models. By open-sourcing IntRaBench, we invite the research community to integrate their models and prompting techniques, ensuring continuous and transparent evaluation of interactive segmentation models in 3D medical imaging.

摘要:目前互動式分割方法受到 META 的 Segment Anything 模型成功的啟發,已取得顯著進展,但它們仍有很大的限制,會阻礙它們在實際臨床場景中的應用。這些限制包括不切實際的人機互動需求,例如 3D 資料上的 2D 模型的逐層操作、缺乏反覆改進以及評估實驗不足。這些缺點會妨礙準確評估模型效能,並導致各項研究結果不一致。IntRaBench 克服了這些挑戰,提供了一個全面且可重現的架構,用於評估實際臨床相關場景中的互動式分割方法。它包含多元的資料集、目標結構和分割模型,並提供了一個彈性的程式碼庫,允許無縫整合新的模型和提示策略。此外,我們引進了先進技術來最小化臨床醫師的互動,確保 2D 和 3D 模型之間的公平比較。透過開放原始碼 IntRaBench,我們邀請研究社群整合他們的模型和提示技術,確保在 3D 醫學影像中持續且透明地評估互動式分割模型。

Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease

2411.07871v1 by Francesco Chiumento, Mingming Liu

The rapid advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown great potential in medical diagnostics, particularly in radiology, where datasets such as X-rays are paired with human-generated diagnostic reports. However, a significant research gap exists in the neuroimaging field, especially for conditions such as Alzheimer's disease, due to the lack of comprehensive diagnostic reports that can be utilized for model fine-tuning. This paper addresses this gap by generating synthetic diagnostic reports using GPT-4o-mini on structured data from the OASIS-4 dataset, which comprises 663 patients. Using the synthetic reports as ground truth for training and validation, we then generated neurological reports directly from the images in the dataset leveraging the pre-trained BiomedCLIP and T5 models. Our proposed method achieved a BLEU-4 score of 0.1827, ROUGE-L score of 0.3719, and METEOR score of 0.4163, revealing its potential in generating clinically relevant and accurate diagnostic reports.

摘要:大型語言模型 (LLM) 和視覺語言模型 (VLM) 的快速進展在醫學診斷中展現了巨大的潛力,特別是在放射學中,其中 X 射線等數據集與人類產生的診斷報告配對。然而,神經影像領域存在著顯著的研究差距,特別是對於阿茲海默症等疾病,因為缺乏可供模型微調使用的全面診斷報告。本文通過使用 GPT-4o-mini 在來自 OASIS-4 數據集的結構化數據上生成合成診斷報告來解決這一差距,該數據集包含 663 名患者。使用合成報告作為訓練和驗證的真實數據,然後我們直接從數據集中的圖像中生成神經報告,利用預先訓練的 BiomedCLIP 和 T5 模型。我們提出的方法實現了 BLEU-4 分數為 0.1827、ROUGE-L 分數為 0.3719 和 METEOR 分數為 0.4163,揭示了其生成臨床相關且準確的診斷報告的潛力。

PatchCTG: Patch Cardiotocography Transformer for Antepartum Fetal Health Monitoring

2411.07796v1 by M. Jaleed Khan, Manu Vatish, Gabriel Davis Jones

Antepartum Cardiotocography (CTG) is vital for fetal health monitoring, but traditional methods like the Dawes-Redman system are often limited by high inter-observer variability, leading to inconsistent interpretations and potential misdiagnoses. This paper introduces PatchCTG, a transformer-based model specifically designed for CTG analysis, employing patch-based tokenisation, instance normalisation and channel-independent processing to capture essential local and global temporal dependencies within CTG signals. PatchCTG was evaluated on the Oxford Maternity (OXMAT) dataset, comprising over 20,000 CTG traces across diverse clinical outcomes after applying the inclusion and exclusion criteria. With extensive hyperparameter optimisation, PatchCTG achieved an AUC of 77%, with specificity of 88% and sensitivity of 57% at Youden's index threshold, demonstrating adaptability to various clinical needs. Testing across varying temporal thresholds showed robust predictive performance, particularly with finetuning on data closer to delivery, achieving a sensitivity of 52% and specificity of 88% for near-delivery cases. These findings suggest the potential of PatchCTG to enhance clinical decision-making in antepartum care by providing a reliable, objective tool for fetal health assessment. The source code is available at https://github.com/jaleedkhan/PatchCTG.

摘要:產前胎兒心搏圖 (CTG) 對於胎兒健康監測至關重要,但傳統方法(如 Dawes-Redman 系統)通常受到高觀察者間變異性的限制,導致解釋不一致和潛在的誤診。本文介紹 PatchCTG,一種專門設計用於 CTG 分析的基於Transformer的模型,採用基於區塊的標記化、實例正規化和通道獨立處理,以捕捉 CTG 信號中的基本局部和全局時間依賴性。PatchCTG 在牛津婦產 (OXMAT) 資料集上進行評估,該資料集包含超過 20,000 個 CTG 軌跡,涵蓋在應用包含和排除標準後不同的臨床結果。透過廣泛的超參數最佳化,PatchCTG 在 Youden 指數閾值下達到 77% 的 AUC,特異性為 88%,敏感性為 57%,證明了其對各種臨床需求的適應性。在不同的時間閾值下進行測試顯示出穩健的預測效能,特別是在接近分娩時對資料進行微調,對於接近分娩的病例,敏感性達到 52%,特異性達到 88%。這些發現表明 PatchCTG 有潛力透過提供可靠、客觀的胎兒健康評估工具來加強產前照護中的臨床決策制定。原始程式碼可在 https://github.com/jaleedkhan/PatchCTG 取得。

Multimodal Clinical Reasoning through Knowledge-augmented Rationale Generation

2411.07611v1 by Shuai Niu, Jing Ma, Liang Bai, Zhihua Wang, Yida Xu, Yunya Song, Xian Yang

Clinical rationales play a pivotal role in accurate disease diagnosis; however, many models predominantly use discriminative methods and overlook the importance of generating supportive rationales. Rationale distillation is a process that transfers knowledge from large language models (LLMs) to smaller language models (SLMs), thereby enhancing the latter's ability to break down complex tasks. Despite its benefits, rationale distillation alone is inadequate for addressing domain knowledge limitations in tasks requiring specialized expertise, such as disease diagnosis. Effectively embedding domain knowledge in SLMs poses a significant challenge. While current LLMs are primarily geared toward processing textual data, multimodal LLMs that incorporate time series data, especially electronic health records (EHRs), are still evolving. To tackle these limitations, we introduce ClinRaGen, an SLM optimized for multimodal rationale generation in disease diagnosis. ClinRaGen incorporates a unique knowledge-augmented attention mechanism to merge domain knowledge with time series EHR data, utilizing a stepwise rationale distillation strategy to produce both textual and time series-based clinical rationales. Our evaluations show that ClinRaGen markedly improves the SLM's capability to interpret multimodal EHR data and generate accurate clinical rationales, supporting more reliable disease diagnosis, advancing LLM applications in healthcare, and narrowing the performance divide between LLMs and SLMs.

摘要:臨床依據在準確的疾病診斷中扮演著關鍵角色; 然而,許多模型主要使用判別式方法,而忽略了生成支持性依據的重要性。依據萃取是一種將知識從大型語言模型 (LLM) 轉移到小型語言模型 (SLM) 的過程,從而增強後者分解複雜任務的能力。儘管有其好處,但單獨的依據萃取不足以解決需要專業知識的任務(例如疾病診斷)中的領域知識限制。有效地將領域知識嵌入 SLM 是一個重大的挑戰。雖然目前的 LLM 主要用於處理文本資料,但整合時間序列資料(特別是電子健康記錄 (EHR))的多模態 LLM 仍在發展中。為了解決這些限制,我們引入了 ClinRaGen,一種針對疾病診斷中多模態依據生成的最佳化 SLM。ClinRaGen 結合了一個獨特的知識增強注意力機制,將領域知識與時間序列 EHR 資料合併,利用逐步的依據萃取策略來產生基於文本和時間序列的臨床依據。我們的評估表明,ClinRaGen 明顯改善了 SLM 解釋多模態 EHR 資料和生成準確臨床依據的能力,支持更可靠的疾病診斷,推進 LLM 在醫療保健中的應用,並縮小 LLM 和 SLM 之間的效能差距。

Contrastive Language Prompting to Ease False Positives in Medical Anomaly Detection

2411.07546v1 by YeongHyeon Park, Myung Jin Kim, Hyeong Seok Kim

A pre-trained visual-language model, contrastive language-image pre-training (CLIP), successfully accomplishes various downstream tasks with text prompts, such as finding images or localizing regions within the image. Despite CLIP's strong multi-modal data capabilities, it remains limited in specialized environments, such as medical applications. For this purpose, many CLIP variants-i.e., BioMedCLIP, and MedCLIP-SAMv2-have emerged, but false positives related to normal regions persist. Thus, we aim to present a simple yet important goal of reducing false positives in medical anomaly detection. We introduce a Contrastive LAnguage Prompting (CLAP) method that leverages both positive and negative text prompts. This straightforward approach identifies potential lesion regions by visual attention to the positive prompts in the given image. To reduce false positives, we attenuate attention on normal regions using negative prompts. Extensive experiments with the BMAD dataset, including six biomedical benchmarks, demonstrate that CLAP method enhances anomaly detection performance. Our future plans include developing an automated fine prompting method for more practical usage.

摘要:預訓練的視覺語言模型,對比語言影像預訓練 (CLIP),成功使用文字提示完成各種下游任務,例如尋找影像或定位影像中的區域。儘管 CLIP 擁有強大的多模態資料功能,但在專門的環境中,例如醫療應用,仍然有限。為此,出現了許多 CLIP 變體,即 BioMedCLIP 和 MedCLIP-SAMv2,但與正常區域相關的假陽性仍然存在。因此,我們的目標是提出一個簡單但重要的目標,以減少醫療異常檢測中的假陽性。我們引入了對比語言提示 (CLAP) 方法,該方法同時利用正向和負向文字提示。這種直接的方法透過視覺注意給定影像中的正向提示,來識別潛在的病灶區域。為了減少假陽性,我們使用負向提示來減弱對正常區域的注意。使用 BMAD 資料集進行的廣泛實驗,包括六個生物醫學基準,證明 CLAP 方法增強了異常檢測效能。我們未來的計畫包括開發一種自動化精細提示方法,以供更實用的使用。

2411.07398v1 by Aakash Sorathiya, Gouri Ginde

With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach.

摘要:隨著行動應用程式在我們日常體驗中激增,圍繞倫理的疑慮也大幅增加。使用者通常在應用程式(app)評論中傳達他們的回饋、回報問題,並建議新的功能,經常強調安全性、隱私和問責疑慮。納入這些評論對於開發成功的產品至關重要。然而,與倫理疑慮相關的 app 評論通常使用特定領域語言,並使用更多變化的詞彙表達。因此,自動化與倫理疑慮相關的 app 評論擷取是一項具有挑戰性且耗時的工作。 本研究提出了一種基於自然語言處理 (NLP) 的新穎方法,它結合了自然語言推論 (NLI),它提供了對語言細微差別的深入理解,以及僅解碼器(類似 LLaMA)的大型語言模型 (LLM),以大規模擷取與倫理疑慮相關的 app 評論。利用心理健康領域的 43,647 個 app 評論,提出的方法 1) 評估四個 NLI 模型以擷取潛在的隱私評論,並將特定領域隱私假設的結果與一般隱私假設進行比較;2) 評估四個 LLM 以將 app 評論分類為隱私疑慮;以及 3) 進一步使用最佳的 NLI 和 LLM 模型從資料集中擷取新的隱私評論。結果顯示,具有特定領域假設的 DeBERTa-v3-base-mnli-fever-anli NLI 模型產生最佳效能,而 Llama3.1-8B-Instruct LLM 在 app 評論分類中表現最佳。然後,使用 NLI+LLM,額外擷取了 1,008 個新的與隱私相關的評論,這些評論未透過先前研究中的基於關鍵字的方法識別出來,因此證明了所提出方法的有效性。

Data-Centric Learning Framework for Real-Time Detection of Aiming Beam in Fluorescence Lifetime Imaging Guided Surgery

2411.07395v1 by Mohamed Abul Hassan, Pu Sun, Xiangnan Zhou, Lisanne Kraft, Kelsey T Hadfield, Katjana Ehrlich, Jinyi Qi, Andrew Birkeland, Laura Marcu

This study introduces a novel data-centric approach to improve real-time surgical guidance using fiber-based fluorescence lifetime imaging (FLIm). A key aspect of the methodology is the accurate detection of the aiming beam, which is essential for localizing points used to map FLIm measurements onto the tissue region within the surgical field. The primary challenge arises from the complex and variable conditions encountered in the surgical environment, particularly in Transoral Robotic Surgery (TORS). Uneven illumination in the surgical field can cause reflections, reduce contrast, and results in inconsistent color representation, further complicating aiming beam detection. To overcome these challenges, an instance segmentation model was developed using a data-centric training strategy that improves accuracy by minimizing label noise and enhancing detection robustness. The model was evaluated on a dataset comprising 40 in vivo surgical videos, demonstrating a median detection rate of 85%. This performance was maintained when the model was integrated in a clinical system, achieving a similar detection rate of 85% during TORS procedures conducted in patients. The system's computational efficiency, measured at approximately 24 frames per second (FPS), was sufficient for real-time surgical guidance. This study enhances the reliability of FLIm-based aiming beam detection in complex surgical environments, advancing the feasibility of real-time, image-guided interventions for improved surgical precision

摘要:本研究提出了一種新穎的以數據為中心的策略,以使用基於光纖的螢光生命期成像 (FLIm) 來改善實時手術導引。此方法的一個關鍵面向是準確偵測瞄準光束,這對於定位用於將 FLIm 測量結果對應到手術視野內組織區域的點至關重要。主要的挑戰來自於手術環境中遇到的複雜且變化的條件,特別是在經口機器人手術 (TORS) 中。手術視野中的照明不均會導致反射、降低對比度,並造成不一致的顏色呈現,進一步使瞄準光束偵測複雜化。為了克服這些挑戰,開發了一個實例分割模型,使用以數據為中心的訓練策略,透過最小化標籤雜訊和增強偵測穩健性來提高準確度。此模型在包含 40 個體內手術影片的資料集上進行評估,顯示出 85% 的中位數偵測率。當此模型整合到臨床系統中時,此效能得以維持,在患者進行 TORS 手術期間達成相似的 85% 偵測率。此系統的運算效率,測量結果約為每秒 24 幀 (FPS),足以進行實時手術導引。本研究增強了 FLIm 為基礎的瞄準光束偵測在複雜手術環境中的可靠性,提升了實時、影像導引介入的可行性,以改善手術精準度

2411.07378v1 by Yu Han, Aaron Ceross, Sarim Ather, Jeroen H. M. Bergmann

Artificial intelligence (AI) in medical device software (MDSW) represents a transformative clinical technology, attracting increasing attention within both the medical community and the regulators. In this study, we leverage a data-driven approach to automatically extract and analyze AI-enabled medical devices (AIMD) from the National Medical Products Administration (NMPA) regulatory database. The continued increase in publicly available regulatory data requires scalable methods for analysis. Automation of regulatory information screening is essential to create reproducible insights that can be quickly updated in an ever changing medical device landscape. More than 4 million entries were assessed, identifying 2,174 MDSW registrations, including 531 standalone applications and 1,643 integrated within medical devices, of which 43 were AI-enabled. It was shown that the leading medical specialties utilizing AIMD include respiratory (20.5%), ophthalmology/endocrinology (12.8%), and orthopedics (10.3%). This approach greatly improves the speed of data extracting providing a greater ability to compare and contrast. This study provides the first extensive, data-driven exploration of AIMD in China, showcasing the potential of automated regulatory data analysis in understanding and advancing the landscape of AI in medical technology.

摘要:醫療器材軟體 (MDSW) 中的人工智慧 (AI) 代表著變革性的臨床技術,在醫療社群和法規單位中都吸引了越來越多的關注。在本研究中,我們利用資料驅動的方法,從國家藥品監督管理局 (NMPA) 法規資料庫中自動擷取和分析具備 AI 功能的醫療器材 (AIMD)。持續增加的公開法規資料需要可擴充的分析方法。法規資訊篩選的自動化對於建立可重製的見解至關重要,這些見解可以在不斷變化的醫療器材領域中快速更新。評估了超過 400 萬筆條目,識別出 2,174 筆 MDSW 註冊,包括 531 筆獨立應用和 1,643 筆整合於醫療器材中,其中 43 筆具備 AI 功能。結果顯示,使用 AIMD 的主要醫療專科包括呼吸科 (20.5%)、眼科/內分泌科 (12.8%) 和骨科 (10.3%)。這種方法大幅提升了資料擷取速度,提供了更強大的比較和對比能力。本研究提供了中國 AIMD 的第一個廣泛資料驅動探索,展示了自動化法規資料分析在了解和推進醫療技術中 AI 領域的潛力。

A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19

2411.07163v1 by Vedant Khandelwal, Manas Gaur, Ugur Kursuncu, Valerie Shalin, Amit Sheth

Monitoring public sentiment via social media is potentially helpful during health crises such as the COVID-19 pandemic. However, traditional frequency-based, data-driven neural network-based approaches can miss newly relevant content due to the evolving nature of language in a dynamically evolving environment. Human-curated symbolic knowledge sources, such as lexicons for standard language and slang terms, can potentially elevate social media signals in evolving language. We introduce a neurosymbolic method that integrates neural networks with symbolic knowledge sources, enhancing the detection and interpretation of mental health-related tweets relevant to COVID-19. Our method was evaluated using a corpus of large datasets (approximately 12 billion tweets, 2.5 million subreddit data, and 700k news articles) and multiple knowledge graphs. This method dynamically adapts to evolving language, outperforming purely data-driven models with an F1 score exceeding 92\%. This approach also showed faster adaptation to new data and lower computational demands than fine-tuning pre-trained large language models (LLMs). This study demonstrates the benefit of neurosymbolic methods in interpreting text in a dynamic environment for tasks such as health surveillance.

摘要:透過社群媒體監控公眾情緒在 COVID-19 等健康危機期間可能很有幫助。然而,傳統的基於頻率、資料驅動的神經網路方法可能會錯過新相關的內容,因為語言在動態演化的環境中會持續演化。由人類策劃的象徵性知識來源(例如標準語言和俚語術語的詞彙)可能會提升社群媒體在演化語言中的訊號。我們引入一種將神經網路與象徵性知識來源整合的神經符號方法,增強與 COVID-19 相關的心理健康相關推文的偵測和詮釋。我們的做法使用大型資料集語料庫(約 120 億則推文、250 萬個 subreddit 資料和 70 萬則新聞文章)和多個知識圖譜進行評估。這種方法動態適應演化的語言,優於純資料驅動模型,F1 分數超過 92%。這種方法也顯示出比微調預訓練大型語言模型 (LLM) 更快適應新資料和更低的運算需求。本研究證明了神經符號方法在動態環境中詮釋文字的優點,適用於健康監控等任務。

Ambient AI Scribing Support: Comparing the Performance of Specialized AI Agentic Architecture to Leading Foundational Models

2411.06713v1 by Chanseo Lee, Sonu Kumar, Kimon A. Vogt, Sam Meraj

This study compares Sporo Health's AI Scribe, a proprietary model fine-tuned for medical scribing, with various LLMs (GPT-4o, GPT-3.5, Gemma-9B, and Llama-3.2-3B) in clinical documentation. We analyzed de-identified patient transcripts from partner clinics, using clinician-provided SOAP notes as the ground truth. Each model generated SOAP summaries using zero-shot prompting, with performance assessed via recall, precision, and F1 scores. Sporo outperformed all models, achieving the highest recall (73.3%), precision (78.6%), and F1 score (75.3%) with the lowest performance variance. Statistically significant differences (p < 0.05) were found between Sporo and the other models, with post-hoc tests showing significant improvements over GPT-3.5, Gemma-9B, and Llama 3.2-3B. While Sporo outperformed GPT-4o by up to 10%, the difference was not statistically significant (p = 0.25). Clinical user satisfaction, measured with a modified PDQI-9 inventory, favored Sporo. Evaluations indicated Sporo's outputs were more accurate and relevant. This highlights the potential of Sporo's multi-agentic architecture to improve clinical workflows.

摘要:本研究比较了 Sporo Health 的 AI Scribe,一种针对医疗记录专门微调的专有模型,与临床记录中的各种 LLM(GPT-4o、GPT-3.5、Gemma-9B 和 Llama-3.2-3B)。我们分析了来自合作诊所的去标识患者记录,使用临床医生提供的 SOAP 记录作为基本事实。每个模型使用零次提示生成了 SOAP 摘要,通过召回率、精确率和 F1 分数评估性能。Sporo 优于所有模型,以最低的性能差异实现了最高的召回率 (73.3%)、精确率 (78.6%) 和 F1 分数 (75.3%)。在 Sporo 和其他模型之间发现了统计学上的显着差异 (p < 0.05),事后检验显示与 GPT-3.5、Gemma-9B 和 Llama 3.2-3B 相比有显着改善。虽然 Sporo 的表现优于 GPT-4o 达 10%,但差异在统计学上并不显着 (p = 0.25)。使用修改后的 PDQI-9 清单衡量的临床用户满意度偏好 Sporo。评估表明 Sporo 的输出更准确、更相关。这突出了 Sporo 的多代理架构在改进临床工作流程方面的潜力。

In-Context Learning for Preserving Patient Privacy: A Framework for Synthesizing Realistic Patient Portal Messages

2411.06549v1 by Joseph Gatto, Parker Seegmiller, Timothy E. Burdick, Sarah Masud Preum

Since the COVID-19 pandemic, clinicians have seen a large and sustained influx in patient portal messages, significantly contributing to clinician burnout. To the best of our knowledge, there are no large-scale public patient portal messages corpora researchers can use to build tools to optimize clinician portal workflows. Informed by our ongoing work with a regional hospital, this study introduces an LLM-powered framework for configurable and realistic patient portal message generation. Our approach leverages few-shot grounded text generation, requiring only a small number of de-identified patient portal messages to help LLMs better match the true style and tone of real data. Clinical experts in our team deem this framework as HIPAA-friendly, unlike existing privacy-preserving approaches to synthetic text generation which cannot guarantee all sensitive attributes will be protected. Through extensive quantitative and human evaluation, we show that our framework produces data of higher quality than comparable generation methods as well as all related datasets. We believe this work provides a path forward for (i) the release of large-scale synthetic patient message datasets that are stylistically similar to ground-truth samples and (ii) HIPAA-friendly data generation which requires minimal human de-identification efforts.

摘要:自 COVID-19 大流行以來,臨床醫生收到了大量的持續性患者入口訊息,這顯著加劇了臨床醫生的倦怠感。據我們所知,沒有大型公共患者入口訊息語料庫可供研究人員用於建構工具來最佳化臨床醫生入口工作流程。本研究借鑒了我們與區域醫院正在進行的工作,介紹了一個由 LLM 驅動的框架,用於可配置且逼真的患者入口訊息產生。我們的做法利用了少樣本接地文本產生,只需少數去識別化的患者入口訊息,就能幫助 LLM 更佳匹配真實資料的真實風格和語氣。我們團隊中的臨床專家認為這個框架符合 HIPAA,這與現有的合成文本產生隱私保護方法不同,後者無法保證所有敏感屬性都受到保護。透過廣泛的量化和人工評估,我們證明了我們的框架產生的資料品質高於可比較的產生方法以及所有相關的資料集。我們相信這項工作為以下事項提供了前進的道路:(i) 發布與真實樣本在風格上相似的、大規模的合成患者訊息資料集,以及 (ii) 符合 HIPAA 的資料產生,而這需要最少的人工去識別化工作。

NeuReg: Domain-invariant 3D Image Registration on Human and Mouse Brains

2411.06315v1 by Taha Razzaq, Asim Iqbal

Medical brain imaging relies heavily on image registration to accurately curate structural boundaries of brain features for various healthcare applications. Deep learning models have shown remarkable performance in image registration in recent years. Still, they often struggle to handle the diversity of 3D brain volumes, challenged by their structural and contrastive variations and their imaging domains. In this work, we present NeuReg, a Neuro-inspired 3D image registration architecture with the feature of domain invariance. NeuReg generates domain-agnostic representations of imaging features and incorporates a shifting window-based Swin Transformer block as the encoder. This enables our model to capture the variations across brain imaging modalities and species. We demonstrate a new benchmark in multi-domain publicly available datasets comprising human and mouse 3D brain volumes. Extensive experiments reveal that our model (NeuReg) outperforms the existing baseline deep learning-based image registration models and provides a high-performance boost on cross-domain datasets, where models are trained on 'source-only' domain and tested on completely 'unseen' target domains. Our work establishes a new state-of-the-art for domain-agnostic 3D brain image registration, underpinned by Neuro-inspired Transformer-based architecture.

摘要:醫學腦部影像高度依賴影像配準,以準確策畫大腦特徵的結構性邊界,用於各種醫療保健應用。深度學習模型近年來在影像配準中展現出卓越的效能。儘管如此,這些模型在處理多元的 3D 大腦體積時常常會遇到困難,受到其結構和對比變化以及影像領域的挑戰。在這項工作中,我們提出 NeuReg,一種具備領域不變性特徵的神經啟發式 3D 影像配準架構。NeuReg 產生影像特徵的領域不可知表示,並將基於滑動視窗的 Swin Transformer 區塊作為編碼器。這使我們的模型能夠擷取跨大腦影像模式和物種的變化。我們展示了一個新的基準,包含人類和老鼠 3D 大腦體積的多領域公開可用資料集。廣泛的實驗顯示,我們的模型 (NeuReg) 優於現有的基準深度學習影像配準模型,並在跨領域資料集上提供高性能提升,其中模型在「僅來源」領域上訓練,並在完全「未見」的目標領域上進行測試。我們的研究建立了領域不可知 3D 大腦影像配準的新技術,由神經啟發式 Transformer 為基礎的架構所支撐。

GuidelineGuard: An Agentic Framework for Medical Note Evaluation with Guideline Adherence

2411.06264v1 by MD Ragib Shahriyear

Although rapid advancements in Large Language Models (LLMs) are facilitating the integration of artificial intelligence-based applications and services in healthcare, limited research has focused on the systematic evaluation of medical notes for guideline adherence. This paper introduces GuidelineGuard, an agentic framework powered by LLMs that autonomously analyzes medical notes, such as hospital discharge and office visit notes, to ensure compliance with established healthcare guidelines. By identifying deviations from recommended practices and providing evidence-based suggestions, GuidelineGuard helps clinicians adhere to the latest standards from organizations like the WHO and CDC. This framework offers a novel approach to improving documentation quality and reducing clinical errors.

摘要:儘管大型語言模型 (LLM) 的快速進展促進了人工智慧應用程式和服務在醫療保健中的整合,但有限的研究專注於對醫療記錄進行系統評估以符合準則。本文介紹了 GuidelineGuard,一個由 LLM 提供動力的代理架構,它會自動分析醫療記錄,例如醫院出院和門診記錄,以確保符合既定的醫療保健準則。透過找出與建議做法的偏差並提供基於證據的建議,GuidelineGuard 可協助臨床醫生遵守世界衛生組織 (WHO) 和疾病管制中心 (CDC) 等組織的最新標準。此架構提供了一種改善文件品質和減少臨床錯誤的新方法。

Deep Reinforcement Learning for Digital Twin-Oriented Complex Networked Systems

2411.06148v1 by Jiaqi Wen, Bogdan Gabrys, Katarzyna Musial

The Digital Twin Oriented Complex Networked System (DT-CNS) aims to build and extend a Complex Networked System (CNS) model with progressively increasing dynamics complexity towards an accurate reflection of reality -- a Digital Twin of reality. Our previous work proposed evolutionary DT-CNSs to model the long-term adaptive network changes in an epidemic outbreak. This study extends this framework by proposeing the temporal DT-CNS model, where reinforcement learning-driven nodes make decisions on temporal directed interactions in an epidemic outbreak. We consider cooperative nodes, as well as egocentric and ignorant "free-riders" in the cooperation. We describe this epidemic spreading process with the Susceptible-Infected-Recovered ($SIR$) model and investigate the impact of epidemic severity on the epidemic resilience for different types of nodes. Our experimental results show that (i) the full cooperation leads to a higher reward and lower infection number than a cooperation with egocentric or ignorant "free-riders"; (ii) an increasing number of "free-riders" in a cooperation leads to a smaller reward, while an increasing number of egocentric "free-riders" further escalate the infection numbers and (iii) higher infection rates and a slower recovery weakens networks' resilience to severe epidemic outbreaks. These findings also indicate that promoting cooperation and reducing "free-riders" can improve public health during epidemics.

摘要:數位孿生導向複雜網路系統(DT-CNS)旨在建立和擴展複雜網路系統(CNS)模型,並逐步增加動態複雜性以準確反映現實——現實的數位孿生。我們先前的工作提出演化的 DT-CNS 來建模流行病爆發中的長期適應性網路變化。本研究透過提出時間 DT-CNS 模型來延伸這個架構,其中強化學習驅動的節點在流行病爆發中對時間導向互動做出決策。我們考慮合作節點,以及合作中的自我中心和無知的「搭便車者」。我們使用易感者-受感染者-康復者($SIR$)模型描述這個流行病擴散過程,並調查流行病嚴重性對不同類型節點的流行病復原力的影響。我們的實驗結果顯示 (i) 全面合作會導致比與自我中心或無知的「搭便車者」合作更高的回報和更低的感染數;(ii) 合作中的「搭便車者」數量增加會導致較小的回報,而自我中心的「搭便車者」數量增加會進一步提升感染數;(iii) 較高的感染率和較慢的復原會削弱網路對嚴重流行病爆發的復原力。這些發現也表示,在流行病期間促進合作和減少「搭便車者」可以改善公共衛生。

Evaluating the Propensity of Generative AI for Producing Disinformation During an Election Cycle

2411.06120v1 by Erik J Schlicht

Generative Artificial Intelligence offers a powerful tool for adversaries who wish to engage in influence operations, such as the Chinese Spamouflage operation and the Russian Internet Research Agency effort that both sought to interfere with recent US election cycles. Therefore, this study seeks to investigate the propensity of current Generative AI models for producing harmful disinformation during an election cycle. The probability that different Generative AI models produced disinformation when given adversarial prompts was evaluated, in addition the associated harm. This allows for the expected harm for each model to be computed and it was discovered that Copilot and Gemini tied for the overall safest performance by realizing the lowest expected harm, while GPT-4o produced the greatest rates of harmful disinformation, resulting in much higher expected harm scores. The impact of disinformation category was also investigated and Gemini was safest within the political category of disinformation, while Copilot was safest for topics related to health. Moreover, characteristics of adversarial roles were discovered that led to greater expected harm across all models. Finally, classification models were developed that predicted disinformation production based on the conditions considered in this study, which offers insight into factors important for predicting disinformation production. Based on all of these insights, recommendations are provided that seek to mitigate factors that lead to harmful disinformation being produced by Generative AI models. It is hoped that developers will use these insights to improve future models.

摘要:生成式人工智慧為有意從事影響力操作的敵對者提供強大的工具,例如中國的垃圾郵件偽裝行動和俄羅斯的網路研究機構努力,這兩者都試圖干預最近的美國選舉週期。因此,本研究旨在調查當前生成式 AI 模型在選舉週期中產生有害錯誤訊息的傾向。除了相關危害之外,還評估了在給定對抗提示時不同生成式 AI 模型產生錯誤訊息的可能性。這允許計算每個模型的預期危害,並且發現 Copilot 和 Gemini 在實現最低預期危害方面並列為最安全的整體效能,而 GPT-4o 產生了最高比率的有害錯誤訊息,導致預期危害分數高得多。還調查了錯誤訊息類別的影響,並且 Gemini 在政治類別的錯誤訊息中是最安全的,而 Copilot 在與健康相關的主題中最安全。此外,發現了對抗角色的特性,導致所有模型的預期危害更大。最後,開發了分類模型,根據本研究中考慮的條件預測錯誤訊息產生,這提供了對預測錯誤訊息產生很重要的因素的見解。根據所有這些見解,提供了建議,旨在減輕導致生成式 AI 模型產生有害錯誤訊息的因素。希望開發人員將使用這些見解來改進未來的模型。

Personalize to generalize: Towards a universal medical multi-modality generalization through personalization

2411.06106v1 by Zhaorui Tan, Xi Yang, Tan Pan, Tianyi Liu, Chen Jiang, Xin Guo, Qiufeng Wang, Anh Nguyen, Yuan Qi, Kaizhu Huang, Yuan Cheng

Personalized medicine is a groundbreaking healthcare framework for the $21^{st}$ century, tailoring medical treatments to individuals based on unique clinical characteristics, including diverse medical imaging modalities. Given the significant differences among these modalities due to distinct underlying imaging principles, generalization in multi-modal medical image tasks becomes substantially challenging. Previous methods addressing multi-modal generalization rarely consider personalization, primarily focusing on common anatomical information. This paper aims to bridge multi-modal generalization with the concept of personalized medicine. Specifically, we propose a novel approach to derive a tractable form of the underlying personalized invariant representation $\mathbb{X}_h$ by leveraging individual-level constraints and a learnable biological prior. We demonstrate the feasibility and benefits of learning a personalized $\mathbb{X}_h$, showing that this representation is highly generalizable and transferable across various multi-modal medical tasks. Our method is rigorously validated on medical imaging modalities emphasizing both physical structure and functional information, encompassing a range of tasks that require generalization. Extensive experimental results consistently show that our approach significantly improves performance across diverse scenarios, confirming its effectiveness.

摘要:個人化醫療是 21 世紀的創新醫療保健架構,根據獨特的臨床特徵(包括多種醫學影像方式)為個人量身打造醫療治療。由於這些方式基於不同的影像原理,因此存在顯著差異,多模式醫學影像任務中的概括變得極具挑戰性。先前處理多模式概括的方法很少考慮個人化,主要關注於共同的解剖資訊。本文旨在將多模式概括與個人化醫療的概念聯繫起來。具體來說,我們提出了一種新穎的方法,透過利用個人層級約束和可學習的生物先驗,衍生出基礎個人化不變表示 $\mathbb{X}_h$ 的易於處理形式。我們展示了學習個人化 $\mathbb{X}_h$ 的可行性和好處,表明此表示具有高度可概括性,並且可以在各種多模式醫療任務中轉移。我們的技術在強調物理結構和功能資訊的醫學影像方式上得到嚴格驗證,涵蓋了需要概括的一系列任務。廣泛的實驗結果一致表明,我們的技術顯著改善了各種情境下的效能,證實了其有效性。

Assessing Foundational Medical 'Segment Anything' (Med-SAM1, Med-SAM2) Deep Learning Models for Left Atrial Segmentation in 3D LGE MRI

2411.05963v1 by Mehri Mehrnia, Mohamed Elbayumi, Mohammed S. M. Elbaz

Atrial fibrillation (AF), the most common cardiac arrhythmia, is associated with heart failure and stroke. Accurate segmentation of the left atrium (LA) in 3D late gadolinium-enhanced (LGE) MRI is helpful for evaluating AF, as fibrotic remodeling in the LA myocardium contributes to arrhythmia and serves as a key determinant of therapeutic strategies. However, manual LA segmentation is labor-intensive and challenging. Recent foundational deep learning models, such as the Segment Anything Model (SAM), pre-trained on diverse datasets, have demonstrated promise in generic segmentation tasks. MedSAM, a fine-tuned version of SAM for medical applications, enables efficient, zero-shot segmentation without domain-specific training. Despite the potential of MedSAM model, it has not yet been evaluated for the complex task of LA segmentation in 3D LGE-MRI. This study aims to (1) evaluate the performance of MedSAM in automating LA segmentation, (2) compare the performance of the MedSAM2 model, which uses a single prompt with automated tracking, with the MedSAM1 model, which requires separate prompt for each slice, and (3) analyze the performance of MedSAM1 in terms of Dice score(i.e., segmentation accuracy) by varying the size and location of the box prompt.

摘要:心房顫動 (AF) 是最常見的心律不整,與心臟衰竭和中風有關。3D 晚期钆增強 (LGE) MRI 中左心房 (LA) 的精確分割有助於評估 AF,因為 LA 心肌中的纖維化重塑會導致心律不整,並作為治療策略的關鍵決定因素。然而,手動 LA 分割既費力又具有挑戰性。最近基礎深度學習模型(例如在不同資料集上預先訓練的 Segment Anything Model (SAM))已在通用分割任務中展現出前景。MedSAM 是 SAM 的微調版本,適用於醫療應用,它能進行有效、零次學習的分割,而無需特定領域的訓練。儘管 MedSAM 模型具有潛力,但尚未評估其在 3D LGE-MRI 中 LA 分割的複雜任務。本研究旨在 (1) 評估 MedSAM 在自動化 LA 分割中的效能,(2) 比較使用單一提示和自動追蹤的 MedSAM2 模型與需要為每個切片提供單獨提示的 MedSAM1 模型的效能,以及 (3) 分析 MedSAM1 在骰子分數(即分割準確度)方面的效能,方法是改變方框提示的大小和位置。

GazeSearch: Radiology Findings Search Benchmark

2411.05780v1 by Trong Thang Pham, Tien-Phat Nguyen, Yuki Ikebe, Akash Awasthi, Zhigang Deng, Carol C. Wu, Hien Nguyen, Ngan Le

Medical eye-tracking data is an important information source for understanding how radiologists visually interpret medical images. This information not only improves the accuracy of deep learning models for X-ray analysis but also their interpretability, enhancing transparency in decision-making. However, the current eye-tracking data is dispersed, unprocessed, and ambiguous, making it difficult to derive meaningful insights. Therefore, there is a need to create a new dataset with more focus and purposeful eyetracking data, improving its utility for diagnostic applications. In this work, we propose a refinement method inspired by the target-present visual search challenge: there is a specific finding and fixations are guided to locate it. After refining the existing eye-tracking datasets, we transform them into a curated visual search dataset, called GazeSearch, specifically for radiology findings, where each fixation sequence is purposefully aligned to the task of locating a particular finding. Subsequently, we introduce a scan path prediction baseline, called ChestSearch, specifically tailored to GazeSearch. Finally, we employ the newly introduced GazeSearch as a benchmark to evaluate the performance of current state-of-the-art methods, offering a comprehensive assessment for visual search in the medical imaging domain.

摘要:醫療眼動追蹤資料是了解放射科醫師如何視覺化詮釋醫療影像的重要資訊來源。這些資訊不僅提升了深度學習模型在 X 光分析中的準確度,也提升了其可解釋性,增進決策制定中的透明度。然而,目前的醫療眼動追蹤資料分散、未經處理且不明確,這使得難以推導出有意義的見解。因此,有必要建立一個新的資料集,其中包含更多焦點和有目的的眼動追蹤資料,以提升其在診斷應用中的效用。在這項工作中,我們提出了一種改良方法,其靈感來自目標呈現視覺搜尋挑戰:有一個特定的發現,而固定則用於定位它。在改良現有的眼動追蹤資料集後,我們將其轉換為一個名為 GazeSearch 的精選視覺搜尋資料集,專門用於放射科發現,其中每個固定序列都刻意與定位特定發現的任務對齊。隨後,我們介紹了一個掃描路徑預測基準,稱為 ChestSearch,專門針對 GazeSearch 量身打造。最後,我們採用新推出的 GazeSearch 作為基準,評估目前最先進方法的效能,提供醫療影像領域中視覺搜尋的全面評估。

Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators

2411.05897v1 by Nicholas Wan, Qiao Jin, Joey Chan, Guangzhi Xiong, Serina Applebaum, Aidan Gilson, Reid McMurry, R. Andrew Taylor, Aidong Zhang, Qingyu Chen, Zhiyong Lu

Although large language models (LLMs) have been assessed for general medical knowledge using medical licensing exams, their ability to effectively support clinical decision-making tasks, such as selecting and using medical calculators, remains uncertain. Here, we evaluate the capability of both medical trainees and LLMs to recommend medical calculators in response to various multiple-choice clinical scenarios such as risk stratification, prognosis, and disease diagnosis. We assessed eight LLMs, including open-source, proprietary, and domain-specific models, with 1,009 question-answer pairs across 35 clinical calculators and measured human performance on a subset of 100 questions. While the highest-performing LLM, GPT-4o, provided an answer accuracy of 74.3% (CI: 71.5-76.9%), human annotators, on average, outperformed LLMs with an accuracy of 79.5% (CI: 73.5-85.0%). With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (56.6%) and calculator knowledge (8.1%), our findings emphasize that humans continue to surpass LLMs on complex clinical tasks such as calculator recommendation.

摘要:儘管大型語言模型 (LLM) 已使用醫學執照考試評估其一般醫學知識,但它們有效支援臨床決策任務(例如選擇和使用醫學計算器)的能力仍不確定。在此,我們評估醫學受訓者和 LLM 推薦醫學計算器的能力,以回應各種多選題臨床情境,例如風險分層、預後和疾病診斷。我們評估了八個 LLM,包括開源、專有和特定領域的模型,其中包含 35 個臨床計算器的 1,009 個問答對,並測量了人類在 100 個問題子集上的表現。表現最佳的 LLM GPT-4o 提供了 74.3% 的回答準確度 (CI:71.5-76.9%),而人類註解者平均表現優於 LLM,準確度為 79.5% (CI:73.5-85.0%)。錯誤分析顯示,表現最佳的 LLM 在理解 (56.6%) 和計算器知識 (8.1%) 方面仍會犯錯,我們的研究結果強調,人類在計算器推薦等複雜臨床任務上仍然優於 LLM。

Identifying and Decomposing Compound Ingredients in Meal Plans Using Large Language Models

2411.05892v1 by Leon Kopitar, Leon Bedrac, Larissa J Strath, Jiang Bian, Gregor Stiglic

This study explores the effectiveness of Large Language Models in meal planning, focusing on their ability to identify and decompose compound ingredients. We evaluated three models-GPT-4o, Llama-3 (70b), and Mixtral (8x7b)-to assess their proficiency in recognizing and breaking down complex ingredient combinations. Preliminary results indicate that while Llama-3 (70b) and GPT-4o excels in accurate decomposition, all models encounter difficulties with identifying essential elements like seasonings and oils. Despite strong overall performance, variations in accuracy and completeness were observed across models. These findings underscore LLMs' potential to enhance personalized nutrition but highlight the need for further refinement in ingredient decomposition. Future research should address these limitations to improve nutritional recommendations and health outcomes.

摘要:這項研究探討大型語言模型在餐點規劃中的效能,著重於其辨識並分解複合食材的能力。我們評估了三個模型:GPT-4o、Llama-3 (70b) 和 Mixtral (8x7b),以評量其辨識並分解複雜食材組合的能力。初步結果顯示,雖然 Llama-3 (70b) 和 GPT-4o 在準確分解方面表現出色,但所有模型在辨識調味料和油脂等必要元素時都遇到困難。儘管整體表現強勁,但各個模型在準確性和完整性方面仍有差異。這些發現強調了 LLM 增強個人化營養的潛力,但同時也突顯了進一步優化食材分解技術的必要性。未來的研究應針對這些限制進行探討,以改善營養建議和健康成果。

SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark

2411.05521v1 by Sithursan Sivasubramaniam, Cedric Osei-Akoto, Yi Zhang, Kurt Stockinger, Jonathan Fuerst

Electronic health records (EHRs) are stored in various database systems with different database models on heterogeneous storage architectures, such as relational databases, document stores, or graph databases. These different database models have a big impact on query complexity and performance. While this has been a known fact in database research, its implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far. In this paper, we present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark based on synthetic patient data from Synthea, following the SNOMED-CT taxonomy -- a widely used knowledge graph ontology covering medical terminology. SM3-Text-to-Query provides data representations for relational databases (PostgreSQL), document stores (MongoDB), and graph databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically and manually develop 408 template questions, which we augment to construct a benchmark of 10K diverse natural language question/query pairs for these four query languages (40K pairs overall). On our dataset, we evaluate several common in-context-learning (ICL) approaches for a set of representative closed and open-source LLMs. Our evaluation sheds light on the trade-offs between database models and query languages for different ICL strategies and LLMs. Last, SM3-Text-to-Query is easily extendable to additional query languages or real, standard-based patient databases.

摘要:電子健康紀錄 (EHR) 儲存在各種資料庫系統中,這些系統在異質儲存架構上具有不同的資料庫模型,例如關聯式資料庫、文件儲存或圖形資料庫。這些不同的資料庫模型對查詢複雜度和效能有很大的影響。雖然這在資料庫研究中已經是眾所周知的事實,但令人驚訝的是,它對日益增加的文字轉查詢系統的影響迄今尚未得到調查。在本文中,我們提出 SM3-Text-to-Query,這是第一個基於來自 Synthea 的合成患者資料的多模型醫療文字轉查詢基準,遵循 SNOMED-CT 分類法——一種廣泛使用的涵蓋醫學術語的知識圖譜本體。SM3-Text-to-Query 提供了關聯式資料庫 (PostgreSQL)、文件儲存 (MongoDB) 和圖形資料庫 (Neo4j 和 GraphDB (RDF)) 的資料表示,允許跨四種流行查詢語言(即 SQL、MQL、Cypher 和 SPARQL)進行評估。我們系統且手動開發了 408 個範本問題,我們擴充這些問題以構建一個基準,其中包含 10K 個針對這四種查詢語言的多樣化自然語言問題/查詢對(總共 40K 對)。在我們的資料集上,我們評估了幾種常見的代表性閉源和開源 LLM 的情境學習 (ICL) 方法。我們的評估揭示了不同 ICL 策略和 LLM 的資料庫模型和查詢語言之間的取捨。最後,SM3-Text-to-Query 可以輕鬆擴展到其他查詢語言或真實的基於標準的患者資料庫。

Towards Scalable Foundation Models for Digital Dermatology

2411.05514v1 by Fabian Gröger, Philippe Gottfrois, Ludovic Amruthalingam, Alvaro Gonzalez-Jimenez, Simone Lionetti, Luis R. Soenksen-Martinez, Alexander A. Navarini, Marc Pouly

The growing demand for accurate and equitable AI models in digital dermatology faces a significant challenge: the lack of diverse, high-quality labeled data. In this work, we investigate the potential of domain-specific foundation models for dermatology in addressing this challenge. We utilize self-supervised learning (SSL) techniques to pre-train models on a dataset of over 240,000 dermatological images from public and private collections. Our study considers several SSL methods and compares the resulting foundation models against domain-agnostic models like those pre-trained on ImageNet and state-of-the-art models such as MONET across 12 downstream tasks. Unlike previous research, we emphasize the development of smaller models that are more suitable for resource-limited clinical settings, facilitating easier adaptation to a broad range of use cases. Results show that models pre-trained in this work not only outperform general-purpose models but also approach the performance of models 50 times larger on clinically relevant diagnostic tasks. To promote further research in this direction, we publicly release both the training code and the foundation models, which can benefit clinicians in dermatological applications.

摘要:數位皮膚科對精準且公平的 AI 模型需求日益增加,但面臨一項重大挑戰:缺乏多元且高品質的標記資料。在這項研究中,我們探討特定領域的基礎模型在皮膚科中解決此挑戰的可能性。我們利用自監督學習 (SSL) 技術在包含超過 24 萬張來自公有和私有資料庫的皮膚科影像的資料集上預先訓練模型。我們的研究考量了多種 SSL 方法,並將產生的基礎模型與不受領域限制的模型(例如在 ImageNet 上預先訓練的模型)以及最先進的模型(例如 MONET)在 12 個下游任務中進行比較。與先前的研究不同,我們強調開發更適合資源有限的臨床環境的小型模型,以利於更輕鬆地適應廣泛的用例。結果顯示,在這項研究中預先訓練的模型不僅優於通用模型,而且在臨床上相關的診斷任務中,其效能也接近大 50 倍的模型。為了促進此方向的進一步研究,我們公開發布訓練程式碼和基礎模型,這些模型可讓皮膚科應用中的臨床醫生受益。

Towards Equitable ASD Diagnostics: A Comparative Study of Machine and Deep Learning Models Using Behavioral and Facial Data

2411.05880v1 by Mohammed Aledhari, Mohamed Rahouti, Ali Alfatemi

Autism Spectrum Disorder (ASD) is often underdiagnosed in females due to gender-specific symptom differences overlooked by conventional diagnostics. This study evaluates machine learning models, particularly Random Forest and convolutional neural networks, for enhancing ASD diagnosis through structured data and facial image analysis. Random Forest achieved 100% validation accuracy across datasets, highlighting its ability to manage complex relationships and reduce false negatives, which is crucial for early intervention and addressing gender biases. In image-based analysis, MobileNet outperformed the baseline CNN, achieving 87% accuracy, though a 30% validation loss suggests possible overfitting, requiring further optimization for robustness in clinical settings. Future work will emphasize hyperparameter tuning, regularization, and transfer learning. Integrating behavioral data with facial analysis could improve diagnosis for underdiagnosed groups. These findings suggest Random Forest's high accuracy and balanced precision-recall metrics could enhance clinical workflows. MobileNet's lightweight structure also shows promise for resource-limited environments, enabling accessible ASD screening. Addressing model explainability and clinician trust will be vital.

摘要:自閉症譜系障礙 (ASD) 由於性別特異的症狀差異,常被忽略而漏診。本研究評估機器學習模型,特別是隨機森林和卷積神經網路,以透過結構化資料和臉部影像分析來強化 ASD 診斷。隨機森林在所有資料集中的驗證準確度達到 100%,突顯其處理複雜關係和減少假陰性的能力,這對於早期介入和解決性別偏見至關重要。在基於影像的分析中,MobileNet 優於基準 CNN,準確度達到 87%,儘管 30% 的驗證損失表明可能過度擬合,需要進一步最佳化以提高臨床環境中的穩健性。未來的研究將強調超參數調整、正則化和遷移學習。將行為資料與臉部分析整合,可以改善漏診群體的診斷。這些發現表明隨機森林的高準確度和平衡的精確度召回指標可以增強臨床工作流程。MobileNet 的輕量級結構也顯示出在資源受限的環境中很有前景,可以進行無障礙的 ASD 篩檢。解決模型可解釋性和臨床醫師的信任至關重要。

Interactive Dialogue Agents via Reinforcement Learning on Hindsight Regenerations

2411.05194v1 by Joey Hong, Jessica Lin, Anca Dragan, Sergey Levine

Recent progress on large language models (LLMs) has enabled dialogue agents to generate highly naturalistic and plausible text. However, current LLM language generation focuses on responding accurately to questions and requests with a single effective response. In reality, many real dialogues are interactive, meaning an agent's utterances will influence their conversational partner, elicit information, or change their opinion. Accounting for how an agent can effectively steer a conversation is a crucial ability in many dialogue tasks, from healthcare to preference elicitation. Existing methods for fine-tuning dialogue agents to accomplish such tasks would rely on curating some amount of expert data. However, doing so often requires understanding the underlying cognitive processes of the conversational partner, which is a skill neither humans nor LLMs trained on human data can reliably do. Our key insight is that while LLMs may not be adept at identifying effective strategies for steering conversations a priori, or in the middle of an ongoing conversation, they can do so post-hoc, or in hindsight, after seeing how their conversational partner responds. We use this fact to rewrite and augment existing suboptimal data, and train via offline reinforcement learning (RL) an agent that outperforms both prompting and learning from unaltered human demonstrations. We apply our approach to two domains that require understanding human mental state, intelligent interaction, and persuasion: mental health support, and soliciting charitable donations. Our results in a user study with real humans show that our approach greatly outperforms existing state-of-the-art dialogue agents.

摘要:大型語言模型 (LLM) 的最新進展使對話代理能夠生成高度自然且合理的文字。然而,目前的 LLM 語言生成著重於以單一有效的回應準確回應問題和要求。在現實中,許多真實對話都是互動的,這表示代理人的發言會影響他們的對話夥伴、引出資訊或改變他們的意見。考量代理人如何有效引導對話的能力在許多對話任務中至關重要,從醫療保健到偏好引導皆是如此。現有的微調對話代理方法以完成此類任務會依賴於策劃一定量的專家資料。然而,這麼做通常需要了解對話夥伴的基礎認知歷程,而這項技能既不是人類也不是訓練過人類資料的 LLM 可靠具備的。我們的關鍵見解在於,儘管 LLM 可能不擅長於事先或在對話進行中識別出引導對話的有效策略,但他們可以在事後或回顧時,在看到他們的對話夥伴如何回應後這麼做。我們利用這個事實來改寫並擴充現有的次佳資料,並透過離線強化學習 (RL) 訓練一名代理人,其表現優於提示和從未經修改的人類示範中學習。我們將我們的做法應用於需要了解人類心理狀態、智慧互動和說服的兩個領域:心理健康支持和募集慈善捐款。我們在與真實人類進行的使用者研究中的結果顯示,我們的做法大幅優於現有的最先進對話代理。

Inverse Transition Learning: Learning Dynamics from Demonstrations

2411.05174v1 by Leo Benac, Abhishek Sharma, Sonali Parbhoo, Finale Doshi-Velez

We consider the problem of estimating the transition dynamics $T^$ from near-optimal expert trajectories in the context of offline model-based reinforcement learning. We develop a novel constraint-based method, Inverse Transition Learning, that treats the limited coverage of the expert trajectories as a \emph{feature}: we use the fact that the expert is near-optimal to inform our estimate of $T^$. We integrate our constraints into a Bayesian approach. Across both synthetic environments and real healthcare scenarios like Intensive Care Unit (ICU) patient management in hypotension, we demonstrate not only significant improvements in decision-making, but that our posterior can inform when transfer will be successful.

摘要:我們考慮在離線模型基礎強化學習的脈絡中,從接近最佳的專家軌跡估計轉換動態 $T^$ 的問題。我們開發一種新的基於約束的方法,逆轉換學習,它將專家軌跡的有限覆蓋範圍視為一種「特徵」:我們利用專家接近最佳的事實來告知我們對 $T^$ 的估計。我們將我們的約束整合到貝氏方法中。在綜合環境和實際醫療保健場景(例如低血壓重症監護病房 (ICU) 病患管理)中,我們不僅展示了決策制定方面的顯著進步,而且我們的後驗可以告知轉移何時會成功。

PadChest-GR: A Bilingual Chest X-ray Dataset for Grounded Radiology Report Generation

2411.05085v1 by Daniel C. Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L. Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores Sánchez-Valverde, Lara Jaques-Pérez, Lourdes Pérez-Rodríguez, Kenji Takeda, José María Salinas, Javier Alvarez-Valle, Joaquín Galant Herrero, Antonio Pertusa

Radiology report generation (RRG) aims to create free-text radiology reports from clinical imaging. Grounded radiology report generation (GRRG) extends RRG by including the localisation of individual findings on the image. Currently, there are no manually annotated chest X-ray (CXR) datasets to train GRRG models. In this work, we present a dataset called PadChest-GR (Grounded-Reporting) derived from PadChest aimed at training GRRG models for CXR images. We curate a public bi-lingual dataset of 4,555 CXR studies with grounded reports (3,099 abnormal and 1,456 normal), each containing complete lists of sentences describing individual present (positive) and absent (negative) findings in English and Spanish. In total, PadChest-GR contains 7,037 positive and 3,422 negative finding sentences. Every positive finding sentence is associated with up to two independent sets of bounding boxes labelled by different readers and has categorical labels for finding type, locations, and progression. To the best of our knowledge, PadChest-GR is the first manually curated dataset designed to train GRRG models for understanding and interpreting radiological images and generated text. By including detailed localization and comprehensive annotations of all clinically relevant findings, it provides a valuable resource for developing and evaluating GRRG models from CXR images. PadChest-GR can be downloaded under request from https://bimcv.cipf.es/bimcv-projects/padchest-gr/

摘要:放射學報告生成 (RRG) 旨在從臨床影像建立自由文字的放射學報告。基礎放射學報告生成 (GRRG) 透過納入影像上個別發現的定位,來延伸 RRG。目前,沒有手動標記的胸部 X 光 (CXR) 資料集,可供訓練 GRRG 模型。在此研究中,我們提出一個名為 PadChest-GR(基礎報告)的資料集,其源自 PadChest,旨在訓練 CXR 影像的 GRRG 模型。我們策劃了一個公開的雙語資料集,其中包含 4,555 份 CXR 研究,附有基礎報告(3,099 份異常報告和 1,456 份正常報告),每個報告都包含完整的句子清單,用英文和西班牙文描述個別存在的(陽性)和不存在的(陰性)發現。總計,PadChest-GR 包含 7,037 個陽性發現句子和 3,422 個陰性發現句子。每個陽性發現句子最多與兩組獨立的邊界框相關聯,由不同的讀者標記,並具有發現類型、位置和進展的分類標籤。據我們所知,PadChest-GR 是第一個手動策劃的資料集,旨在訓練 GRRG 模型,以理解和詮釋放射學影像和產生的文字。透過納入所有臨床相關發現的詳細定位和綜合註解,它為從 CXR 影像開發和評估 GRRG 模型提供了寶貴的資源。PadChest-GR 可應要求從 https://bimcv.cipf.es/bimcv-projects/padchest-gr/ 下載

Position Paper On Diagnostic Uncertainty Estimation from Large Language Models: Next-Word Probability Is Not Pre-test Probability

2411.04962v1 by Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew Churpek, Majid Afshar

Large language models (LLMs) are being explored for diagnostic decision support, yet their ability to estimate pre-test probabilities, vital for clinical decision-making, remains limited. This study evaluates two LLMs, Mistral-7B and Llama3-70B, using structured electronic health record data on three diagnosis tasks. We examined three current methods of extracting LLM probability estimations and revealed their limitations. We aim to highlight the need for improved techniques in LLM confidence estimation.

摘要:大型語言模型 (LLM) 正在被探索用於診斷決策支持,但它們估計臨床決策制定中至關重要的預測試概率的能力仍然有限。本研究使用三個診斷任務的結構化電子健康記錄數據評估了兩個 LLM,Mistral-7B 和 Llama3-70B。我們檢查了提取 LLM 概率估計的三種當前方法並揭示了它們的局限性。我們的目標是強調改進 LLM 置信度估計技術的必要性。

FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs?

2411.05059v2 by Eric Wu, Kevin Wu, James Zou

There is great interest in fine-tuning frontier large language models (LLMs) to inject new information and update existing knowledge. While commercial LLM fine-tuning APIs from providers such as OpenAI and Google promise flexible adaptation for various applications, the efficacy of fine-tuning remains unclear. In this study, we introduce FineTuneBench, an evaluation framework and dataset for understanding how well commercial fine-tuning APIs can successfully learn new and updated knowledge. We analyze five frontier LLMs with commercially available fine-tuning APIs, including GPT-4o and Gemini 1.5 Pro, on their effectiveness in two settings: (1) ingesting novel information, such as recent news events and new people profiles, and (2) updating existing knowledge, such as updated medical guidelines and code frameworks. Our results reveal substantial shortcomings in all the models' abilities to effectively learn new information through fine-tuning, with an average generalization accuracy of 37% across all models. When updating existing knowledge, such as incorporating medical guideline updates, commercial fine-tuning APIs show even more limited capability (average generalization accuracy of 19%). Overall, fine-tuning GPT-4o mini is the most effective for infusing new knowledge and updating knowledge, followed by GPT-3.5 Turbo and GPT-4o. The fine-tuning APIs for Gemini 1.5 Flesh and Gemini 1.5 Pro are unable to learn new knowledge or update existing knowledge. These findings underscore a major shortcoming in using current commercial fine-tuning services to achieve reliable knowledge infusion in common scenarios. We open source the FineTuneBench dataset at https://github.com/kevinwu23/StanfordFineTuneBench.

摘要:微调前沿大型语言模型 (LLM) 以注入新信息并更新现有知识引起了极大的兴趣。虽然来自 OpenAI 和 Google 等提供商的商业 LLM 微调 API 承诺为各种应用程序提供灵活的适应性,但微调的功效仍不明确。在这项研究中,我们介绍了 FineTuneBench,这是一个评估框架和数据集,用于理解商业微调 API 如何成功学习新的和更新的知识。我们分析了五种前沿 LLM,它们具有可商用的微调 API,包括 GPT-4o 和 Gemini 1.5 Pro,在两种设置中的有效性:(1) 摄取新信息,例如最近的新闻事件和新的人物简介,以及 (2) 更新现有知识,例如更新的医疗指南和代码框架。我们的结果揭示了所有模型在通过微调有效学习新信息方面的重大缺陷,所有模型的平均泛化准确度为 37%。在更新现有知识时,例如纳入医疗指南更新,商业微调 API 显示出更有限的能力(平均泛化准确度为 19%)。总体而言,微调 GPT-4o mini 在灌输新知识和更新知识方面最有效,其次是 GPT-3.5 Turbo 和 GPT-4o。Gemini 1.5 Flesh 和 Gemini 1.5 Pro 的微调 API 无法学习新知识或更新现有知识。这些发现强调了在常见场景中使用当前商业微调服务来实现可靠知识注入的重大缺陷。我们在 https://github.com/kevinwu23/StanfordFineTuneBench 上开源了 FineTuneBench 数据集。

Integrating Large Language Models for Genetic Variant Classification

2411.05055v1 by Youssef Boulaimen, Gabriele Fossi, Leila Outemzabet, Nathalie Jeanray, Oleksandr Levenets, Stephane Gerart, Sebastien Vachenc, Salvatore Raieli, Joanna Giemza

The classification of genetic variants, particularly Variants of Uncertain Significance (VUS), poses a significant challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools in this realm. These models can uncover intricate patterns and predictive insights that traditional methods might miss, thus enhancing the predictive accuracy of genetic variant pathogenicity. This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification. Our approach evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets, setting new benchmarks in classification performance. The models were rigorously tested on a set of challenging variants, demonstrating substantial improvements over existing state-of-the-art tools, especially in handling ambiguous and clinically uncertain variants. The results of this research underline the efficacy of combining multiple modeling approaches to significantly refine the accuracy and reliability of genetic variant classification systems. These findings support the deployment of these advanced computational models in clinical environments, where they can significantly enhance the diagnostic processes for genetic disorders, ultimately pushing the boundaries of personalized medicine by offering more detailed and actionable genetic insights.

摘要:遺傳變異的分類,特別是不確定意義變異(VUS),對臨床遺傳學和精準醫療提出了重大挑戰。大型語言模型(LLM)已成為這個領域的變革性工具。這些模型可以揭示傳統方法可能遺漏的複雜模式和預測見解,從而提高遺傳變異致病性的預測準確度。 本研究調查了最先進 LLM 的整合,包括 GPN-MSA、ESM1b 和 AlphaMissense,這些 LLM 利用 DNA 和蛋白質序列數據以及結構見解,形成了一個全面的變異分類分析框架。我們的做法使用標註完善的 ProteinGym 和 ClinVar 數據集來評估這些整合模型,在分類效能上設定了新的基準。這些模型經過嚴格測試,使用一組具有挑戰性的變異,證明了對現有最先進工具的實質性改進,特別是在處理模稜兩可和臨床上不確定的變異方面。 這項研究的結果強調了結合多種建模方法以顯著提高遺傳變異分類系統的準確度和可靠性的有效性。這些發現支持在臨床環境中部署這些先進的計算模型,它們可以在那裡顯著增強遺傳疾病的診斷程序,最終通過提供更詳細且可操作的遺傳見解來突破個人化醫療的界限。

AWARE Narrator and the Utilization of Large Language Models to Extract Behavioral Insights from Smartphone Sensing Data

2411.04691v1 by Tianyi Zhang, Miu Kojima, Simon D'Alfonso

Smartphones, equipped with an array of sensors, have become valuable tools for personal sensing. Particularly in digital health, smartphones facilitate the tracking of health-related behaviors and contexts, contributing significantly to digital phenotyping, a process where data from digital interactions is analyzed to infer behaviors and assess mental health. Traditional methods process raw sensor data into information features for statistical and machine learning analyses. In this paper, we introduce a novel approach that systematically converts smartphone-collected data into structured, chronological narratives. The AWARE Narrator translates quantitative smartphone sensing data into English language descriptions, forming comprehensive narratives of an individual's activities. We apply the framework to the data collected from university students over a week, demonstrating the potential of utilizing the narratives to summarize individual behavior, and analyzing psychological states by leveraging large language models.

摘要:智慧型手機配備了各式感測器,已成為個人感測的寶貴工具。特別是在數位健康領域,智慧型手機促進了健康相關行為和情境的追蹤,對數位表型分析做出了重大貢獻,數位表型分析是一種從數位互動中分析資料以推論行為和評估心理健康的程序。傳統方法將原始感測器資料處理成資訊特徵,以進行統計和機器學習分析。在本文中,我們介紹一種新穎的方法,該方法系統性地將智慧型手機收集的資料轉換成結構化的時間順序敘事。AWARE Narrator 將定量的智慧型手機感測資料轉換成英文語言描述,形成個人活動的綜合敘事。我們將此架構套用在大學生一週內收集的資料上,證明了利用敘事總結個人行為的潛力,並透過運用大型語言模型來分析心理狀態。

FedDP: Privacy-preserving method based on federated learning for histopathology image segmentation

2411.04509v1 by Liangrui Pan, Mao Huang, Lian Wang, Pinle Qin, Shaoliang Peng

Hematoxylin and Eosin (H&E) staining of whole slide images (WSIs) is considered the gold standard for pathologists and medical practitioners for tumor diagnosis, surgical planning, and post-operative assessment. With the rapid advancement of deep learning technologies, the development of numerous models based on convolutional neural networks and transformer-based models has been applied to the precise segmentation of WSIs. However, due to privacy regulations and the need to protect patient confidentiality, centralized storage and processing of image data are impractical. Training a centralized model directly is challenging to implement in medical settings due to these privacy concerns.This paper addresses the dispersed nature and privacy sensitivity of medical image data by employing a federated learning framework, allowing medical institutions to collaboratively learn while protecting patient privacy. Additionally, to address the issue of original data reconstruction through gradient inversion during the federated learning training process, differential privacy introduces noise into the model updates, preventing attackers from inferring the contributions of individual samples, thereby protecting the privacy of the training data.Experimental results show that the proposed method, FedDP, minimally impacts model accuracy while effectively safeguarding the privacy of cancer pathology image data, with only a slight decrease in Dice, Jaccard, and Acc indices by 0.55%, 0.63%, and 0.42%, respectively. This approach facilitates cross-institutional collaboration and knowledge sharing while protecting sensitive data privacy, providing a viable solution for further research and application in the medical field.

摘要:蘇木精和伊紅(H&E)染色全切片圖像(WSI)被認為是病理學家和醫療從業人員用於腫瘤診斷、手術規劃和術後評估的黃金標準。隨著深度學習技術的快速進展,基於卷積神經網路和基於Transformer的模型的眾多模型已被應用於 WSI 的精確分割。然而,由於隱私法規和保護患者機密性的需要,集中式儲存和處理影像資料是不切實際的。由於這些隱私問題,在醫療環境中直接訓練集中式模型難以實施。本文通過採用聯合學習框架來解決醫療影像資料的分散性質和隱私敏感性,允許醫療機構在保護患者隱私的同時進行協作學習。此外,為了解決聯合學習訓練過程中通過梯度反轉進行原始資料重建的問題,差分隱私會在模型更新中引入雜訊,防止攻擊者推斷個別樣本的貢獻,從而保護訓練資料的隱私。實驗結果表明,所提出的方法 FedDP 對模型準確度的影響最小,同時有效保護了癌症病理影像資料的隱私,Dice、Jaccard 和 Acc 指數分別僅略微下降了 0.55%、0.63% 和 0.42%。這種方法促進了機構間的合作和知識共享,同時保護了敏感資料的隱私,為醫療領域的進一步研究和應用提供了可行的解決方案。

Conditional Diffusion Model for Longitudinal Medical Image Generation

2411.05860v1 by Duy-Phuong Dao, Hyung-Jeong Yang, Jahae Kim

Alzheimers disease progresses slowly and involves complex interaction between various biological factors. Longitudinal medical imaging data can capture this progression over time. However, longitudinal data frequently encounter issues such as missing data due to patient dropouts, irregular follow-up intervals, and varying lengths of observation periods. To address these issues, we designed a diffusion-based model for 3D longitudinal medical imaging generation using single magnetic resonance imaging (MRI). This involves the injection of a conditioning MRI and time-visit encoding to the model, enabling control in change between source and target images. The experimental results indicate that the proposed method generates higher-quality images compared to other competing methods.

摘要:阿茲海默症的進程緩慢,涉及各種生物因子之間的複雜互動。縱向醫學影像資料可以隨著時間推移捕捉這種進程。然而,縱向資料經常會遇到問題,例如由於患者退出、不規則的追蹤間隔和觀察期長度不同而導致資料遺失。為了解決這些問題,我們設計了一個基於擴散的模型,用於使用單一磁共振成像 (MRI) 進行 3D 縱向醫學影像生成。這涉及將條件 MRI 和時間訪問編碼注入模型,從而能夠控制源影像和目標影像之間的轉換。實驗結果表明,與其他競爭方法相比,所提出的方法生成的影像品質較高。

Evaluating the Economic Implications of Using Machine Learning in Clinical Psychiatry

2411.05856v1 by Soaad Hossain, James Rasalingam, Arhum Waheed, Fatah Awil, Rachel Kandiah, Syed Ishtiaque Ahmed

With the growing interest in using AI and machine learning (ML) in medicine, there is an increasing number of literature covering the application and ethics of using AI and ML in areas of medicine such as clinical psychiatry. The problem is that there is little literature covering the economic aspects associated with using ML in clinical psychiatry. This study addresses this gap by specifically studying the economic implications of using ML in clinical psychiatry. In this paper, we evaluate the economic implications of using ML in clinical psychiatry through using three problem-oriented case studies, literature on economics, socioeconomic and medical AI, and two types of health economic evaluations. In addition, we provide details on fairness, legal, ethics and other considerations for ML in clinical psychiatry.

摘要:隨著 AI 和機器學習 (ML) 在醫學中應用日益受到重視, 探討 AI 和 ML 在醫學領域(例如臨床精神病學)中應用和倫理的文獻越來越多。問題在於,探討與 ML 在臨床精神病學中應用相關的經濟方面的文獻很少。本研究透過特別探討 ML 在臨床精神病學中應用的經濟影響,來解決這個問題。在本文中,我們透過使用三個以問題為導向的案例研究、經濟學、社會經濟和醫療 AI 的文獻,以及兩種類型的健康經濟評估,評估 ML 在臨床精神病學中應用的經濟影響。此外,我們提供有關 ML 在臨床精神病學中的公平性、法律、倫理和其他考量的詳細資訊。

Robust Real-Time Mortality Prediction in the Intensive Care Unit using Temporal Difference Learning

2411.04285v1 by Thomas Frost, Kezhi Li, Steve Harris

The task of predicting long-term patient outcomes using supervised machine learning is a challenging one, in part because of the high variance of each patient's trajectory, which can result in the model over-fitting to the training data. Temporal difference (TD) learning, a common reinforcement learning technique, may reduce variance by generalising learning to the pattern of state transitions rather than terminal outcomes. However, in healthcare this method requires several strong assumptions about patient states, and there appears to be limited literature evaluating the performance of TD learning against traditional supervised learning methods for long-term health outcome prediction tasks. In this study, we define a framework for applying TD learning to real-time irregularly sampled time series data using a Semi-Markov Reward Process. We evaluate the model framework in predicting intensive care mortality and show that TD learning under this framework can result in improved model robustness compared to standard supervised learning methods. and that this robustness is maintained even when validated on external datasets. This approach may offer a more reliable method when learning to predict patient outcomes using high-variance irregular time series data.

摘要:預測長期患者結果的任務使用監督式機器學習,這是一個具有挑戰性的任務,部分原因是每個患者的軌跡的變異性很高,這可能導致模型過度擬合到訓練數據。時間差分 (TD) 學習,一種常見的強化學習技術,可以通過將學習概括為狀態轉換模式而不是終端結果來減少變異。然而,在醫療保健中,這種方法需要對患者狀態做出幾個強有力的假設,而且似乎有限的文獻評估了 TD 學習相對於傳統監督式學習方法在長期健康結果預測任務中的性能。在這項研究中,我們定義了一個框架,用於將 TD 學習應用於使用半馬爾可夫獎勵過程的實時不規則採樣時間序列數據。我們評估了模型框架在預測重症監護死亡率中的表現,並表明在這個框架下的 TD 學習可以導致與標準監督式學習方法相比模型魯棒性得到改善。而且這種魯棒性即使在外部數據集上驗證也能保持。在使用高變異不規則時間序列數據學習預測患者結果時,這種方法可能會提供一種更可靠的方法。

Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

2411.04118v1 by Daniel P. Jeong, Saurabh Garg, Zachary C. Lipton, Michael Oberst

Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks. For instance, across the tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 12.1% of cases, reach a (statistical) tie in 49.8% of cases, and are significantly worse than their base models in the remaining 38.2% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.

摘要:近期的幾項研究致力於專門針對醫療應用開發基礎模型,透過在公開的生物醫學語料庫上持續預訓練,調整通用的大型語言模型 (LLM) 和視覺語言模型 (VLM)。這些研究通常聲稱,這種領域適應性預訓練 (DAPT) 能改善下游醫療任務的效能,例如回答醫療執照考試題目。在本文中,我們比較了七個公開的「醫療」LLM 和兩個 VLM 與它們對應的基本模型,並得出不同的結論:在醫療問題回答 (QA) 任務的零次/小樣本提示機制中,所有醫療 VLM 和幾乎所有醫療 LLM 都無法持續優於它們的基本模型。例如,在我們在 3 次提示設定中考慮的任務和模型配對中,醫療 LLM 僅在 12.1% 的情況下優於它們的基本模型,在 49.8% 的情況下達到(統計)平手,而在其餘 38.2% 的情況下顯著低於它們的基本模型。我們的結論基於 (i) 直接針對對應的基本模型,逐一比較每個醫療模型;(ii) 分別針對每個模型最佳化提示;以及 (iii) 考慮比較中的統計不確定性。雖然這些基本做法並未持續採用在文獻中,但我們的消融研究表明,它們會大幅影響結論。我們的研究結果表明,最先進的通用領域模型可能已經展現出強大的醫療知識和推理能力,並提出建議以強化未來研究的結論。

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

2411.04097v1 by Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz

Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.

摘要:微调的视觉语言模型(VLM)通常会捕捉图像特征和文本属性之间的虚假相关性,导致在测试时零样本性能下降。现有的解决虚假相关性的方法(i)主要在全局图像级别操作,而不是直接干预细粒度的图像特征,并且(ii)主要设计用于单模态设置。在这项工作中,我们提出了 RaVL,它通过使用局部图像特征而不是在全局图像级别操作来发现和减轻虚假相关性,从而对 VLM 鲁棒性采取了细粒度的视角。给定一个微调的 VLM,RaVL 首先通过利用区域级聚类方法发现虚假相关性,以识别导致零样本分类错误的精确图像特征。然后,RaVL 使用一种新颖的区域感知损失函数来减轻已识别的虚假相关性,该损失函数使 VLM 能够在微调期间关注相关区域并忽略虚假关系。我们使用 654 个 VLM 对 RaVL 进行了评估,这些 VLM 具有各种模型架构、数据域和学习到的虚假相关性。我们的结果表明,RaVL 准确地发现了(比最接近的基线提高了 191%)和减轻了(在最差组图像分类准确性上提高了 8.2%)虚假相关性。对通用域和医学域 VLM 的定性评估证实了我们的发现。

Aligning Characteristic Descriptors with Images for Human-Expert-like Explainability

2411.04008v1 by Bharat Chandra Yalavarthi, Nalini Ratha

In mission-critical domains such as law enforcement and medical diagnosis, the ability to explain and interpret the outputs of deep learning models is crucial for ensuring user trust and supporting informed decision-making. Despite advancements in explainability, existing methods often fall short in providing explanations that mirror the depth and clarity of those given by human experts. Such expert-level explanations are essential for the dependable application of deep learning models in law enforcement and medical contexts. Additionally, we recognize that most explanations in real-world scenarios are communicated primarily through natural language. Addressing these needs, we propose a novel approach that utilizes characteristic descriptors to explain model decisions by identifying their presence in images, thereby generating expert-like explanations. Our method incorporates a concept bottleneck layer within the model architecture, which calculates the similarity between image and descriptor encodings to deliver inherent and faithful explanations. Through experiments in face recognition and chest X-ray diagnosis, we demonstrate that our approach offers a significant contrast over existing techniques, which are often limited to the use of saliency maps. We believe our approach represents a significant step toward making deep learning systems more accountable, transparent, and trustworthy in the critical domains of face recognition and medical diagnosis.

摘要:在执法和医疗诊断等任务关键型领域, 解释和诠释深度学习模型的输出对于确保用户信任和支持知情决策至关重要。 尽管可解释性方面取得了进步,但现有方法在提供解释时往往达不到人类专家给出的深度和清晰度。这种专家级别的解释对于在执法和医疗环境中可靠地应用深度学习模型至关重要。 此外,我们认识到,在现实世界场景中,大多数解释主要是通过自然语言进行交流的。为了满足这些需求,我们提出了一种新颖的方法,该方法利用特征描述符通过识别图像中的特征描述符的存在来解释模型决策,从而生成类似专家的解释。我们的方法在模型架构中加入了一个概念瓶颈层,该层计算图像和描述符编码之间的相似性,以提供内在且可靠的解释。通过面部识别和胸部 X 射线诊断的实验,我们证明了我们的方法与现有技术相比具有显着优势,而现有技术通常仅限于使用显着性图。我们相信,我们的方法代表了朝着使深度学习系统在面部识别和医疗诊断的关键领域更加负责、透明和值得信赖迈出的重要一步。

Fine-tuning -- a Transfer Learning approach

2411.03941v1 by Joseph Arul Raj, Linglong Qian, Zina Ibrahim

Secondary research use of Electronic Health Records (EHRs) is often hampered by the abundance of missing data in this valuable resource. Missingness in EHRs occurs naturally as a result of the data recording practices during routine clinical care, but handling it is crucial to the precision of medical analysis and the decision-making that follows. The literature contains a variety of imputation methodologies based on deep neural networks. Those aim to overcome the dynamic, heterogeneous and multivariate missingness patterns of EHRs, which cannot be handled by classical and statistical imputation methods. However, all existing deep imputation methods rely on end-to-end pipelines that incorporate both imputation and downstream analyses, e.g. classification. This coupling makes it difficult to assess the quality of imputation and takes away the flexibility of re-using the imputer for a different task. Furthermore, most end-to-end deep architectures tend to use complex networks to perform the downstream task, in addition to the already sophisticated deep imputation network. We, therefore ask if the high performance reported in the literature is due to the imputer or the classifier and further ask if an optimised state-of-the-art imputer is used, a simpler classifier can achieve comparable performance. This paper explores the development of a modular, deep learning-based imputation and classification pipeline, specifically built to leverage the capabilities of state-of-the-art imputation models for downstream classification tasks. Such a modular approach enables a) objective assessment of the quality of the imputer and classifier independently, and b) enables the exploration of the performance of simpler classification architectures using an optimised imputer.

摘要:電子健康紀錄 (EHR) 的二次研究用途經常受到此寶貴資源中大量遺失資料的阻礙。EHR 中的遺失資料會在例行臨床照護期間的資料記錄實務中自然發生,但處理遺失資料對於醫療分析的精確度和後續決策至關重要。文獻中包含各種基於深度神經網路的內插方法。這些方法旨在克服 EHR 中動態、異質且多變量的遺失資料模式,而這無法透過傳統和統計內插方法來處理。然而,所有現有的深度內插方法都依賴於將內插和下游分析(例如分類)結合在一起的端到端管道。這種結合使得難以評估內插的品質,並消除了重新使用內插器進行不同任務的靈活性。此外,大多數端到端深度架構傾向於使用複雜的網路來執行下游任務,除了已經很複雜的深度內插網路之外。因此,我們詢問文獻中報導的高效能是由於內插器還是分類器,並進一步詢問是否使用了最佳化的最新內插器,較簡單的分類器是否可以達到相近的效能。本文探討模組化、基於深度學習的內插和分類管道的開發,特別是建構來利用最新內插模型的能力,以進行下游分類任務。這種模組化方法能 a) 客觀評估內插器和分類器的品質,以及 b) 能夠使用最佳化的內插器來探討較簡單分類架構的效能。

MEG: Medical Knowledge-Augmented Large Language Models for Question Answering

2411.03883v2 by Laura Cabello, Carmen Martin-Turrero, Uchenna Akujuobi, Anders Søgaard, Carlos Bobed

Question answering is a natural language understanding task that involves reasoning over both explicit context and unstated, relevant domain knowledge. Large language models (LLMs), which underpin most contemporary question answering systems, struggle to induce how concepts relate in specialized domains such as medicine. Existing medical LLMs are also costly to train. In this work, we present MEG, a parameter-efficient approach for medical knowledge-augmented LLMs. MEG uses a lightweight mapping network to integrate graph embeddings into the LLM, enabling it to leverage external knowledge in a cost-effective way. We evaluate our method on four popular medical multiple-choice datasets and show that LLMs greatly benefit from the factual grounding provided by knowledge graph embeddings. MEG attains an average of +10.2% accuracy over the Mistral-Instruct baseline, and +6.7% over specialized models like BioMistral. We also show results based on Llama-3. Finally, we show that MEG's performance remains robust to the choice of graph encoder.

摘要:問答是自然語言理解任務,涉及對明確的上下文和未說明的相關領域知識進行推理。支撐大多數當代問答系統的大型語言模型 (LLM) 難以推論概念如何在醫學等專業領域中關聯。現有的醫學 LLM 訓練成本也很高。在這項工作中,我們提出了 MEG,這是一種用於醫學知識增強 LLM 的參數有效方法。MEG 使用輕量級映射網路將圖表嵌入整合到 LLM 中,使其能夠以經濟有效的方式利用外部知識。我們在四個流行的醫學多選題資料集上評估了我們的方法,並表明 LLM 從知識圖表嵌入提供的實際依據中受益匪淺。MEG 在 Mistral-Instruct 基準上平均提高了 +10.2% 的準確度,在 BioMistral 等專門模型上提高了 +6.7%。我們還展示了基於 Llama-3 的結果。最後,我們表明 MEG 的性能對圖表編碼器的選擇保持穩健。

2411.03782v1 by Daan Schouten, Giulia Nicoletti, Bas Dille, Catherine Chia, Pierpaolo Vendittelli, Megan Schuurmans, Geert Litjens, Nadieh Khalili

Recent technological advances in healthcare have led to unprecedented growth in patient data quantity and diversity. While artificial intelligence (AI) models have shown promising results in analyzing individual data modalities, there is increasing recognition that models integrating multiple complementary data sources, so-called multimodal AI, could enhance clinical decision-making. This scoping review examines the landscape of deep learning-based multimodal AI applications across the medical domain, analyzing 432 papers published between 2018 and 2024. We provide an extensive overview of multimodal AI development across different medical disciplines, examining various architectural approaches, fusion strategies, and common application areas. Our analysis reveals that multimodal AI models consistently outperform their unimodal counterparts, with an average improvement of 6.2 percentage points in AUC. However, several challenges persist, including cross-departmental coordination, heterogeneous data characteristics, and incomplete datasets. We critically assess the technical and practical challenges in developing multimodal AI systems and discuss potential strategies for their clinical implementation, including a brief overview of commercially available multimodal AI models for clinical decision-making. Additionally, we identify key factors driving multimodal AI development and propose recommendations to accelerate the field's maturation. This review provides researchers and clinicians with a thorough understanding of the current state, challenges, and future directions of multimodal AI in medicine.

摘要:醫療保健領域的近期科技進展導致病患資料數量和多樣性前所未有的成長。儘管人工智慧 (AI) 模型在分析個別資料模式中展現出有前途的成果,但整合多個互補資料來源的模型,即所謂的多模式 AI,可以提升臨床決策制定,這項認知正與日俱增。這篇範圍探討回顧研究探討了涵蓋醫療領域的深度學習基礎多模式 AI 應用現況,分析 2018 年至 2024 年間發表的 432 篇論文。我們提供了多模式 AI 發展的廣泛概觀,涵蓋不同的醫療領域,探討各種架構方法、融合策略和常見應用領域。我們的分析顯示,多模式 AI 模型始終優於其單一模式的對應模型,AUC 平均改善 6.2 個百分點。然而,仍有許多挑戰持續存在,包括跨部門協調、異質資料特性和不完整資料集。我們批判性地評估開發多模式 AI 系統在技術和實務上的挑戰,並討論其臨床實作的潛在策略,包括對市售多模式 AI 模型的簡要概述,用於臨床決策制定。此外,我們找出推動多模式 AI 發展的主要因素,並提出建議以加速該領域的成熟。本回顧研究讓研究人員和臨床醫師深入了解多模式 AI 在醫學領域的現況、挑戰和未來方向。

Sub-DM:Subspace Diffusion Model with Orthogonal Decomposition for MRI Reconstruction

2411.03758v1 by Yu Guan, Qinrong Cai, Wei Li, Qiuyun Fan, Dong Liang, Qiegen Liu

Diffusion model-based approaches recently achieved re-markable success in MRI reconstruction, but integration into clinical routine remains challenging due to its time-consuming convergence. This phenomenon is partic-ularly notable when directly apply conventional diffusion process to k-space data without considering the inherent properties of k-space sampling, limiting k-space learning efficiency and image reconstruction quality. To tackle these challenges, we introduce subspace diffusion model with orthogonal decomposition, a method (referred to as Sub-DM) that restrict the diffusion process via projections onto subspace as the k-space data distribution evolves toward noise. Particularly, the subspace diffusion model circumvents the inference challenges posed by the com-plex and high-dimensional characteristics of k-space data, so the highly compact subspace ensures that diffusion process requires only a few simple iterations to produce accurate prior information. Furthermore, the orthogonal decomposition strategy based on wavelet transform hin-ders the information loss during the migration of the vanilla diffusion process to the subspace. Considering the strate-gy is approximately reversible, such that the entire pro-cess can be reversed. As a result, it allows the diffusion processes in different spaces to refine models through a mutual feedback mechanism, enabling the learning of ac-curate prior even when dealing with complex k-space data. Comprehensive experiments on different datasets clearly demonstrate that the superiority of Sub-DM against state of-the-art methods in terms of reconstruction speed and quality.

摘要:基於擴散模型的方法最近在 MRI 重建中取得了顯著的成功,但由於其耗時的收斂性,整合到臨床常規中仍然具有挑戰性。當直接將傳統擴散過程應用到 k-space 資料,而沒有考慮 k-space 取樣的固有特性時,這種現象尤其明顯,限制了 k-space 學習效率和影像重建品質。為了應對這些挑戰,我們引入了具有正交分解的子空間擴散模型,一種方法(稱為 Sub-DM),它通過投影到子空間來限制擴散過程,因為 k-space 資料分佈會演變成雜訊。特別是,子空間擴散模型迴避了 k-space 資料的複雜和高維特徵所帶來的推論挑戰,因此高度緊湊的子空間確保擴散過程只需要幾個簡單的迭代即可產生準確的先驗資訊。此外,基於小波轉換的正交分解策略阻礙了香草擴散過程遷移到子空間期間的資訊遺失。考慮到該策略近似可逆,因此整個過程可以逆轉。因此,它允許不同空間中的擴散過程通過相互回饋機制來優化模型,即使在處理複雜的 k-space 資料時也能學習準確的先驗。在不同資料集上的全面實驗清楚地證明了 Sub-DM 在重建速度和品質方面優於最先進的方法。

Ultrasound-Based AI for COVID-19 Detection: A Comprehensive Review of Public and Private Lung Ultrasound Datasets and Studies

2411.05029v1 by Abrar Morshed, Abdulla Al Shihab, Md Abrar Jahin, Md Jaber Al Nahian, Md Murad Hossain Sarker, Md Sharjis Ibne Wadud, Mohammad Istiaq Uddin, Muntequa Imtiaz Siraji, Nafisa Anjum, Sumiya Rajjab Shristy, Tanvin Rahman, Mahmuda Khatun, Md Rubel Dewan, Mosaddeq Hossain, Razia Sultana, Ripel Chakma, Sonet Barua Emon, Towhidul Islam, Mohammad Arafat Hussain

The COVID-19 pandemic has affected millions of people globally, with respiratory organs being strongly affected in individuals with comorbidities. Medical imaging-based diagnosis and prognosis have become increasingly popular in clinical settings for detecting COVID-19 lung infections. Among various medical imaging modalities, ultrasound stands out as a low-cost, mobile, and radiation-safe imaging technology. In this comprehensive review, we focus on AI-driven studies utilizing lung ultrasound (LUS) for COVID-19 detection and analysis. We provide a detailed overview of both publicly available and private LUS datasets and categorize the AI studies according to the dataset they used. Additionally, we systematically analyzed and tabulated the studies across various dimensions, including data preprocessing methods, AI models, cross-validation techniques, and evaluation metrics. In total, we reviewed 60 articles, 41 of which utilized public datasets, while the remaining employed private data. Our findings suggest that ultrasound-based AI studies for COVID-19 detection have great potential for clinical use, especially for children and pregnant women. Our review also provides a useful summary for future researchers and clinicians who may be interested in the field.

摘要:COVID-19 疫情影響全球數百萬人,其中合併症患者的呼吸器官受到嚴重影響。基於醫學影像的診斷和預後在臨床環境中已日益普及,用於偵測 COVID-19 肺部感染。在各種醫學影像模式中,超音波因其低成本、可攜式且無輻射的影像技術而脫穎而出。在這篇全面的評論中,我們專注於利用肺部超音波 (LUS) 進行 COVID-19 偵測和分析的人工智慧驅動研究。我們提供公開和私人 LUS 資料集的詳細概觀,並根據所使用的資料集對人工智慧研究進行分類。此外,我們系統地分析並整理了各種面向的研究,包括資料前處理方法、人工智慧模型、交叉驗證技術和評估指標。總計,我們檢閱了 60 篇文章,其中 41 篇使用公開資料集,而其餘則使用私人資料。我們的研究結果表明,基於超音波的人工智慧研究對於 COVID-19 偵測具有極大的臨床應用潛力,特別是對於兒童和孕婦。我們的評論也為可能對此領域感興趣的未來研究人員和臨床醫生提供了有用的摘要。

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

2411.03670v1 by Pedro R. A. S. Bassi, Wenxuan Li, Yucheng Tang, Fabian Isensee, Zifu Wang, Jieneng Chen, Yu-Cheng Chou, Yannick Kirchhoff, Maximilian Rokuss, Ziyan Huang, Jin Ye, Junjun He, Tassilo Wald, Constantin Ulrich, Michael Baumgartner, Saikat Roy, Klaus H. Maier-Hein, Paul Jaeger, Yiwen Ye, Yutong Xie, Jianpeng Zhang, Ziyang Chen, Yong Xia, Zhaohu Xing, Lei Zhu, Yousef Sadegheih, Afshin Bozorgpour, Pratibha Kumari, Reza Azad, Dorit Merhof, Pengcheng Shi, Ting Ma, Yuxin Du, Fan Bai, Tiejun Huang, Bo Zhao, Haonan Wang, Xiaomeng Li, Hanxue Gu, Haoyu Dong, Jichen Yang, Maciej A. Mazurowski, Saumya Gupta, Linshan Wu, Jiaxin Zhuang, Hao Chen, Holger Roth, Daguang Xu, Matthew B. Blaschko, Sergio Decherchi, Andrea Cavalli, Alan L. Yuille, Zongwei Zhou

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

摘要:如何測試 AI 效能?這個問題看似簡單,但並非如此。 標準基準經常有諸如分佈內和小型測試集、過於簡化的指標、不公平的比較和短期結果壓力等問題。因此,在標準基準上的良好效能無法保證在實際情況中也能成功。為了解決這些問題,我們提出了 Touchstone,一種大型協作分割基準,包含 9 種類型的腹部器官。此基準基於來自全球 76 家醫院的 5,195 個訓練 CT 掃描和來自 11 家其他醫院的 5,903 個測試 CT 掃描。這個多樣化的測試集增強了基準結果的統計顯著性,並嚴格評估了各種分佈外情況下的 AI 演算法。我們邀請了 19 種 AI 演算法的 14 位發明者訓練他們的演算法,而我們的團隊作為第三方,獨立評估了這些演算法在三個測試集上的表現。此外,我們還評估了現有的 AI 框架,這些框架與演算法不同,更具彈性,且可以支援不同的演算法,包括 NVIDIA 的 MONAI、DKFZ 的 nnU-Net 和許多其他開源框架。我們致力於擴展此基準,以鼓勵更多 AI 演算法在醫療領域的創新。

Requirements Engineering for Older Adult Digital Health Software: A Systematic Literature Review

2411.03656v1 by Yuqing Xiao, John Grundy, Anuradha Madugalla

Growth of the older adult population has led to an increasing interest in technology-supported aged care. However, the area has some challenges such as a lack of caregivers and limitations in understanding the emotional, social, physical, and mental well-being needs of seniors. Furthermore, there is a gap in the understanding between developers and ageing people of their requirements. Digital health can be important in supporting older adults wellbeing, emotional requirements, and social needs. Requirements Engineering (RE) is a major software engineering field, which can help to identify, elicit and prioritize the requirements of stakeholders and ensure that the systems meet standards for performance, reliability, and usability. We carried out a systematic review of the literature on RE for older adult digital health software. This was necessary to show the representatives of the current stage of understanding the needs of older adults in aged care digital health. Using established guidelines outlined by the Kitchenham method, the PRISMA and the PICO guideline, we developed a protocol, followed by the systematic exploration of eight databases. This resulted in 69 primary studies of high relevance, which were subsequently subjected to data extraction, synthesis, and reporting. We highlight key RE processes in digital health software for ageing people. It explored the utilization of technology for older user well-being and care, and the evaluations of such solutions. The review also identified key limitations found in existing primary studies that inspire future research opportunities. The results indicate that requirement gathering and understanding have a significant variation between different studies. The differences are in the quality, depth, and techniques adopted for requirement gathering and these differences are largely due to uneven adoption of RE methods.

摘要:高齡人口的增長,導致對科技輔助長照服務的需求與日俱增。然而,該領域也面臨一些挑戰,例如照護人員的短缺,以及在理解長者在情緒、社交、生理和心理方面的福祉需求時所存在的限制。此外,開發人員和長者在需求理解上也存在差距。數位健康在支持長者的福祉、情緒需求和社會需求方面扮演著重要的角色。需求工程(RE)是軟體工程領域的一大領域,有助於識別、引導和優先處理利害關係人的需求,並確保系統符合效能、可靠性和可用性的標準。我們對長者數位健康軟體的RE文獻進行了系統性的回顧。這對於展現目前在長照數位健康領域中理解長者需求的階段代表性是必要的。我們根據Kitchenham方法、PRISMA和PICO指南所列出的既定準則,制定了一套協定,接著系統性地探討了八個資料庫。這產生了69項高度相關的主要研究,其後進行了資料萃取、綜合和回報。我們重點介紹了長者數位健康軟體中的關鍵RE流程。它探討了科技在長者使用者福祉和照護中的應用,以及這些解決方案的評估。這份回顧也找出了現有主要研究中發現的主要限制,激勵了未來的研究機會。結果顯示,不同研究之間在需求收集和理解方面有顯著的差異。差異在於需求收集所採用的品質、深度和技術,而這些差異在很大程度上是由於RE方法採用不均所致。

Cross Feature Fusion of Fundus Image and Generated Lesion Map for Referable Diabetic Retinopathy Classification

2411.03618v1 by Dahyun Mok, Junghyun Bum, Le Duc Tai, Hyunseung Choo

Diabetic Retinopathy (DR) is a primary cause of blindness, necessitating early detection and diagnosis. This paper focuses on referable DR classification to enhance the applicability of the proposed method in clinical practice. We develop an advanced cross-learning DR classification method leveraging transfer learning and cross-attention mechanisms. The proposed method employs the Swin U-Net architecture to segment lesion maps from DR fundus images. The Swin U-Net segmentation model, enriched with DR lesion insights, is transferred to generate a lesion map. Both the fundus image and its segmented lesion map are used as complementary inputs for the classification model. A cross-attention mechanism is deployed to improve the model's ability to capture fine-grained details from the input pairs. Our experiments, utilizing two public datasets, FGADR and EyePACS, demonstrate a superior accuracy of 94.6%, surpassing current state-of-the-art methods by 4.4%. To this end, we aim for the proposed method to be seamlessly integrated into clinical workflows, enhancing accuracy and efficiency in identifying referable DR.

摘要:糖尿病視網膜病變 (DR) 是失明的首要原因,需要早期檢測和診斷。本文重點關注可轉診的 DR 分類,以增強所提出方法在臨床實務中的適用性。我們開發了一種先進的交叉學習 DR 分類方法,利用遷移學習和交叉注意機制。所提出的方法採用 Swin U-Net 架構,從 DR 眼底圖像中分割病灶圖。豐富了 DR 病灶見解的 Swin U-Net 分割模型被轉移以生成病灶圖。眼底圖像及其分割的病灶圖都被用作分類模型的補充輸入。部署交叉注意機制以提高模型從輸入對中擷取細粒度細節的能力。我們的實驗利用了兩個公開數據集,FGADR 和 EyePACS,展示了 94.6% 的優異準確率,比當前最先進的方法高出 4.4%。為此,我們希望所提出的方法能無縫整合到臨床工作流程中,提高準確度和效率,以識別可轉診的 DR。

The Future of Intelligent Healthcare: A Systematic Analysis and Discussion on the Integration and Impact of Robots Using Large Language Models for Healthcare

2411.03287v1 by Souren Pashangpour, Goldie Nejat

The potential use of large language models (LLMs) in healthcare robotics can help address the significant demand put on healthcare systems around the world with respect to an aging demographic and a shortage of healthcare professionals. Even though LLMs have already been integrated into medicine to assist both clinicians and patients, the integration of LLMs within healthcare robots has not yet been explored for clinical settings. In this perspective paper, we investigate the groundbreaking developments in robotics and LLMs to uniquely identify the needed system requirements for designing health specific LLM based robots in terms of multi modal communication through human robot interactions (HRIs), semantic reasoning, and task planning. Furthermore, we discuss the ethical issues, open challenges, and potential future research directions for this emerging innovative field.

摘要:大型語言模型 (LLM) 在醫療保健機器人中潛在的應用,有助於滿足全球醫療保健系統對應老齡化人口和醫療保健專業人員短缺問題的重大需求。儘管 LLM 已整合到醫療領域中,以協助臨床醫生和患者,但 LLM 在醫療保健機器人中的整合尚未針對臨床環境進行探討。在此觀點論文中,我們探討機器人和 LLM 的創新發展,以獨特地找出設計特定於健康的 LLM 機器人的系統需求,包括透過人機互動 (HRI)、語義推理和任務規劃的多模式溝通。此外,我們討論了這個新興創新領域的倫理議題、開放性挑戰和潛在的未來研究方向。

Discovering Data Structures: Nearest Neighbor Search and Beyond

2411.03253v1 by Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant

We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.

摘要:我們提出一個通用的架構,用於資料結構的端到端學習。 我們的架構會適應基礎資料分佈,並提供對查詢和空間複雜度的細緻控制。至關重要的是,資料結構是從頭開始學習,不需要仔細初始化或使用候選資料結構/演算法進行設定。我們首先將這個架構應用到最近鄰搜尋的問題。在多種設定中,我們能夠逆向工程已學習的資料結構和查詢演算法。對於 1D 最近鄰搜尋,模型會發現最佳分佈(內部)獨立演算法,例如二元搜尋和內插搜尋變體。在更高維度中,模型學習到的解會在某些模式下類似於 k-d 樹,而在其他模式下,它們會包含局部敏感雜湊的元素。該模型還可以學習高維資料的有用表示,並利用它們來設計有效的資料結構。我們也將我們的架構調整到資料串流上頻率估計的問題,並相信它對於新問題來說也可能是一個強大的發現工具。

Evaluating Machine Learning Models against Clinical Protocols for Enhanced Interpretability and Continuity of Care

2411.03105v1 by Christel Sirocchi, Muhammad Suffian, Federico Sabbatini, Alessandro Bogliolo, Sara Montagna

In clinical practice, decision-making relies heavily on established protocols, often formalised as rules. Concurrently, Machine Learning (ML) models, trained on clinical data, aspire to integrate into medical decision-making processes. However, despite the growing number of ML applications, their adoption into clinical practice remains limited. Two critical concerns arise, relevant to the notions of consistency and continuity of care: (a) accuracy - the ML model, albeit more accurate, might introduce errors that would not have occurred by applying the protocol; (b) interpretability - ML models operating as black boxes might make predictions based on relationships that contradict established clinical knowledge. In this context, the literature suggests using ML models integrating domain knowledge for improved accuracy and interpretability. However, there is a lack of appropriate metrics for comparing ML models with clinical rules in addressing these challenges. Accordingly, in this article, we first propose metrics to assess the accuracy of ML models with respect to the established protocol. Secondly, we propose an approach to measure the distance of explanations provided by two rule sets, with the goal of comparing the explanation similarity between clinical rule-based systems and rules extracted from ML models. The approach is validated on the Pima Indians Diabetes dataset by training two neural networks - one exclusively on data, and the other integrating a clinical protocol. Our findings demonstrate that the integrated ML model achieves comparable performance to that of a fully data-driven model while exhibiting superior accuracy relative to the clinical protocol, ensuring enhanced continuity of care. Furthermore, we show that our integrated model provides explanations for predictions that align more closely with the clinical protocol compared to the data-driven model.

摘要:在臨床實務中,決策仰賴既定的協定,通常以規則形式化。同時,以臨床資料訓練的機器學習 (ML) 模型,渴望整合到醫療決策流程中。然而,儘管 ML 應用數量日增,它們在臨床實務中的採用仍受限。兩個關鍵疑慮浮現,與照護的一致性和連續性概念相關:(a) 準確性 - ML 模型雖然更準確,但可能會引入套用協定時不會發生的錯誤;(b) 可解釋性 - 作為黑盒運作的 ML 模型可能會根據與既定臨床知識相矛盾的關係進行預測。在此脈絡中,文獻建議使用整合領域知識的 ML 模型以提升準確性和可解釋性。然而,缺乏適當的指標來比較 ML 模型與臨床規則,以應對這些挑戰。因此,在本文中,我們首先提出指標來評估 ML 模型相對於既定協定的準確性。其次,我們提出一個方法來衡量兩組規則所提供的解釋的距離,目標是比較基於臨床規則的系統與從 ML 模型中提取的規則之間的解釋相似性。此方法在 Pima 印地安人糖尿病資料集上驗證,方法是訓練兩個神經網路 - 一個僅針對資料,另一個整合臨床協定。我們的研究結果證明,整合式 ML 模型達到了與完全資料驅動模型相當的效能,同時展現出相對於臨床協定的優異準確性,確保增強的照護連續性。此外,我們證明我們的整合模型提供的預測解釋與臨床協定相比,更為緊密地結合。

Local Lesion Generation is Effective for Capsule Endoscopy Image Data Augmentation in a Limited Data Setting

2411.03098v1 by Adrian B. Chłopowiec, Adam R. Chłopowiec, Krzysztof Galus, Wojciech Cebula, Martin Tabakov

Limited medical imaging datasets challenge deep learning models by increasing risks of overfitting and reduced generalization, particularly in Generative Adversarial Networks (GANs), where discriminators may overfit, leading to training divergence. This constraint also impairs classification models trained on small datasets. Generative Data Augmentation (GDA) addresses this by expanding training datasets with synthetic data, although it requires training a generative model. We propose and evaluate two local lesion generation approaches to address the challenge of augmenting small medical image datasets. The first approach employs the Poisson Image Editing algorithm, a classical image processing technique, to create realistic image composites that outperform current state-of-the-art methods. The second approach introduces a novel generative method, leveraging a fine-tuned Image Inpainting GAN to synthesize realistic lesions within specified regions of real training images. A comprehensive comparison of the two proposed methods demonstrates that effective local lesion generation in a data-constrained setting allows for reaching new state-of-the-art results in capsule endoscopy lesion classification. Combination of our techniques achieves a macro F1-score of 33.07%, surpassing the previous best result by 7.84 percentage points (p.p.) on the highly imbalanced Kvasir Capsule Dataset, a benchmark for capsule endoscopy. To the best of our knowledge, this work is the first to apply a fine-tuned Image Inpainting GAN for GDA in medical imaging, demonstrating that an image-conditional GAN can be adapted effectively to limited datasets to generate high-quality examples, facilitating effective data augmentation. Additionally, we show that combining this GAN-based approach with classical image processing techniques further enhances the results.

摘要:受限的醫學影像資料集會透過增加過度擬合的風險和降低概化能力,特別是在生成對抗網路 (GAN) 中,其中判別器可能會過度擬合,導致訓練分歧,對深度學習模型構成挑戰。這種限制也損害了在小型資料集上訓練的分類模型。生成資料擴充 (GDA) 透過使用合成資料擴充訓練資料集來解決此問題,儘管它需要訓練生成模型。我們提出並評估兩種局部病灶生成方法,以解決擴充小型醫學影像資料集的挑戰。第一種方法採用泊松影像編輯演算法,一種經典影像處理技術,來建立逼真的影像合成,其優於目前最先進的方法。第二種方法引進一種新穎的生成方法,利用微調的影像修復 GAN,在真實訓練影像的特定區域內合成逼真的病灶。對這兩種提議方法的全面比較證明,在資料受限的設定中,有效的局部病灶生成允許在膠囊內視鏡病灶分類中達到新的最先進結果。我們的技術組合在高度不平衡的 Kvasir Capsule 資料集(膠囊內視鏡的基準)上,達到了 33.07% 的巨觀 F1 分數,比先前的最佳結果高出 7.84 個百分點 (p.p.)。據我們所知,這項工作是第一個將微調的影像修復 GAN 應用於醫學影像中的 GDA,證明了影像條件 GAN 可以有效地適應受限的資料集,以產生高品質的範例,促進有效的資料擴充。此外,我們表明將這種基於 GAN 的方法與經典影像處理技術相結合,進一步增強了結果。

Controlling for Unobserved Confounding with Large Language Model Classification of Patient Smoking Status

2411.03004v1 by Samuel Lee, Zach Wood-Doughty

Causal understanding is a fundamental goal of evidence-based medicine. When randomization is impossible, causal inference methods allow the estimation of treatment effects from retrospective analysis of observational data. However, such analyses rely on a number of assumptions, often including that of no unobserved confounding. In many practical settings, this assumption is violated when important variables are not explicitly measured in the clinical record. Prior work has proposed to address unobserved confounding with machine learning by imputing unobserved variables and then correcting for the classifier's mismeasurement. When such a classifier can be trained and the necessary assumptions are met, this method can recover an unbiased estimate of a causal effect. However, such work has been limited to synthetic data, simple classifiers, and binary variables. This paper extends this methodology by using a large language model trained on clinical notes to predict patients' smoking status, which would otherwise be an unobserved confounder. We then apply a measurement error correction on the categorical predicted smoking status to estimate the causal effect of transthoracic echocardiography on mortality in the MIMIC dataset.

摘要:因果理解是循证医学的基本目标。当随机化不可行时,因果推论方法允许从观察性数据的回顾性分析中估计治疗效果。然而,此类分析依赖于许多假设,通常包括没有未观察到的混杂因素。在许多实际情况下,当重要的变量在临床记录中没有明确测量时,这一假设就会被违反。先前的工作提出用机器学习来解决未观察到的混杂问题,方法是推算未观察到的变量,然后校正分类器的测量误差。当可以训练这样的分类器并且满足必要的假设时,这种方法可以恢复因果效应的无偏估计。然而,此类工作仅限于合成数据、简单的分类器和二元变量。本文通过使用在临床记录上训练的大语言模型来预测患者的吸烟状况来扩展这种方法,否则这将是一个未观察到的混杂因素。然后,我们对分类预测的吸烟状态应用测量误差校正,以估计经胸超声心动图对 MIMIC 数据集中死亡率的因果效应。

Region-Guided Attack on the Segment Anything Model (SAM)

2411.02974v2 by Xiaoliang Liu, Furao Shen, Jian Zhao

The Segment Anything Model (SAM) is a cornerstone of image segmentation, demonstrating exceptional performance across various applications, particularly in autonomous driving and medical imaging, where precise segmentation is crucial. However, SAM is vulnerable to adversarial attacks that can significantly impair its functionality through minor input perturbations. Traditional techniques, such as FGSM and PGD, are often ineffective in segmentation tasks due to their reliance on global perturbations that overlook spatial nuances. Recent methods like Attack-SAM-K and UAD have begun to address these challenges, but they frequently depend on external cues and do not fully leverage the structural interdependencies within segmentation processes. This limitation underscores the need for a novel adversarial strategy that exploits the unique characteristics of segmentation tasks. In response, we introduce the Region-Guided Attack (RGA), designed specifically for SAM. RGA utilizes a Region-Guided Map (RGM) to manipulate segmented regions, enabling targeted perturbations that fragment large segments and expand smaller ones, resulting in erroneous outputs from SAM. Our experiments demonstrate that RGA achieves high success rates in both white-box and black-box scenarios, emphasizing the need for robust defenses against such sophisticated attacks. RGA not only reveals SAM's vulnerabilities but also lays the groundwork for developing more resilient defenses against adversarial threats in image segmentation.

摘要:影像分割的基石為區段任何模型 (SAM),在各種應用中展現出色的效能,特別是在自動駕駛和醫療影像中,精準的分割至關重要。然而,SAM 容易受到對抗攻擊,而對抗攻擊可能透過輕微的輸入擾動大幅損害其功能性。傳統技術,例如 FGSM 和 PGD,通常在分割任務中無效,因為它們依賴於忽略空間細微差的全局擾動。最近的方法,例如 Attack-SAM-K 和 UAD,已開始解決這些挑戰,但它們經常依賴於外部提示,且並未充分利用分割過程中結構性的相互依賴性。此限制強調需要一種新的對抗策略,以利用分割任務的獨特特性。為了解決這個問題,我們引進專門為 SAM 設計的區域引導攻擊 (RGA)。RGA 利用區域引導地圖 (RGM) 操控分割區域,進而針對擾動進行標定,將大型區段分割並擴展較小的區段,導致 SAM 產生錯誤輸出。我們的實驗證明,RGA 在白盒和黑盒場景中都取得高成功率,強調需要針對此類精密攻擊建立強固的防禦機制。RGA 不僅揭露 SAM 的漏洞,也為在影像分割中針對對抗威脅發展更具復原力的防禦措施奠定基礎。

[Vision Paper] PRObot: Enhancing Patient-Reported Outcome Measures for Diabetic Retinopathy using Chatbots and Generative AI

2411.02973v1 by Maren Pielka, Tobias Schneider, Jan Terheyden, Rafet Sifa

We present an outline of the first large language model (LLM) based chatbot application in the context of patient-reported outcome measures (PROMs) for diabetic retinopathy. By utilizing the capabilities of current LLMs, we enable patients to provide feedback about their quality of life and treatment progress via an interactive application. The proposed framework offers significant advantages over the current approach, which encompasses only qualitative collection of survey data or a static survey with limited answer options. Using the PROBot LLM-PROM application, patients will be asked tailored questions about their individual challenges, and can give more detailed feedback on the progress of their treatment. Based on this input, we will use machine learning to infer conventional PROM scores, which can be used by clinicians to evaluate the treatment status. The goal of the application is to improve adherence to the healthcare system and treatments, and thus ultimately reduce cases of subsequent vision impairment. The approach needs to be further validated using a survey and a clinical study.

摘要:我們提出一個基於第一個大型語言模型 (LLM) 的聊天機器人應用程式,用於糖尿病視網膜病變的病人回報結果測量 (PROM)。透過利用當前 LLM 的功能,我們讓病人能夠透過互動式應用程式提供有關其生活品質和治療進度的回饋。所提出的架構提供顯著優於目前方法的優點,目前方法僅包含調查資料的質性收集或具有有限答案選項的靜態調查。使用 PROBot LLM-PROM 應用程式,病人將會被詢問有關其個人挑戰的客製化問題,並能提供更詳細的回饋,說明其治療進度。根據此輸入,我們將使用機器學習推論傳統 PROM 分數,臨床醫生可以使用這些分數來評估治療狀態。此應用程式的目標是改善對醫療保健系統和治療的依從性,並因此最終減少後續視力損害的病例。需要使用調查和臨床研究進一步驗證此方法。

Leveraging Transfer Learning and Multiple Instance Learning for HER2 Automatic Scoring of H\&E Whole Slide Images

2411.05028v1 by Rawan S. Abdulsadig, Bryan M. Williams, Nikolay Burlutskiy

Expression of human epidermal growth factor receptor 2 (HER2) is an important biomarker in breast cancer patients who can benefit from cost-effective automatic Hematoxylin and Eosin (H\&E) HER2 scoring. However, developing such scoring models requires large pixel-level annotated datasets. Transfer learning allows prior knowledge from different datasets to be reused while multiple-instance learning (MIL) allows the lack of detailed annotations to be mitigated. The aim of this work is to examine the potential of transfer learning on the performance of deep learning models pre-trained on (i) Immunohistochemistry (IHC) images, (ii) H\&E images and (iii) non-medical images. A MIL framework with an attention mechanism is developed using pre-trained models as patch-embedding models. It was found that embedding models pre-trained on H\&E images consistently outperformed the others, resulting in an average AUC-ROC value of $0.622$ across the 4 HER2 scores ($0.59-0.80$ per HER2 score). Furthermore, it was found that using multiple-instance learning with an attention layer not only allows for good classification results to be achieved, but it can also help with producing visual indication of HER2-positive areas in the H\&E slide image by utilising the patch-wise attention weights.

摘要:人類表皮生長因子受體 2 (HER2) 的表現是乳癌患者中的一項重要生物標記,這些患者可以受益於具有成本效益的自動蘇木精和伊紅 (H&E) HER2 評分。然而,開發此類評分模型需要大量的像素級註解資料集。遷移學習允許重複使用來自不同資料集的先驗知識,而多實例學習 (MIL) 允許減輕詳細註解的缺乏。這項工作的目的是檢查遷移學習在預先訓練於 (i) 免疫組織化學 (IHC) 影像、(ii) H&E 影像和 (iii) 非醫學影像上的深度學習模型的效能上的潛力。使用預先訓練的模型作為區塊嵌入模型,開發了一個具有注意力機制的 MIL 框架。研究發現,預先訓練於 H&E 影像上的嵌入模型始終優於其他模型,在 4 個 HER2 分數中產生平均 AUC-ROC 值為 $0.622$(每個 HER2 分數為 $0.59-0.80$)。此外,研究發現,使用具有注意力層的多實例學習不僅可以獲得良好的分類結果,還可以幫助通過利用區塊注意力權重產生 H&E 玻片影像中 HER2 陽性區域的可視化指示。

Membership Inference Attacks against Large Vision-Language Models

2411.02902v1 by Zhan Li, Yongtao Wu, Yihang Chen, Francesco Tonin, Elias Abad Rocamora, Volkan Cevher

Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios. However, their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records, in their training datasets. Detecting inappropriately used data in VLLMs remains a critical and unresolved issue, mainly due to the lack of standardized datasets and suitable methodologies. In this study, we introduce the first membership inference attack (MIA) benchmark tailored for various VLLMs to facilitate training data detection. Then, we propose a novel MIA pipeline specifically designed for token-level image detection. Lastly, we present a new metric called MaxR\'enyi-K%, which is based on the confidence of the model output and applies to both text and image data. We believe that our work can deepen the understanding and methodology of MIAs in the context of VLLMs. Our code and datasets are available at https://github.com/LIONS-EPFL/VL-MIA.

摘要:大型視覺語言模型 (VLLM) 在處理各種應用場景的多模態任務方面表現出有前景的能力。然而,它們的出現也引發了重大的資料安全問題,因為它們的訓練資料集中可能會包含敏感資訊,例如私人照片和醫療記錄。偵測 VLLM 中不當使用的資料仍然是一個關鍵且尚未解決的問題,主要是由於缺乏標準化的資料集和適當的方法。在本研究中,我們引入了第一個針對各種 VLLM 量身打造的成員推論攻擊 (MIA) 基準,以利於訓練資料偵測。然後,我們提出了一個專門設計用於令牌級別影像偵測的全新 MIA 管線。最後,我們提出一個名為 MaxR\'enyi-K% 的新指標,它基於模型輸出的信心,並適用於文字和影像資料。我們相信,我們的研究可以加深對 VLLM 背景下 MIA 的理解和方法。我們的程式碼和資料集可在 https://github.com/LIONS-EPFL/VL-MIA 取得。

Advanced XR-Based 6-DOF Catheter Tracking System for Immersive Cardiac Intervention Training

2411.02611v1 by Mohsen Annabestani, Sandhya Sriram, S. Chiu Wong, Alexandros Sigaras, Bobak Mosadegh

Extended Reality (XR) technologies are gaining traction as effective tools for medical training and procedural guidance, particularly in complex cardiac interventions. This paper presents a novel system for real-time 3D tracking and visualization of intracardiac echocardiography (ICE) catheters, with precise measurement of the roll angle. A custom 3D-printed setup, featuring orthogonal cameras, captures biplane video of the catheter, while a specialized computer vision algorithm reconstructs its 3D trajectory, localizing the tip with sub-millimeter accuracy and tracking the roll angle in real-time. The system's data is integrated into an interactive Unity-based environment, rendered through the Meta Quest 3 XR headset, combining a dynamically tracked catheter with a patient-specific 3D heart model. This immersive environment allows the testing of the importance of 3D depth perception, in comparison to 2D projections, as a form of visualization in XR. Our experimental study, conducted using the ICE catheter with six participants, suggests that 3D visualization is not necessarily beneficial over 2D views offered by the XR system; although all cardiologists saw its utility for pre-operative training, planning, and intra-operative guidance. The proposed system qualitatively shows great promise in transforming catheter-based interventions, particularly ICE procedures, by improving visualization, interactivity, and skill development.

摘要:擴增實境 (XR) 技術正作為醫療訓練和程序指導的有效工具而獲得重視,特別是在複雜的心臟介入治療中。本文提出了一個新的系統,用於實時 3D 追蹤和可視化心內超聲心動圖 (ICE) 導管,並精確測量滾動角度。一個客製化的 3D 列印設定,配備正交相機,捕捉導管的雙平面影片,而一個專門的電腦視覺演算法重建其 3D 軌跡,以小於毫米的精確度定位尖端並即時追蹤滾動角度。系統的資料整合到一個互動式的 Unity 為基礎的環境中,透過 Meta Quest 3 XR 頭戴式裝置呈現,結合動態追蹤的導管和特定病患的 3D 心臟模型。這個沈浸式的環境允許測試 3D 深度感知的重要性,與 2D 投影相比,作為 XR 中的一種視覺化形式。我們的實驗研究,使用 ICE 導管進行,有六位參與者,顯示 3D 視覺化不一定比 XR 系統提供的 2D 視圖有益;儘管所有心臟科醫師都看到它在術前訓練、規劃和術中指導中的用途。所提出的系統在質化上顯示出在轉換導管介入治療,特別是 ICE 程序方面,透過改善視覺化、互動性和技能發展,具有很大的前景。

"It's a conversation, not a quiz": A Risk Taxonomy and Reflection Tool for LLM Adoption in Public Health

2411.02594v1 by Jiawei Zhou, Amy Z. Chen, Darshi Shah, Laura Schwab Reese, Munmun De Choudhury

Recent breakthroughs in large language models (LLMs) have generated both interest and concern about their potential adoption as accessible information sources or communication tools across different domains. In public health -- where stakes are high and impacts extend across populations -- adopting LLMs poses unique challenges that require thorough evaluation. However, structured approaches for assessing potential risks in public health remain under-explored. To address this gap, we conducted focus groups with health professionals and health issue experiencers to unpack their concerns, situated across three distinct and critical public health issues that demand high-quality information: vaccines, opioid use disorder, and intimate partner violence. We synthesize participants' perspectives into a risk taxonomy, distinguishing and contextualizing the potential harms LLMs may introduce when positioned alongside traditional health communication. This taxonomy highlights four dimensions of risk in individual behaviors, human-centered care, information ecosystem, and technology accountability. For each dimension, we discuss specific risks and example reflection questions to help practitioners adopt a risk-reflexive approach. This work offers a shared vocabulary and reflection tool for experts in both computing and public health to collaboratively anticipate, evaluate, and mitigate risks in deciding when to employ LLM capabilities (or not) and how to mitigate harm when they are used.

摘要:大型語言模型 (LLM) 的最新突破引起了人們的興趣,也引起了人們對其作為不同領域的無障礙信息來源或通信工具的潛在採用所產生的擔憂。在公共衛生領域——利害關係很高且影響遍及人群——採用 LLM 構成了獨特的挑戰,需要徹底評估。然而,評估公共衛生中潛在風險的結構化方法仍未得到充分探索。為了解決這一差距,我們與醫療專業人員和健康問題體驗者進行了焦點小組,以解開他們的疑慮,這些疑慮涉及三個不同的關鍵公共衛生問題,這些問題需要高質量的資訊:疫苗、阿片類藥物使用障礙和親密伴侶暴力。我們將參與者的觀點綜合到風險分類法中,區分和情境化 LLM 在與傳統健康傳播並列時可能造成的潛在危害。這種分類法突出了個人行為、以人為中心的護理、資訊生態系統和技術問責制這四個維度的風險。對於每個維度,我們討論具體的風險和範例反思問題,以幫助從業者採用風險反思方法。這項工作為計算和公共衛生領域的專家提供了一個共同的詞彙和反思工具,以便在決定何時採用 LLM 功能(或不採用)以及在使用 LLM 功能時如何減輕危害時,共同預測、評估和減輕風險。

Digitizing Touch with an Artificial Multimodal Fingertip

2411.02479v1 by Mike Lambeta, Tingfan Wu, Ali Sengul, Victoria Rose Most, Nolan Black, Kevin Sawyer, Romeo Mercado, Haozhi Qi, Alexander Sohn, Byron Taylor, Norb Tydingco, Gregg Kammerer, Dave Stroud, Jake Khatha, Kurt Jenkins, Kyle Most, Neal Stein, Ricardo Chavira, Thomas Craven-Bartle, Eric Sanchez, Yitian Ding, Jitendra Malik, Roberto Calandra

Touch is a crucial sensing modality that provides rich information about object properties and interactions with the physical environment. Humans and robots both benefit from using touch to perceive and interact with the surrounding environment (Johansson and Flanagan, 2009; Li et al., 2020; Calandra et al., 2017). However, no existing systems provide rich, multi-modal digital touch-sensing capabilities through a hemispherical compliant embodiment. Here, we describe several conceptual and technological innovations to improve the digitization of touch. These advances are embodied in an artificial finger-shaped sensor with advanced sensing capabilities. Significantly, this fingertip contains high-resolution sensors (~8.3 million taxels) that respond to omnidirectional touch, capture multi-modal signals, and use on-device artificial intelligence to process the data in real time. Evaluations show that the artificial fingertip can resolve spatial features as small as 7 um, sense normal and shear forces with a resolution of 1.01 mN and 1.27 mN, respectively, perceive vibrations up to 10 kHz, sense heat, and even sense odor. Furthermore, it embeds an on-device AI neural network accelerator that acts as a peripheral nervous system on a robot and mimics the reflex arc found in humans. These results demonstrate the possibility of digitizing touch with superhuman performance. The implications are profound, and we anticipate potential applications in robotics (industrial, medical, agricultural, and consumer-level), virtual reality and telepresence, prosthetics, and e-commerce. Toward digitizing touch at scale, we open-source a modular platform to facilitate future research on the nature of touch.

摘要:觸覺是一種至關重要的感測方式,可提供關於物體屬性和與物理環境交互作用的豐富資訊。人類和機器人都受益於使用觸覺來感知和與周圍環境互動(Johansson and Flanagan, 2009; Li et al., 2020; Calandra et al., 2017)。然而,沒有現有系統透過半球形順應性具身化提供豐富的多模式數位觸覺感測功能。在此,我們描述了幾個概念和技術創新,以改善觸覺的數位化。這些進展體現在具備先進感測功能的人工手指形感測器中。重要的是,這個指尖包含高解析度感測器(約 830 萬個觸覺點),可對全方位觸覺做出反應、擷取多模式訊號,並使用裝置上的人工智慧即時處理資料。評估顯示,人工指尖可以解析小至 7 微米的空間特徵,以 1.01 毫牛頓和 1.27 毫牛頓的解析度感測法向力和剪切力,感知高達 10 千赫的振動、感測熱,甚至感測氣味。此外,它內嵌了一個裝置上的 AI 神經網路加速器,作為機器人的周邊神經系統,並模仿人類的反射弧。這些結果證明了以超人類效能數位化觸覺的可能性。其影響深遠,我們預期在機器人技術(工業、醫療、農業和消費者層級)、虛擬實境和遠距臨場、假肢和電子商務中潛在的應用。為了大規模數位化觸覺,我們開放原始碼一個模組化平台,以促進未來對觸覺本質的研究。

Simulation of Nanorobots with Artificial Intelligence and Reinforcement Learning for Advanced Cancer Cell Detection and Tracking

2411.02345v1 by Shahab Kavousinejad

Nanorobots are a promising development in targeted drug delivery and the treatment of neurological disorders, with potential for crossing the blood-brain barrier (BBB). These small devices leverage advancements in nanotechnology and bioengineering for precise navigation and targeted payload delivery, particularly for conditions like brain tumors, Alzheimer's disease, and Parkinson's disease. Recent progress in artificial intelligence (AI) and machine learning (ML) has improved the navigation and effectiveness of nanorobots, allowing them to detect and interact with cancer cells through biomarker analysis. This study presents a new reinforcement learning (RL) framework for optimizing nanorobot navigation in complex biological environments, focusing on cancer cell detection by analyzing the concentration gradients of surrounding biomarkers. We utilize a computer simulation model to explore the behavior of nanorobots in a three-dimensional space with cancer cells and biological barriers. The proposed method uses Q-learning to refine movement strategies based on real-time biomarker concentration data, enabling nanorobots to autonomously navigate to cancerous tissues for targeted drug delivery. This research lays the groundwork for future laboratory experiments and clinical applications, with implications for personalized medicine and less invasive cancer treatments. The integration of intelligent nanorobots could revolutionize therapeutic strategies, reducing side effects and enhancing treatment effectiveness for cancer patients. Further research will investigate the practical deployment of these technologies in medical settings, aiming to unlock the full potential of nanorobotics in healthcare.

摘要:奈米機器人在標靶藥物傳輸和神經疾病治療中是一項有前景的發展,並具有穿越血腦屏障 (BBB) 的潛力。這些小型裝置利用奈米技術和生物工程的進展,進行精確導航和標靶有效載荷傳輸,特別是針對腦瘤、阿茲海默症和帕金森氏症等疾病。人工智慧 (AI) 和機器學習 (ML) 的最新進展改善了奈米機器人的導航和效能,讓它們能透過生物標記分析來偵測和與癌細胞互動。本研究提出了一個新的強化學習 (RL) 架構,用於最佳化奈米機器人在複雜生物環境中的導航,重點在於透過分析周圍生物標記的濃度梯度來偵測癌細胞。我們利用電腦模擬模型來探索奈米機器人在三維空間中與癌細胞和生物障礙物之間的行為。所提出的方法使用 Q 學習來根據即時生物標記濃度資料調整移動策略,讓奈米機器人能自主導航至癌組織進行標靶藥物傳輸。這項研究為未來的實驗室實驗和臨床應用奠定了基礎,並對個人化醫療和侵入性較小的癌症治療產生影響。整合智慧奈米機器人可以革新治療策略,減少副作用並提高癌症患者的治療效果。進一步的研究將探討這些技術在醫療環境中的實際部署,目標是發揮奈米機器人在醫療保健中的全部潛力。

Taking AI Welfare Seriously

2411.00986v1 by Robert Long, Jeff Sebo, Patrick Butlin, Kathleen Finlinson, Kyle Fish, Jacqueline Harding, Jacob Pfau, Toni Sims, Jonathan Birch, David Chalmers

In this report, we argue that there is a realistic possibility that some AI systems will be conscious and/or robustly agentic in the near future. That means that the prospect of AI welfare and moral patienthood, i.e. of AI systems with their own interests and moral significance, is no longer an issue only for sci-fi or the distant future. It is an issue for the near future, and AI companies and other actors have a responsibility to start taking it seriously. We also recommend three early steps that AI companies and other actors can take: They can (1) acknowledge that AI welfare is an important and difficult issue (and ensure that language model outputs do the same), (2) start assessing AI systems for evidence of consciousness and robust agency, and (3) prepare policies and procedures for treating AI systems with an appropriate level of moral concern. To be clear, our argument in this report is not that AI systems definitely are, or will be, conscious, robustly agentic, or otherwise morally significant. Instead, our argument is that there is substantial uncertainty about these possibilities, and so we need to improve our understanding of AI welfare and our ability to make wise decisions about this issue. Otherwise there is a significant risk that we will mishandle decisions about AI welfare, mistakenly harming AI systems that matter morally and/or mistakenly caring for AI systems that do not.

摘要:在這份報告中,我們認為有些 AI 系統在不久的將來有現實的可能性會具有意識和/或強大的能動性。這表示 AI 福利和道德上的病人地位的前景,亦即具有自身利益和道德意義的 AI 系統,不再只是科幻小說或遙遠未來的議題。這是近未來的議題,而 AI 公司和其他行為者有責任開始認真看待它。我們也建議 AI 公司和其他行為者可以採取三個早期的步驟:他們可以 (1) 承認 AI 福利是一個重要且困難的議題(並確保語言模型的輸出也這麼做),(2) 開始評估 AI 系統是否有意識和強大能動性的證據,以及 (3) 準備政策和程序,以適當的道德關注層級來對待 AI 系統。明確來說,我們在這份報告中的論點並非 AI 系統絕對是或將會具有意識、強大的能動性或其他道德意義。相反地,我們的論點是關於這些可能性存在著實質的不確定性,因此我們需要增進我們對 AI 福利的了解,以及我們做出關於此議題的明智決定的能力。否則,我們將面臨重大風險,錯誤地處理關於 AI 福利的決策,錯誤地傷害到在道德上重要的 AI 系統,和/或錯誤地照顧到在道德上不重要的 AI 系統。

Federated GNNs for EEG-Based Stroke Assessment

2411.02286v1 by Andrea Protani, Lorenzo Giusti, Albert Sund Aillet, Simona Sacco, Paolo Manganotti, Lucio Marinelli, Diogo Reis Santos, Pierpaolo Brutti, Pietro Caliandro, Luigi Serio

Machine learning (ML) has the potential to become an essential tool in supporting clinical decision-making processes, offering enhanced diagnostic capabilities and personalized treatment plans. However, outsourcing medical records to train ML models using patient data raises legal, privacy, and security concerns. Federated learning has emerged as a promising paradigm for collaborative ML, meeting healthcare institutions' requirements for robust models without sharing sensitive data and compromising patient privacy. This study proposes a novel method that combines federated learning (FL) and Graph Neural Networks (GNNs) to predict stroke severity using electroencephalography (EEG) signals across multiple medical institutions. Our approach enables multiple hospitals to jointly train a shared GNN model on their local EEG data without exchanging patient information. Specifically, we address a regression problem by predicting the National Institutes of Health Stroke Scale (NIHSS), a key indicator of stroke severity. The proposed model leverages a masked self-attention mechanism to capture salient brain connectivity patterns and employs EdgeSHAP to provide post-hoc explanations of the neurological states after a stroke. We evaluated our method on EEG recordings from four institutions, achieving a mean absolute error (MAE) of 3.23 in predicting NIHSS, close to the average error made by human experts (MAE $\approx$ 3.0). This demonstrates the method's effectiveness in providing accurate and explainable predictions while maintaining data privacy.

摘要:機器學習 (ML) 有潛力成為支援臨床決策制定流程的必要工具,提供增強的診斷能力和個人化治療計畫。然而,使用病患資料訓練機器學習模型的外包醫療紀錄引發了法律、隱私和安全方面的疑慮。聯合學習已成為協作機器學習的一種有前景的典範,它符合醫療保健機構對穩健模型的要求,同時不會分享敏感資料和危害病患隱私。本研究提出了一種新的方法,結合聯合學習 (FL) 和圖形神經網路 (GNN) 來使用腦電圖 (EEG) 訊號預測多個醫療機構的腦中風嚴重程度。我們的做法讓多家醫院能夠共同在他們的本地 EEG 資料上訓練一個共享的 GNN 模型,而無需交換病患資訊。具體來說,我們透過預測美國國家衛生研究院腦中風量表 (NIHSS) 來解決回歸問題,NIHSS 是腦中風嚴重程度的一個關鍵指標。所提出的模型利用遮罩自我注意機制來擷取顯著的腦部連結模式,並採用 EdgeSHAP 在中風後提供神經狀態的事後解釋。我們在來自四家機構的 EEG 記錄上評估了我們的模型,在預測 NIHSS 時達到了 3.23 的平均絕對誤差 (MAE),接近人類專家所犯的平均誤差 (MAE ≈ 3.0)。這證明了該方法在維持資料隱私的同時,能提供準確且可解釋的預測,進而展現其效能。

Weakly supervised deep learning model with size constraint for prostate cancer detection in multiparametric MRI and generalization to unseen domains

2411.02466v1 by Robin Trombetta, Olivier Rouvière, Carole Lartizien

Fully supervised deep models have shown promising performance for many medical segmentation tasks. Still, the deployment of these tools in clinics is limited by the very timeconsuming collection of manually expert-annotated data. Moreover, most of the state-ofthe-art models have been trained and validated on moderately homogeneous datasets. It is known that deep learning methods are often greatly degraded by domain or label shifts and are yet to be built in such a way as to be robust to unseen data or label distributions. In the clinical setting, this problematic is particularly relevant as the deployment institutions may have different scanners or acquisition protocols than those from which the data has been collected to train the model. In this work, we propose to address these two challenges on the detection of clinically significant prostate cancer (csPCa) from bi-parametric MRI. We evaluate the method proposed by (Kervadec et al., 2018), which introduces a size constaint loss to produce fine semantic cancer lesions segmentations from weak circle scribbles annotations. Performance of the model is based on two public (PI-CAI and Prostate158) and one private databases. First, we show that the model achieves on-par performance with strong fully supervised baseline models, both on in-distribution validation data and unseen test images. Second, we observe a performance decrease for both fully supervised and weakly supervised models when tested on unseen data domains. This confirms the crucial need for efficient domain adaptation methods if deep learning models are aimed to be deployed in a clinical environment. Finally, we show that ensemble predictions from multiple trainings increase generalization performance.

摘要:完全監督的深度模型在許多醫療影像分割任務中展現出良好的效能。然而,這些工具在臨床上的部署受到耗時的人工標記資料蒐集限制。此外,大多數最先進的模型都在中等同質的資料集上訓練和驗證。眾所周知,深度學習方法經常會因領域或標籤轉移而大幅降低,而且尚未建構出對未見資料或標籤分佈具有穩健性的方法。在臨床環境中,這個問題特別相關,因為部署機構可能擁有與用於訓練模型的資料不同的掃描器或擷取協定。在這項工作中,我們提議針對從雙參數 MRI 中偵測臨床顯著的前列腺癌 (csPCa) 來解決這兩個挑戰。我們評估由 (Kervadec 等人,2018 年) 提出,並引入大小約束損失的方法,以從弱圓形塗鴉標註中產生精細的語義癌症病灶分割。模型的效能基於兩個公開資料庫 (PI-CAI 和 Prostate158) 和一個私人資料庫。首先,我們展示該模型在分佈內驗證資料和未見測試影像上都達到與強大的完全監督基線模型同等的效能。其次,我們觀察到在未見資料領域上測試時,完全監督和弱監督模型的效能都會下降。這證實了對有效領域適應方法的迫切需求,如果深度學習模型旨在部署在臨床環境中。最後,我們展示來自多重訓練的整體預測會提升概化效能。

Evaluating the quality of published medical research with ChatGPT

2411.01952v1 by Mike Thelwall, Xiaorui Jiang, Peter A. Bath

Evaluating the quality of published research is time-consuming but important for departmental evaluations, appointments, and promotions. Previous research has shown that ChatGPT can score articles for research quality, with the results correlating positively with an indicator of quality in all fields except Clinical Medicine. This article investigates this anomaly with the largest dataset yet and a more detailed analysis. The results showed that ChatGPT 4o-mini scores for articles submitted to the UK's Research Excellence Framework (REF) 2021 Unit of Assessment (UoA) 1 Clinical Medicine correlated positively (r=0.134, n=9872) with departmental mean REF scores, against a theoretical maximum correlation of r=0.226 (due to the departmental averaging involved). At the departmental level, mean ChatGPT scores correlated more strongly with departmental mean REF scores (r=0.395, n=31). For the 100 journals with the most articles in UoA 1, their mean ChatGPT score correlated strongly with their REF score (r=0.495) but negatively with their citation rate (r=-0.148). Journal and departmental anomalies in these results point to ChatGPT being ineffective at assessing the quality of research in prestigious medical journals or research directly affecting human health, or both. Nevertheless, the results give evidence of ChatGPT's ability to assess research quality overall for Clinical Medicine, so now there is evidence of its ability in all academic fields.

摘要:評估已發表的品質研究很耗時,但對於部門評鑑、任命和晉升來說很重要。先前的研究顯示,ChatGPT 可以為研究品質評分,其結果與所有領域(臨床醫學除外)的品質指標呈正相關。本文使用迄今為止最大的資料集和更詳細的分析來探討這種異常現象。結果顯示,提交給英國研究卓越架構 (REF) 2021 評估單位 (UoA) 1 臨床醫學的 ChatGPT 4o-mini 分數與部門平均 REF 分數呈正相關(r=0.134,n=9872),而理論最大相關係數為 r=0.226(由於涉及部門平均)。在部門層級,平均 ChatGPT 分數與部門平均 REF 分數相關性更強(r=0.395,n=31)。對於 UoA 1 中文章最多的 100 本期刊,其平均 ChatGPT 分數與其 REF 分數呈強正相關(r=0.495),但與其引用率呈負相關(r=-0.148)。這些結果中的期刊和部門異常現象表明,ChatGPT 無法評估聲望卓著的醫學期刊或直接影響人類健康的研究(或兩者)的品質。儘管如此,結果證明了 ChatGPT 整體評估臨床醫學研究品質的能力,因此現在有證據證明其在所有學術領域的能力。

You are out of context!

2411.02464v1 by Giancarlo Cobino, Simone Farci

This research proposes a novel drift detection methodology for machine learning (ML) models based on the concept of ''deformation'' in the vector space representation of data. Recognizing that new data can act as forces stretching, compressing, or twisting the geometric relationships learned by a model, we explore various mathematical frameworks to quantify this deformation. We investigate measures such as eigenvalue analysis of covariance matrices to capture global shape changes, local density estimation using kernel density estimation (KDE), and Kullback-Leibler divergence to identify subtle shifts in data concentration. Additionally, we draw inspiration from continuum mechanics by proposing a ''strain tensor'' analogy to capture multi-faceted deformations across different data types. This requires careful estimation of the displacement field, and we delve into strategies ranging from density-based approaches to manifold learning and neural network methods. By continuously monitoring these deformation metrics and correlating them with model performance, we aim to provide a sensitive, interpretable, and adaptable drift detection system capable of distinguishing benign data evolution from true drift, enabling timely interventions and ensuring the reliability of machine learning systems in dynamic environments. Addressing the computational challenges of this methodology, we discuss mitigation strategies like dimensionality reduction, approximate algorithms, and parallelization for real-time and large-scale applications. The method's effectiveness is demonstrated through experiments on real-world text data, focusing on detecting context shifts in Generative AI. Our results, supported by publicly available code, highlight the benefits of this deformation-based approach in capturing subtle drifts that traditional statistical methods often miss. Furthermore, we present a detailed application example within the healthcare domain, showcasing the methodology's potential in diverse fields. Future work will focus on further improving computational efficiency and exploring additional applications across different ML domains.

摘要:本研究提出一個新穎的漂移偵測方法,該方法針對機器學習 (ML) 模型,並基於資料向量空間表示中的「變形」概念。我們了解到新資料可以作為力量,延伸、壓縮或扭曲模型學習到的幾何關係,我們探索各種數學架構來量化這種變形。我們研究了諸如協方差矩陣的特徵值分析來擷取整體形狀變化、使用核密度估計 (KDE) 的局部密度估計,以及 Kullback-Leibler 距離來識別資料集中微妙的偏移。此外,我們從連續力學中汲取靈感,提出一個「應變張量」類比來擷取不同資料類型中的多面向變形。這需要仔細估計位移場,我們深入探討從基於密度的途徑到流形學習和神經網路方法的策略。透過持續監控這些變形量度並將它們與模型效能相關聯,我們旨在提供一個靈敏、可解釋且適應性強的漂移偵測系統,能夠區分良性的資料演化和真正的漂移,從而實現及時的干預並確保機器學習系統在動態環境中的可靠性。為了應對這種方法的計算挑戰,我們討論了降維、近似演算法和並行化等緩解策略,以用於即時和大規模應用。透過在真實世界文字資料上進行實驗,證明了該方法的有效性,重點在於偵測生成式 AI 中的脈絡轉移。我們的結果由公開可用的程式碼支援,突顯了這種基於變形的途徑在擷取傳統統計方法經常遺漏的微妙漂移方面的優點。此外,我們在醫療保健領域中展示了一個詳細的應用範例,展示了該方法在不同領域的潛力。未來的研究將集中在進一步提高計算效率,並探索不同 ML 領域中的其他應用。

Diagnosing Medical Datasets with Training Dynamics

2411.01653v1 by Laura Wenderoth

This study explores the potential of using training dynamics as an automated alternative to human annotation for evaluating the quality of training data. The framework used is Data Maps, which classifies data points into categories such as easy-to-learn, hard-to-learn, and ambiguous (Swayamdipta et al., 2020). Swayamdipta et al. (2020) highlight that difficult-to-learn examples often contain errors, and ambiguous cases significantly impact model training. To confirm the reliability of these findings, we replicated the experiments using a challenging dataset, with a focus on medical question answering. In addition to text comprehension, this field requires the acquisition of detailed medical knowledge, which further complicates the task. A comprehensive evaluation was conducted to assess the feasibility and transferability of the Data Maps framework to the medical domain. The evaluation indicates that the framework is unsuitable for addressing datasets' unique challenges in answering medical questions.

摘要:本研究探討使用訓練動態作為自動化替代方案,以評估訓練資料品質,以取代人工標註。所使用的架構為資料地圖,其將資料點分類為易於學習、難以學習和模稜兩可等類別(Swayamdipta 等人,2020 年)。Swayamdipta 等人(2020 年)強調,難以學習的範例通常包含錯誤,而模稜兩可的情況會對模型訓練產生重大影響。為了確認這些發現的可靠性,我們使用具有挑戰性的資料集複製了實驗,重點放在醫學問題解答上。除了文字理解之外,這個領域還需要獲取詳細的醫學知識,這進一步使任務複雜化。我們進行了全面的評估,以評估資料地圖架構在醫學領域的可行性和可轉移性。評估結果表明,該架構不適合解決資料集在回答醫學問題時面臨的獨特挑戰。

Optical Flow Representation Alignment Mamba Diffusion Model for Medical Video Generation

2411.01647v1 by Zhenbin Wang, Lei Zhang, Lituan Wang, Minjuan Zhu, Zhenwei Zhang

Medical video generation models are expected to have a profound impact on the healthcare industry, including but not limited to medical education and training, surgical planning, and simulation. Current video diffusion models typically build on image diffusion architecture by incorporating temporal operations (such as 3D convolution and temporal attention). Although this approach is effective, its oversimplification limits spatio-temporal performance and consumes substantial computational resources. To counter this, we propose Medical Simulation Video Generator (MedSora), which incorporates three key elements: i) a video diffusion framework integrates the advantages of attention and Mamba, balancing low computational load with high-quality video generation, ii) an optical flow representation alignment method that implicitly enhances attention to inter-frame pixels, and iii) a video variational autoencoder (VAE) with frequency compensation addresses the information loss of medical features that occurs when transforming pixel space into latent features and then back to pixel frames. Extensive experiments and applications demonstrate that MedSora exhibits superior visual quality in generating medical videos, outperforming the most advanced baseline methods. Further results and code are available at https://wongzbb.github.io/MedSora

摘要:醫療影片生成模型預計將對醫療保健產業產生深遠的影響,包括但不限於醫學教育和訓練、手術規劃和模擬。目前的影片擴散模型通常建立在影像擴散架構上,並結合時間運算(例如 3D 摺積和時間注意力)。儘管此方法有效,但其過於簡化限制了時空效能,並消耗大量的運算資源。為了解決這個問題,我們提出醫學模擬影片生成器 (MedSora),它結合了三個關鍵要素:i) 一個影片擴散架構整合了注意力和 Mamba 的優點,在低運算負載和高品質影片生成之間取得平衡,ii) 一個光流表示對齊方法,可以隱含地增強對影格間像素的注意力,以及 iii) 一個具有頻率補償的影片變異自動編碼器 (VAE),用於解決在將像素空間轉換為潛在特徵,然後再轉回像素影格時發生的醫療特徵資訊遺失問題。廣泛的實驗和應用證明,MedSora 在生成醫療影片方面展現出優異的視覺品質,優於最先進的基準方法。進一步的結果和程式碼可以在 https://wongzbb.github.io/MedSora 取得

Customized Subgraph Selection and Encoding for Drug-drug Interaction Prediction

2411.01535v1 by Haotong Du, Quanming Yao, Juzheng Zhang, Yang Liu, Zhen Wang

Subgraph-based methods have proven to be effective and interpretable in predicting drug-drug interactions (DDIs), which are essential for medical practice and drug development. Subgraph selection and encoding are critical stages in these methods, yet customizing these components remains underexplored due to the high cost of manual adjustments. In this study, inspired by the success of neural architecture search (NAS), we propose a method to search for data-specific components within subgraph-based frameworks. Specifically, we introduce extensive subgraph selection and encoding spaces that account for the diverse contexts of drug interactions in DDI prediction. To address the challenge of large search spaces and high sampling costs, we design a relaxation mechanism that uses an approximation strategy to efficiently explore optimal subgraph configurations. This approach allows for robust exploration of the search space. Extensive experiments demonstrate the effectiveness and superiority of the proposed method, with the discovered subgraphs and encoding functions highlighting the model's adaptability.

摘要:基於子圖的方法已被證明在預測藥物-藥物交互作用 (DDI) 中有效且易於解釋,這對於醫療實務和藥物開發至關重要。子圖選擇和編碼是這些方法中的關鍵階段,然而,由於手動調整的成本高昂,客製化這些元件仍未被充分探討。在本研究中,受到神經架構搜尋 (NAS) 成功啟發,我們提出一個方法來搜尋子圖架構中的資料特定元件。具體來說,我們引入了廣泛的子圖選擇和編碼空間,以說明 DDI 預測中藥物交互作用的不同背景。為了應對大型搜尋空間和高取樣成本的挑戰,我們設計了一個放鬆機制,使用近似策略來有效探索最佳子圖配置。這種方法允許對搜尋空間進行穩健的探索。廣泛的實驗證明了所提出方法的有效性和優越性,發現的子圖和編碼函數突顯了模型的適應性。

Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design

2411.01423v1 by Onur Boyar, Hiroyuki Hanada, Ichiro Takeuchi

The rapid discovery of new chemical compounds is essential for advancing global health and developing treatments. While generative models show promise in creating novel molecules, challenges remain in ensuring the real-world applicability of these molecules and finding such molecules efficiently. To address this, we introduce Conditional Latent Space Molecular Scaffold Optimization (CLaSMO), which combines a Conditional Variational Autoencoder (CVAE) with Latent Space Bayesian Optimization (LSBO) to modify molecules strategically while maintaining similarity to the original input. Our LSBO setting improves the sample-efficiency of our optimization, and our modification approach helps us to obtain molecules with higher chances of real-world applicability. CLaSMO explores substructures of molecules in a sample-efficient manner by performing BO in the latent space of a CVAE conditioned on the atomic environment of the molecule to be optimized. Our experiments demonstrate that CLaSMO efficiently enhances target properties with minimal substructure modifications, achieving state-of-the-art results with a smaller model and dataset compared to existing methods. We also provide an open-source web application that enables chemical experts to apply CLaSMO in a Human-in-the-Loop setting.

摘要:新化學化合物的快速發現對於促進全球健康和開發治療方法至關重要。儘管生成模型在創造新分子方面顯示出前景,但仍然存在挑戰,以確保這些分子的實際適用性並有效地找到這些分子。為了解決這個問題,我們引入了條件潛在空間分子支架最佳化 (CLaSMO),它結合了條件變異自動編碼器 (CVAE) 與潛在空間貝氏最佳化 (LSBO),以策略性地修改分子,同時保持與原始輸入的相似性。我們的 LSBO 設定改善了我們最佳化的樣本效率,我們的修改方法幫助我們獲得具有更高實際適用機會的分子。CLaSMO 以樣本有效的方式探索分子的子結構,方法是在 CVAE 的潛在空間中執行 BO,該空間以要最佳化的分子的原子環境為條件。我們的實驗表明,CLaSMO 以最小的子結構修改有效地增強了目標屬性,與現有方法相比,使用較小的模型和數據集實現了最先進的結果。我們還提供了一個開源網路應用程式,讓化學專家能夠在人機迴圈設定中應用 CLaSMO。

Medical X-Ray Image Enhancement Using Global Contrast-Limited Adaptive Histogram Equalization

2411.01373v1 by Sohrab Namazi Nia, Frank Y. Shih

In medical imaging, accurate diagnosis heavily relies on effective image enhancement techniques, particularly for X-ray images. Existing methods often suffer from various challenges such as sacrificing global image characteristics over local image characteristics or vice versa. In this paper, we present a novel approach, called G-CLAHE (Global-Contrast Limited Adaptive Histogram Equalization), which perfectly suits medical imaging with a focus on X-rays. This method adapts from Global Histogram Equalization (GHE) and Contrast Limited Adaptive Histogram Equalization (CLAHE) to take both advantages and avoid weakness to preserve local and global characteristics. Experimental results show that it can significantly improve current state-of-the-art algorithms to effectively address their limitations and enhance the contrast and quality of X-ray images for diagnostic accuracy.

摘要:在醫學影像中,準確的診斷高度依賴於有效的影像增強技術,特別是 X 光影像。現有的方法通常會遇到各種挑戰,例如犧牲整體影像特性以換取局部影像特性,反之亦然。在本文中,我們提出了一種新穎的方法,稱為 G-CLAHE(全局對比度限制自適應直方圖均衡化),它非常適合於以 X 光為重點的醫學影像。此方法改編自全局直方圖均衡化 (GHE) 和對比度限制自適應直方圖均衡化 (CLAHE),以取得兩者的優點,並避免弱點,以保留局部和全局特性。實驗結果表明,它可以顯著改善當前最先進的演算法,以有效解決其限制,並增強 X 光影像的對比度和品質,以利於診斷準確性。

Guided Synthesis of Labeled Brain MRI Data Using Latent Diffusion Models for Segmentation of Enlarged Ventricles

2411.01351v1 by Tim Ruschke, Jonathan Frederik Carlsen, Adam Espe Hansen, Ulrich Lindberg, Amalie Monberg Hindsholm, Martin Norgaard, Claes Nøhr Ladefoged

Deep learning models in medical contexts face challenges like data scarcity, inhomogeneity, and privacy concerns. This study focuses on improving ventricular segmentation in brain MRI images using synthetic data. We employed two latent diffusion models (LDMs): a mask generator trained using 10,000 masks, and a corresponding SPADE image generator optimized using 6,881 scans to create an MRI conditioned on a 3D brain mask. Conditioning the mask generator on ventricular volume in combination with classifier-free guidance enabled the control of the ventricular volume distribution of the generated synthetic images. Next, the performance of the synthetic data was tested using three nnU-Net segmentation models trained on a real, augmented and entirely synthetic data, respectively. The resulting models were tested on a completely independent hold-out dataset of patients with enlarged ventricles, with manual delineation of the ventricles used as ground truth. The model trained on real data showed a mean absolute error (MAE) of 9.09 \pm 12.18 mL in predicted ventricular volume, while the models trained on synthetic and augmented data showed MAEs of 7.52 \pm 4.81 mL and 6.23 \pm 4.33 mL, respectively. Both the synthetic and augmented model also outperformed the state-of-the-art model SynthSeg, which due to limited performance in cases of large ventricular volumes, showed an MAE of 7.73 \pm 12.12 mL with a factor of 3 higher standard deviation. The model trained on augmented data showed the highest Dice score of 0.892 \pm 0.05, slightly outperforming SynthSeg and on par with the model trained on real data. The synthetic model performed similar to SynthSeg. In summary, we provide evidence that guided synthesis of labeled brain MRI data using LDMs improves the segmentation of enlarged ventricles and outperforms existing state-of-the-art segmentation models.

摘要:在医学背景中,深度学习模型面临着数据稀缺性、不均匀性和隐私问题等挑战。本研究专注于使用合成数据改进脑部 MRI 图像中的心室分割。我们采用了两个潜在扩散模型 (LDM):一个使用 10,000 个蒙版训练的蒙版生成器,以及一个使用 6,881 次扫描进行优化的相应 SPADE 图像生成器,以创建基于 3D 脑部蒙版的 MRI。对蒙版生成器进行心室体积调节,并结合无分类器指导,能够控制生成合成图像的心室体积分布。接下来,使用分别训练于真实、增强和完全合成数据上的三个 nnU-Net 分割模型测试了合成数据的性能。将训练所得的模型在完全独立的、具有扩大心室的患者的保留数据集上进行测试,并使用心室的手动描绘作为真实情况。在真实数据上训练的模型在预测的心室体积中显示出 9.09 ± 12.18 mL 的平均绝对误差 (MAE),而在合成和增强数据上训练的模型显示出 7.52 ± 4.81 mL 和 6.23 ± 4.33 mL 的 MAE。合成模型和增强模型的性能均优于最先进的模型 SynthSeg,后者由于在大心室体积的情况下性能有限,显示出 7.73 ± 12.12 mL 的 MAE,标准差高出 3 倍。在增强数据上训练的模型显示出最高的 Dice 得分 0.892 ± 0.05,略优于 SynthSeg,并且与在真实数据上训练的模型相当。合成模型的性能与 SynthSeg 类似。总之,我们提供了证据表明,使用 LDM 对标记的脑部 MRI 数据进行引导合成可以改善扩大心室的分割,并且优于现有的最先进的分割模型。

Causal reasoning in difference graphs

2411.01292v1 by Charles K. Assaad

In epidemiology, understanding causal mechanisms across different populations is essential for designing effective public health interventions. Recently, difference graphs have been introduced as a tool to visually represent causal variations between two distinct populations. While there has been progress in inferring these graphs from data through causal discovery methods, there remains a gap in systematically leveraging their potential to enhance causal reasoning. This paper addresses that gap by establishing conditions for identifying causal changes and effects using difference graphs and observational data. It specifically focuses on identifying total causal changes and total effects in a nonparametric framework, as well as direct causal changes and direct effects in a linear context. In doing so, it provides a novel approach to causal reasoning that holds potential for various public health applications.

摘要:在流行病學中,了解不同人群之間的因果機制對於設計有效的公共衛生干預措施至關重要。最近,差異圖表已被引入作為一種工具,用於直觀地表示兩個不同人群之間的因果變化。儘管通過因果發現方法從數據中推斷這些圖表方面取得了進展,但在系統性地利用其增強因果推理的潛力方面仍然存在差距。本文通過建立使用差異圖表和觀察數據識別因果變化和因果效應的條件來解決這一差距。它特別側重於在非參數框架中識別總因果變化和總效應,以及在線性背景中識別直接因果變化和直接效應。這樣一來,它提供了一種因果推理的新方法,對各種公共衛生應用具有潛力。

Designing a Robust Radiology Report Generation System

2411.01153v1 by Sonit Singh

Recent advances in deep learning have enabled researchers to explore tasks at the intersection of computer vision and natural language processing, such as image captioning, visual question answering, visual dialogue, and visual language navigation. Taking inspiration from image captioning, the task of radiology report generation aims at automatically generating radiology reports by having a comprehensive understanding of medical images. However, automatically generating radiology reports from medical images is a challenging task due to the complexity, diversity, and nature of medical images. In this paper, we outline the design of a robust radiology report generation system by integrating different modules and highlighting best practices drawing upon lessons from our past work and also from relevant studies in the literature. We also discuss the impact of integrating different components to form a single integrated system. We believe that these best practices, when implemented, could improve automatic radiology report generation, augment radiologists in decision making, and expedite diagnostic workflow, in turn improve healthcare and save human lives.

摘要:最近深度學習的進展使研究人員能夠探索電腦視覺和自然語言處理交集中的任務,例如影像標題、視覺問答、視覺對話和視覺語言導航。受影像標題的啟發,放射科報告生成的任務旨在透過全面了解醫學影像自動生成放射科報告。然而,由於醫學影像的複雜性、多樣性和性質,自動從醫學影像生成放射科報告是一項具有挑戰性的任務。在本文中,我們透過整合不同的模組並強調最佳實務,概述了健全的放射科報告生成系統的設計,這些實務汲取自我們過去的工作以及文獻中的相關研究。我們也討論了整合不同組件以形成單一整合系統的影響。我們相信,這些最佳實務在實施後,可以改善自動放射科報告生成,增強放射科醫師在決策制定中的能力,並加快診斷工作流程,進而改善醫療保健並拯救人命。

LEARNER: Learning Granular Labels from Coarse Labels using Contrastive Learning

2411.01144v1 by Gautam Gare, Jana Armouti, Nikhil Madaan, Rohan Panda, Tom Fox, Laura Hutchins, Amita Krishnan, Ricardo Rodriguez, Bennett DeBoisblanc, Deva Ramanan, John Galeotti

A crucial question in active patient care is determining if a treatment is having the desired effect, especially when changes are subtle over short periods. We propose using inter-patient data to train models that can learn to detect these fine-grained changes within a single patient. Specifically, can a model trained on multi-patient scans predict subtle changes in an individual patient's scans? Recent years have seen increasing use of deep learning (DL) in predicting diseases using biomedical imaging, such as predicting COVID-19 severity using lung ultrasound (LUS) data. While extensive literature exists on successful applications of DL systems when well-annotated large-scale datasets are available, it is quite difficult to collect a large corpus of personalized datasets for an individual. In this work, we investigate the ability of recent computer vision models to learn fine-grained differences while being trained on data showing larger differences. We evaluate on an in-house LUS dataset and a public ADNI brain MRI dataset. We find that models pre-trained on clips from multiple patients can better predict fine-grained differences in scans from a single patient by employing contrastive learning.

摘要:在主動患者照護中,一個關鍵問題是確定治療是否產生預期的效果,特別是在短時間內變化細微的情況下。我們提議使用患者間數據來訓練模型,以便學習偵測單一患者內這些細微的變化。具體來說,在多位患者掃描中訓練的模型是否可以預測個別患者掃描中的細微變化?近年來,深度學習 (DL) 在使用生物醫學影像預測疾病方面應用日益廣泛,例如使用肺部超音波 (LUS) 數據預測 COVID-19 的嚴重程度。儘管有大量文獻記載了在有標註的大規模數據集可用時 DL 系統的成功應用,但要為個人收集大量個人化數據集相當困難。在這項工作中,我們探討了近期電腦視覺模型在針對顯示較大差異的數據進行訓練時,學習細微差異的能力。我們在內部 LUS 數據集和公開的 ADNI 大腦 MRI 數據集上進行評估。我們發現,透過使用對比學習,在多位患者的片段上預先訓練的模型可以更好地預測單一患者掃描中的細微差異。

Artificial Intelligence for Microbiology and Microbiome Research

2411.01098v1 by Xu-Wen Wang, Tong Wang, Yang-Yu Liu

Advancements in artificial intelligence (AI) have transformed many scientific fields, with microbiology and microbiome research now experiencing significant breakthroughs through machine learning and deep learning applications. This review provides a comprehensive overview of AI-driven approaches tailored for microbiology and microbiome studies, emphasizing both technical advancements and biological insights. We begin with an introduction to foundational AI techniques, including primary machine learning paradigms and various deep learning architectures, and offer guidance on choosing between machine learning and deep learning methods based on specific research goals. The primary section on application scenarios spans diverse research areas, from taxonomic profiling, functional annotation & prediction, microbe-X interactions, microbial ecology, metabolic modeling, precision nutrition, clinical microbiology, to prevention & therapeutics. Finally, we discuss challenges unique to this field, including the balance between interpretability and complexity, the "small n, large p" problem, and the critical need for standardized benchmarking datasets to validate and compare models. Together, this review underscores AI's transformative role in microbiology and microbiome research, paving the way for innovative methodologies and applications that enhance our understanding of microbial life and its impact on our planet and our health.

摘要:人工智慧 (AI) 的進步已轉變許多科學領域,而微生物學和微生物組研究現在正透過機器學習和深度學習應用體驗到顯著的突破。本篇評論提供 AI 驅動方法的全面概述,這些方法專為微生物學和微生物組研究量身打造,強調技術進步和生物見解。我們從基礎 AI 技術的介紹開始,包括主要的機器學習範例和各種深度學習架構,並提供根據具體研究目標在機器學習和深度學習方法之間進行選擇的指導。應用場景的主要部分涵蓋了從分類分析、功能註解和預測、微生物 X 相互作用、微生物生態、代謝建模、精準營養、臨床微生物學到預防和治療等多個研究領域。最後,我們討論了該領域獨有的挑戰,包括可解釋性和複雜性之間的平衡、「小 n,大 p」問題,以及驗證和比較模型的標準化基準數據集的關鍵需求。本篇評論共同強調了 AI 在微生物學和微生物組研究中的轉型作用,為創新方法和應用鋪平道路,這些方法和應用增強了我們對微生物生命及其對我們星球和我們健康的影響的理解。

Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities

2411.01053v1 by Adriel Saporta, Aahlad Puli, Mark Goldstein, Rajesh Ranganath

Contrastive learning methods, such as CLIP, leverage naturally paired data-for example, images and their corresponding text captions-to learn general representations that transfer efficiently to downstream tasks. While such approaches are generally applied to two modalities, domains such as robotics, healthcare, and video need to support many types of data at once. We show that the pairwise application of CLIP fails to capture joint information between modalities, thereby limiting the quality of the learned representations. To address this issue, we present Symile, a simple contrastive learning approach that captures higher-order information between any number of modalities. Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations. To develop Symile's objective, we derive a lower bound on total correlation, and show that Symile representations for any set of modalities form a sufficient statistic for predicting the remaining modalities. Symile outperforms pairwise CLIP, even with modalities missing in the data, on cross-modal classification and retrieval across several experiments including on an original multilingual dataset of 33M image, text and audio samples and a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements. All datasets and code used in this work are publicly available at https://github.com/rajesh-lab/symile.

摘要:對比學習方法,例如 CLIP,利用自然配對的資料,例如影像及其對應的文字標題,來學習一般化表徵,並有效率地轉移到下游任務。雖然此類方法通常應用於兩種形式,但機器人技術、醫療保健和視訊等領域需要一次支援多種類型的資料。我們顯示,CLIP 的成對應用無法擷取形式間的聯合資訊,因此限制了學習表徵的品質。為了解決此問題,我們提出 Symile,這是一種簡單的對比學習方法,可以擷取任意數量的形式之間的高階資訊。Symile 提供了一個靈活且與架構無關的目標,用於學習特定於形式的表徵。為開發 Symile 的目標,我們推導出總相關性的下界,並顯示任何形式集合的 Symile 表徵形成一個充分的統計量,用於預測其餘形式。Symile 優於成對 CLIP,即使資料中缺少形式,也能在跨形式分類和檢索中表現出色,包括在一個包含 3300 萬張影像、文字和音訊樣本的原始多語言資料集和一個包含胸部 X 光、心電圖和實驗室測量的臨床資料集上進行的多次實驗。本研究中使用所有資料集和程式碼皆公開於 https://github.com/rajesh-lab/symile。

Cross-Fundus Transformer for Multi-modal Diabetic Retinopathy Grading with Cataract

2411.00726v1 by Fan Xiao, Junlin Hou, Ruiwei Zhao, Rui Feng, Haidong Zou, Lina Lu, Yi Xu, Juzhao Zhang

Diabetic retinopathy (DR) is a leading cause of blindness worldwide and a common complication of diabetes. As two different imaging tools for DR grading, color fundus photography (CFP) and infrared fundus photography (IFP) are highly-correlated and complementary in clinical applications. To the best of our knowledge, this is the first study that explores a novel multi-modal deep learning framework to fuse the information from CFP and IFP towards more accurate DR grading. Specifically, we construct a dual-stream architecture Cross-Fundus Transformer (CFT) to fuse the ViT-based features of two fundus image modalities. In particular, a meticulously engineered Cross-Fundus Attention (CFA) module is introduced to capture the correspondence between CFP and IFP images. Moreover, we adopt both the single-modality and multi-modality supervisions to maximize the overall performance for DR grading. Extensive experiments on a clinical dataset consisting of 1,713 pairs of multi-modal fundus images demonstrate the superiority of our proposed method. Our code will be released for public access.

摘要:糖尿病視網膜病變 (DR) 是全球失明的主要原因,也是糖尿病的常見併發症。作為 DR 分級的兩種不同的影像工具,彩色眼底攝影 (CFP) 和紅外線眼底攝影 (IFP) 在臨床應用中高度相關且互補。據我們所知,這是第一個探討創新的多模式深度學習框架,以融合 CFP 和 IFP 的資訊,以進行更準確的 DR 分級。具體來說,我們構建了一個雙流架構 Cross-Fundus Transformer (CFT),以融合兩種眼底影像模式的基於 ViT 的特徵。特別是,引入了精心設計的 Cross-Fundus Attention (CFA) 模組,以捕捉 CFP 和 IFP 影像之間的對應關係。此外,我們採用單一模式和多模式監督,以最大化 DR 分級的整體效能。在由 1,713 對多模式眼底影像組成的臨床資料集上進行的廣泛實驗證明了我們提出的方法的優越性。我們的程式碼將會公開發布。

CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis

2411.00696v1 by Fuying Wang, Feng Wu, Yihan Tang, Lequan Yu

Integrating multimodal Electronic Health Records (EHR) data, such as numerical time series and free-text clinical reports, has great potential in predicting clinical outcomes. However, prior work has primarily focused on capturing temporal interactions within individual samples and fusing multimodal information, overlooking critical temporal patterns across patients. These patterns, such as trends in vital signs like abnormal heart rate or blood pressure, can indicate deteriorating health or an impending critical event. Similarly, clinical notes often contain textual descriptions that reflect these patterns. Identifying corresponding temporal patterns across different modalities is crucial for improving the accuracy of clinical outcome predictions, yet it remains a challenging task. To address this gap, we introduce a Cross-Modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data. Our approach introduces shared initial temporal pattern representations which are refined using slot attention to generate temporal semantic embeddings. To ensure rich cross-modal temporal semantics in the learned patterns, we introduce a contrastive-based TPNCE loss for cross-modal alignment, along with two reconstruction losses to retain core information of each modality. Evaluations on two clinically critical tasks, 48-hour in-hospital mortality and 24-hour phenotype classification, using the MIMIC-III database demonstrate the superiority of our method over existing approaches.

摘要:整合多模态电子健康记录 (EHR) 数据(例如数值时间序列和自由文本临床报告)在预测临床结果方面具有巨大潜力。然而,以前的工作主要集中在捕捉单个样本中的时间交互并融合多模态信息,而忽略了患者之间的关键时间模式。这些模式(例如生命体征趋势,如异常心率或血压)可能表明健康状况恶化或即将发生的危重事件。类似地,临床笔记通常包含反映这些模式的文本描述。识别不同模态之间相应的时间模式对于提高临床结果预测的准确性至关重要,但它仍然是一项具有挑战性的任务。为了解决这一差距,我们引入了一个跨模态时间模式发现 (CTPD) 框架,旨在从多模态 EHR 数据中有效提取有意义的跨模态时间模式。我们的方法引入了共享的初始时间模式表示,这些表示使用插槽注意力进行优化以生成时间语义嵌入。为了确保学习模式中丰富的跨模态时间语义,我们引入了基于对比的 TPNCE 损失用于跨模态对齐,以及两个重建损失以保留每个模态的核心信息。在两个临床关键任务(48 小时院内死亡率和 24 小时表型分类)上的评估,使用 MIMIC-III 数据库证明了我们方法优于现有方法。

Enhancing Osteoporosis Detection: An Explainable Multi-Modal Learning Framework with Feature Fusion and Variable Clustering

2411.00916v2 by Mehdi Hosseini Chagahi, Saeed Mohammadi Dashtaki, Niloufar Delfan, Nadia Mohammadi, Alireza Samari, Behzad Moshiri, Md. Jalil Piran, Oliver Faust

Osteoporosis is a common condition that increases fracture risk, especially in older adults. Early diagnosis is vital for preventing fractures, reducing treatment costs, and preserving mobility. However, healthcare providers face challenges like limited labeled data and difficulties in processing medical images. This study presents a novel multi-modal learning framework that integrates clinical and imaging data to improve diagnostic accuracy and model interpretability. The model utilizes three pre-trained networks-VGG19, InceptionV3, and ResNet50-to extract deep features from X-ray images. These features are transformed using PCA to reduce dimensionality and focus on the most relevant components. A clustering-based selection process identifies the most representative components, which are then combined with preprocessed clinical data and processed through a fully connected network (FCN) for final classification. A feature importance plot highlights key variables, showing that Medical History, BMI, and Height were the main contributors, emphasizing the significance of patient-specific data. While imaging features were valuable, they had lower importance, indicating that clinical data are crucial for accurate predictions. This framework promotes precise and interpretable predictions, enhancing transparency and building trust in AI-driven diagnoses for clinical integration.

摘要:骨質疏鬆症是一種常見的疾病,會增加骨折的風險,特別是老年人。早期診斷對於預防骨折、降低治療成本和維持行動能力至關重要。然而,醫療保健提供者面臨著標記數據有限和處理醫學影像困難等挑戰。本研究提出了一個新穎的多模式學習框架,該框架整合了臨床和影像數據,以提高診斷準確性和模型可解釋性。該模型利用三個預訓練的網路,VGG19、InceptionV3 和 ResNet50,從 X 射線影像中提取深度特徵。這些特徵使用 PCA 轉換以降低維度並專注於最相關的組成部分。基於聚類的選擇過程識別出最具代表性的組成部分,然後將這些組成部分與預處理的臨床數據結合,並通過全連接網路 (FCN) 進行最終分類。特徵重要性圖突出了關鍵變數,表明病史、BMI 和身高是主要貢獻因素,強調了患者特定數據的重要性。雖然影像特徵很有價值,但它們的重要性較低,這表明臨床數據對於準確預測至關重要。此框架促进了準確且可解釋的預測,提高了透明度,並建立了對 AI 驅動診斷在臨床整合中的信任。

Deep learning-based auto-contouring of organs/structures-at-risk for pediatric upper abdominal radiotherapy

2411.00594v1 by Mianyong Ding, Matteo Maspero, Annemieke S Littooij, Martine van Grotel, Raquel Davila Fajardo, Max M van Noesel, Marry M van den Heuvel-Eibrink, Geert O Janssens

Purposes: This study aimed to develop a computed tomography (CT)-based multi-organ segmentation model for delineating organs-at-risk (OARs) in pediatric upper abdominal tumors and evaluate its robustness across multiple datasets. Materials and methods: In-house postoperative CTs from pediatric patients with renal tumors and neuroblastoma (n=189) and a public dataset (n=189) with CTs covering thoracoabdominal regions were used. Seventeen OARs were delineated: nine by clinicians (Type 1) and eight using TotalSegmentator (Type 2). Auto-segmentation models were trained using in-house (ModelPMC-UMCU) and a combined dataset of public data (Model-Combined). Performance was assessed with Dice Similarity Coefficient (DSC), 95% Hausdorff Distance (HD95), and mean surface distance (MSD). Two clinicians rated clinical acceptability on a 5-point Likert scale across 15 patient contours. Model robustness was evaluated against sex, age, intravenous contrast, and tumor type. Results: Model-PMC-UMCU achieved mean DSC values above 0.95 for five of nine OARs, while spleen and heart ranged between 0.90 and 0.95. The stomach-bowel and pancreas exhibited DSC values below 0.90. Model-Combined demonstrated improved robustness across both datasets. Clinical evaluation revealed good usability, with both clinicians rating six of nine Type 1 OARs above four and six of eight Type 2 OARs above three. Significant performance 2 differences were only found across age groups in both datasets, specifically in the left lung and pancreas. The 0-2 age group showed the lowest performance. Conclusion: A multi-organ segmentation model was developed, showcasing enhanced robustness when trained on combined datasets. This model is suitable for various OARs and can be applied to multiple datasets in clinical settings.

摘要:目的:本研究旨在开发一个基于计算机断层扫描 (CT) 的多器官分割模型,用于描绘小儿上腹部肿瘤中的危险器官 (OAR),并评估其在多个数据集中的稳健性。材料和方法:使用小儿肾肿瘤和神经母细胞瘤患者 (n=189) 的院内术后 CT 以及包含胸腹区域 CT 的公共数据集 (n=189)。描绘了 17 个 OAR:9 个由临床医生描绘 (类型 1),8 个使用 TotalSegmentator 描绘 (类型 2)。使用院内 (ModelPMC-UMCU) 和公共数据组合数据集 (Model-Combined) 训练自动分割模型。使用骰子相似性系数 (DSC)、95% 霍斯多夫距离 (HD95) 和平均表面距离 (MSD) 评估性能。两位临床医生使用 5 点李克特量表对 15 个患者轮廓的临床可接受性进行评级。针对性别、年龄、静脉对比和肿瘤类型评估模型的稳健性。结果:Model-PMC-UMCU 对九个 OAR 中的五个 OAR 的平均 DSC 值达到 0.95 以上,而脾脏和心脏在 0.90 到 0.95 之间。胃肠和胰腺的 DSC 值低于 0.90。Model-Combined 在两个数据集上都表现出改进的稳健性。临床评估显示出良好的可用性,两位临床医生对六个九个类型 1 OAR 的评分均高于四分,对八个类型 2 OAR 中的六个评分均高于三分。仅在两个数据集的年龄组中发现了显着的性能 2 差异,特别是在左肺和胰腺中。0-2 岁年龄组表现最差。结论:开发了一个多器官分割模型,在合并数据集上训练时显示出增强的稳健性。该模型适用于各种 OAR,并且可以在临床环境中应用于多个数据集。

Enhancing the Traditional Chinese Medicine Capabilities of Large Language Model through Reinforcement Learning from AI Feedback

2411.00897v1 by Song Yu, Xiaofei Xu, Fangfei Xu, Li Li

Although large language models perform well in understanding and responding to user intent, their performance in specialized domains such as Traditional Chinese Medicine (TCM) remains limited due to lack of expertise. In addition, high-quality data related to TCM is scarce and difficult to obtain, making large language models ineffective in handling TCM tasks. In this work, we propose a framework to improve the performance of large language models for TCM tasks using only a small amount of data. First, we use medical case data for supervised fine-tuning of the large model, making it initially capable of performing TCM tasks. Subsequently, we further optimize the model's performance using reinforcement learning from AI feedback (RLAIF) to align it with the preference data. The ablation study also demonstrated the performance gain is attributed to both supervised fine-tuning and the direct policy optimization. The experimental results show that the model trained with a small amount of data achieves a significant performance improvement on a representative TCM task.

摘要:儘管大型語言模型在理解和回應使用者意圖方面表現良好,但由於缺乏專業知識,它們在傳統中醫 (TCM) 等專業領域的表現仍然有限。此外,與中醫相關的高品質資料稀少且難以取得,這使得大型語言模型在處理中醫任務時效果不彰。在這項工作中,我們提出一個架構,使用少量資料來改善大型語言模型在中醫任務中的表現。首先,我們使用醫療案例資料對大型模型進行監督微調,使其最初具備執行中醫任務的能力。隨後,我們進一步使用人工智慧回饋的強化學習 (RLAIF) 來最佳化模型的表現,使其與偏好資料保持一致。消融研究也證明,表現提升歸功於監督微調和直接策略最佳化。實驗結果顯示,使用少量資料訓練的模型在代表性的中醫任務上取得顯著的表現提升。

StepCountJITAI: simulation environment for RL with application to physical activity adaptive intervention

2411.00336v1 by Karine Karine, Benjamin M. Marlin

The use of reinforcement learning (RL) to learn policies for just-in-time adaptive interventions (JITAIs) is of significant interest in many behavioral intervention domains including improving levels of physical activity. In a messaging-based physical activity JITAI, a mobile health app is typically used to send messages to a participant to encourage engagement in physical activity. In this setting, RL methods can be used to learn what intervention options to provide to a participant in different contexts. However, deploying RL methods in real physical activity adaptive interventions comes with challenges: the cost and time constraints of real intervention studies result in limited data to learn adaptive intervention policies. Further, commonly used RL simulation environments have dynamics that are of limited relevance to physical activity adaptive interventions and thus shed little light on what RL methods may be optimal for this challenging application domain. In this paper, we introduce StepCountJITAI, an RL environment designed to foster research on RL methods that address the significant challenges of policy learning for adaptive behavioral interventions.

摘要:利用強化學習 (RL) 來學習即時適應性介入 (JITAI) 的策略,在許多行為介入領域中備受關注,包括提升體能活動的層級。在基於訊息的體能活動 JITAI 中,行動健康應用程式通常用於向參與者傳送訊息,以鼓勵參與體能活動。在此設定中,RL 方法可被用於學習在不同情境下提供給參與者的介入選項。然而,在實際體能活動適應性介入中部署 RL 方法會遇到挑戰:實際介入研究的成本和時間限制,導致可供學習適應性介入策略的資料有限。此外,常用的 RL 模擬環境具有與體能活動適應性介入相關性有限的動態,因此難以了解哪些 RL 方法可能最適合這個具挑戰性的應用領域。在本文中,我們介紹 StepCountJITAI,這是一個 RL 環境,旨在促進對 RL 方法的研究,以應對適應性行為介入策略學習的重大挑戰。

Strongly Topology-preserving GNNs for Brain Graph Super-resolution

2411.02525v1 by Pragya Singh, Islem Rekik

Brain graph super-resolution (SR) is an under-explored yet highly relevant task in network neuroscience. It circumvents the need for costly and time-consuming medical imaging data collection, preparation, and processing. Current SR methods leverage graph neural networks (GNNs) thanks to their ability to natively handle graph-structured datasets. However, most GNNs perform node feature learning, which presents two significant limitations: (1) they require computationally expensive methods to learn complex node features capable of inferring connectivity strength or edge features, which do not scale to larger graphs; and (2) computations in the node space fail to adequately capture higher-order brain topologies such as cliques and hubs. However, numerous studies have shown that brain graph topology is crucial in identifying the onset and presence of various neurodegenerative disorders like Alzheimer and Parkinson. Motivated by these challenges and applications, we propose our STP-GSR framework. It is the first graph SR architecture to perform representation learning in higher-order topological space. Specifically, using the primal-dual graph formulation from graph theory, we develop an efficient mapping from the edge space of our low-resolution (LR) brain graphs to the node space of a high-resolution (HR) dual graph. This approach ensures that node-level computations on this dual graph correspond naturally to edge-level learning on our HR brain graphs, thereby enforcing strong topological consistency within our framework. Additionally, our framework is GNN layer agnostic and can easily learn from smaller, scalable GNNs, reducing computational requirements. We comprehensively benchmark our framework across seven key topological measures and observe that it significantly outperforms the previous state-of-the-art methods and baselines.

摘要:腦圖像超解析度 (SR) 是網路神經科學中一個尚未充分探索但高度相關的任務。它避開了代價高昂且耗時的醫學影像資料收集、準備和處理的需要。目前的 SR 方法利用圖神經網路 (GNN),因為它們能夠原生處理圖形結構的資料集。然而,大多數 GNN 都執行節點特徵學習,這提出了兩個重大的限制:(1) 它們需要以計算成本高的方式來學習複雜的節點特徵,這些特徵能夠推論連接強度或邊緣特徵,這無法擴展到更大的圖形;(2) 節點空間中的計算無法充分擷取高階腦部拓撲,例如派系和樞紐。然而,許多研究表明,腦圖形拓撲對於識別各種神經退化性疾病(如阿茲海默症和帕金森氏症)的發病和存在至關重要。受到這些挑戰和應用激勵,我們提出了我們的 STP-GSR 架構。它是第一個在高階拓撲空間中執行表示學習的圖形 SR 架構。具體來說,我們使用圖論中的原始對偶圖形公式,從我們低解析度 (LR) 腦圖形的邊緣空間開發了一個高效的對映,對映到高解析度 (HR) 對偶圖形節點空間。這種方法確保了在這個對偶圖形上的節點層級計算自然地對應於我們 HR 腦圖形上的邊緣層級學習,從而強制執行我們框架內強大的拓撲一致性。此外,我們的框架與 GNN 層無關,並且可以輕鬆地從更小、可擴展的 GNN 中學習,從而減少計算需求。我們在七項關鍵拓撲測量中全面評定了我們的框架,並觀察到它顯著優於以往的先進方法和基線。

Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes

2411.02523v1 by Balu Bhasuran, Qiao Jin, Yuzhang Xie, Carl Yang, Karim Hanna, Jennifer Costa, Cindy Shavor, Zhiyong Lu, Zhe He

Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and lab results. Five LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. A comprehensive evaluation involving GPT-4, a knowledge graph, and clinicians was conducted. GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%. Lab results significantly improved accuracy, with GPT-4 and Mixtral excelling, though exact match rates were low. Lab tests, including liver function, metabolic/toxicology panels, and serology/immune tests, were generally interpreted correctly by LLMs for differential diagnosis.

摘要:鑑別診斷對於醫學至關重要,因為它有助於醫療保健提供者系統區分具有相似症狀的疾病。這項研究評估了實驗室檢驗結果對大型語言模型 (LLM) 做出的鑑別診斷 (DDx) 的影響。從 PubMed Central 的 50 份病例報告中建立了臨床簡報,其中包含患者人口統計、症狀和實驗室結果。測試了五個 LLM GPT-4、GPT-3.5、Llama-2-70b、Claude-2 和 Mixtral-8x7B,以生成帶和不帶實驗室數據的前 10、前 5 和前 1 DDx。進行了一項涉及 GPT-4、知識圖譜和臨床醫生的綜合評估。GPT-4 表現最佳,在有實驗室數據的情況下,前 1 名診斷的準確率達到 55%,前 10 名的準確率達到 60%,寬鬆準確率高達 80%。實驗室結果顯著提高了準確率,GPT-4 和 Mixtral 表現出色,儘管完全匹配率較低。LLM 通常可以正確解釋包括肝功能、代謝/毒理學檢查和血清學/免疫測試在內的實驗室檢驗,以進行鑑別診斷。

Deep Learning Predicts Mammographic Breast Density in Clinical Breast Ultrasound Images

2411.00891v2 by Arianna Bunnell, Dustin Valdez, Thomas K. Wolfgruber, Brandon Quon, Kailee Hung, Brenda Y. Hernandez, Todd B. Seto, Jeffrey Killeen, Marshall Miyoshi, Peter Sadowski, John A. Shepherd

Background: Breast density, as derived from mammographic images and defined by the American College of Radiology's Breast Imaging Reporting and Data System (BI-RADS), is one of the strongest risk factors for breast cancer. Breast ultrasound (BUS) is an alternative breast cancer screening modality, particularly useful for early detection in low-resource, rural contexts. The purpose of this study was to explore an artificial intelligence (AI) model to predict BI-RADS mammographic breast density category from clinical, handheld BUS imaging. Methods: All data are sourced from the Hawaii and Pacific Islands Mammography Registry. We compared deep learning methods from BUS imaging, as well as machine learning models from image statistics alone. The use of AI-derived BUS density as a risk factor for breast cancer was then compared to clinical BI-RADS breast density while adjusting for age. The BUS data were split by individual into 70/20/10% groups for training, validation, and testing. Results: 405,120 clinical BUS images from 14.066 women were selected for inclusion in this study, resulting in 9.846 women for training (302,574 images), 2,813 for validation (11,223 images), and 1,406 for testing (4,042 images). On the held-out testing set, the strongest AI model achieves AUROC 0.854 predicting BI-RADS mammographic breast density from BUS imaging and outperforms all shallow machine learning methods based on image statistics. In cancer risk prediction, age-adjusted AI BUS breast density predicted 5-year breast cancer risk with 0.633 AUROC, as compared to 0.637 AUROC from age-adjusted clinical breast density. Conclusions: BI-RADS mammographic breast density can be estimated from BUS imaging with high accuracy using a deep learning model. Furthermore, we demonstrate that AI-derived BUS breast density is predictive of 5-year breast cancer risk in our population.

摘要:背景:乳房密度是根据乳房 X 光图像衍生而来,并由美国放射学院的乳房影像报告和数据系统 (BI-RADS) 定义,是乳腺癌最强的风险因素之一。乳房超音波 (BUS) 是一种替代的乳腺癌筛检方式,特别适用于资源匮乏的农村环境中的早期侦测。本研究的目的是探索一种人工智能 (AI) 模型,以根据临床手持式 BUS 影像预测 BI-RADS 乳房 X 光摄影乳房密度类别。方法:所有数据均来自夏威夷和太平洋岛屿乳房摄影注册中心。我们比较了来自 BUS 影像的深度学习方法,以及仅来自图像统计数据的机器学习模型。然后将 AI 衍生的 BUS 密度用作乳腺癌的风险因子,与临床 BI-RADS 乳房密度进行比较,同时调整年龄。BUS 数据按个人分为 70/20/10% 的组别,用于训练、验证和测试。结果:本研究选取了来自 14.066 名女性的 405,120 张临床 BUS 影像,产生了 9.846 名女性用于训练(302,574 张影像)、2,813 名用于验证(11,223 张影像)和 1,406 名用于测试(4,042 张影像)。在留出的测试集中,最强的 AI 模型实现了 0.854 的 AUROC,根据 BUS 影像预测 BI-RADS 乳房 X 光摄影乳房密度,并且优于所有基于图像统计的浅层机器学习方法。在癌症风险预测中,经年龄调整的 AI BUS 乳房密度预测 5 年乳腺癌风险的 AUROC 为 0.633,而经年龄调整的临床乳房密度预测的 AUROC 为 0.637。结论:使用深度学习模型,可以从 BUS 影像中以高精度估计 BI-RADS 乳房 X 光摄影乳房密度。此外,我们证明了 AI 衍生的 BUS 乳房密度可以预测我们人群中 5 年的乳腺癌风险。

Monitoring fairness in machine learning models that predict patient mortality in the ICU

2411.00190v2 by Tempest A. van Schaik, Xinggang Liu, Louis Atallah, Omar Badawi

This work proposes a fairness monitoring approach for machine learning models that predict patient mortality in the ICU. We investigate how well models perform for patient groups with different race, sex and medical diagnoses. We investigate Documentation bias in clinical measurement, showing how fairness analysis provides a more detailed and insightful comparison of model performance than traditional accuracy metrics alone.

摘要:這項研究提出一個公平性監控方法,用於預測加護病房中病患死亡率的機器學習模型。我們探討模型在不同種族、性別和醫療診斷的病患群體中表現如何。我們探討臨床測量中的文件偏差,說明公平性分析如何提供比傳統準確性指標更詳細且有見地的模型效能比較。

Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy

2411.00178v1 by Panagiota Gatoula, Dimitrios E. Diamantis, Anastasios Koulaouzidis, Cristina Carretero, Stefania Chetcuti-Zammit, Pablo Cortegoso Valdivia, Begoña González-Suárez, Alessandro Mussetto, John Plevris, Alexander Robertson, Bruno Rosa, Ervin Toth, Dimitris K. Iakovidis

Sharing retrospectively acquired data is essential for both clinical research and training. Synthetic Data Generation (SDG), using Artificial Intelligence (AI) models, can overcome privacy barriers in sharing clinical data, enabling advancements in medical diagnostics. This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images. The paper contributes by a) presenting a protocol for the systematic evaluation of synthetic images by medical experts and b) applying it to assess TIDE-II, a novel variational autoencoder-based model for high-resolution WCE image synthesis, with a comprehensive qualitative evaluation conducted by 10 international WCE specialists, focusing on image quality, diversity, realism, and clinical decision-making. The results show that TIDE-II generates clinically relevant WCE images, helping to address data scarcity and enhance diagnostic tools. The proposed protocol serves as a reference for future research on medical image-generation techniques.

摘要:回顧性獲取的資料分享對於臨床研究和訓練至關重要。使用人工智慧 (AI) 模型的合成資料產生 (SDG) 能夠克服臨床資料共享中的隱私障礙,促進醫療診斷的進展。本研究專注於臨床評估醫學 SDG,並透過無線膠囊內視鏡 (WCE) 影像診斷發炎性腸道疾病 (IBD) 的概念驗證調查。本文的貢獻包括:a) 提出由醫學專家系統性評估合成影像的協定,以及 b) 將其應用於評估 TIDE-II,這是一個用於高解析度 WCE 影像合成的變異自動編碼器模型,並由 10 位國際 WCE 專家進行全面的品質評估,重點在於影像品質、多樣性、真實性,以及臨床決策制定。結果顯示 TIDE-II 產生了臨床相關的 WCE 影像,有助於解決資料稀少的問題,並增強診斷工具。所提出的協定可作為未來醫學影像產生技術研究的參考。

Beyond Label Attention: Transparency in Language Models for Automated Medical Coding via Dictionary Learning

2411.00173v1 by John Wu, David Wu, Jimeng Sun

Medical coding, the translation of unstructured clinical text into standardized medical codes, is a crucial but time-consuming healthcare practice. Though large language models (LLM) could automate the coding process and improve the efficiency of such tasks, interpretability remains paramount for maintaining patient trust. Current efforts in interpretability of medical coding applications rely heavily on label attention mechanisms, which often leads to the highlighting of extraneous tokens irrelevant to the ICD code. To facilitate accurate interpretability in medical language models, this paper leverages dictionary learning that can efficiently extract sparsely activated representations from dense language model embeddings in superposition. Compared with common label attention mechanisms, our model goes beyond token-level representations by building an interpretable dictionary which enhances the mechanistic-based explanations for each ICD code prediction, even when the highlighted tokens are medically irrelevant. We show that dictionary features can steer model behavior, elucidate the hidden meanings of upwards of 90% of medically irrelevant tokens, and are human interpretable.

摘要:醫療編碼是將非結構化的臨床文本轉換為標準化醫療代碼的過程,是一項至關重要的醫療保健實務,但耗時費力。儘管大型語言模型 (LLM) 可以自動化編碼流程並提升此類任務的效率,但可解釋性對於維護患者信任仍然至關重要。目前在醫療編碼應用程式的可解釋性方面所做的努力,極度依賴標籤注意機制,這通常會導致強調與 ICD 代碼無關的無關符號。為了促進醫療語言模型的準確可解釋性,本文利用字典學習,可以有效地從疊加的稠密語言模型嵌入中提取稀疏激活的表示。與常見的標籤注意機制相比,我們的模型超越了符號層級的表示,建立了一個可解釋的字典,增強了對每個 ICD 代碼預測的基於機制的解釋,即使強調的符號在醫學上無關緊要。我們證明字典特徵可以引導模型行為,闡明 90% 以上在醫學上無關的符號的隱藏意義,並且人類可以解釋。

2410.24032v1 by Yingzhe Peng, Xiaoting Qin, Zhiyang Zhang, Jue Zhang, Qingwei Lin, Xu Yang, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

The rise of large language models (LLMs) has revolutionized user interactions with knowledge-based systems, enabling chatbots to synthesize vast amounts of information and assist with complex, exploratory tasks. However, LLM-based chatbots often struggle to provide personalized support, particularly when users start with vague queries or lack sufficient contextual information. This paper introduces the Collaborative Assistant for Personalized Exploration (CARE), a system designed to enhance personalization in exploratory tasks by combining a multi-agent LLM framework with a structured user interface. CARE's interface consists of a Chat Panel, Solution Panel, and Needs Panel, enabling iterative query refinement and dynamic solution generation. The multi-agent framework collaborates to identify both explicit and implicit user needs, delivering tailored, actionable solutions. In a within-subject user study with 22 participants, CARE was consistently preferred over a baseline LLM chatbot, with users praising its ability to reduce cognitive load, inspire creativity, and provide more tailored solutions. Our findings highlight CARE's potential to transform LLM-based systems from passive information retrievers to proactive partners in personalized problem-solving and exploration.

摘要:大型語言模型 (LLM) 的興起徹底改變了使用者與基於知識的系統互動的方式,讓聊天機器人能夠綜合大量的資訊,並協助進行複雜的探索性任務。然而,基於 LLM 的聊天機器人通常難以提供個人化的支援,特別是在使用者一開始提出的查詢很模糊,或缺乏足夠的脈絡資訊時。本文介紹了個人化探索的協作助理 (CARE),一個旨在透過結合多重代理 LLM 架構與結構化的使用者介面來增強探索性任務中個人化的系統。CARE 的介面包含聊天面板、解決方案面板和需求面板,可進行反覆的查詢精煉和動態的解決方案產生。多重代理架構協作識別明確和隱含的使用者需求,提供客製化且可行的解決方案。在一個有 22 位參與者的受試者內研究中,CARE 持續獲得比基準 LLM 聊天機器人更好的評價,使用者讚賞其減輕認知負擔、激發創造力,以及提供更客製化解決方案的能力。我們的發現突顯了 CARE 將基於 LLM 的系統從被動的資訊檢索者轉變為個人化問題解決和探索中的主動夥伴的潛力。

Neural Network Verification with PyRAT

2410.23903v1 by Augustin Lemesle, Julien Lehmann, Tristan Le Gall

As AI systems are becoming more and more popular and used in various critical domains (health, transport, energy, ...), the need to provide guarantees and trust of their safety is undeniable. To this end, we present PyRAT, a tool based on abstract interpretation to verify the safety and the robustness of neural networks. In this paper, we describe the different abstractions used by PyRAT to find the reachable states of a neural network starting from its input as well as the main features of the tool to provide fast and accurate analysis of neural networks. PyRAT has already been used in several collaborations to ensure safety guarantees, with its second place at the VNN-Comp 2024 showcasing its performance.

摘要:隨著 AI 系統越來越普及,並用於各種關鍵領域(健康、運輸、能源,...),提供其安全保證和信任的需求是不容否認的。為此,我們提出了 PyRAT,一個基於抽象詮釋的工具,用於驗證神經網路的安全性和穩健性。在本文中,我們描述了 PyRAT 用於從神經網路輸入中找出可達狀態的不同抽象,以及該工具的主要功能,以提供快速且準確的神經網路分析。PyRAT 已在多項合作中用於確保安全保證,其在 VNN-Comp 2024 中獲得第二名,展示了其效能。

Counterfactual MRI Data Augmentation using Conditional Denoising Diffusion Generative Models

2410.23835v1 by Pedro Morão, Joao Santinha, Yasna Forghani, Nuno Loução, Pedro Gouveia, Mario A. T. Figueiredo

Deep learning (DL) models in medical imaging face challenges in generalizability and robustness due to variations in image acquisition parameters (IAP). In this work, we introduce a novel method using conditional denoising diffusion generative models (cDDGMs) to generate counterfactual magnetic resonance (MR) images that simulate different IAP without altering patient anatomy. We demonstrate that using these counterfactual images for data augmentation can improve segmentation accuracy, particularly in out-of-distribution settings, enhancing the overall generalizability and robustness of DL models across diverse imaging conditions. Our approach shows promise in addressing domain and covariate shifts in medical imaging. The code is publicly available at https: //github.com/pedromorao/Counterfactual-MRI-Data-Augmentation

摘要:深度學習 (DL) 模型在醫學影像中會因影像擷取參數 (IAP) 的變化而面臨可概括性和穩健性的挑戰。在這項工作中,我們提出了一種使用條件式去噪擴散生成模型 (cDDGMs) 的新方法,以產生反事實磁共振 (MR) 影像,模擬不同的 IAP,而不會改變患者的解剖結構。我們證明使用這些反事實影像進行資料擴充可以提高分割準確度,特別是在分佈外設定中,增強 DL 模型在不同影像條件下的整體可概括性和穩健性。我們的做法顯示了解決醫學影像中的領域和協變數轉移的前景。程式碼已公開於 https: //github.com/pedromorao/Counterfactual-MRI-Data-Augmentation

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

2410.23822v1 by Jinlong He, Pengfei Li, Gang Liu, Shenjun Zhong

Multimodal Large Language Models (MLLMs) inherit the superior text understanding capabilities of LLMs and extend these capabilities to multimodal scenarios. These models achieve excellent results in the general domain of multimodal tasks. However, in the medical domain, the substantial training costs and the requirement for extensive medical data pose challenges to the development of medical MLLMs. Furthermore, due to the free-text form of answers, tasks such as visual grounding that need to produce output in a prescribed form become difficult for MLLMs. So far, there have been no medical MLLMs works in medical visual grounding area. For the medical vision grounding task, which involves identifying locations in medical images based on short text descriptions, we propose Parameter-efficient Fine-tuning medical multimodal large language models for Medcial Visual Grounding (PFMVG). To validate the performance of the model, we evaluate it on a public benchmark dataset for medical visual grounding, where it achieves competitive results, and significantly outperforming GPT-4v. Our code will be open sourced after peer review.

摘要:多模态大型语言模型 (MLLM) 继承了 LLM 优越的文本理解能力,并将这些能力扩展到多模态场景。这些模型在多模态任务的通用领域中取得了出色的成果。然而,在医学领域,大量的训练成本和对广泛医学数据的需求对医学 MLLM 的发展构成了挑战。此外,由于答案的自由文本形式,需要以规定形式生成输出的任务(例如视觉基础)对于 MLLM 来说变得困难。到目前为止,还没有医学 MLLM 在医学视觉基础领域工作。对于医学视觉基础任务,它涉及根据简短的文本描述识别医学图像中的位置,我们提出了用于医学视觉基础的参数高效微调医学多模态大型语言模型 (PFMVG)。为了验证模型的性能,我们在医学视觉基础的公共基准数据集上对其进行了评估,它取得了有竞争力的结果,并且明显优于 GPT-4v。我们的代码将在同行评审后开源。

Improving snore detection under limited dataset through harmonic/percussive source separation and convolutional neural networks

2410.23796v1 by F. D. Gonzalez-Martinez, J. J. Carabias-Orti, F. J. Canadas-Quesada, N. Ruiz-Reyes, D. Martinez-Munoz, S. Garcia-Galan

Snoring, an acoustic biomarker commonly observed in individuals with Obstructive Sleep Apnoea Syndrome (OSAS), holds significant potential for diagnosing and monitoring this recognized clinical disorder. Irrespective of snoring types, most snoring instances exhibit identifiable harmonic patterns manifested through distinctive energy distributions over time. In this work, we propose a novel method to differentiate monaural snoring from non-snoring sounds by analyzing the harmonic content of the input sound using harmonic/percussive sound source separation (HPSS). The resulting feature, based on the harmonic spectrogram from HPSS, is employed as input data for conventional neural network architectures, aiming to enhance snoring detection performance even under a limited data learning framework. To evaluate the performance of our proposal, we studied two different scenarios: 1) using a large dataset of snoring and interfering sounds, and 2) using a reduced training set composed of around 1% of the data material. In the former scenario, the proposed HPSS-based feature provides competitive results compared to other input features from the literature. However, the key advantage of the proposed method lies in the superior performance of the harmonic spectrogram derived from HPSS in a limited data learning context. In this particular scenario, using the proposed harmonic feature significantly enhances the performance of all the studied architectures in comparison to the classical input features documented in the existing literature. This finding clearly demonstrates that incorporating harmonic content enables more reliable learning of the essential time-frequency characteristics that are prevalent in most snoring sounds, even in scenarios where the amount of training data is limited.

摘要:鼾聲是一種在阻塞性睡眠呼吸中止症候群 (OSAS) 患者中常見的聲學生物標記,對於診斷和監控此公認的臨床疾病具有顯著潛力。無論鼾聲類型如何,大多數鼾聲都表現出可識別的諧波模式,並隨著時間推移表現出獨特的能量分佈。在這項工作中,我們提出了一種新方法,通過使用諧波/打擊聲源分離 (HPSS) 分析輸入聲音的諧波內容,將單聲道鼾聲與非鼾聲區分開來。基於 HPSS 的諧波頻譜圖所產生的特徵,被用作傳統神經網路架構的輸入資料,旨在即使在有限資料學習架構下也能增強鼾聲偵測效能。為了評估我們提案的效能,我們研究了兩種不同的情境:1) 使用大量的鼾聲和干擾聲資料集,以及 2) 使用由約 1% 資料素材組成的縮減訓練集。在前一種情境中,與文獻中的其他輸入特徵相比,所提出的基於 HPSS 的特徵提供了具有競爭力的結果。然而,所提出方法的主要優點在於,在有限資料學習情境中,源自 HPSS 的諧波頻譜圖具有優異的效能。在這個特定情境中,與現有文獻中記載的傳統輸入特徵相比,使用所提出的諧波特徵顯著增強了所有研究架構的效能。這一發現清楚地表明,即使在訓練資料量有限的情境中,納入諧波內容也能夠更可靠地學習大多數鼾聲中普遍存在的必要時頻特徵。

The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams

2410.23769v1 by Yunqi Zhu, Wen Tang, Ying Sun, Xuebing Yang

Recent research on large language models (LLMs) has primarily focused on their adaptation and application in specialized domains. The application of LLMs in the medical field is mainly concentrated on tasks such as the automation of medical report generation, summarization, diagnostic reasoning, and question-and-answer interactions between doctors and patients. The challenge of becoming a good teacher is more formidable than that of becoming a good student, and this study pioneers the application of LLMs in the field of medical education. In this work, we investigate the extent to which LLMs can generate medical qualification exam questions and corresponding answers based on few-shot prompts. Utilizing a real-world Chinese dataset of elderly chronic diseases, we tasked the LLMs with generating open-ended questions and answers based on a subset of sampled admission reports across eight widely used LLMs, including ERNIE 4, ChatGLM 4, Doubao, Hunyuan, Spark 4, Qwen, Llama 3, and Mistral. Furthermore, we engaged medical experts to manually evaluate these open-ended questions and answers across multiple dimensions. The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions, whereas there is room for improvement in the correctness, evidence-based statements, and professionalism of the generated answers. Moreover, LLMs also demonstrate a decent level of ability to correct and rectify reference answers. Given the immense potential of artificial intelligence in the medical field, the task of generating questions and answers for medical qualification exams aimed at medical students, interns and residents can be a significant focus of future research.

摘要:針對大型語言模型 (LLM) 的近期研究主要集中在它們在特定領域的適應和應用。LLM 在醫學領域的應用主要集中在自動化病歷產生、摘要、診斷推理以及醫生與病人之間問答互動等任務。成為一名好老師的挑戰比成為一名好學生更艱鉅,而本研究開創了 LLM 在醫學教育領域的應用。在這項工作中,我們探討了 LLM 在少數提示下產生醫學資格考試題目和對應答案的程度。利用一個真實世界的老年慢性疾病中文數據集,我們讓 LLM 根據八個廣泛使用的 LLM(包括 ERNIE 4、ChatGLM 4、豆包、混元、Spark 4、Qwen、Llama 3 和 Mistral)抽取的入院報告子集產生開放式問題和答案。此外,我們聘請醫學專家手動評估這些開放式問題和答案的多個面向。研究發現,LLM 在使用少數提示後,可以有效模擬真實世界的醫學資格考試題目,而產生的答案在正確性、循證陳述和專業性方面仍有改進空間。此外,LLM 也展現出相當程度更正和修正參考答案的能力。鑑於人工智能在醫學領域的巨大潛力,產生針對醫學生、實習醫生和住院醫生的醫學資格考試題目和答案的任務,可以成為未來研究的重要重點。

Artificial intelligence to improve clinical coding practice in Scandinavia: a crossover randomized controlled trial

2410.23725v1 by Taridzo Chomutare, Therese Olsen Svenning, Miguel Ángel Tejedor Hernández, Phuong Dinh Ngo, Andrius Budrionis, Kaisa Markljung, Lill Irene Hind, Torbjørn Torsvik, Karl Øyvind Mikalsen, Aleksandar Babic, Hercules Dalianis

\textbf{Trial design} Crossover randomized controlled trial. \textbf{Methods} An AI tool, Easy-ICD, was developed to assist clinical coders and was tested for improving both accuracy and time in a user study in Norway and Sweden. Participants were randomly assigned to two groups, and crossed over between coding complex (longer) texts versus simple (shorter) texts, while using our tool versus not using our tool. \textbf{Results} Based on Mann-Whitney U test, the median coding time difference for complex clinical text sequences was 123 seconds (\emph{P}\textless.001, 95\% CI: 81 to 164), representing a 46\% reduction in median coding time when our tool is used. There was no significant time difference for simpler text sequences. For coding accuracy, the improvement we noted for both complex and simple texts was not significant. \textbf{Conclusions} This study demonstrates the potential of AI to transform common tasks in clinical workflows, with ostensible positive impacts on work efficiencies for complex clinical coding tasks. Further studies within hospital workflows are required before these presumed impacts can be more clearly understood.

摘要:試驗設計 交叉隨機對照試驗。方法開發了一種 AI 工具 Easy-ICD,以協助臨床編碼員,並在挪威和瑞典進行的一項使用者研究中測試其在準確性和時間上的改進。參與者被隨機分為兩組,並在使用我們的工具與不使用我們的工具的情況下,對複雜(較長)文本與簡單(較短)文本進行編碼交叉。結果根據 Mann-Whitney U 檢定,複雜臨床文本序列的中位數編碼時間差為 123 秒(\emph{P}\textless.001,95% CI:81 至 164),表示使用我們的工具時中位數編碼時間減少了 46%。對於較簡單的文本序列,沒有顯著的時間差異。對於編碼準確性,我們對複雜文本和簡單文本所觀察到的改進並不顯著。結論這項研究展示了 AI 在轉換臨床工作流程中常見任務的潛力,對複雜臨床編碼任務的工作效率有明顯的正面影響。在這些假設影響能更清楚地被理解之前,需要在醫院工作流程中進行進一步的研究。

Enhancing Brain Tumor Classification Using TrAdaBoost and Multi-Classifier Deep Learning Approaches

2411.00875v1 by Mahin Mohammadi, Saman Jamshidi

Brain tumors pose a serious health threat due to their rapid growth and potential for metastasis. While medical imaging has advanced significantly, accurately identifying and characterizing these tumors remains a challenge. This study addresses this challenge by leveraging the innovative TrAdaBoost methodology to enhance the Brain Tumor Segmentation (BraTS2020) dataset, aiming to improve the efficiency and accuracy of brain tumor classification. Our approach combines state-of-the-art deep learning algorithms, including the Vision Transformer (ViT), Capsule Neural Network (CapsNet), and convolutional neural networks (CNNs) such as ResNet-152 and VGG16. By integrating these models within a multi-classifier framework, we harness the strengths of each approach to achieve more robust and reliable tumor classification. A novel decision template is employed to synergistically combine outputs from different algorithms, further enhancing classification accuracy. To augment the training process, we incorporate a secondary dataset, "Brain Tumor MRI Dataset," as a source domain, providing additional data for model training and improving generalization capabilities. Our findings demonstrate a high accuracy rate in classifying tumor versus non-tumor images, signifying the effectiveness of our approach in the medical imaging domain. This study highlights the potential of advanced machine learning techniques to contribute significantly to the early and accurate diagnosis of brain tumors, ultimately improving patient outcomes.

摘要:腦瘤由於生長快速且有轉移的可能性,對健康構成嚴重威脅。雖然醫學影像技術已大幅進步,但精準辨識和描述這些腫瘤仍然是一大挑戰。本研究透過運用創新的 TrAdaBoost 方法提升腦瘤分割 (BraTS2020) 資料集來解決這個挑戰,目標是提升腦瘤分類的效率和準確度。我們的做法結合了最先進的深度學習演算法,包括視覺轉換器 (ViT)、膠囊神經網路 (CapsNet) 和卷積神經網路 (CNN),例如 ResNet-152 和 VGG16。透過在多分類器架構中整合這些模型,我們利用每種方法的優點來達成更強健且可靠的腫瘤分類。採用新穎的決策範本,以綜效結合不同演算法的輸出,進一步提升分類準確度。為了擴充訓練流程,我們納入次要資料集「腦瘤 MRI 資料集」作為來源網域,提供額外的資料用於模型訓練,並提升概化能力。我們的研究結果顯示,在分類腫瘤與非腫瘤影像時,準確率很高,表示我們的方法在醫學影像領域中很有效。本研究強調進階機器學習技術的潛力,對腦瘤的早期且精準診斷有顯著貢獻,進而改善病患的治療結果。

Deep Convolutional Neural Networks on Multiclass Classification of Three-Dimensional Brain Images for Parkinson's Disease Stage Prediction

2410.23649v1 by Guan-Hua Huang, Wan-Chen Lai, Tai-Been Chen, Chien-Chin Hsu, Huei-Yung Chen, Yi-Chen Wu, Li-Ren Yeh

Parkinson's disease (PD), a degenerative disorder of the central nervous system, is commonly diagnosed using functional medical imaging techniques such as single-photon emission computed tomography (SPECT). In this study, we utilized two SPECT data sets (n = 634 and n = 202) from different hospitals to develop a model capable of accurately predicting PD stages, a multiclass classification task. We used the entire three-dimensional (3D) brain images as input and experimented with various model architectures. Initially, we treated the 3D images as sequences of two-dimensional (2D) slices and fed them sequentially into 2D convolutional neural network (CNN) models pretrained on ImageNet, averaging the outputs to obtain the final predicted stage. We also applied 3D CNN models pretrained on Kinetics-400. Additionally, we incorporated an attention mechanism to account for the varying importance of different slices in the prediction process. To further enhance model efficacy and robustness, we simultaneously trained the two data sets using weight sharing, a technique known as cotraining. Our results demonstrated that 2D models pretrained on ImageNet outperformed 3D models pretrained on Kinetics-400, and models utilizing the attention mechanism outperformed both 2D and 3D models. The cotraining technique proved effective in improving model performance when the cotraining data sets were sufficiently large.

摘要:帕金森氏症 (PD) 是一種中樞神經系統退化性疾病,通常使用功能性醫學影像技術,例如單光子發射斷層掃描 (SPECT) 來診斷。在這項研究中,我們利用來自不同醫院的兩個 SPECT 資料集 (n = 634 和 n = 202) 來開發一個模型,能夠準確預測 PD 分期,這是一個多類別分類任務。我們使用整個三維 (3D) 大腦影像作為輸入,並嘗試使用各種模型架構。最初,我們將 3D 影像視為二維 (2D) 切片的序列,並將它們依序輸入到預先在 ImageNet 上訓練過的 2D 卷積神經網路 (CNN) 模型中,取平均輸出值來取得最終預測的期別。我們也應用預先在 Kinetics-400 上訓練過的 3D CNN 模型。此外,我們納入一個注意力機制,以考量不同切片在預測過程中的重要性差異。為了進一步增強模型的效能和穩健性,我們使用權重共享同時訓練兩個資料集,這是一種稱為共同訓練的技術。我們的結果顯示,預先在 ImageNet 上訓練過的 2D 模型優於預先在 Kinetics-400 上訓練過的 3D 模型,而使用注意力機制的模型則優於 2D 和 3D 模型。當共同訓練的資料集夠大的時候,共同訓練技術已被證明能有效改善模型效能。

MS-Glance: Non-semantic context vectors and the applications in supervising image reconstruction

2410.23577v1 by Ziqi Gao, Wendi Yang, Yujia Li, Lei Xing, S. Kevin Zhou

Non-semantic context information is crucial for visual recognition, as the human visual perception system first uses global statistics to process scenes rapidly before identifying specific objects. However, while semantic information is increasingly incorporated into computer vision tasks such as image reconstruction, non-semantic information, such as global spatial structures, is often overlooked. To bridge the gap, we propose a biologically informed non-semantic context descriptor, \textbf{MS-Glance}, along with the Glance Index Measure for comparing two images. A Global Glance vector is formulated by randomly retrieving pixels based on a perception-driven rule from an image to form a vector representing non-semantic global context, while a local Glance vector is a flattened local image window, mimicking a zoom-in observation. The Glance Index is defined as the inner product of two standardized sets of Glance vectors. We evaluate the effectiveness of incorporating Glance supervision in two reconstruction tasks: image fitting with implicit neural representation (INR) and undersampled MRI reconstruction. Extensive experimental results show that MS-Glance outperforms existing image restoration losses across both natural and medical images. The code is available at \url{https://github.com/Z7Gao/MSGlance}.

摘要:非语义上下文信息对于视觉识别至关重要,因为人类视觉感知系统首先使用全局统计数据来快速处理场景,然后再识别特定对象。然而,虽然语义信息正越来越多地融入到图像重建等计算机视觉任务中,但非语义信息(如全局空间结构)却常常被忽视。为了弥合这一差距,我们提出了一个生物信息启发的非语义上下文描述符,即 \textbf{MS-Glance},以及用于比较两幅图像的 Glance 指数度量。通过根据感知驱动的规则从图像中随机检索像素来构建一个全局 Glance 向量,以形成一个表示非语义全局上下文的向量,而局部 Glance 向量是一个扁平的局部图像窗口,模仿了放大观察。Glance 指数被定义为两组标准化的 Glance 向量的内积。我们评估了在两个重建任务中纳入 Glance 监督的有效性:具有隐式神经表征 (INR) 的图像拟合和欠采样 MRI 重建。大量的实验结果表明,MS-Glance 在自然图像和医学图像中都优于现有的图像恢复损失。代码可在 \url{https://github.com/Z7Gao/MSGlance} 获得。

LLM

Publish Date Title Authors Homepage Code
2024-11-12 Scaling Properties of Diffusion Models for Perceptual Tasks Rahul Ravishankar et.al. 2411.08034v1 null
2024-11-12 GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation Yushi Lan et.al. 2411.08033v1 null
2024-11-12 Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data Juanhui Li et.al. 2411.08028v1 null
2024-11-12 LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models Anoop Cherian et.al. 2411.08027v1 null
2024-11-12 Leonardo vindicated: Pythagorean trees for minimal reconstruction of the natural branching structures Dymitr Ruta et.al. 2411.08024v1 null
2024-11-12 Language Models as Causal Effect Generators Lucius E. J. Bynum et.al. 2411.08019v1 link
2024-11-12 Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings Aditya Sanghi et.al. 2411.08017v1 null
2024-11-12 Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech Eleonora Mancini et.al. 2411.08013v1 null
2024-11-12 ExpressivityArena: Can LLMs Express Information Implicitly? Joshua Tint et.al. 2411.08010v1 null
2024-11-12 Can adversarial attacks by large language models be attributed? Manuel Cebrian et.al. 2411.08003v1 null
2024-11-12 Derivational Morphology Reveals Analogical Generalization in Large Language Models Valentin Hofmann et.al. 2411.07990v1 null
2024-11-12 Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces Ben Fauber et.al. 2411.07983v1 null
2024-11-12 Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization Davide Buffelli et.al. 2411.07979v1 null
2024-11-12 DINO-LG: A Task-Specific DINO Model for Coronary Calcium Scoring Mahmut S. Gokmen et.al. 2411.07976v1 null
2024-11-12 JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Yiyang Ma et.al. 2411.07975v1 null
2024-11-12 From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents Chuyi Kong et.al. 2411.07965v1 null
2024-11-12 Towards Low-bit Communication for Tensor Parallel LLM Inference Harry Dong et.al. 2411.07942v1 null
2024-11-12 DuoLift-GAN:Reconstructing CT from Single-view and Biplanar X-Rays with Generative Adversarial Networks Zhaoxi Zhang et.al. 2411.07941v1 null
2024-11-12 Automatic dataset shift identification to support root cause analysis of AI performance drift Mélanie Roschewitz et.al. 2411.07940v1 null
2024-11-12 CryptoLLM: Unleashing the Power of Prompted LLMs for SmartQnA and Classification of Crypto Posts Aniket Deroy et.al. 2411.07917v1 null
2024-11-12 Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus Benjamin Litterer et.al. 2411.07892v1 null
2024-11-12 INTRABENCH: Interactive Radiological Benchmark Constantin Ulrich et.al. 2411.07885v1 null
2024-11-12 Diverse capability and scaling of diffusion and auto-regressive models when learning abstract rules Binxu Wang et.al. 2411.07873v1 null
2024-11-12 Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease Francesco Chiumento et.al. 2411.07871v1 null
2024-11-12 Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders Xiaofeng Zhu et.al. 2411.07870v1 null
2024-11-12 Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models Yusen Zhang et.al. 2411.07858v1 link
2024-11-12 Tucano: Advancing Neural Text Generation for Portuguese Nicholas Kluge Corrêa et.al. 2411.07854v1 null
2024-11-12 IAE: Irony-based Adversarial Examples for Sentiment Analysis Systems Xiaoyin Yi et.al. 2411.07850v1 null
2024-11-12 Ethical Concern Identification in NLP: A Corpus of ACL Anthology Ethics Statements Antonia Karamolegkou et.al. 2411.07845v1 null
2024-11-12 Chain Association-based Attacking and Shielding Natural Language Processing Systems Jiacheng Huang et.al. 2411.07843v1 null
2024-11-12 Federated Learning for Discrete Optimal Transport with Large Population under Incomplete Information Navpreet Kaur et.al. 2411.07841v1 null
2024-11-12 Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices Kilian Pfeiffer et.al. 2411.07826v1 null
2024-11-12 Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language Models Youan Cong et.al. 2411.07820v1 null
2024-11-12 PatchCTG: Patch Cardiotocography Transformer for Antepartum Fetal Health Monitoring M. Jaleed Khan et.al. 2411.07796v1 link
2024-11-12 RedCode: Risky Code Execution and Generation Benchmark for Code Agents Chengquan Guo et.al. 2411.07781v1 null
2024-11-12 Likelihood as a Performance Gauge for Retrieval-Augmented Generation Tianyu Liu et.al. 2411.07773v1 link
2024-11-12 Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows Fangyu Lei et.al. 2411.07763v1 null
2024-11-12 ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization Weibo Zhao et.al. 2411.07762v1 null
2024-11-12 Navigation with QPHIL: Quantizing Planner for Hierarchical Implicit Q-Learning Alexi Canesse et.al. 2411.07760v1 null
2024-11-12 Optimizing Traffic Signal Control using High-Dimensional State Representation and Efficient Deep Reinforcement Learning Lawrence Francis et.al. 2411.07759v1 null
2024-11-12 SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model Xinyuan Qian et.al. 2411.07751v1 null
2024-11-12 Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding Zirui Shao et.al. 2411.07722v1 null
2024-11-12 Training Data for Large Language Model Yiming Ju et.al. 2411.07715v1 null
2024-11-12 New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook Meng Yang et.al. 2411.07691v1 null
2024-11-12 World Models: The Safety Perspective Zifan Zeng et.al. 2411.07690v1 null
2024-11-12 Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG Zilun Zhang et.al. 2411.07688v1 null
2024-11-12 Fast Disentangled Slim Tensor Learning for Multi-view Clustering Deng Xu et.al. 2411.07685v1 null
2024-11-12 AI enhanced diagnosis of Peyronies disease a novel approach using Computer Vision Yudara Kularathne et.al. 2411.07684v1 null
2024-11-12 Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach Tianyi Huang et.al. 2411.07656v1 link
2024-11-12 Direct Preference Optimization Using Sparse Feature-Level Constraints Qingyu Yin et.al. 2411.07618v1 null
2024-11-12 Multimodal Clinical Reasoning through Knowledge-augmented Rationale Generation Shuai Niu et.al. 2411.07611v1 null
2024-11-12 Circuit Complexity Bounds for RoPE-based Transformer Architecture Bo Chen et.al. 2411.07602v1 null
2024-11-12 Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations Rose E. Wang et.al. 2411.07598v1 link
2024-11-12 Entropy Controllable Direct Preference Optimization Motoki Omura et.al. 2411.07595v1 null
2024-11-12 A Comprehensive Survey of AI-Driven Advancements and Techniques in Automated Program Repair and Code Generation Avinash Anand et.al. 2411.07586v1 null
2024-11-12 Reinforcement Learning Framework for Quantitative Trading Alhassan S. Yasin et.al. 2411.07585v1 null
2024-11-12 Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models Dongrui Han et.al. 2411.07563v1 null
2024-11-12 EUR/USD Exchange Rate Forecasting incorporating Text Mining Based on Pre-trained Language Models and Deep Learning Methods Xiangyu Shi et.al. 2411.07560v1 null
2024-11-12 Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models Tiejin Chen et.al. 2411.07559v1 null
2024-11-12 Contrastive Language Prompting to Ease False Positives in Medical Anomaly Detection YeongHyeon Park et.al. 2411.07546v1 null
2024-11-12 Model Stealing for Any Low-Rank Language Model Allen Liu et.al. 2411.07536v1 null
2024-11-12 Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning Linyang He et.al. 2411.07533v1 null
2024-11-12 Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis Minda Li et.al. 2411.07529v1 null
2024-11-12 SecEncoder: Logs are All You Need in Security Muhammed Fatih Bulut et.al. 2411.07528v1 null
2024-11-12 Prompt-enhanced Network for Hateful Meme Classification Junxi Liu et.al. 2411.07527v1 link
2024-11-12 Fair Summarization: Bridging Quality and Diversity in Extractive Summaries Sina Bagheri Nezhad et.al. 2411.07521v1 link
2024-11-12 TIPS: Threat Actor Informed Prioritization of Applications using SecEncoder Muhammed Fatih Bulut et.al. 2411.07519v1 null
2024-11-12 LLM App Squatting and Cloning Yinglin Xie et.al. 2411.07518v1 null
2024-11-12 SparrowVQE: Visual Question Explanation for Course Content Understanding Jialu Li et.al. 2411.07516v1 link
2024-11-12 An Attack Traffic Identification Method Based on Temporal Spectrum Wenwei Xie et.al. 2411.07510v1 link
2024-11-12 FM-TS: Flow Matching for Time Series Generation Yang Hu et.al. 2411.07506v1 link
2024-11-12 LAUREL: Learned Augmented Residual Layer Gaurav Menghani et.al. 2411.07501v1 null
2024-11-12 Rapid Response: Mitigating LLM Jailbreaks with a Few Examples Alwin Peng et.al. 2411.07494v1 null
2024-11-12 Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models Daria Kryvosheieva et.al. 2411.07474v1 null
2024-11-12 IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark Kawshik Manikantan et.al. 2411.07466v1 null
2024-11-12 BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks Shubham Gandhi et.al. 2411.07464v1 null
2024-11-12 BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Anas Awadalla et.al. 2411.07461v1 null
2024-11-12 DecoPrompt : Decoding Prompts Reduces Hallucinations when Large Language Models Meet False Premises Nan Xu et.al. 2411.07457v1 link
2024-11-12 Research on fault diagnosis of nuclear power first-second circuit based on hierarchical multi-granularity classification network Jiangwen Chen et.al. 2411.07453v1 null
2024-11-12 Optimizing Data Delivery: Insights from User Preferences on Visuals, Tables, and Text Reuben Luera et.al. 2411.07451v1 null
2024-11-12 The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving Kyoungmin Kim et.al. 2411.07447v1 null
2024-11-12 Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection Cilin Yan et.al. 2411.07446v1 null
2024-11-12 Input-Based Ensemble-Learning Method for Dynamic Memory Configuration of Serverless Computing Functions Siddharth Agarwal et.al. 2411.07444v1 null
2024-11-11 Automatically Detecting Online Deceptive Patterns in Real-time Asmit Nayak et.al. 2411.07441v1 null
2024-11-11 Predicting BWR Criticality with Data-Driven Machine Learning Model Muhammad Rizki Oktavian et.al. 2411.07425v1 null
2024-11-11 Untangling Hate Speech Definitions: A Semantic Componential Analysis Across Cultures and Domains Katerina Korre et.al. 2411.07417v1 null
2024-11-11 Using Generative AI and Multi-Agents to Provide Automatic Feedback Shuchen Guo et.al. 2411.07407v1 null
2024-11-11 Controllable Context Sensitivity and the Knob Behind It Julian Minder et.al. 2411.07404v1 null
2024-11-11 Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews Aakash Sorathiya et.al. 2411.07398v1 null
2024-11-11 Toward Optimal Search and Retrieval for RAG Alexandria Leto et.al. 2411.07396v1 null
2024-11-11 Data-Centric Learning Framework for Real-Time Detection of Aiming Beam in Fluorescence Lifetime Imaging Guided Surgery Mohamed Abul Hassan et.al. 2411.07395v1 null
2024-11-11 Feature-Space Semantic Invariance: Enhanced OOD Detection for Open-Set Domain Generalization Haoliang Wang et.al. 2411.07392v1 null
2024-11-11 Federated Learning Client Pruning for Noisy Labels Mahdi Morafah et.al. 2411.07391v1 link
2024-11-11 Firing Rate Models as Associative Memory: Excitatory-Inhibitory Balance for Robust Retrieval Simone Betteti et.al. 2411.07388v1 null
2024-11-11 Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages Midia Yousefi et.al. 2411.07387v1 null
2024-11-11 BeeManc at the PLABA Track of TAC-2024: RoBERTa for task 1 and LLaMA3.1 and GPT-4o for task 2 Zhidong Ling et.al. 2411.07381v1 null
2024-11-11 Warmstarting for Scaling Language Models Neeratyoy Mallik et.al. 2411.07340v1 null
2024-11-11 SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models Bardiya Akhbari et.al. 2411.07336v1 link
2024-11-11 Multimodal Fusion Balancing Through Game-Theoretic Regularization Konstantinos Kontras et.al. 2411.07335v1 null
2024-11-11 Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations Kirti Bhagat et.al. 2411.07320v1 null

Abstracts

Scaling Properties of Diffusion Models for Perceptual Tasks

2411.08034v1 by Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and segmentation under image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perception tasks. Through a careful analysis of these scaling behaviors, we present various techniques to efficiently train diffusion models for visual perception tasks. Our models achieve improved or comparable performance to state-of-the-art methods using significantly less data and compute. To use our code and models, see https://scaling-diffusion-perception.github.io .

摘要:在本文中,我們主張使用擴散模型進行的迭代計算不僅為生成提供了強大的範例,也為視覺感知任務提供了強大的範例。我們將深度估計、光流和分割等任務統一在圖像到圖像轉換下,並展示了擴散模型如何從擴展感知任務的訓練和測試時間計算中受益。通過仔細分析這些縮放行為,我們提出了各種技術,以有效訓練用於視覺感知任務的擴散模型。我們的模型使用顯著更少的数据和計算,達到了與最先進的方法相當或更好的性能。若要使用我們的代碼和模型,請參閱 https://scaling-diffusion-perception.github.io 。

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation

2411.08033v1 by Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, Chen Change Loy

While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent diffusion model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing methods in both text- and image-conditioned 3D generation.

摘要:儘管 3D 內容生成已大幅進展,但現有方法仍面臨輸入格式、潛在空間設計和輸出表示的挑戰。本文介紹了一個新穎的 3D 生成架構,可解決這些挑戰,提供可擴充、高品質的 3D 生成,並具備互動式點雲結構潛在空間。我們的架構採用變異自動編碼器 (VAE),以多視圖姿勢 RGB-D(深度)-N(法線) 渲染作為輸入,使用獨特的潛在空間設計來保留 3D 形狀資訊,並結合串聯潛在擴散模型以改善形狀紋理分離。所提出的方法 GaussianAnything 支援多模式條件式 3D 生成,允許點雲、標題和單/多視圖影像輸入。值得注意的是,新提出的潛在空間自然能實現幾何紋理分離,因此允許 3D 感知編輯。實驗結果證明了我們的方法在多個資料集上的有效性,在文字和影像條件式 3D 生成方面都優於現有方法。

Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data

2411.08028v1 by Juanhui Li, Sreyashi Nag, Hui Liu, Xianfeng Tang, Sheikh Sarwar, Limeng Cui, Hansu Gu, Suhang Wang, Qi He, Jiliang Tang

In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their training is hindered by the scarcity of labeled data. In contrast, unlabeled data is often readily which can be leveraged by using LLMs to generate pseudo-labels for training smaller models. This enables the smaller models (student) to acquire knowledge from LLMs(teacher) while reducing computational costs. This process introduces challenges, such as potential noisy pseudo-labels. Selecting high-quality and informative data is therefore critical to enhance model performance while improving the efficiency of data utilization. To address this, we propose LLKD that enables Learning with Less computational resources and less data for Knowledge Distillation from LLMs. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Specifically, it prioritizes samples where the teacher demonstrates high confidence in its labeling, indicating reliable labels, and where the student exhibits a high information need, identifying challenging samples that require further learning. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.

摘要:在實際的 NLP 應用中,大型語言模型 (LLM) 因其在大量資料集上的廣泛訓練而提供有前景的解決方案。然而,LLM 的龐大規模和高運算需求限制了它們在許多應用中的實用性,特別是在需要進一步微調時。為了解決這些限制,通常偏好較小的模型進行部署。然而,它們的訓練受到標記資料的稀缺性阻礙。相反,未標記的資料通常很容易獲得,可以使用 LLM 為較小的模型生成偽標籤進行訓練。這使較小的模型(學生)能夠從 LLM(老師)那裡獲取知識,同時降低運算成本。這個過程會帶來挑戰,例如潛在的雜訊偽標籤。因此,選擇高品質且有資訊性的資料對於提高模型效能並提高資料利用率至關重要。為了解決這個問題,我們提出了 LLKD,它可以在從 LLM 中進行知識蒸餾時使用較少的運算資源和較少的資料進行學習。LLKD 是一種自適應的樣本選擇方法,它結合了老師和學生的訊號。具體來說,它優先考慮老師在標記中表現出高度信心的樣本,表示標籤可靠,以及學生表現出高度資訊需求的樣本,識別需要進一步學習的具有挑戰性的樣本。我們的綜合實驗表明,LLKD 在具有更高資料效率的各種資料集上實現了卓越的效能。

LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models

2411.08027v1 by Anoop Cherian, Radu Corcodel, Siddarth Jain, Diego Romeres

Physical reasoning is an important skill needed for robotic agents when operating in the real world. However, solving such reasoning problems often involves hypothesizing and reflecting over complex multi-body interactions under the effect of a multitude of physical forces and thus learning all such interactions poses a significant hurdle for state-of-the-art machine learning frameworks, including large language models (LLMs). To study this problem, we propose a new physical reasoning task and a dataset, dubbed TraySim. Our task involves predicting the dynamics of several objects on a tray that is given an external impact -- the domino effect of the ensued object interactions and their dynamics thus offering a challenging yet controlled setup, with the goal of reasoning being to infer the stability of the objects after the impact. To solve this complex physical reasoning task, we present LLMPhy, a zero-shot black-box optimization framework that leverages the physics knowledge and program synthesis abilities of LLMs, and synergizes these abilities with the world models built into modern physics engines. Specifically, LLMPhy uses an LLM to generate code to iteratively estimate the physical hyperparameters of the system (friction, damping, layout, etc.) via an implicit analysis-by-synthesis approach using a (non-differentiable) simulator in the loop and uses the inferred parameters to imagine the dynamics of the scene towards solving the reasoning task. To show the effectiveness of LLMPhy, we present experiments on our TraySim dataset to predict the steady-state poses of the objects. Our results show that the combination of the LLM and the physics engine leads to state-of-the-art zero-shot physical reasoning performance, while demonstrating superior convergence against standard black-box optimization methods and better estimation of the physical parameters.

摘要:物理推理是機器代理在現實世界中運作時所需的重要技能。然而,解決此類推理問題通常涉及對複雜的多體交互進行假設和反思,這些交互受到大量物理力的影響,因此學習所有此類交互對最先進的機器學習框架(包括大型語言模型 (LLM))構成了重大障礙。為了研究這個問題,我們提出了一個新的物理推理任務和一個名為 TraySim 的數據集。我們的任務涉及預測托盤上幾個物體的動態,這些物體受到外部衝擊——由此產生的物體交互的多米諾效應及其動態從而提供了具有挑戰性但受控的設置,推理目標是推論物體在衝擊後的穩定性。為了解決這個複雜的物理推理任務,我們提出了 LLMPhy,這是一個零次方黑盒優化框架,它利用了 LLM 的物理知識和程式合成能力,並將這些能力與現代物理引擎中內建的世界模型協同作用。具體來說,LLMPhy 使用 LLM 產生代碼,通過使用迴圈中的(不可微分)模擬器進行隱式分析-通過合成方法來反覆估計系統的物理超參數(摩擦、阻尼、佈局等),並使用推斷的參數來想像場景的動態,以解決推理任務。為了展示 LLMPhy 的有效性,我們在我們的 TraySim 數據集上進行了實驗,以預測物體的穩態姿勢。我們的結果表明,LLM 和物理引擎的結合導致了最先進的零次方物理推理性能,同時展示了優於標準黑盒優化方法的收斂性,以及對物理參數的更好估計。

Leonardo vindicated: Pythagorean trees for minimal reconstruction of the natural branching structures

2411.08024v1 by Dymitr Ruta, Corrado Mio, Ernesto Damiani

Trees continue to fascinate with their natural beauty and as engineering masterpieces optimal with respect to several independent criteria. Pythagorean tree is a well-known fractal design that realistically mimics the natural tree branching structures. We study various types of Pythagorean-like fractal trees with different shapes of the base, branching angles and relaxed scales in an attempt to identify and explain which variants are the closest match to the branching structures commonly observed in the natural world. Pursuing simultaneously the realism and minimalism of the fractal tree model, we have developed a flexibly parameterised and fast algorithm to grow and visually examine deep Pythagorean-inspired fractal trees with the capability to orderly over- or underestimate the Leonardo da Vinci's tree branching rule as well as control various imbalances and branching angles. We tested the realism of the generated fractal tree images by means of the classification accuracy of detecting natural tree with the transfer-trained deep Convolutional Neural Networks (CNNs). Having empirically established the parameters of the fractal trees that maximize the CNN's natural tree class classification accuracy we have translated them back to the scales and angles of branches and came to the interesting conclusions that support the da Vinci branching rule and golden ratio based scaling for both the shape of the branch and imbalance between the child branches, and claim the flexibly parameterized fractal trees can be used to generate artificial examples to train robust detectors of different species of trees.

摘要:樹木持續以其自然美景和作為工程傑作著迷,在幾個獨立標準方面達到最佳化。畢氏樹是一種著名的分形設計,逼真地模擬自然樹木分枝結構。我們研究各種畢氏分形樹,它們具有不同形狀的基底、分枝角度和放鬆比例,試圖找出並解釋哪些變體最接近自然界中常見的分枝結構。同時追求分形樹模型的寫實主義和極簡主義,我們開發了一種靈活參數化且快速的演算法,用於生長和視覺檢查深度畢氏靈感分形樹,並有能力有條理地高估或低估李奧納多·達文西的樹木分枝規則,以及控制各種不平衡和分枝角度。我們透過轉移訓練深度卷積神經網路 (CNN) 偵測自然樹木的分類準確度,來測試所生成分形樹影像的寫實度。在經驗上建立最大化 CNN 自然樹類別分類準確度的分形樹參數後,我們已將它們轉換回分枝的比例和角度,並得出有趣的結論,支持達文西分枝規則和黃金比例,作為分枝形狀和子分枝之間不平衡的基礎,並宣稱靈活參數化的分形樹可用於產生人工範例,以訓練不同樹種的強健偵測器。

Language Models as Causal Effect Generators

2411.08019v1 by Lucius E. J. Bynum, Kyunghyun Cho

We present a framework for large language model (LLM) based data generation with controllable causal structure. In particular, we define a procedure for turning any language model and any directed acyclic graph (DAG) into a sequence-driven structural causal model (SD-SCM). Broadly speaking, an SD-SCM is a causal model with user-defined structure and LLM-defined structural equations. We characterize how an SD-SCM allows sampling from observational, interventional, and counterfactual distributions according to the desired causal structure. We then leverage this procedure to propose a new type of benchmark for causal inference methods, generating individual-level counterfactual data without needing to manually specify functional relationships between variables. We create an example benchmark consisting of thousands of datasets, and test a suite of popular estimation methods on these datasets for average, conditional average, and individual treatment effect estimation, both with and without hidden confounding. Apart from generating data, the same procedure also allows us to test for the presence of a causal effect that might be encoded in an LLM. This procedure can underpin auditing LLMs for misinformation, discrimination, or otherwise undesirable behavior. We believe SD-SCMs can serve as a useful tool in any application that would benefit from sequential data with controllable causal structure.

摘要:我們提出了一個基於大型語言模型 (LLM) 的資料生成架構,具有可控制的因果結構。具體來說,我們定義了一個程序,將任何語言模型和任何有向無環圖 (DAG) 轉換成一個序列驅動的結構因果模型 (SD-SCM)。廣義來說,SD-SCM 是一個因果模型,具有使用者定義的結構和 LLM 定義的結構方程式。我們描述了 SD-SCM 如何根據所需的因果結構,允許從觀測、介入和反事實分佈中進行抽樣。然後,我們利用這個程序提出了一種類型的因果推論方法基準,生成個體層級的反事實資料,而無需手動指定變數之間的功能關係。我們建立了一個範例基準,包含數千個資料集,並在這些資料集上測試了一系列流行的估計方法,用於平均值、條件平均值和個別處理效果估計,無論是有或沒有隱藏混淆。除了生成資料之外,相同的程序也允許我們測試 LLM 中可能編碼的因果效應的存在。此程序可以支持審核 LLM 的錯誤資訊、歧視或其他不良行為。我們相信 SD-SCM 可以作為任何應用程式的有用工具,這些應用程式可以從具有可控制因果結構的序列資料中受益。

Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings

2411.08017v1 by Aditya Sanghi, Aliasghar Khani, Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani

Large-scale 3D generative models require substantial computational resources yet often fall short in capturing fine details and complex geometries at high resolutions. We attribute this limitation to the inefficiency of current representations, which lack the compactness required to model the generative models effectively. To address this, we introduce a novel approach called Wavelet Latent Diffusion, or WaLa, that encodes 3D shapes into wavelet-based, compact latent encodings. Specifically, we compress a $256^3$ signed distance field into a $12^3 \times 4$ latent grid, achieving an impressive 2427x compression ratio with minimal loss of detail. This high level of compression allows our method to efficiently train large-scale generative networks without increasing the inference time. Our models, both conditional and unconditional, contain approximately one billion parameters and successfully generate high-quality 3D shapes at $256^3$ resolution. Moreover, WaLa offers rapid inference, producing shapes within two to four seconds depending on the condition, despite the model's scale. We demonstrate state-of-the-art performance across multiple datasets, with significant improvements in generation quality, diversity, and computational efficiency. We open-source our code and, to the best of our knowledge, release the largest pretrained 3D generative models across different modalities.

摘要:大型 3D 生成模型需要大量的计算资源,但通常在捕捉精细细节和高分辨率的复杂几何形状方面表现不佳。我们将此限制归因于当前表示形式的低效率,它缺乏对有效建模生成模型所需的紧凑性。为了解决这个问题,我们引入了一种称为小波潜在扩散或 WaLa 的新方法,它将 3D 形状编码为基于小波的紧凑潜在编码。具体来说,我们将 $256^3$ 有符号距离场压缩到 $12^3 \times 4$ 潜在网格中,以最小的细节损失实现了令人印象深刻的 2427 倍压缩比。这种高水平的压缩允许我们的方法有效地训练大规模生成网络,而不会增加推理时间。我们的模型(条件模型和无条件模型)包含大约十亿个参数,并成功生成分辨率为 $256^3$ 的高质量 3D 形状。此外,WaLa 提供快速推理,根据条件在两到四秒内生成形状,尽管模型的规模很大。我们展示了跨多个数据集的最新性能,在生成质量、多样性和计算效率方面都有显著提高。我们开源我们的代码,并且据我们所知,发布了跨不同模态的最大预训练 3D 生成模型。

Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech

2411.08013v1 by Eleonora Mancini, Francesco Paissan, Paolo Torroni, Cem Subakan, Mirco Ravanelli

Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detection have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.

摘要:帕金森氏症 (PD) 的言語障礙提供了重要的早期診斷指標。儘管基於言語的 PD 檢測模型已展現出強勁的效能,但其可解釋性仍未獲得充分探討。本研究系統性地評估了數種可解釋性方法,以識別 PD 特定的言語特徵,旨在支援開發準確、可解釋的模型,以進行 PD 診斷和監控中的臨床決策。我們的研究方法包括:(i) 使用主流可解釋性技術取得歸因和顯著性圖,(ii) 透過一系列既定的指標,量化評估這些圖及其透過聯集和交集所取得組合的真實性,以及 (iii) 從輔助分類器評估顯著性圖傳達的 PD 檢測資訊。我們的結果顯示,儘管解釋與分類器一致,但它們通常無法為領域專家提供有價值的資訊。

ExpressivityArena: Can LLMs Express Information Implicitly?

2411.08010v1 by Joshua Tint, Som Sagar, Aditya Taparia, Kelly Raines, Bimsara Pathiraja, Caleb Liu, Ransalu Senanayake

While Large Language Models (LLMs) have demonstrated remarkable performance in certain dimensions, their ability to express implicit language cues that human use for effective communication remains unclear. This paper presents ExpressivityArena, a Python library for measuring the implicit communication abilities of LLMs. We provide a comprehensive framework to evaluate expressivity of arbitrary LLMs and explore its practical implications. To this end, we refine the definition and measurements of ``expressivity,'' and use our framework in a set of small experiments. These experiments test LLMs in creative and logical tasks such as poetry, coding, and emotion-based responses. They are then evaluated by an automated grader, through ExpressivityArena, which we verify to be the most pragmatic for testing expressivity. Building on these experiments, we deepen our understanding of the expressivity of LLMs by assessing their ability to remain expressive in conversations. Our findings indicate that LLMs are capable of generating and understanding expressive content, however, with some limitations. These insights will inform the future development and deployment of expressive LLMs. We provide the code for ExpressivityArena alongside our paper.

摘要:儘管大型語言模型 (LLM) 在某些面向展示出卓越的表現,它們表達人類用於有效溝通的隱含語言線索的能力仍不明確。本文提出 ExpressivityArena,一個用於測量 LLM 隱含溝通能力的 Python 函式庫。我們提供一個全面的架構來評估任意 LLM 的表達能力,並探討其實際影響。為此,我們改進了「表達能力」的定義和測量,並在一些小型實驗中使用我們的架構。這些實驗在詩歌、編碼和基於情緒的回應等創造性和邏輯任務中測試 LLM。然後,它們通過 ExpressivityArena 由自動評分器評估,我們驗證它是測試表達能力最實用的方法。在這些實驗的基礎上,我們通過評估 LLM 在對話中保持表達能力的能力,加深了我們對 LLM 表達能力的理解。我們的研究結果表明,LLM 能夠產生和理解富有表現力的內容,但有一些限制。這些見解將為未來表達式 LLM 的開發和部署提供資訊。我們在論文中提供了 ExpressivityArena 的程式碼。

Can adversarial attacks by large language models be attributed?

2411.08003v1 by Manuel Cebrian, Jan Arne Telle

Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation-presents significant challenges that are likely to grow in importance. We investigate this attribution problem using formal language theory, specifically language identification in the limit as introduced by Gold and extended by Angluin. By modeling LLM outputs as formal languages, we analyze whether finite text samples can uniquely pinpoint the originating model. Our results show that due to the non-identifiability of certain language classes, under some mild assumptions about overlapping outputs from fine-tuned models it is theoretically impossible to attribute outputs to specific LLMs with certainty. This holds also when accounting for expressivity limitations of Transformer architectures. Even with direct model access or comprehensive monitoring, significant computational hurdles impede attribution efforts. These findings highlight an urgent need for proactive measures to mitigate risks posed by adversarial LLM use as their influence continues to expand.

摘要:在敵對環境(例如網路攻擊和錯誤資訊)中,將大型語言模型(LLM)的輸出歸因於特定模型,是一項重大的挑戰,且其重要性可能會與日俱增。我們使用形式語言理論探討這個歸因問題,特別是 Gold 提出並由 Angluin 擴充的極限語言辨識。透過將 LLM 輸出建模為形式語言,我們分析有限的文字範例是否能明確找出原始模型。我們的結果顯示,由於特定語言類別的不可識別性,在微調模型輸出重疊的一些溫和假設下,理論上不可能確定地將輸出歸因於特定的 LLM。即使考慮到 Transformer 架構的表達力限制,這也成立。即使有直接的模型存取或全面的監控,重大的運算障礙也會阻礙歸因工作。這些發現凸顯了採取主動措施以減輕敵對 LLM 使用所帶來的風險的迫切需要,因為它們的影響力持續擴大。

Derivational Morphology Reveals Analogical Generalization in Large Language Models

2411.07990v1 by Valentin Hofmann, Leonie Weissweiler, David Mortensen, Hinrich Schütze, Janet Pierrehumbert

What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J's behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J's linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.

摘要:大型語言模型(LLM)中語言概括化的底層機制是什麼?這個問題引起了相當大的關注,大多數研究分析了 LLM 的語言技能與規則的相似程度。到目前為止,我們還不知道 LLM 中的語言概括化是否可以同樣解釋為類比過程的結果,類比過程可以形式化為儲存範例的相似性運算。先前研究的一個主要缺點是其重點在於高度規律性的語言現象,對於這種現象,基於規則和類比的方法會做出相同的預測。在這裡,我們改為檢驗派生形態,特別是英語形容詞名詞化,它顯示出顯著的可變性。我們引入了一種新的方法來研究 LLM 中的語言概括化:專注於 GPT-J,我們將實例化基於規則和類比學習的認知模型套用到 LLM 訓練資料,並將其預測與 LLM 在一組新造形容詞上進行比較,讓我們能夠對底層機制得出直接結論。正如預期的那樣,對於具有規則名詞化模式的形容詞,基於規則和類比的模型對 GPT-J 的預測解釋得一樣好。然而,對於具有可變名詞化模式的形容詞,類比模型提供了更好的匹配。此外,GPT-J 的行為對個別字詞頻率很敏感,即使是規則形式也是如此,這種行為與類比規則的說明一致,但與基於規則的說明不一致。這些發現駁斥了 GPT-J 在形容詞名詞化上的語言概括化涉及規則的假設,表明對儲存範例的相似性運算才是底層機制。總體而言,我們的研究表明,類比過程在 LLM 的語言概括化中所扮演的角色比先前想像的更大。

Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces

2411.07983v1 by Ben Fauber

We demonstrate that Gini coefficients can be used as unified metrics to evaluate many-versus-many (all-to-all) similarity in vector spaces. Our analysis of various image datasets shows that images with the highest Gini coefficients tend to be the most similar to one another, while images with the lowest Gini coefficients are the least similar. We also show that this relationship holds true for vectorized text embeddings from various corpuses, highlighting the consistency of our method and its broad applicability across different types of data. Additionally, we demonstrate that selecting machine learning training samples that closely match the distribution of the testing dataset is far more important than ensuring data diversity. Selection of exemplary and iconic training samples with higher Gini coefficients leads to significantly better model performance compared to simply having a diverse training set with lower Gini coefficients. Thus, Gini coefficients can serve as effective criteria for selecting machine learning training samples, with our selection method outperforming random sampling methods in very sparse information settings.

摘要:我們證明基尼係數可用作統一指標,用於評估向量空間中多對多(全對全)相似性。我們對各種影像資料集的分析顯示,具有最高基尼係數的影像往往彼此最相似,而具有最低基尼係數的影像最不相似。我們也顯示此關係對於來自各種語料庫的向量化文字嵌入式資料也成立,突顯我們方法的一致性及其在不同類型資料間的廣泛適用性。此外,我們證明選擇與測試資料集分佈密切匹配的機器學習訓練樣本,比確保資料多樣性重要得多。選擇具有較高基尼係數的範例性和標誌性訓練樣本,與僅有具有較低基尼係數的多樣化訓練集相比,會產生顯著更好的模型效能。因此,基尼係數可用作選擇機器學習訓練樣本的有效準則,我們的選擇方法在非常稀疏的資訊設定中優於隨機抽樣方法。

Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization

2411.07979v1 by Davide Buffelli, Jamie McGowan, Wangkun Xu, Alexandru Cioba, Da-shan Shiu, Guillaume Hennequin, Alberto Bernacchia

Second-order optimization has been shown to accelerate the training of deep neural networks in many applications, often yielding faster progress per iteration on the training loss compared to first-order optimizers.However, the generalization properties of second-order methods are still being debated. Theoretical investigations have proved difficult to carry out outside the tractable settings of heavily simplified model classes -- thus, the relevance of existing theories to practical deep learning applications remains unclear. Similarly, empirical studies in large-scale models and real datasets are significantly confounded by the necessity to approximate second-order updates in practice. It is often unclear whether the observed generalization behaviour arises specifically from the second-order nature of the parameter updates, or instead reflects the specific structured (e.g.\ Kronecker) approximations used or any damping-based interpolation towards first-order updates. Here, we show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep reversible architectures that are sufficiently expressive to be meaningfully applied to common benchmark datasets. We exploit this novel setting to study the training and generalization properties of the GN optimizer. We find that exact GN generalizes poorly. In the mini-batch training setting, this manifests as rapidly saturating progress even on the \emph{training} loss, with parameter updates found to overfit each mini-batchatch without producing the features that would support generalization to other mini-batches. We show that our experiments run in the ``lazy'' regime, in which the neural tangent kernel (NTK) changes very little during the course of training. This behaviour is associated with having no significant changes in neural representations, explaining the lack of generalization.

摘要:二次優化已被證明可以加速許多應用中深度神經網路的訓練,與一階最佳化器相比,通常在訓練損失上每次反覆運算都能產生更快的進度。然而,二階方法的泛化特性仍有爭議。理論研究已證明在大量簡化模型類別的可處理設定之外難以進行,因此,現有理論與實際深度學習應用之間的關聯性仍不清楚。同樣地,由於在實務上需要近似二階更新,因此大型模型和真實資料集中的實證研究會受到顯著的混淆。通常不清楚觀察到的泛化行為是否特別來自於參數更新的二階性質,或者反映了所使用的特定結構化(例如克羅內克)近似值或任何基於阻尼的插值朝向一階更新。在此,我們首次展示確切的 Gauss-Newton (GN) 更新在深度可逆架構類別中採用可處理的形式,而這些架構足夠具有表現力,可以有意義地應用於常見的基準資料集。我們利用這個新穎的設定來研究 GN 最佳化器的訓練和泛化特性。我們發現確切的 GN 泛化效果不佳。在小批次訓練設定中,這表現為即使在\emph{訓練}損失上也迅速飽和進度,發現參數更新過度擬合每個小批次,而沒有產生支援泛化到其他小批次的特性。我們展示我們的實驗在「惰性」模式下執行,其中神經切線核 (NTK) 在訓練過程中變化很小。這種行為與神經表徵沒有顯著變化有關,這解釋了缺乏泛化性。

DINO-LG: A Task-Specific DINO Model for Coronary Calcium Scoring

2411.07976v1 by Mahmut S. Gokmen, Cody Bumgardner, Caner Ozcan

Coronary artery disease (CAD), one of the most common cause of mortality in the world. Coronary artery calcium (CAC) scoring using computed tomography (CT) is key for risk assessment to prevent coronary disease. Previous studies on risk assessment and calcification detection in CT scans primarily use approaches based on UNET architecture, frequently implemented on pre-built models. However, these models are limited by the availability of annotated CT scans containing CAC and suffering from imbalanced dataset, decreasing performance of CAC segmentation and scoring. In this study, we extend this approach by incorporating the self-supervised learning (SSL) technique of DINO (self-distillation with no labels) to eliminate limitations of scarce annotated data in CT scans. The DINO model's ability to train without requiring CAC area annotations enhances its robustness in generating distinct features. The DINO model is trained on to focus specifically on calcified areas by using labels, aiming to generate features that effectively capture and highlight key characteristics. The label-guided DINO (DINO-LG) enhances classification by distinguishing CT slices that contain calcification from those that do not, performing 57% better than the standard DINO model in this task. CAC scoring and segmentation tasks are performed by a basic U-NET architecture, fed specifically with CT slices containing calcified areas as identified by the DINO-LG model. This targeted identification performed by DINO-LG model improves CAC segmentation performance by approximately 10% and significant increase in CAC scoring accuracy.

摘要:冠狀動脈疾病 (CAD) 是世界上最常見的死亡原因之一。使用電腦斷層掃描 (CT) 進行冠狀動脈鈣化 (CAC) 評分是預防冠狀動脈疾病風險評估的關鍵。先前關於風險評估和 CT 掃描中鈣化偵測的研究,主要使用基於 UNET 架構的方法,並經常在預建模型上實作。然而,這些模型受到標註 CT 掃描的可用性限制,且存在資料集不平衡的問題,降低了 CAC 分割和評分的效能。在本研究中,我們透過納入 DINO(無標籤自蒸餾)的自監督學習 (SSL) 技術來擴充此方法,以消除 CT 掃描中標註資料稀少的限制。DINO 模型無需 CAC 區域標註即可訓練的能力,增強了其產生不同特徵的穩健性。DINO 模型經過訓練,特別針對鈣化區域,使用標籤,目的是產生有效捕捉和突顯關鍵特徵的特徵。標籤引導的 DINO(DINO-LG)透過區分包含鈣化的 CT 切片和不包含鈣化的 CT 切片,增強了分類,在此任務中比標準 DINO 模型高出 57%。CAC 評分和分割任務是由一個基本的 U-NET 架構執行,特別輸入 DINO-LG 模型識別的包含鈣化區域的 CT 切片。DINO-LG 模型執行的這種目標識別,將 CAC 分割效能提升了約 10%,並顯著提高了 CAC 評分準確度。

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

2411.07975v1 by Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan

We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.

摘要:我們提出 JanusFlow,一個強大的框架,它統一了圖像理解和生成在單一模型中。JanusFlow 採用了一個極簡主義架構,它整合了自回歸語言模型與校正流,一種生成模型中的最先進方法。我們的關鍵發現證明了校正流可以在大型語言模型框架內直接進行訓練,消除了對複雜架構修改的需求。為了進一步提升我們統一模型的效能,我們採用了兩個關鍵策略:(i) 解耦理解和生成編碼器,以及 (ii) 在統一訓練期間對齊它們的表示。大量的實驗表明,JanusFlow 在各自領域中達到了與專業模型相當或更優異的效能,同時在標準基準測試中顯著優於現有的統一方法。這項工作代表了朝向更有效率且多功能的視覺語言模型邁出了一步。

From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents

2411.07965v1 by Chuyi Kong, Ziyang Luo, Hongzhan Lin, Zhiyuan Fan, Yaxin Fan, Yuxi Sun, Jing Ma

The advanced role-playing capabilities of Large Language Models (LLMs) have paved the way for developing Role-Playing Agents (RPAs). However, existing benchmarks, such as HPD, which incorporates manually scored character relationships into the context for LLMs to sort coherence, and SocialBench, which uses specific profiles generated by LLMs in the context of multiple-choice tasks to assess character preferences, face limitations like poor generalizability, implicit and inaccurate judgments, and excessive context length. To address the above issues, we propose an automatic, scalable, and generalizable paradigm. Specifically, we construct a benchmark by extracting relations from a general knowledge graph and leverage RPA's inherent hallucination properties to prompt it to interact across roles, employing ChatGPT for stance detection and defining relationship hallucination along with three related metrics. Extensive experiments validate the effectiveness and stability of our metrics. Our findings further explore factors influencing these metrics and discuss the trade-off between relationship hallucination and factuality.

摘要:大型語言模型 (LLM) 的先進角色扮演能力已為開發角色扮演代理 (RPA) 鋪平道路。然而,現有的基準,例如 HPD(將手動評分的角色關係納入 LLM 的背景中以對連貫性進行排序),以及 SocialBench(在多選題任務的背景下使用 LLM 生成的特定個人資料來評估角色偏好)面臨著諸如通用性差、判斷含蓄且不準確以及背景長度過長等限制。為了解決上述問題,我們提出了一個自動、可擴充且可概括的範例。具體來說,我們通過從通用知識圖譜中提取關係來構建基準,並利用 RPA 固有的幻覺屬性提示它跨角色互動,採用 ChatGPT 進行立場檢測並定義關係幻覺以及三個相關指標。廣泛的實驗驗證了我們指標的有效性和穩定性。我們的研究結果進一步探討了影響這些指標的因素,並討論了關係幻覺和事實性之間的權衡。

Towards Low-bit Communication for Tensor Parallel LLM Inference

2411.07942v1 by Harry Dong, Tyler Johnson, Minsik Cho, Emad Soroush

Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be distributed across more devices, magnifying the communication cost. One way to approach this problem is with quantization, but current methods for LLMs tend to avoid quantizing the features that tensor parallelism needs to communicate. Taking advantage of consistent outliers in communicated features, we introduce a quantization method that reduces communicated values on average from 16 bits to 4.2 bits while preserving nearly all of the original performance. For instance, our method maintains around 98.0% and 99.5% of Gemma 2 27B's and Llama 2 13B's original performance, respectively, averaged across all tasks we evaluated on.

摘要:張量並行提供了增加伺服器大型語言模型 (LLM) 推論效率的有效方法,儘管增加了額外的通訊成本。然而,由於伺服器 LLM 持續擴大規模,它們需要分佈在更多裝置上,這會放大通訊成本。解決此問題的一種方法是量化,但 LLM 的當前方法傾向於避免量化張量並行需要通訊的功能。我們利用通訊功能中的一致異常值,引入一種量化方法,可將通訊值平均從 16 位元減少到 4.2 位元,同時保留幾乎所有原始效能。例如,我們的模型分別維持了 Gemma 2 27B 和 Llama 2 13B 的約 98.0% 和 99.5% 原始效能,平均在我們評估的所有任務中。

DuoLift-GAN:Reconstructing CT from Single-view and Biplanar X-Rays with Generative Adversarial Networks

2411.07941v1 by Zhaoxi Zhang, Yueliang Ying

Computed tomography (CT) provides highly detailed three-dimensional (3D) medical images but is costly, time-consuming, and often inaccessible in intraoperative settings (Organization et al. 2011). Recent advancements have explored reconstructing 3D chest volumes from sparse 2D X-rays, such as single-view or orthogonal double-view images. However, current models tend to process 2D images in a planar manner, prioritizing visual realism over structural accuracy. In this work, we introduce DuoLift Generative Adversarial Networks (DuoLift-GAN), a novel architecture with dual branches that independently elevate 2D images and their features into 3D representations. These 3D outputs are merged into a unified 3D feature map and decoded into a complete 3D chest volume, enabling richer 3D information capture. We also present a masked loss function that directs reconstruction towards critical anatomical regions, improving structural accuracy and visual quality. This paper demonstrates that DuoLift-GAN significantly enhances reconstruction accuracy while achieving superior visual realism compared to existing methods.

摘要:電腦斷層掃描 (CT) 能提供高度詳細的三維 (3D) 醫學影像,但昂貴、耗時且在術中環境中通常無法取得 (Organization et al. 2011)。最近的進展探索從稀疏的 2D X 光重建 3D 胸部體積,例如單視圖或正交雙視圖影像。然而,目前的模型傾向於以平面方式處理 2D 影像,優先考慮視覺真實性而非結構準確性。在這項工作中,我們介紹了 DuoLift 生成對抗網路 (DuoLift-GAN),一種具有雙分支的新穎架構,可獨立地將 2D 影像及其特徵提升到 3D 表現形式。這些 3D 輸出會合併成一個統一的 3D 特徵圖,並解碼成一個完整的 3D 胸部體積,從而能夠擷取更豐富的 3D 資訊。我們也提出了一個遮罩損失函數,將重建導向關鍵解剖區域,改善結構準確性和視覺品質。這篇論文證明了 DuoLift-GAN 與現有方法相比,顯著提升了重建準確性,同時達到了卓越的視覺真實性。

Automatic dataset shift identification to support root cause analysis of AI performance drift

2411.07940v1 by Mélanie Roschewitz, Raghav Mehta, Charles Jones, Ben Glocker

Shifts in data distribution can substantially harm the performance of clinical AI models. Hence, various methods have been developed to detect the presence of such shifts at deployment time. However, root causes of dataset shifts are varied, and the choice of shift mitigation strategies is highly dependent on the precise type of shift encountered at test time. As such, detecting test-time dataset shift is not sufficient: precisely identifying which type of shift has occurred is critical. In this work, we propose the first unsupervised dataset shift identification framework, effectively distinguishing between prevalence shift (caused by a change in the label distribution), covariate shift (caused by a change in input characteristics) and mixed shifts (simultaneous prevalence and covariate shifts). We discuss the importance of self-supervised encoders for detecting subtle covariate shifts and propose a novel shift detector leveraging both self-supervised encoders and task model outputs for improved shift detection. We report promising results for the proposed shift identification framework across three different imaging modalities (chest radiography, digital mammography, and retinal fundus images) on five types of real-world dataset shifts, using four large publicly available datasets.

摘要:資料分佈的轉變會嚴重損害臨床 AI 模型的效能。因此,已經開發出各種方法來偵測部署時發生的此類轉變。然而,資料集轉變的根本原因各不相同,而轉變緩解策略的選擇高度依賴於測試時遇到的轉變類型。因此,偵測測試時資料集轉變是不夠的:精確識別已發生的轉變類型至關重要。在這項工作中,我們提出了第一個無監督資料集轉變識別架構,有效區分發生率轉變(由標籤分佈的變化引起)、協變數轉變(由輸入特徵的變化引起)和混合轉變(同時發生率和協變數轉變)。我們討論了自監督編碼器在偵測細微協變數轉變中的重要性,並提出了一種新穎的轉變偵測器,利用自監督編碼器和任務模型輸出,以改善轉變偵測。我們針對三個不同的影像模式(胸部 X 光、數位乳房攝影和視網膜眼底影像)報告了所提出的轉變識別架構的良好結果,使用四個大型公開可取得的資料集,針對五種類型的真實世界資料集轉變。

CryptoLLM: Unleashing the Power of Prompted LLMs for SmartQnA and Classification of Crypto Posts

2411.07917v1 by Aniket Deroy, Subhankar Maity

The rapid growth of social media has resulted in an large volume of user-generated content, particularly in niche domains such as cryptocurrency. This task focuses on developing robust classification models to accurately categorize cryptocurrency-related social media posts into predefined classes, including but not limited to objective, positive, negative, etc. Additionally, the task requires participants to identify the most relevant answers from a set of posts in response to specific questions. By leveraging advanced LLMs, this research aims to enhance the understanding and filtering of cryptocurrency discourse, thereby facilitating more informed decision-making in this volatile sector. We have used a prompt-based technique to solve the classification task for reddit posts and twitter posts. Also, we have used 64-shot technique along with prompts on GPT-4-Turbo model to determine whether a answer is relevant to a question or not.

摘要:社群媒體的快速成長產生了大量的使用者產製內容,特別是在加密貨幣等利基領域。此任務專注於開發穩健的分類模型,以準確地將與加密貨幣相關的社群媒體貼文分類為預定義的類別,包括但不限於客觀、正面、負面等。此外,此任務要求參與者從一組貼文中找出最相關的答案,以回應特定問題。透過利用先進的 LLM,此研究旨在增強對加密貨幣討論的理解和過濾,進而促進在這個波動的領域中做出更明智的決策。我們使用基於提示的技術來解決 Reddit 貼文和 Twitter 貼文的分類任務。此外,我們使用 64-shot 技術以及 GPT-4-Turbo 模型上的提示來確定答案是否與問題相關。

Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus

2411.07892v1 by Benjamin Litterer, David Jurgens, Dallas Card

Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.

摘要:Podcast 透過獨特的隨選模式,為龐大的聽眾群提供高度多元的內容。然而,有限的資料阻礙了對 Podcast 生態系統進行大規模的運算分析。為了填補這個缺口,我們引進一個包含超過 110 萬個 Podcast 轉錄的龐大資料集,該資料集廣泛涵蓋了 2020 年 5 月和 6 月透過公開 RSS 饋送提供的全部英語 Podcast。此資料不僅限於文字,還包含 37 萬集子集的音訊特徵和發言者輪流發言,以及全部 110 萬集的發言者角色推論和其他元資料。使用此資料,我們也對此生態系統的內容、結構和回應性進行基礎調查。我們的資料和分析共同開啟了對這個廣受歡迎且影響力大的媒體持續進行運算研究的大門。

INTRABENCH: Interactive Radiological Benchmark

2411.07885v1 by Constantin Ulrich, Tassilo Wald, Emily Tempus, Maximilian Rokuss, Paul F. Jaeger, Klaus Maier-Hein

Current interactive segmentation approaches, inspired by the success of META's Segment Anything model, have achieved notable advancements, however, they come with substantial limitations that hinder their practical application in real clinical scenarios. These include unrealistic human interaction requirements, such as slice-by-slice operations for 2D models on 3D data, a lack of iterative refinement, and insufficient evaluation experiments. These shortcomings prevent accurate assessment of model performance and lead to inconsistent outcomes across studies. IntRaBench overcomes these challenges by offering a comprehensive and reproducible framework for evaluating interactive segmentation methods in realistic, clinically relevant scenarios. It includes diverse datasets, target structures, and segmentation models, and provides a flexible codebase that allows seamless integration of new models and prompting strategies. Additionally, we introduce advanced techniques to minimize clinician interaction, ensuring fair comparisons between 2D and 3D models. By open-sourcing IntRaBench, we invite the research community to integrate their models and prompting techniques, ensuring continuous and transparent evaluation of interactive segmentation models in 3D medical imaging.

摘要:目前互動式分割方法受到 META 的 Segment Anything 模型成功的啟發,已取得顯著進展,但它們仍有很大的限制,會阻礙它們在實際臨床場景中的應用。這些限制包括不切實際的人機互動需求,例如 3D 資料上的 2D 模型的逐層操作、缺乏反覆改進以及評估實驗不足。這些缺點會妨礙準確評估模型效能,並導致各項研究結果不一致。IntRaBench 克服了這些挑戰,提供了一個全面且可重現的架構,用於評估實際臨床相關場景中的互動式分割方法。它包含多元的資料集、目標結構和分割模型,並提供了一個彈性的程式碼庫,允許無縫整合新的模型和提示策略。此外,我們引進了先進技術來最小化臨床醫師的互動,確保 2D 和 3D 模型之間的公平比較。透過開放原始碼 IntRaBench,我們邀請研究社群整合他們的模型和提示技術,確保在 3D 醫學影像中持續且透明地評估互動式分割模型。

Diverse capability and scaling of diffusion and auto-regressive models when learning abstract rules

2411.07873v1 by Binxu Wang, Jiaqi Shang, Haim Sompolinsky

Humans excel at discovering regular structures from limited samples and applying inferred rules to novel settings. We investigate whether modern generative models can similarly learn underlying rules from finite samples and perform reasoning through conditional sampling. Inspired by Raven's Progressive Matrices task, we designed GenRAVEN dataset, where each sample consists of three rows, and one of 40 relational rules governing the object position, number, or attributes applies to all rows. We trained generative models to learn the data distribution, where samples are encoded as integer arrays to focus on rule learning. We compared two generative model families: diffusion (EDM, DiT, SiT) and autoregressive models (GPT2, Mamba). We evaluated their ability to generate structurally consistent samples and perform panel completion via unconditional and conditional sampling. We found diffusion models excel at unconditional generation, producing more novel and consistent samples from scratch and memorizing less, but performing less well in panel completion, even with advanced conditional sampling methods. Conversely, autoregressive models excel at completing missing panels in a rule-consistent manner but generate less consistent samples unconditionally. We observe diverse data scaling behaviors: for both model families, rule learning emerges at a certain dataset size - around 1000s examples per rule. With more training data, diffusion models improve both their unconditional and conditional generation capabilities. However, for autoregressive models, while panel completion improves with more training data, unconditional generation consistency declines. Our findings highlight complementary capabilities and limitations of diffusion and autoregressive models in rule learning and reasoning tasks, suggesting avenues for further research into their mechanisms and potential for human-like reasoning.

摘要:人類擅長從有限的樣本中發現規則結構,並將推論出的規則應用於新的設定。我們探討現代生成模型是否能以類似的方式從有限樣本中學習基礎規則,並透過條件取樣進行推理。在 Raven's Progressive Matrices 任務的啟發下,我們設計了 GenRAVEN 資料集,每個樣本包含三行,且 40 個關係規則中的其中一個適用於所有行的物件位置、數量或屬性。我們訓練生成模型學習資料分佈,其中樣本編碼為整數陣列,以專注於規則學習。我們比較了兩個生成模型家族:擴散(EDM、DiT、SiT)和自迴歸模型(GPT2、Mamba)。我們評估了它們產生結構一致樣本和透過無條件和條件取樣完成面板的能力。我們發現擴散模型在無條件產生方面表現出色,從頭開始產生更多新穎且一致的樣本,且記憶力較差,但在面板完成方面表現較差,即使使用進階條件取樣方法也是如此。相反地,自迴歸模型擅長以規則一致的方式完成遺失的面板,但無條件產生的一致性較差。我們觀察到不同的資料擴充行為:對於這兩個模型家族,規則學習出現在某個資料集大小時 - 每個規則約 1000 個範例。隨著更多訓練資料,擴散模型改善了它們的無條件和條件產生能力。然而,對於自迴歸模型,雖然面板完成隨著更多訓練資料而改善,但無條件產生的一致性卻下降。我們的發現突出了擴散和自迴歸模型在規則學習和推理任務中的互補能力和限制,並提出了進一步研究它們的機制和人類推理潛力的途徑。

Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease

2411.07871v1 by Francesco Chiumento, Mingming Liu

The rapid advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown great potential in medical diagnostics, particularly in radiology, where datasets such as X-rays are paired with human-generated diagnostic reports. However, a significant research gap exists in the neuroimaging field, especially for conditions such as Alzheimer's disease, due to the lack of comprehensive diagnostic reports that can be utilized for model fine-tuning. This paper addresses this gap by generating synthetic diagnostic reports using GPT-4o-mini on structured data from the OASIS-4 dataset, which comprises 663 patients. Using the synthetic reports as ground truth for training and validation, we then generated neurological reports directly from the images in the dataset leveraging the pre-trained BiomedCLIP and T5 models. Our proposed method achieved a BLEU-4 score of 0.1827, ROUGE-L score of 0.3719, and METEOR score of 0.4163, revealing its potential in generating clinically relevant and accurate diagnostic reports.

摘要:大型語言模型 (LLM) 和視覺語言模型 (VLM) 的快速進展在醫學診斷中展現了巨大的潛力,特別是在放射學中,其中 X 射線等數據集與人類產生的診斷報告配對。然而,神經影像領域存在著顯著的研究差距,特別是對於阿茲海默症等疾病,因為缺乏可供模型微調使用的全面診斷報告。本文通過使用 GPT-4o-mini 在來自 OASIS-4 數據集的結構化數據上生成合成診斷報告來解決這一差距,該數據集包含 663 名患者。使用合成報告作為訓練和驗證的真實數據,然後我們直接從數據集中的圖像中生成神經報告,利用預先訓練的 BiomedCLIP 和 T5 模型。我們提出的方法實現了 BLEU-4 分數為 0.1827、ROUGE-L 分數為 0.3719 和 METEOR 分數為 0.4163,揭示了其生成臨床相關且準確的診斷報告的潛力。

Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders

2411.07870v1 by Xiaofeng Zhu, Jaya Krishna Mandivarapu

Although people are impressed by the content generation skills of large language models, the use of LLMs, such as ChatGPT, is limited by the domain grounding of the content. The correctness and groundedness of the generated content need to be based on a verified context, such as results from Retrieval-Augmented Generation (RAG). One important issue when adapting LLMs to a customized domain is that the generated responses are often incomplete, or the additions are not verified and may even be hallucinated. Prior studies on hallucination detection have focused on evaluation metrics, which are not easily adaptable to dynamic domains and can be vulnerable to attacks like jail-breaking. In this work, we propose 1) a post-processing algorithm that leverages knowledge triplets in RAG context to correct hallucinations and 2) a dual-decoder model that fuses RAG context to guide the generation process.

摘要:儘管人們對大型語言模型的內容生成技能印象深刻,但 ChatGPT 等 LLM 的使用受到內容的領域基礎的限制。生成的內容的正確性和基礎必須基於經過驗證的內容,例如檢索擴充生成 (RAG) 的結果。將 LLM 適應到自訂領域時的一個重要問題是,生成的回應通常不完整,或者新增內容未經驗證,甚至可能是幻覺。先前對幻覺偵測的研究集中在評估指標上,這些指標不易適應動態領域,且容易受到越獄等攻擊。在這項工作中,我們提出 1) 一種後處理演算法,利用 RAG 背景中的知識三元組來修正幻覺,以及 2) 一種雙解碼器模型,將 RAG 背景融合以引導生成過程。

Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models

2411.07858v1 by Yusen Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang

When unsure about an answer, humans often respond with more words than necessary, hoping that part of the response will be correct. We observe a similar behavior in large language models (LLMs), which we term "Verbosity Compensation" (VC). VC is harmful because it confuses the user understanding, leading to low efficiency, and influences the LLM services by increasing the latency and cost of generating useless tokens. In this paper, we present the first work that defines and analyzes Verbosity Compensation, explores its causes, and proposes a simple mitigating approach. We define Verbosity Compensation as the behavior of generating responses that can be compressed without information loss when prompted to write concisely. Our experiments, conducted on five datasets of knowledge and reasoning-based QA tasks with 14 newly developed LLMs, reveal three conclusions. 1) We reveal a pervasive presence of verbosity compensation across all models and all datasets. Notably, GPT-4 exhibits a VC frequency of 50.40%. 2) We reveal the large performance gap between verbose and concise responses, with a notable difference of 27.61% on the Qasper dataset. We also demonstrate that this difference does not naturally diminish as LLM capability increases. Both 1) and 2) highlight the urgent need to mitigate the frequency of VC behavior and disentangle verbosity with veracity. We propose a simple yet effective cascade algorithm that replaces the verbose responses with the other model-generated responses. The results show that our approach effectively alleviates the VC of the Mistral model from 63.81% to 16.16% on the Qasper dataset. 3) We also find that verbose responses exhibit higher uncertainty across all five datasets, suggesting a strong connection between verbosity and model uncertainty. Our dataset and code are available at https://github.com/psunlpgroup/VerbosityLLM.

摘要:當不確定答案時,人類通常會用比必要更多的字詞來回答,希望答案的一部分會是正確的。我們在大型語言模型 (LLM) 中觀察到類似的行為,我們稱之為「冗餘補償」(VC)。VC 有害,因為它會混淆使用者的理解,導致效率低下,並透過增加產生無用代幣的延遲和成本來影響 LLM 服務。在本文中,我們提出了第一個定義和分析冗餘補償的工作,探討其原因,並提出一個簡單的緩解方法。我們將冗餘補償定義為在提示簡潔寫作時,產生可以壓縮且不失資訊的回應的行為。我們的實驗在五個知識和基於推理的問答任務的資料集上進行,並使用 14 個新開發的 LLM,揭示了三個結論。1) 我們揭示了所有模型和所有資料集中普遍存在冗餘補償。值得注意的是,GPT-4 的 VC 頻率為 50.40%。2) 我們揭示了冗長和簡潔回應之間的巨大效能差距,在 Qasper 資料集上存在 27.61% 的顯著差異。我們還證明了隨著 LLM 能力的提高,這種差異並不會自然消失。1) 和 2) 都強調了緩解 VC 行為頻率和區分冗餘與真實性的迫切需要。我們提出了一個簡單但有效的串聯演算法,用其他模型產生的回應取代冗長的回應。結果表明,我們的方法有效地將 Mistral 模型在 Qasper 資料集上的 VC 從 63.81% 減輕到 16.16%。3) 我們還發現,在所有五個資料集中,冗長的回應表現出更高的不確定性,這表明冗餘與模型不確定性之間存在強烈的關聯。我們的資料集和程式碼可在 https://github.com/psunlpgroup/VerbosityLLM 取得。

Tucano: Advancing Neural Text Generation for Portuguese

2411.07854v1 by Nicholas Kluge Corrêa, Aniket Sen, Sophia Falk, Shiza Fatimah

Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks. The evaluation of our models also reveals that model performance on many currently available benchmarks used by the Portuguese NLP community has little to no correlation with the scaling of token ingestion during training, highlighting the limitations of such evaluations when it comes to the assessment of Portuguese generative language models. All derivatives of our study are openly released on GitHub and Hugging Face. See https://nkluge-correa.github.io/Tucano/

摘要:近年來,自然語言處理領域取得重大進展。然而,我們目前對語言模型的深度學習方法在數據和計算方面需要大量資源。這種數據密集型範例的副作用之一是語言之間的當前分裂,將被視為高資源的語言(大多數開發和資源都在此發生)與低資源語言分開,後者難以達到相同的效能和自主性。本研究旨在引入一套新資源,以促進葡萄牙語神經文本生成的未來發展。在這項工作中,我們記錄了 GigaVerbo 的開發,它是去重葡萄牙語文本語料庫的串接,總計 2000 億個標記。透過此語料庫,我們訓練了一系列名為 Tucano 的解碼器轉換器。我們的模型在多個葡萄牙語基準中執行與其他類似大小的葡萄牙語和多語言語言模型相同或更佳。我們模型的評估還顯示,葡萄牙語 NLP 社群目前使用的許多現有基準上的模型效能與訓練期間標記擷取的調整幾乎沒有相關性,這突顯了此類評估在評估葡萄牙語生成語言模型方面的限制。我們研究的所有衍生品都在 GitHub 和 Hugging Face 上公開發布。請參閱 https://nkluge-correa.github.io/Tucano/

IAE: Irony-based Adversarial Examples for Sentiment Analysis Systems

2411.07850v1 by Xiaoyin Yi, Jiacheng Huang

Adversarial examples, which are inputs deliberately perturbed with imperceptible changes to induce model errors, have raised serious concerns for the reliability and security of deep neural networks (DNNs). While adversarial attacks have been extensively studied in continuous data domains such as images, the discrete nature of text presents unique challenges. In this paper, we propose Irony-based Adversarial Examples (IAE), a method that transforms straightforward sentences into ironic ones to create adversarial text. This approach exploits the rhetorical device of irony, where the intended meaning is opposite to the literal interpretation, requiring a deeper understanding of context to detect. The IAE method is particularly challenging due to the need to accurately locate evaluation words, substitute them with appropriate collocations, and expand the text with suitable ironic elements while maintaining semantic coherence. Our research makes the following key contributions: (1) We introduce IAE, a strategy for generating textual adversarial examples using irony. This method does not rely on pre-existing irony corpora, making it a versatile tool for creating adversarial text in various NLP tasks. (2) We demonstrate that the performance of several state-of-the-art deep learning models on sentiment analysis tasks significantly deteriorates when subjected to IAE attacks. This finding underscores the susceptibility of current NLP systems to adversarial manipulation through irony. (3) We compare the impact of IAE on human judgment versus NLP systems, revealing that humans are less susceptible to the effects of irony in text.

摘要:對抗性範例是故意擾動輸入,以誘發模型錯誤的難以察覺的變化,這對深度神經網路 (DNN) 的可靠性和安全性提出了嚴重的問題。雖然對抗性攻擊已在連續數據領域(例如圖像)中廣泛研究,但文本的離散性質提出了獨特的挑戰。在本文中,我們提出了基於反諷的對抗性範例 (IAE),這是一種將直截了當的句子轉換為具有反諷意味的句子,以建立對抗性文本的方法。這種方法利用了反諷的修辭手法,其中預期的含義與字面解釋相反,需要對語境有更深入的理解才能檢測出來。IAE 方法特別具有挑戰性,因為需要準確定位評估詞,用適當的搭配詞替換它們,並在保持語義連貫性的同時用合適的反諷元素擴充文本。我們的研究做出了以下關鍵貢獻:(1) 我們引入了 IAE,一種使用反諷生成文本對抗性範例的策略。此方法不依賴於現有的反諷語料庫,使其成為在各種 NLP 任務中創建對抗性文本的多功能工具。(2) 我們證明了當受到 IAE 攻擊時,幾個最先進的深度學習模型在情緒分析任務上的表現顯著下降。這一發現強調了當前 NLP 系統容易受到通過反諷進行的對抗性操縱。(3) 我們比較了 IAE 對人類判斷與 NLP 系統的影響,結果表明人類較不容易受到文本中反諷效果的影響。

Ethical Concern Identification in NLP: A Corpus of ACL Anthology Ethics Statements

2411.07845v1 by Antonia Karamolegkou, Sandrine Schiller Hansen, Ariadni Christopoulou, Filippos Stamatiou, Anne Lauscher, Anders Søgaard

What ethical concerns, if any, do LLM researchers have? We introduce EthiCon, a corpus of 1,580 ethical concern statements extracted from scientific papers published in the ACL Anthology. We extract ethical concern keywords from the statements and show promising results in automating the concern identification process. Through a survey, we compare the ethical concerns of the corpus to the concerns listed by the general public and professionals in the field. Finally, we compare our retrieved ethical concerns with existing taxonomies pointing to gaps and future research directions.

摘要:LLM 研究人員若有任何倫理疑慮,會是什麼?我們引入了 EthiCon, 一個從 ACL Anthology 發表科學論文中萃取的 1,580 條倫理疑慮聲明語料庫。我們從 聲明中萃取倫理疑慮關鍵字,並在自動化疑慮識別處理方面展現極佳成果。透過一項調查,我們將語料庫中的倫理疑慮與一般大眾和該領域專業人士列出的疑慮進行比較。最後, 我們將我們擷取的倫理疑慮與現有分類法進行比較,找出差距和未來的研究方向。

Chain Association-based Attacking and Shielding Natural Language Processing Systems

2411.07843v1 by Jiacheng Huang, Long Chen

Association as a gift enables people do not have to mention something in completely straightforward words and allows others to understand what they intend to refer to. In this paper, we propose a chain association-based adversarial attack against natural language processing systems, utilizing the comprehension gap between humans and machines. We first generate a chain association graph for Chinese characters based on the association paradigm for building search space of potential adversarial examples. Then, we introduce an discrete particle swarm optimization algorithm to search for the optimal adversarial examples. We conduct comprehensive experiments and show that advanced natural language processing models and applications, including large language models, are vulnerable to our attack, while humans appear good at understanding the perturbed text. We also explore two methods, including adversarial training and associative graph-based recovery, to shield systems from chain association-based attack. Since a few examples that use some derogatory terms, this paper contains materials that may be offensive or upsetting to some people.

摘要:聯想作為一種禮物,使人們不必用完全直白的話語提及某事,並讓其他人明白他們想提的是什麼。在本文中,我們提出了一種基於鏈式聯想的對抗性攻擊,用於自然語言處理系統,利用了人類與機器之間的理解差距。我們首先基於聯想範例為漢字生成一個鏈式聯想圖,用於構建潛在對抗性範例的搜索空間。然後,我們引入一個離散粒子群優化演算法來搜索最佳的對抗性範例。我們進行了全面的實驗,並表明先進的自然語言處理模型和應用程式,包括大型語言模型,都容易受到我們的攻擊,而人類似乎很擅長理解擾動後的文字。我們還探索了兩種方法,包括對抗性訓練和基於聯想圖的恢復,以保護系統免受基於鏈式聯想的攻擊。由於一些範例使用了某些貶義詞,因此本文包含可能冒犯或令某些人感到不安的材料。

Federated Learning for Discrete Optimal Transport with Large Population under Incomplete Information

2411.07841v1 by Navpreet Kaur, Juntao Chen, Yingdong Lu

Optimal transport is a powerful framework for the efficient allocation of resources between sources and targets. However, traditional models often struggle to scale effectively in the presence of large and heterogeneous populations. In this work, we introduce a discrete optimal transport framework designed to handle large-scale, heterogeneous target populations, characterized by type distributions. We address two scenarios: one where the type distribution of targets is known, and one where it is unknown. For the known distribution, we propose a fully distributed algorithm to achieve optimal resource allocation. In the case of unknown distribution, we develop a federated learning-based approach that enables efficient computation of the optimal transport scheme while preserving privacy. Case studies are provided to evaluate the performance of our learning algorithm.

摘要:最佳傳輸是一種在來源和目標之間有效分配資源的強大架構。然而,傳統模型在面對龐大且異質的人群時,通常難以有效擴展。在此研究中,我們引入了一個離散最佳傳輸架構,旨在處理大型、異質的目標族群,其特點在於類型分佈。我們探討了兩種場景:一種是目標的類型分佈已知,另一種則是未知。對於已知分佈,我們提出了一種完全分佈式的演算法,以實現最佳資源配置。在未知分佈的情況下,我們開發了一種基於聯邦學習的方法,可以在保護隱私的同時,有效計算最佳傳輸方案。我們提供了案例研究,以評估我們的學習演算法的效能。

Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices

2411.07826v1 by Kilian Pfeiffer, Mohamed Aboelenien Ahmed, Ramin Khalili, Jörg Henkel

In recent years, Large Language Models (LLMs) through Transformer structures have dominated many machine learning tasks, especially text processing. However, these models require massive amounts of data for training and induce high resource requirements, particularly in terms of the large number of Floating Point Operations (FLOPs) and the high amounts of memory needed. To fine-tune such a model in a parameter-efficient way, techniques like Adapter or LoRA have been developed. However, we observe that the application of LoRA, when used in federated learning (FL), while still being parameter-efficient, is memory and FLOP inefficient. Based on that observation, we develop a novel layer finetuning scheme that allows devices in cross-device FL to make use of pretrained neural networks (NNs) while adhering to given resource constraints. We show that our presented scheme outperforms the current state of the art when dealing with homogeneous or heterogeneous computation and memory constraints and is on par with LoRA regarding limited communication, thereby achieving significantly higher accuracies in FL training.

摘要:近年來,大型語言模型 (LLM) 透過 Transformer 結構主導了許多機器學習任務,特別是文本處理。然而,這些模型需要大量的資料進行訓練,並造成高資源需求,特別是在大量的浮點運算 (FLOP) 和所需的高記憶體量方面。為了以參數有效的方式微調此類模型,已開發出適配器或 LoRA 等技術。然而,我們觀察到 LoRA 的應用在聯合學習 (FL) 中使用時,雖然仍然是參數有效的,但在記憶體和 FLOP 方面卻效率不彰。基於該觀察,我們開發了一種新穎的層微調方案,允許跨裝置 FL 中的裝置使用預訓練神經網路 (NN),同時遵守既定的資源限制。我們表明,我們提出的方案在處理同質或異質運算和記憶體限制時優於目前的技術水準,並且在有限的通訊方面與 LoRA 相當,從而實現了 FL 訓練中顯著更高的準確度。

Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language Models

2411.07820v1 by Youan Cong, Cheng Wang, Pritom Saha Akash, Kevin Chen-Chuan Chang

We introduce the \textit{Extract-Refine-Retrieve-Read} (ERRR) framework, a novel approach designed to bridge the pre-retrieval information gap in Retrieval-Augmented Generation (RAG) systems through query optimization tailored to meet the specific knowledge requirements of Large Language Models (LLMs). Unlike conventional query optimization techniques used in RAG, the ERRR framework begins by extracting parametric knowledge from LLMs, followed by using a specialized query optimizer for refining these queries. This process ensures the retrieval of only the most pertinent information essential for generating accurate responses. Moreover, to enhance flexibility and reduce computational costs, we propose a trainable scheme for our pipeline that utilizes a smaller, tunable model as the query optimizer, which is refined through knowledge distillation from a larger teacher model. Our evaluations on various question-answering (QA) datasets and with different retrieval systems show that ERRR consistently outperforms existing baselines, proving to be a versatile and cost-effective module for improving the utility and accuracy of RAG systems.

摘要:我們介紹了「萃取-精煉-擷取-閱讀」(ERRR) 架構,這是一種新穎的方法,旨在透過針對大型語言模型 (LLM) 特定知識需求量身打造的查詢最佳化,來彌補擷取增強產生 (RAG) 系統中的前擷取資訊差距。與 RAG 中使用的傳統查詢最佳化技術不同,ERRR 架構從 LLM 中萃取參數化知識開始,接著使用專門的查詢最佳化器來精煉這些查詢。此程序可確保僅擷取產生準確回應所必要的資訊。此外,為了增強彈性並降低運算成本,我們為我們的管線提出了一個可訓練架構,它利用較小且可調整的模型作為查詢最佳化器,並透過從較大的教師模型中知識萃取來進行精煉。我們在各種問答 (QA) 資料集和不同的擷取系統上的評估顯示,ERRR 持續優於現有的基準,證明它是一個通用且具成本效益的模組,可改善 RAG 系統的效用和準確性。

PatchCTG: Patch Cardiotocography Transformer for Antepartum Fetal Health Monitoring

2411.07796v1 by M. Jaleed Khan, Manu Vatish, Gabriel Davis Jones

Antepartum Cardiotocography (CTG) is vital for fetal health monitoring, but traditional methods like the Dawes-Redman system are often limited by high inter-observer variability, leading to inconsistent interpretations and potential misdiagnoses. This paper introduces PatchCTG, a transformer-based model specifically designed for CTG analysis, employing patch-based tokenisation, instance normalisation and channel-independent processing to capture essential local and global temporal dependencies within CTG signals. PatchCTG was evaluated on the Oxford Maternity (OXMAT) dataset, comprising over 20,000 CTG traces across diverse clinical outcomes after applying the inclusion and exclusion criteria. With extensive hyperparameter optimisation, PatchCTG achieved an AUC of 77%, with specificity of 88% and sensitivity of 57% at Youden's index threshold, demonstrating adaptability to various clinical needs. Testing across varying temporal thresholds showed robust predictive performance, particularly with finetuning on data closer to delivery, achieving a sensitivity of 52% and specificity of 88% for near-delivery cases. These findings suggest the potential of PatchCTG to enhance clinical decision-making in antepartum care by providing a reliable, objective tool for fetal health assessment. The source code is available at https://github.com/jaleedkhan/PatchCTG.

摘要:產前胎兒心搏圖 (CTG) 對於胎兒健康監測至關重要,但傳統方法(如 Dawes-Redman 系統)通常受到高觀察者間變異性的限制,導致解釋不一致和潛在的誤診。本文介紹 PatchCTG,一種專門設計用於 CTG 分析的基於Transformer的模型,採用基於區塊的標記化、實例正規化和通道獨立處理,以捕捉 CTG 信號中的基本局部和全局時間依賴性。PatchCTG 在牛津婦產 (OXMAT) 資料集上進行評估,該資料集包含超過 20,000 個 CTG 軌跡,涵蓋在應用包含和排除標準後不同的臨床結果。透過廣泛的超參數最佳化,PatchCTG 在 Youden 指數閾值下達到 77% 的 AUC,特異性為 88%,敏感性為 57%,證明了其對各種臨床需求的適應性。在不同的時間閾值下進行測試顯示出穩健的預測效能,特別是在接近分娩時對資料進行微調,對於接近分娩的病例,敏感性達到 52%,特異性達到 88%。這些發現表明 PatchCTG 有潛力透過提供可靠、客觀的胎兒健康評估工具來加強產前照護中的臨床決策制定。原始程式碼可在 https://github.com/jaleedkhan/PatchCTG 取得。

RedCode: Risky Code Execution and Generation Benchmark for Code Agents

2411.07781v1 by Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, Bo Li

With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.

摘要:隨著支援 AI 編碼的程式碼代理功能快速提升且廣泛採用,安全性疑慮(例如產生或執行有風險的程式碼)已成為這些代理在現實世界中部署的重大障礙。為了對程式碼代理的安全性進行全面且實際的評估,我們提出 RedCode,一個有風險的程式碼執行和產生基準:(1) RedCode-Exec 提供可能導致有風險的程式碼執行的具挑戰性提示,目的是評估程式碼代理辨識和處理不安全程式碼的能力。我們總共提供了 4,050 個有風險的測試案例,採用 Python 和 Bash 任務,並包含多樣化的輸入格式,包括程式碼片段和自然語言。它們涵蓋了 8 個網域(例如網站、檔案系統)中的 25 種類型的重大漏洞。我們提供 Docker 環境並設計對應的評估指標來評估其執行結果。(2) RedCode-Gen 提供 160 個提示,包含函式簽章和文件字串,作為輸入來評估程式碼代理是否會遵循指令產生有害的程式碼或軟體。我們的實證結果源自於根據 19 個 LLM 評估三個代理架構,提供了程式碼代理漏洞的見解。例如,對 RedCode-Exec 的評估顯示,代理比較有可能拒絕執行作業系統上的有風險操作,但比較不可能拒絕執行技術上有問題的程式碼,這表示風險很高。以自然語言描述的有風險操作比以程式碼格式描述的有風險操作的拒絕率較低。此外,對 RedCode-Gen 的評估顯示,功能更強大的基礎模型和編碼能力更強的代理(例如 GPT4)往往會產生更精緻且有效的有害軟體。我們的發現強調了對各種程式碼代理進行嚴格安全性評估的必要性。我們的資料集和程式碼可在 https://github.com/AI-secure/RedCode 取得。

Likelihood as a Performance Gauge for Retrieval-Augmented Generation

2411.07773v1 by Tianyu Liu, Jirui Qi, Paul He, Arianna Bisazza, Mrinmaya Sachan, Ryan Cotterell

Recent work finds that retrieval-augmented generation with large language models is prone to be influenced by the order of retrieved documents in the context. However, the lack of in-depth analysis limits the use of this phenomenon for prompt engineering in practice. In this study, we posit that likelihoods serve as an effective gauge for language model performance. Through experiments on two question-answering datasets with a variety of state-of-the-art language models, we reveal correlations between answer accuracy and the likelihood of the question at both the corpus level and the instance level. In addition, we find that question likelihood can also indicate the position of the task-relevant information in the context. Based on these findings, we propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance. We demonstrate their effectiveness with experiments. In addition, our likelihood-based methods are efficient, as they only need to compute the likelihood of the input, requiring much fewer language model passes than heuristic prompt engineering methods that require generating responses. Our analysis deepens our understanding of how input prompts affect model performance and provides a promising direction for efficient prompt optimization.

摘要:最近的研究发现,使用大型语言模型进行检索增强生成容易受到上下文中检索到的文档顺序的影响。然而,缺乏深入的分析限制了这种现象在实际提示工程中的使用。在本研究中,我们假设似然度可以作为语言模型性能的有效衡量标准。通过对两个问答数据集进行实验,其中包含各种最先进的语言模型,我们揭示了在语料库级别和实例级别上答案准确度与问题似然度之间的相关性。此外,我们发现问题似然度还可以指示上下文中与任务相关的信息的位置。基于这些发现,我们提出了两种方法,它们使用问题似然度作为衡量标准,用于选择和构建提示,从而带来更好的性能。我们通过实验展示了它们的有效性。此外,我们的基于似然度的方法非常有效,因为它们只需要计算输入的似然度,比需要生成响应的启发式提示工程方法需要的语言模型传递要少得多。我们的分析加深了我们对输入提示如何影响模型性能的理解,并为高效提示优化提供了一个有希望的方向。

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

2411.07763v1 by Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, Tao Yu

Real-world enterprise text-to-SQL workflows often involve complex cloud or local data across various database systems, multiple SQL queries in various dialects, and diverse operations from data transformation to analytics. We introduce Spider 2.0, an evaluation framework comprising 632 real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines, which goes far beyond traditional text-to-SQL challenges. Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 17.0% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation -- especially in prior text-to-SQL benchmarks -- they require significant improvement in order to achieve adequate performance for real-world enterprise usage. Progress on Spider 2.0 represents crucial steps towards developing intelligent, autonomous, code agents for real-world enterprise settings. Our code, baseline models, and data are available at https://spider2-sql.github.io.

摘要:真實世界的企業文字轉 SQL 工作流程通常涉及各種資料庫系統中的複雜雲端或本地資料、各種方言中的多個 SQL 查詢,以及從資料轉換到分析的各種作業。我們介紹 Spider 2.0,一個評估架構,包含 632 個源自企業級資料庫使用案例的真實世界文字轉 SQL 工作流程問題。Spider 2.0 中的資料庫來自真實資料應用程式,通常包含超過 1,000 個欄位,並儲存在本地或雲端資料庫系統中,例如 BigQuery 和 Snowflake。我們顯示,解決 Spider 2.0 中的問題通常需要理解和搜尋資料庫元資料、方言文件,甚至專案層級的程式碼庫。這個挑戰要求模型與複雜的 SQL 工作流程環境互動、處理極長的內容、執行複雜的推理,並產生具有各種作業的多個 SQL 查詢,通常超過 100 行,這遠遠超出了傳統的文字轉 SQL 挑戰。我們的評估表明,根據 o1-preview,我們的程式代理架構僅成功解決了 17.0% 的任務,而 Spider 1.0 為 91.2%,BIRD 為 73.0%。我們在 Spider 2.0 上的結果顯示,雖然語言模型在程式碼產生方面表現出顯著的效能——特別是在先前的文字轉 SQL 基準測試中——但它們需要顯著的改進才能達到足夠的效能,以供真實世界的企業使用。Spider 2.0 的進展代表著朝著為真實世界的企業設定開發智慧型、自主的程式碼代理邁出的關鍵一步。我們的程式碼、基線模型和資料可在 https://spider2-sql.github.io/ 取得。

ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization

2411.07762v1 by Weibo Zhao, Yubin Shi, Xinyu Lyu, Wanchen Sui, Shen Li, Yong Li

Quantization stands as a pivotal technique for large language model (LLM) serving, yet it poses significant challenges particularly in achieving effective low-bit quantization. The limited numerical mapping makes the quantized model produce a non-trivial error, bringing out intolerable performance degration. This paper is anchored in the basic idea of model compression objectives, and delves into the layer-wise error distribution of LLMs during post-training quantization. Subsequently, we introduce ASER, an algorithm consisting of (1) Error Reconstruction: low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD; (2) Activation Smoothing: outlier extraction to gain smooth activation and better error compensation. ASER is capable of quantizing typical LLMs to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup. Experimental results show that ASER is competitive among the state-of-the-art quantization algorithms, showing potential to activation quantization, with minor overhead.

摘要:量化技術是大型語言模型 (LLM) 服務的關鍵技術,但它在實現有效低位元量化方面特別具有挑戰性。受限的數值對應會讓量化的模型產生非平凡的錯誤,導致難以容忍的效能劣化。本文以模型壓縮目標的基本概念為基礎,深入探討 LLM 在訓練後量化期間的層級誤差分佈。隨後,我們介紹 ASER,一種演算法,包含 (1) 錯誤重建:使用透過白化 SVD 建構的 LoRA 式矩陣,對量化誤差進行低秩補償;(2) 激活平滑:離群值萃取以獲得平滑的激活和更好的誤差補償。ASER 能夠將典型的 LLM 量化為低位元,特別是在 W4A8 每通道設定中也能維持準確度。實驗結果顯示,ASER 在最先進的量化演算法中具有競爭力,顯示出具有較小負擔的激活量化潛力。

2411.07760v1 by Alexi Canesse, Mathieu Petitbois, Ludovic Denoyer, Sylvain Lamprier, Rémy Portelas

Offline Reinforcement Learning (RL) has emerged as a powerful alternative to imitation learning for behavior modeling in various domains, particularly in complex navigation tasks. An existing challenge with Offline RL is the signal-to-noise ratio, i.e. how to mitigate incorrect policy updates due to errors in value estimates. Towards this, multiple works have demonstrated the advantage of hierarchical offline RL methods, which decouples high-level path planning from low-level path following. In this work, we present a novel hierarchical transformer-based approach leveraging a learned quantizer of the space. This quantization enables the training of a simpler zone-conditioned low-level policy and simplifies planning, which is reduced to discrete autoregressive prediction. Among other benefits, zone-level reasoning in planning enables explicit trajectory stitching rather than implicit stitching based on noisy value function estimates. By combining this transformer-based planner with recent advancements in offline RL, our proposed approach achieves state-of-the-art results in complex long-distance navigation environments.

摘要:離線強化學習 (RL) 已成為各種領域中行為建模的強大替代方案,特別是在複雜的導航任務中。離線 RL 現有的挑戰是訊號雜訊比,亦即如何因應價值估計中的錯誤而減輕不正確的政策更新。為此,多項研究已證明分層離線 RL 方法的優點,它將高階路徑規劃與低階路徑追蹤分開。在這項研究中,我們提出了一種新穎的分層Transformer方法,它利用空間的學習量化器。此量化能夠訓練更簡單的區域條件低階政策,並簡化規劃,而規劃則簡化為離散自迴歸預測。在其他好處中,規劃中的區域級推理能執行明確的軌跡拼接,而不是基於有雜訊的價值函數估計的隱式拼接。透過將此基於Transformer的規劃器與離線 RL 的最新進展相結合,我們提出的方法在複雜的長距離導航環境中達到了最先進的結果。

Optimizing Traffic Signal Control using High-Dimensional State Representation and Efficient Deep Reinforcement Learning

2411.07759v1 by Lawrence Francis, Blessed Guda, Ahmed Biyabani

In reinforcement learning-based (RL-based) traffic signal control (TSC), decisions on the signal timing are made based on the available information on vehicles at a road intersection. This forms the state representation for the RL environment which can either be high-dimensional containing several variables or a low-dimensional vector. Current studies suggest that using high dimensional state representations does not lead to improved performance on TSC. However, we argue, with experimental results, that the use of high dimensional state representations can, in fact, lead to improved TSC performance with improvements up to 17.9% of the average waiting time. This high-dimensional representation is obtainable using the cost-effective vehicle-to-infrastructure (V2I) communication, encouraging its adoption for TSC. Additionally, given the large size of the state, we identified the need to have computational efficient models and explored model compression via pruning.

摘要:在基於強化學習 (RL) 的交通號誌控制 (TSC) 中, 有關號誌時序的決策是根據道路交叉口車輛的可用資訊做出的。這形成了 RL 環境的狀態表示,它可以是包含多個變數的高維度,或是一個低維度向量。目前的研究所表明,使用高維度狀態表示並不會提高 TSC 的效能。 然而,我們通過實驗結果論證,使用高維度狀態表示實際上可以提高 TSC 效能,平均等待時間最多可改善 17.9%。這種高維度表示可以使用具有成本效益的車對基礎設施 (V2I) 通訊獲得,從而鼓勵其用於 TSC。此外,鑑於狀態規模龐大,我們發現有必要擁有計算高效的模型,並透過剪枝探索模型壓縮。

SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

2411.07751v1 by Xinyuan Qian, Jiaran Gao, Yaodan Zhang, Qiquan Zhang, Hexin Liu, Leibny Paola Garcia, Haizhou Li

Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. SAV-SE. To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S$^2$E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results demonstrate the superiority of VC-S$^2$E over other competitive methods. We will make the source code publicly available. Project demo page: https://AVSEPage.github.io/

摘要:語音增強在各種應用中扮演著重要的角色,而視覺資訊的整合已被證明能帶來顯著的優勢。然而,目前大多數的研究都集中在對臉部和嘴唇動作的檢視上,這在發生遮擋或相機視角較遠時可能會受到影響或完全無法使用。而來自周圍環境的脈絡視覺線索則被忽略了:例如,當我們看到一隻狗吠叫時,我們的大腦具有辨別和濾除吠叫噪音的先天氣質。為此,在本文中,我們引入了一個新任務,即 SAV-SE。據我們所知,這是第一個提出使用來自同步視訊的豐富脈絡資訊作為輔助線索來指示噪音類型的提案,這最終改善了語音增強性能。具體來說,我們提出了 VC-S$^2$E 方法,它結合了 Conformer 和 Mamba 模組,以發揮其互補優勢。在公開的 MUSIC、AVSpeech 和 AudioSet 資料集上進行了大量的實驗,結果證明了 VC-S$^2$E 優於其他競爭方法。我們將公開原始碼。專案展示頁面:https://AVSEPage.github.io/

Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding

2411.07722v1 by Zirui Shao, Chuwei Luo, Zhaoqing Zhu, Hangdi Xing, Zhi Yu, Qi Zheng, Jiajun Bu

Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand in recent years. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it "sees" and what it "understands." Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflicts, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 68.6% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. This method first ensures task-specific consistency and then connects the cognitive and perceptual knowledge. Our method significantly reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks in most scenarios.

摘要:多模態大型語言模型 (MMLM) 在文件理解方面展現了令人印象深刻的能力,這是一個近年來快速發展的研究領域,在產業上有著重大的需求。作為一個多模態任務,文件理解需要模型具備感知和認知能力。然而,現有的 MLLM 經常面臨感知和認知之間的衝突。以文件 VQA 任務(認知)為例,MMLM 產生的答案可能與其 OCR(感知)識別的對應視覺內容不符。這種衝突表明,MMLM 可能難以在它「看見」的資訊和它「理解」的資訊之間建立內在的連結。這種衝突挑戰了認知與感知一致的直覺觀念,阻礙了 MLLM 的效能和可解釋性。在本文中,我們將認知和感知之間的衝突定義為認知與感知 (C&P) 知識衝突,這是一種多模態知識衝突,並專注於文件理解,對它們進行系統性的評估。我們的分析顯示,即使是領先的 MLLM GPT-4o,也只達到了 68.6% 的 C&P 一致性。為了減輕 C&P 知識衝突,我們提出了一種稱為多模態知識一致性微調的新方法。此方法首先確保任務特定的相容性,然後連結認知和感知知識。我們的這項方法大幅減少了所有經過測試的 MLLM 中的 C&P 知識衝突,並在大多數情況下提升了它們在認知和感知任務中的效能。

Training Data for Large Language Model

2411.07715v1 by Yiming Ju, Huanhuan Ma

In 2022, with the release of ChatGPT, large-scale language models gained widespread attention. ChatGPT not only surpassed previous models in terms of parameters and the scale of its pretraining corpus but also achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This progress has led enterprises and research institutions to recognize that building smarter and more powerful models relies on rich and high-quality datasets. Consequently, the construction and optimization of datasets have become a critical focus in the field of artificial intelligence. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models, covering aspects such as data scale, collection methods, data types and characteristics, processing workflows, and provides an overview of available open-source datasets.

摘要:2022 年,隨著 ChatGPT 的發布,大規模語言模型獲得了廣泛關注。ChatGPT 不僅在參數和預訓練語料庫規模方面超越了以前的模型,還通過對大量高品質、人工標註數據進行微調,實現了革命性的性能改進。這一進展讓企業和研究機構認識到,構建更智能、更強大的模型依賴於豐富且高品質的數據集。因此,數據集的構建和優化已成為人工智能領域的關鍵焦點。本文總結了用於訓練大規模語言模型的預訓練和微調數據的現狀,涵蓋了數據規模、收集方法、數據類型和特徵、處理工作流程等方面,並概述了可用的開源數據集。

New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook

2411.07691v1 by Meng Yang, Tianqing Zhu, Chi Liu, WanLei Zhou, Shui Yu, Philip S. Yu

Thanks to the explosive growth of data and the development of computational resources, it is possible to build pre-trained models that can achieve outstanding performance on various tasks, such as neural language processing, computer vision, and more. Despite their powerful capabilities, pre-trained models have also sparked attention to the emerging security challenges associated with their real-world applications. Security and privacy issues, such as leaking privacy information and generating harmful responses, have seriously undermined users' confidence in these powerful models. Concerns are growing as model performance improves dramatically. Researchers are eager to explore the unique security and privacy issues that have emerged, their distinguishing factors, and how to defend against them. However, the current literature lacks a clear taxonomy of emerging attacks and defenses for pre-trained models, which hinders a high-level and comprehensive understanding of these questions. To fill the gap, we conduct a systematical survey on the security risks of pre-trained models, proposing a taxonomy of attack and defense methods based on the accessibility of pre-trained models' input and weights in various security test scenarios. This taxonomy categorizes attacks and defenses into No-Change, Input-Change, and Model-Change approaches. With the taxonomy analysis, we capture the unique security and privacy issues of pre-trained models, categorizing and summarizing existing security issues based on their characteristics. In addition, we offer a timely and comprehensive review of each category's strengths and limitations. Our survey concludes by highlighting potential new research opportunities in the security and privacy of pre-trained models.

摘要:得益於資料爆炸式增長和運算資源的發展,可以建立預訓練模型,在各種任務中都能取得傑出的表現,例如神經語言處理、電腦視覺等。儘管預訓練模型功能強大,但也引起大家注意其在實際應用中出現的新興安全挑戰。安全性與隱私問題,例如洩露隱私資訊和產生有害回應,嚴重破壞了使用者對這些強大模型的信心。隨著模型效能大幅提升,疑慮也隨之增加。研究人員急於探討已經出現的獨特安全性和隱私問題、它們的區別因素,以及如何防禦它們。然而,目前的文獻缺乏針對預訓練模型的新興攻擊和防禦的明確分類法,這阻礙了對這些問題的高層次和全面的理解。為了填補這個缺口,我們對預訓練模型的安全風險進行系統性的調查,提出一個基於預訓練模型輸入和權重在各種安全測試場景中的可存取性,針對攻擊和防禦方法的分類法。此分類法將攻擊和防禦分類為不變更、輸入變更和模型變更方法。透過分類法分析,我們掌握預訓練模型獨特的安全性和隱私問題,根據其特徵對現有的安全問題進行分類和總結。此外,我們及時且全面地檢視每個類別的優缺點。我們的調查最後強調預訓練模型安全性和隱私的新研究機會。

World Models: The Safety Perspective

2411.07690v1 by Zifan Zeng, Chongzhe Zhang, Feng Liu, Joseph Sifakis, Qunli Zhang, Shiming Liu, Peng Wang

With the proliferation of the Large Language Model (LLM), the concept of World Models (WM) has recently attracted a great deal of attention in the AI research community, especially in the context of AI agents. It is arguably evolving into an essential foundation for building AI agent systems. A WM is intended to help the agent predict the future evolution of environmental states or help the agent fill in missing information so that it can plan its actions and behave safely. The safety property of WM plays a key role in their effective use in critical applications. In this work, we review and analyze the impacts of the current state-of-the-art in WM technology from the point of view of trustworthiness and safety based on a comprehensive survey and the fields of application envisaged. We provide an in-depth analysis of state-of-the-art WMs and derive technical research challenges and their impact in order to call on the research community to collaborate on improving the safety and trustworthiness of WM.

摘要:隨著大型語言模型 (LLM) 的激增,世界模型 (WM) 的概念最近在 AI 研究社群中引起了極大的關注,尤其是在 AI 代理的背景下。可以說,它正演變成建立 AI 代理系統不可或缺的基礎。WM 的目的是幫助代理預測環境狀態的未來演變,或幫助代理填補遺失的資訊,以便它可以規劃其行動並安全地執行。WM 的安全性在它們在關鍵應用中的有效使用中扮演著關鍵角色。在本文中,我們根據全面的調查和預期的應用領域,從可信度和安全性的角度回顧並分析了 WM 技術當前最先進的狀態所帶來的影響。我們深入分析了最先進的 WM,並推導出技術研究挑戰及其影響,以便呼籲研究社群合作改善 WM 的安全性和可信度。

Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG

2411.07688v1 by Zilun Zhang, Haozhan Shen, Tiancheng Zhao, Yuhao Wang, Bin Chen, Yuxiang Cai, Yongheng Shang, Jianwei Yin

Ultra High Resolution (UHR) remote sensing imagery (RSI) (e.g. 100,000 $\times$ 100,000 pixels or more) poses a significant challenge for current Remote Sensing Multimodal Large Language Models (RSMLLMs). If choose to resize the UHR image to standard input image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSMLLMs, making it difficult to process the entire image and capture long-range dependencies to answer the query based on the abundant visual context. In this paper, we introduce ImageRAG for RS, a training-free framework to address the complexities of analyzing UHR remote sensing imagery. By transforming UHR remote sensing image analysis task to image's long context selection task, we design an innovative image contextual retrieval mechanism based on the Retrieval-Augmented Generation (RAG) technique, denoted as ImageRAG. ImageRAG's core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts that pertain to a given query. Fast path and slow path are proposed in this framework to handle this task efficiently and effectively. ImageRAG allows RSMLLMs to manage extensive context and spatial information from UHR RSI, ensuring the analysis is both accurate and efficient.

摘要:超高分辨率 (UHR) 遥感影像 (RSI)(例如 100,000 $\times$ 100,000 像素或更多)对当前的遥感多模态大语言模型 (RSMLLM) 构成了重大挑战。如果选择将 UHR 影像调整为标准输入影像大小,则 UHR 影像所包含的广泛空间和上下文信息将被忽略。否则,这些影像的原始大小通常会超出标准 RSMLLM 的标记限制,从而难以处理整个影像并捕捉远程依赖关系,以根据丰富的视觉上下文来回答查询。在本文中,我们介绍了用于遥感的 ImageRAG,这是一个无训练框架,用于解决分析 UHR 遥感影像的复杂性。通过将 UHR 遥感影像分析任务转换为影像的长上下文选择任务,我们设计了一种基于检索增强生成 (RAG) 技术的创新影像上下文检索机制,称为 ImageRAG。ImageRAG 的核心创新在于它能够选择性地检索和关注 UHR 影像中与给定查询相关的最相关部分作为视觉上下文。在此框架中提出了快速路径和慢速路径来高效有效地处理此任务。ImageRAG 允许 RSMLLM 管理来自 UHR RSI 的广泛上下文和空间信息,确保分析既准确又高效。

Fast Disentangled Slim Tensor Learning for Multi-view Clustering

2411.07685v1 by Deng Xu, Chao Zhang, Zechao Li, Chunlin Chen, Huaxiong Li

Tensor-based multi-view clustering has recently received significant attention due to its exceptional ability to explore cross-view high-order correlations. However, most existing methods still encounter some limitations. (1) Most of them explore the correlations among different affinity matrices, making them unscalable to large-scale data. (2) Although some methods address it by introducing bipartite graphs, they may result in sub-optimal solutions caused by an unstable anchor selection process. (3) They generally ignore the negative impact of latent semantic-unrelated information in each view. To tackle these issues, we propose a new approach termed fast Disentangled Slim Tensor Learning (DSTL) for multi-view clustering . Instead of focusing on the multi-view graph structures, DSTL directly explores the high-order correlations among multi-view latent semantic representations based on matrix factorization. To alleviate the negative influence of feature redundancy, inspired by robust PCA, DSTL disentangles the latent low-dimensional representation into a semantic-unrelated part and a semantic-related part for each view. Subsequently, two slim tensors are constructed with tensor-based regularization. To further enhance the quality of feature disentanglement, the semantic-related representations are aligned across views through a consensus alignment indicator. Our proposed model is computationally efficient and can be solved effectively. Extensive experiments demonstrate the superiority and efficiency of DSTL over state-of-the-art approaches. The code of DSTL is available at https://github.com/dengxu-nju/DSTL.

摘要:基於張量的多視角聚類最近因其探索跨視角高階相關性的出色能力而備受關注。然而,現有的方法大多仍會遇到一些限制。(1) 它們大多探索不同親和矩陣之間的相關性,這使得它們無法擴展到大型資料。(2) 雖然有些方法透過引入二部圖來解決這個問題,但它們可能會導致不穩定的錨點選擇過程而產生次優解。(3) 它們通常會忽略每個視角中潛在語義無關資訊的負面影響。為了解決這些問題,我們提出了一種稱為快速解耦纖細張量學習 (DSTL) 的新方法,用於多視角聚類。DSTL 沒有專注於多視角圖結構,而是直接基於矩陣分解探索多視角潛在語義表示之間的高階相關性。為了減輕特徵冗餘的負面影響,DSTL 受穩健 PCA 的啟發,將潛在低維表示解耦為每個視角的語義無關部分和語義相關部分。隨後,使用基於張量的正則化構造兩個纖細張量。為了進一步提高特徵解耦的品質,語義相關表示會透過共識對齊指標在視角之間對齊。我們提出的模型計算效率高,且可以有效地解決。廣泛的實驗證明了 DSTL 優於最先進方法的優越性和效率。DSTL 的程式碼可在 https://github.com/dengxu-nju/DSTL 取得。

AI enhanced diagnosis of Peyronies disease a novel approach using Computer Vision

2411.07684v1 by Yudara Kularathne, Janitha Prathapa, Prarththanan Sothyrajah, Salomi Arasaratnam, Sithira Ambepitiya, Thanveer Ahamed, Dinuka Wijesundara

This study presents an innovative AI-driven tool for diagnosing Peyronie's Disease (PD), a condition that affects between 0.3% and 13.1% of men worldwide. Our method uses key point detection on both images and videos to measure penile curvature angles, utilizing advanced computer vision techniques. This tool has demonstrated high accuracy in identifying anatomical landmarks, validated against conventional goniometer measurements. Traditional PD diagnosis often involves subjective and invasive methods, which can lead to patient discomfort and inaccuracies. Our approach offers a precise, reliable, and non-invasive diagnostic tool to address these drawbacks. The model distinguishes between PD and normal anatomical changes with a sensitivity of 96.7% and a specificity of 100%. This advancement represents a significant improvement in urological diagnostics, greatly enhancing the efficacy and convenience of PD assessment for healthcare providers and patients.

摘要:本研究提出了一種創新的 AI 驅動工具,用於診斷佩羅尼氏症 (PD),這是一種影響全球 0.3% 至 13.1% 男性的一種疾病。我們的技術使用圖像和影片上的關鍵點偵測來測量陰莖彎曲角度,利用先進的電腦視覺技術。此工具在識別解剖地標方面已展現出高準確度,且已針對傳統測角器量測結果進行驗證。傳統的 PD 診斷通常涉及主觀且侵入性的方法,這可能會導致患者不適和不準確。我們的做法提供了一種精確、可靠且非侵入性的診斷工具來解決這些缺點。此模型區分 PD 和正常的解剖變化,敏感度為 96.7%,特異度為 100%。這項進展代表了泌尿科診斷的重大進步,大幅提升了醫療保健提供者和患者評估 PD 的效率和便利性。

Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach

2411.07656v1 by Tianyi Huang, Arya Somasundaram

Large Language Models (LLMs) often perpetuate biases in pronoun usage, leading to misrepresentation or exclusion of queer individuals. This paper addresses the specific problem of biased pronoun usage in LLM outputs, particularly the inappropriate use of traditionally gendered pronouns ("he," "she") when inclusive language is needed to accurately represent all identities. We introduce a collaborative agent pipeline designed to mitigate these biases by analyzing and optimizing pronoun usage for inclusivity. Our multi-agent framework includes specialized agents for both bias detection and correction. Experimental evaluations using the Tango dataset-a benchmark focused on gender pronoun usage-demonstrate that our approach significantly improves inclusive pronoun classification, achieving a 32.6 percentage point increase over GPT-4o in correctly disagreeing with inappropriate traditionally gendered pronouns $(\chi^2 = 38.57, p < 0.0001)$. These results accentuate the potential of agent-driven frameworks in enhancing fairness and inclusivity in AI-generated content, demonstrating their efficacy in reducing biases and promoting socially responsible AI.

摘要:大型語言模型 (LLM) 通常會延續代名詞使用上的偏見,導致對酷兒個人的錯誤陳述或排斥。本文探討 LLM 輸出中代名詞使用有偏見的特定問題,特別是不當使用傳統的性別代名詞(「他」、「她」),而需要包容性的語言來準確代表所有身分。我們引入一個協作代理管道,旨在透過分析和最佳化代名詞的使用來減輕這些偏見以促進包容性。我們的多代理架構包含專門的代理,用於偏見偵測和校正。使用 Tango 資料集(一個專注於性別代名詞使用的基準)進行的實驗評估顯示,我們的做法顯著改善了包容性代名詞分類,在正確不同意不適當的傳統性別代名詞上,比 GPT-4o 提高了 32.6 個百分點(χ2 = 38.57,p < 0.0001)。這些結果突顯了代理驅動架構在增強 AI 產出內容中的公平性和包容性方面的潛力,證明了它們在減少偏見和促進社會責任 AI 方面的效能。

Direct Preference Optimization Using Sparse Feature-Level Constraints

2411.07618v1 by Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang

The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.

摘要:大型語言模型 (LLM) 與人類偏好的對齊仍然是一個關鍵挑戰。雖然像人類回饋強化學習 (RLHF) 和直接偏好最佳化 (DPO) 等訓練後技術已經取得顯著的成功,但它們通常會引入計算無效率和訓練不穩定性。在本文中,我們提出特徵級約束偏好最佳化 (FPO),這是一種新穎的方法,旨在簡化對齊過程,同時確保穩定性。FPO 利用預先訓練的稀疏自編碼器 (SAE),並引入特徵級約束,從而實現高效、強制稀疏性的對齊。我們的做法通過使用在訓練良好的稀疏自編碼器中啟用的稀疏特徵和使用特徵級離線參考的序列 KL 散度的品質,來享受效率。基準資料集上的實驗結果表明,與最先進的基準線相比,FPO 以更低的計算成本實現了勝率的 5.08% 絕對改進,使其成為 LLM 對齊的有效且可控的解決方案。

Multimodal Clinical Reasoning through Knowledge-augmented Rationale Generation

2411.07611v1 by Shuai Niu, Jing Ma, Liang Bai, Zhihua Wang, Yida Xu, Yunya Song, Xian Yang

Clinical rationales play a pivotal role in accurate disease diagnosis; however, many models predominantly use discriminative methods and overlook the importance of generating supportive rationales. Rationale distillation is a process that transfers knowledge from large language models (LLMs) to smaller language models (SLMs), thereby enhancing the latter's ability to break down complex tasks. Despite its benefits, rationale distillation alone is inadequate for addressing domain knowledge limitations in tasks requiring specialized expertise, such as disease diagnosis. Effectively embedding domain knowledge in SLMs poses a significant challenge. While current LLMs are primarily geared toward processing textual data, multimodal LLMs that incorporate time series data, especially electronic health records (EHRs), are still evolving. To tackle these limitations, we introduce ClinRaGen, an SLM optimized for multimodal rationale generation in disease diagnosis. ClinRaGen incorporates a unique knowledge-augmented attention mechanism to merge domain knowledge with time series EHR data, utilizing a stepwise rationale distillation strategy to produce both textual and time series-based clinical rationales. Our evaluations show that ClinRaGen markedly improves the SLM's capability to interpret multimodal EHR data and generate accurate clinical rationales, supporting more reliable disease diagnosis, advancing LLM applications in healthcare, and narrowing the performance divide between LLMs and SLMs.

摘要:臨床依據在準確的疾病診斷中扮演著關鍵角色; 然而,許多模型主要使用判別式方法,而忽略了生成支持性依據的重要性。依據萃取是一種將知識從大型語言模型 (LLM) 轉移到小型語言模型 (SLM) 的過程,從而增強後者分解複雜任務的能力。儘管有其好處,但單獨的依據萃取不足以解決需要專業知識的任務(例如疾病診斷)中的領域知識限制。有效地將領域知識嵌入 SLM 是一個重大的挑戰。雖然目前的 LLM 主要用於處理文本資料,但整合時間序列資料(特別是電子健康記錄 (EHR))的多模態 LLM 仍在發展中。為了解決這些限制,我們引入了 ClinRaGen,一種針對疾病診斷中多模態依據生成的最佳化 SLM。ClinRaGen 結合了一個獨特的知識增強注意力機制,將領域知識與時間序列 EHR 資料合併,利用逐步的依據萃取策略來產生基於文本和時間序列的臨床依據。我們的評估表明,ClinRaGen 明顯改善了 SLM 解釋多模態 EHR 資料和生成準確臨床依據的能力,支持更可靠的疾病診斷,推進 LLM 在醫療保健中的應用,並縮小 LLM 和 SLM 之間的效能差距。

Circuit Complexity Bounds for RoPE-based Transformer Architecture

2411.07602v1 by Bo Chen, Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song

Characterizing the express power of the Transformer architecture is critical to understanding its capacity limits and scaling law. Recent works provide the circuit complexity bounds to Transformer-like architecture. On the other hand, Rotary Position Embedding ($\mathsf{RoPE}$) has emerged as a crucial technique in modern large language models, offering superior performance in capturing positional information compared to traditional position embeddings, which shows great potential in application prospects, particularly for the long context scenario. Empirical evidence also suggests that $\mathsf{RoPE}$-based Transformer architectures demonstrate greater generalization capabilities compared to conventional Transformer models. In this work, we establish a tighter circuit complexity bound for Transformers with $\mathsf{RoPE}$ attention. Our key contribution is that we show that unless $\mathsf{TC}^0 = \mathsf{NC}^1$, a $\mathsf{RoPE}$-based Transformer with $\mathrm{poly}(n)$-precision, $O(1)$ layers, hidden dimension $d \leq O(n)$ cannot solve the arithmetic problem or the Boolean formula value problem. This result significantly demonstrates the fundamental limitation of the expressivity of the $\mathsf{RoPE}$-based Transformer architecture, although it achieves giant empirical success. Our theoretical framework not only establishes tighter complexity bounds but also may instruct further work on the $\mathsf{RoPE}$-based Transformer.

摘要:對於理解 Transformer 架構的表達能力極限和擴充定律而言,描述其表達能力至關重要。最近的研究提供了 Transformer 類似架構的電路複雜度界限。另一方面,旋轉位置嵌入($\mathsf{RoPE}$)已成為現代大型語言模型中的一項關鍵技術,與傳統位置嵌入相比,它在捕捉位置資訊方面提供了卓越的效能,在應用前景方面展現了巨大的潛力,特別是對於長語境場景。實證證據也表明,與傳統的 Transformer 模型相比,基於 $\mathsf{RoPE}$ 的 Transformer 架構展示出更強大的概化能力。在這項研究中,我們為具備 $\mathsf{RoPE}$ 注意力的 Transformer 建立了一個更嚴謹的電路複雜度界限。我們的關鍵貢獻在於,我們證明了除非 $\mathsf{TC}^0 = \mathsf{NC}^1$,否則一個具備 $\mathrm{poly}(n)$ 精度、$O(1)$ 層、隱藏維度 $d \leq O(n)$ 的基於 $\mathsf{RoPE}$ 的 Transformer 無法解決算術問題或布林公式值問題。儘管取得了巨大的實證成功,但這個結果顯著地證明了基於 $\mathsf{RoPE}$ 的 Transformer 架構在表達能力上的基本限制。我們的理論架構不僅建立了更嚴謹的複雜度界限,還能指導後續關於基於 $\mathsf{RoPE}$ 的 Transformer 的研究。

Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations

2411.07598v1 by Rose E. Wang, Pawan Wirawarn, Kenny Lam, Omar Khattab, Dorottya Demszky

Many open-ended conversations (e.g., tutoring lessons or business meetings) revolve around pre-defined reference materials, like worksheets or meeting bullets. To provide a framework for studying such conversation structure, we introduce Problem-Oriented Segmentation & Retrieval (POSR), the task of jointly breaking down conversations into segments and linking each segment to the relevant reference item. As a case study, we apply POSR to education where effectively structuring lessons around problems is critical yet difficult. We present LessonLink, the first dataset of real-world tutoring lessons, featuring 3,500 segments, spanning 24,300 minutes of instruction and linked to 116 SAT math problems. We define and evaluate several joint and independent approaches for POSR, including segmentation (e.g., TextTiling), retrieval (e.g., ColBERT), and large language models (LLMs) methods. Our results highlight that modeling POSR as one joint task is essential: POSR methods outperform independent segmentation and retrieval pipelines by up to +76% on joint metrics and surpass traditional segmentation methods by up to +78% on segmentation metrics. We demonstrate POSR's practical impact on downstream education applications, deriving new insights on the language and time use in real-world lesson structures.

摘要:許多開放式的對話(例如輔導課程或業務會議) 圍繞著預先定義的參考材料,例如工作表或會議 重點。為了提供一個架構來研究此類對話結構,我們 引入了問題導向分割與檢索 (POSR),這項任務是將對話共同 分解成各個片段,並將每個片段連結到 相關的參考項目。作為一個案例研究,我們將 POSR 應用於教育,在教育中 有效地圍繞問題來架構課程至關重要,但卻很困難。我們 提出了 LessonLink,這是第一個真實世界的輔導課程資料集,包含 3,500 個片段,橫跨 24,300 分鐘的教學,並連結到 116 個 SAT 數學問題。我們定義並評估了 POSR 的多種聯合和獨立方法,包括分段(例如 TextTiling)、檢索(例如 ColBERT) 和大型語言模型 (LLM) 方法。我們的結果強調將 POSR 建模為一項聯合任務至關重要:POSR 方法在聯合指標上比獨立 分段和檢索管道高出 +76%,並在分段指標上比傳統分段方法高出 +78%。我們 展示了 POSR 對下游教育應用程式的實際影響, 從真實世界課程結構中語言和時間的使用中獲得了新的見解。

Entropy Controllable Direct Preference Optimization

2411.07595v1 by Motoki Omura, Yasuhiro Fujita, Toshiki Kataoka

In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.

摘要:在大語言模型 (LLM) 的後訓練中,來自人類回饋的強化學習 (RLHF) 是一種有效的方法,可以實現與人類偏好一致的生成。直接偏好最佳化 (DPO) 允許使用簡單的二元交叉熵損失進行政策訓練,而無需獎勵模型。DPO 的目標通過反向 KL 散度進行規範化,這鼓勵尋求模式以符合參考政策。儘管如此,我們指出,最小化反向 KL 散度可能無法捕捉參考分佈的模式,這可能會損害政策的效能。基於此觀察,我們建議對 DPO 進行一個簡單的修改,H-DPO,它允許控制結果政策的熵,增強分佈的清晰度,從而更有效地實現尋求模式的擬合。在我們的實驗中,我們表明 H-DPO 在各種任務中都優於 DPO,在數學任務的 pass@$k$ 評估中展示了優異的結果。此外,H-DPO 實現起來很簡單,只需要對 DPO 的損失計算進行微小的修改,這使得它在 LLM 訓練的廣泛應用中具有高度實用性和前景。

A Comprehensive Survey of AI-Driven Advancements and Techniques in Automated Program Repair and Code Generation

2411.07586v1 by Avinash Anand, Akshit Gupta, Nishchay Yadav, Shaurya Bajaj

Bug fixing and code generation have been core research topics in software development for many years. The recent explosive growth in Large Language Models has completely transformed these spaces, putting in reach incredibly powerful tools for both. In this survey, 27 recent papers have been reviewed and split into two groups: one dedicated to Automated Program Repair (APR) and LLM integration and the other to code generation using LLMs. The first group consists of new methods for bug detection and repair, which include locating semantic errors, security vulnerabilities, and runtime failure bugs. The place of LLMs in reducing manual debugging efforts is emphasized in this work by APR toward context-aware fixes, with innovations that boost accuracy and efficiency in automatic debugging. The second group dwells on code generation, providing an overview of both general-purpose LLMs fine-tuned for programming and task-specific models. It also presents methods to improve code generation, such as identifier-aware training, fine-tuning at the instruction level, and incorporating semantic code structures. This survey work contrasts the methodologies in APR and code generation to identify trends such as using LLMs, feedback loops to enable iterative code improvement and open-source models. It also discusses the challenges of achieving functional correctness and security and outlines future directions for research in LLM-based software development.

摘要:多年來,錯誤修復和程式碼產生一直是軟體開發中的核心研究主題。最近大型語言模型的爆炸性成長徹底改變了這些領域,為這兩者提供了強大的工具。在這項調查中,回顧了 27 篇近期論文,並將其分成兩組:一組專門用於自動程式修復 (APR) 和 LLM 整合,另一組則用於使用 LLM 進行程式碼產生。第一組包含用於錯誤偵測和修復的新方法,其中包括定位語義錯誤、安全性漏洞和執行時期失敗錯誤。APR 透過創新提升自動偵錯的準確性和效率,強調了 LLM 在減少人工除錯工作方面的作用,朝著情境感知的修復邁進。第二組專注於程式碼產生,提供針對程式設計進行微調的一般用途 LLM 和特定任務模型的概觀。它還提供了改進程式碼產生的方法,例如識別符號感知訓練、指令層級的微調以及整合語義程式碼結構。這項調查工作對比了 APR 和程式碼產生中的方法論,以找出趨勢,例如使用 LLM、回饋迴路以啟用反覆的程式碼改進和開源模型。它也討論了實現功能正確性和安全性所面臨的挑戰,並概述了基於 LLM 的軟體開發的未來研究方向。

Reinforcement Learning Framework for Quantitative Trading

2411.07585v1 by Alhassan S. Yasin, Prabdeep S. Gill

The inherent volatility and dynamic fluctuations within the financial stock market underscore the necessity for investors to employ a comprehensive and reliable approach that integrates risk management strategies, market trends, and the movement trends of individual securities. By evaluating specific data, investors can make more informed decisions. However, the current body of literature lacks substantial evidence supporting the practical efficacy of reinforcement learning (RL) agents, as many models have only demonstrated success in back testing using historical data. This highlights the urgent need for a more advanced methodology capable of addressing these challenges. There is a significant disconnect in the effective utilization of financial indicators to better understand the potential market trends of individual securities. The disclosure of successful trading strategies is often restricted within financial markets, resulting in a scarcity of widely documented and published strategies leveraging RL. Furthermore, current research frequently overlooks the identification of financial indicators correlated with various market trends and their potential advantages. This research endeavors to address these complexities by enhancing the ability of RL agents to effectively differentiate between positive and negative buy/sell actions using financial indicators. While we do not address all concerns, this paper provides deeper insights and commentary on the utilization of technical indicators and their benefits within reinforcement learning. This work establishes a foundational framework for further exploration and investigation of more complex scenarios.

摘要:金融股票市場固有的波動性和動態波動突顯了投資者採用綜合且可靠的方法的必要性,該方法整合了風險管理策略、市場趨勢和個別證券的移動趨勢。透過評估特定數據,投資者可以做出更明智的決策。然而,目前的文獻缺乏實質證據支持強化學習 (RL) 代理的實用效力,因為許多模型僅在使用歷史數據進行反向測試時證明了成功。這凸顯了迫切需要一種更先進的方法來應對這些挑戰。在有效利用財務指標以更好地了解個別證券的潛在市場趨勢方面存在顯著脫節。成功交易策略的披露通常在金融市場中受到限制,導致缺乏廣泛記錄和發布的利用 RL 的策略。此外,目前的研究所常忽略識別與各種市場趨勢相關的財務指標及其潛在優勢。 本研究致力於透過提高 RL 代理使用財務指標有效區分正面和負面買賣動作的能力來解決這些複雜性。雖然我們並未解決所有問題,但本文提供了關於技術指標及其在強化學習中好處的更深入見解和評論。這項工作為進一步探索和調查更複雜的場景建立了基礎框架。

Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models

2411.07563v1 by Dongrui Han, Mingyu Cui, Jiawen Kang, Xixin Wu, Xunying Liu, Helen Meng

Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple phonemes depending on contexts, posing a challenge for G2P conversion. Inspired by the remarkable success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems with LLMs' in-context knowledge retrieval (ICKR) capabilities are proposed to promote disambiguation capability. The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset. In particular, the best contextual G2P conversion system using ICKR outperforms the baseline with weighted average phoneme error rate (PER) reductions of 2.0% absolute (28.9% relative). Using GPT-4 in the ICKR system can increase of 3.5% absolute (3.8% relative) on the Librig2p dataset.

摘要:音素轉換 (G2P) 是文字轉語音 (TTS) 系統中至關重要的一步,負責將音素對應到相應的語音表示。然而,它面臨著歧義問題,即相同的音素可以表示多個音素,具體取決於上下文,這對 G2P 轉換構成了挑戰。受大型語言模型 (LLM) 在處理上下文感知場景中取得的顯著成功啟發,提出具備 LLM 的上下文知識檢索 (ICKR) 功能的上下文 G2P 轉換系統,以提升消歧義能力。在 Librig2p 資料集上徹底證明了將 ICKR 納入 G2P 轉換系統的功效。特別是,使用 ICKR 的最佳上下文 G2P 轉換系統優於基線,加權平均音素錯誤率 (PER) 降低了 2.0%(相對降低 28.9%)。在 ICKR 系統中使用 GPT-4 可以使 Librig2p 資料集的絕對值增加 3.5%(相對增加 3.8%)。

EUR/USD Exchange Rate Forecasting incorporating Text Mining Based on Pre-trained Language Models and Deep Learning Methods

2411.07560v1 by Xiangyu Shi, Hongcheng Ding, Salaar Faroog, Deshinta Arrova Dewi, Shamsul Nahar Abdullah, Bahiah A Malek

This study introduces a novel approach for EUR/USD exchange rate forecasting that integrates deep learning, textual analysis, and particle swarm optimization (PSO). By incorporating online news and analysis texts as qualitative data, the proposed PSO-LSTM model demonstrates superior performance compared to traditional econometric and machine learning models. The research employs advanced text mining techniques, including sentiment analysis using the RoBERTa-Large model and topic modeling with LDA. Empirical findings underscore the significant advantage of incorporating textual data, with the PSO-LSTM model outperforming benchmark models such as SVM, SVR, ARIMA, and GARCH. Ablation experiments reveal the contribution of each textual data category to the overall forecasting performance. The study highlights the transformative potential of artificial intelligence in finance and paves the way for future research in real-time forecasting and the integration of alternative data sources.

摘要:本研究提出了一種新的歐元/美元匯率預測方法,整合了深度學習、文本分析和粒子群最佳化 (PSO)。透過將線上新聞和分析文本納入作為定性資料,提出的 PSO-LSTM 模型展現出優於傳統計量經濟學和機器學習模型的卓越效能。這項研究採用了進階的文字探勘技術,包括使用 RoBERTa-Large 模型進行情緒分析和使用 LDA 進行主題建模。實證結果強調了納入文本資料的顯著優勢,PSO-LSTM 模型優於 SVM、SVR、ARIMA 和 GARCH 等基準模型。消融實驗揭示了每個文本資料類別對整體預測效能的貢獻。這項研究突出了人工智慧在金融領域的轉型潛力,並為未來在即時預測和整合替代資料來源的研究鋪路。

Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models

2411.07559v1 by Tiejin Chen, Kaishen Wang, Hua Wei

Jailbreaking methods, which induce Multi-modal Large Language Models (MLLMs) to output harmful responses, raise significant safety concerns. Among these methods, gradient-based approaches, which use gradients to generate malicious prompts, have been widely studied due to their high success rates in white-box settings, where full access to the model is available. However, these methods have notable limitations: they require white-box access, which is not always feasible, and involve high memory usage. To address scenarios where white-box access is unavailable, attackers often resort to transfer attacks. In transfer attacks, malicious inputs generated using white-box models are applied to black-box models, but this typically results in reduced attack performance. To overcome these challenges, we propose Zer0-Jack, a method that bypasses the need for white-box access by leveraging zeroth-order optimization. We propose patch coordinate descent to efficiently generate malicious image inputs to directly attack black-box MLLMs, which significantly reduces memory usage further. Through extensive experiments, Zer0-Jack achieves a high attack success rate across various models, surpassing previous transfer-based methods and performing comparably with existing white-box jailbreak techniques. Notably, Zer0-Jack achieves a 95\% attack success rate on MiniGPT-4 with the Harmful Behaviors Multi-modal Dataset on a black-box setting, demonstrating its effectiveness. Additionally, we show that Zer0-Jack can directly attack commercial MLLMs such as GPT-4o. Codes are provided in the supplement.

摘要:越獄方法會誘使多模態大型語言模型 (MLLM) 輸出有害回應,引發重大的安全疑慮。在這些方法中,基於梯度的做法會使用梯度來產生惡意的提示,由於其在白盒設定中成功率高,因此受到廣泛研究,在白盒設定中可以完全存取模型。然而,這些方法有顯著的限制:它們需要白盒存取權,這並非總是可行,而且會使用大量的記憶體。為了處理無法取得白盒存取權的場景,攻擊者通常會訴諸轉移攻擊。在轉移攻擊中,使用白盒模型產生的惡意輸入會套用在黑盒模型中,但這通常會導致攻擊效能降低。為了克服這些挑戰,我們提出 Zer0-Jack,這是一種透過利用零階最佳化來繞過白盒存取需求的方法。我們提出修補座標下降法來有效產生惡意的影像輸入,以直接攻擊黑盒 MLLM,這進一步大幅降低記憶體使用量。透過廣泛的實驗,Zer0-Jack 在各種模型中達到很高的攻擊成功率,超越先前的基於轉移的方法,並與現有的白盒越獄技術表現相當。值得注意的是,Zer0-Jack 在黑盒設定中對 MiniGPT-4 使用 Harmful Behaviors 多模態資料集時,達到 95% 的攻擊成功率,證明其有效性。此外,我們展示 Zer0-Jack 可以直接攻擊商業 MLLM,例如 GPT-4o。補充資料中提供了程式碼。

Contrastive Language Prompting to Ease False Positives in Medical Anomaly Detection

2411.07546v1 by YeongHyeon Park, Myung Jin Kim, Hyeong Seok Kim

A pre-trained visual-language model, contrastive language-image pre-training (CLIP), successfully accomplishes various downstream tasks with text prompts, such as finding images or localizing regions within the image. Despite CLIP's strong multi-modal data capabilities, it remains limited in specialized environments, such as medical applications. For this purpose, many CLIP variants-i.e., BioMedCLIP, and MedCLIP-SAMv2-have emerged, but false positives related to normal regions persist. Thus, we aim to present a simple yet important goal of reducing false positives in medical anomaly detection. We introduce a Contrastive LAnguage Prompting (CLAP) method that leverages both positive and negative text prompts. This straightforward approach identifies potential lesion regions by visual attention to the positive prompts in the given image. To reduce false positives, we attenuate attention on normal regions using negative prompts. Extensive experiments with the BMAD dataset, including six biomedical benchmarks, demonstrate that CLAP method enhances anomaly detection performance. Our future plans include developing an automated fine prompting method for more practical usage.

摘要:預訓練的視覺語言模型,對比語言影像預訓練 (CLIP),成功使用文字提示完成各種下游任務,例如尋找影像或定位影像中的區域。儘管 CLIP 擁有強大的多模態資料功能,但在專門的環境中,例如醫療應用,仍然有限。為此,出現了許多 CLIP 變體,即 BioMedCLIP 和 MedCLIP-SAMv2,但與正常區域相關的假陽性仍然存在。因此,我們的目標是提出一個簡單但重要的目標,以減少醫療異常檢測中的假陽性。我們引入了對比語言提示 (CLAP) 方法,該方法同時利用正向和負向文字提示。這種直接的方法透過視覺注意給定影像中的正向提示,來識別潛在的病灶區域。為了減少假陽性,我們使用負向提示來減弱對正常區域的注意。使用 BMAD 資料集進行的廣泛實驗,包括六個生物醫學基準,證明 CLAP 方法增強了異常檢測效能。我們未來的計畫包括開發一種自動化精細提示方法,以供更實用的使用。

Model Stealing for Any Low-Rank Language Model

2411.07536v1 by Allen Liu, Ankur Moitra

Model stealing, where a learner tries to recover an unknown model via carefully chosen queries, is a critical problem in machine learning, as it threatens the security of proprietary models and the privacy of data they are trained on. In recent years, there has been particular interest in stealing large language models (LLMs). In this paper, we aim to build a theoretical understanding of stealing language models by studying a simple and mathematically tractable setting. We study model stealing for Hidden Markov Models (HMMs), and more generally low-rank language models. We assume that the learner works in the conditional query model, introduced by Kakade, Krishnamurthy, Mahajan and Zhang. Our main result is an efficient algorithm in the conditional query model, for learning any low-rank distribution. In other words, our algorithm succeeds at stealing any language model whose output distribution is low-rank. This improves upon the previous result by Kakade, Krishnamurthy, Mahajan and Zhang, which also requires the unknown distribution to have high "fidelity", a property that holds only in restricted cases. There are two key insights behind our algorithm: First, we represent the conditional distributions at each timestep by constructing barycentric spanners among a collection of vectors of exponentially large dimension. Second, for sampling from our representation, we iteratively solve a sequence of convex optimization problems that involve projection in relative entropy to prevent compounding of errors over the length of the sequence. This is an interesting example where, at least theoretically, allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance.

摘要:模型竊取,其中學習者嘗試通過仔細選擇的查詢來恢復未知模型,是機器學習中的關鍵問題,因為它威脅到專有模型的安全性以及訓練它們的數據的隱私。近年來,人們對竊取大型語言模型 (LLM) 特別感興趣。在本文中,我們旨在通過研究一個簡單且在數學上易於處理的設置來建立對竊取語言模型的理論理解。我們研究隱藏馬爾可夫模型 (HMM) 的模型竊取,更普遍地研究低秩語言模型。我們假設學習者在條件查詢模型中工作,由 Kakade、Krishnamurthy、Mahajan 和 Zhang 提出。我們的成果是在條件查詢模型中一種用於學習任何低秩分佈的有效演算法。換句話說,我們的演算法成功竊取任何輸出分佈為低秩的語言模型。這改進了 Kakade、Krishnamurthy、Mahajan 和 Zhang 先前的成果,該成果還要求未知分佈具有很高的「保真度」,這是一個僅在受限情況下成立的屬性。我們的演算法背後有兩個關鍵見解:首先,我們通過在大量維度向量集合中建構重心張弦器來表示每個時間步長的條件分佈。其次,為了從我們的表示中進行抽樣,我們反覆求解一系列凸優化問題,其中涉及相對熵中的投影,以防止錯誤在序列長度上累積。這是一個有趣的例子,至少在理論上,允許機器學習模型在推理時解決更複雜的問題可以大幅提升其效能。

Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning

2411.07533v1 by Linyang He, Ercong Nie, Helmut Schmid, Hinrich Schütze, Nima Mesgarani, Jonathan Brennan

This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM evaluation paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical biases that may misrepresent LLMs' true linguistic capabilities. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. Our contributions are three-fold: (1) We compare neurolinguistic and psycholinguistic methods, revealing distinct patterns in LLM assessment; (2) We demonstrate that LLMs exhibit higher competence in form compared to meaning, with the latter largely correlated to the former; (3) We present new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.

摘要:本研究透過區分心理語言學和神經語言學這兩種大型語言模型 (LLM) 評估範例,來探討大型語言模型在符號 (形式) 和所指 (意義) 上的語言理解。傳統的心理語言學評估通常反映出統計偏差,這可能會誤導 LLM 的真實語言能力。我們引入一種神經語言學方法,利用一種新穎的方法,結合最小對和診斷探測來分析模型層之間的激活模式。此方法可以詳細檢視 LLM 如何表示形式和意義,以及這些表示是否在不同語言中保持一致。我們的貢獻有三個方面:(1) 我們比較神經語言學和心理語言學方法,揭示 LLM 評估中的不同模式;(2) 我們證明 LLM 在形式上表現出比意義更高的能力,後者在很大程度上與前者相關;(3) 我們為中文 (COMPS-ZH) 和德文 (COMPS-DE) 提出新的概念最小對資料集,以補充現有的英文資料集。

Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis

2411.07529v1 by Minda Li, Bhaskar Krishnamachari

ChatGPT and other large language models (LLMs) promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT's performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.

摘要:ChatGPT 和其他大型语言模型 (LLM) 承诺通过根据程序规格自动生成代码来革新软件开发。我们评估了 ChatGPT 的 GPT-3.5-turbo 模型在 LeetCode 上的表现,这是一个流行的平台,提供算法编码挑战,用于技术面试实践,涵盖三个难度级别:简单、中等和困难。我们测试了三个主要假设。首先,随着难度的增加,ChatGPT 解决的问题更少(假设 1)。其次,提示工程提高了 ChatGPT 的性能,在较简单的题目上获得了更大的收益,而在较难的题目上收益递减(假设 2)。第三,ChatGPT 在 Python、Java 和 C++ 等流行语言中的表现优于在 Elixir、Erlang 和 Racket 等不太常见的语言中的表现(假设 3)。为了调查这些假设,我们使用 Python 脚本进行自动化实验,生成提示,指示 ChatGPT 创建 Python 解决方案。这些解决方案被存储并手动提交到 LeetCode 以检查其正确性。对于假设 1,结果显示 GPT-3.5-turbo 模型成功解决了 92% 的简单问题、79% 的中等问题和 51% 的困难问题。对于假设 2,提示工程产生了改进:思维链提示提高了 14-29%,在第二个反馈提示中提供了失败的测试用例提高了 38-60%,切换到 GPT-4 提高了 33-58%。从 ChatGPT 用 Python 解决的问题的随机子集中,它还用 Java 解决 78% 的问题,用 C++ 解决 50% 的问题,用 Elixir、Erlang 或 Racket 解决 0 个问题。这些发现总体上验证了所有三个假设。

SecEncoder: Logs are All You Need in Security

2411.07528v1 by Muhammed Fatih Bulut, Yingqi Liu, Naveed Ahmad, Maximilian Turner, Sami Ait Ouahmane, Cameron Andrews, Lloyd Greenwald

Large and Small Language Models (LMs) are typically pretrained using extensive volumes of text, which are sourced from publicly accessible platforms such as Wikipedia, Book Corpus, or through web scraping. These models, due to their exposure to a wide range of language data, exhibit impressive generalization capabilities and can perform a multitude of tasks simultaneously. However, they often fall short when it comes to domain-specific tasks due to their broad training data. This paper introduces SecEncoder, a specialized small language model that is pretrained using security logs. SecEncoder is designed to address the domain-specific limitations of general LMs by focusing on the unique language and patterns found in security logs. Experimental results indicate that SecEncoder outperforms other LMs, such as BERTlarge, DeBERTa-v3-large and OpenAI's Embedding (textembedding-ada-002) models, which are pretrained mainly on natural language, across various tasks. Furthermore, although SecEncoder is primarily pretrained on log data, it outperforms models pretrained on natural language for a range of tasks beyond log analysis, such as incident prioritization and threat intelligence document retrieval. This suggests that domain specific pretraining with logs can significantly enhance the performance of LMs in security. These findings pave the way for future research into security-specific LMs and their potential applications.

摘要:大型和小型语言模型 (LM) 通常使用从维基百科、语料库或网络抓取等公开平台获取的大量文本进行预训练。这些模型由于接触了广泛的语言数据,因此表现出令人印象深刻的泛化能力,并且可以同时执行多项任务。然而,由于其广泛的训练数据,它们在执行特定于领域的特定任务时往往会表现不佳。本文介绍了 SecEncoder,这是一种使用安全日志进行预训练的专门的小型语言模型。SecEncoder 旨在通过关注安全日志中发现的独特语言和模式来解决通用 LM 的特定领域限制。实验结果表明,SecEncoder 优于其他 LM,例如 BERTlarge、DeBERTa-v3-large 和 OpenAI 的嵌入(textembedding-ada-002)模型,这些模型主要在自然语言上进行预训练,并且适用于各种任务。此外,尽管 SecEncoder 主要在日志数据上进行预训练,但它在日志分析之外的一系列任务(例如事件优先级和威胁情报文档检索)上都优于在自然语言上进行预训练的模型。这表明使用日志进行特定领域预训练可以显着增强 LM 在安全方面的性能。这些发现为未来对特定于安全性的 LM 及其潜在应用的研究铺平了道路。

Prompt-enhanced Network for Hateful Meme Classification

2411.07527v1 by Junxi Liu, Yanyan Feng, Jiehai Chen, Yun Xue, Fenghuan Li

The dynamic expansion of social media has led to an inundation of hateful memes on media platforms, accentuating the growing need for efficient identification and removal. Acknowledging the constraints of conventional multimodal hateful meme classification, which heavily depends on external knowledge and poses the risk of including irrelevant or redundant content, we developed Pen -- a prompt-enhanced network framework based on the prompt learning approach. Specifically, after constructing the sequence through the prompt method and encoding it with a language model, we performed region information global extraction on the encoded sequence for multi-view perception. By capturing global information about inference instances and demonstrations, Pen facilitates category selection by fully leveraging sequence information. This approach significantly improves model classification accuracy. Additionally, to bolster the model's reasoning capabilities in the feature space, we introduced prompt-aware contrastive learning into the framework to improve the quality of sample feature distributions. Through extensive ablation experiments on two public datasets, we evaluate the effectiveness of the Pen framework, concurrently comparing it with state-of-the-art model baselines. Our research findings highlight that Pen surpasses manual prompt methods, showcasing superior generalization and classification accuracy in hateful meme classification tasks. Our code is available at https://github.com/juszzi/Pen.

摘要:社群媒體的動態擴張導致媒體平台上充斥著仇恨迷因,凸顯出對有效識別和移除的需求日益增長。承認傳統多模態仇恨迷因分類的限制,它過度依賴外部知識,並有包含不相關或重複內容的風險,我們開發了 Pen ——一種基於提示學習方法的提示增強網路架構。具體來說,在通過提示方法建構序列並使用語言模型對其編碼後,我們對編碼序列執行區域資訊全局提取,以進行多視角感知。透過擷取關於推理實例和示範的全局資訊,Pen 能夠充分利用序列資訊來促進類別選擇。此方法顯著提高了模型分類準確度。此外,為了加強模型在特徵空間中的推理能力,我們將提示感知對比學習引入框架,以提高樣本特徵分佈的品質。透過在兩個公開資料集上進行廣泛的消融實驗,我們評估了 Pen 框架的有效性,並同時將其與最先進的模型基準進行比較。我們的研究結果強調,Pen 超越了手動提示方法,在仇恨迷因分類任務中展現出優異的泛化能力和分類準確度。我們的程式碼可於 https://github.com/juszzi/Pen 取得。

Fair Summarization: Bridging Quality and Diversity in Extractive Summaries

2411.07521v1 by Sina Bagheri Nezhad, Sayan Bandyapadhyay, Ameeta Agrawal

Fairness in multi-document summarization of user-generated content remains a critical challenge in natural language processing (NLP). Existing summarization methods often fail to ensure equitable representation across different social groups, leading to biased outputs. In this paper, we introduce two novel methods for fair extractive summarization: FairExtract, a clustering-based approach, and FairGPT, which leverages GPT-3.5-turbo with fairness constraints. We evaluate these methods using Divsumm summarization dataset of White-aligned, Hispanic, and African-American dialect tweets and compare them against relevant baselines. The results obtained using a comprehensive set of summarization quality metrics such as SUPERT, BLANC, SummaQA, BARTScore, and UniEval, as well as a fairness metric F, demonstrate that FairExtract and FairGPT achieve superior fairness while maintaining competitive summarization quality. Additionally, we introduce composite metrics (e.g., SUPERT+F, BLANC+F) that integrate quality and fairness into a single evaluation framework, offering a more nuanced understanding of the trade-offs between these objectives. This work highlights the importance of fairness in summarization and sets a benchmark for future research in fairness-aware NLP models.

摘要:多文件摘要中用户生成内容的公平性仍然是自然语言处理 (NLP) 中的一项重大挑战。现有的摘要方法通常无法确保不同社会群体的公平代表性,从而导致输出有偏差。在本文中,我们介绍了两种用于公平提取摘要的新方法:基于聚类的 FairExtract 方法和利用具有公平性约束的 GPT-3.5-turbo 的 FairGPT 方法。我们使用 Divsumm 摘要数据集(包含白人、西班牙裔和非裔美国人方言推文)评估了这些方法,并将它们与相关的基线进行了比较。使用一组全面的摘要质量指标(例如 SUPERT、BLANC、SummaQA、BARTScore 和 UniEval)以及公平性指标 F 获得的结果表明,FairExtract 和 FairGPT 在保持有竞争力的摘要质量的同时实现了卓越的公平性。此外,我们引入了复合指标(例如 SUPERT+F、BLANC+F),将质量和公平性整合到一个评估框架中,从而更细致地理解这些目标之间的权衡。这项工作强调了公平性在摘要中的重要性,并为公平性感知 NLP 模型的未来研究设定了基准。

TIPS: Threat Actor Informed Prioritization of Applications using SecEncoder

2411.07519v1 by Muhammed Fatih Bulut, Acar Tamersoy, Naveed Ahmad, Yingqi Liu, Lloyd Greenwald

This paper introduces TIPS: Threat Actor Informed Prioritization using SecEncoder, a specialized language model for security. TIPS combines the strengths of both encoder and decoder language models to detect and prioritize compromised applications. By integrating threat actor intelligence, TIPS enhances the accuracy and relevance of its detections. Extensive experiments with a real-world benchmark dataset of applications demonstrate TIPS's high efficacy, achieving an F-1 score of 0.90 in identifying malicious applications. Additionally, in real-world scenarios, TIPS significantly reduces the backlog of investigations for security analysts by 87%, thereby streamlining the threat response process and improving overall security posture.

摘要:本文介紹 TIPS:威脅行為者資訊優先順序,使用 SecEncoder,一種專門用於安全性的語言模型。TIPS 結合編碼器和解碼器語言模型的優點,以偵測和優先處理受入侵的應用程式。透過整合威脅行為者情報,TIPS 提升其偵測的準確性和相關性。使用真實世界基準資料集的應用程式的廣泛實驗證明了 TIPS 的高效率,在識別惡意應用程式時達到 0.90 的 F-1 分數。此外,在真實世界場景中,TIPS 將安全分析師的調查積壓減少了 87%,從而簡化了威脅應變程序並改善整體安全態勢。

LLM App Squatting and Cloning

2411.07518v1 by Yinglin Xie, Xinyi Hou, Yanjie Zhao, Kai Chen, Haoyu Wang

Impersonation tactics, such as app squatting and app cloning, have posed longstanding challenges in mobile app stores, where malicious actors exploit the names and reputations of popular apps to deceive users. With the rapid growth of Large Language Model (LLM) stores like GPT Store and FlowGPT, these issues have similarly surfaced, threatening the integrity of the LLM app ecosystem. In this study, we present the first large-scale analysis of LLM app squatting and cloning using our custom-built tool, LLMappCrazy. LLMappCrazy covers 14 squatting generation techniques and integrates Levenshtein distance and BERT-based semantic analysis to detect cloning by analyzing app functional similarities. Using this tool, we generated variations of the top 1000 app names and found over 5,000 squatting apps in the dataset. Additionally, we observed 3,509 squatting apps and 9,575 cloning cases across six major platforms. After sampling, we find that 18.7% of the squatting apps and 4.9% of the cloning apps exhibited malicious behavior, including phishing, malware distribution, fake content dissemination, and aggressive ad injection.

摘要:冒充策略,例如應用程式搶註和應用程式複製,已對行動應用程式商店構成長期的挑戰,惡意行為者利用熱門應用程式的名稱和聲譽來欺騙使用者。隨著大型語言模型 (LLM) 商店,例如 GPT Store 和 FlowGPT 的快速成長,這些問題也隨之浮現,威脅到 LLM 應用程式生態系統的完整性。在這項研究中,我們使用自訂建置的工具 LLMappCrazy,針對 LLM 應用程式搶註和複製進行首次大規模分析。LLMappCrazy 涵蓋 14 種搶註產生技術,並整合 Levenshtein 距離和基於 BERT 的語意分析,透過分析應用程式功能相似性來偵測複製。使用此工具,我們產生前 1000 個應用程式名稱的變體,並在資料集中發現超過 5,000 個搶註應用程式。此外,我們在六個主要平台上觀察到 3,509 個搶註應用程式和 9,575 個複製案例。在抽樣後,我們發現 18.7% 的搶註應用程式和 4.9% 的複製應用程式表現出惡意行為,包括網路釣魚、惡意軟體散布、假內容散布和強制廣告植入。

SparrowVQE: Visual Question Explanation for Course Content Understanding

2411.07516v1 by Jialu Li, Manish Kumar Thota, Ruslan Gokhman, Radek Holik, Youshan Zhang

Visual Question Answering (VQA) research seeks to create AI systems to answer natural language questions in images, yet VQA methods often yield overly simplistic and short answers. This paper aims to advance the field by introducing Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations rather than brief responses and address the need for more complex interaction with visual content. We first created an MLVQE dataset from a 14-week streamed video machine learning course, including 885 slide images, 110,407 words of transcripts, and 9,416 designed question-answer (QA) pairs. Next, we proposed a novel SparrowVQE, a small 3 billion parameters multimodal model. We trained our model with a three-stage training mechanism consisting of multimodal pre-training (slide images and transcripts feature alignment), instruction tuning (tuning the pre-trained model with transcripts and QA pairs), and domain fine-tuning (fine-tuning slide image and QA pairs). Eventually, our SparrowVQE can understand and connect visual information using the SigLIP model with transcripts using the Phi-2 language model with an MLP adapter. Experimental results demonstrate that our SparrowVQE achieves better performance in our developed MLVQE dataset and outperforms state-of-the-art methods in the other five benchmark VQA datasets. The source code is available at \url{https://github.com/YoushanZhang/SparrowVQE}.

摘要:視覺問答 (VQA) 研究致力於建立 AI 系統,以回答圖像中的自然語言問題,但 VQA 方法通常會產生過於簡化且簡短的答案。本文旨在透過引入視覺問題解釋 (VQE) 來推動該領域的進步,VQE 增強了 VQA 提供詳細解釋而非簡短回應的能力,並滿足了與視覺內容進行更複雜互動的需求。我們首先從 14 週串流影片機器學習課程中建立了 MLVQE 資料集,其中包含 885 張投影片圖片、110,407 個字的逐字稿和 9,416 個設計好的問答 (QA) 對。接下來,我們提出了一種新穎的 SparrowVQE,這是一個具有 30 億個參數的多模態模型。我們使用三階段訓練機制訓練我們的模型,包括多模態預訓練(投影片圖片和逐字稿特徵對齊)、指令微調(使用逐字稿和 QA 對微調預訓練模型)和領域微調(微調投影片圖片和 QA 對)。最終,我們的 SparrowVQE 能夠使用 SigLIP 模型理解和連結視覺資訊,並使用 Phi-2 語言模型和 MLP 適配器使用逐字稿。實驗結果證明,我們的 SparrowVQE 在我們開發的 MLVQE 資料集中取得了更好的效能,並在其他五個基準 VQA 資料集中優於最先進的方法。原始碼可在 \url{https://github.com/YoushanZhang/SparrowVQE} 取得。

An Attack Traffic Identification Method Based on Temporal Spectrum

2411.07510v1 by Wenwei Xie, Jie Yin, Zihao Chen

To address the issues of insufficient robustness, unstable features, and data noise interference in existing network attack detection and identification models, this paper proposes an attack traffic detection and identification method based on temporal spectrum. First, traffic data is segmented by a sliding window to construct a feature sequence and a corresponding label sequence for network traffic. Next, the proposed spectral label generation methods, SSPE and COAP, are applied to transform the label sequence into spectral labels and the feature sequence into temporal features. Spectral labels and temporal features are used to capture and represent behavioral patterns of attacks. Finally, the constructed temporal features and spectral labels are used to train models, which subsequently detects and identifies network attack behaviors. Experimental results demonstrate that compared to traditional methods, models trained with the SSPE or COAP method improve identification accuracy by 10%, and exhibit strong robustness, particularly in noisy environments.

摘要:為了解決現有網路攻擊偵測與識別模型中,魯棒性不足、特徵不穩定、資料雜訊干擾等問題,本文提出基於時域頻譜的攻擊流量偵測與識別方法。首先,透過滑動視窗將流量資料進行分段,建構網路流量的特徵序列與對應標籤序列。接著,應用所提出的頻譜標籤產生方法 SSPE 與 COAP,將標籤序列轉換為頻譜標籤,並將特徵序列轉換為時域特徵。頻譜標籤與時域特徵用於擷取與表示攻擊的行為模式。最後,將建構的時域特徵與頻譜標籤用於模型訓練,後續偵測與識別網路攻擊行為。實驗結果顯示,與傳統方法相比,使用 SSPE 或 COAP 方法訓練的模型,識別準確度提升 10%,且展現強大的魯棒性,特別是在雜訊環境中。

FM-TS: Flow Matching for Time Series Generation

2411.07506v1 by Yang Hu, Xiao Wang, Lirong Wu, Huatian Zhang, Stan Z. Li, Sheng Wang, Tianlong Chen

Time series generation has emerged as an essential tool for analyzing temporal data across numerous fields. While diffusion models have recently gained significant attention in generating high-quality time series, they tend to be computationally demanding and reliant on complex stochastic processes. To address these limitations, we introduce FM-TS, a rectified Flow Matching-based framework for Time Series generation, which simplifies the time series generation process by directly optimizing continuous trajectories. This approach avoids the need for iterative sampling or complex noise schedules typically required in diffusion-based models. FM-TS is more efficient in terms of training and inference. Moreover, FM-TS is highly adaptive, supporting both conditional and unconditional time series generation. Notably, through our novel inference design, the model trained in an unconditional setting can seamlessly generalize to conditional tasks without the need for retraining. Extensive benchmarking across both settings demonstrates that FM-TS consistently delivers superior performance compared to existing approaches while being more efficient in terms of training and inference. For instance, in terms of discriminative score, FM-TS achieves 0.005, 0.019, 0.011, 0.005, 0.053, and 0.106 on the Sines, Stocks, ETTh, MuJoCo, Energy, and fMRI unconditional time series datasets, respectively, significantly outperforming the second-best method which achieves 0.006, 0.067, 0.061, 0.008, 0.122, and 0.167 on the same datasets. We have achieved superior performance in solar forecasting and MuJoCo imputation tasks, significantly enhanced by our innovative $t$ power sampling method. The code is available at https://github.com/UNITES-Lab/FMTS.

摘要:時序生成已成為分析各領域中時間資料的重要工具。儘管擴散模型最近在生成高品質時序方面獲得顯著關注,但它們往往需要大量的運算,並依賴於複雜的隨機過程。為了解決這些限制,我們引入了 FM-TS,一個基於修正流匹配的時序生成框架,透過直接最佳化連續軌跡來簡化時序生成過程。此方法避免了在基於擴散的模型中通常需要的反覆抽樣或複雜雜訊排程。FM-TS 在訓練和推論方面更有效率。此外,FM-TS 具有高度適應性,支援條件式和非條件式時序生成。值得注意的是,透過我們新穎的推論設計,在非條件式設定中訓練的模型可以無縫地推廣到條件式任務,而無需重新訓練。在兩種設定中的廣泛基準測試證明,與現有方法相比,FM-TS 持續提供優異的效能,同時在訓練和推論方面更有效率。例如,在判別分數方面,FM-TS 分別在 Sines、Stocks、ETTh、MuJoCo、Energy 和 fMRI 非條件式時序資料集上達到 0.005、0.019、0.011、0.005、0.053 和 0.106,顯著優於在相同資料集上達到 0.006、0.067、0.061、0.008、0.122 和 0.167 的第二佳方法。我們在太陽能預測和 MuJoCo 插補任務中取得了優異的效能,這得益於我們創新的 $t$ 次方抽樣方法。程式碼可在 https://github.com/UNITES-Lab/FMTS 取得。

LAUREL: Learned Augmented Residual Layer

2411.07501v1 by Gaurav Menghani, Ravi Kumar, Sanjiv Kumar

One of the core pillars of efficient deep learning methods is architectural improvements such as the residual/skip connection, which has led to significantly better model convergence and quality. Since then the residual connection has become ubiquitous in not just convolutional neural networks but also transformer-based architectures, the backbone of LLMs. In this paper we introduce \emph{Learned Augmented Residual Layer} (LAuReL) -- a novel generalization of the canonical residual connection -- with the goal to be an in-situ replacement of the latter while outperforming on both model quality and footprint metrics. Our experiments show that using \laurel can help boost performance for both vision and language models. For example, on the ResNet-50, ImageNet 1K task, it achieves $60\%$ of the gains from adding an extra layer, while only adding $0.003\%$ more parameters, and matches it while adding $2.6\times$ fewer parameters.

摘要:高效深度學習方法的核心支柱之一是架構改進,例如殘差/跳躍連接,這已導致模型收斂性和品質顯著提升。從那時起,殘差連接不僅在卷積神經網路中普遍存在,也在基於轉換器的架構中普遍存在,後者是 LLM 的骨幹。 在本文中,我們介紹了「學習增強殘差層」(LAuReL) -- 標準殘差連接的新穎概括 -- 目標是在模型品質和佔用空間指標上都優於後者,同時成為後者的原位替換。我們的實驗表明,使用 \laurel 可以幫助提升視覺和語言模型的效能。例如,在 ResNet-50、ImageNet 1K 任務上,它達到了增加一層的 $60\%$ 收益,同時只增加了 $0.003\%$ 的參數,並在增加 $2.6\times$ 更少參數的情況下與其匹配。

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

2411.07494v1 by Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma

As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse.

摘要:隨著大型語言模型(LLM)變得越來越強大,確保它們不會被濫用變得至關重要。儘管研究人員專注於開發強大的防禦措施,但目前還沒有任何方法能完全抵禦攻擊。我們提出了一種替代方法:不是尋求完美的對抗性穩健性,而是開發快速的應對技術,在僅觀察到少數攻擊後就能阻止整類越獄。為了研究這種設定,我們開發了 RapidResponseBench,這是一個基準測試,用於衡量防禦措施在適應少數觀察到的範例後對各種越獄策略的穩健性。我們評估了五種快速應對方法,所有這些方法都使用越獄擴散,我們自動生成與觀察到的範例類似的其他越獄。我們最強大的方法是微調輸入分類器以阻止擴散的越獄,它將攻擊成功率降低了 240 倍以上,在分佈式越獄集合上,以及在觀察到每個越獄策略的一個範例後,在非分佈式集合上降低了 15 倍以上。此外,進一步的研究表明,擴散模型的品質和擴散範例的數量在這種防禦措施的有效性中扮演了關鍵角色。總的來說,我們的結果突顯了快速應對新型越獄以限制 LLM 濫用的潛力。

Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models

2411.07474v1 by Daria Kryvosheieva, Roger Levy

Language models (LMs) are capable of acquiring elements of human-like syntactic knowledge. Targeted syntactic evaluation tests have been employed to measure how well they form generalizations about syntactic phenomena in high-resource languages such as English. However, we still lack a thorough understanding of LMs' capacity for syntactic generalizations in low-resource languages, which are responsible for much of the diversity of syntactic patterns worldwide. In this study, we develop targeted syntactic evaluation tests for three low-resource languages (Basque, Hindi, and Swahili) and use them to evaluate five families of open-access multilingual Transformer LMs. We find that some syntactic tasks prove relatively easy for LMs while others (agreement in sentences containing indirect objects in Basque, agreement across a prepositional phrase in Swahili) are challenging. We additionally uncover issues with publicly available Transformers, including a bias toward the habitual aspect in Hindi in multilingual BERT and underperformance compared to similar-sized models in XGLM-4.5B.

摘要:語言模型 (LM) 能夠習得類似人類的語法知識元素。目標語法評量測試已被用來衡量他們在高資源語言(例如英語)中對語法現象的概括能力。然而,我們仍然缺乏對 LM 在低資源語言中進行語法概括的能力的透徹了解,而低資源語言正是造成全球語法模式多樣性的主要原因。在本研究中,我們針對三種低資源語言(巴斯克語、印地語和斯瓦希里語)開發了目標語法評量測試,並使用它們來評量五個開放式多語言 Transformer LM 家族。我們發現,某些語法任務對 LM 來說相對容易,而其他任務(包含巴斯克語間接受詞的句子中的一致性、斯瓦希里語介系詞短語中的一致性)則具有挑戰性。我們另外揭露了公開可用的 Transformer 的問題,包括多語言 BERT 中對印地語習慣體的偏誤,以及與 XGLM-4.5B 中大小相似的模型相比表現不佳。

IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

2411.07466v1 by Kawshik Manikantan, Makarand Tapaswi, Vineet Gandhi, Shubham Toshniwal

Recent evaluations of LLMs on coreference resolution have revealed that traditional output formats and evaluation metrics do not fully capture the models' referential understanding. To address this, we introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format, commonly used for evaluating LLMs. IdentifyMe features long narratives and employs heuristics to exclude easily identifiable mentions, creating a more challenging task. The benchmark also consists of a curated mixture of different mention types and corresponding entities, allowing for a fine-grained analysis of model performance. We evaluate both closed- and open source LLMs on IdentifyMe and observe a significant performance gap (20-30%) between the state-of-the-art sub-10B open models vs. closed ones. We observe that pronominal mentions, which have limited surface information, are typically much harder for models to resolve than nominal mentions. Additionally, we find that LLMs often confuse entities when their mentions overlap in nested structures. The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs while also indicating room for further improvement.

摘要:最近對大型語言模型 (LLM) 關於共同指稱消解的評估顯示,傳統的輸出格式和評估指標並未完全掌握模型的指稱理解。為了解決這個問題,我們引入了 IdentifyMe,這是一個以多選題 (MCQ) 格式呈現的提及消解新基準,通常用於評估 LLM。IdentifyMe 採用長篇敘事,並使用啟發法排除容易識別的提及,創造更具挑戰性的任務。此基準還包含經過整理的不同提及類型和對應實體的混合,允許對模型效能進行細緻的分析。我們在 IdentifyMe 上評估閉源和開源 LLM,並觀察到最先進的低於 10B 開放模型與閉源模型之間有顯著的效能差距 (20-30%)。我們觀察到,表面資訊有限的代名詞提及通常比名詞提及更難讓模型解析。此外,我們發現當 LLM 的提及在巢狀結構中重疊時,它們經常會混淆實體。得分最高的模型 GPT-4o 達到了 81.9% 的準確度,突顯了最先進 LLM 強大的指稱能力,同時也表示仍有進步的空間。

BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks

2411.07464v1 by Shubham Gandhi, Manasi Patwardhan, Lovekesh Vig, Gautam Shroff

Large Language Models (LLMs) excel in diverse applications including generation of code snippets, but often struggle with generating code for complex Machine Learning (ML) tasks. Although existing LLM single-agent based systems give varying performance depending on the task complexity, they purely rely on larger and expensive models such as GPT-4. Our investigation reveals that no-cost and low-cost models such as Gemini-Pro, Mixtral and CodeLlama perform far worse than GPT-4 in a single-agent setting. With the motivation of developing a cost-efficient LLM based solution for solving ML tasks, we propose an LLM Multi-Agent based system which leverages combination of experts using profiling, efficient retrieval of past observations, LLM cascades, and ask-the-expert calls. Through empirical analysis on ML engineering tasks in the MLAgentBench benchmark, we demonstrate the effectiveness of our system, using no-cost models, namely Gemini as the base LLM, paired with GPT-4 in cascade and expert to serve occasional ask-the-expert calls for planning. With 94.2\% reduction in the cost (from \$0.931 per run cost averaged over all tasks for GPT-4 single agent system to \$0.054), our system is able to yield better average success rate of 32.95\% as compared to GPT-4 single-agent system yielding 22.72\% success rate averaged over all the tasks of MLAgentBench.

摘要:大型語言模型(LLM)在各種應用中表現出色,包括產生程式碼片段,但常常在產生複雜機器學習(ML)任務的程式碼時遇到困難。儘管現有的 LLM 單一代理人系統會根據任務複雜度提供不同的效能,但它們完全依賴於較大且昂貴的模型,例如 GPT-4。我們的調查顯示,在單一代理人設定中,無成本和低成本模型(例如 Gemini-Pro、Mixtral 和 CodeLlama)的效能遠低於 GPT-4。在開發一種成本效益高的基於 LLM 的解決方案以解決 ML 任務的動機下,我們提出了一個基於 LLM 多代理人的系統,該系統利用專家組合,使用剖析、有效擷取過去的觀察結果、LLM 串接,以及尋求專家建議的呼叫。透過對 MLAgentBench 基準中的 ML 工程任務進行實證分析,我們展示了我們系統的有效性,使用無成本模型,即 Gemini 作為基礎 LLM,與 GPT-4 串接,並讓專家負責偶爾尋求專家建議的呼叫以進行規劃。我們的系統成本降低了 94.2%(從 GPT-4 單一代理人系統所有任務的平均執行成本 0.931 美元降低到 0.054 美元),能夠產生更好的平均成功率 32.95%,而 GPT-4 單一代理人系統在 MLAgentBench 的所有任務中平均產生 22.72% 的成功率。

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions

2411.07461v1 by Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu

We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale

摘要:我們介紹 BLIP3-KALE,一個包含 2.18 億張圖片文字對應的資料集,它縮小了描述性合成標題和事實性的網路規模 alt 文字之間的差距。KALE 使用網路規模的 alt 文字來擴充合成密集圖像標題,以產生有事實根據的圖像標題。我們的兩階段方法利用大型視覺語言模型和語言模型來建立知識擴充標題,然後用於訓練專門的 VLM 以擴充資料集。我們在 KALE 上訓練視覺語言模型,並展示在視覺語言任務上的改進。我們的實驗顯示了 KALE 在訓練更強大且知識豐富的多模態模型方面的實用性。我們在 https://huggingface.co/datasets/Salesforce/blip3-kale 上發布 KALE 資料集

DecoPrompt : Decoding Prompts Reduces Hallucinations when Large Language Models Meet False Premises

2411.07457v1 by Nan Xu, Xuezhe Ma

While large language models (LLMs) have demonstrated increasing power, they have also called upon studies on their hallucinated outputs that deviate from factually correct statements. In this paper, we focus on one important scenario of false premises, where LLMs are distracted by misaligned claims although the model possesses the required factual knowledge to answer original questions accurately. Inspired by the observation that entropy of the false-premise prompt is closely related to its likelihood to elicit hallucination generation, we propose a new prompting algorithm, named DecoPrompt, to mitigate hallucination. DecoPrompt leverages LLMs to "decode" the false-premise prompts without really eliciting hallucination output from LLMs. We perform experiments on two datasets, demonstrating that DecoPrompt can reduce hallucinations effectively on outputs from different LLMs. Moreover, DecoPrompt exhibits cross-model transferability, which facilitates its applications to scenarios such as LLMs of large sizes or unavailable model logits.

摘要:儘管大型語言模型(LLM)已展現出越來越強大的能力,但它們也需要針對其虛構輸出進行研究,這些輸出偏離了事實正確的陳述。在本文中,我們專注於一個錯誤前提的重要場景,在該場景中,LLM 會被錯誤的說法分散注意力,儘管該模型具備準確回答原始問題所需的實際知識。受虛假前提提示的熵與其引發幻覺產生的可能性密切相關的觀察結果啟發,我們提出了一種名為 DecoPrompt 的新提示演算法,以減輕幻覺。DecoPrompt 利用 LLM 來「解碼」錯誤前提提示,而不會真正引發 LLM 的幻覺輸出。我們在兩個資料集上執行實驗,證明 DecoPrompt 可以有效減少不同 LLM 輸出中的幻覺。此外,DecoPrompt 展現出跨模型的可轉移性,這有助於其應用於大型 LLM 或不可用模型邏輯值等場景。

Research on fault diagnosis of nuclear power first-second circuit based on hierarchical multi-granularity classification network

2411.07453v1 by Jiangwen Chen, Siwei Li, Guo Jiang, Cheng Dongzhen, Lin Hua, Wang Wei

The safe and reliable operation of complex electromechanical systems in nuclear power plants is crucial for the safe production of nuclear power plants and their nuclear power unit. Therefore, accurate and timely fault diagnosis of nuclear power systems is of great significance for ensuring the safe and reliable operation of nuclear power plants. The existing fault diagnosis methods mainly target a single device or subsystem, making it difficult to analyze the inherent connections and mutual effects between different types of faults at the entire unit level. This article uses the AP1000 full-scale simulator to simulate the important mechanical component failures of some key systems in the primary and secondary circuits of nuclear power units, and constructs a fault dataset. Meanwhile, a hierarchical multi granularity classification fault diagnosis model based on the EfficientNet large model is proposed, aiming to achieve hierarchical classification of nuclear power faults. The results indicate that the proposed fault diagnosis model can effectively classify faults in different circuits and system components of nuclear power units into hierarchical categories. However, the fault dataset in this study was obtained from a simulator, which may introduce additional information due to parameter redundancy, thereby affecting the diagnostic performance of the model.

摘要:複雜機電系統在核能電廠的安全可靠運行,對於核能電廠及其核能機組的安全發電至關重要。因此,核能系統準確及時的故障診斷對於保障核能電廠的安全可靠運行具有重大意義。現有的故障診斷方法主要針對單一設備或子系統,難以分析全機組層面不同類型故障間的內在聯繫和相互影響。本文利用AP1000滿功率模擬器模擬核能機組一、二次迴路部分關鍵系統的重要機械組件故障,構建故障數據集。同時,提出基於EfficientNet大模型的分層多粒度分類故障診斷模型,旨在實現核能故障的分層分類。結果表明,所提故障診斷模型能夠有效地將核能機組不同迴路和系統組件的故障分層分類。但本研究中的故障數據集來源於模擬器,由於參數冗餘可能會引入額外的信息,從而影響模型的診斷性能。

Optimizing Data Delivery: Insights from User Preferences on Visuals, Tables, and Text

2411.07451v1 by Reuben Luera, Ryan Rossi, Franck Dernoncourt, Alexa Siu, Sungchul Kim, Tong Yu, Ruiyi Zhang, Xiang Chen, Nedim Lipka, Zhehao Zhang, Seon Gyeom Kim, Tak Yeon Lee

In this work, we research user preferences to see a chart, table, or text given a question asked by the user. This enables us to understand when it is best to show a chart, table, or text to the user for the specific question. For this, we conduct a user study where users are shown a question and asked what they would prefer to see and used the data to establish that a user's personal traits does influence the data outputs that they prefer. Understanding how user characteristics impact a user's preferences is critical to creating data tools with a better user experience. Additionally, we investigate to what degree an LLM can be used to replicate a user's preference with and without user preference data. Overall, these findings have significant implications pertaining to the development of data tools and the replication of human preferences using LLMs. Furthermore, this work demonstrates the potential use of LLMs to replicate user preference data which has major implications for future user modeling and personalization research.

摘要:在這項工作中,我們研究使用者偏好,以便在使用者提出問題時,可以看到圖表、表格或文字。這讓我們得以了解在特定問題中,什麼時候向使用者顯示圖表、表格或文字是最好的。為此,我們進行了一項使用者研究,在研究中向使用者顯示一個問題,並詢問他們希望看到什麼,並使用資料來建立使用者的個人特質確實會影響他們偏好的資料輸出。了解使用者的特質如何影響使用者的偏好,對於建立具有更好使用者體驗的資料工具至關重要。此外,我們調查了 LLM 在有和沒有使用者偏好資料的情況下,可用於複製使用者偏好的程度。整體來說,這些發現對於資料工具的開發和使用 LLM 複製人類偏好具有重要的意義。此外,這項工作展示了 LLM 複製使用者偏好資料的潛在用途,這對未來的使用者建模和個人化研究具有重大意義。

The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving

2411.07447v1 by Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki

The growing usage of Large Language Models (LLMs) highlights the demands and challenges in scalable LLM inference systems, affecting deployment and development processes. On the deployment side, there is a lack of comprehensive analysis on the conditions under which a particular scheduler performs better or worse, with performance varying substantially across different schedulers, hardware, models, and workloads. Manually testing each configuration on GPUs can be prohibitively expensive. On the development side, unpredictable performance and unknown upper limits can lead to inconclusive trial-and-error processes, consuming resources on ideas that end up ineffective. To address these challenges, we introduce INFERMAX, an analytical framework that uses inference cost models to compare various schedulers, including an optimal scheduler formulated as a constraint satisfaction problem (CSP) to establish an upper bound on performance. Our framework offers in-depth analysis and raises essential questions, challenging assumptions and exploring opportunities for more efficient scheduling. Notably, our findings indicate that preempting requests can reduce GPU costs by 30% compared to avoiding preemptions at all. We believe our methods and insights will facilitate the cost-effective deployment and development of scalable, efficient inference systems and pave the way for cost-based scheduling.

摘要:大型語言模型 (LLM) 使用量不斷增加,突顯了可擴充 LLM 推論系統的需求和挑戰,影響部署和開發流程。在部署方面,對於特定排程器在何種條件下執行得更好或更差,缺乏全面的分析,效能因不同的排程器、硬體、模型和工作負載而有顯著差異。手動在 GPU 上測試每個組態可能會非常昂貴。在開發方面,不可預測的效能和未知的上限可能會導致無法得出結論的試錯流程,消耗無效想法的資源。為了應對這些挑戰,我們引入了 INFERMAX,一個分析架構,使用推論成本模型來比較各種排程器,包括一個最佳排程器,該排程器制定為約束滿足問題 (CSP),以建立效能的上限。我們的架構提供了深入的分析,並提出了重要的問題,挑戰假設並探索更有效排程的機會。值得注意的是,我們的研究結果表明,與完全避免搶先相比,搶先請求可以將 GPU 成本降低 30%。我們相信我們的技術和見解將促進可擴充、有效推論系統的經濟有效部署和開發,並為基於成本的排程鋪路。

Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection

2411.07446v1 by Cilin Yan, Jingyun Wang, Lin Zhang, Ruihui Zhao, Xiaopu Wu, Kai Xiong, Qingsong Liu, Guoliang Kang, Yangyang Kang

Automatic prompt engineering aims to enhance the generation quality of large language models (LLMs). Recent works utilize feedbacks generated from erroneous cases to guide the prompt optimization. During inference, they may further retrieve several semantically-related exemplars and concatenate them to the optimized prompts to improve the performance. However, those works only utilize the feedback at the current step, ignoring historical and unseleccted feedbacks which are potentially beneficial. Moreover, the selection of exemplars only considers the general semantic relationship and may not be optimal in terms of task performance and matching with the optimized prompt. In this work, we propose an Exemplar-Guided Reflection with Memory mechanism (ERM) to realize more efficient and accurate prompt optimization. Specifically, we design an exemplar-guided reflection mechanism where the feedback generation is additionally guided by the generated exemplars. We further build two kinds of memory to fully utilize the historical feedback information and support more effective exemplar retrieval. Empirical evaluations show our method surpasses previous state-of-the-arts with less optimization steps, i.e., improving F1 score by 10.1 on LIAR dataset, and reducing half of the optimization steps on ProTeGi.

摘要:自動提示工程旨在提升大型語言模型 (LLM) 的生成品質。近期研究利用錯誤案例產生的回饋來引導提示最佳化。在推論過程中,它們可能會進一步擷取幾個語義相關的範例,並將它們串接至最佳化的提示以提升效能。然而,這些研究僅利用當前步驟的回饋,忽略了潛在有益的歷史回饋和未選擇的回饋。此外,範例的選擇僅考慮一般語義關係,就任務效能和與最佳化提示的匹配而言可能不是最佳的。在這項研究中,我們提出一個具有記憶機制的範例引導反思 (ERM),以實現更有效率且準確的提示最佳化。具體來說,我們設計一個範例引導反思機制,其中回饋產生進一步由產生的範例引導。我們進一步建構兩種記憶體,以充分利用歷史回饋資訊,並支援更有效的範例擷取。經驗評估顯示,我們的方法以更少的最佳化步驟超越了先前的技術水準,亦即在 LIAR 資料集上將 F1 分數提升了 10.1,並在 ProTeGi 上減少了一半的最佳化步驟。

Input-Based Ensemble-Learning Method for Dynamic Memory Configuration of Serverless Computing Functions

2411.07444v1 by Siddharth Agarwal, Maria A. Rodriguez, Rajkumar Buyya

In today's Function-as-a-Service offerings, a programmer is usually responsible for configuring function memory for its successful execution, which allocates proportional function resources such as CPU and network. However, right-sizing the function memory force developers to speculate performance and make ad-hoc configuration decisions. Recent research has highlighted that a function's input characteristics, such as input size, type and number of inputs, significantly impact its resource demand, run-time performance and costs with fluctuating workloads. This correlation further makes memory configuration a non-trivial task. On that account, an input-aware function memory allocator not only improves developer productivity by completely hiding resource-related decisions but also drives an opportunity to reduce resource wastage and offer a finer-grained cost-optimised pricing scheme. Therefore, we present MemFigLess, a serverless solution that estimates the memory requirement of a serverless function with input-awareness. The framework executes function profiling in an offline stage and trains a multi-output Random Forest Regression model on the collected metrics to invoke input-aware optimal configurations. We evaluate our work with the state-of-the-art approaches on AWS Lambda service to find that MemFigLess is able to capture the input-aware resource relationships and allocate upto 82% less resources and save up to 87% run-time costs.

摘要:在當今的函式即服務產品中,程式設計人員通常負責設定函式記憶體以利成功執行,這會配置成比例的函式資源,例如 CPU 和網路。不過,正確調整函式記憶體會強迫開發人員推測效能並做出臨時設定決策。最近的研究強調函式的輸入特徵(例如輸入大小、類型和輸入數量)會顯著影響其資源需求、執行時間效能和工作負載波動的成本。這種關聯性進一步使記憶體設定成為一項非平凡的任務。有鑑於此,一個具輸入感知能力的函式記憶體配置器不僅能透過完全隱藏與資源相關的決策來提升開發人員生產力,還能驅動一個機會來減少資源浪費並提供更細緻的成本最佳化定價方案。因此,我們提出 MemFigLess,這是一個無伺服器解決方案,可估計具輸入感知能力的無伺服器函式的記憶體需求。此架構在離線階段執行函式剖析,並針對收集的指標訓練一個多輸出隨機森林回歸模型,以呼叫具輸入感知能力的最佳設定。我們使用最先進的方法在 AWS Lambda 服務上評估我們的成果,發現 MemFigLess 能夠擷取具輸入感知能力的資源關係,並配置少達 82% 的資源,並節省高達 87% 的執行時間成本。

Automatically Detecting Online Deceptive Patterns in Real-time

2411.07441v1 by Asmit Nayak, Shirley Zhang, Yash Wani, Rishabh Khandelwal, Kassem Fawaz

Deceptive patterns (DPs) in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous across various digital platforms. While efforts to mitigate DPs have emerged from legal and technical perspectives, a significant gap in usable solutions that empower users to identify and make informed decisions about DPs in real-time remains. In this work, we introduce AutoBot, an automated, deceptive pattern detector that analyzes websites' visual appearances using machine learning techniques to identify and notify users of DPs in real-time. AutoBot employs a two-staged pipeline that processes website screenshots, identifying interactable elements and extracting textual features without relying on HTML structure. By leveraging a custom language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We implement AutoBot as a lightweight Chrome browser extension that performs all analyses locally, minimizing latency and preserving user privacy. Through extensive evaluation, we demonstrate AutoBot's effectiveness in enhancing users' ability to navigate digital environments safely while providing a valuable tool for regulators to assess and enforce compliance with DP regulations.

摘要:數位介面中的欺騙性模式 (DP) 會操縱使用者做出非預期的決定,利用認知偏誤和心理漏洞。這些模式已在各種數位平台上變得無所不在。雖然從法律和技術角度來看,已經出現減輕 DP 的努力,但仍然缺乏重要的可用解決方案,讓使用者能夠識別和做出有關 DP 的明智決定。在這項工作中,我們介紹了 AutoBot,這是一個自動化的欺騙性模式偵測器,它使用機器學習技術分析網站的視覺外觀,以識別和即時通知使用者 DP。AutoBot 採用一個兩階段管道,處理網站螢幕截圖、識別可互動元素,並在不依賴 HTML 結構的情況下擷取文字特徵。透過利用自訂語言模型,AutoBot 了解這些元素周圍的內容,以確定是否存在欺騙性模式。我們將 AutoBot 實作為一個輕量級的 Chrome 瀏覽器擴充功能,它在本地執行所有分析,將延遲降至最低並保護使用者隱私。透過廣泛的評估,我們展示了 AutoBot 在提升使用者安全瀏覽數位環境的能力方面的有效性,同時也為監管機構提供了一個有價值的工具,用於評估和強制遵守 DP 法規。

Predicting BWR Criticality with Data-Driven Machine Learning Model

2411.07425v1 by Muhammad Rizki Oktavian, Anirudh Tunga, Jonathan Nistor, James Tusar, J. Thomas Gruenwald, Yunlin Xu

One of the challenges in operating nuclear power plants is to decide the amount of fuel needed in a cycle. Large-scale nuclear power plants are designed to operate at base load, meaning that they are expected to always operate at full power. Economically, a nuclear power plant should burn enough fuel to maintain criticality until the end of a cycle (EOC). If the reactor goes subcritical before the end of a cycle, it may result in early coastdown as the fuel in the core is already depleted. On contrary, if the reactor still has significant excess reactivity by the end of a cycle, the remaining fuels will remain unused. In both cases, the plant may lose a significant amount of money. This work proposes an innovative method based on a data-driven deep learning model to estimate the excess criticality of a boiling water reactor.

摘要:核能電廠運作中的一項挑戰,是要決定一個週期內所需的燃料量。大型核能電廠的設計是基於基本負載運作,表示預期它們總是會以全功率運作。在經濟層面上,核能電廠應該燃燒足夠的燃料,以維持臨界狀態直到週期結束 (EOC)。如果反應器在週期結束前進入次臨界狀態,由於爐心內的燃料已經耗盡,可能會導致提早停機。相反地,如果反應器在週期結束時仍有顯著的過剩反應性,剩餘的燃料將會保持未用。這兩種情況都會讓電廠損失大量金錢。本研究提出一個創新的方法,基於資料驅動的深度學習模型,來估計沸水反應器的過剩臨界性。

Untangling Hate Speech Definitions: A Semantic Componential Analysis Across Cultures and Domains

2411.07417v1 by Katerina Korre, Arianna Muti, Federico Ruggeri, Alberto Barrón-Cedeño

Hate speech relies heavily on cultural influences, leading to varying individual interpretations. For that reason, we propose a Semantic Componential Analysis (SCA) framework for a cross-cultural and cross-domain analysis of hate speech definitions. We create the first dataset of definitions derived from five domains: online dictionaries, research papers, Wikipedia articles, legislation, and online platforms, which are later analyzed into semantic components. Our analysis reveals that the components differ from definition to definition, yet many domains borrow definitions from one another without taking into account the target culture. We conduct zero-shot model experiments using our proposed dataset, employing three popular open-sourced LLMs to understand the impact of different definitions on hate speech detection. Our findings indicate that LLMs are sensitive to definitions: responses for hate speech detection change according to the complexity of definitions used in the prompt.

摘要:仇恨言論嚴重依賴文化影響,導致不同的個人詮釋。因此,我們提出語義成分分析 (SCA) 架構,用於跨文化和跨領域分析仇恨言論定義。我們建立第一個定義資料集,其來自五個領域:線上字典、研究論文、維基百科條目、立法和線上平台,隨後分析為語義成分。我們的分析顯示,各成分在不同定義中有所不同,但許多領域會從彼此借用定義,而不考慮目標文化。我們使用建議的資料集執行零次學習模型實驗,採用三個流行的開源 LLM,以了解不同定義對仇恨言論偵測的影響。我們的研究結果指出,LLM 對定義很敏感:仇恨言論偵測的回應會根據提示中使用的定義複雜度而改變。

Using Generative AI and Multi-Agents to Provide Automatic Feedback

2411.07407v1 by Shuchen Guo, Ehsan Latif, Yifan Zhou, Xuan Huang, Xiaoming Zhai

This study investigates the use of generative AI and multi-agent systems to provide automatic feedback in educational contexts, particularly for student constructed responses in science assessments. The research addresses a key gap in the field by exploring how multi-agent systems, called AutoFeedback, can improve the quality of GenAI-generated feedback, overcoming known issues such as over-praise and over-inference that are common in single-agent large language models (LLMs). The study developed a multi-agent system consisting of two AI agents: one for generating feedback and another for validating and refining it. The system was tested on a dataset of 240 student responses, and its performance was compared to that of a single-agent LLM. Results showed that AutoFeedback significantly reduced the occurrence of over-praise and over-inference errors, providing more accurate and pedagogically sound feedback. The findings suggest that multi-agent systems can offer a more reliable solution for generating automated feedback in educational settings, highlighting their potential for scalable and personalized learning support. These results have important implications for educators and researchers seeking to leverage AI in formative assessments, offering a pathway to more effective feedback mechanisms that enhance student learning outcomes.

摘要:本研究探討生成式 AI 與多重代理系統在教育情境中提供自動回饋的用途,特別是針對學生在科學評量中建構的回應。此研究透過探討稱為 AutoFeedback 的多重代理系統如何改善 GenAI 生成的回饋品質,來解決該領域的一個關鍵落差,克服常見於單一代理大型語言模型 (LLM) 中的過度讚美和過度推論等已知問題。本研究開發了一個由兩個 AI 代理組成的多重代理系統:一個用於產生回饋,另一個用於驗證和改善回饋。此系統在 240 個學生回應的資料集上進行測試,並將其效能與單一代理 LLM 進行比較。結果顯示,AutoFeedback 大幅減少過度讚美和過度推論錯誤的發生,提供更準確且在教學法上更完善的回饋。研究結果顯示,多重代理系統可以為在教育環境中產生自動回饋提供更可靠的解決方案,突顯其在可擴充且個人化的學習支援方面的潛力。這些結果對尋求在形成性評量中利用 AI 的教育工作者和研究人員具有重要意義,提供一條通往更有效的回饋機制的途徑,以提升學生的學習成果。

Controllable Context Sensitivity and the Knob Behind It

2411.07404v1 by Julian Minder, Kevin Du, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, Ryan Cotterell

When making predictions, a language model must trade off how much it relies on its context vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental functionality, as it enables the model to excel at tasks like retrieval-augmented generation and question-answering. In this paper, we search for a knob which controls this sensitivity, determining whether language models answer from the context or their prior knowledge. To guide this search, we design a task for controllable context sensitivity. In this task, we first feed the model a context (Paris is in England) and a question (Where is Paris?); we then instruct the model to either use its prior or contextual knowledge and evaluate whether it generates the correct answer for both intents (either France or England). When fine-tuned on this task, instruction-tuned versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85-95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm. Then, in each model, we identify a 1-D subspace in a single layer that encodes whether the model follows context or prior knowledge. Interestingly, while we identify this subspace in a fine-tuned model, we find that the exact same subspace serves as an effective knob in not only that model but also non-fine-tuned instruct and base models of that model family. Finally, we show a strong correlation between a model's performance and how distinctly it separates context-agreeing from context-ignoring answers in this subspace. These results suggest a single subspace facilitates how the model chooses between context and prior knowledge, hinting at a simple fundamental mechanism that controls this behavior.

摘要:在进行预测时,语言模型必须权衡其对上下文与先验知识的依赖程度。选择模型对上下文的敏感程度是一项基本功能,因为它使模型能够在检索增强生成和问答等任务中表现出色。在本文中,我们寻找一个控制这种敏感性的旋钮,确定语言模型是根据上下文还是先验知识来回答。为了指导这项搜索,我们设计了一个可控制上下文敏感性的任务。在这个任务中,我们首先向模型提供一个上下文(巴黎在英国)和一个问题(巴黎在哪里?);然后我们指示模型使用其先验或上下文知识,并评估它是否为两种意图(法国或英国)生成了正确的答案。在针对此任务进行微调后,Llama-3.1、Mistral-v0.3 和 Gemma-2 的指令调整版本可以高精度(85-95%)解决它。通过分析这些高性能模型,我们使用一种新颖的线性时间算法缩小了可能对上下文敏感性很重要的层。然后,在每个模型中,我们在单个层中识别一个一维子空间,该子空间对模型是遵循上下文还是先验知识进行编码。有趣的是,虽然我们在微调模型中识别出这个子空间,但我们发现完全相同的子空间不仅在该模型中,而且在该模型系列的非微调指令和基础模型中都充当有效旋钮。最后,我们展示了模型的性能与其在这个子空间中明显区分上下文同意和上下文忽略答案之间的相关性。这些结果表明,一个子空间促进了模型如何在上下文和先验知识之间进行选择,暗示了一种控制此行为的简单基本机制。

2411.07398v1 by Aakash Sorathiya, Gouri Ginde

With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach.

摘要:隨著行動應用程式在我們日常體驗中激增,圍繞倫理的疑慮也大幅增加。使用者通常在應用程式(app)評論中傳達他們的回饋、回報問題,並建議新的功能,經常強調安全性、隱私和問責疑慮。納入這些評論對於開發成功的產品至關重要。然而,與倫理疑慮相關的 app 評論通常使用特定領域語言,並使用更多變化的詞彙表達。因此,自動化與倫理疑慮相關的 app 評論擷取是一項具有挑戰性且耗時的工作。 本研究提出了一種基於自然語言處理 (NLP) 的新穎方法,它結合了自然語言推論 (NLI),它提供了對語言細微差別的深入理解,以及僅解碼器(類似 LLaMA)的大型語言模型 (LLM),以大規模擷取與倫理疑慮相關的 app 評論。利用心理健康領域的 43,647 個 app 評論,提出的方法 1) 評估四個 NLI 模型以擷取潛在的隱私評論,並將特定領域隱私假設的結果與一般隱私假設進行比較;2) 評估四個 LLM 以將 app 評論分類為隱私疑慮;以及 3) 進一步使用最佳的 NLI 和 LLM 模型從資料集中擷取新的隱私評論。結果顯示,具有特定領域假設的 DeBERTa-v3-base-mnli-fever-anli NLI 模型產生最佳效能,而 Llama3.1-8B-Instruct LLM 在 app 評論分類中表現最佳。然後,使用 NLI+LLM,額外擷取了 1,008 個新的與隱私相關的評論,這些評論未透過先前研究中的基於關鍵字的方法識別出來,因此證明了所提出方法的有效性。

Toward Optimal Search and Retrieval for RAG

2411.07396v1 by Alexandria Leto, Cecilia Aguerrebere, Ishwar Bhati, Ted Willke, Mariano Tepper, Vy Ai Vo

Retrieval-augmented generation (RAG) is a promising method for addressing some of the memory-related challenges associated with Large Language Models (LLMs). Two separate systems form the RAG pipeline, the retriever and the reader, and the impact of each on downstream task performance is not well-understood. Here, we work towards the goal of understanding how retrievers can be optimized for RAG pipelines for common tasks such as Question Answering (QA). We conduct experiments focused on the relationship between retrieval and RAG performance on QA and attributed QA and unveil a number of insights useful to practitioners developing high-performance RAG pipelines. For example, lowering search accuracy has minor implications for RAG performance while potentially increasing retrieval speed and memory efficiency.

摘要:檢索增強生成 (RAG) 是一種有望解決大型語言模型 (LLM) 相關記憶挑戰的方法。RAG 管線由檢索器和讀取器兩個獨立系統組成,而每個系統對下游任務效能的影響並未獲得透徹理解。在此,我們努力了解檢索器如何針對常見任務(例如問題解答 (QA))最佳化 RAG 管線。我們針對檢索與 RAG 在 QA 和歸因 QA 上的關係進行實驗,並揭示許多對開發高性能 RAG 管線的從業人員有用的見解。例如,降低搜尋準確度對 RAG 效能影響不大,但可能會提高檢索速度和記憶體效率。

Data-Centric Learning Framework for Real-Time Detection of Aiming Beam in Fluorescence Lifetime Imaging Guided Surgery

2411.07395v1 by Mohamed Abul Hassan, Pu Sun, Xiangnan Zhou, Lisanne Kraft, Kelsey T Hadfield, Katjana Ehrlich, Jinyi Qi, Andrew Birkeland, Laura Marcu

This study introduces a novel data-centric approach to improve real-time surgical guidance using fiber-based fluorescence lifetime imaging (FLIm). A key aspect of the methodology is the accurate detection of the aiming beam, which is essential for localizing points used to map FLIm measurements onto the tissue region within the surgical field. The primary challenge arises from the complex and variable conditions encountered in the surgical environment, particularly in Transoral Robotic Surgery (TORS). Uneven illumination in the surgical field can cause reflections, reduce contrast, and results in inconsistent color representation, further complicating aiming beam detection. To overcome these challenges, an instance segmentation model was developed using a data-centric training strategy that improves accuracy by minimizing label noise and enhancing detection robustness. The model was evaluated on a dataset comprising 40 in vivo surgical videos, demonstrating a median detection rate of 85%. This performance was maintained when the model was integrated in a clinical system, achieving a similar detection rate of 85% during TORS procedures conducted in patients. The system's computational efficiency, measured at approximately 24 frames per second (FPS), was sufficient for real-time surgical guidance. This study enhances the reliability of FLIm-based aiming beam detection in complex surgical environments, advancing the feasibility of real-time, image-guided interventions for improved surgical precision

摘要:本研究提出了一種新穎的以數據為中心的策略,以使用基於光纖的螢光生命期成像 (FLIm) 來改善實時手術導引。此方法的一個關鍵面向是準確偵測瞄準光束,這對於定位用於將 FLIm 測量結果對應到手術視野內組織區域的點至關重要。主要的挑戰來自於手術環境中遇到的複雜且變化的條件,特別是在經口機器人手術 (TORS) 中。手術視野中的照明不均會導致反射、降低對比度,並造成不一致的顏色呈現,進一步使瞄準光束偵測複雜化。為了克服這些挑戰,開發了一個實例分割模型,使用以數據為中心的訓練策略,透過最小化標籤雜訊和增強偵測穩健性來提高準確度。此模型在包含 40 個體內手術影片的資料集上進行評估,顯示出 85% 的中位數偵測率。當此模型整合到臨床系統中時,此效能得以維持,在患者進行 TORS 手術期間達成相似的 85% 偵測率。此系統的運算效率,測量結果約為每秒 24 幀 (FPS),足以進行實時手術導引。本研究增強了 FLIm 為基礎的瞄準光束偵測在複雜手術環境中的可靠性,提升了實時、影像導引介入的可行性,以改善手術精準度

Feature-Space Semantic Invariance: Enhanced OOD Detection for Open-Set Domain Generalization

2411.07392v1 by Haoliang Wang, Chen Zhao, Feng Chen

Open-set domain generalization addresses a real-world challenge: training a model to generalize across unseen domains (domain generalization) while also detecting samples from unknown classes not encountered during training (open-set recognition). However, most existing approaches tackle these issues separately, limiting their practical applicability. To overcome this limitation, we propose a unified framework for open-set domain generalization by introducing Feature-space Semantic Invariance (FSI). FSI maintains semantic consistency across different domains within the feature space, enabling more accurate detection of OOD instances in unseen domains. Additionally, we adopt a generative model to produce synthetic data with novel domain styles or class labels, enhancing model robustness. Initial experiments show that our method improves AUROC by 9.1% to 18.9% on ColoredMNIST, while also significantly increasing in-distribution classification accuracy.

摘要:開放集域泛化解決了一個真實世界的挑戰:訓練一個模型在未見過的域中進行泛化(域泛化),同時也偵測訓練過程中未遇到的未知類別的樣本(開放集識別)。然而,大多數現有方法分別處理這些問題,限制了它們的實際適用性。為了克服這個限制,我們提出了一個開放集域泛化的統一架構,引入了特徵空間語義不變性(FSI)。FSI 在特徵空間中維護不同域之間的語義一致性,從而能夠更準確地偵測未見域中的 OOD 實例。此外,我們採用一個生成模型來產生具有新穎域樣式或類別標籤的合成資料,增強模型的穩健性。初步實驗表明,我們的模型在 ColoredMNIST 上將 AUROC 提高了 9.1% 至 18.9%,同時也顯著提高了分布內分類準確度。

Federated Learning Client Pruning for Noisy Labels

2411.07391v1 by Mahdi Morafah, Hojin Chang, Chen Chen, Bill Lin

Federated Learning (FL) enables collaborative model training across decentralized edge devices while preserving data privacy. However, existing FL methods often assume clean annotated datasets, impractical for resource-constrained edge devices. In reality, noisy labels are prevalent, posing significant challenges to FL performance. Prior approaches attempt label correction and robust training techniques but exhibit limited efficacy, particularly under high noise levels. This paper introduces ClipFL (Federated Learning Client Pruning), a novel framework addressing noisy labels from a fresh perspective. ClipFL identifies and excludes noisy clients based on their performance on a clean validation dataset, tracked using a Noise Candidacy Score (NCS). The framework comprises three phases: pre-client pruning to identify potential noisy clients and calculate their NCS, client pruning to exclude a percentage of clients with the highest NCS, and post-client pruning for fine-tuning the global model with standard FL on clean clients. Empirical evaluation demonstrates ClipFL's efficacy across diverse datasets and noise levels, achieving accurate noisy client identification, superior performance, faster convergence, and reduced communication costs compared to state-of-the-art FL methods. Our code is available at https://github.com/MMorafah/ClipFL.

摘要:聯邦學習 (FL) 能在分散式邊緣裝置上進行協作模型訓練,同時保留資料隱私。然而,現有的 FL 方法通常假設標記資料集是乾淨的,這對於資源受限的邊緣裝置來說是不切實際的。實際上,雜訊標籤很普遍,對 FL 效能構成重大挑戰。先前的做法嘗試標籤校正和穩健訓練技術,但在高雜訊水準下表現出的效能有限。本文介紹 ClipFL(聯邦學習用戶端剪枝),這是一個從新觀點解決雜訊標籤的新穎架構。ClipFL 根據雜訊候選分數 (NCS) 追蹤乾淨驗證資料集上的效能,識別和排除雜訊用戶端。該架構包含三個階段:用戶端前剪枝,用於識別潛在的雜訊用戶端並計算其 NCS;用戶端剪枝,用於排除具有最高 NCS 的用戶端百分比;用戶端後剪枝,用於在乾淨用戶端上使用標準 FL 微調全域模型。實證評估顯示 ClipFL 在不同的資料集和雜訊水準中都表現出效能,與現有的 FL 方法相比,能準確識別雜訊用戶端,具有優異的效能、更快的收斂速度和更低的通訊成本。我們的程式碼可在 https://github.com/MMorafah/ClipFL 取得。

Firing Rate Models as Associative Memory: Excitatory-Inhibitory Balance for Robust Retrieval

2411.07388v1 by Simone Betteti, Giacomo Baggio, Francesco Bullo, Sandro Zampieri

Firing rate models are dynamical systems widely used in applied and theoretical neuroscience to describe local cortical dynamics in neuronal populations. By providing a macroscopic perspective of neuronal activity, these models are essential for investigating oscillatory phenomena, chaotic behavior, and associative memory processes. Despite their widespread use, the application of firing rate models to associative memory networks has received limited mathematical exploration, and most existing studies are focused on specific models. Conversely, well-established associative memory designs, such as Hopfield networks, lack key biologically-relevant features intrinsic to firing rate models, including positivity and interpretable synaptic matrices that reflect excitatory and inhibitory interactions. To address this gap, we propose a general framework that ensures the emergence of re-scaled memory patterns as stable equilibria in the firing rate dynamics. Furthermore, we analyze the conditions under which the memories are locally and globally asymptotically stable, providing insights into constructing biologically-plausible and robust systems for associative memory retrieval.

摘要:發射率模型是動態系統,廣泛用於應用和理論神經科學,用於描述神經元族群中的局部皮質動態。這些模型透過提供神經元活動的巨觀觀點,對於探討振盪現象、混沌行為和聯想記憶過程至關重要。儘管廣泛使用,但將發射率模型應用於聯想記憶網路的研究在數學上仍有限,而且大多數現有研究都集中在特定模型上。相反地,像 Hopfield 網路等完善的聯想記憶設計,缺乏發射率模型中固有的關鍵生物相關特徵,包括正值和反映激發性和抑制性交互作用的可解釋突觸矩陣。為了解決這個差距,我們提出一個通用框架,以確保重新縮放的記憶模式在發射率動態中作為穩定平衡出現。此外,我們分析記憶在局部和全局漸近穩定的條件,提供見解以建構生物學上可信且強健的聯想記憶檢索系統。

Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages

2411.07387v1 by Midia Yousefi, Yao Qian, Junkun Chen, Gang Wang, Yanqing Liu, Dongmei Wang, Xiaofei Wang, Jian Xue

End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio, including both speech and pause segments. Previous methods often controlled the number of words or characters generated by the Machine Translation model to approximate the source sentence's length without considering the isochrony of pauses and speech segments, as duration can vary between languages. To address this, we present improvements to the duration alignment component of our sequence-to-sequence ST model. Our method controls translation length by predicting the duration of speech and pauses in conjunction with the translation process. This is achieved by providing timing information to the decoder, ensuring it tracks the remaining duration for speech and pauses while generating the translation. The evaluation on the Zh-En test set of CoVoST 2, demonstrates that the proposed Isochrony-Controlled ST achieves 0.92 speech overlap and 8.9 BLEU, which has only a 1.4 BLEU drop compared to the ST baseline.

摘要:端對端語音翻譯 (ST) 可直接將原始語言語音翻譯成目標語言文字,近年來備受關注。許多 ST 應用程式需要嚴格的長度控制,以確保翻譯時間與原始音訊長度相符,包括語音和暫停片段。先前的做法通常控制機器翻譯模型產生的字數或字元數,以估計原始句子的長度,而不考慮暫停和語音片段的等時性,因為不同語言的持續時間可能有所不同。為了解決這個問題,我們對序列對序列 ST 模型的持續時間比對元件進行了改進。我們的做法透過預測語音和暫停的持續時間,並與翻譯過程結合,來控制翻譯長度。這是透過提供計時資訊給解碼器來達成,確保它在產生翻譯時追蹤語音和暫停的剩餘時間。在 CoVoST 2 的中英測試集中進行評估,顯示所提出的等時性控制 ST 可達到 0.92 的語音重疊和 8.9 的 BLEU,與 ST 基準相比,BLEU 分數僅下降 1.4。

BeeManc at the PLABA Track of TAC-2024: RoBERTa for task 1 and LLaMA3.1 and GPT-4o for task 2

2411.07381v1 by Zhidong Ling, Zihao Li, Pablo Romeo, Lifeng Han, Goran Nenadic

This report is the system description of the BeeManc team for shared task Plain Language Adaptation of Biomedical Abstracts (PLABA) 2024. This report contains two sections corresponding to the two sub-tasks in PLABA 2024. In task one, we applied fine-tuned ReBERTa-Base models to identify and classify the difficult terms, jargon and acronyms in the biomedical abstracts and reported the F1 score. Due to time constraints, we didn't finish the replacement task. In task two, we leveraged Llamma3.1-70B-Instruct and GPT-4o with the one-shot prompts to complete the abstract adaptation and reported the scores in BLEU, SARI, BERTScore, LENS, and SALSA. From the official Evaluation from PLABA-2024 on Task 1A and 1B, our \textbf{much smaller fine-tuned RoBERTa-Base} model ranked 3rd and 2nd respectively on the two sub-task, and the \textbf{1st on averaged F1 scores across the two tasks} from 9 evaluated systems. Our share our fine-tuned models and related resources at \url{https://github.com/HECTA-UoM/PLABA2024}

摘要:這份報告是 BeeManc 團隊針對 2024 年生物醫學摘要的通用語言適應 (PLABA) 共享任務所做的系統描述。這份報告包含兩部分,分別對應於 PLABA 2024 的兩個子任務。在任務一中,我們應用微調後的 ReBERTa-Base 模型來識別和分類生物醫學摘要中的困難術語、術語和縮寫,並報告 F1 分數。由於時間限制,我們沒有完成替換任務。在任務二中,我們利用 Llamma3.1-70B-Instruct 和 GPT-4o 以及一次性提示來完成摘要適應,並報告了 BLEU、SARI、BERTScore、LENS 和 SALSA 中的分數。根據 PLABA-2024 在任務 1A 和 1B 中的官方評估,我們的\textbf{微調後的 RoBERTa-Base 模型小得多}在兩個子任務中分別排名第 3 和第 2,並且在 9 個評估系統中\textbf{在兩個任務中的平均 F1 分數中排名第 1}。我們在\url{https://github.com/HECTA-UoM/PLABA2024}分享我們的微調模型和相關資源

Warmstarting for Scaling Language Models

2411.07340v1 by Neeratyoy Mallik, Maciej Janowski, Johannes Hog, Herilalaina Rakotoarison, Aaron Klein, Josif Grabocka, Frank Hutter

Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm. The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research. High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups. One direction to ameliorate the cost of pretraining large models is to warmstart the large-scale training from smaller models that are cheaper to tune. In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling. We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using {\mu}Transfer. We investigate the aspects that contribute to the speedup in convergence and the preservation of stable training dynamics under warmstarting with {\mu}Transfer. We find that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from {\mu}P enables effective warmstarting of $\mut{}$.

摘要:將模型規模擴展到擴展效能對於目前的巨量語言模型範例而言運作得非常好。各種規模研究的研究和經驗結果導致新穎的規模結果和定律,這些定律指導後續的研究。當代資料和模型的高訓練成本導致缺乏對如何調整和達成此類訓練設定的透徹理解。改善大型模型預訓練成本的一個方向是從較小的模型開始進行大規模訓練,較小的模型調整成本較低。在這項工作中,我們嘗試了解最佳超參數的行為是否可以在擴展的熱啟動下保留。我們探索允許應用理論激勵的最佳超參數零次轉移方法的簡單操作,使用 {\mu}Transfer。我們研究了有助於加速收斂和在使用 {\mu}Transfer 熱啟動時維持穩定訓練動態的方面。我們發現縮小較小的模型權重、零填充以及使用來自 {\mu}P 的縮放初始化擾動產生的較大模型,能夠有效地熱啟動 $\mut{}$。

SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models

2411.07336v1 by Bardiya Akhbari, Manish Gawali, Nicholas A. Dronen

Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs' algorithmic abilities under simple lexical or semantic variations. To this end, we present the SetLexSem Challenge, a synthetic benchmark that evaluates the performance of LLMs on set operations. SetLexSem assesses the robustness of LLMs' instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating seven LLMs with SetLexSem, we find that they exhibit poor robustness to variation in both operation and operands. We show -- via the framework's systematic sampling of set members along lexical and semantic dimensions -- that LLMs are not only not robust to variation along these dimensions but demonstrate unique failure modes in particular, easy-to-create semantic groupings of "deceptive" sets. We find that rigorously measuring language model robustness to variation in frequency and length is challenging and present an analysis that measures them independently. The code for reproducing the results of this paper, and for generating the SetLexSem Challenge dataset, is available at \href{https://github.com/amazon-science/SetLexSem-Challenge}{https://github.com/amazon-science/SetLexSem-Challenge}.

摘要:集合論是數學的基礎,當集合是有限時,它用於推理世界。一個智能系統應始終如一地執行集合運算,而不管運算元表面的變化。最初設計用於語義導向的 NLP 任務,大型語言模型 (LLM) 現在正在演算法任務上進行評估。由於集合由任意符號(例如數字、字詞)組成,因此它們提供了一個機會,可以系統性地測試 LLM 的演算法能力在簡單的詞彙或語義變化下的不變性。為此,我們提出了 SetLexSem 挑戰,這是一個綜合基準,用於評估 LLM 在集合運算上的效能。SetLexSem 評估 LLM 在各種條件下遵循指令的能力的穩健性,重點關注集合運算以及集合成員的性質和建構。使用 SetLexSem 評估七個 LLM,我們發現它們對運算和運算元中的變化表現出較差的穩健性。我們透過該框架沿著詞彙和語義維度對集合成員進行系統性抽樣,表明 LLM 不僅對這些維度中的變化不穩健,而且表現出獨特的失敗模式,特別是「具欺騙性的」集合的易於建立的語義群組。我們發現,嚴格測量語言模型對頻率和長度變化的穩健性具有挑戰性,並提出了一種獨立測量它們的分析。用於重現本文結果和生成 SetLexSem 挑戰資料集的程式碼可在 \href{https://github.com/amazon-science/SetLexSem-Challenge}{https://github.com/amazon-science/SetLexSem-Challenge} 取得。

Multimodal Fusion Balancing Through Game-Theoretic Regularization

2411.07335v1 by Konstantinos Kontras, Thomas Strypsteen, Christos Chatzichristos, Paul P. Liang, Matthew Blaschko, Maarten De Vos

Multimodal learning can complete the picture of information extraction by uncovering key dependencies between data sources. However, current systems fail to fully leverage multiple modalities for optimal performance. This has been attributed to modality competition, where modalities strive for training resources, leaving some underoptimized. We show that current balancing methods struggle to train multimodal models that surpass even simple baselines, such as ensembles. This raises the question: how can we ensure that all modalities in multimodal training are sufficiently trained, and that learning from new modalities consistently improves performance? This paper proposes the Multimodal Competition Regularizer (MCR), a new loss component inspired by mutual information (MI) decomposition designed to prevent the adverse effects of competition in multimodal training. Our key contributions are: 1) Introducing game-theoretic principles in multimodal learning, where each modality acts as a player competing to maximize its influence on the final outcome, enabling automatic balancing of the MI terms. 2) Refining lower and upper bounds for each MI term to enhance the extraction of task-relevant unique and shared information across modalities. 3) Suggesting latent space permutations for conditional MI estimation, significantly improving computational efficiency. MCR outperforms all previously suggested training strategies and is the first to consistently improve multimodal learning beyond the ensemble baseline, clearly demonstrating that combining modalities leads to significant performance gains on both synthetic and large real-world datasets.

摘要:多模態學習可以透過揭露資料來源之間的關鍵依賴關係,來完成資訊萃取的圖像。然而,目前的系統無法充分利用多種模態來獲得最佳效能。這歸因於模態競爭,其中模態爭取訓練資源,導致有些模態未經最佳化。我們顯示目前的平衡方法難以訓練多模態模型,甚至超越簡單的基準,例如整體。這引發了一個問題:我們如何確保多模態訓練中的所有模態都得到充分訓練,並且從新模態中學習持續改善效能?本文提出了多模態競爭正則化器 (MCR),這是一個新的損失組成,靈感來自互資訊 (MI) 分解,旨在防止多模態訓練中競爭的不利影響。我們的關鍵貢獻包括:1) 在多模態學習中引入博弈論原則,其中每個模態都作為一個參與者競爭,以最大化其對最終結果的影響,從而實現 MI 項的自動平衡。2) 為每個 MI 項精煉上下界,以增強跨模態提取與任務相關的獨特和共享資訊。3) 建議潛在空間排列進行條件 MI 估計,顯著提高運算效率。MCR 優於所有先前建議的訓練策略,並且是第一個持續改善多模態學習超越整體基準的策略,清楚地證明了結合模態會在合成和大型真實世界資料集上帶來顯著的效能提升。

Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations

2411.07320v1 by Kirti Bhagat, Kinshuk Vasisht, Danish Pruthi

While a large body of work inspects language models for biases concerning gender, race, occupation and religion, biases of geographical nature are relatively less explored. Some recent studies benchmark the degree to which large language models encode geospatial knowledge. However, the impact of the encoded geographical knowledge (or lack thereof) on real-world applications has not been documented. In this work, we examine large language models for two common scenarios that require geographical knowledge: (a) travel recommendations and (b) geo-anchored story generation. Specifically, we study four popular language models, and across about $100$K travel requests, and $200$K story generations, we observe that travel recommendations corresponding to poorer countries are less unique with fewer location references, and stories from these regions more often convey emotions of hardship and sadness compared to those from wealthier nations.

摘要:儘管大量工作檢查語言模型對於性別、種族、職業和宗教的偏見,但地理性質的偏見相對較少被探討。一些最近的研究基準測試大型語言模型編碼地理空間知識的程度。然而,已編碼地理知識(或缺乏地理知識)對真實世界應用程式的影響尚未被記錄下來。在這項工作中,我們針對需要地理知識的兩個常見場景檢查大型語言模型:(a) 旅遊建議和 (b) 地理錨定故事生成。具體來說,我們研究了四個流行的語言模型,並在約 10 萬個旅遊請求和 20 萬個故事生成中觀察到,對應於較貧窮國家的旅遊建議較不獨特,且位置參考較少,而這些地區的故事與富裕國家相比,更常傳達艱難和悲傷的情緒。