Skip to content

Knowledge Graphs

Knowledge Graphs

Publish Date Title Authors Homepage Code
2024-11-12 Language Models as Causal Effect Generators Lucius E. J. Bynum et.al. 2411.08019v1 link
2024-11-12 From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents Chuyi Kong et.al. 2411.07965v1 null
2024-11-12 Chain Association-based Attacking and Shielding Natural Language Processing Systems Jiacheng Huang et.al. 2411.07843v1 null
2024-11-11 Gradual Fine-Tuning with Graph Routing for Multi-Source Unsupervised Domain Adaptation Yao Ma et.al. 2411.07185v1 null
2024-11-11 A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19 Vedant Khandelwal et.al. 2411.07163v1 null
2024-11-11 A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs Myeongsoo Kim et.al. 2411.07098v1 null
2024-11-11 Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation Qiao Qiao et.al. 2411.06660v1 null
2024-11-10 CausalStock: Deep End-to-end Causal Discovery for News-driven Stock Movement Prediction Shuqi Li et.al. 2411.06391v1 null
2024-11-09 Analyzing the Evolution of Graphs and Texts Xingzhi Guo et.al. 2411.06295v1 null
2024-11-09 An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models Fatemeh Shiri et.al. 2411.06048v1 link
2024-11-08 Mitigating Hallucination with ZeroG: An Advanced Knowledge Management Engine Anantha Sharma et.al. 2411.05936v1 null
2024-11-08 SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark Sithursan Sivasubramaniam et.al. 2411.05521v1 null
2024-11-08 EUREKHA: Enhancing User Representation for Key Hackers Identification in Underground Forums Abdoul Nasser Hassane Amadou et.al. 2411.05479v1 link
2024-11-08 When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization Jacob Nielsen et.al. 2411.05882v1 null
2024-11-08 Exploring the Alignment Landscape: LLMs and Geometric Deep Models in Protein Representation Dong Shu et.al. 2411.05316v1 link
2024-11-06 LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration Yukun Cao et.al. 2411.05844v1 null
2024-11-06 MEG: Medical Knowledge-Augmented Large Language Models for Question Answering Laura Cabello et.al. 2411.03883v2 link
2024-11-06 The American Sign Language Knowledge Graph: Infusing ASL Models with Linguistic Knowledge Lee Kezar et.al. 2411.03568v1 null
2024-11-05 Graph-DPEP: Decomposed Plug and Ensemble Play for Few-Shot Document Relation Extraction with Graph-of-Thoughts Reasoning Tao Zhang et.al. 2411.02864v1 null
2024-11-05 Multimodal Commonsense Knowledge Distillation for Visual Question Answering Shuo Yang et.al. 2411.02722v1 null
2024-11-04 Geometry of orofacial neuromuscular signals: speech articulation decoding using surface electromyography Harshavardhana T. Gowda et.al. 2411.02591v1 link
2024-11-04 GraphXAIN: Narratives to Explain Graph Neural Networks Mateusz Cedro et.al. 2411.02540v2 link
2024-11-04 Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models Guangzhi Xiong et.al. 2411.02382v1 null
2024-11-04 Can Language Models Enable In-Context Database? Yu Pan et.al. 2411.01807v1 null
2024-11-03 Graph-based Confidence Calibration for Large Language Models Yukun Li et.al. 2411.02454v1 null
2024-11-03 Ontology Population using LLMs Sanaz Saki Norouzi et.al. 2411.01612v1 null
2024-11-03 Pre-trained Molecular Language Models with Random Functional Group Masking Tianhao Peng et.al. 2411.01401v1 null
2024-11-01 Narrative Analysis of True Crime Podcasts With Knowledge Graph-Augmented Large Language Models Xinyi Leng et.al. 2411.02435v1 null
2024-11-01 WLPlan: Relational Features for Symbolic Planning Dillon Z. Chen et.al. 2411.00577v1 null
2024-11-01 GRS-QA -- Graph Reasoning-Structured Question Answering Dataset Anish Pahilajani et.al. 2411.00369v3 null
2024-11-01 Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes Balu Bhasuran et.al. 2411.02523v1 null
2024-10-31 Compositional Automata Embeddings for Goal-Conditioned Reinforcement Learning Beyazit Yalcinkaya et.al. 2411.00205v1 null
2024-10-31 Building Multi-Agent Copilot towards Autonomous Agricultural Data Management and Analysis Yu Pan et.al. 2411.00188v1 null
2024-10-31 Exploring the Knowledge Mismatch Hypothesis: Hallucination Propensity in Small Models Fine-tuned on Data from Larger Models Phil Wee et.al. 2411.00878v1 null
2024-10-31 Failure Modes of LLMs for Causal Reasoning on Narratives Khurram Yamin et.al. 2410.23884v1 link
2024-10-31 Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs Liyi Chen et.al. 2410.23875v1 link
2024-10-31 LLaMo: Large Language Model-based Molecular Graph Assistant Jinyoung Park et.al. 2411.00871v1 link
2024-10-31 End-to-End Ontology Learning with Large Language Models Andy Lo et.al. 2410.23584v1 link
2024-10-30 Graph-Augmented Relation Extraction Model with LLMs-Generated Support Document Vicky Dong et.al. 2410.23452v1 null
2024-10-30 FlowLLM: Flow Matching for Material Generation with Large Language Models as Base Distributions Anuroop Sriram et.al. 2410.23405v1 link
2024-10-30 EMMA: End-to-End Multimodal Model for Autonomous Driving Jyh-Jing Hwang et.al. 2410.23262v2 null
2024-10-30 ProTransformer: Robustify Transformers via Plug-and-Play Paradigm Zhichao Hou et.al. 2410.23182v1 null
2024-10-30 Semantic Enrichment of the Quantum Cascade Laser Properties in Text- A Knowledge Graph Generation Approach Deperias Kerre et.al. 2410.22996v1 null
2024-10-30 How Well Do Large Language Models Disambiguate Swedish Words? Richard Johansson et.al. 2410.22827v1 null
2024-10-30 Beyond Ontology in Dialogue State Tracking for Goal-Oriented Chatbot Sejin Lee et.al. 2410.22767v1 link
2024-10-30 The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation Reza Moravej et.al. 2411.00843v1 null
2024-10-29 Are Large-Language Models Graph Algorithmic Reasoners? Alexander K Taylor et.al. 2410.22597v1 link
2024-10-29 Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset Adrian Garret Gabriel et.al. 2410.22457v1 null
2024-10-29 DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models Chengke Zou et.al. 2411.00836v1 null
2024-10-29 ADAM: An Embodied Causal Agent in Open-World Environments Shu Yu et.al. 2410.22194v1 null
2024-10-29 Synergizing LLM Agents and Knowledge Graph for Socioeconomic Prediction in LBSN Zhilun Zhou et.al. 2411.00028v1 null
2024-10-29 A Hierarchical Language Model For Interpretable Graph Reasoning Sambhav Khurana et.al. 2410.22372v1 null
2024-10-28 LLM-Forest for Health Tabular Data Imputation Xinrui He et.al. 2410.21520v1 null
2024-10-28 Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce Zhantao Yang et.al. 2410.21237v1 null
2024-10-28 CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models Meiqi Chen et.al. 2410.21067v1 null
2024-10-28 CTINEXUS: Leveraging Optimized LLM In-Context Learning for Constructing Cybersecurity Knowledge Graphs Under Data Scarcity Yutong Cheng et.al. 2410.21060v1 null
2024-10-28 Graph-based Uncertainty Metrics for Long-form Language Model Outputs Mingjian Jiang et.al. 2410.20783v1 link
2024-10-28 Plan$\times$RAG: Planning-guided Retrieval Augmented Generation Prakhar Verma et.al. 2410.20753v1 null
2024-10-28 Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation Mufei Li et.al. 2410.20724v2 link
2024-10-27 Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs Xingrui Zhuo et.al. 2410.20321v1 null
2024-10-26 Mathematical Derivation Graphs: A Task for Summarizing Equation Dependencies in STEM Manuscripts Vishesh Prasad et.al. 2410.21324v1 null
2024-10-25 DualMAR: Medical-Augmented Representation from Dual-Expertise Perspectives Pengfei Hu et.al. 2410.19955v1 null
2024-10-25 FISHNET: Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert Swarms, and Task Planning Nicole Cho et.al. 2410.19727v1 null
2024-10-25 Knowledge Graph Enhanced Language Agents for Recommendation Taicheng Guo et.al. 2410.19627v1 null
2024-10-25 Graph Linearization Methods for Reasoning on Graphs with Large Language Models Christos Xypolopoulos et.al. 2410.19494v1 null
2024-10-25 Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis Weikai Li et.al. 2410.19225v1 null
2024-10-24 Enriching GNNs with Text Contextual Representations for Detecting Disinformation Campaigns on Social Media Bruno Croso Cunha da Silva et.al. 2410.19193v1 null
2024-10-24 GCoder: Improving Large Language Model for Generalized Graph Problem Solving Qifan Zhang et.al. 2410.19084v1 link
2024-10-24 LLM-based Online Prediction of Time-varying Graph Signals Dayu Qin et.al. 2410.18718v1 null
2024-10-24 Gene-Metabolite Association Prediction with Interactive Knowledge Transfer Enhanced Graph for Metabolite Production Kexuan Xin et.al. 2410.18475v2 null
2024-10-24 ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis Zezhong Wang et.al. 2410.18447v1 null
2024-10-24 Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains Kun Li et.al. 2410.18415v1 null
2024-10-23 Explaining Bayesian Networks in Natural Language using Factor Arguments. Evaluation in the medical domain Jaime Sevilla et.al. 2410.18060v1 null
2024-10-23 Graphusion: A RAG Framework for Knowledge Graph Construction with a Global Perspective Rui Yang et.al. 2410.17600v1 null
2024-10-23 Navigate Complex Physical Worlds via Geometrically Constrained LLM Yongqiang Huang et.al. 2410.17529v1 null
2024-10-22 Large Language Model-based Augmentation for Imbalanced Node Classification on Text-Attributed Graphs Leyao Wang et.al. 2410.16882v1 null
2024-10-22 Context-aware Inductive Knowledge Graph Completion with Latent Type Constraints and Subgraph Reasoning Muzhi Li et.al. 2410.16803v2 null
2024-10-22 The Scene Language: Representing Scenes with Programs, Words, and Embeddings Yunzhi Zhang et.al. 2410.16770v1 null
2024-10-22 Atomic Fact Decomposition Helps Attributed Question Answering Zhichao Yan et.al. 2410.16708v1 null
2024-10-22 PLDR-LLM: Large Language Model from Power Law Decoder Representations Burc Gokden et.al. 2410.16703v1 link
2024-10-22 Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency Prafulla Kumar Choubey et.al. 2410.16597v1 null
2024-10-21 Towards a Reliable Offline Personal AI Assistant for Long Duration Spaceflight Oliver Bensch et.al. 2410.16397v1 null
2024-10-21 A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns Tianyi Men et.al. 2410.16155v1 null
2024-10-21 CausalGraph2LLM: Evaluating LLMs for Causal Queries Ivaxi Sheth et.al. 2410.15939v1 link
2024-10-21 LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation Tejumade Afonja et.al. 2410.15828v1 null
2024-10-21 NetSafe: Exploring the Topological Safety of Multi-agent Networks Miao Yu et.al. 2410.15686v1 null
2024-10-20 TAGExplainer: Narrating Graph Explanations for Text-Attributed Graph Learning Models Bo Pan et.al. 2410.15268v1 null
2024-10-19 Explaining Graph Neural Networks with Large Language Models: A Counterfactual Perspective for Molecular Property Prediction Yinhan He et.al. 2410.15165v1 null
2024-10-19 MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science Junho Kim et.al. 2410.15126v1 null
2024-10-19 Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models Qitan Lv et.al. 2410.15116v1 null
2024-10-19 A Prompt Engineering Approach and a Knowledge Graph based Framework for Tackling Legal Implications of Large Language Model Answers George Hannah et.al. 2410.15064v1 null
2024-10-19 LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model Tianqianjin Lin et.al. 2410.14961v1 null
2024-10-18 TransBox: EL++-closed Ontology Embedding Hui Yang et.al. 2410.14571v1 null
2024-10-18 Enabling Scalable Evaluation of Bias Patterns in Medical LLMs Hamed Fayyaz et.al. 2410.14763v1 link
2024-10-18 Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning Xingyu Tan et.al. 2410.14211v2 null
2024-10-18 UniMTS: Unified Pre-training for Motion Time Series Xiyuan Zhang et.al. 2410.19818v1 link
2024-10-18 Supervised Chain of Thought Xiang Zhang et.al. 2410.14198v1 null
2024-10-17 Towards Cross-Cultural Machine Translation with Retrieval-Augmented Generation from Multilingual Knowledge Graphs Simone Conia et.al. 2410.14057v1 null
2024-10-17 RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs Jiatan Huang et.al. 2410.13987v1 null
2024-10-17 The Mystery of the Pathological Path-star Task for Language Models Arvid Frydenlund et.al. 2410.13779v1 null

Abstracts

Language Models as Causal Effect Generators

2411.08019v1 by Lucius E. J. Bynum, Kyunghyun Cho

We present a framework for large language model (LLM) based data generation with controllable causal structure. In particular, we define a procedure for turning any language model and any directed acyclic graph (DAG) into a sequence-driven structural causal model (SD-SCM). Broadly speaking, an SD-SCM is a causal model with user-defined structure and LLM-defined structural equations. We characterize how an SD-SCM allows sampling from observational, interventional, and counterfactual distributions according to the desired causal structure. We then leverage this procedure to propose a new type of benchmark for causal inference methods, generating individual-level counterfactual data without needing to manually specify functional relationships between variables. We create an example benchmark consisting of thousands of datasets, and test a suite of popular estimation methods on these datasets for average, conditional average, and individual treatment effect estimation, both with and without hidden confounding. Apart from generating data, the same procedure also allows us to test for the presence of a causal effect that might be encoded in an LLM. This procedure can underpin auditing LLMs for misinformation, discrimination, or otherwise undesirable behavior. We believe SD-SCMs can serve as a useful tool in any application that would benefit from sequential data with controllable causal structure.

摘要:我們提出了一個基於大型語言模型 (LLM) 的資料生成架構,具有可控制的因果結構。具體來說,我們定義了一個程序,將任何語言模型和任何有向無環圖 (DAG) 轉換成一個序列驅動的結構因果模型 (SD-SCM)。廣義來說,SD-SCM 是一個因果模型,具有使用者定義的結構和 LLM 定義的結構方程式。我們描述了 SD-SCM 如何根據所需的因果結構,允許從觀測、介入和反事實分佈中進行抽樣。然後,我們利用這個程序提出了一種類型的因果推論方法基準,生成個體層級的反事實資料,而無需手動指定變數之間的功能關係。我們建立了一個範例基準,包含數千個資料集,並在這些資料集上測試了一系列流行的估計方法,用於平均值、條件平均值和個別處理效果估計,無論是有或沒有隱藏混淆。除了生成資料之外,相同的程序也允許我們測試 LLM 中可能編碼的因果效應的存在。此程序可以支持審核 LLM 的錯誤資訊、歧視或其他不良行為。我們相信 SD-SCM 可以作為任何應用程式的有用工具,這些應用程式可以從具有可控制因果結構的序列資料中受益。

From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents

2411.07965v1 by Chuyi Kong, Ziyang Luo, Hongzhan Lin, Zhiyuan Fan, Yaxin Fan, Yuxi Sun, Jing Ma

The advanced role-playing capabilities of Large Language Models (LLMs) have paved the way for developing Role-Playing Agents (RPAs). However, existing benchmarks, such as HPD, which incorporates manually scored character relationships into the context for LLMs to sort coherence, and SocialBench, which uses specific profiles generated by LLMs in the context of multiple-choice tasks to assess character preferences, face limitations like poor generalizability, implicit and inaccurate judgments, and excessive context length. To address the above issues, we propose an automatic, scalable, and generalizable paradigm. Specifically, we construct a benchmark by extracting relations from a general knowledge graph and leverage RPA's inherent hallucination properties to prompt it to interact across roles, employing ChatGPT for stance detection and defining relationship hallucination along with three related metrics. Extensive experiments validate the effectiveness and stability of our metrics. Our findings further explore factors influencing these metrics and discuss the trade-off between relationship hallucination and factuality.

摘要:大型語言模型 (LLM) 的先進角色扮演能力已為開發角色扮演代理 (RPA) 鋪平道路。然而,現有的基準,例如 HPD(將手動評分的角色關係納入 LLM 的背景中以對連貫性進行排序),以及 SocialBench(在多選題任務的背景下使用 LLM 生成的特定個人資料來評估角色偏好)面臨著諸如通用性差、判斷含蓄且不準確以及背景長度過長等限制。為了解決上述問題,我們提出了一個自動、可擴充且可概括的範例。具體來說,我們通過從通用知識圖譜中提取關係來構建基準,並利用 RPA 固有的幻覺屬性提示它跨角色互動,採用 ChatGPT 進行立場檢測並定義關係幻覺以及三個相關指標。廣泛的實驗驗證了我們指標的有效性和穩定性。我們的研究結果進一步探討了影響這些指標的因素,並討論了關係幻覺和事實性之間的權衡。

Chain Association-based Attacking and Shielding Natural Language Processing Systems

2411.07843v1 by Jiacheng Huang, Long Chen

Association as a gift enables people do not have to mention something in completely straightforward words and allows others to understand what they intend to refer to. In this paper, we propose a chain association-based adversarial attack against natural language processing systems, utilizing the comprehension gap between humans and machines. We first generate a chain association graph for Chinese characters based on the association paradigm for building search space of potential adversarial examples. Then, we introduce an discrete particle swarm optimization algorithm to search for the optimal adversarial examples. We conduct comprehensive experiments and show that advanced natural language processing models and applications, including large language models, are vulnerable to our attack, while humans appear good at understanding the perturbed text. We also explore two methods, including adversarial training and associative graph-based recovery, to shield systems from chain association-based attack. Since a few examples that use some derogatory terms, this paper contains materials that may be offensive or upsetting to some people.

摘要:聯想作為一種禮物,使人們不必用完全直白的話語提及某事,並讓其他人明白他們想提的是什麼。在本文中,我們提出了一種基於鏈式聯想的對抗性攻擊,用於自然語言處理系統,利用了人類與機器之間的理解差距。我們首先基於聯想範例為漢字生成一個鏈式聯想圖,用於構建潛在對抗性範例的搜索空間。然後,我們引入一個離散粒子群優化演算法來搜索最佳的對抗性範例。我們進行了全面的實驗,並表明先進的自然語言處理模型和應用程式,包括大型語言模型,都容易受到我們的攻擊,而人類似乎很擅長理解擾動後的文字。我們還探索了兩種方法,包括對抗性訓練和基於聯想圖的恢復,以保護系統免受基於鏈式聯想的攻擊。由於一些範例使用了某些貶義詞,因此本文包含可能冒犯或令某些人感到不安的材料。

Gradual Fine-Tuning with Graph Routing for Multi-Source Unsupervised Domain Adaptation

2411.07185v1 by Yao Ma, Samuel Louvan, Zhunxuan Wang

Multi-source unsupervised domain adaptation aims to leverage labeled data from multiple source domains for training a machine learning model to generalize well on a target domain without labels. Source domain selection plays a crucial role in determining the model's performance. It relies on the similarities amongst source and target domains. Nonetheless, existing work for source domain selection often involves heavyweight computational procedures, especially when dealing with numerous source domains and the need to identify the best ones from them. In this paper, we introduce a framework for gradual fine tuning (GFT) of machine learning models on multiple source domains. We represent multiple source domains as an undirected weighted graph. We then give a new generalization error bound for GFT along any path within the graph, which is used to determine the optimal path corresponding to the optimal training order. With this formulation, we introduce three lightweight graph-routing strategies which tend to minimize the error bound. Our best strategy improves $2.3\%$ of accuracy over the state-of-the-art on Natural Language Inference (NLI) task and achieves competitive performance on Sentiment Analysis (SA) task, especially a $3.9\%$ improvement on a more diverse subset of data we use for SA.

摘要:多源无监督域自适应旨在利用来自多个源域的标记数据,训练机器学习模型,以便在没有标签的目标域上很好地泛化。源域选择在确定模型性能方面起着至关重要的作用。它依赖于源域和目标域之间的相似性。尽管如此,现有的源域选择工作通常涉及重量级计算程序,尤其是在处理众多源域以及需要从中识别最佳源域时。在本文中,我们介绍了一个在多个源域上对机器学习模型进行逐步微调 (GFT) 的框架。我们将多个源域表示为无向加权图。然后,我们为图中沿任何路径的 GFT 给出了一个新的泛化误差界,用于确定对应于最佳训练顺序的最佳路径。通过这种表述,我们介绍了三种轻量级的图路由策略,这些策略倾向于最小化误差界。我们最好的策略在自然语言推理 (NLI) 任务上比最先进的技术提高了 2.3% 的准确率,并在情感分析 (SA) 任务上取得了有竞争力的性能,特别是在我们用于 SA 的更多样化的数据子集上提高了 3.9%。

A Domain-Agnostic Neurosymbolic Approach for Big Social Data Analysis: Evaluating Mental Health Sentiment on Social Media during COVID-19

2411.07163v1 by Vedant Khandelwal, Manas Gaur, Ugur Kursuncu, Valerie Shalin, Amit Sheth

Monitoring public sentiment via social media is potentially helpful during health crises such as the COVID-19 pandemic. However, traditional frequency-based, data-driven neural network-based approaches can miss newly relevant content due to the evolving nature of language in a dynamically evolving environment. Human-curated symbolic knowledge sources, such as lexicons for standard language and slang terms, can potentially elevate social media signals in evolving language. We introduce a neurosymbolic method that integrates neural networks with symbolic knowledge sources, enhancing the detection and interpretation of mental health-related tweets relevant to COVID-19. Our method was evaluated using a corpus of large datasets (approximately 12 billion tweets, 2.5 million subreddit data, and 700k news articles) and multiple knowledge graphs. This method dynamically adapts to evolving language, outperforming purely data-driven models with an F1 score exceeding 92\%. This approach also showed faster adaptation to new data and lower computational demands than fine-tuning pre-trained large language models (LLMs). This study demonstrates the benefit of neurosymbolic methods in interpreting text in a dynamic environment for tasks such as health surveillance.

摘要:透過社群媒體監控公眾情緒在 COVID-19 等健康危機期間可能很有幫助。然而,傳統的基於頻率、資料驅動的神經網路方法可能會錯過新相關的內容,因為語言在動態演化的環境中會持續演化。由人類策劃的象徵性知識來源(例如標準語言和俚語術語的詞彙)可能會提升社群媒體在演化語言中的訊號。我們引入一種將神經網路與象徵性知識來源整合的神經符號方法,增強與 COVID-19 相關的心理健康相關推文的偵測和詮釋。我們的做法使用大型資料集語料庫(約 120 億則推文、250 萬個 subreddit 資料和 70 萬則新聞文章)和多個知識圖譜進行評估。這種方法動態適應演化的語言,優於純資料驅動模型,F1 分數超過 92%。這種方法也顯示出比微調預訓練大型語言模型 (LLM) 更快適應新資料和更低的運算需求。本研究證明了神經符號方法在動態環境中詮釋文字的優點,適用於健康監控等任務。

A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs

2411.07098v1 by Myeongsoo Kim, Tyler Stennett, Saurabh Sinha, Alessandro Orso

As modern web services increasingly rely on REST APIs, their thorough testing has become crucial. Furthermore, the advent of REST API specifications such as the OpenAPI Specification has led to the emergence of many black-box REST API testing tools. However, these tools often focus on individual test elements in isolation (e.g., APIs, parameters, values), resulting in lower coverage and less effectiveness in detecting faults (i.e., 500 response codes). To address these limitations, we present AutoRestTest, the first black-box framework to adopt a dependency-embedded multi-agent approach for REST API testing, integrating Multi-Agent Reinforcement Learning (MARL) with a Semantic Property Dependency Graph (SPDG) and Large Language Models (LLMs). Our approach treats REST API testing as a separable problem, where four agents -- API, dependency, parameter, and value -- collaborate to optimize API exploration. LLMs handle domain-specific value restrictions, the SPDG model simplifies the search space for dependencies using a similarity score between API operations, and MARL dynamically optimizes the agents' behavior. Evaluated on 12 real-world REST services, AutoRestTest outperforms the four leading black-box REST API testing tools, including those assisted by RESTGPT (which augments realistic test inputs using LLMs), in terms of code coverage, operation coverage, and fault detection. Notably, AutoRestTest is the only tool able to identify an internal server error in Spotify. Our ablation study underscores the significant contributions of the agent learning, SPDG, and LLM components.

摘要:隨著現代網路服務日益依賴 REST API,其徹底的測試變得至關重要。此外,REST API 規範(例如 OpenAPI 規範)的出現,導致許多黑盒 REST API 測試工具的出現。然而,這些工具通常專注於單獨的測試元素(例如 API、參數、值),導致覆蓋率較低,且在偵測錯誤(即 500 回應碼)方面效率較低。為了解決這些限制,我們提出 AutoRestTest,這是第一個採用依賴嵌入式多代理方法進行 REST API 測試的黑盒框架,將多代理強化學習 (MARL) 與語義屬性依賴圖 (SPDG) 和大型語言模型 (LLM) 整合在一起。我們的做法將 REST API 測試視為一個可分離的問題,其中四個代理(API、依賴關係、參數和值)協同合作以最佳化 API 探索。LLM 處理特定領域的值限制,SPDG 模型使用 API 操作之間的相似性分數簡化依賴關係的搜尋空間,而 MARL 則動態最佳化代理的行為。在 12 項真實世界的 REST 服務上進行評估,AutoRestTest 在程式碼覆蓋率、操作覆蓋率和錯誤偵測方面,優於四種領先的黑盒 REST API 測試工具,包括那些由 RESTGPT(使用 LLM 增加逼真的測試輸入)輔助的工具。值得注意的是,AutoRestTest 是唯一能夠識別 Spotify 中內部伺服器錯誤的工具。我們的消融研究強調了代理學習、SPDG 和 LLM 組件的重大貢獻。

Bridge: A Unified Framework to Knowledge Graph Completion via Language Models and Knowledge Representation

2411.06660v1 by Qiao Qiao, Yuepei Li, Qing Wang, Kang Zhou, Qi Li

Knowledge graph completion (KGC) is a task of inferring missing triples based on existing Knowledge Graphs (KGs). Both structural and semantic information are vital for successful KGC. However, existing methods only use either the structural knowledge from the KG embeddings or the semantic information from pre-trained language models (PLMs), leading to suboptimal model performance. Moreover, since PLMs are not trained on KGs, directly using PLMs to encode triples may be inappropriate. To overcome these limitations, we propose a novel framework called Bridge, which jointly encodes structural and semantic information of KGs. Specifically, we strategically encode entities and relations separately by PLMs to better utilize the semantic knowledge of PLMs and enable structured representation learning via a structural learning principle. Furthermore, to bridge the gap between KGs and PLMs, we employ a self-supervised representation learning method called BYOL to fine-tune PLMs with two different views of a triple. Unlike BYOL, which uses augmentation methods to create two semantically similar views of the same image, potentially altering the semantic information. We strategically separate the triple into two parts to create different views, thus avoiding semantic alteration. Experiments demonstrate that Bridge outperforms the SOTA models on three benchmark datasets.

摘要:知識圖譜補全 (KGC) 是一項根據現有知識圖譜 (KG) 推論遺失三元組的任務。結構和語義資訊對於成功的 KGC 至關重要。然而,現有方法僅使用來自 KG 嵌入的結構知識或來自預訓練語言模型 (PLM) 的語義資訊,導致模型效能不佳。此外,由於 PLM 沒有在 KG 上訓練,因此直接使用 PLM 編碼三元組可能並不適當。為了克服這些限制,我們提出一個名為 Bridge 的新架構,該架構聯合編碼 KG 的結構和語義資訊。具體來說,我們透過 PLM 分別對實體和關係進行策略性編碼,以更好地利用 PLM 的語義知識,並透過結構學習原則啟用結構化表示學習。此外,為了彌合 KG 和 PLM 之間的差距,我們採用一種稱為 BYOL 的自監督表示學習方法,以三元組的兩個不同視圖微調 PLM。與 BYOL 不同,BYOL 使用擴充方法來建立兩個語義上相似的相同影像視圖,可能會改變語義資訊。我們策略性地將三元組分為兩部分以建立不同的視圖,從而避免語義改變。實驗證明 Bridge 在三個基準資料集上優於 SOTA 模型。

CausalStock: Deep End-to-end Causal Discovery for News-driven Stock Movement Prediction

2411.06391v1 by Shuqi Li, Yuebo Sun, Yuxin Lin, Xin Gao, Shuo Shang, Rui Yan

There are two issues in news-driven multi-stock movement prediction tasks that are not well solved in the existing works. On the one hand, "relation discovery" is a pivotal part when leveraging the price information of other stocks to achieve accurate stock movement prediction. Given that stock relations are often unidirectional, such as the "supplier-consumer" relationship, causal relations are more appropriate to capture the impact between stocks. On the other hand, there is substantial noise existing in the news data leading to extracting effective information with difficulty. With these two issues in mind, we propose a novel framework called CausalStock for news-driven multi-stock movement prediction, which discovers the temporal causal relations between stocks. We design a lag-dependent temporal causal discovery mechanism to model the temporal causal graph distribution. Then a Functional Causal Model is employed to encapsulate the discovered causal relations and predict the stock movements. Additionally, we propose a Denoised News Encoder by taking advantage of the excellent text evaluation ability of large language models (LLMs) to extract useful information from massive news data. The experiment results show that CausalStock outperforms the strong baselines for both news-driven multi-stock movement prediction and multi-stock movement prediction tasks on six real-world datasets collected from the US, China, Japan, and UK markets. Moreover, getting benefit from the causal relations, CausalStock could offer a clear prediction mechanism with good explainability.

摘要:在新聞驅動的多股票移動預測任務中,現有研究尚未妥善解決兩個問題。一方面,在利用其他股票的價格資訊來實現準確的股票移動預測時,「關係發現」是一個關鍵部分。由於股票關係通常是單向的,例如「供應商-消費者」關係,因此因果關係更適合捕捉股票之間的影響。另一方面,新聞資料中存在大量雜訊,導致難以提取有效資訊。考慮到這兩個問題,我們提出了一個名為 CausalStock 的新框架,用於新聞驅動的多股票移動預測,該框架發現了股票之間的時序因果關係。我們設計了一個延遲依賴的時序因果發現機制,以建模時序因果圖分布。然後採用功能因果模型來封裝發現的因果關係並預測股票走勢。此外,我們提出了一個去噪新聞編碼器,利用大型語言模型 (LLM) 出色的文本評估能力從大量新聞資料中提取有用資訊。實驗結果表明,CausalStock 在從美國、中國、日本和英國市場收集的六個真實世界資料集上,在新聞驅動的多股票移動預測和多股票移動預測任務中都優於強大的基線。此外,CausalStock 受益於因果關係,可以提供具有良好可解釋性的清晰預測機制。

Analyzing the Evolution of Graphs and Texts

2411.06295v1 by Xingzhi Guo

With the recent advance of representation learning algorithms on graphs (e.g., DeepWalk/GraphSage) and natural languages (e.g., Word2Vec/BERT) , the state-of-the art models can even achieve human-level performance over many downstream tasks, particularly for the task of node and sentence classification. However, most algorithms focus on large-scale models for static graphs and text corpus without considering the inherent dynamic characteristics or discovering the reasons behind the changes. This dissertation aims to efficiently model the dynamics in graphs (such as social networks and citation graphs) and understand the changes in texts (specifically news titles and personal biographies). To achieve this goal, we utilize the renowned Personalized PageRank algorithm to create effective dynamic network embeddings for evolving graphs. Our proposed approaches significantly improve the running time and accuracy for both detecting network abnormal intruders and discovering entity meaning shifts over large-scale dynamic graphs. For text changes, we analyze the post-publication changes in news titles to understand the intents behind the edits and discuss the potential impact of titles changes from information integrity perspective. Moreover, we investigate self-presented occupational identities in Twitter users' biographies over five years, investigating job prestige and demographics effects in how people disclose jobs, quantifying over-represented jobs and their transitions over time.

摘要:隨著圖形表示學習演算法的最新進展(例如 DeepWalk/GraphSage)和自然語言(例如 Word2Vec/BERT),最先進的模型甚至可以在許多下游任務中達到人類等級的效能,特別是對於節點和句子分類的任務。然而,大多數演算法都專注於靜態圖形和大規模文字語料庫的模型,而沒有考慮固有的動態特性或找出變化的原因。本論文旨在有效地為圖形(例如社群網路和引文圖形)建模動態,並了解文字的變化(特別是新聞標題和個人傳記)。為了達成這個目標,我們利用著名的 Personalized PageRank 演算法為不斷變化的圖形建立有效的動態網路嵌入。我們提出的方法顯著改善了偵測網路異常入侵者和找出大規模動態圖形中實體含義轉移的執行時間和準確度。對於文字變化的部分,我們分析了新聞標題在出版後的變化,以了解編輯背後的意圖,並討論標題變更對資訊完整性的潛在影響。此外,我們調查了 Twitter 使用者在傳記中呈現的職業身分長達五年,探討了工作聲望和人口統計資料對人們揭露工作的影響,並量化了過度代表的工作及其隨著時間推移的轉變。

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

2411.06048v1 by Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Gholamreza Haffari, Yuan-Fang Li

Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities. Our analyses on object-relationship and multi-hop reasoning reveal several important findings. Firstly, bounding boxes and scene graphs, even synthetic ones, can significantly enhance LMMs' spatial reasoning. Secondly, LMMs struggle more with questions posed from the human perspective than the camera perspective about the image. Thirdly, chain of thought (CoT) prompting does not improve model performance on complex multi-hop questions involving spatial relations. % Moreover, spatial reasoning steps are much less accurate than non-spatial ones across MLLMs. Lastly, our perturbation analysis on GQA-spatial reveals that LMMs are much stronger at basic object detection than complex spatial reasoning. We believe our benchmark dataset and in-depth analyses can spark further research on LMMs spatial reasoning. Spatial-MM benchmark is available at: https://github.com/FatemehShiri/Spatial-MM

摘要:大型多模態模型 (LMM) 已在各種視覺和語言任務中取得強勁的表現。然而,它們的空間推理能力尚未得到充分研究。在本文中,我們構建了一個新穎的 VQA 資料集 Spatial-MM,以全面研究 LMM 的空間理解和推理能力。我們對物件關係和多跳推理的分析揭示了幾個重要的發現。首先,邊界框和場景圖,即使是合成的,也可以顯著增強 LMM 的空間推理能力。其次,LMM 在回答從人類視角提出的問題時比從相機視角提出的問題時遇到更多困難。第三,思考鏈 (CoT) 提示並未改善模型在涉及空間關係的複雜多跳問題上的效能。% 此外,在 MLLM 中,空間推理步驟的準確度遠低於非空間步驟。最後,我們對 GQA-spatial 的擾動分析表明,LMM 在基本物件偵測方面的能力遠強於複雜的空間推理。我們相信我們的基準資料集和深入分析可以激發對 LMM 空間推理的進一步研究。Spatial-MM 基準可在以下網址取得:https://github.com/FatemehShiri/Spatial-MM

Mitigating Hallucination with ZeroG: An Advanced Knowledge Management Engine

2411.05936v1 by Anantha Sharma, Sheeba Elizabeth John, Fatemeh Rezapoor Nikroo, Krupali Bhatt, Mrunal Zambre, Aditi Wikhe

The growth of digital documents presents significant challenges in efficient management and knowledge extraction. Traditional methods often struggle with complex documents, leading to issues such as hallucinations and high latency in responses from Large Language Models (LLMs). ZeroG, an innovative approach, significantly mitigates these challenges by leveraging knowledge distillation and prompt tuning to enhance model performance. ZeroG utilizes a smaller model that replicates the behavior of a larger teacher model, ensuring contextually relevant and grounded responses, by employing a black-box distillation approach, it creates a distilled dataset without relying on intermediate features, optimizing computational efficiency. This method significantly enhances accuracy and reduces response times, providing a balanced solution for modern document management. Incorporating advanced techniques for document ingestion and metadata utilization, ZeroG improves the accuracy of question-and-answer systems. The integration of graph databases and robust metadata management further streamlines information retrieval, allowing for precise and context-aware responses. By transforming how organizations interact with complex data, ZeroG enhances productivity and user experience, offering a scalable solution for the growing demands of digital document management.

摘要:數位文件成長帶來顯著的挑戰,包括有效管理和知識萃取。傳統方法經常難以處理複雜文件,導致問題,例如產生幻覺和大型語言模型 (LLM) 回應的高延遲。ZeroG 是一種創新的方法,透過利用知識蒸餾和提示調整來增強模型效能,大幅減輕這些挑戰。 ZeroG 使用較小的模型複製較大的教師模型的行為,透過採用黑盒蒸餾方法,確保在脈絡上相關且有根據的回應,它建立一個蒸餾的資料集,而不需要依賴中間特徵,最佳化運算效率。這種方法大幅提升準確度並減少回應時間,提供現代文件管理的平衡解決方案。 透過整合進階技術來擷取文件和使用元資料,ZeroG 改善問答系統的準確度。圖形資料庫和強健的元資料管理的整合進一步簡化資訊擷取,允許精確且符合脈絡的回應。透過轉換組織與複雜資料互動的方式,ZeroG 提升生產力和使用者體驗,提供可擴充的解決方案,以滿足數位文件管理日益增長的需求。

SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark

2411.05521v1 by Sithursan Sivasubramaniam, Cedric Osei-Akoto, Yi Zhang, Kurt Stockinger, Jonathan Fuerst

Electronic health records (EHRs) are stored in various database systems with different database models on heterogeneous storage architectures, such as relational databases, document stores, or graph databases. These different database models have a big impact on query complexity and performance. While this has been a known fact in database research, its implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far. In this paper, we present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark based on synthetic patient data from Synthea, following the SNOMED-CT taxonomy -- a widely used knowledge graph ontology covering medical terminology. SM3-Text-to-Query provides data representations for relational databases (PostgreSQL), document stores (MongoDB), and graph databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically and manually develop 408 template questions, which we augment to construct a benchmark of 10K diverse natural language question/query pairs for these four query languages (40K pairs overall). On our dataset, we evaluate several common in-context-learning (ICL) approaches for a set of representative closed and open-source LLMs. Our evaluation sheds light on the trade-offs between database models and query languages for different ICL strategies and LLMs. Last, SM3-Text-to-Query is easily extendable to additional query languages or real, standard-based patient databases.

摘要:電子健康紀錄 (EHR) 儲存在各種資料庫系統中,這些系統在異質儲存架構上具有不同的資料庫模型,例如關聯式資料庫、文件儲存或圖形資料庫。這些不同的資料庫模型對查詢複雜度和效能有很大的影響。雖然這在資料庫研究中已經是眾所周知的事實,但令人驚訝的是,它對日益增加的文字轉查詢系統的影響迄今尚未得到調查。在本文中,我們提出 SM3-Text-to-Query,這是第一個基於來自 Synthea 的合成患者資料的多模型醫療文字轉查詢基準,遵循 SNOMED-CT 分類法——一種廣泛使用的涵蓋醫學術語的知識圖譜本體。SM3-Text-to-Query 提供了關聯式資料庫 (PostgreSQL)、文件儲存 (MongoDB) 和圖形資料庫 (Neo4j 和 GraphDB (RDF)) 的資料表示,允許跨四種流行查詢語言(即 SQL、MQL、Cypher 和 SPARQL)進行評估。我們系統且手動開發了 408 個範本問題,我們擴充這些問題以構建一個基準,其中包含 10K 個針對這四種查詢語言的多樣化自然語言問題/查詢對(總共 40K 對)。在我們的資料集上,我們評估了幾種常見的代表性閉源和開源 LLM 的情境學習 (ICL) 方法。我們的評估揭示了不同 ICL 策略和 LLM 的資料庫模型和查詢語言之間的取捨。最後,SM3-Text-to-Query 可以輕鬆擴展到其他查詢語言或真實的基於標準的患者資料庫。

EUREKHA: Enhancing User Representation for Key Hackers Identification in Underground Forums

2411.05479v1 by Abdoul Nasser Hassane Amadou, Anas Motii, Saida Elouardi, EL Houcine Bergou

Underground forums serve as hubs for cybercriminal activities, offering a space for anonymity and evasion of conventional online oversight. In these hidden communities, malicious actors collaborate to exchange illicit knowledge, tools, and tactics, driving a range of cyber threats from hacking techniques to the sale of stolen data, malware, and zero-day exploits. Identifying the key instigators (i.e., key hackers), behind these operations is essential but remains a complex challenge. This paper presents a novel method called EUREKHA (Enhancing User Representation for Key Hacker Identification in Underground Forums), designed to identify these key hackers by modeling each user as a textual sequence. This sequence is processed through a large language model (LLM) for domain-specific adaptation, with LLMs acting as feature extractors. These extracted features are then fed into a Graph Neural Network (GNN) to model user structural relationships, significantly improving identification accuracy. Furthermore, we employ BERTopic (Bidirectional Encoder Representations from Transformers Topic Modeling) to extract personalized topics from user-generated content, enabling multiple textual representations per user and optimizing the selection of the most representative sequence. Our study demonstrates that fine-tuned LLMs outperform state-of-the-art methods in identifying key hackers. Additionally, when combined with GNNs, our model achieves significant improvements, resulting in approximately 6% and 10% increases in accuracy and F1-score, respectively, over existing methods. EUREKHA was tested on the Hack-Forums dataset, and we provide open-source access to our code.

摘要:地下論壇是網路犯罪活動的樞紐,提供匿名和規避傳統網路監督的空間。在這些隱藏的社群中,惡意行為者合作交換非法知識、工具和策略,推動從駭客技術到銷售竊取資料、惡意軟體和零時差漏洞的各種網路威脅。找出這些行動背後的關鍵煽動者(即關鍵駭客)至關重要,但仍然是一個複雜的挑戰。本文提出了一種稱為 EUREKHA(增強使用者表徵以識別地下論壇中的關鍵駭客)的新方法,旨在透過將每個使用者建模為文字序列來識別這些關鍵駭客。此序列透過大型語言模型(LLM)處理以進行特定領域的適應,其中 LLM 作為特徵萃取器。然後將這些萃取的特徵輸入圖神經網路(GNN)以建模使用者結構關係,大幅提升識別準確度。此外,我們採用 BERTopic(來自 Transformer 主題建模的雙向編碼器表徵)從使用者產生的內容中萃取個人化主題,為每個使用者啟用多個文字表徵,並最佳化最具代表性序列的選擇。我們的研究表明,微調後的 LLM 在識別關鍵駭客方面優於最先進的方法。此外,當與 GNN 結合使用時,我們的模型獲得顯著的提升,與現有方法相比,準確度和 F1 分數分別提高了約 6% 和 10%。EUREKHA 已在 Hack-Forums 資料集上進行測試,我們提供開源方式存取我們的程式碼。

When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization

2411.05882v1 by Jacob Nielsen, Lukas Galke, Peter Schneider-Kamp

Contemporary machine learning models, such as language models, are powerful, but come with immense resource requirements both at training and inference time. It has been shown that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight), facilitating efficient inference. Here, we start our exploration with non-transformer model architectures, investigating 1.58-bit training for multi-layer perceptrons and graph neural networks. Then, we explore 1.58-bit training in other transformer-based language models, namely encoder-only and encoder-decoder models. Our results show that in all of these settings, 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.

摘要:當代機器學習模型(例如語言模型)功能強大, 但在訓練和推論時間上都需要大量的資源。已經證明,僅解碼器語言模型可以用三元權重(每個權重 1.58 位元)訓練到競爭狀態,促進有效率的推論。在此,我們從非Transformer模型架構開始探討,研究多層感知器和圖神經網路的 1.58 位元訓練。接著,我們探討其他基於Transformer的語言模型(即僅編碼器和編碼器-解碼器模型)的 1.58 位元訓練。我們的結果顯示,在所有這些設定中,1.58 位元訓練與標準 32/16 位元模型相當,有時甚至更好。

Exploring the Alignment Landscape: LLMs and Geometric Deep Models in Protein Representation

2411.05316v1 by Dong Shu, Bingbing Duan, Kai Guo, Kaixiong Zhou, Jiliang Tang, Mengnan Du

Latent representation alignment has become a foundational technique for constructing multimodal large language models (MLLM) by mapping embeddings from different modalities into a shared space, often aligned with the embedding space of large language models (LLMs) to enable effective cross-modal understanding. While preliminary protein-focused MLLMs have emerged, they have predominantly relied on heuristic approaches, lacking a fundamental understanding of optimal alignment practices across representations. In this study, we explore the alignment of multimodal representations between LLMs and Geometric Deep Models (GDMs) in the protein domain. We comprehensively evaluate three state-of-the-art LLMs (Gemma2-2B, LLaMa3.1-8B, and LLaMa3.1-70B) with four protein-specialized GDMs (GearNet, GVP, ScanNet, GAT). Our work examines alignment factors from both model and protein perspectives, identifying challenges in current alignment methodologies and proposing strategies to improve the alignment process. Our key findings reveal that GDMs incorporating both graph and 3D structural information align better with LLMs, larger LLMs demonstrate improved alignment capabilities, and protein rarity significantly impacts alignment performance. We also find that increasing GDM embedding dimensions, using two-layer projection heads, and fine-tuning LLMs on protein-specific data substantially enhance alignment quality. These strategies offer potential enhancements to the performance of protein-related multimodal models. Our code and data are available at https://github.com/Tizzzzy/LLM-GDM-alignment.

摘要:潛在表徵對齊已成為建構多模態大型語言模型 (MLLM) 的基礎技術,方法是將不同模態的嵌入映射到共享空間中,通常與大型語言模型 (LLM) 的嵌入空間對齊,以實現有效的跨模態理解。雖然初步以蛋白質為重點的 MLLM 已出現,但它們主要依賴啟發式方法,缺乏對跨表徵最佳對齊實務的基本理解。在本研究中,我們探討了蛋白質領域中 LLM 與幾何深度模型 (GDM) 之間的多模態表徵對齊。我們全面評估了三個最先進的 LLM(Gemma2-2B、LLaMa3.1-8B 和 LLaMa3.1-70B)與四個蛋白質專用 GDM(GearNet、GVP、ScanNet、GAT)。我們的研究從模型和蛋白質角度檢視對齊因素,識別當前對齊方法的挑戰,並提出改善對齊程序的策略。我們的關鍵發現顯示,同時包含圖形和 3D 結構資訊的 GDM 與 LLM 的對齊效果較佳,較大的 LLM 展現出更佳的對齊能力,而蛋白質的稀有性顯著影響對齊效能。我們還發現,增加 GDM 嵌入維度、使用兩層投影頭,以及針對蛋白質特定資料微調 LLM,可以大幅提升對齊品質。這些策略為蛋白質相關多模態模型的效能提供潛在的強化。我們的程式碼和資料可在 https://github.com/Tizzzzy/LLM-GDM-alignment 取得。

LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration

2411.05844v1 by Yukun Cao, Zengyi Gao, Zhiyang Li, Xike Xie, S Kevin Zhou

GraphRAG addresses significant challenges in Retrieval-Augmented Generation (RAG) by leveraging graphs with embedded knowledge to enhance the reasoning capabilities of Large Language Models (LLMs). Despite its promising potential, the GraphRAG community currently lacks a unified framework for fine-grained decomposition of the graph-based knowledge retrieval process. Furthermore, there is no systematic categorization or evaluation of existing solutions within the retrieval process. In this paper, we present LEGO-GraphRAG, a modular framework that decomposes the retrieval process of GraphRAG into three interconnected modules: subgraph-extraction, path-filtering, and path-refinement. We systematically summarize and classify the algorithms and neural network (NN) models relevant to each module, providing a clearer understanding of the design space for GraphRAG instances. Additionally, we identify key design factors, such as Graph Coupling and Computational Cost, that influence the effectiveness of GraphRAG implementations. Through extensive empirical studies, we construct high-quality GraphRAG instances using a representative selection of solutions and analyze their impact on retrieval and reasoning performance. Our findings offer critical insights into optimizing GraphRAG instance design, ultimately contributing to the advancement of more accurate and contextually relevant LLM applications.

摘要:GraphRAG 透過利用具嵌入知識的圖表來增強大型語言模型 (LLM) 的推理能力,解決了檢索增強生成 (RAG) 中的重大挑戰。儘管具有令人期待的潛力,但 GraphRAG 社群目前缺乏一個統一的架構,用於對基於圖表的知識檢索過程進行細粒度的分解。此外,在檢索過程中,現有解決方案並未進行系統性的分類或評估。在本文中,我們提出了 LEGO-GraphRAG,這是一個模組化架構,將 GraphRAG 的檢索過程分解為三個相互連接的模組:子圖萃取、路徑過濾和路徑精煉。我們系統性地總結和分類與每個模組相關的演算法和神經網路 (NN) 模型,提供對 GraphRAG 實例設計空間的更清晰理解。此外,我們找出影響 GraphRAG 實作有效性的關鍵設計因素,例如圖表耦合和運算成本。透過廣泛的經驗研究,我們使用具代表性的解決方案選擇來建構高品質的 GraphRAG 實例,並分析它們對檢索和推理效能的影響。我們的研究結果提供了優化 GraphRAG 實例設計的重要見解,最終有助於推進更準確且與脈絡相關的 LLM 應用。

MEG: Medical Knowledge-Augmented Large Language Models for Question Answering

2411.03883v2 by Laura Cabello, Carmen Martin-Turrero, Uchenna Akujuobi, Anders Søgaard, Carlos Bobed

Question answering is a natural language understanding task that involves reasoning over both explicit context and unstated, relevant domain knowledge. Large language models (LLMs), which underpin most contemporary question answering systems, struggle to induce how concepts relate in specialized domains such as medicine. Existing medical LLMs are also costly to train. In this work, we present MEG, a parameter-efficient approach for medical knowledge-augmented LLMs. MEG uses a lightweight mapping network to integrate graph embeddings into the LLM, enabling it to leverage external knowledge in a cost-effective way. We evaluate our method on four popular medical multiple-choice datasets and show that LLMs greatly benefit from the factual grounding provided by knowledge graph embeddings. MEG attains an average of +10.2% accuracy over the Mistral-Instruct baseline, and +6.7% over specialized models like BioMistral. We also show results based on Llama-3. Finally, we show that MEG's performance remains robust to the choice of graph encoder.

摘要:問答是自然語言理解任務,涉及對明確的上下文和未說明的相關領域知識進行推理。支撐大多數當代問答系統的大型語言模型 (LLM) 難以推論概念如何在醫學等專業領域中關聯。現有的醫學 LLM 訓練成本也很高。在這項工作中,我們提出了 MEG,這是一種用於醫學知識增強 LLM 的參數有效方法。MEG 使用輕量級映射網路將圖表嵌入整合到 LLM 中,使其能夠以經濟有效的方式利用外部知識。我們在四個流行的醫學多選題資料集上評估了我們的方法,並表明 LLM 從知識圖表嵌入提供的實際依據中受益匪淺。MEG 在 Mistral-Instruct 基準上平均提高了 +10.2% 的準確度,在 BioMistral 等專門模型上提高了 +6.7%。我們還展示了基於 Llama-3 的結果。最後,我們表明 MEG 的性能對圖表編碼器的選擇保持穩健。

The American Sign Language Knowledge Graph: Infusing ASL Models with Linguistic Knowledge

2411.03568v1 by Lee Kezar, Nidhi Munikote, Zian Zeng, Zed Sehyr, Naomi Caselli, Jesse Thomason

Language models for American Sign Language (ASL) could make language technologies substantially more accessible to those who sign. To train models on tasks such as isolated sign recognition (ISR) and ASL-to-English translation, datasets provide annotated video examples of ASL signs. To facilitate the generalizability and explainability of these models, we introduce the American Sign Language Knowledge Graph (ASLKG), compiled from twelve sources of expert linguistic knowledge. We use the ASLKG to train neuro-symbolic models for 3 ASL understanding tasks, achieving accuracies of 91% on ISR, 14% for predicting the semantic features of unseen signs, and 36% for classifying the topic of Youtube-ASL videos.

摘要:美國手語 (ASL) 的語言模型可以讓語言技術對手語使用者更易於使用。為了訓練模型執行手語辨識 (ISR) 和 ASL 轉換成英文等任務,資料集提供 ASL 手勢的註解影片範例。為了促進這些模型的概括性和可解釋性,我們引入了美國手語知識圖譜 (ASLKG),它是由十二個專家語言知識來源編譯而成的。我們使用 ASLKG 訓練神經符號模型來執行 3 項 ASL 理解任務,在 ISR 上達到 91% 的準確度、在預測未見手勢的語義特徵上達到 14%,以及在分類 YouTube-ASL 影片主題上達到 36%。

Graph-DPEP: Decomposed Plug and Ensemble Play for Few-Shot Document Relation Extraction with Graph-of-Thoughts Reasoning

2411.02864v1 by Tao Zhang, Ning Yan, Masood Mortazavi, Hoang H. Nguyen, Zhongfen Deng, Philip S. Yu

Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning capability on many NLP tasks. Recasting an NLP task into a text-to-text generation task is a common practice so that generative LLMs can be prompted to resolve it. However, performing document-level relation extraction (DocRE) tasks with generative LLM models is still challenging due to the structured output format of DocRE, which complicates the conversion to plain text. Limited information available in few-shot samples and prompt instructions induce further difficulties and challenges in relation extraction for mentioned entities in a document. In this paper, we represent the structured output as a graph-style triplet rather than natural language expressions and leverage generative LLMs for the DocRE task. Our approach, the Graph-DPEP framework is grounded in the reasoning behind triplet explanation thoughts presented in natural language. In this framework, we first introduce a ``decomposed-plug" method for performing the generation from LLMs over prompts with type-space decomposition to alleviate the burden of distinguishing all relation types. Second, we employ a verifier for calibrating the generation and identifying overlooked query entity pairs. Third, we develop "ensemble-play", reapplying generation on the entire type list by leveraging the reasoning thoughts embedded in a sub-graph associated with the missing query pair to address the missingness issue. Through extensive comparisons with existing prompt techniques and alternative Language Models (LLMs), our framework demonstrates superior performance on publicly available benchmarks in experiments.

摘要:大型語言模型 (LLM) 在海量語料庫上預先訓練,已在許多自然語言處理任務上展現出令人印象深刻的少量樣本學習能力。將自然語言處理任務轉化為文字到文字的生成任務是一種常見做法,這樣生成式大型語言模型就可以提示解決它。然而,由於 DocRE 的結構化輸出格式,使用生成式大型語言模型來執行文件級別關係萃取 (DocRE) 任務仍然具有挑戰性,這使得轉換為純文字變得複雜。少量樣本和提示說明中可用的資訊有限,會導致在文件中提到實體的關係萃取中產生進一步的困難和挑戰。在本文中,我們將結構化輸出表示為圖形樣式的三元組,而不是自然語言表達,並利用生成式大型語言模型來執行 DocRE 任務。我們的做法,圖形 DPEP 框架,是基於自然語言中呈現的三元組解釋思想背後的推理。在這個框架中,我們首先介紹一種「分解插入」方法,用於對具有類型空間分解的提示進行大型語言模型生成,以減輕區分所有關係類型的負擔。其次,我們使用驗證器來校準生成並識別被忽略的查詢實體對。第三,我們開發「整體遊戲」,通過利用與遺失查詢對相關的子圖中嵌入的推理思想,在整個類型列表上重新應用生成,以解決遺失問題。通過與現有提示技術和替代語言模型 (LLM) 的廣泛比較,我們的框架在實驗中證明了在公開基準上的優異性能。

Multimodal Commonsense Knowledge Distillation for Visual Question Answering

2411.02722v1 by Shuo Yang, Siwen Luo, Soyeon Caren Han

Existing Multimodal Large Language Models (MLLMs) and Visual Language Pretrained Models (VLPMs) have shown remarkable performances in the general Visual Question Answering (VQA). However, these models struggle with VQA questions that require external commonsense knowledge due to the challenges in generating high-quality prompts and the high computational costs of fine-tuning. In this work, we propose a novel graph-based multimodal commonsense knowledge distillation framework that constructs a unified relational graph over commonsense knowledge, visual objects and questions through a Graph Convolutional Network (GCN) following a teacher-student environment. This proposed framework is flexible with any type of teacher and student models without further fine-tuning, and has achieved competitive performances on the ScienceQA dataset.

摘要:現有的多模態大型語言模型 (MLLM) 和視覺語言預訓練模型 (VLPM) 在一般的視覺問答 (VQA) 中展現了卓越的表現。然而,這些模型在需要外部常識知識的 VQA 問題上會遇到困難,原因在於產生高品質提示的挑戰以及微調的高運算成本。在這項工作中,我們提出了一個新穎的基於圖形的模態常識知識萃取架構,透過圖形卷積網路 (GCN) 在常識知識、視覺物件和問題上建構一個統一的關聯圖形,遵循師生環境。這個提出的架構對於任何類型的教師和學生模型都具有彈性,無需進一步微調,並在 ScienceQA 資料集上取得了有競爭力的表現。

Geometry of orofacial neuromuscular signals: speech articulation decoding using surface electromyography

2411.02591v1 by Harshavardhana T. Gowda, Zachary D. McNaughton, Lee M. Miller

Each year, millions of individuals lose the ability to speak intelligibly due to causes such as neuromuscular disease, stroke, trauma, and head/neck cancer surgery (e.g. laryngectomy) or treatment (e.g. radiotherapy toxicity to the speech articulators). Effective communication is crucial for daily activities, and losing the ability to speak leads to isolation, depression, anxiety, and a host of detrimental sequelae. Noninvasive surface electromyography (sEMG) has shown promise to restore speech output in these individuals. The goal is to collect sEMG signals from multiple articulatory sites as people silently produce speech and then decode the signals to enable fluent and natural communication. Currently, many fundamental properties of orofacial neuromuscular signals relating to speech articulation remain unanswered. They include questions relating to 1) the data structure of the orofacial sEMG signals, 2)the signal distribution shift of sEMG across individuals, 3) ability of sEMG signals to span the entire English language phonetic space during silent speech articulations, and 4) the generalization capability of non-invasive sEMG based silent speech interfaces. We address these questions through a series of experiments involving healthy human subjects. We show that sEMG signals evince graph data structure and that the signal distribution shift is given by a change of basis. Furthermore, we show that silently voiced articulations spanning the entire English language phonetic space can be decoded using small neural networks which can be trained with little data and that such architectures work well across individuals. To ensure transparency and reproducibility, we open-source all the data and codes used in this study.

摘要:每年,數百萬人因為神經肌肉疾病、中風、創傷和頭頸癌手術(例如喉切除術)或治療(例如放射治療對言語發音器官的毒性)等原因而失去清晰說話的能力。有效的溝通對於日常生活至關重要,而失去說話能力會導致孤立、沮喪、焦慮和一系列有害的後遺症。非侵入性表面肌電圖 (sEMG) 已顯示出恢復這些人說話輸出的希望。目標是從多個發音部位收集 sEMG 信號,因為人們在無聲地發出言語,然後解碼信號以實現流利和自然的溝通。目前,許多與言語發音有關的顏面神經肌肉信號的基本特性仍然沒有得到解答。它們包括與 1) 顏面 sEMG 信號的數據結構、2) sEMG 在個體間的信號分佈轉移、3) sEMG 信號在無聲言語發音過程中跨越整個英語語言音標空間的能力以及 4) 基於非侵入性 sEMG 的無聲言語介面的概括能力相關的問題。我們通過一系列涉及健康人類受試者的實驗來解決這些問題。我們表明 sEMG 信號證明圖數據結構,並且信號分佈轉移是由基變化的給出。此外,我們表明使用可以通過少量數據訓練的小神經網路可以解碼跨越整個英語語言音標空間的無聲發音,並且此類架構在不同個體之間運行良好。為了確保透明度和可重現性,我們公開了本研究中使用的所有數據和代碼。

GraphXAIN: Narratives to Explain Graph Neural Networks

2411.02540v2 by Mateusz Cedro, David Martens

Graph Neural Networks (GNNs) are a powerful technique for machine learning on graph-structured data, yet they pose interpretability challenges, especially for non-expert users. Existing GNN explanation methods often yield technical outputs such as subgraphs and feature importance scores, which are not easily understood. Building on recent insights from social science and other Explainable AI (XAI) methods, we propose GraphXAIN, a natural language narrative that explains individual predictions made by GNNs. We present a model-agnostic and explainer-agnostic XAI approach that complements graph explainers by generating GraphXAINs, using Large Language Models (LLMs) and integrating graph data, individual predictions from GNNs, explanatory subgraphs, and feature importances. We define XAI Narratives and XAI Descriptions, highlighting their distinctions and emphasizing the importance of narrative principles in effective explanations. By incorporating natural language narratives, our approach supports graph practitioners and non-expert users, aligning with social science research on explainability and enhancing user understanding and trust in complex GNN models. We demonstrate GraphXAIN's capabilities on a real-world graph dataset, illustrating how its generated narratives can aid understanding compared to traditional graph explainer outputs or other descriptive explanation methods.

摘要:圖形神經網路 (GNN) 是用於圖形結構資料的機器學習強大技術,但它們會造成可解釋性挑戰,特別是對於非專家使用者。現有的 GNN 解釋方法通常會產生技術輸出,例如子圖和特徵重要性分數,這些輸出不容易理解。建構於社會科學和其他可解釋 AI (XAI) 方法的最新見解,我們提出 GraphXAIN,這是一種自然語言敘述,可以解釋 GNN 做出的個別預測。我們提出一個與模型無關且與解釋器無關的 XAI 方法,它透過使用大型語言模型 (LLM) 和整合圖形資料、GNN 的個別預測、說明性子圖和特徵重要性來補充圖形解釋器,進而產生 GraphXAIN。我們定義 XAI 敘述和 XAI 描述,強調它們的區別,並強調敘述原則在有效解釋中的重要性。透過結合自然語言敘述,我們的做法支援圖形從業者和非專家使用者,與可解釋性的社會科學研究保持一致,並增強使用者對複雜 GNN 模型的理解和信任。我們在真實世界圖形資料集上展示 GraphXAIN 的功能,說明與傳統圖形解釋器輸出或其他描述性解釋方法相比,其產生的敘述如何有助於理解。

Improving Scientific Hypothesis Generation with Knowledge Grounded Large Language Models

2411.02382v1 by Guangzhi Xiong, Eric Xie, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang

Large language models (LLMs) have demonstrated remarkable capabilities in various scientific domains, from natural language processing to complex problem-solving tasks. Their ability to understand and generate human-like text has opened up new possibilities for advancing scientific research, enabling tasks such as data analysis, literature review, and even experimental design. One of the most promising applications of LLMs in this context is hypothesis generation, where they can identify novel research directions by analyzing existing knowledge. However, despite their potential, LLMs are prone to generating ``hallucinations'', outputs that are plausible-sounding but factually incorrect. Such a problem presents significant challenges in scientific fields that demand rigorous accuracy and verifiability, potentially leading to erroneous or misleading conclusions. To overcome these challenges, we propose KG-CoI (Knowledge Grounded Chain of Ideas), a novel system that enhances LLM hypothesis generation by integrating external, structured knowledge from knowledge graphs (KGs). KG-CoI guides LLMs through a structured reasoning process, organizing their output as a chain of ideas (CoI), and includes a KG-supported module for the detection of hallucinations. With experiments on our newly constructed hypothesis generation dataset, we demonstrate that KG-CoI not only improves the accuracy of LLM-generated hypotheses but also reduces the hallucination in their reasoning chains, highlighting its effectiveness in advancing real-world scientific research.

摘要:大型語言模型 (LLM) 已在各種科學領域展現卓越的能力,從自然語言處理到複雜的解決問題任務。它們理解和產生類似人類文字的能力為推進科學研究開啟了新的可能性,讓資料分析、文獻回顧,甚至實驗設計等任務成為可能。LLM 在此脈絡中最有希望的應用之一是假設產生,它們能透過分析現有知識來找出新的研究方向。然而,儘管 LLM 具有潛力,它們卻容易產生「幻覺」,也就是聽起來合理但事實上不正確的輸出。此類問題在需要嚴謹準確性和可驗證性的科學領域中會造成重大挑戰,有可能導致錯誤或誤導性的結論。為了克服這些挑戰,我們提出 KG-CoI(知識基礎觀念鏈),這是一個創新的系統,它透過整合知識圖譜 (KG) 中的外部結構化知識來增強 LLM 假設產生。KG-CoI 引導 LLM 進行結構化推理程序,將其輸出整理成觀念鏈 (CoI),並包含一個由 KG 支援的模組來偵測幻覺。透過我們新建立的假設產生資料集進行的實驗,我們證明 KG-CoI 不僅改善了 LLM 產生的假設的準確性,也減少了其推理鏈中的幻覺,突顯了其在推進現實世界科學研究中的效能。

Can Language Models Enable In-Context Database?

2411.01807v1 by Yu Pan, Hongfeng Yu, Tianjiao Zhao, Jianxin Sun

Large language models (LLMs) are emerging as few-shot learners capable of handling a variety of tasks, including comprehension, planning, reasoning, question answering, arithmetic calculations, and more. At the core of these capabilities is LLMs' proficiency in representing and understanding structural or semi-structural data, such as tables and graphs. Numerous studies have demonstrated that reasoning on tabular data or graphs is not only feasible for LLMs but also gives a promising research direction which treats these data as in-context data. The lightweight and human readable characteristics of in-context database can potentially make it an alternative for the traditional database in typical RAG (Retrieval Augmented Generation) settings. However, almost all current work focuses on static in-context data, which does not allow dynamic update. In this paper, to enable dynamic database update, delta encoding of database is proposed. We explore how data stored in traditional RDBMS can be encoded as in-context text and evaluate LLMs' proficiency for CRUD (Create, Read, Update and Delete) operations on in-context databases. A benchmark named InConDB is presented and extensive experiments are conducted to show the performance of different language models in enabling in-context database by varying the database encoding method, prompting method, operation type and input data distribution, revealing both the proficiency and limitations.

摘要:大型語言模型 (LLM) 逐漸成為僅需少量範例就能處理各種任務的學習者,包括理解、規劃、推理、問答、算術計算等。這些能力的核心是 LLM 在表示和理解結構化或半結構化資料(例如表格和圖形)方面的能力。許多研究已證明,LLM 不僅可以推論表格資料或圖形,還提供了一個有前景的研究方向,將這些資料視為語境資料。語境資料庫的輕量級和人類可讀取特性有可能使其成為典型 RAG(檢索擴充生成)設定中傳統資料庫的替代方案。然而,幾乎所有目前的工作都專注於靜態語境資料,這不允許動態更新。在本文中,為了實現動態資料庫更新,提出了資料庫的 delta 編碼。我們探討了如何將儲存在傳統 RDBMS 中的資料編碼為語境文字,並評估 LLM 在語境資料庫上進行 CRUD(建立、讀取、更新和刪除)操作的能力。提出了名為 InConDB 的基準,並進行了廣泛的實驗,以顯示不同語言模型在通過改變資料庫編碼方法、提示方法、操作類型和輸入資料分佈來啟用語境資料庫方面的效能,揭示了能力和限制。

Graph-based Confidence Calibration for Large Language Models

2411.02454v1 by Yukun Li, Sijia Wang, Lifu Huang, Li-Ping Liu

One important approach to improving the reliability of large language models (LLMs) is to provide accurate confidence estimations regarding the correctness of their answers. However, developing a well-calibrated confidence estimation model is challenging, as mistakes made by LLMs can be difficult to detect. We propose a novel method combining the LLM's self-consistency with labeled data and training an auxiliary model to estimate the correctness of its responses to questions. This auxiliary model predicts the correctness of responses based solely on their consistent information. To set up the learning problem, we use a weighted graph to represent the consistency among the LLM's multiple responses to a question. Correctness labels are assigned to these responses based on their similarity to the correct answer. We then train a graph neural network to estimate the probability of correct responses. Experiments demonstrate that the proposed approach substantially outperforms several of the most recent methods in confidence calibration across multiple widely adopted benchmark datasets. Furthermore, the proposed approach significantly improves the generalization capability of confidence calibration on out-of-domain (OOD) data.

摘要:一種改善大型語言模型 (LLM) 可靠性的重要方法是提供有關其答案正確性的準確信心估計。然而,開發一個校準良好的信心估計模型具有挑戰性,因為 LLM 所犯的錯誤可能難以偵測。我們提出一個新方法,結合 LLM 的自我一致性與標籤資料,並訓練一個輔助模型來估計其對問題的回應正確性。這個輔助模型僅根據其一致性資訊來預測回應的正確性。為了設定學習問題,我們使用一個加權圖形來表示 LLM 對一個問題的多次回應之間的一致性。正確性標籤會根據這些回應與正確答案的相似性分配給這些回應。然後,我們訓練一個圖形神經網路來估計正確回應的機率。實驗證明,所提出的方法在多個廣泛採用的基準資料集上,在信心校準方面明顯優於多種最新方法。此外,所提出的方法顯著改善了在領域外 (OOD) 資料上信心校準的泛化能力。

Ontology Population using LLMs

2411.01612v1 by Sanaz Saki Norouzi, Adrita Barua, Antrea Christou, Nikita Gautam, Andrew Eells, Pascal Hitzler, Cogan Shimizu

Knowledge graphs (KGs) are increasingly utilized for data integration, representation, and visualization. While KG population is critical, it is often costly, especially when data must be extracted from unstructured text in natural language, which presents challenges, such as ambiguity and complex interpretations. Large Language Models (LLMs) offer promising capabilities for such tasks, excelling in natural language understanding and content generation. However, their tendency to ``hallucinate'' can produce inaccurate outputs. Despite these limitations, LLMs offer rapid and scalable processing of natural language data, and with prompt engineering and fine-tuning, they can approximate human-level performance in extracting and structuring data for KGs. This study investigates LLM effectiveness for the KG population, focusing on the Enslaved.org Hub Ontology. In this paper, we report that compared to the ground truth, LLM's can extract ~90% of triples, when provided a modular ontology as guidance in the prompts.

摘要:知識圖譜 (KG) 愈來愈多用於資料整合、表示和視覺化。儘管 KG 填充至關重要,但它通常很昂貴,特別是在必須從自然語言中非結構化文字中提取資料時,這會帶來挑戰,例如歧義和複雜的詮釋。大型語言模型 (LLM) 為此類任務提供了有前景的能力,擅長自然語言理解和內容生成。然而,它們「產生幻覺」的傾向可能會產生不準確的輸出。儘管有這些限制,LLM 提供了自然語言資料的快速且可擴充處理,並且透過提示工程和微調,它們可以近似人類層級的效能,以提取和建構 KG 的資料。本研究調查 LLM 對 KG 填充的有效性,重點關注 Enslaved.org Hub Ontology。在本文中,我們報告與真實情況相比,當在提示中提供模組化本体作為指導時,LLM 可以提取約 90% 的三元組。

Pre-trained Molecular Language Models with Random Functional Group Masking

2411.01401v1 by Tianhao Peng, Yuchen Li, Xuhong Li, Jiang Bian, Zeke Xie, Ning Sui, Shahid Mumtaz, Yanwu Xu, Linghe Kong, Haoyi Xiong

Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, researchers have introduced graph neural networks with graph-based molecular representations, such as GEM, incorporating the topology, geometry, 2D or even 3D structures of molecules into pre-training. While most of molecular graphs in existing studies were automatically converted from SMILES sequences, it is to assume that transformer-based language models might be able to implicitly learn structure-aware representations from SMILES sequences. In this paper, we propose \ours{} -- a SMILES-based \underline{\em M}olecular \underline{\em L}anguage \underline{\em M}odel, which randomly masking SMILES subsequences corresponding to specific molecular \underline{\em F}unctional \underline{\em G}roups to incorporate structure information of atoms during the pre-training phase. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities. Extensive experimental evaluations across 11 benchmark classification and regression tasks in the chemical domain demonstrate the robustness and superiority of \ours{}. Our findings reveal that \ours{} outperforms existing pre-training models, either based on SMILES or graphs, in 9 out of the 11 downstream tasks, ranking as a close second in the remaining ones.

摘要:計算化學的近期進展已利用轉換器語言模型的力量,例如 MoLFormer,使用大量簡化分子輸入線條輸入系統 (SMILES) 序列進行預訓練,以了解和預測分子特性和活性,這是藥物發現和材料科學等領域的重要步驟。為了進一步提升效能,研究人員引入了具有圖形為基礎的分子表示的圖形神經網路,例如 GEM,將分子的拓樸、幾何、2D 甚至 3D 結構納入預訓練中。雖然現有研究中的大多數分子圖形都是從 SMILES 序列自動轉換而來的,但可以假設基於轉換器的語言模型可能能夠從 SMILES 序列中隱式學習結構感知表示。在本文中,我們提出 \ours{} -- 一個基於 SMILES 的\underline{\em M}olecular\underline{\em L}anguage \underline{\em M}odel,它隨機遮蔽對應於特定分子\underline{\em F}unctional\underline{\em G}roups 的 SMILES 子序列,以在預訓練階段納入原子的結構資訊。此技術旨在強制模型更好地推斷分子結構和特性,從而增強其預測能力。在化學領域的 11 個基準分類和回歸任務中進行的廣泛實驗評估證明了 \ours{} 的穩健性和優越性。我們的研究結果顯示,\ours{} 在 11 個下游任務中的 9 個任務中優於現有的預訓練模型(基於 SMILES 或圖形),在剩下的任務中排名第二。

Narrative Analysis of True Crime Podcasts With Knowledge Graph-Augmented Large Language Models

2411.02435v1 by Xinyi Leng, Jason Liang, Jack Mauro, Xu Wang, Andrea L. Bertozzi, James Chapman, Junyuan Lin, Bohan Chen, Chenchen Ye, Temple Daniel, P. Jeffrey Brantingham

Narrative data spans all disciplines and provides a coherent model of the world to the reader or viewer. Recent advancement in machine learning and Large Language Models (LLMs) have enable great strides in analyzing natural language. However, Large language models (LLMs) still struggle with complex narrative arcs as well as narratives containing conflicting information. Recent work indicates LLMs augmented with external knowledge bases can improve the accuracy and interpretability of the resulting models. In this work, we analyze the effectiveness of applying knowledge graphs (KGs) in understanding true-crime podcast data from both classical Natural Language Processing (NLP) and LLM approaches. We directly compare KG-augmented LLMs (KGLLMs) with classical methods for KG construction, topic modeling, and sentiment analysis. Additionally, the KGLLM allows us to query the knowledge base in natural language and test its ability to factually answer questions. We examine the robustness of the model to adversarial prompting in order to test the model's ability to deal with conflicting information. Finally, we apply classical methods to understand more subtle aspects of the text such as the use of hearsay and sentiment in narrative construction and propose future directions. Our results indicate that KGLLMs outperform LLMs on a variety of metrics, are more robust to adversarial prompts, and are more capable of summarizing the text into topics.

摘要:敘事資料涵蓋所有學科,並為讀者或觀眾提供一個連貫的世界模型。機器學習和大型語言模型 (LLM) 的最新進展在分析自然語言方面取得了長足的進步。然而,大型語言模型 (LLM) 仍然難以應付複雜的敘事弧線以及包含相互矛盾資訊的敘事。最近的研究表明,使用外部知識庫增強的 LLM 可以提高所產生模型的準確性和可解釋性。在這項工作中,我們分析了在從傳統自然語言處理 (NLP) 和 LLM 方法中理解真實犯罪播客資料時,應用知識圖譜 (KG) 的有效性。我們直接比較了 KG 增強的 LLM (KGLLM) 與用於 KG 建構、主題建模和情緒分析的傳統方法。此外,KGLLM 允許我們以自然語言查詢知識庫,並測試其事實回答問題的能力。我們檢查了模型對對抗性提示的穩健性,以測試模型處理相互矛盾資訊的能力。最後,我們應用傳統方法來理解文本的更細微方面,例如在敘事建構中使用道聽途說和情緒,並提出未來的方向。我們的結果表明,KGLLM 在各種指標上優於 LLM,對對抗提示更穩健,並且更能夠將文本總結為主題。

WLPlan: Relational Features for Symbolic Planning

2411.00577v1 by Dillon Z. Chen

Scalable learning for planning research generally involves juggling between different programming languages for handling learning and planning modules effectively. Interpreted languages such as Python are commonly used for learning routines due to their ease of use and the abundance of highly maintained learning libraries they exhibit, while compiled languages such as C++ are used for planning routines due to their optimised resource usage. Motivated by the need for tools for developing scalable learning planners, we introduce WLPlan, a C++ package with Python bindings which implements recent promising work for automatically generating relational features of planning tasks. Such features can be used for any downstream routine, such as learning domain control knowledge or probing and understanding planning tasks. More specifically, WLPlan provides functionality for (1) transforming planning tasks into graphs, and (2) embedding planning graphs into feature vectors via graph kernels. The source code and instructions for the installation and usage of WLPlan are available at tinyurl.com/42kymswc

摘要:可擴充的學習規劃研究通常需要在不同的程式語言之間切換,才能有效地處理學習和規劃模組。例如 Python 等直譯語言通常用於學習常式,因為它們易於使用,且有許多維護完善的學習函式庫;而例如 C++ 等編譯語言則用於規劃常式,因為它們能最佳化資源使用。由於需要開發可擴充學習規劃器的工具,我們引進了 WLPlan,這是一個具有 Python 繫結的 C++ 套件,實作了近期有前途的自動產生規劃任務關係特徵的工作。此類特徵可用於任何下游常式,例如學習領域控制知識或探測和理解規劃任務。更具體地說,WLPlan 提供了以下功能:(1) 將規劃任務轉換為圖形,以及 (2) 透過圖形核將規劃圖形嵌入特徵向量。WLPlan 的原始碼和安裝及使用說明可在 tinyurl.com/42kymswc 取得

GRS-QA -- Graph Reasoning-Structured Question Answering Dataset

2411.00369v3 by Anish Pahilajani, Devasha Trivedi, Jincen Shuai, Khin S. Yone, Samyak Rajesh Jain, Namyong Park, Ryan A. Rossi, Nesreen K. Ahmed, Franck Dernoncourt, Yu Wang

Large Language Models (LLMs) have excelled in multi-hop question-answering (M-QA) due to their advanced reasoning abilities. However, the impact of the inherent reasoning structures on LLM M-QA performance remains unclear, largely due to the absence of QA datasets that provide fine-grained reasoning structures. To address this gap, we introduce the Graph Reasoning-Structured Question Answering Dataset (GRS-QA), which includes both semantic contexts and reasoning structures for QA pairs. Unlike existing M-QA datasets, where different reasoning structures are entangled together, GRS-QA explicitly captures intricate reasoning pathways by constructing reasoning graphs, where nodes represent textual contexts and edges denote logical flows. These reasoning graphs of different structures enable a fine-grained evaluation of LLM reasoning capabilities across various reasoning structures. Our empirical analysis reveals that LLMs perform differently when handling questions with varying reasoning structures. This finding facilitates the exploration of textual structures as compared with semantics.

摘要:大型語言模型 (LLM) 由於其先進的推理能力,在多跳問答 (M-QA) 中表現出色。然而,固有推理結構對 LLM M-QA 效能的影響仍不清楚,這主要是由於缺乏提供細粒度推理結構的 QA 資料集。為了解決這個差距,我們引入了圖形推理結構化問答資料集 (GRS-QA),其中包含語義脈絡和 QA 對應的推理結構。與現有的 M-QA 資料集不同,其中不同的推理結構糾纏在一起,GRS-QA 透過建構推理圖形明確捕捉複雜的推理路徑,其中節點表示文字脈絡,邊緣表示邏輯流程。這些不同結構的推理圖形能夠細緻地評估 LLM 在各種推理結構中的推理能力。我們的實證分析顯示,LLM 在處理具有不同推理結構的問題時表現不同。這個發現促進了對文字結構與語義的比較探索。

Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes

2411.02523v1 by Balu Bhasuran, Qiao Jin, Yuzhang Xie, Carl Yang, Karim Hanna, Jennifer Costa, Cindy Shavor, Zhiyong Lu, Zhe He

Differential diagnosis is crucial for medicine as it helps healthcare providers systematically distinguish between conditions that share similar symptoms. This study assesses the impact of lab test results on differential diagnoses (DDx) made by large language models (LLMs). Clinical vignettes from 50 case reports from PubMed Central were created incorporating patient demographics, symptoms, and lab results. Five LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data. A comprehensive evaluation involving GPT-4, a knowledge graph, and clinicians was conducted. GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%. Lab results significantly improved accuracy, with GPT-4 and Mixtral excelling, though exact match rates were low. Lab tests, including liver function, metabolic/toxicology panels, and serology/immune tests, were generally interpreted correctly by LLMs for differential diagnosis.

摘要:鑑別診斷對於醫學至關重要,因為它有助於醫療保健提供者系統區分具有相似症狀的疾病。這項研究評估了實驗室檢驗結果對大型語言模型 (LLM) 做出的鑑別診斷 (DDx) 的影響。從 PubMed Central 的 50 份病例報告中建立了臨床簡報,其中包含患者人口統計、症狀和實驗室結果。測試了五個 LLM GPT-4、GPT-3.5、Llama-2-70b、Claude-2 和 Mixtral-8x7B,以生成帶和不帶實驗室數據的前 10、前 5 和前 1 DDx。進行了一項涉及 GPT-4、知識圖譜和臨床醫生的綜合評估。GPT-4 表現最佳,在有實驗室數據的情況下,前 1 名診斷的準確率達到 55%,前 10 名的準確率達到 60%,寬鬆準確率高達 80%。實驗室結果顯著提高了準確率,GPT-4 和 Mixtral 表現出色,儘管完全匹配率較低。LLM 通常可以正確解釋包括肝功能、代謝/毒理學檢查和血清學/免疫測試在內的實驗室檢驗,以進行鑑別診斷。

Compositional Automata Embeddings for Goal-Conditioned Reinforcement Learning

2411.00205v1 by Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, Sanjit A. Seshia

Goal-conditioned reinforcement learning is a powerful way to control an AI agent's behavior at runtime. That said, popular goal representations, e.g., target states or natural language, are either limited to Markovian tasks or rely on ambiguous task semantics. We propose representing temporal goals using compositions of deterministic finite automata (cDFAs) and use cDFAs to guide RL agents. cDFAs balance the need for formal temporal semantics with ease of interpretation: if one can understand a flow chart, one can understand a cDFA. On the other hand, cDFAs form a countably infinite concept class with Boolean semantics, and subtle changes to the automaton can result in very different tasks, making them difficult to condition agent behavior on. To address this, we observe that all paths through a DFA correspond to a series of reach-avoid tasks and propose pre-training graph neural network embeddings on "reach-avoid derived" DFAs. Through empirical evaluation, we demonstrate that the proposed pre-training method enables zero-shot generalization to various cDFA task classes and accelerated policy specialization without the myopic suboptimality of hierarchical methods.

摘要:目標條件強化學習是一種在執行階段控制 AI 代理行為的強大方法。話雖如此,熱門的目標表示,例如目標狀態或自然語言,僅限於馬可夫任務或依賴於含糊不清的任務語義。我們建議使用確定性有限狀態自動機 (cDFA) 的組合來表示時間目標,並使用 cDFA 來指導 RL 代理。cDFA 平衡了對形式時間語義的需求與易於解釋之間的關係:如果一個人能理解流程圖,那麼他就能理解 cDFA。另一方面,cDFA 形成了一個具有布林語義的可數無限概念類,而對自動機的細微更改可能會導致非常不同的任務,這使得它們難以對代理行為進行條件化。為了解決這個問題,我們觀察到通過 DFA 的所有路徑都對應於一系列到達避免任務,並提出對「到達避免衍生」DFA 進行預訓練圖神經網路嵌入。通過經驗評估,我們證明了所提出的預訓練方法能夠對各種 cDFA 任務類別進行零次學習泛化,並加速策略專業化,而沒有分層方法的近視次優性。

Building Multi-Agent Copilot towards Autonomous Agricultural Data Management and Analysis

2411.00188v1 by Yu Pan, Jianxin Sun, Hongfeng Yu, Joe Luck, Geng Bai, Nipuna Chamara, Yufeng Ge, Tala Awada

Current agricultural data management and analysis paradigms are to large extent traditional, in which data collecting, curating, integration, loading, storing, sharing and analyzing still involve too much human effort and know-how. The experts, researchers and the farm operators need to understand the data and the whole process of data management pipeline to make fully use of the data. The essential problem of the traditional paradigm is the lack of a layer of orchestrational intelligence which can understand, organize and coordinate the data processing utilities to maximize data management and analysis outcome. The emerging reasoning and tool mastering abilities of large language models (LLM) make it a potentially good fit to this position, which helps a shift from the traditional user-driven paradigm to AI-driven paradigm. In this paper, we propose and explore the idea of a LLM based copilot for autonomous agricultural data management and analysis. Based on our previously developed platform of Agricultural Data Management and Analytics (ADMA), we build a proof-of-concept multi-agent system called ADMA Copilot, which can understand user's intent, makes plans for data processing pipeline and accomplishes tasks automatically, in which three agents: a LLM based controller, an input formatter and an output formatter collaborate together. Different from existing LLM based solutions, by defining a meta-program graph, our work decouples control flow and data flow to enhance the predictability of the behaviour of the agents. Experiments demonstrates the intelligence, autonomy, efficacy, efficiency, extensibility, flexibility and privacy of our system. Comparison is also made between ours and existing systems to show the superiority and potential of our system.

摘要:目前的農業資料管理與分析模式在很大程度上仍是傳統的,其中資料收集、整理、整合、載入、儲存、分享和分析仍然需要太多的人力與專業知識。專家、研究人員和農場經營者需要了解資料和整個資料管理流程,才能充分利用資料。傳統模式的基本問題是缺乏一層編排智能,無法理解、組織和協調資料處理工具,以最大化資料管理和分析成果。大型語言模型 (LLM) 新興的推理和工具掌握能力使其潛在適合這個職位,這有助於從傳統的使用者驅動模式轉變為 AI 驅動模式。在本文中,我們提出並探討了基於 LLM 的副駕駛的想法,用於自動化農業資料管理和分析。基於我們先前開發的農業資料管理和分析 (ADMA) 平台,我們建立了一個名為 ADMA Copilot 的概念驗證多代理系統,它可以理解使用者的意圖、規劃資料處理流程並自動完成任務,其中三個代理:基於 LLM 的控制器、輸入格式化程式和輸出格式化程式共同合作。與現有的基於 LLM 的解決方案不同,透過定義元程式圖,我們的研究將控制流程和資料流程解耦,以增強代理行為的可預測性。實驗證明了我們系統的智慧、自主性、效能、效率、可擴充性、靈活性與隱私性。我們也與現有系統進行比較,以顯示我們系統的優越性和潛力。

Exploring the Knowledge Mismatch Hypothesis: Hallucination Propensity in Small Models Fine-tuned on Data from Larger Models

2411.00878v1 by Phil Wee, Riyadh Baghdadi

Recently, there has been an explosion of large language models created through fine-tuning with data from larger models. These small models able to produce outputs that appear qualitatively similar to significantly larger models. However, one of the key limitations that have been observed with these models is their propensity to hallucinate significantly more often than larger models. In particular, they have been observed to generate coherent outputs that involve factually incorrect information and spread misinformation, toxicity, and stereotypes. There are many potential causes of hallucination, of which, one hypothesis is that fine-tuning a model on data produced by a larger model leads to a knowledge mismatch which contributes to hallucination. In particular, it is hypothesized that there is a mismatch between the knowledge that is fed to the model to fine-tune it and the knowledge that is already present in the graph. Fine-tuning the model on data that has such mismatch could contribute to an increased propensity to hallucinate. We show that on an unseen test set, a smaller model fine-tuned on data generated from a larger model produced more wrong answers when compared to models fine-tuned on data created by the small model, which confirms the hypothesis.

摘要:最近,通过使用更大模型的数据进行微调,创建了大量语言模型爆炸。这些小模型能够产生与明显更大的模型在质量上类似的输出。然而,在这些模型中观察到的一个关键限制是,它们比更大的模型更容易出现幻觉。特别是,已经观察到它们会生成涉及事实不正确的信息并传播错误信息、毒性和刻板印象的连贯输出。幻觉有很多潜在原因,其中一个假设是,在更大模型生成的数据上微调模型会导致知识不匹配,从而导致幻觉。特别是,假设模型微调所馈送的知识与图中已有的知识之间存在不匹配。在具有这种不匹配的数据上微调模型可能会导致幻觉倾向增加。我们表明,在一个看不见的测试集中,一个在从一个更大的模型生成的数据上微调的小模型,与在小模型创建的数据上微调的模型相比,产生了更多错误的答案,这证实了这一假设。

Failure Modes of LLMs for Causal Reasoning on Narratives

2410.23884v1 by Khurram Yamin, Shantanu Gupta, Gaurav R. Ghosal, Zachary C. Lipton, Bryan Wilder

In this work, we investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge. For example, LLMs tend to determine causal relationships based on the topological ordering of events (i.e., earlier events cause later ones), resulting in lower performance whenever events are not narrated in their exact causal order. Similarly, we demonstrate that LLMs struggle with long-term causal reasoning and often fail when the narratives are long and contain many events. Additionally, we show LLMs appear to rely heavily on their parametric knowledge at the expense of reasoning over the provided narrative. This degrades their abilities whenever the narrative opposes parametric knowledge. We extensively validate these failure modes through carefully controlled synthetic experiments, as well as evaluations on real-world narratives. Finally, we observe that explicitly generating a causal graph generally improves performance while naive chain-of-thought is ineffective. Collectively, our results distill precise failure modes of current state-of-the-art models and can pave the way for future techniques to enhance causal reasoning in LLMs.

摘要:在這項工作中,我們透過推論敘述中的因果關係這個代表性問題,來探討大型語言模型 (LLM) 的因果推理能力。我們發現,即使是最先進的語言模型,也會依賴於不可靠的捷徑,無論是在敘述呈現或其參數知識方面。例如,LLM 傾向於根據事件的拓撲順序(即,較早的事件導致較晚的事件)來確定因果關係,當事件未按其確切的因果順序敘述時,就會導致較低的效能。同樣地,我們證明 LLM 難以進行長期因果推理,並且當敘述很長且包含許多事件時,它們通常會失敗。此外,我們表明 LLM 似乎過度依賴其參數知識,而犧牲了對所提供敘述的推理。每當敘述與參數知識相衝突時,這就會降低它們的能力。我們透過仔細控制的合成實驗以及對真實世界敘述的評估,廣泛驗證了這些失敗模式。最後,我們觀察到,明確產生因果圖通常會改善效能,而天真的思考鏈則無效。總的來說,我們的結果精確地提煉了當前最先進模型的失敗模式,並可以為未來增強 LLM 中因果推理的技術鋪路。

Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs

2410.23875v1 by Liyi Chen, Panrong Tong, Zhongming Jin, Ying Sun, Jieping Ye, Hui Xiong

Large Language Models (LLMs) have shown remarkable reasoning capabilities on complex tasks, but they still suffer from out-of-date knowledge, hallucinations, and opaque decision-making. In contrast, Knowledge Graphs (KGs) can provide explicit and editable knowledge for LLMs to alleviate these issues. Existing paradigm of KG-augmented LLM manually predefines the breadth of exploration space and requires flawless navigation in KGs. However, this paradigm cannot adaptively explore reasoning paths in KGs based on the question semantics and self-correct erroneous reasoning paths, resulting in a bottleneck in efficiency and effect. To address these limitations, we propose a novel self-correcting adaptive planning paradigm for KG-augmented LLM named Plan-on-Graph (PoG), which first decomposes the question into several sub-objectives and then repeats the process of adaptively exploring reasoning paths, updating memory, and reflecting on the need to self-correct erroneous reasoning paths until arriving at the answer. Specifically, three important mechanisms of Guidance, Memory, and Reflection are designed to work together, to guarantee the adaptive breadth of self-correcting planning for graph reasoning. Finally, extensive experiments on three real-world datasets demonstrate the effectiveness and efficiency of PoG.

摘要:大型語言模型 (LLM) 在複雜任務中展現出非凡的推理能力,但仍存在知識過時、幻覺和決策不透明的問題。相反地,知識圖譜 (KG) 可以提供明確且可編輯的知識,供 LLM 緩解這些問題。現有的 KG 增強 LLM 典範手動預先定義探索空間的廣度,並需要在 KG 中完美導航。然而,此典範無法根據問題語意自適應地探索 KG 中的推理路徑,並自行糾正錯誤的推理路徑,導致效率和效果的瓶頸。為了解決這些限制,我們提出了一個名為圖形計畫 (PoG) 的 KG 增強 LLM 的新穎自修正自適應規劃典範,它首先將問題分解成幾個子目標,然後重複自適應探索推理路徑、更新記憶體和反思需要自行糾正錯誤推理路徑的過程,直到得出答案。具體來說,指導、記憶和反思這三個重要機制被設計為協同運作,以保證自修正規劃在圖形推理中的自適應廣度。最後,在三個真實世界資料集上的廣泛實驗證明了 PoG 的有效性和效率。

LLaMo: Large Language Model-based Molecular Graph Assistant

2411.00871v1 by Jinyoung Park, Minseong Bae, Dohwan Ko, Hyunwoo J. Kim

Large Language Models (LLMs) have demonstrated remarkable generalization and instruction-following capabilities with instruction tuning. The advancements in LLMs and instruction tuning have led to the development of Large Vision-Language Models (LVLMs). However, the competency of the LLMs and instruction tuning have been less explored in the molecular domain. Thus, we propose LLaMo: Large Language Model-based Molecular graph assistant, which is an end-to-end trained large molecular graph-language model. To bridge the discrepancy between the language and graph modalities, we present the multi-level graph projector that transforms graph representations into graph tokens by abstracting the output representations of each GNN layer and motif representations with the cross-attention mechanism. We also introduce machine-generated molecular graph instruction data to instruction-tune the large molecular graph-language model for general-purpose molecule and language understanding. Our extensive experiments demonstrate that LLaMo shows the best performance on diverse tasks, such as molecular description generation, property prediction, and IUPAC name prediction. The code of LLaMo is available at https://github.com/mlvlab/LLaMo.

摘要:大型语言模型 (LLM) 已展示出卓越的概括和指令遵循能力,并进行指令调整。LLM 和指令调整的进步导致了大型视觉语言模型 (LVLMs) 的发展。然而,LLM 和指令调整的能力在分子领域的研究较少。因此,我们提出了 LLaMo:基于大语言模型的分子图助手,这是一个端到端训练的大分子图语言模型。为了弥合语言和图模式之间的差异,我们提出了多级图投影仪,它通过抽象每个 GNN 层的输出表示和基序表示(使用交叉注意力机制)将图表示转换为图标记。我们还引入了机器生成的分子图指令数据,以对大型分子图语言模型进行指令调整,以用于通用分子和语言理解。我们广泛的实验表明,LLaMo 在分子描述生成、属性预测和 IUPAC 名称预测等不同任务上表现出最佳性能。LLaMo 的代码可在 https://github.com/mlvlab/LLaMo 获得。

End-to-End Ontology Learning with Large Language Models

2410.23584v1 by Andy Lo, Albert Q. Jiang, Wenda Li, Mateja Jamnik

Ontologies are useful for automatic machine processing of domain knowledge as they represent it in a structured format. Yet, constructing ontologies requires substantial manual effort. To automate part of this process, large language models (LLMs) have been applied to solve various subtasks of ontology learning. However, this partial ontology learning does not capture the interactions between subtasks. We address this gap by introducing OLLM, a general and scalable method for building the taxonomic backbone of an ontology from scratch. Rather than focusing on subtasks, like individual relations between entities, we model entire subcomponents of the target ontology by finetuning an LLM with a custom regulariser that reduces overfitting on high-frequency concepts. We introduce a novel suite of metrics for evaluating the quality of the generated ontology by measuring its semantic and structural similarity to the ground truth. In contrast to standard metrics, our metrics use deep learning techniques to define more robust distance measures between graphs. Both our quantitative and qualitative results on Wikipedia show that OLLM outperforms subtask composition methods, producing more semantically accurate ontologies while maintaining structural integrity. We further demonstrate that our model can be effectively adapted to new domains, like arXiv, needing only a small number of training examples. Our source code and datasets are available at https://github.com/andylolu2/ollm.

摘要:本体对于领域知识的自动机器处理很有用,因为它们以结构化格式表示知识。然而,构建本体需要大量的手动工作。为了自动化这个过程的一部分,大型语言模型(LLM)已被应用于解决本体学习的各种子任务。然而,这种部分本体学习并没有捕捉到子任务之间的交互。我们通过引入 OLLM 来解决这一差距,这是一种从头开始构建本体分类骨架的通用且可扩展的方法。我们没有专注于子任务,例如实体之间的个别关系,而是通过使用自定义正则化器微调 LLM 来对目标本体的整个子组件进行建模,该正则化器减少了对高频概念的过度拟合。我们引入了一套新的指标来评估生成本体的质量,方法是测量它与地面真实值的语义和结构相似性。与标准指标相反,我们的指标使用深度学习技术来定义图之间的更稳健的距离度量。我们在维基百科上的定量和定性结果表明,OLLM 优于子任务组合方法,在保持结构完整性的同时生成语义上更准确的本体。我们进一步证明,我们的模型可以有效地适应新的领域,如 arXiv,只需要少量的训练样本。我们的源代码和数据集可在 https://github.com/andylolu2/ollm 获得。

Graph-Augmented Relation Extraction Model with LLMs-Generated Support Document

2410.23452v1 by Vicky Dong, Hao Yu, Yao Chen

This study introduces a novel approach to sentence-level relation extraction (RE) that integrates Graph Neural Networks (GNNs) with Large Language Models (LLMs) to generate contextually enriched support documents. By harnessing the power of LLMs to generate auxiliary information, our approach crafts an intricate graph representation of textual data. This graph is subsequently processed through a Graph Neural Network (GNN) to refine and enrich the embeddings associated with each entity ensuring a more nuanced and interconnected understanding of the data. This methodology addresses the limitations of traditional sentence-level RE models by incorporating broader contexts and leveraging inter-entity interactions, thereby improving the model's ability to capture complex relationships across sentences. Our experiments, conducted on the CrossRE dataset, demonstrate the effectiveness of our approach, with notable improvements in performance across various domains. The results underscore the potential of combining GNNs with LLM-generated context to advance the field of relation extraction.

摘要:本研究提出了一個句子層級關係萃取 (RE) 的新方法,該方法整合了圖形神經網路 (GNN) 和大型語言模型 (LLM),以產生脈絡豐富的支援文件。透過利用 LLM 的功能來產生輔助資訊,我們的做法建立了一個文本資料的複雜圖形表示。此圖形隨後透過圖形神經網路 (GNN) 進行處理,以改善和豐富與每個實體相關的嵌入,確保對資料有更細緻且相互連結的理解。此方法透過納入更廣泛的脈絡並利用實體間互動,來解決傳統句子層級 RE 模型的限制,進而提升模型捕捉跨句子的複雜關係的能力。我們在 CrossRE 資料集上執行的實驗證明了我們方法的有效性,在各種領域的效能都有顯著的提升。這些結果強調了將 GNN 與 LLM 產生的脈絡相結合,以推進關係萃取領域的潛力。

FlowLLM: Flow Matching for Material Generation with Large Language Models as Base Distributions

2410.23405v1 by Anuroop Sriram, Benjamin Kurt Miller, Ricky T. Q. Chen, Brandon M. Wood

Material discovery is a critical area of research with the potential to revolutionize various fields, including carbon capture, renewable energy, and electronics. However, the immense scale of the chemical space makes it challenging to explore all possible materials experimentally. In this paper, we introduce FlowLLM, a novel generative model that combines large language models (LLMs) and Riemannian flow matching (RFM) to design novel crystalline materials. FlowLLM first fine-tunes an LLM to learn an effective base distribution of meta-stable crystals in a text representation. After converting to a graph representation, the RFM model takes samples from the LLM and iteratively refines the coordinates and lattice parameters. Our approach significantly outperforms state-of-the-art methods, increasing the generation rate of stable materials by over three times and increasing the rate for stable, unique, and novel crystals by $\sim50\%$ - a huge improvement on a difficult problem. Additionally, the crystals generated by FlowLLM are much closer to their relaxed state when compared with another leading model, significantly reducing post-hoc computational cost.

摘要:材料發現是一個重要的研究領域,具有革新各種領域的潛力,包括碳捕集、可再生能源和電子產品。然而,化學空間的巨大規模使得實驗探索所有可能的材料具有挑戰性。在本文中,我們介紹了 FlowLLM,這是一種新穎的生成模型,結合了大型語言模型 (LLM) 和黎曼流匹配 (RFM) 來設計新型晶體材料。FlowLLM 首先微調 LLM,以學習文本表示中亞穩態晶體的有效基礎分佈。在轉換為圖形表示後,RFM 模型從 LLM 中獲取樣本,並反覆精煉坐標和晶格參數。我們的做法顯著優於最先進的方法,將穩定材料的生成率提高了三倍以上,並將穩定、獨特和新穎晶體的生成率提高了約 50%——這在一個困難的問題上是一個巨大的改進。此外,與另一種領先模型相比,FlowLLM 生成的晶體更接近其鬆弛狀態,顯著降低了事後計算成本。

EMMA: End-to-End Multimodal Model for Autonomous Driving

2410.23262v2 by Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, Yin Zhou, James Guo, Dragomir Anguelov, Mingxing Tan

We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures.

摘要:我們介紹 EMMA,一種用於自動駕駛的端到端多模態模型。 建立在多模態大型語言模型基礎上,EMMA 直接將原始 相機感測器資料對應到各種特定於駕駛的輸出,包括規劃器 軌跡、感知物件和道路圖形元素。EMMA 最大化利用預訓練大型語言模型中的世界知識,方法是 將所有非感測器輸入(例如導航指示和自我 車輛狀態)和輸出(例如軌跡和 3D 位置)表示為自然 語言文字。這種方法允許 EMMA 在統一的語言空間中共同處理各種駕駛 任務,並使用特定於任務的提示為每個任務產生輸出。 根據經驗,我們證明了 EMMA 的有效性,在 nuScenes 上的運動規劃中達到了最先進的性能,以及 在 Waymo 開放運動資料集 (WOMD) 上取得了有競爭力的結果。EMMA 也 在 Waymo 開放資料集 (WOD) 上對相機優先的 3D 物件偵測產生了有競爭力的結果。我們展示了使用規劃器軌跡、 物件偵測和道路圖形任務共同訓練 EMMA 會在所有三個 領域產生改進,突顯了 EMMA 作為自動駕駛應用程式通用模型的潛力。然而,EMMA 也表現出某些限制:它只能 處理少量的影像幀,不包含像 LiDAR 或雷達等準確的 3D 感測模式,並且計算成本昂貴。我們 希望我們的結果能激勵進一步的研究,以減輕這些問題並進一步發展自動駕駛模型 架構的最新技術。

ProTransformer: Robustify Transformers via Plug-and-Play Paradigm

2410.23182v1 by Zhichao Hou, Weizhi Gao, Yuchen Shen, Feiyi Wang, Xiaorui Liu

Transformer-based architectures have dominated various areas of machine learning in recent years. In this paper, we introduce a novel robust attention mechanism designed to enhance the resilience of transformer-based architectures. Crucially, this technique can be integrated into existing transformers as a plug-and-play layer, improving their robustness without the need for additional training or fine-tuning. Through comprehensive experiments and ablation studies, we demonstrate that our ProTransformer significantly enhances the robustness of transformer models across a variety of prediction tasks, attack mechanisms, backbone architectures, and data domains. Notably, without further fine-tuning, the ProTransformer consistently improves the performance of vanilla transformers by 19.5%, 28.3%, 16.1%, and 11.4% for BERT, ALBERT, DistilBERT, and RoBERTa, respectively, under the classical TextFooler attack. Furthermore, ProTransformer shows promising resilience in large language models (LLMs) against prompting-based attacks, improving the performance of T5 and LLaMA by 24.8% and 17.8%, respectively, and enhancing Vicuna by an average of 10.4% against the Jailbreaking attack. Beyond the language domain, ProTransformer also demonstrates outstanding robustness in both vision and graph domains.

摘要:近年來,基於 Transformer 的架構主導了機器學習的各個領域。在本文中,我們介紹了一種新穎且強大的注意力機制,旨在增強基於 Transformer 的架構的韌性。至關重要的是,此技術可以作為即插即用的層整合到現有的 Transformer 中,在無需額外訓練或微調的情況下提高其穩健性。通過全面的實驗和消融研究,我們證明了我們的 ProTransformer 在各種預測任務、攻擊機制、主幹架構和數據領域中顯著增強了 Transformer 模型的穩健性。值得注意的是,在不進一步微調的情況下,ProTransformer 在經典的 TextFooler 攻擊下,分別為 BERT、ALBERT、DistilBERT 和 RoBERTa 提升了 19.5%、28.3%、16.1% 和 11.4% 的性能。此外,ProTransformer 在基於提示的攻擊中對大型語言模型 (LLM) 顯示出有希望的韌性,分別將 T5 和 LLaMA 的性能提升了 24.8% 和 17.8%,並在越獄攻擊中將 Vicuna 的性能平均提升了 10.4%。除了語言領域之外,ProTransformer 在視覺和圖形領域也表現出出色的穩健性。

Semantic Enrichment of the Quantum Cascade Laser Properties in Text- A Knowledge Graph Generation Approach

2410.22996v1 by Deperias Kerre, Anne Laurent, Kenneth Maussang, Dickson Owuor

A well structured collection of the various Quantum Cascade Laser (QCL) design and working properties data provides a platform to analyze and understand the relationships between these properties. By analyzing these relationships, we can gain insights into how different design features impact laser performance properties such as the working temperature. Most of these QCL properties are captured in scientific text. There is therefore need for efficient methodologies that can be utilized to extract QCL properties from text and generate a semantically enriched and interlinked platform where the properties can be analyzed to uncover hidden relations. There is also the need to maintain provenance and reference information on which these properties are based. Semantic Web technologies such as Ontologies and Knowledge Graphs have proven capability in providing interlinked data platforms for knowledge representation in various domains. In this paper, we propose an approach for generating a QCL properties Knowledge Graph (KG) from text for semantic enrichment of the properties. The approach is based on the QCL ontology and a Retrieval Augmented Generation (RAG) enabled information extraction pipeline based on GPT 4-Turbo language model. The properties of interest include: working temperature, laser design type, lasing frequency, laser optical power and the heterostructure. The experimental results demonstrate the feasibility and effectiveness of this approach for efficiently extracting QCL properties from unstructured text and generating a QCL properties Knowledge Graph, which has potential applications in semantic enrichment and analysis of QCL data.

摘要:一個結構良好的各種量子層疊雷射 (QCL) 設計和工作特性數據集合,提供了一個平台來分析和理解這些特性之間的關係。透過分析這些關係,我們可以深入了解不同的設計特徵如何影響雷射效能特性,例如工作溫度。這些 QCL 特性大多數都捕捉在科學文字中。因此,需要有效的方法,可以用於從文字中萃取 QCL 特性,並產生一個語義豐富且相互連結的平台,可以在其中分析這些特性以發現隱藏的關係。還需要維護這些特性所依據的來源和參考資訊。語義網路技術,例如本体和知識圖譜,已證明它們在提供各種領域中知識表徵的相互連結資料平台方面具有能力。在本文中,我們提出一個從文字中產生 QCL 特性知識圖譜 (KG) 的方法,以進行特性的語義豐富化。此方法基於 QCL 本体和基於 GPT 4-Turbo 語言模型的檢索擴增生成 (RAG) 啟用資訊萃取管線。感興趣的特性包括:工作溫度、雷射設計類型、雷射頻率、雷射光功率和異質結構。實驗結果證明了此方法對於從非結構化文字中有效萃取 QCL 特性和產生 QCL 特性知識圖譜的可行性和有效性,這在 QCL 數據的語義豐富化和分析中具有潛在應用。

How Well Do Large Language Models Disambiguate Swedish Words?

2410.22827v1 by Richard Johansson

We evaluate a battery of recent large language models on two benchmarks for word sense disambiguation in Swedish. At present, all current models are less accurate than the best supervised disambiguators in cases where a training set is available, but most models outperform graph-based unsupervised systems. Different prompting approaches are compared, with a focus on how to express the set of possible senses in a given context. The best accuracies are achieved when human-written definitions of the senses are included in the prompts.

摘要:我們針對兩個瑞典語詞彙意義消歧基準,評估一系列近期的大型語言模型。目前,在有訓練集可用的情況下,所有現有模型的準確度都低於最佳監督式消歧器,但大多數模型的表現都優於基於圖形的非監督式系統。比較了不同的提示方法,重點在於如何在特定脈絡中表達可能的意義集合。當提示中包含人類撰寫的意義定義時,可達到最佳準確度。

Beyond Ontology in Dialogue State Tracking for Goal-Oriented Chatbot

2410.22767v1 by Sejin Lee, Dongha Kim, Min Song

Goal-oriented chatbots are essential for automating user tasks, such as booking flights or making restaurant reservations. A key component of these systems is Dialogue State Tracking (DST), which interprets user intent and maintains the dialogue state. However, existing DST methods often rely on fixed ontologies and manually compiled slot values, limiting their adaptability to open-domain dialogues. We propose a novel approach that leverages instruction tuning and advanced prompt strategies to enhance DST performance, without relying on any predefined ontologies. Our method enables Large Language Model (LLM) to infer dialogue states through carefully designed prompts and includes an anti-hallucination mechanism to ensure accurate tracking in diverse conversation contexts. Additionally, we employ a Variational Graph Auto-Encoder (VGAE) to model and predict subsequent user intent. Our approach achieved state-of-the-art with a JGA of 42.57% outperforming existing ontology-less DST models, and performed well in open-domain real-world conversations. This work presents a significant advancement in creating more adaptive and accurate goal-oriented chatbots.

摘要:以目標為導向的聊天機器人在自動化使用者任務中至關重要,例如預訂航班或進行餐廳訂位。這些系統的一個關鍵組成部分是對話狀態追蹤 (DST),它會解譯使用者的意圖並維護對話狀態。然而,現有的 DST 方法通常依賴於固定的本体和手動編譯的槽位值,這限制了它們對開放領域對話的適應性。我們提出了一種新穎的方法,它利用指令調整和先進的提示策略來增強 DST 效能,而無需依賴任何預定義的本体。我們的方法使大型語言模型 (LLM) 能夠透過精心設計的提示來推論對話狀態,並包含一個反幻覺機制,以確保在不同的對話情境中準確追蹤。此外,我們採用變分圖自編碼器 (VGAE) 來建模和預測後續使用者的意圖。我們的做法以 42.57% 的 JGA 達到了現有技術的頂峰,優於現有的無本体 DST 模型,並在開放領域的真實對話中表現良好。這項工作在建立更具適應性和準確性的以目標為導向的聊天機器人方面取得了重大進展。

The Graph's Apprentice: Teaching an LLM Low Level Knowledge for Circuit Quality Estimation

2411.00843v1 by Reza Moravej, Saurabh Bodhe, Zhanguang Zhang, Didier Chetelat, Dimitrios Tsaras, Yingxue Zhang, Hui-Ling Zhen, Jianye Hao, Mingxuan Yuan

Logic synthesis is a crucial phase in the circuit design process, responsible for transforming hardware description language (HDL) designs into optimized netlists. However, traditional logic synthesis methods are computationally intensive, restricting their iterative use in refining chip designs. Recent advancements in large language models (LLMs), particularly those fine-tuned on programming languages, present a promising alternative. In this paper, we introduce VeriDistill, the first end-to-end machine learning model that directly processes raw Verilog code to predict circuit quality-of-result metrics. Our model employs a novel knowledge distillation method, transferring low-level circuit insights via graphs into the predictor based on LLM. Experiments show VeriDistill outperforms state-of-the-art baselines on large-scale Verilog datasets and demonstrates robust performance when evaluated on out-of-distribution datasets.

摘要:邏輯合成是電路設計過程中至關重要的一個階段,負責將硬體描述語言 (HDL) 設計轉換為最佳化的網路表。然而,傳統的邏輯合成方法在運算上很密集,限制了它們在精煉晶片設計中的反覆使用。最近大型語言模型 (LLM) 的進展,特別是那些經過程式語言微調的,提供了一個有希望的替代方案。在本文中,我們介紹了 VeriDistill,第一個端到端的機器學習模型,它直接處理原始 Verilog 程式碼以預測電路品質結果指標。我們的模型採用了一種新穎的知識提煉方法,通過圖表將低階電路見解傳輸到基於 LLM 的預測器中。實驗表明,VeriDistill 在大規模 Verilog 資料集上優於最先進的基準,並且在在分佈外資料集上進行評估時表現出穩健的效能。

Are Large-Language Models Graph Algorithmic Reasoners?

2410.22597v1 by Alexander K Taylor, Anthony Cuturrufo, Vishal Yathish, Mingyu Derek Ma, Wei Wang

We seek to address a core challenge facing current Large Language Models (LLMs). LLMs have demonstrated superior performance in many tasks, yet continue to struggle with reasoning problems on explicit graphs that require multiple steps. To address this gap, we introduce a novel benchmark designed to evaluate LLM performance on classical algorithmic reasoning tasks on explicit graphs. Our benchmark encompasses five fundamental algorithms: Breadth-First Search (BFS) and Depth-First Search (DFS) for connectivity, Dijkstra's algorithm and Floyd-Warshall algorithm for all nodes shortest path, and Prim's Minimum Spanning Tree (MST-Prim's) algorithm. Through extensive experimentation, we assess the capabilities of state-of-the-art LLMs in executing these algorithms step-by-step and systematically evaluate their performance at each stage. Our findings highlight the persistent challenges LLMs face in this domain and underscore the necessity for advanced prompting techniques and algorithmic instruction to enhance their graph reasoning abilities. This work presents MAGMA, the first comprehensive benchmark focused on LLMs completing classical graph algorithms, and provides a critical step toward understanding and improving their structured problem-solving skills.

摘要:我們試圖解決當前大型語言模型 (LLM) 面臨的核心挑戰。LLM 在許多任務中表現出優異的性能,但仍難以應對需要多個步驟的明確圖表中的推理問題。為了解決這個差距,我們引入了一個新的基準,用於評估 LLM 在明確圖表上的經典演算法推理任務上的性能。我們的基準包含五個基本演算法:廣度優先搜尋 (BFS) 和深度優先搜尋 (DFS) 以進行連通性、Dijkstra 演算法和 Floyd-Warshall 演算法以找出所有節點的最短路徑,以及 Prim 最小生成樹 (MST-Prim) 演算法。透過廣泛的實驗,我們評估了最先進的 LLM 在逐步執行這些演算法的能力,並系統性地評估它們在每個階段的性能。我們的研究結果突出了 LLM 在這個領域面臨的持續挑戰,並強調了使用進階提示技術和演算法指令來增強其圖形推理能力的必要性。這項工作提出了 MAGMA,這是第一個專注於 LLM 完成經典圖形演算法的綜合基準,並為了解和改進其結構化問題解決技能提供了關鍵的一步。

Advancing Agentic Systems: Dynamic Task Decomposition, Tool Integration and Evaluation using Novel Metrics and Dataset

2410.22457v1 by Adrian Garret Gabriel, Alaa Alameer Ahmad, Shankar Kumar Jeyakumar

Advancements in Large Language Models (LLMs) are revolutionizing the development of autonomous agentic systems by enabling dynamic, context-aware task decomposition and automated tool selection. These sophisticated systems possess significant automation potential across various industries, managing complex tasks, interacting with external systems to enhance knowledge, and executing actions independently. This paper presents three primary contributions to advance this field: - Advanced Agentic Framework: A system that handles multi-hop queries, generates and executes task graphs, selects appropriate tools, and adapts to real-time changes. - Novel Evaluation Metrics: Introduction of Node F1 Score, Structural Similarity Index (SSI), and Tool F1 Score to comprehensively assess agentic systems. - Specialized Dataset: Development of an AsyncHow-based dataset for analyzing agent behavior across different task complexities. Our findings reveal that asynchronous and dynamic task graph decomposition significantly enhances system responsiveness and scalability, particularly for complex, multi-step tasks. Detailed analysis shows that structural and node-level metrics are crucial for sequential tasks, while tool-related metrics are more important for parallel tasks. Specifically, the Structural Similarity Index (SSI) is the most significant predictor of performance in sequential tasks, and the Tool F1 Score is essential for parallel tasks. These insights highlight the need for balanced evaluation methods that capture both structural and operational dimensions of agentic systems. Additionally, our evaluation framework, validated through empirical analysis and statistical testing, provides valuable insights for improving the adaptability and reliability of agentic systems in dynamic environments.

摘要:大型語言模型 (LLM) 的進展正透過啟用動態、具情境感知能力的任務分解和自動化工具選擇,革新自主代理系統的開發。這些精密的系統在各產業中擁有顯著的自動化潛力,管理複雜的任務、與外部系統互動以增強知識,並獨立執行動作。本文提出了三個主要貢獻以推動這個領域的進展: - 進階代理架構:一種處理多重跳躍查詢、產生並執行任務圖表、選擇適當的工具,並適應即時變化的系統。 - 新穎的評估指標:導入節點 F1 分數、結構相似性指標 (SSI) 和工具 F1 分數,以全面評估代理系統。 - 專業資料集:開發一個基於 AsyncHow 的資料集,用於分析代理行為在不同任務複雜度之間的差異。 我們的研究結果顯示,非同步和動態任務圖表分解能顯著增強系統的回應能力和可擴充性,特別是對於複雜的多步驟任務。詳細的分析顯示,結構和節點層級的指標對於順序任務至關重要,而與工具相關的指標對於並行任務更為重要。具體來說,結構相似性指標 (SSI) 是順序任務中效能最顯著的預測指標,而工具 F1 分數對於並行任務至關重要。這些見解突顯了平衡評估方法的需求,該方法能捕捉代理系統的結構和操作面向。此外,我們的評估架構透過實證分析和統計檢定驗證,為改善代理系統在動態環境中的適應性和可靠性提供了有價值的見解。

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

2411.00836v1 by Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, Huan Zhang

The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that SOTA VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the mathematical reasoning robustness in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of VLMs. DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of concrete questions, including many different types of visual and textual variations. DynaMath allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 SOTA VLMs with 5,010 generated concrete questions. Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and DynaMath provides valuable insights to guide the development of more reliable models for mathematical reasoning.

摘要:視覺語言模型 (VLM) 的快速進步在解決涉及視覺背景的數學推理任務方面展現了巨大的潛力。與人類可以將解決步驟可靠地應用於類似問題(並進行微小的修改)不同,我們發現像 GPT-4o 等 SOTA VLM 在這些場景中可能會持續失敗,揭露了其數學推理能力的限制。在本文中,我們研究了 VLM 中的數學推理穩健性,並評估了這些模型在同一問題的不同變體(例如視覺數值或函數圖形的變化)下的表現。雖然已經開發了多個基於視覺的數學基準來評估 VLM 的問題解決能力,但這些基準只包含靜態問題集,無法輕鬆評估數學推理穩健性。為了填補這一空白,我們引入了 DynaMath,這是一個動態視覺數學基準,專門用於深入評估 VLM。DynaMath 包含 501 個高品質、多主題種子問題,每個問題都表示為一個 Python 程式。這些程式經過仔細設計和註解,以便自動產生一組更大的具體問題,包括許多不同類型的視覺和文字變體。DynaMath 允許我們評估 VLM 的泛化能力,方法是在種子問題的不同輸入條件下評估其表現。我們使用 5,010 個生成的具體問題評估了 14 個 SOTA VLM。我們的結果顯示,最差情況的模型準確度(定義為在所有 10 個變體中正確回答的種子問題的百分比)顯著低於平均情況準確度。我們的分析強調了研究 VLM 推理能力穩健性的必要性,而 DynaMath 提供了有價值的見解,以指導開發更可靠的數學推理模型。

ADAM: An Embodied Causal Agent in Open-World Environments

2410.22194v1 by Shu Yu, Chaochao Lu

In open-world environments like Minecraft, existing agents face challenges in continuously learning structured knowledge, particularly causality. These challenges stem from the opacity inherent in black-box models and an excessive reliance on prior knowledge during training, which impair their interpretability and generalization capability. To this end, we introduce ADAM, An emboDied causal Agent in Minecraft, that can autonomously navigate the open world, perceive multimodal contexts, learn causal world knowledge, and tackle complex tasks through lifelong learning. ADAM is empowered by four key components: 1) an interaction module, enabling the agent to execute actions while documenting the interaction processes; 2) a causal model module, tasked with constructing an ever-growing causal graph from scratch, which enhances interpretability and diminishes reliance on prior knowledge; 3) a controller module, comprising a planner, an actor, and a memory pool, which uses the learned causal graph to accomplish tasks; 4) a perception module, powered by multimodal large language models, which enables ADAM to perceive like a human player. Extensive experiments show that ADAM constructs an almost perfect causal graph from scratch, enabling efficient task decomposition and execution with strong interpretability. Notably, in our modified Minecraft games where no prior knowledge is available, ADAM maintains its performance and shows remarkable robustness and generalization capability. ADAM pioneers a novel paradigm that integrates causal methods and embodied agents in a synergistic manner. Our project page is at https://opencausalab.github.io/ADAM.

摘要:在像 Minecraft 這樣的開放世界環境中,現有的代理人面臨持續學習結構化知識的挑戰,尤其是因果關係。這些挑戰源於黑盒模型固有的不透明性,以及在訓練期間過度依賴先驗知識,這會損害它們的可解釋性和泛化能力。為此,我們引入了 ADAM,Minecraft 中的一個具身因果代理,它可以自主導航開放世界,感知多模式上下文,學習因果世界知識,並通過終身學習來應對複雜任務。ADAM 由四個關鍵組成部分賦能:1) 一個交互模組,使代理能夠執行動作,同時記錄交互過程;2) 一個因果模型模組,負責從頭開始構建一個不斷增長的因果圖,這增強了可解釋性並減少了對先驗知識的依賴;3) 一個控制器模組,包括一個規劃器、一個執行器和一個記憶池,它使用學習到的因果圖來完成任務;4) 一個感知模組,由多模式大型語言模型提供支援,使 ADAM 能夠像人類玩家一樣感知。大量的實驗表明,ADAM 從頭開始構建了一個幾乎完美的因果圖,實現了高效的任務分解和執行,並具有很強的可解釋性。值得注意的是,在我們修改過的 Minecraft 遊戲中,沒有可用的先驗知識,ADAM 保持了其性能,並表現出顯著的魯棒性和泛化能力。ADAM 開創了一種新穎的範例,以協同方式整合因果方法和具身代理。我們的專案頁面位於 https://opencausalab.github.io/ADAM。

Synergizing LLM Agents and Knowledge Graph for Socioeconomic Prediction in LBSN

2411.00028v1 by Zhilun Zhou, Jingyang Fan, Yu Liu, Fengli Xu, Depeng Jin, Yong Li

The fast development of location-based social networks (LBSNs) has led to significant changes in society, resulting in popular studies of using LBSN data for socioeconomic prediction, e.g., regional population and commercial activity estimation. Existing studies design various graphs to model heterogeneous LBSN data, and further apply graph representation learning methods for socioeconomic prediction. However, these approaches heavily rely on heuristic ideas and expertise to extract task-relevant knowledge from diverse data, which may not be optimal for specific tasks. Additionally, they tend to overlook the inherent relationships between different indicators, limiting the prediction accuracy. Motivated by the remarkable abilities of large language models (LLMs) in commonsense reasoning, embedding, and multi-agent collaboration, in this work, we synergize LLM agents and knowledge graph for socioeconomic prediction. We first construct a location-based knowledge graph (LBKG) to integrate multi-sourced LBSN data. Then we leverage the reasoning power of LLM agent to identify relevant meta-paths in the LBKG for each type of socioeconomic prediction task, and design a semantic-guided attention module for knowledge fusion with meta-paths. Moreover, we introduce a cross-task communication mechanism to further enhance performance by enabling knowledge sharing across tasks at both LLM agent and KG levels. On the one hand, the LLM agents for different tasks collaborate to generate more diverse and comprehensive meta-paths. On the other hand, the embeddings from different tasks are adaptively merged for better socioeconomic prediction. Experiments on two datasets demonstrate the effectiveness of the synergistic design between LLM and KG, providing insights for information sharing across socioeconomic prediction tasks.

摘要:基於位置的社交網路 (LBSN) 的快速發展已導致社會發生重大變革,進而促成使用 LBSN 資料進行社會經濟預測的熱門研究,例如區域人口和商業活動估計。現有研究設計各種圖形來建模異質的 LBSN 資料,並進一步應用圖形表示學習方法進行社會經濟預測。然而,這些方法極度依賴啟發式想法和專業知識從不同的資料中萃取與任務相關的知識,這對於特定任務而言可能不是最佳的。此外,它們傾向於忽略不同指標之間的固有關係,進而限制預測準確度。受惠於大型語言模型 (LLM) 在常識推理、嵌入和多重代理協作方面的卓越能力,在這項工作中,我們將 LLM 代理和知識圖形結合起來進行社會經濟預測。我們首先建構一個基於位置的知識圖形 (LBKG) 來整合多來源的 LBSN 資料。然後,我們利用 LLM 代理的推理能力,針對每種類型的社會經濟預測任務識別 LBKG 中相關的 meta 路徑,並設計一個語義導向的注意力模組,用於與 meta 路徑的知識融合。此外,我們引入一個跨任務溝通機制,以透過在 LLM 代理和 KG 層級上跨任務啟用知識共享進一步提升效能。一方面,不同任務的 LLM 代理協作產生更多樣化且全面的 meta 路徑。另一方面,來自不同任務的嵌入會自適應地合併,以進行更好的社會經濟預測。在兩個資料集上的實驗證明了 LLM 和 KG 之間協同設計的有效性,並提供跨社會經濟預測任務進行資訊共享的見解。

A Hierarchical Language Model For Interpretable Graph Reasoning

2410.22372v1 by Sambhav Khurana, Xiner Li, Shurui Gui, Shuiwang Ji

Large language models (LLMs) are being increasingly explored for graph tasks. Despite their remarkable success in text-based tasks, LLMs' capabilities in understanding explicit graph structures remain limited, particularly with large graphs. In this work, we introduce Hierarchical Language Model for Graph (HLM-G), which employs a two-block architecture to capture node-centric local information and interaction-centric global structure, effectively enhancing graph structure understanding abilities. The proposed scheme allows LLMs to address various graph queries with high efficacy, efficiency, and robustness, while reducing computational costs on large-scale graph tasks. Furthermore, we demonstrate the interpretability of our model using intrinsic attention weights and established explainers. Comprehensive evaluations across diverse graph reasoning and real-world tasks of node, link, and graph-levels highlight the superiority of our method, marking a significant advancement in the application of LLMs to graph understanding.

摘要:大型語言模型 (LLM) 愈來愈多用於圖形任務。 儘管 LLM 在基於文字的任務中取得顯著的成功,但其在理解明確圖形結構方面的能力仍然有限,特別是對於大型圖形。在這項工作中,我們引入了圖形階層語言模型 (HLM-G),它採用雙區塊架構來擷取以節點為中心的局部資訊和以互動為中心的整體結構,有效地增強了圖形結構理解能力。所提出的架構允許 LLM 以高效率、高效率和高穩健性來處理各種圖形查詢,同時降低大型圖形任務的運算成本。此外,我們使用內在注意力權重和已建立的解釋器來展示我們模型的可解釋性。在節點、連結和圖形層級的各種圖形推理和真實世界任務中進行的全面評估突顯了我們方法的優越性,標誌著 LLM 在圖形理解應用方面取得重大進展。

LLM-Forest for Health Tabular Data Imputation

2410.21520v1 by Xinrui He, Yikun Ban, Jiaru Zou, Tianxin Wei, Curtiss B. Cook, Jingrui He

Missing data imputation is a critical challenge in tabular datasets, especially in healthcare, where data completeness is vital for accurate analysis. Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation, making them a promising tool for tabular data imputation. However, challenges persist in designing effective prompts for a finetuning-free process and in mitigating the risk of LLM hallucinations. To address these issues, we propose a novel framework, LLM-Forest, which introduces a "forest" of few-shot learning LLM "trees" with confidence-based weighted voting. This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries with both feature and value granularity. Extensive experiments on four real-world healthcare datasets demonstrate the effectiveness and efficiency of LLM-Forest.

摘要:遺失資料推估是表格資料集中的重大挑戰, 特別是在醫療保健中,資料完整性對於準確分析至關重要。 大型語言模型 (LLM) 在龐大的語料庫上訓練,在資料產生方面展現出強大的潛力,使其成為表格資料推估的有前途工具。 然而,在設計有效提示以進行微調免費流程和減輕 LLM 幻覺風險方面仍存在挑戰。 為了解決這些問題,我們提出一個新的框架,LLM-Forest,它引入了一個「森林」的少量學習 LLM「樹」,並採用基於信心的加權投票。 這個框架建立在雙分資訊圖的新概念上,以識別具有特徵和值粒度的優質相關鄰近項目。 在四個真實世界的醫療保健資料集上進行的廣泛實驗證明了 LLM-Forest 的有效性和效率。

Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce

2410.21237v1 by Zhantao Yang, Han Zhang, Fangyi Chen, Anudeepsekhar Bolimera, Marios Savvides

Knowledge Graph (KG) is playing an increasingly important role in various AI systems. For e-commerce, an efficient and low-cost automated knowledge graph construction method is the foundation of enabling various successful downstream applications. In this paper, we propose a novel method for constructing structured product knowledge graphs from raw product images. The method cooperatively leverages recent advances in the vision-language model (VLM) and large language model (LLM), fully automating the process and allowing timely graph updates. We also present a human-annotated e-commerce product dataset for benchmarking product property extraction in knowledge graph construction. Our method outperforms our baseline in all metrics and evaluated properties, demonstrating its effectiveness and bright usage potential.

摘要:知識圖譜 (KG) 在各種 AI 系統中扮演越來越重要的角色。對於電子商務來說,一種有效且低成本的自動化知識圖譜建構方法是促成各種成功的下游應用程式的基礎。在本文中,我們提出了一種從原始產品影像建構結構化產品知識圖譜的新穎方法。該方法協同利用了視覺語言模型 (VLM) 和大型語言模型 (LLM) 的最新進展,完全自動化了流程並允許及時更新圖譜。我們還提供了一個由人工標註的電子商務產品資料集,用於評量知識圖譜建構中的產品屬性萃取。我們的模型在所有指標和評估屬性上都優於我們的基準,證明了其有效性和廣闊的使用潛力。

CRAT: A Multi-Agent Framework for Causality-Enhanced Reflective and Retrieval-Augmented Translation with Large Language Models

2410.21067v1 by Meiqi Chen, Fandong Meng, Yingxue Zhang, Yan Zhang, Jie Zhou

Large language models (LLMs) have shown great promise in machine translation, but they still struggle with contextually dependent terms, such as new or domain-specific words. This leads to inconsistencies and errors that are difficult to address. Existing solutions often depend on manual identification of such terms, which is impractical given the complexity and evolving nature of language. While Retrieval-Augmented Generation (RAG) could provide some assistance, its application to translation is limited by issues such as hallucinations from information overload. In this paper, we propose CRAT, a novel multi-agent translation framework that leverages RAG and causality-enhanced self-reflection to address these challenges. This framework consists of several specialized agents: the Unknown Terms Identification agent detects unknown terms within the context, the Knowledge Graph (KG) Constructor agent extracts relevant internal knowledge about these terms and retrieves bilingual information from external sources, the Causality-enhanced Judge agent validates the accuracy of the information, and the Translator agent incorporates the refined information into the final output. This automated process allows for more precise and consistent handling of key terms during translation. Our results show that CRAT significantly improves translation accuracy, particularly in handling context-sensitive terms and emerging vocabulary.

摘要:大型語言模型(LLM)在機器翻譯方面展現出極大的前景, 但它們仍然難以應對依賴於語境的詞彙,例如新詞或特定領域的詞彙。這會導致不一致和錯誤,而這些錯誤很難解決。現有的解決方案通常依賴於手動識別此類詞彙,但由於語言的複雜性和不斷演變的特性,這並不可行。雖然檢索增強生成(RAG)可以提供一些協助,但其在翻譯中的應用受到諸如資訊超載產生的幻覺等問題的限制。在本文中,我們提出 CRAT,這是一個新穎的多代理翻譯架構,它利用 RAG 和因果增強自省來應對這些挑戰。此架構包含幾個專門的代理:未知詞彙識別代理會偵測語境中的未知詞彙,知識圖譜(KG)建構代理會擷取這些詞彙相關的內部知識,並從外部來源中檢索雙語資訊,因果增強判斷代理會驗證資訊的準確性,而翻譯代理會將精煉過的資訊納入最終輸出。這個自動化的流程允許在翻譯過程中更精確且一致地處理關鍵詞彙。我們的結果顯示,CRAT 大幅提升了翻譯準確性,特別是在處理對語境敏感的詞彙和新興詞彙方面。

CTINEXUS: Leveraging Optimized LLM In-Context Learning for Constructing Cybersecurity Knowledge Graphs Under Data Scarcity

2410.21060v1 by Yutong Cheng, Osama Bajaber, Saimon Amanuel Tsegai, Dawn Song, Peng Gao

Textual descriptions in cyber threat intelligence (CTI) reports, such as security articles and news, are rich sources of knowledge about cyber threats, crucial for organizations to stay informed about the rapidly evolving threat landscape. However, current CTI extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. Syntax parsing relies on fixed rules and dictionaries, while model fine-tuning requires large annotated datasets, making both paradigms challenging to adapt to new threats and ontologies. To bridge the gap, we propose CTINexus, a novel framework leveraging optimized in-context learning (ICL) of large language models (LLMs) for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction. Unlike existing methods, CTINexus requires neither extensive data nor parameter tuning and can adapt to various ontologies with minimal annotated examples. This is achieved through (1) a carefully designed automatic prompt construction strategy with optimal demonstration retrieval for extracting a wide range of cybersecurity entities and relations; (2) a hierarchical entity alignment technique that canonicalizes the extracted knowledge and removes redundancy; (3) an ICL-enhanced long-distance relation prediction technique to further complete the CKSG with missing links. Our extensive evaluations using 150 real-world CTI reports collected from 10 platforms demonstrate that CTINexus significantly outperforms existing methods in constructing accurate and complete CSKGs, highlighting its potential to transform CTI analysis with an efficient and adaptable solution for the dynamic threat landscape.

摘要:網路威脅情報 (CTI) 報告中的文字描述,例如安全文章和新聞,是網路威脅的豐富知識來源,對於組織而言至關重要,可以隨時了解快速演變的威脅環境。然而,目前的 CTI 提取方法缺乏靈活性且難以概括,通常會導致知識提取不準確且不完整。語法解析依賴於固定規則和字典,而模型微調需要大量標註的資料集,這使得這兩種範例都難以適應新的威脅和本体。為了彌補差距,我們提出了 CTINexus,這是一個新穎的框架,利用大型語言模型 (LLM) 的最佳化情境學習 (ICL) 來進行資料有效率的 CTI 知識提取和高品質的網路安全知識圖 (CSKG) 建構。與現有方法不同,CTINexus 不需要廣泛的資料或參數調整,並且可以透過最少的標註範例適應各種本体。這是透過 (1) 經過精心設計的自動提示建構策略,並透過最佳示範檢索來提取廣泛的網路安全實體和關係來實現的;(2) 一種階層式實體比對技術,可以將提取的知識標準化並消除冗餘;(3) 一種 ICL 增強的長距離關係預測技術,可以進一步完成具有遺失連結的 CKSG。我們使用從 10 個平台收集的 150 份真實世界 CTI 報告進行廣泛評估,證明 CTINexus 在建構準確且完整的 CSKG 方面明顯優於現有方法,突顯了其以有效且適應性強的解決方案轉換 CTI 分析的潛力,以應對動態的威脅環境。

Graph-based Uncertainty Metrics for Long-form Language Model Outputs

2410.20783v1 by Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, Tatsunori Hashimoto

Recent advancements in Large Language Models (LLMs) have significantly improved text generation capabilities, but these systems are still known to hallucinate, and granular uncertainty estimation for long-form LLM generations remains challenging. In this work, we propose Graph Uncertainty -- which represents the relationship between LLM generations and claims within them as a bipartite graph and estimates the claim-level uncertainty with a family of graph centrality metrics. Under this view, existing uncertainty estimation methods based on the concept of self-consistency can be viewed as using degree centrality as an uncertainty measure, and we show that more sophisticated alternatives such as closeness centrality provide consistent gains at claim-level uncertainty estimation. Moreover, we present uncertainty-aware decoding techniques that leverage both the graph structure and uncertainty estimates to improve the factuality of LLM generations by preserving only the most reliable claims. Compared to existing methods, our graph-based uncertainty metrics lead to an average of 6.8% relative gains on AUPRC across various long-form generation settings, and our end-to-end system provides consistent 2-4% gains in factuality over existing decoding techniques while significantly improving the informativeness of generated responses.

摘要:大型語言模型 (LLM) 的最新進展顯著提升了文字生成能力,但這些系統仍以產生幻覺著稱,而針對長篇 LLM 生成的細緻不確定性估計仍是一項挑戰。在這項工作中,我們提出圖形不確定性,它將 LLM 生成和其中的主張表示為二部圖,並使用一系列圖形中心性指標估計主張層級的不確定性。在此觀點下,現有的基於自洽性概念的不確定性估計方法可視為使用度量中心性作為不確定性指標,我們證明了更精密的替代方案(例如接近中心性)在主張層級不確定性估計中提供了穩定的增益。此外,我們提出了不確定性感知解碼技術,該技術利用圖形結構和不確定性估計來提升 LLM 生成的真實性,方法是僅保留最可靠的主張。與現有方法相比,我們的基於圖形的指標在各種長篇生成設定中平均提升了 AUPRC 的 6.8%,而我們的端到端系統在真實性方面提供了 2-4% 的穩定增益,同時顯著提升了生成回應的資訊性。

Plan$\times$RAG: Planning-guided Retrieval Augmented Generation

2410.20753v1 by Prakhar Verma, Sukruta Prakash Midigeshi, Gaurav Sinha, Arno Solin, Nagarajan Natarajan, Amit Sharma

We introduce Planning-guided Retrieval Augmented Generation (Plan$\times$RAG), a novel framework that augments the \emph{retrieve-then-reason} paradigm of existing RAG frameworks to \emph{plan-then-retrieve}. Plan$\times$RAG formulates a reasoning plan as a directed acyclic graph (DAG), decomposing queries into interrelated atomic sub-queries. Answer generation follows the DAG structure, allowing significant gains in efficiency through parallelized retrieval and generation. While state-of-the-art RAG solutions require extensive data generation and fine-tuning of language models (LMs), Plan$\times$RAG incorporates frozen LMs as plug-and-play experts to generate high-quality answers. Compared to existing RAG solutions, Plan$\times$RAG demonstrates significant improvements in reducing hallucinations and bolstering attribution due to its structured sub-query decomposition. Overall, Plan$\times$RAG offers a new perspective on integrating external knowledge in LMs while ensuring attribution by design, contributing towards more reliable LM-based systems.

摘要:我們引入了規劃引導的檢索增強生成 (Plan$\times$RAG),這是一個新穎的框架,它擴充了現有 RAG 框架的「先檢索後推理」範例,改為「先規劃後檢索」。Plan$\times$RAG 將推理計畫制定為有向無環圖 (DAG),將查詢分解成相互關聯的原子子查詢。答案生成遵循 DAG 結構,透過並行檢索和生成,大幅提升效率。雖然最先進的 RAG 解决方案需要大量資料生成和語言模型 (LM) 的微調,但 Plan$\times$RAG 將凍結的 LM 整合為即插即用的專家,以生成高品質的答案。與現有的 RAG 解决方案相比,Plan$\times$RAG 在減少幻覺和加強歸因方面表現出顯著的進步,這要歸功於其結構化的子查詢分解。總體而言,Plan$\times$RAG 提供了一個新的觀點,以整合 LM 中的外部知識,同時確保歸因設計,有助於建立更可靠的基於 LM 的系統。

Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation

2410.20724v2 by Mufei Li, Siqi Miao, Pan Li

Large Language Models (LLMs) demonstrate strong reasoning abilities but face limitations such as hallucinations and outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLM outputs in structured external knowledge from KGs. However, current KG-based RAG frameworks still struggle to optimize the trade-off between retrieval effectiveness and efficiency in identifying a suitable amount of relevant graph information for the LLM to digest. We introduce SubgraphRAG, extending the KG-based RAG framework that retrieves subgraphs and leverages LLMs for reasoning and answer prediction. Our approach innovatively integrates a lightweight multilayer perceptron with a parallel triple-scoring mechanism for efficient and flexible subgraph retrieval while encoding directional structural distances to enhance retrieval effectiveness. The size of retrieved subgraphs can be flexibly adjusted to match the query's need and the downstream LLM's capabilities. This design strikes a balance between model complexity and reasoning power, enabling scalable and generalizable retrieval processes. Notably, based on our retrieved subgraphs, smaller LLMs like Llama3.1-8B-Instruct deliver competitive results with explainable reasoning, while larger models like GPT-4o achieve state-of-the-art accuracy compared with previous baselines -- all without fine-tuning. Extensive evaluations on the WebQSP and CWQ benchmarks highlight SubgraphRAG's strengths in efficiency, accuracy, and reliability by reducing hallucinations and improving response grounding.

摘要:大型語言模型 (LLM) 具有強大的推理能力,但面臨幻覺和過時知識等限制。基於知識圖譜 (KG) 的檢索增強生成 (RAG) 透過將 LLM 輸出結果奠基於 KG 中的結構化外部知識,來解決這些問題。然而,目前基於 KG 的 RAG 架構仍難以在檢索效能和效率之間取得最佳平衡,以找出 LLM 能夠消化的適當相關圖表資訊量。我們引進 SubgraphRAG,擴充基於 KG 的 RAG 架構,以檢索子圖表並利用 LLM 進行推理和答案預測。我們的做法創新地整合了一個輕量多層感知器與一個並行三元組計分機制,用於高效且靈活地檢索子圖表,同時編碼方向結構距離以增強檢索效能。檢索到的子圖表大小可以靈活調整,以符合查詢需求和下游 LLM 的功能。這種設計在模型複雜度和推理能力之間取得平衡,實現可擴充且可概化的檢索程序。值得注意的是,根據我們檢索到的子圖表,較小的 LLM(例如 Llama3.1-8B-Instruct)可以提供具備可解釋推理的競爭結果,而較大的模型(例如 GPT-4o)則達到與先前基準相比的最新準確度,而且所有這些都不需要微調。在 WebQSP 和 CWQ 基準上的廣泛評估突顯了 SubgraphRAG 在效率、準確度和可靠性方面的優勢,透過減少幻覺並改善回應依據。

Effective Instruction Parsing Plugin for Complex Logical Query Answering on Knowledge Graphs

2410.20321v1 by Xingrui Zhuo, Jiapu Wang, Gongqing Wu, Shirui Pan, Xindong Wu

Knowledge Graph Query Embedding (KGQE) aims to embed First-Order Logic (FOL) queries in a low-dimensional KG space for complex reasoning over incomplete KGs. To enhance the generalization of KGQE models, recent studies integrate various external information (such as entity types and relation context) to better capture the logical semantics of FOL queries. The whole process is commonly referred to as Query Pattern Learning (QPL). However, current QPL methods typically suffer from the pattern-entity alignment bias problem, leading to the learned defective query patterns limiting KGQE models' performance. To address this problem, we propose an effective Query Instruction Parsing Plugin (QIPP) that leverages the context awareness of Pre-trained Language Models (PLMs) to capture latent query patterns from code-like query instructions. Unlike the external information introduced by previous QPL methods, we first propose code-like instructions to express FOL queries in an alternative format. This format utilizes textual variables and nested tuples to convey the logical semantics within FOL queries, serving as raw materials for a PLM-based instruction encoder to obtain complete query patterns. Building on this, we design a query-guided instruction decoder to adapt query patterns to KGQE models. To further enhance QIPP's effectiveness across various KGQE models, we propose a query pattern injection mechanism based on compressed optimization boundaries and an adaptive normalization component, allowing KGQE models to utilize query patterns more efficiently. Extensive experiments demonstrate that our plug-and-play method improves the performance of eight basic KGQE models and outperforms two state-of-the-art QPL methods.

摘要:知識圖譜查詢嵌入(KGQE)旨在將一階邏輯(FOL)查詢嵌入到低維 KG 空間中,以便對不完整的 KG 進行複雜推理。為了增強 KGQE 模型的泛化能力,最近的研究整合了各種外部資訊(例如實體類型和關係上下文),以更好地捕捉 FOL 查詢的邏輯語義。整個過程通常稱為查詢模式學習(QPL)。然而,當前的 QPL 方法通常會受到模式實體對齊偏差問題的影響,導致學習到的有缺陷查詢模式限制了 KGQE 模型的效能。為了解決這個問題,我們提出了一個有效的查詢指令解析外掛程式(QIPP),它利用預訓練語言模型(PLM)的上下文感知來從類代碼的查詢指令中擷取潛在查詢模式。與先前 QPL 方法引入的外部資訊不同,我們首先提出類代碼的指令以另類格式表達 FOL 查詢。此格式利用文字變數和巢狀元組來傳達 FOL 查詢中的邏輯語義,作為基於 PLM 的指令編碼器的原料,以取得完整的查詢模式。在此基礎上,我們設計了一個查詢引導的指令解碼器,以將查詢模式調整到 KGQE 模型。為了進一步增強 QIPP 在各種 KGQE 模型中的有效性,我們提出了一個基於壓縮最佳化邊界和自適應正規化元件的查詢模式注入機制,允許 KGQE 模型更有效地利用查詢模式。廣泛的實驗表明,我們的即插即用方法改善了八個基本 KGQE 模型的效能,並優於兩種最先進的 QPL 方法。

Mathematical Derivation Graphs: A Task for Summarizing Equation Dependencies in STEM Manuscripts

2410.21324v1 by Vishesh Prasad, Brian Kim, Nickvash Kani

Recent advances in natural language processing (NLP), particularly with the emergence of large language models (LLMs), have significantly enhanced the field of textual analysis. However, while these developments have yielded substantial progress in analyzing textual data, applying analysis to mathematical equations and their relationships within texts has produced mixed results. In this paper, we take the initial steps toward understanding the dependency relationships between mathematical expressions in STEM articles. Our dataset, sourced from a random sampling of the arXiv corpus, contains an analysis of 107 published STEM manuscripts whose inter-equation dependency relationships have been hand-labeled, resulting in a new object we refer to as a derivation graph that summarizes the mathematical content of the manuscript. We exhaustively evaluate analytical and NLP-based models to assess their capability to identify and extract the derivation relationships for each article and compare the results with the ground truth. Our comprehensive testing finds that both analytical and NLP models (including LLMs) achieve $\sim$40-50% F1 scores for extracting derivation graphs from articles, revealing that the recent advances in NLP have not made significant inroads in comprehending mathematical texts compared to simpler analytic models. While current approaches offer a solid foundation for extracting mathematical information, further research is necessary to improve accuracy and depth in this area.

摘要:自然語言處理(NLP)的最新進展,特別是大語言模型(LLM)的出現,已顯著增強了文本分析領域。然而,儘管這些發展在分析文本資料方面取得了實質性進展,但將分析應用於數學方程式及其在文本中的關係卻產生了不同的結果。在本文中,我們採取了初步步驟來了解 STEM 文章中數學表達式之間的依賴關係。我們的資料集取自 arXiv 語料庫的隨機抽樣,其中包含對 107 篇已發表的 STEM 手稿的分析,其方程式間的依賴關係已進行手動標記,產生了一個我們稱為衍生圖的新物件,該物件總結了手稿的數學內容。我們徹底評估了分析和基於 NLP 的模型,以評估它們識別和提取每篇文章的衍生關係的能力,並將結果與真實情況進行比較。我們的全面測試發現,分析和 NLP 模型(包括 LLM)在從文章中提取衍生圖方面的 F1 分數均達到 $\sim$40-50%,這表明與更簡單的分析模型相比,NLP 的最新進展並沒有在理解數學文本方面取得重大進展。儘管目前的方法為提取數學資訊提供了堅實的基礎,但仍需要進一步的研究來提高此領域的準確性和深度。

DualMAR: Medical-Augmented Representation from Dual-Expertise Perspectives

2410.19955v1 by Pengfei Hu, Chang Lu, Fei Wang, Yue Ning

Electronic Health Records (EHR) has revolutionized healthcare data management and prediction in the field of AI and machine learning. Accurate predictions of diagnosis and medications significantly mitigate health risks and provide guidance for preventive care. However, EHR driven models often have limited scope on understanding medical-domain knowledge and mostly rely on simple-and-sole ontologies. In addition, due to the missing features and incomplete disease coverage of EHR, most studies only focus on basic analysis on conditions and medication. We propose DualMAR, a framework that enhances EHR prediction tasks through both individual observation data and public knowledge bases. First, we construct a bi-hierarchical Diagnosis Knowledge Graph (KG) using verified public clinical ontologies and augment this KG via Large Language Models (LLMs); Second, we design a new proxy-task learning on lab results in EHR for pretraining, which further enhance KG representation and patient embeddings. By retrieving radial and angular coordinates upon polar space, DualMAR enables accurate predictions based on rich hierarchical and semantic embeddings from KG. Experiments also demonstrate that DualMAR outperforms state-of-the-art models, validating its effectiveness in EHR prediction and KG integration in medical domains.

摘要:電子健康紀錄 (EHR) 已徹底改變了醫療保健資料管理,並預測了人工智慧和機器學習領域。準確預測診斷和藥物可大幅減輕健康風險,並提供預防性照護的指導方針。然而,EHR 驅動的模型在理解醫療領域知識上通常具有局限性,而且大多依賴於簡單且單一的本体。此外,由於 EHR 遺漏了功能且疾病涵蓋不完整,大多數研究僅專注於疾病和藥物的基本分析。我們提出 DualMAR,一個透過個人觀察資料和公共知識庫增強 EHR 預測任務的架構。首先,我們使用經過驗證的公共臨床本体構建一個雙層級診斷知識圖 (KG),並透過大型語言模型 (LLM) 擴充這個 KG;其次,我們設計一個新的代理任務學習,針對 EHR 中的實驗室結果進行預訓練,進一步增強 KG 表示和患者嵌入。透過擷取極座標空間上的徑向和角向坐標,DualMAR 能夠根據 KG 中豐富的層級和語意嵌入進行準確的預測。實驗也證明 DualMAR 優於最先進的模型,驗證了其在 EHR 預測和醫療領域中 KG 整合的有效性。

FISHNET: Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert Swarms, and Task Planning

2410.19727v1 by Nicole Cho, Nishan Srishankar, Lucas Cecchi, William Watson

Financial intelligence generation from vast data sources has typically relied on traditional methods of knowledge-graph construction or database engineering. Recently, fine-tuned financial domain-specific Large Language Models (LLMs), have emerged. While these advancements are promising, limitations such as high inference costs, hallucinations, and the complexity of concurrently analyzing high-dimensional financial data, emerge. This motivates our invention FISHNET (Financial Intelligence from Sub-querying, Harmonizing, Neural-Conditioning, Expert swarming, and Task planning), an agentic architecture that accomplishes highly complex analytical tasks for more than 98,000 regulatory filings that vary immensely in terms of semantics, data hierarchy, or format. FISHNET shows remarkable performance for financial insight generation (61.8% success rate over 5.0% Routing, 45.6% RAG R-Precision). We conduct rigorous ablations to empirically prove the success of FISHNET, each agent's importance, and the optimized performance of assembling all agents. Our modular architecture can be leveraged for a myriad of use-cases, enabling scalability, flexibility, and data integrity that are critical for financial tasks.

摘要:財務情報生成通常依賴於傳統的知識圖表建構或資料庫工程方法,這些方法來自於龐大的資料來源。最近,針對財務領域進行微調的大型語言模型 (LLM) 已應運而生。儘管這些進展令人振奮,但仍存在一些限制,例如高推理成本、幻覺,以及同時分析高維度財務資料的複雜性。這促使我們發明了 FISHNET(來自子查詢、協調、神經條件化、專家群集和任務規劃的財務情報),這是一種代理架構,可針對超過 98,000 份法規文件執行高度複雜的分析任務,而這些文件在語義、資料階層或格式方面差異極大。FISHNET 在產生財務見解方面表現出色(成功率為 61.8%,路由率為 5.0%,RAG R-Precision 為 45.6%)。我們進行了嚴格的消融,以實證證明 FISHNET 的成功、每個代理的重要性,以及組裝所有代理的最佳化效能。我們模組化的架構可運用於各種使用案例,提供財務任務至關重要的可擴充性、彈性和資料完整性。

Knowledge Graph Enhanced Language Agents for Recommendation

2410.19627v1 by Taicheng Guo, Chaochun Liu, Hai Wang, Varun Mannam, Fang Wang, Xin Chen, Xiangliang Zhang, Chandan K. Reddy

Language agents have recently been used to simulate human behavior and user-item interactions for recommendation systems. However, current language agent simulations do not understand the relationships between users and items, leading to inaccurate user profiles and ineffective recommendations. In this work, we explore the utility of Knowledge Graphs (KGs), which contain extensive and reliable relationships between users and items, for recommendation. Our key insight is that the paths in a KG can capture complex relationships between users and items, eliciting the underlying reasons for user preferences and enriching user profiles. Leveraging this insight, we propose Knowledge Graph Enhanced Language Agents(KGLA), a framework that unifies language agents and KG for recommendation systems. In the simulated recommendation scenario, we position the user and item within the KG and integrate KG paths as natural language descriptions into the simulation. This allows language agents to interact with each other and discover sufficient rationale behind their interactions, making the simulation more accurate and aligned with real-world cases, thus improving recommendation performance. Our experimental results show that KGLA significantly improves recommendation performance (with a 33%-95% boost in NDCG@1 among three widely used benchmarks) compared to the previous best baseline method.

摘要:語言代理最近已被用於模擬人類行為和推薦系統中的使用者項目互動。然而,目前的語言代理模擬並未了解使用者和項目之間的關係,導致使用者輪廓不準確和推薦效果不佳。在這項工作中,我們探討了知識圖譜 (KG) 的效用,其中包含使用者和項目之間廣泛且可靠的關係,以供推薦。我們的關鍵見解是,KG 中的路徑可以捕捉使用者和項目之間的複雜關係,引出使用者偏好的根本原因並豐富使用者輪廓。利用此見解,我們提出了知識圖譜增強語言代理 (KGLA),一個統一語言代理和 KG 以用於推薦系統的架構。在模擬推薦情境中,我們將使用者和項目定位在 KG 中,並將 KG 路徑整合為自然語言描述到模擬中。這允許語言代理彼此互動並發現其互動背後的充分依據,使模擬更準確且與實際案例相符,從而改善推薦效能。我們的實驗結果顯示,與先前最佳基準方法相比,KGLA 大幅改善了推薦效能(在三個廣泛使用的基準中,NDCG@1 提升了 33%-95%)。

Graph Linearization Methods for Reasoning on Graphs with Large Language Models

2410.19494v1 by Christos Xypolopoulos, Guokan Shang, Xiao Fei, Giannis Nikolentzos, Hadi Abdine, Iakovos Evdaimon, Michail Chatzianastasis, Giorgos Stamou, Michalis Vazirgiannis

Large language models have evolved to process multiple modalities beyond text, such as images and audio, which motivates us to explore how to effectively leverage them for graph machine learning tasks. The key question, therefore, is how to transform graphs into linear sequences of tokens, a process we term graph linearization, so that LLMs can handle graphs naturally. We consider that graphs should be linearized meaningfully to reflect certain properties of natural language text, such as local dependency and global alignment, in order to ease contemporary LLMs, trained on trillions of textual tokens, better understand graphs. To achieve this, we developed several graph linearization methods based on graph centrality, degeneracy, and node relabeling schemes. We then investigated their effect on LLM performance in graph reasoning tasks. Experimental results on synthetic graphs demonstrate the effectiveness of our methods compared to random linearization baselines. Our work introduces novel graph representations suitable for LLMs, contributing to the potential integration of graph machine learning with the trend of multi-modal processing using a unified transformer model.

摘要:大型語言模型已演化為處理文字之外的多種模式,例如影像和音訊,這促使我們探索如何有效地運用它們於圖形機器學習任務。因此,關鍵問題在於如何將圖形轉換為線性序列的代幣,這是一個我們稱為圖形線性化的過程,讓 LLM 能自然地處理圖形。我們認為圖形應有意義地進行線性化,以反映自然語言文字的特定屬性,例如局部依賴性和全局對齊,以便讓在數兆個文字代幣上訓練的當代 LLM 更能理解圖形。為達成此目的,我們開發了幾種基於圖形中心性、簡併性和節點重新標籤架構的圖形線性化方法。接著,我們探討它們對 LLM 在圖形推理任務中的效能影響。合成圖形上的實驗結果證明了我們的方法比隨機線性化基準更有效。我們的研究引入了適合 LLM 的新穎圖形表示法,有助於將圖形機器學習與使用統一Transformer模型的多模式處理趨勢整合起來。

Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis

2410.19225v1 by Weikai Li, Ding Wang, Zijian Ding, Atefeh Sohrabizadeh, Zongyue Qin, Jason Cong, Yizhou Sun

High-level synthesis (HLS) is a widely used tool in designing Field Programmable Gate Array (FPGA). HLS enables FPGA design with software programming languages by compiling the source code into an FPGA circuit. The source code includes a program (called ``kernel'') and several pragmas that instruct hardware synthesis, such as parallelization, pipeline, etc. While it is relatively easy for software developers to design the program, it heavily relies on hardware knowledge to design the pragmas, posing a big challenge for software developers. Recently, different machine learning algorithms, such as GNNs, have been proposed to automate the pragma design via performance prediction. However, when applying the trained model on new kernels, the significant domain shift often leads to unsatisfactory performance. We propose a more domain-generalizable model structure: a two-level hierarchical Mixture of Experts (MoE), that can be flexibly adapted to any GNN model. Different expert networks can learn to deal with different regions in the representation space, and they can utilize similar patterns between the old kernels and new kernels. In the low-level MoE, we apply MoE on three natural granularities of a program: node, basic block, and graph. The high-level MoE learns to aggregate the three granularities for the final decision. To stably train the hierarchical MoE, we further propose a two-stage training method. Extensive experiments verify the effectiveness of the hierarchical MoE.

摘要:高階綜合(HLS)是設計現場可編程閘陣列(FPGA)中廣泛使用的工具。HLS 透過將原始碼編譯成 FPGA 電路,使用軟體程式語言進行 FPGA 設計。原始碼包含一個程式(稱為「核心」)和多個指導硬體綜合的指示,例如平行化、管線等。雖然軟體開發人員設計程式相對容易,但它極度依賴硬體知識來設計指示,這對軟體開發人員來說是一大挑戰。最近,不同的機器學習演算法,例如 GNN,已被提出用於透過效能預測自動進行指示設計。然而,在新的核心上應用訓練好的模型時,顯著的領域轉移通常會導致效能不佳。我們提出一個更具領域通用性的模型結構:一個二階層混合專家(MoE),它可以靈活地適應任何 GNN 模型。不同的專家網路可以學習處理表示空間中的不同區域,並且它們可以利用舊核心和新核心之間的相似模式。在低階 MoE 中,我們對程式的三個自然粒度應用 MoE:節點、基本區塊和圖。高階 MoE 學習彙總這三個粒度以做出最終決策。為了穩定訓練階層式 MoE,我們進一步提出一個二階段訓練方法。廣泛的實驗驗證了階層式 MoE 的有效性。

Enriching GNNs with Text Contextual Representations for Detecting Disinformation Campaigns on Social Media

2410.19193v1 by Bruno Croso Cunha da Silva, Thomas Palmeira Ferraz, Roseli De Deus Lopes

Disinformation on social media poses both societal and technical challenges. While previous studies have integrated textual information into propagation networks, they have yet to fully leverage the advancements in Transformer-based language models for high-quality contextual text representations. This work investigates the impact of incorporating textual features into Graph Neural Networks (GNNs) for fake news detection. Our experiments demonstrate that contextual representations improve performance by 9.3% in Macro F1 over static ones and 33.8% over GNNs without textual features. However, noisy data augmentation degrades performance and increases instability. We expect our methodology to open avenues for further research, and all code is made publicly available.

摘要:社群媒體上的錯誤訊息造成社會和技術層面的挑戰。 儘管過往的研究已將文字資訊整合到傳播網路中,但尚未充分利用基於 Transformer 的語言模型在高品質脈絡文字表徵上的進展。這項研究探討將文字特徵納入圖形神經網路 (GNN) 中對於假新聞偵測的影響。我們的實驗結果顯示,脈絡表徵將巨觀 F1 的效能提升了 9.3%,優於靜態表徵,並比沒有文字特徵的 GNN 提升了 33.8%。然而,有雜訊的資料擴充會降低效能並增加不穩定性。我們預期我們的研究方法將開啟進一步研究的途徑,所有程式碼皆公開提供。

GCoder: Improving Large Language Model for Generalized Graph Problem Solving

2410.19084v1 by Qifan Zhang, Xiaobin Hong, Jianheng Tang, Nuo Chen, Yuhan Li, Wenzhong Li, Jing Tang, Jia Li

Large Language Models (LLMs) have demonstrated strong reasoning abilities, making them suitable for complex tasks such as graph computation. Traditional reasoning steps paradigm for graph problems is hindered by unverifiable steps, limited long-term reasoning, and poor generalization to graph variations. To overcome these limitations, we introduce GCoder, a code-based LLM designed to enhance problem-solving in generalized graph computation problems. Our method involves constructing an extensive training dataset, GraphWild, featuring diverse graph formats and algorithms. We employ a multi-stage training process, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Compiler Feedback (RLCF), to refine model capabilities. For unseen tasks, a hybrid retrieval technique is used to augment performance. Experiments demonstrate that GCoder outperforms GPT-4o, with an average accuracy improvement of 16.42% across various graph computational problems. Furthermore, GCoder efficiently manages large-scale graphs with millions of nodes and diverse input formats, overcoming the limitations of previous models focused on the reasoning steps paradigm. This advancement paves the way for more intuitive and effective graph problem-solving using LLMs. Code and data are available at here: https://github.com/Bklight999/WWW25-GCoder/tree/master.

摘要:大型語言模型 (LLM) 已展現強大的推理能力,使其適用於複雜任務,例如圖形運算。傳統圖形問題的推理步驟範例受到不可驗證的步驟、有限的長期推理和對圖形變化的概括性不佳的阻礙。為了克服這些限制,我們引入了 GCoder,一種基於代碼的 LLM,旨在增強廣義圖形運算問題中的問題解決能力。我們的技術涉及構建一個廣泛的訓練資料集 GraphWild,其中包含多樣的圖形格式和演算法。我們採用多階段訓練流程,包括監督微調 (SFT) 和編譯器回饋強化學習 (RLCF),以改善模型能力。對於未知任務,使用混合擷取技術來增強效能。實驗證明,GCoder 優於 GPT-4o,在各種圖形運算問題中平均準確度提升了 16.42%。此外,GCoder 有效地管理著擁有數百萬個節點和多樣輸入格式的大規模圖形,克服了先前專注於推理步驟範例的模型的限制。這項進展為使用 LLM 進行更直觀且有效的圖形問題解決鋪平了道路。程式碼和資料可於此處取得:https://github.com/Bklight999/WWW25-GCoder/tree/master。

LLM-based Online Prediction of Time-varying Graph Signals

2410.18718v1 by Dayu Qin, Yi Yan, Ercan Engin Kuruoglu

In this paper, we propose a novel framework that leverages large language models (LLMs) for predicting missing values in time-varying graph signals by exploiting spatial and temporal smoothness. We leverage the power of LLM to achieve a message-passing scheme. For each missing node, its neighbors and previous estimates are fed into and processed by LLM to infer the missing observations. Tested on the task of the online prediction of wind-speed graph signals, our model outperforms online graph filtering algorithms in terms of accuracy, demonstrating the potential of LLMs in effectively addressing partially observed signals in graphs.

摘要:在本文中,我們提出了一個新穎的框架,該框架利用大型語言模型 (LLM) 來預測時變圖形信號中的缺失值,方法是利用空間和時間平滑度。我們利用 LLM 的能力來實現消息傳遞方案。對於每個缺失節點,其鄰居和先前的估計值會被輸入到 LLM 中並由 LLM 進行處理,以推斷出缺失的觀測值。在風速圖形信號的線上預測任務中進行測試,我們的模型在準確性方面優於線上圖形過濾演算法,這證明了 LLM 在有效處理圖形中部分觀測到的信號方面的潛力。

Gene-Metabolite Association Prediction with Interactive Knowledge Transfer Enhanced Graph for Metabolite Production

2410.18475v2 by Kexuan Xin, Qingyun Wang, Junyu Chen, Pengfei Yu, Huimin Zhao, Heng Ji

In the rapidly evolving field of metabolic engineering, the quest for efficient and precise gene target identification for metabolite production enhancement presents significant challenges. Traditional approaches, whether knowledge-based or model-based, are notably time-consuming and labor-intensive, due to the vast scale of research literature and the approximation nature of genome-scale metabolic model (GEM) simulations. Therefore, we propose a new task, Gene-Metabolite Association Prediction based on metabolic graphs, to automate the process of candidate gene discovery for a given pair of metabolite and candidate-associated genes, as well as presenting the first benchmark containing 2474 metabolites and 1947 genes of two commonly used microorganisms Saccharomyces cerevisiae (SC) and Issatchenkia orientalis (IO). This task is challenging due to the incompleteness of the metabolic graphs and the heterogeneity among distinct metabolisms. To overcome these limitations, we propose an Interactive Knowledge Transfer mechanism based on Metabolism Graph (IKT4Meta), which improves the association prediction accuracy by integrating the knowledge from different metabolism graphs. First, to build a bridge between two graphs for knowledge transfer, we utilize Pretrained Language Models (PLMs) with external knowledge of genes and metabolites to help generate inter-graph links, significantly alleviating the impact of heterogeneity. Second, we propagate intra-graph links from different metabolic graphs using inter-graph links as anchors. Finally, we conduct the gene-metabolite association prediction based on the enriched metabolism graphs, which integrate the knowledge from multiple microorganisms. Experiments on both types of organisms demonstrate that our proposed methodology outperforms baselines by up to 12.3% across various link prediction frameworks.

摘要:在快速發展的代謝工程領域中,尋求有效且精確的基因目標識別以提升代謝產物產量,是一項重大的挑戰。傳統方法,無論是基於知識或基於模型,都相當耗時且費力,這是因為研究文獻的規模龐大,且基因組規模代謝模型 (GEM) 模擬的近似性質。因此,我們提出了一項新的任務,即基於代謝圖的基因-代謝物關聯預測,以自動化候選基因發現的過程,針對給定的代謝物對和候選相關基因,並呈現第一個基準,其中包含 2474 種代謝物和 1947 個基因,來自兩種常用的微生物釀酒酵母 (SC) 和東方伊薩琴科酵母 (IO)。由於代謝圖的不完整性和不同代謝物之間的異質性,這項任務具有挑戰性。為了克服這些限制,我們提出了一個基於代謝圖的互動知識傳輸機制 (IKT4Meta),它透過整合來自不同代謝圖的知識來提高關聯預測的準確性。首先,為了在兩個圖之間建立知識傳輸的橋樑,我們利用具備基因和代謝物外部知識的預訓練語言模型 (PLM) 來幫助產生圖間連結,大幅減輕異質性的影響。其次,我們使用圖間連結作為錨點,從不同的代謝圖傳播圖內連結。最後,我們根據整合了多種微生物知識的豐富代謝圖,進行基因-代謝物關聯預測。兩種生物體的實驗都證明,我們提出的方法在各種連結預測架構中,比基準高出 12.3%。

ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis

2410.18447v1 by Zezhong Wang, Xingshan Zeng, Weiwen Liu, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, Kam-Fai Wong

Supervised fine-tuning (SFT) is a common method to enhance the tool calling capabilities of Large Language Models (LLMs), with the training data often being synthesized. The current data synthesis process generally involves sampling a set of tools, formulating a requirement based on these tools, and generating the call statements. However, tools sampled randomly lack relevance, making them difficult to combine and thus reducing the diversity of the data. Additionally, current work overlooks the coherence between turns of dialogues, leading to a gap between the synthesized data and real-world scenarios. To address these issues, we propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We integrate these two strategies and enable multiple agents to synthesize the dialogue data interactively, resulting in our tool-calling data synthesis pipeline ToolFlow. Data quality assessments demonstrate improvements in the naturalness and coherence of our synthesized dialogues. Finally, we apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow. Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.

摘要:監督微調 (SFT) 是增強大型語言模型 (LLM) 工具呼叫功能的常見方法,訓練資料通常是合成資料。目前的資料合成流程通常涉及抽樣一組工具、根據這些工具制定需求,並產生呼叫陳述。然而,隨機抽樣的工具缺乏關聯性,使得它們難以組合,從而降低資料的多樣性。此外,目前的工作忽略了對話回合之間的連貫性,導致合成資料與現實世界場景之間存在差距。為了解決這些問題,我們提出了一個基於圖形的抽樣策略來抽取更多相關的工具組合,以及一個計畫生成策略來建立計畫,以引導連貫對話的合成。我們整合這兩種策略,並使多個代理能夠互動地合成對話資料,從而產生我們的工具呼叫資料合成管線 ToolFlow。資料品質評估證明了我們合成對話的自然性和連貫性有了改進。最後,我們使用 ToolFlow 生成的 8,000 個合成對話在 LLaMA-3.1-8B 上應用 SFT。結果表明,該模型實現了與 GPT-4 相當甚至超越 GPT-4 的工具呼叫效能,同時保持強大的通用能力。

Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains

2410.18415v1 by Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James Glass, Helen Meng

Knowledge Graphs (KGs) can serve as reliable knowledge sources for question answering (QA) due to their structured representation of knowledge. Existing research on the utilization of KG for large language models (LLMs) prevalently relies on subgraph retriever or iterative prompting, overlooking the potential synergy of LLMs' step-wise reasoning capabilities and KGs' structural nature. In this paper, we present DoG (Decoding on Graphs), a novel framework that facilitates a deep synergy between LLMs and KGs. We first define a concept, well-formed chain, which consists of a sequence of interrelated fact triplets on the KGs, starting from question entities and leading to answers. We argue that this concept can serve as a principle for making faithful and sound reasoning for KGQA. To enable LLMs to generate well-formed chains, we propose graph-aware constrained decoding, in which a constraint derived from the topology of the KG regulates the decoding process of the LLMs. This constrained decoding method ensures the generation of well-formed chains while making full use of the step-wise reasoning capabilities of LLMs. Based on the above, DoG, a training-free approach, is able to provide faithful and sound reasoning trajectories grounded on the KGs. Experiments across various KGQA tasks with different background KGs demonstrate that DoG achieves superior and robust performance. DoG also shows general applicability with various open-source LLMs.

摘要:知識圖譜 (KG) 由於其結構化的知識表示,可用作問答 (QA) 的可靠知識來源。現有關於利用 KG 的大型語言模型 (LLM) 的研究普遍依賴於子圖檢索器或反覆提示,忽視了 LLM 的逐步推理能力和 KG 的結構特性的潛在協同作用。在本文中,我們提出了 DoG(圖形解碼),一個促進 LLM 和 KG 之間深度協同作用的新框架。我們首先定義了一個概念,即良好形成的鏈,它由 KG 上一系列相互關聯的事實三元組組成,從問題實體開始並導致答案。我們認為這個概念可以作為對 KGQA 進行忠實和合理的推理的原則。為了使 LLM 能夠生成良好的鏈,我們提出了圖感知約束解碼,其中源自 KG 拓撲的約束約束了 LLM 的解碼過程。這種受約束的解碼方法確保了良好形成的鏈的生成,同時充分利用了 LLM 的逐步推理能力。基於上述,DoG 是一種無需訓練的方法,能夠提供基於 KG 的忠實且合理的推理軌跡。在具有不同背景 KG 的各種 KGQA 任務中的實驗表明,DoG 達到了卓越且穩健的性能。DoG 還顯示了與各種開源 LLM 的通用適用性。

Explaining Bayesian Networks in Natural Language using Factor Arguments. Evaluation in the medical domain

2410.18060v1 by Jaime Sevilla, Nikolay Babakov, Ehud Reiter, Alberto Bugarin

In this paper, we propose a model for building natural language explanations for Bayesian Network Reasoning in terms of factor arguments, which are argumentation graphs of flowing evidence, relating the observed evidence to a target variable we want to learn about. We introduce the notion of factor argument independence to address the outstanding question of defining when arguments should be presented jointly or separately and present an algorithm that, starting from the evidence nodes and a target node, produces a list of all independent factor arguments ordered by their strength. Finally, we implemented a scheme to build natural language explanations of Bayesian Reasoning using this approach. Our proposal has been validated in the medical domain through a human-driven evaluation study where we compare the Bayesian Network Reasoning explanations obtained using factor arguments with an alternative explanation method. Evaluation results indicate that our proposed explanation approach is deemed by users as significantly more useful for understanding Bayesian Network Reasoning than another existing explanation method it is compared to.

摘要:在本文中,我們提出了一個模型,用於建構貝氏網路推理的自然語言解釋,以因子論證為基礎,它們是流動證據的論證圖,將觀察到的證據與我們想要了解的目標變數聯繫起來。我們引入了因子論證獨立性的概念,以解決定義何時應將論證聯合或單獨呈現的未決問題,並提出了一種演算法,從證據節點和目標節點開始,產生一個按強度排序的所有獨立因子論證清單。最後,我們實作了一個方案,使用這種方法建構貝氏推理的自然語言解釋。我們的提案已在醫學領域中通過人為驅動的評估研究得到驗證,在該研究中,我們將使用因子論證獲得的貝氏網路推理解釋與另一種解釋方法進行比較。評估結果表明,與另一種現有的解釋方法相比,我們的提議解釋方法被使用者視為顯著更有助於理解貝氏網路推理。

Graphusion: A RAG Framework for Knowledge Graph Construction with a Global Perspective

2410.17600v1 by Rui Yang, Boming Yang, Aosong Feng, Sixun Ouyang, Moritz Blum, Tianwei She, Yuang Jiang, Freddy Lecue, Jinghui Lu, Irene Li

Knowledge Graphs (KGs) are crucial in the field of artificial intelligence and are widely used in downstream tasks, such as question-answering (QA). The construction of KGs typically requires significant effort from domain experts. Large Language Models (LLMs) have recently been used for Knowledge Graph Construction (KGC). However, most existing approaches focus on a local perspective, extracting knowledge triplets from individual sentences or documents, missing a fusion process to combine the knowledge in a global KG. This work introduces Graphusion, a zero-shot KGC framework from free text. It contains three steps: in Step 1, we extract a list of seed entities using topic modeling to guide the final KG includes the most relevant entities; in Step 2, we conduct candidate triplet extraction using LLMs; in Step 3, we design the novel fusion module that provides a global view of the extracted knowledge, incorporating entity merging, conflict resolution, and novel triplet discovery. Results show that Graphusion achieves scores of 2.92 and 2.37 out of 3 for entity extraction and relation recognition, respectively. Moreover, we showcase how Graphusion could be applied to the Natural Language Processing (NLP) domain and validate it in an educational scenario. Specifically, we introduce TutorQA, a new expert-verified benchmark for QA, comprising six tasks and a total of 1,200 QA pairs. Using the Graphusion-constructed KG, we achieve a significant improvement on the benchmark, for example, a 9.2% accuracy improvement on sub-graph completion.

摘要:知識圖譜 (KG) 在人工智慧領域至關重要,廣泛用於下游任務,例如問答 (QA)。KG 的建構通常需要領域專家付出大量心力。大型語言模型 (LLM) 近來已用於知識圖譜建構 (KGC)。然而,現有方法大多著重於局部觀點,從個別句子或文件擷取知識三元組,缺少一個融合程序來將知識結合在一個整體 KG 中。本研究引入了 Graphusion,一個從自由文字進行零次學習的 KGC 框架。它包含三個步驟:在步驟 1 中,我們使用主題建模擷取一組種子實體,以引導最終的 KG 納入最相關的實體;在步驟 2 中,我們使用 LLM 進行候選三元組擷取;在步驟 3 中,我們設計了新穎的融合模組,提供擷取知識的整體觀點,包含實體合併、衝突解決和新三元組發現。結果顯示 Graphusion 在實體擷取和關係識別方面分別獲得 3 分中的 2.92 分和 2.37 分。此外,我們展示了 Graphusion 如何應用於自然語言處理 (NLP) 領域,並在教育情境中驗證它。具體來說,我們引入了 TutorQA,一個由專家驗證的新型 QA 基準,包含六項任務和總計 1,200 組 QA。使用 Graphusion 建構的 KG,我們在基準上取得顯著進步,例如,在子圖完成方面提升了 9.2% 的準確度。

2410.17529v1 by Yongqiang Huang, Wentao Ye, Liyao Li, Junbo Zhao

This study investigates the potential of Large Language Models (LLMs) for reconstructing and constructing the physical world solely based on textual knowledge. It explores the impact of model performance on spatial understanding abilities. To enhance the comprehension of geometric and spatial relationships in the complex physical world, the study introduces a set of geometric conventions and develops a workflow based on multi-layer graphs and multi-agent system frameworks. It examines how LLMs achieve multi-step and multi-objective geometric inference in a spatial environment using multi-layer graphs under unified geometric conventions. Additionally, the study employs a genetic algorithm, inspired by large-scale model knowledge, to solve geometric constraint problems. In summary, this work innovatively explores the feasibility of using text-based LLMs as physical world builders and designs a workflow to enhance their capabilities.

摘要:本研究探討大型語言模型 (LLM) 僅基於文字知識重建和建構物理世界的潛力。探討模型效能對空間理解能力的影響。為了增強對複雜物理世界中幾何和空間關係的理解,本研究引入了一組幾何慣例,並基於多層圖形和多代理系統架構開發了一套工作流程。研究探討了 LLM 如何在統一的幾何慣例下,使用多層圖形在空間環境中達成多步驟和多目標的幾何推論。此外,本研究採用受大型模型知識啟發的遺傳演算法來解決幾何約束問題。總之,這項工作創新地探討了使用基於文字的 LLM 作為物理世界建構者的可行性,並設計了一套工作流程來增強其能力。

Large Language Model-based Augmentation for Imbalanced Node Classification on Text-Attributed Graphs

2410.16882v1 by Leyao Wang, Yu Wang, Bo Ni, Yuying Zhao, Tyler Derr

Node classification on graphs frequently encounters the challenge of class imbalance, leading to biased performance and posing significant risks in real-world applications. Although several data-centric solutions have been proposed, none of them focus on Text-Attributed Graphs (TAGs), and therefore overlook the potential of leveraging the rich semantics encoded in textual features for boosting the classification of minority nodes. Given this crucial gap, we investigate the possibility of augmenting graph data in the text space, leveraging the textual generation power of Large Language Models (LLMs) to handle imbalanced node classification on TAGs. Specifically, we propose a novel approach called LA-TAG (LLM-based Augmentation on Text-Attributed Graphs), which prompts LLMs to generate synthetic texts based on existing node texts in the graph. Furthermore, to integrate these synthetic text-attributed nodes into the graph, we introduce a text-based link predictor to connect the synthesized nodes with the existing nodes. Our experiments across multiple datasets and evaluation metrics show that our framework significantly outperforms traditional non-textual-based data augmentation strategies and specific node imbalance solutions. This highlights the promise of using LLMs to resolve imbalance issues on TAGs.

摘要:圖形節點分類經常會遇到類別不平衡的挑戰,導致有偏差的效能,並在實際應用中造成顯著風險。儘管已提出多項以資料為中心的解決方案,但沒有一項專注於文字屬性圖形 (TAG),因此忽略了利用文字特徵中編碼的豐富語意來提升少數節點分類的可能性。鑑於這個關鍵差距,我們探討了在文字空間中擴充圖形資料的可能性,利用大型語言模型 (LLM) 的文字產生能力來處理 TAG 上的不平衡節點分類。具體來說,我們提出了一種名為 LA-TAG(基於 LLM 的文字屬性圖形擴充)的新方法,它提示 LLM 根據圖形中現有的節點文字產生合成文字。此外,為了將這些合成文字屬性節點整合到圖形中,我們引入了一個基於文字的連結預測器,以將合成節點與現有節點連接起來。我們在多個資料集和評估指標上的實驗表明,我們的框架明顯優於傳統的非文字資料擴充策略和特定的節點不平衡解決方案。這突顯了使用 LLM 來解決 TAG 上的不平衡問題的潛力。

Context-aware Inductive Knowledge Graph Completion with Latent Type Constraints and Subgraph Reasoning

2410.16803v2 by Muzhi Li, Cehao Yang, Chengjin Xu, Zixing Song, Xuhui Jiang, Jian Guo, Ho-fung Leung, Irwin King

Inductive knowledge graph completion (KGC) aims to predict missing triples with unseen entities. Recent works focus on modeling reasoning paths between the head and tail entity as direct supporting evidence. However, these methods depend heavily on the existence and quality of reasoning paths, which limits their general applicability in different scenarios. In addition, we observe that latent type constraints and neighboring facts inherent in KGs are also vital in inferring missing triples. To effectively utilize all useful information in KGs, we introduce CATS, a novel context-aware inductive KGC solution. With sufficient guidance from proper prompts and supervised fine-tuning, CATS activates the strong semantic understanding and reasoning capabilities of large language models to assess the existence of query triples, which consist of two modules. First, the type-aware reasoning module evaluates whether the candidate entity matches the latent entity type as required by the query relation. Then, the subgraph reasoning module selects relevant reasoning paths and neighboring facts, and evaluates their correlation to the query triple. Experiment results on three widely used datasets demonstrate that CATS significantly outperforms state-of-the-art methods in 16 out of 18 transductive, inductive, and few-shot settings with an average absolute MRR improvement of 7.2%.

摘要:歸納知識圖譜補全 (KGC) 的目標是預測具有未見實體的三元組。最近的研究專注於建模頭部和尾部實體之間的推理路徑,作為直接的支援證據。然而,這些方法極度依賴推理路徑的存在和品質,這限制了它們在不同場景中的普遍適用性。此外,我們觀察到,潛在類型約束和 KG 中固有的鄰近事實對於推論遺失的三元組也至關重要。為了有效利用 KG 中所有有用的資訊,我們引入了 CATS,這是一個新穎的具備情境感知能力的歸納式 KGC 解决方案。在適當提示和監督微調的充分指導下,CATS 啟用了大型語言模型強大的語義理解和推理能力,以評估查詢三元組的存在,這些三元組由兩個模組組成。首先,類型感知推理模組評估候選實體是否與查詢關係所要求的潛在實體類型相符。然後,子圖推理模組選擇相關的推理路徑和鄰近事實,並評估它們與查詢三元組的關聯性。在三個廣泛使用的資料集上的實驗結果表明,在 18 個轉導式、歸納式和少次嘗試設定中,CATS 在 16 個設定中顯著優於最先進的方法,平均絕對 MRR 提升了 7.2%。

The Scene Language: Representing Scenes with Programs, Words, and Embeddings

2410.16770v1 by Yunzhi Zhang, Zizhang Li, Matt Zhou, Shangzhe Wu, Jiajun Wu

We introduce the Scene Language, a visual scene representation that concisely and precisely describes the structure, semantics, and identity of visual scenes. It represents a scene with three key components: a program that specifies the hierarchical and relational structure of entities in the scene, words in natural language that summarize the semantic class of each entity, and embeddings that capture the visual identity of each entity. This representation can be inferred from pre-trained language models via a training-free inference technique, given text or image inputs. The resulting scene can be rendered into images using traditional, neural, or hybrid graphics renderers. Together, this forms a robust, automated system for high-quality 3D and 4D scene generation. Compared with existing representations like scene graphs, our proposed Scene Language generates complex scenes with higher fidelity, while explicitly modeling the scene structures to enable precise control and editing.

摘要:我們引入了場景語言,這是一種視覺場景表示法,簡潔且精確地描述了視覺場景的結構、語意和身分。它使用三個關鍵組成部分來表示場景:一個程式,用於指定場景中實體的階層和關係結構;以自然語言表示的詞彙,用於總結每個實體的語意類別;以及用於擷取每個實體的視覺身分的嵌入。這個表示法可以透過無訓練推論技術從預先訓練的語言模型推論出來,給定文字或影像輸入。產生的場景可以使用傳統、神經或混合圖形渲染器渲染成影像。總而言之,這形成了一個強健的自動化系統,用於高品質 3D 和 4D 場景生成。與現有的表示法(例如場景圖)相比,我們提出的場景語言可以生成具有更高保真度的複雜場景,同時明確地建模場景結構以實現精確控制和編輯。

Atomic Fact Decomposition Helps Attributed Question Answering

2410.16708v1 by Zhichao Yan, Jiapu Wang, Jiaoyan Chen, Xiaoli Li, Ru Li, Jeff Z. Pan

Attributed Question Answering (AQA) aims to provide both a trustworthy answer and a reliable attribution report for a given question. Retrieval is a widely adopted approach, including two general paradigms: Retrieval-Then-Read (RTR) and post-hoc retrieval. Recently, Large Language Models (LLMs) have shown remarkable proficiency, prompting growing interest in AQA among researchers. However, RTR-based AQA often suffers from irrelevant knowledge and rapidly changing information, even when LLMs are adopted, while post-hoc retrieval-based AQA struggles with comprehending long-form answers with complex logic, and precisely identifying the content needing revision and preserving the original intent. To tackle these problems, this paper proposes an Atomic fact decomposition-based Retrieval and Editing (ARE) framework, which decomposes the generated long-form answers into molecular clauses and atomic facts by the instruction-tuned LLMs. Notably, the instruction-tuned LLMs are fine-tuned using a well-constructed dataset, generated from large scale Knowledge Graphs (KGs). This process involves extracting one-hop neighbors from a given set of entities and transforming the result into coherent long-form text. Subsequently, ARE leverages a search engine to retrieve evidences related to atomic facts, inputting these evidences into an LLM-based verifier to determine whether the facts require expansion for re-retrieval or editing. Furthermore, the edited facts are backtracked into the original answer, with evidence aggregated based on the relationship between molecular clauses and atomic facts. Extensive evaluations demonstrate the superior performance of our proposed method over the state-of-the-arts on several datasets, with an additionally proposed new metric $Attr_{p}$ for evaluating the precision of evidence attribution.

摘要:歸因式問答 (AQA) 的目標是針對特定問題提供可信的答案和可靠的歸因報告。擷取是一種廣泛採用的方法,包括兩種一般範例:擷取再閱讀 (RTR) 和事後擷取。最近,大型語言模型 (LLM) 已展現出卓越的熟練度,促使研究人員對 AQA 產生越來越濃厚的興趣。然而,即使採用 LLM,基於 RTR 的 AQA 仍常常會受到不相關知識和快速變動的資訊影響,而基於事後擷取的 AQA 則難以理解具有複雜邏輯的長篇答案,並精確找出需要修改的內容,同時保留原始意圖。為了解決這些問題,本文提出了一個基於原子事實分解的擷取和編輯 (ARE) 架構,它透過指令調整的 LLM 將產生的長篇答案分解為分子子句和原子事實。值得注意的是,指令調整的 LLM 會使用從大規模知識圖譜 (KG) 中產生的結構良好資料集進行微調。此程序包含從特定實體集合中擷取一跳鄰居,並將結果轉換為連貫的長篇文字。隨後,ARE 會利用搜尋引擎擷取與原子事實相關的證據,將這些證據輸入到基於 LLM 的驗證器中,以確定事實是否需要擴充以供重新擷取或編輯。此外,編輯後的結果會回溯到原始答案,並根據分子子句和原子事實之間的關係彙整證據。廣泛的評估顯示,我們提出的方法在多個資料集上優於現有技術,並額外提出了一個新的指標 $Attr_{p}$,用於評估證據歸因的精準度。

PLDR-LLM: Large Language Model from Power Law Decoder Representations

2410.16703v1 by Burc Gokden

We present the Large Language Model from Power Law Decoder Representations (PLDR-LLM), a language model that leverages non-linear and linear transformations through Power Law Graph Attention mechanism to generate well-defined deductive and inductive outputs. We pretrain the PLDR-LLMs of varying layer sizes with a small batch size of 32 and $\sim$8B tokens from the RefinedWeb dataset, and show that they achieve competitive performance in zero-shot and few-shot settings compared to scaled dot-product LLMs of similar model size reported in the literature. We show that deductive outputs of PLDR-LLMs can be used to compare model characteristics or improve the performance by introducing the Directed Acyclic Graph (DAG) loss as a metric and regularizer. Our results indicate that the initial maximum learning rate and warm-up steps have a lasting impact on deductive outputs throughout the pretraining. We provide a detailed description of PLDR-LLM architecture, its implementation and the pretraining procedure.

摘要:我們提出使用冪律解碼器表示法的大語言模型 (PLDR-LLM),這是一個語言模型,它透過冪律圖注意力機制,利用非線性和線性轉換來產生定義良好的演繹和歸納輸出。我們使用 32 的小批次大小和 RefinedWeb 資料集中的 $\sim$8B 令牌,預訓練不同層大小的 PLDR-LLM,並展示出它們在零次和少次設定中,與文獻中報導的類似模型大小的縮放點積 LLM 相比,它們達到了競爭力表現。我們展示了 PLDR-LLM 的演繹輸出可用於比較模型特徵或透過引入有向無環圖 (DAG) 損失作為指標和正則化器來改善效能。我們的結果表明,初始最大學習率和熱身步驟對整個預訓練過程中的演繹輸出有持久的影響。我們提供了 PLDR-LLM 架構、其實現和預訓練程序的詳細說明。

Distill-SynthKG: Distilling Knowledge Graph Synthesis Workflow for Improved Coverage and Efficiency

2410.16597v1 by Prafulla Kumar Choubey, Xin Su, Man Luo, Xiangyu Peng, Caiming Xiong, Tiep Le, Shachar Rosenman, Vasudev Lal, Phil Mui, Ricky Ho, Phillip Howard, Chien-Sheng Wu

Knowledge graphs (KGs) generated by large language models (LLMs) are becoming increasingly valuable for Retrieval-Augmented Generation (RAG) applications that require knowledge-intensive reasoning. However, existing KG extraction methods predominantly rely on prompt-based approaches, which are inefficient for processing large-scale corpora. These approaches often suffer from information loss, particularly with long documents, due to the lack of specialized design for KG construction. Additionally, there is a gap in evaluation datasets and methodologies for ontology-free KG construction. To overcome these limitations, we propose SynthKG, a multi-step, document-level ontology-free KG synthesis workflow based on LLMs. By fine-tuning a smaller LLM on the synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG, substantially reducing the number of LLM inference calls. Furthermore, we re-purpose existing question-answering datasets to establish KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality -- including models up to eight times larger -- but also consistently excels in retrieval and question-answering tasks. Our proposed graph retrieval framework also outperforms all KG-retrieval methods across multiple benchmark datasets. We release the SynthKG dataset and Distill-SynthKG model publicly to support further research and development.

摘要:由大型語言模型 (LLM) 生成的知識圖譜 (KG) 對於需要知識密集型推理的檢索增強生成 (RAG) 應用程式變得越來越有價值。然而,現有的 KG 萃取方法主要依賴於提示式方法,這種方法對於處理大規模語料庫而言效率低下。由於缺乏針對 KG 建構的專門設計,這些方法通常會遭受資訊遺失,特別是在長篇文件的情況下。此外,在用於建構無本体 KG 的評估資料集和方法論方面存在差距。為了克服這些限制,我們提出了 SynthKG,這是一個基於 LLM 的多步驟文件級別無本体 KG 合成工作流程。透過微調較小的 LLM 在合成的文件-KG 對上,我們將多步驟流程簡化為稱為 Distill-SynthKG 的單步驟 KG 生成方法,大幅減少了 LLM 推論呼叫的數量。此外,我們重新利用現有的問答資料集來建立 KG 評估資料集,並引入新的評估指標。使用 Distill-SynthKG 生成的 KG,我們還為 RAG 設計了一個新穎的基於圖形的檢索架構。實驗結果表明,Distill-SynthKG 不僅在 KG 品質方面超越了所有基準模型(包括大八倍的模型),而且在檢索和問答任務中也始終表現出色。我們提出的圖形檢索架構在多個基準資料集上也優於所有 KG 檢索方法。我們公開釋出 SynthKG 資料集和 Distill-SynthKG 模型,以支持進一步的研究和開發。

Towards a Reliable Offline Personal AI Assistant for Long Duration Spaceflight

2410.16397v1 by Oliver Bensch, Leonie Bensch, Tommy Nilsson, Florian Saling, Wafa M. Sadri, Carsten Hartmann, Tobias Hecking, J. Nathan Kutz

As humanity prepares for new missions to the Moon and Mars, astronauts will need to operate with greater autonomy, given the communication delays that make real-time support from Earth difficult. For instance, messages between Mars and Earth can take up to 24 minutes, making quick responses impossible. This limitation poses a challenge for astronauts who must rely on in-situ tools to access the large volume of data from spacecraft sensors, rovers, and satellites, data that is often fragmented and difficult to use. To bridge this gap, systems like the Mars Exploration Telemetry-Driven Information System (METIS) are being developed. METIS is an AI assistant designed to handle routine tasks, monitor spacecraft systems, and detect anomalies, all while reducing the reliance on mission control. Current Generative Pretrained Transformer (GPT) Models, while powerful, struggle in safety-critical environments. They can generate plausible but incorrect responses, a phenomenon known as "hallucination," which could endanger astronauts. To overcome these limitations, this paper proposes enhancing systems like METIS by integrating GPTs, Retrieval-Augmented Generation (RAG), Knowledge Graphs (KGs), and Augmented Reality (AR). The idea is to allow astronauts to interact with their data more intuitively, using natural language queries and visualizing real-time information through AR. KGs will be used to easily access live telemetry and multimodal data, ensuring that astronauts have the right information at the right time. By combining AI, KGs, and AR, this new system will empower astronauts to work more autonomously, safely, and efficiently during future space missions.

摘要:隨著人類準備前往月球和火星執行新任務,考量到通訊延遲讓來自地球的即時支援變得困難,太空人將需要以更高的自主性執行任務。例如,火星和地球之間的訊息傳遞可能需要長達 24 分鐘,這使得快速回應變得不可能。這個限制對必須仰賴現場工具才能存取來自太空船感測器、探測車和衛星的大量資料的太空人來說是一項挑戰,而這些資料通常是片段且難以使用的。為了彌合這個差距,像火星探測遙測驅動資訊系統 (METIS) 之類的系統正在開發中。METIS 是一個 AI 助理,旨在處理例行工作、監控太空船系統和偵測異常,同時減少對任務控制的依賴。現有的生成式預訓練Transformer (GPT) 模型雖然強大,但在安全關鍵環境中卻難以發揮作用。它們可能會產生看似合理但錯誤的回應,這種現象稱為「幻覺」,可能會使太空人陷入危險。為了克服這些限制,本文提出透過整合 GPT、檢索增強生成 (RAG)、知識圖譜 (KG) 和擴增實境 (AR) 來增強像 METIS 之類的系統。這個想法是讓太空人能夠更直覺地與他們的資料互動,使用自然語言查詢並透過 AR 視覺化即時資訊。KG 將用於輕鬆存取即時遙測和多模式資料,確保太空人在適當的時間取得適當的資訊。透過結合 AI、KG 和 AR,這個新系統將賦能太空人在未來的太空任務中更自主、安全且有效率地工作。

A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns

2410.16155v1 by Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao

With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line topology, star topology, and 100-agent settings. Encourage community attention to the security of multi-agent systems.

摘要:随着大型语言模型的发展,它们被广泛用作各个领域的代理。代理的关键组成部分是记忆,它存储重要信息,但容易受到越狱攻击。现有研究主要集中在单一代理攻击和共享内存攻击上。然而,现实世界中的场景通常涉及独立的内存。在本文中,我们提出了 Troublemaker Makes Chaos in Honest Town (TMCHT) 任务,这是一个大规模、多代理、多拓扑基于文本的攻击评估框架。TMCHT 涉及一个攻击者代理试图误导整个代理社会。我们确定了多代理攻击中的两个主要挑战:(1) 非完整图结构,(2) 大规模系统。我们将这些挑战归因于我们称之为毒性消失的现象。为了解决这些问题,我们提出了一种对抗性复制传染性越狱 (ARCJ) 方法,该方法优化了检索后缀以使中毒样本更容易被检索,并优化了复制后缀以使中毒样本具有传染性。我们在 TMCHT 中展示了我们方法的优越性,在直线拓扑、星形拓扑和 100 代理设置中分别提高了 23.51%、18.95% 和 52.93%。鼓励社区关注多代理系统的安全性。

CausalGraph2LLM: Evaluating LLMs for Causal Queries

2410.15939v1 by Ivaxi Sheth, Bahare Fatemi, Mario Fritz

Causality is essential in scientific research, enabling researchers to interpret true relationships between variables. These causal relationships are often represented by causal graphs, which are directed acyclic graphs. With the recent advancements in Large Language Models (LLMs), there is an increasing interest in exploring their capabilities in causal reasoning and their potential use to hypothesize causal graphs. These tasks necessitate the LLMs to encode the causal graph effectively for subsequent downstream tasks. In this paper, we propose a comprehensive benchmark, \emph{CausalGraph2LLM}, encompassing a variety of causal graph settings to assess the causal graph understanding capability of LLMs. We categorize the causal queries into two types: graph-level and node-level queries. We benchmark both open-sourced and closed models for our study. Our findings reveal that while LLMs show promise in this domain, they are highly sensitive to the encoding used. Even capable models like GPT-4 and Gemini-1.5 exhibit sensitivity to encoding, with deviations of about $60\%$. We further demonstrate this sensitivity for downstream causal intervention tasks. Moreover, we observe that LLMs can often display biases when presented with contextual information about a causal graph, potentially stemming from their parametric memory.

摘要:因果关系在科学研究中至关重要,它使研究人员能够解释变量之间的真实关系。这些因果关系通常用因果图表示,因果图是有向无环图。随着大语言模型 (LLM) 的最新进展,人们越来越有兴趣探索它们在因果推理中的能力以及它们在假设因果图中的潜在用途。这些任务需要 LLM 有效地对因果图进行编码,以便后续的下游任务。在本文中,我们提出了一个综合基准,\emph{CausalGraph2LLM},它包含了各种因果图设置,以评估 LLM 的因果图理解能力。我们将因果查询分为两类:图级查询和节点级查询。我们对开源模型和封闭模型进行了基准测试。我们的研究结果表明,虽然 LLM 在该领域显示出前景,但它们对所使用的编码非常敏感。即使像 GPT-4 和 Gemini-1.5 这样的强大模型也对编码表现出敏感性,偏差约为 60%。我们进一步证明了这种对下游因果干预任务的敏感性。此外,我们观察到,当 LLM 获得有关因果图的上下文信息时,它们通常会表现出偏见,这可能源于它们的参数记忆。

LLM4GRN: Discovering Causal Gene Regulatory Networks with LLMs -- Evaluation through Synthetic Data Generation

2410.15828v1 by Tejumade Afonja, Ivaxi Sheth, Ruta Binkyte, Waqar Hanif, Thomas Ulas, Matthias Becker, Mario Fritz

Gene regulatory networks (GRNs) represent the causal relationships between transcription factors (TFs) and target genes in single-cell RNA sequencing (scRNA-seq) data. Understanding these networks is crucial for uncovering disease mechanisms and identifying therapeutic targets. In this work, we investigate the potential of large language models (LLMs) for GRN discovery, leveraging their learned biological knowledge alone or in combination with traditional statistical methods. We develop a task-based evaluation strategy to address the challenge of unavailable ground truth causal graphs. Specifically, we use the GRNs suggested by LLMs to guide causal synthetic data generation and compare the resulting data against the original dataset. Our statistical and biological assessments show that LLMs can support statistical modeling and data synthesis for biological research.

摘要:基因調控網路 (GRN) 代表單細胞 RNA 定序 (scRNA-seq) 資料中轉錄因子 (TF) 與目標基因之間的因果關係。了解這些網路對於揭露疾病機制和找出治療目標至關重要。在這項工作中,我們探討大型語言模型 (LLM) 在 GRN 探索中的潛力,利用它們學習到的生物知識,單獨或與傳統統計方法結合使用。我們制定了一項基於任務的評估策略,以解決無法取得地面真相因果圖表的挑戰。具體來說,我們使用 LLM 建議的 GRN 來引導因果合成資料產生,並將產生的資料與原始資料集進行比較。我們的統計和生物評估顯示,LLM 可以支援生物研究的統計建模和資料合成。

NetSafe: Exploring the Topological Safety of Multi-agent Networks

2410.15686v1 by Miao Yu, Shilong Wang, Guibin Zhang, Junyuan Mao, Chenlong Yin, Qijiong Liu, Qingsong Wen, Kun Wang, Yang Wang

Large language models (LLMs) have empowered nodes within multi-agent networks with intelligence, showing growing applications in both academia and industry. However, how to prevent these networks from generating malicious information remains unexplored with previous research on single LLM's safety be challenging to transfer. In this paper, we focus on the safety of multi-agent networks from a topological perspective, investigating which topological properties contribute to safer networks. To this end, we propose a general framework, NetSafe along with an iterative RelCom interaction to unify existing diverse LLM-based agent frameworks, laying the foundation for generalized topological safety research. We identify several critical phenomena when multi-agent networks are exposed to attacks involving misinformation, bias, and harmful information, termed as Agent Hallucination and Aggregation Safety. Furthermore, we find that highly connected networks are more susceptible to the spread of adversarial attacks, with task performance in a Star Graph Topology decreasing by 29.7%. Besides, our proposed static metrics aligned more closely with real-world dynamic evaluations than traditional graph-theoretic metrics, indicating that networks with greater average distances from attackers exhibit enhanced safety. In conclusion, our work introduces a new topological perspective on the safety of LLM-based multi-agent networks and discovers several unreported phenomena, paving the way for future research to explore the safety of such networks.

摘要:大型語言模型 (LLM) 賦予了多主體網路中的節點智慧,在學術界和產業中展現出越來越多的應用。然而,如何防止這些網路產生惡意資訊仍然是未經探索的領域,先前針對單一 LLM 安全性的研究難以轉移。在本文中,我們從拓撲學的角度探討多主體網路的安全性,研究哪些拓撲屬性有助於網路更安全。為此,我們提出了一個通用框架 NetSafe,以及一個反覆的 RelCom 互動,以統一現有的各種基於 LLM 的主體框架,為廣義的拓撲安全性研究奠定基礎。我們在多主體網路遭受涉及錯誤資訊、偏見和有害資訊的攻擊時,找出幾個關鍵現象,稱為主體幻覺和聚合安全性。此外,我們發現高度連接的網路更容易受到對抗性攻擊的影響,星形圖形拓撲中的任務效能下降了 29.7%。此外,我們提出的靜態指標比傳統的圖論指標更貼近真實世界的動態評估,這表示與攻擊者平均距離較大的網路具有更高的安全性。總之,我們的研究引入了基於 LLM 的多主體網路安全性的新拓撲觀點,並發現了幾個未曾報導的現象,為未來探索此類網路安全性的研究鋪路。

TAGExplainer: Narrating Graph Explanations for Text-Attributed Graph Learning Models

2410.15268v1 by Bo Pan, Zhen Xiong, Guanchen Wu, Zheng Zhang, Yifei Zhang, Liang Zhao

Representation learning of Text-Attributed Graphs (TAGs) has garnered significant attention due to its applications in various domains, including recommendation systems and social networks. Despite advancements in TAG learning methodologies, challenges remain in explainability due to the black-box nature of existing TAG representation learning models. This paper presents TAGExplainer, the first method designed to generate natural language explanations for TAG learning. TAGExplainer employs a generative language model that maps input-output pairs to explanations reflecting the model's decision-making process. To address the lack of annotated ground truth explanations in real-world scenarios, we propose first generating pseudo-labels that capture the model's decisions from saliency-based explanations, then the pseudo-label generator is iteratively trained based on three training objectives focusing on faithfulness and brevity via Expert Iteration, to improve the quality of generated pseudo-labels. The high-quality pseudo-labels are finally utilized to train an end-to-end explanation generator model. Extensive experiments are conducted to demonstrate the effectiveness of TAGExplainer in producing faithful and concise natural language explanations.

摘要:文本歸因圖 (TAG) 的表示學習因其在各種領域(包括推薦系統和社交網絡)中的應用而備受關注。儘管 TAG 學習方法取得了進展,但由於現有 TAG 表示學習模型的黑箱性質,可解釋性仍然面臨挑戰。本文提出了 TAGExplainer,這是一種旨在為 TAG 學習生成自然語言解釋的第一種方法。TAGExplainer 採用生成語言模型,將輸入輸出對應到反映模型決策過程的解釋。為了解決現實場景中缺乏註解地面真實解釋的問題,我們建議首先從基於顯著性的解釋中生成偽標籤來捕捉模型的決策,然後通過專家迭代基於三個訓練目標(側重於忠實度和簡潔性)反覆訓練偽標籤生成器,以提高生成偽標籤的品質。最後將高品質的偽標籤用於訓練端到端解釋生成器模型。進行了廣泛的實驗,以證明 TAGExplainer 在生成忠實且簡潔的自然語言解釋方面的有效性。

Explaining Graph Neural Networks with Large Language Models: A Counterfactual Perspective for Molecular Property Prediction

2410.15165v1 by Yinhan He, Zaiyi Zheng, Patrick Soga, Yaozhen Zhu, yushun Dong, Jundong Li

In recent years, Graph Neural Networks (GNNs) have become successful in molecular property prediction tasks such as toxicity analysis. However, due to the black-box nature of GNNs, their outputs can be concerning in high-stakes decision-making scenarios, e.g., drug discovery. Facing such an issue, Graph Counterfactual Explanation (GCE) has emerged as a promising approach to improve GNN transparency. However, current GCE methods usually fail to take domain-specific knowledge into consideration, which can result in outputs that are not easily comprehensible by humans. To address this challenge, we propose a novel GCE method, LLM-GCE, to unleash the power of large language models (LLMs) in explaining GNNs for molecular property prediction. Specifically, we utilize an autoencoder to generate the counterfactual graph topology from a set of counterfactual text pairs (CTPs) based on an input graph. Meanwhile, we also incorporate a CTP dynamic feedback module to mitigate LLM hallucination, which provides intermediate feedback derived from the generated counterfactuals as an attempt to give more faithful guidance. Extensive experiments demonstrate the superior performance of LLM-GCE. Our code is released on https://github.com/YinhanHe123/new_LLM4GNNExplanation.

摘要:近年来,图神经网络 (GNN) 已成功应用于分子性质预测任务,例如毒性分析。然而,由于 GNN 的黑盒性质,其输出在高风险决策场景中可能会令人担忧,例如药物发现。针对这一问题,图反事实解释 (GCE) 已成为提高 GNN 透明度的一种很有前景的方法。然而,当前的 GCE 方法通常无法考虑特定领域的知识,这可能导致人类难以理解输出。为了应对这一挑战,我们提出了一种新颖的 GCE 方法,LLM-GCE,以释放大型语言模型 (LLM) 在解释 GNN 用于分子性质预测方面的能力。具体来说,我们利用自动编码器从一组基于输入图的反事实文本对 (CTP) 生成反事实图拓扑。同时,我们还加入了一个 CTP 动态反馈模块来减轻 LLM 幻觉,该模块提供从生成的反事实中派生的中间反馈,以尝试提供更真实的指导。大量的实验表明了 LLM-GCE 的卓越性能。我们的代码已发布在 https://github.com/YinhanHe123/new_LLM4GNNExplanation。

MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science

2410.15126v1 by Junho Kim, Yeachan Kim, Jun-Hyung Park, Yerim Oh, Suho Kim, SangKeun Lee

We introduce a novel continued pre-training method, MELT (MatEriaLs-aware continued pre-Training), specifically designed to efficiently adapt the pre-trained language models (PLMs) for materials science. Unlike previous adaptation strategies that solely focus on constructing domain-specific corpus, MELT comprehensively considers both the corpus and the training strategy, given that materials science corpus has distinct characteristics from other domains. To this end, we first construct a comprehensive materials knowledge base from the scientific corpus by building semantic graphs. Leveraging this extracted knowledge, we integrate a curriculum into the adaptation process that begins with familiar and generalized concepts and progressively moves toward more specialized terms. We conduct extensive experiments across diverse benchmarks to verify the effectiveness and generality of MELT. A comprehensive evaluation convincingly supports the strength of MELT, demonstrating superior performance compared to existing continued pre-training methods. The in-depth analysis also shows that MELT enables PLMs to effectively represent materials entities compared to the existing adaptation methods, thereby highlighting its broad applicability across a wide spectrum of materials science.

摘要:我們介紹了一種新穎的持續預訓練方法,MELT(MatEriaLs-aware持續預訓練),專門設計用於有效地調整材料科學的預訓練語言模型 (PLM)。與先前僅專注於建構特定領域語料庫的調整策略不同,MELT 全面考慮語料庫和訓練策略,因為材料科學語料庫具有不同於其他領域的特徵。為此,我們首先通過建立語義圖從科學語料庫構建一個全面的材料知識庫。利用提取的知識,我們將課程整合到調整過程中,從熟悉且通用的概念開始,逐漸轉向更專業的術語。我們在不同的基準上進行了廣泛的實驗,以驗證 MELT 的有效性和普遍性。全面的評估令人信服地支持了 MELT 的優點,與現有的持續預訓練方法相比,表現出優異的性能。深入分析還表明,與現有的調整方法相比,MELT 能讓 PLM 有效地表示材料實體,從而突顯其在廣泛的材料科學領域中的廣泛適用性。

Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models

2410.15116v1 by Qitan Lv, Jie Wang, Hanzhu Chen, Bin Li, Yongdong Zhang, Feng Wu

Generation of plausible but incorrect factual information, often termed hallucination, has attracted significant research interest. Retrieval-augmented language model (RALM) -- which enhances models with up-to-date knowledge -- emerges as a promising method to reduce hallucination. However, existing RALMs may instead exacerbate hallucination when retrieving lengthy contexts. To address this challenge, we propose COFT, a novel \textbf{CO}arse-to-\textbf{F}ine highligh\textbf{T}ing method to focus on different granularity-level key texts, thereby avoiding getting lost in lengthy contexts. Specifically, COFT consists of three components: \textit{recaller}, \textit{scorer}, and \textit{selector}. First, \textit{recaller} applies a knowledge graph to extract potential key entities in a given context. Second, \textit{scorer} measures the importance of each entity by calculating its contextual weight. Finally, \textit{selector} selects high contextual weight entities with a dynamic threshold algorithm and highlights the corresponding paragraphs, sentences, or words in a coarse-to-fine manner. Extensive experiments on the knowledge hallucination benchmark demonstrate the effectiveness of COFT, leading to a superior performance over $30\%$ in the F1 score metric. Moreover, COFT also exhibits remarkable versatility across various long-form tasks, such as reading comprehension and question answering.

摘要:生成看似合理但实际上不正确的实际信息(通常称为幻觉)引起了重要的研究兴趣。检索增强语言模型 (RALM) 通过为模型提供最新的知识来增强模型,这是一种有前途的方法,可以减少幻觉。然而,现有的 RALM 在检索冗长的上下文时可能会加剧幻觉。为了应对这一挑战,我们提出了 COFT,一种新颖的\textbf{粗}到\textbf{细}高亮\textbf{T}ing 方法,专注于不同粒度级别的关键文本,从而避免在冗长的上下文中迷失。具体来说,COFT 由三个组件组成:\textit{recaller}、\textit{scorer} 和 \textit{selector}。首先,\textit{recaller} 应用知识图谱来提取给定上下文中潜在的关键实体。其次,\textit{scorer} 通过计算每个实体的上下文权重来衡量其重要性。最后,\textit{selector} 使用动态阈值算法选择具有高上下文权重的实体,并以粗到细的方式突出显示相应的段落、句子或单词。在知识幻觉基准上的广泛实验证明了 COFT 的有效性,在 F1 分数指标上取得了超过 30% 的卓越性能。此外,COFT 在各种长篇任务中也表现出卓越的多功能性,例如阅读理解和问题解答。

2410.15064v1 by George Hannah, Rita T. Sousa, Ioannis Dasoulas, Claudia d'Amato

With the recent surge in popularity of Large Language Models (LLMs), there is the rising risk of users blindly trusting the information in the response, even in cases where the LLM recommends actions that have potential legal implications and this may put the user in danger. We provide an empirical analysis on multiple existing LLMs showing the urgency of the problem. Hence, we propose a short-term solution consisting in an approach for isolating these legal issues through prompt re-engineering. We further analyse the outcomes but also the limitations of the prompt engineering based approach and we highlight the need of additional resources for fully solving the problem We also propose a framework powered by a legal knowledge graph (KG) to generate legal citations for these legal issues, enriching the response of the LLM.

摘要:隨著大型語言模型(LLM)近期流行激增,使用者盲目相信回應中資訊的風險也隨之升高,即使在 LLM 建議採取可能產生法律影響的行動時亦然,這可能會使使用者陷入危險之中。我們針對多個現有 LLM 提供實證分析,顯示此問題的急迫性。因此,我們提出一個短期解決方案,包括透過提示重新設計來孤立這些法律問題的方法。我們進一步分析提示工程方法的成果,但也分析其限制,並強調完全解決問題需要額外資源。我們還提出一個由法律知識圖譜(KG)驅動的架構,為這些法律問題產生法律引文,豐富 LLM 的回應。

LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model

2410.14961v1 by Tianqianjin Lin, Pengwei Yan, Kaisong Song, Zhuoren Jiang, Yangyang Kang, Jun Lin, Weikang Yuan, Junjie Cao, Changlong Sun, Xiaozhong Liu

Graph foundation models (GFMs) have recently gained significant attention. However, the unique data processing and evaluation setups employed by different studies hinder a deeper understanding of their progress. Additionally, current research tends to focus on specific subsets of graph learning tasks, such as structural tasks, node-level tasks, or classification tasks. As a result, they often incorporate specialized modules tailored to particular task types, losing their applicability to other graph learning tasks and contradicting the original intent of foundation models to be universal. Therefore, to enhance consistency, coverage, and diversity across domains, tasks, and research interests within the graph learning community in the evaluation of GFMs, we propose GFMBench-a systematic and comprehensive benchmark comprising 26 datasets. Moreover, we introduce LangGFM, a novel GFM that relies entirely on large language models. By revisiting and exploring the effective graph textualization principles, as well as repurposing successful techniques from graph augmentation and graph self-supervised learning within the language space, LangGFM achieves performance on par with or exceeding the state of the art across GFMBench, which can offer us new perspectives, experiences, and baselines to drive forward the evolution of GFMs.

摘要:圖形基礎模型 (GFM) 近期獲得顯著的關注。 然而,不同研究採用獨特資料處理和評估設定,阻礙了對其進展的深入理解。此外,目前的研究傾向於專注於圖形學習任務的特定子集,例如結構任務、節點層級任務或分類任務。因此,它們經常整合專門針對特定任務類型量身打造的模組,失去其對其他圖形學習任務的適用性,並與基礎模型成為通用的原始意圖相矛盾。因此,為了增強圖形學習社群在評估 GFM 時跨領域、任務和研究興趣的一致性、涵蓋範圍和多樣性,我們提出 GFMBench,這是一個包含 26 個資料集的系統化且全面的基準。此外,我們介紹 LangGFM,這是一種完全依賴大型語言模型的新穎 GFM。透過重新檢視和探索有效的圖形文字化原則,以及在語言空間中重新利用圖形擴充和圖形自監督學習的成功技術,LangGFM 在 GFMBench 上實現與現有技術同等或超越現有技術的效能,這可以為我們提供新的觀點、經驗和基準,以推動 GFM 的演進。

TransBox: EL++-closed Ontology Embedding

2410.14571v1 by Hui Yang, Jiaoyan Chen, Uli Sattler

OWL (Web Ontology Language) ontologies, which are able to represent both relational and type facts as standard knowledge graphs and complex domain knowledge in Description Logic (DL) axioms, are widely adopted in domains such as healthcare and bioinformatics. Inspired by the success of knowledge graph embeddings, embedding OWL ontologies has gained significant attention in recent years. Current methods primarily focus on learning embeddings for atomic concepts and roles, enabling the evaluation based on normalized axioms through specially designed score functions. However, they often neglect the embedding of complex concepts, making it difficult to infer with more intricate axioms. This limitation reduces their effectiveness in advanced reasoning tasks, such as Ontology Learning and ontology-mediated Query Answering. In this paper, we propose EL++-closed ontology embeddings which are able to represent any logical expressions in DL via composition. Furthermore, we develop TransBox, an effective EL++-closed ontology embedding method that can handle many-to-one, one-to-many and many-to-many relations. Our extensive experiments demonstrate that TransBox often achieves state-of-the-art performance across various real-world datasets for predicting complex axioms.

摘要:OWL(Web Ontology Language)本体,能够将关系和类型事实表示为标准知识图和描述逻辑 (DL) 公理中的复杂领域知识,在医疗保健和生物信息学等领域得到广泛采用。受知识图嵌入的成功启发,嵌入 OWL 本体近年来备受关注。当前方法主要集中在学习原子概念和角色的嵌入,通过专门设计的评分函数,支持基于归一化公理的评估。然而,它们经常忽略复杂概念的嵌入,这使得难以推断出更复杂的公理。这种限制降低了它们在高级推理任务(例如本体学习和本体介导查询应答)中的有效性。在本文中,我们提出了 EL++ 封闭本体嵌入,它能够通过组合来表示 DL 中的任何逻辑表达式。此外,我们开发了 TransBox,一种有效的 EL++ 封闭本体嵌入方法,可以处理多对一、一对多和多对多关系。我们广泛的实验表明,TransBox 在预测复杂公理的各种真实世界数据集上通常都能达到最先进的性能。

Enabling Scalable Evaluation of Bias Patterns in Medical LLMs

2410.14763v1 by Hamed Fayyaz, Raphael Poulain, Rahmatollah Beheshti

Large language models (LLMs) have shown impressive potential in helping with numerous medical challenges. Deploying LLMs in high-stakes applications such as medicine, however, brings in many concerns. One major area of concern relates to biased behaviors of LLMs in medical applications, leading to unfair treatment of individuals. To pave the way for the responsible and impactful deployment of Med LLMs, rigorous evaluation is a key prerequisite. Due to the huge complexity and variability of different medical scenarios, existing work in this domain has primarily relied on using manually crafted datasets for bias evaluation. In this study, we present a new method to scale up such bias evaluations by automatically generating test cases based on rigorous medical evidence. We specifically target the challenges of a) domain-specificity of bias characterization, b) hallucinating while generating the test cases, and c) various dependencies between the health outcomes and sensitive attributes. To that end, we offer new methods to address these challenges integrated with our generative pipeline, using medical knowledge graphs, medical ontologies, and customized general LLM evaluation frameworks in our method. Through a series of extensive experiments, we show that the test cases generated by our proposed method can effectively reveal bias patterns in Med LLMs at larger and more flexible scales than human-crafted datasets. We publish a large bias evaluation dataset using our pipeline, which is dedicated to a few medical case studies. A live demo of our application for vignette generation is available at https://vignette.streamlit.app. Our code is also available at https://github.com/healthylaife/autofair.

摘要:大型語言模型 (LLM) 已展現出在協助解決 許多醫療挑戰方面的驚人潛力。然而,在高風險應用程式(例如 醫療)中部署 LLM 會帶來許多疑慮。一個主要的疑慮領域與 醫療應用程式中 LLM 的偏見行為有關,導致對個人不公平的 待遇。為了為負責任且有影響力的 Med LLM 部署鋪路,嚴謹的 評估是一項關鍵前提。由於不同醫療場景的複雜性和變異性極大, 此領域現有的工作主要依賴使用人工製作的資料集進行偏見 評估。在本研究中,我們提出了一種新的方法,可以根據嚴謹的醫療 證據自動產生測試案例,以擴大此類偏見評估。我們特別針對 a) 偏見特徵的領域專屬性、b) 在產生測試案例時出現幻覺,以及 c) 健康結果和敏感屬性之間的各種依賴性等挑戰。為此,我們提供 新的方法來解決這些挑戰,並將其與我們的生成管道整合,在我們的 方法中使用醫療知識圖、醫療本体和自訂的通用 LLM 評估架構。透過 一系列廣泛的實驗,我們表明我們提出的方法產生的測試案例可以有效 揭示 Med LLM 中的偏見模式,其規模比人工製作的資料集更大且更具 彈性。我們使用我們的管道發布了一個大型偏見評估資料集,該資料集 專門針對一些醫療案例研究。我們的小插圖生成應用程式的現場示範 可在 https://vignette.streamlit.app 取得。我們的程式碼也可在 https://github.com/healthylaife/autofair 取得。

Paths-over-Graph: Knowledge Graph Empowered Large Language Model Reasoning

2410.14211v2 by Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Wenjie Zhang

Large Language Models (LLMs) have achieved impressive results in various tasks but struggle with hallucination problems and lack of relevant knowledge, especially in deep complex reasoning and knowledge-intensive tasks. Knowledge Graphs (KGs), which capture vast amounts of facts in a structured format, offer a reliable source of knowledge for reasoning. However, existing KG-based LLM reasoning methods face challenges like handling multi-hop reasoning, multi-entity questions, and effectively utilizing graph structures. To address these issues, we propose Paths-over-Graph (PoG), a novel method that enhances LLM reasoning by integrating knowledge reasoning paths from KGs, improving the interpretability and faithfulness of LLM outputs. PoG tackles multi-hop and multi-entity questions through a three-phase dynamic multi-hop path exploration, which combines the inherent knowledge of LLMs with factual knowledge from KGs. In order to improve the efficiency, PoG prunes irrelevant information from the graph exploration first and introduces efficient three-step pruning techniques that incorporate graph structures, LLM prompting, and a pre-trained language model (e.g., SBERT) to effectively narrow down the explored candidate paths. This ensures all reasoning paths contain highly relevant information captured from KGs, making the reasoning faithful and interpretable in problem-solving. PoG innovatively utilizes graph structure to prune the irrelevant noise and represents the first method to implement multi-entity deep path detection on KGs for LLM reasoning tasks. Comprehensive experiments on five benchmark KGQA datasets demonstrate PoG outperforms the state-of-the-art method ToG across GPT-3.5-Turbo and GPT-4, achieving an average accuracy improvement of 18.9%. Notably, PoG with GPT-3.5-Turbo surpasses ToG with GPT-4 by up to 23.9%.

摘要:大型語言模型 (LLM) 在各種任務中取得令人印象深刻的成果,但仍存在幻覺問題和缺乏相關知識,尤其是在深度複雜推理和知識密集型任務中。知識圖譜 (KG) 以結構化格式擷取大量事實,為推理提供了可靠的知識來源。然而,現有的基於 KG 的 LLM 推理方法面臨處理多跳推理、多實體問題和有效利用圖結構等挑戰。為了解決這些問題,我們提出了圖上路徑 (PoG),這是一種創新的方法,通過整合來自 KG 的知識推理路徑來增強 LLM 推理,提高 LLM 輸出的可解釋性和保真性。PoG 通過三階段動態多跳路徑探索來解決多跳和多實體問題,將 LLM 的固有知識與來自 KG 的事實知識相結合。為了提高效率,PoG 首先從圖探索中剪除無關信息,並引入了三步剪枝技術,這些技術結合了圖結構、LLM 提示和預訓練語言模型(例如,SBERT)來有效縮小探索的候選路徑。這確保了所有推理路徑都包含從 KG 擷取的高度相關信息,從而使推理在問題解決中具有保真性和可解釋性。PoG 創新地利用圖結構來剪除無關噪聲,並代表了在 KG 上實現 LLM 推理任務的多實體深度路徑檢測的第一種方法。在五個基準 KGQA 數據集上的綜合實驗表明,PoG 在 GPT-3.5-Turbo 和 GPT-4 上的表現優於最先進的方法 ToG,平均準確率提高了 18.9%。值得注意的是,使用 GPT-3.5-Turbo 的 PoG 比使用 GPT-4 的 ToG 高出 23.9%。

UniMTS: Unified Pre-training for Motion Time Series

2410.19818v1 by Xiyuan Zhang, Diyan Teng, Ranak Roy Chowdhury, Shuheng Li, Dezhi Hong, Rajesh K. Gupta, Jingbo Shang

Motion time series collected from mobile and wearable devices such as smartphones and smartwatches offer significant insights into human behavioral patterns, with wide applications in healthcare, automation, IoT, and AR/XR due to their low-power, always-on nature. However, given security and privacy concerns, building large-scale motion time series datasets remains difficult, preventing the development of pre-trained models for human activity analysis. Typically, existing models are trained and tested on the same dataset, leading to poor generalizability across variations in device location, device mounting orientation and human activity type. In this paper, we introduce UniMTS, the first unified pre-training procedure for motion time series that generalizes across diverse device latent factors and activities. Specifically, we employ a contrastive learning framework that aligns motion time series with text descriptions enriched by large language models. This helps the model learn the semantics of time series to generalize across activities. Given the absence of large-scale motion time series data, we derive and synthesize time series from existing motion skeleton data with all-joint coverage. Spatio-temporal graph networks are utilized to capture the relationships across joints for generalization across different device locations. We further design rotation-invariant augmentation to make the model agnostic to changes in device mounting orientations. Our model shows exceptional generalizability across 18 motion time series classification benchmark datasets, outperforming the best baselines by 340% in the zero-shot setting, 16.3% in the few-shot setting, and 9.2% in the full-shot setting.

摘要:從智慧型手機與智慧型手錶等行動裝置和穿戴式裝置收集的動作時間序列,由於其低耗電、持續運作的特性,可提供人類行為模式的重要見解,在醫療保健、自動化、物聯網和 AR/XR 中有廣泛的應用。然而,考量到安全性和隱私問題,建構大規模的動作時間序列資料集仍然困難,阻礙了人類活動分析預先訓練模型的發展。一般來說,現有的模型會在同一個資料集上訓練和測試,導致無法對裝置位置、裝置安裝方向和人類活動類型的變化進行良好的概化。在本文中,我們介紹 UniMTS,這是第一個統一的動作時間序列預訓練程序,可概化到不同的裝置潛在因子和活動。具體來說,我們採用對比學習架構,將動作時間序列與大型語言模型豐富的文字描述對齊。這有助於模型學習時間序列的語義,以概化到各種活動。由於缺乏大規模的動作時間序列資料,我們從現有的動作骨架資料中衍生和合成時間序列,並涵蓋所有關節。時空圖形網路用於擷取關節之間的關係,以概化到不同的裝置位置。我們進一步設計了旋轉不變增強,讓模型不會受裝置安裝方向變化的影響。我們的模型在 18 個動作時間序列分類基準資料集上展現出卓越的概化能力,在零次學習設定中優於最佳基準 340%,在少次學習設定中優於最佳基準 16.3%,在全次學習設定中優於最佳基準 9.2%。

Supervised Chain of Thought

2410.14198v1 by Xiang Zhang, Dujian Ding

Large Language Models (LLMs) have revolutionized natural language processing and hold immense potential for advancing Artificial Intelligence. However, the core architecture of most mainstream LLMs -- the Transformer -- has inherent limitations in computational depth, rendering them theoretically incapable of solving many reasoning tasks that demand increasingly deep computations. Chain of Thought (CoT) prompting has emerged as a technique to address these architectural limitations, as evidenced by several theoretical studies. It offers a promising approach to solving complex reasoning tasks that were previously beyond the capabilities of these models. Despite its successes, CoT and its variants (such as Tree of Thought, Graph of Thought, etc.) rely on a "one-prompt-for-all" approach, using a single prompt structure (e.g., "think step by step") for a wide range of tasks -- from counting and sorting to solving mathematical and algorithmic problems. This approach poses significant challenges for models to generate the correct reasoning steps, as the model must navigate through a vast prompt template space to find the appropriate template for each task. In this work, we build upon previous theoretical analyses of CoT to demonstrate how the one-prompt-for-all approach can negatively affect the computability of LLMs. We partition the solution search space into two: the prompt space and the answer space. Our findings show that task-specific supervision is essential for navigating the prompt space accurately and achieving optimal performance. Through experiments with state-of-the-art LLMs, we reveal a gap in reasoning performance when supervision is applied versus when it is not.

摘要:大型語言模型 (LLM) 徹底改變了自然語言處理,並具備促進人工智慧發展的巨大潛力。然而,大多數主流 LLM 的核心架構(Transformer)在計算深度方面有其內在限制,理論上無法解決許多需要越來越深入計算的推理任務。思維鏈 (CoT) 提示已成為解決這些架構限制的一種技術,這已由幾項理論研究證實。它提供了一個有前途的方法來解決複雜的推理任務,這些任務以前超出了這些模型的能力。儘管取得了成功,CoT 及其變體(例如思維樹、思維圖等)依賴於「一提示適用所有」的方法,對各種任務(從計數和排序到解決數學和演算法問題)使用單一的提示結構(例如,「逐步思考」)。這種方法對模型產生正確的推理步驟構成了重大挑戰,因為模型必須在廣泛的提示範本空間中導航,才能為每個任務找到適當的範本。在這項工作中,我們建立在 CoT 先前的理論分析之上,說明「一提示適用所有」的方法如何對 LLM 的可計算性產生負面影響。我們將解的搜尋空間分為兩部分:提示空間和答案空間。我們的研究結果表明,特定於任務的監督對於準確導航提示空間並實現最佳效能至關重要。透過使用最先進的 LLM 進行實驗,我們揭示了在應用監督與未應用監督時推理效能的差距。

Towards Cross-Cultural Machine Translation with Retrieval-Augmented Generation from Multilingual Knowledge Graphs

2410.14057v1 by Simone Conia, Daniel Lee, Min Li, Umar Farooq Minhas, Saloni Potdar, Yunyao Li

Translating text that contains entity names is a challenging task, as cultural-related references can vary significantly across languages. These variations may also be caused by transcreation, an adaptation process that entails more than transliteration and word-for-word translation. In this paper, we address the problem of cross-cultural translation on two fronts: (i) we introduce XC-Translate, the first large-scale, manually-created benchmark for machine translation that focuses on text that contains potentially culturally-nuanced entity names, and (ii) we propose KG-MT, a novel end-to-end method to integrate information from a multilingual knowledge graph into a neural machine translation model by leveraging a dense retrieval mechanism. Our experiments and analyses show that current machine translation systems and large language models still struggle to translate texts containing entity names, whereas KG-MT outperforms state-of-the-art approaches by a large margin, obtaining a 129% and 62% relative improvement compared to NLLB-200 and GPT-4, respectively.

摘要:翻譯包含實體名稱的文字是一項具有挑戰性的任務,因為與文化相關的參考在不同語言中可能會有很大差異。這些差異也可能是由轉譯造成的,轉譯是一種改編過程,不僅涉及音譯和逐字翻譯。在本文中,我們從兩個方面解決跨文化翻譯的問題:(i) 我們介紹 XC-Translate,這是第一個針對包含潛在文化細微差別實體名稱的文字的大規模、人工建立的機器翻譯基準測試,以及 (ii) 我們提出 KG-MT,這是一種新的端到端方法,通過利用密集檢索機制將來自多語言知識圖譜的資訊整合到神經機器翻譯模型中。我們的實驗和分析表明,目前的機器翻譯系統和大型語言模型在翻譯包含實體名稱的文字時仍存在困難,而 KG-MT 則以大幅優於最先進方法的優勢勝出,與 NLLB-200 和 GPT-4 相比,分別獲得了 129% 和 62% 的相對改進。

RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs

2410.13987v1 by Jiatan Huang, Mingchen Li, Zonghai Yao, Zhichao Yang, Yongkang Xiao, Feiyun Ouyang, Xiaohan Li, Shuo Han, Hong Yu

Answering complex real-world questions often requires accurate retrieval from textual knowledge graphs (TKGs). The scarcity of annotated data, along with intricate topological structures, makes this task particularly challenging. As the nature of relational path information could enhance the inference ability of Large Language Models (LLMs), efficiently retrieving more complex relational path information from TKGs presents another key challenge. To tackle these challenges, we first develop a Dataset for LLMs Complex Reasoning over Textual Knowledge Graphs (RiTeK) with a broad topological structure coverage.We synthesize realistic user queries that integrate diverse topological structures, relational information, and complex textual descriptions. We conduct rigorous expert evaluation to validate the quality of our synthesized queries. And then, we introduce an enhanced Monte Carlo Tree Search (MCTS) method, Relational MCTS, to automatically extract relational path information from textual graphs for specific queries. Our dataset mainly covers the medical domain as the relation types and entity are complex and publicly available. Experimental results indicate that RiTeK poses significant challenges for current retrieval and LLM systems, while the proposed Relational MCTS method enhances LLM inference ability and achieves state-of-the-art performance on RiTeK.

摘要:回答複雜的現實世界問題通常需要從文本知識圖 (TKG) 中準確擷取。標註資料的稀少,加上複雜的拓撲結構,使得這項任務特別具有挑戰性。由於關係路徑資訊的性質可以增強大型語言模型 (LLM) 的推論能力,從 TKG 有效地擷取更複雜的關係路徑資訊提出了另一個關鍵挑戰。為了應對這些挑戰,我們首先開發了一個具有廣泛拓撲結構涵蓋範圍的文本知識圖 (RiTeK) 上的 LLM 複雜推理資料集。我們綜合了整合了多樣化拓撲結構、關係資訊和複雜文本描述的現實使用者查詢。我們進行嚴格的專家評估,以驗證我們綜合查詢的品質。然後,我們引入一種增強的蒙地卡羅樹搜尋 (MCTS) 方法,即關係 MCTS,以自動從文本圖中擷取特定查詢的關係路徑資訊。我們的資料集主要涵蓋醫療領域,因為關係類型和實體很複雜且公開可用。實驗結果表明,RiTeK 對目前的擷取和 LLM 系統提出了重大挑戰,而所提出的關係 MCTS 方法增強了 LLM 推論能力,並在 RiTeK 上達到了最先進的效能。

The Mystery of the Pathological Path-star Task for Language Models

2410.13779v1 by Arvid Frydenlund

The recently introduced path-star task is a minimal task designed to exemplify limitations to the abilities of language models (Bachmann and Nagarajan, 2024). It involves a path-star graph where multiple arms radiate from a single starting node and each node is unique. Given the start node and a specified target node that ends an arm, the task is to generate the arm containing that target node. This is straightforward for a human but surprisingly difficult for language models, which did not outperform the random baseline. The authors hypothesized this is due to a deficiency in teacher-forcing and the next-token prediction paradigm. We demonstrate the task is learnable using teacher-forcing in alternative settings and that the issue is partially due to representation. We introduce a regularization method using structured samples of the same graph but with differing target nodes, improving results across a variety of model types. We provide RASP proofs showing the task is theoretically solvable. Finally, we find settings where an encoder-only model can consistently solve the task.

摘要:最近推出的路徑星形任務是一個極簡任務,旨在說明語言模型能力的限制(Bachmann 和 Nagarajan,2024 年)。它涉及一個路徑星形圖,其中多個分支從一個起始節點輻射出去,每個節點都是唯一的。給定起始節點和結束一個分支的指定目標節點,任務是生成包含該目標節點的分支。這對人類來說很簡單,但對語言模型來說卻異乎尋常地困難,因為語言模型並未優於隨機基準線。作者假設這是由於教師強制和下一個符號預測範例的不足。 我們展示了該任務可以使用替代設置中的教師強制來學習,並且問題部分是由於表示。我們引入了一種正則化方法,使用同一圖形的結構化樣本,但目標節點不同,從而改進了各種模型類型的結果。我們提供了 RASP 證明,表明該任務在理論上是可以解決的。最後,我們找到了僅編碼器模型可以持續解決任務的設置。