LLM
LLM
Publish Date | Title | Authors | Homepage | Code |
---|---|---|---|---|
2024-11-12 | Scaling Properties of Diffusion Models for Perceptual Tasks | Rahul Ravishankar et.al. | 2411.08034v1 | null |
2024-11-12 | GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation | Yushi Lan et.al. | 2411.08033v1 | null |
2024-11-12 | Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data | Juanhui Li et.al. | 2411.08028v1 | null |
2024-11-12 | LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models | Anoop Cherian et.al. | 2411.08027v1 | null |
2024-11-12 | Leonardo vindicated: Pythagorean trees for minimal reconstruction of the natural branching structures | Dymitr Ruta et.al. | 2411.08024v1 | null |
2024-11-12 | Language Models as Causal Effect Generators | Lucius E. J. Bynum et.al. | 2411.08019v1 | link |
2024-11-12 | Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings | Aditya Sanghi et.al. | 2411.08017v1 | null |
2024-11-12 | Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech | Eleonora Mancini et.al. | 2411.08013v1 | null |
2024-11-12 | ExpressivityArena: Can LLMs Express Information Implicitly? | Joshua Tint et.al. | 2411.08010v1 | null |
2024-11-12 | Can adversarial attacks by large language models be attributed? | Manuel Cebrian et.al. | 2411.08003v1 | null |
2024-11-12 | Derivational Morphology Reveals Analogical Generalization in Large Language Models | Valentin Hofmann et.al. | 2411.07990v1 | null |
2024-11-12 | Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces | Ben Fauber et.al. | 2411.07983v1 | null |
2024-11-12 | Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization | Davide Buffelli et.al. | 2411.07979v1 | null |
2024-11-12 | DINO-LG: A Task-Specific DINO Model for Coronary Calcium Scoring | Mahmut S. Gokmen et.al. | 2411.07976v1 | null |
2024-11-12 | JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation | Yiyang Ma et.al. | 2411.07975v1 | null |
2024-11-12 | From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents | Chuyi Kong et.al. | 2411.07965v1 | null |
2024-11-12 | Towards Low-bit Communication for Tensor Parallel LLM Inference | Harry Dong et.al. | 2411.07942v1 | null |
2024-11-12 | DuoLift-GAN:Reconstructing CT from Single-view and Biplanar X-Rays with Generative Adversarial Networks | Zhaoxi Zhang et.al. | 2411.07941v1 | null |
2024-11-12 | Automatic dataset shift identification to support root cause analysis of AI performance drift | Mélanie Roschewitz et.al. | 2411.07940v1 | null |
2024-11-12 | CryptoLLM: Unleashing the Power of Prompted LLMs for SmartQnA and Classification of Crypto Posts | Aniket Deroy et.al. | 2411.07917v1 | null |
2024-11-12 | Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus | Benjamin Litterer et.al. | 2411.07892v1 | null |
2024-11-12 | INTRABENCH: Interactive Radiological Benchmark | Constantin Ulrich et.al. | 2411.07885v1 | null |
2024-11-12 | Diverse capability and scaling of diffusion and auto-regressive models when learning abstract rules | Binxu Wang et.al. | 2411.07873v1 | null |
2024-11-12 | Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease | Francesco Chiumento et.al. | 2411.07871v1 | null |
2024-11-12 | Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders | Xiaofeng Zhu et.al. | 2411.07870v1 | null |
2024-11-12 | Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models | Yusen Zhang et.al. | 2411.07858v1 | link |
2024-11-12 | Tucano: Advancing Neural Text Generation for Portuguese | Nicholas Kluge Corrêa et.al. | 2411.07854v1 | null |
2024-11-12 | IAE: Irony-based Adversarial Examples for Sentiment Analysis Systems | Xiaoyin Yi et.al. | 2411.07850v1 | null |
2024-11-12 | Ethical Concern Identification in NLP: A Corpus of ACL Anthology Ethics Statements | Antonia Karamolegkou et.al. | 2411.07845v1 | null |
2024-11-12 | Chain Association-based Attacking and Shielding Natural Language Processing Systems | Jiacheng Huang et.al. | 2411.07843v1 | null |
2024-11-12 | Federated Learning for Discrete Optimal Transport with Large Population under Incomplete Information | Navpreet Kaur et.al. | 2411.07841v1 | null |
2024-11-12 | Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices | Kilian Pfeiffer et.al. | 2411.07826v1 | null |
2024-11-12 | Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language Models | Youan Cong et.al. | 2411.07820v1 | null |
2024-11-12 | PatchCTG: Patch Cardiotocography Transformer for Antepartum Fetal Health Monitoring | M. Jaleed Khan et.al. | 2411.07796v1 | link |
2024-11-12 | RedCode: Risky Code Execution and Generation Benchmark for Code Agents | Chengquan Guo et.al. | 2411.07781v1 | null |
2024-11-12 | Likelihood as a Performance Gauge for Retrieval-Augmented Generation | Tianyu Liu et.al. | 2411.07773v1 | link |
2024-11-12 | Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows | Fangyu Lei et.al. | 2411.07763v1 | null |
2024-11-12 | ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization | Weibo Zhao et.al. | 2411.07762v1 | null |
2024-11-12 | Navigation with QPHIL: Quantizing Planner for Hierarchical Implicit Q-Learning | Alexi Canesse et.al. | 2411.07760v1 | null |
2024-11-12 | Optimizing Traffic Signal Control using High-Dimensional State Representation and Efficient Deep Reinforcement Learning | Lawrence Francis et.al. | 2411.07759v1 | null |
2024-11-12 | SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model | Xinyuan Qian et.al. | 2411.07751v1 | null |
2024-11-12 | Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding | Zirui Shao et.al. | 2411.07722v1 | null |
2024-11-12 | Training Data for Large Language Model | Yiming Ju et.al. | 2411.07715v1 | null |
2024-11-12 | New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook | Meng Yang et.al. | 2411.07691v1 | null |
2024-11-12 | World Models: The Safety Perspective | Zifan Zeng et.al. | 2411.07690v1 | null |
2024-11-12 | Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG | Zilun Zhang et.al. | 2411.07688v1 | null |
2024-11-12 | Fast Disentangled Slim Tensor Learning for Multi-view Clustering | Deng Xu et.al. | 2411.07685v1 | null |
2024-11-12 | AI enhanced diagnosis of Peyronies disease a novel approach using Computer Vision | Yudara Kularathne et.al. | 2411.07684v1 | null |
2024-11-12 | Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach | Tianyi Huang et.al. | 2411.07656v1 | link |
2024-11-12 | Direct Preference Optimization Using Sparse Feature-Level Constraints | Qingyu Yin et.al. | 2411.07618v1 | null |
2024-11-12 | Multimodal Clinical Reasoning through Knowledge-augmented Rationale Generation | Shuai Niu et.al. | 2411.07611v1 | null |
2024-11-12 | Circuit Complexity Bounds for RoPE-based Transformer Architecture | Bo Chen et.al. | 2411.07602v1 | null |
2024-11-12 | Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations | Rose E. Wang et.al. | 2411.07598v1 | link |
2024-11-12 | Entropy Controllable Direct Preference Optimization | Motoki Omura et.al. | 2411.07595v1 | null |
2024-11-12 | A Comprehensive Survey of AI-Driven Advancements and Techniques in Automated Program Repair and Code Generation | Avinash Anand et.al. | 2411.07586v1 | null |
2024-11-12 | Reinforcement Learning Framework for Quantitative Trading | Alhassan S. Yasin et.al. | 2411.07585v1 | null |
2024-11-12 | Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models | Dongrui Han et.al. | 2411.07563v1 | null |
2024-11-12 | EUR/USD Exchange Rate Forecasting incorporating Text Mining Based on Pre-trained Language Models and Deep Learning Methods | Xiangyu Shi et.al. | 2411.07560v1 | null |
2024-11-12 | Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models | Tiejin Chen et.al. | 2411.07559v1 | null |
2024-11-12 | Contrastive Language Prompting to Ease False Positives in Medical Anomaly Detection | YeongHyeon Park et.al. | 2411.07546v1 | null |
2024-11-12 | Model Stealing for Any Low-Rank Language Model | Allen Liu et.al. | 2411.07536v1 | null |
2024-11-12 | Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning | Linyang He et.al. | 2411.07533v1 | null |
2024-11-12 | Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis | Minda Li et.al. | 2411.07529v1 | null |
2024-11-12 | SecEncoder: Logs are All You Need in Security | Muhammed Fatih Bulut et.al. | 2411.07528v1 | null |
2024-11-12 | Prompt-enhanced Network for Hateful Meme Classification | Junxi Liu et.al. | 2411.07527v1 | link |
2024-11-12 | Fair Summarization: Bridging Quality and Diversity in Extractive Summaries | Sina Bagheri Nezhad et.al. | 2411.07521v1 | link |
2024-11-12 | TIPS: Threat Actor Informed Prioritization of Applications using SecEncoder | Muhammed Fatih Bulut et.al. | 2411.07519v1 | null |
2024-11-12 | LLM App Squatting and Cloning | Yinglin Xie et.al. | 2411.07518v1 | null |
2024-11-12 | SparrowVQE: Visual Question Explanation for Course Content Understanding | Jialu Li et.al. | 2411.07516v1 | link |
2024-11-12 | An Attack Traffic Identification Method Based on Temporal Spectrum | Wenwei Xie et.al. | 2411.07510v1 | link |
2024-11-12 | FM-TS: Flow Matching for Time Series Generation | Yang Hu et.al. | 2411.07506v1 | link |
2024-11-12 | LAUREL: Learned Augmented Residual Layer | Gaurav Menghani et.al. | 2411.07501v1 | null |
2024-11-12 | Rapid Response: Mitigating LLM Jailbreaks with a Few Examples | Alwin Peng et.al. | 2411.07494v1 | null |
2024-11-12 | Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models | Daria Kryvosheieva et.al. | 2411.07474v1 | null |
2024-11-12 | IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark | Kawshik Manikantan et.al. | 2411.07466v1 | null |
2024-11-12 | BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks | Shubham Gandhi et.al. | 2411.07464v1 | null |
2024-11-12 | BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions | Anas Awadalla et.al. | 2411.07461v1 | null |
2024-11-12 | DecoPrompt : Decoding Prompts Reduces Hallucinations when Large Language Models Meet False Premises | Nan Xu et.al. | 2411.07457v1 | link |
2024-11-12 | Research on fault diagnosis of nuclear power first-second circuit based on hierarchical multi-granularity classification network | Jiangwen Chen et.al. | 2411.07453v1 | null |
2024-11-12 | Optimizing Data Delivery: Insights from User Preferences on Visuals, Tables, and Text | Reuben Luera et.al. | 2411.07451v1 | null |
2024-11-12 | The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving | Kyoungmin Kim et.al. | 2411.07447v1 | null |
2024-11-12 | Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection | Cilin Yan et.al. | 2411.07446v1 | null |
2024-11-12 | Input-Based Ensemble-Learning Method for Dynamic Memory Configuration of Serverless Computing Functions | Siddharth Agarwal et.al. | 2411.07444v1 | null |
2024-11-11 | Automatically Detecting Online Deceptive Patterns in Real-time | Asmit Nayak et.al. | 2411.07441v1 | null |
2024-11-11 | Predicting BWR Criticality with Data-Driven Machine Learning Model | Muhammad Rizki Oktavian et.al. | 2411.07425v1 | null |
2024-11-11 | Untangling Hate Speech Definitions: A Semantic Componential Analysis Across Cultures and Domains | Katerina Korre et.al. | 2411.07417v1 | null |
2024-11-11 | Using Generative AI and Multi-Agents to Provide Automatic Feedback | Shuchen Guo et.al. | 2411.07407v1 | null |
2024-11-11 | Controllable Context Sensitivity and the Knob Behind It | Julian Minder et.al. | 2411.07404v1 | null |
2024-11-11 | Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews | Aakash Sorathiya et.al. | 2411.07398v1 | null |
2024-11-11 | Toward Optimal Search and Retrieval for RAG | Alexandria Leto et.al. | 2411.07396v1 | null |
2024-11-11 | Data-Centric Learning Framework for Real-Time Detection of Aiming Beam in Fluorescence Lifetime Imaging Guided Surgery | Mohamed Abul Hassan et.al. | 2411.07395v1 | null |
2024-11-11 | Feature-Space Semantic Invariance: Enhanced OOD Detection for Open-Set Domain Generalization | Haoliang Wang et.al. | 2411.07392v1 | null |
2024-11-11 | Federated Learning Client Pruning for Noisy Labels | Mahdi Morafah et.al. | 2411.07391v1 | link |
2024-11-11 | Firing Rate Models as Associative Memory: Excitatory-Inhibitory Balance for Robust Retrieval | Simone Betteti et.al. | 2411.07388v1 | null |
2024-11-11 | Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages | Midia Yousefi et.al. | 2411.07387v1 | null |
2024-11-11 | BeeManc at the PLABA Track of TAC-2024: RoBERTa for task 1 and LLaMA3.1 and GPT-4o for task 2 | Zhidong Ling et.al. | 2411.07381v1 | null |
2024-11-11 | Warmstarting for Scaling Language Models | Neeratyoy Mallik et.al. | 2411.07340v1 | null |
2024-11-11 | SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models | Bardiya Akhbari et.al. | 2411.07336v1 | link |
2024-11-11 | Multimodal Fusion Balancing Through Game-Theoretic Regularization | Konstantinos Kontras et.al. | 2411.07335v1 | null |
2024-11-11 | Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations | Kirti Bhagat et.al. | 2411.07320v1 | null |
Abstracts
Scaling Properties of Diffusion Models for Perceptual Tasks
2411.08034v1 by Rahul Ravishankar, Zeeshan Patel, Jathushan Rajasegaran, Jitendra Malik
In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and segmentation under image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perception tasks. Through a careful analysis of these scaling behaviors, we present various techniques to efficiently train diffusion models for visual perception tasks. Our models achieve improved or comparable performance to state-of-the-art methods using significantly less data and compute. To use our code and models, see https://scaling-diffusion-perception.github.io .
摘要:在本文中,我們主張使用擴散模型進行的迭代計算不僅為生成提供了強大的範例,也為視覺感知任務提供了強大的範例。我們將深度估計、光流和分割等任務統一在圖像到圖像轉換下,並展示了擴散模型如何從擴展感知任務的訓練和測試時間計算中受益。通過仔細分析這些縮放行為,我們提出了各種技術,以有效訓練用於視覺感知任務的擴散模型。我們的模型使用顯著更少的数据和計算,達到了與最先進的方法相當或更好的性能。若要使用我們的代碼和模型,請參閱 https://scaling-diffusion-perception.github.io 。
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
2411.08033v1 by Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, Chen Change Loy
While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space design that preserves 3D shape information, and incorporates a cascaded latent diffusion model for improved shape-texture disentanglement. The proposed method, GaussianAnything, supports multi-modal conditional 3D generation, allowing for point cloud, caption, and single/multi-view image inputs. Notably, the newly proposed latent space naturally enables geometry-texture disentanglement, thus allowing 3D-aware editing. Experimental results demonstrate the effectiveness of our approach on multiple datasets, outperforming existing methods in both text- and image-conditioned 3D generation.
摘要:儘管 3D 內容生成已大幅進展,但現有方法仍面臨輸入格式、潛在空間設計和輸出表示的挑戰。本文介紹了一個新穎的 3D 生成架構,可解決這些挑戰,提供可擴充、高品質的 3D 生成,並具備互動式點雲結構潛在空間。我們的架構採用變異自動編碼器 (VAE),以多視圖姿勢 RGB-D(深度)-N(法線) 渲染作為輸入,使用獨特的潛在空間設計來保留 3D 形狀資訊,並結合串聯潛在擴散模型以改善形狀紋理分離。所提出的方法 GaussianAnything 支援多模式條件式 3D 生成,允許點雲、標題和單/多視圖影像輸入。值得注意的是,新提出的潛在空間自然能實現幾何紋理分離,因此允許 3D 感知編輯。實驗結果證明了我們的方法在多個資料集上的有效性,在文字和影像條件式 3D 生成方面都優於現有方法。
Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data
2411.08028v1 by Juanhui Li, Sreyashi Nag, Hui Liu, Xianfeng Tang, Sheikh Sarwar, Limeng Cui, Hansu Gu, Suhang Wang, Qi He, Jiliang Tang
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets. However, the large size and high computation demands of LLMs limit their practicality in many applications, especially when further fine-tuning is required. To address these limitations, smaller models are typically preferred for deployment. However, their training is hindered by the scarcity of labeled data. In contrast, unlabeled data is often readily which can be leveraged by using LLMs to generate pseudo-labels for training smaller models. This enables the smaller models (student) to acquire knowledge from LLMs(teacher) while reducing computational costs. This process introduces challenges, such as potential noisy pseudo-labels. Selecting high-quality and informative data is therefore critical to enhance model performance while improving the efficiency of data utilization. To address this, we propose LLKD that enables Learning with Less computational resources and less data for Knowledge Distillation from LLMs. LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student. Specifically, it prioritizes samples where the teacher demonstrates high confidence in its labeling, indicating reliable labels, and where the student exhibits a high information need, identifying challenging samples that require further learning. Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
摘要:在實際的 NLP 應用中,大型語言模型 (LLM) 因其在大量資料集上的廣泛訓練而提供有前景的解決方案。然而,LLM 的龐大規模和高運算需求限制了它們在許多應用中的實用性,特別是在需要進一步微調時。為了解決這些限制,通常偏好較小的模型進行部署。然而,它們的訓練受到標記資料的稀缺性阻礙。相反,未標記的資料通常很容易獲得,可以使用 LLM 為較小的模型生成偽標籤進行訓練。這使較小的模型(學生)能夠從 LLM(老師)那裡獲取知識,同時降低運算成本。這個過程會帶來挑戰,例如潛在的雜訊偽標籤。因此,選擇高品質且有資訊性的資料對於提高模型效能並提高資料利用率至關重要。為了解決這個問題,我們提出了 LLKD,它可以在從 LLM 中進行知識蒸餾時使用較少的運算資源和較少的資料進行學習。LLKD 是一種自適應的樣本選擇方法,它結合了老師和學生的訊號。具體來說,它優先考慮老師在標記中表現出高度信心的樣本,表示標籤可靠,以及學生表現出高度資訊需求的樣本,識別需要進一步學習的具有挑戰性的樣本。我們的綜合實驗表明,LLKD 在具有更高資料效率的各種資料集上實現了卓越的效能。
LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models
2411.08027v1 by Anoop Cherian, Radu Corcodel, Siddarth Jain, Diego Romeres
Physical reasoning is an important skill needed for robotic agents when operating in the real world. However, solving such reasoning problems often involves hypothesizing and reflecting over complex multi-body interactions under the effect of a multitude of physical forces and thus learning all such interactions poses a significant hurdle for state-of-the-art machine learning frameworks, including large language models (LLMs). To study this problem, we propose a new physical reasoning task and a dataset, dubbed TraySim. Our task involves predicting the dynamics of several objects on a tray that is given an external impact -- the domino effect of the ensued object interactions and their dynamics thus offering a challenging yet controlled setup, with the goal of reasoning being to infer the stability of the objects after the impact. To solve this complex physical reasoning task, we present LLMPhy, a zero-shot black-box optimization framework that leverages the physics knowledge and program synthesis abilities of LLMs, and synergizes these abilities with the world models built into modern physics engines. Specifically, LLMPhy uses an LLM to generate code to iteratively estimate the physical hyperparameters of the system (friction, damping, layout, etc.) via an implicit analysis-by-synthesis approach using a (non-differentiable) simulator in the loop and uses the inferred parameters to imagine the dynamics of the scene towards solving the reasoning task. To show the effectiveness of LLMPhy, we present experiments on our TraySim dataset to predict the steady-state poses of the objects. Our results show that the combination of the LLM and the physics engine leads to state-of-the-art zero-shot physical reasoning performance, while demonstrating superior convergence against standard black-box optimization methods and better estimation of the physical parameters.
摘要:物理推理是機器代理在現實世界中運作時所需的重要技能。然而,解決此類推理問題通常涉及對複雜的多體交互進行假設和反思,這些交互受到大量物理力的影響,因此學習所有此類交互對最先進的機器學習框架(包括大型語言模型 (LLM))構成了重大障礙。為了研究這個問題,我們提出了一個新的物理推理任務和一個名為 TraySim 的數據集。我們的任務涉及預測托盤上幾個物體的動態,這些物體受到外部衝擊——由此產生的物體交互的多米諾效應及其動態從而提供了具有挑戰性但受控的設置,推理目標是推論物體在衝擊後的穩定性。為了解決這個複雜的物理推理任務,我們提出了 LLMPhy,這是一個零次方黑盒優化框架,它利用了 LLM 的物理知識和程式合成能力,並將這些能力與現代物理引擎中內建的世界模型協同作用。具體來說,LLMPhy 使用 LLM 產生代碼,通過使用迴圈中的(不可微分)模擬器進行隱式分析-通過合成方法來反覆估計系統的物理超參數(摩擦、阻尼、佈局等),並使用推斷的參數來想像場景的動態,以解決推理任務。為了展示 LLMPhy 的有效性,我們在我們的 TraySim 數據集上進行了實驗,以預測物體的穩態姿勢。我們的結果表明,LLM 和物理引擎的結合導致了最先進的零次方物理推理性能,同時展示了優於標準黑盒優化方法的收斂性,以及對物理參數的更好估計。
Leonardo vindicated: Pythagorean trees for minimal reconstruction of the natural branching structures
2411.08024v1 by Dymitr Ruta, Corrado Mio, Ernesto Damiani
Trees continue to fascinate with their natural beauty and as engineering masterpieces optimal with respect to several independent criteria. Pythagorean tree is a well-known fractal design that realistically mimics the natural tree branching structures. We study various types of Pythagorean-like fractal trees with different shapes of the base, branching angles and relaxed scales in an attempt to identify and explain which variants are the closest match to the branching structures commonly observed in the natural world. Pursuing simultaneously the realism and minimalism of the fractal tree model, we have developed a flexibly parameterised and fast algorithm to grow and visually examine deep Pythagorean-inspired fractal trees with the capability to orderly over- or underestimate the Leonardo da Vinci's tree branching rule as well as control various imbalances and branching angles. We tested the realism of the generated fractal tree images by means of the classification accuracy of detecting natural tree with the transfer-trained deep Convolutional Neural Networks (CNNs). Having empirically established the parameters of the fractal trees that maximize the CNN's natural tree class classification accuracy we have translated them back to the scales and angles of branches and came to the interesting conclusions that support the da Vinci branching rule and golden ratio based scaling for both the shape of the branch and imbalance between the child branches, and claim the flexibly parameterized fractal trees can be used to generate artificial examples to train robust detectors of different species of trees.
摘要:樹木持續以其自然美景和作為工程傑作著迷,在幾個獨立標準方面達到最佳化。畢氏樹是一種著名的分形設計,逼真地模擬自然樹木分枝結構。我們研究各種畢氏分形樹,它們具有不同形狀的基底、分枝角度和放鬆比例,試圖找出並解釋哪些變體最接近自然界中常見的分枝結構。同時追求分形樹模型的寫實主義和極簡主義,我們開發了一種靈活參數化且快速的演算法,用於生長和視覺檢查深度畢氏靈感分形樹,並有能力有條理地高估或低估李奧納多·達文西的樹木分枝規則,以及控制各種不平衡和分枝角度。我們透過轉移訓練深度卷積神經網路 (CNN) 偵測自然樹木的分類準確度,來測試所生成分形樹影像的寫實度。在經驗上建立最大化 CNN 自然樹類別分類準確度的分形樹參數後,我們已將它們轉換回分枝的比例和角度,並得出有趣的結論,支持達文西分枝規則和黃金比例,作為分枝形狀和子分枝之間不平衡的基礎,並宣稱靈活參數化的分形樹可用於產生人工範例,以訓練不同樹種的強健偵測器。
Language Models as Causal Effect Generators
2411.08019v1 by Lucius E. J. Bynum, Kyunghyun Cho
We present a framework for large language model (LLM) based data generation with controllable causal structure. In particular, we define a procedure for turning any language model and any directed acyclic graph (DAG) into a sequence-driven structural causal model (SD-SCM). Broadly speaking, an SD-SCM is a causal model with user-defined structure and LLM-defined structural equations. We characterize how an SD-SCM allows sampling from observational, interventional, and counterfactual distributions according to the desired causal structure. We then leverage this procedure to propose a new type of benchmark for causal inference methods, generating individual-level counterfactual data without needing to manually specify functional relationships between variables. We create an example benchmark consisting of thousands of datasets, and test a suite of popular estimation methods on these datasets for average, conditional average, and individual treatment effect estimation, both with and without hidden confounding. Apart from generating data, the same procedure also allows us to test for the presence of a causal effect that might be encoded in an LLM. This procedure can underpin auditing LLMs for misinformation, discrimination, or otherwise undesirable behavior. We believe SD-SCMs can serve as a useful tool in any application that would benefit from sequential data with controllable causal structure.
摘要:
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings
2411.08017v1 by Aditya Sanghi, Aliasghar Khani, Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani
Large-scale 3D generative models require substantial computational resources yet often fall short in capturing fine details and complex geometries at high resolutions. We attribute this limitation to the inefficiency of current representations, which lack the compactness required to model the generative models effectively. To address this, we introduce a novel approach called Wavelet Latent Diffusion, or WaLa, that encodes 3D shapes into wavelet-based, compact latent encodings. Specifically, we compress a $256^3$ signed distance field into a $12^3 \times 4$ latent grid, achieving an impressive 2427x compression ratio with minimal loss of detail. This high level of compression allows our method to efficiently train large-scale generative networks without increasing the inference time. Our models, both conditional and unconditional, contain approximately one billion parameters and successfully generate high-quality 3D shapes at $256^3$ resolution. Moreover, WaLa offers rapid inference, producing shapes within two to four seconds depending on the condition, despite the model's scale. We demonstrate state-of-the-art performance across multiple datasets, with significant improvements in generation quality, diversity, and computational efficiency. We open-source our code and, to the best of our knowledge, release the largest pretrained 3D generative models across different modalities.
摘要:大型 3D 生成模型需要大量的计算资源,但通常在捕捉精细细节和高分辨率的复杂几何形状方面表现不佳。我们将此限制归因于当前表示形式的低效率,它缺乏对有效建模生成模型所需的紧凑性。为了解决这个问题,我们引入了一种称为小波潜在扩散或 WaLa 的新方法,它将 3D 形状编码为基于小波的紧凑潜在编码。具体来说,我们将 $256^3$ 有符号距离场压缩到 $12^3 \times 4$ 潜在网格中,以最小的细节损失实现了令人印象深刻的 2427 倍压缩比。这种高水平的压缩允许我们的方法有效地训练大规模生成网络,而不会增加推理时间。我们的模型(条件模型和无条件模型)包含大约十亿个参数,并成功生成分辨率为 $256^3$ 的高质量 3D 形状。此外,WaLa 提供快速推理,根据条件在两到四秒内生成形状,尽管模型的规模很大。我们展示了跨多个数据集的最新性能,在生成质量、多样性和计算效率方面都有显著提高。我们开源我们的代码,并且据我们所知,发布了跨不同模态的最大预训练 3D 生成模型。
Investigating the Effectiveness of Explainability Methods in Parkinson's Detection from Speech
2411.08013v1 by Eleonora Mancini, Francesco Paissan, Paolo Torroni, Cem Subakan, Mirco Ravanelli
Speech impairments in Parkinson's disease (PD) provide significant early indicators for diagnosis. While models for speech-based PD detection have shown strong performance, their interpretability remains underexplored. This study systematically evaluates several explainability methods to identify PD-specific speech features, aiming to support the development of accurate, interpretable models for clinical decision-making in PD diagnosis and monitoring. Our methodology involves (i) obtaining attributions and saliency maps using mainstream interpretability techniques, (ii) quantitatively evaluating the faithfulness of these maps and their combinations obtained via union and intersection through a range of established metrics, and (iii) assessing the information conveyed by the saliency maps for PD detection from an auxiliary classifier. Our results reveal that, while explanations are aligned with the classifier, they often fail to provide valuable information for domain experts.
摘要:帕金森氏症 (PD) 的言語障礙提供了重要的早期診斷指標。儘管基於言語的 PD 檢測模型已展現出強勁的效能,但其可解釋性仍未獲得充分探討。本研究系統性地評估了數種可解釋性方法,以識別 PD 特定的言語特徵,旨在支援開發準確、可解釋的模型,以進行 PD 診斷和監控中的臨床決策。我們的研究方法包括:(i) 使用主流可解釋性技術取得歸因和顯著性圖,(ii) 透過一系列既定的指標,量化評估這些圖及其透過聯集和交集所取得組合的真實性,以及 (iii) 從輔助分類器評估顯著性圖傳達的 PD 檢測資訊。我們的結果顯示,儘管解釋與分類器一致,但它們通常無法為領域專家提供有價值的資訊。
ExpressivityArena: Can LLMs Express Information Implicitly?
2411.08010v1 by Joshua Tint, Som Sagar, Aditya Taparia, Kelly Raines, Bimsara Pathiraja, Caleb Liu, Ransalu Senanayake
While Large Language Models (LLMs) have demonstrated remarkable performance in certain dimensions, their ability to express implicit language cues that human use for effective communication remains unclear. This paper presents ExpressivityArena, a Python library for measuring the implicit communication abilities of LLMs. We provide a comprehensive framework to evaluate expressivity of arbitrary LLMs and explore its practical implications. To this end, we refine the definition and measurements of ``expressivity,'' and use our framework in a set of small experiments. These experiments test LLMs in creative and logical tasks such as poetry, coding, and emotion-based responses. They are then evaluated by an automated grader, through ExpressivityArena, which we verify to be the most pragmatic for testing expressivity. Building on these experiments, we deepen our understanding of the expressivity of LLMs by assessing their ability to remain expressive in conversations. Our findings indicate that LLMs are capable of generating and understanding expressive content, however, with some limitations. These insights will inform the future development and deployment of expressive LLMs. We provide the code for ExpressivityArena alongside our paper.
摘要:
Can adversarial attacks by large language models be attributed?
2411.08003v1 by Manuel Cebrian, Jan Arne Telle
Attributing outputs from Large Language Models (LLMs) in adversarial settings-such as cyberattacks and disinformation-presents significant challenges that are likely to grow in importance. We investigate this attribution problem using formal language theory, specifically language identification in the limit as introduced by Gold and extended by Angluin. By modeling LLM outputs as formal languages, we analyze whether finite text samples can uniquely pinpoint the originating model. Our results show that due to the non-identifiability of certain language classes, under some mild assumptions about overlapping outputs from fine-tuned models it is theoretically impossible to attribute outputs to specific LLMs with certainty. This holds also when accounting for expressivity limitations of Transformer architectures. Even with direct model access or comprehensive monitoring, significant computational hurdles impede attribution efforts. These findings highlight an urgent need for proactive measures to mitigate risks posed by adversarial LLM use as their influence continues to expand.
摘要:在敵對環境(例如網路攻擊和錯誤資訊)中,將大型語言模型(LLM)的輸出歸因於特定模型,是一項重大的挑戰,且其重要性可能會與日俱增。我們使用形式語言理論探討這個歸因問題,特別是 Gold 提出並由 Angluin 擴充的極限語言辨識。透過將 LLM 輸出建模為形式語言,我們分析有限的文字範例是否能明確找出原始模型。我們的結果顯示,由於特定語言類別的不可識別性,在微調模型輸出重疊的一些溫和假設下,理論上不可能確定地將輸出歸因於特定的 LLM。即使考慮到 Transformer 架構的表達力限制,這也成立。即使有直接的模型存取或全面的監控,重大的運算障礙也會阻礙歸因工作。這些發現凸顯了採取主動措施以減輕敵對 LLM 使用所帶來的風險的迫切需要,因為它們的影響力持續擴大。
Derivational Morphology Reveals Analogical Generalization in Large Language Models
2411.07990v1 by Valentin Hofmann, Leonie Weissweiler, David Mortensen, Hinrich Schütze, Janet Pierrehumbert
What mechanisms underlie linguistic generalization in large language models (LLMs)? This question has attracted considerable attention, with most studies analyzing the extent to which the language skills of LLMs resemble rules. As of yet, it is not known whether linguistic generalization in LLMs could equally well be explained as the result of analogical processes, which can be formalized as similarity operations on stored exemplars. A key shortcoming of prior research is its focus on linguistic phenomena with a high degree of regularity, for which rule-based and analogical approaches make the same predictions. Here, we instead examine derivational morphology, specifically English adjective nominalization, which displays notable variability. We introduce a new method for investigating linguistic generalization in LLMs: focusing on GPT-J, we fit cognitive models that instantiate rule-based and analogical learning to the LLM training data and compare their predictions on a set of nonce adjectives with those of the LLM, allowing us to draw direct conclusions regarding underlying mechanisms. As expected, rule-based and analogical models explain the predictions of GPT-J equally well for adjectives with regular nominalization patterns. However, for adjectives with variable nominalization patterns, the analogical model provides a much better match. Furthermore, GPT-J's behavior is sensitive to the individual word frequencies, even for regular forms, a behavior that is consistent with an analogical account of regular forms but not a rule-based one. These findings refute the hypothesis that GPT-J's linguistic generalization on adjective nominalization involves rules, suggesting similarity operations on stored exemplars as the underlying mechanism. Overall, our study suggests that analogical processes play a bigger role in the linguistic generalization of LLMs than previously thought.
摘要:大型語言模型(LLM)中語言概括化的底層機制是什麼?這個問題引起了相當大的關注,大多數研究分析了 LLM 的語言技能與規則的相似程度。到目前為止,我們還不知道 LLM 中的語言概括化是否可以同樣解釋為類比過程的結果,類比過程可以形式化為儲存範例的相似性運算。先前研究的一個主要缺點是其重點在於高度規律性的語言現象,對於這種現象,基於規則和類比的方法會做出相同的預測。在這裡,我們改為檢驗派生形態,特別是英語形容詞名詞化,它顯示出顯著的可變性。我們引入了一種新的方法來研究 LLM 中的語言概括化:專注於 GPT-J,我們將實例化基於規則和類比學習的認知模型套用到 LLM 訓練資料,並將其預測與 LLM 在一組新造形容詞上進行比較,讓我們能夠對底層機制得出直接結論。正如預期的那樣,對於具有規則名詞化模式的形容詞,基於規則和類比的模型對 GPT-J 的預測解釋得一樣好。然而,對於具有可變名詞化模式的形容詞,類比模型提供了更好的匹配。此外,GPT-J 的行為對個別字詞頻率很敏感,即使是規則形式也是如此,這種行為與類比規則的說明一致,但與基於規則的說明不一致。這些發現駁斥了 GPT-J 在形容詞名詞化上的語言概括化涉及規則的假設,表明對儲存範例的相似性運算才是底層機制。總體而言,我們的研究表明,類比過程在 LLM 的語言概括化中所扮演的角色比先前想像的更大。
Gini Coefficient as a Unified Metric for Evaluating Many-versus-Many Similarity in Vector Spaces
2411.07983v1 by Ben Fauber
We demonstrate that Gini coefficients can be used as unified metrics to evaluate many-versus-many (all-to-all) similarity in vector spaces. Our analysis of various image datasets shows that images with the highest Gini coefficients tend to be the most similar to one another, while images with the lowest Gini coefficients are the least similar. We also show that this relationship holds true for vectorized text embeddings from various corpuses, highlighting the consistency of our method and its broad applicability across different types of data. Additionally, we demonstrate that selecting machine learning training samples that closely match the distribution of the testing dataset is far more important than ensuring data diversity. Selection of exemplary and iconic training samples with higher Gini coefficients leads to significantly better model performance compared to simply having a diverse training set with lower Gini coefficients. Thus, Gini coefficients can serve as effective criteria for selecting machine learning training samples, with our selection method outperforming random sampling methods in very sparse information settings.
摘要:我們證明基尼係數可用作統一指標,用於評估向量空間中多對多(全對全)相似性。我們對各種影像資料集的分析顯示,具有最高基尼係數的影像往往彼此最相似,而具有最低基尼係數的影像最不相似。我們也顯示此關係對於來自各種語料庫的向量化文字嵌入式資料也成立,突顯我們方法的一致性及其在不同類型資料間的廣泛適用性。此外,我們證明選擇與測試資料集分佈密切匹配的機器學習訓練樣本,比確保資料多樣性重要得多。選擇具有較高基尼係數的範例性和標誌性訓練樣本,與僅有具有較低基尼係數的多樣化訓練集相比,會產生顯著更好的模型效能。因此,基尼係數可用作選擇機器學習訓練樣本的有效準則,我們的選擇方法在非常稀疏的資訊設定中優於隨機抽樣方法。
Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization
2411.07979v1 by Davide Buffelli, Jamie McGowan, Wangkun Xu, Alexandru Cioba, Da-shan Shiu, Guillaume Hennequin, Alberto Bernacchia
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications, often yielding faster progress per iteration on the training loss compared to first-order optimizers.However, the generalization properties of second-order methods are still being debated. Theoretical investigations have proved difficult to carry out outside the tractable settings of heavily simplified model classes -- thus, the relevance of existing theories to practical deep learning applications remains unclear. Similarly, empirical studies in large-scale models and real datasets are significantly confounded by the necessity to approximate second-order updates in practice. It is often unclear whether the observed generalization behaviour arises specifically from the second-order nature of the parameter updates, or instead reflects the specific structured (e.g.\ Kronecker) approximations used or any damping-based interpolation towards first-order updates. Here, we show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep reversible architectures that are sufficiently expressive to be meaningfully applied to common benchmark datasets. We exploit this novel setting to study the training and generalization properties of the GN optimizer. We find that exact GN generalizes poorly. In the mini-batch training setting, this manifests as rapidly saturating progress even on the \emph{training} loss, with parameter updates found to overfit each mini-batchatch without producing the features that would support generalization to other mini-batches. We show that our experiments run in the ``lazy'' regime, in which the neural tangent kernel (NTK) changes very little during the course of training. This behaviour is associated with having no significant changes in neural representations, explaining the lack of generalization.
摘要:
DINO-LG: A Task-Specific DINO Model for Coronary Calcium Scoring
2411.07976v1 by Mahmut S. Gokmen, Cody Bumgardner, Caner Ozcan
Coronary artery disease (CAD), one of the most common cause of mortality in the world. Coronary artery calcium (CAC) scoring using computed tomography (CT) is key for risk assessment to prevent coronary disease. Previous studies on risk assessment and calcification detection in CT scans primarily use approaches based on UNET architecture, frequently implemented on pre-built models. However, these models are limited by the availability of annotated CT scans containing CAC and suffering from imbalanced dataset, decreasing performance of CAC segmentation and scoring. In this study, we extend this approach by incorporating the self-supervised learning (SSL) technique of DINO (self-distillation with no labels) to eliminate limitations of scarce annotated data in CT scans. The DINO model's ability to train without requiring CAC area annotations enhances its robustness in generating distinct features. The DINO model is trained on to focus specifically on calcified areas by using labels, aiming to generate features that effectively capture and highlight key characteristics. The label-guided DINO (DINO-LG) enhances classification by distinguishing CT slices that contain calcification from those that do not, performing 57% better than the standard DINO model in this task. CAC scoring and segmentation tasks are performed by a basic U-NET architecture, fed specifically with CT slices containing calcified areas as identified by the DINO-LG model. This targeted identification performed by DINO-LG model improves CAC segmentation performance by approximately 10% and significant increase in CAC scoring accuracy.
摘要:冠狀動脈疾病 (CAD) 是世界上最常見的死亡原因之一。使用電腦斷層掃描 (CT) 進行冠狀動脈鈣化 (CAC) 評分是預防冠狀動脈疾病風險評估的關鍵。先前關於風險評估和 CT 掃描中鈣化偵測的研究,主要使用基於 UNET 架構的方法,並經常在預建模型上實作。然而,這些模型受到標註 CT 掃描的可用性限制,且存在資料集不平衡的問題,降低了 CAC 分割和評分的效能。在本研究中,我們透過納入 DINO(無標籤自蒸餾)的自監督學習 (SSL) 技術來擴充此方法,以消除 CT 掃描中標註資料稀少的限制。DINO 模型無需 CAC 區域標註即可訓練的能力,增強了其產生不同特徵的穩健性。DINO 模型經過訓練,特別針對鈣化區域,使用標籤,目的是產生有效捕捉和突顯關鍵特徵的特徵。標籤引導的 DINO(DINO-LG)透過區分包含鈣化的 CT 切片和不包含鈣化的 CT 切片,增強了分類,在此任務中比標準 DINO 模型高出 57%。CAC 評分和分割任務是由一個基本的 U-NET 架構執行,特別輸入 DINO-LG 模型識別的包含鈣化區域的 CT 切片。DINO-LG 模型執行的這種目標識別,將 CAC 分割效能提升了約 10%,並顯著提高了 CAC 評分準確度。
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
2411.07975v1 by Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai yu, Liang Zhao, Yisong Wang, Jiaying Liu, Chong Ruan
We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.
摘要:我們提出 JanusFlow,一個強大的框架,它統一了圖像理解和生成在單一模型中。JanusFlow 採用了一個極簡主義架構,它整合了自回歸語言模型與校正流,一種生成模型中的最先進方法。我們的關鍵發現證明了校正流可以在大型語言模型框架內直接進行訓練,消除了對複雜架構修改的需求。為了進一步提升我們統一模型的效能,我們採用了兩個關鍵策略:(i) 解耦理解和生成編碼器,以及 (ii) 在統一訓練期間對齊它們的表示。大量的實驗表明,JanusFlow 在各自領域中達到了與專業模型相當或更優異的效能,同時在標準基準測試中顯著優於現有的統一方法。這項工作代表了朝向更有效率且多功能的視覺語言模型邁出了一步。
From General to Specific: Utilizing General Hallucation to Automatically Measure the Role Relationship Fidelity for Specific Role-Play Agents
2411.07965v1 by Chuyi Kong, Ziyang Luo, Hongzhan Lin, Zhiyuan Fan, Yaxin Fan, Yuxi Sun, Jing Ma
The advanced role-playing capabilities of Large Language Models (LLMs) have paved the way for developing Role-Playing Agents (RPAs). However, existing benchmarks, such as HPD, which incorporates manually scored character relationships into the context for LLMs to sort coherence, and SocialBench, which uses specific profiles generated by LLMs in the context of multiple-choice tasks to assess character preferences, face limitations like poor generalizability, implicit and inaccurate judgments, and excessive context length. To address the above issues, we propose an automatic, scalable, and generalizable paradigm. Specifically, we construct a benchmark by extracting relations from a general knowledge graph and leverage RPA's inherent hallucination properties to prompt it to interact across roles, employing ChatGPT for stance detection and defining relationship hallucination along with three related metrics. Extensive experiments validate the effectiveness and stability of our metrics. Our findings further explore factors influencing these metrics and discuss the trade-off between relationship hallucination and factuality.
摘要:大型語言模型 (LLM) 的先進角色扮演能力已為開發角色扮演代理 (RPA) 鋪平道路。然而,現有的基準,例如 HPD(將手動評分的角色關係納入 LLM 的背景中以對連貫性進行排序),以及 SocialBench(在多選題任務的背景下使用 LLM 生成的特定個人資料來評估角色偏好)面臨著諸如通用性差、判斷含蓄且不準確以及背景長度過長等限制。為了解決上述問題,我們提出了一個自動、可擴充且可概括的範例。具體來說,我們通過從通用知識圖譜中提取關係來構建基準,並利用 RPA 固有的幻覺屬性提示它跨角色互動,採用 ChatGPT 進行立場檢測並定義關係幻覺以及三個相關指標。廣泛的實驗驗證了我們指標的有效性和穩定性。我們的研究結果進一步探討了影響這些指標的因素,並討論了關係幻覺和事實性之間的權衡。
Towards Low-bit Communication for Tensor Parallel LLM Inference
2411.07942v1 by Harry Dong, Tyler Johnson, Minsik Cho, Emad Soroush
Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be distributed across more devices, magnifying the communication cost. One way to approach this problem is with quantization, but current methods for LLMs tend to avoid quantizing the features that tensor parallelism needs to communicate. Taking advantage of consistent outliers in communicated features, we introduce a quantization method that reduces communicated values on average from 16 bits to 4.2 bits while preserving nearly all of the original performance. For instance, our method maintains around 98.0% and 99.5% of Gemma 2 27B's and Llama 2 13B's original performance, respectively, averaged across all tasks we evaluated on.
摘要:張量並行提供了增加伺服器大型語言模型 (LLM) 推論效率的有效方法,儘管增加了額外的通訊成本。然而,由於伺服器 LLM 持續擴大規模,它們需要分佈在更多裝置上,這會放大通訊成本。解決此問題的一種方法是量化,但 LLM 的當前方法傾向於避免量化張量並行需要通訊的功能。我們利用通訊功能中的一致異常值,引入一種量化方法,可將通訊值平均從 16 位元減少到 4.2 位元,同時保留幾乎所有原始效能。例如,我們的模型分別維持了 Gemma 2 27B 和 Llama 2 13B 的約 98.0% 和 99.5% 原始效能,平均在我們評估的所有任務中。
DuoLift-GAN:Reconstructing CT from Single-view and Biplanar X-Rays with Generative Adversarial Networks
2411.07941v1 by Zhaoxi Zhang, Yueliang Ying
Computed tomography (CT) provides highly detailed three-dimensional (3D) medical images but is costly, time-consuming, and often inaccessible in intraoperative settings (Organization et al. 2011). Recent advancements have explored reconstructing 3D chest volumes from sparse 2D X-rays, such as single-view or orthogonal double-view images. However, current models tend to process 2D images in a planar manner, prioritizing visual realism over structural accuracy. In this work, we introduce DuoLift Generative Adversarial Networks (DuoLift-GAN), a novel architecture with dual branches that independently elevate 2D images and their features into 3D representations. These 3D outputs are merged into a unified 3D feature map and decoded into a complete 3D chest volume, enabling richer 3D information capture. We also present a masked loss function that directs reconstruction towards critical anatomical regions, improving structural accuracy and visual quality. This paper demonstrates that DuoLift-GAN significantly enhances reconstruction accuracy while achieving superior visual realism compared to existing methods.
摘要:電腦斷層掃描 (CT) 能提供高度詳細的三維 (3D) 醫學影像,但昂貴、耗時且在術中環境中通常無法取得 (Organization et al. 2011)。最近的進展探索從稀疏的 2D X 光重建 3D 胸部體積,例如單視圖或正交雙視圖影像。然而,目前的模型傾向於以平面方式處理 2D 影像,優先考慮視覺真實性而非結構準確性。在這項工作中,我們介紹了 DuoLift 生成對抗網路 (DuoLift-GAN),一種具有雙分支的新穎架構,可獨立地將 2D 影像及其特徵提升到 3D 表現形式。這些 3D 輸出會合併成一個統一的 3D 特徵圖,並解碼成一個完整的 3D 胸部體積,從而能夠擷取更豐富的 3D 資訊。我們也提出了一個遮罩損失函數,將重建導向關鍵解剖區域,改善結構準確性和視覺品質。這篇論文證明了 DuoLift-GAN 與現有方法相比,顯著提升了重建準確性,同時達到了卓越的視覺真實性。
Automatic dataset shift identification to support root cause analysis of AI performance drift
2411.07940v1 by Mélanie Roschewitz, Raghav Mehta, Charles Jones, Ben Glocker
Shifts in data distribution can substantially harm the performance of clinical AI models. Hence, various methods have been developed to detect the presence of such shifts at deployment time. However, root causes of dataset shifts are varied, and the choice of shift mitigation strategies is highly dependent on the precise type of shift encountered at test time. As such, detecting test-time dataset shift is not sufficient: precisely identifying which type of shift has occurred is critical. In this work, we propose the first unsupervised dataset shift identification framework, effectively distinguishing between prevalence shift (caused by a change in the label distribution), covariate shift (caused by a change in input characteristics) and mixed shifts (simultaneous prevalence and covariate shifts). We discuss the importance of self-supervised encoders for detecting subtle covariate shifts and propose a novel shift detector leveraging both self-supervised encoders and task model outputs for improved shift detection. We report promising results for the proposed shift identification framework across three different imaging modalities (chest radiography, digital mammography, and retinal fundus images) on five types of real-world dataset shifts, using four large publicly available datasets.
摘要:資料分佈的轉變會嚴重損害臨床 AI 模型的效能。因此,已經開發出各種方法來偵測部署時發生的此類轉變。然而,資料集轉變的根本原因各不相同,而轉變緩解策略的選擇高度依賴於測試時遇到的轉變類型。因此,偵測測試時資料集轉變是不夠的:精確識別已發生的轉變類型至關重要。在這項工作中,我們提出了第一個無監督資料集轉變識別架構,有效區分發生率轉變(由標籤分佈的變化引起)、協變數轉變(由輸入特徵的變化引起)和混合轉變(同時發生率和協變數轉變)。我們討論了自監督編碼器在偵測細微協變數轉變中的重要性,並提出了一種新穎的轉變偵測器,利用自監督編碼器和任務模型輸出,以改善轉變偵測。我們針對三個不同的影像模式(胸部 X 光、數位乳房攝影和視網膜眼底影像)報告了所提出的轉變識別架構的良好結果,使用四個大型公開可取得的資料集,針對五種類型的真實世界資料集轉變。
CryptoLLM: Unleashing the Power of Prompted LLMs for SmartQnA and Classification of Crypto Posts
2411.07917v1 by Aniket Deroy, Subhankar Maity
The rapid growth of social media has resulted in an large volume of user-generated content, particularly in niche domains such as cryptocurrency. This task focuses on developing robust classification models to accurately categorize cryptocurrency-related social media posts into predefined classes, including but not limited to objective, positive, negative, etc. Additionally, the task requires participants to identify the most relevant answers from a set of posts in response to specific questions. By leveraging advanced LLMs, this research aims to enhance the understanding and filtering of cryptocurrency discourse, thereby facilitating more informed decision-making in this volatile sector. We have used a prompt-based technique to solve the classification task for reddit posts and twitter posts. Also, we have used 64-shot technique along with prompts on GPT-4-Turbo model to determine whether a answer is relevant to a question or not.
摘要:社群媒體的快速成長產生了大量的使用者產製內容,特別是在加密貨幣等利基領域。此任務專注於開發穩健的分類模型,以準確地將與加密貨幣相關的社群媒體貼文分類為預定義的類別,包括但不限於客觀、正面、負面等。此外,此任務要求參與者從一組貼文中找出最相關的答案,以回應特定問題。透過利用先進的 LLM,此研究旨在增強對加密貨幣討論的理解和過濾,進而促進在這個波動的領域中做出更明智的決策。我們使用基於提示的技術來解決 Reddit 貼文和 Twitter 貼文的分類任務。此外,我們使用 64-shot 技術以及 GPT-4-Turbo 模型上的提示來確定答案是否與問題相關。
Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus
2411.07892v1 by Benjamin Litterer, David Jurgens, Dallas Card
Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.
摘要:Podcast 透過獨特的隨選模式,為龐大的聽眾群提供高度多元的內容。然而,有限的資料阻礙了對 Podcast 生態系統進行大規模的運算分析。為了填補這個缺口,我們引進一個包含超過 110 萬個 Podcast 轉錄的龐大資料集,該資料集廣泛涵蓋了 2020 年 5 月和 6 月透過公開 RSS 饋送提供的全部英語 Podcast。此資料不僅限於文字,還包含 37 萬集子集的音訊特徵和發言者輪流發言,以及全部 110 萬集的發言者角色推論和其他元資料。使用此資料,我們也對此生態系統的內容、結構和回應性進行基礎調查。我們的資料和分析共同開啟了對這個廣受歡迎且影響力大的媒體持續進行運算研究的大門。
INTRABENCH: Interactive Radiological Benchmark
2411.07885v1 by Constantin Ulrich, Tassilo Wald, Emily Tempus, Maximilian Rokuss, Paul F. Jaeger, Klaus Maier-Hein
Current interactive segmentation approaches, inspired by the success of META's Segment Anything model, have achieved notable advancements, however, they come with substantial limitations that hinder their practical application in real clinical scenarios. These include unrealistic human interaction requirements, such as slice-by-slice operations for 2D models on 3D data, a lack of iterative refinement, and insufficient evaluation experiments. These shortcomings prevent accurate assessment of model performance and lead to inconsistent outcomes across studies. IntRaBench overcomes these challenges by offering a comprehensive and reproducible framework for evaluating interactive segmentation methods in realistic, clinically relevant scenarios. It includes diverse datasets, target structures, and segmentation models, and provides a flexible codebase that allows seamless integration of new models and prompting strategies. Additionally, we introduce advanced techniques to minimize clinician interaction, ensuring fair comparisons between 2D and 3D models. By open-sourcing IntRaBench, we invite the research community to integrate their models and prompting techniques, ensuring continuous and transparent evaluation of interactive segmentation models in 3D medical imaging.
摘要:目前互動式分割方法受到 META 的 Segment Anything 模型成功的啟發,已取得顯著進展,但它們仍有很大的限制,會阻礙它們在實際臨床場景中的應用。這些限制包括不切實際的人機互動需求,例如 3D 資料上的 2D 模型的逐層操作、缺乏反覆改進以及評估實驗不足。這些缺點會妨礙準確評估模型效能,並導致各項研究結果不一致。IntRaBench 克服了這些挑戰,提供了一個全面且可重現的架構,用於評估實際臨床相關場景中的互動式分割方法。它包含多元的資料集、目標結構和分割模型,並提供了一個彈性的程式碼庫,允許無縫整合新的模型和提示策略。此外,我們引進了先進技術來最小化臨床醫師的互動,確保 2D 和 3D 模型之間的公平比較。透過開放原始碼 IntRaBench,我們邀請研究社群整合他們的模型和提示技術,確保在 3D 醫學影像中持續且透明地評估互動式分割模型。
Diverse capability and scaling of diffusion and auto-regressive models when learning abstract rules
2411.07873v1 by Binxu Wang, Jiaqi Shang, Haim Sompolinsky
Humans excel at discovering regular structures from limited samples and applying inferred rules to novel settings. We investigate whether modern generative models can similarly learn underlying rules from finite samples and perform reasoning through conditional sampling. Inspired by Raven's Progressive Matrices task, we designed GenRAVEN dataset, where each sample consists of three rows, and one of 40 relational rules governing the object position, number, or attributes applies to all rows. We trained generative models to learn the data distribution, where samples are encoded as integer arrays to focus on rule learning. We compared two generative model families: diffusion (EDM, DiT, SiT) and autoregressive models (GPT2, Mamba). We evaluated their ability to generate structurally consistent samples and perform panel completion via unconditional and conditional sampling. We found diffusion models excel at unconditional generation, producing more novel and consistent samples from scratch and memorizing less, but performing less well in panel completion, even with advanced conditional sampling methods. Conversely, autoregressive models excel at completing missing panels in a rule-consistent manner but generate less consistent samples unconditionally. We observe diverse data scaling behaviors: for both model families, rule learning emerges at a certain dataset size - around 1000s examples per rule. With more training data, diffusion models improve both their unconditional and conditional generation capabilities. However, for autoregressive models, while panel completion improves with more training data, unconditional generation consistency declines. Our findings highlight complementary capabilities and limitations of diffusion and autoregressive models in rule learning and reasoning tasks, suggesting avenues for further research into their mechanisms and potential for human-like reasoning.
摘要:人類擅長從有限的樣本中發現規則結構,並將推論出的規則應用於新的設定。我們探討現代生成模型是否能以類似的方式從有限樣本中學習基礎規則,並透過條件取樣進行推理。在 Raven's Progressive Matrices 任務的啟發下,我們設計了 GenRAVEN 資料集,每個樣本包含三行,且 40 個關係規則中的其中一個適用於所有行的物件位置、數量或屬性。我們訓練生成模型學習資料分佈,其中樣本編碼為整數陣列,以專注於規則學習。我們比較了兩個生成模型家族:擴散(EDM、DiT、SiT)和自迴歸模型(GPT2、Mamba)。我們評估了它們產生結構一致樣本和透過無條件和條件取樣完成面板的能力。我們發現擴散模型在無條件產生方面表現出色,從頭開始產生更多新穎且一致的樣本,且記憶力較差,但在面板完成方面表現較差,即使使用進階條件取樣方法也是如此。相反地,自迴歸模型擅長以規則一致的方式完成遺失的面板,但無條件產生的一致性較差。我們觀察到不同的資料擴充行為:對於這兩個模型家族,規則學習出現在某個資料集大小時 - 每個規則約 1000 個範例。隨著更多訓練資料,擴散模型改善了它們的無條件和條件產生能力。然而,對於自迴歸模型,雖然面板完成隨著更多訓練資料而改善,但無條件產生的一致性卻下降。我們的發現突出了擴散和自迴歸模型在規則學習和推理任務中的互補能力和限制,並提出了進一步研究它們的機制和人類推理潛力的途徑。
Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer's Disease
2411.07871v1 by Francesco Chiumento, Mingming Liu
The rapid advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have shown great potential in medical diagnostics, particularly in radiology, where datasets such as X-rays are paired with human-generated diagnostic reports. However, a significant research gap exists in the neuroimaging field, especially for conditions such as Alzheimer's disease, due to the lack of comprehensive diagnostic reports that can be utilized for model fine-tuning. This paper addresses this gap by generating synthetic diagnostic reports using GPT-4o-mini on structured data from the OASIS-4 dataset, which comprises 663 patients. Using the synthetic reports as ground truth for training and validation, we then generated neurological reports directly from the images in the dataset leveraging the pre-trained BiomedCLIP and T5 models. Our proposed method achieved a BLEU-4 score of 0.1827, ROUGE-L score of 0.3719, and METEOR score of 0.4163, revealing its potential in generating clinically relevant and accurate diagnostic reports.
摘要:大型語言模型 (LLM) 和視覺語言模型 (VLM) 的快速進展在醫學診斷中展現了巨大的潛力,特別是在放射學中,其中 X 射線等數據集與人類產生的診斷報告配對。然而,神經影像領域存在著顯著的研究差距,特別是對於阿茲海默症等疾病,因為缺乏可供模型微調使用的全面診斷報告。本文通過使用 GPT-4o-mini 在來自 OASIS-4 數據集的結構化數據上生成合成診斷報告來解決這一差距,該數據集包含 663 名患者。使用合成報告作為訓練和驗證的真實數據,然後我們直接從數據集中的圖像中生成神經報告,利用預先訓練的 BiomedCLIP 和 T5 模型。我們提出的方法實現了 BLEU-4 分數為 0.1827、ROUGE-L 分數為 0.3719 和 METEOR 分數為 0.4163,揭示了其生成臨床相關且準確的診斷報告的潛力。
Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders
2411.07870v1 by Xiaofeng Zhu, Jaya Krishna Mandivarapu
Although people are impressed by the content generation skills of large language models, the use of LLMs, such as ChatGPT, is limited by the domain grounding of the content. The correctness and groundedness of the generated content need to be based on a verified context, such as results from Retrieval-Augmented Generation (RAG). One important issue when adapting LLMs to a customized domain is that the generated responses are often incomplete, or the additions are not verified and may even be hallucinated. Prior studies on hallucination detection have focused on evaluation metrics, which are not easily adaptable to dynamic domains and can be vulnerable to attacks like jail-breaking. In this work, we propose 1) a post-processing algorithm that leverages knowledge triplets in RAG context to correct hallucinations and 2) a dual-decoder model that fuses RAG context to guide the generation process.
摘要:儘管人們對大型語言模型的內容生成技能印象深刻,但 ChatGPT 等 LLM 的使用受到內容的領域基礎的限制。生成的內容的正確性和基礎必須基於經過驗證的內容,例如檢索擴充生成 (RAG) 的結果。將 LLM 適應到自訂領域時的一個重要問題是,生成的回應通常不完整,或者新增內容未經驗證,甚至可能是幻覺。先前對幻覺偵測的研究集中在評估指標上,這些指標不易適應動態領域,且容易受到越獄等攻擊。在這項工作中,我們提出 1) 一種後處理演算法,利用 RAG 背景中的知識三元組來修正幻覺,以及 2) 一種雙解碼器模型,將 RAG 背景融合以引導生成過程。
Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models
2411.07858v1 by Yusen Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang
When unsure about an answer, humans often respond with more words than necessary, hoping that part of the response will be correct. We observe a similar behavior in large language models (LLMs), which we term "Verbosity Compensation" (VC). VC is harmful because it confuses the user understanding, leading to low efficiency, and influences the LLM services by increasing the latency and cost of generating useless tokens. In this paper, we present the first work that defines and analyzes Verbosity Compensation, explores its causes, and proposes a simple mitigating approach. We define Verbosity Compensation as the behavior of generating responses that can be compressed without information loss when prompted to write concisely. Our experiments, conducted on five datasets of knowledge and reasoning-based QA tasks with 14 newly developed LLMs, reveal three conclusions. 1) We reveal a pervasive presence of verbosity compensation across all models and all datasets. Notably, GPT-4 exhibits a VC frequency of 50.40%. 2) We reveal the large performance gap between verbose and concise responses, with a notable difference of 27.61% on the Qasper dataset. We also demonstrate that this difference does not naturally diminish as LLM capability increases. Both 1) and 2) highlight the urgent need to mitigate the frequency of VC behavior and disentangle verbosity with veracity. We propose a simple yet effective cascade algorithm that replaces the verbose responses with the other model-generated responses. The results show that our approach effectively alleviates the VC of the Mistral model from 63.81% to 16.16% on the Qasper dataset. 3) We also find that verbose responses exhibit higher uncertainty across all five datasets, suggesting a strong connection between verbosity and model uncertainty. Our dataset and code are available at https://github.com/psunlpgroup/VerbosityLLM.
摘要:
Tucano: Advancing Neural Text Generation for Portuguese
2411.07854v1 by Nicholas Kluge Corrêa, Aniket Sen, Sophia Falk, Shiza Fatimah
Significant advances have been made in natural language processing in recent years. However, our current deep learning approach to language modeling requires substantial resources in terms of data and computation. One of the side effects of this data-hungry paradigm is the current schism between languages, separating those considered high-resource, where most of the development happens and resources are available, and the low-resource ones, which struggle to attain the same level of performance and autonomy. This study aims to introduce a new set of resources to stimulate the future development of neural text generation in Portuguese. In this work, we document the development of GigaVerbo, a concatenation of deduplicated Portuguese text corpora amounting to 200 billion tokens. Via this corpus, we trained a series of decoder-transformers named Tucano. Our models perform equal or superior to other Portuguese and multilingual language models of similar size in several Portuguese benchmarks. The evaluation of our models also reveals that model performance on many currently available benchmarks used by the Portuguese NLP community has little to no correlation with the scaling of token ingestion during training, highlighting the limitations of such evaluations when it comes to the assessment of Portuguese generative language models. All derivatives of our study are openly released on GitHub and Hugging Face. See https://nkluge-correa.github.io/Tucano/
摘要:近年來,自然語言處理領域取得重大進展。然而,我們目前對語言模型的深度學習方法在數據和計算方面需要大量資源。這種數據密集型範例的副作用之一是語言之間的當前分裂,將被視為高資源的語言(大多數開發和資源都在此發生)與低資源語言分開,後者難以達到相同的效能和自主性。本研究旨在引入一套新資源,以促進葡萄牙語神經文本生成的未來發展。在這項工作中,我們記錄了 GigaVerbo 的開發,它是去重葡萄牙語文本語料庫的串接,總計 2000 億個標記。透過此語料庫,我們訓練了一系列名為 Tucano 的解碼器轉換器。我們的模型在多個葡萄牙語基準中執行與其他類似大小的葡萄牙語和多語言語言模型相同或更佳。我們模型的評估還顯示,葡萄牙語 NLP 社群目前使用的許多現有基準上的模型效能與訓練期間標記擷取的調整幾乎沒有相關性,這突顯了此類評估在評估葡萄牙語生成語言模型方面的限制。我們研究的所有衍生品都在 GitHub 和 Hugging Face 上公開發布。請參閱 https://nkluge-correa.github.io/Tucano/
IAE: Irony-based Adversarial Examples for Sentiment Analysis Systems
2411.07850v1 by Xiaoyin Yi, Jiacheng Huang
Adversarial examples, which are inputs deliberately perturbed with imperceptible changes to induce model errors, have raised serious concerns for the reliability and security of deep neural networks (DNNs). While adversarial attacks have been extensively studied in continuous data domains such as images, the discrete nature of text presents unique challenges. In this paper, we propose Irony-based Adversarial Examples (IAE), a method that transforms straightforward sentences into ironic ones to create adversarial text. This approach exploits the rhetorical device of irony, where the intended meaning is opposite to the literal interpretation, requiring a deeper understanding of context to detect. The IAE method is particularly challenging due to the need to accurately locate evaluation words, substitute them with appropriate collocations, and expand the text with suitable ironic elements while maintaining semantic coherence. Our research makes the following key contributions: (1) We introduce IAE, a strategy for generating textual adversarial examples using irony. This method does not rely on pre-existing irony corpora, making it a versatile tool for creating adversarial text in various NLP tasks. (2) We demonstrate that the performance of several state-of-the-art deep learning models on sentiment analysis tasks significantly deteriorates when subjected to IAE attacks. This finding underscores the susceptibility of current NLP systems to adversarial manipulation through irony. (3) We compare the impact of IAE on human judgment versus NLP systems, revealing that humans are less susceptible to the effects of irony in text.
摘要:
Ethical Concern Identification in NLP: A Corpus of ACL Anthology Ethics Statements
2411.07845v1 by Antonia Karamolegkou, Sandrine Schiller Hansen, Ariadni Christopoulou, Filippos Stamatiou, Anne Lauscher, Anders Søgaard
What ethical concerns, if any, do LLM researchers have? We introduce EthiCon, a corpus of 1,580 ethical concern statements extracted from scientific papers published in the ACL Anthology. We extract ethical concern keywords from the statements and show promising results in automating the concern identification process. Through a survey, we compare the ethical concerns of the corpus to the concerns listed by the general public and professionals in the field. Finally, we compare our retrieved ethical concerns with existing taxonomies pointing to gaps and future research directions.
摘要:LLM 研究人員若有任何倫理疑慮,會是什麼?我們引入了 EthiCon, 一個從 ACL Anthology 發表科學論文中萃取的 1,580 條倫理疑慮聲明語料庫。我們從 聲明中萃取倫理疑慮關鍵字,並在自動化疑慮識別處理方面展現極佳成果。透過一項調查,我們將語料庫中的倫理疑慮與一般大眾和該領域專業人士列出的疑慮進行比較。最後, 我們將我們擷取的倫理疑慮與現有分類法進行比較,找出差距和未來的研究方向。
Chain Association-based Attacking and Shielding Natural Language Processing Systems
2411.07843v1 by Jiacheng Huang, Long Chen
Association as a gift enables people do not have to mention something in completely straightforward words and allows others to understand what they intend to refer to. In this paper, we propose a chain association-based adversarial attack against natural language processing systems, utilizing the comprehension gap between humans and machines. We first generate a chain association graph for Chinese characters based on the association paradigm for building search space of potential adversarial examples. Then, we introduce an discrete particle swarm optimization algorithm to search for the optimal adversarial examples. We conduct comprehensive experiments and show that advanced natural language processing models and applications, including large language models, are vulnerable to our attack, while humans appear good at understanding the perturbed text. We also explore two methods, including adversarial training and associative graph-based recovery, to shield systems from chain association-based attack. Since a few examples that use some derogatory terms, this paper contains materials that may be offensive or upsetting to some people.
摘要:聯想作為一種禮物,使人們不必用完全直白的話語提及某事,並讓其他人明白他們想提的是什麼。在本文中,我們提出了一種基於鏈式聯想的對抗性攻擊,用於自然語言處理系統,利用了人類與機器之間的理解差距。我們首先基於聯想範例為漢字生成一個鏈式聯想圖,用於構建潛在對抗性範例的搜索空間。然後,我們引入一個離散粒子群優化演算法來搜索最佳的對抗性範例。我們進行了全面的實驗,並表明先進的自然語言處理模型和應用程式,包括大型語言模型,都容易受到我們的攻擊,而人類似乎很擅長理解擾動後的文字。我們還探索了兩種方法,包括對抗性訓練和基於聯想圖的恢復,以保護系統免受基於鏈式聯想的攻擊。由於一些範例使用了某些貶義詞,因此本文包含可能冒犯或令某些人感到不安的材料。
Federated Learning for Discrete Optimal Transport with Large Population under Incomplete Information
2411.07841v1 by Navpreet Kaur, Juntao Chen, Yingdong Lu
Optimal transport is a powerful framework for the efficient allocation of resources between sources and targets. However, traditional models often struggle to scale effectively in the presence of large and heterogeneous populations. In this work, we introduce a discrete optimal transport framework designed to handle large-scale, heterogeneous target populations, characterized by type distributions. We address two scenarios: one where the type distribution of targets is known, and one where it is unknown. For the known distribution, we propose a fully distributed algorithm to achieve optimal resource allocation. In the case of unknown distribution, we develop a federated learning-based approach that enables efficient computation of the optimal transport scheme while preserving privacy. Case studies are provided to evaluate the performance of our learning algorithm.
摘要:最佳傳輸是一種在來源和目標之間有效分配資源的強大架構。然而,傳統模型在面對龐大且異質的人群時,通常難以有效擴展。在此研究中,我們引入了一個離散最佳傳輸架構,旨在處理大型、異質的目標族群,其特點在於類型分佈。我們探討了兩種場景:一種是目標的類型分佈已知,另一種則是未知。對於已知分佈,我們提出了一種完全分佈式的演算法,以實現最佳資源配置。在未知分佈的情況下,我們開發了一種基於聯邦學習的方法,可以在保護隱私的同時,有效計算最佳傳輸方案。我們提供了案例研究,以評估我們的學習演算法的效能。
Efficient Federated Finetuning of Tiny Transformers with Resource-Constrained Devices
2411.07826v1 by Kilian Pfeiffer, Mohamed Aboelenien Ahmed, Ramin Khalili, Jörg Henkel
In recent years, Large Language Models (LLMs) through Transformer structures have dominated many machine learning tasks, especially text processing. However, these models require massive amounts of data for training and induce high resource requirements, particularly in terms of the large number of Floating Point Operations (FLOPs) and the high amounts of memory needed. To fine-tune such a model in a parameter-efficient way, techniques like Adapter or LoRA have been developed. However, we observe that the application of LoRA, when used in federated learning (FL), while still being parameter-efficient, is memory and FLOP inefficient. Based on that observation, we develop a novel layer finetuning scheme that allows devices in cross-device FL to make use of pretrained neural networks (NNs) while adhering to given resource constraints. We show that our presented scheme outperforms the current state of the art when dealing with homogeneous or heterogeneous computation and memory constraints and is on par with LoRA regarding limited communication, thereby achieving significantly higher accuracies in FL training.
摘要:近年來,大型語言模型 (LLM) 透過 Transformer 結構主導了許多機器學習任務,特別是文本處理。然而,這些模型需要大量的資料進行訓練,並造成高資源需求,特別是在大量的浮點運算 (FLOP) 和所需的高記憶體量方面。為了以參數有效的方式微調此類模型,已開發出適配器或 LoRA 等技術。然而,我們觀察到 LoRA 的應用在聯合學習 (FL) 中使用時,雖然仍然是參數有效的,但在記憶體和 FLOP 方面卻效率不彰。基於該觀察,我們開發了一種新穎的層微調方案,允許跨裝置 FL 中的裝置使用預訓練神經網路 (NN),同時遵守既定的資源限制。我們表明,我們提出的方案在處理同質或異質運算和記憶體限制時優於目前的技術水準,並且在有限的通訊方面與 LoRA 相當,從而實現了 FL 訓練中顯著更高的準確度。
Query Optimization for Parametric Knowledge Refinement in Retrieval-Augmented Large Language Models
2411.07820v1 by Youan Cong, Cheng Wang, Pritom Saha Akash, Kevin Chen-Chuan Chang
We introduce the \textit{Extract-Refine-Retrieve-Read} (ERRR) framework, a novel approach designed to bridge the pre-retrieval information gap in Retrieval-Augmented Generation (RAG) systems through query optimization tailored to meet the specific knowledge requirements of Large Language Models (LLMs). Unlike conventional query optimization techniques used in RAG, the ERRR framework begins by extracting parametric knowledge from LLMs, followed by using a specialized query optimizer for refining these queries. This process ensures the retrieval of only the most pertinent information essential for generating accurate responses. Moreover, to enhance flexibility and reduce computational costs, we propose a trainable scheme for our pipeline that utilizes a smaller, tunable model as the query optimizer, which is refined through knowledge distillation from a larger teacher model. Our evaluations on various question-answering (QA) datasets and with different retrieval systems show that ERRR consistently outperforms existing baselines, proving to be a versatile and cost-effective module for improving the utility and accuracy of RAG systems.
摘要:我們介紹了「萃取-精煉-擷取-閱讀」(ERRR) 架構,這是一種新穎的方法,旨在透過針對大型語言模型 (LLM) 特定知識需求量身打造的查詢最佳化,來彌補擷取增強產生 (RAG) 系統中的前擷取資訊差距。與 RAG 中使用的傳統查詢最佳化技術不同,ERRR 架構從 LLM 中萃取參數化知識開始,接著使用專門的查詢最佳化器來精煉這些查詢。此程序可確保僅擷取產生準確回應所必要的資訊。此外,為了增強彈性並降低運算成本,我們為我們的管線提出了一個可訓練架構,它利用較小且可調整的模型作為查詢最佳化器,並透過從較大的教師模型中知識萃取來進行精煉。我們在各種問答 (QA) 資料集和不同的擷取系統上的評估顯示,ERRR 持續優於現有的基準,證明它是一個通用且具成本效益的模組,可改善 RAG 系統的效用和準確性。
PatchCTG: Patch Cardiotocography Transformer for Antepartum Fetal Health Monitoring
2411.07796v1 by M. Jaleed Khan, Manu Vatish, Gabriel Davis Jones
Antepartum Cardiotocography (CTG) is vital for fetal health monitoring, but traditional methods like the Dawes-Redman system are often limited by high inter-observer variability, leading to inconsistent interpretations and potential misdiagnoses. This paper introduces PatchCTG, a transformer-based model specifically designed for CTG analysis, employing patch-based tokenisation, instance normalisation and channel-independent processing to capture essential local and global temporal dependencies within CTG signals. PatchCTG was evaluated on the Oxford Maternity (OXMAT) dataset, comprising over 20,000 CTG traces across diverse clinical outcomes after applying the inclusion and exclusion criteria. With extensive hyperparameter optimisation, PatchCTG achieved an AUC of 77%, with specificity of 88% and sensitivity of 57% at Youden's index threshold, demonstrating adaptability to various clinical needs. Testing across varying temporal thresholds showed robust predictive performance, particularly with finetuning on data closer to delivery, achieving a sensitivity of 52% and specificity of 88% for near-delivery cases. These findings suggest the potential of PatchCTG to enhance clinical decision-making in antepartum care by providing a reliable, objective tool for fetal health assessment. The source code is available at https://github.com/jaleedkhan/PatchCTG.
摘要:產前胎兒心搏圖 (CTG) 對於胎兒健康監測至關重要,但傳統方法(如 Dawes-Redman 系統)通常受到高觀察者間變異性的限制,導致解釋不一致和潛在的誤診。本文介紹 PatchCTG,一種專門設計用於 CTG 分析的基於Transformer的模型,採用基於區塊的標記化、實例正規化和通道獨立處理,以捕捉 CTG 信號中的基本局部和全局時間依賴性。PatchCTG 在牛津婦產 (OXMAT) 資料集上進行評估,該資料集包含超過 20,000 個 CTG 軌跡,涵蓋在應用包含和排除標準後不同的臨床結果。透過廣泛的超參數最佳化,PatchCTG 在 Youden 指數閾值下達到 77% 的 AUC,特異性為 88%,敏感性為 57%,證明了其對各種臨床需求的適應性。在不同的時間閾值下進行測試顯示出穩健的預測效能,特別是在接近分娩時對資料進行微調,對於接近分娩的病例,敏感性達到 52%,特異性達到 88%。這些發現表明 PatchCTG 有潛力透過提供可靠、客觀的胎兒健康評估工具來加強產前照護中的臨床決策制定。原始程式碼可在 https://github.com/jaleedkhan/PatchCTG 取得。
RedCode: Risky Code Execution and Generation Benchmark for Code Agents
2411.07781v1 by Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, Bo Li
With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode.
摘要:
Likelihood as a Performance Gauge for Retrieval-Augmented Generation
2411.07773v1 by Tianyu Liu, Jirui Qi, Paul He, Arianna Bisazza, Mrinmaya Sachan, Ryan Cotterell
Recent work finds that retrieval-augmented generation with large language models is prone to be influenced by the order of retrieved documents in the context. However, the lack of in-depth analysis limits the use of this phenomenon for prompt engineering in practice. In this study, we posit that likelihoods serve as an effective gauge for language model performance. Through experiments on two question-answering datasets with a variety of state-of-the-art language models, we reveal correlations between answer accuracy and the likelihood of the question at both the corpus level and the instance level. In addition, we find that question likelihood can also indicate the position of the task-relevant information in the context. Based on these findings, we propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance. We demonstrate their effectiveness with experiments. In addition, our likelihood-based methods are efficient, as they only need to compute the likelihood of the input, requiring much fewer language model passes than heuristic prompt engineering methods that require generating responses. Our analysis deepens our understanding of how input prompts affect model performance and provides a promising direction for efficient prompt optimization.
摘要:最近的研究发现,使用大型语言模型进行检索增强生成容易受到上下文中检索到的文档顺序的影响。然而,缺乏深入的分析限制了这种现象在实际提示工程中的使用。在本研究中,我们假设似然度可以作为语言模型性能的有效衡量标准。通过对两个问答数据集进行实验,其中包含各种最先进的语言模型,我们揭示了在语料库级别和实例级别上答案准确度与问题似然度之间的相关性。此外,我们发现问题似然度还可以指示上下文中与任务相关的信息的位置。基于这些发现,我们提出了两种方法,它们使用问题似然度作为衡量标准,用于选择和构建提示,从而带来更好的性能。我们通过实验展示了它们的有效性。此外,我们的基于似然度的方法非常有效,因为它们只需要计算输入的似然度,比需要生成响应的启发式提示工程方法需要的语言模型传递要少得多。我们的分析加深了我们对输入提示如何影响模型性能的理解,并为高效提示优化提供了一个有希望的方向。
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
2411.07763v1 by Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, Tao Yu
Real-world enterprise text-to-SQL workflows often involve complex cloud or local data across various database systems, multiple SQL queries in various dialects, and diverse operations from data transformation to analytics. We introduce Spider 2.0, an evaluation framework comprising 632 real-world text-to-SQL workflow problems derived from enterprise-level database use cases. The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake. We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-level codebases. This challenge calls for models to interact with complex SQL workflow environments, process extremely long contexts, perform intricate reasoning, and generate multiple SQL queries with diverse operations, often exceeding 100 lines, which goes far beyond traditional text-to-SQL challenges. Our evaluations indicate that based on o1-preview, our code agent framework successfully solves only 17.0% of the tasks, compared with 91.2% on Spider 1.0 and 73.0% on BIRD. Our results on Spider 2.0 show that while language models have demonstrated remarkable performance in code generation -- especially in prior text-to-SQL benchmarks -- they require significant improvement in order to achieve adequate performance for real-world enterprise usage. Progress on Spider 2.0 represents crucial steps towards developing intelligent, autonomous, code agents for real-world enterprise settings. Our code, baseline models, and data are available at https://spider2-sql.github.io.
摘要:
ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization
2411.07762v1 by Weibo Zhao, Yubin Shi, Xinyu Lyu, Wanchen Sui, Shen Li, Yong Li
Quantization stands as a pivotal technique for large language model (LLM) serving, yet it poses significant challenges particularly in achieving effective low-bit quantization. The limited numerical mapping makes the quantized model produce a non-trivial error, bringing out intolerable performance degration. This paper is anchored in the basic idea of model compression objectives, and delves into the layer-wise error distribution of LLMs during post-training quantization. Subsequently, we introduce ASER, an algorithm consisting of (1) Error Reconstruction: low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD; (2) Activation Smoothing: outlier extraction to gain smooth activation and better error compensation. ASER is capable of quantizing typical LLMs to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup. Experimental results show that ASER is competitive among the state-of-the-art quantization algorithms, showing potential to activation quantization, with minor overhead.
摘要:量化技術是大型語言模型 (LLM) 服務的關鍵技術,但它在實現有效低位元量化方面特別具有挑戰性。受限的數值對應會讓量化的模型產生非平凡的錯誤,導致難以容忍的效能劣化。本文以模型壓縮目標的基本概念為基礎,深入探討 LLM 在訓練後量化期間的層級誤差分佈。隨後,我們介紹 ASER,一種演算法,包含 (1) 錯誤重建:使用透過白化 SVD 建構的 LoRA 式矩陣,對量化誤差進行低秩補償;(2) 激活平滑:離群值萃取以獲得平滑的激活和更好的誤差補償。ASER 能夠將典型的 LLM 量化為低位元,特別是在 W4A8 每通道設定中也能維持準確度。實驗結果顯示,ASER 在最先進的量化演算法中具有競爭力,顯示出具有較小負擔的激活量化潛力。
Navigation with QPHIL: Quantizing Planner for Hierarchical Implicit Q-Learning
2411.07760v1 by Alexi Canesse, Mathieu Petitbois, Ludovic Denoyer, Sylvain Lamprier, Rémy Portelas
Offline Reinforcement Learning (RL) has emerged as a powerful alternative to imitation learning for behavior modeling in various domains, particularly in complex navigation tasks. An existing challenge with Offline RL is the signal-to-noise ratio, i.e. how to mitigate incorrect policy updates due to errors in value estimates. Towards this, multiple works have demonstrated the advantage of hierarchical offline RL methods, which decouples high-level path planning from low-level path following. In this work, we present a novel hierarchical transformer-based approach leveraging a learned quantizer of the space. This quantization enables the training of a simpler zone-conditioned low-level policy and simplifies planning, which is reduced to discrete autoregressive prediction. Among other benefits, zone-level reasoning in planning enables explicit trajectory stitching rather than implicit stitching based on noisy value function estimates. By combining this transformer-based planner with recent advancements in offline RL, our proposed approach achieves state-of-the-art results in complex long-distance navigation environments.
摘要:離線強化學習 (RL) 已成為各種領域中行為建模的強大替代方案,特別是在複雜的導航任務中。離線 RL 現有的挑戰是訊號雜訊比,亦即如何因應價值估計中的錯誤而減輕不正確的政策更新。為此,多項研究已證明分層離線 RL 方法的優點,它將高階路徑規劃與低階路徑追蹤分開。在這項研究中,我們提出了一種新穎的分層Transformer方法,它利用空間的學習量化器。此量化能夠訓練更簡單的區域條件低階政策,並簡化規劃,而規劃則簡化為離散自迴歸預測。在其他好處中,規劃中的區域級推理能執行明確的軌跡拼接,而不是基於有雜訊的價值函數估計的隱式拼接。透過將此基於Transformer的規劃器與離線 RL 的最新進展相結合,我們提出的方法在複雜的長距離導航環境中達到了最先進的結果。
Optimizing Traffic Signal Control using High-Dimensional State Representation and Efficient Deep Reinforcement Learning
2411.07759v1 by Lawrence Francis, Blessed Guda, Ahmed Biyabani
In reinforcement learning-based (RL-based) traffic signal control (TSC), decisions on the signal timing are made based on the available information on vehicles at a road intersection. This forms the state representation for the RL environment which can either be high-dimensional containing several variables or a low-dimensional vector. Current studies suggest that using high dimensional state representations does not lead to improved performance on TSC. However, we argue, with experimental results, that the use of high dimensional state representations can, in fact, lead to improved TSC performance with improvements up to 17.9% of the average waiting time. This high-dimensional representation is obtainable using the cost-effective vehicle-to-infrastructure (V2I) communication, encouraging its adoption for TSC. Additionally, given the large size of the state, we identified the need to have computational efficient models and explored model compression via pruning.
摘要:在基於強化學習 (RL) 的交通號誌控制 (TSC) 中, 有關號誌時序的決策是根據道路交叉口車輛的可用資訊做出的。這形成了 RL 環境的狀態表示,它可以是包含多個變數的高維度,或是一個低維度向量。目前的研究所表明,使用高維度狀態表示並不會提高 TSC 的效能。 然而,我們通過實驗結果論證,使用高維度狀態表示實際上可以提高 TSC 效能,平均等待時間最多可改善 17.9%。這種高維度表示可以使用具有成本效益的車對基礎設施 (V2I) 通訊獲得,從而鼓勵其用於 TSC。此外,鑑於狀態規模龐大,我們發現有必要擁有計算高效的模型,並透過剪枝探索模型壓縮。
SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model
2411.07751v1 by Xinyuan Qian, Jiaran Gao, Yaodan Zhang, Qiquan Zhang, Hexin Liu, Leibny Paola Garcia, Haizhou Li
Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. SAV-SE. To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S$^2$E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results demonstrate the superiority of VC-S$^2$E over other competitive methods. We will make the source code publicly available. Project demo page: https://AVSEPage.github.io/
摘要:語音增強在各種應用中扮演著重要的角色,而視覺資訊的整合已被證明能帶來顯著的優勢。然而,目前大多數的研究都集中在對臉部和嘴唇動作的檢視上,這在發生遮擋或相機視角較遠時可能會受到影響或完全無法使用。而來自周圍環境的脈絡視覺線索則被忽略了:例如,當我們看到一隻狗吠叫時,我們的大腦具有辨別和濾除吠叫噪音的先天氣質。為此,在本文中,我們引入了一個新任務,即 SAV-SE。據我們所知,這是第一個提出使用來自同步視訊的豐富脈絡資訊作為輔助線索來指示噪音類型的提案,這最終改善了語音增強性能。具體來說,我們提出了 VC-S$^2$E 方法,它結合了 Conformer 和 Mamba 模組,以發揮其互補優勢。在公開的 MUSIC、AVSpeech 和 AudioSet 資料集上進行了大量的實驗,結果證明了 VC-S$^2$E 優於其他競爭方法。我們將公開原始碼。專案展示頁面:https://AVSEPage.github.io/
Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding
2411.07722v1 by Zirui Shao, Chuwei Luo, Zhaoqing Zhu, Hangdi Xing, Zhi Yu, Qi Zheng, Jiajun Bu
Multimodal large language models (MLLMs) have shown impressive capabilities in document understanding, a rapidly growing research area with significant industrial demand in recent years. As a multimodal task, document understanding requires models to possess both perceptual and cognitive abilities. However, current MLLMs often face conflicts between perception and cognition. Taking a document VQA task (cognition) as an example, an MLLM might generate answers that do not match the corresponding visual content identified by its OCR (perception). This conflict suggests that the MLLM might struggle to establish an intrinsic connection between the information it "sees" and what it "understands." Such conflicts challenge the intuitive notion that cognition is consistent with perception, hindering the performance and explainability of MLLMs. In this paper, we define the conflicts between cognition and perception as Cognition and Perception (C&P) knowledge conflicts, a form of multimodal knowledge conflicts, and systematically assess them with a focus on document understanding. Our analysis reveals that even GPT-4o, a leading MLLM, achieves only 68.6% C&P consistency. To mitigate the C&P knowledge conflicts, we propose a novel method called Multimodal Knowledge Consistency Fine-tuning. This method first ensures task-specific consistency and then connects the cognitive and perceptual knowledge. Our method significantly reduces C&P knowledge conflicts across all tested MLLMs and enhances their performance in both cognitive and perceptual tasks in most scenarios.
摘要:多模態大型語言模型 (MMLM) 在文件理解方面展現了令人印象深刻的能力,這是一個近年來快速發展的研究領域,在產業上有著重大的需求。作為一個多模態任務,文件理解需要模型具備感知和認知能力。然而,現有的 MLLM 經常面臨感知和認知之間的衝突。以文件 VQA 任務(認知)為例,MMLM 產生的答案可能與其 OCR(感知)識別的對應視覺內容不符。這種衝突表明,MMLM 可能難以在它「看見」的資訊和它「理解」的資訊之間建立內在的連結。這種衝突挑戰了認知與感知一致的直覺觀念,阻礙了 MLLM 的效能和可解釋性。在本文中,我們將認知和感知之間的衝突定義為認知與感知 (C&P) 知識衝突,這是一種多模態知識衝突,並專注於文件理解,對它們進行系統性的評估。我們的分析顯示,即使是領先的 MLLM GPT-4o,也只達到了 68.6% 的 C&P 一致性。為了減輕 C&P 知識衝突,我們提出了一種稱為多模態知識一致性微調的新方法。此方法首先確保任務特定的相容性,然後連結認知和感知知識。我們的這項方法大幅減少了所有經過測試的 MLLM 中的 C&P 知識衝突,並在大多數情況下提升了它們在認知和感知任務中的效能。
Training Data for Large Language Model
2411.07715v1 by Yiming Ju, Huanhuan Ma
In 2022, with the release of ChatGPT, large-scale language models gained widespread attention. ChatGPT not only surpassed previous models in terms of parameters and the scale of its pretraining corpus but also achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data. This progress has led enterprises and research institutions to recognize that building smarter and more powerful models relies on rich and high-quality datasets. Consequently, the construction and optimization of datasets have become a critical focus in the field of artificial intelligence. This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models, covering aspects such as data scale, collection methods, data types and characteristics, processing workflows, and provides an overview of available open-source datasets.
摘要:2022 年,隨著 ChatGPT 的發布,大規模語言模型獲得了廣泛關注。ChatGPT 不僅在參數和預訓練語料庫規模方面超越了以前的模型,還通過對大量高品質、人工標註數據進行微調,實現了革命性的性能改進。這一進展讓企業和研究機構認識到,構建更智能、更強大的模型依賴於豐富且高品質的數據集。因此,數據集的構建和優化已成為人工智能領域的關鍵焦點。本文總結了用於訓練大規模語言模型的預訓練和微調數據的現狀,涵蓋了數據規模、收集方法、數據類型和特徵、處理工作流程等方面,並概述了可用的開源數據集。
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook
2411.07691v1 by Meng Yang, Tianqing Zhu, Chi Liu, WanLei Zhou, Shui Yu, Philip S. Yu
Thanks to the explosive growth of data and the development of computational resources, it is possible to build pre-trained models that can achieve outstanding performance on various tasks, such as neural language processing, computer vision, and more. Despite their powerful capabilities, pre-trained models have also sparked attention to the emerging security challenges associated with their real-world applications. Security and privacy issues, such as leaking privacy information and generating harmful responses, have seriously undermined users' confidence in these powerful models. Concerns are growing as model performance improves dramatically. Researchers are eager to explore the unique security and privacy issues that have emerged, their distinguishing factors, and how to defend against them. However, the current literature lacks a clear taxonomy of emerging attacks and defenses for pre-trained models, which hinders a high-level and comprehensive understanding of these questions. To fill the gap, we conduct a systematical survey on the security risks of pre-trained models, proposing a taxonomy of attack and defense methods based on the accessibility of pre-trained models' input and weights in various security test scenarios. This taxonomy categorizes attacks and defenses into No-Change, Input-Change, and Model-Change approaches. With the taxonomy analysis, we capture the unique security and privacy issues of pre-trained models, categorizing and summarizing existing security issues based on their characteristics. In addition, we offer a timely and comprehensive review of each category's strengths and limitations. Our survey concludes by highlighting potential new research opportunities in the security and privacy of pre-trained models.
摘要:
World Models: The Safety Perspective
2411.07690v1 by Zifan Zeng, Chongzhe Zhang, Feng Liu, Joseph Sifakis, Qunli Zhang, Shiming Liu, Peng Wang
With the proliferation of the Large Language Model (LLM), the concept of World Models (WM) has recently attracted a great deal of attention in the AI research community, especially in the context of AI agents. It is arguably evolving into an essential foundation for building AI agent systems. A WM is intended to help the agent predict the future evolution of environmental states or help the agent fill in missing information so that it can plan its actions and behave safely. The safety property of WM plays a key role in their effective use in critical applications. In this work, we review and analyze the impacts of the current state-of-the-art in WM technology from the point of view of trustworthiness and safety based on a comprehensive survey and the fields of application envisaged. We provide an in-depth analysis of state-of-the-art WMs and derive technical research challenges and their impact in order to call on the research community to collaborate on improving the safety and trustworthiness of WM.
摘要:隨著大型語言模型 (LLM) 的激增,世界模型 (WM) 的概念最近在 AI 研究社群中引起了極大的關注,尤其是在 AI 代理的背景下。可以說,它正演變成建立 AI 代理系統不可或缺的基礎。WM 的目的是幫助代理預測環境狀態的未來演變,或幫助代理填補遺失的資訊,以便它可以規劃其行動並安全地執行。WM 的安全性在它們在關鍵應用中的有效使用中扮演著關鍵角色。在本文中,我們根據全面的調查和預期的應用領域,從可信度和安全性的角度回顧並分析了 WM 技術當前最先進的狀態所帶來的影響。我們深入分析了最先進的 WM,並推導出技術研究挑戰及其影響,以便呼籲研究社群合作改善 WM 的安全性和可信度。
Enhancing Ultra High Resolution Remote Sensing Imagery Analysis with ImageRAG
2411.07688v1 by Zilun Zhang, Haozhan Shen, Tiancheng Zhao, Yuhao Wang, Bin Chen, Yuxiang Cai, Yongheng Shang, Jianwei Yin
Ultra High Resolution (UHR) remote sensing imagery (RSI) (e.g. 100,000 $\times$ 100,000 pixels or more) poses a significant challenge for current Remote Sensing Multimodal Large Language Models (RSMLLMs). If choose to resize the UHR image to standard input image size, the extensive spatial and contextual information that UHR images contain will be neglected. Otherwise, the original size of these images often exceeds the token limits of standard RSMLLMs, making it difficult to process the entire image and capture long-range dependencies to answer the query based on the abundant visual context. In this paper, we introduce ImageRAG for RS, a training-free framework to address the complexities of analyzing UHR remote sensing imagery. By transforming UHR remote sensing image analysis task to image's long context selection task, we design an innovative image contextual retrieval mechanism based on the Retrieval-Augmented Generation (RAG) technique, denoted as ImageRAG. ImageRAG's core innovation lies in its ability to selectively retrieve and focus on the most relevant portions of the UHR image as visual contexts that pertain to a given query. Fast path and slow path are proposed in this framework to handle this task efficiently and effectively. ImageRAG allows RSMLLMs to manage extensive context and spatial information from UHR RSI, ensuring the analysis is both accurate and efficient.
摘要:超高分辨率 (UHR) 遥感影像 (RSI)(例如 100,000 $\times$ 100,000 像素或更多)对当前的遥感多模态大语言模型 (RSMLLM) 构成了重大挑战。如果选择将 UHR 影像调整为标准输入影像大小,则 UHR 影像所包含的广泛空间和上下文信息将被忽略。否则,这些影像的原始大小通常会超出标准 RSMLLM 的标记限制,从而难以处理整个影像并捕捉远程依赖关系,以根据丰富的视觉上下文来回答查询。在本文中,我们介绍了用于遥感的 ImageRAG,这是一个无训练框架,用于解决分析 UHR 遥感影像的复杂性。通过将 UHR 遥感影像分析任务转换为影像的长上下文选择任务,我们设计了一种基于检索增强生成 (RAG) 技术的创新影像上下文检索机制,称为 ImageRAG。ImageRAG 的核心创新在于它能够选择性地检索和关注 UHR 影像中与给定查询相关的最相关部分作为视觉上下文。在此框架中提出了快速路径和慢速路径来高效有效地处理此任务。ImageRAG 允许 RSMLLM 管理来自 UHR RSI 的广泛上下文和空间信息,确保分析既准确又高效。
Fast Disentangled Slim Tensor Learning for Multi-view Clustering
2411.07685v1 by Deng Xu, Chao Zhang, Zechao Li, Chunlin Chen, Huaxiong Li
Tensor-based multi-view clustering has recently received significant attention due to its exceptional ability to explore cross-view high-order correlations. However, most existing methods still encounter some limitations. (1) Most of them explore the correlations among different affinity matrices, making them unscalable to large-scale data. (2) Although some methods address it by introducing bipartite graphs, they may result in sub-optimal solutions caused by an unstable anchor selection process. (3) They generally ignore the negative impact of latent semantic-unrelated information in each view. To tackle these issues, we propose a new approach termed fast Disentangled Slim Tensor Learning (DSTL) for multi-view clustering . Instead of focusing on the multi-view graph structures, DSTL directly explores the high-order correlations among multi-view latent semantic representations based on matrix factorization. To alleviate the negative influence of feature redundancy, inspired by robust PCA, DSTL disentangles the latent low-dimensional representation into a semantic-unrelated part and a semantic-related part for each view. Subsequently, two slim tensors are constructed with tensor-based regularization. To further enhance the quality of feature disentanglement, the semantic-related representations are aligned across views through a consensus alignment indicator. Our proposed model is computationally efficient and can be solved effectively. Extensive experiments demonstrate the superiority and efficiency of DSTL over state-of-the-art approaches. The code of DSTL is available at https://github.com/dengxu-nju/DSTL.
摘要:
AI enhanced diagnosis of Peyronies disease a novel approach using Computer Vision
2411.07684v1 by Yudara Kularathne, Janitha Prathapa, Prarththanan Sothyrajah, Salomi Arasaratnam, Sithira Ambepitiya, Thanveer Ahamed, Dinuka Wijesundara
This study presents an innovative AI-driven tool for diagnosing Peyronie's Disease (PD), a condition that affects between 0.3% and 13.1% of men worldwide. Our method uses key point detection on both images and videos to measure penile curvature angles, utilizing advanced computer vision techniques. This tool has demonstrated high accuracy in identifying anatomical landmarks, validated against conventional goniometer measurements. Traditional PD diagnosis often involves subjective and invasive methods, which can lead to patient discomfort and inaccuracies. Our approach offers a precise, reliable, and non-invasive diagnostic tool to address these drawbacks. The model distinguishes between PD and normal anatomical changes with a sensitivity of 96.7% and a specificity of 100%. This advancement represents a significant improvement in urological diagnostics, greatly enhancing the efficacy and convenience of PD assessment for healthcare providers and patients.
摘要:本研究提出了一種創新的 AI 驅動工具,用於診斷佩羅尼氏症 (PD),這是一種影響全球 0.3% 至 13.1% 男性的一種疾病。我們的技術使用圖像和影片上的關鍵點偵測來測量陰莖彎曲角度,利用先進的電腦視覺技術。此工具在識別解剖地標方面已展現出高準確度,且已針對傳統測角器量測結果進行驗證。傳統的 PD 診斷通常涉及主觀且侵入性的方法,這可能會導致患者不適和不準確。我們的做法提供了一種精確、可靠且非侵入性的診斷工具來解決這些缺點。此模型區分 PD 和正常的解剖變化,敏感度為 96.7%,特異度為 100%。這項進展代表了泌尿科診斷的重大進步,大幅提升了醫療保健提供者和患者評估 PD 的效率和便利性。
Mitigating Bias in Queer Representation within Large Language Models: A Collaborative Agent Approach
2411.07656v1 by Tianyi Huang, Arya Somasundaram
Large Language Models (LLMs) often perpetuate biases in pronoun usage, leading to misrepresentation or exclusion of queer individuals. This paper addresses the specific problem of biased pronoun usage in LLM outputs, particularly the inappropriate use of traditionally gendered pronouns ("he," "she") when inclusive language is needed to accurately represent all identities. We introduce a collaborative agent pipeline designed to mitigate these biases by analyzing and optimizing pronoun usage for inclusivity. Our multi-agent framework includes specialized agents for both bias detection and correction. Experimental evaluations using the Tango dataset-a benchmark focused on gender pronoun usage-demonstrate that our approach significantly improves inclusive pronoun classification, achieving a 32.6 percentage point increase over GPT-4o in correctly disagreeing with inappropriate traditionally gendered pronouns $(\chi^2 = 38.57, p < 0.0001)$. These results accentuate the potential of agent-driven frameworks in enhancing fairness and inclusivity in AI-generated content, demonstrating their efficacy in reducing biases and promoting socially responsible AI.
摘要:大型語言模型 (LLM) 通常會延續代名詞使用上的偏見,導致對酷兒個人的錯誤陳述或排斥。本文探討 LLM 輸出中代名詞使用有偏見的特定問題,特別是不當使用傳統的性別代名詞(「他」、「她」),而需要包容性的語言來準確代表所有身分。我們引入一個協作代理管道,旨在透過分析和最佳化代名詞的使用來減輕這些偏見以促進包容性。我們的多代理架構包含專門的代理,用於偏見偵測和校正。使用 Tango 資料集(一個專注於性別代名詞使用的基準)進行的實驗評估顯示,我們的做法顯著改善了包容性代名詞分類,在正確不同意不適當的傳統性別代名詞上,比 GPT-4o 提高了 32.6 個百分點(χ2 = 38.57,p < 0.0001)。這些結果突顯了代理驅動架構在增強 AI 產出內容中的公平性和包容性方面的潛力,證明了它們在減少偏見和促進社會責任 AI 方面的效能。
Direct Preference Optimization Using Sparse Feature-Level Constraints
2411.07618v1 by Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang
The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.
摘要:大型語言模型 (LLM) 與人類偏好的對齊仍然是一個關鍵挑戰。雖然像人類回饋強化學習 (RLHF) 和直接偏好最佳化 (DPO) 等訓練後技術已經取得顯著的成功,但它們通常會引入計算無效率和訓練不穩定性。在本文中,我們提出特徵級約束偏好最佳化 (FPO),這是一種新穎的方法,旨在簡化對齊過程,同時確保穩定性。FPO 利用預先訓練的稀疏自編碼器 (SAE),並引入特徵級約束,從而實現高效、強制稀疏性的對齊。我們的做法通過使用在訓練良好的稀疏自編碼器中啟用的稀疏特徵和使用特徵級離線參考的序列 KL 散度的品質,來享受效率。基準資料集上的實驗結果表明,與最先進的基準線相比,FPO 以更低的計算成本實現了勝率的 5.08% 絕對改進,使其成為 LLM 對齊的有效且可控的解決方案。
Multimodal Clinical Reasoning through Knowledge-augmented Rationale Generation
2411.07611v1 by Shuai Niu, Jing Ma, Liang Bai, Zhihua Wang, Yida Xu, Yunya Song, Xian Yang
Clinical rationales play a pivotal role in accurate disease diagnosis; however, many models predominantly use discriminative methods and overlook the importance of generating supportive rationales. Rationale distillation is a process that transfers knowledge from large language models (LLMs) to smaller language models (SLMs), thereby enhancing the latter's ability to break down complex tasks. Despite its benefits, rationale distillation alone is inadequate for addressing domain knowledge limitations in tasks requiring specialized expertise, such as disease diagnosis. Effectively embedding domain knowledge in SLMs poses a significant challenge. While current LLMs are primarily geared toward processing textual data, multimodal LLMs that incorporate time series data, especially electronic health records (EHRs), are still evolving. To tackle these limitations, we introduce ClinRaGen, an SLM optimized for multimodal rationale generation in disease diagnosis. ClinRaGen incorporates a unique knowledge-augmented attention mechanism to merge domain knowledge with time series EHR data, utilizing a stepwise rationale distillation strategy to produce both textual and time series-based clinical rationales. Our evaluations show that ClinRaGen markedly improves the SLM's capability to interpret multimodal EHR data and generate accurate clinical rationales, supporting more reliable disease diagnosis, advancing LLM applications in healthcare, and narrowing the performance divide between LLMs and SLMs.
摘要:
Circuit Complexity Bounds for RoPE-based Transformer Architecture
2411.07602v1 by Bo Chen, Xiaoyu Li, Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song
Characterizing the express power of the Transformer architecture is critical to understanding its capacity limits and scaling law. Recent works provide the circuit complexity bounds to Transformer-like architecture. On the other hand, Rotary Position Embedding ($\mathsf{RoPE}$) has emerged as a crucial technique in modern large language models, offering superior performance in capturing positional information compared to traditional position embeddings, which shows great potential in application prospects, particularly for the long context scenario. Empirical evidence also suggests that $\mathsf{RoPE}$-based Transformer architectures demonstrate greater generalization capabilities compared to conventional Transformer models. In this work, we establish a tighter circuit complexity bound for Transformers with $\mathsf{RoPE}$ attention. Our key contribution is that we show that unless $\mathsf{TC}^0 = \mathsf{NC}^1$, a $\mathsf{RoPE}$-based Transformer with $\mathrm{poly}(n)$-precision, $O(1)$ layers, hidden dimension $d \leq O(n)$ cannot solve the arithmetic problem or the Boolean formula value problem. This result significantly demonstrates the fundamental limitation of the expressivity of the $\mathsf{RoPE}$-based Transformer architecture, although it achieves giant empirical success. Our theoretical framework not only establishes tighter complexity bounds but also may instruct further work on the $\mathsf{RoPE}$-based Transformer.
摘要:對於理解 Transformer 架構的表達能力極限和擴充定律而言,描述其表達能力至關重要。最近的研究提供了 Transformer 類似架構的電路複雜度界限。另一方面,旋轉位置嵌入($\mathsf{RoPE}$)已成為現代大型語言模型中的一項關鍵技術,與傳統位置嵌入相比,它在捕捉位置資訊方面提供了卓越的效能,在應用前景方面展現了巨大的潛力,特別是對於長語境場景。實證證據也表明,與傳統的 Transformer 模型相比,基於 $\mathsf{RoPE}$ 的 Transformer 架構展示出更強大的概化能力。在這項研究中,我們為具備 $\mathsf{RoPE}$ 注意力的 Transformer 建立了一個更嚴謹的電路複雜度界限。我們的關鍵貢獻在於,我們證明了除非 $\mathsf{TC}^0 = \mathsf{NC}^1$,否則一個具備 $\mathrm{poly}(n)$ 精度、$O(1)$ 層、隱藏維度 $d \leq O(n)$ 的基於 $\mathsf{RoPE}$ 的 Transformer 無法解決算術問題或布林公式值問題。儘管取得了巨大的實證成功,但這個結果顯著地證明了基於 $\mathsf{RoPE}$ 的 Transformer 架構在表達能力上的基本限制。我們的理論架構不僅建立了更嚴謹的複雜度界限,還能指導後續關於基於 $\mathsf{RoPE}$ 的 Transformer 的研究。
Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations
2411.07598v1 by Rose E. Wang, Pawan Wirawarn, Kenny Lam, Omar Khattab, Dorottya Demszky
Many open-ended conversations (e.g., tutoring lessons or business meetings) revolve around pre-defined reference materials, like worksheets or meeting bullets. To provide a framework for studying such conversation structure, we introduce Problem-Oriented Segmentation & Retrieval (POSR), the task of jointly breaking down conversations into segments and linking each segment to the relevant reference item. As a case study, we apply POSR to education where effectively structuring lessons around problems is critical yet difficult. We present LessonLink, the first dataset of real-world tutoring lessons, featuring 3,500 segments, spanning 24,300 minutes of instruction and linked to 116 SAT math problems. We define and evaluate several joint and independent approaches for POSR, including segmentation (e.g., TextTiling), retrieval (e.g., ColBERT), and large language models (LLMs) methods. Our results highlight that modeling POSR as one joint task is essential: POSR methods outperform independent segmentation and retrieval pipelines by up to +76% on joint metrics and surpass traditional segmentation methods by up to +78% on segmentation metrics. We demonstrate POSR's practical impact on downstream education applications, deriving new insights on the language and time use in real-world lesson structures.
摘要:許多開放式的對話(例如輔導課程或業務會議) 圍繞著預先定義的參考材料,例如工作表或會議 重點。為了提供一個架構來研究此類對話結構,我們 引入了問題導向分割與檢索 (POSR),這項任務是將對話共同 分解成各個片段,並將每個片段連結到 相關的參考項目。作為一個案例研究,我們將 POSR 應用於教育,在教育中 有效地圍繞問題來架構課程至關重要,但卻很困難。我們 提出了 LessonLink,這是第一個真實世界的輔導課程資料集,包含 3,500 個片段,橫跨 24,300 分鐘的教學,並連結到 116 個 SAT 數學問題。我們定義並評估了 POSR 的多種聯合和獨立方法,包括分段(例如 TextTiling)、檢索(例如 ColBERT) 和大型語言模型 (LLM) 方法。我們的結果強調將 POSR 建模為一項聯合任務至關重要:POSR 方法在聯合指標上比獨立 分段和檢索管道高出 +76%,並在分段指標上比傳統分段方法高出 +78%。我們 展示了 POSR 對下游教育應用程式的實際影響, 從真實世界課程結構中語言和時間的使用中獲得了新的見解。
Entropy Controllable Direct Preference Optimization
2411.07595v1 by Motoki Omura, Yasuhiro Fujita, Toshiki Kataoka
In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.
摘要:在大語言模型 (LLM) 的後訓練中,來自人類回饋的強化學習 (RLHF) 是一種有效的方法,可以實現與人類偏好一致的生成。直接偏好最佳化 (DPO) 允許使用簡單的二元交叉熵損失進行政策訓練,而無需獎勵模型。DPO 的目標通過反向 KL 散度進行規範化,這鼓勵尋求模式以符合參考政策。儘管如此,我們指出,最小化反向 KL 散度可能無法捕捉參考分佈的模式,這可能會損害政策的效能。基於此觀察,我們建議對 DPO 進行一個簡單的修改,H-DPO,它允許控制結果政策的熵,增強分佈的清晰度,從而更有效地實現尋求模式的擬合。在我們的實驗中,我們表明 H-DPO 在各種任務中都優於 DPO,在數學任務的 pass@$k$ 評估中展示了優異的結果。此外,H-DPO 實現起來很簡單,只需要對 DPO 的損失計算進行微小的修改,這使得它在 LLM 訓練的廣泛應用中具有高度實用性和前景。
A Comprehensive Survey of AI-Driven Advancements and Techniques in Automated Program Repair and Code Generation
2411.07586v1 by Avinash Anand, Akshit Gupta, Nishchay Yadav, Shaurya Bajaj
Bug fixing and code generation have been core research topics in software development for many years. The recent explosive growth in Large Language Models has completely transformed these spaces, putting in reach incredibly powerful tools for both. In this survey, 27 recent papers have been reviewed and split into two groups: one dedicated to Automated Program Repair (APR) and LLM integration and the other to code generation using LLMs. The first group consists of new methods for bug detection and repair, which include locating semantic errors, security vulnerabilities, and runtime failure bugs. The place of LLMs in reducing manual debugging efforts is emphasized in this work by APR toward context-aware fixes, with innovations that boost accuracy and efficiency in automatic debugging. The second group dwells on code generation, providing an overview of both general-purpose LLMs fine-tuned for programming and task-specific models. It also presents methods to improve code generation, such as identifier-aware training, fine-tuning at the instruction level, and incorporating semantic code structures. This survey work contrasts the methodologies in APR and code generation to identify trends such as using LLMs, feedback loops to enable iterative code improvement and open-source models. It also discusses the challenges of achieving functional correctness and security and outlines future directions for research in LLM-based software development.
摘要:多年來,錯誤修復和程式碼產生一直是軟體開發中的核心研究主題。最近大型語言模型的爆炸性成長徹底改變了這些領域,為這兩者提供了強大的工具。在這項調查中,回顧了 27 篇近期論文,並將其分成兩組:一組專門用於自動程式修復 (APR) 和 LLM 整合,另一組則用於使用 LLM 進行程式碼產生。第一組包含用於錯誤偵測和修復的新方法,其中包括定位語義錯誤、安全性漏洞和執行時期失敗錯誤。APR 透過創新提升自動偵錯的準確性和效率,強調了 LLM 在減少人工除錯工作方面的作用,朝著情境感知的修復邁進。第二組專注於程式碼產生,提供針對程式設計進行微調的一般用途 LLM 和特定任務模型的概觀。它還提供了改進程式碼產生的方法,例如識別符號感知訓練、指令層級的微調以及整合語義程式碼結構。這項調查工作對比了 APR 和程式碼產生中的方法論,以找出趨勢,例如使用 LLM、回饋迴路以啟用反覆的程式碼改進和開源模型。它也討論了實現功能正確性和安全性所面臨的挑戰,並概述了基於 LLM 的軟體開發的未來研究方向。
Reinforcement Learning Framework for Quantitative Trading
2411.07585v1 by Alhassan S. Yasin, Prabdeep S. Gill
The inherent volatility and dynamic fluctuations within the financial stock market underscore the necessity for investors to employ a comprehensive and reliable approach that integrates risk management strategies, market trends, and the movement trends of individual securities. By evaluating specific data, investors can make more informed decisions. However, the current body of literature lacks substantial evidence supporting the practical efficacy of reinforcement learning (RL) agents, as many models have only demonstrated success in back testing using historical data. This highlights the urgent need for a more advanced methodology capable of addressing these challenges. There is a significant disconnect in the effective utilization of financial indicators to better understand the potential market trends of individual securities. The disclosure of successful trading strategies is often restricted within financial markets, resulting in a scarcity of widely documented and published strategies leveraging RL. Furthermore, current research frequently overlooks the identification of financial indicators correlated with various market trends and their potential advantages. This research endeavors to address these complexities by enhancing the ability of RL agents to effectively differentiate between positive and negative buy/sell actions using financial indicators. While we do not address all concerns, this paper provides deeper insights and commentary on the utilization of technical indicators and their benefits within reinforcement learning. This work establishes a foundational framework for further exploration and investigation of more complex scenarios.
摘要:金融股票市場固有的波動性和動態波動突顯了投資者採用綜合且可靠的方法的必要性,該方法整合了風險管理策略、市場趨勢和個別證券的移動趨勢。透過評估特定數據,投資者可以做出更明智的決策。然而,目前的文獻缺乏實質證據支持強化學習 (RL) 代理的實用效力,因為許多模型僅在使用歷史數據進行反向測試時證明了成功。這凸顯了迫切需要一種更先進的方法來應對這些挑戰。在有效利用財務指標以更好地了解個別證券的潛在市場趨勢方面存在顯著脫節。成功交易策略的披露通常在金融市場中受到限制,導致缺乏廣泛記錄和發布的利用 RL 的策略。此外,目前的研究所常忽略識別與各種市場趨勢相關的財務指標及其潛在優勢。 本研究致力於透過提高 RL 代理使用財務指標有效區分正面和負面買賣動作的能力來解決這些複雜性。雖然我們並未解決所有問題,但本文提供了關於技術指標及其在強化學習中好處的更深入見解和評論。這項工作為進一步探索和調查更複雜的場景建立了基礎框架。
Improving Grapheme-to-Phoneme Conversion through In-Context Knowledge Retrieval with Large Language Models
2411.07563v1 by Dongrui Han, Mingyu Cui, Jiawen Kang, Xixin Wu, Xunying Liu, Helen Meng
Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple phonemes depending on contexts, posing a challenge for G2P conversion. Inspired by the remarkable success of Large Language Models (LLMs) in handling context-aware scenarios, contextual G2P conversion systems with LLMs' in-context knowledge retrieval (ICKR) capabilities are proposed to promote disambiguation capability. The efficacy of incorporating ICKR into G2P conversion systems is demonstrated thoroughly on the Librig2p dataset. In particular, the best contextual G2P conversion system using ICKR outperforms the baseline with weighted average phoneme error rate (PER) reductions of 2.0% absolute (28.9% relative). Using GPT-4 in the ICKR system can increase of 3.5% absolute (3.8% relative) on the Librig2p dataset.
摘要:音素轉換 (G2P) 是文字轉語音 (TTS) 系統中至關重要的一步,負責將音素對應到相應的語音表示。然而,它面臨著歧義問題,即相同的音素可以表示多個音素,具體取決於上下文,這對 G2P 轉換構成了挑戰。受大型語言模型 (LLM) 在處理上下文感知場景中取得的顯著成功啟發,提出具備 LLM 的上下文知識檢索 (ICKR) 功能的上下文 G2P 轉換系統,以提升消歧義能力。在 Librig2p 資料集上徹底證明了將 ICKR 納入 G2P 轉換系統的功效。特別是,使用 ICKR 的最佳上下文 G2P 轉換系統優於基線,加權平均音素錯誤率 (PER) 降低了 2.0%(相對降低 28.9%)。在 ICKR 系統中使用 GPT-4 可以使 Librig2p 資料集的絕對值增加 3.5%(相對增加 3.8%)。
EUR/USD Exchange Rate Forecasting incorporating Text Mining Based on Pre-trained Language Models and Deep Learning Methods
2411.07560v1 by Xiangyu Shi, Hongcheng Ding, Salaar Faroog, Deshinta Arrova Dewi, Shamsul Nahar Abdullah, Bahiah A Malek
This study introduces a novel approach for EUR/USD exchange rate forecasting that integrates deep learning, textual analysis, and particle swarm optimization (PSO). By incorporating online news and analysis texts as qualitative data, the proposed PSO-LSTM model demonstrates superior performance compared to traditional econometric and machine learning models. The research employs advanced text mining techniques, including sentiment analysis using the RoBERTa-Large model and topic modeling with LDA. Empirical findings underscore the significant advantage of incorporating textual data, with the PSO-LSTM model outperforming benchmark models such as SVM, SVR, ARIMA, and GARCH. Ablation experiments reveal the contribution of each textual data category to the overall forecasting performance. The study highlights the transformative potential of artificial intelligence in finance and paves the way for future research in real-time forecasting and the integration of alternative data sources.
摘要:本研究提出了一種新的歐元/美元匯率預測方法,整合了深度學習、文本分析和粒子群最佳化 (PSO)。透過將線上新聞和分析文本納入作為定性資料,提出的 PSO-LSTM 模型展現出優於傳統計量經濟學和機器學習模型的卓越效能。這項研究採用了進階的文字探勘技術,包括使用 RoBERTa-Large 模型進行情緒分析和使用 LDA 進行主題建模。實證結果強調了納入文本資料的顯著優勢,PSO-LSTM 模型優於 SVM、SVR、ARIMA 和 GARCH 等基準模型。消融實驗揭示了每個文本資料類別對整體預測效能的貢獻。這項研究突出了人工智慧在金融領域的轉型潛力,並為未來在即時預測和整合替代資料來源的研究鋪路。
Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models
2411.07559v1 by Tiejin Chen, Kaishen Wang, Hua Wei
Jailbreaking methods, which induce Multi-modal Large Language Models (MLLMs) to output harmful responses, raise significant safety concerns. Among these methods, gradient-based approaches, which use gradients to generate malicious prompts, have been widely studied due to their high success rates in white-box settings, where full access to the model is available. However, these methods have notable limitations: they require white-box access, which is not always feasible, and involve high memory usage. To address scenarios where white-box access is unavailable, attackers often resort to transfer attacks. In transfer attacks, malicious inputs generated using white-box models are applied to black-box models, but this typically results in reduced attack performance. To overcome these challenges, we propose Zer0-Jack, a method that bypasses the need for white-box access by leveraging zeroth-order optimization. We propose patch coordinate descent to efficiently generate malicious image inputs to directly attack black-box MLLMs, which significantly reduces memory usage further. Through extensive experiments, Zer0-Jack achieves a high attack success rate across various models, surpassing previous transfer-based methods and performing comparably with existing white-box jailbreak techniques. Notably, Zer0-Jack achieves a 95\% attack success rate on MiniGPT-4 with the Harmful Behaviors Multi-modal Dataset on a black-box setting, demonstrating its effectiveness. Additionally, we show that Zer0-Jack can directly attack commercial MLLMs such as GPT-4o. Codes are provided in the supplement.
摘要:
Contrastive Language Prompting to Ease False Positives in Medical Anomaly Detection
2411.07546v1 by YeongHyeon Park, Myung Jin Kim, Hyeong Seok Kim
A pre-trained visual-language model, contrastive language-image pre-training (CLIP), successfully accomplishes various downstream tasks with text prompts, such as finding images or localizing regions within the image. Despite CLIP's strong multi-modal data capabilities, it remains limited in specialized environments, such as medical applications. For this purpose, many CLIP variants-i.e., BioMedCLIP, and MedCLIP-SAMv2-have emerged, but false positives related to normal regions persist. Thus, we aim to present a simple yet important goal of reducing false positives in medical anomaly detection. We introduce a Contrastive LAnguage Prompting (CLAP) method that leverages both positive and negative text prompts. This straightforward approach identifies potential lesion regions by visual attention to the positive prompts in the given image. To reduce false positives, we attenuate attention on normal regions using negative prompts. Extensive experiments with the BMAD dataset, including six biomedical benchmarks, demonstrate that CLAP method enhances anomaly detection performance. Our future plans include developing an automated fine prompting method for more practical usage.
摘要:預訓練的視覺語言模型,對比語言影像預訓練 (CLIP),成功使用文字提示完成各種下游任務,例如尋找影像或定位影像中的區域。儘管 CLIP 擁有強大的多模態資料功能,但在專門的環境中,例如醫療應用,仍然有限。為此,出現了許多 CLIP 變體,即 BioMedCLIP 和 MedCLIP-SAMv2,但與正常區域相關的假陽性仍然存在。因此,我們的目標是提出一個簡單但重要的目標,以減少醫療異常檢測中的假陽性。我們引入了對比語言提示 (CLAP) 方法,該方法同時利用正向和負向文字提示。這種直接的方法透過視覺注意給定影像中的正向提示,來識別潛在的病灶區域。為了減少假陽性,我們使用負向提示來減弱對正常區域的注意。使用 BMAD 資料集進行的廣泛實驗,包括六個生物醫學基準,證明 CLAP 方法增強了異常檢測效能。我們未來的計畫包括開發一種自動化精細提示方法,以供更實用的使用。
Model Stealing for Any Low-Rank Language Model
2411.07536v1 by Allen Liu, Ankur Moitra
Model stealing, where a learner tries to recover an unknown model via carefully chosen queries, is a critical problem in machine learning, as it threatens the security of proprietary models and the privacy of data they are trained on. In recent years, there has been particular interest in stealing large language models (LLMs). In this paper, we aim to build a theoretical understanding of stealing language models by studying a simple and mathematically tractable setting. We study model stealing for Hidden Markov Models (HMMs), and more generally low-rank language models. We assume that the learner works in the conditional query model, introduced by Kakade, Krishnamurthy, Mahajan and Zhang. Our main result is an efficient algorithm in the conditional query model, for learning any low-rank distribution. In other words, our algorithm succeeds at stealing any language model whose output distribution is low-rank. This improves upon the previous result by Kakade, Krishnamurthy, Mahajan and Zhang, which also requires the unknown distribution to have high "fidelity", a property that holds only in restricted cases. There are two key insights behind our algorithm: First, we represent the conditional distributions at each timestep by constructing barycentric spanners among a collection of vectors of exponentially large dimension. Second, for sampling from our representation, we iteratively solve a sequence of convex optimization problems that involve projection in relative entropy to prevent compounding of errors over the length of the sequence. This is an interesting example where, at least theoretically, allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance.
摘要:模型竊取,其中學習者嘗試通過仔細選擇的查詢來恢復未知模型,是機器學習中的關鍵問題,因為它威脅到專有模型的安全性以及訓練它們的數據的隱私。近年來,人們對竊取大型語言模型 (LLM) 特別感興趣。在本文中,我們旨在通過研究一個簡單且在數學上易於處理的設置來建立對竊取語言模型的理論理解。我們研究隱藏馬爾可夫模型 (HMM) 的模型竊取,更普遍地研究低秩語言模型。我們假設學習者在條件查詢模型中工作,由 Kakade、Krishnamurthy、Mahajan 和 Zhang 提出。我們的成果是在條件查詢模型中一種用於學習任何低秩分佈的有效演算法。換句話說,我們的演算法成功竊取任何輸出分佈為低秩的語言模型。這改進了 Kakade、Krishnamurthy、Mahajan 和 Zhang 先前的成果,該成果還要求未知分佈具有很高的「保真度」,這是一個僅在受限情況下成立的屬性。我們的演算法背後有兩個關鍵見解:首先,我們通過在大量維度向量集合中建構重心張弦器來表示每個時間步長的條件分佈。其次,為了從我們的表示中進行抽樣,我們反覆求解一系列凸優化問題,其中涉及相對熵中的投影,以防止錯誤在序列長度上累積。這是一個有趣的例子,至少在理論上,允許機器學習模型在推理時解決更複雜的問題可以大幅提升其效能。
Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning
2411.07533v1 by Linyang He, Ercong Nie, Helmut Schmid, Hinrich Schütze, Nima Mesgarani, Jonathan Brennan
This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM evaluation paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical biases that may misrepresent LLMs' true linguistic capabilities. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. Our contributions are three-fold: (1) We compare neurolinguistic and psycholinguistic methods, revealing distinct patterns in LLM assessment; (2) We demonstrate that LLMs exhibit higher competence in form compared to meaning, with the latter largely correlated to the former; (3) We present new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.
摘要:本研究透過區分心理語言學和神經語言學這兩種大型語言模型 (LLM) 評估範例,來探討大型語言模型在符號 (形式) 和所指 (意義) 上的語言理解。傳統的心理語言學評估通常反映出統計偏差,這可能會誤導 LLM 的真實語言能力。我們引入一種神經語言學方法,利用一種新穎的方法,結合最小對和診斷探測來分析模型層之間的激活模式。此方法可以詳細檢視 LLM 如何表示形式和意義,以及這些表示是否在不同語言中保持一致。我們的貢獻有三個方面:(1) 我們比較神經語言學和心理語言學方法,揭示 LLM 評估中的不同模式;(2) 我們證明 LLM 在形式上表現出比意義更高的能力,後者在很大程度上與前者相關;(3) 我們為中文 (COMPS-ZH) 和德文 (COMPS-DE) 提出新的概念最小對資料集,以補充現有的英文資料集。
Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis
2411.07529v1 by Minda Li, Bhaskar Krishnamachari
ChatGPT and other large language models (LLMs) promise to revolutionize software development by automatically generating code from program specifications. We assess the performance of ChatGPT's GPT-3.5-turbo model on LeetCode, a popular platform with algorithmic coding challenges for technical interview practice, across three difficulty levels: easy, medium, and hard. We test three main hypotheses. First, ChatGPT solves fewer problems as difficulty rises (Hypothesis 1). Second, prompt engineering improves ChatGPT's performance, with greater gains on easier problems and diminishing returns on harder ones (Hypothesis 2). Third, ChatGPT performs better in popular languages like Python, Java, and C++ than in less common ones like Elixir, Erlang, and Racket (Hypothesis 3). To investigate these hypotheses, we conduct automated experiments using Python scripts to generate prompts that instruct ChatGPT to create Python solutions. These solutions are stored and manually submitted on LeetCode to check their correctness. For Hypothesis 1, results show the GPT-3.5-turbo model successfully solves 92% of easy, 79% of medium, and 51% of hard problems. For Hypothesis 2, prompt engineering yields improvements: 14-29% for Chain of Thought Prompting, 38-60% by providing failed test cases in a second feedback prompt, and 33-58% by switching to GPT-4. From a random subset of problems ChatGPT solved in Python, it also solved 78% in Java, 50% in C++, and none in Elixir, Erlang, or Racket. These findings generally validate all three hypotheses.
摘要:ChatGPT 和其他大型语言模型 (LLM) 承诺通过根据程序规格自动生成代码来革新软件开发。我们评估了 ChatGPT 的 GPT-3.5-turbo 模型在 LeetCode 上的表现,这是一个流行的平台,提供算法编码挑战,用于技术面试实践,涵盖三个难度级别:简单、中等和困难。我们测试了三个主要假设。首先,随着难度的增加,ChatGPT 解决的问题更少(假设 1)。其次,提示工程提高了 ChatGPT 的性能,在较简单的题目上获得了更大的收益,而在较难的题目上收益递减(假设 2)。第三,ChatGPT 在 Python、Java 和 C++ 等流行语言中的表现优于在 Elixir、Erlang 和 Racket 等不太常见的语言中的表现(假设 3)。为了调查这些假设,我们使用 Python 脚本进行自动化实验,生成提示,指示 ChatGPT 创建 Python 解决方案。这些解决方案被存储并手动提交到 LeetCode 以检查其正确性。对于假设 1,结果显示 GPT-3.5-turbo 模型成功解决了 92% 的简单问题、79% 的中等问题和 51% 的困难问题。对于假设 2,提示工程产生了改进:思维链提示提高了 14-29%,在第二个反馈提示中提供了失败的测试用例提高了 38-60%,切换到 GPT-4 提高了 33-58%。从 ChatGPT 用 Python 解决的问题的随机子集中,它还用 Java 解决 78% 的问题,用 C++ 解决 50% 的问题,用 Elixir、Erlang 或 Racket 解决 0 个问题。这些发现总体上验证了所有三个假设。
SecEncoder: Logs are All You Need in Security
2411.07528v1 by Muhammed Fatih Bulut, Yingqi Liu, Naveed Ahmad, Maximilian Turner, Sami Ait Ouahmane, Cameron Andrews, Lloyd Greenwald
Large and Small Language Models (LMs) are typically pretrained using extensive volumes of text, which are sourced from publicly accessible platforms such as Wikipedia, Book Corpus, or through web scraping. These models, due to their exposure to a wide range of language data, exhibit impressive generalization capabilities and can perform a multitude of tasks simultaneously. However, they often fall short when it comes to domain-specific tasks due to their broad training data. This paper introduces SecEncoder, a specialized small language model that is pretrained using security logs. SecEncoder is designed to address the domain-specific limitations of general LMs by focusing on the unique language and patterns found in security logs. Experimental results indicate that SecEncoder outperforms other LMs, such as BERTlarge, DeBERTa-v3-large and OpenAI's Embedding (textembedding-ada-002) models, which are pretrained mainly on natural language, across various tasks. Furthermore, although SecEncoder is primarily pretrained on log data, it outperforms models pretrained on natural language for a range of tasks beyond log analysis, such as incident prioritization and threat intelligence document retrieval. This suggests that domain specific pretraining with logs can significantly enhance the performance of LMs in security. These findings pave the way for future research into security-specific LMs and their potential applications.
摘要:大型和小型语言模型 (LM) 通常使用从维基百科、语料库或网络抓取等公开平台获取的大量文本进行预训练。这些模型由于接触了广泛的语言数据,因此表现出令人印象深刻的泛化能力,并且可以同时执行多项任务。然而,由于其广泛的训练数据,它们在执行特定于领域的特定任务时往往会表现不佳。本文介绍了 SecEncoder,这是一种使用安全日志进行预训练的专门的小型语言模型。SecEncoder 旨在通过关注安全日志中发现的独特语言和模式来解决通用 LM 的特定领域限制。实验结果表明,SecEncoder 优于其他 LM,例如 BERTlarge、DeBERTa-v3-large 和 OpenAI 的嵌入(textembedding-ada-002)模型,这些模型主要在自然语言上进行预训练,并且适用于各种任务。此外,尽管 SecEncoder 主要在日志数据上进行预训练,但它在日志分析之外的一系列任务(例如事件优先级和威胁情报文档检索)上都优于在自然语言上进行预训练的模型。这表明使用日志进行特定领域预训练可以显着增强 LM 在安全方面的性能。这些发现为未来对特定于安全性的 LM 及其潜在应用的研究铺平了道路。
Prompt-enhanced Network for Hateful Meme Classification
2411.07527v1 by Junxi Liu, Yanyan Feng, Jiehai Chen, Yun Xue, Fenghuan Li
The dynamic expansion of social media has led to an inundation of hateful memes on media platforms, accentuating the growing need for efficient identification and removal. Acknowledging the constraints of conventional multimodal hateful meme classification, which heavily depends on external knowledge and poses the risk of including irrelevant or redundant content, we developed Pen -- a prompt-enhanced network framework based on the prompt learning approach. Specifically, after constructing the sequence through the prompt method and encoding it with a language model, we performed region information global extraction on the encoded sequence for multi-view perception. By capturing global information about inference instances and demonstrations, Pen facilitates category selection by fully leveraging sequence information. This approach significantly improves model classification accuracy. Additionally, to bolster the model's reasoning capabilities in the feature space, we introduced prompt-aware contrastive learning into the framework to improve the quality of sample feature distributions. Through extensive ablation experiments on two public datasets, we evaluate the effectiveness of the Pen framework, concurrently comparing it with state-of-the-art model baselines. Our research findings highlight that Pen surpasses manual prompt methods, showcasing superior generalization and classification accuracy in hateful meme classification tasks. Our code is available at https://github.com/juszzi/Pen.
摘要:
Fair Summarization: Bridging Quality and Diversity in Extractive Summaries
2411.07521v1 by Sina Bagheri Nezhad, Sayan Bandyapadhyay, Ameeta Agrawal
Fairness in multi-document summarization of user-generated content remains a critical challenge in natural language processing (NLP). Existing summarization methods often fail to ensure equitable representation across different social groups, leading to biased outputs. In this paper, we introduce two novel methods for fair extractive summarization: FairExtract, a clustering-based approach, and FairGPT, which leverages GPT-3.5-turbo with fairness constraints. We evaluate these methods using Divsumm summarization dataset of White-aligned, Hispanic, and African-American dialect tweets and compare them against relevant baselines. The results obtained using a comprehensive set of summarization quality metrics such as SUPERT, BLANC, SummaQA, BARTScore, and UniEval, as well as a fairness metric F, demonstrate that FairExtract and FairGPT achieve superior fairness while maintaining competitive summarization quality. Additionally, we introduce composite metrics (e.g., SUPERT+F, BLANC+F) that integrate quality and fairness into a single evaluation framework, offering a more nuanced understanding of the trade-offs between these objectives. This work highlights the importance of fairness in summarization and sets a benchmark for future research in fairness-aware NLP models.
摘要:多文件摘要中用户生成内容的公平性仍然是自然语言处理 (NLP) 中的一项重大挑战。现有的摘要方法通常无法确保不同社会群体的公平代表性,从而导致输出有偏差。在本文中,我们介绍了两种用于公平提取摘要的新方法:基于聚类的 FairExtract 方法和利用具有公平性约束的 GPT-3.5-turbo 的 FairGPT 方法。我们使用 Divsumm 摘要数据集(包含白人、西班牙裔和非裔美国人方言推文)评估了这些方法,并将它们与相关的基线进行了比较。使用一组全面的摘要质量指标(例如 SUPERT、BLANC、SummaQA、BARTScore 和 UniEval)以及公平性指标 F 获得的结果表明,FairExtract 和 FairGPT 在保持有竞争力的摘要质量的同时实现了卓越的公平性。此外,我们引入了复合指标(例如 SUPERT+F、BLANC+F),将质量和公平性整合到一个评估框架中,从而更细致地理解这些目标之间的权衡。这项工作强调了公平性在摘要中的重要性,并为公平性感知 NLP 模型的未来研究设定了基准。
TIPS: Threat Actor Informed Prioritization of Applications using SecEncoder
2411.07519v1 by Muhammed Fatih Bulut, Acar Tamersoy, Naveed Ahmad, Yingqi Liu, Lloyd Greenwald
This paper introduces TIPS: Threat Actor Informed Prioritization using SecEncoder, a specialized language model for security. TIPS combines the strengths of both encoder and decoder language models to detect and prioritize compromised applications. By integrating threat actor intelligence, TIPS enhances the accuracy and relevance of its detections. Extensive experiments with a real-world benchmark dataset of applications demonstrate TIPS's high efficacy, achieving an F-1 score of 0.90 in identifying malicious applications. Additionally, in real-world scenarios, TIPS significantly reduces the backlog of investigations for security analysts by 87%, thereby streamlining the threat response process and improving overall security posture.
摘要:本文介紹 TIPS:威脅行為者資訊優先順序,使用 SecEncoder,一種專門用於安全性的語言模型。TIPS 結合編碼器和解碼器語言模型的優點,以偵測和優先處理受入侵的應用程式。透過整合威脅行為者情報,TIPS 提升其偵測的準確性和相關性。使用真實世界基準資料集的應用程式的廣泛實驗證明了 TIPS 的高效率,在識別惡意應用程式時達到 0.90 的 F-1 分數。此外,在真實世界場景中,TIPS 將安全分析師的調查積壓減少了 87%,從而簡化了威脅應變程序並改善整體安全態勢。
LLM App Squatting and Cloning
2411.07518v1 by Yinglin Xie, Xinyi Hou, Yanjie Zhao, Kai Chen, Haoyu Wang
Impersonation tactics, such as app squatting and app cloning, have posed longstanding challenges in mobile app stores, where malicious actors exploit the names and reputations of popular apps to deceive users. With the rapid growth of Large Language Model (LLM) stores like GPT Store and FlowGPT, these issues have similarly surfaced, threatening the integrity of the LLM app ecosystem. In this study, we present the first large-scale analysis of LLM app squatting and cloning using our custom-built tool, LLMappCrazy. LLMappCrazy covers 14 squatting generation techniques and integrates Levenshtein distance and BERT-based semantic analysis to detect cloning by analyzing app functional similarities. Using this tool, we generated variations of the top 1000 app names and found over 5,000 squatting apps in the dataset. Additionally, we observed 3,509 squatting apps and 9,575 cloning cases across six major platforms. After sampling, we find that 18.7% of the squatting apps and 4.9% of the cloning apps exhibited malicious behavior, including phishing, malware distribution, fake content dissemination, and aggressive ad injection.
摘要:冒充策略,例如應用程式搶註和應用程式複製,已對行動應用程式商店構成長期的挑戰,惡意行為者利用熱門應用程式的名稱和聲譽來欺騙使用者。隨著大型語言模型 (LLM) 商店,例如 GPT Store 和 FlowGPT 的快速成長,這些問題也隨之浮現,威脅到 LLM 應用程式生態系統的完整性。在這項研究中,我們使用自訂建置的工具 LLMappCrazy,針對 LLM 應用程式搶註和複製進行首次大規模分析。LLMappCrazy 涵蓋 14 種搶註產生技術,並整合 Levenshtein 距離和基於 BERT 的語意分析,透過分析應用程式功能相似性來偵測複製。使用此工具,我們產生前 1000 個應用程式名稱的變體,並在資料集中發現超過 5,000 個搶註應用程式。此外,我們在六個主要平台上觀察到 3,509 個搶註應用程式和 9,575 個複製案例。在抽樣後,我們發現 18.7% 的搶註應用程式和 4.9% 的複製應用程式表現出惡意行為,包括網路釣魚、惡意軟體散布、假內容散布和強制廣告植入。
SparrowVQE: Visual Question Explanation for Course Content Understanding
2411.07516v1 by Jialu Li, Manish Kumar Thota, Ruslan Gokhman, Radek Holik, Youshan Zhang
Visual Question Answering (VQA) research seeks to create AI systems to answer natural language questions in images, yet VQA methods often yield overly simplistic and short answers. This paper aims to advance the field by introducing Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations rather than brief responses and address the need for more complex interaction with visual content. We first created an MLVQE dataset from a 14-week streamed video machine learning course, including 885 slide images, 110,407 words of transcripts, and 9,416 designed question-answer (QA) pairs. Next, we proposed a novel SparrowVQE, a small 3 billion parameters multimodal model. We trained our model with a three-stage training mechanism consisting of multimodal pre-training (slide images and transcripts feature alignment), instruction tuning (tuning the pre-trained model with transcripts and QA pairs), and domain fine-tuning (fine-tuning slide image and QA pairs). Eventually, our SparrowVQE can understand and connect visual information using the SigLIP model with transcripts using the Phi-2 language model with an MLP adapter. Experimental results demonstrate that our SparrowVQE achieves better performance in our developed MLVQE dataset and outperforms state-of-the-art methods in the other five benchmark VQA datasets. The source code is available at \url{https://github.com/YoushanZhang/SparrowVQE}.
摘要:
An Attack Traffic Identification Method Based on Temporal Spectrum
2411.07510v1 by Wenwei Xie, Jie Yin, Zihao Chen
To address the issues of insufficient robustness, unstable features, and data noise interference in existing network attack detection and identification models, this paper proposes an attack traffic detection and identification method based on temporal spectrum. First, traffic data is segmented by a sliding window to construct a feature sequence and a corresponding label sequence for network traffic. Next, the proposed spectral label generation methods, SSPE and COAP, are applied to transform the label sequence into spectral labels and the feature sequence into temporal features. Spectral labels and temporal features are used to capture and represent behavioral patterns of attacks. Finally, the constructed temporal features and spectral labels are used to train models, which subsequently detects and identifies network attack behaviors. Experimental results demonstrate that compared to traditional methods, models trained with the SSPE or COAP method improve identification accuracy by 10%, and exhibit strong robustness, particularly in noisy environments.
摘要:為了解決現有網路攻擊偵測與識別模型中,魯棒性不足、特徵不穩定、資料雜訊干擾等問題,本文提出基於時域頻譜的攻擊流量偵測與識別方法。首先,透過滑動視窗將流量資料進行分段,建構網路流量的特徵序列與對應標籤序列。接著,應用所提出的頻譜標籤產生方法 SSPE 與 COAP,將標籤序列轉換為頻譜標籤,並將特徵序列轉換為時域特徵。頻譜標籤與時域特徵用於擷取與表示攻擊的行為模式。最後,將建構的時域特徵與頻譜標籤用於模型訓練,後續偵測與識別網路攻擊行為。實驗結果顯示,與傳統方法相比,使用 SSPE 或 COAP 方法訓練的模型,識別準確度提升 10%,且展現強大的魯棒性,特別是在雜訊環境中。
FM-TS: Flow Matching for Time Series Generation
2411.07506v1 by Yang Hu, Xiao Wang, Lirong Wu, Huatian Zhang, Stan Z. Li, Sheng Wang, Tianlong Chen
Time series generation has emerged as an essential tool for analyzing temporal data across numerous fields. While diffusion models have recently gained significant attention in generating high-quality time series, they tend to be computationally demanding and reliant on complex stochastic processes. To address these limitations, we introduce FM-TS, a rectified Flow Matching-based framework for Time Series generation, which simplifies the time series generation process by directly optimizing continuous trajectories. This approach avoids the need for iterative sampling or complex noise schedules typically required in diffusion-based models. FM-TS is more efficient in terms of training and inference. Moreover, FM-TS is highly adaptive, supporting both conditional and unconditional time series generation. Notably, through our novel inference design, the model trained in an unconditional setting can seamlessly generalize to conditional tasks without the need for retraining. Extensive benchmarking across both settings demonstrates that FM-TS consistently delivers superior performance compared to existing approaches while being more efficient in terms of training and inference. For instance, in terms of discriminative score, FM-TS achieves 0.005, 0.019, 0.011, 0.005, 0.053, and 0.106 on the Sines, Stocks, ETTh, MuJoCo, Energy, and fMRI unconditional time series datasets, respectively, significantly outperforming the second-best method which achieves 0.006, 0.067, 0.061, 0.008, 0.122, and 0.167 on the same datasets. We have achieved superior performance in solar forecasting and MuJoCo imputation tasks, significantly enhanced by our innovative $t$ power sampling method. The code is available at https://github.com/UNITES-Lab/FMTS.
摘要:時序生成已成為分析各領域中時間資料的重要工具。儘管擴散模型最近在生成高品質時序方面獲得顯著關注,但它們往往需要大量的運算,並依賴於複雜的隨機過程。為了解決這些限制,我們引入了 FM-TS,一個基於修正流匹配的時序生成框架,透過直接最佳化連續軌跡來簡化時序生成過程。此方法避免了在基於擴散的模型中通常需要的反覆抽樣或複雜雜訊排程。FM-TS 在訓練和推論方面更有效率。此外,FM-TS 具有高度適應性,支援條件式和非條件式時序生成。值得注意的是,透過我們新穎的推論設計,在非條件式設定中訓練的模型可以無縫地推廣到條件式任務,而無需重新訓練。在兩種設定中的廣泛基準測試證明,與現有方法相比,FM-TS 持續提供優異的效能,同時在訓練和推論方面更有效率。例如,在判別分數方面,FM-TS 分別在 Sines、Stocks、ETTh、MuJoCo、Energy 和 fMRI 非條件式時序資料集上達到 0.005、0.019、0.011、0.005、0.053 和 0.106,顯著優於在相同資料集上達到 0.006、0.067、0.061、0.008、0.122 和 0.167 的第二佳方法。我們在太陽能預測和 MuJoCo 插補任務中取得了優異的效能,這得益於我們創新的 $t$ 次方抽樣方法。程式碼可在 https://github.com/UNITES-Lab/FMTS 取得。
LAUREL: Learned Augmented Residual Layer
2411.07501v1 by Gaurav Menghani, Ravi Kumar, Sanjiv Kumar
One of the core pillars of efficient deep learning methods is architectural improvements such as the residual/skip connection, which has led to significantly better model convergence and quality. Since then the residual connection has become ubiquitous in not just convolutional neural networks but also transformer-based architectures, the backbone of LLMs. In this paper we introduce \emph{Learned Augmented Residual Layer} (LAuReL) -- a novel generalization of the canonical residual connection -- with the goal to be an in-situ replacement of the latter while outperforming on both model quality and footprint metrics. Our experiments show that using \laurel can help boost performance for both vision and language models. For example, on the ResNet-50, ImageNet 1K task, it achieves $60\%$ of the gains from adding an extra layer, while only adding $0.003\%$ more parameters, and matches it while adding $2.6\times$ fewer parameters.
摘要:高效深度學習方法的核心支柱之一是架構改進,例如殘差/跳躍連接,這已導致模型收斂性和品質顯著提升。從那時起,殘差連接不僅在卷積神經網路中普遍存在,也在基於轉換器的架構中普遍存在,後者是 LLM 的骨幹。 在本文中,我們介紹了「學習增強殘差層」(LAuReL) -- 標準殘差連接的新穎概括 -- 目標是在模型品質和佔用空間指標上都優於後者,同時成為後者的原位替換。我們的實驗表明,使用 \laurel 可以幫助提升視覺和語言模型的效能。例如,在 ResNet-50、ImageNet 1K 任務上,它達到了增加一層的 $60\%$ 收益,同時只增加了 $0.003\%$ 的參數,並在增加 $2.6\times$ 更少參數的情況下與其匹配。
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
2411.07494v1 by Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma
As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse.
摘要:隨著大型語言模型(LLM)變得越來越強大,確保它們不會被濫用變得至關重要。儘管研究人員專注於開發強大的防禦措施,但目前還沒有任何方法能完全抵禦攻擊。我們提出了一種替代方法:不是尋求完美的對抗性穩健性,而是開發快速的應對技術,在僅觀察到少數攻擊後就能阻止整類越獄。為了研究這種設定,我們開發了 RapidResponseBench,這是一個基準測試,用於衡量防禦措施在適應少數觀察到的範例後對各種越獄策略的穩健性。我們評估了五種快速應對方法,所有這些方法都使用越獄擴散,我們自動生成與觀察到的範例類似的其他越獄。我們最強大的方法是微調輸入分類器以阻止擴散的越獄,它將攻擊成功率降低了 240 倍以上,在分佈式越獄集合上,以及在觀察到每個越獄策略的一個範例後,在非分佈式集合上降低了 15 倍以上。此外,進一步的研究表明,擴散模型的品質和擴散範例的數量在這種防禦措施的有效性中扮演了關鍵角色。總的來說,我們的結果突顯了快速應對新型越獄以限制 LLM 濫用的潛力。
Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models
2411.07474v1 by Daria Kryvosheieva, Roger Levy
Language models (LMs) are capable of acquiring elements of human-like syntactic knowledge. Targeted syntactic evaluation tests have been employed to measure how well they form generalizations about syntactic phenomena in high-resource languages such as English. However, we still lack a thorough understanding of LMs' capacity for syntactic generalizations in low-resource languages, which are responsible for much of the diversity of syntactic patterns worldwide. In this study, we develop targeted syntactic evaluation tests for three low-resource languages (Basque, Hindi, and Swahili) and use them to evaluate five families of open-access multilingual Transformer LMs. We find that some syntactic tasks prove relatively easy for LMs while others (agreement in sentences containing indirect objects in Basque, agreement across a prepositional phrase in Swahili) are challenging. We additionally uncover issues with publicly available Transformers, including a bias toward the habitual aspect in Hindi in multilingual BERT and underperformance compared to similar-sized models in XGLM-4.5B.
摘要:語言模型 (LM) 能夠習得類似人類的語法知識元素。目標語法評量測試已被用來衡量他們在高資源語言(例如英語)中對語法現象的概括能力。然而,我們仍然缺乏對 LM 在低資源語言中進行語法概括的能力的透徹了解,而低資源語言正是造成全球語法模式多樣性的主要原因。在本研究中,我們針對三種低資源語言(巴斯克語、印地語和斯瓦希里語)開發了目標語法評量測試,並使用它們來評量五個開放式多語言 Transformer LM 家族。我們發現,某些語法任務對 LM 來說相對容易,而其他任務(包含巴斯克語間接受詞的句子中的一致性、斯瓦希里語介系詞短語中的一致性)則具有挑戰性。我們另外揭露了公開可用的 Transformer 的問題,包括多語言 BERT 中對印地語習慣體的偏誤,以及與 XGLM-4.5B 中大小相似的模型相比表現不佳。
IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark
2411.07466v1 by Kawshik Manikantan, Makarand Tapaswi, Vineet Gandhi, Shubham Toshniwal
Recent evaluations of LLMs on coreference resolution have revealed that traditional output formats and evaluation metrics do not fully capture the models' referential understanding. To address this, we introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format, commonly used for evaluating LLMs. IdentifyMe features long narratives and employs heuristics to exclude easily identifiable mentions, creating a more challenging task. The benchmark also consists of a curated mixture of different mention types and corresponding entities, allowing for a fine-grained analysis of model performance. We evaluate both closed- and open source LLMs on IdentifyMe and observe a significant performance gap (20-30%) between the state-of-the-art sub-10B open models vs. closed ones. We observe that pronominal mentions, which have limited surface information, are typically much harder for models to resolve than nominal mentions. Additionally, we find that LLMs often confuse entities when their mentions overlap in nested structures. The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs while also indicating room for further improvement.
摘要:最近對大型語言模型 (LLM) 關於共同指稱消解的評估顯示,傳統的輸出格式和評估指標並未完全掌握模型的指稱理解。為了解決這個問題,我們引入了 IdentifyMe,這是一個以多選題 (MCQ) 格式呈現的提及消解新基準,通常用於評估 LLM。IdentifyMe 採用長篇敘事,並使用啟發法排除容易識別的提及,創造更具挑戰性的任務。此基準還包含經過整理的不同提及類型和對應實體的混合,允許對模型效能進行細緻的分析。我們在 IdentifyMe 上評估閉源和開源 LLM,並觀察到最先進的低於 10B 開放模型與閉源模型之間有顯著的效能差距 (20-30%)。我們觀察到,表面資訊有限的代名詞提及通常比名詞提及更難讓模型解析。此外,我們發現當 LLM 的提及在巢狀結構中重疊時,它們經常會混淆實體。得分最高的模型 GPT-4o 達到了 81.9% 的準確度,突顯了最先進 LLM 強大的指稱能力,同時也表示仍有進步的空間。
BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks
2411.07464v1 by Shubham Gandhi, Manasi Patwardhan, Lovekesh Vig, Gautam Shroff
Large Language Models (LLMs) excel in diverse applications including generation of code snippets, but often struggle with generating code for complex Machine Learning (ML) tasks. Although existing LLM single-agent based systems give varying performance depending on the task complexity, they purely rely on larger and expensive models such as GPT-4. Our investigation reveals that no-cost and low-cost models such as Gemini-Pro, Mixtral and CodeLlama perform far worse than GPT-4 in a single-agent setting. With the motivation of developing a cost-efficient LLM based solution for solving ML tasks, we propose an LLM Multi-Agent based system which leverages combination of experts using profiling, efficient retrieval of past observations, LLM cascades, and ask-the-expert calls. Through empirical analysis on ML engineering tasks in the MLAgentBench benchmark, we demonstrate the effectiveness of our system, using no-cost models, namely Gemini as the base LLM, paired with GPT-4 in cascade and expert to serve occasional ask-the-expert calls for planning. With 94.2\% reduction in the cost (from \$0.931 per run cost averaged over all tasks for GPT-4 single agent system to \$0.054), our system is able to yield better average success rate of 32.95\% as compared to GPT-4 single-agent system yielding 22.72\% success rate averaged over all the tasks of MLAgentBench.
摘要:大型語言模型(LLM)在各種應用中表現出色,包括產生程式碼片段,但常常在產生複雜機器學習(ML)任務的程式碼時遇到困難。儘管現有的 LLM 單一代理人系統會根據任務複雜度提供不同的效能,但它們完全依賴於較大且昂貴的模型,例如 GPT-4。我們的調查顯示,在單一代理人設定中,無成本和低成本模型(例如 Gemini-Pro、Mixtral 和 CodeLlama)的效能遠低於 GPT-4。在開發一種成本效益高的基於 LLM 的解決方案以解決 ML 任務的動機下,我們提出了一個基於 LLM 多代理人的系統,該系統利用專家組合,使用剖析、有效擷取過去的觀察結果、LLM 串接,以及尋求專家建議的呼叫。透過對 MLAgentBench 基準中的 ML 工程任務進行實證分析,我們展示了我們系統的有效性,使用無成本模型,即 Gemini 作為基礎 LLM,與 GPT-4 串接,並讓專家負責偶爾尋求專家建議的呼叫以進行規劃。我們的系統成本降低了 94.2%(從 GPT-4 單一代理人系統所有任務的平均執行成本 0.931 美元降低到 0.054 美元),能夠產生更好的平均成功率 32.95%,而 GPT-4 單一代理人系統在 MLAgentBench 的所有任務中平均產生 22.72% 的成功率。
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
2411.07461v1 by Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, Ran Xu
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. Our two-stage approach leverages large vision-language models and language models to create knowledge-augmented captions, which are then used to train a specialized VLM for scaling up the dataset. We train vision-language models on KALE and demonstrate improvements on vision-language tasks. Our experiments show the utility of KALE for training more capable and knowledgeable multimodal models. We release the KALE dataset at https://huggingface.co/datasets/Salesforce/blip3-kale
摘要:我們介紹 BLIP3-KALE,一個包含 2.18 億張圖片文字對應的資料集,它縮小了描述性合成標題和事實性的網路規模 alt 文字之間的差距。KALE 使用網路規模的 alt 文字來擴充合成密集圖像標題,以產生有事實根據的圖像標題。我們的兩階段方法利用大型視覺語言模型和語言模型來建立知識擴充標題,然後用於訓練專門的 VLM 以擴充資料集。我們在 KALE 上訓練視覺語言模型,並展示在視覺語言任務上的改進。我們的實驗顯示了 KALE 在訓練更強大且知識豐富的多模態模型方面的實用性。我們在 https://huggingface.co/datasets/Salesforce/blip3-kale 上發布 KALE 資料集
DecoPrompt : Decoding Prompts Reduces Hallucinations when Large Language Models Meet False Premises
2411.07457v1 by Nan Xu, Xuezhe Ma
While large language models (LLMs) have demonstrated increasing power, they have also called upon studies on their hallucinated outputs that deviate from factually correct statements. In this paper, we focus on one important scenario of false premises, where LLMs are distracted by misaligned claims although the model possesses the required factual knowledge to answer original questions accurately. Inspired by the observation that entropy of the false-premise prompt is closely related to its likelihood to elicit hallucination generation, we propose a new prompting algorithm, named DecoPrompt, to mitigate hallucination. DecoPrompt leverages LLMs to "decode" the false-premise prompts without really eliciting hallucination output from LLMs. We perform experiments on two datasets, demonstrating that DecoPrompt can reduce hallucinations effectively on outputs from different LLMs. Moreover, DecoPrompt exhibits cross-model transferability, which facilitates its applications to scenarios such as LLMs of large sizes or unavailable model logits.
摘要:儘管大型語言模型(LLM)已展現出越來越強大的能力,但它們也需要針對其虛構輸出進行研究,這些輸出偏離了事實正確的陳述。在本文中,我們專注於一個錯誤前提的重要場景,在該場景中,LLM 會被錯誤的說法分散注意力,儘管該模型具備準確回答原始問題所需的實際知識。受虛假前提提示的熵與其引發幻覺產生的可能性密切相關的觀察結果啟發,我們提出了一種名為 DecoPrompt 的新提示演算法,以減輕幻覺。DecoPrompt 利用 LLM 來「解碼」錯誤前提提示,而不會真正引發 LLM 的幻覺輸出。我們在兩個資料集上執行實驗,證明 DecoPrompt 可以有效減少不同 LLM 輸出中的幻覺。此外,DecoPrompt 展現出跨模型的可轉移性,這有助於其應用於大型 LLM 或不可用模型邏輯值等場景。
Research on fault diagnosis of nuclear power first-second circuit based on hierarchical multi-granularity classification network
2411.07453v1 by Jiangwen Chen, Siwei Li, Guo Jiang, Cheng Dongzhen, Lin Hua, Wang Wei
The safe and reliable operation of complex electromechanical systems in nuclear power plants is crucial for the safe production of nuclear power plants and their nuclear power unit. Therefore, accurate and timely fault diagnosis of nuclear power systems is of great significance for ensuring the safe and reliable operation of nuclear power plants. The existing fault diagnosis methods mainly target a single device or subsystem, making it difficult to analyze the inherent connections and mutual effects between different types of faults at the entire unit level. This article uses the AP1000 full-scale simulator to simulate the important mechanical component failures of some key systems in the primary and secondary circuits of nuclear power units, and constructs a fault dataset. Meanwhile, a hierarchical multi granularity classification fault diagnosis model based on the EfficientNet large model is proposed, aiming to achieve hierarchical classification of nuclear power faults. The results indicate that the proposed fault diagnosis model can effectively classify faults in different circuits and system components of nuclear power units into hierarchical categories. However, the fault dataset in this study was obtained from a simulator, which may introduce additional information due to parameter redundancy, thereby affecting the diagnostic performance of the model.
摘要:複雜機電系統在核能電廠的安全可靠運行,對於核能電廠及其核能機組的安全發電至關重要。因此,核能系統準確及時的故障診斷對於保障核能電廠的安全可靠運行具有重大意義。現有的故障診斷方法主要針對單一設備或子系統,難以分析全機組層面不同類型故障間的內在聯繫和相互影響。本文利用AP1000滿功率模擬器模擬核能機組一、二次迴路部分關鍵系統的重要機械組件故障,構建故障數據集。同時,提出基於EfficientNet大模型的分層多粒度分類故障診斷模型,旨在實現核能故障的分層分類。結果表明,所提故障診斷模型能夠有效地將核能機組不同迴路和系統組件的故障分層分類。但本研究中的故障數據集來源於模擬器,由於參數冗餘可能會引入額外的信息,從而影響模型的診斷性能。
Optimizing Data Delivery: Insights from User Preferences on Visuals, Tables, and Text
2411.07451v1 by Reuben Luera, Ryan Rossi, Franck Dernoncourt, Alexa Siu, Sungchul Kim, Tong Yu, Ruiyi Zhang, Xiang Chen, Nedim Lipka, Zhehao Zhang, Seon Gyeom Kim, Tak Yeon Lee
In this work, we research user preferences to see a chart, table, or text given a question asked by the user. This enables us to understand when it is best to show a chart, table, or text to the user for the specific question. For this, we conduct a user study where users are shown a question and asked what they would prefer to see and used the data to establish that a user's personal traits does influence the data outputs that they prefer. Understanding how user characteristics impact a user's preferences is critical to creating data tools with a better user experience. Additionally, we investigate to what degree an LLM can be used to replicate a user's preference with and without user preference data. Overall, these findings have significant implications pertaining to the development of data tools and the replication of human preferences using LLMs. Furthermore, this work demonstrates the potential use of LLMs to replicate user preference data which has major implications for future user modeling and personalization research.
摘要:在這項工作中,我們研究使用者偏好,以便在使用者提出問題時,可以看到圖表、表格或文字。這讓我們得以了解在特定問題中,什麼時候向使用者顯示圖表、表格或文字是最好的。為此,我們進行了一項使用者研究,在研究中向使用者顯示一個問題,並詢問他們希望看到什麼,並使用資料來建立使用者的個人特質確實會影響他們偏好的資料輸出。了解使用者的特質如何影響使用者的偏好,對於建立具有更好使用者體驗的資料工具至關重要。此外,我們調查了 LLM 在有和沒有使用者偏好資料的情況下,可用於複製使用者偏好的程度。整體來說,這些發現對於資料工具的開發和使用 LLM 複製人類偏好具有重要的意義。此外,這項工作展示了 LLM 複製使用者偏好資料的潛在用途,這對未來的使用者建模和個人化研究具有重大意義。
The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving
2411.07447v1 by Kyoungmin Kim, Kijae Hong, Caglar Gulcehre, Anastasia Ailamaki
The growing usage of Large Language Models (LLMs) highlights the demands and challenges in scalable LLM inference systems, affecting deployment and development processes. On the deployment side, there is a lack of comprehensive analysis on the conditions under which a particular scheduler performs better or worse, with performance varying substantially across different schedulers, hardware, models, and workloads. Manually testing each configuration on GPUs can be prohibitively expensive. On the development side, unpredictable performance and unknown upper limits can lead to inconclusive trial-and-error processes, consuming resources on ideas that end up ineffective. To address these challenges, we introduce INFERMAX, an analytical framework that uses inference cost models to compare various schedulers, including an optimal scheduler formulated as a constraint satisfaction problem (CSP) to establish an upper bound on performance. Our framework offers in-depth analysis and raises essential questions, challenging assumptions and exploring opportunities for more efficient scheduling. Notably, our findings indicate that preempting requests can reduce GPU costs by 30% compared to avoiding preemptions at all. We believe our methods and insights will facilitate the cost-effective deployment and development of scalable, efficient inference systems and pave the way for cost-based scheduling.
摘要:大型語言模型 (LLM) 使用量不斷增加,突顯了可擴充 LLM 推論系統的需求和挑戰,影響部署和開發流程。在部署方面,對於特定排程器在何種條件下執行得更好或更差,缺乏全面的分析,效能因不同的排程器、硬體、模型和工作負載而有顯著差異。手動在 GPU 上測試每個組態可能會非常昂貴。在開發方面,不可預測的效能和未知的上限可能會導致無法得出結論的試錯流程,消耗無效想法的資源。為了應對這些挑戰,我們引入了 INFERMAX,一個分析架構,使用推論成本模型來比較各種排程器,包括一個最佳排程器,該排程器制定為約束滿足問題 (CSP),以建立效能的上限。我們的架構提供了深入的分析,並提出了重要的問題,挑戰假設並探索更有效排程的機會。值得注意的是,我們的研究結果表明,與完全避免搶先相比,搶先請求可以將 GPU 成本降低 30%。我們相信我們的技術和見解將促進可擴充、有效推論系統的經濟有效部署和開發,並為基於成本的排程鋪路。
Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection
2411.07446v1 by Cilin Yan, Jingyun Wang, Lin Zhang, Ruihui Zhao, Xiaopu Wu, Kai Xiong, Qingsong Liu, Guoliang Kang, Yangyang Kang
Automatic prompt engineering aims to enhance the generation quality of large language models (LLMs). Recent works utilize feedbacks generated from erroneous cases to guide the prompt optimization. During inference, they may further retrieve several semantically-related exemplars and concatenate them to the optimized prompts to improve the performance. However, those works only utilize the feedback at the current step, ignoring historical and unseleccted feedbacks which are potentially beneficial. Moreover, the selection of exemplars only considers the general semantic relationship and may not be optimal in terms of task performance and matching with the optimized prompt. In this work, we propose an Exemplar-Guided Reflection with Memory mechanism (ERM) to realize more efficient and accurate prompt optimization. Specifically, we design an exemplar-guided reflection mechanism where the feedback generation is additionally guided by the generated exemplars. We further build two kinds of memory to fully utilize the historical feedback information and support more effective exemplar retrieval. Empirical evaluations show our method surpasses previous state-of-the-arts with less optimization steps, i.e., improving F1 score by 10.1 on LIAR dataset, and reducing half of the optimization steps on ProTeGi.
摘要:自動提示工程旨在提升大型語言模型 (LLM) 的生成品質。近期研究利用錯誤案例產生的回饋來引導提示最佳化。在推論過程中,它們可能會進一步擷取幾個語義相關的範例,並將它們串接至最佳化的提示以提升效能。然而,這些研究僅利用當前步驟的回饋,忽略了潛在有益的歷史回饋和未選擇的回饋。此外,範例的選擇僅考慮一般語義關係,就任務效能和與最佳化提示的匹配而言可能不是最佳的。在這項研究中,我們提出一個具有記憶機制的範例引導反思 (ERM),以實現更有效率且準確的提示最佳化。具體來說,我們設計一個範例引導反思機制,其中回饋產生進一步由產生的範例引導。我們進一步建構兩種記憶體,以充分利用歷史回饋資訊,並支援更有效的範例擷取。經驗評估顯示,我們的方法以更少的最佳化步驟超越了先前的技術水準,亦即在 LIAR 資料集上將 F1 分數提升了 10.1,並在 ProTeGi 上減少了一半的最佳化步驟。
Input-Based Ensemble-Learning Method for Dynamic Memory Configuration of Serverless Computing Functions
2411.07444v1 by Siddharth Agarwal, Maria A. Rodriguez, Rajkumar Buyya
In today's Function-as-a-Service offerings, a programmer is usually responsible for configuring function memory for its successful execution, which allocates proportional function resources such as CPU and network. However, right-sizing the function memory force developers to speculate performance and make ad-hoc configuration decisions. Recent research has highlighted that a function's input characteristics, such as input size, type and number of inputs, significantly impact its resource demand, run-time performance and costs with fluctuating workloads. This correlation further makes memory configuration a non-trivial task. On that account, an input-aware function memory allocator not only improves developer productivity by completely hiding resource-related decisions but also drives an opportunity to reduce resource wastage and offer a finer-grained cost-optimised pricing scheme. Therefore, we present MemFigLess, a serverless solution that estimates the memory requirement of a serverless function with input-awareness. The framework executes function profiling in an offline stage and trains a multi-output Random Forest Regression model on the collected metrics to invoke input-aware optimal configurations. We evaluate our work with the state-of-the-art approaches on AWS Lambda service to find that MemFigLess is able to capture the input-aware resource relationships and allocate upto 82% less resources and save up to 87% run-time costs.
摘要:在當今的函式即服務產品中,程式設計人員通常負責設定函式記憶體以利成功執行,這會配置成比例的函式資源,例如 CPU 和網路。不過,正確調整函式記憶體會強迫開發人員推測效能並做出臨時設定決策。最近的研究強調函式的輸入特徵(例如輸入大小、類型和輸入數量)會顯著影響其資源需求、執行時間效能和工作負載波動的成本。這種關聯性進一步使記憶體設定成為一項非平凡的任務。有鑑於此,一個具輸入感知能力的函式記憶體配置器不僅能透過完全隱藏與資源相關的決策來提升開發人員生產力,還能驅動一個機會來減少資源浪費並提供更細緻的成本最佳化定價方案。因此,我們提出 MemFigLess,這是一個無伺服器解決方案,可估計具輸入感知能力的無伺服器函式的記憶體需求。此架構在離線階段執行函式剖析,並針對收集的指標訓練一個多輸出隨機森林回歸模型,以呼叫具輸入感知能力的最佳設定。我們使用最先進的方法在 AWS Lambda 服務上評估我們的成果,發現 MemFigLess 能夠擷取具輸入感知能力的資源關係,並配置少達 82% 的資源,並節省高達 87% 的執行時間成本。
Automatically Detecting Online Deceptive Patterns in Real-time
2411.07441v1 by Asmit Nayak, Shirley Zhang, Yash Wani, Rishabh Khandelwal, Kassem Fawaz
Deceptive patterns (DPs) in digital interfaces manipulate users into making unintended decisions, exploiting cognitive biases and psychological vulnerabilities. These patterns have become ubiquitous across various digital platforms. While efforts to mitigate DPs have emerged from legal and technical perspectives, a significant gap in usable solutions that empower users to identify and make informed decisions about DPs in real-time remains. In this work, we introduce AutoBot, an automated, deceptive pattern detector that analyzes websites' visual appearances using machine learning techniques to identify and notify users of DPs in real-time. AutoBot employs a two-staged pipeline that processes website screenshots, identifying interactable elements and extracting textual features without relying on HTML structure. By leveraging a custom language model, AutoBot understands the context surrounding these elements to determine the presence of deceptive patterns. We implement AutoBot as a lightweight Chrome browser extension that performs all analyses locally, minimizing latency and preserving user privacy. Through extensive evaluation, we demonstrate AutoBot's effectiveness in enhancing users' ability to navigate digital environments safely while providing a valuable tool for regulators to assess and enforce compliance with DP regulations.
摘要:數位介面中的欺騙性模式 (DP) 會操縱使用者做出非預期的決定,利用認知偏誤和心理漏洞。這些模式已在各種數位平台上變得無所不在。雖然從法律和技術角度來看,已經出現減輕 DP 的努力,但仍然缺乏重要的可用解決方案,讓使用者能夠識別和做出有關 DP 的明智決定。在這項工作中,我們介紹了 AutoBot,這是一個自動化的欺騙性模式偵測器,它使用機器學習技術分析網站的視覺外觀,以識別和即時通知使用者 DP。AutoBot 採用一個兩階段管道,處理網站螢幕截圖、識別可互動元素,並在不依賴 HTML 結構的情況下擷取文字特徵。透過利用自訂語言模型,AutoBot 了解這些元素周圍的內容,以確定是否存在欺騙性模式。我們將 AutoBot 實作為一個輕量級的 Chrome 瀏覽器擴充功能,它在本地執行所有分析,將延遲降至最低並保護使用者隱私。透過廣泛的評估,我們展示了 AutoBot 在提升使用者安全瀏覽數位環境的能力方面的有效性,同時也為監管機構提供了一個有價值的工具,用於評估和強制遵守 DP 法規。
Predicting BWR Criticality with Data-Driven Machine Learning Model
2411.07425v1 by Muhammad Rizki Oktavian, Anirudh Tunga, Jonathan Nistor, James Tusar, J. Thomas Gruenwald, Yunlin Xu
One of the challenges in operating nuclear power plants is to decide the amount of fuel needed in a cycle. Large-scale nuclear power plants are designed to operate at base load, meaning that they are expected to always operate at full power. Economically, a nuclear power plant should burn enough fuel to maintain criticality until the end of a cycle (EOC). If the reactor goes subcritical before the end of a cycle, it may result in early coastdown as the fuel in the core is already depleted. On contrary, if the reactor still has significant excess reactivity by the end of a cycle, the remaining fuels will remain unused. In both cases, the plant may lose a significant amount of money. This work proposes an innovative method based on a data-driven deep learning model to estimate the excess criticality of a boiling water reactor.
摘要:核能電廠運作中的一項挑戰,是要決定一個週期內所需的燃料量。大型核能電廠的設計是基於基本負載運作,表示預期它們總是會以全功率運作。在經濟層面上,核能電廠應該燃燒足夠的燃料,以維持臨界狀態直到週期結束 (EOC)。如果反應器在週期結束前進入次臨界狀態,由於爐心內的燃料已經耗盡,可能會導致提早停機。相反地,如果反應器在週期結束時仍有顯著的過剩反應性,剩餘的燃料將會保持未用。這兩種情況都會讓電廠損失大量金錢。本研究提出一個創新的方法,基於資料驅動的深度學習模型,來估計沸水反應器的過剩臨界性。
Untangling Hate Speech Definitions: A Semantic Componential Analysis Across Cultures and Domains
2411.07417v1 by Katerina Korre, Arianna Muti, Federico Ruggeri, Alberto Barrón-Cedeño
Hate speech relies heavily on cultural influences, leading to varying individual interpretations. For that reason, we propose a Semantic Componential Analysis (SCA) framework for a cross-cultural and cross-domain analysis of hate speech definitions. We create the first dataset of definitions derived from five domains: online dictionaries, research papers, Wikipedia articles, legislation, and online platforms, which are later analyzed into semantic components. Our analysis reveals that the components differ from definition to definition, yet many domains borrow definitions from one another without taking into account the target culture. We conduct zero-shot model experiments using our proposed dataset, employing three popular open-sourced LLMs to understand the impact of different definitions on hate speech detection. Our findings indicate that LLMs are sensitive to definitions: responses for hate speech detection change according to the complexity of definitions used in the prompt.
摘要:仇恨言論嚴重依賴文化影響,導致不同的個人詮釋。因此,我們提出語義成分分析 (SCA) 架構,用於跨文化和跨領域分析仇恨言論定義。我們建立第一個定義資料集,其來自五個領域:線上字典、研究論文、維基百科條目、立法和線上平台,隨後分析為語義成分。我們的分析顯示,各成分在不同定義中有所不同,但許多領域會從彼此借用定義,而不考慮目標文化。我們使用建議的資料集執行零次學習模型實驗,採用三個流行的開源 LLM,以了解不同定義對仇恨言論偵測的影響。我們的研究結果指出,LLM 對定義很敏感:仇恨言論偵測的回應會根據提示中使用的定義複雜度而改變。
Using Generative AI and Multi-Agents to Provide Automatic Feedback
2411.07407v1 by Shuchen Guo, Ehsan Latif, Yifan Zhou, Xuan Huang, Xiaoming Zhai
This study investigates the use of generative AI and multi-agent systems to provide automatic feedback in educational contexts, particularly for student constructed responses in science assessments. The research addresses a key gap in the field by exploring how multi-agent systems, called AutoFeedback, can improve the quality of GenAI-generated feedback, overcoming known issues such as over-praise and over-inference that are common in single-agent large language models (LLMs). The study developed a multi-agent system consisting of two AI agents: one for generating feedback and another for validating and refining it. The system was tested on a dataset of 240 student responses, and its performance was compared to that of a single-agent LLM. Results showed that AutoFeedback significantly reduced the occurrence of over-praise and over-inference errors, providing more accurate and pedagogically sound feedback. The findings suggest that multi-agent systems can offer a more reliable solution for generating automated feedback in educational settings, highlighting their potential for scalable and personalized learning support. These results have important implications for educators and researchers seeking to leverage AI in formative assessments, offering a pathway to more effective feedback mechanisms that enhance student learning outcomes.
摘要:本研究探討生成式 AI 與多重代理系統在教育情境中提供自動回饋的用途,特別是針對學生在科學評量中建構的回應。此研究透過探討稱為 AutoFeedback 的多重代理系統如何改善 GenAI 生成的回饋品質,來解決該領域的一個關鍵落差,克服常見於單一代理大型語言模型 (LLM) 中的過度讚美和過度推論等已知問題。本研究開發了一個由兩個 AI 代理組成的多重代理系統:一個用於產生回饋,另一個用於驗證和改善回饋。此系統在 240 個學生回應的資料集上進行測試,並將其效能與單一代理 LLM 進行比較。結果顯示,AutoFeedback 大幅減少過度讚美和過度推論錯誤的發生,提供更準確且在教學法上更完善的回饋。研究結果顯示,多重代理系統可以為在教育環境中產生自動回饋提供更可靠的解決方案,突顯其在可擴充且個人化的學習支援方面的潛力。這些結果對尋求在形成性評量中利用 AI 的教育工作者和研究人員具有重要意義,提供一條通往更有效的回饋機制的途徑,以提升學生的學習成果。
Controllable Context Sensitivity and the Knob Behind It
2411.07404v1 by Julian Minder, Kevin Du, Niklas Stoehr, Giovanni Monea, Chris Wendler, Robert West, Ryan Cotterell
When making predictions, a language model must trade off how much it relies on its context vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental functionality, as it enables the model to excel at tasks like retrieval-augmented generation and question-answering. In this paper, we search for a knob which controls this sensitivity, determining whether language models answer from the context or their prior knowledge. To guide this search, we design a task for controllable context sensitivity. In this task, we first feed the model a context (Paris is in England) and a question (Where is Paris?); we then instruct the model to either use its prior or contextual knowledge and evaluate whether it generates the correct answer for both intents (either France or England). When fine-tuned on this task, instruction-tuned versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85-95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm. Then, in each model, we identify a 1-D subspace in a single layer that encodes whether the model follows context or prior knowledge. Interestingly, while we identify this subspace in a fine-tuned model, we find that the exact same subspace serves as an effective knob in not only that model but also non-fine-tuned instruct and base models of that model family. Finally, we show a strong correlation between a model's performance and how distinctly it separates context-agreeing from context-ignoring answers in this subspace. These results suggest a single subspace facilitates how the model chooses between context and prior knowledge, hinting at a simple fundamental mechanism that controls this behavior.
摘要:
Beyond Keywords: A Context-based Hybrid Approach to Mining Ethical Concern-related App Reviews
2411.07398v1 by Aakash Sorathiya, Gouri Ginde
With the increasing proliferation of mobile applications in our everyday experiences, the concerns surrounding ethics have surged significantly. Users generally communicate their feedback, report issues, and suggest new functionalities in application (app) reviews, frequently emphasizing safety, privacy, and accountability concerns. Incorporating these reviews is essential to developing successful products. However, app reviews related to ethical concerns generally use domain-specific language and are expressed using a more varied vocabulary. Thus making automated ethical concern-related app review extraction a challenging and time-consuming effort. This study proposes a novel Natural Language Processing (NLP) based approach that combines Natural Language Inference (NLI), which provides a deep comprehension of language nuances, and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. Utilizing 43,647 app reviews from the mental health domain, the proposed methodology 1) Evaluates four NLI models to extract potential privacy reviews and compares the results of domain-specific privacy hypotheses with generic privacy hypotheses; 2) Evaluates four LLMs for classifying app reviews to privacy concerns; and 3) Uses the best NLI and LLM models further to extract new privacy reviews from the dataset. Results show that the DeBERTa-v3-base-mnli-fever-anli NLI model with domain-specific hypotheses yields the best performance, and Llama3.1-8B-Instruct LLM performs best in the classification of app reviews. Then, using NLI+LLM, an additional 1,008 new privacy-related reviews were extracted that were not identified through the keyword-based approach in previous research, thus demonstrating the effectiveness of the proposed approach.
摘要:
Toward Optimal Search and Retrieval for RAG
2411.07396v1 by Alexandria Leto, Cecilia Aguerrebere, Ishwar Bhati, Ted Willke, Mariano Tepper, Vy Ai Vo
Retrieval-augmented generation (RAG) is a promising method for addressing some of the memory-related challenges associated with Large Language Models (LLMs). Two separate systems form the RAG pipeline, the retriever and the reader, and the impact of each on downstream task performance is not well-understood. Here, we work towards the goal of understanding how retrievers can be optimized for RAG pipelines for common tasks such as Question Answering (QA). We conduct experiments focused on the relationship between retrieval and RAG performance on QA and attributed QA and unveil a number of insights useful to practitioners developing high-performance RAG pipelines. For example, lowering search accuracy has minor implications for RAG performance while potentially increasing retrieval speed and memory efficiency.
摘要:檢索增強生成 (RAG) 是一種有望解決大型語言模型 (LLM) 相關記憶挑戰的方法。RAG 管線由檢索器和讀取器兩個獨立系統組成,而每個系統對下游任務效能的影響並未獲得透徹理解。在此,我們努力了解檢索器如何針對常見任務(例如問題解答 (QA))最佳化 RAG 管線。我們針對檢索與 RAG 在 QA 和歸因 QA 上的關係進行實驗,並揭示許多對開發高性能 RAG 管線的從業人員有用的見解。例如,降低搜尋準確度對 RAG 效能影響不大,但可能會提高檢索速度和記憶體效率。
Data-Centric Learning Framework for Real-Time Detection of Aiming Beam in Fluorescence Lifetime Imaging Guided Surgery
2411.07395v1 by Mohamed Abul Hassan, Pu Sun, Xiangnan Zhou, Lisanne Kraft, Kelsey T Hadfield, Katjana Ehrlich, Jinyi Qi, Andrew Birkeland, Laura Marcu
This study introduces a novel data-centric approach to improve real-time surgical guidance using fiber-based fluorescence lifetime imaging (FLIm). A key aspect of the methodology is the accurate detection of the aiming beam, which is essential for localizing points used to map FLIm measurements onto the tissue region within the surgical field. The primary challenge arises from the complex and variable conditions encountered in the surgical environment, particularly in Transoral Robotic Surgery (TORS). Uneven illumination in the surgical field can cause reflections, reduce contrast, and results in inconsistent color representation, further complicating aiming beam detection. To overcome these challenges, an instance segmentation model was developed using a data-centric training strategy that improves accuracy by minimizing label noise and enhancing detection robustness. The model was evaluated on a dataset comprising 40 in vivo surgical videos, demonstrating a median detection rate of 85%. This performance was maintained when the model was integrated in a clinical system, achieving a similar detection rate of 85% during TORS procedures conducted in patients. The system's computational efficiency, measured at approximately 24 frames per second (FPS), was sufficient for real-time surgical guidance. This study enhances the reliability of FLIm-based aiming beam detection in complex surgical environments, advancing the feasibility of real-time, image-guided interventions for improved surgical precision
摘要:本研究提出了一種新穎的以數據為中心的策略,以使用基於光纖的螢光生命期成像 (FLIm) 來改善實時手術導引。此方法的一個關鍵面向是準確偵測瞄準光束,這對於定位用於將 FLIm 測量結果對應到手術視野內組織區域的點至關重要。主要的挑戰來自於手術環境中遇到的複雜且變化的條件,特別是在經口機器人手術 (TORS) 中。手術視野中的照明不均會導致反射、降低對比度,並造成不一致的顏色呈現,進一步使瞄準光束偵測複雜化。為了克服這些挑戰,開發了一個實例分割模型,使用以數據為中心的訓練策略,透過最小化標籤雜訊和增強偵測穩健性來提高準確度。此模型在包含 40 個體內手術影片的資料集上進行評估,顯示出 85% 的中位數偵測率。當此模型整合到臨床系統中時,此效能得以維持,在患者進行 TORS 手術期間達成相似的 85% 偵測率。此系統的運算效率,測量結果約為每秒 24 幀 (FPS),足以進行實時手術導引。本研究增強了 FLIm 為基礎的瞄準光束偵測在複雜手術環境中的可靠性,提升了實時、影像導引介入的可行性,以改善手術精準度
Feature-Space Semantic Invariance: Enhanced OOD Detection for Open-Set Domain Generalization
2411.07392v1 by Haoliang Wang, Chen Zhao, Feng Chen
Open-set domain generalization addresses a real-world challenge: training a model to generalize across unseen domains (domain generalization) while also detecting samples from unknown classes not encountered during training (open-set recognition). However, most existing approaches tackle these issues separately, limiting their practical applicability. To overcome this limitation, we propose a unified framework for open-set domain generalization by introducing Feature-space Semantic Invariance (FSI). FSI maintains semantic consistency across different domains within the feature space, enabling more accurate detection of OOD instances in unseen domains. Additionally, we adopt a generative model to produce synthetic data with novel domain styles or class labels, enhancing model robustness. Initial experiments show that our method improves AUROC by 9.1% to 18.9% on ColoredMNIST, while also significantly increasing in-distribution classification accuracy.
摘要:開放集域泛化解決了一個真實世界的挑戰:訓練一個模型在未見過的域中進行泛化(域泛化),同時也偵測訓練過程中未遇到的未知類別的樣本(開放集識別)。然而,大多數現有方法分別處理這些問題,限制了它們的實際適用性。為了克服這個限制,我們提出了一個開放集域泛化的統一架構,引入了特徵空間語義不變性(FSI)。FSI 在特徵空間中維護不同域之間的語義一致性,從而能夠更準確地偵測未見域中的 OOD 實例。此外,我們採用一個生成模型來產生具有新穎域樣式或類別標籤的合成資料,增強模型的穩健性。初步實驗表明,我們的模型在 ColoredMNIST 上將 AUROC 提高了 9.1% 至 18.9%,同時也顯著提高了分布內分類準確度。
Federated Learning Client Pruning for Noisy Labels
2411.07391v1 by Mahdi Morafah, Hojin Chang, Chen Chen, Bill Lin
Federated Learning (FL) enables collaborative model training across decentralized edge devices while preserving data privacy. However, existing FL methods often assume clean annotated datasets, impractical for resource-constrained edge devices. In reality, noisy labels are prevalent, posing significant challenges to FL performance. Prior approaches attempt label correction and robust training techniques but exhibit limited efficacy, particularly under high noise levels. This paper introduces ClipFL (Federated Learning Client Pruning), a novel framework addressing noisy labels from a fresh perspective. ClipFL identifies and excludes noisy clients based on their performance on a clean validation dataset, tracked using a Noise Candidacy Score (NCS). The framework comprises three phases: pre-client pruning to identify potential noisy clients and calculate their NCS, client pruning to exclude a percentage of clients with the highest NCS, and post-client pruning for fine-tuning the global model with standard FL on clean clients. Empirical evaluation demonstrates ClipFL's efficacy across diverse datasets and noise levels, achieving accurate noisy client identification, superior performance, faster convergence, and reduced communication costs compared to state-of-the-art FL methods. Our code is available at https://github.com/MMorafah/ClipFL.
摘要:聯邦學習 (FL) 能在分散式邊緣裝置上進行協作模型訓練,同時保留資料隱私。然而,現有的 FL 方法通常假設標記資料集是乾淨的,這對於資源受限的邊緣裝置來說是不切實際的。實際上,雜訊標籤很普遍,對 FL 效能構成重大挑戰。先前的做法嘗試標籤校正和穩健訓練技術,但在高雜訊水準下表現出的效能有限。本文介紹 ClipFL(聯邦學習用戶端剪枝),這是一個從新觀點解決雜訊標籤的新穎架構。ClipFL 根據雜訊候選分數 (NCS) 追蹤乾淨驗證資料集上的效能,識別和排除雜訊用戶端。該架構包含三個階段:用戶端前剪枝,用於識別潛在的雜訊用戶端並計算其 NCS;用戶端剪枝,用於排除具有最高 NCS 的用戶端百分比;用戶端後剪枝,用於在乾淨用戶端上使用標準 FL 微調全域模型。實證評估顯示 ClipFL 在不同的資料集和雜訊水準中都表現出效能,與現有的 FL 方法相比,能準確識別雜訊用戶端,具有優異的效能、更快的收斂速度和更低的通訊成本。我們的程式碼可在 https://github.com/MMorafah/ClipFL 取得。
Firing Rate Models as Associative Memory: Excitatory-Inhibitory Balance for Robust Retrieval
2411.07388v1 by Simone Betteti, Giacomo Baggio, Francesco Bullo, Sandro Zampieri
Firing rate models are dynamical systems widely used in applied and theoretical neuroscience to describe local cortical dynamics in neuronal populations. By providing a macroscopic perspective of neuronal activity, these models are essential for investigating oscillatory phenomena, chaotic behavior, and associative memory processes. Despite their widespread use, the application of firing rate models to associative memory networks has received limited mathematical exploration, and most existing studies are focused on specific models. Conversely, well-established associative memory designs, such as Hopfield networks, lack key biologically-relevant features intrinsic to firing rate models, including positivity and interpretable synaptic matrices that reflect excitatory and inhibitory interactions. To address this gap, we propose a general framework that ensures the emergence of re-scaled memory patterns as stable equilibria in the firing rate dynamics. Furthermore, we analyze the conditions under which the memories are locally and globally asymptotically stable, providing insights into constructing biologically-plausible and robust systems for associative memory retrieval.
摘要:發射率模型是動態系統,廣泛用於應用和理論神經科學,用於描述神經元族群中的局部皮質動態。這些模型透過提供神經元活動的巨觀觀點,對於探討振盪現象、混沌行為和聯想記憶過程至關重要。儘管廣泛使用,但將發射率模型應用於聯想記憶網路的研究在數學上仍有限,而且大多數現有研究都集中在特定模型上。相反地,像 Hopfield 網路等完善的聯想記憶設計,缺乏發射率模型中固有的關鍵生物相關特徵,包括正值和反映激發性和抑制性交互作用的可解釋突觸矩陣。為了解決這個差距,我們提出一個通用框架,以確保重新縮放的記憶模式在發射率動態中作為穩定平衡出現。此外,我們分析記憶在局部和全局漸近穩定的條件,提供見解以建構生物學上可信且強健的聯想記憶檢索系統。
Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages
2411.07387v1 by Midia Yousefi, Yao Qian, Junkun Chen, Gang Wang, Yanqing Liu, Dongmei Wang, Xiaofei Wang, Jian Xue
End-to-end speech translation (ST), which translates source language speech directly into target language text, has garnered significant attention in recent years. Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio, including both speech and pause segments. Previous methods often controlled the number of words or characters generated by the Machine Translation model to approximate the source sentence's length without considering the isochrony of pauses and speech segments, as duration can vary between languages. To address this, we present improvements to the duration alignment component of our sequence-to-sequence ST model. Our method controls translation length by predicting the duration of speech and pauses in conjunction with the translation process. This is achieved by providing timing information to the decoder, ensuring it tracks the remaining duration for speech and pauses while generating the translation. The evaluation on the Zh-En test set of CoVoST 2, demonstrates that the proposed Isochrony-Controlled ST achieves 0.92 speech overlap and 8.9 BLEU, which has only a 1.4 BLEU drop compared to the ST baseline.
摘要:端對端語音翻譯 (ST) 可直接將原始語言語音翻譯成目標語言文字,近年來備受關注。許多 ST 應用程式需要嚴格的長度控制,以確保翻譯時間與原始音訊長度相符,包括語音和暫停片段。先前的做法通常控制機器翻譯模型產生的字數或字元數,以估計原始句子的長度,而不考慮暫停和語音片段的等時性,因為不同語言的持續時間可能有所不同。為了解決這個問題,我們對序列對序列 ST 模型的持續時間比對元件進行了改進。我們的做法透過預測語音和暫停的持續時間,並與翻譯過程結合,來控制翻譯長度。這是透過提供計時資訊給解碼器來達成,確保它在產生翻譯時追蹤語音和暫停的剩餘時間。在 CoVoST 2 的中英測試集中進行評估,顯示所提出的等時性控制 ST 可達到 0.92 的語音重疊和 8.9 的 BLEU,與 ST 基準相比,BLEU 分數僅下降 1.4。
BeeManc at the PLABA Track of TAC-2024: RoBERTa for task 1 and LLaMA3.1 and GPT-4o for task 2
2411.07381v1 by Zhidong Ling, Zihao Li, Pablo Romeo, Lifeng Han, Goran Nenadic
This report is the system description of the BeeManc team for shared task Plain Language Adaptation of Biomedical Abstracts (PLABA) 2024. This report contains two sections corresponding to the two sub-tasks in PLABA 2024. In task one, we applied fine-tuned ReBERTa-Base models to identify and classify the difficult terms, jargon and acronyms in the biomedical abstracts and reported the F1 score. Due to time constraints, we didn't finish the replacement task. In task two, we leveraged Llamma3.1-70B-Instruct and GPT-4o with the one-shot prompts to complete the abstract adaptation and reported the scores in BLEU, SARI, BERTScore, LENS, and SALSA. From the official Evaluation from PLABA-2024 on Task 1A and 1B, our \textbf{much smaller fine-tuned RoBERTa-Base} model ranked 3rd and 2nd respectively on the two sub-task, and the \textbf{1st on averaged F1 scores across the two tasks} from 9 evaluated systems. Our share our fine-tuned models and related resources at \url{https://github.com/HECTA-UoM/PLABA2024}
摘要:這份報告是 BeeManc 團隊針對 2024 年生物醫學摘要的通用語言適應 (PLABA) 共享任務所做的系統描述。這份報告包含兩部分,分別對應於 PLABA 2024 的兩個子任務。在任務一中,我們應用微調後的 ReBERTa-Base 模型來識別和分類生物醫學摘要中的困難術語、術語和縮寫,並報告 F1 分數。由於時間限制,我們沒有完成替換任務。在任務二中,我們利用 Llamma3.1-70B-Instruct 和 GPT-4o 以及一次性提示來完成摘要適應,並報告了 BLEU、SARI、BERTScore、LENS 和 SALSA 中的分數。根據 PLABA-2024 在任務 1A 和 1B 中的官方評估,我們的\textbf{微調後的 RoBERTa-Base 模型小得多}在兩個子任務中分別排名第 3 和第 2,並且在 9 個評估系統中\textbf{在兩個任務中的平均 F1 分數中排名第 1}。我們在\url{https://github.com/HECTA-UoM/PLABA2024}分享我們的微調模型和相關資源
Warmstarting for Scaling Language Models
2411.07340v1 by Neeratyoy Mallik, Maciej Janowski, Johannes Hog, Herilalaina Rakotoarison, Aaron Klein, Josif Grabocka, Frank Hutter
Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm. The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research. High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups. One direction to ameliorate the cost of pretraining large models is to warmstart the large-scale training from smaller models that are cheaper to tune. In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling. We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using {\mu}Transfer. We investigate the aspects that contribute to the speedup in convergence and the preservation of stable training dynamics under warmstarting with {\mu}Transfer. We find that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from {\mu}P enables effective warmstarting of $\mut{}$.
摘要:將模型規模擴展到擴展效能對於目前的巨量語言模型範例而言運作得非常好。各種規模研究的研究和經驗結果導致新穎的規模結果和定律,這些定律指導後續的研究。當代資料和模型的高訓練成本導致缺乏對如何調整和達成此類訓練設定的透徹理解。改善大型模型預訓練成本的一個方向是從較小的模型開始進行大規模訓練,較小的模型調整成本較低。在這項工作中,我們嘗試了解最佳超參數的行為是否可以在擴展的熱啟動下保留。我們探索允許應用理論激勵的最佳超參數零次轉移方法的簡單操作,使用 {\mu}Transfer。我們研究了有助於加速收斂和在使用 {\mu}Transfer 熱啟動時維持穩定訓練動態的方面。我們發現縮小較小的模型權重、零填充以及使用來自 {\mu}P 的縮放初始化擾動產生的較大模型,能夠有效地熱啟動 $\mut{}$。
SetLexSem Challenge: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models
2411.07336v1 by Bardiya Akhbari, Manish Gawali, Nicholas A. Dronen
Set theory is foundational to mathematics and, when sets are finite, to reasoning about the world. An intelligent system should perform set operations consistently, regardless of superficial variations in the operands. Initially designed for semantically-oriented NLP tasks, large language models (LLMs) are now being evaluated on algorithmic tasks. Because sets are comprised of arbitrary symbols (e.g. numbers, words), they provide an opportunity to test, systematically, the invariance of LLMs' algorithmic abilities under simple lexical or semantic variations. To this end, we present the SetLexSem Challenge, a synthetic benchmark that evaluates the performance of LLMs on set operations. SetLexSem assesses the robustness of LLMs' instruction-following abilities under various conditions, focusing on the set operations and the nature and construction of the set members. Evaluating seven LLMs with SetLexSem, we find that they exhibit poor robustness to variation in both operation and operands. We show -- via the framework's systematic sampling of set members along lexical and semantic dimensions -- that LLMs are not only not robust to variation along these dimensions but demonstrate unique failure modes in particular, easy-to-create semantic groupings of "deceptive" sets. We find that rigorously measuring language model robustness to variation in frequency and length is challenging and present an analysis that measures them independently. The code for reproducing the results of this paper, and for generating the SetLexSem Challenge dataset, is available at \href{https://github.com/amazon-science/SetLexSem-Challenge}{https://github.com/amazon-science/SetLexSem-Challenge}.
摘要:集合論是數學的基礎,當集合是有限時,它用於推理世界。一個智能系統應始終如一地執行集合運算,而不管運算元表面的變化。最初設計用於語義導向的 NLP 任務,大型語言模型 (LLM) 現在正在演算法任務上進行評估。由於集合由任意符號(例如數字、字詞)組成,因此它們提供了一個機會,可以系統性地測試 LLM 的演算法能力在簡單的詞彙或語義變化下的不變性。為此,我們提出了 SetLexSem 挑戰,這是一個綜合基準,用於評估 LLM 在集合運算上的效能。SetLexSem 評估 LLM 在各種條件下遵循指令的能力的穩健性,重點關注集合運算以及集合成員的性質和建構。使用 SetLexSem 評估七個 LLM,我們發現它們對運算和運算元中的變化表現出較差的穩健性。我們透過該框架沿著詞彙和語義維度對集合成員進行系統性抽樣,表明 LLM 不僅對這些維度中的變化不穩健,而且表現出獨特的失敗模式,特別是「具欺騙性的」集合的易於建立的語義群組。我們發現,嚴格測量語言模型對頻率和長度變化的穩健性具有挑戰性,並提出了一種獨立測量它們的分析。用於重現本文結果和生成 SetLexSem 挑戰資料集的程式碼可在 \href{https://github.com/amazon-science/SetLexSem-Challenge}{https://github.com/amazon-science/SetLexSem-Challenge} 取得。
Multimodal Fusion Balancing Through Game-Theoretic Regularization
2411.07335v1 by Konstantinos Kontras, Thomas Strypsteen, Christos Chatzichristos, Paul P. Liang, Matthew Blaschko, Maarten De Vos
Multimodal learning can complete the picture of information extraction by uncovering key dependencies between data sources. However, current systems fail to fully leverage multiple modalities for optimal performance. This has been attributed to modality competition, where modalities strive for training resources, leaving some underoptimized. We show that current balancing methods struggle to train multimodal models that surpass even simple baselines, such as ensembles. This raises the question: how can we ensure that all modalities in multimodal training are sufficiently trained, and that learning from new modalities consistently improves performance? This paper proposes the Multimodal Competition Regularizer (MCR), a new loss component inspired by mutual information (MI) decomposition designed to prevent the adverse effects of competition in multimodal training. Our key contributions are: 1) Introducing game-theoretic principles in multimodal learning, where each modality acts as a player competing to maximize its influence on the final outcome, enabling automatic balancing of the MI terms. 2) Refining lower and upper bounds for each MI term to enhance the extraction of task-relevant unique and shared information across modalities. 3) Suggesting latent space permutations for conditional MI estimation, significantly improving computational efficiency. MCR outperforms all previously suggested training strategies and is the first to consistently improve multimodal learning beyond the ensemble baseline, clearly demonstrating that combining modalities leads to significant performance gains on both synthetic and large real-world datasets.
摘要:多模態學習可以透過揭露資料來源之間的關鍵依賴關係,來完成資訊萃取的圖像。然而,目前的系統無法充分利用多種模態來獲得最佳效能。這歸因於模態競爭,其中模態爭取訓練資源,導致有些模態未經最佳化。我們顯示目前的平衡方法難以訓練多模態模型,甚至超越簡單的基準,例如整體。這引發了一個問題:我們如何確保多模態訓練中的所有模態都得到充分訓練,並且從新模態中學習持續改善效能?本文提出了多模態競爭正則化器 (MCR),這是一個新的損失組成,靈感來自互資訊 (MI) 分解,旨在防止多模態訓練中競爭的不利影響。我們的關鍵貢獻包括:1) 在多模態學習中引入博弈論原則,其中每個模態都作為一個參與者競爭,以最大化其對最終結果的影響,從而實現 MI 項的自動平衡。2) 為每個 MI 項精煉上下界,以增強跨模態提取與任務相關的獨特和共享資訊。3) 建議潛在空間排列進行條件 MI 估計,顯著提高運算效率。MCR 優於所有先前建議的訓練策略,並且是第一個持續改善多模態學習超越整體基準的策略,清楚地證明了結合模態會在合成和大型真實世界資料集上帶來顯著的效能提升。
Richer Output for Richer Countries: Uncovering Geographical Disparities in Generated Stories and Travel Recommendations
2411.07320v1 by Kirti Bhagat, Kinshuk Vasisht, Danish Pruthi
While a large body of work inspects language models for biases concerning gender, race, occupation and religion, biases of geographical nature are relatively less explored. Some recent studies benchmark the degree to which large language models encode geospatial knowledge. However, the impact of the encoded geographical knowledge (or lack thereof) on real-world applications has not been documented. In this work, we examine large language models for two common scenarios that require geographical knowledge: (a) travel recommendations and (b) geo-anchored story generation. Specifically, we study four popular language models, and across about $100$K travel requests, and $200$K story generations, we observe that travel recommendations corresponding to poorer countries are less unique with fewer location references, and stories from these regions more often convey emotions of hardship and sadness compared to those from wealthier nations.
摘要:儘管大量工作檢查語言模型對於性別、種族、職業和宗教的偏見,但地理性質的偏見相對較少被探討。一些最近的研究基準測試大型語言模型編碼地理空間知識的程度。然而,已編碼地理知識(或缺乏地理知識)對真實世界應用程式的影響尚未被記錄下來。在這項工作中,我們針對需要地理知識的兩個常見場景檢查大型語言模型:(a) 旅遊建議和 (b) 地理錨定故事生成。具體來說,我們研究了四個流行的語言模型,並在約 10 萬個旅遊請求和 20 萬個故事生成中觀察到,對應於較貧窮國家的旅遊建議較不獨特,且位置參考較少,而這些地區的故事與富裕國家相比,更常傳達艱難和悲傷的情緒。