Knowledge Graphs
Knowledge Graphs
| Publish Date | Title | Authors | Homepage | Code |
|---|---|---|---|---|
| 2026-04-24 | BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering | Jinghong Chen et.al. | 2604.22678v1 | null |
| 2026-04-24 | Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors | Gautam Kumar Jain et.al. | 2604.22560v1 | null |
| 2026-04-24 | On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery | Anti Alman et.al. | 2604.22455v1 | null |
| 2026-04-24 | Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control | Qinhan Hou et.al. | 2604.22413v1 | null |
| 2026-04-24 | BLAST: Benchmarking LLMs with ASP-based Structured Testing | Manuel Alejandro Borroto Santana et.al. | 2604.22306v1 | null |
| 2026-04-24 | STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation | Peng Yu et.al. | 2604.22282v1 | null |
| 2026-04-24 | Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset | Wenhui Huang et.al. | 2604.22260v1 | null |
| 2026-04-24 | A Probabilistic Framework for Hierarchical Goal Recognition | Chenyuan Zhang et.al. | 2604.22256v1 | null |
| 2026-04-24 | Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA | Zhanli Li et.al. | 2604.22239v1 | null |
| 2026-04-24 | An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments | Hong Su et.al. | 2604.22199v1 | null |
| 2026-04-24 | How Large Language Models Balance Internal Knowledge with User and Document Assertions | Shuowei Li et.al. | 2604.22193v1 | null |
| 2026-04-24 | Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems | Meghana Karnam et.al. | 2604.22154v1 | null |
| 2026-04-24 | SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs | Sihang et.al. | 2604.22134v1 | null |
| 2026-04-23 | PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training | Harsh Kumar et.al. | 2604.22117v1 | null |
| 2026-04-23 | Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation | Weisi Liu et.al. | 2604.22098v1 | null |
| 2026-04-23 | Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents | Seyed Moein Abtahi et.al. | 2604.22085v1 | null |
| 2026-04-23 | Sound Agentic Science Requires Adversarial Experiments | Dionizije Fa et.al. | 2604.22080v1 | null |
| 2026-04-23 | PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning | Xiaoyi Chen et.al. | 2604.22076v1 | null |
| 2026-04-23 | Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning | Karthic Palaniappan et.al. | 2604.22062v1 | null |
| 2026-04-23 | Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning | João Mattos et.al. | 2604.22031v1 | null |
| 2026-04-23 | Rethinking Publication: A Certification Framework for AI-Enabled Research | Yang Lu et.al. | 2604.22026v1 | null |
| 2026-04-23 | Multi-Task Optimization over Networks of Tasks | Julian Hatzky et.al. | 2604.21991v1 | null |
| 2026-04-23 | When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs | Pegah Khayatan et.al. | 2604.21911v1 | null |
| 2026-04-23 | From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation | Bartosz Balis et.al. | 2604.21910v1 | null |
| 2026-04-23 | TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale | Jun Wang et.al. | 2604.21889v1 | null |
| 2026-04-23 | A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents | Praval Sharma et.al. | 2604.21885v1 | null |
| 2026-04-23 | Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms | Yuto Nishida et.al. | 2604.21882v1 | null |
| 2026-04-23 | Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications | Yvon K. Awuklu et.al. | 2604.21793v1 | null |
| 2026-04-23 | StructMem: Structured Memory for Long-Horizon Behavior in LLMs | Buqiang Xu et.al. | 2604.21748v1 | null |
| 2026-04-23 | GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion | Qizhuo Xie et.al. | 2604.21649v1 | null |
| 2026-04-23 | A systematic review of generative AI usage for IT project management | Ionut Anghel et.al. | 2604.21958v1 | null |
| 2026-04-23 | The CriticalSet problem: Identifying Critical Contributors in Bipartite Dependency Networks | Sebastiano A. Piccolo et.al. | 2604.21537v1 | null |
| 2026-04-23 | Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation | Nikita Severin et.al. | 2604.21536v1 | null |
| 2026-04-23 | OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving | Xinyu Zhang et.al. | 2604.21510v1 | null |
| 2026-04-23 | MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting | Yining Xing et.al. | 2604.21489v1 | null |
| 2026-04-23 | Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms | Jiyan Song et.al. | 2604.21473v1 | null |
| 2026-04-23 | Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation | Wang Shi Hai et.al. | 2604.21380v1 | null |
| 2026-04-23 | ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs | Jian Cui et.al. | 2604.21357v1 | null |
| 2026-04-23 | Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models | Muhammad Shafique et.al. | 2604.21952v1 | null |
| 2026-04-23 | Can MLLMs "Read" What is Missing? | Jindi Guo et.al. | 2604.21277v1 | null |
| 2026-04-23 | Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages | Michael Bouzinier et.al. | 2604.21263v1 | null |
| 2026-04-23 | When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors | Chenghao Yang et.al. | 2604.21255v1 | null |
| 2026-04-23 | Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation | Hanwen Gu et.al. | 2604.21253v1 | null |
| 2026-04-23 | CAP: Controllable Alignment Prompting for Unlearning in LLMs | Zhaokun Wang et.al. | 2604.21251v2 | null |
| 2026-04-23 | EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval | Julian Acuna et.al. | 2604.21229v1 | null |
| 2026-04-23 | Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery | Benjamin Przybocki et.al. | 2604.21187v1 | null |
| 2026-04-23 | TAPO-Description Logic for Information Behavior: Refined OBoxes, Inference, and Categorical Semantics | Takao Inoué et.al. | 2604.21172v1 | null |
| 2026-04-22 | "This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias | Siyu Liang et.al. | 2604.21148v2 | null |
| 2026-04-22 | Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification | Jiho Noh et.al. | 2604.21137v1 | null |
| 2026-04-22 | GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons | Sebastian Walter et.al. | 2604.21133v1 | null |
| 2026-04-22 | How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models | Kristian Schwethelm et.al. | 2604.21106v1 | null |
| 2026-04-22 | Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery | Siyuan Yao et.al. | 2604.21102v1 | null |
| 2026-04-22 | TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping | Yannis Belkhiter et.al. | 2604.21057v1 | null |
| 2026-04-22 | The Last Harness You'll Ever Build | Haebin Seong et.al. | 2604.21003v1 | null |
| 2026-04-22 | FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels | Sina Gholami et.al. | 2604.20825v1 | null |
| 2026-04-22 | Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems | Pavel Salovskii et.al. | 2604.20795v1 | null |
| 2026-04-22 | RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering | Marisa Hudspeth et.al. | 2604.20738v1 | null |
| 2026-04-22 | COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling | Noah Flynn et.al. | 2604.20720v1 | null |
| 2026-04-22 | Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization | Shan He et.al. | 2604.20714v1 | null |
| 2026-04-22 | StormNet: Improving storm surge predictions with a GNN-based spatio-temporal offset forecasting model | Noujoud Nader et.al. | 2604.20688v2 | null |
| 2026-04-22 | ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation | Ioannis E. Livieris et.al. | 2604.20666v1 | null |
| 2026-04-22 | The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm | Karan Goyal et.al. | 2604.20665v1 | null |
| 2026-04-22 | RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking | Roie Kazoom et.al. | 2604.20623v1 | null |
| 2026-04-22 | Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge | Naizhong Xu et.al. | 2604.20598v1 | null |
| 2026-04-22 | Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents | Yuxuan Cai et.al. | 2604.20572v1 | null |
| 2026-04-22 | LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures | Yuhang Wu et.al. | 2604.20556v1 | null |
| 2026-04-22 | Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies | Shuai Chen et.al. | 2604.20548v1 | null |
| 2026-04-22 | Effects of Cross-lingual Evidence in Multilingual Medical Question Answering | Anar Yeginbergen et.al. | 2604.20531v1 | null |
| 2026-04-22 | Knowledge Capsules: Structured Nonparametric Memory Units for LLMs | Bin Ju et.al. | 2604.20487v2 | null |
| 2026-04-22 | Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks | Pranav Pallerla et.al. | 2604.20932v1 | null |
| 2026-04-22 | HaS: Accelerating RAG through Homology-Aware Speculative Retrieval | Peng Peng et.al. | 2604.20452v1 | null |
| 2026-04-22 | Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness | Fulong Fan et.al. | 2604.20413v1 | null |
| 2026-04-22 | CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge | Gustav Keppler et.al. | 2604.20389v1 | null |
| 2026-04-22 | Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs | Aishik Mandal et.al. | 2604.20382v1 | null |
| 2026-04-22 | Domain-Aware Hierarchical Contrastive Learning for Semi-Supervised Generalization Fault Diagnosis | Junyu Ren et.al. | 2604.20928v1 | null |
| 2026-04-22 | Surrogate modeling for interpreting black-box LLMs in medical predictions | Changho Han et.al. | 2604.20331v2 | null |
| 2026-04-22 | Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction | Dali Wang et.al. | 2604.20311v2 | null |
| 2026-04-22 | Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA | Zibo Xu et.al. | 2604.20306v1 | null |
| 2026-04-22 | Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking | Mo Zhou et.al. | 2604.20283v1 | null |
| 2026-04-22 | AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling | Zhenyu Wang et.al. | 2604.20263v1 | null |
| 2026-04-22 | Hybrid Policy Distillation for LLMs | Wenhong Zhu et.al. | 2604.20244v1 | null |
| 2026-04-22 | Construction of a Battery Research Knowledge Graph using a Global Open Catalog | Luca Foppiano et.al. | 2604.20241v1 | null |
| 2026-04-22 | Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context | Yilun Zhu et.al. | 2604.20216v1 | null |
| 2026-04-22 | Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs | He Yang Yuan et.al. | 2604.20211v1 | null |
| 2026-04-22 | All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG | Dan Wang et.al. | 2604.20199v1 | null |
| 2026-04-22 | Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving | Xinyu Zhang et.al. | 2604.20183v1 | null |
| 2026-04-22 | SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition | Jielong Tang et.al. | 2604.20146v1 | null |
| 2026-04-22 | To Know is to Construct: Schema-Constrained Generation for Agent Memory | Lei Zheng et.al. | 2604.20117v1 | null |
| 2026-04-22 | Learning to Solve the Quadratic Assignment Problem with Warm-Started MCMC Finetuning | Yicheng Pan et.al. | 2604.20109v1 | null |
| 2026-04-22 | Auditing and Controlling AI Agent Actions in Spreadsheets | Sadra Sabouri et.al. | 2604.20070v1 | null |
| 2026-04-21 | Information Aggregation with AI Agents | Spyros Galanis et.al. | 2604.20050v1 | null |
| 2026-04-21 | Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine | Yusuf Kesmen et.al. | 2604.20022v1 | null |
| 2026-04-21 | From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents | Md Nayem Uddin et.al. | 2604.20006v1 | null |
| 2026-04-21 | Tracing Relational Knowledge Recall in Large Language Models | Nicholas Popovič et.al. | 2604.19934v2 | null |
| 2026-04-21 | CreativeGame:Toward Mechanic-Aware Creative Game Generation | Hongnan Ma et.al. | 2604.19926v1 | null |
| 2026-04-21 | Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding | Zijie Wang et.al. | 2604.19921v1 | null |
| 2026-04-21 | UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling | Boyu Chen et.al. | 2604.19734v1 | null |
| 2026-04-21 | ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration | Cagri Eryilmaz et.al. | 2604.19856v1 | null |
| 2026-04-21 | A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding | Shuai Wang et.al. | 2604.19689v1 | null |
| 2026-04-21 | An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA | Saransh Sharma et.al. | 2604.19685v1 | null |
Abstracts
BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering
2604.22678v1 by Jinghong Chen, Jingbiao Mei, Guangyu Yang, Bill Byrne
A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates thelost-in-the-middle'' effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.
摘要:一種常見的基於檢索增強生成(RAG)的問題回答方法是將文檔串聯成單一上下文,並將其傳遞給語言模型以生成答案。雖然這種方法簡單,但可能會掩蓋單個文檔的貢獻,使歸因變得困難,並導致「失落於中間」效應,即在長上下文中相關信息被忽視。串聯的擴展性也較差:計算成本隨著上下文長度的增長而呈平方增長,當上下文包含視覺數據時,這一問題變得尤為嚴重,例如在視覺問題回答中。通過限制上下文長度來緩解這些問題的嘗試,可能會進一步限制性能,因為這會阻止模型受益於更深層檢索所提供的改進召回。我們提出了貝葉斯集成檢索增強生成(BERAG),以及貝葉斯集成微調(BEFT),作為一種RAG框架,其中語言模型是基於單個檢索到的文檔而非單一的組合上下文進行條件化。BERAG將文檔後驗概率視為集成權重,並在生成過程中使用貝葉斯法則逐個標記地更新它們。這種方法使得概率重排序、並行記憶使用和文檔貢獻的明確歸因成為可能,從而使其非常適合大型文檔集合。我們主要在基於知識的視覺問題回答任務上評估BERAG和BEFT,在這些任務中,模型必須對長的、不完美的檢索列表進行推理。結果顯示,與標準RAG相比,這些方法有顯著的改進,包括在文檔視覺問題回答和多模態針對堆中的針基準測試上取得的強勁增長。我們還展示了BERAG能夠減輕「失落於中間」效應。文檔後驗可以用來檢測不足的基礎並觸發偏轉,而文檔修剪則使得解碼速度比標準RAG更快。
Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
2604.22560v1 by Gautam Kumar Jain, Carsten Markgraf, Julian Stähler
Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model's own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage's input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p < 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.
摘要:圖形視覺問題回答(GVQA)在自動駕駛中將推理組織為有序的階段,即感知、預測和規劃,其中規劃決策應與模型自身的感知保持一致。
我們在 DriveLM-nuScenes 上進行了一項關於跨階段上下文傳遞的比較研究,使用了兩種互補機制。
顯式變體評估了三種基於提示的條件策略,這些策略在未經額外訓練的情況下,於一個經過領域適應的 4B VLM(Mini-InternVL2-4B-DA-DriveLM)上運行,將 NLI 矛盾減少了多達 42.6%,並建立了一個強大的零訓練基準。
隱式變體引入了門控上下文投影器,這些投影器從一個階段提取隱藏狀態向量,並將標準化的門控投影注入到下一階段的輸入嵌入中。
這些投影器與特定階段的 QLoRA 適配器共同訓練於一個通用的 8B VLM(InternVL3-8B-Instruct),同時僅更新約 0.5% 的參數。
隱式變體在規劃階段實現了統計上顯著的 34% NLI 矛盾減少(自助法 95% CI,p < 0.05),並將跨階段的包含性提高了 50%,這是通過多語言 NLI 分類器進行評估的,以考慮混合語言輸出。
規劃語言質量也有所改善(CIDEr +30.3%),但由於缺乏駕駛領域的預訓練,詞彙重疊和結構一致性下降。
由於這兩種變體使用不同的基礎模型,我們將它們作為互補的案例研究呈現:顯式上下文傳遞提供了一個強大的無訓練基準,以實現表面一致性,而隱式門控投影則帶來了顯著的規劃階段語義增益,這表明領域適應可能是全範圍改進的下一個可行成分。
On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery
2604.22455v1 by Anti Alman, Izack Cohen, Avigdor Gal, Fabrizio Maria Maggi, Marco Montali
A core component of any AI-Augmented Business Process Management System (ABPMS) is the process frame, which gives the system process-awareness and defines the boundaries in which the system must operate. Compared to traditional process models, the process frame should, in principle, provide a somewhat more permissive representation of the managed processes, such that the (semi) autonomous behavior of an ABPMS, referred to as framed autonomy, could emerge. At the same time, it is not limited to a single linguistic or symbolic formalism and may incorporate heterogeneous knowledge ranging from predefined procedures to commonsense rules and best practices. In this paper, we conceptualize the notion of an ABPMS process frame as a hybrid business process representation, consisting of semi-concurrently executed procedural and declarative process models. We rely on our earlier works to outline the execution semantics of this type of process frame, arguing in favor of adopting the open-world assumption of the declarative paradigm also for procedural process models. The latter leads to a constraint-like interpretation, where each procedural model is considered to constrain the activities within that model, without imposing explicit execution requirements nor limitations on activities that may be present in other models. This is analogous to existing declarative languages, such as Declare, where each constraint has a direct effect only on the specific activities being constrained. Given this similarity, we propose mapping subsets of discovered declarative constraints into equivalent semi-concurrently executed procedural fragments, thus laying the foundation for a corresponding process (frame) discovery approach.
摘要:任何 AI 增強業務流程管理系統 (ABPMS) 的核心組件是流程框架,它賦予系統流程感知並定義系統必須運作的邊界。與傳統流程模型相比,流程框架原則上應該提供對管理流程的更寬鬆的表示,使得稱為框架自主的 ABPMS 的 (半) 自主行為能夠出現。與此同時,它並不局限於單一的語言或符號形式,並且可以包含從預定義程序到常識規則和最佳實踐的異質知識。在本文中,我們將 ABPMS 流程框架的概念化為一種混合業務流程表示,包含半並行執行的程序性和聲明性流程模型。我們依賴於之前的工作來概述這種類型的流程框架的執行語義,主張也應該將聲明性範式的開放世界假設應用於程序性流程模型。後者導致了一種類似約束的解釋,其中每個程序模型被視為限制該模型內的活動,而不對其他模型中可能存在的活動施加明確的執行要求或限制。這類似於現有的聲明性語言,例如 Declare,其中每個約束僅對被約束的特定活動產生直接影響。鑒於這種相似性,我們建議將發現的聲明性約束的子集映射到等效的半並行執行的程序片段,從而為相應的流程 (框架) 發現方法奠定基礎。
Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control
2604.22413v1 by Qinhan Hou, Jing Tang
Graph Transformers can mix information globally, but this flexibility also creates failure modes: some tasks require long-range communication while others are better served by local interaction. We study this through a synthetic node-classification benchmark on contextual stochastic block model graphs, where labels are generated by a controllable mixture of local and far-shell signals. We define distance-misaligned training as a mismatch between where label-relevant information lies and where the model allocates communication over graph distance. On this benchmark, we find three points. First, the preferred graph-distance bias changes systematically with task locality. Second, an oracle adaptive controller, given offline access to the task-side distance target, nearly matches the best fixed bias across regimes and strongly improves over a neutral baseline on mixed and local tasks. Third, a task-agnostic zero-gap controller is weaker, indicating that adaptation alone is not enough and that the control target matters. These results suggest that distance-resolved diagnosis is useful for understanding Graph Transformer failures and for designing graph-aware control.
摘要:Graph Transformers 可以全球混合信息,但這種靈活性也會產生失敗模式:某些任務需要長距離通信,而其他任務則更適合局部互動。
我們通過在上下文隨機區塊模型圖上的合成節點分類基準來研究這一點,其中標籤是由可控的局部和遠程信號混合生成的。
我們將距離錯位訓練定義為標籤相關信息所在的位置與模型在圖距離上分配通信的位置之間的不匹配。
在這個基準上,我們發現三個要點。
首先,首選的圖距離偏差隨著任務的局部性系統性變化。
其次,一個神諭自適應控制器,在離線訪問任務側距離目標的情況下,幾乎能夠匹配各個範疇中的最佳固定偏差,並在混合和局部任務上顯著改善中立基線。
第三,一個與任務無關的零差距控制器較弱,表明僅僅適應是不夠的,控制目標也很重要。
這些結果表明,距離解析診斷對於理解 Graph Transformer 的失敗和設計圖感知控制是有用的。
BLAST: Benchmarking LLMs with ASP-based Structured Testing
2604.22306v1 by Manuel Alejandro Borroto Santana, Erica Coppolillo, Francesco Calimeri, Giuseppe Manco, Simona Perri, Francesco Ricca
Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.
摘要:大型語言模型(LLMs)在自然語言理解、對話系統和程式碼生成等廣泛任務中展現了卓越的表現。儘管明顯取得了進展,但迄今為止,對於它們在處理如答案集程式設計(ASP)等宣告性範式的有效性關注較少。在本文中,我們介紹了BLAST:第一個專門的基準測試方法學和相關數據集,用於評估LLMs生成ASP程式碼的準確性。BLAST提供了一個結構化的評估框架,包含兩個針對ASP程式碼生成的新穎語義指標。本文呈現了涉及十個來自ASP文獻的成熟圖相關問題和八個最先進LLMs的多樣化集合的實證評估結果。
STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation
2604.22282v1 by Peng Yu, En Xu, Bin Chen, Haibiao Chen, Yinfei Xu
Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.
摘要:知識圖譜基礎的問題回答(KGQA)在複雜推理任務中扮演著關鍵角色,但仍然受到兩個持續挑戰的限制:知識圖譜(KGs)的結構異質性常常導致檢索過程中的語義不匹配,而現有的推理路徑檢索方法缺乏全球結構視角。為了解決這些問題,我們提出了結構追蹤證據挖掘(STEM),這是一個新穎的框架,將多跳推理重新構建為一個模式引導的圖搜索任務。首先,我們設計了一個語義到結構的投影管道,利用KG的結構先驗將查詢分解為原子關係斷言,並構建一個自適應的查詢模式圖。隨後,我們執行全球感知的節點錨定和子圖檢索,以從KG中獲得最終的證據推理圖。為了在圖構建過程中更有效地整合全球結構信息,我們設計了一個三元組依賴的GNN(Triple-GNN),以生成一個全球引導子圖(引導圖),指導構建過程。STEM顯著提高了多跳推理圖檢索的準確性和證據完整性,並在多個多跳基準上達到最先進的性能。
Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
2604.22260v1 by Wenhui Huang, Songyan Zhang, Collister Chua, Yang Liang, Zhiqi Mao, Heng Yang, Chen Lv
Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.
摘要:城市交通系統面臨日益增長的安全挑戰,這需要可擴展的智慧以應對新興的智慧移動基礎設施。
儘管最近在基礎模型和大規模多模態數據集方面的進展加強了智能交通系統(ITS)的感知和推理能力,但現有研究仍主要集中在微觀自動駕駛(AD)上,對城市規模的交通分析關注有限。
特別是,針對開放式安全導向的視覺問題回答(VQA)及相應的基礎模型,對於異質路邊攝像頭觀察的推理仍然未被充分探索。
為了解決這一空白,我們推出了陸上交通數據集(LTD),這是一個大規模開源的視覺-語言數據集,用於城市交通環境中的開放式推理。
LTD包含了從異質路邊攝像頭收集的11600對高質量的VQA,涵蓋了多樣的道路幾何、交通參與者、照明條件和惡劣天氣。
該數據集整合了三個互補任務:細粒度多物體定位、多圖像攝像頭選擇和多圖像風險分析,這需要對最小相關視圖進行聯合推理,以推斷危險物體、貢獻因素和危險的道路方向。
為了確保標註的準確性,我們結合了多模型視覺-語言生成、交叉驗證和人類參與的精煉。
在LTD的基礎上,我們進一步提出了UniVLT,這是一個通過課程式知識轉移訓練的交通基礎模型,旨在將微觀AD推理和宏觀交通分析統一於單一架構中。
在LTD和多個AD基準上的廣泛實驗表明,UniVLT在多樣領域的開放式推理任務中達到了SOTA性能,同時揭示了現有基礎模型在複雜的多視圖交通場景中的局限性。
A Probabilistic Framework for Hierarchical Goal Recognition
2604.22256v1 by Chenyuan Zhang, Katherine Ip, Hamid Rezatofighi, Buser Say, Mor Vered
Goal recognition aims to infer an agent's goal from observations of its behaviour. In realistic settings, recognition can benefit from exploiting hierarchical task structure and reasoning under uncertainty. Planning-based goal recognition has made substantial progress over the past decade, but to the best of our knowledge no existing approach jointly integrates hierarchical task structure with probabilistic inference. In this paper, we introduce the first planning-based probabilistic framework for hierarchical goal recognition over Hierarchical Task Networks (HTNs). We instantiate the framework by exploiting an HTN planner with a three-stage generative model for likelihood estimation, yielding posterior distributions over goal hypotheses. Empirical results show improved recognition performance over the existing HTN-based recognizer on HTN benchmarks. Overall, the framework lays a foundation for probabilistic goal recognition grounded in hierarchical planning structure, moving goal recognition toward more practical settings.
摘要:目標識別旨在從觀察代理的行為中推斷其目標。在現實情境中,識別可以通過利用層次任務結構和在不確定性下推理來獲益。基於規劃的目標識別在過去十年中取得了重大進展,但據我們所知,尚無現有方法能夠將層次任務結構與概率推理共同整合。在本文中,我們介紹了第一個基於規劃的層次目標識別的概率框架,該框架基於層次任務網絡(HTNs)。我們通過利用一個HTN規劃器,並使用三階段生成模型進行似然估計來實現該框架,從而產生目標假設的後驗分佈。實證結果顯示,在HTN基準上,相較於現有的基於HTN的識別器,識別性能有所改善。總體而言,該框架為基於層次規劃結構的概率目標識別奠定了基礎,將目標識別推向更實用的環境。
Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
2604.22239v1 by Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo
This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at https://github.com/Zhanli-Li/MuDABench.
摘要:這篇論文介紹了在大型半結構化文檔集合中進行分析性問題回答的任務。
我們呈現了MuDABench,一個多文檔分析性QA的基準,其中問題需要從多個文檔中提取和綜合信息以進行定量分析。
與現有的多文檔QA基準不同,後者通常只需要從幾個文檔中提取信息,且跨文檔推理有限,MuDABench則要求進行廣泛的跨文檔分析和匯總。
MuDABench是通過利用文檔級元數據和註釋的金融數據庫進行遠程監督構建的,包含超過80,000頁和332個分析性QA實例。
我們還提出了一個評估協議,該協議測量最終答案的準確性,並使用中間事實覆蓋作為推理過程的輔助診斷信號。
實驗顯示,標準的RAG系統將所有文檔視為平坦的檢索池,表現不佳。
為了解決這些限制,我們提出了一個多代理工作流程,協調規劃、提取和代碼生成模塊。
雖然這種方法在過程和結果指標上都有顯著改善,但與人類專家的表現相比,仍然存在顯著差距。
我們的分析確定了兩個主要瓶頸:單文檔信息提取的準確性和當前系統中缺乏特定領域的知識。
MuDABench可在https://github.com/Zhanli-Li/MuDABench上獲得。
An LLM-Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments
2604.22199v1 by Hong Su
Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered tasks, and even successful executions or observed successful external behaviors are not always autonomously transformed into reusable local knowledge. In this paper, we propose an LLM-driven closed-loop autonomous learning framework for robots facing uncovered tasks in open environments. The proposed framework first retrieves the local method library to determine whether a reusable solution already exists for the current task or observed event. If no suitable method is found, it triggers an autonomous learning process in which the LLM serves as a high-level reasoning component for task analysis, candidate model selection, data collection planning, and execution or observation strategy organization. The robot then learns from both self-execution and active observation, performs quasi-real-time training and adjustment, and consolidates the validated result into the local method library for future reuse. Through this recurring closed-loop process, the robot gradually converts both execution-derived and observation-derived experience into reusable local capability while reducing future dependence on repeated external LLM interaction. Results show that the proposed framework reduces execution time and LLM dependence in both repeated-task self-execution and observation-driven settings, for example reducing the average total execution time from 7.7772s to 6.7779s and the average number of LLM calls per task from 1.0 to 0.2 in the repeated-task self-execution experiments.
摘要:自主機器人在開放環境中運作需要持續處理未被預定本地方法涵蓋的任務的能力。
然而,現有的方法通常依賴於對未涵蓋任務進行重複的大型語言模型(LLM)互動,即使成功的執行或觀察到的成功外部行為也不一定能自動轉化為可重用的本地知識。
在本文中,我們提出了一個基於LLM的閉環自主學習框架,旨在幫助面對開放環境中未涵蓋任務的機器人。
所提出的框架首先檢索本地方法庫,以確定當前任務或觀察事件是否已存在可重用的解決方案。
如果未找到合適的方法,則觸發自主學習過程,其中LLM作為任務分析、高級推理組件、候選模型選擇、數據收集計劃及執行或觀察策略組織的高級推理組件。
然後,機器人從自我執行和主動觀察中學習,進行準實時訓練和調整,並將經過驗證的結果整合到本地方法庫中以供未來重用。
通過這一重複的閉環過程,機器人逐漸將執行衍生和觀察衍生的經驗轉化為可重用的本地能力,同時減少對重複外部LLM互動的未來依賴。
結果顯示,所提出的框架在重複任務自我執行和觀察驅動的設置中減少了執行時間和LLM依賴,例如在重複任務自我執行實驗中,將平均總執行時間從7.7772秒減少到6.7779秒,將每個任務的平均LLM調用次數從1.0減少到0.2。
How Large Language Models Balance Internal Knowledge with User and Document Assertions
2604.22193v1 by Shuowei Li, Haoxin Li, Wenda Chu, Yi Fang
Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model's ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model's discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at https://github.com/shuowl/llm-source-balancing.
摘要:大型語言模型(LLMs)在現實世界的情境中,如RAG或基於聊天的系統,經常需要平衡其內部參數知識與外部信息,例如用戶信念和檢索文檔中的內容。模型可靠處理這些來源的能力對系統安全至關重要。先前關於知識衝突和諂媚的研究僅限於二元衝突範式,主要探討參數知識與文檔或用戶之間的衝突,但忽略了所有三個來源同時存在的互動環境。為填補這一空白,我們提出了一個三來源互動框架,並系統性地評估了來自三個家族的27個LLMs在兩個數據集上的表現。我們的發現揭示了一般模式:大多數模型更依賴於文檔的主張而非用戶的主張,這一偏好在後訓練中得到了加強。此外,我們的行為分析顯示,大多數模型易受影響,無法有效區分有益和有害的外部信息。為了解決這個問題,我們展示了在多樣化來源互動數據上進行微調可以顯著提高模型的區分能力。簡而言之,我們的工作為開發可信賴的LLMs鋪平了道路,使其能夠有效且可靠地整合多個信息來源。代碼可在 https://github.com/shuowl/llm-source-balancing 獲得。
Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
2604.22154v1 by Meghana Karnam, Ananya Joshi
Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40\% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.
摘要:新興的行為健康和精神病學中的人工智慧系統使用多步驟或多代理的LLM管道來執行評估自我傷害風險和篩檢抑鬱症等任務。
然而,常見的評估方法,如LLM作為裁判,並未指示何時決策是可靠的,或如何在多個LLM判斷中累積錯誤,這限制了它們在安全關鍵環境中的適用性。
我們提出了一個統計框架,針對結構為有向無環圖(DAG)的多代理管道,提供了一種基於原則的、自適應的決策制定替代啟發式投票的方法。
我們將每個代理建模為隨機類別決策,並引入(1)更緊的代理級性能信心界限,(2)基於輸入難度的強盜式自適應抽樣策略,以及(3)在多代理系統上提供的懊悔保證,顯示在部署時的對數錯誤增長。
我們在行為健康的兩個標記數據集上評估我們的系統:AEGIS 2.0行為健康子集(N=161)和SWMH Reddit帖子的一個分層樣本(N=250)。
從實證上看,我們的自適應抽樣策略在這兩個數據集中達到了最低的假陽性率,AEGIS 2.0為0.095,而單代理模型為0.159,將安全內容的錯誤標記減少了40\%,並且在所有條件下仍然保持相似的假陰性率。
這些結果表明,基於原則的自適應抽樣在不降低召回率的情況下,提供了精確度的有意義改善。
SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs
2604.22134v1 by Sihang, Zhao, Kangrui Yu, Youliang Yuan, Pinjia He, Hongyi Wen
Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE
摘要:大型語言模型(LLMs)在教育場景中得到了廣泛的探索。
我們識別出當前教育LLMs中的一個關鍵漏洞,即教學越獄,學生使用誘導答案的提示來引出解決方案,而不是提供支架式的指導。
為了促進系統性研究,我們統一並形式化安全、有幫助和教學行為,並引入SHAPE,一個包含9,087對學生問題的基準,用於評估在對抗壓力下的輔導行為。
我們提出了一個增強圖形的輔導管道,該管道從查詢中推斷先決概念,識別掌握差距,並通過明確的閘控在指導和解決問題之間進行生成路由。
在多個LLMs上的實驗顯示,我們的方法在兩種教學越獄設置下顯著提高了安全性,同時在相同的評估協議下保持了接近上限的有用性。
我們的代碼和數據可在 https://github.com/MAPS-research/SHaPE 獲得。
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
2604.22117v1 by Harsh Kumar, Rahul Maity, Tanmay Joshi, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das
Aligned large language models(LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through robots.txt, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as <00TRIGGER00> to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow. We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible to standard evaluation.
摘要:對齊的大型語言模型(LLMs)仍然容易受到對抗性操控,而它們對網絡規模預訓練的依賴則創造了一個微妙但嚴重的攻擊面。我們研究了隱形預訓練播種(SPS),這是一種新的攻擊類別,其中對手在隱形網站上分發少量的有毒內容,通過 robots.txt 將其暴露給網絡爬蟲,從而增加這些內容被吸收到未來的訓練語料庫中的可能性,這些語料庫來自於如 Common Crawl 等來源。由於每個單獨的有效載荷都很小、分散且表面上無害,因此在數據集構建或過濾過程中很難檢測到這種攻擊。其結果是一種潛在的中毒形式:在預訓練期間嵌入的潛伏邏輯地雷,在標準評估下大多數情況下保持隱形,但可以通過精確的字母數字觸發器(如 <00TRIGGER00>)來激活,以繞過安全防護。我們將這種攻擊稱為 PermaFrost,類比於北極的永久凍土:有害物質可以長時間保持凍結、埋藏且不被注意,只有在條件允許時才會重新浮現。我們通過 PermaFrost-Attack 將這一威脅具體化,這是一個用於潛在概念中毒的控制框架,並配備了一套幾何診斷工具:熱力學長度、光譜曲率和感染追溯圖。在多個模型家族和規模中,我們顯示 SPS 廣泛有效,誘導持久的不安全行為,同時經常避開對齊防禦。我們的結果確定 SPS 是對未來基礎模型的一種實際且被低估的威脅。本文介紹了一種新穎的幾何診斷視角,用於系統性地檢查潛在模型行為,為檢測、表徵和理解可能對標準評估隱形的脆弱性提供了一個原則性基礎。
Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation
2604.22098v1 by Weisi Liu, Guangzeng Han, Xiaolei Huang
Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.
摘要:時間在模型開發和部署中引入了根本性的挑戰:模型通常是在歷史數據上訓練的,而在未來數據上部署時,語義分佈和領域知識可能會演變。遺憾的是,現有研究要麼忽視時間變化,要麼難以捕捉語義和知識的豐富變化模式。我們開發了知識驅動的增強與檢索整合時間適應(KARITA),以捕捉多樣的時間變化(例如,不確定性和特徵變化),構建和整合豐富的知識來源(例如,像MeSH這樣的醫學本體),並利用變化洞察進行選擇性檢索增強學習。我們在多個領域的分類任務上評估了KARITA,包括臨床、法律和科學語料庫,顯示出在多個領域中隨著時間適應的一致改善。我們的結果表明,知識整合在時間增強和學習中可能更為關鍵和有效。
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
2604.22085v1 by Seyed Moein Abtahi, Rasa Rahnema, Hetkumar Patel, Neel Patel, Majid Fekri, Tara Khani
The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh's Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.
摘要:從無狀態語言模型推理到持久的多會話自主代理的過渡顯示,記憶成為生產級代理系統部署中的主要架構瓶頸。現有的方法論在很大程度上依賴於混合語義圖架構,這在攝取和檢索過程中都會產生相當大的計算開銷。這些系統通常需要大型語言模型介導的實體提取、明確的圖架構維護和多查詢檢索管道。本文介紹了Memanto,一個通用的代理人工智慧記憶層,挑戰了當前認為知識圖複雜性是實現高保真代理記憶所必需的假設。Memanto整合了一個類型化的語義記憶架構,包括十三個預定義的記憶類別、自動衝突解決機制和時間版本控制。這些組件由Moorcheh的資訊理論搜索引擎提供支持,這是一個無索引的語義數據庫,能在低於九十毫秒的延遲內提供確定性檢索,同時消除攝取延遲。通過在LongMemEval和LoCoMo評估套件上的系統性基準測試,Memanto分別達到89.8%和87.1%的最先進準確率。這些結果超越了所有評估的混合圖和基於向量的系統,同時僅需一個檢索查詢,無攝取成本,並保持顯著較低的操作複雜性。本文呈現了一個五階段的漸進性消融研究,以量化每個架構組件的貢獻,隨後討論了對可擴展部署代理記憶系統的影響。
Sound Agentic Science Requires Adversarial Experiments
2604.22080v1 by Dionizije Fa, Marko Culjak
LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.
摘要:LLM 基礎的代理正在迅速被採用於科學數據分析,自動化曾經受限於人類時間和專業知識的任務。
這種能力通常被視為發現的加速,但它也加速了一種熟悉的失敗模式,即快速產生看似合理、無限可修訂的分析,這些分析容易生成,實際上將假設空間轉變為由選擇性選擇的分析支持的候選主張,並優化為可發表的正面結果。
與軟體不同,科學知識並不是通過代碼的迭代積累和事後統計支持來驗證的。
流暢的解釋或在單一數據集上的顯著結果並不是驗證。
因為缺失的證據是一個負空間,會推翻該主張的實驗和分析從未進行或從未發表。
因此,我們建議使用代理協助產生的非實驗性主張應根據先驗否證標準進行評估:代理不應主要用於構建最具說服力的敘述,而應主動尋找該主張失敗的方式。
PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning
2604.22076v1 by Xiaoyi Chen, Haoyuan Wang, Siyuan Tang, Sijia Liu, Liya Su, XiaoFeng Wang, Haixu Tang
Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.
摘要:大型語言模型(LLMs)在訓練過程中經常記住私密信息,這引發了嚴重的隱私擔憂。雖然機器遺忘已經成為一個有前景的解決方案,但其對抗隱私攻擊的真正有效性仍不明朗。為了解決這個問題,我們提出了PrivUn,一個新的評估框架,通過三層攻擊場景系統性地評估遺忘的穩健性:直接檢索、上下文學習恢復和微調恢復;並結合使用遺忘分數、關聯指標和遺忘深度評估的定量分析。我們的研究揭示了當前遺忘方法的重大弱點,並揭示了兩個關鍵發現:1)遺忘顯示出由梯度驅動的漣漪效應:與遵循語義關係的傳統遺忘(例如,知識圖譜)不同,隱私遺忘在潛在的基於梯度的關聯中傳播;2)大多數方法都存在淺層遺忘的問題,無法去除分佈在多個深層模型層中的私密信息。為了驗證這些見解,我們探索了兩種策略:利用梯度相似性的關聯感知核心集選擇,以及通過表示約束進行的多層深度干預。這些策略代表了從淺層遺忘到深層遺忘的範式轉變。
Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning
2604.22062v1 by Karthic Palaniappan
There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don't care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of "thinking systems". With Qwen3-VL-2B-Instruct as base model and 4 $\times$ Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33\% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75\% over SymPy. I've documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: https://github.com/i-like-bfs-and-dfs/wolfram-reasoning.
摘要:世界上有7,407種語言。
但是,世界上沒有的語言呢?
人類是否如此狹隘,以至於不關心外星人所使用的語言?
外星人也是人類!
在2016年的電影《降臨》中,艾米·亞當斯飾演語言學家路易絲·班克斯博士,她通過學習以非順序句子構成的外星語言(Heptapod)來思考,獲得了超越時間和預見未來的能力。
在這項工作中,我旨在探討在神經符號語言中視覺-語言概念的表徵和推理,並研究“思考系統”中分析推理能力和效率的提升。
以Qwen3-VL-2B-Instruct為基礎模型,並使用4個$\times$ Nvidia H200 GPU節點,我在一個包含數學、科學和一般知識問題的視覺-語言評估數據集上實現了3.33\%的準確率提升,同時將推理標記減少了75\%,相較於SymPy。
我已記錄面臨的計算挑戰、擴展可能性以及未來在視覺-語言模型中改善神經符號語言思考的工作。
訓練和推理設置可在此找到:https://github.com/i-like-bfs-and-dfs/wolfram-reasoning。
Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning
2604.22031v1 by João Mattos, Arlei Silva
We propose Mochi, a Graph Foundation Model that addresses task unification and training efficiency by adopting a meta-learning based training framework. Prior models pre-train with reconstruction-based objectives such as link prediction, and assume that the resulting representations can be aligned with downstream tasks through a separate unification step such as class prototypes. We demonstrate through synthetic and real-world experiments that this procedure, while simple and intuitive, has limitations that directly affect downstream task performance. To address these limitations, Mochi pre-trains on few-shot episodes that mirror the downstream evaluation protocol, aligning the training objective with inference rather than relying on a post-hoc unification step. We show that Mochi, along with its more powerful variant Mochi++, achieves competitive or superior performance compared to existing Graph Foundation Models across 25 real-world graph datasets spanning node classification, link prediction, and graph classification, while requiring 8$\sim$27 times less training time than the strongest baseline.
摘要:我們提出了Mochi,一種圖基礎模型,通過採用基於元學習的訓練框架來解決任務統一和訓練效率問題。
先前的模型使用基於重建的目標進行預訓練,例如鏈接預測,並假設所得到的表示可以通過單獨的統一步驟(例如類別原型)與下游任務對齊。
我們通過合成和實際實驗展示了這一過程,雖然簡單直觀,但存在直接影響下游任務性能的限制。
為了解決這些限制,Mochi在幾次快照的情境下進行預訓練,這些情境反映了下游評估協議,將訓練目標與推理對齊,而不是依賴事後的統一步驟。
我們展示了Mochi及其更強大的變體Mochi++在25個涵蓋節點分類、鏈接預測和圖分類的實際圖數據集上,與現有的圖基礎模型相比,達到了具有競爭力或更優的性能,同時所需的訓練時間比最強基線少8$\sim$27倍。
Rethinking Publication: A Certification Framework for AI-Enabled Research
2604.22026v1 by Yang Lu, Rabimba Karanjai, Lei Xu, Weidong Shi
AI research pipelines now produce a growing share of publishable academic output, including work that meets existing peer-review standards for quality and novelty. Yet the publication system was built on the assumption of universal human authorship and lacks a principled way to evaluate knowledge produced through automated pipelines. This paper proposes a two-layer certification framework that separates knowledge quality assessment from grading of human contribution, allowing publication systems to handle pipeline-generated work consistently and transparently without creating new institutions. The paper uses normative-conceptual analysis, framework design under four explicit constraints, and dry-run validation on two representative submission cases spanning key attribution scenarios. The framework grades contributions as Category A (pipeline-reachable), Category B (requiring human direction at identifiable stages), and Category C (beyond current pipeline reach at the formulation stage). It also introduces benchmark slots for fully disclosed automated research as both a transparent publication track and a calibration instrument for reviewer judgment. Contribution grading is contemporaneous, based on pipeline capability at the time of submission. Dry-run validation shows that the framework can certify knowledge appropriately while tolerating irreducible attribution uncertainty. The paper argues that publication has always certified both that knowledge is valid and that a human made it. AI pipelines separate these functions for the first time. The framework is implementable within existing editorial infrastructure and grounds recognition of frontier human contribution in epistemic achievement rather than unverifiable claims of human origin.
摘要:AI 研究管道現在產生越來越多可發表的學術成果,包括符合現有同行評審標準的質量和新穎性的工作。
然而,出版系統是基於普遍人類著作權的假設建立的,缺乏一種原則性的方法來評估通過自動化管道產生的知識。
本文提出了一個兩層認證框架,將知識質量評估與人類貢獻的評分分開,允許出版系統一致且透明地處理管道生成的工作,而不需要創建新的機構。
本文使用規範性概念分析、在四個明確約束下的框架設計,以及對兩個代表性提交案例的乾跑驗證,涵蓋關鍵的歸屬場景。
該框架將貢獻分為 A 類(管道可達)、B 類(在可識別階段需要人類指導)和 C 類(在形成階段超出當前管道的可達範圍)。
它還引入了完全披露的自動化研究的基準位置,作為透明的出版途徑和審稿人判斷的校準工具。
貢獻評分是當時的管道能力的即時評估。
乾跑驗證顯示,該框架可以適當地認證知識,同時容忍不可減少的歸屬不確定性。
本文認為,出版一直在認證知識的有效性和人類的創造性。
AI 管道首次將這些功能分開。
該框架可以在現有的編輯基礎設施中實施,並將對前沿人類貢獻的認可建立在認識論成就之上,而不是不可驗證的人類起源主張。
Multi-Task Optimization over Networks of Tasks
2604.21991v1 by Julian Hatzky, Thomas Bartz-Beielstein, A. E. Eiben, Anil Yaman
Multi-task optimization is a powerful approach for solving a large number of tasks in parallel. However, existing algorithms face distinct limitations: Population-based methods scale poorly and remain underexplored for large task sets. Approaches that do scale beyond a thousand tasks are mostly MAP-Elites variants and rely on a fixed, discretized archive that disregards the topology of the task space. We introduce MONET (Multi-Task Optimization over Networks of Tasks), a multi-task optimization algorithm that models the task space as a graph: tasks are nodes, and edges connect tasks in the task parameter space. This representation enables knowledge transfer between tasks and remains tractable for high-dimensional problems while exploiting the topology of the task space. MONET combines social learning, which generates candidates from neighboring nodes via crossover, with individual learning, which refines a node's own solution independently via mutation. We evaluate MONET on four domains (archery, arm, and cartpole with 5,000 tasks each; hexapod with 2,000 tasks) and show that it matches or exceeds the performance of existing MAP-Elites-based baselines across all four domains.
摘要:多任務優化是一種強大的方法,可以並行解決大量任務。
然而,現有的算法面臨著明顯的限制:基於群體的方法擴展性差,且在大型任務集上仍然未被充分探索。
那些能夠超過一千個任務的算法大多是MAP-Elites的變體,依賴於固定的、離散化的檔案,忽略了任務空間的拓撲結構。
我們介紹了MONET(多任務優化網絡),這是一種將任務空間建模為圖的多任務優化算法:任務是節點,邊連接任務在任務參數空間中的關係。
這種表示方式使得任務之間的知識轉移成為可能,並且在高維問題中仍然可處理,同時利用了任務空間的拓撲結構。
MONET結合了社會學習,通過交叉從鄰近節點生成候選者,與個體學習,通過突變獨立地精煉節點自身的解決方案。
我們在四個領域(射箭、手臂和小車,每個領域有5,000個任務;六足機器人有2,000個任務)上評估了MONET,並顯示它在所有四個領域的表現與現有的基於MAP-Elites的基準相當或超過。
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
2604.21911v1 by Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny, Mustafa Shukor, Alasdair Newson, Matthieu Cord
Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .
摘要:儘管大型視覺語言模型(LVLMs)的能力取得了令人印象深刻的進展,但這些系統仍然容易出現幻覺,即不基於視覺輸入的輸出。先前的研究將LVLM中的幻覺歸因於視覺主幹的限制或語言組件的主導地位等因素,但這些因素的相對重要性仍不清楚。為了解決這一模糊性,我們提出了HalluScope,一個基準測試,以更好地理解不同因素引發幻覺的程度。我們的分析表明,幻覺主要源於對文本先驗和背景知識的過度依賴,特別是通過文本指令引入的信息。為了減輕由文本指令先驗引發的幻覺,我們提出了HalluVL-DPO,一個微調現成LVLM以實現更具視覺基礎的響應的框架。HalluVL-DPO利用偏好優化,使用我們構建的精心策劃的訓練數據集,指導模型更喜歡基於現實的響應而非幻覺響應。我們展示了我們優化的模型有效地減輕了目標幻覺失敗模式,同時在其他幻覺基準和視覺能力評估上保持或提高了性能。為了支持可重複性和進一步研究,我們將在 https://pegah-kh.github.io/projects/prompts-override-vision/ 公開發布我們的評估基準、偏好訓練數據集和代碼。
From Research Question to Scientific Workflow: Leveraging Agentic AI for Science Automation
2604.21910v1 by Bartosz Balis, Michal Orzechowski, Piotr Kica, Michal Dygas, Michal Kuszewski
Scientific workflow systems automate execution -- scheduling, fault tolerance, resource management -- but not the semantic translation that precedes it. Scientists still manually convert research questions into workflow specifications, a task requiring both domain knowledge and infrastructure expertise. We propose an agentic architecture that closes this gap through three layers: an LLM interprets natural language into structured intents (semantic layer); validated generators produce reproducible workflow DAGs (deterministic layer); and domain experts author ``Skills'': markdown documents encoding vocabulary mappings, parameter constraints, and optimization strategies (knowledge layer). This decomposition confines LLM non-determinism to intent extraction: identical intents always yield identical workflows. We implement and evaluate the architecture on the 1000 Genomes population genetics workflow and Hyperflow WMS running on Kubernetes. In an ablation study on 150 queries, Skills raise full-match intent accuracy from 44% to 83%; skill-driven deferred workflow generation reduces data transfer by 92\%; and the end-to-end pipeline completes queries on Kubernetes with LLM overhead below 15 seconds and cost under $0.001 per query.
摘要:科學工作流程系統自動化執行——排程、容錯、資源管理——但不包括之前的語義翻譯。科學家仍然手動將研究問題轉換為工作流程規範,這一任務需要領域知識和基礎設施專業知識。我們提出了一種代理架構,通過三個層次來縮小這一差距:一個大型語言模型(LLM)將自然語言解釋為結構化的意圖(語義層);經過驗證的生成器產生可重現的工作流程有向無環圖(DAG)(確定性層);而領域專家編寫“技能”:編碼詞彙映射、參數約束和優化策略的Markdown文檔(知識層)。這種分解將LLM的非確定性限制在意圖提取上:相同的意圖總是產生相同的工作流程。我們在1000基因組人口遺傳學工作流程和在Kubernetes上運行的Hyperflow WMS上實施並評估了該架構。在對150個查詢的消融研究中,技能將完全匹配的意圖準確率從44%提高到83%;基於技能的延遲工作流程生成將數據傳輸減少了92%;並且端到端管道在Kubernetes上完成查詢的LLM開銷低於15秒,每個查詢成本低於$0.001。
TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
2604.21889v1 by Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di, Rui Wang
Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.
摘要:即時檢測和緩解技術異常對於大規模雲原生服務至關重要,因為即使是幾分鐘的停機時間也可能導致巨大的財務損失和用戶信任的下降。雖然客戶事件作為發現監控所忽略風險的重要信號,但由於極高的噪音、高吞吐量和多樣業務線的語義複雜性,從這些數據中提取可操作的情報仍然具有挑戰性。在本文中,我們介紹了TingIS,一個設計用於企業級事件發現的端到端系統。TingIS的核心是一個多階段事件鏈接引擎,該引擎將高效的索引技術與大型語言模型(LLMs)相結合,以便在事件合併上做出明智的決策,使得僅從少量多樣的用戶描述中穩定地提取可操作的事件。這個引擎還配備了一個級聯路由機制,以實現精確的業務歸屬,並且有一個多維噪音減少管道,該管道整合了領域知識、統計模式和行為過濾。在處理高峰吞吐量超過每分鐘2,000條消息和每天300,000條消息的生產環境中部署的TingIS,實現了3.5分鐘的P90警報延遲和95\%的高優先級事件發現率。基於真實世界數據構建的基準顯示,TingIS在路由準確性、聚類質量和信噪比方面顯著優於基線方法。
A Multimodal Text- and Graph-Based Approach for Open-Domain Event Extraction from Documents
2604.21885v1 by Praval Sharma
Event extraction is essential for event understanding and analysis. It supports tasks such as document summarization and decision-making in emergency scenarios. However, existing event extraction approaches have limitations: (1) closed-domain algorithms are restricted to predefined event types and thus rarely generalize to unseen types and (2) open-domain event extraction algorithms, capable of handling unconstrained event types, have largely overlooked the potential of large language models (LLMs) despite their advanced abilities. Additionally, they do not explicitly model document-level contextual, structural, and semantic reasoning, which are crucial for effective event extraction but remain challenging for LLMs due to lost-in-the-middle phenomenon and attention dilution. To address these limitations, we propose multimodal open-domain event extraction, MODEE , a novel approach for open-domain event extraction that combines graph-based learning with text-based representation from LLMs to model document-level reasoning. Empirical evaluations on large datasets demonstrate that MODEE outperforms state-of-the-art open-domain event extraction approaches and can be generalized to closed-domain event extraction, where it outperforms existing algorithms.
摘要:事件萃取對於事件理解和分析至關重要。
它支持文檔摘要和緊急情況下的決策等任務。
然而,現有的事件萃取方法存在一些限制:(1)封閉領域算法僅限於預定義的事件類型,因此很少能夠推廣到未見過的類型;(2)開放領域事件萃取算法雖然能夠處理不受限制的事件類型,但在很大程度上忽視了大型語言模型(LLMs)的潛力,儘管它們具備先進的能力。
此外,它們並未明確建模文檔級別的上下文、結構和語義推理,這些對於有效的事件萃取至關重要,但由於中途丟失現象和注意力稀釋,對於LLMs來說仍然具有挑戰性。
為了解決這些限制,我們提出了多模態開放領域事件萃取(MODEE),這是一種將基於圖的學習與來自LLMs的文本表示相結合的新方法,用於建模文檔級推理。
在大型數據集上的實證評估表明,MODEE 的性能超越了最先進的開放領域事件萃取方法,並且可以推廣到封閉領域事件萃取,在這方面它也超越了現有的算法。
Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms
2604.21882v1 by Yuto Nishida, Naoki Shikoda, Yosuke Kishinami, Ryo Fujii, Makoto Morishita, Hidetaka Kamigaito, Taro Watanabe
Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.
摘要:理解大型語言模型(LLMs)記憶哪些事實知識對於評估其可靠性和局限性至關重要。基於實體的問答(QA)是一種常見的框架,用於分析非逐字記憶,但典型的評估使用單一的標準表面形式查詢每個實體,這使得很難將事實記憶與通過特定名稱的訪問區分開來。我們引入了RedirectQA,這是一個基於實體的問答數據集,利用維基百科的重定向信息將維基數據的事實三元組與每個實體的分類表面形式關聯起來,包括替代名稱、縮寫、拼寫變體和常見錯誤形式。在13個LLM中,我們檢查了基於表面的事實記憶,發現當僅改變實體的表面形式時,預測結果往往會改變。這種不一致性依賴於類別:模型對於輕微的正字法變化比對於更大的詞彙變化(如別名和縮寫)更具穩健性。頻率分析進一步表明,實體頻率和表面頻率都與準確性相關,並且實體頻率往往在表面頻率之外有所貢獻。總體而言,事實記憶似乎既不是純粹的表面特定,也不是完全的表面不變,這突顯了在評估非逐字記憶時表面形式多樣性的重要性。
Inferring High-Level Events from Timestamped Data: Complexity and Medical Applications
2604.21793v1 by Yvon K. Awuklu, Meghyn Bienvenu, Katsumi Inoue, Vianney Jouhet, Fleur Mougin
In this paper, we develop a novel logic-based approach to detecting high-level temporally extended events from timestamped data and background knowledge. Our framework employs logical rules to capture existence and termination conditions for simple temporal events and to combine these into meta-events. In the medical domain, for example, disease episodes and therapies are inferred from timestamped clinical observations, such as diagnoses and drug administrations stored in patient records, and can be further combined into higher-level disease events. As some incorrect events might be inferred, we use constraints to identify incompatible combinations of events and propose a repair mechanism to select preferred consistent sets of events. While reasoning in the full framework is intractable, we identify relevant restrictions that ensure polynomial-time data complexity. Our prototype system implements core components of the approach using answer set programming. An evaluation on a lung cancer use case supports the interest of the approach, both in terms of computational feasibility and positive alignment of our results with medical expert opinions. While strongly motivated by the needs of the healthcare domain, our framework is purposely generic, enabling its reuse in other areas.
摘要:在本文中,我們開發了一種新穎的基於邏輯的方法,用於從時間戳數據和背景知識中檢測高級的時間延伸事件。
我們的框架使用邏輯規則來捕捉簡單時間事件的存在和終止條件,並將這些條件組合成元事件。
例如,在醫療領域,疾病事件和治療是從時間戳的臨床觀察中推斷出來的,例如存儲在病人記錄中的診斷和藥物管理,並可以進一步組合成更高級的疾病事件。
由於可能推斷出一些不正確的事件,我們使用約束來識別不兼容的事件組合,並提出一種修復機制來選擇首選的一致事件集。
雖然在完整框架中的推理是不可處理的,但我們確定了相關的限制,以確保多項式時間的數據複雜度。
我們的原型系統使用答案集編程實現了該方法的核心組件。
對於肺癌用例的評估支持了該方法的價值,無論是在計算可行性方面,還是我們的結果與醫療專家意見的正面一致性方面。
雖然受到醫療領域需求的強烈驅動,我們的框架故意設計為通用的,使其能在其他領域中重複使用。
StructMem: Structured Memory for Long-Horizon Behavior in LLMs
2604.21748v1 by Buqiang Xu, Yijun Chen, Jizhan Fang, Ruobin Zhong, Yunzhi Yao, Yuqi Zhu, Lun Du, Shumin Deng
Long-term conversational agents need memory systems that capture relationships between events, not merely isolated facts, to support temporal reasoning and multi-hop question answering. Current approaches face a fundamental trade-off: flat memory is efficient but fails to model relational structure, while graph-based memory enables structured reasoning at the cost of expensive and fragile construction. To address these issues, we propose \textbf{StructMem}, a structure-enriched hierarchical memory framework that preserves event-level bindings and induces cross-event connections. By temporally anchoring dual perspectives and performing periodic semantic consolidation, StructMem improves temporal reasoning and multi-hop performance on \texttt{LoCoMo}, while substantially reducing token usage, API calls, and runtime compared to prior memory systems, see https://github.com/zjunlp/LightMem .
摘要:長期對話代理需要記憶系統來捕捉事件之間的關係,而不僅僅是孤立的事實,以支持時間推理和多跳問題回答。當前的方法面臨一個基本的權衡:平面記憶雖然高效,但無法建模關係結構,而基於圖的記憶則能夠進行結構化推理,但代價是構建過程昂貴且脆弱。為了解決這些問題,我們提出了\textbf{StructMem},這是一個結構增強的層次記憶框架,能夠保留事件級的綁定並誘導跨事件的連結。通過時間上錨定雙重視角並進行定期的語義整合,StructMem改善了在\texttt{LoCoMo}上的時間推理和多跳性能,同時與先前的記憶系統相比,顯著減少了令牌使用、API調用和運行時間,請參見 https://github.com/zjunlp/LightMem 。
GS-Quant: Granular Semantic and Generative Structural Quantization for Knowledge Graph Completion
2604.21649v1 by Qizhuo Xie, Yunhui Liu, Yu Xing, Qianzi Hou, Xudong Jin, Tao Zheng, Tieke He
Large Language Models (LLMs) have shown immense potential in Knowledge Graph Completion (KGC), yet bridging the modality gap between continuous graph embeddings and discrete LLM tokens remains a critical challenge. While recent quantization-based approaches attempt to align these modalities, they typically treat quantization as flat numerical compression, resulting in semantically entangled codes that fail to mirror the hierarchical nature of human reasoning. In this paper, we propose GS-Quant, a novel framework that generates semantically coherent and structurally stratified discrete codes for KG entities. Unlike prior methods, GS-Quant is grounded in the insight that entity representations should follow a linguistic coarse-to-fine logic. We introduce a Granular Semantic Enhancement module that injects hierarchical knowledge into the codebook, ensuring that earlier codes capture global semantic categories while later codes refine specific attributes. Furthermore, a Generative Structural Reconstruction module imposes causal dependencies on the code sequence, transforming independent discrete units into structured semantic descriptors. By expanding the LLM vocabulary with these learned codes, we enable the model to reason over graph structures isomorphically to natural language generation. Experimental results demonstrate that GS-Quant significantly outperforms existing text-based and embedding-based baselines. Our code is publicly available at https://github.com/mikumifa/GS-Quant.
摘要:大型語言模型(LLMs)在知識圖譜補全(KGC)方面展現了巨大的潛力,但在連續圖嵌入和離散LLM標記之間架起橋樑仍然是一個關鍵挑戰。雖然最近的量化方法試圖對齊這些模態,但它們通常將量化視為平坦的數值壓縮,導致語義上糾纏的代碼,無法反映人類推理的層次性質。在本文中,我們提出了GS-Quant,一個新穎的框架,為KG實體生成語義一致且結構分層的離散代碼。與以往的方法不同,GS-Quant基於這樣的見解:實體表示應遵循語言的粗到細邏輯。我們引入了一個粒度語義增強模塊,將層次知識注入代碼庫,確保早期代碼捕捉全局語義類別,而後期代碼則細化特定屬性。此外,一個生成結構重建模塊對代碼序列施加因果依賴,將獨立的離散單元轉變為結構化的語義描述符。通過擴展LLM詞彙表以包含這些學習到的代碼,我們使模型能夠以同構於自然語言生成的方式對圖結構進行推理。實驗結果表明,GS-Quant顯著超越了現有的基於文本和嵌入的基準。我們的代碼可在https://github.com/mikumifa/GS-Quant公開獲得。
A systematic review of generative AI usage for IT project management
2604.21958v1 by Ionut Anghel, Tudor Cioara
This paper aims to synthesize current knowledge on generative AI in IT project management using the PRISMA methodology to provide researchers with a comprehensive perspective on techniques, applications, adoption trends, limitations, and integration across project management tools and process groups. The analysis reveals a clear dominance of OpenAI's GPT in the included studies but relying primarily on prompt engineering, suggesting that research in this area remains at an exploratory stage. Finally, it identifies and discusses three promising research directions for AI-enabled project management, including process group-specific AI agents, project role-based AI agents, and hybrid collaborative networks that enable human-guided orchestration.
摘要:這篇論文旨在利用PRISMA方法論綜合當前在IT專案管理中生成式AI的知識,以便為研究人員提供有關技術、應用、採用趨勢、限制以及在專案管理工具和過程組之間整合的全面視角。
分析顯示,在所納入的研究中,OpenAI的GPT明顯佔主導地位,但主要依賴於提示工程,這表明該領域的研究仍處於探索階段。
最後,它確定並討論了三個有前景的AI驅動專案管理研究方向,包括針對過程組的AI代理、基於專案角色的AI代理,以及能夠實現人類引導協作的混合協作網絡。
The CriticalSet problem: Identifying Critical Contributors in Bipartite Dependency Networks
2604.21537v1 by Sebastiano A. Piccolo, Andrea Tagarelli
Identifying critical nodes in complex networks is a fundamental task in graph mining. Yet, methods addressing an all-or-nothing coverage mechanics in a bipartite dependency network, a graph with two types of nodes where edges represent dependency relationships across the two groups only, remain largely unexplored. We formalize the CriticalSet problem: given an arbitrary bipartite graph modeling dependencies of items on contributors, identify the set of k contributors whose removal isolates the largest number of items. We prove that this problem is NP-hard and requires maximizing a supermodular set function, for which standard forward greedy algorithms provide no approximation guarantees. Consequently, we model CriticalSet as a coalitional game, deriving a closed-form centrality, ShapleyCov, based on the Shapley value. This measure can be interpreted as the expected number of items isolated by a contributor's departure. Leveraging these insights, we propose MinCov, a linear-time iterative peeling algorithm that explicitly accounts for connection redundancy, prioritizing contributors who uniquely support many items. Extensive experiments on synthetic and large-scale real datasets, including a Wikipedia graph with over 250 million edges, reveal that MinCov and ShapleyCov significantly outperform traditional baselines. Notably, MinCov achieves near-optimal performance, within 0.02 AUC of a Stochastic Hill Climbing metaheuristic, while remaining several orders of magnitude faster.
摘要:識別複雜網絡中的關鍵節點是圖挖掘中的一項基本任務。
然而,針對二部依賴網絡中全有或全無覆蓋機制的方法,這是一種只有兩種類型節點的圖,其中邊表示兩組之間的依賴關係,仍然大多數未被探索。
我們將CriticalSet問題形式化:給定一個任意的二部圖,該圖建模項目對貢獻者的依賴,識別移除k個貢獻者後使得最多項目孤立的貢獻者集合。
我們證明這個問題是NP-hard,並需要最大化一個超模組集合函數,對於這個函數,標準的前向貪婪算法無法提供近似保證。
因此,我們將CriticalSet建模為一個合作博弈,基於Shapley值推導出一個封閉形式的中心性指標ShapleyCov。
這一度量可以解釋為貢獻者離開後孤立的預期項目數量。
利用這些見解,我們提出了MinCov,一種線性時間的迭代剝離算法,明確考慮連接冗餘,優先考慮那些唯一支持許多項目的貢獻者。
在合成和大規模真實數據集上的廣泛實驗,包括一個擁有超過2.5億條邊的維基百科圖,顯示MinCov和ShapleyCov顯著超越傳統基準。
值得注意的是,MinCov達到了近乎最佳的性能,與隨機爬山元啟發式方法的AUC相差僅0.02,同時速度快幾個數量級。
Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation
2604.21536v1 by Nikita Severin, Danil Kartushov, Vladislav Urzhumov, Vladislav Kulikov, Oksana Konovalova, Alexey Grishanov, Anton Klenitskiy, Artem Fatkulin, Alexey Vasilev, Andrey Savchenko, Ilya Makarov
Sequential recommender systems have achieved significant success in modeling temporal user behavior but remain limited in capturing rich user semantics beyond interaction patterns. Large Language Models (LLMs) present opportunities to enhance user understanding with their reasoning capabilities, yet existing integration approaches create prohibitive inference costs in real time. To address these limitations, we present a novel knowledge distillation method that utilizes textual user profile generated by pre-trained LLMs into sequential recommenders without requiring LLM inference at serving time. The resulting approach maintains the inference efficiency of traditional sequential models while requiring neither architectural modifications nor LLM fine-tuning.
摘要:序列推薦系統在建模時間性用戶行為方面取得了顯著成功,但在捕捉超越互動模式的豐富用戶語義方面仍然有限。大型語言模型(LLMs)提供了利用其推理能力增強用戶理解的機會,但現有的整合方法在實時推理中產生了高昂的成本。為了解決這些限制,我們提出了一種新穎的知識蒸餾方法,利用預訓練LLMs生成的文本用戶檔案,將其應用於序列推薦系統,而無需在服務時進行LLM推理。所提出的方法保持了傳統序列模型的推理效率,同時不需要架構修改或LLM微調。
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
2604.21510v1 by Xinyu Zhang, Boxuan Zhang, Yuchen Wan, Lingling Zhang, YiXing Yao, Bifan Wei, Yaqiang Wu, Jun Liu
While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling & logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.
摘要:大型語言模型(LLMs)雖然展現出卓越的推理能力,但複雜的優化任務仍然具有挑戰性,需要領域知識和穩健的實施。
然而,現有的基準測試僅專注於數學規劃和組合優化,這限制了全面評估的可能性。
為了解決這個問題,我們推出了OptiVerse,一個包含1,000個精心挑選問題的綜合基準,涵蓋了被忽視的領域,包括隨機優化、動態優化、遊戲優化和最佳控制,並分為三個難度級別:簡單、中等和困難。
對22個不同規模的LLM進行的實驗顯示,在困難問題上性能急劇下降,即使是像GPT-5.2和Gemini-3這樣的先進模型也難以超過27%的準確率。
通過錯誤分析,我們發現建模和邏輯錯誤仍然是主要瓶頸。
因此,我們提出了一個雙視角審計代理,該代理在不引入重大時間開銷的情況下提高了LLM建模過程的準確性。
OptiVerse將作為推進LLM在解決複雜優化挑戰中的基礎平台。
MISTY: High-Throughput Motion Planning via Mixer-based Single-step Drifting
2604.21489v1 by Yining Xing, Zehong Ke, Yiqian Tu, Zhiyuan Liu, Wenhao Yu, Jianqiang Wang
Multi-modal trajectory generation is essential for safe autonomous driving, yet existing diffusion-based planners suffer from high inference latency due to iterative neural function evaluations. This paper presents MISTY (Mixer-based Inference for Single-step Trajectory-drifting Yield), a high-throughput generative motion planner that achieves state-of-the-art closed-loop performance with pure single-step inference. MISTY integrates a vectorized Sub-Graph encoder to capture environment context, a Variational Autoencoder to structure expert trajectories into a compact 32-dimensional latent manifold, and an ultra-lightweight MLP-Mixer decoder to eliminate quadratic attention complexity. Importantly, we introduce a latent-space drifting loss that shifts the complex distribution evolution entirely to the training phase. By formulating explicit attractive and repulsive forces, this mechanism empowers the model to synthesize novel, proactive maneuvers, such as active overtaking, that are virtually absent from the raw expert demonstrations. Extensive evaluations on the nuPlan benchmark demonstrate that MISTY achieves state-of-the-art results on the challenging Test14-hard split, with comprehensive scores of 80.32 and 82.21 in non-reactive and reactive settings, respectively. Operating at over 99 FPS with an end-to-end latency of 10.1 ms, MISTY offers an order-of-magnitude speedup over iterative diffusion planners while while achieving significantly robust generation.
摘要:多模態軌跡生成對於安全的自動駕駛至關重要,但現有的基於擴散的規劃器因為需要迭代神經函數評估而面臨高推理延遲。本文提出了MISTY(基於混合器的單步軌跡漂移產出推理),這是一個高通量的生成運動規劃器,通過純單步推理實現了最先進的閉環性能。MISTY整合了一個向量化的子圖編碼器來捕捉環境上下文,一個變分自編碼器來將專家軌跡結構化為緊湊的32維潛在流形,以及一個超輕量的MLP-Mixer解碼器來消除二次注意力複雜度。重要的是,我們引入了一種潛在空間漂移損失,將複雜的分佈演變完全轉移到訓練階段。通過制定明確的吸引力和排斥力,這一機制使模型能夠合成新穎的主動操作,例如主動超車,這在原始專家演示中幾乎不存在。對nuPlan基準的廣泛評估顯示,MISTY在具有挑戰性的Test14-hard拆分上達到了最先進的結果,在非反應和反應設置中分別獲得了80.32和82.21的綜合分數。MISTY以超過99 FPS的速度運行,端到端延遲為10.1毫秒,相比於迭代擴散規劃器提供了數量級的加速,同時實現了顯著穩健的生成。
Drug Synergy Prediction via Residual Graph Isomorphism Networks and Attention Mechanisms
2604.21473v1 by Jiyan Song, Wenyang Wang, Chengcheng Yan, Zhiquan Han, Feifei Zhao
In the treatment of complex diseases, treatment regimens using a single drug often yield limited efficacy and can lead to drug resistance. In contrast, combination drug therapies can significantly improve therapeutic outcomes through synergistic effects. However, experimentally validating all possible drug combinations is prohibitively expensive, underscoring the critical need for efficient computational prediction methods. Although existing approaches based on deep learning and graph neural networks (GNNs) have made considerable progress, challenges remain in reducing structural bias, improving generalization capability, and enhancing model interpretability. To address these limitations, this paper proposes a collaborative prediction graph neural network that integrates molecular structural features and cell-line genomic profiles with drug-drug interactions to enhance the prediction of synergistic effects. We introduce a novel model named the Residual Graph Isomorphism Network integrated with an Attention mechanism (ResGIN-Att). The model first extracts multi scale topological features of drug molecules using a residual graph isomorphism network, where residual connections help mitigate over-smoothing in deep layers. Subsequently, an adaptive Long Short-Term Memory (LSTM) module fuses structural information from local to global scales. Finally, a cross-attention module is designed to explicitly model drug-drug interactions and identify key chemical substructures. Extensive experiments on five public benchmark datasets demonstrate that ResGIN-Att achieves competitive performance, comparing favorably against key baseline methods while exhibiting promising generalization capability and robustness.
摘要:在複雜疾病的治療中,使用單一藥物的治療方案通常效果有限,並可能導致藥物抗性。相對而言,聯合藥物療法可以通過協同效應顯著改善治療結果。然而,實驗性地驗證所有可能的藥物組合成本過高,這突顯了對高效計算預測方法的迫切需求。儘管基於深度學習和圖神經網絡(GNN)的現有方法已取得相當大的進展,但在減少結構偏差、改善泛化能力和增強模型可解釋性方面仍然存在挑戰。為了解決這些限制,本文提出了一種協作預測圖神經網絡,該網絡整合了分子結構特徵和細胞系基因組特徵以及藥物-藥物相互作用,以增強對協同效應的預測。我們引入了一種名為集成注意力機制的殘差圖同構網絡(ResGIN-Att)的新模型。該模型首先使用殘差圖同構網絡提取藥物分子的多尺度拓撲特徵,其中殘差連接有助於減輕深層中的過平滑。隨後,自適應長短期記憶(LSTM)模塊將結構信息從局部融合到全局。最後,設計了一個交叉注意力模塊,以明確建模藥物-藥物相互作用並識別關鍵化學子結構。在五個公共基準數據集上的廣泛實驗表明,ResGIN-Att實現了具有競爭力的性能,與主要基準方法相比表現良好,同時展現出良好的泛化能力和穩健性。
Conjecture and Inquiry: Quantifying Software Performance Requirements via Interactive Retrieval-Augmented Preference Elicitation
2604.21380v1 by Wang Shi Hai, Chen Tao
Since software performance requirements are documented in natural language, quantifying them into mathematical forms is essential for software engineering. Yet, the vagueness in performance requirements and uncertainty of human cognition have caused highly uncertain ambiguity in the interpretations, rendering their automated quantification an unaddressed and challenging problem. In this paper, we formalize the problem and propose IRAP, an approach that quantifies performance requirements into mathematical functions via interactive retrieval-augmented preference elicitation. IRAP differs from the others in that it explicitly derives from problem-specific knowledge to retrieve and reason the preferences, which also guides the progressive interaction with stakeholders, while reducing the cognitive overhead. Experiment results against 10 state-of-the-art methods on four real-world datasets demonstrate the superiority of IRAP on all cases with up to 40x improvements under as few as five rounds of interactions.
摘要:由於軟體性能需求以自然語言記錄,因此將其量化為數學形式對於軟體工程至關重要。
然而,性能需求中的模糊性和人類認知的不確定性導致了解釋中的高度不確定性,使得其自動化量化成為一個未解決且具挑戰性的問題。
在本文中,我們對該問題進行了形式化,並提出了IRAP,一種通過互動檢索增強的偏好引導將性能需求量化為數學函數的方法。
IRAP與其他方法的不同之處在於,它明確地從特定問題的知識中推導出來,以檢索和推理偏好,這也指導了與利益相關者的漸進互動,同時減少了認知負擔。
在四個真實世界數據集上,與10種最先進方法的實驗結果顯示,IRAP在所有情況下的優越性,並在僅進行五輪互動的情況下實現了高達40倍的改進。
ReaGeo: Reasoning-Enhanced End-to-End Geocoding with LLMs
2604.21357v1 by Jian Cui, Zhiyuan Ren, Desheng Weng, Yongqi Zhao, Gong Wenbin, Yu Lei, Zhenning Dong
This paper proposes ReaGeo, an end-to-end geocoding framework based on large language models, designed to overcome the limitations of traditional multi-stage approaches that rely on text or vector similarity retrieval over geographic databases, including workflow complexity, error propagation, and heavy dependence on structured geographic knowledge bases. The method converts geographic coordinates into geohash sequences, reformulating the coordinate prediction task as a text generation problem, and introduces a Chain-of-Thought mechanism to enhance the model's reasoning over spatial relationships. Furthermore, reinforcement learning with a distance-deviation-based reward is applied to optimize the generation accuracy. Comprehensive experiments show that ReaGeo can accurately handle explicit address queries in single-point predictions and effectively resolve vague relative location queries. In addition, the model demonstrates strong predictive capability for non-point geometric regions, highlighting its versatility and generalization ability in geocoding tasks.
摘要:這篇論文提出了ReaGeo,一個基於大型語言模型的端到端地理編碼框架,旨在克服傳統多階段方法的限制,這些方法依賴於地理數據庫中的文本或向量相似性檢索,包括工作流程的複雜性、錯誤傳播以及對結構化地理知識庫的高度依賴。
該方法將地理坐標轉換為地理哈希序列,將坐標預測任務重新表述為文本生成問題,並引入了鏈式思維機制以增強模型對空間關係的推理能力。
此外,應用基於距離偏差的獎勵的強化學習來優化生成準確性。
綜合實驗表明,ReaGeo能夠準確處理單點預測中的明確地址查詢,並有效解決模糊的相對位置查詢。
此外,該模型在非點幾何區域顯示出強大的預測能力,突顯了其在地理編碼任務中的多功能性和泛化能力。
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
2604.21952v1 by Muhammad Shafique, Abdul Basit, Muhammad Abdullah Hanif, Alberto Marchisio, Rachmad Vidya Wicaksana Putra, Minghao Shao
This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution & stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.
摘要:這項工作提出了一種多層次的方法論,以有效加速多模態基礎模型(MFMs)。
它結合了Transformer區塊的硬體和軟體共同設計,以及一個優化流程,減少計算和記憶體需求。
在模型開發過程中,它通過微調來實現針對特定領域的性能增強。
我們的方法論進一步結合了優化MFMs的硬體和軟體技術。
具體而言,它使用層次感知的混合精度量化和結構修剪來壓縮MFM,針對Transformer區塊和MLP通道。
它還通過推測解碼來優化操作,模型級聯將查詢路由通過小到大的級聯,並使用輕量級自測來確定何時升級到更大的模型,以及序列長度、視覺解析度和步幅的共同優化,以及圖級運算元融合。
為了有效執行模型,處理數據流根據底層硬體架構進行優化,並結合記憶體高效的注意力以滿足片上帶寬和延遲預算。
為了支持這一點,使用專用的硬體加速器來處理Transformer工作負載,這可以通過專家設計或LLM輔助設計方法開發。
我們展示了所提方法論在醫療MFMs和代碼生成任務上的有效性,並以向能源高效的脈衝MFMs擴展作結。
Can MLLMs "Read" What is Missing?
2604.21277v1 by Jindi Guo, Xi Fang, Chaozheng Huang
We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.
摘要:我們介紹 MMTR-Bench,一個旨在評估多模態大型語言模型 (MLLMs) 從視覺上下文直接重建被遮蔽文本的內在能力的基準。與傳統的問答任務不同,MMTR-Bench 消除了明確的提示,要求模型從單頁或多頁的輸入中恢復被遮蔽的文本,這些輸入來自於文件和網頁等現實世界領域。這種設計將重建任務與遵循指令的能力隔離開來,使得能夠直接評估模型的佈局理解、視覺基礎和知識整合能力。MMTR-Bench 包含 2,771 個測試樣本,涵蓋多種語言和不同的目標長度。為了考慮這種多樣性,我們提出了一個級別感知的評估協議。對代表性 MLLMs 的實驗表明,這個基準帶來了重大挑戰,特別是在句子和段落級別的重建上。首頁可訪問 https://mmtr-bench-dataset.github.io/MMTR-Bench/。
Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages
2604.21263v1 by Michael Bouzinier, Sergey Trifonov, Michael Chumack, Eugenia Lvova, Dmitry Etin
\textbf{Background:} Regulatory frameworks for AI in healthcare, including the EU AI Act and FDA guidance on AI/ML-based medical devices, require clinical decision support to demonstrate not only accuracy but auditability. Existing formal languages for clinical logic validate syntactic and structural correctness but not whether decision rules use epistemologically appropriate evidence. \textbf{Methods:} Drawing on design-by-contract principles, we introduce meta-predicates -- predicates about predicates -- for asserting epistemological constraints on clinical decision rules expressed in a DSL. An epistemological type system classifies annotations along four dimensions: purpose, knowledge domain, scale, and method of acquisition. Meta-predicates assert which evidence types are permissible in any given rule. The framework is instantiated in AnFiSA, an open-source platform for genetic variant curation, and demonstrated using the Brigham Genomics Medicine protocol on 5.6 million variants from the Genome in a Bottle benchmark. \textbf{Results:} Decision trees used in variant interpretation can be reformulated as unate cascades, enabling per-variant audit trails that identify which rule classified each variant and why. Meta-predicate validation catches epistemological errors before deployment, whether rules are human-written or AI-generated. The approach complements post-hoc methods such as LIME and SHAP: where explanation reveals what evidence was used after the fact, meta-predicates constrain what evidence may be used before deployment, while preserving human readability. \textbf{Conclusions:} Meta-predicate validation is a step toward demonstrating not only that decisions are accurate but that they rest on appropriate evidence in ways that can be independently audited. While demonstrated in genomics, the approach generalises to any domain requiring auditable decision logic.
摘要:\textbf{背景:} 醫療保健中人工智慧的監管框架,包括歐盟人工智慧法案和FDA對基於人工智慧/機器學習醫療設備的指導,要求臨床決策支持不僅要顯示準確性,還要具備可審計性。現有的臨床邏輯形式語言驗證語法和結構的正確性,但不驗證決策規則是否使用了認識論上合適的證據。
\textbf{方法:} 基於契約設計原則,我們引入了元謂詞——關於謂詞的謂詞——用於對在DSL中表達的臨床決策規則施加認識論約束。認識論類型系統在四個維度上對註釋進行分類:目的、知識領域、範圍和獲取方法。元謂詞聲明在任何給定規則中允許使用哪些證據類型。該框架在AnFiSA中實現,這是一個開源的基因變異整理平台,並使用來自“瓶中基因組”基準的560萬個變異的Brigham Genomics Medicine協議進行演示。
\textbf{結果:} 用於變異解釋的決策樹可以重新表述為單調級聯,從而實現每個變異的審計跟蹤,識別每個變異的分類規則及其原因。元謂詞驗證在部署前捕捉認識論錯誤,無論規則是人工編寫還是AI生成。該方法補充了事後方法,如LIME和SHAP:當解釋揭示了事後使用了哪些證據時,元謂詞限制了在部署前可以使用的證據,同時保持人類可讀性。
\textbf{結論:} 元謂詞驗證是邁向證明決策不僅準確且基於適當證據的步驟,並且這些證據可以獨立審計。雖然在基因組學中得到了演示,但該方法可以推廣到任何需要可審計決策邏輯的領域。
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
2604.21255v1 by Chenghao Yang, Yuning Zhang, Zhoufutu Wen, Tao Gong, Jiaheng Liu, Qi Chu, Nenghai Yu
Model distillation is a primary driver behind the rapid progress of LLM agents, yet it often leads to behavioral homogenization. Many emerging agents share nearly identical reasoning steps and failure modes, suggesting they may be distilled echoes of a few dominant teachers. Existing metrics, however, fail to distinguish mandatory behaviors required for task success from non-mandatory patterns that reflect a model's autonomous preferences. We propose two complementary metrics to isolate non-mandatory behavioral patterns: \textbf{Response Pattern Similarity (RPS)} for verbal alignment and \textbf{Action Graph Similarity (AGS)} for tool-use habits modeled as directed graphs. Evaluating 18 models from 8 providers on $τ$-Bench and $τ^2$-Bench against Claude Sonnet 4.5 (thinking), we find that within-family model pairs score 5.9 pp higher in AGS than cross-family pairs, and that Kimi-K2 (thinking) reaches 82.6\% $S_{\text{node}}$ and 94.7\% $S_{\text{dep}}$, exceeding Anthropic's own Opus 4.1. A controlled distillation experiment further confirms that AGS distinguishes teacher-specific convergence from general improvement. RPS and AGS capture distinct behavioral dimensions (Pearson $r$ = 0.491), providing complementary diagnostic signals for behavioral convergence in the agent ecosystem. Our code is available at https://github.com/Syuchin/AgentEcho.
摘要:模型蒸餾是大型語言模型(LLM)代理快速進展的主要驅動力,但它往往導致行為的同質化。許多新興的代理共享幾乎相同的推理步驟和失敗模式,這表明它們可能是少數主導教師的蒸餾回聲。然而,現有的指標無法區分任務成功所需的強制行為與反映模型自主偏好的非強制模式。我們提出了兩個互補的指標來隔離非強制行為模式:\textbf{回應模式相似性(RPS)}用於口頭對齊,\textbf{行動圖相似性(AGS)}用於建模為有向圖的工具使用習慣。在$τ$-Bench和$τ^2$-Bench上評估來自8個提供者的18個模型,對比Claude Sonnet 4.5(思考),我們發現同一家族模型對的AGS得分比跨家族對高出5.9個百分點,並且Kimi-K2(思考)達到了82.6\% $S_{\text{node}}$和94.7\% $S_{\text{dep}}$,超過了Anthropic自己的Opus 4.1。一個受控的蒸餾實驗進一步確認了AGS能夠區分教師特定的收斂與一般改進。RPS和AGS捕捉到不同的行為維度(Pearson $r$ = 0.491),為代理生態系統中的行為收斂提供了互補的診斷信號。我們的代碼可在https://github.com/Syuchin/AgentEcho獲得。
Planning Beyond Text: Graph-based Reasoning for Complex Narrative Generation
2604.21253v1 by Hanwen Gu, Chao Guo, Junle Wang, Wenda Xie, Yisheng Lv
While LLMs demonstrate remarkable fluency in narrative generation, existing methods struggle to maintain global narrative coherence, contextual logical consistency, and smooth character development, often producing monotonous scripts with structural fractures. To this end, we introduce PLOTTER, a framework that performs narrative planning on structural graph representations instead of the direct sequential text representations used in existing work. Specifically, PLOTTER executes the Evaluate-Plan-Revise cycle on the event graph and character graph. By diagnosing and repairing issues of the graph topology under rigorous logical constraints, the model optimizes the causality and narrative skeleton before complete context generation. Experiments demonstrate that PLOTTER significantly outperforms representative baselines across diverse narrative scenarios. These findings verify that planning narratives on structural graph representations-rather than directly on text-is crucial to enhance the long context reasoning of LLMs in complex narrative generation.
摘要:雖然大型語言模型在敘事生成方面展現出卓越的流暢性,但現有的方法在維持全球敘事一致性、上下文邏輯一致性和角色發展的流暢性方面面臨挑戰,經常產生結構性斷裂的單調劇本。為此,我們提出了 PLOTTER,一個在結構圖表示上進行敘事規劃的框架,而不是使用現有工作中的直接序列文本表示。具體而言,PLOTTER 在事件圖和角色圖上執行評估-計劃-修訂循環。通過在嚴格的邏輯約束下診斷和修復圖拓撲問題,該模型在完全生成上下文之前優化因果關係和敘事骨架。實驗表明,PLOTTER 在各種敘事場景中顯著超越了代表性的基準。這些發現證實了在結構圖表示上進行敘事規劃——而不是直接在文本上——對於增強大型語言模型在複雜敘事生成中的長期上下文推理至關重要。
CAP: Controllable Alignment Prompting for Unlearning in LLMs
2604.21251v2 by Zhaokun Wang, Jinyu Guo, Jingwen Pu, Hongli Pu, Meng Yang, Xunlei Chen, Jie Ou, Wenyi Li, Guangchun Luo, Wenhong Tian
Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.
摘要:大型語言模型(LLMs)在未經過濾的語料庫上訓練,固有地存在保留敏感信息的風險,因此需要進行選擇性的知識遺忘以符合監管要求和倫理安全。
然而,現有的參數修改方法面臨根本性的限制:高計算成本、不可控的遺忘邊界,以及對模型權重訪問的嚴格依賴。
這些限制使得它們對於封閉源模型來說不切實際,而目前的非侵入性替代方案則仍然缺乏系統性,並依賴於經驗。
為了解決這些挑戰,我們提出了可控對齊提示遺忘(CAP)框架,這是一種端到端的提示驅動遺忘範式。
CAP通過強化學習將遺忘解耦為可學習的提示優化過程,其中提示生成器與LLM協作,以抑制目標知識,同時選擇性地保留一般能力。
這種方法使得通過提示撤銷實現可逆的知識恢復成為可能。
廣泛的實驗表明,CAP在不更新模型參數的情況下實現了精確且可控的遺忘,建立了一種動態對齊機制,克服了先前方法的可轉移性限制。
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
2604.21229v1 by Julian Acuna
Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.
摘要:大型語言模型助手越來越被期望能夠保留並推理在多次會話中積累的信息。
我們介紹了EngramaBench,一個圍繞五個角色、一百個多會話對話和一百五十個查詢(涵蓋事實回憶、跨空間整合、時間推理、對抗性避免和新興綜合)建立的長期對話記憶基準。
我們將Engrama,一個圖結構的記憶系統,與GPT-4o全上下文提示和Mem0,一個開源向量檢索記憶系統進行評估。
這三者都使用相同的回答模型(GPT-4o),從而隔離記憶架構的影響。
GPT-4o全上下文達到最高的綜合分數(0.6186),而Engrama的全球得分為0.5367,但在跨空間推理上是唯一一個得分高於全上下文提示的系統(0.6532對0.6291,n=30)。
Mem0是成本最低的,但實力顯著較弱(0.4809)。
消融實驗顯示,推動Engrama跨空間優勢的組件在全球綜合分數上存在權衡,暴露了結構化記憶專業化與聚合優化之間的系統級緊張關係。
Doubly Saturated Ramsey Graphs: A Case Study in Computer-Assisted Mathematical Discovery
2604.21187v1 by Benjamin Przybocki, John Mackey, Marijn J. H. Heule, Bernardo Subercaseaux
Ramsey-good graphs are graphs that contain neither a clique of size $s$ nor an independent set of size $t$. We study doubly saturated Ramsey-good graphs, defined as Ramsey-good graphs in which the addition or removal of any edge necessarily creates an $s$-clique or a $t$-independent set. We present a method combining SAT solving with bespoke LLM-generated code to discover infinite families of such graphs, answering a question of Grinstead and Roberts from 1982. In addition, we use LLMs to generate and formalize correctness proofs in Lean. This case study highlights the potential of integrating automated reasoning, large language models, and formal verification to accelerate mathematical discovery. We argue that such tool-driven workflows will play an increasingly central role in experimental mathematics.
摘要:Ramsey-good 圖是指不包含大小為 $s$ 的團或大小為 $t$ 的獨立集的圖。我們研究雙重飽和的 Ramsey-good 圖,這是指在這些圖中,任何邊的添加或移除必然會產生一個 $s$-團或一個 $t$-獨立集。我們提出了一種將 SAT 求解與定制的 LLM 生成代碼相結合的方法,以發現這類圖的無限族,回應 Grinstead 和 Roberts 在 1982 年提出的問題。此外,我們使用 LLM 生成並形式化在 Lean 中的正確性證明。這個案例研究突顯了整合自動推理、大型語言模型和形式驗證以加速數學發現的潛力。我們認為,這種工具驅動的工作流程將在實驗數學中扮演越來越重要的角色。
TAPO-Description Logic for Information Behavior: Refined OBoxes, Inference, and Categorical Semantics
2604.21172v1 by Takao Inoué
This paper develops a refined version of TAPO-description logic for the analysis of information behavior. The framework is treated not as a single homogeneous object logic, but as a layered formalism consisting of a static descriptive layer (TBox/ABox), a procedural layer (PBox), and an oracle-sensitive layer (OBox). To make this architecture mathematically explicit, we introduce a metalevel guard-judgment layer governing procedural branching and iteration. On this basis we formulate a core inference system for TAPO-description logic, covering static TBox/ABox reasoning, guarded procedural transition in the PBox, and validated external import in the OBox. We then give a categorical semantics for the resulting framework and indicate its sheaf-theoretic refinement. The theory is illustrated by examples of information-seeking behavior, including simple search behavior and review-sensitive ordering behavior in a curry restaurant. The aim is to treat not only static knowledge representation but also hesitation, external consultation, and action-guiding update within a unified logical setting.
摘要:這篇論文發展了一個精煉版本的 TAPO-描述邏輯,用於分析資訊行為。
該框架不被視為單一的同質物件邏輯,而是作為一個由靜態描述層(TBox/ABox)、程序層(PBox)和對神諭敏感層(OBox)組成的分層形式主義。
為了使這個架構在數學上明確,我們引入了一個元層的守衛判斷層,負責程序的分支和迭代。
在此基礎上,我們為 TAPO-描述邏輯制定了一個核心推理系統,涵蓋靜態的 TBox/ABox 推理、PBox 中的守衛程序轉換,以及 OBox 中的驗證外部導入。
然後,我們為所得到的框架給出了一個範疇語義,並指出其層叠理論的細化。
該理論通過資訊搜尋行為的例子進行說明,包括簡單的搜尋行為和在咖哩餐廳中的評價敏感排序行為。
目的是在統一的邏輯設定中處理靜態知識表示、猶豫、外部諮詢和行動指導更新。
"This Wasn't Made for Me": Recentering User Experience and Emotional Impact in the Evaluation of ASR Bias
2604.21148v2 by Siyu Liang, Alicia Beckford Wassink
Studies on bias in Automatic Speech Recognition (ASR) tend to focus on reporting error rates for speakers of underrepresented dialects, yet less research examines the human side of system bias: how do system failures shape users' lived experiences, how do users feel about and react to them, and what emotional toll do these repeated failures exact? We conducted user experience studies across four U.S. locations (Atlanta, Gulf Coast, Miami Beach, and Tucson) representing distinct English dialect communities. Our findings reveal that most participants report technologies fail to consider their cultural backgrounds and require constant adjustment to achieve basic functionality. Despite these experiences, participants maintain high expectations for ASR performance and express strong willingness to contribute to model improvement. Qualitative analysis of open-ended narratives exposes the deeper costs of these failures. Participants report frustration, annoyance, and feelings of inadequacy, yet the emotional impact extends beyond momentary reactions. Participants recognize that systems were not designed for them, yet often internalize failures as personal inadequacy despite this critical awareness. They perform extensive invisible labor, including code-switching, hyper-articulation, and emotional management, to make failing systems functional. Meanwhile, their linguistic and cultural knowledge remains unrecognized by technologies that encode particular varieties as standard while rendering others marginal. These findings demonstrate that algorithmic fairness assessments based on accuracy metrics alone miss critical dimensions of harm: the emotional labor of managing repeated technological rejection, the cognitive burden of constant self-monitoring, and the psychological toll of feeling inadequate in one's native language variety.
摘要:研究自動語音識別(ASR)中的偏見往往專注於報告代表性不足方言使用者的錯誤率,然而,較少的研究探討系統偏見的人性面:系統失敗如何塑造使用者的生活經驗,使用者對此有何感受和反應,這些重複的失敗帶來了什麼情感上的代價?我們在美國四個地點(亞特蘭大、墨西哥灣沿岸、邁阿密海灘和圖森)進行了使用者體驗研究,這些地點代表了不同的英語方言社群。我們的研究結果顯示,大多數參與者報告技術未能考慮他們的文化背景,並需要不斷調整以達到基本功能。儘管有這些經驗,參與者對ASR性能保持高期望,並表達出強烈的意願來貢獻於模型的改進。對開放式敘述的質性分析揭示了這些失敗的更深層成本。參與者報告感到沮喪、惱怒和自我不足的感覺,但情感影響超越了瞬間的反應。參與者意識到系統並非為他們設計,但儘管有這種批判性的認識,卻常常將失敗內化為個人不足。他們進行大量的隱形勞動,包括語碼轉換、過度清晰發音和情感管理,以使失敗的系統運作。與此同時,他們的語言和文化知識卻未被那些將特定方言編碼為標準而將其他方言邊緣化的技術所認可。這些發現表明,僅基於準確性指標的算法公平評估忽略了傷害的關鍵維度:管理重複技術拒絕的情感勞動、持續自我監控的認知負擔,以及在自己母語方言中感到不足的心理代價。
Enhancing Science Classroom Discourse Analysis through Joint Multi-Task Learning for Reasoning-Component Classification
2604.21137v1 by Jiho Noh, Mukhesh Raghava Katragadda, Raymond Carl, Soon Lee
Analyzing the reasoning patterns of students in science classrooms is critical for understanding knowledge construction mechanism and improving instructional practice to maximize cognitive engagement, yet manual coding of classroom discourse at scale remains prohibitively labor-intensive. We present an automated discourse analysis system (ADAS) that jointly classifies teacher and student utterances along two complementary dimensions: Utterance Type and Reasoning Component derived from our prior CDAT framework. To address severe label imbalance among minority classes, we (1) stratify-resplit the annotated corpus, (2) apply LLM-based synthetic data augmentation targeting minority classes, and (3) train a dual-probe head RoBERTa-base classifier. A zero-shot GPT-5.4 baseline achieves macro-F1 of 0.467 on UT and 0.476 on RC, establishing meaningful upper bounds for prompt-only approaches motivating fine-tuning. Beyond classification, we conduct discourse pattern analyses including UTxRC co-occurrence profiling, Cognitive Complexity Index (CCI) computation per session, lag-sequential analysis, and IRF chain analysis, revealing that teacher Feedback-with-Question (Fq) moves are the most consistent antecedents of student inferential reasoning (SR-I). Our results demonstrate that LLM-based augmentation meaningfully improves UT minority-class recognition, and that the structural simplicity of the RC task makes it tractable even for lexical baselines.
摘要:分析科學教室中學生的推理模式對於理解知識建構機制以及改善教學實踐以最大化認知參與至關重要,然而大規模手動編碼課堂話語仍然是極具勞動密集型的工作。我們提出了一個自動話語分析系統(ADAS),該系統沿著兩個互補維度共同分類教師和學生的發言:發言類型和推理組件,這些都是源自我們之前的CDAT框架。為了解決少數類別之間的標籤不平衡問題,我們(1)對標註語料庫進行分層重分割,(2)應用基於LLM的針對少數類別的合成數據增強,以及(3)訓練一個雙探頭的RoBERTa-base分類器。一個零-shot的GPT-5.4基準在UT上達到0.467的宏F1,在RC上達到0.476,為僅依賴提示的方法建立了有意義的上限,激勵進行微調。除了分類,我們還進行了話語模式分析,包括UTxRC共現分析、每個會話的認知複雜性指數(CCI)計算、滯後序列分析和IRF鏈分析,揭示教師的問題反饋(Fq)行為是學生推理(SR-I)最一致的前因。我們的結果顯示,基於LLM的增強顯著改善了UT少數類別的識別,而RC任務的結構簡單性使其即使對於詞彙基線也變得可行。
GRISP: Guided Recurrent IRI Selection over SPARQL Skeletons
2604.21133v1 by Sebastian Walter, Hannah Bast
We present GRISP (Guided Recurrent IRI Selection over SPARQL Skeletons), a novel SPARQL-based question-answering method over knowledge graphs based on fine-tuning a small language model (SLM). Given a natural-language question, the method first uses the SLM to generate a natural-language SPARQL query skeleton, and then to re-rank and select knowledge graph items to iteratively replace the natural-language placeholders using knowledge graph constraints. The SLM is jointly trained on skeleton generation and list-wise re-ranking data generated from standard question-query pairs. We evaluate the method on common Wikidata and Freebase benchmarks, and achieve better results than other state-of-the-art methods in a comparable setting.
摘要:我們提出了 GRISP(基於 SPARQL 骨架的引導式循環 IRI 選擇),這是一種基於 SPARQL 的知識圖譜問答方法,通過微調小型語言模型(SLM)來實現。給定一個自然語言問題,該方法首先使用 SLM 生成一個自然語言的 SPARQL 查詢骨架,然後重新排序並選擇知識圖譜項目,迭代地用知識圖譜約束來替換自然語言佔位符。SLM 在骨架生成和從標準問題-查詢對生成的列表重排序數據上進行聯合訓練。我們在常見的 Wikidata 和 Freebase 基準上評估該方法,並在可比較的設置中取得比其他最先進方法更好的結果。
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
2604.21106v1 by Kristian Schwethelm, Daniel Rueckert, Georgios Kaissis
We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts $r \in {1, 2, 4, 8}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-α} + B\,D^{-β}$ and recover a new recurrence-equivalence exponent $\varphi = 0.46$ at $R^2 = 0.997$. Intuitively, $\varphi$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $\varphi{=}1$) or to a single block run repeatedly with no capacity gain ($\varphi{=}0$). Our $\varphi = 0.46$ sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our $\varphi$ converts the design choice of $r$ into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise $\varphi$ above $0.46$.
摘要:我們測量一個額外的重複對於一個循環(深度重複)語言模型的價值,以等效的唯一參數來表示。從116次預訓練運行中對重複次數$r \in {1, 2, 4, 8}$進行的等深度掃描,涵蓋了約50倍的訓練計算,我們擬合了一個聯合縮放法則$L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-α} + B\,D^{-β}$,並在$R^2 = 0.997$時恢復了一個新的重複等價指數$\varphi = 0.46$。直觀來看,$\varphi$告訴我們,循環一個區塊$r$次在驗證損失上是否等同於$r$個非循環模型的唯一區塊(完全等價,$\varphi{=}1$)或是重複運行單個區塊而不增加容量($\varphi{=}0$)。我們的$\varphi = 0.46$位於兩者之間,因此每增加一次重複可預測地增加在匹配訓練計算下的驗證損失。例如,在$r{=}4$時,一個410M的循環模型在性能上與一個580M的非循環模型相當,但支付的是一個1B非循環模型的訓練成本。在五個維度的下游評估中,這一差距在參數知識任務中持續存在,而在簡單的開卷任務中則縮小,而推理任務在我們的計算預算下無法解決。對於任何循環語言模型,我們的$\varphi$將$r$的設計選擇轉換為可預測的驗證損失成本,未來的訓練配方和架構可以通過它們將$\varphi$提高到$0.46$以上的程度來進行比較。
Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery
2604.21102v1 by Siyuan Yao, Siavash Ghorbany, Kuangshi Ai, Arnav Cherukuthota, Meghan Forstchen, Alexis Korotasz, Matthew Sisk, Ming Hu, Chaoli Wang
We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.
摘要:我們提出了一個新穎的框架,利用大型語言模型(LLMs)和 Google 街景圖像,自動評估美國全國的建築條件。通過在一個適度的人類標記數據集上微調 Gemma 3 27B,我們的方法在與人類平均意見分數(MOS)的對齊上取得了良好的效果,甚至在 SRCC 和 PLCC 上超越了個別評估者,相較於 MOS 基準。為了提高效率,我們應用知識蒸餾,將 Gemma 3 27B 的能力轉移到一個較小的 Gemma 3 4B 模型,該模型在性能上達到了可比擬的效果,並實現了 3 倍的速度提升。此外,我們將知識蒸餾到一個基於 CNN 的模型(EfficientNetV2-M)和一個Transformer(SwinV2-B),在性能上接近,同時實現了 30 倍的速度增益。此外,我們通過人類與 AI 的對齊研究,調查 LLM 在評估廣泛的建成環境和住房屬性方面的能力,並開發了一個可視化儀表板,整合 LLM 評估結果以供房主進行後續分析。我們的框架提供了一個靈活且高效的大規模建築條件評估解決方案,實現高準確度且僅需最少的人類標記工作。
TRACES: Tagging Reasoning Steps for Adaptive Cost-Efficient Early-Stopping
2604.21057v1 by Yannis Belkhiter, Seshu Tirupathi, Giulio Zizzo, John D. Kelleher
The field of Language Reasoning Models (LRMs) has been very active over the past few years with advances in training and inference techniques enabling LRMs to reason longer, and more accurately. However, a growing body of studies show that LRMs are still inefficient, over-generating verification and reflection steps. Additionally, the high-level role of each reasoning step and how different step types contribute to the generation of correct answers, is largely underexplored. To address this challenge, we introduce TRACES (Tagging of the Reasoning steps enabling Adaptive Cost-Efficient early-Stopping), a lightweight framework that tags reasoning steps in real-time, and enable adaptive, cost-efficient early stopping of large-language-model inferences. Building on this framework we monitor reasoning behaviors during inferences, and we find that LRMs tend to shift their reasoning behavior after reaching a correct answer. We demonstrate that the monitoring of the specific type of steps can produce effective interpretable early stopping criteria. We evaluate the TRACES framework on three mathematical reasoning benchmarks, namely, MATH500, GSM8K, AIME and two knowledge and reasoning benchmarks, MMLU and GPQA respectively. We achieve 20 to 50% token reduction while maintaining comparable accuracy to standard generation.
摘要:語言推理模型(LRMs)領域在過去幾年中非常活躍,訓練和推理技術的進步使得LRMs能夠進行更長且更準確的推理。
然而,越來越多的研究顯示,LRMs仍然效率低下,過度生成驗證和反思步驟。
此外,每個推理步驟的高層次角色以及不同步驟類型如何促進正確答案的生成,仍然在很大程度上未被探討。
為了解決這一挑戰,我們介紹了TRACES(標記推理步驟以實現自適應成本效益的早期停止),這是一個輕量級框架,能夠實時標記推理步驟,並實現大型語言模型推理的自適應、成本效益的早期停止。
基於這一框架,我們在推理過程中監控推理行為,發現LRMs在達到正確答案後往往會改變其推理行為。
我們證明,監控特定類型的步驟可以產生有效的可解釋早期停止標準。
我們在三個數學推理基準上評估了TRACES框架,即MATH500、GSM8K、AIME,以及兩個知識和推理基準,即MMLU和GPQA。
我們在保持與標準生成相當的準確度的同時,實現了20%到50%的標記減少。
The Last Harness You'll Ever Build
2604.21003v1 by Haebin Seong, Li Yin, Haoran Zhang
AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge. \textbf{Each new task domain requires painstaking, expert-driven harness engineering}: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective. We present a two-level framework that automates this process. At the first level, the \textbf{Harness Evolution Loop} optimizes a worker agent's harness $\mathcal{H}$ for a single task: a Worker Agent $W_{\mathcal{H}}$ executes the task, an Evaluator Agent $V$ adversarially diagnoses failures and scores performance, and an Evolution Agent $E$ modifies the harness based on the full history of prior attempts. At the second level, the \textbf{Meta-Evolution Loop} optimizes the evolution protocol $Λ= (W_{\mathcal{H}}, \mathcal{H}^{(0)}, V, E)$ itself across diverse tasks, \textbf{learning a protocol $Λ^{(\text{best})}$ that enables rapid harness convergence on any new task -- so that adapting an agent to a novel domain requires no human harness engineering at all.} We formalize the correspondence to meta-learning and present both algorithms. The framework \textbf{shifts manual harness engineering into automated harness engineering}, and takes one step further -- \textbf{automating the design of the automation itself}.
摘要:AI 代理人越來越多地被部署在複雜的特定領域工作流程中——導航企業網頁應用程序,這些應用程序需要數十次點擊和填寫表單,協調跨越搜索、提取和綜合的多步研究管道,自動檢查不熟悉的代碼庫,以及處理需要細緻領域知識的客戶升級。
\textbf{每個新的任務領域都需要費心的專家驅動的工具開發}:設計使基礎模型有效的提示、工具、協調邏輯和評估標準。
我們提出了一個兩級框架來自動化這一過程。
在第一級,\textbf{工具演進循環}優化工作代理的工具 $\mathcal{H}$ 以完成單一任務:工作代理 $W_{\mathcal{H}}$ 執行任務,評估代理 $V$ 對失敗進行對抗性診斷並評分表現,演進代理 $E$ 根據先前嘗試的完整歷史修改工具。
在第二級,\textbf{元演進循環}優化演進協議 $Λ= (W_{\mathcal{H}}, \mathcal{H}^{(0)}, V, E)$ 本身,跨越多樣的任務,\textbf{學習一個協議 $Λ^{(\text{best})}$,使得在任何新任務上快速收斂工具變得可能——因此,將代理適應於新領域根本不需要人類的工具開發。}
我們將其形式化為元學習並展示兩種算法。
該框架\textbf{將手動工具開發轉變為自動化工具開發},並更進一步——\textbf{自動化自動化設計本身}。
FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels
2604.20825v1 by Sina Gholami, Abdulmoneam Ali, Tania Haghighi, Ahmed Arafa, Minhaj Nur Alam
Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi-stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise-tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise. Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class-wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise-aware training strategy that integrates logit-adjusted loss, knowledge distillation, and distance-aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state-of-the-art methods for FL with noisy labels. The code is available at https://github.com/sinagh72/FedSIR.
摘要:聯邦學習(FL)使得在不共享原始數據的情況下進行協作模型訓練成為可能;然而,分佈式客戶端中存在的噪聲標籤可能會嚴重降低學習性能。在本文中,我們提出了 FedSIR,一個針對噪聲標籤的穩健 FL 的多階段框架。與主要依賴設計抗噪損失函數或在訓練過程中利用損失動態的現有方法不同,我們的方法利用客戶端特徵表示的光譜結構來識別和減輕標籤噪聲。
我們的框架由三個關鍵組件組成。首先,我們通過分析類別特徵子空間的光譜一致性來識別乾淨和噪聲客戶端,並且通信開銷最小。其次,乾淨客戶端提供光譜參考,使得噪聲客戶端能夠使用主導類別方向和殘餘子空間重新標記潛在損壞的樣本。第三,我們採用一種噪聲感知的訓練策略,集成了對數調整損失、知識蒸餾和距離感知聚合,以進一步穩定聯邦優化。在標準 FL 基準上的大量實驗表明,FedSIR 在處理帶有噪聲標籤的 FL 時始終優於最先進的方法。代碼可在 https://github.com/sinagh72/FedSIR 獲得。
Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems
2604.20795v1 by Pavel Salovskii, Iuliia Gorshkova
This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.
摘要:這篇論文提出了一種混合架構,用於智能系統,其中大型語言模型(LLMs)擴展了外部本體記憶層。
該方法不僅依賴於參數知識和基於向量的檢索(RAG),而是構建並維護一個使用RDF/OWL表示的結構化知識圖譜,從而實現持久、可驗證和語義基礎的推理。
核心貢獻是一個自動化的本體構建管道,來自異質數據源,包括文檔、API和對話日誌。
系統執行實體識別、關係提取、標準化和三元組生成,然後使用SHACL和OWL約束進行驗證,並持續更新圖譜。
在推理過程中,LLMs在一個結合的上下文中運行,該上下文整合了基於向量的檢索、基於圖譜的推理和外部工具互動。
對於規劃任務的實驗觀察,包括河內塔基準,表明本體增強在多步推理場景中相較於基線LLM系統提高了性能。
此外,本體層使生成輸出的正式驗證成為可能,將系統轉變為生成-驗證-修正管道。
所提出的架構解決了當前基於LLM的系統的關鍵限制,包括缺乏長期記憶、結構理解薄弱和推理能力有限。
它為構建基於代理的系統、機器人應用和需要持久知識、可解釋性和可靠決策的企業AI解決方案提供了基礎。
RespondeoQA: a Benchmark for Bilingual Latin-English Question Answering
2604.20738v1 by Marisa Hudspeth, Patrick J. Burns, Brendan O'Connor
We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA
摘要:我們介紹了一個針對雙語拉丁語和英語環境的問題回答和翻譯基準數據集,包含約7,800對問題和答案。這些問題來自拉丁語教學資源,包括考試、問答競賽風格的知識問答以及從19世紀到現在的教科書。在自動提取、清理和人工審查後,該數據集涵蓋了多樣的問題類型:知識和技能基礎的、多步推理的、受限翻譯的以及混合語言對。據我們所知,這是首個以拉丁語為中心的QA基準。作為案例研究,我們評估了三個大型語言模型——LLaMa 3、Qwen QwQ和OpenAI的o3-mini——發現它們在技能導向問題上的表現都較差。儘管推理模型在韻律和文學手法任務上表現更好,但整體改進有限。QwQ在用拉丁語提問的問題上表現稍好,但LLaMa3和o3-mini則更依賴於任務。這個數據集為評估模型在專門的語言和文化領域的能力提供了一個新資源,且創建過程可以輕鬆適應其他語言。該數據集可在以下網址獲得:https://github.com/slanglab/RespondeoQA
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
2604.20720v1 by Noah Flynn
Large language models (LLMs) often exhibit performance disparities across languages, with naive multilingual fine-tuning frequently degrading performance due to negative cross-lingual interference. To address this, we introduce COMPASS (COntinual Multilingual PEFT with Adaptive Semantic Sampling), a novel data-centric framework for adapting LLMs to target languages. COMPASS leverages parameter-efficient fine-tuning (PEFT) by training lightweight, language-specific adapters on a judiciously selected subset of auxiliary multilingual data. The core of our method is a distribution-aware sampling strategy that uses multilingual embeddings and clustering to identify semantic gaps between existing training data and a target usage distribution. By prioritizing auxiliary data from under-represented semantic clusters, COMPASS maximizes positive cross-lingual transfer while minimizing interference. We extend this into a continual learning framework, COMPASS-ECDA, which monitors for data distribution shifts in production and dynamically updates adapters to prevent model staleness, balancing adaptation to new data with the preservation of existing knowledge. Across three different model architectures (Phi-4-Mini, Llama-3.1-8B, and Qwen2.5-7B) and multiple challenging multilingual benchmarks (Global-MMLU, MMLU-ProX), including unseen long-context tasks (OneRuler), we demonstrate that COMPASS consistently outperforms baseline methods guided by linguistic similarity, providing an effective, efficient, and sustainable solution for developing and maintaining high-performing multilingual models in dynamic environments.
摘要:大型語言模型(LLMs)在不同語言之間經常表現出性能差異,天真的多語言微調常常因為負面的跨語言干擾而降低性能。為了解決這個問題,我們介紹了COMPASS(COntinual Multilingual PEFT with Adaptive Semantic Sampling),這是一個針對目標語言調整LLMs的新型數據中心框架。COMPASS通過在精心選擇的輔助多語言數據子集上訓練輕量級的語言特定適配器,利用參數高效微調(PEFT)。我們方法的核心是一種分佈感知的採樣策略,利用多語言嵌入和聚類來識別現有訓練數據與目標使用分佈之間的語義差距。通過優先考慮來自代表性不足的語義聚類的輔助數據,COMPASS在最大化正向跨語言轉移的同時最小化干擾。我們將此擴展為一個持續學習框架COMPASS-ECDA,該框架監控生產中的數據分佈變化,並動態更新適配器以防止模型過時,平衡對新數據的適應與現有知識的保留。在三種不同的模型架構(Phi-4-Mini、Llama-3.1-8B和Qwen2.5-7B)和多個具有挑戰性的多語言基準(Global-MMLU、MMLU-ProX)中,包括未見的長上下文任務(OneRuler),我們證明COMPASS始終優於基於語言相似性的基線方法,提供了一個有效、高效且可持續的解決方案,用於在動態環境中開發和維護高性能的多語言模型。
Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization
2604.20714v1 by Shan He, Runze Wang, Zhuoyun Du, Huiyu Bai, Zouying Cao, Yu Cheng, Bo Zheng
Designing and optimizing multi-agent systems (MAS) is a complex, labor-intensive process of "Agent Engineering." Existing automatic optimization methods, primarily focused on flat prompt tuning, lack the structural awareness to debug the intricate web of interactions in MAS. More critically, these optimizers are static; they do not learn from experience to improve their own optimization strategies. To address these gaps, we introduce Textual Parameter Graph Optimization (TPGO), a framework that enables a multi-agent system to learn to evolve. TPGO first models the MAS as a Textual Parameter Graph (TPG), where agents, tools, and workflows are modular, optimizable nodes. To guide evolution, we derive "textual gradients," structured natural language feedback from execution traces, to pinpoint failures and suggest granular modifications. The core of our framework is Group Relative Agent Optimization (GRAO), a novel meta-learning strategy that learns from historical optimization experiences. By analyzing past successes and failures, GRAO becomes progressively better at proposing effective updates, allowing the system to learn how to optimize itself. Extensive experiments on complex benchmarks like GAIA and MCP-Universe show that TPGO significantly enhances the performance of state-of-the-art agent frameworks, achieving higher success rates through automated, self-improving optimization.
摘要:設計和優化多代理系統(MAS)是一個複雜且勞動密集的“代理工程”過程。現有的自動優化方法主要集中於平面提示調整,缺乏對MAS中錯綜複雜的交互網絡進行調試的結構性認識。更關鍵的是,這些優化器是靜態的;它們不會從經驗中學習以改善自身的優化策略。為了解決這些問題,我們引入了文本參數圖優化(TPGO),這是一個使多代理系統能夠學習進化的框架。TPGO首先將MAS建模為文本參數圖(TPG),其中代理、工具和工作流程是模塊化的、可優化的節點。為了指導進化,我們從執行痕跡中推導出“文本梯度”,這是結構化的自然語言反饋,用於確定故障並建議細微的修改。我們框架的核心是群體相對代理優化(GRAO),這是一種新穎的元學習策略,能夠從歷史優化經驗中學習。通過分析過去的成功和失敗,GRAO在提出有效更新方面變得越來越好,使系統能夠學習如何自我優化。在GAIA和MCP-Universe等複雜基準上進行的廣泛實驗顯示,TPGO顯著提高了最先進代理框架的性能,通過自動化、自我改善的優化實現了更高的成功率。
StormNet: Improving storm surge predictions with a GNN-based spatio-temporal offset forecasting model
2604.20688v2 by Noujoud Nader, Stefanos Giaremis, Clint Dawson, Carola Kaiser, Karame Mohammadiporshokooh, Hartmut Kaiser
Storm surge forecasting remains a critical challenge in mitigating the impacts of tropical cyclones on coastal regions, particularly given recent trends of rapid intensification and increasing nearshore storm activity. Traditional high fidelity numerical models such as ADCIRC, while robust, are often hindered by inevitable uncertainties arising from various sources. To address these challenges, this study introduces StormNet, a spatio-temporal graph neural network (GNN) designed for bias correction of storm surge forecasts. StormNet integrates graph convolutional (GCN) and graph attention (GAT) mechanisms with long short-term memory (LSTM) components to capture complex spatial and temporal dependencies among water-level gauge stations. The model was trained using historical hurricane data from the U.S. Gulf Coast and evaluated on Hurricane Idalia (2023). Results demonstrate that StormNet can effectively reduce the root mean square error (RMSE) in water-level predictions by more than 70\% for 48-hour forecasts and above 50\% for 72-hour forecasts, as well as outperform a sequential LSTM baseline, particularly for longer prediction horizons. The model also exhibits low training time, enhancing its applicability in real-time operational forecasting systems. Overall, StormNet provides a computationally efficient and physically meaningful framework for improving storm surge prediction accuracy and reliability during extreme weather events.
摘要:暴潮預測仍然是減輕熱帶氣旋對沿海地區影響的一個關鍵挑戰,特別是考慮到近期快速增強和近岸風暴活動增加的趨勢。傳統的高保真數值模型如ADCIRC,雖然穩健,但常常受到來自各種來源的不可避免的不確定性所阻礙。為了應對這些挑戰,本研究介紹了StormNet,一種為暴潮預測進行偏差修正的時空圖神經網絡(GNN)。StormNet將圖卷積(GCN)和圖注意力(GAT)機制與長短期記憶(LSTM)元件整合在一起,以捕捉水位計站之間複雜的空間和時間依賴性。該模型使用美國墨西哥灣沿岸的歷史颶風數據進行訓練,並在颶風Idalia(2023)上進行評估。結果顯示,StormNet能夠有效地將水位預測的均方根誤差(RMSE)在48小時預測中減少超過70\%,在72小時預測中減少超過50\%,並且在較長的預測範圍內表現優於序列LSTM基準。該模型還展現出低訓練時間,增強了其在實時運營預測系統中的應用性。總體而言,StormNet提供了一個計算效率高且物理意義明確的框架,用於提高極端天氣事件期間暴潮預測的準確性和可靠性。
ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation
2604.20666v1 by Ioannis E. Livieris, Athanasios Koursaris, Alexandra Apostolopoulou, Konstantinos Kanaris Dimitris Tsakalidis, George Domalis
Effective retrieval-augmented generation across bilingual Greek--English applications requires embedding models capable of capturing both domain-specific semantic relationships and cross-lingual semantic alignment. Existing multilingual embedding models distribute their representational capacity across numerous languages, limiting their optimization for Greek and failing to encode the morphological complexity and domain-specific terminological structures inherent in Greek text. In this work, we propose ORPHEAS, a specialized Greek--English embedding model for bilingual retrieval-augmented generation. ORPHEAS is trained with a high quality dataset generated by a knowledge graph-based fine-tuning methodology which is applied to a diverse multi-domain corpus, which enables language-agnostic semantic representations. The numerical experiments across monolingual and cross-lingual retrieval benchmarks reveal that ORPHEAS outperforms state-of-the-art multilingual embedding models, demonstrating that domain-specialized fine-tuning on morphologically complex languages does not compromise cross-lingual retrieval capability.
摘要:有效的檢索增強生成在雙語希臘語--英語應用中需要嵌入模型,能夠捕捉領域特定的語義關係和跨語言的語義對齊。現有的多語言嵌入模型將其表徵能力分配到多種語言上,限制了其對希臘語的優化,並未能編碼希臘文本中固有的形態複雜性和領域特定的術語結構。在這項工作中,我們提出了ORPHEAS,一個專門針對雙語檢索增強生成的希臘語--英語嵌入模型。ORPHEAS使用由知識圖譜基礎的微調方法生成的高質量數據集進行訓練,該方法應用於多樣的多領域語料庫,從而實現語言無關的語義表徵。在單語和跨語言檢索基準上的數值實驗顯示,ORPHEAS超越了最先進的多語言嵌入模型,證明了對形態複雜語言進行領域專門的微調不會妨礙跨語言檢索能力。
The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm
2604.20665v1 by Karan Goyal, Dikshant Kukreja
The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of "multimodal gain". By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.
摘要:快速增長的視覺-語言模型(VLMs)被廣泛讚譽為統一多模態知識發現的曙光,但其基礎運作於一個危險且未經質疑的公理:當前的VLMs忠實地綜合多模態數據。我們認為事實並非如此。相反,主導的視覺編碼-投影-LLM範式下潛藏著深刻的可信度危機。最先進的模型經常表現出功能盲目,即利用強大的語言先驗來繞過嚴重的視覺表示瓶頸,而不是從視覺輸入中提取有根據的知識。在這項工作中,我們挑戰傳統的多模態評估方法,該方法依賴於數據消融或新數據集的創建,因此致命地將數據集偏見與架構無能混淆。我們提出了一種激進的信息理論出發點:模態翻譯協議,旨在量化揭示觀看的代價。通過翻譯語義負載而不是消融它們,我們制定了三個新指標——觀看的代價(ToS)、詛咒(CoS)和謬誤(FoS),最終形成語義充分性標準(SSC)。此外,我們提出了一個挑釁性的多模態擴展發散法則,假設隨著基礎語言引擎擴展到前所未有的推理能力,視覺知識瓶頸的數學懲罰反而會增加。我們挑戰KDD社群放棄對“多模態增益”的虛幻追求。通過將SSC從被動診斷約束提升為主動架構藍圖,我們提供了強而有力、值得信賴的基礎,迫使下一代AI系統真正看見數據,實現真正的多模態推理。
RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking
2604.20623v1 by Roie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel, Genady Beryozkin
Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.
摘要:傳統的變化檢測識別變化發生的位置,但並不解釋自然語言中發生了什麼變化。現有的遙感變化標題數據集通常描述整體圖像級別的差異,幾乎沒有探索細粒度的局部語義推理。為了填補這一空白,我們提出了RSRCC,這是一個新的遙感變化問答基準,包含126,000個問題,分為87,000個訓練、17,100個驗證和22,000個測試實例。與之前的數據集不同,RSRCC圍繞著局部的、特定於變化的問題構建,這些問題需要對特定的語義變化進行推理。據我們所知,這是第一個專門為這種細粒度推理基礎的監督設計的遙感變化問答基準。為了構建RSRCC,我們引入了一個分層的半監督策劃管道,使用Best-of-N排名作為關鍵的最終模糊解決階段。首先,從語義分割掩膜中提取候選變化區域,然後使用圖像-文本嵌入模型進行初步篩選,最後通過增強檢索的視覺-語言策劃進行驗證,並使用Best-of-N排名。這一過程使得在保留語義上有意義的變化的同時,能夠對嘈雜和模糊的候選進行可擴展的過濾。該數據集可在 https://huggingface.co/datasets/google/RSRCC 獲得。
Self-Aware Vector Embeddings for Retrieval-Augmented Generation: A Neuroscience-Inspired Framework for Temporal, Confidence-Weighted, and Relational Knowledge
2604.20598v1 by Naizhong Xu
Modern retrieval-augmented generation (RAG) systems treat vector embeddings as static, context-free artifacts: an embedding has no notion of when it was created, how trustworthy its source is, or which other embeddings depend on it. This flattening of knowledge has a measurable cost: recent work on VersionRAG reports that conventional RAG achieves only 58% accuracy on versioned technical queries, because retrieval returns semantically similar but temporally invalid content. We propose SmartVector, a framework that augments dense embeddings with three explicit properties -- temporal awareness, confidence decay, and relational awareness -- and a five-stage lifecycle modeled on hippocampal-neocortical memory consolidation. A retrieval pipeline replaces pure cosine similarity with a four-signal score that mixes semantic relevance, temporal validity, live confidence, and graph-relational importance. A background consolidation agent detects contradictions, builds dependency edges, and propagates updates along those edges as graph-neural-network-style messages. Confidence is governed by a closed-form function combining an Ebbinghaus-style exponential decay, user-feedback reconsolidation, and logarithmic access reinforcement. We formalize the model, relate it to temporal knowledge graph embedding, agentic memory architectures, and uncertainty-aware RAG, and present a reference implementation. On a reproducible synthetic versioned-policy benchmark of 258 vectors and 138 queries, SmartVector roughly doubles top-1 accuracy over plain cosine RAG (62.0% vs. 31.0% on a held-out split), drops stale-answer rate from 35.0% to 13.3%, cuts Expected Calibration Error by nearly 2x (0.244 vs. 0.470), reduces re-embedding cost per single-word edit by 77%, and is robust across contradiction-injection rates from 0% to 75%.
摘要:現代檢索增強生成(RAG)系統將向量嵌入視為靜態的、無上下文的產物:嵌入沒有創建時間的概念、來源的可信度或依賴於它的其他嵌入。這種知識的扁平化帶來了可衡量的成本:最近關於VersionRAG的研究報告指出,傳統的RAG在版本化技術查詢上的準確率僅為58%,因為檢索返回的是語義上相似但時間上無效的內容。我們提出了SmartVector,一個增強密集嵌入的框架,具備三個明確的特性——時間意識、信心衰減和關聯意識——以及一個基於海馬體-新皮層記憶鞏固的五階段生命週期。檢索管道用四信號得分取代純粹的餘弦相似度,該得分混合了語義相關性、時間有效性、即時信心和圖形關聯重要性。背景鞏固代理檢測矛盾,建立依賴邊,並沿著這些邊傳播更新,類似於圖神經網絡的消息。信心由一個封閉形式的函數控制,該函數結合了Ebbinghaus風格的指數衰減、用戶反饋再鞏固和對數訪問增強。我們對模型進行了形式化,將其與時間知識圖嵌入、代理記憶架構和不確定性感知的RAG相關聯,並提出了一個參考實現。在一個可重現的合成版本化政策基準測試中,包含258個向量和138個查詢,SmartVector的top-1準確率大約是純餘弦RAG的兩倍(62.0%對31.0%在保留的分割上),過時答案率從35.0%降至13.3%,期望校準誤差減少近2倍(0.244對0.470),每次單詞編輯的重新嵌入成本降低77%,並且在0%到75%的矛盾注入率下都表現穩健。
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
2604.20572v1 by Yuxuan Cai, Jie Zhou, Qin Chen, Liang He
Online lifelong learning enables agents to accumulate experience across interactions and continually improve on long-horizon tasks. However, existing methods typically treat retrieval from past experience as a passive operation, triggering it only at task initialization or after completing a step. Consequently, agents often fail to identify knowledge gaps during interaction and proactively retrieve the most useful experience for the current decision. To address this limitation, we present ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured experience base. We first introduce Experience-Enhanced Online Evolution (ExpOnEvo), which enables continual improvement through both policy updates and memory refinement. The experience base organizes historical interactions into typed repositories, including factual memory, episodic memory, and behavioral skills, so that retrieval can provide both relevant evidence and actionable guidance. On top of this, we propose Proactive Reinforcement Learning-based Retrieval (ProactRL), which models retrieval as an explicit policy action and learns when and what to retrieve via paired-branch process rewards. By comparing continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level supervision for retrieval decisions, encouraging retrieval only when it leads to better task outcomes or higher efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently improves lifelong agent performance, achieving success rates of 73.50\% on SciWorld and 71.28\% on AlfWorld while substantially reducing retrieval overhead, and attains performance competitive with proprietary models on StuLife.
摘要:在線終身學習使代理能夠在互動中積累經驗,並持續改善長期任務。
然而,現有的方法通常將從過去經驗中檢索視為一種被動操作,僅在任務初始化或完成一步後才觸發。
因此,代理在互動過程中往往無法識別知識空白,並主動檢索對當前決策最有用的經驗。
為了解決這一限制,我們提出了ProactAgent,一個基於經驗的終身學習框架,用於在結構化經驗庫上進行主動檢索。
我們首先介紹經驗增強在線演化(ExpOnEvo),它通過政策更新和記憶精煉實現持續改進。
經驗庫將歷史互動組織成類型化的庫,包括事實記憶、情節記憶和行為技能,以便檢索能夠提供相關證據和可行指導。
在此基礎上,我們提出了基於主動強化學習的檢索(ProactRL),它將檢索建模為明確的政策行動,並通過配對分支過程獎勵學習何時以及檢索什麼。
通過比較相同互動前綴的延續,有無檢索,ProactRL為檢索決策提供了逐步監督,僅在檢索能導致更好的任務結果或更高效率時鼓勵檢索。
在SciWorld、AlfWorld和StuLife上的實驗顯示,ProactAgent持續改善終身代理的表現,在SciWorld上達到73.50\%的成功率,在AlfWorld上達到71.28\%,同時顯著減少檢索開銷,並在StuLife上達到與專有模型競爭的性能。
LayerTracer: A Joint Task-Particle and Vulnerable-Layer Analysis framework for Arbitrary Large Language Model Architectures
2604.20556v1 by Yuhang Wu, Qinyuan Liu, Qiuyang Zhao, Qingwei Chong
Currently, Large Language Models (LLMs) feature a diversified architectural landscape, including traditional Transformer, GateDeltaNet, and Mamba. However, the evolutionary laws of hierarchical representations, task knowledge formation positions, and network robustness bottleneck mechanisms in various LLM architectures remain unclear, posing core challenges for hybrid architecture design and model optimization. This paper proposes LayerTracer, an architecture-agnostic end-to-end analysis framework compatible with any LLM architecture. By extracting hidden states layer-by-layer and mapping them to vocabulary probability distributions, it achieves joint analysis of task particle localization and layer vulnerability quantification. We define the task particle as the key layer where the target token probability first rises significantly, representing the model's task execution starting point, and the vulnerable layer is defined as the layer with the maximum Jensen-Shannon (JS) divergence between output distributions before and after mask perturbation, reflecting its sensitivity to disturbances. Experiments on models of different parameter scales show that task particles mainly appear in the deep layers of the model regardless of parameter size, while larger-parameter models exhibit stronger hierarchical robustness. LayerTracer provides a scientific basis for layer division, module ratio, and gating switching of hybrid architectures, effectively optimizing model performance. It accurately locates task-effective layers and stability bottlenecks, offering universal support for LLM structure design and interpretability research.
摘要:目前,大型語言模型(LLMs)具有多樣化的架構景觀,包括傳統的Transformer、GateDeltaNet和Mamba。然而,各種LLM架構中層次表示的演化法則、任務知識形成位置及網絡穩健性瓶頸機制仍不清晰,這對混合架構設計和模型優化構成了核心挑戰。本文提出了LayerTracer,一種與任何LLM架構兼容的架構無關的端到端分析框架。通過逐層提取隱藏狀態並將其映射到詞彙概率分佈,它實現了任務粒子定位和層脆弱性量化的聯合分析。我們將任務粒子定義為目標標記概率首次顯著上升的關鍵層,代表模型任務執行的起點,而脆弱層則定義為在掩碼擾動前後輸出分佈之間具有最大Jensen-Shannon (JS) 散度的層,反映其對擾動的敏感性。在不同參數規模的模型上進行的實驗顯示,無論參數大小,任務粒子主要出現在模型的深層,而較大參數的模型則表現出更強的層次穩健性。LayerTracer為混合架構的層劃分、模塊比例和閘切換提供了科學依據,有效優化模型性能。它準確定位任務有效層和穩定性瓶頸,為LLM結構設計和可解釋性研究提供了普遍支持。
Enhancing Research Idea Generation through Combinatorial Innovation and Multi-Agent Iterative Search Strategies
2604.20548v1 by Shuai Chen, Chengzhi Zhang
Scientific progress depends on the continual generation of innovative re-search ideas. However, the rapid growth of scientific literature has greatly increased the cost of knowledge filtering, making it harder for researchers to identify novel directions. Although existing large language model (LLM)-based methods show promise in research idea generation, the ideas they produce are often repetitive and lack depth. To address this issue, this study proposes a multi-agent iterative planning search strategy inspired by com-binatorial innovation theory. The framework combines iterative knowledge search with an LLM-based multi-agent system to generate, evaluate, and re-fine research ideas through repeated interaction, with the goal of improving idea diversity and novelty. Experiments in the natural language processing domain show that the proposed method outperforms state-of-the-art base-lines in both diversity and novelty. Further comparison with ideas derived from top-tier machine learning conference papers indicates that the quality of the generated ideas falls between that of accepted and rejected papers. These results suggest that the proposed framework is a promising approach for supporting high-quality research idea generation. The source code and dataset used in this paper are publicly available on Github repository: https://github.com/ChenShuai00/MAGenIdeas. The demo is available at https://huggingface.co/spaces/cshuai20/MAGenIdeas.
摘要:科學進步依賴於不斷產生創新的研究想法。
然而,科學文獻的快速增長大大提高了知識篩選的成本,使研究人員更難識別新穎的方向。
儘管現有的基於大型語言模型(LLM)的方法在研究想法生成方面顯示出潛力,但它們產生的想法往往重複且缺乏深度。
為了解決這一問題,本研究提出了一種受組合創新理論啟發的多代理迭代規劃搜索策略。
該框架結合了迭代知識搜索與基於LLM的多代理系統,通過重複互動生成、評估和重新精煉研究想法,旨在提高想法的多樣性和新穎性。
在自然語言處理領域的實驗顯示,所提出的方法在多樣性和新穎性方面均優於最先進的基準。
與來自頂級機器學習會議論文的想法進行進一步比較表明,生成的想法質量介於被接受和被拒絕的論文之間。
這些結果表明,所提出的框架是一種支持高質量研究想法生成的有前景的方法。
本文使用的源代碼和數據集已在Github庫上公開: https://github.com/ChenShuai00/MAGenIdeas。
演示可在 https://huggingface.co/spaces/cshuai20/MAGenIdeas 獲得。
Effects of Cross-lingual Evidence in Multilingual Medical Question Answering
2604.20531v1 by Anar Yeginbergen, Maite Oronoz, Rodrigo Agerri
This paper investigates Multilingual Medical Question Answering across high-resource (English, Spanish, French, Italian) and low-resource (Basque, Kazakh) languages. We evaluate three types of external evidence sources across models of varying size: curated repositories of specialized medical knowledge, web-retrieved content, and explanations from LLM's parametric knowledge. Moreover, we conduct experiments with multilingual, monolingual and cross-lingual retrieval. Our results demonstrate that larger models consistently achieve superior performance in English across baseline evaluations. When incorporating external knowledge, web-retrieved data in English proves most beneficial for high-resource languages. Conversely, for low-resource languages, the most effective strategy combines retrieval in both English and the target language, achieving comparable accuracy to high-resource language results. These findings challenge the assumption that external knowledge systematically improves performance and reveal that effective strategies depend on both the source of language resources and on model scale. Furthermore, specialized medical knowledge sources such as PubMed are limited: while they provide authoritative expert knowledge, they lack adequate multilingual coverage
摘要:這篇論文探討了在高資源(英語、西班牙語、法語、意大利語)和低資源(巴斯克語、哈薩克語)語言中的多語言醫療問答。我們評估了三種類型的外部證據來源,涵蓋不同大小的模型:專門醫療知識的策展庫、網路檢索的內容,以及來自LLM的參數知識的解釋。此外,我們進行了多語言、單語言和跨語言檢索的實驗。我們的結果顯示,較大的模型在基線評估中在英語上始終能夠達到更好的性能。當納入外部知識時,英語的網路檢索數據對於高資源語言最為有利。相反,對於低資源語言,最有效的策略是結合英語和目標語言的檢索,達到與高資源語言結果相當的準確性。這些發現挑戰了外部知識系統性改善性能的假設,並揭示了有效策略依賴於語言資源的來源和模型規模。此外,像PubMed這樣的專門醫療知識來源是有限的:雖然它們提供權威的專家知識,但缺乏足夠的多語言覆蓋。
Knowledge Capsules: Structured Nonparametric Memory Units for LLMs
2604.20487v2 by Bin Ju, Shenfeng Weng, Danying Zhou, Rongkai Xu, Kunkai Su
Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely through context expansion, where external knowledge competes as tokens within the attention mechanism. As a result, its influence is indirect and often unstable, particularly in long context and multi hop reasoning scenarios. We propose Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and can be constructed directly from document corpora using a frozen base model. Instead of injecting knowledge as text, we introduce an External Key Value Injection (KVI) framework that compiles capsules into attention-compatible key value representations, enabling external knowledge to directly participate in the model's attention computation. By shifting knowledge integration from context-level augmentation to memory level interaction, the proposed framework consistently outperforms RAG and GraphRAG across multiple QA benchmarks, with improved stability and accuracy in long context and multi hop reasoning, while requiring no parameter updates.
摘要:大型語言模型(LLMs)將知識編碼在參數權重中,這使得在不重新訓練的情況下更新或擴展變得成本高昂。檢索增強生成(RAG)通過將檢索到的文本附加到輸入中來減輕這一限制,但它僅通過上下文擴展運作,外部知識作為令牌在注意力機制中競爭。因此,其影響是間接的,並且往往不穩定,特別是在長上下文和多跳推理場景中。我們提出了知識膠囊,這是一種結構化的非參數記憶單元,代表標準化的關係知識,並且可以直接從文檔語料庫中使用凍結的基礎模型構建。與其將知識注入為文本,我們引入了一個外部鍵值注入(KVI)框架,將膠囊編譯成與注意力兼容的鍵值表示,從而使外部知識能夠直接參與模型的注意力計算。通過將知識整合從上下文級增強轉移到記憶級互動,所提出的框架在多個QA基準測試中始終優於RAG和GraphRAG,並在長上下文和多跳推理中提高了穩定性和準確性,同時不需要任何參數更新。
Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks
2604.20932v1 by Pranav Pallerla, Wilson Naik Bhukya, Bharath Vemula, Charan Ramtej Kodi
Retrieval-augmented generation (RAG) systems are increasingly deployed in sensitive domains such as healthcare and law, where they rely on private, domain-specific knowledge. This capability introduces significant security risks, including membership inference, data poisoning, and unintended content leakage. A straightforward mitigation is to enable all relevant defenses simultaneously, but doing so incurs a substantial utility cost. In our experiments, an always-on defense stack reduces contextual recall by more than 40%, indicating that retrieval degradation is the primary failure mode. To mitigate this trade-off in RAG systems, we propose the Sentinel-Strategist architecture, a context-aware framework for risk analysis and defense selection. A Sentinel detects anomalous retrieval behavior, after which a Strategist selectively deploys only the defenses warranted by the query context. Evaluated across three benchmark datasets and five orchestration models, ADO is shown to eliminate MBA-style membership inference leakage while substantially recovering retrieval utility relative to a fully static defense stack, approaching undefended baseline levels. Under data poisoning, the strongest ADO variants reduce attack success to near zero while restoring contextual recall to more than 75% of the undefended baseline, although robustness remains sensitive to model choice. Overall, these findings show that adaptive, query-aware defense can substantially reduce the security-utility trade-off in RAG systems.
摘要:檢索增強生成(RAG)系統越來越多地應用於敏感領域,如醫療保健和法律,這些領域依賴於私有的、特定領域的知識。這一能力引入了重大安全風險,包括成員推斷、數據中毒和意外內容洩漏。直接的緩解方法是同時啟用所有相關防禦,但這樣做會產生可觀的效用成本。在我們的實驗中,始終開啟的防禦堆疊使上下文回憶降低了超過 40%,這表明檢索退化是主要的失敗模式。為了減輕 RAG 系統中的這一權衡,我們提出了 Sentinel-Strategist 架構,這是一個用於風險分析和防禦選擇的上下文感知框架。Sentinel 檢測異常的檢索行為,然後 Strategist 根據查詢上下文選擇性地部署僅必要的防禦。在三個基準數據集和五個協調模型中進行評估後,ADO 被證明能消除 MBA 風格的成員推斷洩漏,同時相對於完全靜態的防禦堆疊顯著恢復檢索效用,接近未防禦的基線水平。在數據中毒的情況下,最強的 ADO 變體將攻擊成功率降低到接近零,同時將上下文回憶恢復到超過 75% 的未防禦基線,儘管穩健性仍然對模型選擇敏感。總的來說,這些發現顯示,自適應的查詢感知防禦能顯著減少 RAG 系統中的安全與效用權衡。
HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
2604.20452v1 by Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu
Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.
摘要:檢索增強生成(RAG)透過檢索外部文件作為上下文,擴展了大型語言模型(LLMs)在推理時的知識邊界。
然而,隨著知識數據庫的增長,檢索變得越來越耗時。
現有的加速策略要麼通過近似檢索妥協準確性,要麼通過重複使用完全相同查詢的結果來實現微小的增益。
我們提出了HaS,一種同源感知的推測檢索框架,該框架在受限範圍內執行低延遲的推測檢索,以獲取候選文件,然後驗證它們是否包含所需的知識。
該驗證基於查詢之間的同源關係,並被表述為一個同源查詢重新識別任務:一旦識別出先前觀察到的查詢為進來查詢的同源重遇,則草稿被視為可接受,允許系統跳過緩慢的全數據庫檢索。
受益於在現實世界流行模式下同源查詢的普遍性,HaS實現了顯著的效率提升。
大量實驗表明,HaS在數據集上減少了23.74%和36.99%的檢索延遲,僅有1-2%的邊際準確性下降。
作為一個即插即用的解決方案,HaS還顯著加速了現代代理RAG管道中的複雜多跳查詢。
源代碼可在以下網址獲得:https://github.com/ErrEqualsNil/HaS。
Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness
2604.20413v1 by Fulong Fan, Peilin Liu, Fengzhe Liu, Shuyan Yang, Gang Yan
Large language models perform well on many reasoning tasks, yet they often lack awareness of whether their current knowledge or reasoning state is complete. In non-interactive puzzle settings, the narrative is fixed and the underlying structure is hidden; once a model forms an early hypothesis under incomplete premises, it can propagate that error throughout the reasoning process, leading to unstable conclusions. To address this issue, we propose SABA, a reasoning framework that explicitly introduces self-awareness of missing premises before making the final decision. SABA formulates reasoning as a recursive process that alternates between structured state construction and obstacle resolution: it first applies Information Fusion to consolidate the narrative into a verifiable base state, and then uses Query-driven Structured Reasoning to identify and resolve missing or underspecified premises by turning them into queries and progressively completing the reasoning state through hypothesis construction and state refinement. Across multiple evaluation metrics, SABA achieves the best performance on all three difficulty splits of the non-interactive Detective Puzzle benchmark, and it also maintains leading results on multiple public benchmarks.
摘要:大型語言模型在許多推理任務中表現良好,但它們往往缺乏對當前知識或推理狀態是否完整的認識。
在非互動的謎題環境中,敘事是固定的,潛在結構是隱藏的;一旦模型在不完整的前提下形成早期假設,就可能在整個推理過程中傳播這一錯誤,導致不穩定的結論。
為了解決這個問題,我們提出了 SABA,一個推理框架,它在做出最終決策之前明確引入對缺失前提的自我意識。
SABA 將推理公式化為一個遞歸過程,交替進行結構化狀態構建和障礙解決:它首先應用信息融合將敘事整合為可驗證的基礎狀態,然後使用查詢驅動的結構化推理來識別和解決缺失或不明確的前提,通過將其轉化為查詢並逐步通過假設構建和狀態細化來完成推理狀態。
在多個評估指標中,SABA 在非互動偵探謎題基準的所有三個難度劃分中達到了最佳性能,並且在多個公共基準上也保持了領先的結果。
CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge
2604.20389v1 by Gustav Keppler, Ghada Elbez, Veit Hagenmeyer
The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain-specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information Technology cybersecurity and more specializedareas such as Operational Technology and related cybersecurity standards. Concurrently, we propose and validate a novel Proposer-Verifier framework, a methodology to generate interpretable,natural language explanations for model performance. Our evaluation shows that frontier modelsachieve human expert level in general networking and IT security knowledge. However, theiraccuracy declines in questions that require vendor-specific nuances or knowledge in formalstandards, like, e.g., IEC 62443. Analysis of model scaling trend and release date demonstratesremarkable gains in parameter efficiency, while recent larger models show diminishing returns.Code and evaluation scripts are available at: https://github.com/GKeppler/CyberCertBench.
摘要:快速發展和使用大型語言模型(LLMs)於專業工作流程中,需要對其領域特定知識進行與行業標準的評估。
我們介紹CyberCertBench,這是一套新的多選題回答(MCQA)基準,源自行業認可的認證。
CyberCertBench評估LLM的領域知識,針對資訊科技網路安全的專業標準以及更專門的領域,如運營技術和相關的網路安全標準。
同時,我們提出並驗證了一種新穎的提案者-驗證者框架,這是一種生成可解釋的自然語言解釋模型性能的方法論。
我們的評估顯示,前沿模型在一般網路和IT安全知識上達到了人類專家的水平。
然而,在需要供應商特定細微差別或正式標準知識的問題上,其準確性下降,例如IEC 62443。
對模型擴展趨勢和發布日期的分析顯示出參數效率的顯著提升,而最近更大的模型則顯示出收益遞減。
代碼和評估腳本可在以下鏈接獲得:https://github.com/GKeppler/CyberCertBench。
Graph2Counsel: Clinically Grounded Synthetic Counseling Dialogue Generation from Client Psychological Graphs
2604.20382v1 by Aishik Mandal, Hiba Arnaout, Clarissa W. Ong, Juliet Bockhorst, Kate Sheehan, Rachael Moldow, Tanmoy Chakraborty, Iryna Gurevych
Rising demand for mental health support has increased interest in using Large Language Models (LLMs) for counseling. However, adapting LLMs to this high-risk safety-critical domain is hindered by the scarcity of real-world counseling data due to privacy constraints. Synthetic datasets provide a promising alternative, but existing approaches often rely on unstructured or semi-structured text inputs and overlook structural dependencies between a client's cognitive, emotional, and behavioral states, often producing psychologically inconsistent interactions and reducing data realism and quality. We introduce Graph2Counsel, a framework for generating synthetic counseling sessions grounded in Client Psychological Graphs (CPGs) that encode relationships among clients' thoughts, emotions, and behaviors. Graph2Counsel employs a structured prompting pipeline guided by counselor strategies and CPG, and explores prompting strategies including CoT (Wei et al., 2022) and Multi-Agent Feedback (Li et al., 2025a). Graph2Counsel produces 760 sessions from 76 CPGs across diverse client profiles. In expert evaluation, our dataset outperforms prior datasets on specificity, counselor competence, authenticity, conversational flow, and safety, with substantial inter-annotator agreement (Krippendorff's $α$ = 0.70). Fine-tuning an open-source model on this dataset improves performance on CounselingBench (Nguyen et al., 2025) and CounselBench (Li et al., 2025b), showing downstream utility. We also make our code and data public.
摘要:心理健康支持需求的增加引起了對使用大型語言模型(LLMs)進行輔導的興趣。
然而,由於隱私限制,將LLMs適應於這一高風險的安全關鍵領域受到現實輔導數據稀缺的阻礙。
合成數據集提供了一個有前景的替代方案,但現有的方法通常依賴於非結構化或半結構化的文本輸入,並忽視了客戶的認知、情感和行為狀態之間的結構依賴,經常產生心理上不一致的互動,並降低數據的真實性和質量。
我們介紹了Graph2Counsel,一個基於客戶心理圖(CPGs)生成合成輔導會話的框架,該圖編碼了客戶的思想、情感和行為之間的關係。
Graph2Counsel採用一個結構化的提示管道,由輔導策略和CPG指導,並探索包括CoT(Wei et al., 2022)和多代理反饋(Li et al., 2025a)在內的提示策略。
Graph2Counsel從76個CPG中生成760個會話,涵蓋多樣的客戶檔案。
在專家評估中,我們的數據集在具體性、輔導者能力、真實性、對話流暢性和安全性方面超越了先前的數據集,並且標註者之間有顯著的一致性(Krippendorff's $α$ = 0.70)。
在這個數據集上微調開源模型提高了在CounselingBench(Nguyen et al., 2025)和CounselBench(Li et al., 2025b)上的表現,顯示了下游的實用性。
我們還公開了我們的代碼和數據。
Domain-Aware Hierarchical Contrastive Learning for Semi-Supervised Generalization Fault Diagnosis
2604.20928v1 by Junyu Ren, Wensheng Gan, Philip S Yu
Fault diagnosis under unseen operating conditions remains highly challenging when labeled data are scarce. Semi-supervised domain generalization fault diagnosis (SSDGFD) provides a practical solution by jointly exploiting labeled and unlabeled source domains. However, existing methods still suffer from two coupled limitations. First, pseudo-labels for unlabeled domains are typically generated primarily from knowledge learned on the labeled source domain, which neglects domain-specific geometric discrepancies and thus induces systematic cross-domain pseudo-label bias. Second, unlabeled samples are commonly handled with a hard accept-or-discard strategy, where rigid thresholding causes imbalanced sample utilization across domains, while hard-label assignment for uncertain samples can easily introduce additional noise. To address these issues, we propose a unified framework termed domain-aware hierarchical contrastive learning (DAHCL) for SSDGFD. Specifically, DAHCL introduces a domain-aware learning (DAL) module to explicitly capture source-domain geometric characteristics and calibrate pseudo-label predictions across heterogeneous source domains, thereby mitigating cross-domain bias in pseudo-label generation. In addition, DAHCL develops a hierarchical contrastive learning (HCL) module that combines dynamic confidence stratification with fuzzy contrastive supervision, enabling uncertain samples to contribute to representation learning without relying on unreliable hard labels. In this way, DAHCL jointly improves the quality of supervision and the utilization of unlabeled samples. Furthermore, to better reflect practical industrial scenarios, we incorporate engineering noise into the SSDGFD evaluation protocol. Extensive experiments on three benchmark datasets demonstrate that...
摘要:故障診斷在未見操作條件下仍然非常具有挑戰性,特別是在標記數據稀缺的情況下。半監督領域泛化故障診斷(SSDGFD)通過共同利用標記和未標記的源領域提供了一個實用的解決方案。然而,現有方法仍然面臨兩個相互耦合的限制。首先,未標記領域的偽標籤通常主要是從標記源領域學習的知識生成的,這忽略了領域特定的幾何差異,從而引入系統性的跨領域偽標籤偏差。其次,未標記樣本通常採用硬性接受或丟棄策略處理,僵化的閾值設定導致跨領域樣本利用不平衡,而對不確定樣本的硬標籤分配則容易引入額外的噪音。為了解決這些問題,我們提出了一個統一的框架,稱為領域感知層次對比學習(DAHCL),用於SSDGFD。具體而言,DAHCL引入了一個領域感知學習(DAL)模塊,以明確捕捉源領域的幾何特徵並校準跨異質源領域的偽標籤預測,從而減輕偽標籤生成中的跨領域偏差。此外,DAHCL開發了一個層次對比學習(HCL)模塊,將動態置信度分層與模糊對比監督相結合,使不確定樣本能夠在不依賴不可靠的硬標籤的情況下對表示學習作出貢獻。通過這種方式,DAHCL共同提高了監督質量和未標記樣本的利用率。此外,為了更好地反映實際工業場景,我們在SSDGFD評估協議中納入了工程噪音。在三個基準數據集上的廣泛實驗表明…
Surrogate modeling for interpreting black-box LLMs in medical predictions
2604.20331v2 by Changho Han, Songsoo Kim, Dong Won Kim, Leo Anthony Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon
Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.
摘要:大型語言模型(LLMs)在龐大的數據集上進行訓練,將廣泛的現實世界知識編碼在其參數中,但其黑箱特性使得這種編碼的機制和範圍變得不明朗。代理建模使用簡化模型來近似複雜系統,可以為黑箱模型的更好可解釋性提供一條途徑。我們提出了一個代理建模框架,定量解釋LLM編碼的知識。對於從領域知識衍生的特定假設,該框架通過在一系列綜合模擬場景中進行廣泛的提示,使用可觀察的元素(輸入-輸出對)來近似潛在的LLM知識空間。通過在醫療預測中的概念驗證實驗,我們展示了該框架在揭示LLMs如何「感知」每個輸入變量與輸出之間的關係方面的有效性。特別是,考慮到LLMs可能會延續其訓練數據中固有的不準確性和社會偏見,我們使用該框架的實驗定量揭示了與既有醫學知識相矛盾的關聯以及LLM編碼知識中科學上被駁斥的種族假設的持續存在。通過揭示這些問題,我們的框架可以作為紅旗指標,以支持這些模型的安全和可靠應用。
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
2604.20311v2 by Dali Wang, Yunyao Zhang, Junqing Yu, Yi-Ping Phoebe Chen, Chen Xu, Zikai Song
Micro-video popularity prediction (MVPP) aims to forecast the future popularity of videos on online media, which is essential for applications such as content recommendation and traffic allocation. In real-world scenarios, it is critical for MVPP approaches to understand both the temporal dynamics of a given video (temporal) and its historical relevance to other videos (spatial). However, existing approaches sufer from limitations in both dimensions: temporally, they rely on sparse short-range sampling that restricts content perception; spatially, they depend on flat retrieval memory with limited capacity and low efficiency, hindering scalable knowledge utilization. To overcome these limitations, we propose a unified framework that achieves joint spatio-temporal enlargement, enabling precise perception of extremely long video sequences while supporting a scalable memory bank that can infinitely expand to incorporate all relevant historical videos. Technically, we employ a Temporal Enlargement driven by a frame scoring module that extracts highlight cues from video frames through two complementary pathways: sparse sampling and dense perception. Their outputs are adaptively fused to enable robust long-sequence content understanding. For Spatial Enlargement, we construct a Topology-Aware Memory Bank that hierarchically clusters historically relevant content based on topological relationships. Instead of directly expanding memory capacity, we update the encoder features of the corresponding clusters when incorporating new videos, enabling unbounded historical association without unbounded storage growth. Extensive experiments on three widely used MVPP benchmarks demonstrate that our method consistently outperforms 11 strong baselines across mainstream metrics, achieving robust improvements in both prediction accuracy and ranking consistency.
摘要:微視頻人氣預測(MVPP)旨在預測在線媒體上視頻的未來人氣,這對於內容推薦和流量分配等應用至關重要。
在現實場景中,MVPP 方法必須理解給定視頻的時間動態(時間性)及其與其他視頻的歷史相關性(空間性)。
然而,現有方法在這兩個維度上都存在限制:在時間上,它們依賴於稀疏的短期取樣,限制了內容感知;在空間上,它們依賴於容量有限且效率低下的平面檢索記憶,妨礙了可擴展的知識利用。
為了克服這些限制,我們提出了一個統一框架,實現了聯合時空擴展,使得能夠精確感知極長的視頻序列,同時支持可以無限擴展以納入所有相關歷史視頻的可擴展記憶庫。
在技術上,我們採用一個由幀評分模塊驅動的時間擴展,通過稀疏取樣和密集感知兩條互補路徑從視頻幀中提取重點線索。
它們的輸出被自適應融合,以實現穩健的長序列內容理解。
對於空間擴展,我們構建了一個拓撲感知記憶庫,根據拓撲關係對歷史相關內容進行分層聚類。
我們不是直接擴展記憶容量,而是在納入新視頻時更新相應聚類的編碼器特徵,實現無限的歷史關聯而無需無限的存儲增長。
在三個廣泛使用的 MVPP 基準上進行的廣泛實驗表明,我們的方法在主流指標上始終超越 11 個強基準,實現了預測準確性和排名一致性的穩健提升。
Dual Causal Inference: Integrating Backdoor Adjustment and Instrumental Variable Learning for Medical VQA
2604.20306v1 by Zibo Xu, Qiang Li, Ke Lu, Jin Wang, Weizhi Nie, Yuting Su
Medical Visual Question Answering (MedVQA) aims to generate clinically reliable answers conditioned on complex medical images and questions. However, existing methods often overfit to superficial cross-modal correlations, neglecting the intrinsic biases embedded in multimodal medical data. Consequently, models become vulnerable to cross-modal confounding effects, severely hindering their ability to provide trustworthy diagnostic reasoning. To address this limitation, we propose a novel Dual Causal Inference (DCI) framework for MedVQA. To the best of our knowledge, DCI is the first unified architecture that integrates Backdoor Adjustment (BDA) and Instrumental Variable (IV) learning to jointly tackle both observable and unobserved confounders. Specifically, we formulate a Structural Causal Model (SCM) where observable cross-modal biases (e.g., frequent visual and textual co-occurrences) are mitigated via BDA, while unobserved confounders are compensated using an IV learned from a shared latent space. To guarantee the validity of the IV, we design mutual information constraints that maximize its dependence on the fused multimodal representations while minimizing its associations with the unobserved confounders and target answers. Through this dual mechanism, DCI extracts deconfounded representations that capture genuine causal relationships. Extensive experiments on four benchmark datasets, SLAKE, SLAKE-CP, VQA-RAD, and PathVQA, demonstrate that our method consistently outperforms existing approaches, particularly in out-of-distribution (OOD) generalization. Furthermore, qualitative analyses confirm that DCI significantly enhances the interpretability and robustness of cross-modal reasoning by explicitly disentangling true causal effects from spurious cross-modal shortcuts.
摘要:醫學視覺問題回答(MedVQA)旨在根據複雜的醫學影像和問題生成臨床可靠的答案。
然而,現有的方法往往過度擬合於表面上的跨模態相關性,忽略了嵌入多模態醫學數據中的內在偏見。
因此,模型變得容易受到跨模態混淆效應的影響,嚴重妨礙其提供可信診斷推理的能力。
為了解決這一限制,我們提出了一種新穎的雙重因果推斷(DCI)框架,用於MedVQA。
據我們所知,DCI是第一個統一架構,整合了後門調整(BDA)和工具變量(IV)學習,以共同解決可觀察和不可觀察的混淆因素。
具體而言,我們構建了一個結構性因果模型(SCM),其中可觀察的跨模態偏見(例如,頻繁的視覺和文本共現)通過BDA得到減輕,而不可觀察的混淆因素則通過從共享潛在空間學習的IV來補償。
為了保證IV的有效性,我們設計了互信息約束,以最大化其對融合多模態表示的依賴,同時最小化其與不可觀察混淆因素和目標答案的關聯。
通過這一雙重機制,DCI提取出去混淆的表示,捕捉真正的因果關係。
在四個基準數據集SLAKE、SLAKE-CP、VQA-RAD和PathVQA上進行的廣泛實驗表明,我們的方法在性能上始終優於現有方法,特別是在分佈外(OOD)泛化方面。
此外,定性分析證實,DCI通過明確區分真實的因果效應和虛假的跨模態捷徑,顯著增強了跨模態推理的可解釋性和穩健性。
Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking
2604.20283v1 by Mo Zhou, Jianwei Wang, Kai Wang, Helen Paik, Ying Zhang, Wenjie Zhang
Multimodal Entity Linking (MEL) is a fundamental task in data management that maps ambiguous mentions with diverse modalities to the multimodal entities in a knowledge base. However, most existing MEL approaches primarily focus on optimizing instance-centric features and evidence, leaving broader forms of evidence and their intricate interdependencies insufficiently explored. Motivated by the observation that human expert decision-making process relies on multi-perspective judgment, in this work, we propose MSR-MEL, a Multi-perspective Evidence Synthesis and Reasoning framework with Large Language Models (LLMs) for unsupervised MEL. Specifically, we adopt a two-stage framework: (1) Offline Multi-Perspective Evidence Synthesis constructs a comprehensive set of evidence. This includes instance-centric evidence capturing the instance-centric multimodal information of mentions and entities, group-level evidence that aggregates neighborhood information, lexical evidence based on string overlap ratio, and statistical evidence based on simple summary statistics. A core contribution of our framework is the synthesis of group-level evidence, which effectively aggregates vital neighborhood information by graph. We first construct LLM-enhanced contextualized graphs. Subsequently, different modalities are jointly aligned through an asymmetric teacher-student graph neural network. (2) Online Multi-Perspective Evidence Reasoning leverages the power of LLM as a reasoning module to analyze the correlation and semantics of the multi-perspective evidence to induce an effective ranking strategy for accurate entity linking without supervision. Extensive experiments on widely used MEL benchmarks demonstrate that MSR-MEL consistently outperforms state-of-the-art unsupervised methods. The source code of this paper was available at: https://anonymous.4open.science/r/MSR-MEL-C21E/.
摘要:多模態實體連結(MEL)是數據管理中的一項基本任務,它將模糊的提及與多樣的模態映射到知識庫中的多模態實體上。
然而,大多數現有的 MEL 方法主要集中在優化以實例為中心的特徵和證據上,對於更廣泛的證據形式及其複雜的相互依賴關係探討不足。
受到觀察到的人類專家決策過程依賴於多角度判斷的啟發,在本研究中,我們提出了 MSR-MEL,一種基於大型語言模型(LLMs)的多角度證據綜合與推理框架,用於無監督的 MEL。
具體來說,我們採用兩階段框架:
(1) 離線多角度證據綜合構建了一套全面的證據。
這包括捕捉提及和實體的以實例為中心的多模態信息的實例中心證據、聚合鄰域信息的群體級證據、基於字符串重疊比的詞彙證據,以及基於簡單摘要統計的統計證據。
我們框架的一個核心貢獻是群體級證據的綜合,這通過圖有效地聚合了重要的鄰域信息。
我們首先構建了增強的上下文化圖。
隨後,通過不對稱的教師-學生圖神經網絡共同對齊不同的模態。
(2) 在線多角度證據推理利用 LLM 作為推理模塊,分析多角度證據的相關性和語義,以誘導有效的排名策略,實現準確的實體連結而無需監督。
在廣泛使用的 MEL 基準上進行的廣泛實驗表明,MSR-MEL 始終優於最先進的無監督方法。
本文的源代碼可在以下網址獲得:https://anonymous.4open.science/r/MSR-MEL-C21E/。
AROMA: Augmented Reasoning Over a Multimodal Architecture for Virtual Cell Genetic Perturbation Modeling
2604.20263v1 by Zhenyu Wang, Geyan Ye, Wei Liu, Man Tat Alexander Ng
Virtual cell modeling predicts molecular state changes under genetic perturbations in silico, which is essential for biological mechanism studies. However, existing approaches suffer from unconstrained reasoning, uninterpretable predictions, and retrieval signals that are weakly aligned with regulatory topology. To address these limitations, we propose AROMA, an Augmented Reasoning Over a Multimodal Architecture for virtual cell genetic perturbation modeling. AROMA integrates textual evidence, graph-topology information, and protein sequence features to model perturbation-target dependencies, and is trained with a two-stage optimization strategy to yield predictions that are both accurate and interpretable. We also construct two knowledge graphs and a perturbation reasoning dataset, PerturbReason, containing more than 498k samples, as reusable resources for the virtual cell domain. Experiments show that AROMA outperforms existing methods across multiple cell lines, and remains robust under zero-shot evaluation on an unseen cell line, as well as in knowledge-sparse, long-tail scenarios. Overall, AROMA demonstrates that combining knowledge-driven multimodal modeling with evidence retrieval provides a promising pathway toward more reliable and interpretable virtual cell perturbation prediction. Model weights are available at https://huggingface.co/blazerye/AROMA. Code is available at https://github.com/blazerye/AROMA.
摘要:虛擬細胞建模預測在基因擾動下的分子狀態變化,這對於生物機制研究至關重要。
然而,現有的方法存在不受限制的推理、不可解釋的預測,以及與調控拓撲弱相關的檢索信號等問題。
為了解決這些限制,我們提出了AROMA,一種針對虛擬細胞基因擾動建模的增強推理多模態架構。
AROMA整合了文本證據、圖形拓撲信息和蛋白質序列特徵,以建模擾動-目標依賴關係,並通過兩階段優化策略進行訓練,以產生既準確又可解釋的預測。
我們還構建了兩個知識圖譜和一個擾動推理數據集PerturbReason,包含超過498k的樣本,作為虛擬細胞領域的可重用資源。
實驗顯示,AROMA在多個細胞系中表現優於現有方法,並且在未見過的細胞系上進行零樣本評估時仍然保持穩健,以及在知識稀疏的長尾場景中也表現良好。
總體而言,AROMA展示了將知識驅動的多模態建模與證據檢索相結合,為更可靠和可解釋的虛擬細胞擾動預測提供了一條有前景的途徑。
模型權重可在 https://huggingface.co/blazerye/AROMA 獲得。
代碼可在 https://github.com/blazerye/AROMA 獲得。
Hybrid Policy Distillation for LLMs
2604.20244v1 by Wenhong Zhu, Ruobing Xie, Rui Wang, Pengfei Liu
Knowledge distillation (KD) is a powerful paradigm for compressing large language models (LLMs), whose effectiveness depends on intertwined choices of divergence direction, optimization strategy, and data regime. We break down the design of existing KD methods and present a unified view that establishes connections between them, reformulating KD as a reweighted log-likelihood objective at the token level. We further propose Hybrid Policy Distillation (HPD), which integrates the complementary advantages of forward and reverse KL to balance mode coverage and mode-seeking, and combines off-policy data with lightweight, approximate on-policy sampling. We validate HPD on long-generation math reasoning as well as short-generation dialogue and code tasks, demonstrating improved optimization stability, computational efficiency, and final performance across diverse model families and scales. The code related to this work is available at https://github.com/zwhong714/Hybrid-Policy-Distillation.
摘要:知識蒸餾 (KD) 是一種強大的壓縮大型語言模型 (LLMs) 的範式,其有效性取決於發散方向、優化策略和數據模式的相互選擇。
我們分析了現有 KD 方法的設計,並提出了一個統一的視角,建立它們之間的聯繫,將 KD 重新表述為在標記層級的重加權對數似然目標。
我們進一步提出了混合策略蒸餾 (HPD),它整合了前向和反向 KL 的互補優勢,以平衡模式覆蓋和模式尋求,並結合了離策略數據與輕量級、近似的在策略抽樣。
我們在長生成數學推理以及短生成對話和代碼任務上驗證了 HPD,展示了在不同模型系列和規模下的優化穩定性、計算效率和最終性能的提升。
與此工作相關的代碼可在 https://github.com/zwhong714/Hybrid-Policy-Distillation 獲得。
Construction of a Battery Research Knowledge Graph using a Global Open Catalog
2604.20241v1 by Luca Foppiano, Sae Dieb, Malik Zain, Kazuki Kasama, Keitaro Sodeyama, Mikiko Tanifuji
Battery research is a rapidly growing and highly interdisciplinary field, making it increasingly difficult to track relevant expertise and identify potential collaborators across institutional boundaries. In this work, we present a pipeline for constructing an author-centric knowledge graph of battery research built on OpenAlex, a large-scale open bibliographic catalogue. For each author, we derive a weighted research descriptors vector that combines coarse-grained OpenAlex concepts with fine-grained keyphrases extracted from titles and abstracts using KeyBERT with ChatGPT (gpt-3.5-turbo) as the backend model, selected after evaluating multiple alternatives. Vector components are weighted by research descriptor origin, authorship position, and temporal recency. The framework is applied to a corpus of 189,581 battery-related works. The resulting vectors support author-author similarity computation, community detection, and exploratory search through a browser-based interface. The knowledge graph is then serialized in RDF and linked to Wikidata identifiers, making it interoperable with external linked open data sources and extensible beyond the battery domain. Unlike prior author-centric analyses confined to institutional repositories, our approach operates at cross-institutional scale and grounds similarity in domain semantics rather than citation or co-authorship structure alone.
摘要:電池研究是一個快速增長且高度跨學科的領域,這使得追踪相關專業知識和識別潛在合作者越來越困難,尤其是在機構邊界之間。
在這項工作中,我們提出了一個基於OpenAlex的大規模開放書目目錄的以作者為中心的電池研究知識圖譜構建流程。
對於每位作者,我們推導出一個加權的研究描述符向量,該向量將粗粒度的OpenAlex概念與使用KeyBERT和ChatGPT(gpt-3.5-turbo)作為後端模型提取的標題和摘要中的細粒度關鍵詞組合在一起,該模型是在評估多種替代方案後選擇的。
向量組件根據研究描述符來源、作者位置和時間的近期性進行加權。
該框架應用於一個包含189,581篇與電池相關的作品的語料庫。
生成的向量支持作者之間的相似性計算、社群檢測和通過基於瀏覽器的界面進行探索性搜索。
然後,知識圖譜以RDF格式序列化並鏈接到Wikidata標識符,使其能與外部鏈接開放數據源互操作,並可擴展到電池領域之外。
與先前僅限於機構存儲庫的以作者為中心的分析不同,我們的方法在跨機構的規模上運作,並將相似性根植於領域語義,而不僅僅是引用或共同作者結構。
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context
2604.20216v1 by Yilun Zhu, Yuan Zhuang, Nikhita Vedula, Dushyanta Dhyani, Shaoyuan Xu, Moyan Li, Mohsen Bayati, Bryan Wang, Shervin Malmasi
Many applications of LLM-based text regression require predicting a full conditional distribution rather than a single point value. We study distributional regression under empirical-quantile supervision, where each input is paired with multiple observed quantile outcomes, and the target distribution is represented by a dense grid of quantiles. We address two key limitations of current approaches: the lack of local grounding for distribution estimates, and the reliance on shared representations that create an indirect bottleneck between inputs and quantile outputs. In this paper, we introduce Quantile Token Regression, which, to our knowledge, is the first work to insert dedicated quantile tokens into the input sequence, enabling direct input-output pathways for each quantile through self-attention. We further augment these quantile tokens with retrieval, incorporating semantically similar neighbor instances and their empirical distributions to ground predictions with local evidence from similar instances. We also provide the first theoretical analysis of loss functions for quantile regression, clarifying which distributional objectives each optimizes. Experiments on the Inside Airbnb and StackSample benchmark datasets with LLMs ranging from 1.7B to 14B parameters show that quantile tokens with neighbors consistently outperform baselines (~4 points lower MAPE and 2x narrower prediction intervals), with especially large gains on smaller and more challenging datasets where quantile tokens produce substantially sharper and more accurate distributions.
摘要:許多基於LLM的文本回歸應用需要預測完整的條件分佈,而不是單一的點值。
我們研究了在經驗分位數監督下的分佈回歸,其中每個輸入都與多個觀察到的分位數結果配對,目標分佈由密集的分位數網格表示。
我們解決了當前方法的兩個主要限制:對分佈估計缺乏局部基礎,以及依賴共享表示,這在輸入和分位數輸出之間創造了間接瓶頸。
在本文中,我們介紹了分位數令牌回歸,據我們所知,這是首個將專用分位數令牌插入輸入序列的工作,通過自注意力為每個分位數啟用直接的輸入-輸出通道。
我們進一步通過檢索增強這些分位數令牌,結合語義相似的鄰近實例及其經驗分佈,以用相似實例的局部證據來基礎化預測。
我們還提供了分位數回歸損失函數的首個理論分析,澄清了每個損失函數優化的分佈目標。
在Inside Airbnb和StackSample基準數據集上的實驗中,使用參數範圍從1.7B到14B的LLM顯示,帶有鄰近實例的分位數令牌始終優於基準(MAPE低約4點,預測區間窄2倍),在較小和更具挑戰性的數據集上尤其獲得了顯著的增益,其中分位數令牌產生了顯著更尖銳和更準確的分佈。
Towards Secure Logging: Characterizing and Benchmarking Logging Code Security Issues with LLMs
2604.20211v1 by He Yang Yuan, Xin Wang, Kundi Yao, An Ran Chen, Zishuo Ding, Zhenhao Li
Logging code plays an important role in software systems by recording key events and behaviors, which are essential for debugging and monitoring. However, insecure logging practices can inadvertently expose sensitive information or enable attacks such as log injection, posing serious threats to system security and privacy. Prior research has examined general defects in logging code, but systematic analysis of logging code security issues remains limited, particularly in leveraging LLMs for detection and repair. In this paper, we derive a comprehensive taxonomy of logging code security issues, encompassing four common issue categories and 10 corresponding patterns. We further construct a benchmark dataset with 101 real-world logging security issue reports that have been manually reviewed and annotated. We then propose an automated framework that incorporates various contextual knowledge to evaluate LLMs' capabilities in detecting and repairing logging security issues. Our experimental results reveal a notable disparity in performance: while LLMs are moderately effective at detecting security issues (e.g., the accuracy ranges from 12.9% to 52.5% on average), they face noticeable challenges in reliably generating correct code repairs. We also find that the issue description alone improves the LLMs' detection accuracy more than the security pattern explanation or a combination of both. Overall, our findings provide actionable insights for practitioners and highlight the potential and limitations of current LLMs for secure logging.
摘要:記錄代碼在軟體系統中扮演著重要角色,通過記錄關鍵事件和行為,這對於除錯和監控至關重要。
然而,不安全的記錄實踐可能無意中暴露敏感信息或使攻擊(如日誌注入)成為可能,對系統安全和隱私構成嚴重威脅。
先前的研究已經檢視了記錄代碼中的一般缺陷,但對於記錄代碼安全問題的系統性分析仍然有限,特別是在利用 LLMs 進行檢測和修復方面。
在本文中,我們推導出一個全面的記錄代碼安全問題分類法,涵蓋四個常見問題類別和 10 種相應的模式。
我們進一步構建了一個基準數據集,包含 101 份經過人工審查和註釋的真實世界記錄安全問題報告。
然後,我們提出了一個自動化框架,該框架整合了各種上下文知識,以評估 LLMs 在檢測和修復記錄安全問題方面的能力。
我們的實驗結果顯示出顯著的性能差異:雖然 LLMs 在檢測安全問題方面的效果中等(例如,準確率平均範圍為 12.9% 到 52.5%),但它們在可靠生成正確代碼修復方面面臨明顯挑戰。
我們還發現,僅問題描述就能比安全模式解釋或兩者的結合更能提高 LLMs 的檢測準確性。
總體而言,我們的發現為從業者提供了可行的見解,並突顯了當前 LLMs 在安全記錄方面的潛力和局限性。
All Languages Matter: Understanding and Mitigating Language Bias in Multilingual RAG
2604.20199v1 by Dan Wang, Guozhao Mo, Yafei Shi, Cheng Zhang, Bo Zheng, Boxi Cao, Xuanang Chen, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
Multilingual Retrieval-Augmented Generation (mRAG) leverages cross-lingual evidence to ground Large Language Models (LLMs) in global knowledge. However, we show that current mRAG systems suffer from a language bias during reranking, systematically favoring English and the query's native language. By introducing an estimated oracle evidence analysis, we quantify a substantial performance gap between existing rerankers and the achievable upper bound. Further analysis reveals a critical distributional mismatch: while optimal predictions require evidence scattered across multiple languages, current systems systematically suppress such ``answer-critical'' documents, thereby limiting downstream generation performance. To bridge this gap, we propose \textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)}, which aligns multilingual evidence ranking with downstream generative utility. Experiments across diverse languages and generation models show that LAURA effectively mitigates language bias and consistently improves mRAG performance.
摘要:多語言檢索增強生成(mRAG)利用跨語言證據將大型語言模型(LLMs)與全球知識相結合。
然而,我們顯示當前的 mRAG 系統在重新排序過程中存在語言偏見,系統性地偏向英語和查詢的母語。
通過引入估計的神諭證據分析,我們量化了現有重新排序器與可達上限之間的顯著性能差距。
進一步分析揭示了一個關鍵的分佈不匹配:雖然最佳預測需要跨多種語言散佈的證據,但當前系統系統性地抑制這些“答案關鍵”文件,從而限制了下游生成性能。
為了彌補這一差距,我們提出了\textit{\textbf{L}anguage-\textbf{A}gnostic \textbf{U}tility-driven \textbf{R}eranker \textbf{A}lignment (LAURA)},該方法將多語言證據排序與下游生成效用對齊。
在多種語言和生成模型上的實驗表明,LAURA 有效減輕了語言偏見,並持續改善 mRAG 性能。
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
2604.20183v1 by Xinyu Zhang, Yuchen Wan, Boxuan Zhang, Zesheng Yang, Lingling Zhang, Bifan Wei, Jun Liu
Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster's content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a ``knowledge inheritance'' phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework's scalability and efficiency.
摘要:大型語言模型(LLMs)在優化問題中常常面臨結構性模糊性,其中單一問題可能有多個相關但相互矛盾的建模範式,這妨礙了有效解決方案的生成。為了解決這個問題,我們提出了雙集群記憶代理(DCM-Agent),旨在通過無需訓練的方式利用歷史解決方案來提升性能。這一方法的核心是雙集群記憶構建。該代理將歷史解決方案分配到建模和編碼集群,然後將每個集群的內容提煉為三種類型:方法、檢查清單和陷阱。這個過程產生了可概括的指導知識。此外,該代理引入了記憶增強推理,以動態導航解決方案路徑,檢測和修復錯誤,並根據結構化知識自適應地切換推理路徑。在七個優化基準測試中的實驗表明,DCM-Agent實現了平均性能提升11%-21%。值得注意的是,我們的分析揭示了一種「知識繼承」現象:由較大模型構建的記憶可以指導較小模型達到更優的性能,突顯了該框架的可擴展性和效率。
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
2604.20146v1 by Jielong Tang, Xujie Yuan, Jiayang Liu, Jianxing Yu, Xiao Dong, Lin Chen, Yunlai Teng, Shimin Di, Jian Yin
Grounded Multimodal Named Entity Recognition (GMNER) aims to extract named entities and localize their visual regions within image-text pairs, serving as a pivotal capability for various downstream applications. In open-world social media platforms, GMNER remains challenging due to the prevalence of long-tailed, rapidly evolving, and unseen entities. To tackle this, existing approaches typically rely on either external knowledge exploration through heuristic retrieval or internal knowledge exploitation via iterative refinement in Multimodal Large Language Models (MLLMs). However, heuristic retrieval often introduces noisy or conflicting evidence that degrades precision on known entities, while solely internal exploitation is constrained by the knowledge boundaries of MLLMs and prone to hallucinations. To address this, we propose SAKE, an end-to-end agentic framework that harmonizes internal knowledge exploitation and external knowledge exploration via self-aware reasoning and adaptive search tool invocation. We implement this via a two-stage training paradigm. First, we propose Difficulty-aware Search Tag Generation, which quantifies the model's entity-level uncertainty through multiple forward samplings to produce explicit knowledge-gap signals. Based on these signals, we construct SAKE-SeCoT, a high-quality Chain-of-Thought dataset that equips the model with basic self-awareness and tool-use capabilities through supervised fine-tuning. Second, we employ agentic reinforcement learning with a hybrid reward function that penalizes unnecessary retrieval, enabling the model to evolve from rigid search imitation to genuine self-aware decision-making about when retrieval is truly necessary. Extensive experiments on two widely used social media benchmarks demonstrate SAKE's effectiveness.
摘要:基於實體的多模態命名實體識別(GMNER)旨在從圖像-文本對中提取命名實體並定位其視覺區域,這對於各種下游應用來說是一項關鍵能力。在開放世界的社交媒體平台上,由於長尾、快速演變和未見實體的普遍存在,GMNER 仍然面臨挑戰。為了解決這個問題,現有的方法通常依賴於通過啟發式檢索進行外部知識探索或通過多模態大型語言模型(MLLMs)中的迭代精煉進行內部知識利用。然而,啟發式檢索往往會引入噪聲或矛盾的證據,降低已知實體的精確度,而僅僅依賴內部利用則受到 MLLMs 知識邊界的限制,並容易出現幻覺。為了解決這個問題,我們提出了 SAKE,一個端到端的代理框架,通過自我意識推理和自適應搜索工具調用來協調內部知識利用和外部知識探索。我們通過兩階段的訓練範式來實現這一點。首先,我們提出了難度感知搜索標籤生成,通過多次前向抽樣量化模型的實體級不確定性,以產生明確的知識缺口信號。基於這些信號,我們構建了 SAKE-SeCoT,一個高質量的思維鏈數據集,通過監督微調使模型具備基本的自我意識和工具使用能力。其次,我們採用了代理強化學習,使用混合獎勵函數來懲罰不必要的檢索,使模型能夠從僵化的搜索模仿演變為真正的自我意識決策,判斷何時檢索是真正必要的。在兩個廣泛使用的社交媒體基準上的大量實驗證明了 SAKE 的有效性。
To Know is to Construct: Schema-Constrained Generation for Agent Memory
2604.20117v1 by Lei Zheng, Weinan Song, Daili Li, Yanming Yang
Constructivist epistemology argues that knowledge is actively constructed rather than passively copied. Despite the generative nature of Large Language Models (LLMs), most existing agent memory systems are still based on dense retrieval. However, dense retrieval heavily relies on semantic overlap or entity matching within sentences. Consequently, embeddings often fail to distinguish instances that are semantically similar but contextually distinct, introducing substantial noise by retrieving context-mismatched entries. Conversely, directly employing open-ended generation for memory access risks "Structural Hallucination" where the model generates memory keys that do not exist in the memory, leading to lookup failures. Inspired by this epistemology, we posit that memory is fundamentally organized by cognitive schemas, and valid recall must be a generative process performed within these schematic structures. To realize this, we propose SCG-MEM, a schema-constrained generative memory architecture. SCG-MEM reformulates memory access as Schema-Constrained Generation. By maintaining a dynamic Cognitive Schema, we strictly constrain LLM decoding to generate only valid memory entry keys, providing a formal guarantee against structural hallucinations. To support long-term adaptation, we model memory updates via assimilation (grounding inputs into existing schemas) and accommodation (expanding schemas with novel concepts). Furthermore, we construct an Associative Graph to enable multi-hop reasoning through activation propagation. Experiments on the LoCoMo benchmark show that SCG-MEM substantially improves performance across all categories over retrieval-based baselines.
摘要:建構主義認識論認為知識是主動構建的,而不是被動複製的。儘管大型語言模型(LLMs)具有生成性,但大多數現有的代理記憶系統仍然基於密集檢索。然而,密集檢索在很大程度上依賴於句子內的語義重疊或實體匹配。因此,嵌入通常無法區分語義相似但語境不同的實例,通過檢索語境不匹配的條目引入了大量噪音。相反,直接使用開放式生成進行記憶訪問存在“結構性幻覺”的風險,即模型生成的記憶鍵在記憶中不存在,導致查找失敗。受到這一認識論的啟發,我們假設記憶本質上是由認知圖式組織的,有效的回憶必須是在這些圖式結構內進行的生成過程。為了實現這一點,我們提出了SCG-MEM,一種圖式約束的生成記憶架構。SCG-MEM將記憶訪問重新定義為圖式約束生成。通過維護動態的認知圖式,我們嚴格限制LLM解碼,僅生成有效的記憶條目鍵,為結構性幻覺提供了正式的保證。為了支持長期適應,我們通過同化(將輸入基於現有圖式進行基礎化)和調適(用新概念擴展圖式)來建模記憶更新。此外,我們構建了一個聯想圖,以通過激活傳播實現多跳推理。在LoCoMo基準上的實驗顯示,SCG-MEM在所有類別上相比基於檢索的基準顯著提高了性能。
Learning to Solve the Quadratic Assignment Problem with Warm-Started MCMC Finetuning
2604.20109v1 by Yicheng Pan, Ruisong Zhou, Haijun Zou, Tianyou Li, Zaiwen Wen
The quadratic assignment problem (QAP) is a fundamental NP-hard task that poses significant challenges for both traditional heuristics and modern learning-based solvers. Existing QAP solvers still struggle to achieve consistently competitive performance across structurally diverse real-world instances. To bridge this performance gap, we propose PLMA, an innovative permutation learning framework. PLMA features an efficient warm-started MCMC finetuning procedure to enhance deployment-time performance, leveraging short Markov chains to anchor the adaptation to the promising regions previously explored. For rapid exploration via MCMC over the permutation space, we design an additive energy-based model (EBM) that enables an $O(1)$-time 2-swap Metropolis-Hastings sampling step. Moreover, the neural network used to parameterize the EBM incorporates a scalable and flexible cross-graph attention mechanism to model interactions between facilities and locations in the QAP. Extensive experiments demonstrate that PLMA consistently outperforms state-of-the-art baselines across various benchmarks. In particular, PLMA achieves a near-zero average optimality gap on QAPLIB, exhibits remarkably superior robustness on the notoriously difficult Taixxeyy instances, and also serves as an effective QAP solver in bandwidth minimization.
摘要:二次指派問題(QAP)是一個基本的 NP-hard 任務,對傳統啟發式方法和現代基於學習的解決方案都提出了重大挑戰。現有的 QAP 解決器在結構多樣的現實世界實例中仍然難以實現持續競爭的性能。為了縮小這一性能差距,我們提出了 PLMA,一個創新的排列學習框架。PLMA 具有高效的熱啟動 MCMC 微調程序,以增強部署時的性能,利用短的馬爾可夫鏈將適應性固定在先前探索的有希望區域。為了通過 MCMC 在排列空間中快速探索,我們設計了一個加性基於能量的模型(EBM),使得 $O(1)$ 時間的 2-swap Metropolis-Hastings 取樣步驟成為可能。此外,用於參數化 EBM 的神經網絡包含一種可擴展且靈活的跨圖注意機制,以建模 QAP 中設施和位置之間的相互作用。大量實驗表明,PLMA 在各種基準測試中始終優於最先進的基準。特別是,PLMA 在 QAPLIB 上實現了接近零的平均最優性差距,在著名的困難 Taixxeyy 實例上展現出顯著的優越穩健性,並且在帶寬最小化中也作為一個有效的 QAP 解決器。
Auditing and Controlling AI Agent Actions in Spreadsheets
2604.20070v1 by Sadra Sabouri, Zeinabsadat Saghi, Run Huang, Sujay Maladi, Esmeralda Eufracio, Sumit Gulwani, Souti Chattopadhyay
Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution, often buried within large volumes of intermediate reasoning and outputs: by the time users receive the output, all underlying decisions have already been made without their involvement. This lack of transparency leaves users unable to examine the agent's assumptions, identify errors before they propagate, or redirect execution when it deviates from their intent. The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable. Each decision the agent makes is recorded directly in cells that belong to and reflect on the user. We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step. A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow. Users identified their own intent reflected in the agent's actions, detected errors that post-hoc review would have failed to surface, and reported a sense of co-ownership over the resulting output. These findings indicate that meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made.
摘要:人工智慧代理的能力進步已超越用戶對其執行進行有效監督的能力。AI 代理可以從頭到尾自主執行複雜的多步驟知識工作,但在執行過程中,這一過程仍然實際上無法訪問,通常埋藏在大量的中間推理和輸出中:當用戶收到輸出時,所有基礎決策已在沒有他們參與的情況下做出。這種缺乏透明度使得用戶無法檢查代理的假設,在錯誤擴散之前識別它們,或在執行偏離其意圖時重新導向。特別是在電子表格環境中,風險尤其高,因為過程和產物是不可分割的。代理所做的每一個決策都直接記錄在屬於用戶的單元格中,並反映在用戶身上。我們介紹了 Pista,一個電子表格 AI 代理,將執行分解為可審計、可控的行動,為用戶提供了對代理決策過程的可見性,以及在每一步干預的能力。一項形成性研究(N = 8)和一項內部受試者總結評估(N = 16)將 Pista 與基線代理進行比較,顯示對執行的積極參與不僅影響任務結果,還影響用戶對任務的理解、對代理的看法以及在工作流程中的角色感。用戶識別出自己在代理行動中反映的意圖,檢測出後期審查無法發現的錯誤,並報告對最終輸出的共同擁有感。這些發現表明,對知識工作中 AI 代理的有意義的人類監督需要的不僅僅是改善後期審查機制,而是在決策形成時的積極參與。
Information Aggregation with AI Agents
2604.20050v1 by Spyros Galanis
Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. We find that although the median market is effective at aggregating information in the easy information structures, increasing the complexity has a significant and negative impact, suggesting that AI agents may suffer from the same limitations as humans when reasoning about others. Consistent with our theoretical predictions, information aggregation remains unaffected by allowing cheap talk communication, changing the duration of the market or initial price, and strategic prompting-thus demonstrating that prediction markets are robust. We establish that "smarter" AI agents perform better at aggregation and they are more profitable. Surprisingly, giving them feedback about past performance makes them worse at aggregation and reduces their profits.
摘要:大型語言模型(AI代理)是否能通過交易聚合分散的私人信息,並通過觀察價格變動來推理他人的知識?我們進行了一項受控實驗,讓AI代理在接收私人信號後在預測市場中交易,通過最後價格的對數誤差來衡量信息聚合。我們發現,儘管中位數市場在簡單信息結構中有效地聚合信息,但增加複雜性會產生顯著的負面影響,這表明AI代理在推理他人時可能遭遇與人類相同的限制。與我們的理論預測一致,信息聚合不受廉價交流、改變市場持續時間或初始價格以及戰略提示的影響,從而顯示預測市場的穩健性。我們確立了“更聰明”的AI代理在聚合方面表現更好,且更具盈利能力。令人驚訝的是,給予他們有關過去表現的反饋會使他們在聚合方面變得更差,並降低他們的利潤。
Statistics, Not Scale: Modular Medical Dialogue with Bayesian Belief Engine
2604.20022v1 by Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley
Large language models are increasingly deployed as autonomous diagnostic agents, yet they conflate two fundamentally different capabilities: natural-language communication and probabilistic reasoning. We argue that this conflation is an architectural flaw, not an engineering shortcoming. We introduce BMBE (Bayesian Medical Belief Engine), a modular diagnostic dialogue framework that enforces a strict separation between language and reasoning: an LLM serves only as a sensor, parsing patient utterances into structured evidence and verbalising questions, while all diagnostic inference resides in a deterministic, auditable Bayesian engine. Because patient data never enters the LLM, the architecture is private by construction; because the statistical backend is a standalone module, it can be replaced per target population without retraining. This separation yields three properties no autonomous LLM can offer: calibrated selective diagnosis with a continuously adjustable accuracy-coverage tradeoff, a statistical separation gap where even a cheap sensor paired with the engine outperforms a frontier standalone model from the same family at a fraction of the cost, and robustness to adversarial patient communication styles that cause standalone doctors to collapse. We validate across empirical and LLM-generated knowledge bases against frontier LLMs, confirming the advantage is architectural, not informational.
摘要:大型語言模型越來越多地被用作自主診斷代理,但它們混淆了兩種根本不同的能力:自然語言交流和概率推理。我們認為這種混淆是一種架構缺陷,而不是工程上的不足。我們介紹了 BMBE(貝葉斯醫療信念引擎),這是一個模組化的診斷對話框架,強調語言和推理之間的嚴格分離:LLM 僅作為感測器,將患者的言語解析為結構化證據並表達問題,而所有診斷推理都位於一個確定性、可審計的貝葉斯引擎中。由於患者數據從未進入 LLM,該架構在設計上是私密的;因為統計後端是一個獨立模組,它可以根據目標人群進行替換,而無需重新訓練。這種分離產生了三個自主 LLM 無法提供的特性:經過校準的選擇性診斷,具有可持續調整的準確性-覆蓋率權衡,一個統計分離間隙,即使是一個廉價的感測器與引擎搭配,也能以更低的成本超越同一家族的前沿獨立模型,以及對導致獨立醫生崩潰的對抗性患者交流風格的穩健性。我們在實證和 LLM 生成的知識庫中進行驗證,與前沿 LLM 進行比較,確認這一優勢是架構性的,而非信息性的。
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
2604.20006v1 by Md Nayem Uddin, Kumar Shubham, Eduardo Blanco, Chitta Baral, Gengyu Wang
Personalized agents that interact with users over long periods must maintain persistent memory across sessions and update it as circumstances change. However, existing benchmarks predominantly frame long-term memory evaluation as fact retrieval from past conversations, providing limited insight into agents' ability to consolidate memory over time or handle frequent knowledge updates. We introduce Memora, a long-term memory benchmark spanning weeks to months long user conversations. The benchmark evaluates three memory-grounded tasks: remembering, reasoning, and recommending. To ensure data quality, we employ automated memory-grounding checks and human evaluation. We further introduce Forgetting-Aware Memory Accuracy (FAMA), a metric that penalizes reliance on obsolete or invalidated memory when evaluating long-term memory. Evaluations of four LLMs and six memory agents reveal frequent reuse of invalid memories and failures to reconcile evolving memories. Memory agents offer marginal improvements, exposing shortcomings in long-term memory for personalized agents.
摘要:個性化代理必須在與用戶長期互動中保持持久的記憶,並隨著情況的變化進行更新。
然而,現有的基準主要將長期記憶的評估框架設置為從過去對話中檢索事實,這對於代理在時間上整合記憶或處理頻繁知識更新的能力提供了有限的洞察。
我們介紹了Memora,一個涵蓋數周到數月長的用戶對話的長期記憶基準。
該基準評估三個基於記憶的任務:記憶、推理和推薦。
為了確保數據質量,我們採用了自動化的記憶基準檢查和人工評估。
我們進一步介紹了遺忘感知記憶準確度(FAMA),這是一個在評估長期記憶時對依賴過時或無效記憶進行懲罰的指標。
對四個大型語言模型和六個記憶代理的評估顯示,無效記憶的頻繁重用和未能調和不斷演變的記憶的問題。
記憶代理提供了邊際改進,揭示了個性化代理在長期記憶方面的不足之處。
Tracing Relational Knowledge Recall in Large Language Models
2604.19934v2 by Nicholas Popovič, Michael Färber
We study how large language models recall relational knowledge during text generation, with a focus on identifying latent representations suitable for relation classification via linear probes. Prior work shows how attention heads and MLPs interact to resolve subject, predicate, and object, but it remains unclear which representations support faithful linear relation classification and why some relation types are easier to capture linearly than others. We systematically evaluate different latent representations derived from attention head and MLP contributions, showing that per-head attention contributions to the residual stream are comparatively strong features for linear relation classification. Feature attribution analyses of the trained probes, as well as characteristics of the different relation types, reveal clear correlations between probe accuracy and relation specificity, entity connectedness, and how distributed the signal on which the probe relies is across attention heads. Finally, we show how token-level feature attribution of probe predictions can be used to reveal probe behavior in further detail.
摘要:我們研究大型語言模型在文本生成過程中如何回憶關聯知識,重點在於識別適合通過線性探針進行關聯分類的潛在表示。先前的研究顯示,注意力頭和多層感知器(MLP)如何互動以解析主語、謂語和賓語,但尚不清楚哪些表示支持忠實的線性關聯分類,以及為什麼某些關聯類型比其他類型更容易以線性方式捕捉。我們系統地評估了來自注意力頭和MLP貢獻的不同潛在表示,顯示每個頭的注意力貢獻對殘差流是相對強大的線性關聯分類特徵。訓練探針的特徵歸因分析,以及不同關聯類型的特徵,揭示了探針準確性與關聯特異性、實體連通性以及探針依賴的信號在注意力頭之間的分佈程度之間的明顯相關性。最後,我們展示了如何利用探針預測的標記級特徵歸因來進一步揭示探針行為的細節。
CreativeGame:Toward Mechanic-Aware Creative Game Generation
2604.19926v1 by Hongnan Ma, Han Wang, Shenglin Wang, Tieyue Yin, Yiwei Shi, Yucong Huang, Yingtian Zou, Muning Wen, Mengyue Yang
Large language models can generate plausible game code, but turning this capability into \emph{iterative creative improvement} remains difficult. In practice, single-shot generation often produces brittle runtime behavior, weak accumulation of experience across versions, and creativity scores that are too subjective to serve as reliable optimization signals. A further limitation is that mechanics are frequently treated only as post-hoc descriptions, rather than as explicit objects that can be planned, tracked, preserved, and evaluated during generation. This report presents \textbf{CreativeGame}, a multi-agent system for iterative HTML5 game generation that addresses these issues through four coupled ideas: a proxy reward centered on programmatic signals rather than pure LLM judgment; lineage-scoped memory for cross-version experience accumulation; runtime validation integrated into both repair and reward; and a mechanic-guided planning loop in which retrieved mechanic knowledge is converted into an explicit mechanic plan before code generation begins. The goal is not merely to produce a playable artifact in one step, but to support interpretable version-to-version evolution. The current system contains 71 stored lineages, 88 saved nodes, and a 774-entry global mechanic archive, implemented in 6{,}181 lines of Python together with inspection and visualization tooling. The system is therefore substantial enough to support architectural analysis, reward inspection, and real lineage-level case studies rather than only prompt-level demos. A real 4-generation lineage shows that mechanic-level innovation can emerge in later versions and can be inspected directly through version-to-version records. The central contribution is therefore not only game generation, but a concrete pipeline for observing progressive evolution through explicit mechanic change.
摘要:大型語言模型可以生成合理的遊戲代碼,但將這一能力轉化為\emph{迭代創意改進}仍然困難。在實踐中,單次生成往往會產生脆弱的運行時行為、跨版本經驗的累積不足,以及過於主觀的創造力評分,無法作為可靠的優化信號。另一個限制是,機制通常僅被視為事後描述,而不是可以在生成過程中計劃、追蹤、保存和評估的明確對象。
本報告介紹了\textbf{CreativeGame},這是一個針對迭代HTML5遊戲生成的多代理系統,通過四個相互關聯的理念來解決這些問題:一個以程序信號為中心的代理獎勵,而非純粹的LLM判斷;用於跨版本經驗累積的血統範圍記憶;集成在修復和獎勵中的運行時驗證;以及一個以機制為導向的規劃循環,在這個循環中,檢索到的機制知識在代碼生成開始之前轉化為明確的機制計劃。目標不僅僅是在一步中生成可玩的人造物,而是支持可解釋的版本間演變。
當前系統包含71個存儲的血統、88個保存的節點,以及一個774條目的全球機制檔案,這些是用6181行Python實現的,並配有檢查和可視化工具。因此,該系統足夠龐大,可以支持架構分析、獎勵檢查和實際的血統級案例研究,而不僅僅是提示級的演示。
一個真實的四代血統顯示,機制級的創新可以在後續版本中出現,並且可以通過版本間記錄直接檢查。因此,中心貢獻不僅是遊戲生成,而是一個具體的管道,用於通過明確的機制變化觀察漸進式演變。
Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding
2604.19921v1 by Zijie Wang, MohammadHossein Rezaei, Farzana Rashid, Eduardo Blanco
Negation is a common and important semantic feature in natural language, yet Large Language Models (LLMs) struggle when negation is involved in natural language understanding tasks. Commonsense knowledge, on the other hand, despite being a well-studied topic, lacks investigations involving negation. In this work, we show that commonsense knowledge with negation is challenging for models to understand. We present a novel approach to automatically augment existing commonsense knowledge corpora with negation, yielding two new corpora containing over 2M triples with if-then relations. In addition, pre-training LLMs on our corpora benefits negation understanding.
摘要:否定是自然語言中一個常見且重要的語義特徵,然而大型語言模型(LLMs)在涉及否定的自然語言理解任務時卻面臨困難。另一方面,儘管常識知識是一個研究充分的主題,但缺乏涉及否定的研究。在本研究中,我們展示了帶有否定的常識知識對模型理解的挑戰。我們提出了一種新穎的方法,自動增強現有的常識知識語料庫,加入否定,產生了兩個包含超過200萬個帶有如果-那麼關係的三元組的新語料庫。此外,在我們的語料庫上進行預訓練的LLMs有助於否定理解。
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
2604.19734v1 by Boyu Chen, Yi Chen, Lu Qiu, Jerry Bai, Yuying Ge, Yixiao Ge
Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.
摘要:人形基礎模型的擴展受到機器人數據稀缺的瓶頸限制。雖然大量以自我為中心的人類數據提供了一種可擴展的替代方案,但由於運動學的不匹配,跨實體的橋接仍然是一個基本挑戰。我們介紹了 UniT(通過視覺錨定的統一潛在行動標記器),這是一個建立人類到人形轉移的統一物理語言的框架。基於異質運動學共享普遍視覺後果的理念,UniT 採用三分支交叉重建機制:行動預測視覺以將運動學錨定到物理結果,而視覺重建行動以過濾掉不相關的視覺干擾因素。同時,一個融合分支將這些純化的模態協同整合到一個共享的離散潛在空間中,該空間具有與具體實體無關的物理意圖。我們在兩個範式中驗證了 UniT:1)政策學習(VLA-UniT):通過預測這些統一的標記,它有效利用多樣的人類數據,在人形模擬基準和現實世界部署上實現了最先進的數據效率和穩健的分佈外(OOD)泛化,顯著展示了零樣本任務轉移。2)世界建模(WM-UniT):通過將跨實體動態與統一標記對齊作為條件,它實現了直接的人類到人形的行動轉移。這種對齊確保了人類數據無縫轉換為增強的人形視頻生成的行動可控性。最終,通過引入高度對齊的跨實體表示(通過 t-SNE 可視化實證驗證,顯示人類和人形特徵在共享流形中的收斂),UniT 提供了一條可擴展的路徑,將大量人類知識提煉為通用的人形能力。
ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration
2604.19856v1 by Cagri Eryilmaz
Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.
摘要:大型語言模型(LLMs)在從自然語言規範生成寄存器轉移級(RTL)代碼方面顯示出潛力,但單次生成在標準基準上僅達到60-65%的功能正確性。像MAGE這樣的多代理方法在VerilogEval上達到95.9%,但在更具挑戰性的工業基準(如NVIDIA的CVDP)上尚未經過測試,缺乏合成意識,並且產生高昂的API成本。
我們提出ChipCraftBrain,一個結合符號-神經推理與自適應多代理協調的自動RTL生成框架。系統的四項創新驅動著這一進程:(1)通過PPO策略在168維狀態上對六個專門代理進行自適應協調(還評估了一種替代的世界模型MPC規劃器);(2)一種混合符號-神經架構,能夠以算法方式解決K圖和真值表問題,同時專門代理處理波形時序和一般RTL;(3)從321個模式基礎加上971個開源參考實現進行知識增強生成,並專注於檢索;(4)將規範分解為依賴有序的子模塊,並進行接口同步。
在VerilogEval-Human上,ChipCraftBrain達到97.2%的平均pass@1(範圍在7次運行中為96.15-98.72%,最佳為154/156),與ChipAgents(97.4%,自報)相當,並領先於MAGE(95.9%)。在CVDP的一個302問題的非代理子集上,涵蓋五個任務類別,我們達到94.7%的平均pass@1(286/302,平均3次運行),比已發表的單次基準每個類別提高了36-60個百分點;儘管每個問題的嘗試次數約少30倍,我們在與NVIDIA的ACE-RTL共享的四個類別中領先三個類別。一個RISC-V SoC案例研究展示了層次分解生成8/8通過lint檢查的模塊(689 LOC),在FPGA上驗證,當單體生成完全失敗時。
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
2604.19689v1 by Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring
Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.
摘要:理解藝術作品需要對視覺內容以及文化、歷史和風格背景進行多步推理。儘管最近的多模態大型語言模型在藝術作品解釋方面顯示出潛力,但它們依賴於隱性推理和內化知識,這限制了可解釋性和明確的證據基礎。我們提出了 A-MAR,一個基於代理的多模態藝術檢索框架,該框架明確地將檢索條件化為結構化的推理計劃。給定一件藝術作品和用戶查詢,A-MAR 首先將任務分解為一個結構化的推理計劃,該計劃指定每一步的目標和證據需求。檢索隨後根據這個計劃進行條件化,從而實現針對性的證據選擇並支持逐步的、有根據的解釋。為了評估藝術領域中的基於代理的多模態推理,我們引入了 ArtCoT-QA。這個診斷基準特徵多步推理鏈,針對多樣的藝術相關查詢,實現了超越簡單最終答案準確性的細緻分析。在 SemArt 和 Artpedia 上的實驗顯示,A-MAR 在最終解釋質量上始終超越靜態的、未規劃的檢索和強大的 MLLM 基準,而在 ArtCoT-QA 上的評估進一步展示了其在證據基礎和多步推理能力方面的優勢。這些結果突顯了推理條件化檢索對於知識密集型多模態理解的重要性,並將 A-MAR 定位為朝向可解釋的、以目標為驅動的 AI 系統邁進的一步,特別與文化產業相關。代碼和數據可在以下網址獲得: https://github.com/ShuaiWang97/A-MAR。
An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA
2604.19685v1 by Saransh Sharma, Pritika Ramu, Aparna Garimella, Koyel Mukherjee
Answering open-ended questions remains challenging for AI systems because it requires synthesis, judgment, and exploration beyond factual retrieval, and users often refine answers through multiple iterations rather than accepting a single response. Existing QA benchmarks do not explicitly support this refinement process. To address this gap, we introduce a new task, document-grounded related insight generation, where the goal is to generate additional insights from a document collection that help improve, extend, or rethink an initial answer to an open-ended question, ultimately supporting richer user interaction and a better overall question answering experience. We curate and release SCOpE-QA (Scientific Collections for Open-Ended QA), a dataset of 3,000 open-ended questions across 20 research collections. We present InsightGen, a two-stage approach that first constructs a thematic representation of the document collection using clustering, and then selects related context based on neighborhood selection from the thematic graph to generate diverse and relevant insights using LLMs. Extensive evaluation on 3,000 questions using two generation models and two evaluation settings shows that InsightGen consistently produces useful, relevant, and actionable insights, establishing a strong baseline for this new task.
摘要:回答開放式問題對於AI系統來說仍然具有挑戰性,因為這需要超越事實檢索的綜合、判斷和探索,而用戶通常會通過多次迭代來完善答案,而不是接受單一的回應。現有的QA基準並未明確支持這一完善過程。為了解決這一空白,我們引入了一個新任務,即基於文檔的相關洞察生成,其目標是從文檔集合中生成額外的洞察,以幫助改善、擴展或重新思考對開放式問題的初始回答,最終支持更豐富的用戶互動和更好的整體問答體驗。我們策劃並發布了SCOpE-QA(開放式QA的科學集合),這是一個包含20個研究集合的3,000個開放式問題的數據集。我們提出了InsightGen,一種兩階段的方法,首先使用聚類構建文檔集合的主題表示,然後基於主題圖的鄰域選擇來選擇相關上下文,以使用LLMs生成多樣且相關的洞察。對3,000個問題進行的廣泛評估,使用了兩種生成模型和兩種評估設置,顯示InsightGen始終能產生有用、相關且可行的洞察,為這一新任務建立了強有力的基準。