LLM

Publish Date	Title	Authors	Homepage	Code
2026-06-17	Native Active Perception as Reasoning for Omni-Modal Understanding	Zhenghao Xing et.al.	2606.19341v1	null
2026-06-17	Learning User Simulators with Turing Rewards	Yingshan Susan Wang et.al.	2606.19336v1	null
2026-06-17	Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States	Denis Peskoff et.al.	2606.19334v1	null
2026-06-17	UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning	Mohamed Nabail et.al.	2606.19328v1	null
2026-06-17	Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation	Siyi Gu et.al.	2606.19327v1	null
2026-06-17	Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors	Michael Finkelson et.al.	2606.19325v1	null
2026-06-17	Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents	Anoushka Vyas et.al.	2606.19319v1	null
2026-06-17	Explaining Attention with Program Synthesis	Amiri Hayes et.al.	2606.19317v1	null
2026-06-17	Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play	Leyang Shen et.al.	2606.19308v1	null
2026-06-17	Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA	Ikram Belmadani et.al.	2606.19266v1	null
2026-06-17	Structured Inference with Large Language Gibbs	Sanghyeok Choi et.al.	2606.19264v1	null
2026-06-17	A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2	Yijin Wang et.al.	2606.19259v1	null
2026-06-17	DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models	Zirui Wu et.al.	2606.19257v1	null
2026-06-17	X+Slides: Benchmarking Audience-Conditioned Slide Generation	Haodong Chen et.al.	2606.19256v1	null
2026-06-17	OneCanvas: 3D Scene Understanding via Panoramic Reprojection	Bartłomiej Baranowski et.al.	2606.19253v1	null
2026-06-17	TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology	Hannah Le et.al.	2606.19245v1	null
2026-06-17	STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability	Haipeng Luo et.al.	2606.19236v1	null
2026-06-17	Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning	Chenyu Zhou et.al.	2606.19222v1	null
2026-06-17	Machine Unlearning for the XGBoost Model with Network Intrusion Datasets	Diana Magalhães et.al.	2606.19220v1	null
2026-06-17	RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering	Pushwitha Krishnappa et.al.	2606.19218v1	null
2026-06-17	Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times	Giuseppe Gabriele et.al.	2606.19199v1	null
2026-06-17	Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis	Soheyl Bateni et.al.	2606.19183v1	null
2026-06-17	A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies	Fangyijie Wang et.al.	2606.19174v1	null
2026-06-17	User as Engram: Internalizing Per-User Memory as Local Parametric Edits	Bojie Li et.al.	2606.19172v1	null
2026-06-17	Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition	Shiho Matta et.al.	2606.19170v1	null
2026-06-17	Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection	Jinhan Li et.al.	2606.19168v1	null
2026-06-17	Essential Subspace Merging for Multi-Task Learning	Longhua Li et.al.	2606.19164v1	null
2026-06-17	IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages	Sakshi Joshi et.al.	2606.19157v1	null
2026-06-17	AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces	Zongmin Zhang et.al.	2606.19152v1	null
2026-06-17	OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems	Till Richter et.al.	2606.19145v1	null
2026-06-17	Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction	Jingyi Zhou et.al.	2606.19144v1	null
2026-06-17	Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation	Ramza Basharat et.al.	2606.19139v1	null
2026-06-17	A Technical Taxonomy of LLM Agent Communication Protocols	Linus Sander et.al.	2606.19135v1	null
2026-06-17	Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening	Kasper Helverskov Petersen et.al.	2606.19133v1	null
2026-06-17	Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions	Hui Zhang et.al.	2606.19121v1	null
2026-06-17	Analysing drivers and interdependencies in European electricity markets using XAI	Antoine Pesenti et.al.	2606.19118v1	null
2026-06-17	Towards an Agent-First Web: Redesigning the Web for AI Agents	Eranga Bandara et.al.	2606.19116v1	null
2026-06-17	Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams	Haewoon Kwak et.al.	2606.19111v1	null
2026-06-17	ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL	Mukund Khanna et.al.	2606.19103v1	null
2026-06-17	ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection	Enrico Cassano et.al.	2606.19079v1	null
2026-06-17	Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science	Qiuyu Fang et.al.	2606.19051v1	null
2026-06-17	Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration	Xhevahire Tërnava et.al.	2606.19042v1	null
2026-06-17	A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors	David Aaron Evans et.al.	2606.19026v1	null
2026-06-17	FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs	Lorenzo Sani et.al.	2606.19025v1	null
2026-06-17	Sumi: Open Uniform Diffusion Language Model from Scratch	Mengyu Ye et.al.	2606.19005v1	null
2026-06-17	Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training	Ruiqi Lai et.al.	2606.19004v1	null
2026-06-17	Enhancing Multilingual Reasoning via Steerable Model Merging	Zhuoran Li et.al.	2606.19002v1	null
2026-06-17	TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction	Moon Ye-Bin et.al.	2606.18996v1	null
2026-06-17	G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment	Fengying Ye et.al.	2606.18989v1	null
2026-06-17	ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection	Jinhao Song et.al.	2606.18988v1	null
2026-06-17	Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering	Yafeng Wu et.al.	2606.18986v1	null
2026-06-17	Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment	Franziska Braun et.al.	2606.18979v1	null
2026-06-17	CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System	Marco Becattini et.al.	2606.18976v1	null
2026-06-17	A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI	Syed Mujtaba Haider et.al.	2606.18970v1	null
2026-06-17	GraphPO: Graph-based Policy Optimization for Reasoning Models	Yuliang Zhan et.al.	2606.18954v1	null
2026-06-17	RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models	San Kim et.al.	2606.18950v1	null
2026-06-17	Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents	Emmanuel Aboah Boateng et.al.	2606.18947v1	null
2026-06-17	SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents	Jingkun Luo et.al.	2606.18946v1	null
2026-06-17	Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking	Pierre Dantas et.al.	2606.18941v1	null
2026-06-17	SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety	Linghao Feng et.al.	2606.18936v1	null
2026-06-17	TransitNet: A Compact Attention-Augmented Deep Learning Framework for Low-SNR Transit Blind Searches	Xingchen Yan et.al.	2606.18932v1	null
2026-06-17	As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language	Jasmine Owers et.al.	2606.18922v1	null
2026-06-17	REVES: REvision and VErification--Augmented Training for Test-Time Scaling	Yuanxin Liu et.al.	2606.18910v1	null
2026-06-17	SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation	Jiangnan Xia et.al.	2606.18897v1	null
2026-06-17	Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction	Zhuangzhuang Pan et.al.	2606.18893v1	null
2026-06-17	Skill-Guided Continuation Distillation for GUI Agents	Zhimin Fan et.al.	2606.18890v1	null
2026-06-17	Improving Medical Communication using Rubric-Guided Counterfactual Recommendations	Adrian Cosma et.al.	2606.18889v1	null
2026-06-17	Generative-Model Predictive Planning for Navigation in Partially Observable Environments	Thomas Quilter et.al.	2606.18888v1	null
2026-06-17	Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems	Bernardo Feijó Junqueira et.al.	2606.18882v1	null
2026-06-17	Efficient Financial Language Understanding via Distillation with Synthetic Data	Wen-Fong et.al.	2606.18875v1	null
2026-06-17	Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness	Zijian Wang et.al.	2606.18874v1	null
2026-06-17	Scaling Learning-based AEB with Massive Unlabeled Data	Xiangyu Wang et.al.	2606.18864v1	null
2026-06-17	URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification	Xinze Zhang et.al.	2606.18861v1	null
2026-06-17	Approximate Structured Diffusion for Sequence Labelling	Nicolas Floquet et.al.	2606.18856v1	null
2026-06-17	ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement	Bohou Zhang et.al.	2606.18850v1	null
2026-06-17	WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents	Yehang Zhang et.al.	2606.18847v1	null
2026-06-17	Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems	Hehai Lin et.al.	2606.18837v1	null
2026-06-17	Target-confidence Recourse Using tSeTlin machines: TRUST	K. Darshana Abeyrathna et.al.	2606.18832v1	null
2026-06-17	Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning	Xiaoyue Xu et.al.	2606.18831v1	null
2026-06-17	GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents	Zhe Ren et.al.	2606.18829v1	null
2026-06-17	Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation	Chenghao Xu et.al.	2606.18828v1	null
2026-06-17	Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets	Jiaxi Liu et.al.	2606.18820v1	null
2026-06-17	SwitchBraidNet: Quantisation-Aware Lightweight Architecture for Hybrid Brain-Computer Interface	Gourav Siddhad et.al.	2606.18816v1	null
2026-06-17	Reinforcement Learning Foundation Models Should Already Be A Thing	Abdelrahman Zighem et.al.	2606.18812v1	null
2026-06-17	Rescaling MLM-Head for Neural Sparse Retrieval	Youngjoon Jang et.al.	2606.18811v1	null
2026-06-17	Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards	Yingyu Shan et.al.	2606.18810v1	null
2026-06-17	ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch	Tengfei Lyu et.al.	2606.18803v1	null
2026-06-17	SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval	Youngjoon Jang et.al.	2606.18801v1	null
2026-06-17	Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports	Qingyu Lu et.al.	2606.18797v1	null
2026-06-17	HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space	Jaward Sesay et.al.	2606.18788v1	null
2026-06-17	RedactionBench	Sean Brynjólfsson et.al.	2606.18782v1	null
2026-06-17	Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation	Shanshan Lyu et.al.	2606.18781v1	null
2026-06-17	SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction	Quanjiang Guo et.al.	2606.18780v1	null
2026-06-17	Private Learning with Public Feature Conditioning	Shuli Jiang et.al.	2606.18773v1	null
2026-06-17	Output Vector Editing for Memorization Mitigation in Large Language Models	Ahmad Dawar Hakimi et.al.	2606.18767v1	null
2026-06-17	Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs	Chris Lee et.al.	2606.18747v1	null
2026-06-17	What Must Generalist Agents Remember?	Khurram Yamin et.al.	2606.18746v1	null
2026-06-17	SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents	Qiao Zhao et.al.	2606.18733v1	null
2026-06-17	LegalWorld: A Life-Cycle Interactive Environment for Legal Agents	Songhan Zuo et.al.	2606.18728v1	null
2026-06-17	Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring	Fang Wang et.al.	2606.18726v1	null

Abstracts

2606.19341v1 by Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

摘要：被動模型在長視頻理解中通常依賴於“全看”範式，均勻處理幀而不考慮查詢難度，導致計算成本隨著視頻時長增長。雖然互動框架已經出現，但它們通常依賴於全局預掃描，其上下文成本仍隨著視頻長度增長。我們提出了OmniAgent，第一個原生的全模態代理，將視頻理解公式化為基於POMDP的迭代觀察-思考-行動循環。OmniAgent執行按需行動，選擇性地將視聽線索提煉成持久的文本記憶，有效地將推理複雜性與原始視頻時長解耦。為了使這一點具體化，我們引入了(1) 代理監督微調，以通過最佳N軌跡合成和雙階段質量控制啟動原生主動感知，和(2) 代理強化學習與TAURA（轉向感知自適應不確定性重標定優勢），它利用轉向級熵來引導信用分配朝向關鍵發現轉。至關重要的是，OmniAgent顯示出正向測試時間擴展，隨著推理轉數的增加，性能提高，驗證了主動感知的有效性。在十個基準（例如，VideoMME，LVBench）上的實證結果表明，OmniAgent在開源模型中達到了最先進的性能。值得注意的是，在LVBench上，我們的7B代理超越了10$\times$更大的Qwen2.5-VL-72B（50.5%對47.3%）。

Learning User Simulators with Turing Rewards

2606.19336v1 by Yingshan Susan Wang, Cedegao E. Zhang, Linlu Qiu, Zexue He, Pengyuan Li, Alex Pentland, Roger P. Levy, Yoon Kim

Learning to simulate human users in interactive settings could advance the training of agent assistants, evaluation of personalization systems, research in the social sciences, and more. Existing approaches generally do so by training a large language model (LLM) to match a single ground truth response, either by maximizing the log probability or by using a similarity reward. We instead propose {Turing-RL}: a Turing-Test-based reinforcement learning approach for training user simulator models. {Turing-RL} uses a discriminative Turing reward with an LLM judge to score how indistinguishable a generated response is from the real user's given the user's history, and the user simulator LLM learns to produce responses indistinguishable from what the user could have said with such rewards. Across two different domains--conversational chat and Reddit forum discussion--we find that {Turing-RL} consistently outperforms baseline methods on both LLM and human evaluation metrics. Our study suggests that optimizing for indistinguishability, rather than response matching, is effective for learning user simulators.

摘要：學習在互動環境中模擬人類使用者可能會推進代理助手的訓練、個性化系統的評估、社會科學的研究等。現有的方法通常通過訓練大型語言模型（LLM）來匹配單一的真實回應，無論是通過最大化對數概率還是使用相似性獎勵。我們則提出 {Turing-RL}：一種基於圖靈測試的強化學習方法，用於訓練使用者模擬器模型。{Turing-RL} 使用帶有 LLM 評審的判別性圖靈獎勵來評分生成的回應在考慮使用者歷史的情況下與真實使用者的回應有多麼難以區分，而使用者模擬器 LLM 學習生成與使用者可能所說的回應無法區分的回應，並獲得這樣的獎勵。在兩個不同的領域——對話聊天和 Reddit 論壇討論中——我們發現 {Turing-RL} 在 LLM 和人類評估指標上始終優於基線方法。我們的研究表明，優化不可區分性而非回應匹配對於學習使用者模擬器是有效的。

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

2606.19334v1 by Denis Peskoff, Joe Barrow, Christopher Vu, Diag Davenport

Progress in legal AI increasingly depends on access to authoritative legal text at scale. Yet one of the most consequential layers of American law remains largely absent from existing machine-readable corpora: local ordinances. Local codes govern zoning, housing, business licensing, public health, noise, animal control, and many other domains of everyday regulation, but they are fragmented across vendor platforms designed for human browsing rather than bulk research access. We introduce LOCUS - the Local Ordinance Corpus for the United States - a comprehensive corpus and county-harmonized access layer for U.S. municipal and county ordinance codes. The raw corpus, available for release to researchers, represents nearly all publicly available municipal and county ordinance codes. The resulting raw corpus contains codes from 9,239 cities and counties. A smaller county-harmonized LOCUS access layer provides coverage for the largest 2,309 of 3,144 U.S. counties, accounting for a majority of the population. We use OCR to handle the myriad of document formats that have kept the law from being a public resource. We release the corpus with coverage metadata to support reproducibility, downstream legal AI research, and the incremental expansion of machine-readable access to local law. We train a collection of ModernBERT-based classifiers and scorers to facilitate analyzing U.S. local law among several dimensions, such as opacity and paternalism, that have not previously been studied at this scale. LOCUS-v1 and its derivative models are available at: https://huggingface.co/datasets/LocalLaws/LOCUS-v1

摘要：進展中的法律人工智慧越來越依賴於大規模訪問權威法律文本。然而，美國法律中最具影響力的一層仍然在現有的機器可讀語料庫中大多數缺失：地方條例。地方法規管理區劃、住房、商業許可、公眾健康、噪音、動物控制以及許多其他日常監管領域，但它們在設計上是為了人類瀏覽而非批量研究訪問的供應商平台上分散。我們介紹了LOCUS - 美國地方條例語料庫 - 一個全面的語料庫和縣級協調訪問層，用於美國市政和縣的條例法規。這個原始語料庫可供研究人員使用，幾乎代表了所有公開可用的市政和縣條例法規。最終的原始語料庫包含來自9,239個城市和縣的法規。一個較小的縣級協調LOCUS訪問層涵蓋了3,144個美國縣中最大的2,309個，佔據了大多數人口。我們使用光學字符識別（OCR）來處理各種文檔格式，這些格式使法律無法成為公共資源。我們發布了帶有覆蓋元數據的語料庫，以支持可重複性、下游法律人工智慧研究以及對地方法律的機器可讀訪問的逐步擴展。我們訓練了一組基於ModernBERT的分類器和評分器，以便在幾個維度（如不透明性和父權主義）中分析美國地方法律，這些維度之前並未在這個規模上進行研究。LOCUS-v1及其衍生模型可在以下網址獲得：https://huggingface.co/datasets/LocalLaws/LOCUS-v1

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

2606.19328v1 by Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

摘要：基於偏好的強化學習提供了一種從行為的成對比較中學習獎勵模型的方法，繞過了明確獎勵設計的需求。然而，現有的方法通常依賴於被動數據收集，並且在學習的早期階段面臨樣本效率低下的問題。我們引入了一種基於模型的方法，通過共同推理獎勵、動力學和價值函數中的不確定性來主動指導探索。我們的方法，不確定性平衡偏好規劃（UBP2），使用獎勵、動力學和價值函數模型的集成來根據一個統一的分數評估候選軌跡，該分數結合了預期獎勵、終端價值和認知不確定性。在這一目標下進行規劃產生了明確的利用與信息獲取之間的權衡，而不需要臨時的探索啟發式方法。在標準的正則性假設下，我們為有限視野和無限視野的設置建立了次線性後悔保證。實證上，對Meta-World基準的實驗顯示，UBP2的樣本效率顯著高於無模型的基於偏好的方法和非樂觀的基於模型的基準。

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

2606.19327v1 by Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

摘要：後訓練推理語言模型通常由監督式蒸餾和具有可驗證獎勵的強化學習驅動。蒸餾通常依賴於思考鏈註釋，這些註釋獲取成本高昂，且可能本身是嘈雜的、不完整的或部分不正確的；即使最終解決方案是正確的，不完美的推理也可能干擾學習。另一方面，帶有驗證獎勵的強化學習通常將評估反饋壓縮為一個標量信號，模糊了應該改進的回應方面。我們提出了\textbf{基於評分標準的自我蒸餾}，這是一個將評分標準作為結構化、細緻反饋的框架，用於在政策上進行自我蒸餾。我們的方法將教師模型條件化於標準級別的評分標準，並利用它提供對學生自己採樣軌跡的標記級別指導。這一設計避免將單一參考推理視為唯一的監督目標。相反，評分標準指定強回應應滿足的條件，從而在推理過程中實現比標量獎勵優化更細緻的信用分配。我們用一個兩階段的管道實現這一框架，首先學習生成任務特定的評分標準，然後訓練一個基於評分標準的推理器。我們在多樣的科學推理基準上進行評估，結果顯示基於評分標準的自我蒸餾有效地將評分標準級別的標準轉換為推理過程中的標記級別指導，平均超過GRPO 1.0分和OPSD 0.9分。

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

2606.19325v1 by Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

摘要：現有的多說話者對話系統通過結構化的監督將說話者與發言綁定：每輪標籤、多流轉錄或可學習的說話者嵌入。這些系統在僅語音的流程中運作，產生乾淨的聲音序列，而沒有真實對話的環境質感。我們採取了不同的方法。我們的方法ScenA，將一個在大規模野外數據上預訓練的文本到音頻流匹配基礎模型，直接以多個參考聲音和描述整個多說話者音頻場景的自由形式自然語言提示進行條件化。利用這樣的基礎模型使我們能夠繼承其對自然非錄音室音頻的能力：背景噪音、房間音響、重疊對話和自發的副語言事件，同時在沒有任何每輪結構的情況下添加多說話者控制。具體而言，參考潛變量被串聯到模型的標記序列中，並通過輕量級的身份感知位置編碼進行區分。然而，我們識別出這種方法的一個關鍵障礙：\textit{參考捷徑}。在標準噪音調度下的訓練過程中，模型可以通過聲學相似性識別匹配的參考，完全繞過文本提示。我們通過一個高噪音偏向的時間步分佈來解決這個問題，迫使模型依賴文本提示進行說話者分配。我們在CoVoMix2-Dialogue基準上評估ScenA，顯示它在說話者綁定指標上超越現有的多說話者系統，同時生成豐富的對話音頻，包含重疊的語音、情感的聲音表達和環境聲音。我們的結果展示了使用以自由形式場景描述為條件的通用音頻模型的優勢，而不是將結構化對話腳本通過僅語音的流程傳遞。

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

2606.19319v1 by Anoushka Vyas, Aarushi Dhanuka, Sina Khoshfetrat Pakazad, Henrik Ohlsson

Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data. We present Data Intelligence Agents (DIA), a system of three agents (Data Interpreter, Schema Creator, and Query Generator) that compresses this workflow by treating autonomous coding agents (ACAs) as a first-class abstraction: rather than emitting text, the agents generate, execute, validate, and repair concrete artifacts, draw on a shared memory for experience reuse, and surface each for review by domain experts. DIA is deployed in production for enterprise customers. We study the Query Generator in depth and evaluate it in fully autonomous mode across seven SQL benchmarks spanning four task categories and four dialects. It matches or surpasses the best published results on all seven, demonstrating that an architecture grounded in execution, built on ACAs and a shared memory, generalizes across the data intelligence workload with adaptation confined to natural-language instructions.

摘要：生產數據整合受到數據擁有者、工程師和分析師之間重複且有損的交接所阻礙，他們必須協作發現、結構化和查詢企業數據。我們提出了數據智能代理（DIA），這是一個由三個代理（數據解釋器、架構創建者和查詢生成器）組成的系統，通過將自主編碼代理（ACA）視為一種一級抽象來壓縮這一工作流程：這些代理不是生成文本，而是生成、執行、驗證和修復具體的工件，利用共享記憶進行經驗重用，並將每個工件呈現給領域專家進行審查。DIA 已在企業客戶的生產環境中部署。我們深入研究了查詢生成器，並在完全自主模式下對其進行評估，涵蓋了四個任務類別和四種方言的七個 SQL 基準。它在所有七個基準上匹配或超越了最佳發表結果，證明了一種以執行為基礎、建立在 ACA 和共享記憶上的架構能夠在數據智能工作負載中進行泛化，而適應僅限於自然語言指令。

Explaining Attention with Program Synthesis

2606.19317v1 by Amiri Hayes, Belinda Li, Jacob Andreas

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

摘要：一項長期以來的研究目標是用人類可理解的符號描述來取代不透明的神經計算。在本文中，我們提出了一種方法，用可執行程序來近似深度網絡組件的行為。我們專注於Transformer語言模型中的注意力頭。對於給定的頭，我們首先在一組隨機選擇的訓練示例上計算其相關的注意力矩陣。接下來，我們用這些矩陣的摘要來提示一個預訓練的語言模型，並指示它生成一組 Python 程序，這些程序可以僅根據輸入句子的文本重現相關的注意力模式。最後，我們根據我們的最終程序集在保留輸入上的行為預測的準確性來重新排序這些程序。我們證明，少於 1,000 個這樣生成的程序可以重現 GPT-2、TinyLlama-1.1B 和 Llama-3B 中頭部的注意力模式，在 TinyStories 上達到超過 75% 的平均交集-聯合相似度。此外，最佳擬合的程序可以在不顯著影響模型行為的情況下替代神經注意力頭：在三個模型中用程序替代品替換 25% 的注意力頭僅會產生 16% 的平均困惑度增加，同時在各種下游問題回答基準上保持性能。這項工作貢獻了一個可擴展的管道，用於使用人類可讀的可執行代碼逆向工程Transformer模型中的注意力頭，推進了神經模型中符號透明度的道路。

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play

2606.19308v1 by Leyang Shen, Yang Zhang, Xiaoyan Zhao, Chun Kai Ling, Tat-Seng Chua

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls short on decision-making tasks that are also prevalent in the real world. These tasks require simultaneous reasoning from the stances of all involved stakeholders whose decisions are mutually dependent and thus cannot be solved in isolation. We characterize this challenge as stance entanglement, a form of decision complexity distinct from execution complexity. To address it, we propose Multi-Agent Fictitious Play (MAFP), a novel MAS paradigm that represents stakeholder stances as agents and formulates decision-making as an equilibrium-seeking process. Built on the game-theoretic principle of fictitious play, MAFP iteratively updates each agent's decision by best responding to the empirical mixture of other agents' past decisions. This enables agents to expose and address one another's weaknesses, progressively improving decision quality and robustness. We evaluate MAFP on challenging decision-making tasks that test the capability of deciding strategies for competitive scenarios prior to acting. MAFP outperforms both single-round and multi-round baselines on two complementary metrics, tournament strength and robustness, demonstrating its effectiveness in addressing stance entanglement.

摘要：大型語言模型（LLM）基礎的多代理系統（MAS）在解決執行複雜度的任務中展現了巨大的潛力，通過將子任務分配給合作的代理。然而，這種分而治之的範式在決策任務上表現不佳，而這些任務在現實世界中也很普遍。這些任務需要所有相關利益相關者的立場進行同時推理，這些決策是相互依賴的，因此無法孤立地解決。我們將這一挑戰稱為立場交纏，這是一種不同於執行複雜度的決策複雜度形式。為了解決這個問題，我們提出了多代理虛擬遊戲（MAFP），這是一種新穎的MAS範式，將利益相關者的立場表示為代理，並將決策制定形式化為尋求均衡的過程。基於虛擬遊戲的博弈論原則，MAFP通過最佳響應其他代理過去決策的經驗混合，迭代更新每個代理的決策。這使得代理能夠暴露並解決彼此的弱點，逐步提高決策質量和穩健性。我們在挑戰性的決策任務上評估MAFP，這些任務測試在行動之前決定競爭場景策略的能力。MAFP在兩個互補指標——錦標賽強度和穩健性上，超越了單輪和多輪基準，展示了其在解決立場交纏方面的有效性。

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

2606.19266v1 by Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

摘要：大型語言模型（LLMs）的發展使得對其在專業領域和語言中的適應性更加關注，但領域適應策略的有效性仍然不明確。我們以法語醫療問答（QA）為案例，展示了一項醫療領域適應的研究。我們比較了持續預訓練（CPT）、監督微調（SFT）及其組合，在三個模型家族、多個尺寸和三種初始化類型之間，明確區分適應效果與基礎模型選擇的影響。我們在貪婪和受限解碼下，使用自動指標和LLM作為評估者的評價，評估了多選擇題（MCQA）和開放式問答（OEQA）。對於MCQA，CPT+SFT最常達到最佳分數，但相較於SFT的增益較小且經常不具統計顯著性，使得SFT成為一個強大且具成本效益的預設選擇。對於OEQA，CPT始終改善基於重疊的指標，而SFT則常常降低生成質量；指令調整和CPT+SFT在LLM基礎的評估中更受青睞。跨語言實驗進一步顯示了從法語適應到英語基準的有效轉移。總體而言，我們提供了在計算限制下選擇適應策略的實用指導方針。

Structured Inference with Large Language Gibbs

2606.19264v1 by Sanghyeok Choi, Henry Gouk, Esmeralda S. Whitammer

The knowledge encoded in large language models (LLMs) can serve as a substrate for structured reasoning over variables describing a complex world, but accessing this knowledge in a probabilistically coherent manner poses a difficult inference problem. We propose Large Language Gibbs, a scheme for structured probabilistic inference that uses conditional distributions of an LLM as transition operators. Rather than sampling structured objects through single-pass autoregressive generation, we iteratively resample individual variables conditioned on others using an LLM's next-token conditionals. This approach avoids order-dependent biases and produces a stationary distribution that reflects a compromise between all local conditionals. We apply this approach to sampling from synthetic distributions, consistent reasoning tasks, and Bayesian structure learning. The results suggest that the use of LLM conditionals in MCMC is a practical alternative to one-pass generation for structured probabilistic inference under a world prior accessible through noisy LLM conditionals.

摘要：大型語言模型（LLMs）中編碼的知識可以作為對描述複雜世界的變量進行結構化推理的基礎，但以概率一致的方式訪問這些知識則構成了一個困難的推理問題。我們提出了大型語言吉布斯（Large Language Gibbs），這是一種結構化概率推理的方案，利用LLM的條件分佈作為轉移運算符。我們不是通過單次自回歸生成來抽樣結構化對象，而是使用LLM的下一個標記條件，迭代地重新抽樣基於其他變量的單個變量。這種方法避免了依賴順序的偏見，並產生了一個穩定的分佈，反映了所有局部條件的妥協。我們將這種方法應用於從合成分佈中抽樣、一致性推理任務和貝葉斯結構學習。結果表明，在通過噪聲LLM條件可訪問的世界先驗下，使用LLM條件在MCMC中是一種結構化概率推理的實用替代方案，而非單次生成。

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

2606.19259v1 by Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

摘要：文本豐富的圖像通常包含與隱私相關、交易性或決策相關的信息。隨著最近的多模態圖像生成模型越來越能合成現實的文本內容和結構化的視覺設計，檢測AI生成的文本豐富圖像已成為數位信任和內容真實性的重要挑戰。然而，現有的基準主要集中在物體中心的圖像上，對於文本語義和佈局組織為中心的場景覆蓋有限。在本文中，我們介紹了一個多領域基準，用於檢測由OpenAI的GPT Image 2生成的文本豐富圖像。該基準包含8,602幅圖像，涵蓋六個代表性類別：商業海報、信息圖表、學術海報、收據、表格和用戶界面截圖。利用這個基準，我們在零樣本設置中評估了五個代表性的AI生成圖像檢測器，並分析了它們的整體、類別和後處理穩健性。我們的結果顯示檢測器性能高度依賴於領域：在某些類別中表現良好的方法往往在其他類別中失敗，即使是最強的傳統檢測器對JPEG壓縮也表現出嚴重的敏感性。我們進一步進行了一項探索性評估，使用多模態視覺-語言模型，揭示了其在結構化格式上的潛力和局限性。這些發現突顯了現代AI生成圖像需要文本和佈局感知檢測方法。我們的數據集已在XXX發布。

DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models

2606.19257v1 by Zirui Wu, Lin Zheng, Jiacheng Ye, Shansan Gong, Xueliang Zhao, Yansong Feng, Wei Bi, Lingpeng Kong

Block diffusion language models accelerate decoding through parallel block-wise denoising, yet whether they can be reliably scaled for long chain-of-thought (CoT) reasoning remains unresolved. To this end, we develop DreamReasoner-8B, an open-source block diffusion reasoning model, and conduct a systematic study of how training and inference block sizes affect long-CoT reasoning. Our analysis reveals a stark performance disparity: training with large block sizes yields remarkably poor reasoning, whereas small block sizes preserve effective reasoning. To bridge this granularity gap, we propose block-size curriculum learning, which gradually transitions training from fine-grained to coarse-grained block sizes, thereby overcoming this limitation and enabling strong reasoning performance that generalizes across diverse inference block sizes. On mathematical and code reasoning benchmarks, DreamReasoner-8B achieves results competitive with leading open autoregressive models such as Qwen3-8B. This work establishes a practical foundation for efficient, reasoning-capable diffusion language models. We release our model at https://github.com/DreamLM/DreamReasoner.

摘要：區塊擴散語言模型通過平行區塊去噪加速解碼，但它們是否能可靠地擴展以進行長鏈思考（CoT）推理仍未解決。為此，我們開發了DreamReasoner-8B，一個開源的區塊擴散推理模型，並系統性地研究訓練和推理區塊大小如何影響長CoT推理。我們的分析揭示了顯著的性能差異：使用大區塊大小進行訓練會產生極差的推理，而小區塊大小則能保持有效的推理。為了彌補這一粒度差距，我們提出了區塊大小課程學習，該方法逐步將訓練從細粒度轉變為粗粒度的區塊大小，從而克服這一限制並實現強大的推理性能，能夠在不同的推理區塊大小中進行泛化。在數學和代碼推理基準上，DreamReasoner-8B的結果與領先的開放自回歸模型如Qwen3-8B相媲美。這項工作為高效、具推理能力的擴散語言模型建立了實用的基礎。我們在https://github.com/DreamLM/DreamReasoner上發布了我們的模型。

X+Slides: Benchmarking Audience-Conditioned Slide Generation

2606.19256v1 by Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, Yanbing Zhu, Bo Wang, Jiawei Hong, Anya Jia, Fan Wu

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $τ_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

摘要：自動從來源文件生成簡報是大型語言模型（LLMs）的一個重要應用。現有的基準主要評估簡報的完整性和技術深度，卻忽略了目標受眾作為一個關鍵的現實因素。例如，專家要求嚴謹的證明，而決策者則優先考慮可行的結論。為了填補這一空白，我們推出了 X+Slides，一個專門為受眾條件化簡報生成而設計的基準。X+Slides 建立在涵蓋 113 個主題和七個演示場景的多樣化語料庫上，採用由 8,133 個去重的、基於來源的探測器構建的動態評估框架。通過為相同的基於來源的探測器分配特定於受眾的效用權重，X+Slides 報告了四個互補指標：受眾覆蓋率衡量傳達了多少受眾必需的信息，領域覆蓋率顯示了哪些信息類型被涵蓋，效率衡量每單位注意成本所提供的效用，而正確性驗證簡報中的主張是否得到來源的支持。在 DeepPresenter、SlideTailor 和 NotebookLM 上的實驗顯示，當前系統可以恢復大量但仍不完整的受眾必需信息：在 $τ_A=0.7$ 時，DeepPresenter 的最佳受眾覆蓋率達到 0.714，SlideTailor 達到 0.594，而 NotebookLM 的消融實驗達到 0.853，同時顯示出明顯的基礎差異。這些結果表明，視覺質量和廣泛的主題覆蓋不應被視為沒有基於來源的評估支持的證據。

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

2606.19253v1 by Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

摘要：現有的視覺-語言模型 (VLMs) 在 3D 場景理解方面要麼依賴於複雜的模型特定幾何編碼器，要麼需要大量的訓練預算以追求空間推理。相反，OneCanvas 將所有視圖的補丁特徵聚合到一個單一的等距全景畫布上。具體而言，每個補丁使用其深度和相機姿勢被反投影到 3D 世界坐標系中，然後根據從畫布原點看到的該點的連續經度和緯度放置在畫布上，並且不會在重疊視圖之間進行光柵化或聚合。因此，來自所有幀的補丁共享一個空間坐標系，沒有融合或對主幹的重大架構修改。預訓練的 VLM 像處理普通圖像一樣消耗這種表示。由於畫布可以以任何感興趣的姿勢為中心，這種表示直接支持從特定視點進行的情境推理，這在機器人技術和具身 AI 中是一個常見需求。得益於這種表示，我們還可以引入空間預訓練課程：通過程序性地將來自真實圖像的物體補丁特徵放置在選定的 3D 世界位置上，並在其他空白畫布上生成即時監督，涵蓋廣泛的空間推理任務，並控制答案分佈以減少空間推理捷徑。OneCanvas 在 SQA3D 和 VSI-Bench 上達到了最先進的準確性，並在 SPBench 上對分佈外數據進行了泛化，使用的訓練計算量比最強競爭方法少一個量級。

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

2606.19245v1 by Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

摘要：人工智慧 (AI) 代理承諾通過壓縮解釋和決策循環來加速藥物發現，但實際部署需要對現實程序決策進行可信評估。我們介紹了治療基準前臨床藥理學 (TxBench-PP)，這是一個可驗證的小分子前臨床藥理學基準，也是更廣泛的治療基準在藥物發現階段和治療模式中的首個專注切片。TxBench-PP 測試代理是否能從現實世界的測定數據中恢復準確的結論，而不是從文獻中記憶的事實。該基準包含 100 個評估，按程序階段、測定類型和任務結構進行索引，涵蓋作用機制 (MoA) 和藥效學 (PD) 推理、化合物-靶標相互作用、因果靶標驗證、可開發性和安全性以及轉化療效。代理接收現實工作流程快照，在編碼環境中檢查文件，並返回結構化的答案，這些答案是確定性評分的。在 16 種模型架構配置中，包括 11 種模型和 4,800 梯跡，沒有系統可靠地恢復前臨床藥理學決策。最強的配置，Claude Opus 4.8 / Pi，通過了 59.3\% 的終點嘗試 (178/300; 95\% CI, 51.1-67.6)，其後是 GPT-5.5 / Pi，通過率為 55.3\% (166/300; 47.0-63.6)。

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

2606.19236v1 by Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

摘要：強化學習與可驗證獎勵算法如 GRPO 已成為 LLMs 中複雜推理的主導後訓練範式，但在訓練過程中常常遭遇政策熵崩潰。我們對 GRPO 下的標記級熵動態進行了一階梯度分析，並識別出標記級信用分配不匹配：每個標記的熵變化分解為軌跡級優勢與下一標記分佈的熵敏感度函數的乘積，產生優勢-驚訝四象限結構和近臨界性特性。受到此啟發，我們提出了 STARE（驚訝引導的標記級優勢重加權以穩定政策熵），它通過批內驚訝分位數識別熵關鍵標記子集，選擇性地重新加權其有效優勢，並納入目標熵閉環門以穩定熵調節。在從 1.5B 到 32B 的模型規模及三個任務系列（短期 CoT、長期 CoT 和多輪工具使用）中，STARE 在數千步的穩定 RL 訓練中持續保持政策熵在目標範圍內。在 AIME24 和 AIME25 上，STARE 在平均準確率上超越 DAPO 和其他競爭基準 4%-8%，反映標記和回應長度同步增長，顯示出持續的探索-利用平衡，進一步釋放 RL 訓練潛力。代碼可在 https://github.com/hp-luo/STARE 獲得。

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

2606.19222v1 by Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability, and full-parameter gradient ascent forgets only by damaging retain MATH and GSM8K. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On the primary model, MAST induces statistically significant target forgetting (MATH forget 45/150 to 37/150; McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3, where MAST preserves GSM8K while full-parameter unlearning collapses it.

摘要：我們提出了MAST（機制對齊選擇性目標），這是一種機制引導的方法，用於消除RLVR引起的推理，其附帶損害遠低於標準的全參數更新。在Qwen2.5-Math-1.5B和Qwen3-1.7B-Base的匹配SFT/RLVR檢查點中，SFT到RLVR的增量在標記級別的delta-log-probability上與SFT更新有明顯差異，而全參數梯度上升僅通過損害保留MATH和GSM8K來遺忘。 MAST根據非主能量、更新幅度和遺忘梯度耦合幅度對注意力投影張量進行排序，然後僅更新排名最高的子集。在主要模型上，MAST引發了統計上顯著的目標遺忘（MATH從45/150遺忘到37/150；McNemar p=0.0078），同時保留GSM8K（+0.8 pp）和MATH（-0.5 pp）。這一優勢在不同的隨機種子、NPO/SimNPO目標和Qwen3中重現，其中MAST保留了GSM8K，而全參數的消除則使其崩潰。

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

2606.19220v1 by Diana Magalhães, Eva Maia, João Vitorino, Isabel Praça

Machine Unlearning (MU) has emerged as an important technique for removing specific data points from trained models without requiring full retraining. However, most existing MU research focuses on deep learning and image data, leaving a gap in the domain of network intrusion detection, which relies heavily on tabular data. This work introduces XGBoost-Forget, an unlearning approach for the XGBoost model, to address this gap. The approach is evaluated on two tabular Network Intrusion (NI) datasets, IoT-23 and GeNIS, using multiple metrics to assess model performance, unlearning efficiency, and forgetting quality. The results show that XGBoost-Forget maintains predictive performance close to the original model while providing significantly faster unlearning, demonstrating its potential for MU in tabular NI settings.

摘要：機器遺忘（MU）已成為一種重要技術，能在不需要全面重訓的情況下，從訓練模型中移除特定數據點。然而，大多數現有的MU研究集中在深度學習和圖像數據上，這在依賴表格數據的網絡入侵檢測領域留下了空白。本研究介紹了XGBoost-Forget，這是一種針對XGBoost模型的遺忘方法，以填補這一空白。該方法在兩個表格網絡入侵（NI）數據集IoT-23和GeNIS上進行評估，使用多種指標來評估模型性能、遺忘效率和遺忘質量。結果顯示，XGBoost-Forget在保持接近原始模型的預測性能的同時，提供了顯著更快的遺忘速度，展示了其在表格NI環境中進行MU的潛力。

RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering

2606.19218v1 by Pushwitha Krishnappa, Amit Das, Vinija Jain, Aman Chadha, Tathagata Mukherjee

Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity), and tell a better system from a worse one (discriminative power). On open-ended, opinion-driven question answering, the two are in tension. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a contamination-free evaluation dataset of 15,000 r/AskReddit questions (September 2025), each paired with its authentic community replies, which postdate every evaluated model's training cutoff. Scoring five open-source LLMs (7--10B) against every reply each metric paired with a random-derangement noise floor we find that no metric does both jobs well. Cosine similarity separates real from random answers (Cohen's $d \approx 2$) but cannot rank the five models ($|d| < 0.1$); BERTScore precision appears to rank the models (raw $|d|$ up to 0.63), but once response length is controlled this collapses to $|d| = 0.09$ and its validity is weak ($d \approx 0.8$, versus cosine's $\approx 2$). Because every metric scores the same outputs, this validity--discrimination tradeoff is a property of the metrics, not the models, and we argue it stems from representation design. Three independent LLM judges reproduce the validity gap and likewise separate the five models only weakly. We recommend reporting metrics on both axes, with an explicit random-baseline floor. RECOM is publicly available at https://anonymous.4open.science/r/recom-D4B0

摘要：自動化指標是評估LLM生成文本的默認選擇，然而這些指標被默默要求完成兩項任務：區分真實內容的一致性與表面巧合（有效性），以及區分一個更好的系統與一個較差的系統（區分能力）。在開放式、以意見為驅動的問題回答中，這兩者之間存在緊張關係。我們介紹RECOM（Reddit模型對應評估），這是一個無污染的評估數據集，包含15,000條r/AskReddit問題（2025年9月），每個問題都配有其真實的社區回覆，這些回覆的發布時間均在每個被評估模型的訓練截止日期之後。對五個開源LLM（7--10B）進行評分，針對每個回覆，每個指標都配有隨機擾動的噪聲基準，我們發現沒有任何指標能很好地完成這兩項任務。餘弦相似度能夠區分真實與隨機答案（Cohen's $d \approx 2$），但無法對五個模型進行排名（$|d| < 0.1$）；BERTScore的精確度似乎能對模型進行排名（原始$|d|$高達0.63），但一旦控制了回應長度，這一數據就崩潰至$|d| = 0.09$，其有效性也很弱（$d \approx 0.8$，相比之下，餘弦的$\approx 2$）。因為每個指標對相同的輸出進行評分，這種有效性與區分能力的權衡是指標的特性，而非模型的特性，我們認為這源於表示設計。三位獨立的LLM評審重現了有效性差距，同樣僅弱地區分這五個模型。我們建議在兩個軸上報告指標，並設置明確的隨機基準底線。RECOM可在https://anonymous.4open.science/r/recom-D4B0公開獲得。

Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

2606.19199v1 by Giuseppe Gabriele, Fabio Pavirani, Seyed Soroush Karimi Madahi, Chris Develder

The recent growth of EV adoption poses challenges for power systems, including increased peak demand and potential grid instability. Smart control of EV charging -- e.g., based on reinforcement learning (RL) -- can alleviate these issues by learning temporal and contextual patterns from historical data. Yet, in real-world scenarios, key features, such as departure time, often are unavailable. This, in turn, makes it harder for an RL agent to learn and execute an effective charging policy. To mitigate this uncertainty, a trained forecaster can approximate the unknown features from available data. However, since these forecasting models are typically trained for accuracy (rather than their impact on a downstream agent's decision quality), their errors may propagate and hinder the overall performance of a controller that is using the forecasts. To avoid this, we propose a decision-focused RL (DF-RL) framework in which the forecaster is trained end-to-end, i.e., with feedback from the charging policy actions taken by the RL agent. Such joint training of both the forecaster and controller ultimately results in higher-quality actions: our proposed DF-RL method yields superior charging decisions compared to other baselines, achieving up to a 14% improvement in total reward and a 55% reduction of unsupplied energy (i.e., charging that failed to happen because the EV already left), relative to the RL method without departure time forecasting.

摘要：最近電動車 (EV) 採用的增長對電力系統帶來了挑戰，包括峰值需求的增加和潛在的電網不穩定性。智能控制 EV 充電——例如，基於強化學習 (RL)——可以通過從歷史數據中學習時間和上下文模式來緩解這些問題。然而，在現實場景中，關鍵特徵，如出發時間，通常是不可用的。這反過來使得 RL 代理學習和執行有效的充電策略變得更加困難。為了減輕這種不確定性，經過訓練的預測器可以從可用數據中近似未知特徵。然而，由於這些預測模型通常是為了準確性而訓練的（而不是它們對下游代理決策質量的影響），因此它們的錯誤可能會傳播並阻礙使用這些預測的控制器的整體性能。為了避免這種情況，我們提出了一種以決策為重點的強化學習 (DF-RL) 框架，其中預測器進行端到端的訓練，即根據 RL 代理所採取的充電政策行動進行反饋。這種預測器和控制器的聯合訓練最終導致更高質量的行動：我們提出的 DF-RL 方法相比其他基準，產生了更優的充電決策，總獎勵提高了最多 14%，未供應能量（即因為 EV 已經離開而未能充電的情況）減少了 55%。

Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis

2606.19183v1 by Soheyl Bateni, Maryam Abdolali

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

摘要：大型語言模型（LLMs）可以通過解釋自由文本文檔，使臨床決策支持變得更加可及，但它們作為診斷引擎的直接使用受到提示、信息順序和合理但不正確的輸出敏感性的限制。結構化機器學習模型提供了更穩定的風險預測，但它們需要難以與敘述性臨床工作流程集成的表格輸入。我們提出了ClaMPAPP（臨床語言輔助機器學習管道，用於闌尾炎），這是一個混合系統，將LLM用作接口，而不是最終決策者。ClaMPAPP從類似筆記的敘述中提取受架構約束的臨床特徵，應用確定性的合理性檢查，並將經過驗證的特徵傳遞給基於臨床、實驗室和超聲變量訓練的XGBoost分類器。我們在來自德國醫院的兩個獨立兒科闌尾炎隊列上評估了ClaMPAPP，並將其與端到端LLM基準進行比較，包括開源和專有模型。為了在測試自由文本輸入時保留真實情況，敘述是通過模板渲染和受限的LLM重寫從結構化電子健康記錄生成的，並進行了額外的句子順序置換以評估位置穩健性。ClaMPAPP在內部和外部驗證中都實現了最強的整體診斷性能，同時最小化漏診的闌尾炎病例，這是急性分診中的關鍵安全問題。端到端LLM顯示出不穩定的敏感性-特異性權衡，並在敘述重新排序下出現更大的降級。這些結果支持LLM作為接口、機器學習作為預測者的設計，將自然語言的可用性與預測推斷分開，並提供了一個更可審計的臨床決策支持途徑。

A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies

2606.19174v1 by Fangyijie Wang, Jianjun Yu, Wentao Shi, Haixia Huang, Ran Shi, Guénolé Silvestre, Kathleen M. Curran

Clinician-centered evaluation is critical for validating medical AI systems, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. Existing medical image platforms primarily focus on dataset labeling. They lack integrated support for blinded model comparison and reproducible evaluation workflows. We present a clinician-centered pipeline for remote annotation and evaluation in ultrasound AI studies. The proposed pipeline uses a centralized server and lightweight browser interfaces to enable clinicians to perform annotation, blinded ranking, and review without local dataset downloads. The pipeline also supports multi-rater participation, centralized result aggregation, and automated statistical analysis. We validate the pipeline in a fetal ultrasound segmentation study with six raters spanning expert, generalist, and non-expert experience levels. The system automatically generated Spearman correlation, Kendall's $τ$, and top-1 selection statistics. Results indicated moderate to strong agreement across experts and other groups. The blinded evaluation results showed a tendency for later active learning models to be preferred. These outcomes suggest that the pipeline can support clinician-centered annotation and reproducible human-\ac{AI} evaluation studies in ultrasound imaging. The proposed pipeline is available on \href{https://github.com/13204942/SonoRate}{GitHub}.

摘要：臨床醫師中心的評估對於驗證醫療人工智慧系統至關重要，尤其是在超聲影像中，定量指標並不總是能夠捕捉臨床可用性。現有的醫療影像平台主要集中於數據集標註。它們缺乏對盲測模型比較和可重複評估工作流程的綜合支持。我們提出了一個臨床醫師中心的管道，用於遠程標註和超聲人工智慧研究中的評估。所提議的管道使用集中式伺服器和輕量級瀏覽器介面，使臨床醫師能夠在不下載本地數據集的情況下進行標註、盲測排名和審查。該管道還支持多評審者參與、集中結果聚合和自動統計分析。我們在一項涉及六位評審者的胎兒超聲分割研究中驗證了該管道，這些評審者的經驗水平涵蓋了專家、通才和非專家。系統自動生成了斯皮爾曼相關係數、肯德爾的 $τ$ 和前一選擇統計數據。結果顯示專家和其他組別之間的協議程度從中等到強。盲測評估結果顯示後期主動學習模型更受青睞。這些結果表明該管道可以支持臨床醫師中心的標註和可重複的人類-\ac{AI} 評估研究在超聲影像中。所提議的管道可在 \href{https://github.com/13204942/SonoRate}{GitHub} 上獲得。

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

2606.19172v1 by Bojie Li

Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.

摘要：個人記憶在語言模型中面臨兩個問題：內容和推理能力。大腦將這兩者分開（每個事件在海馬體中的稀疏、本地記憶痕跡，以及解釋這些事件的緩慢新皮層），因此一個新事實不需要覆蓋其他所有內容。當前大多數個性化方法將用戶的事實保留在權重之外，存儲在自然語言記憶檔案或檢索索引中。當事實被寫入模型時，標準做法是每用戶的LoRA適配器，這與大腦的運作相反，將內容和技能折疊成一個全局權重增量。將用戶的事實寫為LoRA會污染與其無關的文本；將相同的事實寫為本地記憶痕跡行則在數學上保持不變，從而導致大約33,000倍更小的記憶佔用。

因此，我們提出用戶作為記憶痕跡：將用戶的內容存儲為對記憶模型的哈希鍵記憶表的手術性編輯，並在一個共享適配器中攜帶推理技能。這種分層設計與每用戶的LoRA的直接回憶相匹配，同時在平均上提供5.6倍更高的間接推理準確性，並且從未使單一用戶的推理能力低於未觸及的基礎。編輯是一個玻璃盒：寫入一個事實會在恰好觸發時啟用其查找，添加答案所需的值，並保持其他每個位置不變到最後一位，若寫入錯誤的層則會失敗。由於不同用戶的事實落在不相交的哈希槽中，他們的編輯可以組合：許多用戶同時存在於一個共享表中，進行加法堆疊且無損，而每用戶的LoRA，單一全局權重增量，只允許一個。檢索時，每用戶的記憶表不會隨著檢索者必須搜索的人口增長，因此在超過約100個事實後，它在一個2.5倍更大的模型上超越了檢索管道。

Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition

2606.19170v1 by Shiho Matta, Yin Jou Huang, Fei Cheng, Takashi Kodama, Hirokazu Kiyomaru, Yugo Murawaki

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they have predominantly relied on smaller or non-decoder models, limiting their ability to generate open-ended text and reducing their suitability as practical L2 simulators. We identify a key challenge when scaling models to this size: L2 contamination within the "monolingual" pretraining corpus used for L1 acquisition. To address this, we propose a filtering method to reduce premature exposure to English while preserving realistic, minimal exposure. We then fine-tune the model on LLM-generated L2-learning lessons to simulate the L2 acquisition process. Our evaluations confirm that Dango develops human-like L2 production patterns, outperforming both unfiltered and standard multilingual baselines. We release the model, data, and code to facilitate reproducible computational SLA research and learner-facing applications.

摘要：我們介紹 Dango，一個擁有 1.8B 參數的大型語言模型，旨在控制 L1 到 L2（從日語到英語）轉移的第二語言習得（SLA）研究。雖然之前的研究已經探索了語言模型中的 SLA，但它們主要依賴於較小或非解碼器模型，這限制了它們生成開放式文本的能力，並降低了它們作為實用 L2 模擬器的適用性。我們確定了一個關鍵挑戰，即在將模型擴展到這個規模時，L2 污染存在於用於 L1 習得的「單語」預訓練語料庫中。為了解決這個問題，我們提出了一種過濾方法，以減少對英語的過早接觸，同時保持現實的、最小的接觸。然後，我們在 LLM 生成的 L2 學習課程上對模型進行微調，以模擬 L2 習得過程。我們的評估確認 Dango 發展出類似人類的 L2 產出模式，超越了未過濾和標準多語言基準。我們釋放模型、數據和代碼，以促進可重複的計算 SLA 研究和面向學習者的應用。

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

2606.19168v1 by Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

摘要：為了實現大型語言模型（LLMs）的更深層安全對齊，最近的研究努力探討如何將安全干預措施提前到預訓練階段，主要是通過過濾不安全數據或將其重寫為更安全的形式。我們認為，預訓練階段的對齊應該超越僅僅使數據安全：LLMs 可能會將看似無害的知識和能力組合成不安全的行為。為此，我們提出了安全反思預訓練，這是一種預訓練階段的對齊方法，定期將短暫的安全反思插入預訓練語料庫中，以將自我監控直接整合到語言建模中，建立一種基礎能力，隨後通過兼容的後訓練進行強化。我們對在 FineWeb-Edu 上預訓練的 1.7B 模型的實驗顯示，安全反思預訓練提高了安全分類準確性，並大幅降低了推理階段和微調攻擊的成功率。除了我們的實際世界實驗外，我們還介紹了一個完全受控的合成環境 MedSafetyWorld，該環境對安全有明確的定義，並具有一個推理結構，使模型能夠輕鬆地從安全數據中概括不安全的行為。在 MedSafetyWorld 中的消融實驗進一步證明了安全反思預訓練在防止模型基於安全數據概括不安全行為方面的明顯優勢，相較於數據過濾和重寫。綜合來看，我們的研究結果表明，預訓練對齊不僅應使訓練數據安全，還應塑造模型可能從安全數據中獲得的行為。

Essential Subspace Merging for Multi-Task Learning

2606.19164v1 by Longhua Li, Lei Qi, Xin Geng, Qi Tian

Model merging aims to enable multi-task learning by integrating the capabilities of multiple models fine-tuned from the same pre-trained checkpoint into a single model. Its core challenge is inter-task interference among task-specific parameter updates. In this paper, we analyze the output shifts induced by task updates and observe that their energy is concentrated in a small number of principal directions. We call the subspace spanned by these directions the essential subspace. In contrast, most remaining directions carry little task-relevant energy, but their accumulation across multiple task updates can cause severe interference during merging. Motivated by this observation, we propose Essential Subspace Decomposition (ESD), which decomposes each task update according to the principal components of its activation shift. Based on ESD, we introduce Essential Subspace Merging (ESM), a training-free static merging method that orthogonalizes and fuses essential components into one compact multi-task model. We further extend ESM to ESM++, a training-free dynamic merging method that decomposes task-specific residuals into low-rank experts and selects the most relevant expert through prototype-based routing during forward inference. Extensive experiments across multiple task sets and model scales demonstrate that ESM and ESM++ effectively preserves task knowledge while reducing inter-task interference.

摘要：模型合併旨在通過將從相同預訓練檢查點微調的多個模型的能力整合到一個模型中，以實現多任務學習。其核心挑戰是任務特定參數更新之間的相互干擾。在本文中，我們分析了由任務更新引起的輸出變化，並觀察到它們的能量集中在少數主要方向上。我們稱這些方向所跨越的子空間為基本子空間。相對而言，大多數剩餘方向攜帶的任務相關能量很少，但它們在多次任務更新中的累積可能會在合併過程中造成嚴重的干擾。受到這一觀察的啟發，我們提出了基本子空間分解（ESD），該方法根據其激活變化的主成分分解每個任務更新。基於ESD，我們引入了基本子空間合併（ESM），這是一種無需訓練的靜態合併方法，能夠將基本組件正交化並融合成一個緊湊的多任務模型。我們進一步將ESM擴展為ESM++，這是一種無需訓練的動態合併方法，能夠將任務特定的殘差分解為低秩專家，並通過基於原型的路由在前向推理過程中選擇最相關的專家。在多個任務集和模型規模上的大量實驗表明，ESM和ESM++有效地保留了任務知識，同時減少了任務之間的干擾。

IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages

2606.19157v1 by Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh, Eldho Ittan George, R J Hari, Kaushal Bhogale, Mitesh M. Khapra

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.

摘要：AudioLLMs 使得語音識別能夠基於文本提示進行，例如領域描述或實體列表。然而，目前尚不清楚這些模型是否真正利用了這些上下文，還是依賴於在預訓練期間學到的參數知識。現有的基準無法回答這個問題，因為它們在固定的提示條件下評估轉錄，並且很少包含明確的上下文輸入。我們介紹了 IndicContextEval，一個涵蓋 555 位講者的 56 小時多語言自然語音基準，涉及 8 種印度語言和 23 個專業領域。我們設計了一個 7 級提示框架，逐步引入上下文信號，包括元數據、自然語言描述、英語和母語的實體列表，以及包含不正確實體的對抗性提示。對五個模型的評估顯示了上下文利用行為的顯著差異，突顯了對 AudioLLMs 中上下文基礎的明確評估的需求。

AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces

2606.19152v1 by Zongmin Zhang, Yuyang Lou, Bowen Zhang, Junwu Chen, Ryo Kuroki, Xuan Vu Nguyen, Edvin Fako, Lixue Cheng, Philippe Schwaller

Identifying the lowest-energy surface-adsorbate configuration is critical for modeling heterogeneous catalysis, yet exhaustive exploration with ab initio calculations is computationally prohibitive. Machine-learning force fields (MLFFs) accelerate structural relaxation but leave the search over the vast configurational space a major bottleneck, and open-loop large language model (LLM) agents lack a physics-grounded feedback mechanism to correct erroneous initial guesses. We propose AdsMind (Adsorption configuration discovery with Machine intelligence and relaxation feedback), a closed-loop multi-agent framework that enables autonomous error correction through MLFF relaxation feedback. Across four LLM backends, AdsMind achieves consistently high search reliability, with success rates of 100% and 98.8% on the benchmarks AA20 and OCD-GMAE62. Relative to its single-pass (1-Shot) ablation it reduces cross-backend energy dispersion, and it uses only 4.11 and 4.67 MLFF relaxations per case, respectively -- an approximately 14-fold reduction over heuristic enumeration baselines. Density functional theory (DFT) validation using VASP/PBE on six representative AA20 systems shows that the reported open-loop Adsorb-Agent outputs exhibit qualitative adsorption-energy sign errors for molecular adsorbates, whereas AdsMind preserves the correct sign in all tested cases with closer quantitative agreement. AdsMind thus delivers reliability, self-reflection, and interpretability simultaneously, supporting more DFT-informed autonomous chemistry workflows.

摘要：識別最低能量的表面-吸附劑配置對於建模異質催化至關重要，但使用從頭計算的全面探索在計算上是不可行的。機器學習力場（MLFFs）加速結構放鬆，但在廣闊的配置空間中搜索仍然是一個主要瓶頸，而開環大型語言模型（LLM）代理缺乏基於物理的反饋機制來修正錯誤的初始猜測。我們提出了AdsMind（基於機器智能和放鬆反饋的吸附配置發現），這是一個閉環多代理框架，通過MLFF放鬆反饋實現自主錯誤修正。在四個LLM後端中，AdsMind實現了一致的高搜索可靠性，在基準AA20和OCD-GMAE62上的成功率分別為100%和98.8%。相較於其單次通過（1-Shot）消融，它減少了跨後端的能量分散，並且每個案例僅使用4.11和4.67次MLFF放鬆，分別比啟發式枚舉基線減少了約14倍。使用VASP/PBE進行的密度泛函理論（DFT）驗證在六個代表性的AA20系統上顯示，報告的開環Adsorb-Agent輸出對於分子吸附劑顯示出定性吸附能量符號錯誤，而AdsMind在所有測試案例中保持正確的符號，並且在定量上更接近。AdsMind因此同時提供可靠性、自我反思和可解釋性，支持更多基於DFT的自主化學工作流程。

OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

2606.19145v1 by Till Richter, Niki Kilbertus

Dynamical systems are fundamental to modeling the natural world, yet modeling them involves a persistent trade-off: manually prescribed mechanistic models are interpretable by design but often overly simplistic and misspecified; in contrast, flexible data-driven neural methods lack physical insight. Hybrid modeling aims for the best of both worlds by combining a prescribed or symbolic, physics-based component with a flexible neural network. A critical challenge, however, is that the neural component may relearn mechanistic parts, yielding redundant and uninterpretable models, especially when the symbolic structure itself is discovered from data. Existing methods based on standard $L^2$ regularization rely on a projection argument that breaks when the symbolic component is learned through sparse discovery, allowing the neural augmentation to overlap with symbolic structure. We introduce \textbf{OrthoReg} (Orthogonal Regularization), which directly penalizes overlap between the symbolic and neural components, preventing symbolic structure from being absorbed by the neural residual. This yields a complementary decomposition: the symbolic part captures what the library can express, and the neural part captures what remains. On benchmark dynamical systems with partial library mismatch, OrthoReg improves symbolic recovery and out-of-distribution behavior.

摘要：動態系統是建模自然世界的基礎，但建模過程涉及持續的權衡：手動指定的機械模型在設計上是可解釋的，但往往過於簡化且規範不當；相對而言，靈活的數據驅動神經方法則缺乏物理洞察。然而，一個關鍵挑戰是神經組件可能會重新學習機械部分，導致冗餘且不可解釋的模型，特別是當符號結構本身是從數據中發現時。現有基於標準 $L^2$ 正則化的方法依賴於一個投影論證，當符號組件通過稀疏發現學習時，該論證會失效，允許神經增強與符號結構重疊。我們提出了 \textbf{OrthoReg}（正交正則化），該方法直接懲罰符號和神經組件之間的重疊，防止符號結構被神經殘差吸收。這產生了一個互補的分解：符號部分捕捉庫所能表達的內容，而神經部分捕捉剩餘的內容。在具有部分庫不匹配的基準動態系統上，OrthoReg 改進了符號恢復和分佈外行為。

2606.19144v1 by Jingyi Zhou, Senlin Luo, Haofan Chen

Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modeling, memory retrieval, or persona conditioning, lacking a unified framework to explain the emergence of stable social relationships and social intelligence in long-term human-AI interaction.To address this, we propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal model of human-AI interaction as a self-organizing social cognitive system. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a unified dynamical framework and introduces principles including multi-timescale social cognition, relational attractors, trust basins, developmental phase transitions, and social cognitive energy dynamics.We construct a conversational dataset with approximately 14,700 interaction turns and develop a theory-driven empirical evaluation framework. Results reveal a hierarchy of temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence shows a significant negative correlation with social cognitive energy (r = -0.391, p < 0.001), and interaction trajectories exhibit progressive energy reduction over time.These findings suggest that social intelligence emerges from long-term social cognitive coevolution rather than isolated conversational capabilities. HACD-H provides a unified theoretical foundation for modeling adaptive human-AI social interaction and developing socially intelligent AI systems.

摘要：目前的對話式人工智慧系統在語言生成、個性化和長期上下文互動方面取得了顯著進展。然而，大多數現有的方法通過孤立的組件如情感建模、記憶檢索或角色調整來建模社會行為，缺乏統一的框架來解釋穩定社會關係和社會智慧在長期人機互動中的出現。為了解決這個問題，我們提出了人機共演化動力學框架（HACD-H），這是一個將人機互動視為自組織社會認知系統的正式模型。HACD-H將情感適應、關係組織、社會記憶和個性一致性整合到一個統一的動力學框架中，並引入了包括多時間尺度社會認知、關係吸引子、信任盆地、發展階段轉變和社會認知能量動力學等原則。我們構建了一個包含約14,700次互動回合的對話數據集，並開發了一個以理論為驅動的實證評估框架。結果顯示社會認知中的時間持久性層次、穩定的關係吸引子、類相變的發展模式以及結構化的社會認知能量景觀。社會智慧與社會認知能量之間顯示出顯著的負相關（r = -0.391, p < 0.001），而互動軌跡隨著時間的推移表現出逐步的能量減少。這些發現表明，社會智慧是從長期的社會認知共演化中產生的，而不是孤立的對話能力。HACD-H為建模自適應人機社會互動和開發社會智能人工智慧系統提供了統一的理論基礎。

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

2606.19139v1 by Ramza Basharat, Muhammad Usman Ali

Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

摘要：自動手寫文字識別（HTR）本質上是一項具有挑戰性的任務，當處理草寫字體時，其複雜性進一步增加。儘管對各種草寫字體已經做出了重大努力，但對烏爾都手寫文字識別（UHTR）的研究相對有限。這一研究滯後主要是由於其字體所帶來的獨特挑戰，以及基準數據集的稀缺和不可用性。因此，為了推進UHTR的研究，本研究提出了一個名為烏爾都Katib手寫數據集（UKHD）的專門真實數據集。據我們所知，這是第一個專門從歷史時期Katib所寫材料中策劃的離線烏爾都手寫文字行數據集。它涵蓋了Nastalique書法風格中各種平尖筆書寫變體。此外，還評估了不同基於CRNN的混合模型的有效性，以確定烏爾都Katib手寫識別（UKHR）的最佳架構。在分析的模型中，CNN-BGRU-CTC模型顯示出更穩健的性能，具有較低的字符錯誤率（CER）和單詞錯誤率（WER）。本研究旨在支持和鼓勵研究社群開發一個穩健的識別系統，以保護烏爾都手寫文學。

A Technical Taxonomy of LLM Agent Communication Protocols

2606.19135v1 by Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy's purpose, meta-characteristic, and ending conditions, then performed five iterations, three empirical-to-conceptual and two conceptual-to-empirical, on nine actively maintained open-source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent-to-agent protocols combine hybrid payloads with session-state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short-term convergence pressure toward protocols unifying agent-to-agent and agent-to-context (tool and data) communication. Long-term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.}

摘要：隨著大型語言模型（LLMs）的進步，以及多代理系統旨在克服獨立代理的限制，穩健的通信協議正成為分佈式代理網絡的基本基礎設施。儘管如此，碎片化的協議格局帶來了顯著的互操作性挑戰。本研究開發了一個技術分類法來分類和分析LLM代理通信協議。遵循既定的迭代方法，我們定義了分類法的目的、元特徵和結束條件，然後對九個積極維護的開源協議進行了五次迭代，其中三次是從經驗到概念，兩次是從概念到經驗，這些協議具有可證明的採用情況。該分類法包含五個維度：對方、有效載荷、互動狀態、發現機制和模式靈活性。分類顯示出反覆出現的架構模式：所有抽樣的代理對代理協議都將混合有效載荷與會話狀態持久性結合；大多數協議支持多個預定義模式，並且有兩個在運行時協商模式，這表明了一種向模式靈活性發展的趨勢；去中心化發現仍然很少見。分析表明，短期內存在向統一代理對代理和代理對上下文（工具和數據）通信的協議的收斂壓力。然而，從長期來看，沒有單一協議能夠同時最大化多功能性、效率和可攜性。該領域更可能朝著聯邦化的分層協議堆棧發展。該框架指導協議選擇，並突出如隱私和政策執行等開放研究空白。

Equivariant Graph Neural Networks Improve Optical Spectra Prediction for Materials Screening

2606.19133v1 by Kasper Helverskov Petersen, François R J Cornet, Martin Ovesen, Mikkel Jordahn, Kristian S. Thygesen, Mikkel N. Schmidt

Scalable prediction of optical spectra is a critical component of high-throughput materials screening for optoelectronic applications such as solar cells. Existing surrogate models are trained on spectra computed from lower levels of theory or rely on rotation-invariant scalar features, limiting their geometric expressiveness. We explore the use of equivariant graph neural networks for optical spectra prediction, adapting GotenNet to this task and evaluating it on multiple datasets including a recently published collection of 10,533 structures with spectra computed at the level of the random phase approximation (RPA). The proposed model outperforms the current state of the art, with the largest gains in the 0-8 eV range and on predicting the static real permittivity, both of particular relevance for thin-film optics.

摘要：可擴展的光譜預測是高通量材料篩選在光電應用（如太陽能電池）中的一個關鍵組成部分。現有的替代模型是基於從較低理論層次計算的光譜進行訓練，或依賴於旋轉不變的標量特徵，這限制了它們的幾何表達能力。我們探索了使用等變圖神經網絡進行光譜預測，將 GotenNet 調整為此任務，並在多個數據集上進行評估，包括最近發表的 10,533 個結構的集合，這些結構的光譜是基於隨機相位近似（RPA）計算的。所提出的模型在當前的最先進技術中表現優越，尤其是在 0-8 eV 範圍內以及預測靜態實部許可率方面，這兩者對於薄膜光學特別相關。

Written by AI, Managed by AI: Semantic Space Control and Index Sickness Elimination Across 391 Consecutive Sessions

2606.19121v1 by Hui Zhang, Shuren Song

The prevailing engineering intuition for addressing conceptual drift in long-horizon LLM collaboration is to trade more formal constraints for more reliable outputs -- designing symbolic identifier systems, accumulating defensive rules in System Prompts, expanding context windows. Our engineering record shows that in long-horizon settings, this direction may produce effects contrary to design intent. Using action research methods in a real software project (Bang-v3) spanning approximately one month and 391 collaborative sessions, we document and analyze the failure process of these strategies. When the symbolic system exceeds a complexity threshold, LLMs do not become more accurate -- instead, they abandon genuine understanding of business semantics, retreat to self-referential reasoning within the symbolic layer, and generate outputs that appear internally consistent but are physically disconnected from reality. We name this failure pattern "Index Sickness," and its canonical manifestation "Phantom Legislation." We name the underlying principle the "Pang Principle (Semantic Vitality Law)": natural language carrying explicit purpose conveys far greater information quality than symbolic expression. From this, we design and validate its physical engineering mechanism: "Baseline-Log Physical Separation." In the same project, this mechanism reduced AI Instructions volume by ~75%, and across the subsequent ~150 sessions, no recurrence of Index Sickness was observed. A bilingual companion version (Chinese) is included as supplementary material.

摘要：當前針對長期 LLM 合作中概念漂移的工程直覺是用更正式的約束來換取更可靠的輸出——設計符號識別系統、在系統提示中累積防禦規則、擴展上下文窗口。我們的工程紀錄顯示，在長期設定中，這一方向可能會產生與設計意圖相反的效果。在一個持續約一個月且涵蓋 391 次合作會議的實際軟體專案（Bang-v3）中，我們記錄並分析了這些策略的失敗過程。當符號系統超過複雜性閾值時，LLM 並不會變得更準確——相反，它們放棄了對商業語義的真正理解，退回到符號層內的自我參照推理，並生成看似內部一致但與現實物理上脫節的輸出。我們將這種失敗模式命名為「指數病」，其典型表現為「幻影立法」。我們將其背後的原則命名為「彭原則（語義活力法則）」：承載明確目的的自然語言傳達的資訊品質遠高於符號表達。基於此，我們設計並驗證了其物理工程機制：「基線-對數物理分離」。在同一專案中，這一機制將 AI 指令的數量減少了約 75%，並且在隨後的約 150 次會議中未觀察到指數病的復發。附有雙語伴隨版本（中文）作為補充材料。

Analysing drivers and interdependencies in European electricity markets using XAI

2606.19118v1 by Antoine Pesenti, Aidan O'Sullivan

Electricity markets are inherently complex systems characterised by strong nonlinearities, high-dimensional interactions, and increasing interdependence across regions. While deep neural networks (DNNs) have demonstrated strong predictive capabilities for electricity prices, their lack of interpretability limits their usefulness for understanding the underlying drivers of price formation. This paper addresses this gap by combining DNN models with explainable artificial intelligence (XAI) techniques to analyse the determinants of electricity prices across 39 European bidding zones. We employ SHAP (SHapley Additive exPlanations) to quantify feature contributions and apply and extend SSHAP, an aggregation framework to improve interpretability in high-dimensional settings. The analysis identifies that renewable energy sources, particularly solar, play a disproportionately important role in price formation despite their lower share in total power generation. Gas prices remain a dominant and consistent driver across electricity markets, while interconnections significantly shape price dynamics, highlighting the strong interdependence of European electricity systems. In addition, a synthetic EU-wide electricity market is constructed to explore the counterfactual scenario of a fully integrated market with a single price.

摘要：電力市場本質上是複雜的系統，特徵是強非線性、高維互動和各地區之間日益增強的相互依賴性。雖然深度神經網絡（DNN）在電力價格的預測能力上表現出色，但其缺乏可解釋性限制了其對理解價格形成的基本驅動因素的實用性。本文通過將DNN模型與可解釋的人工智慧（XAI）技術相結合，來分析39個歐洲競標區域的電力價格決定因素，填補了這一空白。我們使用SHAP（SHapley Additive exPlanations）來量化特徵貢獻，並應用及擴展SSHAP，這是一個聚合框架，用於提高高維設置中的可解釋性。分析顯示，儘管可再生能源，特別是太陽能，在總發電量中的比例較低，但在價格形成中扮演著不成比例的重要角色。天然氣價格仍然是電力市場中的主導和一致驅動因素，而互聯網絡則顯著影響價格動態，突顯了歐洲電力系統之間的強相互依賴性。此外，構建了一個合成的EU範圍內電力市場，以探索完全整合市場的反事實情境，並且只有一個價格。

Towards an Agent-First Web: Redesigning the Web for AI Agents

2606.19116v1 by Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

摘要：全球資訊網建立在一個持續三十年的假設上：網路內容的主要消費者是人類。這一假設滲透到每一層；其訪問模型假定有人的訪客，其經濟學依賴於人類的注意力，而其內容則針對人類的感知。人工智慧代理作為人類與網路內容之間的中介的快速出現使這一假設失效。然而，網路通過全面封鎖、基於 CAPTCHA 的排除以及將代理訪問視為提取而非合法互動的經濟模型來抵抗代理。

本文提出在三個層面上進行原則性的重新設計。在訪問層，代表人類行動的代理應該繼承相應的訪問權限，這些權限由 HTTP 請求中的速率限制和代理識別元數據管理，類似於瀏覽器標頭，並且採用雙層架構，從同一域提供人類可讀和代理優化的內容。在經濟層面，我們提出一個基於意圖的層級框架，這一框架以代理作為人類代理的原則為基礎：代理的經濟責任反映其所代表的人類的責任。一種基於代幣的訂閱模型以代幣而非頁面瀏覽量來計量內容，並且一個委託內容經濟將 AI 內容生產與人類意圖相連接。在內容層，我們識別出認識論的遞歸，即 AI 生成的內容被代理消費以產生進一步的內容，逐步使網路知識脫離人類的真實基礎。我們提出了代理文本標記語言（ATML）、一種四層人類監督層級模型，以及一個加密來源鏈，以應對這一威脅。

這些共同構成了十項設計原則，旨在打造一個以代理為先的互聯網，在這個互聯網中，代理是第一公民，其整合需要重新協商網路的基礎社會契約，涵蓋訪問、經濟和內容。

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

2606.19111v1 by Haewoon Kwak

Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

摘要：團隊科學認為領導力是有條件的：它僅在特定條件下有效，而有能力的自主團隊可能根本不需要領導力。我們對多代理 LLM 團隊提出類似的問題：在什麼可測量的條件下，過程層級的協調控制能增加價值，這些條件是否與團隊科學的預測相符？我們使用行為特徵（多數鎖定、探索、從不正確的第 0 輪共識中恢復）和每個行動的消融，因為每個控制器都是明確的行動集，而不是單一的提示。我們將三種經典的領導風格（交易型、轉型型、情境型）作為對共享行動詞彙（探索、修訂、接受、綜合）的控制器。一個匹配的控制器具有相同的行動但任意規則，其表現不比多數投票更好，因此是理論推導的規則，而不是詞彙，發揮了作用。在四種任務範疇和三個開放權重模型家族中，沒有控制器在準確性上佔優勢，正如條件觀所預測的：交易控制在所有 12 個（模型、範疇）組合中與共享的第 0 輪投票相匹配，誤差在 1.3 個百分點以內，並且只有在第 0 輪多數不可靠的組合中出現增益（llama-4-scout 社交；情境型比平坦型多 8 個百分點）。一個恢復優勢的解釋，通過四個邊界探針進行測試，表示控制器僅在第 0 輪多數不可靠、任務可恢復且無導向互動未能修復的情況下，勝過普通互動。這些區域映射到條件理論（領導替代品、路徑目標冗餘、情境準備差距），因此大體上無效的準確性結果是理論的預測，而不是控制器的失敗。我們將過程層級的協調控制視為一個需要測量和理論映射的條件，而不是一個需要超越的排行榜。

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

2606.19103v1 by Mukund Khanna, Raj Singh Yadav, Kunal Singh

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

摘要：最近在基於指令的圖像編輯方面的進展，使得模型能夠根據自然語言指令執行複雜的視覺編輯。然而，在以產品為中心的場景中，保留產品特徵、品牌和文本元素至關重要，目前的開源和閉源模型往往難以維持這種細緻的物體身份。這一問題因缺乏符合文本保真約束的基於指令的產品圖像編輯數據集而進一步加劇，這使得它在很大程度上被視為基於指令的圖像編輯模型的隱含能力。在本研究中，我們介紹了ProductConsistency數據集，旨在改善以產品為中心的圖像編輯。我們的方法包括一個包含87,000個樣本的監督微調（SFT）數據集，用於產品編輯，一個包含869個獨特產品圖像的強化學習（RL）數據集，以及一個新的基準數據集——ProductConsistency Benchmark，以便對編輯模型進行嚴格和標準化的評估。為了指導RL訓練，我們提出了一種循環一致性獎勵，通過使用原始產品描述和從編輯圖像生成的標題之間的相似性來強化產品身份的語義保留。我們使用我們的數據集對Qwen-Image-Edit-2511和Flux.1-Kontext-dev進行微調，並在OCR和感知指標以及基於MLLM的評估中顯示出相對於基線模型的一致改進，這表明產品一致性、文本渲染和整體視覺質量更強；其中Qwen-Image-Edit-2511模型實現了字符錯誤率的5倍降低。代碼和流程可在https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md獲得。

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

2606.19079v1 by Enrico Cassano, Michał Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

摘要：隨著參數高效微調（PEFT）的日益普及，模型生態系統中出現了單一主幹與多個任務專用適配器的配對。在這種情況下，推理時的查詢通常在沒有任務標籤的情況下到達，這要求系統自動從不斷增長且異質的適配器池中選擇最合適的適配器。現有的路由方法要麼依賴於對適配器內部的訪問，例如權重分解或基於梯度的統計，要麼需要額外的路由器訓練，這限制了在添加新適配器時的可擴展性和可移植性。我們介紹了ARIADNE，一種無需訓練、與適配器無關的動態適配器選擇路由框架，適用於推理時。ARIADNE通過從其訓練集的嵌入計算的一組質心來表示每個適配器，捕捉與該適配器相關的數據分佈。給定一個未標記的輸入，它通過測量在潛在空間中與這些質心的接近度來選擇適配器。由於路由完全在輸入嵌入空間中進行，ARIADNE與任意PEFT方法兼容，且不需要對適配器或訓練程序進行修改。主要在23個多樣的NLP任務上使用Llama 3.2 1B Instruct進行評估，ARIADNE恢復了97.44%的上限性能。在擴展到44個任務時，它實現了89.7%的平均選擇準確率，無需額外訓練或訪問適配器內部。

Which Sections of a Research Paper Best Reveal Its Research Methods? Evidence from Library and Information Science

2606.19051v1 by Qiuyu Fang, Jiayi Hao, Chengzhi Zhang

Research methods are essential carriers of knowledge contribution in academic papers. Automatic multi-label classification of research methods can support knowledge services such as method retrieval, review generation, and research intelligence analysis. While existing studies primarily rely on titles and abstracts, abstracts often provide only limited methodological information, whereas utilizing full-text content faces challenges related to excessive length and information redundancy. Therefore, this paper proposes a segment combination strategy by partitioning the full-text content according to its physical postion. Using an annotated corpus of 1,954 full-text articles from three representative journals in Library and Information Science (JASIST, LISR, and JDoc), we evaluate the classification performance of various segments and their combinations across multiple models. Experimental results indicate that methodological information is distributed unevenly within the full-text content, with the middle-to-late and final segments exhibiting greater discriminative power. Furthermore, integrating bibliographic metadata with cross-segment combination strategies effectively enhances classification performance.

摘要：研究方法是學術論文中知識貢獻的重要載體。研究方法的自動多標籤分類可以支持知識服務，如方法檢索、評論生成和研究智能分析。雖然現有研究主要依賴標題和摘要，但摘要通常只提供有限的方法論信息，而利用全文內容則面臨過長和信息冗餘的挑戰。因此，本文提出了一種通過根據其物理位置劃分全文內容的段落組合策略。使用來自三本代表性圖書館與信息科學期刊（JASIST、LISR 和 JDoc）的 1,954 篇全文文章的註釋語料庫，我們評估了各種段落及其組合在多個模型中的分類性能。實驗結果表明，方法論信息在全文內容中分佈不均，中後段和最後段表現出更強的區分能力。此外，將書目元數據與跨段組合策略整合有效提升了分類性能。

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

2606.19042v1 by Xhevahire Tërnava

In vibe coding, an emerging AI-driven paradigm, an LLM generates an entire program from a natural language prompt, but what happens to the variability that traditional software engineering carefully builds into code? To answer this question, we conducted an exploratory analysis on 10 vibe coded C/C++ projects, which suggests that there is near-zero in-artifact variability, i.e., at compile and runtime. All variability decisions are resolved at a single new binding time, generation time, the moment the LLM produces the source code. Rather than treating this as a defect to fix, we propose Variability by Regeneration (VbR), to our knowledge the first product-line approach in which the LLM acts as the derivation engine, generating a purpose-built, free of dead code binary for each variant from a declarative specification, while a variant dispatcher transparently routes user requests to the matching binary. We formalise VbR, contrast it with classical SPL derivation, and demonstrate its full pipeline on a wc product family. For SPL engineering, variability in AI-generated software belongs in the specification, not in the code.

摘要：在氛圍編碼這一新興的 AI 驅動範式中，一個 LLM 從自然語言提示生成整個程序，但傳統軟體工程精心構建到代碼中的變異性會發生什麼呢？為了回答這個問題，我們對 10 個氛圍編碼的 C/C++ 項目進行了探索性分析，結果顯示在工件內幾乎沒有變異性，即在編譯和運行時。所有變異性決策都在單一的新綁定時間解決，即生成時間，也就是 LLM 產生源代碼的那一刻。我們並不將此視為需要修復的缺陷，而是提出了再生變異性（Variability by Regeneration，VbR），據我們所知，這是第一個產品線方法，其中 LLM 作為推導引擎，根據聲明性規範為每個變體生成一個專門構建、沒有死代碼的二進位檔，而變體調度器則透明地將用戶請求路由到匹配的二進位檔。我們對 VbR 進行了形式化，並將其與傳統的 SPL 推導進行對比，並在 wc 產品系列上展示了其完整的管道。對於 SPL 工程而言，AI 生成軟體中的變異性應該在規範中，而不是在代碼中。

A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors

2606.19026v1 by David Aaron Evans, Jay C. Rothenberger, Kara J. Sulia, Nick P. Bassill, Chris D. Thorncroft

Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured atmospheric phenomena. Previous work demonstrated that Long Short-Term Memory (LSTM) networks can successfully predict forecast errors in the High-Resolution Rapid Refresh (HRRR) model using mesonet observations, but we believe performance degradation is linked to periods of complex vertical atmospheric evolution. To address this limitation, we develop a hybrid LSTM-Vision Transformer (LSTM-ViT) framework that combines temporal sequence learning from surface observations with atmospheric profiles from the New York State Mesonet profiler network. The LSTM-ViT framework is trained to predict HRRR hourly precipitation, 10 m wind speed, and 2 m temperature forecast errors at individual mesonet stations. Across all three predictors, incorporation of profiler-derived atmospheric structure improves forecast error prediction skill relative to the baseline LSTM architecture, with the largest gains occurring at shorter forecast lead times and during periods of enhanced PBL activity. Improvements are particularly pronounced for precipitation forecast error, where the LSTM-ViT framework achieves approximately a twofold increase in predictive skill relative to the baseline LSTM while better capturing convectively driven error evolution and reducing degradation associated with PBL processes. These results demonstrate that combining temporal sequence learning with vertically informed attention mechanisms provides a physically meaningful pathway for improving forecast error prediction in operational NWP systems. Our research offers forecasters enhanced guidance regarding model bias and forecast confidence.

摘要：高解析度數值天氣預報（NWP）系統中的預報誤差通常與未解決的行星邊界層（PBL）過程、對流、地形引起的環流以及其他垂直結構的大氣現象有關。先前的研究顯示，長短期記憶（LSTM）網絡可以成功預測高解析度快速刷新（HRRR）模型中的預報誤差，使用的是中尺度網絡觀測數據，但我們認為性能下降與複雜的垂直大氣演變期間有關。為了解決這一限制，我們開發了一個混合LSTM-視覺Transformer（LSTM-ViT）框架，該框架將來自地面觀測的時間序列學習與來自紐約州中尺度剖面網的氣象剖面相結合。LSTM-ViT框架被訓練以預測HRRR每小時降水、10米風速和2米溫度的預報誤差，針對各個中尺度站。對於所有三個預測因子，納入剖面導出的氣象結構相較於基線LSTM架構提高了預報誤差預測技能，最大的增益發生在較短的預報提前期和增強的PBL活動期間。降水預報誤差的改善尤為明顯，LSTM-ViT框架相較於基線LSTM實現了約兩倍的預測技能提升，同時更好地捕捉了對流驅動的誤差演變並減少了與PBL過程相關的降解。這些結果表明，將時間序列學習與垂直信息的注意機制相結合，為改善運營NWP系統中的預報誤差預測提供了一條物理上有意義的途徑。我們的研究為預報員提供了有關模型偏差和預報信心的增強指導。

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

2606.19025v1 by Lorenzo Sani, Zeyu Cao, Meghdad Kurmanji, Alex Iacob, Andrej Jovanovic, Yan Gao, Wanru Zhao, Nicholas D. Lane

Pre-training Large Language Models (LLMs) typically demands large-scale infrastructure with tightly coupled hardware accelerators. While increasing model and dataset scale remains the dominant driver of performance, Mixture-of-Experts (MoEs) architectures have recently achieved state-of-the-art results by decoupling parameter count from computational cost. This efficiency enables training massive models on constrained compute budgets, yet it typically requires the high-speed interconnects of a single datacenter. To overcome these physical limits, recent approaches such as DiLoCo and Photon use low-communication data-parallel methods to enable scaling across geographically distributed, weakly connected data centers. However, these methods suffer from a fundamental inefficiency: they require full model replicas at every site, which imposes prohibitive memory constraints and communication overheads. In this work, we introduce FoMoE, a system that breaks the full-replica paradigm by partitioning expert layers across workers. We demonstrate that FoMoE: (I) reduces communication costs by up to 1.42x over efficient baselines and 45.44x over DDP via partial expert replication in the studied regimes; (II) achieves empirical throughput speedups of up to 1.4x through a novel skip-token mechanism; and (III) shows stable routing in the trained proxy regimes and projects the communication/memory benefits to 100B-scale configurations through system modelling.

摘要：預訓練大型語言模型（LLMs）通常需要大規模的基礎設施，並且硬體加速器緊密耦合。雖然增加模型和數據集的規模仍然是性能的主要驅動因素，但混合專家（MoEs）架構最近通過將參數數量與計算成本解耦，實現了最先進的結果。這種效率使得在受限的計算預算上訓練大型模型成為可能，但通常需要單一數據中心的高速互連。為了克服這些物理限制，最近的研究方法如DiLoCo和Photon使用低通信數據並行方法來實現跨地理分佈、弱連接數據中心的擴展。然而，這些方法存在根本性的低效：它們要求每個站點都有完整的模型副本，這會帶來禁止性的內存限制和通信開銷。在這項工作中，我們介紹了FoMoE，一個通過在工作者之間分區專家層來打破完整副本範式的系統。我們展示了FoMoE：（I）在研究的範疇內，通過部分專家複製將通信成本降低高達1.42倍，相較於高效基準和DDP降低高達45.44倍；（II）通過一種新穎的跳過令牌機制實現高達1.4倍的經驗吞吐量加速；以及（III）在訓練的代理範疇中顯示穩定的路由，並通過系統建模將通信/內存效益預測到100B規模的配置。

Sumi: Open Uniform Diffusion Language Model from Scratch

2606.19005v1 by Mengyu Ye, Keito Kudo, Wataru Ikeda, Ryosuke Matsuda, Keisuke Sakaguchi, Jun Suzuki

Diffusion models have become a promising alternative to autoregressive models. Among these, uniform diffusion language models (UDLMs) permit any token to be updated at any step, in principle enabling more flexible generation. However, no UDLM has yet been pretrained from scratch at both large parameter scale and large token budget. Both autoregressive modeling and masked diffusion modeling already have capable models at scale that the community can study and build on; uniform diffusion has none. A scratch-pretrained UDLM at scale would provide a clean reference point for studying scaling behavior, generation dynamics, controllability, and trade-offs against established autoregressive and masked diffusion models. To this end, we introduce Sumi ("ink" in Japanese), a fully open 7B uniform diffusion language model pretrained from scratch on 1.5T tokens. Sumi performs competitively with autoregressive models trained at comparable token budgets on knowledge, reasoning, and coding benchmarks, while under-performing on commonsense benchmarks, where our education-heavy data mixture is a likely contributor. We release our model weights, checkpoints, and full training recipe, including a complete specification of the data mixture over publicly available corpora. We hope this release enables the community to study native uniform diffusion at scale and catalyzes work on its as-yet poorly understood aspects.

摘要：擴散模型已成為自回歸模型的一個有前途的替代方案。在這些模型中，均勻擴散語言模型（UDLMs）原則上允許在任何步驟更新任何標記，從而實現更靈活的生成。然而，目前尚未有任何UDLM從零開始在大參數規模和大標記預算下進行預訓練。自回歸建模和遮罩擴散建模已經有可用的模型在規模上供社群研究和構建；而均勻擴散則沒有。大規模的從零開始預訓練的UDLM將提供一個乾淨的參考點，以研究擴展行為、生成動態、可控性，以及與已建立的自回歸和遮罩擴散模型之間的權衡。為此，我們介紹Sumi（在日語中意為“墨水”），這是一個完全開放的7B均勻擴散語言模型，從零開始在1.5T標記上進行預訓練。Sumi在知識、推理和編碼基準上與在相似標記預算下訓練的自回歸模型表現競爭，但在常識基準上表現不佳，其中我們以教育為重的數據混合可能是主要原因。我們釋出我們的模型權重、檢查點和完整的訓練配方，包括對公開可用語料庫的數據混合的完整規範。我們希望這次釋出能使社群能夠在大規模上研究原生均勻擴散，並促進對其尚未充分理解的方面的研究。

Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-Training

2606.19004v1 by Ruiqi Lai, Dakai An, Wei Gao, Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang

Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

摘要：強化學習（RL）在擴散Transformer（DiTs）後訓練中的應用成本極高，需要數千個高端GPU。現有的研究探索了兩個方向來降低成本：種子探索通過選擇高對比度樣本來改善訓練收斂，但這會增加關鍵路徑的計算；而臨時GPU提供69--77\%的成本降低，但在訓練期間閒置，因為DiT的展開幾乎同時完成，這妨礙了類似大型語言模型（LLM）的展開與訓練的流水線作業。臨時中斷進一步打破了序列並行（SP）組，造成GPU拓撲的碎片化。
我們提出了Spotlight，第一個利用臨時GPU進行DiT RL後訓練的系統。Spotlight基於我們設計的兩個關鍵見解：（1）我們顯示探索可以容忍過時的模型權重，因為使用前一迭代的模型權重進行的探索保持了隨機種子的相對排名，允許探索在訓練期間在閒置的臨時GPU上運行。（2）SP重配置可以重用節點上的狀態，將組恢復的時間從幾分鐘縮短到亞秒級啟動。基於這些見解，Spotlight引入了三種技術：一種基於賭徒的探索規劃器，最大化訓練時間預算內的獎勵方差；一種彈性序列並行技術，通過持久調度器和節點內權重複製即時重配置SP組；以及一種考慮中斷的基於請求的拉取調度器，平衡負載並在中斷時提交正在進行的狀態。我們在開源RL平台ROLL上實現了Spotlight，並在Qwen-Image後訓練中進行評估。Spotlight以比基準快$4\times$的速度達到相同的目標驗證分數，將總成本降低$1.4$-$6.4\times$，同時在DeepSeek-OCR和Geneval數據集上實現了優越的圖像質量，分辨率為$512\times512$和$1280\times1280$。

Enhancing Multilingual Reasoning via Steerable Model Merging

2606.19002v1 by Zhuoran Li, Rui Xu, Jian Yang, Junnan Liu, Zhijun Chen, Qianren Mao, Hongcheng Guo, Jiaheng Liu, Likang Xiao, Ming Li, Xiaojie Wang

Model merging is an effective technique for composing the capabilities of a multilingual model and a reasoning model. It has achieved promising generalization in multilingual reasoning tasks by aligning feature spaces of different models. However, the merged single model often fails to address the conflicts between source models, leading to suboptimal performance. In other words, the one-size-fits-all merging strategy may not align with the characteristics of different inputs which may require prioritizing certain models over others. To this end, we propose a Steerable Model Merging (ST-Merge) framework to modulate the contribution of each source model. To realize this idea, we introduce a gated cross-attention mechanism to weight or filter the two attended source models in an adaptive manner. Extensive experiments demonstrate that ST-Merge consistently outperforms multiple strong baselines on four multilingual reasoning benchmarks across 21 different languages.

摘要：模型合併是一種有效的技術，用於組合多語言模型和推理模型的能力。它通過對齊不同模型的特徵空間，在多語言推理任務中實現了令人鼓舞的泛化效果。然而，合併後的單一模型往往無法解決源模型之間的衝突，導致次優的性能。換句話說，通用的合併策略可能無法與不同輸入的特徵對齊，這可能需要優先考慮某些模型而非其他模型。為此，我們提出了一個可調整的模型合併（ST-Merge）框架，以調節每個源模型的貢獻。為了實現這一思想，我們引入了一種門控交叉注意機制，以自適應的方式對兩個被關注的源模型進行加權或過濾。大量實驗表明，ST-Merge在21種不同語言的四個多語言推理基準上，始終優於多個強基準。

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

2606.18996v1 by Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

摘要：代理人越來越多地被部署在文件密集型工作流程中，在這些工作流程中，敏感的私人信息不是邊緣案例，而是日常輸入，例如，代理人預訂航班需要護照號碼。在這種情況下，代理人必須使用私人信息來準確完成任務，同時在其回應中永遠不暴露這些信息，因為它無法驗證誰實際在鍵盤上。這兩項義務之間存在根本的緊張關係。一個足夠強大的模型能夠使用私人信息來完成任務，但同樣的能力也可能使其透露這些信息。為了評估任務準確性和隱私洩漏之間的權衡，我們引入了任務完成和抵抗主動隱私提取（TRAP）。每個場景都包含一份包含私人信息的文件、一個要求代理人使用私人字段調用正確工具的任務查詢，以及一個試圖以自然語言引出相同信息的攻擊查詢。我們評估了22個涵蓋前沿專有模型和多種規模的開源模型，發現所有模型系列都表現出非平凡的洩漏，並且遵循指令的能力與洩漏率相關。現有的基於提示的防禦措施減少了洩漏，但對任務準確性造成了重大損失。提示優化未能逃避這一權衡。我們證明這一失敗並非偶然。對於任何基於softmax的模型，沒有任何軟約束防禦，例如基於提示的防禦，能夠同時實現高任務成功率和零洩漏概率。受到這一不可能結果的激勵，我們提出結構性私人字段隔離，該方法在私人字段到達模型之前用哈希鍵替換它們。這種方法在保持任務準確性的同時，大大防止了洩漏。

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

2606.18989v1 by Fengying Ye, Yanming Sun, Runzhe Zhan, Zheqi Zhang, Lidia S. Chao, Derek F. Wong

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

摘要：成語在不同語言之間轉換困難，因為它們不具組成性且表面形式的基礎較弱，使得字面映射不可靠。我們提出了 G-IdiomAlign，一個以詞彙為樞紐的基準，每個成語都由來自 Wiktionary 的英文詞彙作為支撐。我們進一步構建了一個高信心的參考對齊集，以便於可重複的評估。G-IdiomAlign 支持兩種協議：（1）一種受控的多選成語等價，帶有類型干擾項以便於錯誤歸因；以及（2）一種詞彙對比生成，對比無詞彙和有詞彙的輸入，以隔離明確語義樞紐的效果。在多種大型語言模型中，對字面翻譯的偏見是一種主要的失敗模式，特別是當目標是低資源語言時。在基於嵌入的語義代理下，詞彙在詞彙對比生成中始終能改善表現，但性能仍然適中，顯示出開放輸出空間中有相當大的提升空間。對 Qwen3-8B 的後續分析進一步表明，跨條件差異更多集中在注意力頭而非層次中，而更好的有詞彙生成與更強的詞彙支撐相吻合。

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

2606.18988v1 by Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

摘要：多模態欺騙檢測對於識別欺詐意圖至關重要，但現有的方法主要依賴於端到端的黑箱範式。這些方法缺乏可解釋性，未能提供透明的推理過程，並且難以明確捕捉到欺騙行為中固有的微妙跨模態不一致性。為了超越這些限制，我們提出了ThinkDeception，一個新穎且可解釋的多模態欺騙檢測框架。作為一項開創性工作，它將多模態大型語言模型（MLLMs）引入這一領域，將欺騙檢測從傳統的二元分類任務轉變為明確的認知推理過程。在首個經過精心註釋的逐步多模態思維鏈（CoT）數據集的支持下，我們開發了一個基礎模型ThinkDeception Base，實證驗證了模態不一致性在解碼欺騙中的關鍵作用。在此基礎上，我們的核心創新在於提出了視覺-音頻一致性群體相對策略優化（VAC--GRPO），並配備了漸進式訓練策略。與標準的GRPO不同，我們將訓練數據分為四個漸進的難度層次，引導模型通過心理學基礎的由易到難的認知過渡。通過創新性地將這一動態課程調度器與多維度、過程感知的獎勵機制和反思學習範式相結合，我們顯著提升了模型的整體推理質量。在主流基準上的廣泛實驗表明，ThinkDeception建立了新的SOTA，在檢測準確性和推理質量上顯著超越現有方法。最終，這項工作成功地推動了欺騙檢測領域朝向可解釋的多模態認知推理發展。

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

2606.18986v1 by Yafeng Wu, Huu Hiep Nguyen, Thin Nguyen, Hung Le

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLMs suffers from a tokenization bottleneck: Byte Pair Encoding fragments continuous values into unstable tokens whose embeddings lack meaningful metric structure, resulting in the loss of magnitude, scale, and trend information. Prior methods use patch-based encoders that split the series into fixed windows, locking in one granularity that breaks patterns and hides exact timesteps, through a separate module that rarely transfers across datasets with different lengths or sampling rates. To address this challenge, we propose CADE (Contrastive Alignment with Direct Embedding), a novel framework for TSQA built upon two key components: direct timestep embedding and semantic alignment. The proposed framework maps each timestep directly into the LLM embedding space through a point-wise linear encoder and MLP projector, preserving exact index-level access while eliminating the need for patching and padding. To further bridge the semantic gap between time-series and language representations, we introduce a novel one-directional supervised contrastive loss that aligns time-series embeddings with frozen class-name text anchors. Experimental results on the public Time-MQA benchmark demonstrate that our framework consistently improves performance across six TSQA tasks, outperforming both open-source and proprietary LLM baselines.

摘要：最近在大型語言模型（LLMs）方面的進展催生了時間序列問答（TSQA），將時間序列分析表述為自然語言問答。然而，直接將原始數值序列輸入LLMs會遭遇標記化瓶頸：字節對編碼將連續值分割為不穩定的標記，其嵌入缺乏有意義的度量結構，導致幅度、尺度和趨勢信息的喪失。先前的方法使用基於補丁的編碼器，將序列拆分為固定窗口，鎖定一種粒度，這會破壞模式並隱藏確切的時間步，通過一個很少在不同長度或取樣率的數據集之間轉移的單獨模塊。為了解決這一挑戰，我們提出了CADE（對比對齊與直接嵌入），這是一個基於兩個關鍵組件的TSQA新框架：直接時間步嵌入和語義對齊。所提出的框架通過逐點線性編碼器和MLP投影器將每個時間步直接映射到LLM嵌入空間，保留精確的索引級別訪問，同時消除補丁和填充的需求。為了進一步縮小時間序列與語言表示之間的語義差距，我們引入了一種新型的單向監督對比損失，將時間序列嵌入與固定的類別名稱文本錨點對齊。在公共Time-MQA基準上的實驗結果表明，我們的框架在六個TSQA任務中始終提高了性能，超越了開源和專有LLM基準。

Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment

2606.18979v1 by Franziska Braun, Christopher Witzl, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Tobias Bocklet, Korbinian Riedhammer

Early detection of cognitive impairment relies on neuropsychological tests to minimize subjectivity by assessing multiple cognitive domains. Speech-based evaluation can support diagnostics and improve accessibility, but transcription errors and the omission of nonverbal subtests (e.g., motor skills) limit accuracy. Beyond conventional test scores, speech-derived features can provide additional insights into cognitive status. This study investigates the speech-based evaluation of the German "Syndrom-Kurz-Test," a standardized dementia screening test comprising verbal and motor subtests. We train models that integrate transcript-derived scores and Whisper embeddings per verbal subtest to reduce scoring errors. To compensate for missing motor subtests, we then leverage these fused representations to approximate expert overall ratings. Despite omitting subtests, our models strongly correlate with expert ratings and efficiently and accurately discriminate between cognitive status groups.

摘要：早期檢測認知障礙依賴神經心理測試，通過評估多個認知領域來最小化主觀性。基於語音的評估可以支持診斷並改善可及性，但轉錄錯誤和省略非語言子測試（例如，運動技能）限制了準確性。超越傳統測試分數，語音衍生的特徵可以提供對認知狀態的額外見解。本研究調查德國的「Syndrom-Kurz-Test」的語音評估，這是一個標準化的癡呆篩查測試，包含口語和運動子測試。我們訓練模型，整合每個口語子測試的轉錄衍生分數和Whisper嵌入，以減少計分錯誤。為了彌補缺失的運動子測試，我們然後利用這些融合的表示來近似專家的整體評分。儘管省略了子測試，我們的模型與專家評分強烈相關，並有效且準確地區分認知狀態組。

CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System

2606.18976v1 by Marco Becattini, Niccolò Caselli, Matteo Minin, Roberto Verdecchia, Enrico Vicario

Automated assessment in software engineering education has advanced significantly for code grading and essay scoring. However, reviewing software architecture deliverables, which requires analyzing structural completeness and requirements traceability, has not yet been fully automated. Applying Large Language Models (LLMs) to this task requires robust architectures to ensure technical feedback is accurate and reliable for students. This paper presents CAPRA (Configurable Architecture Proficiency Report Assessment), a multi-agent LLM system that analyzes software architecture deliverables to generate personalized, template-compliant LaTeX feedback. As a core design choice, CAPRA coordinates multiple specialized agents and employs a Python-based microservice for multi-modal document extraction, utilizing PyMuPDF and vision-enabled LLMs (specifically gpt-4o) to parse text and UML diagrams. To ensure educational reliability and mitigate hallucinations, CAPRA introduces a deterministic Evidence Anchoring step using fuzzy matching via normalized Levenshtein distance, along with a ConsistencyManager agent that cross-verifies, deduplicates, and merges findings. System performance is assessed using a structured eight-criterion binary evaluation taxonomy covering: (i) extraction completeness, (ii) feature validation, (iii) issue grounding and severity detection, (iv) recommendation specificity and traceability, and (v) template and tone compliance. A preliminary empirical evaluation on 10 student reports shows that CAPRA satisfied 88.8% of the evaluated criteria under a strict two-rater aggregation rule, achieved moderate inter-rater agreement with human evaluators (kappa = 0.582), and processed each report in slightly over 4 minutes. While these results support the viability of LLM-supported architectural feedback, human oversight remains essential for subjective assessment dimensions.

摘要：自動化評估在軟體工程教育中已經在程式碼評分和論文評分方面取得了顯著進展。然而，對於軟體架構交付物的審查，這需要分析結構完整性和需求可追溯性，尚未完全自動化。將大型語言模型（LLMs）應用於此任務需要穩健的架構，以確保技術反饋對學生來說準確且可靠。本文介紹了CAPRA（可配置架構能力報告評估），這是一個多代理LLM系統，分析軟體架構交付物以生成個性化的、符合模板的LaTeX反饋。作為核心設計選擇，CAPRA協調多個專門代理，並採用基於Python的微服務進行多模態文檔提取，利用PyMuPDF和具視覺能力的LLMs（具體為gpt-4o）來解析文本和UML圖。為了確保教育可靠性並減少幻覺，CAPRA引入了一個確定性的證據錨定步驟，通過標準化的Levenshtein距離進行模糊匹配，以及一個ConsistencyManager代理，該代理交叉驗證、去重和合併發現。系統性能使用一個結構化的八項標準二元評估分類法進行評估，涵蓋：（i）提取完整性，(ii) 特徵驗證，(iii) 問題基礎和嚴重性檢測，(iv) 建議的具體性和可追溯性，以及 (v) 模板和語調的合規性。對10份學生報告的初步實證評估顯示，CAPRA在嚴格的雙評審聚合規則下滿足了88.8%的評估標準，與人類評估者達成了中等的評審一致性（kappa = 0.582），並在略超過4分鐘內處理每份報告。雖然這些結果支持LLM支持的架構反饋的可行性，但人類監督對於主觀評估維度仍然至關重要。

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

2606.18970v1 by Syed Mujtaba Haider, Silvia Figini

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

摘要：醫學影像分類常受到有限標記數據的限制，這促使了生成增強的需求；最近，為此目的提出了量子生成模型，並經常報告準確性提升。然而，這些說法通常基於單次訓練運行，未能匹配量子和經典生成器的參數預算，且未能描述任何好處出現的數據範疇。我們提出了一個受控基準，隔離量子生成器對腦部MRI增強的貢獻。影像被編碼進入一個KL正則化的潛在空間，在該空間中，使用變分量子生成器或參數數量幾乎相同的經典生成器（1648對1632）訓練一個帶有梯度懲罰的條件Wasserstein GAN。合成樣本被解碼並用於增強一個預訓練的分類器，涵蓋從5%到100%的標記數據比例，並在八個隨機種子上進行評估，使用配對顯著性測試（帶有多重比較修正）以及內部集多樣性和潛在分佈分析。在所有比例中，沒有增強變體顯著超越僅使用真實數據的訓練，且量子和經典生成器在統計上無法區分。任何低數據的好處表現為正則化，而非忠實的數據擴展：合成樣本在數據稀缺的地方偏離分佈並嚴重模式崩潰，且量子生成器的多樣性不比其經典對應物更高。我們釋放該協議作為醫學影像中量子生成增強嚴格評估的測試平台。

GraphPO: Graph-based Policy Optimization for Reasoning Models

2606.18954v1 by Yuliang Zhan, Xinyu Tang, Jian Li, Dandan Zheng, Weilong Chai, Jingdong Chen, Jun Zhou, Ge Wu, Wenyue Tang, Hao Sun

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.

摘要：強化學習與可驗證獎勵（RLVR）已成為提升大型推理模型能力的標準範式。RLVR 通常獨立抽樣回應並利用最終答案來優化策略。這一範式有兩個限制。首先，獨立的回應往往包含相似的中間推理步驟，導致冗餘探索和計算浪費。其次，稀疏的最終答案獎勵使得識別有用步驟變得困難。基於樹的方法部分解決了這個問題，通過共享前綴並比較來自同一前綴的分支來提供細粒度的信號。然而，樹的分支仍然是獨立擴展的。當不同的分支達到相似的推理狀態時，它們無法共享信息並重複相似的探索。此外，基於樹的方法忽略了這種分散，只在不同的分支內進行局部比較，這可能導致優勢估計的方差增高。為了解決這一挑戰，我們提出了 GraphPO（基於圖的策略優化），這是一個新穎的強化學習框架，將回合表示為有向無環圖，推理步驟作為邊，從推理路徑總結的語義狀態作為節點。GraphPO 將語義上等價的推理路徑合併為等價類，允許它們共享後綴，並將預算從冗餘擴展重新分配到多樣化探索上。此外，我們將效率優勢分配給進入邊，將正確性優勢分配給輸出邊，從而在從結果中推導過程監督的同時提高推理效率。理論表明，GraphPO 減少了優勢估計的方差並增強了推理效率。在三個 LLM 的推理和代理搜索基準上進行的實驗顯示，GraphPO 在相同的標記預算或回應預算下，始終優於基於鏈和樹的基準。

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

2606.18950v1 by San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

摘要：現代視覺-語言模型 (VLMs) 在戰略推理方面經常遇到困難，即在競爭和合作環境中，預測和影響其他代理的行動，尤其是在不確定性下。即時戰略 (RTS) 遊戲可以成為診斷這一限制的自然測試平台，因為它們要求與盟友協調、適應對手的策略，並在部分可觀察性下進行長期規劃。然而，現有的 RTS 基準提供的評估範圍有限，缺乏系統性的能力診斷，並且在預設的場景覆蓋範圍內保持固定。為了解決這些限制，我們提出了 RTSGameBench，它基於《超越所有理性》，這是一款大型 RTS 遊戲，擁有擴展的戰場，要求比現有測試平台更廣泛的策略多樣性。該基準通過各種對戰結構的多樣化遊戲玩法提供評估，通過迷你遊戲進行診斷評估，每個迷你遊戲針對個別的戰略能力，並通過自我演變的生成框架提供可擴展的覆蓋，將自由形式的查詢轉換為新的迷你遊戲，並在後續循環中不斷改進。此外，為了使 VLMs 能夠在大型 RTS 遊戲中運行，我們提供了 RTSGameAgent，它通過具有代理記憶的有限狀態機 (FSM) 來管理單位。我們實證驗證了多個最先進的 VLMs 在對戰需要更緊密的協調、多代理協調以及任務規模增加時表現不佳。

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

2606.18947v1 by Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

摘要：生產 LLM 代理越來越依賴即時搜索，但本地搜索基礎將檢索策略、供應商選擇、證據注入、成本、延遲和生成行為捆綁在單一模型供應商邊界之下。這種耦合使得基礎難以檢查、調整、重用或移植，並可能觸發搜索引起的冗長性，破壞嚴格的輸出合約。我們提出了解耦搜索基礎 (DSG)，這是一個供應商無關的邊界，通過與 MCP 兼容的網關將基礎移出推理模型，暴露供應商路由、源感知上下文渲染、配置的後備、檢索深度控制，以及精確和語義緩存作為一級控制。在 SimpleQA、FreshQA 和 HotpotQA 上的五個前沿模型中，本地搜索在對時效敏感的 FreshQA 上表現優越，但 DSG 在控制重要時展現出更強的前沿：在 SimpleQA 上，它的準確率幾乎與本地相當 (86.1% 對 87.7%)，搜索成本降低 91%，保持簡潔的答案合約，並達到 99.4% 的熱緩存命中率，延遲降低 68%。作為大型代理工作負載的共享生產基礎層，DSG 在電子商務查詢理解 (QIU) 工作負載上匹配或稍微超過本地搜索的準確率，同時將搜索成本降低超過 98%。即時基礎最好被視為一個可優化的接口邊界，而不是固定的模型特徵。

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

2606.18946v1 by Jingkun Luo, Yifan Sun, Da-Tian Peng, Guanxiong Pei

Sentence-level AI-generated text detection (S-AGTD) for hybrid documents, where humans and LLMs co-author one text, faces two gaps: existing methods classify each sentence in isolation, discarding inter-sentence dependencies, and existing benchmarks omit the newest generation of generators. We construct MOSAIC, a benchmark of 16,000 hybrid documents over PubMed and XSum, generated by DeepSeek-V3.2 and Kimi K2 under stringent quality controls including a perplexity-consistency filter absent from prior benchmarks. We recast S-AGTD as structured prediction over the document sentence sequence and instantiate it as SenFlow, integrating graph-based inter-sentence propagation with linear-chain CRF decoding in a single document-level pass over a sentence graph. SenFlow reaches state-of-the-art performance on MOSAIC, with a +4.15 pp average Macro-F1 margin on cross-domain transfer, the hardest of three protocols of increasing difficulty. We further find that even after the perplexity filter equalizes overt cues, AI insertions retain a generator-dependent sentence-length gap that sentence-level detectors still exploit. Code and data: https://github.com/luojingkun22/SenFlow

摘要：句子級別的 AI 生成文本檢測 (S-AGTD) 針對混合文檔，即人類和大型語言模型共同創作的文本，面臨兩個缺口：現有方法將每個句子孤立分類，忽略了句子之間的依賴關係，而現有基準則省略了最新一代生成器。我們構建了 MOSAIC，一個包含 16,000 篇混合文檔的基準，這些文檔來自 PubMed 和 XSum，由 DeepSeek-V3.2 和 Kimi K2 在嚴格的質量控制下生成，包括一個在先前基準中缺失的困惑度一致性過濾器。我們將 S-AGTD 重新構建為對文檔句子序列的結構化預測，並將其具體化為 SenFlow，將基於圖的句子間傳播與線性鏈 CRF 解碼整合在單個文檔級別的句子圖上進行處理。SenFlow 在 MOSAIC 上達到了最先進的性能，在跨域轉移的三個難度逐漸增加的協議中，平均 Macro-F1 邊際提高了 +4.15 個百分點。我們進一步發現，即使在困惑度過濾器平衡了明顯的線索後，AI 插入仍然保留了一個依賴於生成器的句子長度差距，而句子級別的檢測器仍然可以利用這一點。代碼和數據：https://github.com/luojingkun22/SenFlow

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

2606.18941v1 by Pierre Dantas, Lucas Cordeiro, Waldir Junior

PLCopen XML defines two encoding formats for IEC 61131-3 Ladder Diagram programs: a textual encoding using elements, and a graphical encoding that represents rung logic as a directed graph of localId/refLocalId connections. ESBMC-PLC supported the textual format but parsed graphical exports from CONTROLLINO, Beremiz, and OpenPLC Editor into an empty GOTO intermediate representation, causing vacuous verification success. This paper presents Graph-ESBMC-PLC, which closes this gap with a DFS-based graphical LD resolver. The resolver traverses the connection graph from leftPowerRail to each coil, extracts rung paths as Boolean contact conjunctions, and applies a three-tier I/O inference scheme. Ordering coils by rightPowerRail connectionPointIn sequence ensures SET coils process before RESET coils, matching IEC scan-cycle semantics. The graphical-to-IR conversion leaves the ESBMC backend unchanged. Validation on 3 graphical LD programs from CONTROLLINO/OpenPLC Editor shows all produce full GOTO IR with nondeterministic inputs and rung logic, versus the empty IR previously. All 3 verify SAFE at k=2 under 70ms. The 11 textual LD benchmarks are fully preserved, with no regression. Two Beremiz examples with no LD content or unsupported timer semantics are reported as discovered limitations. Artifact at Zenodo (DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856).

摘要：PLCopen XML 定義了兩種 IEC 61131-3 梯形圖程序的編碼格式：一種是使用 <rung> 元素的文本編碼，另一種是將梯級邏輯表示為本地 ID/refLocalId 連接的有向圖的圖形編碼。ESBMC-PLC 支持文本格式，但將來自 CONTROLLINO、Beremiz 和 OpenPLC 編輯器的圖形導出解析為空的 GOTO 中間表示，導致虛無的驗證成功。本文提出了 Graph-ESBMC-PLC，通過基於 DFS 的圖形 LD 解析器填補了這一空白。該解析器從左電源軌遍歷連接圖到每個線圈，將梯級路徑提取為布爾接觸聯接，並應用三層 I/O 推斷方案。按右電源軌的 connectionPointIn 順序排列線圈，確保 SET 線圈在 RESET 線圈之前處理，符合 IEC 掃描週期語義。圖形到 IR 的轉換保持 ESBMC 後端不變。對來自 CONTROLLINO/OpenPLC 編輯器的 3 個圖形 LD 程序的驗證顯示，所有程序都生成完整的 GOTO IR，具有非確定性輸入和梯級邏輯，而不是之前的空 IR。所有 3 個在 k=2 下的驗證時間小於 70 毫秒。11 個文本 LD 基準完全保留，沒有回歸。報告了兩個不含 LD 內容或不支持計時器語義的 Beremiz 示例作為發現的限制。文檔在 Zenodo 上（DantasCordeiro2026graphical, doi:10.5281/zenodo.20699856）。

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

2606.18936v1 by Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

摘要：大型語言模型（LLMs）越來越多地嵌入於科學人工智慧（AI4Science）工作流程中，從科學問題回答和文獻分析到實驗室規劃和自主發現。這一進展迫切需要安全基準，不僅評估科學能力，還要檢視模型是否能識別並避免在高風險科學背景下的風險。現有的AI4Science安全數據集涵蓋了幾個學科和任務格式，但未明確規定潛在的風險維度。我們介紹了\textbf{SciRisk-Bench}，這是一個旨在從兩個互補的角度評估AI4Science安全性的基準：明確的風險維度和科學學科。SciRisk-Bench涵蓋了7個學科、31個子學科和10個風險維度。在實驗部分，我們評估了主流LLMs和以科學為導向的LLMs在風險維度、學科和子學科方面的表現，使我們能夠細緻診斷科學模型在哪些方面仍然不安全。

2606.18932v1 by Xingchen Yan, Jian Ge, Qingtian Liu, Kevin Willis, Quanquan Hu, Jiapeng Zhu

Motivated by the observational incompleteness of intermediate-to-long-period Earth-size planets, we present TransitNet, a compact attention-augmented deep-learning framework for low-SNR transit blind searches. To enable realistic method development and objective threshold calibration under blind-search conditions, we develop a unified dataset construction, benchmarking, and threshold-selection framework. On recovery benchmarks constructed from unseen Kepler targets, TransitNet attains 95.2 percent accuracy in the challenging SNR range of 6 to 8 and outperforms both TLS and BLS, achieving ROC-AUC and PR-AP values of 0.974 and 0.982, respectively. In an injected Earth-size and sub-Earth-size transit recovery experiment, TransitNet achieves a recovery rate of 93.0 percent, substantially exceeding those of TLS (63.1 percent) and BLS (60.0 percent). In addition to detection, TransitNet provides attention-based estimates of transit windows and midpoints. On an independent evaluation set, 97.4 percent of injected transits are fully covered by the estimated transit window. Applied to real Kepler observations, the model successfully recovers all 34 selected confirmed Kepler planets, with a mean absolute transit midpoint error of 1.24 hours. The model combines a compact footprint of about 1.5 MB with high inference efficiency, yielding speed-ups of about 12 to 25 times relative to CPU-TLS and about 4 to 5 times relative to CPU-BLS. These results demonstrate that TransitNet provides an accurate, scalable, and computationally efficient framework for low-SNR transit blind searches in the tested regime and motivate its extension to longer-period Earth-size planet searches.

摘要：受到中長期地球大小行星觀測不完整性的激勵，我們提出了TransitNet，一個緊湊的注意力增強深度學習框架，用於低信噪比的過境盲搜索。為了在盲搜索條件下實現現實的算法開發和客觀的閾值校準，我們開發了一個統一的數據集構建、基準測試和閾值選擇框架。在從未見過的開普勒目標構建的恢復基準上，TransitNet在具有挑戰性的信噪比範圍6到8內達到了95.2%的準確率，並且超越了TLS和BLS，分別達到0.974和0.982的ROC-AUC和PR-AP值。在一個注入的地球大小和亞地球大小的過境恢復實驗中，TransitNet達到了93.0%的恢復率，顯著超過了TLS（63.1%）和BLS（60.0%）。除了檢測外，TransitNet還提供基於注意力的過境窗口和中點的估計。在一個獨立的評估集上，97.4%的注入過境完全被估計的過境窗口覆蓋。應用於真實的開普勒觀測，該模型成功恢復了所有34個選定的確認開普勒行星，平均絕對過境中點誤差為1.24小時。該模型結合了約1.5 MB的緊湊佔用空間和高推理效率，相對於CPU-TLS的速度提升約為12到25倍，相對於CPU-BLS的速度提升約為4到5倍。這些結果表明，TransitNet為在測試範圍內的低信噪比過境盲搜索提供了一個準確、可擴展且計算效率高的框架，並激勵其擴展到更長周期的地球大小行星搜索。

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

2606.18922v1 by Jasmine Owers, Edwin Simpson, Martha Lewis

Figurative language and negation are two areas that challenge current language models, however, both are widely used throughout written and spoken language. Large language models (LLMs) are also widely used in everyday contexts where they cannot necessarily be tuned for a specific dataset. It is therefore essential to understand the ability of LLMs to correctly interpret text that includes both negation and figurative language. To investigate this, we develop a set of new annotations to an existing dataset of figurative language, and test a range of language models on the dataset. We find that the combination of negation and figurativeness can present a particular challenge, and that performance overall and across different negation types is particularly dependent on the prompt style used.

摘要：比喻語言和否定是當前語言模型面臨的兩個挑戰領域，然而，這兩者在書面和口語語言中都被廣泛使用。大型語言模型（LLMs）在日常情境中也被廣泛使用，而這些情境未必能針對特定數據集進行調整。因此，了解LLMs正確解釋包含否定和比喻語言的文本的能力至關重要。為了調查這一點，我們對現有的比喻語言數據集開發了一組新的註釋，並對該數據集進行了一系列語言模型的測試。我們發現，否定和比喻的結合可能會帶來特定的挑戰，整體表現以及不同否定類型的表現特別依賴於使用的提示風格。

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

2606.18910v1 by Yuanxin Liu, Ruida Zhou, Xinyan Zhao, Amr Sharaf, Hongzhou Lin, Arijit Biswas, Mohammad Ghavamzadeh, Zhaoran Wang, Mingyi Hong

Test-time scaling via sequential revision has emerged as a powerful paradigm for enhancing Large Language Model (LLM) reasoning. However, standard post-training methods primarily optimize single-shot objectives, creating a fundamental misalignment with multi-step inference dynamics. While recent work treats this as multi-turn reinforcement learning (RL), conventional approaches optimize over the multi-step trajectories directly, failing to further exploit the high-quality mistakes in intermediate steps that model can learn from correcting them. We propose a two-stage iterative framework that alternates between online data/prompt augmentation and policy optimization. By converting the intermediate steps (``near-miss'' answers) in the successful recovery trajectories into decoupled revision and verification prompts, our approach concentrates training on both effective answer transformation and error identification. This approach enables efficient off-policy data generation and reduces the computational overhead of long-horizon sampling compared to standard multi-turn RL. On LiveCodeBench, using publicly available test cases as feedback, we observe gains of +6.5 points over the RL baseline and +4.0 points over standard multi-turn training. Beyond coding, our approach matches the previously reported SOTA result on circle packing while using the smallest base model (4B) and far fewer rollouts than the much larger evolutionary search systems. Math results under ground-truth verification further confirm improved correction ability. It also generalizes to out-of-distribution constraint-satisfaction puzzles such as n_queens and mini_sudoku, where correctness is defined entirely by problem constraints. Code is available at https://github.com/yxliu02/REVES.git.

摘要：測試時間的擴展透過連續修訂已成為增強大型語言模型（LLM）推理的強大範式。然而，標準的後訓練方法主要優化單次目標，這與多步推理動態之間存在根本的不對齊。雖然最近的研究將此視為多回合強化學習（RL），但傳統方法直接在多步軌跡上進行優化，未能進一步利用模型可以從修正中學習的中間步驟中的高質量錯誤。我們提出了一個兩階段的迭代框架，交替進行在線數據/提示增強和策略優化。通過將成功恢復軌跡中的中間步驟（“近失”答案）轉換為解耦的修訂和驗證提示，我們的方法專注於有效的答案轉換和錯誤識別的訓練。這種方法使得高效的離線數據生成成為可能，並且相比於標準的多回合強化學習，減少了長期取樣的計算開銷。在 LiveCodeBench 上，使用公開可用的測試案例作為反饋，我們觀察到相較於 RL 基準提高了 +6.5 分，相較於標準的多回合訓練提高了 +4.0 分。除了編碼之外，我們的方法在圓形打包問題上達到了之前報告的 SOTA 結果，並且使用了最小的基礎模型（4B）以及比更大規模的進化搜索系統少得多的回合次數。在真實驗證下的數學結果進一步確認了改進的修正能力。它還能推廣到如 n_queens 和 mini_sudoku 等超出分佈的約束滿足謎題，其中正確性完全由問題約束定義。代碼可在 https://github.com/yxliu02/REVES.git 獲得。

SAERec: Constructing Fine-grained Interpretable Intents Priors via Sparse Autoencoders for Recommendation

2606.18897v1 by Jiangnan Xia, Xuansheng Wu, Yu Yang, Xin Wang, Ninghao Liu

Intent-based recommender systems have gained significant attention for improving accuracy and interpretability by modeling the underlying motivations behind user behaviors. Most existing models derive intents directly from user sequences via clustering or prototype learning. However, they are sensitive to sequence quality, require presetting the number of intents, and lack explicit semantic grounding. These issues lead to an incomplete and coarse intent set and limit the effectiveness of recommendation. In this paper, we propose the Sparse Autoencoder for intent-based recommendation (SAERec), a novel recommender that automatically constructs a fine-grained and interpretable intent space from a textual corpus to guide recommendation. Rather than treating texts as side signals, SAERec leverages them as high information density evidence for intent construction. Specifically, we first extract a comprehensive set of fine-grained interpretable intents from the latent space of large language models (LLMs) by using a sparse autoencoder (SAE) to disentangle and interpret text embeddings, which isolates intent-related semantics from textual noise. Then, for each user, we retrieve relevant intents from this set as priors to guide recommendation. It contains personal intents matching a user's current interests and public intents capturing general item patterns shared across users (e.g., quality, price). Finally, to integrate retrieved intents into sequence modeling, we propose a multi-branch attention mechanism that captures temporal dependencies and injects both personal and public intent signals, followed by an adaptive fusion layer to construct the final user representation for recommendation. Extensive experiments on public datasets demonstrate the superiority of SAERec, consistently outperforming state-of-the-art baselines while providing human-understandable explanations.

摘要：意圖導向的推薦系統因為能夠通過建模用戶行為背後的基本動機來提高準確性和可解釋性而受到廣泛關注。大多數現有模型通過聚類或原型學習直接從用戶序列中推導出意圖。然而，它們對序列質量敏感，需要預設意圖的數量，並且缺乏明確的語義基礎。這些問題導致意圖集不完整且粗糙，限制了推薦的有效性。在本文中，我們提出了基於意圖的推薦的稀疏自編碼器（SAERec），這是一種新穎的推薦系統，能夠自動從文本語料庫中構建細粒度和可解釋的意圖空間以指導推薦。SAERec並不將文本視為輔助信號，而是將其視為意圖構建的高信息密度證據。具體而言，我們首先使用稀疏自編碼器（SAE）從大型語言模型（LLMs）的潛在空間中提取一組全面的細粒度可解釋意圖，以解開和解釋文本嵌入，從而將與意圖相關的語義與文本噪聲隔離。然後，對於每個用戶，我們從這組意圖中檢索相關意圖作為先驗，以指導推薦。它包含與用戶當前興趣相匹配的個人意圖和捕捉跨用戶共享的一般項目模式的公共意圖（例如，質量、價格）。最後，為了將檢索到的意圖整合到序列建模中，我們提出了一種多分支注意機制，該機制捕捉時間依賴性並注入個人和公共意圖信號，隨後是一個自適應融合層，用於構建最終用戶表示以進行推薦。在公共數據集上的廣泛實驗顯示，SAERec的優越性，持續超越最先進的基準，同時提供人類可理解的解釋。

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

2606.18893v1 by Zhuangzhuang Pan, Ning Dong, Yingna Su, Yan Xia

Multimodal emotion-cause pair extraction (MECPE) requires reliable pair confidence over candidate pairs. Existing pair scorers commonly use pair-level cross entropy over valid candidates, which treats links mostly independently. This leaves the relative confidence geometry among competing causes under-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non-gold context. We study this vulnerability as pair-confidence brittleness and propose RPCL (Robust Pair Confidence Learning), a training-only framework for pair-confidence learning. RPCL encourages pair confidence to be both discriminative and stable: gold pairs are separated from row-wise hard negatives through a confidence-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non-gold contextual utterance representations are partially corrupted. The original clean pair scorer and decoding pipeline are used unchanged at inference time. On ECF, MECAD, and MEC4, RPCL improves the three-seed mean Pair F1 over a matched base model by 2.58 to 2.83 percentage points in the full text-audio-video setting, and improves mean Pair AUPRC on all three datasets. Diagnostic analysis further shows larger gold-negative confidence gaps and lower margin-violation severity. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE.

摘要：多模態情感-原因配對提取（MECPE）需要對候選配對具有可靠的配對信心。現有的配對評分器通常使用有效候選者的配對級交叉熵，這主要是獨立地處理連結。這使得競爭原因之間的相對信心幾何形狀受到約束，允許金標配對與困難的負樣本保持接近或依賴偶然的非金標上下文。我們將這種脆弱性研究為配對信心的脆弱性，並提出RPCL（穩健配對信心學習），這是一個僅用於訓練的配對信心學習框架。RPCL鼓勵配對信心既具辨別性又穩定：金標配對通過信心差異邊際約束與行級困難負樣本分開，並且乾淨的配對預測與來自一個部分損壞的視圖的預測對齊，其中非金標上下文話語表示部分損壞。在推理時，原始的乾淨配對評分器和解碼管道保持不變。在ECF、MECAD和MEC4上，RPCL在完整的文本-音頻-視頻設置中將三種種子均值配對F1提高了2.58到2.83個百分點，並且在所有三個數據集上提高了均值配對AUPRC。診斷分析進一步顯示出更大的金標-負樣本信心差距和較低的邊際違規嚴重性。這些結果表明，明確塑造配對信心是一種有效的MECPE訓練策略。

Skill-Guided Continuation Distillation for GUI Agents

2606.18890v1 by Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30\% range to over 50\%, demonstrating its effectiveness and generality.

摘要：改善 GUI 代理通常依賴於專家軌跡上的行為複製。然而，隨著當前政策偏離專家政策，它不可避免地在閉環執行過程中遇到政策引起的非軌跡狀態，即那些落在專家軌跡之外的狀態。由於專家軌跡對這些未見狀態沒有提供示範，因此這些狀態無法獲得有效的監督，使得政策無法選擇正確的行動。為了填補這一監督空白，我們提出了技能引導的延續蒸餾（SGCD），這是一個迭代自我改進的框架。SGCD 首先在沒有技能引導的情況下運行普通政策幾步，以達到現實的非軌跡狀態。從這些狀態中，技能引導的政策然後完成任務並產生成功的延續，這些延續與專家軌跡混合，以提供對政策引起的非軌跡狀態的監督。這些技能來自於成功和失敗的回合，包含延續計劃、關鍵目標、失敗陷阱和成功標準。在 OSWorld-Verified 上，SGCD 將三個基礎模型的成功率從低於 30\% 提高到超過 50\%，展示了其有效性和普遍性。

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

2606.18889v1 by Adrian Cosma, Nicoleta-Nina Basoc, Andrei Niculae, Cosmin Dumitrache, Emilian Radoi

Text-based telemedicine increasingly relies on lightweight patient feedback, however, such feedback primarily reflects perceived communication quality rather than medical accuracy. We introduce an LM-guided counterfactual recommendation pipeline that discovers and refines interpretable communication features such as tone, personalization, actionability and completeness in addressing patient concerns, without interfering with the medical content. These features are used together with patient-doctor interaction metadata to estimate positive feedback. At inference time, the system searches over low-cost ordinal feature changes and recommends minimal communication changes predicted to increase the probability of positive feedback, while independent auditor models test whether these gains generalize beyond the selection model. Across interactions, recommendations yield a mean +6.41% gain in predicted positive feedback probability under independent auditors, and are non-negative for 93.31% of recommendations. These results suggest that small, interpretable communication changes can capture most predicted gains while preserving the doctor's control over medical reasoning and final wording.

摘要：基於文本的遠程醫療越來越依賴輕量級的患者反饋，然而，這種反饋主要反映的是感知的溝通質量，而非醫療準確性。我們介紹了一個由LM指導的反事實推薦管道，該管道發現並精煉可解釋的溝通特徵，如語調、個性化、可行性和全面性，以解決患者的擔憂，而不干擾醫療內容。這些特徵與患者-醫生互動的元數據一起用來估計正面反饋。在推理時，系統搜索低成本的序數特徵變化，並推薦預測能增加正面反饋概率的最小溝通變化，同時獨立審核模型測試這些增益是否超越選擇模型的範疇。在互動中，推薦在獨立審核者下產生了平均+6.41%的預測正面反饋概率增益，並且93.31%的推薦是非負的。這些結果表明，小的、可解釋的溝通變化可以捕捉到大多數預測增益，同時保持醫生對醫療推理和最終措辭的控制。

2606.18888v1 by Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

摘要：在部分可觀察環境中的導航對於自主代理來說是一項重大挑戰，這需要在未知環境中以有限的感知信息進行有效的決策。基於信念的方法，特別是那些使用神經網絡來近似信念空間的方法，通常無法捕捉信念空間固有的多模態性，尤其是在具有感知別名的高維情況下。雖然生成模型提供了一個引人注目的替代方案，但它們通常需要大量數據或專家示範，並且缺乏明確的長期規劃機制。在本文中，我們介紹了BeliefDiffusion，一個結合生成和規劃優勢的新框架。BeliefDiffusion利用擴散模型來明確描述多模態信念分佈，並利用模型預測控制（MPC）來同時進行前瞻性規劃。它由兩個步驟組成：(1) 根據觀察歷史想像合理的環境配置和 (2) 在聚合配置中規劃有效的導航策略。通過在合成地圖環境中的廣泛實驗，我們證明BeliefDiffusion在導航成功率和路徑效率上顯著優於無模型強化學習基準和其他生成方法。我們的結果驗證了在規劃中明確納入多模態信念表示能夠實現更穩健的部分可觀察環境導航。

Domain-Shift Aware Neural Networks for Unbalance Characterization in Rotating Systems

2606.18882v1 by Bernardo Feijó Junqueira, Claudio Kiyoshi Umezu, Bruno Bilhar Karaziack, Tomaz Junior, Daniel Alves Castello

This work investigates the application of a domain-shift aware neural network for regression tasks aimed at estimating unbalance masses in rotating shafts under varying operating conditions. Experimental data were collected from a test rig in which a primary shaft, equipped with a flange carrying unbalanced masses, was driven at different rotational speeds, while a secondary shaft could be optionally activated to introduce domain discrepancy. The unbalance masses were positioned at a fixed radial distance, and the dynamic response of the system was recorded using triaxial accelerometers. The inverse problem of mass estimation is formulated within a domain adaptation framework, where the network is trained with a maximum mean discrepancy strategy to align feature representations across source and target distributions. The results demonstrate the effectiveness of explicitly addressing domain shift in improving prediction accuracy, especially when the system's physical behavior and sources of domain discrepancy are not fully known and fall outside the training conditions. These findings highlight the potential of domain-shift aware models for regression tasks in Structural Health Monitoring.

摘要：這項工作探討了針對回歸任務應用領域轉移感知神經網絡，以估算在變化操作條件下旋轉軸上的不平衡質量。實驗數據是從一個測試裝置中收集的，該裝置中一根主軸配備有承載不平衡質量的法蘭，以不同的轉速驅動，而一根次軸則可以選擇性啟動以引入領域差異。不平衡質量被定位在固定的徑向距離，系統的動態響應是使用三軸加速度計記錄的。質量估算的逆問題是在一個領域適應框架內進行公式化的，其中網絡使用最大均值差異策略進行訓練，以對齊源和目標分佈之間的特徵表示。結果顯示，明確處理領域轉移在提高預測準確性方面的有效性，特別是在系統的物理行為和領域差異的來源未完全了解且超出訓練條件時。這些發現突顯了領域轉移感知模型在結構健康監測中進行回歸任務的潛力。

Efficient Financial Language Understanding via Distillation with Synthetic Data

2606.18875v1 by Wen-Fong, Huang, Edwin Simpson

Large instruction-following models are powerful but costly to deploy, particularly in finance, where labelled data are limited by confidentiality and expert annotation cost. We present an efficient framework for financial sentiment analysis through distillation with synthetic data, transferring knowledge from a large instruction-tuned teacher to compact student models. The framework is designed for low-resource conditions, where a small set of real examples are collected and labelled by hand. The framework then clusters the examples and uses the clusters to select seeds for generating synthetic examples via structured few-shot prompting. Experiments show that clustering-based seed selection yields more representative synthetic data than random sampling, enabling compact models to achieve strong performance with minimal supervision. Notably, on a more complex and noisy text domain, the compact model trained on the complete synthetic-seed corpus even outperforms the teacher model, while remaining competitive on formal text. The framework provides a practical route toward resource-efficient domain adaptation in financial NLP with minimal human labelling effort.

摘要：大型指令跟隨模型強大但部署成本高，特別是在金融領域，因為標註數據受到保密性和專家標註成本的限制。我們提出了一個通過合成數據進行金融情感分析的高效框架，將知識從大型指令調整的教師模型轉移到緊湊的學生模型。該框架設計用於低資源條件，其中一小組真實範例由人工收集和標註。然後，該框架對範例進行聚類，並利用這些聚類選擇種子，以通過結構化的少量提示生成合成範例。實驗表明，基於聚類的種子選擇比隨機抽樣產生更具代表性的合成數據，使緊湊模型在最小監督下實現強大性能。值得注意的是，在更複雜和噪聲較多的文本領域，基於完整合成種子語料庫訓練的緊湊模型甚至超越了教師模型，同時在正式文本上仍保持競爭力。該框架為在金融自然語言處理中以最小的人力標註努力實現資源高效的領域適應提供了一條實用的途徑。

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

2606.18874v1 by Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu, Shenghan Zuo, Yunzhe Zhang, Da Ma, Danyu Luo, Chenrun Wang, Jing Peng, Tiancheng Huang, Sijia Guo, Huayang Wang, Zichen Zhu, Senyu Han, Yilu Cao, Kai Yu, Lu Chen

AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

摘要：AI 系統可以越來越多地自動化科學工作流程，但將先前證據、生成的想法、實驗和最終主張聯繫起來的推理通常仍隱含在模型推斷中。在此，我們介紹 Xcientist，一個將研究綜合和實驗驗證外部化為可檢查的、受合同約束的過程的研究工具。Xcientist 將文獻證據、想法狀態、實施計劃、消融記錄和修復痕跡組織為持久的研究文物，以便生成的機制可以在不失去其證據基礎的情況下進行基礎化、執行、測試和修訂。我們將主張漂移確定為自動化研究的一種失效模式，其中可運行的文物不再支持最初聲稱的機制。在無需訓練的記憶系統、圖結構的交通預測和多尺度物理知識驅動的神經網絡中，Xcientist 保留了從問題表述到機制設計、驗證和有限修訂的可追溯軌跡。這些結果表明，AI 科學家應該不僅根據他們的最終文物進行評估，還應根據他們的綜合和驗證過程是否保持可歸因、可檢查和科學負責進行評估。

Scaling Learning-based AEB with Massive Unlabeled Data

2606.18864v1 by Xiangyu Wang, Yang Zhan, Mengxiang Hao, Chuanchuan Zhong, Yansong Jia, Junjie Zhang, Yu Han, Xin Jiang, Zhen Cao, Ying Wang, Yulun Song, Zhitao Xu

This paper studies how to scale learning-based automatic emergency braking (AEB) with massive unlabeled fleet data under production constraints. Our approach is based on meta-feedback semi-supervised learning (MF-SSL), where a teacher generates pseudo labels for unlabeled driving data and is updated using a small labeled anchor set as safety-critical feedback. In production, anchor ambiguity and labeled-unlabeled mismatch can amplify systematic pseudo-label errors, leading to spurious triggers. We propose a stabilized MF-SSL framework with (i) Noise-Aware Decoupling, which removes ambiguity-prone anchors from the teacher's supervised update path, and (ii) kinematics-gated pseudo-labeling with a teacher conflict penalty to suppress mismatch-induced risk hallucinations on unlabeled data while maintaining broad coverage. Extensive experiments show consistent gains as unlabeled data scale from 1M to 1B windows, improving safety while keeping comfort stable. The 1B-trained student model is deployed to hundreds of thousands of vehicles and validated over \$10^9$ km of driving, achieving a positive-to-false activation ratio exceeding 100:1 and a 35% improvement in accident-free driving mileage over a production rule-only baseline.

摘要：這篇論文研究如何在生產限制下，利用大量未標記的車隊數據來擴展基於學習的自動緊急制動（AEB）。我們的方法基於元反饋半監督學習（MF-SSL），其中一個教師為未標記的駕駛數據生成偽標籤，並使用一小組標記的錨點集作為安全關鍵的反饋進行更新。在生產中，錨點的模糊性和標記-未標記的不匹配可能會放大系統性的偽標籤錯誤，導致虛假觸發。我們提出了一個穩定的MF-SSL框架，具有（i）噪聲感知解耦，這會從教師的監督更新路徑中移除易受模糊影響的錨點，以及（ii）運動學門控偽標籤生成，並引入教師衝突懲罰，以抑制對未標記數據的不匹配引起的風險幻覺，同時保持廣泛的覆蓋範圍。大量實驗顯示，當未標記數據從1M擴展到1B窗口時，性能穩定提升，改善安全性，同時保持舒適度穩定。經過1B訓練的學生模型已部署到數十萬輛車輛，並在超過\$10^9$公里的駕駛中進行驗證，實現了超過100:1的正向到虛假啟動比率，並在無事故駕駛里程上較僅依賴生產規則的基準提高了35%。

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

2606.18861v1 by Xinze Zhang

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

摘要：重建可供模擬使用的關節物體數位雙胞胎，仍然受到兩個持續存在的缺口的限制：(i) 部件級幾何重建與運動參數估計相互解耦，以及 (ii) 恢復的模型經常違反基本的動態不變性，如能量守恆，導致在物理模擬器中重播 URDF 時出現漂移。我們提出了 KinemaForge，一個基於約束的管道，從短的 RGB-D 序列中共同推斷部件級形狀、關節拓撲和關節參數，並將結果與基於可微剛體動力學構建的能量一致性驗證器進行驗證。該管道引入了三個組件：一個運動約束圖，將關節-部件的關聯編碼為軟邊；一個可微的螺旋軸求解器，通過 Featherstone 的關節體算法從渲染的觀察結果反向傳播到關節參數；以及一個能量殘差損失，對重建模型的非物理自由反應進行懲罰。在五個 PartNet-Mobility 類別和一個內部 RGB-D 基準測試中，KinemaForge 將平均關節軸誤差從 4.52 度降低到 2.83 度（-37.4%），相較於最強的幾何基線（PARIS），以及從 5.30 度降低到 2.83 度（-46.6%），相較於基於互動的 Ditto 基線，並在 50 秒的滾動中將長期模擬漂移降低 64%（與 PARIS 相比），產生的 URDF 在我們的初步評估中，其閉環操作成功率比 Ditto 提高了 14.6 個百分點。代碼和重建數據將在接受後發布。

Approximate Structured Diffusion for Sequence Labelling

2606.18856v1 by Nicolas Floquet, Joseph Le Roux, Nadi Tomeh

Sequence labelling, a core task of Natural Language Processing (NLP), consists in assigning each token of an input sentence a label. From a Machine Learning point of view, sequence labelling is often cast as a Linear-Chain Conditional Random Field (CRF) parametrised by a neural network. While this approach gives good empirical results, CRFs assume a finite decision span (eg label bigrams) which can limit their expressivity and hurt performance when long-range dependencies are required. We show we can leverage diffusion to train a CRF conditioned on an entire label sequence, with the caveat that the condition is on a noisy version of labels. We show experimentally that this method, in conjunction with approximate CRF inference, improves label accuracy with a 16.5% error reduction for POS-tagging.

摘要：序列標註，作為自然語言處理（NLP）的核心任務，旨在為輸入句子的每個標記分配一個標籤。從機器學習的角度來看，序列標註通常被視為由神經網絡參數化的線性鏈條條件隨機場（CRF）。雖然這種方法提供了良好的實證結果，但CRF假設有限的決策範圍（例如標籤二元組），這可能限制其表達能力，並在需要長距離依賴時影響性能。我們展示了如何利用擴散來訓練一個以整個標籤序列為條件的CRF，但需注意條件是基於標籤的噪聲版本。我們的實驗表明，這種方法結合近似CRF推斷，提高了標籤準確性，對於詞性標註減少了16.5%的錯誤率。

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

2606.18850v1 by Bohou Zhang, Xiaoyu Tao, Mingyue Cheng, Huijie Liu, Qi Liu

Abstractive summarization plays a crucial role in enabling efficient understanding of scientific literature, yet it inherently demands both linguistic fluency and factual faithfulness. Existing approaches often fail to reconcile these two requirements. Extractive methods rely on rigid sentence splicing that disrupts macro-level logical coherence, while large language model (LLM)-based generative approaches, despite mastering linguistic fluency, exhibit limited factual consistency. In this work, we propose ScholarSum, a hierarchical reflective graph-based framework that emulates a student-teacher writing process for fluent and faithful scientific summarization. ScholarSum first organizes the document into a hierarchical knowledge graph by segmenting it into semantically coherent units, whose multi-layered community structure captures global logic and macro-level themes. Guided by this global structure, the student generates an initial draft, which is subsequently refined through fine-grained evidence retrieval. To ensure factual consistency, a teacher-like reviewer then iteratively examines the draft, identifies unsupported content, and prompts targeted re-retrieval and rewriting until the summary meets rigorous quality standards. Extensive experiments demonstrate that ScholarSum significantly outperforms previous baselines in terms of both completeness and faithfulness. Our code is available at https://github.com/Xiaoyu-Tao/ScholarSum.

摘要：抽象摘要在促進對科學文獻的有效理解中扮演著至關重要的角色，但它本質上需要語言流暢性和事實忠實性。現有的方法往往無法調和這兩個要求。抽取式方法依賴於僵化的句子拼接，這會破壞宏觀層面的邏輯一致性，而基於大型語言模型（LLM）的生成方法，儘管在語言流暢性上表現出色，但在事實一致性方面卻有限。在本研究中，我們提出了ScholarSum，一種層次反思圖形基礎框架，模擬學生-教師的寫作過程，以實現流暢且忠實的科學摘要。ScholarSum首先通過將文檔劃分為語義上連貫的單元，將其組織成層次知識圖，這些多層社群結構捕捉了全球邏輯和宏觀主題。在這一全球結構的指導下，學生生成初步草稿，然後通過細緻的證據檢索進行精煉。為了確保事實一致性，類似教師的審閱者隨後反覆檢查草稿，識別不支持的內容，並促使針對性的重新檢索和重寫，直到摘要達到嚴格的質量標準。大量實驗表明，ScholarSum在完整性和忠實性方面顯著超越了以往的基準。我們的代碼可在 https://github.com/Xiaoyu-Tao/ScholarSum 獲得。

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

2606.18847v1 by Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

摘要：為了在真實的家庭中長時間協助人類，具身代理必須記住用戶的日常活動、世界狀態和過去的互動。現有的長期記憶基準主要評估以語言為中心的檢索和問題回答，而具身基準則通常專注於短期任務執行，並未測試在動態環境中使用長期記憶。我們介紹了WorldLines，一個以項目為驅動的長期具身家庭協助基準。它構建了包含對話、行動、執行反饋、物體和設備狀態變化的時間延展家庭痕跡，並將其轉換為與證據相關的樣本，用於記憶問答和具身任務規劃。我們進一步提出了ObsMem，一個以觀察者為基礎的記憶框架，維護可見性意識的記憶和行動原生狀態軌跡，以便進行狀態意識的決策。實驗顯示在部分可觀察性、被覆蓋的世界狀態以及將長期記憶轉化為具身計劃方面存在持續的挑戰，而ObsMem則為這種設定提供了更強的參考架構。

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

2606.18837v1 by Hehai Lin, Qi Yang, Chengwei Qin

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

摘要：大型語言模型（LLM）基礎的自動多代理系統（MAS）生成已成為應對複雜任務的重要前沿。然而，現有方法面臨模型能力與經驗保留之間的困境。推理時的 MAS 利用凍結的前沿 LLM，但在沒有從過去經驗中學習的情況下重複相同的搜索。相反，訓練時的 MAS 通過梯度更新內化經驗，但受到較小模型低能力上限的限制，並且難以擴展到大型前沿 LLM。為了填補這一空白，我們提出了 Skill-MAS，一條新穎的第三條路徑，通過將高層次的編排能力概念化為可演變的元技能，將經驗保留與參數更新解耦。Skill-MAS 通過閉環優化循環來精煉這一架構知識：（1）多軌跡回放在當前元技能下為每個任務採樣行為分佈；（2）選擇性反思自適應地選擇優先任務，並應用分層對比分析將系統經驗提煉為可泛化的策略級原則。跨越四個複雜基準和四個不同 LLM 的廣泛實驗表明，Skill-MAS 不僅實現了顯著的性能提升，還保持了有利的成本性能權衡。進一步分析顯示，演變的元技能具有高度的穩健性，並在未見任務和不同 LLM 之間表現出強大的可轉移性。

Target-confidence Recourse Using tSeTlin machines: TRUST

2606.18832v1 by K. Darshana Abeyrathna, Sara El Mekkaoui, Nils Enric Canut Taugbøl, Anuja Vats

Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model's decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support.

摘要：反事實解釋廣泛用於提供高風險決策系統中的算法補救措施。大多數現有方法尋求對輸入進行最小改變，以翻轉模型的決策。然而，決策者往往不僅依賴預測標籤，還依賴置信閾值和風險邊際。剛好跨越決策邊界的反事實在噪聲或模型變化下可能是脆弱和不穩定的。在本文中，我們提出了使用 tSeTlin 機器的目標置信補救（TRUST），這是一個框架，使用者明確指定補救所需的預測置信度。TRUST 不是生成反事實然後評估置信度，而是直接尋找滿足用戶定義的置信目標的最小變更，從而使補救選項在成本、置信度和穩健性方面進行比較。我們使用結合貝葉斯優化的概率 Tsetlin 機器（PTM）來實現 TRUST。PTM 的概率子句結構將預測置信度與決策規則的穩定性聯繫起來。我們展示了滿足相同規則的反事實仍然可能在可靠性上有顯著差異，這取決於它們滿足這些規則的安全程度，從而揭示決策是否由穩健或脆弱的子句激活支持。在合成和現實世界數據集上的實驗表明，目標置信反事實產生的補救措施比傳統基於邊界的方法更穩健且可解釋。在多個基準測試中，TRUST 在保持低補救成本的同時實現了完美的穩健性，包括在 Haberman 數據集上以 0.92 置信度達到 0.10 的 L2 距離。通過明確控制置信度並揭示規則級別的穩定性，TRUST 為高風險決策支持提供了可行的補救措施。

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

2606.18831v1 by Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

摘要：長期推理是大型語言模型的一項基本能力，特別是在它們作為必須對長期軌跡進行推理的自主代理時。強化學習（RL）最近已成為改善這一能力的主導範式，但現有的研究主要集中在獎勵工程上，而多樣化的訓練數據仍然稀缺。我們從數據中心的角度重新審視這個問題，並展示僅僅依靠一個簡單但有效的數據配方，結合一個最小的基於結果的GRPO設置，就足以顯著改善長期推理。我們的配方針對三個互補的任務家族——檢索、多證據綜合和推理——為此我們構建並策劃了八個數據集，總計約14K個示例。在三個模型（Qwen3-4B/8B/30B-A3B）上的實驗顯示，在七個長期推理基準上平均增益為+7.2/+3.2/+6.4分，超過了先前的RL訓練集。我們進一步證明，這些增益可以轉移到代理任務上，在一個經過我們數據配方調整的模型上持續進行RL訓練，使GAIA提高了+4.8分，BrowseComp提高了+7.0分。我們將釋放我們的數據集，以促進未來的研究。

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

2606.18829v1 by Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

摘要：記憶基準測試對於 LLM 代理大多假設為單用戶環境，導致醫院、工作場所、校園和家庭的共享助手研究不足。在這些部署中，多個主體寫入共同的記憶池並根據不同的角色、範疇和關係進行查詢，因此記憶質量需要治理以及回憶。我們介紹了 GateMem，一個針對多主體共享記憶代理的基準。GateMem 共同評估合法長期請求的效用，包含狀態更新、跨上下文授權邊界的訪問控制，以及在明確刪除請求後面向代理的主動遺忘。它涵蓋醫療、辦公、教育和家庭領域，具有長篇多方情節、增量記憶注入、隱藏檢查點、結構化評判和洩漏目標註釋。在多樣的基準線和主幹模型中，沒有任何方法能同時實現強大的效用、穩健的訪問控制和可靠的遺忘。長上下文提示通常在高標記成本下產生最佳治理分數，而基於檢索和外部記憶的方法則降低成本，但仍然洩漏未經授權或已刪除的信息。這些結果顯示，目前的記憶代理仍然遠未達到可靠的共享機構部署。

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

2606.18828v1 by Chenghao Xu

Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

摘要：傳統方法將智慧置於代理中，無論是作為學習的策略還是搜索程序。我們則將智慧置於空間本身：一個場景在配置流形上誘導出一個黎曼度量，而行動則簡化為遵循該度量的測地線，而不是調用單獨的規劃器或碰撞檢查器。一個單一的編碼器-路由器網絡通過三組互補的參數實現這一理念——框架參數用於定向生成器，調製參數用於控制它們的空間傳播，以及基本係數用於確定它們的強度。這些組通過共享的半群-疊加機制結合，產生一個單一的黎曼度量場，形成一個緊湊的架構，其幾何形狀隨場景的複雜性自然縮放。在單一的兩障礙場景上進行訓練後，該模型在未見過的障礙配置上展示了強大的零樣本泛化，碰撞安全路徑成本與穿透障礙的路徑成本之間有著數量級的區別。

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

2606.18820v1 by Jiaxi Liu, Aiping Yang, Yuhang Yang, Shuqi Zhang, Zewei Dong, Jiangming Yang, Xuebin Chen

Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard MDP formulations typically flatten this structure into stage-dependent state descriptions and action masks, thereby obscuring the nested information--action asymmetry that determines which decisions are urgent and which can be deferred. We introduce Maturing Markov Decision Processes (MMDPs), a formulation built around this information--action asymmetry. We characterize one of its key consequences through an expiring-action priority principle, which identifies the actions that must be resolved before the next stage. Motivated by this structure, we develop a structure-aware reinforcement learning framework with stage-aware policy design, expiring-action abstraction, and search-augmented learning with distillation. Experiments on a controlled multi-supplier replenishment problem, simplified cash-management environments of increasing complexity, and a production-scale simulator show that explicitly modeling this asymmetry improves learning efficiency and becomes increasingly valuable as decision problems scale.

摘要：序列決策問題通常表現出信息和決策靈活性的非對稱演變：隨著決策週期的展開，代理人獲得更豐富的信息，同時可行的行動因操作截止、承諾或資源限制而過期。標準的馬爾可夫決策過程（MDP）公式通常將這一結構簡化為階段依賴的狀態描述和行動掩碼，從而掩蓋了決定哪些決策是緊急的、哪些可以延遲的嵌套信息--行動非對稱性。我們引入了成熟馬爾可夫決策過程（MMDPs），這是一種圍繞這一信息--行動非對稱性構建的公式。我們通過過期行動優先原則來描述其一個關鍵後果，該原則確定了必須在下一階段之前解決的行動。受到這一結構的啟發，我們開發了一個結構感知的強化學習框架，具有階段感知的政策設計、過期行動抽象以及增強學習與蒸餾的搜索。對於一個受控的多供應商補貨問題、日益複雜的簡化現金管理環境和一個生產規模的模擬器的實驗顯示，明確建模這種非對稱性提高了學習效率，並隨著決策問題的擴大而變得越來越有價值。

SwitchBraidNet: Quantisation-Aware Lightweight Architecture for Hybrid Brain-Computer Interface

2606.18816v1 by Gourav Siddhad, Yogesh Kumar Meena

Hybrid brain-computer interfaces (BCIs) that integrate motor imagery (MI) and steady-state visual evoked potentials (SSVEP) provide high-dimensional neural decoding but typically exceed the computational limits of embedded hardware. To address this, we propose SwitchBraidNet, a compact EEG classification architecture designed for low-power deployment. The model employs a dual-path temporal braid to extract multiscale oscillatory features, an adaptive squeeze-and-excitation spatial switch for electrode gating, and a log-variance readout layer for direct band-power encoding. Furthermore, through systematic quantisation-aware training on the OpenBMI dataset, we compared SwitchBraidNet against four established baselines across FP32, FP16, and INT8 precisions. Experimental results demonstrate superior efficiency and performance, achieving MI accuracy of 69.49% (FP16), SSVEP accuracy of 93.48% (FP32), and a hybrid information transfer rate of 64.82 bits/min (FP16). With an INT8 footprint of only 3.03 KB, SwitchBraidNet maintains high accuracy across varying numerical precisions, demonstrating its suitability for low-power embedded BCI deployment.

摘要：混合腦機介面（BCI）整合了運動意象（MI）和穩態視覺誘發電位（SSVEP），提供高維度的神經解碼，但通常超出了嵌入式硬體的計算限制。為了解決這個問題，我們提出了 SwitchBraidNet，一種為低功耗部署設計的緊湊型 EEG 分類架構。該模型採用了雙通道時間編織來提取多尺度振盪特徵，適應性擠壓和激勵空間開關用於電極閘控，以及一個對數方差讀出層用於直接帶功率編碼。此外，通過對 OpenBMI 數據集進行系統的量化感知訓練，我們將 SwitchBraidNet 與四個已建立的基準進行了比較，涵蓋 FP32、FP16 和 INT8 精度。實驗結果顯示出更高的效率和性能，達到 MI 準確率 69.49%（FP16）、SSVEP 準確率 93.48%（FP32），以及混合信息傳輸速率 64.82 位/分鐘（FP16）。SwitchBraidNet 在僅有 3.03 KB 的 INT8 足跡下，保持了在不同數值精度下的高準確率，顯示出其適合用於低功耗嵌入式 BCI 部署。

Reinforcement Learning Foundation Models Should Already Be A Thing

2606.18812v1 by Abdelrahman Zighem, Jill-Jênn Vie

Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data, which shifts the burden from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train one model entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

摘要：基於互聯網規模數據的語言和視覺基礎模型，而結構化領域（表格預測、時間序列預測、圖學習、強化學習）則不是。替代品是合成數據，這將負擔從收集轉移到先前設計。對於許多結構化任務，這樣的先驗已經存在：TabPFN及其後續版本使用在合成貝葉斯先驗上預訓練的Transformer來解決表格分類問題。
我們提出兩點。 \textbf{首先}，強化學習是明顯的缺口：對合成MDP的採樣與對合成表格數據集的採樣同樣可行，但沒有任何上下文強化學習工作將先前設計視為主要目標。 \textbf{其次}，MDP允許固定大小的充分統計量，與觀察到的情節無關且呈表格形狀，這使得它們直接適合用於表格基礎模型的基於注意力的架構，並用策略頭替代監督目標。這些共同定義了強化學習基礎模型的議程。
作為概念驗證，我們完全在合成MDP上訓練一個模型，並顯示在沒有特定任務調整的情況下，它能夠在上下文中解決保留的表格基準，無論是在線還是離線：在線時，所需的情節數量遠少於UCB-VI和表格Q學習；離線時，與VI-LCB競爭。

Rescaling MLM-Head for Neural Sparse Retrieval

2606.18811v1 by Youngjoon Jang, Seongtae Hong, Jonah Turner, Heuiseok Lim

Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

摘要：學習稀疏檢索（LSR）模型，如SPLADE，傳統上使用BERT風格的掩碼語言模型作為主幹編碼器。自然的期望是，將BERT替換為更強大的預訓練編碼器應該能提高檢索效果。然而，我們發現，在標準的SPLADE訓練配方下，擁有較大MLM-head L2範數的主幹可能會遭遇性能下降，甚至在標準SPLADE訓練配方下發生訓練崩潰。我們將這一失敗識別為MLM head中的尺度不匹配：SPLADE直接使用MLM-head輸出來構建稀疏詞彙表示，查詢-文檔相關性是通過這些表示的未標準化點積計算的。因此，膨脹的MLM-head尺度可能會放大稀疏激活，扭曲匹配分數，並在常見的訓練設置下使對比訓練不穩定。為了解決這個問題，我們引入了一個簡單的初始化時修正，該修正在SPLADE訓練之前通過一個常數因子重新縮放MLM-head投影。這一零成本的調整改善了訓練穩定性，而不修改模型架構或訓練目標。在內域和外域的檢索基準中，這一簡單的修正顯著改善了大型範數主幹，如ModernBERT和Ettin，將不穩定的訓練過程轉變為具有競爭力的稀疏檢索器。在幾個設置中，修正後的模型進一步匹配或超越了經典的BERT-SPLADE基線。這些發現表明，將預訓練編碼器適應於LSR的瓶頸不僅僅是編碼器的容量，而是用於構建稀疏詞彙表示的MLM-head尺度的校準。

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

2606.18810v1 by Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

摘要：強化學習與可驗證獎勵（RLVR）在訓練大型語言模型（LLMs）以解決推理任務方面推動了顯著的進展，但代表性的方法如 GRPO 對所有標記分配均勻的信用，浪費了在常規標記上的梯度，同時對關鍵推理步驟的信用評估不足。現有的標記級信用分配方法需要超出模型自身回合的資源。GRPO 的變體依賴於過程獎勵模型或真實答案。知識蒸餾通過每個標記的偏差分配信用，但需要外部教師（在政策蒸餾）或特權信息（在政策自我蒸餾）。然而，這些依賴限制了在純 RLVR 設定中的適用性。我們觀察到，將模型條件化於其自身的驗證軌跡會在原始分佈和條件分佈之間產生可測量的每標記 KL 散度，並證明從由驗證軌跡構建的自我教師中進行蒸餾會導致在存在多個驗證軌跡時無法實現的加權平均解。我們提出了 SC-GRPO（自我條件化 GRPO），它使用前面提到的 KL 散度作為 GRPO 梯度的乘法權重。在跨越數學、代碼和代理任務的五個基準測試中，SC-GRPO 始終比 GRPO 高出 8.1%，比 DAPO 高出 5.9%，並且在 OOD 性能上更強。此外，SC-GRPO 的性能高於 OPD。

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

2606.18803v1 by Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

摘要：將大型語言模型（LLMs）引入工業乘車呼叫調度，作為平台規模行為日誌的語義特徵提取器，這是一個引人注目但尚未充分探索的數據系統問題。生產匹配管道仍然以結構化數值特徵為主導，但決定性的行為信號（例如，駕駛員對某些地區的習慣性厭惡）本質上是上下文相關的，並且自然可以表達為LLM生成的用戶檔案。然而，將這種檔案擴展到實時、毫秒延遲的調度器面臨著三個相互交織的約束，這些約束很少同時被解決：在一個每天有數百萬訂單的平台上，日誌的數據量超過任何LLM的上下文窗口幾個數量級；大多數用戶是長尾用戶，與每個用戶的互動次數太少，無法進行個別檔案分析；而表面流暢的檔案不一定能改善下游預測的效用。我們提出了ProfiLLM，一個自主的LLM數據管道，通過兩個模塊實現與效用對齊的用戶檔案分析，以支持生產匹配系統。（1）工具增強的全球知識挖掘為LLM代理配備了27個分析工具，以挖掘平台規模的數據，生成可重用的全球知識、自適應的用戶聚類規則和區域供需先驗。（2）與效用對齊的檔案探索為每個聚類生成多個候選檔案，通過輕量級的下游效用代理進行評估，迭代地精煉最佳候選檔案並構建DPO微調的偏好對。在滴滴的生產調度器上部署的ProfiLLM，在結果預測中實現了高達+6.14%的相對AUC改善，在調度模擬中實現了高達+4.35%的GMV增益，並在為期14天的在線A/B測試中持續改進，包括+0.47%的GMV、+0.33%的完成率和-0.82%的接受前取消率。

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

2606.18801v1 by Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

With the rapid expansion of massive multilingual corpora, Multilingual Information Retrieval (MLIR) has emerged as a critical technology for global information access. MLIR enables users to retrieve semantically relevant documents from multilingual text collections using a single-language query. However, recent multilingual dense retrieval models often exhibit a strong preference for documents in the same language as the query. This leads to severe language bias, where top-ranked results are dominated by documents of specific languages, even when documents in other languages contain more semantically relevant information. To address this issue, we propose SHIFT, a training-free method applicable in the indexing stage. Specifically, SHIFT utilizes parallel translation pairs to estimate a relative language vector for each target language with respect to a source language. Subsequently, SHIFT corrects the language-specific offset by subtracting this relative language vector from document embeddings during indexing. Our comprehensive evaluation across four MLIR benchmarks and diverse dense retrieval models confirms that SHIFT can effectively mitigate language bias and enhance MLIR performance.

摘要：隨著龐大的多語言語料庫的快速擴展，多語言信息檢索（MLIR）已成為全球信息訪問的重要技術。MLIR使得用戶能夠使用單一語言查詢從多語言文本集合中檢索語義相關的文檔。然而，最近的多語言密集檢索模型往往對與查詢相同語言的文檔表現出強烈的偏好。這導致了嚴重的語言偏見，排名最高的結果往往由特定語言的文檔主導，即使其他語言的文檔包含更具語義相關的信息。為了解決這個問題，我們提出了SHIFT，一種在索引階段適用的無需訓練的方法。具體而言，SHIFT利用平行翻譯對來估計每個目標語言相對於源語言的相對語言向量。隨後，SHIFT通過在索引過程中從文檔嵌入中減去這個相對語言向量來修正特定語言的偏移。我們在四個MLIR基準和各種密集檢索模型上的全面評估確認了SHIFT能有效減輕語言偏見並提升MLIR性能。

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

2606.18797v1 by Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

摘要：可靠的放射科報告評估需要嚴格的臨床準確性，因為遺漏關鍵發現或錯誤表徵放射影像觀察會直接影響病人護理。現有的指標通過將報告質量簡化為無醫學根據的標量來掩蓋這一要求。儘管大型語言模型（LLMs）擁有豐富的醫學知識，但它們同樣難以劃定臨床上重要錯誤與無害變異之間的可靠邊界。我們使用ReEvalMed基準作為測試平台來研究這一邊界，並從檢測真實臨床錯誤（“區分”）和容忍不重要變異（“穩健性”）的角度評估指標層級的臨床意義。在單通道和雙通道設置下的8個LLM評估者中，我們識別出廣泛的區分偏見：模型有效地檢測錯誤，但也過度懲罰無害的改述。為了減輕這一問題，我們合成了4k報告對並在Qwen3-8B和MedGemma-4B上訓練輕量級可解釋的指標。我們訓練的指標明確了臨床意義邊界，超越了32B規模的醫學LLMs，並與專有模型保持競爭力。關鍵是，更昂貴的雙通道設置未能持續改善整體性能，主要是在區分和穩健性之間進行了權衡。這些發現表明，單通道訓練的指標是成本敏感部署的實際選擇，而雙通道推斷則保留給D-R平衡至關重要的設置。我們將發布數據集和指標。

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

2606.18788v1 by Jaward Sesay, Yue Yu, Börje F. Karlsson

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

摘要：教導機器模仿自然書寫風格仍然是一個未解的挑戰，因為這需要合成在形狀、質感、壓力和字體上動態變化的筆劃序列——不僅在不同個體之間，還包括單一個體的手寫。對這一挑戰的嘗試主要探索了在線和離線環境中的深度學習方法。然而，這些方法往往受到風格特定架構選擇的限制，對大型數據集的高度依賴，高計算成本，以及通過自然語言對書寫風格缺乏靈活控制。為此，我們介紹了HandwritingAgent，一個語言驅動的代理，可以直接以可擴展矢量圖形（SVG）格式合成自然手寫序列，而無需特定風格的訓練。該代理利用大型推理模型幾何分析並自回歸生成目標手寫字形作為離散網格畫布環境中的筆劃序列。生成是基於以對話或非對話模式提供的文本，以及參考手寫風格圖像。對於涵蓋模仿、識別、多語言手寫合成以及複雜手寫數學和科學表達式生成的多樣手寫任務的實驗顯示，性能有顯著提升，HandwritingAgent的表現達到或超越了最先進的生成手寫模型，同時提供了一種更高效、可控且可泛化的合成方法。

RedactionBench

2606.18782v1 by Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

摘要：大型語言模型越來越多地應用於需要刪除個人可識別信息（PII）的敏感領域。雖然刪除PII是數據清理的前提，但現有基準將提取機制與隱私語義混為一談。公共電話號碼並不等同於醫療記錄中的電話號碼。信息是否構成違規在很大程度上取決於誰持有它、為什麼以及在什麼上下文中，這根本上區分了刪除和簡單的實體識別。基於上下文完整性，我們引入了RedactionBench，一個手動註釋的基準，包含來自11個領域的200份多樣化文件，大多數來源於現實世界。我們還引入了R-Score，一種新穎的字符級指標，平等對待語義相似的刪除，並消除淺顯的格式選擇，例如對電話號碼使用不同的掩碼樣式。在命名實體識別模型、實體提取小型語言模型和配備代理工具的前沿模型的評估中，顯示上下文刪除仍然是一個未解決的問題。對RedactionBench進行的超過80名用戶的人類評估顯示出隱私感知的明顯二元性。註釋者對於強制刪除的目標標籤（89.4%）和安全文本保留（94.1%）顯示出共識，但對於上下文刪除（47.7%）則未能達成一致。這種變異顯示了上下文隱私的主觀性，並促進了R-Score的發展，該指標將上下文模糊性與嚴格精確性解耦。我們比較了35個模型的不同類別，並報告了它們在刪除PII方面的表現。最後，我們發布了RedactionBench，以建立未來隱私保護系統的基準，希望能激發高效的模型設計和標準化評估。

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

2606.18781v1 by Shanshan Lyu, Yiwei Wang, Yujun Cai, Jiafeng Guo, Shenghua Liu

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

摘要：密集檢索將一個查詢向量與一個文檔向量進行排名。對於長文檔，當短而關鍵的片段在文檔編碼過程中被削弱後，這種介面可能會失效。我們將這種失效模式研究為文檔側的早期壓縮，並引入證據稀釋指數（EDI）來衡量文檔級表示在同一金標文檔內低於最強片段級證據的程度。在這一觀點的指導下，我們提出了DICE（通過片段證據進行文檔推斷），這是一種無需訓練的文檔側策略，將文檔拆分為片段，使用凍結模型獨立編碼，然後將它們聚合回單個向量，同時保持標準的一查詢一文檔介面。在LongEmbed上，DICE在四個基礎架構上改善了檢索，對於超過4k標記的片段，增益最大：對於Dream，Passkey >4k從30.0上升到90.0，Needle >4k從23.3上升到74.0。在12,779個過濾樣本中，DICE在92.8%的情況下產生了低於單向量基線的EDI。這些結果確立了文檔級編碼作為長文檔檢索的一個實用且未被充分探索的杠杆。

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

2606.18780v1 by Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

摘要：多模態信息提取（MIE）涵蓋了多模態命名實體識別（MNER）、關係提取（MRE）和事件提取（MEE）等任務，對於理解多媒體內容至關重要，但仍受到嚴重數據稀缺的限制。儘管數據增強是一種有前景的解決方案，但現有的方法受到粗糙的跨模態對齊和碎片化的任務特定設計的阻礙，未能充分利用共享的語義知識。為了克服這些限制，我們提出了語義錨點對齊的多模態增強（SAMA），這是一個統一的框架，用於生成高保真、任務感知的合成數據。SAMA 從真實標籤中構建結構化的語義錨點，以指導協作多專家多模態大型語言模型（CME-MLLM），該模型將共享語義的通用適配器與任務特定的適配器相結合，生成多樣但符合約束的文本樣本。對於圖像合成，SAMA 採用錨點保留擴散機制，使用錨點加權提示和潛在條件來保持關鍵的語義錨點，同時多樣化視覺上下文。為了消除手動驗證的需要，SAMA 進一步引入了一個雙約束過濾模塊，根據跨模態一致性和錨點保真度選擇合成樣本。在 MNER、MRE 和 MEE 的基準數據集上進行的廣泛實驗表明，SAMA 在完全監督和低資源設置下始終超越了最先進的增強基準，突顯了其多功能性、穩健性和有效性。

Private Learning with Public Feature Conditioning

2606.18773v1 by Shuli Jiang, Walid Krichene, Nicolas Mayoraz

We study differentially private (DP) regression in settings where each data sample includes public, non-sensitive features -- common in applications such as recommendation and advertising systems. While such label-DP or semi-sensitive-feature settings have been primarily explored in the context of classification, effective approaches for regression remain underexplored. We introduce Cond-DP, a conditioned variant of DPSGD that leverages the structure of public feature matrices to improve optimization under privacy constraints. Motivated by the observation that these public features often exhibit rapidly decaying spectra, Cond-DP incorporates a data-driven conditioning matrix to reshape the optimization landscape and accelerate convergence. We provide convergence guarantees for convex, strongly convex, and non-convex settings, and recover standard DPSGD as a special case when the conditioning matrix is the identity. We show how to construct an effective conditioning matrix for Cond-DP directly from public features, enabling provably faster convergence than DPSGD in private linear regression without incurring additional privacy cost. Empirically, Cond-DP with this conditioning matrix consistently outperforms state-of-the-art baselines across a wide range of datasets and model architectures under label DP, demonstrating strong and robust performance in practice.

摘要：我們研究在每個數據樣本包含公共、非敏感特徵的情境下進行差分隱私（DP）回歸——這在推薦和廣告系統等應用中很常見。雖然這種標籤-DP或半敏感特徵的情境主要是在分類的背景下進行探討，但回歸的有效方法仍然未被充分研究。我們介紹了Cond-DP，一種條件變體的DPSGD，利用公共特徵矩陣的結構來改善在隱私約束下的優化。受到這些公共特徵通常表現出快速衰減光譜的觀察啟發，Cond-DP結合了一個數據驅動的條件矩陣，以重塑優化景觀並加速收斂。我們提供了對於凸、強凸和非凸情境的收斂保證，並在條件矩陣為單位矩陣時恢復標準DPSGD作為特例。我們展示了如何直接從公共特徵構建有效的條件矩陣，使得在私有線性回歸中能夠比DPSGD實現可證明的更快收斂，而不會產生額外的隱私成本。在實證中，使用這個條件矩陣的Cond-DP在標籤DP下在各種數據集和模型架構中始終超越最先進的基線，展示了在實踐中的強大和穩健性能。

Output Vector Editing for Memorization Mitigation in Large Language Models

2606.18767v1 by Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

摘要：大型語言模型記憶並重現其訓練數據中的序列，從而產生隱私、版權和安全風險。現有的神經元級緩解方法將編輯等同於將神經元激活歸零，但激活僅控制神經元是否參與；輸出向量才是寫入殘餘流的，並通過疊加編碼多個特徵。我們提出了輸出向量編輯，這是一種受限優化的權重編輯，定位一小組負責記憶延續的多層感知器（MLP）神經元，並最小化地修改它們的輸出向量，以在詞彙空間中引入一個干擾項，重新定向它們的殘餘流貢獻，同時保持激活不變。在對四個模型（從360M到7B參數，分別為SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B）進行評估時，我們集中於OLMo-7B（其開放權重和預訓練語料庫使系統性挖掘成為可能），並挖掘了6831個記憶序列，實現了高達87.9%的抑制。針對同一定位神經元的零消融的2.7$\times$差距顯示，抑制來自於輸出向量編輯，而不僅僅是定位。四種編輯模式涵蓋了從激進抑制到最小重定向的範圍；在集成中，它們覆蓋了96.5%的記憶序列，而我們推薦的單模式配置達到81.5%，且沒有災難性的局部失敗。我們進一步確定了一個機制邊界，約為${\sim}14%$的序列無法通過僅MLP編輯來達成；雖然這些失敗整體上不是由注意力驅動的，但消融貢獻最大的注意力頭可以恢復60--64%的失敗，對於從前綴複製標記的延續，恢復效果更強，將注意力定位為一種補充的後備機制，而非主要機制。編輯模式的排序和成功-局部性權衡在所有四個模型中轉移，成功率隨著模型大小而增長，而非家族。

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

2606.18747v1 by Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

摘要：表達性手勢對於自然和有效的溝通至關重要，當口頭提示不足時，手勢可以補充語言（例如，指向）。對於像人形機器人Pepper這樣的社交機器人，產生自然和表達性的動作對於改善人機互動（HRI）和長期接受度至關重要。然而，由於依賴專家創作的動畫，生成手勢仍然具有挑戰性，導致行為僵硬，對於動態和多樣化的環境不切實際。相反，機器學習方法常常難以捕捉到感知的自然性，隨著自由度的增加，這一挑戰變得越來越艱難。因此，生成表達性機器人手勢需要一個能夠適應環境的系統，同時遵循社會規範和物理限制。最近在大型語言模型（LLMs）方面的進展使得動態代碼生成成為可能，為從自然語言進行運行時手勢合成提供了新的機會。在本文中，我們將ChatGPT整合到人形機器人Pepper中，以生成與對話輸出相一致的共語手勢。雖然這一基線使得手勢生成變得靈活，但產生的動作往往被認為是僵硬和不自然的。為了解決這一限制，我們引入了一個基於人類反饋的迭代強化學習（RLHF）系統，該系統根據用戶評估微調手勢生成，利用迭代用戶研究來比較Pepper生成的手勢。我們的結果顯示，RLHF改善了LLM的共語生成能力，產生了更具表達性、相關性和流暢性的動作。

What Must Generalist Agents Remember?

2606.18746v1 by Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

This paper develops a formal account of what generalist agents must store in memory in order to act near-optimally across multiple environments and goals. It shows that when two domains share an observational bottleneck but require incompatible optimal actions, any uniformly near-optimal policy must induce distinct memory distributions at that bottleneck. The result yields a separation theorem: sufficiently successful agents cannot rely only on current state observations, but must preserve domain-relevant information in memory. The paper further shows that if an agent's memory contains enough information to estimate values for related goals, then that memory can be used to approximately reconstruct the agent's local transition dynamics. Together, these results characterize memory as the substrate that supports domain disambiguation, transition-model reconstruction, and planning for generalist agents.

摘要：這篇論文發展了一個正式的理論，說明一般代理在多個環境和目標中近乎最佳行動所需儲存的記憶。它顯示當兩個領域共享觀察瓶頸但需要不相容的最佳行動時，任何均勻的近最佳政策必須在該瓶頸處產生不同的記憶分佈。這一結果產生了一個分離定理：足夠成功的代理不能僅依賴當前狀態觀察，而必須在記憶中保留與領域相關的信息。論文進一步顯示，如果一個代理的記憶包含足夠的信息來估計相關目標的值，那麼該記憶可以用來近似重建代理的局部轉移動力學。這些結果共同描述了記憶作為支持領域消歧、轉移模型重建和一般代理規劃的基礎。

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

2606.18733v1 by Qiao Zhao, JianYing Qu, Jun Zhang, Yehua Yang, Hanwen Du, Zhongkai Sun

Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid direct historical replay, but can drift away from real repository needs. We propose SWE-Future, a forecast-conditioned data synthesis method for future-oriented coding tasks. Given a forecast snapshot at time $T_0$, the method uses only pre-$T_0$ repository evidence to forecast future feature implementation/enhancement, bugfix, and refactor task families. We first validate this forecasting step retrospectively: after forecasts are fixed, later pull requests are used only to measure whether the predicted task families match future repository work. In an 80-repository study, the forecaster achieves 58.1\% future-work relevance under the main semantic matching metric. We then use validated forecast families as conditioning signals to synthesize a 200-task coding-agent dataset across 61 repositories from a task-generation snapshot, rather than replaying the later pull requests used for validation. SWE-Future shows that repository-evolution forecasts can guide realistic, future-oriented coding-task synthesis while reducing direct dependence on historical pull-request replay.

摘要：現實的編碼代理基準通常重播公共 GitHub 問題和拉取請求，使其容易與模型預訓練、微調、合成數據生成或基準驅動的模型選擇重疊。完全合成的任務避免了直接的歷史重播，但可能會偏離實際倉庫的需求。我們提出了 SWE-Future，一種針對未來導向編碼任務的預測條件數據合成方法。給定時間 $T_0$ 的預測快照，該方法僅使用 $T_0$ 之前的倉庫證據來預測未來的功能實現/增強、錯誤修復和重構任務類別。我們首先回顧性地驗證這一步預測：在預測固定後，後續的拉取請求僅用於測量預測的任務類別是否與未來的倉庫工作匹配。在一項涉及 80 個倉庫的研究中，預測器在主要語義匹配指標下達到了 58.1\% 的未來工作相關性。我們然後使用經過驗證的預測類別作為條件信號，從任務生成快照合成一個跨越 61 個倉庫的 200 任務編碼代理數據集，而不是重播用於驗證的後續拉取請求。SWE-Future 顯示，倉庫演變預測可以指導現實的、未來導向的編碼任務合成，同時減少對歷史拉取請求重播的直接依賴。

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

2606.18728v1 by Songhan Zuo, Shengbin Yue, Tao Chiang, Guanying Li, Yun Song, Xuanjing Huang, Zhongyu Wei

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

摘要：民事訴訟本質上是一個生命週期過程：律師在第一天起草的內容限制了幾個月後在審判中展開的情況。然現有的法律基準評估的是孤立的子任務，而之前的法律代理模擬器則從共享的真實基礎重新初始化每個場景，未能建模跨階段的因果依賴。我們提出了LegalWorld，一個生命週期互動環境，將中國民事訴訟建模為五個階段（七個子場景）之間因果相連的狀態鏈，基於75,309對中國民事判決的配對。我們將其與可重用的基礎設施（本地記憶、全球案件記憶、技能/工具庫）配對，使每個爭端在其整個生命週期內保持一致。在這個環境的基礎上，我們構建了LongJud-Bench，以評估代理在所有五個相連階段的能力。來自217名法律背景評估者的18,992條評分確認LegalWorld的軌跡在程序上是忠實的且角色一致；而能力水平的跨模型評估揭示了聚合分數無法顯示的明顯差異，沒有單一的支柱在諮詢、起草和法庭辯護中領先。詳細資源將公開發布。

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

2606.18726v1 by Fang Wang, Ernesto Damiani

Structurally constrained event sequence generation remains challenging because generated paths must preserve transition feasibility, temporal order, termination, and attribute consistency. In predictive process monitoring (PPM), this challenge appears as full event sequence generation, whereas existing work mainly addresses component tasks such as next activity, remaining time, outcome, and attribute prediction. This paper proposes the Graph Grounded Cross Attention Transformer Neural Network (GGATN) for this unified PPM task. GGATN uses a global process graph as structured activity memory, contextualizes sequence positions through Transformer self attention, and injects process topology through graph grounded cross attention. Unlike autoregressive decoding, GGATN generates activities, timestamps, length, and event level and sequence level attributes in a single pass, followed by Viterbi style graph constrained decoding for feasible paths and explicit termination. Experiments on six benchmark event logs show more reliable generation quality than local instruction prompted LLM baselines. GGATN achieves strong performance on sequence similarity, Damerau Levenshtein similarity, bigram based control flow similarity, and duration distribution, while maintaining zero hallucinated activities and zero sequence level attribute inconsistency. Ablation analyses confirm the global graph encoder as a stable structural prior. Interpretability analyses show how graph structure, sequence context, feedback refinement, and constrained decoding shape generation.

摘要：結構受限的事件序列生成仍然具有挑戰性，因為生成的路徑必須保持轉換的可行性、時間順序、終止和屬性一致性。在預測性過程監控（PPM）中，這一挑戰表現為完整事件序列生成，而現有的工作主要針對下一活動、剩餘時間、結果和屬性預測等組件任務。本文提出了圖基交叉注意力Transformer神經網絡（GGATN）來解決這一統一的PPM任務。GGATN使用全局過程圖作為結構化活動記憶，通過Transformer自注意力對序列位置進行上下文化，並通過圖基交叉注意力注入過程拓撲。與自回歸解碼不同，GGATN在單次通過中生成活動、時間戳、長度以及事件級和序列級屬性，隨後進行維特比風格的圖約束解碼以獲得可行路徑和明確的終止。在六個基準事件日誌上的實驗顯示，其生成質量比本地指令提示的LLM基線更可靠。GGATN在序列相似性、Damerau Levenshtein相似性、基於二元組的控制流相似性和持續時間分佈方面表現出色，同時保持零幻覺活動和零序列級屬性不一致。消融分析確認全局圖編碼器作為穩定的結構先驗。可解釋性分析顯示圖結構、序列上下文、反饋精煉和約束解碼如何塑造生成。

LLM

LLM