AdaFuse introduces an adaptive ensemble decoding framework that dynamically chooses when and how to fuse multiple LLMs during generation, instead of relying on a fixed fusion granularity.
It uses an uncertainty-based criterion: when the model is confident it continues normally, and when uncertain it triggers a diversity-aware test-time scaling step to explore multiple candidate continuations before ensembling.
Through this synergy between adaptive ensembling and targeted test-time scaling, AdaFuse achieves consistent performance gains over strong ensemble baselines across QA, arithmetic reasoning, and machine translation, with an average relative improvement of 6.88%.
Introduces a molecular-style framework for long chain-of-thought (Long CoT) reasoning, modeling trajectories as structures built from three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like).
Empirically shows that effective Long CoT patterns emerge from dedicated Long CoT fine-tuning rather than simple keyword imitation, and defines Effective Semantic Isomers, revealing that only structures that drive fast entropy convergence lead to stable, learnable reasoning while competing structures hurt training.
Proposes Mole-Syn, a distribution-transfer-graph method that actively guides the construction of effective Long CoT ‘molecular’ structures, improving reasoning performance and stabilizing reinforcement learning across multiple benchmarks.
The paper formalizes the task of Data-centric Solution Preference and builds a large dataset of 18,438 pairwise comparisons to study how to choose better machine learning solutions without fully executing them.
It shows that large language models, when given a Verified Data Analysis Report, can predict which solution will perform better with 61.5% accuracy and well-calibrated confidence, effectively acting as a fast surrogate for expensive runtime checks.
The authors implement this idea in an agent called FOREAGENT that uses a Predict-then-Verify loop, achieving 6× faster convergence and a 6% performance gain over traditional execution-based agent baselines.
The paper shows that standard point-wise confidence measures like self-consistency can be misleading, because answers that appear perfectly confident can quickly fail when the question is placed in slightly different or distracting contexts.
It introduces Neighbor-Consistency Belief (NCB), a structural metric that checks how consistently a model answers related, contextually perturbed versions of a question, and demonstrates that high-NCB items are more robust under a new cognitive stress-testing protocol.
The authors propose Structure-Aware Training (SAT), a training strategy that explicitly encourages context-invariant belief structures and empirically reduces brittle, long-tail knowledge errors by about 30%.
The paper systematically studies how different preference-tuning alignment objectives generalize when models are used in new domains, focusing on helpfulness in summarization and question-answering tasks.
It compares five popular alignment objectives together with multiple adaptation strategies, including target-domain supervised fine-tuning and pseudo-labeling, to understand their impact on both performance and response diversity under domain shift.
The authors find that alignment objectives differ markedly in how well they transfer to new domains, and show that adaptation methods based on pseudo-labeling can substantially reduce the performance degradation caused by domain shift.
The authors design a realistic, expert-crafted Czech divorce and shared-parenting scenario, including both gendered and gender-neutral versions, plus nine legally relevant factors that systematically vary the case details.
They evaluate four leading LLMs (GPT-5 nano, Claude Haiku 4.5, Gemini 2.5 Flash, and Llama 3.3) in zero-shot mode to see how these models recommend shared-parenting ratios under different factual and gender setups.
Their preliminary analysis finds differences across models and suggests gender-dependent patterns in some systems’ parenting recommendations, highlighting risks for lay users seeking legal help and the need for stronger evaluation of LLMs in sensitive legal domains.
Identifies catastrophic forgetting as a key problem when adapting large language models into small language models for low-resource languages in multilingual settings.
Proposes a continual learning approach that combines parts-of-speech-based code-switching with a replay adapter mechanism to preserve previously learned linguistic knowledge while learning new languages.
Demonstrates that the proposed method improves performance and reduces forgetting on both vision–language tasks (like visual question answering) and standard language modeling tasks for low-resource languages.
The paper studies how the order of training examples (a curriculum) affects preference optimization for machine translation, a factor that prior work largely ignored.
It introduces CLewR, a curriculum learning strategy with restarts that repeatedly goes from easy to hard examples to reduce catastrophic forgetting of easier cases during training.
The authors show that CLewR yields consistent translation quality improvements across multiple large language model families (Gemma2, Qwen2.5, Llama3.1) and various preference optimization methods, and they release their code publicly.
Defines the Multimodal Auto-Completion (MAC) task, where upcoming characters in live chat are predicted using both partially typed text and shared visual context, and constructs benchmark datasets by adapting MMDialog and ImageChat.
Systematically evaluates state-of-the-art vision-language models against strong text-only auto-completion baselines, revealing accuracy–efficiency trade-offs and demonstrating that multimodal grounding better captures user intent.
Introduces Router-Suggest, a dynamic routing framework (with a lightweight variant) that decides when to use text-only models vs. vision-language models based on dialog context, achieving 2.3x–10x speedups over the best VLM while improving user satisfaction and reducing typing effort in user studies.