The study developed and evaluated a Vision Transformer-based deep learning model (using the USFM framework) for automatic segmentation of pancreatic tumors on endoscopic ultrasound (EUS) images, trained on over 17,000 images from two public datasets.
In internal 5-fold cross-validation, the model achieved moderate-to-good segmentation performance (mean DSC ≈ 0.65, IoU ≈ 0.58) with high specificity (~99%) and accuracy (~98%), indicating reliable identification of non-tumor regions.
On an independent external test set, the model maintained similar performance (DSC ≈ 0.66, IoU ≈ 0.61, sensitivity ~72%, specificity ~98%), but about 9.7% of cases showed incorrect multiple tumor predictions, underscoring the need for more standardized data and further prospective validation.
The paper formalizes the task of Data-centric Solution Preference and builds a large dataset of 18,438 pairwise comparisons to study how to choose better machine learning solutions without fully executing them.
It shows that large language models, when given a Verified Data Analysis Report, can predict which solution will perform better with 61.5% accuracy and well-calibrated confidence, effectively acting as a fast surrogate for expensive runtime checks.
The authors implement this idea in an agent called FOREAGENT that uses a Predict-then-Verify loop, achieving 6× faster convergence and a 6% performance gain over traditional execution-based agent baselines.
Introduces Cedalion, an open-source Python framework that unifies the full analysis pipeline for fNIRS and DOT—covering forward modeling, optode co-registration, signal processing, GLM analysis, and DOT image reconstruction—within a single, standardized architecture.
Ensures reproducible, scalable workflows by adhering to SNIRF and BIDS standards, providing cloud-executable Jupyter notebooks, containerized pipelines, automated documentation linked to source publications, and continuous-integration testing.
Seamlessly connects optical neuroimaging with modern machine learning and multimodal workflows by integrating with tools like scikit-learn and PyTorch, supporting multimodal fusion (e.g., EEG, MEG, physiology), and offering validated algorithms plus simulation and data-augmentation modules; the tutorial supplies seven fully executable example notebooks.
Introduces a formal framework for auditing group fairness when model owners can strategically and adaptively update their models, characterizing which updates are allowed as long as the audited property (e.g., fairness) is preserved.
Proposes a general PAC auditing procedure based on an Empirical Property Optimization (EPO) oracle, enabling efficient estimation of fairness properties using a minimal number of labeled samples even under arbitrary admissible updates.
Defines the SP dimension, a new combinatorial complexity measure that governs distribution-free sample complexity for auditing statistical parity, and shows that the same framework extends to other objectives such as prediction error and robust risk.
The paper shows that standard point-wise confidence measures like self-consistency can be misleading, because answers that appear perfectly confident can quickly fail when the question is placed in slightly different or distracting contexts.
It introduces Neighbor-Consistency Belief (NCB), a structural metric that checks how consistently a model answers related, contextually perturbed versions of a question, and demonstrates that high-NCB items are more robust under a new cognitive stress-testing protocol.
The authors propose Structure-Aware Training (SAT), a training strategy that explicitly encourages context-invariant belief structures and empirically reduces brittle, long-tail knowledge errors by about 30%.
The paper systematically studies how different preference-tuning alignment objectives generalize when models are used in new domains, focusing on helpfulness in summarization and question-answering tasks.
It compares five popular alignment objectives together with multiple adaptation strategies, including target-domain supervised fine-tuning and pseudo-labeling, to understand their impact on both performance and response diversity under domain shift.
The authors find that alignment objectives differ markedly in how well they transfer to new domains, and show that adaptation methods based on pseudo-labeling can substantially reduce the performance degradation caused by domain shift.
Identifies and analyzes the problem of exploration collapse in RL-based training of LLM reasoning, showing why traditional entropy-based exploration either leads to reward hacking (verbose but unhelpful outputs) or fails to overcome pre-training biases.
Introduces IIB-LPO, a latent policy optimization method that branches reasoning at high-uncertainty (high-entropy) states and uses the Information Bottleneck both to filter trajectories and as a self-reward, encouraging diverse yet concise and informative reasoning paths.
Demonstrates state-of-the-art performance on four mathematical reasoning benchmarks, improving accuracy by up to 5.3% and reasoning diversity by up to 7.4% compared to previous RLVR approaches.
The paper studies how the order of training examples (a curriculum) affects preference optimization for machine translation, a factor that prior work largely ignored.
It introduces CLewR, a curriculum learning strategy with restarts that repeatedly goes from easy to hard examples to reduce catastrophic forgetting of easier cases during training.
The authors show that CLewR yields consistent translation quality improvements across multiple large language model families (Gemma2, Qwen2.5, Llama3.1) and various preference optimization methods, and they release their code publicly.