Introduces VideoAR, a large-scale visual autoregressive framework that combines intra-frame autoregressive modeling with causal next-frame prediction, enabled by a 3D multi-scale tokenizer to efficiently capture spatio-temporal dynamics.
Proposes several techniques—Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask—to reduce error accumulation over time and improve long-term temporal coherence in generated videos.
Demonstrates state-of-the-art performance among autoregressive video models, significantly improving FVD on UCF-101 while requiring over 10× fewer inference steps, and achieving VBench scores competitive with much larger diffusion-based models via a multi-stage spatial–temporal pretraining pipeline.
The study developed and evaluated a Vision Transformer-based deep learning model (using the USFM framework) for automatic segmentation of pancreatic tumors on endoscopic ultrasound (EUS) images, trained on over 17,000 images from two public datasets.
In internal 5-fold cross-validation, the model achieved moderate-to-good segmentation performance (mean DSC ≈ 0.65, IoU ≈ 0.58) with high specificity (~99%) and accuracy (~98%), indicating reliable identification of non-tumor regions.
On an independent external test set, the model maintained similar performance (DSC ≈ 0.66, IoU ≈ 0.61, sensitivity ~72%, specificity ~98%), but about 9.7% of cases showed incorrect multiple tumor predictions, underscoring the need for more standardized data and further prospective validation.
Introduces LayerGS, a method that decomposes a posed human into separate, animatable 3D layers (body and different garments) using 2D Gaussian splatting for accurate geometry and photorealistic rendering.
Handles occluded and hidden regions of clothing and body by inpainting them with a pretrained 2D diffusion model via score distillation sampling, allowing complete 3D avatars even where the original data had no direct visibility.
Uses a three-stage training pipeline—coarse single-layer reconstruction, then joint multi-layer refinement—to achieve superior rendering quality, layer separation, and recombination compared to prior methods, enabling realistic virtual try-on from new viewpoints and poses.
Defines the Multimodal Auto-Completion (MAC) task, where upcoming characters in live chat are predicted using both partially typed text and shared visual context, and constructs benchmark datasets by adapting MMDialog and ImageChat.
Systematically evaluates state-of-the-art vision-language models against strong text-only auto-completion baselines, revealing accuracy–efficiency trade-offs and demonstrating that multimodal grounding better captures user intent.
Introduces Router-Suggest, a dynamic routing framework (with a lightweight variant) that decides when to use text-only models vs. vision-language models based on dialog context, achieving 2.3x–10x speedups over the best VLM while improving user satisfaction and reducing typing effort in user studies.
Introduces Goal Force, a framework where users specify goals for video world models using explicit force vectors and intermediate dynamics instead of ambiguous text prompts or hard-to-specify target images.
Trains a video generation model on a curated set of simple synthetic physics scenarios (e.g., collisions, falling dominos) so it learns to propagate forces through time and space like an implicit neural physics simulator.
Demonstrates strong zero-shot generalization from these simple training setups to complex, real-world tasks such as tool use and multi-object causal chains, enabling precise, physics-aware planning without external physics engines.