AI Research: Latest Breakthroughs In Denoising Models And Multimodal AI

Nov 18, 2025 by Alex Johnson 72 views

In the rapidly evolving landscape of artificial intelligence, new research continually pushes the boundaries of what's possible. This edition of our daily digest dives into some of the most exciting advancements, particularly focusing on innovations in denoising generative models and the burgeoning field of multimodal foundation models. We’ll explore how researchers are refining techniques to create cleaner data, enhance spatial intelligence, and tackle complex segmentation and editing tasks across various domains.

Back to Basics: Denoising Generative Models Reimagined

Denoising generative models, a cornerstone of modern AI image synthesis, are getting a fresh look. Traditionally, these models, like denoising diffusion models, don't directly predict clean images. Instead, they often predict the noise itself or a noised version of the data. This paper, "Back to Basics: Let Denoising Generative Models Denoise," suggests that this indirect approach might be limiting. The researchers propose that directly predicting clean data, aligning with the manifold assumption that natural data resides on a low-dimensional manifold, could be more effective. This **"Just image Transformers" (JiT)** approach leverages simple, large-patch Transformers directly on pixels, eschewing complex tokenizers, pre-training, or extra losses. The results are compelling, showing competitive performance on ImageNet even at higher resolutions (256 and 512) where predicting noised quantities can falter. This return to basics for Transformer-based diffusion on raw natural data signifies a potential paradigm shift, emphasizing simplicity and directness in generative modeling. The implications for creating high-fidelity images with fewer computational hurdles are significant, potentially democratizing access to advanced generative AI tools. This research highlights a crucial ongoing theme in AI: sometimes, the most effective path forward is a deeper understanding and application of fundamental principles, even as we push the technological envelope.

Scaling Spatial Intelligence with Multimodal Foundation Models

The quest for truly intelligent AI systems hinges on their ability to understand and interact with the world in a comprehensive way, which requires robust spatial intelligence. While multimodal foundation models have made impressive strides, they often exhibit surprising weaknesses in this area. The paper "Scaling Spatial Intelligence with Multimodal Foundation Models" addresses this gap head-on by introducing the SenseNova-SI family. Built upon established visual and unified understanding/generation models, these foundation models are scaled up with a meticulously curated dataset, SenseNova-SI-8M, comprising eight million diverse samples categorized by spatial capabilities. The results are remarkable, showcasing unprecedented performance across a broad spectrum of spatial intelligence benchmarks, including VSI-Bench, MMSI, MindCube, ViewSpatial, and SITE. Crucially, these models maintain strong general multimodal understanding. Beyond benchmark performance, the research delves into the impact of data scaling, exploring emergent generalization capabilities and analyzing potential pitfalls like overfitting and language shortcuts. A preliminary study on spatial chain-of-thought reasoning is also presented, hinting at future advancements. The continuous updates and public release of these models underscore a commitment to fostering further research in this vital area, aiming to make AI systems more aware and capable of navigating the complexities of our three-dimensional world.

Segment Anything Across Shots: Mastering Video Segmentation

Video object segmentation (VOS) is a critical task for many applications, but existing methods often struggle with the complexities of real-world videos, particularly when dealing with multiple distinct scenes or 'shots'. The paper "Segment Anything Across Shots: A Method and Benchmark" tackles this challenge by focusing on multi-shot semi-supervised video object segmentation (MVOS). Traditional VOS methods, largely designed for single-shot videos, falter at shot discontinuities. To overcome this, the researchers introduce a transition mimicking data augmentation (TMA) strategy, enabling cross-shot generalization even with single-shot data, thereby mitigating the severe scarcity of annotated multi-shot data. Coupled with this is the Segment Anything Across Shots (SAAS) model, designed to effectively detect and comprehend shot transitions. To provide a solid foundation for future research, they also introduce Cut-VOS, a new MVOS benchmark featuring dense mask annotations, diverse object categories, and high-frequency transitions. Extensive experiments on both YouMVOS and Cut-VOS demonstrate SAAS's state-of-the-art performance, achieved through its adeptness at mimicking, understanding, and segmenting across complex transitions. The public release of code and datasets is a significant contribution, poised to accelerate progress in this challenging but crucial area of computer vision.

Unlocking Granularity Control: UnSAMv2 for Enhanced Segmentation

The Segment Anything Model (SAM) family has revolutionized vision foundation models, but its ability to control the *granularity* of segmentation has been a persistent limitation. Users often find themselves needing to manually refine segmentation masks, a process that can be tedious and ambiguous. "UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity" introduces a powerful solution to this problem. This work proposes UnSAMv2, which empowers users to segment anything at *any granularity* without requiring human annotations. Building upon the divide-and-conquer strategy of its predecessor, UnSAMv2 innovates by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding. This embedding allows for precise, continuous control over the scale of segmentation. The results are striking: with minimal unlabeled data and a slight increase in parameters, UnSAMv2 significantly enhances existing SAM models, achieving remarkable improvements in tasks like interactive, whole-image, and video segmentation. Evaluated across numerous benchmarks, UnSAMv2 demonstrates substantial gains in metrics like NoC90, 1-IoU, and AR1000. This research underscores the potential of self-supervised learning, particularly when combined with clever data augmentation and architectural innovations, to unlock new levels of control and performance in foundational AI models.

Free-Form Scene Editor: Intuitive 3D Object Manipulation

Text-to-image diffusion models have transformed image editing, but performing intuitive, 3D-aware object manipulation in real-world images has remained a significant hurdle. "Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine" presents FFSE, a groundbreaking 3D-aware autoregressive framework designed for exactly this purpose. FFSE treats object editing as a sequence of learned 3D transformations, allowing users to perform actions like translation, scaling, and rotation while maintaining realistic background effects such as shadows and reflections, and crucially, preserving global scene consistency across multiple editing rounds. This is a significant leap from previous methods that either operated solely in image space or relied on slow, error-prone 3D reconstruction. To facilitate the training of such complex, multi-round manipulation, the researchers introduce 3DObjectEditor, a novel hybrid dataset created from simulated editing sequences across diverse objects and scenes. Extensive experiments confirm that FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios, paving the way for more intuitive and powerful image editing tools.

TiViBench: Benchmarking Reasoning in Video Generation

As video generative models become more sophisticated, the focus is shifting from mere visual plausibility to more complex reasoning capabilities, such as physical plausibility and logical consistency. However, evaluating these higher-order reasoning abilities has been challenging due to a lack of appropriate benchmarks. "TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models" introduces TiViBench, a hierarchical benchmark specifically designed to assess the reasoning skills of image-to-video (I2V) generation models. TiViBench systematically evaluates reasoning across four dimensions: structural reasoning & search, spatial & visual pattern reasoning, symbolic & logical reasoning, and action planning & task execution, covering 24 diverse task scenarios across three difficulty levels. The benchmark reveals that while commercial models like Sora 2 and Veo 3.1 show stronger reasoning potential, open-source models still hold significant untapped promise, limited by training scale and data diversity. To help unlock this potential, the paper also introduces VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. VideoTPO uses LLM self-analysis to improve reasoning performance without requiring additional training. Together, TiViBench and VideoTPO provide a crucial framework for evaluating and advancing the reasoning capabilities of video generation models.

Crossing Borders: Translating and Visualizing Indian Poetry

Indian poetry, with its rich linguistic complexity and deep cultural resonance, often poses challenges for translation and comprehension. "Crossing Borders: A Multimodal Challenge for Indian Poetry Translation and Image Generation" introduces the Translation and Image Generation (TAI) framework to address this. This framework leverages Large Language Models (LLMs) and Latent Diffusion Models to enhance the accessibility of culturally rich Indian-language poetry, supporting SDG 4 (Quality Education) and SDG 10 (Reduced Inequalities). The TAI framework consists of two core modules: a translation module employing an Odds Ratio Preference Alignment Algorithm for accurate poetry translation into English, and an image generation module that uses a semantic graph to visually represent the poems' metaphors and meanings. Comprehensive evaluations, including human assessments, demonstrate the superiority of TAI Diffusion in poem image generation. Furthermore, to combat the scarcity of resources for Indian-language poetry, the paper introduces the Morphologically Rich Indian Language Poems MorphoVerse Dataset, containing 1,570 poems across 21 low-resource Indian languages. This work significantly broadens the accessibility and enriches the reader's experience of Indian poetry.

GS-Light: Training-Free Relighting of 3D Scenes

Relighting 3D scenes based on textual descriptions offers immense creative potential, but achieving this realistically and efficiently has been a challenge. "Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting" introduces GS-Light, an efficient, training-free pipeline that enables text-guided relighting of 3D scenes represented by Gaussian Splatting (3DGS). GS-Light extends single-input diffusion models to handle multi-view inputs, parsing user prompts (e.g., lighting direction, color) into lighting priors using large vision-language models (LVLMs). By fusing these priors with geometry and semantic estimators, the model computes illumination maps and generates initial latent codes for each view, guiding the diffusion model to produce high-fidelity relighting outputs that accurately reflect user intent. Feeding multi-view images and initial latents into the relighting model results in artistically relit images, which are then used to fine-tune the 3DGS scene. Extensive evaluations show GS-Light consistently outperforms state-of-the-art baselines in terms of multi-view consistency, imaging quality, and aesthetic scores, offering a powerful new tool for 3D scene editing and visualization.

QUILL: Hardware Acceleration for Deformable Attention

Deformable transformers have achieved state-of-the-art results in object detection, but their irregular memory access patterns and low arithmetic intensity make them poorly suited for hardware acceleration. "QUILL: An Algorithm-Architecture Co-Design for Cache-Local Deformable Attention" introduces QUILL, a schedule-aware accelerator designed to overcome these limitations. QUILL transforms deformable attention into cache-friendly, single-pass operations. Its core, Distance-based Out-of-Order Querying (DOOQ), reorders queries by spatial proximity and drives a region prefetch into an alternate buffer, overlapping memory and compute. A fused MSDeformAttn engine executes multiple operations in a single pass without intermediate spilling, while small tensors are kept on-chip. Implemented in RTL and evaluated end-to-end, QUILL demonstrates significant improvements in throughput and energy efficiency compared to existing hardware like the RTX 4090 and prior accelerators. With mixed-precision quantization, accuracy remains close to FP32. By converting sparsity into locality and locality into utilization, QUILL offers substantial end-to-end speedups for deformable transformers.

OlmoEarth: Multimodal Earth Observation with Foundation Models

Earth observation data presents a unique challenge due to its spatial, sequential, and multimodal nature. "OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation" introduces OlmoEarth, a novel multimodal, spatio-temporal foundation model specifically designed for this domain. OlmoEarth employs a unique self-supervised learning formulation, masking strategy, and loss function tailored for Earth observation. It achieves state-of-the-art performance across various research benchmarks and real-world tasks, outperforming 12 other foundation models in embedding tasks and often leading in full fine-tuning scenarios. Beyond its strong performance, OlmoEarth is deployed as the backbone of an end-to-end platform that streamlines data collection, labeling, training, and inference for Earth observation models. This platform aims to empower non-profits and NGOs working on global challenges. The open-source release of OlmoEarth's code, training data, and pre-trained weights is a significant contribution to the field, fostering further advancements in using AI for Earth science.

Tuning for Two Adversaries: Robustness Against Transfer and Query Attacks

The robustness of machine learning models against adversarial attacks is a critical concern. "Tuning for Two Adversaries: Enhancing the Robustness Against Transfer and Query-Based Attacks using Hyperparameter Tuning" provides the first detailed analysis of how optimization hyperparameters influence robustness against both transfer-based and query-based attacks. The study reveals a striking dichotomy: decreasing the learning rate significantly boosts robustness against transfer attacks, while increasing it improves robustness against query-based attacks. Leveraging these insights, the research explores hyperparameter tuning to jointly enhance robustness against both attack types. Notably, distributed models show the most significant benefits, achieving a remarkable trade-off by effectively mitigating both attack types simultaneously. This work offers crucial guidance for practitioners seeking to build more resilient AI systems.

Distribution Matching Distillation Meets Reinforcement Learning

Distribution Matching Distillation (DMD) is an effective technique for compressing multi-step diffusion models into faster, few-step versions. However, the performance of these distilled models can be capped by their teachers. "Distribution Matching Distillation Meets Reinforcement Learning" introduces DMDR, a novel framework that integrates Reinforcement Learning (RL) into the distillation process to overcome this limitation. The study shows that the DMD loss itself acts as a more effective regularizer for the few-step generator's RL process compared to traditional methods. Simultaneously, RL helps guide the mode coverage in DMD more effectively, unlocking the few-step generator's capacity. The framework also incorporates dynamic distribution guidance and renoise sampling strategies to improve the initial distillation. Experiments demonstrate that DMDR achieves leading visual quality and prompt coherence among few-step methods, even surpassing the performance of its multi-step teacher model.

PhysX-Anything: Simulation-Ready Physical 3D Assets

The field of 3D modeling is rapidly transitioning towards creating physical, articulated assets suitable for direct use in simulations and interactive applications. However, many existing 3D generation methods overlook crucial physical and articulation properties, limiting their utility in embodied AI. "PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image" introduces PhysX-Anything, the first generative framework capable of producing simulation-ready 3D assets from a single image. These assets possess explicit geometry, articulation, and physical attributes. The framework features the first VLM-based physical 3D generative model and a novel, highly efficient 3D representation that tokenizes geometry significantly, enabling explicit geometry learning within standard VLM token budgets. To address the limited diversity in existing physical 3D datasets, a new dataset, PhysX-Mobility, is constructed, expanding object categories and including thousands of real-world objects with rich physical annotations. Experiments demonstrate strong generative performance, robust generalization, and successful application in robotics policy learning within simulation environments, promising to empower embodied AI and physics-based simulation applications.

Part-X-MLLM: Part-Aware 3D Multimodal Large Language Model

Understanding and manipulating 3D objects at a granular level is crucial for many AI applications. "Part-X-MLLM: Part-aware 3D Multimodal Large Language Model" introduces Part-X-MLLM, a native 3D multimodal large language model designed to unify diverse 3D tasks. The model formulates these tasks as programs within a structured, executable grammar. Given an RGB point cloud and a natural language prompt, Part-X-MLLM autoregressively generates a single token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output acts as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling symbolic planning from geometric synthesis, the approach allows any compatible geometry engine to be controlled through a single, language-native frontend. Pre-training a dual-encoder architecture to disentangle structure from semantics, and instruction-tuning on a large-scale, part-centric dataset, enables state-of-the-art performance in grounded Q&A, compositional generation, and localized editing.

CacheFlow: Efficient Long-Form Video Understanding

Long-form video question answering (VQA) presents a significant challenge for current vision-language models (VLMs) due to the escalating computational demands of attention mechanisms and key-value (KV) caches. This often forces a trade-off between inference cost and the ability to process extended contexts. "CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding" introduces CacheFlow, a training-free pipeline designed to address this limitation. CacheFlow pairs Dynamic Token Dropping (DTD) with a compressive long-term memory system. DTD prunes per-patch tokens online based on cosine similarity to the previous frame, packing surviving tokens into fixed-size blocks, making it inherently suitable for live streaming VQA. As blocks are processed, their keys are summarized by a recurrent encoder to form a retrieval index, while full KV pairs are offloaded for later rehydration, preserving answer fidelity. During inference, a consensus-based retrieval mechanism selects the most relevant blocks, allowing attention over both retrieved and local context for precise, long-range reasoning. CacheFlow is architecture-agnostic and requires no fine-tuning, significantly reducing token processing while enhancing context awareness for practical long-form video understanding.

Alpha Divergence Losses for Biometric Verification

Performance in biometric verification tasks, such as face and speaker recognition, heavily relies on margin-based loss functions. "Alpha Divergence Losses for Biometric Verification" explores the potential of $α$-divergence loss functions, known for inducing sparse solutions, and integrates them with angular margins crucial for verification. The paper derives two novel margin-based $α$-divergence losses: Q-Margin and A3M. A critical training instability in A3M is addressed with a prototype re-initialization strategy. The proposed methods achieve significant performance gains on challenging face verification benchmarks like IJB-B and IJB-C, and demonstrate strong results in speaker verification on VoxCeleb. Crucially, these models outperform baselines at low false acceptance rates (FAR), a vital capability for high-security applications where minimizing false authentications is paramount.

Real-Time Driver Drowsiness Detection System

Driver fatigue is a major contributor to road accidents, making systems that can detect and alert drowsy drivers critically important for safety. "A Real-Time Driver Drowsiness Detection System Using MediaPipe and Eye Aspect Ratio" details the development of such a system. Utilizing a standard webcam, the system tracks facial features, with a primary focus on eye movements analyzed using the Eye Aspect Ratio (EAR) method. The MediaPipe Face Mesh framework provides efficient and accurate facial landmark identification, essential for real-time performance. The system detects prolonged eye closure or infrequent blinking—key indicators of drowsiness—and alerts the driver audibly. By integrating OpenCV for image processing and MediaPipe for face detection, this low-cost, high-performance solution offers a valuable component for current Advanced Driving Assistance Systems (ADAS).

Tissue Aware Nuclei Detection and Classification for Histopathology

Accurate nuclei detection and classification are fundamental to computational pathology, but existing methods often struggle with reliance on extensive expert annotations and insufficient use of tissue context. "Tissue Aware Nuclei Detection and Classification Model for Histopathology Images" introduces TAND (Tissue-Aware Nuclei Detection), a novel framework that achieves joint nuclei detection and classification using point-level supervision enhanced by tissue mask conditioning. TAND employs a ConvNeXt-based encoder-decoder architecture coupled with a frozen Virchow-2 tissue segmentation branch. Semantic tissue probabilities selectively modulate the classification stream through a novel multi-scale Spatial Feature-wise Linear Modulation (Spatial-FiLM). TAND achieves state-of-the-art performance on the PUMA benchmark, surpassing both tissue-agnostic and mask-supervised methods. Notably, it shows remarkable improvements for tissue-dependent cell types like epithelium, endothelium, and stroma. This approach offers a practical pathway to reduce annotation burden by conditioning per-cell classification on learned tissue masks.

AtlasMorph: Learning Conditional Deformable Templates for Brain MRI

Deformable templates, or atlases, are crucial for medical image analysis, representing prototypical anatomy for a population and often accompanied by probabilistic anatomical label maps. However, developing these templates is computationally intensive, leading to the use of sub-optimal templates that may not accurately represent diverse populations. "AtlasMorph: Learning conditional deformable templates for brain MRI" proposes a machine learning framework that uses convolutional registration neural networks to efficiently learn a function for generating templates conditioned on subject-specific attributes like age and sex. When segmentations are available, the network also produces anatomical segmentation maps for the templates. The learned network can also be used for registering subject images to the templates. Demonstrated on 3D brain MRI datasets, AtlasMorph learns high-quality, representative templates and enables better registration than unlabeled or unconditionally generated templates, especially for populations with significant variations.

ICLR: Natural Color Restoration in Low-Light Image Enhancement

Low-Light Image Enhancement (LLIE) aims to improve image contrast, details, and textures in challenging lighting conditions. While the HVI color space has advanced LLIE by decoupling chrominance and luminance, significant distributional differences between these channels and nonlinear parameter propagation can limit feature extraction and introduce errors. "ICLR: Inter-Chrominance and Luminance Interaction for Natural Color Restoration in Low-Light Image Enhancement" proposes the ICLR framework to address these issues. It features a Dual-stream Interaction Enhancement Module (DIEM) for improved fusion and enhancement of complementary information and a Covariance Correction Loss (CCL) that uses luminance residual statistics to penalize chrominance errors and balance gradient conflicts. ICLR outperforms state-of-the-art methods on multiple datasets by effectively managing inter-chrominance and luminance interactions.

VVS: Accelerating Visual Autoregressive Generation

Visual autoregressive (AR) generation models show great potential for image synthesis, but their sequential nature leads to significant inference latency. Speculative decoding (SD) offers acceleration, but its fixed draft-and-verify steps limit potential. "VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping" introduces VVS, a novel SD framework that accelerates visual AR generation by enabling partial verification skipping. VVS leverages the interchangeability of visual tokens and addresses verification redundancy and stale feature reusability. It features a verification-free token selector with dynamical truncation, token-level feature caching and reuse, and fine-grained skipped step scheduling. VVS reduces the number of target model forward passes significantly while maintaining competitive generation quality, offering a superior speed-quality trade-off compared to conventional SD frameworks.

NuClass: Robust Cell Annotation in Histopathology

Identifying cell types and subtypes in histopathology images is crucial for understanding diseases, but existing tile-based models often fail to integrate broader tissue context and struggle with limited, coarse-grained annotations. "Adaptive Multi-Scale Integration Unlocks Robust Cell Annotation in Histopathology Images" introduces NuClass, a framework inspired by pathologist workflows for multi-scale integration of nuclear morphology and microenvironmental context. NuClass balances local detail from nuclear morphology with the surrounding tissue neighborhood using a learnable gating module. An uncertainty-guided objective directs the global path to prioritize uncertain regions, encouraging complementary learning. The framework also provides calibrated confidence estimates and Grad-CAM visualizations for interpretability. A novel marker-guided dataset from spatial transcriptomics assays yields high-resolution labels for millions of cells. NuClass achieves high F1 scores, outperforming baselines by effectively bridging the gap between slide-level foundation models and reliable cell-level phenotype prediction.

Hierarchical Prompt Learning for Person Re-Identification

Person re-identification (ReID) aims to retrieve target pedestrian images using either visual queries (image-to-image, I2I) or textual descriptions (text-to-image, T2I). These tasks present distinct challenges: I2I requires discriminative identity learning, while T2I demands accurate cross-modal semantic alignment. "Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification" proposes HPL, a unified framework that jointly optimizes both tasks through task-aware prompt modeling. HPL introduces a Task-Routed Transformer that routes features for I2I and T2I branches using dual classification tokens. It develops a hierarchical prompt generation scheme incorporating identity-level learnable tokens and instance-level pseudo-text tokens derived from image or text features. A Cross-Modal Prompt Regularization strategy enforces semantic alignment in the prompt token space, ensuring pseudo-prompts retain source-modality characteristics while enhancing cross-modal transferability. HPL achieves state-of-the-art performance on multiple ReID benchmarks for both I2I and T2I tasks.

Opt3DGS: Optimizing 3D Gaussian Splatting

3D Gaussian Splatting (3DGS) has become a leading framework for novel view synthesis, but its optimization process faces challenges related to local optima entrapment and insufficient convergence quality. "Opt3DGS: Optimizing 3D Gaussian Splatting with Adaptive Exploration and Curvature-Aware Exploitation" presents Opt3DGS, a robust framework that enhances 3DGS optimization through a two-stage process. The exploration phase uses an Adaptive Weighted Stochastic Gradient Langevin Dynamics (SGLD) method to improve global search and escape local optima. The exploitation phase employs a Local Quasi-Newton Direction-guided Adam optimizer for precise and efficient convergence, leveraging curvature information. Extensive experiments across diverse datasets demonstrate that Opt3DGS achieves state-of-the-art rendering quality by refining the 3DGS optimization without altering its underlying representation.

TSE-Net: Semi-supervised Monocular Height Estimation

Monocular height estimation is vital for 3D perception in remote sensing, offering a cost-effective alternative to multi-view or LiDAR methods. However, the scarcity of labeled data limits the performance and generalization of current deep learning models. "TSE-Net: Semi-supervised Monocular Height Estimation from Single Remote Sensing Images" introduces TSE-Net, a self-training pipeline for semi-supervised monocular height estimation that leverages large volumes of unlabeled data. The pipeline integrates teacher, student, and exam networks. The student network learns from pseudo-labels generated by the teacher network on unlabeled data. The teacher network combines regression and classification branches; the regression branch predicts height values, and the classification branch predicts height value classes to filter pseudo-labels. Height value classes are defined using a hierarchical bi-cut strategy to handle long-tailed distributions, and predicted class probabilities are calibrated using a Plackett-Luce model. TSE-Net evaluates effectively on diverse datasets, demonstrating improved predictive performance through semi-supervised learning.

Robust Defense Against Backdoor Attacks in Multimodal Contrastive Learning

Multimodal deep learning models, while powerful, are vulnerable to adversarial attacks, particularly backdoor attacks that subtly manipulate model behavior. Existing defense methods often require extensive retraining or fine-tuning without precise identification of affected labels. "Robust Defense Strategies for Multimodal Contrastive Learning: Efficient Fine-tuning Against Backdoor Attacks" introduces an innovative strategy for enhancing the robustness of multimodal contrastive learning models like CLIP. The approach efficiently identifies backdoor triggers, victim samples, and labels. It employs an image segmentation 'oracle' as a supervisor for the poisoned CLIP model's output. Two algorithms are developed: one to differentiate CLIP and Oracle knowledge for trigger identification, and another to pinpoint affected labels and curate a compact fine-tuning dataset. Extensive experiments show this strategy effectively negates backdoor effects in CLIP-based models, demonstrating strong defense capabilities on visual recognition benchmarks.

BootOOD: Self-Supervised Out-of-Distribution Detection

Out-of-distribution (OOD) detection is crucial for reliable image classifier deployment, but existing detectors struggle when OOD samples are semantically similar to in-distribution (ID) classes. "BootOOD: Self-Supervised Out-of-Distribution Detection via Synthetic Sample Exposure under Neural Collapse" presents BootOOD, a fully self-supervised OOD detection framework that exclusively uses ID data and is designed to handle semantically challenging OOD samples. BootOOD synthesizes pseudo-OOD features through simple transformations of ID representations, leveraging Neural Collapse (NC) principles where ID features cluster tightly around class means. Instead of orthogonal subspace constraints, BootOOD uses a lightweight auxiliary head for radius-based classification on feature norms, decoupling OOD detection from the primary classifier. This imposes a relaxed requirement: OOD samples exhibit smaller feature norms than ID features, which is easier to satisfy for semantically close samples. BootOOD outperforms prior post-hoc and outlier-exposure methods while maintaining or improving ID accuracy on challenging benchmarks.

Accuracy is Not Enough: Poisoning Interpretability in Federated Learning

Visual explanation techniques are vital for transparency in AI, but their integrity can be compromised. "Accuracy is Not Enough: Poisoning Interpretability in Federated Learning via Color Skew" reveals a new class of attacks that undermine model interpretability without affecting accuracy. In a federated learning setting, small color perturbations applied by adversarial clients can shift a model's saliency maps away from meaningful regions while preserving predictions. The Chromatic Perturbation Module systematically crafts adversarial examples by altering color contrast to disrupt explanation fidelity. These perturbations accumulate across training rounds, stealthily poisoning the global model's internal feature attributions. This work challenges the assumption that correct predictions imply faithful explanations, highlighting interpretability as a new attack surface. Standard training pipelines are insufficient to detect or mitigate this explanation degradation, especially in federated learning. The attack significantly reduces Grad-CAM explanation overlap while maintaining high classification accuracy.

Minimax Multi-Target Conformal Prediction for Imaging Inverse Problems

Uncertainty quantification in ill-posed imaging inverse problems is a fundamental challenge, particularly for safety-critical applications. While conformal prediction has shown promise for quantifying uncertainty in downstream tasks, existing methods handle only scalar targets. "Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems" proposes an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals and ensures joint marginal coverage. The framework is applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Numerical demonstrations on synthetic and MRI data showcase the benefits of the minimax approach over existing multi-target conformal prediction methods, offering improved uncertainty quantification for complex imaging problems.

Mapping Urban Village Transformations in China

Urban villages (UVs) in China have undergone extensive demolition and redevelopment, but a systematic evaluation of land reuse efficacy is lacking. "Mapping the Vanishing and Transformation of Urban Villages in China" proposes a deep learning framework to monitor these spatiotemporal changes using semantic segmentation of multi-temporal remote sensing imagery. The system classifies post-demolition land use into categories such as vacant land, construction sites, buildings, and green spaces. Analyzing four representative Chinese cities, the study reveals prolonged redevelopment processes, a primary occurrence of transitions in peripheral areas, and identifies three spatiotemporal transformation pathways: synchronized, delayed, and gradual optimization. This research highlights the fragmented and nonlinear nature of UV redevelopment, advocating for tiered and context-sensitive planning strategies to support inclusive, efficient, and sustainable urban renewal, contributing to a global understanding of informal settlement transformations.

Language-Guided Invariance Probing of Vision-Language Models

Vision-language models (VLMs) like CLIP demonstrate strong zero-shot performance, but their reliability under controlled linguistic perturbations remains unclear. "Language-Guided Invariance Probing of Vision-Language Models" introduces LGIP, a benchmark that measures invariance to meaning-preserving paraphrases and sensitivity to meaning-changing semantic flips in image-text matching. Using MS COCO images with multiple captions, LGIP automatically generates paraphrases and rule-based flips altering object category, color, or count. The benchmark summarizes model behavior using an invariance error, semantic sensitivity gap, and positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants show favorable invariance-sensitivity trade-offs, exhibiting low paraphrase variance and consistently higher scores for original captions than flipped ones. In contrast, SigLIP and SigLIP2 exhibit larger invariance errors and often prefer flipped captions, failures often missed by standard retrieval metrics. LGIP provides a crucial model-agnostic diagnostic for linguistic robustness.

InterMoE: Individual-Specific 3D Human Interaction Generation

Generating high-quality 3D human interactions that preserve individual characteristics and adhere to textual descriptions is a significant challenge. "InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE" introduces InterMoE, a novel framework based on a Dynamic Temporal-Selective Mixture of Experts (MoE). InterMoE's routing mechanism synergistically uses high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically adjust their selection capacity and focus on critical temporal features, thereby preserving unique individual identities while ensuring high semantic fidelity. Extensive experiments demonstrate that InterMoE achieves state-of-the-art performance in individual-specific, high-fidelity 3D human interaction generation, showing significant improvements in FID scores on benchmark datasets like InterHuman and InterX.

Semantic Document Derendering: SVG Reconstruction with VLMs

Multimedia documents, often distributed in static raster formats, lose their editability and interactivity. Restoring structured vector formats requires converting these raster images back into editable forms. However, existing geometric raster-vectorization methods struggle with complex documents like slides, failing to preserve high-level structure and semantic distinctions. "Semantic Document Derendering: SVG Reconstruction via Vision-Language Modeling" introduces SliDer, a novel framework using Vision-Language Models (VLMs) to derender slide images into compact, editable Scalable Vector Graphic (SVG) representations. SliDer detects and extracts attributes from individual image and text elements, organizing them into a coherent SVG. The model iteratively refines its predictions during inference, analogous to human design, to faithfully reconstruct the original raster upon rendering. The paper also introduces Slide2SVG, a novel dataset for this task. SliDer achieves superior reconstruction quality and is favored by human evaluators over existing VLM baselines.

Trust in Vision-Language Models: A Participatory Workshop

As Vision-Language Models (VLMs) become more integrated into various applications, understanding how users build and evolve their trust in these systems is critical. "Trust in Vision-Language Models: Insights from a Participatory User Workshop" addresses this by presenting preliminary results from a workshop with prospective VLM users. This user-centered approach aims to inform future studies on contextualizing trust metrics and developing strategies for user-VLM interaction. The research acknowledges the growing reliance on AI models for experimental validation and emphasizes the need for direct user engagement to properly understand and build trust in VLMs.

Foresee: Training-Free Forgery Detection with Vanilla MLLMs

The rise of AI-generated content (AIGC) technologies has made image generation and manipulation remarkably effortless, posing challenges for image forgery detection and localization (IFDL). Existing IFDL methods often struggle with generalization and interpretability, and while multimodal large language models (MLLMs) show promise, training them for IFDL is computationally expensive. "Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline" introduces Foresee, a training-free MLLM-based pipeline for image forgery analysis. Foresee eliminates the need for additional training, enabling lightweight inference while surpassing existing MLLM-based methods in localization accuracy and textual explanation richness. It employs a type-prior-driven strategy and a Flexible Feature Detector (FFD) module to handle copy-move manipulations effectively. Foresee demonstrates superior generalization capability across diverse tampering types, including copy-move, splicing, removal, deepfake, and AIGC-based editing, providing comprehensive explanations.

FUSE: A Flow-based Mapping Between Shapes

Representing maps between 3D shapes efficiently and enabling cross-representation matching without extensive data-driven procedures is a key challenge. "FUSE: A Flow-based Mapping Between Shapes" introduces a novel neural representation based on flow-matching models for maps between 3D shapes. This representation is computationally efficient and supports cross-representation shape matching. 3D shapes are represented as probability distributions induced by continuous, invertible flow mappings from a fixed anchor distribution. By composing inverse and forward flows, points are mapped between surfaces. Encoding shapes with task-tailored embeddings provides an invertible and modality-agnostic representation across point clouds, meshes, SDFs, and volumetric data. FUSE consistently achieves high coverage and accuracy across diverse benchmarks and challenging settings in shape matching, and shows promising results in UV mapping and registration of human body scans.

VOPE: Revisiting Hallucination in Voluntary Imagination Tasks

Most research on hallucinations in Large Vision-Language Models (LVLMs) focuses on factual description tasks, overlooking hallucinations in voluntary imagination tasks like story writing, where novel content generation is expected. "VOPE: Revisiting Hallucination of Vision-Language Models in Voluntary Imagination Task" introduces Voluntary-imagined Object Presence Evaluation (VOPE), a novel method to assess LVLM hallucinations in these tasks. VOPE uses recheck-based questions to evaluate how an LVLM interprets the presence of imagined objects in its own response, comparing this interpretation to the object's actual presence in the image to determine hallucination. Applying VOPE to mainstream LVLMs and mitigation methods reveals that most LVLMs hallucinate heavily during voluntary imagination, and existing mitigation methods show limited effect, highlighting this as a critical area for future research.

For further exploration into the cutting-edge of AI research, check out resources from arXiv and delve deeper into the specifics of each paper through their respective links.