PixFoundation Series: Benchmarking Multi-Modal Large Language Models for Pixel-level Visual Grounding

mennatul@ualberta.ca

Pixel-level vision foundation models encompass a family of models with the ability to provide pixel-level output that includes various tasks; segmentation, pixel-level visual grounding and reasoning, depth estimation, motion estimation and tracking. My work specifically concentrates on benchmarking general-purpose multi-modal large language models (MLLMs) and their adaptation to pixel-level understanding, with an emphasis on visual grounding in both images and videos. The motivation behind this effort is to provide benchmarks that question the current emerging MLLMs and video MLLMs, and push them towards the direction which does not degrade their fundamental chat capabilities. Moreover, I evaluate their grounding abilities in challenging vision-centric and motion-centric scenarios. This effort resulted in PixMMVP, PixCV-Bench and MoCentric-Bench benchmarks for pixel-level visual grounding. While the majority of mainstream MLLMs focus on visual grounding with bounding boxes as it is an easier step than the segmentation output. I believe that the holistic view of pixel-level understanding and the interplay of such tasks can help MLLMs acquire better spatial intelligence and moving forward for a better embodied intelliegence, while being interpretable. The series includes PixFoundation for benchmarking visual grounding in images using vision-centric challenging scenarios and PixFoundation 2.0 for benchmarking visual grounding in videos using a motion-centric evaluation. Since I am quite excited about the intersection of both pixel-level understanding tasks and foundation models, I launched the first PixFoundation workshop in conjunction with CVPR 2025 with the rest of the organizers. You can know more about our workshop and the recorded talks from our amazing speakers here. I am also strongly committed to a responsible approach to AI, hence both interpretability and benchmarking in challenging scenarios for me go hand in hand towards that direction. If a model can not perceive local motion patterns, or If a model can not identify fine-grained spatial details, would you safely use them in a robotics system? While it is extremely important to build models that work, it is equally important to test and push them to fail, even better to understand why they fail and interpret their inner workings. Quantifying failures (benchmarking), understanding failures (using interpretability) and solving these failures (for safety and robustness) is what constitutes a responsible AI approach in my view.

PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?

Paired Evaluation for Visual Grounding and Visual Question Answering - PixMMVP & PixCV-Bench

Abstract

Multiple works have emerged to push the boundaries on multi-modal large language models (MLLMs) towards pixel-level understanding. The current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data with specialized decoders for the segmentation task. However, we show that such MLLMs when evaluated on recent challenging vision-centric benchmarks, exhibit a weak ability in visual question answering (VQA). Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such pixel-level supervision. In this work, we propose two novel challenging benchmarks with paired evaluation for both VQA and grounding. We show that MLLMs without pixel-level grounding supervision can outperform the state of the art in such tasks. Our paired benchmarks and evaluation enable additional analysis on the reasons for failure with respect to VQA and/or grounding. Furthermore, we propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call PixFoundation. More importantly, we study the research question of "When does grounding emerge in MLLMs that are not trained with pixel-level grounding supervision?" We show that grounding can coincide with object parts, its location, appearance, context or state, where we show 27-45% of the examples in both benchmarks exhibit this phenomenon.

The benchmarking and development of strong baselines helped to unravel the insight that grounding can emerge in MLLMs that were not trained with pixel-level supervision coinciding with tokens other than the referring expression. This confirms qualitatively the quantiative finding below, where the baseline that uses the closest noun phrase to the referring expression and its respective emergent grounding (a+s) is lower than the one relying on a mask selection that does not depend on the referring expression (PixFoundation). A quantitative analysis of this through studying the categories of the tokens that correspond to the emergent grounding shows that it can be indicative of locations or object parts or other categories, and it most frequently emerge closer to the end of the output in PixMMVP. An additional quantitative analysis evaluating the point accuracy of the emergent grounding comparing both (a+s) and PixFoundation confirms this further beyond mean intersection over union.

The Two Benchmarks Results Showing our Strong Baselines and Upper Bounds (PixFoundation).

PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?

Description of the image

Motion Existence - front elephant walking to backwards.

Motion Order - bear landing from a jump backward over a barricade and walking backward.

Paired Evaluation for Visual Grounding and Region Captioning in Videos within a Motion-Centric Evaluation MoCentric-Bench

Can Video MLLMs differentiate true motion from fake motion from a static keyframe? Evaluating their ability to identify motion existence. Can they differentiate the forward motion from its reverse? Evaluating their ability to understand the motion order. The ground-truth of the true motion is highlighted in red. The segmentation of the fake/incorrect motion is highlighted in blue.

Abstract

Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities. While their extension to video has enabled tasks such as video question answering and video captioning, their pixel-level visual grounding abilities are less studied. In this work, we raise the pertinent question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions describing their motion patterns. We identify the shortcomings in the current benchmarks, where we show that a single frame can often suffice for capturing the motion referring expression without any temporal reasoning. To address this, we introduce four motion-centric probing techniques, particularly designed for the visual grounding task, to study video MLLMs' ability to identify true motion from a fake one and their ability to grasp the motion order. Consequently, we provide a motion-centric benchmark, MoCentric-Bench. It ensures that video MLLMs are evaluated towards leveraging the interaction between motion and language rather than being dominated by static appearance cues emphasized in existing visual grounding datasets. We further establish strong single-image baselines that are on par with or outperform prior methods. Finally, we explore simple motion-centric adaptation techniques that provide state-of-the-art performance on our MoCentric-Bench. Our motion-centric benchmark, evaluation and findings challenge future models to improve dense spatiotemporal grounding and pixel-level understanding within videos.

I provide the first attempt to probe MLLMs for visual grounding of motion patterns described with language and inspecting both motion existence and motion order.

This Figure shows the shortcomings in the current motion referring expression segmentation datasets, where often one frame is sufficient to capture the expression without motion. Expressions from left to right: (1) "jump to the left then jump back", (2) "dog playing with monkey", (3) "puppy that overwhelms another puppy", (4) "cow shaking head and looking at us." (5) "The little cat walking from behind to the front".

The original motion expressions and the output from the automatic reversal for the motion expression used corresponding to the reverse video.


Examples on the above two probes using the multi-video layout with the static keyframe and the reverse of the video. Motion expressions are shown below each video. The ground-truth of the true motion is highlighted in red. The segmentation of the fake/incorrect motion is highlighted in blue.

MoCentric-Bench Visual Grounding Results.

Video MLLMs performance degrades by the introduction of the Multi-Video Layout probe (val_u (vs.) val_u & Single frame, Reverse (vs.) val_u & Reverse). Region captioning results within the paired evaluation is to be released soon.

BibTeX

@article{siam2025pixfoundation-2.0,
  title={PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?},
  author={Siam, Mennatullah},
  journal={arXiv preprint arXiv:2509.02807},
  year={2025}
}
@article{siam2025pixfoundation,
  title={PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?},
  author={Siam, Mennatullah},
  journal={arXiv preprint arXiv:2502.04192},
  year={2025}
}