Pixel-level vision foundation models encompass a family of models with the ability to provide pixel-level output that includes various tasks; segmentation, pixel-level visual grounding and reasoning, depth estimation, motion estimation and tracking. My work specifically concentrates on benchmarking general-purpose multi-modal large language models (MLLMs) and their adaptation to pixel-level understanding, with an emphasis on visual grounding in both images and videos. The motivation behind this effort is to provide benchmarks that question the current emerging MLLMs and video MLLMs, and push them towards the direction which does not degrade their fundamental chat capabilities. Moreover, I evaluate their grounding abilities in challenging vision-centric and motion-centric scenarios. This effort resulted in PixMMVP, PixCV-Bench and MoCentric-Bench benchmarks for pixel-level visual grounding. While the majority of mainstream MLLMs focus on visual grounding with bounding boxes as it is an easier step than the segmentation output. I believe that the holistic view of pixel-level understanding and the interplay of such tasks can help MLLMs acquire better spatial intelligence and moving forward for a better embodied intelliegence, while being interpretable. The series includes PixFoundation for benchmarking visual grounding in images using vision-centric challenging scenarios and PixFoundation 2.0 for benchmarking visual grounding in videos using a motion-centric evaluation. Since I am quite excited about the intersection of both pixel-level understanding tasks and foundation models, I launched the first PixFoundation workshop in conjunction with CVPR 2025 with the rest of the organizers. You can know more about our workshop and the recorded talks from our amazing speakers here. I am also strongly committed to a responsible approach to AI, hence both interpretability and benchmarking in challenging scenarios for me go hand in hand towards that direction. If a model can not perceive local motion patterns, or If a model can not identify fine-grained spatial details, would you safely use them in a robotics system? While it is extremely important to build models that work, it is equally important to test and push them to fail, even better to understand why they fail and interpret their inner workings. Quantifying failures (benchmarking), understanding failures (using interpretability) and solving these failures (for safety and robustness) is what constitutes a responsible AI approach in my view.
Multiple works have emerged to push the boundaries on multi-modal large language models (MLLMs) towards pixel-level understanding. The current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data with specialized decoders for the segmentation task. However, we show that such MLLMs when evaluated on recent challenging vision-centric benchmarks, exhibit a weak ability in visual question answering (VQA). Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such pixel-level supervision. In this work, we propose two novel challenging benchmarks with paired evaluation for both VQA and grounding. We show that MLLMs without pixel-level grounding supervision can outperform the state of the art in such tasks. Our paired benchmarks and evaluation enable additional analysis on the reasons for failure with respect to VQA and/or grounding. Furthermore, we propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call PixFoundation. More importantly, we study the research question of "When does grounding emerge in MLLMs that are not trained with pixel-level grounding supervision?" We show that grounding can coincide with object parts, its location, appearance, context or state, where we show 27-45% of the examples in both benchmarks exhibit this phenomenon.
Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities. While their extension to video has enabled tasks such as video question answering and video captioning, their pixel-level visual grounding abilities are less studied. In this work, we raise the pertinent question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions describing their motion patterns. We identify the shortcomings in the current benchmarks, where we show that a single frame can often suffice for capturing the motion referring expression without any temporal reasoning. To address this, we introduce four motion-centric probing techniques, particularly designed for the visual grounding task, to study video MLLMs' ability to identify true motion from a fake one and their ability to grasp the motion order. Consequently, we provide a motion-centric benchmark, MoCentric-Bench. It ensures that video MLLMs are evaluated towards leveraging the interaction between motion and language rather than being dominated by static appearance cues emphasized in existing visual grounding datasets. We further establish strong single-image baselines that are on par with or outperform prior methods. Finally, we explore simple motion-centric adaptation techniques that provide state-of-the-art performance on our MoCentric-Bench. Our motion-centric benchmark, evaluation and findings challenge future models to improve dense spatiotemporal grounding and pixel-level understanding within videos.
the little cat walking from behind to the front
cow shaking head and looking at us
monkey walks around with a cart and pushes it
the male child moving away from the small dog after an embrace
the two trucks reversing side by side
Examples on the above two probes using the multi-video layout with the static keyframe and the reverse of the video. Motion expressions are shown below each video. The ground-truth of the true motion is highlighted in red. The segmentation of the fake/incorrect motion is highlighted in blue.
@article{siam2025pixfoundation-2.0,
title={PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?},
author={Siam, Mennatullah},
journal={arXiv preprint arXiv:2509.02807},
year={2025}
}
@article{siam2025pixfoundation,
title={PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?},
author={Siam, Mennatullah},
journal={arXiv preprint arXiv:2502.04192},
year={2025}
}