Abstract
Modern video-language models are typically trained and evaluated on videos downsampled to low frame rates, and many widely used video understanding benchmarks can be solved with sparse temporal sampling. We introduce FPS-Bench, a large-scale video question-answering benchmark designed to evaluate whether VLMs can perceive and reason about rapid, high-frequency events. FPS-Bench introduces minFPS — minimum frames-per-second — a metric that measures the lowest frame rate required to solve a video-question pair. The benchmark contains 1,000 multiple-choice questions from 554 videos, spanning nine temporal reasoning categories and diverse visual domains. Every example requires at least 4 FPS, with an average minFPS of approximately 6.8 FPS. Evaluating state-of-the-art open- and closed-source VLMs reveals a substantial gap: current models achieve only around 30% accuracy, while humans achieve 72.2%. FPS-Bench exposes a critical blind spot in current video-language models and provides a focused benchmark for developing models that truly understand high-frame-rate video.
Key Contributions
A multiple-choice video question-answering benchmark focused on rapid, high-frequency temporal events that are missed by standard low-FPS sampling — spanning nine distinct temporal categories.
We propose minFPS, a metric that quantifies the minimum frame rate required to correctly answer a video-question pair — with a mandatory floor of 4 FPS and an average of 6.8 FPS.
State-of-the-art VLMs perform only slightly above random chance (~30% vs. 25%), while humans reach 72.2% — demonstrating a significant gap in fine-grained temporal understanding.
The Dataset
FPS-Bench questions are designed so that the correct answer cannot be verified below a minimum frame rate. Annotators measure minFPS by progressively adjusting the frame-sampling rate until the correct answer becomes unambiguously verifiable. This directly tests whether models can perceive and reason over rapid temporal evidence, rather than relying on static scene cues or sparse frame samples. Videos are sourced from the large and diverse YouTube-8M dataset, spanning five high-level visual domains: Media & Entertainment, Hobbies & Gaming, Sports & Fitness, Vehicles, and Miscellaneous.
Dataset statistics. FPS-Bench is uniformly distributed across question categories and drawn from a wide range of visual domains. Each question is annotated with its minFPS; all questions have a minFPS of at least 4, with an average of 6.67. Categories such as Blink & Miss require the highest frame rates.
Question Types
Counting cycles or repetitions of a sustained, high-speed periodic action, such as drumstick hits or fan rotations.
Inferring or comparing relative speed where the distinction is only visible over time at a sufficient frame rate.
Identifying subtle differences in movement, form, or trajectory that distinguish correct execution.
Sequencing events that occur nearly simultaneously or in rapid succession.
Identifying the precise state or attribute of an object at the exact moment of a rapid interaction.
Detecting sudden, non-repetitive, extremely short-duration visual events that may appear in only one or two frames.
Determining the temporal relationship between events to identify which caused a subsequent event.
Determining whether multiple high-frequency actions occur simultaneously or whether one precedes another.
Counting the exact number of discrete rapid events within a time interval.
The Metric
minFPS is the minimum integer frame rate at which a human annotator can consistently verify the correct answer for a video-question pair. If the video is sampled below this threshold, the visual evidence needed to answer the question is lost. Annotators begin viewing at 1 FPS and incrementally raise the rate until the answer is unambiguously verifiable, simulating the standard frame-downsampling process used by modern VLMs.
Traditional benchmarks may require watching a particular time interval, but they often do not measure whether the necessary frames survive temporal downsampling. A 10-minute video sampled at 1 FPS and one sampled at 30 FPS have the same temporal-certificate length, yet the 1 FPS video contains far fewer frames — often making rapid events indistinguishable. FPS-Bench focuses on this missing dimension: the frame rate required to preserve the evidence itself.
Native high frame rate (24–30 FPS)
Sparse frames at low FPS
Transient cue disappears between frames
Evidence preserved → answer becomes verifiable
Benchmark Comparison
Comparing minFPS across datasets. FPS-Bench has a substantially higher average minFPS than widely used video understanding benchmarks such as Video-MME, EgoSchema, MVBench, Perception Test, MotionBench, and AirLetters — exceeding even fine-grained-motion benchmarks by more than 2.5×.
Unlike long-video benchmarks that can often be solved with sparse temporal sampling, FPS-Bench requires dense short-term temporal perception. While benchmarks like EgoSchema and Video-MME test long-video understanding, their minFPS is extremely low because heavy downsampling still leaves enough context for longer inputs.
Results
| Model | Overall Accuracy |
|---|---|
| Random baseline | 25.0% |
| Gemini 2.5 Pro | 28.9% |
| Qwen-3-VL-32B | 30.7% |
| Oryx | 31.3% |
| GPT-4o best VLM | 31.8% |
| Human Performance | 72.2% |
Across open-source, closed-source, video-native, and image-based VLMs, model performance remains far below human accuracy. Even when videos are slowed down or sampled at higher frame rates, models do not close the gap — suggesting that the limitation is not only input frame access but also fine-grained temporal perception and reasoning. Notably, the open-source Oryx outperforms Gemini 2.5 Pro, hinting that the gap between open and closed video models is narrowing.
Accuracy vs. minFPS (GPT-4o). Performance does not decrease monotonically with frame rate; GPT-4o is in fact worst on questions requiring around 7 FPS, underscoring that the difficulty is rooted in temporal perception, not merely access to frames.
Analysis
Models miss brief but decisive events — a small flash, contact, bounce, or kick — even when the relevant frames are provided.
Models struggle to maintain fine-grained temporal order or to count rapid events, performing especially poorly on instance counting.
Representative failure. Models can miss small but decisive visual details even when provided with the necessary frames, leading to confident but incorrect answers.
Get Started
FPS-Bench is designed to help evaluate and improve video-language models that need to reason over high-frequency temporal events. It is especially relevant for models that claim strong video understanding, temporal reasoning, or fine-grained motion perception.
Public release coming soon.
@inproceedings{choudhury2026fpsbench,
title = {FPS-Bench: A Benchmark for High Frame-Rate Video Understanding},
author = {Choudhury, Rohan and Dandurand, Jean-Sebastien and Qiu, Kai and
Bhat, Kshitij Madhav and Sharma, Kartik and Dahiya, Liza and
Zhao, Yizhou and Kundu, Souraja and Lin, Chun-Hsien and
Kitani, Kris M. and Jeni, L\'aszl\'o A.},
booktitle = {CVPR},
year = {2026}
}