FPS-Bench: A Benchmark for
High Frame-Rate Video Understanding

Rohan Choudhury*, Jean-Sebastien Dandurand*, Kai Qiu, Kshitij Madhav Bhat, Kartik Sharma, Liza Dahiya, Yizhou Zhao, Souraja Kundu, Chun-Hsien Lin, Kris M. Kitani, László A. Jeni
Carnegie Mellon University
*Equal contribution   Equal advising   rchoudhu@andrew.cmu.edu
CVPR 2026

Code & data coming soon — links above are placeholders.

A camera-flash question: at 2 FPS and 4 FPS the flash frame is missed, but at 14 FPS the flash is clearly visible. A radar plot compares VLM accuracy against human performance across nine categories.

FPS-Bench evaluates video-language models on questions that cannot be answered unless the video is sampled at a sufficiently high frame rate. Here, a brief camera flash is missed at 2–4 FPS but becomes visible at 14 FPS. (Right) Across nine temporal categories, state-of-the-art VLMs trail far behind human performance.


Abstract

Do video models really see fast events?

Modern video-language models are typically trained and evaluated on videos downsampled to low frame rates, and many widely used video understanding benchmarks can be solved with sparse temporal sampling. We introduce FPS-Bench, a large-scale video question-answering benchmark designed to evaluate whether VLMs can perceive and reason about rapid, high-frequency events. FPS-Bench introduces minFPS — minimum frames-per-second — a metric that measures the lowest frame rate required to solve a video-question pair. The benchmark contains 1,000 multiple-choice questions from 554 videos, spanning nine temporal reasoning categories and diverse visual domains. Every example requires at least 4 FPS, with an average minFPS of approximately 6.8 FPS. Evaluating state-of-the-art open- and closed-source VLMs reveals a substantial gap: current models achieve only around 30% accuracy, while humans achieve 72.2%. FPS-Bench exposes a critical blind spot in current video-language models and provides a focused benchmark for developing models that truly understand high-frame-rate video.


Key Contributions

What FPS-Bench brings

High-FPS Video QA Benchmark

A multiple-choice video question-answering benchmark focused on rapid, high-frequency temporal events that are missed by standard low-FPS sampling — spanning nine distinct temporal categories.

Minimum Necessary Frame Rate

We propose minFPS, a metric that quantifies the minimum frame rate required to correctly answer a video-question pair — with a mandatory floor of 4 FPS and an average of 6.8 FPS.

Revealing a VLM Blind Spot

State-of-the-art VLMs perform only slightly above random chance (~30% vs. 25%), while humans reach 72.2% — demonstrating a significant gap in fine-grained temporal understanding.

The Dataset

The FPS-Bench Dataset

1,000
Questions
554
Videos
9
Temporal Categories
≥ 4
FPS Required (minFPS floor)
6.67
Median minFPS (6.8 avg)
72.2%
Human Accuracy
~30%
Best VLM Accuracy

FPS-Bench questions are designed so that the correct answer cannot be verified below a minimum frame rate. Annotators measure minFPS by progressively adjusting the frame-sampling rate until the correct answer becomes unambiguously verifiable. This directly tests whether models can perceive and reason over rapid temporal evidence, rather than relying on static scene cues or sparse frame samples. Videos are sourced from the large and diverse YouTube-8M dataset, spanning five high-level visual domains: Media & Entertainment, Hobbies & Gaming, Sports & Fitness, Vehicles, and Miscellaneous.

Sunburst chart of FPS-Bench's visual domains and question categories, alongside per-category time-duration and average minFPS distributions.

Dataset statistics. FPS-Bench is uniformly distributed across question categories and drawn from a wide range of visual domains. Each question is annotated with its minFPS; all questions have a minFPS of at least 4, with an average of 6.67. Categories such as Blink & Miss require the highest frame rates.

Question Types

Nine high-frame-rate temporal categories

Repetitive Motion

Counting cycles or repetitions of a sustained, high-speed periodic action, such as drumstick hits or fan rotations.

Speed Recognition

Inferring or comparing relative speed where the distinction is only visible over time at a sufficient frame rate.

Fine-Grained Motion

Identifying subtle differences in movement, form, or trajectory that distinguish correct execution.

Action Order

Sequencing events that occur nearly simultaneously or in rapid succession.

State at Event

Identifying the precise state or attribute of an object at the exact moment of a rapid interaction.

Blink and Miss

Detecting sudden, non-repetitive, extremely short-duration visual events that may appear in only one or two frames.

Causality Detection

Determining the temporal relationship between events to identify which caused a subsequent event.

Synchronization Assessment

Determining whether multiple high-frequency actions occur simultaneously or whether one precedes another.

Instance Count

Counting the exact number of discrete rapid events within a time interval.

Examples from the benchmark

The Metric

What is minFPS?

minFPS is the minimum integer frame rate at which a human annotator can consistently verify the correct answer for a video-question pair. If the video is sampled below this threshold, the visual evidence needed to answer the question is lost. Annotators begin viewing at 1 FPS and incrementally raise the rate until the answer is unambiguously verifiable, simulating the standard frame-downsampling process used by modern VLMs.

minFPS = lowest FPS needed to verify the answer

Traditional benchmarks may require watching a particular time interval, but they often do not measure whether the necessary frames survive temporal downsampling. A 10-minute video sampled at 1 FPS and one sampled at 30 FPS have the same temporal-certificate length, yet the 1 FPS video contains far fewer frames — often making rapid events indistinguishable. FPS-Bench focuses on this missing dimension: the frame rate required to preserve the evidence itself.

Full Video

Native high frame rate (24–30 FPS)

Downsampled

Sparse frames at low FPS

Event Lost

Transient cue disappears between frames

minFPS Threshold

Evidence preserved → answer becomes verifiable

Benchmark Comparison

FPS-Bench requires higher frame rates

Bar chart of average minFPS across benchmarks: FPS-Bench 6.7, AirLetters 2.73, MotionBench 1.9, MVBench 0.8, Perception Test 0.34, EgoSchema 0.25, Video-MME 0.1, Video-MMMU 0.1.

Comparing minFPS across datasets. FPS-Bench has a substantially higher average minFPS than widely used video understanding benchmarks such as Video-MME, EgoSchema, MVBench, Perception Test, MotionBench, and AirLetters — exceeding even fine-grained-motion benchmarks by more than 2.5×.

Unlike long-video benchmarks that can often be solved with sparse temporal sampling, FPS-Bench requires dense short-term temporal perception. While benchmarks like EgoSchema and Video-MME test long-video understanding, their minFPS is extremely low because heavy downsampling still leaves enough context for longer inputs.

Results

Current VLMs struggle on FPS-Bench

ModelOverall Accuracy
Random baseline 25.0%
Gemini 2.5 Pro 28.9%
Qwen-3-VL-32B 30.7%
Oryx 31.3%
GPT-4o best VLM 31.8%
Human Performance 72.2%

Across open-source, closed-source, video-native, and image-based VLMs, model performance remains far below human accuracy. Even when videos are slowed down or sampled at higher frame rates, models do not close the gap — suggesting that the limitation is not only input frame access but also fine-grained temporal perception and reasoning. Notably, the open-source Oryx outperforms Gemini 2.5 Pro, hinting that the gap between open and closed video models is narrowing.

Bar chart of GPT-4o accuracy by minFPS bucket, showing the worst performance on questions requiring around 7 FPS rather than steadily decreasing.

Accuracy vs. minFPS (GPT-4o). Performance does not decrease monotonically with frame rate; GPT-4o is in fact worst on questions requiring around 7 FPS, underscoring that the difficulty is rooted in temporal perception, not merely access to frames.

Analysis

Why do models fail?

Missing decisive events

Models miss brief but decisive events — a small flash, contact, bounce, or kick — even when the relevant frames are provided.

Losing temporal order & count

Models struggle to maintain fine-grained temporal order or to count rapid events, performing especially poorly on instance counting.

A representative failure: a frontier VLM answers a high-frame-rate question incorrectly because it misses a small but decisive visual detail.

Representative failure. Models can miss small but decisive visual details even when provided with the necessary frames, leading to confident but incorrect answers.

Get Started

Use FPS-Bench

FPS-Bench is designed to help evaluate and improve video-language models that need to reason over high-frequency temporal events. It is especially relevant for models that claim strong video understanding, temporal reasoning, or fine-grained motion perception.

Public release coming soon.

BibTeX

@inproceedings{choudhury2026fpsbench,
  title     = {FPS-Bench: A Benchmark for High Frame-Rate Video Understanding},
  author    = {Choudhury, Rohan and Dandurand, Jean-Sebastien and Qiu, Kai and
               Bhat, Kshitij Madhav and Sharma, Kartik and Dahiya, Liza and
               Zhao, Yizhou and Kundu, Souraja and Lin, Chun-Hsien and
               Kitani, Kris M. and Jeni, L\'aszl\'o A.},
  booktitle = {CVPR},
  year      = {2026}
}