SF20K

Abstract

Recent developments in vision-language models have significantly advanced video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often depict activities of one person in a single scene. Although existing movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos, and frequently encounter data leakage issues given the use of subtitles and other information about commercial movies during LLM pretraining. To address the above limitations, we propose Short-Films 20K (SF20K), the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering. Our extensive analysis of SF20K reveals minimal data leakage, emphasizes the need for long-term reasoning, and demonstrates the strong performance of recent VLMs. Finally, we show that instruction tuning on the SF20K-Train set substantially improves model performance, paving the way for future progress in long-term video understanding.

Leaderboard

MCQA: Multiple-choice question answering OEQA: Open-ended question answering

V: Vision-only (video frames) L: Language-only (subtitles) VL: Vision-language (video frames + subtitles)

#	Model	#Params	MCQA			OEQA
#	Model	#Params	V	L	VL	V	L	VL
	GPT-4o-mini	-	-	-	79.4	-	-	60.0
	Pixtral	12B	38.5	-	74.4	29.5	49.4	49.8
	Llama-3.2-Vision	11B	57.9	72.3	75.0	24.4	45.2	47.8
	Llava-Next-Video	7B	32.8	52.3	64.2	27.4	42.2	42.7
	Llava-OneVision	7B	63.2	74.7	78.2	25.1	35.3	37.7
	LongVA	7B	58.5	73.8	70.0	26.3	51.7	36.8
	Long-LLaVA	7B	52.1	70.4	72.6	17.4	36.7	33.5
	Llava-OneVision	0.5B	29.2	35.4	37.8	14.3	20.2	22.6
	MovieChat	7B	8.4	6.4	8.0	14.0	15.7	11.8
	Video-LLaVA	7B	32.4	21.3	45.7	8.4	10.6	8.2
	TimeChat	7B	25.5	6.4	31.8	26.4	9.4	5.9
	mPLUG-Owl2	7B	38.3	20.7	21.3	22.1	1.8	1.6
	FrozenBiLM	0.5B	23.4	38.2	38.6	-	-	-

Dataset Examples

Dataset Statistics

grade-lv — **Overall Statistics of SF20K**

Detailed statistics of each SF20K subset:

- SF20K: The full dataset, containing both training and test sets.
- SF20K-Test: The test benchmark, featuring manually curated questions generated from movie synopses.
- SF20K-Test-Silent: A subset of SF20K-Test focused exclusively on silent movies.
- SF20K-Test-Expert: A subset of SF20K-Test with manually crafted, challenging questions.

Data Leakage

Modern LLMs, pre-trained on vast internet data, often encounter data leakage—memorizing movie-related information like synopses, reviews, and scripts instead of analyzing video content. For example, using only the movie title, GPT-4 achieves 76.0% accuracy on MovieQA and 71.2% on LVU but only 36.0% on SF20K-Test, which focuses on amateur films with minimal online presence. This makes SF20K-Test a more reliable benchmark for long-term video understanding, as it minimizes biases caused by pretraining exposure. Additionally, leakage severity correlates with model size, with larger LLMs showing higher memorization on commercial datasets, while SF20K-Test remains unaffected, with accuracies consistently low across models.

Baselines Performance

We evaluate VLMs on SF20K-Test in a zero-shot video question-answering setting, incorporating subtitles and video frames. Models are tested across three modalities: Vision-Only (V), Language-Only (L), and Vision-Language (VL).

Multiple-Choice QA (MCQA): GPT-4o-mini achieves the highest accuracy (79.4%), approaching human performance (89.8%), while recent open-source models (e.g. Llava-OneVision-7B) closes the gap, following by just 1.2 percentage points.
Open-Ended QA (OEQA): This task proves more challenging, with performances ranging from 1.6% to 60.0%. GPT-4o-mini leads again, while open-source models like Pixtral-12B achieve only 49.8%.
Modality Analysis: Multimodal performance consistently surpasses unimodal, but subtitles dominate, with smaller gaps between language-only and multimodal settings compared to vision-only settings.

Temporal Window Study

We confirm the need for long-form video understanding through a temporal window study at shot-, scene-, and movie-level, using FrozenBiLM and LLoVi for MCQA. Shots average 5.52 seconds, scenes aggregate 10 shots, and movies use full videos. Results show that larger temporal windows improve performance: LLoVi's language-only accuracy rises from 38.5% (shot-level) to 64.2% (movie-level), while FrozenBiLM gains 7.3%. These findings emphasize the long-term nature of SF20K-Test's story-level tasks.

Instruction Tuning

Performance of Llava-OneVision-0.5B fine-tuned on SF20K-Train and evaluated on SF20K-Test.

Fine-tuning results of Llava-OneVision-0.5B on SF20K-Train (top):

Multiple-Choice QA (MCQA): Performance improves significantly by 25.2%, reducing the gap with its 7B counterparts.
Open-Ended QA (OEQA): Improvement is more modest (+4.0%), but still meaningful.
Vision-Only (V): In both settings, the visual modality benefits from instruction tuning, despite instruction data being generated from subtitles.

Effect of the training size (right): When fine-tuning with varying numbers of videos, we observe logarithmic improvements of performance, from 46.5% to 63.0%.

@article{ghermi2024shortfilmdatasetsfd, title={Long Story Short: Story-level Video Understanding from 20K Short Films}, author={Ridouane Ghermi and Xi Wang and Vicky Kalogeiton and Ivan Laptev}, journal={arXiv preprint arXiv:2406.10221}, year={2024}, }

Short-Films 20K (SF20K)

Story-level Video Understanding from 20K Short Films

Abstract

Leaderboard

Dataset & Benchmark

Dataset Examples

Dataset Statistics

Word Cloud Analysis

Comparison to other datasets

Experimental Results

Data Leakage

Baselines Performance

Temporal Window Study

Instruction Tuning

Citation