Short-Films 20K (SF20K)

Story-level Video Understanding from 20K Short Films

1LIX, Ecole Polytechnique, IP Paris, 2MBZUAI

Abstract

Recent developments in vision-language models have significantly advanced video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For example, datasets with instructional and egocentric videos often depict activities of one person in a single scene. Although existing movie datasets offer richer content, they are often limited to short-term tasks, lack publicly available videos, and frequently encounter data leakage issues given the use of subtitles and other information about commercial movies during LLM pretraining. To address the above limitations, we propose Short-Films 20K (SF20K), the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering. Our extensive analysis of SF20K reveals minimal data leakage, emphasizes the need for long-term reasoning, and demonstrates the strong performance of recent VLMs. Finally, we show that instruction tuning on the SF20K-Train set substantially improves model performance, paving the way for future progress in long-term video understanding.

Leaderboard

MCQA: Multiple-choice question answering          OEQA: Open-ended question answering

V: Vision-only (video frames)          L: Language-only (subtitles)          VL: Vision-language (video frames + subtitles)

# Model #Params MCQA OEQA
V L VL V L VL
GPT-4o-mini - - - 79.4 - - 60.0
Pixtral 12B 38.5 - 74.4 29.5 49.4 49.8
Llama-3.2-Vision 11B 57.9 72.3 75.0 24.4 45.2 47.8
Llava-Next-Video 7B 32.8 52.3 64.2 27.4 42.2 42.7
Llava-OneVision 7B 63.2 74.7 78.2 25.1 35.3 37.7
LongVA 7B 58.5 73.8 70.0 26.3 51.7 36.8
Long-LLaVA 7B 52.1 70.4 72.6 17.4 36.7 33.5
Llava-OneVision 0.5B 29.2 35.4 37.8 14.3 20.2 22.6
MovieChat 7B 8.4 6.4 8.0 14.0 15.7 11.8
Video-LLaVA 7B 32.4 21.3 45.7 8.4 10.6 8.2
TimeChat 7B 25.5 6.4 31.8 26.4 9.4 5.9
mPLUG-Owl2 7B 38.3 20.7 21.3 22.1 1.8 1.6
FrozenBiLM 0.5B 23.4 38.2 38.6 - - -

Dataset & Benchmark

Dataset Statistics

grade-lv

Overall Statistics of SF20K

grade-lv

Detailed statistics of each SF20K subset:

  • - SF20K: The full dataset, containing both training and test sets.
  • - SF20K-Test: The test benchmark, featuring manually curated questions generated from movie synopses.
  • - SF20K-Test-Silent: A subset of SF20K-Test focused exclusively on silent movies.
  • - SF20K-Test-Expert: A subset of SF20K-Test with manually crafted, challenging questions.

Word Cloud Analysis

grade-lv
Comparative word cloud analysis across three video domains: movies (SF20K-test), instructional videos (iVQA), and egocentric videos (EgoSchema).

Comparison to other datasets

grade-lv
Comparison of SF20K with other video question answering datasets.

Experimental Results

Data Leakage

Modern LLMs, pre-trained on vast internet data, often encounter data leakage—memorizing movie-related information like synopses, reviews, and scripts instead of analyzing video content. For example, using only the movie title, GPT-4 achieves 76.0% accuracy on MovieQA and 71.2% on LVU but only 36.0% on SF20K-Test, which focuses on amateur films with minimal online presence. This makes SF20K-Test a more reliable benchmark for long-term video understanding, as it minimizes biases caused by pretraining exposure. Additionally, leakage severity correlates with model size, with larger LLMs showing higher memorization on commercial datasets, while SF20K-Test remains unaffected, with accuracies consistently low across models.

(i) Anonymization: We test GPT-4 in three settings: with movie titles, without titles, and with anonymized questions (hiding character names). For MovieQA, removing titles drops performance by 17.4%, and full anonymization reduces it by 39.9%. In contrast, SF20K-Test remains stable, showing that LLMs rely on title recognition for commercial movies but not amateur films.

(ii) Shuffling: Replacing movie titles with random ones significantly drops accuracy for MovieQA, even below anonymized levels, showing reliance on title recognition. For SF20K-Test, performance remains unaffected, as LLMs do not recognize amateur films in either cases.

Baselines Performance

We evaluate VLMs on SF20K-Test in a zero-shot video question-answering setting, incorporating subtitles and video frames. Models are tested across three modalities: Vision-Only (V), Language-Only (L), and Vision-Language (VL).

  • Multiple-Choice QA (MCQA): GPT-4o-mini achieves the highest accuracy (79.4%), approaching human performance (89.8%), while recent open-source models (e.g. Llava-OneVision-7B) closes the gap, following by just 1.2 percentage points.
  • Open-Ended QA (OEQA): This task proves more challenging, with performances ranging from 1.6% to 60.0%. GPT-4o-mini leads again, while open-source models like Pixtral-12B achieve only 49.8%.
  • Modality Analysis: Multimodal performance consistently surpasses unimodal, but subtitles dominate, with smaller gaps between language-only and multimodal settings compared to vision-only settings.

We evaluate VLMs on SF20K-Test-Silent (silent movies) and SF20K-Test-Expert (manually-crafted challenging questions) in an open-ended QA setting.

  • SF20K-Test-Silent: Models show a significant accuracy drop, not exceeding 29.0%, compared to 49.8% on the full SF20K-Test.
  • SF20K-Test-Expert: Similarly, performence declines sharply, leaving rooms for improvement for future methods.

Temporal Window Study

grade-lv

We confirm the need for long-form video understanding through a temporal window study at shot-, scene-, and movie-level, using FrozenBiLM and LLoVi for MCQA. Shots average 5.52 seconds, scenes aggregate 10 shots, and movies use full videos. Results show that larger temporal windows improve performance: LLoVi's language-only accuracy rises from 38.5% (shot-level) to 64.2% (movie-level), while FrozenBiLM gains 7.3%. These findings emphasize the long-term nature of SF20K-Test's story-level tasks.

Instruction Tuning

Performance of Llava-OneVision-0.5B fine-tuned on SF20K-Train and evaluated on SF20K-Test.

Fine-tuning results of Llava-OneVision-0.5B on SF20K-Train (top):

  • Multiple-Choice QA (MCQA): Performance improves significantly by 25.2%, reducing the gap with its 7B counterparts.
  • Open-Ended QA (OEQA): Improvement is more modest (+4.0%), but still meaningful.
  • Vision-Only (V): In both settings, the visual modality benefits from instruction tuning, despite instruction data being generated from subtitles.
Effect of the training size (right): When fine-tuning with varying numbers of videos, we observe logarithmic improvements of performance, from 46.5% to 63.0%.

Citation


      @article{ghermi2024shortfilmdatasetsfd,
        title={Long Story Short: Story-level Video Understanding from 20K Short Films}, 
        author={Ridouane Ghermi and Xi Wang and Vicky Kalogeiton and Ivan Laptev},
        journal={arXiv preprint arXiv:2406.10221},
        year={2024},
      }