PEEK: Picking Essential frames via Efficient Knowledge distillation
We propose PEEK, a training-efficient frame selection method that distills knowledge from vision-language teachers to identify the most informative frames for video captioning. PEEK achieves competitive performance with significantly fewer processed frames, making dense video understanding more practical at scale.