Industrial PhD / Computer Vision / Multimodal Learning

Killian Steunou

Building efficient visual systems for the multimodal world.

I am an industrial PhD student at Institut Polytechnique de Paris and Moments Lab, working on efficient omni-modal learning for generalized video understanding.

View publications Start a conversation

PhD Industry and academia, since November 2025

Vision Video understanding, tracking, representation learning

Open Reproducible research and transparent engineering

Current thesis

Efficient omni-modal learning for generalized video understanding.

Research themes

Efficiency, multimodal learning, tracking, and deployment-aware model design.

Working style

Scientific rigor, clear writing, reproducible pipelines, and pragmatic engineering.

News

May 2026

New preprint: PEEK — frame selection for efficient video captioning

We open sourced the code and model weights on HuggingFace, and released a live demo you can try in the browser.

arXiv Paper page Code Demo

About

A research practice shaped by efficiency, clarity, and deployment constraints.

I am currently pursuing an industrial PhD in machine learning at Institut Polytechnique de Paris and Moments Lab. The work sits at the intersection of modern multimodal models and the practical limits that define whether they remain useful in the real world.

I am especially drawn to problems in computer vision and deep learning because visual perception feels foundational to intelligence, both human and artificial. I like models that can scale, adapt, and still remain legible enough to improve.

Current focus

Efficient omni-modal learning for generalized video understanding, with an emphasis on scalable training and practical inference.

Research values

Open-source practices, transparent experimentation, and reproducible pipelines that other researchers can meaningfully build on.

Experience

Industrial research, product-minded experimentation, and model-building in production settings.

Download CV

Machine Learning PhD Student

Moments Lab

November 2025 - Present

Researching efficient omni-modal learning strategies for generalized video understanding.
Working on problems where representation quality, inference cost, and deployment realism all matter.

Deep Learning Research Engineer Intern

Idemia

April 2025 - October 2025

Explored how SAM 2 can strengthen end-to-end multi-object tracking systems.
Built segmentation-aware and proposal-aware tracking variants around MOTIP-style models.

AI Research Intern

Collecte Localisation Satellites

April 2024 - August 2024

Developed tooling to fine-tune foundation models on remote sensing datasets.
Benchmarked segmentation performance and summarized the rapidly changing literature.

Machine Learning Engineer Intern

JoliBrain

February 2023 - July 2023

Integrated ControlNet-style controls and zero-shot detection models into joliGEN.
Helped ship documentation and product-facing material around the tool.

Software Developer Intern

French Ministry of Agriculture

May 2022 - August 2022

Created agreste, an R package and R Shiny workflow for statistical publication automation.
Handled the project end to end, from requirements gathering to delivery.

Education

A mathematics and statistics foundation, refined through machine learning and vision research.

Machine Learning PhD

Institut Polytechnique de Paris

November 2025 - Present

Research topic: efficient omni-modal learning for generalized video understanding.

Master 2 Mathématiques, Vision, Apprentissage

ENS Paris-Saclay

September 2024 - March 2025

Optimal transport, convex optimization, deep learning, graphical models, and generative models.

Master 1 Applied Mathematics and Statistics

Toulouse School of Economics

September 2023 - April 2024

Econometrics, probability, optimization for ML, time series, and data science in Python.

Gap Year

University of Copenhagen

September 2022 - January 2023

Natural language processing, blockchain business development, energy economics, and tax policy.

Double Bachelor: Applied Mathematics and Economics

Toulouse School of Economics

September 2019 - April 2022

Linear algebra, analysis, statistics, econometrics, optimization, programming, and economics.

Publications

Preprint and Accepted Papers

View all

2026

arXiv Preprint

PEEK: Picking Essential frames via Efficient Knowledge distillation

Killian Steunou, Anas Filali Razzouki, Mounîm A. El-Yacoubi, Khalil Guetari, Yannis Tevissen

We propose PEEK, a training-efficient frame selection method that distills knowledge from vision-language teachers to identify the most informative frames for video captioning — achieving competitive performance while significantly reducing the number of processed frames.

Video Understanding Knowledge Distillation Frame Selection Efficiency

Project page arXiv PDF Code Model

2025

arXiv Preprint

Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers

Killian Steunou, Théo Druilhe, Sigurd Saue

We show, theoretically and empirically, that SPCA-based classifiers are more robust than PCA-based alternatives under adversarial attack, providing a new perspective on linear dimensionality reduction as a defense mechanism.

Adversarial Robustness Sparse Representations Computer Vision

arXiv PDF Code

Projects

Tools, experiments, reproductions, and interfaces that turn research ideas into usable artifacts.

Open source Generative AI

Contribution to joliGEN

I integrated ControlNet-inspired edge controls and SAM-based masking into joliGEN, then helped improve its documentation.

GitHub

Segmentation Video

Video Background Removal

A background removal workflow based on Mobile SAM, designed to automatically segment and clean video streams.

GitHub Open demo

Detection Zero-shot

Video Object Detection

An implementation of Owl-ViT for zero-shot object detection in videos using natural-language prompts.

GitHub

Whisper Speech

Audio Visual Transcription

A tool for quickly subtitling audio and video content with OpenAI Whisper, exposed through a simple demo interface.

GitHub Live demo

macOS Computer vision

Nail Bite Detection App

A macOS menu bar app that detects nail biting from your webcam in real time to make the habit visible and measurable.

GitHub Website

Research report Optimal transport

Score-Based Generative Networks for Large-Scale Optimal Transport

I reproduced the SCONES framework and evaluated how score-based modeling changes the behavior of regularized transport on synthetic distributions.

Report GitHub

Research report Test-time adaptation

Test Time Training with Masked Autoencoders

I extended TTT-MAE with an online setting and studied how adaptation behaves when distribution shift keeps evolving at inference time.

Report GitHub

Research report 3D vision

An End-to-End Transformer Model for 3D Object Detection

We reproduced 3DETR on SUN RGB-D and explored how a lean transformer detector behaves when extended with RGB information.

Report

Research report Robustness

Are Generative Classifiers More Robust to Adversarial Attacks?

I revisited the robustness claims around generative classifiers and extended the original setup beyond MNIST to a more realistic dataset.

Report GitHub

Research report Applied ML

Toxic Gas Characterization

I studied domain shift caused by humidity changes and combined multi-task learning with adversarial adaptation for more stable gas characterization.

Report GitHub

Research report Optimization

Convergence of SGD for Training with Sliced Wasserstein Losses

We verified convergence behavior for sliced Wasserstein training on toy distributions and Fashion-MNIST, including a look at Noise Projected SGD.

Report GitHub

Education Visualization

MathViz

A Streamlit application that visualizes ideas from mathematics, statistics, machine learning, and algorithms.

GitHub Open demo

Graph Scraping

Wikipedia Graph

A French-language concept graph built by scraping linked Wikipedia topics and exporting the result for graph exploration.

GitHub

Language Algorithms

Word Ladder Generator

A French word-ladder generator that computes the shortest sequence of one-letter edits between two words.

GitHub

Writing

Long-form notes that turn research trends and evaluation tools into something easier to reason about.

Browse articles

February 2026 / Research analysis

Efficiency Follows Capability: A Decade of Video Understanding Research Trends

An analysis of how efficiency moved from the margins to the center of video understanding research from 2015 to 2025.

Read article

April 2026 / Visual guide

NLP Metrics for Image & Video Captioning: A Visual Guide

A worked visual walkthrough of n-grams, TF-IDF, BLEU, ROUGE-L, METEOR, and CIDEr for captioning research.

Read article

Demos

Interactive spaces for testing models and interfaces beyond the paper.

See all demos

PEEK — Frame Selection for Video Captioning

Select the most informative frames from a video, distilled from vision-language teachers.

Open demo

Audio Visual Transcription

Whisper-powered subtitling for audio and video with a fast demo workflow.

Open demo

Video Background Removal

Interactive segmentation-based background removal for video.

Open demo

Contact

If you care about efficient vision systems, robust evaluation, or research that has to ship, we should talk.

I am always interested in thoughtful conversations around machine learning research, multimodal systems, visual understanding, and the engineering decisions that make models usable outside the lab.

Video understanding Computer vision Multimodal ML Research collaborations