FlowNar: Scalable Streaming Narration for Long-Form Videos

Demo

FlowNar generates dense, temporally grounded narrations for a continuous video stream. VRAM usage stays relatively constant regardless of video length.

FlowNar in action. The model processes an unbounded video stream and generates temporally aligned narrations segment by segment, while its KV-cache memory stays bounded. Unlike prior work, it does not slow down or run out of memory as the video grows longer.

TL;DR — What FlowNar Contributes

Dynamic Context Management (DCM)

After each narration segment, the visual KV cache is pruned entirely. This keeps GPU memory bounded and prevents error propagation from misaligned history narrations into future segments.

O(1) VRAM

Cross-Linear Attention Memory (CLAM)

Historical frames are compressed into a fixed-size set of learnable memory tokens via recurrent cross-attention. Per-step compute and memory stay constant regardless of how many frames have been processed.

Constant per-step compute

Self-Conditioned Evaluation Protocol

Baselines and FlowNar are evaluated using only their own generated history — no oracle text. This honest setting exposes compounding errors that teacher-forcing evaluation hides.

Deployment-realistic

Demonstrated at Scale

Evaluated on Ego4D, EgoExo4D, and EK100. On EK100, reduces KV cache usage by up to 48.3× compared to VideoLLM-Online, while achieving the best temporal alignment F1 scores across all datasets through robust dynamic context management (DCM).

3 benchmarks

Method

A streaming pipeline that couples a fixed-size recurrent memory with aggressive context pruning to stay efficient at any video length.

Fig. 1 — Overall architecture of FlowNar. Visual tokens from a streaming encoder are routed through CLAM for memory compression and DCM for context management before being passed to the LLM narration head.

Fig. 2 — CLAM recurrent memory update. New visual tokens sequentially update a fixed-size recurrent state, and learnable query tokens read out from the final state to produce fixed-size memory tokens, keeping representation size constant across time steps.

Fig. 3 — Illustration of the self-conditioned streaming narration process.

Frame encoding. Each incoming video segment is encoded into visual tokens by a frozen vision encoder.

CLAM memory update. A gated recurrent mechanism sequentially processes the new visual tokens to update a fixed-size memory state. Learnable queries then read from this final state to produce the fixed-size memory tokens — O(1) per step.

LLM narration. The LLM receives current visual tokens + the compact CLAM memory as context and generates the segment narration autoregressively.

DCM pruning. After generation, all visual KV-cache entries are discarded. Only the CLAM memory carries visual information forward — breaking the linear growth of context that limits prior methods.

Key invariant: the KV-cache size at step t depends only on the current segment length, not on t itself. This is what makes FlowNar theoretically unbounded in video length.

Efficiency & Scalability

FlowNar's KV-cache footprint stays flat or grows slowly as videos grow longer. Competing methods pay linearly — and eventually run out of memory.

48.3× less KV-cache vs VideoLLM-Online
(EK100)

3× higher inference FPS

10× longer videos supported

O(1) memory w.r.t. video length

Fig. 4 — (Left) VRAM usage as video length increases. Ours-C stays flat and Ours grows slowly, while baselines grow unboundedly. (Right) Throughput on an H100 measured over 10K frames. FlowNar achieves roughly 3× higher FPS than Videollm-online.

Results

Reading note: all numbers below are under the self-conditioned protocol — the model conditions only on its own previously generated narrations, not on oracle ground-truth text. Under this realistic setting, baselines accumulate compounding errors. In contrast, FlowNar mitigates these errors through its DCM and remains robust under self-conditioning.

Fig. 5 — F1 and CIDEr comparison across Ego4D, EgoExo4D and EpicKitchens-100 under the self-conditioned protocol. FlowNar and FlowNar-C consistently show higher performance than baselines.

Full results (quantitative tables)

Self-conditioned protocol — the model conditions only on its own previously generated narrations (deployment-realistic). With our dynamic context management (DCM), FlowNar variants consistently demonstrate significant improvements over the baselines.

Teacher-forcing (oracle) protocol — the model is given ground-truth history narrations. FlowNar performs on par with baseline models.

Failure Cases

Failure case. The model occasionally hallucinates fine-grained object details or misses brief, fast actions at segment boundaries.

Why failures happen — and what to do about them.
The spatial 3×3 token compression in the vision encoder causes oversmoothing: fine-grained details (small objects, subtle hand motions) are averaged away before reaching the LLM. Hallucinations tend to occur when the visual signal is ambiguous and the model falls back on statistical priors. Future work: higher-resolution token sampling or adaptive spatial pooling that preserves detail for fast or fine-grained segments.

BibTeX

If FlowNar is useful in your research, please consider citing:

@inproceedings{zhong2026flownar, title = {{FlowNar}: Scalable Streaming Narration for Long-Form Videos}, author = {Zhong, Zeyun and Martin, Manuel and Wu, Chengzhi and Schneider, David and Diederichs, Frederik and Gall, Juergen and Beyerer, Juergen}, booktitle = {International Conference on Machine Learning}, year = {2026}, publisher = {PMLR}, note = {Accepted, to appear} }

FlowNar: Scalable Streaming Narrationfor Long-Form Videos