My research centers on streaming video understanding, action anticipation, and real-time vision-language models — building models that reason over long-form video efficiently and in real time. Recent work includes test-time training for long-context modeling, scalable streaming video narration, and linear-attention memory for online action anticipation.
Earlier, I received the M.Sc. in Mechatronics and Robotics from Leibniz University Hannover (2021, with distinction) and the B.Eng. in Process Engineering from Hannover University of Applied Sciences and Arts (2018).
I am attending ICML 2026 and am on the lookout for 2026 internship / full-time opportunities in video understanding and multimodal learning — feel free to reach out.
My research develops efficient memory mechanisms for understanding streaming and long-form video. A recurring theme is reformulating linear attention into compact, constant-memory representations: CLAM introduces this as a memory module for action anticipation, FlowNar extends it to real-time narration of arbitrarily long videos, and E²-TTT generalizes the idea to test-time training for long-context modeling. Together, these works pursue models whose memory and computation stay bounded as inputs grow, enabling accurate understanding of arbitrarily long sequences.
Rethinking Expressivity and Efficiency in Test-Time Training
Zeyun Zhong, Joya Chen, Manuel Martin, Frederik Diederichs, Juergen Gall, and Jürgen Beyerer
Test-Time Training (TTT) enables long-context processing via continuous weight updates during inference, but current methods struggle to balance the expressivity of per-token update dynamics with the hardware efficiency of chunk-wise approximations. We propose E²-TTT (Expressive and Efficient TTT) to bridge this gap. By deriving a closed-form state transition that exactly aggregates per-token momentum and decay coefficients within a chunk, E²-TTT enables fully parallelized chunk-level training while preserving the temporal structure of the update rule that prior chunk-wise methods discard. We validate E²-TTT by training models up to 1.3B parameters from scratch. Extensive experiments demonstrate that our method consistently outperforms previous TTT and hybrid attention baselines in language modeling and retrieval, while achieving significantly better extrapolation on the standard “Needle in a Haystack” test, maintaining > 90% accuracy on passkey retrieval at the training context length. Meanwhile, E²-TTT can match the training throughput of efficient chunk-wise methods, demonstrating that it effectively reconciles expressivity with efficiency.
FlowNar: Scalable Streaming Narration for Long-Form Videos
Zeyun Zhong, Manuel Martin, Chengzhi Wu, David Schneider, Frederik Diederichs, Juergen Gall, and 1 more author
In International Conference on Machine Learning (ICML), 2026
Recent Large Multimodal Models (LMMs), primarily designed for offline settings, are ill-suited for the dynamic requirements of streaming video. While recent online adaptations improve real-time processing, they still face critical scalability challenges, with resource demands typically growing at least linearly with video duration. To overcome this bottleneck, we propose FlowNar, a novel framework for scalable streaming video narration. The core of FlowNar is a dynamic context management strategy for historical visual context removal, combined with our CLAM (Cross Linear Attentive Memory) module for streaming visual history retention, ensuring bounded visual memory usage and computational complexity, crucial for efficient streaming. We also introduce a realistic self-conditioned evaluation protocol and complementary evaluation metrics to assess streaming narration models under deployment-like conditions. Experiments on the Ego4D, EgoExo4D, and EpicKitchens100 datasets demonstrate that FlowNar substantially improves narration quality over strong baselines while being highly efficient, supporting processing of 10x longer videos and achieving 3x higher throughput (FPS).
Scalable Video Action Anticipation with Cross Linear Attentive Memory
Zeyun Zhong, Manuel Martin, David Schneider, David J Lerch, Chengzhi Wu, Frederik Diederichs, and 2 more authors
In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026
Recent advances in action anticipation rely heavily on Transformer architectures to learn discriminative representations of the past observation, incurring high computational and memory overhead that limits their applicability to long videos. While temporal processors with linear complexity like RNNs and state-space models offer efficient alternatives, their sequential nature risks overlooking subtle cues in observed frames that could enhance future anticipation. We address this limitation with Cross Linear Attentive Memory (CLAM), a memory module that selectively retrieves complementary context cues from frame features. By reformulating linear attention to replace traditional cross attention, CLAM achieves linear computation complexity and constant memory usage relative to input length. Finally, by fusing the outputs of the temporal processor and CLAM, a non-autoregressive Transformer decoder generates future actions in one shot with high accuracy. Experiments on egocentric (EpicKitchens100 and Ego4D) and third-person (Thumos14) benchmarks demonstrate our model’s superior anticipation accuracy and scalability, processing longer sequences with significantly less latency growth than alternatives. Our approach also achieves promising results in online action detection.
A Survey on Deep Learning Techniques for Action Anticipation
Zeyun Zhong, Manuel Martin, Michael Voit, Juergen Gall, and Jürgen Beyerer
Under review at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
The ability to anticipate possible future human actions is essential for a wide range of applications, including autonomous driving and human-robot interaction. Consequently, numerous methods have been introduced for action anticipation in recent years, with deep learning-based approaches being particularly popular. In this work, we review the recent advances of action anticipation algorithms with a particular focus on daily-living scenarios. Additionally, we classify these methods according to their primary contributions and summarize them in tabular form, allowing readers to grasp the details at a glance. Furthermore, we delve into the common evaluation metrics and datasets used for action anticipation and provide future directions with systematical discussions.
Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation
Zeyun Zhong*, David Schneider*, Michael Voit, Rainer Stiefelhagen, and Jürgen Beyerer
In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023
Although human action anticipation is a task which is inherently multi-modal, state-of-the-art methods on well known action anticipation datasets leverage this data by applying ensemble methods and averaging scores of unimodal anticipation networks. In this work we introduce transformer based modality fusion techniques, which unify multi-modal data at an early stage. Our Anticipative Feature Fusion Transformer (AFFT) proves to be superior to popular score fusion approaches and presents state-of-the-art results outperforming previous methods on EpicKitchens-100 and EGTEA Gaze+. Our model is easily extensible and allows for adding new modalities without architectural changes. Consequently, we extracted audio features on EpicKitchens-100 which we add to the set of commonly used features in the community.