zeyun.jpg

Zeyun Zhong

Ph.D. Candidate @ KIT & Fraunhofer IOSB

I am a Ph.D. candidate at the Vision and Fusion Laboratory (IES) of the Karlsruhe Institute of Technology, working in collaboration with the Human-AI Interaction department at Fraunhofer IOSB. I am advised by Prof. Dr. Juergen Beyerer and Prof. Dr. Juergen Gall.

My research centers on streaming video understanding, action anticipation, and real-time vision-language models — building models that reason over long-form video efficiently and in real time. Recent work includes test-time training for long-context modeling, scalable streaming video narration, and linear-attention memory for online action anticipation.

Earlier, I received the M.Sc. in Mechatronics and Robotics from Leibniz University Hannover (2021, with distinction) and the B.Eng. in Process Engineering from Hannover University of Applied Sciences and Arts (2018).

I am attending ICML 2026 and am on the lookout for 2026 internship / full-time opportunities in video understanding and multimodal learning — feel free to reach out.

News

Apr 30, 2026 FlowNar, our framework for scalable streaming narration of long-form videos, has been accepted to ICML 2026! See you in Seoul, South Korea. :tada:
Apr 13, 2026 Presented our real-time assembly-assistance system (KIMoS project) with Fraunhofer IOSB at Hannover Messe 2026.
Nov 11, 2025 Our work on scalable video action anticipation has been accepted to WACV 2026.
Jun 19, 2024 🥈 Our team won 2nd place (out of 15 teams) in the Ego4D Long-Term Action Anticipation Challenge at CVPR 2024.

Selected Publications

My research develops efficient memory mechanisms for understanding streaming and long-form video. A recurring theme is reformulating linear attention into compact, constant-memory representations: CLAM introduces this as a memory module for action anticipation, FlowNar extends it to real-time narration of arbitrarily long videos, and E²-TTT generalizes the idea to test-time training for long-context modeling. Together, these works pursue models whose memory and computation stay bounded as inputs grow, enabling accurate understanding of arbitrarily long sequences.

  1. ttt.png
    Rethinking Expressivity and Efficiency in Test-Time Training
    Zeyun Zhong, Joya Chen, Manuel Martin, Frederik Diederichs, Juergen Gall, and Jürgen Beyerer
    2026
    Under Review
  2. flownar.png
    FlowNar: Scalable Streaming Narration for Long-Form Videos
    Zeyun Zhong, Manuel Martin, Chengzhi Wu, David Schneider, Frederik Diederichs, Juergen Gall, and 1 more author
    In International Conference on Machine Learning (ICML), 2026
  3. scalant.png
    Scalable Video Action Anticipation with Cross Linear Attentive Memory
    Zeyun Zhong, Manuel Martin, David Schneider, David J Lerch, Chengzhi Wu, Frederik Diederichs, and 2 more authors
    In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026
  4. survey.png
    A Survey on Deep Learning Techniques for Action Anticipation
    Zeyun Zhong, Manuel Martin, Michael Voit, Juergen Gall, and Jürgen Beyerer
    Under review at IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
  5. afft.png
    Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation
    Zeyun Zhong*, David Schneider*, Michael Voit, Rainer Stiefelhagen, and Jürgen Beyerer
    In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023