Publications | Zeyun Zhong

2024

Unsupervised 3D Skeleton-Based Action Recognition Using Cross-Attention With Conditioned Generation Capabilities

David J Lerch*, Zeyun Zhong*, Manuel Martin, Michael Voit, and Jürgen Beyerer

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshop, 2024

Abs Code

Human action recognition plays a pivotal role in various real-world applications, including surveillance systems, robotics, and occupant monitoring in the car interior. With such a diverse range of domains, the demand for generalization becomes increasingly crucial. In this work, we propose a cross-attention-based encoder-decoder approach for unsupervised 3D skeleton-based action recognition. Specifically, our model takes a skeleton sequence as input for the encoder and further applies masking and noise to the original sequence for the decoder. By training the model to reconstruct the original skeleton sequence, it simultaneously learns to capture the underlying patterns of actions. Extensive experiments on NTU and NW-UCLA datasets demonstrate the state-of-the-art performance as well as the impressive generalizability of our proposed approach. Moreover, our experiments reveal that our approach is capable of generating conditioned skeleton sequences, offering the potential to enhance small datasets or generate samples of under-represented classes in imbalanced datasets

2023

DiffAnt: Diffusion Models for Action Anticipation

Zeyun Zhong, Chengzhi Wu, Manuel Martin, Michael Voit, Juergen Gall, and Jürgen Beyerer

arXiv preprint arXiv:2311.15991, 2023

Abs arXiv Website

Anticipating future actions is inherently uncertain. Given an observed video segment containing ongoing actions, multiple subsequent actions can plausibly follow. This uncertainty becomes even larger when predicting far into the future. However, the majority of existing action anticipation models adhere to a deterministic approach, neglecting to account for future uncertainties. In this work, we rethink action anticipation from a generative view, employing diffusion models to capture different possible future actions. In this framework, future actions are iteratively generated from standard Gaussian noise in the latent space, conditioned on the observed video, and subsequently transitioned into the action space. Extensive experiments on four benchmark datasets, i.e., Breakfast, 50Salads, EpicKitchens, and EGTEA Gaze+, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action anticipation. Our code and trained models will be published on GitHub.
A Survey on Deep Learning Techniques for Action Anticipation

Zeyun Zhong, Manuel Martin, Michael Voit, Juergen Gall, and Jürgen Beyerer

arXiv preprint arXiv:2309.17257, 2023

Abs arXiv

The ability to anticipate possible future human actions is essential for a wide range of applications, including autonomous driving and human-robot interaction. Consequently, numerous methods have been introduced for action anticipation in recent years, with deep learning-based approaches being particularly popular. In this work, we review the recent advances of action anticipation algorithms with a particular focus on daily-living scenarios. Additionally, we classify these methods according to their primary contributions and summarize them in tabular form, allowing readers to grasp the details at a glance. Furthermore, we delve into the common evaluation metrics and datasets used for action anticipation and provide future directions with systematical discussions.
Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation

Zeyun Zhong*, David Schneider*, Michael Voit, Rainer Stiefelhagen, and Jürgen Beyerer

In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023

Abs arXiv Code

Although human action anticipation is a task which is inherently multi-modal, state-of-the-art methods on well known action anticipation datasets leverage this data by applying ensemble methods and averaging scores of unimodal anticipation networks. In this work we introduce transformer based modality fusion techniques, which unify multi-modal data at an early stage. Our Anticipative Feature Fusion Transformer (AFFT) proves to be superior to popular score fusion approaches and presents state-of-the-art results outperforming previous methods on EpicKitchens-100 and EGTEA Gaze+. Our model is easily extensible and allows for adding new modalities without architectural changes. Consequently, we extracted audio features on EpicKitchens-100 which we add to the set of commonly used features in the community.
Long-term Action Anticipation: A Quick Survey

Zeyun Zhong

In Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory, 2023

2021

Mixed probability models for aleatoric uncertainty estimation in the context of dense stereo matching

Zeyun Zhong, and Max Mehltretter

ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2021

Abs Code

The ability to identify erroneous depth estimates is of fundamental interest. Information regarding the aleatoric uncertainty of depth estimates can be, for example, used to support the process of depth reconstruction itself. Consequently, various methods for the estimation of aleatoric uncertainty in the context of dense stereo matching have been presented in recent years, with deep learningbased approaches being particularly popular. Among these deep learning-based methods, probabilistic strategies are increasingly attracting interest, because the estimated uncertainty can be quantified in pixels or in metric units due to the consideration of real error distributions. However, existing probabilistic methods usually assume a unimodal distribution to describe the error distribution while simply neglecting cases in real-world scenarios that could violate this assumption. To overcome this limitation, we propose two novel mixed probability models consisting of Laplacian and Uniform distributions for the task of aleatoric uncertainty estimation. In this way, we explicitly address commonly challenging regions in the context of dense stereo matching and outlier measurements, respectively. To allow a fair comparison, we adapt a common neural network architecture to investigate the effects of the different uncertainty models. In an extensive evaluation using two datasets and two common dense stereo matching methods, the proposed methods demonstrate state-of-the-art accuracy.