Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation

Abstract

Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.

Method

Our approach tackles the core challenge of 3D table tennis analysis: the lack of 3D ground truth in real-world footage. Existing methods trained on synthetic data often fail when applied to noisy, imperfect real-world detections. We propose a robust two-stage pipeline that bridges this gap:

1. Front-End: Robust Perception

We utilize the Segformer++ architecture to detect the ball and table keypoints in high-resolution images. To train these components, we introduce the TTHQ dataset, which provides abundant 2D supervision. A cross-model filtering strategy effectively removes false positives, such as players' shoes or paddles.

2. Back-End: Time-Aware Uplifting

The uplifting network reconstructs the 3D trajectory and spin from the filtered 2D detections. Unlike previous "proof-of-concept" models, we engineer this network for real-world application by introducing a custom Rotary Positional Embedding (RoPE).

This embedding encodes the exact timestamp of each frame, allowing the model to naturally handle varying frame rates and missing detections caused by occlusions. This design enables the back-end to be trained exclusively on physically correct synthetic data while achieving zero-shot generalization to real video.

Links

Our paper will be published at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026.

Citation

If you find this paper helpful, please cite it:

@inproceedings{kienzle2026uplifting,
  title={Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation},
  author={Kienzle, Daniel and Ludwig, Katja and Lorenz, Julian and Satoh, {Shin'ichi} and Lienhart, Rainer},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}

License

The structure of this page is taken and modified from nvlabs.github.io/eg3d which was published under the Creative Commons CC BY-NC 4.0 license .