Table tennis players need analyses of their opponents’ postures to optimize their game strategies, but it is too laborious and time-consuming to calculate a player’s postures by hands. Besides, most existing models are sensor-based.
We built a system to classify players’ postures (forehand and backhand) automatically based on their past game and practice videos, and we calculate ratios of players’ postures automatically based on the prediction from those classifiers.
We recorded 8 videos of different players from the side of the tables at 30 fps, and the length of the videos are from 20 seconds to 100 seconds. The camera moves slightly during the videos, and has different offsets between different videos.
Training on images is not the best approach, because it is costly and time-consuming due to the high dimension. Also, it is easy to be distracted because too many other information are irrelevant to postures.
Hence, we train models on data containing keypoints of body obtained by OpenPose, which is more efficient on both training costs and time, and also concentrates on players’ body motions.
We did experiments on SVM, CNN, LSTM, and many other models, but CNN and other models performed not well. The accuracies of those models approximated to the baseline and some were even lower than it.
Finally, we chose SVM, which performed the best both on training and test data, and LSTM, which performed well at training data but test data, as the finalist.
| Model | Left Model Accuracy | Right Model Accuracy | |————-|———————|———————-| | SVM-RBF | 89% | 75% | | SVM-Sigmoid | 75% | 57% | | SVM-Linear | 82% | 95% | | LSTM | 88% | 57% |
EfficientNet is proposed by Google AI in 2019 and it uses a simple but highly effective compound coefficient to uniformly scales all dimensions of width, depth, and resolution.
Unlike other models that arbitrary scale a single dimension of the network, the compound scaling method uniformly scales up all dimensions in a principled way.
This architecture allows us to use a pre-trained model that has been used for a classification task - on a dataset such as ImageNet - as our encoder. Here, we use EfficientNet as the U-Net’s encoder.
We output videos with the results of the two above mentioned methods.
White points in backgrounds may be detected as balls. To deal with the problem, we recover pixels that be detected as balls at 70% of all the frames in a video.
[1] R. Voeikov, N.Falaleev, R. Baikulov. TTNet: Real-time temporal and spatial video analysis of table tennis. CVPR. 2020.
[2] C. B. Lin, Z. Dong, W. K. Kuan, Y. F. Huang. A Framework for Fall Detection Based on OpenPose Skeleton and LSTM/GRU Models. In Applied Science. 2020.
[3] Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, Y. Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.43, No.1, pp. 172-186, Jan. 1 2021.
[4] C. Sawant. Human activity recognition with openpose and Long Short-Term Memory on real time images. IEEE 5th International Conference for Convergence in Technology (I2CT). 2020.