摘要
Anticipating future actions without observing any partial videos of future actions plays an important role in action prediction and is also a challenging task.To obtain abundant information for action anticipation,some methods integrate multimodal contexts,including scene object labels.However,extensively labelling each frame in video datasets requires considerable effort.In this paper,we develop a weakly supervised method that integrates global motion and local finegrained features from current action videos to predict next action label without the need for specific scene context labels.Specifically,we extract diverse types of local features with weakly supervised learning,including object appearance and human pose representations without ground truth.Moreover,we construct a graph convolutional network for exploiting the inherent relationships of humans and objects under present incidents.We evaluate the proposed model on two datasets,the MPII-Cooking dataset and the EPIC-Kitchens dataset,and we demonstrate the generalizability and effectiveness of our approach for action anticipation.
基金
supported partially by the National Natural Science Foundation of China(NSFC)(Grant Nos.U1911401 and U1811461)
Guangdong NSF Project(2020B1515120085,2018B030312002)
Guangzhou Research Project(201902010037)
Research Projects of Zhejiang Lab(2019KD0AB03)
the Key-Area Research and Development Program of Guangzhou(202007030004).