摘要
注视目标检测旨在定位人的注视目标。HGTTR的提出,将Transformer结构用于注视目标检测的任务中,解决了卷积神经网络需要额外的头部探测器的问题,实现了端到端的对头部位置和注视目标的同时检测,并且实现了优于传统的卷积神经网络的性能。然而,目前的方法在视频数据集上的性能还有较大提升空间。原因在于,当前的方法侧重于在单个视频帧中学习人的注视目标,没有对视频中的时间变化进行建模,所以无法解决动态注视、镜头失焦、运动模糊等问题。当一个人的注视目标在不断的发生变化时,缺乏时间变化建模可能会导致定位注视目标偏离人的真实注视目标。并且由于缺乏对于时间维度上的建模,模型无法解决因为镜头失焦和运动模糊等问题所导致的特征缺失。在这项工作当中,我们提出了一种基于时空Transformer的端到端的视频注视目标检测模型。首先,我们提出帧间局部可变形注意力机制,用于处理特征缺失的问题。其次,我们在可变形注意力机制的基础上,提出帧间可变形注意力机制,利用相邻视频帧的时序差异,动态选择采样点,从而实现对于动态注视的建模。最后,我们提出了时序Transformer来聚合由当前帧和参考帧的注视关系查询向量和注视关系特征。我们的时序Transformer包含三个部分:用于编码多帧空间信息的时序注视关系特征编码器,用于融合注视关系查询的时序注视关系查询编码器以及用于获取当前帧检测结果的时序注视关系解码器。通过对于单个帧空间、相邻帧间以及帧序列三个维度的时空建模,很好的解决了视频数据中常见的动态注视、镜头失焦、运动模糊等问题。大量实验证明,我们的方法在VideoAttentionTarget和VideoCoAtt两个数据集上均取得了较为优异的性能。
Gaze target detection is designed to locate the human gaze target. Proposed by HGTTR, Transformer structure is used in the task of gaze target detection, which solves the problem that convolutional neural networks need additional head detectors, realizes the end-to-end simultaneous detection of head position and gaze target, and achieves better performance than traditional convolutional neural networks. However, there is still much room for improvement in the performance of current methods on video data sets. The reason is that the current method focuses on learning the human gaze target in a single video frame, and does not model the time change in the video, so it cannot solve the problems of dynamic gaze, out-of-focus lens, and motion blur. When a person’s gaze target is constantly changing, the lack of time change modeling may cause the fixed gaze target to deviate from the person’s real gaze target. In addition, due to the lack of modeling in the time dimension, the model cannot solve the feature loss caused by out-of-focus lens and motion blur. In this work, we propose an end-to-end video gaze target detection model based on spatial-temporal Transformers. First, we propose an interframe local deformable attention mechanism to deal with feature missing problems. Secondly, on the basis of the deformable attention mechanism, we propose the Inter-frames deformable attention mechanism, which uses the timing difference of adjacent video frames to dynamically select sampling points, so as to realize the modeling of dynamic gaze. Finally, we propose a temporal Transformers to aggregate gaze relation query vectors and gaze relation features from the current frame and reference frame. Our temporal Transformers consists of three parts: A temporal gaze feature encoder for encoding multi-frame spatial information, a temporal gaze query encoder for fusing gaze queries, and a temporal gaze decoder for obtaining current frame detection results. Through the spatial-temporal modeling of single frame space, adjacent frame
出处
《图像与信号处理》
2024年第2期190-209,共20页
Journal of Image and Signal Processing