期刊文献+

面向连续手语识别的自适应关键帧选择

Adaptive keyframe selection for continuous sign language recognition
原文传递
导出
摘要 基于视觉的连续手语识别旨在从图像序列中识别出对应的手语词序列,可以为手语使用者提供一种便利的辅助工具.现有的连续手语识别方法大多需要从图像序列中,逐帧提取视觉和时序特征,而相邻帧中存在的相似视觉信息带来了大量的冗余计算.本文通过分析帧率对连续手语识别算法的影响,发现降低帧率可以显著地提升计算效率,但也会带来一定的性能损失.为了在降低帧率的同时保留更多手语关键信息,本文提出了自适应动态池化层(adaptive dynamic temporal pooling,ADTP),ADTP基于序列特征的自相似性对序列进行动态下采样.在此基础上,本文进一步提出了一种两阶段的训练方式,以更充分地利用原始帧率中的时空信息.具体而言,该训练方式在第一阶段只训练基于原始帧率的手语识别模型,并以此模型为教师网络,通过知识蒸馏的方式引导第二阶段含ADTP模块的模型训练.实验结果表明,本文所提的方法在损失少量性能的情况下,可以大幅度减少识别所需的计算量.此外,本文所提出的ADTP也可用于手语视频结构分析,生成简略直观的手语视频摘要. Vision-based continuous sign language recognition(CSLR),which aims to recognize unsegmented signs from image sequences,provides a convenient communication tool for sign language users.Recent CSLR approaches often extract visual and contextual features frame by frame from image sequences,leading to redundant computations due to the presence of similar visual information in adjacent frames.This paper analyzes the impact of framerate on continuous sign language recognition algorithms and finds that reducing the framerate significantly improves computational efficiency but may also result in performance degradation.To preserve more key sign language information while reducing computational cost,this paper proposes an adaptive dynamic temporal pooling(ADTP)layer that dynamically downsamples sequences based on their self-similarity in sequence features.Furthermore,a two-stage training scheme is introduced to better utilize the spatiotemporal information in original sequences.Specifically,in the first stage,the CSLR model is trained based on original sequences,and in the second stage,the model with the ADTP module is trained with knowledge distillation guided by the teacher network from the first stage.Experimental results demonstrate that the proposed method significantly reduces the computational requirements for recognition while only sacrificing a small amount of performance.Additionally,the proposed ADTP can also be applied to sign language video structure analysis,generating concise and intuitive summaries of sign language videos.
作者 闵越聪 陈熙霖 Yuecong MIN;Xilin CHEN(Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;School of Computing Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China)
出处 《中国科学:信息科学》 CSCD 北大核心 2024年第4期893-910,共18页 Scientia Sinica(Informationis)
基金 新一代人工智能国家科技重大专项(批准号:2021ZD0111900)资助项目。
关键词 连续手语识别 时间序列分析 视觉语言 知识蒸馏 计算效率 continuous sign language recognition time series analysis visual languages knowledge distillation computational efficiency
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部