Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as wit...Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as with the temporal correlation of video frames,which results in poor matching performance.Additionally,the imbalanced semantic information between videos and texts also leads to difficulty in the alignment of the two modalities.To this end,we propose a dual inter-modal interaction network for video-text retrieval,i.e.,DI-vTR.To learn the intra-modal interaction of video frames,we design a contextual-related video encoder to obtain more fine-grained content-oriented video representations.We also propose a dual inter-modal interaction module to accomplish accurate multilingual alignment between the video and text modalities by introducing multilingual text to improve the representation ability of text semantic features.Extensive experimental results on commonly-used video-text retrieval datasets,including MSR-VTT,MSVD and VATEX,show that the proposed method achieves significantly improved performance compared with state-of-the-art methods.展开更多
In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and un...In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and unified mapping of different modes:A Cross-Modal Hashing retrieval algorithm based on Deep Residual Network(CMHR-DRN).The model construction is divided into two stages:The first stage is the feature extraction of different modal data,including the use of Deep Residual Network(DRN)to extract the image features,using the method of combining TF-IDF with the full connection network to extract the text features,and the obtained image and text features used as the input of the second stage.In the second stage,the image and text features are mapped into Hash functions by supervised learning,and the image and text features are mapped to the common binary Hamming space.In the process of mapping,the distance measurement of the original distance measurement and the common feature space are kept unchanged as far as possible to improve the accuracy of Cross-Modal Retrieval.In training the model,adaptive moment estimation(Adam)is used to calculate the adaptive learning rate of each parameter,and the stochastic gradient descent(SGD)is calculated to obtain the minimum loss function.The whole training process is completed on Caffe deep learning framework.Experiments show that the proposed algorithm CMHR-DRN based on Deep Residual Network has better retrieval performance and stronger advantages than other Cross-Modal algorithms CMFH,CMDN and CMSSH.展开更多
基金supported by the Key Research and Development Program of Shaanxi(2023-YBGY-218)the National Natural Science Foundation of China under Grant(62372357 and 62201424)the Fundamental Research Funds for the Central Universities(QTZX23072),and also supported by the ISN State Key Laboratory.
文摘Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as with the temporal correlation of video frames,which results in poor matching performance.Additionally,the imbalanced semantic information between videos and texts also leads to difficulty in the alignment of the two modalities.To this end,we propose a dual inter-modal interaction network for video-text retrieval,i.e.,DI-vTR.To learn the intra-modal interaction of video frames,we design a contextual-related video encoder to obtain more fine-grained content-oriented video representations.We also propose a dual inter-modal interaction module to accomplish accurate multilingual alignment between the video and text modalities by introducing multilingual text to improve the representation ability of text semantic features.Extensive experimental results on commonly-used video-text retrieval datasets,including MSR-VTT,MSVD and VATEX,show that the proposed method achieves significantly improved performance compared with state-of-the-art methods.
文摘In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and unified mapping of different modes:A Cross-Modal Hashing retrieval algorithm based on Deep Residual Network(CMHR-DRN).The model construction is divided into two stages:The first stage is the feature extraction of different modal data,including the use of Deep Residual Network(DRN)to extract the image features,using the method of combining TF-IDF with the full connection network to extract the text features,and the obtained image and text features used as the input of the second stage.In the second stage,the image and text features are mapped into Hash functions by supervised learning,and the image and text features are mapped to the common binary Hamming space.In the process of mapping,the distance measurement of the original distance measurement and the common feature space are kept unchanged as far as possible to improve the accuracy of Cross-Modal Retrieval.In training the model,adaptive moment estimation(Adam)is used to calculate the adaptive learning rate of each parameter,and the stochastic gradient descent(SGD)is calculated to obtain the minimum loss function.The whole training process is completed on Caffe deep learning framework.Experiments show that the proposed algorithm CMHR-DRN based on Deep Residual Network has better retrieval performance and stronger advantages than other Cross-Modal algorithms CMFH,CMDN and CMSSH.