In recent years,computing art has developed rapidly with the in-depth cross study of artificial intelligence generated con-tent(AIGC)and the main features of artworks.Audio-visual content generation has gradually been...In recent years,computing art has developed rapidly with the in-depth cross study of artificial intelligence generated con-tent(AIGC)and the main features of artworks.Audio-visual content generation has gradually been applied to various practical tasks,including video or game score,assisting artists in creation,art education and other aspects,which demonstrates a broad application pro-spect.In this paper,we introduce innovative achievements in audio-visual content generation from the perspective of visual art genera-tion and auditory art generation based on artificial intelligence(Al).We outline the development tendency of image and music datasets,visual and auditory content modelling,and related automatic generation systems.The objective and subjective evaluation of generated samples plays an important role in the measurement of algorithm performance.We provide a cogeneration mechanism of audio-visual content in multimodal tasks from image to music and display the construction of specific stylized datasets.There are still many new op-portunities and challenges in the field of audio-visual synesthesia generation,and we provide a comprehensive discussion on them.展开更多
Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these tw...Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems.In this paper,we provide a comprehensive survey of recent audio-visual learning development.We divide the current audio-visual learning tasks into four different subfields:audiovisual separation and localization,audio-visual correspondence learning,audio-visual generation,and audio-visual representation learning.State-of-the-art methods,as well as the remaining challenges of each subfield,are further discussed.Finally,we summarize the commonly used datasets and challenges.展开更多
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐vi...Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.展开更多
In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method ...In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method aims to generate a montage out of the movie that is emotionally consistent with the music. Unlike previous work such as video summarization, this task requires not only video content understanding, but also emotion analysis of both the input movie and music. To this end, we propose a two-stage framework, including a learning-based module for the prediction of emotion similarity and an optimization-based module for the selection and composition of candidate movie shots. The core of our method is to align and estimate emotional similarity between music clips and movie shots in a multi-modal latent space via contrastive learning. Subsequently, the montage generation is modeled as a joint optimization of emotion similarity and additional constraints such as scene-level story completeness and shot-level rhythm synchronization. We conduct both qualitative and quantitative evaluations to demonstrate that our method can generate emotionally consistent montages and outperforms alternative baselines.展开更多
N400 is an objective electrophysiological index in semantic processing for brain.This study focuses on the sensitivity of N400 effect during speech comprehension under the uni-and bi-modality conditions.Varying the Si...N400 is an objective electrophysiological index in semantic processing for brain.This study focuses on the sensitivity of N400 effect during speech comprehension under the uni-and bi-modality conditions.Varying the Signal-to-Noise Ratio(SNR) of speech signal under the conditions of Audio-only(A),Visual-only(V,i.e.,lip-reading),and Audio-Visual(AV),the semantic priming paradigm is used to evoke N400 effect and measure the speech recognition rate.For the conditions A and high SNR AV,the N400 amplitudes in the central region are larger;for the conditions of V and low SNR AV,the N400 amplitudes in the left-frontal region are larger.The N400 amplitudes of frontal and central regions under the conditions of A,AV,and V are consistent with speech recognition rate of behavioral results.These results indicate that audio-cognition is better than visual-cognition at high SNR,and visual-cognition is better than audio-cognition at low SNR.展开更多
<strong>Aim:</strong> The aim of this study was to explore patients’ preferences for forms of patient education material, including leaflets, podcasts, and videos;that is, to determine what forms of infor...<strong>Aim:</strong> The aim of this study was to explore patients’ preferences for forms of patient education material, including leaflets, podcasts, and videos;that is, to determine what forms of information, besides that provided verbally by healthcare personnel, do patients prefer following visits to hospital? <strong>Methods: </strong>The study was a mixed-methods study, using a survey design with primarily quantitative items but with a qualitative component. A survey was distributed to patients over 18 years between May and July 2020 and 480 patients chose to respond.<strong> Results:</strong> Text-based patient education materials (leaflets), is the form that patients have the most experience with and was preferred by 86.46% of respondents;however, 50.21% and 31.67% of respondents would also like to receive patient education material in video and podcast formats, respectively. Furthermore, several respondents wrote about the need for different forms of patient education material, depending on the subject of the supplementary information. <strong>Conclusion: </strong>This study provides an overview of patient preferences regarding forms of patient education material. The results show that the majority of respondents prefer to use combinations of written, audio, and video material, thus applying and co-constructing a multimodal communication system, from which they select and apply different modes of communication from different sources simultaneously.展开更多
基金This work was supported by National Natural Science Foundation of China(No.62176006)the National Key Research and Development Program of China(No.2022YFF0902302).
文摘In recent years,computing art has developed rapidly with the in-depth cross study of artificial intelligence generated con-tent(AIGC)and the main features of artworks.Audio-visual content generation has gradually been applied to various practical tasks,including video or game score,assisting artists in creation,art education and other aspects,which demonstrates a broad application pro-spect.In this paper,we introduce innovative achievements in audio-visual content generation from the perspective of visual art genera-tion and auditory art generation based on artificial intelligence(Al).We outline the development tendency of image and music datasets,visual and auditory content modelling,and related automatic generation systems.The objective and subjective evaluation of generated samples plays an important role in the measurement of algorithm performance.We provide a cogeneration mechanism of audio-visual content in multimodal tasks from image to music and display the construction of specific stylized datasets.There are still many new op-portunities and challenges in the field of audio-visual synesthesia generation,and we provide a comprehensive discussion on them.
基金supported by National Key Research and Development Program of China(No.2016YFB1001001)Beijing Natural Science Foundation(No.JQ18017)National Natural Science Foundation of China(No.61976002)。
文摘Audio-visual learning,aimed at exploiting the relationship between audio and visual modalities,has drawn considerable attention since deep learning started to be used successfully.Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems.In this paper,we provide a comprehensive survey of recent audio-visual learning development.We divide the current audio-visual learning tasks into four different subfields:audiovisual separation and localization,audio-visual correspondence learning,audio-visual generation,and audio-visual representation learning.State-of-the-art methods,as well as the remaining challenges of each subfield,are further discussed.Finally,we summarize the commonly used datasets and challenges.
基金supported by the National Key R&D Program of China(No.2020AAA0108904)the Science and Technology Plan of Shenzhen(No.JCYJ20200109140410340).
文摘Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.
基金supported by the National Key Research and Development Program of China under Grant No.2020AAA0106200 and the National Natural Science Foundation of China under Grant No.61832016.
文摘In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method aims to generate a montage out of the movie that is emotionally consistent with the music. Unlike previous work such as video summarization, this task requires not only video content understanding, but also emotion analysis of both the input movie and music. To this end, we propose a two-stage framework, including a learning-based module for the prediction of emotion similarity and an optimization-based module for the selection and composition of candidate movie shots. The core of our method is to align and estimate emotional similarity between music clips and movie shots in a multi-modal latent space via contrastive learning. Subsequently, the montage generation is modeled as a joint optimization of emotion similarity and additional constraints such as scene-level story completeness and shot-level rhythm synchronization. We conduct both qualitative and quantitative evaluations to demonstrate that our method can generate emotionally consistent montages and outperforms alternative baselines.
基金supported by the National Natural Science Foundation of China (Nos. 61601028 and 61431007)the Key R&D Program of Guangdong Province of China (No.2018B030339001)the National Key R&D Program of China (No. 2017YFB1002505)。
文摘N400 is an objective electrophysiological index in semantic processing for brain.This study focuses on the sensitivity of N400 effect during speech comprehension under the uni-and bi-modality conditions.Varying the Signal-to-Noise Ratio(SNR) of speech signal under the conditions of Audio-only(A),Visual-only(V,i.e.,lip-reading),and Audio-Visual(AV),the semantic priming paradigm is used to evoke N400 effect and measure the speech recognition rate.For the conditions A and high SNR AV,the N400 amplitudes in the central region are larger;for the conditions of V and low SNR AV,the N400 amplitudes in the left-frontal region are larger.The N400 amplitudes of frontal and central regions under the conditions of A,AV,and V are consistent with speech recognition rate of behavioral results.These results indicate that audio-cognition is better than visual-cognition at high SNR,and visual-cognition is better than audio-cognition at low SNR.
文摘<strong>Aim:</strong> The aim of this study was to explore patients’ preferences for forms of patient education material, including leaflets, podcasts, and videos;that is, to determine what forms of information, besides that provided verbally by healthcare personnel, do patients prefer following visits to hospital? <strong>Methods: </strong>The study was a mixed-methods study, using a survey design with primarily quantitative items but with a qualitative component. A survey was distributed to patients over 18 years between May and July 2020 and 480 patients chose to respond.<strong> Results:</strong> Text-based patient education materials (leaflets), is the form that patients have the most experience with and was preferred by 86.46% of respondents;however, 50.21% and 31.67% of respondents would also like to receive patient education material in video and podcast formats, respectively. Furthermore, several respondents wrote about the need for different forms of patient education material, depending on the subject of the supplementary information. <strong>Conclusion: </strong>This study provides an overview of patient preferences regarding forms of patient education material. The results show that the majority of respondents prefer to use combinations of written, audio, and video material, thus applying and co-constructing a multimodal communication system, from which they select and apply different modes of communication from different sources simultaneously.