Vision-Text Cross-Modal Fusion for Accurate Video Captioning
In this paper, we introduce a novel end-to-end multimodal video captioning framework based on cross-modal fusion of visual and textual data.The proposed approach integrates a modality-attention module, which captures the visual-textual inter-model relationships using cross-correlation.Further, we integrate temporal attention into the features obtai