Figure2
Figure 2. The CI module. The text feature is Q, and the image feature or the current layer output feature is K and V for multi-head attention operation, and the corresponding relationship is established. CI: Cross-modal interaction.
Figure 2. The CI module. The text feature is Q, and the image feature or the current layer output feature is K and V for multi-head attention operation, and the corresponding relationship is established. CI: Cross-modal interaction.
All published articles are preserved here permanently:
https://www.portico.org/publishers/oae/