Figure1
Figure 1. The overall framework. Including a ViT Image_Encoder module, a CLIP Text_Encoder module, a cross-modality alignment module and a U-net module. CMMF-Net takes image-sentence pairs as input, and outputs the colorized image. ViT: Vision transformer; CLIP: contrastive language-image pretraining; CMMF-Net: a generative network based on clip-guided multi-modal feature fusion for thermal infrared image colorization.