Figure5

From: Retrieve-then-compare mitigates visual hallucination in multi-modal large language models

Figure 5. Our proposed approach RCD identifies erroneous token candidates and promotes accurate candidates during the inference stage of MLLMs. RCD first retrieves relevant images from a reference database utilizing the retrieval module, and then the MLLM generates distinct predictions for each reference with identical textual prefixes. The predicted confidence scores are then contrasted by RCD's compare module to highlight the accurate candidates and decode the next token. These modules can be seamlessly integrated into existing MLLMs without requiring model retraining. RCD: Retrieval contrastive decoding; MLLMs: multi-modal large language models.

Intelligence & Robotics

ISSN 2770-3541 (Online)

[email protected]

Navigation

Sitemap

Navigation

Sitemap