Figure2

Retrieve-then-compare mitigates visual hallucination in multi-modal large language models

Figure 2. Experiment pipeline for investigating the impact of visual and textual input modality on the hallucinatory output. At each decoding step $$ t $$, the test image $$ \boldsymbol{v}^{\tau} $$ is replaced with alternative images $$ \boldsymbol{v}' $$ while keeping the textual prefix constant. Next, we assess the difference in output confidence scores (i.e., logits) between $$ y_t $$ and $$ \hat{y_t} $$ to demonstrate the impact of the visual input. This test image is taken from the OpenImages validation set [46]. Similar images are retrieved from the COCO Caption dataset [47].

Intelligence & Robotics
ISSN 2770-3541 (Online)
Follow Us

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/