Figure3

Retrieve-then-compare mitigates visual hallucination in multi-modal large language models

Figure 3. Visual hallucination analysis results. (A) the JSD is employed to measure the dependency of each generated token on the visual input. The JSD corresponding to articles and prepositions (such as $$ \_ $$a and $$ \_ $$on) are close to zero, while the JSD value for the erroneous token $$ \_ $$green is significantly higher. (B) LLaVA-1.5 can identify accurate visual cues even amid hallucinations, as the visual information contributes +5.008 confidence scores to the accurate candidate $$ \_ $$gray. However, the visual branch also mistakenly supports inaccurate candidates (e.g., +3.898 for $$ \_ $$green and +4.250 for $$ \_ $$brown). Additionally, images with similar semantics and appearances can induce analogous visual hallucinations. For instance, candidate $$ \_ $$green receives high confidence scores (+15.930 and +12.352) in images that do not contain green frogs. JSD: Jensen-Shannon Divergence.

Intelligence & Robotics
ISSN 2770-3541 (Online)
Follow Us

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/