Figure4

Retrieve-then-compare mitigates visual hallucination in multi-modal large language models

Figure 4. Quantitative results supporting our hypothesis that similar images might induce analogous hallucinations. (A) Experiment pipeline for investigating the characteristics of visual hallucinations. image-question pairs are randomly selected from the VQAv2 validation set and similar images are retrieved for each test image. We then independently sample N answers for the test image and the retrieved images. (B) We evaluate the correctness of each sampled answer, and analyze the overlap between the incorrect answers predicted using the test image and those derived from the reference images. if all N answers are correct, then the test sample is categorized into "Correct". If at least one incorrect answer exists in the reference answers, the sample is categorized into "Analogous Hallucination". Otherwise, it is categorized into "Exclusive Hallucination". (C) LLaVA-Next exhibits hallucinations in over 70% of the selected samples from VQAv2. Among the samples where hallucinations occur, more than two-thirds demonstrate that similar images induce analogous hallucinations. VQAv2: Visual Question Answering version 2.

Intelligence & Robotics
ISSN 2770-3541 (Online)
Follow Us

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/