Multitask Learning for Medical Image Report Generation with Structured Table Integration

Abstract

This thesis investigates whether multitask learning can improve factual grounding in automated radiology report generation using Large Vision-Language Models. We adapt LLaVA-OneVision-7B with QLoRA for chest X-ray report generation, jointly optimizing report generation with auxiliary disease classification to test if explicit pathology recognition can improve factual accuracy. Experiments on MIMIC-CXR reveal a critical disconnect between stylistic competence and clinical accuracy. While models demonstrate fluency in radiological terminology and report structure (ROUGE-L: 0.234-0.255), they exhibit severe factual grounding failures with RadGraph F1 scores below 0.15. Despite reasonable auxiliary classification performance (macro F1: 0.45-0.46), multitask learning provides no meaningful improvement in report generation quality. Qualitative analysis exposes systematic error patterns: only 8% of generated reports are completely accurate, while 29% contain major finding hallucinations and 35% omit critical abnormalities. These failures persist across different configurations, suggesting fundamental architectural limitations rather than parameter or training issues. Our results challenge the common approach of adapting general-purpose visionlanguage models for medical applications through fine-tuning. The vision-language semantic gap in medical imaging appears more fundamental than previously recognized, requiring architectural innovations beyond incremental improvements. We show that current LVLMs are unsuitable for high-stakes medical applications without substantial modifications and that traditional NLP metrics provide misleadingly optimistic assessments. While our multitask learning hypothesis was not confirmed, the findings establish realistic expectations for vision-language models in medical applications and highlight the need for specialized architectures designed for clinical accuracy.

Keywords

Artificial Intelligence; Computer Vision; Multitask Learning; Medical Image Report Generation

Citation