Florence-2 Fine-Tuning for Visual Question Answering
We fine-tuned a 771M-parameter vision-language model on 150K image-question pairs and shipped it live on HuggingFace Spaces.
771M params
Model size
150K Q-pairs
Training data
64%
Loss reduction
Colab T4
Hardware
VQA v2.0 Abstract Scenes, open-ended subset
We started from VQA v2.0's Abstract Scenes split: 50,000 fully coloured cartoon-style scenes, ~3 crowd-annotated questions per image, ~150K image-question pairs total (60K train, 30K val, 60K test). We kept only the open-ended QA subset and dropped multiple-choice questions to match the model's output format.
We built a custom preprocessing pass that joined the raw questions and annotations JSONs on question_id, took the modal-vote answer per question, prefixed each prompt with the <VQA> task token, and emitted clean train / val / test JSONs. Images were resized offline to 224×224 for training, and to 768×768 at inference time.
What the questions look like before fine-tuning
Before training, we profiled the open-ended questions and crowd-annotated answers to understand the linguistic surface area. The answer-frequency bar below is rebuilt from the notebook output as an interactive chart; the word cloud and question-length histogram stay as raw notebook artifacts.
Training-set EDA
Top 20 most common answers (training)
Each bar is one answer string; X is its frequency across 60,000 open-ended questions.
How to read it: yes and no (cyan) together carry roughly 24,000 of the 60,000 training questions, about 40 percent of the volume. A model that always guessed yes would already hit about 23 percent accuracy. The real benchmark is what the model does on the long tail past position 5, where colour and object words live.


Three architectures evaluated; Florence-2 won on stability
| Model | Params | Outcome on Colab T4 |
|---|---|---|
| PaliGemma | ~3B | Exceeded GPU memory; prompt-format sensitive. |
| BLIP | ~1.3B | Unstable gradients; kernel crashes during contrastive learning. |
| Florence-2-base-ft | 771M | Stable end-to-end full-parameter fine-tune. |
We loaded microsoft/Florence-2-base-ft at revision refs/pr/6 with trust_remote_code=True. All 771M parameters trainable, no frozen layers.
64% training-loss reduction over 3 epochs
We optimised with AdamW (learning rate 1e-5, batch size 8), a linear LR schedule, no warmup, and 22,500 total steps over 3 epochs. Training ran on a Google Colab T4 GPU in roughly 9 hours wall clock, with a full validation pass at the end of every epoch.
| Epoch | Train loss | Val loss |
|---|---|---|
| 1 | 0.307 | 0.240 |
| 2 | 0.176 | 0.206 |
| 3 | 0.111 | 0.202 |
Note: only cross-entropy loss is reported. No test-set accuracy / BLEU / CIDEr was computed; qualitative validation via manual inspection of validation outputs.
Try the fine-tuned model on HuggingFace Spaces
The Space below runs our fine-tuned model with configurable beam search and a per-answer confidence score. HuggingFace pauses idle Spaces, so the first request may take 30 to 60 seconds to wake the container.
Frameworks and infrastructure
Source code, technical report, and reproduction guide on GitHub.