parham.
All projects
202530/30 e lodeLive demo

Florence-2 Fine-Tuning for Visual Question Answering

We fine-tuned a 771M-parameter vision-language model on 150K image-question pairs and shipped it live on HuggingFace Spaces.

771M params

Model size

150K Q-pairs

Training data

64%

Loss reduction

Colab T4

Hardware

Dataset

VQA v2.0 Abstract Scenes, open-ended subset

We started from VQA v2.0's Abstract Scenes split: 50,000 fully coloured cartoon-style scenes, ~3 crowd-annotated questions per image, ~150K image-question pairs total (60K train, 30K val, 60K test). We kept only the open-ended QA subset and dropped multiple-choice questions to match the model's output format.

We built a custom preprocessing pass that joined the raw questions and annotations JSONs on question_id, took the modal-vote answer per question, prefixed each prompt with the <VQA> task token, and emitted clean train / val / test JSONs. Images were resized offline to 224×224 for training, and to 768×768 at inference time.

Data exploration

What the questions look like before fine-tuning

Before training, we profiled the open-ended questions and crowd-annotated answers to understand the linguistic surface area. The answer-frequency bar below is rebuilt from the notebook output as an interactive chart; the word cloud and question-length histogram stay as raw notebook artifacts.

Training-set EDA

Top 20 most common answers (training)

Each bar is one answer string; X is its frequency across 60,000 open-ended questions.

How to read it: yes and no (cyan) together carry roughly 24,000 of the 60,000 training questions, about 40 percent of the volume. A model that always guessed yes would already hit about 23 percent accuracy. The real benchmark is what the model does on the long tail past position 5, where colour and object words live.

Notebook output
Word cloud of the most common training question tokens
Training-set question vocabulary across 60,000 open-ended prompts.What it tells us: the questions cluster around color, counts, and object identification (color, many, dog, man, woman). The model needs strong colour recognition and basic counting more than abstract reasoning. That insight directly drove the inference-time resize to 768x768, which preserves colour fidelity over raw token count.
Notebook output
Histogram of validation question length in word count
Validation question length distribution. Bin width is one word.How to read it: the mode sits at 5 to 7 words and the tail extends past 15. So the model can rarely lean on template matching, even short prompts are linguistically varied. This is why we used the full Florence-2 tokenizer rather than truncating to a fixed prefix.
Model selection

Three architectures evaluated; Florence-2 won on stability

ModelParamsOutcome on Colab T4
PaliGemma~3BExceeded GPU memory; prompt-format sensitive.
BLIP~1.3BUnstable gradients; kernel crashes during contrastive learning.
Florence-2-base-ft771MStable end-to-end full-parameter fine-tune.

We loaded microsoft/Florence-2-base-ft at revision refs/pr/6 with trust_remote_code=True. All 771M parameters trainable, no frozen layers.

Training and results

64% training-loss reduction over 3 epochs

We optimised with AdamW (learning rate 1e-5, batch size 8), a linear LR schedule, no warmup, and 22,500 total steps over 3 epochs. Training ran on a Google Colab T4 GPU in roughly 9 hours wall clock, with a full validation pass at the end of every epoch.

EpochTrain lossVal loss
10.3070.240
20.1760.206
30.1110.202

Note: only cross-entropy loss is reported. No test-set accuracy / BLEU / CIDEr was computed; qualitative validation via manual inspection of validation outputs.

Live demo

Try the fine-tuned model on HuggingFace Spaces

The Space below runs our fine-tuned model with configurable beam search and a per-answer confidence score. HuggingFace pauses idle Spaces, so the first request may take 30 to 60 seconds to wake the container.

Sleeps when idle. First request may take 30 to 60 seconds while the container wakes up. Open in new tab
Tech stack

Frameworks and infrastructure

PyTorchHuggingFace TransformersFlorence-2AdamWStreamlitHuggingFace SpacesGoogle Colab T4

Source code, technical report, and reproduction guide on GitHub.