Google

PaliGemma 3B

Google's vision model. Strong at visual QA, captioning, and OCR.

3B parameterspaligemmagemma0K context2.5GB - 2.5GB VRAM

About This Model

PaliGemma 3B is a multimodal image-to-text model developed by Google, designed to generate descriptive text from images. With 3 billion parameters, it strikes a balance between complexity and performance, making it suitable for a wide range of applications such as image captioning, visual question answering, and content generation. The model’s context length of 256 tokens allows it to handle detailed descriptions and complex queries, enhancing its versatility in generating rich, context-aware text.

In its size class, PaliGemma 3B performs efficiently, offering a good balance between computational demands and output quality. It is particularly noteworthy for its ability to produce high-quality captions and descriptions with relatively low VRAM requirements (2.5–2.5 GB), making it accessible for users with mid-range GPUs. While it may not outperform larger models in every scenario, its efficiency and effectiveness make it a strong choice for those who need a robust yet lightweight solution. Ideal users include developers, content creators, and researchers looking for a reliable image-to-text model that can be deployed on a variety of hardware, from laptops to more powerful workstations.

Check Your Hardware

See which quantizations of PaliGemma 3B your hardware can run.

Quantization Options

QuantizationBitsFile SizeVRAM NeededRAM NeededQuality
Q4_K_M4.52 GB2.5 GB4 GB
85%

Frequently Asked Questions

How much VRAM do I need to run PaliGemma 3B?

PaliGemma 3B requires 2.5GB VRAM minimum with Q4_K_M quantization. For full precision, you need 2.5GB VRAM.

What is the best quantization for PaliGemma 3B?

Q4_K_M offers the best balance of quality and VRAM usage. Q8_0 is near-lossless if you have enough VRAM.