Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Appendix A: GPU Maths - Choosing the Right GPU for Your Experiment

Before provisioning a GPU VM or submitting a Vertex AI training job, it is worth spending a few minutes estimating whether your chosen GPU can actually fit your experiment. Running out of GPU memory mid-training is one of the most common and frustrating errors in ML, and it is entirely avoidable with some simple upfront calculation.

What Determines GPU Memory Usage?

The GPU memory required for a training run is determined by four main factors:

Worked Example: Fine-tuning Gemma 3 1B Model with LoRA

Let us walk through how to estimate the GPU memory required for fine-tuning Gemma 3 1B with LoRA in bfloat16, which is exactly the experiment we used in this tutorial.

Step 1: Base model weights

Gemma 3 1B has 1 billion parameters. We load it in bfloat16, where each parameter takes 2 bytes (16 bits = 2 bytes):

1,000,000,000 parameters x 2 bytes = 2,000,000,000 bytes = ~2GB

Step 2: LoRA adapter weights

With LoRA rank 8 targeting q_proj and v_proj, we are training roughly 0.1% of the model’s parameters. This adds a negligible amount of memory:

0.1% x 2GB = ~0.002GB

Step 3: Gradients and optimizer states

Since LoRA freezes the base model, gradients and optimizer states only apply to the small set of trainable LoRA parameters, not the full model. This adds roughly:

~0.1GB

Step 4: Activations

Activations scale with batch size and sequence length. With batch size 8 and max length 256:

~0.3GB

Total estimate:

2GB (base model) + 0.002GB (LoRA) + 0.1GB (gradients) + 0.3GB (activations) = ~2.5GB

This is why a 24GB NVIDIA L4 GPU is more than sufficient for this experiment. We are only using about 10% of its available VRAM. You could even run it on a smaller 16GB T4 with room to spare.

Note: These are rough estimates. Actual memory usage can vary depending on the framework version, attention implementation, and other factors. It makes sense to run a quick test with a small number of samples before committing to a full training run.

What If You Were to Run This Without LoRA?

For comparison, here is what full fine-tuning of Gemma 3 1B in float32 would cost:

1B parameters x 4 bytes (float32) = 4GB (model weights)
+ 4GB (gradients)
+ 8GB (Adam optimizer states)
+ ~1GB (activations)
= ~17GB total

This would not fit on a 16GB T4 and would be tight on a 24GB L4. This is exactly why LoRA is so valuable. It reduces the memory requirement from ~17GB to ~2.5GB for the same base model.

Estimating GPU Hours

Once you know your experiment fits in memory, the next question is how long it will take. A simple estimate:

GPU hours = (num_samples x num_epochs) / (steps_per_second x batch_size x 3600)

For our experiment on the L4, we use 3 steps/second as our estimate. This comes directly from the training logs we observed during the hands-on session, where the L4 consistently processed between 2.7 and 3.0 steps per second for this specific workload (Gemma 3 1B, LoRA rank 8, bfloat16, batch size 8, sequence length 256). We round up to 3 for a conservative estimate:

(10,000 samples x 2 epochs) / (3 steps/second x 8 batch size x 3600 seconds)
= 20,000 / 86,400
= ~0.23 hours (~14 minutes)

This matches what we observed in practice, where the full training run completed in roughly 15 minutes.

A few things that affect training speed:

Use the GCP Pricing Calculator with your estimated GPU hours to get a cost estimate before starting a run.

Common GCP GPUs and When to Use Them

GPUVRAMBest For
NVIDIA T416GBSmall models up to 3B, inference, cost-sensitive runs
NVIDIA L424GBModels up to 7B with LoRA, good price/performance
NVIDIA A100 40GB40GBModels up to 13B full fine-tune, large batch training
NVIDIA A100 80GB80GBModels up to 70B with LoRA, large context lengths
NVIDIA H10080GBLargest models, fastest training, highest cost

Practical Tips

References and Further Reading