Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Overview

In the previous hands-on session, we ran the Gemma 1B fine-tuning experiment on a raw GCP VM. We had to provision the VM, SSH in, set up the environment, and manually manage the entire process. In this session we will run the exact same experiment as a Vertex AI Custom Training Job, where Vertex AI handles all the infrastructure for us. We submit a job spec, Vertex AI provisions the hardware, runs the job, and terminates everything automatically when done.

Learning Objectives

By the end of this session, you will be able to:

Prerequisites


Step 1: Enable the Vertex AI API

Run this from your local machine to enable the Vertex AI API:

export PROJECT_ID=$(gcloud config get-value project)

gcloud services enable aiplatform.googleapis.com \
    --project=${PROJECT_ID}

Verify it has been enabled:

gcloud services list --enabled \
    --filter="name:aiplatform.googleapis.com" \
    --project=${PROJECT_ID}

Step 2: Grant the Required Permissions

Vertex AI training jobs run under a service account. Grant it the permissions it needs to read and write to GCS.

export SERVICE_ACCOUNT=$(gcloud iam service-accounts list \
    --filter="displayName:Compute Engine default service account" \
    --format="value(email)" \
    --project=${PROJECT_ID})

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${SERVICE_ACCOUNT}" \
    --role="roles/storage.admin"

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
    --member="serviceAccount:${SERVICE_ACCOUNT}" \
    --role="roles/aiplatform.user"

Step 3: Create the Job Config

Vertex AI Custom Training Jobs are defined by a YAML config file that specifies the machine type, container image, and the command to run. Create a file called vertex_job.yaml in your project folder.

Paste the following code, replacing the placeholder values for the environment variables (YOUR_HF_TOKEN, YOUR_WANDB_API_KEY, YOUR_WANDB_PROJECT, YOUR_GCS_BUCKET_NAME) with the actual values.

workerPoolSpecs:
  - machineSpec:
      machineType: g2-standard-4
      acceleratorType: NVIDIA_L4
      acceleratorCount: 1
    replicaCount: 1
    containerSpec:
      imageUri: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/pytorch-gpu.2-0.py310
      command:
        - bash
        - -c
      args:
        - |
          curl -LsSf https://astral.sh/uv/install.sh | sh &&
          source $HOME/.local/bin/env &&
          git clone https://github.com/rexsimiloluwah/finetuning-gemma-1b-aims-gcp-tutorial.git &&
          cd finetuning-gemma-1b-aims-gcp-tutorial &&
          uv sync &&
          uv run python -m scripts.run_train data.source=gcs data.max_train_samples=10000 training.num_epochs=2 training.batch_size=8 experiment_id=exp_lora_r8_vertex
          gsutil -m cp -r outputs/ gs://$BUCKET_NAME/runs/exp_lora_r8_vertex
      env:
        - name: HF_TOKEN
          value: <YOUR_HF_TOKEN>
        - name: WANDB_API_KEY
          value: <YOUR_WANDB_API_KEY>
        - name: WANDB_PROJECT
          value: <YOUR_WANDB_PROJECT>
        - name: BUCKET_NAME
          value: <YOUR_GCS_BUCKET_NAME>

This configuration does the following:

  1. Automatically spins up a Google Cloud VM equipped with an NVIDIA L4 GPU and the necessary CUDA drivers.

  2. Automates environment setup: clones the training code, installs the uv package manager, and installs all Python dependencies.

  3. Runs the Gemma 1B fine-tuning script and automatically persists the resulting model weights and logs to your GCS bucket.

Step 4: Submit the Training Job

This command uses the gcloud CLI to submit a Vertex AI Custom Training Job based on the specifications in vertex_job.yaml.

export REGION="europe-west4" # change to any region you want
export AIMSUSERNAME="<YOUR_AIMS_USERNAME>"
export JOB_NAME="${AIMSUSERNAME}-gemma-finetune-$(date +%Y%m%d%H%M%S)"

gcloud ai custom-jobs create \
    --region=${REGION} \
    --display-name=${JOB_NAME} \
    --config=vertex_job.yaml \
    --project=${PROJECT_ID}

You should see an output confirming that the job was successfully created with a job ID.

Custom Vertex AI training job submitted

Screenshot: Terminal output showing successfully Vertex AI training job creation with job ID

Step 5: Monitor the Job

Check the job status from the CLI:

gcloud ai custom-jobs list \
    --region=${REGION} \
    --project=${PROJECT_ID}

Stream the logs in real time:

# Ger the Job ID using the Display Name
JOB_ID=$(gcloud ai custom-jobs list \     
    --region=${REGION} \
    --filter="displayName=${JOB_NAME}" \
    --format="value(name.scope(customJobs))")

# Stream the logs using the ID
gcloud ai custom-jobs stream-logs ${JOB_ID} --region=${REGION}

You can also monitor the job from the GCP Console.

  1. Go to Vertex AI page on your GCP Console

  2. Navigate to the Custom Jobs tab. Ensure the region you used when submitting the job is selected.

  3. Click on the job you want to monitor

  4. Click on View logs to view the logs for the training job.

GCP Console showing the custom jobs list with job status

Screenshot: GCP Console showing the custom jobs list with job status

GCP Console showing details for a custom training job

Screenshot: GCP Console showing details for a custom training job with the “View logs” link

GCP Console showing the custom training job logs streaming in real time

Screenshot: GCP Console showing the custom training job logs streaming in real time

As with the VM-based run, you can also monitor training metrics in real time from your WandB dashboard at wandb.ai. You will see. train/loss, eval/loss, train/grad_norm, train/learning_rate, etc., updating every 10 steps.

Step 6: Verify the Outputs in GCS

Once the job status changes to FINISHED, we can verify that the outputs were synced to your GCS bucket successfully using the GCP Console.

GCS Bucket in GCP Console showing output folder

Screenshot: GCS Bucket in GCP Console showing synced output folder

Step 7: Resource Cleanup

Vertex AI Custom Training Jobs terminate automatically when the job completes, so there is no VM delete. The only resource to clean up is the GCS bucket if you no longer need the outputs:

gcloud storage rm --recursive gs://<YOUR_GCS_BUCKET_NAME>

🔑 Key Takeaways

🚀 What’s Next

In the next hands-on session, we will run the same experiment interactively using Vertex AI Colab Enterprise, a managed notebook environment accessible directly from your browser with no SSH or infrastructure setup required.