How to Set Up Cost-Effective AI Workloads Using Spot Instances on AWS and GCP

May 2·7 min read·AI-assisted · human-reviewed

Cloud GPU costs have become one of the biggest line items for AI teams, often exceeding $100,000 per month for a single training run on an A100 cluster. Spot instances—discounted, interruptible compute capacity—offer a way to slash those bills by 60–90%, but only if you design for the risk of abrupt termination. This guide walks through the concrete steps to run training and batch inference on spot instances without losing days of work, covering both AWS EC2 Spot and GCP Preemptible/Spot VMs with real configuration examples and trade-offs.

Understanding Spot Pricing Models and Interruption Risks on AWS vs. GCP

AWS EC2 Spot Instances use a dynamic pricing model where you bid for spare capacity. Historically, prices could spike to on-demand rates, but since late 2023, AWS switched to a market-based price that rarely exceeds the on-demand price for most instance types. The trade-off is a higher interruption rate—typically 5–15% per hour for popular GPU instances like p3.2xlarge (V100) or g4dn.xlarge (T4).

GCP Preemptible VMs (now called Spot VMs) have a fixed 80% discount and a maximum lifetime of 24 hours, after which they are forcefully terminated. GCP’s newer Spot VMs (for GPUs like L4 or A100) can run longer but still face preemption when capacity is reclaimed, often with a 30-second warning. Key difference: GCP does not charge for partial usage if you manually stop a Spot VM, whereas AWS charges for any partial hour.

For AI workloads, the primary risk is losing hours of training progress. Both providers emit a termination notice—AWS gives 2 minutes via the instance metadata service; GCP gives 30 seconds via a shutdown script trigger. Your checkpoint system must respond within that window.

Choosing the Right GPU Instance Types for Spot Workloads

Not all GPU instances are equally suited for spot usage. Older-generation instances (K80, M60, V100) have lower demand and thus lower interruption rates, but slower throughput. Newer instances (A100, H100, L40S) are in high demand, leading to higher preemption—often 20–30% during peak training hours. For training large models, A100 40GB spot instances on GCP can save 70% compared to on-demand, but you must expect interruptions at least twice per 24-hour window.

For inference, use smaller GPU instances like T4 (g4dn.xlarge on AWS) or L4 (g2-standard-4 on GCP). These are over-provisioned in most regions, yielding spot discount rates of 60–80% with preemption rates under 5% per hour. Real-world example: a summarization pipeline using llama.cpp on g4dn.xlarge spot cost $0.19 per hour versus $0.526 on-demand.

AWS spot best picks: g4dn.xlarge (T4), p3.2xlarge (V100), g5.xlarge (A10G). Avoid p4d (A100) in us-east-1 due to 30% interruption rates.
GCP spot best picks: n1-standard-8 with 1x T4, g2-standard-4 (L4), a2-highgpu-1g (A100). Avoid a2-megagpu-16g for spot unless using checkpoint every 10 minutes.
Region matters: us-west-2 (AWS) and europe-west4 (GCP) consistently have lower spot prices and lower interruption rates for GPU instances.

Architecting Checkpoint Systems for Graceful Interruption Recovery

Without a robust checkpoint strategy, a single spot termination can waste hours of compute. The gold standard is asynchronous checkpointing: save model weights and optimizer state to persistent cloud storage (S3 or GCS) every N steps, but do so in a background thread to avoid blocking training.

For PyTorch, use the torch.save function combined with a separate thread that compresses and uploads the checkpoint. Example parameters: save every 500 steps for a small model (like LLaMA-7B) or every 100 steps for a larger model (like Falcon-40B). Monitor the time cost: a 10GB checkpoint on AWS g4dn instance uploads to S3 in 40–60 seconds using S3 multipart upload.

Both AWS and GCP provide lifecycle hooks for termination. On AWS, poll the instance metadata endpoint (http://169.254.169.254/latest/meta-data/spot/termination-time) every 5 seconds. On GCP, add a shutdown script in the instance metadata that triggers a SIGTERM. The script should call a function that forces a checkpoint save within 10 seconds, then sleeps until termination.

Trade-off: Synchronous vs. Asynchronous Checkpointing

Synchronous checkpointing halts training, causing a 2–5% throughput penalty depending on model size. Asynchronous checkpointing avoids this but risks saving a stale state if the training process crashes mid-write. For production, use a in-memory buffer that holds the last N saved states and only uploads the most recent consistent snapshot.

Real-world team anecdote: a startup training a fine-tuned Mistral 7B on GCP Spot VMs lost only 4 training steps out of 20,000 over a 3-day run, costing them $42 versus $230 on on-demand. Their secret: checkpointing every 50 steps and using a preemptible VM with a 30-second shutdown script that triggered a final checkpoint flush.

Configuring Auto-Scaling Spot Fleets with Fallback to On-Demand

To guarantee training completion despite interruptions, configure a hybrid fleet: use spot instances as primary compute, with a small pool of on-demand instances as a safety net. AWS Auto Scaling Groups support mixed instances policies where you set a percentage of spot capacity and a fallback to on-demand. For example, create a launch template with the Deep Learning AMI, specify a target of 80% spot and 20% on-demand, and set termination protection only for the on-demand nodes.

On GCP, use Instance Groups with a managed instance group (MIG) that supports auto-healing. Configure the auto-scaler to use a custom metric like GPU utilization. When a spot VM is preempted, the MIG automatically spins up a new spot VM within 30 seconds. If after 3 retries the spot capacity is unavailable, the MIG falls back to an on-demand VM from a different zone.

Key settings to adjust: on AWS, set the interruption behavior to “terminate” (not stop) to quickly release capacity. Use Capacity Rebalancing to automatically move workloads to cheaper spot pools. On GCP, enable “opportunistic” placement for batch inference jobs—this means the scheduler runs your job when capacity is available, with no SLA.

Worst-case scenario: during a capacity crunch (like the 2023 GPU shortage), spot availability for A100s dropped to near zero for three days. Teams without a fallback to on-demand or a third cloud provider (like Azure or Oracle) lost 15–20% of their training capacity. Always have a region redundancy plan.

Optimizing Training Jobs for Frequent Restarts with Elastic Checkpointing

Elastic checkpointing allows a training job to resume on a different instance type or with a different number of GPUs after an interruption. This is critical when spot instances are preempted and the replacement node is a different SKU (e.g., starting on a p3.8xlarge with 4 V100s, resuming on a g5.12xlarge with 4 A10Gs). PyTorch DDP (Distributed Data Processing) supports world size changes if you save the model state dict and optimizer state dict separately from the data parallelism configuration.

Practical implementation: when saving a checkpoint, also write a JSON metadata file containing the last step, learning rate, batch size, and seed. On resume, load the model and optimizer states, then reinitialize the distributed process group with the new node’s GPU count. Adjust the batch size to match the new GPU memory—A10G has 24GB VRAM versus V100’s 16GB, so you can increase batch size by 50% without out-of-memory errors.

For TensorFlow, use tf.train.CheckpointManager with the max_to_keep parameter set to 3 to avoid filling up disk. The SavedModel format works across instance types as long as the compute capability is nvidia-smi compatible (i.e., Volta+ GPUs can load models trained on Turing or Ampere cards).

One caution: fp16 or bf16 training may behave slightly differently on different GPU architectures due to minor rounding differences. For scientific reproducibility, store the global seed and non-deterministic algorithm flags in the checkpoint metadata.

Running Batch Inference Jobs on Spot Instances Without Service Interruption

For batch inference (e.g., processing a 2 million document corpus), spot instances are ideal because latency is not critical, and you can retry individual failed requests. The trick is to use a distributed work queue like RabbitMQ or Redis Queue. Each worker process pulls a batch of requests (say, 100 documents), runs inference, and writes results to a shared output bucket (S3 or GCS). If the spot VM dies mid-job, the work queue re-queues the unfinished batches with a visibility timeout of 10 minutes.

For model servers (like vLLM or TGI), spot instances are risky because they cause active HTTP connections to drop. If you must use spot for online inference, use a load balancer with health checks that mark a node as unhealthy as soon as the termination notice appears. AWS NLB can detect the termination metric via CloudWatch and drain connections within 30 seconds. GCP TCP Load Balancers support similar connection draining.

Cost comparison: a production inference pipeline serving a summarization endpoint on a single g4dn.xlarge spot cost $0.19/hour for 10 requests per second (throughput with vLLM), versus $0.526/hour on-demand. Over 30 days, that’s $137 vs. $379. The trade-off: occasional 5-second hiccup during spot termination, requiring the client to retry once.

Real-world case: a healthcare AI company ran their NLP model on 200 spot T4 instances at $0.18/hour each, processing 50 million records in 6 hours for $2,160. Running the same job on on-demand T4 instances would have cost $6,480. They lost only 0.02% of records to spot terminations (automatically retried by the queue).

Monitoring and Cost Tracking for Spot-Powered AI Pipelines

You need granular visibility into spot interruption rates and cost savings to justify the operational overhead. AWS Cost Explorer provides a “Spot Savings” report that shows the percentage of compute hours run on spot vs. on-demand. Set up a CloudWatch alarm that fires if spot interruption rate exceeds 25% in a 6-hour window—this indicates you should switch instance families or regions.

On GCP, the Cloud Monitoring dashboard for Compute Engine shows “preemption count” per instance group. Create a custom metric that divides preemption events by total running seconds. If the ratio exceeds 0.1 (one preemption per 10 seconds of runtime), scale up the fallback on-demand capacity or try a different GPU type.

For cost tracking, label your spot instances with a tag like “ResourceType=Spot” in AWS or “goog-spot=true” in GCP. Use cloud billing exports to BigQuery to analyze cost per job. One team found that their spot training job for a Stable Diffusion model cost $0.04 per image generated versus $0.15 on on-demand, but required 12% more wall-clock time due to checkpoints and restarts—a trade-off they accepted to stay within budget.

Avoid the trap: spot instances can tempt you to over-provision because they seem cheap. Set a hard budget cap on the total number of spot nodes. An AI startup accidentally launched 500 spot T4 instances and racked up $3,000 in a single weekend when the spot price surged due to a crypto mining spike—AWS limits didn’t save them because their account was set to unlimited spot.

Start small: allocate your next training run on a single spot instance with GPU, using a model you can checkpoint every 100 steps. Measure the real cost and interruption frequency before scaling to hundreds of nodes. Once you confirm the savings, adopt a hybrid fleet for all non-latency-critical workloads.

About this article. This piece was drafted with the help of an AI writing assistant and reviewed by a human editor for accuracy and clarity before publication. It is general information only — not professional medical, financial, legal or engineering advice. Spotted an error? Tell us. Read more about how we work and our editorial disclaimer.