← All Posts

AWS vs GCP vs Azure for ML

Comparison Overview

Choosing a cloud provider for ML workloads is one of the highest-leverage decisions a platform team makes. Each hyperscaler—AWS, GCP, and Azure—brings distinct strengths. AWS dominates in breadth of services and enterprise adoption. GCP leads in TPU availability and tight integration with open-source frameworks like TensorFlow and JAX. Azure excels in hybrid-cloud scenarios and tight coupling with the Microsoft enterprise ecosystem.

Key insight: There is no universally “best” cloud for ML. The right choice depends on your team’s existing ecosystem, workload profile, and long-term cost model. Multi-cloud is increasingly common for large organisations that need to hedge risk or satisfy data-residency requirements.
AWS SageMaker · EC2 · EKS P5 · Inf2 · Trainium GCP Vertex AI · GKE · TPUs A3 · TPU v5p · H100 Azure AzureML · AKS · NDv5 NC H100 · ND A100

The comparison below is structured around the dimensions that matter most for ML platform teams: compute (GPU/TPU availability and networking), managed services (experiment tracking, pipelines, serving), pricing (on-demand, spot, reserved), and ecosystem integration.

Compute & GPU Instances

GPU availability is often the deciding factor. All three providers offer NVIDIA A100 and H100 instances, but differ in maximum cluster sizes, interconnect bandwidth, and spot/preemptible pricing.

AWS GPU Instances

P5: 8× H100 80 GB, 3.2 Tbps EFA networking. Best for large-scale distributed training.

P4d: 8× A100 40 GB, 400 Gbps EFA. Mature and widely available.

Inf2: AWS Inferentia2 chips for cost-effective inference at up to 70% lower cost vs GPU.

Trn1: AWS Trainium for training workloads with Neuron SDK integration.

GCP GPU Instances

A3 Mega: 8× H100 80 GB, 3.2 Tbps GPUDirect-TCPX. Largest GPU clusters available.

A2: Up to 16× A100, connected via NVSwitch for multi-node training.

TPU v5p: Custom ML accelerator, 8,960 chips per pod. Best for JAX/TensorFlow workloads.

TPU v5e: Cost-optimised TPU for inference and smaller training jobs.

Azure GPU Instances

ND H100 v5: 8× H100, InfiniBand 400 Gbps. Strong for enterprise HPC+ML convergence.

ND A100 v4: 8× A100 80 GB, InfiniBand 200 Gbps.

NC H100: Single-GPU instances for development and small-scale fine-tuning.

Maia: Microsoft’s custom AI accelerator (early access).

FeatureAWSGCPAzure
Max GPU per node8× H100 (P5)8× H100 (A3)8× H100 (NDv5)
InterconnectEFA 3.2 TbpsGPUDirect-TCPXInfiniBand 400G
Custom acceleratorsTrainium, Inferentia2TPU v5p/v5eMaia (preview)
Spot/Preemptible savingsUp to 90%Up to 91%Up to 80%
Multi-node scaling20k+ GPUs (UltraCluster)26k+ TPU chips per podThousands via CycleCloud
Capacity warning: GPU capacity is constrained across all three clouds. For H100 clusters, expect 2–8 week lead times on reserved capacity. Plan procurement well ahead of training runs and consider reserving capacity via committed-use or savings plans.

Managed ML Services

Each cloud provides a managed ML platform that bundles experiment tracking, pipeline orchestration, model registry, and serving. These platforms vary significantly in maturity, flexibility, and lock-in.

AWS SageMaker

  • Studio: Notebook IDE with experiment tracking, lineage, and model registry built in.
  • Pipelines: Native DAG orchestrator. Tightly coupled to SageMaker jobs.
  • Endpoints: Real-time, serverless, and async inference. Auto-scaling built in.
  • Ground Truth: Data labelling service with human-in-the-loop and active learning.
  • Clarify: Bias detection and model explainability.

GCP Vertex AI

  • Workbench: Managed Jupyter with GCS and BigQuery integration.
  • Pipelines: Built on Kubeflow Pipelines v2. Portable and open-source friendly.
  • Prediction: Online and batch prediction with GPU support, traffic splitting.
  • Feature Store: Low-latency feature serving with BigQuery offline store.
  • Model Garden: Pre-trained foundation models available via API.

Azure ML

  • Studio: Drag-and-drop designer plus code-first notebooks. Integrates with VS Code.
  • Pipelines: Component-based SDK v2 pipelines. Also supports Azure Data Factory.
  • Managed Endpoints: Blue/green deployment with Kubernetes or managed compute.
  • Responsible AI: Fairness, interpretability, and error analysis dashboards.
  • Prompt Flow: LLM application orchestration for GenAI workloads.

Open-Source Alternatives

  • MLflow: Experiment tracking, model registry. Works on all three clouds.
  • Kubeflow: End-to-end ML on Kubernetes. Best supported on GKE.
  • Ray: Distributed training and serving. Runs on EKS, GKE, and AKS.
  • Seldon / KServe: Model serving on Kubernetes. Cloud-agnostic.
  • Weights & Biases: Experiment tracking SaaS. Multi-cloud by design.
# Example: launch a SageMaker training job with spot instances import sagemaker from sagemaker.pytorch import PyTorch estimator = PyTorch( entry_point="train.py", instance_type="ml.p5.48xlarge", instance_count=4, use_spot_instances=True, max_wait=86400, max_run=72000, framework_version="2.1", distribution={"torch_distributed": {"enabled": True}} ) estimator.fit({"train": "s3://my-bucket/data/train"})

Pricing Analysis

ML workloads are compute-heavy, making pricing structure a critical differentiator. Each cloud offers on-demand, spot/preemptible, and reserved pricing tiers with significant cost variations.

Instance (8× A100 80GB)On-Demand $/hrSpot $/hr1yr Reserved $/hr
AWS p4de.24xlarge~$40.97~$12.29~$25.81
GCP a2-ultragpu-8g~$40.22~$12.07~$25.34 (CUD)
Azure ND96amsr A100 v4~$37.19~$11.16~$22.87
Cost tip: Spot instances can reduce GPU costs by 60–90%, but require checkpointing strategies. All three clouds offer automatic checkpointing integrations—SageMaker Managed Spot Training, GCP Managed Instance Groups with preemptible VMs, and Azure low-priority VMs with checkpointing SDKs.

Beyond raw compute, consider data egress charges (AWS and Azure charge ~$0.09/GB, GCP offers a free tier), storage costs for training data and model artefacts, and managed service markups. For teams training large language models, networking costs for multi-node clusters can be significant.

AWS Cost Tools

Savings Plans: Flexible commitment model covering EC2, SageMaker, and Lambda. Up to 72% savings.

Cost Explorer: Granular cost breakdown with ML-specific filters.

Spot Advisor: Shows interrupt frequency by instance type to help with capacity planning.

GCP Cost Tools

CUDs: Committed Use Discounts with spend-based or resource-based options. Up to 57% savings.

Active Assist: AI-powered recommendations for right-sizing and idle resource cleanup.

BigQuery billing: Export billing data to BigQuery for custom cost analysis.

Decision Framework

Use the following weighted matrix to score each cloud against your team’s priorities. Assign a weight (1–5) to each dimension, score each cloud (1–5), and multiply to get a weighted score.

DimensionWeightAWSGCPAzure
GPU availability & variety5554
Custom accelerators (TPU/Trainium)3452
Managed ML platform maturity4544
Kubernetes-native workflows4454
Enterprise compliance & hybrid3435
Spot pricing & cost tools4544
Data & analytics integration3454
Open-source ecosystem3453
Recommendation patterns: Choose AWS if you need the broadest service catalogue and deepest enterprise ecosystem. Choose GCP if you are heavy on TensorFlow/JAX or need TPUs for large-scale training. Choose Azure if your organisation is Microsoft-centric or requires strong hybrid-cloud support with Azure Arc.

For multi-cloud strategies, consider abstracting your ML platform layer with Kubernetes (EKS/GKE/AKS) and open-source tools like MLflow, Ray, and KServe. This reduces provider lock-in while preserving the ability to leverage provider-specific accelerators when needed.

Migration Tips

Migrating ML workloads between clouds is complex but increasingly necessary. Here is a phased approach to minimise disruption and risk.

Click a phase to see details
Select a migration phase above to see the detailed breakdown.
# Portable storage abstraction using fsspec import fsspec # Same code works with S3, GCS, and Azure Blob def load_dataset(path: str): fs = fsspec.filesystem("s3") # or "gcs", "az" with fs.open(path, "rb") as f: return f.read() # Environment variable toggles provider at deploy time import os provider = os.getenv("CLOUD_PROVIDER", "aws") bucket = {"aws": "s3://data", "gcp": "gs://data", "azure": "az://data"}[provider]
Data gravity warning: The largest cost of migration is often moving training data between clouds. A 100 TB dataset costs ~$9,000 in egress fees on AWS/Azure. Consider using cloud-native transfer services (AWS DataSync, Google Transfer Service, AzCopy) or physical transfer appliances (Snowball, Transfer Appliance) for petabyte-scale moves.

Finally, invest in infrastructure-as-code (Terraform, Pulumi) from day one. Cloud-agnostic IaC makes it dramatically easier to reproduce environments across providers and reduces the human cost of migration.