Google Cloud Announces General Availability of NVIDIA L4 GPUs on Cloud Run for AI Inference

Google Cloud has officially announced the general availability of NVIDIA L4 GPU support within its Cloud Run serverless platform. This update allows developers to run high-performance workloads such as generative AI inference, video processing, and large-scale data transformations without managing underlying virtual machine infrastructure. The NVIDIA L4 GPU, powered by the Ada Lovelace architecture, provides superior energy efficiency and performance compared to its predecessors, making it ideal for modern serverless AI applications. The service maintains the core benefits of Cloud Run, including the ability to scale to zero when there is no traffic and rapid scaling in response to incoming requests. Developers no longer need to configure complex GPU node pools or handle manual driver installations, as the platform manages these aspects automatically. By simplifying the stack, teams can deploy containerized AI models more quickly while utilizing the pay-as-you-go pricing model to optimize operational expenses. This release significantly impacts how organizations approach AI deployment by lowering the barrier to entry for GPU-accelerated computing. With the integration of sidecar containers and persistent volume support, Cloud Run now offers a robust environment for enterprise-grade AI services. This change is particularly relevant for engineers looking to integrate large language models or computer vision into existing web services with minimal architectural overhead and maximum cost predictability.
Comparison
| Aspect | Before / Alternative | After / This |
|---|---|---|
| Infrastructure Management | Manual GPU node pool configuration in GKE or VM management | Fully managed serverless environment with zero node management |
| Scaling Behavior | Pre-provisioned VMs or slow-scaling Kubernetes nodes | Fast scaling with scale-to-zero capabilities for cost savings |
| Cost Model | Hourly billing for running instances regardless of load | Pay-per-request billing based on actual GPU usage time |
Action Checklist
- Select the NVIDIA L4 GPU option in Cloud Run service settings Ensure your region supports L4 GPU instances
- Configure container resources to request GPU allocation A minimum of 4 vCPUs and 16GB RAM is typically required for GPU usage
- Optimize container startup time for better scaling performance Use smaller base images or lazy loading for large model weights
Source: Google Cloud Blog
This page summarizes the original source. Check the source for full details.
