Tech Updates

Americas

Kubernetes v1.35 Introduces In-Place Pod Restarts to Optimize AI Infrastructure Costs

The latest Kubernetes release introduces a native mechanism to reset Pods without recreation, drastically improving efficiency for large-scale training jobs.

The latest Kubernetes release introduces a native mechanism to reset Pods without recreation, drastically improving efficiency for large-scale training jobs.

The latest Kubernetes release introduces a native mechanism to reset Pods without recreation, drastically improving efficiency for large-scale training jobs.

NewDecoded

Published Jan 3, 2026

Jan 3, 2026

4 min read

Image by Logos World

Kubernetes v1.35 has arrived with a major architectural enhancement known as the RestartAllContainers action. This alpha feature provides the long-requested ability to trigger a full, in-place restart of a Pod without the heavy overhead of deletion and recreation. For high-performance environments, this means infrastructure identity remains intact while the application environment is refreshed from scratch.

Previous versions forced Pods to be fully rescheduled by the control plane whenever multiple containers required a reset. This process often caused significant delays, particularly in massive clusters where scheduling thousands of Pods simultaneously creates a massive bottleneck for the API server. By keeping Pods on their assigned nodes, the system preserves IP addresses and network namespaces to ensure immediate availability for workloads.

The efficiency gains are vital for organizations running large-scale AI and machine learning training jobs. In a cluster with over 1,000 nodes, the time lost to traditional Pod recreation can result in over $100,000 in monthly resource waste. The new in-place mechanism reduces these recovery windows from several minutes to just a few seconds by bypassing the scheduler entirely.

Technically, the feature is activated via the RestartAllContainersOnContainerExits feature gate. When a container exit matches a defined rule, the Kubelet terminates all containers but keeps the sandbox and volume mounts active. The startup sequence then re-executes from the beginning, including the re-running of all init containers to repair any environment corruption that may have occurred.

Beyond AI training, the update simplifies handling complex inter-container dependencies. Sidecars can now trigger a full Pod reset if they detect unrecoverable errors in a shared volume or a configuration file. This eliminates the need for fragile, custom failure-handling logic and offloads recovery mechanisms to the native Kubernetes platform itself.

Monitoring these restarts is straightforward through a new Pod condition called AllContainersRestarting. While it offers incredible speed, the Kubelet bypasses preStop hooks during these actions to prioritize the rapid recovery of the workload. Developers are encouraged to ensure their containers are designed for abrupt termination and that all init containers remain idempotent for safety.

This advancement is a key building block for the community's goal of creating more robust platforms for heavy batch processing. You can find more details in the official documentation at https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/ or review the technical proposal at https://kep.k8s.io/5532. The SIG Node community invites feedback on this alpha feature to refine its behavior for future stable releases.


Decoded Take

Decoded Take

Decoded Take

This update marks a significant shift in the Kubernetes philosophy, moving away from the "disposable infrastructure" model toward a more efficient, long-lived resource management strategy for high-performance computing. By reducing the reliance on the centralized scheduler for simple application resets, the platform is evolving to handle the sheer scale of foundation model training where every second of GPU idle time is a massive financial drain. This transition signals that Kubernetes is no longer just for microservices, but is actively being re-engineered to become the standard substrate for the AI era.

Share this article

Related Articles

Related Articles

Related Articles