KEP-6012: Add initial KEP docs for CompositePodGroup API by tosi3k · Pull Request #6017 · kubernetes/enhancements

tosi3k · 2026-04-15T13:42:07Z

One-line PR description: Enable support for hierarchical scheduling requirements in Kubernetes scheduler
Issue link: CompositePodGroup API #6012

k8s-ci-robot · 2026-04-15T13:42:10Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

mm4tt · 2026-04-15T13:48:48Z

/cc

wojtek-t · 2026-04-16T08:27:14Z

+proximity of the machines they are running on. Today's data centers are
+frequently organized in a hierarchical structure - for instance, machines are
+grouped in a rack, racks are grouped in a block and so on. Machines from the
+same rack are closer to each other than machines belonging to different racks.


It's not only proximity, but primarily it's about network topology, e.g. how many switches the request has to go through or do they have direct high-bandwidth connection.

wojtek-t · 2026-04-16T08:38:13Z

+
+- The whole workload should be scheduled to run in a single availability zone,
+- Homogeneous shards should be scheduled on the same rack,
+- Pods from the same shard should be scheduled on the same machine.


I'm not aware of any users that runs AI Training in non "pod-per-node" fashion.

This usecases sounds too artificial to me. Let's actually take the two real world examples:

TPU v7: https://docs.cloud.google.com/tpu/docs/tpu7x

you want to run shards within individual 4x4x4 cubes

however whole workload need to live in a single superslice, say 8x8x16

A similar usecase for GPU

I saw use-cases when

Pods don't utilize the full GPUs, so multiple pods can use fraction of the same GPU (e.g. using mig or time-slicing).

Same node is shared across multiple users or multiple jobs

k8s-ci-robot · 2026-04-16T13:54:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tosi3k
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2026-04-20T09:11:11Z

+hierarchy consisting of `CompositePodGroup` and `PodGroup` objects with the
+references to the templates within the matching `Workload` object:
+
+```mermaid


In the diagram below - why do you need CompositePodGroup worker-0 and worker-1?

in my mental model, the CompositePodGroup makes sense, only if really has more than 1 child. If it has exactly 1 child - it can effectively be combined with its child.

wojtek-t · 2026-04-20T09:13:55Z

+	// +optional
+	// +listType=map
+	// +listMapKey=name
+	CompositePodGroupTemplates []CompositePodGroupTemplate // <-- NEW FIELD


I would assume that exactly one of CompositePodGroupTemplates and PodGroupTemplates can be set, right?

Yes - description above states that it is a union member, together with PodGroupTemplates. I should have added the DV tags to make it clearer.

That said, I see a peculiar issue with this approach - let's say that:

User enables the CompositePodGroup feature gate,

User creates a Workload with CPG templates (hence PodGroupTemplates is empty),

User disables the CompositePodGroup feature gate,

We end up with a broken Workload object that has empty PodGroupTemplates - and non-empty CompositePodGroupTemplates is not used either because the gate was disabled.

By "broken object" I mean an object that couldn't have been created when the CompositePodGroup was disabled because we would require non-nil PodGroupTemplates - and which doesn't make any sense in the world without the CPGs.

That's not a new pattern - API approvers definitely have answer to that, please ask Jordan.

wojtek-t · 2026-04-20T09:24:05Z

+	// +featureGate=WorkloadAwarePreemption
+	// +optional
+	// +default="Pod"
+	DisruptionMode *CompositeDisruptionMode


I think we were talking about changing DisruptionMode to a struct so that we can later extend it with more information.
So it would effectively become a union of potential DisruptionModes.

@Argh4k - if we don't find good arguments against, we need to ensure that is also beta blocker for Workload-Aware-Preemption KEP.

Do I understand correctly that you want to have same struct for both disrutption mode of PodGroup and CompositePodGroup? Then I guess we also need to change the naming slightly because Pod and PodGroup does not seem to make sense in the context of CompositePodGroup. The Basic/Group that I can see further in the doc makes more sense for both I guess.

I'm not yet sure we want the same disruption mode or not.
For now I don't see a good to separate them, but I'm afraid we may find one in the future...

wojtek-t · 2026-04-20T09:28:03Z

+	// CompositePodGroupTemplateName defines the CompositePodGroupTemplate name within the Workload object.
+	//
+	// +required
+	CompositePodGroupTemplateName string


That doesn't work for structures with more layers, I think.

So imagine a Workload that is:

myWorkload: - cpgTemplateLevel0: - cpgTemplatelevel1: -myPodGroup

Then CompositePodGroup can reference cpgTemplate0, but you no longer can create CPG that would reference cpgTemplate1 with that structure.

If the CPGTemplate is recursive, the reference also has to be recurisve, right?

If the CPGTemplate is recursive, the reference also has to be recurisve, right?

In practice, it doesn't have to - in the current vision, you'd basically need to walk through the referred Workload object's template hierarchy to find the referred template.

That said, your proposal sounds simpler and doesn't require running some kind of a DFS/BFS whenever one tries to get to the referred template.

We would need to additionally validate that the template reference is valid, though (i.e. the whole path in the reference actually exists in the Workload object).

In practice, it doesn't have to - in the current vision, you'd basically need to walk through the referred Workload object's template hierarchy to find the referred template.

That would also require ensuring uniqueness of names in the whole tree - I'm not sure that's the best option.
I would rather try to reach uniqueness within a given node and then DFS/BFS won't help...

Adjusted the proposed template hierarchy structure accordingly.

Argh4k · 2026-04-20T11:03:19Z

+
+In addition, current logic of gang scheduling does not distinguish between the
+Pods within a single `PodGroup` object. This prevents the users from properly
+scheduling heterogeneous gangs as a hierarchical group gang because there is no


What do you mean by hierarchical group gang here? I guess that this is about supporting something like a single unit of LWS, where the heterogeneous gang is a Leader + X workers, and you cannot model this unit with current API (minCount is not enough because you need to guarantee that leader is running). But I do not understand why would we call that hierarchical group gang

Dropped the "hierarchical" adjective and stuck to just the group gang wording - this is indeed unfortunate wording.

Now, by group gang we mean exactly the thing you specified, i.e. at least N groups (of Pods or, recursively, groups) can be scheduled as one large gang.

Does that make sense?

I still think that it does not do a good job explaining the LWS and similar cases. Firstly:

current logic of gang scheduling does not distinguish between the Pods within a single PodGroup object

I mean technically it does, there is nothing preventing you from creating 2 different pods under the same PodGroup.

This prevents the users from properly scheduling heterogeneous gangs as a wider group gang because there is no feasible way to express this intent at the moment.

Still at this point in the doc I am not sure what group gang means. But I know that you can schedule a heteregoneous gang using the current PodGroup API.

I think that you could start by explaining that sometimes you need to have a gang that consists of different types of pods that needs to be grouped together and this is what you call group gang.

Argh4k · 2026-04-21T08:31:21Z

+
+	// DisruptionMode defines the mode in which a given CompositePodGroup can be disrupted.
+	// Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate.
+	// One of Pod, PodGroup. Defaults to Pod if unset.


Here you have One of Pod, PodGroup. Defaults to Pod if unset. but in the type CompositeDisruptionMode string you have Basic/Group.

We need better names in general for both I think :)

Maybe Single/All is better (and then we can actually be consistent)?
@mm4tt - thoughts?

dom4ha · 2026-04-22T21:23:57Z

+hierarchical workloads would be to evolve the `PodGroup` API into a recursive
+type itself.
+
+The drawback of this approach is that `Pod`-oriented policies would stop making


Can you give some examples of the policies that would not make sense? I think we need to lay down the reasons more precisely to justify duplicating the PodGroup structure and its function.

It´s pretty obvious that Pods and PodGroups need different defintions (as the latter need to specify how to schedule a group of objects), but it´s not that obvious that their types matter.

I could come up with the following SchedulingPolicies that are actually abstract from the object types they schedule:

BasicSchedulingPolicy

GangSchedulingPolicy

AnyOfSchedulingPolicy

PrioritizedAlternativesSchedulingPolicy

The claim in this KEP is that we may have a SchedulingPolicy which effectively determines the type of objects that can refer to it, for instance HomogeneousSchedulingPolicy. Even here I could imagine we have a policy which can remain absract if we allow to specify anyOf{PodTemplate, PodGroupTemplate}.

What´s even more important, I´d challenge whether PodGroup is the right type to model a group of non-existing-yet objects. Maybe we should think about a dedicated type to represent them instead? I feel that they would have completely different scheduling semantics, which is not fully envisioned yet, as we most likely will need to have some sort of reservations.

I have one more argument. Let´s imagine that we have a new leaf type that is for instance SlurmNode. It could be one of the ways we abstract resource reservation to avoid (ab)using Pods for it. We could still group those objects into exactly the same type of groups and use the same set of SchedulingPolicies as building blocks allowing to express exactly same structures as we need for pods. This is why I was not even a fan of baking the ¨Pod¨ name into ¨PodGroup¨.

romanbaron · 2026-04-23T08:12:34Z

+	// +optional
+	// +listType=map
+	// +listMapKey=name
+	CompositePodGroupTemplates []CompositePodGroupTemplate


Just to confirm, is the recursive reference in CompositePodGroupTemplate intentional? It seem to allow nesting templates inside themselves.

Yes - it's intentional.

Do we need to validate levels of recursion here?

Wouldn’t this recursive definition fail OpenAPI schema validation?

mm4tt

Part 1 of review, will continue tomorrow. Thanks!

mm4tt · 2026-04-23T18:10:14Z

+APIs ([KEP-5732]).
+
+These features already cover the use cases of some simple batch workloads that
+are characterized by a flat homogeneous structure. [KEP-5547] is an example of a


I would argue whether homogeneous vs. heterogeneous is the key discriminator here. I would say that we are doing CompositePodGroup regardless of whether a PodGroup should be homogeneous or not.

LWS is a great example: I believe we can keep PodGroup heterogeneous for now (leader + worker), but hierarchy independently brings significant value by unlocking disaggregated inference.

mm4tt · 2026-04-23T18:12:44Z

+especially the AI ones, contain some inherent hierarchy that often translates
+into more complex scheduling needs that core Kubernetes cannot satisfy just yet.
+
+One of the most apparent gaps today is the lack of multi-level topology-aware


I guess this is already explained in the TAS KEP, so I would just link to that and explain the key benefits of multi-level TAS.

Overall, this KEP is already 1200 lines long, so I'd recommend cutting content wherever possible.

mm4tt · 2026-04-23T18:14:19Z

+
+### Non-Goals
+
+- Extend topology-aware scheduling with the notion of preferred constraints.


@44past4, don't we want to do that in 1.37 (as a part of TAS KEP)?

not in 1.37 - let's not couple this together

DRA and resourceClaim integration is also part of TAS workstream ?

DRA support is treated as out of this KEP's scope.

andreyvelich

Thanks for this @tosi3k!
I left a few thoughts.

andreyvelich · 2026-04-24T20:10:03Z

+
+### Non-Goals
+
+- Extend topology-aware scheduling with the notion of preferred constraints.


DRA and resourceClaim integration is also part of TAS workstream ?

andreyvelich · 2026-04-24T20:14:50Z

+For flat homogeneous workloads, using the `CompositePodGroup` API is not needed.
+True workload controllers can continue using the `PodGroup` and `Workload` APIs
+exclusively the way they used to - this consumption pattern will continue to be
+supported.


I think, as @mm4tt mentioned about CompisitePodGroup is useful for homogeneous workloads too.

andreyvelich · 2026-04-24T20:39:51Z

+
+- The whole workload should be scheduled to run in a single availability zone,
+- Homogeneous shards should be scheduled on the same rack,
+- Pods from the same shard should be scheduled on the same machine.


I saw use-cases when

Pods don't utilize the full GPUs, so multiple pods can use fraction of the same GPU (e.g. using mig or time-slicing).

Same node is shared across multiple users or multiple jobs

andreyvelich · 2026-04-24T20:43:59Z

+exclusively the way they used to - this consumption pattern will continue to be
+supported.
+
+### User Stories


What do we think about use-case when TrainJob consists of multiple stages for post-training (e.g. initializer, Trainer, Evaluator).
CompositePodGroups might be useful there as well: https://github.com/andreyvelich/trainer/blob/3833e9575d698c371cf7288af1d6eb941972de50/docs/proposals/3015-workload-aware-scheduling/README.md#story-3

Also, for hyperparameter optimization jobs we usually have multiple Trials (e.g. TrainJobs) that suggestion service evaluate over time to find the best combination.

andreyvelich · 2026-04-24T20:55:00Z

+	// +optional
+	// +listType=map
+	// +listMapKey=name
+	CompositePodGroupTemplates []CompositePodGroupTemplate


Do we need to validate levels of recursion here?

andreyvelich · 2026-04-24T20:57:31Z

+	//
+	// +required
+	CompositePodGroupName string
+}


Do we keep this struct for future extension, if we want to allow cross-namespace references?

I am curious why not just this:

type PodGroupSpec struct { // name for CompositePodGroupName ParentRef string }

Hmm, I don't think we actually have an example of a multi-namespace true workload, do we?

Wrt. having just a string field instead of a dedicated type - good question, I thought about having a struct instead because it would be easier to extend in the future (which is difficult to predict:)).

andreyvelich · 2026-04-24T21:01:08Z

+	// SchedulingPolicy defines the scheduling policy for this instance of the CompositePodGroup.
+	// Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate.
+	// This field is immutable.
+	//
+	// +required
+	SchedulingPolicy CompositePodGroupSchedulingPolicy


If PodGroup has ParentRef to the Composition, does it have the same spec values as defined in CompositePodGroupSpec (e.g. minCount)?

I'm not sure if I follow - CompositePodGroup (CPG) can have a different scheduling policy that its children PodGroups (PGs) - e.g. basic policy for CPG and gang policy for all children PGs.

sanposhiho · 2026-04-30T04:58:45Z

What's the plan for rescheduling, esp for heterogeneous multi-layered pod group? Essentially, this has the same scheduling difficulty with a heterogeneous pod group - even if the scheduler cannot find the placement, it might be just because of the scheduling order. For now, we are saying that "we do not support workload with various pods nicely". But, I need to bring this up again.

The question here is - What are we going to explain for heterogeneous multi-layered pod groups? The first step would likely be the same - just say "we do not support such heterogeneous ones nicely right now". But, what should we do in long term? I am asking this because I imagine "composite pod group with various pod groups" would be more usually-possible scenarios than "one pod group with various pods". So, we cannot just keep saying "we do not support heterogeneous multi-layered pod group". We need a good path evolving the scheduling to support them nicely.

k8s-ci-robot requested review from johnbelamaric and macsko April 15, 2026 13:42

github-project-automation Bot added this to SIG Scheduling Apr 15, 2026

github-project-automation Bot moved this to Needs Triage in SIG Scheduling Apr 15, 2026

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 15, 2026

tosi3k mentioned this pull request Apr 15, 2026

CompositePodGroup API #6012

Open

4 tasks

k8s-ci-robot requested a review from mm4tt April 15, 2026 13:48

tosi3k force-pushed the 6012-compositepodgroup-api branch 2 times, most recently from cf3bf8c to 6f99dd4 Compare April 16, 2026 07:16

wojtek-t reviewed Apr 16, 2026

View reviewed changes

wojtek-t self-assigned this Apr 16, 2026

tosi3k force-pushed the 6012-compositepodgroup-api branch from 6f99dd4 to f3c215e Compare April 16, 2026 13:54

tosi3k force-pushed the 6012-compositepodgroup-api branch 2 times, most recently from 7f664a6 to f36f90a Compare April 17, 2026 10:00

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 17, 2026

tosi3k force-pushed the 6012-compositepodgroup-api branch 3 times, most recently from 4da08f9 to 77986e0 Compare April 19, 2026 17:11

wojtek-t reviewed Apr 20, 2026

View reviewed changes

tosi3k force-pushed the 6012-compositepodgroup-api branch from 77986e0 to ce82141 Compare April 20, 2026 09:31

Argh4k reviewed Apr 20, 2026

View reviewed changes

tosi3k force-pushed the 6012-compositepodgroup-api branch from ce82141 to 329d325 Compare April 20, 2026 14:08

tosi3k force-pushed the 6012-compositepodgroup-api branch 2 times, most recently from d61d8a9 to a9fd9f3 Compare April 20, 2026 15:15

Argh4k reviewed Apr 21, 2026

View reviewed changes

dom4ha reviewed Apr 22, 2026

View reviewed changes

romanbaron reviewed Apr 23, 2026

View reviewed changes

mm4tt mentioned this pull request Apr 23, 2026

Workload Aware Scheduling proof of concept ray-project/kuberay#4723

Draft

4 tasks

mm4tt reviewed Apr 23, 2026

View reviewed changes

andreyvelich reviewed Apr 24, 2026

View reviewed changes

sanposhiho reviewed Apr 30, 2026

View reviewed changes

romanbaron mentioned this pull request May 4, 2026

REQUEST: New membership for romanbaron kubernetes/org#6347

Open

11 tasks

KEP-6012: Add initial KEP docs for CompositePodGroup API

6ac4893

tosi3k force-pushed the 6012-compositepodgroup-api branch from a9fd9f3 to 6ac4893 Compare May 6, 2026 13:36

Edwinhr716 mentioned this pull request May 6, 2026

KEP-666: Gang Scheduling in LWS kubernetes-sigs/lws#844

Open


		### Non-Goals

		- Extend topology-aware scheduling with the notion of preferred constraints.

Conversation

tosi3k commented Apr 15, 2026

Uh oh!

k8s-ci-robot commented Apr 15, 2026

Uh oh!

mm4tt commented Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 16, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tosi3k Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mm4tt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tosi3k Apr 20, 2026 •

edited

Loading

andreyvelich Apr 24, 2026 •

edited

Loading