Skip to content

KEP-6012: Add initial KEP docs for CompositePodGroup API#6017

Draft
tosi3k wants to merge 1 commit intokubernetes:masterfrom
tosi3k:6012-compositepodgroup-api
Draft

KEP-6012: Add initial KEP docs for CompositePodGroup API#6017
tosi3k wants to merge 1 commit intokubernetes:masterfrom
tosi3k:6012-compositepodgroup-api

Conversation

@tosi3k
Copy link
Copy Markdown
Member

@tosi3k tosi3k commented Apr 15, 2026

  • One-line PR description: Enable support for hierarchical scheduling requirements in Kubernetes scheduler

  • Issue link: CompositePodGroup API #6012

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Apr 15, 2026
@github-project-automation github-project-automation Bot moved this to Needs Triage in SIG Scheduling Apr 15, 2026
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 15, 2026
@tosi3k tosi3k mentioned this pull request Apr 15, 2026
4 tasks
@mm4tt
Copy link
Copy Markdown
Contributor

mm4tt commented Apr 15, 2026

/cc

@k8s-ci-robot k8s-ci-robot requested a review from mm4tt April 15, 2026 13:48
@tosi3k tosi3k force-pushed the 6012-compositepodgroup-api branch 2 times, most recently from cf3bf8c to 6f99dd4 Compare April 16, 2026 07:16
Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md Outdated
Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md Outdated
proximity of the machines they are running on. Today's data centers are
frequently organized in a hierarchical structure - for instance, machines are
grouped in a rack, racks are grouped in a block and so on. Machines from the
same rack are closer to each other than machines belonging to different racks.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not only proximity, but primarily it's about network topology, e.g. how many switches the request has to go through or do they have direct high-bandwidth connection.

Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md

- The whole workload should be scheduled to run in a single availability zone,
- Homogeneous shards should be scheduled on the same rack,
- Pods from the same shard should be scheduled on the same machine.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any users that runs AI Training in non "pod-per-node" fashion.

This usecases sounds too artificial to me. Let's actually take the two real world examples:

  1. TPU v7: https://docs.cloud.google.com/tpu/docs/tpu7x
  • you want to run shards within individual 4x4x4 cubes
  • however whole workload need to live in a single superslice, say 8x8x16
  1. A similar usecase for GPU

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw use-cases when

  • Pods don't utilize the full GPUs, so multiple pods can use fraction of the same GPU (e.g. using mig or time-slicing).
  • Same node is shared across multiple users or multiple jobs

@wojtek-t wojtek-t self-assigned this Apr 16, 2026
@tosi3k tosi3k force-pushed the 6012-compositepodgroup-api branch from 6f99dd4 to f3c215e Compare April 16, 2026 13:54
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tosi3k
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tosi3k tosi3k force-pushed the 6012-compositepodgroup-api branch 2 times, most recently from 7f664a6 to f36f90a Compare April 17, 2026 10:00
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 17, 2026
@tosi3k tosi3k force-pushed the 6012-compositepodgroup-api branch 3 times, most recently from 4da08f9 to 77986e0 Compare April 19, 2026 17:11
Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md Outdated
hierarchy consisting of `CompositePodGroup` and `PodGroup` objects with the
references to the templates within the matching `Workload` object:

```mermaid
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the diagram below - why do you need CompositePodGroup worker-0 and worker-1?

in my mental model, the CompositePodGroup makes sense, only if really has more than 1 child. If it has exactly 1 child - it can effectively be combined with its child.

// +optional
// +listType=map
// +listMapKey=name
CompositePodGroupTemplates []CompositePodGroupTemplate // <-- NEW FIELD
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume that exactly one of CompositePodGroupTemplates and PodGroupTemplates can be set, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - description above states that it is a union member, together with PodGroupTemplates. I should have added the DV tags to make it clearer.

That said, I see a peculiar issue with this approach - let's say that:

  • User enables the CompositePodGroup feature gate,
  • User creates a Workload with CPG templates (hence PodGroupTemplates is empty),
  • User disables the CompositePodGroup feature gate,
  • We end up with a broken Workload object that has empty PodGroupTemplates - and non-empty CompositePodGroupTemplates is not used either because the gate was disabled.

By "broken object" I mean an object that couldn't have been created when the CompositePodGroup was disabled because we would require non-nil PodGroupTemplates - and which doesn't make any sense in the world without the CPGs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a new pattern - API approvers definitely have answer to that, please ask Jordan.

Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md Outdated
// +featureGate=WorkloadAwarePreemption
// +optional
// +default="Pod"
DisruptionMode *CompositeDisruptionMode
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we were talking about changing DisruptionMode to a struct so that we can later extend it with more information.
So it would effectively become a union of potential DisruptionModes.

@Argh4k - if we don't find good arguments against, we need to ensure that is also beta blocker for Workload-Aware-Preemption KEP.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that you want to have same struct for both disrutption mode of PodGroup and CompositePodGroup? Then I guess we also need to change the naming slightly because Pod and PodGroup does not seem to make sense in the context of CompositePodGroup. The Basic/Group that I can see further in the doc makes more sense for both I guess.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not yet sure we want the same disruption mode or not.
For now I don't see a good to separate them, but I'm afraid we may find one in the future...

// CompositePodGroupTemplateName defines the CompositePodGroupTemplate name within the Workload object.
//
// +required
CompositePodGroupTemplateName string
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work for structures with more layers, I think.

So imagine a Workload that is:

myWorkload:
  - cpgTemplateLevel0:
    - cpgTemplatelevel1:
      -myPodGroup

Then CompositePodGroup can reference cpgTemplate0, but you no longer can create CPG that would reference cpgTemplate1 with that structure.

If the CPGTemplate is recursive, the reference also has to be recurisve, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the CPGTemplate is recursive, the reference also has to be recurisve, right?

In practice, it doesn't have to - in the current vision, you'd basically need to walk through the referred Workload object's template hierarchy to find the referred template.

That said, your proposal sounds simpler and doesn't require running some kind of a DFS/BFS whenever one tries to get to the referred template.

We would need to additionally validate that the template reference is valid, though (i.e. the whole path in the reference actually exists in the Workload object).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, it doesn't have to - in the current vision, you'd basically need to walk through the referred Workload object's template hierarchy to find the referred template.

That would also require ensuring uniqueness of names in the whole tree - I'm not sure that's the best option.
I would rather try to reach uniqueness within a given node and then DFS/BFS won't help...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted the proposed template hierarchy structure accordingly.

Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md Outdated
@tosi3k tosi3k force-pushed the 6012-compositepodgroup-api branch from 77986e0 to ce82141 Compare April 20, 2026 09:31

In addition, current logic of gang scheduling does not distinguish between the
Pods within a single `PodGroup` object. This prevents the users from properly
scheduling heterogeneous gangs as a hierarchical group gang because there is no
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by hierarchical group gang here? I guess that this is about supporting something like a single unit of LWS, where the heterogeneous gang is a Leader + X workers, and you cannot model this unit with current API (minCount is not enough because you need to guarantee that leader is running). But I do not understand why would we call that hierarchical group gang

Copy link
Copy Markdown
Member Author

@tosi3k tosi3k Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the "hierarchical" adjective and stuck to just the group gang wording - this is indeed unfortunate wording.

Now, by group gang we mean exactly the thing you specified, i.e. at least N groups (of Pods or, recursively, groups) can be scheduled as one large gang.

Does that make sense?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that it does not do a good job explaining the LWS and similar cases. Firstly:

current logic of gang scheduling does not distinguish between the Pods within a single PodGroup object

I mean technically it does, there is nothing preventing you from creating 2 different pods under the same PodGroup.

This prevents the users from properly scheduling heterogeneous gangs as a wider group gang because there is no feasible way to express this intent at the moment.

Still at this point in the doc I am not sure what group gang means. But I know that you can schedule a heteregoneous gang using the current PodGroup API.

I think that you could start by explaining that sometimes you need to have a gang that consists of different types of pods that needs to be grouped together and this is what you call group gang.

@tosi3k tosi3k force-pushed the 6012-compositepodgroup-api branch from ce82141 to 329d325 Compare April 20, 2026 14:08
@tosi3k tosi3k force-pushed the 6012-compositepodgroup-api branch 2 times, most recently from d61d8a9 to a9fd9f3 Compare April 20, 2026 15:15

// DisruptionMode defines the mode in which a given CompositePodGroup can be disrupted.
// Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate.
// One of Pod, PodGroup. Defaults to Pod if unset.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you have One of Pod, PodGroup. Defaults to Pod if unset. but in the type CompositeDisruptionMode string you have Basic/Group.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need better names in general for both I think :)

Maybe Single/All is better (and then we can actually be consistent)?
@mm4tt - thoughts?

hierarchical workloads would be to evolve the `PodGroup` API into a recursive
type itself.

The drawback of this approach is that `Pod`-oriented policies would stop making
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some examples of the policies that would not make sense? I think we need to lay down the reasons more precisely to justify duplicating the PodGroup structure and its function.

It´s pretty obvious that Pods and PodGroups need different defintions (as the latter need to specify how to schedule a group of objects), but it´s not that obvious that their types matter.

I could come up with the following SchedulingPolicies that are actually abstract from the object types they schedule:

  • BasicSchedulingPolicy
  • GangSchedulingPolicy
  • AnyOfSchedulingPolicy
  • PrioritizedAlternativesSchedulingPolicy

The claim in this KEP is that we may have a SchedulingPolicy which effectively determines the type of objects that can refer to it, for instance HomogeneousSchedulingPolicy. Even here I could imagine we have a policy which can remain absract if we allow to specify anyOf{PodTemplate, PodGroupTemplate}.

What´s even more important, I´d challenge whether PodGroup is the right type to model a group of non-existing-yet objects. Maybe we should think about a dedicated type to represent them instead? I feel that they would have completely different scheduling semantics, which is not fully envisioned yet, as we most likely will need to have some sort of reservations.

I have one more argument. Let´s imagine that we have a new leaf type that is for instance SlurmNode. It could be one of the ways we abstract resource reservation to avoid (ab)using Pods for it. We could still group those objects into exactly the same type of groups and use the same set of SchedulingPolicies as building blocks allowing to express exactly same structures as we need for pods. This is why I was not even a fan of baking the ¨Pod¨ name into ¨PodGroup¨.

// +optional
// +listType=map
// +listMapKey=name
CompositePodGroupTemplates []CompositePodGroupTemplate
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, is the recursive reference in CompositePodGroupTemplate intentional? It seem to allow nesting templates inside themselves.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it's intentional.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to validate levels of recursion here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn’t this recursive definition fail OpenAPI schema validation?

Copy link
Copy Markdown
Contributor

@mm4tt mm4tt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part 1 of review, will continue tomorrow. Thanks!

Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md Outdated
Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md Outdated
APIs ([KEP-5732]).

These features already cover the use cases of some simple batch workloads that
are characterized by a flat homogeneous structure. [KEP-5547] is an example of a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue whether homogeneous vs. heterogeneous is the key discriminator here. I would say that we are doing CompositePodGroup regardless of whether a PodGroup should be homogeneous or not.

LWS is a great example: I believe we can keep PodGroup heterogeneous for now (leader + worker), but hierarchy independently brings significant value by unlocking disaggregated inference.

especially the AI ones, contain some inherent hierarchy that often translates
into more complex scheduling needs that core Kubernetes cannot satisfy just yet.

One of the most apparent gaps today is the lack of multi-level topology-aware
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is already explained in the TAS KEP, so I would just link to that and explain the key benefits of multi-level TAS.

Overall, this KEP is already 1200 lines long, so I'd recommend cutting content wherever possible.


### Non-Goals

- Extend topology-aware scheduling with the notion of preferred constraints.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@44past4, don't we want to do that in 1.37 (as a part of TAS KEP)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in 1.37 - let's not couple this together

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA and resourceClaim integration is also part of TAS workstream ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA support is treated as out of this KEP's scope.

Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md Outdated
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @tosi3k!
I left a few thoughts.


### Non-Goals

- Extend topology-aware scheduling with the notion of preferred constraints.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA and resourceClaim integration is also part of TAS workstream ?

Comment on lines +225 to +228
For flat homogeneous workloads, using the `CompositePodGroup` API is not needed.
True workload controllers can continue using the `PodGroup` and `Workload` APIs
exclusively the way they used to - this consumption pattern will continue to be
supported.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, as @mm4tt mentioned about CompisitePodGroup is useful for homogeneous workloads too.


- The whole workload should be scheduled to run in a single availability zone,
- Homogeneous shards should be scheduled on the same rack,
- Pods from the same shard should be scheduled on the same machine.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw use-cases when

  • Pods don't utilize the full GPUs, so multiple pods can use fraction of the same GPU (e.g. using mig or time-slicing).
  • Same node is shared across multiple users or multiple jobs

exclusively the way they used to - this consumption pattern will continue to be
supported.

### User Stories
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about use-case when TrainJob consists of multiple stages for post-training (e.g. initializer, Trainer, Evaluator).
CompositePodGroups might be useful there as well: https://github.com/andreyvelich/trainer/blob/3833e9575d698c371cf7288af1d6eb941972de50/docs/proposals/3015-workload-aware-scheduling/README.md#story-3

Also, for hyperparameter optimization jobs we usually have multiple Trials (e.g. TrainJobs) that suggestion service evaluate over time to find the best combination.

Comment thread keps/sig-scheduling/6012-composite-podgroup-api/README.md Outdated
// +optional
// +listType=map
// +listMapKey=name
CompositePodGroupTemplates []CompositePodGroupTemplate
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to validate levels of recursion here?

//
// +required
CompositePodGroupName string
}
Copy link
Copy Markdown
Member

@andreyvelich andreyvelich Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we keep this struct for future extension, if we want to allow cross-namespace references?

I am curious why not just this:

type PodGroupSpec struct {
   // name for CompositePodGroupName
	ParentRef string 
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't think we actually have an example of a multi-namespace true workload, do we?

Wrt. having just a string field instead of a dedicated type - good question, I thought about having a struct instead because it would be easier to extend in the future (which is difficult to predict:)).

Comment on lines +492 to +497
// SchedulingPolicy defines the scheduling policy for this instance of the CompositePodGroup.
// Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate.
// This field is immutable.
//
// +required
SchedulingPolicy CompositePodGroupSchedulingPolicy
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PodGroup has ParentRef to the Composition, does it have the same spec values as defined in CompositePodGroupSpec (e.g. minCount)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I follow - CompositePodGroup (CPG) can have a different scheduling policy that its children PodGroups (PGs) - e.g. basic policy for CPG and gang policy for all children PGs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan for rescheduling, esp for heterogeneous multi-layered pod group? Essentially, this has the same scheduling difficulty with a heterogeneous pod group - even if the scheduler cannot find the placement, it might be just because of the scheduling order. For now, we are saying that "we do not support workload with various pods nicely". But, I need to bring this up again.

The question here is - What are we going to explain for heterogeneous multi-layered pod groups? The first step would likely be the same - just say "we do not support such heterogeneous ones nicely right now". But, what should we do in long term? I am asking this because I imagine "composite pod group with various pod groups" would be more usually-possible scenarios than "one pod group with various pods". So, we cannot just keep saying "we do not support heterogeneous multi-layered pod group". We need a good path evolving the scheduling to support them nicely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Needs Triage

Development

Successfully merging this pull request may close these issues.

9 participants