KEP-6012: Add initial KEP docs for CompositePodGroup API#6017
KEP-6012: Add initial KEP docs for CompositePodGroup API#6017tosi3k wants to merge 1 commit intokubernetes:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
/cc |
cf3bf8c to
6f99dd4
Compare
| proximity of the machines they are running on. Today's data centers are | ||
| frequently organized in a hierarchical structure - for instance, machines are | ||
| grouped in a rack, racks are grouped in a block and so on. Machines from the | ||
| same rack are closer to each other than machines belonging to different racks. |
There was a problem hiding this comment.
It's not only proximity, but primarily it's about network topology, e.g. how many switches the request has to go through or do they have direct high-bandwidth connection.
|
|
||
| - The whole workload should be scheduled to run in a single availability zone, | ||
| - Homogeneous shards should be scheduled on the same rack, | ||
| - Pods from the same shard should be scheduled on the same machine. |
There was a problem hiding this comment.
I'm not aware of any users that runs AI Training in non "pod-per-node" fashion.
This usecases sounds too artificial to me. Let's actually take the two real world examples:
- you want to run shards within individual 4x4x4 cubes
- however whole workload need to live in a single superslice, say 8x8x16
- A similar usecase for GPU
There was a problem hiding this comment.
I saw use-cases when
- Pods don't utilize the full GPUs, so multiple pods can use fraction of the same GPU (e.g. using mig or time-slicing).
- Same node is shared across multiple users or multiple jobs
6f99dd4 to
f3c215e
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tosi3k The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
7f664a6 to
f36f90a
Compare
4da08f9 to
77986e0
Compare
| hierarchy consisting of `CompositePodGroup` and `PodGroup` objects with the | ||
| references to the templates within the matching `Workload` object: | ||
|
|
||
| ```mermaid |
There was a problem hiding this comment.
In the diagram below - why do you need CompositePodGroup worker-0 and worker-1?
in my mental model, the CompositePodGroup makes sense, only if really has more than 1 child. If it has exactly 1 child - it can effectively be combined with its child.
| // +optional | ||
| // +listType=map | ||
| // +listMapKey=name | ||
| CompositePodGroupTemplates []CompositePodGroupTemplate // <-- NEW FIELD |
There was a problem hiding this comment.
I would assume that exactly one of CompositePodGroupTemplates and PodGroupTemplates can be set, right?
There was a problem hiding this comment.
Yes - description above states that it is a union member, together with PodGroupTemplates. I should have added the DV tags to make it clearer.
That said, I see a peculiar issue with this approach - let's say that:
- User enables the
CompositePodGroupfeature gate, - User creates a
Workloadwith CPG templates (hencePodGroupTemplatesis empty), - User disables the
CompositePodGroupfeature gate, - We end up with a broken
Workloadobject that has emptyPodGroupTemplates- and non-emptyCompositePodGroupTemplatesis not used either because the gate was disabled.
By "broken object" I mean an object that couldn't have been created when the CompositePodGroup was disabled because we would require non-nil PodGroupTemplates - and which doesn't make any sense in the world without the CPGs.
There was a problem hiding this comment.
That's not a new pattern - API approvers definitely have answer to that, please ask Jordan.
| // +featureGate=WorkloadAwarePreemption | ||
| // +optional | ||
| // +default="Pod" | ||
| DisruptionMode *CompositeDisruptionMode |
There was a problem hiding this comment.
I think we were talking about changing DisruptionMode to a struct so that we can later extend it with more information.
So it would effectively become a union of potential DisruptionModes.
@Argh4k - if we don't find good arguments against, we need to ensure that is also beta blocker for Workload-Aware-Preemption KEP.
There was a problem hiding this comment.
Do I understand correctly that you want to have same struct for both disrutption mode of PodGroup and CompositePodGroup? Then I guess we also need to change the naming slightly because Pod and PodGroup does not seem to make sense in the context of CompositePodGroup. The Basic/Group that I can see further in the doc makes more sense for both I guess.
There was a problem hiding this comment.
I'm not yet sure we want the same disruption mode or not.
For now I don't see a good to separate them, but I'm afraid we may find one in the future...
| // CompositePodGroupTemplateName defines the CompositePodGroupTemplate name within the Workload object. | ||
| // | ||
| // +required | ||
| CompositePodGroupTemplateName string |
There was a problem hiding this comment.
That doesn't work for structures with more layers, I think.
So imagine a Workload that is:
myWorkload:
- cpgTemplateLevel0:
- cpgTemplatelevel1:
-myPodGroup
Then CompositePodGroup can reference cpgTemplate0, but you no longer can create CPG that would reference cpgTemplate1 with that structure.
If the CPGTemplate is recursive, the reference also has to be recurisve, right?
There was a problem hiding this comment.
If the CPGTemplate is recursive, the reference also has to be recurisve, right?
In practice, it doesn't have to - in the current vision, you'd basically need to walk through the referred Workload object's template hierarchy to find the referred template.
That said, your proposal sounds simpler and doesn't require running some kind of a DFS/BFS whenever one tries to get to the referred template.
We would need to additionally validate that the template reference is valid, though (i.e. the whole path in the reference actually exists in the Workload object).
There was a problem hiding this comment.
In practice, it doesn't have to - in the current vision, you'd basically need to walk through the referred Workload object's template hierarchy to find the referred template.
That would also require ensuring uniqueness of names in the whole tree - I'm not sure that's the best option.
I would rather try to reach uniqueness within a given node and then DFS/BFS won't help...
There was a problem hiding this comment.
Adjusted the proposed template hierarchy structure accordingly.
77986e0 to
ce82141
Compare
|
|
||
| In addition, current logic of gang scheduling does not distinguish between the | ||
| Pods within a single `PodGroup` object. This prevents the users from properly | ||
| scheduling heterogeneous gangs as a hierarchical group gang because there is no |
There was a problem hiding this comment.
What do you mean by hierarchical group gang here? I guess that this is about supporting something like a single unit of LWS, where the heterogeneous gang is a Leader + X workers, and you cannot model this unit with current API (minCount is not enough because you need to guarantee that leader is running). But I do not understand why would we call that hierarchical group gang
There was a problem hiding this comment.
Dropped the "hierarchical" adjective and stuck to just the group gang wording - this is indeed unfortunate wording.
Now, by group gang we mean exactly the thing you specified, i.e. at least N groups (of Pods or, recursively, groups) can be scheduled as one large gang.
Does that make sense?
There was a problem hiding this comment.
I still think that it does not do a good job explaining the LWS and similar cases. Firstly:
current logic of gang scheduling does not distinguish between the Pods within a single
PodGroupobject
I mean technically it does, there is nothing preventing you from creating 2 different pods under the same PodGroup.
This prevents the users from properly scheduling heterogeneous gangs as a wider group gang because there is no feasible way to express this intent at the moment.
Still at this point in the doc I am not sure what group gang means. But I know that you can schedule a heteregoneous gang using the current PodGroup API.
I think that you could start by explaining that sometimes you need to have a gang that consists of different types of pods that needs to be grouped together and this is what you call group gang.
ce82141 to
329d325
Compare
d61d8a9 to
a9fd9f3
Compare
|
|
||
| // DisruptionMode defines the mode in which a given CompositePodGroup can be disrupted. | ||
| // Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate. | ||
| // One of Pod, PodGroup. Defaults to Pod if unset. |
There was a problem hiding this comment.
Here you have One of Pod, PodGroup. Defaults to Pod if unset. but in the type CompositeDisruptionMode string you have Basic/Group.
There was a problem hiding this comment.
We need better names in general for both I think :)
Maybe Single/All is better (and then we can actually be consistent)?
@mm4tt - thoughts?
| hierarchical workloads would be to evolve the `PodGroup` API into a recursive | ||
| type itself. | ||
|
|
||
| The drawback of this approach is that `Pod`-oriented policies would stop making |
There was a problem hiding this comment.
Can you give some examples of the policies that would not make sense? I think we need to lay down the reasons more precisely to justify duplicating the PodGroup structure and its function.
It´s pretty obvious that Pods and PodGroups need different defintions (as the latter need to specify how to schedule a group of objects), but it´s not that obvious that their types matter.
I could come up with the following SchedulingPolicies that are actually abstract from the object types they schedule:
- BasicSchedulingPolicy
- GangSchedulingPolicy
- AnyOfSchedulingPolicy
- PrioritizedAlternativesSchedulingPolicy
The claim in this KEP is that we may have a SchedulingPolicy which effectively determines the type of objects that can refer to it, for instance HomogeneousSchedulingPolicy. Even here I could imagine we have a policy which can remain absract if we allow to specify anyOf{PodTemplate, PodGroupTemplate}.
What´s even more important, I´d challenge whether PodGroup is the right type to model a group of non-existing-yet objects. Maybe we should think about a dedicated type to represent them instead? I feel that they would have completely different scheduling semantics, which is not fully envisioned yet, as we most likely will need to have some sort of reservations.
I have one more argument. Let´s imagine that we have a new leaf type that is for instance SlurmNode. It could be one of the ways we abstract resource reservation to avoid (ab)using Pods for it. We could still group those objects into exactly the same type of groups and use the same set of SchedulingPolicies as building blocks allowing to express exactly same structures as we need for pods. This is why I was not even a fan of baking the ¨Pod¨ name into ¨PodGroup¨.
| // +optional | ||
| // +listType=map | ||
| // +listMapKey=name | ||
| CompositePodGroupTemplates []CompositePodGroupTemplate |
There was a problem hiding this comment.
Just to confirm, is the recursive reference in CompositePodGroupTemplate intentional? It seem to allow nesting templates inside themselves.
There was a problem hiding this comment.
Do we need to validate levels of recursion here?
There was a problem hiding this comment.
Wouldn’t this recursive definition fail OpenAPI schema validation?
mm4tt
left a comment
There was a problem hiding this comment.
Part 1 of review, will continue tomorrow. Thanks!
| APIs ([KEP-5732]). | ||
|
|
||
| These features already cover the use cases of some simple batch workloads that | ||
| are characterized by a flat homogeneous structure. [KEP-5547] is an example of a |
There was a problem hiding this comment.
I would argue whether homogeneous vs. heterogeneous is the key discriminator here. I would say that we are doing CompositePodGroup regardless of whether a PodGroup should be homogeneous or not.
LWS is a great example: I believe we can keep PodGroup heterogeneous for now (leader + worker), but hierarchy independently brings significant value by unlocking disaggregated inference.
| especially the AI ones, contain some inherent hierarchy that often translates | ||
| into more complex scheduling needs that core Kubernetes cannot satisfy just yet. | ||
|
|
||
| One of the most apparent gaps today is the lack of multi-level topology-aware |
There was a problem hiding this comment.
I guess this is already explained in the TAS KEP, so I would just link to that and explain the key benefits of multi-level TAS.
Overall, this KEP is already 1200 lines long, so I'd recommend cutting content wherever possible.
|
|
||
| ### Non-Goals | ||
|
|
||
| - Extend topology-aware scheduling with the notion of preferred constraints. |
There was a problem hiding this comment.
@44past4, don't we want to do that in 1.37 (as a part of TAS KEP)?
There was a problem hiding this comment.
not in 1.37 - let's not couple this together
There was a problem hiding this comment.
DRA and resourceClaim integration is also part of TAS workstream ?
There was a problem hiding this comment.
DRA support is treated as out of this KEP's scope.
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for this @tosi3k!
I left a few thoughts.
|
|
||
| ### Non-Goals | ||
|
|
||
| - Extend topology-aware scheduling with the notion of preferred constraints. |
There was a problem hiding this comment.
DRA and resourceClaim integration is also part of TAS workstream ?
| For flat homogeneous workloads, using the `CompositePodGroup` API is not needed. | ||
| True workload controllers can continue using the `PodGroup` and `Workload` APIs | ||
| exclusively the way they used to - this consumption pattern will continue to be | ||
| supported. |
There was a problem hiding this comment.
I think, as @mm4tt mentioned about CompisitePodGroup is useful for homogeneous workloads too.
|
|
||
| - The whole workload should be scheduled to run in a single availability zone, | ||
| - Homogeneous shards should be scheduled on the same rack, | ||
| - Pods from the same shard should be scheduled on the same machine. |
There was a problem hiding this comment.
I saw use-cases when
- Pods don't utilize the full GPUs, so multiple pods can use fraction of the same GPU (e.g. using mig or time-slicing).
- Same node is shared across multiple users or multiple jobs
| exclusively the way they used to - this consumption pattern will continue to be | ||
| supported. | ||
|
|
||
| ### User Stories |
There was a problem hiding this comment.
What do we think about use-case when TrainJob consists of multiple stages for post-training (e.g. initializer, Trainer, Evaluator).
CompositePodGroups might be useful there as well: https://github.com/andreyvelich/trainer/blob/3833e9575d698c371cf7288af1d6eb941972de50/docs/proposals/3015-workload-aware-scheduling/README.md#story-3
Also, for hyperparameter optimization jobs we usually have multiple Trials (e.g. TrainJobs) that suggestion service evaluate over time to find the best combination.
| // +optional | ||
| // +listType=map | ||
| // +listMapKey=name | ||
| CompositePodGroupTemplates []CompositePodGroupTemplate |
There was a problem hiding this comment.
Do we need to validate levels of recursion here?
| // | ||
| // +required | ||
| CompositePodGroupName string | ||
| } |
There was a problem hiding this comment.
Do we keep this struct for future extension, if we want to allow cross-namespace references?
I am curious why not just this:
type PodGroupSpec struct {
// name for CompositePodGroupName
ParentRef string
}There was a problem hiding this comment.
Hmm, I don't think we actually have an example of a multi-namespace true workload, do we?
Wrt. having just a string field instead of a dedicated type - good question, I thought about having a struct instead because it would be easier to extend in the future (which is difficult to predict:)).
| // SchedulingPolicy defines the scheduling policy for this instance of the CompositePodGroup. | ||
| // Controllers are expected to fill this field by copying it from a CompositePodGroupTemplate. | ||
| // This field is immutable. | ||
| // | ||
| // +required | ||
| SchedulingPolicy CompositePodGroupSchedulingPolicy |
There was a problem hiding this comment.
If PodGroup has ParentRef to the Composition, does it have the same spec values as defined in CompositePodGroupSpec (e.g. minCount)?
There was a problem hiding this comment.
I'm not sure if I follow - CompositePodGroup (CPG) can have a different scheduling policy that its children PodGroups (PGs) - e.g. basic policy for CPG and gang policy for all children PGs.
There was a problem hiding this comment.
What's the plan for rescheduling, esp for heterogeneous multi-layered pod group? Essentially, this has the same scheduling difficulty with a heterogeneous pod group - even if the scheduler cannot find the placement, it might be just because of the scheduling order. For now, we are saying that "we do not support workload with various pods nicely". But, I need to bring this up again.
The question here is - What are we going to explain for heterogeneous multi-layered pod groups? The first step would likely be the same - just say "we do not support such heterogeneous ones nicely right now". But, what should we do in long term? I am asking this because I imagine "composite pod group with various pod groups" would be more usually-possible scenarios than "one pod group with various pods". So, we cannot just keep saying "we do not support heterogeneous multi-layered pod group". We need a good path evolving the scheduling to support them nicely.
a9fd9f3 to
6ac4893
Compare
One-line PR description: Enable support for hierarchical scheduling requirements in Kubernetes scheduler
Issue link: CompositePodGroup API #6012