KEP-5598: Extend opportunistic batching with rescoring#6039
KEP-5598: Extend opportunistic batching with rescoring#6039romanbaron wants to merge 3 commits intokubernetes:masterfrom
Conversation
|
Hi @romanbaron. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
singh1203
left a comment
There was a problem hiding this comment.
Hey @romanbaron, thanks for including me. I went through the full diff carefully. Overall, the rescoring design is clean and well written. I just had one question, and thanks for answering it. 🙇
Overall, it looks good to me.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: romanbaron, singh1203 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
afbce01 to
38281e1
Compare
eda4b71 to
5734340
Compare
e9c449d to
703870e
Compare
703870e to
8119e15
Compare
|
The PR is now rebased on top of the refactored KEP and only includes Rescore-related changes. |
|
/ok-to-test |
| 2. `Score` is called for each scoring plugin against node A using the current `CycleState`. The | ||
| resulting score accurately reflects pod N's placement. |
There was a problem hiding this comment.
I believe that node states can change over time. For example, I think the following cases could occur:
- Changes in ImageLocality scores due to image GC
- Taints are being added to a node
- Pods terminating
It is possible that these events occur on nodes other than Node A, causing their scores to change. Ideally, scoring should be performed on all nodes, but doing so would negate the benefits of this change.
Is this acceptable?
There was a problem hiding this comment.
I agree with your observation, and I think this can affect not only scoring but also filtering. A taint applied to a node might make that node infeasible for the next pod. That said, this concern also applies to the current design, since cached nodes do not go through filtering on every cycle.
Cache staleness is bounded by maxBatchAge (500ms). This is explained in more detail in the risks and mitigations section here.
In theory, the scheduler snapshot could also have been taken before such a taint was applied, so the same issue can occur without Opportunistic Batching as well, although the risk is probably much lower.
I think this is important to document this concern, but I don’t see how it can be eliminated without removing the Opportunistic Batching optimization entirely.
There was a problem hiding this comment.
Still, the scheduler is only eventually consistent and it can observe any cluster change with some delay. Moreover, kube-scheduler doesn't pick the fully optimal score, because it subsample number of feasible nodes to score. Given than and the small age of batching information, this proposal is fine.
There was a problem hiding this comment.
I posted my comment because I felt this was a potential concern. If the decision has been made to accept the risks, then I think this policy is fine. Thank you.
Co-authored-by: Toru Komatsu <k0ma@utam0k.jp>
7d19290 to
f4cd3a0
Compare
Add rescoring to handle multi-pod-per-node workloads: when the last chosen node remains feasible, rescore it in-place and continue batching rather than flushing the cache.
AI tooling was used to assist in preparing this PR. All changes have been reviewed and verified by the author.