02/16/2022, 12:36 PM
>I think that if we make the node unavailable until it gets an initial dispatch peer list + some metrics you could alert on if dispatch isn't able to refresh the peer list for some period of time would fix the issue? @User following up on yesterday's conversation: I thought about the exact same thing! 🎉 Given how important internal dispatch is for the latency goals, it would make sense to surface problems there as early as possible. A challenge is that pods wouldn't be able to become healthy, leading to potential cascading failures if Kube API is having trouble, so one possible option is to check if SD is healthy on bootstrap, and if it's not, fallback to starting with local dispatch. I think some metrics around SD and comms with Kube API would help. Kube operators are generally wary of anything that puts load ok the Kube Server API, so it would be useful to know things like request count and latencies when talking to kube API. Generally on board with what you shared!