Hi team, I found an issue when doing
# spicedb
j
Hi team, I found an issue when doing tests that could explain why we have unexplained latencies in request. TLDR - The DNS resolution delay time is typically 30 seconds and the client has to wait for that before acting. The server is closing due to
data max age
, which seems to be based on max connection time (30s by default and i havent configured this at all). This means that if it we recently tried to connect and it was then closed, then we have to wait up to 30 seconds to re-try. Which matches the latency spikes, but doesn't explain why a connection might be closed by the server before 30 seconds is up https://cdn.discordapp.com/attachments/844600078948630559/1357228089624625182/image.png?ex=67ef70a8&is=67ee1f28&hm=9feb1b11aa888acab94846c05fa31f9c6002b89d4f58cf244d3353c2871e20f9& https://cdn.discordapp.com/attachments/844600078948630559/1357228089897390121/image.png?ex=67ef70a8&is=67ee1f28&hm=d0c62906329c5ca80e0181747eafc8b6c565cd2e221c419a092b8814a1282766&
y
> The DNS resolution delay time is typically 30 seconds meaning that it's taking 30s to resolve DNS?
j
meaning between if you had a successful DNS resolution recently, gRPC clients wait for at least another 30 seconds from the last successful resolution to resolve DNS again
y
that's... odd.
if you're in a k8s environment i'd recommend using kuberesolver's resolution instead of dns
nvm
it's a golang lib
this behavior is likely what you want - essentially watching kube's API's instead of waiting for DNS resolution to happen
but lemme think on whether there's a more direct workaround
so is the issue that a client will think it has no nodes that it can talk to?
or is it resolving to an IP, trying to hit it, and then timing out?
v
kuberesolver is used for dispatches. Are we talking about issues on dispatch, or in client applications talking to SpiceDB?
j
@yetitwo I am in a k8s env if that helps
y
it sounds like they're hitting an edge behavior of the dns resolver in grpc-node
j
@vroldanbet This is a client application maintaining its access to the spicedb because I am using 50051 for client ports Client application sits within the k8s cluster and talks to spicedb using service discovery
y
so if they had access to something similar to
kuberesolver
but that was compatible with
grpc-node
, they could use it to keep track of the list of available nodes in the client application
or at least that's my thought
j
I think in theory CoreDNS in k8s keeps the DNS records up to date? I dont think it has an issue with resolving nodes
the delays seems to be that (1) there was a recent DNS resolution (2) the GOAWAY termination code sent by the server necessitates the DNS resolution and hence if the GOAWAY code is sent too quickly/early, we still need to wait for the 30 seconds cooldown. Does that make sense?
Another question is how does spicedb decide when to terminate the connection? If I have 3 separate authzed node grpc clients for a given IP address, does authzed determine that we should close each of them independently, based on the TCP connection's age (max age being 30 seconds)? Because I am seeing GOAWAY codes more frequently than 30 seconds
y
it kinda does, but this also isn't behavior that we've encountered before to my knowledge
i don't know, tbh. i don't think we do anything to manually trigger it in our code; i'd expect that it's a behavior of the gRPC server code and its configuration
you can probably mess with this flag:
Copy code
--grpc-max-conn-age duration            how long a connection serving gRPC should be able to live (default 30s)
j
Another thought - is it wrong to have 3 spicedb clients on the same service/IP calling authzed? Instead of reusing it. I wonder if spicedb decides to close channels of transport based on the IP address, I certainly have more than 1 client instance instantiated per service. And if they connect at different times, will spicedb close all of them together?
v
There are a couple of useful gRPC docs that can shed some light: This one is about connection lifecycle: https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md This one describes how client-side load-balancing works, which is typically how it behaves by default: https://github.com/grpc/grpc/blob/master/doc/load-balancing.md The client will do client-side load-balancing, which means it keeps a list of backend servers to talk to.
The GOAWAY sent by the server are typically governed by the server-side connection parameters: https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters The SpiceDB Server sets
MaxConnectionAge
[by default to 30 seconds](https://github.com/authzed/spicedb/blob/e5323c57d11652c147bb069a25cefe58a20bcaac/pkg/cmd/util/util.go#L72). That's what
data max age
means. This is what the grpc-go docs say: > // MaxConnectionAge is a duration for the maximum amount of time a > // connection may exist before it will be closed by sending a GoAway. A > // random jitter of +/-10% will be added to MaxConnectionAge to spread out > // connection storms. It means that gRPC connections in the server don't live for longer than 30 seconds. When they reach 30 seconds plus some jitter, the server will send
GOAWAY
. Now, that does not necessarily explain why your client is waiting 30 seconds. This is what the connectivity semantics say: > IDLE: This is the state where the channel is not even trying to create a connection because of a lack of new or pending RPCs. New RPCs MAY be created in this state. Any attempt to start an RPC on the channel will push the channel out of this state to connecting. When there has been no RPC activity on a channel for a specified IDLE_TIMEOUT, i.e., no new or pending (active) RPCs for this period, channels that are READY or CONNECTING switch to IDLE. Additionally, channels that receive a GOAWAY when there are no active or pending RPCs should also switch to IDLE to avoid connection overload at servers that are attempting to shed connections. We will use a default IDLE_TIMEOUT of 300 seconds (5 minutes).
Based on the logs, it seems this is what's happening: your client receives go away, and the channel moves to IDLE
Then we can see the channel is trying to move to
CONNECTING
state. This is also where the resolver kicks in. The resolver responsibility is: > receiving updated configuration and list of server addresses from the resolver
Now, while I don't recall with precision, I've certainly had tons of issues in past jobs with CoreDNS and cached DNS entries. In a dynamic environment like Kube, I wouldn't be surprised this is a client-side resolver having a stale list of IPs
If I understood correctly, you are using the node client. I've [found multiple environment variables](https://github.com/grpc/grpc-node/blob/c4580fa80b4133ac8ea61b55bdea99485412828b/doc/environment_variables.md?plain=1#L33) that can be set to troubleshoot connection issues: > * GRPC_NODE_TRACE, GRPC_TRACE > A comma separated list of tracers that provide additional insight into how > grpc-js is processing requests via debug logs. Available tracers include: > -
call_stream
- Traces client request internals > -
channel
- Traces channel events > -
connectivity_state
- Traces channel connectivity state changes > -
dns_resolver
- Traces DNS resolution > -
ip_resolver
- Traces IPv4/v6 resolution > -
pick_first
- Traces the pick first load balancing policy > -
proxy
- Traces proxy operations > -
resolving_load_balancer
- Traces the resolving load balancer > -
round_robin
- Traces the round robin load balancing policy > -
server
- Traces high-level server events > -
server_call
- Traces server handling of individual requests > -
subchannel
- Traces subchannel connectivity state and errors > -
subchannel_refcount
- Traces subchannel refcount changes. Includes per-call logs. > -
subchannel_flowctrl
- Traces HTTP/2 flow control. Includes per-call logs. > -
subchannel_internals
- Traces HTTP/2 session state. Includes per-call logs. > -
channel_stacktrace
- Traces channel construction events with stack traces. > -
keepalive
- Traces gRPC keepalive pings > -
outlier_detection
- Traces outlier detection events
Hope this helps troubleshoot the issue
y
ideally you want a single client and to reuse that client
it's less overhead and the gRPC client does a really good job of multiplexing and balancing over the HTTP connections it has
j
caching should have been fine, I noticed that there were no spicedb instance being spun up or scaled down or even evicted around some of the long traces...
but upgrading the grpc-js and authzed node client version fixed it somehow
8 Views