Hi team, I found an issue when doing SpiceDB #spicedb

Hi team, I found an issue when doing

jacobchia

04/03/2025, 5:39 AM

Hi team, I found an issue when doing tests that could explain why we have unexplained latencies in request. TLDR - The DNS resolution delay time is typically 30 seconds and the client has to wait for that before acting. The server is closing due to

data max age

, which seems to be based on max connection time (30s by default and i havent configured this at all). This means that if it we recently tried to connect and it was then closed, then we have to wait up to 30 seconds to re-try. Which matches the latency spikes, but doesn't explain why a connection might be closed by the server before 30 seconds is up https://cdn.discordapp.com/attachments/844600078948630559/1357228089624625182/image.png?ex=67ef70a8&is=67ee1f28&hm=9feb1b11aa888acab94846c05fa31f9c6002b89d4f58cf244d3353c2871e20f9& https://cdn.discordapp.com/attachments/844600078948630559/1357228089897390121/image.png?ex=67ef70a8&is=67ee1f28&hm=d0c62906329c5ca80e0181747eafc8b6c565cd2e221c419a092b8814a1282766&

yetitwo

04/03/2025, 2:13 PM

> The DNS resolution delay time is typically 30 seconds meaning that it's taking 30s to resolve DNS?

jacobchia

04/03/2025, 4:17 PM

meaning between if you had a successful DNS resolution recently, gRPC clients wait for at least another 30 seconds from the last successful resolution to resolve DNS again

yetitwo

04/03/2025, 4:47 PM

that's... odd.

yetitwo

04/03/2025, 4:47 PM

if you're in a k8s environment i'd recommend using kuberesolver's resolution instead of dns

yetitwo

04/03/2025, 4:53 PM

nvm

yetitwo

04/03/2025, 4:53 PM

https://github.com/sercand/kuberesolver

yetitwo

04/03/2025, 4:53 PM

it's a golang lib

yetitwo

04/03/2025, 4:53 PM

this behavior is likely what you want - essentially watching kube's API's instead of waiting for DNS resolution to happen

yetitwo

04/03/2025, 4:53 PM

but lemme think on whether there's a more direct workaround

yetitwo

04/03/2025, 4:54 PM

so is the issue that a client will think it has no nodes that it can talk to?

yetitwo

04/03/2025, 4:54 PM

or is it resolving to an IP, trying to hit it, and then timing out?

vroldanbet

04/03/2025, 5:10 PM

kuberesolver is used for dispatches. Are we talking about issues on dispatch, or in client applications talking to SpiceDB?

jacobchia

04/03/2025, 5:46 PM

@yetitwo I am in a k8s env if that helps

yetitwo

04/03/2025, 5:47 PM

it sounds like they're hitting an edge behavior of the dns resolver in grpc-node

jacobchia

04/03/2025, 5:48 PM

@vroldanbet This is a client application maintaining its access to the spicedb because I am using 50051 for client ports Client application sits within the k8s cluster and talks to spicedb using service discovery

yetitwo

04/03/2025, 5:48 PM

so if they had access to something similar to

kuberesolver

but that was compatible with

grpc-node

, they could use it to keep track of the list of available nodes in the client application

yetitwo

04/03/2025, 5:48 PM

or at least that's my thought

jacobchia

04/03/2025, 5:51 PM

I think in theory CoreDNS in k8s keeps the DNS records up to date? I dont think it has an issue with resolving nodes

jacobchia

04/03/2025, 5:52 PM

the delays seems to be that (1) there was a recent DNS resolution (2) the GOAWAY termination code sent by the server necessitates the DNS resolution and hence if the GOAWAY code is sent too quickly/early, we still need to wait for the 30 seconds cooldown. Does that make sense?

jacobchia

04/03/2025, 5:56 PM

Another question is how does spicedb decide when to terminate the connection? If I have 3 separate authzed node grpc clients for a given IP address, does authzed determine that we should close each of them independently, based on the TCP connection's age (max age being 30 seconds)? Because I am seeing GOAWAY codes more frequently than 30 seconds

yetitwo

04/03/2025, 6:36 PM

it kinda does, but this also isn't behavior that we've encountered before to my knowledge

yetitwo

04/03/2025, 6:37 PM

i don't know, tbh. i don't think we do anything to manually trigger it in our code; i'd expect that it's a behavior of the gRPC server code and its configuration

yetitwo

04/03/2025, 6:37 PM

you can probably mess with this flag:

Copy code

--grpc-max-conn-age duration            how long a connection serving gRPC should be able to live (default 30s)

jacobchia

04/03/2025, 9:24 PM

Another thought - is it wrong to have 3 spicedb clients on the same service/IP calling authzed? Instead of reusing it. I wonder if spicedb decides to close channels of transport based on the IP address, I certainly have more than 1 client instance instantiated per service. And if they connect at different times, will spicedb close all of them together?

vroldanbet

04/04/2025, 7:54 AM

There are a couple of useful gRPC docs that can shed some light: This one is about connection lifecycle: https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md This one describes how client-side load-balancing works, which is typically how it behaves by default: https://github.com/grpc/grpc/blob/master/doc/load-balancing.md The client will do client-side load-balancing, which means it keeps a list of backend servers to talk to.

vroldanbet

04/04/2025, 7:54 AM

The GOAWAY sent by the server are typically governed by the server-side connection parameters: https://pkg.go.dev/google.golang.org/grpc/keepalive#ServerParameters The SpiceDB Server sets

MaxConnectionAge

[by default to 30 seconds](https://github.com/authzed/spicedb/blob/e5323c57d11652c147bb069a25cefe58a20bcaac/pkg/cmd/util/util.go#L72). That's what

data max age

means. This is what the grpc-go docs say: > // MaxConnectionAge is a duration for the maximum amount of time a > // connection may exist before it will be closed by sending a GoAway. A > // random jitter of +/-10% will be added to MaxConnectionAge to spread out > // connection storms. It means that gRPC connections in the server don't live for longer than 30 seconds. When they reach 30 seconds plus some jitter, the server will send

GOAWAY

. Now, that does not necessarily explain why your client is waiting 30 seconds. This is what the connectivity semantics say: > IDLE: This is the state where the channel is not even trying to create a connection because of a lack of new or pending RPCs. New RPCs MAY be created in this state. Any attempt to start an RPC on the channel will push the channel out of this state to connecting. When there has been no RPC activity on a channel for a specified IDLE_TIMEOUT, i.e., no new or pending (active) RPCs for this period, channels that are READY or CONNECTING switch to IDLE. Additionally, channels that receive a GOAWAY when there are no active or pending RPCs should also switch to IDLE to avoid connection overload at servers that are attempting to shed connections. We will use a default IDLE_TIMEOUT of 300 seconds (5 minutes).

vroldanbet

04/04/2025, 7:55 AM

Based on the logs, it seems this is what's happening: your client receives go away, and the channel moves to IDLE

vroldanbet

04/04/2025, 7:58 AM

Then we can see the channel is trying to move to

CONNECTING

state. This is also where the resolver kicks in. The resolver responsibility is: > receiving updated configuration and list of server addresses from the resolver

vroldanbet

04/04/2025, 7:58 AM

https://cdn.discordapp.com/attachments/1357228090153238611/1357625443804512337/image.png?ex=67f0e2b9&is=67ef9139&hm=2f8d12fc7c93737378a7c2e6edd9e81737fea1985c4cdb0693140c011d7d3872&

vroldanbet

04/04/2025, 8:01 AM

Now, while I don't recall with precision, I've certainly had tons of issues in past jobs with CoreDNS and cached DNS entries. In a dynamic environment like Kube, I wouldn't be surprised this is a client-side resolver having a stale list of IPs

vroldanbet

04/04/2025, 8:08 AM

If I understood correctly, you are using the node client. I've [found multiple environment variables](https://github.com/grpc/grpc-node/blob/c4580fa80b4133ac8ea61b55bdea99485412828b/doc/environment_variables.md?plain=1#L33) that can be set to troubleshoot connection issues: > * GRPC_NODE_TRACE, GRPC_TRACE > A comma separated list of tracers that provide additional insight into how > grpc-js is processing requests via debug logs. Available tracers include: > -

call_stream

- Traces client request internals > -

channel

- Traces channel events > -

connectivity_state

- Traces channel connectivity state changes > -

dns_resolver

- Traces DNS resolution > -

ip_resolver

- Traces IPv4/v6 resolution > -

pick_first

- Traces the pick first load balancing policy > -

proxy

- Traces proxy operations > -

resolving_load_balancer

- Traces the resolving load balancer > -

round_robin

- Traces the round robin load balancing policy > -

server

- Traces high-level server events > -

server_call

- Traces server handling of individual requests > -

subchannel

- Traces subchannel connectivity state and errors > -

subchannel_refcount

- Traces subchannel refcount changes. Includes per-call logs. > -

subchannel_flowctrl

- Traces HTTP/2 flow control. Includes per-call logs. > -

subchannel_internals

- Traces HTTP/2 session state. Includes per-call logs. > -

channel_stacktrace

- Traces channel construction events with stack traces. > -

keepalive

- Traces gRPC keepalive pings > -

outlier_detection

- Traces outlier detection events

vroldanbet

04/04/2025, 8:21 AM

Hope this helps troubleshoot the issue

yetitwo

04/04/2025, 8:24 PM

ideally you want a single client and to reuse that client

yetitwo

04/04/2025, 8:24 PM

it's less overhead and the gRPC client does a really good job of multiplexing and balancing over the HTTP connections it has

jacobchia

04/04/2025, 9:43 PM

caching should have been fine, I noticed that there were no spicedb instance being spun up or scaled down or even evicted around some of the long traces...

jacobchia

04/04/2025, 9:44 PM

but upgrading the grpc-js and authzed node client version fixed it somehow

8 Views

Previous Next