r
Hi! And first of all thanks for a great product! We’re currently investigating SpiceDB as the authorization system of the future for all our products globally. We are facing a performance issue, and we need some help. What we’re seeing is persistent but intermittent freezes with various clients calling SpiceDB over gRPC. The server is running in 3 different environments hosted in K8s on AWS EKS and at this point we are wondering if you ever heard of this behavior before and have any insights. More info in thread 🧵 Best regards, Robin and team
Observations - We tried this unofficial dotnet client (https://github.com/JalexSocial/SpiceDb), and the official Node client. - The dotnet client have been using various versions of the underlying gRPC libraries. For instance
Grpc.Core.Api
and
Grpc.Net.Client
using versions ranging from
2.52.0
to
2.62.0
. - The spicedb server was originally configured by
spicedb-operator
. We’ve tried versions
1.26.0
,
1.30.0
,
1.30.1
and
1.31.0
, all having this issue. - We added OTEL tracing to see if SpiceDB was seeing the error, but it was not. It rather seems related to session / transport / network layers. - Sometimes, usually after restarting the application, we have encountered recurring errors on regular intervals. It seems to happen after a server
GOAWAY
, client reset handshake, when trying to open a new connection. It often happens every minute (after x2 30-second
GOAWAY
-reset cycles) that we experience freezing, the client waiting for a timeout. The client becomes operational again after the timeout. - On this topic we've noticed that when using a domain name and DNS to target the server, the client switches from IPv6 to IPv4 when recovering after the timeout. - When forcing the client onto one of the versions we typically get additional errors between the timeout and recovery, but no clear conclusions to draw here. - Other times, we have not experienced any errors at all despite running several processes concurrently and handling all of over 100,000 requests successfully throughout an hour. - Running two clients concurrently, both having this issue, on the same IPv4/IPv6 version, they don't encounter the issue at the same time, so connections seem independent.
Questions - Are there any known issues with similar symptoms? Does this make sense at all? - Are there any specific settings or configurations we should check in our SpiceDB, K8s or networking setups? - What could the timeouts we see (20s in dotnet, 30s in Node) correspond to?
j
are the backing pods getting rescheduled onto different nodes during this time?
i've heard of "long" initial connection times with the java client in the past, but only on the order of single digit seconds, never 30s+
p
Per here, working with Robin. No, the pods stays on the same nodes throughout the issues
j
my guess would be that the way you're connecting is using client-side load balancing with DNS, and when the pods are moving around the DNS is out of date, but that's just a guess based on the fact that it's always DNS 🙂
p
And it seems like the issue is related to the client setting up a new http/2 connection after the GOAWAY. If it gets "stuck" and I start a new client, that one gets a connection up and running directly (while the one who's stuck eventually comes back and works).
j
each client's grpc implementation should have some network tunables that you can play with, maybe you can get it to abandon a connection when something fails earlier
p
We see the same issue both live between our service and SpiceDB (in that case through an AWS Load Balancer as the ingress), but also if we do a local port forward with kubectl directly against the service in k8s, as well as port-forward directly against the SpiceDB pod.. And we get the same problem on all of those different ways to connect to SpiceDB
e
Do the dropped connections happen on specific api calls? Is it all Checks, etc? Since you don’t see problems in the SpiceDB logs it’s likely either a client issue like Jake mentioned or a load balancer issue. Since you still see the issue when port forwarding, which will connect directly to a single SpiceDB pod, it’s almost certainly a client config issue (or two separate issues). You can enable additional grpc debug logs with:
Copy code
- name: GRPC_GO_LOG_SEVERITY_LEVEL
          value: info
        - name: GRPC_GO_LOG_VERBOSITY_LEVEL
          value: "2"
To confirm there is no issue on the spicedb side.
Also: do you have any service meshes running? I.e. istio or friends?
p
We did try with two different clients (.NET and NodeJS) just to rule out differences and we got slightly different timeouts (assuming different defaults), but the same behavior. The GRPC debug logs only showed the goaway happening, but didn't say anything about why the new connection didn't happen. No, no service meshes running here (and that should not be involved when I port-forward directly into the pod anyway right?
I do have an interesting Wireshark session (running insecure so I can see contents) and there's something a bit interesting I see when the issue happens, will write a complete follow-up later, but basically I see something like this happening (the number in the brackets represent a specific TCP connection between the client and server): * [1] ... lots of CheckPermissions working just fine.. * [1] Client gets an GOAWAY from server * [1] Client sends a FIN to close the TCP connection * [1] Server ACK the FIN * [2] Client opens a new connection * [2] Server ACK the connection * [2] Client initiates an HTTP/2 session, including a FRAME with the CheckPermission call * [2] Server ACK the call, but just seems to not not do anything (nothing ever comes from this connection * [1] Server sends RST, ACK to Client * [2] After 15s: Client sends a TCP Keep-Alive that the server is ACK:ing * [2] Server sends an RST, ACK to Client * [3] After additional 5 seconds (20 seconds after GOAWAY and FIN for connection above in connection #1): A new connection is opened by the client and now the communication is working again and the CheckPermission call is answered by the server.
It feels like the new connection #2 is correctly setup on the TCP level, but the server does not correctly respond to the HTTP/2 handshake
I saw that the go-grpc library was updated in 1.32.0 (I saw an issue in the go-grpc GitHub that at least sounded like it could be related and should have been solved in the version you update to in 1.32.0), so I tested with this as well, but still got the same issue.
r
Any chance of a comment on Per's TCP breakdown? @Jake @ecordell? 🙏
p
Some update on this! It turned out to be a configuration error on our side, sorry about that. Apparently the port-forwards with kubectl created a symtom that looked exactly like the problem we had in the real environment (but with the weird TCP session breakdown as above) so that's why we went deep into a side-track that was not relevant for the issue. It turned out that the real problem was actually in our AWS NLB load balancer setup. We had one of the IPs (out of three) that was not reachable due to cross-availability zone access wasn't enabled in the load balancer, and we only had deployed SpiceDB into two availability zones. So whenever the client got a GOAWAY from the server (every 30s), it would do a DNS round-robin to reconnect, and sometimes ended up on that non-working IP of the load balancer which ended up in a 20s delay before it tried another IP.
8 Views