Continuing off this thread ^ I m running into a weird proble SpiceDB #spicedb

Continuing off this thread ^ I'm running into a we...

perseus29

03/14/2023, 5:52 AM

Continuing off this thread ^ I'm running into a weird problem while exposing SpiceDB behind an ALB on AWS - the ALB has a configuration for 'idle connection timeout' which is 60s by default. my request flow is - [External Service] -> [ALB] -> SpiceDB Pod(EKS) What I noticed after setting up the ALB was that there were requests on the external service that were taking as long as the total

idle connection timeout

config. By default, that value is 60s, so some requests would take 60s to resolve and then get timed-out. I increased it to 600s, and then those requests started taking 600s. Exposing SpiceDB through an ELB (eks allocates an ELB by default for any

LoadBalancer

services) works fine not sure if anyone has experienced this (assumption right now is that its some odd gRPC behavior with ALBs)

perseus29

03/14/2023, 5:52 AM

this is the response time graph of the external service interacting with SpiceDB. the flattening at the end is when I switched back from ALB to ELB

perseus29

03/14/2023, 5:53 AM

it pretty much looked like every http/2 connection would eventually run into this problem until a new one was established

vroldanbet

03/14/2023, 9:25 AM

I suspect connections may be not getting drained gracefully properly at either side of the load balancer. Your application could be attempting to use a connection that has since being closed by the load-balancer. Are you using authzed's go client? Have you aligned your client application connection to have a lifetime < 60s? This may also need some tweaking on the SpiceDB side, as I'm not sure we expose it - see https://pkg.go.dev/google.golang.org/grpc@v1.53.0/keepalive#ServerParameters

perseus29

03/14/2023, 10:01 AM

I'm using nodejs, so the node client

perseus29

03/14/2023, 10:03 AM

I haven't done any tweaking on the client gRPC settings - all defaults

vroldanbet

03/14/2023, 10:17 AM

I'd suggest exploring setting starting with forcing a max lifetime of connections of, say, 59s, and have ALB with idle connection of 60s. If the problem persists, then we need to look into the other side of the LB - SpiceDB

yetitwo

03/14/2023, 1:53 PM

also in my experience ALBs support gRPC, but only just, and gRPC isn't designed to be run through a load balancer

yetitwo

03/14/2023, 1:54 PM

a gRPC client wants to know about all of the nodes associated with a service because it does client-side load balancing

perseus29

03/21/2023, 11:19 AM

managed to get request ids propagating from my application to spicedb - found that the requests werent actually making their way to spicedb, so its something between the application and the ALB

vroldanbet

03/21/2023, 11:20 AM

good tracing is here to help 😄

perseus29

03/21/2023, 11:21 AM

haha yeah, i think itd be helpful to document how to propagate those ids as well - i looked through the spicedb source code to figure out that I need to be adding that data into the

Metadata

for each request - new to gRPC, so wasnt aware of this

perseus29

03/21/2023, 11:22 AM

but no luck figuring out what it is between service -> ALB yet. i added 5s deadlines to each call, so now it gives a DEADLINE EXCEEDED error for a few calls after 5s, but id like to not have this happen at all. something to do with connection pooling or the likes id guess

vroldanbet

03/21/2023, 12:54 PM

this is a good point, I'll open an issue in the docs repository to make this clearer

vroldanbet

03/21/2023, 12:56 PM

opened https://github.com/authzed/docs/issues/102

vroldanbet

03/21/2023, 1:00 PM

my best guess is this related to stale connections in the pool caused by connection draining. ALB, just like any service, will prune connections after a given lifetime. This is necessary in order to able to perform operations in ALB (think deploying a new version of ALB). The reverse proxy terminates it's side of the TCP connection, but ghe client does, and the moment it goes to pick up the connection and use it, it's unusable. Perhaps the go gRPC client is not able to surface this properly and instead returns a deadline error. A potential exercise would be to look into a way to get the connection pool to evict connections with a lifetime. You could set it to something very low, say, 30s, and see if the problem continues

9 Views

Previous Next