Anyone running SpiceDB behind AWS ALB
# spicedb
b
Anyone running SpiceDB behind AWS ALB getting a
14
(unavailable) status code from the ALB intermittently under load? Doesn't seem to be coming from SpiceDB itself afaict - no errors in the logs and no Target 5xx responses, but plenty of ELB 502s
v
We ran some load tests back in the day with ALBs and also had some unexplicable error rates we didn't investigate further, it was just an experiment. Is it possible it's happening when the SpiceDB node rolls? Have you tried killing a pod and seeing if it lines up?
b
Didn't seem to be a correlation with scaling
v
not scaling. The hypothesis I mentioned was if there is an issue with connection draining. Try killing a pod.
b
I mean the instances were stable at the time, nothing dying, being taken out of service, being brought online etc
v
any insight in the types of requests that failed?
b
But I can try that. Some of AWSs documents it could indicate something related to keepalive, but don't know how relevant that is for grpc / http/2
They're all WriteRelationships
v
b
And I have seen the occasional transaction rollback back only very infrequently compared to the amount of times I'm seeing this
v
and are you able to to see the request coming through into SpiceDB and succeeding? and it's only the ALB that returns 502?
b
I think I will need to dig into access logs, according to the metrics there were no errors from SpiceDB, all from ALB
The majority of requests are fine
v
it's also interesting it's only WriteRelationship requests. I wonder if some timeout is being hit?
are you writing large numbers of relationships?
b
Yes, although max of about 5 per WriteRelationships call
v
well that shouldn't be an issue
and you haven't seen with any other API method?
y
@Ben Simpson my understanding is that ALBs only kind of support gRPC
and gRPC really isn't meant to be run through a load balancer
the gRPC client wants to have a full list of the nodes that it can talk to and then do client-side load balancing
we were running through an ALB and having problems, and we switched to a slightly janky setup that uses CloudMap for service discovery and otherwise lets gRPC talk directly to SpiceDB nodes and have gotten better behavior out of our gRPC clients
but i'd really recommend moving to running SpiceDB in k8s - it enables better horizontal dispatch and gRPC clients are very happy in a k8s context
v
At least grpc-go gives you some headroom to change the load balancing strategies
y
all of them do, i think
FWIW we do use load-balancers in our managed infrastructure, there are just not L7s
This also provides good insight on the recommendations around load balancing gRPC: https://grpc.io/blog/grpc-load-balancing/
and finally FWIW, we've seen better balancing of requests when using
round_robin
strategy versus
pick_first
there is nothing conceptually wrong about using an L7 load balancer and gRPC, it's just suboptimal but it should work, so I suspect it's related to ALB configuration
b
We do have the ALB set to
grpc
mode
From the access logs it seems like we're running into this scenario
Copy code
The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target

The load balancer receives a request and forwards it to the target. The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer. Make sure that the duration of the keep-alive timeout is greater than the idle timeout value.

Check the values for the request_processing_time, target_processing_time and response_processing_time fields.
y
yeah we had ours in gRPC mode as well and still had issues
those issues went away when we got rid of the ALB
v
@Ben Simpson these are the server-side timeout and keepalive configuration parameters: https://pkg.go.dev/google.golang.org/grpc/keepalive?utm_source=godoc#ServerParameters The default server-side keep-alive in go-grpc is 2h. SpiceDB let's you configure the connections max-age, which is set by default to 30 seconds, see https://github.com/authzed/spicedb/blob/d9ae77692f09c42a89c38d02df2aa7b6ba62eeb5/pkg/cmd/util/util.go#L68
Copy code
--grpc-max-conn-age duration                                      how long a connection serving gRPC should be able to live (default 30s)
I checked grpc-go gracefully closes connections:
Copy code
// keepalive running in a separate goroutine does the following:
// 1. Gracefully closes an idle connection after a duration of keepalive.MaxConnectionIdle.
// 2. Gracefully closes any connection after a duration of keepalive.MaxConnectionAge.
// 3. Forcibly closes a connection after an additive period of keepalive.MaxConnectionAgeGrace over keepalive.MaxConnectionAge.
// 4. Makes sure a connection is alive by sending pings with a frequency of keepalive.Time and closes a non-responsive connection
// after an additional duration of keepalive.Timeout.
There is in fact a grace period after a connection reaches "max-age", and this grace is set to infinity, so the server will technically wait indefinitely to drain the connection. So from my perspective what I think could be happening is that SpiceDB closes a connection, but ALB does not know this connection is closed and keeps sending requests over it. Ideally ALB allows configuring a connection max-age too, so you can see it to <30 second and see if that fixes it (you need to account for a +-10% jitter). You can also configure that max-age on the spicedb if need be.
a quick experoment would be to set max-age to a very large value to see if this changes anything. I don't see anything in the AWS ALB documentation that suggests you can configure max connection age, only keep-alive: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html
the relevant flags on ALB are:
Copy code
client_keep_alive.seconds
  The client keepalive value, in seconds. The default is 3600 seconds.
idle_timeout.timeout_seconds
  The idle timeout value, in seconds. The default is 60 seconds.
note the current default keepalive is 1h, whereas SpiceDB connections are set to live max 30s
b
I think the client keepalive is between the client and the ALB, not the ALB and the target (SpiceDB)
Because reusing the connection to ALB won't necessarily have ALB route to the same target (I assume)
The HTTP client keepalive duration value specifies the maximum amount of time that ALB will maintain an HTTP connection with a client before closing the connection.
I'm not sure this problem is fully solvable given that I can't configure the max lifetime of the connection between ALB and the target. But I think I could mitigate it somewhat by increasing that parameter on the SpiceDB side, and using grpc retries on
unavailable
in my clients that connect to ALB with a very fast retry
v
there has to be a max lifetime on the ALB side - ask an AWS rep for it. Otherwise, just set the spicedb to infinity