Anyone running SpiceDB behind AWS ALB
# spicedb
b
Anyone running SpiceDB behind AWS ALB getting a
14
(unavailable) status code from the ALB intermittently under load? Doesn't seem to be coming from SpiceDB itself afaict - no errors in the logs and no Target 5xx responses, but plenty of ELB 502s
v
We ran some load tests back in the day with ALBs and also had some unexplicable error rates we didn't investigate further, it was just an experiment. Is it possible it's happening when the SpiceDB node rolls? Have you tried killing a pod and seeing if it lines up?
b
Didn't seem to be a correlation with scaling
v
not scaling. The hypothesis I mentioned was if there is an issue with connection draining. Try killing a pod.
b
I mean the instances were stable at the time, nothing dying, being taken out of service, being brought online etc
v
any insight in the types of requests that failed?
b
But I can try that. Some of AWSs documents it could indicate something related to keepalive, but don't know how relevant that is for grpc / http/2
They're all WriteRelationships
v
b
And I have seen the occasional transaction rollback back only very infrequently compared to the amount of times I'm seeing this
v
and are you able to to see the request coming through into SpiceDB and succeeding? and it's only the ALB that returns 502?
b
I think I will need to dig into access logs, according to the metrics there were no errors from SpiceDB, all from ALB
The majority of requests are fine
v
it's also interesting it's only WriteRelationship requests. I wonder if some timeout is being hit?
are you writing large numbers of relationships?
b
Yes, although max of about 5 per WriteRelationships call
v
well that shouldn't be an issue
and you haven't seen with any other API method?
y
@Ben Simpson my understanding is that ALBs only kind of support gRPC
and gRPC really isn't meant to be run through a load balancer
the gRPC client wants to have a full list of the nodes that it can talk to and then do client-side load balancing
we were running through an ALB and having problems, and we switched to a slightly janky setup that uses CloudMap for service discovery and otherwise lets gRPC talk directly to SpiceDB nodes and have gotten better behavior out of our gRPC clients
but i'd really recommend moving to running SpiceDB in k8s - it enables better horizontal dispatch and gRPC clients are very happy in a k8s context
v
At least grpc-go gives you some headroom to change the load balancing strategies
y
all of them do, i think
FWIW we do use load-balancers in our managed infrastructure, there are just not L7s
This also provides good insight on the recommendations around load balancing gRPC: https://grpc.io/blog/grpc-load-balancing/
and finally FWIW, we've seen better balancing of requests when using
round_robin
strategy versus
pick_first
there is nothing conceptually wrong about using an L7 load balancer and gRPC, it's just suboptimal but it should work, so I suspect it's related to ALB configuration
b
We do have the ALB set to
grpc
mode
From the access logs it seems like we're running into this scenario
Copy code
The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target

The load balancer receives a request and forwards it to the target. The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer. Make sure that the duration of the keep-alive timeout is greater than the idle timeout value.

Check the values for the request_processing_time, target_processing_time and response_processing_time fields.
y
yeah we had ours in gRPC mode as well and still had issues
those issues went away when we got rid of the ALB
v
@Ben Simpson these are the server-side timeout and keepalive configuration parameters: https://pkg.go.dev/google.golang.org/grpc/keepalive?utm_source=godoc#ServerParameters The default server-side keep-alive in go-grpc is 2h. SpiceDB let's you configure the connections max-age, which is set by default to 30 seconds, see https://github.com/authzed/spicedb/blob/d9ae77692f09c42a89c38d02df2aa7b6ba62eeb5/pkg/cmd/util/util.go#L68
Copy code
--grpc-max-conn-age duration                                      how long a connection serving gRPC should be able to live (default 30s)
I checked grpc-go gracefully closes connections:
Copy code
// keepalive running in a separate goroutine does the following:
// 1. Gracefully closes an idle connection after a duration of keepalive.MaxConnectionIdle.
// 2. Gracefully closes any connection after a duration of keepalive.MaxConnectionAge.
// 3. Forcibly closes a connection after an additive period of keepalive.MaxConnectionAgeGrace over keepalive.MaxConnectionAge.
// 4. Makes sure a connection is alive by sending pings with a frequency of keepalive.Time and closes a non-responsive connection
// after an additional duration of keepalive.Timeout.
There is in fact a grace period after a connection reaches "max-age", and this grace is set to infinity, so the server will technically wait indefinitely to drain the connection. So from my perspective what I think could be happening is that SpiceDB closes a connection, but ALB does not know this connection is closed and keeps sending requests over it. Ideally ALB allows configuring a connection max-age too, so you can see it to <30 second and see if that fixes it (you need to account for a +-10% jitter). You can also configure that max-age on the spicedb if need be.
a quick experoment would be to set max-age to a very large value to see if this changes anything. I don't see anything in the AWS ALB documentation that suggests you can configure max connection age, only keep-alive: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html
the relevant flags on ALB are:
Copy code
client_keep_alive.seconds
  The client keepalive value, in seconds. The default is 3600 seconds.
idle_timeout.timeout_seconds
  The idle timeout value, in seconds. The default is 60 seconds.
note the current default keepalive is 1h, whereas SpiceDB connections are set to live max 30s
b
I think the client keepalive is between the client and the ALB, not the ALB and the target (SpiceDB)
Because reusing the connection to ALB won't necessarily have ALB route to the same target (I assume)
The HTTP client keepalive duration value specifies the maximum amount of time that ALB will maintain an HTTP connection with a client before closing the connection.
I'm not sure this problem is fully solvable given that I can't configure the max lifetime of the connection between ALB and the target. But I think I could mitigate it somewhat by increasing that parameter on the SpiceDB side, and using grpc retries on
unavailable
in my clients that connect to ALB with a very fast retry
v
there has to be a max lifetime on the ALB side - ask an AWS rep for it. Otherwise, just set the spicedb to infinity
i
@vroldanbet we've started running into this issue as well. We're attempting to set the max connection age in SpiceDB to infinity, but it seems like the SpiceDB flag requires a duration: https://github.com/authzed/spicedb/blob/v1.35.3/pkg/cmd/util/util.go#L69 It looks like Go's ParseDuration supports a max value of about 292 years. Is there a way you know we can set this to Infinity instead? https://stackoverflow.com/a/68757925
v
doesn't a very large time work for you? does your app not restart at all? Like even load balancers will change more often than that (scaling, reshuffling processes around) so it seems unrealistic to have a connection that lives forever
i
Thanks @vroldanbet , we tried a large time, but unfortunately it didn't fix the issue. We still see 502's every 20 minutes or so, even though there isn't any change in the service behind the load balancer
v
I'm not sure I can help, we don't run ALBs ourselves for our managed offering so we don't have experience with putting SpiceDB behind it. Maybe @Ben Simpson has made progress on identifying the issue?
In this thread I tried to provide as much info as possible so y'all could research the issue.
I suspect this is not necessarily a "SpiceDB" problem but a "gRPC behind ALB" problem
y
i'd second the above
at my last company we stopped seeing these problems after we started using CloudMap and letting the client talk directly to SpiceDB rather than going through a load balancer
it brought other problems, namely that the list of nodes was slow to update when the SpiceDB instances rolled, but we stopped having issues with 502s
gRPC really isn't designed to be run through a load balancer - it does its own load balancing in the client and wants to have a full list of available nodes at all times
i
Ok we can try that. How do you handle the slow to update issue? That seems like a bigger problem if it can't reliably route to healthy instances
y
we moved to k8s. gRPC and k8s kind of coevolved - gRPC clients are happy when they can get the list of nodes from k8s service discovery. it's also the easiest way to get horizontal dispatch working in spicedb, which is otherwise a pain in an ECS world.
our operator makes running spicedb in an EKS cluster relatively easy, fwiw
i
I would also like to use K8s, but we're limited to ECS because we're deploying into customer cloud accounts that do not have EKS access at the moment
y
ah gotcha. yeah, we ran in ECS for a while and it wasn't a huge deal, but we also didn't have a ton of traffic to the system at the time and were generally able to mitigate things like the downtimes.
i think a larger replication factor would have helped
v
If ECS has a reliable service discovery feature, we could build something to use it. I'm not aware such functionality exists
b
We didn't really find a solution except to enable retries on
Unavailable
in the client
Update: For a long time we got by with using ALB and setting
--grpc-max-conn-age
to 1hr, which stopped the `502`s But we noticed that we would occasionally get a request that would take 60s and then fail, which would result in a
504
. The only thing that we could see with a
60s
timeout was the
idle_timeout
on the ALB. We couldn't figure out how to resolve this issue - SpiceDB didn't see any requests taking
60s
to complete so it seemed like something was up with the ALB and it seemed unlikely that increasing the idle timeout would have any effect apart from the request taking longer to fail. In the end we switched from ALB to NLB, and reduced the
--grpc-max-conn-age
to 60s so that traffic gets routed to new instances in a timely fashion as they come online. This seems to be working much better so far.
v
Yep! That's also now we run our managed services. I'd still want to spend some time at some point with ALBs to see what's up
112 Views