Anyone running SpiceDB behind AWS ALB SpiceDB #spicedb

Anyone running SpiceDB behind AWS ALB

Ben Simpson

05/15/2024, 5:11 AM

Anyone running SpiceDB behind AWS ALB getting a

(unavailable) status code from the ALB intermittently under load? Doesn't seem to be coming from SpiceDB itself afaict - no errors in the logs and no Target 5xx responses, but plenty of ELB 502s

vroldanbet

05/15/2024, 8:40 AM

We ran some load tests back in the day with ALBs and also had some unexplicable error rates we didn't investigate further, it was just an experiment. Is it possible it's happening when the SpiceDB node rolls? Have you tried killing a pod and seeing if it lines up?

Ben Simpson

05/15/2024, 8:41 AM

Didn't seem to be a correlation with scaling

vroldanbet

05/15/2024, 8:56 AM

not scaling. The hypothesis I mentioned was if there is an issue with connection draining. Try killing a pod.

Ben Simpson

05/15/2024, 9:00 AM

I mean the instances were stable at the time, nothing dying, being taken out of service, being brought online etc

vroldanbet

05/15/2024, 9:04 AM

any insight in the types of requests that failed?

Ben Simpson

05/15/2024, 9:04 AM

But I can try that. Some of AWSs documents it could indicate something related to keepalive, but don't know how relevant that is for grpc / http/2

Ben Simpson

05/15/2024, 9:05 AM

They're all WriteRelationships

vroldanbet

05/15/2024, 9:05 AM

KeepAlives are relevant for gRPC: https://github.com/grpc/grpc/blob/master/doc/keepalive.md

Ben Simpson

05/15/2024, 9:05 AM

And I have seen the occasional transaction rollback back only very infrequently compared to the amount of times I'm seeing this

vroldanbet

05/15/2024, 9:06 AM

and are you able to to see the request coming through into SpiceDB and succeeding? and it's only the ALB that returns 502?

Ben Simpson

05/15/2024, 9:07 AM

I think I will need to dig into access logs, according to the metrics there were no errors from SpiceDB, all from ALB

Ben Simpson

05/15/2024, 9:08 AM

The majority of requests are fine

vroldanbet

05/15/2024, 9:10 AM

it's also interesting it's only WriteRelationship requests. I wonder if some timeout is being hit?

vroldanbet

05/15/2024, 9:10 AM

are you writing large numbers of relationships?

Ben Simpson

05/15/2024, 9:12 AM

Yes, although max of about 5 per WriteRelationships call

vroldanbet

05/15/2024, 10:29 AM

well that shouldn't be an issue

vroldanbet

05/15/2024, 10:29 AM

and you haven't seen with any other API method?

yetitwo

05/15/2024, 1:48 PM

@Ben Simpson my understanding is that ALBs only kind of support gRPC

yetitwo

05/15/2024, 1:48 PM

and gRPC really isn't meant to be run through a load balancer

yetitwo

05/15/2024, 1:49 PM

the gRPC client wants to have a full list of the nodes that it can talk to and then do client-side load balancing

yetitwo

05/15/2024, 1:49 PM

we were running through an ALB and having problems, and we switched to a slightly janky setup that uses CloudMap for service discovery and otherwise lets gRPC talk directly to SpiceDB nodes and have gotten better behavior out of our gRPC clients

yetitwo

05/15/2024, 1:50 PM

but i'd really recommend moving to running SpiceDB in k8s - it enables better horizontal dispatch and gRPC clients are very happy in a k8s context

vroldanbet

05/15/2024, 3:33 PM

At least grpc-go gives you some headroom to change the load balancing strategies

yetitwo

05/15/2024, 3:37 PM

all of them do, i think

vroldanbet

05/15/2024, 3:48 PM

https://grpc.io/docs/guides/custom-load-balancing/#implementing-your-own-policy

vroldanbet

05/15/2024, 3:49 PM

FWIW we do use load-balancers in our managed infrastructure, there are just not L7s

vroldanbet

05/15/2024, 3:55 PM

this may be useful exploring too: https://aws.amazon.com/blogs/aws/new-application-load-balancer-support-for-end-to-end-http-2-and-grpc/

vroldanbet

05/15/2024, 4:11 PM

This also provides good insight on the recommendations around load balancing gRPC: https://grpc.io/blog/grpc-load-balancing/

vroldanbet

05/15/2024, 4:12 PM

and finally FWIW, we've seen better balancing of requests when using

round_robin

strategy versus

pick_first

vroldanbet

05/15/2024, 4:17 PM

there is nothing conceptually wrong about using an L7 load balancer and gRPC, it's just suboptimal but it should work, so I suspect it's related to ALB configuration

Ben Simpson

05/15/2024, 9:30 PM

We do have the ALB set to

grpc

mode

Ben Simpson

05/15/2024, 9:32 PM

From the access logs it seems like we're running into this scenario

Copy code

The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target

The load balancer receives a request and forwards it to the target. The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer. Make sure that the duration of the keep-alive timeout is greater than the idle timeout value.

Check the values for the request_processing_time, target_processing_time and response_processing_time fields.

yetitwo

05/16/2024, 1:54 AM

yeah we had ours in gRPC mode as well and still had issues

yetitwo

05/16/2024, 1:54 AM

those issues went away when we got rid of the ALB

vroldanbet

05/16/2024, 8:18 AM

@Ben Simpson these are the server-side timeout and keepalive configuration parameters: https://pkg.go.dev/google.golang.org/grpc/keepalive?utm_source=godoc#ServerParameters The default server-side keep-alive in go-grpc is 2h. SpiceDB let's you configure the connections max-age, which is set by default to 30 seconds, see https://github.com/authzed/spicedb/blob/d9ae77692f09c42a89c38d02df2aa7b6ba62eeb5/pkg/cmd/util/util.go#L68

Copy code

--grpc-max-conn-age duration                                      how long a connection serving gRPC should be able to live (default 30s)

I checked grpc-go gracefully closes connections:

Copy code

// keepalive running in a separate goroutine does the following:
// 1. Gracefully closes an idle connection after a duration of keepalive.MaxConnectionIdle.
// 2. Gracefully closes any connection after a duration of keepalive.MaxConnectionAge.
// 3. Forcibly closes a connection after an additive period of keepalive.MaxConnectionAgeGrace over keepalive.MaxConnectionAge.
// 4. Makes sure a connection is alive by sending pings with a frequency of keepalive.Time and closes a non-responsive connection
// after an additional duration of keepalive.Timeout.

There is in fact a grace period after a connection reaches "max-age", and this grace is set to infinity, so the server will technically wait indefinitely to drain the connection. So from my perspective what I think could be happening is that SpiceDB closes a connection, but ALB does not know this connection is closed and keeps sending requests over it. Ideally ALB allows configuring a connection max-age too, so you can see it to <30 second and see if that fixes it (you need to account for a +-10% jitter). You can also configure that max-age on the spicedb if need be.

vroldanbet

05/16/2024, 8:24 AM

a quick experoment would be to set max-age to a very large value to see if this changes anything. I don't see anything in the AWS ALB documentation that suggests you can configure max connection age, only keep-alive: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html

vroldanbet

05/16/2024, 8:26 AM

the relevant flags on ALB are:

Copy code

client_keep_alive.seconds
  The client keepalive value, in seconds. The default is 3600 seconds.
idle_timeout.timeout_seconds
  The idle timeout value, in seconds. The default is 60 seconds.

vroldanbet

05/16/2024, 8:30 AM

note the current default keepalive is 1h, whereas SpiceDB connections are set to live max 30s

Ben Simpson

05/17/2024, 1:47 AM

I think the client keepalive is between the client and the ALB, not the ALB and the target (SpiceDB)

Ben Simpson

05/17/2024, 1:48 AM

Because reusing the connection to ALB won't necessarily have ALB route to the same target (I assume)

Ben Simpson

05/17/2024, 1:49 AM

The HTTP client keepalive duration value specifies the maximum amount of time that ALB will maintain an HTTP connection with a client before closing the connection.

Ben Simpson

05/17/2024, 1:53 AM

I'm not sure this problem is fully solvable given that I can't configure the max lifetime of the connection between ALB and the target. But I think I could mitigate it somewhat by increasing that parameter on the SpiceDB side, and using grpc retries on

unavailable

in my clients that connect to ALB with a very fast retry

vroldanbet

05/17/2024, 12:09 PM

there has to be a max lifetime on the ALB side - ask an AWS rep for it. Otherwise, just set the spicedb to infinity

Ian

09/05/2024, 3:38 PM

@vroldanbet we've started running into this issue as well. We're attempting to set the max connection age in SpiceDB to infinity, but it seems like the SpiceDB flag requires a duration: https://github.com/authzed/spicedb/blob/v1.35.3/pkg/cmd/util/util.go#L69 It looks like Go's ParseDuration supports a max value of about 292 years. Is there a way you know we can set this to Infinity instead? https://stackoverflow.com/a/68757925

vroldanbet

09/06/2024, 9:10 AM

doesn't a very large time work for you? does your app not restart at all? Like even load balancers will change more often than that (scaling, reshuffling processes around) so it seems unrealistic to have a connection that lives forever

Ian

09/06/2024, 9:25 AM

Thanks @vroldanbet , we tried a large time, but unfortunately it didn't fix the issue. We still see 502's every 20 minutes or so, even though there isn't any change in the service behind the load balancer

vroldanbet

09/06/2024, 9:38 AM

I'm not sure I can help, we don't run ALBs ourselves for our managed offering so we don't have experience with putting SpiceDB behind it. Maybe @Ben Simpson has made progress on identifying the issue?

vroldanbet

09/06/2024, 9:39 AM

In this thread I tried to provide as much info as possible so y'all could research the issue.

vroldanbet

09/06/2024, 9:39 AM

I suspect this is not necessarily a "SpiceDB" problem but a "gRPC behind ALB" problem

yetitwo

09/06/2024, 2:59 PM

i'd second the above

yetitwo

09/06/2024, 3:00 PM

at my last company we stopped seeing these problems after we started using CloudMap and letting the client talk directly to SpiceDB rather than going through a load balancer

yetitwo

09/06/2024, 3:00 PM

it brought other problems, namely that the list of nodes was slow to update when the SpiceDB instances rolled, but we stopped having issues with 502s

yetitwo

09/06/2024, 3:01 PM

gRPC really isn't designed to be run through a load balancer - it does its own load balancing in the client and wants to have a full list of available nodes at all times

Ian

09/06/2024, 8:28 PM

Ok we can try that. How do you handle the slow to update issue? That seems like a bigger problem if it can't reliably route to healthy instances

yetitwo

09/06/2024, 8:46 PM

we moved to k8s. gRPC and k8s kind of coevolved - gRPC clients are happy when they can get the list of nodes from k8s service discovery. it's also the easiest way to get horizontal dispatch working in spicedb, which is otherwise a pain in an ECS world.

yetitwo

09/06/2024, 8:48 PM

our operator makes running spicedb in an EKS cluster relatively easy, fwiw

Ian

09/06/2024, 8:58 PM

I would also like to use K8s, but we're limited to ECS because we're deploying into customer cloud accounts that do not have EKS access at the moment

yetitwo

09/06/2024, 11:17 PM

ah gotcha. yeah, we ran in ECS for a while and it wasn't a huge deal, but we also didn't have a ton of traffic to the system at the time and were generally able to mitigate things like the downtimes.

yetitwo

09/06/2024, 11:17 PM

i think a larger replication factor would have helped

vroldanbet

09/06/2024, 11:40 PM

If ECS has a reliable service discovery feature, we could build something to use it. I'm not aware such functionality exists

Ben Simpson

09/08/2024, 10:02 PM

We didn't really find a solution except to enable retries on

Unavailable

in the client

Ben Simpson

03/28/2025, 12:54 AM

Update: For a long time we got by with using ALB and setting

--grpc-max-conn-age

to 1hr, which stopped the `502`s But we noticed that we would occasionally get a request that would take 60s and then fail, which would result in a

. The only thing that we could see with a

60s

timeout was the

idle_timeout

on the ALB. We couldn't figure out how to resolve this issue - SpiceDB didn't see any requests taking

60s

to complete so it seemed like something was up with the ALB and it seemed unlikely that increasing the idle timeout would have any effect apart from the request taking longer to fail. In the end we switched from ALB to NLB, and reduced the

--grpc-max-conn-age

to 60s so that traffic gets routed to new instances in a timely fashion as they come online. This seems to be working much better so far.

vroldanbet

03/28/2025, 11:32 AM

Yep! That's also now we run our managed services. I'd still want to spend some time at some point with ALBs to see what's up

156 Views

Previous Next