Ben Simpson
05/15/2024, 5:11 AM14
(unavailable) status code from the ALB intermittently under load? Doesn't seem to be coming from SpiceDB itself afaict - no errors in the logs and no Target 5xx responses, but plenty of ELB 502svroldanbet
05/15/2024, 8:40 AMBen Simpson
05/15/2024, 8:41 AMvroldanbet
05/15/2024, 8:56 AMBen Simpson
05/15/2024, 9:00 AMvroldanbet
05/15/2024, 9:04 AMBen Simpson
05/15/2024, 9:04 AMBen Simpson
05/15/2024, 9:05 AMvroldanbet
05/15/2024, 9:05 AMBen Simpson
05/15/2024, 9:05 AMvroldanbet
05/15/2024, 9:06 AMBen Simpson
05/15/2024, 9:07 AMBen Simpson
05/15/2024, 9:08 AMvroldanbet
05/15/2024, 9:10 AMvroldanbet
05/15/2024, 9:10 AMBen Simpson
05/15/2024, 9:12 AMvroldanbet
05/15/2024, 10:29 AMvroldanbet
05/15/2024, 10:29 AMyetitwo
05/15/2024, 1:48 PMyetitwo
05/15/2024, 1:48 PMyetitwo
05/15/2024, 1:49 PMyetitwo
05/15/2024, 1:49 PMyetitwo
05/15/2024, 1:50 PMvroldanbet
05/15/2024, 3:33 PMyetitwo
05/15/2024, 3:37 PMvroldanbet
05/15/2024, 3:48 PMvroldanbet
05/15/2024, 3:49 PMvroldanbet
05/15/2024, 3:55 PMvroldanbet
05/15/2024, 4:11 PMvroldanbet
05/15/2024, 4:12 PMround_robin
strategy versus pick_first
vroldanbet
05/15/2024, 4:17 PMBen Simpson
05/15/2024, 9:30 PMgrpc
modeBen Simpson
05/15/2024, 9:32 PMThe target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target
The load balancer receives a request and forwards it to the target. The target receives the request and starts to process it, but closes the connection to the load balancer too early. This usually occurs when the duration of the keep-alive timeout for the target is shorter than the idle timeout value of the load balancer. Make sure that the duration of the keep-alive timeout is greater than the idle timeout value.
Check the values for the request_processing_time, target_processing_time and response_processing_time fields.
yetitwo
05/16/2024, 1:54 AMyetitwo
05/16/2024, 1:54 AMvroldanbet
05/16/2024, 8:18 AM--grpc-max-conn-age duration how long a connection serving gRPC should be able to live (default 30s)
I checked grpc-go gracefully closes connections:
// keepalive running in a separate goroutine does the following:
// 1. Gracefully closes an idle connection after a duration of keepalive.MaxConnectionIdle.
// 2. Gracefully closes any connection after a duration of keepalive.MaxConnectionAge.
// 3. Forcibly closes a connection after an additive period of keepalive.MaxConnectionAgeGrace over keepalive.MaxConnectionAge.
// 4. Makes sure a connection is alive by sending pings with a frequency of keepalive.Time and closes a non-responsive connection
// after an additional duration of keepalive.Timeout.
There is in fact a grace period after a connection reaches "max-age", and this grace is set to infinity, so the server will technically wait indefinitely to drain the connection.
So from my perspective what I think could be happening is that SpiceDB closes a connection, but ALB does not know this connection is closed and keeps sending requests over it.
Ideally ALB allows configuring a connection max-age too, so you can see it to <30 second and see if that fixes it (you need to account for a +-10% jitter).
You can also configure that max-age on the spicedb if need be.vroldanbet
05/16/2024, 8:24 AMvroldanbet
05/16/2024, 8:26 AMclient_keep_alive.seconds
The client keepalive value, in seconds. The default is 3600 seconds.
idle_timeout.timeout_seconds
The idle timeout value, in seconds. The default is 60 seconds.
vroldanbet
05/16/2024, 8:30 AMBen Simpson
05/17/2024, 1:47 AMBen Simpson
05/17/2024, 1:48 AMBen Simpson
05/17/2024, 1:49 AMThe HTTP client keepalive duration value specifies the maximum amount of time that ALB will maintain an HTTP connection with a client before closing the connection.
Ben Simpson
05/17/2024, 1:53 AMunavailable
in my clients that connect to ALB with a very fast retryvroldanbet
05/17/2024, 12:09 PMIan
09/05/2024, 3:38 PMvroldanbet
09/06/2024, 9:10 AMIan
09/06/2024, 9:25 AMvroldanbet
09/06/2024, 9:38 AMvroldanbet
09/06/2024, 9:39 AMvroldanbet
09/06/2024, 9:39 AMyetitwo
09/06/2024, 2:59 PMyetitwo
09/06/2024, 3:00 PMyetitwo
09/06/2024, 3:00 PMyetitwo
09/06/2024, 3:01 PMIan
09/06/2024, 8:28 PMyetitwo
09/06/2024, 8:46 PMyetitwo
09/06/2024, 8:48 PMIan
09/06/2024, 8:58 PMyetitwo
09/06/2024, 11:17 PMyetitwo
09/06/2024, 11:17 PMvroldanbet
09/06/2024, 11:40 PMBen Simpson
09/08/2024, 10:02 PMUnavailable
in the clientBen Simpson
03/28/2025, 12:54 AM--grpc-max-conn-age
to 1hr, which stopped the `502`s
But we noticed that we would occasionally get a request that would take 60s and then fail, which would result in a 504
. The only thing that we could see with a 60s
timeout was the idle_timeout
on the ALB. We couldn't figure out how to resolve this issue - SpiceDB didn't see any requests taking 60s
to complete so it seemed like something was up with the ALB and it seemed unlikely that increasing the idle timeout would have any effect apart from the request taking longer to fail.
In the end we switched from ALB to NLB, and reduced the --grpc-max-conn-age
to 60s so that traffic gets routed to new instances in a timely fashion as they come online. This seems to be working much better so far.vroldanbet
03/28/2025, 11:32 AM