I can't share the exact TF (we have
# spicedb
b
I can't share the exact TF (we have stuff modularized so it wouldn't translate 1-1 anyway) but: - Create
aws_service_discovery_private_dns_namespace
- Create
aws_service_discovery_service
using the namespace with a low TTL. We're using 10s at the moment. - Attach service discovery to
aws_ecs_service
using
service_registries
- Set
SPICEDB_DISPATCH_UPSTREAM_ADDR
to
dns:///spicedb.your_namespace:50053
- Set up an ALB and a target group configured to use
GRPC
protocol and a health check like
Copy code
healthcheck = {
    path     = "/grpc.health.v1.Health/Check"
    timeout  = 5
    interval = 10
    matcher  = 0
  }
v
Problem is that you may have in the worst case a 10s delay when an instance goes down, which means dispatching will fail. Depending on your schema, this could be tolerated if there are various paths to evaluate a permission that is decomposed into subproblems that fall under different ECS SpiceDB instances: one of the dispatches will timeout, but because another path was found, the overal request would succeed. What's the API call RPS and p95/p99 latency on checks for your cluster? I'd say that in a cluster running hotter you'll probably see more disruption when an instace in the cluster goes down. When you say "ALB" here, you mean you put it in front of the SpiceDB API, not in front of the Dispatch API right?
This also seems like a great contribution to the docs 😄
b
Yeah the ALB is in front of the API, I forgot to mention you need to ensure the security group for the service (task? I forget) allows self-ingress
And re: the 10s thing I'm still not 100% certain when exactly it deregisters it from the discovery service, but if it has a deregistration delay that's longer maybe it won't happen if it's shut down gracefully, assuming it gets immediately removed from discovery but will still accept requests for those 10s (but not be routed them by the ALB)
v
I see what you mean, maybe AWS first deregisters, then waits 10 seconds, and then sends termination signal. SpiceDB can also be configured with a termination graceful timeout, so you could even force it at the application level: - ECS sends termination - SpiceDB configured to continue serving in flight requests for X seconds - (assumption) AWS removed the task IP from the service discovery, DNS TTL expires and IP is gone - SpiceDB terminates after x seconds
It would be great to get a clarification from AWS on what the flow is when the ECS scheduler sends termination
and just to be clear, there is 2 levels of drainage needed here: - API via the ALB - Dispatch API via service discovery
b
Hmm https://github.com/aws/containers-roadmap/issues/343 The suggestion is to use Service Connect instead of Discovery (cloudmap)
... or just move to EKS, I guess 😛
v
> Hello all. We have launched ECS Service Connect which is intended to be a drop-in replacement for Cloud Map DNS based service discovery. You can keep using the same DNS names, but it no longer uses DNS propagation. Instead Service Connect uses an Envoy Proxy sidecar which is configured by monitoring the ECS control plane for task launches and stops. This means there is far less propagation delay. Additionally, the sidecar automatically retries and redirects your request to a different task in the rare case that a task has crashed in the short interval between the sidecar receiving the latest updated config from ECS. > > Please try it out and let us know if this solves your problem. Service Connect is designed to feel the same as DNS based service discovery, but overall much more featureful and doesn't have the same DNS propagation timing issues.
b
Yeah
v
Well now you are adding a sidecar proxy to the request path, but the worst part is the "automatic retries and redirects". That would basically dispatch to a different node, which breaks the dispatch ring. It's not horrible per se, I guess better than failing the request, but it means the dispatch lands on a node that won't have the subproblem cached
At least they mention it's only happening on rare cases
worth a shot!
b
Yeah sounds more reliable at least - "that a task has crashed" leads to me to think that if it's shut down gracefully it shouldn't be a problem
v
> Deployment order matters. If you have a Service A under servicea.internal, then deploy a Service B under serviceb.internal, Service A will not be able to talk to Service B (DNS resolution error will happen) until Service A containers are restarted. I find this behaviour quite irritating, as dynamic creation of new ECS services then require restarting of other services if you want those to be able to "discover" the new one. yikes
b
Shouldn't be a problem in this case, unless I'm mistunderstanding
v
no, just was surprising, given that service discovery is about dynamically adapting to everything
if you deploy SpiceDB, and it already exists, every new service should be able to see it
if for whatever reason you had to create a new SpiceDB service (e.g. dynamically provisioning them for testing purposes, or doing A/B testing), then now you need to restart every service client
which sounds horrible
y
yeah this is the approach we're taking to the problem 😛