I can't share the exact TF (we have SpiceDB #spicedb

I can't share the exact TF (we have

Ben Simpson

04/26/2024, 1:49 AM

I can't share the exact TF (we have stuff modularized so it wouldn't translate 1-1 anyway) but: - Create

aws_service_discovery_private_dns_namespace

- Create

aws_service_discovery_service

using the namespace with a low TTL. We're using 10s at the moment. - Attach service discovery to

aws_ecs_service

using

service_registries

- Set

SPICEDB_DISPATCH_UPSTREAM_ADDR

dns:///spicedb.your_namespace:50053

- Set up an ALB and a target group configured to use

GRPC

protocol and a health check like

Copy code

healthcheck = {
    path     = "/grpc.health.v1.Health/Check"
    timeout  = 5
    interval = 10
    matcher  = 0
  }

vroldanbet

04/26/2024, 9:10 AM

Problem is that you may have in the worst case a 10s delay when an instance goes down, which means dispatching will fail. Depending on your schema, this could be tolerated if there are various paths to evaluate a permission that is decomposed into subproblems that fall under different ECS SpiceDB instances: one of the dispatches will timeout, but because another path was found, the overal request would succeed. What's the API call RPS and p95/p99 latency on checks for your cluster? I'd say that in a cluster running hotter you'll probably see more disruption when an instace in the cluster goes down. When you say "ALB" here, you mean you put it in front of the SpiceDB API, not in front of the Dispatch API right?

vroldanbet

04/26/2024, 9:10 AM

This also seems like a great contribution to the docs 😄

Ben Simpson

04/26/2024, 9:12 AM

Yeah the ALB is in front of the API, I forgot to mention you need to ensure the security group for the service (task? I forget) allows self-ingress

Ben Simpson

04/26/2024, 9:13 AM

And re: the 10s thing I'm still not 100% certain when exactly it deregisters it from the discovery service, but if it has a deregistration delay that's longer maybe it won't happen if it's shut down gracefully, assuming it gets immediately removed from discovery but will still accept requests for those 10s (but not be routed them by the ALB)

vroldanbet

04/26/2024, 9:18 AM

I see what you mean, maybe AWS first deregisters, then waits 10 seconds, and then sends termination signal. SpiceDB can also be configured with a termination graceful timeout, so you could even force it at the application level: - ECS sends termination - SpiceDB configured to continue serving in flight requests for X seconds - (assumption) AWS removed the task IP from the service discovery, DNS TTL expires and IP is gone - SpiceDB terminates after x seconds

vroldanbet

04/26/2024, 9:19 AM

It would be great to get a clarification from AWS on what the flow is when the ECS scheduler sends termination

vroldanbet

04/26/2024, 9:22 AM

and just to be clear, there is 2 levels of drainage needed here: - API via the ALB - Dispatch API via service discovery

Ben Simpson

04/26/2024, 9:47 AM

Hmm https://github.com/aws/containers-roadmap/issues/343 The suggestion is to use Service Connect instead of Discovery (cloudmap)

Ben Simpson

04/26/2024, 9:52 AM

... or just move to EKS, I guess 😛

vroldanbet

04/26/2024, 9:53 AM

> Hello all. We have launched ECS Service Connect which is intended to be a drop-in replacement for Cloud Map DNS based service discovery. You can keep using the same DNS names, but it no longer uses DNS propagation. Instead Service Connect uses an Envoy Proxy sidecar which is configured by monitoring the ECS control plane for task launches and stops. This means there is far less propagation delay. Additionally, the sidecar automatically retries and redirects your request to a different task in the rare case that a task has crashed in the short interval between the sidecar receiving the latest updated config from ECS. > > Please try it out and let us know if this solves your problem. Service Connect is designed to feel the same as DNS based service discovery, but overall much more featureful and doesn't have the same DNS propagation timing issues.

Ben Simpson

04/26/2024, 9:55 AM

Yeah

vroldanbet

04/26/2024, 9:55 AM

Well now you are adding a sidecar proxy to the request path, but the worst part is the "automatic retries and redirects". That would basically dispatch to a different node, which breaks the dispatch ring. It's not horrible per se, I guess better than failing the request, but it means the dispatch lands on a node that won't have the subproblem cached

vroldanbet

04/26/2024, 9:56 AM

At least they mention it's only happening on rare cases

vroldanbet

04/26/2024, 9:56 AM

worth a shot!

Ben Simpson

04/26/2024, 9:57 AM

Yeah sounds more reliable at least - "that a task has crashed" leads to me to think that if it's shut down gracefully it shouldn't be a problem

vroldanbet

04/26/2024, 9:58 AM

> Deployment order matters. If you have a Service A under servicea.internal, then deploy a Service B under serviceb.internal, Service A will not be able to talk to Service B (DNS resolution error will happen) until Service A containers are restarted. I find this behaviour quite irritating, as dynamic creation of new ECS services then require restarting of other services if you want those to be able to "discover" the new one. yikes

Ben Simpson

04/26/2024, 10:00 AM

Shouldn't be a problem in this case, unless I'm mistunderstanding

vroldanbet

04/26/2024, 10:01 AM

no, just was surprising, given that service discovery is about dynamically adapting to everything

vroldanbet

04/26/2024, 10:01 AM

if you deploy SpiceDB, and it already exists, every new service should be able to see it

vroldanbet

04/26/2024, 10:02 AM

if for whatever reason you had to create a new SpiceDB service (e.g. dynamically provisioning them for testing purposes, or doing A/B testing), then now you need to restart every service client

vroldanbet

04/26/2024, 10:02 AM

which sounds horrible

yetitwo

04/26/2024, 4:23 PM

yeah this is the approach we're taking to the problem 😛

7 Views

Previous Next