Hello - loving spicedb, its a superb and
# spicedb
m
Hello - loving spicedb, its a superb and incredibly powerful. We have had it running in production for a few months now, but recently expanded the use of it, meaning our setup is having to work harder. We are running the instances on ECS, backed by the MySql datastore - is there a reference for what should be the ideal amount of memory per spicedb instance? I appreciate this will depend on use case, but if we had a reference we could understand if we are doing something wrong.
v
Can you describe more your setup? type of workload (API calls and RPS), consistency requested, SpiceDB nodes? Configuration of your SpiceDB? Compute / Memory allocated? At first glance the problem I anticipate is that you are not running the dispatch ring because you run on ECS, which means that spicedb nodes won't talk to each other, and cache reuse would be fairly low. I've seen folks deploy to ECS and be happy with their perf, but as soon as you ramp up, you are going to be missing an scalability feature property of SpiceDB. Dispatch ring on ECS unfortuantely is super reliable since it does not have an accurate service discovery API like the Kube API provides. I tried introducing a resolver for AWS CloudMap (https://github.com/authzed/spicedb/pull/1620) then to discover it's just another poll-based mechanism.
m
Thanks for the response, sorry for the delay in replying, I wanted to get the right information first. Setup is fairly simple at the moment, and we're mainly looking for reliability. PHP/Laravel-based API, also on ECS. This makes calls to the SpiceDB server via Service Connect over HTTP (we had trouble getting gRPC working + lack of existing PHP libaries). 150k requests over a 3 hour period 28k peak RPS (anomaly, is this normal?) 1-2k average RPS 1x SpiceDB nodes (however, we'd like to set up autoscaling - with or without cache ring, if possible). 1vCPU, 8GB memory. We previously used 2GB memory, but it ate all available RAM within a few hours (looked almost like a memory leak). Is that normal? We're more concerned about redundancy as opposed to performance at this stage. The SpiceDB set up is already significantly faster than it's replacement in our use case. Some questions to you guys * What is a normal amount of resources for a node? * Is is possible to set up auto scaling if we don't mind about the cache ring (just for redundancy) * If we wanted the cache ring, but are okay with less reliability, does it actually work? How can we tell if it's enabled correctly? (we did try, but it was hard to tell) * We do seem to get spikes in RPS, is that normal? One other area of note is we took an existing (bad) PHP library and improved it in some areas, but we have no idea if it follows best practice, so there is an element of unknown there.
v
Thanks for sharing: - Would you mind also describing the flags/env variables passed to SpiceDB? - What kind of API requests are you performing? Checks? Lookups? - Which version are you running? - 1-2K RPS to a single SpiceDB node is something possibly making it run a bit too hot, but it depends on the use-case. If there is a good cachehit ratio, it's possible that's going to work fine. - 1vCPU seems too low to handle that many concurrent requests, and I suspect that's where the reliability issues would come from. You want at the very least 2vCPU. Basically the go runtime will have 1 OS thread that it will have to share with the go runtime garbage collector. At 1K/2K RPS that's a lot goroutines sheduled into a single OS thread and likely not optimal. I'd suggest looking into the SpiceDB prometheus metrics exposed, particularly the go runtime scheduler latency, which would indicate if there is not enough compute to serve requests reliably. re 28K peak RPS: that almost certainly waaaay too much for 1vCPU.
> What is a normal amount of resources for a node? It really depends on your workload and schema. You want to look at prometheus metrics to get a sense of whether the service has enough compute > Is is possible to set up auto scaling if we don't mind about the cache ring (just for redundancy) Yes, totally. But do note that an autoscaling process that adds / removes nodes frequently can be detrimental depending on your workload. If you are running at 1K/2K RPS in a very hot node, the cache hit ratio is likely to be fairly high. The moment you add a new cold-cached node, you are going to experience spikes in latency and even overload the database. But it all depends on your workload - better give it a go, but there is nothing technical that would prevent it > I f we wanted the cache ring, but are okay with less reliability, does it actually work? How can we tell if it's enabled correctly? (we did try, but it was hard to tell) It does. At the very least you can use DNS service discovery for the hash ring load-balancer, but if you are adding autoscaling to the mix that adds/removes nodes frequently, you are going to expect failures dispatch failures because the dispatch ring does not update its peers quickly enough. In order to tell if it's enabled you can either enable debug logs, or look at dispatch metrics in the prometheus endpoint > We do seem to get spikes in RPS, is that normal? That seems specific to your client application, the one issuing the spikes 😅
> We previously used 2GB memory, but it ate all available RAM within a few hours (looked almost like a memory leak). Is that normal? Depends, which version are you running? SpiceDB uses an auto-mechanism to configure the cache size. If it's unable to infer that from the execution environment, it may set it to a wrong value. In this case better to tune the cache sizes manually. Caches expire by default after 2 x the quantization window, so within 10 seconds entries should expire.
s
@vroldanbet Hi, Just want to ask in your experience, do you see any case which run spicedb in ecs in production with(using cloud map as I saw on couple threads)/without the cache ring(service discovery problem) how well are they performing. Do we have any specific number/stats on how well the cluster is running. For our use case we only have probably < 100 rps, but we want the query latency to be fast(maybe < 50ms per call), and high availability. Is it ok to run spicedb in ECS and still able to achieve the goal above?
v
My understanding is that CloudMap is not better than DNS resolution, so you get delays instances come and go. We haven't tested it at all, and it's in fact not even supported, requires some code changes, which I punted on because I saw no value on using CloudMap vs DNS resolution. The consequence is that you may get failed requests hitting pods that are down.
We don't have any experience running in ECS
It should be fine but you'd need to use standard DNS resolution, which has the side effects I noted above.
25 Views