👋 Hello folks, at work we've deployed
# spicedb
p
👋 Hello folks, at work we've deployed spicedb and has been dealing with our production load for some time already. Lately, the p95 latency is skyrocketing and after scaling up both SpiceDb and the underlying cockroach, we have seen few benefits. We're starting to wonder how we can tune SpiceDB, it seems to me a problem for spicedb acquiring a connection from the pool to connect to CockroachDB (see image attached, parent span is 843ms, the actual query took 41ms). 1. are there any metrics I can take a look at to see how spicedb's conn pool is behaving? 2. is it advised to modify any configuration when facing this kind of problem? thank you so much! https://cdn.discordapp.com/attachments/844600078948630559/1259786482554765352/CleanShot_2024-07-08_at_09.18.27.png?ex=668cf317&is=668ba197&hm=18397d519e4ee5b4e59f63e21729e5c7f086daec6e6b01c0aea5a0af3570a86f&
v
hi 👋 there are exhaustive connection pool metrics, including cochroach specific metrics, that would allow you to see what's going on. 41ms certainly shows contention to acquire one, this most frequently happens when doing very wide dispatches (i.e. relations with many many relationships). Please note that connections are not free, they have overhead on the SpiceDB side, and on the cockroach side, so you want to make sure they are tuned according. For example, it's important to add jitter with
--datastore-conn-pool-write-max-lifetime-jitter
and
--datastore-conn-pool-read-max-lifetime-jitter
, as otherwise all connections expire at the same time. You want to enable connection balancing for cockroach (
--datastore-connection-balancing
) as well. If you don't care about the new enemy problem much, you should enable
--datastore-tx-overlap-strategy=insecure
for cockroach. I assume you are running SpiceDB in cluster mode (dispatching enabled) at that there is enough compute to support the amount of dispatch fanout you app is likely generating. I'd typically take a overall look at metrics looking for smoking guns, make sure there is enough compute, cache hit ratios, types of requests (avoid full consistency as much as possible). You want to make sure you do cursored
LookupResources
, use
at_least_as_fresh
or
minimize_latency
consistency levels as much as possible, look into tuning staleness parameters (
--datastore-revision-quantization-max-staleness-percent
and
--datastore-revision-quantization-interval
), and look into concurrency limits (which would cap you clusters ability to fan out dispatches). Definitely also give
--enable-experimental-watchable-schema-cache
a go since it will likely reduce the amount of queries necessary per API call.
p
Thanks @vroldanbet !
about: > I assume you are running SpiceDB in cluster mode I'm running spicedb operator, and it creates a bunch of replicas for spicedb itself. Do u know if cluster mode activated by default in the operator?
> If you don't care about the new enemy problem much, you should enable --datastore-tx-overlap-strategy=insecure we're using ZedTokens for most checkrelationships/bulkcheckrelationships, In my understanding that should help with new enemy problem, right? I'm gonna try this one too!
v
> I'm running spicedb operator, and it creates a bunch of replicas for spicedb itself. Do u know if cluster mode activated by default in the operator? If you use the operator you are getting clustering enabled, yep > we're using ZedTokens for most checkrelationships/bulkcheckrelationships, In my understanding that should help with new enemy problem, right? I'm gonna try this one too! for cochroach in particular, zedtokens alone won't help if
tx-overlap-strategy
is
insecure
(that's why I said only use if you don't care about the new enemy problem). Cockroach transactions are not externally consistent - only for data within the same range. For that reason transaction overlap was introduced only with the cockroach datastore, to enforce transactions are externally consistent, which comes with a performance penalty during writes. I recommend you read https://github.com/authzed/spicedb/blob/5ed2d0f28b42da82fad93601fec49c41a2d03a0e/internal/datastore/crdb/README.md to determine which strategy makes sense for your workload. Please note this is mostly relevant for writes not so much for reads
p
Thank you so much!
21 Views