spicedb/internal/datastore/postgres/comm...
# spicedb
c
We're dealing with some performance issues that seem to be correlated with database (cockroachdb) pooling behavior and seeing some general oddities in the metrics. 1. We have max connections set to 20 (confirmed by startup configuration log), however the metrics are only showing 15 2. Anytime the pool acquires a connection there is a long (typically multi-second) impact to request latency. Idle connections never seem to hit 0 so that's not what I would expect. 3. Jitter defaults to 0 out of the box. I saw one point in our metrics where all of the connections rotated simultaneously. Looking through the spicedb configuration code, I think this is a null vs default value issue. https://github.com/authzed/spicedb/blob/8db642ad59a79b25df201e49b107ff90ad972e51/internal/datastore/postgres/common/pgx.go#L273 checks for nil, however I believe what's coming in is a default value (Duration(0)) from https://github.com/authzed/spicedb/blob/8db642ad59a79b25df201e49b107ff90ad972e51/pkg/cmd/datastore/datastore.go#L46.
Copy code
ReadConnPool
  {
    HealthCheckInterval 30000
    MaxIdleTime 3600000
    MaxLifetime 1800000
    MaxLifetimeJitter 300000
    MaxOpenConns 20
    MinOpenConns 20
  }
Note, these are LookupResources requests
j
> however the metrics are only showing 15
a portion are reserved for writes
c
That still doesn't make a ton of sense though. That would indicate we have 10 connections for read and 5 for write. It's explicitly configured for 20 read and 10 write.
v
Finally, CRDB has a very specific formula for the number of connections based on the cluster size. Are you running OSS on-prem, Dedicated or Serverless? We've observed that the smallest dedicated cluster you can get has a fairly bad performance and even CRL suggests it shouldn't be used for production systems. I believe Serverless can handle more connections because of their horizontally scalable and decoupled connection handling layer proxy, but to be fair we haven't run exhaustive load-tests on Serverless to really recommend it. https://www.cockroachlabs.com/docs/stable/recommended-production-settings#sizing
Copy code
For cluster stability, Cockroach Labs recommends a minimum of 8 vCPUs , and strongly recommends no fewer than 4 vCPUs per node. In a cluster with too few CPU resources, foreground client workloads will compete with the cluster's background maintenance tasks. For more information, see capacity planning issues.
https://www.cockroachlabs.com/docs/stable/recommended-production-settings#connection-pooling
Copy code
The total number of workload connections across all connection pools should not exceed 4 times the number of vCPUs in the cluster by a large amount.
what's the scrape interval of the metrics? If they only get drained for a second or two, those won't show up. We've seen that in the past. There are 2 factors that can lead to connection pools being drained momentarily: - connection lifecycle: jitter is disabled by default, we recommend setting it to the lifetime duration of the connections. I think we should probably add that as a default. - quantization window elapsing: when the instance is running hot, during the time the quantization window elapses there can be a thundering herd of database queries, because effectively all the caching that was taking place disapears. For this introduced we introduced
--datastore-revision-quantization-max-staleness-percent
, which is like a jitter, but for the quantization window. It's set by default to 10%, but increasing it can help a lot. If you set it to 100%, your max staleness can go from 5s to 10s, so only set it if that's acceptable to you. If you have more staleness headroom, consider increasing the quantization window itself. I also recommend running the latest version of SpiceDB 1.29.0 and enable
--enable-experimental-watchable-schema-cache
. Set the heartbeat to something like
2s
. That will eliminate quite many queries from the database and thus reduce contention from the connection pools.
c
Awesome, thanks for all of this! We're running OSS at the moment, but switching to cloud soon. All of our evaluations are at least as fresh as so I already have datastore-revision-quantization-max-staleness-percent = 300. Would we be better off with a larger window and max-staleness-percent or the inverse? I'll give 1.29 a try!
that has adjustable diagrams for how caching knobs impact
v
both will increase the odds of a cache hit, so it depends on your usecase. Are you using LookupResources with a cursor? because if your dataset is large enough, you are basically loading everything in memory and then streaming it, which can keep conn pools busy for a long period of time
It's better to use cursors and ask for small batches
c
Thanks @Joey, I've played with that before. @vroldanbet, We're using LookupResources without a cursor. However we have architected our model to keep the number of results that come back low (typically fewer than 20). Do you have a document on cache tuning? I don't have a good understanding of what's held by the namespace, dispatch, and dispatch cluster or if our num counters or max cost are appropriate for our data set. Given we're using at_least_as_fresh_as for all of our evals and not updating the zed tokens very frequently, I'd love to be able to answer nearly every question from cache, for potentially many hours.
v
right now the cache has a TTL of 2x quantization window
so even if there wasn't any mutations to the system, your change is going to go away within that 2 x quantization window
and no, we don't have a cache tunning guide available
it used to not have a TTL and keep things around forever until they were evicted from memory
you also have to consider that there is a GC window in cockroach
so snapshot revisions will eventually dissapear
so if you are holding to a zedtoken for hours you need to configure your CRDB accordingly
c
Sounds good, I'll keep tweaking and see what happens. We have the GC window at 24h right now.
v
The default is 4 hours in dedicated, 1.25h in Serverless
c
Are there any metrics to see the utilization of the caches? Mostly trying to see if we're filling up the counters.
There are metrics for added/removed, but I didn't see current usage
v
there are metrics
but honestly, for LookupResources that return 20 elements, I feel like there is something else up
Have you look into open telemetry traces? they will tell you where time is being spent
waiting seconds for a new connection does not sound right
LookupResources is also a fairly expensive computation, have you considered doing BulkPermissionCheck?
are you doing this for filtering?
you can certainly focus on the caches, but we've seen customers use it with the default 5s and large datasets and LR calls without issues
so it really feels like something else is at play
c
Yeah, I'm suspicious of our cockroach cluster performance as well. We're moving to a dedicated cloud instance next week, so that will hopefully improve things there. Just wanted to do all we can on the caching layer while the cluster gets spun up. Thanks for all of the help, y'all are great!
v
anytime! good luck with the migration 🙂
8 Views