spicedb/internal/datastore/postgres/comm... SpiceDB #spicedb

spicedb/internal/datastore/postgres/comm...

croemmich

01/24/2024, 6:41 PM

We're dealing with some performance issues that seem to be correlated with database (cockroachdb) pooling behavior and seeing some general oddities in the metrics. 1. We have max connections set to 20 (confirmed by startup configuration log), however the metrics are only showing 15 2. Anytime the pool acquires a connection there is a long (typically multi-second) impact to request latency. Idle connections never seem to hit 0 so that's not what I would expect. 3. Jitter defaults to 0 out of the box. I saw one point in our metrics where all of the connections rotated simultaneously. Looking through the spicedb configuration code, I think this is a null vs default value issue. https://github.com/authzed/spicedb/blob/8db642ad59a79b25df201e49b107ff90ad972e51/internal/datastore/postgres/common/pgx.go#L273 checks for nil, however I believe what's coming in is a default value (Duration(0)) from https://github.com/authzed/spicedb/blob/8db642ad59a79b25df201e49b107ff90ad972e51/pkg/cmd/datastore/datastore.go#L46.

Copy code

ReadConnPool
  {
    HealthCheckInterval 30000
    MaxIdleTime 3600000
    MaxLifetime 1800000
    MaxLifetimeJitter 300000
    MaxOpenConns 20
    MinOpenConns 20
  }

croemmich

01/24/2024, 6:42 PM

https://cdn.discordapp.com/attachments/1199786041347940413/1199786346840068239/Screenshot_2024-01-24_at_12.02.08_PM.png?ex=65c3cf1c&is=65b15a1c&hm=20b054de8cdd7fac6604a934281d48b142eaad8b40f65195865eb077e207e728&

croemmich

01/24/2024, 6:43 PM

Note, these are LookupResources requests

Joey

01/24/2024, 7:30 PM

> however the metrics are only showing 15

Joey

01/24/2024, 7:31 PM

a portion are reserved for writes

croemmich

01/24/2024, 7:40 PM

That still doesn't make a ton of sense though. That would indicate we have 10 connections for read and 5 for write. It's explicitly configured for 20 read and 10 write.

vroldanbet

01/25/2024, 9:39 AM

Finally, CRDB has a very specific formula for the number of connections based on the cluster size. Are you running OSS on-prem, Dedicated or Serverless? We've observed that the smallest dedicated cluster you can get has a fairly bad performance and even CRL suggests it shouldn't be used for production systems. I believe Serverless can handle more connections because of their horizontally scalable and decoupled connection handling layer proxy, but to be fair we haven't run exhaustive load-tests on Serverless to really recommend it. https://www.cockroachlabs.com/docs/stable/recommended-production-settings#sizing

Copy code

For cluster stability, Cockroach Labs recommends a minimum of 8 vCPUs , and strongly recommends no fewer than 4 vCPUs per node. In a cluster with too few CPU resources, foreground client workloads will compete with the cluster's background maintenance tasks. For more information, see capacity planning issues.

https://www.cockroachlabs.com/docs/stable/recommended-production-settings#connection-pooling

Copy code

The total number of workload connections across all connection pools should not exceed 4 times the number of vCPUs in the cluster by a large amount.

vroldanbet

01/25/2024, 9:39 AM

what's the scrape interval of the metrics? If they only get drained for a second or two, those won't show up. We've seen that in the past. There are 2 factors that can lead to connection pools being drained momentarily: - connection lifecycle: jitter is disabled by default, we recommend setting it to the lifetime duration of the connections. I think we should probably add that as a default. - quantization window elapsing: when the instance is running hot, during the time the quantization window elapses there can be a thundering herd of database queries, because effectively all the caching that was taking place disapears. For this introduced we introduced

--datastore-revision-quantization-max-staleness-percent

, which is like a jitter, but for the quantization window. It's set by default to 10%, but increasing it can help a lot. If you set it to 100%, your max staleness can go from 5s to 10s, so only set it if that's acceptable to you. If you have more staleness headroom, consider increasing the quantization window itself. I also recommend running the latest version of SpiceDB 1.29.0 and enable

--enable-experimental-watchable-schema-cache

. Set the heartbeat to something like

2s

. That will eliminate quite many queries from the database and thus reduce contention from the connection pools.

croemmich

01/25/2024, 5:36 PM

Awesome, thanks for all of this! We're running OSS at the moment, but switching to cloud soon. All of our evaluations are at least as fresh as so I already have datastore-revision-quantization-max-staleness-percent = 300. Would we be better off with a larger window and max-staleness-percent or the inverse? I'll give 1.29 a try!

Joey

01/25/2024, 5:50 PM

https://authzed.com/blog/hotspot-caching-in-google-zanzibar-and-spicedb

Joey

01/25/2024, 5:51 PM

that has adjustable diagrams for how caching knobs impact

vroldanbet

01/25/2024, 6:06 PM

both will increase the odds of a cache hit, so it depends on your usecase. Are you using LookupResources with a cursor? because if your dataset is large enough, you are basically loading everything in memory and then streaming it, which can keep conn pools busy for a long period of time

vroldanbet

01/25/2024, 6:07 PM

It's better to use cursors and ask for small batches

croemmich

01/25/2024, 6:19 PM

Thanks @Joey, I've played with that before. @vroldanbet, We're using LookupResources without a cursor. However we have architected our model to keep the number of results that come back low (typically fewer than 20). Do you have a document on cache tuning? I don't have a good understanding of what's held by the namespace, dispatch, and dispatch cluster or if our num counters or max cost are appropriate for our data set. Given we're using at_least_as_fresh_as for all of our evals and not updating the zed tokens very frequently, I'd love to be able to answer nearly every question from cache, for potentially many hours.

vroldanbet

01/25/2024, 6:40 PM

right now the cache has a TTL of 2x quantization window

vroldanbet

01/25/2024, 6:40 PM

so even if there wasn't any mutations to the system, your change is going to go away within that 2 x quantization window

vroldanbet

01/25/2024, 6:40 PM

and no, we don't have a cache tunning guide available

vroldanbet

01/25/2024, 6:41 PM

it used to not have a TTL and keep things around forever until they were evicted from memory

vroldanbet

01/25/2024, 6:43 PM

you also have to consider that there is a GC window in cockroach

vroldanbet

01/25/2024, 6:44 PM

so snapshot revisions will eventually dissapear

vroldanbet

01/25/2024, 6:44 PM

so if you are holding to a zedtoken for hours you need to configure your CRDB accordingly

croemmich

01/25/2024, 6:44 PM

Sounds good, I'll keep tweaking and see what happens. We have the GC window at 24h right now.

vroldanbet

01/25/2024, 6:44 PM

The default is 4 hours in dedicated, 1.25h in Serverless

croemmich

01/25/2024, 6:45 PM

Are there any metrics to see the utilization of the caches? Mostly trying to see if we're filling up the counters.

croemmich

01/25/2024, 6:46 PM

There are metrics for added/removed, but I didn't see current usage

vroldanbet

01/25/2024, 6:46 PM

there are metrics

vroldanbet

01/25/2024, 6:47 PM

but honestly, for LookupResources that return 20 elements, I feel like there is something else up

vroldanbet

01/25/2024, 6:47 PM

Have you look into open telemetry traces? they will tell you where time is being spent

vroldanbet

01/25/2024, 6:47 PM

waiting seconds for a new connection does not sound right

vroldanbet

01/25/2024, 6:48 PM

LookupResources is also a fairly expensive computation, have you considered doing BulkPermissionCheck?

vroldanbet

01/25/2024, 6:48 PM

are you doing this for filtering?

vroldanbet

01/25/2024, 6:49 PM

you can certainly focus on the caches, but we've seen customers use it with the default 5s and large datasets and LR calls without issues

vroldanbet

01/25/2024, 6:50 PM

so it really feels like something else is at play

croemmich

01/25/2024, 6:52 PM

Yeah, I'm suspicious of our cockroach cluster performance as well. We're moving to a dedicated cloud instance next week, so that will hopefully improve things there. Just wanted to do all we can on the caching layer while the cluster gets spun up. Thanks for all of the help, y'all are great!

vroldanbet

01/25/2024, 6:53 PM

anytime! good luck with the migration 🙂

8 Views

Previous Next