we're currently having some problems SpiceDB #spicedb

we're currently having some problems

perseus29

02/05/2024, 6:38 AM

we're currently having some problems with our spicedb deployment - (running spicedb v1.19 with RDS Postgres) there seems to be a significant uptick in IOPS all of a sudden (no corresponding increase in the traffic of the service in front of spicedb) - this has led to our iops budget being exhausted and CPU spiking to 100% due to iops being throttled I'm not able to figure out why the IOPS is so high all of a sudden. From RDS' dashboards, I can see that the GC query is the one that is taking the most time (screenshots in thread), but I don't know what changed to make this happen now. GC has been running relatively frequently for months without anything like this happening Anyone have this happen to them before/could help me troubleshoot this?

perseus29

02/05/2024, 6:40 AM

CPU wait by query https://cdn.discordapp.com/attachments/1203952660953178122/1203953196096167967/Screenshot_2024-02-05_at_12.09.58_PM.png?ex=65d2f7ca&is=65c082ca&hm=3f12b5f963723077a7306668002912ed75da3a553890cc70c0d316e4e67fddb1&

perseus29

02/05/2024, 6:40 AM

IOPS spikes https://cdn.discordapp.com/attachments/1203952660953178122/1203953280305074197/Screenshot_2024-02-05_at_12.10.23_PM.png?ex=65d2f7de&is=65c082de&hm=990e42fd5a8cd1fa2f43381252de31111ff0c02421f0c1cb85b3e44f0b26c97d&

perseus29

02/05/2024, 6:57 AM

indexes look fine, they should be able to support this query

perseus29

02/05/2024, 7:17 AM

looks like this coincides with AWS applying a maintenance patch to the postgres instance. Do the SpiceDB pods need to be restarted if the underlying postgres instance restarts/disconnects temporarily?

vroldanbet

02/05/2024, 7:59 AM

>Do the SpiceDB pods need to be restarted if the underlying postgres instance restarts/disconnects temporarily? I don't think we have tests for that, but I'd be surprised it isn't able to recover by itself As for the spikes, you are likely hitting this: https://github.com/authzed/spicedb/pull/1550 Please note that you need to upgrade to Postgres 15 in order for the indexes to work.

vroldanbet

02/05/2024, 8:01 AM

That change is part of https://github.com/authzed/spicedb/releases/tag/v1.26.0 Since you are there, I'd probably recommend moving all the way to 1.29.1 to get a bunch of perf improvements.

perseus29

02/05/2024, 8:15 AM

we're at 1.19 right now

perseus29

02/05/2024, 8:18 AM

is the index change something that would break for postgres versions <15? ours is at 14 right now

perseus29

02/05/2024, 8:24 AM

basically trying to understand if the upgrade path to v1.29 requires postgres 15

vroldanbet

02/05/2024, 8:29 AM

it's not required but if you don't upgrade, the index won't be selected by the query planner. So basically it won't solve your problem.

perseus29

02/05/2024, 8:29 AM

also can confirm that restarting all the pods seems to have fixed it

vroldanbet

02/05/2024, 8:31 AM

the GC runs on a schedule. By restarting the containers, you stop the query, and the GC will be re-run after a certain period of time

perseus29

02/05/2024, 8:32 AM

i thought the default time is every 3minutes?

vroldanbet

02/05/2024, 8:32 AM

so you would technically eventually have the issue again

perseus29

02/05/2024, 8:32 AM

i do have some GC running still, this is the

spicedb_datastore_gc_relationships_total

metric https://cdn.discordapp.com/attachments/1203952660953178122/1203981547498373171/Screenshot_2024-02-05_at_2.02.25_PM.png?ex=65d31231&is=65c09d31&hm=ec713d8691fb6cebb5706d07ca77e0ace411d7d90042db39dbb3d5fd0ef8b10c&

vroldanbet

02/05/2024, 8:33 AM

Copy code

--datastore-gc-interval duration                                  amount of time between passes of garbage collection (postgres driver only) (default 3m0s)
      --datastore-gc-max-operation-time duration                        maximum amount of time a garbage collection pass can operate before timing out (postgres driver only) (default 1m0s)
      --datastore-gc-window duration                                    amount of time before revisions are garbage collected (default 24h0m0s)

vroldanbet

02/05/2024, 8:33 AM

the thing is that if the query is timing out, you won't be GCing anything

vroldanbet

02/05/2024, 8:34 AM

check

gc_failure_total

perseus29

02/05/2024, 8:34 AM

but the query that runs has a

DELETE

clause in it, which should delete the data even if the query times out for the client right?

vroldanbet

02/05/2024, 8:35 AM

it's not only a

DELETE

, it has a subquery to select the set of elements to delete

vroldanbet

02/05/2024, 8:36 AM

actually I think that's not correct, may be a different datastore

vroldanbet

02/05/2024, 8:36 AM

no, it's correct

Copy code

query := fmt.Sprintf(`WITH rows AS (%[1]s)
          DELETE FROM %[2]s
          WHERE (%[3]s) IN (SELECT %[3]s FROM rows);

vroldanbet

02/05/2024, 8:37 AM

this GC issue usually manifests when there has been some sort of bulk ingestion

vroldanbet

02/05/2024, 8:38 AM

so lots of revisions becoming deleted and the GC starting to take an increasing amount of time to do its job

perseus29

02/05/2024, 8:38 AM

and the elements being selected are based on the current txid from what i understand - so wouldnt the success of the query on PG's side be enough even if the result times out for the client?

perseus29

02/05/2024, 8:39 AM

but in any case I guess we need to set some time to upgrade our PG to 15/16 and then apply the spicedb updates

vroldanbet

02/05/2024, 8:40 AM

they are based on the current revision, and determining anything that falls below

current_revision-GC_WINDOW

vroldanbet

02/05/2024, 8:40 AM

which is 24h by default

perseus29

02/05/2024, 8:40 AM

ahhh got it

vroldanbet

02/05/2024, 8:42 AM

I don't recall the exact sequence that leads to failures. It didn't actually lead to failures all the time, sometimes it failed, others it was fine. The problem is that the query basically consumed an absurd amount of resources, making everything else going on in your cluster super slow

vroldanbet

02/05/2024, 8:43 AM

so you'd see latency's spike > 1s and if client sets deadlines, you'll start to see deadlines everywhere

perseus29

02/05/2024, 8:44 AM

yeah im actually seeing latency spikes >1s for WriteRelationships semi-frequently

perseus29

02/05/2024, 8:44 AM

Writes and Deletes from what I can see

perseus29

02/05/2024, 8:44 AM

not semi-frequently my bad, its rare but does happen

vroldanbet

02/05/2024, 8:46 AM

so like difficult to assess the impact because it depends on the current size of the relation_tuple table, the number of revisions that are marked as deleted, and teh capacity of your cluster

vroldanbet

02/05/2024, 8:46 AM

so it's not a uniform problem when this starts to manifest. But I'm pretty certain this is a problem that PG clusters can hit at a certain point.

perseus29

02/06/2024, 1:27 AM

just to confirm: for the upgrade path, if I upgrade spicedb first, and then PG to 15/16, it should automatically start using the indices right? I don't want to upgrade PG first and risk breaking something in the application

vroldanbet

02/06/2024, 7:45 AM

correct

vroldanbet

02/06/2024, 7:46 AM

the worst that can happen is that the index is not selected, just as it happens right now in your deployment anyway

perseus29

02/06/2024, 3:36 PM

got it, that helps a lot. thank you!

6 Views

Previous Next