we're currently having some problems
# spicedb
p
we're currently having some problems with our spicedb deployment - (running spicedb v1.19 with RDS Postgres) there seems to be a significant uptick in IOPS all of a sudden (no corresponding increase in the traffic of the service in front of spicedb) - this has led to our iops budget being exhausted and CPU spiking to 100% due to iops being throttled I'm not able to figure out why the IOPS is so high all of a sudden. From RDS' dashboards, I can see that the GC query is the one that is taking the most time (screenshots in thread), but I don't know what changed to make this happen now. GC has been running relatively frequently for months without anything like this happening Anyone have this happen to them before/could help me troubleshoot this?
indexes look fine, they should be able to support this query
looks like this coincides with AWS applying a maintenance patch to the postgres instance. Do the SpiceDB pods need to be restarted if the underlying postgres instance restarts/disconnects temporarily?
v
>Do the SpiceDB pods need to be restarted if the underlying postgres instance restarts/disconnects temporarily? I don't think we have tests for that, but I'd be surprised it isn't able to recover by itself As for the spikes, you are likely hitting this: https://github.com/authzed/spicedb/pull/1550 Please note that you need to upgrade to Postgres 15 in order for the indexes to work.
That change is part of https://github.com/authzed/spicedb/releases/tag/v1.26.0 Since you are there, I'd probably recommend moving all the way to 1.29.1 to get a bunch of perf improvements.
p
we're at 1.19 right now
is the index change something that would break for postgres versions <15? ours is at 14 right now
basically trying to understand if the upgrade path to v1.29 requires postgres 15
v
it's not required but if you don't upgrade, the index won't be selected by the query planner. So basically it won't solve your problem.
p
also can confirm that restarting all the pods seems to have fixed it
v
the GC runs on a schedule. By restarting the containers, you stop the query, and the GC will be re-run after a certain period of time
p
i thought the default time is every 3minutes?
v
so you would technically eventually have the issue again
v
Copy code
--datastore-gc-interval duration                                  amount of time between passes of garbage collection (postgres driver only) (default 3m0s)
      --datastore-gc-max-operation-time duration                        maximum amount of time a garbage collection pass can operate before timing out (postgres driver only) (default 1m0s)
      --datastore-gc-window duration                                    amount of time before revisions are garbage collected (default 24h0m0s)
the thing is that if the query is timing out, you won't be GCing anything
check
gc_failure_total
p
but the query that runs has a
DELETE
clause in it, which should delete the data even if the query times out for the client right?
v
it's not only a
DELETE
, it has a subquery to select the set of elements to delete
actually I think that's not correct, may be a different datastore
no, it's correct
Copy code
query := fmt.Sprintf(`WITH rows AS (%[1]s)
          DELETE FROM %[2]s
          WHERE (%[3]s) IN (SELECT %[3]s FROM rows);
this GC issue usually manifests when there has been some sort of bulk ingestion
so lots of revisions becoming deleted and the GC starting to take an increasing amount of time to do its job
p
and the elements being selected are based on the current txid from what i understand - so wouldnt the success of the query on PG's side be enough even if the result times out for the client?
but in any case I guess we need to set some time to upgrade our PG to 15/16 and then apply the spicedb updates
v
they are based on the current revision, and determining anything that falls below
current_revision-GC_WINDOW
which is 24h by default
p
ahhh got it
v
I don't recall the exact sequence that leads to failures. It didn't actually lead to failures all the time, sometimes it failed, others it was fine. The problem is that the query basically consumed an absurd amount of resources, making everything else going on in your cluster super slow
so you'd see latency's spike > 1s and if client sets deadlines, you'll start to see deadlines everywhere
p
yeah im actually seeing latency spikes >1s for WriteRelationships semi-frequently
Writes and Deletes from what I can see
not semi-frequently my bad, its rare but does happen
v
so like difficult to assess the impact because it depends on the current size of the relation_tuple table, the number of revisions that are marked as deleted, and teh capacity of your cluster
so it's not a uniform problem when this starts to manifest. But I'm pretty certain this is a problem that PG clusters can hit at a certain point.
p
just to confirm: for the upgrade path, if I upgrade spicedb first, and then PG to 15/16, it should automatically start using the indices right? I don't want to upgrade PG first and risk breaking something in the application
v
correct
the worst that can happen is that the index is not selected, just as it happens right now in your deployment anyway
p
got it, that helps a lot. thank you!
6 Views