make sure you're on version 15 at least
# spicedb
j
make sure you're on version 15 at least of PG
ok
well, that means the GC process is timing out
it should be fixed on PG 15
the GC timeout can be increased, but it is taking over a minute (which is the default)
then you're either generating a VERY large amount of garbage
or its not using the correct indexes
which, with Aurora, who knows
b
I grabbed the snapshot
xmin
from a transaction ~1hr old and there are ~4.2 million tuples returned by
select count(*) from spicedb.public.relation_tuple where deleted_xid < '185328836';
Which doesn't seem excessive for the size of the server but maybe would take >1min
I'll try get an EXPLAIN for the GC delete query
j
you're generating 4.2M updates within an hour?
if it is timing out, you may need to scale up your Aurora
b
No, I don't think the GC has succeeded in a long time
Like since we migrated all our users into SpiceDB probably
j
there is a prometheus metric we recommend monitoring for just that
but yeah, I'd scale up and see why its failing
in PG < 15, it wasn't using the correct index
but it should be doing so with PG 15+
b
It's taking ~5min to delete 1 batch of 1000 rows which definitely ain't right
k
We're currently on 15.4. We are using Aurora PG and have a pair of
db.r6g.xlarge
nodes in the cluster. Certainly its not a large type though I'd expect this to cope. Even if we were to increase the size I dont think we'd see an improvement as its single CPU bound for the query. A larger size would just give us more idle CPUs. Although we would also get more RAM. As Ben has just said each batch is taking a very long time. The index seems to be used for the query to find 1000 records but I'm not sure its being used for the actual
DELETE
. I suspect that its scanning the table (which is currently ~20GB) Are we able to change the batch size to help confirm or disprove this hypothesis?
Or... can we run the
DELETE
statement manually to test/confirm this?
We've run a VACUUM and an ANALYZE to make sure its not related to that. A query to do the SELECT for the 1000 rows is "immediate" (94ms in my case). Which leads me to think either I'm getting a different plan there or it is scanning the table on the DELETE section.
v
something you can do to get this unstuck, which is not ideal but will likely work, is to increase the timeout of the GC via the corresponding ENV variable. Once it does complete, you should be able to go back to hte original value. I did recently merge a metric improvement that should help show the number of rows deleted even when the operation timed out, which previously was not being reported to via the prometheus endpoint
j
yeah, I'd see if you can get an EXPLAIN on the DELETE
the delete is being done via PK; it shouldn't take 5 minutes
k
We did this (threw it way high to a 10minute timeout). We get a couple of batches done in that time but that's all. It'll take very long time to catch up and we still receive timeout errors at which point the process stops and we have to wait for its next cycle.
v
Then change the cycle to be lower, eventually it will catch up. Just saying as a stop gap. Right now it's an exercise of understanding what's exactly wrong with the query and if it's missing another index. I did some work a while back, I added an index to help here but it seems that it hasn't helped as I expected
k
Here's a bit of an analysis dump (across a few messages to satisfy max message length)... Since we use RDS Aurora, we have their Performance Insights data. From that we get the query:
Copy code
WITH rows AS (SELECT namespace, object_id, relation, userset_namespace, userset_object_id, userset_relation, created_xid, deleted_xid FROM relation_tuple WHERE deleted_xid < $1 LIMIT ?)
          DELETE FROM relation_tuple
          WHERE (namespace, object_id, relation, userset_namespace, userset_object_id, userset_relation, created_xid, deleted_xid) IN (SELECT namespace, object_id, relation, userset_namespace, userset_object_id, userset_relation, created_xid, deleted_xid FROM rows);
Its our biggest load on our cluster. This cluster has a cople of databases - one for SpiceDB and one for FusionAuth. We've basically separated our "auth" onto its own DB cluster. Pretty much everything else is barely noticeable on the load (see the
Delete Query Load.png
image) We first looked at the "rows" select to make sure that is returning quickly. As you'll see Performance Insights gives us a
$x
placeholder for variables. So we worked out what we think would be going into that value. As its an xid8 value it was a bit of a struggle to get it parse but found if we used a string it worked. So for this example:
Copy code
SELECT namespace, object_id, relation, userset_namespace, userset_object_id, userset_relation, created_xid, deleted_xid FROM relation_tuple WHERE deleted_xid < '999999999' LIMIT 1000
The query will run and return the rows in milliseconds. Perfectly performant. The plan is:
Copy code
Limit  (cost=0.43..110.31 rows=1000 width=122)
  ->  Index Scan using ix_gc_index on relation_tuple  (cost=0.43..466125.91 rows=4242314 width=122)
        Index Cond: (deleted_xid < '999999999'::xid8)
Clearly using the index as you'd expect. https://cdn.discordapp.com/attachments/1225632871214219294/1225977179830222950/Delete_Query_Load.png?ex=66231739&is=6610a239&hm=c636db80454a1d0e39f75f4cb9de2f770da9aa0f26a8fd20123c5ccb38f421c6&
Substituting this into the
DELETE
statement the plan is:
Copy code
Delete on relation_tuple  (cost=140.87..2956.71 rows=0 width=0)
  ->  Nested Loop  (cost=140.87..2956.71 rows=227 width=152)
        ->  HashAggregate  (cost=140.31..150.31 rows=1000 width=268)
              Group Key: (rows.namespace)::text, (rows.object_id)::text, (rows.relation)::text, (rows.userset_namespace)::text, (rows.userset_object_id)::text, (rows.userset_relation)::text, rows.created_xid, rows.deleted_xid
              ->  Subquery Scan on rows  (cost=0.43..120.31 rows=1000 width=268)
                    ->  Limit  (cost=0.43..110.31 rows=1000 width=122)
                          ->  Index Scan using ix_gc_index on relation_tuple relation_tuple_1  (cost=0.43..466125.91 rows=4242314 width=122)
                                Index Cond: (deleted_xid < '999999999'::xid8)
        ->  Index Scan using ix_relation_tuple_by_subject on relation_tuple  (cost=0.56..2.80 rows=1 width=128)
              Index Cond: (((userset_object_id)::text = (rows.userset_object_id)::text) AND ((userset_namespace)::text = (rows.userset_namespace)::text) AND ((userset_relation)::text = (rows.userset_relation)::text) AND ((namespace)::text = (rows.namespace)::text) AND ((relation)::text = (rows.relation)::text))
              Filter: (((rows.object_id)::text = (object_id)::text) AND (rows.created_xid = created_xid) AND (rows.deleted_xid = deleted_xid))
There's a large cost for a
HashAggregate
. Which I believe is the issue.
Thinking about that I just tried adding in a
DISTINCT
to the initial
rows
query. That appears to improve the plan quite a bit:
Copy code
Delete on relation_tuple  (cost=1.25..3324.01 rows=0 width=0)
  ->  Nested Loop  (cost=1.25..3324.01 rows=1 width=152)
        ->  Subquery Scan on rows  (cost=0.69..517.61 rows=1000 width=268)
              ->  Limit  (cost=0.69..507.61 rows=1000 width=122)
                    ->  Unique  (cost=0.69..1066720.82 rows=2104290 width=122)
                          ->  Index Only Scan using pk_relation_tuple on relation_tuple relation_tuple_1  (cost=0.69..981874.54 rows=4242314 width=122)
                                Index Cond: (deleted_xid < '999999999'::xid8)
        ->  Index Scan using ix_relation_tuple_by_subject on relation_tuple  (cost=0.56..2.80 rows=1 width=128)
              Index Cond: (((userset_object_id)::text = (rows.userset_object_id)::text) AND ((userset_namespace)::text = (rows.userset_namespace)::text) AND ((userset_relation)::text = (rows.userset_relation)::text) AND ((namespace)::text = (rows.namespace)::text) AND ((relation)::text = (rows.relation)::text))
              Filter: (((rows.object_id)::text = (object_id)::text) AND (rows.created_xid = created_xid) AND (rows.deleted_xid = deleted_xid))
Another observation I had was the delete doesnt order the inital select by
deleted_xid
. Not sure it matters but it might be tidier to remove rows in the order they are deleted.
v
for additional context, you can cast to
xid8
doing as you see above:
'999999999'::xid8
. This is important because we identified a while back that the query planner won't select xid8 based indexes in versions < PG15, hence the question and the recommendation to run SpiceDB with SpiceDB >= PG15.
also note the index already has an order by clause, so pressumably those are already being deleted in order. See https://github.com/authzed/spicedb/blob/1e81e6848a555fec2647923416aa0913ccd22fb1/internal/datastore/postgres/migrations/zz_migration.0016_add_tuned_gc_index.go#L12
may I ask how many relationships are in your database? I'm not able to reproduce locally the same issue with 5M deleted tuples
Notice your explain for the select query:
Copy code
Limit  (cost=0.43..110.31 rows=1000 width=122)
  ->  Index Scan using ix_gc_index on relation_tuple  (cost=0.43..466125.91 rows=4242314 width=122)
        Index Cond: (deleted_xid < '999999999'::xid8)
there are
4242314
rows scanned. That seems odd. This is what my local 15.6 PG shows, with 5M deleted rows. Also it's not selecting the GC index, interestingly
Copy code
| --------------------------------------------------------------------------------------------------------------------------- |
| Limit  (cost=0.00..26.79 rows=1000 width=80) (actual time=0.006..0.198 rows=1000 loops=1)                                   |
|   ->  Seq Scan on relation_tuple  (cost=0.00..137596.88 rows=5136950 width=80) (actual time=0.006..0.157 rows=1000 loops=1) |
|         Filter: (deleted_xid < '999999999'::xid8)                                                                           |
| Planning Time: 0.058 ms                                                                                                     |
| Execution Time: 0.227 ms                                                                                                    |
oh I see, was able to reproduce that phenomenon when I added 5M live tuples:
Copy code
| c0                                                                                                                                              |
| ----------------------------------------------------------------------------------------------------------------------------------------------- |
| Limit  (cost=0.43..35.38 rows=1000 width=80) (actual time=1.183..1.740 rows=1000 loops=1)                                                       |
|   ->  Index Scan using ix_gc_index on relation_tuple  (cost=0.43..178606.61 rows=5110010 width=80) (actual time=1.182..1.705 rows=1000 loops=1) |
|         Index Cond: (deleted_xid < '901'::xid8)                                                                                                 |
| Planning Time: 0.359 ms                                                                                                                         |
| Execution Time: 1.780 ms                                                                                                                        |
ah this could be because Performance Insights is doing
EXPLAIN
but not
EXPLAIN ANALYZE
, so you get the actual number of rows scanned. Makes sense.
b
iirc there are ~25M tuples total
v
I have an alternative query that seems quite faster. Would you be able to test it in your DB? You just need to make sure you select the right XID8 e.g.
Copy code
EXPLAIN DELETE FROM relation_tuple
WHERE ctid in (SELECT ctid
       FROM relation_tuple 
       WHERE deleted_xid < 'the_xid8_to_GC_from'
       LIMIT 1000);
explain looks like this:
Copy code
| c0                                                                                                                                         |
| ------------------------------------------------------------------------------------------------------------------------------------------ |
| Delete on relation_tuple  (cost=47.89..854.38 rows=0 width=0)                                                                              |
|   ->  Nested Loop  (cost=47.89..854.38 rows=1000 width=36)                                                                                 |
|         ->  HashAggregate  (cost=47.88..49.88 rows=200 width=36)                                                                           |
|               Group Key: "ANY_subquery".ctid                                                                                               |
|               ->  Subquery Scan on "ANY_subquery"  (cost=0.43..45.38 rows=1000 width=36)                                                   |
|                     ->  Limit  (cost=0.43..35.38 rows=1000 width=6)                                                                        |
|                           ->  Index Scan using ix_gc_index on relation_tuple relation_tuple_1  (cost=0.43..178606.61 rows=5110010 width=6) |
|                                 Index Cond: (deleted_xid < '901'::xid8)                                                                    |
|         ->  Tid Scan on relation_tuple  (cost=0.00..4.01 rows=1 width=6)                                                                   |
|               TID Cond: (ctid = "ANY_subquery".ctid)                                                                                       |
`
b
@Kent Chenery if you wanna try that tomorrow you can use
185328836
it's the one I grabbed on Friday
The nested loop seems one order of magnitude faster on the worst case
The tentative query I'm going to propose is, to acommodate it to the current codebase. It has the same explain result.
Copy code
WITH rows AS (SELECT ctid FROM relation_tuple WHERE deleted_xid < 'your_XID8' LIMIT 1000)
          DELETE FROM relation_tuple
          WHERE (ctid) IN (SELECT ctid FROM rows);
Opened https://github.com/authzed/spicedb/pull/1859 with the proposed change
k
I just tested the delete above (the
WITH rows...
one) on our db (with a rollback) and it executed in 135ms.
Are we able to clear up our table manually until that fix is merged/deployed?
v
@Kent Chenery that's create to hear. Yes, you can run that to clear up your DB. There are other operations that need to run too (clean up namespaces and transactions) but hopefully those tables won't have issues with the current code.
Basically you have to run it on a loop, or bump the limit (beware the bigger the limit, more impact it will have on the DB)
k
So I understand things correctly... the other operations that need to run - will SpiceDB do these for us? Or is there something we should run? I'd set up a process to start calling that delete directly outside of SpiceDB. Simply because that is quicker and easier for us to do over waiting for a new release that includes the fix or building an image ourselves.
b
SpiceDB does those after it does the relationship GC, but that's never completing so it's not getting to them
v
>So I understand things correctly... the other operations that need to run - will SpiceDB do these for us? Or is there something we should run? @User Typically SpiceDB will do this for you. There is also a
spicedb datastore gc
command to run these on-demand. But because this is an issue with the queries SpiceDB issue when doing GC, and because this branch hasn't landed yet in main, we recommended running it yourself directly. Once the PR lands, you should be able to grab the nightly build. Unfortunately we have a slow week because of several folks on PTO.
k
@vroldanbet just letting you know we updated to SpiceDB
1.31.0
yesterday and the issue is all solved now. Thanks!
v
Thanks for following up, and glad to hear! 🎉
2 Views