k
Hey, We have a strange behavior in one of our regular SQL queries. There's no increase in % requests and CPU on the backend side. The problem only arises on the PostgreSQL side. Postgres CPU increase from 35% to 100% in 10minutes Query:
WITH minvalid AS ( SELECT xid, snapshot FROM relation_tuple_transaction WHERE timestamp >= NOW() - INTERVAL ? OR timestamp = (SELECT MAX(timestamp) FROM relation_tuple_transaction) ORDER BY timestamp ASC LIMIT ? ) SELECT minvalid.xid, minvalid.snapshot, pg_current_snapshot() FROM minvalid
https://cdn.discordapp.com/attachments/844600078948630559/1236946829476892682/image.png?ex=6639dc01&is=66388a81&hm=9f29050155f81a6466a7184d29b02bb1becacac5640645d1c390386bdb24aae4&
v
What SpiceDB version are you using?
k
v1.29.1
v
While I don't discard there is room to optimize that query, whenever we see something like this it it has always led to the transaction GC not properly running. You need to monitor if GC is taking place and completing. I'd recommend updating to SpiceDB 1.31, which introduced an optimization to GC (https://github.com/authzed/spicedb/pull/1859). While that query optimization does not directly involve the transactions table, it's possible that tuple GC is not progressing in your cluster, and thus transaction GC is not even starting. My recommendation would be to: - upgrade to 1.31 - increase the gc-max-operation-time, this would give SpiceDB more time to clean up dead relationships and transactions. - monitor via the corrresponding prom metrics: https://github.com/authzed/spicedb/blob/342eee166699e6cf3f10d95d27cd46cae9ca63ed/internal/datastore/common/gc.go#L17-L52
Copy code
--datastore-gc-interval duration                                  amount of time between passes of garbage collection (postgres driver only) (default 3m0s)
      --datastore-gc-max-operation-time duration                        maximum amount of time a garbage collection pass can operate before timing out (postgres driver only) (default 1m0s)
      --datastore-gc-window duration                                    amount of time before revisions are garbage collected (default 24h0m0s)
If GC completes cleaning up, you see the GC metrics get to a healthy state, which should look like: - not much time spent on GC - not many tuples / transactions being GC and the problem still persists, then there may be something else going on.
Alternatively you can run GC on demand with the spicedb gc command
k
also we have warnings with concurrent transaction: ERROR: could not serialize access due to read/write dependencies among transactions Could this also have an effect?
v
what this suggests is that you have many concurrent write write transactions happening at the same time. Depending on the size of your transactions, this could happen. The recommendation is to try again. Are you perhaps seeding data to your cluster? This would explain the GC issue too.
k
Sometimes we are seeding big scope of relations in the same time. For now, we disable this mechanism for testing
Also, could you recommend the best GC timing settings for PostgreSQL with over a billion relations in the database?
And what metrics can I use to monitor GC?
v
I just linked above the GC metrics, please have a look
The best timming settings really depend on the rate at which you write and generate garbage. Ideally it's set in a way that runs frequently enough to keep up. The metrics should give you the information you need.
Also: if you do not use zedtokens and only use
minimize_latency
consistency, you could narrow your GC window. By default it's 24 hours. Making the GC window would limit the ability to go "back in time" (see
at_exact_snapshot
or
at_least_as_fresh
consistency). But if you do not care about that, then you can shorten the GC window, which means it greatly reduces the number of transactions stored in your database at a given time, which is likely what's causing your DB to spend much time computing that query you posted.
y
hello) we updated the version to 1.31 and added configs:
Copy code
--datastore-gc-interval=3m
--datastore-gc-max-operation-time=3m
--datastore-gc-window duration=24h
but after this changes we have big memory leak for spice db ( should we add more memory or we can play with configs?
@vroldanbet 🙏
v
I don't see evidence of a memory leak in that graph. And I don't have reasons to believe the GC would cause any leak. What you've done there is really making the interval go from the default 5m to 3m. This means it will run the job more frequently, that's it.
41 Views