Hey, SpiceDB #spicedb

Hey,

kotohoroshko

05/06/2024, 7:45 AM

Hey, We have a strange behavior in one of our regular SQL queries. There's no increase in % requests and CPU on the backend side. The problem only arises on the PostgreSQL side. Postgres CPU increase from 35% to 100% in 10minutes Query:

WITH minvalid AS ( SELECT xid, snapshot FROM relation_tuple_transaction WHERE timestamp >= NOW() - INTERVAL ? OR timestamp = (SELECT MAX(timestamp) FROM relation_tuple_transaction) ORDER BY timestamp ASC LIMIT ? ) SELECT minvalid.xid, minvalid.snapshot, pg_current_snapshot() FROM minvalid

https://cdn.discordapp.com/attachments/844600078948630559/1236946829476892682/image.png?ex=6639dc01&is=66388a81&hm=9f29050155f81a6466a7184d29b02bb1becacac5640645d1c390386bdb24aae4&

vroldanbet

05/06/2024, 7:47 AM

What SpiceDB version are you using?

kotohoroshko

05/06/2024, 7:48 AM

v1.29.1

vroldanbet

05/06/2024, 8:05 AM

While I don't discard there is room to optimize that query, whenever we see something like this it it has always led to the transaction GC not properly running. You need to monitor if GC is taking place and completing. I'd recommend updating to SpiceDB 1.31, which introduced an optimization to GC (https://github.com/authzed/spicedb/pull/1859). While that query optimization does not directly involve the transactions table, it's possible that tuple GC is not progressing in your cluster, and thus transaction GC is not even starting. My recommendation would be to: - upgrade to 1.31 - increase the gc-max-operation-time, this would give SpiceDB more time to clean up dead relationships and transactions. - monitor via the corrresponding prom metrics: https://github.com/authzed/spicedb/blob/342eee166699e6cf3f10d95d27cd46cae9ca63ed/internal/datastore/common/gc.go#L17-L52

Copy code

--datastore-gc-interval duration                                  amount of time between passes of garbage collection (postgres driver only) (default 3m0s)
      --datastore-gc-max-operation-time duration                        maximum amount of time a garbage collection pass can operate before timing out (postgres driver only) (default 1m0s)
      --datastore-gc-window duration                                    amount of time before revisions are garbage collected (default 24h0m0s)

If GC completes cleaning up, you see the GC metrics get to a healthy state, which should look like: - not much time spent on GC - not many tuples / transactions being GC and the problem still persists, then there may be something else going on.

vroldanbet

05/06/2024, 8:05 AM

Alternatively you can run GC on demand with the spicedb gc command

kotohoroshko

05/06/2024, 8:18 AM

also we have warnings with concurrent transaction: ERROR: could not serialize access due to read/write dependencies among transactions Could this also have an effect?

vroldanbet

05/06/2024, 8:20 AM

what this suggests is that you have many concurrent write write transactions happening at the same time. Depending on the size of your transactions, this could happen. The recommendation is to try again. Are you perhaps seeding data to your cluster? This would explain the GC issue too.

kotohoroshko

05/06/2024, 8:25 AM

Sometimes we are seeding big scope of relations in the same time. For now, we disable this mechanism for testing

kotohoroshko

05/06/2024, 8:25 AM

Also, could you recommend the best GC timing settings for PostgreSQL with over a billion relations in the database?

kotohoroshko

05/06/2024, 8:27 AM

And what metrics can I use to monitor GC?

vroldanbet

05/06/2024, 8:41 AM

I just linked above the GC metrics, please have a look

vroldanbet

05/06/2024, 8:42 AM

The best timming settings really depend on the rate at which you write and generate garbage. Ideally it's set in a way that runs frequently enough to keep up. The metrics should give you the information you need.

vroldanbet

05/06/2024, 8:47 AM

Also: if you do not use zedtokens and only use

minimize_latency

consistency, you could narrow your GC window. By default it's 24 hours. Making the GC window would limit the ability to go "back in time" (see

at_exact_snapshot

at_least_as_fresh

consistency). But if you do not care about that, then you can shorten the GC window, which means it greatly reduces the number of transactions stored in your database at a given time, which is likely what's causing your DB to spend much time computing that query you posted.

Yaroslav

05/16/2024, 10:30 AM

hello) we updated the version to 1.31 and added configs:

Copy code

--datastore-gc-interval=3m
--datastore-gc-max-operation-time=3m
--datastore-gc-window duration=24h

but after this changes we have big memory leak for spice db ( should we add more memory or we can play with configs?

Yaroslav

05/16/2024, 10:30 AM

https://cdn.discordapp.com/attachments/1236946829648855170/1240612321479626773/Screenshot_2024-05-16_at_12.30.14.png?ex=664731c2&is=6645e042&hm=a816f373ee4b7f0ae666e0eb5317e604cc094a169565f2682514afc44eee8c2f&

Yaroslav

05/16/2024, 10:31 AM

this is a memory leak after update ⬇️ https://cdn.discordapp.com/attachments/1236946829648855170/1240612594339807272/Screenshot_2024-05-16_at_12.31.03.png?ex=66473203&is=6645e083&hm=e750eb53a4506a4300f84b9816ac784ee1501b80b3cfec0822d8637abf541e8a&

Yaroslav

05/16/2024, 10:37 AM

@vroldanbet 🙏

vroldanbet

05/16/2024, 10:39 AM

I don't see evidence of a memory leak in that graph. And I don't have reasons to believe the GC would cause any leak. What you've done there is really making the interval go from the default 5m to 3m. This means it will run the job more frequently, that's it.

41 Views

Previous Next