I'm having serious issues trying to upgrade to 1.1...
# spicedb
w
I'm having serious issues trying to upgrade to 1.14.1 on Postgres (manual, no-downtime upgrade process). I've done the
add-xid-constraints
migration, running
write-both-read-new
, and I'm observing much higher response time than 1.13.0. Initially the response time shot through the roof (>1s P95), it seemed to be a pod CPU issue rather than a database issue, so I mitigated by doubling the # of replicas. It brought back the response time within usable range, but it's still much higher than it was on 1.13.0: my P50 is 3x higher (~15ms) and my P95 is 2x higher (130ms). I'm not seeing any particular pressure on the database, and the CPU usage of my pods is way higher than it used to be (>0.5, used to be ~0.1), so I assume something is burning CPU that wasn't before. I have this problem with
write-both-read-new
but also
write-both-read-old
. I have traces but I'm not getting any insight from them, maybe you do? I'm concerned about whether my cluster will be able to handle the load on Monday: I'm already breaching my SLOs with these response times on a Sunday where we have low traffic 😰
Note: I don't use the dispatch cluster
Response times. - I started the
add-xid-constraints
a bit before midnight - it finished around 4am - around 7am, we see response time going wild, which correlate to our traffic - A bit before 11am, I added more replicas to handle the load
CPU usage over the same period
here's a cpu profile over 30s
j
Can you get a trace from a 95th percentile request?
You should also be able to safely roll back your pods to write both read old, does that improve the situation.
Also, do you have Postgres GC turned on, and do you frequently reuse or recreate relationships using TOUCH operations?
w
> Can you get a trace from a 95th percentile request? Should be able to, I'll try to get that > do you have Postgres GC turned on, and do you frequently reuse or recreate relationships using TOUCH operations? The SpiceDB GC you mean? Yes and yes, we use TOUCH a lot
> You should also be able to safely roll back your pods to write both read old, does that improve the situation. Will try, but I don't think so. I started seeing higher response time as soon as I upgraded to 1.14.1, with write-both-read-old. Didn't think much of it at the time as it wasn't dramatic, it was late at night with low traffic
Also seeing memory increasing a lot, correlating with response time getting higher and higher. A
kubectl rollout restart deployment
seems to calm things down, but still to higher-than-before response times
I'd love to rollback to 1.13.0 to get back to a stable situation but it's not so easy after the
add-xid-constraints
, I expect writes will fail :/
j
If you’re running write both read old the read path should be the same as v1.13
I would be interested to see what is consuming more CPU in that configuration
Using TOUCH by itself is ok, even exclusively. There is an issue when you have too many copies of the same relationship generated by touch events
But that wouldn’t show up with write both read old
w
Here's a trace for a slow query
j
Dominated by queryrelationships
Can you catch one of the slow queries in flight and run explain on it?
w
switched to
read-old
(at the timestamp my cursor is at), it is maybe slightly better but not as fast as before
> Can you catch one of the slow queries in flight and run explain on it? I don't think I can, I don't have anything logging queries in-flight. I only have the "generic" queries with parameter placeholders
Unless SpiceDB logs that with LOG_LEVEL=debug?
j
You will get them with log level debug or you can just run psql to dump running queries a few times by hand until you see one.
w
I only get the generic plan with these 2 approaches:
Copy code
sql
SELECT namespace, object_id, relation, userset_namespace, userset_object_id, userset_relation, caveat_name, caveat_context FROM relation_tuple WHERE created_transaction <= $1 AND (deleted_transaction = $2 OR deleted_transaction > $3) AND namespace = $4 AND relation = $5 AND object_id IN ($6) LIMIT 9223372036854775807
I'll try to EXPLAIN it with my best guess of parameters
Copy code
sql
 Limit  (cost=0.69..643.69 rows=237 width=138) (actual time=5.947..5.948 rows=0 loops=1)
   ->  Index Scan using ix_backfill_tuple_temp_2 on relation_tuple  (cost=0.69..643.69 rows=237 width=138) (actual time=5.947..5.947 rows=0 loops=1)
         Index Cond: (((namespace)::text = 'care_recipient'::text) AND ((object_id)::text = ANY ('{1,2,3,4,5,6}'::text[])) AND ((relation)::text = 'caregiver'::text) AND (created_transaction <= 49851674))
         Filter: ((deleted_transaction = '9223372036854775807'::bigint) OR (deleted_transaction > 49851674))
 Planning Time: 81.407 ms
 Execution Time: 5.975 ms
Interesting, it's using
ix_backfill_tuple_temp_2
which is an index I created to help the backfill of xid
j
is that planning cost actually representative of the real runtime? it seems crazy
w
I agree
I got this plan by running
psql
on a spicedb pod, using the same connection string as SpiceDB so... it's as representative of the real runtime as I can imagine
j
but i mean if you run that same query without explain does it take 90ms?
w
it takes ~40ms
Rerunning the same
EXPLAIN ANALYZE
gives me
Copy code
sql
 Limit  (cost=0.69..643.69 rows=237 width=138) (actual time=0.075..0.075 rows=0 loops=1)
   ->  Index Scan using ix_backfill_tuple_temp_2 on relation_tuple  (cost=0.69..643.69 rows=237 width=138) (actual time=0.074..0.074 rows=0 loops=1)
         Index Cond: (((namespace)::text = 'care_recipient'::text) AND ((object_id)::text = ANY ('{1,2,3,4,5,6}'::text[])) AND ((relation)::text = 'caregiver'::text) AND (created_transaction <= 49851674))
         Filter: ((deleted_transaction = '9223372036854775807'::bigint) OR (deleted_transaction > 49851674))
 Planning Time: 34.907 ms
 Execution Time: 0.104 ms
(6 rows)
So that's consistent with 40ms execution time
j
can you run this?
select count(*) from relation_tuple group by namespace, relation, object_id, userset_namespace, userset_object_id, userset_relation order by count(*) desc limit 10;
maybe on a copy of the DB if you're worried about DB CPU
w
Copy code
sql
 count
-------
   952
   707
   707
   707
   707
   706
   692
   692
   681
   287
(10 rows)
j
those don't seem unreasonable
we have sort of a degenerate edge case that has 86k copies of the same tuple, and it definitely causes performance problems
your backfill is done now?
maybe try removing that index?
w
> your backfill is done now? Yes
> maybe try removing that index? I'll try that in a transaction first
j
the last migration that you ran drops the old primary key indices, maybe something about the shape of your data on the old read path was relying on those bigserial primary keys
seems unlikely...
w
Without my custom index:
Copy code
sql
 Limit  (cost=0.69..691.69 rows=237 width=138) (actual time=4.441..4.442 rows=0 loops=1)
   ->  Index Scan using uq_relation_tuple_namespace on relation_tuple  (cost=0.69..691.69 rows=237 width=138) (actual time=4.440..4.440 rows=0 loops=1)
         Index Cond: (((namespace)::text = 'care_recipient'::text) AND ((object_id)::text = ANY ('{1,2,3,4,5,6}'::text[])) AND ((relation)::text = 'caregiver'::text) AND (created_transaction <= 49851674))
         Filter: ((deleted_transaction = '9223372036854775807'::bigint) OR (deleted_transaction > 49851674))
 Planning Time: 37.745 ms
 Execution Time: 4.481 ms
(6 rows)
it's pretty much the same. Not sure my custom index was really useful for anythign
j
i don't really have any clue how to debug long planning time
one thing to try is using real care_recipient IDs
also try varying the number of them
w
can do
Copy code
Limit  (cost=0.69..643.69 rows=237 width=138) (actual time=1.072..5.509 rows=138 loops=1)
   ->  Index Scan using ix_backfill_tuple_temp_2 on relation_tuple  (cost=0.69..643.69 rows=237 width=138) (actual time=1.071..5.495 rows=138 loops=1)
         Index Cond: (((namespace)::text = 'care_recipient'::text) AND ((object_id)::text = ANY ('{268ea094-92e4-11eb-a25b-067ff236ecb9,4bc4de79-1e3e-40ce-b58e-c5bd389e52ea,af6f7d21-eb5d-42a0-8b9e-cc55dddfeee0,99c8c11d-f6d2-471d-845b-ed2dbc21e822,90ae53f5-ad48-4929-9a12-b4a6ed8a3d80,6091d46b-4db7-11ea-8a02-06a80bfbb33e}'::text[])) AND ((relation)::text = 'caregiver'::text) AND (created_transaction <= 49851674))
         Filter: ((deleted_transaction = '9223372036854775807'::bigint) OR (deleted_transaction > 49851674))
 Planning Time: 0.568 ms
 Execution Time: 5.544 ms
Planning time went down, but it's also gone down if I rerun the previous EXPLAIN ANALYZE...
j
is that reduction reflected in your overall SLIs?
w
Not really, I'm not seeing any difference in response time
j
that query performance looks more or less like what I would expect
w
me too. I'm not really sure the problem is in the database queries
The pods CPU is much higher than on 1.13.0, that's what looks most suspect to me
The profile I sent earlier shows a lot of time spent unmarshalling stuff for namespace caching
j
yeah, the cache switched to using serialized copies for better cache costing
but the switch to using vtprotobuf was supposed to offset the performance losses of having to deserialize and reserialize on each request
w
I've tried disabling the ns cache (made things much worse, not unexpected) and raising/lowering the "ns cache max cost" (to no effect) earlier in the day
j
yeah that de/reserialization is in the critical path of the happy path
w
Is there anything I can do to test whether it's indeed this ns cache that's causing issues?
j
not an easy test unfortunately, easiest way to test would probably be to make a build with the PR reverted
if you want to isolate that as a factor
w
Ouch, I'm not confident to do that 😬
esp on the prod system
j
right
the commits don't cleanly revert either
the "easiest" thing to do is probably to give you a downgrade version of the add-xid-constraints migration that will let you move down to v1.13
w
That would be a great start πŸ˜…
I did just find a maybe something
getting rid of the
caveat_name, caveat_context
in the SELECT gives me this plan:
Copy code
sql
 Limit  (cost=0.69..205.12 rows=237 width=128) (actual time=0.072..0.072 rows=0 loops=1)
   ->  Index Only Scan using ix_backfill_tuple_temp_2 on relation_tuple  (cost=0.69..205.12 rows=237 width=128) (actual time=0.071..0.071 rows=0 loops=1)
         Index Cond: ((namespace = 'care_recipient'::text) AND (object_id = ANY ('{1,2,3,4,5,6}'::text[])) AND (relation = 'caregiver'::text) AND (created_transaction <= 49851674))
         Filter: ((deleted_transaction = '9223372036854775807'::bigint) OR (deleted_transaction > 49851674))
         Heap Fetches: 0
 Planning Time: 0.520 ms
 Execution Time: 0.095 ms
ie an INDEX ONLY scan
The caveat columns aren't in the index, so PG needs to do some table reading which slows down the query a bunch
That doesn't explain the raised CPU in my pods, but maybe there's 2 distinct things happening
j
that's super common though, right? I don't think those query times are sustainable
w
I don't get your point sorry? πŸ˜…
My point is that v1.14.1 introduced these caveat columns in the SELECT (I think), which slows down the query as PG can't do an INDEX ONLY anymore. Could explain why I'm seeing higher response time
j
i just mean, to me an index points to a row, if you always need to be able to read everything from the index to have acceptable performance that's probably not sustainable
w
I agree
j
I get what you're saying, but like the caveat context is a json column, I doubt you're going to fit it in your index
w
Yeah
but that's a pretty significant regression from 1.13.0 then
j
if it holds in the general case, I would agree
w
How feasible would it be to have a rollback migration? I'd be happy moving back to 1.13.0 and buy more time to investigate this πŸ˜…
j
since the migration is a pretty substantial change, i wonder if doing an ANALYZE on the table would help with planning
have you tried that yet?
w
I did
I can try again πŸ˜…
j
when did you do it last?
w
~7h ago. Just reran it, I'm not seeing any improvement
j
What version of Postgres are you using?
w
13
j
@williamdclt if you add a temp index including the caveat_name, caveat_context (which should both be empty unless you're using caveats right now, which I doubt), do you see a significant performance impact overall?
w
I can try
j
if that works, we can look into skipping selection of those columns unless caveats is enabled (for now)
but I'm quite curious to see if that is the primary driver
j
this branch will manually revert the migrations to a place that is compatible with v1.13: https://github.com/authzed/spicedb/tree/revert-migration-constraints
you'll have to make your own build though
j
another question: you don't have any more custom indices, right?
I want to try to rule out modifications you've made, because some of this might even be Postgres vs Aurora
w
> but I'm quite curious to see if that is the primary driver I would expect that the pod CPU will stay high and limit throughput (possibly due to this marshalling/unmarshalling discussed earlier), but we'll see > another question: you don't have any more custom indices, right? No. Even this
ix_backfill_tuple_temp_2
is completely redundant with
uq_relation_tuple_namespace
actually
Index created (
(namespace, object_id, relation, caveat_name, caveat_context, userset_namespace, userset_object_id, userset_relation, created_transaction, deleted_transaction)
). The query now does an
INDEX ONLY
πŸ‘ I'll give it a few mins to see the impact on response time
j
cool; if that does have a significant impact, we'll come up with at least a workaround tomorrow; to confirm: you're not using caveats right now, right?
w
> you're not using caveats right now, right? No, we have no plan to use caveats in the near future
@Jake what would I need to run for this rollback?
spicedb migrate XXX
?
j
spicedb migrate head will do it
just make SURE you're using the branch
or
spicedb migrate add-xid-columns
should also do it, I made the destination migration name something that you can start moving forward from again
w
No improvement from the new index :/
I just observed something interesting: The
planning time
is very high (30-90ms) when I run
EXPLAIN ANALYZE
on my read replica (which is where all the permission checks are going). It's normal (<1ms) when running it on the master
j
that's... odd
j
that is very interesting, i guess you should dig into that first before doing anything drastic
w
yeah
Simplest example I can find is
explain analyze select deleted_transaction from relation_tuple WHERE (deleted_transaction > 49851674) limit 10
j
ok, i'm going to go play with my kids, @ me if there's something more I can help with tonight
w
Thank you for your help!
I rollbacked to 1.13. - I ran the migations manually rather than making my own build, felt simpler and safer - The migration went OK but my 1.13.0 wasn't getting ready. The health probe checks that the current migration is known, and
add-xid-columns
isn't in 1.13.0. I set it to
add-ns-config-id
for now
Thank you for your support, I'm going to bed now πŸ˜… I'll have to pick that up with you, we're not in a great place now (our staging is fully on v1.14.0 but prod isn't)
j
we think we have an idea on why the CPU usage is higher
with 1.13, are the queries back to expected performance?
w
Yes
downgrade was at 22:57
j
ok
j
No unexpected behavior today?
w
All good on 1.13