Hi guys just had some questions
# spicedb
w
Hi guys, just had some questions regarding LookupResources latency. We are currently attempting to host SpiceDB self hosted, and are struggling a bit with performance. We are running 6 AWS instances of spiceDB, on separate nodes, each with 4 vCPU and 8 Gb of RAM - this is pointed an RDS instance with the same. Looking at metrics for both, neither are particularly saturated. All of this is running in eu-west-2, as well as the applications that use this system. Even with minimize latency added to all of our queries (not really caring about the new-enemy problem) - while paginated - each of the individual requests are taking ~200ms to complete. Comparing this to your serverless environment, sending the same request - we see similar latencies - but that includes the round-trip to America. Considering the number of tuples we have, this can result in LookupResources requests taking ~15s if not more. Any advice here regarding how to improve our performance?
v
Hi 👋 What version are you running? I'd suggest looking into OpenTelemetry and running some checks with
--explain
to get a better understanding on the path traversed on the permission check used for
LookupResources
. Typically what we see is contention over connection pools - it's not so much a CPU/Memory problem on the SpiceDB side. You want to understand where is time being spent down to the SQL query.
LookupResources
can put a bunch of load over your database cluster. I suggest turning on Performance Insights for RDS to get a sense spikes are happening and queries are getting queued: https://aws.amazon.com/rds/performance-insights/
w
v1.25.0 of SpiceDB I see that the top few queries are roughly of the same type - I assume these are the LookupResource queries?
SELECT namespace, object_id, relation, userset_namespace, userset_object_id, userset_relation, caveat_name, caveat_context  FROM relation_tuple WHERE pg_visible_in_snapshot(created_xid, $1) = $2 AND pg_visible_in_snapshot(deleted_xid, $3) = $4 AND ((userset_namespace = $5 AND userset_object_id IN (<LIST OF OBJECT_IDS>))) AND namespace = $30 AND relation = $31 ORDER BY userset_namespace, userset_object_id, userset_relation, namespace, object_id, relation LIMIT $32
These take on order of `183±300`ms to complete, with the worst we've seen taking
3000ms
v
how long are the pages you are asking for? it sounds like the database is being the bottleneck. Chances are there is an index missing too.
w
We've got our block size set to the default setting for postgres: 8192 bytes We have the indexes currently, which are the default ones that came with SpiceDB:
ix_relation_by_deleted_xid
,
ix_relation_tuple_by_subject
and
ix_relation_tuple_by_subject_relation
I don't particularly mind optimizing heavily for read performance vs write performance
v
sorry, I actually mean the
limit
parameter in the
LookupResources
call, which defines the size of the page to return
it's expected you have those indexes, but for your access pattern there may be an unindexed access pattern we may be missing. RDS performance insights should tell you when queries are taking too long, and then you'd have to identify the query by enabling verbose logs and try running an
EXPLAIN
in RDS for that specific query that's being slow for you
w
Ah the limit parameter is always 1000 for these queries
Okay that makes some sense - noting that that above query is the worst performing, would an index over
userset_namespace
,
userset_object_id
,
namespace
and
relation
help here?
y
@willthornton. how are you running spicedb? is it in a k8s environment?
w
Yes it is
Specifically v1.24.16 on eks
v
@willthornton. postgres query planning is a blackbox, I'd suggest: 1. capture an actual query that is being slow 2. run explain on your RDS cluster. 3. if it does a sequential scan or something inefficient, add index 4. run explain again until query time improves
w
Postgres query planning is massively opaque I agree - feels like black magic when you finally get someting working. Okay, will turn on the verbose logs and see what happens
See if I can find where it is struggling
v
I've had my fair share of RDS query planner over the last weeks, that was the strategy I followed, sorry I cannot give more advice. If you could provide a reproduction scenario, we could look into it
j
> I assume these are the LookupResource queries?
yeah, if you can get an EXPLAIN on one of them, that would help
w
"Limit (cost=379.12..379.24 rows=46 width=92) (actual time=0.201..0.202 rows=5 loops=1)" " -> Sort (cost=379.12..379.24 rows=46 width=92) (actual time=0.200..0.201 rows=5 loops=1)" " Sort Key: userset_object_id, userset_relation, object_id" " Sort Method: quicksort Memory: 25kB" " -> Bitmap Heap Scan on relation_tuple (cost=209.78..377.85 rows=46 width=92) (actual time=0.184..0.190 rows=5 loops=1)" " Recheck Cond: (((userset_object_id)::text = ANY ('{7,13,15,17,22,681}'::text[])) AND ((userset_namespace)::text = 'user'::text) AND ((namespace)::text = 'channel'::text) AND ((relation)::text = 'reader'::text))" " Heap Blocks: exact=5" " -> Bitmap Index Scan on ix_relation_tuple_by_subject (cost=0.00..209.76 rows=46 width=0) (actual time=0.179..0.180 rows=5 loops=1)" " Index Cond: (((userset_object_id)::text = ANY ('{7,13,15,17,22,681}'::text[])) AND ((userset_namespace)::text = 'user'::text) AND ((namespace)::text = 'channel'::text) AND ((relation)::text = 'reader'::text))"
That is horribly formatted, let me see if I can that as a nicer response
This is for an example query from above
j
execution time: 0.2ms
do you have an example of a request taking 200+ms?
w
I'll enable the logs as suggested above, and then find an example for you
There's a few cropping up like this
Also the 'excution time' here isn't necessarily accurate from what we see - when the database is under load as it's excuting lots of these queries, we're seeing excution times ~400 ms
Coming back from the postgres logs
The limit here looks very suspicious - being that it is the max of a signed 64 bit integer
9223372036854775807
I specify a limit of 1000 in all LookupResources requests
y
thought: it may be an intermediate hop, where you need to get all results in order to have the final answer to the question contain the number of results that you want
w
Yeah, true
Just I get suspicious whenver I see numbers like that
y
i feel you there
v
yeah there is an
IN
clause there, so the limit is implicit to the number of elements there, which I believe is controlled by the application
j
This is an intermediate lookup
Likely over an arrow@
w
Is there a reason why it would be that slow - even for an intermediate query? I do have a lot of relations in that table at this point but is there anything I can do about it
v
I'd suggest doing
EXPLAIN (ANALYZE, VERBOSE, BUFFERS)
to have more data around the query
if it's taking 200ms+, it must be loading more rows that it actually needs to execute teh query
do you happen to know how many relationships are in that table, and how many are there in that relation specifically?
2 Views