Hello we finally completed our upgrade SpiceDB #spicedb

Hello we finally completed our upgrade

ch-2023

10/02/2023, 5:11 PM

Hello, we finally completed our upgrade to 1.25.0 and was hoping it would resolve our errors but we are still seeing: - occasional latency spikes - CheckPermission errors: - rpc error: code = Unavailable desc = error reading from server: EOF - rpc error: code = Canceled desc = context canceled This is with dispatching disabled and a relatively small RPS(~10-15)/Relationship# (~75K). Any ideas why this may still be happening?

ecordell

10/02/2023, 5:15 PM

Can you share more about your configuration? you see the checkpermission errors in the clients? or just server logs?

ch-2023

10/02/2023, 5:31 PM

Clients can see error use standard calls (some usage of fully_consistent, but not much overall), no adjustment to the defaults for serve. Errors do make it to the clients

ch-2023

10/02/2023, 5:32 PM

AWS Postgres 13.8 + EKS

vroldanbet

10/02/2023, 5:37 PM

I'd suggest turning on debug logs on the server side

ch-2023

10/02/2023, 5:40 PM

Our schema isn't that complex either (largest query depth would be 4 levels)

ch-2023

10/02/2023, 7:19 PM

Also hitting some container restarts due to OOM: State: Running Started: Mon, 02 Oct 2023 15:12:14 -0400 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 02 Oct 2023 14:10:06 -0400 Finished: Mon, 02 Oct 2023 15:12:14 -0400 Ready: True Restart Count: 2 Limits: ephemeral-storage: 1G memory: 2G Requests: cpu: 700m ephemeral-storage: 250M memory: 1G

ch-2023

10/02/2023, 9:23 PM

Not seeing much with debug logs on.

ch-2023

10/02/2023, 11:45 PM

Still getting OOMKilled after upping request/limit to 1.5GB/2,5GB

ch-2023

10/03/2023, 12:11 AM

Again this isn't with the dispatch server enabled (we dont have any flags modifying defaults)

vroldanbet

10/03/2023, 8:58 AM

would you mind sharing your SpiceDB configuration flags/envs?

vroldanbet

10/03/2023, 9:03 AM

I'd suggest getting a pprof heap profile and share with us with https://pprof.me/

vroldanbet

10/03/2023, 9:04 AM

you can do with:

Copy code

go tool pprof http://$SPICEDB_URL:$METRICS_PORT/debug/pprof/heap

ch-2023

10/03/2023, 11:11 AM

We're not using any flags (besides --datastore-conn-uri) not much in the values as well: PGSSLMODE: verify-full PGSSLROOTCERT: SPICEDB_DATASTORE_ENGINE: postgres SPICEDB_GRPC_PRESHARED_KEY: SPICEDB_HTTP_ENABLED: false SPICEDB_METRICS_ENABLED: true SPICEDB_DASHBOARD_ENABLED: false SPICEDB_LOG_LEVEL: debug SPICEDB_LOG_FORMAT: json SPICEDB_TELEMETRY_ENDPOINT: "" SPICEDB_METRICS_ADDR: ":9091" SPICEDB_OTEL_PROVIDER: otlpgrpc SPICEDB_OTEL_TRACE_PROPAGATOR: w3c Memory usage is very periodic and follows same trend, then container gets OOMKilled https://cdn.discordapp.com/attachments/1158451233426780221/1158722953270267985/Memory_usage_spicedb_containers.png?ex=651d4858&is=651bf6d8&hm=8b851fc92eef386ee156a4a05adbc39e44c175b06bbd14f4120f8a7b651816e5&

ch-2023

10/03/2023, 11:16 AM

Any specific time I should try and get a heap profile from?

ch-2023

10/03/2023, 1:18 PM

@vroldanbet DM'd you the pprof

vroldanbet

10/03/2023, 1:22 PM

what version were you running before?

vroldanbet

10/03/2023, 1:23 PM

I don't see anything that stands out as a leak in that heap profile, other than the OTel occupying a lot in memory

vroldanbet

10/03/2023, 1:23 PM

are you sure your traces are being delivered?

vroldanbet

10/03/2023, 1:23 PM

could you try disabling otel?

vroldanbet

10/03/2023, 1:23 PM

You can by setting

SPICEDB_OTEL_PROVIDER: none

vroldanbet

10/03/2023, 1:24 PM

also the startup logs will give you insight on the cache sizes elected for the various caches

vroldanbet

10/03/2023, 1:57 PM

@ch-2023 moving back the convo to public so that @ecordell can also chime in. So from what we've discussed: - you are using postgres - you moved from 1.14.1 all the way up to 1.25.0 - you are not using the operator - your workload is mostly Checks, and Lookups are few and oomkills do not line up with them

vroldanbet

10/03/2023, 1:57 PM

chances are that the cache size algo was introduced somewhere between 1.14 and 1.25, and that it's not working well for you

vroldanbet

10/03/2023, 1:58 PM

I'd suggest manually setting the cache sizes to see if that has any effect. Please also do provide the log messages that SpiceDB shows on bootstrap with the cache sizes elected.

ch-2023

10/03/2023, 1:59 PM

Here is our current cache: {"defaultTTL":20600,"level":"info","maxCost":"1.8 GiB","message":"configured dispatch cache","numCounters":10000,"time":"2023-10-03T13:09:38Z"} {"defaultTTL":0,"level":"info","maxCost":"32 MiB","message":"configured namespace cache","numCounters":1000,"time":"2023-10-03T13:10:37Z"}

vroldanbet

10/03/2023, 1:59 PM

flags are:

Copy code

--dispatch-cache-max-cost 
--dispatch-cluster-cache-max-cost 
`

ch-2023

10/03/2023, 1:59 PM

We are currently at 2.5GB limit

ch-2023

10/03/2023, 1:59 PM

For the container

vroldanbet

10/03/2023, 2:00 PM

you can set those flags in bytes

ch-2023

10/03/2023, 2:00 PM

Any reccomendation to try first?

vroldanbet

10/03/2023, 2:02 PM

try both

ecordell

10/03/2023, 2:02 PM

Did you try disabling otel to see if there's a difference? The cache config by default tries to work out how much space to use as a fraction of the available memory. If you're using otel agressively that can be too much overhead for the remaining space, and you can get OOM.

ecordell

10/03/2023, 2:02 PM

yeah, these two things in combination are the likely culprit

ch-2023

10/03/2023, 2:02 PM

Sorry, I meant values for those flags

vroldanbet

10/03/2023, 2:03 PM

well that's up to you and your deployment limit. Start with something conservative like 250MB and see how it fares

vroldanbet

10/03/2023, 2:03 PM

then if you see that the growth still happens, try disabling otel

vroldanbet

10/03/2023, 2:04 PM

otel should be doing by default 1% sampling, it's weird it's using so much space

ch-2023

10/03/2023, 2:04 PM

We didnt change the otel sample limit or whatever, what do you mean by using otel aggressively?

vroldanbet

10/03/2023, 2:05 PM

well that's the thing, it would look like OTel is using a good chunk of memory, and that's surpsing for a low scale deployment and for one that hasn't changed the default sampling rate

ch-2023

10/03/2023, 4:33 PM

Still have a OOM killed after about an hour - config dump: {"async":false,"format":"json","level":"info","log_level":"debug","message":"configured logging","provider":"zerolog","time":"2023-10-03T16:06:47Z"} {"balancerconfig":{"loadBalancingConfig":[{"consistent-hashring":{"replicationFactor":100,"spread":1}}]},"concurrency-limit-check-permission":50,"concurrency-limit-lookup-resources":50,"concurrency-limit-lookup-subjects":50,"concurrency-limit-reachable-resources":50,"level":"info","message":"configured dispatcher","time":"2023-10-03T16:06:47Z"} {"defaultTTL":20600,"level":"info","maxCost":"238 MiB","message":"configured dispatch cache","numCounters":10000,"time":"2023-10-03T16:06:47Z"} {"level":"info","message":"telemetry disabled","time":"2023-10-03T16:06:47Z"} https://cdn.discordapp.com/attachments/1158451233426780221/1158804005359730783/message.txt?ex=651d93d4&is=651c4254&hm=abe871c99608610dd3fcd0b3785d6632d8466bde709a0bfeb7f62eec834d12dd&

vroldanbet

10/03/2023, 5:41 PM

if you are not doing dispatch, you could disable the cluster dispatch cache, not sure if ristretto does any preallocations there

vroldanbet

10/03/2023, 5:41 PM

I'd suggest doing that, and if that also fails, then next do disable otel

ch-2023

10/03/2023, 5:43 PM

I had set SPICEDB_OTEL_PROVIDER to none. So I will flip the --dispatch-cluster-cache-enabled to false

vroldanbet

10/03/2023, 5:55 PM

after that I'd suggest dispatching all caching, if that helps, then we know there is some leak in the caches

ch-2023

10/03/2023, 9:46 PM

😢 Confirmed another set of restarts - it is extremely periodic even with what we have changed. Spikes up to 1.5gb memory usage then eventually gets restarted after roughly an hour. I dont see any additional error logs to help diagnose.

ch-2023

10/03/2023, 10:06 PM

What is considered a large call for LR? Ours is fully consistent and expects 0 to 1 items returned. Would this overflow the memory?

vroldanbet

10/04/2023, 6:46 AM

did you try disabling cache altogether?

vroldanbet

10/04/2023, 6:47 AM

I'd suggest getting another heap profile and sharing with us

vroldanbet

10/04/2023, 6:48 AM

LR not returning many values does not necessarily qualify it as "small". The call may be traversing large portions of the graph which have to be brought into memory, but I wouldn't expect them to cause a leak. It used to be a problem when SpiceDB didn't offer cursoring and LR calls would return thousands or elements.

ch-2023

10/04/2023, 8:48 AM

Cluster cache was disabled yes but still no fix

vroldanbet

10/04/2023, 8:51 AM

all of the caches?

vroldanbet

10/04/2023, 8:52 AM

and otel also disabled?

ch-2023

10/04/2023, 8:59 AM

Otel is disabled, only the cluster dispatch cache is disabled, dispatch catch is set to 250 MB and no changes on the ns cache

vroldanbet

10/04/2023, 9:08 AM

please disable all caches to rule out leaks there

vroldanbet

10/04/2023, 9:08 AM

then fetch a heap profile when it's approaching limit

ch-2023

10/04/2023, 9:09 AM

Ok will try and grab that

ch-2023

10/04/2023, 4:11 PM

sent you a pprof, this isnt right before breaching but it is close @User

vroldanbet

10/04/2023, 5:54 PM

profile shows 50mb used. Can you plot the various memory metrics and see if it's specific to the heap or other parts?

vroldanbet

10/04/2023, 5:54 PM

have you also disabled the namespace cache?

vroldanbet

10/04/2023, 5:55 PM

is your schema large?

ch-2023

10/04/2023, 5:57 PM

Namespace cache is still running, schema has 5 resources with about 9 relation types. ~30K records

vroldanbet

10/04/2023, 5:58 PM

what do you mean 30K records

ch-2023

10/04/2023, 5:58 PM

Ill share some plots but container mem usage is still about the same as shared above (tracks the same path)

ch-2023

10/04/2023, 5:58 PM

About 30k relationships in prod

vroldanbet

10/04/2023, 5:59 PM

try disabling namespace cache, it's the last element I know can store offheap and does not show up in the heap dump

ch-2023

10/04/2023, 5:59 PM

What is the NS cache caching?

vroldanbet

10/04/2023, 5:59 PM

your schema

vroldanbet

10/04/2023, 6:00 PM

that's why I asked for the size of your schema

vroldanbet

10/04/2023, 6:00 PM

are you running any sidecar or anything to SpiceDB?

vroldanbet

10/04/2023, 6:02 PM

I gotta drop but feel free to ping @Joey or @ecordell if you gather more insight

ch-2023

10/04/2023, 6:34 PM

Yes we have another container in there that does some additional logic then proxies the same requests to spicedb, its mem usage is very low. The container that restarts is spicedb. Will add any more info I have. Appreciate your help!

ch-2023

10/05/2023, 3:00 PM

We look to be stable now at a high total container req/limit. Will monitor memory usage and will let you know once we get full root cause. Appreciate all the help! Have some additional questions about our 1.25.0 migration: - most importantly is that our checkPermission calls are much slower than previous version for P95,P75 (but is the same for P99,P50) - still see occasional context canceled returned as error (we still don't have cluster dispatch enabled) - any ideas what this could be?

Joey

10/05/2023, 3:02 PM

if calls are timing out, then context can be canceled

Joey

10/05/2023, 3:03 PM

what version did you upgrade from?

ch-2023

10/05/2023, 3:04 PM

1.13.0 -> 1.14.1 (4 part migration on postgres) -> 1.25.0 [dispatch disabled so just migrate head]

Joey

10/05/2023, 3:06 PM

disabled dispatch will make things slower

Joey

10/05/2023, 3:06 PM

you should reenable dispatching

ch-2023

10/05/2023, 3:06 PM

Dispatch was always disabled (we just used the defaults)

Joey

10/05/2023, 3:07 PM

then you should enable dispatching

ch-2023

10/05/2023, 3:07 PM

Is 1.25.0 more dependent on cluster dispatch?

Joey

10/05/2023, 3:07 PM

no, but it makes caching more efficient

Joey

10/05/2023, 3:08 PM

there are also settings changes around connection pooling as well

ch-2023

10/05/2023, 3:09 PM

Right, Im just wondering why we saw a performance regression by the version bump?

ch-2023

10/05/2023, 3:09 PM

Any ideas before I try enabling cluster dispatch + increasing connection pools?

Joey

10/05/2023, 3:10 PM

considering its 15 versions, it could be all sorts of things

Joey

10/05/2023, 3:10 PM

you'd need to look at a trace to see

Joey

10/05/2023, 3:11 PM

I'd also recommend looking at the connection delay prometheus metrics to see if its waiting for connections

ecordell

10/05/2023, 3:12 PM

some of the defaults around connection pooling changed, that's a good thing to check

ecordell

10/05/2023, 3:12 PM

in general we expect 1.25 to be much faster on PG vs. 1.13, the tx management is much better

ch-2023

10/05/2023, 3:17 PM

Thanks - will get back!

ch-2023

10/05/2023, 8:15 PM

Noticed a few things. For context: we have low RPS (About 2-10 QPS for checkPermission, 0-2 QPS for write, 10 LookupResources/hour) and using all defaults (no additional flags). Have 4 spicedb instances running 1.25.0, small-ish schema (7 namespaces w/ total 11 relationships): - Our checkPermission is slower average than our writeRelationships since upgrade (strange?) - DB Connections increased when migrating from 1.14.1 to 1.25.0 (20 conns per instance to 30 for total of 120) - The longest check permission requests spend most of their time in ExecuteQuery w/ 1-2K spans Just wanted to check if any ideas popped up as for why?

Joey

10/05/2023, 8:36 PM

sounds like you're hanging on available connections

Joey

10/05/2023, 8:36 PM

that's why I suggested checking the prom metric for connection wait time

ch-2023

10/05/2023, 8:49 PM

Is there a list of the metrics, I am scrolling through them but not sure which to look at. Each instance seems to have enough conns to support the load independently given we never breach 10 qps and have ~30 open each?

Joey

10/05/2023, 8:53 PM

max(rate(pgxpool_acquire_duration_ns[$__rate_interval]))

ch-2023

10/06/2023, 11:37 AM

So I should increase the connections for read then? I'm a little confused since the latency was fine with less connections previously (20 total held per instance vs now 30)

ch-2023

10/06/2023, 12:02 PM

I've noticed that jitter is set to 0 so I will change that. The default also says 20% of max which is defaulted to 30m so I assumed it would have been 6min

vroldanbet

10/06/2023, 12:42 PM

if the jitter is not honoring the default behaviour feel free to open an issue 🙏

vroldanbet

10/06/2023, 12:43 PM

Have you taken a trace of your slow API queries? How big is that trace in terms of dispatches and nesting? Could you identify requests that take very long, and perfom the same request but with

zed

passing the

--explain

flag? That would give you an easier way to visualize where time is being spent

ch-2023

10/06/2023, 1:40 PM

Ok, I will work getting more data. Thanks again for all the help!

ch-2023

10/06/2023, 2:04 PM

Can expand this out, but there is quite a lot of nesting on the slower queries https://cdn.discordapp.com/attachments/1158451233426780221/1159853717206618213/Check_Trace.png?ex=653288f3&is=652013f3&hm=e46a2c30ab04d035a700067cdac07cdbeedcead5d7cd5df016cfe5f2dd94889c&

vroldanbet

10/06/2023, 2:57 PM

some spans show the number of results of a query, and if you see. blank spots, chances are this is contention over connection pools (unfortunately connection pool acquisition has no spans)

ch-2023

10/06/2023, 8:31 PM

Is this trace interesting? I see a lot of these where the client is doing little for a long time then spicedb takes action very quickly. I dont yet have the production ones for our much slower requests, will have those next week https://cdn.discordapp.com/attachments/1158451233426780221/1159951072576675850/Slow_Client.png?ex=6532e39e&is=65206e9e&hm=69b9e53782503bfe5d9aebe66407e54febfc6f7a521691911ddd5386fcb9029f&

ch-2023

10/06/2023, 8:33 PM

Seems like the client might be hanging on connection to spicedb or am I reading this trace wrong? We have an insecure connection to spicedb in the pod, a secure connection to the db.

Joey

10/06/2023, 8:42 PM

that could also be waiting on the DB connection

Joey

10/06/2023, 8:42 PM

depending on what the blue part shows

Joey

10/06/2023, 8:43 PM

that's what @vroldanbet was saying earlier: if you're waiting on DB connections, the system will just... wait

ch-2023

10/06/2023, 9:14 PM

the pgxpool_acquire_duration_ns stays low throughout seem stable at their min. I'm out for the night now. Thanks so much for your help again and hope you all have a great weekend. Will get back when I have more data/tracing. Cheers

ch-2023

10/08/2023, 2:29 PM

Some more data coming in: - Increasing connections from default 20r,10w to 50r,20w had an impact on our P75 but none on P95/99. - Increasing connections from 50r,20w to 300r,100w actually increased our p75 again but the acquire duration was near 0 the whole time. - 1.13.0 had better p95 than current with all defaults (so just 20 total conns) - It uses all conns even if I disable load in our staging (would assume to see it go down to the min? pgxpool_idle_conns is non zero but unsure what that means) - We have 4 pods running (again read QPS of 10, 1 of write) but are autoscaling on CPU which doesn't seem to be the best metric, what would you recommend here? Attached a slow ~(800ms) high level trace, spicedb is all of the blue. I don't have much experience reading these but noticed it has a lot of queries and also has some that go past the response leaving our service. Any ideas on what else to try? Thanks! https://cdn.discordapp.com/attachments/1158451233426780221/1160584852316356699/Trace.png?ex=653531df&is=6522bcdf&hm=040dbb4537299db36658fdce8823c93ff60dc3d06e417cdf27340dd4a1c4e810& https://cdn.discordapp.com/attachments/1158451233426780221/1160584852731596881/Span.png?ex=653531df&is=6522bcdf&hm=4b47baa573deef400d1d91b6bcea7cadab8247be61322215e3a0b266bad267d4&

Joey

10/08/2023, 4:04 PM

that doesn't really show anything I can read, unfortunately

ch-2023

10/08/2023, 6:47 PM

Yeah its very large trace, I can zoom in if you need was just trying to give high level context. Any useful info from the comments above outside of the trace?

Joey

10/08/2023, 6:50 PM

Nope

Joey

10/08/2023, 6:50 PM

Not without knowing what the traces show

ecordell

10/08/2023, 6:51 PM

Depending on your PG instance, you may get better perf with fewer connections. Connections on PG are actually pretty resource intensive.

ch-2023

10/08/2023, 6:56 PM

Ill try and export this trace as best I can. Using a db.r6g.4xlarge, so you think lower connections rather than increase? So less than default of 20r,10w?

ecordell

10/08/2023, 6:57 PM

It’s worth measuring. Not at a computer to check what we normally do for PG, and I don’t have it off the top of my head

ecordell

10/08/2023, 6:59 PM

But you’re running 4 nodes right? The defaults are typically sized to be okay for 3 nodes. That said if you don’t have evidence of contention on the RDS side this is probably just micro-tuning and not the root cause of your issues

ch-2023

10/08/2023, 7:01 PM

Yeah, especially since perf was good on 1.13.0 w/ all defaults for p95 (but had other issues that meant we needed to upgrade) now on 1.25.0 I can't seem to get it to be as fast w/ same setup

ecordell

10/08/2023, 7:02 PM

I would at least rule out by trying with 20conns per node and see. Maybe 15r/5w unless you have a very heavy write load

Joey

10/08/2023, 7:02 PM

Also run analyze table on the tables too; it’s certainly possible PGs stats are wrong now that the tables have changed

ch-2023

10/08/2023, 7:03 PM

High read load (about 10-100x read vs write depending on time)

ch-2023

10/08/2023, 7:37 PM

Datadog will not Should I try cranking up the dispatch-check-permission-concurrency-limit ? There are like 250 spans for dispachcheck and 150 for Execute query. Datadog doesn't allow me to export this trace easily but it is a lot of what I attached here, seems like there are groupings of different sets of spans starting. https://cdn.discordapp.com/attachments/1158451233426780221/1160662220980621312/trace_close_up.png?ex=653579ed&is=652304ed&hm=f42abc45ad4b8c93d78392b816846f41e0fd09587f896b20374247cab8cd3ced&

Joey

10/08/2023, 9:38 PM

yeah, the question is why are queries taking 100+ms

Joey

10/08/2023, 9:38 PM

if you can get an example of a query and run an EXPLAIN on it

Joey

10/08/2023, 9:38 PM

that would be helpful

Joey

10/08/2023, 9:38 PM

but I'd also suggest trying the ANALYZE TABLE on the relationships table too

ecordell

10/08/2023, 9:42 PM

You may also want to check GC metrics. If you have a lot of garbage it’s possible that the periodic GC job is making some queries slow

ch-2023

10/08/2023, 10:17 PM

Ok I'll work on getting the explain on a query. As for the GC metrics i do see the periodic gc and seems to take around 40 ms each time. Using the lower connections (currently at 50r,20w) I do see an occasional spike for pgxpool_acquire_duration_ns . This was gone at the higher conns but was slower.

ch-2023

10/09/2023, 9:04 PM

9:01PM INF debugging requested on check false Error: proto: (line 1:291): unknown field "duration" Using spicedb 1.25.0 and zed 0.14.0, might be because https://github.com/authzed/zed/blob/main/go.mod#L11 ?

Joey

10/09/2023, 9:05 PM

need a new version of zed

Joey

10/09/2023, 9:06 PM

download the most recent one

Joey

10/09/2023, 9:06 PM

should work on 0.14.0

Joey

10/09/2023, 9:06 PM

if not, we'll make a 0.15.0

ch-2023

10/09/2023, 9:18 PM

0.14 is latest

ch-2023

10/09/2023, 9:20 PM

Fairly confident we have an issue w/ our schema. Will share that once I can confirm

Joey

10/09/2023, 9:28 PM

such as?

Joey

10/09/2023, 9:28 PM

that shouldn't cause that

duration

issue though

ch-2023

10/09/2023, 10:19 PM

Looks like possibly some big queries we're doing, may be the cause of issues

ch-2023

10/10/2023, 3:16 PM

So the call is like checkPermission on someResource:A write user:B

ch-2023

10/10/2023, 5:18 PM

Would a subject relation on someResource help here?

vroldanbet

10/10/2023, 6:40 PM

have you run the same check call with

--explain

to understand where time is being spent? We can get that through the traces, but in a more lower level way that can be difficult to parse

vroldanbet

10/10/2023, 6:41 PM

my only guess is that some of those relations over

organization

are relatively wide (e.g. key, admin->key, owner->key)? that's something you should see with

--explain

ch-2023

10/10/2023, 7:05 PM

I do have an explain which is massive. Seems to look over all orgs and then for each org it looks at admins, owners, keys . I am thinking we just break this in to two check permissions, one for the authorized org on someResource and one for the write/read on the org.

ecordell

10/10/2023, 7:10 PM

do you need keys to be associated with orgs? it looks like you could get the same thing with:

Copy code

definition user {
   key:key
}

definition key {}

definition organization {
    owner: user
    admin: user
    member: user

    permission write: owner->key + admin->key
    permission read: write->key + member->key
}

ch-2023

10/10/2023, 7:11 PM

A user can be in more than one org, but a key is tied to a particular org (and user)

ecordell

10/10/2023, 7:15 PM

Copy code

definition user {
   relation key: key
}

definition key {}

definition organization {
    relation owner: user
    relation admin: user
    relation member: user
    relation key: key

    permission write = key & (owner->key + admin->key)
    permission read = write + (key & member->key)
}

at the very least you don't need to include all of the users in the result set

ecordell

10/10/2023, 7:17 PM

assuming you're doing a check on key

ch-2023

10/10/2023, 7:18 PM

Sorry a user can use a key or there is another scenario where they can use themselves.

ch-2023

10/10/2023, 7:19 PM

Was hoping that it would recognize which path to check if key or user

ch-2023

10/10/2023, 7:20 PM

But either way, should I just do a double permission check if thats an option here?

ch-2023

10/10/2023, 7:55 PM

Or would lookup resources be more effective as we can then start from the user or key? The double check seems like an anti pattern

vroldanbet

10/10/2023, 9:23 PM

Does the explain show all branches bring evaluated, even when the type is not applicable to the request? Because I wonder if we could short-circuit that. Have you tried turning around the key intersection? I recall there is an optimization when Intersecting with the wildcard, where depending on the order of the operands in the expression you could short circuit loading the wider relationship. I assume that's be "org#key" instead of user#key

vroldanbet

10/10/2023, 9:24 PM

Maybe @Joey would be able to see an optimization opportunity in your schema

ch-2023

10/10/2023, 9:26 PM

Does lookupResources follow the same path or would it work backwards? ie. user -> set of orgs -> authorized orgs. It does show all branches being evaluated

Joey

10/10/2023, 9:26 PM

yes and it will follow them all

Joey

10/10/2023, 9:26 PM

What, exactly, are you trying to do?

ch-2023

10/10/2023, 9:28 PM

In short, check permission that a user/key is in an authorized org on someResource with write permissions in that org. Users are in a small # of orgs typically

ch-2023

10/10/2023, 9:29 PM

Other resources typically have a single org, but this one is essentially a singleton, so almost every org has a relationship

Joey

10/10/2023, 9:30 PM

Why the double key check?

ch-2023

10/10/2023, 9:31 PM

Key cant be used accross orgs, a user can be in multiple orgs and can have a key in each of them

ch-2023

10/10/2023, 9:31 PM

So that extra org key relation is just stating that that key can only be used in that org

ch-2023

10/10/2023, 9:32 PM

But there usually aren't too many keys per org. The main issue is the first branch from someResource write which checks every org for the specified key/user. The other smaller issues is that it is also trying to check for the key if a user is passed

Joey

10/10/2023, 9:33 PM

Yeah, I think this can be optimized

Joey

10/10/2023, 9:33 PM

I’ll have to give it some thought

ch-2023

10/10/2023, 9:35 PM

Thanks!

ecordell

10/10/2023, 10:32 PM

do you need the schema to enforce that a key is only used in one org? that could also just be built into your app's interface i.e. when you make a key you write

user:me#key@key:org1_thekey

instead of

user:me#key@key:thekey

and

org:org1#key@key:thekey

. down the line, if you want to make cross-org keys, the schema would allow you to. I do think you've found a case that we need to optimize, but I'm trying to understand your requirements a bit better

ch-2023

10/10/2023, 10:35 PM

Yes I believe we need to enforce that key to single org requirement for now. Do you see any hope for speeding up this query on that initial search of all orgs or will this have to be addressed using multiple checks?

ch-2023

10/10/2023, 10:35 PM

Lookup resources on a user will look at all orgs since its directed graph?

ch-2023

10/10/2023, 10:36 PM

Im also looking for mitigation on this slow query while we maybe find a longer term solution

ch-2023

10/10/2023, 11:42 PM

Also if not, what is the best approach for singleton type resources in the future?

Joey

10/11/2023, 12:41 AM

trying to fully understand...

Joey

10/11/2023, 12:42 AM

even at 10K orgs, that's a maximum of 100 requests

Joey

10/11/2023, 12:42 AM

which isn't super fast, mind you, but should process fairly quickly

Joey

10/11/2023, 12:43 AM

where, exactly, is the slowdown?

Joey

10/11/2023, 1:55 AM

> Also I was surprised to see it querying for keys since we were checking permission on a user rather than a key (so I assumed it knew only to look up owner+admin relations since they are the only ones for a user to be authorized?)

Joey

10/11/2023, 1:56 AM

this is known as reachability-based optimization: there is a TODO for us to add it, but it was deemed low priority because its easy to work around (and somewhat rare for the same rules to be used for different terminal subject types)

Joey

10/11/2023, 1:56 AM

typically, you'd just have two permissions in the worst case:

view

and

viewable_by_key

or somesuch, as the current workaround

Joey

10/11/2023, 1:56 AM

which isn't as nice, but does work

ch-2023

10/11/2023, 2:06 PM

From my understanding, there are say 10000 orgs which it then looks for owner, admin, member and key in at 100 each, so isnt that 10000*4 / 100 = 400 queries for that plus 100 for fetching the authorized orgs. So 500 queries but the conn pool is only at 50 so it can't service other requests until it is completed?

ch-2023

10/11/2023, 2:09 PM

I see some individual queries taking as much as 100ms as well. Actually the explain shows its also looking for write (which is the permission) rather than a relationship, a lot of these write queries are longer than average, see a bunch at/ around 100ms

Joey

10/11/2023, 3:00 PM

its actually 50 limited due to concurrency controls, but yes

Joey

10/11/2023, 3:00 PM

> Actually the explain shows its also looking for write (which is the permission) rather than a relationship

Joey

10/11/2023, 3:00 PM

permissions are not looked up against the DB

ch-2023

10/11/2023, 3:08 PM

Didn't think so, then is that explain just checking for cached write permission?

Joey

10/11/2023, 3:19 PM

if it says (cached) next to it

Joey

10/11/2023, 3:19 PM

do you mean

--explain

EXPLAIN

Joey

10/11/2023, 3:20 PM

--explain

will show the permissions

Joey

10/11/2023, 3:20 PM

but that doesn't hit the DB

Joey

10/11/2023, 3:20 PM

> so isnt that 10000*4 / 100 = 400 queries for that plus 100 for fetching the authorized orgs yes, but this can be worked around by walking to the org once

Joey

10/11/2023, 3:20 PM

which is generally the preferred design

ch-2023

10/11/2023, 3:20 PM

--explain

Joey

10/11/2023, 3:20 PM

org->combined_perm

org->read + org->write + org->admin

Joey

10/11/2023, 3:20 PM

the latter is heavier

ch-2023

10/11/2023, 3:24 PM

Right, we are using the combined permission org-> write though for the singleton resource. Its really just that it looks across all orgs and looks at the 4 relationships for those orgs. Though this does make sense because how would it know which org a user/key is in and any may end up being an authorization path.

ch-2023

10/11/2023, 3:25 PM

Appreciate the help, so it seems like trying to rework how we query here is the best approach then (that is checking if an org is authorized and then checking if a user has write in that org?) I believe we have the org info so this should be possible. Open to any suggestions, otherwise we will go that route.

Joey

10/11/2023, 3:50 PM

for now, yes

Joey

10/11/2023, 3:50 PM

the reality is if you need to check across 10K orgs, then we need to, well, check 10K orgs

Joey

10/11/2023, 3:50 PM

there has been a proposal to reverse the check based on cardinality

Joey

10/11/2023, 3:51 PM

but that has major caching implications

ch-2023

10/11/2023, 3:52 PM

Cool, thanks again!

Joey

10/11/2023, 3:56 PM

also in your above example, if you have

permission whatever = first + second + third

on the org

Joey

10/11/2023, 3:57 PM

scaling up your SpiceDB cluster should help because each subproblem will be dispatched

10 Views

Previous Next