Hello we finally completed our upgrade
# spicedb
c
Hello, we finally completed our upgrade to 1.25.0 and was hoping it would resolve our errors but we are still seeing: - occasional latency spikes - CheckPermission errors: - rpc error: code = Unavailable desc = error reading from server: EOF - rpc error: code = Canceled desc = context canceled This is with dispatching disabled and a relatively small RPS(~10-15)/Relationship# (~75K). Any ideas why this may still be happening?
e
Can you share more about your configuration? you see the checkpermission errors in the clients? or just server logs?
c
Clients can see error use standard calls (some usage of fully_consistent, but not much overall), no adjustment to the defaults for serve. Errors do make it to the clients
AWS Postgres 13.8 + EKS
v
I'd suggest turning on debug logs on the server side
c
Our schema isn't that complex either (largest query depth would be 4 levels)
Also hitting some container restarts due to OOM: State: Running Started: Mon, 02 Oct 2023 15:12:14 -0400 Last State: Terminated Reason: OOMKilled Exit Code: 137 Started: Mon, 02 Oct 2023 14:10:06 -0400 Finished: Mon, 02 Oct 2023 15:12:14 -0400 Ready: True Restart Count: 2 Limits: ephemeral-storage: 1G memory: 2G Requests: cpu: 700m ephemeral-storage: 250M memory: 1G
Not seeing much with debug logs on.
Still getting OOMKilled after upping request/limit to 1.5GB/2,5GB
Again this isn't with the dispatch server enabled (we dont have any flags modifying defaults)
v
would you mind sharing your SpiceDB configuration flags/envs?
I'd suggest getting a pprof heap profile and share with us with https://pprof.me/
you can do with:
Copy code
go tool pprof http://$SPICEDB_URL:$METRICS_PORT/debug/pprof/heap
c
We're not using any flags (besides --datastore-conn-uri) not much in the values as well: PGSSLMODE: verify-full PGSSLROOTCERT: SPICEDB_DATASTORE_ENGINE: postgres SPICEDB_GRPC_PRESHARED_KEY: SPICEDB_HTTP_ENABLED: false SPICEDB_METRICS_ENABLED: true SPICEDB_DASHBOARD_ENABLED: false SPICEDB_LOG_LEVEL: debug SPICEDB_LOG_FORMAT: json SPICEDB_TELEMETRY_ENDPOINT: "" SPICEDB_METRICS_ADDR: ":9091" SPICEDB_OTEL_PROVIDER: otlpgrpc SPICEDB_OTEL_TRACE_PROPAGATOR: w3c Memory usage is very periodic and follows same trend, then container gets OOMKilled https://cdn.discordapp.com/attachments/1158451233426780221/1158722953270267985/Memory_usage_spicedb_containers.png?ex=651d4858&is=651bf6d8&hm=8b851fc92eef386ee156a4a05adbc39e44c175b06bbd14f4120f8a7b651816e5&
Any specific time I should try and get a heap profile from?
@vroldanbet DM'd you the pprof
v
what version were you running before?
I don't see anything that stands out as a leak in that heap profile, other than the OTel occupying a lot in memory
are you sure your traces are being delivered?
could you try disabling otel?
You can by setting
SPICEDB_OTEL_PROVIDER: none
also the startup logs will give you insight on the cache sizes elected for the various caches
@ch-2023 moving back the convo to public so that @ecordell can also chime in. So from what we've discussed: - you are using postgres - you moved from 1.14.1 all the way up to 1.25.0 - you are not using the operator - your workload is mostly Checks, and Lookups are few and oomkills do not line up with them
chances are that the cache size algo was introduced somewhere between 1.14 and 1.25, and that it's not working well for you
I'd suggest manually setting the cache sizes to see if that has any effect. Please also do provide the log messages that SpiceDB shows on bootstrap with the cache sizes elected.
c
Here is our current cache: {"defaultTTL":20600,"level":"info","maxCost":"1.8 GiB","message":"configured dispatch cache","numCounters":10000,"time":"2023-10-03T13:09:38Z"} {"defaultTTL":0,"level":"info","maxCost":"32 MiB","message":"configured namespace cache","numCounters":1000,"time":"2023-10-03T13:10:37Z"}
v
flags are:
Copy code
--dispatch-cache-max-cost 
--dispatch-cluster-cache-max-cost 
`
c
We are currently at 2.5GB limit
For the container
v
you can set those flags in bytes
c
Any reccomendation to try first?
v
try both
e
Did you try disabling otel to see if there's a difference? The cache config by default tries to work out how much space to use as a fraction of the available memory. If you're using otel agressively that can be too much overhead for the remaining space, and you can get OOM.
yeah, these two things in combination are the likely culprit
c
Sorry, I meant values for those flags
v
well that's up to you and your deployment limit. Start with something conservative like 250MB and see how it fares
then if you see that the growth still happens, try disabling otel
otel should be doing by default 1% sampling, it's weird it's using so much space
c
We didnt change the otel sample limit or whatever, what do you mean by using otel aggressively?
v
well that's the thing, it would look like OTel is using a good chunk of memory, and that's surpsing for a low scale deployment and for one that hasn't changed the default sampling rate
c
Still have a OOM killed after about an hour - config dump: {"async":false,"format":"json","level":"info","log_level":"debug","message":"configured logging","provider":"zerolog","time":"2023-10-03T16:06:47Z"} {"balancerconfig":{"loadBalancingConfig":[{"consistent-hashring":{"replicationFactor":100,"spread":1}}]},"concurrency-limit-check-permission":50,"concurrency-limit-lookup-resources":50,"concurrency-limit-lookup-subjects":50,"concurrency-limit-reachable-resources":50,"level":"info","message":"configured dispatcher","time":"2023-10-03T16:06:47Z"} {"defaultTTL":20600,"level":"info","maxCost":"238 MiB","message":"configured dispatch cache","numCounters":10000,"time":"2023-10-03T16:06:47Z"} {"level":"info","message":"telemetry disabled","time":"2023-10-03T16:06:47Z"} https://cdn.discordapp.com/attachments/1158451233426780221/1158804005359730783/message.txt?ex=651d93d4&is=651c4254&hm=abe871c99608610dd3fcd0b3785d6632d8466bde709a0bfeb7f62eec834d12dd&
v
if you are not doing dispatch, you could disable the cluster dispatch cache, not sure if ristretto does any preallocations there
I'd suggest doing that, and if that also fails, then next do disable otel
c
I had set SPICEDB_OTEL_PROVIDER to none. So I will flip the --dispatch-cluster-cache-enabled to false
v
after that I'd suggest dispatching all caching, if that helps, then we know there is some leak in the caches
c
😢 Confirmed another set of restarts - it is extremely periodic even with what we have changed. Spikes up to 1.5gb memory usage then eventually gets restarted after roughly an hour. I dont see any additional error logs to help diagnose.
What is considered a large call for LR? Ours is fully consistent and expects 0 to 1 items returned. Would this overflow the memory?
v
did you try disabling cache altogether?
I'd suggest getting another heap profile and sharing with us
LR not returning many values does not necessarily qualify it as "small". The call may be traversing large portions of the graph which have to be brought into memory, but I wouldn't expect them to cause a leak. It used to be a problem when SpiceDB didn't offer cursoring and LR calls would return thousands or elements.
c
Cluster cache was disabled yes but still no fix
v
all of the caches?
and otel also disabled?
c
Otel is disabled, only the cluster dispatch cache is disabled, dispatch catch is set to 250 MB and no changes on the ns cache
v
please disable all caches to rule out leaks there
then fetch a heap profile when it's approaching limit
c
Ok will try and grab that
sent you a pprof, this isnt right before breaching but it is close @User
v
profile shows 50mb used. Can you plot the various memory metrics and see if it's specific to the heap or other parts?
have you also disabled the namespace cache?
is your schema large?
c
Namespace cache is still running, schema has 5 resources with about 9 relation types. ~30K records
v
what do you mean 30K records
c
Ill share some plots but container mem usage is still about the same as shared above (tracks the same path)
About 30k relationships in prod
v
try disabling namespace cache, it's the last element I know can store offheap and does not show up in the heap dump
c
What is the NS cache caching?
v
your schema
that's why I asked for the size of your schema
are you running any sidecar or anything to SpiceDB?
I gotta drop but feel free to ping @Joey or @ecordell if you gather more insight
c
Yes we have another container in there that does some additional logic then proxies the same requests to spicedb, its mem usage is very low. The container that restarts is spicedb. Will add any more info I have. Appreciate your help!
We look to be stable now at a high total container req/limit. Will monitor memory usage and will let you know once we get full root cause. Appreciate all the help! Have some additional questions about our 1.25.0 migration: - most importantly is that our checkPermission calls are much slower than previous version for P95,P75 (but is the same for P99,P50) - still see occasional context canceled returned as error (we still don't have cluster dispatch enabled) - any ideas what this could be?
j
if calls are timing out, then context can be canceled
what version did you upgrade from?
c
1.13.0 -> 1.14.1 (4 part migration on postgres) -> 1.25.0 [dispatch disabled so just migrate head]
j
disabled dispatch will make things slower
you should reenable dispatching
c
Dispatch was always disabled (we just used the defaults)
j
then you should enable dispatching
c
Is 1.25.0 more dependent on cluster dispatch?
j
no, but it makes caching more efficient
there are also settings changes around connection pooling as well
c
Right, Im just wondering why we saw a performance regression by the version bump?
Any ideas before I try enabling cluster dispatch + increasing connection pools?
j
considering its 15 versions, it could be all sorts of things
you'd need to look at a trace to see
I'd also recommend looking at the connection delay prometheus metrics to see if its waiting for connections
e
some of the defaults around connection pooling changed, that's a good thing to check
in general we expect 1.25 to be much faster on PG vs. 1.13, the tx management is much better
c
Thanks - will get back!
Noticed a few things. For context: we have low RPS (About 2-10 QPS for checkPermission, 0-2 QPS for write, 10 LookupResources/hour) and using all defaults (no additional flags). Have 4 spicedb instances running 1.25.0, small-ish schema (7 namespaces w/ total 11 relationships): - Our checkPermission is slower average than our writeRelationships since upgrade (strange?) - DB Connections increased when migrating from 1.14.1 to 1.25.0 (20 conns per instance to 30 for total of 120) - The longest check permission requests spend most of their time in ExecuteQuery w/ 1-2K spans Just wanted to check if any ideas popped up as for why?
j
sounds like you're hanging on available connections
that's why I suggested checking the prom metric for connection wait time
c
Is there a list of the metrics, I am scrolling through them but not sure which to look at. Each instance seems to have enough conns to support the load independently given we never breach 10 qps and have ~30 open each?
j
max(rate(pgxpool_acquire_duration_ns[$__rate_interval]))
c
So I should increase the connections for read then? I'm a little confused since the latency was fine with less connections previously (20 total held per instance vs now 30)
I've noticed that jitter is set to 0 so I will change that. The default also says 20% of max which is defaulted to 30m so I assumed it would have been 6min
v
if the jitter is not honoring the default behaviour feel free to open an issue 🙏
Have you taken a trace of your slow API queries? How big is that trace in terms of dispatches and nesting? Could you identify requests that take very long, and perfom the same request but with
zed
passing the
--explain
flag? That would give you an easier way to visualize where time is being spent
c
Ok, I will work getting more data. Thanks again for all the help!
v
some spans show the number of results of a query, and if you see. blank spots, chances are this is contention over connection pools (unfortunately connection pool acquisition has no spans)
c
Is this trace interesting? I see a lot of these where the client is doing little for a long time then spicedb takes action very quickly. I dont yet have the production ones for our much slower requests, will have those next week https://cdn.discordapp.com/attachments/1158451233426780221/1159951072576675850/Slow_Client.png?ex=6532e39e&is=65206e9e&hm=69b9e53782503bfe5d9aebe66407e54febfc6f7a521691911ddd5386fcb9029f&
Seems like the client might be hanging on connection to spicedb or am I reading this trace wrong? We have an insecure connection to spicedb in the pod, a secure connection to the db.
j
that could also be waiting on the DB connection
depending on what the blue part shows
that's what @vroldanbet was saying earlier: if you're waiting on DB connections, the system will just... wait
c
the pgxpool_acquire_duration_ns stays low throughout seem stable at their min. I'm out for the night now. Thanks so much for your help again and hope you all have a great weekend. Will get back when I have more data/tracing. Cheers
Some more data coming in: - Increasing connections from default 20r,10w to 50r,20w had an impact on our P75 but none on P95/99. - Increasing connections from 50r,20w to 300r,100w actually increased our p75 again but the acquire duration was near 0 the whole time. - 1.13.0 had better p95 than current with all defaults (so just 20 total conns) - It uses all conns even if I disable load in our staging (would assume to see it go down to the min? pgxpool_idle_conns is non zero but unsure what that means) - We have 4 pods running (again read QPS of 10, 1 of write) but are autoscaling on CPU which doesn't seem to be the best metric, what would you recommend here? Attached a slow ~(800ms) high level trace, spicedb is all of the blue. I don't have much experience reading these but noticed it has a lot of queries and also has some that go past the response leaving our service. Any ideas on what else to try? Thanks! https://cdn.discordapp.com/attachments/1158451233426780221/1160584852316356699/Trace.png?ex=653531df&is=6522bcdf&hm=040dbb4537299db36658fdce8823c93ff60dc3d06e417cdf27340dd4a1c4e810& https://cdn.discordapp.com/attachments/1158451233426780221/1160584852731596881/Span.png?ex=653531df&is=6522bcdf&hm=4b47baa573deef400d1d91b6bcea7cadab8247be61322215e3a0b266bad267d4&
j
that doesn't really show anything I can read, unfortunately
c
Yeah its very large trace, I can zoom in if you need was just trying to give high level context. Any useful info from the comments above outside of the trace?
j
Nope
Not without knowing what the traces show
e
Depending on your PG instance, you may get better perf with fewer connections. Connections on PG are actually pretty resource intensive.
c
Ill try and export this trace as best I can. Using a db.r6g.4xlarge, so you think lower connections rather than increase? So less than default of 20r,10w?
e
It’s worth measuring. Not at a computer to check what we normally do for PG, and I don’t have it off the top of my head
But you’re running 4 nodes right? The defaults are typically sized to be okay for 3 nodes. That said if you don’t have evidence of contention on the RDS side this is probably just micro-tuning and not the root cause of your issues
c
Yeah, especially since perf was good on 1.13.0 w/ all defaults for p95 (but had other issues that meant we needed to upgrade) now on 1.25.0 I can't seem to get it to be as fast w/ same setup
e
I would at least rule out by trying with 20conns per node and see. Maybe 15r/5w unless you have a very heavy write load
j
Also run analyze table on the tables too; it’s certainly possible PGs stats are wrong now that the tables have changed
c
High read load (about 10-100x read vs write depending on time)
Datadog will not Should I try cranking up the dispatch-check-permission-concurrency-limit ? There are like 250 spans for dispachcheck and 150 for Execute query. Datadog doesn't allow me to export this trace easily but it is a lot of what I attached here, seems like there are groupings of different sets of spans starting. https://cdn.discordapp.com/attachments/1158451233426780221/1160662220980621312/trace_close_up.png?ex=653579ed&is=652304ed&hm=f42abc45ad4b8c93d78392b816846f41e0fd09587f896b20374247cab8cd3ced&
j
yeah, the question is why are queries taking 100+ms
if you can get an example of a query and run an EXPLAIN on it
that would be helpful
but I'd also suggest trying the ANALYZE TABLE on the relationships table too
e
You may also want to check GC metrics. If you have a lot of garbage it’s possible that the periodic GC job is making some queries slow
c
Ok I'll work on getting the explain on a query. As for the GC metrics i do see the periodic gc and seems to take around 40 ms each time. Using the lower connections (currently at 50r,20w) I do see an occasional spike for pgxpool_acquire_duration_ns . This was gone at the higher conns but was slower.
9:01PM INF debugging requested on check false Error: proto: (line 1:291): unknown field "duration" Using spicedb 1.25.0 and zed 0.14.0, might be because https://github.com/authzed/zed/blob/main/go.mod#L11 ?
j
need a new version of zed
download the most recent one
should work on 0.14.0
if not, we'll make a 0.15.0
c
0.14 is latest
Fairly confident we have an issue w/ our schema. Will share that once I can confirm
j
such as?
that shouldn't cause that
duration
issue though
c
Looks like possibly some big queries we're doing, may be the cause of issues
So the call is like checkPermission on someResource:A write user:B
Would a subject relation on someResource help here?
v
have you run the same check call with
--explain
to understand where time is being spent? We can get that through the traces, but in a more lower level way that can be difficult to parse
my only guess is that some of those relations over
organization
are relatively wide (e.g. key, admin->key, owner->key)? that's something you should see with
--explain
c
I do have an explain which is massive. Seems to look over all orgs and then for each org it looks at admins, owners, keys . I am thinking we just break this in to two check permissions, one for the authorized org on someResource and one for the write/read on the org.
e
do you need keys to be associated with orgs? it looks like you could get the same thing with:
Copy code
definition user {
   key:key
}

definition key {}

definition organization {
    owner: user
    admin: user
    member: user

    permission write: owner->key + admin->key
    permission read: write->key + member->key
}
c
A user can be in more than one org, but a key is tied to a particular org (and user)
e
Copy code
definition user {
   relation key: key
}

definition key {}

definition organization {
    relation owner: user
    relation admin: user
    relation member: user
    relation key: key

    permission write = key & (owner->key + admin->key)
    permission read = write + (key & member->key)
}
at the very least you don't need to include all of the users in the result set
assuming you're doing a check on key
c
Sorry a user can use a key or there is another scenario where they can use themselves.
Was hoping that it would recognize which path to check if key or user
But either way, should I just do a double permission check if thats an option here?
Or would lookup resources be more effective as we can then start from the user or key? The double check seems like an anti pattern
v
Does the explain show all branches bring evaluated, even when the type is not applicable to the request? Because I wonder if we could short-circuit that. Have you tried turning around the key intersection? I recall there is an optimization when Intersecting with the wildcard, where depending on the order of the operands in the expression you could short circuit loading the wider relationship. I assume that's be "org#key" instead of user#key
Maybe @Joey would be able to see an optimization opportunity in your schema
c
Does lookupResources follow the same path or would it work backwards? ie. user -> set of orgs -> authorized orgs. It does show all branches being evaluated
j
yes and it will follow them all
What, exactly, are you trying to do?
c
In short, check permission that a user/key is in an authorized org on someResource with write permissions in that org. Users are in a small # of orgs typically
Other resources typically have a single org, but this one is essentially a singleton, so almost every org has a relationship
j
Why the double key check?
c
Key cant be used accross orgs, a user can be in multiple orgs and can have a key in each of them
So that extra org key relation is just stating that that key can only be used in that org
But there usually aren't too many keys per org. The main issue is the first branch from someResource write which checks every org for the specified key/user. The other smaller issues is that it is also trying to check for the key if a user is passed
j
Yeah, I think this can be optimized
I’ll have to give it some thought
c
Thanks!
e
do you need the schema to enforce that a key is only used in one org? that could also just be built into your app's interface i.e. when you make a key you write
user:me#key@key:org1_thekey
instead of
user:me#key@key:thekey
and
org:org1#key@key:thekey
. down the line, if you want to make cross-org keys, the schema would allow you to. I do think you've found a case that we need to optimize, but I'm trying to understand your requirements a bit better
c
Yes I believe we need to enforce that key to single org requirement for now. Do you see any hope for speeding up this query on that initial search of all orgs or will this have to be addressed using multiple checks?
Lookup resources on a user will look at all orgs since its directed graph?
Im also looking for mitigation on this slow query while we maybe find a longer term solution
Also if not, what is the best approach for singleton type resources in the future?
j
trying to fully understand...
even at 10K orgs, that's a maximum of 100 requests
which isn't super fast, mind you, but should process fairly quickly
where, exactly, is the slowdown?
> Also I was surprised to see it querying for keys since we were checking permission on a user rather than a key (so I assumed it knew only to look up owner+admin relations since they are the only ones for a user to be authorized?)
this is known as reachability-based optimization: there is a TODO for us to add it, but it was deemed low priority because its easy to work around (and somewhat rare for the same rules to be used for different terminal subject types)
typically, you'd just have two permissions in the worst case:
view
and
viewable_by_key
or somesuch, as the current workaround
which isn't as nice, but does work
c
From my understanding, there are say 10000 orgs which it then looks for owner, admin, member and key in at 100 each, so isnt that 10000*4 / 100 = 400 queries for that plus 100 for fetching the authorized orgs. So 500 queries but the conn pool is only at 50 so it can't service other requests until it is completed?
I see some individual queries taking as much as 100ms as well. Actually the explain shows its also looking for write (which is the permission) rather than a relationship, a lot of these write queries are longer than average, see a bunch at/ around 100ms
j
its actually 50 limited due to concurrency controls, but yes
> Actually the explain shows its also looking for write (which is the permission) rather than a relationship
permissions are not looked up against the DB
c
Didn't think so, then is that explain just checking for cached write permission?
j
if it says (cached) next to it
do you mean
--explain
or
EXPLAIN
?
--explain
will show the permissions
but that doesn't hit the DB
> so isnt that 10000*4 / 100 = 400 queries for that plus 100 for fetching the authorized orgs yes, but this can be worked around by walking to the org once
which is generally the preferred design
c
--explain
j
org->combined_perm
vs
org->read + org->write + org->admin
the latter is heavier
c
Right, we are using the combined permission org-> write though for the singleton resource. Its really just that it looks across all orgs and looks at the 4 relationships for those orgs. Though this does make sense because how would it know which org a user/key is in and any may end up being an authorization path.
Appreciate the help, so it seems like trying to rework how we query here is the best approach then (that is checking if an org is authorized and then checking if a user has write in that org?) I believe we have the org info so this should be possible. Open to any suggestions, otherwise we will go that route.
j
for now, yes
the reality is if you need to check across 10K orgs, then we need to, well, check 10K orgs
there has been a proposal to reverse the check based on cardinality
but that has major caching implications
c
Cool, thanks again!
j
also in your above example, if you have
permission whatever = first + second + third
on the org
scaling up your SpiceDB cluster should help because each subproblem will be dispatched
9 Views