examples/schemas/google-iam at main · au...
# spicedb
f
Hi! I'm looking into SpiceDB, running a small tests, and there are some (performance) metrics that I don't really understand. My test schema is very similar to https://github.com/authzed/examples/tree/main/schemas/google-iam (which I guess means quite a bit of subproblems that need to be checked for each CheckPermission call). But what I'm seeing that's puzzling to me is that I'm almost seeing 10x
DispatchCheck
requests vs
CheckPermission
calls, and what looks like very poor cache hit ratio. I've tried playing around with the cache paramters (spread, quantization etc) but it hasn't made much of a difference. What I do find very puzzling and am failing to understand, is that the number of DispatchChecks seem to be much lower when nodes are freshly started. Even with lower cache hits (new nodes) Spice seems to be hitting the database way less in this scenario, only due to the huge drop in `DispatchCheck`s
v
hi, can you describe more about your setup? - number of relationships - widest relation (relation with largest number of relationships for a single object) - number of checks per second - regions your spanner instance is deployed to
f
The # of relationships was maybe 10 million last I checked. Widest relation is in the thousands, but most are 2-5. For a regular check there’d be around 5-7 sub problems to compute. Currently doing ~300rps Spanner and everything else is running in us-central
That’s ballpark, I can get exact numbers when I’m by a computer if that’s helpful
The widest object is a ”system” object that a lot of entities has as their root, e.g to easily be able to define a superadmin
y
hmm... that system object might be a potential problem if it's regularly in the path of a check
(but i'll let victor weigh in)
v
would you say that relationships "exercised" in your checks follow a pareto distribution? I'm thinking that a 300RPS with 10m relationships, the likelihood of cache hits given cache expires every 10 seconds by default is pretty low?
is that 300 RPS 100% of your traffic? or just a small percentage?
here is a blog post I wrote that shows some of the expected cache hit ratios based on the QPS. It may not be applicable to your schema, but gives you a mental model: https://authzed.com/blog/google-scale-authorization
you can see that at 10m relationships and 1K RPS, we got 33.20% cache hit ratio
also: I'm going to assume that your dispatching is correctly configured, and that subproblems are routed to the correct spicedb nodes, If that's not the case, that would also explain low cache hit ratio
the number of dispatches, as per your initial comment, is totally expected. For a complex schema you can expect fan out in dispatches.
And this is the right PromQL query, just to make sure you are computing cache hit ratio the same way as we do:
Copy code
sum(rate(spicedb_services_dispatches_sum{cached="true"}[$__rate_interval])) / sum(rate(spicedb_services_dispatches_sum{}[$__rate_interval]))
now, I don't discard an issue with spanner optimized revision selection, but I'd be surprised since it's timestamp based
y
they're using the operator on EKS, so i think their dispatch should be working properly
f
@vroldanbet yeah I'm suspecting something similar -- that it's pretty evenly distributed and that we aren't seeing much cache hits due to that
@vroldanbet have read that article (and many more) -- great stuff! 🙂
> is that 300 RPS 100% of your traffic? or just a small percentage? I think it's ~1/2 IIRC
But maybe I should re-read that article, I may have gotten some assumptions wrong 🙏
> the number of dispatches, as per your initial comment, is totally expected. For a complex schema you can expect fan out in dispatches. That's nice to know at least! 👍 We suspected that (also doing some back of the napkin calculations and diving into the codebase). What we haven't fully understood though is some of the patterns we're seeing over time
I can share some graphs tomorrow!
y
oh! another implementation detail: you'll have at least as many dispatches as there are branches in your schema, but there may be more
because a dispatch will have a cap on the number of relations that it checks
a flag to play with is
Copy code
--dispatch-chunk-size uint16                              maximum number of object IDs in a dispatched request (default 100)
it defaults to 100, but we've had luck increasing that to 500 or 1000. the performance benefits or lack thereof will have to do with the characteristics of your database and its sizing.
f
Ah, that's interesting. I'll play around with it a bit and see if I can spot any differences! Thanks for the tip!
v
yeah that's right, that flag helps a lot with wide relations, but the tradeoff is makes queries more expensive, so you need to keep an eye on that. Id suggest being conservative in increasing it, for example we've seen that for some customers pushing it beyond 500 gave diminishing returns
f
Cool! I'll play around with that, and also potentially remove a bunch of role bindings/granted roles that are currently unused from the schema and see if I see any differences 👍
Ok, getting back to this, I think I've finally figured it out. I had mistakenly interpreted the quantization max staleness percent as specified as a percent (had it set to 100) when it's actually a fraction, so
1.0
was what I intended to set. 😅
6 Views