examples/schemas/google-iam at main · au... SpiceDB #spicedb

examples/schemas/google-iam at main · au...

frekw

02/25/2025, 1:23 PM

Hi! I'm looking into SpiceDB, running a small tests, and there are some (performance) metrics that I don't really understand. My test schema is very similar to https://github.com/authzed/examples/tree/main/schemas/google-iam (which I guess means quite a bit of subproblems that need to be checked for each CheckPermission call). But what I'm seeing that's puzzling to me is that I'm almost seeing 10x

DispatchCheck

requests vs

CheckPermission

calls, and what looks like very poor cache hit ratio. I've tried playing around with the cache paramters (spread, quantization etc) but it hasn't made much of a difference. What I do find very puzzling and am failing to understand, is that the number of DispatchChecks seem to be much lower when nodes are freshly started. Even with lower cache hits (new nodes) Spice seems to be hitting the database way less in this scenario, only due to the huge drop in `DispatchCheck`s

vroldanbet

02/25/2025, 4:26 PM

hi, can you describe more about your setup? - number of relationships - widest relation (relation with largest number of relationships for a single object) - number of checks per second - regions your spanner instance is deployed to

frekw

02/25/2025, 5:10 PM

The # of relationships was maybe 10 million last I checked. Widest relation is in the thousands, but most are 2-5. For a regular check there’d be around 5-7 sub problems to compute. Currently doing ~300rps Spanner and everything else is running in us-central

frekw

02/25/2025, 5:11 PM

That’s ballpark, I can get exact numbers when I’m by a computer if that’s helpful

frekw

02/25/2025, 5:11 PM

The widest object is a ”system” object that a lot of entities has as their root, e.g to easily be able to define a superadmin

yetitwo

02/25/2025, 5:34 PM

hmm... that system object might be a potential problem if it's regularly in the path of a check

yetitwo

02/25/2025, 5:35 PM

(but i'll let victor weigh in)

vroldanbet

02/25/2025, 6:02 PM

would you say that relationships "exercised" in your checks follow a pareto distribution? I'm thinking that a 300RPS with 10m relationships, the likelihood of cache hits given cache expires every 10 seconds by default is pretty low?

vroldanbet

02/25/2025, 6:03 PM

is that 300 RPS 100% of your traffic? or just a small percentage?

vroldanbet

02/25/2025, 6:04 PM

here is a blog post I wrote that shows some of the expected cache hit ratios based on the QPS. It may not be applicable to your schema, but gives you a mental model: https://authzed.com/blog/google-scale-authorization

vroldanbet

02/25/2025, 6:05 PM

you can see that at 10m relationships and 1K RPS, we got 33.20% cache hit ratio

vroldanbet

02/25/2025, 6:05 PM

also: I'm going to assume that your dispatching is correctly configured, and that subproblems are routed to the correct spicedb nodes, If that's not the case, that would also explain low cache hit ratio

vroldanbet

02/25/2025, 6:06 PM

the number of dispatches, as per your initial comment, is totally expected. For a complex schema you can expect fan out in dispatches.

vroldanbet

02/25/2025, 6:09 PM

And this is the right PromQL query, just to make sure you are computing cache hit ratio the same way as we do:

Copy code

sum(rate(spicedb_services_dispatches_sum{cached="true"}[$__rate_interval])) / sum(rate(spicedb_services_dispatches_sum{}[$__rate_interval]))

vroldanbet

02/25/2025, 6:11 PM

now, I don't discard an issue with spanner optimized revision selection, but I'd be surprised since it's timestamp based

yetitwo

02/25/2025, 6:49 PM

they're using the operator on EKS, so i think their dispatch should be working properly

frekw

02/25/2025, 9:02 PM

@vroldanbet yeah I'm suspecting something similar -- that it's pretty evenly distributed and that we aren't seeing much cache hits due to that

frekw

02/25/2025, 9:02 PM

@vroldanbet have read that article (and many more) -- great stuff! 🙂

frekw

02/25/2025, 9:03 PM

> is that 300 RPS 100% of your traffic? or just a small percentage? I think it's ~1/2 IIRC

frekw

02/25/2025, 9:03 PM

But maybe I should re-read that article, I may have gotten some assumptions wrong 🙏

frekw

02/25/2025, 9:05 PM

> the number of dispatches, as per your initial comment, is totally expected. For a complex schema you can expect fan out in dispatches. That's nice to know at least! 👍 We suspected that (also doing some back of the napkin calculations and diving into the codebase). What we haven't fully understood though is some of the patterns we're seeing over time

frekw

02/25/2025, 9:05 PM

I can share some graphs tomorrow!

yetitwo

02/25/2025, 9:15 PM

oh! another implementation detail: you'll have at least as many dispatches as there are branches in your schema, but there may be more

yetitwo

02/25/2025, 9:15 PM

because a dispatch will have a cap on the number of relations that it checks

yetitwo

02/25/2025, 9:17 PM

a flag to play with is

Copy code

--dispatch-chunk-size uint16                              maximum number of object IDs in a dispatched request (default 100)

it defaults to 100, but we've had luck increasing that to 500 or 1000. the performance benefits or lack thereof will have to do with the characteristics of your database and its sizing.

frekw

02/26/2025, 8:56 AM

Ah, that's interesting. I'll play around with it a bit and see if I can spot any differences! Thanks for the tip!

vroldanbet

02/26/2025, 10:17 AM

yeah that's right, that flag helps a lot with wide relations, but the tradeoff is makes queries more expensive, so you need to keep an eye on that. Id suggest being conservative in increasing it, for example we've seen that for some customers pushing it beyond 500 gave diminishing returns

frekw

02/26/2025, 1:52 PM

Cool! I'll play around with that, and also potentially remove a bunch of role bindings/granted roles that are currently unused from the schema and see if I see any differences 👍

frekw

03/10/2025, 11:05 AM

Ok, getting back to this, I think I've finally figured it out. I had mistakenly interpreted the quantization max staleness percent as specified as a percent (had it set to 100) when it's actually a fraction, so

1.0

was what I intended to set. 😅

6 Views

Previous Next