https://authzed.com logo
Title
s

symion

11/16/2022, 9:28 PM
Ya, restarting spicedb really seems to tank performance
j

Joey

11/16/2022, 9:29 PM
then you need to investigate what your tests are doing, exactly
it will take a short period of time to acquire new connections, but that's measured in seconds
and the cache will take some time to fill, but that's not going to apply after a few seconds
unless your quantization window is extremely long
even 500ms is very, very long
s

symion

11/16/2022, 9:30 PM
we're just sending requests as fast as we can
j

Joey

11/16/2022, 9:30 PM
you're probably overloading your cluster or your DB
and "requests" can be quite different
the cache depends on what you're requesting
s

symion

11/16/2022, 9:32 PM
the db is pretty fine. Its a sustained test that does the same work in a loop. The only thing that changed was we restarted spicedb. The test never stopped.
j

Joey

11/16/2022, 9:32 PM
if you're just firing off the same check over and over
s

symion

11/16/2022, 9:33 PM
its a fixed data set that we iterate over
j

Joey

11/16/2022, 9:33 PM
then its likely due to the caches being empty
when you start up, we have to retrieve everything
which is likely taking some time and then you're piling up requests
eventually, the cache settles a bit
s

symion

11/16/2022, 9:33 PM
How can we mitigate that?
j

Joey

11/16/2022, 9:33 PM
which still doesn't explain why it gets slightly faster over time
don't throw a wall of traffic onto a cold cluster
what is that CPU usage of?
s

symion

11/16/2022, 9:34 PM
spicedb
j

Joey

11/16/2022, 9:34 PM
and what was the change?
s

symion

11/16/2022, 9:34 PM
restarted the deployment
I don't think that is practical in a kubernetes environment. things restart
j

Joey

11/16/2022, 9:34 PM
yes, individual pods do
not the entire cluster unless there was a full outage
s

symion

11/16/2022, 9:34 PM
thats what happened
j

Joey

11/16/2022, 9:35 PM
that's a single pod?
s

symion

11/16/2022, 9:35 PM
or a deployment
j

Joey

11/16/2022, 9:35 PM
a deployment entirely failing means your entire cluster is gone
or do you mean a rolling update?
s

symion

11/16/2022, 9:35 PM
anything that changes the pod spec of a deployment will restart everything.
if we change env vars or move secrets
j

Joey

11/16/2022, 9:35 PM
it should do a rolling deployment
s

symion

11/16/2022, 9:35 PM
it is
j

Joey

11/16/2022, 9:35 PM
then you need a bigger cluster
to handle the traffic during the rollout
if that's the case, you're showing that the cache is handling traffic overhead solely because your load happens to be widely cachable
real world traffic will likely have less caching
and that means you'll need more nodes to handle traffic
s

symion

11/16/2022, 9:37 PM
how many checks per second would you expect a single pod to be able to handle?
j

Joey

11/16/2022, 9:37 PM
I can't answer that
it depends entirely on the size of the nodes
the complexity of schema
and what the data shape is
a check could be a single relationship load
it could be 50 levels nested
s

symion

11/16/2022, 9:39 PM
the nodes have 4CPU and 1GM of memory.
definition access_token {}

definition account {
  relation participant: user#member | service_account#member
  permission member = participant
}

definition user {
  relation inheritor: user | access_token
  relation access_token: access_token
  permission member = inheritor
  permission token = access_token
}

definition service_account {
  relation inheritor: service_account | access_token
  permission member = inheritor
}

definition role {
  relation inheritor: user#member | user#token | service_account#member
  relation parent_role: role#member
  relation account_role: account
  permission member = inheritor + parent_role
  permission parent = parent_role
  permission account = account_role
}

definition resource {
  relation reader: user#token | role#member | service_account#member
  relation writer: user#token | role#member | service_account#member
  relation manager: user#token | role#member | service_account#member
  permission manage = manager
  permission write = manage + writer
  permission read = write + reader
}
Schema looks like that
j

Joey

11/16/2022, 9:40 PM
and what are you check-ing?
and how deep are the roles?
that's a recursive definition - it could be N levels deep
s

symion

11/16/2022, 9:41 PM
We check to see if a given token has read, write or manage permission on a given resource
j

Joey

11/16/2022, 9:42 PM
right, and looking at your schema
that could mean hitting each role
which, in turn, can have many parent roles
so that could spread out to many dispatches
so it really depends on how deep those are
s

symion

11/16/2022, 9:42 PM
we arent doing any deep role testing in this.
j

Joey

11/16/2022, 9:42 PM
so what is your test data like then?
if it is fairly flat, then you're likely simply overloading the nodes
s

symion

11/16/2022, 9:46 PM
The test data only goes 1 level of role inheritence. The tests check
if an access key has a permission.
j

Joey

11/16/2022, 9:47 PM
what, exactly, is the data you're writing and testing
s

symion

11/16/2022, 9:48 PM
I'm not sure what kind of an answer you are looking for
j

Joey

11/16/2022, 9:48 PM
the relationships you're writing and the check you're perform, concretely
but either way, I'm nearly certain you're simply overloading your nodes
if you're firing off 20-50RPS, and you have 4 CPU available, that means you're likely saturating the number of goroutines available for us to use
and since the dispatch sends requests for the same checks to the same nodes
s

symion

11/16/2022, 9:51 PM
what is the number of goroutines available?
j

Joey

11/16/2022, 9:52 PM
its based roughly on CPU count
s

symion

11/16/2022, 9:52 PM
so 4?
j

Joey

11/16/2022, 9:52 PM
(edited for clarify) concurrently executing goroutines will be limited by the number threads, which will be based on the CPU count
if you don't have any vCPUs
s

symion

11/16/2022, 9:52 PM
that doesn't sound like a lot
j

Joey

11/16/2022, 9:52 PM
its not
and as I said, I suspect your load test, by hitting the same checks, is creating a hotspot
the reason it gets better over time is because you're just hitting the cache more and more
s

symion

11/16/2022, 9:53 PM
I guess I'm looking a bit of operational guidance around tuning and sizing.
j

Joey

11/16/2022, 9:54 PM
you'll likely want bigger nodes with more CPU cores, to start
and you'll likely want to change your load tester to not check the same things over and over; or at least add a bit more variance/randomness
s

symion

11/16/2022, 9:54 PM
is a shared cache on the table? a lot of the distributed DBs we've used in recent history have a disc cache or something. We'd run it on NVMe drives, etc.
I mean, its a 1MB file of things to test
its a lot of things
j

Joey

11/16/2022, 9:56 PM
we have some ideas around a shared L2 cache
we also have ideas around making it so the in-process cache can persist between restarts
but that's a optimization
s

symion

11/16/2022, 9:57 PM
whats the standard setup when you deploy a managed instance? what do you provision? Someone has an idea of what is "pretty good" for most production work loads
j

Joey

11/16/2022, 9:57 PM
we usually work with the customer on that, but I'll get the node sizes we use
s

symion

11/16/2022, 9:57 PM
10 nodes? 50? 4CPU? 25?
j

Joey

11/16/2022, 9:58 PM
^
s

symion

11/16/2022, 9:58 PM
ok. thanks that would help
j

Joey

11/16/2022, 9:59 PM
for now, I recommend trying nodes with more (v)CPUs
s

symion

11/16/2022, 10:00 PM
I still havent gotten back around to turning on the cluster dispatch. Is that something that helps in this kind of a situation?
j

Joey

11/16/2022, 10:00 PM
wait, this is with dispatch off?
s

symion

11/16/2022, 10:00 PM
ya
j

Joey

11/16/2022, 10:00 PM
do you have a load balancer in front of your cluster too?
s

symion

11/16/2022, 10:00 PM
ya
j

Joey

11/16/2022, 10:00 PM
but your load tester
is it making a single connection?
s

symion

11/16/2022, 10:01 PM
there is an http service that does all of the spicedb work. the load tester hits that.
there is an LB in front of the service. Its not the bottle neck, it never goes above 25% CPU utilization
j

Joey

11/16/2022, 10:03 PM
is the HTTP service itself making a single connection
s

symion

11/16/2022, 10:03 PM
There is a K8s service infront of spicedb
j

Joey

11/16/2022, 10:03 PM
or opening new ones
or...
s

symion

11/16/2022, 10:03 PM
GRPC connections
single one per worker
j

Joey

11/16/2022, 10:03 PM
and how many workers do you have?
s

symion

11/16/2022, 10:03 PM
5
j

Joey

11/16/2022, 10:03 PM
hrmph
that should distribute the load amongst the SpiceDB workers
s

symion

11/16/2022, 10:04 PM
it should...
j

Joey

11/16/2022, 10:04 PM
and how many 4 CPU nodes are you running?
s

symion

11/16/2022, 10:04 PM
4
j

Joey

11/16/2022, 10:05 PM
that should be fine for your load then
why the HTTP service?
s

symion

11/16/2022, 10:06 PM
its the thing product will interface with instead of every service having to setup spicedb stuff. its a simple http API
j

Joey

11/16/2022, 10:06 PM
dispatch could help, yes, in that it would allow for better cache usage
but even without dispatch enabled, 20 RPS should be fine against 4 x 4 CPUs
s

symion

11/16/2022, 10:08 PM
Its more that 20 spicedb RPS
j

Joey

11/16/2022, 10:09 PM
why does the graph above show ~20 RPS?
going to ~80
s

symion

11/16/2022, 10:09 PM
Thats total HTTP requests through the service
j

Joey

11/16/2022, 10:10 PM
so each HTTP request is generating ~100 Checks?
or am I missing some additional context
s

symion

11/16/2022, 10:11 PM
we have a bulk endpoint that sends 10-20 things to check. so in some cases I can do more than one thing, yes
j

Joey

11/16/2022, 10:11 PM
ah
then yeah, at 1000 or 2000 RPS, you need a bigger cluster
s

symion

11/16/2022, 10:11 PM
we don't know if that is practical yet. I could be a lot of single things.
ok
j

Joey

11/16/2022, 10:11 PM
I was operating under a very mistaken impression
sorry about that
s

symion

11/16/2022, 10:12 PM
so about 300 RPS per node ?
in napkin math terms
j

Joey

11/16/2022, 10:12 PM
75-100 RPS per CPU, using very rough math is a very rough idea
closer to 75
so if you're running ~2000 RPS though, you'll want ~27 cores or ~6 nodes at that size
but I'd scale based on what gives you the best performance
+1 for rolling deploys
this is all napkin math, of course
s

symion

11/16/2022, 10:14 PM
right
Other that cpu utilization, is there something we can monitor around work backing up as you describe? is there like a pending work queue or anything like that?
j

Joey

11/16/2022, 10:18 PM
dispatch metrics can give some deeper insight there, but they are only really used when dispatch is enabled
total request latency on the SpiceDB gRPC side is your best bet
gRPC exports a lot of good metrics
dispatch will, as I said, result in more requests going on, but better cache reuse
because nodes will talk to each other
s

symion

11/16/2022, 10:29 PM
Wait, isn't the number of goroutine a process can run in the thousands? I'm a go rookie, so tell me I'm wrong. but 4 certainly doesn't seem right
Or is that a limit imposed by spicedb?
j

Joey

11/16/2022, 10:32 PM
I was incorrect above
I meant the number of concurrent routines, which is based on the CPU count
you can have thousands started concurrently, but they are ultimately handled by threads underneath
you're still limited on the number of requests that can be handled concurrently; of course, if one request is waiting on data from the datastore, another can continue forward, etc