Ya, restarting spicedb really seems to tank perfor...
# spicedb
s
Ya, restarting spicedb really seems to tank performance
j
then you need to investigate what your tests are doing, exactly
it will take a short period of time to acquire new connections, but that's measured in seconds
and the cache will take some time to fill, but that's not going to apply after a few seconds
unless your quantization window is extremely long
even 500ms is very, very long
s
we're just sending requests as fast as we can
j
you're probably overloading your cluster or your DB
and "requests" can be quite different
the cache depends on what you're requesting
s
the db is pretty fine. Its a sustained test that does the same work in a loop. The only thing that changed was we restarted spicedb. The test never stopped.
j
if you're just firing off the same check over and over
s
its a fixed data set that we iterate over
j
then its likely due to the caches being empty
when you start up, we have to retrieve everything
which is likely taking some time and then you're piling up requests
eventually, the cache settles a bit
s
How can we mitigate that?
j
which still doesn't explain why it gets slightly faster over time
don't throw a wall of traffic onto a cold cluster
s
j
what is that CPU usage of?
s
spicedb
j
and what was the change?
s
restarted the deployment
I don't think that is practical in a kubernetes environment. things restart
j
yes, individual pods do
not the entire cluster unless there was a full outage
s
thats what happened
j
that's a single pod?
s
or a deployment
j
a deployment entirely failing means your entire cluster is gone
or do you mean a rolling update?
s
anything that changes the pod spec of a deployment will restart everything.
if we change env vars or move secrets
j
it should do a rolling deployment
s
it is
j
then you need a bigger cluster
to handle the traffic during the rollout
if that's the case, you're showing that the cache is handling traffic overhead solely because your load happens to be widely cachable
real world traffic will likely have less caching
and that means you'll need more nodes to handle traffic
s
how many checks per second would you expect a single pod to be able to handle?
j
I can't answer that
it depends entirely on the size of the nodes
the complexity of schema
and what the data shape is
a check could be a single relationship load
it could be 50 levels nested
s
the nodes have 4CPU and 1GM of memory.
Copy code
definition access_token {}

definition account {
  relation participant: user#member | service_account#member
  permission member = participant
}

definition user {
  relation inheritor: user | access_token
  relation access_token: access_token
  permission member = inheritor
  permission token = access_token
}

definition service_account {
  relation inheritor: service_account | access_token
  permission member = inheritor
}

definition role {
  relation inheritor: user#member | user#token | service_account#member
  relation parent_role: role#member
  relation account_role: account
  permission member = inheritor + parent_role
  permission parent = parent_role
  permission account = account_role
}

definition resource {
  relation reader: user#token | role#member | service_account#member
  relation writer: user#token | role#member | service_account#member
  relation manager: user#token | role#member | service_account#member
  permission manage = manager
  permission write = manage + writer
  permission read = write + reader
}
Schema looks like that
j
and what are you check-ing?
and how deep are the roles?
that's a recursive definition - it could be N levels deep
s
We check to see if a given token has read, write or manage permission on a given resource
j
right, and looking at your schema
that could mean hitting each role
which, in turn, can have many parent roles
so that could spread out to many dispatches
so it really depends on how deep those are
s
we arent doing any deep role testing in this.
j
so what is your test data like then?
if it is fairly flat, then you're likely simply overloading the nodes
s
The test data only goes 1 level of role inheritence. The tests check
if an access key has a permission.
j
what, exactly, is the data you're writing and testing
s
I'm not sure what kind of an answer you are looking for
j
the relationships you're writing and the check you're perform, concretely
but either way, I'm nearly certain you're simply overloading your nodes
if you're firing off 20-50RPS, and you have 4 CPU available, that means you're likely saturating the number of goroutines available for us to use
and since the dispatch sends requests for the same checks to the same nodes
s
what is the number of goroutines available?
j
its based roughly on CPU count
s
so 4?
j
(edited for clarify) concurrently executing goroutines will be limited by the number threads, which will be based on the CPU count
if you don't have any vCPUs
s
that doesn't sound like a lot
j
its not
and as I said, I suspect your load test, by hitting the same checks, is creating a hotspot
the reason it gets better over time is because you're just hitting the cache more and more
s
I guess I'm looking a bit of operational guidance around tuning and sizing.
j
you'll likely want bigger nodes with more CPU cores, to start
and you'll likely want to change your load tester to not check the same things over and over; or at least add a bit more variance/randomness
s
is a shared cache on the table? a lot of the distributed DBs we've used in recent history have a disc cache or something. We'd run it on NVMe drives, etc.
I mean, its a 1MB file of things to test
its a lot of things
j
we have some ideas around a shared L2 cache
we also have ideas around making it so the in-process cache can persist between restarts
but that's a optimization
s
whats the standard setup when you deploy a managed instance? what do you provision? Someone has an idea of what is "pretty good" for most production work loads
j
we usually work with the customer on that, but I'll get the node sizes we use
s
10 nodes? 50? 4CPU? 25?
j
^
s
ok. thanks that would help
j
for now, I recommend trying nodes with more (v)CPUs
s
I still havent gotten back around to turning on the cluster dispatch. Is that something that helps in this kind of a situation?
j
wait, this is with dispatch off?
s
ya
j
do you have a load balancer in front of your cluster too?
s
ya
j
but your load tester
is it making a single connection?
s
there is an http service that does all of the spicedb work. the load tester hits that.
there is an LB in front of the service. Its not the bottle neck, it never goes above 25% CPU utilization
j
is the HTTP service itself making a single connection
s
There is a K8s service infront of spicedb
j
or opening new ones
or...
s
GRPC connections
single one per worker
j
and how many workers do you have?
s
5
j
hrmph
that should distribute the load amongst the SpiceDB workers
s
it should...
j
and how many 4 CPU nodes are you running?
s
4
j
that should be fine for your load then
why the HTTP service?
s
its the thing product will interface with instead of every service having to setup spicedb stuff. its a simple http API
j
dispatch could help, yes, in that it would allow for better cache usage
but even without dispatch enabled, 20 RPS should be fine against 4 x 4 CPUs
s
Its more that 20 spicedb RPS
j
why does the graph above show ~20 RPS?
going to ~80
s
Thats total HTTP requests through the service
j
so each HTTP request is generating ~100 Checks?
or am I missing some additional context
s
we have a bulk endpoint that sends 10-20 things to check. so in some cases I can do more than one thing, yes
j
ah
then yeah, at 1000 or 2000 RPS, you need a bigger cluster
s
we don't know if that is practical yet. I could be a lot of single things.
ok
j
I was operating under a very mistaken impression
sorry about that
s
so about 300 RPS per node ?
in napkin math terms
j
75-100 RPS per CPU, using very rough math is a very rough idea
closer to 75
so if you're running ~2000 RPS though, you'll want ~27 cores or ~6 nodes at that size
but I'd scale based on what gives you the best performance
+1 for rolling deploys
this is all napkin math, of course
s
right
Other that cpu utilization, is there something we can monitor around work backing up as you describe? is there like a pending work queue or anything like that?
j
dispatch metrics can give some deeper insight there, but they are only really used when dispatch is enabled
total request latency on the SpiceDB gRPC side is your best bet
gRPC exports a lot of good metrics
dispatch will, as I said, result in more requests going on, but better cache reuse
because nodes will talk to each other
s
Wait, isn't the number of goroutine a process can run in the thousands? I'm a go rookie, so tell me I'm wrong. but 4 certainly doesn't seem right
Or is that a limit imposed by spicedb?
j
I was incorrect above
I meant the number of concurrent routines, which is based on the CPU count
you can have thousands started concurrently, but they are ultimately handled by threads underneath
you're still limited on the number of requests that can be handled concurrently; of course, if one request is waiting on data from the datastore, another can continue forward, etc
2 Views