Ya restarting spicedb really seems to tank performance SpiceDB #spicedb

Join Discord

Ya, restarting spicedb really seems to tank perfor...

# spicedb

symion5464

11/16/2022, 9:28 PM

Ya, restarting spicedb really seems to tank performance

symion5464

11/16/2022, 9:29 PM

Joey

11/16/2022, 9:29 PM

then you need to investigate what your tests are doing, exactly

Joey

11/16/2022, 9:29 PM

it will take a short period of time to acquire new connections, but that's measured in seconds

Joey

11/16/2022, 9:29 PM

and the cache will take some time to fill, but that's not going to apply after a few seconds

Joey

11/16/2022, 9:29 PM

unless your quantization window is extremely long

Joey

11/16/2022, 9:30 PM

even 500ms is very, very long

symion5464

11/16/2022, 9:30 PM

we're just sending requests as fast as we can

Joey

11/16/2022, 9:30 PM

you're probably overloading your cluster or your DB

Joey

11/16/2022, 9:30 PM

and "requests" can be quite different

Joey

11/16/2022, 9:31 PM

the cache depends on what you're requesting

symion5464

11/16/2022, 9:32 PM

the db is pretty fine. Its a sustained test that does the same work in a loop. The only thing that changed was we restarted spicedb. The test never stopped.

Joey

11/16/2022, 9:32 PM

if you're just firing off the same check over and over

symion5464

11/16/2022, 9:33 PM

its a fixed data set that we iterate over

Joey

11/16/2022, 9:33 PM

then its likely due to the caches being empty

Joey

11/16/2022, 9:33 PM

when you start up, we have to retrieve everything

Joey

11/16/2022, 9:33 PM

which is likely taking some time and then you're piling up requests

Joey

11/16/2022, 9:33 PM

eventually, the cache settles a bit

symion5464

11/16/2022, 9:33 PM

How can we mitigate that?

Joey

11/16/2022, 9:33 PM

which still doesn't explain why it gets slightly faster over time

Joey

11/16/2022, 9:33 PM

don't throw a wall of traffic onto a cold cluster

symion5464

11/16/2022, 9:33 PM

Joey

11/16/2022, 9:34 PM

what is that CPU usage of?

symion5464

11/16/2022, 9:34 PM

spicedb

Joey

11/16/2022, 9:34 PM

and what was the change?

symion5464

11/16/2022, 9:34 PM

restarted the deployment

symion5464

11/16/2022, 9:34 PM

I don't think that is practical in a kubernetes environment. things restart

Joey

11/16/2022, 9:34 PM

yes, individual pods do

Joey

11/16/2022, 9:34 PM

not the entire cluster unless there was a full outage

symion5464

11/16/2022, 9:34 PM

thats what happened

Joey

11/16/2022, 9:35 PM

that's a single pod?

symion5464

11/16/2022, 9:35 PM

or a deployment

Joey

11/16/2022, 9:35 PM

a deployment entirely failing means your entire cluster is gone

Joey

11/16/2022, 9:35 PM

or do you mean a rolling update?

symion5464

11/16/2022, 9:35 PM

anything that changes the pod spec of a deployment will restart everything.

symion5464

11/16/2022, 9:35 PM

if we change env vars or move secrets

Joey

11/16/2022, 9:35 PM

it should do a rolling deployment

symion5464

11/16/2022, 9:35 PM

it is

Joey

11/16/2022, 9:35 PM

then you need a bigger cluster

Joey

11/16/2022, 9:35 PM

to handle the traffic during the rollout

Joey

11/16/2022, 9:36 PM

if that's the case, you're showing that the cache is handling traffic overhead solely because your load happens to be widely cachable

Joey

11/16/2022, 9:36 PM

real world traffic will likely have less caching

Joey

11/16/2022, 9:36 PM

and that means you'll need more nodes to handle traffic

symion5464

11/16/2022, 9:37 PM

how many checks per second would you expect a single pod to be able to handle?

Joey

11/16/2022, 9:37 PM

I can't answer that

Joey

11/16/2022, 9:37 PM

it depends entirely on the size of the nodes

Joey

11/16/2022, 9:37 PM

the complexity of schema

Joey

11/16/2022, 9:37 PM

and what the data shape is

Joey

11/16/2022, 9:38 PM

a check could be a single relationship load

Joey

11/16/2022, 9:38 PM

it could be 50 levels nested

symion5464

11/16/2022, 9:39 PM

the nodes have 4CPU and 1GM of memory.

symion5464

11/16/2022, 9:40 PM

Copy code

definition access_token {}

definition account {
  relation participant: user#member | service_account#member
  permission member = participant
}

definition user {
  relation inheritor: user | access_token
  relation access_token: access_token
  permission member = inheritor
  permission token = access_token
}

definition service_account {
  relation inheritor: service_account | access_token
  permission member = inheritor
}

definition role {
  relation inheritor: user#member | user#token | service_account#member
  relation parent_role: role#member
  relation account_role: account
  permission member = inheritor + parent_role
  permission parent = parent_role
  permission account = account_role
}

definition resource {
  relation reader: user#token | role#member | service_account#member
  relation writer: user#token | role#member | service_account#member
  relation manager: user#token | role#member | service_account#member
  permission manage = manager
  permission write = manage + writer
  permission read = write + reader
}

symion5464

11/16/2022, 9:40 PM

Schema looks like that

Joey

11/16/2022, 9:40 PM

and what are you check-ing?

Joey

11/16/2022, 9:40 PM

and how deep are the roles?

Joey

11/16/2022, 9:40 PM

that's a recursive definition - it could be N levels deep

symion5464

11/16/2022, 9:41 PM

We check to see if a given token has read, write or manage permission on a given resource

Joey

11/16/2022, 9:42 PM

right, and looking at your schema

Joey

11/16/2022, 9:42 PM

that could mean hitting each role

Joey

11/16/2022, 9:42 PM

which, in turn, can have many parent roles

Joey

11/16/2022, 9:42 PM

so that could spread out to many dispatches

Joey

11/16/2022, 9:42 PM

so it really depends on how deep those are

symion5464

11/16/2022, 9:42 PM

we arent doing any deep role testing in this.

Joey

11/16/2022, 9:42 PM

so what is your test data like then?

Joey

11/16/2022, 9:43 PM

if it is fairly flat, then you're likely simply overloading the nodes

symion5464

11/16/2022, 9:46 PM

The test data only goes 1 level of role inheritence. The tests check

symion5464

11/16/2022, 9:46 PM

if an access key has a permission.

Joey

11/16/2022, 9:47 PM

what, exactly, is the data you're writing and testing

symion5464

11/16/2022, 9:48 PM

I'm not sure what kind of an answer you are looking for

Joey

11/16/2022, 9:48 PM

the relationships you're writing and the check you're perform, concretely

Joey

11/16/2022, 9:48 PM

but either way, I'm nearly certain you're simply overloading your nodes

Joey

11/16/2022, 9:49 PM

if you're firing off 20-50RPS, and you have 4 CPU available, that means you're likely saturating the number of goroutines available for us to use

Joey

11/16/2022, 9:49 PM

and since the dispatch sends requests for the same checks to the same nodes

symion5464

11/16/2022, 9:51 PM

what is the number of goroutines available?

Joey

11/16/2022, 9:52 PM

its based roughly on CPU count

symion5464

11/16/2022, 9:52 PM

so 4?

Joey

11/16/2022, 9:52 PM

(edited for clarify) concurrently executing goroutines will be limited by the number threads, which will be based on the CPU count

Joey

11/16/2022, 9:52 PM

if you don't have any vCPUs

symion5464

11/16/2022, 9:52 PM

that doesn't sound like a lot

Joey

11/16/2022, 9:52 PM

its not

Joey

11/16/2022, 9:52 PM

and as I said, I suspect your load test, by hitting the same checks, is creating a hotspot

Joey

11/16/2022, 9:53 PM

the reason it gets better over time is because you're just hitting the cache more and more

symion5464

11/16/2022, 9:53 PM

I guess I'm looking a bit of operational guidance around tuning and sizing.

Joey

11/16/2022, 9:54 PM

you'll likely want bigger nodes with more CPU cores, to start

Joey

11/16/2022, 9:54 PM

and you'll likely want to change your load tester to not check the same things over and over; or at least add a bit more variance/randomness

symion5464

11/16/2022, 9:54 PM

is a shared cache on the table? a lot of the distributed DBs we've used in recent history have a disc cache or something. We'd run it on NVMe drives, etc.

symion5464

11/16/2022, 9:55 PM

I mean, its a 1MB file of things to test

symion5464

11/16/2022, 9:55 PM

its a lot of things

Joey

11/16/2022, 9:56 PM

we have some ideas around a shared L2 cache

Joey

11/16/2022, 9:56 PM

we also have ideas around making it so the in-process cache can persist between restarts

Joey

11/16/2022, 9:56 PM

but that's a optimization

symion5464

11/16/2022, 9:57 PM

whats the standard setup when you deploy a managed instance? what do you provision? Someone has an idea of what is "pretty good" for most production work loads

Joey

11/16/2022, 9:57 PM

we usually work with the customer on that, but I'll get the node sizes we use

symion5464

11/16/2022, 9:57 PM

10 nodes? 50? 4CPU? 25?

Joey

11/16/2022, 9:58 PM

symion5464

11/16/2022, 9:58 PM

ok. thanks that would help

Joey

11/16/2022, 9:59 PM

for now, I recommend trying nodes with more (v)CPUs

symion5464

11/16/2022, 10:00 PM

I still havent gotten back around to turning on the cluster dispatch. Is that something that helps in this kind of a situation?

Joey

11/16/2022, 10:00 PM

wait, this is with dispatch off?

symion5464

11/16/2022, 10:00 PM

Joey

11/16/2022, 10:00 PM

do you have a load balancer in front of your cluster too?

symion5464

11/16/2022, 10:00 PM

Joey

11/16/2022, 10:00 PM

but your load tester

Joey

11/16/2022, 10:01 PM

is it making a single connection?

symion5464

11/16/2022, 10:01 PM

there is an http service that does all of the spicedb work. the load tester hits that.

symion5464

11/16/2022, 10:02 PM

there is an LB in front of the service. Its not the bottle neck, it never goes above 25% CPU utilization

Joey

11/16/2022, 10:03 PM

is the HTTP service itself making a single connection

symion5464

11/16/2022, 10:03 PM

There is a K8s service infront of spicedb

Joey

11/16/2022, 10:03 PM

or opening new ones

Joey

11/16/2022, 10:03 PM

or...

symion5464

11/16/2022, 10:03 PM

GRPC connections

symion5464

11/16/2022, 10:03 PM

single one per worker

Joey

11/16/2022, 10:03 PM

and how many workers do you have?

symion5464

11/16/2022, 10:03 PM

Joey

11/16/2022, 10:03 PM

hrmph

Joey

11/16/2022, 10:03 PM

that should distribute the load amongst the SpiceDB workers

symion5464

11/16/2022, 10:04 PM

it should...

Joey

11/16/2022, 10:04 PM

and how many 4 CPU nodes are you running?

symion5464

11/16/2022, 10:04 PM

Joey

11/16/2022, 10:05 PM

that should be fine for your load then

Joey

11/16/2022, 10:06 PM

why the HTTP service?

symion5464

11/16/2022, 10:06 PM

its the thing product will interface with instead of every service having to setup spicedb stuff. its a simple http API

Joey

11/16/2022, 10:06 PM

dispatch could help, yes, in that it would allow for better cache usage

Joey

11/16/2022, 10:07 PM

but even without dispatch enabled, 20 RPS should be fine against 4 x 4 CPUs

symion5464

11/16/2022, 10:08 PM

symion5464

11/16/2022, 10:08 PM

Its more that 20 spicedb RPS

symion5464

11/16/2022, 10:08 PM

Joey

11/16/2022, 10:09 PM

why does the graph above show ~20 RPS?

Joey

11/16/2022, 10:09 PM

going to ~80

symion5464

11/16/2022, 10:09 PM

Thats total HTTP requests through the service

Joey

11/16/2022, 10:10 PM

so each HTTP request is generating ~100 Checks?

Joey

11/16/2022, 10:10 PM

or am I missing some additional context

symion5464

11/16/2022, 10:11 PM

we have a bulk endpoint that sends 10-20 things to check. so in some cases I can do more than one thing, yes

Joey

11/16/2022, 10:11 PM

Joey

11/16/2022, 10:11 PM

then yeah, at 1000 or 2000 RPS, you need a bigger cluster

symion5464

11/16/2022, 10:11 PM

we don't know if that is practical yet. I could be a lot of single things.

symion5464

11/16/2022, 10:11 PM

Joey

11/16/2022, 10:11 PM

I was operating under a very mistaken impression

Joey

11/16/2022, 10:11 PM

sorry about that

symion5464

11/16/2022, 10:12 PM

so about 300 RPS per node ?

symion5464

11/16/2022, 10:12 PM

in napkin math terms

Joey

11/16/2022, 10:12 PM

75-100 RPS per CPU, using very rough math is a very rough idea

Joey

11/16/2022, 10:13 PM

closer to 75

Joey

11/16/2022, 10:13 PM

so if you're running ~2000 RPS though, you'll want ~27 cores or ~6 nodes at that size

Joey

11/16/2022, 10:13 PM

but I'd scale based on what gives you the best performance

Joey

11/16/2022, 10:14 PM

+1 for rolling deploys

Joey

11/16/2022, 10:14 PM

this is all napkin math, of course

symion5464

11/16/2022, 10:14 PM

right

symion5464

11/16/2022, 10:17 PM

Other that cpu utilization, is there something we can monitor around work backing up as you describe? is there like a pending work queue or anything like that?

Joey

11/16/2022, 10:18 PM

dispatch metrics can give some deeper insight there, but they are only really used when dispatch is enabled

Joey

11/16/2022, 10:18 PM

total request latency on the SpiceDB gRPC side is your best bet

Joey

11/16/2022, 10:19 PM

gRPC exports a lot of good metrics

Joey

11/16/2022, 10:19 PM

dispatch will, as I said, result in more requests going on, but better cache reuse

Joey

11/16/2022, 10:20 PM

because nodes will talk to each other

symion5464

11/16/2022, 10:29 PM

Wait, isn't the number of goroutine a process can run in the thousands? I'm a go rookie, so tell me I'm wrong. but 4 certainly doesn't seem right

symion5464

11/16/2022, 10:30 PM

Or is that a limit imposed by spicedb?

Joey

11/16/2022, 10:32 PM

I was incorrect above

Joey

11/16/2022, 10:33 PM

I meant the number of concurrent routines, which is based on the CPU count

Joey

11/16/2022, 10:33 PM

you can have thousands started concurrently, but they are ultimately handled by threads underneath

Joey

11/16/2022, 10:34 PM

you're still limited on the number of requests that can be handled concurrently; of course, if one request is waiting on data from the datastore, another can continue forward, etc

4 Views

Previous Next