default is 3 minutes https github com authzed spicedb blob m SpiceDB #spicedb

Join Discord

default is 3 minutes: https://github.com/authzed/s...

# spicedb

Jake

10/14/2022, 12:33 PM

default is 3 minutes: https://github.com/authzed/spicedb/blob/main/pkg/cmd/datastore/datastore.go#L106

williamdclt

10/14/2022, 1:11 PM

Yep, I have 3min like the default

vroldanbet

10/14/2022, 1:12 PM

The garbage collectors exposes prometheus metrics. IT would be good to get some insight on how many rows are being deleted, time spent deleting, etc...

williamdclt

10/14/2022, 1:15 PM

I had it in my monitoring dashboard but it doesn't work anymore 🤔 maybe the prom metrics changed name

vroldanbet

10/14/2022, 1:16 PM

there was some refactoring in the GC logic around July, I think. @jzelinskie may know more about it.

williamdclt

10/14/2022, 1:41 PM

Yeah, names changed, fixed it. I can't see any correlation between GC and these spikes

vroldanbet

10/14/2022, 1:43 PM

do you have an tracing UI in place that could allow you to see where is time being spent? perhaps checking the go runtime to see if there is an GC going on? PGX connection pool contention?

williamdclt

10/14/2022, 1:43 PM

> PGX connection pool contention? That's the one

williamdclt

10/14/2022, 1:43 PM

(I do have some tracing but had to disable it temporarily a few days ago :()

williamdclt

10/14/2022, 1:44 PM

Here's a graph of DB connections. The 2 spikes at 800 connections are time-correlated with the latency spikes.

williamdclt

10/14/2022, 1:45 PM

And I have these config

Copy code

# maximum amount of time a connection can idle in PG's connection pool
  --datastore-conn-max-idletime=2m \
  # maximum amount of time a connection can live in PG's connection pool
  --datastore-conn-max-lifetime=2m \

vroldanbet

10/14/2022, 1:46 PM

I wonder if some of that could be addressed with Joey's batching improvements. The gist is that instead of issuing many individual dispatch operations, it issues fewer that do ranged queries

williamdclt

10/14/2022, 1:46 PM

(800 is my 4 pods * my

datastore-conn-max-open

which is 200)

vroldanbet

10/14/2022, 1:46 PM

so 2m lifetime aligns with the spikes you see. SpiceDB becomes blocking waiting for connections to be opened

williamdclt

10/14/2022, 1:46 PM

yep

williamdclt

10/14/2022, 1:46 PM

strange that so many connections are open, though

vroldanbet

10/14/2022, 1:47 PM

there is a default of 20 I think?

vroldanbet

10/14/2022, 1:47 PM

so 4 pods should be 80 connections tops

williamdclt

10/14/2022, 1:47 PM

I set it explicitly to 200

vroldanbet

10/14/2022, 1:48 PM

vroldanbet

10/14/2022, 1:48 PM

right, then yeah

vroldanbet

10/14/2022, 1:48 PM

y'all were running this with AWS RDS right?

vroldanbet

10/14/2022, 1:49 PM

any metrics you could check on the RDS side? maybe anything related with conn contention?

vroldanbet

10/14/2022, 1:49 PM

is this anything that started happening in 1.13 or was always there, or you guys have ramped up load / traffic?

williamdclt

10/14/2022, 1:50 PM

> y'all were running this with AWS RDS right? Yes > any metrics you could check on the RDS side? maybe anything related with conn contention? Not sure what to look for! This graph I sent is from RDS, showing the # of open connections. > is this anything that started happening in 1.13 or was always there, or you guys have ramped up load / traffic? It's always happened, I just never took the time to dig into it

Jake

10/14/2022, 1:51 PM

why would you have anything less than the max number of connections open?

williamdclt

10/14/2022, 1:51 PM

If there's <200 requests in-flight, there's no need for 200 connections?

williamdclt

10/14/2022, 1:52 PM

We have ~300RPS overall, so we definitely have <200 concurrent requests per pod

vroldanbet

10/14/2022, 1:52 PM

yeah unfortunately that's not how it works. Think that SpiceDB is doing a lot of concurrent work

williamdclt

10/14/2022, 1:52 PM

One connection per dispatched check?

vroldanbet

10/14/2022, 1:52 PM

at the very least yeah

vroldanbet

10/14/2022, 1:53 PM

then there is also hedging, which is enabled by default

Jake

10/14/2022, 1:53 PM

with dispatch off though they should be getting put back into the pool before the next subrequest

vroldanbet

10/14/2022, 1:53 PM

oh, they have dispatch off, ok

williamdclt

10/14/2022, 1:53 PM

> then there is also hedging, which is enabled by default I disable it 🙂

williamdclt

10/14/2022, 1:53 PM

Dispatch disabled too indeed

vroldanbet

10/14/2022, 1:54 PM

We recently added pgx metrics, that's something you'd have to enable explicitly

vroldanbet

10/14/2022, 1:54 PM

it gives you useful information as connections held, and time spent waiting for a connection to be come available

williamdclt

10/14/2022, 1:54 PM

Nice!

vroldanbet

10/14/2022, 1:55 PM

actually, I think it was always there in PGX, it wasn't in CRDB

williamdclt

10/14/2022, 1:55 PM

Coming back to the spike of open connections: I'm not seeing that it correlates to an increase of traffic, not sure why this spike happens

vroldanbet

10/14/2022, 1:59 PM

not sure if you are referring to the big large spikes (which do not seem to happen every 2 minutes) or the smaller saw like spikes. If your conn lifetime is 2minutes, I think it's normal there will be that saw phenomenon. 300RPS attempt to get a new connection, which is not available in the pool, so they wait until it is opened.

williamdclt

10/14/2022, 2:01 PM

during the big large spikes of open connection (lasting ~10min), we see latency spikes every 2 minutes

williamdclt

10/14/2022, 2:02 PM

So the latency makes sense: it's a consequence of reaching the max # of open connections. The question is: why do we reach 800 connections? we're usually more around 300

vroldanbet

10/14/2022, 2:03 PM

yeah it's a good question. If those last 10 minutes, any chances you can get a goroutine dump?

vroldanbet

10/14/2022, 2:03 PM

I cannot imagine the garbage collector taking that many connections unless with do work concurrently

vroldanbet

10/14/2022, 2:04 PM

the fact that you get latency spikes during that pool-maxed-out period every 2 minutes makes sense, since that's your connection lifetime

vroldanbet

10/14/2022, 2:06 PM

my suggestion is to get pprof dumps during that period (interested in the goroutine profile). Another option is checking the opentelemetry traces in a UI like jaeger or Grafana, or using a continuous profiler like Parca or Pyroscope

vroldanbet

10/14/2022, 2:12 PM

Another possibility is that, if something is causing those queries to take very long, they would be held out of the pool for as long as it takes to get a response. Are you guys setting a deadline in the client?

vroldanbet

10/14/2022, 2:13 PM

not sure if AWS has any kind of query profiler as par of RDS, would be good to check if during that large spike queries are taking longer than usual

williamdclt

10/14/2022, 2:33 PM

I can try to grab a dump if I happen to be looking at the graphs when it happens again

williamdclt

10/14/2022, 2:38 PM

> not sure if AWS has any kind of query profiler as par of RDS, would be good to check if during that large spike queries are taking longer than usual It has "performance insights". I don't see anything unusual on it (first graph), but I can see that during these times we have an increase in commits/s (see the red-marked areas in the second graph)

williamdclt

10/14/2022, 2:39 PM

> Are you guys setting a deadline in the client? Not explicitly no, just using whatever defaults the Node Authzed client has

vroldanbet

10/14/2022, 2:40 PM

not sure if it has any default. @Joey @Sam ?

Joey

10/14/2022, 3:24 PM

the server has a deadline for API calls

Joey

10/14/2022, 3:24 PM

I'd be very curious to see the traffic patterns associated with the spikes

3 Views

Previous Next