https://authzed.com logo
Title
w

williamdclt

10/14/2022, 1:11 PM
Yep, I have 3min like the default
v

vroldanbet

10/14/2022, 1:12 PM
The garbage collectors exposes prometheus metrics. IT would be good to get some insight on how many rows are being deleted, time spent deleting, etc...
w

williamdclt

10/14/2022, 1:15 PM
I had it in my monitoring dashboard but it doesn't work anymore 🤔 maybe the prom metrics changed name
v

vroldanbet

10/14/2022, 1:16 PM
there was some refactoring in the GC logic around July, I think. @jzelinskie may know more about it.
w

williamdclt

10/14/2022, 1:41 PM
Yeah, names changed, fixed it. I can't see any correlation between GC and these spikes
v

vroldanbet

10/14/2022, 1:43 PM
do you have an tracing UI in place that could allow you to see where is time being spent? perhaps checking the go runtime to see if there is an GC going on? PGX connection pool contention?
w

williamdclt

10/14/2022, 1:43 PM
> PGX connection pool contention? That's the one
(I do have some tracing but had to disable it temporarily a few days ago :()
Here's a graph of DB connections. The 2 spikes at 800 connections are time-correlated with the latency spikes.
And I have these config
# maximum amount of time a connection can idle in PG's connection pool
  --datastore-conn-max-idletime=2m \
  # maximum amount of time a connection can live in PG's connection pool
  --datastore-conn-max-lifetime=2m \
v

vroldanbet

10/14/2022, 1:46 PM
I wonder if some of that could be addressed with Joey's batching improvements. The gist is that instead of issuing many individual dispatch operations, it issues fewer that do ranged queries
w

williamdclt

10/14/2022, 1:46 PM
(800 is my 4 pods * my
datastore-conn-max-open
which is 200)
v

vroldanbet

10/14/2022, 1:46 PM
so 2m lifetime aligns with the spikes you see. SpiceDB becomes blocking waiting for connections to be opened
w

williamdclt

10/14/2022, 1:46 PM
yep
strange that so many connections are open, though
v

vroldanbet

10/14/2022, 1:47 PM
there is a default of 20 I think?
so 4 pods should be 80 connections tops
w

williamdclt

10/14/2022, 1:47 PM
I set it explicitly to 200
v

vroldanbet

10/14/2022, 1:48 PM
oh
right, then yeah
y'all were running this with AWS RDS right?
any metrics you could check on the RDS side? maybe anything related with conn contention?
is this anything that started happening in 1.13 or was always there, or you guys have ramped up load / traffic?
w

williamdclt

10/14/2022, 1:50 PM
> y'all were running this with AWS RDS right? Yes > any metrics you could check on the RDS side? maybe anything related with conn contention? Not sure what to look for! This graph I sent is from RDS, showing the # of open connections. > is this anything that started happening in 1.13 or was always there, or you guys have ramped up load / traffic? It's always happened, I just never took the time to dig into it
j

Jake

10/14/2022, 1:51 PM
why would you have anything less than the max number of connections open?
w

williamdclt

10/14/2022, 1:51 PM
If there's <200 requests in-flight, there's no need for 200 connections?
We have ~300RPS overall, so we definitely have <200 concurrent requests per pod
v

vroldanbet

10/14/2022, 1:52 PM
yeah unfortunately that's not how it works. Think that SpiceDB is doing a lot of concurrent work
w

williamdclt

10/14/2022, 1:52 PM
One connection per dispatched check?
v

vroldanbet

10/14/2022, 1:52 PM
at the very least yeah
then there is also hedging, which is enabled by default
j

Jake

10/14/2022, 1:53 PM
with dispatch off though they should be getting put back into the pool before the next subrequest
v

vroldanbet

10/14/2022, 1:53 PM
oh, they have dispatch off, ok
w

williamdclt

10/14/2022, 1:53 PM
> then there is also hedging, which is enabled by default I disable it 🙂
Dispatch disabled too indeed
v

vroldanbet

10/14/2022, 1:54 PM
We recently added pgx metrics, that's something you'd have to enable explicitly
it gives you useful information as connections held, and time spent waiting for a connection to be come available
w

williamdclt

10/14/2022, 1:54 PM
Nice!
v

vroldanbet

10/14/2022, 1:55 PM
actually, I think it was always there in PGX, it wasn't in CRDB
w

williamdclt

10/14/2022, 1:55 PM
Coming back to the spike of open connections: I'm not seeing that it correlates to an increase of traffic, not sure why this spike happens
v

vroldanbet

10/14/2022, 1:59 PM
not sure if you are referring to the big large spikes (which do not seem to happen every 2 minutes) or the smaller saw like spikes. If your conn lifetime is 2minutes, I think it's normal there will be that saw phenomenon. 300RPS attempt to get a new connection, which is not available in the pool, so they wait until it is opened.
w

williamdclt

10/14/2022, 2:01 PM
during the big large spikes of open connection (lasting ~10min), we see latency spikes every 2 minutes
So the latency makes sense: it's a consequence of reaching the max # of open connections. The question is: why do we reach 800 connections? we're usually more around 300
v

vroldanbet

10/14/2022, 2:03 PM
yeah it's a good question. If those last 10 minutes, any chances you can get a goroutine dump?
I cannot imagine the garbage collector taking that many connections unless with do work concurrently
the fact that you get latency spikes during that pool-maxed-out period every 2 minutes makes sense, since that's your connection lifetime
my suggestion is to get pprof dumps during that period (interested in the goroutine profile). Another option is checking the opentelemetry traces in a UI like jaeger or Grafana, or using a continuous profiler like Parca or Pyroscope
Another possibility is that, if something is causing those queries to take very long, they would be held out of the pool for as long as it takes to get a response. Are you guys setting a deadline in the client?
not sure if AWS has any kind of query profiler as par of RDS, would be good to check if during that large spike queries are taking longer than usual
w

williamdclt

10/14/2022, 2:33 PM
I can try to grab a dump if I happen to be looking at the graphs when it happens again
> not sure if AWS has any kind of query profiler as par of RDS, would be good to check if during that large spike queries are taking longer than usual It has "performance insights". I don't see anything unusual on it (first graph), but I can see that during these times we have an increase in commits/s (see the red-marked areas in the second graph)
> Are you guys setting a deadline in the client? Not explicitly no, just using whatever defaults the Node Authzed client has
v

vroldanbet

10/14/2022, 2:40 PM
not sure if it has any default. @Joey @Sam ?
j

Joey

10/14/2022, 3:24 PM
the server has a deadline for API calls
I'd be very curious to see the traffic patterns associated with the spikes