default is 3 minutes: https://github.com/authzed/s...
# spicedb
w
Yep, I have 3min like the default
v
The garbage collectors exposes prometheus metrics. IT would be good to get some insight on how many rows are being deleted, time spent deleting, etc...
w
I had it in my monitoring dashboard but it doesn't work anymore 🤔 maybe the prom metrics changed name
v
there was some refactoring in the GC logic around July, I think. @jzelinskie may know more about it.
w
Yeah, names changed, fixed it. I can't see any correlation between GC and these spikes
v
do you have an tracing UI in place that could allow you to see where is time being spent? perhaps checking the go runtime to see if there is an GC going on? PGX connection pool contention?
w
> PGX connection pool contention? That's the one
(I do have some tracing but had to disable it temporarily a few days ago :()
Here's a graph of DB connections. The 2 spikes at 800 connections are time-correlated with the latency spikes.
And I have these config
Copy code
# maximum amount of time a connection can idle in PG's connection pool
  --datastore-conn-max-idletime=2m \
  # maximum amount of time a connection can live in PG's connection pool
  --datastore-conn-max-lifetime=2m \
v
I wonder if some of that could be addressed with Joey's batching improvements. The gist is that instead of issuing many individual dispatch operations, it issues fewer that do ranged queries
w
(800 is my 4 pods * my
datastore-conn-max-open
which is 200)
v
so 2m lifetime aligns with the spikes you see. SpiceDB becomes blocking waiting for connections to be opened
w
yep
strange that so many connections are open, though
v
there is a default of 20 I think?
so 4 pods should be 80 connections tops
w
I set it explicitly to 200
v
oh
right, then yeah
y'all were running this with AWS RDS right?
any metrics you could check on the RDS side? maybe anything related with conn contention?
is this anything that started happening in 1.13 or was always there, or you guys have ramped up load / traffic?
w
> y'all were running this with AWS RDS right? Yes > any metrics you could check on the RDS side? maybe anything related with conn contention? Not sure what to look for! This graph I sent is from RDS, showing the # of open connections. > is this anything that started happening in 1.13 or was always there, or you guys have ramped up load / traffic? It's always happened, I just never took the time to dig into it
j
why would you have anything less than the max number of connections open?
w
If there's <200 requests in-flight, there's no need for 200 connections?
We have ~300RPS overall, so we definitely have <200 concurrent requests per pod
v
yeah unfortunately that's not how it works. Think that SpiceDB is doing a lot of concurrent work
w
One connection per dispatched check?
v
at the very least yeah
then there is also hedging, which is enabled by default
j
with dispatch off though they should be getting put back into the pool before the next subrequest
v
oh, they have dispatch off, ok
w
> then there is also hedging, which is enabled by default I disable it 🙂
Dispatch disabled too indeed
v
We recently added pgx metrics, that's something you'd have to enable explicitly
it gives you useful information as connections held, and time spent waiting for a connection to be come available
w
Nice!
v
actually, I think it was always there in PGX, it wasn't in CRDB
w
Coming back to the spike of open connections: I'm not seeing that it correlates to an increase of traffic, not sure why this spike happens
v
not sure if you are referring to the big large spikes (which do not seem to happen every 2 minutes) or the smaller saw like spikes. If your conn lifetime is 2minutes, I think it's normal there will be that saw phenomenon. 300RPS attempt to get a new connection, which is not available in the pool, so they wait until it is opened.
w
during the big large spikes of open connection (lasting ~10min), we see latency spikes every 2 minutes
So the latency makes sense: it's a consequence of reaching the max # of open connections. The question is: why do we reach 800 connections? we're usually more around 300
v
yeah it's a good question. If those last 10 minutes, any chances you can get a goroutine dump?
I cannot imagine the garbage collector taking that many connections unless with do work concurrently
the fact that you get latency spikes during that pool-maxed-out period every 2 minutes makes sense, since that's your connection lifetime
my suggestion is to get pprof dumps during that period (interested in the goroutine profile). Another option is checking the opentelemetry traces in a UI like jaeger or Grafana, or using a continuous profiler like Parca or Pyroscope
Another possibility is that, if something is causing those queries to take very long, they would be held out of the pool for as long as it takes to get a response. Are you guys setting a deadline in the client?
not sure if AWS has any kind of query profiler as par of RDS, would be good to check if during that large spike queries are taking longer than usual
w
I can try to grab a dump if I happen to be looking at the graphs when it happens again
> not sure if AWS has any kind of query profiler as par of RDS, would be good to check if during that large spike queries are taking longer than usual It has "performance insights". I don't see anything unusual on it (first graph), but I can see that during these times we have an increase in commits/s (see the red-marked areas in the second graph)
> Are you guys setting a deadline in the client? Not explicitly no, just using whatever defaults the Node Authzed client has
v
not sure if it has any default. @Joey @Sam ?
j
the server has a deadline for API calls
I'd be very curious to see the traffic patterns associated with the spikes
2 Views