upgrading from deprecated connpool flags
# spicedb
w
I've tried to action the deprecation notice on
--datastore-conn-min-open
(and similar options), by replacing it with the new read/write configs. I kept similar values, but the pool behaved completely differently 🤔 it created a ton more connections, which was unexpected. Config before:
Copy code
--datastore-conn-max-idletime=2m \
  --datastore-conn-max-lifetime=2m \
  --datastore-conn-min-open=10 \
  --datastore-conn-max-open=200\
Config after:
Copy code
--datastore-connpool-read-max-idletime=2m
--datastore-connpool-write-max-idletime=2m
--datastore-connpool-read-max-lifetime=2m
--datastore-connpool-write-max-lifetime=2m
--datastore-connpool-read-min-open=10
--datastore-connpool-read-max-open=200
--datastore-connpool-write-min-open=10
--datastore-connpool-write-max-open=10
The total connection count jumped from a very stable 90 to >400 when I deployed the config change, for no reason that I can think of. Am I missing something? Is that a known behaviour? I'm on SpiceDB 1.22.2
Theory based on no particular information: it looks like new connections get opened in the pool even if there's some available, until the
max-open
limit is reached?
v
I recall some issues with how those config params were bound via environment variable, but should work with flags. I'd suggest using what's shown in
serve help
just in case you are hitting some issue there, I know there should be a fallback if you use
connpool
instead of
conn-pool
but just in case:
Copy code
--datastore-conn-pool-read-healthcheck-interval duration        amount of time between connection health checks in a remote datastore's connection pool (default 30s)
      --datastore-conn-pool-read-max-idletime duration                maximum amount of time a connection can idle in a remote datastore's connection pool (default 30m0s)
      --datastore-conn-pool-read-max-lifetime duration                maximum amount of time a connection can live in a remote datastore's connection pool (default 30m0s)
      --datastore-conn-pool-read-max-lifetime-jitter duration         waits rand(0, jitter) after a connection is open for max lifetime to actually close the connection (default: 20% of max lifetime)
      --datastore-conn-pool-read-max-open int                         number of concurrent connections open in a remote datastore's connection pool (default 20)
      --datastore-conn-pool-read-min-open int                         number of minimum concurrent connections open in a remote datastore's connection pool (default 20)
      --datastore-conn-pool-write-healthcheck-interval duration       amount of time between connection health checks in a remote datastore's connection pool (default 30s)

...
I wouldn't trust what you see in your metrics unless you are scraping at 1s interval. I've been bitten a few times from things that don't surface with coarse scraping intervals. This is particularly true around the quantization window elapsing and suddenly a spike of non-cached requests take place
w
Interesting, it's
connpool
in 1.21.x but
conn-pool
in higher versions (not sure about 1.22.x, I just jumped to 1.25.x). Worth a try
I definitely have way more connections open to the DB, there's no doubt about that! I started to get errors due to maxing out the RDS DB connections immediately after the deployment, went away after a revert, and my metrics do show a jump in avg/max connection count
v
sure - coming from the assumption you were looking at SpiceDB prometheus metrics. Just saying that some things do not always surface, you pgxpool metrics may look healthy but only some behavious surface when you start looking with finer scraping interval
we've seen noticeable improvements with both dispatch and datastore deduplication, which has been introduced in 1.27.0 (but really 1.28.0-rc1, because of one last minute regression). It may be what you just need to get some extra headroom in your RDS instance, but I know y'all have had issues uprading for a while.
w
I'm not really looking for improvements, just for things to stay the same when moving from the old
datastore-conn
options to the new
datastore-conn-pool-read/write
ones 😅 I'll test out this dash difference
v
fair enough!
Actually, I'm getting a lot of connections even with 0 traffic. Is it possible that spicedb just eagerly opens as many connections as configured with the
max
option?
j
SpiceDB does eagerly open connections because establishing a connection while handling an in-flight request is slow enough to blow the typical latency SLA. If you're configuring 200 for read and 200 for write, that's a total of 400 connections that'll be opened.
Ah, it looks like you 200 for read but only 10 for write
w
Interesting! Why is it different when using the old options
datastore-conn-max-open
? And what is the
min
option for then, if the pool is kept maxed out at all times?
j
IIRC we attempted to maintain backwards compatible behavior as best we could with the old flag and then hid it from new users. I'm digging into that last question now. I think not all datastores allow us to keep the pool open this way, so some benefit from having the min flag.
v
that is certainly not the behaviour one would expect from
min-conns
, and as far as I know
pgx
still supports it
j
Yeah it's definitely not as intuitive as I wish it could be. These changes came out of real systems achieving latency at scale. Keeping the pool maxed and having jitter in place for replacing connections was critical at the p95/p99
Out of curiousity, have you tried the defaults? SpiceDB got a lot better at supporting long-lived database connections, so 2m is very short for the lifetime parameter. The fewer times you allocate connections, the less likely they are going to block requests.
w
I haven't, no
It might not be something you want to do anything about, but the main problem for me was that I went over the max conncetions allowed by my RDS instance, which meant that I couldn't even rollback as new pods couldn't acquire a connection and would crashloop
In the end it's my fault for configuring a max-conn-pool that was summing up above the max allowed connections in RDS, really. Although having elasticity would allow for a pod's connections to spike while others don't, it's a trade-off.