upgrading from deprecated connpool flags SpiceDB #spicedb

upgrading from deprecated connpool flags

williamdclt

11/28/2023, 1:48 PM

I've tried to action the deprecation notice on

--datastore-conn-min-open

(and similar options), by replacing it with the new read/write configs. I kept similar values, but the pool behaved completely differently 🤔 it created a ton more connections, which was unexpected. Config before:

Copy code

--datastore-conn-max-idletime=2m \
  --datastore-conn-max-lifetime=2m \
  --datastore-conn-min-open=10 \
  --datastore-conn-max-open=200\

Config after:

Copy code

--datastore-connpool-read-max-idletime=2m
--datastore-connpool-write-max-idletime=2m
--datastore-connpool-read-max-lifetime=2m
--datastore-connpool-write-max-lifetime=2m
--datastore-connpool-read-min-open=10
--datastore-connpool-read-max-open=200
--datastore-connpool-write-min-open=10
--datastore-connpool-write-max-open=10

The total connection count jumped from a very stable 90 to >400 when I deployed the config change, for no reason that I can think of. Am I missing something? Is that a known behaviour? I'm on SpiceDB 1.22.2

williamdclt

11/28/2023, 2:05 PM

Theory based on no particular information: it looks like new connections get opened in the pool even if there's some available, until the

max-open

limit is reached?

vroldanbet

11/28/2023, 2:39 PM

I recall some issues with how those config params were bound via environment variable, but should work with flags. I'd suggest using what's shown in

serve help

just in case you are hitting some issue there, I know there should be a fallback if you use

connpool

instead of

conn-pool

but just in case:

Copy code

--datastore-conn-pool-read-healthcheck-interval duration        amount of time between connection health checks in a remote datastore's connection pool (default 30s)
      --datastore-conn-pool-read-max-idletime duration                maximum amount of time a connection can idle in a remote datastore's connection pool (default 30m0s)
      --datastore-conn-pool-read-max-lifetime duration                maximum amount of time a connection can live in a remote datastore's connection pool (default 30m0s)
      --datastore-conn-pool-read-max-lifetime-jitter duration         waits rand(0, jitter) after a connection is open for max lifetime to actually close the connection (default: 20% of max lifetime)
      --datastore-conn-pool-read-max-open int                         number of concurrent connections open in a remote datastore's connection pool (default 20)
      --datastore-conn-pool-read-min-open int                         number of minimum concurrent connections open in a remote datastore's connection pool (default 20)
      --datastore-conn-pool-write-healthcheck-interval duration       amount of time between connection health checks in a remote datastore's connection pool (default 30s)

...

vroldanbet

11/28/2023, 2:40 PM

I wouldn't trust what you see in your metrics unless you are scraping at 1s interval. I've been bitten a few times from things that don't surface with coarse scraping intervals. This is particularly true around the quantization window elapsing and suddenly a spike of non-cached requests take place

williamdclt

11/28/2023, 2:44 PM

Interesting, it's

connpool

in 1.21.x but

conn-pool

in higher versions (not sure about 1.22.x, I just jumped to 1.25.x). Worth a try

williamdclt

11/28/2023, 2:47 PM

I definitely have way more connections open to the DB, there's no doubt about that! I started to get errors due to maxing out the RDS DB connections immediately after the deployment, went away after a revert, and my metrics do show a jump in avg/max connection count

vroldanbet

11/28/2023, 2:51 PM

sure - coming from the assumption you were looking at SpiceDB prometheus metrics. Just saying that some things do not always surface, you pgxpool metrics may look healthy but only some behavious surface when you start looking with finer scraping interval

vroldanbet

11/28/2023, 2:54 PM

we've seen noticeable improvements with both dispatch and datastore deduplication, which has been introduced in 1.27.0 (but really 1.28.0-rc1, because of one last minute regression). It may be what you just need to get some extra headroom in your RDS instance, but I know y'all have had issues uprading for a while.

williamdclt

11/28/2023, 3:00 PM

I'm not really looking for improvements, just for things to stay the same when moving from the old

datastore-conn

options to the new

datastore-conn-pool-read/write

ones 😅 I'll test out this dash difference

vroldanbet

11/28/2023, 3:00 PM

fair enough!

williamdclt

11/28/2023, 3:24 PM

No dice with

conn-pool

, connections are still jumping up https://cdn.discordapp.com/attachments/1179056100486680606/1179080298026242048/image.png?ex=65787b19&is=65660619&hm=fd50ccf003f46c4e25ae0cb49c39bc4d6c96f10a4651c8c13ef8785aa327c71a&

williamdclt

11/28/2023, 3:30 PM

Actually, I'm getting a lot of connections even with 0 traffic. Is it possible that spicedb just eagerly opens as many connections as configured with the

max

option?

jzelinskie

11/28/2023, 3:40 PM

SpiceDB does eagerly open connections because establishing a connection while handling an in-flight request is slow enough to blow the typical latency SLA. If you're configuring 200 for read and 200 for write, that's a total of 400 connections that'll be opened.

jzelinskie

11/28/2023, 3:40 PM

Ah, it looks like you 200 for read but only 10 for write

jzelinskie

11/28/2023, 3:45 PM

I wanted to confirm -- we do keep the pool maxed at all times: https://github.com/authzed/spicedb/blob/main/internal/datastore/postgres/common/pgx.go#L229-L232

williamdclt

11/28/2023, 3:47 PM

Interesting! Why is it different when using the old options

datastore-conn-max-open

? And what is the

min

option for then, if the pool is kept maxed out at all times?

jzelinskie

11/28/2023, 3:49 PM

IIRC we attempted to maintain backwards compatible behavior as best we could with the old flag and then hid it from new users. I'm digging into that last question now. I think not all datastores allow us to keep the pool open this way, so some benefit from having the min flag.

vroldanbet

11/28/2023, 3:54 PM

that is certainly not the behaviour one would expect from

min-conns

, and as far as I know

pgx

still supports it

jzelinskie

11/28/2023, 4:26 PM

Yeah it's definitely not as intuitive as I wish it could be. These changes came out of real systems achieving latency at scale. Keeping the pool maxed and having jitter in place for replacing connections was critical at the p95/p99

jzelinskie

11/28/2023, 7:40 PM

Out of curiousity, have you tried the defaults? SpiceDB got a lot better at supporting long-lived database connections, so 2m is very short for the lifetime parameter. The fewer times you allocate connections, the less likely they are going to block requests.

williamdclt

11/29/2023, 5:34 PM

I haven't, no

williamdclt

11/29/2023, 5:36 PM

It might not be something you want to do anything about, but the main problem for me was that I went over the max conncetions allowed by my RDS instance, which meant that I couldn't even rollback as new pods couldn't acquire a connection and would crashloop

williamdclt

11/29/2023, 5:38 PM

In the end it's my fault for configuring a max-conn-pool that was summing up above the max allowed connections in RDS, really. Although having elasticity would allow for a pod's connections to spike while others don't, it's a trade-off.

7 Views

Previous Next