Postgres failover
# spicedb
j
Hey all! Right now we're running spicedb on version 1.15.0, we recently executed an RDS reboot with failover on a multiaz postgres RDS instance associated with spiceDB. It looks like the pods did not automatically reconnect to the failed over instance and we had to execute a manual restart. Pods were shown as healthy and running so we were a bit confused about the issue. I'm assuming SpiceDB does not check the database status as part of some sort of continuous check? The implications of this is that if an availability zone goes down, it would require manual intervention to restart the pods in the cluster. Is there an automated self-healing mechanism others have implemented to circumvent this? Would love to hear your suggestions and thoughts on this capability. Thanks!
v
Hey, I got investigated this recently and identified the context severing used at the datastore layer is what makes SpiceDB unable to recover from a PostgreSQL failover. Unfortunately right not it requires manually rolling the pods. This should only affect PostgreSQL and potentially MySQL.
j
Hey! Thanks for the reply 🙂 Do you have any recommendations on what should be done to encourage self-healing? Just thinking out loud maybe a liveness probe of some sort? Was wondering if you had encountered any fixes or if it's on the roadmap to fix at Authzed!
j
you could run a liveness probe on SpiceDB itself and if the API returns an error indicative of a DB issue, restart the pods
but it won't give you assurances that it is a DB issue vs a SpiceDB problem
we are looking longer term into removing the context severing, but that has connection pooling implications