Hi all, we experienced a long outage yesterday rel...
# spicedb
Hi all, we experienced a long outage yesterday related to SpiceDB and identified a few factors that lead to the overal failure. We're on SpiceDB 1.16.1: - Garbage Collection wasn't keeping up, records weren't being deleted fast enough, resulting in very high read times to the database. This was in part due to the GC query performing a sequence scan on the database, with the query even now taking 55 seconds to complete every time it runs - We're overusing
in our services, and need to improve our use of the caching mechanisms that SpiceDB offers - We had an unbound query in the system that requested all resources that a user had read access to, which included public resources - We store large numbers of resources in SpiceDB - we're currently at 250 million tuples and that's likely to increase due to the nature of our system (e.g. we have users -> workspaces -> projects -> assets for those projects) We're now working on mitigations: - Improved alerting based on the prometheus + open telemetry metrics that SpiceDB exposes - Moving everything we can to use
- Monitoring how many records queries return and alerting if it's abnormally high However there's still a number of things that are unclear, which I'm hoping you'll be able to clarify: - Is any work being planned by the SpiceDB team to deal with the garbage collection sequence scan? I saw an open issue on Github around it, but it's not clear what the current status is - Can we do anything on SpiceDB side to at least protect us from unbound queries? We've got a team of 75 developers, the chance that something slips through the cracks is fairly high, so if we can ensure SpiceDB doesn't blindly return 3 million records that'd help a lot - How long does it generally take to get consistency when using minimizeLatency? - What sort of upper limits do we need to consider when storing records in SpiceDB?