the general recommendation is not to SpiceDB #spicedb

the general recommendation is not to

Joey

05/20/2024, 3:04 PM

the general recommendation is not to have a single point of failure for any part of the SpiceDB stack: SpiceDB itself, of course can (and should) be run multinode and you can have two clusters if you want complete and total isolation between them (although this is unlikely to be needed). The datastore itself is the most contentious possible SPoF; for that we recommend using a true HA multi-master datastore like CockroachDB or Spanner

williamdclt

05/20/2024, 3:10 PM

Multinode SpiceDB + HA multimaster DB does mitigate some scenarios, but it only covers failures of a given node (both in the case of spicedb and CockroachDB). Scenarios I'm looking at are things like: - Overloading (of either spicedb or CockroachDB). Usually, if one node gets overloaded it's likely that the others are too. Usually caused by things like a client going wild, or a data volume threshold - Bad operations, like an upgrade going wrong. If a spicedb upgrade or a CockroachDB upgrade causes issues (eg regression), it'll likely impact the whole cluster

Joey

05/20/2024, 3:11 PM

how do you mitigate those issues with your standard database today?

williamdclt

05/20/2024, 3:14 PM

We don't 😄 but our databases don't tend to be SPOFs for the entire system: if they go down they only take a specific service with them, the rest of our product keeps running. Authorization is a central domain by nature, so I'm getting challenged on it being a SPOF for the whole product. If it goes down, everything goes down. It's not about "spicedb versus in-house authorization service", it's about "centralised authz versus decentralised authz"

yetitwo

05/20/2024, 3:18 PM

i'd probably talk about the complexity of decentralized authorization

yetitwo

05/20/2024, 3:18 PM

specifically in needing to share state and logic between services in order for them to make their own authorization decisions

Joey

05/20/2024, 3:19 PM

so in your case @williamdclt since you do have the read replicas isolated

Joey

05/20/2024, 3:19 PM

you basically are running two distinct clusters except for the upgrade path

Joey

05/20/2024, 3:19 PM

and for that, I'd recommend testing on a different stage before pushing to prod

Joey

05/20/2024, 3:19 PM

longer term, we have plans for a second layer of caching that could answer queries if SpiceDB is down and the cache is backfilled

Joey

05/20/2024, 3:20 PM

but it wouldn't be able to answer everything

williamdclt

05/20/2024, 3:26 PM

> i'd probably talk about the complexity of decentralized authorization > specifically in needing to share state and logic between services in order for them to make their own authorization decisions Yeah, that's the discussion I'm having ATM 😄 I'm having a hard time convincing the principal engs, so I'm also exploring how to improve the SPOF situation > you basically are running two distinct clusters except for the upgrade path True, and I suppose we could make it "one cluster per product domain" if we really wanted to. The database is still a SPOF and as well as upgrades (at least the ones containing a DB migration, otherwise we could run different spicedb versions across clusters), but maybe that's mitigation enough

williamdclt

05/20/2024, 3:27 PM

> longer term, we have plans for a second layer of caching that could answer queries if SpiceDB is down and the cache is backfilled > but it wouldn't be able to answer everything Ha, we do that actually! We store the spicedb responses in redis and replay them if spicedb is unavailable

Joey

05/20/2024, 3:38 PM

> Ha, we do that actually! We store the spicedb responses in redis and replay them if spicedb is unavailable yeah, it might be something like this, but with intelligence to keep the responses long term until they've changed

williamdclt

05/20/2024, 4:00 PM

> with intelligence to keep the responses long term until they've changed Ohhh we actually implemented exactly that in our (now archived) home-grown zanzibar-based service. There's interesting papers on this from the RDF people, the core idea was to store the tuples that were used to check the permission (called "BGP" - basic graph patterns) with the cached response for invalidation when something changes. If you're interested I can ask if that's something we'd be willing to open-source

Joey

05/20/2024, 4:57 PM

we actually have a working prototype that does exactly that

Joey

05/20/2024, 4:58 PM

although it uses some specialized data structures to ensure faster checks and less mem

Joey

05/20/2024, 4:58 PM

and everything is revisioned

Joey

05/20/2024, 4:59 PM

if you're interested in contributing, we'd love your insights into it 🙂

williamdclt

05/20/2024, 5:19 PM

as usual, interested but probably too time-constrained 😄 If you have something ready to be read I'd be interested but certainly don't go to much effort for me!

Joey

05/20/2024, 5:22 PM

k 🙂

Previous Next