šŸ‘‹ Looking for thoughts from the community - my te...
# spicedb
e
šŸ‘‹ Looking for thoughts from the community - my team at Strava is considering running SpiceDB backed by Aurora MySQL - has anyone done this? If so, I'd love to chat about your experience and understand particularly how that has scaled for you, thanks!
j
The folks over at Birdie (e.g. @williamdclt ) are using Aurora Postgres, maybe they can comment on the Aurora half.
FWIW, I'd lean towards using Postgres if possible. It's far more mature and performant. The folks over at GitHub contributed MySQL support, but it was largely a copy of an older version of the Postgres driver.
w
Yeah we use Aurora Postgres. We're struggling a bit with our setup, but I'm not sure that it's really Aurora's fault. Here's our situation: - We have 10ish million relations in the datastore - We have up to 400 req/s to SpiceDB (all checks) - Most of our permissions are mediumly complex, with at least one intersection To support that, Postgres (at least on Aurora, dunno about non-Aurora PG) needs to be able to have almost the entire dataset in RAM (16GB of RAM in our case). It also needs a fair amount of CPU: Aurora is generally pretty CPU-greedy, more so than non-Aurora RDS. This means we need an r6g.large which is a bit too expensive to my taste. It's actually not really enough for the traffic either, we're having issues from time to time. Obv Aurora allows scaling horizontally so we could add more replicas. SpiceDB doesn't handle using read replicas natively (it doesn't handle eventual consistency), we work around it.
e
Hey @williamdclt just saw this - thanks for the input! Can you elaborate on how you work around the fact that spicedb doesn't support read replicas? We use aurora read replicas in most of our services and just accept the small chance of inconsistency in most cases. I haven't thought too hard about how that might work with spicedb yet šŸ¤”
w
Sure! We set up 2 k8s services: spicedb-master and spicedb-reader. They have similar config, except that spicedb-master points at the Aurora write endpoint and spicedb-reader at the read endpoint. We then wrote a small wrapper around the spicedb client that choose whether to call -master or -reader depending on the consistency requirement.
fullyConsistent
always goes to master,
minimizeLatency
to reader,
atLeastAsFresh
tries to use reader and fallbacks to master if it gets an error about the revision not existing. All writes going to -master. You could get away with all reads going to -reader and accept eventual consistency, but then you'd have no way of getting full consistency if you ever need it. It also probably wouldn't be that much work to get support for eventually-consistent replicas in SpiceDB, really!
e
Two questions come to mind: 1. Do you have an idea what percent of your stored relations are used in auth checks? Iā€™d expect most of the data to be older and for checks to target only newer data, a relatively small percent. 2. Do you take advantage of spicedbs caching mechanisms? Iā€™d expect that to reduce load on the db for frequent checksā€¦ not sure exactly how that caching works
w
> Do you have an idea what percent of your stored relations are used in auth checks? Iā€™d expect most of the data to be older and for checks to target only newer data, a relatively small percent. I'd say most of the data is actually used? except maybe data relative to disabled/inactive users, which probably is mostly-unused > Do you take advantage of spicedbs caching mechanisms? Iā€™d expect that to reduce load on the db for frequent checksā€¦ not sure exactly how that caching works Yes, SpiceDB handles that mostly-transparently, that's what the consistency requirement does really, you can read about it in SpiceDB's docs. The cache hit rate is around 40% for us, so it definitely reduces load but isn't an order of magnitude
e
Ok thatā€™s all good to know - the 40% cache hit rate is good context for what you said previously.
You originally said you're struggling with your setup - is that still true? Did that wrapper and using the reader solve your scaling concerns? Or what are you struggling with at the moment?
w
We're having a few issues with the version of spicedb that we're running (1.13): - We're having spikes of DB connections maxing out the pool, triggering spikes of latency - Memory usage is growing until OOM. Latency also seems to go up with memory usage - We're getting a few errors occasionally from bugs of SpiceDB AFAIK the 2 last ones should be fixed in the latest version of SpiceDB, but we've been struggling to upgrade as 1.14: this specific upgrade is a heavy operation (few hours) and when we tried it we found big performance regressions. It should be fixed now but we haven't had much time to look at upgrading again
e
Thanks @williamdclt I appreciate all the details
j
FWIW, we've been experimenting with Aurora ourselves and can't promise anything, but if we end up using it ourselves, I suspect we'd build replica awareness into SpiceDB so that it could handle the routing to the proper datastore based on the request. We've talked about this a little bit, but we should probably create a formal GitHub issue for it. Would be great if either of you created that and could chime in on your use cases.
Also @williamdclt your cache hit rate seems quite low. Most of the folks we worked with are typically between 80-90%. It definitely varies depending on your workload and schema though.
w
> your cache hit rate seems quite low I suspected it was šŸ¤” Not sure why. Maybe simply the size of the dataset? Would be great to have more visibility into what relations tend to be cached or not
j
If you setup distributed tracing for SpiceDB, you can run through some of the traces and figure it out, but there's no good way to answer that question using a tradition time series database because of the cardinality of the metrics you'd need.
a
Hi @ethanlew šŸ‘‹ Turo started looking into running SpiceDB with Aurora MySQL too, we're at initial stages of this investigation. How's your investigation going so far? Would you be interested in sharing what you learned? @williamdclt sorry to hear about the struggles with with your Postgres Aurora set up. Were you able to upgrade to 1.14 or were the regressions too great to warrant staying behind? Also, do I understand it correctly that your set up uses the main replica for all reads and writes? Appreciate any info y'all are willing to share!
w
The regressions at the time were a no-go, the system was unusable. There's been a fix for it since then (
1.14.1-nscachefix
) We've tried upgrading again a few days ago with the fix: the perf regressions seem to be gone but we encountered another unrelated bug (couldn't write a relationship due to a uniqueness constraint error) so we had to rollback again > Also, do I understand it correctly that your set up uses the main replica for all reads and writes? No, it uses the master DB for writes and fully-consistent reads, and the replica for all other reads
e
Can you elaborate on the uniqueness constraint issue? Not being able to write a relation isnā€™t greatā€¦ are you saying thereā€™s a bug in spicedb or was it something with your specific setup?
w
I don't have a whole lot of information TBH! There was a specific relationship that couldn't be written (
unable to delete relationships: ERROR: duplicate key value violates unique constraint "uq_relation_tuple_living" (SQLSTATE 23505)
) , which for us is a blocker. > are you saying thereā€™s a bug in spicedb or was it something with your specific setup? Pretty sure it's a bug in spicedb, rolling back to 1.13 fixed it. I suspect it's a bug that appears specifically in the migration phase of 1.13 -> 1.14, but that's a blocker for us to upgrade
j
What write(s) were you performing at the time?
The ns cache fix landed and was rolled out to our dedicated, serverless, and enterprise environments a few major versions ago
It should be noted @williamdclt that you can get that error if you tried to modify the same relationship within the same write transaction
This was previously unchecked but now we raise a proper error for it
w
> What write(s) were you performing at the time? Not sure TBH, it's been 3w since then šŸ˜… > you can get that error if you tried to modify the same relationship within the same write transaction Could be what we saw, does that mean I can't
TOUCH
the same relationship twice (or do a DELETE then a WRITE) in a single
writeRelationships
? How comes šŸ¤”
j
yes, it does mean that and for this exact reason: it breaks the unique index for the transactions table. we now explicitly ensure this by raising a proper error instead of just failing
e
@Joey To clarify - the relationship has to be exactly the same to throw this error - like I canā€™t both delete and write the relationship ā€œuser:ethan#follower@user:joeyā€. But I could for example delete ā€œuser:ethan#follower@user:joeyā€ and write ā€œuser:ethan#follower@user:williamā€
j
correct
the updates within a write relationships call do not have a defined order
so if you were to delete and "then" write
it wouldn't really make much sense
since there is no casual ordering within
w
> the updates within a write relationships call do not have a defined order They don't?? That's news to me šŸ˜Ø I'll have to check whether we rely on order anywhere!
j
yeah, it is considered a single transactional update, so the same relationships should not be updated at the same time
j
Regarding the use of Aurora PostgreSQL has anyone tried using RDS proxy to address the read/write distribution across the database cluster? I'm considering testing it out if no one has any experience with it
j
the problem is one of staleness: if the proxy hits the read replicas, they may not yet have the written data and cause the request to fail
j
and because it's a proxy you have no way of recovering from the error ...
j
yep
we are investigating what it might be like to have SpiceDB support read replicas or Aurora explicitly, so that we can intelligently direct requests
so if the replicas can satisfy it, the request goes to them
otherwise, it goes to the primary
j
In the mean time, it should be safe to use Aurora as long as you're only configured to talk to the Write instance
3 Views