I'm trying to upgrade to 1.16.2, and finding a ver...
# spicedb
w
I'm trying to upgrade to 1.16.2, and finding a very strange issue: I'm getting a ~0.2% error rate, with the message
CheckError: 9 FAILED_PRECONDITION: object definition user_agency_role not found
.
user_agency_role
is the subject I'm using for all permission checks. It's definitely there when I'm doing
zed schema read
, I tried reproducing a query that failed (reproduction using
zed
) and it works fine. Is this a known bug by any chance? 0.2% is rather a lot
v
depends! you using PG?
w
Yes I am
Actually it's not just the subject that's failing sometimes, I'm seeing this with all objects
v
we detected an issue with revision handling in PG and are in the process of releasing the fix in 1.18. It manifested when writing to SpiceDB and then reading afterwards with fully consistent, if I recall correctly
we also detected a similar problem in MemDB
w
I don't think we ever use
fullyConsistent
🤔
w
It seems fairly different: I'm not seeing incorrect permissions, I'm seeing errors due to unknown definitions? Are you confident it's related?
v
and you say this is happening when going from 1.16.1 to 1.16.2 ?
w
No, haven't tried 1.16.1. This is a fresh install of spicedb, last version I used was 1.14.1-nscachefix
v
well, the same problem manifested in MemDB, with the same exact errors. And it was related to revision handling and revision GC. The fix for MemDB did not change PG datastore - it seemed to do the right thing - but it's all related to revisions and how SpiceDB handles if something is "visible" as of current time
this is all to say no, I don't have a exact example of this happening in PostgreSQL, but: - we fixed the exact same thing for MemDB, and it was revision handling - PG datastore revision handling has undergone various changes, starting at 1.14.0. A big change is going out in 1.18 that changes revision handling as well.
I guess the part that does not add up is that you haven't seen it before until doing a fresh install of 1.16.2
is this a completely separate deploy of your prod env?
w
> is this a completely separate deploy of your prod env? It is, yes. For various reasons it was simpler for us to have a new spicedb service with a new database than upgrading the existing one (which was on 1.13.0, not 1.14, I was incorrect)
v
right. In 1.14 is when revision handling changed
w
Our staging env is on 1.14 and we're not seeing this, though
Well, actually let me really look, maybe we do
No we don't, no such problem on staging on
v1.14.1-nscachefix
v
and the new deployment that is manifesting the problem, is that receiving production load?
and is the staging env subject to the same workload?
w
> and the new deployment that is manifesting the problem, is that receiving production load? it is, 200 to 400 req/s > and is the staging env subject to the same workload? Definitely not, just a few req/s. Can't seen this error in the past 14d, but we can't exclude that it's statistically expected
Is there an ETA for 1.18?
User feedback: I'm pretty antsy, we need a permission service that works for some of our upcoming work and no version of SpiceDB so far has given us that 😬 I can't roll it out as long as it has issues, we'll soon have to drop it entirely to implement our own, which I'd hate
j
@williamdclt and you're not changing the schema at all?
w
nop, updates to the schema is something we do maybe once a week
j
it sounds like a bug in the namespace cache, perhaps due to the singleflight
can you try setting the namespace cache to off and seeing if it still occurs?
we haven't seen this in prod or testing anywhere, but I want to verify
w
> setting the namespace cache to off How can I do that?
j
--ns-cache-enabled=false
be aware it will cause a perf impact
w
Is okay
j
@williamdclt is this on the read endpoint of Aurora?
w
Yes it is
j
ok
w
NS cache disabled, I'm waiting for metrics to come through
Will give it 15min to be sure
j
if it does turn out to be that, it is likely a race in the caching code for returning whether a namespace exists
if not, it is likely a race somewhere else, as the data isn't changing and I doubt Aurora would return data that stale
w
I'm still seeing these :/
Also huge impact on perfs, wow
Actually... let me see, I'm not 100% sure I disabled it correctly
No it's definitely still happening with
--ns-cache-enabled=false
v
1.18 should be out in the next few days, in the meanwhile I'd suggest building it and testing it
obviously I cannot guarantee it's going to fix this problem, but addresses a bunch of other problems with revision handling in PG
j
If you can create something to reproduce it in a development environment (e.g. brand new spicedb, minimal schema), that'd definitely help us ensure a fix goes out ASAP
j
@williamdclt can you run with
--log-level=trace
and see if you can find the error?
w
> 1.18 should be out in the next few days, in the meanwhile I'd suggest building it and testing it Should I just build from main? I'm not seeing an RC > If you can create something to reproduce it in a development environment (e.g. brand new spicedb, minimal schema), that'd definitely help us ensure a fix goes out ASAP I'm afraid I won't have the capacity, I'm already on borrowed time 😓 I'll see if I can do something > can you run with --log-level=trace and see if you can find the error? Sure can
v
>Should I just build from main? I'm not seeing an RC yeah. It would include some fixes to PG indexes that we are in the process of testing out, actually the ones you reported with GC.
w
Not seeing any error with log-level=trace, but it's a bit of a needle-in-a-haystack situation
j
yeah
is it only that namespace that's not being found or...?
w
no, all object definitions seem to sporadically be not found
j
okay
I'm going to prepare a branch off of HEAD
if you can try it, let me know
w
Sometime it's the subject of the permission that's not found, sometimes the resource
j
yeah
w
If it includes the migrations, it'll probably take me a bit longer!
j
what version are you on?
I'll branch of it
off*
v
@Joey you'll have to branch off the 1.18.0 branch that Jake created, which excludes the PG indexes migration
j
depends on what @williamdclt is running
v
he mentioned 1.16.1 for this test
j
@williamdclt let me know for confirmation and I'll prep it
v
sorry, let me read the scrollback again
w
I'm on 1.16.2
j
gotcha
the NS cache disable flag doesn't bypass the code entirely; it simply no-ops the caching parts
so I want to see if this fixes it
since this bypasses it entirely, including the single flight
w
Should I also use
--ns-cache-enabled=false
?
j
nah
shouldn't matter
it'll bypass
v
there is also a possibility this is fixed just because of changes in 1.18 right?
j
unlikely
w
The image isn't running for me :/
/opt/spicedb: line 1: syntax error: unexpected "("
Might have built it incorrectly? I just did a
docker build .
in your fork
j
one sec
w
interestingly I just discovered that this error only happens with consistency atLeastAsFresh or fullyConsistent (I don't think I ever use fullyConsistent, but I'm not positive)
j
but not minimize latency?
w
What I'm really seeing is that it happens only on my "master" service (which uses the Aurora write endpoint), not the "reader" service (which uses the read endpoint). And I know that we only call the "master" service for checks for atLeastAsFresh or fullyConsistent
Can't see that the reader service ever fails
j
quay.io/authzed/spicedb:no-ns-cache-1-16-2
oh, interesting...
so maybe it is Aurora or consistency related...
w
Which gives more weight to the problem that @vroldanbet was pointing at at the very start of this thread
j
yeah
earlier you said it was on the read endpoint, so I thought it might be something else
w
I was wrong 😓 forgot we had this behaviour depending on consistency requirements
j
yah
okay, so: maybe try my image
but I suspect 1.18 might fix it
w
also getting errors with your build
I'll have to clock off, it's the end of the day here 🙂
Have a good day!
j
even the image I made?
OH
I made an arm one...
hah
one second
fixed
w
I'm now running
quay.io/authzed/spicedb:v1.18-rc4
, and these errors have entirely disappeared 👍 with no difference in performance that I can see
Only been 15min, I'll continue to monitor
v
@williamdclt 1.18 was just released like an hour ago
j
it should be noted 1.18 does not include the GC index
that will be in 1.18.1
v
☝️