https://authzed.com logo
I m trying to upgrade to 1 16 2 and
w

williamdclt

03/21/2023, 4:41 PM
I'm trying to upgrade to 1.16.2, and finding a very strange issue: I'm getting a ~0.2% error rate, with the message
CheckError: 9 FAILED_PRECONDITION: object definition user_agency_role not found
.
user_agency_role
is the subject I'm using for all permission checks. It's definitely there when I'm doing
zed schema read
, I tried reproducing a query that failed (reproduction using
zed
) and it works fine. Is this a known bug by any chance? 0.2% is rather a lot
v

vroldanbet

03/21/2023, 4:42 PM
depends! you using PG?
w

williamdclt

03/21/2023, 4:42 PM
Yes I am
Actually it's not just the subject that's failing sometimes, I'm seeing this with all objects
v

vroldanbet

03/21/2023, 4:44 PM
we detected an issue with revision handling in PG and are in the process of releasing the fix in 1.18. It manifested when writing to SpiceDB and then reading afterwards with fully consistent, if I recall correctly
we also detected a similar problem in MemDB
w

williamdclt

03/21/2023, 4:46 PM
I don't think we ever use
fullyConsistent
🤔
w

williamdclt

03/21/2023, 4:47 PM
It seems fairly different: I'm not seeing incorrect permissions, I'm seeing errors due to unknown definitions? Are you confident it's related?
v

vroldanbet

03/21/2023, 4:47 PM
and you say this is happening when going from 1.16.1 to 1.16.2 ?
w

williamdclt

03/21/2023, 4:48 PM
No, haven't tried 1.16.1. This is a fresh install of spicedb, last version I used was 1.14.1-nscachefix
v

vroldanbet

03/21/2023, 4:49 PM
well, the same problem manifested in MemDB, with the same exact errors. And it was related to revision handling and revision GC. The fix for MemDB did not change PG datastore - it seemed to do the right thing - but it's all related to revisions and how SpiceDB handles if something is "visible" as of current time
this is all to say no, I don't have a exact example of this happening in PostgreSQL, but: - we fixed the exact same thing for MemDB, and it was revision handling - PG datastore revision handling has undergone various changes, starting at 1.14.0. A big change is going out in 1.18 that changes revision handling as well.
I guess the part that does not add up is that you haven't seen it before until doing a fresh install of 1.16.2
is this a completely separate deploy of your prod env?
w

williamdclt

03/21/2023, 4:53 PM
> is this a completely separate deploy of your prod env? It is, yes. For various reasons it was simpler for us to have a new spicedb service with a new database than upgrading the existing one (which was on 1.13.0, not 1.14, I was incorrect)
v

vroldanbet

03/21/2023, 4:55 PM
right. In 1.14 is when revision handling changed
w

williamdclt

03/21/2023, 4:56 PM
Our staging env is on 1.14 and we're not seeing this, though
Well, actually let me really look, maybe we do
No we don't, no such problem on staging on
v1.14.1-nscachefix
v

vroldanbet

03/21/2023, 4:58 PM
and the new deployment that is manifesting the problem, is that receiving production load?
and is the staging env subject to the same workload?
w

williamdclt

03/21/2023, 5:00 PM
> and the new deployment that is manifesting the problem, is that receiving production load? it is, 200 to 400 req/s > and is the staging env subject to the same workload? Definitely not, just a few req/s. Can't seen this error in the past 14d, but we can't exclude that it's statistically expected
Is there an ETA for 1.18?
User feedback: I'm pretty antsy, we need a permission service that works for some of our upcoming work and no version of SpiceDB so far has given us that 😬 I can't roll it out as long as it has issues, we'll soon have to drop it entirely to implement our own, which I'd hate
j

Joey

03/21/2023, 5:07 PM
@williamdclt and you're not changing the schema at all?
w

williamdclt

03/21/2023, 5:08 PM
nop, updates to the schema is something we do maybe once a week
j

Joey

03/21/2023, 5:09 PM
it sounds like a bug in the namespace cache, perhaps due to the singleflight
can you try setting the namespace cache to off and seeing if it still occurs?
we haven't seen this in prod or testing anywhere, but I want to verify
w

williamdclt

03/21/2023, 5:14 PM
> setting the namespace cache to off How can I do that?
j

Joey

03/21/2023, 5:16 PM
--ns-cache-enabled=false
be aware it will cause a perf impact
w

williamdclt

03/21/2023, 5:16 PM
Is okay
j

Joey

03/21/2023, 5:16 PM
@williamdclt is this on the read endpoint of Aurora?
w

williamdclt

03/21/2023, 5:16 PM
Yes it is
j

Joey

03/21/2023, 5:16 PM
ok
w

williamdclt

03/21/2023, 5:22 PM
NS cache disabled, I'm waiting for metrics to come through
Will give it 15min to be sure
j

Joey

03/21/2023, 5:23 PM
if it does turn out to be that, it is likely a race in the caching code for returning whether a namespace exists
if not, it is likely a race somewhere else, as the data isn't changing and I doubt Aurora would return data that stale
w

williamdclt

03/21/2023, 5:34 PM
I'm still seeing these :/
Also huge impact on perfs, wow
Actually... let me see, I'm not 100% sure I disabled it correctly
No it's definitely still happening with
--ns-cache-enabled=false
v

vroldanbet

03/21/2023, 5:52 PM
1.18 should be out in the next few days, in the meanwhile I'd suggest building it and testing it
obviously I cannot guarantee it's going to fix this problem, but addresses a bunch of other problems with revision handling in PG
j

jzelinskie

03/21/2023, 5:53 PM
If you can create something to reproduce it in a development environment (e.g. brand new spicedb, minimal schema), that'd definitely help us ensure a fix goes out ASAP
j

Joey

03/21/2023, 5:59 PM
@williamdclt can you run with
--log-level=trace
and see if you can find the error?
w

williamdclt

03/21/2023, 6:00 PM
> 1.18 should be out in the next few days, in the meanwhile I'd suggest building it and testing it Should I just build from main? I'm not seeing an RC > If you can create something to reproduce it in a development environment (e.g. brand new spicedb, minimal schema), that'd definitely help us ensure a fix goes out ASAP I'm afraid I won't have the capacity, I'm already on borrowed time 😓 I'll see if I can do something > can you run with --log-level=trace and see if you can find the error? Sure can
v

vroldanbet

03/21/2023, 6:03 PM
>Should I just build from main? I'm not seeing an RC yeah. It would include some fixes to PG indexes that we are in the process of testing out, actually the ones you reported with GC.
w

williamdclt

03/21/2023, 6:14 PM
Not seeing any error with log-level=trace, but it's a bit of a needle-in-a-haystack situation
j

Joey

03/21/2023, 6:14 PM
yeah
is it only that namespace that's not being found or...?
w

williamdclt

03/21/2023, 6:16 PM
no, all object definitions seem to sporadically be not found
j

Joey

03/21/2023, 6:18 PM
okay
I'm going to prepare a branch off of HEAD
if you can try it, let me know
w

williamdclt

03/21/2023, 6:19 PM
Sometime it's the subject of the permission that's not found, sometimes the resource
j

Joey

03/21/2023, 6:19 PM
yeah
w

williamdclt

03/21/2023, 6:19 PM
If it includes the migrations, it'll probably take me a bit longer!
j

Joey

03/21/2023, 6:19 PM
what version are you on?
I'll branch of it
off*
v

vroldanbet

03/21/2023, 6:20 PM
@Joey you'll have to branch off the 1.18.0 branch that Jake created, which excludes the PG indexes migration
j

Joey

03/21/2023, 6:20 PM
depends on what @williamdclt is running
v

vroldanbet

03/21/2023, 6:21 PM
he mentioned 1.16.1 for this test
j

Joey

03/21/2023, 6:21 PM
@williamdclt let me know for confirmation and I'll prep it
v

vroldanbet

03/21/2023, 6:21 PM
sorry, let me read the scrollback again
w

williamdclt

03/21/2023, 6:23 PM
I'm on 1.16.2
j

Joey

03/21/2023, 6:23 PM
gotcha
the NS cache disable flag doesn't bypass the code entirely; it simply no-ops the caching parts
so I want to see if this fixes it
since this bypasses it entirely, including the single flight
w

williamdclt

03/21/2023, 6:36 PM
Should I also use
--ns-cache-enabled=false
?
j

Joey

03/21/2023, 6:37 PM
nah
shouldn't matter
it'll bypass
v

vroldanbet

03/21/2023, 6:38 PM
there is also a possibility this is fixed just because of changes in 1.18 right?
j

Joey

03/21/2023, 6:41 PM
unlikely
w

williamdclt

03/21/2023, 6:41 PM
The image isn't running for me :/
/opt/spicedb: line 1: syntax error: unexpected "("
Might have built it incorrectly? I just did a
docker build .
in your fork
j

Joey

03/21/2023, 6:41 PM
one sec
w

williamdclt

03/21/2023, 6:43 PM
interestingly I just discovered that this error only happens with consistency atLeastAsFresh or fullyConsistent (I don't think I ever use fullyConsistent, but I'm not positive)
j

Joey

03/21/2023, 6:44 PM
but not minimize latency?
w

williamdclt

03/21/2023, 6:45 PM
What I'm really seeing is that it happens only on my "master" service (which uses the Aurora write endpoint), not the "reader" service (which uses the read endpoint). And I know that we only call the "master" service for checks for atLeastAsFresh or fullyConsistent
Can't see that the reader service ever fails
j

Joey

03/21/2023, 6:45 PM
quay.io/authzed/spicedb:no-ns-cache-1-16-2
oh, interesting...
so maybe it is Aurora or consistency related...
w

williamdclt

03/21/2023, 6:46 PM
Which gives more weight to the problem that @vroldanbet was pointing at at the very start of this thread
j

Joey

03/21/2023, 6:46 PM
yeah
earlier you said it was on the read endpoint, so I thought it might be something else
w

williamdclt

03/21/2023, 6:47 PM
I was wrong 😓 forgot we had this behaviour depending on consistency requirements
j

Joey

03/21/2023, 6:47 PM
yah
okay, so: maybe try my image
but I suspect 1.18 might fix it
w

williamdclt

03/21/2023, 6:51 PM
also getting errors with your build
I'll have to clock off, it's the end of the day here 🙂
Have a good day!
j

Joey

03/21/2023, 6:52 PM
even the image I made?
OH
I made an arm one...
hah
one second
fixed
w

williamdclt

03/23/2023, 2:30 PM
I'm now running
quay.io/authzed/spicedb:v1.18-rc4
, and these errors have entirely disappeared 👍 with no difference in performance that I can see
Only been 15min, I'll continue to monitor
v

vroldanbet

03/23/2023, 2:45 PM
@williamdclt 1.18 was just released like an hour ago
j

Joey

03/23/2023, 2:47 PM
it should be noted 1.18 does not include the GC index
that will be in 1.18.1
v

vroldanbet

03/23/2023, 2:48 PM
☝️