postgres.parseRevisionDecimal - makeslice: cap ou...
# spicedb
d
Hello again, We're intermittently seeing a
panic: runtime error: makeslice: cap out of range
from
github.com/authzed/spicedb/internal/datastore/postgres.parseRevisionDecimal({0xc000bce280?, 0x1?})
(larger trace in thread) . We suspect this is because we have some consistency tokens stored from when we were on serverless that don't match our self-hosted db, but wanted to: 1. check that this is a likely cause of the error we're seeing 2. report that this takes our our entire SpiceDB cluster when it happens. We haven't isolated yet if this is because we retry the permission check on a service failure, so we quickly cycle through the available spicedb nodes until they're all dead, or if the group failure is the result of internal cluster communication.
Slightly larger trace:
Copy code
2023-09-18 16:15:10    
    /home/runner/work/spicedb/spicedb/internal/datastore/postgres/revisions.go:126 +0x25
    
2023-09-18 16:15:10    
github.com/authzed/spicedb/internal/datastore/postgres.(*pgDatastore).RevisionFromString(0xc0011dc6e8?, {0xc000bce280?, 0xc000a8a8c0?})
    
2023-09-18 16:15:10    
    /home/runner/work/spicedb/spicedb/internal/datastore/postgres/revisions.go:132 +0x4f
    
2023-09-18 16:15:10    
github.com/authzed/spicedb/internal/datastore/postgres.parseRevision({0xc000bce280, 0x1e})
    
2023-09-18 16:15:10    
    /home/runner/work/spicedb/spicedb/internal/datastore/postgres/revisions.go:206 +0x1ba
    
2023-09-18 16:15:10    
github.com/authzed/spicedb/internal/datastore/postgres.parseRevisionDecimal({0xc000bce280?, 0x1?})
    
2023-09-18 16:15:10    
goroutine 811 [running]:
    
2023-09-18 16:15:10    
    
2023-09-18 16:15:10    
panic: runtime error: makeslice: cap out of range
Likely cause: https://github.com/authzed/spicedb/blob/main/internal/datastore/postgres/revisions.go#L206 where
xmax-xmin
is producing an invalid cap, similar to https://github.com/golang/go/issues/52783
j
do you have an example of a token that fails that we can look at?
it should be noted zedtokens are not compatible across datastore types
and serverless does not use Postgres
d
I'll try to dig one up. We might have a record of a zedtoken that was sent with the failing requests. Just nuked them all in our metadata db when we realised what was happening though. There'll be a snapshot of the db before around somewhere. Makes sense with the switching datastores too.
j
appreciate it - the parsing code should never panic
it should just return an error saying "this isn't a valid zedtoken"
d
Many we have look like:
GhUKEzE2OTM1NDA5NDQ5NTk3MjA1OTI=
. These are handled fine, and just cause a “revision was invalid” error, or a fallback to another consistency mode. Some zedtokens are of the format
GiAKHjE2OTM1NDA5NDAzNzMwNDU3MjcuMDAwMDAwMDAwMQ==
. That one nukes an instance.
Copy code
zed --endpoint=spicedb.us.com:443 --token=<auth-token> --permissions-system perms/ relationship read perms/application:1 --consistency-at-least GiAKHjE2OTM1NDA5NDAzNzMwNDU3MjcuMDAwMDAwMDAwMQ==

Error: rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway); transport: received unexpected content-type "text/html"
same effect with
--consistency-at-exactly
. both take out the spicedb pod that received the request, with the same error shown above. The request only takes out the one pod though. The issue we saw with all pods in the SpiceDbCluster failing must have been the result of us retrying, or a burst of similar requests coming in.
huh. that token of death is a nanosecond timestamp, vs an int timestamp. I guess that's how it makes it through the if
max>min
on L205
I thought go was stricter with its typing that that, but 🤷‍♂️
j
yeah
I suspected as much
I'll get this fixed
do you mind if I use your sample token in the unit test?
d
nah, I think that's fine.. it's just a base64 encoded float.
j
great
d
If I wasn't in the middle of renovating a house right now, with my first child due in 3 weeks I'd want to make the contribution myself. A little short on time for getting my Go up to scratch though... 🙂 Will there be an issue I can follow somewhere to see the final fix?
j
yep
I plan to file it in a moment
once I am able to repro
repro'ed
fixing now
d
Thanks again for your time, and the fix.
j
of course
d
Not urgent, but are you planning to put out a docker image update / release soon?
j
we'll be cutting an RC, probably early next week