postgres.parseRevisionDecimal - makeslice: cap out of range SpiceDB #spicedb

postgres.parseRevisionDecimal - makeslice: cap ou...

Duncan

09/19/2023, 10:26 AM

Hello again, We're intermittently seeing a

panic: runtime error: makeslice: cap out of range

from

github.com/authzed/spicedb/internal/datastore/postgres.parseRevisionDecimal({0xc000bce280?, 0x1?})

(larger trace in thread) . We suspect this is because we have some consistency tokens stored from when we were on serverless that don't match our self-hosted db, but wanted to: 1. check that this is a likely cause of the error we're seeing 2. report that this takes our our entire SpiceDB cluster when it happens. We haven't isolated yet if this is because we retry the permission check on a service failure, so we quickly cycle through the available spicedb nodes until they're all dead, or if the group failure is the result of internal cluster communication.

Duncan

09/19/2023, 10:30 AM

Slightly larger trace:

Copy code

2023-09-18 16:15:10    
    /home/runner/work/spicedb/spicedb/internal/datastore/postgres/revisions.go:126 +0x25
    
2023-09-18 16:15:10    
github.com/authzed/spicedb/internal/datastore/postgres.(*pgDatastore).RevisionFromString(0xc0011dc6e8?, {0xc000bce280?, 0xc000a8a8c0?})
    
2023-09-18 16:15:10    
    /home/runner/work/spicedb/spicedb/internal/datastore/postgres/revisions.go:132 +0x4f
    
2023-09-18 16:15:10    
github.com/authzed/spicedb/internal/datastore/postgres.parseRevision({0xc000bce280, 0x1e})
    
2023-09-18 16:15:10    
    /home/runner/work/spicedb/spicedb/internal/datastore/postgres/revisions.go:206 +0x1ba
    
2023-09-18 16:15:10    
github.com/authzed/spicedb/internal/datastore/postgres.parseRevisionDecimal({0xc000bce280?, 0x1?})
    
2023-09-18 16:15:10    
goroutine 811 [running]:
    
2023-09-18 16:15:10    
    
2023-09-18 16:15:10    
panic: runtime error: makeslice: cap out of range

Likely cause: https://github.com/authzed/spicedb/blob/main/internal/datastore/postgres/revisions.go#L206 where

xmax-xmin

is producing an invalid cap, similar to https://github.com/golang/go/issues/52783

Joey

09/19/2023, 2:56 PM

do you have an example of a token that fails that we can look at?

Joey

09/19/2023, 2:56 PM

it should be noted zedtokens are not compatible across datastore types

Joey

09/19/2023, 2:56 PM

and serverless does not use Postgres

Duncan

09/19/2023, 3:44 PM

I'll try to dig one up. We might have a record of a zedtoken that was sent with the failing requests. Just nuked them all in our metadata db when we realised what was happening though. There'll be a snapshot of the db before around somewhere. Makes sense with the switching datastores too.

Joey

09/19/2023, 3:46 PM

appreciate it - the parsing code should never panic

Joey

09/19/2023, 3:46 PM

it should just return an error saying "this isn't a valid zedtoken"

Duncan

09/19/2023, 6:11 PM

Many we have look like:

GhUKEzE2OTM1NDA5NDQ5NTk3MjA1OTI=

. These are handled fine, and just cause a “revision was invalid” error, or a fallback to another consistency mode. Some zedtokens are of the format

GiAKHjE2OTM1NDA5NDAzNzMwNDU3MjcuMDAwMDAwMDAwMQ==

. That one nukes an instance.

Copy code

zed --endpoint=spicedb.us.com:443 --token=<auth-token> --permissions-system perms/ relationship read perms/application:1 --consistency-at-least GiAKHjE2OTM1NDA5NDAzNzMwNDU3MjcuMDAwMDAwMDAwMQ==

Error: rpc error: code = Unavailable desc = unexpected HTTP status code received from server: 502 (Bad Gateway); transport: received unexpected content-type "text/html"

same effect with

--consistency-at-exactly

. both take out the spicedb pod that received the request, with the same error shown above. The request only takes out the one pod though. The issue we saw with all pods in the SpiceDbCluster failing must have been the result of us retrying, or a burst of similar requests coming in.

Duncan

09/19/2023, 6:20 PM

huh. that token of death is a nanosecond timestamp, vs an int timestamp. I guess that's how it makes it through the if

max>min

on L205

Duncan

09/19/2023, 6:23 PM

I thought go was stricter with its typing that that, but 🤷‍♂️

Joey

09/19/2023, 6:32 PM

yeah

Joey

09/19/2023, 6:32 PM

I suspected as much

Joey

09/19/2023, 6:32 PM

I'll get this fixed

Joey

09/19/2023, 6:32 PM

do you mind if I use your sample token in the unit test?

Duncan

09/19/2023, 6:33 PM

nah, I think that's fine.. it's just a base64 encoded float.

Joey

09/19/2023, 6:33 PM

great

Duncan

09/19/2023, 6:34 PM

If I wasn't in the middle of renovating a house right now, with my first child due in 3 weeks I'd want to make the contribution myself. A little short on time for getting my Go up to scratch though... 🙂 Will there be an issue I can follow somewhere to see the final fix?

Joey

09/19/2023, 6:35 PM

yep

Joey

09/19/2023, 6:35 PM

I plan to file it in a moment

Joey

09/19/2023, 6:35 PM

once I am able to repro

Joey

09/19/2023, 6:58 PM

https://github.com/authzed/spicedb/issues/1539

Joey

09/19/2023, 6:58 PM

repro'ed

Joey

09/19/2023, 6:58 PM

fixing now

Joey

09/19/2023, 7:15 PM

https://github.com/authzed/spicedb/pull/1540/files

Duncan

09/19/2023, 7:47 PM

Thanks again for your time, and the fix.

Joey

09/19/2023, 7:59 PM

of course

Duncan

09/20/2023, 9:34 AM

Not urgent, but are you planning to put out a docker image update / release soon?

Joey

09/20/2023, 2:57 PM

we'll be cutting an RC, probably early next week

Previous Next