I've got spicedb operator running on my SpiceDB #spicedb

I've got spicedb operator running on my

tekky

05/21/2024, 6:10 PM

I've got spicedb operator running on my AWS setup. It seemed to be working up until last week when I suddenly started getting GRPC Context Cancelled errors from all my permission checks. I don't see anything obvious on the k8s logs, and I'm wondering what's the best approach to debug what's going on

ecordell

05/21/2024, 6:45 PM

how are you routing traffic to spicedb?

ecordell

05/21/2024, 6:45 PM

it sounds like you might have an issue with a load balancer and draining properly

tekky

05/21/2024, 6:46 PM

I'm not actually sure how it's being routed

tekky

05/21/2024, 6:50 PM

Looks like we use ambassador. I'm reading up on it now

ecordell

05/21/2024, 6:51 PM

do checks start working if you roll your application?

tekky

05/21/2024, 6:52 PM

nope

ecordell

05/21/2024, 6:52 PM

you can set

logLevel: debug

in the config block of the

SpiceDBCluster

object to turn on debug logs

tekky

05/21/2024, 6:52 PM

okay, let me do that

tekky

05/21/2024, 7:07 PM

So these are the error messages i'm seeting from my "LookupPermissions" request

tekky

05/21/2024, 7:08 PM

Copy code

rue {"level":"debug","requestID":"546bf1837d6b4a888e5d7c9724ac2381","now":"2024-05-21T19:01:00Z","time":"2024-05-21T19:01:00Z","message":"computing new revision"}
                               {"level":"debug","requestID":"546bf1837d6b4a888e5d7c9724ac2381","now":"2024-05-21T19:01:00Z","valid":"2024-05-21T19:01:04Z","validFor":"4.036462s","time":"2024-05-21T19:01:00Z","message":"setting valid through"}
2024-05-21T19:01:30.962079603Z {"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"LookupResources","grpc.method_type":"server_stream","requestID":"546bf1837d6b4a888e5d7c9724ac2381","peer.address":"10.3.2.175:41890","grpc.start_time":"2024-05-21T19:01:00Z","grpc.code":"Canceled","grpc.error":"rpc error: code = Canceled desc = context canceled","grpc.time_ms":30001,"time":"2024-05-21T19:01:30Z","message":"finished call"}

tekky

05/21/2024, 7:10 PM

The same is true for my LookupResources requests

tekky

05/21/2024, 7:11 PM

I am not currently using and storing ZedTokens anywhere. I'm not sure if that might be a factor

Joey

05/21/2024, 7:33 PM

is it showing the same error for CheckPermission?

Joey

05/21/2024, 7:34 PM

and is that

peer.address

correct for a running pod?

tekky

05/21/2024, 7:39 PM

Getting a DeadlineExceeded error for CheckPermission

ecordell

05/21/2024, 7:40 PM

what datastore are you using? have you looked at its metrics / utilization?

tekky

05/21/2024, 7:40 PM

postgres backend and not yet

tekky

05/21/2024, 7:42 PM

hmm should the peer address be for spicedb?

tekky

05/21/2024, 7:42 PM

if so, then kubectl describe is giving me a different IP

ecordell

05/21/2024, 7:43 PM

if you're getting the deadlineexceeded code back at the client then the networking is probably fine

ecordell

05/21/2024, 7:44 PM

based on the logs and the deadlineexceeded on check I'm guessing your DB is overloaded

tekky

05/21/2024, 7:47 PM

I'm not seeing anything obvious on AWS from the dashboard

tekky

05/21/2024, 7:47 PM

only thjing that seems odd is that the active connections seems high

tekky

05/21/2024, 7:50 PM

the DB being overloaded also seems strange, cause we're not high volume. The requests I'm making should be the only ones being done on spicedb

tekky

05/21/2024, 7:51 PM

One thing is that I did do a large relationship write request when migrating our existing policies at the beginning of the week

tekky

05/21/2024, 7:53 PM

I deleted and recreated the spicedb pods, but I didn't reset the datastore

vroldanbet

05/21/2024, 7:56 PM

The request is timing out at 30s. What spiceDB version are you using?

tekky

05/21/2024, 7:57 PM

1.29.5

Joey

05/21/2024, 8:02 PM

if you migrated over more data its possible its timing out due to having to do more work now

Joey

05/21/2024, 8:02 PM

unlikely though

Joey

05/21/2024, 8:02 PM

if the checks themselves are the same

Joey

05/21/2024, 8:02 PM

is it all checks that fail?

tekky

05/21/2024, 8:02 PM

yeah

tekky

05/21/2024, 8:02 PM

the only thing working is read schema

Joey

05/21/2024, 8:17 PM

which doesn't dispatch

Joey

05/21/2024, 8:17 PM

it does feel like you have a dispatcher issue

Joey

05/21/2024, 8:18 PM

how is it configured on the CLI?

Joey

05/21/2024, 8:18 PM

or the pod config

Joey

05/21/2024, 8:18 PM

@tekky you could try a Read Relationships call too; if that works, then its definitely dispatch

tekky

05/21/2024, 8:19 PM

Read Relationship returns, but it's empty

tekky

05/21/2024, 8:19 PM

but no error

Joey

05/21/2024, 8:23 PM

so yeah

Joey

05/21/2024, 8:23 PM

dispatching

Joey

05/21/2024, 8:23 PM

its most likely there is some sort of networking issue between the pods

Joey

05/21/2024, 8:23 PM

we generally recommend running dispatching without any form of sidecar or intermediate networking

Joey

05/21/2024, 8:24 PM

we've had a few reports that Istio can cause massive performance degradation

tekky

05/21/2024, 8:24 PM

dispatching is spicedb -> datastore?

tekky

05/21/2024, 8:27 PM

I'm talking with our AWS admin, and he says that we're just service -> pod. no sidecar

Joey

05/21/2024, 8:38 PM

dispatching is spicedb->spicedb

Joey

05/21/2024, 8:39 PM

your datastore access seems to be working

Joey

05/21/2024, 8:39 PM

how is your dispatcher configured on your SpiceDB pods?

tekky

05/21/2024, 8:40 PM

how do I check that?

Joey

05/21/2024, 8:41 PM

it should be on the ENV vars or the CLI flags of the pod

Joey

05/21/2024, 8:41 PM

and contain

dispatch

in the names

Joey

05/21/2024, 8:41 PM

if you're running on Kub, we do recommend using the operator, since it handles all of this for you

tekky

05/21/2024, 8:43 PM

we are using the operator

tekky

05/21/2024, 8:44 PM

Copy code

SPICEDB_DISPATCH_UPSTREAM_ADDR:          kubernetes:///<app>.default:dispatch
      SPICEDB_DASHBOARD_TLS_CERT_PATH:         /tls/tls.crt
      SPICEDB_DASHBOARD_TLS_KEY_PATH:          /tls/tls.key
      SPICEDB_DATASTORE_ENGINE:                postgres
      SPICEDB_DISPATCH_CLUSTER_ENABLED:        true
      SPICEDB_DISPATCH_CLUSTER_TLS_CERT_PATH:  /tls/tls.crt
      SPICEDB_DISPATCH_CLUSTER_TLS_KEY_PATH:   /tls/tls.key

Joey

05/21/2024, 8:46 PM

hmm

ecordell

05/21/2024, 8:47 PM

it could be an issue with peer discovery...if you set these env vars we'll get more debug info:

Copy code

- name: GRPC_GO_LOG_SEVERITY_LEVEL
          value: info
        - name: GRPC_GO_LOG_VERBOSITY_LEVEL
          value: "2"

the first few dozen logs will show GRPC discovering peers - we should see the other spicedb nodes there

ecordell

05/21/2024, 8:48 PM

you said that you did a

rollout restart

and they came back up healthy? normally if there's a tls / sidecar config issue and not a discovery issue,

rollout restart

will fail to bring up new nodes.

ecordell

05/21/2024, 8:48 PM

if they all come up healthy it may be that the nodes are not finding peers at all

tekky

05/21/2024, 8:48 PM

oh i didn't do a rollout, just a pod delete to force a restart

ecordell

05/21/2024, 8:49 PM

that should have worked too

ecordell

05/21/2024, 8:49 PM

https://github.com/authzed/spicedb/issues/814 for some context

ecordell

05/21/2024, 8:49 PM

(but I don't think that's your issue, just wanted to rule it out)

tekky

05/21/2024, 8:52 PM

what logs should I filter for?

ecordell

05/21/2024, 8:52 PM

the grpc ones - they're from the underlying grpclib and have a different format

ecordell

05/21/2024, 8:53 PM

just filtering for the string

grpc

should work IIRC

tekky

05/21/2024, 8:55 PM

https://cdn.discordapp.com/attachments/1242540027267907776/1242581434183581816/message.txt?ex=664e5ba3&is=664d0a23&hm=b104b8141315155f81025a65d41b285f26b88e9233b30f932bc019add44aaa36&

ecordell

05/21/2024, 8:56 PM

do you see any

Endpoints

lists with more than one entry?

ecordell

05/21/2024, 8:56 PM

how many spicedb pods are you running?

tekky

05/21/2024, 8:56 PM

2 are up as replicas

tekky

05/21/2024, 8:57 PM

all

Endpoints

have one entry

tekky

05/21/2024, 8:59 PM

I do see a lot of these messages later on:

Copy code

2024/05/21 20:52:33 INFO: [transport] [client-transport 0xc000dec6c0] loopyWriter exiting with error: transport closed by client
2024/05/21 20:52:33 INFO: [transport] [server-transport 0xc000838680] Closing: EOF
2024/05/21 20:52:33 INFO: [transport] [server-transport 0xc000838680] loopyWriter exiting with error: connection error: desc = "transport is closing"
2024/05/21 20:52:34 INFO: [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to IDLE
2024/05/21 20:52:34 INFO: [transport] [client-transport 0xc000dec480] Closing: connection error: desc = "received goaway and there are no active streams"
2024/05/21 20:52:34 INFO: [core] [pick-first-lb 0xc0008bb4d0] Received SubConn state update: 0xc0008bb650, {ConnectivityState:IDLE ConnectionError:<nil>}
2024/05/21 20:52:34 INFO: [core] [Channel #4] Channel Connectivity change to IDLE
2024/05/21 20:52:34 INFO: [transport] [client-transport 0xc000dec480] loopyWriter exiting with error: transport closed by client

ecordell

05/21/2024, 9:10 PM

yeah the nodes aren't finding each other

ecordell

05/21/2024, 9:10 PM

you're running on plain EKS? no service mesh?

ecordell

05/21/2024, 9:11 PM

if you have 2 nodes you should see two addresses in the endpoint list

ecordell

05/21/2024, 9:11 PM

you can also check the kube api

ecordell

05/21/2024, 9:11 PM

kubectl get endpoints

ecordell

05/21/2024, 9:11 PM

you'll want to make sure there's an endpoint for each of your spicedb pods

ecordell

05/21/2024, 9:11 PM

and also check the

service

object's status and make sure it looks okay

tekky

05/21/2024, 9:15 PM

Looks like I see 3?

ecordell

05/21/2024, 9:17 PM

in the grpc logs or in the kube api?

tekky

05/21/2024, 9:17 PM

kube api

tekky

05/21/2024, 9:17 PM

ah, nvm it's 1 with multiple ports

ecordell

05/21/2024, 9:18 PM

ah so if you only see one there something is funky with your kube service

ecordell

05/21/2024, 9:18 PM

are all of the spicedb pods reporting healthy?

tekky

05/21/2024, 9:18 PM

ah, looks like our AWS admin is also debugging and reduced the replicas to 1

ecordell

05/21/2024, 9:18 PM

ahh

tekky

05/21/2024, 9:18 PM

still having the same issue

tekky

05/21/2024, 9:23 PM

Hmm so kube has one endpoint listed for spicedb, but in the logs I'm seeing it trying to forward it to another IP:

Copy code

2024/05/21 21:21:09 INFO: [transport] [server-transport 0xc000b86680] Closing: read tcp 10.3.2.158:50051->10.3.2.196:50278: read: connection reset by peer
2024/05/21 21:21:09 INFO: [transport] [server-transport 0xc000b86680] Error closing underlying net.Conn during Close: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.3.2.158:50051->10.3.2.196:50278: write: broken pipe
2024/05/21 21:21:09 INFO: [core] [Server #14] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "ServerHandshake(\"10.3.2.196:50300\") failed: read tcp 10.3.2.158:50051->10.3.2.196:50300: read: connection reset by peer"

tekky

05/21/2024, 9:24 PM

158 is the address listed by k8s

tekky

05/21/2024, 10:07 PM

So even with one replica i'm still seeing the same timeout, so I'm not sure if we should be seeing the dispatcher issue anymore

Joey

05/21/2024, 11:00 PM

@tekky if dispatch is still enabled, yes

Joey

05/21/2024, 11:01 PM

it would just try to dispatch to itself over the network

Joey

05/21/2024, 11:01 PM

(sorry for the delayed response)

tekky

05/21/2024, 11:12 PM

tekky

05/21/2024, 11:13 PM

So At this point we can rule out client -> spicedb and spicedb -> postgres as causes?

Joey

05/21/2024, 11:13 PM

pretty much

Joey

05/21/2024, 11:13 PM

it seems like its spicedb->spicedb

Joey

05/21/2024, 11:13 PM

otherwise ReadSchema should fail

Joey

05/21/2024, 11:13 PM

you could try turning off dispatch entirely

Joey

05/21/2024, 11:14 PM

SPICEDB_DISPATCH_CLUSTER_ENABLED:        true

Joey

05/21/2024, 11:14 PM

set this to

false

and it'll only ever compute on each node independently

Joey

05/21/2024, 11:14 PM

but your cache reuse will go down quite a bit

tekky

05/21/2024, 11:40 PM

turned off dispatch but I'm still getting the issue

Joey

05/21/2024, 11:41 PM

same exact error?

tekky

05/21/2024, 11:41 PM

yeah

Joey

05/21/2024, 11:41 PM

do you have other traffic going to the cluster?

Joey

05/21/2024, 11:41 PM

or just the check you're trying?

tekky

05/21/2024, 11:43 PM

so I keep seeing new permissions check log messages even though mine timed out

tekky

05/21/2024, 11:43 PM

there might be other traffic

tekky

05/21/2024, 11:45 PM

okay now all other traiffic has stopped

Joey

05/21/2024, 11:45 PM

okay

Joey

05/21/2024, 11:45 PM

and with just your own call

Joey

05/21/2024, 11:45 PM

the check still fails?

tekky

05/21/2024, 11:45 PM

yeah

Joey

05/21/2024, 11:45 PM

okay, let's further isolate

Joey

05/21/2024, 11:45 PM

can you check a

relation

that is not recursive?

Joey

05/21/2024, 11:46 PM

e.g. you have, say

relation viewer: user

Joey

05/21/2024, 11:46 PM

can you check

resource:whatever viewer user:expectedtobethere

Joey

05/21/2024, 11:46 PM

and see if it works?

Joey

05/21/2024, 11:46 PM

that's about as simple a check as you'll find

tekky

05/21/2024, 11:49 PM

still failing

Joey

05/21/2024, 11:49 PM

so to summarize: single pod, dispatch disabled, check of a direct relation, fail?

tekky

05/21/2024, 11:50 PM

if the relationship doesn't exist, would that make a difference?

Joey

05/21/2024, 11:50 PM

Joey

05/21/2024, 11:50 PM

it should return "no permission"

Joey

05/21/2024, 11:50 PM

not fail with an error

Joey

05/21/2024, 11:50 PM

so long as the

relation

exists

tekky

05/21/2024, 11:50 PM

yeah, then there's no difference

Joey

05/21/2024, 11:51 PM

but

ReadSchema

works?

Joey

05/21/2024, 11:51 PM

even after restarting the pod?

tekky

05/21/2024, 11:51 PM

yup

Joey

05/21/2024, 11:52 PM

weird

Joey

05/21/2024, 11:53 PM

can you paste the logs of the CheckPermission failure from the SpiceDB side?

tekky

05/21/2024, 11:57 PM

Copy code

2024/05/21 23:49:59 INFO: [transport] [server-transport 0xc000efdd40] Closing: EOF
2024/05/21 23:49:59 INFO: [transport] [server-transport 0xc000efdd40] loopyWriter exiting with error: transport closed by client
2024/05/21 23:49:59 INFO: [transport] [server-transport 0xc000efdba0] Closing: EOF
2024/05/21 23:49:59 INFO: [transport] [server-transport 0xc000efdba0] loopyWriter exiting with error: transport closed by client
2024/05/21 23:50:09 INFO: [transport] [server-transport 0xc00066c340] Closing: EOF
2024/05/21 23:50:09 INFO: [transport] [server-transport 0xc00066c340] loopyWriter exiting with error: transport closed by client
2024/05/21 23:50:09 INFO: [transport] [server-transport 0xc00066dba0] Closing: EOF
2024/05/21 23:50:09 INFO: [transport] [server-transport 0xc00066dba0] loopyWriter exiting with error: transport closed by client
{"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"CheckPermission","grpc.method_type":"unary","requestID":"110b0df5fe5bbac5788893983302d6b2","peer.address":"127.0.0.1:37310","grpc.start_time":"2024-05-21T23:49:11Z","grpc.code":"DeadlineExceeded","grpc.error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","grpc.time_ms":60008,"time":"2024-05-21T23:50:11Z","message":"finished call"}
2024/05/21 23:50:11 INFO: [transport] [server-transport 0xc000efc680] loopyWriter exiting with error: finished processing active streams while in draining mode
2024/05/21 23:50:11 INFO: [transport] [server-transport 0xc000efc680] Closing: read tcp 127.0.0.1:50051->127.0.0.1:37310: use of closed network connection
2024/05/21 23:50:11 INFO: [transport] [server-transport 0xc000efc680] Error closing underlying net.Conn during Close: use of closed network connection
2024/05/21 23:50:19 INFO: [transport] [server-transport 0xc000e716c0] Closing: EOF
2024/05/21 23:50:19 INFO: [transport] [server-transport 0xc000e716c0] loopyWriter exiting with error: transport closed by client

Joey

05/22/2024, 12:08 AM

> "grpc.time_ms":60008

Joey

05/22/2024, 12:08 AM

so that's hitting the configured timeout

Joey

05/22/2024, 12:08 AM

it still feels like dispatch is on

Joey

05/22/2024, 12:09 AM

I hate to ask, but you're certain its been changed?

Joey

05/22/2024, 12:09 AM

the pod startup logs should indicate

Joey

05/22/2024, 12:10 AM

> DispatchUpstreamTimeout=60000

Joey

05/22/2024, 12:10 AM

that's the default

Joey

05/22/2024, 12:10 AM

which exactly matches the timeout you're seeing

Joey

05/22/2024, 12:10 AM

make sure to set

DispatchUpstreamAddr

to empty as well

tekky

05/22/2024, 12:22 AM

Okay, I’ll have to do that later tonight/early tomorrow. I’ll give you an update tomorrow

Joey

05/22/2024, 12:52 AM

sounds good

Joey

05/22/2024, 12:52 AM

sorry for the issue

tekky

05/22/2024, 4:49 AM

turns out you have to turn it off in config.dispatchEnabled as opposed to an env variable. It's now actually disabled and permission checks are working now

Joey

05/22/2024, 5:34 AM

okay, that means there is something interfering at the network level then

363 Views

Previous Next