I've got spicedb operator running on my
# spicedb
t
I've got spicedb operator running on my AWS setup. It seemed to be working up until last week when I suddenly started getting GRPC Context Cancelled errors from all my permission checks. I don't see anything obvious on the k8s logs, and I'm wondering what's the best approach to debug what's going on
e
how are you routing traffic to spicedb?
it sounds like you might have an issue with a load balancer and draining properly
t
I'm not actually sure how it's being routed
Looks like we use ambassador. I'm reading up on it now
e
do checks start working if you roll your application?
t
nope
e
you can set
logLevel: debug
in the config block of the
SpiceDBCluster
object to turn on debug logs
t
okay, let me do that
So these are the error messages i'm seeting from my "LookupPermissions" request
Copy code
rue {"level":"debug","requestID":"546bf1837d6b4a888e5d7c9724ac2381","now":"2024-05-21T19:01:00Z","time":"2024-05-21T19:01:00Z","message":"computing new revision"}
                               {"level":"debug","requestID":"546bf1837d6b4a888e5d7c9724ac2381","now":"2024-05-21T19:01:00Z","valid":"2024-05-21T19:01:04Z","validFor":"4.036462s","time":"2024-05-21T19:01:00Z","message":"setting valid through"}
2024-05-21T19:01:30.962079603Z {"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"LookupResources","grpc.method_type":"server_stream","requestID":"546bf1837d6b4a888e5d7c9724ac2381","peer.address":"10.3.2.175:41890","grpc.start_time":"2024-05-21T19:01:00Z","grpc.code":"Canceled","grpc.error":"rpc error: code = Canceled desc = context canceled","grpc.time_ms":30001,"time":"2024-05-21T19:01:30Z","message":"finished call"}
The same is true for my LookupResources requests
I am not currently using and storing ZedTokens anywhere. I'm not sure if that might be a factor
j
is it showing the same error for CheckPermission?
and is that
peer.address
correct for a running pod?
t
Getting a DeadlineExceeded error for CheckPermission
e
what datastore are you using? have you looked at its metrics / utilization?
t
postgres backend and not yet
hmm should the peer address be for spicedb?
if so, then kubectl describe is giving me a different IP
e
if you're getting the deadlineexceeded code back at the client then the networking is probably fine
based on the logs and the deadlineexceeded on check I'm guessing your DB is overloaded
t
I'm not seeing anything obvious on AWS from the dashboard
only thjing that seems odd is that the active connections seems high
the DB being overloaded also seems strange, cause we're not high volume. The requests I'm making should be the only ones being done on spicedb
One thing is that I did do a large relationship write request when migrating our existing policies at the beginning of the week
I deleted and recreated the spicedb pods, but I didn't reset the datastore
v
The request is timing out at 30s. What spiceDB version are you using?
t
1.29.5
j
if you migrated over more data its possible its timing out due to having to do more work now
unlikely though
if the checks themselves are the same
is it all checks that fail?
t
yeah
the only thing working is read schema
j
which doesn't dispatch
it does feel like you have a dispatcher issue
how is it configured on the CLI?
or the pod config
@tekky you could try a Read Relationships call too; if that works, then its definitely dispatch
t
Read Relationship returns, but it's empty
but no error
j
so yeah
dispatching
its most likely there is some sort of networking issue between the pods
we generally recommend running dispatching without any form of sidecar or intermediate networking
we've had a few reports that Istio can cause massive performance degradation
t
dispatching is spicedb -> datastore?
I'm talking with our AWS admin, and he says that we're just service -> pod. no sidecar
j
dispatching is spicedb->spicedb
your datastore access seems to be working
how is your dispatcher configured on your SpiceDB pods?
t
how do I check that?
j
it should be on the ENV vars or the CLI flags of the pod
and contain
dispatch
in the names
if you're running on Kub, we do recommend using the operator, since it handles all of this for you
t
we are using the operator
Copy code
SPICEDB_DISPATCH_UPSTREAM_ADDR:          kubernetes:///<app>.default:dispatch
      SPICEDB_DASHBOARD_TLS_CERT_PATH:         /tls/tls.crt
      SPICEDB_DASHBOARD_TLS_KEY_PATH:          /tls/tls.key
      SPICEDB_DATASTORE_ENGINE:                postgres
      SPICEDB_DISPATCH_CLUSTER_ENABLED:        true
      SPICEDB_DISPATCH_CLUSTER_TLS_CERT_PATH:  /tls/tls.crt
      SPICEDB_DISPATCH_CLUSTER_TLS_KEY_PATH:   /tls/tls.key
j
hmm
e
it could be an issue with peer discovery...if you set these env vars we'll get more debug info:
Copy code
- name: GRPC_GO_LOG_SEVERITY_LEVEL
          value: info
        - name: GRPC_GO_LOG_VERBOSITY_LEVEL
          value: "2"
the first few dozen logs will show GRPC discovering peers - we should see the other spicedb nodes there
you said that you did a
rollout restart
and they came back up healthy? normally if there's a tls / sidecar config issue and not a discovery issue,
rollout restart
will fail to bring up new nodes.
if they all come up healthy it may be that the nodes are not finding peers at all
t
oh i didn't do a rollout, just a pod delete to force a restart
e
that should have worked too
(but I don't think that's your issue, just wanted to rule it out)
t
what logs should I filter for?
e
the grpc ones - they're from the underlying grpclib and have a different format
just filtering for the string
grpc
should work IIRC
e
do you see any
Endpoints
lists with more than one entry?
how many spicedb pods are you running?
t
2 are up as replicas
all
Endpoints
have one entry
I do see a lot of these messages later on:
Copy code
2024/05/21 20:52:33 INFO: [transport] [client-transport 0xc000dec6c0] loopyWriter exiting with error: transport closed by client
2024/05/21 20:52:33 INFO: [transport] [server-transport 0xc000838680] Closing: EOF
2024/05/21 20:52:33 INFO: [transport] [server-transport 0xc000838680] loopyWriter exiting with error: connection error: desc = "transport is closing"
2024/05/21 20:52:34 INFO: [core] [Channel #4 SubChannel #5] Subchannel Connectivity change to IDLE
2024/05/21 20:52:34 INFO: [transport] [client-transport 0xc000dec480] Closing: connection error: desc = "received goaway and there are no active streams"
2024/05/21 20:52:34 INFO: [core] [pick-first-lb 0xc0008bb4d0] Received SubConn state update: 0xc0008bb650, {ConnectivityState:IDLE ConnectionError:<nil>}
2024/05/21 20:52:34 INFO: [core] [Channel #4] Channel Connectivity change to IDLE
2024/05/21 20:52:34 INFO: [transport] [client-transport 0xc000dec480] loopyWriter exiting with error: transport closed by client
e
yeah the nodes aren't finding each other
you're running on plain EKS? no service mesh?
if you have 2 nodes you should see two addresses in the endpoint list
you can also check the kube api
kubectl get endpoints
you'll want to make sure there's an endpoint for each of your spicedb pods
and also check the
service
object's status and make sure it looks okay
t
Looks like I see 3?
e
in the grpc logs or in the kube api?
t
kube api
ah, nvm it's 1 with multiple ports
e
ah so if you only see one there something is funky with your kube service
are all of the spicedb pods reporting healthy?
t
ah, looks like our AWS admin is also debugging and reduced the replicas to 1
e
ahh
t
still having the same issue
Hmm so kube has one endpoint listed for spicedb, but in the logs I'm seeing it trying to forward it to another IP:
Copy code
2024/05/21 21:21:09 INFO: [transport] [server-transport 0xc000b86680] Closing: read tcp 10.3.2.158:50051->10.3.2.196:50278: read: connection reset by peer
2024/05/21 21:21:09 INFO: [transport] [server-transport 0xc000b86680] Error closing underlying net.Conn during Close: tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.3.2.158:50051->10.3.2.196:50278: write: broken pipe
2024/05/21 21:21:09 INFO: [core] [Server #14] grpc: Server.Serve failed to create ServerTransport: connection error: desc = "ServerHandshake(\"10.3.2.196:50300\") failed: read tcp 10.3.2.158:50051->10.3.2.196:50300: read: connection reset by peer"
158 is the address listed by k8s
So even with one replica i'm still seeing the same timeout, so I'm not sure if we should be seeing the dispatcher issue anymore
j
@tekky if dispatch is still enabled, yes
it would just try to dispatch to itself over the network
(sorry for the delayed response)
t
np
So At this point we can rule out client -> spicedb and spicedb -> postgres as causes?
j
pretty much
it seems like its spicedb->spicedb
otherwise ReadSchema should fail
you could try turning off dispatch entirely
SPICEDB_DISPATCH_CLUSTER_ENABLED:        true
set this to
false
and it'll only ever compute on each node independently
but your cache reuse will go down quite a bit
t
turned off dispatch but I'm still getting the issue
j
same exact error?
t
yeah
j
do you have other traffic going to the cluster?
or just the check you're trying?
t
so I keep seeing new permissions check log messages even though mine timed out
there might be other traffic
okay now all other traiffic has stopped
j
okay
and with just your own call
the check still fails?
t
yeah
j
okay, let's further isolate
can you check a
relation
that is not recursive?
e.g. you have, say
relation viewer: user
can you check
resource:whatever viewer user:expectedtobethere
and see if it works?
that's about as simple a check as you'll find
t
still failing
j
so to summarize: single pod, dispatch disabled, check of a direct relation, fail?
t
if the relationship doesn't exist, would that make a difference?
j
no
it should return "no permission"
not fail with an error
so long as the
relation
exists
t
yeah, then there's no difference
j
but
ReadSchema
works?
even after restarting the pod?
t
yup
j
weird
can you paste the logs of the CheckPermission failure from the SpiceDB side?
t
Copy code
2024/05/21 23:49:59 INFO: [transport] [server-transport 0xc000efdd40] Closing: EOF
2024/05/21 23:49:59 INFO: [transport] [server-transport 0xc000efdd40] loopyWriter exiting with error: transport closed by client
2024/05/21 23:49:59 INFO: [transport] [server-transport 0xc000efdba0] Closing: EOF
2024/05/21 23:49:59 INFO: [transport] [server-transport 0xc000efdba0] loopyWriter exiting with error: transport closed by client
2024/05/21 23:50:09 INFO: [transport] [server-transport 0xc00066c340] Closing: EOF
2024/05/21 23:50:09 INFO: [transport] [server-transport 0xc00066c340] loopyWriter exiting with error: transport closed by client
2024/05/21 23:50:09 INFO: [transport] [server-transport 0xc00066dba0] Closing: EOF
2024/05/21 23:50:09 INFO: [transport] [server-transport 0xc00066dba0] loopyWriter exiting with error: transport closed by client
{"level":"info","protocol":"grpc","grpc.component":"server","grpc.service":"authzed.api.v1.PermissionsService","grpc.method":"CheckPermission","grpc.method_type":"unary","requestID":"110b0df5fe5bbac5788893983302d6b2","peer.address":"127.0.0.1:37310","grpc.start_time":"2024-05-21T23:49:11Z","grpc.code":"DeadlineExceeded","grpc.error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","grpc.time_ms":60008,"time":"2024-05-21T23:50:11Z","message":"finished call"}
2024/05/21 23:50:11 INFO: [transport] [server-transport 0xc000efc680] loopyWriter exiting with error: finished processing active streams while in draining mode
2024/05/21 23:50:11 INFO: [transport] [server-transport 0xc000efc680] Closing: read tcp 127.0.0.1:50051->127.0.0.1:37310: use of closed network connection
2024/05/21 23:50:11 INFO: [transport] [server-transport 0xc000efc680] Error closing underlying net.Conn during Close: use of closed network connection
2024/05/21 23:50:19 INFO: [transport] [server-transport 0xc000e716c0] Closing: EOF
2024/05/21 23:50:19 INFO: [transport] [server-transport 0xc000e716c0] loopyWriter exiting with error: transport closed by client
j
> "grpc.time_ms":60008
so that's hitting the configured timeout
it still feels like dispatch is on
I hate to ask, but you're certain its been changed?
the pod startup logs should indicate
> DispatchUpstreamTimeout=60000
that's the default
which exactly matches the timeout you're seeing
make sure to set
DispatchUpstreamAddr
to empty as well
t
Okay, I’ll have to do that later tonight/early tomorrow. I’ll give you an update tomorrow
j
sounds good
sorry for the issue
t
turns out you have to turn it off in config.dispatchEnabled as opposed to an env variable. It's now actually disabled and permission checks are working now
j
okay, that means there is something interfering at the network level then
336 Views