I have a deployment of spicedb + postgresql on kub...
# spicedb
r
I have a deployment of spicedb + postgresql on kubernetes. I'm not using the operator. I noticed that after a while all pods except one become unready. Recently I upgraded to using TLS everywhere I am not sure if that is related. This is what I see in the logs of the unready pods:
Copy code
{"level":"debug","datastoreReady":true,"dispatchReady":true,"time":"2022-10-17T17:14:38Z","message":"completed dispatcher and datastore readiness checks"}
{"level":"info","grpc.component":"server","grpc.method":"Check","grpc.method_type":"unary","grpc.service":"grpc.health.v1.Health","peer.address":"[::1]:56680","protocol":"grpc","requestID":"029dcbed3eb352a21117e9de95e9e401","grpc.request.deadline":"2022-10-17T17:14:39Z","grpc.start_time":"2022-10-17T17:14:38Z","grpc.code":"OK","grpc.time_ms":"0.015","time":"2022-10-17T17:14:38Z","message":"started call"}
a bunch of these and then ...
{"level":"info","time":"2022-10-17T17:16:27Z","message":"received interrupt"}
{"level":"info","time":"2022-10-17T17:16:27Z","message":"shutting down"}
{"level":"warn","error":"context canceled","time":"2022-10-17T17:16:27Z","message":"completed shutdown of postgres datastore"}
{"level":"info","addr":":50051","network":"tcp","prefix":"grpc","time":"2022-10-17T17:16:27Z","message":"grpc server stopped listening"}
{"level":"info","addr":":8080","prefix":"dashboard","time":"2022-10-17T17:16:27Z","message":"http server stopped serving"}
{"level":"info","addr":":9090","prefix":"metrics","time":"2022-10-17T17:16:27Z","message":"http server stopped serving"}
{"level":"info","addr":":50053","network":"tcp","prefix":"dispatch-cluster","time":"2022-10-17T17:16:27Z","message":"grpc server stopped listening"}
can someone help me understand what is happening
e
are there any events for the pods? kubernetes may be killing them / moving them
r
no...
what is a dispatcher? how is that special relative to the other pods? how is it elected?
Copy code
{
  "level": "debug",
  "datastoreReady": true,
  "dispatchReady": false,
  "time": "2022-10-17T17:21:11Z",
  "message": "completed dispatcher and datastore readiness checks"
}
e
"dispatch" is the name we use for the connection between peer spicedb nodes
so that's saying there's a problem with spicedb talking to other pods
they all talk to each other, there's no leader
but this log is different from the first one you shared, which showed
dispatchReady: true
> {"level":"info","time":"2022-10-17T17:16:27Z","message":"received interrupt"} I'm not aware of any way this log can happen without an external signal (kubelet killing the pod, maybe oom?)
r
ok, se here is my dispatcher related properties:
Copy code
- name: SPICEDB_DISPATCH_UPSTREAM_ADDR
          value: 'kubernetes:///myspicedb.spicedb-test:dispatch'
        - name: SPICEDB_DISPATCH_CLUSTER_ENABLED
          value: 'true'
and for command line argument:
Copy code
- '--dispatch-cluster-tls-cert-path'
        - /certs/tls.crt
        - '--dispatch-cluster-tls-key-path'
        - /certs/tls.key
        - '--dispatch-upstream-ca-path'
        - /ca/service-ca.crt
e
what datastore are you using?
memory?
r
postgresql but that seems to be fine
e
oh sorry you said postgres
yeah
r
so when the dispatcher is not ready, also the pod is not ready correct?
what is this notation:
kubernetes:///myspicedb.spicedb-test:dispatch
what host is checked during the TLS handshake?
e
correct, but the check is only on startup. once it goes healthy once, it will stay healthy, we don't flip it back unhealthy if there's a network issue the exception is on the very first start of a cluster, it's possible for pods to not find any peers to connect to yet and therefore mark dispatch ready even though the connection hasn't been tested. (wrote this up here: https://github.com/authzed/spicedb/issues/814)
r
because I don't think I my cert has that name. it will have
myspicedb.spicedb-test.svc
which is more standard in kube
these are the names accepted by my cert:
myspicedb.spicedb-test.svc, myspicedb.spicedb-test.svc.cluster.local
should I add an
.svc
there?
e
yeah they need to match. I think it will work if you add
.svc
to the url, but haven't tested
https://github.com/sercand/kuberesolver is used for address resolution
actually reading it, I don't think that will work
I think you need a cert signed for the
svcName.svcNamespace
without
.svc
r
You need to a certificate with name service-name.namespace in order to connect with TLS to your services. this is bad, I can's generate such certificate...
sorry back to the thread.
e
what prevents you from making such a cert?
these are just pod-to-pod, no external traffic goes through it
r
I'm using an operator that only generates standard kube name certs
e
what operator? we use cert-manager with spicedb
but
svcName.svcNamespace
is an illegal name and should not be used...
it;s not an illegal name there must be a search rule that makes it work
so disregard my last statement.
so options are: - manually make certs, or use something else like cert-manager - update service-ca-operator to include the name without
.svc
- update kuberesolver to resolve properly if dns suffixes are included
or just run without TLS between nodes
r
it looks like it worked with the new certificate.
pods are still dying for the liveness probe...
these should be the health check calls:
Copy code
{"level":"info","grpc.component":"server","grpc.method":"Check","grpc.method_type":"unary","grpc.service":"grpc.health.v1.Health","peer.address":"[::1]:60998","protocol":"grpc","requestID":"4031d98eafda2bb28d303b6b71de1587","grpc.request.deadline":"2022-10-17T18:51:06Z","grpc.start_time":"2022-10-17T18:51:05Z","grpc.code":"OK","grpc.time_ms":"0.018","time":"2022-10-17T18:51:05Z","message":"started call"}
{"level":"info","grpc.component":"server","grpc.method":"Check","grpc.method_type":"unary","grpc.service":"grpc.health.v1.Health","peer.address":"[::1]:60998","protocol":"grpc","requestID":"4031d98eafda2bb28d303b6b71de1587","grpc.request.deadline":"2022-10-17T18:51:06Z","grpc.start_time":"2022-10-17T18:51:05Z","grpc.code":"OK","grpc.time_ms":"0.077","time":"2022-10-17T18:51:05Z","message":"finished call"}
it's unclear but evidently they fail.
e
if you run with debug log level you should see slightly more detail
it looks like it's not trying to use tls though
you can also set
Copy code
- name: GRPC_GO_LOG_SEVERITY_LEVEL
          value: info
        - name: GRPC_GO_LOG_VERBOSITY_LEVEL
          value: "2"
to get detailed debug info on the grpc connection
r
I fixed it. I had made tls aware only the readiness check, and not the liveness check