I have a deployment of spicedb + postgresql on kubernetes I SpiceDB #spicedb

I have a deployment of spicedb + postgresql on kub...

raffaelespazzoli

10/17/2022, 5:19 PM

I have a deployment of spicedb + postgresql on kubernetes. I'm not using the operator. I noticed that after a while all pods except one become unready. Recently I upgraded to using TLS everywhere I am not sure if that is related. This is what I see in the logs of the unready pods:

Copy code

{"level":"debug","datastoreReady":true,"dispatchReady":true,"time":"2022-10-17T17:14:38Z","message":"completed dispatcher and datastore readiness checks"}
{"level":"info","grpc.component":"server","grpc.method":"Check","grpc.method_type":"unary","grpc.service":"grpc.health.v1.Health","peer.address":"[::1]:56680","protocol":"grpc","requestID":"029dcbed3eb352a21117e9de95e9e401","grpc.request.deadline":"2022-10-17T17:14:39Z","grpc.start_time":"2022-10-17T17:14:38Z","grpc.code":"OK","grpc.time_ms":"0.015","time":"2022-10-17T17:14:38Z","message":"started call"}
a bunch of these and then ...
{"level":"info","time":"2022-10-17T17:16:27Z","message":"received interrupt"}
{"level":"info","time":"2022-10-17T17:16:27Z","message":"shutting down"}
{"level":"warn","error":"context canceled","time":"2022-10-17T17:16:27Z","message":"completed shutdown of postgres datastore"}
{"level":"info","addr":":50051","network":"tcp","prefix":"grpc","time":"2022-10-17T17:16:27Z","message":"grpc server stopped listening"}
{"level":"info","addr":":8080","prefix":"dashboard","time":"2022-10-17T17:16:27Z","message":"http server stopped serving"}
{"level":"info","addr":":9090","prefix":"metrics","time":"2022-10-17T17:16:27Z","message":"http server stopped serving"}
{"level":"info","addr":":50053","network":"tcp","prefix":"dispatch-cluster","time":"2022-10-17T17:16:27Z","message":"grpc server stopped listening"}

raffaelespazzoli

10/17/2022, 5:19 PM

can someone help me understand what is happening

ecordell

10/17/2022, 5:20 PM

are there any events for the pods? kubernetes may be killing them / moving them

raffaelespazzoli

10/17/2022, 5:20 PM

no...

raffaelespazzoli

10/17/2022, 5:21 PM

what is a dispatcher? how is that special relative to the other pods? how is it elected?

raffaelespazzoli

10/17/2022, 5:21 PM

Copy code

{
  "level": "debug",
  "datastoreReady": true,
  "dispatchReady": false,
  "time": "2022-10-17T17:21:11Z",
  "message": "completed dispatcher and datastore readiness checks"
}

ecordell

10/17/2022, 5:22 PM

"dispatch" is the name we use for the connection between peer spicedb nodes

ecordell

10/17/2022, 5:23 PM

so that's saying there's a problem with spicedb talking to other pods

ecordell

10/17/2022, 5:23 PM

they all talk to each other, there's no leader

ecordell

10/17/2022, 5:23 PM

but this log is different from the first one you shared, which showed

dispatchReady: true

ecordell

10/17/2022, 5:24 PM

> {"level":"info","time":"2022-10-17T17:16:27Z","message":"received interrupt"} I'm not aware of any way this log can happen without an external signal (kubelet killing the pod, maybe oom?)

raffaelespazzoli

10/17/2022, 5:25 PM

ok, se here is my dispatcher related properties:

Copy code

- name: SPICEDB_DISPATCH_UPSTREAM_ADDR
          value: 'kubernetes:///myspicedb.spicedb-test:dispatch'
        - name: SPICEDB_DISPATCH_CLUSTER_ENABLED
          value: 'true'

and for command line argument:

Copy code

- '--dispatch-cluster-tls-cert-path'
        - /certs/tls.crt
        - '--dispatch-cluster-tls-key-path'
        - /certs/tls.key
        - '--dispatch-upstream-ca-path'
        - /ca/service-ca.crt

ecordell

10/17/2022, 5:25 PM

what datastore are you using?

ecordell

10/17/2022, 5:25 PM

memory?

raffaelespazzoli

10/17/2022, 5:26 PM

postgresql but that seems to be fine

ecordell

10/17/2022, 5:26 PM

oh sorry you said postgres

ecordell

10/17/2022, 5:26 PM

yeah

raffaelespazzoli

10/17/2022, 5:26 PM

so when the dispatcher is not ready, also the pod is not ready correct?

raffaelespazzoli

10/17/2022, 5:28 PM

what is this notation:

kubernetes:///myspicedb.spicedb-test:dispatch

what host is checked during the TLS handshake?

ecordell

10/17/2022, 5:28 PM

correct, but the check is only on startup. once it goes healthy once, it will stay healthy, we don't flip it back unhealthy if there's a network issue the exception is on the very first start of a cluster, it's possible for pods to not find any peers to connect to yet and therefore mark dispatch ready even though the connection hasn't been tested. (wrote this up here: https://github.com/authzed/spicedb/issues/814)

raffaelespazzoli

10/17/2022, 5:29 PM

because I don't think I my cert has that name. it will have

myspicedb.spicedb-test.svc

raffaelespazzoli

10/17/2022, 5:29 PM

which is more standard in kube

raffaelespazzoli

10/17/2022, 5:31 PM

these are the names accepted by my cert:

myspicedb.spicedb-test.svc, myspicedb.spicedb-test.svc.cluster.local

raffaelespazzoli

10/17/2022, 5:31 PM

should I add an

.svc

there?

ecordell

10/17/2022, 5:31 PM

yeah they need to match. I think it will work if you add

.svc

to the url, but haven't tested

ecordell

10/17/2022, 5:32 PM

https://github.com/sercand/kuberesolver is used for address resolution

ecordell

10/17/2022, 5:32 PM

actually reading it, I don't think that will work

ecordell

10/17/2022, 5:33 PM

I think you need a cert signed for the

svcName.svcNamespace

without

.svc

raffaelespazzoli

10/17/2022, 5:34 PM

You need to a certificate with name service-name.namespace in order to connect with TLS to your services. this is bad, I can's generate such certificate...

raffaelespazzoli

10/17/2022, 5:34 PM

sorry back to the thread.

ecordell

10/17/2022, 5:34 PM

what prevents you from making such a cert?

ecordell

10/17/2022, 5:35 PM

these are just pod-to-pod, no external traffic goes through it

raffaelespazzoli

10/17/2022, 5:35 PM

I'm using an operator that only generates standard kube name certs

ecordell

10/17/2022, 5:35 PM

what operator? we use cert-manager with spicedb

raffaelespazzoli

10/17/2022, 5:36 PM

this is an openshift feature: https://docs.openshift.com/container-platform/4.11/security/certificates/service-serving-certificate.html#add-service-certificate_service-serving-certificate

raffaelespazzoli

10/17/2022, 5:37 PM

but

svcName.svcNamespace

is an illegal name and should not be used...

raffaelespazzoli

10/17/2022, 5:38 PM

it;s not an illegal name there must be a search rule that makes it work

raffaelespazzoli

10/17/2022, 5:38 PM

so disregard my last statement.

ecordell

10/17/2022, 5:46 PM

yeah it looks like it doesn't create certs without the `.svc`: https://github.com/openshift/service-ca-operator/blob/42088528ef8a6a4b8c99b0f558246b8025584056/pkg/controller/servingcert/controller/secret_creating_controller.go#L351-L366

ecordell

10/17/2022, 5:48 PM

so options are: - manually make certs, or use something else like cert-manager - update service-ca-operator to include the name without

.svc

- update kuberesolver to resolve properly if dns suffixes are included

ecordell

10/17/2022, 5:48 PM

or just run without TLS between nodes

raffaelespazzoli

10/17/2022, 5:59 PM

it looks like it worked with the new certificate.

raffaelespazzoli

10/17/2022, 6:52 PM

pods are still dying for the liveness probe...

raffaelespazzoli

10/17/2022, 6:54 PM

these should be the health check calls:

Copy code

{"level":"info","grpc.component":"server","grpc.method":"Check","grpc.method_type":"unary","grpc.service":"grpc.health.v1.Health","peer.address":"[::1]:60998","protocol":"grpc","requestID":"4031d98eafda2bb28d303b6b71de1587","grpc.request.deadline":"2022-10-17T18:51:06Z","grpc.start_time":"2022-10-17T18:51:05Z","grpc.code":"OK","grpc.time_ms":"0.018","time":"2022-10-17T18:51:05Z","message":"started call"}
{"level":"info","grpc.component":"server","grpc.method":"Check","grpc.method_type":"unary","grpc.service":"grpc.health.v1.Health","peer.address":"[::1]:60998","protocol":"grpc","requestID":"4031d98eafda2bb28d303b6b71de1587","grpc.request.deadline":"2022-10-17T18:51:06Z","grpc.start_time":"2022-10-17T18:51:05Z","grpc.code":"OK","grpc.time_ms":"0.077","time":"2022-10-17T18:51:05Z","message":"finished call"}

it's unclear but evidently they fail.

raffaelespazzoli

10/17/2022, 6:55 PM

ecordell

10/17/2022, 6:58 PM

if you run with debug log level you should see slightly more detail

ecordell

10/17/2022, 6:58 PM

it looks like it's not trying to use tls though

ecordell

10/17/2022, 6:59 PM

you can also set

Copy code

- name: GRPC_GO_LOG_SEVERITY_LEVEL
          value: info
        - name: GRPC_GO_LOG_VERBOSITY_LEVEL
          value: "2"

ecordell

10/17/2022, 6:59 PM

to get detailed debug info on the grpc connection

raffaelespazzoli

10/17/2022, 7:29 PM

I fixed it. I had made tls aware only the readiness check, and not the liveness check

Previous Next