LookupResources -> service failure SpiceDB #spicedb

LookupResources -> service failure

Duncan

10/05/2023, 9:31 AM

Hello again, Our self-hosted cluster is mostly working beautifully with 3 replicas, that you all for a great piece of software. Using Postgres (AWS RDS) as a datastore, and the Authzed Python client to make the requests. Sadly we can't get time allocated to rebuild in Rust 😦 . We do manage to blow it up a few times a day though. Primarily triggered with a LookupResources request, we think, sometimes a request will suddenly take a very long time to resolve, ie 20-60 seconds. CPU load on the spicedb instances stays at 5-10%, the RDS instance doesn't seem under pressure either, but responses for even simple requests start to take an extremely long time. That same query will normally take ~5 seconds to resolve, same if sent immediately after the very-long (hung?) request resolves. Once the cluster gets into this state, it seems to receive GRPC requests but be unable to work on them. We have traces from our authorization service (which uses the python client) running for ~900 seconds. These requests seem to get through the k8s Nginx ingress, timeout there. These requests seemed to be handled more reasonably by the serverless tier, but we do have some records of some of them taking a huge amount of time to resolve. Any thoughts/ideas/best practices for getting our LookupResources requests to work better? Is there any variant of the WatchAPI available for open source use?

Duncan

10/05/2023, 10:18 AM

Specifically, the request looks like:

Copy code

request = LookupResourcesRequest(
        consistency=consistency,
        resource_object_type=object_type,
        permission=permission,
        subject=SubjectReference(
           object=ObjectReference(
               object_id=str(user_id),
               object_type="user",
           ),
        ),
    )

Duncan

10/05/2023, 10:18 AM

~~hmm, I don't seem to be able to rename this thread, and it has a junk name, sorry.~~

vroldanbet

10/05/2023, 10:35 AM

what consistency are you using? how many results do you expect?

vroldanbet

10/05/2023, 10:35 AM

Watch API is part of the opensource project

vroldanbet

10/05/2023, 10:36 AM

also thank you for the kind words!

Duncan

10/05/2023, 11:32 AM

sorry,

consistency

is usually

None

, but may sometimes be

at_least_as_fresh=ZedToken(<token>)

Good point though, I should see if there's a correlation between consistency and these difficult requests

vroldanbet

10/05/2023, 11:47 AM

I'd also recommend using cursors. If the response is to large the server will have to buffer all elements in memory, which can OOMKill it

vroldanbet

10/05/2023, 11:48 AM

if you use cursors, you can restrict how many tuples are loaded at each time

vroldanbet

10/05/2023, 11:48 AM

if you do not provide zedtoken you'll be asking for full-consistency, which is going to completely bypass caching

Duncan

10/05/2023, 11:52 AM

testing this atm, good to hear we're doing sensible things though We're nowhere near OOM though, and no pod restarts

Duncan

10/05/2023, 11:53 AM

oh, really? I thought

consistency=None

would default to

minimise_latency

, not

fully_consistent

We'll check and change what we pass in to get the desired outcome, ty.

Duncan

10/05/2023, 11:56 AM

just to complete the answer to "How many results": 200 to ~18k atm mostly in the 200 to 1-2k region though

vroldanbet

10/05/2023, 3:15 PM

I may be misremembering, just make sure to set the consistency you want and be aware that fully consistent is typically expensive

Duncan

10/05/2023, 3:15 PM

yeah, we're manually applying

minimise_latency

to everything where we don't want a token/care about NewEnemy now.

Duncan

10/05/2023, 3:17 PM

using the cursor seems to have helped a bit, but we still get spikes out to 20+seconds for the first page. Looking into whether those correlate with us supplying a zedtoken

vroldanbet

10/05/2023, 3:17 PM

LookupResources always had the limitation that with the right number of results it can put much pressure on your cluster, a handful of things in the implementation that do did not scale well. With the introduction of cursors we now make sure to only load what's strictly needed to fulfil the

optional_limit

and it's truly streaming

vroldanbet

10/05/2023, 3:18 PM

I suggest you look into tracing next to see what's up - you want to see if there is any part of your schema that is causing lots of relationships to be loaded

Duncan

10/05/2023, 3:18 PM

sure, what's weird is that the cluster doesn't appear to be under much pressure. Not much CPu usage on the SpiceDB instances or the RDS, no particularly slow RDS responses shown.

vroldanbet

10/05/2023, 3:18 PM

SpiceDB is not really that CPU bounded, it's mostly memory and I/O work

Duncan

10/05/2023, 3:19 PM

adding tracing to the SpiceDB nodes? Good point too. Anything recommended /easy / out-of-the-box here?

vroldanbet

10/05/2023, 3:23 PM

well I don't know your setup, there is a handful of options out there, Jaeger being one, and there is also Grafana + OpenTelemetry Collector/Grafana Agent + Tempo

vroldanbet

10/05/2023, 3:23 PM

if you are ok throwing money at the problem, I'm a big fan of Lightstep

vroldanbet

10/05/2023, 3:23 PM

If you use Datadog it also has support for these things

vroldanbet

10/05/2023, 3:24 PM

essentially anything that supports OpenTelemetry should work

Duncan

10/05/2023, 3:31 PM

ooh, grafana does an opentelemetry viewer? We battled with AWS's tools for a bit, use jaeger locally, honeycomb in prod. Was asking more about the trace generation side of things. Is there a spicedb image with instrumentation built in? Is there a known and convenient way to plug auto-instrumentation into the existing image? I'm presuming not, as the modules are compiled, but I have very limited knowledge about this.

vroldanbet

10/05/2023, 3:32 PM

SpiceDB supports OpenTelemetry tracing out of the box, nothing you need to compile

vroldanbet

10/05/2023, 3:32 PM

you just need to set a bunch of env vars

Duncan

10/05/2023, 3:32 PM

Awesome, ty

vroldanbet

10/05/2023, 3:32 PM

I bet Honecomb supports it

vroldanbet

10/05/2023, 3:33 PM

they do

vroldanbet

10/05/2023, 3:33 PM

https://docs.honeycomb.io/getting-data-in/opentelemetry-overview/

Duncan

10/05/2023, 3:33 PM

yeah, we go OTLP>honeycomb until we get time to get something better working

vroldanbet

10/05/2023, 3:33 PM

also if you are just looking for something to play around locally, we have also an observability example that sets up grafana with tempo for you

vroldanbet

10/05/2023, 3:34 PM

https://github.com/authzed/examples/tree/main/tracing

vroldanbet

10/05/2023, 3:34 PM

it shows you with a docker compose file how to configure SpiceDB. You could spin up all services but SpiceDB, and then run your own instance outside of the docker compose file

vroldanbet

10/05/2023, 3:34 PM

also spinning up a local Jaeger is pretty easy

Duncan

10/05/2023, 3:50 PM

yup, jaeger works well. TY for the help, no doubt I/we will be back in a bit..

corkrean

10/05/2023, 7:57 PM

minimize_latency is used by default if you don't specify consistency in the request

10 Views

Previous Next