LookupResources -> service failure
# spicedb
d
Hello again, Our self-hosted cluster is mostly working beautifully with 3 replicas, that you all for a great piece of software. Using Postgres (AWS RDS) as a datastore, and the Authzed Python client to make the requests. Sadly we can't get time allocated to rebuild in Rust 😦 . We do manage to blow it up a few times a day though. Primarily triggered with a LookupResources request, we think, sometimes a request will suddenly take a very long time to resolve, ie 20-60 seconds. CPU load on the spicedb instances stays at 5-10%, the RDS instance doesn't seem under pressure either, but responses for even simple requests start to take an extremely long time. That same query will normally take ~5 seconds to resolve, same if sent immediately after the very-long (hung?) request resolves. Once the cluster gets into this state, it seems to receive GRPC requests but be unable to work on them. We have traces from our authorization service (which uses the python client) running for ~900 seconds. These requests seem to get through the k8s Nginx ingress, timeout there. These requests seemed to be handled more reasonably by the serverless tier, but we do have some records of some of them taking a huge amount of time to resolve. Any thoughts/ideas/best practices for getting our LookupResources requests to work better? Is there any variant of the WatchAPI available for open source use?
Specifically, the request looks like:
Copy code
request = LookupResourcesRequest(
        consistency=consistency,
        resource_object_type=object_type,
        permission=permission,
        subject=SubjectReference(
           object=ObjectReference(
               object_id=str(user_id),
               object_type="user",
           ),
        ),
    )
hmm, I don't seem to be able to rename this thread, and it has a junk name, sorry.
v
what consistency are you using? how many results do you expect?
Watch API is part of the opensource project
also thank you for the kind words!
d
sorry,
consistency
is usually
None
, but may sometimes be
at_least_as_fresh=ZedToken(<token>)
Good point though, I should see if there's a correlation between consistency and these difficult requests
v
I'd also recommend using cursors. If the response is to large the server will have to buffer all elements in memory, which can OOMKill it
if you use cursors, you can restrict how many tuples are loaded at each time
if you do not provide zedtoken you'll be asking for full-consistency, which is going to completely bypass caching
d
testing this atm, good to hear we're doing sensible things though We're nowhere near OOM though, and no pod restarts
oh, really? I thought
consistency=None
would default to
minimise_latency
, not
fully_consistent
We'll check and change what we pass in to get the desired outcome, ty.
just to complete the answer to "How many results": 200 to ~18k atm mostly in the 200 to 1-2k region though
v
I may be misremembering, just make sure to set the consistency you want and be aware that fully consistent is typically expensive
d
yeah, we're manually applying
minimise_latency
to everything where we don't want a token/care about NewEnemy now.
using the cursor seems to have helped a bit, but we still get spikes out to 20+seconds for the first page. Looking into whether those correlate with us supplying a zedtoken
v
LookupResources always had the limitation that with the right number of results it can put much pressure on your cluster, a handful of things in the implementation that do did not scale well. With the introduction of cursors we now make sure to only load what's strictly needed to fulfil the
optional_limit
and it's truly streaming
I suggest you look into tracing next to see what's up - you want to see if there is any part of your schema that is causing lots of relationships to be loaded
d
sure, what's weird is that the cluster doesn't appear to be under much pressure. Not much CPu usage on the SpiceDB instances or the RDS, no particularly slow RDS responses shown.
v
SpiceDB is not really that CPU bounded, it's mostly memory and I/O work
d
adding tracing to the SpiceDB nodes? Good point too. Anything recommended /easy / out-of-the-box here?
v
well I don't know your setup, there is a handful of options out there, Jaeger being one, and there is also Grafana + OpenTelemetry Collector/Grafana Agent + Tempo
if you are ok throwing money at the problem, I'm a big fan of Lightstep
If you use Datadog it also has support for these things
essentially anything that supports OpenTelemetry should work
d
ooh, grafana does an opentelemetry viewer? We battled with AWS's tools for a bit, use jaeger locally, honeycomb in prod. Was asking more about the trace generation side of things. Is there a spicedb image with instrumentation built in? Is there a known and convenient way to plug auto-instrumentation into the existing image? I'm presuming not, as the modules are compiled, but I have very limited knowledge about this.
v
SpiceDB supports OpenTelemetry tracing out of the box, nothing you need to compile
you just need to set a bunch of env vars
d
Awesome, ty
v
I bet Honecomb supports it
they do
d
yeah, we go OTLP>honeycomb until we get time to get something better working
v
also if you are just looking for something to play around locally, we have also an observability example that sets up grafana with tempo for you
it shows you with a docker compose file how to configure SpiceDB. You could spin up all services but SpiceDB, and then run your own instance outside of the docker compose file
also spinning up a local Jaeger is pretty easy
d
yup, jaeger works well. TY for the help, no doubt I/we will be back in a bit..
c
minimize_latency is used by default if you don't specify consistency in the request
10 Views