Has anyone encountered the issue where SpiceDB #spicedb

Has anyone encountered the issue where

Assada

09/19/2024, 8:42 AM

Has anyone encountered the issue where the GC at some point starts consuming 70%+ of the CPU? And the CPU usage never decreases after that? Restarting the pod normalizes the CPU usage, but over time, the problem occurs again. https://cdn.discordapp.com/attachments/844600078948630559/1286245915707047968/image.png?ex=66ed355b&is=66ebe3db&hm=9206b33aa6eae8fcf0a4e97e8d864b1a62a2577ad7360b6453eeede3539c8b52&

Assada

09/19/2024, 8:46 AM

https://cdn.discordapp.com/attachments/1286245915979550785/1286247076128100363/image.png?ex=66ed3670&is=66ebe4f0&hm=7fd7851706609392abfbf9fa92629e2c97b9813219314367d4ddff1dd331aeb4&

Assada

09/19/2024, 8:50 AM

SpiceDB:

v1.34.0-v1.35.3

PostgreSQL:

15.7

Copy code

command:
    - spicedb
    - serve
    - --http-enabled
    - --log-level=error
    - --datastore-gc-window=3m
    - --datastore-gc-max-operation-time=3m
    - --datastore-gc-interval=7m
    - --datastore-tx-overlap-strategy=insecure
    - --dispatch-cluster-enabled
    - --dispatch-upstream-addr=kubernetes:///permission-db-app.api:50053
    - --grpc-preshared-key=$(PRESHARED_KEY)
    - --datastore-engine=$(DATASTORE_ENGINE)
    - --datastore-conn-uri=$(DATASTORE_URI)
    - --telemetry-endpoint=${TELEM_URI}

vroldanbet

09/19/2024, 9:03 AM

What do the logs say? It means you are doing lots of writes and there is a ton to GC over the default window of 24 hours. It's true this does not currently have any throttling mechanism and could explain that behavior

vroldanbet

09/19/2024, 9:04 AM

I'd suggest moving to log info or debug to see what's up. You also have GC metrics, Indicating how many GC cycles run, how many tuples were GC, and how much it took to run the GC process

vroldanbet

09/19/2024, 9:05 AM

Restarting helps because the GC kicks after 5 minutes by default.

Assada

09/19/2024, 10:51 AM

The error logs don’t provide any information (since there are no errors). I’m trying to lower the logging level to info and check what’s happening there. Thx!

vroldanbet

09/19/2024, 11:14 AM

yeah, those do not show up in the error log, I don't think they are erroring out, just taking very long and doing a lot of work

Joey

09/19/2024, 3:10 PM

@vroldanbet I think this is Go GC, not rel GC?

Joey

09/19/2024, 3:11 PM

@Assada at first glance, it seems like the pod hit max mem and so then started thrashing trying to GC to process

Joey

09/19/2024, 3:11 PM

you might need to lower the cache memory size

Joey

09/19/2024, 3:11 PM

to leave room for other memory as needed

vroldanbet

09/19/2024, 6:01 PM

How much memory are you allocating? We recently introduced usage of gomemlimit. There are also go GC metrics exposed

vroldanbet

09/19/2024, 6:02 PM

You are right, didnt look into the flame graph. That's def go GC.

jzelinskie

09/19/2024, 7:20 PM

pprof also has heap profiles, that should show you where all the allocations are coming from

jzelinskie

09/19/2024, 7:23 PM

Can you expand a bit more on the type of workload / API usage?

Dmytro

09/20/2024, 2:12 PM

Hi, I can provide some more details. We’re using 8GiB of memory per instance. I’ve collected GC metrics. https://cdn.discordapp.com/attachments/1286245915979550785/1286691533126369341/Screenshot_2024-09-20_at_2.00.59_PM.png?ex=66eed45f&is=66ed82df&hm=4c0d1f4f632fdcebaffa8be2d8b601b4c796445b1e08f3a5a3cbe7b1ec132876&

Dmytro

09/20/2024, 2:18 PM

In our other(test) environment, we’re seeing a daily increase in CPU usage, even though the load remains constant. https://cdn.discordapp.com/attachments/1286245915979550785/1286692867745775616/Screenshot_2024-09-20_at_3.10.27_PM.png?ex=66eed59d&is=66ed841d&hm=693feea1f862bb649d397ad393cd5ec3d6c90c3c16602a358d88588f2e85b286&

Dmytro

09/20/2024, 2:26 PM

Our use case: We work with users and documents, using SpiceDB for document sharing. We’re using both gRPC and REST APIs of SpiceDB. https://cdn.discordapp.com/attachments/1286245915979550785/1286694996757708881/Screenshot_2024-09-20_at_4.19.25_PM.png?ex=66eed798&is=66ed8618&hm=ba47e194b73727ca67dc1a9b02f0bbc75f6dbe1e7aedc461dd4123567626ca89&

Joey

09/20/2024, 2:46 PM

its likely hitting the configured mem limit

Joey

09/20/2024, 2:47 PM

I'd suggest lower the cache limits

Dmytro

09/20/2024, 2:56 PM

pprof heap profile: I think that I’ll need to wait a couple of days to capture a heap profile when the CPU issue is at its peak

vroldanbet

09/20/2024, 5:13 PM

what kind of API calls are being done in this load test?

Dmytro

09/23/2024, 12:27 PM

In our test environment, the main operations are /v1/permissions/check, /v1/relationships/write, and /v1/relationships/delete. We’re performing very few privilege checks over gRPC.

Dmytro

09/23/2024, 12:32 PM

I managed to get a heap profile. But I can't figure out how to read it correctly and why so much data goes to metrics. https://cdn.discordapp.com/attachments/1286245915979550785/1287753342881370183/Screenshot_2024-09-23_at_2.31.03_PM.png?ex=66f2b142&is=66f15fc2&hm=f92979b37f61b7f897fd771f869a11260d905e73e5d567cea55970ae9794a87f&

vroldanbet

09/23/2024, 1:59 PM

it seems like it's all OTel stuff? that's surprising

vroldanbet

09/23/2024, 2:00 PM

try disabling it?

vroldanbet

09/23/2024, 2:02 PM

Copy code

--otel-provider=none

Dmytro

09/23/2024, 2:36 PM

I'll try it. But we don't set up OTel explicitly. Also I see the message: {"level":"info","v":0,"provider":"none","endpoint":"","service":"spicedb","insecure":false,"sampleRatio":0.01,"time":"2024-09-23T09:09:02Z","message":"configured opentelemetry tracing"}

vroldanbet

09/23/2024, 2:44 PM

weird

vroldanbet

09/23/2024, 2:46 PM

are you using gRPC or the HTTP Gateway?

vroldanbet

09/23/2024, 2:48 PM

It seems to be pointing to this function:

Copy code

mux.Handle("POST", pattern_PermissionsService_CheckPermission_0, func(w http.ResponseWriter, req *http.Request, pathParams map[string]string) {
        ctx, cancel := context.WithCancel(req.Context())
        defer cancel()
        inboundMarshaler, outboundMarshaler := runtime.MarshalerForRequest(mux, req)
        var err error
        var annotatedContext context.Context
        annotatedContext, err = runtime.AnnotateContext(ctx, mux, req, "/authzed.api.v1.PermissionsService/CheckPermission", runtime.WithHTTPPathPattern("/v1/permissions/check"))
        if err != nil {
            runtime.HTTPError(ctx, mux, outboundMarshaler, w, req, err)
            return
        }
        resp, md, err := request_PermissionsService_CheckPermission_0(annotatedContext, inboundMarshaler, client, req, pathParams)
        annotatedContext = runtime.NewServerMetadataContext(annotatedContext, md)
        if err != nil {
            runtime.HTTPError(annotatedContext, mux, outboundMarshaler, w, req, err)
            return
        }

        forward_PermissionsService_CheckPermission_0(annotatedContext, mux, outboundMarshaler, w, req, resp, mux.GetForwardResponseOptions()...)

    })

Dmytro

09/23/2024, 2:51 PM

Both. We use mainly HTTP Gateway: writing and reading. But for few checks we use gRPC.

vroldanbet

09/23/2024, 2:57 PM

alright, thank you. This is pointing to a leak only affecting the HTTP Gateway.

vroldanbet

09/23/2024, 3:02 PM

can you give me the exact list of RPCs you use the HTTP Gateway for?

vroldanbet

09/23/2024, 3:21 PM

are you sending otel context via HTTP Gateway?

Dmytro

09/23/2024, 3:26 PM

all http endpoints witch we use in this tests: /v1/permissions/check /v1/relationships/write /v1/relationships/delete

vroldanbet

09/23/2024, 3:27 PM

so far I don't seem to be able to reproduce, those objects are getting released correctly

Dmytro

09/23/2024, 3:51 PM

I'm not sure that we don't send telemetry headers. I'll take a look.

vroldanbet

09/24/2024, 11:10 AM

@Dmytro @Assada I've opened https://github.com/authzed/spicedb/pull/2075 with the fix. Once I updated the otel component, the leak disapeared. It would be helpful if you could build this branch and run it in your system and confirm if the problem is solved for y'all.

vroldanbet

09/27/2024, 9:05 AM

the fix should be part of the new release: https://github.com/authzed/spicedb/releases/tag/v1.37.0

Dmytro

09/27/2024, 9:08 AM

the load test was launched last night - no problems so far. let's wait until monday.

Dmytro

09/30/2024, 8:49 AM

in v1.37.0, this problem is no longer observed on our test environment. thank you very much.

jzelinskie

09/30/2024, 6:54 PM

thank you for reporting this issue!

15 Views

Previous Next