Has anyone encountered the issue where
# spicedb
a
Has anyone encountered the issue where the GC at some point starts consuming 70%+ of the CPU? And the CPU usage never decreases after that? Restarting the pod normalizes the CPU usage, but over time, the problem occurs again. https://cdn.discordapp.com/attachments/844600078948630559/1286245915707047968/image.png?ex=66ed355b&is=66ebe3db&hm=9206b33aa6eae8fcf0a4e97e8d864b1a62a2577ad7360b6453eeede3539c8b52&
SpiceDB:
v1.34.0-v1.35.3
PostgreSQL:
15.7
Copy code
command:
    - spicedb
    - serve
    - --http-enabled
    - --log-level=error
    - --datastore-gc-window=3m
    - --datastore-gc-max-operation-time=3m
    - --datastore-gc-interval=7m
    - --datastore-tx-overlap-strategy=insecure
    - --dispatch-cluster-enabled
    - --dispatch-upstream-addr=kubernetes:///permission-db-app.api:50053
    - --grpc-preshared-key=$(PRESHARED_KEY)
    - --datastore-engine=$(DATASTORE_ENGINE)
    - --datastore-conn-uri=$(DATASTORE_URI)
    - --telemetry-endpoint=${TELEM_URI}
v
What do the logs say? It means you are doing lots of writes and there is a ton to GC over the default window of 24 hours. It's true this does not currently have any throttling mechanism and could explain that behavior
I'd suggest moving to log info or debug to see what's up. You also have GC metrics, Indicating how many GC cycles run, how many tuples were GC, and how much it took to run the GC process
Restarting helps because the GC kicks after 5 minutes by default.
a
The error logs don’t provide any information (since there are no errors). I’m trying to lower the logging level to info and check what’s happening there. Thx!
v
yeah, those do not show up in the error log, I don't think they are erroring out, just taking very long and doing a lot of work
j
@vroldanbet I think this is Go GC, not rel GC?
@Assada at first glance, it seems like the pod hit max mem and so then started thrashing trying to GC to process
you might need to lower the cache memory size
to leave room for other memory as needed
v
How much memory are you allocating? We recently introduced usage of gomemlimit. There are also go GC metrics exposed
You are right, didnt look into the flame graph. That's def go GC.
j
pprof also has heap profiles, that should show you where all the allocations are coming from
Can you expand a bit more on the type of workload / API usage?
j
its likely hitting the configured mem limit
I'd suggest lower the cache limits
d
pprof heap profile: I think that I’ll need to wait a couple of days to capture a heap profile when the CPU issue is at its peak
v
what kind of API calls are being done in this load test?
d
In our test environment, the main operations are /v1/permissions/check, /v1/relationships/write, and /v1/relationships/delete. We’re performing very few privilege checks over gRPC.
v
it seems like it's all OTel stuff? that's surprising
try disabling it?
Copy code
--otel-provider=none
d
I'll try it. But we don't set up OTel explicitly. Also I see the message: {"level":"info","v":0,"provider":"none","endpoint":"","service":"spicedb","insecure":false,"sampleRatio":0.01,"time":"2024-09-23T09:09:02Z","message":"configured opentelemetry tracing"}
v
weird
are you using gRPC or the HTTP Gateway?
It seems to be pointing to this function:
Copy code
mux.Handle("POST", pattern_PermissionsService_CheckPermission_0, func(w http.ResponseWriter, req *http.Request, pathParams map[string]string) {
        ctx, cancel := context.WithCancel(req.Context())
        defer cancel()
        inboundMarshaler, outboundMarshaler := runtime.MarshalerForRequest(mux, req)
        var err error
        var annotatedContext context.Context
        annotatedContext, err = runtime.AnnotateContext(ctx, mux, req, "/authzed.api.v1.PermissionsService/CheckPermission", runtime.WithHTTPPathPattern("/v1/permissions/check"))
        if err != nil {
            runtime.HTTPError(ctx, mux, outboundMarshaler, w, req, err)
            return
        }
        resp, md, err := request_PermissionsService_CheckPermission_0(annotatedContext, inboundMarshaler, client, req, pathParams)
        annotatedContext = runtime.NewServerMetadataContext(annotatedContext, md)
        if err != nil {
            runtime.HTTPError(annotatedContext, mux, outboundMarshaler, w, req, err)
            return
        }

        forward_PermissionsService_CheckPermission_0(annotatedContext, mux, outboundMarshaler, w, req, resp, mux.GetForwardResponseOptions()...)

    })
d
Both. We use mainly HTTP Gateway: writing and reading. But for few checks we use gRPC.
v
alright, thank you. This is pointing to a leak only affecting the HTTP Gateway.
can you give me the exact list of RPCs you use the HTTP Gateway for?
are you sending otel context via HTTP Gateway?
d
all http endpoints witch we use in this tests: /v1/permissions/check /v1/relationships/write /v1/relationships/delete
v
so far I don't seem to be able to reproduce, those objects are getting released correctly
d
I'm not sure that we don't send telemetry headers. I'll take a look.
v
@Dmytro @Assada I've opened https://github.com/authzed/spicedb/pull/2075 with the fix. Once I updated the otel component, the leak disapeared. It would be helpful if you could build this branch and run it in your system and confirm if the problem is solved for y'all.
the fix should be part of the new release: https://github.com/authzed/spicedb/releases/tag/v1.37.0
d
the load test was launched last night - no problems so far. let's wait until monday.
in v1.37.0, this problem is no longer observed on our test environment. thank you very much.
j
thank you for reporting this issue!
9 Views