It suspect there's a memory leak in SpiceDB. My po...
# spicedb
w
It suspect there's a memory leak in SpiceDB. My pods are regularly restarting (36 times in 5d18h), with memory going very high pretty quickly. That only started happening since I upgraded to v1.13.0
For info, my
resource
configuration:
Copy code
resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1000m
      memory: 2Gi
Extended graph. - First yellowish zone is v1.11.0 - Purpleish zone is v1.12.0 - Second yellowish zone is v1.13.0 (same zone as the purpleish zone of the first screenshot above, confusingly)
v
maybe potentially related to the new way to compute cache size using percentages? By default it uses a 75% of available memory to the container. A way to determine if that was the contributing factor would be to change from the default 75% to a static memory value
It could be that perhaps the newly introduced library is not respecting cgroup limits
cc @jzelinskie
(could be totally unrelated to the new percentage-based cache sizes, but it's worth ruling it out with the exercise outlined above)
w
I'm happy to try it: could you suggest a reasonable static memory value here?
v
well I'd say set it to 75% of the resource request, that'd me 750mib?
v
@vroldanbet are you talking about
--dispatch-cluster-cache-max-cost
(which defaults to 70%) or
--dispatch-cache-max-cost
(default=30%) ?
v
actually both.
v
we're experiencing memory issues too, but I don't know if related to latest version, since we just enabled spice in production (for a small feature, we'd like to expand usage too)
v
did you experience memory issues after setting those 2 flags?
any chances you can get a pprof dump?
v
no, those flags are not set, but as I wrote earlier at every restart I get different values. Even with small values like:
it ends up OOMing
Copy code
go_memstats_mallocs_total{job="spicedb-production-metrics"} - go_memstats_frees_total{job="spicedb-production-metrics"}
v
it's expected you get different values, since the percentage of free memory in the machine would vary each time spicedb starts up
v
(memory request is 1GB)
v
👀
v
yes but should always be at least 1GB -> thus 70% ~ 700MB
v
is that running in your machine or in prod? I experienced something similar in my machine. Maybe the library we are using is not reliable
v
prod
v
so in that graph request and limit is 1GB, correct?
v
let me check the limit, I thought it was not set, it could be an error in the chart
but yes, request is 1GB
v
and the logs above, showing relatively low cache allocations, is also from prod right?
I certainly wouldn't expect those numbers. It's similar to what I observed, but in my local machine
the only alternative I can think of right now is to set those manually and make sure they fit within the request
manually as in with actual sizes instead of percentages, the flag should support both
v
yes, they are from production. I confirm limit is 1GB, my bad
v
@jzelinskie can follow up, Im actually OoO but just in front of the computer because I got paged 😅
v
I'll set them rn, but I don't expect it to improve, because even now we're going way above those values
j
the percentage actually checks free memory and does (free mem * 0.70) * $YOUR_PERCENTAGE
it's confusing, but this is so that other things can consume some amount of memory without the cache having eaten up everything
We're looking into to seeing if there could be another cause to the misaligned memory usage though, since that should put you safely below your target and not above it
v
@jzelinskie have you seen the chart about allocs? does it say something? 😄
w
Just tried setting a static
--dispatch-cache-max-cost
to
750MiB
, doesn't seem to have changed anything really
(I don't enable the dispatch cluster, so didn't set
--dispatch-cluster-cache-max-cost
)
a heap dump for your eyes
Interestingly the dump talks about "882.57MB total", not >3GB 🤔
> (I don't enable the dispatch cluster, so didn't set --dispatch-cluster-cache-max-cost) I set it for peace of mind: didn't change anything either 🙂
v
i set them to 200MB/200MB, still growing (slowing, not many requests, as I mentioned before)
j
Ok I'm going to try and reproduce this today
w
It's impacting response time and availability pretty negatively, I'll rollback to v1.12.0 in the meanwhile!
j
I suspect our cost estimates for item sizes going into the cache are substantially off such that the cache in reality is growing beyond what it's intended
I'm experimenting with how we store and estimate items in our cache so that they are no longer estimates.
v
@jzelinskie if you add metrics (estimates vs real, where real could be optional?) we could report what we see
j
Rather I have a PR that just gets rid of any guess-work we were doing.
You might be able to cherrypick it onto v1.13
v
@jzelinskie I'm running 1.13.0 patched with your PR. I'll wait till tomorrow to be sure but apparently it's 😍
j
nice! glad to hear that so far
v
annotation where PR was applied
v
do you have cache metrics enabled? Would be useful to see if evictions are happening based on the cache size selected on startup
v
@vroldanbet can you tell me which (prometheus) metric exactly?
v
sure! you need 2 flags enabled, because these are "ristretto" (the caching library we use) specific metrics, which are not enabled by default: -
dispatch-cache-metrics
-
dispatch-cluster-cache-metrics
once enabled, you shall see the following metrics:
Copy code
spicedb_dispatch_client_cost_added_bytes
spicedb_dispatch_client_cost_evicted_bytes
spicedb_dispatch_cost_added_bytes
spicedb_dispatch_cost_evicted_bytes
v
they're all 0, they're probably not enabled. Let me check
I've just deployed spice with these metrics enabled. Traffic is already going down for today (TZ is EU/Rome), we can have more information tomorrow
v
thanks for looking into this!
v
@vroldanbet thank you and @jzelinskie for the support!
v
no problem! it's jimmy who deserve all the credit here! ✨
j
@vad8615 would it be possible to get a similar graph for your cpu performance?
v
@jzelinskie
v
mind marking when did the cache-fix got deployed? I believe Jimmy wanted to check if extra CPU incurred was relevant, and it appears it is not. How many pods are y'all running?
v
sure. I can post p50 and p90 too, they look relevant
v
what is this snapshot, avg?
when I say relevant, I mean that it does not look like there is an appreciable increase in CPU use. And even if there was, SpiceDB is still fundamentally I/O bound and trading some CPU seemed like a reasonable choice here to better control memory management
v
(ATM only a couple of features with low usage have ACL managed with spice). Yes, I agree that CPU usage has not changed in any relevant way. Here CPU with annotation:
p50 and p90:
it looks like there's a modest increase in p90
v
it seems like your workload always shows that pattern around 8:00am in your TZ. At first glance it would seem so, but perhaps early to drag conclusions - workload could have changed. Maybe try to correlate with traffic?
v
no traffic on weekends, working days are quite similar:
evicted are still 0
max costs are 200MB/200MB
v
nice looking good, presumably there is not enough load to apply pressure to the cache
j
fantastic
v
even if the memory has improved a lot, I think there's still something not working properly. - In the last hours we went over the 200MB limit we set, and we still see no evitions. - Total memory usage is 675MB. It's growing faster than caches. It was ~400MB when caches were ~100MB each, thus ~200MB overhead. With caches ~200MB each it's ~675MB, thus 375MB overhead. Memory usage is basically double the amount of memory reserved for cache.
j
Can you run a heap profile on a SpiceDB instance? There could be something else outside of the cache consuming memory, so I'd like to confirm that.
v
(update: evictions started at 209MB 🙏 )
@jzelinskie can I send it privately somehow?
j
yeah if you DM me a https://pprof.me link that should be private
Looks like there could be a big LookupResources request in flight
going to dig a bit deeper if i can
v
I would't say there's a big one, of this resource type we have at most ~200 resources per user AFAIK. Could it be there's a leak of "LookupResources objects" ?
as you can see in the above chart, it takes days to reach this point
j
yeah, that's quite possible
v
and then suddenly it oomed
j
Can y'all try the latest v1.14 release? I just want to make sure you have all the latest changes to LookUpResources internals
v
I'll try next week ASAP, thank you
j
There have been lots of changes there for performance and to support caveats, so I'd like to make sure things are reproducible
Check out the GitHub release notes for how to upgrade with zero downtime
v
we're now running 1.14.1. Let's wait for memory to increase
j
How's it been going?
v
it's apparently the same, but let me collect data 1 more day
I need to decrease cache size even more in order to be able to make a statement
an update: with 1.14.1 memory seems to be quite stable 🎉 there could still be a small leak but way less important. I'm collecting evidence
v
did you guys reduce the memory request/limits to put some pressure on the memory?
v
we reduced cache size (in MB), do you think I should decrease even the pod req/limits?
v
well ideally you don't need to resource to reducing the cache size and the default establishes sensible values. I was referring to the kube requests/limits in order to accelerate the leak and make it evident waiting less time (it it still exists)
v
the current situation is much better than before. However, I would expect memory usage to go back to the same value every night (when there's no traffic), as soon as evictions starts kicking in. We observed an increase of ~20MB between the last two nights instead.
v
hrm, shouldn't eviction kick-in when there is memory pressure? If I understand correctly, y'all set a lower cache size compared to the actual memory available to the process, right?
so I guess that's why eviction is kicking in, even thought there is still memory left
while that works, it does not fully utilize the memory available to the process
If I may, I'd suggest adding a line denoting the cache size the first panel, so it's easier to map evictions to crossing that threshold
v
cache limit is 20MB BTW. I'm aware it's not an optimal setting, it's just for "debugging"
v
oh ok!
u
Hey everyone, we started to see quite a lot of out-of-memory kills of our spicedb pods recently (as the number of the relationships has grown from ~7K to ~11K, but maybe due to some other reason that we're missing) we tried setting --dispatch-cluster-cache-max-cost and --dispatch-cache-max-cost to specific values (700MiB and 300MiB respectively out of 1280Mi configured as the pods' resource request) as it was suggested here but didn't see any improvement we started to experience these problems with 1.14.1, then upgraded to 1.15.0 which made no difference is there anything else that we could try doing?
v
what version are you currently running?
u
we're running 1.15.0
v
are there any chances you guys can provide us with your schema to see if we can reproduce it? cc @jzelinskie who's been looking into the leak
u
hey @vroldanbet and @jzelinskie , we'll have to obfuscate our schema a little to be able to share but we can do that, I can send it to you in DM it's pretty complex though, so I'm not sure how feasible it would be to make any use of it I can also share some heap profiles and any other metrics if it helps
one observation - when our kubernetes metrics show that a spicedb pod uses ~1 GB of memory, the heap profile shows less than 500MB
j
what does
ps
show for the memory usage of the process?
j
can you share a heap profile?
go tool pprof http://$SPICEDB_URL:$METRICS_PORT/debug/pprof/heap
should download a tarball that you can upload to https://pprof.me to share with us
u
Copy code
/ # ps -o pid,user,vsz,rss,comm
PID   USER     VSZ  RSS  COMMAND
    1 65532    1.7g 1.1g spicedb
j
thanks
u
memory usage in the heap profile (Memory In-Use Bytes, Cumulative), ps (the rss column) and in kubernetes metrics seem to match actually but the svg generated with
go tool pprof
->
web
displays
Showing nodes accounting for 443.62MB 92.88% of 477.64MB total
(perhaps my expectation that this would represent the total memory used is incorrect)
j
[spicedb] github.com/authzed/spicedb/internal/namespace.(*ReachabilityGraph).collectEntrypoints
jumps out on the cumulative graph
I'll investigate tomorrow to make sure we're not somehow keeping that graph around once its done being used
u
hey everybody, I was wondering if there are any updates regarding the suspected memory leak as far as I can see, our heap profile at the link above got overwritten by something else, but I can upload a new one if needed we keep seeing oom kills quite frequently - typically a couple of time a day
j
we found a deadlock in the LookupResources code and just merged a fix for it: https://github.com/authzed/spicedb/pull/1086
however, we aren't 100% certain that is the root cause of the memory issues
if you can get us another profile too, it can't hurt
u
Hi everybody, we're still suffering memory leak issues here's some more observations:
Copy code
ps -o pid,user,vsz,rss,comm
PID   USER     VSZ  RSS  COMMAND
    1 65532    3.0g 1.4g spicedb
https://pprof.me/f56c390/
Copy code
ps -o pid,user,vsz,rss,comm
PID   USER     VSZ  RSS  COMMAND
    1 65532    3.0g 2.2g spicedb
https://pprof.me/567b9c8/
j
@jzelinskie ^
@__alexb__ on v1.16?
v1.16 has a fix for what we believe was causing the reachability graph to not be GCed
that fix linked above ^
j
from your graph -- the fix joey is mentioning removes errgroup and uses a different primitive for scheduling reachableresources dispatches
u
hmm... I checked what's included in v1.16 and I thought the fix linked above wasn't there and based on that we decided not to upgrade just yet
j
ah
might be v1.16.1
j
yeah, sorry
I thought it merged earlier than it did
u
thanks a lot 👍, we'll give it a try
j
thanks for the report
it was incredibly helpful to narrow it down
p
FYI @Joey - we are still seeing the same issues after upgrading. We have a relatively small DB and have been giving our pods more and more memory, but invariably spicedb eventually eats all of it after some time and use. The pattern seems to be that we do some writes to the db, and after we start doing reads on it it looks like spicedb starts populating caches but never stops growing them. In case it helps: we are not using ZedTokens and querying with the fully consistent setting (the response times on this are perfectly fine for our use case when spicedb is operating normally). Based on our observations, the latest release did not fix or mitigate the memory leak issues.
v
Is this with 1.16.1? would you mind getting another heap profile and sharing with us? 🙏🏻
p
Yes, latest version - we'll try to get something to you soon. Since the behaviour does not seem to have changed, I expect it might look similar to the profiles from before
The pod just got killed a little while ago so we might have to wait a little for the issue to re-occur (typically happens a couple of times per day)
u
here's a heap profile https://pprof.me/de52332/
and ps output for the same pod
Copy code
PID   USER     VSZ  RSS  COMMAND
    1 65532    6.3g 4.6g spicedb
p
Note that the ps output shows 4.6g, while the heap memory-in-use bytes only show up as 2.8GB in the profile - we're not sure where the rest of that comes from
j
the ristretto cache holds memory in a different way
but that is expected
it looks like gRPC is still holding references to the streams, for some reaswon
reason*
I'll investigate today
p
Client-side we are using the asyncio python library. Roughly following the README page for that, we have a setup that just creates one client object for the app, which we keep re-using. I am quite sure we consume the streams from e.g. ReadRelationships fully, so there shouldn't be a reason for the client to keep such streams open, for instance. But there's no context management going on anywhere as of now, not sure if that's normal for this client. We're considering trying an experiment where we just destroy and re-create this client object every few hundred/thousand requests or so to see if that changes anything - do you have any reason to believe this could help?
j
I don't believe a client change is necessary, no
it may help in convincing gRPC to clear the streams
but that's not something we'd want to encourage, ideally
it appears others are having similar issues with gRPC streams
@pdow do you happen to have a QA or stage env that you can reproduce the problem in?
if you do, and you want to try a fix: https://github.com/authzed/spicedb/pull/1119
p
We're giving this a try, thanks @Joey
u
we built and deployed a version from main (i.e. without the PR linked above) and our problems are gone when we return to 1.16.1 the memory issues are back is this this PR https://github.com/authzed/spicedb/pull/1110 that solves the problem?
yep, we built from the previous commit in main and can immediately see the problem again
v
it does not necessarily mean there is no leak. What that PR is doing is eagerly evicting data from the cache that is 2x out of the quantization window and therefore unlikely to be reused
what could be happening is that the grow will be slower moving forward. Or it could be also that there was never any leak, but SpiceDB does not properly compute the size of the cache entries and more entries are being added when there is no capacity left
u
are there any cache configuration options that we can experiment with to check the latter hypothesis? the very first thing that we tried was tuning --dispatch-cluster-cache-max-cost and --dispatch-cache-max-cost but to no effect
v
you could disable the cache entirely
it may be worth just testing this PR for a bit but I'm not convinced this solves the problem. It's eagerly evicting data from the cache that is likely to be unused. - if the leak does not happen anymore, then we know it's related to the cache - if it happens but is happening slower than before, then it means the leak is elsewhere, and that PR didn't really fix it
j
We know part of the issue is related to reachable resources, which the quantization PR does not address
@pdow @__alexb__ let me know if you can try the timeouts PR itself, instead of just
main
the overall memory usage should still be higher, but the memory profile should show the leak removed
u
@Joey , yes, we will try it
j
thanks
@__alexb__ any updates?
@__alexb__ we've merged that PR into HEAD, alongside the cache improvements
u
hey @Joey , sorry, I was off last week, our team did some quick tests using that PR but we didn't collect enough data to come to any conclusion and then we had (due to some internal reasons) to switch back to 1.16.1
we'll try to do more tests this week
here's how things look for us in terms of memory usage after switching to a build from the main branch
and here's what we have in prod where we use 1.16.1
v
looks promising! The fix should have been released now as part of 1.16.2
2 Views