s
Hi! I'm trying to debug an intermittent problem and it proved to be very tricky thing to do. Once in a while the number of running goroutines spikes by several orders of magnitude and we experience a big latency hit. I'm not sure why this is happening, because the number of operations remains roughly the same. It happens regardless of whether cluster dispatch is enabled. Could you give me any pointers on where should I look? We are running self-hosted 1.32.0 with CockroachDB 23.1 as a backing datastore https://cdn.discordapp.com/attachments/844600078948630559/1311985449106931742/image.png?ex=674ad92a&is=674987aa&hm=a2bdfaa5e8c618a213a67f754332d7909bdbcb2911e14b4aaf4df85c4510fa7d&
v
I'd look into dispatch metrics. This is likely a vey wide relation that requires a lot of dispatches to be evaluated. You can also use OpenTelemetry traces to visualize what's up.
s
I've setup traces but they are not very helpful, unfortunately
as an example, LookupResources took 4.03 seconds for some reason, and majority of time was spent in the ReverseQueryRelationship, and I'm not sure how many routines did it spawn https://cdn.discordapp.com/attachments/1311985449366982746/1316750807818965044/image.png?ex=675c2f40&is=675addc0&hm=e94603d607a47c7c8ee5d15cd3c81840c2b5cd5b0a89f3f514bcf188fd9eb639&
consistency is set to "minimize latency" and cluster dispatch is disabled. containers are not starved for resources as far as I can tell
v
yeah cluster dispatch won't help here, it's putting all the load on the datastore. Have you looked into cockroache's console looking for resource usage and potentially expensive queries?
s
I can take a look, but based on that trace, it seems like the datastore is not the issue, as queries complete in under 20ms. Instead, most of the latency appears to come from ReverseQueryRelationships.
v
it can still be the database - queries are fast, but how fast is it to acquire a connection from the pool? That's definitely a typical source of contention. I'd suggest to look into prometheus metrics for the connection pools, it should reveal how much time SpiceDB is spending trying to get a connection. You can tune your connection pools accordingly.
5 Views