Broken Collabora Container causes Nextcloud Timeouts

3 min read

Today I learned: A broken Collabora Container can cause arbitrary timeouts in Nextcloud.

What seemed to be the problem?

I was working on my Nextcloud server, shifting some files back and forth, doing some basic maintenance, the usual stuff, you know? When suddenly I realized: Every few minutes (or even seconds!) would the page not load properly. It would just load and load for at least 20 to 30 seconds, before actually finishing. Sometimes I even needed to manually trigger a reload. At first, I was unsure, if this is a temporary issue, or something persistent, but after an hour, I was certain it was persistent and it was driving me nuts!

How did I find the solution?

I was flabbergasted at first, as I had no idea what could be causing this. I started thinking about all the possible reasons outside my own domain: Network issues, my hosting provider having problems, my internet provider having problems or interfering somehow, the almighty spaghetti monster punishing me for something... The list kept growing. Until I found a hint in the go-to troubleshooting solution of every seasoned admin: Log files! I was looking at my nextcloud.log, chasing another issue, when I saw something along the lines of Failed to fetch the Collabora capabilities endpoint: cURL error 28: Operation timed out after 45000 milliseconds with 0 bytes received. And that's when it came to me: "Check the darn Collabora container!", which I did. And would you know it? There it was, with a nasty error in its logs, which prevented Collabora from running, but did not cause the container to crash. Otherwise, I would have realized earlier, that something was off.

How did I solve the problem?

First I improved my monitoring, so I would be aware in the future, if Collabora was misbehaving again. Next I reverted the container to an older image, as my containers are automatically updated by Watchtower. After I started the container, I double-checked my Nextcloud and the functionality was restored as well as the timeouts were gone. I still have to understand what the problem with the current image is, but I am at least up and running again.

Wrapping up

There are several takeaways here:

  1. Honor your log files. They do contain the answer to your problems more often than you think.
  2. Complex systems can be hard to always understand entirely, so take your time to recall all the moving parts.
  3. Sometime, all you need is a bit of luck.