what's important:
* if task code ran, it exits with 0. this code is regardless of (error, result)
* when it exited cleanly, we will get the values from the database
* if task timed out, the box code kills it and it has a flag tracking timedOut. we can
ignore exit code in this case.
* if task code was stopped, box code will send SIGTERM which ideally it will handle and end with 70.
* if task code crashed and it caught the exception, it will return 50
* if task code crashed and node nuked us, it will exit with 1
* if task code was killed with some unhandleabe signal, taskworker.sh will return the signal (9=SIGKILL)
this is now moved entirely to cloudron-support --enable-remote-access.
this emphasizes more that users have to get ssh access to the server before
we can do anything about it. it's far too simple for people to click this
button.
we have now also added clear terms to understand what remote access entails.
(what happens if support personnel makes a mistake. who is liable? etc)
https://forum.cloudron.io/topic/13408/update-to-cloudron-8.3-error
We get a Task xx crashed with code null in the notification.
The crux of the issue is that we use KillMode=control-group. This ends
up sending SIGTERM signal to box code and all the sudo in parallel. The box
code then sees the sudo die and records the task as failed.
To fix, we switch to KillMode=mixed. This gives box code a chance to handle SIGTERM
first. It cleans out its task list and kills all the sudo.
it seems unbound-anchor is not a dep of unbound in ubuntu 24. some
installations are thus missing this package.
in any case, ignore unbound-anchor exit status
An issue was that mail container was not getting refreshed with the up to
date certs. The root cause is that it is refreshed only in the renewCerts()
cron job. If cert renewal was caused by an app task, then the cron job will
skip the restart (since cert is fresh).
The other issue is that we keep hitting 0 length certs when we run out of disk
space. The root cause is that when out of disk space, a cert renewal will
cause cert to be written but since it has no space it is 0 length. Then, when
the user tries to restart the server, the box code does not write the cert again.
This change fixes the above two including:
* To simplify, we use the fallback cert only if we failed to get a LE cert. Expired LE certs
will continue to be used. nginx is fine with this.
* restart directory as well on renewal
Previously, the du plugin was collecting data every 20 seconds but
carbon was configured to only keep data every 12 hours causing much
confusion.
In the process of reworking this, it was determined:
* No need to collect disk usage info over time. Not sure how that is useful
* Instead, collect CPU/Network/Block info over time. We get this now from docker stats
* We also collect info about the services (addon containers)
* No need to reconfigure collectd for each app change anymore since there is no per
app collectd configuration anymore.