tasks: fix update failed notification

https://forum.cloudron.io/topic/13408/update-to-cloudron-8.3-error

We get a Task xx crashed with code null in the notification.

The crux of the issue is that we use KillMode=control-group. This ends
up sending SIGTERM signal to box code and all the sudo in parallel. The box
code then sees the sudo die and records the task as failed.

To fix, we switch to KillMode=mixed. This gives box code a chance to handle SIGTERM
first. It cleans out its task list and kills all the sudo.
This commit is contained in:
Girish Ramakrishnan
2025-06-17 22:30:34 +02:00
parent ca25c6075b
commit fb39aa32bb
7 changed files with 35 additions and 34 deletions

View File

@@ -35,13 +35,16 @@ systemctl reset-failed "${service_name}" 2>/dev/null || true
options="-p TimeoutStopSec=10s -p MemoryMax=${memory_limit_mb}M -p OOMScoreAdjust=${oom_score_adjust} --wait"
# systemd-run is used to create resource limited tasks. the tasks are in separate cgroup and won't get affected by box start/stop
# 1. tasks should stop when box code is stopped. in this state, dashboard us unreachable and don't want things in background.
# 2. if tasks continue running, box code needs some reconcilation code to track tasks. systemd has no mechanism to handle both stop and restart.
# when using BindsTo=box.service , the tasks restart with systemctl restart. This defeats any point of tasks running in background if they start afresh.
# this design is because tasks should not run when box code is stopped. if dashboard is down, nothing should run -in background.
# besides, if tasks continue running, box code needs complex reconcilation code on start up.
# To achieve above design, we could use BindsTo=box.service. While this stops all tasks on systemctl stop, it restarts tasks on systemctl restart.
# Another approach was to use --scope but this is incompatible with --wait (and then we have to start polling status to get exit code)
# it seems systemd-run does not return the exit status of the process despite --wait
if ! systemd-run --unit "${service_name}" --nice "${nice}" --uid=${id} --gid=${id} ${options} --setenv HOME=${HOME} --setenv USER=${SUDO_USER} \
--setenv DEBUG=box:* --setenv BOX_ENV=${BOX_ENV} --setenv NODE_ENV=production "${task_worker}" "${task_id}" "${logfile}"; then
# it seems systemd-run does not return the exit status of the process despite --wait but atleast it waits
if ! systemd-run --unit "${service_name}" --wait --uid=${id} --gid=${id} \
-p TimeoutStopSec=2s -p MemoryMax=${memory_limit_mb}M -p OOMScoreAdjust=${oom_score_adjust} --nice "${nice}" \
--setenv HOME=${HOME} --setenv USER=${SUDO_USER} --setenv DEBUG=box:* --setenv BOX_ENV=${BOX_ENV} --setenv NODE_ENV=production \
"${task_worker}" "${task_id}" "${logfile}"; then
echo "Service ${service_name} failed to run" # this only happens if the path to task worker itself is wrong
fi