How to kill stuck processes that block your deployments

One of the most common causes for environment builds that get stuck is a runaway/stuck process.

To ensure data consistency, the deployment flow tries to terminate running processes gracefully. Sometimes, this does not work and the deployment ends up waiting forever due to a blocking process. Using two simple commands when connected to SSH can help you get things moving without having to wait for our support team to intervene.

As an example, let’s assume a cron process is stuck.

The demo application contains one blocker.sh script:

#!/bin/sh

sleep 3600

which is configured as an every 5 minute cron in .platform.app.yaml:

  blocker:
    spec: '*/5 * * * *'
    cmd: '/bin/bash /app/block.sh'

Now, this process is blocking our new deployment - the log is stuck at:

    Redeploying environment main
      Preparing deployment
      Closing services router and app

First thing to do is see if you can connect with SSH to the environment (this should work most of the time). If the SSH connection is successful, run: ps fuxa.

The output will be a list of processes, similar to this one:

web@app.0:~$ ps fuxa
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         153  0.9  0.0 197656 33176 ?        Sl   15:05   0:00 /usr/bin/python2.7 /etc/platform/commands/notify
web          159  0.0  0.0 271388 26620 ?        Sl   15:05   0:00  \_ /usr/bin/python2.7 /etc/platform/commands/notify
web          162  0.0  0.0   4280   740 ?        S    15:05   0:00      \_ /bin/dash -c /bin/bash /app/blocker.sh
web          165  0.0  0.0  12920  2740 ?        S    15:05   0:00          \_ /bin/bash /app/blocker.sh
web          166  0.0  0.0   7580   668 ?        S    15:05   0:00              \_ sleep 3600
root           1  0.0  0.0  15816  1096 ?        Ss+  15:02   0:00 init [2]
root          74  0.0  0.0   4204  1132 ?        Ss   15:02   0:00 runsvdir -P /etc/service log: ...................................................................................................................
root          80  0.0  0.0   4052   704 ?        Ss   15:02   0:00  \_ runsv tideways
root          81  0.0  0.0   4052   696 ?        Ss   15:02   0:00  \_ runsv ssh
root          90  0.0  0.0  72104  5608 ?        S    15:02   0:00  |   \_ /usr/sbin/sshd -D
root         167  0.0  0.0  94876  6472 ?        Ss   15:06   0:00  |       \_ sshd: web [priv]
web          173  0.0  0.0  94876  3668 ?        S    15:06   0:00  |           \_ sshd: web@pts/0
web          174  0.0  0.0  21768  3852 pts/0    Ss   15:06   0:00  |               \_ -bash
web          190  0.0  0.0  37448  3172 pts/0    R+   15:06   0:00  |                   \_ ps fuxa
root          82  0.0  0.0   4052   700 ?        Ss   15:02   0:00  \_ runsv nginx
root         115  0.0  0.0  36984  6684 ?        S    15:02   0:00  |   \_ nginx: master process /usr/sbin/nginx -g daemon off; error_log /var/log/error.log; -c /etc/nginx/nginx.conf
web          121  0.0  0.0  45460 11560 ?        S    15:02   0:00  |       \_ nginx: worker process
root          83  0.0  0.0   4052   752 ?        Ss   15:02   0:00  \_ runsv newrelic
root          84  0.0  0.0   4052   644 ?        Ss   15:02   0:00  \_ runsv idmapd
root          87  0.0  0.0  23348  2152 ?        S    15:02   0:00  |   \_ /usr/sbin/rpc.idmapd -f -C -p /run/rpc_pipefs
root          85  0.0  0.0   4052   700 ?        Ss   15:02   0:00  \_ runsv app
web          111  0.0  0.0 359464 30288 ?        Ss   15:02   0:00      \_ php-fpm: master process (/etc/php/7.2-zts/fpm/php-fpm.conf)
web          116  0.0  0.0  12932   296 ?        S    15:02   0:00          \_ /bin/bash /etc/platform/start-app
web          117  0.0  0.0   7584   656 ?        S    15:02   0:00              \_ tee -a /var/log/app.log

You can probably see already our stuck process is here:

web          162  0.0  0.0   4280   740 ?        S    15:05   0:00      \_ /bin/dash -c /bin/bash /app/blocker.sh
web          165  0.0  0.0  12920  2740 ?        S    15:05   0:00          \_ /bin/bash /app/blocker.sh
web          166  0.0  0.0   7580   668 ?        S    15:05   0:00              \_ sleep 3600

If you have trouble locating the stuck process, it’s generally found below:

/usr/bin/python2.7 /etc/platform/commands/notify

The notify process is a special process that monitors the container state.

The important thing in the above output is the number listed in the second column, after the web user: this is the process ID, a unique identifier for each running process. As we’re interested in stopping the process, it’d be very useful to somehow forcefully stop it.

This can be done with the kill -9 command, followed by the list of process IDs you want to stop.

Therefore, to stop the cron in our example, we’d need to run:

kill -9 162 165 166

Be careful: there might be more processes that block the deployment ! Inspect the process list carefully (all application processes will be under the web user) and repeat the previous command for all of them.

Once done, the SSH connection will be terminated and you’ll see the friendly:

Message from bot@platform.sh at 15:09:36:                                      
  This container is being dematerialized. See you on the other side.       

message. Your previously stuck deployment will now continue.

Note: if the SSH connection cannot be established, you will need to open a support ticket.

4 Likes

By the way,
you can also use the pkill command. I find it easier to do pkill php or pkill node. The pkill command kills processes by name, and it kills all of them, no need to check each individual id. Since you are trying to get the container redeployed everything is probably safe to kill. Only processes started by the user web can be killed. In cases where you’re not sure what is keeping it stuck, pkill everything coming from the user web. If that didn’t change anything, time for a support ticket.

1 Like

Another tip: if you have a cron that regularly causes issues but can’t find the cause.

Simply prepend timeout 290s to the command. This will automatically kill the cron if it runs for longer than 290seconds (a little less than 5 minutes).

So your cron would become:

  blocker:
    spec: '*/5 * * * *'
    cmd: 'timeout 290s /bin/bash /app/block.sh'