MATLAB: How to reset a MATLAB Distributed Computing Engine worker session that appears to be hung

hanghungMATLAB Parallel ServermdceParallel Computing Toolboxworker

There are two possible phases in which an MDCE worker session can hang:
– When I start a worker with the startworker command, and the system does not return the prompt to me, the worker could be hung. (If this is the case, skip step 1 in the solution.)
– If a task (and therefore its job) appears to be stuck in the running state, or if a task times out, it could be because of a hung worker session.

Best Answer

The solutions to clearing a hung worker are presented here in the order of safest to most drastic. You should try them in the suggested order, testing after each step to see if the problem is cleared.
1. If a job is stuck in the running state because one of its tasks is stuck running on a hung worker, you can try destroying the job from the client MATLAB session by using the destroy function. Submit another job and see if the worker in question now properly evaluates its tasks.
2. Use the stopworker command on the worker node to end the worker session. Restart the worker session with the command startworker -clean.
3. Shut down all MDCE services on the worker node with the command mdce stop. Note that this will shut down all worker and job manager sessions on the node. Restart all sessions accordingly.
4. If MDCE stop does not return a prompt, then as a last resort you can delete the worker's checkpoint directories and reboot the node to restart its MDCE sessions.