MATLAB: Problem with parallel configuration. Parallel job test validation failed!!

clusterjobmdcempiparallel computing

We want to set up a cluster of two PCs (intel core i5 with 4 cores per machine). We are using the release of MATLAB 2009b and the admin center to generate a job manager with 4 workers, one core per worker (2 workers per machine). The mdce is installed in the two machines with the default mdce_def. This process works fine.
The problems appear when we try to run a parallel configuration, using this job manager with a minimun and maximun of 4 workers, because the parallel test fail.
This process generates several error lines in the mdce-service.log in log folder:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:job aborted using terminate/kill:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:process: node: exit code: error message:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPI_Comm_connect(119)…………………: MPI_Comm_connect(port="tag=0 port=28351 description=lp-apd12 ifname=172.22.4.92 ", MPI_INFO_NULL, root=0, comm=0x84000000, newcomm=0000000001023A60) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:0: localhost: 1: Fatal error in MPI_Comm_connect: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPID_Comm_connect(187)………………..:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_Comm_connect(405)……………….:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIC_Sendrecv(126)……………………:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPI_Comm_connect(119)…………………: MPI_Comm_connect(port="tag=0 port=28351 description=lp-apd12 ifname=172.22.4.92 ", MPI_INFO_NULL, root=0, comm=0x84000000, newcomm=0000000001023A60) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIC_Wait(270)……………………….:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPID_Comm_connect(187)………………..:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_Comm_connect(405)……………….:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_CH3i_Progress_wait(215)………….: an error occurred while handling an event returned by MPIDU_Sock_Wait()
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIC_Sendrecv(126)……………………:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_CH3I_Progress_handle_sock_event(420):
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIC_Wait(270)……………………….:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_CH3i_Progress_wait(215)………….: an error occurred while handling an event returned by MPIDU_Sock_Wait()
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_CH3I_Progress_handle_sock_event(420):
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:Fatal error in MPI_Intercomm_merge: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:MPI_Intercomm_merge(284): MPI_Intercomm_merge(comm=0xc4000005, high=1, newintracomm=0000000001023A68) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:job aborted using terminate/kill:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDU_Sock_wait(2603)…………………: The specified network name is no longer available. (errno 64)
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:process: node: exit code: error message:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:0: localhost: 1: Fatal error in MPI_Intercomm_merge: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:MPI_Intercomm_merge(284): MPI_Intercomm_merge(comm=0xc4000005, high=1, newintracomm=0000000001023A68) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:MPI_Intercomm_merge(262): Too many communicators
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDU_Sock_wait(2603)…………………: The specified network name is no longer available. (errno 64)
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:MPI_Intercomm_merge(262): Too many communicators
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "cp".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "nodisplay".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "Djava.security.policy=C:\Program Files\MATLAB\R2009b\toolbox\distcomp\config\jsk-all.policy".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "cp".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "nodisplay".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "Djava.security.policy=C:\Program Files\MATLAB\R2009b\toolbox\distcomp\config\jsk-all.policy".
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-18:out:Warning: Unable to locate a personal folder for $documents\MATLAB
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-18:out:{Warning: Userpath must be an absolute path and must exist on disk.}
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-17:out:Warning: Unable to locate a personal folder for $documents\MATLAB
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-17:out:{Warning: Userpath must be an absolute path and must exist on disk.}
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out: To get started, type one of these: helpwin, helpdesk, or demo.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out: For product information, visit www.mathworks.com.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out: To get started, type one of these: helpwin, helpdesk, or demo.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out: For product information, visit www.mathworks.com.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out:
INFO | jvm 1 | 2011/08/18 16:09:07 | Thu Aug 18 16:09:07 CEST 2011:Group-17:out:» Thu Aug 18 16:09:07 CEST 2011 Worker started: pc-goba_worker02
INFO | jvm 1 | 2011/08/18 16:09:08 | Thu Aug 18 16:09:07 CEST 2011:Group-18:out:» Thu Aug 18 16:09:07 CEST 2011 Worker started: pc-goba_worker01
Thanks

Best Answer

It looks like your hosts can't resolve their IP addresses correctly. Check the networking setup very closely and make sure:
Hosts can ping each other by short name (yourhostname) Hosts can ping each other by fully qualified name (yourhostname.yourdomain.com)
(you'll need to do this for both hosts in the cluster)
One of the most common things I've seen is that the DNS search order doesn't include the DNS domain of the host itself. For example, the fully qualified hostname is
myhost.desktops.mycorp.com
and the DNS search order is mycorp.com
So the host can't resolve "myhost" and then you get odd networking problems where things can't connect reliably. You can see what these settings are by running "ipconfig /all" at a command prompt, or by looking at the properties on the network connection.
I think Java is just reporting and is working OK.