Custom Search

Add new comment

Working around an MPI network interface error

While working with MPI and Slurm I bumped into an error that involves the cluster configuration, the Slurm configuration and the MPI version.

Warming Up
On the one hand, my cluster has two networks (IB and ETH) and this is a excerpt of its /etc/hosts: frontend.local frontend cuda00.local cuda00 cuda01.local cuda01 frontend.ibnet cuda00i.ibnet cuda01i.ibnet

On the other hand, the Slurm configuration only took into account the node name of the host:

NodeName=cuda00 CPUs=8 SocketsPerBoard=2 CoresPerSocket=4 ...
NodeName=cuda01 CPUs=8 SocketsPerBoard=2 CoresPerSocket=4 ...

Finally, it is important to recall that MPICH by default uses IB to communicate the processes.

The application
I developed a dummy MPI code that spawned itself with the aim of executing other instance with more processes that the previous.

if (parent == MPI_COMM_NULL) {
getHostlist(slurm_alloc_msg_ptr->node_list, hostlist);
MPI_Info_set(info, "hosts", hostlist);
printf("[%d/%d] Spawning...\n", rank, world_size);
MPI_Comm_spawn("./app", MPI_ARGV_NULL, rank_size * 2, info, 0, MPI_COMM_WORLD, &parent, MPI_ERRCODES_IGNORE);
} else {
printf("[%d/%d] Hello from %d of %d ranks (%s).\n", rank, world_size, rank, world_size, processor_name);
return 0;

Basically, the first execution of the application will spawn more processes which will print a "hello".

The execution
I launched the instance queuing it with Slurm. In the command I asked for 4 nodes (because the execution will spawn 4 processes) but I only, initially, create 1 process:

salloc -N4 mpiexec -n 1 ./jobExpansion

And during the spawning, that is what I got:

Fatal error in MPI_Init: Unknown error class, error stack:
MPID_Init(304)........................: spawned process group was unable to connect back to the parent on port <tag#0$description#cuda00$port#60832$ifname#$>
MPIDU_Complete_posted_with_error(1137): Process failed
MPIDU_Complete_posted_with_error(1137): Process failed

As can be observed in the highlighted areas, the error drives us to some issue with the networking, because the hostname and the IP are referenced.

Finally, I recalled that we can force the instance to use an specific network interface and launching with:

salloc -N4 mpiexec -iface eth0 -n 1 ./jobExpansion

The problem happened to be solved.
Anyway, it is not an optimal solution due to we are not leveraging IB network and the intereface parameter should not be given.
So, I will try to dig into the problem to configure properly all the tools.

The appropriate solution
In order to continue taking advantage of the IB network, we'll configure Slurm better by indicating the controller IP of the IB network interface, and the IB address of the nodes:

NodeName=cuda00 NodeAddr= ...
NodeName=cuda01 NodeAddr= ...

So now, we are able to launch our application without extra flags, letting Slurm route the transfers:

salloc -N4 mpiexec -n 1 ./jobExpansion

Curiosly, the error have appeared again, so If you still want to use IB, you'd better select explicitly the IB interface:

salloc -N4 mpiexec -iface ib0 -n 1 ./jobExpansion

The next step, will be to check the configuration of the MPICH installation...