When a 6-node cluster told you this “connect() to 192.168.122.1 failed: Connection refused (111)”

and all processors hangs, like waiting for ever. Here is the solution and I just do a copy/paste:

Tell Open MPI not to use the virtual bridge interface virbr0 interface for sending messages over TCP/IP. Or better tell it to only use eth0 for the purpose:

$ mpiexec --mca btl_tcp_if_include eth0 ...

This comes from the greedy behaviour of Open MPI’s tcp BTL component that transmits messages using TCP/IP. It tries to use all of the available network interfaces that are up on each node in order to maximise the data bandwidth. Both nodes have virbr0 configured with the same subnet address. Open MPI falls to recognise that both addresses are equal, but since the subnets match, it assumes that it should be able to talk over virbr0. So process A is trying to send a message to process B, which resides on the other node. Process B listens on port P and process A knows this, so it tries to connect to 192.168.122.1:P. But this is actually the address given to the virbr0 interface on the node where process A is, so the node tries to talk to itself on a non-existent port, hence the “connection refused” error.

To avoid remember such a long command, I added

alias mympirun='mpirun --mca btl_tcp_if_include eth0'

to ~/.bashrc

Leave a Reply

Your email address will not be published. Required fields are marked *