WSL2,NCCL RuntimeError: NCCL Error 2: unhandled system error

created at 11-09-2021 views: 21

error

Using pytorch1.7.1 on WSL2, it is unable to perform multi-graphics distributed training, and it prompts RuntimeError: NCCL Error 2: unhandled system error. I never know what it means. Later I searched the Internet and found that I can print the nccl log to get more information. For detailed error tips. Immediately added to the environment variables

export NCCL_DEBUG=info
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1

Training again, this time printed out more detailed error messages. as follows:

DESKTOP-SVB4DC0:26340:26340 [0] NCCL INFO Bootstrap : Using [0]eth0:172.24.6.154<0>
DESKTOP-SVB4DC0:26340:26340 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DESKTOP-SVB4DC0:26340:26340 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
DESKTOP-SVB4DC0:26340:26340 [0] NCCL INFO NET/Socket : Using [0]eth0:172.24.6.154<0>
DESKTOP-SVB4DC0:26340:26340 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.0

DESKTOP-SVB4DC0:26340:26714 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:02/../../0000:02:00.0
DESKTOP-SVB4DC0:26340:26714 [0] NCCL INFO graph/xml.cc:469 -> 2

DESKTOP-SVB4DC0:26340:26715 [1] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:02/../../0000:02:00.0
DESKTOP-SVB4DC0:26340:26715 [1] NCCL INFO graph/xml.cc:469 -> 2
DESKTOP-SVB4DC0:26340:26715 [1] NCCL INFO graph/xml.cc:660 -> 2
DESKTOP-SVB4DC0:26340:26715 [1] NCCL INFO graph/topo.cc:523 -> 2
DESKTOP-SVB4DC0:26340:26715 [1] NCCL INFO init.cc:581 -> 2
DESKTOP-SVB4DC0:26340:26715 [1] NCCL INFO init.cc:840 -> 2
DESKTOP-SVB4DC0:26340:26714 [0] NCCL INFO graph/xml.cc:660 -> 2
DESKTOP-SVB4DC0:26340:26714 [0] NCCL INFO graph/topo.cc:523 -> 2
DESKTOP-SVB4DC0:26340:26714 [0] NCCL INFO init.cc:581 -> 2
DESKTOP-SVB4DC0:26340:26714 [0] NCCL INFO init.cc:840 -> 2
DESKTOP-SVB4DC0:26340:26715 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
DESKTOP-SVB4DC0:26340:26714 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
DESKTOP-SVB4DC0:26340:26340 [0] NCCL INFO init.cc:906 -> 2

Judging from the error message, the main problem is: NCCL WARN Could not find real path of /sys/class/pci_bus/0000:02/../../0000:02:00.0, I searched it online , Did not find a relevant solution, and then figured out the solution by myself.

solution

The main reason for this problem is that the file structure in WSL2 is slightly different from that of ordinary Linux systems, which causes NCCL to be unable to find relevant information. `pytorch1.7.1` uses NCCL version 2.7.8+cuda11.0, The NCCL version is not high enough. This problem is solved in NCCL2.11.4, but the pytorch downloaded and installed from the website is compiled in advance. Even if the 2.11.4 version of NCCL is installed in the system, the call to pytorch is still compiled Version 2.7.8. Therefore, you need to install pytorch by compiling and installing locally using the downloaded source code. When compiling, configure the environment parameters:

export USE_SYSTEM_NCCL =1

Then execute python3 setup.py install.

created at:11-09-2021
edited at: 11-09-2021: