Hi, I am placed in the unenviable position as a communicator of a problem without privileges or understanding of underlying programmatic architecture, so please bear with me. We have a user that can complete his ABAQUS job using MPI to distribute the model out over cluster nodes using Condor when he only request 8 compute elements . However, larger runs (of 24 ce's ) terminate with the included shadow exception: Subject: Condor JobDisconnectedEvent::writeEvent()
called without startd_addr
I am running an ABAQUS
analysis on a cluster using Condor and MPI. However, after running for
several hours, the job is shutting down without any reason related
to ABAQUS. The condor log is showing a message as follows:
022 (461459.000.000) 10/10 13:59:48 007 (461459.000.000)
10/10 13:59:48
Shadow exception! JobDisconnectedEvent::writeEvent() called without startd_addr We have tried several corrective actions based on our assumption that this is a network/filesystem issue ( a specific file not available when needed), that include: 1- move the NFS based filesystem from NAT translation through the head node of the cluster to be directly connected via ethernet ports to each node in the cluster 2- changed NFS underlying protocol to use TCP instead of UDP Does anyone have suggestions on what information I need to ask our systems group to capture , such as packet data or sockets being used at the time of the error being thrown, in order to trouble shoot this problem? Thank you for any suggestions to help me through this. Thanks, Brandon Brandon Leeds Lehigh University |