Hi,
I am placed in the unenviable position as a communicator of a
problem without privileges or understanding of underlying programmatic
architecture, so please bear with me.
We have a user that can complete his ABAQUS job using MPI to distribute
the model out over cluster nodes using Condor when he only request 8
compute elements . However, larger runs (of 24 ce's ) terminate with
the included shadow exception:
Subject: Condor JobDisconnectedEvent::writeEvent()
called without startd_addr
I am running an ABAQUS
analysis on a cluster using Condor and MPI. However, after running for
several hours, the job is shutting down without any reason related
to ABAQUS. The condor log is showing a message as follows:
022 (461459.000.000) 10/10 13:59:48 007 (461459.000.000)
10/10 13:59:48
Shadow exception!
JobDisconnectedEvent::writeEvent() called without startd_addr
We have tried several corrective actions based on our assumption that
this is a network/filesystem issue ( a specific file not available when
needed), that include:
1- move the NFS based filesystem from NAT translation through the head
node of the cluster to be directly connected via ethernet ports to each
node in the cluster
2- changed NFS underlying protocol to use TCP instead of UDP
Does anyone have suggestions on what information I need to ask our
systems group to capture , such as packet data or sockets being used at
the time of the error being thrown, in order to trouble shoot this
problem? Thank you for any suggestions to help me through this. Thanks,
Brandon
Brandon Leeds
Lehigh University
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/