Re: [Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd

Re: [Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd_addr error

Date: Tue, 21 Oct 2008 12:30:59 -0400

From: Brandon Leeds <byl405@xxxxxxxxxx>

Subject: Re: [Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd_addr error

One crucial point of information I was told I left out is what version of Condor we are using:

CondorVersion = "$CondorVersion: 7.0.3 Jun 20 2008 BuildID: 91405 $"

I am asking for an upgrade, but am not sure of when this could happen yet. Thanks,
--Brandon

Brandon Leeds wrote:

Hi,
I am placed in the unenviable position as a communicator of a problem without privileges or understanding of underlying programmatic architecture, so please bear with me.

We have a user that can complete his ABAQUS job using MPI to distribute the model out over cluster nodes using Condor when he only request 8 compute elements . However, larger runs (of 24 ce's ) terminate with the included shadow exception:

Subject: Condor JobDisconnectedEvent::writeEvent() called without startd_addr

I am running an ABAQUS analysis on a cluster using Condor and MPI. However, after running for several hours, the job is shutting down without any reason related to ABAQUS. The condor log is showing a message as follows:

022 (461459.000.000) 10/10 13:59:48 007 (461459.000.000) 10/10 13:59:48
Shadow exception!
JobDisconnectedEvent::writeEvent() called without startd_addr

We have tried several corrective actions based on our assumption that this is a network/filesystem issue ( a specific file not available when needed), that include:
1- move the NFS based filesystem from NAT translation through the head node of the cluster to be directly connected via ethernet ports to each node in the cluster
2- changed NFS underlying protocol to use TCP instead of UDP

Does anyone have suggestions on what information I need to ask our systems group to capture , such as packet data or sockets being used at the time of the error being thrown, in order to trouble shoot this problem? Thank you for any suggestions to help me through this. Thanks,

Brandon

Brandon Leeds
Lehigh University
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/