Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd_addr error

Date: Tue, 21 Oct 2008 12:06:05 -0400
From: Brandon Leeds <byl405@xxxxxxxxxx>
Subject: [Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd_addr error

Hi,
I am placed in the unenviable position as a communicator of a problem without privileges or understanding of underlying programmatic architecture, so please bear with me.

We have a user that can complete his ABAQUS job using MPI to distribute the model out over cluster nodes using Condor when he only request 8 compute elements . However, larger runs (of 24 ce's ) terminate with the included shadow exception:

Subject: Condor JobDisconnectedEvent::writeEvent() called without startd_addr

I am running an ABAQUS analysis on a cluster using Condor and MPI. However, after running for several hours, the job is shutting down without any reason related to ABAQUS. The condor log is showing a message as follows:

022 (461459.000.000) 10/10 13:59:48 007 (461459.000.000) 10/10 13:59:48
Shadow exception!
JobDisconnectedEvent::writeEvent() called without startd_addr

We have tried several corrective actions based on our assumption that this is a network/filesystem issue ( a specific file not available when needed), that include:
1- move the NFS based filesystem from NAT translation through the head node of the cluster to be directly connected via ethernet ports to each node in the cluster
2- changed NFS underlying protocol to use TCP instead of UDP

Does anyone have suggestions on what information I need to ask our systems group to capture , such as packet data or sockets being used at the time of the error being thrown, in order to trouble shoot this problem? Thank you for any suggestions to help me through this. Thanks,

Brandon

Brandon Leeds
Lehigh University

Follow-Ups:
- Re: [Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd_addr error
  - From: Brandon Leeds

Prev by Date: [Condor-users] [Fwd: Re: condor_shadow failed to detect the quickly job which cannot update Shadow]
Next by Date: Re: [Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd_addr error
Previous by thread: [Condor-users] [Fwd: Re: condor_shadow failed to detect the quickly job which cannot update Shadow]
Next by thread: Re: [Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd_addr error
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] Condor JobDisconnectedEvent::writeEvent() called without startd_addr error