Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Job Failure
- Date: Sat, 20 Jan 2007 23:21:13 -0500
- From: Glen <gafergus@xxxxxxxxxxx>
- Subject: [Condor-users] Job Failure
Hi,
I have a problem jobs which are ending prematurely. The jobs
stop without finishing or giving any type of error, it simply stops.
The program I am running operates normally when used without Condor. We
have a local cluster that runs condor across 24 dual opteron nodes each
running suse linux 9.2 with condor 6.6.9 with 4 virtual machines per
node. We have turned off preemption and checkpointing.
Below
are examples from the condor StaterLog on the node the jobs stop. There
is no indication of the jobs stopping in the MasterLong. While the
times listed below are at night the same occurrences have happened at
various times throughout the day. Is there a reason jobs would end
prematurely? Do the below StarterLog files indicate a pathology?
Example one
*****************************************************************************************
From condor StarterLog
1/17 01:06:36 entering FileTransfer::Upload
1/17 01:06:36 entering FileTransfer::DoUpload
1/17 01:06:36 DoUpload: send file mono_basis-1_job-100a.out
1/17 01:06:36 ReliSock: put_file: sent 71147 bytes
1/17 01:06:36 DoUpload: send file mono_basis-1_job-100a.rwf
1/17 01:06:36 ReliSock: put_file: sent 5391 bytes
1/17 01:06:36 DoUpload: send file mono_basis-1_job-100a.gbs
1/17 01:06:36 ReliSock: put_file: sent 802 bytes
1/17 01:06:36 DoUpload: send file mono_basis-1_job-100a.pot
1/17 01:06:36 ReliSock: put_file: sent 607 bytes
1/17 01:06:36 DoUpload: exiting at 1413
1/17 01:06:36 Inside OsProc::JobExit()
1/17 01:06:36 In VanillaProc::PublishUpdateAd()
1/17 01:06:36 ProcAPI::buildFamily failed: parent 16052 not found on system.
1/17 01:06:36 Inside OsProc::PublishUpdateAd()
1/17 01:06:36 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:06:36 IO: Incoming packet is too big
1/17 01:06:36 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:06:36 IO: Incoming packet is too big
1/17 01:06:36 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:06:36 Got SIGQUIT. Performing fast shutdown.
1/17 01:06:36 ShutdownFast all jobs.
1/17 01:06:36 Got ShutdownFast when no jobs running.
******************************************************************************************
Example Two
******************************************************************************************
From Condor StarterLog
1/17 01:02:42 FileTransfer::UploadFiles: sent TransKey=1#45ad294b7b3797a06413011b
1/17 01:02:42 entering FileTransfer::Upload
1/17 01:02:42 entering FileTransfer::DoUpload
1/17 01:02:42 DoUpload: send file mono_basis-1_job-
100.out
1/17 01:02:42 ReliSock: put_file: sent 61511 bytes
1/17 01:02:42 DoUpload: send file mono_basis-1_job-100.rwf
1/17 01:02:42 ReliSock: put_file: sent 5391 bytes
1/17 01:02:42 DoUpload: send file mono_basis-1_job-
100.gbs
1/17 01:02:42 ReliSock: put_file: sent 802 bytes
1/17 01:02:42 DoUpload: send file mono_basis-1_job-100.pot
1/17 01:02:42 ReliSock: put_file: sent 607 bytes
1/17 01:02:42 DoUpload: exiting at 1413
1/17 01:02:42 Inside OsProc::JobExit()
1/17 01:02:42 In VanillaProc::PublishUpdateAd()
1/17 01:02:42 ProcAPI::buildFamily failed: parent 6062 not found on system.
1/17 01:02:42 Inside OsProc::PublishUpdateAd()
1/17 01:02:42 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:02:42 IO: Incoming packet is too big
1/17 01:02:42 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:02:42 IO: Incoming packet is too big
1/17 01:02:42 DaemonCore: Can't receive command request (perhaps a timeout?)
1/17 01:02:42 Got SIGQUIT. Performing fast shutdown.
1/17 01:02:42 ShutdownFast all jobs.
1/17 01:02:42 Got ShutdownFast when no jobs running.
******************************************************************************************
Take Care,
Glen