Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Nodes rejecting jobs after a few runs
- Date: Thu, 13 Sep 2012 11:20:46 -0400
- From: Dmitry Rodionov <d.rodionov@xxxxxxxxx>
- Subject: Re: [Condor-users] Nodes rejecting jobs after a few runs
Forgot corresponding entry from SchedLog:
09/13/12 10:47:51 (pid:44237) Shadow pid 51953 for job 149.0 exited with status 4
09/13/12 10:47:51 (pid:44237) ERROR: Shadow exited with job exception code!
09/13/12 10:47:51 (pid:44237) match (slot4@xxxxxxxxxxx <10.0.0.2:51239> for drod) out of jobs; relinquishing
09/13/12 10:47:51 (pid:44237) Completed RELEASE_CLAIM to startd slot4@xxxxxxxxxxx <10.0.0.2:51239> for drod
09/13/12 10:47:51 (pid:44237) Match record (slot4@xxxxxxxxxxx <10.0.0.2:51239> for drod, 149.-1) deleted
Thanks,
Dmitry
On 2012-09-13, at 11:13 AM, Dmitry Rodionov wrote:
> Good day everyone!
> I have Condor 7.8.2 set up on 6 Mac workstations running 10.6.
>
> I start a job consisting of 1000 identical simulations, 1 simulation= 1 job, 22 jobs queued (I have 22 cores).
> The working folder is mounted via nfs across all hosts. All_squash.
> All hosts are on 1Gbps LAN less then 5m away from the switch.
>
> Initial situation: all idle nodes accept jobs and start crunching numbers. So far good.
> After completing 2-3 jobs or so nodes stop accepting new jobs "for unknown reason".
> The submitting node is the last one to start refusing jobs.
>
> This is all condor_q -global -better-analyze had to say on the subject.
>
> -- Schedd: sioux.local : <10.0.0.15:62904>
> ---
> 149.000: Run analysis summary. Of 22 machines,
> 0 are rejected by your job's requirements
> 4 reject your job because of their own requirements
> 2 match but are serving users with a better priority in the pool
> 16 match but reject the job for unknown reasons
> 0 match but will not currently preempt their existing job
> 0 match but are currently offline
> 0 are available to run your job
> Last successful match: Thu Sep 13 10:41:19 2012
>
> The following attributes are missing from the job ClassAd:
>
> CheckpointPlatform
>
> SchedLog is filled up with variations of this:
>
> 09/13/12 10:43:40 (pid:44237) Shadow pid 51139 for job 149.0 exited with status 4
> 09/13/12 10:43:40 (pid:44237) Match for cluster 149 has had 5 shadow exceptions, relinquishing.
> 09/13/12 10:43:40 (pid:44237) Match record (slot3@xxxxxxxxxxxxx <10.0.0.30:49774> for drod, 149.0) deleted
> 09/13/12 10:43:40 (pid:44237) Shadow pid 51149 for job 133.0 exited with status 4
> 09/13/12 10:44:20 (pid:44237) Starting add_shadow_birthdate(149.0)
> 09/13/12 10:44:20 (pid:44237) Started shadow for job 149.0 on slot2@xxxxxxxxxx <10.0.0.54:51729> for drod, (shadow pid = 51262)
> 09/13/12 10:44:20 (pid:44237) Shadow pid 51262 for job 149.0 exited with status 4
> 09/13/12 10:44:20 (pid:44237) match (slot2@xxxxxxxxxx <10.0.0.54:51729> for drod) switching to job 149.0
> 09/13/12 10:44:20 (pid:44237) Starting add_shadow_birthdate(149.0)
> 09/13/12 10:44:20 (pid:44237) Started shadow for job 149.0 on slot2@xxxxxxxxxx <10.0.0.54:51729> for drod, (shadow pid = 51296)
> 09/13/12 10:44:20 (pid:44237) Shadow pid 51296 for job 149.0 exited with status 4
>
> Shadow log has numerous copies of stuff like
>
> 09/13/12 10:47:51 Initializing a VANILLA shadow for job 149.0
> 09/13/12 10:47:51 (149.0) (51919): Request to run on slot4@xxxxxxxxxxx <10.0.0.2:51239> was ACCEPTED
> --
> 09/13/12 10:47:51 Setting maximum accepts per cycle 8.
> 09/13/12 10:47:51 ******************************************************
> 09/13/12 10:47:51 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 09/13/12 10:47:51 ** /condor/condor-installed/sbin/condor_shadow
> 09/13/12 10:47:51 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
> 09/13/12 10:47:51 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
> 09/13/12 10:47:51 ** $CondorVersion: 7.8.2 Aug 08 2012 $
> 09/13/12 10:47:51 ** $CondorPlatform: x86_64_macos_10.7 $
> 09/13/12 10:47:51 ** PID = 51919
> 09/13/12 10:47:51 ** Log last touched 9/13 10:47:51
> --
> 09/13/12 10:47:51 (149.0) (51919): ERROR "Can no longer talk to condor_starter <10.0.0.2:51239>" at line 219 in file /Volumes/disk1/condor/execute/slot1/dir_49805/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
>
> Where is "/Volumes/disk1/condor/" in the line above coming from? There is no such thing on my systems. Condor is in "/condor/condor-installed"
>
> From StarterLog on terra.local:
>
> 09/13/12 10:47:51 slot4: Got activate_claim request from shadow (10.0.0.15)
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 slot4: Remote job ID is 149.0
> 09/13/12 10:47:51 slot4: Got universe "VANILLA" (5) from request classad
> 09/13/12 10:47:51 slot4: State change: claim-activation protocol successful
> 09/13/12 10:47:51 slot4: Changing activity: Idle -> Busy
> 09/13/12 10:47:51 Starter pid 74324 exited with status 1
> 09/13/12 10:47:51 slot4: State change: starter exited
> 09/13/12 10:47:51 slot4: Changing activity: Busy -> Idle
> 09/13/12 10:47:51 slot2: State change: received RELEASE_CLAIM command
> 09/13/12 10:47:51 slot2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
> 09/13/12 10:47:51 slot2: State change: No preempting claim, returning to owner
> 09/13/12 10:47:51 slot2: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 09/13/12 10:47:51 slot2: State change: IS_OWNER is false
> 09/13/12 10:47:51 slot2: Changing state: Owner -> Unclaimed
> 09/13/12 10:47:51 slot1: Got activate_claim request from shadow (10.0.0.15)
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX
> 09/13/12 10:47:51 slot1: Remote job ID is 153.0
> 09/13/12 10:47:51 slot1: Got universe "VANILLA" (5) from request classad
> 09/13/12 10:47:51 slot1: State change: claim-activation protocol successful
> 09/13/12 10:47:51 slot1: Changing activity: Idle -> Busy
> 09/13/12 10:47:51 slot3: State change: received RELEASE_CLAIM command
> 09/13/12 10:47:51 slot3: Changing state and activity: Claimed/Idle -> Preempting/Vacating
> 09/13/12 10:47:51 slot3: State change: No preempting claim, returning to owner
> 09/13/12 10:47:51 slot3: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 09/13/12 10:47:51 slot3: State change: IS_OWNER is false
> 09/13/12 10:47:51 slot3: Changing state: Owner -> Unclaimed
> 09/13/12 10:47:51 slot4: State change: received RELEASE_CLAIM command
> 09/13/12 10:47:51 slot4: Changing state and activity: Claimed/Idle -> Preempting/Vacating
> 09/13/12 10:47:51 slot4: State change: No preempting claim, returning to owner
> 09/13/12 10:47:51 slot4: Changing state and activity: Preempting/Vacating -> Owner/Idle
> 09/13/12 10:47:51 slot4: State change: IS_OWNER is false
> 09/13/12 10:47:51 slot4: Changing state: Owner -> Unclaimed
>
> What does "sysapi_disk_space_raw: Free disk space kbytes overflow, capping to INT_MAX" mean?
>
> Please help me troubleshoot this problem.
> I am new to Condor and not sure where to start.
>
> Thanks!
> Dmitry
>