Subject: [Condor-users] job only checkpoints sometimes
Hi all, I see this strange situation:
Sometimes a job (standard
universe) is checkpointed.. sometime its NOT! Could this be because
multiple different signals are being used? Ie suspend/preempt/owner?
See
my condor log below, that indicates what was happened on each machine.
(note: the log files are in reverse time order). The machine that did
NOT checkpoint seems to have called DEACTIVATE_CLAIM_FORCIBLY immediately, while
the one that DID checkpoint correctly called DEACTIVATE_CLAIM then
DEACTIVATE_CLAIM_FORCIBLY. What could have caused this to happen?
Ashish
LOG ON MACHINE THAT DID CHECKPOINT 1/5 14:35:45 Error: can't find resource with capability (<
192.168.1.102:32775>#3511251448) 1/5 14:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 14:35:45 DaemonCore: Command received via UDP from host <128.2.211.9:33316> 1/5 14:35:45 vm2: Changing state: Owner -> Unclaimed
1/5 14:35:45 vm2: State change: IS_OWNER is false
1/5 14:35:45 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle 1/5 14:35:45 vm2: State change: No preempting claim, returning to owner 1/5 14:35:45 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
1/5 14:35:45 vm2: State change: received RELEASE_CLAIM command 1/5 14:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler) 1/5 14:35:45 DaemonCore: Command received via UDP from host <
128.2.211.9:33316> 1/5 14:35:43 vm2: Changing activity: Busy -> Idle 1/5 14:35:43 vm2: State change: starter exited
1/5 14:35:43 Starter pid 7870 exited with status 0
1/5 14:35:43 vm2: Called deactivate_claim_forcibly() 1/5 14:35:43 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 1/5 14:35:43 DaemonCore: Command received via TCP from host <
128.2.211.9:37412> 1/5 14:35:42 Assuming the keyboard and mouse to be infinitely idle. 1/5 14:35:42 Failed to obtain keyboard or mouse idle information.
1/5 14:35:40 vm2: Called deactivate_claim()
1/5 14:35:40 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler) 1/5 14:35:40 DaemonCore: Command received via UDP from host <
128.2.211.9:33316
>
LOG ON MACHINE THAT DID NOT!
1/5 18:43:06 Error: can't find resource with capability (<
192.168.1.104:32774>#1929149340) 1/5 18:43:06 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 18:43:06 DaemonCore: Command received via UDP from host <128.2.211.9:33346> 1/5 18:43:06 vm2: Changing state: Owner -> Unclaimed
1/5 18:43:06 vm2: State change: IS_OWNER is false
1/5 18:43:06 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle 1/5 18:43:06 vm2: State change: No preempting claim, returning to owner 1/5 18:43:06 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
1/5 18:43:06 vm2: State change: received RELEASE_CLAIM command 1/5 18:43:06 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler) 1/5 18:43:06 DaemonCore: Command received via UDP from host <
128.2.211.9:33346> 1/5 18:43:06 vm2: Changing activity: Busy -> Idle 1/5 18:43:06 vm2: State change: starter exited
1/5 18:43:06 Starter pid 6381 exited with status 0
1/5 18:43:06 vm2: Called deactivate_claim_forcibly() 1/5 18:43:06 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler) 1/5 18:43:06 DaemonCore: Command received via TCP from host <
128.2.211.9:38919> 1/5 18:43:05 vm2: Performing a periodic checkpoint on
vm2@xxxxxxxxxxxxxxxxxxxxx. 1/5 18:43:05 Assuming the keyboard and mouse to be infinitely idle.
001 (214.000.000) 01/05 14:33:24 Job executing on host: <192.168.1.102:32775> ... 006 (
214.000.000) 01/05 14:35:40 Image size of job updated: 107981 ...
004 (214.000.000) 01/05 14:35:45 Job was evicted. (1) Job was checkpointed. Usr 0 00:01:52, Sys 0 00:00:01 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:02 - Run Local Usage
97712680 - Run Bytes Sent By Job 240792208 - Run Bytes Received By Job ... 001 (214.000.000) 01/05 14:43:02 Job executing on host: <
192.168.1.104:32774
> ... 004 (214.000.000) 01/05 18:43:06 Job was evicted. (0) Job was not checkpointed. Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:12, Sys 0 00:01:12 - Run Local Usage
11367188 - Run Bytes Sent By Job 8140449280 - Run Bytes Received By Job