[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] job only checkpoints sometimes
- Date: Fri, 5 Jan 2007 19:32:01 -0500
- From: "Ashish Venugopal" <arv@xxxxxxxxxxxxxx>
- Subject: [Condor-users] job only checkpoints sometimes
Hi all, I see this strange situation:
Sometimes a job (standard
universe) is checkpointed.. sometime its NOT! Could this be because
multiple different signals are being used? Ie suspend/preempt/owner?
See
my condor log below, that indicates what was happened on each machine.
(note: the log files are in reverse time order). The machine that did
NOT checkpoint seems to have called DEACTIVATE_CLAIM_FORCIBLY immediately, while
the one that DID checkpoint correctly called DEACTIVATE_CLAIM then
DEACTIVATE_CLAIM_FORCIBLY. What could have caused this to happen?
Ashish
LOG ON MACHINE THAT DID CHECKPOINT
1/5 14:35:45 Error: can't find resource with capability (<
192.168.1.102:32775>#3511251448)
1/5 14:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 14:35:45 DaemonCore: Command received via UDP from host <128.2.211.9:33316>
1/5 14:35:45 vm2: Changing state: Owner -> Unclaimed
1/5 14:35:45 vm2: State change: IS_OWNER is false
1/5 14:35:45 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle
1/5 14:35:45 vm2: State change: No preempting claim, returning to owner
1/5 14:35:45 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
1/5 14:35:45 vm2: State change: received RELEASE_CLAIM command
1/5 14:35:45 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 14:35:45 DaemonCore: Command received via UDP from host <
128.2.211.9:33316>
1/5 14:35:43 vm2: Changing activity: Busy -> Idle
1/5 14:35:43 vm2: State change: starter exited
1/5 14:35:43 Starter pid 7870 exited with status 0
1/5 14:35:43 vm2: Called deactivate_claim_forcibly()
1/5 14:35:43 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
1/5 14:35:43 DaemonCore: Command received via TCP from host <
128.2.211.9:37412>
1/5 14:35:42 Assuming the keyboard and mouse to be infinitely idle.
1/5 14:35:42 Failed to obtain keyboard or mouse idle information.
1/5 14:35:40 vm2: Called deactivate_claim()
1/5 14:35:40 DaemonCore: received command 403 (DEACTIVATE_CLAIM), calling handler (command_handler)
1/5 14:35:40 DaemonCore: Command received via UDP from host <
128.2.211.9:33316
>
LOG ON MACHINE THAT DID NOT!
1/5 18:43:06 Error: can't find resource with capability (<
192.168.1.104:32774>#1929149340)
1/5 18:43:06 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 18:43:06 DaemonCore: Command received via UDP from host <128.2.211.9:33346>
1/5 18:43:06 vm2: Changing state: Owner -> Unclaimed
1/5 18:43:06 vm2: State change: IS_OWNER is false
1/5 18:43:06 vm2: Changing state and activity: Preempting/Vacating -> Owner/Idle
1/5 18:43:06 vm2: State change: No preempting claim, returning to owner
1/5 18:43:06 vm2: Changing state and activity: Claimed/Idle -> Preempting/Vacating
1/5 18:43:06 vm2: State change: received RELEASE_CLAIM command
1/5 18:43:06 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_handler)
1/5 18:43:06 DaemonCore: Command received via UDP from host <
128.2.211.9:33346>
1/5 18:43:06 vm2: Changing activity: Busy -> Idle
1/5 18:43:06 vm2: State change: starter exited
1/5 18:43:06 Starter pid 6381 exited with status 0
1/5 18:43:06 vm2: Called deactivate_claim_forcibly()
1/5 18:43:06 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
1/5 18:43:06 DaemonCore: Command received via TCP from host <
128.2.211.9:38919>
1/5 18:43:05 vm2: Performing a periodic checkpoint on
vm2@xxxxxxxxxxxxxxxxxxxxx.
1/5 18:43:05 Assuming the keyboard and mouse to be infinitely idle.
001 (214.000.000) 01/05 14:33:24 Job executing on host: <192.168.1.102:32775>
...
006 (
214.000.000) 01/05 14:35:40 Image size of job updated: 107981
...
004 (214.000.000) 01/05 14:35:45 Job was evicted.
(1) Job was checkpointed.
Usr 0 00:01:52, Sys 0 00:00:01 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:02 - Run Local Usage
97712680 - Run Bytes Sent By Job
240792208 - Run Bytes Received By Job
...
001 (214.000.000) 01/05 14:43:02 Job executing on host: <
192.168.1.104:32774
>
...
004 (214.000.000) 01/05 18:43:06 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:12, Sys 0 00:01:12 - Run Local Usage
11367188 - Run Bytes Sent By Job
8140449280 - Run Bytes Received By Job