Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] How can I check whether my VMware job is really checkpointed?
- Date: Thu, 13 May 2010 20:05:15 -0700 (PDT)
- From: Rob <spamrefuse@xxxxxxxxx>
- Subject: Re: [Condor-users] How can I check whether my VMware job is really checkpointed?
On Thu, 13 May 2010 David Kotz wrote:
> Rob,
>
> If you look at the timestamps on the two checkpoint messages you'll see
> that they're just 8 seconds apart. It looks like a checkpoint was taken
> and then the job was evicted 8 seconds later. When the job was evicted
> no additional checkpoint was made because it was too soon after the last
> one. Job eviction normally triggers an automatic checkpoint, but there
> seems to be some minimum amount of time since the last checkpoint before
> Condor will do another one.
Let me rephrase your words to check whether I actually understand what
you're telling me. Is it then so that:
1) When checkpointing is active, condor makes checkpoints at regular
intervals, eventhough the job is running alright. Correct?
2) When evicted, condor will create a new checkpoint, unless the previous
'routine' checkpoint is too recent. Correct?
Is this internally configured, or can I modify the behaviour a bit by setting
condor macros in the config file?
According to your description, it seems then that checkpointing is working
in my case.
However, I would like to have some more evidence that the job is indeed
revitalized from the previous checkpoint on the same or another pool PC.
Where can I find this evidence?
I have searched the log files, but to no avail.....
-------
By the way, the local condor config file on my (Windows) pool PCs is
as follows:
CONSOLE_DEVICES = mouse, console
StartIdleTime = 5 * $(MINUTE)
ContinueIdleTime = 5 * $(MINUTE)
MaxSuspendTime = 5 * $(MINUTE)
MaxVacateTime = 5 * $(MINUTE)
I hope there's nothing wrong here!
Thanks!
Rob.
> On Thu, 2010-05-13 at 07:13 -0700, Rob wrote:
>> Hi,
>>
>> I have successfully submitted VMware jobs without checkpointing.
>> Now I want to check the checkpoint feature, as it is described in the
>> manual (no checkpoint server is needed).
>>
>> The master is a linux/Fedora with condor 7.4.2.
>> All pool PCs are Windows XP, with condor 7.2 and VMware 1.0.
>>
>> I have changed the submission file such that it also allows
>> checkpointing, like this:
>>
>> Universe = vm
>> Executable = any_name_you_like
>> Log = vm.log
>> vm_type = vmware
>> vm_networking = false
>> vm_checkpoint = true
>> vm_memory = 64
>> vmware_dir = /home/condor/VM
>> vm_cdrom_files = input.dat
>> vm_should_transfer_cdrom_files = YES
>> vmware_should_transfer_files = YES
>> Requirements = (target.Arch == "INTEL")
>> Queue
>>
>>
>> When I run the job, the vm.log Log file has lines like this:
>>
>> 001 (007.000.000) 05/13 17:57:05 Job executing on host: <115.145.228.96:1034>
>> ...
>> 003 (007.000.000) 05/13 20:26:51 Job was checkpointed.
>> Usr 0 00:00:01, Sys 0 02:27:38 - Run Remote Usage
>> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
>> 68599016 - Run Bytes Sent By Job For Checkpoint
>> ...
>> 004 (007.000.000) 05/13 20:26:59 Job was evicted.
>> (0) Job was not checkpointed.
>> Usr 0 00:00:01, Sys 0 02:27:38 - Run Remote Usage
>> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
>> 68599016 - Run Bytes Sent By Job
>> 79956464 - Run Bytes Received By Job
>>
>>
>> Notice, that it says
>> "Job was checkpointed."
>> *and*
>> "Job was not checkpointed."
>>
>> Meanwhile I do find the checkpoint files in the spool:
>>
>> 15MB-000001.vmdk
>> isohrDAAH.iso
>> nvram
>> vmBvHAAB_condor-Snapshot1.vmsn
>> vmbvhaab_condor.vmem
>> vmBvHAAB_condor.vmsd
>> vmbvhaab_condor.vmss
>> vmBvHAAB_condor.vmx
>> vmware-0.log
>> vmware-1.log
>> vmware.log
>>
>>
>> I'm quite confused by all this.
>> Is the VMware condor job checkpointed or not?
>>
>> Also, I don't know where and how I can verify this.
>>
>> And if it's not checkpointed, why is it not?
>> If it is checkpointed, why I can't see more evidence of it?
>>
>> Thanks for your help!
>>
>> Rob.