[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] How can I check whether my VMware job is really checkpointed?



On Thu, 13 May 2010 David Kotz wrote:
> Rob,
> 
> If you look at the timestamps on the two checkpoint messages you'll see
> that they're just 8 seconds apart.  It looks like a checkpoint was taken
> and then the job was evicted 8 seconds later.  When the job was evicted
> no additional checkpoint was made because it was too soon after the last
> one.  Job eviction normally triggers an automatic checkpoint, but there
> seems to be some minimum amount of time since the last checkpoint before
> Condor will do another one.

Let me rephrase your words to check whether I actually understand what
you're telling me. Is it then so that:
1) When checkpointing is active, condor makes checkpoints at regular
    intervals, eventhough the job is running alright. Correct?
2) When evicted, condor will create a new checkpoint, unless the previous
    'routine' checkpoint is too recent. Correct?

Is this internally configured, or can I modify the behaviour a bit by setting
condor macros in the config file?


According to your description, it seems then that checkpointing is working
in my case.
However, I would like to have some more evidence that the job is indeed
revitalized from the previous checkpoint on the same or another pool PC.
Where can I find this evidence?
I have searched the log files, but to no avail.....

-------

By the way, the local condor config file on my (Windows) pool PCs is
as follows:

CONSOLE_DEVICES = mouse, console 
StartIdleTime    = 5 * $(MINUTE) 
ContinueIdleTime = 5 * $(MINUTE) 
MaxSuspendTime   = 5 * $(MINUTE) 
MaxVacateTime    = 5 * $(MINUTE) 


I hope there's nothing wrong here!

Thanks!

Rob.


> On Thu, 2010-05-13 at 07:13 -0700, Rob wrote:
>> Hi,
>> 
>> I have successfully submitted VMware jobs without checkpointing.
>> Now I want to check the checkpoint feature, as it is described in the
>> manual (no checkpoint server is needed).
>> 
>> The master is a linux/Fedora with condor 7.4.2.
>> All pool PCs are Windows XP, with condor 7.2 and VMware 1.0.
>> 
>> I have changed the submission file such that it also allows
>> checkpointing, like this:
>> 
>> Universe = vm
>> Executable = any_name_you_like
>> Log = vm.log
>> vm_type = vmware
>> vm_networking = false
>> vm_checkpoint = true
>> vm_memory = 64
>> vmware_dir = /home/condor/VM
>> vm_cdrom_files = input.dat
>> vm_should_transfer_cdrom_files = YES
>> vmware_should_transfer_files = YES
>> Requirements = (target.Arch == "INTEL")
>> Queue
>> 
>> 
>> When I run the job, the vm.log Log file has lines like this:
>> 
>> 001 (007.000.000) 05/13 17:57:05 Job executing on host: <115.145.228.96:1034>
>> ...
>> 003 (007.000.000) 05/13 20:26:51 Job was checkpointed.
>>     Usr 0 00:00:01, Sys 0 02:27:38  -  Run Remote Usage
>>     Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>>     68599016  -  Run Bytes Sent By Job For Checkpoint
>> ...
>> 004 (007.000.000) 05/13 20:26:59 Job was evicted.
>>     (0) Job was not checkpointed.
>>         Usr 0 00:00:01, Sys 0 02:27:38  -  Run Remote Usage
>>         Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
>>     68599016  -  Run Bytes Sent By Job
>>     79956464  -  Run Bytes Received By Job
>> 
>> 
>> Notice, that it says
>>   "Job was checkpointed."
>> *and*
>>   "Job was not checkpointed."
>> 
>> Meanwhile I do find the checkpoint files in the spool:
>> 
>> 15MB-000001.vmdk
>> isohrDAAH.iso
>> nvram
>> vmBvHAAB_condor-Snapshot1.vmsn
>> vmbvhaab_condor.vmem
>> vmBvHAAB_condor.vmsd
>> vmbvhaab_condor.vmss
>> vmBvHAAB_condor.vmx
>> vmware-0.log
>> vmware-1.log
>> vmware.log
>> 
>> 
>> I'm quite confused by all this.
>> Is the VMware condor job checkpointed or not?
>> 
>> Also, I don't know where and how I can verify this.
>> 
>> And if it's not checkpointed, why is it not?
>> If it is checkpointed, why I can't see more evidence of it?
>> 
>> Thanks for your help!
>> 
>> Rob.