Hi, I am using htcondor 7.8.5 on CentOS6.3. I have gpu jobs to run and each
will take 25~90 minutes to run. Each machine have 2 GPUs.
All GPU jobs are in one node of a DAG job. I find that some jobs will be
transformed to another machine silently to execute after executing for a
while in one machine. This is the event sequence for this job:
SUBMIT
EXECUTE on 10.1.1.254
IMAGE_SIZE_UPDATE
IMAGE_SIZE_UPDATE
EXECUTE on 10.1.1.251
......
I want to know why the second EXECUTE event occurred. There is nothing
between the last IMAGE_SIZE_UPDATE event and EXECUTE event. I also checked
*.dag.nodes.log file, *.dagman.out file and found nothing helpful.
I do not config the RANK expression for startd. The rank for job is:
-SlotId + HasGPU*1000+GPUCores. But I think this will not be the reason.
HasGPU is true and GPUCores is 2496 now.
Thanks. I have to figure out why jobs transferred to another machine.
Thanks.