Hi, I am using htcondor 7.8.5 on CentOS6.3. I have gpu jobs to run and each will take 25~90 minutes to run. Each machine have 2 GPUs.
All GPU jobs are in one node of a DAG job. I find that some jobs will be transformed to another machine silently to execute after executing for a while in one machine. This is the event sequence for this job:
SUBMIT
EXECUTE on 10.1.1.254
IMAGE_SIZE_UPDATE
IMAGE_SIZE_UPDATE
EXECUTE on 10.1.1.251
......
I want to know why the second EXECUTE event occurred. There is nothing between the last IMAGE_SIZE_UPDATE event and EXECUTE event. I also checked *.dag.nodes.log file, *.dagman.out file and found nothing helpful.
I do not config the RANK _expression_ for startd. The rank for job is: -SlotId + HasGPU*1000+GPUCores. But I think this will not be the reason. HasGPU is true and GPUCores is 2496 now.
Thanks. I have to figure out why jobs transferred to another machine. Thanks.