Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?

Date: Sat, 25 Jan 2014 08:04:40 +0800
From: 钱晓明 <kyleqian@xxxxxxxxx>
Subject: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?

Hi, I am using htcondor 7.8.5 on CentOS6.3. I have gpu jobs to run and each will take 25~90 minutes to run. Each machine have 2 GPUs.

All GPU jobs are in one node of a DAG job. I find that some jobs will be transformed to another machine silently to execute after executing for a while in one machine. This is the event sequence for this job:
SUBMIT
EXECUTE on 10.1.1.254
IMAGE_SIZE_UPDATE
IMAGE_SIZE_UPDATE
EXECUTE on 10.1.1.251
......
I want to know why the second EXECUTE event occurred. There is nothing between the last IMAGE_SIZE_UPDATE event and EXECUTE event. I also checked *.dag.nodes.log file, *.dagman.out file and found nothing helpful.

I do not config the RANK _expression_ for startd. The rank for job is: -SlotId + HasGPU*1000+GPUCores. But I think this will not be the reason. HasGPU is true and GPUCores is 2496 now.

Thanks. I have to figure out why jobs transferred to another machine. Thanks.

Prev by Date: Re: [HTCondor-users] Failed to perform final update to job queue
Next by Date: Re: [HTCondor-users] requirements question and job run count
Previous by thread: [HTCondor-users] Job Submit fails !
Next by thread: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?