Re: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

inline links below:

From: "Dr. Harinder Singh Bawa" <harinder.singh.bawa@xxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, April 24, 2013 3:24:24 PM
Subject: Re: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o

Hi Tim,
I also suspect this nfs share not optimised.

I can try using @should transfer files=No. What exactly this means? I mean is this pointed to input or output files.

http://research.cs.wisc.edu/htcondor/manual/v7.9/2_5Submitting_Job.html#SECTION00354200000000000000

How to enforce concurrency limit? Is there any command which does this?

http://research.cs.wisc.edu/htcondor/manual/v7.9/3_12Setting_Up.html#39092

Harinder

PS: Yes, for longer term, I think HDFS system can help

On Wed, Apr 24, 2013 at 10:11 PM, Tim St Clair <tstclair@xxxxxxxxxx> wrote:
I would suspect that your NFS share is not optimized for your deployment, or use case, which is likely causing the issue when reading+writing your files.

http://tldp.org/HOWTO/NFS-HOWTO/performance.html

If the share is common among all machines, make certain 'should_transfer_files = NO' on your submission too. Also, if you still experience long wait times you can always enforce concurrency limits on your jobs, so they don't all hit at the same shared resource at one time.

Long term, you may want to look into other distributed filesystems to reduce load on a single source e.g (Gluster, HDFS, QFS, etc.)

Cheers,
Tim

From: "Dr. Harinder Singh Bawa" <harinder.singh.bawa@xxxxxxxxx>
To: htcondor-users@xxxxxxxxxxx
Sent: Wednesday, April 24, 2013 8:14:31 AM
Subject: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o

Hello experts,

I am submitting 120 jobs in 120 nodes using condor. What I am basically doing is that I have approx 20,000 input files in /rdata2 dir.

/dev/sdd1              39T   19T   21T 48% /NFSv3exports/rdata2

I have a file containing name and path of 20,000 input files (i.e Full2013.list) containing paths of the files. I split that file (containing 20,000 lines corresponds to 20,000 files) into 120 jobs as 120parts so my each job have approx. 20,000/120= 166 files.

In Condor, its taking 1 day to finish my jobs.

I ran one job interactively which is running over one node :****Finishes in 40 min

15126.0   bawa            4/23 04:56   0+03:50:16 R 0   317.4 parallel_90.sh

Statistics for comparison:-
Interactively:-
==============
real    63m57.321s
user    42m17.957s
sys     1m24.413s

Statistics for
Condor Node:
==========
condor_q -analyze 15126.0

-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
---
15126.000: Request is being serviced

The jobs are running since 1 day, If I see Real CPUTime of this job, its
[bawa@t3nfs Wstar_sin0_NewCalib17]$ condor_q 15126.0 -cputime

-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
ID      OWNER            SUBMITTED     CPU_TIME ST PRI SIZE CMD
15126.0   bawa            4/23 04:56   0+00:06:47 R 0   317.4 parallel_90.sh

If I understand correctly, CPUtime(CPU time is time of running CPU) is just 6min 47 sec Out of RunTime which is 3 Hr 50 min. I suspect there is something serious in data transfer going on.(i/o)

Is there any suggestion how to debug that.

Thanks
-Harinder

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--
Dr. Harinder Singh Bawa
Experimental High Energy Physics
ATLAS Experiment
@CERN, Geneva

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o