From: "Dr. Harinder Singh Bawa" <harinder.singh.bawa@xxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, April 24, 2013 3:24:24 PM
Subject: Re: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/oHi Tim,I also suspect this nfs share not optimised.I can try using @should transfer files=No. What exactly this means? I mean is this pointed to input or output files.
How to enforce concurrency limit? Is there any command which does this?
HarinderPS: Yes, for longer term, I think HDFS system can helpOn Wed, Apr 24, 2013 at 10:11 PM, Tim St Clair <tstclair@xxxxxxxxxx> wrote:I would suspect that your NFS share is not optimized for your deployment, or use case, which is likely causing the issue when reading+writing your files.If the share is common among all machines, make certain 'should_transfer_files = NO' on your submission too. Also, if you still experience long wait times you can always enforce concurrency limits on your jobs, so they don't all hit at the same shared resource at one time.Long term, you may want to look into other distributed filesystems to reduce load on a single source e.g (Gluster, HDFS, QFS, etc.)Cheers,TimFrom: "Dr. Harinder Singh Bawa" <harinder.singh.bawa@xxxxxxxxx>
To: htcondor-users@xxxxxxxxxxx
Sent: Wednesday, April 24, 2013 8:14:31 AM
Subject: [HTCondor-users] Most of the time in Condor jobs gets wasted in I/o
Hello experts,I am submitting 120 jobs in 120 nodes using condor. What I am basically doing is that I have approx 20,000 input files in /rdata2 dir./dev/sdd1 39T 19T 21T 48% /NFSv3exports/rdata2
I have a file containing name and path of 20,000 input files (i.e Full2013.list) containing paths of the files. I split that file (containing 20,000 lines corresponds to 20,000 files) into 120 jobs as 120parts so my each job have approx. 20,000/120= 166 files.In Condor, its taking 1 day to finish my jobs.I ran one job interactively which is running over one node :****Finishes in 40 min15126.0 bawa 4/23 04:56 0+03:50:16 R 0 317.4 parallel_90.shStatistics for comparison:-
Interactively:-
==============
real 63m57.321s
user 42m17.957s
sys 1m24.413s
Statistics for
Condor Node:
==========
condor_q -analyze 15126.0
-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
---
15126.000: Request is being serviced
The jobs are running since 1 day, If I see Real CPUTime of this job, its
[bawa@t3nfs Wstar_sin0_NewCalib17]$ condor_q 15126.0 -cputime-- Submitter: t3nfs.atlas.csufresno.edu : <192.168.100.2:9905> : t3nfs.atlas.csufresno.edu
ID OWNER SUBMITTED CPU_TIME ST PRI SIZE CMD
15126.0 bawa 4/23 04:56 0+00:06:47 R 0 317.4 parallel_90.sh
If I understand correctly, CPUtime(CPU time is time of running CPU) is just 6min 47 sec Out of RunTime which is 3 Hr 50 min. I suspect there is something serious in data transfer going on.(i/o)
Is there any suggestion how to debug that.
Thanks
-Harinder_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-usersThe archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-usersThe archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/--
Dr. Harinder Singh Bawa
Experimental High Energy Physics
ATLAS Experiment
@CERN, Geneva
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-usersThe archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/