Hi Folks,
I'm seeing an unfortunate behavior with condor_shadow jobs in the
vanilla universe. this is LINUX X86_64 and condor v6.8.6.
A user submits a large number (500 -1000 ) or jobs on a cluster with
150 processors, and has about 100 jobs running simultaneously. These
jobs all run for about 3 minutes, and then complete at nearly the
same time. At this time, the load on the submit machine, which is
also the head node, reaches a little over N, where N is the number of
this user's running jobs.
Closer inspection shows that all of the condor_shadow processes owned
by this user are in the "D" state, contending for what appears to be
the same resources.
At first I thought that this was contention was the output data was
returned from the compute nodes to the submit node. As such I asked
the user to add
initialdir = [ the run dir ]
should_transfer_files = NO
To the submit file, but this doesn't help. Also, looking at the
actual output, each job produces less than 20 K in output data.
What could be causing such contention in a vanilla universe
condor_shadow job, if not the final file transfer process? Has
anyone seen such behavior before in the vanilla universe? Any hints
of guesses for things to look at?
thanks,
rob
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/