009 (4225.000.000) 03/15 11:47:30 Job was aborted by the user. via condor_rm (by user roskar) ...000 (4226.000.000) 03/15 11:48:00 Job submitted from host: <128.95.98.82:36131>
...014 (4226.000.000) 03/15 11:48:55 Node 0 executing on host: <128.95.99.141:32785>
...014 (4226.000.001) 03/15 11:48:55 Node 1 executing on host: <128.95.99.88:32785>
...014 (4226.000.000) 03/15 11:50:05 Node 0 executing on host: <128.95.99.141:32785>
...014 (4226.000.001) 03/15 11:50:05 Node 1 executing on host: <128.95.99.88:32785>
...014 (4226.000.000) 03/15 11:51:15 Node 0 executing on host: <128.95.99.141:32785>
...014 (4226.000.001) 03/15 11:51:15 Node 1 executing on host: <128.95.99.88:32785>
...014 (4226.000.000) 03/15 11:52:25 Node 0 executing on host: <128.95.99.141:32785>
...014 (4226.000.001) 03/15 11:52:25 Node 1 executing on host: <128.95.99.88:32785>
...014 (4226.000.000) 03/15 11:53:35 Node 0 executing on host: <128.95.99.141:32785>
...014 (4226.000.001) 03/15 11:53:36 Node 1 executing on host: <128.95.99.88:32785>
...the job requests 4 machines, but only 2 are given to it when execution begins for some reason.
----------------------------------------- Rok Roskar University of Washington Department of Astronomy On Mar 9, 2006, at 12:59 PM, Greg Thain wrote:
Rok Roskar wrote:I'm running MPI under condor 6.6:DedicatedScheduler holds on to resources even after all MPI jobs have beenremoved from the queue - any way to fix this? Or is it an unfortunatebyproduct of mixing parallel and serial jobs on the same set of resources?It will hold onto claims for UNUSED_CLAIM_TIMEOUT seconds after the job leaves the queue, where UNUSED_CLAIM_TIMEOUT is a parameter in the condor_config file. The default is 300 seconds, and you can lower this as you like.Also, my jobs sometimes try to start even when DedicatedScheduler doesn'thave enough resources for them. This causes infinite looping of unsuccessful job execution, meaning that all the resource time getswasted. For example, my job requests 8 machines, but only 7 are available. Somehow, Condor tries to execute the job anyway, but because there aren'tenough resources, it doesn't run. Solutions?Does the job try to start, or do the machines just get claimed? -greg _______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users