I could be exposing my lack of knowledge of the mechanics of condor
pools, however on hand I am quite surprised that the performance of
the
pool is, on the whole, quite poor. The composition of the pool is
complicated -- there are machines from different departments and/or
subnet, and so this may be a very difficult issue to analyse or for
any
one to advise us on...
According to condor_status most of the machines are unclaimed, however
when I submit a batch of 100 simple jobs I find that maybe 50% of them
will run simultaneously in the pool -- the rest are rejected, and
condor_q tells me that machines do match however reject the jobs for
some unknown reason. The vast majority of the machines are running XP
with SP2.
Can anyone please advise us in this respect. For example what might be
wrong in the pool, or what analysis might we consider doing?
1216 match, match, but reject the job for unknown reasons
The trick to figuring this out would be to track down these "unknown
reasons". Are there certain machines that are consistently able to run
jobs? Are there certain machines that consistently fail to run jobs?
You can find successful machines by looking at the "LastRemoteHost"
attribute that condor_history <cluster.proc> -l reports. Then see if
you can find failures by looking at the ShadowLog on the submitting
machine. You may want to have a look at my Troubleshooting page:
http://docs.optena.com/display/CONDOR/Troubleshooting
My guess is that some of your machines are somehow mis-configured and
that jobs are going there, dying, and getting kicked off, only to start
somewhere else and succeed.
Mike Yoder
Principal Member of Technical Staff
Ask Mike: http://docs.optena.com
Direct : +1.408.321.9000
Fax : +1.408.321.9030
Mobile : +1.408.497.7597
yoderm@xxxxxxxxxx
Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users