HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] Jobs lying idle when machines are available.




Hi Mahadev -

With D_FULLDEBUG specified for both the schedd and the negotiator, could you please send along more of the NegotiatorLog when this problem happens? Such as the entire section of the NegotiatorLog for that negotiation cycle? Also, if you could include the section of the ScheddLog during the time of the negotation cycle, that'd be great.

Armed with the above info, hopefully we can gain some insight re what is happening at your site.

Maybe sending to condor-admin with the log files would be a good idea (both because they could be long and they could contain IP addresses etc you don't want passed around on a public email list) - we could summarize what we find back to this list if folks are interested.

thanks,
Todd



At 04:57 PM 11/8/2006, Mahadev Konar wrote:
Hi all,
  While experimenting with Condor on our cluster, we encountered the
following problem.

After we submit a bunch of jobs, Condor is able to schedule a few of them
but for the others we get:

100 match but reject the job for unknown reasons.

There are machines lying idle on the cluster not running anything but Condor
does not schedule jobs on them. Also, all the jobs are similar in nature and
so are the machines. Meaning that if one job matches a  machine so should
the other.
Also, this happens every 2nd or 3rd time we submit a bunch of jobs to the
cluster. Sometimes it does run all the jobs in the queue.

After taking a look at the Negotiator logs I saw, this

Attempting to use cached MatchList: Failed (MatchList length: 0,
Autocluster: 0, Schedd Name: ****, Schedd Address: *****)
11/8 22:40:43       Rejected jobid schedd_name schedd_ip : no match found

The above is for the jobs that condor_q -analyze says " match but reject the
job for unknown reasons".

Could this be possible due to some misconfiguration?
Is there anyway to debug why this would be happening? Any tools to find out
why a job is not matching a specific startd? It would be great if someone
could point me to some debugging tools for this.

Thanks
Mahadev

_______________________________________________
Condor-devel mailing list
Condor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-devel


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba      Madison, WI 53706-1685
Phone: (608) 263-7132  FAX: (608) 262-9777