[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-devel] Another condor_q/quill problem
- Date: Fri, 14 Oct 2005 11:40:40 -0500
- From: Daniel Forrest <forrest@xxxxxxxxxxxxx>
- Subject: [Condor-devel] Another condor_q/quill problem
We have had about a half dozen condor_q processes go into an infinite
CPU loop in "condor_c++_util/jobqueuesnapshot.C". Here is a summary
of attaching gdb to one of these processes:
384: while(st == DONE_PROCADS_CUR_CLUSTERAD) {
385: getNextClusterAd(curClusterId, curClusterAd);
curClusterId = "36201"
*curClusterAd = {<AttrList> = {<AttrListAbstract> = {type = 0, inList = 0x0, next = 0x0, prev = 0x0, _vptr.AttrListAbstract = 0x82f7a28}, chainedAttrs = 0x0, exprList = 0x83d19c0, associatedList = 0x0, tail = 0x832bda8, ptrExpr = 0x0, ptrExprInChain = false, ptrName = 0x0, ptrNameInChain = false, seq = 0, inside_insert = false}, myType = 0x0, targetType = 0x0}
122: if (cur_clusterads_str_index >= clusterads_str_num) {
cur_clusterads_str_index = 789
clusterads_str_num = 789
123: return DONE_CLUSTERADS_CURSOR;
124: }
386: st = getNextProcAd(ad);
213: if ((cur_procads_num_index >= procads_num_num) &&
214: (cur_procads_str_index >= procads_str_num)) {
cur_procads_num_index = 3621
procads_num_num = 6985
cur_procads_str_index = 1787
procads_str_num = 4333
217: }
218: else {
219: ad = new ClassAd();
220: }
225: while(cur_procads_num_index < procads_num_num) {
227: cid = jqDB->getJobQueueProcAds_NumValue(
228: cur_procads_num_index, 0); // cid
cid = "36174"
230: if (strcmp(cid, "0") != 0) {
231: break;
232: }
237: while(cur_procads_str_index < procads_str_num) {
239: cid = jqDB->getJobQueueProcAds_StrValue(
240: cur_procads_str_index, 0); // cid
cid = "36174"
242: if (strcmp(cid, "0") != 0) {
243: break;
244: }
251: if (!cid || strcmp(cid, curClusterId) != 0) {
cid = "36174"
curClusterId = "36201"
252: delete ad;
253: ad = NULL;
254: return DONE_PROCADS_CUR_CLUSTERAD;
255: }
And we return to the top for an endless loop.
I don't claim to understand the code, but is it possible the error
here is that the return value from "getNextClusterAd" is not being
checked at line 385? It is returning "DONE_CLUSTERADS_CURSOR".
--
Daniel K. Forrest Laboratory for Molecular and
forrest@xxxxxxxxxxxxx Computational Genomics
(608) 262 - 9479 University of Wisconsin, Madison