HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] Another condor_q/quill problem



We have had about a half dozen condor_q processes go into an infinite
CPU loop in "condor_c++_util/jobqueuesnapshot.C".  Here is a summary
of attaching gdb to one of these processes:


384:	while(st == DONE_PROCADS_CUR_CLUSTERAD) {
385:		getNextClusterAd(curClusterId, curClusterAd);

curClusterId = "36201"
*curClusterAd = {<AttrList> = {<AttrListAbstract> = {type = 0, inList = 0x0, next = 0x0, prev = 0x0, _vptr.AttrListAbstract = 0x82f7a28}, chainedAttrs = 0x0, exprList = 0x83d19c0, associatedList = 0x0, tail = 0x832bda8, ptrExpr = 0x0, ptrExprInChain = false, ptrName = 0x0, ptrNameInChain = false, seq = 0, inside_insert = false}, myType = 0x0, targetType = 0x0}

122:	if (cur_clusterads_str_index >= clusterads_str_num) {

cur_clusterads_str_index = 789
clusterads_str_num = 789

123:		return DONE_CLUSTERADS_CURSOR;
124:	}

386:		st = getNextProcAd(ad);

213:	if ((cur_procads_num_index >= procads_num_num) &&
214:		(cur_procads_str_index >= procads_str_num)) {

cur_procads_num_index = 3621
procads_num_num = 6985
cur_procads_str_index = 1787
procads_str_num = 4333

217:	}
218:	else {
219:		ad = new ClassAd();
220:	}

225:	while(cur_procads_num_index < procads_num_num) {
227:		cid = jqDB->getJobQueueProcAds_NumValue(
228:			cur_procads_num_index, 0); // cid

cid = "36174"

230:		if (strcmp(cid, "0") != 0) {
231:			break;
232:		}

237:	while(cur_procads_str_index < procads_str_num) {
239:		cid = jqDB->getJobQueueProcAds_StrValue(
240:			cur_procads_str_index, 0); // cid

cid = "36174"

242:		if (strcmp(cid, "0") != 0) {
243:			break;
244:		}

251:	if (!cid || strcmp(cid, curClusterId) != 0) {

cid = "36174"
curClusterId = "36201"

252:		delete ad;
253:		ad = NULL;
254:		return DONE_PROCADS_CUR_CLUSTERAD;
255:	}

And we return to the top for an endless loop.

I don't claim to understand the code, but is it possible the error
here is that the return value from "getNextClusterAd" is not being
checked at line 385?  It is returning "DONE_CLUSTERADS_CURSOR".

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison