Re: [HTCondor-users] Jobs delayed and schedd logging problem

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

I fixed the logging issue by restarting schedd. It still takes a long time to start the jobs though. Guess they weren't related.

From: jholladay@xxxxxxxxxxx
To: htcondor-users@xxxxxxxxxxx
Date: Thu, 12 Jun 2014 14:41:00 -0700
Subject: [HTCondor-users] Jobs delayed and schedd logging problem

I was trying to figure out why my jobs took so long to start and noticed some issues in the SchedLog. It seems like it's creating a new log file every 2 seconds or so. Here's the entire SchedLog:

06/12/14 13:47:29 (pid:8249) Now in new log file /usr/local/condor/local.workstation1/log/SchedLog

06/12/14 13:47:29 (pid:8249) Number of Active Workers 2

06/12/14 13:47:29 (pid:8249) GET_JOB_CONNECT_INFO failed: Job 12.1 is not running.

In SchedLog.old I see similar entries:

...

(pid:79460) Number of Active Workers 1

(pid:20275) Number of Active Workers 0

(pid:20275) GET_JOB_CONNECT_INFO failed: Job 12.1 is not running.

(pid:79460) Number of Active Workers 2

(pid:20276) Number of Active Workers 1

(pid:79460) Number of Active Workers 3

(pid:20276) GET_JOB_CONNECT_INFO failed: Job 12.2 is not running.

(pid:20277) Number of Active Workers 2

(pid:20277) GET_JOB_CONNECT_INFO failed: Job 12.0 is not running.

The jobs eventually run but they take much longer than they should to start (sometimes over 30 minutes). I checked the logs. I don't notice any errors in the local logs and nothing appears on the master. It's only a two node cluster with the remote one being the master. There are no other jobs in the global queue.

'condor_q -global' on master reports:

All queues are empty

condor_config.local on both nodes:

START = TRUE

SUSPEND = FALSE

PREEMPT = FALSE

KILL = FALSE

'condor_q -analyze' reports:

012.000: Request has not yet been considered by the matchmaker.

From SchedLog on master every five minutes:

-------- Begin starting jobs --------

-------- Done starting jobs --------

Getting monitoring info for pid 66690

JobsRunning = 0

JobsIdle = 0

JobsHeld = 0

JobsRemoved = 0

LocalUniverseJobsRunning = 0

LocalUniverseJobsIdle = 0

SchedUniverseJobsRunning = 0

SchedUniverseJobsIdle = 0

It looks like schedd is taking a while to send the job to the master but I don't see any reasons why. Any help would be greatly appreciated.

Thanks,

Josh

_______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Jobs delayed and schedd logging problem