I fixed the logging issue by restarting schedd. It still takes a long time to start the jobs though. Guess they weren't related.
From: jholladay@xxxxxxxxxxx To: htcondor-users@xxxxxxxxxxx Date: Thu, 12 Jun 2014 14:41:00 -0700 Subject: [HTCondor-users] Jobs delayed and schedd logging problem I was trying to figure out why my jobs took so long to start and noticed some issues in the SchedLog. It seems like it's creating a new log file every 2 seconds or so. Here's the entire SchedLog: 06/12/14 13:47:29 (pid:8249) Now in new log file /usr/local/condor/local.workstation1/log/SchedLog 06/12/14 13:47:29 (pid:8249) Number of Active Workers 2 06/12/14 13:47:29 (pid:8249) GET_JOB_CONNECT_INFO failed: Job 12.1 is not running. In SchedLog.old I see similar entries: ... (pid:79460) Number of Active Workers 1 (pid:20275) Number of Active Workers 0 (pid:20275) GET_JOB_CONNECT_INFO failed: Job 12.1 is not running. (pid:79460) Number of Active Workers 2 (pid:20276) Number of Active Workers 1 (pid:79460) Number of Active Workers 3 (pid:20276) GET_JOB_CONNECT_INFO failed: Job 12.2 is not running. (pid:20277) Number of Active Workers 2 (pid:20277) GET_JOB_CONNECT_INFO failed: Job 12.0 is not running. The jobs eventually run but they take much longer than they should to start (sometimes over 30 minutes). I checked the logs. I don't notice any errors in the local logs and nothing appears on the master. It's only a two node cluster with the remote one being the master. There are no other jobs in the global queue. 'condor_q -global' on master reports: All queues are empty condor_config.local on both nodes: START = TRUE SUSPEND = FALSE PREEMPT = FALSE KILL = FALSE 'condor_q -analyze' reports: 012.000: Request has not yet been considered by the matchmaker. From SchedLog on master every five minutes: -------- Begin starting jobs -------- -------- Done starting jobs -------- Getting monitoring info for pid 66690 JobsRunning = 0 JobsIdle = 0 JobsHeld = 0 JobsRemoved = 0 LocalUniverseJobsRunning = 0 LocalUniverseJobsIdle = 0 SchedUniverseJobsRunning = 0 SchedUniverseJobsIdle = 0 It looks like schedd is taking a while to send the job to the master but I don't see any reasons why. Any help would be greatly appreciated. Thanks, Josh _______________________________________________ HTCondor-users mailing list To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users The archives can be found at: https://lists.cs.wisc.edu/archive/htcondor-users/ |