[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Can't find address of local schedd



Hello,

I looked at /var/opt/condor/spool directory. Here is it content:

# ls -all
total 3508704
drwxr-xr-x 3 condor condor      4096 Apr 13 14:35 .
drwxr-xr-x 5 condor condor      4096 Dec 15 11:15 ..
-rw------- 1 condor condor    248004 Apr 13 14:41 Accountantnew.log
-rwxr-xr-x 1 condor condor   2077155 Apr 13 13:45 cluster15.ickpt.subproc0
-rwxr-xr-x 1 condor condor   2077155 Apr 13 08:50 cluster8.ickpt.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:43 cluster8.proc0.subproc0
-rw-r--r-- 1 condor condor   2322432 Apr 13 14:34 cluster8.proc0.subproc0.tmp
-rw-r--r-- 1 condor condor 277414943 Apr 13 12:09 cluster8.proc1.subproc0
-rw-r--r-- 1 condor condor 277419039 Apr 13 11:33 cluster8.proc2.subproc0
-rw-r--r-- 1 condor condor 277419039 Apr 13 12:02 cluster8.proc4.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:43 cluster8.proc5.subproc0
-rwxr-xr-x 1 condor condor   2077155 Apr 13 09:07 cluster9.ickpt.subproc0
-rw-r--r-- 1 condor condor 101482496 Apr 13 12:24 cluster9.proc0.subproc0.tmp
-rw-r--r-- 1 condor condor 277410847 Apr 13 11:48 cluster9.proc10.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:58 cluster9.proc14.subproc0
-rw-r--r-- 1 condor condor 277410847 Apr 13 11:48 cluster9.proc15.subproc0
-rw-r--r-- 1 condor condor   1024000 Apr 13 14:29 cluster9.proc15.subproc0.tmp
-rw-r--r-- 1 condor condor  43974656 Apr 13 12:24 cluster9.proc16.subproc0.tmp
-rw-r--r-- 1 condor condor  16863232 Apr 13 12:33 cluster9.proc17.subproc0.tmp
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:48 cluster9.proc1.subproc0
-rw-r--r-- 1 condor condor  77766656 Apr 13 12:33 cluster9.proc2.subproc0.tmp
-rw-r--r-- 1 condor condor 277419039 Apr 13 11:58 cluster9.proc4.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:48 cluster9.proc6.subproc0
-rw-r--r-- 1 condor condor   9547776 Apr 13 12:33 cluster9.proc7.subproc0.tmp
-rw-r--r-- 1 condor condor 277419039 Apr 13 11:58 cluster9.proc9.subproc0
-rw-r--r-- 1 condor condor    218377 Apr 13 13:45 history
-rw------- 2 condor condor    262144 Apr 13 14:34 job_queue.log
-rw------- 2 condor condor    262144 Apr 13 14:34 job_queue.log.4
-rw------- 1 condor condor         0 Apr 13 14:35 job_queue.log.tmp
drwxrwxrwt 2 condor condor      4096 Dec 15 11:15 local_univ_execute


As can be seen, there are many files named clusterN.procM.subproc0
which are huge (277 MB). The content of the directory amounts 3.5 GB.
The size of the /var directory is 3.8 GB (the default Rocks
installation). So, /spool directory is consuming all the room in /var.
What is the content of clusterN.procM.subproc0 files? How can I
prevent these files to grow so much? It is safe to erase them?

Thanks in advance

Marcelo


2009/4/14 Marcelo Chiapparini <marcelo.chiappa@xxxxxxxxx>:
> Hi Rob,
>
> Bingo! you was right:
>
> # df
> Filesystem           1K-blocks      Used Available Use% Mounted on
> /dev/sda1             15872604   4889488  10163804  33% /
> /dev/sda5            828959588   2753132 783418536   1% /state/partition1
> /dev/sda2              3968124   3831872         0 100% /var
> tmpfs                  4087108         0   4087108   0% /dev/shm
> tmpfs                  1995656      4992   1990664   1% /var/lib/ganglia/rrds
>
> /var is full!
>
> Filesystem           1K-blocks   Used        Available   Use%    Mounted on
> /dev/sda2              3968124    3831872    0              100%    /var
>
>
> Now I have to figure out what is the reason. To fix it and to prevent
> it to happen again. The user is compiling his programs with
> condor_compile and submitting them in the standard universe. May be
> /var is full with his checkpoint images? If not, any help will be very
> well come!
>
> Regards
>
> Marcelo
>
> ps: I want to thanks all the support from people of this marvelous list!
>
>
> 2009/4/14 Robert Futrick <rfutrick@xxxxxxxxxxxxxxxxxx>:
>> Hello Marcelo,
>>
>> Based on what you've written, it sounds like you're experiencing case #1 in
>> Jason's email.  Your daemons are configured to run on the correct server,
>> but stopped running suddenly and now will not start again.
>>
>> Considering you didn't make any other changes, and the sudden nature of the
>> stop, you might be out of disk space.  That's a common cause of daemons
>> stopping logging mid-logline. Another option is that permissions or
>> something else changed to prevent Condor from writing to that directory.
>>
>> Try running "df" on the /var/opt/condor/log to make sure you have disk
>> space. Being out of disk space is not the only reason Condor could have
>> stopped working, but it is a good initial check.
>>
>> Regards,
>> Rob
>>
>> Marcelo Chiapparini wrote:
>>
>> Jason,
>>
>> thank you for the help. Below are the results of your advices:
>>
>> 2009/4/14 Jason Stowe <jstowe@xxxxxxxxxxxxxxxxxx>:
>>
>>
>> Marcelo,
>> The errors you are getting could be caused by a few problems, so below
>> is a more detailed process to help you debug this:
>>
>>
>> $ condor_status
>> CEDAR:6001:Failed to connect to <xxx.xx.xxx.xx:xxxx>
>> Error: Couldn't contact the condor_collector on cluster-name.domain
>>
>> Extra Info: the condor_collector is a process that runs on the central
>>
>>
>> ...
>>
>>
>> responding. Also see the Troubleshooting section of the manual.
>>
>>
>> This error indicates that the condor_status command couldn't
>> communicate with the collector. This most likely means:
>> (1) the collector (and the condor_master/other daemons) isn't running
>> on the central manager,
>> (2) the collector is running, but not on the server the command thinks
>> it is, or
>> (3) the collector is running where condor_status thinks it is, but
>> condor_status doesn't have permission to talk with it.
>>
>> To rule out #1, on the central manager of the pool, after you run
>> condor_master on the head node for the cluster, what do you get when
>> you run:
>> $ ps -ef | grep condor
>> Does the condor_master/condor_collector show up here?
>>
>>
>> No. Deamons are not running on the central node:
>>
>> # condor_master
>> # ps -ef | grep condor
>> root     25980 15002  0 09:41 pts/1    00:00:00 grep condor
>>
>>
>>
>> This should tell you the directory log files are located in:
>> $ condor_config_val -config -verbose LOG
>>
>>
>> I found they! They are in /var/opt/condor/log. Thanks!
>>
>>
>>
>> To check for option #2, determine where the collector should be by running:
>> condor_config_val -verbose COLLECTOR_HOST
>>
>>
>> # condor_config_val -verbose COLLECTOR_HOST
>> COLLECTOR_HOST: lacad-dft.fis.uerj.br
>>
>>
>>
>> Does this match the machine you expect to be the central manager?
>>
>>
>> Yes!
>>
>>
>>
>> For situation #3, do you get permission denied errors in the logfiles?
>> Checking the HOSTALLOW_READ settings on the central manager will be
>> the next step:
>> http://www.cs.wisc.edu/condor/manual/v7.2/3_6Security.html#sec:Host-Security
>>
>>
>> # condor_config_val -verbose HOSTALLOW_READ
>> HOSTALLOW_READ: *
>>   Defined in '/opt/condor/etc/condor_config', line 209.
>>
>>
>> Looking at the CollectorLog file, it is clear that something happened
>> at 14:42:01, because the write to this log was interrupted in the
>> middle of a sentence. See the last lines of the CollectorLog:
>>
>> <snip>
>> 4/13 14:40:22 NegotiatorAd  : Inserting ** "< lacad-dft.fis.uerj.br >"
>> 4/13 14:41:55 (Sending 84 ads in response to query)
>> 4/13 14:41:55 Got QUERY_STARTD_PVT_ADS
>> 4/13 14:41:55 (Sending 64 ads in response to query)
>> 4/13 14:42:01 Got QUERY
>>
>> and nothing more was written since this. This was yesterday, when
>> Condor stops to work.
>> Looking at the MasterLog file we find the same. Again, things were
>> interrupted abruptly at 14:42:14. (sorry for the long log,  but I want
>> to give a good idea of what happened...)
>>
>> <snip>
>> 4/10 10:50:18 Preen pid is 10018
>> 4/10 10:50:18 Child 10018 died, but not a daemon -- Ignored
>> 4/11 10:50:18 Preen pid is 12156
>> 4/11 10:50:18 Child 12156 died, but not a daemon -- Ignored
>> 4/12 10:50:18 Preen pid is 10655
>> 4/12 10:50:18 Child 10655 died, but not a daemon -- Ignored
>> 4/13 10:50:18 Preen pid is 18824
>> 4/13 10:50:18 Child 18824 died, but not a daemon -- Ignored
>> 4/13 14:34:51 The SCHEDD (pid 4063) exited with status 4
>> 4/13 14:34:51 Sending obituary for "/opt/condor/sbin/condor_schedd"
>> 4/13 14:34:51 restarting /opt/condor/sbin/condor_schedd in 10 seconds
>> 4/13 14:35:01 Started DaemonCore process
>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20203
>> 4/13 14:35:01 The SCHEDD (pid 20203) exited with status 4
>> 4/13 14:35:01 Sending obituary for "/opt/condor/sbin/condor_schedd"
>> 4/13 14:35:01 restarting /opt/condor/sbin/condor_schedd in 11 seconds
>> 4/13 14:35:12 Started DaemonCore process
>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20210
>> 4/13 14:35:12 The SCHEDD (pid 20210) exited with status 44
>> 4/13 14:35:12 Sending obituary for "/opt/condor/sbin/condor_schedd"
>> 4/13 14:35:12 restarting /opt/condor/sbin/condor_schedd in 13 seconds
>> 4/13 14:35:25 Started DaemonCore process
>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20214
>> 4/13 14:35:25 The SCHEDD (pid 20214) exited with status 44
>> 4/13 14:35:25 Sending obituary for "/opt/condor/sbin/condor_schedd"
>> 4/13 14:35:25 restarting /opt/condor/sbin/condor_schedd in 17 seconds
>> 4/13 14:35:42 Started DaemonCore process
>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20218
>> 4/13 14:35:42 The SCHEDD (pid 20218) exited with status 44
>> 4/13 14:35:42 restarting /opt/condor/sbin/condor_schedd in 25 seconds
>> 4/13 14:36:07 Started DaemonCore process
>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20219
>> 4/13 14:36:07 The SCHEDD (pid 20219) exited with status 44
>> 4/13 14:36:07 restarting /opt/condor/sbin/condor_schedd in 41 seconds
>> 4/13 14:36:48 Started DaemonCore process
>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20220
>> 4/13 14:36:48 The SCHEDD (pid 20220) exited with status 44
>> 4/13 14:36:48 restarting /opt/condor/sbin/condor_schedd in 73 seconds
>> 4/13 14:38:01 Started DaemonCore process
>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20222
>> 4/13 14:38:01 The SCHEDD (pid 20222) exited with status 44
>> 4/13 14:38:01 restarting /opt/condor/sbin/condor_schedd in 137 seconds
>> 4/13 14:40:18 Started DaemonCore process
>> "/opt/condor/sbin/condor_schedd", pid and pgroup = 20226
>> 4/13 14:40:18 The SCHEDD (pid 20226) exited with status 44
>> 4/13 14:40:18 restarting /opt/condor/sbin/condor_schedd in 265 seconds
>> 4/13 14:42:01 The COLLECTOR (pid 3779) exited with status 44
>> 4/13 14:42:01 Sending obituary for "/opt/condor/sbin/condor_collector"
>> 4/13 14:42:01 restarting /opt/condor/sbin/condor_collector in 10 seconds
>> 4/13 14:42:01 attempt to connect to <152.92.133.74:9618> failed:
>> Connection refused (connect errno = 111).
>> 4/13 14:42:01 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618>
>> failed
>>
>> 4/13 14:42:01 Failed to start non-blocking update to <152.92.133.74:9618>.
>> 4/13 14:42:11 Started DaemonCore process
>> "/opt/condor/sbin/condor_collector", pid and pgroup = 20233
>> 4/13 14:42:14 attempt to connect to <152.92.133.74:9618> failed:
>> Connection refused (connect errno = 111).
>> 4/13 14:42:14 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618>
>> failed
>>
>> 4/13 14:42:14 Failed to start non-blocking update to <152.92.133.74:9618>.
>> 4/13 14:42:14 The COLLECTOR (pid 20233) exited with status 44
>> 4/13 14:42:14 Sending obituary for "/opt/condor/sbin/condor_
>> collector"
>> 4/13 14:42:
>>
>> Is this a physical problem with the hardware? I reboot physically the
>> cluster today, 4/14, but Condor refuses to run. Nothing was written to
>> the logs since yesterday 4/13 14:42:14.
>>
>> Any help will be very welcome,
>>
>> Regards
>>
>> Marcelo
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>>
>> --
>>
>> ===================================
>> Rob Futrick
>> main: 888.292.5320
>>
>> Cycle Computing, LLC
>> Leader in Condor Grid Solutions
>> Enterprise Condor Support and CycleServer Management Tools
>>
>> http://www.cyclecomputing.com
>> http://www.cyclecloud.com
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>>