Over
the last week we've had two instances of the Condor daemons on a machine going
down for apparently no reason. These were two different machines, but
both were "submit only" (condor_master and condor_schedd) machines. I'm
hoping someone could take a quick look at my log file and see if there's
anything here that would help with diagnosis. The repeated entries saying
9/25
17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist. 9/25
17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist. worry
me, but I don't know what they mean. I've confirmed that the pid doesn't
exist, but I don't know why it's looking for it. One
other potential item (although it might be a red herring) is that the
condor_config and local config files are in the ~condor directory on an NFS
mounted partition and we've had occasional trouble with that mount failing on
us. But normally that gives a pretty obvious error message, and we aren't
getting anything here. However, if there's a "cd ~condor"
command or equivalent in the code somewhere that could be a problem since you
can't cd to ~condor on our systems. You can 'cd /home/condor', and 'ls
~condor', but 'cd ~condor' is disabled (I don't know why.) Finally,
is there a way to ensure that we get notified when the condor_master daemon
goes down? I have PUBLISH_OBITUARIES set to True and OBITUARY_LOG_LENGTH
set to 20, but I'm not getting any emails at the ADMIN address at all when
these issues occur. Thanks
in advance for any help. I'm stumped, so anything at all would be
appreciated. -Colin System
information: Condor
version 6.7.14 Redhat
Enterprise Linux 4 (Linux 2.6.9-5.0.5.ELsmp) on the two machines that went down Redhat
Enterprise Linux 3 (Linux 2.4.21-32.0.1.ELsmp) on the central_master This email and any files transmitted with it are confidential, proprietary and intended solely for the individual or entity to whom they are addressed. If you have received this email in error please delete it immediately. |
9/25 17:13:25 Getting monitoring info for pid 2796 9/25 17:13:38 enter Daemons::UpdateCollector 9/25 17:13:38 Trying to update collector <xx.xx.xx.xx:9618> 9/25 17:13:38 Attempting to send update via UDP to collector mnappmb00.fairisaac.com <xx.xx.xx.xx:9618> 9/25 17:13:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 9/25 17:13:38 exit Daemons::UpdateCollector 9/25 17:13:38 enter Daemons::CheckForNewExecutable 9/25 17:13:38 Time stamp of running /opt/condor/sbin/condor_master: 1139286549 9/25 17:13:38 GetTimeStamp returned: 1139286549 9/25 17:13:38 Time stamp of running /opt/condor/sbin/condor_schedd: 1139286549 9/25 17:13:38 GetTimeStamp returned: 1139286549 9/25 17:13:38 exit Daemons::CheckForNewExecutable 9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist. 9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist. 9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist. 9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist. 9/25 17:13:58 ProcAPI::getProcInfo() pid 8109 does not exist. 9/25 17:13:58 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:14:58 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:15:59 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist. 9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist. 9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist. 9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist. 9/25 17:16:59 ProcAPI::getProcInfo() pid 13414 does not exist. 9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist. 9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist. 9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist. 9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist. 9/25 17:16:59 ProcAPI::getProcInfo() pid 13416 does not exist. 9/25 17:16:59 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:17:25 Getting monitoring info for pid 2796 9/25 17:17:59 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:18:38 enter Daemons::UpdateCollector 9/25 17:18:38 Trying to update collector <xx.xx.xx.xx:9618> 9/25 17:18:38 Attempting to send update via UDP to collector mnappmb00.fairisaac.com <xx.xx.xx.xx:9618> 9/25 17:18:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 9/25 17:18:38 exit Daemons::UpdateCollector 9/25 17:18:38 enter Daemons::CheckForNewExecutable 9/25 17:18:38 Time stamp of running /opt/condor/sbin/condor_master: 1139286549 9/25 17:18:38 GetTimeStamp returned: 1139286549 9/25 17:18:38 Time stamp of running /opt/condor/sbin/condor_schedd: 1139286549 9/25 17:18:38 GetTimeStamp returned: 1139286549 9/25 17:18:38 exit Daemons::CheckForNewExecutable 9/25 17:18:59 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:20:00 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:21:00 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:21:25 Getting monitoring info for pid 2796 9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist. 9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist. 9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist. 9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist. 9/25 17:22:00 ProcAPI::getProcInfo() pid 21664 does not exist. 9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist. 9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist. 9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist. 9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist. 9/25 17:22:00 ProcAPI::getProcInfo() pid 21666 does not exist. 9/25 17:22:00 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:22:24 DaemonCore: Command received via UDP from host <xx.xx.xx.xx:33012> 9/25 17:22:24 DaemonCore: received command 60008 (DC_CHILDALIVE), calling handler (HandleChildAliveCommand) 9/25 17:23:01 ProcAPI::buildFamily() Found daddypid on the system: 2799 9/25 17:23:38 enter Daemons::UpdateCollector 9/25 17:23:38 Trying to update collector <xx.xx.xx.xx:9618> 9/25 17:23:38 Attempting to send update via UDP to collector mnappmb00.fairisaac.com <xx.xx.xx.xx:9618> 9/25 17:23:38 SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 9/25 17:23:38 exit Daemons::UpdateCollector 9/25 17:23:38 enter Daemons::CheckForNewExecutable 9/25 17:23:38 Time stamp of running /opt/condor/sbin/condor_master: 1139286549 9/25 17:23:38 GetTimeStamp returned: 1139286549 9/25 17:23:38 Time stamp of running /opt/condor/sbin/condor_schedd: 1139286549 9/25 17:23:38 GetTimeStamp returned: 1139286549 9/25 17:23:38 exit Daemons::CheckForNewExecutable 9/26 12:28:03 NET_REMAP_ENABLE is undefined, using default value of False 9/26 12:28:03 NET_REMAP_ENABLE is undefined, using default value of False 9/26 12:28:03 PASSWD_CACHE_REFRESH is undefined, using default value of 300 9/26 12:28:03 ****************************************************** 9/26 12:28:03 ** condor_master (CONDOR_MASTER) STARTING UP 9/26 12:28:03 ** /opt/condor/sbin/condor_master 9/26 12:28:03 ** $CondorVersion: 6.7.14 Dec 13 2005 $ 9/26 12:28:03 ** $CondorPlatform: I386-LINUX_RH9 $ 9/26 12:28:03 ** PID = 32758 9/26 12:28:03 ******************************************************