Hi we have a HTCondor cluster for local jobs submission which exploits a shared filesystem. *** [root@ettore ~]# condor_schedd -version $CondorVersion: 8.4.6 Apr 20 2016 BuildID: 364106 $ $CondorPlatform: x86_64_RedHat6 $ [root@ettore ~]# [root@ettore ~]# condor_config_val -dump FILESYSTEM_DOMAIN FILESYSTEM_DOMAIN = GPFS [italiano@ui02 ~]$ condor_config_val -dump FILESYSTEM_DOMAIN FILESYSTEM_DOMAIN = GPFS *** Everytime the filesystem experiences high latency while accessing files for example during a restripe operation on the file system, the schedd serving the local job submission hangs. In this status a condor_reconfig takes several minute di be applied. So, it seems that the slow filesystem performance negatively affects the schedd response time and sometime it also becomes unresponsive *** [root@ettore ~]# condor_q -- Failed to fetch ads from: <90.147.169.224:38705> : ettore.recas.ba.infn.it SECMAN:2007:Failed to end classad message. [root@ettore ~]# *** Is there a way to preserve schedd functionalities during such situations ? In the same cluster there are also other schedds serving grid jobs which are NOT affected by this behaviour. thanks in advance for any hint you would like to share Ale |
Attachment:
smime.p7s
Description: S/MIME cryptographic signature