Hi,
We're using Condor to execute jobs which take a lot of time on 15 macintosh G5.
Our "vanilla" configuration:
- Central manager: xserve G4 username=condor
- Submit machine: same xserve G4 with another username= submit
- Execution machines: G5
We have 2 condor_master on the same machine (to manage and to submit) with 2 different username. Can this configuration lead pbs ?
We have 2 different problems:
1- After few hours, all the execution machines stop the job, a communication error occurs between the condor_starter and the condor_master (macintosh Xserve):
Cluster01 crashdump: Unable to determine CPSProcessSerNum pid: 11913 name: condor_starter
and in the Shadow log, we have:
ERROR "Can no longer talk to condor_starter on execute machine (192.168.1.23)" at line 63 in file NTreceivers.C
Problem exists with condor6.6.6 and condor6.6.7…
2- After few hours, central manager and execution machine stop the communication but the submit machine follows the jobs. Condor_q indicates "R" status although condor_status indicates the communication is stopped.
Then, when we launch condor_master on the central manager, condor_status become normal that is to say that execution machines are in "busy" status. Is it normal for a vanilla configuration ?
After 2 or 3 days, we have either pb1 or pb2 !
Has anyone got an idea ?
Thank you for your help
Damien
Damien AUTRET:
Unité INSERM 601
Département de Recherche en ImmunoCancérologie
Equipe 6 Biophysique-Cancérologie
9 Quai Moncousu
44093 Nantes Cedex
Tél: 02.40.41.28.21
Fax: 02.40.35.66.97
Sec: 02.40.08.47.47