[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Bug



Dear Group

Earlier on this morning we had successfully built a Master node and 2
slave nodes all running condor.

We then applied the hardening process to the Master node.

We issued a simple condor submit job to confirm that the cluster still
functioned correctly after performing the hardening of the Master node.

We then progressed by securing the first Slave node.  After applying the
security hardening procedure to the first node we ran some basic tests
to establish that the cluster still worked i.e. submitted test condor
jobs

Following this we took a system image of the secured slaved node and
used it to clone the second slave node.  Once this had been completed we
performed the same basic condor submit job tests that had previously
confirmed the cluster to be working.  Unfortunately this time the test
jobs did not run.  It appears that the reason for this is that not all
of the relevant condor processes start on the slave nodes.

At this point in time we have not been able to establish the underlying
cause of the problem even though we have struggled for the last half
hour.

None of the log files in /home/condor/log on the slave or master nodes
has given us any indication of what the problem might be.  

The main symptom of the problem is that the [condor_master] is the only
process that starts i.e. [condor_schedd] and [condor_startd] do not
start.  If we issue a condor_staus command on either node or master the
only system that shows as registered is the master node (occasionally
the condor_status command returns nothing after a timeout period of
several minutes). 

As all worked prior to node-02 being built by SystemImager, which was
after the hardening process and node-01's drive was not in the system
when the image was copied to node-02's drive. How is it that node-01 has
stopped functioning correctly? Given that node-01 has not been changed
we are confused.

I would like to know if you think/know if any of the SystemImager
processes when building node-02 would have changed the Master Node in
any way as this may give us somewhere to investigate further.