[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Parallel universe/MPI issues when upgrading 7.2->7.4



When I said that the directory had permissions of 0644 I actually meant 0755. Apologies if I mislead anyone, but the point stands. The jobs don't run until that spool directory is set to 1777.

m

On 07/10/2010 11:55, Mark Calleja wrote:
OK, I can get it to work as expected under v7.4.3 if I change the permissions on Condor's spool directory on the submit host from 0644 to 1777. However, under v7.2 it worked fine with perms of just 0644, so why do we now need these less secure settings?

m

On 06/10/2010 11:27, Mark Calleja wrote:
Hi,

Our users have come across a problem for MPI jobs running under the parallel universe when upgrading from 7.2.5 to 7.4.3, and though we have found a workaround (mentioned below), it would be great if we can identify a proper fix.

The issue is that jobs using the "usual" MPI wrapper script (e.g. mp1script) for such jobs now fail with the following:

In stdout:

error 0 chirp putting identity keys back

In stderr:

Can't chirp_client_open /home/condor/spool/cluster55247.proc0.subproc0/0.key:-1

Looking in the ShadowLog, it seems that a new permissions problem rears its head:

09/13 10:48:29 (55247.0) (30445): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <172.24.89.94:9696> was ACCEPTED
09/13 10:48:29 (55247.0) (30445): FileTransfer::Init(): mkdir(/home/condor/spool/cluster55247.proc0.subproc0) failed, Permission denied (errno: 13)

We have found that we can get around the issue by spooling the data on submission, i.e. via "condor_submit -spool" and then retrieving the data on completion via condor_transfer_data, before finally removing the job from the queue manually with condor_rm. This new behaviour is perplexing, as there have been no new configuration changes made to the hosts on upgrade.

Have we missed something necessary in the upgrade? From the release notes I can't discern any such new requirement, and having to remember to manually retrieve output and remove completed jobs from the queue is a pain in the unmentionables.

Best regards,
Mark 
_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/

-- 
The Cavendish Laboratory, University of Cambridge,
J J Thomson Avenue, Cambridge, CB3 0HE, UK
Tel. (+44/0) 1223 746627
http://www.escience.cam.ac.uk/~mcal00

-- 
The Cavendish Laboratory, University of Cambridge,
J J Thomson Avenue, Cambridge, CB3 0HE, UK
Tel. (+44/0) 1223 746627
http://www.escience.cam.ac.uk/~mcal00