[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] All the other machines except central manager don't work!!



Thanks, matt.

I checked condor_config files, only to find out I already did it.

http://physics.majimak.com/config_files.tgz

Above link is the condor_config_files of central manager and one of the clients.

Thanks for your helps.

2010/1/14 Matthew Farrellee <matt@xxxxxxxxxx>
How about TRUST_UID_DOMAIN = TRUE?

Best,


matt

On 01/13/2010 12:49 PM, Genie Jhang wrote:
> Is there anyone to help me.
>
> Our machines still don't work except central machine.
>
> central machine submits jobs to clients, but clients cannot work.
>
> What file do you need to find out what is the problem?
>
> Please help me.
>
> 2010/1/13 Genie Jhang <geniejhang@xxxxxxxxxxx
> <mailto:geniejhang@xxxxxxxxxxx>>
>
>     Thanks for your reply, Dan.
>
>     As you said, I changed permission of the directory,
>     /home/condor/execute, on all machines to 777.
>
>     And I don't use NFS.
>
>     Now, i'm getting this kind of error.
>
>     --------------------------------------------------
>
>     022 (218.000.000) 01/13 18:03:15 Job disconnected, attempting to
>     reconnect
>         Socket between submit and execute hosts closed unexpectedly
>         Trying to reconnect to slot3@pheko05 <192.168.0.105:33682
>     <http://192.168.0.105:33682/>>
>     ...
>     024 (218.000.000) 01/13 18:03:15 Job reconnection failed
>         Job not found at execution machine
>         Can not reconnect to slot3@pheko05, rescheduling job
>
>     -------------------------------------------------------------
>
>     I set
>     UID_DOMAIN = 192.168.0.109
>     FILESYSTEM_DOMAIN = $(FULL_HOSTNAME)
>     USE_NFS = False
>     SOFT_UID_DOMAIN = TRUE.
>
>
>
>     2010/1/13 Dan Bradley <dan@xxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxx>>
>
>         Genie,
>
>         Is your condor execute directory on NFS with root squashing?
>          The following line is what makes me guess that it might be:
>
>
>             01/13 06:32:30 get_file(): Failed to open file
>             /home/condor/execute/dir_22496/condor_exec.exe, errno = 13:
>             Permission denied.
>
>
>         If EXECUTE is on a NFS mount with root squashing, then it needs
>         to be world-writable.
>
>         --Dan
>
>
>         Genie Jhang wrote:
>
>             Hello, again.
>              Thanks to all of you, I succeed to run and to connect all
>             the machines our lab have.
>              But, when I finally tried to submit jobs to machines, I
>             found that all the other machines except central manager
>             doesn't work!!
>              and I dug the log files.
>              Here's the log.
>              ----------------------------------------------------------------------------------------------------------------------------------
>              01/13 06:32:30
>             ******************************************************
>             01/13 06:32:30 ** condor_starter (CONDOR_STARTER) STARTING UP
>             01/13 06:32:30 ** /condor/sbin/condor_starter
>             01/13 06:32:30 ** SubsystemInfo: name=STARTER
>             type=STARTER(8) class=DAEMON(1)
>             01/13 06:32:30 ** Configuration: subsystem:STARTER
>             local:<NONE> class:DAEMON
>             01/13 06:32:30 ** $CondorVersion: 7.4.1 Dec 17 2009 BuildID:
>             204351 $
>             01/13 06:32:30 ** $CondorPlatform: I386-LINUX_RHEL3 $
>             01/13 06:32:30 ** PID = 22496
>             01/13 06:32:30 ** Log last touched time unavailable (No such
>             file or directory)
>             01/13 06:32:30
>             ******************************************************
>             01/13 06:32:30 Using config source: /condor/etc/condor_config
>             01/13 06:32:30 Using local config sources:
>             01/13 06:32:30    /home/condor/condor_config.local
>             01/13 06:32:30 DaemonCore: Command Socket at
>             <192.168.0.105:33714 <http://192.168.0.105:33714/>
>             <http://192.168.0.105:33714 <http://192.168.0.105:33714/>>>
>
>             01/13 06:32:30 Done setting resource limits
>             01/13 06:32:30 Communicating with shadow
>             <192.168.0.109:55237 <http://192.168.0.109:55237/>
>             <http://192.168.0.109:55237 <http://192.168.0.109:55237/>>>
>
>             01/13 06:32:30 Submitting machine is "pheko09"
>             01/13 06:32:30 setting the orig job name in starter
>             01/13 06:32:30 setting the orig job iwd in starter
>             01/13 06:32:30 get_file(): Failed to open file
>             /home/condor/execute/dir_22496/condor_exec.exe, errno = 13:
>             Permission denied.
>             01/13 06:32:30 get_file(): consumed 28023 bytes of file
>             transmission
>             01/13 06:32:30 DoDownload: consuming rest of transfer and
>             failing after encountering the following error: STARTER at
>             192.168.0.105 failed to write to file
>             /home/condor/execute/dir_22496/condor_exec.exe: (errno 13)
>             Permission denied
>             01/13 06:32:30 WARNING: File
>             /home/condor/execute/dir_22496/condor_exec.exe can not be
>             accessed by Quill file transfer tracking.
>             01/13 06:32:30 File transfer failed (status=0).
>             01/13 06:32:30 ERROR "Failed to transfer files" at line 1882
>             in file jic_shadow.cpp
>             01/13 06:32:30 ShutdownFast all jobs.
>              ------------------------------------------------------------------------------------------------------------------------------------
>              What on the earth is the problem?
>              I set ALLOW_WRITE = * in condor_config file of all the
>             machines.
>             ------------------------------------------------------------------------
>
>             _______________________________________________
>             Condor-users mailing list
>             To unsubscribe, send a message to
>             condor-users-request@xxxxxxxxxxx
>             <mailto:condor-users-request@xxxxxxxxxxx> with a
>             subject: Unsubscribe
>             You can also unsubscribe by visiting
>             https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>             The archives can be found at:
>             https://lists.cs.wisc.edu/archive/condor-users/
>
>
>         _______________________________________________
>         Condor-users mailing list
>         To unsubscribe, send a message to
>         condor-users-request@xxxxxxxxxxx
>         <mailto:condor-users-request@xxxxxxxxxxx> with a
>         subject: Unsubscribe
>         You can also unsubscribe by visiting
>         https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>         The archives can be found at:
>         https://lists.cs.wisc.edu/archive/condor-users/
>
>
>
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/