Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] jobs stuck in queue
- Date: Thu, 25 Aug 2011 18:08:35 -0300
- From: Fabricio Cannini <fcannini@xxxxxxxxx>
- Subject: Re: [Condor-users] jobs stuck in queue
Em quarta-feira 24 agosto 2011, às 01:21:37, Koller, Garrett escreveu:
> Mr. Cannini,
>
> You're receiving these errors because Condor is trying to be cautious with
> the power you give it. "With great power comes great responsibility."
> Root processes have the power to change their effective user and group IDs
> while they are running. So, even though Condor is being run as root at
> first, Condor only uses that power when it needs it. When Condor is doing
> normal Condor stuff that doesn't need the extra permissions, it changes
> its effective user and group IDs to be 'condor'. That is why when you
> check the Condor processes with ps or top, they almost always are listed
> as being owned by the 'condor' user and group. When Condor needs the
> extra permissions, it changes its effective user ID to be root but then
> changes back to 'condor' when its done doing the dangerous stuff.
Yes, i understand it, but like i said, i need to make it work first and then
close it down.
> Because of this, perhaps the '/var/spool/condor/' directory or one of its
> subdirectories needs to be owned by root:root. I have mine owned by
> condor:condor, though, so I don't know why this is a problem. Try
> chowning it to 'root:root' and see if that helps. For a similar reason,
> perhaps '/var/lib/condor/execute/' needs to be owned by root:root.
> (Root-squashed usually refers to not giving special permissions to a local
> 'root' user on a shared filesystem that doesn't care about root, I think.)
> Why is this directory have the sticky bit set, though? (According to the
> "t" in the "drwx-rwx-rwt" permissions.) Try unsetting the sticky bit in
> '/var/lib/condor/execute/' by running 'chmod -t /var/lib/condor/execute'
> as root. My execute directory doesn't have the sticky bit set, so I think
> it's safe to unset it (I don't think it's set by default, that is).
Tried unsetting sticky bit, changing ownership , but no dice.
> Hopefully, this will fix your problems or at least get you that much closer
> to figuring it all out once and for all. I don't know why the job stays
> stuck on the queue. Unfortunately, I'm not yet familiar with the parallel
> universe. What I do know is that after you make these changes and correct
> the most recent errors in your log files, restart Condor and try again.
> If they still stay in the queue, run the 'condor_q -better-analyze' to see
> if you get more information this time. Before, it mentioned that your job
> didn't match any resource constraints, which tells me that the
> Requirements of the job and the capabilities of the machine don't quite
> match up right. Look through the log files I mentioned again to see if
> you get any new errors. If 'condor_q -better-analyze' and the log files
> don't help, give me the output of 'condor_q -long' for the appropriate
> cluster/job and 'condor_status -long' for the appropriate machines
> (node-01 and node-02?).
>
> Best Regards,
> ~ Garrett K.
> condor.cs.wlu.edu
There it goes.
example job
===============================
universe = parallel
Error = err-$(node).log
Output = out-$(node).log
Log = log-$(node).log
executable = /usr/bin/mpirun
arguments = /home/user/hw -np 8 -host $NODE
machine_count = 1
WhenToTransferOutput = ON_EXIT
transfer_input_files = /home/user/hw
Queue
===============================
Output of 'condor_q -long 57'
+++++++++++++++++++++++++++++++
-- Submitter: master.internal.domain : <172.17.8.121:9632> :
master.internal.domain
PeriodicRemove = false
CommittedSlotTime = 0
Out = "out-#pArAlLeLnOdE#.log"
WantIOProxy = true
ImageSize_RAW = 51
NumCkpts_RAW = 0
JobRequiresSandbox = true
EnteredCurrentStatus = 1314306012
CommittedSuspensionTime = 0
WhenToTransferOutput = "ON_EXIT"
NumSystemHolds = 0
StreamOut = false
NumRestarts = 0
ImageSize = 75
Cmd = "/usr/bin/mpirun"
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx"
CurrentHosts = 0
Iwd = "/home/user"
CumulativeSlotTime = 0
ExecutableSize_RAW = 51
CondorVersion = "$CondorVersion: 7.6.0 Apr 19 2011 BuildID: Debian
[7.6.0-1~nd60+1] $"
RemoteUserCpu = 0.0
NumCkpts = 0
JobStatus = 1
RemoteSysCpu = 0.0
OnExitRemove = true
BufferBlockSize = 32768
ClusterId = 58
In = "/dev/null"
LocalUserCpu = 0.0
MinHosts = 1
Environment = ""
JobUniverse = 11
RequestDisk = DiskUsage
RootDir = "/"
NumJobStarts = 0
WantRemoteIO = true
RequestMemory = ceiling(ifThenElse(JobVMMemory =!=
undefined,JobVMMemory,ImageSize / 1024.000000))
GlobalJobId = "master.internal.domain#58.0#1314306012"
LocalSysCpu = 0.0
PeriodicRelease = false
DiskUsage = 75
CumulativeSuspensionTime = 0
JobLeaseDuration = 1200
TransferInput = "/home/user/hw"
UserLog = "/home/user/log-#pArAlLeLnOdE#.log"
KillSig = "SIGTERM"
ExecutableSize = 75
MaxHosts = 1
ServerTime = 1314306330
CoreSize = 0
DiskUsage_RAW = 61
ProcId = 0
TransferFiles = "ONEXIT"
ShouldTransferFiles = "YES"
CommittedTime = 0
TotalSuspensions = 0
Err = "err-#pArAlLeLnOdE#.log"
RequestCpus = 1
StreamErr = false
NiceUser = false
RemoteWallClockTime = 0.0
TargetType = "Machine"
PeriodicHold = false
QDate = 1314306012
OnExitHold = false
Rank = 0.0
ExitBySignal = false
CondorPlatform = "$CondorPlatform: X86_64-Debian_6.0 $"
JobPrio = 0
LastSuspensionTime = 0
Args = "/home/user/hw -np 8 -host $NODE"
CurrentTime = time()
JobNotification = 2
User = "user@xxxxxxxxxxxxxxx"
BufferSize = 524288
WantRemoteSyscalls = false
LeaveJobInQueue = false
ExitStatus = 0
CompletionDate = 0
MyType = "Job"
Requirements = ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && (
TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) && ( (
RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.HasFileTransfer )
WantCheckpoint = false
Owner = "user"
LastJobStatus = 0
TransferIn = false
+++++++++++++++++++++++++++++++