Hi Greg !
Thanks, here it is the log, according to it executable file has not been not copied into the docker image.
gergely.debreczeni@xxxxxxx:~/batchsubmission$ condor_q -anal 57.1
-- Schedd: X.X.X.X <10.1.8.8:51975?...
---
057.001: Request is held.
Hold reason: Error from slot1@scorpio005: STARTER at 10.1.10.5 failed to send file(s) to <10.1.8.8:28343>: error reading from /var/lib/condor/execute/dir_1810221/output.out: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <10.1.10.5:10057>
And the reason for this is that the executable was not running, the executable was not copied. The job's stderr message says:
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
/usr/local/bin/nvidia_entrypoint.sh: line 88: exec: batch.sh: not found
Experimenting with it a bit more, the executable only gets copied (with condor 8.4.2) if
So like this in the paramlist file:
a, batch.sh, 1 2
a, batch.sh, 3 4
a, batch.sh, 5 6
a, batch.sh, 7 8and this the submission file:
executable = ./batch.sh
universe = docker
docker_image = nv-pytorch-wglobus_v2
## Logs
log = out/batch.$(Process).log
output = out/batch.$(Process).stdout
error = out/batch.$(Process).stderr
## File transfer
should_transfer_files = Yes
when_to_transfer_output = ON_EXIT
line = $(Row)
transfer_output_files = output.out
transfer_output_remaps = "output.out=out/output$INT(line).out"
transfer_input_files = $(input_file1), $(input_file2)
## Resources requested
request_cpus = 1
request_GPUs = 0
Requirements = (ResourceType == "Dedicated") && (regexp(".*nv-pytorch-wglobus_v2.*",LocallyAvailableDockerImages))
## Submit command
queue input_file1, input_file2, arguments from [0:2:1] ./paramlist
with condor 8.8.0 it works also without the ./ and explicit listing in paramlist file.
thanks,
Gergely
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Greg Thain <gthain@xxxxxxxxxxx>
Sent: Monday, May 6, 2019 4:07 PM To: htcondor-users@xxxxxxxxxxx Subject: Re: [HTCondor-users] batch submitssion strange problem On 5/4/19 3:00 PM, Gergely Debreczeni via HTCondor-users wrote:
Can you send us the output of condor_q -hold. When a job is held, condor_q -hold will show the hold reason, which is often the best way to debug what's going on.
-greg
This e-mail and any files transmitted with it contain confidential and may contain privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorized use, copying, disclosure or distribution of the material in this e-mail is strictly forbidden. |