Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor job idle
- Date: Thu, 10 May 2012 22:51:36 -0500
- From: Victor <ruotti@xxxxxxxx>
- Subject: Re: [Condor-users] condor job idle
Hi,
>From multiple submissions, the job always lands on slot2@xxxxxxxxxxxxxxxxxx
Then this compute node is failing to execute a simple job.
Is this normal?
Sorry in advance if I'm doing something wrong.
Victor
-bash-3.2$ condor_submit process.cmd
Submitting job(s).
1 job(s) submitted to cluster 266909.
-bash-3.2$ tail -f process.log
009 (266908.000.000) 05/10 22:49:02 Job was aborted by the user.
via condor_rm (by user galaxy)
...
000 (266909.000.000) 05/10 22:49:19 Job submitted from host: <128.104.153.183:9618?PrivAddr=%3c10.129.28.28:9618%3fsock%3d5400_b3d1_2%3e&PrivNet=morgridge&noUDP&sock=5400_b3d1_2>
...
001 (266909.000.000) 05/10 22:50:01 Job executing on host: <128.104.55.43:57255>
...
022 (266909.000.000) 05/10 22:50:01 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
...
024 (266909.000.000) 05/10 22:50:01 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot2@xxxxxxxxxxxxxxxxxx, rescheduling job
022 (266905.000.000) 05/10 22:43:58 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
...
024 (266905.000.000) 05/10 22:43:58 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot2@xxxxxxxxxxxxxxxxxx, rescheduling job
...
001 (266906.000.000) 05/10 22:44:46 Job executing on host: <128.104.55.43:57255>
...
022 (266906.000.000) 05/10 22:44:46 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot5@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
...
024 (266906.000.000) 05/10 22:44:46 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot5@xxxxxxxxxxxxxxxxxx, rescheduling job
000 (266908.000.000) 05/10 22:47:33 Job submitted from host: <128.104.153.183:9618?PrivAddr=%3c10.129.28.28:9618%3fsock%3d5400_b3d1_2%3e&PrivNet=morgridge&noUDP&sock=5400_b3d1_2>
...
001 (266908.000.000) 05/10 22:47:43 Job executing on host: <128.104.55.43:57255>
...
022 (266908.000.000) 05/10 22:47:43 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot5@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
...
024 (266908.000.000) 05/10 22:47:43 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot5@xxxxxxxxxxxxxxxxxx, rescheduling job
...
On May 10, 2012, at 10:41 PM, Victor wrote:
> Hi Nathan,
> Thanks for the tip.
> I reduced the number of retries to 1.
>
> The weird thing is that when I first launch the dag, both the dag and the job from that node shows are R. Then I assume the job started. But then it goes back to the I state.
> See bellow, I keep getting R but then goes back to I.
> How do I find out what is causing the job(node) to exit the run state?
>
> I tried to go back and submit the jog without dagman and keep getting something related to this compute node.
> slot2@xxxxxxxxxxxxxxxxxx
>
> It looks like it trying to run but then I get a socket connection error.
> Will try it gain to see if it lands on another slot this time.
> Might nee to make sure my submit file works well first and then the dag.
>
> Victor
>
>
>
> ...
> 000 (266905.000.000) 05/10 22:36:08 Job submitted from host: <128.104.153.183:9618?PrivAddr=%3c10.129.28.28:9618%3fsock%3d5400_b3d1_2%3e&PrivNet=morgridge&noUDP&sock=5400_b3d1_2>
> ...
> 001 (266905.000.000) 05/10 22:36:32 Job executing on host: <128.104.55.43:57255>
> ...
> 022 (266905.000.000) 05/10 22:36:32 Job disconnected, attempting to reconnect
> Socket between submit and execute hosts closed unexpectedly
> Trying to reconnect to slot2@xxxxxxxxxxxxxxxxxx <128.104.55.43:57255>
> ...
> 024 (266905.000.000) 05/10 22:36:32 Job reconnection failed
> Job not found at execution machine
> Can not reconnect to slot2@xxxxxxxxxxxxxxxxxx, rescheduling job
> ...
>
> Thanks,
> Victor
>
>
>
> -- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
> ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
> 2.0 soaruser 11/9 12:47 30+17:05:34 I 0 73.2 continuous.cron 20
> 3.0 soaruser 11/9 12:47 182+05:44:54 R 0 0.0 checkprogress.cron
> 158265.0 soaruser 1/11 09:27 0+10:27:40 I 0 170.9 scrubber.cron 20
> 158266.0 galaxy 1/11 09:32 0+16:57:14 I 0 244.1 scrubber.cron 20
> 160662.0 galaxy 1/18 15:31 112+20:01:41 R 0 73.2 checkprogress.cron
> 266877.0 galaxy 5/10 22:09 0+00:02:52 R 0 7.3 condor_dagman -f -
>
> 6 jobs; 3 idle, 3 running, 0 held
> -bash-3.2$ condor_q -dag
>
>
> -- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
> ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
> 2.0 soaruser 11/9 12:47 30+17:05:34 I 0 73.2 continuous.cron 20
> 3.0 soaruser 11/9 12:47 182+05:44:55 R 0 0.0 checkprogress.cron
> 158265.0 soaruser 1/11 09:27 0+10:27:40 I 0 170.9 scrubber.cron 20
> 158266.0 galaxy 1/11 09:32 0+16:57:14 I 0 244.1 scrubber.cron 20
> 160662.0 galaxy 1/18 15:31 112+20:01:42 R 0 73.2 checkprogress.cron
> 266877.0 galaxy 5/10 22:09 0+00:02:53 R 0 7.3 condor_dagman -f -
> 266881.0 |-fastq_file1 5/10 22:12 0+00:00:00 I 0 0.0 chtcjobwrapper --t
>
> -- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
> ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
> 2.0 soaruser 11/9 12:47 30+17:05:34 I 0 73.2 continuous.cron 20
> 3.0 soaruser 11/9 12:47 182+05:46:23 R 0 0.0 checkprogress.cron
> 158265.0 soaruser 1/11 09:27 0+10:27:40 I 0 170.9 scrubber.cron 20
> 158266.0 galaxy 1/11 09:32 0+16:57:14 I 0 244.1 scrubber.cron 20
> 160662.0 galaxy 1/18 15:31 112+20:03:10 R 0 73.2 checkprogress.cron
> 266877.0 galaxy 5/10 22:09 0+00:04:21 R 0 7.3 condor_dagman -f -
> 266882.0 |-fastq_file1 5/10 22:13 0+00:00:00 I 0 0.0 chtcjobwrapper --t
>
> 7 jobs; 4 idle, 3 running, 0 held
> -bash-3.2$ condor_q -dag
>
>
> -- Submitter: condor.morgridge.net : <10.129.28.28:9618?sock=5400_b3d1_2> : condor.morgridge.net
> ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
> 2.0 soaruser 11/9 12:47 30+17:05:34 I 0 73.2 continuous.cron 20
> 3.0 soaruser 11/9 12:47 182+05:46:26 R 0 0.0 checkprogress.cron
> 158265.0 soaruser 1/11 09:27 0+10:27:40 I 0 170.9 scrubber.cron 20
> 158266.0 galaxy 1/11 09:32 0+16:57:14 I 0 244.1 scrubber.cron 20
> 160662.0 galaxy 1/18 15:31 112+20:03:13 R 0 73.2 checkprogress.cron
> 266877.0 galaxy 5/10 22:09 0+00:04:24 R 0 7.3 condor_dagman -f -
> 266882.0 |-fastq_file1 5/10 22:13 0+00:00:00 R 0 0.0 chtcjobwrapper --t
>
> On May 10, 2012, at 6:30 PM, Nathan Panike wrote:
>
>> On Thu, May 10, 2012 at 05:32:15PM -0500, Victor wrote:
>>> Hi,
>>> I'm very new at creating dags so sorry in advance as this might be a mistake on my part.
>>> I'm hoping someone can point me out on how to check why the node->job is always idle.
>>> I created a very simple dag and started it via
>>> condor_submit_dag mydag.dag
>>>
>>> CONFIG dagman_config
>>> JOB fastq_file1 process.cmd dir fastq_file1
>>> SCRIPT POST fastq_file1 /opt/galaxy/dagme/ChtcRun/postjob.pl
>>> RETRY fastq_file1 10
>>>
>>> I can see the dagman running, so something within the node is causing this would be my guess.
>>>
>>> 266865.0 galaxy 5/10 17:14 0+00:00:24 R 0 7.3 condor_dagman -f -
>>> 266866.0 |-fastq_file1 5/10 17:14 0+00:00:00 I 0 0.0 chtcjobwrapper --t
>>
>> What universe is 266866 running in? Unless it is a local universe or
>> scheduler, this behavior is expected.
>>
>>> From this, I read that DAGman has been running for 24 seconds. That
>> means that your job has been in the queue for about 12 seconds. It is
>> quite likely that it has not yet been considered for a match. I
>> counsel a certain amount of patience.
>>
>> Nathan Panike
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>