[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job keeps re-executing after reaching ~ 1GB size but never finishes...



Dave,

All the three machines have identical properties, they all have 1GB RAM,
Scientific Linux 4.3 (OS), and they all have at least 40 GB of available
disk. I am using Condor version 6.8.2.

The Shadowlog:
##########################################################
7/29 22:22:22 ******************************************************
7/29 22:22:22 Using config source: /home/condor/condor/etc/condor_config
7/29 22:22:22 Using local config sources:
7/29 22:22:22   
/home/condor/condor/local.phys-ugradlab02/condor_config.local
7/29 22:22:22 DaemonCore: Command Socket at <10.0.40.139:32809>
7/29 22:22:22 Initializing a VANILLA shadow for job 552.2
7/29 22:22:22 (552.2) (4805): Request to run on <10.0.40.112:32771> was
ACCEPTED
7/30 12:45:25 (552.1) (4719): Got SIGTERM. Performing graceful shutdown.
7/30 12:45:25 (552.0) (4718): Got SIGTERM. Performing graceful shutdown.
7/30 12:45:25 (552.2) (4805): Got SIGTERM. Performing graceful shutdown.
######################################################################

I don't know why the above happens. This is the time when the jobs are
then re-executed.

...continued Shadowlog...
######################################################################
7/30 12:45:26 (552.1) (4719): attempt to connect to <10.0.40.139:32772>
failed: Invalid argument (connect errno = 22).  Will keep trying for 2
0 total seconds (19 to go).

7/30 12:45:26 (552.0) (4718): attempt to connect to <10.0.40.148:32771>
failed: Network is unreachable (connect errno = 101).  Will keep tryin
g for 20 total seconds (19 to go).

7/30 12:45:26 (552.2) (4805): attempt to connect to <10.0.40.112:32771>
failed: Network is unreachable (connect errno = 101).  Will keep tryin
g for 20 total seconds (19 to go).
#################################################################

The Schedlog
#################################################################
7/30 12:35:20 (pid:3407) Sent ad to 1 collectors for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:40:20 (pid:3407) Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:40:20 (pid:3407) Sent ad to 1 collectors for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:45:20 (pid:3407) Sent ad to central manager for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:45:20 (pid:3407) Sent ad to 1 collectors for
condor@xxxxxxxxxxxxxxxxxxxxx
7/30 12:45:25 (pid:3407) Got SIGTERM. Performing graceful shutdown.
7/30 12:45:26 (pid:3407) Called preempt( 1 )
7/30 12:45:27 (pid:3407) SafeMsg: sending small msg failed. errno: 101
7/30 12:45:27 (pid:3407) Can't send EOM to <10.0.40.148:32771>
7/30 12:45:27 (pid:3407) Sent vacate command to <10.0.40.148:32771> for
job 552.0
7/30 12:45:29 (pid:3407) Called preempt( 1 )
7/30 12:45:29 (pid:3407) SafeMsg: sending small msg failed. errno: 22
7/30 12:45:29 (pid:3407) Can't send EOM to <10.0.40.139:32772>
7/30 12:45:29 (pid:3407) Sent vacate command to <10.0.40.139:32772> for
job 552.1
##################################################################
The above shows the error before my job is vacated...

...continued Schedlog...
##################################################################
7/30 12:46:40 (pid:3402)
******************************************************
7/30 12:46:40 (pid:3402) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
7/30 12:46:40 (pid:3402) ** /home/condor/condor/sbin/condor_schedd
7/30 12:46:40 (pid:3402) ** $CondorVersion: 6.8.2 Oct 12 2006 $
7/30 12:46:40 (pid:3402) ** $CondorPlatform: I386-LINUX_RHEL3 $
7/30 12:46:40 (pid:3402) ** PID = 3402
7/30 12:46:40 (pid:3402) ** Log last touched 7/30 12:45:29
7/30 12:46:40 (pid:3402)
******************************************************
7/30 12:46:40 (pid:3402) Using config source:
/home/condor/condor/etc/condor_config
7/30 12:46:40 (pid:3402) Using local config sources:
7/30 12:46:40 (pid:3402)   
/home/condor/condor/local.phys-ugradlab02/condor_config.loca
##################################################################
(then the job is re-executed as shown above)

Remarks: I can not access the lists.cs.wisc.edu (mailing list archive).


Thanks,

Leo

> Leo,
>
> You will need to say what version of Condor, what operating system, and
> maybe some information about the machines in your pool.  How much memory
> and virtual memory do the execute nodes have?  Condor jobs can certainly
> use more than 1GB of memory.
>
> You can also search the log files for both the submit node and the
> execute node for one of those job numbers to see if you can find more
> information about what is happening to your jobs.
>
> - dave
>
>
> Leo Cristobal C. Ambolode II wrote:
>> Hi all,
>>
>> Can anyone explain to me what happened to my jobs: the batch jobs I have
>> became idle for some time when one of the jobs reached approximately 1GB
>> of result/file, and then the jobs are re-executed..this keeps on
>> repeating
>> twice then I decided to remove the job because it never finishes....I am
>> expecting that each job would return approximately 5 GB each. Is this
>> some
>> restriction in Condor environment?,if so, how can I fix this?
>>
>> The following is the log file of my job:
>> ############################################
>> 000 (552.000.000) 07/28 21:58:24 Job submitted from host:
>> <10.0.40.139:32771>
>> ...
>> 000 (552.001.000) 07/28 21:58:24 Job submitted from host:
>> <10.0.40.139:32771>
>> ...
>> 000 (552.002.000) 07/28 21:58:24 Job submitted from host:
>> <10.0.40.139:32771>
>> ...
>> 001 (552.000.000) 07/28 21:58:27 Job executing on host:
>> <10.0.40.148:32771>
>> ...
>> 001 (552.001.000) 07/28 21:58:29 Job executing on host:
>> <10.0.40.139:32772>
>> ...
>> 001 (552.002.000) 07/28 21:58:32 Job executing on host:
>> <10.0.40.112:32771>
>> ...
>> 006 (552.000.000) 07/28 21:58:36 Image size of job updated: 232508
>> ...
>> 006 (552.001.000) 07/28 21:58:37 Image size of job updated: 222952
>> ...
>> 006 (552.002.000) 07/28 21:58:40 Image size of job updated: 240012
>> ...
>> 006 (552.000.000) 07/28 22:18:36 Image size of job updated: 299492
>> ...
>> 006 (552.001.000) 07/28 22:18:37 Image size of job updated: 300036
>> ...
>> 006 (552.002.000) 07/28 22:18:40 Image size of job updated: 300564
>> ...
>> 006 (552.000.000) 07/28 22:38:35 Image size of job updated: 299992
>> ...
>> 006 (552.001.000) 07/28 22:38:37 Image size of job updated: 300588
>> ...
>> 006 (552.002.000) 07/28 22:38:40 Image size of job updated: 301064
>> ...
>> 006 (552.002.000) 07/28 22:58:40 Image size of job updated: 301196
>> ...
>> 006 (552.000.000) 07/28 23:18:35 Image size of job updated: 303588
>> ...
>> 006 (552.001.000) 07/28 23:18:37 Image size of job updated: 304188
>> ...
>> 006 (552.002.000) 07/28 23:18:40 Image size of job updated: 304660
>> ...
>> 006 (552.000.000) 07/28 23:38:35 Image size of job updated: 304708
>> ...
>> 006 (552.001.000) 07/28 23:38:37 Image size of job updated: 305304
>> ...
>> 006 (552.002.000) 07/28 23:38:40 Image size of job updated: 305776
>> ...
>> 006 (552.000.000) 07/29 00:38:36 Image size of job updated: 325480
>> ...
>> 006 (552.001.000) 07/29 00:38:37 Image size of job updated: 324032
>> ...
>> 006 (552.002.000) 07/29 00:38:39 Image size of job updated: 324516
>> ...
>> 006 (552.000.000) 07/29 00:58:36 Image size of job updated: 327124
>> ...
>> 006 (552.001.000) 07/29 00:58:37 Image size of job updated: 327980
>> ...
>> 006 (552.002.000) 07/29 01:18:39 Image size of job updated: 327168
>> ...
>> 006 (552.000.000) 07/29 01:38:36 Image size of job updated: 327640
>> ...
>> 006 (552.002.000) 07/29 02:18:40 Image size of job updated: 328708
>> ...
>> 006 (552.000.000) 07/29 03:38:35 Image size of job updated: 338992
>> ...
>> 006 (552.000.000) 07/29 04:38:36 Image size of job updated: 588500
>> ...
>> 006 (552.001.000) 07/29 04:38:37 Image size of job updated: 333832
>> ...
>> 006 (552.002.000) 07/29 04:38:40 Image size of job updated: 339928
>> ...
>> 006 (552.001.000) 07/29 05:38:37 Image size of job updated: 599632
>> ...
>> 006 (552.002.000) 07/29 05:38:40 Image size of job updated: 594704
>> ...
>> 006 (552.002.000) 07/29 05:58:40 Image size of job updated: 601292
>> ...
>> 001 (552.000.000) 07/29 06:23:53 Job executing on host:
>> <10.0.40.139:32772>
>> ...
>> 001 (552.001.000) 07/29 06:34:15 Job executing on host:
>> <10.0.40.148:32771>
>> ...
>> 001 (552.002.000) 07/29 06:34:17 Job executing on host:
>> <10.0.40.112:32771>
>> ...
>> 006 (552.000.000) 07/29 06:44:01 Image size of job updated: 300600
>> ...
>> 006 (552.001.000) 07/29 06:54:23 Image size of job updated: 300148
>> ...
>> 006 (552.002.000) 07/29 06:54:25 Image size of job updated: 299148
>> ...
>> 006 (552.000.000) 07/29 07:04:01 Image size of job updated: 301100
>> ...
>> 006 (552.001.000) 07/29 07:14:23 Image size of job updated: 300616
>> ...
>> 006 (552.002.000) 07/29 07:14:25 Image size of job updated: 299456
>> ...
>> 006 (552.000.000) 07/29 07:24:01 Image size of job updated: 304700
>> ...
>> 006 (552.001.000) 07/29 07:34:23 Image size of job updated: 304212
>> ...
>> 006 (552.002.000) 07/29 07:34:25 Image size of job updated: 303052
>> ...
>> 006 (552.000.000) 07/29 07:44:01 Image size of job updated: 305816
>> ...
>> 006 (552.001.000) 07/29 07:54:23 Image size of job updated: 305332
>> ...
>> 006 (552.002.000) 07/29 07:54:26 Image size of job updated: 304172
>> ...
>> 006 (552.000.000) 07/29 08:44:01 Image size of job updated: 324556
>> ...
>> 006 (552.001.000) 07/29 08:54:23 Image size of job updated: 325088
>> ...
>> 006 (552.002.000) 07/29 08:54:25 Image size of job updated: 323752
>> ...
>> 006 (552.000.000) 07/29 09:04:01 Image size of job updated: 328840
>> ...
>> 006 (552.001.000) 07/29 09:14:23 Image size of job updated: 326720
>> ...
>> 006 (552.002.000) 07/29 09:14:25 Image size of job updated: 326972
>> ...
>> 006 (552.001.000) 07/29 09:54:23 Image size of job updated: 327492
>> ...
>> 006 (552.002.000) 07/29 09:54:25 Image size of job updated: 327228
>> ...
>> 006 (552.000.000) 07/29 11:44:01 Image size of job updated: 334344
>> ...
>> 006 (552.001.000) 07/29 11:54:23 Image size of job updated: 333860
>> ...
>> 006 (552.002.000) 07/29 11:54:25 Image size of job updated: 332700
>> ...
>> 006 (552.002.000) 07/29 12:34:26 Image size of job updated: 439668
>> ...
>> 006 (552.000.000) 07/29 12:44:01 Image size of job updated: 596972
>> ...
>> 006 (552.001.000) 07/29 12:54:23 Image size of job updated: 595696
>> ...
>> 006 (552.002.000) 07/29 12:54:25 Image size of job updated: 598368
>> ...
>> 006 (552.000.000) 07/29 20:04:01 Image size of job updated: 924564
>> ...
>> 006 (552.001.000) 07/29 20:14:23 Image size of job updated: 846144
>> ...
>> 006 (552.000.000) 07/29 20:24:01 Image size of job updated: 1033628
>> ...
>> 001 (552.000.000) 07/29 22:17:22 Job executing on host:
>> <10.0.40.148:32771>
>> ...
>> 001 (552.001.000) 07/29 22:17:24 Job executing on host:
>> <10.0.40.139:32772>
>> ...
>> 001 (552.002.000) 07/29 22:22:24 Job executing on host:
>> <10.0.40.112:32771>
>> ...
>> 006 (552.000.000) 07/29 22:37:30 Image size of job updated: 299556
>> ...
>> 006 (552.001.000) 07/29 22:37:32 Image size of job updated: 298980
>> ...
>> 006 (552.002.000) 07/29 22:42:31 Image size of job updated: 301080
>> ...
>> 006 (552.000.000) 07/29 22:57:31 Image size of job updated: 300056
>> ...
>> 006 (552.001.000) 07/29 22:57:32 Image size of job updated: 299476
>> ...
>> 006 (552.002.000) 07/29 23:02:31 Image size of job updated: 301108
>> ...
>> 006 (552.000.000) 07/29 23:17:30 Image size of job updated: 303652
>> ...
>> 006 (552.002.000) 07/29 23:22:31 Image size of job updated: 304704
>> ...
>> 006 (552.000.000) 07/29 23:37:30 Image size of job updated: 304772
>> ...
>> 006 (552.001.000) 07/29 23:37:32 Image size of job updated: 303076
>> ...
>> 006 (552.002.000) 07/29 23:42:32 Image size of job updated: 305820
>> ...
>> 006 (552.001.000) 07/29 23:57:32 Image size of job updated: 304192
>> ...
>> 006 (552.000.000) 07/30 00:37:30 Image size of job updated: 323696
>> ...
>> 006 (552.002.000) 07/30 00:42:32 Image size of job updated: 326856
>> ...
>> 006 (552.000.000) 07/30 00:57:30 Image size of job updated: 327188
>> ...
>> 006 (552.001.000) 07/30 00:57:32 Image size of job updated: 322928
>> ...
>> 006 (552.002.000) 07/30 01:02:31 Image size of job updated: 328236
>> ...
>> 006 (552.001.000) 07/30 01:17:32 Image size of job updated: 326608
>> ...
>> 006 (552.002.000) 07/30 01:42:32 Image size of job updated: 328752
>> ...
>> 006 (552.001.000) 07/30 02:17:33 Image size of job updated: 327124
>> ...
>> 006 (552.000.000) 07/30 03:37:30 Image size of job updated: 333300
>> ...
>> 006 (552.002.000) 07/30 03:42:31 Image size of job updated: 334352
>> ...
>> 006 (552.001.000) 07/30 03:57:33 Image size of job updated: 372392
>> ...
>> 006 (552.000.000) 07/30 04:17:30 Image size of job updated: 376872
>> ...
>> 006 (552.002.000) 07/30 04:22:31 Image size of job updated: 443972
>> ...
>> 006 (552.000.000) 07/30 04:37:30 Image size of job updated: 597524
>> ...
>> 006 (552.002.000) 07/30 04:42:32 Image size of job updated: 588952
>> ...
>> 006 (552.001.000) 07/30 04:57:33 Image size of job updated: 591804
>> ...
>> 006 (552.002.000) 07/30 06:22:32 Image size of job updated: 596632
>> ...
>> 006 (552.001.000) 07/30 06:37:33 Image size of job updated: 594992
>> ...
>> 006 (552.002.000) 07/30 11:42:32 Image size of job updated: 851356
>> ...
>> 006 (552.000.000) 07/30 11:57:30 Image size of job updated: 977040
>> ...
>> 006 (552.001.000) 07/30 11:57:33 Image size of job updated: 679004
>> ...
>> 006 (552.002.000) 07/30 12:02:33 Image size of job updated: 1034548
>> ...
>> 006 (552.000.000) 07/30 12:17:31 Image size of job updated: 1033368
>> ...
>> 006 (552.001.000) 07/30 12:17:33 Image size of job updated: 1031860
>> ...
>>
>> ...then the job became idle again :(
>> ######################################################################
>>
>>
>> Thanks,
>>
>> Leo
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>>
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>