[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fedora 19 (64 bits) with HTCondor 8.1: schedd crashes....



The work around that I recall was:

USE_CLONE_TO_CREATE_PROCESSES = FALSE 

https://bugzilla.redhat.com/show_bug.cgi?id=1000106

There was some behavioral change that wasn't fully tracked down as of yet. 

----- Original Message -----
> From: "Stub" <spamrefuse@xxxxxxxxx>
> To: "condor-users" <htcondor-users@xxxxxxxxxxx>
> Sent: Wednesday, October 23, 2013 11:00:48 PM
> Subject: Re: [HTCondor-users] Fedora 19 (64 bits) with HTCondor 8.1:ÂÂÂÂÂÂÂÂscheddÂÂÂÂÂÂÂÂcrashes....
> 
> Hi,
> 
> Alas, the apparent 'good news' is false.
> On Fedora 19 with HTCondor 8.1.0, the schedd keeps crashing upon restarts by
> the master daemon
> (it seems to have nothing to do with the missing Â/var/lock/condor/local/
> Âdirectory).
> 
> See below for SchedLog.
> 
> Regards,
> Rob.
> 
> 10/24/13 12:54:03 (pid:16146)
> ******************************************************
> 10/24/13 12:54:03 (pid:16146) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 10/24/13 12:54:03 (pid:16146) ** /usr/sbin/condor_schedd
> 10/24/13 12:54:03 (pid:16146) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5)
> class=DAEMON(1)
> 10/24/13 12:54:03 (pid:16146) ** Configuration: subsystem:SCHEDD local:<NONE>
> class:DAEMON
> 10/24/13 12:54:03 (pid:16146) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID:
> RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $
> 10/24/13 12:54:03 (pid:16146) ** $CondorPlatform: X86_64-Fedora_19 $
> 10/24/13 12:54:03 (pid:16146) ** PID = 16146
> 10/24/13 12:54:03 (pid:16146) ** Log last touched 10/24 12:20:21
> 10/24/13 12:54:03 (pid:16146)
> ******************************************************
> 10/24/13 12:54:03 (pid:16146) Using config source: /etc/condor/condor_config
> 10/24/13 12:54:03 (pid:16146) Using local config sources:
> 10/24/13 12:54:03 (pid:16146)
> Â Â/etc/condor/config.d/00personal_condor.config
> 10/24/13 12:54:03 (pid:16146)
> Â Â/etc/condor/config.d/01personal_condor.config
> 10/24/13 12:54:03 (pid:16146) Â Â/etc/condor/config.d/99flocking.config
> 10/24/13 12:54:03 (pid:16146) DaemonCore: command socket at
> <xxx.xxx.xxx.xxx:55786>
> 10/24/13 12:54:03 (pid:16146) DaemonCore: private command socket at
> <xxx.xxx.xxx.xxx:55786>
> 10/24/13 12:54:03 (pid:16146) History file rotation is enabled.
> 10/24/13 12:54:03 (pid:16146) Â Maximum history file size is: 20971520 bytes
> 10/24/13 12:54:03 (pid:16146) Â Number of rotated history files is: 2
> 10/24/13 12:54:03 (pid:16146) Failed to execute /usr/sbin/condor_shadow.std,
> ignoring
> 10/24/13 12:54:04 (pid:16146) About to rotate ClassAd log
> /var/lib/condor/spool/job_queue.log
> 10/24/13 12:54:04 (pid:16146) 210.0: JobLeaseDuration remaining: EXPIRED!
> 10/24/13 12:54:08 (pid:16146) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 10/24/13 12:54:08 (pid:16146) TransferQueueManager upload 1m I/O load: 0
> bytes/s Â0.000 disk load Â0.000 net load
> 10/24/13 12:54:08 (pid:16146) TransferQueueManager download 1m I/O load: 0
> bytes/s Â0.000 disk load Â0.000 net load
> 10/24/13 12:54:08 (pid:16146) Sent ad to central manager for
> myname@xxxxxxxxxxxxxxx
> 10/24/13 12:54:08 (pid:16146) Sent ad to 1 collectors for
> myname@xxxxxxxxxxxxxxx
> 10/24/13 12:55:04 (pid:16146) WARNING: forward resolution of
> condormaster.skku.edu doesn't match xxx.xxx.xxx.xxx!
> 10/24/13 12:55:04 (pid:16146) Using negotiation protocol: NEGOTIATE
> 10/24/13 12:55:04 (pid:16146) Negotiating for owner: myname@xxxxxxxxxxxxxxx
> 10/24/13 12:55:04 (pid:16146) AutoCluster:config() significant attributes
> changed to
> 10/24/13 12:55:05 (pid:16146) Checking consistency running and runnable jobs
> 10/24/13 12:55:05 (pid:16146) Tables are consistent
> 10/24/13 12:55:05 (pid:16146) Rebuilt prioritized runnable job list in
> 0.483s.
> 10/24/13 12:55:05 (pid:16146) Finished negotiating for myname in local pool:
> 0 matched, 1 rejected
> 10/24/13 12:55:05 (pid:16146) Increasing flock level for myname to 1 from 0.
> 10/24/13 12:55:05 (pid:16146) TransferQueueManager stats: active up=0/10
> down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 10/24/13 12:55:05 (pid:16146) TransferQueueManager upload 1m I/O load: 0
> bytes/s Â0.000 disk load Â0.000 net load
> 10/24/13 12:55:05 (pid:16146) TransferQueueManager download 1m I/O load: 0
> bytes/s Â0.000 disk load Â0.000 net load
> 10/24/13 12:55:05 (pid:16146) Sent ad to central manager for
> myname@xxxxxxxxxxxxxxx
> 10/24/13 12:55:05 (pid:16146) Sent ad to 1 collectors for
> myname@xxxxxxxxxxxxxxx
> 10/24/13 12:55:07 (pid:16146) Using negotiation protocol: NEGOTIATE
> 10/24/13 12:55:07 (pid:16146) Negotiating for owner: myname@xxxxxxxxxxxxxxx
> (flock level 1, pool condor.skku.edu)
> 10/24/13 12:55:07 (pid:16146) AutoCluster:config() significant attributes
> changed to
> JobUniverse,LastCheckpointPlatform,NumCkpts,RemoteGroup,SubmitterGroup,SubmitterUserPrio
> 10/24/13 12:55:07 (pid:16146) Starting add_shadow_birthdate(210.0)
> Stack dump for process 16146 at timestamp 1382586907 (4 frames)
> /lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x38954e0972]
> /lib64/libcondor_utils_8_1_0.so[0x389557b5f7]
> /lib64/libc.so.6[0x3891435a90]
> [0x7fff76a01aa0]
> 10/24/13 12:55:18 (pid:17005)
> ******************************************************
> 10/24/13 12:55:18 (pid:17005) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 10/24/13 12:55:18 (pid:17005) ** /usr/sbin/condor_schedd
> 10/24/13 12:55:18 (pid:17005) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5)
> class=DAEMON(1)
> 10/24/13 12:55:18 (pid:17005) ** Configuration: subsystem:SCHEDD local:<NONE>
> class:DAEMON
> 10/24/13 12:55:18 (pid:17005) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID:
> RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $
> 10/24/13 12:55:18 (pid:17005) ** $CondorPlatform: X86_64-Fedora_19 $
> 10/24/13 12:55:18 (pid:17005) ** PID = 17005
> 10/24/13 12:55:18 (pid:17005) ** Log last touched 10/24 12:55:08
> 10/24/13 12:55:18 (pid:17005)
> ******************************************************
> 10/24/13 12:55:18 (pid:17005) Using config source: /etc/condor/condor_config
> 10/24/13 12:55:18 (pid:17005) Using local config sources:
> 10/24/13 12:55:18 (pid:17005)
> Â Â/etc/condor/config.d/00personal_condor.config
> 10/24/13 12:55:18 (pid:17005)
> Â Â/etc/condor/config.d/01personal_condor.config
> 10/24/13 12:55:18 (pid:17005) Â Â/etc/condor/config.d/99flocking.config
> 10/24/13 12:55:18 (pid:17005) DaemonCore: command socket at
> <xxx.xxx.xxx.xxx:56785>
> 10/24/13 12:55:18 (pid:17005) DaemonCore: private command socket at
> <xxx.xxx.xxx.xxx:56785>
> 10/24/13 12:55:18 (pid:17005) History file rotation is enabled.
> 10/24/13 12:55:18 (pid:17005) Â Maximum history file size is: 20971520 bytes
> 10/24/13 12:55:18 (pid:17005) Â Number of rotated history files is: 2
> 10/24/13 12:55:18 (pid:17005) Failed to execute /usr/sbin/condor_shadow.std,
> ignoring
> 10/24/13 12:55:18 (pid:17005) About to rotate ClassAd log
> /var/lib/condor/spool/job_queue.log
> 10/24/13 12:55:19 (pid:17005) 210.0: JobLeaseDuration remaining: 1188
> 10/24/13 12:55:19 (pid:17005) Starting add_shadow_birthdate(210.0)
> Stack dump for process 17005 at timestamp 1382586919 (4 frames)
> /lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x38954e0972]
> /lib64/libcondor_utils_8_1_0.so[0x389557b5f7]
> /lib64/libc.so.6[0x3891435a90]
> [0x7fff4bcc51f0]
> 10/24/13 12:55:30 (pid:17014)
> ******************************************************
> 10/24/13 12:55:30 (pid:17014) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 10/24/13 12:55:30 (pid:17014) ** /usr/sbin/condor_schedd
> 10/24/13 12:55:30 (pid:17014) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5)
> class=DAEMON(1)
> 10/24/13 12:55:30 (pid:17014) ** Configuration: subsystem:SCHEDD local:<NONE>
> class:DAEMON
> 10/24/13 12:55:30 (pid:17014) ** $CondorVersion: 8.1.0 Jul 15 2013 BuildID:
> RH-8.1.0-0.2.fc19 PRE-RELEASE-UWCS $
> 10/24/13 12:55:30 (pid:17014) ** $CondorPlatform: X86_64-Fedora_19 $
> 10/24/13 12:55:30 (pid:17014) ** PID = 17014
> 10/24/13 12:55:30 (pid:17014) ** Log last touched 10/24 12:55:19
> 10/24/13 12:55:30 (pid:17014)
> ******************************************************
> 10/24/13 12:55:30 (pid:17014) Using config source: /etc/condor/condor_config
> 10/24/13 12:55:30 (pid:17014) Using local config sources:
> 10/24/13 12:55:30 (pid:17014)
> Â Â/etc/condor/config.d/00personal_condor.config
> 10/24/13 12:55:30 (pid:17014)
> Â Â/etc/condor/config.d/01personal_condor.config
> 10/24/13 12:55:30 (pid:17014) Â Â/etc/condor/config.d/99flocking.config
> 10/24/13 12:55:30 (pid:17014) DaemonCore: command socket at
> <xxx.xxx.xxx.xxx:59733>
> 10/24/13 12:55:30 (pid:17014) DaemonCore: private command socket at
> <xxx.xxx.xxx.xxx:59733>
> 10/24/13 12:55:30 (pid:17014) History file rotation is enabled.
> 10/24/13 12:55:30 (pid:17014) Â Maximum history file size is: 20971520 bytes
> 10/24/13 12:55:30 (pid:17014) Â Number of rotated history files is: 2
> 10/24/13 12:55:30 (pid:17014) Failed to execute /usr/sbin/condor_shadow.std,
> ignoring
> 10/24/13 12:55:31 (pid:17014) About to rotate ClassAd log
> /var/lib/condor/spool/job_queue.log
> 10/24/13 12:55:31 (pid:17014) 210.0: JobLeaseDuration remaining: 1176
> 10/24/13 12:55:31 (pid:17014) Starting add_shadow_birthdate(210.0)
> Stack dump for process 17014 at timestamp 1382586931 (4 frames)
> /lib64/libcondor_utils_8_1_0.so(dprintf_dump_stack+0x72)[0x38954e0972]
> /lib64/libcondor_utils_8_1_0.so[0x389557b5f7]
> /lib64/libc.so.6[0x3891435a90]
> [0x7ffff37c2d30]
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

-- 
Cheers,
Tim