Ian, I'm seeing similar behaviour while evaluating 7.8.4 for deployment on our campus grid. Indeed, occasionally the Startd dies and leaves a core (we're using Debian 6.0.6 x86_64), with this sort of message in the StartLog: 10/08/12 09:21:16 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim Stack dump for process 7352 at timestamp 1349684476 (4 frames) /Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor_utils_7_8_4.so(dprintf_dump_stack+0x131)[0x7f4c0f428051] /Condor/x86_64/condor-7.8.4-x86_64_deb_6.0-stripped/sbin/../lib/libcondor_utils_7_8_4.so(_Z18linux_sig_coredumpi+0x40)[0x7f4c0f596a00] /lib64/libpthread.so.0(+0xeff0)[0x7f4c0b208ff0] This particular failure happened while trying to use partitionable slots with ParallelSchedulingGroups under the parallel universe. I know that one can run this particular type of job from within the vanilla universe (which works), but I need this test for backward compatibility in case users stick to using old scripts that they may have. Mark On 06/10/12 07:51, Ian Cottam wrote:
We are getting a ton of these messages from our Pool after updating from 7.4 to 7.8.4. Does it mean we are obliged to run the new daemon that clears out partitioned slots? Or is it showing up a bug, which seems likely as startd should not seg fault? -Ian -- Ian Cottam IT Services - supporting research Faculty of EPS The University of Manchester On 06/10/2012 04:18, "Owner of Condor Daemons" <condor@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:This is an automated email from the Condor system on machine "xxx". Do not reply. "/usr/sbin/condor_startd" on "e-c07atg105057.it.manchester.ac.uk" died due to signal 11 (Segmentation fault). Condor will automatically restart this process in 10 seconds. *** Last 20 line(s) of file /var/log/condor/StartLog: 10/05/12 20:51:13 slot1_3: State change: claim-activation protocol successful 10/05/12 20:51:13 slot1_3: Changing activity: Idle -> Busy 10/05/12 20:51:13 slot1_1: match_info called 10/05/12 20:51:13 slot1_4: Got activate_claim request from shadow (130.88.203.22) 10/05/12 20:51:13 slot1_4: Remote job ID is 329729.2744 10/05/12 20:51:13 slot1_4: Got universe "VANILLA" (5) from request classad 10/05/12 20:51:13 slot1_4: State change: claim-activation protocol successful 10/05/12 20:51:13 slot1_4: Changing activity: Idle -> Busy 10/06/12 04:18:21 slot1_1: Called deactivate_claim_forcibly() 10/06/12 04:18:21 slot1_1: Changing state and activity: Claimed/Busy -> Preempting/Vacating 10/06/12 04:18:21 Starter pid 2555 exited with status 0 10/06/12 04:18:21 slot1_1: State change: starter exited 10/06/12 04:18:21 slot1_1: State change: No preempting claim, returning to owner 10/06/12 04:18:21 slot1_1: Changing state and activity: Preempting/Vacating -> Owner/Idle 10/06/12 04:18:21 slot1_1: State change: IS_OWNER is false 10/06/12 04:18:21 slot1_1: Changing state: Owner -> Unclaimed 10/06/12 04:18:21 slot1_1: Changing state: Unclaimed -> Delete 10/06/12 04:18:21 slot1_1: Resource no longer needed, deleting 10/06/12 04:18:27 Job no longer matches partitionable slot after MODIFY_REQUEST_EXPR_ edits, retrying w/o edits 10/06/12 04:18:27 slot1: Partitionable slot can't be split to allocate a dynamic slot large enough for the claim *** End of file StartLog -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Questions about this message or Condor in general? Email address of the local Condor administrator: ian.cottam@xxxxxxxxxxxxxxxx The Official Condor Homepage is http://www.cs.wisc.edu/condor_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/ |