[HTCondor-devel] bugfix in ec2job for unhandled openstack error


Date: Mon, 11 Aug 2014 15:37:26 +0000
From: <frank.polgart@xxxxxxxxxxxxxxx>
Subject: [HTCondor-devel] bugfix in ec2job for unhandled openstack error
Dear HTCondor Developers.

short version

I'm looking for some help contributing a bug fix to the error handling in the
EC2 API from gridmanager.

---- ---- ---- ---- ---- ----
long version

I couldn't find the means to create a ticket; thats why I'm addressing the
mailing list first.
There is a fix and I would appreciate feedback, wether the fixed behavior is
suitable to make it into the code base.

Bug description:
There is a general error state in Openstack, that is reached after a requested
virtual machine is registered, but before the VM is instanciated. This happens
for example when quotas aren't exceeded, but openstacks scheduler couldn't find
a matching host.
Since the error message isn't a direct answer to the runInstance request, but
only exposed later during describeInstances calls, this error state isn't
handled at the moment.
EC2 jobs, that go into that error state, are kept in the job queue as IDLE and
never recover.
This behavior was discovered using GlideinWMS but can be reproduced manually.

Proposed fix:
I resolved this issue with the introduction of a new EC2_VM_STATE, which is
tested against in the GM_PROBE_JOB state. The job is then held, but not cleaned
up. Keeping the job was important for the usage with GlideninWMS.


regards, Frank Polgart

Attachment: signature.asc
Description: Digital signature

[← Prev in Thread] Current Thread [Next in Thread→]