[HTCondor-devel] bugfix in ec2job for unhandled openstack error

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date:	Mon, 11 Aug 2014 15:37:26 +0000
From:	<frank.polgart@xxxxxxxxxxxxxxx>
Subject:	[HTCondor-devel] bugfix in ec2job for unhandled openstack error

Dear HTCondor Developers.

short version

I'm looking for some help contributing a bug fix to the error handling in the
EC2 API from gridmanager.

---- ---- ---- ---- ---- ----
long version

I couldn't find the means to create a ticket; thats why I'm addressing the
mailing list first.
There is a fix and I would appreciate feedback, wether the fixed behavior is
suitable to make it into the code base.

Bug description:
There is a general error state in Openstack, that is reached after a requested
virtual machine is registered, but before the VM is instanciated. This happens
for example when quotas aren't exceeded, but openstacks scheduler couldn't find
a matching host.
Since the error message isn't a direct answer to the runInstance request, but
only exposed later during describeInstances calls, this error state isn't
handled at the moment.
EC2 jobs, that go into that error state, are kept in the job queue as IDLE and
never recover.
This behavior was discovered using GlideinWMS but can be reproduced manually.

Proposed fix:
I resolved this issue with the introduction of a new EC2_VM_STATE, which is
tested against in the GM_PROBE_JOB state. The job is then held, but not cleaned
up. Keeping the job was important for the usage with GlideninWMS.


regards, Frank Polgart

Attachment: signature.asc
Description: Digital signature

[← Prev in Thread]	Current Thread	[Next in Thread→]
[HTCondor-devel] bugfix in ec2job for unhandled openstack error, frank.polgart <= Re: [HTCondor-devel] bugfix in ec2job for unhandled openstack error, Todd L Miller Re: [HTCondor-devel] bugfix in ec2job for unhandled openstack error, Todd Tannenbaum

Previous by Date:	Re: [HTCondor-devel] ClassAd attributes containing "my" address, Alan De Smet
Next by Date:	Re: [HTCondor-devel] bugfix in ec2job for unhandled openstack error, Todd L Miller
Previous by Thread:	Re: [HTCondor-devel] Automatic requirements, Todd Tannenbaum
Next by Thread:	Re: [HTCondor-devel] bugfix in ec2job for unhandled openstack error, Todd L Miller
Indexes:	[Date] [Thread]

Mailing List Archives

Authenticated access

[HTCondor-devel] bugfix in ec2job for unhandled openstack error