--- Begin Message ---
- Date: Fri, 24 Jun 2011 08:37:35 -0500
- From: "Timothy St. Clair" <tstclair@xxxxxxxxxx>
- Subject: Re: [Condor-users] Unable to start EC2 instance
Universe = grid
grid_resource = ec2 https://ec2.amazonaws.com/
# Executable in this context is just a label for the job
Executable = my_ec2_test_job
transfer_executable = false
Log=$(cluster).ec2.log
Iwd=/tmp
#input
ec2_ami_id =
ec2_instance_type =
ec2_security_groups=
ec2_access_key_id = <YOUR_LOC>/ec2.aid
ec2_secret_access_key = <YOUR_LOC>/ec2.key
#optional
#ec2_elastic_ip =
# in upstream src only, not yet released
# ec2_ebs_volumes =
# ec2_availability_zone =
#safe-loc-output
ec2_keypair_file = <YOUR_LOC>/test1.pem
Hope this helps,
Tim
On Thu, 2011-06-23 at 19:40 -0700, Philip Papadopoulos wrote:
> Closer, but not quite there.
>
> [root@vizagra ~]# tail -f /var/opt/condor/log/GridmanagerLog.phil
> 06/23/11 19:37:22 [25245] Found job 8.0 --- inserting
> 06/23/11 19:37:22 [25245] gahp server not up yet, delaying ping
> 06/23/11 19:37:22 [25245] (8.0) doEvaluateState called: gmState
> GM_INIT, condorState 1
> 06/23/11 19:37:22 [25245] GAHP server pid = 25247
> 06/23/11 19:37:28 [25245] resource https://ec2.amazonaws.com/ is now
> up
> 06/23/11 19:37:28 [25245] (8.0) doEvaluateState called: gmState
> GM_CHECK_VM, condorState 1
> 06/23/11 19:37:28 [25245] (8.0) doEvaluateState called: gmState
> GM_CHECK_VM, condorState 1
> 06/23/11 19:37:29 [25245] (8.0) doEvaluateState called: gmState
> GM_DESTROY_KEYPAIR_SUBMIT, condorState 1
> 06/23/11 19:37:32 [25245] (8.0) doEvaluateState called: gmState
> GM_CREATE_KEYPAIR, condorState 1
> 06/23/11 19:37:32 [25245] ERROR "Bad EC2_VM_START Request: E" at line
> 2256 in
> file /state/partition1/condor/src/condor_gridmanager/gahp-client.cpp
>
>
> If you can tell me where to put debug statements in the ec2_gahp
> files, I can do that.
> -P
>
>
> On Thu, Jun 23, 2011 at 5:36 PM, Matthew Farrellee <matt@xxxxxxxxxx>
> wrote:
> I believe with the new ec2_gahp you need "grid_resource = ec2
> https://ec2.amazonaws.com/"
>
> Best,
>
>
> matt
>
>
>
> On 06/23/2011 07:30 PM, Philip Papadopoulos wrote:
>
>
> Still no love....
> I git cloned the head of the condor tree, and remade
> copied condor_submit, condor_gridmanager, and ec2_gaph
> in bin, sbin, sbin
>
> I changed the condor config to use the new gahp.
> $ condor_config_val -dump | grep AMAZON
> AMAZON_GAHP = $(SBIN)/ec2_gahp
> AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
> GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON =
> 20
>
> And then submitted with
> universe = grid
> grid_resource = amazon https://ec2.amazonaws.com/
> periodic_release = NumHolds < 3
> +NumHolds = 0
> periodic_remove = NumHolds >= 3 || (JobStatus == 2 &&
> time()-ShadowBday
> > 1*60*60)
> executable = RunEC2VM
> amazon_keypair_file = keypair.$(Process)
>
> amazon_ami_id = ami-4ed12d27
> amazon_instance_type = m1.large
> amazon_user_data =
> condor:landphil.rocksclusters.org:40000:50000
> amazon_private_key = /home/phil/.ec2/pk.pem
> amazon_public_key = /home/phil/.ec2/cert.pem
>
> queue 1\
>
> as before.
>
> GridManager.log shows
> 06/23/11 16:21:27 Setting maximum accepts per cycle 4.
> 06/23/11 16:21:29 [27034]
> ================================>
> AmazonJob::AmazonJob 1
> 06/23/11 16:21:29 [27034] Found job 2.0 --- inserting
> 06/23/11 16:21:29 [27034] gahp server not up yet,
> delaying ping
> 06/23/11 16:21:29 [27034] (2.0) doEvaluateState
> called: gmState GM_INIT,
> condorState 1
> 06/23/11 16:21:29 [27034] GAHP server pid = 27038
> 06/23/11 16:21:34 [27034] ERROR "Bad
> AMAZON_VM_STATUS_ALL Request: E" at
> line 2256 in file
> /state/partition1/condor/src/condor_gridmanager/gahp-client.cpp
>
> From this same node, I can use ec2-native tools to
> start stop query
> instances
> e.g
> $ ec2-describe-instances
> RESERVATION r-ef3f0283 126101316194
> default
> INSTANCE i-d91433b7 ami-4ed12d27
> ec2-50-17-131-129.compute-1.amazonaws.com
>
> <http://ec2-50-17-131-129.compute-1.amazonaws.com>
>
> ip-10-110-235-155.ec2.internal running 0
> m1.large 2011-06-23T23:05:41+0000
> us-east-1c
> aki-e5c1218c monitoring-disabled
> 50.17.131.129
> 10.110.235.155
> instance-store
> paravirtual
> xen sg-427ca02b
>
>
> and
>
> ec2-terminate-instances i-d91433b7
> INSTANCE i-d91433b7 running shutting-down
>
>
> -P
>
>
>
>
>
>
>
> On Thu, Jun 23, 2011 at 7:41 AM, Philip Papadopoulos
>
> <philip.papadopoulos@xxxxxxxxx
> <mailto:philip.papadopoulos@xxxxxxxxx>>
>
> wrote:
>
>
> I will try that when I get in this AM (I'm on the
> west coast) and
> report back.
> Thanks,
> Phil
>
> On Thu, Jun 23, 2011 at 7:34 AM, Timothy St. Clair
>
> <tstclair@xxxxxxxxxx <mailto:tstclair@xxxxxxxxxx>>
> wrote:
>
> You could extract the condor_submit +
> gridmanager + ec2_gahp..
>
> Cheers,
> Tim
>
> On Thu, 2011-06-23 at 07:26 -0700, Philip
> Papadopoulos wrote:
> > Do I need all of condor 7.7 or can I just
> extract the ec2_gahp
> > executable from it?
> >
> > Thanks,
> > Phil
> >
> >
> >
> > On Thu, Jun 23, 2011 at 4:56 AM, Matthew
> Farrellee
>
> <matt@xxxxxxxxxx <mailto:matt@xxxxxxxxxx>>
>
>
> > wrote:
> >
> > On 06/22/2011 02:49 PM, Philip
> Papadopoulos wrote:
> >
> >
> > Trying out Condor 7.6.1 --
> installed via the
> > rhap.stripped.tar.gz
> >
> > I get the following in my
> GAHP log.
> > 06/22/11 09:33:37
> Command(AMAZON_VM_STATUS_ALL) got
> > error(code:Client,
> > msg:End of file or no input:
> Operation
> interrupted or
> > timed out
> > 06/22/11 09:38:38 Call to
> DescribeInstances
> failed:
> > SOAP 1.1 fault:
> > SOAP-ENV:Client [no subcode]
> > "End of file or no input: Operation
> interrupted or
> > timed out"
> > Detail: [no detail]
> >
> > 06/22/11 09:38:38
> Command(AMAZON_VM_STATUS_ALL) got
> > error(code:Client,
> > msg:End of file or no input:
> Operation
> interrupted or
> > timed out
> > 06/22/11 09:42:08 EOF
> reached on pipe 0
> > 06/22/11 09:42:08 stdin
> buffer closed, exiting
> > 06/22/11 09:47:19 Call to
> DescribeInstances
> failed:
> > SOAP 1.1 fault:
> > SOAP-ENV:Client [no subcode]
> > "End of file or no input: Operation
> interrupted or
> > timed out"
> > Detail: [no detail]
> >
> > 06/22/11 09:47:19
> Command(AMAZON_VM_STATUS_ALL) got
> > error(code:Client,
> > msg:End of file or no input:
> Operation
> interrupted or
> > timed out
> > 06/22/11 09:48:33 EOF
> reached on pipe 0
> > 06/22/11 09:48:33 stdin
> buffer closed, exiting
> > 06/22/11 09:49:18 Call to
> DescribeInstances
> failed:
> > SOAP 1.1 fault:
> > SOAP-ENV:Client [no subcode]
> > "End of file or no input: Operation
> interrupted or
> > timed out"
> > Detail: [no detail]
> >
> > 06/22/11 09:49:18
> Command(AMAZON_VM_STATUS_ALL) got
> > error(code:Client,
> > msg:End of file or no input:
> Operation
> interrupted or
> > timed out
> >
> >
> > The submission file is
> simple:
> > universe = grid
> > grid_resource = amazon
> https://ec2.amazonaws.com/
> > periodic_release = NumHolds
> < 3
> > +NumHolds = 0
> > periodic_remove = NumHolds
> >= 3 || (JobStatus
> == 2 &&
> > time()-ShadowBday
> > > 1*60*60)
> > executable = RunEC2VM
> > amazon_keypair_file =
> keypair.$(Process)
> >
> > amazon_ami_id = ami-4ed12d27
> > amazon_instance_type =
> m1.large
> > amazon_user_data =
> >
> condor:landphil.rocksclusters.org:40000:50000
> > amazon_private_key
> = /home/phil/.ec2/pk.pem
> > amazon_public_key
> = /home/phil/.ec2/cert.pem
> >
> > queue 1
> >
> >
> > And the condor_config_val
> (The salient ones
> I think)
> > $ condor_config_val -dump |
> grep -i amazon
> > AMAZON_GAHP =
> $(SBIN)/amazon_gahp
> > AMAZON_GAHP_LOG
> = /tmp/AmazonGahpLog.$(USERNAME)
> >
>
> GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE_AMAZON =
> > 20
> >
> > and
> > $ condor_config_val -dump |
> grep -i ssl
> > SOAP_SSL_CA_FILE
> = /etc/pki/tls/cert.pem
> > SOAP_SSL_SKIP_HOST_CHECK =
> True
> >
> > I've tried both with an
> without
> > SOAP_SSL_SKIP_HOST_CHECK.
> > the SSL_CA_FILE exists
> > If I try WITHOUT the
> > SOAP_SSL_CA_FILE
> = /etc/pki/tls/cert.pem
> > then I get
> > Call to DescribeInstances
> failed: SOAP 1.1
> fault:
> > SOAP-ENV:Client [no
> > subcode]
> > "SSL_ERROR_SSL
> > error:14090086:SSL
> >
> routines:SSL3_GET_SERVER_CERTIFICATE:certificate
> > verify failed"
> > Detail: SSL connect failed
> in tcp_connect()
> >
> >
> > Right now I'm flumoxed.
> >
> > Thanks,
> > Phil
> >
> > --
> > Philip Papadopoulos, PhD
> > University of California,
> San Diego
> >
>
> > 858-822-3628 <tel:858-822-3628>
> <tel:858-822-3628
>
> <tel:858-822-3628>> (Ofc)
>
> > 619-331-2990 <tel:619-331-2990>
> <tel:619-331-2990
>
>
> <tel:619-331-2990>> (Fax)
> >
> > Phil,
> >
> > Assuming you aren't getting those
> errors 100% of the
> time, and
> > you're actually talking to AWS's EC2
> service.
> >
> > I've seen similar intermittent
> issues in the past.
> They came
> > and went by days. After much
> investigation, I eventually
> > chalked them up to transient issues
> with AWS' EC2 SOAP
> > interface. The amazon_gahp was
> Condor's first means to
> > interact with EC2 and was written to
> the (then
> popular) SOAP
> > interface. Over the years the EC2
> Query interface has
> > apparently taken hold as the
> interface of choice,
> with many
> > EC2 clones not supporting SOAP. In
> response, the
> ec2_gahp has
> > been written, available in 7.7,
> against the Query
> interface.
> > You should try it out, especially on
> a day when the SOAP
> > interface is failing so that we
> might get a better
> handle on
> > if the issue is truly SOAP v Query.
> >
> > Best,
> >
> >
> > matt
> >
> >
> >
> > --
> > Philip Papadopoulos, PhD
> > University of California, San Diego
> > 858-822-3628 <tel:858-822-3628> (Ofc)
> > 619-331-2990 <tel:619-331-2990> (Fax)
>
> >
> _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to
> condor-users-request@xxxxxxxxxxx
>
> <mailto:condor-users-request@xxxxxxxxxxx> with
> a
>
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> >
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> >
> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to
> condor-users-request@xxxxxxxxxxx
>
> <mailto:condor-users-request@xxxxxxxxxxx> with
> a
>
> subject: Unsubscribe
> You can also unsubscribe by visiting
>
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>
>
>
>
> --
> Philip Papadopoulos, PhD
> University of California, San Diego
>
> 858-822-3628 <tel:858-822-3628> (Ofc)
> 619-331-2990 <tel:619-331-2990> (Fax)
>
>
>
>
>
> --
> Philip Papadopoulos, PhD
> University of California, San Diego
> 858-822-3628 (Ofc)
> 619-331-2990 (Fax)
>
>
>
>
>
> --
> Philip Papadopoulos, PhD
> University of California, San Diego
> 858-822-3628 (Ofc)
> 619-331-2990 (Fax)
--- End Message ---