Hi Todd,
Hummm... it does not work...
Hi Roberto,
Please follow the Quick Start link I gave you in my last post. And/or read sections 2.4 and 2.5 in the HTCondor Manual.Â
The below wonât work because you have âuniverse=standardâ in your submit file. Change to âuniverse=vanillaâ as I already suggested, or simply remove that line (vanilla is the default setting).Â
Also the submit file below does not tell HTCondor to transfer any files (like the executable) from your submit machine to your worker node, so a shared file system is assumed. /tmp is never shared across machines, so the only node this job could possibly run would be on the same node you submitted the job.Â
All of this is discussed in the quick start guide previously mentioned (link to it is on HTCondor.org homepage), and also documented in high detail the Manual. I think you will save yourself a lot of time by following the Quick Start Guide - it gives clear cut and paste examples and is not very long. Let us know if you find it helpful.Â
Hope this helpsTodd
The script is named teste1.sh, it's chmod'ed 777. The contents are:#!/bin/shecho "works" >> /tmp/itisworking.txt
The submission file is###################### submit description file# Example 1: queuing multiple jobs with differing# command line arguments and output files.#                                   Â####################                         Â                                    ÂExecutable   = teste1.sh                         ÂÂUniverse    = standard                                    ÂArguments   Â= 1                       ÂÂOutput Â= foo.out1                          ÂÂError  = foo.err1QueueÂ
All the logs are empty at the end.
The result of the ShadowLog is
12/02/17 06:56:08 (?.?) (81788):******************************************* 12/02/17 06:56:08 (?.?) (81788):uid=0, euid=122, gid=0, egid=13112/02/17 06:56:08 (?.?) (81788):Hostname = "<xxx.xxx.xxx.xxx:17345?addrs=xxx.xxx.xxx.xxx-17345>", Job = 39.0 12/02/17 06:56:08 (39.0) (81788):Requesting Primary Starter12/02/17 06:56:08 (39.0) (81788):Shadow: Request to run a job was ACCEPTED12/02/17 06:56:08 (39.0) (81788):Shadow: RSC_SOCK connected, fd = 1712/02/17 06:56:08 (39.0) (81788):Shadow: CLIENT_LOG connected, fd = 1812/02/17 06:56:08 (39.0) (81788):My_Filesystem_Domain = "my domain"12/02/17 06:56:08 (39.0) (81788):My_UID_Domain = "my domain"12/02/17 06:56:08 (39.0) (81788):Can't get address for checkpoint server host (NULL): No such file or directory12/02/17 06:56:08 (39.0) (81788): Entering pseudo_get_file_stream12/02/17 06:56:08 (39.0) (81788): file = "/var/lib/condor/spool/39/cluster39.ickpt.subproc0" 12/02/17 06:56:08 (39.0) (81788):Created TCP listen socket <192.168.0.2:23983>12/02/17 06:56:08 (39.0) (81788):Shadow: Job 39.0 exited, termsig = 0, coredump = 128, retcode = 012/02/17 06:56:08 (39.0) (81788):user_time = 0 ticks12/02/17 06:56:08 (39.0) (81788):sys_time = 0 ticks12/02/17 06:56:08 (39.0) (81788):Shadow: Cannot notify user( Condor Job 39.0, tavares, w )12/02/17 06:56:08 (39.0) (81788):Static Policy: removing job because OnExitRemove has become true12/02/17 06:56:08 (39.0) (81788):********** Shadow Exiting(102) **********
(xxx.xxx.xxx.xxx is my IP for the external network - eth0; 192.168.0.2 is the internal IP - eth1)
Is there any other relevant log or anything else that I should look for?
Thanks!
Roberto
On Fri, Dec 1, 2017 at 7:36 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 12/1/2017 3:07 PM, Roberto Tavares wrote:
Hello,
I think I'm almoust there!
I'm trying to run a simple script:
echo "It Works" >> /tmp/thisshouldwork.txt
Are you specifying "thisshouldwork.txt" as your executable? If so, I would not expect it to work. Does it work from the command prompt without involving HTCondor ? (my guess is no). Instead of
 Âecho "It Works"
you probably want
 Â#!/bin/sh
 Âecho "It Works"
and then do chmod 700 thisshouldwork.txt (to set the executable bit). This is life on a Linux/Unix environment, nothing specific to HTCondor here.
Take a look at
Âhttp://research.cs.wisc.edu/htcondor/manual/quickstart.html
I think you will find it helpful at getting started, it covers the above issues.
In looking at the log below, it looks like you submitted the job to HTcondor' "standard" universe, which you likely do not want to do (unless you have the C or Fortran souce code to your program). Instead, you want the 'vanilla' universe, by placing the following into your submit file:
 Âuniverse = vanilla
(This is the default on recent HTCondor installs....).
Hope the above helps,
Todd
What happens:
- it goes to the queue
- it is removed from the queue
- it does not run (log files empty and file in tmp is not created) and it seems to fall into some black hole... :(
The maximum that I could reach that shows any error is the ShadowLog file, that gives me:
12/01/17 18:51:25 (?.?) (74915):******* Standard Shadow starting up *******
12/01/17 18:51:25 (?.?) (74915):** $CondorVersion: 8.4.12 Jul 06 2017 BuildID: 409562 $
12/01/17 18:51:25 (?.?) (74915):** $CondorPlatform: x86_64_Ubuntu14 $
12/01/17 18:51:25 (?.?) (74915):******************************************* 12/01/17 18:51:25 (37.0) (74915):*Can't get address for checkpoint server host (NULL): No such file or directory*
12/01/17 18:51:25 (?.?) (74915):uid=0, euid=122, gid=0, egid=131
12/01/17 18:51:25 (?.?) (74915):Hostname = "<xxx.xxx.xxx.xxx:17345?addrs=xxx.xxx.xxx.xxx-17345>", Job = 37.0
12/01/17 18:51:25 (37.0) (74915):Requesting Primary Starter
12/01/17 18:51:25 (37.0) (74915):Shadow: Request to run a job was ACCEPTED
12/01/17 18:51:25 (37.0) (74915):Shadow: RSC_SOCK connected, fd = 17
12/01/17 18:51:25 (37.0) (74915):Shadow: CLIENT_LOG connected, fd = 18
12/01/17 18:51:25 (37.0) (74915):My_Filesystem_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):My_UID_Domain = "my domain"
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ Entering pseudo_get_file_stream
12/01/17 18:51:25 (37.0) (74915):ÂÂÂ file = "/var/lib/condor/spool/37/cluster37.ickpt.subproc0" 12/01/17 18:51:25 (37.0) (74915):*Shadow: Cannot notify user( Condor Job 37.0, tavares, w )*
12/01/17 18:51:25 (37.0) (74915):Created TCP listen socket <xxx.xxx.xxx.xxx:41412>
12/01/17 18:51:25 (37.0) (74915):Shadow: Job 37.0 exited, termsig = 0, coredump = 128, retcode = 0
12/01/17 18:51:25 (37.0) (74915):user_time = 1 ticks
12/01/17 18:51:25 (37.0) (74915):sys_time = 0 ticks
12/01/17 18:51:25 (37.0) (74915):Static Policy: removing job because OnExitRemove has become true
12/01/17 18:51:25 (37.0) (74915):********** Shadow Exiting(102) **********
Just to keep it simple, I'd rather to avoid to use the checkpoint server. Is it possible?
I'm a little clueless now... can you give me any help on that?
Thank you!!!!
Roberto
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxx.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing ÂDepartment of Computer Sciences
HTCondor Technical Lead        1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685