[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job Realtime output file



On Sat, 9 Mar 2013, Guillermo Marco Puche wrote:

I know those directives are SGE directives. From my pov is SGE handles job he must be able also to handle it's own error and output logs.
The trouble here is that SGE is being handed, for a number of hard 
reasons, a russian doll of scripts to execute. Your job is the smallest 
doll, while the -o and -e directives (and yes, you are overriding the 
directives set by default by 'bosco') apply to the outermost doll. It's 
very likely that stdout and stderr are already being diverted at inner 
layers. If you'd really like to see streaming stdout from your job, your 
best option (until we have some form of out-of-the-box Condor 'standard 
universe' for 'grid' or 'vanilla' universe jobs, which would indeed come 
in handy for many other applications) is probably to set up some form of 
remote I/O yourself.
If you have at least outbound network connectivity from the worker nodes 
to the submit node you could try using 'chirp' (a standalone incarnation 
of the Aitch-Tee-Condor Remote I/O protocol, which may eventually 
be "re-"integrated into the 'grid' universe as the remote I/O method of 
choice).
In its simplest form:

0) Grab and install 'cctools', and make it available on the submit
   and worker nodes.
   http://www.cse.nd.edu/~ccl/software/download.shtml
   (the site seems to be down right now)

1) Start chirp_server on the submit node (will bind on port
   9094 by default, use *no* authentication/authorisation and
   write files in the current directory).

2) Run your payload on the worker nodes with
   ./payload |tee chirp_put -t -1 -b 4096 - submit_node.domain my_job_output.$$

You should then be getting a streaming update (with 4kB buffering, which is pretty much the minimum you can get by default from fstreams) of the stdout of your job(s) as 'my_job_output.script_PID' on submit_node.domain, in the directory from which you started chirp_server.
There are countless variations of this scheme (add 
authentication/authorisation, send the 'chirp_put' executable along with 
the job if you cannot install it on the worker nodes, use a different 
naming scheme, run the job via 'parrot', etc.) but it should serve your 
basic need in any environment.
Does this still make sense ?

Francesco Prelz
INFN-MI