[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] MPI issues on Windows



I believe the errors youâre seeing are due to the policy for when to consider the job complete. The default is to consider the job complete when node 0 exits. At that point, the shadow begins to clean up and tell the EPs to kill the other nodes of the job. If the other nodes have also exited, their starters may be trying to transfer output files while being told to shut down. It looks like it can get chaotic.

You can change the policy to be to wait for all nodes of the parallel job to exit before considering the job complete (and triggering cleanup). To do so, add this to your submit file:

+ParallelShutdownPolicy = "WAIT_FOR_ALL"

 - Jaime

On Dec 12, 2023, at 5:48âPM, Sam.Dana@xxxxxxxxxxx wrote:

My MPI jobs were "completing" and giving me output, but was it the expected output?  I hadn't a clue, but something seemed  off. Fortunately, Jaime mentioned, in "Shadow exit with status 108", to compare ShadowLog with StartLogs - Thank you! 

I am running 23.0.1 on a small pool: a CM w/credD host, a dedicated AP host, 2 EP hosts.
MPI Job:   machine_count=12

What shall I make of these discrepancies? 

  1. On EP1 I get: New machine resource allocated
     on EP2 I get: New machine resource of type -1 allocated
       --- what is the significance of "type -1"?

  2. The first "Request to run..." includes the slot, but the others don't
    Request to run on slot1_2@ep1<...> was ACCEPTED vs 
    Request to run on <...> <...> was ACCEPTED 
        ---I assume, the first is the "head" while the other 11 are "workers"

  3. Each slot, shows "Got activate_claim request from shadow (ip-address), but then, on AP, ...
    ERROR in ParallelShadow::updateFromStarter: no Node defined in update ad, can't process!
         --- what's the story with update_ad?  I don't see such a file.
    followed by, for each slot: 
    on EP1 and EP2:  Called deactivate_claim_forcibly() 
    on EP1 only:  Failed to open .job.ad file: 3
    on EP1 only:  Failed to write ToE tag to .job.ad file (0): No error
         --- why do EP1 and EP2 behave differently?

  4. on EP1: 
    condor_write(): Socket closed when trying to write 559 bytes to daemon at <EP1-ip-addr:9618>, fd is 1852
    Buf::write(): condor_write() failed
    SECMAN: failed to end classad message
    Send_Signal: Warning: could not send signal 3 (SIGQUIT) to pid 12936 (no longer exists)
    slot1_5[90.0]: Error sending signal to starter, errno = 0 (No error)
    meanwhile, on the AP:
    (90.0) (4744): condor_read(): Socket closed abnormally when trying to read 5 bytes from startd at <EP1-ip-addr:9618>, errno=10054 
    (90.0) (4744): Job 90.0 terminated: exited with status 0
    (90.0) (4744): Reporting job exit reason 100 and attempting to fetch new job.
    (90.0) (4744): ERROR: SharedPortEndpoint: Named pipe does not exist.
    (90.0) (4744): SharedPortEndpoint: StopListener could not stop the listener thread: 0
    (90.0) (4744): **** condor_shadow (condor_SHADOW) pid 4744 EXITING WITH STATUS 100

  5. On both EP1 and EP2:
    PERMISSION DENIED to AP-HOST$@mydomain from host AP-ip-addr for command 403 (DEACTIVATE_CLAIM), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request;
    DC_AUTHENTICATE: Command not authorized, done!

  6. some of the output and error files get copied back to the AP, while others don't. Which was my first clue that things weren't quite right.

Are there additional resources for testing exploring the MPI "Science" example? 
Any ideas how to overcome these obstacles?

Thanks, 
Sam


NOTICE: This email message and all attachments transmitted with it may contain privileged and confidential information, and information that is protected by, and proprietary to, Parsons Corporation, and is intended solely for the use of the addressee for the specific purpose set forth in this communication. If the reader of this message is not the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, or other use of this message or its attachments is strictly prohibited, and you should delete this message and all copies and backups thereof. The recipient may not further distribute or use any of the information contained herein without the express written authorization of the sender. If you have received this message in error, or if you have any questions regarding the use of the proprietary information contained therein, please contact the sender of this message immediately, and the sender will provide you with further instructions.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/