[Halld-offline] [EXTERNAL] Re: MCwrapper token

Richard T. Jones richard.t.jones at uconn.edu
Mon Oct 21 13:56:15 EDT 2024


This script runs on the submit host in the background, right? So how does
the updated token get relayed to the jobs running on the osg workers? Here
is what chat-gpt says about this:

> When you update your SciToken using htgettoken on your *GlideinWMS (GWMS)*
> submit host, the updated bearer token is not automatically relayed to jobs
> already running on remote Open Science Grid (OSG) workers. In HTCondor, the
> bearer token (or any credential) is usually distributed to the workers at
> the time the job starts, so any changes made to the token afterward on the
> submit host would not be propagated to jobs that are already running.
> Potential Solutions:
> *Automatic Token Refresh*:
> To manage long-running jobs, you need to ensure that your jobs can refresh
> the token themselves. One approach is to:
>
>    - Include the *refresh token* along with the job submission.
>
>
>    - Have the job periodically refresh the SciToken using the refresh
>    token (perhaps through a script that checks for token expiration).
>
> *Job Restart*:
> If the token must be updated after the job starts and you cannot implement
> an in-job refresh mechanism, you might need to *restart the job* with the
> updated token.
> *Shared Filesystem*:
> If the workers and submit hosts share a filesystem (such as an NFS or
> distributed filesystem), you could save the updated token in a shared
> location that the workers can access. Each job would then check that
> location periodically for an updated token.
> *Re-submit the Job*:
> Another option would be to hold, remove, and resubmit jobs after the new
> token has been updated, ensuring that the jobs receive the new token when
> they start.
> The token lifecycle needs to be handled in a way that takes into account
> the time delay between job submission and execution, particularly on a
> distributed infrastructure like OSG. Integrating token refreshing into your
> job scripts is the most reliable way to handle this issue without needing
> to resubmit jobs.


Here is some follow-up of my own.

   1. When I run htgettoken command from your example script, it requires
   an interactive authentication cycle with the incommon identity provider the
   first time I run it.
   2. After that first run, the "refresh token" is saved on the
   htvault.jlab.org under my userid, together with a secret passcode that
   is saves under my cue account on the submit host.
   3. I can request a fresh bearer token any time after that by requesting
   it from the vault where I accompany my request with the passcode that was
   saved from the initial round.
   4. This only refreshes my local bearer token on the submit host, and is
   of no help to the jobs running on the osg that want to write directly back
   to the xrootd server.
   5. Of course I could use htcondor file transfer to stage the output back
   on the submit host, and then copy back to the xrootd server by running a
   second copy script running on the submit host.

This option 5 is no good for me, because it creates a bottleneck on the
submit host where I have thousands of jobs trying to save terabytes of
output back to this one pathetic submit host that was never built to serve
as a data transfer node. On the other hand, my xrootd servers are robust
and highly parallelized to be able to handle thousands of simultaneous
write requests with terabytes per hour.

-Richard Jones

On Mon, Oct 21, 2024 at 12:24 PM Peter Hurck <ppauli at jlab.org> wrote:

> *Message sent from a system outside of UConn.*
>
> Hi Richard,
>
> See the script below that is run every five minutes by the mcwrapper user
> to renew its token on the OSG submit host. I think Thomas (cc’d) worked on
> this with Kurt Strohsal.
>
> I hope that helps.
>
> Cheers,
> Peter
>
>
>
> #!/bin/bash
> #use ssh agent to grab a new token
>
> # User specific aliases and functions
> export XDG_RUNTIME_DIR=/run/user/10967
> export BEARER_TOKEN_FILE=/var/run/user/10967/bt_u10967
>
> eval `ssh-agent`
>
> /usr/bin/ssh-add
>
> /usr/bin/htgettoken --nooidc -v -a htvault.jlab.org
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__nam10.safelinks.protection.outlook.com_-3Furl-3Dhttp-253A-252F-252Fhtvault.jlab.org-252F-26data-3D05-257C02-257Crichard.t.jones-2540uconn.edu-257C58d519cbf29e4bfe2f5308dcf1ecc579-257C17f1a87e2a254eaab9df9d439034b080-257C0-257C0-257C638651246453737791-257CUnknown-257CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0-253D-257C0-257C-257C-257C-26sdata-3DVZHU5-252FVmlRYhrLxxANLX0dU-252Fz-252BmNV8LZiWcYmURxTD0-253D-26reserved-3D0&d=DwIFaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=Te_hCR4EUlJ6iCDYLJ8Viv2aDOR7D9ZZMoBAvf2H0M4&m=dShJVSgnwJKsI6mZPPvWSXABVCv7e47GRz4i_7bIRyJGh7lQYp2_jM50_ja-0nvA&s=Ogk8RdBJdOPB7eQ1pxMJoLGYiERPWKV-QlKo14YtdQY&e= > -i
> jlab -r gluex
>
> ssh-agent -k
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20241021/1721049b/attachment-0001.html>


More information about the Halld-offline mailing list