<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Dear FX,<div class="">Your graph shows that, since the RG-K cooking started on 7/24-25, RG-K had about twice the CPU hours than RG-B had. Can you explain how that contradicts what was a said earlier?</div><div class="">Best regards,</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>Raffaella<br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On 4 Aug 2020, at 21:32, Francois-Xavier Girod <<a href="mailto:fxgirod@jlab.org" class="">fxgirod@jlab.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">Was my request inappropriate somehow?</div><div class="">I sincerely do not understand how "surely you are joking" was a proper answer to my question</div><div class=""><br class=""></div><div class="">I took Nathan script and ran</div>slurm-status.py -u clas12-1 -d 30 <div class="">to get slurm information for CPU time used by RGB during the last 30 days</div><div class=""><br class=""></div><div class="">similarly I assume clas12-2 was dedicated to RGK uniquely</div><div class=""><br class=""></div><div class="">I show results for cumulative CPU hours from these commands in the attached plot</div><div class=""><br class=""></div><div class="">The evidence I can find contradicts the claim that was made that RGK used substantially more CPU than RGB last week</div><div class="">The evidence seems much more compatible with the earlier suggestion that RGK processing time per event is less than RGB processing time per event</div><div class=""><br class=""></div><div class="">Of course it is possible that I did something wrong but I have seen no other quantitative analysis personally </div><div class="">As I said earlier, it is also possible that the attached plot can be obtained simply from SciComp website, that was the original question</div><div class=""><br class=""></div><div class="">RGB and RGK took the process of pass 1 review seriously, we did our best to demonstrate that we were ready to use computing resources</div><div class="">What was the point of the review process if we are so cavalier about monitoring resource usage?</div><div class=""><br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Aug 5, 2020 at 12:31 AM Nathan Baltzell <<a href="mailto:baltzell@jlab.org" class="">baltzell@jlab.org</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div style="overflow-wrap: break-word;" class="">
<div class="">surely you're joking</div>
<div class="">
<div class=""><br class="">
</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 4, 2020, at 18:14, Francois-Xavier Girod <<a href="mailto:fxgirod@jlab.org" target="_blank" class="">fxgirod@jlab.org</a>> wrote:</div>
<br class="">
<div class="">
<div dir="ltr" class="">Dear Nathan
<div class=""><br class="">
</div>
<div class="">Thank you for the answer</div>
<div class=""><br class="">
</div>
<div class="">> I'm not sure what exactly you're looking for, but there's a few ways to get information on previous and current JLab batch farm usage ...</div>
<br class="">
<div class="">Ok let me be reiterate specifically: we were told that last week RGK received more computing hours than RGB, which I am not contesting. I am only asking for the direct evidence.</div>
<div class="">This is relevant as it is the basis for the decision on computing resources allocations.</div>
<div class=""><br class="">
</div>
<div class="">Can we please see the number of CPU hours per day, received by RGB and RGK over the course of the last month?</div>
<div class=""><br class="">
</div>
<div class="">I would like to make this graph myself but I do not know how to retrieve it, and from the email you sent I am not readily able to do this</div>
<div class="">I do not believe that "Usage -> Job History" or "Usage -> Usage Stat" allow me to perform this operation. Note that these link do contain the kind of information I want, but only for Hall B production vs say Hall D or theory for instance. </div>
<div class=""><br class="">
</div>
<div class="">Best regards</div>
<div class="">FX</div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Aug 4, 2020 at 11:51 PM Nathan Baltzell <<a href="mailto:baltzell@jlab.org" target="_blank" class="">baltzell@jlab.org</a>> wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class="">Hello FX,
<div class=""><br class="">
</div>
<div class="">I'm not sure what exactly you're looking for, but there's a few ways to get information on previous and current JLab batch farm usage ...</div>
<div class=""><br class="">
</div>
<div class="">For web-based stuff, there's the "Usage" link and its "Job History" and "Usage Stat" subitems at the top left of <a href="http://scicomp.jlab.org/" target="_blank" class="">scicomp.jlab.org</a>. From there, you can select date ranges and get
core count timelines and integrated summary tables and drill down the fairshare hierarchy of SLURM accounts and their users. There's also the "Slurm Info" subitem with more job-specific metrics and also the tree fairshare settings and pie charts on usage.
For info on the fairshare algorithm in use by JLab, see <a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__slurm.schedmd.com_fair-5Ftree.html&d=DwMFaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=t2OnxVoP8YJyzuxpHKwj1tmrMj6s1mZ4pgV16sZJZb8&m=iKPbrNV5VEgoX4Q3IBSBJRbO8oxSALclReBOsLW78Hs&s=-UIhEEIVOYcfuvrw-viCoA4roQlZn_or9egVKpYk6n0&e=" target="_blank" class="">https://slurm.schedmd.com/fair_tree.html</a>, where currently the half-life is 7 days.</div>
<div class=""><br class="">
</div>
<div class="">We also use some command-line tools we wrote, see "source /group/clas12/packages/setup.(c)sh; module load workflow; slurm-status.py -h", which uses the same backend as the <span class="">"Job Query" at </span><a href="http://scicomp.jlab.org/" target="_blank" class="">scicomp.jlab.org</a> and
adds an integrated summary of cpu/wall/mem for the given query.</div>
<div class=""><br class="">
</div>
<div class="">
<div class="">And then you can also use raw SLURM commands (see "man slurm" for a list of them, and then "man $CMD" for a given command), which might be best, but I'm a bit less familiar with those and so far haven't really needed them except for sshare/sinfo/pestat
for live info and sbatch/srun for running jobs.</div>
<div class=""><br class="">
</div>
<div class="">On a related note, the time estimates that have been presented by myself or chefs previously for CLAS12 data processing were based on our measured (in situ) event rates and fairshares and the current batch farm hardware distributions at
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__jeffersonlab-2Dmy.sharepoint.com_-3Ax-3A_g_personal_baltzell-5Fjlab-5Forg_EU096WRXcyBLl-5FApLfSCuvoBiy3Sq-5FxtFlY9MvO-5FHWHQUw-3Fe-3DIwv4bC&d=DwMFaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=t2OnxVoP8YJyzuxpHKwj1tmrMj6s1mZ4pgV16sZJZb8&m=iKPbrNV5VEgoX4Q3IBSBJRbO8oxSALclReBOsLW78Hs&s=oKeN2rl0uSTLMDF9EgjLUGQ-kaUM7GRyFSqFhIP213s&e=" target="_blank" class="">
https://jeffersonlab-my.sharepoint.com/:x:/g/personal/baltzell_jlab_org/EU096WRXcyBLl_ApLfSCuvoBiy3Sq_xtFlY9MvO_HWHQUw?e=Iwv4bC</a> , and things have so far always come out pretty close to expectations after accounting for fluctuations in received fairshare
for a given period, although we still keep an eye on it to make sure things are making sense.</div>
<div class=""><br class="">
</div>
<div class="">-Nathan</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
<blockquote type="cite" class="">
<div class="">On Aug 4, 2020, at 09:44, Francois-Xavier Girod <<a href="mailto:fxgirod@jlab.org" target="_blank" class="">fxgirod@jlab.org</a>> wrote:</div>
<br class="">
<div class="">
<div dir="ltr" class="">Dear Raffaella
<div class=""><br class="">
</div>
<div class="">> These are tools accessible to everyone from theSciComp page</div>
<div class=""><br class="">
</div>
<div class="">Could you please point to these tools, as requested 4 days ago?</div>
<div class=""><br class="">
</div>
<div class="">Also checking more carefully at the status of the RGK cooking, it appears that contrary to what was claimed earlier, we did not reach the 50% mark until yesterday. </div>
<div class=""><br class="">
</div>
<div class="">Best regards</div>
<div class="">FX</div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jul 31, 2020 at 10:49 PM Raffaella De Vita <<a href="mailto:Raffaella.Devita@ge.infn.it" target="_blank" class="">Raffaella.Devita@ge.infn.it</a>> wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class="">
<div class="">
<div class="">Dear FX,
<div class="">We have the capability monitoring the resources used by individual accounts as part of the SciComp tools. These are tools accessible to everyone from theSciComp page. The monitoring is indicating the resources used by RG-K over the last week
exceed largely what was used by RG-B. I don’t think we are lacking in terms of tools here and I’m not sure I understand the motivation of this concern.
<div class="">We can discuss this further but I’m not sure this aspect is really central to decide how the RG-K and RG-B cooking should continue. The plan defined by the CCC for the data processing was 1) to run in parallel for about a week, while RGK was a
still in testing phase, 2) continue with RG-K alone until the 50% was reached, 3) continue in parallel till the end. It seems to me that we basically skipped 1 because the fairshare algorithm did what it is designed to do, we are in 2, and the natural thing
would be to continue with 3. This would be compliant to the CCC indications and the most efficient solution from the technical point of view.</div>
<div class="">Obviously the CCC can reconsider the situation and provide different indications: I have cc’ed Kyungseon and Marco if they want to comment.</div>
<div class="">Best regards,</div>
<div class=""><span style="white-space:pre-wrap" class=""></span>Raffaella</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
</div>
</div>
<div class="">
<blockquote type="cite" class="">
<div class="">On 31 Jul 2020, at 16:12, Francois-Xavier Girod <<a href="mailto:fxgirod@jlab.org" target="_blank" class="">fxgirod@jlab.org</a>> wrote:</div>
<br class="">
<div class="">
<div dir="ltr" class="">Dear Raffaella
<div class=""><br class="">
</div>
<div class="">> During last week, the RG-B cooking, even if not fully stopped, progressed at a very slow rate because the farm fairshare algorithm favored the RG-K account that was not used in a while. Because of this, RG-K has been basically running alone
for most of the time as for example it’s happening right now.</div>
<div class=""><br class="">
</div>
<div class="">Do we have the tools to monitor how many CPU hours were used by what account?</div>
<div class="">If we do not have these tools, is the only monitoring available the throughput accomplished by one group vs another?</div>
<div class=""><br class="">
</div>
<div class="">It seems to me that this is not a good situation for the collaboration. We should anticipate that there will be requests in the future to cook again, possibly even RGA, once we have improvement to the software released, such as ongoing work on
the tracking.</div>
<div class=""><br class="">
</div>
<div class="">I had understood that the RGK - RGB parallel cooking was a dry run to improve our preparedness against these issues. If we are not making progress on understanding the details of parallel cooking, it appears mysterious what the point of refusing
our request. Whether RGK gets 100% of the ressources for 4 days or 50% of the resources for the 8 days, or even 25% of the resources for 16 days should make 0 difference whatsoever to RGB, unless they will be done in less than 16 days. </div>
<div class=""><br class="">
</div>
<div class="">What is essential however is that we develop better tools to monitor the shared usage of resources, because these issues will only get worse.</div>
<div class=""><br class="">
</div>
<div class="">Best regards</div>
<div class="">FX </div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jul 31, 2020 at 10:05 PM Raffaella De Vita <<a href="mailto:Raffaella.Devita@ge.infn.it" target="_blank" class="">Raffaella.Devita@ge.infn.it</a>> wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Dear Annalisa,<br class="">
As you mentioned the data processing has been progressing very well and, in about a week, we are already very closed to 50% of whole RG-K with the 7.5 GeV almost completed and the 6.5 GeV already started. At this pace, it looks like the 50% will be reached
and exceeded over the weekend. <br class="">
<br class="">
During last week, the RG-B cooking, even if not fully stopped, progressed at a very slow rate because the farm fairshare algorithm favored the RG-K account that was not used in a while. Because of this, RG-K has been basically running alone for most of the
time as for example it’s happening right now.<br class="">
<br class="">
Given this and considering the overall status, I think the 50% goal will be exceeded shortly, entering in what was defined as phase3 where both RG-K and RG-B can continue data processing in parallel.
<br class="">
Best regards,<br class="">
Raffaella<br class="">
<br class="">
<br class="">
> On 31 Jul 2020, at 13:57, Annalisa D'Angelo <<a href="mailto:annalisa.dangelo@roma2.infn.it" target="_blank" class="">annalisa.dangelo@roma2.infn.it</a>> wrote:<br class="">
> <br class="">
> Dear Nathan and Raffaella,<br class="">
> <br class="">
> as you may see from the monitoring information at:<br class="">
> <br class="">
> <a href="https://clas12mon.jlab.org/files/?RGK7.5GeV" rel="noreferrer" target="_blank" class="">
https://clas12mon.jlab.org/files/?RGK7.5GeV</a><br class="">
> <br class="">
> the pass1 cooking process of 7.5 GeV data has started last Friday night July 24th in parallel with RG-B.<br class="">
> <br class="">
> It been going quite efficiently this week so that we already have produced 88% of RG-K data at 7.5 GeV, corresponding to 44% of the full RG-K set of collected data.<br class="">
> <br class="">
> The agreement was that RG-K would run in parallel with RG-B for one week, and then alone for about another week to obtain 50% of the processed data.<br class="">
> <br class="">
> How should we interpret the next step ? May we run alone ?<br class="">
> <br class="">
> Thank you for your help and support, in particular to Nathan, who kindly agreed to launch and monitor the 6.5 GeV work flow.<br class="">
> <br class="">
> All the best<br class="">
> <br class="">
> Annalisa<br class="">
> <br class="">
> -- <br class="">
> <br class="">
> ================================================<br class="">
> Prof. Annalisa D'Angelo<br class="">
> Dip. Fisica, Universita' di Roma "Tor Vergata"<br class="">
> Istituto Nazionale di Fisica Nucleare<br class="">
> Sezione di Roma Tor Vergata, Rome Italy<br class="">
> <a href="mailto:email%3Aannalisa.dangelo@roma2.infn.it" target="_blank" class="">
email:annalisa.dangelo@roma2.infn.it</a><br class="">
> Jefferson Laboratory, Newport News, VA USA<br class="">
> Email: <a href="mailto:annalisa@jlab.org" target="_blank" class="">annalisa@jlab.org</a><br class="">
> Tel: + 39 06 72594562<br class="">
> <br class="">
> <br class="">
> <br class="">
<br class="">
<br class="">
<br class="">
_______________________________________________<br class="">
clas12_rgk mailing list<br class="">
<a href="mailto:clas12_rgk@jlab.org" target="_blank" class="">clas12_rgk@jlab.org</a><br class="">
<a href="https://mailman.jlab.org/mailman/listinfo/clas12_rgk" rel="noreferrer" target="_blank" class="">https://mailman.jlab.org/mailman/listinfo/clas12_rgk</a><br class="">
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote></div>
<span id="cid:f_kdgp2loj0"><slurm_mon.png></span></div></blockquote></div><br class=""></div></div></body></html>