[clas12_rgk] [EXTERNAL] Re: RG-K PASS1 cooking status

Francois-Xavier Girod fxgirod at jlab.org
Tue Aug 4 18:36:58 EDT 2020


Dear Nathan

No sorry if this is trivial, maybe it is just a matter of clicking three
buttons from the sci comp website which I do not know
I would like to see a graph of the number of CPU hours per day used by RGB
and RGK over the last month, I was personally never shown this after having
asked for it

And yes I understand that this appears like a joke because this information
should be trivial to retrieve, but that is the kind of information I have
been asking

Best regards
FX

On Wed, Aug 5, 2020 at 12:31 AM Nathan Baltzell <baltzell at jlab.org> wrote:

> surely you're joking
>
>
> On Aug 4, 2020, at 18:14, Francois-Xavier Girod <fxgirod at jlab.org> wrote:
>
> Dear Nathan
>
> Thank you for the answer
>
> > I'm not sure what exactly you're looking for, but there's a few ways to
> get information on previous and current JLab batch farm usage ...
>
> Ok let me be reiterate specifically: we were told that last week RGK
> received more computing hours than RGB, which I am not contesting. I am
> only asking for the direct evidence.
> This is relevant as it is the basis for the decision on computing
> resources allocations.
>
> Can we please see the number of CPU hours per day, received by RGB and
> RGK over the course of the last month?
>
> I would like to make this graph myself but I do not know how to retrieve
> it, and from the email you sent I am not readily able to do this
> I do not believe that "Usage -> Job History" or "Usage -> Usage Stat"
> allow me to perform this operation. Note that these link do contain the
> kind of information I want, but only for Hall B production vs say Hall D or
> theory for instance.
>
> Best regards
> FX
>
> On Tue, Aug 4, 2020 at 11:51 PM Nathan Baltzell <baltzell at jlab.org> wrote:
>
>> Hello FX,
>>
>> I'm not sure what exactly you're looking for, but there's a few ways to
>> get information on previous and current JLab batch farm usage ...
>>
>> For web-based stuff, there's the "Usage" link and its "Job History" and
>> "Usage Stat" subitems at the top left of scicomp.jlab.org.  From there,
>> you can select date ranges and get core count timelines and integrated
>> summary tables and drill down the fairshare hierarchy of SLURM accounts and
>> their users.   There's also the "Slurm Info" subitem with more job-specific
>> metrics and also the tree fairshare settings and pie charts on usage.  For
>> info on the fairshare algorithm in use by JLab, see
>> https://slurm.schedmd.com/fair_tree.html, where currently the half-life
>> is 7 days.
>>
>> We also use some command-line tools we wrote, see "source
>> /group/clas12/packages/setup.(c)sh; module load workflow; slurm-status.py
>> -h", which uses the same backend as the "Job Query" at scicomp.jlab.org and
>> adds an integrated summary of cpu/wall/mem for the given query.
>>
>> And then you can also use raw SLURM commands (see "man slurm" for a list
>> of them, and then "man $CMD" for a given command), which might be best, but
>> I'm a bit less familiar with those and so far haven't really needed them
>> except for sshare/sinfo/pestat for live info and sbatch/srun for running
>> jobs.
>>
>> On a related note, the time estimates that have been presented by myself
>> or chefs previously for CLAS12 data processing were based on our measured
>> (in situ) event rates and fairshares and the current batch farm hardware
>> distributions at
>> https://jeffersonlab-my.sharepoint.com/:x:/g/personal/baltzell_jlab_org/EU096WRXcyBLl_ApLfSCuvoBiy3Sq_xtFlY9MvO_HWHQUw?e=Iwv4bC
>> , and things have so far always come out pretty close to expectations after
>> accounting for fluctuations in received fairshare for a given period,
>> although we still keep an eye on it to make sure things are making sense.
>>
>> -Nathan
>>
>>
>> On Aug 4, 2020, at 09:44, Francois-Xavier Girod <fxgirod at jlab.org> wrote:
>>
>> Dear Raffaella
>>
>> >  These are tools accessible to everyone from theSciComp page
>>
>> Could you please point to these tools, as requested 4 days ago?
>>
>> Also checking more carefully at the status of the RGK cooking, it appears
>> that contrary to what was claimed earlier, we did not reach the 50% mark
>> until yesterday.
>>
>> Best regards
>> FX
>>
>> On Fri, Jul 31, 2020 at 10:49 PM Raffaella De Vita <
>> Raffaella.Devita at ge.infn.it> wrote:
>>
>>> Dear FX,
>>> We have the capability monitoring the resources used by individual
>>> accounts as part of the SciComp tools.  These are tools accessible to
>>> everyone from theSciComp page. The monitoring is indicating the resources
>>> used by RG-K over the last week exceed largely what was used by RG-B. I
>>> don’t think we are lacking in terms of tools here and I’m not sure I
>>> understand the motivation of this concern.
>>> We can discuss this further but I’m not sure this aspect is really
>>> central to decide how the RG-K and RG-B cooking should continue. The plan
>>> defined by the CCC for the data processing was 1) to run in parallel for
>>> about a week, while RGK was a still in testing phase, 2) continue with RG-K
>>> alone until the 50% was reached, 3) continue in parallel till the end. It
>>> seems to me that we basically skipped 1 because the fairshare algorithm did
>>> what it is designed to do, we are in 2, and the natural thing would be to
>>> continue with 3. This would be compliant to the CCC indications and the
>>> most efficient solution from the technical point of view.
>>> Obviously the CCC can reconsider the situation and provide different
>>> indications: I have cc’ed Kyungseon and Marco if they want to comment.
>>> Best regards,
>>> Raffaella
>>>
>>>
>>> On 31 Jul 2020, at 16:12, Francois-Xavier Girod <fxgirod at jlab.org>
>>> wrote:
>>>
>>> Dear Raffaella
>>>
>>> > During last week, the RG-B cooking, even if not fully stopped,
>>> progressed at a very slow rate because the farm fairshare algorithm favored
>>> the RG-K account that was not used in a while. Because of this, RG-K has
>>> been basically running alone for most of the time as for example it’s
>>> happening right now.
>>>
>>> Do we have the tools to monitor how many CPU hours were used by what
>>> account?
>>> If we do not have these tools, is the only monitoring available the
>>> throughput accomplished by one group vs another?
>>>
>>> It seems to me that this is not a good situation for the collaboration.
>>> We should anticipate that there will be requests in the future to cook
>>> again, possibly even RGA, once we have improvement to the
>>> software released, such as ongoing work on the tracking.
>>>
>>> I had understood that the RGK - RGB parallel cooking was a dry run to
>>> improve our preparedness against these issues. If we are not making
>>> progress on understanding the details of parallel cooking, it appears
>>> mysterious what the point of refusing our request. Whether RGK gets 100% of
>>> the ressources for 4 days or 50% of the resources for the 8 days, or even
>>> 25% of the resources for 16 days should make 0 difference whatsoever to
>>> RGB, unless they will be done in less than 16 days.
>>>
>>> What is essential however is that we develop better tools to monitor the
>>> shared usage of resources, because these issues will only get worse.
>>>
>>> Best regards
>>> FX
>>>
>>> On Fri, Jul 31, 2020 at 10:05 PM Raffaella De Vita <
>>> Raffaella.Devita at ge.infn.it> wrote:
>>>
>>>> Dear Annalisa,
>>>> As you mentioned the data processing has been progressing very well
>>>> and, in about a week, we are already very closed to  50% of whole RG-K with
>>>> the 7.5 GeV almost completed and the 6.5 GeV already started. At this pace,
>>>> it looks like the 50% will be reached and exceeded over the weekend.
>>>>
>>>> During last week, the RG-B cooking, even if not fully stopped,
>>>> progressed at a very slow rate because the farm fairshare algorithm favored
>>>> the RG-K account that was not used in a while. Because of this, RG-K has
>>>> been basically running alone for most of the time as for example it’s
>>>> happening right now.
>>>>
>>>> Given this and considering the overall status, I think the 50% goal
>>>> will be exceeded shortly, entering in what was defined as phase3 where both
>>>> RG-K and RG-B can continue data processing in parallel.
>>>> Best regards,
>>>>         Raffaella
>>>>
>>>>
>>>> > On 31 Jul 2020, at 13:57, Annalisa D'Angelo <
>>>> annalisa.dangelo at roma2.infn.it> wrote:
>>>> >
>>>> > Dear Nathan and Raffaella,
>>>> >
>>>> > as you may see from the monitoring information at:
>>>> >
>>>> > https://clas12mon.jlab.org/files/?RGK7.5GeV
>>>> >
>>>> > the pass1 cooking process of 7.5 GeV data has started last Friday
>>>> night July 24th in parallel with RG-B.
>>>> >
>>>> > It been going quite efficiently this week so that we already have
>>>> produced 88% of RG-K data at 7.5 GeV, corresponding to 44% of the full RG-K
>>>> set of collected data.
>>>> >
>>>> > The agreement was that RG-K would run in parallel with RG-B for one
>>>> week, and then alone for about another week to obtain 50% of the processed
>>>> data.
>>>> >
>>>> > How should we interpret the next step ?  May we run alone ?
>>>> >
>>>> > Thank you for your help and support, in particular to Nathan, who
>>>> kindly agreed to launch and monitor the 6.5 GeV work flow.
>>>> >
>>>> > All the best
>>>> >
>>>> > Annalisa
>>>> >
>>>> > --
>>>> >
>>>> > ================================================
>>>> > Prof. Annalisa D'Angelo
>>>> > Dip. Fisica, Universita' di Roma "Tor Vergata"
>>>> > Istituto Nazionale di Fisica Nucleare
>>>> > Sezione di Roma Tor Vergata, Rome Italy
>>>> > email:annalisa.dangelo at roma2.infn.it
>>>> > Jefferson Laboratory, Newport News, VA USA
>>>> > Email: annalisa at jlab.org
>>>> > Tel: + 39 06 72594562
>>>> >
>>>> >
>>>> >
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> clas12_rgk mailing list
>>>> clas12_rgk at jlab.org
>>>> https://mailman.jlab.org/mailman/listinfo/clas12_rgk
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/clas12_rgk/attachments/20200805/cba238ba/attachment-0001.html>


More information about the clas12_rgk mailing list