[Clas12_software] Fwd: class12-2 jobs performing lots of small i/o

Mon Jul 2 20:52:45 EDT 2018

Dear Andrew

As I said, the issue with the decoding jobs memory footprint is that the CC
monitors virtual memory as well as physical memory, and I do not think this
issue applies to Geant (C++) jobs.

We had numerous discussion with the CC about this, and they insist to
monitor virtual memory. This is (a bit) less crucial for CLAS12 recon
multithreaded jobs, which take most or all of the individual nodes they run
on, and comprise the large bulk of CLAS12 resources compared to decoding,
so I do not think this is a serious for the other halls actually.

Best regards
FX

On Tue, Jul 3, 2018 at 9:41 AM Andrew Puckett <puckett at jlab.org> wrote:

> Hi FX,
>
> That would seem to be inconsistent with these complaint I got a few years
> ago from Sandy Philpott about “excessive” memory usage of my
> single-threaded GEANT4 simulation jobs:
>
> “Hi Andrew,
>
> We're trying to understand the 8GB memory requirement of your SBS farm
> jobs... The farm nodes are configured at 32 GB RAM for 24 cores, so 1.5 GB
> per core.  Since your jobs request so much memory, it blocks other jobs'
> access to the systems. Why are these jobs such an exception to the farm job
> norm -- why do they need so much memory for just a single core job?  Can
> your code run on multiple cores and use memory more efficiently?
>
> All insight helpful and appreciated, as other users' jobs are backed up in
> the queue although many cores sit idle.
>
> Regards,
> Sandy”
>
> What makes CLAS12 decoding jobs different in this regard? Unless something
> has changed since then? This question should probably be looked into at a
> higher level. If in fact these decode jobs are de facto forcing 5/6 cores
> per job to sit idle, that would be a significant issue for scientific
> computing for all four halls.
>
> Best regards,
> Andrew
>
> puckett.physics.uconn.edu
>
> On Jul 2, 2018, at 8:23 PM, Francois-Xavier Girod <fxgirod at jlab.org>
> wrote:
>
> The decoding is not multithreaded. The decoding of several evio files
> merged into one hipo file currently does require this much memory, however
> this is largely due to the CC insisting on monitoring virtual memory
> allocation. None of this is a problem, those are all features.
>
> That being said, I do not think those jobs grab 6 cores. They only need
> and only use one.
>
> On Tue, Jul 3, 2018 at 9:18 AM Andrew Puckett <puckett at jlab.org> wrote:
>
>> The much bigger problem that I thought the farm admins would’ve noticed
>> first appears to be that a single-core job is requesting 9 GB memory
>> allocation, which is wildly inefficient given the batch farm architecture
>> of approximately 1.5 GB per core. Unless I’ve misunderstood something, or
>> unless the job is actually multithreaded even though it appears not to be,
>> each job with those parameters will grab six cores while using only one,
>> causing five cores to sit idle ***per job***. While I am by no means an
>> expert, I thought that the CLAS12 software framework was supposed to be
>> fully multi-thread capable? I only bring it up as someone with a vested
>> interest in the efficient use of the JLab scientific computing facilities...
>>
>> puckett.physics.uconn.edu
>>
>> On Jul 2, 2018, at 7:03 PM, Francois-Xavier Girod <fxgirod at jlab.org>
>> wrote:
>>
>> The I/O to those jobs is defined as per CC guidelines, there is no
>> “small” I/O the hippo files are about 5 GB merging 10 evio files together.
>> I think the CC need to have better diagnostics.
>>
>> On Tue, Jul 3, 2018 at 7:56 AM Harout Avakian <avakian at jlab.org> wrote:
>>
>>> FYI
>>>
>>> I understood that was fixed.  FX could you please check what is the
>>> problem.
>>> Harut
>>>
>>> -------- Forwarded Message --------
>>> Subject: class12-2 jobs performing lots of small i/o
>>> Date: Mon, 2 Jul 2018 08:58:04 -0400 (EDT)
>>> From: Kurt Strosahl <strosahl at jlab.org> <strosahl at jlab.org>
>>> To: Harut Avagyan <avakian at jlab.org> <avakian at jlab.org>
>>> CC: sciops <sciops at jlab.org> <sciops at jlab.org>
>>>
>>> Harut,
>>>
>>>     There are a large number of clas12 jobs running through the farm under user clas12-2, these jobs are performing lots of small i/o.
>>>
>>> An example of one of these jobs is:
>>>
>>> Job Index:	55141495
>>> User Name:	clas12-2
>>> Job Name:	R4013_13
>>> Project: 	clas12
>>> Queue:		prod64
>>> Hostname:	farm12021
>>> CPU Req: 	1 centos7 core requested
>>> MemoryReq:	9 GB
>>> Status:		ACTIVE
>>>
>>> You can see the small i/o by looking: https://scicomp.jlab.org/scicomp/index.html#/lustre/users
>>>
>>> w/r,
>>> Kurt J. Strosahl
>>> System Administrator: Lustre, HPC
>>> Scientific Computing Group, Thomas Jefferson National Accelerator Facility
>>>
>>> _______________________________________________
>>> Clas12_software mailing list
>>> Clas12_software at jlab.org
>>> https://mailman.jlab.org/mailman/listinfo/clas12_software
>>
>> _______________________________________________
>> Clas12_software mailing list
>> Clas12_software at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/clas12_software
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/clas12_software/attachments/20180703/c14ef1dd/attachment-0002.html>