[Clas12_software] DB initialization and FTCAL
Francois-Xavier Girod
fxgirod at jlab.org
Tue Mar 6 12:30:19 EST 2018
Dear Vardan
I am getting a similar memory management issue processing run 3432
Shutting down...
-------------------------------------------
Resource usage summary:
cput=72:27:10,mem=17424908kb,vmem=61633456kb,walltime=01:16:39
set job 36237851 done reason to exceed resource limit
Is there a way to investigate which service is causing this memory issue?
Thanks
Best regards
FX
On Sun, Mar 4, 2018 at 11:36 AM, Vardan Gyurjyan <gurjyan at jlab.org> wrote:
> Hi All,
>
> Here is an observed behavior running Clara on clonfarm0 (32 core AMD, RH
> Linux 6).
>
> Clas12 software release: 5a.1.0
> Data file: clas_003191.evio.7.hipo
>
> Running production service chain except FTCAL and FTHODO services: runs
> with no problems and exceptions, yet there is this long (unexplained) DB
> activities at the configure stage of the reconstruction (see attached docx
> file ). This init process can cause DB issues.
>
>
>
>
> Adding FTCLA to the service chain: "[IndexedTable] ---> error.. entry does
> not exist” console printout for every event (slowing down reconstruction),
> and after a while out of memory exception (see below).
>
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:714)
> at java.util.concurrent.ThreadPoolExecutor.addWorker(
> ThreadPoolExecutor.java:950)
> at java.util.concurrent.ThreadPoolExecutor.execute(
> ThreadPoolExecutor.java:1357)
> at java.util.concurrent.AbstractExecutorService.submit(
> AbstractExecutorService.java:112)
> at org.jlab.coda.xmsg.core.xMsg$1.handle(xMsg.java:568)
> at org.jlab.coda.xmsg.core.xMsgSubscription$Handler.run(
> xMsgSubscription.java:108)
> at java.lang.Thread.run(Thread.java:745)
> java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.java:714)
> at java.util.concurrent.ThreadPoolExecutor.addWorker(
> ThreadPoolExecutor.java:950)
> at java.util.concurrent.ThreadPoolExecutor.execute(
> ThreadPoolExecutor.java:1357)
> at java.util.concurrent.AbstractExecutorService.submit(
> AbstractExecutorService.java:112)
> at org.jlab.clara.sys.Service.execute(Service.java:176)
> at org.jlab.clara.sys.Service.access$300(Service.java:53)
> at org.jlab.clara.sys.Service$ServiceCallBack.callback(
> Service.java:286)
> at org.jlab.coda.xmsg.core.xMsg$1.lambda$handle$0(xMsg.java:568)
> at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> [
>
> Note that this Not the only FT issue (as previously thought), and can be
> an overall clas12 service engine memory management issue (garbage creation,
> etc.). Let us first start understanding and addressing DB communication
> issues, and get read of printouts for every event in any service engine.
> Printouts are useful for debuting of engines, but not for production
> service deployments. Clara presents mechanisms to report errors directly to
> the, triggering specific actions. After these we can return to tackle "out
> of memory" exception.
>
> Also please take a look at the benchmark results below. We spent almost
> 3sec/event/core to reconstruct charge particle tracks in DC. This is using
> 5a.1.0 on online events (Run 3544, currently active). Is this real?
>
>
> 2018-03-04 11:04:17.263: Benchmark results:
> 2018-03-04 11:04:17.264: READER 3000 events total time =
> 140.63 s average event time = 46.88 ms
> 2018-03-04 11:04:17.264: FTCAL 3000 events total time =
> 2.73 s average event time = 0.91 ms
> 2018-03-04 11:04:17.265: FTHODO 3000 events total time =
> 2.61 s average event time = 0.87 ms
> 2018-03-04 11:04:17.266: FTEB 3000 events total time =
> 2.78 s average event time = 0.93 ms
> 2018-03-04 11:04:17.267: DCHB 3000 events total time =
> 6354.61 s average event time = 2118.20 ms
> 2018-03-04 11:04:17.267: DCTB 3000 events total time =
> 2198.50 s average event time = 732.83 ms
> 2018-03-04 11:04:17.268: FTOF 3000 events total time =
> 11.69 s average event time = 3.90 ms
> 2018-03-04 11:04:17.269: LTCC 3000 events total time =
> 1.04 s average event time = 0.35 ms
> 2018-03-04 11:04:17.269: EC 3000 events total time =
> 6.25 s average event time = 2.08 ms
> 2018-03-04 11:04:17.270: EBHB 3000 events total time =
> 6.99 s average event time = 2.33 ms
> 2018-03-04 11:04:17.271: EBTB 3000 events total time =
> 5.85 s average event time = 1.95 ms
> 2018-03-04 11:04:17.272: WRITER 3000 events total time =
> 10.63 s average event time = 3.54 ms
> 2018-03-04 11:04:17.272: TOTAL 3000 events total time =
> 8744.31 s average event time = 2914.77 ms
>
> Best regards,
> -vardan
>
> _______________________________________________
> Clas12_software mailing list
> Clas12_software at jlab.org
> https://mailman.jlab.org/mailman/listinfo/clas12_software
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/clas12_software/attachments/20180306/2d2623ba/attachment-0002.html>
More information about the Clas12_software
mailing list