<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Hi FX,<div class=""><br class=""></div><div class="">You can exclude one service at a time from the composition and see which one is using large memory (keeping the same data set). Do you remember what was a resource usage for the previous release? Obviously vmem of 40G was perfectly fine, but I am not sure about the physical memory usage (I remember it being around 11G).</div><div class="">This also can be a data profile issue, that pushes service engine to a processing mode that requires more memory (not well designed service engine that is strongly data dependent). To test this hypothesis you can process (the current clas12 plugin release) the old data set and check the memory usage </div><div class=""><br class=""></div><div class="">Best,</div><div class="">-vardan</div><div class=""><br class=""></div><div class=""><div class="">
<div style="color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;">--------------------------------------------------<br class="">Vardan H. Gyurjyan, Ph.D.<br class="">Staff Scientist<br class="">Thomas Jefferson Accelerator Facility<br class="">Newport News, VA, 23606<br class="">E-mail: <a href="mailto:gurjyan@jlab.org" class="">gurjyan@jlab.org</a><br class="">757-269-5879 (JLAB)</div>
</div>
<br class=""><div><blockquote type="cite" class=""><div class="">On Mar 6, 2018, at 12:30 PM, Francois-Xavier Girod <<a href="mailto:fxgirod@jlab.org" class="">fxgirod@jlab.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Dear Vardan<div class=""><br class=""></div><div class="">I am getting a similar memory management issue processing run 3432</div><div class=""><br class=""></div>Shutting down...<br class="">-------------------------------------------<br class="">Resource usage summary:<br class="">cput=72:27:10,mem=17424908kb,vmem=61633456kb,walltime=01:16:39<br class="">set job 36237851 done reason to exceed resource limit<br class=""><br class="">Is there a way to investigate which service is causing this memory issue?<div class=""><br class=""></div><div class="">Thanks<br class=""><div class=""><br class=""><div class="">Best regards</div><div class="">FX</div></div></div></div><div class="gmail_extra"><br class=""><div class="gmail_quote">On Sun, Mar 4, 2018 at 11:36 AM, Vardan Gyurjyan <span dir="ltr" class=""><<a href="mailto:gurjyan@jlab.org" target="_blank" class="">gurjyan@jlab.org</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi All,<br class="">
<br class="">
Here is an observed behavior running Clara on clonfarm0 (32 core AMD, RH Linux 6).<br class="">
<br class="">
Clas12 software release: 5a.1.0<br class="">
Data file: clas_003191.evio.7.hipo<br class="">
<br class="">
Running production service chain except FTCAL and FTHODO services: runs with no problems and exceptions, yet there is this long (unexplained) DB activities at the configure stage of the reconstruction (see attached docx file ). This init process can cause DB issues.<br class="">
<br class="">
<br class=""><br class="">
<br class="">
Adding FTCLA to the service chain: "[IndexedTable] ---> error.. entry does not exist” console printout for every event (slowing down reconstruction), and after a while out of memory exception (see below).<br class="">
<br class="">
[IndexedTable] ---> error.. entry does not exist<br class="">
[IndexedTable] ---> error.. entry does not exist<br class="">
[IndexedTable] ---> error.. entry does not exist<br class="">
[IndexedTable] ---> error.. entry does not exist<br class="">
[IndexedTable] ---> error.. entry does not exist<br class="">
[IndexedTable] ---> error.. entry does not exist<br class="">
java.lang.OutOfMemoryError: unable to create new native thread<br class="">
at java.lang.Thread.start0(Native Method)<br class="">
at java.lang.Thread.start(Thread.<wbr class="">java:714)<br class="">
at java.util.concurrent.<wbr class="">ThreadPoolExecutor.addWorker(<wbr class="">ThreadPoolExecutor.java:950)<br class="">
at java.util.concurrent.<wbr class="">ThreadPoolExecutor.execute(<wbr class="">ThreadPoolExecutor.java:1357)<br class="">
at java.util.concurrent.<wbr class="">AbstractExecutorService.<wbr class="">submit(<wbr class="">AbstractExecutorService.java:<wbr class="">112)<br class="">
at org.jlab.coda.xmsg.core.xMsg$<wbr class="">1.handle(xMsg.java:568)<br class="">
at org.jlab.coda.xmsg.core.<wbr class="">xMsgSubscription$Handler.run(<wbr class="">xMsgSubscription.java:108)<br class="">
at java.lang.Thread.run(Thread.<wbr class="">java:745)<br class="">
java.lang.OutOfMemoryError: unable to create new native thread<br class="">
at java.lang.Thread.start0(Native Method)<br class="">
at java.lang.Thread.start(Thread.<wbr class="">java:714)<br class="">
at java.util.concurrent.<wbr class="">ThreadPoolExecutor.addWorker(<wbr class="">ThreadPoolExecutor.java:950)<br class="">
at java.util.concurrent.<wbr class="">ThreadPoolExecutor.execute(<wbr class="">ThreadPoolExecutor.java:1357)<br class="">
at java.util.concurrent.<wbr class="">AbstractExecutorService.<wbr class="">submit(<wbr class="">AbstractExecutorService.java:<wbr class="">112)<br class="">
at org.jlab.clara.sys.Service.<wbr class="">execute(Service.java:176)<br class="">
at org.jlab.clara.sys.Service.<wbr class="">access$300(Service.java:53)<br class="">
at org.jlab.clara.sys.Service$<wbr class="">ServiceCallBack.callback(<wbr class="">Service.java:286)<br class="">
at org.jlab.coda.xmsg.core.xMsg$<wbr class="">1.lambda$handle$0(xMsg.java:<wbr class="">568)<br class="">
at java.util.concurrent.<wbr class="">Executors$RunnableAdapter.<wbr class="">call(Executors.java:511)<br class="">
at java.util.concurrent.<wbr class="">FutureTask.run(FutureTask.<wbr class="">java:266)<br class="">
at java.util.concurrent.<wbr class="">ThreadPoolExecutor.runWorker(<wbr class="">ThreadPoolExecutor.java:1142)<br class="">
at java.util.concurrent.<wbr class="">ThreadPoolExecutor$Worker.run(<wbr class="">ThreadPoolExecutor.java:617)<br class="">
at java.lang.Thread.run(Thread.<wbr class="">java:745)<br class="">
[<br class="">
<br class="">
Note that this Not the only FT issue (as previously thought), and can be an overall clas12 service engine memory management issue (garbage creation, etc.). Let us first start understanding and addressing DB communication issues, and get read of printouts for every event in any service engine. Printouts are useful for debuting of engines, but not for production service deployments. Clara presents mechanisms to report errors directly to the, triggering specific actions. After these we can return to tackle "out of memory" exception.<br class="">
<br class="">
Also please take a look at the benchmark results below. We spent almost 3sec/event/core to reconstruct charge particle tracks in DC. This is using 5a.1.0 on online events (Run 3544, currently active). Is this real?<br class="">
<br class="">
<br class="">
2018-03-04 11:04:17.263: Benchmark results:<br class="">
2018-03-04 11:04:17.264: READER 3000 events total time = 140.63 s average event time = 46.88 ms<br class="">
2018-03-04 11:04:17.264: FTCAL 3000 events total time = 2.73 s average event time = 0.91 ms<br class="">
2018-03-04 11:04:17.265: FTHODO 3000 events total time = 2.61 s average event time = 0.87 ms<br class="">
2018-03-04 11:04:17.266: FTEB 3000 events total time = 2.78 s average event time = 0.93 ms<br class="">
2018-03-04 11:04:17.267: DCHB 3000 events total time = 6354.61 s average event time = 2118.20 ms<br class="">
2018-03-04 11:04:17.267: DCTB 3000 events total time = 2198.50 s average event time = 732.83 ms<br class="">
2018-03-04 11:04:17.268: FTOF 3000 events total time = 11.69 s average event time = 3.90 ms<br class="">
2018-03-04 11:04:17.269: LTCC 3000 events total time = 1.04 s average event time = 0.35 ms<br class="">
2018-03-04 11:04:17.269: EC 3000 events total time = 6.25 s average event time = 2.08 ms<br class="">
2018-03-04 11:04:17.270: EBHB 3000 events total time = 6.99 s average event time = 2.33 ms<br class="">
2018-03-04 11:04:17.271: EBTB 3000 events total time = 5.85 s average event time = 1.95 ms<br class="">
2018-03-04 11:04:17.272: WRITER 3000 events total time = 10.63 s average event time = 3.54 ms<br class="">
2018-03-04 11:04:17.272: TOTAL 3000 events total time = 8744.31 s average event time = 2914.77 ms<br class="">
<br class="">
Best regards,<br class="">
-vardan<br class="">
<br class="">______________________________<wbr class="">_________________<br class="">
Clas12_software mailing list<br class="">
<a href="mailto:Clas12_software@jlab.org" class="">Clas12_software@jlab.org</a><br class="">
<a href="https://mailman.jlab.org/mailman/listinfo/clas12_software" rel="noreferrer" target="_blank" class="">https://mailman.jlab.org/<wbr class="">mailman/listinfo/clas12_<wbr class="">software</a><br class=""></blockquote></div><br class=""></div>
</div></blockquote></div><br class=""></div></body></html>