[Clas12_software] DB initialization and FTCAL

Tue Mar 6 12:30:19 EST 2018

Dear Vardan

I am getting a similar memory management issue processing run 3432

Shutting down...
-------------------------------------------
Resource usage summary:
cput=72:27:10,mem=17424908kb,vmem=61633456kb,walltime=01:16:39
set job 36237851 done reason to exceed resource limit

Is there a way to investigate which service is causing this memory issue?

Thanks

Best regards
FX

On Sun, Mar 4, 2018 at 11:36 AM, Vardan Gyurjyan <gurjyan at jlab.org> wrote:

> Hi All,
>
> Here is an observed behavior running Clara on clonfarm0 (32 core AMD, RH
> Linux 6).
>
> Clas12 software release:  5a.1.0
> Data file: clas_003191.evio.7.hipo
>
> Running production service chain except FTCAL and FTHODO services:  runs
> with no problems and exceptions, yet there is this long (unexplained) DB
> activities at the configure stage of the reconstruction (see attached docx
> file ). This init process can cause DB issues.
>
>
>
>
> Adding FTCLA to the service chain: "[IndexedTable] ---> error.. entry does
> not exist” console printout for every event (slowing down reconstruction),
> and after a while out of memory exception (see below).
>
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> [IndexedTable] ---> error.. entry does not exist
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:714)
>         at java.util.concurrent.ThreadPoolExecutor.addWorker(
> ThreadPoolExecutor.java:950)
>         at java.util.concurrent.ThreadPoolExecutor.execute(
> ThreadPoolExecutor.java:1357)
>         at java.util.concurrent.AbstractExecutorService.submit(
> AbstractExecutorService.java:112)
>         at org.jlab.coda.xmsg.core.xMsg$1.handle(xMsg.java:568)
>         at org.jlab.coda.xmsg.core.xMsgSubscription$Handler.run(
> xMsgSubscription.java:108)
>         at java.lang.Thread.run(Thread.java:745)
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:714)
>         at java.util.concurrent.ThreadPoolExecutor.addWorker(
> ThreadPoolExecutor.java:950)
>         at java.util.concurrent.ThreadPoolExecutor.execute(
> ThreadPoolExecutor.java:1357)
>         at java.util.concurrent.AbstractExecutorService.submit(
> AbstractExecutorService.java:112)
>         at org.jlab.clara.sys.Service.execute(Service.java:176)
>         at org.jlab.clara.sys.Service.access$300(Service.java:53)
>         at org.jlab.clara.sys.Service$ServiceCallBack.callback(
> Service.java:286)
>         at org.jlab.coda.xmsg.core.xMsg$1.lambda$handle$0(xMsg.java:568)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> [
>
> Note that this Not the only FT issue (as previously thought), and can be
> an overall clas12 service engine memory management issue (garbage creation,
> etc.). Let us first start understanding and addressing DB communication
> issues, and get read of printouts for every event in any service engine.
> Printouts are useful for debuting of engines, but not for production
> service deployments. Clara presents mechanisms to report errors directly to
> the, triggering specific actions. After these we can return to tackle "out
> of memory" exception.
>
> Also please take a look at the benchmark results below. We spent almost
> 3sec/event/core to reconstruct charge particle tracks in DC. This is using
> 5a.1.0 on online events (Run 3544, currently active). Is this real?
>
>
> 2018-03-04 11:04:17.263: Benchmark results:
> 2018-03-04 11:04:17.264:   READER           3000 events    total time =
>  140.63 s    average event time =   46.88 ms
> 2018-03-04 11:04:17.264:   FTCAL            3000 events    total time =
>  2.73 s    average event time =    0.91 ms
> 2018-03-04 11:04:17.265:   FTHODO           3000 events    total time =
>  2.61 s    average event time =    0.87 ms
> 2018-03-04 11:04:17.266:   FTEB             3000 events    total time =
>  2.78 s    average event time =    0.93 ms
> 2018-03-04 11:04:17.267:   DCHB             3000 events    total time =
> 6354.61 s    average event time = 2118.20 ms
> 2018-03-04 11:04:17.267:   DCTB             3000 events    total time =
> 2198.50 s    average event time =  732.83 ms
> 2018-03-04 11:04:17.268:   FTOF             3000 events    total time =
> 11.69 s    average event time =    3.90 ms
> 2018-03-04 11:04:17.269:   LTCC             3000 events    total time =
>  1.04 s    average event time =    0.35 ms
> 2018-03-04 11:04:17.269:   EC               3000 events    total time =
>  6.25 s    average event time =    2.08 ms
> 2018-03-04 11:04:17.270:   EBHB             3000 events    total time =
>  6.99 s    average event time =    2.33 ms
> 2018-03-04 11:04:17.271:   EBTB             3000 events    total time =
>  5.85 s    average event time =    1.95 ms
> 2018-03-04 11:04:17.272:   WRITER           3000 events    total time =
> 10.63 s    average event time =    3.54 ms
> 2018-03-04 11:04:17.272:   TOTAL            3000 events    total time =
> 8744.31 s    average event time = 2914.77 ms
>
> Best regards,
> -vardan
>
> _______________________________________________
> Clas12_software mailing list
> Clas12_software at jlab.org
> https://mailman.jlab.org/mailman/listinfo/clas12_software
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/clas12_software/attachments/20180306/2d2623ba/attachment-0002.html>