[Moller_intdet] [Moller_daq] [EXTERNAL] Question: ADC board software - Too many files open core dump

Michael Gericke Michael.Gericke at umanitoba.ca
Tue Jan 21 13:13:09 EST 2025


Hi Wouter,

Tried that ... no difference. I think this is related to the way zmq 
handles (or doesn't) sockets.

Thanks,

Michael

On 2025-01-21 10:37 a.m., Wouter Deconinck wrote:
> Bandaid: I think you are running into the ulimit on maximum file 
> descriptors. If you don't care much about that, and you have root 
> access, you can increase it. A factor 10 doesn't seem impossible for 
> the system to handle. It's in /etc/sysctl.conf for permanent change. 
> sysctl command for temporary change.
>
> Still gotta figure out why the fds aren't released.
>
> Wouter
>
>
> On Tue, Jan 21, 2025, 10:25 AM Michael Gericke via Moller_intdet 
> <moller_intdet at jlab.org> wrote:
>
>     *Caution!* This message was sent from outside the University of
>     Manitoba.
>
>
>     Yes, googling was, of course, my first attempt as well ...
>
>
>     My understanding is that epoll.cxx is mostly used for socket
>     management and if I were to create a bunch of open sockets
>     (without closing them) I would understand the crash.
>     But the only place where I open sockets repeatedly is in the same
>     loop where I close them as well (e.g. zmq_close(data_socket) :
>     "The /zmq_close()/ function shall destroy the socket referenced by
>     the /socket/ argument.").
>
>     Either I am not doing something correctly or there is a bug in
>     that function with respect to the OS side.
>
>
>     Thanks,
>
>
>     Michael
>
>
>     On 2025-01-21 10:08 a.m., Robert Michaels wrote:
>>     *Caution!* This message was sent from outside the University of
>>     Manitoba.
>>
>>
>>     I haven't encountered this, but I notice that if you Google "Too
>>     many open files (src/epoll.cpp:38)" there are several
>>     suggestions.   My experience with Googling is that there may be
>>     10 ideas and it ends up that 1 of them works.
>>
>>     yours
>>     Bob
>>
>>     Dr. Robert Michaels
>>
>>     Staff Scientist, Jefferson Lab
>>
>>     http://userweb.jlab.org/~rom
>>
>>     ------------------------------------------------------------------------
>>     *From:* Moller_daq <moller_daq-bounces at jlab.org>
>>     <mailto:moller_daq-bounces at jlab.org> on behalf of Michael Gericke
>>     <Michael.Gericke at umanitoba.ca> <mailto:Michael.Gericke at umanitoba.ca>
>>     *Sent:* Tuesday, January 21, 2025 11:01 AM
>>     *To:* Zuhal Seyma Demiroglu <zdemirog at jlab.org>
>>     <mailto:zdemirog at jlab.org>; Ciprian Gal <ciprian at jlab.org>
>>     <mailto:ciprian at jlab.org>; Malte Wilfert <mwilfert at uni-mainz.de>
>>     <mailto:mwilfert at uni-mainz.de>; moller_daq at jlab.org
>>     <moller_daq at jlab.org> <mailto:moller_daq at jlab.org>;
>>     moller_analysis at jlab.org <moller_analysis at jlab.org>
>>     <mailto:moller_analysis at jlab.org>; bryerton at gmail.com
>>     <bryerton at gmail.com> <mailto:bryerton at gmail.com>;
>>     moller_intdet at jlab.org <moller_intdet at jlab.org>
>>     <mailto:moller_intdet at jlab.org>
>>     *Subject:* [Moller_daq] [EXTERNAL] Question: ADC board software -
>>     Too many files open core dump
>>
>>     Hi everyone,
>>
>>     Sorry if you get this multiple times (I am sending to several
>>     email lists with significant overlap).
>>
>>     I have an annoying problem. For the PMT testing, I want to run
>>     through a given set of tests for each PMT in one sitting (DAQ and
>>     analysis program running continuously), which
>>     means starting the ADC DAQ on the computer end once and
>>     (currently - ideally) letting it collect upwards of about 1000 5
>>     second long runs at a time for each PMT.  The program
>>     runs fine for about 180 5 second sets and then core dumps with
>>     "Too many open files (src/epoll.cpp:38)".
>>
>>     If you have encountered this problem and figured out how to solve
>>     it (beyond the band aid suggestions given in online posts), can
>>     you please let me know.
>>
>>     Thanks,
>>
>>     Michael
>>
>>
>>     Some more details ...
>>
>>
>>     I am writing both root files and binary files, for the raw data. 
>>     The code uses a separate thread to write the root trees, while it
>>     keeps getting data from the ADC board continuously.
>>
>>     The thread that writes the root trees and stores them to file is
>>     started only once and within it, ROOT files are opened and closed
>>     in the same loop. The main process writes
>>     the raw data file and handles the communication with the ADC
>>     board. The raw data files are written in a function that both
>>     opens and closes each file that is being written.
>>
>>     There are two zmq sockets for communication with the ADC (control
>>     and data) and each is opened and closed for each 5 second data
>>     chunk that is being received from the ADC
>>     (each single run).
>>
>>     I scoured forums and various information sources, but most of
>>     posts I find suggest that once needs to increase the nofile
>>     parameter in |/proc/sys/fs/file-max | or similar.
>>     I have done that, but it doesn't really change anything. It just
>>     allows me to take more runs, but doesn't solve the problem.
>>
>>     In principle, I should be able to run this indefinitely (aside
>>     from storage space issues).
>>
>>     Of course I could just stop the process and restart periodically
>>     (collecting fewer runs at a time), but I want to move through a
>>     series of PMT voltages in one run series, with
>>     as little time as possible between changing the HV.
>>
>>     I have encountered this crash every time I start the program and
>>     run for extended periods of time (1 hour or more) and it is
>>     driving me crazy, because I think the program is written
>>     such that there should not be "many" open files at a time (I
>>     think maybe at most 5 at any given time).
>>
>>     The only way around this at the moment is to use a script that
>>     restarts the ADC software end after some number of runs before
>>     this crash occurs. That's okay, but doesn't resolve
>>     the basic problem.
>>
>>     Somehow I feel like there is either a bug in ZMQ or I am doing
>>     something wrong (likely the latter, but some posts I found seem
>>     to suggest the possibility of the former).
>>
>>     I don't want to throw a bunch of code at you. For now, I am just
>>     wondering if any here has encountered a similar issue and could
>>     guess what the problem is.
>>
>>     If anyone wants to take a look (if you have time), the code is
>>     located here:
>>
>>     https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mtgericke_MOLLER-2DIntElec-2DPMTGain-2DMeas&d=DwIDaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=sAcmfDnmgp80OHNp8BT9B0ppMns-xHhof47DzJMhOgs&m=BFYYJDIBOUMj6D5tnbS4PkB51v_p1kqwzc1tQCO_TZBMXRg5OjtJeAuRscxdZiMp&s=6vRD7-QXSzOQmT1x9j34JQex6hoju65UyyBFYmp-T9A&e= 
>>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mtgericke_MOLLER-2DIntElec-2DPMTGain-2DMeas&d=DwMDaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=oEOVH2JDjF9X3FqUtWH-nt4bOXao4u2S63djv3IksSA&m=h63pm6kCVRZsJZDv6Qm3IadomS8gZlz-u_jJh_kb06vZdHak21HYvLepNBpUh1Zl&s=LOn7q6CUDdVUcA7WtGX-sjMjpFun4LiiFijDNN2RqTM&e=>
>>
>>     The relevant file is CMData.cxx
>>
>>     The relevant functions are:
>>
>>     void CMData::StartDataCollection()   - file lines 712-733  ,
>>     758-781,
>>
>>     void *CMData::GetServerData(void *vargp)  - file line 826
>>
>>     void* CMData::FillRootTreeThread(void *vargp) - file lines 1070, 1293
>>
>>     void* CMData::GetSocket(SockType type)   - file line 458
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/moller_intdet/attachments/20250121/ca5c320a/attachment-0001.html>


More information about the Moller_intdet mailing list