[Moller_daq] [Moller_intdet] [EXTERNAL] Question: ADC board software - Too many files open core dump
Michael Gericke
mgericke at physics.umanitoba.ca
Wed Jan 22 11:21:15 EST 2025
Hi Malte,
Yes, that solved the problem. I did not think of that at all.
Thanks for you help!
Michael
P.S. Sorry for the spam everyone...
Btw.
The https://urldefense.proofpoint.com/v2/url?u=http-3A__api.zeromq.org_4-2D2-3Azmq-2Dctx-2Dnew&d=DwIDaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=oEOVH2JDjF9X3FqUtWH-nt4bOXao4u2S63djv3IksSA&m=qZPr9BwbjpR6B-dAXUkVNJfaN0Xp-gTmmkhTDlQbDuBR1IMNDBzF5ZM1vzF4HGOm&s=GbPtZjmIQZ5CvUhBa-a_STjxk30y7IlHgRE_BDW7GDs&e= doesn't list any
errors for the function, as far as I can tell.
On 2025-01-22 8:00 a.m., Malte Wilfert wrote:
> Caution! This message was sent from outside the University of Manitoba.
>
>
> Hi,
>
> looking at your code for CMData I noticed, that you create an
> zmq_context for every socket (data and control) and for every run. From
> my understanding of ZMQ one zmq_context should be enough for the full
> program since they handle just a "list" of all sockets.
> Since you are creating two of them for every run and do not destroy them
> at the end, they remain open. From the description of the "zmq_ctx_new"
> function the only error which is mentioned there is:
> EMFILE: The limit on the total number of open files has been reached and
> it wasn’t possible to create a new context.
>
> This at least sounds similar to your problem.
> Can you try to use only one zmq_context for the full program and check
> whether this will solve the problem?
>
> Cheers
> Malte
>
> Am 21.01.25 um 19:13 schrieb Michael Gericke:
>> Hi Wouter,
>>
>> Tried that ... no difference. I think this is related to the way zmq
>> handles (or doesn't) sockets.
>>
>> Thanks,
>>
>> Michael
>>
>> On 2025-01-21 10:37 a.m., Wouter Deconinck wrote:
>>> Bandaid: I think you are running into the ulimit on maximum file
>>> descriptors. If you don't care much about that, and you have root
>>> access, you can increase it. A factor 10 doesn't seem impossible for
>>> the system to handle. It's in /etc/sysctl.conf for permanent change.
>>> sysctl command for temporary change.
>>>
>>> Still gotta figure out why the fds aren't released.
>>>
>>> Wouter
>>>
>>>
>>> On Tue, Jan 21, 2025, 10:25 AM Michael Gericke via Moller_intdet
>>> <moller_intdet at jlab.org> wrote:
>>>
>>> *Caution!* This message was sent from outside the University of
>>> Manitoba.
>>>
>>>
>>> Yes, googling was, of course, my first attempt as well ...
>>>
>>>
>>> My understanding is that epoll.cxx is mostly used for socket
>>> management and if I were to create a bunch of open sockets
>>> (without closing them) I would understand the crash.
>>> But the only place where I open sockets repeatedly is in the same
>>> loop where I close them as well (e.g. zmq_close(data_socket) :
>>> "The /zmq_close()/ function shall destroy the socket referenced by
>>> the /socket/ argument.").
>>>
>>> Either I am not doing something correctly or there is a bug in
>>> that function with respect to the OS side.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Michael
>>>
>>>
>>> On 2025-01-21 10:08 a.m., Robert Michaels wrote:
>>>> *Caution!* This message was sent from outside the University of
>>>> Manitoba.
>>>>
>>>>
>>>> I haven't encountered this, but I notice that if you Google "Too
>>>> many open files (src/epoll.cpp:38)" there are several
>>>> suggestions. My experience with Googling is that there may be
>>>> 10 ideas and it ends up that 1 of them works.
>>>>
>>>> yours
>>>> Bob
>>>>
>>>> Dr. Robert Michaels
>>>>
>>>> Staff Scientist, Jefferson Lab
>>>>
>>>> http://userweb.jlab.org/~rom
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>>
>>>> *From:* Moller_daq <moller_daq-bounces at jlab.org>
>>>> <mailto:moller_daq-bounces at jlab.org> on behalf of Michael Gericke
>>>> <Michael.Gericke at umanitoba.ca>
>>>> <mailto:Michael.Gericke at umanitoba.ca>
>>>> *Sent:* Tuesday, January 21, 2025 11:01 AM
>>>> *To:* Zuhal Seyma Demiroglu <zdemirog at jlab.org>
>>>> <mailto:zdemirog at jlab.org>; Ciprian Gal <ciprian at jlab.org>
>>>> <mailto:ciprian at jlab.org>; Malte Wilfert <mwilfert at uni-mainz.de>
>>>> <mailto:mwilfert at uni-mainz.de>; moller_daq at jlab.org
>>>> <moller_daq at jlab.org> <mailto:moller_daq at jlab.org>;
>>>> moller_analysis at jlab.org <moller_analysis at jlab.org>
>>>> <mailto:moller_analysis at jlab.org>; bryerton at gmail.com
>>>> <bryerton at gmail.com> <mailto:bryerton at gmail.com>;
>>>> moller_intdet at jlab.org <moller_intdet at jlab.org>
>>>> <mailto:moller_intdet at jlab.org>
>>>> *Subject:* [Moller_daq] [EXTERNAL] Question: ADC board software -
>>>> Too many files open core dump
>>>>
>>>> Hi everyone,
>>>>
>>>> Sorry if you get this multiple times (I am sending to several
>>>> email lists with significant overlap).
>>>>
>>>> I have an annoying problem. For the PMT testing, I want to run
>>>> through a given set of tests for each PMT in one sitting (DAQ and
>>>> analysis program running continuously), which
>>>> means starting the ADC DAQ on the computer end once and
>>>> (currently - ideally) letting it collect upwards of about 1000 5
>>>> second long runs at a time for each PMT. The program
>>>> runs fine for about 180 5 second sets and then core dumps with
>>>> "Too many open files (src/epoll.cpp:38)".
>>>>
>>>> If you have encountered this problem and figured out how to solve
>>>> it (beyond the band aid suggestions given in online posts), can
>>>> you please let me know.
>>>>
>>>> Thanks,
>>>>
>>>> Michael
>>>>
>>>>
>>>> Some more details ...
>>>>
>>>>
>>>> I am writing both root files and binary files, for the raw data.
>>>> The code uses a separate thread to write the root trees, while it
>>>> keeps getting data from the ADC board continuously.
>>>>
>>>> The thread that writes the root trees and stores them to file is
>>>> started only once and within it, ROOT files are opened and closed
>>>> in the same loop. The main process writes
>>>> the raw data file and handles the communication with the ADC
>>>> board. The raw data files are written in a function that both
>>>> opens and closes each file that is being written.
>>>>
>>>> There are two zmq sockets for communication with the ADC (control
>>>> and data) and each is opened and closed for each 5 second data
>>>> chunk that is being received from the ADC
>>>> (each single run).
>>>>
>>>> I scoured forums and various information sources, but most of
>>>> posts I find suggest that once needs to increase the nofile
>>>> parameter in |/proc/sys/fs/file-max | or similar.
>>>> I have done that, but it doesn't really change anything. It just
>>>> allows me to take more runs, but doesn't solve the problem.
>>>>
>>>> In principle, I should be able to run this indefinitely (aside
>>>> from storage space issues).
>>>>
>>>> Of course I could just stop the process and restart periodically
>>>> (collecting fewer runs at a time), but I want to move through a
>>>> series of PMT voltages in one run series, with
>>>> as little time as possible between changing the HV.
>>>>
>>>> I have encountered this crash every time I start the program and
>>>> run for extended periods of time (1 hour or more) and it is
>>>> driving me crazy, because I think the program is written
>>>> such that there should not be "many" open files at a time (I
>>>> think maybe at most 5 at any given time).
>>>>
>>>> The only way around this at the moment is to use a script that
>>>> restarts the ADC software end after some number of runs before
>>>> this crash occurs. That's okay, but doesn't resolve
>>>> the basic problem.
>>>>
>>>> Somehow I feel like there is either a bug in ZMQ or I am doing
>>>> something wrong (likely the latter, but some posts I found seem
>>>> to suggest the possibility of the former).
>>>>
>>>> I don't want to throw a bunch of code at you. For now, I am just
>>>> wondering if any here has encountered a similar issue and could
>>>> guess what the problem is.
>>>>
>>>> If anyone wants to take a look (if you have time), the code is
>>>> located here:
>>>>
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mtgericke_MOLLER-2DIntElec-2DPMTGain-2DMeas&d=DwIDaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=oEOVH2JDjF9X3FqUtWH-nt4bOXao4u2S63djv3IksSA&m=qZPr9BwbjpR6B-dAXUkVNJfaN0Xp-gTmmkhTDlQbDuBR1IMNDBzF5ZM1vzF4HGOm&s=ZrfnUJ7fP2BPe9ljx4mRRrtcvYwGwBw7bXqgjog1uho&e=
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mtgericke_MOLLER-2DIntElec-2DPMTGain-2DMeas&d=DwMDaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=oEOVH2JDjF9X3FqUtWH-nt4bOXao4u2S63djv3IksSA&m=h63pm6kCVRZsJZDv6Qm3IadomS8gZlz-u_jJh_kb06vZdHak21HYvLepNBpUh1Zl&s=LOn7q6CUDdVUcA7WtGX-sjMjpFun4LiiFijDNN2RqTM&e=>
>>>>
>>>>
>>>> The relevant file is CMData.cxx
>>>>
>>>> The relevant functions are:
>>>>
>>>> void CMData::StartDataCollection() - file lines 712-733 ,
>>>> 758-781,
>>>>
>>>> void *CMData::GetServerData(void *vargp) - file line 826
>>>>
>>>> void* CMData::FillRootTreeThread(void *vargp) - file lines 1070,
>>>> 1293
>>>>
>>>> void* CMData::GetSocket(SockType type) - file line 458
>>>>
>>>>
>>>>
>
More information about the Moller_daq
mailing list