[Moller_intdet] [Moller_daq] [EXTERNAL] Question: ADC board software - Too many files open core dump
Wouter Deconinck
wouter.deconinck at umanitoba.ca
Tue Jan 21 11:37:43 EST 2025
Bandaid: I think you are running into the ulimit on maximum file
descriptors. If you don't care much about that, and you have root access,
you can increase it. A factor 10 doesn't seem impossible for the system to
handle. It's in /etc/sysctl.conf for permanent change. sysctl command for
temporary change.
Still gotta figure out why the fds aren't released.
Wouter
On Tue, Jan 21, 2025, 10:25 AM Michael Gericke via Moller_intdet <
moller_intdet at jlab.org> wrote:
> *Caution!* This message was sent from outside the University of Manitoba.
>
> Yes, googling was, of course, my first attempt as well ...
>
>
> My understanding is that epoll.cxx is mostly used for socket management
> and if I were to create a bunch of open sockets (without closing them) I
> would understand the crash.
> But the only place where I open sockets repeatedly is in the same loop
> where I close them as well (e.g. zmq_close(data_socket) : "The
> *zmq_close()* function shall destroy the socket referenced by the *socket*
> argument.").
>
> Either I am not doing something correctly or there is a bug in that
> function with respect to the OS side.
>
>
> Thanks,
>
>
> Michael
>
> On 2025-01-21 10:08 a.m., Robert Michaels wrote:
>
> *Caution!* This message was sent from outside the University of Manitoba.
> I haven't encountered this, but I notice that if you Google "Too many open
> files (src/epoll.cpp:38)" there are several suggestions. My experience
> with Googling is that there may be 10 ideas and it ends up that 1 of them
> works.
>
> yours
> Bob
>
> Dr. Robert Michaels
>
> Staff Scientist, Jefferson Lab
>
> http://userweb.jlab.org/~rom
> ------------------------------
> *From:* Moller_daq <moller_daq-bounces at jlab.org>
> <moller_daq-bounces at jlab.org> on behalf of Michael Gericke
> <Michael.Gericke at umanitoba.ca> <Michael.Gericke at umanitoba.ca>
> *Sent:* Tuesday, January 21, 2025 11:01 AM
> *To:* Zuhal Seyma Demiroglu <zdemirog at jlab.org> <zdemirog at jlab.org>;
> Ciprian Gal <ciprian at jlab.org> <ciprian at jlab.org>; Malte Wilfert
> <mwilfert at uni-mainz.de> <mwilfert at uni-mainz.de>; moller_daq at jlab.org
> <moller_daq at jlab.org> <moller_daq at jlab.org>; moller_analysis at jlab.org
> <moller_analysis at jlab.org> <moller_analysis at jlab.org>; bryerton at gmail.com
> <bryerton at gmail.com> <bryerton at gmail.com>; moller_intdet at jlab.org
> <moller_intdet at jlab.org> <moller_intdet at jlab.org>
> *Subject:* [Moller_daq] [EXTERNAL] Question: ADC board software - Too
> many files open core dump
>
>
> Hi everyone,
>
> Sorry if you get this multiple times (I am sending to several email lists
> with significant overlap).
>
> I have an annoying problem. For the PMT testing, I want to run through a
> given set of tests for each PMT in one sitting (DAQ and analysis program
> running continuously), which
> means starting the ADC DAQ on the computer end once and (currently -
> ideally) letting it collect upwards of about 1000 5 second long runs at a
> time for each PMT. The program
> runs fine for about 180 5 second sets and then core dumps with "Too many
> open files (src/epoll.cpp:38)".
>
> If you have encountered this problem and figured out how to solve it
> (beyond the band aid suggestions given in online posts), can you please let
> me know.
>
> Thanks,
>
> Michael
>
>
> Some more details ...
>
>
> I am writing both root files and binary files, for the raw data. The code
> uses a separate thread to write the root trees, while it keeps getting data
> from the ADC board continuously.
>
> The thread that writes the root trees and stores them to file is started
> only once and within it, ROOT files are opened and closed in the same loop.
> The main process writes
> the raw data file and handles the communication with the ADC board. The
> raw data files are written in a function that both opens and closes each
> file that is being written.
>
> There are two zmq sockets for communication with the ADC (control and
> data) and each is opened and closed for each 5 second data chunk that is
> being received from the ADC
> (each single run).
>
> I scoured forums and various information sources, but most of posts I find
> suggest that once needs to increase the nofile parameter in /proc/sys/fs/file-max
> or similar.
> I have done that, but it doesn't really change anything. It just allows me
> to take more runs, but doesn't solve the problem.
>
> In principle, I should be able to run this indefinitely (aside from
> storage space issues).
>
> Of course I could just stop the process and restart periodically
> (collecting fewer runs at a time), but I want to move through a series of
> PMT voltages in one run series, with
> as little time as possible between changing the HV.
>
> I have encountered this crash every time I start the program and run for
> extended periods of time (1 hour or more) and it is driving me crazy,
> because I think the program is written
> such that there should not be "many" open files at a time (I think maybe
> at most 5 at any given time).
>
> The only way around this at the moment is to use a script that restarts
> the ADC software end after some number of runs before this crash occurs.
> That's okay, but doesn't resolve
> the basic problem.
>
> Somehow I feel like there is either a bug in ZMQ or I am doing something
> wrong (likely the latter, but some posts I found seem to suggest the
> possibility of the former).
>
> I don't want to throw a bunch of code at you. For now, I am just wondering
> if any here has encountered a similar issue and could guess what the
> problem is.
>
> If anyone wants to take a look (if you have time), the code is located
> here:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mtgericke_MOLLER-2DIntElec-2DPMTGain-2DMeas&d=DwIFaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=sAcmfDnmgp80OHNp8BT9B0ppMns-xHhof47DzJMhOgs&m=vKQCK-M4jeTqmDMZv0IWRLHVYfQy29Ky2VyR86z3YRd5JxrpbLm5nekfwy6MXn9m&s=x-VSk9eKuh707O34j-cknBowSgdNRtZ_dZf-MSDudro&e=
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mtgericke_MOLLER-2DIntElec-2DPMTGain-2DMeas&d=DwMDaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=oEOVH2JDjF9X3FqUtWH-nt4bOXao4u2S63djv3IksSA&m=h63pm6kCVRZsJZDv6Qm3IadomS8gZlz-u_jJh_kb06vZdHak21HYvLepNBpUh1Zl&s=LOn7q6CUDdVUcA7WtGX-sjMjpFun4LiiFijDNN2RqTM&e=>
>
> The relevant file is CMData.cxx
>
> The relevant functions are:
>
> void CMData::StartDataCollection() - file lines 712-733 , 758-781,
>
> void *CMData::GetServerData(void *vargp) - file line 826
>
> void* CMData::FillRootTreeThread(void *vargp) - file lines 1070, 1293
>
> void* CMData::GetSocket(SockType type) - file line 458
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/moller_intdet/attachments/20250121/ce6340c4/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gU4vnSoaU2NonqHW.png
Type: image/png
Size: 703404 bytes
Desc: not available
URL: <https://mailman.jlab.org/pipermail/moller_intdet/attachments/20250121/ce6340c4/attachment-0001.png>
More information about the Moller_intdet
mailing list