[Moller_intdet] [Moller_daq] [EXTERNAL] Question: ADC board software - Too many files open core dump

Michael Gericke Michael.Gericke at umanitoba.ca
Tue Jan 21 11:22:43 EST 2025


Yes, googling was, of course, my first attempt as well ...


My understanding is that epoll.cxx is mostly used for socket management 
and if I were to create a bunch of open sockets (without closing them) I 
would understand the crash.
But the only place where I open sockets repeatedly is in the same loop 
where I close them as well (e.g. zmq_close(data_socket) : "The 
/zmq_close()/ function shall destroy the socket referenced by the 
/socket/ argument.").

Either I am not doing something correctly or there is a bug in that 
function with respect to the OS side.


Thanks,


Michael


On 2025-01-21 10:08 a.m., Robert Michaels wrote:
> *Caution!* This message was sent from outside the University of Manitoba.
>
>
> I haven't encountered this, but I notice that if you Google "Too many 
> open files (src/epoll.cpp:38)" there are several suggestions.   My 
> experience with Googling is that there may be 10 ideas and it ends up 
> that 1 of them works.
>
> yours
> Bob
>
> Dr. Robert Michaels
>
> Staff Scientist, Jefferson Lab
>
> http://userweb.jlab.org/~rom <http://userweb.jlab.org/~rom>
>
> ------------------------------------------------------------------------
> *From:* Moller_daq <moller_daq-bounces at jlab.org> on behalf of Michael 
> Gericke <Michael.Gericke at umanitoba.ca>
> *Sent:* Tuesday, January 21, 2025 11:01 AM
> *To:* Zuhal Seyma Demiroglu <zdemirog at jlab.org>; Ciprian Gal 
> <ciprian at jlab.org>; Malte Wilfert <mwilfert at uni-mainz.de>; 
> moller_daq at jlab.org <moller_daq at jlab.org>; moller_analysis at jlab.org 
> <moller_analysis at jlab.org>; bryerton at gmail.com <bryerton at gmail.com>; 
> moller_intdet at jlab.org <moller_intdet at jlab.org>
> *Subject:* [Moller_daq] [EXTERNAL] Question: ADC board software - Too 
> many files open core dump
>
> Hi everyone,
>
> Sorry if you get this multiple times (I am sending to several email 
> lists with significant overlap).
>
> I have an annoying problem. For the PMT testing, I want to run through 
> a given set of tests for each PMT in one sitting (DAQ and analysis 
> program running continuously), which
> means starting the ADC DAQ on the computer end once and (currently - 
> ideally) letting it collect upwards of about 1000 5 second long runs 
> at a time for each PMT.  The program
> runs fine for about 180 5 second sets and then core dumps with "Too 
> many open files (src/epoll.cpp:38)".
>
> If you have encountered this problem and figured out how to solve it 
> (beyond the band aid suggestions given in online posts), can you 
> please let me know.
>
> Thanks,
>
> Michael
>
>
> Some more details ...
>
>
> I am writing both root files and binary files, for the raw data.  The 
> code uses a separate thread to write the root trees, while it keeps 
> getting data from the ADC board continuously.
>
> The thread that writes the root trees and stores them to file is 
> started only once and within it, ROOT files are opened and closed in 
> the same loop. The main process writes
> the raw data file and handles the communication with the ADC board. 
> The raw data files are written in a function that both opens and 
> closes each file that is being written.
>
> There are two zmq sockets for communication with the ADC (control and 
> data) and each is opened and closed for each 5 second data chunk that 
> is being received from the ADC
> (each single run).
>
> I scoured forums and various information sources, but most of posts I 
> find suggest that once needs to increase the nofile parameter in 
> |/proc/sys/fs/file-max | or similar.
> I have done that, but it doesn't really change anything. It just 
> allows me to take more runs, but doesn't solve the problem.
>
> In principle, I should be able to run this indefinitely (aside from 
> storage space issues).
>
> Of course I could just stop the process and restart periodically 
> (collecting fewer runs at a time), but I want to move through a series 
> of PMT voltages in one run series, with
> as little time as possible between changing the HV.
>
> I have encountered this crash every time I start the program and run 
> for extended periods of time (1 hour or more) and it is driving me 
> crazy, because I think the program is written
> such that there should not be "many" open files at a time (I think 
> maybe at most 5 at any given time).
>
> The only way around this at the moment is to use a script that 
> restarts the ADC software end after some number of runs before this 
> crash occurs. That's okay, but doesn't resolve
> the basic problem.
>
> Somehow I feel like there is either a bug in ZMQ or I am doing 
> something wrong (likely the latter, but some posts I found seem to 
> suggest the possibility of the former).
>
> I don't want to throw a bunch of code at you. For now, I am just 
> wondering if any here has encountered a similar issue and could guess 
> what the problem is.
>
> If anyone wants to take a look (if you have time), the code is located 
> here:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mtgericke_MOLLER-2DIntElec-2DPMTGain-2DMeas&d=DwIDaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=sAcmfDnmgp80OHNp8BT9B0ppMns-xHhof47DzJMhOgs&m=T4-O1eo3e_-o-jYjelSSGAnrQp_EV_9AoaItYatvAHmPkV7ra08ptyVaWsVydEX1&s=ilk53Zt7ugc3I1Ek9yf8u1My0VJpRga4jHpPVfJl-_Q&e=  
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mtgericke_MOLLER-2DIntElec-2DPMTGain-2DMeas&d=DwMDaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=oEOVH2JDjF9X3FqUtWH-nt4bOXao4u2S63djv3IksSA&m=h63pm6kCVRZsJZDv6Qm3IadomS8gZlz-u_jJh_kb06vZdHak21HYvLepNBpUh1Zl&s=LOn7q6CUDdVUcA7WtGX-sjMjpFun4LiiFijDNN2RqTM&e=>
>
> The relevant file is CMData.cxx
>
> The relevant functions are:
>
> void CMData::StartDataCollection()   - file lines 712-733 , 758-781,
>
> void *CMData::GetServerData(void *vargp)  - file line 826
>
> void* CMData::FillRootTreeThread(void *vargp) - file lines 1070, 1293
>
> void* CMData::GetSocket(SockType type)   - file line 458
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/moller_intdet/attachments/20250121/02a18c23/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gU4vnSoaU2NonqHW.png
Type: image/png
Size: 703404 bytes
Desc: not available
URL: <https://mailman.jlab.org/pipermail/moller_intdet/attachments/20250121/02a18c23/attachment-0001.png>


More information about the Moller_intdet mailing list