[Halld-online] [New Logentry] Large Scale DAQ testing
davidl at jlab.org
davidl at jlab.org
Thu Jul 3 11:50:03 EDT 2014
Logentry Text:
--
Today we continued our regular testing/development of the large scale DAQ system in Hall-D. We again used the 50 crate system and the softROC system where the mcROL is used that reads simulated data from an EVIO file and does not access the VME backplane and does not use the Trigger system.
Today we started with gluon50 and having the IB network interface turned off before starting anything. (Previously, we'd used gluon53).
The configuration runs 4 Data Concentrators, one Secondary Event Builder, and one Event Recorder all on the same host machine. The 50 ROCs are run on their respective hosts.
1st attempt:
- Components couldn’t connect to platform. Not sure why.
2nd attempt:
- Some messages indicating X11 windows couldn’t open, but all windows appear to have opened.
- Message was ”Warning: No xauth data”, “error in locking authority file /home/furletov/.xauthority”
- Components are waiting for connections:
“Socket connect error: Connection timed out”
“client thread 2: error recreating rc client’s TCP send socket”
- Components are slowly connecting one at a time.
- eventually killed all components by hand (actually, via script)
3rd attempt:
- Use configuration with fewer ROCs “bcal_n4_sf” (4 ROCs)
- Everything came up right away.
- Configure failed due to “Failed Communication with the client”
Sergey noted that this configuration was working fine just before 9am.
Vardan said he committed some changes to CODA at 9am. We are going to back out those changes.
(reverted Afecs to revision 12773)
4th attempt:
- Using CODA version prior to Vardan’s commit this morning.
- Configure succeed
- Download succeeded
- Prestart failed: ROCs state “roc-prestart: ET initialization failed”
ET ports:
SEB0 23911
DCFCAL 23913
DCBCAL 23915
DCSTPS 23917
DCFDC 23919
5th attempt:
- Reset followed by configure
- Configure succeeded
- Prestart failed due to ET initialization again (all ROCs)
- Problem was direct connection from ROCs set to use gluon53 in jcedit
- Also issue with file in COOL that stores connection hosts. Need to delete by hand
if hosts changed or restart all processes.
6th attempt:
- ROCBCAL2 failed because ROC died. This appears to have happened during reset.
- All other components seemed OK
7th attempt:
- Restarted ROCBCAL2 manually
- Failed at download because of unresolved libexpat.so link to mcROL.so
8th attempt:
- Recompiled mcROL.so on rocdev1 with fix that uses local file from RAM disk rather than NFS mounted one.
- Restarted all components from scratch
- Configure succeeded
- Download succeeded
- Prestart succeeded
- Go succeeded
- Data flowed until we tried restarting softROCcontroller with 10k events per chunk
9th attempt:
- Reset
- Configure succeeded
- Download succeeded
- softROCcontroller started with 5000 events per chunk
- Prestart succeeded
- Go succeeded
- DCBCAL fails after several thousand events with “etDeadException”
10th attempt:
- Reset and restarted without changing anything
- Ran for ~31k events and event rate fell to zero. All components still reporting active
- Most ROCs had exactly 31k events but all BCAL crates had fewer events
- BCAL ROCs use same DC so there may be some back pressure issue there.
We were unable to thoroughly diagnose the problem leading to the data rate stopping
during the last attempts. (We needed to hand the DAQ systems back to the detector groups.)
It may have been due to the BCAL DC getting into a state where it was unable to receive
events from the ROCs, causing them to go into a holding pattern. It was decided that we should
have another look at the softROC code, but plan on doing testing with more traditional ROLs
on Tuesday to see if the same problem occurs.We will try to arrange use of the TS for
that test period.
---
This is a plain text email for clients that cannot display HTML. The full logentry can be found online at https://logbooks.jlab.org/entry/3289159
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/halld-online/attachments/20140703/961a366d/attachment.html
More information about the Halld-online
mailing list