[Halld-offline] proposal for analysis launch and skimming
Shepherd, Matthew
mashephe at indiana.edu
Wed Jul 17 17:03:30 EDT 2019
Hi all,
As a follow-on to some earlier discussions at the collaboration meeting and workshop, I'd like to put forth the idea of including a secondary skim or data processing step in our analysis launches, and perhaps, in some cases, discarding the standard ROOT analysis trees. The goal in this is to take up less disk space and facilitate easy exploratory interaction with the GlueX data.
First some comments from recent experience:
In order to complement some work that others in our group were doing, I set out to try to analyze the reaction: gamma p -> pi+ pi- pi0 eta p. Specifically, I was interested in gamma p -> ( eta pi- pi0 ) + ( pi+ p ), where the pi+ p was a Delta++. The motivation is not so relevant for this discussion, but one wants to look at Dalitz plots of the eta pi- pi0 system as a function of eta pi- pi0 mass, etc. There are a lot of things to potentially plot (1D and 2D) and it useful to be able to do this interactively on the command line wile making cuts interactively. I opted to use the FSRoot package that Ryan wrote (and I demonstrated at the workshop in May).
Since I wasn't up to speed on the JLab queue and job submission system, I wanted to test the model of doing "analysis at home" using products of the analysis launch produced at JLab. Fortunately ver27 analysis launch of the 2017 data had what I was looking for: tree_pi0pippimeta__B3_M17. The ROOT files were 3.5 TB. I had to find a place to park these files and then it took two days to transfer at an average speed of 27 MB/s.
While I was waiting for the transfer to complete, I took one file (one run) and began to analyze it for mechanisms to reduce the size. A single run was a 17 GB root file, which I could "flatten" (by throwing away some information I didn't need) into a 8 GB file. I could then interact with this file with the FSRoot package on the ROOT command line to explore useful skim cuts.
As expected, a significant contributor to file size is photon combinatorics. There is a pi0 and an eta in the final state, so there are many combinations of two pairs of photons. Using FSRoot I could easily plot the eta invariant mass for various cuts and realized that there was little evidence of signal when the number of photons over 100 MeV in an event exceeded 5 (NumNeutralHypos > 5 in analysis tree speak). And looking at the case when there were exactly 4 and exactly 5 suggested diminishing returns for including 5. Likewise I looked at chi2DOF for various combinations of unused tracks and concluded there was only evidence of signal (clear peak at very low chi2DOF) if the number of extra tracks was 1 or 0.
I then used the flatten program (hd_utilities/FlattenForFSRoot), with the following requirements:
numNeutralHypos = 4
numUnusedTracks <= 1
chi2DOF < 50
To run flatten on the 3.5 TB of files, I had to submit jobs to our local cluster -- it takes about 30 minutes per file to skim and process, so this took an afternoon. When I was done, I now I had a set of ROOT files that was 105 GB. I went from 3.5 TB to 105 GB by making cuts that largely throw out junk events or poorly fitted hypotheses (only about a factor of two was due to discarding information in going from the analysis tree format to the flat tree format).
So, 105 GB is better than 3.5 TB, but still not quite small enough to copy to my laptop and "play" with interactively using ROOT. I made a second step with some more specialized cuts (e.g., proton pi+ invariant mass to isolate the delta, beam energy, etc.) and I was left with a file that is now less than 10 GB and has really everything I desired to work with that was in the initial 3.5 TB set of files. And using this 10 GB file, I was able make plots on the fly for this topology showing all the substructure in the eta, pi-, pi0 system that I was interested in. (The second skim step is very easy since FSRoot is distributed with command line tools that let one make rather complicated skim cuts without every having to write any code.)
This whole job would have been much easier (and I could have made my desired plots in an hour or so) if I could copy 105 GB instead of 3.5 TB and have it already in a format that facilitates easy interaction.
So, here's the proposal:
Can we incorporate a standard "flatten" step into the analysis launch framework to beat down the file size a bit and make it easier for portable interactive analysis offsite? I'm envisioning that for each reaction we specify the values for the three cuts noted above (these are arguments to flatten). And (maybe) we have an option to purge the analysis trees in order to reduce disk usage at JLab. The purge is particularly useful for exploratory analyses.
In some cases one may need to go back to the analysis trees for more detailed looks at data or to run specialized algorithms. But in many cases one is trying to survey the landscape and look for low hanging fruit. My biggest concern is that starting with a DSelector running over analysis trees presents a significant impediment to playing with the data and looking for interesting things (and this ultimately diminishes the discovery potential for GlueX). While it is important to be able to exercise the capability of the DSelector framework and all the info provided the standard trees, we also need to enable fast and easy browsing of data, especially offsite.
I'd be interested in hearing thoughts and ideas -- is there a desire for something like this?
Matt
More information about the Halld-offline
mailing list