[Halld-offline] proposal for analysis launch and skimming
Shepherd, Matthew
mashephe at indiana.edu
Thu Jul 18 11:25:26 EDT 2019
Hi Justin,
I would be in favor of doing the operation on the merged file as part of the merging step. I suggest some initial tight cuts to get the file size down to something manageable. Based on my experience, up to 1 unused track and exactly the number of "neutral hypotheses" as photons in the final state as well as a loose chi2/dof cut of 50 should work for many things.
If it is part of the merge step, then these files are available on tap for people. Hopefully they are small enough that we can keep them on disk most of the time. The goal is to get to a point where one could have an idea like: "I wonder if we see a signal for X in some decay mode" and then go grab some tens of GB of files from JLab and look. It is the fast turn around (no staging from tape, waiting for queues, etc.) that is nice.
This also facilitates easy exploration and learning by students. You can just give a student one of these root files and they can start making mass distributions without ever writing any code. You focus all the energy and early training for the student on developing an intuition for the experiment rather than how to make a DSelector and submit jobs to a queue.
As one gets deeper into the analysis, one can always roll back to the analysis tree and check to be sure the initial skim was appropriate. Having the ability to quickly look at something is really nice.
I'm happy to help develop this for deployment, but I know so little about the analysis launch framework that I'm not sure I'm useful.
Matt
> On Jul 18, 2019, at 10:59 AM, Justin Stevens <jrsteven at jlab.org> wrote:
>
> Hi Matt,
>
> Thanks for bringing this up again. I’m certainly in favor steps like this that could simplify the analysis workflow for people using these tools.
>
> From my perspective, this kind of post-processing could fit in either of these steps of our current workflow:
> 1) the jobs which execute ReactionFilter producing an Analysis TTree file for each reaction and REST file or
> 2) the merging job where Analysis TTrees files are combined into a single file for each run
>
> Both would require some effort to allow users to submit requests for this post-processing to be run for their channel (including the cut requirements they need) as well as modification to the scripts to execute FlattenForFSRoot and copy the output files to the relevant location on the cache disk.
>
> Another possibility, if we foresee users wanting to do this flattening/skimming step multiple times (with different cuts) on the standard Analysis TTree files which already exist on disk, is we could factorize this post-processing from the analysis launch and make it a separate step which happens after the TTree files are merged for all runs. This would require a new set of scripts, but would also make it straightforward for users to do this on their own (on the JLab farm).
>
> -Justin
>
>> On Jul 17, 2019, at 5:03 PM, Shepherd, Matthew <mashephe at indiana.edu> wrote:
>>
>>
>> Hi all,
>>
>> As a follow-on to some earlier discussions at the collaboration meeting and workshop, I'd like to put forth the idea of including a secondary skim or data processing step in our analysis launches, and perhaps, in some cases, discarding the standard ROOT analysis trees. The goal in this is to take up less disk space and facilitate easy exploratory interaction with the GlueX data.
>>
>> First some comments from recent experience:
>>
>> In order to complement some work that others in our group were doing, I set out to try to analyze the reaction: gamma p -> pi+ pi- pi0 eta p. Specifically, I was interested in gamma p -> ( eta pi- pi0 ) + ( pi+ p ), where the pi+ p was a Delta++. The motivation is not so relevant for this discussion, but one wants to look at Dalitz plots of the eta pi- pi0 system as a function of eta pi- pi0 mass, etc. There are a lot of things to potentially plot (1D and 2D) and it useful to be able to do this interactively on the command line wile making cuts interactively. I opted to use the FSRoot package that Ryan wrote (and I demonstrated at the workshop in May).
>>
>> Since I wasn't up to speed on the JLab queue and job submission system, I wanted to test the model of doing "analysis at home" using products of the analysis launch produced at JLab. Fortunately ver27 analysis launch of the 2017 data had what I was looking for: tree_pi0pippimeta__B3_M17. The ROOT files were 3.5 TB. I had to find a place to park these files and then it took two days to transfer at an average speed of 27 MB/s.
>>
>> While I was waiting for the transfer to complete, I took one file (one run) and began to analyze it for mechanisms to reduce the size. A single run was a 17 GB root file, which I could "flatten" (by throwing away some information I didn't need) into a 8 GB file. I could then interact with this file with the FSRoot package on the ROOT command line to explore useful skim cuts.
>>
>> As expected, a significant contributor to file size is photon combinatorics. There is a pi0 and an eta in the final state, so there are many combinations of two pairs of photons. Using FSRoot I could easily plot the eta invariant mass for various cuts and realized that there was little evidence of signal when the number of photons over 100 MeV in an event exceeded 5 (NumNeutralHypos > 5 in analysis tree speak). And looking at the case when there were exactly 4 and exactly 5 suggested diminishing returns for including 5. Likewise I looked at chi2DOF for various combinations of unused tracks and concluded there was only evidence of signal (clear peak at very low chi2DOF) if the number of extra tracks was 1 or 0.
>>
>> I then used the flatten program (hd_utilities/FlattenForFSRoot), with the following requirements:
>>
>> numNeutralHypos = 4
>> numUnusedTracks <= 1
>> chi2DOF < 50
>>
>> To run flatten on the 3.5 TB of files, I had to submit jobs to our local cluster -- it takes about 30 minutes per file to skim and process, so this took an afternoon. When I was done, I now I had a set of ROOT files that was 105 GB. I went from 3.5 TB to 105 GB by making cuts that largely throw out junk events or poorly fitted hypotheses (only about a factor of two was due to discarding information in going from the analysis tree format to the flat tree format).
>>
>> So, 105 GB is better than 3.5 TB, but still not quite small enough to copy to my laptop and "play" with interactively using ROOT. I made a second step with some more specialized cuts (e.g., proton pi+ invariant mass to isolate the delta, beam energy, etc.) and I was left with a file that is now less than 10 GB and has really everything I desired to work with that was in the initial 3.5 TB set of files. And using this 10 GB file, I was able make plots on the fly for this topology showing all the substructure in the eta, pi-, pi0 system that I was interested in. (The second skim step is very easy since FSRoot is distributed with command line tools that let one make rather complicated skim cuts without every having to write any code.)
>>
>> This whole job would have been much easier (and I could have made my desired plots in an hour or so) if I could copy 105 GB instead of 3.5 TB and have it already in a format that facilitates easy interaction.
>>
>> So, here's the proposal:
>>
>> Can we incorporate a standard "flatten" step into the analysis launch framework to beat down the file size a bit and make it easier for portable interactive analysis offsite? I'm envisioning that for each reaction we specify the values for the three cuts noted above (these are arguments to flatten). And (maybe) we have an option to purge the analysis trees in order to reduce disk usage at JLab. The purge is particularly useful for exploratory analyses.
>>
>> In some cases one may need to go back to the analysis trees for more detailed looks at data or to run specialized algorithms. But in many cases one is trying to survey the landscape and look for low hanging fruit. My biggest concern is that starting with a DSelector running over analysis trees presents a significant impediment to playing with the data and looking for interesting things (and this ultimately diminishes the discovery potential for GlueX). While it is important to be able to exercise the capability of the DSelector framework and all the info provided the standard trees, we also need to enable fast and easy browsing of data, especially offsite.
>>
>> I'd be interested in hearing thoughts and ideas -- is there a desire for something like this?
>>
>> Matt
>>
>>
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
More information about the Halld-offline
mailing list