[Halld-offline] Fwd: Long-term Code Maintainability

Wed Dec 11 14:07:35 EST 2013

There was a request at the meeting today to forward this message the full offline software list.

Matt

Begin forwarded message:

> From: Matthew Shepherd <mashephe at indiana.edu>
> Subject: Re: Long-term Code Maintainability
> Date: October 21, 2013 at 2:06:25 PM EDT
> To: Paul Mattione <pmatt at jlab.org>
> Cc: Mark Ito <gluex at jlab.org>, David Lawrence <davidl at jlab.org>, Simon Taylor <staylor at jlab.org>, "Curtis A. Meyer" <cmeyer at cmu.edu>
> 
> 
> Hi all,
> 
> Paul's strategy of using new simulation with old reconstruction is what we did in CLEO -- it is a challenge, because you can't simply split simulation from reconstruction.  The simulation may change in a way that is not backwards compatible with the reconstruction.
> 
> There were several false starts, but I think the system we settled on after several years of messing around was that for each data processing cycle we had three tagged releases to cover reconstruction and generation, but then we had separate, regular tagged releases by date for analysis.
> 
> For example, assume we had a data set ("federations" we called them) called data6.  We would have releases like:
> 
> data6_recon
> data6_mcrecon
> data6_mcgen
> 
> All of the final data (think REST production) would be done for data with data6_recon.  We would generate a large sample of generic MC with data6_mcgen and reconstruct it with data6_mcrecon.
> 
> These recon, mcrecon, and mcgen releases had to be put together by an expert to ensure compatibility and then tested.  The idea was:
> 
> data6_recon was the version of reconstruction code used reconstruct data6 data (that is easy to tag)
> 
> data6_mcgen was our best understanding of how to model the detector at the time we generated MC for data6 (typically the most recent release at the time of generation)
> 
> data6_mcrecon was, in principle, just the reconstruction bits of data6_recon matched to accept what was produced by data6_mcgen -- this is the one that may take some work to produce since the backwards compatibility has to be checked
> 
> The reconstruction using data6_recon was a mass production thing -- not something users would normally do.  We also had mass production generic MC, like PYTHIA, that used data6_mcgen and data6_mcrecon.  Users could use data6_mcgen and data6_mcrecon to generate some small signal MC to match the reconstructed data in data6.
> 
> To capture the analysis code we had regular releases that were tagged by date (as GlueX does now).    This is important because analysis code, mainly global tools like kinematic fitting, continues to improve long after the data are processed.  The other reason not to merge analysis code with others is that an analysis frequently spans multiple data federations.
> 
> So, if I were to start an analysis for a thesis I'd need to pick a recent release to do analysis.
> 
> sim-recon_21Oct2013
> 
> I'd compile any of my user-defined analysis code against this release
> 
> Then I may have data that were processed using releases:
> 
> data2_recon
> data3_recon
> data4_recon
> 
> I may want to generate some signal MC, so I would generate in proportion to the data using the releases:
> 
> data2_mcgen
> data3_mcgen
> data4_mcgen
> 
> and reconstruct with:
> 
> data2_mcrecon
> data3_mcrecon
> data4_mcrecon
> 
> From the user standpoint, this system, with naming of releases in the that way, seemed to be the most transparent and led to the least confusion.  We started with a system that used only data releases, but then had users trying to make their own decisions about, e.g., what "most recent" meant when generating MC.  This system also means that the hardest release to prepared (merging old with new), the data3_mcrecon release, say, only ever needs to be tested to be compatible with the output of data3_mcgen.  It shouldn't be used for anything else.
> 
> Again, I think "slicing" needs to be done and we need a dedicated expert to do it and check validity.  Simply splitting one work.  A classic example of where splitting fails is in the detector geometry.  It is used for both simulation (obviously) but also in material corrections to the Kalman filter.  It can't be split without risking divergence.  If you run Kalman filter on the data with incorrect geometry and realize it later, then the thing to do is to simulate with the correct geometry and rerun Kalman with the incorrect geometry.  I think this suggests slicing up the repository with tags applied to the state of code at different times is the right move.  Slicing can be aided by some lists of areas of the repository that fall clearly into the recon or simulation side.
> 
> Some potential pitfalls:  one obvious with having multiple releases in use at the same time is that they have to be managed.  
> 
> I imagine all of the releases noted above, except the analysis release, would be precompiled on disk.  Presumably the user could switch right between them.  We had a command line tool that we could use to both list all releases and switch releases (it simply adjusted environment variables).  It worked well when everyone was working on the same Cornell UNIX cluster, but if we are working in many places we need to be sure the replication of precompiled releases happens easily.
> 
> We need to be sure the BMS has safeguards in it to avoid mixing releases.  We do this now somewhat with operating systems and compilers by putting the products in specific directories, but we should be careful also about releases.  One hopes that the only release a user ever has to really work (compile code) in is the analysis release, but in practice there may be reason to have a couple of releases that the average user uses.  We could "install" headers in a directory tagged with release names and embed release names into the names of built libraries to avoid header/library mismatch or pulling in incompatible code in the link phase.  I don't imagine these changes are too hard for someone who understands our BMS.
> 
> Sorry for the long email -- I had a lot of experience with this as a grad student.  Bottom line is that there is not a simple (or even complex) solution that will work automatically.  I think expert intervention and testing is needed for each "certified" mcgen and mcrecon pair of releases... and then the system has to be easy for everyone to use.
> 
> Matt
> 
> On Oct 20, 2013, at 4:58 PM, Paul Mattione <pmatt at jlab.org> wrote:
> 
>> How do you guys envision we will be handling the reconstruction vs. analysis stages of our data/software over the long term?  For example, let's say we perform track/shower reconstruction in 2016 with 2016 code, but then afterwards we make some bug fixes (or add/modify features/detectors) to both the track reconstruction and analysis software.  Then in 2017 we want to analyze our 2016 data.  
>> 
>> It seems like redoing reconstruction after every new tagged release of bug fixes would be overkill.  So I think what we would want is to do our analysis on the 2016 reconstruction results using the 2016 reconstruction code on our simulated data, but use the 2017 analysis/simulation code for the rest of it.  So it seems like we would want:
>> 
>> 1) A tagged release of ONLY our reconstruction code from 2016.
>> 2) A separate tagged release of our generators, hdgeant, mcsmear, and analysis code from 2017 (excluding the reconstruction code).  
>> 
>> However, making zigzag cuts through sim-recon for each tagged release would be a pain.  This suggests that the reconstruction code should be decoupled from everything else in the svn trunk, so that we can seamlessly combine the two software systems later (unless the REST format/classes change).  As operating systems and compiler versions change, we would need to make sure our 2016 reconstruction tagged release still builds, but we wouldn't need to worry about maintaining older versions of hdgeant, etc.  
>> 
>> Does this seem like a good idea?  It might be a lot of work to split up sim-recon but I think it would save us some major headaches in the long run (CLAS6 anyone?).  Or is there a better solution that I'm missing?  
>> 
>> - Paul
>> 
>