[d2n-analysis-talk] 64bit farm replay

Brad Sawatzky brads at jlab.org
Fri May 6 18:24:37 EDT 2011


Ok, if 94% reliability counts as working, then the 64bit replay appears
to be working.

My test replay directory is here:
  /w/halla/e06014/disk1/brads/replay_64/

There is a new, recompiled 64bit version of ROOT here:
  /w/halla/e06014/disk1/ROOT/PRO/
and a rebuild of our analyzer against those libs here:
  /w/halla/e06014/disk1/brads/replay_64/20101207b-DB06Apr2011/

NOTE: The DB files in that tree are current as of Apr. 6, 2011.

It is probably simplest if you make a personal copy of the above directory
and update the root-setup.sh script to point at your copy for any farm
replays you are in charge of:
  cd /w/halla/e06014/disk1/<your username>/
  cp -a /w/halla/e06014/disk1/brads/replay_64 .
Don't forget to update the root-setup.sh file to set d2n_root to
your path:
  vim 20101207b-DB06Apr2011/root-setup.sh

Don't forget to update any output symlinks in the replay directory.  In
particular set ROOTfiles to point to the correct location for your
replay files (needs to be on the work disk though):
 d2n/replay/ROOTfiles --> point to something proper

Note that the root-setup.sh script should be sourced to set all the
relevant environment variables to point at the new version of ROOT and
analyzer libs.  You should source the file before attempting to test the
analyzer manually.

The farm control scripts have been modified slightly.  The job control
generator does away with the 'INPUT' file list and just copies the whole
damn replay directory instead of pruning out just the required files.
It pushes a few more bytes into the scratch space, but it's a lot
simpler than maintaining a discrete list of files...  The memory
allocation for the farm jobs was also bumped up considerably for the 64
bit jobs (this is pretty standard when moving from 32 -> 64 bit).
  /w/halla/e06014/disk1/brads/replay_64/run_e06014.sh

As implied at the top, there is still a race condition that gets hit
periodically.  The changes I made to the module load order in the
replay_farm_BB.C seemed to make the biggest difference, but I'll be
damned if I can find a smoking gun.  Variables under investigation
include:
  - machine load, memory load, file locality (ie. 'local' vs. infiniband
    mounts for various combinations of files)
  - disk IO latency (tricky due to limited tools available with RHEL5
    kernel and the rather complicated low-level filesystem setup used
    for the farm...)
    - I strongly suspect the problem lies here.  Network filesystems are
      really hard to get right and race conditions are tricky to avoid
      if you actually want to get reasonable performance...

Oh well, if the runs that complete are internally consistent and match
the 32 bit results I guess it doesn't matter too much...

-- Brad

-- 
Brad Sawatzky, PhD <brads at jlab.org>  -<>-  Jefferson Lab / Hall C / C111
Ph: 757-269-5947  -<>-  Fax: 757-269-5235  -<>- Pager: brads-page at jlab.org
The most exciting phrase to hear in science, the one that heralds new
  discoveries, is not "Eureka!" but "That's funny..."   -- Isaac Asimov


More information about the d2n-analysis-talk mailing list