[Halld-offline] zero field

Sean Dobbs s-dobbs at northwestern.edu
Wed Oct 19 19:10:27 EDT 2016


Hi Ryan,

I think that you've correctly diagnosed the problem.  The base directory
that the resources are stored in is given by the environment
variable JANA_RESOURCE_DIR.  A couple solutions to the problem come to mind:

1) Create a separate temporary JANA_RESOURCE_DIR for each job in some
temporary area.  Perhaps your batch system makes this easy, but I'd only
suggest this if you have a caching http proxy, or you will make halldweb
rather sad.

2) If you have a shared disk that the jobs can access, set up your
own JANA_RESOURCE_DIR on that disk, and run a short interactive job to
populate it before running your batch jobs.

I haven't encountered this problem myself in awhile, but maybe a longer
term fix would be to add lock files to the download mechanism in JANA.

Cheers,
Sean


On Wed, Oct 19, 2016 at 4:17 PM Mitchell, Ryan Edward <remitche at indiana.edu>
wrote:

> Hi David and all,
>
> I've run into some strange errors that seem possibly related to the email
> chain I dug up and attached below.
>
> I'm running 25 analysis jobs over ~100 rest hddm files in a single run
> (using the karst machine at IU).
>
> 20 finished fine, but 5 crashed with this error:
>
> JANA ERROR>>-- ERROR: md5 checksum for the following resource file does
> not match expected
> JANA ERROR>>--
> /tmp/remitche/resources/Magnets/Solenoid/solenoid_1200A_poisson_20160222
> JANA ERROR>>--  for the resource:
> JANA ERROR>>--   URL_base = https://halldweb.jlab.org/resources
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__halldweb.jlab.org_resources&d=CwMGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=y4ZD58I4nPR6tZqjerSGt-WlWVAhqa3FHDMXqQ_5aUc&m=3mGfeOL4DrxBWZ9bwJi0dT9WjwctIuk6aDBus5jBy8c&s=neH7lf0djm3Rb5nH94rZE125gp4S3pCDAB0hhTtkfZ0&e=>
> JANA ERROR>>--       path =
> Magnets/Solenoid/solenoid_1200A_poisson_20160222
> JANA ERROR>>--        md5 = 9a70615a1a3e42236cf9c9dabd8cf546
> JANA ERROR>>-- The md5sum for the existing file is:
> 902c0923bd13f626fe8cc4dc2e41f222
> JANA ERROR>>--
> JANA ERROR>>-- This can happen if the resource download was previously
> interrupted.
> JANA ERROR>>-- Try removing the existing file and re-running to trigger a
> re-download.
> JANA ERROR>>--
> JANA ERROR>>-- This is a fatal error and the program will stop now. To
> bypass checking
> JANA ERROR>>-- the md5sum, set the JANA:RESOURCE_CHECK_MD5 config.
> parameter to 0.
> JANA ERROR>>--
>
> The message is the same for all 5 crashed jobs, but the "md5sum for the
> existing file" (line 7) is different for each.
>
> Was maybe one job downloading the file while others were trying to read
> it?  Is there a way to avoid this?
>
> Thanks,
> Ryan
>
>
>
>
> On Jul 18, 2014, at 11:52 AM, Matthew Shepherd <mashephe at indiana.edu>
> wrote:
>
>
> Great -- thanks.
>
> It was going to catch someone at some point.  (I'm
> assuming I'm not the only one that hits ctrl-c when I realize
> that I've just executed a long program with
> incorrect arguments.)  The field download takes
> a while and happens right at the time you realize you
> made an execution mistake.
>
> For such a large system with so many users and
> possible permutations of use it is really important
> to make every sanity check possible and abort
> immediately when something doesn't seem right.
>
> Matt
>
> ---------------------------------------------------------------------
> Matthew Shepherd, Associate Professor
> Department of Physics, Indiana University, Swain West 265
> 727 East Third Street, Bloomington, IN 47405
>
> Office Phone:  +1 812 856 5808 <(812)%20856-5808>
>
> On Jul 18, 2014, at 10:23 AM, David Lawrence <davidl at jlab.org> wrote:
>
>
> Hi Matt,
>
> Sorry this has cost you so much time. You’re the first person to
> report this as an issue in the 6 months since it was deployed. I’ll
> go ahead and put in the code to generate and check the md5
> checksum whenever a program is started so that an error can
> be flagged if there is a mismatch.
>
> Regards,
> -David
>
> On Jul 18, 2014, at 10:15 AM, Matthew Shepherd <mashephe at indiana.edu>
> wrote:
>
>
> With bits and pieces from 3 different people,  I think
> I've figured this out...  gggrrrrrr!
>
> Mike Staib suggested my log seemed to indicate my field doesn't have
> enough z points.
>
> It seems like jana is using JANA_RESOURCE_DIR to cache fields.  (I got
> this from Paul.. I've never set this, but grep and reading jana source
> tells me it is set by default to /tmp/username/resources.)
>
> If I go to that directory and delete it and rerun bfiled2root, I get
> a full field.
>
> What happened?  Here's my theory:
>
> I ran a job and that job chose to download the field.  This
> is the first time consuming thing that happens in a job.  If,
> during the field download you kill the job with ctrl-c then
> you are left with a partial field.
> (I tested this out and was able to repeat it.)
>
> I must have been unlucky and hit ctrl-c on the job that
> tried to download the field.  Note this is normal for must users.
> It is easy to execute a command accidentally or realize just
> after you press return that you didn't specify all the arguments
> that you wanted.
>
> This is a really nasty behavior because every subsequent job
> then just reads this partial field and never prints any message
> or error.
>
> Can we put some sort of check in the field?  Write the number
> of points in the file first.  And then on read back when there
> isn't than many points abort with an error.
>
> nasty nasty nasty... that cost Paul and me a ton of time
> this week
>
> Matt
>
> ---------------------------------------------------------------------
> Matthew Shepherd, Associate Professor
> Department of Physics, Indiana University, Swain West 265
> 727 East Third Street, Bloomington, IN 47405
>
> Office Phone:  +1 812 856 5808 <(812)%20856-5808>
>
> On Jul 18, 2014, at 7:18 AM, David Lawrence <davidl at jlab.org> wrote:
>
>
> Hi Matt,
>
> The output looks right. What happens if you run the bfield2root utility?
> This should
> be built as part of the default sim-recon build. (Source is in
> $HALLD_HOME/src/programs/Utilities/bfield2root)
>
> Draw the field map in ROOT using:
>
> bfield2root
> root bfield.root
>
> root [1] Bz_vs_r_vs_z->Draw("colz")
>
> Also, how are you checking that the field map is returning zeros?
>
> Regards,
> -David
>
> On Jul 17, 2014, at 9:49 PM, Matthew Shepherd <mashephe at indiana.edu>
> wrote:
>
>
> Hi all,
>
> Paul and I have been trying to understand why I cannot
> get the example analysis software working and we seem
> to have traced the problem down to the fact that the
> magnetic field map is returning a zero field.
>
> Does anyone know how to debug this?  The startup
> of the job suggests all is OK (I think):
>
> JANA >>URL: sqlite:////home/s4/mashephe/gluex/ccdb.sqlite
> JANA >>context: default
> JANA >>Reading Magnetic field map from
> Magnets/Solenoid/solenoid_1350_poisson_20
> 130925 ...
> Nx=251 Ny=1 Nz=43 )  at 0x7f46720a0ce0
> Fine-mesh evio file does not exist.
> Constructing the fine-mesh B-field map...
> rmin: 0 rmax: 88.5 dr: 0.1 zmin: 0 zmax: 600 dz: 0.1vg.: 0.0Hz)
> Number of points in z = 6000
> Number of points in r = 885
> JANA >>10599 entries found (Created Magnetic field map of type
> DMagneticFieldMap
>
> I'm using:
>
> sim-recon-2014-06-30
> ccdb_1.02
> hdds-2.1
> jana_0.7.1p3
>
> and a ccdb.sqlite file from July 14 (copied from Mark's web page link that
> day).  I've also
> tried the ccdb.sqlite file from dc2 conditions.
>
> Matt
>
>
> ---------------------------------------------------------------------
> Matthew Shepherd, Associate Professor
> Department of Physics, Indiana University, Swain West 265
> 727 East Third Street, Bloomington, IN 47405
>
> Office Phone:  +1 812 856 5808 <(812)%20856-5808>
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
>
>
>
>
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20161019/197e5cf1/attachment-0002.html>


More information about the Halld-offline mailing list