[Halld-offline] zero field
Mitchell, Ryan Edward
remitche at indiana.edu
Wed Oct 19 20:42:52 EDT 2016
Thanks, Sean,
I'll go with something like solution 2 for the time being. Having JANA_RESOURCE_DIR point to something more permanent should help.
Ryan
On Oct 19, 2016, at 7:10 PM, Sean Dobbs <s-dobbs at northwestern.edu<mailto:s-dobbs at northwestern.edu>> wrote:
Hi Ryan,
I think that you've correctly diagnosed the problem. The base directory that the resources are stored in is given by the environment variable JANA_RESOURCE_DIR. A couple solutions to the problem come to mind:
1) Create a separate temporary JANA_RESOURCE_DIR for each job in some temporary area. Perhaps your batch system makes this easy, but I'd only suggest this if you have a caching http proxy, or you will make halldweb rather sad.
2) If you have a shared disk that the jobs can access, set up your own JANA_RESOURCE_DIR on that disk, and run a short interactive job to populate it before running your batch jobs.
I haven't encountered this problem myself in awhile, but maybe a longer term fix would be to add lock files to the download mechanism in JANA.
Cheers,
Sean
On Wed, Oct 19, 2016 at 4:17 PM Mitchell, Ryan Edward <remitche at indiana.edu<mailto:remitche at indiana.edu>> wrote:
Hi David and all,
I've run into some strange errors that seem possibly related to the email chain I dug up and attached below.
I'm running 25 analysis jobs over ~100 rest hddm files in a single run (using the karst machine at IU).
20 finished fine, but 5 crashed with this error:
JANA ERROR>>-- ERROR: md5 checksum for the following resource file does not match expected
JANA ERROR>>-- /tmp/remitche/resources/Magnets/Solenoid/solenoid_1200A_poisson_20160222
JANA ERROR>>-- for the resource:
JANA ERROR>>-- URL_base = https://halldweb.jlab.org/resources<https://urldefense.proofpoint.com/v2/url?u=https-3A__halldweb.jlab.org_resources&d=CwMGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=y4ZD58I4nPR6tZqjerSGt-WlWVAhqa3FHDMXqQ_5aUc&m=3mGfeOL4DrxBWZ9bwJi0dT9WjwctIuk6aDBus5jBy8c&s=neH7lf0djm3Rb5nH94rZE125gp4S3pCDAB0hhTtkfZ0&e=>
JANA ERROR>>-- path = Magnets/Solenoid/solenoid_1200A_poisson_20160222
JANA ERROR>>-- md5 = 9a70615a1a3e42236cf9c9dabd8cf546
JANA ERROR>>-- The md5sum for the existing file is: 902c0923bd13f626fe8cc4dc2e41f222
JANA ERROR>>--
JANA ERROR>>-- This can happen if the resource download was previously interrupted.
JANA ERROR>>-- Try removing the existing file and re-running to trigger a re-download.
JANA ERROR>>--
JANA ERROR>>-- This is a fatal error and the program will stop now. To bypass checking
JANA ERROR>>-- the md5sum, set the JANA:RESOURCE_CHECK_MD5 config. parameter to 0.
JANA ERROR>>--
The message is the same for all 5 crashed jobs, but the "md5sum for the existing file" (line 7) is different for each.
Was maybe one job downloading the file while others were trying to read it? Is there a way to avoid this?
Thanks,
Ryan
On Jul 18, 2014, at 11:52 AM, Matthew Shepherd <mashephe at indiana.edu<mailto:mashephe at indiana.edu>> wrote:
Great -- thanks.
It was going to catch someone at some point. (I'm
assuming I'm not the only one that hits ctrl-c when I realize
that I've just executed a long program with
incorrect arguments.) The field download takes
a while and happens right at the time you realize you
made an execution mistake.
For such a large system with so many users and
possible permutations of use it is really important
to make every sanity check possible and abort
immediately when something doesn't seem right.
Matt
---------------------------------------------------------------------
Matthew Shepherd, Associate Professor
Department of Physics, Indiana University, Swain West 265
727 East Third Street, Bloomington, IN 47405
Office Phone: +1 812 856 5808<tel:(812)%20856-5808>
On Jul 18, 2014, at 10:23 AM, David Lawrence <davidl at jlab.org<mailto:davidl at jlab.org>> wrote:
Hi Matt,
Sorry this has cost you so much time. You’re the first person to
report this as an issue in the 6 months since it was deployed. I’ll
go ahead and put in the code to generate and check the md5
checksum whenever a program is started so that an error can
be flagged if there is a mismatch.
Regards,
-David
On Jul 18, 2014, at 10:15 AM, Matthew Shepherd <mashephe at indiana.edu<mailto:mashephe at indiana.edu>> wrote:
With bits and pieces from 3 different people, I think
I've figured this out... gggrrrrrr!
Mike Staib suggested my log seemed to indicate my field doesn't have
enough z points.
It seems like jana is using JANA_RESOURCE_DIR to cache fields. (I got
this from Paul.. I've never set this, but grep and reading jana source
tells me it is set by default to /tmp/username/resources.)
If I go to that directory and delete it and rerun bfiled2root, I get
a full field.
What happened? Here's my theory:
I ran a job and that job chose to download the field. This
is the first time consuming thing that happens in a job. If,
during the field download you kill the job with ctrl-c then
you are left with a partial field.
(I tested this out and was able to repeat it.)
I must have been unlucky and hit ctrl-c on the job that
tried to download the field. Note this is normal for must users.
It is easy to execute a command accidentally or realize just
after you press return that you didn't specify all the arguments
that you wanted.
This is a really nasty behavior because every subsequent job
then just reads this partial field and never prints any message
or error.
Can we put some sort of check in the field? Write the number
of points in the file first. And then on read back when there
isn't than many points abort with an error.
nasty nasty nasty... that cost Paul and me a ton of time
this week
Matt
---------------------------------------------------------------------
Matthew Shepherd, Associate Professor
Department of Physics, Indiana University, Swain West 265
727 East Third Street, Bloomington, IN 47405
Office Phone: +1 812 856 5808<tel:(812)%20856-5808>
On Jul 18, 2014, at 7:18 AM, David Lawrence <davidl at jlab.org<mailto:davidl at jlab.org>> wrote:
Hi Matt,
The output looks right. What happens if you run the bfield2root utility? This should
be built as part of the default sim-recon build. (Source is in $HALLD_HOME/src/programs/Utilities/bfield2root)
Draw the field map in ROOT using:
bfield2root
root bfield.root
root [1] Bz_vs_r_vs_z->Draw("colz")
Also, how are you checking that the field map is returning zeros?
Regards,
-David
On Jul 17, 2014, at 9:49 PM, Matthew Shepherd <mashephe at indiana.edu<mailto:mashephe at indiana.edu>> wrote:
Hi all,
Paul and I have been trying to understand why I cannot
get the example analysis software working and we seem
to have traced the problem down to the fact that the
magnetic field map is returning a zero field.
Does anyone know how to debug this? The startup
of the job suggests all is OK (I think):
JANA >>URL: sqlite:////home/s4/mashephe/gluex/ccdb.sqlite
JANA >>context: default
JANA >>Reading Magnetic field map from Magnets/Solenoid/solenoid_1350_poisson_20
130925 ...
Nx=251 Ny=1 Nz=43 ) at 0x7f46720a0ce0
Fine-mesh evio file does not exist.
Constructing the fine-mesh B-field map...
rmin: 0 rmax: 88.5 dr: 0.1 zmin: 0 zmax: 600 dz: 0.1vg.: 0.0Hz)
Number of points in z = 6000
Number of points in r = 885
JANA >>10599 entries found (Created Magnetic field map of type DMagneticFieldMap
I'm using:
sim-recon-2014-06-30
ccdb_1.02
hdds-2.1
jana_0.7.1p3
and a ccdb.sqlite file from July 14 (copied from Mark's web page link that day). I've also
tried the ccdb.sqlite file from dc2 conditions.
Matt
---------------------------------------------------------------------
Matthew Shepherd, Associate Professor
Department of Physics, Indiana University, Swain West 265
727 East Third Street, Bloomington, IN 47405
Office Phone: +1 812 856 5808<tel:(812)%20856-5808>
_______________________________________________
Halld-offline mailing list
Halld-offline at jlab.org<mailto:Halld-offline at jlab.org>
https://mailman.jlab.org/mailman/listinfo/halld-offline
_______________________________________________
Halld-offline mailing list
Halld-offline at jlab.org<mailto:Halld-offline at jlab.org>
https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20161020/7b4d3942/attachment-0002.html>
More information about the Halld-offline
mailing list