[Halld-offline] zero field

David Lawrence davidl at jlab.org
Wed Oct 19 23:47:27 EDT 2016


I concur that this is the right way to go. Otherwise every one of your jobs could be hitting the
JLab server to download the same file. Better to have a local version.

Regards,
-David


> On Oct 19, 2016, at 8:42 PM, Mitchell, Ryan Edward <remitche at indiana.edu> wrote:
> 
> Thanks, Sean,
> 
> I'll go with something like solution 2 for the time being.  Having JANA_RESOURCE_DIR point to something more permanent should help.
> 
> Ryan
> 
> 
> 
>> On Oct 19, 2016, at 7:10 PM, Sean Dobbs <s-dobbs at northwestern.edu <mailto:s-dobbs at northwestern.edu>> wrote:
>> 
>> Hi Ryan,
>> 
>> I think that you've correctly diagnosed the problem.  The base directory that the resources are stored in is given by the environment variable JANA_RESOURCE_DIR.  A couple solutions to the problem come to mind:
>> 
>> 1) Create a separate temporary JANA_RESOURCE_DIR for each job in some temporary area.  Perhaps your batch system makes this easy, but I'd only suggest this if you have a caching http proxy, or you will make halldweb rather sad.
>> 
>> 2) If you have a shared disk that the jobs can access, set up your own JANA_RESOURCE_DIR on that disk, and run a short interactive job to populate it before running your batch jobs.
>> 
>> I haven't encountered this problem myself in awhile, but maybe a longer term fix would be to add lock files to the download mechanism in JANA.
>> 
>> Cheers,
>> Sean
>> 
>> 
>> On Wed, Oct 19, 2016 at 4:17 PM Mitchell, Ryan Edward <remitche at indiana.edu <mailto:remitche at indiana.edu>> wrote:
>> Hi David and all,
>> 
>> I've run into some strange errors that seem possibly related to the email chain I dug up and attached below.
>> 
>> I'm running 25 analysis jobs over ~100 rest hddm files in a single run (using the karst machine at IU).
>> 
>> 20 finished fine, but 5 crashed with this error:
>> 
>> JANA ERROR>>-- ERROR: md5 checksum for the following resource file does not match expected
>> JANA ERROR>>-- /tmp/remitche/resources/Magnets/Solenoid/solenoid_1200A_poisson_20160222
>> JANA ERROR>>--  for the resource: 
>> JANA ERROR>>--   URL_base = https://halldweb.jlab.org/resources <https://urldefense.proofpoint.com/v2/url?u=https-3A__halldweb.jlab.org_resources&d=CwMGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=y4ZD58I4nPR6tZqjerSGt-WlWVAhqa3FHDMXqQ_5aUc&m=3mGfeOL4DrxBWZ9bwJi0dT9WjwctIuk6aDBus5jBy8c&s=neH7lf0djm3Rb5nH94rZE125gp4S3pCDAB0hhTtkfZ0&e=>
>> JANA ERROR>>--       path = Magnets/Solenoid/solenoid_1200A_poisson_20160222
>> JANA ERROR>>--        md5 = 9a70615a1a3e42236cf9c9dabd8cf546
>> JANA ERROR>>-- The md5sum for the existing file is: 902c0923bd13f626fe8cc4dc2e41f222
>> JANA ERROR>>--
>> JANA ERROR>>-- This can happen if the resource download was previously interrupted.
>> JANA ERROR>>-- Try removing the existing file and re-running to trigger a re-download.
>> JANA ERROR>>--
>> JANA ERROR>>-- This is a fatal error and the program will stop now. To bypass checking
>> JANA ERROR>>-- the md5sum, set the JANA:RESOURCE_CHECK_MD5 config. parameter to 0.
>> JANA ERROR>>--
>> 
>> The message is the same for all 5 crashed jobs, but the "md5sum for the existing file" (line 7) is different for each.
>> 
>> Was maybe one job downloading the file while others were trying to read it?  Is there a way to avoid this?
>> 
>> Thanks,
>> Ryan
>> 
>> 
>> 
>> 
>>> On Jul 18, 2014, at 11:52 AM, Matthew Shepherd <mashephe at indiana.edu <mailto:mashephe at indiana.edu>> wrote:
>>> 
>>> 
>>> Great -- thanks.
>>> 
>>> It was going to catch someone at some point.  (I'm
>>> assuming I'm not the only one that hits ctrl-c when I realize
>>> that I've just executed a long program with 
>>> incorrect arguments.)  The field download takes
>>> a while and happens right at the time you realize you
>>> made an execution mistake.
>>> 
>>> For such a large system with so many users and
>>> possible permutations of use it is really important
>>> to make every sanity check possible and abort
>>> immediately when something doesn't seem right.
>>> 
>>> Matt
>>> 
>>> ---------------------------------------------------------------------
>>> Matthew Shepherd, Associate Professor
>>> Department of Physics, Indiana University, Swain West 265
>>> 727 East Third Street, Bloomington, IN 47405
>>> 
>>> Office Phone:  +1 812 856 5808 <tel:(812)%20856-5808>
>>> 
>>> On Jul 18, 2014, at 10:23 AM, David Lawrence <davidl at jlab.org <mailto:davidl at jlab.org>> wrote:
>>> 
>>>> 
>>>> Hi Matt,
>>>> 
>>>> Sorry this has cost you so much time. You’re the first person to
>>>> report this as an issue in the 6 months since it was deployed. I’ll
>>>> go ahead and put in the code to generate and check the md5
>>>> checksum whenever a program is started so that an error can
>>>> be flagged if there is a mismatch.
>>>> 
>>>> Regards,
>>>> -David
>>>> 
>>>> On Jul 18, 2014, at 10:15 AM, Matthew Shepherd <mashephe at indiana.edu <mailto:mashephe at indiana.edu>> wrote:
>>>> 
>>>>> 
>>>>> With bits and pieces from 3 different people,  I think 
>>>>> I've figured this out...  gggrrrrrr!
>>>>> 
>>>>> Mike Staib suggested my log seemed to indicate my field doesn't have
>>>>> enough z points.
>>>>> 
>>>>> It seems like jana is using JANA_RESOURCE_DIR to cache fields.  (I got
>>>>> this from Paul.. I've never set this, but grep and reading jana source 
>>>>> tells me it is set by default to /tmp/username/resources.)
>>>>> 
>>>>> If I go to that directory and delete it and rerun bfiled2root, I get 
>>>>> a full field.
>>>>> 
>>>>> What happened?  Here's my theory:
>>>>> 
>>>>> I ran a job and that job chose to download the field.  This
>>>>> is the first time consuming thing that happens in a job.  If, 
>>>>> during the field download you kill the job with ctrl-c then 
>>>>> you are left with a partial field.
>>>>> (I tested this out and was able to repeat it.)
>>>>> 
>>>>> I must have been unlucky and hit ctrl-c on the job that
>>>>> tried to download the field.  Note this is normal for must users.
>>>>> It is easy to execute a command accidentally or realize just
>>>>> after you press return that you didn't specify all the arguments
>>>>> that you wanted.
>>>>> 
>>>>> This is a really nasty behavior because every subsequent job
>>>>> then just reads this partial field and never prints any message
>>>>> or error.
>>>>> 
>>>>> Can we put some sort of check in the field?  Write the number
>>>>> of points in the file first.  And then on read back when there
>>>>> isn't than many points abort with an error.
>>>>> 
>>>>> nasty nasty nasty... that cost Paul and me a ton of time
>>>>> this week
>>>>> 
>>>>> Matt
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> Matthew Shepherd, Associate Professor
>>>>> Department of Physics, Indiana University, Swain West 265
>>>>> 727 East Third Street, Bloomington, IN 47405
>>>>> 
>>>>> Office Phone:  +1 812 856 5808 <tel:(812)%20856-5808>
>>>>> 
>>>>> On Jul 18, 2014, at 7:18 AM, David Lawrence <davidl at jlab.org <mailto:davidl at jlab.org>> wrote:
>>>>> 
>>>>>> 
>>>>>> Hi Matt,
>>>>>> 
>>>>>> The output looks right. What happens if you run the bfield2root utility? This should
>>>>>> be built as part of the default sim-recon build. (Source is in $HALLD_HOME/src/programs/Utilities/bfield2root)
>>>>>> 
>>>>>> Draw the field map in ROOT using:
>>>>>> 
>>>>>>> bfield2root
>>>>>>> root bfield.root
>>>>>> root [1] Bz_vs_r_vs_z->Draw("colz")
>>>>>> 
>>>>>> Also, how are you checking that the field map is returning zeros?
>>>>>> 
>>>>>> Regards,
>>>>>> -David
>>>>>> 
>>>>>> On Jul 17, 2014, at 9:49 PM, Matthew Shepherd <mashephe at indiana.edu <mailto:mashephe at indiana.edu>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> Paul and I have been trying to understand why I cannot
>>>>>>> get the example analysis software working and we seem
>>>>>>> to have traced the problem down to the fact that the 
>>>>>>> magnetic field map is returning a zero field.
>>>>>>> 
>>>>>>> Does anyone know how to debug this?  The startup
>>>>>>> of the job suggests all is OK (I think):
>>>>>>> 
>>>>>>> JANA >>URL: sqlite:////home/s4/mashephe/gluex/ccdb.sqlite <>
>>>>>>> JANA >>context: default
>>>>>>> JANA >>Reading Magnetic field map from Magnets/Solenoid/solenoid_1350_poisson_20
>>>>>>> 130925 ...
>>>>>>> Nx=251 Ny=1 Nz=43 )  at 0x7f46720a0ce0
>>>>>>> Fine-mesh evio file does not exist.
>>>>>>> Constructing the fine-mesh B-field map...
>>>>>>> rmin: 0 rmax: 88.5 dr: 0.1 zmin: 0 zmax: 600 dz: 0.1vg.: 0.0Hz)     
>>>>>>> Number of points in z = 6000
>>>>>>> Number of points in r = 885
>>>>>>> JANA >>10599 entries found (Created Magnetic field map of type DMagneticFieldMap
>>>>>>> 
>>>>>>> I'm using:
>>>>>>> 
>>>>>>> sim-recon-2014-06-30
>>>>>>> ccdb_1.02
>>>>>>> hdds-2.1
>>>>>>> jana_0.7.1p3
>>>>>>> 
>>>>>>> and a ccdb.sqlite file from July 14 (copied from Mark's web page link that day).  I've also
>>>>>>> tried the ccdb.sqlite file from dc2 conditions.
>>>>>>> 
>>>>>>> Matt
>>>>>>> 
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> Matthew Shepherd, Associate Professor
>>>>>>> Department of Physics, Indiana University, Swain West 265
>>>>>>> 727 East Third Street, Bloomington, IN 47405
>>>>>>> 
>>>>>>> Office Phone:  +1 812 856 5808 <tel:(812)%20856-5808>
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Halld-offline mailing list
>>>>>>> Halld-offline at jlab.org <mailto:Halld-offline at jlab.org>
>>>>>>> https://mailman.jlab.org/mailman/listinfo/halld-offline <https://mailman.jlab.org/mailman/listinfo/halld-offline>
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Halld-offline mailing list
>>> Halld-offline at jlab.org <mailto:Halld-offline at jlab.org>
>>> https://mailman.jlab.org/mailman/listinfo/halld-offline <https://mailman.jlab.org/mailman/listinfo/halld-offline>
>>> 
>>> 
>> 
> 
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20161019/dfa80e89/attachment-0002.html>


More information about the Halld-offline mailing list