[Halld-offline] zero field
David Lawrence
davidl at jlab.org
Wed Oct 19 23:47:27 EDT 2016
I concur that this is the right way to go. Otherwise every one of your jobs could be hitting the
JLab server to download the same file. Better to have a local version.
Regards,
-David
> On Oct 19, 2016, at 8:42 PM, Mitchell, Ryan Edward <remitche at indiana.edu> wrote:
>
> Thanks, Sean,
>
> I'll go with something like solution 2 for the time being. Having JANA_RESOURCE_DIR point to something more permanent should help.
>
> Ryan
>
>
>
>> On Oct 19, 2016, at 7:10 PM, Sean Dobbs <s-dobbs at northwestern.edu <mailto:s-dobbs at northwestern.edu>> wrote:
>>
>> Hi Ryan,
>>
>> I think that you've correctly diagnosed the problem. The base directory that the resources are stored in is given by the environment variable JANA_RESOURCE_DIR. A couple solutions to the problem come to mind:
>>
>> 1) Create a separate temporary JANA_RESOURCE_DIR for each job in some temporary area. Perhaps your batch system makes this easy, but I'd only suggest this if you have a caching http proxy, or you will make halldweb rather sad.
>>
>> 2) If you have a shared disk that the jobs can access, set up your own JANA_RESOURCE_DIR on that disk, and run a short interactive job to populate it before running your batch jobs.
>>
>> I haven't encountered this problem myself in awhile, but maybe a longer term fix would be to add lock files to the download mechanism in JANA.
>>
>> Cheers,
>> Sean
>>
>>
>> On Wed, Oct 19, 2016 at 4:17 PM Mitchell, Ryan Edward <remitche at indiana.edu <mailto:remitche at indiana.edu>> wrote:
>> Hi David and all,
>>
>> I've run into some strange errors that seem possibly related to the email chain I dug up and attached below.
>>
>> I'm running 25 analysis jobs over ~100 rest hddm files in a single run (using the karst machine at IU).
>>
>> 20 finished fine, but 5 crashed with this error:
>>
>> JANA ERROR>>-- ERROR: md5 checksum for the following resource file does not match expected
>> JANA ERROR>>-- /tmp/remitche/resources/Magnets/Solenoid/solenoid_1200A_poisson_20160222
>> JANA ERROR>>-- for the resource:
>> JANA ERROR>>-- URL_base = https://halldweb.jlab.org/resources <https://urldefense.proofpoint.com/v2/url?u=https-3A__halldweb.jlab.org_resources&d=CwMGaQ&c=yHlS04HhBraes5BQ9ueu5zKhE7rtNXt_d012z2PA6ws&r=y4ZD58I4nPR6tZqjerSGt-WlWVAhqa3FHDMXqQ_5aUc&m=3mGfeOL4DrxBWZ9bwJi0dT9WjwctIuk6aDBus5jBy8c&s=neH7lf0djm3Rb5nH94rZE125gp4S3pCDAB0hhTtkfZ0&e=>
>> JANA ERROR>>-- path = Magnets/Solenoid/solenoid_1200A_poisson_20160222
>> JANA ERROR>>-- md5 = 9a70615a1a3e42236cf9c9dabd8cf546
>> JANA ERROR>>-- The md5sum for the existing file is: 902c0923bd13f626fe8cc4dc2e41f222
>> JANA ERROR>>--
>> JANA ERROR>>-- This can happen if the resource download was previously interrupted.
>> JANA ERROR>>-- Try removing the existing file and re-running to trigger a re-download.
>> JANA ERROR>>--
>> JANA ERROR>>-- This is a fatal error and the program will stop now. To bypass checking
>> JANA ERROR>>-- the md5sum, set the JANA:RESOURCE_CHECK_MD5 config. parameter to 0.
>> JANA ERROR>>--
>>
>> The message is the same for all 5 crashed jobs, but the "md5sum for the existing file" (line 7) is different for each.
>>
>> Was maybe one job downloading the file while others were trying to read it? Is there a way to avoid this?
>>
>> Thanks,
>> Ryan
>>
>>
>>
>>
>>> On Jul 18, 2014, at 11:52 AM, Matthew Shepherd <mashephe at indiana.edu <mailto:mashephe at indiana.edu>> wrote:
>>>
>>>
>>> Great -- thanks.
>>>
>>> It was going to catch someone at some point. (I'm
>>> assuming I'm not the only one that hits ctrl-c when I realize
>>> that I've just executed a long program with
>>> incorrect arguments.) The field download takes
>>> a while and happens right at the time you realize you
>>> made an execution mistake.
>>>
>>> For such a large system with so many users and
>>> possible permutations of use it is really important
>>> to make every sanity check possible and abort
>>> immediately when something doesn't seem right.
>>>
>>> Matt
>>>
>>> ---------------------------------------------------------------------
>>> Matthew Shepherd, Associate Professor
>>> Department of Physics, Indiana University, Swain West 265
>>> 727 East Third Street, Bloomington, IN 47405
>>>
>>> Office Phone: +1 812 856 5808 <tel:(812)%20856-5808>
>>>
>>> On Jul 18, 2014, at 10:23 AM, David Lawrence <davidl at jlab.org <mailto:davidl at jlab.org>> wrote:
>>>
>>>>
>>>> Hi Matt,
>>>>
>>>> Sorry this has cost you so much time. You’re the first person to
>>>> report this as an issue in the 6 months since it was deployed. I’ll
>>>> go ahead and put in the code to generate and check the md5
>>>> checksum whenever a program is started so that an error can
>>>> be flagged if there is a mismatch.
>>>>
>>>> Regards,
>>>> -David
>>>>
>>>> On Jul 18, 2014, at 10:15 AM, Matthew Shepherd <mashephe at indiana.edu <mailto:mashephe at indiana.edu>> wrote:
>>>>
>>>>>
>>>>> With bits and pieces from 3 different people, I think
>>>>> I've figured this out... gggrrrrrr!
>>>>>
>>>>> Mike Staib suggested my log seemed to indicate my field doesn't have
>>>>> enough z points.
>>>>>
>>>>> It seems like jana is using JANA_RESOURCE_DIR to cache fields. (I got
>>>>> this from Paul.. I've never set this, but grep and reading jana source
>>>>> tells me it is set by default to /tmp/username/resources.)
>>>>>
>>>>> If I go to that directory and delete it and rerun bfiled2root, I get
>>>>> a full field.
>>>>>
>>>>> What happened? Here's my theory:
>>>>>
>>>>> I ran a job and that job chose to download the field. This
>>>>> is the first time consuming thing that happens in a job. If,
>>>>> during the field download you kill the job with ctrl-c then
>>>>> you are left with a partial field.
>>>>> (I tested this out and was able to repeat it.)
>>>>>
>>>>> I must have been unlucky and hit ctrl-c on the job that
>>>>> tried to download the field. Note this is normal for must users.
>>>>> It is easy to execute a command accidentally or realize just
>>>>> after you press return that you didn't specify all the arguments
>>>>> that you wanted.
>>>>>
>>>>> This is a really nasty behavior because every subsequent job
>>>>> then just reads this partial field and never prints any message
>>>>> or error.
>>>>>
>>>>> Can we put some sort of check in the field? Write the number
>>>>> of points in the file first. And then on read back when there
>>>>> isn't than many points abort with an error.
>>>>>
>>>>> nasty nasty nasty... that cost Paul and me a ton of time
>>>>> this week
>>>>>
>>>>> Matt
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> Matthew Shepherd, Associate Professor
>>>>> Department of Physics, Indiana University, Swain West 265
>>>>> 727 East Third Street, Bloomington, IN 47405
>>>>>
>>>>> Office Phone: +1 812 856 5808 <tel:(812)%20856-5808>
>>>>>
>>>>> On Jul 18, 2014, at 7:18 AM, David Lawrence <davidl at jlab.org <mailto:davidl at jlab.org>> wrote:
>>>>>
>>>>>>
>>>>>> Hi Matt,
>>>>>>
>>>>>> The output looks right. What happens if you run the bfield2root utility? This should
>>>>>> be built as part of the default sim-recon build. (Source is in $HALLD_HOME/src/programs/Utilities/bfield2root)
>>>>>>
>>>>>> Draw the field map in ROOT using:
>>>>>>
>>>>>>> bfield2root
>>>>>>> root bfield.root
>>>>>> root [1] Bz_vs_r_vs_z->Draw("colz")
>>>>>>
>>>>>> Also, how are you checking that the field map is returning zeros?
>>>>>>
>>>>>> Regards,
>>>>>> -David
>>>>>>
>>>>>> On Jul 17, 2014, at 9:49 PM, Matthew Shepherd <mashephe at indiana.edu <mailto:mashephe at indiana.edu>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Paul and I have been trying to understand why I cannot
>>>>>>> get the example analysis software working and we seem
>>>>>>> to have traced the problem down to the fact that the
>>>>>>> magnetic field map is returning a zero field.
>>>>>>>
>>>>>>> Does anyone know how to debug this? The startup
>>>>>>> of the job suggests all is OK (I think):
>>>>>>>
>>>>>>> JANA >>URL: sqlite:////home/s4/mashephe/gluex/ccdb.sqlite <>
>>>>>>> JANA >>context: default
>>>>>>> JANA >>Reading Magnetic field map from Magnets/Solenoid/solenoid_1350_poisson_20
>>>>>>> 130925 ...
>>>>>>> Nx=251 Ny=1 Nz=43 ) at 0x7f46720a0ce0
>>>>>>> Fine-mesh evio file does not exist.
>>>>>>> Constructing the fine-mesh B-field map...
>>>>>>> rmin: 0 rmax: 88.5 dr: 0.1 zmin: 0 zmax: 600 dz: 0.1vg.: 0.0Hz)
>>>>>>> Number of points in z = 6000
>>>>>>> Number of points in r = 885
>>>>>>> JANA >>10599 entries found (Created Magnetic field map of type DMagneticFieldMap
>>>>>>>
>>>>>>> I'm using:
>>>>>>>
>>>>>>> sim-recon-2014-06-30
>>>>>>> ccdb_1.02
>>>>>>> hdds-2.1
>>>>>>> jana_0.7.1p3
>>>>>>>
>>>>>>> and a ccdb.sqlite file from July 14 (copied from Mark's web page link that day). I've also
>>>>>>> tried the ccdb.sqlite file from dc2 conditions.
>>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> Matthew Shepherd, Associate Professor
>>>>>>> Department of Physics, Indiana University, Swain West 265
>>>>>>> 727 East Third Street, Bloomington, IN 47405
>>>>>>>
>>>>>>> Office Phone: +1 812 856 5808 <tel:(812)%20856-5808>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Halld-offline mailing list
>>>>>>> Halld-offline at jlab.org <mailto:Halld-offline at jlab.org>
>>>>>>> https://mailman.jlab.org/mailman/listinfo/halld-offline <https://mailman.jlab.org/mailman/listinfo/halld-offline>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Halld-offline mailing list
>>> Halld-offline at jlab.org <mailto:Halld-offline at jlab.org>
>>> https://mailman.jlab.org/mailman/listinfo/halld-offline <https://mailman.jlab.org/mailman/listinfo/halld-offline>
>>>
>>>
>>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20161019/dfa80e89/attachment-0002.html>
More information about the Halld-offline
mailing list