[Halld-offline] SIMD Troubles
David Lawrence
davidl at jlab.org
Thu Mar 3 08:55:46 EST 2011
Hi All,
I just committed a change to fix this. Sorry for the confusion over
"readcessed".
-Dave
On 3/3/11 8:47 AM, Beni Zihlmann wrote:
> Hi All,
> concernint the readcessed word:
> mcsmear prints out how many events are processed and the rate but no
> carriage return
> and overwrites it continuously throughout the run. At the end of the run
> in prints out how many events are prcessed in total overwriting again
> the previous messeage
> of "blablabla .... processed". That string is longer than "nnnnn
> events read" and you see
> the last letters "cessed" from the word "processed".
>
> cheers,
> Beni
>
>
>> Matt,
>>
>> I just repeated a fresh checkout and build on stanley. You can look
>> for it under ~jonesrt (fresh install of jana, calib, sim-recon, hdds)
>> using as close to your procedure as I know how. I do not get any
>> crashing when I run mcsmear on the input file
>> sim_p_pip_pim_0099.hddm. Here are a couple of logs, one from stanley
>> and the other from c0-0. I am not claiming that you are not seeing
>> segfaults on stanley, I just don't know how to reproduce it.
>>
>> -Richard Jones
>>
>> PS. Can someone explain the meaning of a mysterious message printed
>> at the end of each mcsmear run, stating "nnnnnn events
>> readcessed". readcessed???
>>
>> [jonesrt at stanley gluex.d]$ mcsmear sim_p_pip_pim_0099.hddm
>> Warning in <TUnixSystem::SetDisplay>: DISPLAY not set, setting it to
>> gryphn.phys.uconn.edu:0.0
>> BCAL values will be smeared
>> BCAL values will be added
>> Read 26 values from FDC/drift_smear_parms in calibDB
>> Columns: h0 h1 h2 m0 m1 m2 s0 s1 s2
>> get TOF/tof_parms parameters from calibDB
>> get BCAL/bcal_parms parameters from calibDB
>> get FCAL/fcal_parms parameters from calibDB
>> get CDC/cdc_parms parameters from calibDB
>> get FDC/fdc_parms parameters from calibDB
>> get START_COUNTER/start_parms parameters from calibDB
>> input file: sim_p_pip_pim_0099.hddm
>> output file: sim_p_pip_pim_0099_smeared.hddm
>> 300 events readcessed
>> [jonesrt at stanley gluex.d]$
>>
>> [ now I ssh to slave node c0-0 ]
>>
>> [jonesrt at compute-0-0 gluex.d]$ mcsmear sim_p_pip_pim_0099.hddm
>> Warning in <TUnixSystem::SetDisplay>: DISPLAY not set, setting it to
>> stanley.local:0.0
>> BCAL values will be smeared
>> BCAL values will be added
>> Read 26 values from FDC/drift_smear_parms in calibDB
>> Columns: h0 h1 h2 m0 m1 m2 s0 s1 s2
>> get TOF/tof_parms parameters from calibDB
>> get BCAL/bcal_parms parameters from calibDB
>> get FCAL/fcal_parms parameters from calibDB
>> get CDC/cdc_parms parameters from calibDB
>> get FDC/fdc_parms parameters from calibDB
>> get START_COUNTER/start_parms parameters from calibDB
>> input file: sim_p_pip_pim_0099.hddm
>> output file: sim_p_pip_pim_0099_smeared.hddm
>> 300 events readcessed
>> [jonesrt at compute-0-0 gluex.d]$
>>
>>
>>
>>
>> On 3/2/2011 10:14 PM, Matthew Shepherd wrote:
>>> Hi Richard,
>>>
>>> Are you implying a mismatch between the capabilities of stanley and
>>> the nodes? I see the failure when running on stanley.
>>>
>>> Matt
>>>
>>> ----
>>> This message was sent from my iPhone.
>>>
>>> On Mar 2, 2011, at 9:30 PM, Richard
>>> Jones<richard.t.jones at uconn.edu<mailto:richard.t.jones at uconn.edu>>
>>> wrote:
>>>
>>> Matt,
>>>
>>> Having seen and worked through a dozen or so SIMD-related issues in
>>> building this software stack on different hardware for the grid, I
>>> have seen no evidence of any "alignment problem", as suspected by
>>> Simon. At any rate, a detailed diagnosis is better than a
>>> suspicion, so here is a detailed diagnosis of the problem you are
>>> seeing, building on
>>> stanley.physics.indiana.edu<http://stanley.physics.indiana.edu> and
>>> running on the K7 nodes of the stan cluster.
>>>
>>> Stanley.physics.indiana.edu<http://Stanley.physics.indiana.edu> is:
>>>
>>> * dual 4-core Intel(R) Xeon(R) CPU X5482 @ 3.20GHz
>>> * x86_64 instruction set, 64-bit architecture
>>> * running a 32-bit kernel (2.6.18-194.32.1.el5PAE)
>>> * supports SIMD extensions: mmx sse sse2 sse3 ssse3 sse4_1
>>>
>>> worker nodes on the stan cluster are:
>>>
>>> * dual single-core AMD K7 athlon CPUs @ 1667MHz
>>> * i686 instruction set, 32-bit architecture
>>> * running a 32-bit kernel (2.6.18-194.32.1.el5)
>>> * supports SIMD extensions: mmx sse mmxext
>>>
>>> In case you would like to verify, I have attached a miniature c++
>>> program that queries the processor for all of the common SIMD
>>> extensions that it supports. You can compile and run this on any
>>> node, and verify the kinds of SIMD extensions that it can execute.
>>>
>>> This Stanley is a bit of an odd-ball: a 64-bit processor running a
>>> 32-bit OS. What happens during the build is that the Makefile.SIMD
>>> tries to discover what SIMD support is present in the hardware. You
>>> are building on Stanley, so it queries the processor on Stanley, and
>>> finds it supports sse, sse2, ssse3, and sse4_1. After that, it
>>> builds an executable that exploits all of these features, and that
>>> is what you want -- if you run on Stanley. If you look in the build
>>> logs, you should see the gcc/g++ flags "-mfpmath=sse -msse
>>> -DUSE_SSE2 -msse2" which enables both sse and sse2 instructions.
>>> That code runs fine on stanley, but try to run it on c0-0, and bang,
>>> the parts of the code that try to use sse2 extensions are going to
>>> hit a wall. The xmm registers used by the sse2 extensions are 128
>>> bits, compared with the 64-bit registers used by the sse extensions,
>>> which leads to the segfault you are seeing.
>>>
>>> The immediate solution for you on the stanley cluster is to redo the
>>> build (make clean;make) on one of the worker nodes. Then it will
>>> recognize that sse2 is not supported, and set the flags to
>>> "-mfpmath=sse -msse -mno-sse2", so sse math will be used (consistent
>>> answers) but not the sse2 extensions (no segfaults). This code will
>>> run both on the K7 nodes and on the head node, and will give answers
>>> that are consistent with running full Simon-supercharged code on a
>>> 64-bit node, but without the super-charged performance.
>>>
>>> For the future, we should have a run-time-startup check in our
>>> applications that verifies that the options used during the build
>>> are supported by the cpu running the code. This is easy to do, and
>>> for code that uses DANA, it is now a part of the Init() method of
>>> the DApplication class -- I added it last week. It would be trivial
>>> to copy that code into the main() of non-DANA apps like mcsmear and
>>> hdgeant. I would support that, but will hold back on checking in
>>> more SIMD-related changes until some of this dust settles and we
>>> know that things are under control.
>>>
>>> -Richard J.
>>>
>>>
>>>
>>>
>>>
>>> On 3/2/2011 6:08 PM, Simon Taylor wrote:
>>>
>>> Hi.
>>>
>>> Our suspicion is that there is an alignment problem on 32-bit systems
>>> with regard to the SIMD instructions; Dave looked into this some time
>>> ago and it is not clear to us how to fix it.
>>>
>>> I've checked in a change to Makefile.SIMD that changes the default from
>>> "SIMD on" to "SIMD off". To get the SIMD instructions, one would
>>> now need to do "make ENABLE_SIMD=yes".
>>>
>>> Simon
>>>
>>> Matthew Shepherd wrote:
>>>
>>>
>>> Hi all,
>>>
>>> It seems that the BMS system doesn't properly understand our SIMD
>>> capabilities on the machines here at Indiana. If we do a default
>>> build, then we get a segmentation fault at the first DVector2
>>> operation. If we build with DISABLE_SIMD=1 then this segfault is
>>> avoided.
>>>
>>> This seems to point to two causes:
>>>
>>> (1) there is a bug in the SIMD implementation of DVector2
>>> (2) our machines are not capable of handling current SIMD code
>>>
>>> (1) seems unlikely since other people are using the code. Assuming
>>> it is (2), how do we properly diagnose and fix it?
>>>
>>> -Matt
>>>
>>>
>>>
>>> _______________________________________________
>>> Halld-offline mailing list
>>> <mailto:Halld-offline at jlab.org>Halld-offline at jlab.org<mailto:Halld-offline at jlab.org>
>>>
>>> <https://mailman.jlab.org/mailman/listinfo/halld-offline>https://mailman.jlab.org/mailman/listinfo/halld-offline
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Halld-offline mailing list
>>> <mailto:Halld-offline at jlab.org>Halld-offline at jlab.org<mailto:Halld-offline at jlab.org>
>>>
>>> <https://mailman.jlab.org/mailman/listinfo/halld-offline>https://mailman.jlab.org/mailman/listinfo/halld-offline
>>>
>>>
>>>
>>> <cpuid.cc>
>>> _______________________________________________
>>> Halld-offline mailing list
>>> Halld-offline at jlab.org<mailto:Halld-offline at jlab.org>
>>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>>
>>
>>
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20110303/e0d25b62/attachment-0002.html>
More information about the Halld-offline
mailing list