[Halld-offline] SIMD Troubles

Thu Mar 3 08:55:46 EST 2011

Hi All,

I just committed a change to fix this. Sorry for the confusion over 
"readcessed".

-Dave

On 3/3/11 8:47 AM, Beni Zihlmann wrote:
> Hi All,
> concernint the readcessed word:
> mcsmear prints out how many events are processed and the rate but no 
> carriage return
> and overwrites it continuously throughout the run. At the end of the run
> in prints out how many events are prcessed in total overwriting again 
> the previous messeage
> of "blablabla .... processed". That string is longer than "nnnnn 
> events read" and you see
> the last letters "cessed" from the word "processed".
>
> cheers,
> Beni
>
>
>> Matt,
>>
>> I just repeated a fresh checkout and build on stanley.  You can look 
>> for it under ~jonesrt (fresh install of jana, calib, sim-recon, hdds) 
>> using as close to your procedure as I know how.  I do not get any 
>> crashing when I run mcsmear on the input file 
>> sim_p_pip_pim_0099.hddm.  Here are a couple of logs, one from stanley 
>> and the other from c0-0.  I am not claiming that you are not seeing 
>> segfaults on stanley, I just don't know how to reproduce it.
>>
>> -Richard Jones
>>
>> PS. Can someone explain the meaning of a mysterious message printed 
>> at the end of each mcsmear run, stating "nnnnnn events 
>> readcessed".    readcessed???
>>
>> [jonesrt at stanley gluex.d]$ mcsmear sim_p_pip_pim_0099.hddm
>> Warning in <TUnixSystem::SetDisplay>: DISPLAY not set, setting it to 
>> gryphn.phys.uconn.edu:0.0
>> BCAL values will  be smeared
>> BCAL values will  be added
>> Read 26 values from FDC/drift_smear_parms in calibDB
>> Columns:  h0 h1 h2 m0 m1 m2 s0 s1 s2
>> get TOF/tof_parms parameters from calibDB
>> get BCAL/bcal_parms parameters from calibDB
>> get FCAL/fcal_parms parameters from calibDB
>> get CDC/cdc_parms parameters from calibDB
>> get FDC/fdc_parms parameters from calibDB
>> get START_COUNTER/start_parms parameters from calibDB
>>  input file: sim_p_pip_pim_0099.hddm
>>  output file: sim_p_pip_pim_0099_smeared.hddm
>>  300 events readcessed
>> [jonesrt at stanley gluex.d]$
>>
>> [ now I ssh to slave node c0-0 ]
>>
>> [jonesrt at compute-0-0 gluex.d]$ mcsmear sim_p_pip_pim_0099.hddm
>> Warning in <TUnixSystem::SetDisplay>: DISPLAY not set, setting it to 
>> stanley.local:0.0
>> BCAL values will  be smeared
>> BCAL values will  be added
>> Read 26 values from FDC/drift_smear_parms in calibDB
>> Columns:  h0 h1 h2 m0 m1 m2 s0 s1 s2
>> get TOF/tof_parms parameters from calibDB
>> get BCAL/bcal_parms parameters from calibDB
>> get FCAL/fcal_parms parameters from calibDB
>> get CDC/cdc_parms parameters from calibDB
>> get FDC/fdc_parms parameters from calibDB
>> get START_COUNTER/start_parms parameters from calibDB
>>  input file: sim_p_pip_pim_0099.hddm
>>  output file: sim_p_pip_pim_0099_smeared.hddm
>>  300 events readcessed
>> [jonesrt at compute-0-0 gluex.d]$
>>
>>
>>
>>
>>  On 3/2/2011 10:14 PM, Matthew Shepherd wrote:
>>> Hi Richard,
>>>
>>> Are you implying a mismatch between the capabilities of stanley and 
>>> the nodes?  I see the failure when running on stanley.
>>>
>>> Matt
>>>
>>> ----
>>> This message was sent from my iPhone.
>>>
>>> On Mar 2, 2011, at 9:30 PM, Richard 
>>> Jones<richard.t.jones at uconn.edu<mailto:richard.t.jones at uconn.edu>>  
>>> wrote:
>>>
>>> Matt,
>>>
>>> Having seen and worked through a dozen or so SIMD-related issues in 
>>> building this software stack on different hardware for the grid, I 
>>> have seen no evidence of any "alignment problem", as suspected by 
>>> Simon.  At any rate, a detailed diagnosis is better than a 
>>> suspicion, so here is a detailed diagnosis of the problem you are 
>>> seeing, building on 
>>> stanley.physics.indiana.edu<http://stanley.physics.indiana.edu>  and 
>>> running on the K7 nodes of the stan cluster.
>>>
>>> Stanley.physics.indiana.edu<http://Stanley.physics.indiana.edu>  is:
>>>
>>>   *   dual 4-core Intel(R) Xeon(R) CPU X5482  @ 3.20GHz
>>>   *   x86_64 instruction set, 64-bit architecture
>>>   *   running a 32-bit kernel (2.6.18-194.32.1.el5PAE)
>>>   *   supports SIMD extensions:  mmx sse sse2 sse3 ssse3 sse4_1
>>>
>>> worker nodes on the stan cluster are:
>>>
>>>   *   dual single-core AMD K7 athlon CPUs @ 1667MHz
>>>   *   i686 instruction set, 32-bit architecture
>>>   *   running a 32-bit kernel (2.6.18-194.32.1.el5)
>>>   *   supports SIMD extensions:  mmx sse mmxext
>>>
>>> In case you would like to verify, I have attached a miniature c++ 
>>> program that queries the processor for all of the common SIMD 
>>> extensions that it supports.  You can compile and run this on any 
>>> node, and verify the kinds of SIMD extensions that it can execute.
>>>
>>> This Stanley is a bit of an odd-ball: a 64-bit processor running a 
>>> 32-bit OS.  What happens during the build is that the Makefile.SIMD 
>>> tries to discover what SIMD support is present in the hardware.  You 
>>> are building on Stanley, so it queries the processor on Stanley, and 
>>> finds it supports sse, sse2, ssse3, and sse4_1.  After that, it 
>>> builds an executable that exploits all of these features, and that 
>>> is what you want -- if you run on Stanley.  If you look in the build 
>>> logs, you should see the gcc/g++ flags "-mfpmath=sse -msse 
>>> -DUSE_SSE2 -msse2" which enables both sse and sse2 instructions.  
>>> That code runs fine on stanley, but try to run it on c0-0, and bang, 
>>> the parts of the code that try to use sse2 extensions are going to 
>>> hit a wall.   The xmm registers used by the sse2 extensions are 128 
>>> bits, compared with the 64-bit registers used by the sse extensions, 
>>> which leads to the segfault you are seeing.
>>>
>>> The immediate solution for you on the stanley cluster is to redo the 
>>> build (make clean;make) on one of the worker nodes.  Then it will 
>>> recognize that sse2 is not supported, and set the flags to 
>>> "-mfpmath=sse -msse -mno-sse2", so sse math will be used (consistent 
>>> answers) but not the sse2 extensions (no segfaults).  This code will 
>>> run both on the K7 nodes and on the head node, and will give answers 
>>> that are consistent with running full Simon-supercharged code on a 
>>> 64-bit node, but without the super-charged performance.
>>>
>>> For the future, we should have a run-time-startup check in our 
>>> applications that verifies that the options used during the build 
>>> are supported by the cpu running the code.  This is easy to do, and 
>>> for code that uses DANA, it is now a part of the Init() method of 
>>> the DApplication class -- I added it last week.  It would be trivial 
>>> to copy that code into the main() of non-DANA apps like mcsmear and 
>>> hdgeant.  I would support that, but will hold back on checking in 
>>> more SIMD-related changes until some of this dust settles and we 
>>> know that things are under control.
>>>
>>> -Richard J.
>>>
>>>
>>>
>>>
>>>
>>> On 3/2/2011 6:08 PM, Simon Taylor wrote:
>>>
>>> Hi.
>>>
>>> Our suspicion is that there is an alignment problem on 32-bit systems
>>> with regard to the SIMD instructions;  Dave looked into this some time
>>> ago and it is not clear to us how to fix it.
>>>
>>> I've checked in a change to Makefile.SIMD that changes the default from
>>> "SIMD on" to "SIMD off".  To get the SIMD instructions, one would
>>> now need to do "make ENABLE_SIMD=yes".
>>>
>>> Simon
>>>
>>> Matthew Shepherd wrote:
>>>
>>>
>>> Hi all,
>>>
>>> It seems that the BMS system doesn't properly understand our SIMD 
>>> capabilities on the machines here at Indiana.  If we do a default 
>>> build, then we get a segmentation fault at the first DVector2 
>>> operation.  If we build with DISABLE_SIMD=1 then this segfault is 
>>> avoided.
>>>
>>> This seems to point to two causes:
>>>
>>> (1) there is a bug in the SIMD implementation of DVector2
>>> (2) our machines are not capable of handling current SIMD code
>>>
>>> (1) seems unlikely since other people are using the code.  Assuming 
>>> it is (2), how do we properly diagnose and fix it?
>>>
>>> -Matt
>>>
>>>
>>>
>>> _______________________________________________
>>> Halld-offline mailing list
>>> <mailto:Halld-offline at jlab.org>Halld-offline at jlab.org<mailto:Halld-offline at jlab.org> 
>>>
>>> <https://mailman.jlab.org/mailman/listinfo/halld-offline>https://mailman.jlab.org/mailman/listinfo/halld-offline 
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Halld-offline mailing list
>>> <mailto:Halld-offline at jlab.org>Halld-offline at jlab.org<mailto:Halld-offline at jlab.org> 
>>>
>>> <https://mailman.jlab.org/mailman/listinfo/halld-offline>https://mailman.jlab.org/mailman/listinfo/halld-offline 
>>>
>>>
>>>
>>> <cpuid.cc>
>>> _______________________________________________
>>> Halld-offline mailing list
>>> Halld-offline at jlab.org<mailto:Halld-offline at jlab.org>
>>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>>
>>
>>
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline
>
>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org
> https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20110303/e0d25b62/attachment-0002.html>