[Halld-offline] SIMD Troubles
Richard Jones
richard.t.jones at uconn.edu
Thu Mar 3 08:39:11 EST 2011
Matt,
I just repeated a fresh checkout and build on stanley. You can look for it under ~jonesrt (fresh install of jana, calib, sim-recon, hdds) using as close to your procedure as I know how. I do not get any crashing when I run mcsmear on the input file sim_p_pip_pim_0099.hddm. Here are a couple of logs, one from stanley and the other from c0-0. I am not claiming that you are not seeing segfaults on stanley, I just don't know how to reproduce it.
-Richard Jones
PS. Can someone explain the meaning of a mysterious message printed at the end of each mcsmear run, stating "nnnnnn events readcessed". readcessed???
[jonesrt at stanley gluex.d]$ mcsmear sim_p_pip_pim_0099.hddm
Warning in <TUnixSystem::SetDisplay>: DISPLAY not set, setting it to gryphn.phys.uconn.edu:0.0
BCAL values will be smeared
BCAL values will be added
Read 26 values from FDC/drift_smear_parms in calibDB
Columns: h0 h1 h2 m0 m1 m2 s0 s1 s2
get TOF/tof_parms parameters from calibDB
get BCAL/bcal_parms parameters from calibDB
get FCAL/fcal_parms parameters from calibDB
get CDC/cdc_parms parameters from calibDB
get FDC/fdc_parms parameters from calibDB
get START_COUNTER/start_parms parameters from calibDB
input file: sim_p_pip_pim_0099.hddm
output file: sim_p_pip_pim_0099_smeared.hddm
300 events readcessed
[jonesrt at stanley gluex.d]$
[ now I ssh to slave node c0-0 ]
[jonesrt at compute-0-0 gluex.d]$ mcsmear sim_p_pip_pim_0099.hddm
Warning in <TUnixSystem::SetDisplay>: DISPLAY not set, setting it to stanley.local:0.0
BCAL values will be smeared
BCAL values will be added
Read 26 values from FDC/drift_smear_parms in calibDB
Columns: h0 h1 h2 m0 m1 m2 s0 s1 s2
get TOF/tof_parms parameters from calibDB
get BCAL/bcal_parms parameters from calibDB
get FCAL/fcal_parms parameters from calibDB
get CDC/cdc_parms parameters from calibDB
get FDC/fdc_parms parameters from calibDB
get START_COUNTER/start_parms parameters from calibDB
input file: sim_p_pip_pim_0099.hddm
output file: sim_p_pip_pim_0099_smeared.hddm
300 events readcessed
[jonesrt at compute-0-0 gluex.d]$
On 3/2/2011 10:14 PM, Matthew Shepherd wrote:
> Hi Richard,
>
> Are you implying a mismatch between the capabilities of stanley and the nodes? I see the failure when running on stanley.
>
> Matt
>
> ----
> This message was sent from my iPhone.
>
> On Mar 2, 2011, at 9:30 PM, Richard Jones<richard.t.jones at uconn.edu<mailto:richard.t.jones at uconn.edu>> wrote:
>
> Matt,
>
> Having seen and worked through a dozen or so SIMD-related issues in building this software stack on different hardware for the grid, I have seen no evidence of any "alignment problem", as suspected by Simon. At any rate, a detailed diagnosis is better than a suspicion, so here is a detailed diagnosis of the problem you are seeing, building on stanley.physics.indiana.edu<http://stanley.physics.indiana.edu> and running on the K7 nodes of the stan cluster.
>
> Stanley.physics.indiana.edu<http://Stanley.physics.indiana.edu> is:
>
> * dual 4-core Intel(R) Xeon(R) CPU X5482 @ 3.20GHz
> * x86_64 instruction set, 64-bit architecture
> * running a 32-bit kernel (2.6.18-194.32.1.el5PAE)
> * supports SIMD extensions: mmx sse sse2 sse3 ssse3 sse4_1
>
> worker nodes on the stan cluster are:
>
> * dual single-core AMD K7 athlon CPUs @ 1667MHz
> * i686 instruction set, 32-bit architecture
> * running a 32-bit kernel (2.6.18-194.32.1.el5)
> * supports SIMD extensions: mmx sse mmxext
>
> In case you would like to verify, I have attached a miniature c++ program that queries the processor for all of the common SIMD extensions that it supports. You can compile and run this on any node, and verify the kinds of SIMD extensions that it can execute.
>
> This Stanley is a bit of an odd-ball: a 64-bit processor running a 32-bit OS. What happens during the build is that the Makefile.SIMD tries to discover what SIMD support is present in the hardware. You are building on Stanley, so it queries the processor on Stanley, and finds it supports sse, sse2, ssse3, and sse4_1. After that, it builds an executable that exploits all of these features, and that is what you want -- if you run on Stanley. If you look in the build logs, you should see the gcc/g++ flags "-mfpmath=sse -msse -DUSE_SSE2 -msse2" which enables both sse and sse2 instructions. That code runs fine on stanley, but try to run it on c0-0, and bang, the parts of the code that try to use sse2 extensions are going to hit a wall. The xmm registers used by the sse2 extensions are 128 bits, compared with the 64-bit registers used by the sse extensions, which leads to the segfault you are seeing.
>
> The immediate solution for you on the stanley cluster is to redo the build (make clean;make) on one of the worker nodes. Then it will recognize that sse2 is not supported, and set the flags to "-mfpmath=sse -msse -mno-sse2", so sse math will be used (consistent answers) but not the sse2 extensions (no segfaults). This code will run both on the K7 nodes and on the head node, and will give answers that are consistent with running full Simon-supercharged code on a 64-bit node, but without the super-charged performance.
>
> For the future, we should have a run-time-startup check in our applications that verifies that the options used during the build are supported by the cpu running the code. This is easy to do, and for code that uses DANA, it is now a part of the Init() method of the DApplication class -- I added it last week. It would be trivial to copy that code into the main() of non-DANA apps like mcsmear and hdgeant. I would support that, but will hold back on checking in more SIMD-related changes until some of this dust settles and we know that things are under control.
>
> -Richard J.
>
>
>
>
>
> On 3/2/2011 6:08 PM, Simon Taylor wrote:
>
> Hi.
>
> Our suspicion is that there is an alignment problem on 32-bit systems
> with regard to the SIMD instructions; Dave looked into this some time
> ago and it is not clear to us how to fix it.
>
> I've checked in a change to Makefile.SIMD that changes the default from
> "SIMD on" to "SIMD off". To get the SIMD instructions, one would
> now need to do "make ENABLE_SIMD=yes".
>
> Simon
>
> Matthew Shepherd wrote:
>
>
> Hi all,
>
> It seems that the BMS system doesn't properly understand our SIMD capabilities on the machines here at Indiana. If we do a default build, then we get a segmentation fault at the first DVector2 operation. If we build with DISABLE_SIMD=1 then this segfault is avoided.
>
> This seems to point to two causes:
>
> (1) there is a bug in the SIMD implementation of DVector2
> (2) our machines are not capable of handling current SIMD code
>
> (1) seems unlikely since other people are using the code. Assuming it is (2), how do we properly diagnose and fix it?
>
> -Matt
>
>
>
> _______________________________________________
> Halld-offline mailing list
> <mailto:Halld-offline at jlab.org>Halld-offline at jlab.org<mailto:Halld-offline at jlab.org>
> <https://mailman.jlab.org/mailman/listinfo/halld-offline>https://mailman.jlab.org/mailman/listinfo/halld-offline
>
>
>
> _______________________________________________
> Halld-offline mailing list
> <mailto:Halld-offline at jlab.org>Halld-offline at jlab.org<mailto:Halld-offline at jlab.org>
> <https://mailman.jlab.org/mailman/listinfo/halld-offline>https://mailman.jlab.org/mailman/listinfo/halld-offline
>
>
> <cpuid.cc>
> _______________________________________________
> Halld-offline mailing list
> Halld-offline at jlab.org<mailto:Halld-offline at jlab.org>
> https://mailman.jlab.org/mailman/listinfo/halld-offline
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4092 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://mailman.jlab.org/pipermail/halld-offline/attachments/20110303/2c701f76/attachment.p7s>
More information about the Halld-offline
mailing list