[Clara] Overlimit Error Farm

Justin Ruger jruger at jlab.org
Wed Jan 15 09:12:43 EST 2014


So one of the reasons our farm jobs for clara dpe keep getting canceled 
is because of this error:

=>> PBS: job killed: vmem 29235798016 exceeded limit 28991029248

This happens with or without -l flag.

So how to recreate the error:

I have 20k event files. I said I wanted to do processing on 50 20k 
files. I wrote a script that caches the files using jcache and then adds 
the file name to the input.list so that it is all mandatory.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++=
#!/bin/bash
for((i=1;i<=$1;i+=1)); do
         echo "/cache/mss/clas/clas12/clas12-testing/gemc/sidis_$i.ev";
         if [ $i == 1 ]
                 then
                         echo "sidis_$i.ev" > 'input.list';
                 else
                         echo "sidis_$i.ev" >> 'input.list';
         fi
         jcache submit default 
/mss/clas/clas12/clas12-testing/gemc/sidis_$i.ev
done
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Then i ran the farm orch like:

java -cp "$CLARA_SERVICES/.:$CLARA_SERVICES/lib/*" 
std.orchestrators.FarmOrchestrator -i 
/cache/mss/clas/clas12/clas12-testing/gemc -o 
/w/hallb/clas12/jruger/fiftyNode/output -s /tmp 
/w/hallb/clas12/jruger/stress_test/services.yaml input.list

It ran fine for 5 files: so sidis_1.ev to sidis_5.ev but gets the job 
canceled while processing the 6th file.

I think this is something we should figure out how to solve ASAP while 
they are allowing us to hold a node for development. If all I can do is 
five files at a time this limits the robustness of clara in my opinion. 
Let me know if you need anymore information.

Justin


More information about the Clara mailing list