[Clara] Overlimit Error Farm

Wed Jan 15 09:22:42 EST 2014

these are the services i am using

# Uncomment the line below to set a new container name for all services
# container: Ruger_Reconstruction
services:
  - class: org.jlab.clas12.ec.services.ECReconstruction
    name: ECReconstruction
  - class: org.jlab.clas12.ctof.services.CTOFReconstruction
    name: CTOFReconstruction
  - class: trac.services.centraltracker.BSTTrackCandFinder
    name: BSTTrackCandFinder
  - class: trac.services.forwardtracking.DCTrackCandFinder
    name: DCTrackCandFinder


On Wednesday, January 15, 2014 9:19:07 AM, Vardan Gyurjyan wrote:
> This is most likely an error of the service that is managing shared-memory file cashing, and is nothing to do with the framework. It seems that that service is not removing files from the shared memory after the chain is done processing a file.
> -vardan
>
>
> ----- Original Message -----
> From: "Justin Ruger" <jruger at jlab.org>
> To: clara at jlab.org
> Sent: Wednesday, January 15, 2014 9:12:43 AM
> Subject: [Clara] Overlimit Error Farm
>
> So one of the reasons our farm jobs for clara dpe keep getting canceled
> is because of this error:
>
> =>> PBS: job killed: vmem 29235798016 exceeded limit 28991029248
>
> This happens with or without -l flag.
>
> So how to recreate the error:
>
> I have 20k event files. I said I wanted to do processing on 50 20k
> files. I wrote a script that caches the files using jcache and then adds
> the file name to the input.list so that it is all mandatory.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++=
> #!/bin/bash
> for((i=1;i<=$1;i+=1)); do
>           echo "/cache/mss/clas/clas12/clas12-testing/gemc/sidis_$i.ev";
>           if [ $i == 1 ]
>                   then
>                           echo "sidis_$i.ev" > 'input.list';
>                   else
>                           echo "sidis_$i.ev" >> 'input.list';
>           fi
>           jcache submit default
> /mss/clas/clas12/clas12-testing/gemc/sidis_$i.ev
> done
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Then i ran the farm orch like:
>
> java -cp "$CLARA_SERVICES/.:$CLARA_SERVICES/lib/*"
> std.orchestrators.FarmOrchestrator -i
> /cache/mss/clas/clas12/clas12-testing/gemc -o
> /w/hallb/clas12/jruger/fiftyNode/output -s /tmp
> /w/hallb/clas12/jruger/stress_test/services.yaml input.list
>
> It ran fine for 5 files: so sidis_1.ev to sidis_5.ev but gets the job
> canceled while processing the 6th file.
>
> I think this is something we should figure out how to solve ASAP while
> they are allowing us to hold a node for development. If all I can do is
> five files at a time this limits the robustness of clara in my opinion.
> Let me know if you need anymore information.
>
> Justin
> _______________________________________________
> Clara mailing list
> Clara at jlab.org
> https://mailman.jlab.org/mailman/listinfo/clara