[Halld-offline] updated branch again: sim-recon-dc-2

Tue Feb 25 09:42:24 EST 2014

Justin, et al.,

Overnight Sunday to Monday, using the new tag I saw similar results. 
1,000 jobs, 50 k events each, no crashes. This time I did increase the 
memory request to 6 GB from 3 GB and the time request to 24 hours from 
15 hours. Still no EM background. But about 60 of them had short REST 
files (there are some ambiguous cases very close to the nominal file 
length). The short REST file thing seems to be the next frontier.

I am not counting the DTrackFitterKalmanSIMD.cc message as a crash 
anymore. Simon tells me that is more like a warning.

One question I have is if there is some standard way to gracefully punt 
on an event. We should keep careful accounting of that in each job of 
course. On the other hand, there is this: 
http://grantland.com/features/grantland-channel-coach-never-punts/

On the length of a job, I would prefer to go to 24-hour jobs (assuming 
that there is a high probability of them finishing of course). Fewer 
output files that way. My understanding is that on the grid this is too 
long when we are in opportunistic mode. The longer you run in that case, 
the higher your chance of being pre-empted. Perhaps we should have 
different run times at the different sites.

   -- Mark

On 02/25/2014 08:59 AM, Justin Stevens wrote:
> Hi Data Challengers,
>
> FYI, I used these tagged versions for a test on the MIT Openstack cluster of 100 jobs completed with 25K events each using the "standard" EM background that Kei and Paul have also used (10^7 g/s and +/- 200 ns gate).  With this configuration I don't see any crashes, but 2 jobs finish with a small REST output file size which only contains a fraction of the 25K events.  I think Paul also found this earlier where JANA reports that it processed the full number of events, but the REST file size is too small.  As a reminder these jobs are deployed with 8 parallel processes running on an 8-core blade which share 14GB of memory.  So they're not as sensitive to the memory spikes as some of the other tests we've seen.
>
> One other thought on job length.  Since the EM background results in much more time spent in hdgeant, should we lower the number of events per job to keep the job length from getting too large?  These 25K event jobs take 6 hours here, but this will only go up if we extend the gate or up the rate for the EM background.
>
> -Justin
>
> On Feb 23, 2014, at 6:01 PM, Mark Ito wrote:
>
>> correction, see below
>> On 02/22/2014 09:18 PM, Mark Ito wrote:
>>> Folks,
>>>
>>> I changed the data challenge 2 sim-recon and hdds branches to get the
>>> latest versions of everything, this time including src/libraries/TOF.
>>> This will bring in Paul's recent changes to ameliorate the memory issues
>>> we have been fighting with. The new hdds is necessary for the new TOF
>>> geometry.
>>>
>>> New tagged versions reflect the branch a week ago:
>>>
>>> tags/hdds-dc-2.0
>>> tags/hdds-dc-2.1
>> that last line should have been: tags/sim-recon-dc-2.1
>>> As before, the branch has not been tested yet, but the word should go
>>> out whenever it changes. Let me know if you see weirdness.
>>>
>>>     -- Mark
>>>
>> _______________________________________________
>> Halld-offline mailing list
>> Halld-offline at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/halld-offline

-- 
Mark M. Ito, Jefferson Lab, marki at jlab.org, (757)269-5295