<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Hi all,<br>
<br>
I multiple people were wondering about the status of the pass2, that
was submitted just before Christmas brake.<br>
<br>
During the brake significant amount of jobs are failed.<br>
Main issue is the so called "No space left on disk" exception.<br>
<br>
I did a ccpr I was in contact with computer center before and after
the brakes.<br>
<br>
Briefly about what happened:<br>
The response was that auger scheduler is taking more jobs than the
local disk of the node can take, and after<br>
about 7-8 hours of running the local disk of the node becomes
completely full, and therefore crashing all the jobs on that node.<br>
<br>
Computer center advised that the problem is related to the so called
"farm18" nodes, and before the brakes I sent jobs<br>
to "farm16" nodes to avoid jobs to be run by any farm18 node,
however during th Christmas brake, we got similar issue<br>
with farm16 nodes too.<br>
In addition to this, number of nodes for some reason went
offline (restarted) during the brake again causing a lot of jobs to
fail.<br>
Given the high rate of failures, I didn't submit additional jobs
during brake time, until we will understand the failure reason.<br>
<br>
After the brake I asked computer center about it, and they suggested
instead of auger, to use slurm. <br>
Now half of nodes are running on slurm, and in a near future all
jobs will be submitted through slurm instead of auger.<br>
<br>
Past week I was doing some tests on different nodes to check if the
similar problem still exists.<br>
<br>
It happened again in farm18 nodes, but for others it worked well for
a single run.<br>
<br>
I promised Maurik to present details about this on the coming
Wednesday meeting.<br>
<br>
Details of ccpr ar in the following link<br>
<a class="moz-txt-link-freetext" href="https://misportal.jlab.org/mis/apps/ccpr/ccpr_user/ccpr_user_request_history.cfm?ccpr_number=249598">https://misportal.jlab.org/mis/apps/ccpr/ccpr_user/ccpr_user_request_history.cfm?ccpr_number=249598</a><br>
<br>
Rafo<br>
<br>
<br>
<div class="moz-cite-prefix">On 1/7/19 12:26 PM, Graf, Norman A.
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CO2PR07MB2693EEFF390594102D43DED6F8890@CO2PR07MB2693.namprd07.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
<div id="divtagdefaultwrapper"
style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;"
dir="ltr">
<p style="margin-top:0;margin-bottom:0">Hello Rafo,</p>
<p style="margin-top:0;margin-bottom:0"><br>
</p>
<p style="margin-top:0;margin-bottom:0">Can you please bring me
up to speed on where we are with the Pass2 recon?</p>
<p style="margin-top:0;margin-bottom:0"><br>
</p>
<p style="margin-top:0;margin-bottom:0">I just got back and have
not yet gone through all of my email, so apologies if you've
posted a progress report already.</p>
<p style="margin-top:0;margin-bottom:0"><br>
</p>
<p style="margin-top:0;margin-bottom:0">Norman</p>
</div>
</blockquote>
<br>
</body>
</html>