[Lowq] Cooking Issue

Lamiaa El Fassi elfassi at jlab.org
Mon Mar 22 14:34:09 EDT 2010


Hi All,

I tried to resubmit some of the last crashed jobs, but unfortunately they
crashed again.
I don't know what should I do to overcome the clasdb issues.
Is there any suggestions for that?

Best regards,

Lamiaa

************************************************************
*  Lamiaa El Fassi             email: elfassi at jlab.org
*  Research Associate @ Rutgers University
*  Phone: (757) 269-7011 // Fax: (757) 269-5703
*  Jefferson Lab., 12000 Jefferson Ave.
*  Suite# 4, MS 12H3
*  Newport News, VA. 23606
************************************************************


On Sun, Mar 21, 2010 at 7:47 AM, Stepan Stepanyan <stepanya at jlab.org> wrote:

> Hi all,
>
> During saturday's session of collaboration meeting, Dennis
> reported that machine that holds clasdb and wiki start having
> a problem last couple of days. They are looking into the problem.
>
> Stepan
>
>
> Hovanes Egiyan wrote:
>
>> Hi Lamiaa,
>>
>> I noticed that "clasdb" MySQL database server was down for a while.
>> I think computer center is doing some maintenance work related to clasdb.
>> But I am not  absolutely sure that it is the reason of those job crashes.
>>
>> Hovanes.
>>
>>
>> Lamiaa El Fassi wrote:
>>
>>
>>> Hi,
>>>  Those are some cooking statistics:
>>>  Completed jobs: 688 Good jobs        : 249
>>> Crashed jobs   : 439
>>>  In the last hour, things are becoming worse. Below an example of time
>>> stamps summary of file clas_41470.A40 taken from auger web
>>> page. The log file of this job is showing the same segmentation fault
>>> that
>>> I mentioned in my last email
>>>
>>>
>>>      Time stamps
>>>
>>> Submitted: Mar 18, 2010 10:30:17 PM
>>> Cleared Dependencies: Mar 20, 2010 2:15:29 PM
>>> Started Copying Input Files: Mar 20, 2010 2:22:08 PM
>>> Started Executing: Mar 20, 2010 2:25:12 PM
>>> Started Copying Output Files: Mar 20, 2010 2:25:19 PM
>>> Completed: Mar 20, 2010 2:25:19 PM
>>>
>>> Best regards,
>>>
>>> Lamiaa
>>>
>>>
>>> ************************************************************
>>> *  Lamiaa El Fassi             email: elfassi at jlab.org <mailto:
>>> elfassi at jlab.org>
>>> *  Research Associate @ Rutgers University          *  Phone: (757)
>>> 269-7011 // Fax: (757) 269-5703                *  Jefferson Lab., 12000
>>> Jefferson Ave.                *  Suite# 4, MS 12H3
>>> *  Newport News, VA. 23606
>>> ************************************************************
>>>
>>>
>>> On Sat, Mar 20, 2010 at 12:32 PM, Lamiaa El Fassi <elfassi at jlab.org<mailto:
>>> elfassi at jlab.org>> wrote:
>>>
>>>    Hi,
>>>         Upon request I am reprocessing the elastic runs for the RTPC
>>>    calibration.
>>>    I have noticed that almost half of the jobs done until now crashed
>>>    during the
>>>    cooking. All these crashed jobs are showing "status:success & exit
>>>    code: 0"
>>>    in auger web page, but if I check their output log files I am finding
>>>     "Segmentation fault & size of raw data equal 0"
>>>    This lack of getting the raw data can be caused by no enough space
>>>    in the farm
>>>    machine or something else?     In the submission script of each job I
>>> am requesting 8 GB of disk
>>>    space which fulfills
>>>    the size requirement of the input and output files of each
>>>    processed job.
>>>    Is there any problem in the farm machine which may be causing that?
>>>         Best regards,
>>>         Lamiaa
>>>
>>>    ************************************************************
>>>    *  Lamiaa El Fassi             email: elfassi at jlab.org
>>>    <mailto:elfassi at jlab.org>
>>>    *  Research Associate @ Rutgers University              *  Phone:
>>> (757) 269-7011 // Fax: (757) 269-5703                    *  Jefferson Lab.,
>>> 12000 Jefferson Ave.                    *  Suite# 4, MS 12H3
>>>    *  Newport News, VA. 23606
>>>    ************************************************************
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Lowq mailing list
>>> Lowq at jlab.org
>>> https://mailman.jlab.org/mailman/listinfo/lowq
>>>
>>>
>>
>> _______________________________________________
>> Lowq mailing list
>> Lowq at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/lowq
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/lowq/attachments/20100322/533af689/attachment.html 


More information about the Lowq mailing list