[Lowq] Cooking Issue
Lamiaa El Fassi
elfassi at jlab.org
Mon Mar 22 14:34:09 EDT 2010
Hi All,
I tried to resubmit some of the last crashed jobs, but unfortunately they
crashed again.
I don't know what should I do to overcome the clasdb issues.
Is there any suggestions for that?
Best regards,
Lamiaa
************************************************************
* Lamiaa El Fassi email: elfassi at jlab.org
* Research Associate @ Rutgers University
* Phone: (757) 269-7011 // Fax: (757) 269-5703
* Jefferson Lab., 12000 Jefferson Ave.
* Suite# 4, MS 12H3
* Newport News, VA. 23606
************************************************************
On Sun, Mar 21, 2010 at 7:47 AM, Stepan Stepanyan <stepanya at jlab.org> wrote:
> Hi all,
>
> During saturday's session of collaboration meeting, Dennis
> reported that machine that holds clasdb and wiki start having
> a problem last couple of days. They are looking into the problem.
>
> Stepan
>
>
> Hovanes Egiyan wrote:
>
>> Hi Lamiaa,
>>
>> I noticed that "clasdb" MySQL database server was down for a while.
>> I think computer center is doing some maintenance work related to clasdb.
>> But I am not absolutely sure that it is the reason of those job crashes.
>>
>> Hovanes.
>>
>>
>> Lamiaa El Fassi wrote:
>>
>>
>>> Hi,
>>> Those are some cooking statistics:
>>> Completed jobs: 688 Good jobs : 249
>>> Crashed jobs : 439
>>> In the last hour, things are becoming worse. Below an example of time
>>> stamps summary of file clas_41470.A40 taken from auger web
>>> page. The log file of this job is showing the same segmentation fault
>>> that
>>> I mentioned in my last email
>>>
>>>
>>> Time stamps
>>>
>>> Submitted: Mar 18, 2010 10:30:17 PM
>>> Cleared Dependencies: Mar 20, 2010 2:15:29 PM
>>> Started Copying Input Files: Mar 20, 2010 2:22:08 PM
>>> Started Executing: Mar 20, 2010 2:25:12 PM
>>> Started Copying Output Files: Mar 20, 2010 2:25:19 PM
>>> Completed: Mar 20, 2010 2:25:19 PM
>>>
>>> Best regards,
>>>
>>> Lamiaa
>>>
>>>
>>> ************************************************************
>>> * Lamiaa El Fassi email: elfassi at jlab.org <mailto:
>>> elfassi at jlab.org>
>>> * Research Associate @ Rutgers University * Phone: (757)
>>> 269-7011 // Fax: (757) 269-5703 * Jefferson Lab., 12000
>>> Jefferson Ave. * Suite# 4, MS 12H3
>>> * Newport News, VA. 23606
>>> ************************************************************
>>>
>>>
>>> On Sat, Mar 20, 2010 at 12:32 PM, Lamiaa El Fassi <elfassi at jlab.org<mailto:
>>> elfassi at jlab.org>> wrote:
>>>
>>> Hi,
>>> Upon request I am reprocessing the elastic runs for the RTPC
>>> calibration.
>>> I have noticed that almost half of the jobs done until now crashed
>>> during the
>>> cooking. All these crashed jobs are showing "status:success & exit
>>> code: 0"
>>> in auger web page, but if I check their output log files I am finding
>>> "Segmentation fault & size of raw data equal 0"
>>> This lack of getting the raw data can be caused by no enough space
>>> in the farm
>>> machine or something else? In the submission script of each job I
>>> am requesting 8 GB of disk
>>> space which fulfills
>>> the size requirement of the input and output files of each
>>> processed job.
>>> Is there any problem in the farm machine which may be causing that?
>>> Best regards,
>>> Lamiaa
>>>
>>> ************************************************************
>>> * Lamiaa El Fassi email: elfassi at jlab.org
>>> <mailto:elfassi at jlab.org>
>>> * Research Associate @ Rutgers University * Phone:
>>> (757) 269-7011 // Fax: (757) 269-5703 * Jefferson Lab.,
>>> 12000 Jefferson Ave. * Suite# 4, MS 12H3
>>> * Newport News, VA. 23606
>>> ************************************************************
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Lowq mailing list
>>> Lowq at jlab.org
>>> https://mailman.jlab.org/mailman/listinfo/lowq
>>>
>>>
>>
>> _______________________________________________
>> Lowq mailing list
>> Lowq at jlab.org
>> https://mailman.jlab.org/mailman/listinfo/lowq
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://mailman.jlab.org/pipermail/lowq/attachments/20100322/533af689/attachment.html
More information about the Lowq
mailing list