Server 'freezing'

Rob Brandt yellowdog-general@lists.terrasoftsolutions.com
Wed Dec 18 15:01:02 2002


Actually, it's a good point.  The UPS *is* of questionable
quality.  Off brand, several years old.  I have no high
power-drawing equipment being powered up here, but that doesn't
mean that someone else in my grid doesn't.

FWIW, judging by my server monitoring, the outages seem to start
somewhere in early morning before office hours, like 3-5 am. 
Although I haven't tracked that closely.

What is the Linux equivelent of scandisk?  Maybe there's some
monitoring I could do to verify the quality of my hard disk?  If
it's fsck, I do that all the time :'(

Rob


> Is the machine running off a UPS? This might be a long shot, but
> if you  had a brownout or very brief power cut you might find
> that you get some  memory corruption. I have no clue about how
> plausible this is, maybe  someone else will know if it's likely,
> but if you don't have a UPS then  maybe that's the problem.
>
> The things that makes me think of this is you saying it seems to
> happen  most early on monday. If there's a large piece of
> equipment that gets  powered up monday morning then that could
> be what's giving you spikes  that are doing the damage. Spikes
> are easily solved with a surge
> protector, but very short cuts in power, even just a few
> milliseconds I  believe, can cause problems.
>
> Of course if you're running on a UPS then this ain't it (unless
> it's  broken ;-) ).
>
> This is a bit of a dead chicken waving suggestion, but since
> you're out  of options, I guess anything is worth a try.
>
> Good luck
>
> Pete
>
> On Wednesday, December 18, 2002, at 11:56 AM, Rob Brandt wrote:
>
>> The server was frozen again Monday morning, so the suggested
>> fix of replacing the motherboard battery didn't do the job.
>>
>> I hope someone has a clue about this, or can suggest a
>> stragegy to diagnose the problem.  Here's the latest
>> information I can offer; don't know if it's relevent or not:
>>
>> Half of the time that it happens it's been on a Monday morning
>> before work.  The rest of the time except once it's been on
>> another day before work.  Once it was in the evening after
>> work. It's never happened while I've been here.  The server
>> doesn't get a lot of traffic, but when it does it's usually in
>> bunches.  I
>> suppose inactivity may be a contributing factor.  I frequently
>> check my mail during the day.  But come to think of it, I have
>> a utility checking a special pop account for new messages
>> every 15 minutes all the time.
>>
>> Right now I have a problem with the server that may or may not
>> be related.  While sorting my mail this morning, I noticed it
>> beig
>> very slow.  Gnome was running, and I have several "load"
>> panels
>> running in the tool bar - RAM, CPU and Net.  I noticed that
>> CPU
>> was running at 100% and not varying.  I tried to start gtop to
>> see what was sucking cycles, but it was unresponsive.  I had a
>> console window and browser window open, and I closed those,
>> and noticed
>> that the icons on the desktop didn't redraw.  Attempting to
>> log
>> out, that was unresponsive too.  On my desktop Mac, I browsed
>> to Webmin on the server and viewed the Running applications to
>> see
>> what was sucking up the cycles; it was Courier-Imap.  I killed
>> that, restarted Courier-Imap, and the CPU load panel on the
>> server went down to normal levels.  But Gnome itself is still
>> unresponsive.  I can't start applications, log out, the icons
>> on the desktop still haven't redrawn.  The toolbar is
>> responsive and the load panels inside of it are active.
>>
>> Like I said, I don't know if this is related to the server
>> freeze or not.  When the server freezes, I have no network
>> services but I do now.  When the server freezes, the CRT won't
>> wake up, but does now.
>>
>> (dramatic pause)
>>
>> OK, new information.  As I was trying to send the above, it
>> was
>> apparent that some of my mail services weren't working because
>> it wouldn't send.  Other network services such as apache were
>> OK.  So I decided to reboot the server; when it rebooted it
>> said that the file systems weren't unmounted cleanly and
>> forced a file system
>> check.  There were unexpected inconsistencies, so I had to run
>> fsck.  There were several inode problems, after they were
>> fixed it rebooted again and started OK.  I'm back up and
>> running.
>>
>> But it appears that some questions have been answered: namely
>> that the "unexpected inconsistencies" were not the result of
>> power
>> off/on rebooting I had to resort to when the system freezes,
>> since it happened now after I did a normal reboot.  Quite
>> possibly the unexpected inconsistencies are the cause of the
>> freezing.
>>
>> Any ideas on further diagnosis?
>>
>> Thanks
>>
>>
>>
>>> I am having a problem for the last month or so and don't know
>>> what to  do about it.  It's happened 4 times in November, and
>>> last night as  well.
>>>
>>> The server "freezes".  It is completely unresponsive to the
>>> keyboard,  http, mail services, telnet, ssh, and ftp.  I can
>>> successfully ping  it.
>>>
>>> When this has happened, I end up having to shut it down at
>>> the
>>> power  button and reboot.  When rebooting it goes through the
>>> filesystem  check and often encounters and Unexpected
>>> Inconsistency, requiring me  to run fsck.  After going
>>> through
>>> that, it fixes several things
>>> (sometimes a lot, sometimes a little) and then loads properly
>>> and all  is well.  For a while.  Then it does it again in 2
>>> to
>>> 14 days.  I  don't know whether the unexpected inconsistency
>>> is the cause or the  result of the "freeze".
>>>
>>> Are there any diagnostics that I can perform to discover the
>>> problem?  If there are file system errors causing this, are
>>> there utilities  that can be run that will prevent it or
>>> minimize the risk?
>>>
>>> Any advice appreciated.  Running:
>>>
>>> Beige G3 Tower;
>>> YDL 2.2
>>> 768 MB Ram
>>> YDL boot disk : original 4gig IDE (hda8)
>>> Data disk for /home and /var: New (2 mos) 60 gig IBM (hdb2 &
>>> 3)
>>>
>>> All of the above partitions have exhibited the Unexpected
>>> Inconsistency.
>>>
>>> Rob
>>> _______________________________________________
>>> yellowdog-general mailing list
>>> yellowdog-general@lists.terrasoftsolutions.com
>>> http://lists.terrasoftsolutions.com/mailman/listinfo/yellowdog-general
>>
>>
>>
>> _______________________________________________
>> yellowdog-general mailing list
>> yellowdog-general@lists.terrasoftsolutions.com
>> http://lists.terrasoftsolutions.com/mailman/listinfo/yellowdog-general
>>
>
> _______________________________________________
> yellowdog-general mailing list
> yellowdog-general@lists.terrasoftsolutions.com
> http://lists.terrasoftsolutions.com/mailman/listinfo/yellowdog-general