Discussion:
System crashing: I need help
Add Reply
root
2022-01-07 15:38:19 UTC
Reply
Permalink
I have a server that runs Slack64 14.2 and has done so since
before 14.2. A few weeks ago the system started crashing.
For most of the crashes the kernel was still running
and would respond to pings, and there was a display
but the server would not accept keyboard or mouse input.

The system would run for a few days and crash again.

I swapped out the power supply with a brand new 750w unit.
The crashes continued.

I swapped out the motherboard/cpu/memory with one
from a working machine. The crashes continued.

I updated the 10 year old bios on the motherboard.
I tried different kernels.
I updated everything with slackpkg.
I updated Chrome to the latest version. Chrome runs all the time.

Only the computer case and NVidia graphics card remain
from the original system, and still the crashes persist.


When I got up this morning, the system had crashed
during the night. After rebooting I looked at the syslog
and I found a stream of:
rcu_sched self-detected stall on CPU
errors which continued until I rebooted the system

This seems to be related to a kernel overload as if
there were too many tasks for the system to keep up.
The cpu is Intel Core I7 3.4GHz with 16GB of memory.

Among other Call Traces in the syslog I see something
that must have originated within Chrome, and another
crash from kswapd, when I have no swap partition.

I am pretty much out of ideas and would appreciate
any suggestions.

Thanks.
Henrik Carlqvist
2022-01-07 16:50:32 UTC
Reply
Permalink
Only the computer case and NVidia graphics card remain from the original
system, and still the crashes persist.
From those two I would first try to replace the nVidia card. :-)
When I got up this morning, the system had crashed during the night.
rcu_sched self-detected stall on CPU
errors which continued until I rebooted the system
Could you in the syslog see anything special just before those messages
started flooding the syslog?
This seems to be related to a kernel overload as if there were too many
tasks for the system to keep up.
Do you have any other machines in the network? If so, you might be able
to use one of those to monitor your problematic machine.

Mostly, when a machine crashes it will lose the interesting last part of
system logs. With syslog configured to send logs to a log server those
interesting log messages can sometimes be saved.

Having snmpd running on the problematic machine and something like mrgt
monitor the machine might give useful graphs to inspect even though mrtg
only samples the machine every 5 minutes. On those graphs you can see if
any partition fills up, machine load, CPU usage, memory usage. If you
also provide data from lmsensors to snmpd you can make graphs of
temperatures and fan speeds. I wrote the project
https://sourceforge.net/projects/nvgpu-smi-snmp/ to also be able to
monitor nVidia GPU usage with snmp/mrtg.

Another thing to try when the system responds to ping but keyboard and
mouse has hung is to see if you can ssh into the system. Once logged in
you might get clues about what is going on by studying the output from
commands like dmesg and top.

regards Henrik
Rich
2022-01-07 18:41:56 UTC
Reply
Permalink
Thanks for responding. I was able to ssh in and verify that the
kernel was still running at the last "crash". As I said, syslog
revealed that the kernel was overloaded. I have been focussed
entirely on hardware, now I think I have to look into software. I'm
not entirely sure, but I think the trouble started after a change was
made to my data gathering/cleaning software. Some routines to
process the online data were written by me in (c) many years ago.
About the problems started, I replaced what I had written by software
written by an accomplished javascript programmer. That opened me up
to a vulnerability of the javascript as well as node.js.
Hmm, this would give more weight to an "out of memory -- with no swap"
situation being what you are seeing. A JS variant, running in Node, of
your C "data gathering" routines would likely be much more memory
hungry, and that combined with the memory hog that is chrome, might
have pushed the system past its present memory amount.
Rich
2022-01-07 18:32:37 UTC
Reply
Permalink
Post by root
The system would run for a few days and crash again.
I swapped out the power supply with a brand new 750w unit.
The crashes continued.
I swapped out the motherboard/cpu/memory with one
from a working machine. The crashes continued.
I updated the 10 year old bios on the motherboard.
I tried different kernels.
I updated everything with slackpkg.
I updated Chrome to the latest version. Chrome runs all the time.
Only the computer case and NVidia graphics card remain
from the original system, and still the crashes persist.
You've given us very little info with which to help. But...
Are you running the NVidia closed-source driver or the open source
driver?
If closed-source driver, then try the open source driver.
Post by root
Among other Call Traces in the syslog I see something
that must have originated within Chrome, and another
crash from kswapd, when I have no swap partition.
Hmm... How much RAM?
Chrome is a known memory hog, and if you have no swap, then anytime
chrome trys to grow beyond the free memory left in the system's ram
after everything else that is loaded, things will go bad very fast.
Thanks for responding.
I am running the NVidia driver. Nouveau, the alternative, does not
support driving two different displays,
I am presently typing this on a system running Nouveau with an Nvidia
card driving two displays, so Nouveau does support multiple displays.
The system has 16GB of ram. Up until a few weeks ago the system
ran 24/7 and only was taken down to change SATA drives. Uptime
of several months was the norm.
But, a Chrome process consuming 12+GB of ram is not unheard of. You
could, possibly, still have an "out of memory, with no swap" situation.
I am beginning to think that this isn't a hardware problem. The
server functions at two levels: to gather, clean-up, and source
online data, and as a A/V server.
You state you've replaced everything except the Nvidia card. So if it
is hardware, the only common hardware is the Nvidia card itself.
Which, of course, because it is in common, /could/ be the culprit
(i.e., you have not ruled it out, neither have you confirmed it to be
the culprit).

One way this /could/ be a hardware problem, and only just now manifest
itself, is:

1) filter capacitors on the on-board voltage generators for the Nvidia
card have been slowly degrading, and have now reached the point
where their filtering is allowing just a bit too much ripple through
- which would cause a "runs fine for years, then starts failing"
situation.

2) cooling for the nvidia card has gotten poor (i.e., dust clogging
card) - which would also cause a "runs for years, then starts
failing situation".

Both are guesses. And it could be an out of memory with no swap issue
(which would not be the Nvidia card). Which is also a guess.

We depend upon you to test, and tell us which guess is ruled out.
root
2022-01-07 20:03:05 UTC
Reply
Permalink
Post by Rich
But, a Chrome process consuming 12+GB of ram is not unheard of. You
could, possibly, still have an "out of memory, with no swap" situation.
You state you've replaced everything except the Nvidia card. So if it
is hardware, the only common hardware is the Nvidia card itself.
Which, of course, because it is in common, /could/ be the culprit
(i.e., you have not ruled it out, neither have you confirmed it to be
the culprit).
One way this /could/ be a hardware problem, and only just now manifest
1) filter capacitors on the on-board voltage generators for the Nvidia
card have been slowly degrading, and have now reached the point
where their filtering is allowing just a bit too much ripple through
- which would cause a "runs fine for years, then starts failing"
situation.
2) cooling for the nvidia card has gotten poor (i.e., dust clogging
card) - which would also cause a "runs for years, then starts
failing situation".
The video card is only 2 years old. It is a fanless card with a
large finned heat sink.
Post by Rich
Both are guesses. And it could be an out of memory with no swap issue
(which would not be the Nvidia card). Which is also a guess.
We depend upon you to test, and tell us which guess is ruled out.
Thanks for responding. I am currently working on the hypothesis that
it isn't a hardware problem. I am looking into the software side.
Rich
2022-01-08 00:20:30 UTC
Reply
Permalink
Post by root
Post by Rich
But, a Chrome process consuming 12+GB of ram is not unheard of. You
could, possibly, still have an "out of memory, with no swap" situation.
You state you've replaced everything except the Nvidia card. So if it
is hardware, the only common hardware is the Nvidia card itself.
Which, of course, because it is in common, /could/ be the culprit
(i.e., you have not ruled it out, neither have you confirmed it to be
the culprit).
One way this /could/ be a hardware problem, and only just now manifest
1) filter capacitors on the on-board voltage generators for the Nvidia
card have been slowly degrading, and have now reached the point
where their filtering is allowing just a bit too much ripple through
- which would cause a "runs fine for years, then starts failing"
situation.
2) cooling for the nvidia card has gotten poor (i.e., dust clogging
card) - which would also cause a "runs for years, then starts
failing situation".
The video card is only 2 years old. It is a fanless card with a
large finned heat sink.
In that case, does it get enough airflow inside the case for its
cooling needs? A clogged PSU fan, reducing airflow through the box in
total, would also reduce airflow for the fanless card.
Post by root
Post by Rich
Both are guesses. And it could be an out of memory with no swap issue
(which would not be the Nvidia card). Which is also a guess.
We depend upon you to test, and tell us which guess is ruled out.
Thanks for responding. I am currently working on the hypothesis that
it isn't a hardware problem. I am looking into the software side.
Your other post about switching to a node.js application from your own
C code implies that this switch might be the cause.
root
2022-01-08 04:20:05 UTC
Reply
Permalink
Post by Rich
Post by root
Post by Rich
But, a Chrome process consuming 12+GB of ram is not unheard of. You
could, possibly, still have an "out of memory, with no swap" situation.
You state you've replaced everything except the Nvidia card. So if it
is hardware, the only common hardware is the Nvidia card itself.
Which, of course, because it is in common, /could/ be the culprit
(i.e., you have not ruled it out, neither have you confirmed it to be
the culprit).
One way this /could/ be a hardware problem, and only just now manifest
1) filter capacitors on the on-board voltage generators for the Nvidia
card have been slowly degrading, and have now reached the point
where their filtering is allowing just a bit too much ripple through
- which would cause a "runs fine for years, then starts failing"
situation.
2) cooling for the nvidia card has gotten poor (i.e., dust clogging
card) - which would also cause a "runs for years, then starts
failing situation".
The video card is only 2 years old. It is a fanless card with a
large finned heat sink.
In that case, does it get enough airflow inside the case for its
cooling needs? A clogged PSU fan, reducing airflow through the box in
total, would also reduce airflow for the fanless card.
The box has 4 4-inch fans, the and one three inch fan, along
with the cpu fan. During the day the cpu temps stay about
37 degrees.
Post by Rich
Post by root
Post by Rich
Both are guesses. And it could be an out of memory with no swap issue
(which would not be the Nvidia card). Which is also a guess.
We depend upon you to test, and tell us which guess is ruled out.
Thanks for responding. I am currently working on the hypothesis that
it isn't a hardware problem. I am looking into the software side.
Your other post about switching to a node.js application from your own
C code implies that this switch might be the cause.
I'm looking into that possibiity even as we speak. I have disabled
the .js code and now have to wait for the system to crash. It
may take several days.

Thanks for following.

Loading...