High system load on NFS snafu

Discussion:

(too old to reply)

S.K.R. de Jong

2022-06-06 15:04:46 UTC

I have a Slackware64 15.0 system on which I had several
directories mounted by NFS from a remote system. That remote system was
actually rebooted a few times - for maintenance purposes - but I was
stupid enough not to unmount those directories in my system. In fact, I
had at least one terminal emulator where I was in one of the NFS-mounted
directories. I foolishly tried to list the contents of that directory,
and the shell just froze up on me. I had to kill the terminal emulator.

The system load has shot up to at least 4.00 ever since, even
when, according to top, nothing much is going on in the system. I mean, I
have a few things running, but nothing to justify that load: all the
cores are at least 95% idle at any given time.

I was able to unmount those NFS directories - forcefully, on
occasion - and I was able to stop the RPC and NFSD daemons. However, the
high load issue did not disappear.

Anybody got any suggestions as to how to diagnose and solve this
problem, without rebooting? top is not helping, and I see nothing
relevant in dmesg, or any of the /var/log files. More precisely, there
are relevant entries, but they are all old and not being updated - but
the high load stubbornly remains.

Lew Pitcher

2022-06-06 16:05:44 UTC

Permalink

[snip]

The system load has shot up to at least 4.00 ever since, even
when, according to top, nothing much is going on in the system. I mean,
I have a few things running, but nothing to justify that load: all the
cores are at least 95% idle at any given time.
I was able to unmount those NFS directories - forcefully, on
occasion - and I was able to stop the RPC and NFSD daemons. However, the
high load issue did not disappear.
Anybody got any suggestions as to how to diagnose and solve this
problem, without rebooting? top is not helping, and I see nothing
relevant in dmesg, or any of the /var/log files. More precisely, there
are relevant entries, but they are all old and not being updated - but
the high load stubbornly remains.

I have no insights into your high loadavg problem. However, I do note a
suspicious co-incidence:
a) you have high loadavg, and
b) your system logs "are not being updated".

Perhaps you should investigate /why/ the system logs are not being
updated; this phenomenon might be related to the cause of your high
loadavg.

BTW, the getloadavg(3) manpage refers to the proc(5) manpage, and
according to proc(5), the loadavg figures represent "the number of
jobs in the run queue (state R) or waiting for disk I/O (state D)
averaged over 1, 5, and 15 minutes".

I note that you looked at top(1), presumably to see the runnable
processes; did you look to see the processes "waiting for disk I/O"?
If so, was there anything suspicious there? Perhaps a kernel module
or logging daemon that shouldn't have been waiting?

--
Lew Pitcher
"In Skills, We Trust"

S.K.R. de Jong

2022-06-06 20:06:49 UTC

Permalink

Post by Lew Pitcher
I have no insights into your high loadavg problem. However, I do note a
a) you have high loadavg, and b) your system logs "are not being
updated".

I'm sorry - I meant they were not being updated with NFS-related
diagnostics. They were updated all along with diagnostics associated with
other events - like e.g. when ssh clients closed a connection.

Henrik Carlqvist

2022-06-06 17:57:15 UTC

Permalink

Usually that is not needed when an NFS server is rebooted, once it is
back up again everything is supposed to be fine again.

In fact, I had at least one terminal emulator where I was in one of the
NFS-mounted directories. I foolishly tried to list the contents of that
directory, and the shell just froze up on me. I had to kill the
terminal emulator.

Most likely, somehow, the NFS server has not come back as it should.

Even if you killed the terminal, your ls process is probably still there
in a "D" state (waiting for disk) and your system load is the sum of all
processes wayting for CPU and all processes waiting for disk.

I was able to unmount those NFS directories - forcefully, on
occasion - and I was able to stop the RPC and NFSD daemons. However, the
high load issue did not disappear.

To get rid of the high load you will need to kill the processes in "D"
state. This is probably only possible if you mounted the NFS directories
with the "intr" option.

Stopping the rpc and nfsd daemons on the NFS server will from the NFS
clients point of view be just as bad as shutting down the NFS server
completely. Any processes being hung in "D" state will be so until the
NFS service is restored. Instead of stopping the NFS service you should
do something like "/etc/rc.d/rc.nfsd restart".

Anybody got any suggestions as to how to diagnose and solve this
problem, without rebooting? top is not helping, and I see nothing
relevant in dmesg, or any of the /var/log files.

In both dmesg and your log files you should see something like this:

nfs: server foo.example.com not responding, still trying

When this is the latest you see about that NFS server you will get
processes stuck in "D" state. Once the NFS server is rebooted and up
again you should see:

nfs: server foo.example.com OK

and all your processes in "D" state should get back to normal again.

More precisely, there are relevant entries, but they are all old and
not being updated - but the high load stubbornly remains.

If you do:

ps aux | grep D

and look for processes with a "D" in the STAT column those processes
might explain your high load. There are other tools like lsof and fuser
to find out which processes are in an NFS mounted directory (or any other
directory), but you should focus on bringing that NFS server back instead
of killing unfinished processes.

regards Henrik

S.K.R. de Jong

2022-06-06 19:51:50 UTC

Permalink

Post by Henrik Carlqvist
Even if you killed the terminal, your ls process is probably still there
in a "D" state (waiting for disk) and your system load is the sum of all
processes wayting for CPU and all processes waiting for disk.

Thanks. That did the trick: I had three processes in a "D" state
(actually, D and something else) - one of them being indeed the shell
where I tried to do the ls. After killing them the system load is back to
the levels that I would expect from the ordinary system activity.