Discussion:
NUMA won't enable
(too old to reply)
K Venken
2019-07-16 15:58:06 UTC
Permalink
I am trying to get NUMA enabled on Slackware 14.2
- I checked to have proper CPU (core i7) hardware (it did show numa with
centos)
- I installed slackbuilds numactl package
- I recompiled the kernel/modules with NUMA_CONFIG=yes

I still get

***@node-01-02:~$ numactl -s
physcpubind: 0 1 2 3 4 5 6 7
No NUMA support available on this system.

I also noted in the changelog for current that the kernel source has now
NUMA n -> y which confirms that in the stock 14.2 version numa is not by
default available.

Should recompiling the kernel not be sufficient to have NUMA active?
Aragorn
2019-07-16 17:14:19 UTC
Permalink
Post by K Venken
I am trying to get NUMA enabled on Slackware 14.2
- I checked to have proper CPU (core i7) hardware (it did show numa
with centos)
- I installed slackbuilds numactl package
- I recompiled the kernel/modules with NUMA_CONFIG=yes
I still get
physcpubind: 0 1 2 3 4 5 6 7
No NUMA support available on this system.
I also noted in the changelog for current that the kernel source has
now NUMA n -> y which confirms that in the stock 14.2 version numa is
not by default available.
Should recompiling the kernel not be sufficient to have NUMA active?
You only have NUMA when your system has more than one processor
socket. So long as it's a system with a single processor socket —
regardless of how many cores or hyperthreads it supports — you don't
have a NUMA system.

In addition to that, even though it has been a long time since I
configured and built a kernel, I believe that there several more
options to be selected for a NUMA system, such as the type of NUMA
access. The upstream kernel supports several non-contiguous memory
models, so you need to tell the kernel which model to use.


Note: NUMA is not the same thing as a CPU set. A CPU set is where
you confine a program to only a subset of the available processor
cores or hyperthreads, and it can be used on a single-socket
machine as well as on NUMA architectures.
--
With respect,
= Aragorn =
K. Venken
2019-07-16 18:14:15 UTC
Permalink
Post by Aragorn
Post by K Venken
I am trying to get NUMA enabled on Slackware 14.2
- I checked to have proper CPU (core i7) hardware (it did show numa
with centos)
- I installed slackbuilds numactl package
- I recompiled the kernel/modules with NUMA_CONFIG=yes
I still get
physcpubind: 0 1 2 3 4 5 6 7
No NUMA support available on this system.
I also noted in the changelog for current that the kernel source has
now NUMA n -> y which confirms that in the stock 14.2 version numa is
not by default available.
Should recompiling the kernel not be sufficient to have NUMA active?
You only have NUMA when your system has more than one processor
socket. So long as it's a system with a single processor socket —
regardless of how many cores or hyperthreads it supports — you don't
have a NUMA system.
That's curious. I took an existing system with following specs:

[***@compute-0-20 ~]$ uname -a
Linux compute-0-20.local 2.6.32-504.16.2.el6.x86_64 #1 SMP Wed Apr 22
06:48:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[***@compute-0-20 ~]$ more /etc/centos-release
CentOS release 6.6 (Final)
[***@compute-0-20 ~]$ cpuinfo
compute-0-20 : model name : Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz

With also following

[***@compute-0-20 ~]$ numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 8159 MB
node 0 free: 129 MB
node distances:
node 0
0: 10

Then, to upgrade it, I installed, as indicated slackware 14.2 with
mentioned changes. I was expecting to see the same result.
Post by Aragorn
In addition to that, even though it has been a long time since I
configured and built a kernel, I believe that there several more
options to be selected for a NUMA system, such as the type of NUMA
access. The upstream kernel supports several non-contiguous memory
models, so you need to tell the kernel which model to use.
Well, that's a good point. I saw at least 4 different entries with NUMA.
It is not clear what I should select.
Post by Aragorn
Note: NUMA is not the same thing as a CPU set. A CPU set is where
you confine a program to only a subset of the available processor
cores or hyperthreads, and it can be used on a single-socket
machine as well as on NUMA architectures.
The point of NUMA in this case is (obviously) to increase the total
available memory for parallellized calculations.
Aragorn
2019-07-16 20:07:15 UTC
Permalink
Post by K. Venken
Post by Aragorn
[...]
Note: NUMA is not the same thing as a CPU set. A CPU set is where
you confine a program to only a subset of the available
processor cores or hyperthreads, and it can be used on a
single-socket machine as well as on NUMA architectures.
The point of NUMA in this case is (obviously) to increase the total
available memory for parallellized calculations.
Um, no, that is not what NUMA is for. NUMA ("non-uniform memory
addressing") is a technology in use on machines that have more than one
memory controller.

NUMA systems have already long existed, and even Intel had already
built NUMA systems in the past, but mainstream Intel x86 processors
were all using a common memory controller on the northbridge.

When AMD introduced the first x86-64 processor, they integrated the
memory controller on the processor die, which meant that if you had a
dual- or quad-socket motherboard for AMD64, then you had two or four
memory controllers. Intel on the other hand stuck with the common
memory controller on the northbridge until the Core architecture was
released.

So unless you have a system with more than one (occupied) socket, you
will only have a single memory controller — it's on the processor die,
but there is only one controller per die — for accessing all of the RAM,
and so NUMA support in the kernel is irrelevant.

If you have more than one processor socket in a modern x86-64 system,
then the RAM banks will be divided over the processor sockets, so that
a certain amount of RAM will be local to one processor socket. This
is called a NUMA node. The rest of the RAM is "remote" to all of the
cores on that particular processor die and must be accessed by way of
an inter-socket bus.

What NUMA support does, concretely, is make sure that the processes are
always [*] running on a processor core local to the memory banks of that
same node, so that no slower memory access needs taking place by way of
the inter-socket bus.


[*] Within a certain margin of error, of course. There will always be
communication between the nodes, but the idea is to keep that
communication restricted to the bare minimum, so that the entire
setup can make use of the multiplied throughput from having
multiple memory controllers.
--
With respect,
= Aragorn =
K. Venken
2019-07-16 20:45:59 UTC
Permalink
Post by Aragorn
Post by K. Venken
Post by Aragorn
[...]
Note: NUMA is not the same thing as a CPU set. A CPU set is where
you confine a program to only a subset of the available
processor cores or hyperthreads, and it can be used on a
single-socket machine as well as on NUMA architectures.
The point of NUMA in this case is (obviously) to increase the total
available memory for parallellized calculations.
Um, no, that is not what NUMA is for. NUMA ("non-uniform memory
addressing") is a technology in use on machines that have more than one
memory controller.
NUMA systems have already long existed, and even Intel had already
built NUMA systems in the past, but mainstream Intel x86 processors
were all using a common memory controller on the northbridge.
When AMD introduced the first x86-64 processor, they integrated the
memory controller on the processor die, which meant that if you had a
dual- or quad-socket motherboard for AMD64, then you had two or four
memory controllers. Intel on the other hand stuck with the common
memory controller on the northbridge until the Core architecture was
released.
That's what it says. non uniform memory addressing. I was basing my
assumptions on Wikipedia. It states
(https://en.wikipedia.org/wiki/Non-uniform_memory_access)

"Non-uniform memory access (NUMA) is a computer memory design used in
multiprocessing, where the memory access time depends on the memory
location relative to the processor. Under NUMA, a processor can access
its own local memory faster than non-local memory (memory local to
another processor or memory shared between processors). The benefits of
NUMA are limited to particular workloads, notably on servers where the
data is often associated strongly with certain tasks or users.[1]"

But it does not mention clusters and multiple nodes. Indeed.
Post by Aragorn
So unless you have a system with more than one (occupied) socket, you
will only have a single memory controller — it's on the processor die,
but there is only one controller per die — for accessing all of the RAM,
and so NUMA support in the kernel is irrelevant.
Actually there are a few Xenon based PC's with two processors but we
have not yet touched it...
Post by Aragorn
If you have more than one processor socket in a modern x86-64 system,
then the RAM banks will be divided over the processor sockets, so that
a certain amount of RAM will be local to one processor socket. This
is called a NUMA node. The rest of the RAM is "remote" to all of the
cores on that particular processor die and must be accessed by way of
an inter-socket bus.
What NUMA support does, concretely, is make sure that the processes are
always [*] running on a processor core local to the memory banks of that
same node, so that no slower memory access needs taking place by way of
the inter-socket bus.
[*] Within a certain margin of error, of course. There will always be
communication between the nodes, but the idea is to keep that
communication restricted to the bare minimum, so that the entire
setup can make use of the multiplied throughput from having
multiple memory controllers.
Thanks a lot for your information. I (actually together with a student)
am trying to redeploy a 10 year old cluster, now based on Slackware. I
had(ve) no knowledge about all the technology involved, but we learned a
lot last weeks. It started as an experiment, but we are now at the point
the we are testing some of the applications which can benefit from
distributed calculations. One of them crashed in (open)MPI and mentioned
a problem with NUMA. But if NUMA is not essential, this might be a dead
end. It's good to know we are not there yet.
Aragorn
2019-07-16 22:26:13 UTC
Permalink
Post by K. Venken
Post by Aragorn
[...]
So unless you have a system with more than one (occupied) socket,
you will only have a single memory controller — it's on the
processor die, but there is only one controller per die — for
accessing all of the RAM, and so NUMA support in the kernel is
irrelevant.
Actually there are a few Xenon based PC's with two processors but we
have not yet touched it...
I have an older server sitting here with two Netburst-based Intel Xeon
processors with hyperthreading, and they still use a single memory
controller on the northbridge.

On the other hand, newer Intel x86-64 processors from the Core series
onward also have their memory controller on the processor die, and so
if you have a multi-socket system with processors like that, then
you'll have a NUMA system.
Post by K. Venken
Thanks a lot for your information. I (actually together with a
student) am trying to redeploy a 10 year old cluster, now based on
Slackware. I had(ve) no knowledge about all the technology involved,
but we learned a lot last weeks. It started as an experiment, but we
are now at the point the we are testing some of the applications
which can benefit from distributed calculations. One of them crashed
in (open)MPI and mentioned a problem with NUMA. But if NUMA is not
essential, this might be a dead end. It's good to know we are not
there yet.
Well, if you're working with a cluster — i.e. essentially a
supercomputer topography — then you definitely need NUMA enabled in the
kernel — or at least on the controlling node; I don't know about
the slaves.

If you want a truly distributed operating system on the other hand, then
perhaps you should look into Plan 9, which is now available as Open
Source.

Of course, Plan 9 is not GNU/Linux, and even less Slackware.
It's not even UNIX, but it's strongly related to UNIX as it was
intended to be its successor.

For home-made supercomputer setups, I would recommend looking into
Beowulf clusters. ;)
--
With respect,
= Aragorn =
Grant Taylor
2019-07-17 03:34:09 UTC
Permalink
Post by Aragorn
Well, if you're working with a cluster — i.e. essentially a
supercomputer topography — then you definitely need NUMA enabled
in the kernel — or at least on the controlling node; I don't know
about the slaves.
I think it completely depends on what type of cluster it is.

· High Performance Computing /might/ benefit from NUMA.
· High Availability is less likely to benefit from NUMA.
· Load Balancing is even less likely to benefit from NUMA.

The benefit is completely related to how much inter-compute-node
(memory) access there is.

HPC can do a fair amount of inter-node communications, particularly if
the calculations rely on data that is spread out in other nodes.

HA /may/ use NUMA to be able to replicate state between nodes. But even
then, chances are good that one (or a few) nodes in the cluster are
going to be actively using the memory while the other node(s) simply
have a ~current copy.

LB is usually about distributing small discrete jobs across multiple
nodes to get aggregate better throughput. Those jobs are usually fully
contained within any given node.
Post by Aragorn
If you want a truly distributed operating system on the other hand,
then perhaps you should look into Plan 9, which is now available as
Open Source.
I'll posit that (Open)VMS is as distributed as Plan 9 is.
Post by Aragorn
Of course, Plan 9 is not GNU/Linux, and even less Slackware. It's not
even UNIX, but it's strongly related to UNIX as it was intended to
be its successor.
I question that, but don't have data to refute it. If Plan 9 made it
far enough in development, it probably could easily have achieved Unix
certification and the legal right to use the Unix name, including the
capital U.

In fact, I'd be fairly surprised if Plan 9 didn't somehow already have
the legal right, if not certification. Seeing as how—as I understand
it—Plan 9 came from AT&T ~> Bell Laboratories, whom owned the Unix
rights at the time. Ergo it would have been a matter of internal
paperwork. But I'd have to go back and confirm things to be able to say
definitively.
Post by Aragorn
For home-made supercomputer setups, I would recommend looking into
Beowulf clusters. ;)
Please elaborate on what Beowulf cluster means to you. I ask because I
remember hearing "Beowulf cluster" being thrown around and touted as the
next big thing around 2000. I quit paying attention for about a decade,
and then I was seeing HPC, HA, and LB clusters. I've not been able to
find anything that differentiated a Beowulf cluster from some other
non-Beowulf cluster. My supposition is that Beowulf cluster was a / the
first commonly known type of cluster that evolved into HPC. Other types
of clusters, namely HA, and LB, have spun out of a full Beowulf / HPC
cluster as they are simpler and have different designs and purposes.
--
Grant. . . .
unix || die
Aragorn
2019-07-17 05:20:57 UTC
Permalink
Post by Grant Taylor
Post by Aragorn
If you want a truly distributed operating system on the other hand,
then perhaps you should look into Plan 9, which is now available as
Open Source.
I'll posit that (Open)VMS is as distributed as Plan 9 is.
VMS doesn't have anything in common with UNIX. Plan 9 on the other
hand does.
Post by Grant Taylor
Post by Aragorn
Of course, Plan 9 is not GNU/Linux, and even less Slackware. It's
not even UNIX, but it's strongly related to UNIX as it was intended
to be its successor.
I question that, but don't have data to refute it. If Plan 9 made it
far enough in development, it probably could easily have achieved
Unix certification and the legal right to use the Unix name,
including the capital U.
There are certain differences. For instance, Plan 9 handles
networking very differently from UNIX.

https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs
Post by Grant Taylor
Post by Aragorn
For home-made supercomputer setups, I would recommend looking into
Beowulf clusters. ;)
Please elaborate on what Beowulf cluster means to you.
I'm not an expert on clustering, but from what I know about Beowulf is
that it's entirely made up of GNU/Linux and other Open Source. Other
technologies at the time — (Open)Mosix? — when I was looking into it
were mostly based on proprietary stuff.
--
With respect,
= Aragorn =
Grant Taylor
2019-07-18 02:04:33 UTC
Permalink
Post by Aragorn
VMS doesn't have anything in common with UNIX. Plan 9 on the other
hand does.
Fair.

But VMS /does/ can be / is a distributed system.
Post by Aragorn
There are certain differences. For instance, Plan 9 handles networking
very differently from UNIX.
https://en.wikipedia.org/wiki/Plan_9_from_Bell_Labs
*nod*

I don't have any first hand experience with Plan 9 (yet). But from what
I hear, it does have some ground breaking new concepts.

I've heard tell that programs running on hosts can import other host's
TCP/IP stack. Meaning that you can run a program on a workstation which
is using the TCP/IP stack on the bastion firewall without your system
actually being on the Internet. (I don't know if the application is on
the Internet per say or not.)

I've chalked it up to being something like SOCKS on steroids, including
not needing to program differently for SOCKS or not.
Post by Aragorn
I'm not an expert on clustering, but from what I know about Beowulf
is that it's entirely made up of GNU/Linux and other Open Source.
It may be that the original cluster named Beowulf was based on Linux /
open source software. But I'm fairly certain that some of the other
clusters modeled after the original cluster named Beowulf were running
things other than Linux and / or open source software.
Post by Aragorn
Other technologies at the time — (Open)Mosix? — when I was looking
into it were mostly based on proprietary stuff.
I don't see how open source vs proprietary stuff is germane to how a
cluster operates or what it does. Much like it doesn't matter if you
run Sendmail vs Exchange as an SMTP server.
--
Grant. . . .
unix || die
K. Venken
2019-07-17 18:33:42 UTC
Permalink
Post by Grant Taylor
Post by Aragorn
Well, if you're working with a cluster — i.e. essentially a
supercomputer topography — then you definitely need NUMA enabled in
the kernel — or at least on the controlling node; I don't know about
the slaves.
I think it completely depends on what type of cluster it is.
 · High Performance Computing /might/ benefit from NUMA.
 · High Availability is less likely to benefit from NUMA.
 · Load Balancing is even less likely to benefit from NUMA.
The benefit is completely related to how much inter-compute-node
(memory) access there is.
HPC can do a fair amount of inter-node communications, particularly if
the calculations rely on data that is spread out in other nodes.
HA /may/ use NUMA to be able to replicate state between nodes.  But even
then, chances are good that one (or a few) nodes in the cluster are
going to be actively using the memory while the other node(s) simply
have a ~current copy.
LB is usually about distributing small discrete jobs across multiple
nodes to get aggregate better throughput.  Those jobs are usually fully
contained within any given node.
Most of the work we have can be handled by LB. A lot of our code simply
does not support any parallelization by itself. However, more and more
applications provide support for parallel processing in different forms,
with support for different cluster configurations. It is for these
latter cases where a lot of calculations and a large dataset are needed
that we want to provide NUMA. HA is not a request for the moment.
Post by Grant Taylor
Post by Aragorn
If you want a truly distributed operating system on the other hand,
then perhaps you should look into Plan 9, which is now available as
Open Source.
I'll posit that (Open)VMS is as distributed as Plan 9 is.
Post by Aragorn
Of course, Plan 9 is not GNU/Linux, and even less Slackware.  It's not
even UNIX, but it's strongly related to UNIX as it was intended to be
its successor.
I question that, but don't have data to refute it.  If Plan 9 made it
far enough in development, it probably could easily have achieved Unix
certification and the legal right to use the Unix name, including the
capital U.
In fact, I'd be fairly surprised if Plan 9 didn't somehow already have
the legal right, if not certification.  Seeing as how—as I understand
it—Plan 9 came from AT&T ~> Bell Laboratories, whom owned the Unix
rights at the time.  Ergo it would have been a matter of internal
paperwork.  But I'd have to go back and confirm things to be able to say
definitively.
A non-linux OS seems not to be an option. So there are no investigations
in this direction. I am however curious about how other OS (and I was
more thinking of BSD) would compare to linux.
Post by Grant Taylor
Post by Aragorn
For home-made supercomputer setups, I would recommend looking into
Beowulf clusters. ;)
Please elaborate on what Beowulf cluster means to you.  I ask because I
remember hearing "Beowulf cluster" being thrown around and touted as the
next big thing around 2000.  I quit paying attention for about a decade,
and then I was seeing HPC, HA, and LB clusters.  I've not been able to
find anything that differentiated a Beowulf cluster from some other
non-Beowulf cluster.  My supposition is that Beowulf cluster was a / the
first commonly known type of cluster that evolved into HPC.  Other types
of clusters, namely HA, and LB, have spun out of a full Beowulf / HPC
cluster as they are simpler and have different designs and purposes.
That 's moreless what we observed. Several interesting projects (like
Mosix) seem to be abandoned. Same with predefined cluster distributions.
There are still a few left, but how much support do you get after 5 years?
Grant Taylor
2019-07-18 02:27:54 UTC
Permalink
Post by K. Venken
Most of the work we have can be handled by LB. A lot of our code simply
does not support any parallelization by itself.
Okay. Have you checked out Linux Virtual Server or Netfilter's cluster
match? I'm not saying that they will do what you need. But I do think
that it's worth 5 ~ 15 minutes to brief yourself on them and decide for
yourself if they will help you.

LVS turns Linux into a load balancer sitting in front of multiple real
servers.

Netfilter's cluster does away with the LB in front of the real servers
by having all servers physically receive the request and then Netfilter
on each node decides if it should pass the traffic in to the daemons.
Post by K. Venken
However, more and more applications provide support for parallel
processing in different forms, with support for different cluster
configurations. It is for these latter cases where a lot of
calculations and a large dataset are needed that we want to provide
NUMA.
Okay. I'm guessing that the data set is too big to live in RAM and you
don't want it to live on comparatively slow disk. As such, you want to
shard the data across multiple nodes and have each node access other
nodes for contents of their RAM as necessary. Hence NUMA between local
RAM and remote RAM.

If I can ask, what networking technology are you using for inter-node
communications? Ethernet? Or something more exotic, e.g. InfiniBand, RDMA?
Post by K. Venken
HA is not a request for the moment.
ACK

It really sounds like you're looking for HPC. Most of the HPC
installations that I've seen are on the opposite end of the spectrum
from HA.
Post by K. Venken
A non-linux OS seems not to be an option. So there are no
investigations in this direction. I am however curious about how
other OS (and I was more thinking of BSD) would compare to linux.
I found the following article on a FreeBSD HPC cluster from '03. I'm
sure it's dated and things have changed. But I'd be surprised to learn
that FreeBSD wasn't an option.

Link - Building a High-performance Computing Cluster Using FreeBSD
- https://people.freebsd.org/~brooks/papers/bsdcon2003/fbsdcluster/
Post by K. Venken
That's moreless what we observed. Several interesting projects
(like Mosix) seem to be abandoned. Same with predefined cluster
distributions. There are still a few left, but how much support do
you get after 5 years?
¯\_(ツ)_/¯

Good luck. :-)
--
Grant. . . .
unix || die
K Venken
2019-07-18 09:42:31 UTC
Permalink
Post by K. Venken
Most of the work we have can be handled by LB. A lot of our code
simply does not support any parallelization by itself.
Okay.  Have you checked out Linux Virtual Server or Netfilter's cluster
match?  I'm not saying that they will do what you need.  But I do think
that it's worth 5 ~ 15 minutes to brief yourself on them and decide for
yourself if they will help you.
LVS turns Linux into a load balancer sitting in front of multiple real
servers.
Checked it out, it could be usefull, but it seems that it stops at
kernel 2.6. Not sure if we can migrate it to a recent kernel. At this
point slurm is doing well.
Netfilter's cluster does away with the LB in front of the real servers
by having all servers physically receive the request and then Netfilter
on each node decides if it should pass the traffic in to the daemons.
Post by K. Venken
However, more and more applications provide support for parallel
processing in different forms, with support for different cluster
configurations. It is for these latter cases where a lot of
calculations and a large dataset are needed that we want to provide NUMA.
Okay.  I'm guessing that the data set is too big to live in RAM and you
don't want it to live on comparatively slow disk.  As such, you want to
shard the data across multiple nodes and have each node access other
nodes for contents of their RAM as necessary.  Hence NUMA between local
RAM and remote RAM.
Yes, that's it.
If I can ask, what networking technology are you using for inter-node
communications?  Ethernet?  Or something more exotic, e.g. InfiniBand,
RDMA?
The current (10 y old) uses Gb ethernet. We noted a drastic difference
in performance when we changed it (by accident) to 100 Mb! So the
network matters a lot with NUMA. We are considering going to infiniband
but then we need to change switches but probably also all cabling and
that's another story. The current hardware does have two ethernet ports
on each node, which might help a bit if we do bonding or split the
internode communication from the filesystem.
Post by K. Venken
HA is not a request for the moment.
ACK
It really sounds like you're looking for HPC.  Most of the HPC
installations that I've seen are on the opposite end of the spectrum
from HA.
Yes, that it is. HA facilities can be usefull but not at the expense of
performance. At the moment we decided to stick to good plain old NFS as
global filesystem rather then any cluster filesystem while we have the
impression that performance would degrade, but we have no conclusion
yet. And NFS is too simple not to use.
Post by K. Venken
A non-linux OS seems not to be an option. So there are no
investigations in this direction. I am however curious about how other
OS (and I was more thinking of BSD) would compare to linux.
I found the following article on a FreeBSD HPC cluster from '03.  I'm
sure it's dated and things have changed.  But I'd be surprised to learn
that FreeBSD wasn't an option.
Link - Building a High-performance Computing Cluster Using FreeBSD
 - https://people.freebsd.org/~brooks/papers/bsdcon2003/fbsdcluster/
Contains a lot of usefull information. Problem is that some commercial
applications only support Windows and Linux.
Post by K. Venken
That's moreless what we observed. Several interesting projects (like
Mosix) seem to be abandoned. Same with predefined cluster
distributions.  There are still a few left, but how much support do
you get after 5 years?
¯\_(ツ)_/¯
Good luck.  :-)
Thanks ;-)
Grant Taylor
2019-07-18 15:35:15 UTC
Permalink
Post by K Venken
Checked it out, it could be usefull, but it seems that it stops at
kernel 2.6. Not sure if we can migrate it to a recent kernel. At this
point slurm is doing well.
It's in 4.14.127. (I assume it's been in almost all versions since it
came out.) It has moved around a few times, and you might not see it if
the dependencies aren't enabled.

Networking support ---> Networking options ---> Network packet filtering
framework (Netfilter) ---> IP virtual server support

Netfilter's cluster match is located at:

Networking support ---> Networking options ---> Network packet filtering
framework (Netfilter) ---> IP virtual server support ---> Core Netfilter
Configuration ---> cluster
Post by K Venken
Yes, that's it.
The current (10 y old) uses Gb ethernet. We noted a drastic difference
in performance when we changed it (by accident) to 100 Mb! So the
network matters a lot with NUMA. We are considering going to infiniband
but then we need to change switches but probably also all cabling and
that's another story.
I quite confident that InfiniBand uses something quite different from
copper Ethernet patch cables. The little InfiniBand that I've seen uses
what I've learned to call QSFP+ connections. I have re-used InfiniBand
cables for 40 Gbps Ethernet via QSFP+ ports. So InfiniBand cables must
be quite similar, if not the same, as QSFP+ Ethernet.

I also know that Mellanox ConnextX-3 (Pro) cards can support Ethernet,
and with enhanced firmware also support InfiniBand.

So, ya. Cabling will be quite different for InfiniBand.
Post by K Venken
The current hardware does have two ethernet ports on each node, which
might help a bit if we do bonding or split the internode communication
from the filesystem.
To bond, or not to bond,....

I've heard people advocate for both sides of the spectrum. I personally
like the idea of bonding (read: LACP) if it's an option. I say this
because I'm used to both types of traffic being quite bursty and not
always at the same time. As such, why limit either type to the speed /
bandwidth of a single interface if it's relatively easy to benefit from
unused bandwidth of the other interface(s).
Post by K Venken
Yes, that it is. HA facilities can be usefull but not at the expense of
performance. At the moment we decided to stick to good plain old NFS as
global filesystem rather then any cluster filesystem while we have the
impression that performance would degrade, but we have no conclusion
yet. And NFS is too simple not to use.
I've not herd about performance degradation in clustered file systems.
But I'm not looking for it and could have easily missed it. I can see
how the shared / distributed / global lock management could be a
performance problem.

Though I would think that NFS would also have some form of locking
related performance gotchas too.

This brings to mind multiple multiple things:
· sharding NFS mounts across multiple NFS NAS gateways
· clustered file systems between NFS NAS gateways
· sharding NFS NAS gateways across multiple SAN backends
· Other NAS protocols
· Other block / object storage protocols
Post by K Venken
Contains a lot of usefull information. Problem is that some commercial
applications only support Windows and Linux.
ACK
Post by K Venken
Thanks ;-)
You're welcome.

Thank you for allowing me to learn vicariously through you. ():-)
--
Grant. . . .
unix || die
K. Venken
2019-07-18 19:29:52 UTC
Permalink
Post by K Venken
Checked it out, it could be usefull, but it seems that it stops at
kernel 2.6. Not sure if we can migrate it to a recent kernel. At this
point slurm is doing well.
It's in 4.14.127.  (I assume it's been in almost all versions since it
came out.)  It has moved around a few times, and you might not see it if
the dependencies aren't enabled.
Networking support ---> Networking options ---> Network packet filtering
framework (Netfilter) ---> IP virtual server support
Networking support ---> Networking options ---> Network packet filtering
framework (Netfilter) ---> IP virtual server support ---> Core Netfilter
Configuration ---> cluster
Post by K Venken
Yes, that's it.
The current (10 y old) uses Gb ethernet. We noted a drastic difference
in performance when we changed it (by accident) to 100 Mb! So the
network matters a lot with NUMA. We are considering going to
infiniband but then we need to change switches but probably also all
cabling and that's another story.
I quite confident that InfiniBand uses something quite different from
copper Ethernet patch cables.  The little InfiniBand that I've seen uses
what I've learned to call QSFP+ connections.  I have re-used InfiniBand
cables for 40 Gbps Ethernet via QSFP+ ports.  So InfiniBand cables must
be quite similar, if not the same, as QSFP+ Ethernet.
I also know that Mellanox ConnextX-3 (Pro) cards can support Ethernet,
and with enhanced firmware also support InfiniBand.
So, ya.  Cabling will be quite different for InfiniBand.
Point noted...
Post by K Venken
The current hardware does have two ethernet ports on each node, which
might help a bit if we do bonding or split the internode communication
from the filesystem.
To bond, or not to bond,....
I've heard people advocate for both sides of the spectrum.  I personally
like the idea of bonding (read: LACP) if it's an option.  I say this
because I'm used to both types of traffic being quite bursty and not
always at the same time.  As such, why limit either type to the speed /
bandwidth of a single interface if it's relatively easy to benefit from
unused bandwidth of the other interface(s).
Post by K Venken
Yes, that it is. HA facilities can be usefull but not at the expense
of performance. At the moment we decided to stick to good plain old
NFS as global filesystem rather then any cluster filesystem while we
have the impression that performance would degrade, but we have no
conclusion yet. And NFS is too simple not to use.
I've not herd about performance degradation in clustered file systems.
But I'm not looking for it and could have easily missed it.  I can see
how the shared / distributed / global lock management could be a
performance problem.
I guess a lot depends on how you set it up,... There will probably other
results, but here is a reference

https://www.jdieter.net/posts/2017/08/14/benchmarking-small-file-performance-on-distributed-filesystems/

and, admitted I am probably biased, but NFS is sound and proven
technology, optimized and improved over decades...

My personal experience - but it's with smaller NAS and simple servers...
- Nothing beats a proper ftp client and server, it always achieves the
maximum capacity of the weakest link regardless if it's WAN, LAN. FTP is
designed for that. At least for 'not to small files'
- NFS comes next, way better then Samba, or SMB, but that's heavily
contested these days. I'll leave it open if it is still correct, but if
I can avoid Samba, I would.
- nothing beats your own Linux (Slackware of course) server over
commercial NAS, media player,... Not because it's better but because you
can leave out things you don't want and sometimes, just sometimes, it
matters.
Though I would think that NFS would also have some form of locking
related performance gotchas too.
 · sharding NFS mounts across multiple NFS NAS gateways
 · clustered file systems between NFS NAS gateways
 · sharding NFS NAS gateways across multiple SAN backends
 · Other NAS protocols
 · Other block / object storage protocols
NFS has a lot of other tricks in its sleeves. (Did I mention that I am
biased?) To name a few:
- you can export read only systems (like /export/opt) readonly and use
async. No locking needed, it's safe as you only read. It's fast!
- you can create a truly shared directory where everybody is the same
using all_squash option. All permissions gone just like that.
- you can use the sicky bit and create a global large tmp file system

I haven't investigated other cluster filesystems sufficient to know if
this is possible or not, so I won't claim NFS is the wholy grail, but I
like it.


Here is another thought experiment. (If you feel up to it...)
If you create a RAID 1, 5, 6, you need to write more data with a factor
more then 1. Let's take a RAID 5 (3 disks, 1 redundant) you write 1.5
times. Normally, internal busses (SATA,...) are faster then network
interfaces, but things change using infiniband and bonding. So let's go
wild. Create 3 NFS servers exporting one bare file. Lets create an
additional NFS server importing these 3 files, creating a loopback
device on each, assembling them in a RAID 5 and exporting it again to
all the nodes of the cluster. In this last NFS server has a dedicated
ethernet connection to each of the other 3 NFS servers and one to the
internal cluster network, the performance degradation due to RAID 5
setup can be mitigated. With NFS this is a rather easy setup. But at
this point you might want to consider drbd.

OK, this goes way to far, but this is the flexibility you have with
proper well designed mechanisms. And I apologize to be too creative at
times.
Post by K Venken
Contains a lot of usefull information. Problem is that some commercial
applications only support Windows and Linux.
ACK
Post by K Venken
Thanks ;-)
You're welcome.
Thank you for allowing me to learn vicariously through you.  ():-)
Henrik Carlqvist
2019-07-18 20:15:49 UTC
Permalink
Post by K. Venken
Post by Grant Taylor
The little InfiniBand that I've seen
uses what I've learned to call QSFP+ connections.  I have re-used
InfiniBand cables for 40 Gbps Ethernet via QSFP+ ports.  So InfiniBand
cables must be quite similar, if not the same, as QSFP+ Ethernet.
I have no experience from infiniband but my experience from QSFP+ and
even SFP+ has taught me that unfortunately SFP+ is no standard
connection. Direct Attach cables must match the switches. Buying the
wrong DA-cable and you will find that even though it fits in the SFP+
port on the switch the switch might completely refuse to connect to the
cable, accept the cable with some warnings that it might not work or in
the best case fully accept the cable. Things can get really interesting
when you try to connect two different switches with a DA-cable.
Post by K. Venken
Here is another thought experiment. (If you feel up to it...)
If you create a RAID 1, 5, 6, you need to write more data with a factor
more then 1. Let's take a RAID 5 (3 disks, 1 redundant) you write 1.5
times. Normally, internal busses (SATA,...) are faster then network
interfaces, but things change using infiniband and bonding.
I also like NFS but don't think of RAID as an alnternative technology.
Instead I prefer to have my NFS servers exporting data from RAID disks.
Raid gives both more bandwidth and protection against crashed disks.
Unfortunately I have been bitten by situations when RAID5 has not been
enough. When a disk broke the RAID5 system started rebuilding to a hot
spare. The rebuild caused a lot of disk acesses which turned out that
another disk was also broken, with two failed drives all data was lost.

So nowadays I allways use RAID6 instead. In my experience a hardware
RAID6 system with about 20 mechanical disks is able to give about 1.5 GB/
s on big files whih is more than enough to make a 10 Gb/s network the
bottleneck. However, accessing many small files, like compiling programs
will not give that bandwidth. Even with 20+ SSDs in RAID the latencies
for opening files will give you so low bandwidth that a simple 1Gb/s
network will be more than enough.

regards Henrik
K. Venken
2019-07-18 21:22:07 UTC
Permalink
Post by Henrik Carlqvist
Post by K. Venken
Post by Grant Taylor
The little InfiniBand that I've seen
uses what I've learned to call QSFP+ connections.  I have re-used
InfiniBand cables for 40 Gbps Ethernet via QSFP+ ports.  So InfiniBand
cables must be quite similar, if not the same, as QSFP+ Ethernet.
I have no experience from infiniband but my experience from QSFP+ and
even SFP+ has taught me that unfortunately SFP+ is no standard
connection. Direct Attach cables must match the switches. Buying the
wrong DA-cable and you will find that even though it fits in the SFP+
port on the switch the switch might completely refuse to connect to the
cable, accept the cable with some warnings that it might not work or in
the best case fully accept the cable. Things can get really interesting
when you try to connect two different switches with a DA-cable.
Post by K. Venken
Here is another thought experiment. (If you feel up to it...)
If you create a RAID 1, 5, 6, you need to write more data with a factor
more then 1. Let's take a RAID 5 (3 disks, 1 redundant) you write 1.5
times. Normally, internal busses (SATA,...) are faster then network
interfaces, but things change using infiniband and bonding.
I also like NFS but don't think of RAID as an alnternative technology.
Instead I prefer to have my NFS servers exporting data from RAID disks.
Raid gives both more bandwidth and protection against crashed disks.
Unfortunately I have been bitten by situations when RAID5 has not been
enough. When a disk broke the RAID5 system started rebuilding to a hot
spare. The rebuild caused a lot of disk acesses which turned out that
another disk was also broken, with two failed drives all data was lost.
I never wanted to suggest NFS as a replacement for RAID - sorry for
that. But from what I can find:

Infiniband = 10 Gbps = 10 Gbps and Sata-III = 6 Gbps.

So, if I understand this correctly, if I have an infiniband 10 Gbps
interface and read at maximum speed, the disk wouldn't be able to
follow. However, If I split the read into two parts, I get 5 Gbps each
and I can send them to two other NFS servers over (again infiniband) and
then, each can provide the data from an Sata interface. Of course,
splitting requires RAID 0. This is what I thought of. Using NFS servers
to mitigate the bandwidth bottleneck of internal interfaces. Not to
replace RAID by itself...

If you use RAID disks inside the NFS servers it makes a lot of sense
anyway (think of reliability), regardless of bandwidth calculations. But
as I mentioned, these are just wild thought experiments, so don't
hesitate to correct me.
Post by Henrik Carlqvist
So nowadays I allways use RAID6 instead. In my experience a hardware
RAID6 system with about 20 mechanical disks is able to give about 1.5 GB/
s on big files whih is more than enough to make a 10 Gb/s network the
bottleneck. However, accessing many small files, like compiling programs
will not give that bandwidth. Even with 20+ SSDs in RAID the latencies
for opening files will give you so low bandwidth that a simple 1Gb/s
network will be more than enough.
Thanks for the information. I have done a lot of reliability
calculations concerning a lot of RAID configurations and I came to the
conclusion that RAID 6 (RAID 60 under investigation?) would be a minimum
for large storage spaces. A student typically creating 2 TB of storage
each year and several students graduating each year, would require
several 10's or 100's storage space. Apart from creating a high
performance cluster, this is another concern.

So, please, do not hesitate to comment and correct me. I appreciate all
the information very much you provide.
Post by Henrik Carlqvist
regards Henrik
Grant Taylor
2019-07-19 03:43:37 UTC
Permalink
Post by K. Venken
Infiniband = 10 Gbps = 10 Gbps and Sata-III = 6 Gbps.
I'm fairly certain that InfiniBand can be significantly faster than 10 Gbps.

According to Wikipedia, InfiniBand can be up to 250 Gbps. Much like
Ethernet, it depends /which/ /version/ of InfiniBand that you're talking
about.

Apparently, that's per link and you can have 1 / 4 / 8 / 12 links. So
that's 3,000 Gbps or 3 Tbps. }:-)
Post by K. Venken
So, if I understand this correctly, if I have an infiniband 10 Gbps
interface and read at maximum speed, the disk wouldn't be able to
follow. However, If I split the read into two parts, I get 5 Gbps each
and I can send them to two other NFS servers over (again infiniband)
and then, each can provide the data from an Sata interface. Of course,
splitting requires RAID 0. This is what I thought of. Using NFS servers
to mitigate the bandwidth bottleneck of internal interfaces. Not to
replace RAID by itself...
Conceptually I think that would work.

I think the numbers are off, see above.

I also suspect that something other than NFS might be a better option.
There are clustered file systems that would allow multiple servers to
access the same disk. Spread that out across multiple disks and things
start to get really parallel and really fast really quick.

Will NFS work? Sure. Is NFS the best thing? I doubt it. At least not
for the back end SAN. NFS may be quite good for the front end NAS access.

Something else InfiniBand can do is allow for MUCH faster communications
(both bps and lower latency) between nodes than traditional Ethernet.

InfiniBand can offer RDMA between nodes. I have no idea how it's done,
but I'm quite sure that it can be done.

It's even possible to do some of this communications over Fiber Channel.
Post by K. Venken
If you use RAID disks inside the NFS servers it makes a lot of
sense anyway (think of reliability), regardless of bandwidth
calculations. But as I mentioned, these are just wild thought
experiments, so don't hesitate to correct me.
:-)

I'd suggest doing some research on some other HPC clusters and see what
they did do, thus what they didn't do, and if possible why.

I'd also suggest that you read and understand the content of Wikipedia's
Clustered File System article.

Aside: I know that people question Wikipedia. But I find that it does
give terms / names / titles / concepts that help a general understanding
which can be further refined through additional research.

No offense intended, but it seems like you need to do some overview of
some other options that are out there. Then, after you get an
understanding, you can decide what is best fit for your use case.
Post by K. Venken
Thanks for the information. I have done a lot of reliability
calculations concerning a lot of RAID configurations and I came to
the conclusion that RAID 6 (RAID 60 under investigation?) would be
a minimum for large storage spaces. A student typically creating 2
TB of storage each year and several students graduating each year,
would require several 10's or 100's storage space. Apart from creating
a high performance cluster, this is another concern.
Not only do you need fast storage to go with your HPC, but you need the
two connected to each other with a fast interconnect.

This is where some of the distributed file systems come into play. They
leverage a BUNCH of moderately fast connections in parallel for an
overall aggregate of REALLY FAST I/O.
Post by K. Venken
So, please, do not hesitate to comment and correct me. I appreciate
all the information very much you provide.
:-)

I have no guarantee that what I'm saying is correct, much less current.
I do hope that I'm giving you things to research to allow you to make
informed decisions. Even if that decision is to use NFS. You will be
able to say /why/ you're using NFS. ;-)
--
Grant. . . .
unix || die
K. Venken
2019-07-19 18:45:51 UTC
Permalink
Post by Grant Taylor
Post by K. Venken
Infiniband = 10 Gbps = 10 Gbps and Sata-III = 6 Gbps.
I'm fairly certain that InfiniBand can be significantly faster than 10 Gbps.
According to Wikipedia, InfiniBand can be up to 250 Gbps.  Much like
Ethernet, it depends /which/ /version/ of InfiniBand that you're talking
about.
Apparently, that's per link and you can have 1 / 4 / 8 / 12 links.  So
that's 3,000 Gbps or 3 Tbps.  }:-)
Post by K. Venken
So, if I understand this correctly, if I have an infiniband 10 Gbps
interface and read at maximum speed, the disk wouldn't be able to
follow. However, If I split the read into two parts, I get 5 Gbps each
and I can send them to two other NFS servers over (again infiniband)
and then, each can provide the data from an Sata interface. Of course,
splitting requires RAID 0. This is what I thought of. Using NFS
servers to mitigate the bandwidth bottleneck of internal interfaces.
Not to replace RAID by itself...
Conceptually I think that would work.
I think the numbers are off, see above.
I also suspect that something other than NFS might be a better option.
There are clustered file systems that would allow multiple servers to
access the same disk.  Spread that out across multiple disks and things
start to get really parallel and really fast really quick.
Will NFS work?  Sure.  Is NFS the best thing?  I doubt it.  At least not
for the back end SAN.  NFS may be quite good for the front end NAS access.
Something else InfiniBand can do is allow for MUCH faster communications
(both bps and lower latency) between nodes than traditional Ethernet.
InfiniBand can offer RDMA between nodes.  I have no idea how it's done,
but I'm quite sure that it can be done.
It's even possible to do some of this communications over Fiber Channel.
Post by K. Venken
If you use RAID disks inside the NFS servers it makes a lot of sense
anyway (think of reliability), regardless of bandwidth calculations.
But as I mentioned, these are just wild thought experiments, so don't
hesitate to correct me.
:-)
I'd suggest doing some research on some other HPC clusters and see what
they did do, thus what they didn't do, and if possible why.
I'd also suggest that you read and understand the content of Wikipedia's
Clustered File System article.
Aside:  I know that people question Wikipedia.  But I find that it does
give terms / names / titles / concepts that help a general understanding
which can be further refined through additional research.
No offense intended, but it seems like you need to do some overview of
some other options that are out there.  Then, after you get an
understanding, you can decide what is best fit for your use case.
None taken, on the contrary. I appreciate the background information you
provide. It gives me a lot to read.
Post by Grant Taylor
Post by K. Venken
Thanks for the information. I have done a lot of reliability
calculations concerning a lot of RAID configurations and I came to the
conclusion that RAID 6 (RAID 60 under investigation?) would be a
minimum for large storage spaces. A student typically creating 2 TB of
storage each year and several students graduating each year, would
require several 10's or 100's storage space. Apart from creating a
high performance cluster, this is another concern.
Not only do you need fast storage to go with your HPC, but you need the
two connected to each other with a fast interconnect.
This is where some of the distributed file systems come into play.  They
leverage a BUNCH of moderately fast connections in parallel for an
overall aggregate of REALLY FAST I/O.
Post by K. Venken
So, please, do not hesitate to comment and correct me. I appreciate
all the information very much you provide.
:-)
I have no guarantee that what I'm saying is correct, much less current.
I do hope that I'm giving you things to research to allow you to make
informed decisions.  Even if that decision is to use NFS.  You will be
able to say /why/ you're using NFS.  ;-)
It is clear to me now that there is a lot more technology available then
I initially thought. My research is just starting.
Grant Taylor
2019-07-19 19:35:48 UTC
Permalink
Post by K. Venken
None taken, on the contrary. I appreciate the background information you
provide. It gives me a lot to read.
:-)
Post by K. Venken
It is clear to me now that there is a lot more technology available then
I initially thought. My research is just starting.
It sounds to me like the fun is just starting. Both the good and bad
aspects. Good as in learning new things. Bad as in crap, we have to
recover the corrupt data.
--
Grant. . . .
unix || die
Henrik Carlqvist
2019-07-22 20:11:00 UTC
Permalink
Post by K. Venken
So, if I understand this correctly, if I have an infiniband 10 Gbps
interface and read at maximum speed, the disk wouldn't be able to
follow. However, If I split the read into two parts, I get 5 Gbps each
and I can send them to two other NFS servers over (again infiniband) and
then, each can provide the data from an Sata interface. Of course,
splitting requires RAID 0.
Sorry for my late reply. Yes, RAID0 with two disks will give you twice
the bandwidth for both reading and writing but no redundancy. However, a
correclty implemented RAID1 system with two disks will also give you
twice the bandwidth for reading. With RAID1 you can read every other
stripe from every other disk but you will need to write all stripes to
all disks.

regards Henrik
Grant Taylor
2019-07-23 01:27:38 UTC
Permalink
However, a correclty implemented RAID1 system with two disks will
also give you twice the bandwidth for reading. With RAID1 you can
read every other stripe from every other disk
Hum.

That's contrary to the RAID 1 / mirroring implementations I've used.

Yes, I can see how that /could/ be done. But that relies on the hard
drive to report if there is an error.

Conversely, my understanding is, the RAID 1 / mirroring implementations
that I've dealt with actually compare the data between the disks to
detect if they agree or not. As such, both disks need to read the same
data.
--
Grant. . . .
unix || die
Henrik Carlqvist
2019-07-23 09:57:00 UTC
Permalink
Post by Grant Taylor
However, a correclty implemented RAID1 system with two disks will also
give you twice the bandwidth for reading. With RAID1 you can read every
other stripe from every other disk
That's contrary to the RAID 1 / mirroring implementations I've used.
Yes, I am aware that some RAID1 implementations does not give you the
full bandwidth for reading. But if I remember right Linux software RAID1
will give you full bandwidth and also most hardware RAID systems.
Post by Grant Taylor
Yes, I can see how that /could/ be done. But that relies on the hard
drive to report if there is an error.
Yes, RAID1 will rely on the hardware being able to report errors.
Post by Grant Taylor
Conversely, my understanding is, the RAID 1 / mirroring implementations
that I've dealt with actually compare the data between the disks to
detect if they agree or not. As such, both disks need to read the same
data.
But unless you have hardware being able to report an error, how would you
know which disk is right and which is wrong in a 2-disk RAID1 system?

regards Henrik
Grant Taylor
2019-07-23 14:39:34 UTC
Permalink
Post by Henrik Carlqvist
But unless you have hardware being able to report an error, how
would you know which disk is right and which is wrong in a 2-disk
RAID1 system?
My understanding—which may be wrong—is that the drives need to have the
/same/ data, whatever that is. If the data is different, then checksums
are applied across the data to determine which one passes and which one
fails. The failure is considered to be the wrong drive.

At least that's what I remember from discussions 15+ years ago.
--
Grant. . . .
unix || die
Grant Taylor
2019-07-18 22:13:33 UTC
Permalink
Post by Henrik Carlqvist
I have no experience from infiniband but my experience from QSFP+
and even SFP+ has taught me that unfortunately SFP+ is no standard
connection.
I raise a finger at that statement, but will let you finish as it's germane.
Post by Henrik Carlqvist
Direct Attach cables must match the switches. Buying the wrong DA-cable
and you will find that even though it fits in the SFP+ port on the
switch the switch might completely refuse to connect to the cable,
accept the cable with some warnings that it might not work or in the
best case fully accept the cable.
Yep. The key phrase is "the switch might … refuse". That is a software
problem, not an electrical compatibility problem.

Yes (Q)SFP(+) transceivers have identifying information in them. Some
switches will refuse to work with a brand that they don't like. (Why
they don't like it is immaterial.)

That's a /software/ restriction, not an /electrical/ restriction.

We use a number of switches at work that by default refuse to use
off-brand optics / cables. But our switches have a hidden command that
you can run to tell them to use off-brand optics / cables.

To me, this is more vendor lock in than anything else.

I'd be willing to bet that if you turn up debugging far enough, you will
see that the switch sees the (Q)SFP(+), communicates with it, finds out
that it's off-brand and then refuse to work with it.

IMHO the simple fact that the switch does this is proof that it could
use the (Q)SFP(+) if it wanted to. Thus the (Q)SFP(+) /is/ compatible. ;-)
Post by Henrik Carlqvist
Things can get really interesting when you try to connect two different
switches with a DA-cable.
Ya. Get two big brand switches that are notorious for this @^*% and you
can be in a tough spot if you want to use a DAC or AOC. But you can put
each vendor's preferred optic in, any fiber patch cord between them, and
things are perfectly fine.

Thankfully my employer is big enough and has enough buying power that we
can go back to the big players in the market and tell them to stop it or
we will take our money elsewhere. We bring enough money to the table
Post by Henrik Carlqvist
I also like NFS but don't think of RAID as an alnternative technology.
Agreed on both accounts.
Post by Henrik Carlqvist
Instead I prefer to have my NFS servers exporting data from RAID disks.
Raid gives both more bandwidth and protection against crashed disks.
Yep.
Post by Henrik Carlqvist
Unfortunately I have been bitten by situations when RAID5 has not been
enough. When a disk broke the RAID5 system started rebuilding to a hot
spare. The rebuild caused a lot of disk acesses which turned out that
another disk was also broken, with two failed drives all data was lost.
Been there. Done that. No fun.

I've also found that sometimes you can offline the entire array and
spare disk, re-insert the array, let it come back up in a degraded
state, vacate the data, then blow the array away and rebuild a new one.
Post by Henrik Carlqvist
So nowadays I allways use RAID6 instead. In my experience a hardware
RAID6 system with about 20 mechanical disks is able to give about 1.5
GB/ s on big files whih is more than enough to make a 10 Gb/s network
the bottleneck.
You're getting into the number of disks where I start to get
uncomfortable of even RAID 6.

I've become quite fond of ZFS's RAID ability where I can easily create a
RAID of underlying RAIDs. Any RAID level on top of any other RAID
level. I can easily do three parity disks (worth of space) in a ZRAID3
(sub)pool.

ZFS also has the nice feature that it only rebuilds the part of the disk
that's actually in use. No need to rebuild the part of the disk that
hasn't been touched yet.
Post by Henrik Carlqvist
However, accessing many small files, like compiling programs will
not give that bandwidth. Even with 20+ SSDs in RAID the latencies
for opening files will give you so low bandwidth that a simple 1Gb/s
network will be more than enough.
Ya. That's when there is not enough actual disk I/O to compete with the
other overhead that is happening with file handles, NAS / SAN protocols,
etc.

This is why network gear lists (millions of) packets per second of
specific sizes. Smaller packets are harder because of the percentage of
overhead vs actual packet payload.
--
Grant. . . .
unix || die
K. Venken
2019-07-18 23:20:16 UTC
Permalink
Post by Grant Taylor
Post by Henrik Carlqvist
I have no experience from infiniband but my experience from QSFP+ and
even SFP+ has taught me that unfortunately SFP+ is no standard
connection.
I raise a finger at that statement, but will let you finish as it's germane.
Post by Henrik Carlqvist
Direct Attach cables must match the switches. Buying the wrong
DA-cable and you will find that even though it fits in the SFP+ port
on the switch the switch might completely refuse to connect to the
cable, accept the cable with some warnings that it might not work or
in the best case fully accept the cable.
Yep.  The key phrase is "the switch might … refuse".  That is a software
problem, not an electrical compatibility problem.
Yes (Q)SFP(+) transceivers have identifying information in them.  Some
switches will refuse to work with a brand that they don't like.  (Why
they don't like it is immaterial.)
That's a /software/ restriction, not an /electrical/ restriction.
We use a number of switches at work that by default refuse to use
off-brand optics / cables.  But our switches have a hidden command that
you can run to tell them to use off-brand optics / cables.
To me, this is more vendor lock in than anything else.
I'd be willing to bet that if you turn up debugging far enough, you will
see that the switch sees the (Q)SFP(+), communicates with it, finds out
that it's off-brand and then refuse to work with it.
IMHO the simple fact that the switch does this is proof that it could
use the (Q)SFP(+) if it wanted to.  Thus the (Q)SFP(+) /is/ compatible.
;-)
I apologize for the comment, but that sounds "gross". I won't have the
budget (support) for this, so it won't be an option if I can't guarantee
interchangeability. On the other hand, if a vendor can guarantee that
his interface boards can work with Slackware 14.2 and his switches can
be used (of course), I would be a step further.

Am I mistaken if I expect Infiniband to be standard and uniformally
supported?
Post by Grant Taylor
Post by Henrik Carlqvist
Things can get really interesting when you try to connect two
different switches with a DA-cable.
can be in a tough spot if you want to use a DAC or AOC.  But you can put
each vendor's preferred optic in, any fiber patch cord between them, and
things are perfectly fine.
Thankfully my employer is big enough and has enough buying power that we
can go back to the big players in the market and tell them to stop it or
we will take our money elsewhere.  We bring enough money to the table
Post by Henrik Carlqvist
I also like NFS but don't think of RAID as an alnternative technology.
Agreed on both accounts.
Post by Henrik Carlqvist
Instead I prefer to have my NFS servers exporting data from RAID
disks. Raid gives both more bandwidth and protection against crashed
disks.
Yep.
Post by Henrik Carlqvist
Unfortunately I have been bitten by situations when RAID5 has not been
enough. When a disk broke the RAID5 system started rebuilding to a hot
spare. The rebuild caused a lot of disk acesses which turned out that
another disk was also broken, with two failed drives all data was lost.
Been there.  Done that.  No fun.
I've also found that sometimes you can offline the entire array and
spare disk, re-insert the array, let it come back up in a degraded
state, vacate the data, then blow the array away and rebuild a new one.
Post by Henrik Carlqvist
So nowadays I allways use RAID6 instead. In my experience a hardware
RAID6 system with about 20 mechanical disks is able to give about 1.5
GB/ s on big files whih is more than enough to make a 10 Gb/s network
the bottleneck.
You're getting into the number of disks where I start to get
uncomfortable of even RAID 6.
I've become quite fond of ZFS's RAID ability where I can easily create a
RAID of underlying RAIDs.  Any RAID level on top of any other RAID
level.  I can easily do three parity disks (worth of space) in a ZRAID3
(sub)pool.
ZFS also has the nice feature that it only rebuilds the part of the disk
that's actually in use.  No need to rebuild the part of the disk that
hasn't been touched yet.
Post by Henrik Carlqvist
However, accessing many small files, like compiling programs will not
give that bandwidth. Even with 20+ SSDs in RAID the latencies for
opening files will give you so low bandwidth that a simple 1Gb/s
network will be more than enough.
Ya.  That's when there is not enough actual disk I/O to compete with the
other overhead that is happening with file handles, NAS / SAN protocols,
etc.
This is why network gear lists (millions of) packets per second of
specific sizes.  Smaller packets are harder because of the percentage of
overhead vs actual packet payload.
Grant Taylor
2019-07-19 03:50:45 UTC
Permalink
Post by K. Venken
I apologize for the comment, but that sounds "gross".
~chuckle~

What does? The idea of telling switches to shut up and use what works
despite being off brand? Or the idea that switches want to enforce
vendor lock in?
Post by K. Venken
I won't have the budget (support) for this, so it won't be an option if
I can't guarantee interchangeability.
Oh. That's a dangerous slope.
Post by K. Venken
On the other hand, if a vendor can guarantee that his interface boards
can work with Slackware 14.2 and his switches can be used (of course),
I would be a step further.
I'd be surprised if you'll find a vendor that will guarantee that their
solution works with /Slackware/. You will likely find multiple vendors
that will guarantee that their solution works with /RHEL/ or /CentOS/ or
maybe even /Ubuntu/. But that's more marketing than it is anything
technical.

Yes, you can absolutely find vendors that will guarantee that their
cards will work with their cables will work with their switches.

But the OS used and drivers to talk to their cards? I'd be surprised if
they specifically list /Slackware/.

I'd like to be wrong. But I won't hold my breath.

(This is why I said dangerous slope.)
Post by K. Venken
Am I mistaken if I expect Infiniband to be standard and uniformally
supported?
"That depends."

/Slackware/? Possibly ~> probably.

/Linux/ in general? I think it's a safe expectation.

Convincing the vendor's help desk that /Slackware/ is just as much Linux
as /RHEL/ or /Ubuntu/ and that they should support it? ¯\_(ツ)_/¯
--
Grant. . . .
unix || die
K. Venken
2019-07-19 19:05:50 UTC
Permalink
Post by Grant Taylor
Post by K. Venken
I apologize for the comment, but that sounds "gross".
~chuckle~
What does?  The idea of telling switches to shut up and use what works
despite being off brand?  Or the idea that switches want to enforce
vendor lock in?
vendor lock in...
Post by Grant Taylor
Post by K. Venken
I won't have the budget (support) for this, so it won't be an option
if I can't guarantee interchangeability.
Oh.  That's a dangerous slope.
Some background. When I started, I initially proposed to *just* update
the software as some applications could not be upgraded anymore. So I
asked for some time and assistance. Updating hardware wasn't initially
not in the scope and has a much more visible cost and needs to be well
motivated. In a cluster (even a very small one) everything is multiplied.
Post by Grant Taylor
Post by K. Venken
On the other hand, if a vendor can guarantee that his interface boards
can work with Slackware 14.2 and his switches can be used (of course),
I would be a step further.
I'd be surprised if you'll find a vendor that will guarantee that their
solution works with /Slackware/.  You will likely find multiple vendors
that will guarantee that their solution works with /RHEL/ or /CentOS/ or
maybe even /Ubuntu/.  But that's more marketing than it is anything
technical.
Yes, you can absolutely find vendors that will guarantee that their
cards will work with their cables will work with their switches.
But the OS used and drivers to talk to their cards?  I'd be surprised if
they specifically list /Slackware/.
I'd like to be wrong.  But I won't hold my breath.
And there is that. It is not the first time that I get the response that
Slackware is not a supported platform, even though Linux is supported.
Choosing Slackware, in retrospect, may have been to eager but it is the
one I know best. (I am a user at since 13.37)
Post by Grant Taylor
(This is why I said dangerous slope.)
Post by K. Venken
Am I mistaken if I expect Infiniband to be standard and uniformally
supported?
"That depends."
/Slackware/?  Possibly ~> probably.
/Linux/ in general?  I think it's a safe expectation.
Convincing the vendor's help desk that /Slackware/ is just as much Linux
as /RHEL/ or /Ubuntu/ and that they should support it?  ¯\_(ツ)_/¯
Grant Taylor
2019-07-19 19:42:38 UTC
Permalink
Post by K. Venken
vendor lock in...
Ah. Yes. I /HATE/ vendor lock in.
Post by K. Venken
Some background. When I started, I initially proposed to *just* update
the software as some applications could not be upgraded anymore. So I
asked for some time and assistance. Updating hardware wasn't initially
not in the scope and has a much more visible cost and needs to be well
motivated. In a cluster (even a very small one) everything is multiplied.
Yep.

That likely means that either scope will change and expand, or it will
behoove you and your team to leverage the hardware that you already
have. Fair enough.
Post by K. Venken
And there is that. It is not the first time that I get the response that
Slackware is not a supported platform, even though Linux is supported.
Yep. Trust yourself. If something can work under big name
distributions then there is a 95% chance that it can work in Slackware
too. You just have to take the onion and start peeling away the layers
until you learn the core piece that you need to install / configure /
add on Slackware.

Yes, it may get tricky if you need to bring binary modules across. But
it was possible to do that the last time I needed to. It just took
effort and a good understanding. But it was worth it.
Post by K. Venken
Choosing Slackware, in retrospect, may have been to eager but it is the
one I know best. (I am a user at since 13.37)
Fair enough.

I started with Slackware '96 in '98 and have been wandering all over the
place since then. I tend to prefer things that allow me to get down
into the files and configure things the way that I want to. Understand
each Lego brick, what it does, how it does it, and build the set that
you want.

I did things with Slackware that weren't possible to do with popular
distribution's as their init scripts were limiting to work within. If
I'm doing to have to completely create something from scratch, I'll do
it somewhere with fewer limitations.
--
Grant. . . .
unix || die
K. Venken
2019-07-18 23:21:22 UTC
Permalink
Post by Grant Taylor
Post by Henrik Carlqvist
I have no experience from infiniband but my experience from QSFP+ and
even SFP+ has taught me that unfortunately SFP+ is no standard
connection.
I raise a finger at that statement, but will let you finish as it's germane.
Post by Henrik Carlqvist
Direct Attach cables must match the switches. Buying the wrong
DA-cable and you will find that even though it fits in the SFP+ port
on the switch the switch might completely refuse to connect to the
cable, accept the cable with some warnings that it might not work or
in the best case fully accept the cable.
Yep.  The key phrase is "the switch might … refuse".  That is a software
problem, not an electrical compatibility problem.
Yes (Q)SFP(+) transceivers have identifying information in them.  Some
switches will refuse to work with a brand that they don't like.  (Why
they don't like it is immaterial.)
That's a /software/ restriction, not an /electrical/ restriction.
We use a number of switches at work that by default refuse to use
off-brand optics / cables.  But our switches have a hidden command that
you can run to tell them to use off-brand optics / cables.
To me, this is more vendor lock in than anything else.
I'd be willing to bet that if you turn up debugging far enough, you will
see that the switch sees the (Q)SFP(+), communicates with it, finds out
that it's off-brand and then refuse to work with it.
IMHO the simple fact that the switch does this is proof that it could
use the (Q)SFP(+) if it wanted to.  Thus the (Q)SFP(+) /is/ compatible.
;-)
I apologize for the comment, but that sounds "gross". I won't have the
budget (support) for this, so it won't be an option if I can't guarantee
interchangeability. On the other hand, if a vendor can guarantee that
his interface boards can work with Slackware 14.2 and his switches can
be used (of course), I would be a step further.

Am I mistaken if I expect Infiniband to be standard and uniformally
supported?
Post by Grant Taylor
Post by Henrik Carlqvist
Things can get really interesting when you try to connect two
different switches with a DA-cable.
can be in a tough spot if you want to use a DAC or AOC.  But you can put
each vendor's preferred optic in, any fiber patch cord between them, and
things are perfectly fine.
Thankfully my employer is big enough and has enough buying power that we
can go back to the big players in the market and tell them to stop it or
we will take our money elsewhere.  We bring enough money to the table
Post by Henrik Carlqvist
I also like NFS but don't think of RAID as an alnternative technology.
Agreed on both accounts.
Post by Henrik Carlqvist
Instead I prefer to have my NFS servers exporting data from RAID
disks. Raid gives both more bandwidth and protection against crashed
disks.
Yep.
Post by Henrik Carlqvist
Unfortunately I have been bitten by situations when RAID5 has not been
enough. When a disk broke the RAID5 system started rebuilding to a hot
spare. The rebuild caused a lot of disk acesses which turned out that
another disk was also broken, with two failed drives all data was lost.
Been there.  Done that.  No fun.
I've also found that sometimes you can offline the entire array and
spare disk, re-insert the array, let it come back up in a degraded
state, vacate the data, then blow the array away and rebuild a new one.
Post by Henrik Carlqvist
So nowadays I allways use RAID6 instead. In my experience a hardware
RAID6 system with about 20 mechanical disks is able to give about 1.5
GB/ s on big files whih is more than enough to make a 10 Gb/s network
the bottleneck.
You're getting into the number of disks where I start to get
uncomfortable of even RAID 6.
I've become quite fond of ZFS's RAID ability where I can easily create a
RAID of underlying RAIDs.  Any RAID level on top of any other RAID
level.  I can easily do three parity disks (worth of space) in a ZRAID3
(sub)pool.
ZFS also has the nice feature that it only rebuilds the part of the disk
that's actually in use.  No need to rebuild the part of the disk that
hasn't been touched yet.
Post by Henrik Carlqvist
However, accessing many small files, like compiling programs will not
give that bandwidth. Even with 20+ SSDs in RAID the latencies for
opening files will give you so low bandwidth that a simple 1Gb/s
network will be more than enough.
Ya.  That's when there is not enough actual disk I/O to compete with the
other overhead that is happening with file handles, NAS / SAN protocols,
etc.
This is why network gear lists (millions of) packets per second of
specific sizes.  Smaller packets are harder because of the percentage of
overhead vs actual packet payload.
Grant Taylor
2019-07-18 21:55:53 UTC
Permalink
Post by K. Venken
Point noted...
:-)
Post by K. Venken
I guess a lot depends on how you set it up,... There will probably other
results, but here is a reference
https://www.jdieter.net/posts/2017/08/14/benchmarking-small-file-performance-on-distributed-filesystems/
$ReadingList++
Post by K. Venken
and, admitted I am probably biased, but NFS is sound and proven
technology, optimized and improved over decades...
Fair.
Post by K. Venken
My personal experience - but it's with smaller NAS and simple servers...
- Nothing beats a proper ftp client and server, it always achieves the
maximum capacity of the weakest link regardless if it's WAN, LAN. FTP is
designed for that. At least for 'not to small files'
I can see how FTP can get better throughput than a NAS protocol.

Though FTP has it's own hurtles. Some of them are non-trivial.
Post by K. Venken
- NFS comes next, way better then Samba, or SMB, but that's heavily
contested these days. I'll leave it open if it is still correct, but if
I can avoid Samba, I would.
NFS also has issues, but I think they are fewer than FTP.
Post by K. Venken
- nothing beats your own Linux (Slackware of course) server over
commercial NAS, media player,... Not because it's better but because you
can leave out things you don't want and sometimes, just sometimes, it
matters.
Agreed.
Post by K. Venken
NFS has a lot of other tricks in its sleeves. (Did I mention that I am
biased?)
Yep.
Post by K. Venken
- you can export read only systems (like /export/opt) readonly and use
async. No locking needed, it's safe as you only read. It's fast!
I knew that NFS could have read-only exports. I hadn't thought about if
the backing file system is read-only vs read-write.

Multiple NFS filters (NAS gateways) with read-only exports of the same
SAN file system mounted as read-only would be entertaining. }:-)
Post by K. Venken
- you can create a truly shared directory where everybody is the same
using all_squash option. All permissions gone just like that.
I was not aware of the all_squash option. I'll have to read up on that.

I think I'd rather have properly synchronized UIDs & GIDs and / or NFSv4
which accounts for differences. Chances are good that I'm going to want
things to be synchronized anyway for multiple other reasons.
Post by K. Venken
- you can use the sicky bit and create a global large tmp file system
Presuming that you have UID & GID synchronization handled, sure.

I like Kerberized NFS. }:-)

I think I saw that NFS supported some form of encryption when I was
messing with Kerberized NFS. Though I didn't look into any detail
beyond the fact that it did.
Post by K. Venken
I haven't investigated other cluster filesystems sufficient to know if
this is possible or not, so I won't claim NFS is the wholy grail, but I
like it.
The little that I've done with GFS(2) made me think that from an end
user point of view, there was no difference in interacting with or
permissions on GFS(2) than Ext2/3/4 or ReiserFS or ZFS.
Post by K. Venken
Here is another thought experiment. (If you feel up to it...)
Challenge accepted.
Post by K. Venken
If you create a RAID 1, 5, 6, you need to write more data with a factor
more then 1. Let's take a RAID 5 (3 disks, 1 redundant) you write 1.5
times.
Please defend / explain your 1.5 times figure.

To me, You're writing ~⅓ on each drive with the 4th drive being parity
information. (I'm ignoring the data vs parity rotation for now.) I
would think that it would be 1⅓ times data for a four drive RAID 5.
Adding a fifth drive (4 data + 1 parity) would make it 1¼.

Did your 1.5 times number by chance come from a 3 drive (2 data + 1
parity) RAID 5?

My algorithm is 1 + (<number of parity drive(s)> / <number of data
drives>) times data.

You have to write the data across the data drives. That's 1. Then you
write the same number of blocks to the parity drive as was written to
one of the data disks. Thus all disks consume the same number of
blocks. Just that one of them is parity.

At least that's my logic. I'm happy to have someone explain where I'm
wrong.
Post by K. Venken
Normally, internal busses (SATA,...) are faster then network
interfaces, but things change using infiniband and bonding.
Yep.
Post by K. Venken
So let's go wild. Create 3 NFS servers exporting one bare file. Lets
create an additional NFS server importing these 3 files, creating a
loopback device on each, assembling them in a RAID 5 and exporting it
again to all the nodes of the cluster. In this last NFS server has
a dedicated ethernet connection to each of the other 3 NFS servers
and one to the internal cluster network, the performance degradation
due to RAID 5 setup can be mitigated. With NFS this is a rather easy
setup.
Why are you using NFS for this? Why are you using loopback devices? Is
this a case of the things that you're familiar with?

If we are applying RAID mentality to this, then I'm guessing that the
three back end servers are RAID 5 and protecting the file. Thus the
front end server can do a RAID 0 stripe across the other three devices
from back end servers. (If you will forgive the loose analogy for the
sake of discussion.)

This sounds like a RAID 0 stripe across three RAID 5 arrays. So you end
up with the protection of RAID 5 and performance of RAID 0.
Post by K. Venken
But at this point you might want to consider drbd.
I've done exceedingly little with DRBD, so I may be way off base here.
But I thought the point of DRBD was to replicate block devices between
servers. Thus the final central server would need to have as much disk
space as the first three servers combined.

I would think that something like iSCSI, or other SAN protocol, would be
a better choice here. The three back end servers export something which
they protect via RAID 5. Then the front end server accesses them,
aggregates them together in a RAID 0 before further exporting the
aggregate to clients.

There are multiple products on the market that play this game of tiered
storage protection. ;-) IBM's SVC product, Oracle's Exadata, and I'm
confident something from NetApp does this. I think even EMC's
CLARiiON-120 probably qualifies here.
Post by K. Venken
OK, this goes way to far, but this is the flexibility you have with
proper well designed mechanisms. And I apologize to be too creative at
times.
~chuckle~

Yep. Having good Lego bricks and truly understanding them allows you to
build some really creative things. The question of if you should build
them or not is completely different and may have a different answer.
--
Grant. . . .
unix || die
K. Venken
2019-07-18 23:06:17 UTC
Permalink
Post by Grant Taylor
Post by K. Venken
Point noted...
:-)
Post by K. Venken
I guess a lot depends on how you set it up,... There will probably
other results, but here is a reference
https://www.jdieter.net/posts/2017/08/14/benchmarking-small-file-performance-on-distributed-filesystems/
$ReadingList++
Post by K. Venken
and, admitted I am probably biased, but NFS is sound and proven
technology, optimized and improved over decades...
Fair.
Post by K. Venken
My personal experience - but it's with smaller NAS and simple servers...
- Nothing beats a proper ftp client and server, it always achieves the
maximum capacity of the weakest link regardless if it's WAN, LAN. FTP
is designed for that. At least for 'not to small files'
I can see how FTP can get better throughput than a NAS protocol.
Though FTP has it's own hurtles.  Some of them are non-trivial.
Post by K. Venken
- NFS comes next, way better then Samba, or SMB, but that's heavily
contested these days. I'll leave it open if it is still correct, but
if I can avoid Samba, I would.
NFS also has issues, but I think they are fewer than FTP.
Post by K. Venken
- nothing beats your own Linux (Slackware of course) server over
commercial NAS, media player,... Not because it's better but because
you can leave out things you don't want and sometimes, just sometimes,
it matters.
Agreed.
Post by K. Venken
NFS has a lot of other tricks in its sleeves. (Did I mention that I am
biased?)
Yep.
Post by K. Venken
- you can export read only systems (like /export/opt) readonly and use
async. No locking needed, it's safe as you only read. It's fast!
I knew that NFS could have read-only exports.  I hadn't thought about if
the backing file system is read-only vs read-write.
Multiple NFS filters (NAS gateways) with read-only exports of the same
SAN file system mounted as read-only would be entertaining.  }:-)
Post by K. Venken
- you can create a truly shared directory where everybody is the same
using all_squash option. All permissions gone just like that.
I was not aware of the all_squash option.  I'll have to read up on that.
I think I'd rather have properly synchronized UIDs & GIDs and / or NFSv4
which accounts for differences.  Chances are good that I'm going to want
things to be synchronized anyway for multiple other reasons.
That's where I use the "triumvirate" NIS/NFS/automount. NIS is for
sunchronizing accounts, groups etc... It's not much appreciated these
days as most people want LDAP, fair enough, but NIS is convenient,
especially if you can confine it to, for instance, a private cluster
network.
Post by Grant Taylor
Post by K. Venken
- you can use the sicky bit and create a global large tmp file system
Presuming that you have UID & GID synchronization handled, sure.
I like Kerberized NFS.  }:-)
I think I saw that NFS supported some form of encryption when I was
messing with Kerberized NFS.  Though I didn't look into any detail
beyond the fact that it did.
Post by K. Venken
I haven't investigated other cluster filesystems sufficient to know if
this is possible or not, so I won't claim NFS is the wholy grail, but
I like it.
The little that I've done with GFS(2) made me think that from an end
user point of view, there was no difference in interacting with or
permissions on GFS(2) than Ext2/3/4 or ReiserFS or ZFS.
Post by K. Venken
Here is another thought experiment. (If you feel up to it...)
Challenge accepted.
Post by K. Venken
If you create a RAID 1, 5, 6, you need to write more data with a
factor more then 1. Let's take a RAID 5 (3 disks, 1 redundant) you
write 1.5 times.
Please defend / explain your 1.5 times figure.
RAID 5 has 3 disks with one redundant. If you write N bytes, N/2 goes to
disk 1, also to disk 2 and parity (agian N/2) of disk 1 and 2 goes to 3.
So this is 3 times 1/2 which is 3/2 or 1.5 ;-)

In RAID 0 you would just spread data...
Post by Grant Taylor
To me, You're writing ~⅓ on each drive with the 4th drive being parity
information.  (I'm ignoring the data vs parity rotation for now.)  I
would think that it would be 1⅓ times data for a four drive RAID 5.
Adding a fifth drive (4 data + 1 parity) would make it 1¼.
Ok, If you have more disks, the 1.5 wouldn't hold, correct observation!
Had to ccnsider that. So the correct formulation would be N / (N-1) ?
And for RAID 6 it would be N / (N+2)

for what's worth I had the impression that RAID 6 is more reliable then ....

OK Take 4 disks you can organize them RAID 10 or RADI 6 for the same
capacity, RAID 6 would be more reliable... Still not finished the
reliability calculation.
Post by Grant Taylor
Did your 1.5 times number by chance come from a 3 drive (2 data + 1
parity) RAID 5?
My algorithm is 1 + (<number of parity drive(s)> / <number of data
drives>) times data.
You have to write the data across the data drives.  That's 1.  Then you
write the same number of blocks to the parity drive as was written to
one of the data disks.  Thus all disks consume the same number of
blocks.  Just that one of them is parity.
At least that's my logic.  I'm happy to have someone explain where I'm
wrong.
Post by K. Venken
Normally, internal busses (SATA,...) are faster then network
interfaces, but things change using infiniband and bonding.
Yep.
Post by K. Venken
So let's go wild. Create 3 NFS servers exporting one bare file. Lets
create an additional NFS server importing these 3 files, creating a
loopback device on each, assembling them in a RAID 5 and exporting it
again to all the nodes of the cluster. In this last NFS server has a
dedicated ethernet connection to each of the other 3 NFS servers and
one to the internal cluster network, the performance degradation due
to RAID 5 setup can be mitigated. With NFS this is a rather easy setup.
Why are you using NFS for this?  Why are you using loopback devices?  Is
this a case of the things that you're familiar with?
Hmmmm., I seem to provide a more detailed example to explain. I thought
I would go into too much details, but why not, lets give the commands to
do this. Lets have four stations which can see each other over the network.

Each of the 3 (sub)stations creates a file in /export/dnd eg.
/export/dnd/slice and exports it (NFS} So you have
substation-1:/export/dnd/... exported by NFS
So the 4th station can import 3 NFS folders substation.../dnd holding an
slice file. Now, here is the trick. You can create a loopback for each
of these 3 files like:

losetup /dev/loop(i) otherserver:/export/dnd/slice

So you end up with 3 devices pointing to a remote NFS file. Now, its
easy to combine those three files into one (RAID) device like:

for all 3, do
losetup /dev/loop{i} /mnt/net/srvr-I-/slive

Now combine them into

mdadm -Cv -l5 -c64 -n3 -pls /dev/md0 /dev/loop{1,2,3}

end export this /dev/md0 somewhay

OK, thats over the top. But what you end up with is that when you write
to seerver:/dev/md0, you actually write to tree remote NFS files
spreading the throughput.

Here it comes. The 4th NFS server holding /dev/md0 exports this again.
So when one of the nodes writes to the servers export, it writes to the
/dev/md0 which writes to 3 files on a remote NFS server!!!

Weard, absolutely! Fun, for sure. Useful? I don't know yet... But it
seams that the internal bottleneck can be avoided ?
Post by Grant Taylor
If we are applying RAID mentality to this, then I'm guessing that the
three back end servers are RAID 5 and protecting the file.  Thus the
front end server can do a RAID 0 stripe across the other three devices
from back end servers.  (If you will forgive the loose analogy for the
sake of discussion.)
Well, I take into account that any of the 3 backend server can "crash"
so the other 2 can continue. (Funny side remark, does slackware
crash???). Repairing/reinstalling the crashed server should
"automatically" fis the problem

As a side remark. As only the master server ever gets to write to the 3
slave NFS servers, it is safe to use option async for this specific
case! So the client nodes use sync, bit it's fast as the 3 servers are
using async so they immediately return. Could improve throughput,
Post by Grant Taylor
This sounds like a RAID 0 stripe across three RAID 5 arrays.  So you end
up with the protection of RAID 5 and performance of RAID 0.
Post by K. Venken
But at this point you might want to consider drbd.
I've done exceedingly little with DRBD, so I may be way off base here.
But I thought the point of DRBD was to replicate block devices between
servers.  Thus the final central server would need to have as much disk
space as the first three servers combined.
As did I... I am just starting to learn all the technologies and systems
to create a cluster. So I am expecting that most of my ideas wont make
that much sense. But it's exciting.
Post by Grant Taylor
I would think that something like iSCSI, or other SAN protocol, would be
a better choice here.  The three back end servers export something which
they protect via RAID 5.  Then the front end server accesses them,
aggregates them together in a RAID 0 before further exporting the
aggregate to clients.
Yes, fully agree. My ideas might be completely over the top, and as I
mentioned, using established solutions is a better idea. That's why I am
looking to other solutions still, but understanding the mechanisms can
help in deciding.
Post by Grant Taylor
There are multiple products on the market that play this game of tiered
storage protection.  ;-)  IBM's SVC product, Oracle's Exadata, and I'm
confident something from NetApp does this.  I think even EMC's
CLARiiON-120 probably qualifies here.
Post by K. Venken
OK, this goes way to far, but this is the flexibility you have with
proper well designed mechanisms. And I apologize to be too creative at
times.
~chuckle~
Yep.  Having good Lego bricks and truly understanding them allows you to
build some really creative things.  The question of if you should build
them or not is completely different and may have a different answer.
Thanks. I really like this comment very much. I try to motivate my sons
in this direction (solid building bricke iso predefined solutions -
plymobil) but my parental skills might not be perfect neither :-)
Grant Taylor
2019-07-19 04:58:18 UTC
Permalink
Post by K. Venken
That's where I use the "triumvirate" NIS/NFS/automount. NIS is for
sunchronizing accounts, groups etc... It's not much appreciated these
days as most people want LDAP, fair enough, but NIS is convenient,
especially if you can confine it to, for instance, a private cluster
network.
I have never really used NIS(+), but I am aware of it. I spent some
time earlier this year labing various things. I did end up going with
LDAP for the following reason.

NIS(+) has a serious security concern in that you can't (easily) control
what NIS(+) server responds. Including a rogue NIS(+) server responding
faster than legitimate NIS(+) server(s). Thus a security vulnerability
is introduced.

Conversely, LDAP is explicit configuration and avoids the aforementioned
security concern.

Such account / directory centralization (vs synchronization) works quite
well.

I would also /strongly/ recommend that you look into Kerberos. I
learned some very interesting things about Kerberos earlier this year
and am seriously considering deploying it at home for as much as possible.
Post by K. Venken
RAID 5 has 3 disks with one redundant. If you write N bytes, N/2 goes to
disk 1, also to disk 2 and parity (agian N/2) of disk 1 and 2 goes to 3.
So this is 3 times 1/2 which is 3/2 or 1.5 ;-)
It sounds like you are talking about a three disk RAID 5 configuration
(2 data + 1 parity). I was talking about a four disk RAID 5 configuration.

RAID 5 can work with three or more disks. (I've not found a top number
of disks. Other problems start to arise.)
Post by K. Venken
In RAID 0 you would just spread data...
Yep. That's why I wouldn't use it with any data that I cared the least
bit about. At least not without striping across block devices that were
themselves some form of redundant.
Post by K. Venken
Ok, If you have more disks, the 1.5 wouldn't hold, correct observation!
Had to ccnsider that. So the correct formulation would be N / (N-1) ?
And for RAID 6 it would be N / (N+2)
for what's worth I had the impression that RAID 6 is more reliable then ....
Yes RAID 6 is more reliable in that it has multiple disks worth of
protection. So if you loose two disks, your data is now unprotected,
but still accessible.
Post by K. Venken
OK Take 4 disks you can organize them RAID 10 or RADI 6 for the same
capacity, RAID 6 would be more reliable... Still not finished the
reliability calculation.
RAID 10, is really two levels of RAID, 1 and 0. It's not fair to try to
apply a formula for single level RAID to multi-level RAID.

I would rather use RAID 6 than RAID 1 and 0 in combination. I say this,
because it's possible to loose all data if two /specific/ disks die in
RAID 10. RAID 6 will still protect your data if /any/ two disks die.

Say RAID 10 is a RAID 1 mirror applied across two RAID 0 stripes. If I
take out a disk in both RAID 0 stripes, then both RAID 0 stripes fail
and take the RAID 1 mirror out and all data is lost.

Say RAID 10 is a RAID 0 striped applied across two RAID 1 mirrors. If I
take out both disks in either RAID 1 mirror, then that RAID 1 mirror
fails and takes out the RAID 0 stripe at the same time.

RAID 10 is susceptible to data loss if two /specific/ disks are lost.

Conversely, any two disks in RAID 6 can die and the data is still available.

To me, RAID 6 is safer than RAID 10 for four drive RAIDs.
Post by K. Venken
Hmmmm., I seem to provide a more detailed example to explain. I thought
I would go into too much details, but why not, lets give the commands to
do this. Lets have four stations which can see each other over the network.
Each of the 3 (sub)stations creates a file in /export/dnd eg.
/export/dnd/slice and exports it (NFS} So you have
substation-1:/export/dnd/... exported by NFS
So the 4th station can import 3 NFS folders substation.../dnd holding an
slice file. Now, here is the trick. You can create a loopback for each
losetup /dev/loop(i) otherserver:/export/dnd/slice
So you end up with 3 devices pointing to a remote NFS file. Now, its
for all 3, do
losetup /dev/loop{i} /mnt/net/srvr-I-/slive
Now combine them into
mdadm -Cv -l5 -c64 -n3 -pls /dev/md0 /dev/loop{1,2,3}
end export this /dev/md0 somewhay
OK, thats over the top. But what you end up with is that when you write
to seerver:/dev/md0, you actually write to tree remote NFS files
spreading the throughput.
Here it comes. The 4th NFS server holding /dev/md0 exports this again.
So when one of the nodes writes to the servers export, it writes to the
/dev/md0 which writes to 3 files on a remote NFS server!!!
I completely understood the concept of what you are saying.

I think that you are going to run into some performance issues with
multiple aspects.

1) I question if NFS is the most optimal method to access data (in this
scenario).
2) I think the loopback device is going to be sub-optimal.
3) Aggregating the multiple loopbacks will have some overhead.

Will it work? Yes. Will it be optimal? I doubt it.

Here's the scenario as I see it:

A) Export a directory containing a file on the first three machines.
B) Mount the exports on the fourth machine.
C) Create a loopback device for each file.
D) Aggregate the multiple loopbacks into a new device.

Alter the scenario a little bit.

I) Export the ""file(s) via iSCSI (et al.) on the first three machines.
II) Connect the fourth machine to the iSCSI targets. This inherently
gives you a device and doesn't require the loopback.
III) Aggregate the multiple devices.

I think that both I and II are much more optimized than A and B. You
completely eliminate C.

We can alter the scenario even more by using newer distributed file systems.

1) Export the back end disks via a DFS on the first three machines.
2) Connect the fourth machine using DFS client.

The DFS file system deals with redundancy across multiple storage
servers, is massively parallel, and can be directly used by many clients
at the same time, thus negating the need for the fourth machine to
aggregate and re-share.
Post by K. Venken
Weard, absolutely! Fun, for sure. Useful? I don't know yet... But it
seams that the internal bottleneck can be avoided ?
Well, I take into account that any of the 3 backend server can "crash"
so the other 2 can continue.
This is a requirement in any solution that you use.

What happens if the fourth machine crashes? All clients completely
loose access to the data.

Note: DFS avoid this particular problem do to how they operate.
Post by K. Venken
(Funny side remark, does slackware crash???). Repairing/reinstalling
the crashed server should "automatically" fis the problem
It sure can.
Post by K. Venken
As a side remark. As only the master server ever gets to write to the 3
slave NFS servers, it is safe to use option async for this specific
case! So the client nodes use sync, bit it's fast as the 3 servers are
using async so they immediately return. Could improve throughput,
Maybe.

Check out Distributed File Systems. The fan out method that you've
described can only scale so far. DFS can scale out to hundreds /
thousands of back end storage systems and they can all be directly
accessed by client machines. There is no central layer to bottleneck on.
Post by K. Venken
As did I... I am just starting to learn all the technologies and systems
to create a cluster. So I am expecting that most of my ideas wont make
that much sense. But it's exciting.
I think it's important to talk through things and understand them. That
way you can make the informed decisions. ;-)
Post by K. Venken
Yes, fully agree. My ideas might be completely over the top, and as I
mentioned, using established solutions is a better idea.
Don't conflate your unfamiliarity with a solution as that solution not
being established. ;-)
Post by K. Venken
That's why I am looking to other solutions still, but understanding
the mechanisms can help in deciding.
Yep.
Post by K. Venken
Thanks. I really like this comment very much. I try to motivate my sons
in this direction (solid building bricke iso predefined solutions -
plymobil) but my parental skills might not be perfect neither :-)
:-)
--
Grant. . . .
unix || die
K. Venken
2019-07-19 19:55:54 UTC
Permalink
Post by K. Venken
That's where I use the "triumvirate" NIS/NFS/automount. NIS is for
sunchronizing accounts, groups etc... It's not much appreciated these
days as most people want LDAP, fair enough, but NIS is convenient,
especially if you can confine it to, for instance, a private cluster
network.
I have never really used NIS(+), but I am aware of it.  I spent some
time earlier this year labing various things.  I did end up going with
LDAP for the following reason.
NIS(+) has a serious security concern in that you can't (easily) control
what NIS(+) server responds.  Including a rogue NIS(+) server responding
faster than legitimate NIS(+) server(s).  Thus a security vulnerability
is introduced.
That's a concern. But NIS would only be used at the private cluster part
in a physically locked environment. So there is only one access point to
the outside world where NIS is not used. But it is to be looked at if
this is sufficient.
Conversely, LDAP is explicit configuration and avoids the aforementioned
security concern.
Such account / directory centralization (vs synchronization) works quite
well.
I would also /strongly/ recommend that you look into Kerberos.  I
learned some very interesting things about Kerberos earlier this year
and am seriously considering deploying it at home for as much as possible.
Post by K. Venken
RAID 5 has 3 disks with one redundant. If you write N bytes, N/2 goes
to disk 1, also to disk 2 and parity (agian N/2) of disk 1 and 2 goes
to 3. So this is 3 times 1/2 which is 3/2 or 1.5 ;-)
It sounds like you are talking about a three disk RAID 5 configuration
(2 data + 1 parity).  I was talking about a four disk RAID 5 configuration.
RAID 5 can work with three or more disks.  (I've not found a top number
of disks.  Other problems start to arise.)
Post by K. Venken
In RAID 0 you would just spread data...
Yep.  That's why I wouldn't use it with any data that I cared the least
bit about.  At least not without striping across block devices that were
themselves some form of redundant.
Post by K. Venken
Ok, If you have more disks, the 1.5 wouldn't hold, correct
observation! Had to ccnsider that. So the correct formulation would be
N / (N-1) ?
And for RAID 6 it would be N / (N+2)
for what's worth I had the impression that RAID 6 is more reliable then ....
Yes RAID 6 is more reliable in that it has multiple disks worth of
protection.  So if you loose two disks, your data is now unprotected,
but still accessible.
Post by K. Venken
OK Take 4 disks you can organize them RAID 10 or RADI 6 for the same
capacity, RAID 6 would be more reliable... Still not finished the
reliability calculation.
RAID 10, is really two levels of RAID, 1 and 0.  It's not fair to try to
apply a formula for single level RAID to multi-level RAID.
I would rather use RAID 6 than RAID 1 and 0 in combination.  I say this,
because it's possible to loose all data if two /specific/ disks die in
RAID 10.  RAID 6 will still protect your data if /any/ two disks die.
Say RAID 10 is a RAID 1 mirror applied across two RAID 0 stripes.  If I
take out a disk in both RAID 0 stripes, then both RAID 0 stripes fail
and take the RAID 1 mirror out and all data is lost.
Say RAID 10 is a RAID 0 striped applied across two RAID 1 mirrors.  If I
take out both disks in either RAID 1 mirror, then that RAID 1 mirror
fails and takes out the RAID 0 stripe at the same time.
RAID 10 is susceptible to data loss if two /specific/ disks are lost.
Conversely, any two disks in RAID 6 can die and the data is still available.
To me, RAID 6 is safer than RAID 10 for four drive RAIDs.
That's what I found based on my MTBF calculations. It's good to have
some confirmation that my calculations give some usefull background.
FWW, my calculations also suggested that RAID-5 should be limited to a
small number of disks something like 5 or 8 or the MTBF of the whole is
worse then this of a single disk. That's one of my requirements (next to
redundancy) that any disk configuration as a whole should have a better
MTBF then each of the disks by itself. But I am not sure if it is usefull.
Post by K. Venken
Hmmmm., I seem to provide a more detailed example to explain. I
thought I would go into too much details, but why not, lets give the
commands to do this. Lets have four stations which can see each other
over the network.
Each of the 3 (sub)stations creates a file in /export/dnd eg.
/export/dnd/slice and exports it (NFS} So you have
substation-1:/export/dnd/... exported by NFS
So the 4th station can import 3 NFS folders substation.../dnd holding
an slice file. Now, here is the trick. You can create a loopback for
losetup /dev/loop(i) otherserver:/export/dnd/slice
So you end up with 3 devices pointing to a remote NFS file. Now, its
for all 3, do
losetup /dev/loop{i} /mnt/net/srvr-I-/slive
Now combine them into
mdadm -Cv -l5 -c64 -n3 -pls /dev/md0 /dev/loop{1,2,3}
end export this /dev/md0 somewhay
OK, thats over the top. But what you end up with is that when you
write to seerver:/dev/md0, you actually write to tree remote NFS files
spreading the throughput.
Here it comes. The 4th NFS server holding /dev/md0 exports this again.
So when one of the nodes writes to the servers export, it writes to
the /dev/md0 which writes to 3 files on a remote NFS server!!!
I completely understood the concept of what you are saying.
I think that you are going to run into some performance issues with
multiple aspects.
1)  I question if NFS is the most optimal method to access data (in this
scenario).
2)  I think the loopback device is going to be sub-optimal.
3)  Aggregating the multiple loopbacks will have some overhead.
Will it work?  Yes.  Will it be optimal?  I doubt it.
A)  Export a directory containing a file on the first three machines.
B)  Mount the exports on the fourth machine.
C)  Create a loopback device for each file.
D)  Aggregate the multiple loopbacks into a new device.
Alter the scenario a little bit.
I)  Export the ""file(s) via iSCSI (et al.) on the first three machines.
II)  Connect the fourth machine to the iSCSI targets.  This inherently
gives you a device and doesn't require the loopback.
III)  Aggregate the multiple devices.
I think that both I and II are much more optimized than A and B.  You
completely eliminate C.
We can alter the scenario even more by using newer distributed file systems.
1)  Export the back end disks via a DFS on the first three machines.
2)  Connect the fourth machine using DFS client.
The DFS file system deals with redundancy across multiple storage
servers, is massively parallel, and can be directly used by many clients
at the same time, thus negating the need for the fourth machine to
aggregate and re-share.
I was looking into drbd but didn't know how it would compare to my
setup. (I actually tried the 3+1 NFS setup, and it showed improvement
over a single NFS server, but as you noted, it is not scalable...) But I
need to look at iSCSI as well.
Post by K. Venken
Weard, absolutely! Fun, for sure. Useful? I don't know yet... But it
seams that the internal bottleneck can be avoided ?
Well, I take into account that any of the 3 backend server can "crash"
so the other 2 can continue.
This is a requirement in any solution that you use.
What happens if the fourth machine crashes?  All clients completely
loose access to the data.
At this moment it is a single NFS serving all nodes, so this problem is
there. HPC is more important then HA. What happens is that all
calculations freeze and continue the moment the NFS server is up and
running again. Freezing progress is in this case a better alternative
then aborting with an error. A single NFS behaves like that (sync
option). For what's worth, it's more then 7 months ago that the NFS
server (CentOS 6.6) rebooted and it was an intentional
manual/maintenance operation.

If a DFS gives an error back to the calculation application, that would
be a problem. Just hanging at least if you can fix the problem is then a
better solution.
Note:  DFS avoid this particular problem do to how they operate.
Post by K. Venken
(Funny side remark, does slackware crash???). Repairing/reinstalling
the crashed server should "automatically" fis the problem
It sure can.
Post by K. Venken
As a side remark. As only the master server ever gets to write to the
3 slave NFS servers, it is safe to use option async for this specific
case! So the client nodes use sync, bit it's fast as the 3 servers are
using async so they immediately return. Could improve throughput,
Maybe.
Check out Distributed File Systems.  The fan out method that you've
described can only scale so far.  DFS can scale out to hundreds /
thousands of back end storage systems and they can all be directly
accessed by client machines.  There is no central layer to bottleneck on.
I am researching it ;) I agree with all points but I am not convinced of
the last statement (again I am not an expert, rather a newbie) At some
point you want a synchronized, consistent view for all the nodes and
this must involve communication, locks,... It does not come for free. I
thought I read somewhere that this overhead can be important and degrade
performance beyond a single server, but I have to investigate this in
more detail if this is true.
Post by K. Venken
As did I... I am just starting to learn all the technologies and
systems to create a cluster. So I am expecting that most of my ideas
wont make that much sense. But it's exciting.
I think it's important to talk through things and understand them.  That
way you can make the informed decisions.  ;-)
Post by K. Venken
Yes, fully agree. My ideas might be completely over the top, and as I
mentioned, using established solutions is a better idea.
Don't conflate your unfamiliarity with a solution as that solution not
being established.  ;-)
Post by K. Venken
That's why I am looking to other solutions still, but understanding
the mechanisms can help in deciding.
Yep.
Post by K. Venken
Thanks. I really like this comment very much. I try to motivate my
sons in this direction (solid building bricke iso predefined solutions
- plymobil) but my parental skills might not be perfect neither :-)
:-)
Grant Taylor
2019-07-23 01:22:12 UTC
Permalink
Post by K. Venken
That's a concern. But NIS would only be used at the private cluster part
in a physically locked environment. So there is only one access point to
the outside world where NIS is not used. But it is to be looked at if
this is sufficient.
ACK

If you haven't read Henrik's reply yet, it sounds like there is a
configuration option to likely mitigate this.
Post by K. Venken
That's what I found based on my MTBF calculations. It's good to have
some confirmation that my calculations give some usefull background.
FWW, my calculations also suggested that RAID-5 should be limited to a
small number of disks something like 5 or 8 or the MTBF of the whole is
worse then this of a single disk.
I have no reason to object. I've known and have heard others talk about
the upper reasonable limit in the number of drives in a RAID-5 / RAID-6
array and vaguely why.
Post by K. Venken
That's one of my requirements (next to redundancy) that any disk
configuration as a whole should have a better MTBF then each of the
disks by itself. But I am not sure if it is usefull.
I would like to know how /you/ calculate the MTBF of the whole array.
I've not seen a good way, but I've not really looked.
Post by K. Venken
I was looking into drbd but didn't know how it would compare to my
setup. (I actually tried the 3+1 NFS setup, and it showed improvement
over a single NFS server, but as you noted, it is not scalable...) But I
need to look at iSCSI as well.
I spent some time reading / refreshing myself on clustering (it's been a
while) and came across ClusterMonkey.net. Many of their articles that I
skimmed match what I was remembering. They also go into some things
that I've not messed with.
Post by K. Venken
At this moment it is a single NFS serving all nodes, so this problem is
there. HPC is more important then HA. What happens is that all
calculations freeze and continue the moment the NFS server is up and
running again. Freezing progress is in this case a better alternative
then aborting with an error. A single NFS behaves like that (sync
option). For what's worth, it's more then 7 months ago that the NFS
server (CentOS 6.6) rebooted and it was an intentional
manual/maintenance operation.
If a DFS gives an error back to the calculation application, that would
be a problem. Just hanging at least if you can fix the problem is then a
better solution.
I think DFS implementations prefer to offer redundancy such that the
whole system can continue functioning properly when a given machine has
problems.
Post by K. Venken
I am researching it ;) I agree with all points but I am not convinced of
the last statement (again I am not an expert, rather a newbie) At some
point you want a synchronized, consistent view for all the nodes and
this must involve communication, locks,... It does not come for free. I
thought I read somewhere that this overhead can be important and degrade
performance beyond a single server, but I have to investigate this in
more detail if this is true.
I was referring to the lack of a single central NFS server that
everything must flow through.

I was reading about some DFSs (on Cluster Monkey) that have redundancy
across multiple back end data nodes using one or more redundant
meta-data nodes. So any single system failing does not take the DFS
offline.
--
Grant. . . .
unix || die
K Venken
2019-07-23 09:44:08 UTC
Permalink
Post by Grant Taylor
Post by K. Venken
That's a concern. But NIS would only be used at the private cluster
part in a physically locked environment. So there is only one access
point to the outside world where NIS is not used. But it is to be
looked at if this is sufficient.
ACK
If you haven't read Henrik's reply yet, it sounds like there is a
configuration option to likely mitigate this.
Post by K. Venken
That's what I found based on my MTBF calculations. It's good to have
some confirmation that my calculations give some usefull background.
FWW, my calculations also suggested that RAID-5 should be limited to a
small number of disks something like 5 or 8 or the MTBF of the whole
is worse then this of a single disk.
I have no reason to object.  I've known and have heard others talk about
the upper reasonable limit in the number of drives in a RAID-5 / RAID-6
array and vaguely why.
Post by K. Venken
That's one of my requirements (next to redundancy) that any disk
configuration as a whole should have a better MTBF then each of the
disks by itself. But I am not sure if it is usefull.
I would like to know how /you/ calculate the MTBF of the whole array.
I've not seen a good way, but I've not really looked.
Well, this is where I get suspicious about my own calculations. FWW, it
goes as follows:

Taking an MTBF you can convert it to an AFR (Annual Failure Rate) with
AFR = (1- exp(-time/MTBF) )
This represents a probability (eg. the disk crashes the first year).
Lets call it p. Now (and this is to be verified) I took p as a
probability as such, which means that I can use it as in probability
calculations. What I mean is the following. Take two the same disks, so
the both have the same p. If you put them in a RAID-0, the chance your
RAID breaks (the first year) is when any or both of them breaks (the
first year), thus

p-RAID0 = p + p + p^2 = (1-p)^2

And for a RAID1, it would be

p-RAID1 = p^2

With the first formula, you can calculate the MTBF again (p ~ AFR and
use the inverse formula). As a start, here are a few numbers (the single
disk is the chosen reference):

setup dsks cap MTBF AFR formula
single disk 1 1 20.00 0.05 p
RAID-0 2 2 10.00 0.10 1-(1-p)^2
RAID-0 3 3 6.67 0.14 1-(1-p)^3
RAID-0 4 4 2.50 0.33 1-(1-p)^4
RAID-1 2 1 419.92 0.00 p^2
RAID-1 3 1 8619.88 0.00 p^3

etc...

When combining multiple raid levels (eg RAID-10), You substitute the
RAID-1 formula in the RAID-0 formula, etc... So you can extend this
table with RAID-10, RAID-50,... etc. Doing this, for 4 disks, you get

setup dsks cap MTBF AFR
single disk 1 1 20.00 0.05
RAID-10 4 2 209.96 0.00
RAID-01 4 2 109.92 0.01
RAID-6 4 2 481.36 0.00

Which indicates that a plain RAID-6 is better then any of RAID-10 or
RAID-01.

But as indicated, this 'assumes' you can combine AFR as probability.

As a final remark. I noted that if a RAID-0 with 3 disks breaks, you
loose 3 times the information which is 3 times as bad. (This is why I
added capacity in the table). So I want it to happen 3 times less
frequent. (Let alone it can take 3 times longer to recover...)) So the
MTBF of the RAID configuration is preferably capacity*MTBF-single disk.
This is were I discovered that a RAID-5 with 5 disks does not hold
anymore. It's much better (MTBF=45.90) then a single disk but not taking
into account that you have 4 times the capacity to recover. But it's
more a personal choice to avoid this.
Post by Grant Taylor
Post by K. Venken
I was looking into drbd but didn't know how it would compare to my
setup. (I actually tried the 3+1 NFS setup, and it showed improvement
over a single NFS server, but as you noted, it is not scalable...) But
I need to look at iSCSI as well.
I spent some time reading / refreshing myself on clustering (it's been a
while) and came across ClusterMonkey.net.  Many of their articles that I
skimmed match what I was remembering.  They also go into some things
that I've not messed with.
Post by K. Venken
At this moment it is a single NFS serving all nodes, so this problem
is there. HPC is more important then HA. What happens is that all
calculations freeze and continue the moment the NFS server is up and
running again. Freezing progress is in this case a better alternative
then aborting with an error. A single NFS behaves like that (sync
option). For what's worth, it's more then 7 months ago that the NFS
server (CentOS 6.6) rebooted and it was an intentional
manual/maintenance operation.
If a DFS gives an error back to the calculation application, that
would be a problem. Just hanging at least if you can fix the problem
is then a better solution.
I think DFS implementations prefer to offer redundancy such that the
whole system can continue functioning properly when a given machine has
problems.
Post by K. Venken
I am researching it ;) I agree with all points but I am not convinced
of the last statement (again I am not an expert, rather a newbie) At
some point you want a synchronized, consistent view for all the nodes
and this must involve communication, locks,... It does not come for
free. I thought I read somewhere that this overhead can be important
and degrade performance beyond a single server, but I have to
investigate this in more detail if this is true.
I was referring to the lack of a single central NFS server that
everything must flow through.
I was reading about some DFSs (on Cluster Monkey) that have redundancy
across multiple back end data nodes using one or more redundant
meta-data nodes.  So any single system failing does not take the DFS
offline.
Grant Taylor
2019-07-23 14:44:59 UTC
Permalink
Thank you for the detailed response. I will give it the attention that
it deserves as time permits in the coming days.
Post by K Venken
Well, this is where I get suspicious about my own calculations.
Yep. Been there. Done that. I view it as a good, but annoying, thing
and a learning opportunity.

Bob: I've done this for years and it's always worked.
Tom: Yes. But /why/ does it work?
Bob: Um....
(Later)
Bob: Can I get back to you on that? I need to go think. :-/
--
Grant. . . .
unix || die
Grant Taylor
2019-08-19 21:11:31 UTC
Permalink
Hi K Venken,

I've spent some time re-reading / analyzing / assimilating what you said
below. I've found it to be very informative, but I've got some
questions and would like to discuss it further.
Post by K Venken
Well, this is where I get suspicious about my own calculations. FWW, it
Taking an MTBF you can convert it to an AFR (Annual Failure Rate) with
    AFR = (1- exp(-time/MTBF) )
This represents a probability (eg. the disk crashes the first year).
Lets call it p. Now (and this is to be verified) I took p as a
probability as such, which means that I can use it as in probability
calculations. What I mean is the following. Take two the same disks, so
the both have the same p. If you put them in a RAID-0, the chance your
RAID breaks (the first year) is when any or both of them breaks (the
first year), thus
This largely makes sense. However, I'd like to plug some numbers into
the following formula.

Aside: The additional research / reading that I did to understand this
supports your equation.
Post by K Venken
p-RAID0 = p + p + p^2 = (1-p)^2
Aside: I question why it's "p + p + p^2". Wouldn't either of the first
drives failing terminate the RAID0? Or is the p^2 the likelihood that
both drives will fail simultaneously?

p = 1 - e(-8760/100000) = .08387… # 100,000 hours MTBF

p + p + p^2 = .17478…

(1 - p)^2 = .83928…

.17478… ≠ .83928…

Will you please explain what I'm doing wrong?

Is it possible that (1-p)^2 is the likelihood that p-RIAD0 will survive
the first year?

.17478… + .83928… ≈ 1.01406…

Looking at the formulas you use below, it looks like perhaps that should
be 1 - (1 - p)^2

1 - (1 - p)^2 = .16071…

.16071… is a lot closer to .17478…

Aside: We're getting to the point that number of digits after the
decimal place matters.
Post by K Venken
And for a RAID1, it would be
p-RAID1 = p^2
p-RAID1 = p^2 = (1-e(-8760/100000))^2 = .16071…
Post by K Venken
With the first formula, you can calculate the MTBF again (p ~ AFR and
use the inverse formula). As a start, here are a few numbers (the single
I'm not following how you get the MTBF.

Will you please provide starting numbers, like I did above. The closest
that I can get to 0.05 AFR is with 340,000 hours MTBF.

I don't see how that ~340,000 MTBF correlates to to the 20.00 MTBF you
have below.

I also don't understand how you calculated your AFR.
Post by K Venken
setup        dsks  cap   MTBF      AFR    formula
single disk   1    1     20.00    0.05   p
I think the probability that a single disk will fail could also be
written as 1-(1-p)^1. This is somewhat important to me because it
generalizes the formula for an N disk RAID 0 stripe to be:

RAID-0 1-(1-p)^N

RAID-0 1 1 20.00 0.05 1-(1-p)^1
RAID-0 2 2 10.00 0.10 1-(1-p)^2
RAID-0 3 3 6.67 0.14 1-(1-p)^3
RAID-0 4 4 2.50 0.33 1-(1-p)^4
Post by K Venken
RAID-1        2    1     419.92   0.00   p^2
RAID-1        3    1     8619.88  0.00   p^3
This generalizes the failure of an N drive RAID-1 to be:

RAID-1 p^N
Post by K Venken
etc...
What I don't see is comparable formulas for RAID-5 or RAID-6.
Post by K Venken
When combining multiple raid levels (eg RAID-10), You substitute the
RAID-1 formula in the RAID-0 formula, etc... So you can extend this
table with RAID-10, RAID-50,... etc. Doing this, for 4 disks, you get
setup        dsks  cap   MTBF      AFR
single disk   1    1     20.00    0.05
RAID-10       4    2     209.96   0.00
RAID-01       4    2     109.92   0.01
RAID-6        4    2     481.36   0.00
Which indicates that a plain RAID-6 is better then any of RAID-10 or
RAID-01.
But as indicated, this 'assumes' you can combine AFR as probability.
As indicated above, multiple things supported this assumption.
Post by K Venken
As a final remark. I noted that if a RAID-0 with 3 disks breaks, you
loose 3 times the information which is 3 times as bad. (This is why I
added capacity in the table). So I want it to happen 3 times less
frequent. (Let alone it can take 3 times longer to recover...)) So the
MTBF of the RAID configuration is preferably capacity*MTBF-single disk.
This is were I discovered that a RAID-5 with 5 disks does not hold
anymore. It's much better (MTBF=45.90) then a single disk but not taking
into account that you have 4 times the capacity to recover. But it's
more a personal choice to avoid this.
I agree with your logic, but I think I need a better understanding of
how you derived the MTBF and AFR in your tables above.
--
Grant. . . .
unix || die
SBM
2019-08-21 10:06:35 UTC
Permalink
Post by Grant Taylor
Hi K Venken,
I've spent some time re-reading / analyzing / assimilating what you said
below. I've found it to be very informative, but I've got some
questions and would like to discuss it further.
I appreciate youre efforts
Post by Grant Taylor
Post by K Venken
Well, this is where I get suspicious about my own calculations. FWW,
Taking an MTBF you can convert it to an AFR (Annual Failure Rate) with
AFR = (1- exp(-time/MTBF) )
This represents a probability (eg. the disk crashes the first year).
Lets call it p. Now (and this is to be verified) I took p as a
probability as such, which means that I can use it as in probability
calculations. What I mean is the following. Take two the same disks,
so the both have the same p. If you put them in a RAID-0, the chance
your RAID breaks (the first year) is when any or both of them breaks
(the first year), thus
This largely makes sense. However, I'd like to plug some numbers into
the following formula.
Aside: The additional research / reading that I did to understand this
supports your equation.
Post by K Venken
p-RAID0 = p + p + p^2 = (1-p)^2
Aside: I question why it's "p + p + p^2". Wouldn't either of the first
drives failing terminate the RAID0? Or is the p^2 the likelihood that
both drives will fail simultaneously?
Yes, it is, but ALAS, it's incorrect I fear. Good to notice!!! Lets redo
this

A beaks and B does not = p*(1-p)
A does not break and B does = (1-p)*p
A and B both break = p*p

adding it up gives p-p^2 + p-p^2 + p^2 = p + p - p^2
Post by Grant Taylor
p = 1 - e(-8760/100000) = .08387… # 100,000 hours MTBF
p + p + p^2 = .17478…
(1 - p)^2 = .83928…
So the exact numbers are : p + p - p^2 = .16071 which adds up with
(1-p)^2 as .83928 + .16072 = 1.00000 as it should. The RAID-0 breaks or
does not break!
Post by Grant Taylor
.17478… ≠ .83928…
Will you please explain what I'm doing wrong?
Nothing I made a mistake in my formulas. This is really unfortunate. I
plugged them into OpenOffice, so I have to redo them to review some of
my conlcusions! Great job!
Post by Grant Taylor
Is it possible that (1-p)^2 is the likelihood that p-RIAD0 will survive
the first year?
.17478… + .83928… ≈ 1.01406…
Looking at the formulas you use below, it looks like perhaps that should
be 1 - (1 - p)^2
1 - (1 - p)^2 = .16071…
.16071… is a lot closer to .17478…
That's correct, see explanation before. Sorry for the confusion.
Post by Grant Taylor
Aside: We're getting to the point that number of digits after the
decimal place matters.
Post by K Venken
And for a RAID1, it would be
p-RAID1 = p^2
p-RAID1 = p^2 = (1-e(-8760/100000))^2 = .16071…
Post by K Venken
With the first formula, you can calculate the MTBF again (p ~ AFR and
use the inverse formula). As a start, here are a few numbers (the
I'm not following how you get the MTBF.
AFR = 1 - exp(-time/MTBF)

then

MTBF/time = -1 / ln ( 1 - AFR )

taken into account that I took 1 year as time reference. So the factor
-1/ln(1-AFR) gives the factor of your original MTBF.
Post by Grant Taylor
Will you please provide starting numbers, like I did above. The closest
that I can get to 0.05 AFR is with 340,000 hours MTBF.
I don't see how that ~340,000 MTBF correlates to to the 20.00 MTBF you
have below.
I also don't understand how you calculated your AFR.
20.00 MTBF references to 20.00 years MTBF. 340000 hours would then refer
to 340000/24/365 = 38.8 years. Apparently, my disks are worse then yours.

1 - exp (- first=1 year / 20 years MTBF) = 0.05

"There should be no difference if you take hours or years as reference
scale, but years makes the calculations somewhat more accurate"
Post by Grant Taylor
Post by K Venken
setup dsks cap MTBF AFR formula
single disk 1 1 20.00 0.05 p
I think the probability that a single disk will fail could also be
written as 1-(1-p)^1. This is somewhat important to me because it
RAID-0 1-(1-p)^N
RAID-0 1 1 20.00 0.05 1-(1-p)^1
RAID-0 2 2 10.00 0.10 1-(1-p)^2
RAID-0 3 3 6.67 0.14 1-(1-p)^3
RAID-0 4 4 2.50 0.33 1-(1-p)^4
Post by K Venken
RAID-1 2 1 419.92 0.00 p^2
RAID-1 3 1 8619.88 0.00 p^3
RAID-1 p^N
Yes, this is a much better way of doing things. I only saw the
generalization later on.
Post by Grant Taylor
Post by K Venken
etc...
What I don't see is comparable formulas for RAID-5 or RAID-6.
Should be the same way of reasoning. A RAID-5 fails if two disks fails,
or a RAID-5 does not fail if no two disk fails. As you indicated before,
The formula's may need to be reviewed. Lets try this for 3 disks:

RAID-5 fails is

A and B but not C fails = p * p * (1-p)
A fails, B does not but C does = p * (1-p) * p
A does not fail, B and C do = (1-p) * p * p
A, B and C all fail = p * p * p

or p-RAID5 = 3*p^2*(1-p) + p^3

In this case, it's easy to get the observation:

RAID-5 does not fail if

not A, not B and not C fail = (1-p)^3
A fails but not B and not C = p*(1-p)^2
...

or 1 - p-RAID5 = ... which should get 1 if you sum both.
Post by Grant Taylor
Post by K Venken
When combining multiple raid levels (eg RAID-10), You substitute the
RAID-1 formula in the RAID-0 formula, etc... So you can extend this
table with RAID-10, RAID-50,... etc. Doing this, for 4 disks, you get
setup dsks cap MTBF AFR
single disk 1 1 20.00 0.05
RAID-10 4 2 209.96 0.00
RAID-01 4 2 109.92 0.01
RAID-6 4 2 481.36 0.00
Which indicates that a plain RAID-6 is better then any of RAID-10 or
RAID-01.
But as indicated, this 'assumes' you can combine AFR as probability.
As indicated above, multiple things supported this assumption.
Post by K Venken
As a final remark. I noted that if a RAID-0 with 3 disks breaks, you
loose 3 times the information which is 3 times as bad. (This is why I
added capacity in the table). So I want it to happen 3 times less
frequent. (Let alone it can take 3 times longer to recover...)) So the
MTBF of the RAID configuration is preferably capacity*MTBF-single
disk. This is were I discovered that a RAID-5 with 5 disks does not
hold anymore. It's much better (MTBF=45.90) then a single disk but not
taking into account that you have 4 times the capacity to recover. But
it's more a personal choice to avoid this.
I agree with your logic, but I think I need a better understanding of
how you derived the MTBF and AFR in your tables above.
I hope, previous additions clarify and correct some of my calculations.
Your understanding is, in my opinion, sound. I am not sure if the full
detail of calculations and formula makes sense in this newsgroup - it 's
by times boring and just a bunch of numbers -, but if you like I can
post them after I verified them with the insights you gave.
Karel Venken
2019-08-21 10:09:57 UTC
Permalink
SBM wrote:

OFFTOPIC

Not sure why I couldn't send it yesterday. Tried it with a different
account/PC, seems to work, now changed the account back to the original
settings. If this comes trough, well, curious... intermittent network
problem. I guess.

Karel
Grant Taylor
2019-08-22 03:46:59 UTC
Permalink
Hi Karel,

Thank you for another detailed reply.

I'm going to take some more time to assimilate your detailed reply.
--
Grant. . . .
unix || die
Henrik Carlqvist
2019-07-22 20:07:06 UTC
Permalink
Post by Grant Taylor
NIS(+) has a serious security concern in that you can't (easily) control
what NIS(+) server responds. Including a rogue NIS(+) server responding
faster than legitimate NIS(+) server(s). Thus a security vulnerability
is introduced.
I suppose that could only happen if you in /etc/yp.conf has configured
the nis client to use broadcast for the nis domain?

If you in yp.conf list server(s) by hostname(s) in your local /etc/hosts
or even something like:

ypserver 10.2.3.147
ypserver 10.2.3.148

You shouldn't risk listening to a rouge NIS(+) server unless it has been
configured with the same IP.

regards Henrik
Grant Taylor
2019-07-23 01:25:03 UTC
Permalink
Post by Henrik Carlqvist
I suppose that could only happen if you in /etc/yp.conf has configured
the nis client to use broadcast for the nis domain?
Okay.

The reading that I've done on NIS(+) indicated that it was broadcast based.
Post by Henrik Carlqvist
If you in yp.conf list server(s) by hostname(s) in your local
ypserver 10.2.3.147
ypserver 10.2.3.148
You shouldn't risk listening to a rouge NIS(+) server unless it has
been configured with the same IP.
Okay.

Thank you for the follow up / clarification / correction Henrik. #TIL :-)
--
Grant. . . .
unix || die
K Venken
2019-07-23 08:22:23 UTC
Permalink
Post by Henrik Carlqvist
Post by Grant Taylor
NIS(+) has a serious security concern in that you can't (easily) control
what NIS(+) server responds. Including a rogue NIS(+) server responding
faster than legitimate NIS(+) server(s). Thus a security vulnerability
is introduced.
I suppose that could only happen if you in /etc/yp.conf has configured
the nis client to use broadcast for the nis domain?
If you in yp.conf list server(s) by hostname(s) in your local /etc/hosts
ypserver 10.2.3.147
ypserver 10.2.3.148
That's what we configured in the nodes, a dedicated ypserver.
Post by Henrik Carlqvist
You shouldn't risk listening to a rouge NIS(+) server unless it has been
configured with the same IP.
regards Henrik
Henrik Carlqvist
2019-07-23 10:03:23 UTC
Permalink
Post by K Venken
Post by Henrik Carlqvist
If you in yp.conf list server(s) by hostname(s) in your local
ypserver 10.2.3.147
ypserver 10.2.3.148
That's what we configured in the nodes, a dedicated ypserver.
As the NIS server will become a single point of failure for the network I
prefer to have two NIS servers. Only one NIS server can be configured to
be the NIS master but more NIS servers can be configured as NIS slaves.

regards Henrik
Grant Taylor
2019-07-23 14:40:58 UTC
Permalink
Post by Henrik Carlqvist
As the NIS server will become a single point of failure for the
network I prefer to have two NIS servers. Only one NIS server can be
configured to be the NIS master but more NIS servers can be configured
as NIS slaves.
Can NIS slaves be configured to forward updates to the master on behalf
of the NIS client? Or does the NIS client need to be able to
communicate with the NIS master for some things?
--
Grant. . . .
unix || die
Henrik Carlqvist
2019-07-23 17:28:39 UTC
Permalink
Post by Grant Taylor
Post by Henrik Carlqvist
As the NIS server will become a single point of failure for the network
I prefer to have two NIS servers. Only one NIS server can be configured
to be the NIS master but more NIS servers can be configured as NIS
slaves.
Can NIS slaves be configured to forward updates to the master on behalf
of the NIS client? Or does the NIS client need to be able to
communicate with the NIS master for some things?
Good question, fortunately I have not had the experience of such a long
downtime of the NIS master. But reading up in the man pages I find that
rpc.yppasswdd is only running on the NIS master and the passwd map (which
lists password, shell and other stuff in /etc/passwd) is the only part
changable by NIS clients. Other maps should be edited on the NIS master
and after that the NIS maps are updated on the master and pushed to the
slaves.

So when the NIS master is down it is not possible to update any NIS maps.

regards Henrik
Grant Taylor
2019-07-17 03:19:17 UTC
Permalink
Post by K. Venken
But it does not mention clusters and multiple nodes. Indeed.
It might not mention clusters, but think about it this way.

There is a performance (latency and / or throughput) penalty for local
resources vs non-local resources.

The same can concept can be applied to memory associated with specific
CPUs as well as it can memory associated with different computers in a
cluster. You can even take it to larger levels.
Post by K. Venken
Thanks a lot for your information. I (actually together with a student)
am trying to redeploy a 10 year old cluster, now based on Slackware. I
had(ve) no knowledge about all the technology involved, but we learned a
lot last weeks. It started as an experiment, but we are now at the point
the we are testing some of the applications which can benefit from
distributed calculations. One of them crashed in (open)MPI and mentioned
a problem with NUMA. But if NUMA is not essential, this might be a dead
end. It's good to know we are not there yet.
This conjures up a LOT of questions for me.
--
Grant. . . .
unix || die
K. Venken
2019-07-17 18:47:47 UTC
Permalink
Post by Grant Taylor
Post by K. Venken
But it does not mention clusters and multiple nodes. Indeed.
It might not mention clusters, but think about it this way.
There is a performance (latency and / or throughput) penalty for local
resources vs non-local resources.
The same can concept can be applied to memory associated with specific
CPUs as well as it can memory associated with different computers in a
cluster.  You can even take it to larger levels.
Post by K. Venken
Thanks a lot for your information. I (actually together with a
student) am trying to redeploy a 10 year old cluster, now based on
Slackware. I had(ve) no knowledge about all the technology involved,
but we learned a lot last weeks. It started as an experiment, but we
are now at the point the we are testing some of the applications which
can benefit from distributed calculations. One of them crashed in
(open)MPI and mentioned a problem with NUMA. But if NUMA is not
essential, this might be a dead end. It's good to know we are not
there yet.
This conjures up a LOT of questions for me.
We have found another cause of disabling NUMA. We noted somewhere that
you need acpi for numa. In short, when we added the line

append = "acpi=on"

in lilo.conf

our calculations could be spread over more then one node. This is what
we wanted. For the interested reader, we found it in following reference:

https://www.thegeekdiary.com/centos-rhel-how-to-find-if-numa-configuration-is-enabled-or-disabled/

Now the nodes correctly state numa is enabled.

Thanks to all for all the insights you shared.
Loading...