AF:
NF:0
PS:10
SRH:1
SFN:
DSR:
MID:<20080716183439.08ec9aa8@ripper.onstor.net>
CFG:
PT:0
S:andy.sharp@onstor.com
RQ:
SSV:onstor-exch02.onstor.net
NSV:
SSH:
R:<jonathan.goldick@onstor.com>
MAID:1
X-Sylpheed-Privacy-System:
X-Sylpheed-Sign:0
SCF:#mh/Mailbox/sent
RMID:#imap/andys@onstor.net@onstor-exch02.onstor.net/INBOX	0	BB375AF679D4A34E9CA8DFA650E2B04E0AF1DD09@onstor-exch02.onstor.net
X-Sylpheed-End-Special-Headers: 1
Date: Wed, 16 Jul 2008 18:35:42 -0700
From: Andrew Sharp <andy.sharp@onstor.com>
To: "Jonathan Goldick" <jonathan.goldick@onstor.com>
Subject: Re: clustering fencing
Message-ID: <20080716183542.2cc89bce@ripper.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0AF1DD09@onstor-exch02.onstor.net>
References: <BB375AF679D4A34E9CA8DFA650E2B04E0AF1DD09@onstor-exch02.onstor.net>
Organization: Onstor
X-Mailer: Sylpheed-Claws 2.6.0 (GTK+ 2.8.20; x86_64-pc-linux-gnu)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit

At least the Solaris explanation would be the exact opposite of what
we're doing.  So in that sense I'm completely correct.  Instead of the
new node panicing/rebooting because the old node is still pinging the
volume, it should force takeover of the volume ownership and fence the
"dying" node out.

On Wed, 16 Jul 2008 18:27:33 -0700 "Jonathan Goldick"
<jonathan.goldick@onstor.com> wrote:

> Linux-HA  http://en.wikipedia.org/wiki/STONITH  The other node is
> powered down.  
> 
> Solaris clusterig
> http://docs.sun.com/app/docs/doc/819-2969/6n57kl13o?a=view#caccajda     
> About Failure Fencing
> A major issue for clusters is a failure that causes the cluster to
> become partitioned (called split brain). When split brain occurs, not
> all nodes can communicate, so individual nodes or subsets of nodes
> might try to form individual or subset clusters. Each subset or
> partition might "believe" it has sole access and ownership to the
> multihost devices. When multiple nodes attempt to write to the disks,
> data corruption can occur.
> Failure fencing limits node access to multihost devices by physically
> preventing access to the disks. Failure fencing applies only to nodes,
> not to zones. When a node leaves the cluster (it either fails or
> becomes partitioned), failure fencing ensures that the node can no
> longer access the disks. Only current member nodes have access to the
> disks, resulting in data integrity.
> Device services provide failover capability for services that use
> multihost devices. When a cluster member that currently serves as the
> primary (owner) of the device group fails or becomes unreachable, a
> new primary is chosen. The new primary enables access to the device
> group to continue with only minor interruption. During this process,
> the old primary must forfeit access to the devices before the new
> primary can be started. However, when a member drops out of the
> cluster and becomes unreachable, the cluster cannot inform that node
> to release the devices for which it was the primary. Thus, you need a
> means to enable surviving members to take control of and access
> global devices from failed members.
> The Sun Cluster software uses SCSI disk reservations to implement
> failure fencing. Using SCSI reservations, failed nodes are "fenced"
> away from the multihost devices, preventing them from accessing those
> disks. SCSI-2 disk reservations support a form of reservations, which
> either grants access to all nodes attached to the disk (when no
> reservation is in place). Alternatively, access is restricted to a
> single node (the node that holds the reservation).
> When a cluster member detects that another node is no longer
> communicating over the cluster interconnect, it initiates a failure
> fencing procedure to prevent the other node from accessing shared
> disks. When this failure fencing occurs, the fenced node panics with a
> "reservation conflict" message on its console.
> The discovery that a node is no longer a cluster member, triggers a
> SCSI reservation on all the disks that are shared between this node
> and other nodes. The fenced node might not be "aware" that it is
> being fenced and if it tries to access one of the shared disks, it
> detects the reservation and panics.
> 
