CautionThis is an advanced topic. Use at your own risk and ALWAYS backup your data before. Useful CommandsView DRBD Status - DRBD 7
View DRBD Status - DRBD 9Reload all parameters Translations Ignore |
---|
Code Block |
---|
drbdadm adjust jtelshared |
|
Disconnect the share (useful for planned maintenance) Translations Ignore |
---|
Code Block |
---|
drbdadm disconnect jtelshared |
|
Down the share (useful for planned maintenance) Translations Ignore |
---|
Code Block |
---|
drbdadm down jtelshared |
|
Up the share Translations Ignore |
---|
Code Block |
---|
drbdadm up jtelshared |
|
Set the node to primary Translations Ignore |
---|
Code Block |
---|
drbdadm primary jtelshared |
|
Connect the share Translations Ignore |
---|
Code Block |
---|
drbdadm connect jtelshared |
|
Split BrainBackgroundSee also: https://docs.linbit.com/doc/users-guide-84/s-resolve-split-brain/ Symptoms
Translations Ignore |
---|
Code Block |
---|
| cat /proc/drbd
-->
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r-----
ns:0 nr:119823323 dw:119823323 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 |
|
cs:StandAlone means the node is not connected. This should be visible on both sides. Find out which node is active in the PCS cluster Translations Ignore |
---|
Code Block |
---|
| pcs status
-->
Cluster name: portal
Stack: corosync
Current DC: acd-store1 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum
Last updated: Sun Mar 18 18:05:32 2018
Last change: Fri Feb 16 00:07:51 2018 by root via cibadmin on acd-store2
2 nodes configured
3 resources configured
Node acd-store1: standby
Online: [ acd-store2 ]
Full list of resources:
Resource Group: haproxy_group
ClusterDataJTELSharedMount (ocf::heartbeat:Filesystem): Started acd-store2
ClusterIP (ocf::heartbeat:IPaddr2): Started acd-store2
samba (systemd:smb): Started acd-store2
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled |
|
In the example above, the first node is in standby. The most important thing to check, is on which server the resources are started. In this case, the resources are started on acd-store2. This will therefore be defined as the NON BROKEN node. Standby the broken node in the PCS clusterThis command can be run on either machine. Translations Ignore |
---|
Code Block |
---|
| pcs cluster standby acd-lb-broken
--> Verify this with
pcs status |
|
On broken nodeNote: the first command will probably throw an error. Also, the share may not be mounted. This is OK. Translations Ignore |
---|
Code Block |
---|
| umount /srv/jtel/shared
drbdadm disconnect jtelshared
drbdadm secondary jtelshared
drbdadm connect --discard-my-data jtelshared |
|
On the healthy node Translations Ignore |
---|
Code Block |
---|
title | drbd on healthy node |
---|
| drbdadm primary jtelshared
drbdadm connect jtelshared |
|
Check re-sync activityThe re-sync might take a long time. Watch the status of this using: cat /proc/drbd Example output: Translations Ignore |
---|
Code Block |
---|
| [root@storage01 ~]# cat /proc/drbd
version: 8.4.10-1 (api:1/proto:86-101)
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
ns:0 nr:1411538 dw:121234862 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:184698664
[>....................] sync'ed: 0.8% (180368/181744)M
finish: 26:12:15 speed: 1,940 (2,760) want: 2,120 K/sec |
|
Tune the transfer (Second Node)If the transfer is going to take ages, then tune it on the broken node: Translations Ignore |
---|
Code Block |
---|
title | drbd Transfer Tuning (on broken node) |
---|
| drbdadm disk-options --c-plan-ahead=0 --resync-rate=110M jtelshared |
|
Put broken node back to primary Translations Ignore |
---|
Code Block |
---|
title | Unstandby broken node |
---|
| drbdadm primary jtelshared
--> Verify this with
cat /proc/drbd |
|
Restart PCS node Translations Ignore |
---|
Code Block |
---|
title | Unstandby broken node |
---|
| pcs cluster unstandby acd-lb-broken
--> Verify this with
pcs status |
|
Untune the transfer (Second Node)If the transfer was tuned, then untune it (on the broken node). Note: it won't hurt to run this command anyway. Translations Ignore |
---|
Code Block |
---|
title | drbd - Untune Transfer |
---|
| drbdadm adjust jtelshared |
|
Check everything Translations Ignore |
---|
Code Block |
---|
| pcs status
cat /proc/drbd
# On some other linux machines
ls /home/jtel/shared
# Windows
dir //acd-store/shared |
|
File System CorruptSometimes, when DRBD fails, the file system will also become corrupt. In this case both nodes might be primary, however neither will have the share mounted. The command mount /srv/jtel/shared will fail. In this case, it may be necessary to repair the file system. Symptoms Translations Ignore |
---|
Code Block |
---|
[17354513.483526] XFS (drbd1): log mount/recovery failed: error -22
[17354513.483569] XFS (drbd1): log mount failed
[17355040.104433] XFS (drbd1): Mounting V5 Filesystem
[17355040.122234] XFS (drbd1): Corruption warning: Metadata has LSN (56:112832) ahead of current LSN (56:112733). Please unmount and run xfs_repair (>= v4.3) to resolve.
[17355040.122239] XFS (drbd1): log mount/recovery failed: error -22
[17355040.122322] XFS (drbd1): log mount failed |
|
RepairingOne one of the nodes (need to choose one to become primary): Translations Ignore |
---|
Code Block |
---|
xfs_repair /dev/drbd/by-res/jtelshared/0
pcs resource cleanup |
|
This should then mount and start the resources on that node. Then proceed with the other node as "broken" in the split brain situation. Stalled ResyncIf the DRBD resync stalls - the output will be "stalled" when cat /proc/drbd is executed - then it may be necessary to restart the machine. This has been observed once, and restarting resolved the situation. However not much more is known about this state, or the cause, at this time. |