This is an advanced topic. Use at your own risk and ALWAYS backup your data before.
cat /proc/drbd
drbdadm adjust jtelshared
drbdadm disconnect jtelshared
drbdadm down jtelshared
drbdadm up jtelshared
drbdadm primary jtelshared
drbdadm connect jtelshared
See also:
https://docs.linbit.com/doc/users-guide-84/s-resolve-split-brain/
cat /proc/drbd --> GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----- ns:0 nr:119823323 dw:119823323 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 |
cs:StandAlone means the node is not connected.
This should be visible on both sides.
pcs status --> Cluster name: portal Stack: corosync Current DC: acd-store1 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum Last updated: Sun Mar 18 18:05:32 2018 Last change: Fri Feb 16 00:07:51 2018 by root via cibadmin on acd-store2 2 nodes configured 3 resources configured Node acd-store1: standby Online: [ acd-store2 ] Full list of resources: Resource Group: haproxy_group ClusterDataJTELSharedMount (ocf::heartbeat:Filesystem): Started acd-store2 ClusterIP (ocf::heartbeat:IPaddr2): Started acd-store2 samba (systemd:smb): Started acd-store2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled |
In the example above, the first node is in standby. The most important thing to check, is on which server the resources are started.
In this case, the resources are started on acd-store2.
This will therefore be defined as the NON BROKEN node.
This command can be run on either machine.
pcs cluster standby acd-lb-broken --> Verify this with pcs status |
Note: the first command will probably throw an error. Also, the share may not be mounted. This is OK.
umount /srv/jtel/shared drbdadm disconnect jtelshared drbdadm secondary jtelshared drbdadm connect --discard-my-data jtelshared |
drbdadm primary jtelshared drbdadm connect jtelshared |
The re-sync might take a long time.
Watch the status of this using:
cat /proc/drbd
Example output:
[root@storage01 ~]# cat /proc/drbd version: 8.4.10-1 (api:1/proto:86-101) GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22 1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r----- ns:0 nr:1411538 dw:121234862 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:184698664 [>....................] sync'ed: 0.8% (180368/181744)M finish: 26:12:15 speed: 1,940 (2,760) want: 2,120 K/sec |
If the transfer is going to take ages, then tune it on the broken node:
drbdadm disk-options --c-plan-ahead=0 --resync-rate=110M jtelshared |
drbdadm primary jtelshared --> Verify this with cat /proc/drbd |
pcs cluster unstandby acd-lb-broken --> Verify this with pcs status |
If the transfer was tuned, then untune it (on the broken node).
Note: it won't hurt to run this command anyway.
drbdadm adjust jtelshared |
pcs status cat /proc/drbd # On some other linux machines ls /home/jtel/shared # Windows dir //acd-store/shared |
Sometimes, when DRBD fails, the file system will also become corrupt.
In this case both nodes might be primary, however neither will have the share mounted.
The command mount /srv/jtel/shared will fail.
In this case, it may be necessary to repair the file system.
[17354513.483526] XFS (drbd1): log mount/recovery failed: error -22 [17354513.483569] XFS (drbd1): log mount failed [17355040.104433] XFS (drbd1): Mounting V5 Filesystem [17355040.122234] XFS (drbd1): Corruption warning: Metadata has LSN (56:112832) ahead of current LSN (56:112733). Please unmount and run xfs_repair (>= v4.3) to resolve. [17355040.122239] XFS (drbd1): log mount/recovery failed: error -22 [17355040.122322] XFS (drbd1): log mount failed |
One one of the nodes (need to choose one to become primary):
xfs_repair /srv/jtel/shared/ pcs resource cleanup |
This should then mount and start the resources on that node.
Then proceed with the other node as "broken" in the split brain situation.