Sv translation

language	en

Caution

This is an advanced topic. Use at your own risk and ALWAYS backup your data before.

Useful Commands

View DRBD Status - DRBD 7

Translations Ignore

Code Block
cat /proc/drbd

View DRBD Status - DRBD 9

Translations Ignore

Code Block
drbdadm status

Reload all parameters

Translations Ignore

Code Block
drbdadm adjust jtelshared

Disconnect the share (useful for planned maintenance)

Translations Ignore

Code Block
drbdadm disconnect jtelshared

Down the share (useful for planned maintenance)

Translations Ignore

Code Block
drbdadm down jtelshared

Up the share

Translations Ignore

Code Block
drbdadm up jtelshared

Set the node to primary

Translations Ignore

Code Block
drbdadm primary jtelshared

Connect the share

Translations Ignore

Code Block
drbdadm connect jtelshared

PCS Cluster Commands (CentOS 8)

Code Block
pcs cluster stop acd-store2 pcs cluster start acd-store2 pcs node standby acd-store2 pcs node unstandby acd-store2

Split Brain

Background

Symptoms - CentOS 7 and earlier

Translations Ignore

Code Block

title	cat /proc/drbd

cat /proc/drbd

-->

GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r-----
 ns:0 nr:119823323 dw:119823323 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

cs:StandAlone means the node is not connected.

This should be visible on both sides.

Symptoms - CentOS 8 and earlier

Translations Ignore

Code Block

title	drbdadm status

drbdadm status

-->

jtelshared role:Primary
  disk:UpToDate
  acd-store1 connection:Connecting

drbdadm status

-->

# No currently configured DRBD found.

The first command shows that DRBD is active on the first node, but not active on the second node.

Note: this can be due to the second node being stopped or in standby.

Find out which node is active in the PCS cluster - CentOS 7

Translations Ignore

Code Block

title	pcs status

pcs status

-->

Cluster name: portal

Stack: corosync
Current DC: acd-store1 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum
Last updated: Sun Mar 18 18:05:32 2018
Last change: Fri Feb 16 00:07:51 2018 by root via cibadmin on acd-store2
2 nodes configured
3 resources configured
Node acd-store1: standby
Online: [ acd-store2 ]
Full list of resources:
Resource Group: haproxy_group
 ClusterDataJTELSharedMount (ocf::heartbeat:Filesystem): Started acd-store2
 ClusterIP (ocf::heartbeat:IPaddr2): Started acd-store2
 samba (systemd:smb): Started acd-store2
Daemon Status:
 corosync: active/enabled
 pacemaker: active/enabled
 pcsd: active/enabled

In the example above, the first node is in standby. The most important thing to check, is on which server the resources are started.

In this case, the resources are started on acd-store2.

This will therefore be defined as the NON BROKEN node.

Find out which node is active in the PCS cluster - CentOS 8

Translations Ignore

Code Block

title	pcs status

pcs status

-->

Cluster name: jtel_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: acd-lb1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum
  * Last updated: Sat Oct  3 12:39:22 2020
  * Last change:  Sat Oct  3 12:31:22 2020 by root via cibadmin on acd-lb2
  * 2 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ acd-lb1 ]
  * OFFLINE: [ acd-lb2 ]

Full List of Resources:
  * Clone Set: DRBDClusterMount-clone [DRBDClusterMount] (promotable):
    * Masters: [ acd-lb1 ]
    * Stopped: [ acd-lb2 ]
  * DRBDClusterFilesystem       (ocf::heartbeat:Filesystem):    Started acd-lb1
  * Samba       (systemd:smb):  Started acd-lb1
  * ClusterIP   (ocf::heartbeat:IPaddr2):       Started acd-lb1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

In the example above, the second node is offline. The most important thing to check, is on which server the resources are started.

In this case, the resources are started on acd-lb1.

This will therefore be defined as the NON BROKEN node.

Standby the broken node in the PCS cluster (if necessary)

This command can be run on either machine.

CentOS 7

Code Block

title	Standby broken node

pcs cluster standby acd-lb-broken
 
--> Verify this with
 
pcs status

CentOS 8

Code Block

title	Standby broken node

pcs node standby acd-lb-broken
 
--> Verify this with
 
pcs status

On broken node

Note: the first command will probably throw an error. Also, the share may not be mounted. This is OK.

Translations Ignore

Code Block

title	drbd on broken node

umount /srv/jtel/shared
drbdadm disconnect jtelshared
drbdadm secondary jtelshared
drbdadm connect --discard-my-data jtelshared

On the healthy node

Translations Ignore

Code Block

title	drbd on healthy node

drbdadm primary jtelshared
drbdadm connect jtelshared

Check re-sync activity

The re-sync might take a long time.

Watch the status of this using:

cat /proc/drbd

Example output:

Translations Ignore

Code Block

title	cat /proc/drbd

[root@storage01 ~]# cat /proc/drbd
version: 8.4.10-1 (api:1/proto:86-101)
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r-----
 ns:0 nr:1411538 dw:121234862 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:184698664
 [>....................] sync'ed: 0.8% (180368/181744)M
 finish: 26:12:15 speed: 1,940 (2,760) want: 2,120 K/sec

Tune the transfer (Second Node) - Only CentOS 7.x

Currently there is no procedure for tuning the transfer on CentOS 8.x

If the transfer is going to take ages, then tune it on the broken node:

Translations Ignore

Code Block

title	drbd Transfer Tuning (on broken node)

drbdadm disk-options --c-plan-ahead=0 --resync-rate=110M jtelshared

Put broken node back to primary - CentOS 7.x ONLY

Translations Ignore

Do not do this on CentOS 8.x installations! Here DRBD is managed by the cluster.

Code Block

title	Unstandby broken node

drbdadm primary jtelshared
 
--> Verify this with
 
cat /proc/drbd

Restart PCS node

Translations Ignore

CentOS 7.x

Code Block

title	Unstandby broken node

pcs cluster unstandby acd-lb-broken
 
--> Verify this with
 
pcs status

...

CentOS 8.x

Code Block

title	Unstandby broken node

drbdadm primary jtelshared

pcs cluster start acd-lb-broken
pcs node unstandby acd-lb-broken
 
--> Verify this with
 
pcs status

Untune the transfer (Second Node) - CentOS 7x only

If the transfer was tuned, then untune it (on the broken node).

Note: it won't hurt to run this command anyway.

Translations Ignore

Code Block

title	drbd - Untune Transfer

drbdadm adjust jtelshared

Check everything

Translations Ignore

Code Block

title	Check everything

pcs status
# CentOS 7.x
cat /proc/drbd
# CentOS 8.x
drbdadm status
# On some other linux machines
ls /home/jtel/shared
# Windows
dir //acd-store/shared

File System Corrupt

Check everything

Sometimes, when DRBD fails, the file system will also become corrupt.

In this case both nodes might be primary, however neither will have the share mounted.

The command mount /srv/jtel/shared will fail.

In this case, it may be necessary to repair the file system.

Symptoms

Translations Ignore

Code Block

[17354513.483526] XFS (drbd1): log mount/recovery failed: error -22
[17354513.483569] XFS (drbd1): log mount failed
[17355040.104433] XFS (drbd1): Mounting V5 Filesystem
[17355040.122234] XFS (drbd1): Corruption warning: Metadata has LSN (56:112832) ahead of current LSN (56:112733). Please unmount and run xfs_repair (>= v4.3) to resolve.
[17355040.122239] XFS (drbd1): log mount/recovery failed: error -22
[17355040.122322] XFS (drbd1): log mount failed

Repairing

One one of the nodes (need to choose one to become primary):

Translations Ignore

Code Block
xfs_repair /dev/drbd/by-res/jtelshared/0 pcs resource cleanup

This should then mount and start the resources on that node.

Then proceed with the other node as "broken" in the split brain situation.

Stalled Resync

If the DRBD resync stalls - the output will be "stalled" when cat /proc/drbd is executed - then it may be necessary to restart the machine.

This has been observed once, and restarting resolved the situation. However not much more is known about this state, or the cause, at this time.

Failed Connect (Unrelated data, aborting)

When the secondary has been told to discard it's data, and all of the commands to start the sync have been entered on both the healthy and the broken node, sometimes cat /proc/drbd will not report a connection.

Check /var/log/messages

If you can see output like this:

Code Block
kernel: block drbd0: uuid_compare()=-1000 by rule 100 kernel: block drbd0: Unrelated data, aborting!

Then the metadata has become corrupt.

This requires that the metadata be completely reconstructed on the bad node.

Use the following commands to recreate the data on the broken node:

Code Block
drbdadm down jtelshared drbdadm wipe-md jtelshared drbdadm create-md jtelshared

Then proceed with the re-sync as above (start with the part "On the broken node" with the commands to place this in secondary and discard data.

Failed Connect (Unknown Connection)

This produces errors something like this:

Code Block
??: Failure: (162) Invalid configuration request additional info from kernel: unknown connection

In this case, drbd might not be loaded and enabled.

Execute the following code on the broken node and then proceed as above:

Code Block
modprobe drbd systemctl enable drbd systemctl start drbd

Sv translation

language	de

Status
colour Red
title This page is only available in English

Sv translation

language	fr

Attention

C'est un sujet avancé. Utilisez-le à vos propres risques et TOUJOURS sauvegardez vos données avant.

Commandes utiles

Voir le statut de DRBD - DRBD 7

Translations Ignore

Code Block
cat /proc/drbd

Voir le statut de DRBD - DRBD 9

Translations Ignore

Code Block
statut drbdadm

Recharger tous les paramètres

Translations Ignore

Code Block
drbdadm adjust jtelshared

Déconnecter l'action (utile pour la maintenance planifiée)

Translations Ignore

Code Block
drbdadm disconnect jtelshared

Réduction de l'action (utile pour la maintenance planifiée)

Translations Ignore

Code Block
drbdadm down jtelshared

Augmenter la part

Translations Ignore

Code Block
drbdadm up jtelshared

Régler le nœud sur le primaire

Translations Ignore

Code Block
drbdadm primary jtelshared

Connectez le partage

Translations Ignore

Code Block
drbdadm connect jtelshared

Commandes des clusters PCS (CentOS 8)

Code Block
pcs cluster stop acd-store2 pcs cluster start acd-store2 pcs node standby acd-store2 pcs node unstandby acd-store2

Cerveau fendu

Contexte

Voir ici:

https://docs.linbit.com/doc/users-guide-84/s-resolve-split-brain/

Symptômes - CentOS 7 et antérieurs

Translations Ignore

Code Block

title	cat /proc/drbd

cat /proc/drbd --> GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22 1: cs:StandAlone ro:Primary/Unknown ds:UpToDate/DUnknown r----- ns:0 nr:119823323 dw:119823323 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0

cs:StandAlone signifie que le nœud n'est pas connecté.

Cela devrait être visible des deux côtés.

Symptômes - CentOS 8 et antérieurs

Translations Ignore

Code Block

title	statut drbdadm

drbdadm status --> jtelshared role:Primary disk:UpToDate tdm-jtel-lb-2-pn-1 connection:Connecting drbdadm status --> # No currently configured DRBD found.

La première commande montre que la DRBD est active sur le premier nœud, mais pas sur le second.

Remarque : cela peut être dû au fait que le deuxième nœud est arrêté ou en attente.

Découvrez quel nœud est actif dans le cluster PCS - CentOS 7

Translations Ignore

Code Block

title	pcs status

pcs status --> Cluster name: portal Stack: corosync Current DC: acd-store1 (version 1.1.16-12.el7_4.7-94ff4df) - partition with quorum Last updated: Dim 18 Mar 18:05:32 2018 Dernier changement: Ven 16 Feb 00:07:51 2018 par root via cibadmin sur acd-store2 2 nœuds configurés 3 ressources configurées Nœud acd-store1: standby En ligne: [acd-store2] Liste complète des ressources: Groupe de ressources: haproxy_group ClusterDataJTELSharedMount (ocf::heartbeat:Filesystem): Started acd-store2 ClusterIP (ocf::heartbeat:IPaddr2): Started acd-store2 samba (systemd:smb): Started acd-store2 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled

Dans l'exemple ci-dessus, le premier nœud est en veille. La chose la plus importante à vérifier est de savoir sur quel serveur les ressources sont lancées.

Dans ce cas, les ressources sont mises en place sur acd-store2.

Il sera donc défini comme le nœud NON brisé .

Découvrez quel nœud est actif dans le cluster PCS - CentOS 8

Translations Ignore

Code Block

title	pcs status

pcs status --> Cluster name: jtel_cluster Cluster Summary: * Stack: corosync * Current DC: acd-lb1 (version 2.0.3-5.el8_2.1-4b1f869f0f) - partition with quorum * Last updated: Sam 3 Oct 12:39:22 2020 * Dernier changement: Sam 3 Oct 12:31:22 2020 par root via cibadmin sur acd-lb2 * 2 nœuds configurés * 5 instances de ressources configurées Liste des nœuds: * En ligne: [acd-lb1 acd-lb2] Liste complète des ressources: * Ensemble de clones: DRBDClusterMount-clone [DRBDClusterMount] (promotable): * Masters: [ acd-lb1 ] * Stopped: [ acd-lb2 ] * DRBDClusterFilesystem (ocf::heartbeat:Filesystem): Started acd-lb1 * Samba (systemd:smb): Started acd-lb1 * ClusterIP (ocf::heartbeat:IPaddr2): Started acd-lb1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled

Dans l'exemple ci-dessus, le deuxième nœud est hors ligne. La chose la plus importante à vérifier est de savoir sur quel serveur les ressources sont lancées.

Dans ce cas, les ressources sont démarrées sur acd-lb1.

Il sera donc défini comme le nœud NON brisé .

Mettez en veille le nœud rompu dans le cluster PCS (si nécessaire)

Cette commande peut être exécutée sur n'importe quelle machine.

CentOS 7

Code Block

title	Nœud brisé en veille

pcs cluster standby acd-lb-broken   --> Vérifiez cela avec pcs status

CentOS 8

Code Block

title	Nœud brisé en veille

pcs node standby acd-lb-broken   --> Vérifiez cela avec  pcs status

Sur un nœud brisé

Note : la première commande lancera probablement une erreur. De plus, il se peut que l'action ne soit pas montée. Cela ne pose pas de problème.

Translations Ignore

Code Block

title	drbd on broken node

umount /srv/jtel/shared drbdadm disconnect jtelshared drbdadm secondary jtelshared drbdadm connect --discard-my-data jtelshared

Sur le nœud sain

Translations Ignore

Code Block

title	drbd on healthy node

drbdadm primary jtelshared drbdadm connect jtelshared

Vérifier l'activité de re-sync

La re-synchronisation pourrait prendre beaucoup de temps.

Surveillez l'état de cette utilisation :

cat /proc/drbd

Exemple de sortie :

Translations Ignore

Code Block

title	cat /proc/drbd

[root@storage01 ~]# cat /proc/drbd version: 8.4.10-1 (api:1/proto:86-101) GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22 1: cs:SyncTarget ro:Secondary/Primary ds:Inconsistent/UpToDate C r----- ns:0 nr:1411538 dw:121234862 dr:2128 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:184698664 [>....................] sync'ed: 0.8% (180368/181744)M finish: 26:12:15 speed: 1,940 (2,760) want: 2,120 K/sec

Régler le transfert (deuxième nœud) - uniquement CentOS 7.x

Actuellement, il n'existe pas de procédure pour régler le transfert sur CentOS 8.x

Si le transfert doit prendre du temps, il faut alors le régler sur le nœud rompu :

Translations Ignore

Code Block

title	Réglage du transfert drbd (sur noeud brisé)

drbdadm disk-options --c-plan-ahead=0 --resync-rate=110M jtelshared

Remettre le nœud rompu en position primaire - CentOS 7.x ONLY

Translations Ignore

Ne faites pas cela sur les installations CentOS 8.x ! Ici, DRBD est géré par le cluster.

Code Block

title	Nœud brisé en attente

drbdadm primary jtelshared -->Vérifiez cela avec cat /proc/drbd

Redémarrer le nœud PCS

Translations Ignore

CentOS 7.x

Code Block

title	Nœud brisé en attente

pcs cluster unstandby acd-lb-broken   --> Vérifiez cela avec   pcs status

CentOS 8.x

Code Block

title	Nœud brisé en attente

pcs cluster start acd-lb-broken pcs node unstandby acd-lb-broken   --> Vérifiez cela avec   pcs status

Désactiver le transfert (deuxième nœud) - CentOS 7x seulement

Si le transfert a été accordé, il faut le désaccorder (sur le nœud rompu).

Note : de toute façon, cela ne fera pas de mal d'exécuter cette commande.

Translations Ignore

Code Block

title	drbd - Transfert de l'infortune

drbdadm adjust jtelshared

Tout vérifier

Translations Ignore

Code Block

title	Tout vérifier

pcs status # CentOS 7.x cat /proc/drbd # CentOS 8.x drbdadm status # On some other linux machines ls /home/jtel/shared # Windows dir //acd-store/shared

Système de fichiers corrompu

Parfois, lorsque le DRBD échoue, le système de fichiers devient également corrompu.

Dans ce cas, les deux nœuds peuvent être primaires, mais la part ne sera pas montée non plus.

La commande mount /srv/jtel/shared échouera.

Dans ce cas, il peut être nécessaire de réparer le système de fichiers.

Symptômes

Translations Ignore

Code Block

[17354513.483526] XFS (drbd1): log mount/recovery failed: error -22 [17354513.483569] XFS (drbd1): log mount failed [17355040.104433] XFS (drbd1): Mounting V5 Filesystem [17355040.122234] XFS (drbd1): Avertissement de corruption: Metadata has LSN (56:112832) ahead of current LSN (56:112733). Please unmount and run xfs_repair (>= v4.3) to resolve. [17355040.122239] XFS (drbd1): log mount/recovery failed: error -22 [17355040.122322] XFS (drbd1): log mount failed

Réparation

Un des nœuds (il faut en choisir un pour devenir primaire) :

Translations Ignore

Code Block
xfs_repair /dev/drbd/by-res/jtelshared/0 pcs resource cleanup

Celui-ci devrait ensuite monter et démarrer les ressources sur ce nœud.

Ensuite, continuez avec l'autre nœud comme " brisé" dans la situation du cerveau divisé.

Resync bloquée

Si la resynchronisation du DRBD est bloquée - la sortie sera "bloquée" lorsque cat /proc/drbd est exécuté - alors il peut être nécessaire de redémarrer la machine.

Cela a été observé une fois, et le redémarrage a résolu la situation. Cependant, on n'en sait pas beaucoup plus sur cet état, ni sur sa cause, à l'heure actuelle.

Échec de la connexion (données non pertinentes, abandon)

Lorsqu'il a été demandé au secondaire de rejeter ses données, et que toutes les commandes pour lancer la synchronisation ont été entrées sur le nœud sain et le nœud cassé, il arrive que cat /proc/drbd ne signale pas de connexion.

Vérifier /var/log/messages

Si vous voyez une sortie comme celle-ci :

Code Block
kernel: block drbd0: uuid_compare()=-1000 by rule 100 kernel: block drbd0: Données non liées, abandon !

Alors les métadonnées sont devenues corrompues.

Cela exige que les métadonnées soient complètement reconstruites sur le mauvais nœud.

Utilisez les commandes suivantes pour recréer les données sur le nœud cassé :

Code Block
drbdadm down jtelshared drbdadm wipe-md jtelshared drbdadm create-md jtelshared

Ensuite, procédez à la re-synchronisation comme ci-dessus (commencez par la partie "Sur le noeud cassé" avec les commandes pour placer celui-ci en secondaire et supprimer les données.

Code Block

title	Unstandby broken node

pcs resource show cat /proc/drbd

Page tree

Page History

Versions Compared

Old Version 6

New Version Current

Key

Caution

Useful Commands

View DRBD Status - DRBD 7

View DRBD Status - DRBD 9

Reload all parameters

Disconnect the share (useful for planned maintenance)

Down the share (useful for planned maintenance)

Up the share

Set the node to primary

Connect the share

PCS Cluster Commands (CentOS 8)

Split Brain

Background

Symptoms - CentOS 7 and earlier

Symptoms - CentOS 8 and earlier

Find out which node is active in the PCS cluster - CentOS 7

Find out which node is active in the PCS cluster - CentOS 8

Standby the broken node in the PCS cluster (if necessary)

CentOS 7

CentOS 8

On broken node

On the healthy node

Check re-sync activity

Tune the transfer (Second Node) - Only CentOS 7.x

Put broken node back to primary - CentOS 7.x ONLY

Restart PCS node

CentOS 7.x

CentOS 8.x

Untune the transfer (Second Node) - CentOS 7x only

Check everything

File System Corrupt

Check everything

Symptoms

Repairing

Stalled Resync

Failed Connect (Unrelated data, aborting)

Failed Connect (Unknown Connection)

StatuscolourRedtitleThis page is only available in English

Attention

Commandes utiles

Voir le statut de DRBD - DRBD 7

Voir le statut de DRBD - DRBD 9

Recharger tous les paramètres

Déconnecter l'action (utile pour la maintenance planifiée)

Réduction de l'action (utile pour la maintenance planifiée)

Augmenter la part

Régler le nœud sur le primaire

Connectez le partage

Commandes des clusters PCS (CentOS 8)

Cerveau fendu

Contexte

Symptômes - CentOS 7 et antérieurs

Symptômes - CentOS 8 et antérieurs

Découvrez quel nœud est actif dans le cluster PCS - CentOS 7

Découvrez quel nœud est actif dans le cluster PCS - CentOS 8

Mettez en veille le nœud rompu dans le cluster PCS (si nécessaire)

CentOS 7

CentOS 8

Sur un nœud brisé

Sur le nœud sain

Vérifier l'activité de re-sync

Régler le transfert (deuxième nœud) - uniquement CentOS 7.x

Remettre le nœud rompu en position primaire - CentOS 7.x ONLY

Redémarrer le nœud PCS

CentOS 7.x

CentOS 8.x

Désactiver le transfert (deuxième nœud) - CentOS 7x seulement

Tout vérifier

Système de fichiers corrompu

Symptômes

Réparation

Resync bloquée

Échec de la connexion (données non pertinentes, abandon)

Status
colour Red
title This page is only available in English