There could be many scenarios where some of the nodes in a Redis cluster are removed. Eventually, transitioning to a fail
state. When you try to forget these nodes from a cluster that has many nodes, you might notice they re-appear in handshake
state.
Bunch of things to note here which might not be intuitive
1. Forget has to be run on all nodes
Just running cluster forget <node-id>
on one of the cluster nodes will not forget the failed node from the cluster. It has to be run on all the nodes for the change to take place
2. Handshake state
When you forget a failed node from a random node in the cluster, it might come back with a different node id. The reason for this is the node is still known(on other nodes) to the cluster and comes back through gossip
3. Run forget on all nodes
So, an intuitive solution to this problem is to run forget on all nodes. But, there is a catch here. If you start running it sequentially on these nodes through a script mentioned(below) in github issue you’ll notice the nodes are still there.
nodes_addrs=$(redis-cli -h $1 -p $2 cluster nodes|grep -v handshake| awk '{print $2}')
echo $nodes_addrs
for addr in ${nodes_addrs[@]}; do
host=${addr%:*}
port=${addr#*:}
del_nodeids=$(redis-cli -h $host -p $port cluster nodes|grep -E 'handshake|fail'| awk '{print $1}')
for nodeid in ${del_nodeids[@]}; do
echo $host $port $nodeid
redis-cli -h $host -p $port cluster forget $nodeid
done
done
Also, if you are using the node id mentioned for a handshake
state node obtained through cluster nodes
command, it will comeback again with a random node id again
4. Run forget on all nodes in under a minute
If you don’t run forget on all nodes in under a minute, cluster will not forget that node. To avoid this, make sure to run it under this limit on all nodes. One way to do that would be to use an ansible
script with free
strategy to execute forget on all nodes at once
5. Use only the fail state node id
While doing a cluster forget, don’t worry about the handshake
node ids. Instead find the node id for the same IP where the state is fail
on any of the nodes. Handshake state entries will automatically disappear when you do this.
That’s it! Follow me Twitter/LinkedIn for more. Thanks!