[redis-db] Unrecoverable error: corrupted cluster config file

Discussion:

Dave Vaughan

2016-06-13 12:56:10 UTC

This morning I came in to find 1 whole shard (2 nodes) had failed on my
test Redis cluster. The cluster is made up of 6 nodes, 3 masters & 3 slaves.

This meant that 1/3 of the clustered slots were unavailable. It appears
that the nodes crashed and were rebooted at around 7am but Redis failed to
start on either with:

Jun 13 07:00:40 testb redis[1215]: Unrecoverable error: corrupted cluster
config file.
Jun 13 07:00:40 testb systemd: redis.service: main process exited,
code=exited, status=1/FAILURE
Jun 13 07:00:40 testb redis-cli: Could not connect to Redis at
127.0.0.1:6379: Connection refused
Jun 13 07:00:40 testb systemd: redis.service: control process exited,
code=exited status=1
Jun 13 07:00:40 testb systemd: Unit redis.service entered failed state.
Jun 13 07:00:40 testb systemd: redis.service failed.

Looking at the /etc/redis/nodes.conf (cluster files) it was indeed corrupt:

5fe341c8812d8871fc57f030a03049a76c3e835f 192.168.10.114:6379 slave,fail
654d6848ff1892ac983234917a643336c2de17ad 1465796156340 1465796154329 53
connected
199a2b70a36d40c8a11a9e48a4c7d07a87f0a540 192.168.10.110:6379 master,fail? -
1465796178468 1465796175956 52 connected 5461-10922
4c5ff5cc4b6b611367af1e1e30ca71d4173e5889 192.168.10.113:6379 myself,slave
199a2b70a36d40c8a11a9e48a4c7d07a87f0a540 0 0 50 connected
92d13c4f7f0276ec7e5ede3e31cacbd8ab3d012a 192.168.10.109:6379 slave
8fc4c824986389815cc8e616c3fd892152a1d371 0 1465796278528 51 connected
654d6848ff1892ac983234917a643336c2de17ad 192.168.10.111:6379 master,fail -
1465796147797 1465796145287 53 connected 10923-16383
8fc4c824986389815cc8e616c3fd892152a1d371 192.168.10.112:6379 master -
1465796277751 14657962

Note the last row is missing data, plus no "vars" row. Looking at,
http://download.redis.io/redis-stable/src/cluster.c it's likely due to the
last row have less than 8 args.

The only way I managed to resolve was to evict both nodes, re-add them,
then re-assign the slots.

To me, it looks like the process crashed when it was writing to the cluster
config file - but it happened on both master and slave?? Or was the corrupt
cluster config replicated to the slave?

Just wondering if anyone else has experienced this, or maybe know what
happened?

Many thanks,
Dave

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+***@googlegroups.com.
To post to this group, send email to redis-***@googlegroups.com.
Visit this group at https://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Dave Vaughan

2016-06-13 15:08:20 UTC

Permalink

A bit more info: Redis cluster version 3.0.7 on CentOS 7.2 x64

Post by Dave Vaughan
This morning I came in to find 1 whole shard (2 nodes) had failed on my
test Redis cluster. The cluster is made up of 6 nodes, 3 masters & 3 slaves.
This meant that 1/3 of the clustered slots were unavailable. It appears
that the nodes crashed and were rebooted at around 7am but Redis failed to
Jun 13 07:00:40 testb redis[1215]: Unrecoverable error: corrupted cluster
config file.
Jun 13 07:00:40 testb systemd: redis.service: main process exited,
code=exited, status=1/FAILURE
Jun 13 07:00:40 testb redis-cli: Could not connect to Redis at
127.0.0.1:6379: Connection refused
Jun 13 07:00:40 testb systemd: redis.service: control process exited,
code=exited status=1
Jun 13 07:00:40 testb systemd: Unit redis.service entered failed state.
Jun 13 07:00:40 testb systemd: redis.service failed.
5fe341c8812d8871fc57f030a03049a76c3e835f 192.168.10.114:6379 slave,fail
654d6848ff1892ac983234917a643336c2de17ad 1465796156340 1465796154329 53
connected
199a2b70a36d40c8a11a9e48a4c7d07a87f0a540 192.168.10.110:6379 master,fail?
- 1465796178468 1465796175956 52 connected 5461-10922
4c5ff5cc4b6b611367af1e1e30ca71d4173e5889 192.168.10.113:6379 myself,slave
199a2b70a36d40c8a11a9e48a4c7d07a87f0a540 0 0 50 connected
92d13c4f7f0276ec7e5ede3e31cacbd8ab3d012a 192.168.10.109:6379 slave
8fc4c824986389815cc8e616c3fd892152a1d371 0 1465796278528 51 connected
654d6848ff1892ac983234917a643336c2de17ad 192.168.10.111:6379 master,fail
- 1465796147797 1465796145287 53 connected 10923-16383
8fc4c824986389815cc8e616c3fd892152a1d371 192.168.10.112:6379 master -
1465796277751 14657962
Note the last row is missing data, plus no "vars" row. Looking at,
http://download.redis.io/redis-stable/src/cluster.c it's likely due to
the last row have less than 8 args.
The only way I managed to resolve was to evict both nodes, re-add them,
then re-assign the slots.
To me, it looks like the process crashed when it was writing to the
cluster config file - but it happened on both master and slave?? Or was the
corrupt cluster config replicated to the slave?
Just wondering if anyone else has experienced this, or maybe know what
happened?
Many thanks,
Dave

Dave Vaughan

2016-06-13 15:08:45 UTC

Permalink

A bit more info: Redis cluster version 3.0.7 on CentOS 7.2 x64

Tuco

2016-06-14 04:42:31 UTC

Permalink

Yes, i had got the same error on redis 3.0.7 on redhat, when the machine
crashed, i tried deleting the partial lines and it messed up the
configuration, then i had to manually add nodes and assign them the slots.

Post by Dave Vaughan
A bit more info: Redis cluster version 3.0.7 on CentOS 7.2 x64

Dave Vaughan

2016-06-28 12:31:21 UTC

Permalink

I've created a post about this, should anyone stumble upon it:
http://www.dwjvaughan.com/2016/06/23/recovering-from-a-corrupted-cluster-config-file/