We have a Elasticache Redis replication group, it has two nodes: one primary and one replica. Last week, we noticed that the primary redis node suddenly stops working – any connections to the primary node timed out eventually.
According to the log, there was a load burst and following that the redis reboot itself.
Unfortunately, the redis node stops responding after that. The weird thing is the replication between nodes still works. So I promoted the original replica to primary, and login into it. The ‘role’ or ‘info’ commands tells me the replication is working fine, and the slave ip is 10.0.x.x. Ah, that’s interesting as my VPC network is 172.31.x.x. So it means there is something wrong with the instance’s 172.31.x.x NIC. Contacted AWS and their service team restart that NIC, then things are back to normal.
The ironical thing is that the AWS console still shows everything is green while the the 172.31.x.x NIC is not functional. Looks to me that AWS only monitor their internal network (in this case it is 10.0.x.x network). I have submitted a feature request to suggest them to improve the monitoring.