RediB
Introduction
RediB(Regression framework for Distributed system Benchmark) is an infrastructure or the deterministic reproduction of distributed system failures. RediB provides both a dataset of known distributed-systems bugs called RediD and a toolset called RediT.
Currently, node failure, network partition, network delay, network packet loss, and clock drift is supported. For a few supported languages, it is possible to enforce a specific order between nodes in order to reproduce a specific time-sensitive scenario and inject failures before or after a specific method is called when a specific stack trace is present. You can find full documentation in here
RediD
We applied RediT to 7 widely-used cloud systems. The following table shows the bugs reproduced by Redit in these systems.
Bug ID |
Bug Description |
Testcase in RediD |
---|---|---|
Pause/resume feature of ActiveMQ not resuming properly |
||
AMQP SSL Transport “leaking” currentTransportCounts |
||
DLQ message lost after broker restarts |
||
noLocal=true in durable subscriptions is ignored after reconnect |
||
Aborting a STOMP 1.1 transaction after ACK/NACK leads to invalid state |
||
Immediate poison ACK after move from DLQ leads to message loss |
||
Consuming problem with topics in ActiveMQ 5.14.1 with AMQP Qpid client |
||
Acknowledging messages out of order in a STOMP 1.1 transaction raises exception |
||
No message received with prefetch 0 over http |
||
Durable subscription messages can be orphaned when using individual ack mode |
||
When taking snapshot, manifest.json contains incorrect or no files |
||
order by descending on frozen list not working |
||
Commit log replay failure for static columns with collections in clustering keys |
||
Assertion failure in ViewUpdateGenerator |
||
Secondary idx query on partition key cols not return partitions with only static data |
||
Indexed static column returns inconsistent results |
||
Error when starting cassandra: Unable to make UUID from ‘aa’ (SASI index) |
||
Failed to create Materialized view with a specific token range |
||
nodetool can not create snapshot with snapshot name that have special character |
||
Materialized views incorrect quoting of UDF |
||
CQL writetime and ttl functions should be forbidden for multicell columns |
||
NameNode refresh doesn’t remove DataNodes that are no longer in the allowed list |
||
Fsshell mv fails if port usage doesn’t match in src and destination paths |
||
DFSInputStream may infinite loop requesting block locations |
||
Rename with Snapshots does not honor quota limit |
||
EC: EC file blockId location info displaying as “null” with hdfs fsck-blockId command |
||
Data loss in case of distcp using snapshot diff. |
||
Misleading REM_QUOTA value with snapshot and trash feature enabled for a directory |
||
CreateSnapshotOp fails during edit log loading |
||
EC: hdfs client hangs due to exception during addBlock |
||
The number of Offline Regions is wrong after restoring a snapshot |
||
Fix NPE when disable DeadServerMetricRegionChore |
||
WALSplit recreates region dirs for deleted table with recovered edits data |
||
TableStateNotFoundException happends when table creation if rsgroup is enable |
||
Comparator of NOT_EQUAL NULL is invalid for checkAndMutate |
||
delete with null columnQualifier occurs NPE when NewVersionBehavior is on |
||
The calling of HTable.batch blocked caused by ArrayStoreException |
||
Updating Broker configuration dynamically twice reverts log configuration to default |
||
Producer.send() blocks and generates TimeoutException if topic name has illegal char |
||
KafkaAdminClient#describeAcls should handle invalid filters gracefully |
||
Consumer mishandles topics deleted and recreated with the same name |
||
Kafka requires ZK root access even when using a chroot |
||
KafkaConsumer cannot jump out of the poll method, and the consumer is blocked |
||
kafka-configs.sh end with UVE when describing TLS user with quotas |
||
Producer fails to recover if topic gets deleted (and gets auto-created) |
||
Producer.send without record key and batch.size=0 goes into infinite loop |
||
add check for preventing repeat start mq |
||
Pull result size is always less than given size in PullConsumer |
||
Offset store is null after consumer clients start() |
||
Can’t start consumer with a small “consumerThreadMax” number |
||
rocketmq tools queryMsgByKey may have bug! |
||
updateAclConfig cause broker fail to start |
||
cannot delete topic/group perms in acl config |
||
When broker is down, rocketmq client can not retry under Async send model |
||
large numbers of watches can cause session re-establishment to fail |
||
Zookeeper should be tolerant of clock adjustments |
||
Unable to delete a node when the node has no children |
||
Support different watch modes on same path |
||
ZooKeeper client run to endless loop in ClientCnxn.SendThread.run if all server down |
||
zooInspector create root node fail with path validate |
||
Data inconsistencies and unexpired ephemeral nodes after cluster restart |
||
Ephemeral node is never deleted if follower fails while reading the proposal packet |
||
Client side NullPointerException in case of empty Multi operation |