RediB

Introduction

RediB(Regression framework for Distributed system Benchmark) is an infrastructure or the deterministic reproduction of distributed system failures. RediB provides both a dataset of known distributed-systems bugs called RediD and a toolset called RediT.

Currently, node failure, network partition, network delay, network packet loss, and clock drift is supported. For a few supported languages, it is possible to enforce a specific order between nodes in order to reproduce a specific time-sensitive scenario and inject failures before or after a specific method is called when a specific stack trace is present. You can find full documentation in here

RediD

We applied RediT to 7 widely-used cloud systems. The following table shows the bugs reproduced by Redit in these systems.

Bug ID	Bug Description	Testcase in RediD
AMQ-6000	Pause/resume feature of ActiveMQ not resuming properly	Code
AMQ-6010	AMQP SSL Transport “leaking” currentTransportCounts	Code
AMQ-6059	DLQ message lost after broker restarts	Code
AMQ-6430	noLocal=true in durable subscriptions is ignored after reconnect	Code
AMQ-6697	Aborting a STOMP 1.1 transaction after ACK/NACK leads to invalid state	Code
AMQ-6847	Immediate poison ACK after move from DLQ leads to message loss	Code
AMQ-6500	Consuming problem with topics in ActiveMQ 5.14.1 with AMQP Qpid client	Code
AMQ-6796	Acknowledging messages out of order in a STOMP 1.1 transaction raises exception	Code
AMQ-6823	No message received with prefetch 0 over http	Code
AMQ-7129	Durable subscription messages can be orphaned when using individual ack mode	Code
CASSANDRA-10968	When taking snapshot, manifest.json contains incorrect or no files	Code
CASSANDRA-15814	order by descending on frozen list not working	Code
CASSANDRA-14365	Commit log replay failure for static columns with collections in clustering keys	Code
CASSANDRA-12424	Assertion failure in ViewUpdateGenerator	Code
CASSANDRA-13666	Secondary idx query on partition key cols not return partitions with only static data	Code
CASSANDRA-14242	Indexed static column returns inconsistent results	Code
CASSANDRA-13669	Error when starting cassandra: Unable to make UUID from ‘aa’ (SASI index)	Code
CASSANDRA-13464	Failed to create Materialized view with a specific token range	Code
CASSANDRA-15297	nodetool can not create snapshot with snapshot name that have special character	Code
CASSANDRA-16836	Materialized views incorrect quoting of UDF	Code
CASSANDRA-17628	CQL writetime and ttl functions should be forbidden for multicell columns	Code
HDFS-8950	NameNode refresh doesn’t remove DataNodes that are no longer in the allowed list	Code
HDFS-10239	Fsshell mv fails if port usage doesn’t match in src and destination paths	Code
HDFS-11379	DFSInputStream may infinite loop requesting block locations	Code
HDFS-14504	Rename with Snapshots does not honor quota limit	Code
HDFS-14987	EC: EC file blockId location info displaying as “null” with hdfs fsck-blockId command	Code
HDFS-14869	Data loss in case of distcp using snapshot diff.	Code
HDFS-14499	Misleading REM_QUOTA value with snapshot and trash feature enabled for a directory	Code
HDFS-15446	CreateSnapshotOp fails during edit log loading	Code
HDFS-15398	EC: hdfs client hangs due to exception during addBlock	Code
HBASE-19850	The number of Offline Regions is wrong after restoring a snapshot	Code
HBASE-23682	Fix NPE when disable DeadServerMetricRegionChore	Code
HBASE-24189	WALSplit recreates region dirs for deleted table with recovered edits data	Code
HBASE-24135	TableStateNotFoundException happends when table creation if rsgroup is enable	Code
HBASE-26742	Comparator of NOT_EQUAL NULL is invalid for checkAndMutate	Code
HBASE-26901	delete with null columnQualifier occurs NPE when NewVersionBehavior is on	Code
HBASE-26027	The calling of HTable.batch blocked caused by ArrayStoreException	Code
KAFKA-9254	Updating Broker configuration dynamically twice reverts log configuration to default	Code
KAFKA-5098	Producer.send() blocks and generates TimeoutException if topic name has illegal char	Code
KAFKA-7496	KafkaAdminClient#describeAcls should handle invalid filters gracefully	Code
KAFKA-12257	Consumer mishandles topics deleted and recreated with the same name	Code
KAFKA-12866	Kafka requires ZK root access even when using a chroot	Code
KAFKA-13310	KafkaConsumer cannot jump out of the poll method, and the consumer is blocked	Code
KAFKA-13964	kafka-configs.sh end with UVE when describing TLS user with quotas	Code
KAFKA-13488	Producer fails to recover if topic gets deleted (and gets auto-created)	Code
KAFKA-14303	Producer.send without record key and batch.size=0 goes into infinite loop	Code
ROCKETMQ-281	add check for preventing repeat start mq	Code
ROCKETMQ-231	Pull result size is always less than given size in PullConsumer	Code
ROCKETMQ-255	Offset store is null after consumer clients start()	Code
ROCKETMQ-266	Can’t start consumer with a small “consumerThreadMax” number	Code
ROCKETMQ-1409	rocketmq tools queryMsgByKey may have bug!	Code
ROCKETMQ-3175	updateAclConfig cause broker fail to start	Code
ROCKETMQ-3281	cannot delete topic/group perms in acl config	Code
ROCKETMQ-3556	When broker is down, rocketmq client can not retry under Async send model	Code
ZOOKEEPER-706	large numbers of watches can cause session re-establishment to fail	Code
ZOOKEEPER-1366	Zookeeper should be tolerant of clock adjustments	Code
ZOOKEEPER-2052	Unable to delete a node when the node has no children	Code
ZOOKEEPER-4466	Support different watch modes on same path	Code
ZOOKEEPER-4508	ZooKeeper client run to endless loop in ClientCnxn.SendThread.run if all server down	Code
ZOOKEEPER-4473	zooInspector create root node fail with path validate	Code
ZOOKEEPER-1367	Data inconsistencies and unexpired ephemeral nodes after cluster restart	Code
ZOOKEEPER-2355	Ephemeral node is never deleted if follower fails while reading the proposal packet	Code
ZOOKEEPER-3895	Client side NullPointerException in case of empty Multi operation	Code

Table of Contents