RediB

Introduction

RediB(Regression framework for Distributed system Benchmark) is an infrastructure or the deterministic reproduction of distributed system failures. RediB provides both a dataset of known distributed-systems bugs called RediD and a toolset called RediT.

Currently, node failure, network partition, network delay, network packet loss, and clock drift is supported. For a few supported languages, it is possible to enforce a specific order between nodes in order to reproduce a specific time-sensitive scenario and inject failures before or after a specific method is called when a specific stack trace is present. You can find full documentation in here

RediD

We applied RediT to 7 widely-used cloud systems. The following table shows the bugs reproduced by Redit in these systems.

Bug ID

Bug Description

Testcase in RediD

AMQ-6000

Pause/resume feature of ActiveMQ not resuming properly

Code

AMQ-6010

AMQP SSL Transport “leaking” currentTransportCounts

Code

AMQ-6059

DLQ message lost after broker restarts

Code

AMQ-6430

noLocal=true in durable subscriptions is ignored after reconnect

Code

AMQ-6697

Aborting a STOMP 1.1 transaction after ACK/NACK leads to invalid state

Code

AMQ-6847

Immediate poison ACK after move from DLQ leads to message loss

Code

AMQ-6500

Consuming problem with topics in ActiveMQ 5.14.1 with AMQP Qpid client

Code

AMQ-6796

Acknowledging messages out of order in a STOMP 1.1 transaction raises exception

Code

AMQ-6823

No message received with prefetch 0 over http

Code

AMQ-7129

Durable subscription messages can be orphaned when using individual ack mode

Code

CASSANDRA-10968

When taking snapshot, manifest.json contains incorrect or no files

Code

CASSANDRA-15814

order by descending on frozen list not working

Code

CASSANDRA-14365

Commit log replay failure for static columns with collections in clustering keys

Code

CASSANDRA-12424

Assertion failure in ViewUpdateGenerator

Code

CASSANDRA-13666

Secondary idx query on partition key cols not return partitions with only static data

Code

CASSANDRA-14242

Indexed static column returns inconsistent results

Code

CASSANDRA-13669

Error when starting cassandra: Unable to make UUID from ‘aa’ (SASI index)

Code

CASSANDRA-13464

Failed to create Materialized view with a specific token range

Code

CASSANDRA-15297

nodetool can not create snapshot with snapshot name that have special character

Code

CASSANDRA-16836

Materialized views incorrect quoting of UDF

Code

CASSANDRA-17628

CQL writetime and ttl functions should be forbidden for multicell columns

Code

HDFS-8950

NameNode refresh doesn’t remove DataNodes that are no longer in the allowed list

Code

HDFS-10239

Fsshell mv fails if port usage doesn’t match in src and destination paths

Code

HDFS-11379

DFSInputStream may infinite loop requesting block locations

Code

HDFS-14504

Rename with Snapshots does not honor quota limit

Code

HDFS-14987

EC: EC file blockId location info displaying as “null” with hdfs fsck-blockId command

Code

HDFS-14869

Data loss in case of distcp using snapshot diff.

Code

HDFS-14499

Misleading REM_QUOTA value with snapshot and trash feature enabled for a directory

Code

HDFS-15446

CreateSnapshotOp fails during edit log loading

Code

HDFS-15398

EC: hdfs client hangs due to exception during addBlock

Code

HBASE-19850

The number of Offline Regions is wrong after restoring a snapshot

Code

HBASE-23682

Fix NPE when disable DeadServerMetricRegionChore

Code

HBASE-24189

WALSplit recreates region dirs for deleted table with recovered edits data

Code

HBASE-24135

TableStateNotFoundException happends when table creation if rsgroup is enable

Code

HBASE-26742

Comparator of NOT_EQUAL NULL is invalid for checkAndMutate

Code

HBASE-26901

delete with null columnQualifier occurs NPE when NewVersionBehavior is on

Code

HBASE-26027

The calling of HTable.batch blocked caused by ArrayStoreException

Code

KAFKA-9254

Updating Broker configuration dynamically twice reverts log configuration to default

Code

KAFKA-5098

Producer.send() blocks and generates TimeoutException if topic name has illegal char

Code

KAFKA-7496

KafkaAdminClient#describeAcls should handle invalid filters gracefully

Code

KAFKA-12257

Consumer mishandles topics deleted and recreated with the same name

Code

KAFKA-12866

Kafka requires ZK root access even when using a chroot

Code

KAFKA-13310

KafkaConsumer cannot jump out of the poll method, and the consumer is blocked

Code

KAFKA-13964

kafka-configs.sh end with UVE when describing TLS user with quotas

Code

KAFKA-13488

Producer fails to recover if topic gets deleted (and gets auto-created)

Code

KAFKA-14303

Producer.send without record key and batch.size=0 goes into infinite loop

Code

ROCKETMQ-281

add check for preventing repeat start mq

Code

ROCKETMQ-231

Pull result size is always less than given size in PullConsumer

Code

ROCKETMQ-255

Offset store is null after consumer clients start()

Code

ROCKETMQ-266

Can’t start consumer with a small “consumerThreadMax” number

Code

ROCKETMQ-1409

rocketmq tools queryMsgByKey may have bug!

Code

ROCKETMQ-3175

updateAclConfig cause broker fail to start

Code

ROCKETMQ-3281

cannot delete topic/group perms in acl config

Code

ROCKETMQ-3556

When broker is down, rocketmq client can not retry under Async send model

Code

ZOOKEEPER-706

large numbers of watches can cause session re-establishment to fail

Code

ZOOKEEPER-1366

Zookeeper should be tolerant of clock adjustments

Code

ZOOKEEPER-2052

Unable to delete a node when the node has no children

Code

ZOOKEEPER-4466

Support different watch modes on same path

Code

ZOOKEEPER-4508

ZooKeeper client run to endless loop in ClientCnxn.SendThread.run if all server down

Code

ZOOKEEPER-4473

zooInspector create root node fail with path validate

Code

ZOOKEEPER-1367

Data inconsistencies and unexpired ephemeral nodes after cluster restart

Code

ZOOKEEPER-2355

Ephemeral node is never deleted if follower fails while reading the proposal packet

Code

ZOOKEEPER-3895

Client side NullPointerException in case of empty Multi operation

Code