Distributed storage systems (DSS) are instrumental for providing reliable storage solutions to satisfy ever growing data demands. DSS stores the content over a network of nodes, and achieves resilience against node failures by maintaining data redundancy via various coding techniques. When some DSS nodes fail, this coding allows for retrieving the original data from the remaining nodes, as long as there is sufficient number of nodes left. A DSS is responsible for repairing the failed node and restoring the required level of redundancy, since otherwise, over time, the system will lose the stored data.
In the first part of the talk we discuss the state-of-the-art coding techniques for the node repair problem. We provide an overview of the most important results which explore the storage-repair bandwidth tradeoff.
In the second part of this talk we focus on the scenario where the nodes in a DSS may not only fail but may also be attacked by an adversary. The compromised nodes may corrupt the stored information during the data retrieval/node repair while remaining unnoticed by the rest of the system. In particular, the erroneous data may propagate from a single corrupted node to the other nodes during the node repair process. We present a novel concatenated coding scheme for distributed storage systems containing nodes with adversarial errors. This scheme is based on two types of codes: maximum rank distance (MRD) code as an outer code and optimal repair maximal distance separable (MDS)array code as an inner code. We prove that this coding scheme attains the upper bound on the resilience capacity, i.e., amount of data stored reliably in a system with a limited number of corrupted nodes.