Introduction: Problem Statement
Public blockchains allow insertion of arbitrary data. Even specific-purpose blockchains like Bitcoin already contain a lot of non-financial data. Although this data insertion can be beneficial in some use cases (e.g. proof of existence), it can also cause damage. If a blockchain contained videos with instructions on how to torture someone, there would immediately be broad consensus that this data must be deleted. But since blockchains are supposed to be immutable databases, the question is: what can be done if this happens?
In our previous article (Immutability: Is There a Limit?) we discussed non-financial data insertion into blockchains from a different angle. The example was a published report containing information which is important to society, but upsets a small group of people. The analysis showed realistic methods of censoring data from the blockchain by exploiting known vulnerabilities and limitations of this technology. In one scenario censorship is desirable and in the other it is not.
In the following pages we review two articles which propose ways of erasing data when there is broad consensus, without having to restart the system and without endangering the entire blockchain. The articles under review are Erasing Data from Blockchain Nodes [BF19] and Redactable Blockchain in the Permissionless Setting [DM19].
Another strategy prevents such situations by filtering data before it enters the blockchain. Article [HH18b] surveys which non-financial information has already entered the Bitcoin blockchain. We also review the article Thwarting Unwanted Blockchain Content Insertion [HH18a], which discusses methods of thwarting the insertion of non-financial data.
Redactable Blockchain Designs
Blockchains create a sequence of data blocks where the content of every block is locked to itself by its successor (see Figure 1).
Blockchains have two main ingredients to achieve this kind of lock:
- A network that follows a protocol to store, maintain, and share with new peers the blockchain information.
- A function that is collision-resistant: a function H for which it is (almost) impossible to find two different blocks B,B’ that are mapped to the same output, i.e., H(B)= H(B’).
Hash functions are used to lock the blocks together, which explains the common terminology: BlockHash for the hash H(B) of the block B and PrevBlockHash for the hash of the previous block (see Figure 2). The content of the block Bi-1is locked to itself by including H(Bi-1) to its successor Bi. If someone redacts block Bi-1resulting in a new block B’, then all peers, new and old, can notice this change, since the calculation of BlockHash H(B’) would not coincide with the value PrevBlockHash H(Bi-1) in block Bi.
One approach to blockchain redaction is by means of a function which allows a specific group of users to produce collisions. The article [AM17] takes up this approach, in which a share of a trapdoor key is given to each group member. The entire group must consent to unlock the connections and redact the block. This only works well for special use cases. As mentioned in article [DM19], the trapdoor key method falls short for networks like the Bitcoin network: it cannot scale to large networks and only the trapdoor key holders are aware of the altered data, reducing transparency.
In the following pages we review and compare the articles [BF19] and [DM19], which introduce a redaction mechanism for the Bitcoin network. Their contribution lies in a modification of the first ingredient for the “lock” between the blocks by introducing another layer of communication among the network nodes. Both make use of the fact that full node servers have the entire blockchain and can prune their locally stored copy of the ledger – a practice already used to erase older parts of the blockchain for storage reduction.
Erasing Data From Blockchain Nodes
The authors of [BF19] suggest to change the network protocol in order to make data erasure possible. Network nodes mark data which are supposed to be erased in their local erasure database (similar to chainstate database in Bitcoind client) before they delete the data from their copy of the blockchain. This way, accountability about erased data is guaranteed, which is necessary to avoid requesting these data again from other nodes and to differentiate between non-existent and erased data. The proposed protocol change suggests a transaction validation policy for nodes when they come across a transaction referring to data they erased from their local blockchain copy. The policy is based on the same idea as the simplified payment verification method. For more details we refer to section Protocol Details of Erasing Data From Blockchain Nodes.
Scope of Research
The proposal works in general for any UTXO model-based blockchain which includes the Bitcoin model. The proposed protocol change does not disturb the functionality of the whole blockchain system and can be deployed without intervening in the core blockchain architecture. The research does not include a decentralized mechanism to find consensus about which data should be removed. The authors demonstrate the viability of the proposal by a proof of concept implementation using the Bitcoind software for participating in the Bitcoin network.
Summary of Limitations
The authors of [BF19] present an interesting and very pragmatic way to erase data in blockchains, but the current proposal comes with certain limitations:
The proposal does not cover erasure of all possible data insertion methods discussed in the survey [SSV19]. The focus is only set on one data field, namely the transaction output data scriptPubKey (see Protocol Details for Erasing Data for more details).
The proposal does not include a decentralized consensus mechanism to achieve agreement and accountability about erasure of specific data. Instead, consensus is achieved off-chain and accountability is managed locally. For example, by publishing on a web page transaction identities to be erased, and nodes update their erasure database locally.
Nodes which are not updating their software to this newly proposed changes will occasionally disagree with nodes running the updated software. Therefore, the new validation protocol must be introduced as a soft fork.
The proposed protocol change increases the probability of blockchain forks which impacts the security of the entire system (see [GK16]). Although forks can be created only by a dishonest user, this opens a new attack vector. A formal study of the security implications is missing.
Chain validation refers to the process new or out of sync network nodes perform to ensure that received blockchain information is indeed valid. Data erasure makes it impossible for new nodes to validate the chain since erased data are missing to calculate and verify the BlockHash value.
Limitations of Proof of Concept Implementation
Their current implementation constraints are due to the use of the existing pruning logic of the Bitcoind client software:
- More data than just the unwanted data file are erased, namely all blocks up to a target height. Therefore, nodes which erase data will not be anymore full nodes archiving and indexing the full blockchain.
- Full erasure can be done not sooner than roughly 50 hours after content has been included on the active chain due to Bitcoind pruning logic.
- It is suggested to avoid the use of the protocol in mining mode, since it may cause blockchain forks. Therefore, the network will still have many full nodes, namely mining nodes, which will still make the erased data available through their local copies.
- The proof of concept implementation is further restricted to transaction outputs which are not SegWit transactions.
Redactable Blockchain in the Permissionless Setting
The authors of the second article [DM19] propose a safe method for data erasure. Network nodes can propose to edit data fields stored in the blockchain. For this proposal to be approved, a specified validation policy has to be followed which includes that the majority of the network nodes vote for the proposals within a certain timeframe. Chain and block validation are modified in order to capture redacted blocks.
Scope of Research
The proposal can be applied to blockchains based on proof of work and might be easily adapted to any consensus mechanism. The proposal requires changes of the core blockchain protocol. It introduces a decentralized voting mechanism using the available blockchain infrastructure to decide which data should be erased. As a by-product, all information about blockchain edits are transparently written in the ledger. The research includes a rigorous formal security analysis and provides a proof of concept implementation to demonstrate the viability of the proposal.
Summary of Limitations
The authors of [DM19] present an interesting and proven-to-be-secure method for data erasure in blockchains.
The proposal allows to change only components of transactions which do not affect the consistency of past and future events, e.g. OP_RETURN in Bitcoin. In this form it does not cover all data insertion methods (see [SSV19]), e.g. the transaction output PaytoPublicKeyHash (P2PKH) which is a spendable component, although if used for arbitrary data it cannot be spent. This restriction is not due to technical reasons, but to protect users from eventual future inconsistencies in the blockchain. Anyway, these restrictions can be adjusted. It is rather a mechanism to remove the responsibility of the user on deciding what redactions could cause inconsistencies on the chain in the future.
The proposal requires that network nodes modify their software to run the enhanced protocol. In particular, the Bitcoin block structure must accommodate another field (for more information, see section Protocol Destails for Redactable Blockchains). This would require a hard fork of the Bitcoin blockchain, and hence the support of the entire community.
There is an increased risk to have denial of service attacks (DoS), since malicious miners can include many edit requests in order to slow down the overall performance of the system.
The enhanced protocol produces a new overhead due to additional voting and validation for redacted transactions, but tests of their proof of concept implementation show that the additional overhead is quite tiny.
Limitations of the Proof of Concept Implementation
The proof of concept implementation is in a test environment which mimics all the basic functionalities of Bitcoin. It includes the Bitcoin script language that allows to insert arbitrary data into the chain, which can be redacted afterwards. A test over the current Bitcoin network is not feasible, since this would require all nodes to update their software.
In Table 1 we summarize the limitations and strengths of both proposals for the Bitcoin blockchain measuring them against the following metrics:
- Range of Editable Data: Does the edit mechanism cover the entire spectrum of possible data insertion components?
- Decentralized Voting: Does the protocol provide a decentralized system to propose and vote for edit requests?
- Easy Protocol Adoption: Does the protocol change require support of the community only (Yes) or does it require also a change in the core blockchain architecture (No)?
- Accountability: Does the protocol provide a secure and transparent method to account for all approved edit requests?
- Security Analysis: There is no security analysis (Low), there is a discussion about security (Medium), there is a formal security analysis (High).
- Scalability: Does the protocol change keep the computational overhead stable when compared to the original blockchain version (Yes – Moderate – No)?
- Moderate Blockchain Growth: Does the protocol change keep the growth of the blockchain size similar to the original blockchain version? (Yes – Moderate – No)
Preventing Insertion of Unwanted Data
Thwarting Unwanted Blockchain Content Insertion
Proposal Overview and Scope of Research
The authors of article [HH18a] analyze content filtering methods having Bitcoin as the main example. They first identify the pattern of transactions carrying unwanted data, and then propose countermeasures against unwanted data insertion:
- Content-detectors: Transactions with large-sized data carrying many transaction outputs with text generate positive signals, which results in exclusion of these transactions into blocks during block generation.
- Economic incentives: A transaction fee policy, which increases the fee according to the size of the transaction data. This makes the system less attractive for non-financial data insertion as they are supposed to be large, while standard financial transactions stay “cheap”.
Finally, the authors propose Identifier Commitments – a new tool using cryptography that allows full nodes to check whether the receiver addresses in the transaction output are valid addresses or arbitrary data. This allows to identify and discard transactions with harmful data during block generation. In the current Bitcoin protocol, only when transactions are spent, nodes verify the validity of the receiver address, but by then they already form part of the blockchain.
Summary of Limitations
The content-detectors cannot catch all unwanted data, e.g., preventing insertion of binary files is not feasible via this approach.
Likewise, the introduction of economic incentives does not thwart all cases. It is not clear what should be the price for publishing in Bitcoin in order to avoid any insertion of unwanted data in the blockchain. What is the price for damaging the reputation of someone else or showing any kind of harmful data?
Both methods for filtering are easy to deploy. The introduction of identifier commitments requires that the network nodes modify the software of the blockchain to participate with the new protocol. This requires that the entire community supports the protocol change.
The introduction of identifier commitments implies that validating nodes have more computational work to perform, but the computational overhead is tiny.
Protocol Details and Analysis
Reminder of Bitcoin Transaction Format
A transaction T has input and output data as well as a transaction identifier iT (TxID) – the hash of transaction T. The input data (prevOut) make reference to previous transactions using the TxID (see Figure 3). The sender must provide a proof of authentication to spend this input (ScriptSig) and he must specify in the output data the transaction amount (Value) and the recipient’s address (scriptPubKey).
Protocol Details of Erasing Data From Blockchain Nodes
The proposed protocol [BF19] only allows to erase unwanted data stored in the output data field scriptPubKey. Nodes execute the following steps:
- Erase data X of a transaction T with TxID iTresulting in a transaction T’.
- Store T’ in an erasure database together with identifier iT.
- Physically erase T from the local copy of the blockchain.
Erasing data of the blockchain can cause problems when validating transactions referring to erased data. The proposed policy is the following:
- Unconfirmed transactions, i.e., transactions which are not yet part of the blockchain but just “gossiped” between the network nodes for inclusion in the next block, are always considered invalid and are not communicated to other peers if they reference to erased data.
- Confirmed transactions, i.e., transactions which are already approved by the network and form part of the blockchain, are always considered to be valid, even if they reference to erased data.
Analysis of Erasing Data From Blockchain Nodes
For ease of exposition we use the terms honest nodes for nodes which follow the old validation protocol and erasing nodes for nodes which erase data and follow the new enhanced validation policy. Both honest nodes and erasing nodes are “honest” as opposed to dishonest nodes, which do not follow one particular protocol all the time.
The new validation policy does not affect the security of the blockchain when applied to unconfirmed transactions. Problems arise when the blockchain contains confirmed transactions with a reference to an older output data which was arbitrary data and not a valid receiver address. Note that this can only happen when the transaction was mined by a dishonest node or a non-validating node (a node which does not validate to safe resources and runs the risk to lose the block reward). Such transactions are invalid transactions due to the incorrect input data, and are not included in block proposals neither by honest nor erasing nodes. For this reason, the occurrence of confirmed transactions with erased data is not frequent, but the possibility cannot be excluded since this can be produced by dishonest miners depending on their mining power.
As mentioned in the article, in these cases there will be a disagreement about the validity of such transactions between honest and erasing nodes of the network. Depending on which group outnumbers the other, transactions are finally considered as non-valid or valid. This leads to the following new additional risks compared to the original (Bitcoin) protocol:
Increased number of forks:
Disagreements about the validity of specific transactions between honest and erasing nodes are reflected as “forks” in the blockchain. Usually, forks appear due to delays in the network communication resulting in a lack of synchronization between the nodes. In many blockchain protocols, the longest chain rule helps the miners to reconcile after an unintended fork: the longest chain is the valid one. This incurs also security problems such as double spending attacks which are intensional forks. This topic is well-studied in [GK16]. The enhanced validation policy proposed in [BF19] increases the chances of temporal disagreements which can be exploited by dishonest nodes, although more study about security implications is required.
The risk analysis in [BF19] considers two cases: 1) Erasing nodes form the minority. 2) Erasing nodes form the majority. The analysis does not reflect the stability of the system regarding its dependency on the percentage of dishonest nodes. For example, if we assume that 15% of the network is dishonest and erasing nodes form a minority of 36%, a confirmed transaction with reference to an erased data could be part of the longest chain (i.e. considered as valid), depending on the decision of the dishonest part.
Additionally, in this case the situation is different from double spending attacks where two valid chains are presented, and the longest chain rule decides which is the correct one. Here the honest miners which do not erase the unwanted data will consider the longest chain as an invalid chain when an erasing node mines a block with a reference to an erased data. This leads to a permanent “split” of the network if there is a majority of erasing nodes, not just a temporal fork, since honest nodes do not follow the protocol of the erasing nodes. This implies that the proposal has to be introduced as a soft fork. We see this as a problem as long as the erased database is locally maintained and there is no common view about this database.
Remark: If we suppose that 100% of the community agrees on which data should be removed and erase it from their local copies, then there is no risk of additional forks. Nevertheless, this would necessitate a good accountability system where all users both see and act on the information. The full consensus scenario is hard to imagine given the cultural and geographic diversity within any blockchain community.
Another major risk is the insecure chain validation during the bootstrapping. The Bitcoin blockchain has a very secure method for this which is Sybil-resistant (using the longest chain). Since chain validation requires all data constituting the blockchain, erasure makes the chain validation impossible. This can be resolved by either leaving full nodes in the network or by storing the relevant data in a central database which new nodes can consult in order to validate the chain. In the first case, unwanted data will still be accessible to everybody, while the second solution leads to a very strong centralization with all implied risks, e.g. single point of failure, misuse of power, etc. Therefore, the authors started to define a trustless method for bootstrapping which has not yet been described completely.
Protocol Details for Redactable Blockchain
The protocol of [DM19] has three components: Redaction proposal, proposal validation, and block & chain validation. To account for redactions, the block header (see Figure 1) must accommodate an additional field called old_MerkleRoot: All blocks contain a list of transactions and a commitment to this set of transactions by using their MerkleRoot. When a block is initially created, that is to say, prior to any redaction, this new field takes the same value as Merkle root. If a redaction is approved, this field is updated, while old_MerkleRoot will store the old Merkle root to provide a transparent account on redaction history.
Users who want to propose a redaction construct a special transaction editTx which contains the original transaction T with its transaction identifier iT and the redacted transaction T’. Then, nodes broadcast the special transaction editTx which requires a transaction fee as usual. The proposed transaction T’ is validated by checking its contents with respect to T, and if it is valid it can be considered for voting.
The new validation protocol dictates the requirements and constraints for redaction operations in the blockchain. The authors provide a formal description and an informal description of a (basic) policy for Bitcoin that could be implemented which we cite here:
- The proposed redaction T’ is identical to the transaction T, except that it can remove data.
- Only data that will never be spent, e.g., OP RETURN output scripts, can be removed.
- No redaction of votes can be proposed for other redactions in the chain.
- The proposed redaction receives more than 50% of votes in the 1024 consecutive blocks (voting period) after the corresponding editTx is stable in the chain.
After a successful voting, nodes replace in their local ledger the transaction T by (iT, T’).
Block and Chain Validation
When nodes have to validate a block containing a redacted transaction, the node can verify that the original transaction T was there before and is now replaced by T’. The verification of the first part makes use of the transaction identifier iT and the old_MerkleRoot, while the second part is processed by finding the corresponding editTx in the chain, and verifying that the edit request satisfies the validation policy.
When nodes have to validate a full chain and come across a redacted block, the miner verifies if the redaction was approved. The miner rejects a chain as invalid if any of the following occurs: (1) a block’s redaction was not approved according to the policy, (2) the MerkleRoot value of the redacted block is incorrect with respect to the set of transactions or (3) a previously approved redaction was not performed on the chain.
Analysis of Redactable Blockchain
Denial of Service Attacks:
Spamming the network with many edit requests could lead to a denial of service. However, as mentioned in the article, miners are deterred from doing this because it incurs the cost of a transaction fee for the edit requests. Moreover, a higher transaction fee for edit requests can further help to protect the system against such an attack. Anyway, more refined techniques are in place, as listed here, to protect the network.
Increased Computational Overhead:
Voting, additional proposal validation and the enhanced protocol for block and chain validation increase the computational overhead nodes have to perform. Nevertheless, as estimated by the authors using their proof of concept implementation, the extra computation is not to be considered critical.
Scope of Security Analysis:
The security model from [GKL15] is used to formally prove that the proposed protocol changes are not affecting the blockchain security. This security model applies to public blockchains, including the Bitcoin blockchain. Although, for example, the Ouroboros proof of stake protocol builds upon that article, game-theoretic considerations are done separately. Further investigation has to be done as to whether the introduction of a redaction mechanism (with a policy allowing to redact more transaction components) allows or not other rational strategies with higher reward than the honest one.
Protocol Details for Thwarting Unwanted Content Insertion
Content-Detectors and Filtering
The authors find that almost all of the nearly 255 million transactions in Bitcoin (as of August 2017) have at most 50 transaction outputs (99.73 %). Furthermore, of all transactions, 97.22 % have not more than 5 outputs, and 91.77 % have at most 2 outputs. In contrast, data insertion of arbitrary data uses more than 50 transaction outputs due to the limited data size in the output fields, and hence can be detected through their larger size.
The first filtering method detects whether transactions contain text, ASCII-based files, or any binary files of images or archives. Additionally, the authors introduce a detector which gives positive signals when a transaction contains many printable ASCII characters distributed over many (e.g. more than 50) transaction outputs.
The second filtering method proposes a mandatory minimum fee to penalize large transactions. The formula to calculate the mandatory fee considers the transaction size and the number of outputs, and is designed to discourage non-financial content insertion.
Protocol Change: Identifier Commitments
Identifier Commitments (IC) help to prevent arbitrary data insertion placed in the output data fields of a transaction. The basic idea of this concept consists in not sending the public address of the receiver on the blockchain – which may contain harmful data instead of an address -, but a different data which prevents non-financial data insertion, and works as a receiver address. We refer to [HH18a] for more details about the technical solution.
Analysis of Thwarting Unwanted Content Insertion
The proposed filtering method based on large-size data files can also cause false positives. High quality of filtering can penalize some honest large-sized transactions, e.g. issued by crypto exchanges or a revenue payment to many stakeholders. The risk of false positives exists, although experiments show high probability of correct signals. The limitation of such detectors is that they are unable to sufficiently detect binary data fields which can carry encoded harmful data. This is the authors’ motivation to introduce ICs.
Mandatory minimum fees can affect payments with many outputs, e.g. repartitions of rewards to investors or payments of exchange services. The problem here is that the better the protection against harmful data insertion, the higher the penalization of some financial transactions.
The introduction of ICs requires to make a software change in order to integrate the concept and to allow an enhanced validation process. This change can be achieved easily, but needs support from the blockchain community.
ICs help to reject harmful data insertion in the output fields of transactions, but leave other possible data insertion methods as discussed in [SSV19] still open.
The additional computational overhead caused by IC validation is insignificant. Full nodes need only roughly 4 seconds longer according to their experiments. This delay is not critical, since average block generation takes about 10 minutes. ICs increase the general transaction size, but the saving in blockchain storage through thwarting non-financial data might outperform the increased blockchain size by ICs.
In Table 1 we summarized the general aspects of the protocols [BF19] and [DM19]. According to our analysis, the first article should provide stronger security proofs, since the proposed change increases the number of forks, which can reduce the security of the entire blockchain system. These forks can also result in permanent divisions of the community. If all nodes follow the validation policy proposed in [BF19], these two risks are eliminated. Nevertheless, it is essential to find a safe method of agreeing on which data should be erased. Furthermore, there must be a verifiable record of legitimately removed data for bootstrapping to be possible. If this record is kept by a central authority, as the authors of [BF19] suggest, we contradict the blockchain philosophy of decentralization. It is also questionable to what extent and how quickly agreements can be achieved given the heterogeneous nature of blockchain communities.
The article [DM19] suggests a method to erase data from the blockchain which requires the community to update their software to run the new protocol. The authors present a strong security analysis, and the only minor defects of the proposal are a slight risk of a denial of service attack and a slightly higher computational overhead. We would like to see a game-theory analysis of their proposal to evaluate potential misuse of the new rules. If all nodes follow the enhanced protocol, there is no security risk and no obstacle to bootstrapping. Although the proposal includes a secure voting mechanism, it is again questionable whether agreement about harmful data can be achieved easily among a multicultural community.
The approach of [HH2a] consists of preventing harmful data insertion by making use of text-detectors and economic incentives. Both methods improve prevention, but fail to entirely exclude harmful data. The proposed cryptographic tool helps to detect harmful data which pass the first two filters, but cannot control all possible insertion methods.
[AM17] G. Ateniese, B. Magri et al.: Redactable Blockchain or Rewriting History in Bitcoin and Friends, IEEE European Symposium on Security and Privacy, 2017.
[BF19] S. Beaucamp, M. Florian, et al.: Erasing Data from Blockchain Nodes, arxiv:1904.08901, 2019.
[DM19] D. Deuber, B. Magri et al.: Redactable Blockchain in the Permissionless Setting, IEEE Symposium on Security and Privacy, 2019.
[GK16] A. Gervais, G. K. Karame, et al.: On the Security and Performance of Proof of Work Blockchains, Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016.
[GKL15] J.A. Garay, A. Kiayias, N. Leobardos: The Bitcoin Backbone Protocol: Analysis and Applications, Proc. Eurocrypt, 2015.
[HH18a] M. Henze, J. Hiller, et al.: Thwarting Unwanted Blockchain Content Insertion, IEEE International Conference on Cloud Engineering, 2018.
[HH18b] M. Henze, J. Hiller, et al.: A Quantitative Analysis of the Impact of Arbitrary Blockchain Content on Bitcoin, in Proc. 22nd International Conference on Financial Cryptography and Data Security, 2018
[M13] G. Maxwell: To Prevent Arbitrary Data Storage in Txouts – The Ultimate Solution, 2013.
[SSV19] F. Stonedahl, A. Sward, I. Vecna: Data Insertion in Bitcoin’s Blockchain, Ledger Journal Vol 3, 2019.
Portrait Image by Brian J. Matis. License creative commons.