If you are interested in blockchains and cryptocurrencies, it is likely that you may have stumbled across Merkle trees (also known as hash trees). It is worth noting that almost every blockchain project uses Merkle trees as part of their protocols. In version 5.0.0 of the Lisk SDK, we are introducing lisk-tree, the new library that will handle Merkle trees in the Lisk protocol.
This blogpost can be broken down into 2 sections. The first section is designed to provide a crash course into Merkle trees, covering the most important concepts, such as Merkle roots and proofs-of-inclusion. In the second section, we introduce lisk-tree, describe how Merkle trees are implemented in Lisk, and finally explain the improvements they bring to the Lisk protocol.
Below you find the main technical terminology used throughout this blogpost as a reference point in order to help provide a clear understanding. When not specified, we assume that each data value is a byte array.
- Original dataset - The initial set of data blocks from which we build the Merkle tree, for example an array of byte values.
- Data block - An element of the original dataset.
- (Merkle) Root - The final bytes that can be used to verify the integrity of the data structure.
- Branch node - An internal node of the tree, i.e. a node that does not belong to the base layer.
- Leaf node - An external node of the tree, i.e. a node belonging to the base layer.
- Hashing - Mapping data of arbitrary size to a fixed-sized bytes value. For example, the Lisk protocol uses the SHA-256 hash function.
- Inclusion path - An array of data values that can be used, along with the Merkle root, to prove membership of certain data to the original dataset.
- Append path - An array of data values that can be used, along with new data being added to the tree, to calculate the new Merkle root.
What is a Merkle Tree?
Figure 1: A Merkle tree obtained from the names of the members of the research team. The white boxes represent the original dataset, the green boxes are leaf nodes, and finally the yellow boxes are branch nodes. At the top of the tree sits the Merkle root. Unpaired nodes are pushed to the layer above without hashing them (dashed boxes). Please note that in this blogpost we shortened the output of the hash function for ease of reading.
A Merkle tree is an authenticated data structure organized as a tree. Let's try to understand what exactly each of these words mean:
- Merkle: This name is derived from the actual inventor of hash trees, Ralph Merkle.
- data structure: A data structure is a collection of data with a set of operations that can be performed on it. An example of a data structure is an array, an object, or a Merkle tree.
- authenticated: The integrity of the data structure can be efficiently verified using the (Merkle) root of the tree. The dataset cannot be altered without changing the Merkle root.
- tree: The underlying data structure is organized as a tree, where each parent node is obtained by hashing the data from the child nodes in the layer below. Here we consider only binary trees, where each parent has two children.
To build a Merkle tree, proceed as follows:
- Individually hash each element of the original dataset into a leaf node.
- Hash together pairs of leaf nodes and store the resulting value into a parent branch node. Unpaired nodes are just pushed up the tree to be eventually paired with a node from a higher layer. The hashing function to obtain a leaf or branch node is different. The reason for this is explained in the ''Leaf versus Branch Hash'' section, below.
- Keep hashing pairs of branch nodes until you get to a single top branch node, the root of the tree.
In Fig. 1, an illustrative example of a Merkle tree constructed from the names of the Lisk research team members can be seen.
Figure 2: The Merkle tree shown in Fig. 1 is used to illustrate the proof-of-inclusion. In order to prove that the data value "Iker" (blue-bordered box) is part of the tree, the Prover sends the hashes in the red-bordered boxes, along with their positions in the tree. The Verifier can use this information to recompute the Merkle root (top blue-bordered box) and check that it corresponds to the same one they have.
One could argue why go through all the hassle of building the tree, when it would be easier to just hash the original data directly to get a single root. The point is that once the tree is built, it is possible to create short proofs-of-inclusion for any data present in the original dataset. These proofs-of-inclusion are just the set of the intermediate hash values needed to recalculate the Merkle root.
There are two parties involved in a proof-of-inclusion: the Verifier and the Prover. The Verifier’s responsibility is to check that certain data is part of the original dataset, while the role of the Prover is to prepare the proof and give it to the Verifier. We assume that the Verifier knows the Merkle root and (obviously) the data they wish to verify, but not the whole dataset. Whereas the Prover has access to the entire tree (or just to the original dataset, from which they can reconstruct the tree).
The procedure for a proof-of-inclusion is as follows:
- The Verifier sends the Prover the data block for which they wish to obtain a proof-of-inclusion.
- The Prover finds this block in the Merkle tree and its index, indicating the block position in the tree.
- The Prover collects all the needed intermediary hashes and transmits them to the Verifier, along with the index and the length of the original dataset.
- The Verifier uses the hashes and the data block they already have to recalculate the Merkle root and checks that it equals the one they had originally. The index and the dataset length are used to infer in which order the data blocks have to be hashed together.
An example of a proof-of-inclusion can be seen above in Fig. 2. The Verifier holds the data in the blue-bordered boxes (the data block "Iker" and the Merkle root). They can verify that "Iker" is part of the research team by asking for a proof-of-inclusion, which in this case includes 3 hashes (red-bordered boxes), the index of "Iker" in the tree (2 in this case), and the length of the dataset (5). Using this information, the Verifier knows that they have to hash "Iker" on the left with the first hash in the proof, then the result has to be hashed on the right with the second proof element, and finally on the left again with the third one.
Without the Merkle tree construction, the Verifier would need to request all the other 4 elements of the dataset. It doesn't look like we're saving much here (only 3 hashes versus 4). However, the important point to recognize here is that the size of a proof-of-inclusion only scales logarithmically with the size of the original dataset (rather than linearly).
In the Lisk protocol, a full block can contain at most ~128 transactions. In this case the proof-of-inclusion for a transaction would have length Log 128 = 7 (where Log indicates the logarithm in base 2), rather than 127. This is already a good save! In fact, the Verifier can also check the presence of multiple data blocks at once. This can actually save space in the proof, as some the required hashes can be directly calculated by the Verifier.
An additional point to mention here is that the original dataset in Fig. 2 is ordered alphabetically. This means that if we give a proof-of-inclusion for two adjacent elements in the dataset, it will constitute a proof-of-non-inclusion for any data block that would stay in between.
As an example, we can prove that there is no "Gianluigi" in the research team by proving that both "Andreas" and "Iker" are present and are located next to each other in the original dataset (this can be checked using their index).
This is a general property: If we know for sure that the original dataset will always be ordered (according to a certain rule), it is always possible to give a proof-of-non-inclusion using this strategy.
In this section we go over some use cases for Merkle trees in the Lisk protocol, and describe the specifications of lisk-tree. If you would like to see further information, please take a look at the specifications provided in LIP 0031.
Lisk will benefit from Merkle trees in the following places:
Proofs-of-Inclusion for Transactions in a Block
In version 5.0.0 of the SDK, each block header will store the transactionRoot instead of the payloadHash. The transactionRoot is calculated as the Merkle root of the IDs of the transactions included in the block payload (the details are specified in LIP 0032). Using the transactionRoot and a proof-of-inclusion, it will be possible to check whether a certain transaction is part of the block without downloading the full block.
As part of the "Introduce decentralized re-genesis" roadmap objective, we will create a snapshot of the Lisk blockchain (as specified in LIP 0035). This snapshot will be used to perform a hardfork to implement the new protocol. Reference to the last blockchain state will be stored as the Merkle root of the hashes of all blocks up to the snapshot.
If you made it this far, congratulations! You now know what Merkle trees are and why we are implementing them in the Lisk protocol. This final section gives more insights on some technical choices that we made. In particular, we will cover the concept of branch and leaf hash and why it is important to define different hashing functions. In general, we chose to follow the Certificate Transparency specifications. In the following, hash is the SHA256 hashing function and | indicates bytes concatenation.
Leaf versus Branch Hash
Figure 3: (top) The Merkle tree built from the dataset x=[D1,D2,D3,D4]. (bottom) The Merkle tree built from the dataset x’=[D1,D2,D3'=hash(D3)|hash(D4)]. If we use the same hash function for leaf and branch nodes, L3'=B2, and thus both trees have the same root R.
In our construction, Merkle trees accept input dataset with arbitrary length. Due to this, they could potentially be susceptible to second pre-image attacks. In a second pre-image attack, the goal of the attacker is to find a second input x′ that hashes to the same value of another given input x, i.e. find x’≠x such that hash(x)=hash(x’). In Merkle trees, this corresponds to having two different input datasets x and x’ giving the same Merkle root.
But this is easy to do: for any x=[D1,D2,D3,D4], just choose x’=[D1,D2,D3'=hash(D3)|hash(D4)], as shown above in Fig. 3. Notice that L3' equals B2, resulting in the same Merkle root.
To solve this problem, we prepend a different flag to the data before hashing it, depending on whether the resulting node is a leaf or branch node. In particular, we define leaf and branch hash functions as the following:
- leafHash(data) = hash(00|data)
- branchHash(child1, child2) = hash(01|child1|child2)
If we try to use the same trick, this results in two different roots, as now D3 and D4 would be hashed together differently in the two cases:
- B2 = branchHash(hash(D3)|hash(D4)).
Appending Data to the Tree
Figure 4: (top) When appending new data to the dataset, we can update the tree using only the hashes in the append path (purple-bordered boxes). (bottom) The updated tree. Nazar is now a member of the research team! The new append path is again indicated by purple-bordered boxes.
The original dataset from which the tree is built can later be expanded to append new data. In this case, the Merkle root, and in general the tree, in principle have to be recomputed from scratch. It turns out that it is possible to append new data to an existing tree efficiently. What this means is that the number of operations needed to update the tree is only logarithmic in the size of the dataset. To do this, we store an additional array, called the append path. The append path contains at most Log N values, with N the size of the original dataset.
Using the append path, we can append new data to the tree as follows:
- Add the new data block to the original dataset and create the corresponding new leaf node.
- Go up the tree layer by layer and recompute the relevant branch nodes using the hashes from the append path. Notice that only Log N branch nodes have to be updated.
- Finally update the Merkle root and keep the new append path.
In Fig. 4 (top), the append path is indicated by the purple-bordered boxes. After appending a new data block to the tree, the Merkle root and the append path can be updated just using the old append path, resulting in the new tree in Fig. 4 (bottom). The new append path is indicated by the purple-bordered boxes.
The lisk-tree library introduces Merkle trees into the Lisk protocol as specified in LIP 0031. Merkle trees will be used to store the transactions root in the block header (LIP 0032) and to perform the decentralized regenesis (LIP 0035). However, in the future they may be important for other aspects of the protocol such as interoperability.
For further questions and comments on the topic, we will host an AMA on Lisk.chat with Alessandro Ricottone (Research Scientist) next Thursday, September 24th at 4 pm CEST. We also invite all community members to go to the Lisk Research forum where we are always happy to hear your feedback and to participate in discussions on this and other topics.
Lisk is on a mission to enable you to create decentralized, efficient, and transparent blockchain applications. Join us: