Notice: The following work was deprecated following the deprecation of EIP 1011. The following is published in case it is helpful to anyone working on a similar task in the future.
I’m currently working on a Github issue to port
msg_hash.se.py to Vyper LLL on the
ethereum/casper repo. Having just completed the port of the purity checker to Vyper LLL, I see that in retrospect there was a lot of stuff I had to figure out by either consulting the Vyper, Serpent or even Solidity codebases or learning some low-level details of the EVM execution model. To help myself and others understand what will become a core part of Casper FFG, I want to document the process with this blog post.
- Figure out what
msg_hash.se.pyis supposed to do
- Gain some confidence that
msg_hash.se.pyis doing that
- Plan out a strategy for porting the Serpent code to Vyper LLL.
- Execute that strategy :D
What should it do?
From looking at its usage in the casper contract, we see that the
MSG_HASHER is used to compute the hash of a message excluding the signature. This functionality is required when we wish to validate a signature as coming from some Casper validator. The requirement comes in when we consider the message format assumed by Casper and how the default signature algorithm ECDSA works. Casper votes are arranged as RLP lists that contain the data relevant to that message along with a signature attesting to the fact that this message came from a certain validator as the last element. We can see this format at play in the format of the
vote message from the EIP:
def decode_rlp_list(vote_msg): # [validator_index, target_hash, target_epoch, source_epoch, signature] return RLPList(vote_msg, [int, bytes, int, int, bytes])
To make a Casper vote, a miner includes this message to the Casper contract. The vote data contains a
validator_index corresponding to the validator’s position in the contract’s
validators table, a
target_hash for the block the validator wishes to next justify, the
target_epoch for the block with the
target_hash and a
source_epoch used to orient the validator’s previous justification. In combination these parameters can be used to extend a justification chain as described in the EIP and the Casper FFG paper. We also see the last item in the RLP list which is the signature. This signature is some data that can be given to the validator’s validation contract supplied during the
deposit stage. This validation contract should only positively verify signatures made by the validation entity under question; otherwise anyone could masquerade as the validator and start doing things that lead to malicious activity (until the slashing kicks in).
The default signature algorithm is the one used widely in Ethereum: ECDSA. To create a signature with ECDSA, we require a hash (just to get a relatively small number of unique bytes; the algorithm also supports streaming more bytes in but then performance goes down) and a private key. Here is some pseudocode:
def sign(msg_hash, private_key): return ecdsa.sign(msg_hash, private_key) msg = some_bytes() msg_hash = sha3(msg) private_key = load_private_key() v, r, s = sign(msg_hash, private_key)
The output of the signing algorithm is conventinally called
s. The concatenation of these parameters form the bytestring that is placed as the last element of our RLP list for Casper messages. To verify a signature, we supply these parameters and the original input hash. Together, we can recover a public key from this data which then yields an Ethereum address using the well-known algorithm for the derivation:
def recover(msg_hash, v, r, s): return ecdsa.recover(msg_hash, v, r, s) public_key = recover(msg_hash, v, r, s) address = to_eth_address(public_key) # confirm the address is the one you expect
A default validator verification contract can simply perform this entire operation and check that the recovered Ethereum address matches the one it expects. And in fact this is exactly what we see in the initial validator sketch, repeated here:
def mk_validation_code(address): validation_code_maker_lll = LLLnode.from_list(['seq', ['return', , ['lll', ['seq', ['calldatacopy', 0, 0, 128], ['call', 3000, 1, 0, 0, 128, 0, 32], ['mstore', 0, ['eq', ['mload', 0], utils.bytes_to_int(address)]], ['return', 0, 32] ], ] ] ]) validation_code_maker_lll = optimizer.optimize(validation_code_maker_lll) return compile_lll.assembly_to_evm(compile_lll.compile_to_assembly(validation_code_maker_lll))
To make a
validation_code, we just supply the address corresponding to the private key we expect to be making our Casper signatures with. This snippet of Vyper LLL simply makes a
call to the contract at address
0x1 which happens to be the
ecrecover algorithm just described above and compares the output to the expected
There is a series of “account abstraction” facilities going on with the way in which validator signatures are validated and messages in general are processed; without taking us too far out of scope, it is interesting to note that a validator can supply any validation algorithm of their choosing as long as it can be expressed as a smart contract and within the gas limits afforded by the Casper contract. We can look to its use in the Casper contract to see the required interface:
# cannot be labeled @constant because of external call # even though the call is to a pure contract call @private def validate_signature(msg_hash: bytes32, sig: bytes, validator_index: int128) -> bool: return extract32(raw_call(self.validators[validator_index].addr, concat(msg_hash, sig), gas=self.VALIDATION_GAS_LIMIT, outsize=32), 0) == convert(1, 'bytes32')
The validation contract should simply expect an input string of bytes of size less than or equal to 1056 that take the place of the
msg_hash and the
signature. Pretty cool to see the account abstraction in action!
Now, with all the context in place we can see why we need to compute the hash of the message that was used as input to the signature algorithm. For what I assume are efficiency reasons, the Casper messages do not include the message hash and the message data and the signature; we can compute the message hash given the message data and this is where the
MSG_HASHER comes into play. We want a contract that given an input RLP list will return the hash of the list with the last element removed (the signature).
What does it do?
We can experiment with some sample messages by porting the original Serpent source code to Python and then quickly get an understanding of the various parts of the Serpent code.
For reference, here is the Serpent code:
# Fetches the char from calldata at position $x macro calldatachar($x): div(calldataload($x), 2**248) # Fetches the next $b bytes from calldata starting at position $x # Assumes that there is nothing important in memory at bytes 0..63 macro calldatabytes_as_int($x, $b): ~mstore(32-$b, calldataload($x)) ~mload(0) # Position in calldata with pos = 0: # First char in calldata with c0 = calldatachar(0): # The start of the array must be in 192...255 because it represents # a list length # Length ++ body case if c0 < 248: pos = 1 # Length of length ++ length ++ body case else: pos = (c0 - 246) # Start position of the list (save it) with startpos = pos: # Start position of the previous element with lastpos = 0: # Keep looping until we hit the end of the input while pos < ~calldatasize(): # Next char in calldata with c = calldatachar(pos): lastpos = pos # Single byte 0x00...0x7f body case if c < 128: pos += 1 # Length ++ body case elif c < 184: pos += c - 127 # Length of length ++ length ++ body case elif c < 192: pos += calldatabytes_as_int(pos + 1, c - 183) + (c - 182) # Length of new RLP list with newlen = lastpos - startpos: # Length ++ body case if newlen < 56: # Store length in the first byte ~mstore8(0, 192 + newlen) # Copy calldata right after length ~calldatacopy(1, startpos, newlen) # Return the hash return(~sha3(0, 1 + newlen)) else: # The log256 of the length (ie. length of length) # Can't go higher than 16777216 bytes due to gas limits with _log = if(newlen < 256, 1, if(newlen < 65536, 2, 3)): # Store the length ~mstore(0, newlen) # Store the length of the length right before the length with 31minuslog = 31 - _log: ~mstore8(31minuslog, 247 + _log) # Store the rest of the data ~calldatacopy(32, startpos, newlen) # Return the hash return(~sha3(31minuslog, 1 + _log + newlen))
The only tricky bit of the port is handling the
calldata and recreating a Python embedding of the EVM semantics. We start with some utilities to facilitate this process, importing some supporting libraries and setting up a
from web3 import Web3 import rlp from eth_tester import ( EthereumTester, PyEVMBackend ) from web3.providers.eth_tester import ( EthereumTesterProvider, ) base_tester = EthereumTester(PyEVMBackend()) w3 = Web3(EthereumTesterProvider(base_tester))
Next we build some utilities to make messgages the Casper contract expects. We will need a private key to sign the message. We make a
vote message in
casper_vote_with_hash by adding some dummy data for
target_epoch, and the
source_epoch. To get the message hash, we
SHA3 the RLP encoding and defer to
web3 for the signature. The final output of this function is the
vote message as would be submitted to the Casper contract and the hash of the message we expect to recover via the
validation_key = w3.eth.account.create(str(base_tester.get_accounts())).privateKey # from casper/tests/utils.py def encode_int32(val): return Web3.toBytes(val).rjust(32, b'\x00') def casper_vote_with_hash(): msg = [1, w3.sha3(text='some block hash'), 11, 10] msg_hash = w3.sha3(rlp.encode(msg)) signed = w3.eth.account.signHash(msg_hash, validation_key) sig = encode_int32(signed.v) + encode_int32(signed.r) + encode_int32(signed.s) msg.append(sig) return [rlp.encode(msg), msg_hash] message, expected_msg_hash = casper_vote_with_hash()
We finish the preparation with a memory utility to replicate the temporary memory of the EVM. The
Memory class is just a Python list that extends by zero when attempting to write past its length. We do this to simplify memory operations when executing the port as the default behavior for Python lists is to throw an exception when writing off the end of the backing array.
class Memory(list): # https://stackoverflow.com/a/869901 def __setitem__(self, index, value): size = len(self) if index >= size: self.extend(0 for _ in range(size, index + 1)) list.__setitem__(self, index, value)# input byte array
calldata to our input message we created above and set up a
memory for the ported Serpent code to work against.
calldata = message # memory memory = Memory()
We begin with the two Serpent macros which facilitate grabbing either a single byte from the
calldata or a slice of bytes from the
calldata. Because we are modeling our calldata as a byte-addressed bytestring, we do not have to perform the mathemagic required to work around the EVM’s 32-byte word size.
# Fetches the char from calldata at position index def calldatachar(index): return calldata[index] # Fetches the next b bytes from calldata starting at position x def calldata_bytes_as_int(x, b): return sum(calldata[x:x+b])
The remainder of the contract is essentially a parser for RLP lists. The assumption is that any message to the Casper contract will be an RLP list and importantly that the signature is the final element in this top-level list. No further restrictions are made on the elements of the top-level RLP list, i.e. they can be any valid RLP elements, including other RLP lists.
We begin by scanning the
calldata for the length of the top-level list. There are two options here: either the first byte represents the length of the list, or the first byte represents the length of the length that immediately follows the length data. We determine this value to know where the first RLP list element begins which we store in
# Save some space for the msg_hash later msg_hash = None # Position in calldata pos = 0 # First char in calldata c0 = calldatachar(0) # The start of the array must be in 192...255 because it represents # a list length # Length ++ body case if c0 < 248: pos = 1 else: # Length of length ++ length ++ body case pos = (c0 - 246) # Start position of the list (save it) startpos = pos
Next, we loop over the calldata, moving an element at a time, until we run out of any message bytes to scan. Given the way we set
lastpos to index the current RLP element and the fact that we only loop while
pos < len(calldata), we can be sure that lastpos points to the final element of the RLP list. According to our interface above, this value will index the signature data. Aside from the RLP parsing, this is the core logic of the
MSG_HASHER. We see detection of each possible RLP type (‘immediate’ value, small list and large list) in the
# Start position of the previous element lastpos = 0 # Keep looping until we hit the end of the input while pos < len(calldata): # Next char in calldata c = calldatachar(pos) lastpos = pos # Single byte 0x00...0x7f body case if c < 128: pos += 1 # Length ++ body case elif c < 184: pos += c - 127 # Length of length ++ length ++ body case elif c < 192: pos += calldata_bytes_as_int(pos + 1, c - 183) + (c - 182)
Now that we have scanned each element of the top-level list, we now the beginning of the message data and the ending of the message data, excluding the signature. It is precisely this data we need to find the
SHA3 to recompute the message hash (compare to here). We use the length to determine which kind of RLP list we need to build and then proceed to recreate the message list in memory. We finish by taking the
SHA3 of the proper slice of memory.
# Length of new RLP list newlen = lastpos - startpos # Length ++ body case if newlen < 56: # Store length in the first byte memory = 192 + newlen # Copy calldata right after length for i in range(newlen): memory[1+i] = calldata[startpos+i] msg_hash = w3.sha3(bytes(memory[0:1+newlen])) else: # The log256 of the length (ie. length of length) # Can't go higher than 16777216 bytes due to gas limits _log = None if newlen < 256: _log = 1 elif newlen < 65536: _log = 2 else: _log = 3 # Store the length memory = newlen # Store the length of the length right before the length index_to_list_length = 31 - _log memory[index_to_list_length] = 247 + _log # Store the rest of the data for i in range(newlen): memory[32+i] = calldata[startpos+i] # Return the hash msg_hash = w3.sha3(bytes(memory[index_to_list_length:1+_log+newlen])) assert expected_msg_hash == msg_hash
The complete code is in a gist here. The
vote message is a fairly small flat list so several of the arms of the various conditionals in the
MSG_HASHER contract are not tested. There will be more robust testing in the final PR.
Port to Vyper LLL
Vyper is a new programming language with syntax like Python that targets the EVM. It is seeing a lot of activity in the Ethereum space and is likely to provide a robust alternative to the de facto Solidity that has to date seen a lot of criticism for reasons of complexity and lack of developer ergonomics (particularly around making it straightforward to write secure smart contracts). Many instances of large “hacks” where ETH was stolen or frozen were to do unclear consequenes of the Solidity programming model.
We also have Serpent which is another Python-like language meant to be a higher-level interface to the EVM’s assembly. Serpent is officially deprecated at this point and it is advised to perform all Ethereum smart contract development in another language. For this reason, the maintainers of the Casper repo have decided to port all of the dangling Serpent code to something more modern, i.e. Vyper.
The catch here is that Vyper is designed with security in mind, which means you can’t directly author EVM assembly in Vyper source. This decision is probably the right one in aggregate but it does mean we will have to port the Serpent code directly to Vyper LLL (the intermediate representation) to access the lower-level facilities we need.