porting msg_hash.se.py to vyper lll

jun 6, 2018

Notice: The following work was deprecated following the deprecation of EIP 1011. The following is published in case it is helpful to anyone working on a similar task in the future.

Introduction

I’m currently working on a Github issue to port msg_hash.se.py to Vyper LLL on the ethereum/casper repo. Having just completed the port of the purity checker to Vyper LLL, I see that in retrospect there was a lot of stuff I had to figure out by either consulting the Vyper, Serpent or even Solidity codebases or learning some low-level details of the EVM execution model. To help myself and others understand what will become a core part of Casper FFG, I want to document the process with this blog post.

Approach

  1. Figure out what msg_hash.se.py is supposed to do
  2. Gain some confidence that msg_hash.se.py is doing that
  3. Plan out a strategy for porting the Serpent code to Vyper LLL.
  4. Execute that strategy :D

What should it do?

From looking at its usage in the casper contract, we see that the MSG_HASHER is used to compute the hash of a message excluding the signature. This functionality is required when we wish to validate a signature as coming from some Casper validator. The requirement comes in when we consider the message format assumed by Casper and how the default signature algorithm ECDSA works. Casper votes are arranged as RLP lists that contain the data relevant to that message along with a signature attesting to the fact that this message came from a certain validator as the last element. We can see this format at play in the format of the vote message from the EIP:

def decode_rlp_list(vote_msg):
    # [validator_index, target_hash, target_epoch, source_epoch, signature]
    return RLPList(vote_msg, [int, bytes, int, int, bytes])

To make a Casper vote, a miner includes this message to the Casper contract. The vote data contains a validator_index corresponding to the validator’s position in the contract’s validators table, a target_hash for the block the validator wishes to next justify, the target_epoch for the block with the target_hash and a source_epoch used to orient the validator’s previous justification. In combination these parameters can be used to extend a justification chain as described in the EIP and the Casper FFG paper. We also see the last item in the RLP list which is the signature. This signature is some data that can be given to the validator’s validation contract supplied during the deposit stage. This validation contract should only positively verify signatures made by the validation entity under question; otherwise anyone could masquerade as the validator and start doing things that lead to malicious activity (until the slashing kicks in).

The default signature algorithm is the one used widely in Ethereum: ECDSA. To create a signature with ECDSA, we require a hash (just to get a relatively small number of unique bytes; the algorithm also supports streaming more bytes in but then performance goes down) and a private key. Here is some pseudocode:

def sign(msg_hash, private_key):
    return ecdsa.sign(msg_hash, private_key)
    
msg = some_bytes()
msg_hash = sha3(msg)
private_key = load_private_key()
v, r, s = sign(msg_hash, private_key)

The output of the signing algorithm is conventinally called v, r and s. The concatenation of these parameters form the bytestring that is placed as the last element of our RLP list for Casper messages. To verify a signature, we supply these parameters and the original input hash. Together, we can recover a public key from this data which then yields an Ethereum address using the well-known algorithm for the derivation:

def recover(msg_hash, v, r, s):
    return ecdsa.recover(msg_hash, v, r, s)

public_key = recover(msg_hash, v, r, s)
address = to_eth_address(public_key)
# confirm the address is the one you expect

A default validator verification contract can simply perform this entire operation and check that the recovered Ethereum address matches the one it expects. And in fact this is exactly what we see in the initial validator sketch, repeated here:

def mk_validation_code(address):
    validation_code_maker_lll = LLLnode.from_list(['seq',
                                ['return', [0],
                                    ['lll',
                                        ['seq',
                                            ['calldatacopy', 0, 0, 128],
                                            ['call', 3000, 1, 0, 0, 128, 0, 32],
                                            ['mstore', 0, ['eq', ['mload', 0], utils.bytes_to_int(address)]],
                                            ['return', 0, 32]
                                        ],
                                    [0]]
                                ]
                            ])
    validation_code_maker_lll = optimizer.optimize(validation_code_maker_lll)
    return compile_lll.assembly_to_evm(compile_lll.compile_to_assembly(validation_code_maker_lll))

To make a validation_code, we just supply the address corresponding to the private key we expect to be making our Casper signatures with. This snippet of Vyper LLL simply makes a call to the contract at address 0x1 which happens to be the ecrecover algorithm just described above and compares the output to the expected address.

There is a series of “account abstraction” facilities going on with the way in which validator signatures are validated and messages in general are processed; without taking us too far out of scope, it is interesting to note that a validator can supply any validation algorithm of their choosing as long as it can be expressed as a smart contract and within the gas limits afforded by the Casper contract. We can look to its use in the Casper contract to see the required interface:

# cannot be labeled @constant because of external call
# even though the call is to a pure contract call
@private
def validate_signature(msg_hash: bytes32, sig: bytes[1024], validator_index: int128) -> bool:
    return extract32(raw_call(self.validators[validator_index].addr, concat(msg_hash, sig), gas=self.VALIDATION_GAS_LIMIT, outsize=32), 0) == convert(1, 'bytes32')

The validation contract should simply expect an input string of bytes of size less than or equal to 1056 that take the place of the msg_hash and the signature. Pretty cool to see the account abstraction in action!

Now, with all the context in place we can see why we need to compute the hash of the message that was used as input to the signature algorithm. For what I assume are efficiency reasons, the Casper messages do not include the message hash and the message data and the signature; we can compute the message hash given the message data and this is where the MSG_HASHER comes into play. We want a contract that given an input RLP list will return the hash of the list with the last element removed (the signature).

What does it do?

We can experiment with some sample messages by porting the original Serpent source code to Python and then quickly get an understanding of the various parts of the Serpent code.

For reference, here is the Serpent code:

# Fetches the char from calldata at position $x
macro calldatachar($x):
    div(calldataload($x), 2**248)

# Fetches the next $b bytes from calldata starting at position $x 
# Assumes that there is nothing important in memory at bytes 0..63
macro calldatabytes_as_int($x, $b):
    ~mstore(32-$b, calldataload($x))
    ~mload(0)

# Position in calldata
with pos = 0:
    # First char in calldata
    with c0 = calldatachar(0):
        # The start of the array must be in 192...255 because it represents
        # a list length
        # Length ++ body case
        if c0 < 248:
            pos = 1
        # Length of length ++ length ++ body case
        else:
            pos = (c0 - 246)
    # Start position of the list (save it)
    with startpos = pos:
        # Start position of the previous element
        with lastpos = 0:
            # Keep looping until we hit the end of the input
            while pos < ~calldatasize():
                # Next char in calldata
                with c = calldatachar(pos):
                    lastpos = pos
                    # Single byte 0x00...0x7f body case
                    if c < 128:
                        pos += 1
                    # Length ++ body case
                    elif c < 184:
                        pos += c - 127
                    # Length of length ++ length ++ body case
                    elif c < 192:
                        pos += calldatabytes_as_int(pos + 1, c - 183) + (c - 182)
         
            # Length of new RLP list
            with newlen = lastpos - startpos:
                # Length ++ body case
                if newlen < 56:
                    # Store length in the first byte
                    ~mstore8(0, 192 + newlen)
                    # Copy calldata right after length
                    ~calldatacopy(1, startpos, newlen)
                    # Return the hash
                    return(~sha3(0, 1 + newlen))
                else:
                    # The log256 of the length (ie. length of length)
                    # Can't go higher than 16777216 bytes due to gas limits
                    with _log = if(newlen < 256, 1, if(newlen < 65536, 2, 3)):
                        # Store the length
                        ~mstore(0, newlen)
                        # Store the length of the length right before the length
                        with 31minuslog = 31 - _log:
                            ~mstore8(31minuslog, 247 + _log)
                            # Store the rest of the data
                            ~calldatacopy(32, startpos, newlen)
                            # Return the hash
                            return(~sha3(31minuslog, 1 + _log + newlen))

The only tricky bit of the port is handling the calldata and recreating a Python embedding of the EVM semantics. We start with some utilities to facilitate this process, importing some supporting libraries and setting up a web3.py instance.

from web3 import Web3
import rlp
from eth_tester import (
    EthereumTester,
    PyEVMBackend
)
from web3.providers.eth_tester import (
    EthereumTesterProvider,
)

base_tester = EthereumTester(PyEVMBackend())
w3 = Web3(EthereumTesterProvider(base_tester))

Next we build some utilities to make messgages the Casper contract expects. We will need a private key to sign the message. We make a vote message in casper_vote_with_hash by adding some dummy data for validator_index, target_hash, target_epoch, and the source_epoch. To get the message hash, we SHA3 the RLP encoding and defer to web3 for the signature. The final output of this function is the vote message as would be submitted to the Casper contract and the hash of the message we expect to recover via the MSG_HASHER.

validation_key = w3.eth.account.create(str(base_tester.get_accounts()[0])).privateKey

# from casper/tests/utils.py
def encode_int32(val):
    return Web3.toBytes(val).rjust(32, b'\x00')

def casper_vote_with_hash():
    msg = [1, w3.sha3(text='some block hash'), 11, 10]
    msg_hash = w3.sha3(rlp.encode(msg))
    signed = w3.eth.account.signHash(msg_hash, validation_key)
    sig = encode_int32(signed.v) + encode_int32(signed.r) + encode_int32(signed.s)
    msg.append(sig)
    return [rlp.encode(msg), msg_hash]

message, expected_msg_hash = casper_vote_with_hash()

We finish the preparation with a memory utility to replicate the temporary memory of the EVM. The Memory class is just a Python list that extends by zero when attempting to write past its length. We do this to simplify memory operations when executing the port as the default behavior for Python lists is to throw an exception when writing off the end of the backing array.

class Memory(list):
    # https://stackoverflow.com/a/869901

    def __setitem__(self, index, value):
        size = len(self)
        if index >= size:
            self.extend(0 for _ in range(size, index + 1))

            list.__setitem__(self, index, value)# input byte array

We set calldata to our input message we created above and set up a memory for the ported Serpent code to work against.

calldata = message

# memory
memory = Memory([])

We begin with the two Serpent macros which facilitate grabbing either a single byte from the calldata or a slice of bytes from the calldata. Because we are modeling our calldata as a byte-addressed bytestring, we do not have to perform the mathemagic required to work around the EVM’s 32-byte word size.


# Fetches the char from calldata at position index
def calldatachar(index):
    return calldata[index]

# Fetches the next b bytes from calldata starting at position x
def calldata_bytes_as_int(x, b):
    return sum(calldata[x:x+b])

The remainder of the contract is essentially a parser for RLP lists. The assumption is that any message to the Casper contract will be an RLP list and importantly that the signature is the final element in this top-level list. No further restrictions are made on the elements of the top-level RLP list, i.e. they can be any valid RLP elements, including other RLP lists.

We begin by scanning the calldata for the length of the top-level list. There are two options here: either the first byte represents the length of the list, or the first byte represents the length of the length that immediately follows the length data. We determine this value to know where the first RLP list element begins which we store in startpos.

# Save some space for the msg_hash later
msg_hash = None

# Position in calldata
pos = 0
# First char in calldata
c0 = calldatachar(0)
# The start of the array must be in 192...255 because it represents
# a list length
# Length ++ body case
if c0 < 248:
    pos = 1
else:
# Length of length ++ length ++ body case
    pos = (c0 - 246)

# Start position of the list (save it)
startpos = pos

Next, we loop over the calldata, moving an element at a time, until we run out of any message bytes to scan. Given the way we set lastpos to index the current RLP element and the fact that we only loop while pos < len(calldata), we can be sure that lastpos points to the final element of the RLP list. According to our interface above, this value will index the signature data. Aside from the RLP parsing, this is the core logic of the MSG_HASHER. We see detection of each possible RLP type (‘immediate’ value, small list and large list) in the if..elif..elif chain.

# Start position of the previous element
lastpos = 0
# Keep looping until we hit the end of the input
while pos < len(calldata):
    # Next char in calldata
    c = calldatachar(pos)
    lastpos = pos
    # Single byte 0x00...0x7f body case
    if c < 128:
        pos += 1
    # Length ++ body case
    elif c < 184:
        pos += c - 127
    # Length of length ++ length ++ body case
    elif c < 192:
        pos += calldata_bytes_as_int(pos + 1, c - 183) + (c - 182)

Now that we have scanned each element of the top-level list, we now the beginning of the message data and the ending of the message data, excluding the signature. It is precisely this data we need to find the SHA3 to recompute the message hash (compare to here). We use the length to determine which kind of RLP list we need to build and then proceed to recreate the message list in memory. We finish by taking the SHA3 of the proper slice of memory.

# Length of new RLP list
newlen = lastpos - startpos
# Length ++ body case
if newlen < 56:
    # Store length in the first byte
    memory[0] = 192 + newlen
    # Copy calldata right after length
    for i in range(newlen):
        memory[1+i] = calldata[startpos+i]
    msg_hash = w3.sha3(bytes(memory[0:1+newlen]))
else:
    # The log256 of the length (ie. length of length)
    # Can't go higher than 16777216 bytes due to gas limits
    _log = None
    if newlen < 256:
        _log = 1
    elif newlen < 65536:
        _log = 2
    else:
        _log = 3
    # Store the length
    memory[0] = newlen
    # Store the length of the length right before the length
    index_to_list_length = 31 - _log
    memory[index_to_list_length] = 247 + _log
    # Store the rest of the data
    for i in range(newlen):
        memory[32+i] = calldata[startpos+i]
    # Return the hash
    msg_hash = w3.sha3(bytes(memory[index_to_list_length:1+_log+newlen]))

assert expected_msg_hash == msg_hash

The complete code is in a gist here. The vote message is a fairly small flat list so several of the arms of the various conditionals in the MSG_HASHER contract are not tested. There will be more robust testing in the final PR.

Port to Vyper LLL

Vyper is a new programming language with syntax like Python that targets the EVM. It is seeing a lot of activity in the Ethereum space and is likely to provide a robust alternative to the de facto Solidity that has to date seen a lot of criticism for reasons of complexity and lack of developer ergonomics (particularly around making it straightforward to write secure smart contracts). Many instances of large “hacks” where ETH was stolen or frozen were to do unclear consequenes of the Solidity programming model.

We also have Serpent which is another Python-like language meant to be a higher-level interface to the EVM’s assembly. Serpent is officially deprecated at this point and it is advised to perform all Ethereum smart contract development in another language. For this reason, the maintainers of the Casper repo have decided to port all of the dangling Serpent code to something more modern, i.e. Vyper.

The catch here is that Vyper is designed with security in mind, which means you can’t directly author EVM assembly in Vyper source. This decision is probably the right one in aggregate but it does mean we will have to port the Serpent code directly to Vyper LLL (the intermediate representation) to access the lower-level facilities we need.