Hackweek 2020: Assembly Diff Tool

— 10 minute read

The annual Hackweek event was held last week at Unity. This is an employee only event organised by Unity as a source of innovation, collaboration and idea generation. In the months leading up to the event, employees are encouraged to share their ideas and invite others to join them in a week of research and development. Throughout the week we hack prototypes together and evaluate their potential for future innovation.


The Project permalink

This year the idea formed during a series of conversations between a number of members of the Sustained Engineering team who are focused on ensuring the quality and stability of Unity. We found a gap in our workflow and wanted a tool that could assist in the pull request review process. The tool would need to help us evaluate the effect a given pull request would have on the performance of the areas of code it touched. It would also have to be small enough in scope to be prototyped within a week. The tool and project we decided to embark on prototyping was an Assembly Diff Tool.

Does this pull request improve performance, or hinder it?

To start the week, we discussed our goals for the project and concluded that the best course of action would be to develop a command line tool. It could then be integrated wherever it was required, be it on a build machine or remote CI solution - a CLI tool would give us the most flexibility. CLI tools also have the advantage of being self contained, allowing for fast iteration times and the ability to experiment with new technologies. We decided to seize that opportunity for experimentation and write the tool in Rust!

The following is an overview of how we envisioned the tool would work:

  1. Checkout before changeset
  2. Compile project, using compiler flags to generate assembly output
  3. Checkout after changeset
  4. Recompile project
  5. Diff the resulting assembly files
  6. Generate visual diff presentation

Setup permalink

The Rust package directory crates.io contains a great deal of libraries we could use for parsing command line arguments. We were able to integrate and evaluate a number of packages very quickly and the consistent documentation generation allowed us to look past the set dressing and get straight to the information we desired.

We settled on using clap as our command line argument parser. It offers a variety of patterns for defining the arguments your tool expects (a complete listing can be found on the clap documentation website). We chose to define our arguments using a struct tagged by the #[derive(Clap)] attribute. Attributes in Rust are essentially procedural macros and this one generates code based on the fields defined in the struct.

struct Args {
/// Project Directory
project_directory: std::path::PathBuf,
/// Output file
output_file: std::path::PathBuf,
/// Changeset before making changes
before_changeset: String,
/// Changeset after changes have been made
after_changeset: String,
/// Unity build flag
#[clap(short, long)]
unity: bool,
/// Verbose flag
#[clap(short, long)]
verbose: bool,
}

The above is the complete argument struct at the end of the week. It is especially convenient that clap automatically parses a number of types interchangeably. The derive attribute above generates a parse function that collects the arguments passed to the program and returns them in the form of an instance of the struct. It even generates a -help argument output from documentation comments!

With the parsed arguments in hand, we need to run a number of commands to complete the steps outlined above. This is very simple with Rust due to the extensive standard library, we used std::process::Command.

Here is a look at how we used the API:

pub fn diff_directories(
directory: &std::path::Path,
from_dir: &std::path::Path,
to_dir: &std::path::Path,
) -> std::io::Result<std::process::Output> {
std::process::Command::new("git")
.current_dir(directory)
.arg("diff")
.arg(from_dir)
.arg(to_dir)
.output()
}

Using code similar to the above, we were able to run the various commands required to get the basic process in place, these were: git fetch/checkout, running the compiler (a custom build system in this case) and git diff (displayed above). It was helpful that git diff provides a argument set that allows for directory diffing; we didn't like the idea of individually diffing each pair of output assembly files.


Processing Assembly permalink

We decided to use the temporary directory (std::env::temp_dir()) path and the original changeset hash to create an output directory for the compiled assembly files from each of the builds. Our structure was as follows:

<temp_dir>\<original_sha_1>\<before_or_after>\<asm_files>

Where a complete path (on Windows) could look like this:

C:\Users\<Username>\AppData\Local\Temp\<SHA-1>\before

Once we had the assembly in the right place, we used the diff_directories function to generate the diff. This is when we noticed a few problems. Firstly, the diff output was very noisy, especially when we tested on the whole Unity codebase. A simple change in one area of the engine would cause a mass of assembly line modifications, most of which we were not interested in. Secondly, assembly is large; a build of the Unity code base generated 5.37 gigabytes of assembly data. These issues combined resulted in a very large output diff consisting largely of superfluous modifications.

We needed solutions for both of these problems for the tool to be useful at all. Deliberations and analysis of both the assembly and the diff led us to our solution; post process everything. We first post processed the diff since we noticed many lines had changed due to constant offset changes or data segment identifier changes. We went for the quick and naive approach of comparing each line to it's previous and discarding the lines from the diff if they were deemed similar enough.

- mov rax, QWORD PTR $T11\[rsp]
+ mov rax, QWORD PTR $T10\[rsp]

This reduced our diff by a decent amount, but it still wasn't small enough to create a usable output. Post processing the gigabytes of assembly data was the next step. Analysis of the assembly files revealed that there was alot that we could discard; we only wanted to diff instructions, not the surrounding data. We could therefore remove all lines outside of TEXT segments. Further processing / prettifying could be performed on the segments that remained to further reduce the size of the output.

The following is a non-exhaustive list of the post processes performed:

  • Discard empty lines,
  • Discard comments (lines starting with a semi-colon, or sections of lines after the semi-colon),
  • Discard all lines outside a SEGMENT block,
  • Discard all blocks that are not TEXT blocks,
  • Prettify function names.

This looks like very aggressive post-processing and eager discarding of contents, I imagine in the future this would be configurable; there are cases where keeping various sections of the assembly files would be desirable dependent on use case and result desired. However, this worked for us and reduced the size of the assembly data generated from the previously mentioned 5.37 down to 1.4 gigabytes - I call that a success!


Presentation permalink

Our tool was now able to checkout two provided changesets, execute build scripts to generate assembly files and build a diff from the changes between the two sets of assembly files. The final step was to present the diff in a readable way. Ideally it would also be quickly searchable and provide a collection of other means to inspect the contained information, but we wanted a solution that we could build within a week.

We decided to go for the simple yet effective route of building a HTML file from a template, we used the diff2html library to render the content. Initial attempts proved unfruitful due to the usage of an unmaintained and largely undocumented Rust crate version of diff2html. We tried to rely on the documentation provided by the javascript version of the library but ended up dropping the Rust version entirely and instead wrote a HTML template that we could inject the diff content into and leave the javascript library to do the rendering of the page.

There were additional hurdles along the road however, it seems that diff2html (or browsers) can't handle large amounts of diff data, we tried feeding it with 186 megabytes of string data and it wasn't exactly happy with it. We would definitely need a more scalable solution if we were to develop this tool further and use it in a production environment.


Future permalink

Now that Hackweek has concluded, many teams hope to continue working on their projects. In the past this has led to innovations in either the product or our workflows at Unity. Myself and the team are hoping to continue work on this (where we can find the time) and see what it can grow into, maybe I'll write an update here in the future detailing it's evolution or demise!

Overall, Hackweek this year was a great experience and I thoroughly enjoyed it!

Take a look at a snippet from one of the HTML files generated by the tool: Editor_Src_1.html