blog

Breaking things one pointer at a time

The changing scope of my research

I started on my Masters research project back in March, and the progress of both the project itself and how I've felt about it since that time has been somewhat interesting.

For some context, the codebase that I've been contributing to as a part of my research is approximately 50 thousand lines of code long, and the PhD thesis that much of my work is based on is 204 pages long (as I've previously mentioned), so even getting a grasp on what exactly research project entailed took quite a long time.

Whilst students do pick their own research projects (and I selected the research team that I joined), the projects themselves are selected by the supervisors and then advertised. This unfortunately meant that whilst my supervisor had somewhat of a grasp on the project that I had selected, I myself had basically no clue. I have no background in physics (the most that I've done is high school physics, and definitely nothing to do with gravitational waves), so even understanding how everything fit together to do what it was intended to do was incredibly difficult - and it was made worse by the fact that the codebase has very few comments.

As such, a large part of my time up until now was simply me trying to get up to speed. I did a complexity analysis of a part of the codebase, with the sole purpose of attempting to understand the small section which that analysis encompassed. It definitely worked, but as time has continued and I've gotten a better idea of how everything fits together, it turned out that the section of code I analysed actually has little to do with my main part of my research - so all that work was for nothing.

To me, it always felt like the scope of my project was changing, even though in actually that scope never changed. More files moved into what felt like the scope of my project at exactly the same time other files were suddenly discounted. I went from expecting my entire project to be in C and CUDA to it being in only Python, and then back again. Most of this change was because of my understanding of what exactly my project entailed didn't fully mature until I had a full understanding of how everything was put together, and that understanding was vital to figuring out how exactly I was meant to go about solving the problem I was given.

I think that this is likely the case for lots of people in more theoretical research than my more implementation-focused research. As your understanding of the thing you are researching increases, the more likely it is that things that you may have previously thought important suddenly become irrelevant, and ideas that only seemed to be tangential are suddenly vital - and this seems to be the pattern of any complex long-term project.

I think that this is part of the reason why senior (or at least experienced) software engineers are so coveted. Whilst they may not be burning many story points for their own managers to see, they have enough experience with the codebase and exposure to different ways to solve problems to be able to make the scope for other people much smaller, and their entire team more efficient because of it. It's also why there's always an "on-boarding" period for new employees, so they are able to have a rudimentary understanding of the potential scopes of any problems they encounter.

With this in mind, I think there's a few items that I think massively help with preventing the sort of scope creep that I've experienced. In many ways these are probably just some more common sense general guidelines, and can probably be found in any "programming processes" textbook, but will still be helpful to list here.

Have useful comments and documentation

The codebase I've been working on has very very few comments, and those that it does have are generally just commenting out old versions of the code. This means that almost all the sources for how things work need to come from outside the codebase itself, and are usually one of two things - people or papers (as in peer-reviewed papers).

The massive downside to this is that if you want to find out how one very specific part of the codebase works, you either need to have someone who hasn't worked on that part of the code for potentially many years sit down and work through it with you, or you need to trawl one of 50 research papers in the hope that it mentions the specific function that you're looking for (and spoiler alert, chances are that none of them do). Having no useful comments and no useful documentation means that the process of on-boarding is incredibly time consuming and takes valuable work hours away from people that know the codebase well in order to explain the minutia of the code to the newcomers.

This is something that I've attempted to rectify with my research project - every addition has been commented on in its function and sometimes how/why it works. My hope is that the next person that needs to work on the same area of code would be able to make use of the work I've already done and not need to retrace my steps, but unfortunately the entire codebase is so massive in comparison to the scope of my work that the chance of any overlap is minute.

Have useful commit messages

This is another area that definitely needed improving. Before I began on this research project, commit messages often looked like this:

postcoh.c: fix a bug for output trigger->ifos when the ifos are LV

or

generic_init.sh: change Virgo quality bits with latest suggestion

For the first one, what was the bug? How does this change fix it? Or for the second one, what's the "latest suggestion"? What problem does it solve?

The commit messages that are in the repository are as useful documentation as the comments in the code themselves in many cases. If you can clearly articulate the reasoning behind changes, the problems that they solve and perhaps possible alternatives, then it makes understanding the progression of the codebase significantly easier - and running git blame actually would return useful information.

When it comes to writing commit messages, I try to follow this excellent blog post on commit messages, which suggests that you should try to use a text body of a commit message whenever possible to explain "what" and "why" the change has been made. There's also been a push from some of the other members of the research team to follow similar guidelines. The project goes under an external code review every 1.5 years or so, and being able to clearly show the reasoning for changes is something that helps with the efficiency of that review in addition to helping new team members understand what they're looking at.

Provide early and timely feedback to direct efforts

One of the reasons why the area of code that I did a complexity analysis on ended up being irrelevant for my project is because I had absolutely no idea what I was being asked to do in my project. I'd written a project proposal, had talked to my supervisors about doing a complexity analysis, did the complexity analysis, and the only feedback that I got the entire time was "no one has done a complexity analysis for this type of project before, I look forward to the results!". Whilst this was nice to hear, and the complexity analysis was fun to do, it felt all for naught when I realised that it wasn't really under the purview of my project. The many hours that I spent pouring over that part of the code, trying to understand every part and running some of my own benchmarks and performance analysis ended up being for something entirely irrelevant.

I don't think that this is entirely the fault of my supervisors - a large part of it also falls on to me for not clarifying what exactly was being asked of me, especially as I didn't really understand how the whole project fit together at the time - but some early feedback to properly direct my efforts into something that was actually relevant would have allowed me to finish my project significantly earlier than it was, and possibly even have time to extend the project.

Conclusion

I know I've spent the last thousand or so words complaining about things in my research project, but I have genuinely enjoyed my time doing research this year. The value of good comments and documentation, commit messages and the role of feedback in directing efforts are lessons that I will take onto my future projects and work to ensure that the feeling of massively changing scope in a static project does not happen again to me, nor anyone else.

Permalink

Rescheduling when I do blog posts

It's been almost a full week since my last blog post, and I should probably discuss why.

When I originally restarted this blog, I intended to write one blog post per day. Barely three weeks later and it seems like I've given up all hope of trying for that trend.

I found, not long after I started, that when I wrote about things that I'd given some thought to, I wrote a lot more words than I thought I would. Unfortunately, I am limited by my ability to type fast and get my words onto my computer. This often means that even if I have a fully-formed and planned out idea, it can easily take me 45 minutes to an hour to write.

I'm almost the entire way through my Masters, and the semester truly has gotten back into the swing of things. I have mid-semester tests coming up, and assignment are beginning to become due. In addition, my thesis will soon need to be submitted, which means that for me, my time is at a premium.

As such, I'm instead going to move to writing one blog post a week, starting next week on Sunday. I think this should give me the time to be able to spend on my other assignments and allow me to properly spend the time on these posts.

I should probably say that this one blog post a week isn't due to a lack of ideas - I almost have more ideas written down than I have taken days off writing! Unfortunately, I just need to put more time into my university work for the moment.

Permalink

My ideal computer at the moment

Chris Siebenmann recently wrote about how is ideal machine isn't in an existing category. Interestingly, I find myself in the same sort of situation, however with an almost entirely different conclusion.

As I have previously mentioned, I am of the opinion that as the number of machines you try to do work on increases, the ease of synchronization between those machines increases exponentially. This means that for me, an ideal machine would be easily portable so that it can be used anywhere with minimal effort to transport. This effectively pushes me into a being left with the options of either a laptop (as in an ultrabook) or a tablet.

Tablets are in an interesting position right now where they are continuously increasing in power and ability, but still lack many of the features that I'd require in a daily driver machine, namely the ability to easily compile code locally, easy access to the terminal and a keyboard-driven UI (tiling window managers truly have gotten the better of me). Thus, I am left with the sole choice of a laptop.

Now the nitpicking starts, what sort of specs do I want in a laptop? Well, as you may have picked up from my rant about why esports will never be mainstream, I play CS:GO on a regular basis, and also enjoy casually playing a number of other single-player games such as BioShock and Civilization. As such, I would quite like to be able to continue playing these games as I use them as my primary source of leisure.

Say what you will about their power, but gaming laptops are both (generally speaking) incredibly heavy and immobile, and very difficult to upgrade. I have no intention of needing to update my laptop every 1-2 years just so I can upgrade my GPU. So for me, the option that makes the most sense is a laptop with an external GPU enclosure, so the GPU can be upgraded independently of the rest of the machine. This would also allow me to dock my machine when I get home and be able to get use out of the high refresh rate monitors that I own - something that would be extremely valuable for gaming. External GPU enclosures do exist on the market (although they are not widely used), however they are almost all based on Thunderbolt 3. This leads me to my next problem.

I'd much rather use an AMD-based than an Intel-based CPU, as the current generation of Ryzen-based laptop CPUs run circles around their Intel equivalents. With the addition of their extra cores, both compiling and gaming would be significantly better on an AMD-based laptop than Intel-based laptop. This comes with the issue of USB 4 (the equivalent Thunderbolt 3 standard which can be used for external GPU enclosures) only being supported on the next-generation of AMD chips.

Thus my ideal machine doesn't yet exist, but I fully expect that in a year or two it might.

Permalink

Turns out you can get in contact with me

Today's update will be a fair bit shorter than my usual fair (thank goodness!).

I mentioned in this post that there would be no way to contact me easily for feedback and comments. My belief at the time of writing was that if I wanted to receive emails from an email address related to this domain (e.g. something@tommoa.me), I'd have to purchase a domain email service.

It turns out, I'm wrong about that. I've been able to setup tom@tommoa.me to send me email, but unfortunately there's been no way (as far as I can tell) for me to be able to send mail back using that address unless I pay for email. From my hosting provider, it only costs about $2 AUD per month for email (compared to GSuite at $8.50 AUD per month), but I don't want to waste even $2 if it turns out that the email address isn't going to be used.

If I start getting enough email through the address above that it becomes worth it to me to buy purchase the domain, I will, but as my current audience (as far as I'm aware) is a grand total of zero people, it's definitely not worth my money right now.

Permalink

Automatically updating git submodules using GitHub Actions

I know that the mono-repository is in style at the moment, but git submodules are a fantastic (and probably over complicated) tool for being able to store components of a git repository that may need to be kept used, but separate. For example, you may need to track a specific upstream version of another repository that isn't controlled by you or your organization for use in your repository, or you might want to mix repository visibility or share some code between different repositories that need to be separated.

I personally use submodules in the repositories for a wide variety of uses. For example, my research project uses a submodule to track the latest version of the codebase that I am modifying so I can write patches to match against those versions. Another example is my site, which has a bunch of submodules to track various things which get published there, from my resume and my research project, to the theme that I use for both my site and this blog.

I only recently moved to using the same codebase for the theme of my site and blog, which has both made my life easier and caused me a great deal of heartache - and almost all of the heartache came from git submodules. You see, submodules are tied to a specific commit, and running a command like git submodule update (which you'd think updates a submodule to the latest version) only checks out the submodule to the same commit as what your local repository already has stored. This makes life a little more difficult - how can I easily update submodules without having to specifically know the remotes, branches or location of all of my submodules?

First, all of the submodules have their remotes and locations (and optionally branches) stored in the .gitmodules file, which can be used to iterate through the submodules and pull down the latest versions. Another option is using the git submodule foreach command to update the submodules by fetching the remotes and then checking out latest commit, which has the disadvantage submodules not (by default) being checked out branches, and instead is checked out to commits. Both of these options are obviously not particularly ergonomic to work with, and so git introduced the --remote flag to git submodule update to tell it to update to the branch tracked on the remote instead of the local repository.

Great, so now we know update our submodules to the latest remote commit (huzzah!), but I want to push to one repository and watch the other repositories automatically update to track that commit (instead of requiring me to run a command). How can we do that?

GitHub Actions to the rescue! (although unfortunately it won't quite get us all the way there)

GitHub Actions allows you to trigger commands on arbitrary events using the repository_dispatch event. The first step is to create a GitHub Action in our repository that updates submodules for us.

Of course, the simplest way to do this would simply be to recursively clone the repository using GitHub's checkout action, to run git submodule update --remote and then git push the changes back into the repository. Unfortunately, this doesn't quite work out for two reasons; we won't have write access to the repository and we won't be able to clone private repositories. We can get around this by passing a personal access token to GitHub's checkout action and telling it that we need to pull submodules recursively. An example of this action can be found below, and is actually what I use in my site's repo.

name: Update module
on:
  repository_dispatch:
    types: update
jobs:
  update:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
        with:
          token: ${{ secrets.PAT }}
          submodules: recursive
      - name: Update module
        run: |
          git submodule update --init --recursive --checkout -f --remote -- "${{github.event.client_payload.module}}"
          git config --global user.name "GitHub Action"
          git config --global user.email "noreply@github.com"
          git commit -am "deploy: ${{github.event.client_payload.module}} - ${{github.event.client_payload.sha}}"
          git push

The second step is to create the trigger for the repository_dispatch in the submodule repositories themselves. This can be done using the excellent repository-dispatch action to send the event. An example of this can be shown below.

name: Dispatch to repo
on: [push, workflow_dispatch]
jobs:
  dispatch:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        repo: ["owner/repo"]
    steps:
      - name: Push to repo
        uses: peter-evans/repository-dispatch@v1
        with:
          token: ${{ secrets.PAT }}
          repository: ${{ matrix.repo }}
          event-type: update
          client-payload: '{"ref": "${{ github.ref }}", "sha": "${{ github.sha }}", "module": "owner/submodule", "branch": "master"}'

This of course works with submodules that you control, but what about for submodules that you don't control? Unfortunately there's no way to start an action based on a push in a different repository, nor is there a way to create a webhook that you could use to trigger another action from a repository that you don't control. As such you are left with two options; to simply run the update action at a frequent interval or to hack together a way to do it with activity email notification and a repository_dispatch. I know which of the two I'd rather implement in a hurry!

Permalink