diff --git a/infra/website/content/blog/119_wrong_abstraction.md b/infra/website/content/blog/119_wrong_abstraction.md new file mode 100644 index 00000000000..e89b7adc6d2 --- /dev/null +++ b/infra/website/content/blog/119_wrong_abstraction.md @@ -0,0 +1,152 @@ +Title: That's the wrong abstraction layer +Category: Blog +Date: 2020-11-11 +Tags: /dev/diary + +I'm writing this post mostly to my future self, not any specific +project or piece of code I've seen other people write. That's not to +say that I don't think this is something that probably applies to many +projects. Sometimes it's easy to lose sight of what we're doing, and +it's good to be reminded. + +So to start at the beginning: I've been working on [supergit], a Rust +library to parse git repositories. It's built on top of `libgit2` +(and the `git2` rust bindings), and aims to create a more Rustic +interface and type fascade for git repositories. It also aims to +solve issues such as: rename detection, path-history, and subtree +management. I'm writing this library for [octopus], which will +eventually host my monorepo. + +In `supergit` the main workflow is around iterating things, seeing as +git is an acyclical graph, and iterators are a decent way to view this +datastructure. But git graphs can get pretty big. I wanted the +iterator to be configurable in a way that allows someone to write a +tool that searches a whole repository history, while also making it +possible to step through a history 20 commits at a time (to implement +history pagination on a website, for example). + +Looking at the current API, this is how you would implement the +latter, for a `main` branch: + +```rust +use supergit::Repository; + +fn main() { + let path = ... // get your repository path somehow + let repo = Repository::open(path).unwrap(); + + let main = repo.get_branch("main").unwrap(); + let iter = main.get(20); + + iter.for_each(|c| { + println!("{}: {}", c.commit().id_str(), c.commit().summary()); + }); +} +``` + +That's easy enough, right? But wait, why am I calling `.commit()` on +`c`. Isn't it already a commit? Well...sort of. In `supergit`, this +type is a `BranchCommit`, because this is where things get +complicated. + + +## Sort of like a tree, but not really + +In git, rarely is a branch just a history of single commits. Maybe +this is how some people think about their history, but it certainly +has never been the case for any of the repositories that I work on. +Basically the second you have more than one contributor, it's very +common for a history to have merge-commits in it. + +So how do we deal with that in an iterator? The design I chose was to +wrap a `Commit` object in another type, which can convey this state. +`BranchCommit` is an enum and has three variants: `Commit` (maybe I +should rename that to `Simple` or something?), `Merge`, and `Octopus` +(if you don't know what an octopus merge is, don't worry about it. +Most people don't and they're very rare and weird). + +What `Merge` and `Octopus` contain are new `Branch` handles (the type +returned by `get_branch()`), meaning that for every split it's now up +to the user to decide whether they want to continue first-parent +(i.e. only ever follow the main branch line, ignoring the history of +merged branches), or if they want to enumerate the histories as well. +Most importantly: for every branch merge, you get to re-decide what +your iterator strategy should be: infinate, limited by number, or +limited up to a certain commit-hash. + +So far so good I thought, this is an okay enough interface for me to +work with. But this is where some problems appeared. + + +## File histories (and git internals) + +*(a slight de-tour through git - feel free to skip)* + +The main reason why I'm writing this more Rustic wrapper around +`libgit2` is to make it easier to determine what the history of a file +has been. This is pretty simple to find out via the git CLI (`git -- +`), but not something that `libgit2` exposes, because +that's not how git stores data. + +To git, all data is stored in a key-value store indexed by a SHA1 +(soon to be SHA256 I think?) hash reference. That applies to files, +full file trees, and commits as well. Say we have a file `acab.txt`, +we commit it and it gets the ID +`da39a3ee5e6b4b0d3255bfef95601890afd80709` (the file ID, not the +commit ID!), but then we open it and write `ACAB` in it, and commit +that again. Now the file ID is +`99f069b8a0cbe4c9485a14fe50775d0f71deb4e7`. Both these files are +saved in the git object store, because after all you might want to go +back to the older version. + +But here's the thing: from the actual commits we can get two things: +the file tree at the time of commit, and the commit parents. To +figure out what actually _changed_ in the commit, you have to diff it +against it's parents, which is exactly what `git show` does if you +give it a reference to a commit. + +What this means is that if you want to have a library that grabs the +history of a path, well you'll have to go through all commits, and +check the tree for changes at that specific path. Furthermore, that +won't actually let you know if a file has simply been renamed, only +that it has changed. Further logic is required to figure out if the +file is the same, but just has a different name. + +And all of this is something that `supergit` implements, behind a nice +Rustic API (I hope...). + + +## Bloated abstractions + +So I wrote a function that would, for a branch iterator, step along it +and check the history of a path, by diffing each commit with it's +parents, and tracking a path via the delta information in the diff. +But this is where I ran into problems. Because my iterator design +always chose the first-parent to step through. Other branches were +ignored, and because the function accepted an iterator and stepped it +internally, there was no way for my `file_history()` function to +figure out the exact behaviour the user wanted. + +My first instinct was to implement branching in the `BranchIter` +itself; allowing it to branch off, essentially pushing commits it +would have to get back to onto a stack, and resuming from a previous +position. That turned out to be a really [bad idea][badidea]. + +It took me about an hour of banging my head against this abstraction +before I realised that it wasn't meant to be. Sometimes systems are +self-contained, and adding more functionality takes a considerable +amount of effort, and begs the question, if it's really the right +choice to make. Why add more functionality to an abstraction that +works fine on it's own? + +Instead, embrace composition, and add another layer on top, that can +use the previous. You end up with a much more managable design, and +data can flow from one layer to the next. Make sure that your +interfaces are flexible enough to be re-used, but don't think that +just because a component _could_ technically be responsible for some +work, that it really has to implement this work. + +And that's it basically. Thanks for reading my ramblings about git +and one of my side-projects. I hope I managed to make you think about +the way you build systems a bit, and maybe next time you are in a +situation similar to this one, don't be like me :)