website: adding article about overloaded abstraction layers

4 years ago · 1e9c862f4c
parent b2681a59d5
commit 1e9c862f4c
1 changed files with 152 additions and 0 deletions
--- a/infra/website/content/blog/119_wrong_abstraction.md
+++ b/infra/website/content/blog/119_wrong_abstraction.md
@ -0,0 +1,152 @@
+Title: That's the wrong abstraction layer
+Category: Blog
+Date: 2020-11-11
+Tags: /dev/diary
+
+I'm writing this post mostly to my future self, not any specific
+project or piece of code I've seen other people write.  That's not to
+say that I don't think this is something that probably applies to many
+projects.  Sometimes it's easy to lose sight of what we're doing, and
+it's good to be reminded.
+
+So to start at the beginning: I've been working on [supergit], a Rust
+library to parse git repositories.  It's built on top of `libgit2`
+(and the `git2` rust bindings), and aims to create a more Rustic
+interface and type fascade for git repositories.  It also aims to
+solve issues such as: rename detection, path-history, and subtree
+management.  I'm writing this library for [octopus], which will
+eventually host my monorepo.
+
+In `supergit` the main workflow is around iterating things, seeing as
+git is an acyclical graph, and iterators are a decent way to view this
+datastructure.  But git graphs can get pretty big.  I wanted the
+iterator to be configurable in a way that allows someone to write a
+tool that searches a whole repository history, while also making it
+possible to step through a history 20 commits at a time (to implement
+history pagination on a website, for example).
+
+Looking at the current API, this is how you would implement the
+latter, for a `main` branch:
+
+```rust
+use supergit::Repository;
+
+fn main() {
+    let path = ... // get your repository path somehow
+    let repo = Repository::open(path).unwrap();
+    
+    let main = repo.get_branch("main").unwrap();
+    let iter = main.get(20);
+    
+    iter.for_each(|c| {
+        println!("{}: {}", c.commit().id_str(), c.commit().summary());
+    });
+}
+```
+
+That's easy enough, right?  But wait, why am I calling `.commit()` on
+`c`.  Isn't it already a commit?  Well...sort of.  In `supergit`, this
+type is a `BranchCommit`, because this is where things get
+complicated.
+
+
+## Sort of like a tree, but not really
+
+In git, rarely is a branch just a history of single commits.  Maybe
+this is how some people think about their history, but it certainly
+has never been the case for any of the repositories that I work on.
+Basically the second you have more than one contributor, it's very
+common for a history to have merge-commits in it.
+
+So how do we deal with that in an iterator?  The design I chose was to
+wrap a `Commit` object in another type, which can convey this state.
+`BranchCommit` is an enum and has three variants: `Commit` (maybe I
+should rename that to `Simple` or something?), `Merge`, and `Octopus`
+(if you don't know what an octopus merge is, don't worry about it.
+Most people don't and they're very rare and weird).
+
+What `Merge` and `Octopus` contain are new `Branch` handles (the type
+returned by `get_branch()`), meaning that for every split it's now up
+to the user to decide whether they want to continue first-parent
+(i.e. only ever follow the main branch line, ignoring the history of
+merged branches), or if they want to enumerate the histories as well.
+Most importantly: for every branch merge, you get to re-decide what
+your iterator strategy should be: infinate, limited by number, or
+limited up to a certain commit-hash.
+
+So far so good I thought, this is an okay enough interface for me to
+work with.  But this is where some problems appeared.
+
+
+## File histories (and git internals)
+
+*(a slight de-tour through git - feel free to skip)*
+
+The main reason why I'm writing this more Rustic wrapper around
+`libgit2` is to make it easier to determine what the history of a file
+has been.  This is pretty simple to find out via the git CLI (`git --
+<your file here>`), but not something that `libgit2` exposes, because
+that's not how git stores data.
+
+To git, all data is stored in a key-value store indexed by a SHA1
+(soon to be SHA256 I think?) hash reference.  That applies to files,
+full file trees, and commits as well.  Say we have a file `acab.txt`,
+we commit it and it gets the ID
+`da39a3ee5e6b4b0d3255bfef95601890afd80709` (the file ID, not the
+commit ID!), but then we open it and write `ACAB` in it, and commit
+that again.  Now the file ID is
+`99f069b8a0cbe4c9485a14fe50775d0f71deb4e7`.  Both these files are
+saved in the git object store, because after all you might want to go
+back to the older version.
+
+But here's the thing: from the actual commits we can get two things:
+the file tree at the time of commit, and the commit parents.  To
+figure out what actually _changed_ in the commit, you have to diff it
+against it's parents, which is exactly what `git show` does if you
+give it a reference to a commit.
+
+What this means is that if you want to have a library that grabs the
+history of a path, well you'll have to go through all commits, and
+check the tree for changes at that specific path.  Furthermore, that
+won't actually let you know if a file has simply been renamed, only
+that it has changed.  Further logic is required to figure out if the
+file is the same, but just has a different name.
+
+And all of this is something that `supergit` implements, behind a nice
+Rustic API (I hope...).
+
+
+## Bloated abstractions
+
+So I wrote a function that would, for a branch iterator, step along it
+and check the history of a path, by diffing each commit with it's
+parents, and tracking a path via the delta information in the diff.
+But this is where I ran into problems.  Because my iterator design
+always chose the first-parent to step through.  Other branches were
+ignored, and because the function accepted an iterator and stepped it
+internally, there was no way for my `file_history()` function to
+figure out the exact behaviour the user wanted.
+
+My first instinct was to implement branching in the `BranchIter`
+itself; allowing it to branch off, essentially pushing commits it
+would have to get back to onto a stack, and resuming from a previous
+position.  That turned out to be a really [bad idea][badidea].
+
+It took me about an hour of banging my head against this abstraction
+before I realised that it wasn't meant to be.  Sometimes systems are
+self-contained, and adding more functionality takes a considerable
+amount of effort, and begs the question, if it's really the right
+choice to make.  Why add more functionality to an abstraction that
+works fine on it's own?
+
+Instead, embrace composition, and add another layer on top, that can
+use the previous.  You end up with a much more managable design, and
+data can flow from one layer to the next.  Make sure that your
+interfaces are flexible enough to be re-used, but don't think that
+just because a component _could_ technically be responsible for some
+work, that it really has to implement this work.
+
+And that's it basically.  Thanks for reading my ramblings about git
+and one of my side-projects.  I hope I managed to make you think about
+the way you build systems a bit, and maybe next time you are in a
+situation similar to this one, don't be like me :)