website: adding article about overloaded abstraction layers

wip/yesman
Katharina Fey 4 years ago committed by Mx Kookie
parent b2681a59d5
commit 1e9c862f4c
  1. 152
      infra/website/content/blog/119_wrong_abstraction.md

@ -0,0 +1,152 @@
Title: That's the wrong abstraction layer
Category: Blog
Date: 2020-11-11
Tags: /dev/diary
I'm writing this post mostly to my future self, not any specific
project or piece of code I've seen other people write. That's not to
say that I don't think this is something that probably applies to many
projects. Sometimes it's easy to lose sight of what we're doing, and
it's good to be reminded.
So to start at the beginning: I've been working on [supergit], a Rust
library to parse git repositories. It's built on top of `libgit2`
(and the `git2` rust bindings), and aims to create a more Rustic
interface and type fascade for git repositories. It also aims to
solve issues such as: rename detection, path-history, and subtree
management. I'm writing this library for [octopus], which will
eventually host my monorepo.
In `supergit` the main workflow is around iterating things, seeing as
git is an acyclical graph, and iterators are a decent way to view this
datastructure. But git graphs can get pretty big. I wanted the
iterator to be configurable in a way that allows someone to write a
tool that searches a whole repository history, while also making it
possible to step through a history 20 commits at a time (to implement
history pagination on a website, for example).
Looking at the current API, this is how you would implement the
latter, for a `main` branch:
```rust
use supergit::Repository;
fn main() {
let path = ... // get your repository path somehow
let repo = Repository::open(path).unwrap();
let main = repo.get_branch("main").unwrap();
let iter = main.get(20);
iter.for_each(|c| {
println!("{}: {}", c.commit().id_str(), c.commit().summary());
});
}
```
That's easy enough, right? But wait, why am I calling `.commit()` on
`c`. Isn't it already a commit? Well...sort of. In `supergit`, this
type is a `BranchCommit`, because this is where things get
complicated.
## Sort of like a tree, but not really
In git, rarely is a branch just a history of single commits. Maybe
this is how some people think about their history, but it certainly
has never been the case for any of the repositories that I work on.
Basically the second you have more than one contributor, it's very
common for a history to have merge-commits in it.
So how do we deal with that in an iterator? The design I chose was to
wrap a `Commit` object in another type, which can convey this state.
`BranchCommit` is an enum and has three variants: `Commit` (maybe I
should rename that to `Simple` or something?), `Merge`, and `Octopus`
(if you don't know what an octopus merge is, don't worry about it.
Most people don't and they're very rare and weird).
What `Merge` and `Octopus` contain are new `Branch` handles (the type
returned by `get_branch()`), meaning that for every split it's now up
to the user to decide whether they want to continue first-parent
(i.e. only ever follow the main branch line, ignoring the history of
merged branches), or if they want to enumerate the histories as well.
Most importantly: for every branch merge, you get to re-decide what
your iterator strategy should be: infinate, limited by number, or
limited up to a certain commit-hash.
So far so good I thought, this is an okay enough interface for me to
work with. But this is where some problems appeared.
## File histories (and git internals)
*(a slight de-tour through git - feel free to skip)*
The main reason why I'm writing this more Rustic wrapper around
`libgit2` is to make it easier to determine what the history of a file
has been. This is pretty simple to find out via the git CLI (`git --
<your file here>`), but not something that `libgit2` exposes, because
that's not how git stores data.
To git, all data is stored in a key-value store indexed by a SHA1
(soon to be SHA256 I think?) hash reference. That applies to files,
full file trees, and commits as well. Say we have a file `acab.txt`,
we commit it and it gets the ID
`da39a3ee5e6b4b0d3255bfef95601890afd80709` (the file ID, not the
commit ID!), but then we open it and write `ACAB` in it, and commit
that again. Now the file ID is
`99f069b8a0cbe4c9485a14fe50775d0f71deb4e7`. Both these files are
saved in the git object store, because after all you might want to go
back to the older version.
But here's the thing: from the actual commits we can get two things:
the file tree at the time of commit, and the commit parents. To
figure out what actually _changed_ in the commit, you have to diff it
against it's parents, which is exactly what `git show` does if you
give it a reference to a commit.
What this means is that if you want to have a library that grabs the
history of a path, well you'll have to go through all commits, and
check the tree for changes at that specific path. Furthermore, that
won't actually let you know if a file has simply been renamed, only
that it has changed. Further logic is required to figure out if the
file is the same, but just has a different name.
And all of this is something that `supergit` implements, behind a nice
Rustic API (I hope...).
## Bloated abstractions
So I wrote a function that would, for a branch iterator, step along it
and check the history of a path, by diffing each commit with it's
parents, and tracking a path via the delta information in the diff.
But this is where I ran into problems. Because my iterator design
always chose the first-parent to step through. Other branches were
ignored, and because the function accepted an iterator and stepped it
internally, there was no way for my `file_history()` function to
figure out the exact behaviour the user wanted.
My first instinct was to implement branching in the `BranchIter`
itself; allowing it to branch off, essentially pushing commits it
would have to get back to onto a stack, and resuming from a previous
position. That turned out to be a really [bad idea][badidea].
It took me about an hour of banging my head against this abstraction
before I realised that it wasn't meant to be. Sometimes systems are
self-contained, and adding more functionality takes a considerable
amount of effort, and begs the question, if it's really the right
choice to make. Why add more functionality to an abstraction that
works fine on it's own?
Instead, embrace composition, and add another layer on top, that can
use the previous. You end up with a much more managable design, and
data can flow from one layer to the next. Make sure that your
interfaces are flexible enough to be re-used, but don't think that
just because a component _could_ technically be responsible for some
work, that it really has to implement this work.
And that's it basically. Thanks for reading my ramblings about git
and one of my side-projects. I hope I managed to make you think about
the way you build systems a bit, and maybe next time you are in a
situation similar to this one, don't be like me :)
Loading…
Cancel
Save