parent
b2681a59d5
commit
1e9c862f4c
@ -0,0 +1,152 @@ |
|||||||
|
Title: That's the wrong abstraction layer |
||||||
|
Category: Blog |
||||||
|
Date: 2020-11-11 |
||||||
|
Tags: /dev/diary |
||||||
|
|
||||||
|
I'm writing this post mostly to my future self, not any specific |
||||||
|
project or piece of code I've seen other people write. That's not to |
||||||
|
say that I don't think this is something that probably applies to many |
||||||
|
projects. Sometimes it's easy to lose sight of what we're doing, and |
||||||
|
it's good to be reminded. |
||||||
|
|
||||||
|
So to start at the beginning: I've been working on [supergit], a Rust |
||||||
|
library to parse git repositories. It's built on top of `libgit2` |
||||||
|
(and the `git2` rust bindings), and aims to create a more Rustic |
||||||
|
interface and type fascade for git repositories. It also aims to |
||||||
|
solve issues such as: rename detection, path-history, and subtree |
||||||
|
management. I'm writing this library for [octopus], which will |
||||||
|
eventually host my monorepo. |
||||||
|
|
||||||
|
In `supergit` the main workflow is around iterating things, seeing as |
||||||
|
git is an acyclical graph, and iterators are a decent way to view this |
||||||
|
datastructure. But git graphs can get pretty big. I wanted the |
||||||
|
iterator to be configurable in a way that allows someone to write a |
||||||
|
tool that searches a whole repository history, while also making it |
||||||
|
possible to step through a history 20 commits at a time (to implement |
||||||
|
history pagination on a website, for example). |
||||||
|
|
||||||
|
Looking at the current API, this is how you would implement the |
||||||
|
latter, for a `main` branch: |
||||||
|
|
||||||
|
```rust |
||||||
|
use supergit::Repository; |
||||||
|
|
||||||
|
fn main() { |
||||||
|
let path = ... // get your repository path somehow |
||||||
|
let repo = Repository::open(path).unwrap(); |
||||||
|
|
||||||
|
let main = repo.get_branch("main").unwrap(); |
||||||
|
let iter = main.get(20); |
||||||
|
|
||||||
|
iter.for_each(|c| { |
||||||
|
println!("{}: {}", c.commit().id_str(), c.commit().summary()); |
||||||
|
}); |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
That's easy enough, right? But wait, why am I calling `.commit()` on |
||||||
|
`c`. Isn't it already a commit? Well...sort of. In `supergit`, this |
||||||
|
type is a `BranchCommit`, because this is where things get |
||||||
|
complicated. |
||||||
|
|
||||||
|
|
||||||
|
## Sort of like a tree, but not really |
||||||
|
|
||||||
|
In git, rarely is a branch just a history of single commits. Maybe |
||||||
|
this is how some people think about their history, but it certainly |
||||||
|
has never been the case for any of the repositories that I work on. |
||||||
|
Basically the second you have more than one contributor, it's very |
||||||
|
common for a history to have merge-commits in it. |
||||||
|
|
||||||
|
So how do we deal with that in an iterator? The design I chose was to |
||||||
|
wrap a `Commit` object in another type, which can convey this state. |
||||||
|
`BranchCommit` is an enum and has three variants: `Commit` (maybe I |
||||||
|
should rename that to `Simple` or something?), `Merge`, and `Octopus` |
||||||
|
(if you don't know what an octopus merge is, don't worry about it. |
||||||
|
Most people don't and they're very rare and weird). |
||||||
|
|
||||||
|
What `Merge` and `Octopus` contain are new `Branch` handles (the type |
||||||
|
returned by `get_branch()`), meaning that for every split it's now up |
||||||
|
to the user to decide whether they want to continue first-parent |
||||||
|
(i.e. only ever follow the main branch line, ignoring the history of |
||||||
|
merged branches), or if they want to enumerate the histories as well. |
||||||
|
Most importantly: for every branch merge, you get to re-decide what |
||||||
|
your iterator strategy should be: infinate, limited by number, or |
||||||
|
limited up to a certain commit-hash. |
||||||
|
|
||||||
|
So far so good I thought, this is an okay enough interface for me to |
||||||
|
work with. But this is where some problems appeared. |
||||||
|
|
||||||
|
|
||||||
|
## File histories (and git internals) |
||||||
|
|
||||||
|
*(a slight de-tour through git - feel free to skip)* |
||||||
|
|
||||||
|
The main reason why I'm writing this more Rustic wrapper around |
||||||
|
`libgit2` is to make it easier to determine what the history of a file |
||||||
|
has been. This is pretty simple to find out via the git CLI (`git -- |
||||||
|
<your file here>`), but not something that `libgit2` exposes, because |
||||||
|
that's not how git stores data. |
||||||
|
|
||||||
|
To git, all data is stored in a key-value store indexed by a SHA1 |
||||||
|
(soon to be SHA256 I think?) hash reference. That applies to files, |
||||||
|
full file trees, and commits as well. Say we have a file `acab.txt`, |
||||||
|
we commit it and it gets the ID |
||||||
|
`da39a3ee5e6b4b0d3255bfef95601890afd80709` (the file ID, not the |
||||||
|
commit ID!), but then we open it and write `ACAB` in it, and commit |
||||||
|
that again. Now the file ID is |
||||||
|
`99f069b8a0cbe4c9485a14fe50775d0f71deb4e7`. Both these files are |
||||||
|
saved in the git object store, because after all you might want to go |
||||||
|
back to the older version. |
||||||
|
|
||||||
|
But here's the thing: from the actual commits we can get two things: |
||||||
|
the file tree at the time of commit, and the commit parents. To |
||||||
|
figure out what actually _changed_ in the commit, you have to diff it |
||||||
|
against it's parents, which is exactly what `git show` does if you |
||||||
|
give it a reference to a commit. |
||||||
|
|
||||||
|
What this means is that if you want to have a library that grabs the |
||||||
|
history of a path, well you'll have to go through all commits, and |
||||||
|
check the tree for changes at that specific path. Furthermore, that |
||||||
|
won't actually let you know if a file has simply been renamed, only |
||||||
|
that it has changed. Further logic is required to figure out if the |
||||||
|
file is the same, but just has a different name. |
||||||
|
|
||||||
|
And all of this is something that `supergit` implements, behind a nice |
||||||
|
Rustic API (I hope...). |
||||||
|
|
||||||
|
|
||||||
|
## Bloated abstractions |
||||||
|
|
||||||
|
So I wrote a function that would, for a branch iterator, step along it |
||||||
|
and check the history of a path, by diffing each commit with it's |
||||||
|
parents, and tracking a path via the delta information in the diff. |
||||||
|
But this is where I ran into problems. Because my iterator design |
||||||
|
always chose the first-parent to step through. Other branches were |
||||||
|
ignored, and because the function accepted an iterator and stepped it |
||||||
|
internally, there was no way for my `file_history()` function to |
||||||
|
figure out the exact behaviour the user wanted. |
||||||
|
|
||||||
|
My first instinct was to implement branching in the `BranchIter` |
||||||
|
itself; allowing it to branch off, essentially pushing commits it |
||||||
|
would have to get back to onto a stack, and resuming from a previous |
||||||
|
position. That turned out to be a really [bad idea][badidea]. |
||||||
|
|
||||||
|
It took me about an hour of banging my head against this abstraction |
||||||
|
before I realised that it wasn't meant to be. Sometimes systems are |
||||||
|
self-contained, and adding more functionality takes a considerable |
||||||
|
amount of effort, and begs the question, if it's really the right |
||||||
|
choice to make. Why add more functionality to an abstraction that |
||||||
|
works fine on it's own? |
||||||
|
|
||||||
|
Instead, embrace composition, and add another layer on top, that can |
||||||
|
use the previous. You end up with a much more managable design, and |
||||||
|
data can flow from one layer to the next. Make sure that your |
||||||
|
interfaces are flexible enough to be re-used, but don't think that |
||||||
|
just because a component _could_ technically be responsible for some |
||||||
|
work, that it really has to implement this work. |
||||||
|
|
||||||
|
And that's it basically. Thanks for reading my ramblings about git |
||||||
|
and one of my side-projects. I hope I managed to make you think about |
||||||
|
the way you build systems a bit, and maybe next time you are in a |
||||||
|
situation similar to this one, don't be like me :) |
Loading…
Reference in new issue