POSTS
Cleaning Up a Bloated Git Repository
BlogI have been using a static site generator where the content is managed in git
version control for this site for a number of years.
A decision I punted on really understanding when I was trying to figure how to get the darned thing working was how to handle static assets (images, mostly). To “get it working” I decided to commit the assets to the repository.
Fast-forward to the present: I was unable to deploy my site because the image copying was exceeding the processing limits on my web server.
I needed to trim the footprint of my git repository. Ultimately I reduced the footprint by 100x. Here are the tools I used to handle the problem.
How Bad Is It?
I wasn’t sure how to calculate the size of the bloat of my system.
The fix is: git count-objects -vH
1
The fatal output was:
size: 710 MiB
Yikes. But when I looked in the actual content directory I saw:
$ du -h content/posts
7.0M content/posts
Clearly there was some bloat somewhere. I needed to find my biggest files and get them out.
Finding the Culprits
To find the biggest baddies in the repo, I found this command 2 which I repurposed to let me know which files were naughty.
#!/bin/sh
git rev-list --all --objects | \
sed -n $(git rev-list --objects --all | \
cut -f1 -d' ' | \
git cat-file --batch-check | \
grep blob | \
sort -n -k 3 | \
while read hash type size; do
echo -n "-e s/$hash/$size/p ";
done) | \
sort -n -k1
I saved this output to a file and then did some vim-fu to get the second column
for all files that matched /static/images
. This was my list of offfenders.
Interlude
What I realized I should have done is:
- Statically generate the site, creating text from the
content/
directories - Store the static assets in the
public/images
directory. They could then be accessed in my local writing environment. Once the post is done. Generate the site and then usersync
to move the completed posts and the static assets across to the host. - This means the static assets stop being in
git
revision management, but with two copies of the static assets directory (local and production) with backups by the host, I don’t feel too terribly exposed. Ultimately the text is the me, the images are additions, but only slight ones, for the most part.
I have adopted this architecture and it now powers this site.
Shell Cleanup
I needd to remove the files, but I also needed to remove them from the git
repository itself. It’s not sufficient to git rm bad_file; git commit -am "Kill big file"
and violà done! I needed to remove all objects that
kept references to the big files as well. for this I used git-filter-branch
3:
I took the given tip: git filter-branch --tree-filter 'rm -f huge_file' HEAD
and then used that as the meat in a function to clear out all my big files:
for i in `cat fatfiles`; do echo $i; git filter-branch -f --index-filter "git rm --force --cached --ignore-unmatch $i" -- --all; done
This took a while, so I set it to run in the background while I walked the dog. There was a ton of output too.
Lock in the Work
$ git reflog expire --expire=now --all
$ git gc --prune=now
Check Your Results
$ git count-objects -vH
count: 9188
size: 65.35 MiB
in-pack: 106964
packs: 1
size-pack: 211.71 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
That appears to be slightly more than a 100x reduction in size!