POSTS

Cleaning Up a Bloated Git Repository

Blog

I have been using a static site generator where the content is managed in git version control for this site for a number of years.

A decision I punted on really understanding when I was trying to figure how to get the darned thing working was how to handle static assets (images, mostly). To “get it working” I decided to commit the assets to the repository.

Fast-forward to the present: I was unable to deploy my site because the image copying was exceeding the processing limits on my web server.

I needed to trim the footprint of my git repository. Ultimately I reduced the footprint by 100x. Here are the tools I used to handle the problem.

How Bad Is It?

I wasn’t sure how to calculate the size of the bloat of my system.

The fix is: git count-objects -vH 1

The fatal output was:

size: 710 MiB

Yikes. But when I looked in the actual content directory I saw:

$ du -h content/posts
7.0M    content/posts

Clearly there was some bloat somewhere. I needed to find my biggest files and get them out.

Finding the Culprits

To find the biggest baddies in the repo, I found this command 2 which I repurposed to let me know which files were naughty.

#!/bin/sh

git rev-list --all --objects | \
    sed -n $(git rev-list --objects --all | \
    cut -f1 -d' ' | \
    git cat-file --batch-check | \
    grep blob | \
    sort -n -k 3 | \
    while read hash type size; do
         echo -n "-e s/$hash/$size/p ";
    done) | \
    sort -n -k1

I saved this output to a file and then did some vim-fu to get the second column for all files that matched /static/images. This was my list of offfenders.

Interlude

What I realized I should have done is:

  • Statically generate the site, creating text from the content/ directories
  • Store the static assets in the public/images directory. They could then be accessed in my local writing environment. Once the post is done. Generate the site and then use rsync to move the completed posts and the static assets across to the host.
  • This means the static assets stop being in git revision management, but with two copies of the static assets directory (local and production) with backups by the host, I don’t feel too terribly exposed. Ultimately the text is the me, the images are additions, but only slight ones, for the most part.

I have adopted this architecture and it now powers this site.

Shell Cleanup

I needd to remove the files, but I also needed to remove them from the git repository itself. It’s not sufficient to git rm bad_file; git commit -am "Kill big file" and violà done! I needed to remove all objects that kept references to the big files as well. for this I used git-filter-branch 3:

I took the given tip: git filter-branch --tree-filter 'rm -f huge_file' HEAD and then used that as the meat in a function to clear out all my big files:

for i in `cat fatfiles`; do echo $i; git filter-branch -f  --index-filter "git rm --force --cached --ignore-unmatch  $i" -- --all; done

This took a while, so I set it to run in the background while I walked the dog. There was a ton of output too.

Lock in the Work

$ git reflog expire --expire=now --all
$ git gc --prune=now

Check Your Results

$ git count-objects -vH
count: 9188
size: 65.35 MiB
in-pack: 106964
packs: 1
size-pack: 211.71 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

That appears to be slightly more than a 100x reduction in size!

Other Things I Wish I’d Found Earlier