Extracting Part of a Repository with git filter-branch

Posted

I recently wanted to extract a small webapp from this site into its own project. This is nice for modularity and allows me to release it as open source. (Previously I had people sending me emails with recommended CSS changes)

For no particular reason I thought it would be nice to preserve the commit history of this project when I moved it to a separate repo. I knew that git filter-branch existed, however it was generally intended for removing files from repos. In this case I wanted to remove most of the files and rename the rest but after a bit of fiddling it easy to do.

Code

First the filter script.

#! /bin/zsh
# extract.zsh

set -e

git ls-files -s | \
	sed -nE \
		-e 's#\tsource/a/js/ga.js#\tsrc/ga.js#p' \
		-e 's#\tsource/a/\w+/playerone#\tsrc/playerone#p' \
		-e 's#\tsource/games/playerone/#\tsrc/#p' | \
	GIT_INDEX_FILE="$GIT_INDEX_FILE.new" git update-index --index-info

if [ -e "$GIT_INDEX_FILE.new" ]; then
	mv "$GIT_INDEX_FILE"{.new,}
else
	rm "$GIT_INDEX_FILE"
fi

Now the command to use it.

git filter-branch --index-filter --prune-empty ./extract.zsh

Explanation

Nothing is too complicated. I’m using an --index-filter because it is much faster than checking out the files for each revision. Since I don’t need to modify any of the files it is convenient enough to manipulate the index. --prune-empty is provided so that commits that don’t touch the target files and to drop commits before from before I started working on Player One.

git ls-files -s

This lists the files at the current commit. It includes their mode, blob, stage number and path. This is basically a dump of the index. We will manipulate this dump to generate our output index.

% git ls-files -s
100644 000ad978d081f8bf3ab6f432972075b7b9ac90e7 0	src/a/main.scss
100644 7d9fde7f8751f8f33f9c97d2de790ae2959621c8 0	src/a/main.ts
100644 384fa688174968cfea9e2f90a2288e3ddd49bb46 0	src/a/sw.ts
100644 c6b43c8ef6d3952d17ce03d04ce5dbf104807996 0	src/index.pug
100644 74799dee4222c3f65a4c431c2df878ddfdceeada 0	src/logo.svg
100644 aabc917cd6200de4b485ebfc26b46caef5bd54dd 0	src/manifest.json
100644 30cb0ee2f80f82a38d21fbda83a24dab58bb7d9d 0	src/service-worker.ts

sed

This is where the interesting stuff happens. We are extracting the interesting files and moving them to their new locations. Notice that we use the tab character as an anchor to select and change the file names.

In this case I am using -n to avoid printing non-matching lines and the p flag to print the matching lines. If you were just doing a rename without dropping other files you would remove these.

sed -nE \
	-e 's#\tsource/a/js/ga.js#\tsrc/ga.js#p' \
	-e 's#\tsource/a/\w+/playerone#\tsrc/playerone#p' \
	-e 's#\tsource/games/playerone/#\tsrc/#p'

git update-index

We then use git update-index to write our modified index into to a temporary file. We need to use a non-existent file because if we use the existing index file git update-index will merge the two instead of replacing.

GIT_INDEX_FILE="$GIT_INDEX_FILE.new" git update-index --index-info

Stage the modified index.

All that is left is replacing the original index for this commit with our modified index.

Since I only added these files part-way though the history of the repository we need to handle the case where there are no files. In the no-file case $GIT_INDEX_FILE.new will not have been created and we must remove the index completely. Otherwise we just replace the previous index file with the new one.

if [ -e "$GIT_INDEX_FILE.new" ]; then
	mv "$GIT_INDEX_FILE"{.new,}
else
	rm "$GIT_INDEX_FILE"
fi

Conclusion

After that I have a repository with only the desired files, and only the commits that affected those files. It’s worth doing a quick look through the commits to ensure that you don’t have anything (especially commit messages) that shouldn’t be in the history of the new repository.