Comment gérer les dépôts volumineux avec Git
Git est une alternative fantastique pour suivre l'évolution de votre base de code et pour collaborer efficacement avec vos pairs. Mais que se passe-t-il si le dépôt dont vous voulez faire le suivi est vraiment gigantesque ?
In this post I'll try to give you some ideas and techniques to deal properly with the different categories of huge.
Two categories of Big repositories
Quand on y pense, les dépôts grossissent considérablement pour deux grandes raisons :
- Ceux-ci accumulent un historique très long (le projet grossit au fil du temps et les éléments s'accumulent).
- Ils incluent des actifs binaires volumineux qu'il faut tracker et apparier avec le code.
- Les deux raisons ci-dessus.
So a repository can grow in two orthogonal directions: The size of the working directory - i.e. the latest commit - and the size of the entire accumulated history.
Parfois, un autre problème se pose : d'anciens artefacts binaires obsolètes sont encore stockés dans le dépôt. Nous avons trouvé un moyen relativement simple, mais pénible, pour corriger ce bug. Voir ci-dessous.
Pour les deux scénarios ci-dessus, les techniques et les solutions de contournement diffèrent et se complètent parfois. Je les décrirai séparément.
Gérer les dépôts avec un historique très long
Even though the bounds that identify a repository as massive are pretty high - for example the latest Linux kernel clocks at 15+ million lines of code but people seem happy to peruse it in full - very old projects that for regulatory/legal reasons have to be kept intact can become a pain to clone (Now to be transparent the Linux kernel is split in a historical repository and a more recent one, and requires a simple grafting setup to have access to the full unified history).
Simple solution is a shallow clone
The first solution to a fast clone and to saving developers and systems time and disk space is to perform a
shallow clone using git. A shallow clone allows you to clone a repository keeping only the latest
n commits of history.
How do you do it? Just use the
- -depth option, for example:
git clone --depth depth remote-url
Imagine you accumulated ten or more years of project history in your repository - for example for Jira we migrated to
git an 11 years old code base -, the time savings can add up and be very noticeable.
The full clone of Jira is 677MB with the working directory being another 320+MB , making up for more than 47,000+ commits. From a quick check on the Jira checkout a shallow clone took
29.5 seconds compared to the
4 minutes 24 seconds of a full complete clone with all the history. The disparity grows also proportionally to how many binary assets your project has swallowed over time. In any case build systems can greatly profit from this technique too.
Recent git has improved support for shallow clones
Shallow clones used to be somewhat impaired citizens of the
git world as some operations were barely supported. But recent versions (1.9+) have improved the situation greatly and you can properly
push to repositories even from a shallow clone now.
Partial solution is filter-branch
For the huge repositories that have big binary cruft committed by mistake or old assets not needed anymore a great solution is to use
filter-branch. The command allows to walk through the entire history of the project filtering out, massaging, modifying, skipping files according to predefined patterns. It is a very powerful tool in your
git arsenal. There are already helper scripts available to identify big objects in your git repository, so that should be easy enough.
Sample usage of
git filter-branch --tree-filter 'rm -rf /path/to/spurious/asset/folder' HEAD
filter-branch has a minor drawback: once you use
filter-branch you effectively rewrite the entire history of your project: all commit ids change. This requires every developer to re-clone the updated repository.
So in case you're planning to carry out a cleanup action using
filter-branch you should alert your team, plan a short freeze while the operation is carried out and then notify everyone that they should
clone the repository again.
Alternative to shallow-clone: Clone only one branch
git 1.7.10, of April 2012 you can also limit the amount of history you clone by cloning a single branch, like the following:
git clone URL --branch nom_branche --single-branch [dossier]
Ce hack spécifique pourrait s'avérer utile pour les branches longues et divergentes ou si vous avez de nombreuses branches. Si vous avez quelques branches comprenant un nombre limité de différences, vous ne constaterez probablement pas une grande différence lorsque vous utiliserez cette commande.
Gérer les dépôts comprenant des actifs binaires volumineux
The second category of big repositories is made up from code bases that have to track huge binary assets. Gaming teams have to juggle around huge 3D models, Web development teams might need to track raw image assets, CAD teams might need to manipulate and track the status of binary deliverables. So there are different categories of software team that run into this issue with
Git is not especially bad at handling binary assets, but it's not especially good either. By default
git will compress and store all subsequent full versions of the binary assets, which is obviously not optimal if you have many.
There are some basic tweaks that improve the situation, like running the garbage collection
git gc, or tweaking the usage of
delta commits for some binary types in
But it's important to reflect on the nature of you project's binary assets as the winning approach may vary. For example here are three points to check (thanks to Stefan Saasen for the remarks):
- For binary files that change significantly - and not just some meta data headers - the delta compression is probably going to be useless so the suggestion is to turn
delta offfor those files to avoid the unnecessary delta compression work as part of the repack
- In the scenario above it's likely that those files don't zlib compress very well either so you could turn compression off with
core.loosecompression 0; That's a global setting that would negatively affect all the non-binary files that actually compress well so the suggestion makes sense if you split the binary assets in a separate repository.
- It's important to remember that
git gcturns the "duplicated" loose objects into a single pack file but again unless the files compress in any way that probably doesn't make any significant difference in relation to the resulting pack file.
- Explore the tuning of
core.bigFileThreshold. Anything larger than
512 MiBwon't be delta compressed anyway - without having to set
.gitattributes- so maybe that's something worth tweaking.
Technique 1: sparse checkout
A mild help to the binary assets problem is sparse checkout (available since Git
1.7.0]). This technique allows to keep the working directory clean by explicitly detailing which folders you want to populate. Unfortunately it does not affect the size of the overall local repository but can be helpful if you have a huge tree of folders.
What are the involved commands? Here's an example (credit):
- Clone the full repository once:
git clone <repository-address>
- Activate the feature:
git config core.sparsecheckout true
- Ajouter les dossiers qui sont explicitement requis, en ignorant les dossiers assets :
echo src/ › .git/info/sparse-checkout
- Read the tree as specified:
git read-tree -m -u HEAD
After the above you can go back to use your normal
git commands, but your work directory will only contain the folders you specified above.
Technique 2: Use of submodules
Another way to handle huge binary asset folders is to split those into a separate repository and pull the assets in your main project using submodules. This gives you a way a way to control when you update the assets. See more on submodules in these posts: core concept and tips and alternatives.
If you go the way of the
submodules way you might want to checkout the complexities of handling project dependencies, since some of the possible approaches to the huge binaries problem might be helped by the approaches I mention there.
Technique 3: Use git annex or git-bigfiles
A third option for handling binary assets with
git is to rely on an apt third party extension.
The first one I mention is git-annex, which allows managing binary files with git without checking the file contents into the repository.
git-annex saves the files in a special key-value store and only symbolic links are then checked into git and versioned like regular files. It is straightforward to use and the examples are self explanatory.
The second one is git-bigfiles, a
git fork that hopes to make life bearable for people using Git on project hosting very large files.
[UPDATE] …or you can skip all that and use Git LFS
If you work with large files on a regular basis, the best solution might be to take advantage of the large file support (LFS) Atlassian co-developed with GitHub in 2015.
Git LFS is an extension that stores pointers (naturally!) to large files in your repository, instead of storing the files themselves in there. The actual files are stored on a remote server. As you can imagine, this dramatically reduces the time it takes to clone your repo.
Bitbucket supports Git LFS, as does GitHub. So chances are, you already have access to this technology. It’s especially helpful for teams that include designers, videographers, musicians, or CAD users.
Don't give up the fantastic capabilities of
git just because you have a huge repository history or huge assets. There are workable solutions to both problems.
Follow me @durdn for more DVCS rocking.