Make your repository lean and clean!

Make your repository lean and clean! Make your repository lean and clean! Make your repository lean and clean! Make your repository lean and clean!
Sometimes, large files are needlessly committed to your repository but from a long term perspective, it is best practice to keep your repository small. Let’s dive into our blog post to see what can be done to reduce the size of your repository.

Repositories become larger over time due to new code, nevertheless, sometimes large files that should not be part of the project are committed as well. Having these files as part of your code base is not the best practice as they will only cause others to work with and download unnecessary large repositories.

Most common cases of unnecessary files:

  • Large media files like videos and photos - Use CDN, AWS S3 or any media hosting service rather than repository to store these. Alternatively you can also use GIT LFS.
  • Generated files (e.g. transpiled files by Webpack) and binaries - Generating these resources in your Continuous Integration pipeline helps you keep your process repeatable and reliable without taking space in your project.
  • 3rd party libraries (jQuery, Laravel) - Use package managers like NPM, PIP, Composer and others to avoid storing local copies of popular libraries in your code base. These package managers also help you keep track of all the necessary updates.

Due to git’s great feature of keeping the whole history of changes, it’s not enough to just delete these files from the git repository as git still keeps all the files previously committed. This may then cause your repository to be too big for analyses. This limit of 200 MB is set for keeping the performance of Codeac on the highest possible level.

There are many ways on how to reduce the size of your repository and clean its history. Let's dive into the two main ones:

I am using GitHub Bitbucket GitLab as my GIT provider.

BFG Repo-Cleaner

The   BFG Repo-Cleaner   is a simple and fast tool written in Scala for cleaning bad data out of your GIT repository history. Here is how to get started:

Clone a fresh copy of your repository first:

git clone --mirror https://github.com/tony/my-big-project.git

The --mirror flag will clone a bare repository and your normal files won't be visible. However, it is a full copy of the GIT database of your repo, and at this point you should make a backup of it to ensure you don't lose anything.

Run the BFG to clean your repository up:

java -jar bfg.jar --strip-blobs-bigger-than 10M my-big-project

The BFG will update your commits and all branches and tags so they are clean, but it doesn't physically delete the unwanted stuff. Make sure your history has been updated, and then use the standard git gc command to strip out the unwanted dirty data, which GIT will now recognise as surplus to requirements:

Strip the unwanted data:

cd my-big-project
git reflog expire --expire=now --all && git gc --prune=now --aggressive

Once you're happy with the updated state of your repo, push it back up (note that because your clone command used the --mirror flag, this push will update all refs on your remote server):

  Rewriting repository history is a destructive operation. Make sure to have your repository backed up.

Push the changes back to the origin:

git push origin --force 'refs/heads/*'

At this point, you're ready for everyone to ditch their old copies of the repository and do fresh clones of the nice, new pristine data. It's best to delete all old clones, as they'll have a dirty history that you don't want to risk pushing back into your newly cleaned repo.

Using git-filter-repo

An alternative way to clean large files from your repository is to use   git-filter-repo . It is a versatile tool for altering GIT history. Install git-filter-repo using a supported package manager or from source.

Install the tool using pip for example:

pip3 install git-filter-repo

Clone a fresh copy of the repository using --bare and --mirror.

Clone your repository locally:

git clone --bare --mirror https://github.com/tony/my-big-project.git

It is a full copy of the GIT database of your repo, and at this point you should make a backup of it to ensure you don't lose anything.

Purge any large files from the history of your repository.

git filter-repo --strip-blobs-bigger-than 10M

  Rewriting repository history is a destructive operation. Make sure to have your repository backed up.

Force push your changes to overwrite all branches:

git push origin --force 'refs/heads/*'

Once large files have been removed, it is a best practice for everyone using the repository to make a new clone; otherwise, if someone does a force push, they will push the large files again and you’ll be back to where you started.

For more information see the GitHub documentation “Removing files from repository’s history” and "Removing sensitive data from a repository" Bitbucket documentation “Reduce repository size GitLab documentation “Reduce repository size.

Ready to get started?

Use your favorite version control system to sign-up.

Sign up for free