Shrinking the size of a monorepo

The source code of Allegro iOS app for buyers used to be divided into separate modules hosted in multiple repositories (polyrepo). The source code was migrated to a monorepo a few years back along with the history of all repos that constituted the app. Updating source code of a module on one repository could affect another module hosted on a separate repository. Versioning modules and propagation of dependency update led to long release process of the entire application. Our main repository for the iOS application thus became our monorepo. After 9 years of development of the app the repo size has grown enormously and the git clone command became a nightmare taking too much time. We had a possibility to shrink the project size during the migration from an on-premise to an external git repo hosting provider.

Monorepo scale #

General repo scale #

The General scale of our old repository was as follows:

  • almost 9 years of history with 91k + commits
  • 440k BLOBs were stored in the repo: multiple .png and .jpeg files, 3rd-party frameworks and toolset binaries. The unpacked size of the BLOBs would add up to 36 GB and the biggest BLOB stored was 100+ MB
  • 680k git tree objects
  • the unpacked repo size on the main development branch was 8+ GB where the .git dir size after a clone was 7+ GB. The .git directory contained compressed .png (4.5+ GB 🤯) and .pbxproj (600+ MB) files

New repo scale #

After the migration and history rewrite we shrank the repo size to:

  • 71k + commits
  • 230k BLOBs, where all BLOBs unpacked would add up to 1.6 GB - this number also includes size of source code files, whereas assets and binaries were migrated to an external storage
  • 455k+ git tree objects

How did we do it? #

The history-rewrite process required proper planning and a few steps:

  1. Analysis of the old repo contents and its history
  2. Creating a reproducible procedure for the history rewrite
  3. Dry running the procedure to test out the process
  4. Planning and scheduling activities necessary to migrate the repo
  5. Proper communication about the process to stakeholders (a.k.a. developers 👩‍💻👨‍💻)
  6. The actual migration

The repo analysis #

Goals:

  • find items that can be removed from the history
  • select items that can be migrated to an external storage

We used tools such as git-sizer and git-filter-repo tool to get information about types of files stored in the repository. If you wanted to do the same the workshops from GitHub Universe and the scripts introduced there might be a good starting point.

From the analysis we were able to select the following items for complete removal from the history:

  • deleted dirs and files
  • unwanted paths: e.g. Pods/, invalid symlinks causing deep nesting of paths
  • unwanted history of paths: e.g. Vendor for storing 3rd party dependencies or Toolset with binaries
  • unwanted files: .e.g .pbxproj that can be generated by XcodeGen and their history is meaningless (600+ MB savings in our case), history of BLOB files such as .jpg, .png, .a, .dylib, .pdf, .zip, .mp4, .json

We decided to track BLOBs using Git LFS (Large File Storage). In our case the following were a good use case for it:

  • large binary files .jpg, .png, .a, .dylib, .pdf, .zip, .mp4, .json
  • framework binaries
  • toolset binaries

Reproducible procedure and dry runs #

We created a script that contained all commands that removed redundant items from the history. To remove items we used git filter-repo - it‘s much more performant than git‘s built-in git filter-branch (do not use it!). Some examples of usage:

git filter-repo --invert-paths --path Pods/ --force
git filter-repo --invert-paths --paths-from-file remove.txt --force
git filter-repo --invert-paths --path-glob '*.pbxproj' --force

After the removal we restored the most-recent version of binaries, frameworks and BLOBs to the repo and we tracked them with Git LFS:

git lfs track "*.png"
git lfs track "Vendor/SomeSDK/SomeSDK.framework/SomeSDK"

We ran the script a few times to verify the output size of the repo. One crucial aspect after the run was to verify that all plans on the CI (Continuous Integration) pass - we did it to check that the app still compiles, tests pass and no more files that the ones we had wanted were actually deleted.

Communication #

The crucial aspect of introducing any change is communication. It‘s good to prepare it in advance, have team members review it. We used a few channels so that our devs would get important info about the migration and history rewrite through the channel that suited their working habits best (e-mails, instant messaging tool, dev forums).

Some final thoughts #

It took a large amount of time to prepare the migration, understand the history of the repository and select proper items and strategies for the migration. The links here might be a good starting point if you wanted to rewrite histories of your overweight repos:

When creating a plan for the rewrite, remember to have a checklist that you can use to verify outcomes and to remember all the steps involved in the process. If the repository migration from one provider to another hosting provider is required execute it together with the history rewrite. Plan the rewrite for a time that folks would not want to push code to the repo. We used Friday evening, and yes, we had to fix some issues over the weekend - not everything went smoothly.

You can have a copy of your old repository in READ-ONLY mode on the servers - it will serve as a backup and will contain the actual history.

Discussion