May 28 2018

Static linking vs dyld3

The following article has two parts. The first part describes improving Allegro iOS app launch time by adopting static linking and sums it up with a speedup analysis. The second part describes how I managed to launch a custom macOS app using not-yet-fully-released dyld3 dynamic linker and also completes with an app launch speedup analysis.

Improving iOS app launch time #

It takes some time to launch a mobile app, especially on a system with limited power of mobile CPU. Apple suggests 400ms as a good launch time. iOS performs zoom animation during the app launch – thus creating an opportunity to perform all CPU-intensive tasks. Ideally the whole launch process on iOS should be completed as soon as the app opening animation ends.

Apple engineers described some techniques to improve launch times in WWDC 2016 - Session 406: Optimizing App Startup Time. This wasn’t enough, so the very next year they announced a brand new dynamic linker in WWDC 2017 - Session 413: App Startup Time: Past, Present, and Future. Looking at the history of dyld, one can see that Apple is constantly trying to make their operating systems faster.

At Allegro we also try to make our apps as fast as possible. Aside from using Swift (Swift performs much better than ObjC in terms of launch time and app speed), we build our iOS apps using static linking.

Static linking #

Allegro iOS app uses a lot of libraries. The app has modular architecture and each module is a separate library. Aside from that, Allegro app uses a lot of 3rd-party libraries, integrated using CocoaPods package manager. All these libraries used to be integrated as frameworks – a standard way of dylibs (dynamic libraries) distribution in Apple ecosystem. 57 nested frameworks is a number large enough to impact app launch time. iOS has a 20 seconds app launch time limit. Any app that hits that limit is instantly killed. Allegro app was often killed on a good old iPad 2, when the device was freshly started and all caches were empty.

Dynamic linker performs a lot of disk IO when searching for dependencies. Static linking eliminates the need for all that dylib searching – dependencies and executable become one. We decided to give it a try and to link at least some of our libraries statically into main executable, hence reducing frameworks count.

We wanted to do this gradually, framework by framework. We also wanted to have a possibility to turn the static linking off in case of any unexpected problem.

We decided to use a two-step approach:

compiling frameworks code to static libraries,
converting frameworks (dynamic library packages) to resource bundles (resources packages).

Compiling framework code as a static library #

Xcode 9 provides MACH_O_TYPE = staticlib build setting – linker produces static library when the flag is set. As for libraries integrated through CocoaPods, we had to create a custom script in Podfile to set this flag only for selected external libraries during pod install (that is during dependencies installation, because CocoaPods creates new project structures for managed libraries with each reinstallation).

MACH_O_TYPE does a great job, but we performed static linking even before Xcode 9 was released. Although Xcode 8 had no support for static Swift linking, there is a way to perform static linking using libtool. In those dark times, we were just adding custom build phases with buildstatic script for selected libraries. This may seem like a hack, but it is really just a hefty usage of well-documented toolset… and it worked flawlessly.

That way we replaced our dynamic libraries with static libraries, but that was the easier part of the job.

Converting framework to resource bundle #

Aside from dynamic libraries, a framework can also contain resources (images, nibs, etc.). We got rid of dynamic libraries, but we couldn’t leave resource-only-frameworks. Resource bundle is a standard way of wrapping resources in Apple ecosystem, so we created framework_to_bundle.sh script, which takes *.framework and outputs *.bundle with all the resources.

The resource-handling code was redesigned to automatically use the right resource location. Allegro iOS app has a Bundle.resourcesBundle(forModuleName:) method, which always finds the right bundle, no matter what linking type was used.

Results #

Last time the Allegro iOS app launch time was measured, it still had 31 dynamic libraries – so merely 45% libraries were linked statically and results were already very promising. Our job with static linking revolution is not complete yet, the target is 100%.

We measured launch time on different devices for two app versions: one with all libraries dynamically linked and the other one with 26 libraries statically linked. What measurement method did we use? A stopwatch… yes, real stopwatch. DYLD_PRINT_STATISTICS=1 variable is a tool that can help identify the reason of a dynamic linker being slow, but it does not measure the whole launch time. We used a stopwatch and slow motion camera, to measure the time between an app icon tap and the app home screen being fully visible.

Each measurement in the following table is an average of 6 samples.

	iPhone 4s	iPad 2	iPhone 5c	iPhone 5s	iPhone 7+	iPad 2 cold launch
57 dylibs app launch time [s]	7.79	7.33	7.30	3.14	2.31	11.75
31 dylibs app launch time [s]	6.62	6.08	5.39	2.75	1.75	7.27
Launch speedup [%]	15.02	17.05	26.16	12.42	24.24	38.13

Allegro iOS app launch time decreased by about 2 seconds on iPhone 5c – this was a significant gain. The app launch time improved even more on freshly turned on iPad 2 – the difference was about 4.5 seconds, which was about 38% of the launch time with all libraries being dynamically linked.

Static linking pitfall #

Having some statically linked library, beware of linking it with more than one dynamic library – this will result in static library objects being duplicated across different dynamic libraries and that could be a serious problem. We have created a check_duplicated_classes.sh script to be run as a final build phase.

That was the only major obstacle we’ve come across.

Dyld3 #

Dyld3, the brand new dynamic linker, was announced about a year ago at WWDC 2017. At the time of writing this article, we are getting close to WWDC 2018 and dyld3 is still not available for 3rd party apps. Currently only system apps use dyld3. I couldn’t wait any longer, I was too curious about its real power. I decided to try launching my own app using dyld3.

Looking for dyld3 #

I wondered: What makes system apps so special that they are launched with dyld3?

First guess: LC_LOAD_DYLINKER load command points to dyld3 executable…

$ otool -l /Applications/Calculator.app/Contents/MacOS/Calculator | grep "cmd LC_LOAD_DYLINKER" -A 2
          cmd LC_LOAD_DYLINKER
      cmdsize 32
         name /usr/lib/dyld (offset 12)

That was a bad guess. Looking through the rest of load commands and all the app sections revealed nothing particular. Do system applications use dyld3 at all? Let’s try checking that using lldb debugger:

$ lldb /Applications/Calculator.app/Contents/MacOS/Calculator
(lldb) rbreak dyld3
Breakpoint 1: 887 locations.
(lldb) r
Process 92309 launched: '/Applications/Calculator.app/Contents/MacOS/Calculator' (x86_64)
Process 92309 stopped
* thread #1, stop reason = breakpoint 1.154
    frame #0: 0x00007fff72bf6296 libdyld.dylib`dyld3::AllImages::applyInterposingToDyldCache(dyld3::launch_cache::binary_format::Closure const*, dyld3::launch_cache::DynArray<dyld3::loader::ImageInfo> const&)
libdyld.dylib`dyld3::AllImages::applyInterposingToDyldCache:
->  0x7fff72bf6296 <+0>: pushq  %rbp
    0x7fff72bf6297 <+1>: movq   %rsp, %rbp
    0x7fff72bf629a <+4>: pushq  %r15
    0x7fff72bf629c <+6>: pushq  %r14
Target 0: (Calculator) stopped.

lldb hit some dyld3-symbol during system app launch and did not during any custom app launch. Inspecting the backtrace and the assembly showed that /usr/lib/dyld contained both the old dyld2 and the brand new dyld3. There had to be some if that decided which dyldX should be used.

Reading assembly code is often a really hard process. Fortunately I remembered that some parts of apple code are open sourced, including dyld. My local binary had LC_SOURCE_VERSION = 551.3 and the most recent dyld source available was 519.2.2. Are those versions distant? I spent a few nights looking at local dyld assembly and corresponding dyld sources and didn’t see any significant difference. In fact I had a strange feeling that the source code exactly matched the assembly – it was a perfect guide for debugging.

What did I end up with? Hidden dyld3 can be activated on macOS High Sierra using one of the following two approaches:

setting dyld`sEnableClosures:
- dyld`sEnableClosures needs to be set by e.g. using lldb memory write (unfortunately undocumented DYLD_USE_CLOSURES=1 variable only works on Apple internal systems),
- /usr/libexec/closured needs be compiled from dyld sources (it needs a few modifications to compile),
- read invocation in callClosureDaemon needs to be fixed (I filed a bug report for this issue); for the sake of tests I fixed it with lldb breakpoint command and a custom lldb script that invoked read in a loop until it returned 0, or
dyld closure needs to be generated and saved to the dyld cache… but… what is a dyld closure?

Dyld closure #

Louis Gerbarg mentioned the concept of dyld closure at WWDC 2017. Dyld closure contains all the informations needed to launch an app. Dyld closures can be cached, so dyld can save a lot of time just restoring them.

Dyld sources contain dyld_closure_util – a tool that can be used to create and dump dyld closures. It looks like Apple open source can rarely be compiled on a non-Apple-internal system, because it has a lot of Apple private dependencies (e.g. Bom/Bom.h and more…). I was lucky – dyld_closure_util could be compiled with just a couple of simple modifications.

I created a macOS app just to check dyld3 in action. The TestMacApp.app contained 20 frameworks, 1000 ObjC classes and about 1000~10000 methods each. I tried to create a dyld closure for the app, its JSON representation (36.5 MB) was pretty long - almost milion lines:

$ dyld_closure_util -create_closure ~/tmp/TestMacApp.app/Contents/MacOS/TestMacApp | wc -l
  832363

The basic JSON representation of a dyld closure looks as follows:

{
  "dyld-cache-uuid": "9B095CC4-22F1-3F88-8821-8DFD979AB7AD",
  "images": [
    {
      "path": "/Users/kamil.borzym/tmp/TestMacApp.app/Contents/MacOS/TestMacApp",
      "uuid": "D5BDC1D3-D09E-36D5-96E9-E7FFA7EE955E"
      "file-inode": "0x201D8F8BC", // used to check if dyld closure is still valid
      "file-mod-time": "0x5B032E9A", // used to check if dyld closure is still valid
      "dependents": [
        {
          "path": "/Users/kamil.borzym/tmp/TestMacApp.app/Contents/Frameworks/Frm1.framework/Versions/A/Frm1"
        },
        {
          "path": "/Users/kamil.borzym/tmp/TestMacApp.app/Contents/Frameworks/Frm2.framework/Versions/A/Frm2"
        },
        /* ... */
      ],
      /* ... */
    },
    {
      "path": "/Users/kamil.borzym/tmp/TestMacApp.app/Contents/Frameworks/Frm1.framework/Versions/A/Frm1",
      "dependents": [ /* ... */ ]
    },
    /* ... */
  ],
  /* ... */
}

Dyld closure contains a fully resolved dylib dependency tree. That means: no more expensive dylib searching.

Dyld3 closure cache #

In order to measure dyld3 launch speed gain, I had to use the dyld3 activation method #2 – providing a valid app dyld closure. Although setting dyld`sEnableClosures creates a dyld closure during app launch, the closure is currently not being cached.

Dyld sources contain an update_dyld_shared_cache tool source code. Unfortunately this tool uses some Apple-private libraries, I was not able to compile it on my system. By pure accident I found that this tool is available in every macOS High Sierra in /usr/bin/update_dyld_shared_cache. Also the man update_dyld_shared_cache was present – this made the cache rebuild even simpler.

update_dyld_shared_cache sources showed that it generates dyld closures cache only for a set of predefined system apps. I could modify the tool binary to take TestMacApp.app into account, but I ended up renaming the test app to Calculator.app and moving it to /Applications – simple, but effective.

I updated the dyld closure cache:

sudo update_dyld_shared_cache -force

and restarted my system (as stated by man update_dyld_shared_cache). After that, my test app launched using dyld3! I verified that with lldb. Also setting DYLD_PRINT_WARNINGS=1 variable showed that the dyld closure was not generated, but taken from the dyld cache:

dyld: found closure 0x7fffef8f278c in dyld shared cache

Dyld3 performance #

As I wrote earlier, the test app contained 20 frameworks, each framework having 1000 ObjC classes and 1000~10000 methods. I also created a simple dependency network between those frameworks: main app depended on all frameworks, 1st framework depended on 19 frameworks, 2nd framework depended on 18 frameworks, 3rd framework depended on 17 frameworks, and so on… After launching, the app just invoked exit(0). I used time to measure the time between invoking the launch command and app exit. I didn’t use DYLD_PRINT_STATISTICS=1, because, aside from the reasons presented above, dyld3 does not even support this variable yet.

Test platform was MacBook Pro Retina, 13-inch, Early 2015 (3,1 GHz Intel Core i7) with macOS High Sierra 10.13.4 (17E202). Unfortunately I didn’t have access to any significantly slower machine. Each measurement in the following tables is an average of 6 samples. Two types of launches were measured:

warm launch – without system restart,
cold launch – system restart between each measured time sample.

Statically linked app always launched very fast, but I could not see any significant difference between dyld2 and dyld3 loading time.

launch type	dyld2	dyld3	static
warm	0.737s	0.726s	0.676s
cold	1.166s	1.094s	0.871s

I tried measuring app launch from some slower drive configuration – an old USB drive (having terribly low sequential read speed of 17.1 MB/s). Disk IO was supposed to be a bottleneck of dyld2 loading. I faked /Application/Calculator.app path using ln -s /Volumes/USB/Calculator.app and regenerated dyld cache.

Next measurements looked much better. No difference at warm launch, but cold launch was 20% faster with dyld3 than with dyld2. Actually dyld3 cold launch was right in the middle, between dyld2 launch time and statically linked app launch time.

launch type	dyld2	dyld3	static
warm	0.722s	0.731s	0.679s
cold	3.687s	2.947s	2.276s

dyld3 status #

Mind that dyld3 in still under development, it has not been released for 3rd party apps yet. I guess it is currently available for system apps not to increase their speed, but mainly to test dyld3 stability.

Louis Gerbarg said that dyld3 had its daemon. On macOS High Sierra there is no dyld3 daemon. closured is currently invoked by dyld3 as a command line tool with fork+execve. It does not even cache created dyld closures. For sure we will see a lot of changes in the near future.

Are you curious about my opinion? I think a fully working dyld3 with closured daemon will be shipped with the next major macOS version. I think this new dyld3 version will implement even faster in-memory closure cache. Everyone will feel a drastic app launch time improvement on all Apple platforms – launch time much closer to statically linked app launching than to the current dyld2 launching. I keep my fingers crossed.

mobile ios macos static linking dyld dyld3