The iOS bug chase
This article tells a story of chasing an iOS bug – a bug hidden so deep that it required many different skills and debugging on different levels to identify it. I think every native mobile app developer (not only an iOS developer) will find this text interesting. Non-mobile developers may find it an intriguing read as well.
Bugs #
A mobile device is a fully-fledged computer these days and as such it always does what it has been programmed to. If something fails, this is because a computer did things exactly as it was commanded to. We – as humans – make errors, which is part of our nature. Without errors, we would not be able to learn.
I don’t want to insist on it, Dave, but I am incapable of making an error.
HAL 9000
In a novel by Arthur C. Clarke, a potential AB-35 unit crash could be detected before it even occurred. In the real world, we diagnose bug symptoms using various tools:
- crash loggers — crashes and non-fatal errors,
- tracking tools — detecting user flow anomalies,
- remote configuration — possibility of disabling problematic features.
Often, these tools are sufficient to analyze an issue. But occasionally, they can hardly detect if anything is wrong or, in the case of a very complex problem, even the most sophisticated tools are of no help. One should be aware that even a single undetected non-fatal error can affect the experience of thousands of users. That is why here, at Allegro, we have to be very serious about all potential issues.
This article describes an iOS MapKit bug analysis and its resolution, starting with source code, through network stack, down to the assembly, and ending up in San Francisco.
The Bug #
One of our quality assurance engineers reported a strange bug. No one but him was able to reproduce it. The problem could be seen in the process of selecting a parcel machine for shopping delivery. Apparently the problem occurred on a single device only.
The
MKMapView
controller had trouble displaying a map. Internet connection was working fine
on the device. Restarting or reinstalling the app did not fix the problem.
After playing with the bug for some time, the issue suddenly disappeared. The
situation was terrifying. Our iOS app has tens of thousands of daily active
users. Even if the problem persisted for a small percentage of that volume, it
could prevent users from selecting a parcel machine. This would block shopping
for a large number of buyers — the last thing we really wanted. Any attempt to
break MKMapView
again failed (i.e. simulating a poor network, MKMapView
stress testing, etc.). Suddenly, when the bug re-appeared on two other devices,
we had a chance to catch it.
Strangely enough, the iOS Maps app showed the same symptoms, so did all the third party apps installed. This seemed to be a bug in iOS, but we could not ignore it anyway.
Each time you start ignoring a bug, you end up looking just like this owl.
Now seriously… Although we could not fix the bug directly in iOS, we could at least try to bypass it, so that it would no longer occur in our app. The analysis began.
The Code #
I tried the most basic level of debugging — logging an error. MapKit provides a handy delegate method:
-[MKMapViewDelegate mapViewDidFailLoadingMap:withError:]
so I implemented it with some NSLog
logging inside.
I also added a breakpoint there, so I could debug and inspect the error
in
depth using Xcode Variables View. The breakpoint was reached almost immediately.
The delegate method invocation was caused by a GEOErrorDomain
domain error.
Its userInfo
was a singleton dictionary, a single array of underlying errors
under the SimpleTileRequesterUnderlyingErrors
key. Each underlying error had
two values in its userInfo
dictionary: the first one under the HTTPStatus
key and the second one under the NSErrorFailingURLStringKey
key. It was a
clear sign that a network error was a direct cause of failure to display map
tiles.
The Man In The Middle #
When it comes to the analysis of network communication, one of the simplest, yet most powerful tools you will ever need is mitmproxy. Never heard of it? You should really check it out, as it can save you hours of debugging in the future. In this case, I only used it to intercept network requests, but mitmproxy has many more features (e.g. scripting).
I started to intercept network traffic and displayed the map to trigger its network activity. Mitmproxy showed a lot of map tile requests.
There were a lot of requests finished with 410
response code, indeed. But
wait… what? 410
?
410 Gone Indicates that the resource requested is no longer available and will not be available again. This should be used when a resource has been intentionally removed and the resource should be purged.
And guess what… Just as I finished debugging, the bug suddenly disappeared!
The map worked great again and all map tile requests finished successfully with
200
response code.
I lost the bug, but at least I had a network communication dump in mitmproxy. The only thing I could do at that point was to take a closer look at it.
At first glance, the map tile request contained nothing extraordinary. Just everything one could expect:
http://gspe19.ls.apple.com/tile.vf?flags=...&style=...&size=...&scale=...&v=...&z=...&x=...&y=...&sid=...&accessKey=...
I compared two tile requests – each corresponding to the same tile with the
same x
, y
and z
coordinates, but the former finished with the 410
code
and the latter with the 200
code. Filtering the mitmproxy flow list with the
style=1.*&z=14&x=8962&y=5377
limit filter gave rewarding results.
Only one map tile request parameter looked suspicious – that was the v
parameter. I was 99% certain that the v
stood for some kind of version
number. An analysis of a long-time request log confirmed my suspicions about
the v
— the parameter went up every few minutes while I was browsing Apple
maps.
It was great! Imagine a world without the v
parameter and a user browsing a
map region and the region being edited at the same time. That would result in
serious glitches. Map glitches are the last thing the car driver wants.
The question was: what caused v
to increment? A couple of requests happened
in between the 410
and 200
responses, just while the v
was being changed.
One request looked particularly suspicious and that was the request for
/geo_manifest/dynamic/config
. It was also the only request that retrieved some
serious data. Unfortunately, the response inspection revealed binary data with
neither 11040529
nor 0xA87711
(in any endianness). Even though the new v
value was not clearly visible in the geo_manifest
data, it could still be
present there. Anyway, it would be hard, if not impossible, to understand
binary data of an unknown format. But I still had a few more tricks to use.
The Machine Code #
MapKit.framework
was the one that should understand geo_manifest
, so the
obvious option was to look for this understanding in the framework code. The
MapKit source code is obviously not publicly available, but there were two
things that helped overcome that obstacle.
Firstly, all iOS framework dylibs can be easily accessed from either:
- iOS Simulator files — x86 and i386 dylibs:
/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/System/Library/Frameworks
- iOS Device Symbols — ARM dylibs:
~/Library/Developer/Xcode/iOS\ DeviceSupport/*/Symbols/System/Library/Frameworks
Secondly, Hopper makes decompilation nothing but pure pleasure. Hopper is such a powerful, yet simple and intuitive tool that any person, even one without knowledge of assembly or Mach-O, could easily analyze any executable. The basic Hopper usage scenario is as simple as that:
- Use “Read Executable to Disassemble” and wait until Hopper processes the binary.
- Use the symbols panel to find the method of your interest.
- Use “Show Pseudo Code of Procedure” to see selected method logic.
What method to look for in order to find a geo_manifest
trace? The obvious
choice was to filter symbols using the geomanifest
filter at first, and that
was it!
GEOResourceManifestManager
caught my eye. Unfortunately, no method for that
class was visible, only an external symbol reference
_OBJC_CLASS_$_GEOResourceManifestConfiguration
. This meant that MapKit used
another framework underneath. I listed shared libraries of MapKit dylib using
otool
:
$ otool -L /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/System/Library/Frameworks/MapKit.framework/MapKit
...
/System/Library/PrivateFrameworks/GeoServices.framework/GeoServices (compatibility version 1.0.0, current version 1151.49.0)
...
From among many other dylibs used by MapKit, GeoServices.framework
looked
like the obvious owner of GEOResourceManifestConfiguration
.
GeoServices.framework
is a private system framework, no wonder I had never
heard of it before. So I tried to inspect the GeoServices
dynamic library
using Hopper. I used GEOResourceManifestManager
as a symbols filter and
Hopper showed a bunch of GEOResourceManifestManager
methods. One of them was
the method:
-[GEOResourceManifestManager forceUpdate]
Once again, I was very lucky.
The Hack #
By pure coincidence I got another device affected by the maps problem. Having
the knowledge of GeoServices.framework
internals, I could run the debugger
and try to perform some magic.
(lldb) po [GEOResourceManifestManager class]
GEOResourceManifestManager
(lldb)
Thanks to the dynamic nature of Objective-C, lldb had no trouble finding the
GEOResourceManifestManager
class. Invoking -[GEOResourceManifestManager
forceUpdate]
was not a problem either.
(lldb) po [[GEOResourceManifestManager sharedManager] forceUpdate]
nil
(lldb)
Then, a miracle happened — mitmproxy showed a request for
/geo_manifest/dynamic/config
followed by a nice bunch of successful tile
requests.
I was so close, yet so far from the fix. This was a one-time device state fix, a symptomatic treatment, not a fix to the root cause of the problem. But I tried to do it at least to see if the whole investigation was not totally wrong.
Later, using another affected device and just to play around, I ran the test app with the following code:
- (void)viewDidLoad {
Class resourceManifestManagerClass = NSClassFromString(@"GEOResourceManifestManager");
SEL sharedManagerSelector = NSSelectorFromString(@"sharedManager");
SEL forceUpdateSelector = NSSelectorFromString(@"forceUpdate");
if (resourceManifestManagerClass && [resourceManifestManagerClass respondsToSelector:sharedManagerSelector]) {
id sharedManager = [resourceManifestManagerClass performSelector:sharedManagerSelector];
if (sharedManager && [sharedManager respondsToSelector:forceUpdateSelector]) {
[sharedManager performSelector:forceUpdateSelector];
}
}
}
Note that using a private API is a violation of the iOS Developer Program
Agreement. Any app found using a private API is rejected by Apple. Even if such
app passes the Apple review, for example by hiding selectors with simple
ROT13, the app can be unstable.
Checking for respondsToSelector:
? Still unsafe, because any private method
behavior could change anytime or cause a trap after detecting an illegal flow.
Do not ever try to release such code!
The Radar #
The investigation showed one thing — the bug was clearly in iOS, affecting the
whole system and could not be properly fixed in the app. The only thing that
could and should be done, was to file an Apple bug
report (aka radar). An external user (non-Apple
employee) can only see bug reports reported by himself and no one else, so it
may also be a good idea to file an openradar
so that everyone else can find it and see its status. This way, “the 410
MapKit” issue described above was reported as rdar://25267344
. The issue was
also described on the Apple Developer
Forum.
The Engineer #
Some time after filing the bug report, I managed to attend WWDC2016.
WWDC is full of sessions about the latest iOS topics, but the real value lies
in the labs. I visited a MapKit lab to ask what was going on with
rdar://25267344
. I met an Apple engineer and told him about the “410 MapKit
issue”. He opened the Radar on his
iPhone and searched for the bug report. As it turned out, there was a lot of
comments under the bug report – comments seen only by Apple engineers. He told
me that my bug report helped to capture a 4-year old bug regarding incorrect
410
HTTP status handling and the bug was fixed in iOS 10 beta 1.
Shortly after that conversation, I received an update regarding the bug report
— it asked me to test the issue in iOS 10 beta and to let Apple know if it
still occurred. My first thought was: “It will really be hard to test this
non-deterministic bug…”, but was it? The engineer told me the bug was about
incorrect 410
HTTP status handling, so I thought I could reproduce the 410
status codes using mitmproxy. I just had to write a simple mitmproxy script
that would change the response of every tile request by replacing the status
code with 410
.
def response(context, flow):
if flow.request.path.startswith("/tile.vf"):
flow.response.status_code = 410
flow.response.content = b""
Then, by adding the script to mitmproxy, I could test the map behavior in iOS 10 beta 2 (latest beta at that time).
Mitmproxy changed the status code of each tile request to 410
. Once the first
tile request finished with 410
status code, geod
daemon immediately updated
its manifest requesting fresh /geomanifest/dynamic/config
. It worked just as
expected! The bug was resolved!
What could go wrong? #
A bug chase is often a long and hard process. In the one I have described, luck was a big contributor to success, because – as usual – many things could have gone wrong:
- the issue could just not have occurred on our test devices at all,
- maps API could have been secured with SSL-pinning,
- Apple could have ignored the report for such an ephemeral bug,
- the investigation could have gone in a wrong direction,
- the investigation could have required jumping through a decompiled framework call hierarchy — it is often very easy to get lost there.
Summary #
A good conclusion could be these four simple pieces of advice:
- Install mitmproxy now.
- Download Hopper, play with the trial version and add Hopper Personal License to your wish list.
- File Apple bug reports — the Bug Reporter is not
/dev/null
, the whole Apple staff are just waiting for your reports. - “Stay Hungry. Stay Foolish.” + Learn internals… internals of everything — this will make the Force strong with you.
This was a happy-ending story — the bug has been resolved the right way, Apple Maps will once again work seamlessly and the Allegro iOS app will provide the best user experience. Nothing will stop our users from shopping. Or so it seems… Crashes happen and we examine each crash report very carefully. Maintaining the iOS app of the largest e-commerce business in Poland is a challenge, but developers at Allegro do their best to protect users from any obstacle to shop.