Friday, April 5, 2013

Measuring Success - Charting the Android Open Source Project

I decided to take a look at some facts and figures from the Android Open Source Project in terms of its progress over the last 4.5 years.

Of all the things we could measure in a SW project, lines of code is the most obvious (and contentious) metric. Short of simply throwing out some figures from cloc and running away from the results (satisfying neither myself or doing justice to the code itself), this exercise turned out to be more about the meta problem of measuring a large body of code such as the AOSP, rather than the collection and reporting of the statistics.


REPOSITORY AND BUILD SIZES

First up - let’s take a look at the entire Android repository checkout size over time and disk space required for a build. 

For all the Android revisions evaluated here, I downloaded from the AOSP directly, using the last patch release made for each revision. After the checkout, I removed the .repo and .git directories recursively from the top, leaving just the contents of the repositories (otherwise, we would be measuring the GIT repo size as well, which clearly keeps getting larger over the revisions!).



Not unexpected – as Android grew over the years, it got bigger. ~6GB for a checkout is getting a little large, but more alarming is the disk space required for a single build : 25GB for one build variant in JB 4.2 (contains both code, intermediates and final binaries). Superimposing Kryder's Law on this, we are still within bounds, but given the latency that corporate IT departments have in following the laws of Mr Moore and Kryder, it’s been a hard graph to stomach over the years.

For reference, building an additional variant (either a new product or something as simple as changing the operator logo for regional distribution) adds 11GB per device. Putting this in context, building 6 variants for a single Android handset (fairly typical for different regions etc…) tops out at over 80GB of disk space!


SYSTEM DISK AND BUILD TIMES

Whilst we are checking out the code, we may as well build the default ROM and measure how long it takes. 

A few details of the build environment:
  • Ubuntu 12.04 64bit with 4GB memory
  • Laptop was ~1.86GHz Core 2 Duo
  • 5400 rpm hard disk
When building, I always used the ARM CPU target from the lead Google Experience Device handset at the time, but only built the generic variant from the tree. With this, only the bare minimum was built for an Android distribution, caveats from a normal device build including:
  • No HAL modules built (connectivity, camera, audio etc…)
  • No kernel build time included
  • CCACHE was disabled
I always ran the build with make “-j1” to restrict the make system to only build a single item at once  (it can still use the dual cores / hyper threading however).

For the final ROM, I merely looked at the size of the files that were going into the userdata.img and system.img disks (not the .IMG file sizes themselves, which are fixed in size by the config).


Note that these build times are comparative against each other only and not some theatrical super Linux box. And yes, the wife’s laptop is not a behemoth in the computing world, but (a) it is easy to swap out the drive for an Ubuntu imaged drive and (b) she went to bed at 10.

Build times are definitely ‘a problem’ with the latest JellyBean 4.2 release comparing with the good old days of Donut. The why is a lot down to the introduction of CLANG + LLVM in the ICS / Android 4.0.4 time frame – it's simply very expensive to build in addition to being a large contributor to the increase in built disk space needed for host side binaries. Webkit is also somewhat to blame, although it has been with us since the first Beta release and hasn't really fattened out much in the intervening years. Combined however, these contribute to ~40% of total build time.

The ROM hasn't grown as fast as the checkout or build times, but it’s getting there. What used to fit into 512MB of unmanaged NAND in 2008 (with room for applications) now takes up the best part of 800MB of eMMC in the latest Nexus 4 Jellybean 4.2 release, with at least a few GB required on top for applications these days, depending on which side of the “external SDCard” line your Android device stands. What is most interesting is that the vanilla system disk doesn't show this trend however – the increase in ROM size is not the core framework itself, but from areas like the SoC adaption code and larger assets being included for big screen sizes etc…


LINES OF CODE, LINES OF CODE

Yes yes yes, back to the real topic. Here is the money shot of the total lines of total code contained in the Android release:


This isn't all that useful however. Firstly, there are all sorts of bits of SW mixed up in the Android repository – tool chains, the Linux kernel and even a full copy of Quake. Secondly, what types of file should we count? In the above diagram, I already filtered out comments, blank lines etc.. and only tracked a subset of files in the tree that might be part of Android (no Objective-C for example!). But still, it feels wildly inaccurate given the rise in lines of code. Google would have needed an army of Android engineers to have put this in place if it was all hand crafted in house.

For a baseline, +Andy Rubin, father of Android and all round nice chap was quoted that Ice Cream Sandwich (Android 4.0) had “over 1 million lines of code” – this gives us a starting point at least to try and narrow down what might be considered acceptable to count.

Before we jump down the rabbit hole, there is an interesting tail off at the end to dig into. Total file count trend shows… 



…which is what we'd expect– slow growth. So it’s not so much that the latest release simply chopped out of a bunch of files and reduced its total lines of code.  Breaking the lines of code delta by high level directory we see:


So the packages top level directory looks like our man…  A quick dig into the results showed that the drop in ‘lines of code’ was actually the XML files in the packages/inputmethods/LatinIME/dictionary directory going away in the latest revision (looks like they were changed to be downloaded entirely at run time now instead of containing a default cached version in the system image). This enforces the fact that reporting 25 million lines of “code” in Android 4.2 is clearly hard to stomach at a SW engineering level when over a million of these lines was a canned dictionary!

The XML problem does lead us to the question of what should we actually class as line of code. XML dictionaries are clearly pushing the definitions of traditional SW source files, yet within Android, XML is used in abundance to describe things such as UI assets that were originally “programmatically generated”, so we can't simply ignore this file type.


ANDROID CODE COMPOSITION

To make better sense of what has gone on in the Android tree, we can classify the code into several high level categories:
  • Java (.java etc..)
  • Native code (C, C++, header files, assembly code)
  • Build and test scripts (Make, shell scripts, Python)
  • XML (Just the .xml files)
Simply looking at the code based on these classifications, we see a more interesting breakdown of the src code (although still wrong!):




A few takeaways here – there is a lot of XML data in the tree, way more than Java src code and over 50% of the native code at one point in the development. It is great practice to store your resources outside of the code itself, but I hadn't quite realized how much these assets could amount to.

Ignoring the XML completely, we are still left around the 18M lines of code mark over Java, Native and Build src code categories  What we need to focus on from here to get a better readout of our lies of code is to classify the Android tree into categories to account for things like toolchains and incorporated third party open source projects


WHAT IS ANDROID vs WHATS IN ANDROID

To narrow down the distinction between what makes is written for Android itself vs what it utilizes, we can classify the contents of an Android repository into some high level buckets:

  • External projects pulled in from the upstream (webkit, GCC toolchains etc..)
  • Linux support code (“C” library, root file system, startup scripts etc…)
  • Applications (Android APKs, service providers etc...)
  • Android Framework (Dalvik, system services, main Java application framework)
  • Build and tools (device configuration, the amazing make based build system)
  • Platform development code (CTS, SDK, NDK, PDK, GDK, documentation)

The only caveat here is that the device configuration directory (device/moto etc…) and the hardware HAL libraries (hardware/broadcom etc…) are code contributed to Google via vendors for inclusion in the open source project and probably should be counted separately. But that would mean an overhaul to my bit of python that is already 10 lines past its sell by date, so we'll ignore this for now.

This gives us another suspicious graph (still includes XML):




The external projects make up the majority of the Android code base in the latest release : ~52% in the latest Android release. The framework comes in second, eating up 34% of the remaining lines of code.

For a last attempt to get closer to the magic Rubin/ICS 1 million lines of code milestone, we classify the tree one more time, removing XML:




So we are pretty close now in the framework catagory, enough that I am calling it a day. For reference, the headline lines of code figures are for Android JellyBean 4.2:

  • Android framework == 2.78M (up from 1.6M in Donut)
  • External modules used by Android == 12.2M (up from 4.3M in Donut)
  • Build system – 34k
  • Native Linux code – 977k
  • Applications – 911k
  • Platform development code – 814k
== 17.8M total lines of code

(Again, these do not include comments, blank files or XML)

My takeaway is that the essence of Android (everything but the external modules) comes to 5.52 million lines of code (up from 2.3 million in Donut).

Note that all of this only includes the public side of Android - the GMS (Google Mobile Services - things like Maps, Play Store etc...) could easily add another million lines of code here if measured in src form.


COMMENTS VS LINES OF CODE

Before I get off the train and hunt down a cup of most the excellent Philz Coffee, let’s have a little fun with the dataset. Looking at just the code that can be considered the “Android OS” (the framework, sans XML), what is the split of code vs comments over the years?





On which note, a few words to say GOOD JOB TEAM ANDROID. On a personal level, working with Android has been an never ending fun time ride and even today, having to handle a twice yearly drop of 18M lines of code (integration, bug fixing and production) and push out to a never ending line of products opens up doors everyday to learning something new. Specialization, after all, is for insects.


CODE COUNTING SEGUE

>So, you like counting code then?

Well, not quite. Counting lines of code is said to have been invented before programming itself, a measurement of sweat and tears rumored to have brought life to computing for the sole pursuance of long lunches and afternoons of lost productivity. Why we still talk about such metrics today however is simply because software development is difficult to quantify and letting go of something tangible is hard to do (also, lunch remains popular among manager types).

As someone that has trench foot from software projects past and also sat around the war room game table of management, my position in the counting lines of code arena has been that it’s interesting, but the very nature of this is so context sensitive that most of the time, the metric causes fear and alarm over back patting and drinks in the club. Comparing progress within a project or against similar peer projects however is often interesting for general trends and summarizing rough complexity of the finished SW (again, somewhat language dependent). To qualify, I never, never, ever count engineers productivity based on lines of code output - if you have to fall back on this, you don't know your team and even worse in my opinion, trust them.


DISCLAIMER

My career over the last 5 years has brought me into contact with all sorts of proprietary Android releases, including the Beta and Honeycomb drops (HC specifically is obviously missing in the graphs as a data point). However, this blog post was done using open source software only and as such, currently only goes back as far as Android 1.6 (the oldest in the AOSP). If someone has a public mirror of the 1.5 release, I would be happy to add these statistics in.