Wednesday, November 21, 2007

Is AGPL (Affero GPL) the Doom of Google?

This is actually a question I was asked today!

I am referring to the GNU Affero General Public License Version 3 (AGPL), which was just released Monday. AGPL extends GPL to give end-users access to the source code for AGPL licensed software accessed over a network. As the Free Software Foundation (FSF) says in their press release, "The FSF recommends that people consider using the GNU AGPL for any software which will commonly be run over a network".

Well, a very significant portion of (if not most) software developed today will be accessed over a network - so if a large group of developers follows FSF's advice, AGPL could become a very widely used license. Normally, a new license would not be easily adopted by an open source project; since it would limit which other open source code could be reused within the project. However, the special situation here is that the AGPL is compatible with the GPL because of a special provision for such compatibility in Version 3 of both the AGPL and the GPL.

Now, if we take a look at the distribution of open source licenses among open source projects, we can see how many of these projects are under GPL compatible licenses (see FSF's Licenses Page for details on compatibility). Since the majority of GPLv2 projects are re-licensable under GPLv3, we end up with 90-95% of open source projects compatible with Version 3 of the GPL - and thereby also compatible with the new AGPL. For an illustration of how licenses can be combined, see David Wheeler's The Free-Libre / Open Source Software (FLOSS) License Slide or the chart halfway down the FSF page A Quick Guide to GPLv3.

From the individual open source developers that I have talked to, most do not start an open source project with any particular "political" licensing agenda, but they commonly have a few simple goals:

1. Ability to use as much other open source software as possible
2. Get other developers to contribute, and
3. Don't let somebody "steal" the code.

Many developers think that the GPL covers these bases decently - which is why it has become a favorite for new projects among non-corporate developers. However, this could now be changing with the introduction of the AGPL, since a project under the AGPL can still include the same 90+% of open source code that a project under the GPL can include - and by using the AGPL, the developer can arguably get closer to goals 2 and 3 - especially for web enabled software applications. The author of the GPL, Eben Moglen, has already stated that in his opinion, "Google and Yahoo are morally obliged to share their GPL code", but software licensed under the GPL cannot force these companies to do so. However, with the introduction of the AGPL, open source developers now have a "tool" to force such sharing, and if the approach is adopted, we might see a serious move towards using AGPL for new projects started by non-corporate open source developers.

As we can see with the uptake on using GPLv3, any adoption takes time, and even if the AGPL becomes popular among open source developers, it could be a while before significant portions of software is only available under the AGPL, and we may not see the real effects for another couple of years. However, even a relatively limited adoption would require organizations developing web-sites to be more careful in tracking their code-bases. They need to do this in order to either avoid AGPL code or know which code they need to make available to their users.

A wide adoption of the AGPL would change a current standard practice for creating a web application, where the developers start with a few pieces of GPL software and then modify the software until it suits their needs. With AGPL software in the mix, a business decision would have to be made on whether to use AGPL software and make source code for modifications and additions available - or to avoid AGPL software and spend more time developing software which can be kept out of the hands of competitors and potential hackers.

Larger companies, e.g. Google and Yahoo, are actually among the best positioned to live in this new world. They can carefully evaluate the trade-offs on a case-by-case basis and can introduce processes to make sure that AGPL code does not sneak into places where it should not be. If we end up in a world where major new inventive software is only available under the AGPL, they might obviously face new competition, but this should be a manageable issue, and I have faith that Google and Yahoo will adapt.

What is the impact of AGPL?


Friday, November 16, 2007

Massive Reuse Within the Open Source Community

A few people have commented on how I actually account for the widespread code reuse in open source when estimating newly created open source code vs. reused open source code, so it seems like a good idea to fill in some more details.

Not surprisingly, the open source community is excellent at reusing code! Traditional estimates of code reuse from papers like On Finding Duplication and Near-Duplication in Large Software Systems back in 1996 puts the code reuse in the 10% to 15% range. This has changed, and more recently in Large-scale code reuse in open source software, Audris Mockus from Avaya Labs examined Linux and BSD distributions and found that more than 50% of the files were used in more than one project. In addition, he writes "The most widely reused components were small and represented templates requiring major and minor modifications and a group of files reused without any change. Some widely reused components involved hundreds of files."

From an analysis of Black Duck's database of open source code, I have actually found that only 39% of the source files are unique -- in other words 61% are reused from either the same or other open source projects. Sure, this is not exactly comparing apples to apples. The other analysis pick specific sets of applications or operating system distributions, whereas I look across more than 150,000 open source projects. Some of these projects incorporate another project whole-sale, some start by cloning a project (effectively creating a branch), whereas others simply use a basic make system and a few other files to get started. Whichever way it starts, the development continues from there.

In addition to the source code reuse, there is also significant reuse of unmodified binary components. Just as for reused source code these can be complete projects, a complete component within a project, or just a few files.

Both source code reuse and binary reuse are captured in the following graph showing some of the most reused open source projects:

This graph was extracted from the Black Duck whitepaper The Quest for an "Open Source Genome" and shows how many times files from popular open source projects have been reused in other open source projects. It clearly shows that files from some open source projects are reused in thousands of other open source projects. The actual number of open source projects reused in more than 1000 other open source projects is actually 46, clearly showing that the open source community really is serious about reusing code!!

How much Open Source do you (re)use?


Friday, November 9, 2007

The Open Source Community as a Top 100 Country

For months Black Duck Software CEO Doug Levin has been writing a blog. It is interesting and offers useful insights into the open source community, software development and other things. After a series of inquiries, I slowly came to the conclusion that this was a good way to share my point of view as well. So I am writing this blog.

As you may know, Black Duck Software maintains a database of all the open source code that we know of, and this database gets updated continuously day in and day out.

This morning I took a look at the amount of open source code that we receive every day. I decided to only look at new unique source files found in actual project releases - ruling out non source code - e.g. documentation and binaries, interim code - which is not officially released, and duplicates of existing files that developers reuse from the same or other open source projects. Even so, approximately 4.7 million lines of code is added everyday - which translates into 1.7 billion lines of code each year. Although we are probably missing some parts of open source out there - which would make this an underestimate, we can take a leap faith and use this as a proxy for the amount of open source code created in the world, and then we can get some idea of the value created by the open source community.

Now let's make a bunch of assumptions and try to see the value of the effort in creating such an amount of code. Assuming an average open source project is 35,000 lines of code and the average cost of a software developer is $30/hour (~$60,000/year), a simple COCOMO II calculator tells us that the average open source project costs $630,000 to develop. This cost translates into $18 per line of code. Extrapolating that to 1.7 billion lines of code gives us an estimated value of $30.6 billion/year. Changing perspective for a second, if the open source community was a country with a GDP of $30.6 billion, it would rank 77 right between Bulgaria and Lithuania according to the International Monetary Fund's list of GDP by country, thereby putting the open source community ahead of most countries in the world.

You can argue about whether this number is high or low, and you can argue whether the basic COCOMO calculation on a 35,000 line project can be extrapolated. However, the $30 billion/year number seems consistent with previous estimates such as David A. Wheeler's More Than a Gigabuck: Estimating GNU/Linux's Size estimating the cost to develop all elements of the Red Hat Linux 7.1 distribution as $1.08 billion, and the study: Economic impact of open source software on innovation and the competitiveness of the Information and Communication Technologies (ICT) sector in the EU estimating the cost to develop the elements of the Debian 3.1 distribution (until 2005) at €11.9 billion -- increasing to a cumulative €100 billion (~$146 billion) by 2010.

According to these rough calculations, the direct economic impact (ignoring any indirect economic impact) of the open source community appears to be larger than the economic impact of most individual countries in the world. Even if the numbers could be somewhat off and not a perfect measurement of impact, it does show that the development cost of open source is in that same order of magnitude as many countries' GDP. Such an economic force should not be underestimated, and this is yet another indication that open source has become a significant part the technology world.