If a piece of bioinformatics software is not fully open source , my lab and I will generally seekout alternatives to it for research, teaching and training. This holdswhether or not the software is free for academic use.

If a piece of bioinformatics software is only available under the GNUPublic License or another 'copyleft' license ,then my lab and I will absolutely avoid integrating any part of itinto our own source code, which is mostly BSD.

Why avoid non-open-source software?

We avoid non-open-source software because it saves us futureheadaches.

Contrary to some assumptions , this isnot because I'm anti-company or against making money from software,although I have certainly chosen to forego that in my own life. It'salmost entirely because such software is an evolutionary dead end, andhence time spent working with it is ultimately wasted.

More specifically, here are some situations that I want to avoid:

  • we invest a lot of time in building training materials for a pieceof software, only to find out that some of our trainees can't makeuse of the software in their day job.
  • we need to talk to lawyers about whether or not we can use a pieceof software or include some of its functionality in a workflow we'rebuilding.
  • we find a small but critical bug in a piece of bioinformatics software,and can't reach any of the original authors to OK a new release.

This is the reason why I won't be investing much time or effort inusing kallisto in my courses and my research: it's definitely not open source .

Why avoid copyleft?

The typical decision tree for an open source license is between a 'permissive'BSD-style license vs a copyleft license like the GPL; see JakeVanderplas's excellent post on licensing scientific code for specifics .

There is an asymmetry in these licenses.

Our software, khmer , isavailable under the BSD license. Any open source project (indeed, any project ) is welcome to take any part of our source code andinclude it in theirs.

However, we cannot use GPL software in our code base at all. We can'tcall GPL library functions, we can't include GPL code in our codebase,and I'm not 100% sure we can even look closely at GPL code. This isbecause if we do so, we must license our own software under the GPL.

This is the reason that I will be avoiding bloomtree code, and in fact wewill probably be following through on our reimplementation --bloomtree relies on both Jellyfish and sdsl-lite , which are GPL.

Why did we choose BSD and not GPL for our own code?

Two reasons: first, I'm an academic, funded by government grants;second, I want to maximize the utility of my work, which meanschoosing a license that encourages the most participation in theproject, and encourages the most reuse of my code in other projects.

Jake covers the second line of reasoning really nicely in his blogpost ,so I will simply extract his summary of John Hunter's reasoning:

To summarize Hunter's reasoning: the most important two predictorsof success for a software project are the number of users and thenumber of contributors. Because of the restrictions and subtlelegal issues involved with GPL licenses, many for-profit companieswill not touch GPL-licensed code, even if they are happy tocontribute their changes back to the community. A BSD license, onthe other hand, removes these restrictions: Hunter mentions severalspecific examples of vital industry partnership in the case ofmatplotlib. He argues that in general, a good BSD-licensed projectwill, by virtue of opening itself to the contribution of privatecompanies, greatly grow its two greatest assets: its user-base andits developer-base.

I also think maximizing remixability is abasic scientific goal, and this is something that the GPLfails.

The first line of reasoning is a little more philosophical, butbasically it comes down to a wholesale rejection of the logic in theBayh-Dole act , which triesto encourage innovation and commercialization of federally fundedresearch by assigning intellectual property to the university. Ithink this approach is bollocks. While I am not an economic expert, Ithink it's clear that most innovation in the university is probablynot worth that much and should be made maximally open. From talking to Dr. Bill Janeway , I heagrees that pre-Bayh-Dole was a time of more openness, although I'mnot sure of the evidence for more innovation during this period.Regardless, to me it's intuitively obvious that the prospect ofcommercialization causes more researchers to keep their researchclosed, and I think this is obviously bad for science. ( The IdeaFactory talks a lot about how Bell Labs spurred immense amounts of innovationbecause so much of their research was open for use. Talent Wants tobe Free is apop-sci book that outlines research supporting openness leading tomore innovation.)

So, basically, I think my job as an academic is to come up with coolstuff and make it as open as possible, because that encourages innovationand progress. And the BSD fits that bill. If a company wants to makeuse of my code, that's great! Please don't talk to us - just grab it andgo!

I should say that I'm very aware of the many good reasons why GPLpromotes a better long-term future, and until I became a grad studentI was 100% on board. Once I got more involved in scientificprogramming, though, I switched to a more selfish rationale, which isthat my reputation is going to be enhanced by more people using mycode, and the way to do that is with the BSD. If you have goodarguments about why I'm wrong and everyone should use the GPL, pleasedo post them (or links to good pieces) in the comments; I'm happy topromote that line of reasoning, but for now I've settled on BSD for myown work.

One important note: universities like releasing things under the GPL,because they know that it virtually guarantees no company will use itin a commercial product without paying the university to relicense itspecifically for the company. While this may be in the bestshort-term interests of the university, I think it says all you needto know about the practical outcome of the GPL on scientificinnovation.

Why am I OK with the output of commercial equipment?

Lior Pachter drew a contrast between my refusal to teach non-freesoftware and my presumed teaching on sequencing output from commercialIllumina machines . I think there'sat least four arguments to be made in favor of continuing to use Illuminawhile avoiding the use of Kallisto.

  • pragmatically, Illumina is the only game in town for most of mystudents, while there are plenty of RNA-seq analysis programs. Sounless I settled on kallisto being the super-end-all-be-all ofRNAseq analysis, I can indulge my leanings towards freedom byignoring kallisto and teaching something else that's free-er.
  • Illumina has a clear pricing model and their sequencing is essentiallya commodity that needs little to no engagement from me. This is notgenerally how bioinformatics software works :)
  • There's no danger of Illumina claiming dibs on any of my results orextensions - we're all clear that I pays my money, and I gets mysequence. I'm honestly not sure what would happen if I modifiedkallisto or built on it to do something cool, and then wanted to leta company use it. (I bet it would involve talking to a lot oflawyers, which I'm not interested in doing.)
  • James Taylor made the excellent points thatlimited training and development time is best spent on tools thatare maximally available, and that don't involve licenses that theycan't enforce .

So that's my reasoning. I don't want to pour fuel on any licensingfire, but I wanted to explain my reasoning to people. I also thinkthat people should fight hard to make their bioinformatics softwareavailable under a permissive license, because it will benefit everyone:).

I should say that Manoj Samanta has been following this line ofthought for much longer than me, and has written several blog posts on thistopic ( see also this ,for example).



    7           5