"

This is a response to (parts of) Dr. Lior Pachter's post, 'The mythsof bioinformatics software' . (You can also see my post on bioinformatics software licensing for at least some of the background arguments.)

I agree with a lot of what Lior says: most bioinformatics software isnot very good quality (#1), most bioinformatics software is not builtby a team (#2), licensing is at best a minor component of what makessoftware widely used (#3), software should have an expiration date(#5), most URLs are unstable (#6), software should not be 'idiotproof' (#7), and it shouldn't matter whether you use a specific programming language (#8).

I strongly disagree with Lior's point #4, in almost every way. I trymake my software free for everyone, including companies, for bothphilosophical reasons and for simplicity; I explained my reasoning in my blog post .(Anyone who doesn't think linking against GPL software is reasonablycomplicated and nuanced should through the tweets and comments on thatpost!) From my few involvements with working on non-free software, Iwould also add that selling software is a tough business, and not onethat automatically leads to any profits; there's a long tail, just aswith everything else, and I long ago decided that my time is worthmore to me than the expected income from selling software would be.(I would be thrilled if a student wanted to try to make money off ofour work, but my academic work would remain open source.)

Regardless, Lior's opinion isn't obviously wrong, and I appreciate thediscussion.


What surprises me most about Lior's post, though, is that he'sdescribing the present situation rather accurately, but he's not angryabout it. I'm angry, frustrated, and upset by it, and I reallywant to see a better future -- I'm in science, and biology, partlybecause I think it can have a real impact on society and health.Software is a key part of that.

Biology and genomics are changing. Large scale data analysis isbecoming more and more important to the biomedical sciences, andsoftware packages like kallisto and khmer are almost certainly goingto be used in the clinic at some point. (I believe some of Broad'svariant calling software is already used in diagnosis and treatmentfor cancer, for example, although I don't know the details.) Oursoftware is certainly being used by people doing basic biomedicalresearch, although it may not be directly clinical yet - and I thinkthe quality of computation in basic research matters too.

And this means bioinformatics should grow up a bit . Ifbioinformatics is a core component of the future of biology (which Ithink is obvious), then the quality of bioinformatics softwarematters.

To quote Lior, 'Who wants to read junk software, let alone try to editit or build on it?' Certainly not me - but then why are we producingit? Are we settling for this kind of software in biomedical research?Are we just giving up on producing decent quality software altogether,because, uh, it's hard? How is this different from doing bad math, orpublishing bad biology - topics that Lior and others get really madabout?

Lior also quotes a Computational Biology interview with James Taylor,who says ,

A lot of traditional software engineering is about how to buildsoftware effectively with large teams, whereas the way mostscientific software is developed is (and should be)different. Scientific software is often developed by one or ahandful of people.

That was true in a decade ago, and it may have been a reasonablereason to avoid using decent software engineering techniques then, butthe landscape has changed significantly in the last decade, with awide variety of rapid prototyping, test-driven development, andlean/agile methodologies being put into practice in startups and largecompanies. So I think James is mistaken here.

I wager that the reason a lot of scientists do bad softwareengineering is because they can get away with it, not because thereare no techniques they could profitably use. Heck, if they wanted tolearn something about it, Software Carpentry will come teach workshops for youon this very topic, and I'd be happy to offer both Lior and James aworkshop to bring them up to speed. (Note: I don't think either ofthem needs my advice, which is actually kind of my point.)

(As for languages, Lior's point #8, there is a persistent expansion ofthe Python and R toolchains around bioinformatics and a convergence onthem as the daily workhorses of bioinformatics data analysis. So eventhat's changing.)

Fundamentally the blithe acceptance of badly engineered software inscience baffles me. I can understand ( and even endorse )not requiring good software engineering for algorithmic proofs ofconcept, but clearly we want to have good, robust libraries forserious work .To claim otherwise would seem to lead to the conclusion that much ofbioinformatics and genomics should seek to be incorrect andirrelevant.


I want there to be a robust community of computational scientistsand software developers in biology. I want people to be able tobuild a new variant caller without having to reimplement a FASTQor SAM parser. I think we need people to file bug reports,catch weird off-by-one problems, and otherwise spot check all thesoftware they are using. And I don't think it's impossible or eventerribly difficult to achieve this.

The open source community has been developing software withdistributed teams, with no single employer, and with uncertain fundingfor decades. It's not easy, but it's not impossible. And in the end Ido think that the open source community has a lot of the solutions thecomputational science community needs, and in fact is simply a muchbetter exemplar for how to work reproducibly and with high technicalquality . Why wecontinue to ignore this mystifies me, although I would guess it has todo with how credit is assigned in academic software development .

If we went back to the 80s and 90s we'd see that many of the samearguments that Lior is making were applied to open source software incontrast to commercial software. We know how that ended - open sourcesoftware now runs most of the Internet infrastructure. And opensource has had other benefits, too; to quote Bill Janeway, 'opensource and the cloud have dramatically decreased the friction ofinnovating', and the scientific community has certainly benefited fromthe tremendous innovation in software and high tech. I would love tosee that same innovation enabled in genomics and bioinformatics. Andthat's why we try to practice good software development in my lab;that's why we release our software under the BSD license; and that'swhy we encourage people to do whatever they want with our software.

Ultimately I think we should develop our software (and choose ourlicenses), for the future we want to see, not the present that we'restuck with.

--titus

"



    12           2