"

Just as I was moving to UC Davis, a funding call for a trainingcoordination center came out. I got partway down the path of applying for it beforerealizing that I was overwhelmed with the move, but I did generatesome text that I thought was OK. Here it is!


The increasing velocity, variety, and volume of data generated inbiomedical research is challenging the existing data management andanalysis skills of most researchers. Access to and application ofthese forms of data is hindered not only by a lack of training andtraining opportunities in data-intensive biomedical research, but alsoby the heterogeneity of biomedical data as well as limiteddiscoverability of training materials relevant to different biomedicalfields. Many biomedical fields of study - genomics, informatics,biostatistics, and epidemiology, among others - have invested in dataanalysis methodologies and training materials, but no comprehensiveindex of biomedical data analysis methodologies, trainers, or trainingmaterials exists to support this training gap. Significant investmentsin biomedical data science training by the Helmsely Foundation and theNIH BD2K program, as well as by foundations not devoted to humanhealth such as the Moore and Sloan Foundations, highlight the need forand opportunities in peer knowledge coordination and training.

We propose to bridge the gap between the availability of biomedicalbig data and the needs of biomedical researchers to make use of thisdata by building a coordination center around proven principles ofopen online collaboration. This coordination center will nucleate anational and international community of expert trainers, together witha catalog of openly available supporting materials developed by thiscommunity, to enable the discoverability of resources and the trainingof data-intensive biomedical researchers using modern, evidence-basedteaching practices.

Aim 1: Build an index to enable categorization, discovery, and review of open educational resources.

Subaim 1A: Create and maintain an index of open educational resources.

Subaim 1B: Create and maintain software tooling behind index, including categorization, discovery, and review of resources.

Subaim 1C: Support categorization, discovery, and personalization of educational resources through a controlled vocabulary, personalized search, and lesson tracks.

Aim 2: Coordinate with existing biomedical/data science research and training community.

Subaim 2A: Build a 'matchmaking service' to help scientists identify potential collaborative partners for lab rotations.

Subaim 2B: Connect and coordinate training components for national and international biomedical data science initiatives, including BD2K Center awardees, BD2K R25 awardees, and Foundation funders.

Subaim 2C: Facilitate connections and communication with the larger Data Science training community.

Aim 3: Build a community of trainers and contributors to reuse, review, remix, and create training materials.

Subaim 3A: Initiate regional training centers (Davis, Harvard, St. Louis (or Chicago/Florida) for coordinating trainers, doing material discovery & curation/needs analysis/assessment;

Subaim 3B: Build and maintain a catalog of trained instructors that enables discoverability, coordination and collaboration for training purposes.

Subaim 3C: Encourage and develop a diverse community of contributors through partnerships at regional training centers and Data Carpentry initiatives.

Introduction

As the volume, variety, and velocity of biomedical data increases, sotoo have the variety of training needs in analyzing this data. Datascience training specific to biomedical research is still relativelyrare, and where it exists it is siloed, reflecting the bottom-upemergence of training and education in response to research-areaspecific needs. However, as the research community and fundersrespond to the increasing need with increased effort and funding,there is an opportunity to coordinate efforts to serve the broaderpurpose of training so-called 'pi-shaped' researchers - researcherswith deep backgrounds in biomedicine and data science both.

We propose to create a virtual training coordination center toorganize and coordinate online training materials, facilitateinteractions and connections between the many biomedical and datascience communities, and nucleate the formation of a more cohesive andmore diverse biomedical data science training community.

This TCC will build and maintain a catalogue of open educationalresources that can be personalized to researcher-specific careergoals, and coordinate software development and data management withthe ELIXIR-UK TeSS project (tess.oerc.ox.ac.uk), which serves theEuropean community. Our main effort will be to provide automated andsemi-automated gathering and classification of training materials,integrated into a sustainable open system that can be used by others,and served via a personalized curriculum system that can recommendmaterials based on research interests and prior training.

We will also interact with the national and international biomedicaldata science communities, including the newly funded BD2K centers, R25workshop and material grants, EU's ELIXIR, etc., and facilitateconnections between these communities and broader data sciencetraining initiatives that include the Moore/Sloan Data ScienceEnvironments at UW, NYU, and Berkeley, as well as Software Carpentry,Data Carpentry, and the Mozilla Science Lab. A key component of thiswill be a 'matchmaking' service that seeks to identify and supportpotential collaboration and 'lab rotation' opportunities forbiomedical scientists looking for data science collaborators.

Finally, we will work to nucleate a diverse and expert community oftrainers who can use, reuse, remix, and build new materials. Thiscommunity will be built upon regional training coordination centersand a recurrent Train-the-Trainers (T3) program to introduce trainersto materials, training and assessment approaches, and technologyuseful for training. We will also emphasize the inclusion ofunderrepresented minorities in the T3 program.

Background

The growth of data in the natural sciences has been explosive, with asimultaneous and dramatic increase in all 'three Vs' of data - volume,velocity, and variety - over the last two decades. This growth indata has in turn led to an increasing interest in quantitative andcomputational aspects of data analysis by academic and industryresearchers. Computational infrastructure, analysis software,statistical methods for data analysis and integration, and researchinto the fundamental methods underpinning data driven discovery hasall grown apace.

The growth in data has led to a training gap, in which manyresearchers suffer from the lack of a solid foundation in quantitativeand computational methods. This gap is especially large in basicbiology and biomedical research, where traditionally very fewresearchers have received any training in data analysis beyond basicmathematics and statistics. Moreover, as the volume and importance ofdata grow and the pace of research into data-driven discoveryaccelerates, the training gap is widening. Meanwhile, theopportunities for careers in biology-specific data analysis anddata-driven discovery are increasing rapidly in both industry andacademia, further increasing this gap between the need for trainedresearchers and the supply.

A number of training programs have stepped up to address this gap.One of the largest and broadest is Software Carpentry, a globalnon-profit which runs two-day intensive workshops on basiccomputational practice for academic scientists; as of 2015, SoftwareCarpentry has trained over 10,000 students in 14 years. While notlimited to biology, in 2014 half of the Software Carpentry workshopswere biology focused, and approximately 1300 of the 2600 trainees werefrom biology backgrounds. iPlant Collaborative, an NSF-funded centerfocused on biological data analysis, has also run many trainingworkshops to address the training gap. Internationally, the EU'sELIXIR program and the Australian Bioinformatics Network are focusedon biological information, and have significant training programs.

More biomedically focused workshops and training programs in datascience have also begun to be developed. Of particular note, NHGRIhas funded a variety of computational training over the last decadethat include T-32s, R25s, and K and F mechanisms. In the last year,the BD2K Initiative - formed specifically in recognition ofcross-Institute opportunities and challenges in data science - hasfunded a number of 'Big Data' centers and R25 workshop and resourcedevelopment grants, with more to come in 2015. Most recently, theHelmsley Trust has invested $1.7m in the Mozilla Science Lab to helpincrease the capacity of biomedical scientists to integratecomputation into their research.

The landscape of data science training is much larger than biology andbiomedical science, of course. In the past few years, a tremendousvariety of online resources, including written tutorials, videos,Massive Open Online Courses (MOOCs), and webinars have emerged. Manyuniversities and institutes have started data science trainingprograms, with a notable investment by the Moore and Sloan foundationsin Data Science Environments at NYU, UW, and UC Berkeley focused ondata driven discovery. Furthermore, an NSF investment in BIO Centersled to the initiation of Data Carpentry, a sister non-profit toSoftware Carpentry that is focused on linking domain-specific dataanalysis methodology to the broader contexts of efficiency andreproducibility; Data Carpentry is now funded by the Moore Foundation.

Thus, the training landscape in data science generally, and biomedicaldata science specifically, is large, complex, and international.Moreover, the number of training programs and initiatives is growingfast.

Over the last few years, several themes in biomedical data sciencetraining have emerged:

Many styles of training are needed, across many career levels. Thetraining breakout at the BD2K 'ADDSup' meeting in September 2014summarized biomedical training opportunities in 10 dimensions,including formal vs informal, beginner to advanced, in-person vsonline, short course vs long, centralized vs physical, andragogy vspedagogy, project-based vs structured, individual vs group learning,'just in time' vs background training, and basic to clinicallyfocused. Different types of materials, teaching approaches, andassessment approaches are appropriate for each of these.

Training opportunities are increasingly oversubscribed. Both surveysand anecdotes suggest that the perceived need for training inbiomedical data science is great. For example, the AustralianBioinformatics Network survey on bioinformatics needs concluded thataccess to training is by far the most dominant concern for biologists;more here. The summer workshop on sequence analysis run by Dr. Brownroutinely has 5-10x more applicants (~200) than can be accommodated(25). Software Carpentry and Data Carpentry workshops with biologyfocuses typically fill up within two days of their announcement andalways have a waiting list. And online courses on data analysis andstatistics typically have 10s of thousands of participants.

Common concerns of reproducibility, efficiency, automation, andstatistical correctness underpin every data science domain. At themost domain-specific level, data science training inevitably mustfocus on specific data types, data analysis problems, and dataanalysis software. However, as trainees grow in expertise, the sameconcerns consistently emerge no matter the domain: how do we make thisanalysis reproducible? How can we most efficiently use availablecompute resources? How do we run the current analysis on a new dataset? How do we assess significance and correctness of our results?This convergence suggests a role for inter-domain coordination oftraining, especially because some biomedical domains (such asbioinformatics) have explored this area in more depth than others

There is a great need for more instructors versed in both advancedbiomedical data science and evidence-based educational practice. Thegap in biomedical data science is nowhere more evident than in thelack of instructors capable of teaching data science to biomedicalresearchers! A consistent theme from universities is that facultyworking in this area are overwhelmed by existing teaching and researchopportunities, which limits the available instructor pool. A majorlimiting factor in offering more biology-focused Software Carpentryworkshops has been a lack of instructors, although this is somewhatameliorated by the use of graduate students and postdocs.

A more diverse trainee and instructor pool is needed. By definition,underrepresented minorities are underrepresented in faculty lines, butin biomedical data science, this further intersects with thesignificant underrepresentation of women and minorities inquantitative and computational disciplines. However, we are still atan early stage in biomedical data science where thisunderrepresentation could be addressed by targeted initiatives.

There is a strong need (and attendant opportunity) for centralizedcoordination in biomedical data science training. Cataloging oftraining opportunities and materials could increase the efficiency andreuse of existing training, as well as identify where materials andtraining are lacking. Instructor training can increase the availablepool of trained educators as well as provide opportunities forunderrepresented populations to get involved in training. Andcoordination across domains on more advanced training topics couldbroaden the scope of these advanced materials.

"



    9           4