Center for Library and Informatics (CLI)
Permanent URI for this community
The Center for Library and Informatics (CLI) addresses the challenges of organizing and preserving data, promoting data-driven discovery, and developing innovative visualization and analysis tools. These challenges demand a collaborative response that brings together the expertise of computer scientists, librarians and life scientists. The mission of the Center is to provide sustainable, flexible, innovative research and services based on collections of scientific information.
The CLI participates in national and international communities of scientific data providers and managers building our expertise in Biodiversity Informatics, and History and Philosophy of Science. The Center includes several cutting-edge research projects, educational courses, and services, including the MBLWHOI Library, the NSF Data Conservancy project, the Encyclopedia of Life, the annual MBL/National Library of Medicine Biomedical Informatics course, and the History and Philosophy of Science Program.
The CLI was disbanded in January 2013
Browse
Browsing Center for Library and Informatics (CLI) by Title
Results Per Page
Sort Options
-
ArticleApplications of natural language processing in biodiversity science(Hindawi Publishing, 2012) Thessen, Anne E. ; Cui, Hong ; Mozzherin, DmitryCenturies of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science. A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science.
-
ArticleAutomated simultaneous analysis phylogenetics (ASAP) : an enabling tool for phlyogenomics(BioMed Central, 2008-02-19) Sarkar, Indra Neil ; Egan, Mary G. ; Coruzzi, Gloria ; Lee, Ernest K. ; DeSalle, RobThe availability of sequences from whole genomes to reconstruct the tree of life has the potential to enable the development of phylogenomic hypotheses in ways that have not been before possible. A significant bottleneck in the analysis of genomic-scale views of the tree of life is the time required for manual curation of genomic data into multi-gene phylogenetic matrices. To keep pace with the exponentially growing volume of molecular data in the genomic era, we have developed an automated technique, ASAP (Automated Simultaneous Analysis Phylogenetics), to assemble these multigene/multi species matrices and to evaluate the significance of individual genes within the context of a given phylogenetic hypothesis. Applications of ASAP may enable scientists to re-evaluate species relationships and to develop new phylogenomic hypotheses based on genome-scale data.
-
PreprintBiodiversity informatics : organizing and linking information across the spectrum of life( 2007-04-08) Sarkar, Indra NeilBiological knowledge can be inferred from three major levels of information: molecules, organisms, and ecologies. Bioinformatics is an established field that has made significant advances in the development of systems and techniques to organize contemporary molecular data; biodiversity informatics is an emerging discipline that strives to develop methods to organize knowledge at the organismal level extending back to the earliest dates of recorded natural history. Furthermore, while bioinformatics studies generally focus on detailed examinations of key “model” organisms, biodiversity informatics aims to develop over-arching hypotheses that span the entire tree of life. Biodiversity informatics is presented here as a discipline that unifies biological information from a range of contemporary and historical sources across the spectrum of life using organisms as the linking thread. The present review primarily focuses on the use of organism names as a universal meta-data element to link and integrate biodiversity data across a range of data sources.
-
ArticleBiological nomenclature terms for facilitating communication in the naming of organisms(Pensoft, 2012-05-08) David, John ; Garrity, George M. ; Greuter, Werner ; Hawksworth, David L. ; Jahn, Regine ; Kirk, Paul M. ; McNeill, John ; Michel, Ellinor ; Knapp, Sandra ; Patterson, David J. ; Tindall, Brian J. ; Todd, Jonathan A. ; Tol, Jan van ; Turland, Nicholas J.A set of terms recommended for use in facilitating communication in biological nomenclature is presented as a table showing broadly equivalent terms used in the traditional Codes of nomenclature. These terms are intended to help those engaged in naming across organism groups, and are the result of the work of the International Committee on Bionomenclature, whose aim is to promote harmonisation and communication amongst those naming life on Earth.
-
PreprintBroadly sampled multigene analyses yield a well-resolved eukaryotic tree of life( 2010-06-01) Parfrey, Laura Wegener ; Grant, Jessica ; Tekle, Yonas I. ; Lasek-Nesselquist, Erica ; Morrison, Hilary G. ; Sogin, Mitchell L. ; Patterson, David J. ; Katz, Laura A.An accurate reconstruction of the eukaryotic tree of life is essential to identify the innovations underlying the diversity of microbial and macroscopic (e.g. plants and animals) eukaryotes. Previous work has divided eukaryotic diversity into a small number of high-level ‘supergroups’, many of which receive strong support in phylogenomic analyses. However, the abundance of data in phylogenomic analyses can lead to highly supported but incorrect relationships due to systematic phylogenetic error. Further, the paucity of major eukaryotic lineages (19 or fewer) included in these genomic studies may exaggerate systematic error and reduces power to evaluate hypotheses. Here, we use a taxon-rich strategy to assess eukaryotic relationships. We show that analyses emphasizing broad taxonomic sampling (up to 451 taxa representing 72 major lineages) combined with a moderate number of genes yield a well-resolved eukaryotic tree of life. The consistency across analyses with varying numbers of taxa (88-451) and levels of missing data (17-69%) supports the accuracy of the resulting topologies. The resulting stable topology emerges without the removal of rapidly evolving genes or taxa, a practice common to phylogenomic analyses. Several major groups are stable and strongly supported in these analyses (e.g. SAR, Rhizaria, Excavata), while the proposed supergroup ‘Chromalveolata’ is rejected. Further, extensive instability among photosynthetic lineages suggests the presence of systematic biases including endosymbiotic gene transfer from symbiont (nucleus or plastid) to host. Our analyses demonstrate that stable topologies of ancient evolutionary relationships can be achieved with broad taxonomic sampling and a moderate number of genes. Finally, taxonrich analyses such as presented here provide a method for testing the accuracy of relationships that receive high bootstrap support in phylogenomic analyses and enable placement of the multitude of lineages that lack genome scale data.
-
PresentationBuilding research networks to support campus programs [poster]( 2012-04-04) Furfey, John F. ; Devenish, Ann ; Hurter, Colleen ; Stafford, NancyThis poster focuses on the methods, tools and outcomes involved in creating two targeted research networks to support large, long-running research programs in the Woods Hole scientific community.
-
PreprintCAOS software for use in character-based DNA barcoding( 2008-04) Sarkar, Indra Neil ; Planet, Paul J. ; DeSalle, RobThe success of character based DNA barcoding depends on the efficient identification of diagnostic character states from molecular sequences that have been organized hierarchically (e.g., according to phylogenetic methods). Similarly, the reliability of these identified diagnostic character states must be assessed according to their ability to diagnose new sequences. Here, a set of software tools is presented that implement the previously described Characteristic Attribute Organization System for both diagnostic identification and diagnostic-based classification. The software is publicly available from http://sarkarlab.mbl.edu/CAOS.
-
ArticleCharacter-based DNA barcoding allows discrimination of genera, species and populations in Odonata(Royal Society, 2007-11-08) Rach, J. ; DeSalle, Rob ; Sarkar, Indra Neil ; Schierwater, B. ; Hadrys, H.DNA barcoding has become a promising means for identifying organisms of all life stages. Currently, phenetic approaches and tree-building methods have been used to define species boundaries and discover 'cryptic species'. However, a universal threshold of genetic distance values to distinguish taxonomic groups cannot be determined. As an alternative, DNA barcoding approaches can be 'character based', whereby species are identified through the presence or absence of discrete nucleotide substitutions (character states) within a DNA sequence. We demonstrate the potential of character-based DNA barcodes by analysing 833 odonate specimens from 103 localities belonging to 64 species. A total of 54 species and 22 genera could be discriminated reliably through unique combinations of character states within only one mitochondrial gene region (NADH dehydrogenase 1). Character-based DNA barcodes were further successfully established at a population level discriminating seven population-specific entities out of a total of 19 populations belonging to three species. Thus, for the first time, DNA barcodes have been found to identify entities below the species level that may constitute separate conservation units or even species units. Our findings suggest that character-based DNA barcoding can be a rapid and reliable means for (i) the assignment of unknown specimens to a taxonomic group, (ii) the exploration of diagnosability of conservation units, and (iii) complementing taxonomic identification systems.
-
ArticleData hosting infrastructure for primary biodiversity data(BioMed Central, 2011-12-15) Goddard, Anthony ; Wilson, Nathan ; Cryer, Phil ; Yamashita, GrantToday, an unprecedented volume of primary biodiversity data are being generated worldwide, yet significant amounts of these data have been and will continue to be lost after the conclusion of the projects tasked with collecting them. To get the most value out of these data it is imperative to seek a solution whereby these data are rescued, archived and made available to the biodiversity community. To this end, the biodiversity informatics community requires investment in processes and infrastructure to mitigate data loss and provide solutions for long-term hosting and sharing of biodiversity data. We review the current state of biodiversity data hosting and investigate the technological and sociological barriers to proper data management. We further explore the rescuing and re-hosting of legacy data, the state of existing toolsets and propose a future direction for the development of new discovery tools. We also explore the role of data standards and licensing in the context of data hosting and preservation. We provide five recommendations for the biodiversity community that will foster better data preservation and access: (1) encourage the community's use of data standards, (2) promote the public domain licensing of data, (3) establish a community of those involved in data hosting and archival, (4) establish hosting centers for biodiversity data, and (5) develop tools for data discovery. The community's adoption of standards and development of tools to enable data discovery is essential to sustainable data preservation. Furthermore, the increased adoption of open content licensing, the establishment of data hosting infrastructure and the creation of a data hosting and archiving community are all necessary steps towards the community ensuring that data archival policies become standardized.
-
ArticleData issues in the life sciences(Pensoft Publishers, 2011-11-28) Thessen, Anne E. ; Patterson, David J.We review technical and sociological issues facing the Life Sciences as they transform into more data-centric disciplines - the “Big New Biology”. Three major challenges are: 1) lack of comprehensive standards; 2) lack of incentives for individual scientists to share data; 3) lack of appropriate infrastructure and support. Technological advances with standards, bandwidth, distributed computing, exemplar successes, and a strong presence in the emerging world of Linked Open Data are sufficient to conclude that technical issues will be overcome in the foreseeable future. While motivated to have a shared open infrastructure and data pool, and pressured by funding agencies in move in this direction, the sociological issues determine progress. Major sociological issues include our lack of understanding of the heterogeneous data cultures within Life Sciences, and the impediments to progress include a lack of incentives to build appropriate infrastructures into projects and institutions or to encourage scientists to make data openly available.
-
ArticleThe Encyclopedia of Life, Biodiversity Heritage Library, biodiversity informatics and beyond Web 2.0(Great Cities Initiative of the University of Illinois at Chicago Library, 2008-08-04) Norton, Cathy N.E.O. Wilson, the noted entomologist at Harvard, "wished" for an authoritative encyclopedia of life that would be freely available on the worldwide web for the entire world. On 9 May 2007, The Encyclopedia of Life (EOL) was launched as a multi-institutional initiative whose mission is to create 1.8 million Web sites detailing all the known attributes, history, and behavior, about every known and described species and portraying that information through video, audio, and literature, via the Internet. A major contributor to the Encyclopedia is the Biodiversity Heritage Library that is currently scanning all the core biodiversity literature.
-
Working PaperEnvisioning the future of science libraries at academic research institutions : a discussion( 2012-12-20) Feltes, Carol ; Gibson, Donna S. ; Miller, Holly ; Norton, Cathy N. ; Pollock, LudmilaA group of librarians, other information professionals, scientists and research administrators met to discuss the challenges that research libraries are currently facing. After the meeting a survey was conducted to obtain additional input from the group on several key challenges that arose from the discussions. The purpose of the meeting and survey was threefold: 1. Examine in detail, from a variety of perspectives, how the world of research is changing and the impact these changes have on the direction of research libraries. 2. Create an informed vision of how research libraries can be a vital partner to researchers. 3. Suggest a strategic approach for realizing this vision. The strategic approach presented in this white paper incorporates feedback from various sized research libraries, each with its own mission. The expectation is that individual libraries will use it as a guide in formulating strategies that are appropriate to their research communities, financial circumstances, and organizational reporting structure.
-
ArticleExploring historical trends using taxonomic name metadata(BioMed Central, 2008-05-13) Sarkar, Indra Neil ; Schenk, Ryan ; Norton, Cathy N.Authority and year information have been attached to taxonomic names since Linnaean times. The systematic structure of taxonomic nomenclature facilitates the ability to develop tools that can be used to explore historical trends that may be associated with taxonomy. From the over 10.7 million taxonomic names that are part of the uBio system, approximately 3 million names were identified to have taxonomic authority information from the years 1750 to 2004. A pipe-delimited file was then generated, organized according to a Linnaean hierarchy and by years from 1750 to 2004, and imported into an Excel workbook. A series of macros were developed to create an Excel-based tool and a complementary Web site to explore the taxonomic data. A cursory and speculative analysis of the data reveals observable trends that may be attributable to significant events that are of both taxonomic (e.g., publishing of key monographs) and societal importance (e.g., world wars). The findings also help quantify the number of taxonomic descriptions that may be made available through digitization initiatives. Temporal organization of taxonomic data can be used to identify interesting biological epochs relative to historically significant events and ongoing efforts. We have developed an Excel workbook and complementary Web site that enables one to explore taxonomic trends for Linnaean taxonomic groupings, from Kingdoms to Families.
-
ArticleGenBank and PubMed : how connected are they?(BioMed Central, 2009-06-09) Miller, Holly ; Norton, Cathy N. ; Sarkar, Indra NeilGenBank(R) is a public repository of all publicly available molecular sequence data from a range of sources. In addition to relevant metadata (e.g., sequence description, source organism and taxonomy), publication information is recorded in the GenBank data file. The identification of literature associated with a given molecular sequence may be an essential first step in developing research hypotheses. Although many of the publications associated with GenBank records may not be linked into or part of complementary literature databases (e.g., PubMed), GenBank records associated with literature indexed in Medline are identifiable as they contain PubMed identifiers (PMIDs). Here we show that an analysis of 87,116,501 GenBank sequence files reveals that 42% are associated with a publication or patent. Of these, 71% are associated with PMIDs, and can therefore be linked to a citation record in the PubMed database. The remaining (29%) of publication-associated GenBank entries either do not have PMIDs or cite a publication that is not currently indexed by PubMed. We also identify the journal titles that are linked through citations in the GenBank files to the largest number of sequences. Our analysis suggests that GenBank contains molecular sequences from a range of disciplines beyond biomedicine, the initial scope of PubMed. The findings thus suggest opportunities to develop mechanisms for integrating biological knowledge beyond the biomedical field.
-
PreprintGrand challenges in biodiversity informatics( 2007-01) Sarkar, Indra NeilThe exponentially growing array of biological data has necessitated the development of a new information management domain, biodiversity informatics. It is one of the newest members of the ‘informatics’ sub-disciplines, which all generally focus on the management of information through the application of advanced technologies. Like other informatics sub-disciplines, biodiversity informatics depends on fundamental computer science and information science principles to facilitate the management of heterogeneous data. Biodiversity informatics distinguishes itself as being the most focused on biological knowledge dating back to the earliest dates of recorded history – while most biological or biomedical informatics studies focus on organizing and studying information spanning less than 100 years, the scope of biodiversity informatics spans the age of the Earth. Biodiversity informatics is also concerned with the widest range of disparate data types – including climatology, epidemiology, geography, and taxonomy. To this end, many informatics principles can readily be incorporated into biodiversity informatics; however, there are equally as many challenges that will require creative solutions. Here, several such challenges are presented in an effort to lay a framework for the types of issues that will define the future of biodiversity informatics and, in turn, the future of biology and biomedicine.
-
PreprintIdentity of epibiotic bacteria on symbiontid euglenozoans in O2-depleted marine sediments : evidence for symbiont and host co-evolution( 2010-06) Edgcomb, Virginia P. ; Breglia, S. A. ; Yubuki, Naoji ; Beaudoin, David J. ; Patterson, David J. ; Leander, Brian S. ; Bernhard, Joan M.A distinct subgroup of euglenozoans, referred to as the “Symbiontida,” has been described from oxygen-depleted and sulfidic marine environments. By definition, all members of this group carry epibionts that are intimately associated with underlying mitochondrion-derived organelles beneath the surface of the hosts. We have used molecular phylogenetic and ultrastructural evidence to identify the rod-shaped epibionts of two members of this group, Calkinsia aureus and Bihospites bacati, hand-picked from sediments from two separate oxygen-depleted, sulfidic environments. We identify their epibionts as closely related sulfur or sulfide oxidizing members of the Epsilon proteobacteria. The Epsilon proteobacteria generally play a significant role in deep-sea habitats as primary colonizers, primary producers, and/or in symbiotic associations. The epibionts likely fulfill a role in detoxifying the immediate surrounding environment for these two different hosts. The nearly identical rod-shaped epibionts on these two symbiontid hosts provides evidence for a co-evolutionary history between these two sets of partners. This hypothesis is supported by congruent tree topologies inferred from 18S and 16S rDNA from the hosts and bacterial epibionts, respectively. The eukaryotic hosts likely serve as a motile substrate that delivers the epibionts to the ideal locations with respect to the oxic/anoxic interface whereby their growth rates can be maximized, perhaps also allowing the host to cultivate a food source. Because symbiontid isolates and additional SSU rDNA gene sequences from this clade have now been recovered from many locations worldwide, the Symbiontida are likely more widespread and diverse than presently known.
-
PreprintIntra- and interspecies differences in growth and toxicity of Pseudo-nitzschia while using different nitrogen sources( 2009-01) Thessen, Anne E. ; Bowers, Holly A. ; Stoecker, Diane K.Clonal cultures of plankton are widely used in laboratory experiments and have contributed greatly to knowledge of microbial systems. However, many physiological characteristics vary drastically between strains of the same species, calling into question our ability to make ecologically relevant inferences about populations based on studying one or a few strains. This study included nineteen non-axenic strains of three species of the diatom Pseudo-nitzschia isolated primarily from the mid-Atlantic coastal region of the United States. Toxin (domoic acid) production and growth rates were measured in cultures using different nitrogen sources (NH4+, NO3- and urea) and growth irradiances. The strains exhibited broad differences in growth rate and toxin content even between strains isolated from the same water sample. The influence of bacteria on toxin production was not investigated. Both P. multiseries clones produced toxin, yet preferentially used different nitrogen sources. Only two out of nine P. calliantha and two out of five P. fraudulenta isolates were toxic and domoic acid content varied by orders of magnitude. All three species had variable intraspecies growth rates on each nitrogen source, but P. fraudulenta strains had the broadest range. Light-limited growth rate and maximum growth rate in P. fraudulenta and P. multiseries varied with species. These findings show the importance of defining intra- and interspecies variability in ecophysiology and toxicity. Ecologically relevant functional diversity in the form of ecotypes or cryptic species appears to be present in the genus Pseudo-nitzschia.
-
ArticleLigerCat : using “MeSH clouds” from journal, article, or gene citations to facilitate the identification of relevant biomedical literature(American Medical Informatics Association, 2009-11-14) Sarkar, Indra Neil ; Schenk, Ryan ; Miller, Holly ; Norton, Cathy N.The identification of relevant literature from within large collections is often a challenging endeavor. In the context of indexed resources, such as MEDLINE, it has been shown that keywords from a controlled vocabulary (e.g., MeSH) can be used in combination to retrieve relevant search results. One effective strategy for identifying potential search terms is to examine a collection of documents for frequently occurring terms. In this way, “Tag clouds” are a popular mechanism for ascertaining terms associated with a collection of documents. Here, we present the Literature and Genomic Electronic Resource Catalogue (LigerCat) system for exploring biomedical literature through the selection of terms within a “MeSH cloud” that is generated based on an initial query using journal, article, or gene data. The resultant interface is encapsulated within a Web interface: http://ligercat.ubio.org. The system is also available for installation under an MIT license.
-
ArticleMapping the biosphere : exploring species to understand the origin, organization and sustainability of biodiversity(Taylor & Francis, 2012-03-27) Wheeler, Q. D. ; Knapp, Sandra ; Stevenson, D. W. ; Stevenson, J. ; Blum, Stan D. ; Boom, B.. M. ; Borisy, Gary G. ; Buizer, James L. ; De Carvalho, M. R. ; Cibrian, A. ; Donoghue, M. J. ; Doyle, V. ; Gerson, E. M. ; Graham, C. H. ; Graves, P. ; Graves, Sara J. ; Guralnick, Robert P. ; Hamilton, A. L. ; Hanken, J. ; Law, W. ; Lipscomb, D. L. ; Lovejoy, Thomas E. ; Miller, Holly ; Miller, J. S. ; Naeem, Shahid ; Novacek, M. J. ; Page, L. M. ; Platnick, N. I. ; Porter-Morgan, H. ; Raven, Peter H. ; Solis, M. A. ; Valdecasas, A. G. ; Van Der Leeuw, S. ; Vasco, A. ; Vermeulen, N. ; Vogel, J. ; Walls, R. L. ; Wilson, E. O. ; Woolley, J. B.The time is ripe for a comprehensive mission to explore and document Earth's species. This calls for a campaign to educate and inspire the next generation of professional and citizen species explorers, investments in cyber-infrastructure and collections to meet the unique needs of the producers and consumers of taxonomic information, and the formation and coordination of a multi-institutional, international, transdisciplinary community of researchers, scholars and engineers with the shared objective of creating a comprehensive inventory of species and detailed map of the biosphere. We conclude that an ambitious goal to describe 10 million species in less than 50 years is attainable based on the strength of 250 years of progress, worldwide collections, existing experts, technological innovation and collaborative teamwork. Existing digitization projects are overcoming obstacles of the past, facilitating collaboration and mobilizing literature, data, images and specimens through cyber technologies. Charting the biosphere is enormously complex, yet necessary expertise can be found through partnerships with engineers, information scientists, sociologists, ecologists, climate scientists, conservation biologists, industrial project managers and taxon specialists, from agrostologists to zoophytologists. Benefits to society of the proposed mission would be profound, immediate and enduring, from detection of early responses of flora and fauna to climate change to opening access to evolutionary designs for solutions to countless practical problems. The impacts on the biodiversity, environmental and evolutionary sciences would be transformative, from ecosystem models calibrated in detail to comprehensive understanding of the origin and evolution of life over its 3.8 billion year history. The resultant cyber-enabled taxonomy, or cybertaxonomy, would open access to biodiversity data to developing nations, assure access to reliable data about species, and change how scientists and citizens alike access, use and think about biological diversity information.
-
PreprintA model for Bioinformatics training : the Marine Biological Laboratory( 2010-08-04) Yamashita, Grant ; Miller, Holly ; Goddard, Anthony ; Norton, Cathy N.Many areas of science such as biology, medicine, and oceanography are becoming increasingly data-rich and most programs that train scientists do not address informatics techniques or technologies that are necessary for managing and analyzing large amounts of data. Educational resources for scientists in informatics are scarce, yet scientists need the skills and knowledge to work with informaticians and manage graduate students and post-docs in informatics projects. The Marine Biological Laboratory houses a world-renowned library and is involved in a number of informatics projects in the sciences. The MBL has been home to the National Library of Medicine's BioMedical Informatics Course for nearly two decades and is committed to educating scientists and other scholars in informatics. In an innovative, immersive learning experience, Grant Yamashita, a biologist and post-doc at Arizona State University, visited the Science Informatics Group at MBL to learn first hand how informatics is done and how informatics teams work. Hands-on work with developers, systems administrators, librarians, and other scientists provided an invaluable education in informatics and is a model for future science informatics training.