Center for Library and Informatics (CLI)
Permanent URI for this community
The Center for Library and Informatics (CLI) addresses the challenges of organizing and preserving data, promoting data-driven discovery, and developing innovative visualization and analysis tools. These challenges demand a collaborative response that brings together the expertise of computer scientists, librarians and life scientists. The mission of the Center is to provide sustainable, flexible, innovative research and services based on collections of scientific information.
The CLI participates in national and international communities of scientific data providers and managers building our expertise in Biodiversity Informatics, and History and Philosophy of Science. The Center includes several cutting-edge research projects, educational courses, and services, including the MBLWHOI Library, the NSF Data Conservancy project, the Encyclopedia of Life, the annual MBL/National Library of Medicine Biomedical Informatics course, and the History and Philosophy of Science Program.
The CLI was disbanded in January 2013
Browse
Browsing Center for Library and Informatics (CLI) by Issue Date
Results Per Page
Sort Options
-
ArticleTaxonomic informatics tools for the electronic nomenclator zoologicus(Marine Biological Laboratory, 2006-02) Remsen, David P. ; Norton, Cathy N. ; Patterson, David J.Given the current trends, it seems inevitable that all biological documents will eventually exist in a digital format and be distributed across the internet. New network services and tools need to be developed to increase retrieval rates for documents and to refine data recovery. Biological data have traditionally been well managed using taxonomic principles. As part of a larger initiative to build an array of names-based network services that emulate taxonomic principles for managing biological information, we undertook the digitization of a major taxonomic reference text, Nomenclator Zoologicus. The process involved replicating the text to a high level of fidelity, parsing the content for inclusion within a database, developing tools to enable expert input into the product, and integrating the metadata and factual content within taxonomic network services. The result is a high-quality and freely available web application (http://uio.mbl.edu/NomenclatorZoologicus/) capable of being exploited in an array of biological informatics services.
-
ArticleTaxonomic indexing—extending the role of taxonomy(Taylor & Francis, 2006-06) Patterson, David J. ; Remsen, David P. ; Marino, William A. ; Norton, Cathy N.Taxonomic indexing refers to a new array of taxonomically intelligent network services that use nomenclatural principles and elements of expert taxonomic knowledge to manage information about organisms. Taxonomic indexing was introduced to help manage the increasing amounts of digital information about biology. It has been designed to form a near basal layer in a layered cyberinfrastructure that deals with biological information. Taxonomic Indexing accommodates the special problems of using names of organisms to index biological material. It links alternative names for the same entity (reconciliation), and distinguishes between uses of the same name for different entities (disambiguation), and names are placed within an indefinite number of hierarchical schemes. In order to access all information on all organisms, Taxonomic indexing must be able to call on a registry of all names in all forms for all organisms. NameBank has been developed to meet that need. Taxonomic indexing is an area of informatics that overlaps with taxonomy, is dependent on the expert input of taxonomists, and reveals the relevance of the discipline to a wide audience.
-
PreprintGrand challenges in biodiversity informatics( 2007-01) Sarkar, Indra NeilThe exponentially growing array of biological data has necessitated the development of a new information management domain, biodiversity informatics. It is one of the newest members of the ‘informatics’ sub-disciplines, which all generally focus on the management of information through the application of advanced technologies. Like other informatics sub-disciplines, biodiversity informatics depends on fundamental computer science and information science principles to facilitate the management of heterogeneous data. Biodiversity informatics distinguishes itself as being the most focused on biological knowledge dating back to the earliest dates of recorded history – while most biological or biomedical informatics studies focus on organizing and studying information spanning less than 100 years, the scope of biodiversity informatics spans the age of the Earth. Biodiversity informatics is also concerned with the widest range of disparate data types – including climatology, epidemiology, geography, and taxonomy. To this end, many informatics principles can readily be incorporated into biodiversity informatics; however, there are equally as many challenges that will require creative solutions. Here, several such challenges are presented in an effort to lay a framework for the types of issues that will define the future of biodiversity informatics and, in turn, the future of biology and biomedicine.
-
ArticleuBioRSS : tracking taxonomic literature using RSS(Oxford University Press, 2007-03-28) Leary, Patrick R. ; Remsen, David P. ; Norton, Cathy N. ; Patterson, David J. ; Sarkar, Indra NeilWeb content syndication through standard formats such as RSS and ATOM has become an increasingly popular mechanism for publishers, news sources, and blogs to disseminate regularly updated content. These standardized syndication formats deliver content directly to the subscriber, allowing them to locally aggregate content from a variety of sources instead of having to find the information on multiple websites. The uBioRSS application is a "taxonomically intelligent" service customized for the biological sciences. It aggregates syndicated content from academic publishers and science news feeds, then uses a taxonomic name entity recognition algorithm to identify and index taxonomic names within those data streams. The resulting name index is cross-referenced to current global taxonomic datasets to provide context for browsing the publications by taxonomic group. This process, called taxonomic indexing, draws upon services developed specifically for biological sciences, collectively referred to as "taxonomic intelligence." Such value-added enhancements can provide biologists with accelerated and improved access to current biological content.
-
PreprintBiodiversity informatics : organizing and linking information across the spectrum of life( 2007-04-08) Sarkar, Indra NeilBiological knowledge can be inferred from three major levels of information: molecules, organisms, and ecologies. Bioinformatics is an established field that has made significant advances in the development of systems and techniques to organize contemporary molecular data; biodiversity informatics is an emerging discipline that strives to develop methods to organize knowledge at the organismal level extending back to the earliest dates of recorded natural history. Furthermore, while bioinformatics studies generally focus on detailed examinations of key “model” organisms, biodiversity informatics aims to develop over-arching hypotheses that span the entire tree of life. Biodiversity informatics is presented here as a discipline that unifies biological information from a range of contemporary and historical sources across the spectrum of life using organisms as the linking thread. The present review primarily focuses on the use of organism names as a universal meta-data element to link and integrate biodiversity data across a range of data sources.
-
ArticleCharacter-based DNA barcoding allows discrimination of genera, species and populations in Odonata(Royal Society, 2007-11-08) Rach, J. ; DeSalle, Rob ; Sarkar, Indra Neil ; Schierwater, B. ; Hadrys, H.DNA barcoding has become a promising means for identifying organisms of all life stages. Currently, phenetic approaches and tree-building methods have been used to define species boundaries and discover 'cryptic species'. However, a universal threshold of genetic distance values to distinguish taxonomic groups cannot be determined. As an alternative, DNA barcoding approaches can be 'character based', whereby species are identified through the presence or absence of discrete nucleotide substitutions (character states) within a DNA sequence. We demonstrate the potential of character-based DNA barcodes by analysing 833 odonate specimens from 103 localities belonging to 64 species. A total of 54 species and 22 genera could be discriminated reliably through unique combinations of character states within only one mitochondrial gene region (NADH dehydrogenase 1). Character-based DNA barcodes were further successfully established at a population level discriminating seven population-specific entities out of a total of 19 populations belonging to three species. Thus, for the first time, DNA barcodes have been found to identify entities below the species level that may constitute separate conservation units or even species units. Our findings suggest that character-based DNA barcoding can be a rapid and reliable means for (i) the assignment of unknown specimens to a taxonomic group, (ii) the exploration of diagnosability of conservation units, and (iii) complementing taxonomic identification systems.
-
ArticleAutomated simultaneous analysis phylogenetics (ASAP) : an enabling tool for phlyogenomics(BioMed Central, 2008-02-19) Sarkar, Indra Neil ; Egan, Mary G. ; Coruzzi, Gloria ; Lee, Ernest K. ; DeSalle, RobThe availability of sequences from whole genomes to reconstruct the tree of life has the potential to enable the development of phylogenomic hypotheses in ways that have not been before possible. A significant bottleneck in the analysis of genomic-scale views of the tree of life is the time required for manual curation of genomic data into multi-gene phylogenetic matrices. To keep pace with the exponentially growing volume of molecular data in the genomic era, we have developed an automated technique, ASAP (Automated Simultaneous Analysis Phylogenetics), to assemble these multigene/multi species matrices and to evaluate the significance of individual genes within the context of a given phylogenetic hypothesis. Applications of ASAP may enable scientists to re-evaluate species relationships and to develop new phylogenomic hypotheses based on genome-scale data.
-
PreprintCAOS software for use in character-based DNA barcoding( 2008-04) Sarkar, Indra Neil ; Planet, Paul J. ; DeSalle, RobThe success of character based DNA barcoding depends on the efficient identification of diagnostic character states from molecular sequences that have been organized hierarchically (e.g., according to phylogenetic methods). Similarly, the reliability of these identified diagnostic character states must be assessed according to their ability to diagnose new sequences. Here, a set of software tools is presented that implement the previously described Characteristic Attribute Organization System for both diagnostic identification and diagnostic-based classification. The software is publicly available from http://sarkarlab.mbl.edu/CAOS.
-
ArticleExploring historical trends using taxonomic name metadata(BioMed Central, 2008-05-13) Sarkar, Indra Neil ; Schenk, Ryan ; Norton, Cathy N.Authority and year information have been attached to taxonomic names since Linnaean times. The systematic structure of taxonomic nomenclature facilitates the ability to develop tools that can be used to explore historical trends that may be associated with taxonomy. From the over 10.7 million taxonomic names that are part of the uBio system, approximately 3 million names were identified to have taxonomic authority information from the years 1750 to 2004. A pipe-delimited file was then generated, organized according to a Linnaean hierarchy and by years from 1750 to 2004, and imported into an Excel workbook. A series of macros were developed to create an Excel-based tool and a complementary Web site to explore the taxonomic data. A cursory and speculative analysis of the data reveals observable trends that may be attributable to significant events that are of both taxonomic (e.g., publishing of key monographs) and societal importance (e.g., world wars). The findings also help quantify the number of taxonomic descriptions that may be made available through digitization initiatives. Temporal organization of taxonomic data can be used to identify interesting biological epochs relative to historically significant events and ongoing efforts. We have developed an Excel workbook and complementary Web site that enables one to explore taxonomic trends for Linnaean taxonomic groupings, from Kingdoms to Families.
-
ArticleThe Encyclopedia of Life, Biodiversity Heritage Library, biodiversity informatics and beyond Web 2.0(Great Cities Initiative of the University of Illinois at Chicago Library, 2008-08-04) Norton, Cathy N.E.O. Wilson, the noted entomologist at Harvard, "wished" for an authoritative encyclopedia of life that would be freely available on the worldwide web for the entire world. On 9 May 2007, The Encyclopedia of Life (EOL) was launched as a multi-institutional initiative whose mission is to create 1.8 million Web sites detailing all the known attributes, history, and behavior, about every known and described species and portraying that information through video, audio, and literature, via the Internet. A major contributor to the Encyclopedia is the Biodiversity Heritage Library that is currently scanning all the core biodiversity literature.
-
PreprintIntra- and interspecies differences in growth and toxicity of Pseudo-nitzschia while using different nitrogen sources( 2009-01) Thessen, Anne E. ; Bowers, Holly A. ; Stoecker, Diane K.Clonal cultures of plankton are widely used in laboratory experiments and have contributed greatly to knowledge of microbial systems. However, many physiological characteristics vary drastically between strains of the same species, calling into question our ability to make ecologically relevant inferences about populations based on studying one or a few strains. This study included nineteen non-axenic strains of three species of the diatom Pseudo-nitzschia isolated primarily from the mid-Atlantic coastal region of the United States. Toxin (domoic acid) production and growth rates were measured in cultures using different nitrogen sources (NH4+, NO3- and urea) and growth irradiances. The strains exhibited broad differences in growth rate and toxin content even between strains isolated from the same water sample. The influence of bacteria on toxin production was not investigated. Both P. multiseries clones produced toxin, yet preferentially used different nitrogen sources. Only two out of nine P. calliantha and two out of five P. fraudulenta isolates were toxic and domoic acid content varied by orders of magnitude. All three species had variable intraspecies growth rates on each nitrogen source, but P. fraudulenta strains had the broadest range. Light-limited growth rate and maximum growth rate in P. fraudulenta and P. multiseries varied with species. These findings show the importance of defining intra- and interspecies variability in ecophysiology and toxicity. Ecologically relevant functional diversity in the form of ecotypes or cryptic species appears to be present in the genus Pseudo-nitzschia.
-
ArticleGenBank and PubMed : how connected are they?(BioMed Central, 2009-06-09) Miller, Holly ; Norton, Cathy N. ; Sarkar, Indra NeilGenBank(R) is a public repository of all publicly available molecular sequence data from a range of sources. In addition to relevant metadata (e.g., sequence description, source organism and taxonomy), publication information is recorded in the GenBank data file. The identification of literature associated with a given molecular sequence may be an essential first step in developing research hypotheses. Although many of the publications associated with GenBank records may not be linked into or part of complementary literature databases (e.g., PubMed), GenBank records associated with literature indexed in Medline are identifiable as they contain PubMed identifiers (PMIDs). Here we show that an analysis of 87,116,501 GenBank sequence files reveals that 42% are associated with a publication or patent. Of these, 71% are associated with PMIDs, and can therefore be linked to a citation record in the PubMed database. The remaining (29%) of publication-associated GenBank entries either do not have PMIDs or cite a publication that is not currently indexed by PubMed. We also identify the journal titles that are linked through citations in the GenBank files to the largest number of sequences. Our analysis suggests that GenBank contains molecular sequences from a range of disciplines beyond biomedicine, the initial scope of PubMed. The findings thus suggest opportunities to develop mechanisms for integrating biological knowledge beyond the biomedical field.
-
ArticleLigerCat : using “MeSH clouds” from journal, article, or gene citations to facilitate the identification of relevant biomedical literature(American Medical Informatics Association, 2009-11-14) Sarkar, Indra Neil ; Schenk, Ryan ; Miller, Holly ; Norton, Cathy N.The identification of relevant literature from within large collections is often a challenging endeavor. In the context of indexed resources, such as MEDLINE, it has been shown that keywords from a controlled vocabulary (e.g., MeSH) can be used in combination to retrieve relevant search results. One effective strategy for identifying potential search terms is to examine a collection of documents for frequently occurring terms. In this way, “Tag clouds” are a popular mechanism for ascertaining terms associated with a collection of documents. Here, we present the Literature and Genomic Electronic Resource Catalogue (LigerCat) system for exploring biomedical literature through the selection of terms within a “MeSH cloud” that is generated based on an initial query using journal, article, or gene data. The resultant interface is encapsulated within a Web interface: http://ligercat.ubio.org. The system is also available for installation under an MIT license.
-
PreprintIdentity of epibiotic bacteria on symbiontid euglenozoans in O2-depleted marine sediments : evidence for symbiont and host co-evolution( 2010-06) Edgcomb, Virginia P. ; Breglia, S. A. ; Yubuki, Naoji ; Beaudoin, David J. ; Patterson, David J. ; Leander, Brian S. ; Bernhard, Joan M.A distinct subgroup of euglenozoans, referred to as the “Symbiontida,” has been described from oxygen-depleted and sulfidic marine environments. By definition, all members of this group carry epibionts that are intimately associated with underlying mitochondrion-derived organelles beneath the surface of the hosts. We have used molecular phylogenetic and ultrastructural evidence to identify the rod-shaped epibionts of two members of this group, Calkinsia aureus and Bihospites bacati, hand-picked from sediments from two separate oxygen-depleted, sulfidic environments. We identify their epibionts as closely related sulfur or sulfide oxidizing members of the Epsilon proteobacteria. The Epsilon proteobacteria generally play a significant role in deep-sea habitats as primary colonizers, primary producers, and/or in symbiotic associations. The epibionts likely fulfill a role in detoxifying the immediate surrounding environment for these two different hosts. The nearly identical rod-shaped epibionts on these two symbiontid hosts provides evidence for a co-evolutionary history between these two sets of partners. This hypothesis is supported by congruent tree topologies inferred from 18S and 16S rDNA from the hosts and bacterial epibionts, respectively. The eukaryotic hosts likely serve as a motile substrate that delivers the epibionts to the ideal locations with respect to the oxic/anoxic interface whereby their growth rates can be maximized, perhaps also allowing the host to cultivate a food source. Because symbiontid isolates and additional SSU rDNA gene sequences from this clade have now been recovered from many locations worldwide, the Symbiontida are likely more widespread and diverse than presently known.
-
PreprintBroadly sampled multigene analyses yield a well-resolved eukaryotic tree of life( 2010-06-01) Parfrey, Laura Wegener ; Grant, Jessica ; Tekle, Yonas I. ; Lasek-Nesselquist, Erica ; Morrison, Hilary G. ; Sogin, Mitchell L. ; Patterson, David J. ; Katz, Laura A.An accurate reconstruction of the eukaryotic tree of life is essential to identify the innovations underlying the diversity of microbial and macroscopic (e.g. plants and animals) eukaryotes. Previous work has divided eukaryotic diversity into a small number of high-level ‘supergroups’, many of which receive strong support in phylogenomic analyses. However, the abundance of data in phylogenomic analyses can lead to highly supported but incorrect relationships due to systematic phylogenetic error. Further, the paucity of major eukaryotic lineages (19 or fewer) included in these genomic studies may exaggerate systematic error and reduces power to evaluate hypotheses. Here, we use a taxon-rich strategy to assess eukaryotic relationships. We show that analyses emphasizing broad taxonomic sampling (up to 451 taxa representing 72 major lineages) combined with a moderate number of genes yield a well-resolved eukaryotic tree of life. The consistency across analyses with varying numbers of taxa (88-451) and levels of missing data (17-69%) supports the accuracy of the resulting topologies. The resulting stable topology emerges without the removal of rapidly evolving genes or taxa, a practice common to phylogenomic analyses. Several major groups are stable and strongly supported in these analyses (e.g. SAR, Rhizaria, Excavata), while the proposed supergroup ‘Chromalveolata’ is rejected. Further, extensive instability among photosynthetic lineages suggests the presence of systematic biases including endosymbiotic gene transfer from symbiont (nucleus or plastid) to host. Our analyses demonstrate that stable topologies of ancient evolutionary relationships can be achieved with broad taxonomic sampling and a moderate number of genes. Finally, taxonrich analyses such as presented here provide a method for testing the accuracy of relationships that receive high bootstrap support in phylogenomic analyses and enable placement of the multitude of lineages that lack genome scale data.
-
PreprintA model for Bioinformatics training : the Marine Biological Laboratory( 2010-08-04) Yamashita, Grant ; Miller, Holly ; Goddard, Anthony ; Norton, Cathy N.Many areas of science such as biology, medicine, and oceanography are becoming increasingly data-rich and most programs that train scientists do not address informatics techniques or technologies that are necessary for managing and analyzing large amounts of data. Educational resources for scientists in informatics are scarce, yet scientists need the skills and knowledge to work with informaticians and manage graduate students and post-docs in informatics projects. The Marine Biological Laboratory houses a world-renowned library and is involved in a number of informatics projects in the sciences. The MBL has been home to the National Library of Medicine's BioMedical Informatics Course for nearly two decades and is committed to educating scientists and other scholars in informatics. In an innovative, immersive learning experience, Grant Yamashita, a biologist and post-doc at Arizona State University, visited the Science Informatics Group at MBL to learn first hand how informatics is done and how informatics teams work. Hands-on work with developers, systems administrators, librarians, and other scientists provided an invaluable education in informatics and is a model for future science informatics training.
-
PreprintNames are key to the big new biology( 2010-09-20) Patterson, David J. ; Cooper, J. ; Kirk, Paul M. ; Pyle, R. L. ; Remsen, David P.Those who seek answers to big, broad questions about biology, especially questions emphasizing the organism (taxonomy, evolution, ecology), will soon benefit from an emerging names-based infrastructure. It will draw on the almost universal association of organism names with biological information to index and interconnect information distributed across the Internet. The result will be a virtual data commons, expanding as further data are shared, allowing biology to become more of a “big science”. Informatics devices will exploit this ‘big new biology’, revitalizing comparative biology with a broad perspective to reveal previously inaccessible trends and discontinuities, so helping us to reveal unfamiliar biological truths. Here, we review the first components of this freely available, participatory, and semantic Global Names Architecture.
-
PresentationSCOR/IODE/MBLWHOI Library collaboration on data publication [poster] ( 2011-05-25) Raymond, Lisa ; Pikula, Linda ; Lowry, Roy ; Urban, Edward ; Moncoiffe, Gwenaelle ; Pissierssens, Peter ; Norton, Cathy N.This poster describes the development of international standards to publish oceanographic datasets. Research areas include the assignment of persistent identifiers, tracking provenance, linking datasets to publications, attributing credit to data providers, and best practices for the physical composition and semantic description of the content.
-
ArticlePseudo-nitzschia physiological ecology, phylogeny, toxicity, monitoring and impacts on ecosystem health(Elsevier B.V., 2011-11-03) Trainer, Vera L. ; Bates, Stephen S. ; Lundholm, Nina ; Thessen, Anne E. ; Cochlan, William P. ; Adams, Nicolaus G. ; Trick, Charles G.Over the last decade, our understanding of the environmental controls on Pseudo-nitzschia blooms and domoic acid (DA) production has matured. Pseudo-nitzschia have been found along most of the world's coastlines, while the impacts of its toxin, DA, are most persistent and detrimental in upwelling systems. However, Pseudo-nitzschia and DA have recently been detected in the open ocean's high-nitrate, low-chlorophyll regions, in addition to fjords, gulfs and bays, showing their presence in diverse environments. The toxin has been measured in zooplankton, shellfish, crustaceans, echinoderms, worms, marine mammals and birds, as well as in sediments, demonstrating its stable transfer through the marine food web and abiotically to the benthos. The linkage of DA production to nitrogenous nutrient physiology, trace metal acquisition, and even salinity, suggests that the control of toxin production is complex and likely influenced by a suite of environmental factors that may be unique to a particular region. Advances in our knowledge of Pseudo-nitzschia sexual reproduction, also in field populations, illustrate its importance in bloom dynamics and toxicity. The combination of careful taxonomy and powerful new molecular methods now allow for the complete characterization of Pseudo-nitzschia populations and how they respond to environmental changes. Here we summarize research that represents our increased knowledge over the last decade of Pseudo-nitzschia and its production of DA, including changes in worldwide range, phylogeny, physiology, ecology, monitoring and public health impacts.
-
ArticleData issues in the life sciences(Pensoft Publishers, 2011-11-28) Thessen, Anne E. ; Patterson, David J.We review technical and sociological issues facing the Life Sciences as they transform into more data-centric disciplines - the “Big New Biology”. Three major challenges are: 1) lack of comprehensive standards; 2) lack of incentives for individual scientists to share data; 3) lack of appropriate infrastructure and support. Technological advances with standards, bandwidth, distributed computing, exemplar successes, and a strong presence in the emerging world of Linked Open Data are sufficient to conclude that technical issues will be overcome in the foreseeable future. While motivated to have a shared open infrastructure and data pool, and pressured by funding agencies in move in this direction, the sociological issues determine progress. Major sociological issues include our lack of understanding of the heterogeneous data cultures within Life Sciences, and the impediments to progress include a lack of incentives to build appropriate infrastructures into projects and institutions or to encourage scientists to make data openly available.