The long term cost of inferior database quality

This month’s editorial is written by Antony J. Williams and Sean Ekins. “If I have seen further than others, it is by standing upon the shoulders of giants.”Isaac Newton

Without question science has progressed by building on the theories, past experiments and data of others. Historically, scientists have guarded their data carefully if not hoarded it. Others have had to look on in envy as scientists were enabled to generate data that was inaccessible to most. Times have changed and citizen scientists are now ‘asked’to review data in a crowdsourced approach to data review and validation (e.g. Galaxy Zoo [], ChemSpider [1]) and public engagement in data modeling [e.g. foldIt (]. In parallel, data sets are now made available to the community for download and integration to in-house systems, and the possibility exists where these data could be used by someone sitting in their basement to identify a new drug. Nowadays, immense quantities of scientific information are contained in the thousands of databases that exist on computers worldwide. Progress can however be inhibited by errors in these databases, and it has long been suggested as having downstream effects when the data is reused. There are few publications in this domain but we have collected some that should be of general interest to drug discovery scientists.

In the 1990s it was indicated that errors in genotyping data could impact high resolution genetic maps [2]. Some bioinformatics databases have been described that were designed to perform data curation and error identification [3] although it is unclear how widely these have been embraced. The impact of correctness of molecule structure on computational models has been discussed to a limited extent [4]. Oprea and colleagues have shown how errors in molecule structures published in scientific journals can propagate in the literature [5] and then into databases like SciFinder ( and the Merck Index [6]. Even manually curated databases, such as the MDL drug data report [MDDR (] has been proposed to have errors [7]. It has been suggested that automatic classification of molecules based on SMILES strings might be useful for error detection and aiding biochemical pathway database construction [8]. Even new molecule databases appear to have not learnt from some of these earlier studies. For example, an NIH database was recently reported to have errors in 5% to over 10% of the molecules [9]. A recent review of data governance in predictive toxicology analyzing several public databases mentioned a lack of systematic and standard measures of data quality but did not address error rates or address molecule structure quality [10].

A multicentre analysis of a common sample by mass spectrometry–based proteomics identified generic problems in databases as the major hurdle for characterizing proteins in the test sample correctly. Primarily search engines could not distinguish among different identifiers and many of these algorithms calculated molecular weight incorrectly [11]. Methods have been developed for labeling error detection to improve analysis of microarray data and discover the correct relationship between diseases and genes [12]. A simple rule-based method for validating ontology-based instance data can detect curation errors during development of biological pathways [13]. Many proteins have completely wrong function assignments in databases, with one database having between 2.1 and 13.6% of annotated Pfam hits unjustified [14]. While most scientists think a ligand-protein X-ray structure is the last word others have highlighted how these can also have errors with far reaching consequences for drug design [15].
We think the above examples are just scratching the surface. Detection of errors in databases relevant to drug discovery is rarely encouraged and documented. When scientists raise alerts to quality issues, from our own experience [9], there is a reluctance for the database “owner” to engage and solve the problem, even when sincere efforts are made to collaborate on improved data quality. This is disconcerting as it is evident to us that at least a part of the long term-health of science might rest in ensuring the quality of databases, especially when a significant part of the community implicitly grants trust to the databases, without reason and certainly without validation. Trusting the quality of databases should not suffice, objective data driven review and qualification should be undertaken and reported. This work is presently underway for a small collection of the world’s top selling drugs in some of the most widely used molecule databases (
The impact on pharmaceutical drug discovery costs of errors in databases has not, to our knowledge, been calculated. Perhaps by improving the quality of database content we can avoid research dead ends, and this might be separate to other important issues, such as target validation reproducibility [16]. This is a challenge for everyone from the bench scientist, to the publisher to the research funding organizations. A solution might be better collaboration by all involved, increased alertness of database users to errors, and increased responsiveness of database creators to correct errors or retract databases until problems are solved.
New online databases continue to be released, yet these too will continue to distribute data of unknown quality across the internet. This disturbing and continuing trend needs to be prevented. Funders issuing grants that will produce a database for the community should identify objective approaches by which the database hosts should validate their data before further contributing to these issues. The change in condition of quality of data available online must begin with recognition of the issue and an agreed upon course of action for the database providers to collaborate on resolving the data quality issues. We will describe elsewhere how to address some of these challenges but it is important the drug discovery community realizes that there are issues with databases across the continuum from molecules to proteins and everything in between.
5              Oprea, T. et al. (2002) On the propagation of errors in the QSAR literature. In Euro QSAR 2002


Antony J. Williams graduated with a Ph.D. in chemistry as an NMR spectroscopist. Dr Williams is currently VP, Strategic development for ChemSpider at the Royal Society of Chemistry. Dr Williams has written chapters for many books and authored/coauthored or >120 peer reviewed papers and book chapters on NMR, predictive ADME methods, internet-based tools, crowdsourcing and database curation. He is an active blogger and participant in the internet chemistry network.
Sean Ekins graduated from the University of Aberdeen; receiving his M.Sc., Ph.D. and D.Sc. He is Principal Consultant for Collaborations in Chemistry and Collaborations Director at Collaborative Drug Discovery Inc. He has written over 170 papers and book chapters on topics including drug metabolism, drug-drug interaction screening, computational ADME/Tox, collaborative computational technologies and neglected disease research. He has edited or coedited 4 books.

Share this article

More services


This article is featured in:
The View From Here


Comment on this article

You must be registered and logged in to leave a comment about this article.