On 23 Aug a paper by three Australian authors gained much publicity about how Excel interprets data entry and how the users don’t check their data.
Gene name errors are widespread in the scientific literature. By Mark Ziemann, Yotam Eren and Assam El-Osta. Genome Biology 2016 17:177 DOI: 10.1186/s13059-016-1044-7 © The Author(s). 2016. Published: 23 August 2016
They wrote scripts to trawl through published papers on genetics which had data files attached, and checked those files for data errors. They screened 35,175 supplementary Excel files and confirmed gene name errors in 987 supplementary files from 704 published articles in 18 journals. Linear-regression estimates show gene name errors in supplementary files have increased at an annual rate of 15 % over the past five years, outpacing the increase in published papers (3.8 % per year). “In conclusion, we show that inadvertent gene name conversion errors persist in the scientific literature, but these should be easy to avoid if researchers, reviewers, editorial staff and database curators remain vigilant.”
The root problem is that the researchers who uploaded those files NEVER checked them. Once you have seen this once, you’re sensitised to it and can take steps to avoid it in future.
The reason they never check is because Excel is seen as a simple application, not programming or software development or data management. People don’t learn how to use it any more than they would take courses in Word or Powerpoint. So this point-and-click approach gives rise to attitudes like that expressed in a comment to a Slate article that Excel should just take the input and not change it. The problem with running with such naïve unvalidated expectations about any software package is that you then run the risk of publishing career-limiting mistakes.
There’s a skills certification called “Spreadsheet Safe” and one of their points is “validate CSV file imports.”
I contacted the authors and Mark Ziemann kindly sent me three example files. The solution is so easy … when importing that CSV/TSV/TXT file from the data capture instrumentation, use the “Get External Data button”, set the data type for the gene columns and other IDs to text, and that’s it. A few seconds’ work.
Mark Ziemann says “Most, but not all of these data files are imported as text or csv formats from instrumentation such as DNA sequencers, gene microarrays or proteomics screens. Many of these files appear to be heavily modified from their original format and contain colour coding, modified column headers and additional columns. Some files are comparisons of 2 or more datasets or more in the same worksheet. A smaller number of files was simply a filtered list of gene names that could be a group of candidate genes for future analysis. As most error-containing files I screened also had accession numbers or other identifying information, the risks to altering the conclusions of the study by gene name errors are minimal, but embarrassingly common. Statistical analyses of large datasets is done in R, Matlab, Python, etc and its common to save the data in XLS file so that other researchers in the study can open and inspect the data. ”
Here’s how to work it:
First open the file as normal to eyeball it. I picked the CSV file first:
Firstly, D4:E6 look different from the rest of columns D&E. What’s happening there is that the numeric values are outside Excel’s range of approximately 2E-308 to 1E308 so they are stored as text. Let’s take a quick look at the other columns. On the Data ribbon tab, click Filter.
Click the dropdown for “geneid” first:
Well, that shows something is amiss – what are dates doing there? Pick the first one and apply that filter:
So, one gene name has been changed to a date and displayed as Sep-11; that’s 1-Sep-2011. You can check some more, but you should now realise that this column (and maybe others, although in this specific case they all look like they should be numbers) needs to be imported as Text. On the Data tab, Get External Data, click From Text. In fact, if you were opening a .txt or .tsv file, Excel would offer you this Text Import Wizard right away rather than just opening it as it does with .csv files.
Yes, it’s Delimited by commas, so click Next and check the Comma box :
Click Next again:
This is it. See where Excel says it converts date values to dates? Anything that looks like a date value in your locale, like MARCH1 or SEPT11, will become a date. So override that by checking the Text option:
Click Finish and specify where you want it to go:
I usually like to click Properties and uncheck “Save query definition” as otherwise a data link is created in this workbook so that you can refresh it later from the same source file.
After using the Filter button and checking the “geneid” column, that now looks clean. There is still the text data in D4:E6 but Excel can’t handle that as numbers so you’ll have to know what to do with it.
To further check on the data, I would add a sheet to calculate for each column the count of all values, of numbers, of blanks, and the minimum, maximum, average, and sum of the numeric data and the lengths of the text data. That would tell me of other oddities that need investigation such as missing or out-of-range data. A check for duplicates reveals that eight of the gene IDs are duplicated in this sample.
If I had a reference list of gene names that should be in this sample, I could also do a lookup both ways to report gene IDs in the sample not in the reference, and those in the reference not in the sample. In any domain, there are always specific tests that can be performed. In addition, data can be formatted or highlighted in various ways to convey some significance.
What all that means is up to the expert in that domain, but it illustrates the tests that are easy to perform in Excel, and where there is really no excuse to skip the task of curating the data for presentation.
The paper’s lead author Mark Ziemann has kindly made available three files if you’d like to try it yourself. CSV, TSV and TXT compressed zip files.
He adds “They were generated after statistical analysis of gene expression data from Illumina DNA sequencing instruments or microarray instruments. It is common to sort these sheets by p-value to help end-users see what the most significant datapoints are. Duplicated gene names can be an issue for downstream work and it is common practise to keep only the most significant one.”
This sample media comment illustrates the contrasting views:
Excel Created Major Typos in 20 Percent of Scientific Papers on Genes
One of the comments reads “Even if I were to pretend this was actually Excel’s fault, I still can’t understand them being so stupid as to not double check their work before publishing. No excuse for that.”
Another says “I read the article and give not two whits for what the last sentence says. I shouldn’t have to “wrestle” with the software to convince it to just let me enter my data already and stop pretending like it knows what I want better than I do.”
The 2004 paper:
Zeeberg BR, Riss J, Kane DW, Bussey KJ, Uchio E, Linehan WM, et al. Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics. 2004;5:80. http://dx.doi.org/10.1186/1471-2105-5-80
In 2012 this blog post says it’s still happening:
I guess there are always people yet to learn for the first time. Maybe spend $40 of their research budget on a good general Excel book by Walkenbach, or one of the specific Excel For Data Analysis books?
Even textbooks don’t always cover good practice. Amazon shows a number of books for “Excel Data Analysis”. Those would be more specialised than the general books critically reviewed here:
Journal of Education for Business 90(4):169-174 · March 2015 by Nathan Garrett
Garrett concludes: “Teaching Excel as a point-and-click tool, or only examining individual features in isolation, results in amateur excel programmers who understand features, but do not know how to tie these features together. The universally high error rates found in the field show the need for improving the state of instruction. Curriculum needs to reflect a professional approach. Without this shift, amateur work will continue to be the norm.”
Excel specifications and limits