Excel & Genes: mutation and curation

On 23 Aug a paper by three Australian authors gained much publicity about how Excel interprets data entry and how the users don’t check their data.


Gene name errors are widespread in the scientific literature. By Mark Ziemann, Yotam Eren and Assam El-Osta. Genome Biology 2016 17:177 DOI: 10.1186/s13059-016-1044-7 ©  The Author(s). 2016. Published: 23 August 2016

They wrote scripts to trawl through published papers on genetics which had data files attached, and checked those files for data errors. They screened 35,175 supplementary Excel files and confirmed gene name errors in 987 supplementary files from 704 published articles in 18 journals. Linear-regression estimates show gene name errors in supplementary files have increased at an annual rate of 15 % over the past five years, outpacing the increase in published papers (3.8 % per year). “In conclusion, we show that inadvertent gene name conversion errors persist in the scientific literature, but these should be easy to avoid if researchers, reviewers, editorial staff and database curators remain vigilant.”

The root problem is that the researchers who uploaded those files NEVER checked them. Once you have seen this once, you’re sensitised to it and can take steps to avoid it in future.

The reason they never check is because Excel is seen as a simple application, not programming or software development or data management. People don’t learn how to use it any more than they would take courses in Word or Powerpoint. So this point-and-click approach gives rise to attitudes like that expressed in a comment to a Slate article that Excel should just take the input and not change it. The problem with running with such naïve unvalidated expectations about any software package is that you then run the risk of publishing career-limiting mistakes.

There’s a skills certification called “Spreadsheet Safe” and one of their points is “validate CSV file imports.”


I contacted the authors and Mark Ziemann kindly sent me three example files. The solution is so easy … when importing that CSV/TSV/TXT file from the data capture instrumentation, use the “Get External Data button”, set the data type for the gene columns and other IDs to text, and that’s it. A few seconds’ work.

Mark Ziemann says “Most, but not all of these data files are imported as text or csv formats from instrumentation such as DNA sequencers, gene microarrays or proteomics screens. Many of these files appear to be heavily modified from their original format and contain colour coding, modified column headers and additional columns. Some files are comparisons of 2 or more datasets or more in the same worksheet. A smaller number of files was simply a filtered list of gene names that could be a group of candidate genes for future analysis. As most error-containing files I screened also had accession numbers or other identifying information, the risks to altering the conclusions of the study by gene name errors are minimal, but embarrassingly common. Statistical analyses of large datasets is done in R, Matlab, Python, etc and its common to save the data in XLS file so that other researchers in the study can open and inspect the data. ”

Here’s how to work it:

First open the file as normal to eyeball it. I picked the CSV file first:


Firstly, D4:E6 look different from the rest of columns D&E. What’s happening there is that the numeric values are outside Excel’s range of approximately 2E-308 to 1E308 so they are stored as text. Let’s take a quick look at the other columns. On the Data ribbon tab, click Filter.

Click the dropdown for “geneid” first:


Well, that shows something is amiss – what are dates doing there? Pick the first one and apply that filter:


So, one gene name has been changed to a date and displayed as Sep-11; that’s 1-Sep-2011. You can check some more, but you should now realise that this column (and maybe others, although in this specific case they all look like they should be numbers) needs to be imported as Text. On the Data tab, Get External Data, click From Text. In fact, if you were opening a .txt or .tsv file, Excel would offer you this Text Import Wizard right away rather than just opening it as it does with .csv files.


Yes, it’s Delimited by commas, so click Next and check the Comma box :


Click Next again:


This is it. See where Excel says it converts date values to dates? Anything that looks like a date value in your locale, like MARCH1 or SEPT11, will become a date. So override that by checking the Text option:


Click Finish and specify where you want it to go:


I usually like to click Properties and uncheck “Save query definition” as otherwise a data link is created in this workbook so that you can refresh it later from the same source file.

After using the Filter button and checking the “geneid” column, that now looks clean. There is still the text data in D4:E6 but Excel can’t handle that as numbers so you’ll have to know what to do with it.


To further check on the data, I would add a sheet to calculate for each column the count of all values, of numbers, of blanks, and the minimum, maximum, average, and sum of the numeric data and the lengths of the text data. That would tell me of other oddities that need investigation such as missing or out-of-range data.  A check for duplicates reveals that eight of the gene IDs are duplicated in this sample.

If I had a reference list of gene names that should be in this sample, I could also do a lookup both ways to report gene IDs in the sample not in the reference, and those in the reference not in the sample. In any domain, there are always specific tests that can be performed. In addition, data can be formatted or highlighted in various ways to convey some significance.

What all that means is up to the expert in that domain, but it illustrates the tests that are easy to perform in Excel, and where there is really no excuse to skip the task of curating the data for presentation.

Example files

The paper’s lead author Mark Ziemann has kindly made available three files if you’d like to try it yourself. CSV, TSV and TXT compressed zip files.
example1.csv.zip 547K

example2.txt.zip 801K

example3.tsv.zip 480K

He adds “They were generated after statistical analysis of gene expression data from Illumina DNA sequencing instruments or microarray instruments. It is common to sort these sheets by p-value to help end-users see what the most significant datapoints are. Duplicated gene names can be an issue for downstream work and it is common practise to keep only the most significant one.”

Other references

This sample media comment illustrates the contrasting views:
Excel Created Major Typos in 20 Percent of Scientific Papers on Genes

One of the comments reads “Even if I were to pretend this was actually Excel’s fault, I still can’t understand them being so stupid as to not double check their work before publishing. No excuse for that.”

Another says “I read the article and give not two whits for what the last sentence says. I shouldn’t have to “wrestle” with the software to convince it to just let me enter my data already and stop pretending like it knows what I want better than I do.”




The 2004 paper:

Zeeberg BR, Riss J, Kane DW, Bussey KJ, Uchio E, Linehan WM, et al. Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics. 2004;5:80. http://dx.doi.org/10.1186/1471-2105-5-80

In 2012 this blog post says it’s still happening:


I guess there are always people yet to learn for the first time. Maybe spend $40 of their research budget on a good general Excel book by Walkenbach, or one of the specific Excel For Data Analysis books?

Even textbooks don’t always cover good practice. Amazon shows a number of books for “Excel Data Analysis”. Those would be more specialised than the general books critically reviewed here:

Journal of Education for Business 90(4):169-174 · March 2015 by Nathan Garrett


Garrett concludes:   “Teaching Excel as a point-and-click tool, or only examining individual features in isolation, results in amateur excel programmers who understand features, but do not know how to tie these features together.  The universally high error rates found in the field show the need for improving the state of instruction. Curriculum needs to reflect a professional approach. Without this shift, amateur work will continue to be the norm.”

Excel specifications and limits



Posted in Excel/VBA, Research, Risk, Software Testing | 4 Comments

Eusprig 2016 July 7, London

The European Spreadsheet Risks Interest Group 17th Annual Conference
(EuSpRIG 2016) will be held at the Institute of Chartered Accountants in
England and Wales (ICAEW), Moorgate, London, on the 7th July 2016.


The programme is now available. Register now to take advantage of the early bird discount! Here are the titles of the papers:

     Teaching methods are erroneous: approaches which lead to erroneous 
end-user computing.  Mária Csernoch, Piroska Biró

     Characteristics of Spreadsheets Developed with the SSMI 
Methodology.    Paul Mireault

     A Programmatic Approach to the development of Solutions in Excel. 
Peter Bartholomew

     A Conceptual Model for Measuring the Complexity of Spreadsheets. 
     Thomas Reschenhofer, Bernhard Waltl, Klym Shumaiev, Florian Matthes

     A Pilot Study Exploring Spreadsheet Risk in Scientific Research     
Ghada AlTarawneh, Simon Thorne

     The use of the Power Query / Get & Transform tools in Excel     
Simon Hurst

Registration and other information: http://www.eusprig.org

In other news:

At 18:00 the previous evening, Wed July 6th, the ICAEW launch their
Spreadsheet Competency Framework, a new guide for classifying
spreadsheeting ability in finance professionals. The event features
speeches from sector leaders and experts, and a Q&A session where
attendees can put their questions to a panel of our speakers and the
framework’s authors. There will be a drinks reception afterwards.

ICAEW Spreadsheet Capability Framework Launch

Posted in Excel/VBA, Research, Risk, Software Testing | Tagged , , | Leave a comment

Modeless Userform not terminating

One way to display status messages as a long procedure runs is to show a modeless userform and update it with text or a progress bar showing percentage completed.
I was having a problem with Excel crashing, particularly Excel 2013, and I traced it to a userform like that. Here is a sample:


Run the ShowProblem sub.
1. With the button visible, there is no problem. The userform is terminated when the oFrm variable goes out of scope after the form is Hidden.

2. With the button not visible, the form terminate event does not fire. The userform is left in memory, (iow it is not removed by the garbage collector) and its window can be seen by WinSpy++ (class ThunderDFrame) but it is no longer accessible because its object pointer has gone out of scope.
In the addin I was working with, after a number of such windows accumulate, and in Excel 2013 they are owned by a workbook rather than the main Excel window, Excel crashes when the workbook is closed and the windows start to terminate one after the other.

3. In the demo file, run the sub twice in a row, now two ThunderDFrame windows appear in WinSpy++, and an orphaned Excel process is left behind with no App window.

The way to avoid that is to use an explicit Unload command which is better practice anyway than relying on the combination of .Hide and letting oFrm go out of scope.
According to Rory Archibald, “[it] stems from not having any controls on the form that can take focus.”
It’s the kind of thing that wastes hours of debugging.


In previous versions of Excel, forms were always owned by the Excel main window (XLMAIN). Excel 2013 introduced SDI so now forms are owned by the last opened document window so we have to resort to API kludges to change the owner, like this helpful contribution from Jan Karel Pieterse:

MSDN: Window Features, definitions of Owner and Parent

Note: GWL_HWNDPARENT changes the OWNER, not the parent, of a window

Greg Maxey’s MS Word page: Userform – Advanced Notes and Tips

Posted in Excel/VBA, Uncategorized | Tagged , | Leave a comment

VBA Code Tracer / Profiler

Simple VBA code tracing

To track down a problem with Excel crashing I needed to get a debug trace of every procedure so I could find which one was causing the crash and I could then single step through it with F8. The simplest method I found was to insert a line at the top of every procedure which initialises an object from a user-defined class.  There is a Start method to set the proc name as we can’t pass that to the class Initialize event. The inserted line to create an object variable named P_ looks like this, immediately after each Sub or Function declaration:

Sub MyProc(param1 as long,etc)
Dim P_ As New Profiler: P_.Start "MyProc"

When this object goes out of scope when the sub exits, its Terminate event is called to record the exit. The class methods simply OutputDebugString the procedure name and a 1 for a start or -1 for the exit. I use OutputDebugString because I can see what the output is in DebugView whereas with Debug.Print the immediate window is lost when Excel crashes. I could write to a text file but the need to close and reopen it every time is slower than DebugView which itself adds quite an overhead.
I have also developed the idea to add a tracing line to each numbered line of code to be able to track in greater detail, particularly when there are long loops where F8 is a tedious way to find the point of a crash. If there is interest, I may post that too.

Here is the VBA code to inject the tracing calls. Copy and paste the code into a standard module and add a reference in the VBA Editor to Microsoft Visual Basic for Applications Extensibility 5.3.

http://www.sysmod.com/modVBATracer.txt                  (Updated 8-Apr-16)

It outputs lines to DebugView which you can then save to a file. The lines look like:

00005595 17883.11132813 ,Module1.MyProc,1
00005596 17883.23456789 ,Module1.MyProc,-1
The first number is the DebugView sequence number, the next after a tab character is the time. The procedure name is prefixed by the module name, as there can be several procedures (especially event procedures) with the same name. The 1 marks the start of the proc and the -1 is the exit. You can analyse that data to see which procedure is called most often, or takes the longest.


Posted in Excel/VBA, Software Testing, Uncategorized | Leave a comment

Excel Options for updating external links

There are many options in Excel that control this depending on the kind of link it is. I’ll cover the usual external links in formulas first, then finish up with the other kinds.

An external link formula shows the full path to the source if that workbook is closed, otherwise it shows only the workbook name in the path. For more information, see the MS article ‘Create an external reference (link) to a cell range in another workbook


You can link to cell addresses or names; obviously named ranges are safer because addresses can be affected by insertions or deletions in the source worksheet. To find where links are used in a workbook, search in formulas for “.xl*!”; that is, an Excel file name followed by an exclamation mark.

As well as intentionally created links you enter in formulas, links can be created by Excel when you copy sheets from one workbook to another, and those sheets contain references to other sheets either obviously in formulas or less obviously in range names. If you get a prompt to update links when you open a workbook, but you cannot find any cells containing links, the link is probably in a name. Use the Name Manager (either Excel’s own or the free third-party JKP Name Manager ) to check all names for links, particularly phantom links which are not used but clutter up the list.

To control whether values are updated when the workbook containing the links (the ‘target’ workbook) is opened, use the Edit Links button in the Data tab of the Ribbon. Click the ‘Startup Prompt’ button, and select one of these three:

1: Let users choose to display the alert or not

Now, what happens at startup depends on another setting:

Options > Advanced > General: Ask to update automatic links.  (VBA: Application.AskToUpdateLinks). If true, the user is alerted and can choose whether to update or not. If false, the links are updated.

2: Don’t display the alert and don’t update automatic links

Links are not updated at startup, regardless of the setting of Application.AskToUpdateLinks.

3: Don’t display the alert and update links

Links are updated at startup, regardless of the setting of Application.AskToUpdateLinks.

In VBA, ThisWorkbook.UpdateLinks has the values 1,2, or 3 above.


If the source workbooks are open at the same time as the target, the formulas are part of the recalculation chain as normal. The link values from closed files are not refreshed in a recalculation. Excel provides no way to automatically refresh the values of external link formulas to CLOSED workbooks. To manually update values at any time while using a workbook, use Edit Links, and for each link click ‘Update Values’. You can add a macro that calls ActiveWorkbook.UpdateLink. To update all links, pass it the array of link sources as follows:

ActiveWorkbook.UpdateLink _

If you omit the parameter, by default Excel uses xlLinkTypeExcelLinks. The other type of XlLInkType is xlLinkTypeOLELinks, see below for those.

If any links are bad, that raises an error. To update them one by one and trap the error:

Sub UpdateAllLinks()
   UpdateLinks xlLinkTypeExcelLinks
   UpdateLinks xlLinkTypeOLELinks
End Sub

Sub UpdateLinks(LinkType As XlLinkType)
Dim vSource As Variant
Application.DisplayAlerts = False ' comment out to see the prompt for missing link sources
For Each vSource In ActiveWorkbook.LinkSources(LinkType)
   On Error Resume Next
   Debug.Print vSource
   Application.DisplayAlerts = False
   ActiveWorkbook.UpdateLink Name:=vSource, Type:=LinkType
   If Err <> 0 Then
      Debug.Print "^Error "; Err; Err.Description
   End If
Next vSource
Application.DisplayAlerts = True
End Sub

You could call that macro in the Worksheet_Calculate event to update link values every time a recalculation happens. That will slow down performance, particularly for links over a network rather than to local drives.

To automate the calling of this macro at regular intervals, you can use the .Ontime method. Again, be aware when you do this of the length of time it takes to update the values.

In the ThisWorkbook class module:


Sub Workbook_Open()
      Auto_UpdateExcel Links
End Sub

In a standard module:

Sub Auto_UpdateExcelLinks()
      Application.OnTime DateAdd("s",10,Now()), "Auto_UpdateExcelLinks")
End Sub


Other ways to link to external data

OLE: Object Linking and Embedding can be used for other Office applications such as linking Word or PowerPoint to Excel. You could link Excel to Word if you like, odd though it looks. Copy the data in Word then in Excel Paste Special, Link, and select the format you want. ‘HTML’ suits Word tables, ‘Microsoft Word Document Object’ inserts an object with the formatting as in Word.

The Edit Links dialog allows you to choose Automatic or Manual update only for OLE links. Excel links always have the Manual option button greyed out. If it is automatic then it is refreshed live in Excel as soon as the source document is changed, even without saving the document. If it is manual then it can be updated through Edit Links, select the link, click Update Values.

The Excel setting ‘Options > Advanced > When calculating this workbook: Update links to other documents’ controls the updating of OLE links only. (VBA: ThisWorkbook.UpdateRemoteReferences ) When unchecked, OLE links do not update, even if you select ‘Update Values’ in the Edit Links dialog for that link.


Connections: set up by the Get External Data button. These create queries which are refreshed by the Refresh button. You can link to text files (eg CSV), other workbooks, databases, and web pages.  For further details see MS support and many other websites:



RTD / DDE: Real Time Data and Dynamic Data Exchange. These are ways of getting data from COM instances on a server. For more information, see MS support and the Wilmott forums.



Macros: Data can be read and written by VBA code with Workbooks.Open, Querytables, ADO and SQL. Worksheet shapes can contain links to macros in other workbooks, so clicking them may cause Excel to attempt to open those other workbooks.


Posted in Excel/VBA, Uncategorized | Tagged | 5 Comments

A first look at the FUSE spreadsheet corpus

Following on the first paper by Titus Barik et al
and some work by Mark Townsend analysing the last row and column used in each file

I downloaded the 7GB of 249,376 files and did some summary analysis of them and the VBA.

The top domain is .org (29.5%),followed by .gov (27.7%)
That’s because almost half the files are from one web site – triathlon.org. They look like files that were filled in for reporting purposes, and so contain no formulas.
Files which are simple web report downloads or automatically generated (eg quickfacts.census.gov) then they were not user-created spreadsheets at all, and so of no interest to me.
5,600 have “SpreadsheetGear” as a write access user, all from worldbank.org.

So most of the FUSE spreadsheets are of no interest to me in formula error research.
There are no .xlsm files although a simple google finds 106,000.

Of the 5037 web hosts, http://www.triathlon.org accounts for 106328 files, or 43% of the total.
http://www.triathlon.org 106328
quickfacts.census.gov 47025
theahl.com 10350
The top 3 account for 66% of the files, the top 50 (1% of the hosts) have 87% of the files.
So it’s pretty skewed towards a few domains.

The POI analysis can not handle Biff5 files, but they can be processed in Excel if you relax the File Block settings.
The top 80% of files have no formulas or very few, again because they are really data files.
12854 (5.15%) have formulas.

Only 737 had VBA code, and 472 of them had unique VBA content as determined by a MD5 hash.
They have a range of typically 10 to 2000 lines of code.
102 have “Macro recorded by…” and no Dim statements
Only 78 of 472 have Option Explicit

I have prepared a slide deck of the findings, available at:
http://www.sysmod.com/vbainfusecorpus-pobeirne.pdf (185K PDF)

Does this interest anyone?

Posted in Excel/VBA, Research | Tagged , , | 1 Comment

SoftTest Ireland Conference 2015 – Future Proof your Software Testing

SoftTest Ireland in association with the ISA Software Skillnet are hosting a one-day national conference for software test professionals.

Date: Wednesday 23rd September 2015 9am – 5.30pm
Venue: Clyde Court Hotel, Ballsbridge, Dublin 4

Cost: €100.00 per delegate
Book here: https://softtest2015.eventbrite.co.uk/

Software testing is experiencing an industry disrupting revolution, driven by rapidly evolving technologies and software delivery paradigms.
Understand the critical technologies to be learned, key skills to be developed and the professional persona required by testers to further enhance their value into the future.

Explore the future of testing with expert speakers on topics such as;
The Internet of Things
Cloud Test Ops
Test Automation
Personal professional development

9 Speakers: Paul Gerrard, Chris Ambler, Colm Harrington, Declan O’Riordan, Claire Goss, Augusto Evangelisti, Stephen Janaway, Jonathan Wright, Matt Wynn

t: @softtestireland

SoftTest Ireland

Interest group for software testers in Ireland

Posted in Software Testing | Tagged , | Leave a comment

VBA in the spreadsheets from the Enron email corpus

The European Spreadsheet Risk 2015 Conference papers are now available at

My presentation was on investigating the use of VBA in spreadsheets in the Enron email corpus.

Click to access VBA%20in%20spreadsheets%20from%20Enron%2CPO%27Beirne%202015.pdf

Most of the slides deal with the mechanics of how I did it, and statistics on the 538 workbooks found with unique VBA content. The network graph I did for interest was #madewithgephi.

Workbook VBA code similarity

Workbook VBA code similarity

Some conclusions I came to are:

1) The workbooks are probably not typical of the routine mass of everyday spreadsheets.
If people email spreadsheets to others, I infer that they don’t have a shared folder on the network. Therefore these workbooks are for communication. So, they are probably not an insight into the real ‘dark matter’ of EUC that stays in shared folders and never emailed – eg routine accounting workbook.

2) Apart from simple static analysis which gives a general indicator of code quality, it is very difficult to say if the VBA contains errors.
The real test is in execution of the code. But we cannot reproduce the environment in which these workbooks were created. The files we have are probably circulated as reports for reading, so after the execution of the code. Therefore do not have the preconditions to run the code. Static analysis like TM-VBA inspector only goes so far.
Code inspection can raise questions of unsafe practices and assumptions, but is time consuming.

Posted in Excel/VBA, Research | Tagged , , , , | Leave a comment

XLTest 1.55 released

I have updated my spreadsheet auditing addin.
Changes in version 1.55
1.    On startup, it checks and fixes autonumber seed problems in database
2.    Processing limit of 1 million cells made an option
3.    ExportActiveProject procedure runnable by VBA automation
4.    VBA Sort Procedures option
5.    Option to list all sheet statistics separately when scanning files
6.    Demo / 30-day evaluation version made available
7.    Utility option to display current autofilter criteria
8.    Numerous usability tweaks


In order to make a time limited trial available I am now protecting it using the ‘Unviewable VBA’ app from Petros Chatzipantazis:


He also markets a Ribbon Commander addin intended to make Ribbon customisation easier than XML.

Posted in Uncategorized | Tagged | Leave a comment

Excel VBA Copy Range, Paste as Table to Powerpoint 2010/2013

So, I’ve spent the last couple of weeks trying to get that one right.

Piecing together all the hints from stackoverflow and the rest, I find that I have to use the ExecuteMso method available in PP2010 and later, and use Doevents to give PP time to act in order to avoid raising errors.

In this code, mPPAppObject is the Powerpoint Application object accessed from Excel VBA eg Set mPPAppObject = GetObject(, "PowerPoint.Application")

Assuming we start with rng.Copy then if the paste is to be a table then

for i=1 to 500:Doevents :next ' 5 hundred, about 10ms timing

‘Otherwise get a Clipboard Error -2147188160 Shapes (unknown member) : Invalid request.  Clipboard is empty or contains data which may not be pasted here.

mPPAppObject.CommandBars.ExecuteMso "PasteExcelTableSourceFormatting" ' 2010+


for i=1 to 5000:Doevents :next ' 5 thousand, about 100ms
‘Otherwise calling code testing the slide.shapes collection does not see the pasted shape yet, or it does not see that the shape has a table (oShape.HasTable)

Set oShape = mPPSlide.Shapes(mPPSlide.Shapes.Count)

If we want some other paste format, then it’s a bit easier:

Set oShape = mPPSlide.Shapes.PasteSpecial(PasteDataType)

‘Where PasteDataType can be

‘ 0 ppPasteDefault ‘ truncates to correspond to visible on worksheet window
‘ 2 ppPasteEnhancedMetafile pastes correctly with full width
‘ 3 ppPasteMetafilePicture ‘ same as default, *truncated if too wide*

‘ 7 ppPasteText
‘ 8 ppPasteHTML ‘ does not work, use mPPAppObject.ActiveWindow.View.Paste

In PP 2013 I sometimes found that more than one shape would be pasted, so:

' oShape.Count property does not exist for correctly pasted shape. So use errtrapped function
If ObjectCount(oShape) > 1 Then ' handle pp 2013 bug
'Debug.Print ">1 shape pasted: "; TypeName(oShape); oShape.Count ' ShapeRange 2
Set oShape = oShape(1)
End If

Function ObjectCount(obj As Object) As Long
If obj Is Nothing Then
ObjectCount = 0
ObjectCount = 1
On Error Resume Next
ObjectCount = obj.Count
'could do Err.Clear here if you like
End If
End Function

What a kludge to get something done that should be easily accessible from the COM model.

Posted in Uncategorized | Tagged , , , , | 22 Comments