The story broke around October 5th that Public Health England temporarily lost 16,000 Covid test results because they did not notice that the XLS data format they were using was insufficient to process the number of results they were getting. The risk is that lives could be lost because Covid cases were not being traced and followed up. The BBC report is probably the best summary:
“[The Covid testing labs] filed their results in the form of text-based lists – known as CSV files – without issue. PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system. The problem is that PHE’s own developers picked an old file format to do this – known as XLS. As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of. And since each test result created several rows of data, in practice it meant that each template was limited to about 1,400 cases. When that total was reached, further cases were simply left off.”
Most commentary piled on to the use of Excel at all as “inappropriate”. Wired has an interview with spreadsheet warriors on the Eusprig mail list on how they face daily support tasks like this. Maths comedian Matt Parker did a video on spreadsheet errors and on EuSpRIG.
I want to examine this story from the point of view of data and spreadsheet control and wonder about some unstated angles.
TL;DR: if they had applied basic data controls, they could have detected the problem as soon as it happened rather than lose precious days. Whenever data is being transformed from one system into another, there are many opportunities for data going missing, being duplicated, or corrupted. Whatever the skill level of the developer, nobody can know in advance what data might actually arrive in use and every possible failure mode. Any such system, whatever the technology used to implement it, should have at a minimum a simple check that the number of records output reconciles to the number input; and ideally other checks on data quality and hash totals.
First of all, why would they choose XLS? Possibly because it was the easiest choice for upload to their central system. For example, 32-bit SQL Server can import XLS (65536 rows) and XLSX (1M rows) files. For XLSX however they would have had to register the OLE DB provider Microsoft.ACE.OLEDB.12.0 and that may have looked like work for them and they thought that “we’ll never have more than 65K cases per day”. But although that choice of XLS might well be adequate (at the moment!) for upload, the fatal flaw was that the import into XLS would need many more lines. [Edit: I now think they really were using Excel 2003 or earlier.]
I’ve done work for labs and written apps in Excel and VBA to process the data logs from instruments. The article refers to them as CSV files and one might think that implies a text file with a header line with field names, one row per record, fields separated by commas, see RFC 7111.
However, the fact that only about 1400 cases fitted into 65K rows makes it look more likely that these are simple text files with one field and value per line. Or maybe it’s a combination of irregular numbers of fields per line, we don’t know. I would expect that depending on the exact number of lines per record, 46 or 47, the LAST record in the file would therefore be incomplete, some fields would be missing. So there are incomplete records in the database too.
Neither do we know what process they used to import the data in the first place – text file open, Get Data from Text, VBA Querytable, copy & paste, whatever. [Edit: I now think they used Workbooks.Open in a macro].
We don’t know what they did to transform the data from the lab output format into the regular table-like structure that an upload to a database would require. Maybe a mess of INDEX or VLOOKUP formulas; maybe a simple VBA macro that ran down the column of imported text and wrote out the samples in columns. Maybe they had to transform field values from codes into numbers and handle special values like “n/a”, “<0.001”, and such like.
They could have processed it in an XLSM with 1M rows but saved the normalised table as an XLS file, thus satisfying both requirements of processing a large number of lines and also saving as the most convenient format for upload; but they didn’t. Even sticking with an XLS file with macros, they could have read the lab file line by line, assembled each record in-memory, and write it out to the target sheet, without needing to populate a 65K sheet with imported data; but it looks like they didn’t.
But what puzzled me is this: I had thought that whatever method is used to import data in Excel, even in Excel 2003 with XLS files, the application would raise an error if you tried to import more lines; but there was no report of the problem being flagged by an error report, but by people checking the numbers.
If you use the menu to File > Open a text file in Excel 2003 with more than 65536 lines, or try to copy and paste from another application, it raises an error message “File not loaded completely”. If you import in any version of Excel into a worksheet in compatibility mode (ie 65536 rows) it would report ” The text file contains more data than will fit on a single worksheet”. So why did this not happen? (Spoiler: I discovered why later, see the postscript.) This was, in Sherlock Holmes terms, the dog that did not bark in the night.
I tried to imagine scenarios involving XLSWriter or Apache POI trying to create XLS files with more than 65K rows; I worked out an unlikely scenario involving copying to the clipboard from another application and in VBA using Worksheets(1).Paste which does not raise an error, it truncates the data. Other Excel experts suggested that the PHE developers simply suppressed Application.DisplayAlerts or On Error Resume Next, or even deliberately imported line by line with a cut off at 65535. They point to a variation of “Hanlon’s razor”: “Don’t ascribe to obscure complexity what can be explained by incompetence”, and to the collection of reports on thedailywtf.com and theregister.com. And Arthur Conan Doyle wrote “Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth”. [Edit: Another conclusion, which I thought improbable at first, is that this public health service is really so starved of IT resources that they have to use unsupported versions of Excel more than 13 years old.]
However, it was much simpler than that, and only after trying the most obvious thing which I missed the first time I posted this, I found it: VBA Workbooks.Open “file.txt” will load as much data as it can WITHOUT raising any error. No On Error Resume Next needed. So this is in my view an Excel failure, a defect, because the VBA behaviour is not consistent with the UI behaviour. Still, you should not assume your technology will always work perfectly and fail to put in controls around your processes.
More like Occam’s razor really: pick the simplest explanation over the more complex one.
[Update 14-Oct-2020: what I now think really happened]
Update 22-Oct-2020: I found a Sky News article which gave a bit more information:
“The problem is while the Second Generation Surveillance System (SGSS), as it’s called, is plugged into the existing pathology labs that pre-date COVID-19, for some reason no-one has yet managed to connect it to the new labs which have been created during the course of this pandemic. […] Rather than feeding its results securely straight into SGSS, as Pillar 1 pathology labs do, the test results from Pillar 2 labs arrive at PHE in the form of a CSV file.” So Pillar 1 are using secure transfer, maybe in HL7 format, but the new system masterminded by consultants at £7000/day uses CSV, like the USA CDC?