Querying Text/CSV files in VBA, method #3: Power Query

The module PowerQueryText.bas contains the code to query delimited files and return tables.

The advantages of PowerQuery are:

  1. It can handle comma and tab separated values and does not need a schema.ini as for ADO. You do however need to specify the separator character, see below for the details.
  2. Tab-delimited files do not have to have the extension “.tab” as they do for ADO, but can be named with a “.tsv” extension if that’s what you have.
  3. It can handle line breaks within quotes, or not, depending on the option you set. I always include line breaks within quotes.
  4. Its default encoding is 65001 which can handle all text encoding I have tested it with.  The UTF-16 (UCS-2) encoding is 1200.

In the ADO module, I determined the field separator by reading the file contents as bytes, and use a function CountColumnsFromBytes to infer the number of columns and their separator. In Powerquery, there is a function Binary.InferContentType that will offer statistics on possible delimiters, and your code can choose the most likely one.

At the moment, I have to send the output to a worksheet. If anyone finds a way to return the table data as a memory structure in VBA such as a recordset, let me know.

The functions contained in PowerQueryText.bas are:

testPQcsv    – a simple test stub that prompts for a file name and queries it. It calls the following functions.

InferContentType – needed to infer the encoding and field delimiter (separator) character.

TableProfile – returns statistics about a file to help determine the number of rows and columns.

PQMCsvFormula – a generic function to return the M expression for a delimited file.

PQDelimitedText – runs a query and outputs the result.

WBQueryToSheet – sends the data from a query to a sheet table range.

WBQueryToListTable – sends the data from a query to an Excel ListTable.

That is the end of this series on reading delimited text files in Excel VBA. I hope you find it useful, and let me know of any bug fixes needed, or suggest any improvements.

Posted in Excel/VBA, Power Query | Tagged , , , | 2 Comments

Querying Text/CSV files. Method #2: ADO Recordset

The module ADOQueryText.bas has the functions to read a delimited text file into a worksheet. It uses an ADODB recordset to perform the query. The pros and cons of ADO are:

  1. It can handle embedded newlines in data, unlike a Querytable. These are common in CSV files from sources (eg surveys) where users enter free text with multiple lines, and the text is returned in one field.
  2. It cannot open Unicode file names, so to be safe you must copy the original file to a temporary copy with an ANSI filename.
  3. It requires a schema.ini file to recognise when the fields are delimited by tabs. To avoid this, you could use a function ReplaceBytes() in the QueryTables module from the previous blog post to change all tab delimiters with commas.

To avoid some of these complications, the final post in this series will be on how to use Power Query in VBA to query a text / csv file.

The functions are:

ADOLoadWSFromDelimitedTextFile(ByVal OriginalFullname As String, ByRef wsData As Worksheet, Optional ByRef Delimiter As String, Optional ByRef CodePage As Long)

              Queries a structured text file into a table in a sheet, returns delimiter and codepage by reference. It uses the two following functions, writing a schema.ini file if the data source is tab-delimited and checks whether there are enough rows on the sheet for the records.

ADORecordsetFromTextfile(ByVal FullName As String, ByRef rs As ADODB.Recordset,  ByVal FMT As String, ByVal CodePage As Long, Optional ByVal strSelect As String)

              Queries a structured text file into an ADO recordset, returning the format and codepage by reference.  strSelect is the list of fieldnames or expressions in SQL syntax, the default is “*”, all fields.

CopyRecordsetToSheet(rs As ADODB.Recordset, Optional ByRef wsTarget As Worksheet, Optional ByVal RecordSource As String)

              Writes a recordset to a worksheet, adding the sheet and writing headers if required.

Posted in Excel/VBA | Tagged , , , , | Leave a comment

Querying Text/CSV files. Method #1: Querytable

You will know that simply opening a CSV file in Excel is commonly a poor way to do it. Excel has a habit of eagerly trying to interpret as a date anything that looks anything like a date. A previous blog post on this describes the frequency of such corrupted Excel data in genome research papers until they finally changed the names of the genes to avoid the problem.


The legacy Get Data from Text/CSV command in Excel creates a Querytable. This is pretty good for most purposes as long as you use the undocumented TextFilePlatform value of 65001. There is one glaring limitation: it cannot handle newlines in quoted strings. I get this frequently in Surveymonkey downloads where people have entered free text comments with multiple lines. So, the workaround is to pre-process the file to change any quoted linefeed (LF) or carriage return (CR) characters to spaces.

The module QueryTables.bas has the code to do that.

There is another way, using an ADO Query, which will be the subject of the next blog post. However, that has its own complications. For example, if the file is tab delimited you have to write a schema.ini file. The final way, in the last blog post, will be to use Power Query.

Posted in Excel/VBA | Tagged , , | Leave a comment

Reading Unicode file contents as text

This is a short update on my post of Sep 28, 2020 on reading Unicode text in VBA.

That post mentioned the limitations of old VB statements which cannot handle Unicode:

   Open FullName For Input As #handle : Line Input #1, strLine

I am reposting the module as ADOStream.bas with the following functions:

MessageBox(Prompt As String, Optional Buttons As Long = VbMsgBoxStyle.vbOKOnly, Optional Title As String = "Message") As VbMsgBoxResult

MessageBoxW from the Windows API correctly displays international Unicode characters.

ReadFileBOMCharset(ByVal filename As String, ByRef charset As String) As Long

Infer the encoding of a file.

CharsetFromBytes(bytes() As Byte) As String

   Infer charset if there is any Byte Order Marker in the first 3 bytes of content.

ReadADOStreamFile(ByVal FullName As String, ByRef text As String, ByRef charset As String, Optional ByRef numchars As Long = -1) As Long

   Read the contents of a file as text, inferring the encoding.

WriteADOStreamText(ByVal filename As String, ByRef text As String, Optional ByRef charset As String) As Long

    Write text to a file.

Function ReadADOStreamBytes(ByVal filename As String, ByRef bytes() As Byte, Optional ByRef numchars As Long = -1) As Long

    Read the contents of a file as a byte array, including any Byte Order Marker (BOM).

Posted in Excel/VBA | Tagged | Leave a comment

Hello (the rest of the) World – reading data in other languages

Have you ever opened a text file in Excel and found that the characters looked weird or corrupted? The Wikipedia entry for mojibake says ‘In German, Buchstabensalat (“letter salad”) is a common term for this phenomenon, and in Spanish, deformación (literally deformation).’

This is one of a series of blog posts on file handling in VBA to cope with Unicode file names and contents. Part 1 was posted on Sep 28 2020, about reading text from Unicode files. The articles should provide all the Excel VBA code you need for working with Unicode folder and file names in the Windows file system, reading unstructured Unicode text, and structured text file data queries.

This post is the code for using the Filesystem object to replace the old VB functions like Dir(), Filelen(), FileDateTime(), etc, that do not work with Unicode file names. FileSystem.bas is the module which you can import into your project, and the CSV file is a sample downloaded from opengov.gr.

Code module: FileSystem.bas   Sample file: Σχόλιο UTF-8 BOM.csv

Old VBFunction
CurdirGetCurDir
 ChPath
ChDirSetCurDir
 NewFSO
 GetFSOFile
DirFileExists
Dir( file,vbDirectory)FolderExists
 GetPathAndName
FileLenFileSize
FileDateTimeFileDateModified
FileCopyCopyFile
Name..AsRenameFile
KillDeleteFile
 GetTempName
 GetTempFullName
 GetFSOFolder
MkDirCreateFolder
RmDirDeleteFolder
AttrGetFileAttributes
SetAttrSetFileAttributes
 FileAttrString
 GetOpenFilenameStr

Let me know of any improvements you can suggest.

Posted in Excel/VBA | Tagged , , , | Leave a comment

UK Covid-19 Track & Trace Excel snafu: Uncontrolled spreadsheets lead to data loss

The story broke around October 5th that Public Health England temporarily lost 16,000 Covid test results because they did not notice that the XLS data format they were using was insufficient to process the number of results they were getting. The risk is that lives could be lost because Covid cases were not being traced and followed up. The BBC report is probably the best summary:

“[The Covid testing labs] filed their results in the form of text-based lists – known as CSV files – without issue. PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system.  The problem is that PHE’s own developers picked an old file format to do this – known as XLS. As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of. And since each test result created several rows of data, in practice it meant that each template was limited to about 1,400 cases. When that total was reached, further cases were simply left off.”

Most commentary piled on to the use of Excel at all as “inappropriate”. Wired has an interview with spreadsheet warriors on the Eusprig mail list on how they face daily support tasks like this. Maths comedian Matt Parker did a video on spreadsheet errors and on EuSpRIG.

I want to examine this story from the point of view of data and spreadsheet control and wonder about some unstated angles.

TL;DR: if they had applied basic data controls, they could have detected the problem as soon as it happened rather than lose precious days. Whenever data is being transformed from one system into another, there are many opportunities for data going missing, being duplicated, or corrupted. Whatever the skill level of the developer, nobody can know in advance what data might actually arrive in use and every possible failure mode. Any such system, whatever the technology used to implement it, should have at a minimum a simple check that the number of records output reconciles to the number input; and ideally other checks on data quality and hash totals.

First of all, why would they choose XLS? Possibly because it was the easiest choice for upload to their central system. For example, 32-bit SQL Server can import XLS (65536 rows) and XLSX (1M rows) files. For XLSX however they would have had to register the OLE DB provider Microsoft.ACE.OLEDB.12.0 and that may have looked like work for them and they thought that “we’ll never have more than 65K cases per day”. But although that choice of XLS might well be adequate (at the moment!) for upload, the fatal flaw was that the import into XLS would need many more lines. [Edit: I now think they really were using Excel 2003 or earlier.]

I’ve done work for labs and written apps in Excel and VBA to process the data logs from instruments. The article refers to them as CSV files and one might think that implies a text file with a header line with field names, one row per record, fields separated by commas, see RFC 7111.

However, the fact that only about 1400 cases fitted into 65K rows makes it look more likely that these are simple text files with one field and value per line. Or maybe it’s a combination of irregular numbers of fields per line, we don’t know. I would expect that depending on the exact number of lines per record, 46 or 47, the LAST record in the file would therefore be incomplete, some fields would be missing. So there are incomplete records in the database too.

Neither do we know what process they used to import the data in the first place – text file open, Get Data from Text, VBA Querytable, copy & paste, whatever. [Edit: I now think they used Workbooks.Open in a macro].

We don’t know what they did to transform the data from the lab output format into the regular table-like structure that an upload to a database would require. Maybe a mess of INDEX or VLOOKUP formulas; maybe a simple VBA macro that ran down the column of imported text and wrote out the samples in columns. Maybe they had to transform field values from codes into numbers and handle special values like “n/a”, “<0.001”, and such like.

They could have processed it in an XLSM with 1M rows but saved the normalised table as an XLS file, thus satisfying both requirements of processing a large number of lines and also saving as the most convenient format for upload; but they didn’t. Even sticking with an XLS file with macros, they could have read the lab file line by line, assembled each record in-memory, and write it out to the target sheet, without needing to populate a 65K sheet with imported data; but it looks like they didn’t.

But what puzzled me is this: I had thought that whatever method is used to import data in Excel, even in Excel 2003 with XLS files, the application would raise an error if you tried to import more lines; but there was no report of the problem being flagged by an error report, but by people checking the numbers.

If you use the menu to File > Open a text file in Excel 2003 with more than 65536 lines, or try to copy and paste from another application, it raises an error message “File not loaded completely”. If you import in any version of Excel into a worksheet in compatibility mode (ie 65536 rows) it would report ” The text file contains more data than will fit on a single worksheet”. So why did this not happen? (Spoiler: I discovered why later, see the postscript.) This was, in Sherlock Holmes terms, the dog that did not bark in the night.

I tried to imagine scenarios involving XLSWriter or Apache POI trying to create XLS files with more than 65K rows; I worked out an unlikely scenario involving copying to the clipboard from another application and in VBA using Worksheets(1).Paste which does not raise an error, it truncates the data. Other Excel experts suggested that the PHE developers simply suppressed Application.DisplayAlerts or On Error Resume Next, or even deliberately imported line by line with a cut off at 65535. They point to a variation of “Hanlon’s razor”: “Don’t ascribe to obscure complexity what can be explained by incompetence”, and to the collection of reports on thedailywtf.com and theregister.com. And Arthur Conan Doyle wrote “Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth”. [Edit: Another conclusion, which I thought improbable at first, is that this public health service is really so starved of IT resources that they have to use unsupported versions of Excel more than 13 years old. Update: this can happen even in modern Excel, see below for 25-Nov-2020]

However, it was much simpler than that, and only after trying the most obvious thing which I missed the first time I posted this, I found it: VBA Workbooks.Open “file.txt” will load as much data as it can WITHOUT raising any error. No On Error Resume Next needed. So this is in my view an Excel failure, a defect, because the VBA behaviour is not consistent with the UI behaviour. Still, you should not assume your technology will always work perfectly and fail to put in controls around your processes.

More like Occam’s razor really: pick the simplest explanation over the more complex one.

[Update 14-Oct-2020: what I now think really happened]

Update 22-Oct-2020: I found a Sky News article which gave a bit more information:

“The problem is while the Second Generation Surveillance System (SGSS), as it’s called, is plugged into the existing pathology labs that pre-date COVID-19, for some reason no-one has yet managed to connect it to the new labs which have been created during the course of this pandemic. […] Rather than feeding its results securely straight into SGSS, as Pillar 1 pathology labs do, the test results from Pillar 2 labs arrive at PHE in the form of a CSV file.” So Pillar 1 are using secure transfer, maybe in HL7 format, but the new system masterminded by consultants at £7000/day uses CSV, like the USA CDC?

Update 25-Nov-2020: I found two more scenarios where using an XLS macro in Excel 2016 silently drops data above the sheet capacity of 65536 rows. If I use in VBA Querytables.Add “TEXT;file.csv” or an ADO Jet.OLEDB.4.0 query “SELECT * from file.csv” in an .xls file in Excel 2016 to populate a worksheet in the xls file from a CSV file with 89000 rows, it stops at 65536 silently, ignoring extra data, no error. It’s quite likely that xls macro files persist in use even after the version of MS Office Excel is updated. So there’s another couple of ways in which it could have happened. There may be more!

Posted in Uncategorized | 1 Comment

Nested JSON to flat data tables

Have you ever had to convert a file of JSON text to Excel tables? Not simple JSON, but one with a collection of dictionaries inside a dictionary inside a collection of dictionary records inside another collection of records inside a dictionary? And you want all this as old-fashioned normalised database tables? And in VBA? Here’s how I did it:

Download: Flatten JSON 20200915.xlsm

I chose as an example a Surveymonkey survey_details JSON file. I use VBA Collections, ADO.Recordsets, Scripting.Dictionary, a free class to parse JSON in VBA and some code to handle the inevitable exceptions. It’s a two pass process; one to identify all the fields in all the dictionaries, create the linking primary and foreign keys, and the second to populate recordsets with the data for those fields and write them out to Excel sheets. Or of course you could write them to a database.

If you like YouTube tutorials, here’s a recording where I walk through it:

What we start with:

What we end with:

(Update 23-Nov-2020: fix misspelled download link)

Posted in Uncategorized | Tagged , , , , | Leave a comment

VBA Read Unicode file contents in various encodings

To read a text file, you need to be able to handle more character sets than just ANSI. Not just in the contents but also in the file and folder names.
The classic article on this is by Joel Spolsky:
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Here are folder and file names in various languages:

F:\DOCS\Arabic\التاريخ المالي والنسب.xls
F:\DOCS\Arabic\اسم ملف الاختبار\اسم ملف الاختبار.xlsm
F:\DOCS\Chinese\2016 全 日 制 ⑴ 班 2016 ~ 2017 学 年 度 第 一 学 期 MBA 教 学 计 划(教室:2-111).xls
F:\DOCS\Chinese\測試文件名\測試文件名.xlsx
F:\DOCS\Greek\όνομα αρχείου δοκιμής\όνομα αρχείου δοκιμής.xlsx
F:\DOCS\Hebrew\שם קובץ הבדיקה\שם קובץ הבדיקה.xlsx
F:\DOCS\Hindi\परीक्षण फ़ाइल नाम\परीक्षण फ़ाइल नाम.xlsx
F:\DOCS\Hungarian\kalóriaguru Munkafüzet.xlsm
F:\DOCS\Hungarian\kalóriaguru\kalóriaguru.cls
F:\DOCS\Japanese\テストファイル名\テストファイル名.xlsx
F:\DOCS\Korean\테스트 파일 이름\테스트 파일 이름.xlsx
F:\DOCS\Russian\Лист1.xlsx
F:\DOCS\Russian\имя тестового файла\имя тестового файла.xlsx

Misinterpreting characters results in accented characters appearing as two garbage characters; or you see a white question mark in a black diamond which indicates that UTF-8 encoding could not figure out what it was.
https://en.wikipedia.org/wiki/Mojibake

In the case of VBA, the classic Open statement, and the functions Dir(), Filelen(), FileDateTime() cannot handle Unicode file names and folders. The Filesystem Stream object can read Unicode names but cannot handle UTF-8 encoding. So, I use the ADO.Stream object in the module ReadFileContents.bas to handle a number of different cases. The variations are not complete but it handles the file types I get.

Let me know of any improvements you can suggest.

Here is the function:

' Reads the entire contents and tries to GUESS the character encoding
Function ReadFileContentsAndCharset(ByVal FileName As String, ByRef Text As String, _
         ByRef charset As String) As Long
   charset = "UTF-8" ' first guess. If it's not UTF-8 it may still be readable as UTF-8
   ReadFileContentsAndCharset = ReadADOStreamText(FileName, Text, -1, charset)
   If ReadFileContentsAndCharset = 0 Then ' it succeeded without error
      Debug.Print FileName   ' comment these out in production
      Dim lUTF8Len As Long, sBOM As String
      ' UTF-8 files do not have to have a BOM so no point in testing for that
      sBOM = Left$(Text, 2)
      lUTF8Len = Len(Text)
      If sBOM = Chr$(254) & Chr$(255) Or sBOM = Chr$(255) & Chr$(254) _
         Or InStr(1, Text, Chr(0)) > 0 _
         Then ' UTF-16 or nulls? Might be double-byte character set
         charset = "UTF-16"
         ReadFileContentsAndCharset = ReadADOStreamText(FileName, Text, -1, charset)
         Debug.Print charset & " len=" & Len(Text) & ", utf-8 len=" & lUTF8Len
         If lUTF8Len = Len(Text) * 2 Then
            'ReadFSOStreamText(FileName, Text) would work here too for UCS2
            Debug.Print "As expected, double byte character set"
         End If
      ElseIf InStr(1, Text, ChrW$(65533), vbBinaryCompare) > 0 Then ' &HFFFD
         ' Check for U+FFFD <question mark in black diamond> replacement character
         ' used to replace an unknown, unrecognized or unrepresentable character
         'https://en.wikipedia.org/wiki/Specials_(Unicode_block)
         charset = "Windows-1252" ' second guess; superset of ISO-8859-1
         ReadFileContentsAndCharset = ReadADOStreamText(FileName, Text, -1, charset)
         Debug.Print charset & " len=" & Len(Text) & ", utf-8 len=" & lUTF8Len
         If Len(Text) > lUTF8Len Then
            Debug.Print charset & " converted some characters into multiple characters"
            ' so it must be some other encoding. No idea.
         End If
      End If
      Debug.Print Len(Text) & " chars, charset inferred=" & charset & " " & FileName
   Else
      MsgBox FileName & vbLf & "Error " & Err & " " & Err.Description
   End If
End Function

Function ReadADOStreamText(ByVal FileName As String, ByRef Text As String, _
   Optional ByRef numchars As Long, Optional ByRef charset As String) As Long
   ' initialise default parameters
   If numchars = 0 Then numchars = -1
   'numchars Optional. The number of characters to read from the file, _
      or a StreamReadEnum value. Default is adReadAll=-1
   If Len(charset) = 0 Then charset = "utf-8"
   With CreateObject("ADODB.Stream")
      .Type = 2 'adTypeText = 2 Specify stream type -  text/string data.
      .charset = charset
      .Open
      On Error Resume Next
      .LoadFromFile FileName
      'Err  3002         File could not be opened.
      Text = vbNullString ' in case of err
      Text = .ReadText(numchars)
      ReadADOStreamText = Err
   End With
End Function

Posted in Uncategorized | Tagged , , , | 2 Comments

A look back to Eusprig 2019

The 20th conference of the European Spreadsheet Risk Interest Group was held in  Browns of Covent Garden, London on 11-12 July 2019,

Thursday 11 July 2019

Developing Excel Thought Leadership

David Lyford-Smith, ICAEW, Chartered Accountants’ Hall

This described the development of their published booklets on Twenty Principles, the Spreadsheet Competency Framework, and Financial Modelling Good Practices.  David summarised their approach as:

  • We have a model for developing spreadsheet thought leadership:
  1. Identify an area where a high-level guidance publication could be of use by consulting with experts and stakeholders
  2. Convene a suitable and diverse group of experts and synthesise a consensus view from their disparate opinions
  3. Invite commentary and critique via an exposure draft process
  4. Complete the project based on the feedback received
  • With such variation in opinions and practice, building in room for individual preference and differences in approach must be an explicit consideration
  • Many people are willing to follow good practice when exposed to it, but few will seek it out for themselves

Practical aspects of applying an End User Computing policy

Roger Turner, Wesleyan Assurance Society, Birmingham

This was an update on his 2018 paper where he described the work Wesleyan did to set up its EUC policy. There was a continual focus on meeting and working with managers to be sure that actions were completed. His slides give a very frank account of their journey which will be very valuable to those present working on similar projects. Their fine-grained approach to classifying risks made it easier to score some quick wins, on things like Controls, Validation, and Documentation. Some slides showed their KPIs to measure progress on moving spreadsheets from red to green status. His learning points are:

  1. Plan what you’re going to do
    1. If you fail to plan you plan to fail
    2. Work out risk metrics, use best practice to make an EUC policy, do pilot runs to iron out problems.
  2. Get buy-in right from the top
    1. Identify the risks and benefits – suggest a risk based approach tackling where the biggest risks are first.
    2. Success with these will give your project momentum which you will need for harder areas
  3. Use of expensive software can usually be avoided
    1. Look to re-use existing infrastructure and systems
    2. A top down approach can get some results and is cheaper, less intrusive than a bottom-up approach
  4. Publicise the policy with its benefits (intranet, meetings etc to make the approach clear)
  5. Be sympathetic with peoples’ other business priorities
    1. Most will try to help and this will give you momentum
  6. Use KPI reporting – this will show how your EUC project is reaping benefits over time.

A Case Study of Spreadsheet Use within the Finance and Academic Registry units within a Higher Education Institution

Simon Thorne and Jamie Hancock

This was a survey of the Academic Registry and Finance departments of a university. The paper has many tables giving an analysis of the responses to the survey, although bear in mind that this is self-reporting. They identified the common risks:

  1. Lack of training, standards, development methods, testing and documentation in spreadsheet development
  2. Mistakes in communicating with third parties
  3. Potential for fraud and falsification
  4. GDPR risks

The mitigating actions that follow are:

  1. Generate an inventory of active critical spreadsheet applications and take appropriate mitigating actions with the riskiest spreadsheets.
  2. Development of a robust spreadsheet development policy governing approaches to development, deployment, documentation, testing and accountability
  3. As part of the organisational policy, decide on development methodologies, documentation standards and testing policies for spreadsheets
  4. Identify and deliver appropriate training in planning, development and testing of spreadsheet models
  5. Establish peer working, user groups and risk awareness sessions

Abstracting spreadsheet data flow through hypergraph redrawing

David Birch, Nicolai Stawinoga, Jack Binks, Bruno Nicoletti, Paul Kelly

Imperial College London, Filigree Technologies, London

This represents the spreadsheet as a network diagram of its dependencies. They make the link structure explicit by transforming the spreadsheet into a graph structure. They find groups of cells performing near-identical operations by detecting vector operations using isomorphic subtree detection. They illustrate the effectiveness of raising the abstraction level by translating a series of spreadsheets to a high-level data-flow language called ModL. By applying a force-directed algorithm they can rotate the graph in a virtual 3D space to better see the relationships.

More established technologies such as Quantrix are specifically aimed at the financial modelling market and start with such structures as their design surface.

Implementation Strategies for Multidimensional Spreadsheets

Paul Mireault, Founder, SSMI International, Honorary Professor, HEC Montréal

Last year, the author issued a challenge to spreadsheet users to create a spreadsheet model with some initial data and simple formulas to allocate quantities and money over four dimensions. There was no challenge in the formulas, they were given them all, it was simply a request to implement a model. Even with that, of the 109 who responded, only 17 completed the challenge. Some implemented it as tables which spread the four dimensions over multiple blocks of formulas in multiple worksheets. Others collated the data in a flat database structure and created the reports as simple summaries of the data. His conclusions are:

  1. Participants using the Database approach usually have the simplest formulas for aggregated variables. We expect that they will be the easiest to maintain.
  2. The analysis of the Multi-dimensional Spreadsheet Challenge submissions has given us insight on the different implementation strategies used by seasoned spreadsheet developers. Given the same problem and its solution, developers had very different results. This lack of standardization may explain the why it is hard to understand and maintain somebody else’s spreadsheet.

EQUS – helping to see formulae

Chris R. Roast, Sheffield Hallam University

EQUS is a C# Excel addin that presents a graphical breakdown of cell formulas annotated with an evaluation of each subexpression and the calculation steps. It is like a more colourful, simultaneous, presentation of what the Formula Evaluation tool in Excel does; or other formula reference tree viewers in the Excel addins market. This is intended for educational use and is very suited to explaining to students how the formulas work.

From webtables to datatables

Mária Csernoch, University of Debrecen, Hungary

The motivation for this work is to create realistic and interesting tables of data for school students to analyse. Web sites are an expected source, but unfortunately many frustrate the ability to conveniently collect data from their pages. Standard web queries depend on data in HTML tables, but many use <div> or other tags which can not be read by Get Data from Web, or Powerquery. She presented the steps they must use for copy/paste or saving pages as HTML or plain text and opening in Excel or Word; then reformatting, editing, cleansing, parsing, and reshaping to get a usable table out of the result. It can be a lot of tedious work. While professional developers may use VBA and APIs, this aims to create reproducible manual steps which schoolchildren can use. Be aware that these are Hungarian schoolchildren who have already been taught Excel array formulas.

Will Dynamic Arrays finally change the way Models are built?

Peter Bartholomew , MDAO Technologies Ltd

The author is an engineer, not a financial modeller. His kind of task for the MoD was a low-fidelity Excel model of the likely interaction of warship and helicopter in the North Atlantic, with a graphical readout of the behaviour. His models are largely constructed with names rather than cell references, so have always been innately array-oriented. He points out that Excel 365 dynamic arrays now provide the most direct route to a solution

instead of requiring effort to set up Control-Shift-Enter (CSE) objects; any formula can be an array calculation, any cell can hold an entire result array.

The surprise for anyone versed in traditional  methods may well be how few cells contain formulas, so consistency errors are simply not possible and having fewer formulas reduces the opportunity for error.

Cubes an alternative to spreadsheets to mitigate risk and improve performance?

Marc BRAUN, Xcubes

This was a demonstration of XCubes, a Multidimensional Spreadsheet System software package. He used the model specified in Paul Mireault’s talk as an illustration. An Excel solution required 1,427 formulas; an XCubes solution just 13. Cubes are the most suitable solution when the model can be broken down in blocks to store well-structured data. They make analysis, calculation and visualization even on large data sets much easier. The solution can be downloaded from xcubes.net.

Does the teaching of spreadsheeting skills have the potential to be a significant global education policy intervention?

Morten Siersted, Visiting Research Fellow, Centre for Development Studies, University of Bath

Spreadsheeting skills do not currently feature within global education policy in any significant way.  However within the UN Strategic Development Goals (SDGs), the proxy of ‘ICT skills’ is of direct relevance to Target 4.4 ‘Skills for work’ and has led UNESCO to propose ‘using basic arithmetic formulas in a spreadsheet’ as a component of an ‘ICT skills acquisition’ element of measuring progress on SDG4.4.

Although there is a shift from superficial surface level ICT skills to deeper cognitive skills and computational thinking, he sees a risk of policy makers jumping straight from ICT skills to traditional computer programming education, thereby missing out the ‘spreadsheeting as entry-level programming’ stepping stone. ICT for Development is now mainstreamed within key UN agencies (ITU, UNCTAD, UNDP, and UNESCO). He concludes that an initiative should be undertaken to disseminate existing spreadsheeting research to education policy makers working with low-income countries.

 

Friday 12: Training Sessions

Working through Modeloff example problems

David Lyford-Smith, ICAEW

David worked through two games from the Modeloff questions resources as an example of problems that require no domain knowledge but are capable of solution by Excel. Indeed, the drive for more efficient solutions often captures the attention of Excel nerds beyond the point of return on time spent.

Dynamic Arrays in Excel

Charles Williams, Decision Models Ltd

Charles worked through a presentation created by Joe McDaid of Microsoft which explained how the new dynamic arrays work and also, illuminatingly, how the legacy array formulas and their implicit intersection rules really worked! This is the video from Microsoft:

https://community.powerbi.com/t5/MBAS-Gallery/Microsoft-Excel-Advanced-spreadsheet-modeling-using-Dynamic/td-p/726918

Powerpoint slides:

https://medius.studios.ms/video/asset/PPT/MBAS19-BRK4001

Slide 32 (Array Eval) has a typo, the word “didn’t” should not be there. Slide 42 (Calculation Modes) table is blank in edit mode but fills in by animation in presentation mode. The content can be seen in this tweet from Charles’ presentation: https://twitter.com/ExcelAnalytics/status/1149633726592114688

Patrick O’Beirne  16 July 2019

Posted in eusprig, Excel/VBA, Research, Risk, Uncategorized | Tagged | Leave a comment

A look ahead to Eusprig 2019

The 20th annual conference of the European Spreadsheet Risk Special Interest Group (http://www.eusprig.org) will be held on July 11 & 12 at The Barrister’s Court, Browns, 82-84 St Martin’s Ln, Covent Garden, London WC2N 4AG, UK.

David Lyford-Smith of the Institute of Chartered Accountants in England and Wales (ICAEW) will review the history of their three “thought leadership” papers on good practice in spreadsheet use and spreadsheet work environments.  Anyone involved in spreadsheet training, teaching, and syllabus development will be interested to hear how the process of making them has helped ICAEW to develop its position. David also features in the Friday morning tutorials.

Last year Roger Turner of Wesleyan Assurance Society presented a high-quality paper which described their plan to address EUC risks, which the audience found very relevant. This year he describes the challenges which were faced and how these were overcome. I am particularly looking forward to this practical presentation.

During the year, Paul Mireault of SSMI International invited seasoned Excel developers to participate in a challenge to implement a forecast modelling spreadsheet with four variables – product, region, sector, time. He will report on the different implementation strategies employed. Having tried it myself, I can say you will see some radically different approaches to the challenge! Expect arguments from the attendees that “my way is the best because….” !

Simon Thorne and Jamie Hancock  of Cardiff Metropolitan University will present the findings of a case study of spreadsheet use in a higher education institution in the UK. The implications of the results are then considered in terms of accurate reporting to external funding bodies such the funding councils, internal data integrity and internal data efficiencies.

Mária Csernoch of the University of Debrecen, Hungary, will talk about how school pupils can scrape data from web pages rather than type it in for their Excel exercises. Tables and table-like structures on webpages are excellent sources for teaching spreadsheeting by utilizing and developing knowledge-transfer items, presenting various real world problems and solutions, discussing and debugging, and in general, developing computational thinking skills.

Chris R. Roast of Sheffield Hallam University will describe the design development and evaluation of an Excel add-in for interactive visualisation of spreadsheet formulas (EQUS). The development process was one of iterative refinement engaging a initial target audience of mid-teen learners, involving re-design and formative evaluation. The resulting visualisation techniques have been found to be broadly relevant to spreadsheet users beyond the initial target audience.

David Birch and others from Imperial College and  Filigree Technologies, London, expose hidden linkage structure by transforming spreadsheets into fine-grained graphs with operators and values as nodes. They propose that researchers should seek techniques to redraw these boundaries to create higher level structures which will more faithfully represent the end-user’s real world/mental model.

Peter Bartholomew of MDAO Technologies Ltd will discuss the thesis that the adoption of the new Dynamic Arrays in Excel within a more professional development environment could replace traditional techniques where solution integrity is important. This will be a useful taster to Charles Williams’ tutorial on Friday morning.

Marc BRAUN of Xcubes argues that a cube-based system offers a structured and scalable approach to modelling, forecasting, reporting and data analysis with little manual intervention once the model is set up. A practical demonstration will be made with XCubes, a Multidimensional Spreadsheet System software package.

Morten Siersted, visiting research fellow at Bath, presents his dissertation on “Does the teaching of spreadsheeting skills have the potential to be a significant global education policy intervention?” Anyone involved in promoting EUC skills will be interested in hearing his views on the relevant ‘ICT for development’ frameworks.

The innovation this year is a pair of tutorials on Friday morning. David Lyford-Smith will be illustrating good spreadsheet structuring by working through Modeloff example problems. Charles Williams of Decision Models Ltd, the Excel MVP, will be explaining Dynamic Arrays in Excel, their advantages and their wrinkles.

I’ll be there!

Register at http://www.eusprig.org

 

 

Posted in eusprig, Excel/VBA, Research, Software Quality, Software Testing | Tagged , , | Leave a comment