I am one of the 48 speakers at a 48-hour Excel extravaganza online on Oct 12 and 13. It’s FREE but you are encouraged to make a donation to charity to reflect the value to you of the content you benefit from. Visit ExcelVirtuallyGlobal.com.
“Microsoft’s Most Valuable Professionals, or MVPs, are technology experts who passionately share their knowledge with the community. They are always on the “bleeding edge” and have an unstoppable urge to get their hands on new, exciting technologies. This virtual conference, now in its fifth year, presents Excel, Data Platform and PowerPoint MVPs, together with other acknowledged experts from Microsoft and / or around the globe to present, answer questions and demonstrate the future of Excel and how it will make your life easier, personally and professionally. Topics include: Auditing Spreadsheets, Charts, Dashboards, Data Analysis, Data Types, Dynamic Arrays, Excel Tricks & Tips, Financial Modelling, Forecasting, Maps, Microsoft 365, New Functions and Features, PivotTables, Power BI, Power Pivot, Power Query, Presenting Your Data, Spilled Formulae, Testing, Timelines and developing solutions with DNA .Net, VBA, Python, R, OfficeScript and Office-JS. And note well – it’s not all in English, with sessions in Mandarin, Portuguese, Spanish and Telugu, as well.
Most sessions will be recorded so you may watch them later with downloads aplenty – there are no medals for staying up to watch the entire event live!
From your own favourite chair, bring a laptop, an inquisitive mind and your sense of humour. Be prepared to learn heaps. And remember, it’s for charity – all profits will be donated to a global COVID-19 research charity.”
My topic is how to avoid data loss and corruption when importing text files into Excel. It is for the developer who needs to automate ways of importing text files from anywhere in the world with Unicode folder and file names, and content in any character set and encoding. And to correctly import data with any field delimiter and embedded linefeeds in quoted fields. It is based on other articles in this blog, collected together into one demo workbook.
00:00 Intro and the Windows file system object; 04:30 Reading file content as text; Querying structured data – CSV and its many variants; in four ways: 11:00 User interface vs VBA open methods 21:05 Querytables 27:22 ADO Recordset 32:35 Power Query 42:00 End
I use the OutputDebugString API in Windows to send debugging messages without the line limitations of the VBA Immediate Window. This is also useful when tracing the causes of crashes which kill the Excel process.
I use TraceSpy to capture the debug output. It is faster than DbgView which tends to bog down when there are many traces.
Event Tracing for Windows (ETW) is faster than OutputDebugString. Calling ETW from VBA takes 26 microseconds, three times the speed of OutputDebugString 77us. And WpfTraceSpy correctly displays Unicode strings like Cyrillic.
Simon Mourier has provided 32 and 64 bit VBA modules for WpfTraceSpy on GitHub. I have contributed a module ETWModule3264.bas suitable for Excel add-ins that need to run without modification in either 32 or 64 bit version of Office. This has been tested in 32-bit Excel 365 and 64-bit Excel 365 on Windows 10. I work only in Excel, but it should apply to other Office products too.
I noticed that the speed of the ETW output depends on the folder the xlsm is in. It can take as little as 1 microsecond for an ETW call, or as much as 26 as shown above.
I realised that the fast performing xlsm files are in Excel Trusted Locations. I wonder is Windows Defender sniffing at the untrusted API calls & slowing them down? In Trusted Locations, OutputDebugString takes 23 us, ETW takes 1 us. It is so fast because all EventWriteString does is push the event on to the Trace event queue and return, so there is very little overhead.
This could be an idea for API performance enhancement, if this applies to other Win APIs called from VBA.
So… make that “Event Tracing is twenty times faster than OutputDebugString” !
(Updated 22-Jul-2021: Tracing P_.O also sends to OutputDebugString )
I use it to find out which procedures are taking the most time, so that I can target performance improvements to the subs which need it.
Open the file XLTEST_Vba_Profiler.xlam and enable macros, or install it as an add-in.
Open the workbook you want to profile. Save it under a different name to avoid overwriting the original.
On the Developer tab, in the VBA Profiler group, click the button “Profiling”.
This gives a form with two checkboxes and a Help button (which shows the same instructions as on this page). The options are:
Add a call to every line of code to trace its execution for coverage analysis. A tab-delimited text file is output of all the code after trace lines have been added so that the sub VBA_Profile_Reports can show what lines of code were executed. The text file is named from your workbook name suffixed with _VBA_COVERAGE.TXT
Export all the VBA to a text file before making the changes. This is just for your record, it is not used in this process. The tab-delimited text file is named from your workbook name suffixed with _VBA_BEFORE.TXT
When you click OK, it adds a line to every procedure in the project, unless its declaration line contains a comment ‘@notrace The inserted line looks like this, where 12345 is the line number in the project:
Dim P_ As Object: Set P_ = VBA_Profiler.New_Profiler(): P_.Start 12345, "modulename", "procname"
It only adds one line of code to each of the procedures in the code, and that line calls subs in the addin to do the timing and calculations. That adds a few milliseconds of overhead to each procedure call. For very small, trivial, and frequently called procedures, consider adding a comment ‘@notrace to its Sub or Function declaration to suppress the insertion of a profiling line of code.
It adds a reference to VBA_Profiler to the VBProject so that those inserted lines can call the functions to do the timing calculations.
Finally, insert a procedure in any standard module like this:
If the optional delimiter is specified after VBA_Profile_Initialize , such as vbTab or “,” every P_.Start sends the values passed to the debug output, separated by that delimiter. You can watch and capture that output using DebugView from SysInternals or TraceSpy from Simon Mourier. I use TraceSpy because it is faster than DbgView and did not miss any output in fast tests. To be sure that TraceSpy can keep up with the output, do not use Auto-Scroll.
VBA_Profile_Reports outputs a tab-delimited file named from the workbook name suffixed by _VBA_PROFILE.TXT; and if you do the coverage analysis, a workbook is created with one sheet containing the contents of the _VBA_COVERAGE.TXT file with an extra column added to show the number of times a line of code was executed. The workbook is unsaved, you can save it as a .xlsx file if you wish to keep it.
Finally, go into the VBE and click Compile to check that no syntax errors have been introduced
When a procedure exits and the P_ variable goes out of scope, the profiler counts up the number of calls and the length of time the procedure took. The resolution of the timer is about a microsecond.
The visible methods in this add-in are:
Function New_Profiler() returns an object that collects timing data for the reports.
Sub VBA_Profile_Initialize() Call to initialise the profiler data collection
Sub VBA_Profile_Reports(Optional sProfileTSV As String , Optional sCoverageTSV As String ) Produces reports. The default file names are For the Profile output: workbook name plus “_VBA_PROFILE.TXT” For the Coverage input: workbook name plus “_VBA_COVERAGE.TXT”
The report columns are: Module.Proc: Module name and procedure name Count: Number of times the procedure was called Inclusive time: Execution time of this procedure and all those it calls Exclusive time: Execution time within this procedure only Average time: The average run time of the procedure, the exclusive time divided by the count of calls Index: A sequence number
If you lose the original workbook, you can remove the profiling code by doing a Search with Wildcards and Replace All with nothing for these Find What strings: Dim P_ * P_O *&:
Feedback is welcome!
Limitations: It does not handle Unicode workbook and module names, so use the Windows-1252 Latin character set only.
Assurance: It only works locally, and does not ‘phone home’ with any telemetry. It is supplied free and with no guarantee, in the hope that it provides some utility.
Consider a module with three procedures. TopProc calls SubA once. SubA calls SubB four times. TopProc takes 1 second with its own code, SubA takes 3 seconds, and SubB takes 2 seconds. This is what the profile report will look like:
Example profile report
From this you see that:
TopProc is called once. The total execution time including its called subs is 12 seconds. The code in TopProc alone takes one second.
SubA is called once. The code in SubA takes three seconds. The total execution time including the eight seconds for its called subs is 11 seconds.
SubB is called four times. The total execution time is 8 seconds, so the average time per execution is 2 seconds.
Sort the table by Exclusive time descending to see which subs takes the most time overall. Sort by Average time to see which are slowest per execution.
To learn more about profiling, see these Microsoft articles:
The module PowerQueryText.bas contains the code to query delimited files and return tables.
The advantages of PowerQuery are:
It can handle comma and tab separated values and does not need a schema.ini as for ADO. You do however need to specify the separator character, see below for the details.
Tab-delimited files do not have to have the extension “.tab” as they do for ADO, but can be named with a “.tsv” extension if that’s what you have.
It can handle line breaks within quotes, or not, depending on the option you set. I always include line breaks within quotes.
Its default encoding is 65001 which can handle all text encoding I have tested it with. The UTF-16 (UCS-2) encoding is 1200.
In the ADO module, I determined the field separator by reading the file contents as bytes, and use a function CountColumnsFromBytes to infer the number of columns and their separator. In Powerquery, there is a function Binary.InferContentType that will offer statistics on possible delimiters, and your code can choose the most likely one.
At the moment, I have to send the output to a worksheet. If anyone finds a way to return the table data as a memory structure in VBA such as a recordset, let me know.
The functions contained in PowerQueryText.bas are:
testPQcsv – a simple test stub that prompts for a file name and queries it. It calls the following functions.
InferContentType – needed to infer the encoding and field delimiter (separator) character.
TableProfile – returns statistics about a file to help determine the number of rows and columns.
PQMCsvFormula – a generic function to return the M expression for a delimited file.
PQDelimitedText – runs a query and outputs the result.
WBQueryToSheet – sends the data from a query to a sheet table range.
WBQueryToListTable – sends the data from a query to an Excel ListTable.
That is the end of this series on reading delimited text files in Excel VBA. I hope you find it useful, and let me know of any bug fixes needed, or suggest any improvements.
The module ADOQueryText.bas has the functions to read a delimited text file into a worksheet. It uses an ADODB recordset to perform the query. The pros and cons of ADO are:
It can handle embedded newlines in data, unlike a Querytable. These are common in CSV files from sources (eg surveys) where users enter free text with multiple lines, and the text is returned in one field.
It cannot open Unicode file names, so to be safe you must copy the original file to a temporary copy with an ANSI filename.
It requires a schema.ini file to recognise when the fields are delimited by tabs. To avoid this, you could use a function ReplaceBytes() in the QueryTables module from the previous blog post to change all tab delimiters with commas.
To avoid some of these complications, the final post in this series will be on how to use Power Query in VBA to query a text / csv file.
The functions are:
ADOLoadWSFromDelimitedTextFile(ByVal OriginalFullname As String, ByRef wsData As Worksheet, Optional ByRef Delimiter As String, Optional ByRef CodePage As Long)
Queries a structured text file into a table in a sheet, returns delimiter and codepage by reference. It uses the two following functions, writing a schema.ini file if the data source is tab-delimited and checks whether there are enough rows on the sheet for the records.
ADORecordsetFromTextfile(ByVal FullName As String, ByRef rs As ADODB.Recordset, ByVal FMT As String, ByVal CodePage As Long, Optional ByVal strSelect As String)
Queries a structured text file into an ADO recordset, returning the format and codepage by reference. strSelect is the list of fieldnames or expressions in SQL syntax, the default is “*”, all fields.
CopyRecordsetToSheet(rs As ADODB.Recordset, Optional ByRef wsTarget As Worksheet, Optional ByVal RecordSource As String)
Writes a recordset to a worksheet, adding the sheet and writing headers if required.
The legacy Get Data from Text/CSV command in Excel creates a Querytable. This is pretty good for most purposes as long as you use the undocumented TextFilePlatform value of 65001. There is one glaring limitation: it cannot handle newlines in quoted strings. I get this frequently in Surveymonkey downloads where people have entered free text comments with multiple lines. So, the workaround is to pre-process the file to change any quoted linefeed (LF) or carriage return (CR) characters to spaces.
There is another way, using an ADO Query, which will be the subject of the next blog post. However, that has its own complications. For example, if the file is tab delimited you have to write a schema.ini file. The final way, in the last blog post, will be to use Power Query.
Have you ever opened a text file in Excel and found that the characters looked weird or corrupted? The Wikipedia entry for mojibake says ‘In German, Buchstabensalat (“letter salad”) is a common term for this phenomenon, and in Spanish, deformación (literally deformation).’
This is one of a series of blog posts on file handling in VBA to cope with Unicode file names and contents. Part 1 was posted on Sep 28 2020, about reading text from Unicode files. The articles should provide all the Excel VBA code you need for working with Unicode folder and file names in the Windows file system, reading unstructured Unicode text, and structured text file data queries.
This post is the code for using the Filesystem object to replace the old VB functions like Dir(), Filelen(), FileDateTime(), etc, that do not work with Unicode file names. FileSystem.bas is the module which you can import into your project, and the CSV file is a sample downloaded from opengov.gr.
The story broke around October 5th that Public Health England temporarily lost 16,000 Covid test results because they did not notice that the XLS data format they were using was insufficient to process the number of results they were getting. The risk is that lives could be lost because Covid cases were not being traced and followed up. The BBC report is probably the best summary:
“[The Covid testing labs] filed their results in the form of text-based lists – known as CSV files – without issue. PHE had set up an automatic process to pull this data together into Excel templates so that it could then be uploaded to a central system. The problem is that PHE’s own developers picked an old file format to do this – known as XLS. As a consequence, each template could handle only about 65,000 rows of data rather than the one million-plus rows that Excel is actually capable of. And since each test result created several rows of data, in practice it meant that each template was limited to about 1,400 cases. When that total was reached, further cases were simply left off.”
I want to examine this story from the point of view of data and spreadsheet control and wonder about some unstated angles.
TL;DR: if they had applied basic data controls, they could have detected the problem as soon as it happened rather than lose precious days. Whenever data is being transformed from one system into another, there are many opportunities for data going missing, being duplicated, or corrupted. Whatever the skill level of the developer, nobody can know in advance what data might actually arrive in use and every possible failure mode. Any such system, whatever the technology used to implement it, should have at a minimum a simple check that the number of records output reconciles to the number input; and ideally other checks on data quality and hash totals.
First of all, why would they choose XLS? Possibly because it was the easiest choice for upload to their central system. For example, 32-bit SQL Server can import XLS (65536 rows) and XLSX (1M rows) files. For XLSX however they would have had to register the OLE DB provider Microsoft.ACE.OLEDB.12.0 and that may have looked like work for them and they thought that “we’ll never have more than 65K cases per day”. But although that choice of XLS might well be adequate (at the moment!) for upload, the fatal flaw was that the import into XLS would need many more lines. [Edit: I now think they really were using Excel 2003 or earlier.]
I’ve done work for labs and written apps in Excel and VBA to process the data logs from instruments. The article refers to them as CSV files and one might think that implies a text file with a header line with field names, one row per record, fields separated by commas, see RFC 7111.
However, the fact that only about 1400 cases fitted into 65K rows makes it look more likely that these are simple text files with one field and value per line. Or maybe it’s a combination of irregular numbers of fields per line, we don’t know. I would expect that depending on the exact number of lines per record, 46 or 47, the LAST record in the file would therefore be incomplete, some fields would be missing. So there are incomplete records in the database too.
Neither do we know what process they used to import the data in the first place – text file open, Get Data from Text, VBA Querytable, copy & paste, whatever. [Edit: I now think they used Workbooks.Open in a macro].
We don’t know what they did to transform the data from the lab output format into the regular table-like structure that an upload to a database would require. Maybe a mess of INDEX or VLOOKUP formulas; maybe a simple VBA macro that ran down the column of imported text and wrote out the samples in columns. Maybe they had to transform field values from codes into numbers and handle special values like “n/a”, “<0.001”, and such like.
They could have processed it in an XLSM with 1M rows but saved the normalised table as an XLS file, thus satisfying both requirements of processing a large number of lines and also saving as the most convenient format for upload; but they didn’t. Even sticking with an XLS file with macros, they could have read the lab file line by line, assembled each record in-memory, and write it out to the target sheet, without needing to populate a 65K sheet with imported data; but it looks like they didn’t.
But what puzzled me is this: I had thought that whatever method is used to import data in Excel, even in Excel 2003 with XLS files, the application would raise an error if you tried to import more lines; but there was no report of the problem being flagged by an error report, but by people checking the numbers.
If you use the menu to File > Open a text file in Excel 2003 with more than 65536 lines, or try to copy and paste from another application, it raises an error message “File not loaded completely”. If you import in any version of Excel into a worksheet in compatibility mode (ie 65536 rows) it would report ” The text file contains more data than will fit on a single worksheet”. So why did this not happen? (Spoiler: I discovered why later, see the postscript.) This was, in Sherlock Holmes terms, the dog that did not bark in the night.
I tried to imagine scenarios involving XLSWriter or Apache POI trying to create XLS files with more than 65K rows; I worked out an unlikely scenario involving copying to the clipboard from another application and in VBA using Worksheets(1).Paste which does not raise an error, it truncates the data. Other Excel experts suggested that the PHE developers simply suppressed Application.DisplayAlerts or On Error Resume Next, or even deliberately imported line by line with a cut off at 65535. They point to a variation of “Hanlon’s razor”: “Don’t ascribe to obscure complexity what can be explained by incompetence”, and to the collection of reports on thedailywtf.com and theregister.com. And Arthur Conan Doyle wrote “Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth”. [Edit: Another conclusion, which I thought improbable at first, is that this public health service is really so starved of IT resources that they have to use unsupported versions of Excel more than 13 years old. Update: this can happen even in modern Excel, see below for 25-Nov-2020]
However, it was much simpler than that, and only after trying the most obvious thing which I missed the first time I posted this, I found it: VBA Workbooks.Open “file.txt” will load as much data as it can WITHOUT raising any error. No On Error Resume Next needed. So this is in my view an Excel failure, a defect, because the VBA behaviour is not consistent with the UI behaviour. Still, you should not assume your technology will always work perfectly and fail to put in controls around your processes.
More like Occam’s razor really: pick the simplest explanation over the more complex one.
[Update 14-Oct-2020: what I now think really happened]
Update 22-Oct-2020: I found a Sky News article which gave a bit more information:
“The problem is while the Second Generation Surveillance System (SGSS), as it’s called, is plugged into the existing pathology labs that pre-date COVID-19, for some reason no-one has yet managed to connect it to the new labs which have been created during the course of this pandemic. […] Rather than feeding its results securely straight into SGSS, as Pillar 1 pathology labs do, the test results from Pillar 2 labs arrive at PHE in the form of a CSV file.” So Pillar 1 are using secure transfer, maybe in HL7 format, but the new system masterminded by consultants at £7000/day uses CSV, like the USA CDC?
Update 25-Nov-2020: I found two more scenarios where using an XLS macro in Excel 2016 silently drops data above the sheet capacity of 65536 rows. If I use in VBA Querytables.Add “TEXT;file.csv” or an ADO Jet.OLEDB.4.0 query “SELECT * from file.csv” in an .xls file in Excel 2016 to populate a worksheet in the xls file from a CSV file with 89000 rows, it stops at 65536 silently, ignoring extra data, no error. It’s quite likely that xls macro files persist in use even after the version of MS Office Excel is updated. So there’s another couple of ways in which it could have happened. There may be more!
Have you ever had to convert a file of JSON text to Excel tables? Not simple JSON, but one with a collection of dictionaries inside a dictionary inside a collection of dictionary records inside another collection of records inside a dictionary? And you want all this as old-fashioned normalised database tables? And in VBA? Here’s how I did it:
I chose as an example a Surveymonkey survey_details JSON file. I use VBA Collections, ADO.Recordsets, Scripting.Dictionary, a free class to parse JSON in VBA and some code to handle the inevitable exceptions. It’s a two pass process; one to identify all the fields in all the dictionaries, create the linking primary and foreign keys, and the second to populate recordsets with the data for those fields and write them out to Excel sheets. Or of course you could write them to a database.
If you like YouTube tutorials, here’s a recording where I walk through it: