Get the real used range of a worksheet

Update 1-Dec-2021: Timing tests show that a binary search method is the fastest.

VBA programmers use many different ways to find the bottom right hand corner of a sheet’s data, the range which encompasses all the data cells but not empty formatted cells or cells which were previously used but are now empty.
Paul Kelly’s good introductory video describes well the problems with four common methods:
.CurrentRegion, .UsedRange, .SpecialCells(xlCellTypeLastCells), and .End(xlUp)
The method most experienced people advocate is to use the .Find method for any data (“*”), starting at A1, and searching backwards, so that the search begins at the bottom of the sheet.
However, that will return only the first cell if the data found is in a merged range.
And it will only search within visible rows, and ignore rows filtered out, which may be the last rows of the sheet.

Another way, posted by Tim Jeffryes on the Excel-L mail list, is to use a formula to find the maximum row and column that contain data. That works well for reasonably sized sheets, but if the used range exceeds 45 million cells (that may depend on your version of Excel), which may happen if people have formatted entire rows or columns, the formula returns #VALUE!.

I provide a sample workbook to test all these scenarios, with a GetLastCell() function. It tries the formula method first, and if that returns an error value, it uses a binary search method of the usedrange, to find the cell at the intersection of the last used row and column. In practice, you may as well use BSearchLastCell() all the time, it’s faster than the formula method.

For a sample data sheet with a data range A1:K1001, the .Find method takes 13ms, the formula methods 30ms, and the binary search only 1.7ms. For a test sheet formatted to XFD1048576 and a data range to K1001, Find takes 103ms, the formulas fail, and the binary search only 3.8ms.

I also provide a sub ShapesLastCell() that reports the bottom right shape. If you want to delete apparently excessive rows or columns, you may want to not delete shapes, or move them to a better position.

The main sub testAll() runs all tests on all sheets. It shows :
1) The Range.Find and MAX(SUBTOTAL(… methods omit the rows hidden by filtering.
2) The ISBLANK and MMULT formulas search hidden rows but are limited to a usedrange of c.45M cells.
3) Shapes can be placed outside the used range.
4) The best method is to use the BSearchLastCell function.
5) Any data cell found needs to be adjusted for any merged range that it is part of.

Let me know what you think – is this useful for you?

Posted in Excel/VBA | Leave a comment

Power Query infer data types with Table.Profile

The Power Query UI automatically detects column data types and performs the transformations for you. Typically these steps look like this:

let
    Source = Csv.Document(File.Contents("online_retail_II.csv"),[Delimiter=",", Columns=8, Encoding=65001, QuoteStyle=QuoteStyle.None]),
    online_retail = Table.PromoteHeaders(Source, [PromoteAllScalars=true])
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Invoice", type text}, ...etc...}})
in
    #"Changed Type"

However, when I am calling Power Query M code from VBA ( see my previous posts) to open any given text file I usually do not know at that time what the column names or data types are. So after PromoteHeaders I have to infer the column data types. There is no built-in PQ function to do that. The following code for a DetectListValuesType() function adapted from Imke Feldmann the BIccountant and Steve de Peijper will infer the data type from a list of values, with a given margin of error allowed. If I subsequently transform the column, any values which are not valid for the majority type, such as text mixed in a column of largely numeric data, would be converted to Error values.

DetectListValuesType =
// Return the majority data type in a list with the given proportion of errors allowed.
(values as list, marginforerror as number) as text =>
    let
        maxerrors = List.Count(values) * marginforerror, // 0 <= marginforerror < 1

        distinctvalues   = List.Distinct(values),

        NullValuesOrEmptyStringsOnly =
            List.NonNullCount(distinctvalues) = 0
            or
            (List.NonNullCount(distinctvalues) = 1 and distinctvalues{0} = ""),

        CheckTypes = {"number", "date", "datetime", "text"},

        Env = Record.Combine({[L=values],
                            [NF = #shared[Number.From]],    [DF = #shared[Date.From]],   
                            [DTF = #shared[DateTime.From]], [TF = #shared[Text.From]],
                            [LT = #shared[List.Transform]], [LS = #shared[List.Select]], [LC  = #shared[List.Count]]
                            }),

        NumberOfErrors = List.Transform( {"NF", "DF", "DTF", "TF"},
                                            each Expression.Evaluate("LC(LS(LT(L, each try  " & _ & "(_) otherwise ""Error""), each _ = ""Error""))", Env)
                                        ),

        // Type conversion of null will return null, and not throw an error, so null values do not contribute to the conversion error count.
        // If all list values are equal to null, no errors will be detected for any type. type text will be selected in this case
        // If list values are simple values (not complex typed values like e.g. {"a","b"}), Text.From() will not throw an error,
        // and the number of errors for conversion to text will be 0; i.e. text conversion will meet the zero error criterion
        // If values contain dates only (none of the values is with time),
        // for both date and datetime conversion the number of errors will be 0 and date will be selected (date has a lower index) as conversion type.
        // If at least 1 of the date values include a valid time part, date conversion will throw an error, and datetime conversion will not throw an error;
        // hence datetime conversion will be selected.

        CheckErrorCriterion = List.Transform(NumberOfErrors, each _ <= maxerrors),

        typeindex = List.PositionOf(CheckErrorCriterion, true),  // First type which meets the error criterion;
                                                                 // if none of the types meets the error criterion, typeindex = -1

        typestring = if NullValuesOrEmptyStringsOnly or typeindex = -1 then "text" else CheckTypes{typeindex}
    in typestring

To create a schema table with the column names and data types of all the columns in a source table, I can use the Table.Profile function with the undocumented parameter optional additionalAggregates as nullable list. The details are given in Lars Schreiber’s post on additionalAggregates. In brief, it is a list of lists, like this: {{}, {}}. Each of these inner lists represents a new calculated profile column and consists of three items:
Output column name as text, typecheck function as function, aggregation function as function

My ProfileAggregates() function code then is as follows:

// return a profiling table with the inferred data type of each column
//Column	Min	Max	Average	StandardDeviation	Count	NullCount	DistinctCount	ColumnType
(SourceTable as table, numberofrecords as number, marginforerror as number) => 
    let
        numberofrecords = if numberofrecords<=0 then Table.RowCount(SourceTable) else numberofrecords,

        tblProfile = Table.Profile( SourceTable, {{"ColumnType", each true, 
            each DetectListValuesType( List.FirstN( _, numberofrecords ), marginforerror) }})

    in
        tblProfile

This function can then be called like this:

ProfileAggregates(online_retail, 10000, .01)

I tried it on the “Online Retail II” data set which has 1,067,371 rows. That’s too many for Excel, but a good test. Excel takes 25 minutes to run the complete analysis which indicates that some performance tuning should be investigated. Here is some of the output; I reordered the columns to put ColumnType after the column name.

Output of Table.Profile with additional ColumnType column

If you know how to create the table of column names, types, and other data, directly rather than using Table.Profile, let me know at the PowerBI Community forum, or in the comments here.

Posted in Uncategorized | Tagged , , | Leave a comment

Excel Virtually Global 2021

I am one of the 48 speakers at a 48-hour Excel extravaganza online on Oct 12 and 13. It’s FREE but you are encouraged to make a donation to charity to reflect the value to you of the content you benefit from. Visit ExcelVirtuallyGlobal.com.

The agenda:

https://excelvirtuallyglobal.com/program/

“Microsoft’s Most Valuable Professionals, or MVPs, are technology experts who passionately share their knowledge with the community.  They are always on the “bleeding edge” and have an unstoppable urge to get their hands on new, exciting technologies. This virtual conference, now in its fifth year, presents Excel, Data Platform and PowerPoint MVPs, together with other acknowledged experts from Microsoft and / or around the globe to present, answer questions and demonstrate the future of Excel and how it will make your life easier, personally and professionally.
Topics include: Auditing Spreadsheets, Charts, Dashboards, Data Analysis, Data Types, Dynamic Arrays, Excel Tricks & Tips, Financial Modelling, Forecasting, Maps, Microsoft 365, New Functions and Features, PivotTables, Power BI, Power Pivot, Power Query, Presenting Your Data, Spilled Formulae, Testing, Timelines and developing solutions with DNA .Net, VBA, Python, R, OfficeScript and Office-JS.  And note well – it’s not all in English, with sessions in Mandarin, Portuguese, Spanish and Telugu, as well.

Most sessions will be recorded so you may watch them later with downloads aplenty – there are no medals for staying up to watch the entire event live!

From your own favourite chair, bring a laptop, an inquisitive mind and your sense of humour.  Be prepared to learn heaps.  And remember, it’s for charity – all profits will be donated to a global COVID-19 research charity.”

My topic is how to avoid data loss and corruption when importing text files into Excel. It is for the developer who needs to automate ways of importing text files from anywhere in the world with Unicode folder and file names, and content in any character set and encoding. And to correctly import data with any field delimiter and embedded linefeeds in quoted fields. It is based on other articles in this blog, collected together into one demo workbook.

Video timings:

00:00 Intro and the Windows file system object;
04:30 Reading file content as text;
Querying structured data – CSV and its many variants; in four ways:
11:00 User interface vs VBA open methods
21:05 Querytables
27:22 ADO Recordset
32:35 Power Query
42:00 End

Promo 1-minute video

While waiting for the organisers to correct the recording, here is a link to the full 42 minute video.

Better and safer ways to read text files in VBA



Posted in Uncategorized | Tagged , , , , , , | Leave a comment

Event Tracing is three times faster than OutputDebugString

I use the OutputDebugString API in Windows to send debugging messages without the line limitations of the VBA Immediate Window. This is also useful when tracing the causes of crashes which kill the Excel process.


I use TraceSpy to capture the debug output. It is faster than DbgView which tends to bog down when there are many traces.

Event Tracing for Windows (ETW) is faster than OutputDebugString. Calling ETW from VBA takes 26 microseconds, three times the speed of OutputDebugString 77us. And WpfTraceSpy correctly displays Unicode strings like Cyrillic.

Simon Mourier has provided 32 and 64 bit VBA modules for WpfTraceSpy on GitHub.
I have contributed a module ETWModule3264.bas suitable for Excel add-ins that need to run without modification in either 32 or 64 bit version of Office.
This has been tested in 32-bit Excel 365 and 64-bit Excel 365 on Windows 10.
I work only in Excel, but it should apply to other Office products too.

https://github.com/smourier/TraceSpy/tree/master/vba

I noticed that the speed of the ETW output depends on the folder the xlsm is in. It can take as little as 1 microsecond for an ETW call, or as much as 26 as shown above.

I realised that the fast performing xlsm files are in Excel Trusted Locations. I wonder is Windows Defender sniffing at the untrusted API calls & slowing them down? In Trusted Locations, OutputDebugString takes 23 us, ETW takes 1 us. It is so fast because all EventWriteString does is push the event on to the Trace event queue and return, so there is very little overhead.

This could be an idea for API performance enhancement, if this applies to other Win APIs called from VBA.

So… make that “Event Tracing is twenty times faster than OutputDebugString” !

Posted in Uncategorized | Tagged , , , | Leave a comment

VBA Performance Profiler : timing procedure execution

This is an update and replacement for the post on 9th March 2017 titled ‘Free add in to profile VBA speed and coverage’. Updated 8-Jul-2021 with an example profile report.

Download XLTEST_VBA_Profiler.xlam.

(Updated 22-Jul-2021: Tracing P_.O also sends to OutputDebugString )

I use it to find out which procedures are taking the most time, so that I can target performance improvements to the subs which need it.

INSTRUCTIONS

Open the file XLTEST_Vba_Profiler.xlam and enable macros, or install it as an add-in.

Open the workbook you want to profile. Save it under a different name to avoid overwriting the original.

On the Developer tab, in the VBA Profiler group, click the button “Profiling”.

This gives a form with two checkboxes and a Help button (which shows the same instructions as on this page). The options are:

  1. Add a call to every line of code to trace its execution for coverage analysis.
    A tab-delimited text file is output of all the code after trace lines have been added
    so that the sub VBA_Profile_Reports can show what lines of code were executed.
    The text file is named from your workbook name suffixed with _VBA_COVERAGE.TXT
  2. Export all the VBA to a text file before making the changes.
    This is just for your record, it is not used in this process.
    The tab-delimited text file is named from your workbook name suffixed with _VBA_BEFORE.TXT

When you click OK, it adds a line to every procedure in the project, unless its declaration line contains a comment ‘@notrace
The inserted line looks like this, where 12345 is the line number in the project:


Dim P_ As Object: Set P_ = VBA_Profiler.New_Profiler(): P_.Start 12345, "modulename", "procname"

It only adds one line of code to each of the procedures in the code, and that line calls subs in the addin to do the timing and calculations. That adds a few milliseconds of overhead to each procedure call. For very small, trivial, and frequently called procedures, consider adding a comment ‘@notrace to its Sub or Function declaration to suppress the insertion of a profiling line of code.

It adds a reference to VBA_Profiler to the VBProject so that those inserted lines can call the functions to do the timing calculations.

Finally, insert a procedure in any standard module like this:

Sub RunProfiler
   VBA_Profile_Initialize [delimiter]

   Call your_top_procedure_name

   VBA_Profile_Reports
End Sub

If the optional delimiter is specified after VBA_Profile_Initialize , such as vbTab or “,”
every P_.Start sends the values passed to the debug output, separated by that delimiter.
You can watch and capture that output using DebugView from SysInternals or TraceSpy from Simon Mourier. I use TraceSpy because it is faster than DbgView and did not miss any output in fast tests. To be sure that TraceSpy can keep up with the output, do not use Auto-Scroll.

VBA_Profile_Reports outputs a tab-delimited file named from the workbook name suffixed by _VBA_PROFILE.TXT;
and if you do the coverage analysis, a workbook is created with one sheet containing the contents of the _VBA_COVERAGE.TXT file with an extra column added to show the number of times a line of code was executed.
The workbook is unsaved, you can save it as a .xlsx file if you wish to keep it.

Finally, go into the VBE and click Compile to check that no syntax errors have been introduced

When a procedure exits and the P_ variable goes out of scope, the profiler counts up the number of calls and the length of time the procedure took. The resolution of the timer is about a microsecond.

The visible methods in this add-in are:

Function New_Profiler() returns an object that collects timing data for the reports.

Sub VBA_Profile_Initialize() Call to initialise the profiler data collection

Sub VBA_Profile_Reports(Optional sProfileTSV As String , Optional sCoverageTSV As String )
Produces reports. The default file names are
For the Profile output: workbook name plus “_VBA_PROFILE.TXT”
For the Coverage input: workbook name plus “_VBA_COVERAGE.TXT”

The report columns are:
Module.Proc: Module name and procedure name
Count: Number of times the procedure was called
Inclusive time: Execution time of this procedure and all those it calls
Exclusive time: Execution time within this procedure only
Average time: The average run time of the procedure, the exclusive time divided by the count of calls
Index: A sequence number

If you lose the original workbook, you can remove the profiling code by doing a Search with Wildcards and Replace All with nothing for these Find What strings:
Dim P_ *
P_O *&:

Feedback is welcome!

Limitations: It does not handle Unicode workbook and module names, so use the Windows-1252 Latin character set only.

Assurance: It only works locally, and does not ‘phone home’ with any telemetry. It is supplied free and with no guarantee, in the hope that it provides some utility.

EXAMPLE

Consider a module with three procedures. TopProc calls SubA once. SubA calls SubB four times. TopProc takes 1 second with its own code, SubA takes 3 seconds, and SubB takes 2 seconds. This is what the profile report will look like:

Module.ProcCountInclusive timeExclusive timeAverage timeIndex
TopProc112.00000001.00000001.00000001
SubA111.00000003.00000003.00000002
SubB48.00000008.00000002.00000003
Example profile report

From this you see that:

TopProc is called once. The total execution time including its called subs is 12 seconds. The code in TopProc alone takes one second.

SubA is called once. The code in SubA takes three seconds. The total execution time including the eight seconds for its called subs is 11 seconds.

SubB is called four times. The total execution time is 8 seconds, so the average time per execution is 2 seconds.

Sort the table by Exclusive time descending to see which subs takes the most time overall. Sort by Average time to see which are slowest per execution.

To learn more about profiling, see these Microsoft articles:

https://docs.microsoft.com/en-us/visualstudio/profiling/instrumentation-method-data-views?view=vs-2017

https://docs.microsoft.com/en-us/visualstudio/profiling/caller-callee-view-instrumentation-data

Caller/Callee view – instrumentation data

Posted in Uncategorized | Tagged , , | 1 Comment

Querying Text/CSV files in VBA, method #3: Power Query

The module PowerQueryText.bas contains the code to query delimited files and return tables.

The advantages of PowerQuery are:

  1. It can handle comma and tab separated values and does not need a schema.ini as for ADO. You do however need to specify the separator character, see below for the details.
  2. Tab-delimited files do not have to have the extension “.tab” as they do for ADO, but can be named with a “.tsv” extension if that’s what you have.
  3. It can handle line breaks within quotes, or not, depending on the option you set. I always include line breaks within quotes.
  4. Its default encoding is 65001 which can handle all text encoding I have tested it with.  The UTF-16 (UCS-2) encoding is 1200.

In the ADO module, I determined the field separator by reading the file contents as bytes, and use a function CountColumnsFromBytes to infer the number of columns and their separator. In Powerquery, there is a function Binary.InferContentType that will offer statistics on possible delimiters, and your code can choose the most likely one.

At the moment, I have to send the output to a worksheet. If anyone finds a way to return the table data as a memory structure in VBA such as a recordset, let me know.

The functions contained in PowerQueryText.bas are:

testPQcsv    – a simple test stub that prompts for a file name and queries it. It calls the following functions.

InferContentType – needed to infer the encoding and field delimiter (separator) character.

TableProfile – returns statistics about a file to help determine the number of rows and columns.

PQMCsvFormula – a generic function to return the M expression for a delimited file.

PQDelimitedText – runs a query and outputs the result.

WBQueryToSheet – sends the data from a query to a sheet table range.

WBQueryToListTable – sends the data from a query to an Excel ListTable.

That is the end of this series on reading delimited text files in Excel VBA. I hope you find it useful, and let me know of any bug fixes needed, or suggest any improvements.

Posted in Excel/VBA, Power Query | Tagged , , , | 3 Comments

Querying Text/CSV files. Method #2: ADO Recordset

The module ADOQueryText.bas has the functions to read a delimited text file into a worksheet. It uses an ADODB recordset to perform the query. The pros and cons of ADO are:

  1. It can handle embedded newlines in data, unlike a Querytable. These are common in CSV files from sources (eg surveys) where users enter free text with multiple lines, and the text is returned in one field.
  2. It cannot open Unicode file names, so to be safe you must copy the original file to a temporary copy with an ANSI filename.
  3. It requires a schema.ini file to recognise when the fields are delimited by tabs. To avoid this, you could use a function ReplaceBytes() in the QueryTables module from the previous blog post to change all tab delimiters with commas.

To avoid some of these complications, the final post in this series will be on how to use Power Query in VBA to query a text / csv file.

The functions are:

ADOLoadWSFromDelimitedTextFile(ByVal OriginalFullname As String, ByRef wsData As Worksheet, Optional ByRef Delimiter As String, Optional ByRef CodePage As Long)

              Queries a structured text file into a table in a sheet, returns delimiter and codepage by reference. It uses the two following functions, writing a schema.ini file if the data source is tab-delimited and checks whether there are enough rows on the sheet for the records.

ADORecordsetFromTextfile(ByVal FullName As String, ByRef rs As ADODB.Recordset,  ByVal FMT As String, ByVal CodePage As Long, Optional ByVal strSelect As String)

              Queries a structured text file into an ADO recordset, returning the format and codepage by reference.  strSelect is the list of fieldnames or expressions in SQL syntax, the default is “*”, all fields.

CopyRecordsetToSheet(rs As ADODB.Recordset, Optional ByRef wsTarget As Worksheet, Optional ByVal RecordSource As String)

              Writes a recordset to a worksheet, adding the sheet and writing headers if required.

Posted in Excel/VBA | Tagged , , , , | Leave a comment

Querying Text/CSV files. Method #1: Querytable

You will know that simply opening a CSV file in Excel is commonly a poor way to do it. Excel has a habit of eagerly trying to interpret as a date anything that looks anything like a date. A previous blog post on this describes the frequency of such corrupted Excel data in genome research papers until they finally changed the names of the genes to avoid the problem.


The legacy Get Data from Text/CSV command in Excel creates a Querytable. This is pretty good for most purposes as long as you use the undocumented TextFilePlatform value of 65001. There is one glaring limitation: it cannot handle newlines in quoted strings. I get this frequently in Surveymonkey downloads where people have entered free text comments with multiple lines. So, the workaround is to pre-process the file to change any quoted linefeed (LF) or carriage return (CR) characters to spaces.

The module QueryTables.bas has the code to do that.

There is another way, using an ADO Query, which will be the subject of the next blog post. However, that has its own complications. For example, if the file is tab delimited you have to write a schema.ini file. The final way, in the last blog post, will be to use Power Query.

Posted in Excel/VBA | Tagged , , | Leave a comment

Reading Unicode file contents as text

This is a short update on my post of Sep 28, 2020 on reading Unicode text in VBA.

That post mentioned the limitations of old VB statements which cannot handle Unicode:

   Open FullName For Input As #handle : Line Input #1, strLine

I am reposting the module as ADOStream.bas with the following functions:

MessageBox(Prompt As String, Optional Buttons As Long = VbMsgBoxStyle.vbOKOnly, Optional Title As String = "Message") As VbMsgBoxResult

MessageBoxW from the Windows API correctly displays international Unicode characters.

ReadFileBOMCharset(ByVal filename As String, ByRef charset As String) As Long

Infer the encoding of a file.

CharsetFromBytes(bytes() As Byte) As String

   Infer charset if there is any Byte Order Marker in the first 3 bytes of content.

ReadADOStreamFile(ByVal FullName As String, ByRef text As String, ByRef charset As String, Optional ByRef numchars As Long = -1) As Long

   Read the contents of a file as text, inferring the encoding.

WriteADOStreamText(ByVal filename As String, ByRef text As String, Optional ByRef charset As String) As Long

    Write text to a file.

Function ReadADOStreamBytes(ByVal filename As String, ByRef bytes() As Byte, Optional ByRef numchars As Long = -1) As Long

    Read the contents of a file as a byte array, including any Byte Order Marker (BOM).

Posted in Excel/VBA | Tagged | Leave a comment

Hello (the rest of the) World – reading data in other languages

Have you ever opened a text file in Excel and found that the characters looked weird or corrupted? The Wikipedia entry for mojibake says ‘In German, Buchstabensalat (“letter salad”) is a common term for this phenomenon, and in Spanish, deformación (literally deformation).’

This is one of a series of blog posts on file handling in VBA to cope with Unicode file names and contents. Part 1 was posted on Sep 28 2020, about reading text from Unicode files. The articles should provide all the Excel VBA code you need for working with Unicode folder and file names in the Windows file system, reading unstructured Unicode text, and structured text file data queries.

This post is the code for using the Filesystem object to replace the old VB functions like Dir(), Filelen(), FileDateTime(), etc, that do not work with Unicode file names. FileSystem.bas is the module which you can import into your project, and the CSV file is a sample downloaded from opengov.gr.

Code module: FileSystem.bas   Sample file: Σχόλιο UTF-8 BOM.csv

Old VBFunction
CurdirGetCurDir
 ChPath
ChDirSetCurDir
 NewFSO
 GetFSOFile
DirFileExists
Dir( file,vbDirectory)FolderExists
 GetPathAndName
FileLenFileSize
FileDateTimeFileDateModified
FileCopyCopyFile
Name..AsRenameFile
KillDeleteFile
 GetTempName
 GetTempFullName
 GetFSOFolder
MkDirCreateFolder
RmDirDeleteFolder
AttrGetFileAttributes
SetAttrSetFileAttributes
 FileAttrString
 GetOpenFilenameStr

Let me know of any improvements you can suggest.

Posted in Excel/VBA | Tagged , , , | Leave a comment