Power Query infer data types with Table.Profile

The Power Query UI automatically detects column data types and performs the transformations for you. Typically these steps look like this:

    Source = Csv.Document(File.Contents("online_retail_II.csv"),[Delimiter=",", Columns=8, Encoding=65001, QuoteStyle=QuoteStyle.None]),
    online_retail = Table.PromoteHeaders(Source, [PromoteAllScalars=true])
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Invoice", type text}, ...etc...}})
    #"Changed Type"

However, when I am calling Power Query M code from VBA ( see my previous posts) to open any given text file I usually do not know at that time what the column names or data types are. So after PromoteHeaders I have to infer the column data types. There is no built-in PQ function to do that. The following code for a DetectListValuesType() function adapted from Imke Feldmann the BIccountant and Steve de Peijper will infer the data type from a list of values, with a given margin of error allowed. If I subsequently transform the column, any values which are not valid for the majority type, such as text mixed in a column of largely numeric data, would be converted to Error values.

DetectListValuesType =
// Return the majority data type in a list with the given proportion of errors allowed.
(values as list, marginforerror as number) as text =>
        maxerrors = List.Count(values) * marginforerror, // 0 <= marginforerror < 1

        distinctvalues   = List.Distinct(values),

        NullValuesOrEmptyStringsOnly =
            List.NonNullCount(distinctvalues) = 0
            (List.NonNullCount(distinctvalues) = 1 and distinctvalues{0} = ""),

        CheckTypes = {"number", "date", "datetime", "text"},

        Env = Record.Combine({[L=values],
                            [NF = #shared[Number.From]],    [DF = #shared[Date.From]],   
                            [DTF = #shared[DateTime.From]], [TF = #shared[Text.From]],
                            [LT = #shared[List.Transform]], [LS = #shared[List.Select]], [LC  = #shared[List.Count]]

        NumberOfErrors = List.Transform( {"NF", "DF", "DTF", "TF"},
                                            each Expression.Evaluate("LC(LS(LT(L, each try  " & _ & "(_) otherwise ""Error""), each _ = ""Error""))", Env)

        // Type conversion of null will return null, and not throw an error, so null values do not contribute to the conversion error count.
        // If all list values are equal to null, no errors will be detected for any type. type text will be selected in this case
        // If list values are simple values (not complex typed values like e.g. {"a","b"}), Text.From() will not throw an error,
        // and the number of errors for conversion to text will be 0; i.e. text conversion will meet the zero error criterion
        // If values contain dates only (none of the values is with time),
        // for both date and datetime conversion the number of errors will be 0 and date will be selected (date has a lower index) as conversion type.
        // If at least 1 of the date values include a valid time part, date conversion will throw an error, and datetime conversion will not throw an error;
        // hence datetime conversion will be selected.

        CheckErrorCriterion = List.Transform(NumberOfErrors, each _ <= maxerrors),

        typeindex = List.PositionOf(CheckErrorCriterion, true),  // First type which meets the error criterion;
                                                                 // if none of the types meets the error criterion, typeindex = -1

        typestring = if NullValuesOrEmptyStringsOnly or typeindex = -1 then "text" else CheckTypes{typeindex}
    in typestring

To create a schema table with the column names and data types of all the columns in a source table, I can use the Table.Profile function with the undocumented parameter optional additionalAggregates as nullable list. The details are given in Lars Schreiber’s post on additionalAggregates. In brief, it is a list of lists, like this: {{}, {}}. Each of these inner lists represents a new calculated profile column and consists of three items:
Output column name as text, typecheck function as function, aggregation function as function

My ProfileAggregates() function code then is as follows:

// return a profiling table with the inferred data type of each column
//Column	Min	Max	Average	StandardDeviation	Count	NullCount	DistinctCount	ColumnType
(SourceTable as table, numberofrecords as number, marginforerror as number) => 
        numberofrecords = if numberofrecords<=0 then Table.RowCount(SourceTable) else numberofrecords,

        tblProfile = Table.Profile( SourceTable, {{"ColumnType", each true, 
            each DetectListValuesType( List.FirstN( _, numberofrecords ), marginforerror) }})


This function can then be called like this:

ProfileAggregates(online_retail, 10000, .01)

I tried it on the “Online Retail II” data set which has 1,067,371 rows. That’s too many for Excel, but a good test. Excel takes 25 minutes to run the complete analysis which indicates that some performance tuning should be investigated. Here is some of the output; I reordered the columns to put ColumnType after the column name.

Output of Table.Profile with additional ColumnType column

If you know how to create the table of column names, types, and other data, directly rather than using Table.Profile, let me know at the PowerBI Community forum, or in the comments here.


About Patrick O'Beirne, Excel/VBA developer

Patrick provides consultancy and training in spreadsheet automation, data analysis, testing and model review; and the Excel spreadsheet auditing addin XLtest
This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s