VBA Read Unicode file contents in various encodings


To read a text file, you need to be able to handle more character sets than just ANSI. Not just in the contents but also in the file and folder names.
The classic article on this is by Joel Spolsky:
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Misinterpreting characters results in accented characters appearing as two garbage characters; or you see a white question mark in a black diamond which indicates that UTF-8 encoding could not figure out what it was.
https://en.wikipedia.org/wiki/Mojibake

In the case of VBA, the classic Open statement, and the functions Dir(), Filelen(), FileDateTime() cannot handle Unicode file names and folders. The Filesystem Stream object can read Unicode names but cannot handle UTF-8 encoding. So, I use the ADO.Stream object in the module ReadFileContents.bas to handle a number of different cases. The variations are not complete but it handles the file types I get.

Let me know of any improvements you can suggest.

Here is the function:

' Reads the entire contents and tries to GUESS the character encoding
Function ReadFileContentsAndCharset(ByVal FileName As String, ByRef Text As String, _
         ByRef charset As String) As Long
   charset = "UTF-8" ' first guess. If it's not UTF-8 it may still be readable as UTF-8
   ReadFileContentsAndCharset = ReadADOStreamText(FileName, Text, -1, charset)
   If ReadFileContentsAndCharset = 0 Then ' it succeeded without error
      Debug.Print FileName   ' comment these out in production
      Dim lUTF8Len As Long, sBOM As String
      ' UTF-8 files do not have to have a BOM so no point in testing for that
      sBOM = Left$(Text, 2)
      lUTF8Len = Len(Text)
      If sBOM = Chr$(254) & Chr$(255) Or sBOM = Chr$(255) & Chr$(254) _
         Or InStr(1, Text, Chr(0)) > 0 _
         Then ' UTF-16 or nulls? Might be double-byte character set
         charset = "UTF-16"
         ReadFileContentsAndCharset = ReadADOStreamText(FileName, Text, -1, charset)
         Debug.Print charset & " len=" & Len(Text) & ", utf-8 len=" & lUTF8Len
         If lUTF8Len = Len(Text) * 2 Then
            'ReadFSOStreamText(FileName, Text) would work here too for UCS2
            Debug.Print "As expected, double byte character set"
         End If
      ElseIf InStr(1, Text, ChrW$(65533), vbBinaryCompare) > 0 Then ' &HFFFD
         ' Check for U+FFFD <question mark in black diamond> replacement character
         ' used to replace an unknown, unrecognized or unrepresentable character
         'https://en.wikipedia.org/wiki/Specials_(Unicode_block)
         charset = "Windows-1252" ' second guess; superset of ISO-8859-1
         ReadFileContentsAndCharset = ReadADOStreamText(FileName, Text, -1, charset)
         Debug.Print charset & " len=" & Len(Text) & ", utf-8 len=" & lUTF8Len
         If Len(Text) > lUTF8Len Then
            Debug.Print charset & " converted some characters into multiple characters"
            ' so it must be some other encoding. No idea.
         End If
      End If
      Debug.Print Len(Text) & " chars, charset inferred=" & charset & " " & FileName
   Else
      MsgBox FileName & vbLf & "Error " & Err & " " & Err.Description
   End If
End Function

Function ReadADOStreamText(ByVal FileName As String, ByRef Text As String, _
   Optional ByRef numchars As Long, Optional ByRef charset As String) As Long
   ' initialise default parameters
   If numchars = 0 Then numchars = -1
   'numchars Optional. The number of characters to read from the file, _
      or a StreamReadEnum value. Default is adReadAll=-1
   If Len(charset) = 0 Then charset = "utf-8"
   With CreateObject("ADODB.Stream")
      .Type = 2 'adTypeText = 2 Specify stream type -  text/string data.
      .charset = charset
      .Open
      On Error Resume Next
      .LoadFromFile FileName
      'Err  3002         File could not be opened.
      Text = vbNullString ' in case of err
      Text = .ReadText(numchars)
      ReadADOStreamText = Err
   End With
End Function

About Patrick O'Beirne, spreadsheet auditor

Patrick provides consultancy and training in spreadsheet development, auditing / testing and model review; and the Excel addin XLtest
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s