To read a text file, you need to be able to handle more character sets than just ANSI. Not just in the contents but also in the file and folder names.
The classic article on this is by Joel Spolsky:
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Here are folder and file names in various languages:
F:\DOCS\Arabic\التاريخ المالي والنسب.xls
F:\DOCS\Arabic\اسم ملف الاختبار\اسم ملف الاختبار.xlsm
F:\DOCS\Chinese\2016 全 日 制 ⑴ 班 2016 ~ 2017 学 年 度 第 一 学 期 MBA 教 学 计 划(教室:2-111).xls
F:\DOCS\Chinese\測試文件名\測試文件名.xlsx
F:\DOCS\Greek\όνομα αρχείου δοκιμής\όνομα αρχείου δοκιμής.xlsx
F:\DOCS\Hebrew\שם קובץ הבדיקה\שם קובץ הבדיקה.xlsx
F:\DOCS\Hindi\परीक्षण फ़ाइल नाम\परीक्षण फ़ाइल नाम.xlsx
F:\DOCS\Hungarian\kalóriaguru Munkafüzet.xlsm
F:\DOCS\Hungarian\kalóriaguru\kalóriaguru.cls
F:\DOCS\Japanese\テストファイル名\テストファイル名.xlsx
F:\DOCS\Korean\테스트 파일 이름\테스트 파일 이름.xlsx
F:\DOCS\Russian\Лист1.xlsx
F:\DOCS\Russian\имя тестового файла\имя тестового файла.xlsx
Misinterpreting characters results in accented characters appearing as two garbage characters; or you see a white question mark in a black diamond which indicates that UTF-8 encoding could not figure out what it was.
https://en.wikipedia.org/wiki/Mojibake
In the case of VBA, the classic Open statement, and the functions Dir(), Filelen(), FileDateTime() cannot handle Unicode file names and folders. The Filesystem Stream object can read Unicode names but cannot handle UTF-8 encoding. So, I use the ADO.Stream object in the module ReadFileContents.bas to handle a number of different cases. The variations are not complete but it handles the file types I get.
Let me know of any improvements you can suggest.
Here is the function:
' Reads the entire contents and tries to GUESS the character encoding
Function ReadFileContentsAndCharset(ByVal FileName As String, ByRef Text As String, _
ByRef charset As String) As Long
charset = "UTF-8" ' first guess. If it's not UTF-8 it may still be readable as UTF-8
ReadFileContentsAndCharset = ReadADOStreamText(FileName, Text, -1, charset)
If ReadFileContentsAndCharset = 0 Then ' it succeeded without error
Debug.Print FileName ' comment these out in production
Dim lUTF8Len As Long, sBOM As String
' UTF-8 files do not have to have a BOM so no point in testing for that
sBOM = Left$(Text, 2)
lUTF8Len = Len(Text)
If sBOM = Chr$(254) & Chr$(255) Or sBOM = Chr$(255) & Chr$(254) _
Or InStr(1, Text, Chr(0)) > 0 _
Then ' UTF-16 or nulls? Might be double-byte character set
charset = "UTF-16"
ReadFileContentsAndCharset = ReadADOStreamText(FileName, Text, -1, charset)
Debug.Print charset & " len=" & Len(Text) & ", utf-8 len=" & lUTF8Len
If lUTF8Len = Len(Text) * 2 Then
'ReadFSOStreamText(FileName, Text) would work here too for UCS2
Debug.Print "As expected, double byte character set"
End If
ElseIf InStr(1, Text, ChrW$(65533), vbBinaryCompare) > 0 Then ' &HFFFD
' Check for U+FFFD <question mark in black diamond> replacement character
' used to replace an unknown, unrecognized or unrepresentable character
'https://en.wikipedia.org/wiki/Specials_(Unicode_block)
charset = "Windows-1252" ' second guess; superset of ISO-8859-1
ReadFileContentsAndCharset = ReadADOStreamText(FileName, Text, -1, charset)
Debug.Print charset & " len=" & Len(Text) & ", utf-8 len=" & lUTF8Len
If Len(Text) > lUTF8Len Then
Debug.Print charset & " converted some characters into multiple characters"
' so it must be some other encoding. No idea.
End If
End If
Debug.Print Len(Text) & " chars, charset inferred=" & charset & " " & FileName
Else
MsgBox FileName & vbLf & "Error " & Err & " " & Err.Description
End If
End Function
Function ReadADOStreamText(ByVal FileName As String, ByRef Text As String, _
Optional ByRef numchars As Long, Optional ByRef charset As String) As Long
' initialise default parameters
If numchars = 0 Then numchars = -1
'numchars Optional. The number of characters to read from the file, _
or a StreamReadEnum value. Default is adReadAll=-1
If Len(charset) = 0 Then charset = "utf-8"
With CreateObject("ADODB.Stream")
.Type = 2 'adTypeText = 2 Specify stream type - text/string data.
.charset = charset
.Open
On Error Resume Next
.LoadFromFile FileName
'Err 3002 File could not be opened.
Text = vbNullString ' in case of err
Text = .ReadText(numchars)
ReadADOStreamText = Err
End With
End Function
Pingback: Hello (the rest of the) World – reading data in other languages | Patrick O'Beirne @ sysmod
Pingback: Reading Unicode file contents as text | Patrick O'Beirne @ sysmod