One of the older applications I support uses ActiveX controls embedded inside a web page. These controls request data from a web server to update the information on the page without requesting the whole page again, much in the same way that AJAX is now commonly used.
This has worked fine for the Latin code pages (ISO8859-1, ISO8859-15), and for the double byte code page (cp950) that have been tested. However it did not work when I tried the UTF-8 Unicode code page.
The reason for this is fairly simple:
VB stores strings internally using Unicode, but assumes that the outside world is ANSI.
This means that Visual Basic will convert from ANSI to Unicode (UTF-16) when storing a string, and convert it back again when it is retrieved.
The ActiveX controls use the Microsoft Inet control to request data via HTTP. This uses the GetChunck() method in the StateChanged event in order to read the data in to a string. This was the first cause of my problems as Visual Basic will automatically convert the data in the string to ANSI, which loses the Unicode characters.
The Inet control GetChunck() method takes two parameters; size and data type. The size parameter tells it how much data to read, and the data type parameter tells it what data type to read it in to. The data was being read in to a string (icString), but to avoid the conversion I had to change this to a byte array (icByteArray) to avoid the automatic conversion process.
So far so good. But now I had a UTF-8 byte array that I needed to convert in to a string without losing data in the conversion process. This was a bit of a sticking point as Visual Basics string conversion function StrConv() can’t cope with UTF-8 and none of the API calls I found to convert the string worked. You can assign a string equal to a byte array and no automatic conversion happens, but as strings are stored internally as UTF-16 this does not work.
I was nearly at the stage where I either needed to write my own conversion process, or re-develop the controls in another language with better UTF-8 support.
Then I found this solution:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| Public Function ConvertUtf8BytesToString(ByRef data() As Byte) As String
Dim objStream As ADODB.Stream
Dim strTmp As String
' init stream
Set objStream = New ADODB.Stream
objStream.Charset = "utf-8"
objStream.Mode = adModeReadWrite
objStream.Type = adTypeBinary
objStream.Open
' write bytes into stream
objStream.Write data
objStream.Flush
' rewind stream and read text
objStream.Position = 0
objStream.Type = adTypeText
strTmp = objStream.ReadText
' close up and return
objStream.Close
ConvertUtf8BytesToString = strTmp
End Function |
This does not use any APIs but requires the Microsoft ActiveX Data Objects 2.5 Library or later.
Using this solution I was able to assign the original internal string variable to the result of this function and the rest of the code in the controls worked.
1
| strWSConnectReturnData = ConvertUtf8BytesToString(bytWSConnectReturnData) |
The ActiveX controls also read data values from the webpage and POST them back to the webserver. The values are read via the DOM. These also need to be converted in the opposite direction, before they can be URL encoded.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| Public Function ConvertStringToUtf8Bytes(ByRef strText As String) As Byte()
Dim objStream As ADODB.Stream
Dim data() As Byte
' init stream
Set objStream = New ADODB.Stream
objStream.Charset = "utf-8"
objStream.Mode = adModeReadWrite
objStream.Type = adTypeText
objStream.Open
' write bytes into stream
objStream.WriteText strText
objStream.Flush
' rewind stream and read text
objStream.Position = 0
objStream.Type = adTypeBinary
objStream.Read 3 ' skip first 3 bytes as this is the utf-8 marker
data = objStream.Read()
' close up and return
objStream.Close
ConvertStringToUtf8Bytes = data
End Function |
This returns a byte array, and I pass it directly in to a function that URL encodes the byte array, returning a sting.
1
| String = URLEncodeUTF8ByteArray( ConvertStringToUtf8Bytes( DomValue) ) |
Many thanks to Tim Hastings for his solution, as this has saved me a lot of pain!