When loading and saving XML data using FromXml() and ToXml() in C++, the data should be in the local C++ code page. If this is not the case then you may get the error 'Error - not well-formed (invalid token) at line x'.
This may be resolved by building the application as Unicode, or by using FromXmlStream() and ToXmlStream() which deal with Binary Data and converting the data yourself.
Xml documents can be encoded using a number of different encodings. The type of encoding is indicated using the encoding tag in the document header (i.e. <?xml version="1.0" encoding="UTF-8"?>).
Writing an XML document to file
When an XML document is persisted as a file, it is safer to consider it in terms as of a stream of bytes as opposed to stream of characters. When an XML document is serialized to a file, an encoding is applied to it. The resulting file will then be correctly encoded given the encoding applied.
- If a Unicode encoding is applied, the resulting file is prefixed with the Unicode header 0xFF 0xFE, and will be encoded with 2 bytes per character.
- If a UTF-8 encoding is applied the resulting file will contain a variable number of bytes per character. If this file is then viewed using a tool incapable of decoding UTF-8, then you may see it contains a number of strange characters. If the file is viewed using an UTF-8 compliant application (e.g. IExplorer, Notepad on Win2000 onwards, Visual Studio .Net) then the XML Document will appear with the correct characters (if characters are corrupted or misrepresented, it should be noted that some fonts do not contain the full UNICODE set)
Turning an XML document a string
When an XML document is created from a generated class using ToXml (ToXml returns a string). The string returned is encoded as Unicode (except in C++ non-debug builds), however the XML document header does not show any encoding (<?xml version="1.0"?>).
The string returned is Unicode, Unicode is the internal character representation for VB6, .Net & Java, as such if it is written to file or passed to another application, it should be passed as Unicode. If it has to be converted to a 1 byte per character representation prior to this, then data will likely be corrupted if complex characters have been used within the document.
If you need to persist an XML document to a file use ToXmlFile, if you need pass an XML document to another (non-Unicode) application, then should use ToXmlStream.
There is also a problem that commonly occurs in C++ UNICODE applications when dealing with UTF-8 encoded data. If you load a UFT-8 encoded file into a UNICODE application, the temptation is to store it in a UNICODE string (WCHAR*), and the conversion to Unicode is often implicit (part of some string/bstr class). However these conversions typically assume the source string is in the local code page, which is rarely UTF-8, and more frequently ANSI. So when the data is converted to UNICODE, the conversion function does not treat the data as UTF-8, and so does not correctly decode it. This results in a UNICODE string which no longer represents the source.
In these circumstances, it is better to either treat the data as binary or to use the appropriate conversion method - utf8 to Unicode.
Passing an XML document to a ASCII or ANSI application
It is common to want to pass the XML document you have created to a non-Unicode application. If you need to do this then you may look first at ToXml, this will provide you with a UNICODE string, however converting this to an ASCII or ANSI string may cause the corruption of complex characters (you lose information going from 2 bytes to 1 byte per character). You could take the string returned from ToXml, and apply your own UTF-8 encoding, however the encoding attribute in the header (<?xml version="1.0" encoding="UTF-8"?>) would not be present, and the XML parser decoding the document may misinterpret it.
The better solution is to use the ToXmlStream method. This allows you to specify an encoding, and returns a stream of bytes (array of bytes in VB). This byte stream is a representation of the XML Document in the given encoding, containing the correct encoding attribute in the header (<?xml version="1.0" encoding="UTF-8"?>).