[Estimated Reading Time: 4 minutes]

A post came up in recent days on the NZ DUG mailing list, about a problem with the LoadXMLData() function on Android. The problem subsequently was found to also exist on Win32. And indeed, the cause was found to go back at least as far as Delphi 2006. So why did it only come up now ?

The problem was identified as a result of someone trying to use LoadXMLData() using a string containing XML which contained an XML declaration which contained an atypical but perfectly valid XML encoding specification:

  <?xml version='1.0' encoding = 'UTF-8' ?>

Do you see the problem ?

It is the spaces either side of the ‘=’ in the encoding declaration.

This causes problems when the XML loading code path for a string ends up calling CheckEncoding() (in the XMLDoc unit), which contains a shockingly naive attempt to remove the XML encoding declaration for anything other than a WideChar based encoding scheme:

procedure CheckEncoding(var XMLData: DOMString; const ValidEncodings: array of string);
var
  Encoding: string;
  EncodingPos, EncodingLen: Integer;
begin
  { Check if the XML data has an encoding, if so it must match one of the
    valid encodings, or we will remove it. }
  Encoding := ExtractAttrValue(SEncoding, Copy(XMLData, 1, 50), '');
  if (Encoding <> '') and not EncodingMatches(Encoding, ValidEncodings) then
  begin
    EncodingPos := Pos(SEncoding, XMLData);
    EncodingLen := Length(Encoding) + 12;
    Delete(XMLData, EncodingPos - 1, EncodingLen);
  end;
end;

The problem is that assumed length of the "encoding='..'" declaration, allowing a fixed 12 characters on top of the length of the encoding value itself. Those 12 characters are intended to cover:

               1    a leading space
  encoding     8    for the word "encoding"
  =            1    for the "=" symbol
  ''           2    for the quotes around the encoding value

In the case of the specific example (which was a result from a call to a web service and so not directly under the control of the developer in question, as far as I know), this breaks since it does not cleanly remove the encoding declaration and instead results in:

  <?xml version='1.0'8' ?>

Which breaks the subsequent parsing of the XML for obvious reasons.

Specification Interpretation

I suspect a mistaken interpretation of the XML specification lies behind this long-standing mistake, since the specification of the XML declaration entity identifies the encoding declaration as:

[80]  EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )

Which on first blush would suggest that there can be no white-space either side of the “Eq“. Except that if you follow the link to the Eq definition itself you find not a simple ‘=‘ symbol but:

[25]   	Eq ::= S? '=' S?

i.e. an ‘=’ symbol with any amount of optional whitespace (not just spaces) either side.

Mitigating Circumstances

As I say, this problem goes back at least as far as Delphi 2006, but up until recently it could be easily avoided.

You see, when you load the XML into a TXMLDocument, there is not – as you might expect – a single, consistent code path followed for loading XML from the various supported sources (string variable, TStrings object and stream – i.e. from a file).

Instead there are different code paths for each of these.

In earlier versions of Delphi even using a simple string variable would invoke one of two different code paths, depending on whether the string was an ANSIString or a WideString (DOMString).

WideString XML is loaded via the simple string path and removing non-WideChar encodings from such strings makes sense in that context.

ANSIString XML is loaded via the stream mechanism, bypassing the gotcha lurking in CheckEncoding().

LoadXMLData() is overloaded to support the two different string types explicitly, so as long as you keep an eye on what your string actually contains and declare the string type appropriately you could ensure that you invoked the ANSIString version of LoadXMLData() and you don’t have a problem.

Alas Poor UTF-8. I Knew Him Well

He hath borne me on his back a thousand times.

In Delphi XE5 however, Embarcadero chose to drop support for UTF-8 Strings on “NEXTGEN” platforms, and it is this decision I think which has created the problem here. The ANSIString version of LoadXMLData() only really existed to support UTF-8 strings and so that overload is now subject to a conditional compilation directive.

It is not supported by NEXTGEN compilers, including that for Android.

The result is that even if you carefully declare an ANSIString to hold your XML, when you pass it to LoadXMLData() it will be converted to a WideString and go barelling off down the code path that leads to the flawed CheckEncoding().

You cannot even create a TXMLDocument and call LoadFromXML(ANSIString) directly since this too is not available to NEXTGEN compilers.

You will of course get a warning about an implicit string conversion, but I cannot speak for the attitude to such warnings on behalf of the original developer who came up against this problem. As I say, the problem only came to my attention and interest via a public mailing list. For all I know, they simply ignore such warnings and/or don’t understand the consequences. I don’t know.

So what can you do ?

You could create the TXMLDocument and assign your string to the XML property (TStrings), using the Text property of course:

  doc := TXMLDocument.Create(NIL);
  doc.XML.Text := sMyXMLString;

You’ll still get a warning about the string conversion, but the loading of XML from TStrings bypasses CheckEncoding() so you won’t get the problem arising from the corruption of the XML declaration this can cause.

However, TStrings is of course – these days – also WideString based so even if sMyXMLString is formally declared ANSIString (is this even allowed on Android? I don’t know because I don’t use Delphi for FireMonkey for Android myself) it will end up being converted to a UTF-16 string again.

So although this bypasses the CheckEncoding() flaw, it will possibly cause subsequent problems as a result of the UTF-16 encoded XML incorrectly identifying itself as UTF-8 encoded.

The Long Way Around

The developer with the problem identified that they could save the XML string to a file and then load it into a TXMLDocument from that file and this seemed to work, though whether the declared vs actual storage encoding issue was either inadvertently or deliberately resolved or is simply not an issue in that case, I do not know. Whatever the explanation, it’s certainly far from ideal (almost certainly unacceptable) to have to use an intermediate file to get a string in memory to an XML DOM in memory.

Quite possibly the only reliable solution therefore is to implement your own mechanism for removing the XML encoding from the string (where appropriate to do so) and making sure you do that before passing it to LoadXMLData().

I leave that as an exercise for the reader. 🙂

6 thoughts on “Old XML Bug in Delphi Causes New Problems”

  1. The whole xml support in Delphi is horror! Since years! The MSDOM is extremely slow, the rest of xml DOMs are buggy like hell. Qt’s xml support is standard and very well engineered.

  2. AnsiString, and (P)AnsiChar, have been removed from NextGen (ie mobile) compilers. So you cannot declare AnsiString or AnsiString(N) (including UTF8String and RawByteString) variables at all.

      1. Supposedly this change is coming sometime to the desktop too… maybe. At some time. Or not. It’s like that Magic 8-ball answer, “Future is unclear. Ask again later.” We’re three months into the new six-month release window and we still have no idea what’s coming for XE6. Or what the release date is. As is, I delayed making a decision about which tool to use for a new project before XE5 came out because Marco’s white paper about the NextGen compiler for desktop led me to believe that it was coming with XE5’s release.

        Delphi’s roadmap communication generally consists of “Surprise!”. As is, there is no 2014 road map. We ran out of road with XE5. 🙂

    1. Oh, I used to do Delphi version-agnostic pointer math via PDestination(PAnsiChar(Ptr) + Offset). Now it looks like i’ll had to use conditional directives anyway.

Comments are closed.