[Estimated Reading Time: 7 minutes]

Be careful what you wish for. A lot of people were overjoyed to hear that Unicode support was coming to Delphi. Some were skeptical of the chosen implementation approach however, it all seemed just a little bit too easy. I was one, and sadly it seems I was right.

I’ve just started updating a whole host of code to Delphi 2009.  Since Unicode is what my code has to speak whether I like it or not (if I want to use the latest/greatest Delphi compilers) then I may as well get with the program and drag my ANSI code kicking and screaming into the UTF-16 world.

“Kicking and Screaming” is certainly involved.  But mostly it’s me doing it.

In utter frustration.

There are currently two head-shaped dents in the wall. Allow me to share my pain.

Dent #1: UTF8ToANSI()

The help says:

Call Utf8ToAnsi to convert a UTF-8 string to Ansi. S is a string encoded in UTF-8. Utf8ToAnsi returns the corresponding string that uses the Ansi character set.

Well if you do call UTF8ToAnsi and expect it to do what that says it will do then you are in for a disappointment.  Because it actually returns a UnicodeString.

Not “what it says on the tin” and sure as eggs is eggs not what the caller wants or expects, not matter how quickly and dirtily they want to hack their old ANSI code into a warning-free ANSI-on-WideAPI “pseudo-Unicode” application..

Dumb.  Dumb.  Dumb.

But it gets even dumber. There is now also a function UTF8ToString()! And if we inspect the source for that we find the following puzzling implementation:

  function UTF8ToString(const S: RawByteString): string;
  begin
  {$IFDEF UNICODE}
    Result := UTF8ToUnicodeString(S);
  {$ELSE}
    Result := UTf8ToAnsi(S);
  {$ENDIF}
  end;

You may be thinking that this isn’t puzzling at all. That it is a perfectly sensible implementation – if we’re compiling with UNICODE defined then defer to the UnicodeString implementation. Otherwise defer to the ANSI implementation.

But hang-on, since when was UNICODE an option in Delphi 2009? CodeGear decided that it was to be “Unicode or the Hi-rode”.

But it get’s better, because of course in deferring to the ANSI implementation – even assuming this is ever compiled with UNICODE not defined – they are of course calling the so-called-ANSI implementation that – UNICODE or not – calls the Unicode implementation!

You can’t help but think that CodeGear managed to confuse even themselves with their approach to Unicode.

OK, But What’s this Wide-ANSI Nonesense ?

Ok, so UTF8ToANSI() is not something that I’m guessing many people are actually using (my code is working with the Apple Bonjour SDK which makes extensive use of UTF-8, and the code originated in Delphi 7, hence the ANSI/UTF8 conversions).

So I ran into an edge-case. No big deal, right? Granted. But sadly there are other, wider [sic] issues which people will undoubtedly run into.

Let me give you an example that I’m certain will be encountered by more people than that UTF8/ANSI scenario.

Dent #2: Uppercase()

var ws: String;  // UnicodeString in Delphi 2009/10

ws := 'aà';
ws := Uppercase(ws);

What does ws contain after the call to Uppercase()?  If you said  then congratulations – you are 100% wrong.  It actually contains .

Sadly, that is NOT the uppercase version of the input string according to the Unicode specification.

Now, to be fair, the documentation in this case is quite correct and clear that Uppercase() does not handle Unicode Strings as Unicode. Which makes an absolutely mockery of a strongly typed language such as Delphi.

Uppercase() accepts a String parameter. In Delphi 2009 String is a Unicode data type.

In Delphi 2009+ Uppercase() is – by contract – a Unicode function and should jolly well behave as such.

Frankly, this decision by CodeGear speaks to me of a cavalier disregard for what makes Delphi, well, Delphi.

Anyway… Now, the hard part. Can you guess what function you should use to properly convert a UnicodeString to uppercase?

Well, you actually have a choice of two:

    WideUppercase( );
    ANSIUppercase( );

No, wait. That can’t be right surely? ANSIUppercase?

Yes. Of course the alarm bells should be ringing as soon as you realise that ANSIUppercase accepts not an ANSIString parameter but a plain String parameter, which of course made perfect sense pre-Delphi 2009 when String meant ANSIString, but in Delphi 2009+ CodeGear found themselves painted into a corner.

Having decided that they, and only they, would get to choose what String meant to the compiler (although I think that the $ifdef UNICODE in System.pas is evidence that they weren’t – at one time at least – as convinced that this was the right and obvious thing to do as they seemed to suggest), if they properly modified these ANSI RTL functions to accept ANSI strings, then a lot of code currently calling them with String parameters would start throwing up warnings when String transformed from ANSIString to UnicodeString.

But at the same time the vast, and I mean VAST majority of code would simply be calling Uppercase(), and they didn’t want any warnings coming from that either.

What a pickle.

Fight Fire With Fire

It seems they chose to resolve this dilemma by adding a little more confusion into the mix, creating a situation where someone explicitly calling an ANSI routine will obtain a Wide operation and someone calling the “default” implementation, presumably expecting it to yield a Wide operation (because Delphi 2009 is Unicode through and through, right?) is rewarded with an ANSI operation.

Perhaps they thought that if they made our heads spin enough we wouldn’t notice what a pigs-ear they’d made of the whole thing?

But what does all this mean for someone with some old apps that they can’t wait to get updated to the latest Unicode compiler so that they can start selling to their customers who have been demanding Unicode support?

Well having compiled your pre-Unicode source with the Unicode compiler you may have been very happy to find that you had only a handful of warnings to deal with and then BINGO, you had a Unicode application.

Sorry.

I hate to break it to you, but you may not have finished yet.

What you have at the moment is an application that in all likelihood is still behaving very much like an ANSI application, it’s just that it’s now sitting atop the Windows Wide API’s, pretending to be a Unicode application.

In many respects it may fool you, perhaps for a long time, because anyone who hadn’t previously tackled Unicode head-on almost certainly doesn’t have any actual need for Unicode and their application will not be stressing those parts of Unicode support that actually separate a “Unicode” application from a “non-Unicode” application.

A Thought Experiment

This is a “Thought Experiment” in the sense of “You know what Thought did?” One answer to which is “He thought he did, but he didn’t.”

Let’s take a simple and I think highly common situation. A text entry field for some user specified code that is required to accept only alpha-numeric characters and enforce uppercase.

Pre-Delphi 2009 such a field may well have had a key filter installed in an event handler:

  procedure TForm1.Edit1KeyPress(Sender: TObject; var Key: Char);
  begin
    if (Key in ['a'..'z']) then
      Key := UpCase(Key)
    else if NOT (Key in ['0'..'9']) then
      Key := #0;
  end;

In Delphi 2009 this throws up the “WideChar reduced to byte char” warning that everyone “doing Unicode” in Delphi has been talking about, and like a good little Delphi developer they do what CodeGear tell them and change this code. But let us imagine that we know a little bit about Unicode and are actually migrating to Delphi 2009 because we want to be able to market our product as a Unicode application.

So rather than simply calling CharInSet, because that still only deals with ASCII characters, we adjust the routine to call the Windows Unicode support routine IsCharAlphaNumeric instead:

  procedure TForm1.Edit1KeyPress(Sender: TObject; var Key: Char);
  begin
    if IsCharAlphaNumeric(Key) then
      Key := UpCase(Key)
    else
      Key := #0;
  end;

With this simple change the code now compiles without any warnings.

YAY! We didn’t fall into the trap laid for us by CharInSet! We dealt with this properly and now we have a Unicode application!

Don’t we?

No.

Can you guess what that WideChar version of UpCase() does. Yep. Behaves exactly the same as the ANSI version.

Dumb. Dumb. Dumb.

There is not even the excuse of needing to maintain backward compatability in this case – there was no WideChar version of UpCase() prior to Delphi 2009.

Now, anyone who understands Unicode and specifically the properties and characteristics of UTF16 will be able to tell you why UpCase( WideChar ) does not perform a Unicode case conversion – it simply cannot. A single WideChar may not represent an entire character – it may be part of a surrogate pair and whilst I don’t think that there are any case convertible characters that require surrogate pairs currently, that could change (and I may be wrong on that anyhow).

So what else could it do but echo the ANSI implementation?

Well one option would have been to reflect the true nature of Unicode and simply not try to create the illusion of supporting something that cannot be supported.

But more acceptable I think would have been to perform a Unicode conversion on those chars that it could (non-surrogates) and if a surrogate was supplied as input, simply return it unmodified.

As it is, the previous (and current) “ANSI” implementation is incomplete w.r.t ANSI, providing only ASCII case conversion. It would have been easy to argue that a Unicode implementation that only operated on characters in the BMP (Basic Multilingual Plane) was the natural and obvious behaviour for a Wide UpCase implementation.

So I’m afraid if you do want and/or need proper Unicode support, you’ve still got some work to do before you can get there, and unfortunately the compiler is not going to help you from this point on. Furthermore, the VCL now seems to go out of it’s way to make it harder by in some cases completely breaking the type-safety that you can normally expect when working in Delphi and which would have guided you toward the answers to the numerous questions that arise when contemplating proper Unicode support.

The compiler assistance in “helping you find the things you need to change for Unicode support” only really works if you don’t actually need proper Unicode support and just want to get your ANSI code running over the Wide API in Windows.

Once you’ve reached that point – or even before that – if you then decide you want to do Unicode properly, I fear you will find that the design decisions made to facilitate the migration of Wide-ANSI applications will frustrate you and complicate your job no-end.

Really, I have to wonder if it really was worth getting so excited about Unicode support if it’s main audience is people not actually supporting Unicode properly?

I can only hope that the 64-bit support that it seems people increasingly need is not being delayed in order to make way for a cross-platform implementation that will suffer the same identity crisis as the Unicode implementation is littered with.

Getting it done quick sometimes is not as important as doing it right.

In the case of Unicode I’m afraid I’ve not come across anything to make me think it was “done right” at all.

24 thoughts on “Delphi Unicode = Wide-ANSI”

  1. The Unicode change has been a disaster in our shop. We have stuck with Delphi for a very long time because they have never really burned a bridge behind them like Microsoft (ex: VB). This change has done just that to us. Previous versions of Delphi prided themselves on backwards compatibility and compiling. The new versions while possessing some impressive features are ultimately what is going to drive us away from the language. We use several third party with source units and of course some date back quite a while. Several are a nightmare to untangle in the new compiler and the original authors are nowhere to be found. If we have to spend several weeks patching why not put that time to moving forward on something more industry standard that being C# is the question management has put to us. My last really big holdout which was Delphi has always supported older code is no longer a viable excuse. As much as I love Delphi, in our shop we are beginning to let go. Even harder to fathom is that Unicode does absolutely nothing for us so it is an even more bitter pill to swallow.

  2. “Getting it done quick sometimes is not as important as doing it right.”

    Unicode support was neither quick nor right done. So how in the name of (fill in whatever you prefer here) would you expect 64-bit support to be any different … come on get real.

  3. Sorry, but your complaints are a bit lame IMO…

    1. Presumably that conditional define just dates from when the RTL, VCL and IDE were in the process of being ‘unicodified’.

    2. Yes, updating old UTF8 conversion code is a sore point, but to be honest, the new UTF8String type (and the much simpler conversion code it brings) is worth the pain for me. At least we know that your beef is with the odd break in source code backwards compatibilty. No, wait…

    3. That UpperCase and LowerCase in D2009/10 only converts characters in the ASCII range is because they only ever did, a fact that was and is documented in the help (didn’t you ever wonder why parallel AnsiXXX functions were added to the RTL back in D3…?). Basically, what you were assuming here was a *break* in backwards compatibility.

    3. Your yabbering on about the AnsiXXX functions is once again inconsistently demanding a break in backwards compatibility – in D3-2007, people usually used AnsiXXX on variables and properties declared as string, not AnsiString, and so would require the D2009/10 behaviour when upgrading. Moreover, do a bit of research before you vent – check out the Character unit and its ToLower/ToUpper functions, new to D2009.

    4. All those words about UpCase miss a simple fact: the function is merely a holdover from the Turbo Pascal RTL.

    5. ‘if you then decide you want to do Unicode properly, I fear you will find that the design decisions made to facilitate the migration of Wide-ANSI applications will frustrate you and complicate your job no-end’. Design decisions like what? Like adding all those compiler warnings for implicit conversions and suspicious casts, warnings with desriptions I personally found very thorough without being overlong? Like adding the Character unit and TEncoding class? Like imitating the .Net BCL’s unicode support, following well-recognised conventions as a result? Like not going for the tempting-yet-subtle-bug-introducing method of making the default string type UTF-8? Like not wasting resources on supporting parallel ANSI and unicode VCLs?

  4. While I normally avoid your comments as you are in the same ranting angry place I spent so many years in, in this case instead of just ranting angry, you have something of a valid point.

    Many of the functions SHOULD act unicode if the switch to unicode is meant to be an honest one. And to that end, I put to you the old chestnut: Put each case in the QC database.

    Except perhaps the UpCase one – you might want to be more careful there. Part of the problem there is pretending utf-16 solves all our unicode problems. I’ve been starting to move my code forward, and I have seen the pitfalls with utf-8 mbcs. Clearly the solution is meant to be move everything over to a ‘unicodestring’, work with it ‘natively’ and then only use utf-8 as a transport container.

    Problem is that utf-16 is also a transport container because it ALSO is a mbcs. In order to avoid all of it, we would need honest to goodness utf-32 (at least this week – on wonders when they will decide that only utf-64 will do… ). Oh wait, the win32 and linux worlds don’t work with utf-32, so everything would have to go through conversions again all the time. So UTF-16 is what we get, we plug our ears and pretend the multi-byte sequences aren’t there.

    Probably going to bite us in the ass at some point.

    Oh, I do not know why the help even talks about UTF8ToAnsi. It’s absurd. Since a simple

    MyAnsiString := AnsiString(MyUnicodeString);
    or
    MyUTF8String := UTF8String(MyUnicodeString);

    Already do all the correct encoding conversions under the covers.

    If you think that you’ve found problems? Wait until you have to interface unicode strings with older 8bit string libraries.

    You end up with a lot of PAnsiChar(AnsiString(aUnicodeString)) code.

    Why? Because the same magic does not apparently happen with PAnsiChar(UnicodeString) – and since tyingcasting a string with PAnsiChar or PChar causes compiler magic to call a function in the first place, it pretty much could (I guess that is my bit to toss at QC)

  5. I agree with the confusion of names and type safety. It’s very confusing and is very likely to cause buggy programs. I cannot count the amount of websites and other programs that already mess up non US-ASCII characters on a regular basis, showing UTF-8 as Latin-1 or vice versa, loosing some data in the conversion, etc. It’s extremely annoying and it wasn’t necessary to increase the amount of problems.

    They should have taken the time to clean up the mixture of ASCII and Ansi-functions. Having functions that have ANSI in the name return or expect Unicode in the params is a real mess.

    To properly handle KeyPress Delphi should monitor WM_UNICHAR and pass a UTF-32 character instead of a UTF-16 character as in
    procedure KeyPress(Sender: TObject; var Key: UCS4Char);

    BTW UTF32Char would be a far better name since UCS encodings are deprecated and misleading. UCS-2 is a subset of UTF-16 and UCS-4 is a subset of UTF-32. As such it is misleading to call the characters of a UTF-X string UCS-Y characters. The term UCS shouldn’t be used at all (this was already the case long before Delphi 2009 was released (i.e. years ago).

    Similarly neglect of details can be seen when looking at the formatting functions. ListSeparator, ThousandSeparator, DateSeparator, etc.
    MSDN states:
    “The maximum number of characters allowed for this string is four, including a terminating null character. ”
    But Delphi uses only one Char. Obviously with Unicode this is even worse if the separator character is not in the BMP.

    The result is, works in most cases, but strange error occur in special cases. The kind of bugs you “love” and for which your customers will hate you.

    Now, I don’t want to even think about how many library functions properly handle Unicode. Making sure it works would have required thoroughly reviewing all the code. Considering how often it was mentioned that switching to Unicode was very fast and easy it highly doubt Unicode handling in Delphi is reliable.

    Unicode is a difficult subject, therefore clear concepts and names are mandatory.

  6. @Chris:

    > conditional define just dates from when the RTL, VCL and IDE were
    > in the process of being ‘unicodified’.

    If true then CodeGear are a shining example of how NOT to manage code quality, because once those conditional defines became utterly redundant they should have been removed prior to release.

    > Yes, updating old UTF8 conversion code is a sore point, but to be
    > honest, the new UTF8String type (and the much simpler conversion
    > code it brings)

    That perspective only works if you do not share your code with people who are potentially not using the same shiny new UTF8String type that you are.

    Maybe there won’t be so many such people in the future now that Embarcadero are getting out the bovver boots and demanding money with menaces to try to force people to upgrade, but for the time being at least, there ARE people who aren’t using the Unicode compiler, not least because of the problems that I myself am now running into.

    Sure, they aren’t “problems” in the sense that they can be worked around, but it’s a “problem” when you have to jump through hoops and write nonsensical code simply to preserve the status-quo in what was previously much clearer and – more importantly – working code.

    > That UpperCase and LowerCase in D2009/10 only converts
    > characters in the ASCII range is because they only ever did

    Because “String” was only ever ANSIString. CodeGear decreed that String would now == UnicodeString. So functions that operate on UnicodeStrings should behave like Unicode functions.

    This is a very simple concept. You don’t write a function that accepts a signed 32-bit Integer parameter and then code it to operate on the value as an unsigned 32-bit Cardinal.

    > Moreover, do a bit of research before you vent – check out the
    > Character unit and its ToLower/ToUpper functions, new to D2009.

    I’m fully aware of the Character unit. New code is free to behave however it feels. My “beef” is with the bonkers decisions made w.r.t previously *existing* functions.

    And if I’m writing code to be shared between Unicode and non-Unicode compilers (no matter how much you might wish it weren’t so, this is a simple fact of life for the foreseeable future), I cannot write code that relies on the Character unit.

    Or rather, the RTL should take care of invoking the Character routines for me in compiler/RTL versions that support it.

    i.e. in Delphi 2009+ Upcase(Char) should call TCharacter.ToUpper, and in Delphi 2007- should do what it always did.

    UpCase( ANSIChar ) in Delphi 2009+ should be the Delphi 2007 implementation.

    > All those words about UpCase miss a simple fact: the function
    > is merely a holdover from the Turbo Pascal RTL.

    UpCase( WideChar ) is brand new to Delphi 2009. No hold-over here.

    > Design decisions like what? Like adding all those compiler warnings
    > for implicit conversions and suspicious casts, warnings with descriptions
    > I personally found very thorough without being overlong?

    And which advise you to take steps that preserve ANSI behaviour (use CharInSet) to sidestep the warning rather than properly reviewing your code for Unicode correctness.

    Leading to a sense of accomplishment when the warning has been dealt with that is entirely misleading.

    > Like not wasting resources on supporting parallel ANSI and
    > unicode VCLs?

    Straw man argument. There was never any need to have parallel ANSI and Unicode VCL’s, even with a Unicode “switch”. Anymore than there was a need for parallel LongString and ShortString versions of the VCL.

    It would have been perfectly acceptable to have a VCL compiled with UNICODE (forcibly) defined, but to allow *application* code to be compiled with UNICODE *UN*defined (String = ANSIString).

    This was essentially the situation when String went “Long”.

    In the case of Unicode all those warnings you love so much would then have helped identify any potential problems when passing strings across the application/VCL boundary, but since most such transfers involve entire strings, the automatic conversions that were provided would have taken care of most of them without any additional warnings.

    Maël Hörz summed it up perfectly in his comment….

    Unicode is a difficult subject.

    CodeGear should be given credit for *trying* to make it easy, but in some cases it’s better to confront difficulty honestly rather than to try to pretend it doesn’t exist.

  7. Delphi 2009 delivers all the functionality you need in order to do unicode right. The main problems for conversion arises, if you tried to do unicode in Delphi 2007 or earlier, supporting utf-8 and doing other tricks. Many different methods were used, but fortunately, many of these problems can be solved using search&replace in the units that are not easily upgraded.

    The upgrade process is tricky, but instead of running your head into a wall, try to use the help from other programmers. We are many who succeeded to upgrade large and complex applications without big problems.

    The good news is, that once you have upgraded, you will spend a LOT less time on doing character set handling, and there is so much code that is easier to write and much easier to read.

    Many of your complaints are not really meaningful, IMO. For instance, uppercase() is meant to affect only ASCII, and leave all characters >#128 as they are, and since upcase() does not make sense in UTF-16, you will need to rewrite that code if you want to support UTF-16 (instead of just UCS-2). However, if you did well with ansi before, UCS-2 (16 bit per char) is a huge improvement, and seriously, when do you need upcase() on a character that is >=#65536?

    The main problems with Unicode is actually not Delphi-related. The main problem is, that if the user can specify a unicode character, how compatible is that with your “export to CSV file functionality”? Remember, that Windows still uses ANSI 8-bit characters for a lot of I/O, so you may lose a lot. Also, what if your database app wants to export data to another application, that doesn’t do unicode well, should you restrict your TEdit input to ansi-only?

    However, with the introduction of the euro-symbol €, and the general use of other non-ansi characters in communication, unicode is a step that most need to implement one way or the other. Delphi 2009 makes it possible to stop focusing on character sets, and to start focusing on solving customer problems. And the upgrade process couldn’t have been made much better.

    So, try not to focus on your frustrations, and try to focus on how to solve your problems. Afterwards, your life will be easier.

    1. @Lars: You have missed my point.

      You said it yourself – UpCase() does not make sense in Unicode, so WHY did CodeGear introduce a WideChar version of UpCase in Delphi 2009?

      And yes, I understand that Uppercase is performing exactly as documented. At least, I presume that is what you mean by “is MEANT to affect only ASCII” (my emphasis on “meant”). I suggest however that if you asked a user what they would expect from a function called Uppercase() that accepts a UnicodeString parameter in a compiler and RTL that is loudly advertised as “a Unicode compiler and RTL”, they would NOT expect it to behave like an ASCII function.

      It ONLY does that to protect CodeGear against the consequences of having imposed Unicode on legacy Delphi code.

      And I absolutely guarantee you that they would NOT expect ANSIUppercase() to behave like a Wide/Unicode operation.

      It simply does not make sense to have a default string type that is Unicode and a built in, undecorated function that accepts that Unicode string byte but them proceeds to operate on it as if it were an ANSI string.

      Nor does it make sense to have a function that describes itself as “ANSI” which is in fact a Unicode function. Documentation is no substitute for common sense I’m afraid (not least because God knows we’ve had to get used to getting by without decent documentation in recent years). In fact, I suspect that someone with recent Delphi experience reading the docs for Uppercase() would simply assume that they hadn’t been properly updated to reflect the Unicode nature of things in the latest RTL/compiler.

      Or would you argue that – for example – a function that accepts a signed integer but then proceeds to operate on that passed in value as an unsigned integer is also “a good idea”, as long as it is documented as such, and tough luck on any one who calls it expecting it to operate as a signed integer operation without consulting the documentation first to make sure that what it seems to say it does is what it actually will do.

      And I have to laugh at this:

      > when do you need upcase() on a character that is >=#65536?

      Because this is just the WideString adjusted version of “when do you ever need upcase() on a character that isn’t in the range a..z?”

      I’m sorry, but I simply do not believe that anyone that found a Unicode migration “easy” using Delphi 2009 has actually done it properly at all. They may consequently have a Wide-ANSI app that suits them and their users, but in that case a regular ANSI app without the memory bloat and performance overhead would have sufficed just as well.

      And before you quote the “overhead of conversion involved when using the ANSI Windows API”, the tests that Marco Cantu did in that area showed that there was no such significant overhead. And even if there were, any gains on that score are almost certainly lost in the losses that arise from the additional and more complex processing now required to correctly handle strings for even the most basic operations, whether or not those strings are ever likely to hit the Windows API at all.

      In the meantime, if Unicode is so much easier in Delphi 2009, you should be able to answer this question for me very easily:

      http://stackoverflow.com/questions/1482699/zeroconf-bonjour-code-that-works-in-delphi-7-not-working-in-2009

      There’s nothing particularly sophisticated involved, simply passing UTF8 strings into a DLL. All the test data is in fact ASCII so there’s no MBCS complications.

      It works in Delphi 7.

      It doesn’t work in Delphi 2009.

      Any ideas?

  8. 1. You are right. They (Embt) sacrificed function naming for backward compatibility. But what should they have done. Upset 90% of their user base who wants a “Load, fix and recompiled” or upset 10% who wants to do more migration by hand (and have the skills to do that)? I don’t like all those “Ansi*” calls in my code but if you want to compile the code in Delphi 2007 and 2009/2010 you must use them because new units and functions don’t exist in Delphi 2007.

    2.
    > > That UpperCase and LowerCase in D2009/10 only converts
    > > characters in the ASCII range is because they only ever did

    > Because “String” was only ever ANSIString

    Your code was wrong in the beginning and stays wrong in Delphi 2009+. Let’s assume you have your application in an ANSI Delphi (D7, D2007) and you use UpCase/UpperCase in the KeyDown event. If a user now types an ‘a’ he gets an ‘A’. But what if he types a German umlaut? Does he get the uppercase representation of it. No, because ‘ä’ is not in the US-ASCII range. This UpCase/UpperCase behavior is still valid in Delphi 2009+. And I would throw things at Embt if they would have changed that.
    Those who wrongly used UpperCase where they should have used AnsiUpperCase are still wrong in Delphi 2009+. And those who intentionally used UpperCase in their code still have the intended behavior.

    UTF8ToString implementation:
    > You can’t help but think that CodeGear managed to confuse
    > even themselves with their approach to Unicode.

    Maybe they had an ANSI and Unicode version in their mind when they started to work in Delphi 2009 (and I guess they had). But as we all know they dropped the idea. So this code only shows that they forgot to clean up the source code. And during the time of development the implementations of the ANSI-Unicode functions changed so the “unused” code became broken.

    > No, wait. That can’t be right surely? ANSIUppercase?

    On the one side I think they should have introduced another set of functions like “UniUpperCase”. But on the other side I wouldn’t like the idea of another set of string functions. Ansi*, Wide* and Uni* filling up the global namespace. It would also be fine for me if they replaced the Wide* functions with the Uni* function because I don’t use them. But there are others that might use them. So they can’t remove them.

    1. @Andreas:

      > Your code was wrong in the beginning and stays wrong in Delphi 2009+.
      > Let’s assume you have your application in an ANSI Delphi (D7, D2007) and
      > you use UpCase/UpperCase in the KeyDown event

      No, you’re missing a key point.

      “My code” (had it been my code, which it wasn’t – it was a thought experiment) was perfectly valid in pre-Delphi 2009. Remember that my code was filtering out all non-ASCII alpha characters. The user could ONLY enter a..z. This of course was because “my application” wasn’t Unicode and couldn’t handle such things reliably, so they were specifically not supported.

      Then I get my shiny new Delphi 2009/2010 compiler, which I’m told will enable me to handle Unicode very easily, as long as I deal with a few issues that the compiler will help identify for me.

      I deal with those issues by eliminating the use of char sets for character filtering and using the proper Unicode character category functions instead. That deals with the warning (in this particular case) and my code for transforming and manipulating strings – which are now all Unicode – is now calling functions that specifically and explicitly contract to handle Unicode Strings.

      Is it not entirely natural and logical to expect those functions to handle Unicode? The compiler is no longer warning me that I may have any Unicode related issues in my code, so I have a Unicode application, right?

      *THAT* is the situation that CodeGear have created that I find infuriating. Having ANSI* functions that operate on Unicode strings is an additional confusion that compounds the mistake.

      But we didn’t need another set of Uni* functions – we already had a lot of Wide* functions, and those would have been sufficient. Any additional ones need only to have used that Wide* prefix for consistency.

      But yes, if a consolidation of the RTL string routines resulted in some of those routines disappearing and breaking some code, that actually would have been *better* in my opinion, as long as there were direct and obvious replacements that could easily be introduced in code to replace them where necessary.

  9. Jolyon: I don’t want to engage in a discussion of the names of functions. The prefix “Ansi” didn’t make sense with Windows 3.1, it didn’t make sense with Windows 95, and it still doesn’t make sense, simply because “Ansi” is not the 8-bit character set on most Windows computer on this planet. However, it makes a lot of sense to make upcase() work with widechar, because widechar is the type that is used for characters, and upcase() takes a character as parameter. With Delphi 2009, you would not use ansichar, unless it’s a hack or you are doing binary data which is not upcased. In short, your criticism on these points come 13 years too late and is not related to Unicode. It is even possible to argue, that Delphi 2009 improves on this topic, since Unicode contains Ansi, whereas AnsiUppercase is not about Ansi.

    There are a LOT of cases, where you do not want to treat non-ascii letters as being letters. There can be several reasons, including predictability, reversibility and performance. If you prefer fast software to slow software, it makes a lot of sense to make the normal uppercase operate only on ascii, and to make unicode uppercase a function with a different, more complex name. However, this is a choice where programmer’s taste may differ, of course, but I like when the default method performs best.

    I have posted an answer to your question on stackoverflow.com – your code uses a string type that has been significantly changed for Delphi 2009, but you are using it when interfacing with a DLL, that did not change. I assume that the external DLL was not written using Delphi, and was not recompiled for Delphi 2009, and in that case, your source code simply has a bug: It uses PUtf8String. My best guess is that you should use PAnsiChar() for that DLL.

    1. @Lars: I’m not interested in arguments as to the whys and where-fores of whether ANSI is technically correct or not. However “wrong” the implementation may have been in the past, two wrongs or a further compounded wrong, won’t make it right.

      And see my reply to your answer on StackOverflow. I was originally using my own UTF8Char/PUTF8Char type (ANSIChar ^ANSIChar respectively), and that didn’t work either. Knowing that UTF8String was significantly different in Delphi 2009 I changed the code to try to eliminate any possible errors I had introduced in my explicit transcoding of strings, allowing the consistent transcodings in Delphi 2009 to operate without possibly disruptive interference from me.

      As for it not being “compatible” with a C/C++ DLL, I seriously doubt that that is true.

      After all, we’ve been passing PANSIChar to C/C++ DLL’s that know nothing of Delphi ANSIStrings for years without issue. The whole point of the String type in Delphi is to make that possible since the base pointer of a string points to the initial element, with the Delphi-specific RTTI at a negative offset, of which the C/C++ DLL will remain blissfully ignorant.

      If I *can’t* pass a suitably cast UTF8String to a function expecting a null terminated PUTF8Char then I jolly well should be able to, just as I was always able to pass a PANSIChar(String) to a function expecting a null terminated ANSI Char.

      But that’s beside the point. PUTF8String or PUTF8Char – it still doesn’t work in Delphi 2009.

  10. Jolyon, I’m surprised that you dislike the name “AnsiUppercase” and then introduce “type utf8char=ansichar”. To me, it would make much more sense to say “type utf8char=array[0..5] of char”, since utf-8 encoding allows up to 6 bytes per code point 😉

    With regard to your stackoverflow.com question, you indicate that the reason for the problem is related to unicode and a new string type, even though you did not find the actual cause, yet. How do you deduct that the problem is caused by the new string types? I ask that question here, because your stackoverflow.com question does not indicate that unicode is the problem.

  11. Btw, widestrings are MUCH slower than Unicodestrings, but still supported. I would expect a wide* function to operate on Widestrings, and not on unicodestrings. Anyway, it’s always easy to find things that could be improved after the release of a product – the trick is to design things properly before the release.

  12. Unicode in the latest Delphi is a disaster.
    It was introduced the lazy way.

    The natural way was to introduce new API without compiler change:
    – “string” as UTF-8 / ANSI / ASCII / my-hacked-code-page
    – “WideString” – binary unicode
    – AnsiString – ANSI string

    In fact, VCL & Win32 API should be changed (for consistency: in all cases UTF-8), not compiler.

    It is far easier to prepare string for proper display (using something like AnsiToUtf8) then change all the string parsing code your application contains.

    See Lazarus UI implementation.

  13. @Lars: Touché. 🙂

    I shall save my post w.r.t encoding independent codepoint-wise reading of an arbitrary Unicode stream for another time. Suffice to say that the Unicode implementation in Delphi 2009 again abandoned me to my own devices on that score.

    UTF8Char was of course defined as a corollary for WideChar (which of course isn’t always a single character) and ANSIChar (which also isn’t necessarily always a single character but does at least have the saving grace of being far more likely to be so than a WideChar is – in the types of applications for which it is the “lingua franca” of the character element I mean).

    And yes I know that WideString is slower than UnicodeString, but a UnicodeString is also essentially a “WideString with twiddly bits”.

    A UnicodeString *IS* a wide string, it’s not not a WideString.

    Delphi being the strongly typed language that it is, we don’t need those redundant prefixes on such functions anyway:

    Uppercase( ANSIString )
    Uppercase( UnicodeString )
    Uppercase( WideString )

    Would handle all the important cases clearly and unambiguously (assuming that they handled their parameter data type appropriately).

    If CodeGear had given us the choice over what “String” mean in our application code there couldn’t have been any problem. If we compiled with UNICODE defined then Uppercase( String ) would call the UnicodeString version, and with UNICODE not defined it would call the ANSIString version (since String at the time of compilation would identify one or other of those specific types).

    They created the problem by insisting that String would mean Unicode whether we liked it or not, and then had to wrestle with what to do if someone carried on using String but wanted the previous ANSIString type behaviours.

    As you say, these things need to be got right during the design phase. If they listened to people during that design phase they might have avoided these mistakes, but they insisted – wrongly imho – that they could make this easy for everyone.

    The entire implementation smells of a bright idea tossed around in a coffee and donuts meeting rushed into production with a great deal of back slapping but very little proper consideration.

    The guys that work on Delphi are no doubt very bright indeed, but aren’t necessarily best placed to appreciate the sorts of application code that their decisions in this area would impact on and how.

    Productivity tools in the IDE, compiler/language enhancements are a very different consideration compared to the needs and obligations of the fundamental string handling infrastructure provided for application code.

    I get the distinct impression that they simply didn’t appreciate the problems (they no doubt consulted with “partners” but again, partners are I think typically similarly highly technical development outfits, not general purpose applications shops)

  14. @PiotrL: I agree that utf-8 would be the best path towards unicode – it would have unicode-enabled much more source code. I had some huge discussions about that on Borland’s servers once, but the opinion was against it, mostly because the standard way to do strings in Win32, Win64, .net and Java is to use UTF-16.

    One way to make a unit unicode-compatible, is to do a search & replace of all strings and replace them with utf8string. However, since the RTL, VCL and others use unicodestring, this can potentially introduce a LOT of character set conversions, making your application extremely slow because each conversion involves the Windows API. A bad design, in my opinion, but I guess noone saw that problem before it was too late. The basic problem is, that CodeGear followed the Microsoft/Java crowd instead of following the Linux crowd which solved this much more elegantly.

    However, widestring is seriously awful but required for COM, and unicodestring is a huge improvement – and once you start programming with it, it solves your character set problems just as easily as .net and Java. You simply stop thinking much about it, and in a Windows world, where applications need to handle many different character sets (utf-8, oem, “ansi”, UCS-2, UTF-16 and sometimes other stuff), Delphi 2009 is a great tool that does not restrict you to utf-16.

    When I thought about how to convert my PChar-intensive XML parser to Delphi 2009, I decided to rename string to rawbytestring, instead of going unicode, and only changed the API so that I/O was unicode-enabled. This was a great choice – it is now faster than before, the source code was virtually unchanged, and it is 100% unicode enabled.

    1. @Lars:

      > One way to make a unit unicode-compatible, is to do a search &
      > replace of all strings and replace them with utf8string

      But this does NOT make your application a “Unicode application”. That was the entire point of my post which I thought you had grasped but this comment suggests not.

      What you describe is a quick and easy way to deal to a large number of warnings you will get about implicit conversions arising from the switch to a UnicodeString, but your code calling the familiar RTL routines is not dealing with those UnicodeStrings as *true* Unicode strings at all and if you start marketting your application as “supporting Unicode” then you are going to be bombarded by users who believe you and then find that you aren’t selling a “Unicode application” at all.

      That is, unless your code is explicitly calling ANSIxxxx() routines, in which case your Unicode strings will be handled correctly.

      As regards RawByteString…. that’s a great idea if you have the luxury of not writing code that you share with anyone using an older Delphi compiler (and the confidence that you won’t find yourself needing to return to an older compiler yourself).

      I also find your comments a little confusing/confused… you talk about “UTF8, UCS2 and UTF16” as character sets. These are not character sets – these are all THE Unicode character set (in the case of UCS2 a subset of the complete set but still not a distinct character set). Those things you mention are *encodings* not character sets.

      And you missed out UCS4/UTF32, which perhaps isn’t surprising because CodeGear seem to have forgotten about that one too, which is very strange given that a lot of very useful and sensible articles about handling Unicode that I have read (including the O’Reilly “Unicode Explained” tome) strongly suggest that Unicode be normalised to UTF32 for internal processing, especially when that processing involves codepoint-wise handling/manipulation/interpretation of strings.

      There is no UCS4/UTF32 TEncoding, for example, and no easy way that I have find for transcoding an arbitrarily encoded stream into some specific encoding. I couldn’t find one, but I sure had fun writing one. The best part about that being that the resulting code works just as well in “pre-Unicode” Delphi compilers.

  15. @Jolyon: For us, Search & Replace solved many of the problems that you describe. However, we had different approaches for different units – some bit-fuddling algorithms were handled by search&replace to utf8string, but most GUI stuff and business logic was converted to unicodestring (i.e. no rename). I agree, that a compiler switch would have helped doing that, but it would have led many developers into the trap that I just described above: Lots of character-set conversions that bring your application performance down to bottom.

    I don’t think that there is a solution, that could have given us good performance and backwards compatibility without using utf-8 as the default character set, and that would have given us conversions on all interfaces to the exterior, which is a lot, which would also have hurt performance. The Windows API simply forces us to use utf-16, like it or not, and that causes trouble.

    We have made a quite large and complex upgrade to Delphi 2009, and my experience with that has taught me that the upgrade process is complicated, mainly because we now have to separate unicode-text, other character sets and binary stuff, often in I/O contexts. We cannot use the same TStrings object for all three types of data any more, and that’s not CodeGear’s fault.

    In other words, I don’t believe that CodeGear could have made a solution, that would perform well and be easy. I haven’t yet seen a solution that would have been better in all regards – including utf-8 and a compiler switch.

  16. Let’s not get too detailed into the definition of words, but UTF8, UTF16 and UCS2 are encoding systems, that cover more than, all of, and parts of Unicode. It is incorrect to call them character sets, of course. I disregard UCS4/UTF32, because nobody uses them, except in a few very rare cases. Those that favor it because it means “one string position is one character”, are wrong – because with Unicode, you can combine multiple characters. There must have been some extremely important reasons to invent this stuff since it really makes the work of programmers more difficult.

    I think I now understand one of the reasons why you are arguing about uppercase(). Since Delphi 1, uppercase() has been completely useless for doing uppercase on human text in my part of the world, because uppercase() does not support the ANSI letters of my language. However, uppercase() supports English… so while I could never write code like s:=uppercase(Edit1.Text), you probably could. Since Delphi 1, I had to use ansiuppercase() for all data that the user had to see or enter. In that regard, nothing has changed.

  17. #1 looks actually promising. If they introduced a UNICODE define there, there is hope that it will surface eventually and be made available to developers 😉

    And yes, this would have broken less and given the developers the choice how to handle the migration.

    // Oliver

  18. My project includes about 450 Delphi units and about 185,000 lines of code. I would categorize the effort required in the upgrade to Delphi 2009 as negligible. Unicode was a big requirement as my system spits out HTML web pages and emails from SQL server data, which needed to support other languages. I pretty much did a search on all instances of “PChar” and “Char” and then decided whether they should a) stay like that (now as Unicode data), b) be converted to PAnsiChar/AnsiChar or c) change the methodology to handle unicode bytes. All the “string” types stayed as “string”. It’s worth noting that my system does a lot of raw byte handling (RS-232&TCP/IP communications) and character indexing string manipulations. All text and HTML output is generated as UTF8 via a simple helper to convert the internal UCS2 strings to UTF bytes.

    So far the only significant inconvenience I have encountered is this non-conforming UpperCase() function, which I have fixed after googling onto this article and illogically replacing it with “AnsiUppercase”, which is described in the help file as supporting MBCS. Thanks.

  19. The fact that UpperCase doesn’t work, but AnsiUpperCase works, proves to me that Embollox messed up Unicode implementation deliberately to cause as much confusion as possible. I will stick with BCB5 which has an UpperCase which works as expected, even with AnsiStrings in a Hebrew code page! Jeez!

Comments are closed.