Proposal: A single unified string type.
Author  : Regan Heath
Version : 1.1
Date    : 24 Nov 2005 +1300 (New Zealand DST)

[Preamble/Introduction]
After the recent discussion on Unicode, UTF encodings and the current D situation it occured to me that many of the issues D has with strings could be side-stepped if there was a single string type. 

In the past we have assumed that to obtain this we have to choose one of the 3 available types and encodings. This wasn't an attractive option because each type has different pros/cons and each application may prefer one type over another. Another suggested solution was a string class which hides the details, this solution suffers from being a class and the limitations imposed by that and not being tied directly into the language.

My proposal is a single "string" type built into the language, which can represent it's string data in any given UTF encoding. Which will allow slicing of "characters" as opposed to what is essentially bytes, shorts, and ints. Whose default encoding can be selected at compile time, or specified at runtime. Which will implicitly or explicitly transcode where required.

There are some requirements for this to be possible, namely knowledge of the UTF encodings being built into D, these requirements may prohibit the proposal being favourable as it increases the knowledge required to write a D compiler. However it occurs to me that DMD and thus D? already requires a fair bit of UTF knowledge.


[Key]
First, lets start with some terminology, these are the terms I am going to be using and what they mean, if these are incorrect please correct me, but take them to have the stated meanings for this document.

code point      := Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF.
code unit       := The minimal bit combination that can represent a unit of encoded text for processing or interchange. i.e. 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.
code value      := AKA code unit.
transcoding     := the process of converting from one encoding to another.
source          := a file, the keyboard, a tcp socket, a com port, an OS/C function call, a 3rd party library.
sink            := a file, the screen, a tcp socket, a com port, an OS/C function call, a 3rd party library.
native encoding := application specific "preferred" encoding (more on this later)

Anything I am unsure about will be suffixed with (x) where x is a letter of the alphabet, and my thoughts will be detailed in the [Questions] section.


[Assumptions]
These are what I base my argument/suggestion on, if you disagree with any of these you will likely disagree with the proposal. If that is the case please post your concerns with any given assumption in it's own post (I would like to discuss each issue in it's own thread and avoid mixing several issues)

#1: Any given string can be represented in any UTF encoding, it can be transcoded to/from any UTF encoding with no loss of data/meaning.

#2: Transcoding has a performance penalty at runtime. This proposal will mention the possible runtime penalty wherever appropriate.

#3: There are 2 places where transcoding cannot be avoided; input and ouput. Input is the process of obtaining data from a source. Output is the process of sending data to a sink. In either case the source or sink will have a fixed encoding and is that encoding does not match the native encoding the application will need to transcode. (see definitions above for what classifies as a source or sink)

#4: String literals can be stored in the binary in any encoding (#1) the encoding chosen may have repercusions at runtime (#2 & #3).


[Details]
Many of the details are flexible, i.e. the names of the types etc, the important/inflexible details are how it all fits together and achieves it's results. I've chosen a bullet point format and tried to make each change/point as succint and clear as possible. Feel free to ask for clarification on any point or points. Or to ask general questions. Or to pose general problems. I will do my best to answer all questions.

* remove char[], wchar[] and dchar[].

* add a new type "string". "string" will store code points in the application specific native encoding and be implicitly or explicitly transcoded as required (more below).

* the application specific native encoding will default to UTF-8. An application can choose another with a compile option or pragma, this choice will have no effect on the behaviour of the program (as we only have 1 type and all transcoding is handled where required) it will only affect performance. 

The performance cost cannot be avoided, presuming it is only being done at input and output (which is part of what this proposal aims to achieve). This cost is application specific and will depend on the tasks and data the application is designed to perform and use. 

Given that, letting the programmer choose a native encoding will allow them to test different encodings for speed and/or provide different builds based on the target language, eg an application destined to be used with the Japanese language would likely benefit from using UTF-32 internally/natively.

* rename char, wchar, and dchar to cp1, cp2, and cp4. These types represent code points only, never code units/values. cp1 will only ever represent code points which fit inside 1 byte. cp2 will only ever represent code points that fit inside 2 bytes, and cp4 will only ever represent code points that fit inside 4 bytes. They are essentially byte, short and int with different names.

* use the existing byte, short, int handling rules for cp1, cp2, and cp4, they are essentially the same thing.

* add a new type/alias "cpn", this alias will be cp1, cp2 or cp4 depending on the native encoding chosen. This allows efficient code, like:

string s = "test";
foreach(cpn c; s) {
}

* slicing string gives another string

* indexing a string gives a cp1, cp2 or cp4

* string literals would be of type "string" encoded in the native encoding, or if another encoding can be determined at compile time, in that encoding (see ASCII example below).

* character literals would default to the native encoding, failing that to the smallest of cp1, 2 or 4 and are promoted as required.

* there are occasions where you may want to use a specific encoding for a part of your application, perhaps you're loading a UTF-16 file and parsing it. If all the work is done in a small section of code and it doesn't interact with the bulk of your application data which is all in UTF-8 then your native encoding is likely to be UTF-8 not UTF-16.

In this case, for performance reasons, you want to be able to specify the encoding to use for your "string" types at runtime, they are exceptions to the native encoding. To do this we specify the encoding at construction/declaration time, eg.

string s(UTF16); //construct string which uses UTF-16 to store it's data
//somehow assign the data read from the file to the string, no transcoding occurs as they are of the same type.
(or similar, the exact syntax is not important at this stage)

thus...

* the type of encoding used by "string" should be selectable at runtime, some sort of encoding type flag must exist for each string at runtime, this is starting to head into "implementation details" which I want to avoid at this point, however it is important to note the requirement.


[Output]
* a new type "char" will exist, it will now _only_ represent a C string, thus when a string is passed as a char it can be implicitly transcoded into ASCII with a null terminator, eg.

int strcmp(const char *src, const char *dst);

string test = "this is a test";
if (strcmp(test,"this is a test")==0) { }

the above will implicitly transcode 'test' into ASCII and ensure there is a null terminator. The literal "this is a test" can be stored in the binary as ASCII with a null terminator.

* Native OS functions requiring "char" will use the rule above. eg.

CreateFileA(char *filename...

* Native OS functions requiring unicode will be defined as:

CreateFileW(cp2 *filename...

and "string" will be implicitly transcoded to UTF-16, with a null terminator added..

* When the required encoding is not apparent, eg.

void CreateFile(char *data) { }
void CreateFile(cp2 *data) { }

string test = "this is a test";
CreateFile(test);

an explicit property should be used, eg.

CreateFile(test.char);
CreateFile(test.cp2);


[Input]
* Old encodings, Latin-1 etc would be loaded into ubyte[] or byte[] and could be cast (painted) to char*, cp1*, cp2* or cp4* or converted to "string" using a routine i.e. string toStringFromXXX(ubyte[] raw).

* A stream class would have a selectable encoding and hide these details from us handling the data and giving a natively encoded "string" instead. Meaning, transcoding will naturally occur on input or output where required.


[Example application types and the effect of this change]

* the quick and dirty console app which handles ASCII only. It's native encoding will be UTF-8, and no transcoding will ever need to occur (assuming none of it's input or output is in another encoding)

* an app which loads files in different encodings and needs to process them efficiently. In this case the code can select the encoding of "string" at runtime and avoid transcoding the data until such time as it needs to interface with another part of the application in another encoding or it needs to output to a sink, also in another encoding.

* an international app which will handle many languages. this app can be custom built with the native string type selected to match each language.


[Advantages]
As I see it, this change would have the following advantages:

* "string" requires no knowledge of UTF encodings (and the associated problems) to use making it easy for begginners and for a quick and dirty program.

* "string" can be sliced/indexed by character regardless of the encoding used for the data.

* overload resolution has only 1 type, not 3 to choose from.

* code written in D would all use the same type "string". no more this library uses char[] this one wchar and my app dchar[] problems.


[Disadvantages]
* requirements listed below

* libraries built for a different native type will likely cause transcoding. This problem already exists, at least with this suggestion the library can be built 3 times, once for each native encoding and the correct one linked to your app.

* possibility of implicit and silent transcoding. This can occur between libraries built with different native encodings and between "string" and char*, cp1*, cp2* and cp4*, the compiler _could_ identify all such locations if desired.


[Requirements]
In order to implement all this "string" requires knowledge of all code points, how they are encoded in the 3 encodings and how to compare and convert between them. So, D and thus any D compiler eg DMD, requires this knowledge. I am not entirely sure just how big an "ask" this is. I believe DMD and thus D already has much of this capability built in.