String theory by example

Re: String theory and confusion
Nov 25, 2005 kris
Nov 25, 2005 Derek Parnell
Nov 25, 2005 Kris
Nov 26, 2005 Derek Parnell
Nov 26, 2005 Kris
Nov 26, 2005 Derek Parnell
Nov 26, 2005 Georg Wrede

Nov 25, 2005

Jari-Matti Mäkelä

Nov 25, 2005

Georg Wrede

November 25, 2005

String theory by example

Posted by Regan Heath

Permalink

Regan Heath

Attachments:

string.d

Permalink

Ok, this time I thought I'd see if I could come up with a string struct/class that behaved how I think an in-built string type should behave in D, here is the best I could do.

Thoughts, comments, etc.

---

module string;

import std.utf;
import std.stdio;

class StringType(T)
{
	private T data;

	this(char[] value)
	{
		assign!(char[])(value);
	}

	this(wchar[] value)
	{
		assign!(wchar[])(value);
	}

	this(dchar[] value)
	{
		assign!(dchar[])(value);
	}

	template assign(S) {
		private void assign(S value)
		{
			static if (is(T:char[])) data = toUTF8(value);
			static if (is(T:wchar[])) data = toUTF16(value);
			static if (is(T:dchar[])) data = toUTF32(value);
		}
	}

	dchar opIndex(uint index)
	{
		foreach(uint i, dchar c; data)
		{
			if (i == index) return c;
		}

		throw new Error("String bounds error");
	}

	dchar opIndexAssign(dchar u, uint index)
	{
		dchar[] res;
		uint i;

		i = 0;
		foreach(dchar c; data)
		{
			if (i == index) res ~= u;
			else res ~= c;
			i++;
		}

		if (i < index) throw new Error("String bounds error");

		assign!(dchar[])(res);

		return u;
	}

	//I dont get it, shouldn't this read:
	//StringType!(T) opSlice(uint start, uint end)
	//?

	StringType opSlice(uint start, uint end)
	{
		dchar[] res;
		uint i;

		i = 0;
		foreach(dchar c; data)
		{
			if (i++ < start) continue;
			if (i >= end) continue;
			res ~= c;
		}

		if (i < end) throw new Error("String bounds error");

		return new StringType(res);
	}

	StringType opCat(StringType rhs)
	{
		return new StringType(data ~ rhs.data);
	}

	StringType opCatAssign(StringType rhs)
	{
		data ~= rhs.data;
		return this;
	}

	int length()
	{
		uint i = 0;
		foreach(dchar c; data) i++;
		return i;
	}

	int length(int newlength)
	{
		uint nl = newlength - length();
		data.length = data.length + nl;
	}

	char[] opCast()
	{
		return toUTF8(data);
	}

	char[] toString()
	{
		return toUTF8(data);
	}
}

//Choose your native encoding
alias StringType!(char[]) String;
//alias StringType!(wchar[]) String;
//alias StringType!(dchar[]) String;

//NOTE: for this to work on the windows console you have to:
//      - left-click top left corner of command prompt window
//      - select "properties"
//      - select "font"
//      - select "Lucida Console"
//      - type "chcp 65001" into command prompt
//and now you can finally run this example.

void main()
{
	//hopefully the suffix becomes redundant
	String test = new String("smörgåsbord"c);

	//sadly this creates a new string.
	String two = test[0..4];

	//as does this, but this time it's expected to
	String three = test[0..4] ~ test[5..test.length];

	//modify original string, note that this inserts in character position 3
(counting from 0) making 'a' the 4th character in this string
	test[3] = 'a';

	//for some odd reason if you change any of these to writefln it stops any
more data appearing
	//I suspect a bug specific to the windows, perhaps in phobos?
	writef("%s",test);
	writef(" ");
	writef("%s",two);
	writef(" ");
	writef("%s",three);
}

November 25, 2005

Re: String theory by example

Posted by Kris
in reply to Regan Heath

Permalink

Kris

Posted in reply to Regan Heath

Permalink

"Regan Heath" <regan@netwin.co.nz> wrote
> Ok, this time I thought I'd see if I could come up with a string struct/class that behaved how I think an in-built string type should behave in D, here is the best I could do.
>
> Thoughts, comments, etc.

It seems clear that any unified string notion would be better off as a library suite; not built into the compiler. It's difficult enough to evolve the code within Phobos, let alone something hard-coded into the compiler.

Thus, at this point, you're surely talking about a pre-packaged Phobos String class? Exactly the kind of thing that many have discussed in the past. The reason it hasn't yet happened are not fully clear, but I would bet it's partly to do with the following:

a) it seems everyone has a different set of requirements for a String class -- tradeoffs regarding performance, flexibility, favourite methods, etc, etc. To wit: there are perfectly good String classes all over the place. Many different implementations to choose from. Some would argue that's a good thing.

b) a String class to support Unicode is hardly a trivial undertaking. You really have to consider very hard what the goals are before putting something in stone (as in getting it added to Phobos). I say that from experience with the ICU project ~ there's code in there to handle the kinds of things that would frighten many people. Unicode ain't trivial and, frankly, I think AJ would have a hard time coming up with a "suitable" set of compromises. The latter is important: there will be many compromises one way or another.


I think a good place to start is to ask yourself and others (particularly those who actually use unicode on a regular basis) why not just use ICU and be done with it ~ after all, ICU can do just about anything vis-a-vis Unicode. The outcome may be able to provide some guidance?

November 25, 2005

Re: String theory by example

Posted by Regan Heath
in reply to Kris

Permalink

Regan Heath

Posted in reply to Kris

Permalink

On Thu, 24 Nov 2005 17:41:55 -0800, Kris <fu@bar.com> wrote:
> "Regan Heath" <regan@netwin.co.nz> wrote
>> Ok, this time I thought I'd see if I could come up with a string
>> struct/class that behaved how I think an in-built string type should
>> behave in D, here is the best I could do.
>>
>> Thoughts, comments, etc.
>
> It seems clear that any unified string notion would be better off as a
> library suite; not built into the compiler.

Perhaps, however the syntax can be better if it's built in.

> Thus, at this point, you're surely talking about a pre-packaged Phobos
> String class?

I have used a class here. I'd have preffered to use a struct but several things didn't work when it was a struct. I'd prefer it was built in most of all, like the arrays are.

> Exactly the kind of thing that many have discussed in the
> past. The reason it hasn't yet happened are not fully clear, but I would bet it's partly to do with the following:
>
> a) it seems everyone has a different set of requirements for a String
> class -- tradeoffs regarding performance, flexibility, favourite methods,
> etc, etc. To wit: there are perfectly good String classes all over the
> place. Many different implementations to choose from. Some would argue
> that's a good thing.

This is true. I've read/heard many of the arguments. However, I reckon it's possible to make everyone happy with a built in type that doesn't try to do too much. That is what the purpose of this thread is.

> b) a String class to support Unicode is hardly a trivial undertaking. You
> really have to consider very hard what the goals are before putting
> something in stone (as in getting it added to Phobos).

Certainly and it appears to me that there already exists in DMD and Phobos the required code to handle the idea I have in mind.

My goal is a built-in type which can store strings in any of the 3 UTF encodings, when sliced will give characters (as opposed to character fragments) and will be transcoded either implcitly or explicitly. Further, if the array feature that allows this:

void foo(char[] a) {}
char[] a;
a.foo();

is also implemented for this type, then it becomes extensible and people can add their favourite methods, tho I would hope that phobos came with many already provided.

It doesn't need anything else, from this point we provide the ICU features via methods and libraries, very little else needs to be built in, the class I posted almost does everything I see this built in type doing and it almost does it exactly how I wanted it done. Where it falls short is in the fact that it's not built in and does not have the syntax that would enable us to have.

> I say that from
> experience with the ICU project ~ there's code in there to handle the kinds of things that would frighten many people. Unicode ain't trivial and,
> frankly, I think AJ would have a hard time coming up with a "suitable" set of compromises. The latter is important: there will be many compromises one way or another.

I believe you, your experience would be useful in exploring this idea.

> I think a good place to start is to ask yourself and others (particularly
> those who actually use unicode on a regular basis) why not just use ICU and be done with it ~ after all, ICU can do just about anything vis-a-vis
> Unicode. The outcome may be able to provide some guidance?

I think using ICU is a great idea. As I said above, this would be part of a library and would extend the built in type. All the built in type needs to do is store the 3 encodings, transcode between them and slice full characters (as opposed to fragments).

Nothing more.

Regan

November 25, 2005

Re: String theory by example

Posted by kris
in reply to Regan Heath

Permalink

kris

Posted in reply to Regan Heath

Permalink

Regan Heath wrote:
> My goal is a built-in type which can store strings in any of the 3 UTF  encodings, when sliced will give characters (as opposed to character  fragments)

That would be require 32bits, then. A dchar.

> and will be transcoded either implcitly or explicitly. Further,  if the array feature that allows this:
> 
> void foo(char[] a) {}
> char[] a;
> a.foo();
> 
> is also implemented for this type, then it becomes extensible

And, just then, the vehicle swerved off the road and over a cliff. Bon voyage.

November 25, 2005

Re: String theory by example

Posted by Regan Heath
in reply to kris

Permalink

Regan Heath

Posted in reply to kris

Permalink

On Thu, 24 Nov 2005 20:22:53 -0800, kris <fu@bar.org> wrote:
> Regan Heath wrote:
>> My goal is a built-in type which can store strings in any of the 3 UTF  encodings, when sliced will give characters (as opposed to character  fragments)
>
> That would be require 32bits, then. A dchar.

Yes. Note opIndex in the code I posted.

Have you had a close look at std.format.doFormat and std.stdio.writefx? Have you noticed that UTF-8 characters are all transcoded to individual dchars then transcoded back to UTF-8 to be output?

This doesn't proove anything but it suggests that using a dchar sized variable for characters will have little or no real effect on performance.. maybe, a conclusive test should really be made.

My original idea was horribly broken because I tried to fight against the fact that the only type that can store a complete character all the time is the dchar, a 32 bit type. I was trying to make the ASCII app programmers happy, happy because they can store their characters in an 8 bit wide type.

>> and will be transcoded either implcitly or explicitly. Further,  if the array feature that allows this:
>>  void foo(char[] a) {}
>> char[] a;
>> a.foo();
>>  is also implemented for this type, then it becomes extensible
>
> And, just then, the vehicle swerved off the road and over a cliff. Bon voyage.

I take it you don't like this feature? or..

I don't mind either way:

a)
string foo;
foo.method();

b)
string foo;
method(foo);

but then, I'm a C programmer by trade.

Regan

November 25, 2005

Re: String theory by example

Posted by kris
in reply to Regan Heath

Permalink

kris

Posted in reply to Regan Heath

Permalink

Regan Heath wrote:
> On Thu, 24 Nov 2005 20:22:53 -0800, kris <fu@bar.org> wrote:
> 
>> Regan Heath wrote:
>>
>>> My goal is a built-in type which can store strings in any of the 3 UTF   encodings, when sliced will give characters (as opposed to character   fragments)
>>
>>
>> That would be require 32bits, then. A dchar.
> 
> 
> Yes. 

So, just use dchar. All it needs are properties to convert it to utf8 and utf16. Wait! You don't need any properties either, since you can use that awful hack below for those purposes <g>

Seriously, the extent of what you appear to propose can be done right now, in multiple different ways. No compiler changes required. I'd like to see true properties for UTF transcoding, but that would just be convenient. There's already sufficient to build upon, assuming one would do the necessary research to construct a great API.

> 
> Have you had a close look at std.format.doFormat and std.stdio.writefx?  Have you noticed that UTF-8 characters are all transcoded to individual  dchars then transcoded back to UTF-8 to be output?

I'm rather surprised that wasn't already widely known.

> This doesn't proove anything but it suggests that using a dchar sized  variable for characters will have little or no real effect on  performance.. 

Pardon me, but this sounds a bit naiive. One has to consider the use case involved ~ printf() can hardly be considered a high-performance, uh, anything. The goal is convenience, not speed (though the writef design could certainly be improved upon quite dramatically).

Your above statement is trying to extrapolate an equivalent measure of acceptability in the general case. That doesn't hold up to much scrutiny, IMO. Confusing convenience with acceptable performance is a mistake.

> maybe, a conclusive test should really be made.

A conclusive test of what? This thing about writef is a total red herring. Horses for courses.

> but then, I'm a C programmer by trade.

C makes a great language to write nicely structured OO-style code. Don't knock it <g>

Some would claim it's also more maintainable that C++  :-)

November 25, 2005

Re: String theory by example

Posted by Regan Heath
in reply to kris

Permalink

Regan Heath

Posted in reply to kris

Permalink

On Thu, 24 Nov 2005 21:13:34 -0800, kris <fu@bar.org> wrote:
> Regan Heath wrote:
>> On Thu, 24 Nov 2005 20:22:53 -0800, kris <fu@bar.org> wrote:
>>
>>> Regan Heath wrote:
>>>
>>>> My goal is a built-in type which can store strings in any of the 3 UTF   encodings, when sliced will give characters (as opposed to character   fragments)
>>>
>>>
>>> That would be require 32bits, then. A dchar.
>>   Yes.
>
> So, just use dchar.

The advantage the type I'm imagining would have is the ability to store the data as UTF-8 internally. (like my class can). Characters would only exist as dchar sized units rarely i.e. when you actually indexed the string or asking it for them, one at a time. (like my class does).

> Seriously, the extent of what you appear to propose can be done right now, in multiple different ways. No compiler changes required.

Yes, with a class, like I posted. But the syntax could be much nicer if it was built in, and if it came standard (built in or as part of the library) the other 3 array types could fade into obscurity, i.e. only get used when accessing code fragments was desired.

It should mean that everyone writing code in D would use it and not one of the other 3, meaning we get no more "this library uses char[]" but "this library uses wchar[]" problems and no more "I have to write 3 functions one for each char type" problems either.

Regan.

November 25, 2005

Re: String theory by example

Posted by kris
in reply to Regan Heath

Permalink

kris

Posted in reply to Regan Heath

Permalink

Regan Heath wrote:
> On Thu, 24 Nov 2005 21:13:34 -0800, kris <fu@bar.org> wrote:
> 
>> Regan Heath wrote:
>>
>>> On Thu, 24 Nov 2005 20:22:53 -0800, kris <fu@bar.org> wrote:
>>>
>>>> Regan Heath wrote:
>>>>
>>>>> My goal is a built-in type which can store strings in any of the 3  UTF   encodings, when sliced will give characters (as opposed to  character   fragments)
>>>>
>>>>
>>>>
>>>> That would be require 32bits, then. A dchar.
>>>
>>>   Yes.
>>
>>
>> So, just use dchar.
> 
> 
> The advantage the type I'm imagining would have is the ability to store  the data as UTF-8 internally. (like my class can). Characters would only  exist as dchar sized units rarely i.e. when you actually indexed the  string or asking it for them, one at a time. (like my class does).
> 
>> Seriously, the extent of what you appear to propose can be done right  now, in multiple different ways. No compiler changes required.
> 
> 
> Yes, with a class, like I posted. But the syntax could be much nicer if it  was built in, and if it came standard (built in or as part of the library)  the other 3 array types could fade into obscurity, i.e. only get used when  accessing code fragments was desired.
> 
> It should mean that everyone writing code in D would use it and not one of  the other 3, meaning we get no more "this library uses char[]" but "this  library uses wchar[]" problems and no more "I have to write 3 functions  one for each char type" problems either.
> 
> Regan.

Yep, it's clear what your after. And you're not the first to try. But you won't get there by ignoring the problems inherent in building a compromise. This whole subject needs some serious research, rather than chit chat in a NG. Better to look at how it's done everywhere else, and learn how that could be adapted appropriately? This is a wheel that's been invented before, by those with far more expertise than you or I will likely ever have in this field.

It ain't hard to put together a useful String class. Making it extensible is easy too, given tools like interfaces and class inheritance. Designing it with respect to performance and immutability are also not so tough (though D badly needs read-only arrays). What's really hard is getting the initial set of compromises worked out, as I keep repeating. Then comes the hard work of dealing with the edge-conditions, special cases, unexpected gotcha's and, in some cases, just plain old grey-matter and hard work.

You mentioned before that this built-in notion would somehow interface with ICU? Well, that would be a consideration. But first you need to review how ICU, and other packages like it, operate before assuming some binding to a native type (other than a class) could make it an attractive marriage. I stongly suspect, based on experience, that you'd end up with a class-based interface anyway. And why not? What on earth is wrong with classes? Especially when they're native to the language?

November 25, 2005

Re: String theory by example

Posted by Regan Heath
in reply to kris

Permalink

Regan Heath

Posted in reply to kris

Permalink

On Thu, 24 Nov 2005 22:19:33 -0800, kris <fu@bar.org> wrote:

<snip good advice>

> I stongly suspect, based on experience, that you'd end up with a class-based interface anyway. And why not? What on earth is wrong with classes? Especially when they're native to the language?

To answer that question you have to ask "what is the difference between a class and the built in array types?".	

Regan

November 25, 2005

Re: String theory by example

Posted by kris
in reply to Regan Heath

Permalink

kris

Posted in reply to Regan Heath

Permalink

Regan Heath wrote:
> On Thu, 24 Nov 2005 22:19:33 -0800, kris <fu@bar.org> wrote:
> 
> <snip good advice>
> 
>> I stongly suspect, based on experience, that you'd end up with a  class-based interface anyway. And why not? What on earth is wrong with  classes? Especially when they're native to the language?
> 
> 
> To answer that question you have to ask "what is the difference between a  class and the built in array types?".   
> 
> Regan

You don't know? :-)

If I get your drift, the question should perhaps be thus: at what point of complexity does it become generally acceptable to leave native types behind.

Everyone seems to have different opinion. What do you expect?

The key to powerful, easy-to-use, practical, and extensible Unicode handling is, IMO, far away on the other side of that divide. I suspect/hope you'd ultimately agree.

Since this thread is called "String theory by example", I'll encourage those interested to take a critical look at the ICU project here: http://icu.sourceforge.net/userguide/ and the D wrappers over here: http://svn.dsource.org/projects/mango/trunk/mango/icu/

No, I'm not saying that ICU is the "way and the truth". But one has to start researching somewhere.

Top | Forum index | About this forum

Forums