November 24, 2005 Re: Unified String Theory.. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | I want to thank everyone for reading and posting opinions on my proposal. It appears I have done a bad job explaining some of it, and some of it simply doesn't work. I have a modified idea in mind which I think might work a whole bunch better and should also be much simpler too. Thanks everyone. Regan |
November 24, 2005 Re: Unified String Theory.. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | Regan Heath wrote: > On Thu, 24 Nov 2005 13:34:51 +0200, Jari-Matti Mäkelä <jmjmak@invalid_utu.fi> wrote: > >> Regan, your proposal is absolutely too complex. I don't get it and I really don't like it. D is supposed to be a _simple_ language. Here's an alternative proposal: Sorry for being a bit impolite. I just wanted to show that it's completely possible to write Unicode-compliant programs without the need for several string-keywords. I believe a careful design&implementation removes most of the performance drawbacks. > Thanks for your opinion. It appears some parts of my idea were badly thought out. I was trying to end up with something simple, it seems a few of my choices were bad ones and they simply complicated the idea. > Thanks for bringing up some conversation. As you can see, neither of us is perfect => designing a modern programming language isn't as easy as it might have seen. > I was trying to avoid picking any 1 type over the others (as you have suggested here). Actually I have to change my opinion. I think it would be good, if the compiler were allowed to choose the correct encoding. I don't think there will be any serious problems since nowadays most Win32-things use UTF-16 and *nix-systems UTF-8. > It appears now that I should replace all my talk about cp1, cp2, cp4 and cpn with "all characters are stored in a 32 bit type called uchar". If anyone has a problem with that, I'd direct them to take a look at std.format.doFormat and std.stdio.writef which convert all char[] data into individual dchar's before converting it back to UTF-8 for output to the screen. That is one solution. Although I might let the compiler decide the encoding. >> Please stop whining about the slowness of utf-conversions. If it's really so slow, I would certainly want to see some real world benchmarks. > > I mention performance only becase people have been concerned with it in the past. I too have no idea how much time it takes and would like to see a benchmark. The fact that D is already doing it with writef and no-one has complained... I can't say anything about the overall complexity class for programs that do Unicode, but at least my simple experiments [1] show that unoptimized use of writefln is 'only' 50% slower than optimized use of printf in C (both using the same gcc-backend). Though I'm not 100% sure this program of mine actually did any transcoding. In addition, I think most 'static' conversions can be precalculated. [1] http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.learn/1983 Jari-Matti |
November 24, 2005 Re: Unified String Theory.. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | Regan Heath wrote: > But people don't care about code units, they care about characters. When do you want to inspect or modify a single code unit? I would say, just about never. On the other hand you might want to change the 4th character of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[] array. Ick. True. BTW, is there a bug in std.string.insert? I tried to do: char[] a = "blaahblaah"; std.string.insert(a, 5, "öö"); std.stdio.writefln(a); Outputs: blaahblaah >> When would you actually need character based indexing? >> I believe the answer is less often than you think. > > Really? How often do you care what UTF-8 fragments are used to represent your characters? The answer _should_ be never, however D forces you to know, you cannot replace the 4th letter of "smörgåsbord" without knowing. This is the problem, IMO. I agree. You don't need it very often, but when you do, there's currently no possibility to do that. I think char[]-slicing and indexing should be a bit better (work in the Unicode character level) since you _never_ want to change code units. (And in case you do, just cast it to void[]) Jari-Matti |
November 24, 2005 Re: Unified String Theory.. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jari-Matti Mäkelä | On Fri, 25 Nov 2005 00:31:32 +0200, Jari-Matti Mäkelä wrote: > True. BTW, is there a bug in std.string.insert? I tried to do: > > char[] a = "blaahblaah"; > std.string.insert(a, 5, "öö"); > std.stdio.writefln(a); > > Outputs: > > blaahblaah No bug. The function is not designed to update the same string passed to the function. It returns an updated string. char[] a = "blaahblaah"; a = std.string.insert(a, 5, "öö"); std.stdio.writefln(a); -- Derek (skype: derek.j.parnell) Melbourne, Australia 25/11/2005 10:04:41 AM |
November 25, 2005 Re: Unified String Theory.. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | Regan Heath wrote: > On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu >> >> It makes far more sense to have only 1 _character_ type, that holds >> any UNICODE character. Whether it comes from an utf8, utf16 or >> utf32 string shouldn't matter: True! > Yeah, I'm starting to think that is the only way it works. The 3 > types were an attempt to avoid that for programs which do not need an > int-sized type for a char i.e. the quick and dirty ASCII program > for example. > > Interestingly it seems std.stdio and std.format are already involved > in a conspiracy to convert all our char[] output to dchar and back > again one character at a time before it eventually makes it to the > screen. Must've been the specters in the night again. :-) > I like "uchar". I agree "char" should go back to being C's char type. > I don't think we need a char[] all the C functions expect a null terminated char*. That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)! |
November 25, 2005 Re: Unified String Theory.. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Georg Wrede | On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote: > Regan Heath wrote: >> On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu >>> >>> It makes far more sense to have only 1 _character_ type, that holds any UNICODE character. Whether it comes from an utf8, utf16 or utf32 string shouldn't matter: > > True! > >> Yeah, I'm starting to think that is the only way it works. The 3 types were an attempt to avoid that for programs which do not need an int-sized type for a char i.e. the quick and dirty ASCII program for example. >> >> Interestingly it seems std.stdio and std.format are already involved in a conspiracy to convert all our char[] output to dchar and back again one character at a time before it eventually makes it to the screen. > > Must've been the specters in the night again. :-) > >> I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*. > > That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when (under pressure) converting code from C(++)! I think that would interfere with the slice concept. char[] a = "some text"; char[] b = a[4 .. 7]; // Making 'b' a reference into 'a'. -- Derek (skype: derek.j.parnell) Melbourne, Australia 25/11/2005 11:37:08 AM |
November 25, 2005 Re: Unified String Theory | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | Regan Heath wrote: > On Thu, 24 Nov 2005 13:29:10 +0200, Georg Wrede <georg.wrede@nospam.org> wrote: ... > "cpn" would need to change size for each character. It would be more than a simple alias. > > If it cannot change size, then it would need to be the largest size required. > > If that was also too weird/difficult then it would need to be 32 bits in size all the time. I was trying to avoid this but it seems it may be required? Yes. I see no way to avoid "cpn" being 32 bit only. > My proposal didn't suggest different encodings based on the system. It was UTF-8 by default (all systems) and application specific otherwise. There is nothing stopping us making the windows default to UTF-16 if that makes sense. Which it seems to. Windows, <sigh>. Looks like it. They seem to have a habit of choosing what seems easiest at the outset, without ever learning to dig into issues first. Had they done it, they'd chosen UTF-8, like everybody else. :-( |
November 25, 2005 Re: Unified String Theory [READ THIS FIRST] | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | Regan Heath wrote: > On Thu, 24 Nov 2005 14:33:53 +0200, Georg Wrede wrote: > > I was trying to avoid it being 32 bits large all the time, but it seems to be the only way it works. I agree. And I share the feeling. :-) >> If this is true, then we might consider blatantly skipping cp1 and cp2, and only having cp4 (possibly also renaming it utfchar). >> >> This would make it possible for us to fully automate the extraction and insertion of single "characters" into our new strings. >> >> string foo = "gagaga"; >> utfchar bar = '\UFE9D'; // you don't want to know the name :-) >> utfchar baf = 'a'; >> foo ~= bar ~ baf; > > It seems this may be the best solution. Oskar had a good name for it "uchar". It means quick and dirty ASCII apps will have to use a 32 bit sized char type. I can hear people complain already.. but it's odd that no-one is complaining about writef doing this exact same thing! Not too many have dissected writef. Or else we'd have heard some complaints already. ;-) I actually thought about "uchar" for a while, but then I remembered that a lot of this utf disaster originates from unfortunate names. And C has a uchar type. So, I'd suggest "utfchar" or "unicode" or something to-the-point and unambiguous that's not in C. >> For completeness, we could have the painting casts (as opposed to converting casts). They'd be for the (seldom) situations where the programmer _does_ want to do serious tinkering on our strings. >> >> ubyte[] myarr1 = cast(ubyte[])foo; >> ushort[] myarr2 = cast(ushort[]) foo; >> uint[] myarr3 = cast(uint[]) foo; >> >> These give raw arrays, like exact images of the string. The burden of COW would lie on the programmer. > > I was thinking of using properties (Sean's idea) to access the data as a certain type, eg. > > ubyte[] b = foo.utf8; > ushort[] s = foo.utf16; > uint[] i = foo.utf32; > > these properties would return the string in the specified encoding using those array types. So it'd be the same thing, except your code looks a lot nicer! |
November 25, 2005 Re: Unified String Theory.. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | Derek Parnell wrote: > On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote: >>> I like "uchar". I agree "char" should go back to being C's char >>> type. I don't think we need a char[] all the C functions expect a >>> null terminated char*. >> >> That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when >> (under pressure) converting code from C(++)! > > I think that would interfere with the slice concept. > char[] a = "some text"; > char[] b = a[4 .. 7]; > // Making 'b' a reference into 'a'. Slicing C's char[] implies byte-wide, and non-UTF. |
November 25, 2005 Re: Unified String Theory.. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Georg Wrede | On Fri, 25 Nov 2005 05:16:15 +0200, Georg Wrede wrote: > Derek Parnell wrote: >> On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote: > >>>> I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*. >>> >>> That would be nice! What if we even decided that char[] is >>> null-terminated? That'd massively reduce all kinds of bugs when >>> (under pressure) converting code from C(++)! >> >> I think that would interfere with the slice concept. >> char[] a = "some text"; >> char[] b = a[4 .. 7]; >> // Making 'b' a reference into 'a'. > > Slicing C's char[] implies byte-wide, and non-UTF. Exactly, and that why I'm worried by the suggestion that char[] be automatically zero-terminated, because slices are usually not zero-terminated. -- Derek (skype: derek.j.parnell) Melbourne, Australia 25/11/2005 3:12:51 PM |
Copyright © 1999-2021 by the D Language Foundation