Ascii matters

Aug 22, 2012

bearophile

Aug 22, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

Aug 23, 2012

I need to manage Unicode text, but in many cases I have lot of 7-bit or 8-bit ASCII text to process, and this has lead to this discussion, so since some time thanks to Jonathan Davis we have an efficient translate() again: http://d.puremagic.com/issues/show_bug.cgi?id=7515 The s2 array generated by this code is a dchar[] (if array() becomes pure you are probably able to assign type s2 as dstring): string s = "test string"; // UTF-8, but also 7-bit ASCII dchar[] s2 = map!(x => x)(s).array(); // Uses the Id function To produce a char[] (or string, using assumeUnique), you are free to use a cast: auto s3 = map!(x => cast(char)x)(s).array(); But D casts are unsafe, and one thing I'm learning from Haskell is how important is to give types to your code to prevent bugs. So maybe an AsciiString wrapper (a subtype of string) range can be invented for Phobos. Its consructor verifies the input is a 7-big ASCII and its "front" method yields chars, so map.array() gives a char[]: astring a1 = "test string"; // enforced 7-bit ASCII char[] s4 = map!(x => x)(s).array(); This makes some algorithms working on ASCII text cleaner and safer, avoiding the need for casts. Is creating something like this possible and appreciated for Phobos? Bye, bearophile

On Thursday, August 23, 2012 00:11:18 bearophile wrote: > Is creating something like this possible and appreciated for Phobos? It could certainly be done. In fact, doing so would be incredibly trivial. But given that you can use ubyte[] just fine and the fact that using ASCII really shouldn't be encouraged, I don't like the idea of adding such a range to Phobos. I don't know what the general consensus on that would be though. - Jonathan M Davis

Jonathan M Davis: > But given that you can use ubyte[] just fine The data I am processing is not generic octets, like 8 bits digitized by some old A/D converter, they are chars, and I expect to see strings when I print them :-) > and the fact that using ASCII really shouldn't be encouraged, For generic text I agree with you, using UTF-8 is safer and better. But there is plenty of scientific/technical text-encoded data that is in ASCII, and for both practical and performance reasons in D I want to process it as a sequence of chars (or a sequence of ubytes, as you say). So for some kinds of data that encouragement is a waste of your time. Bye, bearophile

On Thursday, August 23, 2012 02:07:52 bearophile wrote: > Jonathan M Davis: > > But given that you can use ubyte[] just fine > > The data I am processing is not generic octets, like 8 bits digitized by some old A/D converter, they are chars, and I expect to see strings when I print them :-) > > > and the fact that using ASCII really shouldn't be encouraged, > > For generic text I agree with you, using UTF-8 is safer and > better. > But there is plenty of scientific/technical text-encoded data > that is in ASCII, and for both practical and performance reasons > in D I want to process it as a sequence of chars (or a sequence > of ubytes, as you say). So for some kinds of data that > encouragement is a waste of your time. Then just use ubyte[], and if you need char[] for printing out, then cast it. And if you don't like the casting, you can ever wrap it in a function. char[] fromASCII(ubyte[] str) { return cast(char[])str; } Creating an ASCII range type will just encourage its use, when you should only be operating on ASCII when you really need it. Operating on ASCII is quite possible as it is and isn't even very hard. So, I really don't see much benefit in adding such a range, and the fact that arguably would encourage bad behavior then makes it _undesirable_ rather than just not particularly beneficial. - Jonathan M Davis

On Aug 22, 2012, at 5:07 PM, bearophile <bearophileHUGS@lycos.com> wrote: > Jonathan M Davis: > >> and the fact that using ASCII really shouldn't be encouraged, > > For generic text I agree with you, using UTF-8 is safer and better. > But there is plenty of scientific/technical text-encoded data that is in ASCII, and for both practical and performance reasons in D I want to process it as a sequence of chars (or a sequence of ubytes, as you say). So for some kinds of data that encouragement is a waste of your time. I'm clearly missing something. ASCII and UTF-8 are compatible. What's stopping you from just processing these as if they were UTF-8 strings?

On Wednesday, August 22, 2012 19:52:10 Sean Kelly wrote: > I'm clearly missing something. ASCII and UTF-8 are compatible. What's stopping you from just processing these as if they were UTF-8 strings? Range-based functions will treat arrays of char or wchar as forward ranges of dchar. Because of the variable length of their code points, they aren't considered to have length, be random access, or have slicing and will not generally work with range-based functions which require any of those operations (though some range-based functions do specialize on strings and use those operations where they can based on proper understanding of unicode). On the other hand, if you have a string that specifically holds ASCII and you know that it only holds ASCII, you know that you can safely use length, random access, and slicing as if each code unit were a full code point. But the range-based functions don't know that your string is guaranteed to be ASCII- only, so they continue to treat it as a range of dchar rather than char. The solution is to either create a wrapper range whose element type is char or to cast the char[] to ubyte[]. And Bearophile wants such a wrapper range to be added to Phobos. - Jonathan M Davis

Sean Kelly: > I'm clearly missing something. ASCII and UTF-8 are compatible. > What's stopping you from just processing these as if they were UTF-8 strings? std.algorithm is not closed (http://en.wikipedia.org/wiki/Closure_%28mathematics%29 ) on UTF-8, its operations lead to UTF-32. Bye, bearophile

Jonathan M Davis: > And Bearophile wants such a wrapper range to be added to Phobos. I am just asking if there is interest in it, if people see something wrong in having it in Phobos. Surely I am not demanding it :-) Bye, bearophile

August 23, 2012

Re: Ascii matters

Posted by Sean Kelly

Permalink

Sean Kelly

Permalink

On Aug 22, 2012, at 8:03 PM, Jonathan M Davis <jmdavisProg@gmx.com> wrote:

> On Wednesday, August 22, 2012 19:52:10 Sean Kelly wrote:
>> I'm clearly missing something. ASCII and UTF-8 are compatible. What's stopping you from just processing these as if they were UTF-8 strings?
> 
> Range-based functions will treat arrays of char or wchar as forward ranges of dchar. Because of the variable length of their code points, they aren't considered to have length, be random access, or have slicing and will not generally work with range-based functions which require any of those operations (though some range-based functions do specialize on strings and use those operations where they can based on proper understanding of unicode).

Yeah.  I understand why the range-based functions use dchar, but for my own use I generally want to work directly with a char string of UTF-8 so I can slice buffers.  Typing these as uchar buffers isn't ideal, but it does work.

> On the other hand, if you have a string that specifically holds ASCII and you know that it only holds ASCII, you know that you can safely use length, random access, and slicing as if each code unit were a full code point. But the range-based functions don't know that your string is guaranteed to be ASCII- only, so they continue to treat it as a range of dchar rather than char. The solution is to either create a wrapper range whose element type is char or to cast the char[] to ubyte[]. And Bearophile wants such a wrapper range to be added to Phobos.

Gotcha.  Despite it being something I'd use regularly, I wouldn't want this in Phobos because it seems like it could cause maintenance problems.  I'd rather explicitly cast to ubyte as a way to flag that I was doing something potentially unsafe.

Sean Kelly: > Gotcha. Despite it being something I'd use regularly, I wouldn't want this in Phobos because it seems like it could cause maintenance problems. I'd rather explicitly cast to ubyte as a way to flag that I was doing something potentially unsafe. What's unsafe in what I have presented? The constructor verifies every char to be in 7 bits, and then you use the new type safely. No casts, and no need to flag something as unsafe. This usage of types to denote capabilities is quite common in functional languages, see articles I've recently linked here as: http://tomasp.net/blog/type-first-development.aspx Bye, bearophile

Forums