View mode: basic / threaded / horizontal-split · Log in · Help
April 26, 2012
Re: Notice/Warning on narrowStrings .length
"H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message 
news:mailman.2173.1335475413.4860.digitalmars-d@puremagic.com...
> On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
>> "James Miller" <james@aatch.net> wrote in message
>> news:qdgacdzxkhmhojqcettj@forum.dlang.org...
>> > I'm writing an introduction/tutorial to using strings in D, paying
>> > particular attention to the complexities of UTF-8 and 16. I realised
>> > that when you want the number of characters, you normally actually
>> > want to use walkLength, not length. Is is reasonable for the
>> > compiler to pick this up during semantic analysis and point out this
>> > situation?
>> >
>> > It's just a thought because a lot of the time, using length will get
>> > the right answer, but for the wrong reasons, resulting in lurking
>> > bugs. You can always cast to immutable(ubyte)[] or
>> > immutable(short)[] if you want to work with the actual bytes anyway.
>>
>> I find that most of the time I actually *do* want to use length. Don't
>> know if that's common, though, or if it's just a reflection of my
>> particular use-cases.
>>
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
>> return the number of "characters" (ie, graphemes), but merely the
>> number of code points - which is not the same thing (due to existence
>> of the [confusingly-named] "combining characters").
> [...]
>
> And don't forget that some code points (notably from the CJK block) are
> specified as "double-width", so if you're trying to do text layout,
> you'll want yet a different length (layoutLength?).
>

Interesting. Kinda makes sence that such thing exists, though: The CJK 
characters (even the relatively simple Japanese *kanas) are detailed enough 
that they need to be larger to achieve the same readability. And that's the 
*non*-double-length ones. So I don't doubt there's ones that need to be 
tagged as "Draw Extra Big!!" :)

For example, I have my font size in Windows Notepad set to a comfortable 
value. But when I want to use hiragana or katakana, I have to go into the 
settings and increase the font size so I can actually read it (Well, to what 
*little* extent I can even read it in the first place ;) ). And those kana's 
tend to be among the simplest CJK characters.

(Don't worry - I only use Notepad as a quick-n-dirty scrap space, never for 
real coding/writing).

> So we really need all four lengths. Ain't unicode fun?! :-)
>

No kidding. The *one* thing I really, really hate about Unicode is the fact 
that most (if not all) of its complexity actually *is* necessary.

Unicode *itself* is undisputably necessary, but I do sure miss ASCII.

> Array length is simple.  Walklength is already implemented. Grapheme
> length requires recognition of 'combining characters' (or rather,
> ignoring said characters), and layout length requires recognizing
> widthless, single- and double-width characters.
>

Yup.

> I've been thinking about unicode processing recently. Traditionally, we
> have to decode narrow strings into UTF-32 (aka dchar) then do table
> lookups and such. But unicode encoding and properties, etc., are static
> information (at least within a single unicode release). So why bother
> with hardcoding tables and stuff at all?
>
> What we *really* should be doing, esp. for commonly-used functions like
> computing various lengths, is to automatically process said tables and
> encode the computation in finite-state machines that can then be
> optimized at the FSM level (there are known algos for generating optimal
> FSMs), codegen'd, and then optimized again at the assembly level by the
> compiler. These FSMs will operate at the native narrow string char type
> level, so that there will be no need for explicit decoding.
>
> The generation algo can then be run just once per unicode release, and
> everything will Just Work.
>

While I find that very intersting...I'm afraid I don't actually understand 
your suggestion :/ (I do understand FSM's and how they work, though) Could 
you give a little example of what you mean?
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On Thu, Apr 26, 2012 at 06:13:00PM -0400, Nick Sabalausky wrote:
> "H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message 
> news:mailman.2173.1335475413.4860.digitalmars-d@puremagic.com...
[...]
> > And don't forget that some code points (notably from the CJK block)
> > are specified as "double-width", so if you're trying to do text
> > layout, you'll want yet a different length (layoutLength?).
> >

Correction: the official term for this is "full-width" (as opposed to
the "half-width" of the typical European scripts).


> Interesting. Kinda makes sence that such thing exists, though: The CJK
> characters (even the relatively simple Japanese *kanas) are detailed
> enough that they need to be larger to achieve the same readability.
> And that's the *non*-double-length ones. So I don't doubt there's ones
> that need to be tagged as "Draw Extra Big!!" :)

Have you seen U+9598? It's an insanely convoluted glyph composed of
*three copies* of an already extremely complex glyph.

	http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png

(And yes, that huge thing is supposed to fit inside a SINGLE
character... what *were* those ancient Chinese scribes thinking?!)


> For example, I have my font size in Windows Notepad set to a
> comfortable value. But when I want to use hiragana or katakana, I have
> to go into the settings and increase the font size so I can actually
> read it (Well, to what *little* extent I can even read it in the first
> place ;) ). And those kana's tend to be among the simplest CJK
> characters.
> 
> (Don't worry - I only use Notepad as a quick-n-dirty scrap space,
> never for real coding/writing).

LOL... love the fact that you felt obligated to justify your use of
notepad. :-P


> > So we really need all four lengths. Ain't unicode fun?! :-)
> >
> 
> No kidding. The *one* thing I really, really hate about Unicode is the
> fact that most (if not all) of its complexity actually *is* necessary.

We're lucky the more imaginative scribes of the world have either been
dead for centuries or have restricted themselves to writing fictional
languages. :-) The inventions of the dead ones have been codified and
simplified by the unfortunate people who inherited their overly complex
systems (*cough*CJK glyphs*cough), and the inventions of the living ones
are largely ignored by the world due to the fact that, well, their
scripts are only useful for writing fictional languages. :-)

So despite the fact that there are still some crazy convoluted stuff out
there, such as Arabic or Indic scripts with pair-wise substitution rules
in Unicode, overall things are relatively tame. At least the
subcomponents of CJK glyphs are no longer productive (actively being
used to compose new characters by script users) -- can you imagine the
insanity if Unicode had to support composition by those radicals and
subparts? Or if Unicode had to support a script like this one:

	http://www.arthaey.com/conlang/ashaille/writing/sarapin.html

whose components are graphically composed in, shall we say, entirely
non-trivial ways (see the composed samples at the bottom of the page)?


> Unicode *itself* is undisputably necessary, but I do sure miss ASCII.

In an ideal world, where memory is not an issue and bus width is
indefinitely wide, a Unicode string would simply be a sequence of
integers (of arbitrary size). Things like combining diacritics, etc.,
would have dedicated bits/digits for representing them, so there's no
need of the complexity of UTF-8, UTF-16, etc.. Everything fits into a
single character. Every possible combination of diacritics on every
possible character has a unique representation as a single integer.
String length would be equal to glyph count.

In such an ideal world, screens would also be of indefinitely detailed
resolution, so anything can fit inside a single grid cell, so there's no
need of half-width/double-width distinctions.  You could port ancient
ASCII-centric C code just by increasing sizeof(char), and things would
Just Work.

Yeah I know. Totally impossible. But one can dream, right? :-)


[...]
> > I've been thinking about unicode processing recently. Traditionally,
> > we have to decode narrow strings into UTF-32 (aka dchar) then do
> > table lookups and such. But unicode encoding and properties, etc.,
> > are static information (at least within a single unicode release).
> > So why bother with hardcoding tables and stuff at all?
> >
> > What we *really* should be doing, esp. for commonly-used functions
> > like computing various lengths, is to automatically process said
> > tables and encode the computation in finite-state machines that can
> > then be optimized at the FSM level (there are known algos for
> > generating optimal FSMs), codegen'd, and then optimized again at the
> > assembly level by the compiler. These FSMs will operate at the
> > native narrow string char type level, so that there will be no need
> > for explicit decoding.
> >
> > The generation algo can then be run just once per unicode release,
> > and everything will Just Work.
> >
> 
> While I find that very intersting...I'm afraid I don't actually
> understand your suggestion :/ (I do understand FSM's and how they
> work, though) Could you give a little example of what you mean?
[...]

Currently, std.uni code (argh the pun!!) is hand-written with tables of
which character belongs to which class, etc.. These hand-coded tables
are error-prone and unnecessary. For example, think of computing the
layout width of a UTF-8 stream. Why waste time decoding into dchar, and
then doing all sorts of table lookups to compute the width? Instead,
treat the stream as a byte stream, with certain sequences of bytes
evaluating to length 2, others to length 1, and yet others to length 0.

A lexer engine is perfectly suited for recognizing these kinds of
sequences with optimal speed. The only difference from a real lexer is
that instead of spitting out tokens, it keeps a running total (layout)
length, which is output at the end.

So what we should do is to write a tool that processes Unicode.txt (the
official table of character properties from the Unicode standard) and
generates lexer engines that compute various Unicode properties
(grapheme count, layout length, etc.) for each of the UTF encodings.

This way, we get optimal speed for these algorithms, plus we don't need
to manually maintain tables and stuff, we just run the tool on
Unicode.txt each time there's a new Unicode release, and the correct
code will be generated automatically.


T

-- 
Public parking: euphemism for paid parking. -- Flora
April 27, 2012
Re: Notice/Warning on narrowStrings .length
"H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message 
news:mailman.2179.1335486409.4860.digitalmars-d@puremagic.com...
>
> Have you seen U+9598? It's an insanely convoluted glyph composed of
> *three copies* of an already extremely complex glyph.
>
> http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png
>
> (And yes, that huge thing is supposed to fit inside a SINGLE
> character... what *were* those ancient Chinese scribes thinking?!)
>

Yikes!

>
>> For example, I have my font size in Windows Notepad set to a
>> comfortable value. But when I want to use hiragana or katakana, I have
>> to go into the settings and increase the font size so I can actually
>> read it (Well, to what *little* extent I can even read it in the first
>> place ;) ). And those kana's tend to be among the simplest CJK
>> characters.
>>
>> (Don't worry - I only use Notepad as a quick-n-dirty scrap space,
>> never for real coding/writing).
>
> LOL... love the fact that you felt obligated to justify your use of
> notepad. :-P
>

Heh, any usage of Notepad *needs* to be justified. For example, it has an 
undo buffer of exactly ONE change. And the stupid thing doesn't even handle 
Unix-style newlines. *Everything* handes Unix-style newlines these days, 
even on Windows. Windows *BATCH* files even accept Unix-style newlines, for 
goddsakes! But not Notepad.

It is nice in it's leanness and no-nonsence-ness. But it desperately needs 
some updates.

At least it actually supports Unicode though. (Which actually I find 
somewhat surprising.)

'Course, this is all XP. For all I know maybe they have finally updated it 
in MS OSX, erm, I mean Vista and Win7...

>
>> > So we really need all four lengths. Ain't unicode fun?! :-)
>> >
>>
>> No kidding. The *one* thing I really, really hate about Unicode is the
>> fact that most (if not all) of its complexity actually *is* necessary.
>
> We're lucky the more imaginative scribes of the world have either been
> dead for centuries or have restricted themselves to writing fictional
> languages. :-) The inventions of the dead ones have been codified and
> simplified by the unfortunate people who inherited their overly complex
> systems (*cough*CJK glyphs*cough), and the inventions of the living ones
> are largely ignored by the world due to the fact that, well, their
> scripts are only useful for writing fictional languages. :-)
>
> So despite the fact that there are still some crazy convoluted stuff out
> there, such as Arabic or Indic scripts with pair-wise substitution rules
> in Unicode, overall things are relatively tame. At least the
> subcomponents of CJK glyphs are no longer productive (actively being
> used to compose new characters by script users) -- can you imagine the
> insanity if Unicode had to support composition by those radicals and
> subparts? Or if Unicode had to support a script like this one:
>
> http://www.arthaey.com/conlang/ashaille/writing/sarapin.html
>
> whose components are graphically composed in, shall we say, entirely
> non-trivial ways (see the composed samples at the bottom of the page)?
>

That's insane!

And yet, very very interesting...

>>
>> While I find that very intersting...I'm afraid I don't actually
>> understand your suggestion :/ (I do understand FSM's and how they
>> work, though) Could you give a little example of what you mean?
> [...]
>
> Currently, std.uni code (argh the pun!!)

Hah! :)

> is hand-written with tables of
> which character belongs to which class, etc.. These hand-coded tables
> are error-prone and unnecessary. For example, think of computing the
> layout width of a UTF-8 stream. Why waste time decoding into dchar, and
> then doing all sorts of table lookups to compute the width? Instead,
> treat the stream as a byte stream, with certain sequences of bytes
> evaluating to length 2, others to length 1, and yet others to length 0.
>
> A lexer engine is perfectly suited for recognizing these kinds of
> sequences with optimal speed. The only difference from a real lexer is
> that instead of spitting out tokens, it keeps a running total (layout)
> length, which is output at the end.
>
> So what we should do is to write a tool that processes Unicode.txt (the
> official table of character properties from the Unicode standard) and
> generates lexer engines that compute various Unicode properties
> (grapheme count, layout length, etc.) for each of the UTF encodings.
>
> This way, we get optimal speed for these algorithms, plus we don't need
> to manually maintain tables and stuff, we just run the tool on
> Unicode.txt each time there's a new Unicode release, and the correct
> code will be generated automatically.
>

I see. I think that's a very good observation, and a great suggestion. In 
fact, it'd imagine it'd be considerably simpler than a typial lexer 
generator. Much less of the fancy regexy-ness would be needed. Maybe put 
together a pull request if you get the time...?
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On Thursday, April 26, 2012 17:26:40 H. S. Teoh wrote:
> Currently, std.uni code (argh the pun!!) is hand-written with tables of
> which character belongs to which class, etc.. These hand-coded tables
> are error-prone and unnecessary. For example, think of computing the
> layout width of a UTF-8 stream. Why waste time decoding into dchar, and
> then doing all sorts of table lookups to compute the width? Instead,
> treat the stream as a byte stream, with certain sequences of bytes
> evaluating to length 2, others to length 1, and yet others to length 0.
> 
> A lexer engine is perfectly suited for recognizing these kinds of
> sequences with optimal speed. The only difference from a real lexer is
> that instead of spitting out tokens, it keeps a running total (layout)
> length, which is output at the end.
> 
> So what we should do is to write a tool that processes Unicode.txt (the
> official table of character properties from the Unicode standard) and
> generates lexer engines that compute various Unicode properties
> (grapheme count, layout length, etc.) for each of the UTF encodings.
> 
> This way, we get optimal speed for these algorithms, plus we don't need
> to manually maintain tables and stuff, we just run the tool on
> Unicode.txt each time there's a new Unicode release, and the correct
> code will be generated automatically.

That's a fantastic idea! Of course, that leaves the job of implementing it... 
:)

- Jonathan M Davis
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
[...]
> Heh, any usage of Notepad *needs* to be justified. For example, it has an 
> undo buffer of exactly ONE change.

Don't laugh too hard. The original version of vi also had an undo buffer
of depth 1. In fact, one of the *current* vi's still only has an undo
buffer of depth 1. (Fortunately vim is much much saner.)


> And the stupid thing doesn't even handle Unix-style newlines.
> *Everything* handes Unix-style newlines these days, even on Windows.
> Windows *BATCH* files even accept Unix-style newlines, for 
> goddsakes! But not Notepad.
> 
> It is nice in it's leanness and no-nonsence-ness. But it desperately needs 
> some updates.

Back in the day, my favorite editor ever was Norton Editor. It's tiny
(only about 50k or less, IIRC) yet had innovative (for its day)
features... like split pane editing, ^V which flips capitalization to
EOL (so a single function serves for both uppercasing and lowercasing,
and you just apply it twice to do a single word).  Unfortunately it's a
DOS-only program.  I think it works in the command prompt, but I've
never tested it (the modern windows command prompt is subtly different
from the old DOS command prompt, so things may not quite work as they
used to).

It's ironic how useless Notepad is compared to an ancient DOS program
from the dinosaur age.


> At least it actually supports Unicode though. (Which actually I find 
> somewhat surprising.)

Now in that, at least, it surpasses Norton Editor. :-) But had Norton
not been bought over by Symantec, we'd have a modern, much more powerful
version of NE today. But, oh well. Things have moved on. Vim beats the
crap out of NE, Notepad, and just about any GUI editor out there. It
also beats the snot out of emacs, but I don't want to start *that*
flamewar. :-P


[...]
> > http://www.arthaey.com/conlang/ashaille/writing/sarapin.html
> >
> > whose components are graphically composed in, shall we say, entirely
> > non-trivial ways (see the composed samples at the bottom of the
> > page)?
> >
> 
> That's insane!
> 
> And yet, very very interesting...

Here's more:

	http://www.omniglot.com/writing/conscripts2.htm

Imagine if some of the more complicated scripts there were actually used
in a real language, and Unicode had to support it...  Like this one:

	http://www.omniglot.com/writing/talisman.htm

Or, if you *really* wanna go all-out:

	http://www.omniglot.com/writing/ssioweluwur.php

(Check out the sample text near the bottom of the page and gape in
awe at what creative minds let loose can produce... and horror at the
prospect of Unicode being required to support it.)


[...]
> > Currently, std.uni code (argh the pun!!)
> 
> Hah! :)
> 
> > is hand-written with tables of which character belongs to which
> > class, etc.. These hand-coded tables are error-prone and
> > unnecessary. For example, think of computing the layout width of a
> > UTF-8 stream. Why waste time decoding into dchar, and then doing all
> > sorts of table lookups to compute the width? Instead, treat the
> > stream as a byte stream, with certain sequences of bytes evaluating
> > to length 2, others to length 1, and yet others to length 0.
> >
> > A lexer engine is perfectly suited for recognizing these kinds of
> > sequences with optimal speed. The only difference from a real lexer
> > is that instead of spitting out tokens, it keeps a running total
> > (layout) length, which is output at the end.
> >
> > So what we should do is to write a tool that processes Unicode.txt
> > (the official table of character properties from the Unicode
> > standard) and generates lexer engines that compute various Unicode
> > properties (grapheme count, layout length, etc.) for each of the UTF
> > encodings.
> >
> > This way, we get optimal speed for these algorithms, plus we don't
> > need to manually maintain tables and stuff, we just run the tool on
> > Unicode.txt each time there's a new Unicode release, and the correct
> > code will be generated automatically.
> >
> 
> I see. I think that's a very good observation, and a great suggestion.
> In fact, it'd imagine it'd be considerably simpler than a typial lexer
> generator. Much less of the fancy regexy-ness would be needed. Maybe
> put together a pull request if you get the time...?
[...]

When I get the time? Hah... I really need to get my lazy bum back to
working on the new AA implementation first. I think that would
contribute greater value than optimizing Unicode algorithms. :-) I was
hoping *somebody* would be inspired by my idea and run with it...


T

-- 
What do you mean the Internet isn't filled with subliminal messages? What about all those buttons marked "submit"??
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On 4/27/12, H. S. Teoh <hsteoh@quickfur.ath.cx> wrote:
> It's ironic how useless Notepad is compared to an ancient DOS program
> from the dinosaur age.

If you run "edit" in command prompt or the run dialog (well, assuming
you had a win32 box somewhere), you'd actually get a pretty decent
dos-based editor that is still better than Notepad. It has split
windows, a tab stop setting, and even a whole bunch of color settings.
:P
April 27, 2012
Re: Notice/Warning on narrowStrings .length
"H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message 
news:mailman.2182.1335490591.4860.digitalmars-d@puremagic.com...
>
> Now in that, at least, it surpasses Norton Editor. :-) But had Norton
> not been bought over by Symantec, we'd have a modern, much more powerful
> version of NE today. But, oh well. Things have moved on. Vim beats the
> crap out of NE, Notepad, and just about any GUI editor out there. It
> also beats the snot out of emacs, but I don't want to start *that*
> flamewar. :-P
>

"We didn't start that flamewar,
It was always burning,
Since the world's been turning..."

>
> Here's more:
>
> http://www.omniglot.com/writing/conscripts2.htm
>
> Imagine if some of the more complicated scripts there were actually used
> in a real language, and Unicode had to support it...  Like this one:
>
> http://www.omniglot.com/writing/talisman.htm
>
> Or, if you *really* wanna go all-out:
>
> http://www.omniglot.com/writing/ssioweluwur.php
>
> (Check out the sample text near the bottom of the page and gape in
> awe at what creative minds let loose can produce... and horror at the
> prospect of Unicode being required to support it.)
>

Crazy stuff! Some of them look rather similar to Arabic or Korean's Hangul 
(sp?), at least to my untrained eye. And then others are just *really* 
interesting-looking, like:

http://www.omniglot.com/writing/12480.htm
http://www.omniglot.com/writing/ayeri.htm
http://www.omniglot.com/writing/oxidilogi.htm

You're right though, if I were in charge of Unicode and tasked with handling 
some of those, I think I'd just say "Screw it. Unicode is now depricated. 
Use ASCII instead. Doesn't have the characters for your langauge? Tough! Fix 
your language!" :)

>
> When I get the time? Hah... I really need to get my lazy bum back to
> working on the new AA implementation first. I think that would
> contribute greater value than optimizing Unicode algorithms. :-) I was
> hoping *somebody* would be inspired by my idea and run with it...
>

Heh, yea. It is a tempting project, but my plate's overflowing too. (Now if 
only I could make the same happen to bank account...!)
April 27, 2012
Re: Notice/Warning on narrowStrings .length
"Andrej Mitrovic" <andrej.mitrovich@gmail.com> wrote in message 
news:mailman.2183.1335491333.4860.digitalmars-d@puremagic.com...
> On 4/27/12, H. S. Teoh <hsteoh@quickfur.ath.cx> wrote:
>> It's ironic how useless Notepad is compared to an ancient DOS program
>> from the dinosaur age.
>
> If you run "edit" in command prompt or the run dialog (well, assuming
> you had a win32 box somewhere), you'd actually get a pretty decent
> dos-based editor that is still better than Notepad. It has split
> windows, a tab stop setting, and even a whole bunch of color settings.
> :P

Heh, I remember that :)

Holy crap, even in XP, they updated it to use the Windows standard key 
combos for cut/copy/paste. I had no idea, all this time. Back in DOS, it 
used that old "Shift-Ins" stuff.
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On Friday, 27 April 2012 at 01:35:26 UTC, H. S. Teoh wrote:
> When I get the time? Hah... I really need to get my lazy bum 
> back to
> working on the new AA implementation first. I think that would
> contribute greater value than optimizing Unicode algorithms. 
> :-) I was
> hoping *somebody* would be inspired by my idea and run with 
> it...

I actually recently wrote a lexer generator for D that wouldn't 
be that hard to adapt to something like this.
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On Friday, 27 April 2012 at 00:25:44 UTC, H. S. Teoh wrote:
> On Thu, Apr 26, 2012 at 06:13:00PM -0400, Nick Sabalausky wrote:
>> "H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message 
>> news:mailman.2173.1335475413.4860.digitalmars-d@puremagic.com...
> [...]
>> > And don't forget that some code points (notably from the CJK 
>> > block)
>> > are specified as "double-width", so if you're trying to do 
>> > text
>> > layout, you'll want yet a different length (layoutLength?).
>> >
>
> Correction: the official term for this is "full-width" (as 
> opposed to
> the "half-width" of the typical European scripts).
>
>
>> Interesting. Kinda makes sence that such thing exists, though: 
>> The CJK
>> characters (even the relatively simple Japanese *kanas) are 
>> detailed
>> enough that they need to be larger to achieve the same 
>> readability.
>> And that's the *non*-double-length ones. So I don't doubt 
>> there's ones
>> that need to be tagged as "Draw Extra Big!!" :)
>
> Have you seen U+9598? It's an insanely convoluted glyph 
> composed of
> *three copies* of an already extremely complex glyph.
>
> 	http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png
>
> (And yes, that huge thing is supposed to fit inside a SINGLE
> character... what *were* those ancient Chinese scribes 
> thinking?!)
>
>
>> For example, I have my font size in Windows Notepad set to a
>> comfortable value. But when I want to use hiragana or 
>> katakana, I have
>> to go into the settings and increase the font size so I can 
>> actually
>> read it (Well, to what *little* extent I can even read it in 
>> the first
>> place ;) ). And those kana's tend to be among the simplest CJK
>> characters.
>> 
>> (Don't worry - I only use Notepad as a quick-n-dirty scrap 
>> space,
>> never for real coding/writing).
>
> LOL... love the fact that you felt obligated to justify your 
> use of
> notepad. :-P
>
>
>> > So we really need all four lengths. Ain't unicode fun?! :-)
>> >
>> 
>> No kidding. The *one* thing I really, really hate about 
>> Unicode is the
>> fact that most (if not all) of its complexity actually *is* 
>> necessary.
>
> We're lucky the more imaginative scribes of the world have 
> either been
> dead for centuries or have restricted themselves to writing 
> fictional
> languages. :-) The inventions of the dead ones have been 
> codified and
> simplified by the unfortunate people who inherited their overly 
> complex
> systems (*cough*CJK glyphs*cough), and the inventions of the 
> living ones
> are largely ignored by the world due to the fact that, well, 
> their
> scripts are only useful for writing fictional languages. :-)
>
> So despite the fact that there are still some crazy convoluted 
> stuff out
> there, such as Arabic or Indic scripts with pair-wise 
> substitution rules
> in Unicode, overall things are relatively tame. At least the
> subcomponents of CJK glyphs are no longer productive (actively 
> being
> used to compose new characters by script users) -- can you 
> imagine the
> insanity if Unicode had to support composition by those 
> radicals and
> subparts? Or if Unicode had to support a script like this one:
>
> 	http://www.arthaey.com/conlang/ashaille/writing/sarapin.html
>
> whose components are graphically composed in, shall we say, 
> entirely
> non-trivial ways (see the composed samples at the bottom of the 
> page)?
>
>
>> Unicode *itself* is undisputably necessary, but I do sure miss 
>> ASCII.
>
> In an ideal world, where memory is not an issue and bus width is
> indefinitely wide, a Unicode string would simply be a sequence 
> of
> integers (of arbitrary size). Things like combining diacritics, 
> etc.,
> would have dedicated bits/digits for representing them, so 
> there's no
> need of the complexity of UTF-8, UTF-16, etc.. Everything fits 
> into a
> single character. Every possible combination of diacritics on 
> every
> possible character has a unique representation as a single 
> integer.
> String length would be equal to glyph count.
>
> In such an ideal world, screens would also be of indefinitely 
> detailed
> resolution, so anything can fit inside a single grid cell, so 
> there's no
> need of half-width/double-width distinctions.  You could port 
> ancient
> ASCII-centric C code just by increasing sizeof(char), and 
> things would
> Just Work.
>
> Yeah I know. Totally impossible. But one can dream, right? :-)
>
>
> [...]
>> > I've been thinking about unicode processing recently. 
>> > Traditionally,
>> > we have to decode narrow strings into UTF-32 (aka dchar) 
>> > then do
>> > table lookups and such. But unicode encoding and properties, 
>> > etc.,
>> > are static information (at least within a single unicode 
>> > release).
>> > So why bother with hardcoding tables and stuff at all?
>> >
>> > What we *really* should be doing, esp. for commonly-used 
>> > functions
>> > like computing various lengths, is to automatically process 
>> > said
>> > tables and encode the computation in finite-state machines 
>> > that can
>> > then be optimized at the FSM level (there are known algos for
>> > generating optimal FSMs), codegen'd, and then optimized 
>> > again at the
>> > assembly level by the compiler. These FSMs will operate at 
>> > the
>> > native narrow string char type level, so that there will be 
>> > no need
>> > for explicit decoding.
>> >
>> > The generation algo can then be run just once per unicode 
>> > release,
>> > and everything will Just Work.
>> >
>> 
>> While I find that very intersting...I'm afraid I don't actually
>> understand your suggestion :/ (I do understand FSM's and how 
>> they
>> work, though) Could you give a little example of what you mean?
> [...]
>
> Currently, std.uni code (argh the pun!!) is hand-written with 
> tables of
> which character belongs to which class, etc.. These hand-coded 
> tables
> are error-prone and unnecessary. For example, think of 
> computing the
> layout width of a UTF-8 stream. Why waste time decoding into 
> dchar, and
> then doing all sorts of table lookups to compute the width? 
> Instead,
> treat the stream as a byte stream, with certain sequences of 
> bytes
> evaluating to length 2, others to length 1, and yet others to 
> length 0.
>
> A lexer engine is perfectly suited for recognizing these kinds 
> of
> sequences with optimal speed. The only difference from a real 
> lexer is
> that instead of spitting out tokens, it keeps a running total 
> (layout)
> length, which is output at the end.
>
> So what we should do is to write a tool that processes 
> Unicode.txt (the
> official table of character properties from the Unicode 
> standard) and
> generates lexer engines that compute various Unicode properties
> (grapheme count, layout length, etc.) for each of the UTF 
> encodings.
>
> This way, we get optimal speed for these algorithms, plus we 
> don't need
> to manually maintain tables and stuff, we just run the tool on
> Unicode.txt each time there's a new Unicode release, and the 
> correct
> code will be generated automatically.
>
>
> T

I'm not sure if you or others knew or not (I didn't until just 
now as there hasn't been an announcement) but one of the accepted 
GSOC projects is extending unicode support by Dmitry Olshansky.  
Maybe take up this idea with him.

https://www.google-melange.com/gsoc/project/google/gsoc2012/dolsh/31002

Regards,
Brad Anderson
1 2 3
Top | Discussion index | About this forum | D home