Jump to page: 1 2
Thread overview
How to detect start of Unicode symbol and count amount of graphemes
Oct 05, 2014
Uranuz
Oct 05, 2014
monarch_dodra
Oct 05, 2014
Uranuz
Oct 05, 2014
Jacob Carlborg
Oct 06, 2014
Uranuz
Oct 06, 2014
ketmar
Oct 06, 2014
H. S. Teoh
Oct 07, 2014
Jacob Carlborg
Oct 07, 2014
H. S. Teoh
Oct 06, 2014
anonymous
Oct 06, 2014
Kagamin
Oct 06, 2014
Nicolas F.
October 05, 2014
I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT switch), where I could get information about lineIndex, codeUnitIndex, graphemeIndex. So I don't want to use *front* primitive because it autodecodes everywhere, but I want to get info abot index of *user perceived character* in debug mode (so decoding is needed here).

Question is how to detect that I go from one Unicode grapheme to another when iterating on string, wstring, dstring by code unit? Is it simple or is it attempt to reimplement a big piece of existing std library code?

As a result I should just increment internal graphemeIndex.

There short version of implementation that I want follows

struct StringStream(String)
{
   String str;
   size_t index;
   size_t graphemeIndex;

   auto popChar()
   {
      index++;
      if( ??? ) //How to detect new grapheme?
      {
         graphemeIndex++;
      }
      return str[index];
   }

}

Sorry for very simple question. I just have a mess in my head about Unicode and D strings
October 05, 2014
On Sunday, 5 October 2014 at 08:27:58 UTC, Uranuz wrote:
> I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT switch), where I could get information about lineIndex, codeUnitIndex, graphemeIndex. So I don't want to use *front* primitive because it autodecodes everywhere, but I want to get info abot index of *user perceived character* in debug mode (so decoding is needed here).
>
> Question is how to detect that I go from one Unicode grapheme to another when iterating on string, wstring, dstring by code unit? Is it simple or is it attempt to reimplement a big piece of existing std library code?

You can use std.uni.byGrapheme to iterate by graphemes:
http://dlang.org/phobos/std_uni.html#.byGrapheme

AFAIK, graphemes are not "self synchronizing", but codepoints are. You can pop code units until you reach the beginning of a new codepoint. From there, you can iterate by graphemes, though your first grapheme might be off.
October 05, 2014
> You can use std.uni.byGrapheme to iterate by graphemes:
> http://dlang.org/phobos/std_uni.html#.byGrapheme
>
> AFAIK, graphemes are not "self synchronizing", but codepoints are. You can pop code units until you reach the beginning of a new codepoint. From there, you can iterate by graphemes, though your first grapheme might be off.

Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new sequence started?

October 05, 2014
On 2014-10-05 14:09, Uranuz wrote:

> Maybe there is some idea how to just detect first code unit of grapheme
> without overhead for using Grapheme struct? I just tried to check if ch
> < 128 (for UTF-8). But this dont work. How to check if byte is
> continuation of code for single code point or if new sequence started?

Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point.

[1] http://en.wikipedia.org/wiki/UTF-8#Description

-- 
/Jacob Carlborg
October 06, 2014
Unicode is hard to deal with properly as how you deal with it is
very context dependant.

One grapheme is a visible character and consists of one or more
codepoints. One codepoint is one mapping of a byte sequence to a
meaning, and consists of one or more bytes.

This you do not want to deal with yourself, as knowing which
codepoints form graphemes is hard. Thankfully, std.uni exists.
Specifically, look at decodeGrapheme: it pops one grapheme from
an input range and returns it.

Never write code that deals with unicode on a bytelevel. It will
always be wrong.
October 06, 2014
On Sunday, 5 October 2014 at 12:09:34 UTC, Uranuz wrote:
> Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new sequence started?

Are you trying to split strings? If you want to optimize usage of graphemes, try to check if 10 code units contain ascii symbol; when that fails, fall back to graphemes.
October 06, 2014
>
> Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point.
>
> [1] http://en.wikipedia.org/wiki/UTF-8#Description

Thanks. I solved it myself already for UTF-8 encoding. There choosed approach with using bitbask. Maybe it is not best with eficiency but it works)

( str[index] & 0b10000000 ) == 0 ||
( str[index] & 0b11100000 ) == 0b11000000 ||
( str[index] & 0b11110000 ) == 0b11100000 ||
( str[index] & 0b11111000 ) == 0b11110000

If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule?

For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?
October 06, 2014
On Mon, 06 Oct 2014 17:28:43 +0000
Uranuz via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com>
wrote:

> If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule?
alot. take for example RIGHT-TO-LEFT MARK, which is not a grapheme at all. and not a "composite" for that matter. ah, those joys of unicode!


October 06, 2014
On Mon, Oct 06, 2014 at 05:28:43PM +0000, Uranuz via Digitalmars-d-learn wrote:
> >
> >Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point.
> >
> >[1] http://en.wikipedia.org/wiki/UTF-8#Description
> 
> Thanks. I solved it myself already for UTF-8 encoding. There choosed approach with using bitbask. Maybe it is not best with eficiency but it works)
> 
> ( str[index] & 0b10000000 ) == 0 ||
> ( str[index] & 0b11100000 ) == 0b11000000 ||
> ( str[index] & 0b11110000 ) == 0b11100000 ||
> ( str[index] & 0b11111000 ) == 0b11110000
> 
> If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule?
> 
> For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?

This looks wrong to me. Are you sure this finds *all* possible graphemes? Keep in mind that combining diacritic sequences are treated as a single grapheme; for example the sequence 'A' U+0301 U+0302 U+0303. There are several different codepoint ranges that have the combining diacritic property, and they are definitely more complicated than what you have here.

Furthermore, there are more complicated things like the Devanagari sequences (e.g., KA + VIRAMA + TA + VOWEL SIGN U), that your code certainly doesn't look like it would handle correctly.

As somebody else has said, it's generally a bad idea to work with Unicode byte sequences yourself, because Unicode is complicated, and many apparently-simple concepts actually require a lot of care to get it right.


T

-- 
It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall
October 06, 2014
On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote:
> ( str[index] & 0b10000000 ) == 0 ||
> ( str[index] & 0b11100000 ) == 0b11000000 ||
> ( str[index] & 0b11110000 ) == 0b11100000 ||
> ( str[index] & 0b11111000 ) == 0b11110000
>
> If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule?
>
> For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?

I think your idea of graphemes is off.

A grapheme is made up of one or more code points. This is the
same for all UTF encodings.
A code point is made of one or more code units. UTF8: between 1
and 4 I think, UTF16: 1 or 2, UTF32: always 1.
A code unit is made up of a fixed number of bytes. UTF8: 1,
UTF16: 2, UTF32: 4.

So, the number of UTF8 bytes in a sequence has no relation to
graphemes. The number of leading ones in a UTF8 start byte is
equal to the total number of bytes in that sequence. I.e. when
you see a 0b1110_0000 byte, the following two bytes should be
continuation bytes (0b10xx_xxxx), and the three of them together
encode a *code point*.

And in UTF32, the number of code units is equal to the number of
*code points*, not graphemes.
« First   ‹ Prev
1 2