How to detect start of Unicode symbol and count amount of graphemes - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » How to detect start of Unicode symbol and count amount of graphemes

Thread overview

How to detect start of Unicode symbol and count amount of graphemes
Oct 05, 2014 Uranuz
Oct 05, 2014 monarch_dodra
Oct 05, 2014 Uranuz
Oct 05, 2014 Jacob Carlborg
Oct 06, 2014 Uranuz
Oct 06, 2014 ketmar
Oct 06, 2014 H. S. Teoh
Oct 07, 2014 Jacob Carlborg
Oct 07, 2014 H. S. Teoh
Oct 06, 2014 anonymous
Oct 06, 2014 Kagamin
Oct 06, 2014 Nicolas F.

October 05, 2014

How to detect start of Unicode symbol and count amount of graphemes

Posted by Uranuz

Uranuz

I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT switch), where I could get information about lineIndex, codeUnitIndex, graphemeIndex. So I don't want to use *front* primitive because it autodecodes everywhere, but I want to get info abot index of *user perceived character* in debug mode (so decoding is needed here).

Question is how to detect that I go from one Unicode grapheme to another when iterating on string, wstring, dstring by code unit? Is it simple or is it attempt to reimplement a big piece of existing std library code?

As a result I should just increment internal graphemeIndex.

There short version of implementation that I want follows

struct StringStream(String)
{
   String str;
   size_t index;
   size_t graphemeIndex;

   auto popChar()
   {
      index++;
      if( ??? ) //How to detect new grapheme?
      {
         graphemeIndex++;
      }
      return str[index];
   }

}

Sorry for very simple question. I just have a mess in my head about Unicode and D strings

October 05, 2014

Re: How to detect start of Unicode symbol and count amount of graphemes

Posted by monarch_dodra
in reply to Uranuz

monarch_dodra

Posted in reply to Uranuz

On Sunday, 5 October 2014 at 08:27:58 UTC, Uranuz wrote:
> I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT switch), where I could get information about lineIndex, codeUnitIndex, graphemeIndex. So I don't want to use *front* primitive because it autodecodes everywhere, but I want to get info abot index of *user perceived character* in debug mode (so decoding is needed here).
>
> Question is how to detect that I go from one Unicode grapheme to another when iterating on string, wstring, dstring by code unit? Is it simple or is it attempt to reimplement a big piece of existing std library code?

You can use std.uni.byGrapheme to iterate by graphemes:
http://dlang.org/phobos/std_uni.html#.byGrapheme

AFAIK, graphemes are not "self synchronizing", but codepoints are. You can pop code units until you reach the beginning of a new codepoint. From there, you can iterate by graphemes, though your first grapheme might be off.

October 05, 2014

Re: How to detect start of Unicode symbol and count amount of graphemes

Posted by Uranuz
in reply to monarch_dodra

Uranuz

Posted in reply to monarch_dodra

> You can use std.uni.byGrapheme to iterate by graphemes:
> http://dlang.org/phobos/std_uni.html#.byGrapheme
>
> AFAIK, graphemes are not "self synchronizing", but codepoints are. You can pop code units until you reach the beginning of a new codepoint. From there, you can iterate by graphemes, though your first grapheme might be off.

Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new sequence started?

October 05, 2014

Re: How to detect start of Unicode symbol and count amount of graphemes

Posted by Jacob Carlborg
in reply to Uranuz

Jacob Carlborg

Posted in reply to Uranuz

On 2014-10-05 14:09, Uranuz wrote:

> Maybe there is some idea how to just detect first code unit of grapheme
> without overhead for using Grapheme struct? I just tried to check if ch
> < 128 (for UTF-8). But this dont work. How to check if byte is
> continuation of code for single code point or if new sequence started?

Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point.

[1] http://en.wikipedia.org/wiki/UTF-8#Description

-- 
/Jacob Carlborg

October 06, 2014

Re: How to detect start of Unicode symbol and count amount of graphemes

Posted by Nicolas F.
in reply to Uranuz

Nicolas F.

Posted in reply to Uranuz

Unicode is hard to deal with properly as how you deal with it is
very context dependant.

One grapheme is a visible character and consists of one or more
codepoints. One codepoint is one mapping of a byte sequence to a
meaning, and consists of one or more bytes.

This you do not want to deal with yourself, as knowing which
codepoints form graphemes is hard. Thankfully, std.uni exists.
Specifically, look at decodeGrapheme: it pops one grapheme from
an input range and returns it.

Never write code that deals with unicode on a bytelevel. It will
always be wrong.

October 06, 2014

Re: How to detect start of Unicode symbol and count amount of graphemes

Posted by Kagamin
in reply to Uranuz

Kagamin

Posted in reply to Uranuz

On Sunday, 5 October 2014 at 12:09:34 UTC, Uranuz wrote:
> Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new sequence started?

Are you trying to split strings? If you want to optimize usage of graphemes, try to check if 10 code units contain ascii symbol; when that fails, fall back to graphemes.

October 06, 2014

Re: How to detect start of Unicode symbol and count amount of graphemes

Posted by Uranuz
in reply to Jacob Carlborg

Uranuz

Posted in reply to Jacob Carlborg

>
> Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point.
>
> [1] http://en.wikipedia.org/wiki/UTF-8#Description

Thanks. I solved it myself already for UTF-8 encoding. There choosed approach with using bitbask. Maybe it is not best with eficiency but it works)

( str[index] & 0b10000000 ) == 0 ||
( str[index] & 0b11100000 ) == 0b11000000 ||
( str[index] & 0b11110000 ) == 0b11100000 ||
( str[index] & 0b11111000 ) == 0b11110000

If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule?

For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?

October 06, 2014

Re: How to detect start of Unicode symbol and count amount of graphemes

Posted by ketmar
in reply to Uranuz

ketmar

Posted in reply to Uranuz

Attachments:

signature.asc

On Mon, 06 Oct 2014 17:28:43 +0000
Uranuz via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com>
wrote:

> If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule?
alot. take for example RIGHT-TO-LEFT MARK, which is not a grapheme at all. and not a "composite" for that matter. ah, those joys of unicode!

October 06, 2014

Re: How to detect start of Unicode symbol and count amount of graphemes

Posted by H. S. Teoh
in reply to Uranuz

H. S. Teoh

Posted in reply to Uranuz

On Mon, Oct 06, 2014 at 05:28:43PM +0000, Uranuz via Digitalmars-d-learn wrote:
> >
> >Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point.
> >
> >[1] http://en.wikipedia.org/wiki/UTF-8#Description
> 
> Thanks. I solved it myself already for UTF-8 encoding. There choosed approach with using bitbask. Maybe it is not best with eficiency but it works)
> 
> ( str[index] & 0b10000000 ) == 0 ||
> ( str[index] & 0b11100000 ) == 0b11000000 ||
> ( str[index] & 0b11110000 ) == 0b11100000 ||
> ( str[index] & 0b11111000 ) == 0b11110000
> 
> If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule?
> 
> For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?

This looks wrong to me. Are you sure this finds *all* possible graphemes? Keep in mind that combining diacritic sequences are treated as a single grapheme; for example the sequence 'A' U+0301 U+0302 U+0303. There are several different codepoint ranges that have the combining diacritic property, and they are definitely more complicated than what you have here.

Furthermore, there are more complicated things like the Devanagari sequences (e.g., KA + VIRAMA + TA + VOWEL SIGN U), that your code certainly doesn't look like it would handle correctly.

As somebody else has said, it's generally a bad idea to work with Unicode byte sequences yourself, because Unicode is complicated, and many apparently-simple concepts actually require a lot of care to get it right.

T

-- 
It won't be covered in the book. The source code has to be useful for something, after all. -- Larry Wall

October 06, 2014

Re: How to detect start of Unicode symbol and count amount of graphemes

Posted by anonymous
in reply to Uranuz

anonymous

Posted in reply to Uranuz

On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote:
> ( str[index] & 0b10000000 ) == 0 ||
> ( str[index] & 0b11100000 ) == 0b11000000 ||
> ( str[index] & 0b11110000 ) == 0b11100000 ||
> ( str[index] & 0b11111000 ) == 0b11110000
>
> If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule?
>
> For UTF-32 number of codeUnits is just equal to number of graphemes. And what about UTF-16? Is it possible to detect first codeUnit of encoding sequence?

I think your idea of graphemes is off.

A grapheme is made up of one or more code points. This is the
same for all UTF encodings.
A code point is made of one or more code units. UTF8: between 1
and 4 I think, UTF16: 1 or 2, UTF32: always 1.
A code unit is made up of a fixed number of bytes. UTF8: 1,
UTF16: 2, UTF32: 4.

So, the number of UTF8 bytes in a sequence has no relation to
graphemes. The number of leading ones in a UTF8 start byte is
equal to the total number of bytes in that sequence. I.e. when
you see a 0b1110_0000 byte, the following two bytes should be
continuation bytes (0b10xx_xxxx), and the three of them together
encode a *code point*.

And in UTF32, the number of code units is equal to the number of
*code points*, not graphemes.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation