April 06, 2006
Georg Wrede schrieb am 2006-04-06:
> Mike Capp wrote:
>> "Wasteful" is also relative. UTF-32 is certainly wasteful of memory space, but UTF-8 is potentially far more wasteful of CPU cycles and memory bandwidth.
>
> It sure looks like it. Then again, studying the UTF-8 spec, and "why we did it this way" (sorry, no URL here. Anybody?), shows that it actually is _amazingly_ light on CPU cycles! Really.

Have a look at the endcoding of Hangul(Korean) and polytonic Greek <g>

> (( I sure wish there was somebody in this NG who could write a Scientifically Valid test to compare the time needed to find the millionth character in UTF-8 vs. UTF-8 first converted to UTF-32. ))

Challenge:
Provide a D implementation that firsts converts to UTF-32 and has
shorter runtime than the code below:

# size_t codepoint_to_index(size_t codepoint_number, char[] data){
#	char* start = data.ptr;
#	char* end = start + data.length;
#	size_t index;
#
#	if(!data.length){
# insufficent_input:
#		throw new Exception("not enough input");
#	}
#
#	if(!codepoint_number){
#		return 0;
#	}
#
#	asm{
#			mov	EDX,	codepoint_number;
#			mov	ECX,	start;
#			mov	EBX,	end;
#
#		next_codepoint:
#			mov	AL,	[ECX];
#			inc	ECX;
#			sal	AL,	1;
#			jnc	end_of_codepoint;
#			sal	AL,	1;
#		inner_loop:
#			inc	ECX;
#			sal	AL,	1;
#			jc	inner_loop;
#
#		end_of_codepoint:
#			// array bounds
#			cmp	ECX,	EBX;
#			jnb	insufficent_input;
#
#			// the interresting codepoint?
#			dec	EDX;
#			jnz	next_codepoint;
#
#			// calculate index
#			mov	EBX,	start;
#			sub	ECX,	EBX;
#			mov	index,	ECX;
#	}
#
#	return index;
# }

Thomas


April 06, 2006
Jari-Matti wrote:
> That's very true. A "normal" hard drive reads 60 MB/s. So,
> reading a 4 MB file takes at least 66 ms and a 1 MB UTF-8-file (only
> ASCII-characters) is read in 17 ms (well, I'm a bit optimistic here :).
> A modern processor executes 3 000 000 000 operations in a
> second. Going through the UTF-8 stream takes 1 000 000 * 10 (perhaps?)
> operations and thus costs 3 ms. So it's actually faster to read UTF-8.

1) your sample: English (consider Chinese)
2) magic word: seek

Thomas


April 07, 2006
Thomas Kuehne wrote:
> Challenge:
> Provide a D implementation that firsts converts to UTF-32 and has
> shorter runtime than the code below:

I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
April 07, 2006
Walter Bright wrote:
> Thomas Kuehne wrote:
>> Challenge:
>> Provide a D implementation that firsts converts to UTF-32 and has
>> shorter runtime than the code below:
> 
> I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().

I've been wondering about this.  Will 'stride' be accurate for any arbitrary string position or input data?  I would assume so, but don't know enough about how UTF-8 is structured to be sure.


Sean
April 07, 2006
Sean Kelly wrote:
> Walter Bright wrote:
>> Thomas Kuehne wrote:
>>> Challenge:
>>> Provide a D implementation that firsts converts to UTF-32 and has
>>> shorter runtime than the code below:
>>
>> I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
> 
> I've been wondering about this.  Will 'stride' be accurate for any arbitrary string position or input data?  I would assume so, but don't know enough about how UTF-8 is structured to be sure.

UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.
April 07, 2006
Walter Bright wrote:
> Thomas Kuehne wrote:
> 
>> Challenge:
>> Provide a D implementation that firsts converts to UTF-32 and has
>> shorter runtime than the code below:
> 
> 
> I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().


It's not as simple as that any more. Lookup tables can sometimes cause more stalls that straight-line code, especially with designs such as the P4. Not to mention the possibility of a bit of cache-thrashing with other programs.

Thus, the lookup may be sub-optimal. Quite possibly less optimal.
April 07, 2006
Walter Bright wrote:
> Sean Kelly wrote:
>> Walter Bright wrote:
>>> Thomas Kuehne wrote:
>>>> Challenge:
>>>> Provide a D implementation that firsts converts to UTF-32 and has
>>>> shorter runtime than the code below:
>>>
>>> I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
>>
>> I've been wondering about this.  Will 'stride' be accurate for any arbitrary string position or input data?  I would assume so, but don't know enough about how UTF-8 is structured to be sure.
> 
> UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.

Thanks.  I saw the 0xFF entries in UTF8stride and mostly wanted to make sure an odd combination of bytes couldn't be mistaken as a valid character, as stride seems the best fit for an "is valid UTF-8 char" type function.  I've been giving the 0xFF choice some thought however, and while it would avoid stalling loops, the alternative is an access violation when evaluating short strings and just weird behavior for large strings.  If I had to track down a program bug I'd almost prefer it be a tight endless loop.


Sean
April 07, 2006
Sean Kelly wrote:
> Walter Bright wrote:
>> Sean Kelly wrote:
>>> Walter Bright wrote:
>>>> Thomas Kuehne wrote:
>>>>> Challenge:
>>>>> Provide a D implementation that firsts converts to UTF-32 and has
>>>>> shorter runtime than the code below:
>>>>
>>>> I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
>>>
>>> I've been wondering about this.  Will 'stride' be accurate for any arbitrary string position or input data?  I would assume so, but don't know enough about how UTF-8 is structured to be sure.
>>
>> UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.
> 
> Thanks.  I saw the 0xFF entries in UTF8stride and mostly wanted to make sure an odd combination of bytes couldn't be mistaken as a valid character, as stride seems the best fit for an "is valid UTF-8 char" type function.  I've been giving the 0xFF choice some thought however, and while it would avoid stalling loops, the alternative is an access violation when evaluating short strings and just weird behavior for large strings.  If I had to track down a program bug I'd almost prefer it be a tight endless loop.

Take a look at std.utf.toUTFindex(), which takes care of the problem (by throwing an exception).
April 07, 2006
Sean Kelly wrote:
> Walter Bright wrote:
>> Sean Kelly wrote:
>>> Walter Bright wrote:
>>>> Thomas Kuehne wrote:
>>>>
>>>>> Challenge:
>>>>> Provide a D implementation that firsts converts to UTF-32 and has
>>>>> shorter runtime than the code below:
>>>>
>>>> I don't know about that, but the code below isn't optimal <g>. Replace the sar's with a lookup of the 'stride' of the UTF-8 character (see std.utf.UTF8stride[]). An implementation is std.utf.toUTFindex().
>>>
>>> I've been wondering about this.  Will 'stride' be accurate for any arbitrary string position or input data?  I would assume so, but don't know enough about how UTF-8 is structured to be sure.
>>
>> UTF8stride[] will give 0xFF for values that are not at the beginning of a valid UTF-8 sequence.
> 
> Thanks.  I saw the 0xFF entries in UTF8stride and mostly wanted to make sure an odd combination of bytes couldn't be mistaken as a valid character, 

No fear. Any UTF-8 byte that belongs to a stride is clearly marked as such in the most significant bits. Thus, you can enter a byte[] at any place, and immediately know if it's (1) a single-byte character, (2) the first in a stride, or (3) within a stride. Without looking at any of the other bytes.

> as stride seems the best fit for an "is valid UTF-8 char" type function.  I've been giving the 0xFF choice some thought however, and while it would avoid stalling loops, the alternative is an access violation when evaluating short strings and just weird behavior for large strings.  If I had to track down a program bug I'd almost prefer it be a tight endless loop.

UTF-8 is precisely designed to be used in very tight ASM loops, that don't need a lookup table.
April 07, 2006
Georg Wrede wrote:

>>> For the general case, UTF-32 is a pretty wasteful Unicode encoding
>>> just to have that priviledge ?
>>
>> I'm not sure there is a "general case", so it's hard to say. Some
>> programmers have to deal with MBCS every day; others can go for years
>> without ever having to worry about anything but vanilla ASCII.
> 
> True!! Folks in Boise, Idaho, vs. folks in the non-British Europe or the Far East.

I don't think so. UTF-8 is good for us in "non-British" Europe, and
UTF-16 is good in the East. UTF-32 is good for... finding codepoints ?

As long as the "exceptions" (high code units) are taken care of, there is really no difference between the three (or five) - it's all Unicode.


I prefer UTF-8 - because it is ASCII-compatible and endian-independent,
but UTF-16 is not a bad choice if you handle a lot of non-ASCII chars.

Just as long as other layers play along with the embedded NULs, and you
have the proper BOM marks when storing it. It seemed to work for Java ?


The argument was just against *UTF-32* as a storage type, nothing more.
(As was rationalized in http://www.unicode.org/faq/utf_bom.html#UTF32)

--anders


PS.
Thought that having std UTF type aliases would have helped, but I dunno:

module std.stdutf;

/* UTF code units */

alias char   utf8_t; // UTF-8
alias wchar utf16_t; // UTF-16
alias dchar utf32_t; // UTF-32

It's a little confusing anyway, many "char*" routines don't accept UTF ?