Need to do some "dirty" UTF-8 handling - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Need to do some "dirty" UTF-8 handling

Thread overview

Need to do some "dirty" UTF-8 handling
Jun 25, 2011 Nick Sabalausky
Jun 25, 2011 Vladimir Panteleev
Jun 25, 2011 Nick Sabalausky
Jun 26, 2011 Vladimir Panteleev
Jun 25, 2011 Jonathan M Davis
Jun 25, 2011 Nick Sabalausky
Jun 25, 2011 Andrej Mitrovic
Jun 25, 2011 Nick Sabalausky
Jun 25, 2011 Dmitry Olshansky
Jun 25, 2011 Jonathan M Davis
Jun 25, 2011 Nick Sabalausky
Jun 26, 2011 Dmitry Olshansky
Jun 26, 2011 Nick Sabalausky
Jun 26, 2011 Jonathan M Davis

June 25, 2011

Need to do some "dirty" UTF-8 handling

Posted by Nick Sabalausky

Nick Sabalausky

Sometimes I need to bring data into a string, and need to be able to treat it as an actual "string", but don't actually care if the entire thing is technically valid UTF-8 or not, don't care if invalid bytes don't get preserved right, and can't have any utf exceptions being thrown regardless of the input. Yea, I know that's sloppy, but sometimes that's good enough and proper handling may be far more trouble than what's needed. (For example: Processing HTML from arbitrary URLs. It's pretty much guaranteed you'll come across stuff that's wrong or even has the encoding type improperly set. But it's usually more important for the process to succeed than for it to be perfectly accurate.)

Far as I can tell, this seems to currently be impossible with Phobos (unless you're *extremely* meticulous about watching what your entire codebase does with the data), which is a major pain when such a need arises.

Anyone have a good workaround? For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected valid character?

June 25, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Vladimir Panteleev
in reply to Nick Sabalausky

Vladimir Panteleev

Posted in reply to Nick Sabalausky

On Sat, 25 Jun 2011 12:00:43 +0300, Nick Sabalausky <a@a.a> wrote:

> Anyone have a good workaround? For instance, maybe a function that'll take
> in a byte array and convert *all* invalid UTF-8 sequences to a user-selected
> valid character?

I tend to do this a lot, for various reasons. By my experience, a great part of string-handling functions in Phobos will work just fine with strings containing invalid UTF-8 - you can generally use your intuition about whether a function will need to look at individual characters inside the string. Note, though, that there's currently a bug in D2/Phobos (6064) which causes std.array.join (and possibly other functions) to treat strings as not something that can be joined by concatenation, and do a character-by-character copy (which is both needlessly inefficient and will choke on invalid UTF-8).

When I really need to pass arbitrary data through string-handling functions, I use these functions:

/// convert any data to valid UTF-8, so D's string functions can properly work on it
string rawToUTF8(string s)
{
	dstring d;
	foreach (char c; s)
		d ~= c;
	return toUTF8(d);
}

string UTF8ToRaw(string r)
{
	string s;
	foreach (dchar c; r)
	{
		assert(c < '\u0100');
		s ~= c;
	}
	return s;
}

( from https://github.com/CyberShadow/Team15/blob/master/Utils.d#L514 )

Of course, it would be nice if it'd be possible to only convert INVALID UTF-8 sequences. According to Wikipedia, the invalid Unicode code points U+DC80..U+DCFF are often used for encoding invalid byte sequences. I'd guess that a proper implementation will need to guarantee that a roundtrip will always return the same data as the input, so it'd have to "escape" the invalid code points used for escaping as well.

-- 
Best regards,
 Vladimir                            mailto:vladimir@thecybershadow.net

June 25, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Jonathan M Davis
in reply to Nick Sabalausky

Jonathan M Davis

Posted in reply to Nick Sabalausky

On 2011-06-25 02:00, Nick Sabalausky wrote:
> Sometimes I need to bring data into a string, and need to be able to treat it as an actual "string", but don't actually care if the entire thing is technically valid UTF-8 or not, don't care if invalid bytes don't get preserved right, and can't have any utf exceptions being thrown regardless of the input. Yea, I know that's sloppy, but sometimes that's good enough and proper handling may be far more trouble than what's needed. (For example: Processing HTML from arbitrary URLs. It's pretty much guaranteed you'll come across stuff that's wrong or even has the encoding type improperly set. But it's usually more important for the process to succeed than for it to be perfectly accurate.)
> 
> Far as I can tell, this seems to currently be impossible with Phobos (unless you're *extremely* meticulous about watching what your entire codebase does with the data), which is a major pain when such a need arises.
> 
> Anyone have a good workaround? For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected valid character?

Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually treats it as a string instead of an array of bytes _must_ treat it as UTF-8 since it has to decode to determine what the characters are. So, I don't think that there's really any way around that. A string must be valid UTF-8. But if you really don't care about the string's contents, then you can just cast it to an array of ubyte and plenty of functions will work with it - nothing terribly string specific of course, but I don't see how you could possibly expect to do much string-specific with invalid data anyway.

- Jonathan M Davis

June 25, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Andrej Mitrovic

Andrej Mitrovic

I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible.

The new function simply replaces throwing exceptions with flagging a boolean.

June 25, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Nick Sabalausky
in reply to Vladimir Panteleev

Nick Sabalausky

Posted in reply to Vladimir Panteleev

"Vladimir Panteleev" <vladimir@thecybershadow.net> wrote in message news:op.vxmuvzqbtuzx1w@cybershadow.mshome.net...
>
> string s;
> foreach (dchar c; r)

That doesn't throw on an invalid sequence?

June 25, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Nick Sabalausky
in reply to Jonathan M Davis

Nick Sabalausky

Posted in reply to Jonathan M Davis

"Jonathan M Davis" <jmdavisProg@gmx.com> wrote in message news:mailman.1214.1309008317.14074.digitalmars-d-learn@puremagic.com...
> On 2011-06-25 02:00, Nick Sabalausky wrote:
>> Sometimes I need to bring data into a string, and need to be able to
>> treat
>> it as an actual "string", but don't actually care if the entire thing is
>> technically valid UTF-8 or not, don't care if invalid bytes don't get
>> preserved right, and can't have any utf exceptions being thrown
>> regardless
>> of the input. Yea, I know that's sloppy, but sometimes that's good enough
>> and proper handling may be far more trouble than what's needed. (For
>> example: Processing HTML from arbitrary URLs. It's pretty much guaranteed
>> you'll come across stuff that's wrong or even has the encoding type
>> improperly set. But it's usually more important for the process to
>> succeed
>> than for it to be perfectly accurate.)
>>
>> Far as I can tell, this seems to currently be impossible with Phobos (unless you're *extremely* meticulous about watching what your entire codebase does with the data), which is a major pain when such a need arises.
>>
>> Anyone have a good workaround? For instance, maybe a function that'll
>> take
>> in a byte array and convert *all* invalid UTF-8 sequences to a
>> user-selected valid character?
>
> Convert it to a ubyte[] (or immutable(ubyte)[])? Anything that actually
> treats
> it as a string instead of an array of bytes _must_ treat it as UTF-8 since
> it
> has to decode to determine what the characters are. So, I don't think that
> there's really any way around that. A string must be valid UTF-8. But if
> you
> really don't care about the string's contents, then you can just cast it
> to an
> array of ubyte and plenty of functions will work with it - nothing
> terribly
> string specific of course, but I don't see how you could possibly expect
> to do
> much string-specific with invalid data anyway.
>

Using immutable(ubyte)[] just causes an enormous amount of type-related problems, largely involving the need to throw around a bunch of casts absolutely everywhere, including every single time any of the byte arrays needs to come in contact with an actual string (for instance, a string literal, for comparing,searching or anything else). It might be the "correct" thing, but in many cases (anything that doesn't need to be perfect, or can't realistically be perfect) it's far more trouble than it's actually worth.

Like I said, "For instance, maybe a function that'll take in a byte array and convert *all* invalid UTF-8 sequences to a user-selected valid character?" In such a case, *there would be no invalid data* in the actual string.

June 25, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Nick Sabalausky
in reply to Andrej Mitrovic

Nick Sabalausky

Posted in reply to Andrej Mitrovic

"Andrej Mitrovic" <andrej.mitrovich@gmail.com> wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn@puremagic.com...
> I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible.
>
> The new function simply replaces throwing exceptions with flagging a boolean.

I think I may end up doing something like that :/

I was hoping to be able to do something vaguely sensible like this:

string newStr;
foreach(dchar dc; str)
{
    if(isValidDchar(dc))
        newStr ~= dc;
    else
        newStr ~= 'X';
}
str = newStr;

But that just blows up in my face.

June 25, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Dmitry Olshansky
in reply to Nick Sabalausky

Dmitry Olshansky

Posted in reply to Nick Sabalausky

On 26.06.2011 1:49, Nick Sabalausky wrote:
> "Andrej Mitrovic"<andrej.mitrovich@gmail.com>  wrote in message
> news:mailman.1215.1309019944.14074.digitalmars-d-learn@puremagic.com...
>> I've had a similar requirement some time ago. I've had to copy and
>> modify the phobos function std.utf.decode for a custom text editor
>> because the function throws when it finds an invalid code point. This
>> is way too slow for my needs. I'm actually displaying invalid code
>> points with special marks (just like Scintilla), so I need decoding to
>> work as fast as possible.
>>
>> The new function simply replaces throwing exceptions with flagging a
>> boolean.
> I think I may end up doing something like that :/
>
> I was hoping to be able to do something vaguely sensible like this:
>
> string newStr;
> foreach(dchar dc; str)
> {
>      if(isValidDchar(dc))
>          newStr ~= dc;
>      else
>          newStr ~= 'X';
> }
> str = newStr;
>
> But that just blows up in my face.
>
>
std.encoding to the rescue?
It looks like a well established module that was forgotten for some reason.

And here I'm wondering what a function named sanitize could do :)

-- 
Dmitry Olshansky

June 25, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Jonathan M Davis
in reply to Dmitry Olshansky

Jonathan M Davis

Posted in reply to Dmitry Olshansky

On 2011-06-25 15:17, Dmitry Olshansky wrote:
> On 26.06.2011 1:49, Nick Sabalausky wrote:
> > "Andrej Mitrovic"<andrej.mitrovich@gmail.com>  wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn@puremagic.com...
> > 
> >> I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible.
> >> 
> >> The new function simply replaces throwing exceptions with flagging a boolean.
> > 
> > I think I may end up doing something like that :/
> > 
> > I was hoping to be able to do something vaguely sensible like this:
> > 
> > string newStr;
> > foreach(dchar dc; str)
> > {
> > 
> >      if(isValidDchar(dc))
> > 
> >          newStr ~= dc;
> > 
> >      else
> > 
> >          newStr ~= 'X';
> > 
> > }
> > str = newStr;
> > 
> > But that just blows up in my face.
> 
> std.encoding to the rescue?
> It looks like a well established module that was forgotten for some reason.

It's also likely going away. It was an experiment of sorts which Andrei considers a failure. We need something to replace it, but as I understand it, it doesn't solve all of the problems that it's supposed to, and those it does solve, it doesn't necessarily solve in the best way. So, an improved replacement is going to need to be devised, but I wouldn't expect std.encoding to stick around in the long run.

- Jonathan M Davis

June 25, 2011

Re: Need to do some "dirty" UTF-8 handling

Posted by Nick Sabalausky
in reply to Dmitry Olshansky

Nick Sabalausky

Posted in reply to Dmitry Olshansky

"Dmitry Olshansky" <dmitry.olsh@gmail.com> wrote in message news:iu5n32$2vjd$1@digitalmars.com...
> On 26.06.2011 1:49, Nick Sabalausky wrote:
>> "Andrej Mitrovic"<andrej.mitrovich@gmail.com>  wrote in message news:mailman.1215.1309019944.14074.digitalmars-d-learn@puremagic.com...
>>> I've had a similar requirement some time ago. I've had to copy and modify the phobos function std.utf.decode for a custom text editor because the function throws when it finds an invalid code point. This is way too slow for my needs. I'm actually displaying invalid code points with special marks (just like Scintilla), so I need decoding to work as fast as possible.
>>>
>>> The new function simply replaces throwing exceptions with flagging a boolean.
>> I think I may end up doing something like that :/
>>
>> I was hoping to be able to do something vaguely sensible like this:
>>
>> string newStr;
>> foreach(dchar dc; str)
>> {
>>      if(isValidDchar(dc))
>>          newStr ~= dc;
>>      else
>>          newStr ~= 'X';
>> }
>> str = newStr;
>>
>> But that just blows up in my face.
>>
>>
> std.encoding to the rescue?
> It looks like a well established module that was forgotten for some
> reason.
>
> And here I'm wondering what a function named sanitize could do :)
>

Ahh, I didn't even notice that module.

Even if it's imperfect and goes away, it looks like it'll at least get the job done for me. And the encoding conversions should even give me an easy way to save at least some of the invalid chars (which wasn't really a requirement of mine, but it'll still be nice).

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation