tolf and detab

I wrote these two trivial utilities for the purpose of canonicalizing source code before checkins and to deal with FreeBSD's inability to deal with CRLF line endings, and because I can never figure out the right settings for git to make it do the canonicalization.

tolf - converts LF, CR, and CRLF line endings to LF.

detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 column tabs. Removes trailing whitespace from lines.

Posted here just in case someone wonders what they are.
---------------------------------------------------------
/* Replace tabs with spaces, and remove trailing whitespace from lines.
 */

import std.file;
import std.path;

int main(string[] args)
{
    foreach (f; args[1 .. $])
    {
        auto input = cast(char[]) std.file.read(f);
        auto output = filter(input);
        if (output != input)
            std.file.write(f, output);
    }
    return 0;
}


char[] filter(char[] input)
{
    char[] output;
    size_t j;

    int column;
    for (size_t i = 0; i < input.length; i++)
    {
        auto c = input[i];

        switch (c)
        {
            case '\t':
                while ((column & 7) != 7)
                {   output ~= ' ';
                    j++;
                    column++;
                }
                c = ' ';
                column++;
                break;

            case '\r':
            case '\n':
                while (j && output[j - 1] == ' ')
                    j--;
                output = output[0 .. j];
                column = 0;
                break;

            default:
                column++;
                break;
        }
        output ~= c;
        j++;
    }
    while (j && output[j - 1] == ' ')
        j--;
    return output[0 .. j];
}
-----------------------------------------------------
/* Replace line endings with LF
 */

import std.file;
import std.path;

int main(string[] args)
{
    foreach (f; args[1 .. $])
    {
        auto input = cast(char[]) std.file.read(f);
        auto output = filter(input);
        if (output != input)
            std.file.write(f, output);
    }
    return 0;
}


char[] filter(char[] input)
{
    char[] output;
    size_t j;

    for (size_t i = 0; i < input.length; i++)
    {
        auto c = input[i];

        switch (c)
        {
            case '\r':
                c = '\n';
                break;

            case '\n':
                if (i && input[i - 1] == '\r')
                    continue;
                break;

            case 0:
                continue;

            default:
                break;
        }
        output ~= c;
        j++;
    }
    return output[0 .. j];
}
------------------------------------------

August 07, 2010

Re: tolf and detab

Posted by Andrei Alexandrescu
in reply to Walter Bright

Permalink

Andrei Alexandrescu

Posted in reply to Walter Bright

Permalink

On 08/06/2010 08:34 PM, Walter Bright wrote:
> I wrote these two trivial utilities for the purpose of canonicalizing
> source code before checkins and to deal with FreeBSD's inability to deal
> with CRLF line endings, and because I can never figure out the right
> settings for git to make it do the canonicalization.
>
> tolf - converts LF, CR, and CRLF line endings to LF.
>
> detab - converts all tabs to the correct number of spaces. Assumes tabs
> are 8 column tabs. Removes trailing whitespace from lines.
>
> Posted here just in case someone wonders what they are.
[snip]

Nice, though they don't account for multiline string literals.

A good exercise would be rewriting these tools in idiomatic D2 and assess the differences.


Andrei

August 07, 2010

Re: tolf and detab

Posted by Andrej Mitrovic
in reply to Andrei Alexandrescu

Permalink

Andrej Mitrovic

Posted in reply to Andrei Alexandrescu

Attachments:

text/html part

Permalink

Or improve your google-fu by finding some existing tools that do the job right. :)

I'm pretty sure Uncrustify is good at most of these issues, not to mention it's a very nice source-code "prettifier/indenter". There's a front-end called UniversalIndentGUI, which has about a dozen integrated versions of source-code prettifiers (including uncrustify, and for many languages). It has varios settings on the left, and togglable *Live* preview mode which you can view on the right.

I invite you guys to try it out sometime:

http://universalindent.sourceforge.net/

(+ you can save different settings which is neat when you're coding for different projects that have different "code design & look" standards)

On Sat, Aug 7, 2010 at 3:50 AM, Andrei Alexandrescu < SeeWebsiteForEmail@erdani.org> wrote:

> On 08/06/2010 08:34 PM, Walter Bright wrote:
>
>> I wrote these two trivial utilities for the purpose of canonicalizing source code before checkins and to deal with FreeBSD's inability to deal with CRLF line endings, and because I can never figure out the right settings for git to make it do the canonicalization.
>>
>> tolf - converts LF, CR, and CRLF line endings to LF.
>>
>> detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 column tabs. Removes trailing whitespace from lines.
>>
>> Posted here just in case someone wonders what they are.
>>
> [snip]
>
> Nice, though they don't account for multiline string literals.
>
> A good exercise would be rewriting these tools in idiomatic D2 and assess the differences.
>
>
> Andrei
>

August 07, 2010

Re: tolf and detab

Posted by Walter Bright
in reply to Andrej Mitrovic

Permalink

Walter Bright

Posted in reply to Andrej Mitrovic

Permalink

Andrej Mitrovic wrote:
> Or improve your google-fu by finding some existing tools that do the job right. :)

Sure, but I suspect it's faster to write the utility! After all, they are trivial.

August 07, 2010

Re: tolf and detab

Posted by Walter Bright
in reply to Andrei Alexandrescu

Permalink

Walter Bright

Posted in reply to Andrei Alexandrescu

Permalink

Andrei Alexandrescu wrote:
> A good exercise would be rewriting these tools in idiomatic D2 and assess the differences.

Some D2-fu would be cool. Any takers?

August 07, 2010

Re: tolf and detab

Posted by Yao G.
in reply to Andrei Alexandrescu

Permalink

Yao G.

Posted in reply to Andrei Alexandrescu

Permalink

What does idiomatic D means?

On Fri, 06 Aug 2010 20:50:52 -0500, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:

> On 08/06/2010 08:34 PM, Walter Bright wrote:
>> I wrote these two trivial utilities for the purpose of canonicalizing
>> source code before checkins and to deal with FreeBSD's inability to deal
>> with CRLF line endings, and because I can never figure out the right
>> settings for git to make it do the canonicalization.
>>
>> tolf - converts LF, CR, and CRLF line endings to LF.
>>
>> detab - converts all tabs to the correct number of spaces. Assumes tabs
>> are 8 column tabs. Removes trailing whitespace from lines.
>>
>> Posted here just in case someone wonders what they are.
> [snip]
>
> Nice, though they don't account for multiline string literals.
>
> A good exercise would be rewriting these tools in idiomatic D2 and assess the differences.
>
>
> Andrei


-- 
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

August 07, 2010

Re: tolf and detab

Posted by Andrei Alexandrescu
in reply to Yao G.

Permalink

Andrei Alexandrescu

Posted in reply to Yao G.

Permalink

On 08/06/2010 09:33 PM, Yao G. wrote:
> What does idiomatic D means?

At a quick glance - I'm thinking two elements would be using string and possibly byLine.

Andrei

August 07, 2010

Re: tolf and detab

Posted by Nick Sabalausky
in reply to Yao G.

Permalink

Nick Sabalausky

Posted in reply to Yao G.

Permalink

"Yao G." <nospamyao@gmail.com> wrote in message news:op.vg1qpcjfxeuu2f@miroslava.gateway.2wire.net...
>
> What does idiomatic D means?
>

"idiomatic D" -> "In typical D style"

August 08, 2010

Re: tolf and detab

Posted by Jonathan M Davis
in reply to Andrei Alexandrescu

Permalink

Jonathan M Davis

Posted in reply to Andrei Alexandrescu

Permalink

On Friday 06 August 2010 18:50:52 Andrei Alexandrescu wrote:
> On 08/06/2010 08:34 PM, Walter Bright wrote:
> > I wrote these two trivial utilities for the purpose of canonicalizing source code before checkins and to deal with FreeBSD's inability to deal with CRLF line endings, and because I can never figure out the right settings for git to make it do the canonicalization.
> > 
> > tolf - converts LF, CR, and CRLF line endings to LF.
> > 
> > detab - converts all tabs to the correct number of spaces. Assumes tabs are 8 column tabs. Removes trailing whitespace from lines.
> > 
> > Posted here just in case someone wonders what they are.
> 
> [snip]
> 
> Nice, though they don't account for multiline string literals.
> 
> A good exercise would be rewriting these tools in idiomatic D2 and assess the differences.
> 
> 
> Andrei

I didn't try and worry about multiline string literals, but here are my more idiomatic solutions:

detab:

/* Replace tabs with spaces, and remove trailing whitespace from lines.
  */

import std.conv;
import std.file;
import std.stdio;
import std.string;

void main(string[] args)
{
    const int tabSize = to!int(args[1]);
    foreach(f; args[2 .. $])
        removeTabs(tabSize, f);
}

void removeTabs(int tabSize, string fileName)
{
    auto file = File(fileName);
    string[] output;

    foreach(line; file.byLine())
    {
        int lastTab = 0;

        while(lastTab != -1)
        {
            const int tab = line.indexOf('\t');

            if(tab == -1)
                break;

            const int numSpaces = tabSize - tab % tabSize;

            line = line[0 .. tab] ~ repeat(" ", numSpaces) ~ line[tab + 1 .. $];

            lastTab = tab + numSpaces;
        }

        output ~= line.idup;
    }

    std.file.write(fileName, output.join("\n"));
}

-------------------------------------------

The three differences between mine and Walter's are that mine takes the tab size as the first argumen,t it doesn't put a newline at the end of the file, and it writes the file even if it changed (you could test for that, but when using byLine(), it's a bit harder). Interestingly enough, from the few tests that I ran, mine seems to be somewhat faster. I also happen to think that the code is clearer (it's certainly shorter), though that might be up for debate.

-------------------------------------------

tolf:

/* Replace line endings with LF
  */

import std.file;
import std.string;

void main(string[] args)
{
    foreach(f; args[1 .. $])
        fixEndLines(f);
}

void fixEndLines(string fileName)
{
    auto fileStr = std.file.readText(fileName);
    auto result = fileStr.replace("\r\n", "\n").replace("\r", "\n");

    std.file.write(fileName, result);
}

-------------------------------------------

This version is ludicrously simple. And it was also faster than Walter's in the few tests that I ran. In either case, I think that it is definitely clearer code.

I would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go.

- Jonathan M Davis

August 08, 2010

Re: tolf and detab

Posted by bearophile
in reply to Jonathan M Davis

Permalink

bearophile

Posted in reply to Jonathan M Davis

Permalink

Jonathan M Davis:
> I would have thought that being more idomatic would have resulted in slower code than what Walter did, but interestingly enough, both programs are faster with my code. They might take more memory though. I'm not quite sure how to check that. In any cases, you wanted some idiomatic D2 solutions, so there you go.

Your code looks better.

My (probably controversial) opinion on this is that the idiomatic D solution for those text "scripts" is to use a scripting language, as Python :-)

In this case a Python version is more readable, shorter and probably faster too because reading the lines of a _normal_ text file is faster in Python compared to D (because Python is more optimized for such purposes. I can show benchmarks on request).

On the other hand D2 is in its debugging phase, so it's good to use it even for purposes it's not the best language for, to catch bugs or performance bugs. So I think it's positive to write such scripts in D2, even if in a real-world setting I want to use Python to write them.

Bye,
bearophile

Top | Forum index | About this forum

Forums