Multi-file byte comparison tool. What would you have done differently? - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Multi-file byte comparison tool. What would you have done differently?

Thread overview

Multi-file byte comparison tool. What would you have done differently?
Aug 04, 2011 Kai Meyer
Aug 04, 2011 Timon Gehr
Aug 05, 2011 Kai Meyer
Aug 05, 2011 Pelle
Aug 05, 2011 Kai Meyer
Aug 05, 2011 Jonathan M Davis
Aug 05, 2011 Kai Meyer
Aug 08, 2011 Pelle
Aug 08, 2011 Kai Meyer

August 04, 2011

Multi-file byte comparison tool. What would you have done differently?

Posted by Kai Meyer

Kai Meyer

I have a need for detecting incorrect byte sequences in multiple files (>2) at a time (as a part of our porting effort to new platforms.) Ideally the files should be identical for all but a handful of byte sequences (in a header section) that I can just skip over. I thought this would be a fun exercise for my D muscles. I found success creating a dynamic array of structs to keep information about each file passed in via command-line parameters. I'll append the code at the end (and I'm sure it'll get mangled in the process...)

(I'm not one for coming up with creative names, so it's SOMETHING) Then I loop around a read for each file, then manually run a for loop from 0 to BLOCK_SIZE, copy the size_t value into a new dynamic array (one for each of the files opened), and run a function to ensure all values in the size_t array are the same. If not, I compare each ubyte value (via the byte_union) to determine which bytes are not correct by adding each byte to a separate array, and comparing each value in that array, printing the address and values of each bad byte as I encounter them.

This appears to work great. Some justifications:
I used size_t because I'm under the impression it's a platform specific size that best fits into a single register, thus making comparisons faster than byte-by-byte.
I used a union to extract the bytes from the size_t
I wanted to create a SOMETHING for each file at run-time, instead of only allowing a certain number of SOMETHINGS (either hard coded, or a limit).
Originally I wrote my own comparison function, but in my search for something more functional, I tried out std.algorithm's count. Can't say I can tell if it's better or worse.

Features I'll probably add if I have to keep using the tool:
1) Better support for starting points and bytes to read.
2) Threshold for errors encountered, perferrably managed by a command-line argument.
3) Coalescing error messages in sequential byte sequences.

When I run the program, it's certainly I/O bound at 30Mb/s to an external USB drive :).

So the question is, how would you make it more D-ish? (Do we have a term analogous to "pythonic" for D? :))


Code:

import std.stdio;
import std.file;
import std.conv;
import std.getopt;
import std.algorithm;

enum BLOCK_SIZE = 1024;
union byte_union
{
    size_t val;
    ubyte[val.sizeof] bytes;
}
struct SOMETHING
{
    string file_name;
    size_t size_bytes;
    File fd;
    byte_union[BLOCK_SIZE] bytes;
}

void main(string[] args)
{
    size_t bytes_read;
    size_t bytes_max;
    size_t size_smallest;
    size_t[] comp_arr;
    SOMETHING[] somethings;
    getopt(args,
        "seek", &bytes_read,
        "bytes", &bytes_max
    );
    if(bytes_max == 0)
        bytes_max = size_t.max; // Limit on the smallest file size
    else
        bytes_max += bytes_read;
    //bytes_read = bytes_read - (bytes_read % (BLOCK_SIZE * SOMETHING.size_bytes.sizeof));
    size_smallest = bytes_max;
    somethings.length = args.length - 1;
    comp_arr.length = args.length - 1;
    for(size_t i = 0; i < somethings.length; i++)
    {
        somethings[i].file_name = args[i + 1];
        somethings[i].size_bytes = getSize(somethings[i].file_name);
        stderr.writef("Opening file: %s(%d)\n", somethings[i].file_name, somethings[i].size_bytes);
        somethings[i].fd = File(somethings[i].file_name, "r");
        somethings[i].fd.seek(bytes_read);
        if(somethings[i].fd.tell() != bytes_read)
        {
            stderr.writef("Failed to seek to position %d in %s\n", bytes_read, args[i + 1]);
        }
        // Pick the smallest file, or the limit
        size_smallest = min(size_smallest, somethings[i].size_bytes);
    }
    // Check file sizes
    for(size_t i = 0; i < somethings.length; i++)
        comp_arr[i] = somethings[i].size_bytes;
    writef("count: %s\n", count(comp_arr, comp_arr[0]));
    if(count(comp_arr, comp_arr[0]) != comp_arr.length)
    {
        stderr.writef("Files are not the same size!");
        foreach(s; somethings)
            stderr.writef("[%s:%d]", s.file_name, s.size_bytes);
        stderr.writef("\n");
    }

    // While bytes_read < size of smallest file
    size_t block_counter;
    while(bytes_read < size_smallest)
    {
        // Read bytes
        //stderr.writef("tell: ");
        for(size_t i = 0; i < somethings.length; i++)
        {
            //stderr.writef("Reading file %s\n", file_names[i]);
            //stderr.writef("%d ", somethings[i].fd.tell());
            //if(somethings[0].fd.tell() + BLOCK_SIZE * SOMETHING.size_bytes.sizeof > somethings[0].size_bytes)
            //{
            //    stderr.writef("Warning, reading last block : [%d:%d:%d]\n", somethings[0].fd.tell(), somethings[0].size_bytes, somethings[0].fd.tell() + BLOCK_SIZE * SOMETHING.size_bytes.sizeof);
            //    for(size_t j = 0; j < somethings[i].bytes.length; j++)
            //    {
            //        somethings[i].bytes[i].val = 0;
            //    }
            //}
            somethings[i].fd.rawRead(somethings[i].bytes);
        }
        // Compare all size_t values
        for(size_t i = 0; i < BLOCK_SIZE; i++)
        {
            // If one is different
            for(size_t j = 0; j < somethings.length; j++)
                comp_arr[j] = somethings[j].bytes[i].val;
            if(count(comp_arr, comp_arr[0]) != comp_arr.length)
            {
                // Compare bytes inside to determine which byte(s) are different
                for(size_t k = 0; k < byte_union.sizeof; k++)
                {
                    for(size_t j = 0; j < somethings.length; j++)
                        comp_arr[j] = to!(size_t)(somethings[j].bytes[i].bytes[k]);
                    if(count(comp_arr, comp_arr[0]) != comp_arr.length)
                    {
                        stderr.writef("Byte at 0x%08x (%u) does not match %s\n",
                            bytes_read + i * byte_union.sizeof + k,
                            bytes_read + i * byte_union.sizeof + k, comp_arr);
                    }
                }
            }
        }
        bytes_read += BLOCK_SIZE * SOMETHING.size_bytes.sizeof;
        block_counter++;
        if( (block_counter % (1024 * 25)) == 0)
        {
            stderr.writef("Completed %5.1fGB\n", to!(double)(bytes_read) / 1024 / 1024 / 1024);
        }
    }

    for(size_t i = 0; i < somethings.length; i++)
    {
        somethings[i].fd.close();
    }
}

August 04, 2011

Re: Multi-file byte comparison tool. What would you have done differently?

Posted by Timon Gehr
in reply to Kai Meyer

Timon Gehr

Posted in reply to Kai Meyer

Kai Meyer wrote:
> So the question is, how would you make it more D-ish? (Do we have a term
> analogous to "pythonic" for D? :))

An easy first step to improve the D-Factor would be to replace all these for loops with foreach loops and ref foreach loops.

-Timon

August 05, 2011

Re: Multi-file byte comparison tool. What would you have done differently?

Posted by Pelle
in reply to Kai Meyer

Pelle

Posted in reply to Kai Meyer

On Fri, 05 Aug 2011 00:25:38 +0200, Kai Meyer <kai@unixlords.com> wrote:

> I have a need for detecting incorrect byte sequences in multiple files (>2) at a time (as a part of our porting effort to new platforms.) Ideally the files should be identical for all but a handful of byte sequences (in a header section) that I can just skip over. I thought this would be a fun exercise for my D muscles. I found success creating a dynamic array of structs to keep information about each file passed in via command-line parameters. I'll append the code at the end (and I'm sure it'll get mangled in the process...)
>
> (I'm not one for coming up with creative names, so it's SOMETHING) Then I loop around a read for each file, then manually run a for loop from 0 to BLOCK_SIZE, copy the size_t value into a new dynamic array (one for each of the files opened), and run a function to ensure all values in the size_t array are the same. If not, I compare each ubyte value (via the byte_union) to determine which bytes are not correct by adding each byte to a separate array, and comparing each value in that array, printing the address and values of each bad byte as I encounter them.
>
> This appears to work great. Some justifications:
> I used size_t because I'm under the impression it's a platform specific size that best fits into a single register, thus making comparisons faster than byte-by-byte.
> I used a union to extract the bytes from the size_t
> I wanted to create a SOMETHING for each file at run-time, instead of only allowing a certain number of SOMETHINGS (either hard coded, or a limit).
> Originally I wrote my own comparison function, but in my search for something more functional, I tried out std.algorithm's count. Can't say I can tell if it's better or worse.
>
> Features I'll probably add if I have to keep using the tool:
> 1) Better support for starting points and bytes to read.
> 2) Threshold for errors encountered, perferrably managed by a command-line argument.
> 3) Coalescing error messages in sequential byte sequences.
>
> When I run the program, it's certainly I/O bound at 30Mb/s to an external USB drive :).
>
> So the question is, how would you make it more D-ish? (Do we have a term analogous to "pythonic" for D? :))
>
>
> Code:
>
> import std.stdio;
> import std.file;
> import std.conv;
> import std.getopt;
> import std.algorithm;
>
> enum BLOCK_SIZE = 1024;
> union byte_union
> {
>      size_t val;
>      ubyte[val.sizeof] bytes;
> }
> struct SOMETHING
> {
>      string file_name;
>      size_t size_bytes;
>      File fd;
>      byte_union[BLOCK_SIZE] bytes;
> }

I would use TypeNames and nonTypeNames, so, blockSize, ByteUnion, Something, sizeBytes, etc.

> void main(string[] args)
> {
>      size_t bytes_read;
>      size_t bytes_max;
>      size_t size_smallest;
>      size_t[] comp_arr;
>      SOMETHING[] somethings;

Don't declare variables until you need them, just leave bytes_read and bytes_max here.

>      getopt(args,
>          "seek", &bytes_read,
>          "bytes", &bytes_max
>      );
>      if(bytes_max == 0)
>          bytes_max = size_t.max; // Limit on the smallest file size
>      else
>          bytes_max += bytes_read;
>      //bytes_read = bytes_read - (bytes_read % (BLOCK_SIZE * SOMETHING.size_bytes.sizeof));
>      size_smallest = bytes_max;
>      somethings.length = args.length - 1;
>      comp_arr.length = args.length - 1;
>      for(size_t i = 0; i < somethings.length; i++)
>      {
>          somethings[i].file_name = args[i + 1];
>          somethings[i].size_bytes = getSize(somethings[i].file_name);
>          stderr.writef("Opening file: %s(%d)\n", somethings[i].file_name, somethings[i].size_bytes);
>          somethings[i].fd = File(somethings[i].file_name, "r");
>          somethings[i].fd.seek(bytes_read);
>          if(somethings[i].fd.tell() != bytes_read)
>          {
>              stderr.writef("Failed to seek to position %d in %s\n", bytes_read, args[i + 1]);
>          }
>          // Pick the smallest file, or the limit
>          size_smallest = min(size_smallest, somethings[i].size_bytes);
>      }

Use foreach (ref something; somethings) and something instead of somethings[i].

>      // Check file sizes
>      for(size_t i = 0; i < somethings.length; i++)
>          comp_arr[i] = somethings[i].size_bytes;
>      writef("count: %s\n", count(comp_arr, comp_arr[0]));
>      if(count(comp_arr, comp_arr[0]) != comp_arr.length)
>      {
>          stderr.writef("Files are not the same size!");
>          foreach(s; somethings)
>              stderr.writef("[%s:%d]", s.file_name, s.size_bytes);
>          stderr.writef("\n");
>      }

You can use writefln() istead of writef("\n") everywhere.

>
>      // While bytes_read < size of smallest file
>      size_t block_counter;
>      while(bytes_read < size_smallest)
>      {
>          // Read bytes
>          //stderr.writef("tell: ");
>          for(size_t i = 0; i < somethings.length; i++)
>          {
>              //stderr.writef("Reading file %s\n", file_names[i]);
>              //stderr.writef("%d ", somethings[i].fd.tell());
>              //if(somethings[0].fd.tell() + BLOCK_SIZE * SOMETHING.size_bytes.sizeof > somethings[0].size_bytes)
>              //{
>              //    stderr.writef("Warning, reading last block : [%d:%d:%d]\n", somethings[0].fd.tell(), somethings[0].size_bytes, somethings[0].fd.tell() + BLOCK_SIZE * SOMETHING.size_bytes.sizeof);
>              //    for(size_t j = 0; j < somethings[i].bytes.length; j++)
>              //    {
>              //        somethings[i].bytes[i].val = 0;
>              //    }
>              //}
>              somethings[i].fd.rawRead(somethings[i].bytes);
>          }
>          // Compare all size_t values
>          for(size_t i = 0; i < BLOCK_SIZE; i++)
>          {

Here you can use foreach (i; 0 .. blockSize)

>              // If one is different
>              for(size_t j = 0; j < somethings.length; j++)
>                  comp_arr[j] = somethings[j].bytes[i].val;
>              if(count(comp_arr, comp_arr[0]) != comp_arr.length)
>              {
>                  // Compare bytes inside to determine which byte(s) are different
>                  for(size_t k = 0; k < byte_union.sizeof; k++)
>                  {
>                      for(size_t j = 0; j < somethings.length; j++)
>                          comp_arr[j] = to!(size_t)(somethings[j].bytes[i].bytes[k]);
>                      if(count(comp_arr, comp_arr[0]) != comp_arr.length)
>                      {
>                          stderr.writef("Byte at 0x%08x (%u) does not match %s\n",
>                              bytes_read + i * byte_union.sizeof + k,
>                              bytes_read + i * byte_union.sizeof + k, comp_arr);
>                      }
>                  }
>              }
>          }
>          bytes_read += BLOCK_SIZE * SOMETHING.size_bytes.sizeof;
>          block_counter++;
>          if( (block_counter % (1024 * 25)) == 0)
>          {
>              stderr.writef("Completed %5.1fGB\n", to!(double)(bytes_read) / 1024 / 1024 / 1024);
>          }
>      }
>
>      for(size_t i = 0; i < somethings.length; i++)
>      {
>          somethings[i].fd.close();
>      }
> }

You don't generally need to close them, they should be closed by the destructors of the File (I think, at least).

I don't understand why you use ByteUnion instead of just a plain array of bytes. I also don't understand why you write so much to stderr instead of stdout.

August 05, 2011

Re: Multi-file byte comparison tool. What would you have done differently?

Posted by Kai Meyer
in reply to Timon Gehr

Kai Meyer

Posted in reply to Timon Gehr

On 08/04/2011 05:03 PM, Timon Gehr wrote:
> Kai Meyer wrote:
>> So the question is, how would you make it more D-ish? (Do we have a term
>> analogous to "pythonic" for D? :))
>
> An easy first step to improve the D-Factor would be to replace all these for loops
> with foreach loops and ref foreach loops.
>
> -Timon

I like D-Factor :)

I haven't done foreach(ref s; somethings) before. Interesting, thanks :)

August 05, 2011

Re: Multi-file byte comparison tool. What would you have done differently?

Posted by Kai Meyer
in reply to Pelle

Kai Meyer

Posted in reply to Pelle

On 08/05/2011 03:02 AM, Pelle wrote:
> On Fri, 05 Aug 2011 00:25:38 +0200, Kai Meyer <kai@unixlords.com> wrote:
>
>> I have a need for detecting incorrect byte sequences in multiple files
>> (>2) at a time (as a part of our porting effort to new platforms.)
>> Ideally the files should be identical for all but a handful of byte
>> sequences (in a header section) that I can just skip over. I thought
>> this would be a fun exercise for my D muscles. I found success
>> creating a dynamic array of structs to keep information about each
>> file passed in via command-line parameters. I'll append the code at
>> the end (and I'm sure it'll get mangled in the process...)
>>
>> (I'm not one for coming up with creative names, so it's SOMETHING)
>> Then I loop around a read for each file, then manually run a for loop
>> from 0 to BLOCK_SIZE, copy the size_t value into a new dynamic array
>> (one for each of the files opened), and run a function to ensure all
>> values in the size_t array are the same. If not, I compare each ubyte
>> value (via the byte_union) to determine which bytes are not correct by
>> adding each byte to a separate array, and comparing each value in that
>> array, printing the address and values of each bad byte as I encounter
>> them.
>>
>> This appears to work great. Some justifications:
>> I used size_t because I'm under the impression it's a platform
>> specific size that best fits into a single register, thus making
>> comparisons faster than byte-by-byte.
>> I used a union to extract the bytes from the size_t
>> I wanted to create a SOMETHING for each file at run-time, instead of
>> only allowing a certain number of SOMETHINGS (either hard coded, or a
>> limit).
>> Originally I wrote my own comparison function, but in my search for
>> something more functional, I tried out std.algorithm's count. Can't
>> say I can tell if it's better or worse.
>>
>> Features I'll probably add if I have to keep using the tool:
>> 1) Better support for starting points and bytes to read.
>> 2) Threshold for errors encountered, perferrably managed by a
>> command-line argument.
>> 3) Coalescing error messages in sequential byte sequences.
>>
>> When I run the program, it's certainly I/O bound at 30Mb/s to an
>> external USB drive :).
>>
>> So the question is, how would you make it more D-ish? (Do we have a
>> term analogous to "pythonic" for D? :))
>>
>>
>> Code:
>>
>> import std.stdio;
>> import std.file;
>> import std.conv;
>> import std.getopt;
>> import std.algorithm;
>>
>> enum BLOCK_SIZE = 1024;
>> union byte_union
>> {
>> size_t val;
>> ubyte[val.sizeof] bytes;
>> }
>> struct SOMETHING
>> {
>> string file_name;
>> size_t size_bytes;
>> File fd;
>> byte_union[BLOCK_SIZE] bytes;
>> }
>
> I would use TypeNames and nonTypeNames, so, blockSize, ByteUnion,
> Something, sizeBytes, etc.
>
I haven't used TypeNames or nonTypeNames. Do have some reference material for me? Searching http://www.d-programming-language.org/ didn't give me anything that sounds like what you're talking about.
>> void main(string[] args)
>> {
>> size_t bytes_read;
>> size_t bytes_max;
>> size_t size_smallest;
>> size_t[] comp_arr;
>> SOMETHING[] somethings;
>
> Don't declare variables until you need them, just leave bytes_read and
> bytes_max here.
>
Is there a performance consideration? Or is it purely a style or D-Factor suggestion?
>> getopt(args,
>> "seek", &bytes_read,
>> "bytes", &bytes_max
>> );
>> if(bytes_max == 0)
>> bytes_max = size_t.max; // Limit on the smallest file size
>> else
>> bytes_max += bytes_read;
>> //bytes_read = bytes_read - (bytes_read % (BLOCK_SIZE *
>> SOMETHING.size_bytes.sizeof));
>> size_smallest = bytes_max;
>> somethings.length = args.length - 1;
>> comp_arr.length = args.length - 1;
>> for(size_t i = 0; i < somethings.length; i++)
>> {
>> somethings[i].file_name = args[i + 1];
>> somethings[i].size_bytes = getSize(somethings[i].file_name);
>> stderr.writef("Opening file: %s(%d)\n", somethings[i].file_name,
>> somethings[i].size_bytes);
>> somethings[i].fd = File(somethings[i].file_name, "r");
>> somethings[i].fd.seek(bytes_read);
>> if(somethings[i].fd.tell() != bytes_read)
>> {
>> stderr.writef("Failed to seek to position %d in %s\n", bytes_read,
>> args[i + 1]);
>> }
>> // Pick the smallest file, or the limit
>> size_smallest = min(size_smallest, somethings[i].size_bytes);
>> }
>
> Use foreach (ref something; somethings) and something instead of
> somethings[i].
>
I didn't know ref could be used like that :) Thanks!
>> // Check file sizes
>> for(size_t i = 0; i < somethings.length; i++)
>> comp_arr[i] = somethings[i].size_bytes;
>> writef("count: %s\n", count(comp_arr, comp_arr[0]));
>> if(count(comp_arr, comp_arr[0]) != comp_arr.length)
>> {
>> stderr.writef("Files are not the same size!");
>> foreach(s; somethings)
>> stderr.writef("[%s:%d]", s.file_name, s.size_bytes);
>> stderr.writef("\n");
>> }
>
> You can use writefln() istead of writef("\n") everywhere.
>
It's hard to fix my printf habits :)
>>
>> // While bytes_read < size of smallest file
>> size_t block_counter;
>> while(bytes_read < size_smallest)
>> {
>> // Read bytes
>> //stderr.writef("tell: ");
>> for(size_t i = 0; i < somethings.length; i++)
>> {
>> //stderr.writef("Reading file %s\n", file_names[i]);
>> //stderr.writef("%d ", somethings[i].fd.tell());
>> //if(somethings[0].fd.tell() + BLOCK_SIZE *
>> SOMETHING.size_bytes.sizeof > somethings[0].size_bytes)
>> //{
>> // stderr.writef("Warning, reading last block : [%d:%d:%d]\n",
>> somethings[0].fd.tell(), somethings[0].size_bytes,
>> somethings[0].fd.tell() + BLOCK_SIZE * SOMETHING.size_bytes.sizeof);
>> // for(size_t j = 0; j < somethings[i].bytes.length; j++)
>> // {
>> // somethings[i].bytes[i].val = 0;
>> // }
>> //}
>> somethings[i].fd.rawRead(somethings[i].bytes);
>> }
>> // Compare all size_t values
>> for(size_t i = 0; i < BLOCK_SIZE; i++)
>> {
>
> Here you can use foreach (i; 0 .. blockSize)
>
Oh, duh :)
>> // If one is different
>> for(size_t j = 0; j < somethings.length; j++)
>> comp_arr[j] = somethings[j].bytes[i].val;
>> if(count(comp_arr, comp_arr[0]) != comp_arr.length)
>> {
>> // Compare bytes inside to determine which byte(s) are different
>> for(size_t k = 0; k < byte_union.sizeof; k++)
>> {
>> for(size_t j = 0; j < somethings.length; j++)
>> comp_arr[j] = to!(size_t)(somethings[j].bytes[i].bytes[k]);
>> if(count(comp_arr, comp_arr[0]) != comp_arr.length)
>> {
>> stderr.writef("Byte at 0x%08x (%u) does not match %s\n",
>> bytes_read + i * byte_union.sizeof + k,
>> bytes_read + i * byte_union.sizeof + k, comp_arr);
>> }
>> }
>> }
>> }
>> bytes_read += BLOCK_SIZE * SOMETHING.size_bytes.sizeof;
>> block_counter++;
>> if( (block_counter % (1024 * 25)) == 0)
>> {
>> stderr.writef("Completed %5.1fGB\n", to!(double)(bytes_read) / 1024 /
>> 1024 / 1024);
>> }
>> }
>>
>> for(size_t i = 0; i < somethings.length; i++)
>> {
>> somethings[i].fd.close();
>> }
>> }
>
> You don't generally need to close them, they should be closed by the
> destructors of the File (I think, at least).
>
Being a C programmer, I *have* to be explicit. I get a bit nervous when I'm not. Infact, the first two things I wrote in the script was a loop for open and a loop for close :)
> I don't understand why you use ByteUnion instead of just a plain array
> of bytes.
I thought that comparing one byte at a time would be slower than comparing 8 bytes at a time (size_t on 64bit) and failing over to the byte-by-byte comparison only when the size_t value was different.
> I also don't understand why you write so much to stderr
> instead of stdout.
Again, just a habit. The tool is in a very terse state. What I typically do with little hackish scripts like this is if I need to go back and add actual useful info to STDOUT in a format that is generally consumable (usually because somebody else wants to use the tool), leaving the STDERR stuff in there for my own purposes. Then I test by doing something like "./byte_check <file list> 2>/dev/null". When I'm ready to dump my terse output, I either keep them for a -v flag, or just delete the lines. They're easy to find.


Thanks for your feedback :)

August 05, 2011

Re: Multi-file byte comparison tool. What would you have done differently?

Posted by Jonathan M Davis
in reply to Kai Meyer

Jonathan M Davis

Posted in reply to Kai Meyer

> On 08/05/2011 03:02 AM, Pelle wrote:
> > On Fri, 05 Aug 2011 00:25:38 +0200, Kai Meyer <kai@unixlords.com> wrote:
> >> I have a need for detecting incorrect byte sequences in multiple files (>2) at a time (as a part of our porting effort to new platforms.) Ideally the files should be identical for all but a handful of byte sequences (in a header section) that I can just skip over. I thought this would be a fun exercise for my D muscles. I found success creating a dynamic array of structs to keep information about each file passed in via command-line parameters. I'll append the code at the end (and I'm sure it'll get mangled in the process...)
> >> 
> >> (I'm not one for coming up with creative names, so it's SOMETHING) Then I loop around a read for each file, then manually run a for loop from 0 to BLOCK_SIZE, copy the size_t value into a new dynamic array (one for each of the files opened), and run a function to ensure all values in the size_t array are the same. If not, I compare each ubyte value (via the byte_union) to determine which bytes are not correct by adding each byte to a separate array, and comparing each value in that array, printing the address and values of each bad byte as I encounter them.
> >> 
> >> This appears to work great. Some justifications:
> >> I used size_t because I'm under the impression it's a platform
> >> specific size that best fits into a single register, thus making
> >> comparisons faster than byte-by-byte.
> >> I used a union to extract the bytes from the size_t
> >> I wanted to create a SOMETHING for each file at run-time, instead of
> >> only allowing a certain number of SOMETHINGS (either hard coded, or a
> >> limit).
> >> Originally I wrote my own comparison function, but in my search for
> >> something more functional, I tried out std.algorithm's count. Can't
> >> say I can tell if it's better or worse.
> >> 
> >> Features I'll probably add if I have to keep using the tool:
> >> 1) Better support for starting points and bytes to read.
> >> 2) Threshold for errors encountered, perferrably managed by a
> >> command-line argument.
> >> 3) Coalescing error messages in sequential byte sequences.
> >> 
> >> When I run the program, it's certainly I/O bound at 30Mb/s to an external USB drive :).
> >> 
> >> So the question is, how would you make it more D-ish? (Do we have a
> >> term analogous to "pythonic" for D? :))
> >> 
> >> 
> >> Code:
> >> 
> >> import std.stdio;
> >> import std.file;
> >> import std.conv;
> >> import std.getopt;
> >> import std.algorithm;
> >> 
> >> enum BLOCK_SIZE = 1024;
> >> union byte_union
> >> {
> >> size_t val;
> >> ubyte[val.sizeof] bytes;
> >> }
> >> struct SOMETHING
> >> {
> >> string file_name;
> >> size_t size_bytes;
> >> File fd;
> >> byte_union[BLOCK_SIZE] bytes;
> >> }
> > 
> > I would use TypeNames and nonTypeNames, so, blockSize, ByteUnion, Something, sizeBytes, etc.
> 
> I haven't used TypeNames or nonTypeNames. Do have some reference material for me? Searching http://www.d-programming-language.org/ didn't give me anything that sounds like what you're talking about.

I believe that he was talking about how you named your variables and type names. Normall in D, user-defined types use pascal case (e.g. Something and ByteUnion, not SOMETHING and byte_union), and everything else uses camelcase (e.g. blockSize and fileName, not BLOCK_SIZE and file_name). It's an issue of style. You can name your variables however you'd like to, but what you're doing doesn't match what most of the people around here do, which can make it harder to read the code that you post. He was simply suggesting that you follow the more typical D naming conventions. It's completely up to you though.

- Jonathan M Davis

August 05, 2011

Re: Multi-file byte comparison tool. What would you have done differently?

Posted by Kai Meyer
in reply to Jonathan M Davis

Kai Meyer

Posted in reply to Jonathan M Davis

On 08/05/2011 11:13 AM, Jonathan M Davis wrote:
>> On 08/05/2011 03:02 AM, Pelle wrote:
>>> On Fri, 05 Aug 2011 00:25:38 +0200, Kai Meyer<kai@unixlords.com>  wrote:
>>>> I have a need for detecting incorrect byte sequences in multiple files
>>>> (>2) at a time (as a part of our porting effort to new platforms.)
>>>> Ideally the files should be identical for all but a handful of byte
>>>> sequences (in a header section) that I can just skip over. I thought
>>>> this would be a fun exercise for my D muscles. I found success
>>>> creating a dynamic array of structs to keep information about each
>>>> file passed in via command-line parameters. I'll append the code at
>>>> the end (and I'm sure it'll get mangled in the process...)
>>>>
>>>> (I'm not one for coming up with creative names, so it's SOMETHING)
>>>> Then I loop around a read for each file, then manually run a for loop
>>>> from 0 to BLOCK_SIZE, copy the size_t value into a new dynamic array
>>>> (one for each of the files opened), and run a function to ensure all
>>>> values in the size_t array are the same. If not, I compare each ubyte
>>>> value (via the byte_union) to determine which bytes are not correct by
>>>> adding each byte to a separate array, and comparing each value in that
>>>> array, printing the address and values of each bad byte as I encounter
>>>> them.
>>>>
>>>> This appears to work great. Some justifications:
>>>> I used size_t because I'm under the impression it's a platform
>>>> specific size that best fits into a single register, thus making
>>>> comparisons faster than byte-by-byte.
>>>> I used a union to extract the bytes from the size_t
>>>> I wanted to create a SOMETHING for each file at run-time, instead of
>>>> only allowing a certain number of SOMETHINGS (either hard coded, or a
>>>> limit).
>>>> Originally I wrote my own comparison function, but in my search for
>>>> something more functional, I tried out std.algorithm's count. Can't
>>>> say I can tell if it's better or worse.
>>>>
>>>> Features I'll probably add if I have to keep using the tool:
>>>> 1) Better support for starting points and bytes to read.
>>>> 2) Threshold for errors encountered, perferrably managed by a
>>>> command-line argument.
>>>> 3) Coalescing error messages in sequential byte sequences.
>>>>
>>>> When I run the program, it's certainly I/O bound at 30Mb/s to an
>>>> external USB drive :).
>>>>
>>>> So the question is, how would you make it more D-ish? (Do we have a
>>>> term analogous to "pythonic" for D? :))
>>>>
>>>>
>>>> Code:
>>>>
>>>> import std.stdio;
>>>> import std.file;
>>>> import std.conv;
>>>> import std.getopt;
>>>> import std.algorithm;
>>>>
>>>> enum BLOCK_SIZE = 1024;
>>>> union byte_union
>>>> {
>>>> size_t val;
>>>> ubyte[val.sizeof] bytes;
>>>> }
>>>> struct SOMETHING
>>>> {
>>>> string file_name;
>>>> size_t size_bytes;
>>>> File fd;
>>>> byte_union[BLOCK_SIZE] bytes;
>>>> }
>>>
>>> I would use TypeNames and nonTypeNames, so, blockSize, ByteUnion,
>>> Something, sizeBytes, etc.
>>
>> I haven't used TypeNames or nonTypeNames. Do have some reference
>> material for me? Searching http://www.d-programming-language.org/ didn't
>> give me anything that sounds like what you're talking about.
>
> I believe that he was talking about how you named your variables and type
> names. Normall in D, user-defined types use pascal case (e.g. Something and
> ByteUnion, not SOMETHING and byte_union), and everything else uses camelcase
> (e.g. blockSize and fileName, not BLOCK_SIZE and file_name). It's an issue of
> style. You can name your variables however you'd like to, but what you're
> doing doesn't match what most of the people around here do, which can make it
> harder to read the code that you post. He was simply suggesting that you
> follow the more typical D naming conventions. It's completely up to you
> though.
>
> - Jonathan M Davis
Thanks :) That's what I was asking for. I've got some programming habits that need to be broken if I want to increase my D-Factor!

August 08, 2011

Re: Multi-file byte comparison tool. What would you have done differently?

Posted by Pelle
in reply to Kai Meyer

Pelle

Posted in reply to Kai Meyer

On Fri, 05 Aug 2011 18:43:27 +0200, Kai Meyer <kai@unixlords.com> wrote:

> On 08/05/2011 03:02 AM, Pelle wrote:
>>
>> Don't declare variables until you need them, just leave bytes_read and
>> bytes_max here.
>>
> Is there a performance consideration? Or is it purely a style or D-Factor suggestion?

Just style and D-factor :-)

Also, resulting code is shorter, and you can replace a lot of type names with auto.

>> I don't understand why you use ByteUnion instead of just a plain array
>> of bytes.
> I thought that comparing one byte at a time would be slower than comparing 8 bytes at a time (size_t on 64bit) and failing over to the byte-by-byte comparison only when the size_t value was different.

Maybe, but that's something that should be benchmarked. If a byte array is just as fast, and the code is simpler, that's a better solution :)

August 08, 2011

Re: Multi-file byte comparison tool. What would you have done differently?

Posted by Kai Meyer
in reply to Pelle

Kai Meyer

Posted in reply to Pelle

On 08/08/2011 12:33 AM, Pelle wrote:
> On Fri, 05 Aug 2011 18:43:27 +0200, Kai Meyer <kai@unixlords.com> wrote:
>
>> On 08/05/2011 03:02 AM, Pelle wrote:
>>>
>>> Don't declare variables until you need them, just leave bytes_read and
>>> bytes_max here.
>>>
>> Is there a performance consideration? Or is it purely a style or
>> D-Factor suggestion?
>
> Just style and D-factor :-)
>
> Also, resulting code is shorter, and you can replace a lot of type names
> with auto.
>
>>> I don't understand why you use ByteUnion instead of just a plain array
>>> of bytes.
>> I thought that comparing one byte at a time would be slower than
>> comparing 8 bytes at a time (size_t on 64bit) and failing over to the
>> byte-by-byte comparison only when the size_t value was different.
>
> Maybe, but that's something that should be benchmarked. If a byte array
> is just as fast, and the code is simpler, that's a better solution :)


I simply can't imagine how 8,000 byte comparisons would be even close to comparable to 1,000 size_t comparisons done one at a time, with the way I'm doing the comparison. I'm certain there are much better ways of comparing bits, but I'm not ready to make this program that complex :)

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation