View mode: basic / threaded / horizontal-split · Log in · Help
November 05, 2011
Re: Review of Jesse Phillips's CSV Parser
On Fri, 28 Oct 2011 09:18:27 -0400, dsimcha wrote:

> Formal review of Jesse Phillips's CSV parser module begins today and
> runs through November .  Following that, a vote will take place for one
> week.  Please post any comments about this library in this thread.
> 
> Docs: http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
> 
> Code:
> https://github.com/he-the-great/phobos/tree/csv

I've done another update this covers:

- Documentation more complete
- A provided heading can be any InputRange of string
- csvText => csvReader
- RecordList => Records
- csvReader no longer takes a Malformed enum, and instead provides 
customization to comma and quote.

Tomorrow I will try and complete one or more of the following with this 
priority:

- Have row/column information for thrown exceptions.
- Support [string][string] Contents.
- Write a CSV formatter. Support for range of struct, range of [string]
[string], and range of string[].
November 07, 2011
Re: Review of Jesse Phillips's CSV Parser
And only one pending request, csvFormatter. I am still considering how I 
want to approach this. I don't want to rush it in so I don't plan on 
having it complete before voting starts.

It has been fun and I look forward to the verdict.

On Sat, 05 Nov 2011 23:37:43 +0000, Jesse Phillips wrote:

> On Fri, 28 Oct 2011 09:18:27 -0400, dsimcha wrote:
> 
>> Formal review of Jesse Phillips's CSV parser module begins today and
>> runs through November .  Following that, a vote will take place for one
>> week.  Please post any comments about this library in this thread.
>> 
>> Docs:
>> http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>> 
>> Code:
>> https://github.com/he-the-great/phobos/tree/csv
> 
> I've done another update this covers:
> 
> - Documentation more complete - A provided heading can be any InputRange
> of string - csvText => csvReader - RecordList => Records - csvReader no
> longer takes a Malformed enum, and instead provides customization to
> comma and quote.
> 
> Tomorrow I will try and complete one or more of the following with this
> priority:
> 
> - Have row/column information for thrown exceptions.
> - Support [string][string] Contents.
> - Write a CSV formatter. Support for range of struct, range of [string]
> [string], and range of string[].
November 07, 2011
Re: Review of Jesse Phillips's CSV Parser
On 10/28/2011 9:18 AM, dsimcha wrote:
> Formal review of Jesse Phillips's CSV parser module begins today and
> runs through November . Following that, a vote will take place for one
> week. Please post any comments about this library in this thread.
>
> Docs:
> http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>
> Code:
> https://github.com/he-the-great/phobos/tree/csv

I apologize.  I just noticed that I forgot to put the end date in this 
message.  The review of the CSV parser runs the standard two weeks, 
until November 11.
November 08, 2011
Re: Review of Jesse Phillips's CSV Parser
On Sun, 06 Nov 2011 19:00:42 -0500, Jesse Phillips <jessekphillips+d@gmail.com> wrote:
> And only one pending request, csvFormatter. I am still considering how I
> want to approach this. I don't want to rush it in so I don't plan on
> having it complete before voting starts.

My primary use of csv files is as a method of saving data to later graph / view / format in Excel. i.e.

csvWrite("filename",["x","y","z"],x,y,z);

where x, y, z are ranges with possibly different lengths and/or types.

I also have some API design concerns. Primarily, most of the type names are overly broad and unclear. For example: Record, Records, Malformed could just as easily be part of a database API. And because of this, not only does one reading the code not know what the object is, but if namespace conflicts can occur, then code reviewers will probably make the wrong assumption half the time as to what Record is and not bother looking it up.

Furthermore, Records and Record, are essentially internal ranges/structs returned by csvReader. In the tradition of std.algorithms these should be defined inside the csvReader function (preferably) or as CsvReader / CsvReader.Record.
November 10, 2011
Re: Review of Jesse Phillips's CSV Parser
On Friday, October 28, 2011 09:18:27 dsimcha wrote:
> Formal review of Jesse Phillips's CSV parser module begins today and
> runs through November .  Following that, a vote will take place for one
> week.  Please post any comments about this library in this thread.
> 
> Docs:
> http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
> 
> Code:
> https://github.com/he-the-great/phobos/tree/csv

- Nitpick: You should probably proofread the module documentation again. It 
has several mispellings and grammar errors.

- The functions and types mentioned in the documentation should use LREF so 
that they're links to their documentation.

- Shouldn't the term "header" be used rather than "heading?" So, you'd have 
HeaderMismatchException instead of HeadingMismatchException. Usually, the term 
heading is only used for stuff like the heading of an article. Usually, the 
term header is used for stuff like files. RFC 4180 uses the term header but 
never heading. So, unless I'm missing something, I really think that all of 
the instances of heading (both in the documentation and in the API) should be 
changed to header.

- I'd argue against Separator in csvReader being char by default. Individual 
characters should generally be dchar, not char. Using char by itself is almost 
always bad practice IMHO.

- I'd say that csvReader should say that it returns a Records struct which is 
an input range rather than saying that it provides std.range.isInputRange. We 
just don't talk that way about ranges normally. We say that something is a 
particular range type, not that it provides the API for a particular range 
type. If you want to make "is an input range" a link to 
std.range.isInputRange, that's fine, but what's currently there is not how 
ranges are typically referred to. I'd also suggest that you on the range 
functions where you say that they're part of the std.range.isInputRange 
interface, you should use the term API instead of interface, since interface 
has a specific meaning in the language which has nothing to do with this 
context.

- Why does csvNextToken take an Appender!(char[])? That seems overly 
restrictive. I'd argue that it should take an output range of dchar (which 
Appender!(char[]) is). Then it's not restricted to Appender, and it's not 
restricted to char[].

- Also, the fact that you have isSomeChar!(ElementType!Range) concerns me 
somewhat. Normally, I'd expect something more like is(ElementType!Range == 
dchar). As it is, you could have a range of char or wchar, which could do 
funny things. Phobos doesn't normally support that, and I'd be concerned about 
what would happen if someone tried. ElementType!Range will be dchar for all 
string types, so they'll work fine either way, but programmers who are overly 
adventurous or ignorant may try and use a char or wchar range, and I'm willing 
to be that there would be problems with that. Just all around, I'd advise 
checking that the ranges are ranges of dchar and using dchar for the 
separators rather than char.

- You should have unit tests testing ranges _other_ than string. The fact that 
neither wstring nor dstring is tested is worrisome, and ideally, you'd do stuff 
like call filter!"true" on a string so that you have an input range which isn't 
a string. Without at least some of that, I'd be very worried that your code 
can't actually handle generic ranges of dchar and really only works with 
string.

- You should have some unit tests with unicode values - preferably some which 
are multiple code units for both char and wchar. For instance, std.string uses 
\U00010143 in some of its tests, because it's 4 code units in UTF-8 and 2 in 
UTF-16. The complete lack of testing with unicode is worrisome in the same way 
that only testing with string is worrisome. It's too easy to screw something 
up such that it happens to work with string and ASCII but not in general. So, 
please add tests which involve unicode - not only in the data but in the 
separators.

- I find it somewhat odd that you would use a pointer to the range in Record 
rather than using a slice, but I'd have to examine it in more detail to see 
whether that's actually necessary or not. But in general, I'd argue that a 
slice should be used rather than a pointer if it can be.

- Jonathan M Davis


P.S.  Just as a note, since you seem to like not using braces if you don't 
need to for if statements and the like, I though that I'd point out that try 
and catch don't require braces in D if htey contain only one statement. You 
have several one line catches which use braces when they don't need to. I'm 
not arguing that you should or shouldn't change it, but I thought that I would 
point out that the braces aren't necessary in case you didn't know (since they 
_are_ necessary in other languages).
November 11, 2011
Re: Review of Jesse Phillips's CSV Parser
On Mon, 07 Nov 2011 23:28:20 -0500, Robert Jacques wrote:

> On Sun, 06 Nov 2011 19:00:42 -0500, Jesse Phillips
> <jessekphillips+d@gmail.com> wrote:
>> And only one pending request, csvFormatter. I am still considering how
>> I want to approach this. I don't want to rush it in so I don't plan on
>> having it complete before voting starts.
> 
> My primary use of csv files is as a method of saving data to later graph
> / view / format in Excel. i.e.
> 
> csvWrite("filename",["x","y","z"],x,y,z);
> 
> where x, y, z are ranges with possibly different lengths and/or types.

I don't want to do anything with files, I'm thinking an output buffer 
makes the most sense (should be able to have such a buffer write to a 
file).

The x,y,z seems interesting but what about the poor sole who has a range 
of

struct {
   auto x,y,z;
}

If I'm going to make a csvFormatter than making sure something like

csvFormatter(csvReader(txt));

works is number one priority. But you have added a style to consider when 
building the interface.

> I also have some API design concerns. Primarily, most of the type names
> are overly broad and unclear. For example: Record, Records, Malformed
> could just as easily be part of a database API. And because of this, not
> only does one reading the code not know what the object is, but if
> namespace conflicts can occur, then code reviewers will probably make
> the wrong assumption half the time as to what Record is and not bother
> looking it up.

The reason it is name like what you would find in a database, is because 
they are the same thing. It is a row of data. In past discussion about 
existing name clashes it was concluded that was what modules, static, and 
named imports wore for.

Malformed, I just don't have any clue what to actually call it. No 
CSVMalformed doesn't help.

But then you go on to the point below, that you shouldn't even see them.

> Furthermore, Records and Record, are essentially internal ranges/structs
> returned by csvReader. In the tradition of std.algorithms these should
> be defined inside the csvReader function (preferably) or as CsvReader /
> CsvReader.Record.

I was going to explain how wrong you are and how I tried having all the 
options part of csvReader but it just muddies up the clean API when 
customizing something. But actually I already found the solution with my 
previous updates and incorrectly chose to move Malformed customization 
out.

Long story short the problems I see:

* The documentation still needs to exist and will be a little harder to 
do without these
* Records contains the heading field, this is abnormal for an InputRange 
and means it won't show as an auto link if it becomes a hidden range.
* Currently you can find exactly which function call result in a specific 
error, if I hide the ranges you don't.

I'm expecting that introducing CSVFormatter will require another review, 
at that time I can try making two forms of documentation to see which is 
preferred.

Thank you! Very useful food for thought.
November 11, 2011
Re: Review of Jesse Phillips's CSV Parser
Thank you, lots of stuff to ponder and comment.

On Thu, 10 Nov 2011 02:10:28 -0800, Jonathan M Davis wrote:

> On Friday, October 28, 2011 09:18:27 dsimcha wrote:
>> Formal review of Jesse Phillips's CSV parser module begins today and
>> runs through November .  Following that, a vote will take place for one
>> week.  Please post any comments about this library in this thread.
>> 
>> Docs:
>> http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>> 
>> Code:
>> https://github.com/he-the-great/phobos/tree/csv
> 
> - Nitpick: You should probably proofread the module documentation again.
> It has several mispellings and grammar errors.

Err, I'll try, use some of the great tips you gave me :)

> - The functions and types mentioned in the documentation should use LREF
> so that they're links to their documentation.

I'll see if I can find them all while proofing.

> - Shouldn't the term "header" be used rather than "heading?" So, you'd
> have HeaderMismatchException instead of HeadingMismatchException.
> Usually, the term heading is only used for stuff like the heading of an
> article. Usually, the term header is used for stuff like files. RFC 4180
> uses the term header but never heading. So, unless I'm missing
> something, I really think that all of the instances of heading (both in
> the documentation and in the API) should be changed to header.

Used both in development, then chose one, guess I got it wrong. Changing.

> - I'd argue against Separator in csvReader being char by default.
> Individual characters should generally be dchar, not char. Using char by
> itself is almost always bad practice IMHO.

The only reason I put defaults on it was because it wouldn't choose the 
value for it from my default values. Otherwise it works with what you 
give it, be it char, wchar, dchar.

> - I'd say that csvReader should say that it returns a Records struct
> which is an input range rather than saying that it provides
> std.range.isInputRange. We just don't talk that way about ranges
> normally. We say that something is a particular range type, not that it
> provides the API for a particular range type. If you want to make "is an
> input range" a link to std.range.isInputRange, that's fine, but what's
> currently there is not how ranges are typically referred to. I'd also
> suggest that you on the range functions where you say that they're part
> of the std.range.isInputRange interface, you should use the term API
> instead of interface, since interface has a specific meaning in the
> language which has nothing to do with this context.

Thanks.

> - Why does csvNextToken take an Appender!(char[])? That seems overly
> restrictive. I'd argue that it should take an output range of dchar
> (which Appender!(char[]) is). Then it's not restricted to Appender, and
> it's not restricted to char[].

Appender is used so that the exception thrown can contain the consumed 
data. Thinking again on this, I suppose I can catch the exception add the 
data and rethrow. Sounds pretty good actually.

> - Also, the fact that you have isSomeChar!(ElementType!Range) concerns
> me somewhat. Normally, I'd expect something more like
> is(ElementType!Range == dchar). As it is, you could have a range of char
> or wchar, which could do funny things. Phobos doesn't normally support
> that, and I'd be concerned about what would happen if someone tried.
> ElementType!Range will be dchar for all string types, so they'll work
> fine either way, but programmers who are overly adventurous or ignorant
> may try and use a char or wchar range, and I'm willing to be that there
> would be problems with that. Just all around, I'd advise checking that
> the ranges are ranges of dchar and using dchar for the separators rather
> than char.

I was of a different mindset of the time. You are probably right I should 
prevent someone from eating their leg off.

> - You should have unit tests testing ranges _other_ than string. The
> fact that neither wstring nor dstring is tested is worrisome, and
> ideally, you'd do stuff like call filter!"true" on a string so that you
> have an input range which isn't a string. Without at least some of that,
> I'd be very worried that your code can't actually handle generic ranges
> of dchar and really only works with string.

Did add a wchar input range test, but now it should just be an dchar 
input range test since wchar ranges won't be allowed.

> - You should have some unit tests with unicode values - preferably some
> which are multiple code units for both char and wchar. For instance,
> std.string uses \U00010143 in some of its tests, because it's 4 code
> units in UTF-8 and 2 in UTF-16. The complete lack of testing with
> unicode is worrisome in the same way that only testing with string is
> worrisome. It's too easy to screw something up such that it happens to
> work with string and ASCII but not in general. So, please add tests
> which involve unicode - not only in the data but in the separators.

Didn't create any new tests, just added unicode in to input and as 
separators into old tests.

> - I find it somewhat odd that you would use a pointer to the range in
> Record rather than using a slice, but I'd have to examine it in more
> detail to see whether that's actually necessary or not. But in general,
> I'd argue that a slice should be used rather than a pointer if it can
> be.

Record requires modifying the range, for a true input range this is not a 
problem, but strings and some other struct based ranges are implicit 
forward ranges and save upon passing.

Slicing doesn't exist on input ranges, though I'd almost argue [] should 
take the place of save() in a forward range.

> - Jonathan M Davis
> 
> 
> P.S.  Just as a note, since you seem to like not using braces if you
> don't need to for if statements and the like, I though that I'd point
> out that try and catch don't require braces in D if htey contain only
> one statement. You have several one line catches which use braces when
> they don't need to. I'm not arguing that you should or shouldn't change
> it, but I thought that I would point out that the braces aren't
> necessary in case you didn't know (since they _are_ necessary in other
> languages).

Thanks, hopefully I'm consistent in that regard and I'll just leave it. I 
knew of try, but not catch.
November 11, 2011
Re: Review of Jesse Phillips's CSV Parser
On Friday, November 11, 2011 03:17:10 Jesse Phillips wrote:
> > I also have some API design concerns. Primarily, most of the type names
> > are overly broad and unclear. For example: Record, Records, Malformed
> > could just as easily be part of a database API. And because of this, not
> > only does one reading the code not know what the object is, but if
> > namespace conflicts can occur, then code reviewers will probably make
> > the wrong assumption half the time as to what Record is and not bother
> > looking it up.
> 
> The reason it is name like what you would find in a database, is because
> they are the same thing. It is a row of data. In past discussion about
> existing name clashes it was concluded that was what modules, static, and
> named imports wore for.

Actually, I'd argue that Row would be better than Record, since it _is_ a row 
in a table. Personally, I'd find it to be more immediately clear that way. With 
Record, I have to figure out what the heck it is a record of, whereas Row is 
immediately obvious in this context.

- Jonathan M Davis
November 11, 2011
Re: Review of Jesse Phillips's CSV Parser
On 02.11.2011 09:02, Jacob Carlborg wrote:
> On 2011-11-01 02:28, Jesse Phillips wrote:
>> On Mon, 31 Oct 2011 19:21:02 +0200, Vladimir Panteleev wrote:
>>
>>> On Fri, 28 Oct 2011 16:18:27 +0300, dsimcha<dsimcha@yahoo.com> wrote:
>>>
>>>> Docs:
>>>> http://nascent.freeshell.org/programming/D/doc/phobos/std_csv.html
>>>
>>> Checked the new docs today. If I'm reading them right, the top example
>>> prints:
>>>
>>> "Fred works as a Fly and earns $4 per year"
>>>
>>> Is this a pop culture reference I'm not catching?
>>
>> No, not that I know of. Should it be? Should I go with a more
>> professional like example?
>>
>>> An idea I had the other day was to include convenience presets for the
>>> more common flavours of CSV formats, e.g.: csvText!CSVFormat_Excel(...)
>>
>> Well if Excel actually conforms to what it claims:
>>
>> http://office.microsoft.com/en-us/excel-help/excel-formatting-and-
>> features-that-are-not-transferred-to-other-file-formats-HP010014105.aspx?
>> CTT=5&origin=HP010099725#BM4
>>
>> Which I doubt it does, then my parser already defaults to being able to
>> read such formats. In my experience Excel sucks at CSV and follows no
>> rules.
>>
>> The implementation I choose as default is the most common and any other
>> style is just a butchafication and likely to be unreliable, usually stems
>> from "my data has commas, so I'll use colon." Which is great until the
>> data also has colon. But maybe that is just my experience.
>
> Excel + CSV == Pain in the ass
>
> Excel uses different value delimiters as the default setting depending
> on the locale of the Excel application.

Far more horrible than that -- in an old Excel from around 2000 the user 
interface used US settings for reading short dates, if the day of the 
month was 12 or less. Otherwise it uses the locale. But it always uses 
the locale for displaying them.

So 11/1, 12/1, 13/1 gets changed to 1 Nov, 1 Dec, 13 Jan.

I wrote a macro called "November 9" (9/11) for swapping them back if 
they were in the buggy range.
November 11, 2011
Re: Review of Jesse Phillips's CSV Parser
On 2011-11-11 12:50, Don wrote:
> Far more horrible than that -- in an old Excel from around 2000 the user
> interface used US settings for reading short dates, if the day of the
> month was 12 or less. Otherwise it uses the locale. But it always uses
> the locale for displaying them.
>
> So 11/1, 12/1, 13/1 gets changed to 1 Nov, 1 Dec, 13 Jan.
>
> I wrote a macro called "November 9" (9/11) for swapping them back if
> they were in the buggy range.

Haha that's just insane.

-- 
/Jacob Carlborg
1 2 3 4 5 6
Top | Discussion index | About this forum | D home