GnuCash
Contact   Instructions
Bug 797030 - Import Customer & Vendors: several issues with the matching of data rows
Summary: Import Customer & Vendors: several issues with the matching of data rows
Status: RESOLVED FIXED
Alias: None
Product: GnuCash
Classification: Unclassified
Component: Import - CSV (show other bugs)
Version: git-maint
Hardware: PC Mac OS
: Normal normal
Target Milestone: ---
Assignee: Rob Laan
QA Contact: import
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-08 14:03 EST by Rob Laan
Modified: 2019-03-29 18:03 EDT (History)
4 users (show)

See Also:


Attachments
Import file to illustrate the bug (35 bytes, text/plain)
2019-01-08 14:09 EST, Rob Laan
no flags Details

Description Rob Laan 2019-01-08 14:03:27 EST
In the function Import Customers & Vendors, the regular expressions used for matching the import data rows have a number of issues:

In the default regular expression, 14 of the 18 separators are defined as optional, and the regular expression has no end of line pattern ($).
In the regular expressions for comma separated and semi-colon separated, 11 of the 18 separators are defined as optional, and both have no end of line pattern.

This causes all kinds of issues in the matching process. E.g. a data rows with just 4 fields could be imported, with values moving to unexpected data fields in the import.
Comment 1 Rob Laan 2019-01-08 14:09:36 EST
Created attachment 373116 [details]
Import file to illustrate the bug

Comma separated input file with just seven fields, instead of the 18 fields of the import data structure. Comma separated import populates the first 5 and the last 2 fields of the import data structure.
Comment 2 Rob Laan 2019-01-08 14:10:23 EST
And in the default regular expression, the import data field 'notes' is missing.
Comment 3 Geert Janssens 2019-01-09 07:09:08 EST
(In reply to Rob Laan from comment #0)
> In the function Import Customers & Vendors, the regular expressions used for
> matching the import data rows have a number of issues:
> 
> In the default regular expression, 14 of the 18 separators are defined as
> optional, and the regular expression has no end of line pattern ($).
> In the regular expressions for comma separated and semi-colon separated, 11
> of the 18 separators are defined as optional, and both have no end of line
> pattern.
> 
> This causes all kinds of issues in the matching process. E.g. a data rows
> with just 4 fields could be imported, with values moving to unexpected data
> fields in the import.

The default regex having a different number of optional separators from the comma/semi-colon based regex is certainly a consistency bug.

Being able to import rows with less than the mandatory number of fields is as well.

Without looking at the code it's not clear to me how a missing end of line pattern is an issue here ? Is it because some mandatory fields are at the end of the regex ?

As a side note I'll mention here that one of my longer term design goals was/is to port all importers to c++. For csv files the basis has already been created and both transaction and price import from csv files are using this framework. I think the business imports for customers/vendors/invoices/bills could also benefit from this. You may consider looking into doing this port ?

IMO the generic framework is more flexible:
- as user you can configure the csv file format (order of columns, separator,...) and save that as a preset for future reuse.
- as developer you can specify which columns are required, offer presets and helpful error messages
Additionally the shared code has some advantages as well
- importing becomes a more consistent experience: regardless of what you import, you get a similar interface
- improvements in parts of that shared code benefit all importer variants.
Comment 4 Rob Laan 2019-01-10 15:48:01 EST
The missing end of line pattern is an issue, because without it, if the user provides more that the required number of fields, the row would still be imported. E.g. if the user inadvertently uses the separator character *within* a field, the matching pattern would recognise that as an extra field, causing all subsequent data import fields to shift. Enclosing the matching pattern in start/end of line patterns makes that only row with exactly the required number of separators are imported. Which I think is what the user should expect.

I will have a look into the csv import for transactions and prices, and see if I would be able to tackle that port; my c++ experience is limited, but this may be an opportunity to improve that.

However, I'd first want to finish my review, update, bug fix and documentation of the current situation.
Comment 5 John Ralls 2019-01-10 16:02:54 EST
Tests would be good, too, especially if you're contemplating rewriting in C++.
Comment 6 Rob Laan 2019-01-13 14:19:21 EST
In addition, the default matching pattern defines that a number of fields must have content (by using a + instead of a * for the field). But that is not consistent with the other matching patterns (all fields with a *). Validation of mandatory fields is done in the code.

Furthermore, the Byte Order Mark, which many applications use at the start of the file to indicate a UTF8 file is matched as part of the customer/vendor ID. See similar bug 796991 "Import Bills & Invoices: Unicode Byte Order Mark included in imported invoice number"
Comment 7 John Ralls 2019-01-13 14:46:47 EST
Only Microsoft puts the unnecessary and not-recommended BOM on UTF-8 files. That's not to suggest that we shouldn't be skipping it.

I wonder, though, if these massive REs are really the right way to be doing this. Wouldn't the effort be better spent redoing this as part of csv-imp?
Comment 8 Rob Laan 2019-01-13 15:08:38 EST
I have no opinion on the practicality of using the massive REs. I just see that they are there, and need some improvement. They seem to be pretty fast in the matching process, though.

I'm using the review and update of this functionality to get acquainted with the tooling (git, doxygen, docbook, gettext, xcode etc.), and the ways of working. A lot of my current effort goes into that, not so much into the coding (although testing does take quite some effort). Having done most of the work on review, coding and documentation, I'd rather not abandon it now.

I am more than willing to dig into the redo as part of csv-imp, but if I first finish the update of current functionality, I hope I will be better equipped to tackle the redo.
Comment 9 Geert Janssens 2019-01-13 15:51:02 EST
I think a rewrite to become part of csv-imp should indeed be the long term goal. However if it helps you getting grips with the project and you don't mind your current work will eventually be replaced by all means go ahead.

We can use a few more people that know how to work on the gnucash code base.
Comment 10 John Ralls 2019-03-29 18:01:45 EDT
    Bug 797029 - Import Customer & Vendors: blank name and company in import data row crashes GnuCash
Comment 11 John Ralls 2019-03-29 18:03:26 EDT
This is fixed and will be released in GnuCash 3.5. Thanks for the report.

Note You need to log in before you can comment on or make changes to this bug.