GnuCash
Contact   Instructions
Bug 797730 - Transaction matching can match multiple imported transactions to the same existing one
Summary: Transaction matching can match multiple imported transactions to the same exi...
Status: RESOLVED FIXED
Alias: None
Product: GnuCash
Classification: Unclassified
Component: Import - Other (show other bugs)
Version: git-maint
Hardware: PC Mac OS
: Normal normal
Target Milestone: ---
Assignee: import
QA Contact: import
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-05-07 00:42 EDT by Jean Laroche
Modified: 2020-08-02 15:02 EDT (History)
5 users (show)

See Also:


Attachments
UI mockup :) (30.25 KB, image/png)
2020-05-07 02:47 EDT, Christopher Lam
no flags Details
benchmark hungarian algorithm (925 bytes, text/x-c++src)
2020-05-09 02:41 EDT, Christopher Lam
no flags Details

Description Jean Laroche 2020-05-07 00:42:02 EDT
When importing transactions, the matcher can match multiple imported transactions to the same existing transaction. This, of course, should not be possible.
To reproduce, create or edit an existing OFX file with several transactions posted a few days apart with the same amount, and the same memos.

Add a transaction in one of your GC accounts with the same amount, memo etc.
Then import the OFX.
You'll see that all the imported transactions are matched to the existing one in the register.
Comment 1 Christopher Lam 2020-05-07 02:47:17 EDT
Created attachment 373672 [details]
UI mockup :)

Here's my vision for a better matching UI. Are you brave to try?
Comment 2 Christopher Lam 2020-05-07 03:04:46 EDT
Instead of drag&drop you can also have buttons for each pane splits:

LHS imported splits:
Clicking [Skip] moves split to ignored pane
Clicking [Match] moves to RHS to choose an existing split
Clicking [NEW] does nothing and moves to next imported split.

RHS existing split (after clicking [Match])
[Up] [Down] to locate desired line
[Match] to trigger matching, moving both LHS imported and RHS existing lines to matched splits

LHS ignored splits
Click [Restore] to restore to imported splits

LHS matched splits
Click [Unmatch] to restore to imported splits

All:
[TAB] [Shift-TAB] moves to the next/prev pane
[Up] [Down] moves to prev/next line
Comment 3 Christopher Lam 2020-05-07 04:51:04 EDT
The problem is in the auto-matcher rather than UI (e.g. how does the automatcher handle 2 OFX splits both with same text/amount <-> 2 account splits with same text/amount?) but I'd wager a better UI is part of the solution here...
Comment 4 Bob 2020-05-07 05:57:14 EDT
It's an interesting idea, so effectively change main-import-matcher to use potentially 6 tree views with drag and drop.

Just wondering on screen real estate, can you change your image to landscape and see how it fits with may be a button row at the bottom, is there enough room?

Maybe for 5.0!!!
Comment 5 Christopher Lam 2020-05-07 09:30:48 EDT
@Bob: I don't know how to use glade properly to test. But in the mockup, the top-panes can be short, the middle vs lower panes can be malleable. Also I made logic mistake: RHS shows splits with online_id and splits without online_id, instead of reconciled/unreconciled

+-----------------------+     +----------------------------------+
| ignored list          |     | splits with online_id            | <- not tall
+-----------------------+     +----------------------------------+

+-----------------------+     +----------------------------------+
| ofx splits            |     | splits without online_id         |
+-----------------------+     +----------------------------------+ <- flexible
| matched splits        |     | matched splits                   |
+-----------------------+     +----------------------------------+

@Jean

I think the OFX splits and the splits_without online_id should be matched with a suitable algorithm first. The current algorithm is very naive, and leads to double matching as you have found. May I suggest https://users.cs.duke.edu/~brd/Teaching/Bio/asmb/current/Handouts/munkres.html or https://launchpad.net/lib-bipartite-match -- it involves calculating a "cost" score for each pair; N ofx splits to M existing splits will need calculating N*M scores. This shouldn't take too much time.

I'm not sure where the C match score calculator is; I know the QIF matcher is at https://github.com/Gnucash/gnucash/blob/maint/gnucash/import-export/qif-imp/qif-merge-groups.scm#L109 and is not perfect either.
Comment 6 Jean Laroche 2020-05-07 11:35:43 EDT
I respectfully think you guys are over-engineering this.
I agree that the current algo is very naive, and has bugs (many-to-one matching).
The NxM matching you suggest is definitely an option, but in my view 99% of the time it's completely overkill because it's rare that conflicts occur (i.e., two ofx transactions wanting to match the same register transaction).
99% of the time, the ofx transaction gets its favorite match, and no other transaction wants it.

Now what to do about the 1% of the time it's not doing the right thing? I don't think we have to sweat it too much. in particular, I don't think it's wise to design a complicated UI to fix corner cases. That's going to take a lot of time and effort and result in a small improvement.
This time and effort would be better spent (again, in my humble opinion) toward far more essential shortcomings, of which GC has many (no undo/redo, very basic operations usually are unavailable like group-delete, group-edit, etc).

Here's what I suggest:
- We update the matching algo so that it no longer offers to match previously cleared or reconciled transactions (that's really easy, but maybe I'm missing something here, perhaps people want to be able to match transactions that have been previously cleared?)
- We update the algo to detect many-to-one situations. I suggest the following (a greedy algo, which is definitely NOT optimal, but will give a good solution, if not the best).
   . Go through each ofx trans, get its best match, find whether it's claimed by other ofx transations (many-to-one situation). Make a list of them
   . If the list has more than 1 element, 
      +select the ofx transaction that has the best match score, and decide that this is the one getting the match (greedy).
      + All other ofx transactions in the list are now assigned their next best match, if any.
      + Start the process again, because we may now have new conflicts.


This algo is also naive, and sub-optimal (it's greedy), but it will solve 99% of the remaining 1% of conflict cases and so I believe it's enough to make us happy!

Then we can move on and fix the more pressing shortcomings of GC!

I can implement this ^^ if you guys agree (in fact, I already started).
Jean
Comment 7 Christopher Lam 2020-05-07 11:45:16 EDT
(In reply to ripngo@gmail.com from comment #6)
> I respectfully think you guys are over-engineering this.

Of course we are...

The Hungarian algorithm doesn't need to be implemented, however I still very much dislike the current import matcher - IMHO having two panes is not enough. To me the red/amber/green coding and the analog score makes not much sense, I do prefer LHS=ofx RHS=register view myself; if we can do it I'm 100% sure it will be very well received.

> Here's what I suggest:
> - We update the matching algo so that it no longer offers to match
> previously cleared or reconciled transactions (that's really easy, but maybe
> I'm missing something here, perhaps people want to be able to match
> transactions that have been previously cleared?)

I think instead of matching cleared/unreconciled transactions, it should attempt to match any split without an online_id.

Good luck with your algorithm, I'm sure will be better than the current one. If not, there are solutions already out there.
Comment 8 John Ralls 2020-05-07 12:30:25 EDT
> (In reply to ripngo@gmail.com from comment #6)
> I respectfully think you guys are over-engineering this.
> I agree that the current algo is very naive, and has bugs (many-to-one
> matching).
> The NxM matching you suggest is definitely an option, but in my view 99% of
> the time it's completely overkill because it's rare that conflicts occur
> (i.e., two ofx transactions wanting to match the same register transaction).
> 99% of the time, the ofx transaction gets its favorite match, and no other
> transaction wants it.

Careful. Just because it's only a 1% problem in your experience doesn't make that so for other users. Since there's been an explicit complaint on gnucash-user about the NxM problem we have to accept that it's a significant pain point as only a handful of users actually participate in the mailing lists.

It's also really important not to hyperfocus on a single use-case, that's where design bugs come from.
Comment 9 Jean Laroche 2020-05-08 12:04:20 EDT
I opened another bug https://bugs.gnucash.org/show_bug.cgi?id=797737 for the part about matching transactions that have previously been matched. (and opened a PR for it, an easy fix).
This bug is about many-to-one matching, which is more difficult to solve.
Comment 10 Jean Laroche 2020-05-08 12:19:05 EDT
@Christopher Lam,
Thanks for the reference to the Hungarian algo! I did not know about it, and mistakenly thought the assignment problem was an exponential one!
https://en.wikipedia.org/wiki/Hungarian_algorithm

Very cool!
I implemented the simple idea I described above and it works as expected (and solves the issue described in this bug report).
I'm wondering if the Hungarian algo would be a lot more complex to implement (it looks like it would be, but who knows).
Comment 11 Jean Laroche 2020-05-08 12:36:22 EDT
BTW there's a C++ implementation here:
https://github.com/mcximing/hungarian-algorithm-cpp 
which we could easily leverage I imagine. But again, this could be very overkill...
Comment 12 Christopher Lam 2020-05-08 14:55:29 EDT
I think it would be a good thing to use. Any home-grown algorithm would be less efficient.
Comment 13 Jean Laroche 2020-05-08 15:00:01 EDT
(In reply to Christopher Lam from comment #12)
> I think it would be a good thing to use. Any home-grown algorithm would be
> less efficient.

Well not necessarily: imagine that in most cases we have no conflict, we might spend an inordinate amount of time running that algo, for no benefit. I don't know whether the algo returns quickly if the solution is "trivial"...
Comment 14 Christopher Lam 2020-05-08 15:11:23 EDT
The only way to find out is to try!
Comment 15 Christopher Lam 2020-05-09 02:41:56 EDT
Created attachment 373688 [details]
benchmark hungarian algorithm

I've taken up the challenge to practise some C++ ;-)

Create a 1000x3000 matrix, indicating 1000 ofx splits * 3000 account splits: runs as follows after logging all cost scores:

real	0m2.021s
user	0m0.750s
sys	0m0.099s

and removing all cout logging:

real	0m0.266s
user	0m0.230s
sys	0m0.036s

i.e. safe to say that the algorithm is acceptable.

The most difficult issue is to decide on the scoring method. It'll be fine to reuse the existing scores.
Comment 16 Jean Laroche 2020-05-09 23:27:23 EDT
I did a pull request for the simple solution I suggested. It think it's adequate (much better than before, that's for sure). I don't know how often the fully optimal solution provided by the hungarian algo would be an improvement over what I wrote, I'd be curious to find out...

The PR is here.
https://github.com/Gnucash/gnucash/pull/718

J
Comment 17 Christopher Lam 2020-05-09 23:52:33 EDT
Would be interesting indeed to try one with a formally proven algorithm. As a matter of principle that gnucash will always strive to do the correct thing rather than a home-grown algorithm.

Moreover for future hackers, adjusting afterwards would involve modifying the scoring rather than trying to decode your code seems to me more "correct".
Comment 18 Jean Laroche 2020-05-10 12:49:12 EDT
(In reply to Christopher Lam from comment #17)
> Would be interesting indeed to try one with a formally proven algorithm. As
> a matter of principle that gnucash will always strive to do the correct
> thing rather than a home-grown algorithm.
> 
> Moreover for future hackers, adjusting afterwards would involve modifying
> the scoring rather than trying to decode your code seems to me more
> "correct".

Here's my thoughts. A fix, even if sub-optimal, is better than no fix at all, especially if it's going to be optimal almost all the time. I wouldn't refer to what I implemented as "my algorithm" as it's a very standard greedy approach to optimization, really not home-grown. It would suggest to adopt the current fix, but add a note in the code mentioning the optimal algorithm and indicating that if optimality problems arise, hackers should go implement the fully optimal Hungarian algo. 
Note that the hungarian algo is definitely best for cases where the matrix is fully populated (i.e., full conflicts, even OFX transaction would like to match every register transaction, with a reasonably good score). Of course, this is far from frequent in GC! In the majority of cases, there's 1 match with a good enough score so the matrix has only 1 non-zero entry in each column and row (because dates and amounts have to match reasonably well).

In any cases, these are my 2 cents.
J.
Comment 19 John Ralls 2020-05-10 13:23:32 EDT
Would it make sense to allow the user to tune matching parameters such as "within n days" and ±x.yz amount? Not necessarily for this bug nor even for 4.0.
Comment 20 Jean Laroche 2020-05-10 13:41:59 EDT
(In reply to John Ralls from comment #19)
> Would it make sense to allow the user to tune matching parameters such as
> "within n days" and ±x.yz amount? Not necessarily for this bug nor even for
> 4.0.

My personal opinion is: I find the threshold options for matching very confusing (in the preferences). I had to re-read them several times (the help balloons) and I'm not even sure I understand them yet. Having something that says +/- days/amount is easier to understand than the thresholds we currently have, I think.
Comment 21 John Ralls 2020-05-10 18:12:40 EDT
I agree, especially given that nowhere is it documented what the "red zone" is or how the scores are arrived at.
Comment 22 Christopher Lam 2020-05-11 11:07:16 EDT
Re: Hungarian - I'm not too fussed really. It seems a good idea. Maybe in the distant future. https://github.com/Gnucash/gnucash/pull/718 may be ready to merge.

Re: import matcher - I'm sure you have seen it is rather inefficient; each ofx split gets matched to potential account splits retrieved via qof_query. The latter is known to be slow. This explains https://github.com/Gnucash/gnucash/blob/maint/gnucash/import-export/import-backend.c#L856 comment.

See https://github.com/Gnucash/gnucash/commit/944e7814 for the equivalent QIF matching being sped up -- previously each qif-split similarly generated a new qof_query for potential matches; now there is only one qof_query to retrieve all potential matches into the "old-splits" list, and compares each qif-split with old-splits. This speed upgrade was uneventful - i.e. no user has complained (or even noticed!). QIF matching would take 0.2s per split, now nearly instantenous.

Would this be of interest to try speeding up the C code? It looks complicated to me and may require some refactoring first.
Comment 23 Jean Laroche 2020-05-11 12:02:36 EDT
I did notice the inefficiency and there's a long comment in the code specifically about that.
I'm up for refactoring it, it's not a huge deal and I can use the QIF example, and yes, the gist is to using only 1 query at the end...
Note that this is completely orthogonal to the issue in this bug. The fix I proposed for the many-to-one happens at the very end of the import of course.
J.
Comment 24 Jean Laroche 2020-05-12 01:26:22 EDT
I created a PR https://github.com/Gnucash/gnucash/pull/721 for the speedup fix.
The code is fairly simple...
Comment 25 John Ralls 2020-08-02 15:02:50 EDT
PR 721 is merged for GnuCash 4.2.

Note You need to log in before you can comment on or make changes to this bug.