GnuCash
Contact   Instructions
Bug 797847 - Best match probability calculation on import is too pessimistic
Summary: Best match probability calculation on import is too pessimistic
Status: NEW
Alias: None
Product: GnuCash
Classification: Unclassified
Component: Import - OFX (show other bugs)
Version: 4.0
Hardware: PC Mac OS
: Normal normal
Target Milestone: ---
Assignee: import
QA Contact: import
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-05 16:25 EDT by David Reiser
Modified: 2020-07-18 18:33 EDT (History)
6 users (show)

See Also:


Attachments
transaction csv export from gnucash that isn't matched by bank's ofx (568 bytes, text/plain)
2020-07-06 21:20 EDT, David Reiser
no flags Details
ofx file that does not successfully match 1-day off transaction in gnucash (804 bytes, text/plain)
2020-07-06 21:24 EDT, David Reiser
no flags Details
gnucash exported transaction that does match an ofx import (444 bytes, text/plain)
2020-07-06 21:25 EDT, David Reiser
no flags Details
ofx file with successful match to 1-day off transaction in gnucash (758 bytes, text/plain)
2020-07-06 21:28 EDT, David Reiser
no flags Details

Description David Reiser 2020-07-05 16:25:06 EDT
Under the transaction matcher behavior in gnucash 4.0, a credit card transaction which posts one day after the transaction date (as entered in a gnucash transaction) gets a best match probability score too low to ever meet the auto-match threshold during an ofx import processing run. For example, I bought some stamps at the post office on July 2 and entered the transaction in gnucash with the July 2 date. The credit card company posted the transaction on July 3. When I imported transaction on July 5 (aqbanking direct connect), the transaction showed up in the matcher window with no probable match acknowledged (no green bars in the match window Info column).

My credit card company will show me the transaction date as the primary information on their web site, but ofx/qfx files appear to contain only the date posted. I never let the import matcher update my transaction data -- only clear them.

From my perspective, transactions posting within 3 days of the transaction date should still get a match score of 6. One day after transaction ought to be 8. One of the grocery stores I go to frequently has never posted the charge to the credit card company the next day. Sometimes there’s only a 2-day gap, but it is frequently 3 days. I suspect the difference between 2 and 3 days is related to how late in the evening I go shopping.

For my checking account, a one day delay transaction->posting date still met the minimum requirements for auto-match. My credit union does not participate in OFX Direct Connect, so the checking account data was imported via File>Import>Import OFX/QFX.
Comment 1 Jean Laroche 2020-07-05 17:15:51 EDT
I'm going to take a look at this issue.
Comment 2 Jean Laroche 2020-07-05 23:08:53 EDT
Here's what's in the matching algo, date wise:
If you start from a matching amount, this gives you a prob of 3.
Then,
- If the dates are less than 1 day off you add +3
- If the dates are 4 days apart or less, you add +2
- If the dates are more than 14 days apart, you remove 5. 
You can get extra points for matching memos etc, but the above is more or less what's happening. 
So transactions whose dates are 4 days apart or less should have a prob of 5. 

I'm a bit hesitant to alter these values as they've been around forever. Could you possibly lower your auto-clear threshold to 5 and see if that works better for you? That's in the preferences (Online Banking)

Let me know.
Jean
Comment 3 David Reiser 2020-07-06 00:03:53 EDT
Auto-clear cannot be set to less than 6.

But a matching amount and one day off is giving me a score of 5 for aqbanking acquired qfx credit card transactions, where the same circumstances give a score of 6 for libofx checking account imports.
Comment 4 Jean Laroche 2020-07-06 01:12:46 EDT
Well on top of what I described above, a matching memo, or description can give you an extra point or two, so perhaps that's what's going on here?
Comment 5 David Reiser 2020-07-06 09:42:03 EDT
Well, the behavior has changed in the direction of a lower score with no change in my recording habits. 

I use notes, but almost never memos. And the description field can't match unless it's with an invisible token from a prior import. But I'm still buying most things from vendors I've used in the past.
Comment 6 Jean Laroche 2020-07-06 12:55:51 EDT
I checked the code, it does not look like anything has changed recently on that front so I'm not sure what's going on here, and I won't be able to check unless I can take a look at both sources of imports. May not be worth the trouble.
How about lowering the threshold to 5 and seeing whether this makes things OK?
Comment 7 Jean Laroche 2020-07-06 12:57:58 EDT
Also, the way the day difference is computed may give different results based on the time of day for the transaction! (silly but that's the way it is). So it's possible that if one import gives you the same date, but a different time of day (for some reason) than the other, you could get one that trigger "same day" -> 3 points and the other that triggers "4 days or less" -> 2 points.

J.
Comment 8 David Reiser 2020-07-06 14:34:59 EDT
(In reply to Jean Laroche from comment #6)
> I checked the code, it does not look like anything has changed recently on
> that front so I'm not sure what's going on here, and I won't be able to
> check unless I can take a look at both sources of imports. May not be worth
> the trouble.
> How about lowering the threshold to 5 and seeing whether this makes things
> OK?

Gnucash does not allow the threshold to be set at 5 or lower
Comment 9 David Reiser 2020-07-06 14:41:59 EDT
Perhaps something that is different now is that the Info column in the matcher window does not show any bar graph for scores of 5 or lower (at least when no transactions have a higher score). My recollection is that the info column used to be populated with the score bar graph regardless of score. And maybe what I am remembering is that any time I saw a score of 5 there was almost certainly a good match if I clicked on the C column (formerly R column).
Comment 10 Jean Laroche 2020-07-06 16:39:37 EDT
(In reply to David Reiser from comment #8)
> (In reply to Jean Laroche from comment #6)
> > I checked the code, it does not look like anything has changed recently on
> > that front so I'm not sure what's going on here, and I won't be able to
> > check unless I can take a look at both sources of imports. May not be worth
> > the trouble.
> > How about lowering the threshold to 5 and seeing whether this makes things
> > OK?
> 
> Gnucash does not allow the threshold to be set at 5 or lower

I crap, indeed! OK then something has to be changed. Either we allow lowering that threshold to 5 or below, or we raise the +3 / +2 bonus for exact date or <= 4 days difference to +4 / +3 ...

I don't have a preference myself, but it seems like if we lower the min threshold this won't impact anyone by default and you'll be able to adjust it to your liking, assuming of course that that would work.
I'm going to make the change, then you can get the build from the repo and try it to see if that works for you...
J.
Comment 11 David Reiser 2020-07-06 21:16:18 EDT
(In reply to Jean Laroche from comment #10)
> (In reply to David Reiser from comment #8)
> > (In reply to Jean Laroche from comment #6)
> > > I checked the code, it does not look like anything has changed recently on
> > > that front so I'm not sure what's going on here, and I won't be able to
> > > check unless I can take a look at both sources of imports. May not be worth
> > > the trouble.
> > > How about lowering the threshold to 5 and seeing whether this makes things
> > > OK?
> > 
> > Gnucash does not allow the threshold to be set at 5 or lower
> 
> I crap, indeed! OK then something has to be changed. Either we allow
> lowering that threshold to 5 or below, or we raise the +3 / +2 bonus for
> exact date or <= 4 days difference to +4 / +3 ...
> 
> I don't have a preference myself, but it seems like if we lower the min
> threshold this won't impact anyone by default and you'll be able to adjust
> it to your liking, assuming of course that that would work.
> I'm going to make the change, then you can get the build from the repo and
> try it to see if that works for you...
> J.

That will probably work. It'll take me a while to get the build done. I haven't built Gnucash for MacOS since I could build an X11 version. jhbuild confused me enough the couple times I tried it that I haven't tried for a while.

In the meantime, I'll post the two pairs of transaction+ofx files that gave different match behavior.
Comment 12 David Reiser 2020-07-06 21:20:53 EDT
Created attachment 373778 [details]
transaction csv export from gnucash that isn't matched by bank's ofx
Comment 13 David Reiser 2020-07-06 21:24:05 EDT
Created attachment 373779 [details]
ofx file that does not successfully match 1-day off transaction in gnucash

ofx <TRNTYPE> is DEBIT
ofx transaction date posted is one day later than date assigned to transaction in gnucash (transaction exported as 'matcherfail.csv')
Import matcher score is 5
Comment 14 David Reiser 2020-07-06 21:25:00 EDT
Created attachment 373780 [details]
gnucash exported transaction that does match an ofx import
Comment 15 David Reiser 2020-07-06 21:28:29 EDT
Created attachment 373781 [details]
ofx file with successful match to 1-day off transaction in gnucash

ofx <TRNTYPE> is XFER
ofx date posted is one day later than transaction in gnucash
Import matcher score is 6 (successfully auto-matches)

The only apparent difference with example failed match is <TRNTYPE> difference
Comment 16 Jean Laroche 2020-07-07 11:51:52 EDT
Thanks David, I'll take a look at the files.

In the meantime, I did a PR to lower the minimum threshold. 
https://github.com/Gnucash/gnucash/pull/750/files
That's something I think you can test yourself, without rebuilding the code. Apparently glade files are loaded at run time, so if you modify dialog-import.glade in your installation folder and change

   <object class="GtkAdjustment" id="auto_clear_adj">
      <property name="lower">6</property>
to
   <object class="GtkAdjustment" id="auto_clear_adj">
      <property name="lower">5</property>

you should be able to lower the threshold in your preferences and see if that helps. You'll have to locate that glade file, I'm not sure where it's saved in the installation folder.

Jean
Comment 17 Jean Laroche 2020-07-07 12:15:55 EDT
Correction, the value to modify is in dialog-preferences.glade in GC 4.0 not in dialog-import.glade.

  <object class="GtkAdjustment" id="auto_clear_adj">
    <property name="lower">6</property>
    <property name="upper">12</property>
    <property name="step_increment">1</property>
    <property name="page_increment">10</property>
  </object>

I just checked that if I manually change this value, I'm able to lower the threshold in the prefs.
Comment 18 David Reiser 2020-07-07 12:53:57 EDT
(In reply to Jean Laroche from comment #17)
> Correction, the value to modify is in dialog-preferences.glade in GC 4.0 not
> in dialog-import.glade.
> 
>   <object class="GtkAdjustment" id="auto_clear_adj">
>     <property name="lower">6</property>
>     <property name="upper">12</property>
>     <property name="step_increment">1</property>
>     <property name="page_increment">10</property>
>   </object>
> 
> I just checked that if I manually change this value, I'm able to lower the
> threshold in the prefs.

I can also change the pref after modifying the glade file.

With the lower limit set to 5, an ofx import will auto-match both 1-day and 3-day offset dates between gnucash transaction date and ofx import posted date. I like that.

I think 1-day offsets ought to get a higher match probability score, though.

The file on my mac was:
/Applications/Gnucash.app/Contents/Resources/share/gnucash/gtkbuilder/dialog-preferences.glade
Comment 19 Benoit Grégoire 2020-07-07 14:28:36 EDT
Definitely the lower level allowed by the dialog should be changed.

However, i'd be weary of raising the 1 day date score (I wrote the original matcher), or even lowering the default. In this case, a score of 5 means that the transaction matched the date within one day, matched the amount exactly and nothing else.  That's not very strong evidence, and I know for a fact false positives are a problem.

The reason the fist file doesn't match and the second does is that the second bank is sending a date with a time zone, which probably pushes the date within a day in gnucash.  

You cannot imagine how horrible date processing is in the financial industry, which requires a bunch of shaky assumptions to make it more or less work.

The intent of the code was:

If the transaction is on the same day, +3
If the transaction is within 4 days, +2
Outside of that: -5

Now in your case, a credit card transaction is systematically posted on the next day (I'm sure it is a frequent case, but by no means systematic).  

What I propose, to avoid debalancing the amount-date score and other assumptions in the code, is to change the code like so:
in import-backend.c, if gnucash account type is ACCT_TYPE_CREDIT, use datediff_day <= 1 for +3, otherwise keep datediff_day == 0

Extra refinement would be to change the diff calculations (move the llabs down into the conditions) so that for this specific case, the code discriminates between no more than one day in the future and within two days.  This would avoid the morning coffee you buy every day having the same score on two subsequent day.

Beyond that, there is not much more we can do without remembering past matches (right now, bayesian matching is only used to find the destination account)
Comment 20 John Ralls 2020-07-18 18:33:57 EDT
I've merged Jean's PR lowering the minimum threshold by 1.

Note You need to log in before you can comment on or make changes to this bug.