QIF importer results in '?' appearing throughout description, memo, and notes fields in place of common diacritic characters like ä, ö, ü, Ä, Ö, Ü, ß, ... This is likely caused by QIF importer not using same character set/codepage as my QIF files that are exported from Quicken 2016. These QIF files are exported by Quicken as Windows-1252. ===Setup GnuCash 3.5 Windows 10 ===Repo 1. Create transactions in Quicken 2016 that have payees and memos with characters ä, ö, ü, Ä, Ö, Ü, ß 2. Export the account with those transactions to QIF 3. Import this QIF into GnuCash ===Result The ä, ö, ü, Ä, Ö, Ü, ß characters are replaced with '?' ===Expected The characters to remain ä, ö, ü, Ä, Ö, Ü, ß ===Suggestions I suggest prompting the user during the import process. Prompting for it will do two things: 1. Enable correct conversion of QIF character set -> GnuCash's utf-8. There exists libraries to do this conversion. 2. Indirectly notify advanced users, like me, that I can convert the files to UTF-8 before importing. Auto-detection of character set/codepage is probably not worth the effort unless already provided and well tested by a 3rd party library. It may be useful to have exception/error handling to catch occurrences of the import routine not converting correctly and substituting a '?'. Some libraries can realize something is wrong with the character set. If caught, I recommend prompting the user one-time that the character set does not seem to be correct and asking them to continue/not. A think a good place for this prompt is at the very beginning when the QIF is quickly scanned for date format, etc. Before that scan, the character set can be prompted, and then tested in this same scan.
Even better than a bug description, are attachments of such QIF files and Quicken screenshots.
Hi. In absence of them, I provided clear repo steps that you can use to create the scenario. You can also create your own QIF files to demonstrate the problem. The lack-of-feature in GnuCash is that it is not managing the codepage/character sets of QIFs being imported. Since there is an unbounded variety of tools and programs that create QIFs, the exact source QIF is not that interesting. Instead, the suggestions above point to the likely resolution. Scan/detect/validating the QIF is needed, and potentially UI prompting as I describe. Do you use vscode? Look at the bottom right where you can see what vscode *thinks* the characterset of an open file. But it can be wrong. And GnuCash is wrong in some cases. So this same idea scan/detect/validate and prompt when needed is the topic at hand. And I see no value in a quicken screenshot. What is your intention in asking for that?
Chris, the relevant code starts at https://github.com/Gnucash/gnucash/blob/maint/gnucash/import-export/qif-imp/qif-file.scm#L153.
Within a few days, I will create several sample QIFs to demonstrate the problem. I can see value in you having them so you can see how gnucash works today with a subset of QIF files, and then see how it fails on the typical Quicken exports. I also think I've noticed a pattern. I'm a US citizen living in Germany. I use a lot of US tools/apps/financial things (bank downloads, accounts, currency, character sets, reports, etc.) and at the same time use Germany's things (Euro, german bank downloads, accounts, German specific characters, reports, etc.). A German doing always German things...or a USA/English always doing english things...doesn't exercise GnuCash like I do. I do both, always, simultaneously. I'm definitely a minority in the GnuCash userbase. So the issues/situation that I find, are probably not encountered by the majority. They are more often the edge-cases. :-/
Sample files for unit testing are codepages1.zip. This ZIP has 10 files. Each of these files is a QIF file. Each of the QIF files have things in common. All of the files... * report transactions for the account 'My Online Bank:Checking' * that account is a bank account * it has a balance of $1000.00 on 10 April 2020 * only one transaction is reported * transaction is for $100.00 * transaction number is 1234 * the transaction is composed of two splits ($70.00 and $30.00) Each of the files will differ in the following: * payee will use characters that exercise the encoding/character set * memo will do the same * for both the 2 splits, the split category and split memo do the same Each of the files is named with the encoding/character set it uses. For example: 'utf8-with-BOM.QIF' is utf-8 encoded with a BOM 'shiftjis.QIF' is encoded with ShiftJIS 'windows-1252.QIF' is encoded with Windows-1252 Examine the utf-8.QIF file using an advanced editor like vscode. Ensure that it is opened by the editor as utf-8. Notice in the file there are no question marks. Question marks will often appear in incorrectly decoded character data. This utf-8.QIF file has examples of all the characters from the other files. All of these files should be able to be loaded correctly by the importer. And gnucash core should be able to store all these rich complex characters as payees, memos, accounts, etc. Therefore, these files exercise both components. If you robustly solve this issue, then all files should be supported. If instead, you have to solve only a subset, then I recommend you prioritize windows-1252 and utf-8. Why? Because Quicken exports windows-1252 encoded files. And utf-8 is defacto. Questions?
Created attachment 373665 [details] codepages1.zip sample QIF files
Looking at the qif files I think the only way to import these unconventional charts is to ask the user for it during import. I don't think it's possible to guess using any heuristics.
See concept at https://github.com/Gnucash/gnucash/pull/716
Definitely GnuCash should not write its own detection logic. That effort has already been done multiple times in other projects. One such project is uchardet and it has multiple bindings/languages including C/C++. I think a freely typeable box for people to enter the encoding is not the right approach. For example, it would required typing a long string 'w-i-n-d-o-w-s---1-2-5-2' on every file load for officially generated quicken qif exports. Instead, I think the approach is better with something like... 1. Same import work up until step 'seect a qif file to load' 2. click select 3. user selects the file 4. click next 5. This is where the encoding detection should occur. It happens at the same time GnuCash already does mystery validation/load of the file. (I have no idea what this step does today). 6. Click Next 7. This is the step to prompt the user for the encoding. In today's GnuCash workflow, this step sometimes shows "The QIF file format does not specify which order the day, month, and year components of a date are printed. In most cases, it is possible to automatically determine which format is in use..." So *before* it does the validation of the date format, it needs to validate the characters themselves. And if the autodetect code is not sure, then prompt the user with a UI to ask them to accept the autodetect guess, or to choose from a reasonable pulldown list of character sets. That means we need to list them in code. And naturally more can be added/removed later. It is also possible to query APIs to get a list of character sets. 8. Click Next 9. This step lists all QIF files. I think this step should slightly change. It should be a two column list. Left column is the same as today, right column is character set. This allows a user to remove any files that do not have the correct character set. I would like to have that second column be pulldown boxes so the user can select the character set there. But there *might* be some code in GnuCash that needs to have the correct character set earlier selected. So I leave that in Step 7. However, if all the gnucash code that runs before step 9 only needs ASCII for the files's header (e.g. the date format), then the character set display/choice can be postponed until step 9.
uchardet https://www.freedesktop.org/wiki/Software/uchardet/
Clearly a free text box is not useful. The csv importer currently uses a combo box with a wide selection of charsets. Perhaps this can be reused in the QIF importer ? That aside having code to guess the charset would even be better though there should always be a means for the user to override the guess. So the guess could be used to preselect a value in above mentioned combo box.
@gjanssens see https://github.com/Gnucash/gnucash/pull/716 for updates. I've reused the go-charmap-sel. It works fine. The invalid-charmap selection error messaging exposes a current QIF-assistant bug, and I *think* I've done the right fix (see the wind->load_stop change). The QIF-assistant is a different nightmare. There is no charmap autodetect -- are you keen to import an external library?
Personally, if the licenses are compatible, I would look into uchardet (see above). I also see the package libuchardet-dev available in Ubuntu 18.04 so such library may be widely available across it and the other GnuCash distros making licensing easier. And it woudl also be easy to use a CMake state variable to skip autodetect code if such library isn't available on a distribution. I was surprised to see only OpenSSL in LICENSE. I had thought with GnuCash using so many other opensource projects, that there would be requirements to list those other licenses. Yet, I can find none in that file or in the app's about UI. So...to use a similar approach, using the library in the distros might be a better approach that the library as a entity from the official website. I do recommend the autodetect, because humans make mistakes. I chuckled when I saw at https://github.com/Gnucash/gnucash/tree/maint/doc mistakes in core gnucash docs. Notice all the utf-8 codepage errors throughout that toplevel doc.
(In reply to Dale Phurrough from comment #13) > I was surprised to see only OpenSSL in LICENSE. I had thought with GnuCash > using so many other opensource projects, that there would be requirements to > list those other licenses. Yet, I can find none in that file or in the app's > about UI. So...to use a similar approach, using the library in the distros > might be a better approach that the library as a entity from the official > website. A bit off-topic, but huh? LICENSE is a renamed FSF COPYING, the Gnu Public License. Nothing to do with OpenSSL at all. As is normal for all FOSS projects, the LICENSE applies to GnuCash's source code only. There are a few files that we've borrowed from other projects and those files have the original copyright and license notices in them, also standard FOSS practice. For our source tarballs, which Linux distros use to generate their binary package files, that's all that's appropriate. It's a legitimate bust that our flatpack, MacOS, and Microsoft Windows distributions don't include a list of the included dependencies, their licenses, project home pages, and how to obtain their sources as the GPL requires. If that bothers you enough to make an issue of it please open a separate bug for it--or better yet, create the file and submit it as a PR.
(In reply to Dale Phurrough from comment #13) > I do recommend the autodetect, because humans make mistakes. I chuckled when > I saw at https://github.com/Gnucash/gnucash/tree/maint/doc mistakes in core > gnucash docs. Notice all the utf-8 codepage errors throughout that toplevel > doc. If you're referring to README.français and README.german, they're encoded in ISO-8859-1. They're also untouched since 2006 other than converting the URLs to https, so they're horribly out of date. We should just remove them.
Reply to your offtopic to answer you ;-) https://github.com/Gnucash/gnucash/blob/maint/LICENSE writes at the top: "Some of the source files have an exception for linking against OpenSSL, as per the following language" It isn't a standard GPL2 license file as it has that custom OpenSSL header tacked on the top by @jsled back in 2007 https://github.com/Gnucash/gnucash/commit/07e94bda8e30eed3e0ca4f48805883ae1fe12dc1#diff-9879d6db96fd29134fc802214163b95a I see no particular problem with having that custom OpenSSL header in that file. I mentioned it only in context of @Christopher inquire about 3rd party libraries. So I looked for the typical concatenated list of licenses for 3rd party libraries. Technically, it is possible for Gnucash to have none...however unlikely. Its quite en vogue to have an automated process for generating such a list. The legality of the current state of GnuCash is not in my purview. I am not a mainter/owner of this project nor am I a lawyer. If Gnucash is still a GNU subproject, the mainter/owners could consult with FSF/GNU lawyers to get advice on this. I'm referring to the link I provided as an example of how humans make mistakes with charsets. We both would have to closely examine that url to see if it is ISO8859-1 or Windows-1252. They are incredibly similar; the differences might not even be exposed in that URL because of the specific set of bytes used in the text. Further supporting my approach that humans make mistakes, automation of charset detect with libraries might help, and that a UI that shows the guessed charset is needed to confirm. Happy weekend :-)
Created attachment 373814 [details] incomplete patch incomplete patch.
I have a similar issue. I export QIF File from Quicken 2017 which includes accounts and categories with latin caracters (like á é í ó ú). It's ISO-8859-1 and It's imported in GNUCash 4.4 with "?" The workaround that It's working for me is to convert the QIF File to UTF-8, and then Import it into Gnucash. I use the command "iconv" in linux. Maybe the free software Notepad++ can convert it to.