Deduping

There seem to be two needs for deduping during ingest, discussed below:

  • deduping previously imported transactions
  • deduping transactions with counterparts in other accounts (eg: transfers)

Deduping Previously Imported Transactions

The imported file’s date range may overlap with the existing source ledger’s date range. Attemping to download non-overlapping OFX/CSV files from your institution is error prone and does not work well in practice. What works better is to always download say, the last 90 days (assuming you download at least once in 90 days), and deduping during import.

bean-extract has a built in dedupe that will catch this for many cases where there is a “reasonable” match. If you find that this is not working well, it is simple to write your own dedupe function and pass it as a hook to bean-extract. See beancount documentation for how to do this: it’s pretty simple.

This sometimes does not work on transactions that are modified in a way that the heuristics in the built-in dedupe fails to catch. have certain modified after import will not be deduped on subsequent imports.

In practice, I don’t modify transactions often enough for this to be a problem. For investment and banking, my importers should reflect the final transaction correctly. For credit card statements, I occasionally change the description or categorization, where a subsequent import will sneak through bean-extract’s dedup. Auto-generated balance transactions quickly point me to the the occasional duplicated transaction that I remove manually.

It helps that I have per-account ledger files, auto-annotated with comments by the importer, so when the balance assertion fails, it’s obvious when there is a date overlap, like so:

------- Importer appended: 2021-03-15 16:13 -0700 ----------
...
...
2021-03-12 * "Dinner with Neighbors"
   Liabilities:Credit-Card   -100.00 USD
   Expenses:Restaurant
------- End of append ---------- 

------- Importer appended: 2021-03-27 18:23 -0700 ----------
2021-03-12 * "ABC Restaurant"
   Liabilities:Credit-Card  -100.00 USD
   Expenses:Restaurant
...
...
------- End of append ----------

Custom Deduper Example

I use a deduper that is a slightly modified version of class SimilarityComparator in similarity.py:

The final check in that file is:

# Here, we have found at least one common account with a close  
# amount. Now, we require that the set of accounts are equal or that  
# one be a subset of the other.

return accounts1.issubset(accounts2) or accounts2.issubset(accounts1)

My modification instead check for intersection:

return accounts1.intersection(accounts2)

My common case is an import of a credit card transaction that is modified post-import. On a subsequent import (with an overlapping date range), dedupe does not work with the original heuristic but works well with this one.

You can use an importer-appropriate dedupe function where needed.

Deduping Transactions with Counterparts in Other Accounts

See Deduping with Zerosum. Beancount v3 has a proposal to render all this moot by allowing the two halves to be declared separately.

Notes mentioning this note