Make Importers Easy to Write. And Write Lots of Them
Making importers extremely easy and fast to write leads to writing importers to pull data from all your institutions. Which leads to the five minute update. How? Read on.
The Ideal Import Process…Does Not Currently Exist
The ideal import process would work thus:
- It would not involve you. You start up
beancount
, and within a few seconds, all your latest transactions are downloaded, validated, deduped, error checked, categorized, and inserted for all your accounts - New accounts you opened up would either automatically be imported (if you already hold an existing account at the given institution), or be imported after a quick one-time, secure, handing over of credentials
- The import process would be able to automatically figure out file formats it has never seen before
The ideal import process does not quite exist, yet. But we can get reasonably close. This set of articles focuses on manually writing dependable importers with minimal effort. Further, we want them to low to zero maintenance, extensible, and customizable. Examples of customizability include:
- Commodity-leaf accounts
- Distinguising between dividends and capital gains based on a specific text that a given banking institution uses
The next big leap (not included in these articles) would be to eliminate the need for writing importers at all.
Challenges
Importers are painful to get right for several reasons:
- institutions are inconsistent in their use of OFX, and wholly unrestrained with CSVs
- institutions occasionally change their formats (eg: columns are added or removed in csvs; or, institutions move from OFX to CSV)
- extensive testing across many users is difficult because data files are personal
All this leads to highly custom code for each importer. In addition:
- constructing postings for investment transactions can be complex due to the special casing required, and the variances between brokers
- expense categorization is not a deterministic process (smart_importer solves this problem)
Separating reading information from the institution’s file format correctly, from constructing postings, allows us to develop and maintain these layers independently, while allowing us massive resuse. More on this below.
Introducing beancount_reds_importers
The principles discussed in the articles below are implemented and have been used by the author in beancount_reds_importers. This makes it handy to use as examples throughout. Suggestions, contributions, well reasoned design modifications, and of course, bug reports and pull requests, are welcome.
Design Philosophies
The overarching philosophy is to make importers easy to develop and maintain by:
DRY: specifically, minimize the per-institution importer code necessary.
Doing so makes importers easy to develop, both for your existing and new accounts. That is what lets you easily write all the importers you need to get to full automation (or close), which is what ultimately enables the 5-minute ledger update.
In addition:
- Use the code-as-config approach in writing importers. The contrasting approach is to write a core importer that can be configured for each banking institution through complex configuration options. This approach does not work. Importers are messy as it is. A configurable generic importer yields situations where one has to look up config options, try them to see if they produce the desired effect, and end up writing code anyway if they don’t, and having to test that across all other configs. A code-as-config approach means each importer is simply a python script. It can be just a few declarative lines or many lines of actual code that do exactly what needs to be done for the specific banking institution you are writing an importer for.
- Do one thing and do it well. This means we rely heavily on well tested, mature,
external tools to complete our workflow, as opposed to a tool with a comprehensive
approach like beancount-import, which
follows a different philosophy (but is also well developed and worth checking out).
We rely upon:
- beancount for identification, extraction, and filing
- ofxparse for reading OFX files
- ofxtools for connecting to banking institutions via their web APIs and automating downloads
- petl for manipulating tables in CSV files
- pass to securely store passwords
- fava to show us the visual state of how up to date an account is
- Be dependable, not “smart (in a way that fails)”. One way to be “smart” is to try to anticipate all possible data formats and possibilities and write a generic importer that works with almost any institution. This fails because the code gets too complex to maintain well, and breaks because there is always yet another new case.
Making Importers Easy to Write
To achieve this, the design separates importer code into three parts:
- File Format Readers (Reusable)
- Transaction Builders (Reusable)
- [[ Institution-specific Code ]] (Minimal)
This helps move common code into (1) and (2) above, and makes writing new importers easy by simply picking from one of those two along with with minimal declarations and code in (3).
Here is the data flow:
Remaining Issues
- [[ Testing framework for when importers break ]]
- Customizability:
- [[ Features - Commodity Leaf Accounts ]]
- [[ Features - Custom Categorization ]]
- [[ How to Install, Configure, and Use beancount-reds-importers ]]
References
- Existing threads on the beancount mailing list
- My multi-part post on importers.
- Also, this thread on ofx importers.
- Importers design doc