How we optimized finding similar assets for tax-loss harvesting

Published in

AgentRisk: Superhuman Wealth Management

7 min readApr 27, 2020

Automated tax-loss harvesting is one of the core features of both of our AgentRisk Wealth and AgentRisk Lite products for individual investors, and of our AgentRisk Overlay product for financial advisors. Finding similar assets to replace an asset in tax-loss harvesting is the most computationally intensive part of the process. In this post, we share how we optimized this computation, as well as how this can be further improved.

What is tax-loss harvesting and how does it work?

Tax-loss harvesting is the practice of selling an asset that is at a loss and realizing this loss to help minimize your tax bill. It goes without saying that nothing we cover in this post should be viewed as tax advice and you should definitely talk to your tax accountant before applying any of this information to your portfolio.

Selling an asset at a loss is half the battle in tax-loss harvesting. The second half is finding a similar-enough asset to replace the asset you sold. If you don’t do this, your portfolio allocation is skewed. You also can’t just sell and buy the same asset back-to-back — that would be cheating, wouldn’t it? In fact, the tax code categorizes this as a “wash sale”. This is why you need to find a similar enough asset and use the proceeds from selling the original asset to buy it.

Tax-loss harvesting in practice

This may sound a bit confusing, so let’s give a simple example.

Let’s assume that sometime in January 2020 you bought 100 shares of SPY (a very popular ETF that tracks the S&P 500) at $330 per share. This asset now is at $280 per share, meaning that you have $5000 total loss. If you want to realize this loss and collect the tax credit, you sell all 100 shares of SPY, with total proceeds of $28,000.

Now, you’ll want to replace SPY with a “similar enough” asset so that your portfolio stays practically the same. We’ll describe the details on how to do this below, but for now, let’s assume that VOO (another S&P 500 ETF) is the asset we’ll replace SPY with. At a price of $265 per share, our proceeds will buy 105 shares of VOO. Easy-peasy.

How to find similar assets

At a high level, when looking for an asset to replace SPY with, we want to find an asset whose performance closely tracks that of SPY. The algorithm for this can be summarized in the following python-like pseudocode:

similar_assets = []# calculate daily returns for SPY
daily_ret_spy = get_daily_returns("SPY")for asset of asset_universe:
  # calculate daily returns for asset
  daily_ret_asset = get_daily_returns(asset)  # calculate correlation coefficient of the two
  correl = calc_correlation(daily_ret_spy, daily_ret_asset)  similar_assets.push({'asset': asset, 'correl': correl})# return results in descending order of correlation
return similar_assets.sort_by('correl')

Some initial observations and explanations on the pseudocode above:

The calculation is linear in the size of asset_universe. This potentially includes all symbols from NYSE, NASDAQ, and ASE, as well as all mutual funds, which in total is in excess of 32K symbols.
Each iteration is independent, which means this loop has high potential for speedup when run in parallel.
Each iteration requires us to calculate the daily returns for each symbol we consider. This calculation is based on historical price data for that symbol.
The calculation of the correlation coefficient is the heavy hitter of the loop and dependent on the number of items of the daily returns arrays.

Optimization #1: preprocess all data

In our initial (admittedly naive) implementation, we would calculate the daily returns in each loop iteration. Even though our historical data was static and in memory, when we ran our code through the Python profiler, we discovered that over 45% of the time was spent in this calculation.

More specifically, the breakdown from the profiler (top-3 hitters) was:

Function           % Timenp.corrcoef        51.2%
get_daily_returns  45.3%
np.trim_zeros       1.6%

Obviously, the call to calculating the correlation coefficient was expected to take a lot of time, but we were spending almost the same amount of time doing unnecessary work calculating daily returns at runtime!

Not only could we improve performance by memoising this calculation, but we can do even better: we can preprocess all of this! Which is exactly what we did. We already had a script to fetch historical data from one of our data providers and store them in JSON files. We just had to update the script to store the daily returns alongside the raw data.

Lesson learned: the profiler is your best friend.

Optimization #2: trim the data

Now that we had all data ready to use, the next step was to reduce the time spent in calculating the correlation coefficient between the each pair of arrays in each iteration. Our initial calculation was using arrays with over 1000 data points, which is over 4 years of historical data. As we started scaling down this number (i.e. keeping just the most recent data points), we observed very little difference in the calculated correlation numbers, and even smaller difference in the relative correlation (i.e. the sorted array of most correlated assets to a specific asset was almost identical).

Correlation coefficient of SPLG with SPY (log scale)

What’s more, we also excluded assets that had too few data points to make any useful correlation calculations. These could be for example new ETFs or companies with recent IPOs.

Lesson learned: use as much data as you need, no more.

Optimization #3: keep data close to computation

OK, now we had preprocessed data and trimmed of all the fat. The final step was to ensure that each iteration could read this data as fast as possible. Our initial approach was to have this data stored in a co-located Redis instance.

However, this would mean that we would be doing very frequent and large network data fetches for each iteration, again and again. This, together with the fact that the per-iteration compute time for calculating the correlation coefficient was already low (the underlying implementation in the SciPy package is highly optimized), even a couple ms per operation would slow things down.

Instead, we decided to have all this data loaded in memory at all times. Since this is read-only data, there would be no locking overhead or anything like that. Just really fast memory accesses.

Lesson learned: don’t make a CPU-bound job also wait on I/O.

Optimization #4: cache results as long as possible

As we discussed above, correlation numbers don’t change that drastically. What this means is that not only we can cache the computations we do, but also we can do so for quite a while. In our case, we fetch new historical data weekly, which is also the TTL we set for our cached results. In this case, Redis is a great tool, which is what we ended up using.

Lesson learned: know your data domain and cache aggressively.

Optimization #5: segment your data

I left this for last, since this is a bit of a product decision (although it did have a big impact on performance). What we realized when talking with customers is that it’s very rare that someone would do tax-loss harvesting of an ETF with a mutual fund and vice versa. What this means is that we could significantly cut the number of iterations per calculation, based on the type of asset. More specifically, if the asset is a mutual fund, we only look for similar assets that are also mutual funds (and vice versa).

What this means is that our problem space now is significantly reduced. To give you an idea, our data provider has data for over 26K mutual funds and the total symbols in NYSE, NASDAQ, and ASE is almost 9K. Practically, if a symbol is not a mutual fund, then our problem space is now reduced from 35K to just 9K (almost 75% reduction), while for mutual funds this is an almost 25% reduction.

Lesson learned: optimizations can come from smart product decisions.

Our implementation

How did we put all this in production? Our initial approach was to package this in a Dockerfile that would run in Google Cloud Run. This was a good fit for this “serverless-like” model, especially since we would expect that for the most common assets, we’d quickly have all we need in Redis.

However, since we decided to package the data in the Docker image, we saw that the boot time per invocation was not negligible. Which is why in the end we decided to run this as another microservice in our GKE cluster.

The microservice is a Flask API server, which essentially just listens on a single endpoint. We currently haven’t attempted to parallelize each loop (which, as we discuss above, is another quick win we could do). One could imagine an implementation of this using asyncio, but the current approach is good enough for us.

Give it a shot!

If you want to try this out, we already have a free tool for people to use, which we recently switch to use this new implementation. We’ll also open-source our microservice implementation in the coming weeks, so make sure you follow us on Twitter and GitHub.

At AgentRisk, we build investment products and tools for individuals as well as for financial advisors and wealth managers, leveraging state-of-the-art machine learning algorithms and time-tested investment theories.