Erez Katz, CEO and Co-founder of Lucena Research.

Investors often look at Sharpe ratio to determine a portfolio’s strength. (Sharpe ratio measures a portfolio’s risk adjusted return.) The goal of Sharpe ratio is to assess a portfolio’s returns discounted against risk-free volatility in order to measure the consistency of performance.

Measuring a portfolio’s total return would normally be insufficient since a huge return in one day can compensate for an otherwise lackluster performance. However, large incidental returns over time will not instill confidence in the portfolio continuing to execute well into the future.


Why not look simply at a portfolio’s total return? Isn’t return on investment the ultimate goal?

When a portfolio manager publishes a portfolio’s historical performance, a disclaimer must follow, indicating that past performance is not indicative of future returns. Indeed, as an investor you must be convinced that there is a fundamental thematic reason for a consistent performance that will likely carry forward into the future. More specifically, you want to have enough statistical evidence to reject the hypothesis that such performance is simply due to luck or to a high correlation to a market regime trend.



How to Measure Your Portfolio’s Performance Beyond Sharpe Ratio


The good news is there are a few industry standards for measuring a portfolio’s performance beyond Sharpe ratio. Here are a few examples and one important statistical principle often overlooked in investment, namely, P-Value or P-Score. More on that in a bit…

First, let’s dig into one of our Model Portfolios, Tiebreaker as an example. Tiebreaker is an active investment strategy that uses long and short positions to deliver market neutral returns. Tiebreaker employs Lucena’s Price Forecaster on a qualified list of stocks generated from our Event Analyzer in order to rank a handful of stocks most likely to go up over the next week to a month. The strategy then incorporates a mean-variance optimization along with long/short hedge positions using Lucena’s proprietary hedging algorithm. Tiebreaker’s goal is to maximize its Sharpe Ratio.

Below is a performance chart of a live paper traded simulation from platform QuantDesk®. 


long/short investment strategy evaluation


Tiebreaker is measured against the Vanguard Market Neutral Fund Institutional Fund as a benchmark. Tiebreaker has been perpetually tracked since September 2014 with consistently low volatility (6.16 max drawdown) and outperformance against its benchmark as measured by Sharpe, IR and R-Squared. Performance is net of transactions cost, with a max exposure of 1X. Past performance is not indicative of future results.


Three important factors to note:

Sortino RatioSortino Ratio follows the same principle as Sharpe ratio but instead of discounting a portfolio’s return against all its volatility, we only discount its downward volatility. In other words, if a portfolio is volatile on the way up, it shouldn’t be penalized. 

Information Ratio (IR) –  Information Ratio measures a portfolio’s outperformance consistency relative to its benchmark. A higher IR means that the portfolio has been consistently outperforming its benchmark, making it less likely that the benchmark relative returns are by chance.

R Squared – R Squared is yet another measurement of a portfolio’s performance against its benchmark. R Squared measures how much of a portfolio’s performance is correlated to its benchmark. A low R Squared score indicates that the portfolio is doing well independent of the benchmark, indicating there is something else that can be attributed to the portfolio’s returns. 

Why It’s Important to Know Your Portfolio’s P-Value

While all the indicators above are commonly used and provide some information about the robustness of a strategy or a portfolio, they lack a scientific measure of what constitutes luck vs. substance. P-Score, also often called P-Value (the P stands for Probability), is a scientific measure used to determine the statistical significance of an outcome. To put it simply, the lower the P-Value the less likely the outcome is due to luck.

Example of Measuring P-Value

Let’s assume we toss a coin that we think is a fair coin (has heads and tails) but could possibly be a trick coin with both sides being heads. The question that should be asked: “How many sequential flips where we get only heads is enough to assume with a high degree of confidence (statistically significant) that we actually have a trick coin (both sides are heads)?

To express our experiment scientifically we define our null hypothesis as follows:

  • The coin is a fair coin and thus if we toss the coin many times the expected breakdown between heads and tails will be normally distributed. In addition, we expect the most common occurrence to be 50% heads and 50% tails.


Evaluate portfolio with p-value


We flip the coin 20 times and measure the number of heads occurrences. If we repeat the process 50,000 times the results should be normally distributed as depicted, with most occurrences at 10 heads out of 20 flips (50%).


Distribution graph for p-value


We delineate the region in our normal distribution graph that indicates unlikely or very unlikely outcome.

The long tails on both sides of the center of the distribution graph indicate a statistically improbable scenario given the number of runs in our experiment. To express the rejection of the null hypothesis we will define an alternative hypothesis:

  • The coin is rigged (both sides are heads) as it is statistically improbable that the number of consecutive heads in the observed region of the normal distribution (marked in red above) occurred randomly.

Imagine we flip the coin 1 time and we get heads – probability is 0.5 or 50%

We flip the coin a second time and we get heads – probability of this outcome is: 0.5 * 0.5 = 0.25 or 25%

We flip the coin a third time and we get heads again – probability is: 0.5 * 0.5 * 0.5 = 12.5%

We flip the coin a fourth time and we get heads again – probability is: 0.5 * 0.5 * 0.5 * 0.5 = 6.25%

We flip the coin a fifth time and we get heads again – probability is now: 0.5^5 = 3.125%

In a scientific experiment with big sample data, any observed outcome with a likelihood below 5% (0.05) is considered substantial enough to reject the null hypothesis and assume the alternative hypothesis. Therefore, after a large sample data (many coin tosses) of five or more sequential heads would indicate a lower P, or accept the alternative hypothesis that the coin is rigged.


Can we apply the same method to determine the statistical significance of a trading signal?


Imagine you run a deep neural network (DNN) experiment to train a stock predictor model and Tensor Board shows a validation precision of 59%. This means that whenever your model scored a 1 (a buy) your true positive probability was 59%. In other words, whenever the model determined a buy, it turned out profitable 59 percent of the time. Now that seems pretty good on the surface, but by itself is not enough to determine that the model is robust.

What if we are evaluating a bullish period where all stocks move higher? It’s not hard to predict a successful buy when everything moves higher. We need to define the problem in the context of statistical success above random selection. It turns out that this is not a simple problem to solve because success is not measured only by picking a stock that moved higher, but rather by picking a stock that moves higher relative to its peers or above the average expected move.

So, let’s define our problem better. Imagine we are trying to train a model that every day selects stocks that are most likely to move higher relative to the market within a week. We measure the market return as the average return of all the stocks, equally weighted in the S&P (referred to as RPG). Also, to keep things simple, we will not account for stocks with equal performance to RPG. In addition, we will not distinguish between stocks based on how high they move. All we care about is that the stocks picked will move higher relative to RPG.

Assuming a random normal distribution, our null hypothesis is as follows: The likelihood of a stock that, when picked randomly, moves higher relative to RPG is 50%, very much like the coin toss assuming heads or tails. Now we can determine how many consecutive successful picks will assume a statistically robust model.

The rejection of the null hypothesis leading to the alternative hypothesis is if P is equal or less than 0.05 (5%). Our assumption is that if we consistently get sequential successful picks of 5 or more, that would be significant enough to reject the null hypothesis and assume our model is robust enough to reject a random lucky pick. 


p-value for portfolio evaluation

By defining our experiment and isolating the expected random outcome, we can measure the P-value of our model’s outcome. The lower the P-value the more likely our model is what contributed to the abnormal observation.


In this article, we wanted to offer a statistical/scientific approach to how Machine Learning models are evaluated. Valuation can be done on a portfolio’s performance or further isolated on the model(s) that drives buy/sell signals. We simplified the statistical model for demonstration purposes.

In the real world, I highly recommend consulting with an expert who can assist in building the random selection distribution model and subsequently define the null hypothesis, the alternative hypothesis, and the rejection P-value. At Lucena we have extended the traditional statistical analysis on TensorFlow/TensorBoard so that hyper parameter tuning is set to drive the robustness of a model in addition to the standard Accuracy and Precision measures.


Want to test your own investment strategies before risking capital?

Try our Machine Learning Research Platform QuantDesk®


Join CEO Erez Katz for an upcoming webinar: “The Journey of an Alternative Data Signal”

Learn More and Register Here


Validating an alternative data signal

Liked this post? Here are some similar topics: