Erez Katz, CEO and Co-founder of Lucena Research.
At Lucena, our mission is to bridge the gap between validated data and data-driven professionals. Portfolio managers seek reliable ways to efficiently assess and deploy alternative data for investment decision making. Data providers, on the other hand, are looking for unbiased 3rd party validation and empirical evidence of their signals’ predictability in order to gain credibility in a growingly skeptical market.
Data: The Lifeblood of Investment Decisions
It’s important to note there are quite a few pitfalls data providers and data consumers should be aware of. The reality is whether inadvertently or intentionally, the premise of the data can be misleading and subsequently costly to the unsuspecting consumer.
First, it’s important to separate the narrative from the underlying data. Often, we come across a data provider that represents a unique approach to predicting markets, sectors, or individual constituents. But when inspecting the underlying data, it falls short of the premise.
Take, for example, credit card transactions. Imagine you have daily access, both historically and perpetually, to any purchase any consumer in the United States makes. Aggregating such data by demographic profile or geographic location can be very telling about economic trends at the macro level and in certain industries. We can tell, for example, by inspecting the purchases of a specific product if it is in the midst of a “hot” growing trend. Naturally, such data can provide valuable insight into the future performance of a market, industry, and individual companies.
Let’s look at Nvidia (NVDA) for example. Nvidia is a rising star that has capitalized on growing demand for Graphical Processing Units (GPUs). As the GPU market has shifted from a niche segment of electronic gamers into the broader adoption of AI researchers, block chain and crypto currency miners, and datacenters, all of which consume GPU chips insatiably.
Nvidia’s stock has been on a tear in the last 3 years, growing exponentially year over year for a total of more than 800%!
Nvidia NVDA stock price action 1/1/2016 to 12/31/2017. Source Yahoo Finance.
The trend of number of units sold to datacenters tells a similar story of growing consumption over time.
Nvidia’s growth in quarterly sales for datacenters (2016 to 2017) / Source TheStreet.
On the surface, this sounds like an easy win. If we can ascertain the consumption trends of Nvidia GPU units on a perpetual basis, we could potentially forecast Nvidia’s revenue before its earnings announcement. If Nvidia’s price has yet to reflect this growing demand trend, the signal could help an investor enter a position with high confidence of an impending price move.
In the real world however, data is rarely available in such a straight forward format. Data is also rarely available in real time, and the lag could potentially indicate stale data that is no longer actionable. This doesn’t serve investors and portfolio managers who seek to ascertain actionability or reflect how meaningful a trend actually can be.
At Lucena, we’ve developed a comprehensive process by which we aggregate such data and condition it for Machine Learning research in order to determine how ready it is for consumption and, more importantly, how actionable it is.
The Data Predictability Process: Consume, Validate, Enhance and Deploy
In order to assess data efficiently, we have developed a set of technologies geared to automate many of the mundane, time consuming tasks of validating and deploying big data. We break our research process to three well-defined steps.
Phase I: Import, Qualify and Feature Engineer Data
In Phase I we conduct research on the entire data sample offered. We test the data for completion, we map data to assets, extrapolate missing data, adjust data for seasonality, aggregate consumption level by channels, and ultimately generate our initial report called DQE (Data Qualification Engine).
One additional important step in Phase I is the concept of feature engineering — creating a set of new features derived from the raw data that increases their suitability for machine learning time series trend analysis. For example, we can add a data point that compares the current sales number to a rolling mean of the last 30 days.
It’s important to note that in Phase I we are mainly conditioning the data for research, but no Machine Learning discipline is yet deployed.
Underlying tasks of validating data completion and conditioning the data for machine learning research.
Phase II: Training, Validation and Hold-out
In Phase II we break our data into three distinct timeframes: Training, Validation and Hold-out.
During the Training Phase we construct multi-factor models and assess how strongly correlated these models are to future price actions of certain assets.
The Validation Phase assess the most successful models that show promise from the training phase. In other words, we test the model that was perfected in the training phase on unseen/new data.
This is a highly iterative process by which AI is deployed to repeatedly construct models and test them on a different timeframe.
The Hold-Out Period is designed to ultimately evaluate the data on yet another set of unseen data which we will covered in Phase III.
Machine Learning is deployed for a highly iterative discovery process.
Phase III: Backtest and Train the Machine Learning Model for Future Changes
We test the model that was formed and validated in Phase II on yet another set unseen data (out of sample). The reason we take this extra step (since we’ve already tested unseen data during the validation period) is that the iterative process of training/validation can lead to a known phenomenon called overfitting, or more specifically, selection bias. The sheer large number of attempts to test models between two timeframes could lead to overfitting and thus another timeframe is required to conduct one last final test on the most promising models from Phase II.
What is overfitting?
With a growing number of variables, if you try and test every permutation, you will likely stumble upon something that looks predictive but cannot be sustainable on new unseen data.
Static vs. Dynamic Machine Learning Models:
Machine learning models for investment should rarely be static, and should adjust dynamically to changes in market conditions. Hence, assessing and retraining predictive models needs to occur every so often. An important side effect of dynamic models is that they are less likely to overfit.
Phase III backtesting our model on unseen data and assess if the signal holds predictive value similar to what was determined during the preceding validation period. You can see how training and predicting repeats periodically by which the machine constantly assesses for a model’s decay and retrains on more recent data before forecasting again.
Data Matching Engine Report
Data that has passed the three phases of our development process is ultimately made available through Lucena’s Data Matching Engine (DME). Consumers can enter their constituents list, investment style, time horizon, risk tolerance, etc., and DME identifies through event study technology which data signals are most suitable for their specific needs. The DME report provides more empirical evidence of how a user can combine multiple independent data sources into a single predictive machine learning model that can ultimately guide their investment decision.
Unlock Your Data’s True Potential
The process of conditioning alternative data for investment is not a trivial one. You need an honest independent partner who can produce empirical evidence of signal correlations to investment decisions.
Moreover, investment should rarely rely on a single set of data, and instead use multiple independent models agreeing on a particular buy or sell signal. Lucena developed a Data Analytics Suite (DAS) to serve as an honest broker to transform data into actionable insights for the financial markets.
Want to learn more about our Data Analytics Suite (DAS) for investment decision making?
Liked this post? Here are some similar topics:
Erez Katz, CEO and Co-founder of Lucena Research At Lucena our mission is to democratize some of the best kept secrets in the Financial industry and refute the “black-box” image often associated with Machine Learning. In that spirit, I wanted to share with you an...
How To Validate Alternative Data for Stock Forecasting Erez Katz, CEO and Co-founder of Lucena Research There are many different Machine Learning methods that can be utilized for stock forecasting. A few we recently discussed use neural networks with time series...
Erez Katz, CEO and Co-founder of Lucena Research. The rapid growth of big data has resulted in a technology and AI arms race. In the past, being an AI player would typically earn you a new level of professional esteem but big data, data science and machine learning...
Erez Katz, CEO and Co-founder of Lucena Research. Investors often look at Sharpe ratio to determine a portfolio’s strength. (Sharpe ratio measures a portfolio’s risk adjusted return.) The goal of Sharpe ratio is to assess a portfolio’s returns discounted against...
Erez Katz, CEO and Co-founder of Lucena Research At Lucena, we always try to understand the root cause of unexpected results and pull actionable insights from the data. Sometimes it's easy to blame the machine when actually it did exactly what it was...