How To Validate Alternative Data for Stock Forecasting


Erez Katz, CEO and Co-founder of Lucena Research

There are many different Machine Learning methods that can be utilized for stock forecasting. A few we recently discussed use neural networks with time series data to:


A few points to note on stock forecasting:


  • It’s important to distinguish between forecasting the actual price of a stock vs. its direction (higher or lower).
    • Even when we attempt to forecast the stock’s directional price action, we do it in the context of price action relative to a benchmark.
    • For example, a stock-price move relative to the S&P 500 (market relative price move).
    • To truly forecast what the price of a stock may be a few days or a few months in the future, the solution would most likely have to be regression-based and deterministic.
    • That’s a very tall order (and, in my opinion, impossible for high latency investments, the opposite of high frequency investments) given the dynamic and non-stationary nature of the stock market.


  • Our research at Lucena is almost entirely focused on price action classification (whether a stock will be higher relative to its current position). We typically advance our deep learning classifiers with other machine learning deterministic means (mainly Knn, decision trees, etc).


  • We base our research on achieving the highest statistical significance. We don’t expect to be right all the time, but rather to be more right than wrong.  
    • This bodes well for deep neural nets based on TensorFlow (an open source deep learning platform and framework, developed by  Google Brain) and Keras and Theano (open source component libraries built on top of TenserFlow).
    • The networks are trained to minimize loss (the difference between the network outcome and the desired, perfect knowledge, labeled outcome) in order to maximize accuracy and precision.

TensorBoard graphical representation of training in action.

TensorBoard graphical representation of training in action. We are training 10 futures constituents based on 15 or so features across 21 days in history that were converted to richer two-dimensional images. Note how over 2,000 epochs (data sampling in training) the CNN (in this particular example) is getting better and better over time. The loss function returns are moving lower (left image) while the accuracy and precision improve over time (middle and right images).


  • For time series data, we advocate daily features that are captured over time but not necessarily daily stock prices. Our research has yielded very little actionable insight from pure stock prices data or for that matter other simple technical features (which are based on price/volume information).


  • The neural networks are trained to assess how various features, captured as daily sentiment scores, trend over time.
    • More importantly, the networks capture how trends of multiple orthogonal data sources when overlaid on top of each other (figuratively speaking) can present more compelling correlations to stocks’ price moves than assessing each feature alone.



How Effective Are Deep Networks In Analyzing An Alternative Data Source?


The process of deploying alternative data for price action classification is not only valuable for investors who are looking to deploy strategies. With the rise in Alternative data, providers are seeking non-biased researchers to assess and validate their data’s readiness for use in the Financial markets.

When you take into account how much data is available to use for stock forecasting it can be a little mind blowing. The buyside can spend countless hours and resources picking and analyzing data only to discover it’s inability to correlate with their specific needs.

Our mission at Lucena has always been to bridge the gap between the data providers and buy side investment professionals. With known data issues such as noise, observations inaccuracies and even at time intentional false news, we’ve had to create enhanced classification technology to be more effective at streamlining the validation process of alternative data.

We pride ourselves on advising our data clients on how to make their data more effective and compelling to the financial consumer. To that end, we will never be a “black box” when it comes to our process so without further delay here is an overview of how we help save our clients time and valuable resources.


De-mything The Data Validation Process

The process of validating an alternative data source can be summed up in the following steps:

  • Ingest raw data
  • Utilize machine learning to:
    • Map
    • Feature engineer
    • Back Test for thorough performance report
    • View paper trading simulation
    • Provide data feed updates
  • Deliver to data clients and our buy side clients for use in investment strategies

alternative data validation

The process of converting raw data into actionable signals. The final product is delivered back to our data providers which they can then offer to their customers.

We’ve created two platforms that expedite the data validation process, a Data Qualification Engine and Data Matching Engine. The two AI powered platforms automate the following:

  • Quickly assessing the basic maturity of an alternate data source.
  • Testing for data type consistency, universe coverage, sentiment score distribution, survivorship bias, NANs, anomalies and outliers.
  • Data Mapping – The process of aggregating (normally to daily time series) and mapping the aggregate score to tradable securities.
  • Decile breakdown of signal strength to future returns, backtesting and more.
  • Feature Engineering – Enhancing the raw data with additional derived features more suitable to machine learning research. For example: applying time decay, ranking scores against a universe of peers and more.
  • Event Signal Classification to determine which features work best together for highest future price action consistency.


Event Study Analysis for Investment Strategies

  • Backtesting and Paper Trading Simulation – At the conclusion of the validation and enhancement processes, we are ready for backtest simulation of a trading strategy that buys and sells stocks based on the data provider’s signals. Any backtesting can be made a live paper traded portfolio so that the signals to returns correlation is tested perpetually into the future. In addition, Lucena provides a comprehensive performance and attribution reporting analysis for a highly visual intuitive assessment.


Backtest Performance

 Backtesting simulation of signals strength against a predetermined benchmark. Backtest closely simulates real-life scenario taking into account transactions cost, slippage, and not allowing peeking into the future. 


Don’t Let Data Research Drain Your Resources


The explosion of new alternative data brings exciting new dimensionality to deep learning research.

Location data, corporate action data, social media sentiment, consumer spending activity and much more can be used to enhance your investment strategies.

With the rise in Alternative data flooding the market, buy side clients are looking to be smart on how they deploy their research resources. Lucena bridges that gap by allowing an effective automation of sifting substance from noise and allowing the hedge fund managers to concentrate on what’s most promising.

Even the most sophisticated hedge funds are constantly looking for ways to determine if a data provider is worth pursuing. They want to be able to “fail fast” so that precious quant research isn’t wasted on unqualified data. The very same technology we use to deploy predictive signals for investment strategies can be used to assess if an alternative data source contains validated and predictive information.


Wondering if your data is actionable for investment?

Watch CEO Erez Katz discuss: “The Journey of an Alternative Data Signal”

Watch Here



Liked this post? Here are some similar topics: