Erez Katz, CEO and Co-founder of Lucena Research.

 

The rapid growth of big data has resulted in a technology and AI arms race. In the past, being an AI player would typically earn you a new level of professional esteem but big data, data science and machine learning are no longer a privilege of a fortunate few.  

Complex concepts in deep learning are now more accessible than ever. For example, image recognition problems can be solved with approximately 97% accuracy with as little as 75 lines of code in Python on TensorFlow.

As many seek to add alternative data and deep learning as an overlay to their research, they are quick to realize that throwing data at a deep learner is not enough to solve real-world problems. The reality is that data science is essential to a commercially sound deep learning solution but guidelines need to also be provided.

Key Concepts For Quantitative Investment Research

Labeled vs. Unlabeled Data

Unlabeled dataconsists of observations that can be obtained relatively easily from the world around us. The data is mainly a collection of items without any classification or identification. Social media tweets, news feeds, corporate actions, sales figures, and location data are all considered in their natural form to be unlabeled data. In contrast, labeled datais unlabeled data augmented with some sort of identification (also called classification or tagging).

The additional information overlaid on unlabeled data is mainly human-driven, and therefore more costly and harder to obtain. For example, labeling a photo to distinguish whether it contains a dog or a cat must be informed initially by humans. The process of deep learning is empowering a machine to infer the proper label from new unlabeled data by generalizing the relationship between the data (attributes or features) and its labels.

 

Image recognition deep learning

Machine learning is made available through “inspecting” lots and lots of labeled data in order to discern the underlying association rules. Subsequently when the machine is presented with new unlabeled data it can label it on its own with a high degree of certainty (measured by accuracy and precision).

The process of mapping and labeling raw data so that it is most effective for a deep learner is not simple. It requires deep domain expertise in the underlying business that sources the data. In addition, in most cases the data is not directly related to the financial markets and it needs to be aggregated or arranged in a certain way so that it’s most relevant to tradable securities.

 

Data Science: Cutting Through the Noise for Machine Learning

 

Imagine you have a five-year history of every mobile sales transaction at Starbucks. If not analyzed and aggregated properly, even the most obvious information could not be predictive.

A natural approach would likely be to correlate total sales to Starbucks’ future stock price since strong sales will most likely drive the next quarterly earnings. However, if the sales figures are not seasonally adjusted, or not measured in the context of a change against a predetermined moving average, the information will be less exploitable by the deep learner.

There are additional hidden gems in Starbuck’s sales data that only an astute data scientist can recognize. Take, for example, aggregating Starbucks sales by geographic location or consumer demographic profile. There may by additional valuable information to infer from a positive Starbucks sales trend regarding things like consumer confidence, discretionary spending forecast, or the sales forecast of supply chain dependencies.

Independently and Identically Distributed Data (IID Data)

Labeled data is most effective for a deep learner when its values are independent of each other and are similarly distributed. The reality, however, is that most time-series data (the type of data mostly used for price forecasting) is actually dependent. It is incumbent upon the data scientist to map the training data into an IID-like structure for a most efficient deep learning process.

data science Independently and Identically Distributed Data (IID Data)

 

Describing a data problem in IID and non-IID format – Non-IID require multiple independent models and are naturally harder to solve.

 

Suppose you are investigating how heart rate can predict if a person is a smoker. You can measure BPM (beats per minute) as 30 samples of one-minute (30*1m) readings and ask if the person smokes in order to build your label dataset. What would contribute to a better predictor?  

  • 30 observations from one person who smokes?
  • or one observation from 30 different people who smoke?

Given that one person’s heart rate won’t change much over 30m, it seems clear that you’d rather have one-minute samples from 30 different people rather than 30-minute samples from one person.

To put this in context, 30 observations from different people are independent and identically distributed while 30 readings from the same person are highly dependent. An astute data scientist should know how to map a sample data into an IID format in order to provide the machine with the most conducive conditions to learning.

 

You Have Data, Now What?  

The examples above provide some context into what it takes to make alternative data and deep learning jive by layering in data science. At Lucena, we have systemized the life cycle of taking raw data and converting it into actionable and consumable products. Our process includes:

  • Consumption of raw data
  • Data validation
  • Data enhancement with feature engineering and proper labeling
  • Training predictive models through deep learning technology
  • Cross validation
  • Out of sample backtesting
  • Delivery of derived signals in multiple consumable formats:
    • Data Feeds
    • Model Portfolios
    • Custom Research

The question remains: how do we incorporate growth in machine learning to forecasting asset prices? Data must go through a thorough and robust journey of validation, enhancement and testing before it can form the basis for an investment strategy. Once your data has been vetted and enriched with feature engineering you’re ready to test incorporating a quantitative overlay into your investment strategy research.

 

Looking to incorporate alternative data into your investment research?

Let’s Talk. 

 

CEO Erez Katz discusses: “The Journey of an Alternative Data Signal”

Watch Here

 

 

Liked this post? Here are some similar topics: