Learning Machine Learning: The Infrastructure

braindata-370x290In 2016, all I was reading about was big data, deep learning, artifical intelligence, machine learning, etc… soon I realized I needed to do more than just read about it. So for 2017, I decided it was time to take a deep dive into Machine Learning and see what all the buzz was about.

I haven’t programmed in 20 years but figured now would be a great time to restart. From all the reading I did in 2016 it was clear that the programming language of choice for Machine Learning was Python. I didn’t want to take a bunch of disconnected courses on Coursera and Udacity to learn about Machine Learning, instead I had a project in mind. When I moved to India 12 years ago, it was to launch an algorithm/quant hedge fund and I was the guy tasked with getting all the technology infrastructure (servers, data feeds, leased lines, datacenter access, etc…) in place and then over time I would learn to build trading algorithms. One thing led to another and I never got around to build those models. Over the years, I felt the algo/quant space was over done and it would be tough to get back into it. However there has been a resurgence with all of the new technologies involving Artificial Intelligence entering the space. So that was my goal, learn Machine Learning to trade the stock market.

I spent the first couple weeks of the new year putting together a plan to accomplish the end goal. The first thing was to take an introduction course on Python from Coursera. In parallel I was researching the algo/quant side and understanding what goes into building models, trading models and risk management. Not only did I want to learn about Machine Learning but whatever I did, I wanted to build it like it was going to be a billon dollar asset management company – highly redundant architecture, quality data feeds and top-notch risk management. It soon became clear this was something that was not going to get built over the weekend!

I was able to breakdown the work into 3 stages:
1. Infrastructure – cloud provider, servers, databases, data feeds, trade execution
2. Research trading models – researching and designing algorithms to produce “alpha”
3. Risk management – once the trade is made, constantly monitoring the position and making sure it fits within the risk model that has been designed. Or as they say within the industry Value at Risk (VaR).

This blog post will talk about the infrastructure and some of the technology I learned along the way.

It quickly became apparent that many of the Machine Learning experts were using something called Jupyter which is an open-source platform to share notebooks and run live Python code. It’s like an online version of an IDE (integrated development environment) that programmers use to build applications.

The next thing was to start getting data and lots of data onto the platform that I had built. For all the crap I talk about Yahoo, they have a pretty good finance section to download historical stock data for Indian stocks. Using pandas, a Python data analysis library, I was able to pull down all the price data I needed.

Some of the technologies I learned and implemented along the way:

  • Amazon Web Services – the cloud provider
  • EC2/Ubuntu – Linux distribution on an EC2 server
  • Let’s Encrypt – secure the server with a free SSL cert
  • Python – programming language
  • Jupyter – online IDE
  • pandas – data analysis library for Python (developed by an AQR employee)
  • Python scripting – used to get the daily price updates from Yahoo
  • RDS/MySQL – database where the price data resides
  • crontab – run the Python script at 2am in the morning
  • crontab.guru – a super simple site to understand the syntax for scheduling cron jobs
  • MySQLWorkBench – Software to interact with the MySQL DB
  • SQL statements – Structured Query Language (SQL) to manage and get data from the DB

Below is a SlideShare document showing the process of setting up the server on AWS:

Part 2 will talk about the research aspect of building trading models – the traditional methods and using the newer Machine Learning tools like Apache SystemML, Caffe2, Microsoft’s CNTK,  TenserFlow and Sciket-learn to name a few.