Evolving a Machine Learning & Analytics Platform in Python @ Full Stack Toronto Meetup

Our stack is a Python back-end with AngularJS for the front-end. We started with a pretty simple Django app using MySQL with scikit-learn for predictive modelling. Along the way we added Celery and Redis, statsmodels for forecasting, and we’re in the process of moving our data analysis into Cassandra with Spark. This talk will focus on technology choices and how we’ve grown the system from just a few stores to thousands in less than a year.

Brandon from Vantage Analytics presented in Shopify’s Toronto Office for FSTO’s October 2014 meetup. Here’s me notes:

Product info

Vantage Analytics provides information for Shopify (and other e-commerce solution) merchants. It takes in records of customer purchases, and outputs information like trends, predictions, and actions merchants can take to improve their business like suggesting PPC ad campaigns. Data science for small to medium size merchants.

A timeline of growth & change

Jan 2014

Product is live and has a few customers
Stack is simple (backend has Django, some Celery tasks, MySQL, only a few VMs)
Cron job runs a few big calculation tasks to figure out the numbers merchants want to see. Vantage is growing by a few merchants per week

Feb 2014

Growth spiked for a few days after being featured by Shopify. Get bigger servers!
No major architecture changes

Spring 2014

The team finds the bottlenecks in architecture and infrastructure as features are added and user base grows

The cron job that crunches numbers (the most important part of Vantage) is getting slow and memory usage is growing
Network or disk IO is becoming a bottleneck when importing data from Shopify’s API

Time for changes!

Hire a backend developer
Move some number crunching out of the DB and into Python (so DB can focus on writes?)

Summer 2014

More changes!

Start caching DB reads with redis to reduce DB IO further
Break giant cron job into smaller tasks, then use Celery queue to manage them
Scale up Celery from 1 machine to a cluster

The stack as of now

Django for web app framework
MySQL for DB
Redis for DB caching
Celery for long running asynchronous jobs
RabbitMQ for message queue
AngularJS frontend
Infrastructure will be OK with thousands of merchants signed up

Lessons from 2014

Measure everything! Shout out to New Relic Pro, prices are negotiable
Use queues to run as many long tasks asynchronously as possible (this is where Celery comes in)
MySQL datatypes are hard to change later. Get them right before data grows large. Eg Queries with BLOB types will require a hit to the hard disk which is bad for performance
When you have to change a MySQL schema, use pt-online-schema-change for making schema changes to big MySQL tables to minimize downtime
Separate infrastructure into multiple servers early for easy scaling (queue manager on 1 box, queue workers on other boxes, web app on other boxes, DBs on other boxes)
Hosting on VM Farms was a good choice for offloading system admin tasks, especially when the product team was tiny

Next steps

Data needs will outgrow what they can do with MySQL at some point. Cassandra? Cassandra + Spark? Not loving Spark yet. Implementing it may be too big a task for small team.. But they aren’t in a rush to move away from MySQL yet

Remember FSTO Conf is coming up fast! Get your tickets for November 22 – 23, 2014.

Check out this Meetup →