To supercharge data science teams and enable development of data products that react efficiently to signals from the wild, we need to build stable infrastructure that supports real-time, near real-time as well as batch processing.
A particular early challenge we faced here at Mad Street Den was building scalable, multi-tenet and cost-efficient real-time data pipelines that serve use-cases like real-time sessionization of clickstreams. At the Fifth Elephant conference in Bangalore this week, we will be sharing some of our hard-earned insights in building out asynchronous data workflows based on Celery, our weapon of choice.
Data – There is a lot of it!
Data needs to be aggregated for every stage and made compatible for analysis & consumption
The plumbing story involves three important phases :
Preparation: Ask questions, collect & organize data