How Creating Cloud Agnostic Scheduler Helped Reduce Maintenance of Data Pipelines

Written by Zapr Media Labs | Jan 6, 2021 8:44:41 AM

Zapr provides TV Viewership data for millions of individuals that conventional measurement systems can’t give. We are sitting on gigabytes of insights for every single brand. We are the biggest offline media consumption repository in India. With big data, comes big responsibilities. Being a data company that is operating at a massive scale, we have numerous data pipelines that are executed on a day-to-day basis. All of these data pipelines are owned by various different teams. Some of the examples include production ETL operations, data warehousing jobs, client data sharing jobs, data analytics jobs, sanity jobs and much more.

Existing Setup

Our legacy setup for the majority of our data pipelines was an in-house data lake cluster. More detail about Zapr Data Lake can be found here. It consists of Apache Hive with the option of MR/TEZ/Spark as an execution engine. We leveraged Oozie & Coordinator service for authoring our workflows. This was standard for quite a while and every team was using this. Oozie has a fairly rich functionality with a wide range of support for different kinds of actions which was sufficient for the initial stage/phase.

Why do we need a cloud native scheduler at ZAPR?

Down the line, as we wanted stronger SLAs and isolated execution environments for our critical production workloads, we started migrating some of our workloads to transient EMRs authored via Airflow. Now, newly written pipelines are scheduled to Airflow, but we still have legacy workloads being operated in Oozie or transient EMRs. It’s either due to tech backlog or team wise expertise, priorities or preferences. Therefore, now our workloads are scattered over different systems i.e. oozie, EMR with lambda trigger and airflow. 

All of these data pipelines which are owned by different teams and scattered across different platforms are mostly interdependent on each other. Data lineage of these pipelines would form DAG. Ideally we would want a certain workload to start only after dependent upstream workloads have been completed. You would want your daily workload execution data to be easily and readily available so that anyone can see the status.

But now, with these workloads scattered across different systems, maintaining them was becoming a growing issue. There was no visible data lineage available. While making changes in one workflow, there was no straightforward way to determine which downstream workflows will be affected. We would have to go and ask around different teams for figuring out dependents. Due to this, deployments were also becoming difficult and often resulted in failure of some unknown dependent. Ideally, you would want your workload to fail fast, or better not start at all when upstream dependencies are not satisfied.

But because of different platforms used across pipelines, we were using time based scheduling and that was leading to slow failures. For example, a failure when discovering that there is no data available to execute this wf. Also, frequent human intervention was required to either hold or restart workloads. There was no central place where one could look for the execution status of workloads. You had to be aware of where that workload is running and then go to that platform for viewing the status. There was no centralized alerting and we were experiencing frequent SLA misses.

What we wanted was a cloud/system agnostic scheduler (just a scheduler, not to reinvent the full blown authoring tool like airflow). We wanted this tool to be the central trigger/schedule point of any of our workloads, residing in either Airflow, EMR, Oozie or any other tool in future. We wanted cron as well to trigger time based and event based. By truly event based, we mean, it should be a push based event, unlike sensing tasks in Airflow (where you pull on completion of certain wf of certain amount of time and then abandon this wf, if parent wf is not completed within poll time).

Also, we wanted multi-parent & multi-child dependencies as well. For example, after one of the workflow is complete, both client data sharing and report generation pipelines should be triggered. We wanted it to be the central hub for all Zapr workloads, where one could tell at a glance, what was supposed to be executed that day and what the current status of any data pipeline was.

Bezel

So, we developed an internal tool called Bezel. It’s a lightweight, cloud agnostic & event-driven job scheduler with central SLA alerting & monitoring. It’s developed on top of the embedded Quartz scheduler. It fully leverages the benefits of event driven architecture. All time based, dependency based or ad hoc schedules are published as events inside the scheduler's event queue. Then, based on event type, priority of the event and fulfilment of required dependents, the event executor makes a decision to submit a job via an available worker. In Bezel, we can give different priority groups and execution priorities are defined by these priority groups. See the below diagram for brief architecture overview :

 

 

Currently, Bezel supports triggering Oozie Job, Lambda Job, Airflow Job, Script Job and Python Job out of the box. To make Bezel future proof, we have made its job interface generic. In the future, if there is a requirement to implement a job for any other platform, we can just implement Bezel’s custom job interface and provide platform specific implementation of triggering and tracking workflow status. Based on the exit status and/or error stack trace of this job, Bezel out of the box does all the other heavy lifting, like, updating metadata, generating necessary success/failure events and sending necessary SLA alerts.

So, it’s really customizable. It supports both cron based and dependency based scheduling out of the box. With all of the data pipelines being scheduled from this central place, the benefits are manifold - we are able to easily do event based scheduling across the platforms, re-running of all the dependent workflows has become very straightforward and easy, and we have all of the execution metadata at a centralized place. So, all of the SLA monitoring and alerting could be done from the central place.

 

 

Apart from central slack alerts, Bezel out of the box provides support for subscribing to alerts specific to your workloads, so you never miss an alert! With this execution metadata, we are also able to power a quick glance at  Grafana dashboard for publishing daily execution dashboard and we are storing the historical execution metadata which helps us in evaluating weekly and monthly consolidated failures and for every pipelines, which further helps us in categorizing and fixing those issues in our data pipelines.