Life of a software engineer at Zapr revolves around taking care of the data we bring in: trimming down the unnecessary weight, polishing it in a way that brings the best out of the data, and then finding a place for it to stay with us forever. Everyone owns some part of the data; the journey of sharing and caring for it here at ZAPR. As the core-platform team, our responsibility is to help out everyone by providing the platform and tools for seamless data experience. And there comes our biggest product, the Data Lake.
At Zapr, we query data for multiple reasons:
We already had a defined and well maintained data processing pipeline to cater to the first three needs. However, due to the unpredictability in the next three cases we resort to ad-hoc reporting with help of developers at Zapr. The data was stored partly in S3, partially in HDFS and rest in MySQL or MongoDB. Our Sales, Marketing teams and mostly Data Analysts are continuously envisioning new and innovative ways (means more complex) to enrich and sculpting the existing data for operational reporting and advanced analytics. With the evolution of their needs coupled with lack of holistic knowledge about the data or the intent behind the reporting, it always ended up being a long and arduous exercise for the developers to assist them. This was neither a good use of of their time nor expertise. The obvious flaws diagnosed were:
This new solution needs to support multiple reporting tools in a self-serve capacity, to allow rapid ingestion of new datasets without extensive modeling, and to scale them while delivering performance. It should support advanced analytics, like machine learning and text analytics, and allow users to cleanse and process the data iteratively and track lineage of data for compliance. Users should be able to easily search and explore structured, unstructured, internal, and external data from multiple sources in one secure place.
The solution that fits all of these criterias is the Data Lake.
Data Lake can be considered a system which provides unlimited amount of storage and compute to the developers for processing, storing as well as querying and visualizing data of any size and shape, thrown at varying speed.
A Data Lake enables users to analyze the full variety and volume of data stored in the lake. This necessitates features and functionalities to secure and curate data and run analytics, visualization, and reporting on it. The characteristics of a successful Data Lake include:
Keeping all of these elements in mind is critical for the design of a successful Data Lake. In the blog-series we would take a deep dive on each of the topic subsequently.
Our servers are majorly hosted on Amazon EC2 or Spot instances. Amazon provides an EMR service which does tick off most of the use-cases in addition to painless maintenance. However after much debate and discussion we chose to maintain our own setup.
Our system is to be up and running 24×7. Keeping a 24×7 EMR setup is extremely cost heavy. Therefore, the most pertinent point is about being frugal and reducing our infra costs followed by flexibility of adding new softwares along with upgrading the existing ones. EMR is known to provide the following advantages, which we have overcome by leaps and bounds;
Cloudera is comparatively more difficult to learn and configure. But once you have it setup, it’s far more flexible than EMR, and there’s no extra infrastructure cost. Cloudera Manager has an easy- to – use web GUI. This helps manage and monitor Hadoop services, cluster, and physical host hardware. We have also successfully deployed Sentry authorization on top of LDAP to restrict users in accessing the Lake according to their respective roles, which would have not been possible with EMR.
We need to be truthful on why we chose Cloudera and not any of the other open source platform. The answer is GUI. HUE is by far the most exquisite user interface for Hadoop users in addition to its pluggability feature. The Cloudera Manager portal provides one stop shop for adding, removing, modifying and deploying any service in a completely hassle free manner. It allowed us to create meaningful dashboards and alerts for round-the-clock monitoring for free.
Cloudera manager automatically raises conflict alerts on poorly configured setups It has built in help tips with each and every Hadoop feature and recommends better values for allocating resources. Everyday we get to learn something new just expanding the help tips (and of course researching around a bit). Every check-in is saved and one can roll back to the older set-up anytime. This way all our operational efforts boiled down to a few clicks.
Similarly HUE has been a life saving tool for providing non-technical users a query interface at ZAPR. Our analysts would create workflows, queries with parameters and share with the Operations and Sales team in a read-only manner for them to execute on-demand with their own set of parameter values. Any update on the queries and workflows can be deployed easily through HUE. We saved time and effort in building separate web-portal for them. With the Sentry setup, everyone has restricted access according to their team permission level. We were able to neatly divide team – based temporary workspaces (read database) for storing data. In the background, we periodically cleanup the temporary workspaces to minimize S3 cost as well. Note that, since we deal with huge amount of data and reporting everyday, our S3 cost did hit the roof before for lack of cleanup of temporary data storage.
We will share the following stories in these blog-series:
Also read How Atreyee Dey redefined both Big-Data problems and Gender Norms.