How to trust your data: Data observability and 5 pillars to a reliable data platform
Google ranks Amanotes' data as the 3rd largest pool in the Vietnamese market. Our data is sourced from over 30 different sources and generates over 5 terabytes of new information each day. Managing this massive amount of data requires us to handle over 3,000 data jobs daily, which is no easy feat. In this blog, we will delve into how our data team tackles this challenge with data observability.
Massive amount of data in Amanotes
Table of content:
What is Data?
What is Data?
In the realm of data analysis, it is widely acknowledged that data serves as a representation of reality, and is meticulously documented and stored. There are numerous methods of gathering data, and each approach yields data with unique characteristics, as depicted in the following illustration:
Four characteristics of Data
What is Data Observability?
The term "Data observability" refers to an organization's capacity to have a comprehensive understanding of the status of its data and data systems. When data observability is high, companies gain complete visibility into their data pipelines. This visibility enables teams to establish procedures and tools to monitor how data moves within the organization, pinpoint any data bottlenecks, and ultimately avoid inconsistencies and downtime in data operations.
5 Pillars of Data Observability
The pillars of data observability provide details that can accurately describe the state of the organization’s data at any given time. There are five pillars of data observability:
Freshness ensures that the data in the system is up to date and synchronized across multiple and complex data sources.
Distribution measures the variance of data in the system and focuses on the quality of data produced and consumed by the data system.
Volume monitors data intake, storage capacity and detect anomalies in case of sudden change in volume to ensure that data requirements are within defined limits.
Schema ensures format and structure of data received is accurate, up-to-date, and regularly audited.
Lineage traces the flow of data through the data system, providing a full picture of the data ecosystem how they are connected, and what external data sources are being used.
When it comes to managing data, it may not always be essential to address all five above fundamental pillars. Depending on the nature of the data and the company's systems, it may be more pertinent to focus on specific aspects within these pillars.
Five pillars of data observability
How does Amanotes build a reliable data system?
At Amanotes, we place a strong emphasis on building a reliable data system that can effectively support our business operations. We achieve this by following a 3-phase approach: Preparation, Collection, and Visualization.
Three-phase approach to build up a reliable data system at Amanotes
The first phase is the Preparation phase, where we focus on architecting our data pipeline with a clear structure and convention. We divide our data storage into different zones and stages, each with a principle to process data, and use labeling and tagging to create filters and dimensions for monitoring our data. We also separate our daily jobs and ad hoc jobs to ensure scalability.
During the Collection phase, a significant emphasis is placed on obtaining a comprehensive understanding of the data system. Several critical factors, such as volume, storage, processing, and timing, are taken into account. Additionally, data quality checks and validations are conducted to ensure that the collected data is reliable and accurate. The Collection phase also involves gathering essential metrics and utilizing orchestration techniques to build the lineage of the data. Finally, strategically positioned sensors are employed throughout the system to gather the necessary parameters, enabling us to generate a detailed report on the status of the entire data pipeline.
Finally, in the Visualization phase, we visualize our data and take action based on our observations. Before observing, we may need to profile our data to gain insights into its characteristics and behaviors. We use visualization tools to create dashboards and reports that help us monitor our data system effectively. We set up alerts and warnings depending on the different impacts and issues we observe, allowing us to quickly respond and resolve any problems that arise.
Demo at Amanotes
Let's explore a real-world example of data observability in action by looking at some of our demo data observability in the system.
The data pipeline flow in Amanotes
Amanotes has developed its own data system based on a standard Proof of Concept (POC) as there are limited available tools to build a data system in the market. The data pipeline flow in Amanotes, depicted in the upper half of the chart, includes the collection and processing of data until it is uploaded to Metabase. In order to ensure data observability, Amanotes leverages various techniques such as monitoring the BigQuery audit log, Airflow log, and BigQuery schema. These techniques help ensure the data being processed is accurate, complete, and consistent.
To further customize the system, we utilize the Kabina ingest pipeline, which allows for the extraction of more insightful and actionable data.
The presence of bug of the system affecting job success rate
The table indicates a significant decrease in job success rate on the 28th, implying the presence of a bug in the system. This tool allows us to drill down the root cause of the issue and conduct a thorough investigation and identify the specific jobs that are encountering errors.
Amanotes successfully determined root cause
Additionally, we can investigate the "start time" parameter to analyze errors, as jobs are interdependent. If there is a significant change, such as the one shown in the chart above, we can zoom in to determine the root cause.
Amanotes traced data back to its root source using table lineage
Here is the example we use to illustrate the data lineage in Amanotes, which enables us to trace data back to its root source. The table provides a clear and concise way to visualize data lineage, which is essential for data management. By tracking the lineage of data, we can identify any issues or inconsistencies that may arise and take corrective action to ensure data quality. This is particularly important in complex systems where data may be sourced from multiple locations and transformed in various ways before being consumed by end-users.
Amanotes detection system assists to send notifications about abnormal data trends
Amanotes has developed an anomaly detection system that sends automated notifications via Slack to track any abnormal data trends. This allows the team to quickly track the abnormal data trend and take appropriate action
While system observability has been around for some time, the emergence of data observability is a new trend in the industry. Here is a list of big data observability services available in the market that you can consider. Data observability services help organizations monitor, manage, and analyze their data infrastructure to ensure optimal performance, reliability, and security. They offer features such as real-time monitoring, alerting, and visualization of data pipelines and systems, as well as data quality and anomaly detection. By using these services, organizations can improve the observability of their data infrastructure and gain insights to make informed decisions.
Big data observability services
Nghi Nguyen - Head of Tech Amanotes
Cuong Tran - Data Engineering Lead