How to track and visualize data lineage

July 29, 2024

8mins

Data lineage is about tracking the flow of information through a system or process. It is necessary to guarantee the quality, usability, and security of your data. For large organizations, it is also a key conformity requirement. Unfortunately, many organizations are missing this ability to connect data sources together because of regulatory constraints, complex technology and scattered data.

Graph visualization is a powerful asset for data lineage. It provides a clear and intuitive way to understand the relationships between data elements in a complex system. It helps identify and address issues more quickly and make better decisions about how to manage data.

What is data lineage?

Data lineage is defined as a data life cycle that includes the data’s origins and where it moves over time. For large organizations, that life cycle can be quite complex as data flows from files, to databases or reports while going through various transformation processes. Analyzing the data provenance of a specific data point is very challenging.

The success of an organization depends on the quality, usability and security of its data. Want to provide amazing support to your customer? Create new products and services? Respect legal requirements? The best companies approach these issues in a data-driven way.

But when your management looks at the quarterly sales report, do you know exactly what data they are looking at? Sometimes bad data can be more dangerous than no data. That’s why data lineage is so important.

A visualization of data lineage using graph

Why is data lineage important?

Data lineage enables effective incident response. It allows organizations to quickly trace the root cause of data-related issues and understand their impact across systems. By providing a deep understanding of data dependencies, data lineage empowers teams to quickly respond to incidents, leading to faster remediation and minimizing the potential damage to business operations.

Data lineage is also one of the best tools to ensure data quality. It’s possible to manually trace the lineage of a limited number of tables with a small number of data sources. But the reality is that most enterprises today use much greater quantities of data. Adopting enterprise data lineage allows organizations to scale their data quality.

Data lineage use cases

Here are some examples of use cases where data lineage is particularly important:

Meeting regulatory requirements. Many industries - like healthcare, finance, or government - are subject to regulatory compliance requirements related to data usage. Data lineage helps organizations comply with regulations by tracking data movement, identifying its sources and destinations, and documenting transformations that take place during processing.
Effectively managing risk. By tracing the movement of data, organizations can identify potential security and privacy risks, such as unauthorized access or data breaches. Data lineage can also help identify potential operational risks, such as errors or inconsistencies that could impact decision making or compliance.
Ensuring data quality. Data lineage helps organizations ensure data quality by tracking data from its sources and identifying potential points of data degradation, such as data transformation or data loss.
Better decision making. Data lineage provides transparency into how data is collected, transformed, and used, enabling better-informed decision-making. By understanding the data flow and processing steps, decision-makers can make more accurate and timely decisions, and have greater confidence in the reliability and accuracy of the data they are using.

But you need the right tools to get this kind of value out of your data lineage. And in many cases traditional tools aren’t up for the job.

Traditional data lineage tools have limitations

Part of the issue is due to the limitations of the tools organizations are using to map and track data lineage. Most of them are backed by Relational Database Management Systems (RDBMS), database systems deployed in the 1980s to power software applications. In RDBMS, the data architecture is tabular, with rows and columns. This is well suited for operations where data is consistent and not highly connected. But for connected data, these relational analysis tools have some drawbacks. For instance:

Querying connected data through SQL is a hard and error-prone process;
Long processing time and low performance for questions that require looking up multiple connections (like getting the full data lineage of a given property);
It’s hard to accommodate an evolving data model in a relational database.

Graph databases for data lineage visualization

Graph databases are a perfect foundation to meet the challenges of data lineage. These new types of databases emerged in the early 2000s to address the shortcomings of relational systems. They feature a different way of storing data: as a graph of connected entities. In a graph data model, the relationships are as important as the individual data points.

There are several advantages to this approach:

It’s easy to model the flow of data in a graph;
You can query relationships with ease and in real time;
A graph schema can evolve to accommodate new data and relationships.

In the next section, we detail how to use a graph visualization and analytics tool like Linkurious Enterprise to build a powerful and easy-to-use data lineage system on top of a graph database system.

Using a graph database to power your metadata management

To build an effective data lineage system, you need to map the various data elements and the processes or algorithms they go through. To be thorough, you’d have to track the files, the tables, views, columns and reports in databases, the ETL jobs, etc.

For the purpose of clarity, we have prepared a small dataset that focuses on four types of entities: the metadata, the systems, the processes, and the reports. We modeled our data as a graph, as depicted below.

data lineage data model — Data lineage model.

Metadata (blue nodes) summarizes basic information about data. It can be, for example, the column name is a database and its type. Metadata can flow through a process (red node) such as an ETL job, a SQL query or program code to another metadata. It is stored in a system (yellow node) like a database. Finally, it can be used in a report (green node) a set of data accessible to end users through a visual interface.

Having the data within a graph database (such as Neo4j, Memgraph, or others) allows you to ask questions about your graph visualization and analytics tool, like “what is the data lineage of report y”. For that kind of query, we can use Cypher, the Neo4j query language. The query below, for example, help to understand where the data from our sales report comes from:

// Data lineage pf the “Employee count” report
MATCH (a)-[:FLOWS_TO*]->(b:REPORT {name: ‘Employee_Count’})
RETURN a,b

That query will return all the entities which are involved in the report in question.

Data lineage visualize through Neo4j. — Data lineage visualization

Here are a few other questions you can quickly answer using graph analytics:

Is my database still being used in an important company process, or can I remove it?
What systems and reports would be impacted by a change in a particular process?
Which data is used by whom?

Graph visualization: helping business users investigate data lineage

Visualizing your data lineage is a key element to understanding data flows and dependencies. A graph visualization and analytics solution sits on top of a graph database. It gives business users the ability to visualize and analyze data lineage to find answers without the need for programming skills.

Using a graph visualization and analytics solution such as Linkurious Enterprise, you get access to full text search features to look for any property or data element in the database through a search bar to quickly drill down into the information most relevant to you.

Within the interactive graph visualization interface, you can explore the graph by expanding the relationships of your choice to find answers. That’s the difference between having a theoretical capability of tracking data lineage and an analyst being able to quickly answer a question regarding the provenance of his data with confidence.

For example, if you want to understand what data is used for your sales report, you can simply look up the report via the search bar and add it as a node to your visualization. You can then explore its connections. In a few seconds you can find out that the origin of your report is the order_total metadata stored in the sales_db.

In our example, we worked with a sample dataset, but users can visualize graphs with billions of nodes and edges in Linkurious Enterprise. The tool offers advanced filtering options, letting you slice and dice the data to focus on relevant pieces of information and answer crucial data lineage questions.

Webinar: how to track and visualization data lineage

Track and visualize data lineage today with Linkurious Enterprise

Approaching data lineage from the graph perspective is a way of tackling the challenges faced by organizations. By bringing the data silos into an holistic view of connected entities, graph technology Linkurious Enterprise helps analysts take control of their data.

You can try Linkurious Enterprise now and extract new insights from your data!