How to track and visualize data lineage

December 15, 2021

Data lineage is about tracking the flow of information. It is necessary to guarantee the quality, usability and security of your data. For large organizations, it is also a key conformity requirement. Unfortunately, many organizations are missing this ability to connect the data together because of regulatory constraints, complex technology and scattered data.  

What is data lineage?

The success of an organization depends on the quality, usability and security of its data. Want to provide amazing support to your customer? Create new products and services? Respect legal requirements? The best companies approach these issues in a data-driven way.

But when your management looks at the quarterly sales report, do you know exactly what data they are looking at? Sometimes bad data can be more dangerous than no data. That’s why data lineage is so important.

Data lineage is defined as “a data life cycle that includes the data’s origins and where it moves over time.” For large organizations, that life cycle can be quite complex as data flows from files, to databases or reports while going through various transformation processes. Tracking the data provenance of a specific data point is very challenging.

Example of a real-life data pipeline at Pinterest.

Part of the issue is due to the limitations of the tools organizations are using to map and track data lineage. Most of them are backed by Relational Database Management Systems (RDBMS), database systems deployed in the 80’s to power software applications. In RDBMS, data is structured in a tabular way with raw and columns. This is well suited for operations where data is consistent and not highly connection. But when it comes to connected data, they have some drawbacks. For instance:

  • querying connected data through SQL is a hard and error-prone process;
  • performances for questions requiring looking up multiple connections are low (like getting the full data lineage of a given property);
  • it’s hard to accommodate an evolving data model in a relational database.

Graph databases are a perfect match for the challenges of data lineage. These new type of databases emerged in the early 2000s to address the shortcomings of relational systems.They came up with a new way of storing data: as a graph of connected entities. There are some advantages to this approach:

  • it’s easy to model the flow of data in a graph;
  • you can query relationships with ease and in real-time;
  • a graph schema can evolve to accommodate new data and relationships.

In the next section, we detail how to use Linkurious Enterprise, our Graph Intelligence Platform, to build a powerful and easy-to-use data lineage system on top of Neo4j, the leading graph database system.

Using a Neo4j graph database to power your metadata management

To build an effective data lineage system, it is necessary to map the various data elements and the processes or algorithms they go through. To be thorough, we’d have to track the files, the tables, views, columns and reports in databases, the ETL jobs, etc.

For clarity purposes, we have prepared a small dataset that focuses on four types of entities: the metadata, the systems, the processes, and the reports. We modeled our data as a graph, as depicted below.

data lineage data model
Data lineage model.

Metadata (blue nodes) summarizes basic information about data. It can be, for example, the column name is a database and its type. Metadata can flow through a process (red node) such as an ETL job, a SQL query or program code to another metadata. It is stored in a system (yellow node) like a database. Finally, it can be used in a report (green node) a set of data accessible to end users through a visual interface.

Having the data into Neo4j allows us to ask questions like “what is the data lineage of report y”. For that kind of query, we can use Cypher, the Neo4j query language. The query below, for example, help usto understand where the data from our sales report comes from:

// Data lineage pf the “Employee count” report MATCH (a)-[:FLOWS_TO*]->(b:REPORT {name: ‘Employee_Count’}) RETURN a,b

That query will return all the entities which are involved in my report.

Data lineage visualize through Neo4j.
Data lineage visualize through Neo4j.

Here are a few other questions we can use Cypher to answer:

  • is my database still being used in an important company process, or can I remove it?
  • what systems and reports would be impacted by a change in a particular process?
  • which data is used by whom?

Graph visualization can help business users investigate data lineage

A graph platform like Linkurious Enterprise sits on top of Neo4j. It gives business users the ability to visualize and analyze data lineage to find answers without the need for programming skills.

Within Linkurious Enterprise, you get access to full text search capability to look for any property or data element in the database through a search bar.

Within the interactive graph visualization interface, ou explore the graph by expanding the relationships of your choice. It’s easy to drill down in the data and find answers. That’s the difference between having a theoretical capability of tracking the data lineage and an analyst being able to quickly answer a question regarding the provenance of his data with confidence.

For example, if I want to know what data is used for my sales report report I simply look up the report via the search bar and add it as a node to my visualization. I can then explore its connections. In a few seconds I can find out that the origin of my report is the order_total metadata stored in the sales_db.

In our example, we worked with a sample dataset but users can visualize graphs with billions of nodes and edges in Linkurious Enterprise. The platform offers advanced filtering options, letting you slice and dice the data to focus on relevant pieces of information and answer crucial data lineage questions.

Track and visualize data lineage today with Linkurious Enterprise

Approaching data lineage from the graph perspective is a way of tackling the challenges faced by organizations. By bringing the data silos into an holistic view of connected entities, graph technology like Neo4j and Linkurious Enterprise are helping analysts take control of their data. You can try Linkurious Enterprise now and extract new insights from your data!

Subscribe to our newsletter

A spotlight on financial crime, directly in your inbox.