In my earlier post, I wrote about how Waterline uses a combination of data profiling with machine learning to create a new process called data fingerprinting that automatically tags columns of data with the appropriate business term regardless of the technical column name. An additional aspect of the process is to also use the fingerprinting information to derive data lineage.
Why do we care about lineage in building a data catalog? The key reason is that lineage provides another sense of data quality beyond what is found in the profiling information. While profiling provides basic statistics about a given column of data, the lineage gives perspective on the upstream data source as well as downstream uses of the data. This allows a user of the catalog to decide whether they want to get closer to the source or instead use a version of the same data that has been further refined. Looking at the lineage helps users make those kinds of decisions quickly.
Now in many cases you may already know the lineage from your ETL or HQL code or if you are tracking lineage in Apache Atlas or Cloudera Navigator. In that case, there is no need to derive much of the lineage. All you have to do is import that information into the data catalog. However, very often, developers hand code movement of data with scripts that don’t integrate well with standard toolsets. In that case, there needs to be a way to augment the factual information you have about lineage. The question, then, is how does this work?
To begin with, we want to start with what we know. Why? Because if we already know that Dataset A-prime is the child of Dataset A, then that is one less combination of datasets we need to look at. Part of the challenge with solving this kind of problem is the sheer magnitude of the number of combinations we can look at. Also, if we know that we moved an entire database of thousands of tables from Oracle into HDFS, just letting the lineage discovery algorithms know that this happened will make it much easier for the algorithms to derive the lineage down at the table and column level. In general, we want to start from the perspective of the child and look backwards upstream to identify the parent or parents.
Read the full post here!