Summary
Original source: https://static.googleusercontent.com/media/research.google.com/de//pubs/archive/45390.pdf, 14.11.2019
Google engineers describe Goods, googles automated, non-invasive metadata index that tracks over 26 billion datasets.
Introduction
We have well established standards for source code management (version control, reviews, etc.) lack the same for datasets. This incurs opportunity cost, duplicated effort and mishandling of data.
Enterprise Data Management (EDM) is a common way to organize datasets, but requires all stackeholders to follow the rules closely.
Other is to allow complete freedom for creating and consuming datasets, maybe a bit similar to datalake (which has fallen out of favor for some people).
The idea behind Google dataset search (GOODS) is to leave the complete freedom to people, but to impose some order upon chaos and achieve discoverability by automatically crawling and indexing metadata.
Goods automatically tracks provenance and can also monitor features automatically. It provides a unified view on spread-out datasets in multiple systems.
From the consumer side it is a search engine to discover data. It provides a dashboard for each dataset, so data structure and meanings can be quickly understood, along with boilerplatecode to access the data and similar datasets.
Additionally to the automated metadata inference, dataset owners can annotate
Challenges
- Knowing whats important and whats not (don't index marker files with no content)
- deduplicating sharded datasets
- scaling for large datasets (min. 26 billion at google)
- Format and storage system variety
- Transient datasets (limeted TTL)
- Uncertainty when discovering metadata e.g. associated protobuf schema, primary keys
- Importance ranking
- Semantic inference (e.g. infer that an integer column is actually landmark ID's)
the goods catalog
At the core is the catalog that contains metadata and strives to provide an unified view. For that it creates clusters of related datasets, e.g. for versioned datasets they return a single unified view for all versions.
metadata
Initially some basic crawling is perfomed to discover what datasets are there
After that metadata inference discovers
- Flat metadata like owners, readers, permissions, file format and timestamp.
- Provenance is inferred from logs, and transitive relations are partly propagated
- Metadata is inferred by matching against all protobuf schemas in a central repo oo
- Content summary: Potential keys are found by running HyperLogLog to estimate number of distinct values for a
field, and then matching to record counts of other fields. Fingerprints are used to find similar datasets.
- Manual Annotations are processed and be crucial for ranking, as well as filter out experimental
datasets
- Semantics are inferred by manual annotation, parsing comments in protobuf files, and matching against
googles knowledge graph.
clusters
Inferring metadata is computationally expensive, in case where datasets are produced hourly over years it can save a lot of costs to cluster them. This happens mainly by output path: If a output path is similar exept a data path component, it is clustered as one. The distinction between propagated and computed metadata is kept, so that if you really need to make sure, you can compute the data.
backend implementation
section 4 of the paper describes implementation details. Some interesting facts
- they use big table as backend
- large crawler jobs handle data ingestion, while smaller jobs serve the frontend. Sometimes
the schema analyser takes weeks to get up to date, for these cases some prioritazation has been implemented
- data is aggressively garbage-collected. simply deleting unused data after a week is not enough,
for the storage not to become bloated they defined more sophisticated predicates for if a row can be garbage collected.
functionality
Dataset profiles
Dataset profile pages show and also allow to edit metadata. They are compressed, old entries get discarded, and/or dependency trees are pruned if the profile becomes to large. When possible metadata is linked to specialized tools. E.g. jobs are linked to their job definitions, schema is linked to protobuf code A dataset profile page also generates code for accessing the data.
search
A sophisticated search is also provided
Firstly datasets are matched by path and metadata against a fulltext query. Then a scoring function computes the importance of a dataset. Following things are considered when computing the score
- keyword match, e.g. match on path is more important than match on some metadata
- type of dataset, e.g. raw file is less important than a dreml table, where the user
had to take more action to actually register the dataset
- The more consumers the more important ~
- Manually annotated datasets are also favored
dashboards
Dashboards showing metrics of datasets genereted by a team can be easily created, e.g. availability, sharding, other metadata. These can include monitoring of signal distributions. The dashboard system can even reccomend to put certain metrics under an alerting monitor, if they stay the same over time. Monitors are easily embedded.
Related work
Goods is characterized as a data lake, with the main difference to traditional approaches being the post-hoc approach to metadata inference.
The paper refers some other data lake approaches that do not follow the post-hoc approach primarily. this paper about data provenance seems interesting.
Some systems for provenance tracking of databases and files are referred. PASS and Trio.
Conclusion
The paper conclued that we need to build systems that enable “data culture” for enterprises. Their post-hoc approach worked really well, but challenges remain.
- much needed metadata remains unfilled
- data is not realtime, but pushing metadata instead of pulling would violate the non-invasive approach
- datasets ranking can be improved
- understanding of dataset semantics is often lacking.