In our previous blogs, we explored the benefits of Wiki-based data architectures and transaction snapshot models. Today, we'll delve into a powerful technique that can revolutionise your data discovery process: delta bronze.
The challenge of data discovery
When embarking on a data discovery journey, organisations often face the challenge of limited visibility into their data. Often, organisations have a limited understanding of how their vendor-specific systems store and expose data for analytical purposes.It goes without saying that as we learn more from the data we ingest over time, our assumptions about data engineering could change in this process of discovery.
A common approach to address this is to use transaction snapshots, capturing daily snapshots of your data. While this simplifies data reconciliation, it can be resource-intensive due to the storage requirements of storing the entire dataset daily.
Delta Bronze: A smarter solution
Delta bronze offers a more efficient solution by capturing only the changes that occur between snapshots. This significantly reduces storage costs while maintaining data integrity.
Key Benefits of Delta Bronze:
- Reduced Storage Costs: Significantly lower storage requirements compared to daily snapshots.
- Improved Efficiency: Faster data processing and analysis.
- Simplified Data Transformation: Easy-to-understand data modelling approach.
- Enhanced Data Understanding: As you work with delta bronze, your confidence in understanding your data will increase.
The concept
Let us take an example of a data source that is ingested every day. The data source has 10,000 rows of data.
This will accumulate to 10,000x365 = 3,650,000 rows of data by the first year.
We need to see if we can represent this dataset by a lesser number of rows by storing the rows against a date range of validity i.e. only unique versions of the rows will be stored with two special columns called “valid_from_date” and “valid_to_date”.
Delta data
If we analyse the above rows of data, we can conclude that there are some daily repeating rows that can be combined: 1st January to 6th August, then 7th August to the 9th of September and, finally, 10th of September to the 31st of January 2024.
Each common row of data can thus be represented by a single row with the dates of validity as follows:
The latest row version (the last row) however is subject to an amendment every day, as the last seen (_last_seen) will get amended to every day subsequent date that has no change in value.
For data queries inferring the latest version of the row, it is also unclear if the 31st of January 2024 was the last time the row was ingested, or does it represent the latest successful ingest on the 31st of January i.e. was a successful dataset ingested after the 31st that did not contain the customer A123098 anymore. It is also not simple to get the latest version of the customer.
Delta with latest
To solve the problem of the latest row, and keep an easily queryable version of it, the row is split by setting the last seen date to blank (NULL), yet not losing the last ingested date.
Thus, the latest ingested row will always have the “_last_seen = NULL” against the actual date of ingestion. This allows querying the latest row easy by finding:
“WHERE _last_seen IS NULL”
It is important to note that the first unique version of the row is always saved, along with the last time that same version was seen (ingested) in the “_last_seen” field.
In practice, the ingest date should be implemented as ingest timestamp i.e. it should have the date and time the row was ingested, and we should expect multiple ingests in a single day, including no ingests at all.
Delta bronze
The bronze layer (Medallion architecture of data warehousing) of data should not be subject to any business rules. We can apply this delta data strategy to bronze and create the delta bronze layer. Normally, we keep this in the silver layer prefixed with the table name “delta_”.
Not all bronze tables can be subject to the delta strategy and be useful, unless they are structured. JSON data tables should be parsed to a sensible structured schema before being subject to a delta strategy, otherwise the changes become difficult to analyse.
Delta on Delta
One useful use of the delta strategy is that it can be re-applied on a delta bronze layer to focus on specific fields of change. Taking the above customer table, we can have a “delta_customer_name” table. This will give us the following table:
Daily rollup
For many organisations, a practical reporting time unit is normally a day i.e. what was the state of date at the end of the day. This unit may vary, but for simplicity, we will assume that it is a day.
Using the ingest timestamp, we can then find the latest version of a row for a given day, and rollup the table further. This can then be subject to logical assumptions that if a day of data is missing, it can be represented by the previous or perhaps the nearest date row. Daily rollup of data is useful for analytics because the date “BETWEEN start_date and end_date” can be used. This will give rise to a table like:
Sample use case
Current version of all data
The current version of all data includes data that has stopped being ingested, perhaps because it has never changed in the source system, or it may have been deleted.
SELECT * FROM DELTA
WHERE TO_DATE IS NULL
Latest data ingested
For a dataset that is a full ingest, we can get the latest representation of all rows by making use of the fact that the row’s most recent version of data will be the same as other rows in the load.
SELECT * FROM DELTA
WHERE
TO_DATE IS NULL AND
FROM_DATE = (SELECT MAX(FROM_DATE) FROM DELTA)
Deleted rows
For a dataset that is a full ingest, we can get the latest representation of all rows by making use of the fact that if a row is not present (deleted) in the latest load, then the row’s most recent version of data will be earlier than other rows in the load.
SELECT * FROM DELTA
WHERE
TO_DATE IS NULL AND
FROM_DATE < (SELECT MAX(FROM_DATE) FROM DELTA)
Limitations
The delta architecture is similar to CDC (change data capture) and somewhat like a Type2 SCD (slowly changing dimension), but architecturally, it sits in a different layer and can be cleverly used to create a lean storage layer.
However, if a row stops coming during a load, perhaps due to source system bugs, then the delta should be inferred as being deleted. This is a practical assumption; until the point the row starts appearing again. The delta time validity ranges will logically fill the missing gap up and thus the “delete” will be lost.
Example:
Now let us assume that from 12th January till 25th January, this row is missed from the source system, but other rows have been ingested.
Until now (25th Jan 2024) the row “Hello” can be inferred to have been deleted as the latest version of 11 Jan 2024 is less than the latest load of 25 Jan 2024.
Let us now assume that this row starts re-appearing again on the 26th of January. This will lead to the following:
The fact that the row was deleted is now lost. Practically, this may not be an issue, because for data driven decision making, the row was never deleted.
Another way to solve this scenario is to inject a “_state=’Deleted” or “_last_ingested” field and then infer the scenario. However, it's important to note that this approach may not capture all deletions if data is consistently missing from your source system. While this limitation is generally manageable, it's worth considering when implementing delta bronze.
Conclusion
When it comes to businesses looking to streamline their data discovery procedures, Delta Bronze is revolutionary. Delta Bronze delivers a significant decrease in storage costs, enhanced efficiency, and easier data translation by concentrating on collecting only the changes in your data.
Key benefits at a glance:
Reduced Storage Costs: Say goodbye to the burden of storing massive amounts of data.
Improved Efficiency: Process data faster and gain insights more quickly.
Enhanced Data Understanding: Uncover hidden patterns and trends within your data.
Ready to Explore Delta Bronze for Your Organisation?
Contact Ei Square today to learn more about how our data expertise can help you unlock the full potential of your data. Our team of experts can provide guidance, implementation assistance, and ongoing support.