Ei Square® helps organisations unlock the potential and wealth of data / information that they are sitting on, so it is vitally important that the quality of that data is accurate and dependable so that the decisions are trustworthy and sound. Please visit the blog to get a quick refresher on data quality.
What is data quality and why should you care?
Data quality measures the overall usability, accuracy, and health of data.
Tracking quality dimensions should be integral to an organisation. It should track for completeness, consistency, accuracy, and validity.
Quality data enables effective and accurate decision-making, process optimisation, and enables for effective data management there by improving business growth.
To embed a trustworthy process where data is core in decision-making, high-quality data is not a choice but a necessity.
Poor data quality can lead to inefficiencies, missed opportunities, and even financial losses.
Examples of data quality:
Here are handful examples of what high-quality data looks like in various business contexts:
Customer data with accurate contact information and complete purchase history.
Financial data that reconciles across systems and departments without discrepancies.
Product data with full specifications, up-to-date pricing, and correct inventory levels.
Student data that should reconcile across systems to report 360 degrees view of student data.
Six dimensions of data quality
In order to embed a process where your data quality is trustworthy, it is important to understand the dimensions of data quality.
Six key dimensions contribute to overall data quality:
Accuracy
Completeness
Consistency
Timeliness
Validity
Uniqueness
Thinking of embedding a data quality health check?: If you are wondering how to start on embedding a data quality process, track and monitor your data health, we are here to help!
1. Accuracy
Accurate data looks at format and value of the data. From a data warehouse reporting perspective, it means that the ingested values are correct and reflect reality and match the source from which the data is fetched from. There should be no errors, gaps, discrepancies, or inconsistencies in the data compared to the source.
Couple of reporting metrics that help with accurate and precise data quality are listed below:
Percentage of correct data values: This metric calculates the proportion of data values that are accurate and free from errors. To determine this percentage, compare the data values against a trusted source or reference data. All data entities must have a master data source that is the reference data, so comparing the values of the ingested data to that master data source should point to the percentage of accurate data used for reporting. To get the precision of data values correct, sometimes it is worth understanding how the journey that particular data point takes and where the data gets changed in various sources.
Error rates and data discrepancies: This metric tracks the number and frequency of errors or discrepancies in the data. By monitoring error rates over time, organisations can identify patterns and trends in data quality issues. For example, if a monthly report consistently shows a high number of address errors, it may indicate a systemic issue with data entry or validation processes.
Inaccurate data often leads to wasted marketing efforts, incorrect processes leading to poor customer experience, and eventually lost revenue.
2. Completeness
Complete data has all the required fields populated, with no missing values or gaps. Incomplete data can hinder analysis and lead to flawed conclusions.
Two important completeness metrics are:
Percentage of populated data fields: This metric calculates the proportion of data fields that contain a value, as opposed to being null or empty. From tracking this metric, it is important to make sure the correct fields are being tracked, fields that has a relevancy in your reports or those that be effective on what your audience wants reported. For example, if a customer database has 10 fields and, on average, 8 of those fields are populated for each record, even though the percentage of populated data fields would be 80% and it may seem a decent percentage, the two other fields which are empty most of the time may be the ones that are most relevant. So other than tracking the populated data fields, you should also track the data fields used downstream.
Missing value counts and ratios: This metric tracks the number and proportion of missing values in the data. By monitoring missing value counts over time, organizations can identify trends and patterns in data completeness issues. One must thoroughly understand the cause of this so that it can be corrected at source. The ultimate goal of this report should be to get no output at all. For example, if a monthly sales report consistently shows a high number of missing product codes, it may indicate a problem with data collection or integration processes.
3. Consistency
Consistent data is standardised across systems and free of contradictions or conflicts between sources. Formatting is a crucial element of data consistency. Inconsistent data can cause confusion and hinder collaboration. Two key consistency metrics are:
Percentage of standardised data formats: This metric calculates the proportion of data that adheres to established formatting standards, such as date formats, address structures, or product codes. Making sure that the source data is configured to take a particular format should be built in, so that irrespective of who enters the data, the formatting should be well known and adhered to. Reporting regularly on non-standard formats helps one to build the formatting rules and reducing errors from non-standard formats and values.
Number of data conflicts and contradictions: This metric tracks the number of instances where data values are inconsistent or contradictory across different systems or sources. This in the data world is also known as data integrity. For example, if a customer's address is listed differently in the billing system and the CRM system, it would be counted as a data conflict.
Struggling to analyse and create a data quality framework with the above dimensions? Ei Square can help you integrate your data quality health in your current technology infrastructure without having to do a huge change programme or buy a new tool. Let's talk!
4. Timeliness
Timely data is current and reflects the most recent changes and updates. Outdated data can lead to missed opportunities and inaccurate insights. The importance of timely data is most relevant across different system boundaries. Data that traverses across functional boundaries and gets changed as it traverses.
Two important timeliness metrics when it comes to reporting are:
Data freshness and update frequency: This metric measures how recently the data was updated and how often it is refreshed. For example, if a sales dashboard is updated daily with the previous day's sales figures, the data freshness would be 1 day, and the update frequency would be daily.
Data latency and processing times: This metric tracks the time it takes for data to be captured, processed, and made available for use. For example, if it takes 2 hours for a new customer transaction to appear in the reporting system, the data latency would be 2 hours.
5. Validity
Valid data conforms to defined business rules, constraints, and any business rules that may be applicable. Invalid data can cause system failures, skew analytics and worse incorrect decisions made as a result of that.
Two key validity metrics are:
Percentage of data conforming to business rules: This metric calculates the proportion of data that meets predefined business rules and constraints. For example, if a business rule states that all customer orders must have a valid product code, and 95% of the orders in the database meet this criterion, your end reporting that looks into product code needs to highlight that the data quality is only applicable to 95% of the data.
Number of data constraint violations: This metric tracks the number of instances where data violates established constraints or domain values. For example, if a database field for customer age is constrained to values between 18 and 120, any records with ages outside this range would be counted as a constraint violation.
Ensuring data validity is essential for realising the full potential of data-driven initiatives.
6. Uniqueness
Uniqueness in data ensures that each record is distinct and represents a single entity without duplication. Duplicate data can skew analysis, leading to misleading insights and decisions. It is important to understand how systems define a data entity and this is something our data quality analyst often spends time unpicking as a row might often look to be duplicate (based on name for e.g.) but there are some columns that make it unique.
Here are two uniqueness metrics:
Percentage of unique records: This metric measures the proportion of records in a dataset that are distinct from one another. For any source that is the master data of an entity, making sure duplication is tracked and processes put in place to avoid duplication to happen in the first instance. Having 98% unique records is a good start for this metric.
Duplicate record counts and ratios: This metric identifies the number and proportion of duplicate records within a dataset. Tracking duplicate records helps in understanding the extent of the data duplication issue. For example, if a product inventory list contains two hundred items but ten are duplicates of others, the count of duplicate records would be ten, indicating a need for data cleansing to enhance dataset integrity.
By understanding these key aspects of data quality, organisations can assess their current data health and take steps to improve it.
In the next section, we will explore the processes involved in maintaining and enhancing data quality.
Reporting Data Quality
Before we delve into Data Quality reporting, we should understand what various KPIs we should report to say the data quality is good or bad. It is important to be very thorough in this and make sure we are tracking the quality of the relevant data points. With increased digitisation, the amount of data one holds is ever increasing but not all is relevant for an organisation.
In this section we would only focus on tracking and reporting inconsistencies and errors and incompleteness of data that is coming into the organisation’s data warehouses from various sources. Often organisations run insights and analytics from the warehouse and there should be a set of metrics to measure the quality of data on a timely basis so that corrective actions can be taken.
Data quality metrics and data quality checks will enable the data observability at various levels and inform of the data that flows into the targets systems from various sources.
The following are a good starting point to track the quality of the data in your data warehouse and all of them should focus on data or entity points that are relevant:
Data to errors ratio: the proportion of data records that contain errors.
Number of empty values: the count of data records that have missing or null values.
Data transformation errors: the count of data records that fail to be transformed or loaded into a target system.
Amount of dark data: the volume of data that is collected but not used or analysed.
Mapped vs unmapped records: Identifying the data ingested in your data lake is tied to the expected contexts in your model.
Volume of data ingested on a timely basis per source per table in the target schema: Timing of data ingestion various as per business needs. It can be hourly to monthly and everything in between depending on what it is being tracked. There is a need to track and understand the data volumes coming in to take necessary steps in terms of storage capacity of the target storage systems.
These KPIs can be reported by various data visualisations tools in the form of reports and dashboard and can be tracked on a timely basis to understand what raw data is ingested to how the data is transformed in various stages to what data is populated to the target systems.
Data Messaging: If you want to learn how to effectively send messaging to your audience, read our blog on this here.
Who are the audience for these data quality reports and dashboards?
Data Analysts
Data Product Owners
Business owners and Business Analysts.
Use case:
Educational MATs who gather data from various sources like MIS (Student data, Attendance data, Course enrolment data, academic achievement data ), OpenApply( Admissions data) , IMP Software (Revenue data), GL-Assessment reports ( Assessment data).
The data is ingested into the staging layer from these sources and the data is transformed and modelled using ETLs tools and finally populated into the reporting later for data and insights analytics.
During this entire data journey from source to target, there are many stages that the data from these various sources will be cleansed, transformed, related, normalised, de-normalised and aggregated before it reaches target tables. There will be different staging layers involved in the data journey. The questions below focusses around data lake use cases, however, the quality reports can easily be extended to track operational data quality issues and rectify the processes, if needed.
A Data analyst or a Business Analyst wants to understand:
1.How many new records are getting updated in a fact table from source data after a daily run.
We can have a dashboard to show line chart with Date in x-axis with record count for each table per each table with a dashboard filter on table name.
SQL: Select table_name, updated_date, Count(*) from information_schema
2. What the is data volume each initial staging table is ingesting from each of the Source system or API.
We can have a dashboard to show bar chart with table in x-axis with record count for each table per each table with a dashboard filter on ingestion_date.
SQL: Select table_name, ingestion_date, count(*) from information_schema
3. How many unmapped records/ orphan records are getting populated to the fact tables for each ETL run.
We can have a dashboard with a tablular report with all columns in a table where the dimensional key is -1( Not mapping to any dimension)
Select table_name, * from information_schema where school_key =-1
Conclusion
Mastering data quality requires a proactive approach. Start by defining clear data quality standards and regularly assessing your data against those criteria. Automate data cleansing and validation processes to maintain data integrity and reduce manual effort. Foster a data-driven culture that values data quality and encourages accountability at all levels.
Looking to be data centric in driving decisions?: Ei Square delivers end to end data transformation and reporting to organisations and enable a foundation of strong data centric organisation. Click the link below to book a free consultation!