Why Quality Engineering Needs to Evolve into Data Reliability Engineering
--
Lessons from validating enterprise-scale data platforms beyond traditional test automation
For years, Quality Engineering was primarily associated with UI automation, API testing, regression suites, and release validation.
That model worked well for traditional applications.
But modern enterprise platforms are changing rapidly. Today’s systems are powered by cloud-native architectures, distributed data pipelines, real-time processing, Medallion architectures, and enterprise-scale analytics platforms.
In these environments, traditional QA approaches begin to break down.
While working on large-scale financial data platforms at Lloyds Banking Group, I realised that validating enterprise data systems at scale is fundamentally different from validating traditional applications.
This is where I believe Quality Engineering must evolve into something broader:
This is where Quality Engineering must evolve into something broader: Data Reliability Engineering.
The Shift in the Problem Space
In traditional automation frameworks, the focus is usually UI behaviour, API response validation, functional correctness, and regression execution.
But enterprise data platforms introduce very different challenges:
• Billion-row datasets
• Distributed pipelines
• Schema drift
• Delayed upstream loads
• Data contract violations
• Pipeline dependencies
• Downstream reporting impact
At this scale, validation is no longer about checking screens or API status codes.
It becomes about ensuring:
• Trust in data – stakeholders can rely on what they see
• Reliability of pipelines – systems behave consistently under load
• Stability of downstream systems – upstream changes don’t silently break reports
• Operational resilience – failures are detected and diagnosed before business impact
Why Traditional QA Approaches Fail at Scale
Traditional row-by-row validation does not scale for enterprise data systems.
In large financial platforms, validations often involve BigQuery datasets, multi-stage transformations, Bronze → Silver → Gold architectures, and incremental loads processing billions of records.
If validation frameworks attempt to pull large datasets into Python, compare rows directly, and process everything synchronously – the framework itself becomes the bottleneck.
This is why many traditional automation frameworks struggle under high-volume data workloads. The framework was designed for a different problem.
The Evolution: From Test Automation to Reliability Engineering
I started rethinking validation frameworks differently.
Instead of treating the framework as a data processor, I began treating it as an orchestration and intelligence layer.
The heavy computation remains inside distributed platforms like BigQuery. The framework focuses on:
• Validation orchestration
• Metadata-driven checks
• Root cause analysis
• Contract validation
• Observability
• Impact analysis
• Reporting intelligence
This architectural shift changes everything.
What Modern Data Quality Engineering Looks Like
A modern enterprise validation framework should include five core capabilities:
- Contract Validation: Detect issues before downstream consumption – not after reports fail.
amount:
datatype: NUMERIC
precision: 10
scale: 2
nullable: false
Instead of discovering failures in production, bad data is identified at ingestion time – before it reaches Silver or Gold layers.
Contract violations I regularly catch in financial pipelines:
• Invalid special characters in string fields
• Decimal precision mismatches in monetary values
• Schema violations from upstream source system changes
• Unexpected nulls in primary key column
2. Distributed Validation
Validations execute inside BigQuery – not inside the Python framework. The framework only receives summarised results
-- Row count reconciliation — Bronze to Silver
SELECT
b.source_count,
s.target_count,
ABS(b.source_count - s.target_count) AS variance
FROM
(SELECT COUNT(*) AS source_count
FROM `bronze.table_raw`
WHERE business_date = CURRENT_DATE()) b,
(SELECT COUNT(*) AS target_count
FROM `silver.chargeable_event_table`
WHERE business_date = CURRENT_DATE()) s
This enables validations to scale to billions of records without performance degradation.
3. Root Cause Analysis
Modern frameworks should not stop at failure detection. They should answer: Why did this fail? What should happen next?
def analyse_root_cause(validation_failure: dict) -> dict:
"""
Classify failure and recommend remediation action
"""
failure_type = classify_failure(validation_failure)
rca_map = {
"ROW_COUNT_MISMATCH": {
"likely_cause": "Upstream load delayed or partial ingestion",
"action": "Check Cloud Composer DAG execution logs",
"severity": "HIGH"
},
"NULL_IN_CRITICAL_COLUMN": {
"likely_cause": "Source system schema change",
"action": "Review Bronze ingestion logs",
"severity": "HIGH"
},
"SCHEMA_DRIFT": {
"likely_cause": "Upstream column rename or datatype change",
"action": "Trigger Impact Analyzer",
"severity": "MEDIUM"
}
}
return rca_map.get(failure_type, {
"likely_cause": "Unknown — manual investigation required",
"action": "Escalate to data engineering team",
"severity": "MEDIUM"
})
Embedding RCA directly into validation frameworks reduces investigation time from hours to minutes.
4. Impact Analysis
Enterprise systems are deeply interconnected. A small upstream schema change may affect:
• Gold layer reporting views
• Regulatory dashboards
• External APIs
• Downstream reconciliation jobs
• Scheduled Cloud Composer DAGs
Validation frameworks should proactively identify downstream impact before failures occur – not after a stakeholder reports a broken dashboard.
5. Synthetic Test Data Generation
One of the biggest enterprise bottlenecks is dependency on upstream test data. Modern frameworks should generate:
• Boundary cases – minimum and maximum values
• Invalid scenarios – deliberately bad data to test rejection logic
• Precision mismatches – monetary values with incorrect decimal places
• Special character datasets – names and addresses with non-ASCII characters
• SCD2 scenarios – inserts, updates and deletes for slowly changing dimension validation
This reduces dependency on external teams and improves test reproducibility significantly.
The Bigger Realisation
At some point while building these frameworks , I realised something important:
We are no longer simply testing applications. We are engineering trust and reliability into enterprise data ecosystems.
That is a very different responsibility.
The role of Quality Engineering is evolving from execution-focused testing to:
• Platform reliability
• Data observability
• Intelligent validation
• Predictive quality systems
Why This Matters in Financial Systems
In financial services, data failures are not isolated technical issues. They can affect:
• Compliance reporting and regulatory submissions
• Executive dashboards and KPI accuracy
• Customer decisions based on incorrect data
• Regulatory obligations – FCA, Basel III, IFRS
• Operational risk management
When data reliability fails at this scale, the consequences extend far beyond a failed test case.
As data ecosystems scale, reliability becomes a first-class engineering concern – and Quality Engineering has a critical role to play in that transformation.
The Future of QE
I believe the future of enterprise Quality Engineering lies at the intersection of Data Engineering, Reliability Engineering, Observability, Automation Intelligence, and AI-assisted diagnostics.
The next generation of frameworks will not just execute validations. They will:
• Predict failures before they occur using historical patterns
• Explain failures in plain language using LLM-powered RCA
• Assess impact across interconnected pipeline dependencies
• Recommend actions with specific remediation steps
• Continuously learn from historical failure signatures
Conclusion
Quality Engineering is no longer limited to UI automation or regression suites.
Modern enterprise platforms require something more advanced:
Intelligent Data Reliability Engineering.
By combining scalable validation, contract enforcement, RCA, impact analysis, and observability – we can build systems that not only detect failures, but actively improve platform reliability.
This is not the end of Quality Engineering.
It is its evolution.
About Me
I am a Senior Quality Engineer specialising in data-focused automation and validation for large-scale financial platforms. I have built RCA modules, Impact Analyzers, and fully automated validation pipelines on GCP that are used across multiple teams in the Finance Data Lab.
My work sits at the intersection of Quality Engineering, Data Engineering, and platform reliability – building intelligent frameworks that improve pipeline trust, compliance, and operational efficiency at scale.
I am currently pursuing M.Tech in AI/ML through BITS Pilani, exploring how machine learning can enhance predictive quality systems.
Connect with me on LinkedIn: linkedin.com/in/swethanallapati