engineering lifecycle.
- Victor Peña
- Jun 27, 2024
- 2 min read
Generation & Ingestion
Data is generated from four main applications: Salesforce, Marketo, Adobe Analytics, and a Media Vendor Provider. This data is ingested into Vertica or AWS Athena database, or provided in an AWS S3 Bucket.
Transformation
Data from Salesforce and Marketo is cleansed, de-duplicated, checked for quality, and validated. Web data and media data, pulled from a Data Lake, are mainly joined for account validation. The data is then aggregated, encrypted, hashed to remove PII, and moved to AWS for further transformation.
Serving
After all the transformations, the final dataset is created in AWS and ingested into a Dataflow using Microsoft Data Fabric. This data flow is refreshed weekly and used in both the development and production environment.
Security
The dataset uses Personal Identifiable Information (PII), so it's encrypted, hashed, and not made public. There's also a weekly deletion of contacts flagged for deletion for GDPR and CCPA mandates.
Data Management
Data lineage and data discoverability are key attributes. The dataset is included in a data catalog, and lineage is used to trace changes upstream. Data quality checks are performed for uniqueness, completeness, volume outliers, format compliance, and schema changes.
Data Operations
Automation is a key element in this process, allowing for humanless execution using RapidMiner. An incident response process is in place for when data downtime occurs.
Architecture
The Operational Architecture was defined at the start of the process, with a focus on understanding system interactions and optimizing them. The Technical Architecture involves planning the execution of the pipeline, including systems, queries, and backups.
Orchestration
Orchestration ensures all queries run and dependencies are met before execution. The process is executed in RapidMiner, scheduled to be executed every week, and includes email notifications in case of process failures.
コメント