Aurimas Griciลซnas (@Aurimas_Gr)
2025-01-15 | โค๏ธ 311 | ๐ 71
๐๐ฎ๐๐ฎ ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ๐ ๐ถ๐ป ๐ ๐ฎ๐ฐ๐ต๐ถ๐ป๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ป๐ถ๐ป๐ด ๐ฆ๐๐๐๐ฒ๐บ๐ can become complex and for a good reason ๐
It is critical to ensure Data Quality and Integrity upstream of ML Training and Inference Pipelines, trying to do that in the downstream systems will cause unavoidable failure when working at scale.
It is a good idea to start thinking about the quality of your data at the point of creation (the producers). This is where you can also start to utilise Data Contracts.
Example architecture for a production grade end-to-end data flow:
๐ญ: Schema changes are implemented in version control, once approved - they are pushed to the Applications generating the Data, Databases holding the Data and a central Data Contract Registry.
[๐๐บ๐ฝ๐ผ๐ฟ๐๐ฎ๐ป๐]: Ideally you should be enforcing a Data contract at this stage, when producing Data. Data Validation steps down the stream are Detection and Prevention mechanisms that donโt allow low quality data to reach downstream systems. There might be a significant delay before you can do those checks, causing irreversible corruption or loss of data.
Applications push generated Data to Kafka Topics:
๐ฎ: Events emitted directly by the Application Services.
๐ This also includes IoT Fleets and Website Activity Tracking.
๐ฎ.๐ญ: Raw Data Topics for CDC streams.
๐ฏ: A Flink Application(s) consumes Data from Raw Data streams and validates it against schemas in the Contract Registry. ๐ฐ: Data that does not meet the contract is pushed to Dead Letter Topic. ๐ฑ: Data that meets the contract is pushed to Validated Data Topic. ๐ฒ: Data from the Validated Data Topic is pushed to object storage for additional Validation. ๐ณ: On a schedule Data in the Object Storage is validated against additional SLAs in Data Contracts and is pushed to the Data Warehouse to be Transformed and Modeled for Analytical purposes. ๐ด: Modeled and Curated data is pushed to the Feature Store System for further Feature Engineering. ๐ด.๐ญ: Real Time Features are ingested into the Feature Store directly from Validated Data Topic (5).
๐ Ensuring Data Quality here is complicated since checks against SLAs is hard to perform.
๐ต: High Quality Data is used in Machine Learning Training Pipelines. ๐ญ๐ฌ: The same Data is used for Feature Serving in Inference.
Note: ML Systems are plagued by other Data related issues like Data and Concept Drifts. These are silent failures and while they can be monitored, we donโt include it in the Data Contract.
Let me know your thoughts! ๐
MachineLearning DataEngineering AI
Want to learn first principals of Agentic systems from scratch? Follow my journey here: https://www.newsletter.swirlai.com/p/building-ai-agents-from-scratch-part
๐ ์๋ณธ ๋งํฌ
๋ฏธ๋์ด
![]()
๐ Related
Auto-generated - needs manual review