Forensic Audit: Synthetic Data Reliability

About the Project

This project serves as a forensic audit of data integrity in financial machine learning. While many developers blindly optimize models, this study investigates whether a common synthetic dataset actually contains a valid predictive signal for loan eligibility and credit scores. We engineered a reproducible MLOps pipeline using MLflow for experiment tracking and Optuna for Bayesian hyperparameter tuning. After building a modern Multi-Layer Perceptron (MLP) with Batch Normalization and Swish activation, our analysis revealed a 'Synthetic Trap.' Statistical distributions for 'Income' and 'Savings' were identical across both classes, proving that the dataset lacked the necessary patterns for reliable classification. This project highlights the critical engineering truth: data quality always dictates performance, regardless of architectural complexity.

Engineering Challenges

01Identifying the 'Synthetic Trap'

The dataset initialy showed misleadingly high performance due to 'leaky' features like loan amounts that defined the target.

Solution: Conducted a feature separability audit using violin plots, which proved that the remaining features had near-zero correlation with the target, preventing futile over-optimization.

02Architecture vs. Signal

There was a risk that the poor performance was due to a weak model rather than bad data.

Solution: Implemented a Modern MLP with BatchNorm, Swish activation, and Dropout. We ran extensive Optuna sweeps to ensure every architectural advantage was explored, confirming that the model had hit an irreducible error floor dictated by the data.