Open In Colab

Introduction

Here we explore the technical intricacies of leveraging PySpark and Pandas for handling multifile data. Our initial step involves establishing the “bronze” layer in the domain of data lakes - a raw, minimally processed stage where data from diverse sources converges (IRS data). Learn how to passively organize and manage this raw data, preparing it for subsequent refining and analysis. The following notebook is in Google Colab and based on a Microsoft Fabric lab.

Notebook

Medallion_Architecture_Design_using_PySpark_and_Delta_Lake