MLOps – Parallelizing ETL with Dask on Kubeflow

Home » Blog » MLOps – Parallelizing ETL with Dask on Kubeflow

MLOps – Parallelizing ETL with Dask on Kubeflow

June 8, 2022 Posted by Jeremy Bitzan MLOps

At MLOps 2022 – Jacob Tominson a senior software engineer from NVIDIA in Great Britain presented on using Dask to parallelize extraction transformation and loading (ETL) on June 8th, 2022.

Key points of interest is organizations utilizing GPUs to rapidly convert CSV (comma-separated-values) into Parquet files for more efficient go forward data storage and file processing for data science and machine learning.

Dask is a scalable Python engine with 300 contributors and 20 active maintainers. There are 10,000 weekly visitors to Dask documentation.

Dask uses YAML (yet another markup language) resources and the Dask Operator has three custom resource types that you may create via KUBECTL.

Dask Cluster – to create whole clusters.
Dask Worker Group – to create additional groups of workers with various configurations (high memory, GPUs, etc.).
Dask Jobqueue – to run end-to-end tasks like Kubernetes Jobs.

Dask clusters may created using either Python or YAML.

About Jeremy Bitzan

Data, information technology, and architecture leader of analytic & systems architecture project achievements. A strategic and tactical- based decision-maker with data architecture and business intelligence experience. Diversity & inclusion team leadership and management experience.

13NaN

MLOps – Parallelizing ETL with Dask on Kubeflow

MLOps – Parallelizing ETL with Dask on Kubeflow

About Jeremy Bitzan

Recent Posts

Recent Comments

MLOps – Parallelizing ETL with Dask on Kubeflow

MLOps – Parallelizing ETL with Dask on Kubeflow

About Jeremy Bitzan

You also might be interested in

River Valley Forensic Services

Machine Learning: Lessons Learned

Solomon Partners Consulting

Recent Posts

Recent Comments