At MLOps 2022 – Jacob Tominson a senior software engineer from NVIDIA in Great Britain presented on using Dask to parallelize extraction transformation and loading (ETL) on June 8th, 2022.
Key points of interest is organizations utilizing GPUs to rapidly convert CSV (comma-separated-values) into Parquet files for more efficient go forward data storage and file processing for data science and machine learning.
Dask is a scalable Python engine with 300 contributors and 20 active maintainers. There are 10,000 weekly visitors to Dask documentation.
Dask uses YAML (yet another markup language) resources and the Dask Operator has three custom resource types that you may create via KUBECTL.
- Dask Cluster – to create whole clusters.
- Dask Worker Group – to create additional groups of workers with various configurations (high memory, GPUs, etc.).
- Dask Jobqueue – to run end-to-end tasks like Kubernetes Jobs.
Dask clusters may created using either Python or YAML.