Skip to content Skip to content
Vladimir Chavkov

Python for Data Engineering Training

Build production data pipelines with this intensive 3-day training. Master pandas and NumPy for data manipulation, design robust ETL processes, and learn to handle real-world data at scale with Python’s powerful data ecosystem.

Duration3 days (24 hours)
LevelIntermediate
DeliveryIn-person, Live online, Hybrid
CertificationN/A
  • Python developers moving into data engineering
  • Data analysts scaling beyond spreadsheets and SQL
  • Backend engineers building data pipelines
  • DevOps engineers automating data workflows
  • Anyone responsible for data quality and delivery

After completing this training, participants will be able to:

  • Manipulate and transform data efficiently with pandas DataFrames
  • Perform numerical computations with NumPy arrays and vectorized operations
  • Design and implement ETL pipelines for batch and incremental processing
  • Read and write data across formats including CSV, JSON, Parquet, and databases
  • Validate data quality with schema checks and automated tests
  • Orchestrate multi-step pipelines with error handling and retry logic

Day 1: Data Manipulation with pandas and NumPy

Section titled “Day 1: Data Manipulation with pandas and NumPy”

Module 1: NumPy Foundations

  • ndarray creation and data types
  • Indexing, slicing, and boolean masking
  • Vectorized operations and broadcasting
  • Hands-on: Process sensor data with NumPy

Module 2: pandas DataFrames

  • Series and DataFrame creation
  • Indexing with loc, iloc, and boolean selection
  • Column operations, dtypes, and missing data
  • Hands-on: Clean and explore a messy dataset

Module 3: Data Transformation

  • Filtering, sorting, and grouping
  • Aggregations, pivot tables, and crosstabs
  • Merging, joining, and concatenating DataFrames
  • Hands-on: Combine and summarize multi-source sales data

Module 4: Reading and Writing Data

  • CSV, JSON, and Excel I/O with pandas
  • Working with Parquet and columnar formats
  • Database connections with SQLAlchemy
  • Hands-on: Build a multi-format data ingestion layer

Module 5: Data Validation and Quality

  • Schema validation and type enforcement
  • Detecting duplicates, outliers, and anomalies
  • Data quality metrics and reporting
  • Hands-on: Build a data validation framework

Module 6: ETL Pipeline Patterns

  • Extract-Transform-Load vs ELT approaches
  • Incremental loading and change data capture
  • Idempotent pipeline design
  • Hands-on: Implement a complete ETL pipeline from raw files to database

Module 7: Pipeline Orchestration

  • Task dependencies and execution order
  • Scheduling with cron and APScheduler
  • Error handling, retries, and dead-letter queues
  • Hands-on: Orchestrate a multi-step pipeline with dependency management

Module 8: Performance Optimization

  • Chunked processing for large datasets
  • Memory optimization with dtypes and categories
  • Parallel processing with multiprocessing and Dask
  • Hands-on: Optimize a slow pipeline to handle 10x data volume

Module 9: Testing and Monitoring

  • Unit testing data transformations with pytest
  • Integration testing with test fixtures and sample data
  • Pipeline monitoring, logging, and alerting
  • Hands-on: Add comprehensive tests and monitoring to a production pipeline
  • Python Fundamentals or equivalent programming experience
  • Comfort with functions, classes, and file I/O
  • Basic understanding of SQL and relational databases
  • Familiarity with CSV and JSON data formats
FormatDescription
In-PersonOn-site at your company’s location, hands-on with direct interaction
Live OnlineInteractive virtual sessions with screen sharing and real-time labs
HybridCombination of on-site and remote sessions, flexible scheduling

All formats include hands-on labs, course materials, sample datasets, and post-training support.