Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Resources

Tech

  • Terraform / OpenTofu
  • AWS S3
  • AWS Security Groups
  • AWS Auto Scaling Groups
  • AWS AMI / Packer
  • AWS EC2
  • AWS RDS
  • AWS ELB / ALB
  • AWS Route 53
  • K8s / EKS
  • Helm
  • ArgoCD
  • Jenkins / Groovy DSL
  • Trivy / tfsec / Checkov / Terrascan / TFLint / KubeLinter
  • Java
  • Spring Boot
  • Python 3x
  • Postgress
  • MySQL
  • Redis
  • Kafka
  • Airflow
  • Prometheus
  • Grafana
  • ELK
  • Splunk
  • Druid
  • Presto
  • Trino
  • AWS Redshift
  • Apache Spark
  • Apache Flink
  • Apache Beam
  • Apache Kafka
  • Kafka Streams
  • KSQLDB
  • Apache Pulsar
  • Hadoop ecosystem (HDFS, MapReduce, Hive)
  • JVM Tunning
  • GC Tunning
  • Linux Tunning
  • K8s Tunning
  • Database Tunning
  • AWS VPC Flow
  • Cloud Trail
  • Cloud Watch
  • OpenSearch / ElasticSearch
  • Storage formats:
    • Parquet
    • Avro
    • ORC
    • JSON
    • Iceberg
    • Delta Lake
  • Luigi
  • Apache NiFi
  • DDD
  • Data Mash
  • Snowflake

Data Techniques

  • ETL vs ELT patterns
  • Stream processing and windowing
  • Change data capture (CDC)
  • Data partitioning and sharding
  • Batch vs real-time processing strategies
  • Data federation vs data virtualization
  • API-first data integration
  • Event-driven architecture patterns
  • Master data management (MDM)
  • Data replication strategies (sync vs async)
  • Columnar storage formats (Parquet, ORC)
  • Data compression techniques
  • Indexing strategies for analytics
  • Query optimization and execution planning
  • Caching layers and materialized views
  • Schema-on-read vs schema-on-write
  • Data denormalization for analytics
  • Slowly changing dimensions (SCD) handling
  • Data aggregation and rollup strategies
  • Time-series data processing patterns
  • Event sourcing patterns
  • Complex event processing (CEP)
  • Stream-stream and stream-table joins
  • Watermarking for late-arriving data
  • Backpressure handling in streaming systems

Data Engineering Code Challenges Round 1

  1. Create an Airflow DAG that extracts data from PostgreSQL, transforms it with Spark, and loads to Redshift. Include data quality checks and error handling.
  2. Build a real-time streaming pipeline using Kafka, Spark Structured Streaming, and Delta Lake. Process IoT sensor data with windowing and aggregations.
  3. Create a dbt project with dimensional modeling (fact/dimension tables), tests, documentation, and CI/CD pipeline deployment to Snowflake.
  4. Build a Docker Compose setup with Kafka, Spark, PostgreSQL, and Jupyter. Create a complete data pipeline that processes sample e-commerce data.
  5. Create Terraform scripts to deploy: S3 data lake with proper partitioning, Glue catalog, Glue ETL job, Lambda for data validation, and IAM roles.
  6. Build a Python data pipeline using Pandas that reads from multiple CSV sources, performs data cleaning/validation, and outputs to both PostgreSQL and Parquet files.
  7. Create a real-time CDC pipeline using Debezium, Kafka Connect, and Elasticsearch. Capture changes from MySQL and make them searchable in near real-time.

Data Engineering Code Challenges Round 2

  1. Build a serverless data pipeline using AWS Lambda, Step Functions, and S3. Process JSON files, transform with Pandas, and load to DynamoDB.
  2. Create a Kubernetes operator in Go that automatically provisions Spark clusters and manages data processing jobs based on custom resource definitions.
  3. Build a data quality monitoring system using Great Expectations, deployed on Kubernetes with alerts to Slack when data quality issues are detected.
  4. Create a multi-tenant data platform using Apache Iceberg tables with row-level security, deployed on EKS with Trino for querying.
  5. Build a feature store using Feast, deployed on Kubernetes, with both batch and real-time feature serving for ML pipelines.
  6. Create a data lineage tracking system using Apache Atlas or DataHub, integrated with Airflow and dbt to automatically track data dependencies.
  7. Deploy a complete lakehouse architecture using Delta Lake on S3, with Spark on EKS, Hive Metastore, and Superset for visualization. Include data governance, performance tuning, and cost optimization.

Data Engineering Code Challenges Round 3

  1. Create a data mesh implementation with domain-specific data products, each with their own CI/CD, data contracts, and SLA monitoring.
  2. Build a real-time fraud detection pipeline using Kafka Streams, feature engineering with time windows, and model serving with MLflow on Kubernetes.
  3. Create a data observability platform using OpenTelemetry, Grafana, and Prometheus to monitor data pipelines, including latency, throughput, and error rates.
  4. Build a data archiving solution using Apache Pulsar for event streaming, with automatic tiered storage to S3 Glacier for cold data.
  5. Create a data catalog with automated metadata extraction from various data sources (databases, files, APIs) using Apache Nifi and store in Apache Atlas.
  6. Build a real-time recommendation engine using Apache Flink, integrating with Kafka for user events and serving recommendations via a REST API.
  7. Create a data governance framework with Apache Ranger for access control, Apache Atlas for metadata management, and integration with dbt for data lineage tracking.

Books