Apache Projects and AWS Services
Since I frequently see ‘xxx service for Apache ooo,’ I made this table out of boredom.
Apache Project
https://projects.apache.org/projects.html
feature | Apache | AWS |
Running on EMR | ||
Distributed Computing Framework for Batching | Hadoop (Parquet) | official |
Fast Distributed Computing Engine for Caching | Spark (Parquet) | official |
Distributed NoSQL Database (read-focused) | HBase | official |
Data Warehouse SQL Query Tool | Hive (ORC) | official |
Big Data Processing Language | Pig | |
Incremental Processing Framework | Hudi | official |
Stream Processing Framework | Flink | |
Distributed Coordination Service | ZooKeeper | Built-in ZooKeeper management in EMR |
Data Transfer between RDBMS and Hadoop | Sqoop | |
Deep Learning Framework | MXNet | SageMaker |
Centralized Permission Management Platform | Ranger | EMR |
…, etc. | ||
AWS Managed | like EKS for K8S | ignore low-level works |
Distributed Streaming Platform & Message Queue | Kafka | MSK (Managed Streaming for Kafka) |
Workflow | Airflow | MWAA (Managed Workflows for Apache Airflow) |
NoSQL DB (focus on write) | Cassandra | Keyspaces |
Stream Processing Framework | Flink | Kinesis Data Analytics (Flink) |
Deep Learning Framework | MXNet | SageMaker MXNet |
AWS Alternative Solutions/Replacements | ||
Fast Distributed Computing Engine | Spark | Glue |
Distributed Streaming Platform & Message Queue | Kafka | Kinesis Data Streams |
Workflow | Airflow | Step Functions |
NoSQL DB | Cassandra | DynamoDB |
Distributed SQL Query Engine | Presto | Athena |
Distributed Storage and Processing | Hadoop HDFS | S3 |
Distributed Coordination Service | ZooKeeper | Cloud Map |
Full-text Search Engine/Lucene-based Search Platform | Lucene/Solr | OpenSearch |
Real-time Data Stream Processing System | Storm | Kinesis Data Analytics, Lambda |
Data Transfer between RDBMS and Hadoop | Sqoop | DMS (Database Migration Service) |
Centralized Permission Management Platform | Ranger | AWS Lake Formation |
- SageMaker supports other frameworks like TensorFlow and Keras…
- When in doubt, EMR is usually the answer 😅
- Amazon EMR started with Apache Hadoop but now supports many more
- TensorFlow (Google)
- PrestoDB, Distributed SQL Query Engine
- From Facebook, then Presto was open-sourced under the Apache Software License
- Many other Apache projects (too many to list under EMR!) Check them out here
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html
- EMR mostly focuses on Spark vs Hadoop:
- a. Spark: realtime, expensive, memory-hungry
- b. Hadoop: batches, budget-friendly, storage-based
Scenarios
- Note: Spark Streaming is just one component of Spark
- Note: While Elastic Stack isn't completely under Apache, it focuses on search and visualization. AWS built Elasticsearch and then OpenSearch based on Elastic Stack.
- Elasticsearch and Apache Solr are both based on Apache Lucene, providing powerful search capabilities
Lucene is the underlying search library, and Solr is a platform built on top of Lucene that makes it easy to build Lucene-based applications.
- Elasticsearch and Apache Solr are both based on Apache Lucene, providing powerful search capabilities
Simple
- Pipeline: Kafka (collection) -> Spark (transformation/cleaning) -> Hadoop (storage) -> Hive (query)
- Stream processing: Kafka -> Flink (transformation) -> S3
A Little Complicated
a for apache solution, b for AWS solution
- streaming processing and storage:
- Kafka (collect) -> Spark Streaming (process) -> Hive (save) -> Presto (query)
- Kinesis Data Streams -> Kinesis Analytics -> S3 -> Athena
- batching:
- Sqoop (import from SQL) -> HDFS (save) -> Spark (transform) -> HBase (save) -> Hive (analysis)
- DMS -> S3 -> EMR(Spark) -> DynamoDB -> Athena
- real time analysis:
- Flume (collect logging) -> Kafka (buffer) -> Storm (realtime processing) -> Cassandra (save) -> Spark (fast analysis)
- Kinesis Firehose or CloudWatch-> Kinesis Streams -> Kinesis Analytics -> Keyspaces -> EMR(Spark)
- ML:
- Kafka (collect) -> Spark Streaming -> HDFS (store) -> Spark MLlib (train) -> HBase (store)
- MSK -> EMR(Spark) -> S3 -> SageMaker -> DynamoDB
- logging analysis:
- Flume or Logstash (Elastic) -> Kafka -> Flink -> Solr or Elasticsearch (Elastic) -> Superset or Kibana (Elastic)
- CloudWatch Logs -> Kinesis Firehose -> Kinesis Analytics -> OpenSearch -> OpenSearch Dashboards
Side Note: Parquet vs ORC format
ORC
For heavy write operations, frequent column calculations, and "overall statistical" computations
- Good if you're just using Hadoop
- Perfect for batch processing
# ORC
File
├── Stripes
│ ├── Index Data
│ │ ├── Column 1 index (Student Name)
│ │ ├── Column 2 index (Gender)
│ │ └── Column 3 index (Seat Number)
│ │
│ ├── Row Data
│ │ ├── Column 1: Alice | Bob | Charlie | David | Eve | Frank | Grace | Hank | Ivy | Jack
│ │ ├── Column 2: F | M | M | M | F | M | F | M | F | M
│ │ └── Column 3: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
│ │
│ └── Stripe Footer
│
└── File Footer
Parquet
for read-heavy
- Suitable for scenarios using both Hadoop and Spark together
- Good for tool integration scenarios
File
├── Row Groups
│ ├── Column Chunks (Columns in chunks)
│ │ ├── Column 1 (Student Name)
│ │ │ ├── Page 1: Alice, Bob, Charlie, David, Eve
│ │ │ └── Page 2: Frank, Grace, Hank, Ivy, Jack
│ │ │
│ │ ├── Column 2 (Gender)
│ │ │ ├── Page 1: F, M, M, M, F
│ │ │ └── Page 2: M, F, M, F, M
│ │ │
│ │ └── Column 3 (Seat Number)
│ │ ├── Page 1: 1, 2, 3, 4, 5
│ │ └── Page 2: 6, 7, 8, 9, 10
│
└── File Metadata (Schema and compression information)
References
https://aws.amazon.com/managed-workflows-for-apache-airflow/?did=ap_card&trk=ap_card
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase.html
https://aws.amazon.com/compare/the-difference-between-hadoop-vs-spark/?nc1=h_ls
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-architecture.html
https://aws.amazon.com/blogs/big-data/migrate-from-apache-solr-to-opensearch/
https://lucidworks.com/post/introduction-to-apache-lucenesolr/