@jialin.huang
FRONT-ENDBACK-ENDNETWORK, HTTPOS, COMPUTERCLOUD, AWS, Docker
To live is to risk it all Otherwise you are just an inert chunk of randomly assembled molecules drifting wherever the Universe blows you

© 2024 jialin00.com

Original content since 2022

back
RSS

Apache Projects and AWS Services

Since I frequently see ‘xxx service for Apache ooo,’ I made this table out of boredom.

Apache Project

https://projects.apache.org/projects.html

featureApacheAWS
Running on EMR
Distributed Computing Framework for BatchingHadoop (Parquet)official
Fast Distributed Computing Engine for CachingSpark (Parquet)official
Distributed NoSQL Database (read-focused)HBaseofficial
Data Warehouse SQL Query ToolHive (ORC)official
Big Data Processing LanguagePig
Incremental Processing FrameworkHudiofficial
Stream Processing FrameworkFlink
Distributed Coordination ServiceZooKeeperBuilt-in ZooKeeper management in EMR
Data Transfer between RDBMS and HadoopSqoop
Deep Learning FrameworkMXNetSageMaker
Centralized Permission Management PlatformRangerEMR
…, etc.
AWS Managedlike EKS for K8Signore low-level works
Distributed Streaming Platform & Message QueueKafkaMSK (Managed Streaming for Kafka)
WorkflowAirflowMWAA (Managed Workflows for Apache Airflow)
NoSQL DB (focus on write)CassandraKeyspaces
Stream Processing FrameworkFlinkKinesis Data Analytics (Flink)
Deep Learning FrameworkMXNetSageMaker MXNet
AWS Alternative Solutions/Replacements
Fast Distributed Computing EngineSparkGlue
Distributed Streaming Platform & Message QueueKafkaKinesis Data Streams
WorkflowAirflowStep Functions
NoSQL DBCassandraDynamoDB
Distributed SQL Query EnginePrestoAthena
Distributed Storage and ProcessingHadoop HDFSS3
Distributed Coordination ServiceZooKeeperCloud Map
Full-text Search Engine/Lucene-based Search PlatformLucene/SolrOpenSearch
Real-time Data Stream Processing SystemStormKinesis Data Analytics, Lambda
Data Transfer between RDBMS and HadoopSqoopDMS (Database Migration Service)
Centralized Permission Management PlatformRangerAWS Lake Formation
  1. SageMaker supports other frameworks like TensorFlow and Keras…
  1. When in doubt, EMR is usually the answer 😅
  1. Amazon EMR started with Apache Hadoop but now supports many more
    1. TensorFlow (Google)
    1. PrestoDB, Distributed SQL Query Engine
      1. From Facebook, then Presto was open-sourced under the Apache Software License
    1. Many other Apache projects (too many to list under EMR!) Check them out here

      https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html

  1. EMR mostly focuses on Spark vs Hadoop:
    1. a. Spark: realtime, expensive, memory-hungry
    1. b. Hadoop: batches, budget-friendly, storage-based

Scenarios

  • Note: Spark Streaming is just one component of Spark
  • Note: While Elastic Stack isn't completely under Apache, it focuses on search and visualization. AWS built Elasticsearch and then OpenSearch based on Elastic Stack.
    • Elasticsearch and Apache Solr are both based on Apache Lucene, providing powerful search capabilities
      Lucene is the underlying search library, and Solr is a platform built on top of Lucene that makes it easy to build Lucene-based applications.

Simple

  1. Pipeline: Kafka (collection) -> Spark (transformation/cleaning) -> Hadoop (storage) -> Hive (query)
  1. Stream processing: Kafka -> Flink (transformation) -> S3

A Little Complicated

a for apache solution, b for AWS solution

  1. streaming processing and storage:
    1. Kafka (collect) -> Spark Streaming (process) -> Hive (save) -> Presto (query)
    1. Kinesis Data Streams -> Kinesis Analytics -> S3 -> Athena
  1. batching:
    1. Sqoop (import from SQL) -> HDFS (save) -> Spark (transform) -> HBase (save) -> Hive (analysis)
    1. DMS -> S3 -> EMR(Spark) -> DynamoDB -> Athena
  1. real time analysis:
    1. Flume (collect logging) -> Kafka (buffer) -> Storm (realtime processing) -> Cassandra (save) -> Spark (fast analysis)
    1. Kinesis Firehose or CloudWatch-> Kinesis Streams -> Kinesis Analytics -> Keyspaces -> EMR(Spark)
  1. ML:
    1. Kafka (collect) -> Spark Streaming -> HDFS (store) -> Spark MLlib (train) -> HBase (store)
    1. MSK -> EMR(Spark) -> S3 -> SageMaker -> DynamoDB
  1. logging analysis:
    1. Flume or Logstash (Elastic) -> Kafka -> Flink -> Solr or Elasticsearch (Elastic) -> Superset or Kibana (Elastic)
    1. CloudWatch Logs -> Kinesis Firehose -> Kinesis Analytics -> OpenSearch -> OpenSearch Dashboards

Side Note: Parquet vs ORC format

ORC

For heavy write operations, frequent column calculations, and "overall statistical" computations

  • Good if you're just using Hadoop
  • Perfect for batch processing
# ORC
File
├── Stripes
│   ├── Index Data
│   │   ├── Column 1 index (Student Name)
│   │   ├── Column 2 index (Gender)
│   │   └── Column 3 index (Seat Number)
│   │
│   ├── Row Data
│   │   ├── Column 1: Alice | Bob | Charlie | David | Eve | Frank | Grace | Hank | Ivy | Jack
│   │   ├── Column 2: F | M | M | M | F | M | F | M | F | M
│   │   └── Column 3: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
│   │
│   └── Stripe Footer
│
└── File Footer

Parquet

for read-heavy

  • Suitable for scenarios using both Hadoop and Spark together
  • Good for tool integration scenarios
File
├── Row Groups
│   ├── Column Chunks (Columns in chunks)
│   │   ├── Column 1 (Student Name)
│   │   │   ├── Page 1: Alice, Bob, Charlie, David, Eve
│   │   │   └── Page 2: Frank, Grace, Hank, Ivy, Jack
│   │   │
│   │   ├── Column 2 (Gender)
│   │   │   ├── Page 1: F, M, M, M, F
│   │   │   └── Page 2: M, F, M, F, M
│   │   │
│   │   └── Column 3 (Seat Number)
│   │       ├── Page 1: 1, 2, 3, 4, 5
│   │       └── Page 2: 6, 7, 8, 9, 10
│
└── File Metadata (Schema and compression information)

References

https://aws.amazon.com/managed-workflows-for-apache-airflow/?did=ap_card&trk=ap_card

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase.html

https://aws.amazon.com/compare/the-difference-between-hadoop-vs-spark/?nc1=h_ls

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-architecture.html

https://aws.amazon.com/blogs/big-data/migrate-from-apache-solr-to-opensearch/

https://medium.com/@diehardankush/why-parquet-vs-orc-an-in-depth-comparison-of-file-formats-5fc3b5fdac2e

https://lucidworks.com/post/introduction-to-apache-lucenesolr/

https://aws.amazon.com/what-is/presto/?nc1=h_ls

EOF