@jialin.huang
FRONT-ENDBACK-ENDNETWORK, HTTPOS, COMPUTERCLOUD, AWS, Docker
To live is to risk it all Otherwise you are just an inert chunk of randomly assembled molecules drifting wherever the Universe blows you

© 2024 jialin00.com

Original content since 2022

back
RSS

Overview of Athena, Redshift, Kinesis Firehose, EMR, etc.

SQL QueryReal-time ProcessingPreprocessingPost-processingmore on analysis
Kinesis Data Streams
Kinesis Firehose
Kinesis Data Analytics
Amazon MSK
EMR✅ HiveQL, Spark SQL
Glue✅ (Glue Data Catalog)
Athenafor s3 analysis
Redshiftquery, analysis
OLAP
Lake Formation
  1. Athena can query data in any format stored in S3.
  1. Redshift can ingest data in near real-time, but has limited real-time processing capabilities.
  1. Kinesis Firehose is primarily used to transform data into the format required by other services, rather than focusing on real-time processing.
  1. EMR is the most feature-rich tool among the ones mentioned.

Choosing

Real-time Monitoring Use Cases

  1. Needs complex processing -> EMR
  1. Financial transactions, cybersecurity -> MSK
  1. Simple use cases -> Kinesis Data Streams

Analytics Use Cases

Smaller scale:

  1. Redshift: department historical data, business logic analysis

    like a information center

  1. Athena: flexible analysis of data in S3
    1. Just grab it whenever you feel like checking something.

Larger scale:

  1. MSK: high throughput, low latency, suitable for transactional workloads
  1. EMR: complex analysis, machine learning, anomaly detection
  1. Kinesis Data Analytics: real-time monitoring dashboards for website traffic, purchasing behavior

    Like when an e-commerce site is about to launch a new product, and they need to see real-time customer behavior immediately.

Others:

  1. Glue: automated ETL, prepare data for use by other tools.
  1. Lake Formation: data governance, sharing and access control

    like IAM or similar identity providers.

Pipeline Scenarios

Real-time Analytics

Kinesis Data Streams -> Kinesis Data Analytics -> S3 -> Athena

Batch ETL

S3 (origin) -> Glue ETL -> Redshift -> Athena

Streaming ETL

MSK -> EMR(Spark Streaming) -> S3 -> Redshift

Data Lake

Kinesis Firehose -> S3 -> Lake Formation -> Athena

Hybrid Analytics

realtime: Kinesis Data Streams -> Kinesis Analytics -> S3

batching: S3 -> Glue ETL -> S3

then analysis: Redshift (integrate both from realtime and batching) -> Athena (query)

EOF