Overview of Athena, Redshift, Kinesis Firehose, EMR, etc.

SQL Query | Real-time Processing | Preprocessing | Post-processing | more on analysis | |
Kinesis Data Streams | ✅ | ||||
Kinesis Firehose | ✅ | ✅ | |||
Kinesis Data Analytics | ✅ | ✅ | |||
Amazon MSK | ✅ | ||||
EMR | ✅ HiveQL, Spark SQL | ✅ | ✅ | ✅ | ✅ |
Glue | ✅ | ✅ | ✅ | ✅ (Glue Data Catalog) | |
Athena | ✅ | for s3 analysis | ✅ | ✅ | |
Redshift | ✅ | query, analysis OLAP | ✅ | ✅ | |
Lake Formation | ✅ | ✅ | ✅ | ✅ |
- Athena can query data in any format stored in S3.
- Redshift can ingest data in near real-time, but has limited real-time processing capabilities.
- Kinesis Firehose is primarily used to transform data into the format required by other services, rather than focusing on real-time processing.
- EMR is the most feature-rich tool among the ones mentioned.
Choosing
Real-time Monitoring Use Cases
- Needs complex processing -> EMR
- Financial transactions, cybersecurity -> MSK
- Simple use cases -> Kinesis Data Streams
Analytics Use Cases
Smaller scale:
- Redshift: department historical data, business logic analysis
like a information center
- Athena: flexible analysis of data in S3
- Just grab it whenever you feel like checking something.
Larger scale:
- MSK: high throughput, low latency, suitable for transactional workloads
- EMR: complex analysis, machine learning, anomaly detection
- Kinesis Data Analytics: real-time monitoring dashboards for website traffic, purchasing behavior
Like when an e-commerce site is about to launch a new product, and they need to see real-time customer behavior immediately.
Others:
- Glue: automated ETL, prepare data for use by other tools.
- Lake Formation: data governance, sharing and access control
like IAM or similar identity providers.
Pipeline Scenarios
Real-time Analytics
Kinesis Data Streams -> Kinesis Data Analytics -> S3 -> Athena
Batch ETL
S3 (origin) -> Glue ETL -> Redshift -> Athena
Streaming ETL
MSK -> EMR(Spark Streaming) -> S3 -> Redshift
Data Lake
Kinesis Firehose -> S3 -> Lake Formation -> Athena
Hybrid Analytics
realtime: Kinesis Data Streams -> Kinesis Analytics -> S3
batching: S3 -> Glue ETL -> S3
then analysis: Redshift (integrate both from realtime and batching) -> Athena (query)