n" src="https://www.notion.so/icons/document_red.svg"/></div><h1 class="page-title">Apache Projects and AWS Services</h1><p class="page-description"></p><table class="properties"><tbody><tr class="property-row property-row-created_by"><th><span class="icon property-icon"><svg role="graphics-symbol" viewBox="0 0 16 16" style="width:14px;height:14px;display:block;fill:rgba(55, 53, 47, 0.45);flex-shrink:0" class="typesCreatedBy"><path d="M8 15.126C11.8623 15.126 15.0615 11.9336 15.0615 8.06445C15.0615 4.20215 11.8623 1.00293 7.99316 1.00293C4.13086 1.00293 0.938477 4.20215 0.938477 8.06445C0.938477 11.9336 4.1377 15.126 8 15.126ZM8 10.4229C6.05176 10.4229 4.54785 11.1133 3.83008 11.9131C2.90039 10.9082 2.33301 9.55469 2.33301 8.06445C2.33301 4.91992 4.84863 2.39746 7.99316 2.39746C11.1377 2.39746 13.6738 4.91992 13.6738 8.06445C13.6738 9.55469 13.1064 10.9082 12.1699 11.9131C11.4521 11.1133 9.94824 10.4229 8 10.4229ZM8 9.30176C9.32617 9.30859 10.3516 8.18066 10.3516 6.71094C10.3516 5.33008 9.31934 4.18164 8 4.18164C6.6875 4.18164 5.6416 5.33008 5.64844 6.71094C5.65527 8.18066 6.68066 9.28809 8 9.30176Z"></path></svg></span>Created by</th><td><span class="user"><img src="Apache%20Projects%20and%20AWS%20Services%201256cd51990d8068a409f37c4e37f0f5/IMG_2295.jpg" class="icon user-icon"/>JiaLin Huang</span></td></tr><tr class="property-row property-row-last_edited_time"><th><span class="icon property-icon"><svg role="graphics-symbol" viewBox="0 0 16 16" style="width:14px;height:14px;display:block;fill:rgba(55, 53, 47, 0.45);flex-shrink:0" class="typesCreatedAt"><path d="M8 15.126C11.8623 15.126 15.0615 11.9336 15.0615 8.06445C15.0615 4.20215 11.8623 1.00293 7.99316 1.00293C4.13086 1.00293 0.938477 4.20215 0.938477 8.06445C0.938477 11.9336 4.1377 15.126 8 15.126ZM8 13.7383C4.85547 13.7383 2.33301 11.209 2.33301 8.06445C2.33301 4.91992 4.84863 2.39746 7.99316 2.39746C11.1377 2.39746 13.6738 4.91992 13.6738 8.06445C13.6738 11.209 11.1445 13.7383 8 13.7383ZM4.54102 8.91211H7.99316C8.30078 8.91211 8.54004 8.67285 8.54004 8.37207V3.8877C8.54004 3.58691 8.30078 3.34766 7.99316 3.34766C7.69238 3.34766 7.45312 3.58691 7.45312 3.8877V7.83203H4.54102C4.2334 7.83203 4.00098 8.06445 4.00098 8.37207C4.00098 8.67285 4.2334 8.91211 4.54102 8.91211Z"></path></svg></span>Last edited</th><td><time>@2024年10月30日 16:26</time></td></tr><tr class="property-row property-row-multi_select"><th><span class="icon property-icon"><svg role="graphics-symbol" viewBox="0 0 16 16" style="width:14px;height:14px;display:block;fill:rgba(55, 53, 47, 0.45);flex-shrink:0" class="typesMultipleSelect"><path d="M1.91602 4.83789C2.44238 4.83789 2.87305 4.40723 2.87305 3.87402C2.87305 3.34766 2.44238 2.91699 1.91602 2.91699C1.38281 2.91699 0.952148 3.34766 0.952148 3.87402C0.952148 4.40723 1.38281 4.83789 1.91602 4.83789ZM5.1084 4.52344H14.3984C14.7607 4.52344 15.0479 4.23633 15.0479 3.87402C15.0479 3.51172 14.7607 3.22461 14.3984 3.22461H5.1084C4.74609 3.22461 4.45898 3.51172 4.45898 3.87402C4.45898 4.23633 4.74609 4.52344 5.1084 4.52344ZM1.91602 9.03516C2.44238 9.03516 2.87305 8.60449 2.87305 8.07129C2.87305 7.54492 2.44238 7.11426 1.91602 7.11426C1.38281 7.11426 0.952148 7.54492 0.952148 8.07129C0.952148 8.60449 1.38281 9.03516 1.91602 9.03516ZM5.1084 8.7207H14.3984C14.7607 8.7207 15.0479 8.43359 15.0479 8.07129C15.0479 7.70898 14.7607 7.42188 14.3984 7.42188H5.1084C4.74609 7.42188 4.45898 7.70898 4.45898 8.07129C4.45898 8.43359 4.74609 8.7207 5.1084 8.7207ZM1.91602 13.2324C2.44238 13.2324 2.87305 12.8018 2.87305 12.2686C2.87305 11.7422 2.44238 11.3115 1.91602 11.3115C1.38281 11.3115 0.952148 11.7422 0.952148 12.2686C0.952148 12.8018 1.38281 13.2324 1.91602 13.2324ZM5.1084 12.918H14.3984C14.7607 12.918 15.0479 12.6309 15.0479 12.2686C15.0479 11.9062 14.7607 11.6191 14.3984 11.6191H5.1084C4.74609 11.6191 4.45898 11.9062 4.45898 12.2686C4.45898 12.6309 4.74609 12.918 5.1084 12.918Z"></path></svg></span>Tags</th><td><span class="selected-value select-value-color-purple">Post</span><span class="selected-value select-value-color-red">aws</span><span class="selected-value select-value-color-blue">emr</span></td></tr></tbody></table></header><div class="page-body"><p class="">Since I frequently see ‘xxx service for Apache ooo,’ I made this table out of boredom.</p><h1 class="">Apache Project</h1><p class=""><a href="https://projects.apache.org/projects.html">https://projects.apache.org/projects.html</a></p><table class="simple-table"><tbody><tr><td class="" style="width:283px">feature</td><td class="">Apache</td><td class="">AWS</td></tr><tr><td class="block-color-orange_background" style="width:283px">Running on EMR</td><td class="block-color-orange_background"></td><td class="block-color-orange_background"></td></tr><tr><td class="" style="width:283px">Distributed Computing Framework for Batching</td><td class="">Hadoop (Parquet)</td><td class="">official</td></tr><tr><td class="" style="width:283px">Fast Distributed Computing Engine for Caching</td><td class="">Spark (Parquet)</td><td class="">official</td></tr><tr><td class="" style="width:283px">Distributed NoSQL Database (read-focused)</td><td class="">HBase</td><td class="">official</td></tr><tr><td class="" style="width:283px">Data Warehouse SQL Query Tool</td><td class="">Hive (ORC)</td><td class="">official</td></tr><tr><td class="" style="width:283px">Big Data Processing Language</td><td class="">Pig</td><td class=""></td></tr><tr><td class="" style="width:283px">Incremental Processing Framework</td><td class="">Hudi</td><td class="">official</td></tr><tr><td class="" style="width:283px">Stream Processing Framework</td><td class="">Flink</td><td class=""></td></tr><tr><td class="" style="width:283px">Distributed Coordination Service</td><td class="">ZooKeeper</td><td class="">Built-in ZooKeeper management in EMR</td></tr><tr><td class="" style="width:283px">Data Transfer between RDBMS and Hadoop</td><td class="">Sqoop</td><td class=""></td></tr><tr><td class="" style="width:283px">Deep Learning Framework</td><td class="">MXNet</td><td class="">SageMaker</td></tr><tr><td class="" style="width:283px">Centralized Permission Management Platform</td><td class="">Ranger</td><td class="">EMR</td></tr><tr><td class="" style="width:283px">…, etc.</td><td class=""></td><td class=""></td></tr><tr><td class="block-color-orange_background" style="width:283px">AWS Managed</td><td class="block-color-orange_background">like EKS for K8S</td><td class="block-color-orange_background">ignore low-level works</td></tr><tr><td class="" style="width:283px">Distributed Streaming Platform &amp; Message Queue</td><td class="">Kafka</td><td class="">MSK (Managed Streaming for Kafka)</td></tr><tr><td class="" style="width:283px">Workflow</td><td class="">Airflow</td><td class="">MWAA (Managed Workflows for Apache Airflow)</td></tr><tr><td class="" style="width:283px">NoSQL DB (focus on write)</td><td class="">Cassandra</td><td class="">Keyspaces</td></tr><tr><td class="" style="width:283px">Stream Processing Framework</td><td class="">Flink</td><td class="">Kinesis Data Analytics (Flink)</td></tr><tr><td class="" style="width:283px">Deep Learning Framework</td><td class="">MXNet</td><td class="">SageMaker MXNet</td></tr><tr><td class="block-color-orange_background" style="width:283px">AWS Alternative Solutions/Replacements</td><td class="block-color-orange_background"></td><td class="block-color-orange_background"></td></tr><tr><td class="" style="width:283px">Fast Distributed Computing Engine</td><td class="">Spark</td><td class="">Glue</td></tr><tr><td class="" style="width:283px">Distributed Streaming Platform &amp; Message Queue</td><td class="">Kafka</td><td class="">Kinesis Data Streams</td></tr><tr><td class="" style="width:283px">Workflow</td><td class="">Airflow</td><td class="">Step Functions</td></tr><tr><td class="" style="width:283px">NoSQL DB</td><td class="">Cassandra</td><td class="">DynamoDB</td></tr><tr><td class="" style="width:283px">Distributed SQL Query Engine</td><td class="">Presto</td><td class="">Athena</td></tr><tr><td class="" style="width:283px">Distributed Storage and Processing</td><td class="">Hadoop HDFS</td><td class="">S3</td></tr><tr><td class="" style="width:283px">Distributed Coordination Service</td><td class="">ZooKeeper</td><td class="">Cloud Map</td></tr><tr><td class="" style="width:283px">Full-text Search Engine/Lucene-based Search Platform</td><td class="">Lucene/Solr</td><td class="">OpenSearch</td></tr><tr><td class="" style="width:283px">Real-time Data Stream Processing System</td><td class="">Storm</td><td class="">Kinesis Data Analytics, Lambda</td></tr><tr><td class="" style="width:283px">Data Transfer between RDBMS and Hadoop</td><td class="">Sqoop</td><td class="">DMS (Database Migration Service)</td></tr><tr><td class="" style="width:283px">Centralized Permission Management Platform</td><td class="">Ranger</td><td class="">AWS Lake Formation</td></tr></tbody></table><ol type="1" class="numbered-list" start="1"><li>SageMaker supports other frameworks like TensorFlow and Keras…</li></ol><ol type="1" class="numbered-list" start="2"><li>When in doubt, EMR is usually the answer 😅</li></ol><ol type="1" class="numbered-list" start="3"><li><mark class="highlight-red"><strong>Amazon EMR started with Apache Hadoop but now supports many more</strong></mark><ol type="a" class="numbered-list" start="1"><li>TensorFlow (Google)</li></ol><ol type="a" class="numbered-list" start="2"><li>PrestoDB, Distributed SQL Query Engine<ol type="i" class="numbered-list" start="1"><li>From Facebook, then Presto was open-sourced under the Apache Software License</li></ol></li></ol><ol type="a" class="numbered-list" start="3"><li>Many other Apache projects (too many to list under EMR!) Check them out here<p class=""><a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html">https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html</a></p></li></ol><p class="">
</p></li></ol><ol type="1" class="numbered-list" start="4"><li>EMR mostly focuses on Spark vs Hadoop:<ol type="a" class="numbered-list" start="1"><li>a. Spark: realtime, expensive, memory-hungry</li></ol><ol type="a" class="numbered-list" start="2"><li>b. Hadoop: batches, budget-friendly, storage-based</li></ol></li></ol><p class="">
</p><p class="">
</p><p class="">
</p><p class="">
</p><h1 class="">Scenarios</h1><ul class="bulleted-list"><li style="list-style-type:disc">Note: <strong>Spark Streaming</strong> is just one component of <strong>Spark</strong></li></ul><ul class="bulleted-list"><li style="list-style-type:disc">Note: While <strong>Elastic Stack</strong> isn&#x27;t completely under <strong>Apache</strong>, it focuses on search and visualization. AWS built Elasticsearch and then OpenSearch based on <strong>Elastic Stack</strong>.<ul class="bulleted-list"><li style="list-style-type:circle">Elasticsearch and Apache Solr are both based on Apache Lucene, providing powerful search capabilities<blockquote class=""><strong>Lucene is the underlying search library, and Solr is a platform built on top of Lucene</strong> that makes it easy to build Lucene-based applications.</blockquote></li></ul></li></ul><p class="">
</p><h3 class="">Simple</h3><ol type="1" class="numbered-list" start="1"><li>Pipeline: Kafka (collection) -&gt; Spark (transformation/cleaning) -&gt; Hadoop (storage) -&gt; Hive (query)</li></ol><ol type="1" class="numbered-list" start="2"><li>Stream processing: Kafka -&gt; Flink (transformation) -&gt; S3</li></ol><h3 class="">A Little Complicated</h3><p class=""><mark class="highlight-red">a</mark> for apache solution, <mark class="highlight-red">b</mark> for AWS solution</p><ol type="1" class="numbered-list" start="1"><li>streaming processing and storage:<ol type="a" class="numbered-list" start="1"><li>Kafka (collect) -&gt; Spark Streaming (process) -&gt; Hive (save) -&gt; Presto (query)</li></ol><ol type="a" class="numbered-list" start="2"><li>Kinesis Data Streams -&gt; Kinesis Analytics -&gt; S3 -&gt; Athena</li></ol></li></ol><ol type="1" class="numbered-list" start="2"><li>batching: <ol type="a" class="numbered-list" start="1"><li>Sqoop (import from SQL) -&gt; HDFS (save) -&gt; Spark (transform) -&gt; HBase (save) -&gt; Hive (analysis)</li></ol><ol type="a" class="numbered-list" start="2"><li>DMS -&gt; S3 -&gt; EMR(Spark) -&gt; DynamoDB -&gt; Athena</li></ol></li></ol><ol type="1" class="numbered-list" start="3"><li>real time analysis: <ol type="a" class="numbered-list" start="1"><li>Flume (collect logging) -&gt; Kafka (buffer) -&gt; Storm (realtime processing) -&gt; Cassandra (save) -&gt; Spark (fast analysis)</li></ol><ol type="a" class="numbered-list" start="2"><li>Kinesis Firehose or CloudWatch-&gt; Kinesis Streams -&gt; Kinesis Analytics -&gt; Keyspaces -&gt; EMR(Spark)</li></ol></li></ol><ol type="1" class="numbered-list" start="4"><li>ML: <ol type="a" class="numbered-list" start="1"><li>Kafka (collect) -&gt; Spark Streaming -&gt; HDFS (store) -&gt; Spark MLlib (train) -&gt; HBase (store)</li></ol><ol type="a" class="numbered-list" start="2"><li>MSK -&gt; EMR(Spark) -&gt; S3 -&gt; SageMaker -&gt; DynamoDB</li></ol></li></ol><ol type="1" class="numbered-list" start="5"><li>logging analysis:<ol type="a" class="numbered-list" start="1"><li>Flume or <span style="border-bottom:0.05em solid">Logstash (Elastic)</span> -&gt; Kafka -&gt; Flink -&gt; Solr or <span style="border-bottom:0.05em solid">Elasticsearch (Elastic)</span> -&gt; Superset or <span style="border-bottom:0.05em solid">Kibana (Elastic)</span></li></ol><ol type="a" class="numbered-list" start="2"><li>CloudWatch Logs -&gt; Kinesis Firehose -&gt; Kinesis Analytics -&gt; OpenSearch -&gt; OpenSearch Dashboards</li></ol></li></ol><p class="">
</p><p class="">
</p><p class="">
</p><h1 class="">Side Note: Parquet vs ORC format</h1><h3 class="">ORC</h3><p class="">For heavy write operations, frequent column calculations, and &quot;overall statistical&quot; computations</p><ul class="bulleted-list"><li style="list-style-type:disc">Good if you&#x27;re just using Hadoop</li></ul><ul class="bulleted-list"><li style="list-style-type:disc">Perfect for batch processing</li></ul><script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre class="code"><code class="language-Bash"># ORC
File
├── Stripes
│   ├── Index Data
│   │   ├── Column 1 index (Student Name)
│   │   ├── Column 2 index (Gender)
│   │   └── Column 3 index (Seat Number)
│   │
│   ├── Row Data
│   │   ├── Column 1: Alice | Bob | Charlie | David | Eve | Frank | Grace | Hank | Ivy | Jack
│   │   ├── Column 2: F | M | M | M | F | M | F | M | F | M
│   │   └── Column 3: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
│   │
│   └── Stripe Footer
│
└── File Footer</code></pre><h3 class="">Parquet</h3><p class="">for read-heavy</p><ul class="bulleted-list"><li style="list-style-type:disc">Suitable for scenarios using both Hadoop and Spark together</li></ul><ul class="bulleted-list"><li style="list-style-type:disc">Good for tool integration scenarios</li></ul><script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre class="code"><code class="language-JavaScript">File
├── Row Groups
│   ├── Column Chunks (Columns in chunks)
│   │   ├── Column 1 (Student Name)
│   │   │   ├── Page 1: Alice, Bob, Charlie, David, Eve
│   │   │   └── Page 2: Frank, Grace, Hank, Ivy, Jack
│   │   │
│   │   ├── Column 2 (Gender)
│   │   │   ├── Page 1: F, M, M, M, F
│   │   │   └── Page 2: M, F, M, F, M
│   │   │
│   │   └── Column 3 (Seat Number)
│   │       ├── Page 1: 1, 2, 3, 4, 5
│   │       └── Page 2: 6, 7, 8, 9, 10
│
└── File Metadata (Schema and compression information)</code></pre><p class="">
</p><p class="">
</p><p class="">
</p><h1 class="">References</h1><p class=""><a href="https://aws.amazon.com/managed-workflows-for-apache-airflow/?did=ap_card&amp;trk=ap_card">https://aws.amazon.com/managed-workflows-for-apache-airflow/?did=ap_card&amp;trk=ap_card</a></p><p class=""><a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase.html">https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase.html</a></p><p class=""><a href="https://aws.amazon.com/compare/the-difference-between-hadoop-vs-spark/?nc1=h_ls">https://aws.amazon.com/compare/the-difference-between-hadoop-vs-spark/?nc1=h_ls</a></p><p class=""><a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-architecture.html">https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ranger-architecture.html</a></p><p class=""><a href="https://aws.amazon.com/blogs/big-data/migrate-from-apache-solr-to-opensearch/">https://aws.amazon.com/blogs/big-data/migrate-from-apache-solr-to-opensearch/</a></p><p class=""><a href="https://medium.com/@diehardankush/why-parquet-vs-orc-an-in-depth-comparison-of-file-formats-5fc3b5fdac2e">https://medium.com/@diehardankush/why-parquet-vs-orc-an-in-depth-comparison-of-file-formats-5fc3b5fdac2e</a></p><p class=""><a href="https://lucidworks.com/post/introduction-to-apache-lucenesolr/">https://lucidworks.com/post/introduction-to-apache-lucenesolr/</a></p><p class=""><a href="https://aws.amazon.com/what-is/presto/?nc1=h_ls">https://aws.amazon.com/what-is/presto/?nc1=h_ls</a></p></div></article><span class="sans" style="font-size:14px;padding-top:2em"></span></body>
~/
about
posts
frontbacknetworkoscloud
readings
css
bookmarks
archives
© 2024 jialin00.com Original content since 2022
And maybe its just slow involvement at first, but try to sort of creep your career in that direction, because if youre not being challenged, if youre not a little bit scared all the time, just a little bit, then youre not gonna improve. - The Myth of the Genius Programmer