Apache Spark

Apache Spark
開發者	Apache軟體基金會, 加州大學柏克萊分校AMPLab, Databricks
当前版本	3.5.4（2024年12月17日；穩定版本）;
源代码库	github.com/apache/spark;
编程语言	Scala, Java, Python
操作系统	Linux, Mac OS, Microsoft Windows
类型	數據分析, 機器學習演算法
许可协议	Apache授權條款 2.0
网站	spark.apache.org

Apache Spark是一個開源叢集運算框架，最初是由加州大學柏克萊分校AMPLab所開發。相對於Hadoop的MapReduce會在執行完工作後將中介資料存放到磁碟中，Spark使用了記憶體內運算技術，能在資料尚未寫入硬碟時即在記憶體內分析運算。Spark在記憶體內執行程式的運算速度能做到比Hadoop MapReduce的運算速度快上100倍，即便是執行程式於硬碟時，Spark也能快上10倍速度。^[2]Spark允許用戶將資料加載至叢集記憶體，並多次對其進行查詢，非常適合用於機器學習演算法。^[3]

使用Spark需要搭配叢集管理員和分散式儲存系統。Spark支援獨立模式（本地Spark叢集）、Hadoop YARN或Apache Mesos的叢集管理。^[4] 在分散式儲存方面，Spark可以和 Alluxio、HDFS^[5]、 Cassandra^[6] 、OpenStack Swift和Amazon S3等介面搭配。 Spark也支援偽分散式（pseudo-distributed）本地模式，不過通常只用於開發或測試時以本機檔案系統取代分散式儲存系統。在這樣的情況下，Spark僅在一台機器上使用每個CPU核心執行程序。

在2014年有超過465位貢獻者投入Spark開發^[7]，讓其成為Apache軟體基金會以及巨量資料眾多開源專案中最為活躍的專案。

歷史

Spark在2009年由Matei Zaharia（英语：Matei Zaharia）在加州大學柏克萊分校AMPLab開創，2010年透過BSD授權條款開源釋出。2013年，該專案被捐贈給Apache軟體基金會並切換授權條款至Apache2.0。^[8]。2014年2月，Spark成為Apache的頂級專案。2014年11月，Databricks團隊使用Spark 刷新資料排序世界記錄。^[9]

專案構成要素

Spark專案包含下列幾項:

Spark核心和彈性分散式資料集（RDDs）

Spark核心是整個專案的基礎，提供了分散式任務調度，排程和基本的I／O功能。而其基礎的程序抽象則稱為彈性分散式資料集（RDDs），是一個可以并行操作、有容錯機制的資料集合。 RDDs可以透過引用外部存儲系統的資料集建立（例如：共享文件系統、HDFS、HBase或其他 Hadoop 資料格式的資料來源）。或者是通過在現有RDDs的轉換而創建（比如：map、filter、reduce、join等等）。

RDD抽象化是經由一個以Scala、Java、Python的語言集成API所呈現，簡化了編程複雜性，應用程序操縱RDDs的方法類似於操縱本地端的資料集合。

以 RDD 為中心的函數式編程的一個典型示例是以下 Scala 程序，它計算一組文本文件中出現的所有單詞的頻率並打印最常見的單詞。每個 map、flatMap（map 的變體）和 reduceByKey 都採用匿名函數對單個數據項（或一對項）執行簡單操作，並應用其參數將 RDD 轉換為新的 RDD。^[10]^[11]

val conf = new SparkConf().setAppName("wiki_test") 
val sc = new SparkContext(conf) 
val data = sc.textFile("/path/to/somedir") 
val tokens = data.flatMap(_.split(" ")) 
val wordFreq = tokens.map((_, 1)).reduceByKey(_ + _) 
wordFreq.sortBy(s => -s._2).map(x => (x._2, x._1)).top(10)

Spark SQL

Spark SQL在Spark核心上帶出一種名為SchemaRDD的資料抽象化概念，提供結構化和半結構化資料相關的支援。Spark SQL提供了領域特定語言，可使用Scala、Java或Python來操縱SchemaRDDs。它還支援使用使用命令行界面和ODBC／JDBC伺服器操作SQL語言。在Spark 1.3版本，SchemaRDD被重新命名為DataFrame。

Spark Streaming

Spark Streaming充分利用Spark核心的快速排程能力來執行串流分析。它擷取小批量的資料並對之執行RDD轉換。這種設計使串流分析可在同一個引擎內使用同一組為批次分析編寫而撰寫的應用程序代碼。

MLlib

MLlib是Spark上分散式機器學習框架。Spark分散式記憶體式的架構比Hadoop磁碟式的Apache Mahout快上10倍，擴充性甚至比Vowpal Wabbit（英语：Vowpal Wabbit）要好。^[12] MLlib可使用許多常見的機器學習和統計演算法，簡化大規模機器學習時間，其中包括：

匯總統計、相關性、分層抽樣、假設檢定、隨機數據生成
分類與回歸：支持向量機、回歸、線性回歸、邏輯斯諦迴歸、決策樹、樸素貝葉斯
協同過濾：ALS
分群：k-平均演算法
維度约减：奇異值分解（SVD），主成分分析（PCA）
特徵提取和轉換：TF-IDF、Word2Vec、StandardScaler
最优化：隨機梯度下降法（SGD）、L-BFGS

GraphX

GraphX是Spark上的分散式圖形處理框架。它提供了一組API，可用於表達圖表計算並可以模擬Pregel抽象化。GraphX還對這種抽象化提供了優化運行。

GraphX最初為加州大學柏克萊分校AMPLab和Databricks的研究專案，後來捐贈給Spark專案。^[13]

特色

Java、Scala、Python和R APIs。
可擴展至超過8000個結點。^[14]
能夠在記憶體內緩存資料集以進行交互式資料分析。
Scala或Python中的互動式命令列介面可降低橫向擴展資料探索的反應時間。
Spark Streaming對即時資料串流的處理具有可擴充性、高吞吐量、可容錯性等特點。
Spark SQL支援結構化和關聯式查詢處理（SQL）。
MLlib機器學習演算法和Graphx圖形處理演算法的高階函式庫。

參考資料

^ Release 3.5.4. 2024年12月17日 [2024年12月21日].
^ Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion. Shark: SQL and Rich Analytics at Scale (PDF). June 2013 [2015-05-30]. （原始内容存档 (PDF)于2017-08-09）. |conference=被忽略 (帮助)
^ Matei Zaharia. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. Invited Talk at NIPS 2011 Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale. [2015-05-30]. （原始内容存档于2015-11-13）.
^ Cluster Mode Overview - Spark 1.2.0 Documentation - Cluster Manager Types. apache.org. Apache Foundation. 2014-12-18 [2015-01-18]. （原始内容存档于2015-01-19）.
^ Figure showing Spark in relation to other open-source Software projects including Hadoop. [2015-05-30]. （原始内容存档于2015-03-24）.
^ Doan, DuyHai. Re: cassandra + spark / pyspark. Cassandra User (邮件列表). 2014-09-10 [2014-11-21]. （原始内容存档于2015-05-30）.
^ Open HUB Spark development activity. [2015-05-30]. （原始内容存档于2014-12-07）.
^ The Apache Software Foundation Announces Apache&#8482 Spark&#8482 as a Top-Level Project. apache.org. Apache Software Foundation. 27 February 2014 [4 March 2014]. （原始内容存档于2015-03-17）.
^ Spark officially sets a new record in large-scale sorting. [2015-05-30]. （原始内容存档于2015-05-15）.
^ Frank Kane. Taming Big Data with Apache Spark and Python. Packt. 2017 [2021-11-09]. ISBN 978-1787287945. （原始内容存档于2021-11-09）.
^ dotnet/spark, .NET Platform, 2020-09-14 [2020-09-14], （原始内容存档于2022-04-29）
^ Sparks, Evan; Talwalkar, Ameet. Spark Meetup: MLbase, Distributed Machine Learning with Spark. slideshare.net. Spark User Meetup, San Francisco, California. 2013-08-06 [10 February 2014]. （原始内容存档于2015-06-26）.
^ Gonzalez, Joseph; Xin, Reynold; Dave, Ankur; Crankshaw, Daniel; Franklin, Michael; Stoica, Ion. GraphX: Graph Processing in a Distributed Dataflow Framework (PDF). Oct 2014 [2015-05-30]. （原始内容存档 (PDF)于2014-12-07）. |conference=被忽略 (帮助)
^ Apache Spark FAQ. apache.org. Apache Software Foundation. [5 December 2014]. （原始内容存档于2015-05-20）.

外部連結

[wikidata-d164d58b5d3eedf8a10b94e3237a456334a94230-v3-1] Release 3.5.4. 2024年12月17日 [2024年12月21日].

[2] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion. Shark: SQL and Rich Analytics at Scale (PDF). June 2013 [2015-05-30]. （原始内容存档 (PDF)于2017-08-09）. |conference=被忽略 (帮助)

[3] Matei Zaharia. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. Invited Talk at NIPS 2011 Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale. [2015-05-30]. （原始内容存档于2015-11-13）.

[4] Cluster Mode Overview - Spark 1.2.0 Documentation - Cluster Manager Types. apache.org. Apache Foundation. 2014-12-18 [2015-01-18]. （原始内容存档于2015-01-19）.

[5] Figure showing Spark in relation to other open-source Software projects including Hadoop. [2015-05-30]. （原始内容存档于2015-03-24）.

[6] Doan, DuyHai. Re: cassandra + spark / pyspark. Cassandra User (邮件列表). 2014-09-10 [2014-11-21]. （原始内容存档于2015-05-30）.

[7] Open HUB Spark development activity. [2015-05-30]. （原始内容存档于2014-12-07）.

[8] The Apache Software Foundation Announces Apache&#8482 Spark&#8482 as a Top-Level Project. apache.org. Apache Software Foundation. 27 February 2014 [4 March 2014]. （原始内容存档于2015-03-17）.

[9] Spark officially sets a new record in large-scale sorting. [2015-05-30]. （原始内容存档于2015-05-15）.

[10] Frank Kane. Taming Big Data with Apache Spark and Python. Packt. 2017 [2021-11-09]. ISBN 978-1787287945. （原始内容存档于2021-11-09）.

[11] tnet/spark, .NET Platform, 2020-09-14 [2020-09-14], （原始内容存档于2022-04-29）

[12] Sparks, Evan; Talwalkar, Ameet. Spark Meetup: MLbase, Distributed Machine Learning with Spark. slideshare.net. Spark User Meetup, San Francisco, California. 2013-08-06 [10 February 2014]. （原始内容存档于2015-06-26）.

[13] Gonzalez, Joseph; Xin, Reynold; Dave, Ankur; Crankshaw, Daniel; Franklin, Michael; Stoica, Ion. GraphX: Graph Processing in a Distributed Dataflow Framework (PDF). Oct 2014 [2015-05-30]. （原始内容存档 (PDF)于2014-12-07）. |conference=被忽略 (帮助)

[14] Apache Spark FAQ. apache.org. Apache Software Foundation. [5 December 2014]. （原始内容存档于2015-05-20）.

[2]

[3]

[1]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]