注重体验与质量的电子书资源下载网站
分类于: 其它 编程语言
简介
Learning Spark, 2nd Edition: Lightning-Fast Data Analytics 豆 0.0分
资源最后更新于 2020-10-05 18:44:10
作者:Tathagata Das
出版社:O'Reilly Media
出版日期:2020-01
ISBN:9781492050049
文件格式: pdf
标签: Spark 大数据 计算机科学 数据分析 分布式 软件工程 BigData
简介· · · · · ·
Data is getting bigger, arriving faster, and coming in varied formats—and it all needs to be processed at scale for analytics or machine learning. How can you process such varied data workloads efficiently? Enter Apache Spark.
Updated to emphasize new features in Spark 2.x., this second edition shows data engineers and scientists why structure and unification in Spark matters. ...
目录
1. Introduction to Unified Analytics with Apache Spark
The Genesis of Big Data and Distributed Computing at Google
Hadoop at Yahoo!
Spark’s Early Years at AMPLab
What is Apache Spark?
Speed
Ease of Use
Modularity
Extensibility
Why Unified Analytics?
Apache Spark Components as a Unified Stack
Apache Spark’s Distributed Execution and Concepts
Developer’s Experience
Who Uses Spark, and for What?
Data Science Tasks
Data Engineering Tasks
Machine Learning or Deep Learning Tasks
Community Adoption and Expansion
2. Downloading Apache Spark and Getting Started
Step 1: Download Apache Spark
Spark’s Directories and Files
Step 2: Use Scala Shell or PySpark Shell
Using Local Machine
Step 3: Understand Spark Application Concepts
Spark Application and SparkSession
Spark Jobs
Spark Stages
Spark Tasks
Transformations, Actions, and Lazy Evaluation
Spark UI
Databricks Community Edition
First Standalone Application
Using Local Machine
Counting M&Ms for the Cookie Monster
Building Standalone Applications in Scala
Summary
3. Apache Spark’s Structured APIs
A Bit of History…
Unstructured Spark: What’s Underneath an RDD?
Structuring Spark
Key Merits and Benefits
Structured APIs: DataFrames and Datasets APIs
DataFrames API
Common DataFrame Operations
Datasets API
DataFrames vs Datasets
What about RDDs?
Spark SQL and the Underlying Engine
Catalyst Optimizer
Summary
4. Spark SQL and DataFrames — Introduction to Built-in Data Sources
Using Spark SQL in Spark Applications
Basic Query Example
SQL Tables and Views
Data Sources for DataFrames and SQL Tables
DataFrameReader
DataFrameWriter
Parquet
JSON
CSV
Avro
ORC
Image
Summary
5. Spark SQL and Datasets
Single API for Java and Scala
Scala Case Classes and JavaBeans for Datasets
Working with Datasets
Creating Sample Data
Transforming Sample Data
Memory Management for Datasets and DataFrames
Dataset Encoders
Spark’s Internal Format vs Java Object Format
Serialization and Deserialization (SerDe)
Costs of Using Datasets
Strategies to Mitigate Costs
Summary
6. Loading and Saving Your Data
Motivation for Data Sources
File Formats: Revisited
Text Files
Organizing Data for Efficient I/O
Partitioning
Bucketing
Compression Schemes
Saving as Parquet Files
Delta Lake Storage Format
Delta Lake Table
Summary
The Genesis of Big Data and Distributed Computing at Google
Hadoop at Yahoo!
Spark’s Early Years at AMPLab
What is Apache Spark?
Speed
Ease of Use
Modularity
Extensibility
Why Unified Analytics?
Apache Spark Components as a Unified Stack
Apache Spark’s Distributed Execution and Concepts
Developer’s Experience
Who Uses Spark, and for What?
Data Science Tasks
Data Engineering Tasks
Machine Learning or Deep Learning Tasks
Community Adoption and Expansion
2. Downloading Apache Spark and Getting Started
Step 1: Download Apache Spark
Spark’s Directories and Files
Step 2: Use Scala Shell or PySpark Shell
Using Local Machine
Step 3: Understand Spark Application Concepts
Spark Application and SparkSession
Spark Jobs
Spark Stages
Spark Tasks
Transformations, Actions, and Lazy Evaluation
Spark UI
Databricks Community Edition
First Standalone Application
Using Local Machine
Counting M&Ms for the Cookie Monster
Building Standalone Applications in Scala
Summary
3. Apache Spark’s Structured APIs
A Bit of History…
Unstructured Spark: What’s Underneath an RDD?
Structuring Spark
Key Merits and Benefits
Structured APIs: DataFrames and Datasets APIs
DataFrames API
Common DataFrame Operations
Datasets API
DataFrames vs Datasets
What about RDDs?
Spark SQL and the Underlying Engine
Catalyst Optimizer
Summary
4. Spark SQL and DataFrames — Introduction to Built-in Data Sources
Using Spark SQL in Spark Applications
Basic Query Example
SQL Tables and Views
Data Sources for DataFrames and SQL Tables
DataFrameReader
DataFrameWriter
Parquet
JSON
CSV
Avro
ORC
Image
Summary
5. Spark SQL and Datasets
Single API for Java and Scala
Scala Case Classes and JavaBeans for Datasets
Working with Datasets
Creating Sample Data
Transforming Sample Data
Memory Management for Datasets and DataFrames
Dataset Encoders
Spark’s Internal Format vs Java Object Format
Serialization and Deserialization (SerDe)
Costs of Using Datasets
Strategies to Mitigate Costs
Summary
6. Loading and Saving Your Data
Motivation for Data Sources
File Formats: Revisited
Text Files
Organizing Data for Efficient I/O
Partitioning
Bucketing
Compression Schemes
Saving as Parquet Files
Delta Lake Storage Format
Delta Lake Table
Summary