Create big data streaming pipelines with Spark using Python. Run analytics on live Tweet data from Twitter. Integrate Spark Streaming with tools like Apache Kafka, used by Fortune 500 companies. Work with new features of the most recent version of Spark: 2.3.
This course covers all the fundamentals about Apache Spark streaming with Python and teaches you everything you need to know about developing Spark streaming applications using PySpark, the Python API for Spark. At the end of this course, you will gain in-depth knowledge about Spark streaming and general big data manipulation skills to help your company to adapt Spark Streaming for building big data processing pipelines and data analytics applications. This course will be absolutely critical to anyone trying to make it in data science today.
What will you learn from this lecture?
In this couse, you'll learn the following:
An overview of the architecture of Apache Spark.
How to develop Apache Spark 2.0 applications with PySpark using RDD transformations and actions and Spark SQL.
How to work with Spark's primary abstraction, resilient distributed datasets(RDDs), to process and analyze large data sets.
Advanced techniques to optimize and tune Apache Spark jobs by partitioning, caching and persisting RDDs.
Analyzing structured and semi-structured data using Datasets and DataFrames, and develop a thorough understanding of Spark SQL.
How to scale up Spark Streaming applications for both bandwidth and processing speed
How to integrate Spark Streaming with cluster computing tools like Apache Kafka
How to connect your Spark Stream to a data source like Amazon Web Services (AWS) Kinesis
Best practices of working with Apache Spark in the field.
Big data ecosystem overview.