> > > CDTSH1

Cloudera Developer Training for Spark and Hadoop (CDTSH1)

Course Description Schedule Course Outline
 

Cloudera Developer Training for Spark and Hadoop is also available in OnDemand e-learning.

$2235.00 USD

Subscription Details

This OnDemand offering provides you with a 180-day subscription that begins on the date of purchase. While the subscription is active, you will have unlimited access to the course training materials which includes recorded course lectures and demonstrations, assessment components, and hands-on exercise instructions. You will also receive 20 runtime hours of access to the online hands-on exercise environment accessible though web browser. You can start the exercise environment when you are ready to use it. You can stop or pause it when you are done for the time being, then return anytime to continue where you left off. The exercise environment remains accessible until you have used the runtime hours or the subscription period ends, whichever occurs first.

About this Course

Learn how to import data into your Apache Hadoop cluster and process it with Spark, Hive, Flume, Sqoop, Impala, and other Hadoop ecosystem tools.

This four-day hands-on training course delivers the key concepts and expertise you need to ingest and process data on a Hadoop cluster using the most up-to-date tools and techniques. Employing Hadoop ecosystem projects such as Spark, Hive, Flume, Sqoop, and Impala, this training course is the best preparation for the real-world challenges faced by Hadoop developers. You will learn to identify which tool is the right one to use in a given situation, and will gain hands-on experience in developing using those tools.

Who should attend

  • Programmers
  • Developers
  • Engineers

Class Prerequisites

  • Apache Spark examples and hands-on exercises are presented in Scala and Python, so the ability to program in one of those languages is required.
  • Basic familiarity with the Linux command line is assumed.
  • Basic knowledge of SQL is helpful
  • Prior knowledge of Hadoop is not required.

What You Will Learn

By the end of this course, you will learn:

  • How data is distributed, stored, and processed in a Hadoop cluster
  • How to use Sqoop and Flume to ingest data
  • How to process distributed data with Apache Spark
  • How to model structured data as tables in Impala and Hive
  • How to choose the best data storage format for different data usage patterns
  • Best practices for data storage

Follow On Courses

Outline: Cloudera Developer Training for Spark and Hadoop (CDTSH1)

Module 1: Introduction to Hadoop and the Hadoop Ecosystem

  • Problems with Traditional Large-Scale Systems
  • Hadoop!
  • Data Storage and Ingest
  • Data Processing
  • Data Analysis and Exploration
  • Other Ecosystem Tools
  • Introduction to the Hands-On Exercises

Module 2: Hadoop Architecture and HDFS

  • Distributed Processing on a Cluster
  • Storage: HDFS Architecture
  • Storage: Using HDFS
  • Resource Management: YARN Architecture
  • Resource Management: Working with YARN

Module 3: Importing Relational Data with Apache Sqoop

  • Sqoop Overview
  • Basic Imports and Exports
  • Limiting Results
  • Improving Sqoop’s Performance
  • Sqoop 2

Module 4: Introduction to Impala and Hive

  • Introduction to Impala and Hive
  • Why Use Impala and Hive?
  • Comparing Hive to Traditional Databases
  • Hive Use Cases

Module 5: Modeling and Managing Data with Impala and Hive

  • Data Storage Overview
  • Creating Databases and Tables
  • Loading Data into Tables
  • HCatalog
  • Impala Metadata Caching

Module 6: Data Formats

  • Selecting a File Format
  • Hadoop Tool Support for File Formats
  • Avro Schemas
  • Using Avro with Hive and Sqoop
  • Avro Schema Evolution
  • Compression

Module 7: Data Partitioning

  • Partitioning Overview
  • Partitioning in Impala and Hive

Module 8: Capturing Data with Apache Flume

  • What is Apache Flume?
  • Basic Flume Architecture
  • Flume Sources
  • Flume Sinks
  • Flume Channels
  • Flume Configuration

Module 9: Spark Basics

  • What is Apache Spark?
  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Module 10: Working with RDDs in Spark

  • Creating RDDs
  • Other General RDD Operations

Module 11: Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Configuring Spark Properties
  • Logging

Module 12: Parallel Programming with Spark

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Module 13: Spark Caching and Persistence

  • RDD Lineage
  • Caching Overview
  • Distributed Persistence

Module 14: Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Module 15: Preview: Spark SQL

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • DataFrames and RDDs
  • Comparing Spark SQL, Impala and Hive-on-Spak
Classroom Training

Duration 4 days

Price
  • Canada: CAD 3,195
Dates and Booking
Online Training

Duration 4 days

Price
  • Canada: CAD 3,195
Dates and Booking
E-Learning
Price
  • Canada: US$ 2,235
Buy E-Learning
 
Click on town name to bookSchedule
This is an Instructor-Led Classroom course
Fast Lane will carry out all guaranteed training regardless of the number of attendees, exempt from force majeure or other unexpected events, like e.g. accidents or illness of the trainer, which prevent the course from being conducted.
This computer icon in the schedule indicates that this date/time will be conducted as Instructor-Led Online Training.
  *   This class is delivered by a partner.
Canada

Currently there are no training dates scheduled for this course.  For enquiries please write to info@fastlaneca.com.

United States
Jun 6-9, 2017 Guaranteed to Run Online Training Time zone: US/Eastern * Enroll
Jun 20-23, 2017 Herndon/Reston, VA * Enroll
Jun 27-30, 2017 Online Training Time zone: US/Eastern * Enroll
Jul 11-14, 2017 Chicago, IL * Enroll
Jul 18-21, 2017 Redwood City, CA * Enroll
Jul 25-28, 2017 Online Training Time zone: US/Eastern * Enroll