Scalable Machine Learning (SML)


Course Overview

This course has been designed and developed for providing exposure to participants in Scalable Machine learning. Cloudera Hadoop and Spark Frameworks being used for implementing Scalable Machine Learning Algorithms using Scala/Python programming language.

Who should attend

This program is designed for:

  • Software Developer
  • Data Scientist
  • Data Engineer
  • Big Data Engineer


Participants should have knowledge level equivalent to what is specified in “Data and Machine Learning Fundamentals” course.

Outline: Scalable Machine Learning (SML)

Introduction to Scalable Machine Learning

  • What is Scalable Machine Learning?
  • Why it is required?
  • Key platforms for performing Scalable Machine Learning
  • Scalable Machine Learning Project End to End Pipeline
  • Spark Introduction
  • Why Spark for Scalable Machine Learning?
  • Databricks Platform Demo
  • Approaches for scaling sci-kit learn code
  • Hands-on Exercise(s): Experiencing the first notebook

Why Spark for Scalable Machine Learning (SML)?

  • Problems with Traditional Machine Learning Frameworks
  • Machine Learning at Scale – Various options
  • Iterative Algorithms
  • How Spark performs well for Iterative Machine Learning Algorithms?
  • Hands-on Exercise(s)

Scalable Machine Learning on Enterprise Platform

  • Acquiring Structured content from Relational Databases
  • Acquiring Semi-structured content from Log Files
  • Acquiring Unstructured content from other key sources like Web
  • Tools for Performing Data acquisition at Scale
  • Sqoop, Flume and Kafka Introduction, use cases and architectures
  • Hands-on Exercise(s)

Data Acquisition at Scale

Data Pre-Processing for Modeling

  • Using the Spark Shell
  • Resilient Distributed Datasets (RDDs)
  • Functional Programming with Spark
  • RDD Operations
  • Key-Value Pair RDDs
  • MapReduce and Pair RDD Operations
  • Building and Running a Spark Application
  • Performing Data Validation
  • Data De-Duplication
  • Detecting Outliers
  • Hands-on Exercise(s)

Working with Iterative Algorithms

  • Dealing with RDD Infinite Lineages
  • Caching Overview
  • Distributed Persistence
  • Checkpointing of an Iterative Machine Learning Algorithm
  • Hands-on Exercise(s)

Spark SQL

  • Introduction
  • Dataframe API
  • Performing ad-hoc query analysis using Spark SQL
  • Hands-on Exercise(s)

Spark Machine Learning using MLLib

  • Spark ML vs Spark MLLib
  • Data types and key terms
  • Feature Extraction
  • Linear Regression using Spark MLLib
  • Hands-on Exercise(s)

Spark Machine Learning using ML

  • Spark ML Overview
  • Transformers and Estimators
  • Pipelines
  • Implementing Decision Trees
  • K-Means Clustering using Spark ML
  • Hands-on Exercise(s)

Decision Trees and Random Forest

  • Types – Classification and Regression trees
  • Gini Index, Entropy and Information Gain
  • Building Decision Trees
  • Pruning the trees
  • Prediction using Trees
  • Ensemble Models
  • Bagging and Boosting
  • Advantages of using Random Forest
  • Working with Random Forest
  • Ensemble Learning
  • How ensemble learning works
  • Building models using Bagging
  • Random Forest algorithm
  • Random Forest model building
  • Fine tuning hyper-parameters
  • Hands-on Exercise(s)

Model Evaluation, Optimization and Deployment

  • Model Evaluation
  • Optimizing a Model
  • Deploying Model
  • Best Practices

Prices & Delivery methods

Online Training

3 days

  • on request
Classroom Training

3 days

  • on request


Currently there are no training dates scheduled for this course.