**Data Science and Data Analytics – Python / R / SAS Training in Pune**

**Learn Data Science, Deep Learning, & Machine Learning using Python / R /SAS With Live Machine Learning & Deep Learning Projects **

Duration : 3 Months – Weekends 3 Hours on Saturday and Sundays

Real Time Projects , Assignments , scenarios are part of this course

Data Sets , Installations , Interview Preparations , Repeat the session until 6 months are all attractions of this particular course

2640 Satisfied Learners

**Data Science Training & Certification in Pune **

**Learn Data Science, Deep Learning, & Machine Learning with Python / R /SAS With Live Machine Learning & Deep Learning Projects **

**Duration of Data Science Training : 80 hrs**

**Batch type : weekdays /weekends**

**Mode of Data Science Training: Classroom / Online / Corporate Training**

Data Science Training , Real Time Projects , Assignments , scenarios are part of this course

Preparing you to become a Certified Data Scientist & Complete Placement Support for getting the job.

Data Sets , Installations , Interview Preparations , Repeat the session until 6 months are all attractions of this particular course

Trainer :- Experienced DataScience Consultant

**Want to be Future Data Scientist **

**Data Science Training Introduction: ** This course does not require a prior quantitative or mathematics background. It starts by introducing basic concepts such as the mean, median mode etc. and eventually covers all aspects of an analytics (or) data science career from analyzing and preparing raw data to visualizing your findings. If you’re a programmer or a fresh graduate looking to switch into an exciting new career track, or a data analyst looking to make the transition into the tech industry – this course will teach you the basic to Advance techniques used by real-world industry data scientists.

**Data Science, Statistics with Python / R / SAS : **This course is an introduction to Data Science and Statistics using the R programming language OR Python OR SAS. It covers both the theoretical aspects of Statistical concepts and the practical implementation using R / Python/ SaS. If you’re new to Python, don’t worry – the course starts with a crash course. If you’ve done some programming before or you are new in Programming, you should pick it up quickly. This course shows you how to get set up on Microsoft Windows-based PC’s; the sample code will also run on MacOS or Linux desktop systems.

**Data Science Analytics: **Using Spark and Scala you can analyze and explore your data in an interactive environment with fast feedback. The course will show how to leverage the power of RDDs and Data frames to manipulate data with ease.

**Machine Learning and Data Science : **Spark’s core functionality and built-in libraries make it easy to implement complex algorithms like Recommendations with very few lines of code. We’ll cover a variety of datasets and algorithms including PageRank, MapReduce and Graph datasets.

**Data Science Real life examples: **Every concept is explained with the help of examples, case studies and source code in R wherever necessary. The examples cover a wide array of topics and range from A/B testing in an Internet company context to the Capital Asset Pricing Model in a quant finance context. ** **

**Data Science Target audience?**

- Engineering/Management Graduate or Post-graduate Fresher Students who want to make their career in Data Science Industry or want to be future Data Scientist.
- Engineers who want to use a distributed computing engine for batch or stream processing or both
- Analysts who want to leverage Spark for analyzing interesting datasets
- Data Scientists who want a single engine for analyzing and modelling data as well as productionizing it.
- MBA Graduates or business professionals who are looking to move to a heavily quantitative role.
- Engineering Graduate/Professionals who want to understand basic statistics and lay a foundation for a career in Data Science
- Working Professional or Fresh Graduate who have mostly worked in Descriptive analytics or not work anywhere and want to make the shift to being data scientists
- Professionals who’ve worked mostly with tools like Excel and want to learn how to use R for statistical analysis.

**Data Science Course Content**

**Introduction to Data Science with Python**

- What is analytics & Data Science?
- Common Terms in Analytics
- Analytics vs. Data warehousing, OLAP, MIS Reporting
- Relevance in industry and need of the hour
- Types of problems and business objectives in various industries
- How leading companies are harnessing the power of analytics?
- Critical success drivers
- Overview of analytics tools & their popularity
- Analytics Methodology & problem solving framework
- List of steps in Analytics projects
- Identify the most appropriate solution design for the given problem statement
- Project plan for Analytics project & key milestones based on effort estimates
- Build Resource plan for analytics project

**Python Essentials**

- Why Python for data science?
- Overview of Python- Starting with Python
- Introduction to installation of Python
- Introduction to Python Editors & IDE’s(Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)
- Understand Jupyter notebook & Customize Settings
- Concept of Packages/Libraries – Important packages(NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
- Installing & loading Packages & Name Spaces
- Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
- List and Dictionary Comprehensions
- Variable & Value Labels – Date & Time Values
- Basic Operations – Mathematical – string – date
- Reading and writing data
- Simple plotting
- Control flow & conditional statements
- Debugging & Code profiling
- How to create class and modules and how to call them?

Scientific Distributions Used In Python For Data Science

NumPy, pandas, scikit-learn, stat models, nltk

**Accessing/Importing And Exporting Data Using Python Modules **

- Importing Data from various sources (Csv, txt, excel, access etc)
- Database Input (Connecting to database)
- Viewing Data objects – subsetting Data, methods
- Exporting Data to various formats
- Important python modules: Pandas, beautiful soup

**Data Manipulation – Cleansing – Munging using python modules**

- Cleansing Data with Python
- Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversions, renaming, formatting etc)
- Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
- Python Built-in Functions (Text, numeric, date, utility functions)
- Python User Defined Functions
- Stripping out extraneous information
- Normalizing data
- Formatting data
- Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)

**Data Analysis – Visualization Using Python**

- Introduction exploratory data analysis
- Descriptive statistics, Frequency Tables and summarization
- Univariate Analysis (Distribution of data & Graphical Analysis)
- Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
- Creating Graphs- Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)
- Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and SciPy. Stats etc)

**Introduction to Statistics**

- Basic Statistics – Measures of Central Tendencies and Variance
- Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
- Inferential Statistics -Sampling – Concept of Hypothesis Testing Statistical Methods – Z/t-tests( One sample, independent, paired), Analysis of variance, Correlations and Chi-square
- Important modules for statistical methods: NumPy, SciPy, Pandas

**Introduction to Predictive Modelling**

- Concept of model in analytics and how it is used?
- Common terminology used in analytics & Modelling process
- Popular modelling algorithms
- Types of Business problems – Mapping of Techniques
- Different Phases of Predictive Modelling

**Data Exploration For Modelling**

- Need for structured exploratory data
- EDA framework for exploring the data and identifying any problems with the data (Data Audit Report)
- Identify missing data
- Identify outliers data
- Visualize the data trends and patterns

**Data Preparation**

- Need of Data preparation
- Consolidation/Aggregation – Outlier treatment – Flat Liners – Missing values- Dummy creation – Variable Reduction
- Variable Reduction Techniques – Factor & PCA Analysis

**Segmentation: Solving Segmentation Problems**

- Introduction to Segmentation
- Types of Segmentation (Subjective Vs Objective, Heuristic Vs. Statistical)
- Heuristic Segmentation Techniques (Value Based, RFM Segmentation and Life Stage Segmentation)
- Behavioural Segmentation Techniques (K-Means Cluster Analysis)
- Cluster evaluation and profiling – Identify cluster characteristics
- Interpretation of results – Implementation on new data

**Linear Regression: Solving Regression Problems**

- Introduction – Applications
- Assumptions of Linear Regression
- Building Linear Regression Model
- Understanding standard metrics (Variable significance, R-square/Adjusted R-square, Global hypothesis ,etc)
- Assess the overall effectiveness of the model
- Validation of Models (Re running Vs. Scoring)
- Standard Business Outputs (Decile Analysis, Error distribution (histogram), Model equation, drivers etc.)
- Interpretation of Results – Business Validation – Implementation on new data

**Logistic Regression : Solving Classification Problems**

- Introduction – Applications
- Linear Regression Vs. Logistic Regression Vs. Generalized Linear Models
- Building Logistic Regression Model (Binary Logistic Model)
- Understanding standard model metrics (Concordance, Variable significance, Hosmer Lemeshov Test, Gini, KS, Misclassification, ROC Curve etc)
- Validation of Logistic Regression Models (Re running Vs. Scoring)
- Standard Business Outputs (Decile Analysis, ROC Curve, Probability Cut-offs, Lift charts, Model equation, Drivers or variable importance, etc)
- Interpretation of Results – Business Validation – Implementation on new data

**Time Series Forecasting : Solving Forecasting Problems**

- Introduction – Applications
- Time Series Components( Trend, Seasonality, Cyclicity and Level) and Decomposition
- Classification of Techniques(Pattern based – Pattern less)
- Basic Techniques – Averages, Smoothening, etc
- Advanced Techniques – AR Models, ARIMA, etc
- Understanding Forecasting Accuracy – MAPE, MAD, MSE, etc

**Machine Learning : Predictive Modelling**

- Introduction to Machine Learning & Predictive Modelling
- Types of Business problems – Mapping of Techniques – Regression vs. classification vs. segmentation vs. Forecasting
- Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
- Different Phases of Predictive Modelling (Data Pre-processing, Sampling, Model Building, Validation)
- Overfitting (Bias-Variance Trade off) & Performance Metrics
- Feature engineering & dimension reduction
- Concept of optimization & cost function
- Overview of gradient descent algorithm
- Overview of Cross validation(Bootstrapping, K-Fold validation etc)
- Model performance metrics (R-square, Adjusted R-square, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics )

**Data Science Unsupervised Learning : Segmentation**

- What is segmentation & Role of ML in Segmentation?
- Concept of Distance and related math background
- K-Means Clustering
- Expectation Maximization
- Hierarchical Clustering
- Spectral Clustering (DBSCAN)
- Principle component Analysis (PCA)

**Data Science Supervised Learning :- Decision Trees**

- Decision Trees – Introduction – Applications
- Types of Decision Tree Algorithms
- Construction of Decision Trees through Simplified Examples; Choosing the “Best” attribute at each Non-Leaf node; Entropy; Information Gain, Gini Index, Chi Square, Regression Trees
- Generalizing Decision Trees; Information Content and Gain Ratio; Dealing with Numerical Variables; other Measures of Randomness
- Pruning a Decision Tree; Cost as a consideration; Unwrapping Trees as Rules
- Decision Trees – Validation
- Overfitting – Best Practices to avoid

**Supervised Learning :- Ensemble Learning**

- Concept of Ensembling
- Manual Ensembling Vs. Automated Ensembling
- Methods of Ensembling (Stacking, Mixture of Experts)
- Bagging (Logic, Practical Applications)
- Random forest (Logic, Practical Applications)
- Boosting (Logic, Practical Applications)
- Ada Boost
- Gradient Boosting Machines (GBM)
- XGBoost

**Supervised Learning :- Artificial Neural Network – ANN**

- Motivation for Neural Networks and Its Applications
- Perceptron and Single Layer Neural Network, and Hand Calculations
- Learning In a Multi Layered Neural Net: Back Propagation and Conjugant Gradient Techniques
- Neural Networks for Regression
- Neural Networks for Classification
- Interpretation of Outputs and Fine tune the models with hyper parameters
- Validating ANN models

**Supervised Learning :- Support Vector Machines**

- Motivation for Support Vector Machine & Applications
- Support Vector Regression
- Support vector classifier (Linear & Non-Linear)
- Mathematical Intuition (Kernel Methods Revisited, Quadratic Optimization and Soft Constraints)
- Interpretation of Outputs and Fine tune the models with hyper parameters
- Validating SVM models

**Supervised Learning :-KNN**

- What is KNN & Applications?
- KNN for missing treatment
- KNN For solving regression problems
- KNN for solving classification problems
- Validating KNN model
- Model fine tuning with hyper parameters

**Supervised Learning :- Naive Bayes**

- Concept of Conditional Probability
- Bayes Theorem and Its Applications
- Naïve Bayes for classification
- Applications of Naïve Bayes in Classifications

**Text Mining And Analytics**

- Taming big text, Unstructured vs. Semi-structured Data; Fundamentals of information retrieval, Properties of words; Creating Term-Document (TxD);Matrices; Similarity measures, Low-level processes (Sentence Splitting; Tokenization; Part-of-Speech Tagging; Stemming; Chunking)
- Finding patterns in text: text mining, text as a graph
- Natural Language processing (NLP)
- Text Analytics – Sentiment Analysis using Python
- Text Analytics – Word cloud analysis using Python
- Text Analytics – Segmentation using K-Means/Hierarchical Clustering
- Text Analytics – Classification (Spam/Not spam)
- Applications of Social Media Analytics
- Metrics(Measures Actions) in social media analytics
- Examples & Actionable Insights using Social Media Analytics
- Important python modules for Machine Learning (SciKit Learn, stats models, scipy, nltk etc)
- Fine tuning the models using Hyper parameters, grid search, piping etc.

OR

**DATASCIENCE WITH R COURSE CONTENT**

- What is analytics & Data Science?
- Common Terms in Analytics
- Analytics vs. Data warehousing, OLAP, MIS Reporting
- Relevance in industry and need of the hour
- Types of problems and business objectives in various industries
- How leading companies are harnessing the power of analytics?
- Critical success drivers
- Overview of analytics tools & their popularity
- Analytics Methodology & problem solving framework
- List of steps in Analytics projects
- Identify the most appropriate solution design for the given problem statement
- Project plan for Analytics project & key milestones based on effort estimates
- Build Resource plan for analytics project
- Why R for data science?

**Data Importing / Exporting**

- Introduction R/R-Studio – GUI
- Concept of Packages – Useful Packages (Base & Other packages)
- Data Structure & Data Types (Vectors, Matrices, factors, Data frames, and Lists)
- Importing Data from various sources (txt, dlm, excel, sas7bdata, db, etc.)
- Database Input (Connecting to database)
- Exporting Data to various formats)
- Viewing Data (Viewing partial data and full data)
- Variable & Value Labels – Date Values

**Data Manipulation**

- Data Manipulation steps
- Creating New Variables (calculations & Binning)
- Dummy variable creation
- Applying transformations
- Handling duplicates
- Handling missings
- Sorting and Filtering
- Subsetting (Rows/Columns)
- Appending (Row appending/column appending)
- Merging/Joining (Left, right, inner, full, outer etc)
- Data type conversions
- Renaming
- Formatting
- Reshaping data
- Sampling
- Data manipulation tools
- Operators
- Functions
- Packages
- Control Structures (if, if else)
- Loops (Conditional, iterative loops, apply functions)
- Arrays
- R Built-in Functions (Text, Numeric, Date, utility)
- Numerical Functions
- Text Functions
- Date Functions
- Utilities Functions
- R User Defined Functions
- R Packages for data manipulation (base, dplyr, plyr, data.table, reshape, car, sqldf, etc)

** Data Analysis – Visualization**

- ntroduction exploratory data analysis
- Descriptive statistics, Frequency Tables and summarization
- Univariate Analysis (Distribution of data & Graphical Analysis)
- Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
- Creating Graphs- Bar/pie/line chart/histogram/boxplot/scatter/density etc)
- R Packages for Exploratory Data Analysis(dplyr, plyr, gmodes, car, vcd, Hmisc, psych, doby etc)
- R Packages for Graphical Analysis (base, ggplot, lattice,etc)

** Introduction To Statistics**

- Basic Statistics – Measures of Central Tendencies and Variance
- Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
- Inferential Statistics -Sampling – Concept of Hypothesis Testing
- Statistical Methods – Z/t-tests( One sample, independent, paired), Anova, Correlations and Chi-square

**Predictive Modelling**

- Concept of model in analytics and how it is used?
- Common terminology used in analytics & modelling process
- Popular modelling algorithms
- Types of Business problems – Mapping of Techniques
- Different Phases of Predictive Modelling

** Data Exploration For Modeling**

** Data Preparation**

- Need of Data preparation
- Consolidation/Aggregation – Outlier treatment – Flat Liners – Missing values- Dummy creation – Variable Reduction
- Variable Reduction Techniques – Factor & PCA Analysis

** Segmentation: Solving Segmentation Problems**

- Introduction to Segmentation
- Types of Segmentation (Subjective Vs Objective, Heuristic Vs. Statistical)
- Heuristic Segmentation Techniques (Value Based, RFM Segmentation and Life Stage Segmentation)
- Behavioral Segmentation Techniques (K-Means Cluster Analysis)
- Cluster evaluation and profiling – Identify cluster characteristics
- Interpretation of results – Implementation on new data

- Introduction – Applications
- Assumptions of Linear Regression
- Building Linear Regression Model
- Understanding standard metrics (Variable significance, R-square/Adjusted R-square, Global hypothesis ,etc)
- Assess the overall effectiveness of the model
- Validation of Models (Re running Vs. Scoring)
- Standard Business Outputs (Decile Analysis, Error distribution (histogram), Model equation, drivers etc.)
- Interpretation of Results – Business Validation – Implementation on new data

**Logistic Regression: Solving Classification Problems**

- Introduction – Applications
- Linear Regression Vs. Logistic Regression Vs. Generalized Linear Models
- Building Logistic Regression Model (Binary Logistic Model)
- Understanding standard model metrics (Concordance, Variable significance, Hosmer Lemeshov Test, Gini, KS, Misclassification, ROC Curve etc)
- Validation of Logistic Regression Models (Re running Vs. Scoring)
- Standard Business Outputs (Decile Analysis, ROC Curve, Probability Cut-offs, Lift charts, Model equation, Drivers or variable importance, etc)
- Interpretation of Results – Business Validation – Implementation on new data

**Time Series Forecasting: Solving Forecasting Problems**

- Introduction – Applications
- Time Series Components( Trend, Seasonality, Cyclicity and Level) and Decomposition
- Classification of Techniques(Pattern based – Pattern less)
- Basic Techniques – Averages, Smoothening, etc
- Advanced Techniques – AR Models, ARIMA, etc
- Understanding Forecasting Accuracy – MAPE, MAD, MSE, etc

- Introduction to Machine Learning & Predictive Modeling
- Types of Business problems – Mapping of Techniques – Regression vs. classification vs. segmentation vs. Forecasting
- Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
- Different Phases of Predictive Modeling (Data Pre-processing, Sampling, Model Building, Validation)
- Overfitting (Bias-Variance Trade off) & Performance Metrics
- Feature engineering & dimension reduction
- Concept of optimization & cost function
- Overview of gradient descent algorithm
- Overview of Cross validation(Bootstrapping, K-Fold validation etc)
- Model performance metrics (R-square, Adjusted R-squre, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics )

**Unsupervised Learning: Segmentation**

- What is segmentation & Role of ML in Segmentation?
- Concept of Distance and related math background
- K-Means Clustering
- Expectation Maximization
- Hierarchical Clustering
- Spectral Clustering (DBSCAN)
- Principle component Analysis (PCA)

**Supervised Learning: Decision Trees**

- Decision Trees – Introduction – Applications
- Types of Decision Tree Algorithms
- Construction of Decision Trees through Simplified Examples; Choosing the “Best” attribute at each Non-Leaf node; Entropy; Information Gain, Gini Index, Chi Square, Regression Trees
- Generalizing Decision Trees; Information Content and Gain Ratio; Dealing with Numerical Variables; other Measures of Randomness
- Pruning a Decision Tree; Cost as a consideration; Unwrapping Trees as Rules
- Decision Trees – Validation
- Overfitting – Best Practices to avoid

- Concept of Ensembling
- Manual Ensembling Vs. Automated Ensembling
- Methods of Ensembling (Stacking, Mixture of Experts)
- Bagging (Logic, Practical Applications)
- Random forest (Logic, Practical Applications)
- Boosting (Logic, Practical Applications)
- Ada Boost
- Gradient Boosting Machines (GBM)
- XGBoost

**Supervised Learning: Artificial Neural Networks (ANN)**

- Motivation for Neural Networks and Its Applications
- Perceptron and Single Layer Neural Network, and Hand Calculations
- Learning In a Multi Layered Neural Net: Back Propagation and Conjugant Gradient Techniques
- Neural Networks for Regression
- Neural Networks for Classification
- Interpretation of Outputs and Fine tune the models with hyper parameters
- Validating ANN models

**Supervised Learning: Support Vector Machines**

- Motivation for Support Vector Machine & Applications
- Support Vector Regression
- Support vector classifier (Linear & Non-Linear)
- Mathematical Intuition (Kernel Methods Revisited, Quadratic Optimization and Soft Constraints)
- Interpretation of Outputs and Fine tune the models with hyper parameters
- Validating SVM models

**Supervised Learning: KNN**

- What is KNN & Applications?
- KNN for missing treatment
- KNN For solving regression problems
- KNN for solving classification problems
- Validating KNN model
- Model fine tuning with hyper parameters

- Concept of Conditional Probability
- Bayes Theorem and Its Applications
- Naïve Bayes for classification
- Applications of Naïve Bayes in Classifications

- Taming big text, Unstructured vs. Semi-structured Data; Fundamentals of information retrieval, Properties of words; Creating Term-Document (TxD);Matrices; Similarity measures, Low-level processes (Sentence Splitting; Tokenization; Part-of-Speech Tagging; Stemming; Chunking)
- Finding patterns in text: text mining, text as a graph
- Natural Language processing (NLP)
- Text Analytics – Sentiment Analysis using R
- Text Analytics – Word cloud analysis using R
- Text Analytics – Segmentation using K-Means/Hierarchical Clustering
- Text Analytics – Classification (Spam/Not spam)
- Applications of Social Media Analytics
- Metrics(Measures Actions) in social media analytics
- Examples & Actionable Insights using Social Media Analytics
- Important R packages for Machine Learning (caret, H2O, Randomforest, nnet, tm etc)
- Fine tuning the models using Hyper parameters, grid search, piping etc.

Case Studies

OR

**DATASCIENCE TRAINING WITH S-A-S COURSE CONTENT**

**Introduction To Analytics**

- Analytics World
- Introduction to Analytics
- Concept of ETL
- S-A-S in advanced analytics

- Global Certification: Induction and walk through
- Getting Started
- Software installation
- Introduction to GUI
- Different components of the language
- All programming windows
- Concept of Libraries and Creating Libraries
- Variable Attributes – (Name, Type, Length, Format, In format, Label)
- Importing Data and Entering data manually

- Understanding Datasets
- Descriptor Portion of a Dataset (Proc Contents)
- Data Portion of a Dataset
- Variable Names and Values
- Data Libraries

** Base S-A-S – Accessing The Data**

- Understanding Data Step Processing
- Data Step and Proc Step
- Data step execution
- Compilation and execution phase
- Input buffer and concept of PDV

- Importing Raw Data Files
- Column Input and List Input and Formatted methods
- Delimiters, Reading missing and non standard values
- Reading one to many and many to one records
- Reading Hierarchical files
- Creating raw data files and put statement
- Formats / Informat

- Importing and Exporting Data (Fixed Format / Delimited)
- Proc Import / Delimited text files
- Proc Export / Exporting Data
- Datalines / Cards;
- Atypical importing cases (mixing different style of inputs)
- Reading Multiple Records per Observation
- Reading “Mixed Record Types”
- Sub-setting from a Raw Data File
- Multiple Observations per Record
- Reading Hierarchical Files

- Concept of SAS library and SAS Catalog
- Variable Types in SAS
- Reading Data stored external to SAS
- Importing Data by using Proc Import
- Data Step SAS statements
- SAS Functions
- Appending and Merging using SAS
- SAS Procedures like proc means, proc Univariate, proc append, proc freq and proc export.
- SAS SQL
- SAS Macros

**Hypothesis Testing and ANOVA**

- One Sample t-test of comparing means
- Two Sample t-test of comparing means
- One Way ANOVA
- Assumptions of ANOVA Modeling
- n-way ANOVA
- ANOVA Post Hoc Studies

**Measure Model Performance**

- Apply the principles of honest assessment to model performance measurement
- Assess classifier performance using the confusion matrix
- Model selection and validation using training and validation data
- Create and interpret graphs (ROC, lift, and gains charts) for model comparison and selection
- Establish effective decision cut-off values for scoring

**Data Understanding, Managing And Manipulation**

- Understanding and Exploration Data
- Introduction to basic Procedures – Proc Contents, Proc Print

- Understanding and Exploration Data
- Operators and Operands
- Conditional Statements (Where, If, If then Else, If then Do and select when)
- Difference between WHERE and IF statements and limitation of WHERE statements
- Labels, Commenting
- System Options (OBS, FSTOBS, NOOBS etc…)

- Data Manipulation
- Proc Sort – with options / De-Duping
- Accumulator variable and By-Group processing
- Explicit Output Statements
- Nesting Do loops
- Do While and Do Until Statement
- Array elements and Range

- Combining Datasets (Appending and Merging)
- Concatenation
- Interleaving
- Proc Append
- One To One Merging
- Match Merging
- IN = Controlling merge and Indicator

** Data Mining With Proc SQL**

- Introduction to Databases
- Introduction to Proc SQL
- Basics of General SQL language
- Creating table and Inserting Values
- Retrieve & Summarize data
- Group, Sort & Filter
- Using Joins (Full, Inner, Left, Right and Outer)
- Reporting and summary analysis
- Concept of Indexes and creating Indexes (simple and composite)
- Connecting S-A-S to external Databases
- Implicit and Explicit pass through methods

**Macros For Automation**

- Macro Parameters and Variables
- Different types of Macro Creation
- Defining and calling a macro
- Using call Symput and Symget
- Macros options (mprint symbolgen mlogic merror serror)

** Fundamental Of Statistics**

- Basic Statistics – Measures of Central Tendencies and Variance
- Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
- Inferential Statistics -Sampling – Concept of Hypothesis Testing
- Statistical Methods – Z/t-tests( One sample, independent, paired), Anova, Correlations and Chi-square
- Levels of Measurement and Variable types
- Descriptive Statistics and Picturing Distributions
- Confidence Interval for the Mean

**Introduction To Predictive Modelling**

- Introduction to Predictive Modeling
- Types of Business problems – Mapping of Techniques
- Different Phases of Predictive Modeling

** Data Preparation**

- Need of Data preparation
- Data Audit Report and Its importance
- Consolidation/Aggregation – Outlier treatment – Flat Liners – Missing values- Dummy creation – Variable Reduction
- Variable Reduction Techniques – Factor & PCA Analysis

** Segmentation**

- Introduction to Segmentation
- Types of Segmentation (Subjective Vs Objective, Heuristic Vs. Statistical)
- Heuristic Segmentation Techniques (Value Based, RFM Segmentation and Life Stage Segmentation)
- Behavioural Segmentation Techniques (K-Means Cluster Analysis)
- Cluster evaluation and profiling
- Interpretation of results – Implementation on new data

** Linear Regression**

- Introduction – Applications
- Assumptions of Linear Regression
- Building Linear Regression Model
- Understanding standard metrics (Variable significance, R-square/Adjusted R-square, Global hypothesis ,etc)
- Validation of Models (Re running Vs. Scoring)
- Standard Business Outputs (Decile Analysis, Error distribution (histogram), Model equation, drivers etc.)
- Interpretation of Results – Business Validation – Implementation on new data

** Logistic Regression**

- Introduction – Applications
- Linear Regression Vs. Logistic Regression Vs. Generalized Linear Models
- Building Logistic Regression Model
- Understanding standard model metrics (Concordance, Variable significance, Hosmer Lemeshov Test, Gini, KS, Misclassification, etc)
- Validation of Logistic Regression Models (Re running Vs. Scoring)
- Standard Business Outputs (Decile Analysis, ROC Curve,

Probability Cut-offs, Lift charts, Model equation, Drivers, etc) - Interpretation of Results – Business Validation -Implementation on new data

** Time Series Forecasting**

- Introduction – Applications
- Time Series Components( Trend, Seasonality, Cyclicity and Level) and Decomposition
- Classification of Techniques(Pattern based – Pattern less)
- Basic Techniques – Averages, Smoothening, etc
- Advanced Techniques – AR Models, ARIMA, etc
- Understanding Forecasting Accuracy – MAPE, MAD, MSE, etc

** Introduction To Machine Learning**

- Statistical learning vs. Machine learning
- Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
- Concept of Overfitting and Under fitting (Bias-Variance Trade off) & Performance Metrics
- Types of Cross validation(Train & Test, Bootstrapping, K-Fold validation etc)

** Regression & Classification Model Building**

- Recursive Partitioning(Decision Trees)
- Ensemble Models(Random Forest, Bagging & Boosting)
- K-Nearest neighbours

OR

**ADVANCED BIG DATASCIENCE COURSE CONTENT**

**Introduction To Data Science**

- What is Data Science?
- Why Python for data science?
- Relevance in industry and need of the hour
- How leading companies are harnessing the power of Data Science with Python?
- Different phases of a typical Analytics/Data Science projects and role of python
- Anaconda vs. Python

** Python Essentials (Core)**

- Overview of Python- Starting with Python
- Introduction to installation of Python
- Introduction to Python Editors & IDE’s(Canopy, pycharm, Jupyter, Rodeo, Ipython etc…)
- Understand Jupyter notebook & Customize Settings
- Concept of Packages/Libraries – Important packages(NumPy, SciPy, scikit-learn, Pandas, Matplotlib, etc)
- Installing & loading Packages & Name Spaces
- Data Types & Data objects/structures (strings, Tuples, Lists, Dictionaries)
- List and Dictionary Comprehensions
- Variable & Value Labels – Date & Time Values
- Basic Operations – Mathematical – string – date
- Reading and writing data
- Simple plotting
- Control flow & conditional statements
- Debugging & Code profiling
- How to create class and modules and how to call them?
- Scientific distributions used in python for Data Science – Numpy, scify, pandas, scikitlearn, statmodels, nltk etc

** Accessing/Importing And Exporting Data Using Python Modules**

- Importing Data from various sources (Csv, txt, excel, access etc)
- Database Input (Connecting to database)
- Viewing Data objects – subsetting, methods
- Exporting Data to various formats
- Important python modules: Pandas, beautifulsoup

** Data Manipulation – Cleansing – Munging Using Python Modules**

- Cleansing Data with Python
- Data Manipulation steps(Sorting, filtering, duplicates, merging, appending, subsetting, derived variables, sampling, Data type conversions, renaming, formatting etc)
- Data manipulation tools(Operators, Functions, Packages, control structures, Loops, arrays etc)
- Python Built-in Functions (Text, numeric, date, utility functions)
- Python User Defined Functions
- Stripping out extraneous information
- Normalizing data
- Formatting data
- Important Python modules for data manipulation (Pandas, Numpy, re, math, string, datetime etc)

** Data Analysis – Visualization Using Python**

- Introduction exploratory data analysis
- Descriptive statistics, Frequency Tables and summarization
- Univariate Analysis (Distribution of data & Graphical Analysis)
- Bivariate Analysis(Cross Tabs, Distributions & Relationships, Graphical Analysis)
- Creating Graphs- Bar/pie/line chart/histogram/ boxplot/ scatter/ density etc)
- Important Packages for Exploratory Analysis(NumPy Arrays, Matplotlib, seaborn, Pandas and scipy.stats etc)

** Basic Statistics & Implementation Of Stats Methods In Python**

- Basic Statistics – Measures of Central Tendencies and Variance
- Building blocks – Probability Distributions – Normal distribution – Central Limit Theorem
- Inferential Statistics -Sampling – Concept of Hypothesis Testing
- Statistical Methods – Z/t-tests (One sample, independent, paired), Anova, Correlation and Chi-square
- Important modules for statistical methods: Numpy, Scipy, Pandas

** Python: Machine Learning -Predictive Modeling – Basics**

- Introduction to Machine Learning & Predictive Modeling
- Types of Business problems – Mapping of Techniques – Regression vs. classification vs. segmentation vs. Forecasting
- Major Classes of Learning Algorithms -Supervised vs Unsupervised Learning
- Different Phases of Predictive Modeling (Data Pre-processing, Sampling, Model Building, Validation)
- Overfitting (Bias-Variance Trade off) & Performance Metrics
- Feature engineering & dimension reduction
- Concept of optimization & cost function
- Concept of gradient descent algorithm
- Concept of Cross validation(Bootstrapping, K-Fold validation etc)
- Model performance metrics (R-square, RMSE, MAPE, AUC, ROC curve, recall, precision, sensitivity, specificity, confusion metrics)

** Machine Learning Algorithms & Applications – Implementation In Python**

- Linear & Logistic Regression
- Segmentation – Cluster Analysis (K-Means)
- Decision Trees (CART/CD 5.0)
- Ensemble Learning (Random Forest, Bagging & boosting)
- Artificial Neural Networks(ANN)
- Support Vector Machines(SVM)
- Other Techniques (KNN, Naïve Bayes, PCA)
- Introduction to Text Mining using NLTK
- Introduction to Time Series Forecasting (Decomposition & ARIMA)
- Important python modules for Machine Learning (SciKit Learn, stats models, scipy, nltk etc)
- Fine tuning the models using Hyper parameters, grid search, piping etc.

Project – Consolidate Learnings

- Applying different algorithms to solve the business problems and bench mark the results

**Introduction To Big Data**

- Introduction and Relevance
- Uses of Big Data analytics in various industries like Telecom, E- commerce, Finance and Insurance etc.
- Problems with Traditional Large-Scale Systems

** Hadoop(Big Data) Eco-System**

- Motivation for Hadoop
- Different types of projects by Apache
- Role of projects in the Hadoop Ecosystem
- Key technology foundations required for Big Data
- Limitations and Solutions of existing Data Analytics Architecture
- Comparison of traditional data management systems with Big Data management systems
- Evaluate key framework requirements for Big Data analytics
- Hadoop Ecosystem & Hadoop 2.x core components
- Explain the relevance of real-time data
- Explain how to use Big Data and real-time data as a Business planning tool

** Hadoop Cluster-Architecture-Configuration Files**

- Hadoop Master-Slave Architecture
- The Hadoop Distributed File System – Concept of data storage
- Explain different types of cluster setups(Fully distributed/Pseudo etc)
- Hadoop cluster set up – Installation
- Hadoop 2.x Cluster Architecture
- A Typical enterprise cluster – Hadoop Cluster Modes
- Understanding cluster management tools like Cloudera manager/Apache ambari

** Hadoop-HDFS & MapReduce (YARN)**

- HDFS Overview & Data storage in HDFS
- Get the data into Hadoop from local machine(Data Loading Techniques) – vice versa
- Map Reduce Overview (Traditional way Vs. MapReduce way)
- Concept of Mapper & Reducer
- Understanding MapReduce program Framework
- Develop MapReduce Program using Java (Basic)
- Develop MapReduce program with streaming API) (Basic)

** Data Integration Using Sqoop & Flume**

- Integrating Hadoop into an Existing Enterprise
- Loading Data from an RDBMS into HDFS by Using Sqoop
- Managing Real-Time Data Using Flume
- Accessing HDFS from Legacy Systems

** Data Analysis Using Pig**

- Introduction to Data Analysis Tools
- Apache PIG – MapReduce Vs Pig, Pig Use Cases
- PIG’s Data Model
- PIG Streaming
- Pig Latin Program & Execution
- Pig Latin : Relational Operators, File Loaders, Group Operator, COGROUP Operator, Joins and COGROUP, Union, Diagnostic Operators, Pig UDF
- Writing JAVA UDF’s
- Embedded PIG in JAVA
- PIG Macros
- Parameter Substitution
- Use Pig to automate the design and implementation of MapReduce applications
- Use Pig to apply structure to unstructured Big Data

** Data Analysis Using Hive**

- Apache Hive – Hive Vs. PIG – Hive Use Cases
- Discuss the Hive data storage principle
- Explain the File formats and Records formats supported by the Hive environment
- Perform operations with data in Hive
- Hive QL: Joining Tables, Dynamic Partitioning, Custom Map/Reduce Scripts
- Hive Script, Hive UDF
- Hive Persistence formats
- Loading data in Hive – Methods
- Serialization & Deserialization
- Handling Text data using Hive
- Integrating external BI tools with Hadoop Hive

** Data Analysis Using Impala**

- Impala & Architecture
- How Impala executes Queries and its importance
- Hive vs. PIG vs. Impala
- Extending Impala with User Defined functions

** Introduction To Other Ecosystem Tools**

- NoSQL database – Hbase
- Introduction Oozie

**Spark: Introduction**

- Introduction to Apache Spark
- Streaming Data Vs. In Memory Data
- Map Reduce Vs. Spark
- Modes of Spark
- Spark Installation Demo
- Overview of Spark on a cluster
- Spark Standalone Cluster

** Spark: Spark In Practice**

- Invoking Spark Shell
- Creating the Spark Context
- Loading a File in Shell
- Performing Some Basic Operations on Files in Spark Shell
- Caching Overview
- Distributed Persistence
- Spark Streaming Overview(Example: Streaming Word Count)

** Spark: Spark Meets Hive**

- Analyze Hive and Spark SQL Architecture
- Analyze Spark SQL
- Context in Spark SQL
- Implement a sample example for Spark SQL
- Integrating hive and Spark SQL
- Support for JSON and Parquet File Formats Implement Data Visualization in Spark
- Loading of Data
- Hive Queries through Spark
- Performance Tuning Tips in Spark
- Shared Variables: Broadcast Variables & Accumulators

** Spark Streaming**

- Extract and analyze the data from twitter using Spark streaming
- Comparison of Spark and Storm – Overview

** Spark GraphX**

- Overview of GraphX module in spark
- Creating graphs with GraphX

** Introduction To Machine Learning Using Spark**

- Understand Machine learning framework
- Implement some of the ML algorithms using Spark MLLib

** Project**

- Consolidate all the learnings
- Working on Big Data Project by integrating various key components

**Projects :-**

**Python Projects**

Random password generator | Mini |

CLI based scientific calculator | Mini |

Instagram bot | Mini |

Expense Tracker | Mini |

Site connectivity checker | Mini |

Lawn Tennis Match Highlight (Can be extended to any sport) | Major |

NLP library | Major |

**Deep Learning Projects**

Churn Modelling using ANN | Mini |

Image Classification | Mini |

Image classification using Transfer learning | Major |

Sentence Classification using RNN,LSTM,GRU | Mini |

Sentence Classification using word embeddings | Major |

Object Detection using yolo | Major |

**Machine Learning Projects**

EDA on movies database | Mini |

House price prediction using Regression | Mini |

Predict survival on the Titanic using Classification | Mini |

Image Clustering | Mini |

Document Clustering | Mini |

Twitter US Airline Sentiment | Major |

Restaurant revenue prediction | Major |

Disease Prediction | Major |

**Note:** Depends upon Trainers above projects may vary

**DataScience Demo Session : –**

DataQubez University creates meaningful big data & Data Science certifications that are recognized in the industry as a confident measure of qualified, capable big data experts. How do we accomplish that mission? DataQubez certifications are exclusively hands on, performance-based exams that require you to complete a set of tasks. Demonstrate your expertise with the most sought-after technical skills. Big data success requires professionals who can prove their mastery with the tools and techniques of the Hadoop stack. However, experts predict a major shortage of advanced analytics skills over the next few years. At DataQubez, we’re drawing on our industry leadership and early corpus of real-world experience to address the big data & Data Science talent gap.

**How To Become Certified Data Science Professional Engineer**

Certification Code – DQCP – 501

Certification Description – DataQubez Certified Professional Data Science Engineer

Define and deploy a rack topology script, Change the configuration of a service using Apache Hadoop, Configure the Capacity Scheduler, Create a home directory for a user and configure permissions, Configure the include and exclude DataNode files

Restart an Cluster service, View an application’s log file, Configure and manage alerts Troubleshoot a failed job

Configure NameNode, Configure ResourceManager, Copy data between two clusters, Create a snapshot of an HDFS directory, Recover a snapshot, Configure HiveServer2

Import data from a table in a relational database into HDFS, Import the results of a query from a relational database into HDFS, Import a table from a relational database into a new or existing Hive table, Insert or update data from HDFS into a table in a relational database, Given a Flume configuration file, start a Flume agent, Given a configured sink and source, configure a Flume memory channel with a specified capacity

Write and execute a Pig script, Load data into a Pig relation without a schema, Load data into a Pig relation with a schema, Load data from a Hive table into a Pig relation, Use Pig to transform data into a specified format, Transform data to match a given Hive schema, Group the data of one or more Pig relations, Use Pig to remove records with null values from a relation, Store the data from a Pig relation into a folder in HDFS, Store the data from a Pig relation into a Hive table, Sort the output of a Pig relation, Remove the duplicate tuples of a Pig relation, Specify the number of reduce tasks for a Pig MapReduce job, Join two datasets using Pig, Perform a replicated join using Pig

Write and execute a Hive query, Define a Hive-managed table, Define a Hive external table, Define a partitioned Hive table, Define a bucketed Hive table, Define a Hive table from a select query, Define a Hive table that uses the ORCFile format, Create a new ORCFile table from the data in an existing non-ORCFile Hive table, Specify the storage format of a Hive table Specify the delimiter of a Hive table, Load data into a Hive table from a local directory Load data into a Hive table from an HDFS directory, Load data into a Hive table as the result of a query, Load a compressed data file into a Hive table, Update a row in a Hive table, Delete a row from a Hive table, Insert a new row into a Hive table, Join two Hive tables, Set a Hadoop or Hive configuration property from within a Hive query.

Frame big data analysis problems as Apache Spark scripts, Optimize Spark jobs through partitioning, caching, and other techniques, Develop distributed code using the Scala programming language, Build, deploy, and run Spark scripts on Hadoop clusters, Transform structured data using SparkSQL and DataFrames

Using MLLib to Produce Recomandation Engine, Run Page rank algorithem, using dataframes with mllib, Machine Learning with Spark

Process Stream Data using spark streaming.

Introduction to Linear Regression, Introduction to Regression Section, Linear Regression Documentation Alternate Linear Regression Data CSV File, Linear Regression Walkthrough , Linear Regression Project

Classification, Classification Documentation, Spark Classification – Logistic Regression , Logistic Regression Amendments, Classification Project

Clustering with Spark & Python, KMeans, Example of KMeans with Spark & Python, Clustering Project

Model Evaluation, Spark Model Evaluation, Spark – Model Evaluation – Regression

Program in R, Create Data Visualizations, Use R to manipulate data easily, Use R for Data Science, Use R for Data Analysis, Use R to handle csv,excel,SQL files or web scraping, Use R for Machine Learning Algorithms, Machine Learning with R – Linear Regression, Machine Learning with R – Logistic Regression

For Exam Registration of DataQubez Certified Professional Data Science Engineer, Click here:

Trainer for Big data & Data science course is having 11 years of exp. in the same technologies, he is industry expert. Trainer itself cloudera certified along with AWS (Solution Architecture) and GCP (Google Cloud Platform) certified. And also he is certified data scientist from The University of Chicago.

- Training By 11+ Years experienced Real Time Trainer
- A pool of 200+ real time Practical Sessions on Data Science and Analytics
- Scenarios and Assignments to make sure you compete with current Industry standards
- World class training methods
- Training until the candidate get placed
- Certification and Placement Support until you get certified and placed
- All training in reasonable cost
- 10000+ Satisfied candidates
- 5000+ Placement Records
- Corporate and Online Training in reasonable Cost
- Complete End-to-End Project with Each Course
- World Class Lab Facility which facilitates I3 /I5 /I7 Servers and Cisco UCS Servers

- Covers Topics other than from Books which is required for the IT Industry
- Resume And Interview preparation with 100% Hands-on Practical sessions
- Doubt clearing sessions any time after the course
- Happy to help you any time after the course

In classroom we solve real time problem, and also push students to create at-least a demo model and push his/her code into GIT, also in class we solve real time problem or data world problems.

Radical technologies, we believe that the best way to learn job-skills is from industry professionals. So, we are building an alternate higher education system, when you can learn job-skills from industry experts and get certified by companies. we complete the course as in classroom method with 85% Practical scenarios complete hands-on on each and every point of the course. and if student faces any issue in future he/she can join also in next batch. These courses are delivered through a live interactive classroom platform

We provide in classroom for solving real time problem, and also trying push to students at least create a demo model and push his/her code into GIT, also in class we solve real time Kaggle problem or data world problems.

Big Data with Cloud Computing (AWS) – Amazon Web Services

Big Data with Cloud Computing (GCP) – Google Cloud Platform

Big Data & Data Science with Cloud Computing (AWS) – Amazon Web Services

Big Data & Data Science with Cloud Computing (GCP) – Google Cloud Platform

Data Science with R & Spark with Python & Scala

Machine Learning with Google Cloud Platform with Tensor Flow