This course is about knowledge discovery in databases; searching through large volumes of raw data to find useful information-patterns that are implicitly embedded in the raw data. Data Scientists and decision makers can use this information for new sources of advantages and differentiation or for developing new business models. In layman’s terms, devoid of predictive analytics terminology, some of the tasks that a student will be able to perform after course completion will be:
Broadly speaking the program’s objectives are two-fold:
This course is divided into three distinct modules. Module 1 is about Data Science and Data Mining, Module 2 is about Projects on Industrial data that students will execute. Module 3 is regarding Hadoop eco-system based analytics. Module 1 and Module 2 progress in parallel while Module 3 follows after completion of Module 1 and 2. Each participant, at the beginning of the course, receives a Virtual Machine fully equipped with all the software tools, packages and data to work on. Modules and Virtual Machine are more fully described below.
Introduction to Big Data. Data Exploration and data cleaning using RStudio and python. Data Visualization and story-telling with RStudio, python and Tableau/MS Power BI. Data pre-processing, integration and data transformations. Business Forecasting: Exponential smoothing, Holt-winters smoothing, ARIMA and Hybrid approaches. Data Mining: Measures of Proximity; Cluster Analysis: K-means and Agglomeration clustering; Density based clustering methods; Kohonen self-organizing maps.
Classification Analysis: Decision Trees induction; Neural Network; Support Vector Machines; Naïve Bayes Method; Ensemble Methods (Random Forest, Gradient Boosting Machines, Extreme Gradient Boosting); Auto-encoders and Deep Learning. Model tuning: Cross validations and Grid Search; Evaluating classification results: ROC, AUC, Precision, Specificity and kappa metric.
This is the projects module. We have experience with a number of Industrial projects. An e-book illustrating the projects executed by us can be downloaded from here. Students execute these projects while implementing techniques learnt and also as part of weekly exercises. Ta Feng Grocery Store: Segmenting customers; Kaggle Project: Airbnb What would be the next destination of a visitor; Kaggle Project: Otto Group, Product Classification Challenge; Kaggle Project: Rossmann Sales Revenue Prediction; Kaggle Project: Predict TFI Restaurant Sales; Kaggle Project: Determine whether to send a direct mail piece to a customer; Kaggle project: Predict handwritten digits; Kaggle Project: Santander Customer Satisfaction; Kaggle Project: Predict Biological Response to drug molecules; Kaggle Project: Housing Prices, Advanced Regression Techniques.
A sample list of projects can be seen here downloadable e-book
Hadoop eco-system: Developing familiarity with Hadoop filesystem/Hadoop-cluster; using Hadoop shell and Hue; Data extraction with Hive, Pig, Tez and Impala using SQL; Developing analytical models (such as Recommender Systems and Logistic Regression) with SparkR, H2o, and Mahout over Hadoop. Frequent Itemset Mining and data exploration and developing predictive analytical models on BigML Cloud.
At the commencement of the course, each participant is given a virtual machine that is installable on Windows/Mac/Linux systems with 4GB of RAM. It can be installed on Lap-top or desktop. The Virtual Machine contains all the software tools that the participant will work on along with data. Every software installed is fully licensed. The virtual machine makes it easy for participants to practice weekly exercises at home/workplace. Applications installed on the VM are as follows:
We may mention that besides this virtual machine, we have a separate Hadoop-cluster of ten machines with Cloudera server installed (with around 120GB RAM). This large cluster helps participants to work comfortably in the Computer lab.Downloadable file of this Course Content is here