Skip to content

Latest commit





Preparing Statistician to be Successful Big Data Scientist

Outline/Description: With recent big data revolution, enterprises ranging from FORTUNE 500 to startups across the US are hungry for data scientists to bring valuable business insight from all the data collected. Statisticians are great data scientist candidates, but there are relatively few data scientists with statistics education background. In this short course, we will walk through the needed data science knowledge and skills (such as deep learning and big data platform) through hands-on exercises to prepare statisticians to be successful data scientists. Data science is a combination of science and art with data as the foundation. We will also cover the art part to guide participants to learn typical data science project flow, general pitfalls in data science projects, and soft skills to communicate with business stakeholders effectively. The Databricks community edition cloud platform and R-Studio will be used to cover programming and platform (such as Spark, Hadoop, GPU, SQL, and R) and typical machine learning algorithms (including examples for unsupervised learning and deep learning). The prerequisite knowledge is MS level education in statistics and entry level of R knowledge. This is an enhanced version of a similar highly-rated and full-day training course (CE 11C) offered at JSM 2017 in Baltimore with updated material to reflect students suggestions and new trends in data science.

  1. Introduction to data science. In this section, we will introduce the history and trends in data science. We will list typical requirements for successful data scientist and do an evaluation for participants to find the skill gaps and give recommendations to bridge the gaps. Participants will have a good understanding of what data scientists do and know their skill and knowledge gaps after taking this section.

  2. Deep Learning Lecture. In this section, we will briefly introduce the history of deep learning and the essential concepts that we need to know for deep learning applications. Then we will introuduce the feed forward neural network (FFNN) and Convolutional Neural Network (CNN) with the MNIST hand written digits examples. The R package keras will be used to show how to build FFNN and CNN models.

  3. Data Prepressing Using R Pipe Line. For R users not familar with the R pipe line way of written code, we have this brief introduction section of using R pipe line which will be used in most of the hands on sessiions.

  4. Databricks account setup. In this section, we will walk through the steps to apply and setup a Databricks Community Edition free acount and all the hands on sessions will run in this account.

  5. Deep Learning Hands On. In this section, we will walk through all the steps to (1) create a cloud computing node, (2) create a notebook using R, (3) import a notebook of deep learning applications with FFNN and CNN, (4) step-by-step illustration of FFNN model, (5) step-by-step illustration of CNN model.

  6. Data Preprocessing & Wrangling. In this section, we will walk through major steps in data peprocessing and wrangling.

  7. Big Data Cloud Platform. In this section, we will introduce the big data cloud plaftorm and steps to use R to directly interact with Spark dataframes for big data applications.

  8. Data Preprocessing & Wrangling Hands On. In this section, we will use Databricks community edition to walk through the steps.

  9. Big Data Cloud Platform Hands On. In this section, we will use Databricks community edition to walk through the steps.

  10. Soft Skills and Data Science Project Cycle. In this section, we will introduce the needed soft skills that are essential in data science projects at enterprise environments. We will talk about basic project management skills with agile concepts and how to effectively communicate with business partners to define and solve data science problems. We will illustrate how to lead with confidence given the strong technical background that statisticians have.

  11. Build static personal website using SSG+Netlify. In this section, we will introdue how to quickly build your personal-professional website for future career advancement opportunities.

Tentative Schedule

Topic Time
Introduction 15 min
Deep Learning Lecture 15 min + 45 min + 45 min
Break 20 min
R Pipe %>% 10 min
Databricks account setup 30 min
Deep Learning Hands on Session 60 min
Lunch break
Data Preprocessing & Wrangling 45 min
Data Preprocessing & Wrangling Hands on 30 min
Big Data Cloud Platform Lecture 45 min
Break 20 min
Big Data Cloud Platform Hands on 60 min
Soft Skill and Project Cycle 20 min
Build static personal website using SSG+Netlify 20 min

Useful links:

  1. Databrick free community edition account open:

  2. Notebook that contains all the steps in the deep learning section:

  3. Notebook that contains all the steps in the big data platform section:

  4. Data preprocessing code:

  5. Data wrangling code: