Data science

Before we can perform statistical analysis, we need to have a thorough understanding of probability and statistics. Primarily this is accomplished through courses previous to this course. But, we did review some of this material.

Probability and statistics

Once you have a solid understand of probability and statistics, you are able to start analyzing data and providing value to your organization. But there is a lot more to data analysis than fitting a model and interpreting it.

Pipeline

R4DS contains a figure demonstrating steps in the data science pipeline.

We have discussed many steps in this pipeline including

R programming

The suggestion in R4DS is to implement this process progammatically. This allows modifications to any step can be quickly implemented and all downstream steps to be updated accordingly. In this course, we have focused on the R programming language to implement scripts that perform all of the tasks above.

Communicate

A final step in the diagram above is “Communicate” which is the step where you translate all of the work you have done into non-jargon language or graphics for presentation to non-statisticians. At this point, you need to know how to provide your analyses in formats that non-statisticians can understand. This typically means you need to limit code and mathematics as much as possible.

The step of data science is moving quickly. We’ll talk about the following

  • Communicate
    • R scripts
    • Rmarkdown
    • LaTeX/knitr
    • Quatro

Feedback loops

The steps indicated in the diagram above are concentrated on the topics discussed in R4DS. The diagram only includes one feedback loop in the (Transform \(\to\) Visualize \(\to\) Model \(\to\) Transform) loop. In reality, there are many more feedback loops as seen in the picture below.

As you visualize data, you will see suspicious data. Best practice is to review the data and, if there is an error in the data, “amend” the raw data and keep a log of the change. As you communicate results to stakeholders, you will realize that you are not providing them with the results they are interested in or in a format that they understand. This may mean going back to the tidying, transforming, or modeling steps to incorporate this feedback.

Reproducibility

Due to all these feedback loops, you really want your process to be reproducible and easy to update and rerun. Even if you ultimately end up implementing your analyses in a different language, e.g. python, SAS, etc., the concept here will be the same.