Skip to content

ranja-sarkar/stats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 

Repository files navigation

stats

Accuracy and precision cannot be used interchangeably, the former being true to intention (degree of closeness of measured value to true value) while the latter is true to itself (degree of closeness of repeated measured values)

Probability and likelihood are different terms; the former is finding the chance of outcomes given a data distribution, the latter is finding the most likely distribution given the outcomes.

DESCRIPTIVE STATISTICS: For inference of (smaller) sample data

INFERENTIAL STATISTICS: For inference of (larger) population

CENTRAL LIMIT THEOREM The central limit theorem (CLT) states that, given a sufficiently large sample size, the sampling distribution of the mean for a variable approximateS a normal distribution regardless of that variable’s distribution in the population.

CLT is vital for two reasons — the normality assumption and the precision of the estimates.

The normality assumption is vital for parametric hypothesis tests of the mean. Consequently, you might think that these tests are not valid when the data are non-normally distributed. However, if your sample size is large enough, CLT kicks in and produces sampling distributions that approximate a normal distribution. This fact allows you to use these hypothesis tests even when your data are non-normally distributed as long as your sample size is large enough.

The 'precision of estimates' property of CLT becomes relevant when using a sample to estimate the mean of an entire population. With a larger sample size, your sample mean is more likely to be close to the real population mean. In other words, your estimate is more precise.

Depending on your goal and the data, you select a test.

If the goal is to quantify an association between two groups, we check Pearson correlation for parametric data, Spearman correlation for non-parametric data. If the goal is to predict a target from one or more variables, we perform simple regression (two variables) and multiple regression (more than two variables) for parametric data. If we have to compare unpaired (independent) groups, we perform unpaired T-test (or one-way ANOVA for 2+ groups) for parametric data, and Mann-Whitney test (2 groups) for non-parametric data.

Parametric test:-

Assumption: Data has normal distribution https://en.wikipedia.org/wiki/Normal_distribution

image

Non-parametric test:-

No assumption

image

HYPOTHESIS TESTS: Depending on datatypes and number of samples, hypothesis testing is carried out.

Traditional testing is called Non-Bayesian. It is how often an outcome happens over repeated runs (repeat sampling) of the experiment. It’s an objective view of whether an experiment is repeatable. Bayesian hypothesis testing is a subjective view of the same - it takes into account how much faith you have in your results. It includes prior knowledge about the data and personal beliefs about the results.

0

There's a data classification based on privacy, security, risk management and regulatory compliance: public, confidential, restricted and internal.

For more: https://en.wikipedia.org/wiki/F-test https://en.wikipedia.org/wiki/Analysis_of_variance

MEASURES OF CENTRAL TENDENCY data

image

Mode: Number that occurs most often in a dataset.

Median: Middle number/value when a dataset is ordered from least to greatest.

image

image

A violin plot shows the shape (density distribution) of data which boxplot does not, and it must be used to explore skewed data.

vp

There are power transformations that variables need to undergo if they follow either right-skewed or left-skewed distributions. Parametric machine learning models like linear regression assume real-valued variables in the input data have Gaussian distributions. Non-parametric models like kNN do not have this assumption, yet often are more reliable and perform better when the input variables have Gaussian distributions. As such, variables with skewed distributions (Gaussian-like) or different distributions altogether need transformation. Power transforms refer to a class of techniques utilizing a power function (like logarithm or exponent) to make the probability distribution of a variable Gaussian.

Gaussian (normal) distribution: https://ranjas.substack.com/p/why-the-gaussian

There're 2 popular approaches for automatic power transforms:

• Box-Cox Transform • Yeo-Johnson Transform

They find a parameter (lambda) that best transforms a variable for example, lambda = -1 is a reciprocal transform, lambda = 0 is a log transform, lambda = 0.5 is a square root transform.

MEASURES OF DISPERSION: Range, quartile deviation and interquartile range (quartile deviation is half of the interquartile range), variance, standard deviation

image

image

✅ It is mentionworthy that the standard error of a sample mean is an estimate of how far the sample mean is likely to be from a population mean, whereas the standard deviation of the sample is the degree to which individuals within the sample differ from the sample mean. Hence, standard error and standard deviation are different terms.

For more: https://en.wikipedia.org/wiki/Standard_error

image

image

STATISTICAL MODELS

Discriminative models leverage conditional probability distributions while Generative models leverage non-conditional ones.

mod

About

Basic Statistics for Data Sciences

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published