Statistics for Data Science

5 min readDec 1, 2020

Hi there,
This is the first article in which we are going to cover all the statistics and probability theory we need to learn data science. In this series of articles we are going to start from here and gradually move into data analysis using statistics we learn. Then we will build, use machine learning algorithms to solve real-world data science problems. Of course, if you want to get started with data science, knowing programming is a necessity. So I welcome you to check out my python tutorials series as well.

From my previous article where we talked about introduction to data science, we know that Data science is the field of study that uses mathematics, programming, and domain knowledge to extract meaningful insights from data. Data scientists would apply machine learning algorithms to data such as numbers, texts, images videos, and more to produce artificial systems that perform tasks that require human intelligence. Using these systems we extract insights that can be used in our business to grow.
So we know that mathematics and statistics are essential for learning data science. Mathematics and statistics are the backbone of every machine learning algorithm we will be using in the data science field. So having good knowledge of maths and statistics will help you to understand data as well as apply algorithms to them.

Statistics

Statistics is the science of learning from data. Statistics is used to process complex problems in the real world so that Data Scientists and Analysts can look for meaningful trends and changes in Data. In simple words, Statistics can be used to derive meaningful insights from data by performing mathematical computations on it.

Data Analysis and Data Analytics

We will be using math and statistics for data analysis and data analytics purposes in Data science.

Data analysis is a procedure of investigating, cleaning, transforming, and training the data with the aim of finding some useful information, recommend conclusions, and helps in the decision-making process.

On the other hand, data analytics is utilizing data, machine learning, statistical analysis, and computer-based models to get a better insight and make better decisions from the data. Analytics is defined as “a process of transforming data into actions through analysis and insight in the context of organizational decision making and problem-solving.”

So in a nutshell in data analysis, we will be looking at the past using the data we collected and we will use analytics to predict what could happen in the future.

Data types

Data Types are an important concept of statistics, which needs to be understood, to correctly apply statistical measurements to our data and therefore to correctly conclude certain assumptions about it.

There are 2 main types of data we will be dealing with.

Categorical data
Numerical data.

What is Categorical Data (Qualitative data)?

Categorical data is a type of data that can be stored into groups or categories using names or labels. This grouping is usually made according to the data characteristics and similarities. For example, gender is categorical data because it can be categorized into male and female according to some unique qualities possessed by each gender.

There are 2 main types of categorical data,

Nominal data — Nominal data is named data that can be separated into discrete categories that do not overlap.
A common example of nominal data is, gender; male and female.
Ordinal data- Ordinal data is data which placed into some kind of order or scale.
An example of this is rating happiness on a scale of 1–10.

What is Numerical Data? (Quantitative data)

Numerical data is a type of data that is expressed in terms of numbers rather than natural language descriptions. It can only be collected in number form. This numerical data type can be used as a form of measurement, such as a person’s height, weight, IQ, etc.

It can also be used to carry out arithmetic operations like addition, subtraction, multiplication, and division.

There are 2 types of numerical data.

Discrete data- Within a range, there are certain values this variable cant get.
E.g.- Number of people (There cant be the decimal amount of people)
Continuous data- These variables can get any point in a range.
E.g.-Height of a person. (There can be decimal values in the range)

Population data vs sample data

Population

The population is the group that is targeted to collect the data from. Our data is the information collected from the population. The population is always defined first, before starting the data collection process for any statistical study.

In research, a population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries, species, organisms, etc.

Sample

It is part of the population that is selected randomly for the study. The sample should be selected such that it represents all the characteristics of the population. The process of selecting the subset from the population is called sampling and the subset selected is called the sample.

Categories in statistics

The field of statistics is composed of two broad categories. Those are Descriptive and inferential statistics.

Both of them give us different insights into the data. One alone doesn’t help us much to understand the complete picture of our data but using both of them together gives us a powerful tool for description and prediction.

Descriptive statistics (Exploratory data analysis)

It describes the important characteristics/ properties of the data using the measures the central tendency like mean, median, mode, and the measures of dispersion like range, standard deviation, variance, etc. Here to summarize and represent the data we will be using charts, tables, and graphs.

For example,
We have marks of 10000 students and we may be interested in the overall performance of those students and the distribution as well as the spread of marks.

Descriptive statistics provide us the tools to define our data in the most understandable and appropriate way.
Tools–
Visualization, Measures of central tendency, the spread of the data.

Inferential Statistics

It is about using data from a sample and then making inferences about the larger population from which the sample is drawn.

The goal of inferential statistics is to draw conclusions from a sample and generalize them to the population. It determines the probability of the characteristics of the sample using probability theory. Inferential statistics are valuable when examination of each member of an entire population is not convenient or possible.

The most common methodologies used are hypothesis tests, Analysis of variance, etc.

For example,
you might have a list of information on 100 people (your “sample”) out of 10,000 people (the “population”). You can use that list to make some assumptions about the entire population’s behavior.

That pretty much covers the introduction to the statistics and some terms you need to know before start learning statistics. In the next article, we will start talking about descriptive statistics.

So thank you guys for reading this article and I hope you enjoyed it.

See you soon. bye.