Summary Statistics In Python

Summary statistics are useful tools for summarizing and understanding large datasets. They provide information about the central tendency, spread, and shape of a dataset, and can help identify outliers and potential issues with the data. In this article, we will discuss how to calculate standard summary statistics for both quantitative and qualitative variables using Python.

Summary Statistics for Quantitative Variables

Quantitative variables are variables that have numerical values and can be measured on a continuous or discrete scale. Examples of quantitative variables include age, height, weight, and income. The most common summary statistics for quantitative variables are:

Mean

The mean is the average value of a dataset and is calculated by summing all the values and dividing by the total number of observations.

import numpy as np data = [1, 2, 3, 4, 5] mean = np.mean(data) print(mean) # Output: 3.0

Median

The median is the middle value of a dataset and is calculated by sorting the values and selecting the middle value. If there are an even number of values, the median is the average of the two middle values.

import numpy as np data = [1, 2, 3, 4, 5] median = np.median(data) print(median) # Output: 3.0

Mode

The mode is the most common value in a dataset and is calculated by counting the frequency of each value and selecting the value with the highest frequency.

from statistics import mode data = [1, 2, 3, 4, 5, 5, 5] mode = mode(data) print(mode) # Output: 5

Range

The range is the difference between the maximum and minimum values in a dataset.

import numpy as np data = [1, 2, 3, 4, 5] range = np.max(data) - np.min(data) print(range) # Output: 4

Variance and Standard Deviation

The variance and standard deviation are measures of the spread or dispersion of a dataset. The variance is the average squared deviation from the mean, while the standard deviation is the square root of the variance.

import numpy as np data = [1, 2, 3, 4, 5] variance = np.var(data) std_dev = np.std(data) print(variance) # Output: 2.0 print(std_dev) # Output: 1.4142135623730951

Summary Statistics for Qualitative Variables

Qualitative variables are variables that have categorical values and cannot be measured on a numerical scale. Examples of qualitative variables include gender, race, and educational level. The most common summary statistics for qualitative variables are:

Frequency Table

A frequency table is a table that shows the frequency or count of each category in a dataset.

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Male', 'Female']}
df = pd.DataFrame(data)
freq_table = df['Gender'].value_counts()

print(freq_table)
"""
Output:
Male      3
Female    2
Name: Gender, dtype: int64
"""

Proportions and Percentages

Proportions and percentages are measures of the relative frequency of each category in a dataset. The proportion is the frequency of a category divided by the total number of observations, while the percentage is the proportion multiplied by 100.

import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Male', 'Female']}df = pd.DataFrame(data)
prop_table = df['Gender'].value_counts(normalize=True)
prop_table *= 100

print(prop_table) 
"""
Output:
Male 60.0
Female 40.0
Name: Gender, dtype: float64
"""

Mode

The mode is the most common value in a dataset and is calculated by counting the frequency of each value and selecting the value with the highest frequency.

from statistics import mode data = ['Red', 'Blue', 'Green', 'Green', 'Blue', 'Red', 'Red'] mode = mode(data) print(mode) # Output: 'Red'

Chi-Square Test

The chi-square test is a statistical test used to determine if there is a significant association between two qualitative variables. It compares the observed frequencies of each category to the expected frequencies if there were no association between the variables.

import pandas as pd from scipy.stats import chi2_contingency data = {'Gender': ['Male', 'Female', 'Male', 'Male', 'Female'], 'Smoker': ['Yes', 'No', 'Yes', 'No', 'Yes']} df = pd.DataFrame(data) contingency_table = pd.crosstab(df['Gender'], df['Smoker']) stat, p, dof, expected = chi2_contingency(contingency_table) print(stat) # Output: 0.6 print(p) # Output: 0.738
 

Conclusion 

In conclusion, summary statistics are a powerful tool for summarizing and understanding large datasets. Python provides a wide range of built-in functions and libraries for calculating summary statistics for both quantitative and qualitative variables. These statistics can help identify patterns, outliers, and potential issues with the data, and can guide further analysis and decision-making.

Comments

Popular posts from this blog

Understanding the Random Forest Binary Choice Model: A Powerful Tool for Predictive Analytics