Summary Statistics In Python
Summary statistics are useful tools for summarizing and understanding large datasets. They provide information about the central tendency, spread, and shape of a dataset, and can help identify outliers and potential issues with the data. In this article, we will discuss how to calculate standard summary statistics for both quantitative and qualitative variables using Python.
Summary Statistics for Quantitative Variables
Quantitative variables are variables that have numerical values and can be measured on a continuous or discrete scale. Examples of quantitative variables include age, height, weight, and income. The most common summary statistics for quantitative variables are:
Mean
The mean is the average value of a dataset and is calculated by summing all the values and dividing by the total number of observations.
import numpy as np
data = [1, 2, 3, 4, 5]
mean = np.mean(data)
print(mean) # Output: 3.0
Median
The median is the middle value of a dataset and is calculated by sorting the values and selecting the middle value. If there are an even number of values, the median is the average of the two middle values.
import numpy as np
data = [1, 2, 3, 4, 5]
median = np.median(data)
print(median) # Output: 3.0
Mode
The mode is the most common value in a dataset and is calculated by counting the frequency of each value and selecting the value with the highest frequency.
from statistics import mode
data = [1, 2, 3, 4, 5, 5, 5]
mode = mode(data)
print(mode) # Output: 5
Range
The range is the difference between the maximum and minimum values in a dataset.
import numpy as np
data = [1, 2, 3, 4, 5]
range = np.max(data) - np.min(data)
print(range) # Output: 4
Variance and Standard Deviation
The variance and standard deviation are measures of the spread or dispersion of a dataset. The variance is the average squared deviation from the mean, while the standard deviation is the square root of the variance.
import numpy as np
data = [1, 2, 3, 4, 5]
variance = np.var(data)
std_dev = np.std(data)
print(variance) # Output: 2.0
print(std_dev) # Output: 1.4142135623730951
Summary Statistics for Qualitative Variables
Qualitative variables are variables that have categorical values and cannot be measured on a numerical scale. Examples of qualitative variables include gender, race, and educational level. The most common summary statistics for qualitative variables are:
Frequency Table
A frequency table is a table that shows the frequency or count of each category in a dataset.
import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Male', 'Female']}
df = pd.DataFrame(data)
freq_table = df['Gender'].value_counts()
print(freq_table)
"""
Output:
Male 3
Female 2
Name: Gender, dtype: int64
"""
Proportions and Percentages
Proportions and percentages are measures of the relative frequency of each category in a dataset. The proportion is the frequency of a category divided by the total number of observations, while the percentage is the proportion multiplied by 100.
import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Male', 'Female']}df = pd.DataFrame(data)
prop_table = df['Gender'].value_counts(normalize=True)
prop_table *= 100
print(prop_table)
"""
Output:
Male 60.0
Female 40.0
Name: Gender, dtype: float64
"""
Mode
The mode is the most common value in a dataset and is calculated by counting the frequency of each value and selecting the value with the highest frequency.
from statistics import mode
data = ['Red', 'Blue', 'Green', 'Green', 'Blue', 'Red', 'Red']
mode = mode(data)
print(mode) # Output: 'Red'
Chi-Square Test
The chi-square test is a statistical test used to determine if there is a significant association between two qualitative variables. It compares the observed frequencies of each category to the expected frequencies if there were no association between the variables.
import pandas as pd
from scipy.stats import chi2_contingency
data = {'Gender': ['Male', 'Female', 'Male', 'Male', 'Female'],
'Smoker': ['Yes', 'No', 'Yes', 'No', 'Yes']}
df = pd.DataFrame(data)
contingency_table = pd.crosstab(df['Gender'], df['Smoker'])
stat, p, dof, expected = chi2_contingency(contingency_table)
print(stat) # Output: 0.6
print(p) # Output: 0.738
Conclusion
In conclusion, summary statistics are a powerful tool for summarizing and understanding large datasets. Python provides a wide range of built-in functions and libraries for calculating summary statistics for both quantitative and qualitative variables. These statistics can help identify patterns, outliers, and potential issues with the data, and can guide further analysis and decision-making.
Comments
Post a Comment