How to Calculate Pearson Correlation and Significance in Python?
I am looking for a function that takes two lists as input and returns both the Pearson correlation and the significance of the correlation.
How can I implement this to calculate pearson r in Python?
Hey Everyone!
Using scipy.stats.pearsonr
: The scipy.stats
module provides the pearsonr
function, which calculates both the Pearson correlation coefficient and its significance (p-value):
from scipy.stats import pearsonr
def pearson_r_python(list1, list2):
correlation, p_value = pearsonr(list1, list2)
return correlation, p_value
list1 = [1, 2, 3, 4, 5]
list2 = [5, 4, 3, 2, 1]
correlation, p_value = pearson_r_python(list1, list2)
print(f"Pearson correlation: {correlation}, p-value: {p_value}")
This is the simplest way to calculate Pearson R and p-value using Python’s scipy
library.
Hey Everyone!
Great explanation, @macy-davis! Building on that, if you’d like to manually calculate the Pearson correlation coefficient and significance, you can combine numpy
for matrix operations and scipy.stats
for the p-value calculation. Here’s an enhanced version of the pearson r python
function:
import numpy as np
from scipy.stats import t
def pearson_r_python(list1, list2):
# Calculate correlation manually using numpy
correlation = np.corrcoef(list1, list2)[0, 1]
n = len(list1)
# Compute t-statistic
t_statistic = correlation * np.sqrt((n - 2) / (1 - correlation**2))
# Calculate p-value using t-distribution
p_value = 2 * (1 - t.cdf(np.abs(t_statistic), df=n-2))
return correlation, p_value
list1 = [1, 2, 3, 4, 5]
list2 = [5, 4, 3, 2, 1]
correlation, p_value = pearson_r_python(list1, list2)
print(f"Pearson correlation: {correlation}, p-value: {p_value}")
This approach gives you a bit more control and insight into the calculation process while still being efficient.
Excellent points, @tim-khorev! If you’re working with data stored in a Pandas DataFrame, you can make the process even more convenient by leveraging pandas
for data management while still using scipy.stats.pearsonr
for calculation. Here’s how you can enhance the pearson r python
implementation for Pandas users:
import pandas as pd
from scipy.stats import pearsonr
def pearson_r_python(list1, list2):
# Create a DataFrame for easier handling
df = pd.DataFrame({'list1': list1, 'list2': list2})
# Use scipy's pearsonr to compute correlation and p-value
correlation, p_value = pearsonr(df['list1'], df['list2'])
return correlation, p_value
list1 = [1, 2, 3, 4, 5]
list2 = [5, 4, 3, 2, 1]
correlation, p_value = pearson_r_python(list1, list2)
print(f"Pearson correlation: {correlation}, p-value: {p_value}")
This approach integrates well with data preprocessing pipelines in Pandas, making it ideal for larger datasets or more complex analysis.