How to calculate Pearson correlation and p-value in Python?

anjuyadav.1398 · December 10, 2024, 6:30pm

How to Calculate Pearson Correlation and Significance in Python?

I am looking for a function that takes two lists as input and returns both the Pearson correlation and the significance of the correlation.

How can I implement this to calculate pearson r in Python?

macy-davis · December 10, 2024, 6:33pm

Hey Everyone!

Using scipy.stats.pearsonr: The scipy.stats module provides the pearsonr function, which calculates both the Pearson correlation coefficient and its significance (p-value):

from scipy.stats import pearsonr

def pearson_r_python(list1, list2):
    correlation, p_value = pearsonr(list1, list2)
    return correlation, p_value

list1 = [1, 2, 3, 4, 5]
list2 = [5, 4, 3, 2, 1]
correlation, p_value = pearson_r_python(list1, list2)
print(f"Pearson correlation: {correlation}, p-value: {p_value}")

This is the simplest way to calculate Pearson R and p-value using Python’s scipy library.

tim-khorev · December 15, 2024, 8:00am

Hey Everyone!

Great explanation, @macy-davis! Building on that, if you’d like to manually calculate the Pearson correlation coefficient and significance, you can combine numpy for matrix operations and scipy.stats for the p-value calculation. Here’s an enhanced version of the pearson r python function:

import numpy as np
from scipy.stats import t

def pearson_r_python(list1, list2):
    # Calculate correlation manually using numpy
    correlation = np.corrcoef(list1, list2)[0, 1]
    n = len(list1)
    # Compute t-statistic
    t_statistic = correlation * np.sqrt((n - 2) / (1 - correlation**2))
    # Calculate p-value using t-distribution
    p_value = 2 * (1 - t.cdf(np.abs(t_statistic), df=n-2))
    return correlation, p_value

list1 = [1, 2, 3, 4, 5]
list2 = [5, 4, 3, 2, 1]
correlation, p_value = pearson_r_python(list1, list2)
print(f"Pearson correlation: {correlation}, p-value: {p_value}")

This approach gives you a bit more control and insight into the calculation process while still being efficient.

dimplesaini.230 · December 18, 2024, 8:03am

Excellent points, @tim-khorev! If you’re working with data stored in a Pandas DataFrame, you can make the process even more convenient by leveraging pandas for data management while still using scipy.stats.pearsonr for calculation. Here’s how you can enhance the pearson r python implementation for Pandas users:

import pandas as pd
from scipy.stats import pearsonr

def pearson_r_python(list1, list2):
    # Create a DataFrame for easier handling
    df = pd.DataFrame({'list1': list1, 'list2': list2})
    # Use scipy's pearsonr to compute correlation and p-value
    correlation, p_value = pearsonr(df['list1'], df['list2'])
    return correlation, p_value

list1 = [1, 2, 3, 4, 5]
list2 = [5, 4, 3, 2, 1]
correlation, p_value = pearson_r_python(list1, list2)
print(f"Pearson correlation: {correlation}, p-value: {p_value}")

This approach integrates well with data preprocessing pipelines in Pandas, making it ideal for larger datasets or more complex analysis.