How do I calculate r-squared using Python and Numpy?

mehta_tvara · January 17, 2025, 6:30pm

I’m using Python and Numpy to calculate a best-fit polynomial of arbitrary degree. I pass a list of x values, y values, and the degree of the polynomial I want to fit (linear, quadratic, etc.).

This much works, but I also want to calculate r (coefficient of correlation) and r-squared (coefficient of determination). I am comparing my results with Excel’s best-fit trendline capability, and the r-squared value it calculates. Using this, I know I am calculating r-squared correctly for linear best-fit (degree equals 1). However, my function does not work for polynomials with degree greater than 1.

Excel is able to do this. How do I calculate r-squared for higher-order polynomials using Numpy?

Here’s my function:

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)
     # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    correlation = numpy.corrcoef(x, y)[0,1]

     # r
    results['correlation'] = correlation
     # r-squared
    results['determination'] = correlation**2

    return results

How can I modify the function to correctly calculate r-squared for higher-degree polynomials using Python and Numpy?

Ambikayache · January 17, 2025, 6:36pm

Hey, I’ve been using this method in Python with Numpy to calculate R-squared for polynomial fits, and it’s been quite efficient. Here’s how I go about it: You start by computing the fitted values first, then you can calculate R-squared by comparing how well the model fits against your original data. Here’s a sample implementation:

import numpy

# Polynomial Regression
def polyfit(x, y, degree):
    results = {}

    coeffs = numpy.polyfit(x, y, degree)

    # Polynomial Coefficients
    results['polynomial'] = coeffs.tolist()

    # r-squared calculation
    p = numpy.poly1d(coeffs)
    # fit values, and mean
    yhat = p(x)                         # or [p(z) for z in x]
    ybar = numpy.sum(y) / len(y)         # or sum(y) / len(y)
    ssreg = numpy.sum((yhat - ybar) ** 2) # or sum([ (yihat - ybar)**2 for yihat in yhat])
    sstot = numpy.sum((y - ybar) ** 2)   # or sum([ (yi - ybar)**2 for yi in y])
    results['determination'] = ssreg / sstot

    return results

This approach follows the standard R-squared calculation formula used in linear regression. The key here is that ssreg is the sum of squared residuals, and sstot is the total sum of squares. You’ll end up with an R-squared value that shows how well the polynomial fit matches your data. Definitely a useful method in R squared python calculations.

dipen-soni · January 20, 2025, 9:52am

Ah, I totally get where you’re coming from, but there’s actually another streamlined way of doing this, especially if you’re working with a library like scikit-learn. This method can save you some steps while still getting the same R-squared value. Here’s what I recommend:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np

def polyfit_r2(x, y, degree):
    poly = PolynomialFeatures(degree)
    X_poly = poly.fit_transform(x.reshape(-1, 1))
    model = LinearRegression()
    model.fit(X_poly, y)
    r_squared = model.score(X_poly, y)
    return r_squared

In this case, I use PolynomialFeatures to transform the input features, then fit a linear regression model to them. The score() function in LinearRegression conveniently calculates the r squared python value as part of the model fitting process. It’s a much easier way to handle things if you’re just looking to compute the R-squared directly without manually doing all the steps yourself.

vindhya.rddy · January 27, 2025, 9:53am

Nice! So building on what was shared earlier, if you’re someone who likes to keep it manual and understand each step, I’ve found this method quite useful. It’s a bit more hands-on but provides the clarity of how the R-squared value is derived from scratch. The steps are very similar to the first approach, but here’s how I personally handle it:

import numpy as np

def manual_r_squared(x, y, degree):
    coeffs = np.polyfit(x, y, degree)
    p = np.poly1d(coeffs)

    yhat = p(x)  # Fitted values
    ybar = np.mean(y)  # Mean of the actual values
    ssreg = np.sum((yhat - ybar) ** 2)  # Regression sum of squares
    sstot = np.sum((y - ybar) ** 2)  # Total sum of squares
    r_squared = ssreg / sstot
    return r_squared

This is essentially a more manual approach to calculating r squared python without relying on external libraries like scikit-learn. You’re essentially following the standard calculation for R-squared, which is great for those who want to manually track how the fit is progressing. It’s very insightful for understanding the math behind it!