For the train test split R, what should I provide for x and size? Is x the dataset, and size the number of samples?

archna.vv · July 30, 2024, 6:30pm

I’ve just started using R and I’m unsure how to use the sample function to split my dataset into training (75%) and testing (25%) sets.

Here’s the function syntax I’m working with:

sample(x, size, replace = FALSE, prob = NULL)

For the train test split R, what should I provide for x and size? Is x the dataset, and size the number of samples? How do I use these parameters to achieve the split?

netra.agarwal · August 18, 2024, 9:54am

Hi,

You can calculate the number of samples for the training set (75% of the total dataset).

Use the sample function:

x should be the indices of the rows in your dataset. size should be the number of rows for the training set.

# Assuming 'data' is your dataset

set.seed(123) # For reproducibility
total_rows <- nrow(data)
train_size <- floor(0.75 * total_rows)
train_indices <- sample(seq_len(total_rows), size = train_size)


# Create training and testing sets

train_set <- data[train_indices, ]
test_set <- data[-train_indices, ]

raimavaswani · August 19, 2024, 12:13pm

You can follow these steps as well!

Install and load caret if not already installed

install.packages("caret")
library(caret)

Assuming ‘data’ is your dataset

set.seed(123) # For reproducibility
partition <- createDataPartition(data$target_variable, p = 0.75, list = FALSE)

Create training and testing sets

train_set <- data[partition, ]
test_set <- data[-partition, ]

dimplesaini.230 · August 19, 2024, 5:39pm

You can also use the dplyr Package:

Use the sample_frac function from the dplyr package.
Use sample_frac to sample 75% for the training set and the rest for the testing set.

# Install and load dplyr if not already installed
install.packages("dplyr")
library(dplyr)

# Assuming 'data' is your dataset
set.seed(123) # For reproducibility
train_set <- data %>% sample_frac(0.75)
test_set <- data %>% anti_join(train_set)

Each method provides a way to split your dataset into training and testing sets, depending on your preference for base R, caret, or dplyr.