I’ve just started using R and I’m unsure how to use the sample function to split my dataset into training (75%) and testing (25%) sets.
Here’s the function syntax I’m working with:
sample(x, size, replace = FALSE, prob = NULL)
For the train test split R, what should I provide for x and size? Is x the dataset, and size the number of samples? How do I use these parameters to achieve the split?
Hi,
You can calculate the number of samples for the training set (75% of the total dataset).
Use the sample function:
x should be the indices of the rows in your dataset.
size should be the number of rows for the training set.
# Assuming 'data' is your dataset
set.seed(123) # For reproducibility
total_rows <- nrow(data)
train_size <- floor(0.75 * total_rows)
train_indices <- sample(seq_len(total_rows), size = train_size)
# Create training and testing sets
train_set <- data[train_indices, ]
test_set <- data[-train_indices, ]
You can follow these steps as well!
- Install and load caret if not already installed
install.packages("caret")
library(caret)
- Assuming ‘data’ is your dataset
set.seed(123) # For reproducibility
partition <- createDataPartition(data$target_variable, p = 0.75, list = FALSE)
- Create training and testing sets
train_set <- data[partition, ]
test_set <- data[-partition, ]
You can also use the dplyr Package:
- Use the sample_frac function from the dplyr package.
- Use sample_frac to sample 75% for the training set and the rest for the testing set.
# Install and load dplyr if not already installed
install.packages("dplyr")
library(dplyr)
# Assuming 'data' is your dataset
set.seed(123) # For reproducibility
train_set <- data %>% sample_frac(0.75)
test_set <- data %>% anti_join(train_set)
Each method provides a way to split your dataset into training and testing sets, depending on your preference for base R, caret, or dplyr.