sample(1:100,10)
## [1] 29 94 5 83 6 84 28 100 21 19
sample(c("red","orange","yellow:","green"),2, replace = TRUE)
## [1] "yellow:" "red"
iris$id = 1:150
data(iris)
sample(iris$Sepal.Length, 10)
## [1] 5.6 6.7 7.9 5.1 5.2 5.8 5.6 5.0 6.4 6.8
iris$id = 1:150
iris[9,2]
## [1] 2.9
iris[150, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
## 150 5.9 3 5.1 1.8 virginica 150
iris[, 2]
## [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5
## [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2
## [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3
## [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8
## [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5
## [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9
## [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2
## [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2
## [145] 3.3 3.0 2.5 3.0 3.4 3.0
data = iris[sample(1:150, 12) , ]
Sample type. This is a convenience sample. I asked everyone who was in the library during study hours, and then a few more people I ran into around campus. People were picked because they were easy to reach at that moment, not at random, so the results describe students who were in the library studying more than they describe the whole school. Anyone who was not in the library during that window had no chance of being included, which is a coverage limitation.
Response and bias. Almost everyone I asked filled out the form. The only people who did not respond were ones who did not have their phone or computer on them to open the survey. So non-response was low and was tied to device access rather than to anything about the questions themselves. That keeps non-response bias small, though the handful of people without a device on hand might use their phones a little less in general, which could nudge the photo counts down if they had been included. The bigger limitation is still who I sampled from (library study-hours students) rather than who chose not to answer.
survey = read.csv("C:/Users/echow/Downloads/ASP/Data Science/survey.csv",
stringsAsFactors = FALSE)
survey$id = 1:nrow(survey)
Most columns came in clean. The three that needed work were
steps, food, and friends.
steps. Some answers were typed with comma thousands
separators like "15,400" (text), others as plain numbers,
and 27 were left blank. I strip the commas, convert the column to
numeric, and let the blanks become NA instead of guessing
values for them.
survey$steps = as.numeric(gsub(",", "", survey$steps))
food. I lowercased and trimmed the answers so
spelling and spacing match. Only one response listed more than one item,
"cookies and french toast". The rule was to keep the single
least common item, and french toast shows up once while cookies shows up
three times, so french toast is the one I keep. Every other answer was
already a single dish.
survey$food = tolower(trimws(survey$food))
survey$food[survey$food == "cookies and french toast"] = "french toast"
friends. One value, 67, stood out as not believable
and is almost certainly a “6-7” meme answer rather than a real count, so
I drop it (set it to NA). The next highest values, 40 and
35, are high but reasonable for a social summer program, so I keep
those.
survey$friends[survey$friends > 50] = NA
plot(survey$photos, survey$friends, pch = 19, col = "steelblue",
xlab = "photos taken", ylab = "friends made",
main = "Photos vs. friends made")
abline(lm(friends ~ photos, data = survey), col = "red", lwd = 2)
cor(survey$photos, survey$friends, use = "complete.obs")
## [1] 0.213247
Answer. There is only a weak positive link. The correlation is about r = 0.21, and the fitted line rises by roughly 0.06 friends for each extra photo. That slope is not statistically significant (p is about 0.07), so the pattern could easily be noise. The most photo-heavy students do not clearly make more friends than everyone else, and the cloud of points is wide. I would say photo-taking and friend-making move together a little, but not enough to call it a real relationship.
counts = table(survey$food)
named = counts[counts >= 2]
plot_counts = sort(c(named, "other (one-offs)" = sum(counts == 1)))
par(mar = c(4, 9, 3, 1))
barplot(plot_counts, horiz = TRUE, las = 1, col = "tomato",
xlab = "number of students", main = "Favorite food (n = 72)")
Answer. Tacos are the clear winner with 16 of 72 students (about 22 percent), more than the next two combined. Mac and cheese is second at 11, and sweet potatoes third at 6. After that it drops into a long tail: there are 25 different foods in all, and 13 of them were named by only one person. Savory comfort foods dominate the top of the list while desserts and one-off picks fill out the bottom.
hist(survey$steps, breaks = 8, col = "skyblue",
xlab = "steps", main = "Daily steps (45 students who answered)")
abline(v = mean(survey$steps, na.rm = TRUE), col = "red", lwd = 2)
abline(v = median(survey$steps, na.rm = TRUE), col = "darkgreen", lwd = 2, lty = 2)
legend("topright", legend = c("mean", "median"),
col = c("red", "darkgreen"), lwd = 2, lty = c(1, 2), bty = "n")
summary(survey$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 4683 11411 14136 13911 15630 22341 27
Answer. The students who reported steps are quite active. Of the 45 who answered (62 percent of the sample), the median is 14,136 steps and the mean is 13,911, so the mean and median sit almost on top of each other and the distribution is roughly symmetric. Steps run from a low of 4,683 to a high of 22,341, with a standard deviation around 3,600, and 91 percent of these students logged at least 10,000 steps. The one student near 4,700 pulls the left tail out a bit, but most of the group clusters in the 11,000 to 16,000 range. The 27 blanks are worth remembering: this conclusion only covers the people who actually tracked and reported their steps.