sample(1:100,10)
##  [1]  29  94   5  83   6  84  28 100  21  19
sample(c("red","orange","yellow:","green"),2, replace = TRUE)
## [1] "yellow:" "red"
iris$id = 1:150
data(iris)
sample(iris$Sepal.Length, 10)
##  [1] 5.6 6.7 7.9 5.1 5.2 5.8 5.6 5.0 6.4 6.8
iris$id = 1:150
iris[9,2]
## [1] 2.9
iris[150, ]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species  id
## 150          5.9           3          5.1         1.8 virginica 150
iris[, 2]
##   [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5
##  [19] 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2
##  [37] 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3
##  [55] 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8
##  [73] 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5
##  [91] 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0 3.0 2.5 2.9
## [109] 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2
## [127] 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2
## [145] 3.3 3.0 2.5 3.0 3.4 3.0
data = iris[sample(1:150, 12) , ]

Survey Analysis

Data Collection

Sample type. This is a convenience sample. I asked everyone who was in the library during study hours, and then a few more people I ran into around campus. People were picked because they were easy to reach at that moment, not at random, so the results describe students who were in the library studying more than they describe the whole school. Anyone who was not in the library during that window had no chance of being included, which is a coverage limitation.

Response and bias. Almost everyone I asked filled out the form. The only people who did not respond were ones who did not have their phone or computer on them to open the survey. So non-response was low and was tied to device access rather than to anything about the questions themselves. That keeps non-response bias small, though the handful of people without a device on hand might use their phones a little less in general, which could nudge the photo counts down if they had been included. The bigger limitation is still who I sampled from (library study-hours students) rather than who chose not to answer.

survey = read.csv("C:/Users/echow/Downloads/ASP/Data Science/survey.csv",
                  stringsAsFactors = FALSE)
survey$id = 1:nrow(survey)

Cleaning the Data

Most columns came in clean. The three that needed work were steps, food, and friends.

steps. Some answers were typed with comma thousands separators like "15,400" (text), others as plain numbers, and 27 were left blank. I strip the commas, convert the column to numeric, and let the blanks become NA instead of guessing values for them.

survey$steps = as.numeric(gsub(",", "", survey$steps))

food. I lowercased and trimmed the answers so spelling and spacing match. Only one response listed more than one item, "cookies and french toast". The rule was to keep the single least common item, and french toast shows up once while cookies shows up three times, so french toast is the one I keep. Every other answer was already a single dish.

survey$food = tolower(trimws(survey$food))
survey$food[survey$food == "cookies and french toast"] = "french toast"

friends. One value, 67, stood out as not believable and is almost certainly a “6-7” meme answer rather than a real count, so I drop it (set it to NA). The next highest values, 40 and 35, are high but reasonable for a social summer program, so I keep those.

survey$friends[survey$friends > 50] = NA

Research Question 1: Do students who take more photos also make more friends?

plot(survey$photos, survey$friends, pch = 19, col = "steelblue",
     xlab = "photos taken", ylab = "friends made",
     main = "Photos vs. friends made")
abline(lm(friends ~ photos, data = survey), col = "red", lwd = 2)

cor(survey$photos, survey$friends, use = "complete.obs")
## [1] 0.213247

Answer. There is only a weak positive link. The correlation is about r = 0.21, and the fitted line rises by roughly 0.06 friends for each extra photo. That slope is not statistically significant (p is about 0.07), so the pattern could easily be noise. The most photo-heavy students do not clearly make more friends than everyone else, and the cloud of points is wide. I would say photo-taking and friend-making move together a little, but not enough to call it a real relationship.

Research Question 3: How physically active is this group based on daily steps?

hist(survey$steps, breaks = 8, col = "skyblue",
     xlab = "steps", main = "Daily steps (45 students who answered)")
abline(v = mean(survey$steps, na.rm = TRUE), col = "red", lwd = 2)
abline(v = median(survey$steps, na.rm = TRUE), col = "darkgreen", lwd = 2, lty = 2)
legend("topright", legend = c("mean", "median"),
       col = c("red", "darkgreen"), lwd = 2, lty = c(1, 2), bty = "n")

summary(survey$steps)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
##    4683   11411   14136   13911   15630   22341      27

Answer. The students who reported steps are quite active. Of the 45 who answered (62 percent of the sample), the median is 14,136 steps and the mean is 13,911, so the mean and median sit almost on top of each other and the distribution is roughly symmetric. Steps run from a low of 4,683 to a high of 22,341, with a standard deviation around 3,600, and 91 percent of these students logged at least 10,000 steps. The one student near 4,700 pulls the left tail out a bit, but most of the group clusters in the 11,000 to 16,000 range. The 27 blanks are worth remembering: this conclusion only covers the people who actually tracked and reported their steps.