HTML and RMD files can be found on Github
Bellabeat is a women’s wellness brand that offers a comprehensive range of products and services designed to enhance women’s health. The company specializes in creating wearables and complementary products that track biometric and lifestyle data, enabling women to gain a deeper understanding of their bodies and make informed decisions about their health. By gathering data on activity, sleep, stress, and reproductive health, Bellabeat empowers women with the knowledge they need to improve their overall well-being and lifestyle habits.
Bellabeat products:
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Gain insights for Bellabeat smart products from trends of
non-Bellabeat smart-devices consumers in order to reveal growth
opportunities for the company digital marketing strategy.
In this analysis, we explore approaches to effectively engage
potential consumers of smart devices and identify trends and patterns
related to their behaviors. Our analysis will cover various aspects of
user activity, including sleep patterns, health-related metrics such as
calorie and step counts, as well as intensity levels, heart rates, and
MET (metabolic equivalent of task) measurements.
Key stakeholders in
this analysis include Bellabeat’s co-founder and Chief Creative Officer
Urška Sršen, as well as the Bellabeat executive and marketing analytics
teams.
For our analysis we will use the FitBit Fitness
Tracker Data.
Where is your data stored?
The data is can be
found online on Kaggle platform, and for this project was downloaded and
used locally.
How is the data organized?
The
data has 18 data sets, wide and long formats.
Datasets contains
information on activity, sleep, calories, intensities, steps, MET
scores, heart-rate and weight.
Are there issues with bias or
credibility in this data?
The dataset is generated by
respondent to a distributed survey via Amazon Mechanical Turk between
12.02.2016-12.03.2016
33 eligible Fitbit users consented to the
submission of personal tracker data and become a part of the
dataset.
Addressing licensing, privacy, security, and
accessibility
Data sets not showing any private details
about users, and can be used by the public for analysis uses.
How does the data can help you achieving your goal?
Our goal is to identify meaningful patterns and insights from the
data to inform our marketing strategy.
We plan to explore
relationships between different measures and visualize the data at
various time intervals, such as minute, hour, and week. Additionally, we
will analyze the distribution of different values to gain further
insights. All of these approaches will help us achieve our marketing
objectives.
Are there any problems with the
data?
Kaggle article do mention that there is some
variation between output represents use of different types of Fitbit
trackers and individual tracking behaviors / preferences. To address the
issue, we will perform data cleaning and normalization
techniques.Additionally, we will perform statistical analysis to
identify and adjust for any biases or inconsistencies in the data.
Additionally, there is no information provided on how the different
levels of the data were determined or defined. For that we would conduct
exploratory data analysis (EDA) on the data to see if there are any
patterns or inconsistencies that might indicate how the levels were
determined. This will involve examining the distributions of the data,
looking for outliers, and examining relationships between different
variables.
Overall look on datasets:
Tools - We are programming in R using RStudio for
our analysis.
Data Cleaning & Integrity Check -
We have checked the data integrity and cleaned the data and documented
the process. We have thoroughly checked the data for any errors,
inconsistencies, and missing values, ensuring that all data sources
involved have been reviewed and any necessary corrections have been
made. We can confidently state that the data is now properly formatted,
accurate, and suitable for use in achieving our business objectives. For
our goals in this analysis, we would look on each data category and
include just the important cleaning process and reliability checks, so
that our analysis is coherent and in logical and easy to understand
context, with focus on the main things inside each category.
Our initial steps would involve installing necessary packages using
the install.packages()
function and loading the required
libraries.
We would also import relevant datasets using the
read_csv()
function and transform any date columns to the
correct datetime format for consistency and accuracy.
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
library(lubridate)
library(here)
library(snakecase)
library(ggrepel)
library(hms)
dailyActivity <- mutate(dailyActivity,ActivityDate_new = mdy(ActivityDate))
# Changing char date to Date format, we will repeat this code to all of our datasets
we can commence our analysis by examining each data category, which may entail analyzing individual datasets or performing analyses on combined datasets for some sections.
Datasets in use:
Upon examining the dailyActivity table, it becomes apparent that numerous rows contain 0 values across all variables except for sedentary minutes and calories burned. In order to derive meaningful insights regarding activity patterns throughout the day, it is necessary to clean this data.
head(filter(dailyActivity, TotalSteps == 0)) # Incomplete data
## # A tibble: 6 × 16
## Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.50e9 5/12/2… 0 0 0 0 0 0 0 0
## 2 1.84e9 4/24/2… 0 0 0 0 0 0 0 0
## 3 1.84e9 4/25/2… 0 0 0 0 0 0 0 0
## 4 1.84e9 4/26/2… 0 0 0 0 0 0 0 0
## 5 1.84e9 5/2/20… 0 0 0 0 0 0 0 0
## 6 1.84e9 5/7/20… 0 0 0 0 0 0 0 0
## # … with 6 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## # LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## # ActivityDate_new <date>, and abbreviated variable names ¹ActivityDate,
## # ²TotalSteps, ³TotalDistance, ⁴TrackerDistance, ⁵LoggedActivitiesDistance,
## # ⁶VeryActiveDistance, ⁷ModeratelyActiveDistance, ⁸LightActiveDistance,
## # ⁹SedentaryActiveDistance
dailyActivity <- dailyActivity %>% filter(TotalSteps > 0)
There are four different levels of activity: Very active, Fairly active, Lightly active, and Sedentary active. To evaluate the reliability of these activity levels, we will investigate the relationship between time and distance. A linear relationship should exist, with the greatest slope observed for the very active level and the least for the sedentary level.
distance_long <- dailyActivity %>%
select(id = Id, date = ActivityDate_new, very_active = VeryActiveDistance, fairly_active = ModeratelyActiveDistance,
lightly_active = LightActiveDistance, sedentary_active = SedentaryActiveDistance) %>%
pivot_longer(cols = very_active:sedentary_active,
names_to = "activity_type", values_to = "average_distance")
minutes_long <- dailyActivity %>%
select(id = Id, date = ActivityDate_new, very_active = VeryActiveMinutes, fairly_active=FairlyActiveMinutes,
lightly_active = LightlyActiveMinutes, sedentary_active = SedentaryMinutes) %>%
pivot_longer(cols = very_active:sedentary_active,
names_to = "activity_type", values_to = "average_time")
activityDistanceTime <- inner_join(distance_long,minutes_long, by = c("date","id","activity_type"))
activityDistanceTime %>% ggplot(aes(x= average_time, y = average_distance)) +
geom_point() + geom_smooth(method = "lm") + facet_wrap(~activity_type) +
labs(title = "Activity Distance by Minutes", x = "Minutes", y = "Distance")
As expected, we have observed differences in the slope values, with
the very active level having the greatest slope.
To further evaluate the dailyActivity dataset, we will verify that the sum of all activity minutes equals the total number of minutes in a day minus 1440. To achieve this, we will need to merge the data from the sleepDay dataset to obtain the time spent in bed for each day.
dailyActivity <- dailyActivity %>% mutate(sleep_mins = 1440-VeryActiveMinutes-FairlyActiveMinutes-
LightlyActiveMinutes-SedentaryMinutes)
testTotalMinutes <- inner_join(dailyActivity,sleepDay,by = c("Id","ActivityDate_new" = "SleepDay_new"), multiple = "all")
testTotalMinutes <- testTotalMinutes %>% filter(sleep_mins == TotalTimeInBed)
nrow(testTotalMinutes)
## [1] 126
We observed that only 126 out of 413 rows accurately represent the
total minutes in a day.
The issue is that sedentary time includes
time spent in bed. Despite this issue, after verifying that our data
includes a diverse group of users, we can proceed to examine the
distribution of daily activity. We will select the necessary data and
transform it into a long format to create a new data table for
analysis.
testActivity <- testTotalMinutes %>%
summarise(avg_very_active = mean(VeryActiveMinutes),
avg_fairly_active = mean(FairlyActiveMinutes),
avg_lightly_active = mean(LightlyActiveMinutes),
avg_sedentary = mean(SedentaryMinutes)) %>%
pivot_longer(cols = avg_very_active:avg_sedentary,
names_to = "activity_type", values_to = "average_minutes")
summed_activity <- testActivity %>% group_by(activity_type) %>%
summarise(average_minutes = mean(average_minutes)) %>%
summarise(activity_type,average_minutes,average_minutes_percent = average_minutes/sum(average_minutes)) %>%
arrange(activity_type)
my_colors <- c("#1C0C5B", "#3D2C8D", "#916BBF", "#C996CC")
my_labels <- c("Fairly Active","Lightly Active","Sedendary","Very Active")
summed_activity %>%
mutate(avg_hours = format(as.POSIXct(average_minutes_percent*24*3600*(sum(average_minutes)/1440),
origin = "1970-01-01", tz = "UTC"),"%H:%M")) %>%
ggplot() + geom_col(aes(x = "", y = average_minutes_percent, fill = my_colors), width = 1) +
coord_polar("y", start = 0) + theme_void() +
geom_text_repel(aes(x = 1.8, y = average_minutes_percent, color = my_colors,
label = paste0(scales::percent(average_minutes_percent, accuracy = .1),
"\n",avg_hours))
,size = 3.8, fontface = "bold",box.padding = 0.1, position = position_stack(vjust = 0.5)) +
labs(title = "Average Duration of Waking Time Spent by Activity Type in a Day", fill = "Activity Type") +
theme(plot.title = element_text(hjust = 0.3, face = "bold", size = 15,margin = margin(t =10,b = 5)), legend.title = element_text(face = "bold"))+
guides(color = "none") + # removing just the color legend
scale_fill_manual(values = my_colors, label = my_labels) + scale_color_manual(values = my_colors)
Based on the data analysis, it is evident that users spend a
significant portion of their day in sedentary activities, averaging at
13 hours and 10 minutes per day.
Datasets in use:
our next step would be to calculate the mean number of steps per user using the dailyActivity dataset.
dailyActivity %>% group_by(Id) %>%
summarise(user_avg_daily_steps = mean(TotalSteps)) %>%
summarise(average_daily_steps = mean (user_avg_daily_steps))
## # A tibble: 1 × 1
## average_daily_steps
## <dbl>
## 1 7922.
The average daily steps of 7922 falls short of the recommended 10,000 steps per day. It would be interesting to see if the overall trend of daily steps is increasing or not.
dailyActivity %>% ggplot(aes(x= as.POSIXct(ActivityDate_new), y = TotalSteps)) +
geom_jitter() + geom_smooth(method = "lm") +
labs(title = "Daily Steps Among All Users", x = "Total Steps", y = "Date") +
scale_x_datetime(date_labels = "%Y-%m-%d", timezone = "America/Los_Angeles")
While a linear relationship may provide interesting insights, it
seems that there is no such trend present in the data. Nonetheless,
analyzing the distribution of steps throughout the day may still be
useful in gaining further insights.
minuteStepsNarrow %>% group_by(hour = format(ActivityMinute_new, format = "%H")) %>%
summarise(avg_step = mean(Steps)*60) %>%
ggplot(aes(x = hour, y = avg_step)) +
geom_bar(stat = "identity", fill = "#66347F") +
labs(title = "Step Count by Hour", x = "Hour", y = "Average Steps")
We can see that the majority of steps are taken between 12:00-20:00,
with a dip between 15:00-16:00.
The first step in our analysis would be to examine the range of values in each dataset. Once we have assessed the range, we will then proceed to join the necessary tables.
unique(minuteIntensitiesNarrow$Intensity) # Four distinct values of intensity: 0,1,2,3
## [1] 0 1 2 3
range(minuteCaloriesNarrow$Calories) # 0-19.75
## [1] 0.00000 19.74995
range(minuteStepsNarrow$Steps) # Steps ranges 0-220
## [1] 0 220
Our next step is to merge the minuteIntensitiesNarrow,
minuteCaloriesNarrow, and minuteStepsNarrow tables.
To ensure
computational efficiency, we will take a sample of 100,000 rows out of
the 1,325,580 total rows. This will provide us with a 99% confidence
level and a margin of error of 0.4%.
cal_int_steps <- inner_join(minuteCaloriesNarrow, minuteIntensitiesNarrow, by = c("Id","ActivityMinute_new")) %>%
inner_join(minuteStepsNarrow,by = c("Id","ActivityMinute_new"))
cal_int_steps <- select(cal_int_steps,c("Id","Calories","Intensity","Steps","ActivityMinute_new"))
sampled_data <- cal_int_steps %>% sample_n(100000) # Using sample of that data
Introducing the correlation analysis between calories and steps.
sampled_data %>% ggplot(aes(x = Steps, y = Calories)) + geom_jitter() +
geom_smooth(method = "lm") +
geom_text(aes(x=90,y=18, label = paste0("r = ",round(cor(Steps,Calories),2))),color = "blue", size = 5) +
labs(title = "Examining the Steps-Calories Relationship")
The linear connection between steps and calories burned makes sense,
as physical activity usually leads to higher energy expenditure.
ggplot(data = sampled_data,aes(x=Intensity ,y=Calories)) + geom_jitter() +
geom_smooth(method = "lm") +
geom_text(aes(x=1.5,y=18, label = paste0("r = ",round(cor(Intensity,Calories),2))),color = "blue", size = 5) +
labs(title = "Examining the Intensity-Calories Relationship")
We expect to see a clear separation of calorie burn based on
intensity level and step count. However, we observe some discrepancies
that require further investigation. Specifically, some data points with
higher step count do not necessarily result in a proportional increase
in calorie burn. We can explore why there are no clear distinctions with
steps involved in the next step.
ggplot(data = sampled_data) +
geom_smooth(mapping = aes(x=Steps,y=Calories,color = factor(Intensity),k = 8,se = FALSE)) +
labs(color = "Intensity Level", title = "Steps and Intensity: A Breakdown of Caloric Burn" ) +
scale_color_manual(values = c("lightgreen", "blue", "orange", "red"),
labels = c("Low Intensity","Light Intensity","Moderate Intensity","High Intensity"))
From the data, we can observe that for a certain level of steps,
light intensity activities burn more calories than moderate intensity
activities.
To better understand intensity, we can examine the correlation between MET and intensity with heart rate data. MET represents the metabolic equivalent of a task and is a unit of measurement for physical activities. It indicates the amount of energy expended while sitting quietly. Physical activities can be rated using METs to indicate their intensity level.
# heart rate seconds to minutes in order to join with cal_int_steps
heartrate_minutes <- heartrate_seconds %>%
group_by(ActivityMinute_new = floor_date(Time_new, unit = "minute"), Id) %>%
summarize(average_heartrate = mean(Value))
intensityMET <- inner_join(minuteIntensitiesNarrow,minuteMETsNarrow, by = c("Id","ActivityMinute_new"))
intensityMET <- intensityMET %>% select(activity_minute = ActivityMinute_new,id = Id, intensity = Intensity, mets = METs)
intens_met_heartrate <- inner_join(intensityMET,heartrate_minutes, by = c("id" = "Id","activity_minute" = "ActivityMinute_new"))
intens_met_heartrate %>% sample_n(100000) %>% ggplot(aes(x = mets, y=intensity)) + geom_jitter(aes(color = average_heartrate)) +
geom_smooth(color = "black",method = "lm",se = FALSE) +
geom_text(aes(x=42,y=4.5, label = paste0("r = ",round(cor(mets,intensity),2))),color = "black", size = 5) +
labs(title = "Correlating Intensity with METs Score",color = "Heart rate") +
scale_color_gradient(low = "lightgreen", high = "red") +
scale_y_continuous(breaks = seq(0,3,1)) +
scale_x_continuous(breaks = seq(0,150,20)) +
theme_bw() +ylab("Intensity") + xlab("MET")
From the graph, we can observe that there is a minimum amount of MET
required to reach a certain intensity level. For example, level 1 starts
at 21 MET, level 2 starts at 42 MET, and level 3 starts at 60 MET per
minute. However, we can also see that MET is not the only variable that
determines the intensity level, as earlier we saw that the number of
calories burned also plays a role. Additionally, we can observe that
higher intensity levels are associated with higher heart rates.
Datasets in use:
To begin our analysis, let’s start by examining the average amount of sleep time per user.
sleepsSumTable <- sleepDay %>% group_by(Id) %>% summarise(avg_user_daily_sleep = mean(TotalMinutesAsleep))
mean(sleepsSumTable$avg_user_daily_sleep)
## [1] 377.6475
On average, users sleep for 377 minutes or 6 hours and 17 minutes, which is below the recommended time. We can further explore the data to see at what times users tend to sleep.
minuteSleep %>% ggplot(aes(x= format(date_new, "%H"))) +
geom_bar(stat = "count", width = 0.9, fill = "#001F3F") +
labs(title = "Sleep Distribution by Hour", x = "", y = "Count") +
theme(axis.text.y = element_blank()) + theme_bw()
The data shows that users on average sleep for 377 minutes or 6
hours and 17 minutes, which is below the recommended sleep time. The
majority of users are observed to be asleep between 23:00 and 07:00,
indicating that the data is reliable. We can now analyze how the sleep
patterns vary across the week.
sleepInWeek <- sleepDay %>% mutate(week_day = wday(SleepDay_new, label = TRUE, locale = "en"))
sleepInWeek %>% group_by(week_day,Id) %>% summarise(avg_sleep = mean(TotalMinutesAsleep)) %>%
arrange(week_day) %>% group_by(week_day) %>% summarise(avg_sleep_weekday = mean(avg_sleep)) %>%
ggplot(aes(x= week_day, y= avg_sleep_weekday)) +
geom_bar(stat = "identity", fill = "#3E54AC") +
geom_text(aes(label = round(avg_sleep_weekday)), color = "white", vjust = 1.5 , size = 6)+
labs(title = "Minutes of Sleep Throughout the Week", x = "", y = "Sleep time (min)")
Based on the sleep data, we can see that women tend to have more
sleep at the beginning of the week, and there is a significant drop in
sleep time between Wednesday and Thursday. To investigate this further,
we can explore if there is a correlation between light activity and
sleep time. For this, we can use the long-format dailyActivity data,
which includes information on activity type, distance, and time
spent.
sleepInWeek <- sleepDay %>% mutate(week_day = wday(SleepDay_new, label = TRUE, locale = "en"))
activityDistanceTime <- mutate(activityDistanceTime, week_day = wday(date, label = TRUE, locale = "en"))
sleepVsLight <- inner_join(filter(activityDistanceTime,activity_type == "lightly_active"),
sleepInWeek, by = c("id" = "Id", "date" = "SleepDay_new","week_day"))
sleepVsLight <- select(sleepVsLight,id,average_distance,average_time,week_day, TotalMinutesAsleep)
sleepVsLight %>% ggplot(aes(x=average_time,y=TotalMinutesAsleep)) + geom_point() +
labs(title = "Light Activity vs. Sleep Time (in Minutes)", x = "Light Activity Time", y = "Sleep Time")
Based on our analysis, we did not find any significant correlation
between light activity and sleep time. Moving on, let’s now examine how
sleep and light activity change throughout the week.
sleepVsLight %>% group_by(week_day,id) %>%
summarise(avg_time = mean(average_time), sleep_time = mean(TotalMinutesAsleep)) %>%
group_by(week_day) %>% summarise(avg_time = mean(avg_time), sleep_time = mean(sleep_time)) %>%
ggplot(aes(x=week_day, group = 1)) + geom_line(aes(y=sleep_time, color = "Sleep Time"), size = 1.5) +
geom_line(aes(y = avg_time, color = "Light Activity Time"), size = 1.5) +
labs(color = "", title = "Weekly Trends in Light Activity and Sleep Durations") +
ylab("Time (min)") + xlab("")
Looking at the graph, we can observe that on Sundays users tend to
sleep the most with the least amount of light activity, while on
Saturdays they have the most amount of light activity but an average
amount of sleep.
After exploring sleep and activity patterns throughout the week, we will now take a deeper look into sleep stages. We will begin by examining the typical sleep pattern of users with an average number of sleep observations.
colnames(minuteSleep) = to_snake_case(colnames(minuteSleep))
averageObs <- minuteSleep %>% group_by(id, log_id) %>% reframe(sleep_obs = length(value)) %>%
arrange(-sleep_obs) %>% slice((n() %/% 2 - 1):(n() %/% 2 + 2)) # %/% is integer division
topMinSleepObserv <- minuteSleep[minuteSleep$id %in% averageObs$id & minuteSleep$log_id %in% averageObs$log_id,]
topMinSleepObserv %>% ggplot() + geom_point(aes(x = as_hms(date_new + hours(12)), y = value)) + # Adding 12h for continuous graph look
facet_wrap(~log_id) + labs(title = "Snapshots of User Sleep Patterns") + theme_bw() +
ylab("Sleep Value") + scale_y_continuous(breaks = seq(1,3,1)) +xlab("") +
theme(axis.text.x = element_blank())
From the initial analysis, it appears that most of the sleep time is
spent in stage 1, with a decreasing amount of time in deeper sleep
stages. Additionally, some users do not have any stage 3 sleep, and
there are variations in the amount of stage 3 sleep cycles among users.
We will now examine the numerical data to gain a more comprehensive
understanding of these observations.
sleep_labels <- c("Rapid Eye Movement (3)","Non Rapid Eye Movement (2)", "Light Sleep (1)")
sleep_values <- c("#0B2447", "#19376D", "#576CBC")
minuteSleep %>% group_by(value) %>% summarise(sleep_percent = n()/ length(minuteSleep$value)) %>%
ggplot() + geom_bar(aes(x = "",y = sleep_percent, fill = rev(factor(value))), stat = "identity",width = 1) +
coord_polar("y", start = 0) +
theme_void() +
labs(title = "Average Sleep Stage Distribution", fill = "Sleep Phases" ) +
geom_text_repel(aes(x = 1.67, y = sleep_percent, label = scales::percent(sleep_percent, accuracy = .1))
,size = 4.5, fontface = "bold", color = rev(sleep_values), position = position_stack(vjust = 0.5),box.padding = 0) +
theme(plot.title = element_text(hjust = 4, size = 18, face = "bold"),
legend.title = element_text(face = "bold"),
legend.margin = margin(t = 0, r = 50, b = 10, l = 0)) +
scale_fill_manual(values = sleep_values, labels = sleep_labels) +
guides(fill = guide_legend(reverse = TRUE))
According to our initial analysis, only one percent of sleep is in
deep sleep, which is lower than the expected 7-8 minutes per sleep
according to studies. To gain a deeper understanding, we will create a
table that shows the chunks of deep sleep for each user, as well as the
level of sleep, which indicates how many cycles of deep sleep occur
during the sleep period.
sleep_three <- minuteSleep %>% filter(value == 3) %>% group_by(id) %>% arrange(id,date_new)
current_sleep <- 1
level <- 1
df_row <- 1
deep_sleep_df <- data.frame(id = numeric(), date = POSIXct(),level = numeric()
, deep_sleep_length = numeric(),stringsAsFactors = FALSE)
for(i in 2:nrow(sleep_three)){
time_diff = as.numeric(difftime(sleep_three[i,]$date_new,sleep_three[i-1,]$date_new),units = "mins")
if(time_diff > 0 && time_diff <=1){
current_sleep <- current_sleep +1
}else{
deep_sleep_df[df_row,1] <- sleep_three[i-1,]$id
deep_sleep_df[df_row,2] <- sleep_three[i-1,]$date_new
deep_sleep_df[df_row,3] <- level
deep_sleep_df[df_row,4] <- current_sleep
if(sleep_three[i,]$id == sleep_three[i-1,]$id &&
as.numeric(difftime(sleep_three[i,]$date_new,sleep_three[i-1,]$date_new),units = "mins") < 180){
level <- level + 1
}else {level <- 1}
current_sleep <- 1
df_row <- df_row + 1
}
}
rm(current_sleep, df_row, time_diff)
deep_sleep_df %>% group_by(id) %>% summarise(avg_deep_per_user = mean(deep_sleep_length)) %>%
summarise(average_deep = mean(avg_deep_per_user))
## # A tibble: 1 × 1
## average_deep
## <dbl>
## 1 3.18
we found that the average time spent in stage 3 (deep sleep) was only 3 minutes and 10 seconds, which is considerably less than what research suggests. Now, let’s examine how the sleep length observations vary.
deep_sleep_df %>% ggplot(aes(x = deep_sleep_length)) + geom_bar(stat = "count") +
labs(title = "Deep Sleep Length in One Night: A Close Look") +
xlab("Deep sleep length") +ylab("Count")
Many users have an average deep sleep time of just 1 minute, which
appears to be unusual. To investigate this further, we will examine the
sleep cycles, or levels, of the users.
deep_sleep_df %>% ggplot(aes(x = level)) + geom_bar(stat = "count") +
labs(title = "Deep Sleep Cycles in One Night") +
xlab("Deep sleep cycles") +ylab("Count")
It appears that most users have only one minute of deep sleep, which
is unusual. When we examine the sleep cycles, we also see that most
users only have one cycle of deep sleep, which is inconsistent with the
typical four to six cycles per night for adults. Furthermore, some users
seem to have more than 20 cycles of sleep, which is not plausible. Given
these findings, we can conclude that the data on deep sleep is
unreliable for further analysis. Therefore, we will not be including our
findings related to sleep stages in the conclusion of our analysis.
Moving on, we have daily data on sleep time and time spent in bed. Let’s take a look at the average time it takes for users to fall asleep, both overall and throughout the week.
sleepObserv <- sleepDay %>%
mutate(time_to_sleep = TotalTimeInBed-TotalMinutesAsleep,
week_day = wday(SleepDay_new, label = TRUE, locale = "en"))
sleepObserv %>% group_by(Id) %>%
summarise(avg_tts_id = mean(time_to_sleep)) %>%
summarise(average_time_to_sleep = mean(avg_tts_id))
## # A tibble: 1 × 1
## average_time_to_sleep
## <dbl>
## 1 42.4
On average, the data set shows that people spend 42 minutes in bed before they are considered asleep, but it is important to note that this might include time spent using their phone or engaging in other activities that do not necessarily reflect the time it takes them to fall asleep. According to studies, most adults with healthy sleep patterns typically take 10 to 20 minutes to fall asleep. Therefore, based on the available data, we can assume that people spend approximately 22 to 32 minutes engaging in other activities before falling asleep while in bed. However, without more information on how the data is determined, it is difficult to determine where the line between actively trying to fall asleep and simply spending time in bed while using their phone or engaging in other activities lies.
Now, let’s examine the average time users spend in bed before they are considered asleep throughout the week.
sleepObserv %>% group_by(Id,week_day) %>%
summarise(avg_tts_id = mean(time_to_sleep)) %>%
group_by(week_day) %>%
summarise(average_time_to_sleep = mean(avg_tts_id)) %>%
ggplot(aes(x = week_day,y = average_time_to_sleep)) +geom_col(fill = "#FFD369", width = 0.8) +
geom_text(aes(label = round(average_time_to_sleep), vjust = 2, fontface = "bold"), size = 4.5) +
xlab("Day in Week") + ylab("Average Time in Bed (min)") +
labs(title = "Time in Bed before Falling Asleep Throughout the Week") +theme_classic()
Based on the data, it seems that Sunday is the day when users take
the longest to fall asleep and spend the most time in bed, while the
time gradually decreases throughout the week. By Thursday, users seem to
fall asleep the fastest. This pattern suggests that people might have
more difficulty falling asleep at the beginning of the week, possibly
due to stress or anxiety associated with the upcoming workweek, and then
become more relaxed as the week progresses.
Our analysis has led to several suggestions that could help Bellabeat grow and be implemented in the Bellabeat app and marketing strategy.