HTML and RMD files can be found on Github

Bellabeat Introduction

Bellabeat is a women’s wellness brand that offers a comprehensive range of products and services designed to enhance women’s health. The company specializes in creating wearables and complementary products that track biometric and lifestyle data, enabling women to gain a deeper understanding of their bodies and make informed decisions about their health. By gathering data on activity, sleep, stress, and reproductive health, Bellabeat empowers women with the knowledge they need to improve their overall well-being and lifestyle habits.

Bellabeat products:

  • Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.

  • Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

  • Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

  • Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

Analysis Goal and Overview

Gain insights for Bellabeat smart products from trends of non-Bellabeat smart-devices consumers in order to reveal growth opportunities for the company digital marketing strategy.

In this analysis, we explore approaches to effectively engage potential consumers of smart devices and identify trends and patterns related to their behaviors. Our analysis will cover various aspects of user activity, including sleep patterns, health-related metrics such as calorie and step counts, as well as intensity levels, heart rates, and MET (metabolic equivalent of task) measurements.
Key stakeholders in this analysis include Bellabeat’s co-founder and Chief Creative Officer Urška Sršen, as well as the Bellabeat executive and marketing analytics teams.
For our analysis we will use the FitBit Fitness Tracker Data.

Prepare

Where is your data stored?
The data is can be found online on Kaggle platform, and for this project was downloaded and used locally.
How is the data organized?
The data has 18 data sets, wide and long formats.
Datasets contains information on activity, sleep, calories, intensities, steps, MET scores, heart-rate and weight.
Are there issues with bias or credibility in this data?
The dataset is generated by respondent to a distributed survey via Amazon Mechanical Turk between 12.02.2016-12.03.2016
33 eligible Fitbit users consented to the submission of personal tracker data and become a part of the dataset.

  • Reliable – The data seems to be collected from a random sample of the survey respondents, But there is no explanation about this procedure, and no details about the survey itself. We do not know for sure that this dataset is unbiased. Further more, there is no documentation about reliability check from FitBit or any reference about the cleaning process.
  • Original – The data is a third party source, and the original files cant be traced online. Second party source is Zenodo website and the data originally arranged in to separate files for each day (12.4.16-12.5.16). The data is indexed in OpenAIRE. Datasets on Kaggle are a merged version of the all the datasets. After checking we can approve the merging has done correctly without the loss of information.
  • Comprehensive – The data is part comprehensive for our analysis purposes. Some of the datasets contain just 8 users and 67 total rows which make it hard for analysis (weight dataset).
  • Current – Data is from 2016, and might be less useful for insights about current trends.
  • Cited – The data is cited. Original Data lies in Zenodo website.

Addressing licensing, privacy, security, and accessibility
Data sets not showing any private details about users, and can be used by the public for analysis uses.

How does the data can help you achieving your goal?
Our goal is to identify meaningful patterns and insights from the data to inform our marketing strategy.
We plan to explore relationships between different measures and visualize the data at various time intervals, such as minute, hour, and week. Additionally, we will analyze the distribution of different values to gain further insights. All of these approaches will help us achieve our marketing objectives.
Are there any problems with the data?
Kaggle article do mention that there is some variation between output represents use of different types of Fitbit trackers and individual tracking behaviors / preferences. To address the issue, we will perform data cleaning and normalization techniques.Additionally, we will perform statistical analysis to identify and adjust for any biases or inconsistencies in the data.
Additionally, there is no information provided on how the different levels of the data were determined or defined. For that we would conduct exploratory data analysis (EDA) on the data to see if there are any patterns or inconsistencies that might indicate how the levels were determined. This will involve examining the distributions of the data, looking for outliers, and examining relationships between different variables.

Overall look on datasets: Image cannot be loaded or accessed

Process & Analyze

Tools - We are programming in R using RStudio for our analysis.
Data Cleaning & Integrity Check - We have checked the data integrity and cleaned the data and documented the process. We have thoroughly checked the data for any errors, inconsistencies, and missing values, ensuring that all data sources involved have been reviewed and any necessary corrections have been made. We can confidently state that the data is now properly formatted, accurate, and suitable for use in achieving our business objectives. For our goals in this analysis, we would look on each data category and include just the important cleaning process and reliability checks, so that our analysis is coherent and in logical and easy to understand context, with focus on the main things inside each category.

General Preperations

Our initial steps would involve installing necessary packages using the install.packages() function and loading the required libraries.
We would also import relevant datasets using the read_csv() function and transform any date columns to the correct datetime format for consistency and accuracy.

library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
library(lubridate)
library(here)
library(snakecase)
library(ggrepel)
library(hms)
dailyActivity <- mutate(dailyActivity,ActivityDate_new = mdy(ActivityDate))
# Changing char date to Date format, we will repeat this code to all of our datasets

we can commence our analysis by examining each data category, which may entail analyzing individual datasets or performing analyses on combined datasets for some sections.

Activity

Datasets in use:

  • dailyActivity
  • sleepDay (in order to better understand how reliable the data is)

Data Cleaning and Reliability Analysis

Upon examining the dailyActivity table, it becomes apparent that numerous rows contain 0 values across all variables except for sedentary minutes and calories burned. In order to derive meaningful insights regarding activity patterns throughout the day, it is necessary to clean this data.

head(filter(dailyActivity, TotalSteps == 0)) # Incomplete data
## # A tibble: 6 × 16
##       Id Activ…¹ Total…² Total…³ Track…⁴ Logge…⁵ VeryA…⁶ Moder…⁷ Light…⁸ Seden…⁹
##    <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 1.50e9 5/12/2…       0       0       0       0       0       0       0       0
## 2 1.84e9 4/24/2…       0       0       0       0       0       0       0       0
## 3 1.84e9 4/25/2…       0       0       0       0       0       0       0       0
## 4 1.84e9 4/26/2…       0       0       0       0       0       0       0       0
## 5 1.84e9 5/2/20…       0       0       0       0       0       0       0       0
## 6 1.84e9 5/7/20…       0       0       0       0       0       0       0       0
## # … with 6 more variables: VeryActiveMinutes <dbl>, FairlyActiveMinutes <dbl>,
## #   LightlyActiveMinutes <dbl>, SedentaryMinutes <dbl>, Calories <dbl>,
## #   ActivityDate_new <date>, and abbreviated variable names ¹​ActivityDate,
## #   ²​TotalSteps, ³​TotalDistance, ⁴​TrackerDistance, ⁵​LoggedActivitiesDistance,
## #   ⁶​VeryActiveDistance, ⁷​ModeratelyActiveDistance, ⁸​LightActiveDistance,
## #   ⁹​SedentaryActiveDistance
dailyActivity <- dailyActivity %>% filter(TotalSteps > 0)

There are four different levels of activity: Very active, Fairly active, Lightly active, and Sedentary active. To evaluate the reliability of these activity levels, we will investigate the relationship between time and distance. A linear relationship should exist, with the greatest slope observed for the very active level and the least for the sedentary level.

distance_long <- dailyActivity %>% 
  select(id = Id, date = ActivityDate_new, very_active = VeryActiveDistance, fairly_active = ModeratelyActiveDistance,
         lightly_active = LightActiveDistance, sedentary_active = SedentaryActiveDistance) %>%
  pivot_longer(cols = very_active:sedentary_active,
               names_to = "activity_type", values_to = "average_distance")

minutes_long <- dailyActivity %>% 
  select(id = Id, date = ActivityDate_new, very_active = VeryActiveMinutes, fairly_active=FairlyActiveMinutes, 
         lightly_active = LightlyActiveMinutes, sedentary_active = SedentaryMinutes) %>%
  pivot_longer(cols = very_active:sedentary_active,
               names_to = "activity_type", values_to = "average_time")

activityDistanceTime <- inner_join(distance_long,minutes_long, by = c("date","id","activity_type"))

activityDistanceTime %>% ggplot(aes(x= average_time, y = average_distance)) + 
  geom_point()  + geom_smooth(method = "lm") + facet_wrap(~activity_type) +
  labs(title = "Activity Distance by Minutes", x = "Minutes", y = "Distance")


As expected, we have observed differences in the slope values, with the very active level having the greatest slope.

To further evaluate the dailyActivity dataset, we will verify that the sum of all activity minutes equals the total number of minutes in a day minus 1440. To achieve this, we will need to merge the data from the sleepDay dataset to obtain the time spent in bed for each day.

dailyActivity <- dailyActivity %>% mutate(sleep_mins = 1440-VeryActiveMinutes-FairlyActiveMinutes-
                                            LightlyActiveMinutes-SedentaryMinutes)

testTotalMinutes <- inner_join(dailyActivity,sleepDay,by = c("Id","ActivityDate_new" = "SleepDay_new"), multiple = "all")
testTotalMinutes <- testTotalMinutes %>% filter(sleep_mins == TotalTimeInBed)
nrow(testTotalMinutes)
## [1] 126

Activity Analysis

We observed that only 126 out of 413 rows accurately represent the total minutes in a day.
The issue is that sedentary time includes time spent in bed. Despite this issue, after verifying that our data includes a diverse group of users, we can proceed to examine the distribution of daily activity. We will select the necessary data and transform it into a long format to create a new data table for analysis.

testActivity <- testTotalMinutes %>% 
  summarise(avg_very_active = mean(VeryActiveMinutes),
            avg_fairly_active = mean(FairlyActiveMinutes),
            avg_lightly_active = mean(LightlyActiveMinutes),
            avg_sedentary = mean(SedentaryMinutes)) %>% 
  pivot_longer(cols = avg_very_active:avg_sedentary,
               names_to = "activity_type", values_to = "average_minutes")

summed_activity <- testActivity %>% group_by(activity_type) %>%
  summarise(average_minutes = mean(average_minutes)) %>% 
              summarise(activity_type,average_minutes,average_minutes_percent = average_minutes/sum(average_minutes)) %>% 
  arrange(activity_type)

my_colors <- c("#1C0C5B", "#3D2C8D", "#916BBF", "#C996CC")
my_labels <- c("Fairly Active","Lightly Active","Sedendary","Very Active")

summed_activity %>% 
  mutate(avg_hours = format(as.POSIXct(average_minutes_percent*24*3600*(sum(average_minutes)/1440),
                                       origin = "1970-01-01", tz = "UTC"),"%H:%M")) %>% 
  ggplot() + geom_col(aes(x = "", y = average_minutes_percent, fill = my_colors), width = 1) +
  coord_polar("y", start = 0) + theme_void() +
  geom_text_repel(aes(x = 1.8, y = average_minutes_percent, color = my_colors,
                      label = paste0(scales::percent(average_minutes_percent, accuracy = .1),
                                     "\n",avg_hours))
                  ,size = 3.8, fontface = "bold",box.padding = 0.1, position = position_stack(vjust = 0.5)) +
  labs(title = "Average Duration of Waking Time Spent by Activity Type in a Day", fill = "Activity Type") +
  theme(plot.title = element_text(hjust = 0.3, face = "bold", size = 15,margin = margin(t =10,b = 5)), legend.title = element_text(face = "bold"))+ 
  guides(color = "none") + # removing just the color legend
  scale_fill_manual(values = my_colors, label = my_labels) + scale_color_manual(values = my_colors)


Based on the data analysis, it is evident that users spend a significant portion of their day in sedentary activities, averaging at 13 hours and 10 minutes per day.

Calories, Intensity and Steps Correlation

Datasets in use:

  • dailyActivity
  • minuteStepsNarrow
  • minuteIntensitiesNarrow
  • minuteCaloriesNarrow
  • heartrate_seconds
  • minuteMETsNarrow

Steps

our next step would be to calculate the mean number of steps per user using the dailyActivity dataset.

dailyActivity %>% group_by(Id) %>%
  summarise(user_avg_daily_steps = mean(TotalSteps)) %>% 
  summarise(average_daily_steps = mean (user_avg_daily_steps))
## # A tibble: 1 × 1
##   average_daily_steps
##                 <dbl>
## 1               7922.

The average daily steps of 7922 falls short of the recommended 10,000 steps per day. It would be interesting to see if the overall trend of daily steps is increasing or not.

dailyActivity %>% ggplot(aes(x= as.POSIXct(ActivityDate_new), y = TotalSteps)) + 
  geom_jitter() + geom_smooth(method = "lm") + 
  labs(title = "Daily Steps Among All Users", x = "Total Steps", y = "Date") +
    scale_x_datetime(date_labels = "%Y-%m-%d", timezone  = "America/Los_Angeles")


While a linear relationship may provide interesting insights, it seems that there is no such trend present in the data. Nonetheless, analyzing the distribution of steps throughout the day may still be useful in gaining further insights.

minuteStepsNarrow %>% group_by(hour = format(ActivityMinute_new, format = "%H")) %>%
  summarise(avg_step = mean(Steps)*60) %>% 
  ggplot(aes(x = hour, y = avg_step)) + 
  geom_bar(stat = "identity", fill = "#66347F") +
  labs(title = "Step Count by Hour", x = "Hour", y = "Average Steps")


We can see that the majority of steps are taken between 12:00-20:00, with a dip between 15:00-16:00.

Correlation between Steps, Calories, and Intensity

The first step in our analysis would be to examine the range of values in each dataset. Once we have assessed the range, we will then proceed to join the necessary tables.

unique(minuteIntensitiesNarrow$Intensity) # Four distinct values of intensity: 0,1,2,3
## [1] 0 1 2 3
range(minuteCaloriesNarrow$Calories) # 0-19.75
## [1]  0.00000 19.74995
range(minuteStepsNarrow$Steps) #  Steps ranges 0-220
## [1]   0 220

Our next step is to merge the minuteIntensitiesNarrow, minuteCaloriesNarrow, and minuteStepsNarrow tables.
To ensure computational efficiency, we will take a sample of 100,000 rows out of the 1,325,580 total rows. This will provide us with a 99% confidence level and a margin of error of 0.4%.

cal_int_steps <- inner_join(minuteCaloriesNarrow, minuteIntensitiesNarrow, by = c("Id","ActivityMinute_new")) %>% 
  inner_join(minuteStepsNarrow,by = c("Id","ActivityMinute_new"))
cal_int_steps <- select(cal_int_steps,c("Id","Calories","Intensity","Steps","ActivityMinute_new"))
sampled_data <- cal_int_steps %>% sample_n(100000) # Using sample of that data

Introducing the correlation analysis between calories and steps.

sampled_data %>% ggplot(aes(x = Steps, y = Calories)) + geom_jitter() +
  geom_smooth(method = "lm") +
  geom_text(aes(x=90,y=18, label = paste0("r = ",round(cor(Steps,Calories),2))),color = "blue", size = 5) +
  labs(title = "Examining the Steps-Calories Relationship")


The linear connection between steps and calories burned makes sense, as physical activity usually leads to higher energy expenditure.

ggplot(data = sampled_data,aes(x=Intensity ,y=Calories)) + geom_jitter() +
  geom_smooth(method = "lm") +
  geom_text(aes(x=1.5,y=18, label = paste0("r = ",round(cor(Intensity,Calories),2))),color = "blue", size = 5) +
  labs(title = "Examining the Intensity-Calories Relationship")


We expect to see a clear separation of calorie burn based on intensity level and step count. However, we observe some discrepancies that require further investigation. Specifically, some data points with higher step count do not necessarily result in a proportional increase in calorie burn. We can explore why there are no clear distinctions with steps involved in the next step.

ggplot(data = sampled_data) + 
  geom_smooth(mapping = aes(x=Steps,y=Calories,color = factor(Intensity),k = 8,se = FALSE)) + 
  labs(color = "Intensity Level", title = "Steps and Intensity: A Breakdown of Caloric Burn" ) +
  scale_color_manual(values = c("lightgreen", "blue", "orange", "red"),
                     labels = c("Low Intensity","Light Intensity","Moderate Intensity","High Intensity"))


From the data, we can observe that for a certain level of steps, light intensity activities burn more calories than moderate intensity activities.

Deeper Look on Intensity

To better understand intensity, we can examine the correlation between MET and intensity with heart rate data. MET represents the metabolic equivalent of a task and is a unit of measurement for physical activities. It indicates the amount of energy expended while sitting quietly. Physical activities can be rated using METs to indicate their intensity level.

# heart rate seconds to minutes in order to join with cal_int_steps
heartrate_minutes <- heartrate_seconds %>% 
  group_by(ActivityMinute_new = floor_date(Time_new, unit = "minute"), Id) %>%
  summarize(average_heartrate = mean(Value))

intensityMET <- inner_join(minuteIntensitiesNarrow,minuteMETsNarrow, by = c("Id","ActivityMinute_new"))
intensityMET <- intensityMET %>% select(activity_minute = ActivityMinute_new,id = Id, intensity = Intensity, mets = METs)

intens_met_heartrate <- inner_join(intensityMET,heartrate_minutes, by = c("id" = "Id","activity_minute" = "ActivityMinute_new"))
intens_met_heartrate %>% sample_n(100000) %>% ggplot(aes(x = mets, y=intensity)) + geom_jitter(aes(color = average_heartrate)) +
  geom_smooth(color = "black",method = "lm",se = FALSE) +
  geom_text(aes(x=42,y=4.5, label = paste0("r = ",round(cor(mets,intensity),2))),color = "black", size = 5) +
  labs(title = "Correlating Intensity with METs Score",color = "Heart rate") +
  scale_color_gradient(low = "lightgreen", high = "red") +
  scale_y_continuous(breaks = seq(0,3,1)) +
  scale_x_continuous(breaks = seq(0,150,20)) +
  theme_bw() +ylab("Intensity") + xlab("MET")


From the graph, we can observe that there is a minimum amount of MET required to reach a certain intensity level. For example, level 1 starts at 21 MET, level 2 starts at 42 MET, and level 3 starts at 60 MET per minute. However, we can also see that MET is not the only variable that determines the intensity level, as earlier we saw that the number of calories burned also plays a role. Additionally, we can observe that higher intensity levels are associated with higher heart rates.

Sleep

Datasets in use:

  • sleepDay
  • minuteSleep
  • dailyActivity

To begin our analysis, let’s start by examining the average amount of sleep time per user.

sleepsSumTable <- sleepDay %>% group_by(Id) %>% summarise(avg_user_daily_sleep = mean(TotalMinutesAsleep))
mean(sleepsSumTable$avg_user_daily_sleep)
## [1] 377.6475

On average, users sleep for 377 minutes or 6 hours and 17 minutes, which is below the recommended time. We can further explore the data to see at what times users tend to sleep.

minuteSleep %>% ggplot(aes(x= format(date_new, "%H"))) + 
  geom_bar(stat = "count", width = 0.9, fill = "#001F3F") +
  labs(title = "Sleep Distribution by Hour", x = "", y = "Count") +
  theme(axis.text.y = element_blank()) + theme_bw()


The data shows that users on average sleep for 377 minutes or 6 hours and 17 minutes, which is below the recommended sleep time. The majority of users are observed to be asleep between 23:00 and 07:00, indicating that the data is reliable. We can now analyze how the sleep patterns vary across the week.

sleepInWeek <- sleepDay %>%  mutate(week_day = wday(SleepDay_new, label = TRUE, locale = "en")) 

sleepInWeek %>% group_by(week_day,Id) %>% summarise(avg_sleep = mean(TotalMinutesAsleep)) %>%
  arrange(week_day) %>% group_by(week_day) %>% summarise(avg_sleep_weekday = mean(avg_sleep)) %>% 
  ggplot(aes(x= week_day, y= avg_sleep_weekday)) +
  geom_bar(stat = "identity", fill = "#3E54AC") +
  geom_text(aes(label = round(avg_sleep_weekday)), color = "white", vjust = 1.5 , size = 6)+
  labs(title = "Minutes of Sleep Throughout the Week", x = "", y = "Sleep time (min)")


Based on the sleep data, we can see that women tend to have more sleep at the beginning of the week, and there is a significant drop in sleep time between Wednesday and Thursday. To investigate this further, we can explore if there is a correlation between light activity and sleep time. For this, we can use the long-format dailyActivity data, which includes information on activity type, distance, and time spent.

sleepInWeek <- sleepDay %>%  mutate(week_day = wday(SleepDay_new, label = TRUE, locale = "en")) 
activityDistanceTime <- mutate(activityDistanceTime, week_day = wday(date, label = TRUE, locale = "en"))

sleepVsLight <- inner_join(filter(activityDistanceTime,activity_type == "lightly_active"),
                           sleepInWeek, by = c("id" = "Id", "date" = "SleepDay_new","week_day"))
sleepVsLight <- select(sleepVsLight,id,average_distance,average_time,week_day, TotalMinutesAsleep)

sleepVsLight %>%  ggplot(aes(x=average_time,y=TotalMinutesAsleep)) + geom_point() +
  labs(title = "Light Activity vs. Sleep Time (in Minutes)", x = "Light Activity Time", y = "Sleep Time")


Based on our analysis, we did not find any significant correlation between light activity and sleep time. Moving on, let’s now examine how sleep and light activity change throughout the week.

sleepVsLight %>% group_by(week_day,id) %>% 
  summarise(avg_time = mean(average_time), sleep_time = mean(TotalMinutesAsleep)) %>% 
  group_by(week_day) %>% summarise(avg_time = mean(avg_time), sleep_time = mean(sleep_time)) %>% 
  ggplot(aes(x=week_day, group = 1)) + geom_line(aes(y=sleep_time,  color = "Sleep Time"), size = 1.5) +
  geom_line(aes(y = avg_time,  color = "Light Activity Time"), size = 1.5) +
  labs(color = "", title = "Weekly Trends in Light Activity and Sleep Durations") +
  ylab("Time (min)") + xlab("")


Looking at the graph, we can observe that on Sundays users tend to sleep the most with the least amount of light activity, while on Saturdays they have the most amount of light activity but an average amount of sleep.

After exploring sleep and activity patterns throughout the week, we will now take a deeper look into sleep stages. We will begin by examining the typical sleep pattern of users with an average number of sleep observations.

colnames(minuteSleep) = to_snake_case(colnames(minuteSleep))

averageObs <- minuteSleep %>% group_by(id, log_id) %>% reframe(sleep_obs = length(value)) %>% 
  arrange(-sleep_obs) %>% slice((n() %/% 2 - 1):(n() %/% 2 + 2)) # %/% is integer division

topMinSleepObserv <- minuteSleep[minuteSleep$id %in% averageObs$id & minuteSleep$log_id %in% averageObs$log_id,]

topMinSleepObserv %>% ggplot() + geom_point(aes(x = as_hms(date_new + hours(12)), y = value)) + # Adding 12h for continuous graph look
  facet_wrap(~log_id) + labs(title = "Snapshots of User Sleep Patterns") + theme_bw() +
  ylab("Sleep Value") + scale_y_continuous(breaks = seq(1,3,1)) +xlab("") +
  theme(axis.text.x = element_blank())


From the initial analysis, it appears that most of the sleep time is spent in stage 1, with a decreasing amount of time in deeper sleep stages. Additionally, some users do not have any stage 3 sleep, and there are variations in the amount of stage 3 sleep cycles among users. We will now examine the numerical data to gain a more comprehensive understanding of these observations.

sleep_labels <- c("Rapid Eye Movement (3)","Non Rapid Eye Movement (2)", "Light Sleep (1)")
sleep_values <- c("#0B2447", "#19376D", "#576CBC")
  
minuteSleep %>% group_by(value) %>% summarise(sleep_percent = n()/ length(minuteSleep$value)) %>% 
  ggplot() + geom_bar(aes(x = "",y = sleep_percent, fill = rev(factor(value))), stat = "identity",width = 1) + 
  coord_polar("y", start = 0) +
  theme_void() +
  labs(title = "Average Sleep Stage Distribution", fill = "Sleep Phases" ) + 
  geom_text_repel(aes(x = 1.67, y = sleep_percent, label = scales::percent(sleep_percent, accuracy = .1)) 
            ,size = 4.5, fontface = "bold", color = rev(sleep_values), position = position_stack(vjust = 0.5),box.padding = 0) +
  theme(plot.title = element_text(hjust = 4, size = 18, face = "bold"),
        legend.title = element_text(face = "bold"),
        legend.margin = margin(t = 0, r = 50, b = 10, l = 0)) + 
  scale_fill_manual(values = sleep_values, labels = sleep_labels) +
  guides(fill = guide_legend(reverse = TRUE))


According to our initial analysis, only one percent of sleep is in deep sleep, which is lower than the expected 7-8 minutes per sleep according to studies. To gain a deeper understanding, we will create a table that shows the chunks of deep sleep for each user, as well as the level of sleep, which indicates how many cycles of deep sleep occur during the sleep period.

sleep_three <- minuteSleep %>% filter(value == 3) %>% group_by(id) %>% arrange(id,date_new)

current_sleep <- 1
level <- 1
df_row <- 1
deep_sleep_df <- data.frame(id = numeric(), date = POSIXct(),level = numeric()
                    , deep_sleep_length = numeric(),stringsAsFactors = FALSE)

for(i in 2:nrow(sleep_three)){
  time_diff = as.numeric(difftime(sleep_three[i,]$date_new,sleep_three[i-1,]$date_new),units = "mins")
  if(time_diff > 0 && time_diff <=1){
    current_sleep <- current_sleep +1
  }else{
    deep_sleep_df[df_row,1] <- sleep_three[i-1,]$id
    deep_sleep_df[df_row,2] <- sleep_three[i-1,]$date_new
    deep_sleep_df[df_row,3] <- level
    deep_sleep_df[df_row,4] <- current_sleep
    
    if(sleep_three[i,]$id == sleep_three[i-1,]$id &&
       as.numeric(difftime(sleep_three[i,]$date_new,sleep_three[i-1,]$date_new),units = "mins") < 180){
      level <- level + 1
    }else {level <- 1}
    
    current_sleep <- 1
    df_row <- df_row + 1
  }
}
rm(current_sleep, df_row, time_diff)
deep_sleep_df %>% group_by(id) %>% summarise(avg_deep_per_user = mean(deep_sleep_length)) %>% 
  summarise(average_deep = mean(avg_deep_per_user))
## # A tibble: 1 × 1
##   average_deep
##          <dbl>
## 1         3.18

we found that the average time spent in stage 3 (deep sleep) was only 3 minutes and 10 seconds, which is considerably less than what research suggests. Now, let’s examine how the sleep length observations vary.

deep_sleep_df %>% ggplot(aes(x = deep_sleep_length)) + geom_bar(stat = "count")  +
  labs(title = "Deep Sleep Length in One Night: A Close Look") +
  xlab("Deep sleep length") +ylab("Count")


Many users have an average deep sleep time of just 1 minute, which appears to be unusual. To investigate this further, we will examine the sleep cycles, or levels, of the users.

deep_sleep_df %>% ggplot(aes(x = level)) + geom_bar(stat = "count") +
  labs(title = "Deep Sleep Cycles in One Night") +
  xlab("Deep sleep cycles") +ylab("Count")


It appears that most users have only one minute of deep sleep, which is unusual. When we examine the sleep cycles, we also see that most users only have one cycle of deep sleep, which is inconsistent with the typical four to six cycles per night for adults. Furthermore, some users seem to have more than 20 cycles of sleep, which is not plausible. Given these findings, we can conclude that the data on deep sleep is unreliable for further analysis. Therefore, we will not be including our findings related to sleep stages in the conclusion of our analysis.

Moving on, we have daily data on sleep time and time spent in bed. Let’s take a look at the average time it takes for users to fall asleep, both overall and throughout the week.

sleepObserv <- sleepDay %>%
  mutate(time_to_sleep = TotalTimeInBed-TotalMinutesAsleep,
         week_day = wday(SleepDay_new, label = TRUE, locale = "en"))
sleepObserv %>% group_by(Id) %>% 
  summarise(avg_tts_id = mean(time_to_sleep)) %>%
  summarise(average_time_to_sleep = mean(avg_tts_id))
## # A tibble: 1 × 1
##   average_time_to_sleep
##                   <dbl>
## 1                  42.4

On average, the data set shows that people spend 42 minutes in bed before they are considered asleep, but it is important to note that this might include time spent using their phone or engaging in other activities that do not necessarily reflect the time it takes them to fall asleep. According to studies, most adults with healthy sleep patterns typically take 10 to 20 minutes to fall asleep. Therefore, based on the available data, we can assume that people spend approximately 22 to 32 minutes engaging in other activities before falling asleep while in bed. However, without more information on how the data is determined, it is difficult to determine where the line between actively trying to fall asleep and simply spending time in bed while using their phone or engaging in other activities lies.

Now, let’s examine the average time users spend in bed before they are considered asleep throughout the week.

sleepObserv %>% group_by(Id,week_day) %>% 
  summarise(avg_tts_id = mean(time_to_sleep)) %>%
  group_by(week_day) %>% 
  summarise(average_time_to_sleep = mean(avg_tts_id)) %>% 
  ggplot(aes(x = week_day,y = average_time_to_sleep)) +geom_col(fill = "#FFD369", width = 0.8) +
  geom_text(aes(label = round(average_time_to_sleep), vjust = 2, fontface = "bold"), size = 4.5) +
  xlab("Day in Week") + ylab("Average Time in Bed (min)") +
  labs(title = "Time in Bed before Falling Asleep Throughout the Week") +theme_classic()


Based on the data, it seems that Sunday is the day when users take the longest to fall asleep and spend the most time in bed, while the time gradually decreases throughout the week. By Thursday, users seem to fall asleep the fastest. This pattern suggests that people might have more difficulty falling asleep at the beginning of the week, possibly due to stress or anxiety associated with the upcoming workweek, and then become more relaxed as the week progresses.

Conclusion

Our analysis has led to several suggestions that could help Bellabeat grow and be implemented in the Bellabeat app and marketing strategy.

Encouraging Movement and Healty Habits

  • Encourage users to move their bodies by providing insights about their daily activity routines and abnormal sedentary time, which people spend 76% of waking time in.
  • Give users suggestions on burning calories more efficiently, as light activity can burn more calories than moderate activity with the same amount of steps per minute.
  • Help users understand the correlation between steps and calories, and suggest a number of steps to be taken to reach their calorie goal within the app.
  • Promote more physical activity on Sundays by providing personalized exercise recommendations, hosting fitness challenges, or offering incentives to increase user motivation.

Monitoring Health

  • Monitor heart rate during workouts and warn users if anything seems abnormal.
  • Encourage regular weigh-ins to better understand how activity and habits during the day affect body weight and BMI over time.
  • Implement more health tips to improve overall health.

Improving Sleep

  • Visualize the time it takes for users to fall asleep and provide personalized insights to help improve their sleeping habits, taking into account how they spend their time in bed before falling asleep.
  • Improve users’ sleeping stages measuring for accurate insights and tips for better and deeper sleep.

Emphasize Bellabeat’s Competitive Edge

  • Highlighting Bellabeat ability to monitor stress, mindfulness, hydration levels and menstrual cycle as a lot of alternatives like FitBit do not provide this information and underline those advantages in the marketing campaign.