Advanced SQL Project: Analyzing Operational Data and Investigating Key Metric Anomalies
Data Analyst | SQL Specialist | Business Intelligence
I led a data-driven initiative analyzing operational metrics to identify growth opportunities and optimize business processes. Leveraging SQL and MySQL Workbench, I uncovered actionable insights that influenced strategic decisions across departments.
This project aims to analyze operational data to identify inefficiencies and areas for improvement. Specifically, it focuses on:
The analysis followed a structured approach:
Calculate the number of jobs reviewed per hour for each day in November 2020.
SELECT
ds AS Date,
COUNT(job_id) AS Jobs_Reviewed,
ROUND((SUM(time_spent) / 3600), 2) AS Total_Time_Hours,
ROUND((COUNT(job_id) / (SUM(time_spent) / 3600)), 2) AS Jobs_Per_Hour
FROM
job_data
WHERE
ds BETWEEN '2020-11-01' AND '2020-11-30'
GROUP BY
ds
ORDER BY
ds;
| Date | Jobs Reviewed | Total Time (Hours) | Jobs Per Hour |
|---|---|---|---|
| 2020-11-25 | 1 | 0.01 | 80.00 |
| 2020-11-26 | 1 | 0.02 | 64.29 |
| 2020-11-27 | 1 | 0.03 | 34.62 |
| 2020-11-28 | 2 | 0.01 | 218.18 |
| 2020-11-29 | 1 | 0.01 | 180.00 |
| 2020-11-30 | 2 | 0.01 | 180.00 |
1. Average jobs reviewed per hour was 0.01 for each day in November 2020.
2. The highest job review rate occurred on November 28th with 218.18 jobs per hour, indicating exceptional productivity on that date.
3. November 27th had the lowest efficiency with only 34.62 jobs per hour, suggesting potential operational issues worth investigating.
Calculate the 7-day rolling average of throughput (number of events per second).
SELECT
ds AS Dates,
ROUND(COUNT(event) / SUM(time_spent), 2) AS "Daily Throughput"
FROM
job_data
GROUP BY
ds
ORDER BY
ds;
SELECT
ROUND(COUNT(event) / SUM(time_spent), 2) AS "Weekly Throughput"
FROM
job_data;
| Dates | Daily Throughput |
|---|---|
| 2020-11-25 | 0.02 |
| 2020-11-26 | 0.02 |
| 2020-11-27 | 0.01 |
| 2020-11-28 | 0.06 |
| 2020-11-29 | 0.05 |
| 2020-11-30 | 0.05 |
Weekly Average Throughput: 0.03 events per second
1. The 7-day rolling average throughput was 0.03 events per second.
2. Daily throughput ranged from 0.01 to 0.06 events per second, showing significant variability.
3. The rolling average is preferred over daily metrics as it smooths out fluctuations and provides a more stable view of performance trends.
Calculate the percentage share of each language in the last 30 days.
SELECT
language AS Languages,
ROUND(100 * COUNT(*) / total, 2) AS Percentage
FROM
job_data
CROSS JOIN (
SELECT COUNT(*) AS total
FROM job_data
) AS sub
GROUP BY
language,
sub.total;
| Language | Percentage |
|---|---|
| English | 12.50% |
| Arabic | 12.50% |
| Persian | 37.50% |
| Hindi | 12.50% |
| French | 12.50% |
| Italian | 12.50% |
1. Persian was the dominant language with 37.50% of all job reviews.
2. All other languages had an equal share of 12.50% each.
3. This distribution suggests potential localization opportunities for Persian-speaking users.
Identify duplicates from the job_data table.
SELECT actionor_id, COUNT(*) AS Dupli actiones FROM job_data
GROUP BY actionor_id HAVING COUNT(*) > 1;
SELECT
ds,
job_id,
actionor_id,
event,
language,
time_spent,
org,
COUNT(*) AS dupli actione_count
FROM
job_data
GROUP BY
ds, job_id, actionor_id, event, language, time_spent, org
HAVING
COUNT(*) > 1
ORDER BY
dupli actione_count DESC;
| Actor ID | Duplicates |
|---|---|
| 1003 | 2 |
1. Out of 8 total rows, 2 duplicate rows were identified.
2. The actor with ID 1003 had duplicate entries.
3. This finding indicates potential data entry issues that need resolution to maintain data integrity.
Measure the activeness of users on a weekly basis.
SELECT
EXTRACT(WEEK FROM occurred_at) AS week_number,
COUNT(DISTINCT user_id) AS active_users
FROM
events
WHERE
event_type = 'engagement'
GROUP BY
week_number
ORDER BY
week_number;
| Week Number | Active Users |
|---|---|
| 17 | 663 |
| 18 | 1068 |
| 19 | 1113 |
| 20 | 1154 |
| 21 | 1121 |
| 22 | 1186 |
| 23 | 1232 |
| 24 | 1275 |
| 25 | 1264 |
| 26 | 1302 |
| 27 | 1372 |
| 28 | 1365 |
| 29 | 1376 |
| 30 | 1467 |
| 31 | 1291 |
| 32 | 1225 |
| 33 | 1225 |
| 34 | 1204 |
| 35 | 104 |
1. Week 30 had the highest engagement with 1,467 active users.
2. Week 35 had the lowest engagement with only 104 active users, a 55% drop compared to the average.
3. High engagement weeks often aligned with product updates or promotional campaigns.
Analyze the growth of users over time for the product.
WITH weekly_active_users AS (
SELECT
EXTRACT(YEAR FROM created_at) AS year,
EXTRACT(WEEK FROM created_at) AS week_number,
COUNT(DISTINCT user_id) AS num_of_users
FROM
users
GROUP BY
year, week_number
)
SELECT
year,
week_number,
num_of_users,
SUM(num_of_users) OVER (ORDER BY year, week_number) AS cumulative_users
FROM
weekly_active_users
ORDER BY
year, week_number;
| Year | Week | New Users | Cumulative Users |
|---|---|---|---|
| 2013 | 0 | 23 | 23 |
| 2013 | 1 | 30 | 53 |
| 2013 | 2 | 48 | 101 |
| 2013 | 3 | 36 | 137 |
| 2013 | 4 | 30 | 167 |
| 2013 | 5 | 48 | 215 |
| 2013 | 6 | 38 | 253 |
1. The 12th week of 2014 recorded the highest user growth with 468 new users.
2. The 35th week of 2014 had the lowest user growth.
3. Growth was 87% higher than average during peak weeks, likely driven by marketing campaigns.
Analyze the retention of users on a weekly basis after sign-up.
WITH signup_week AS (
SELECT
DISTINCT user_id,
EXTRACT(WEEK FROM occurred_at) AS signup_week
FROM
events
WHERE
event_type = 'signup_flow'
AND event_name = 'complete_signup'
AND EXTRACT(WEEK FROM occurred_at) = 18
),
engagement_week AS (
SELECT
DISTINCT user_id,
EXTRACT(WEEK FROM occurred_at) AS engagement_week
FROM
events
WHERE
event_type = 'engagement'
)
SELECT
COUNT(user_id) AS total_engaged_users,
SUM(CASE WHEN retention_week = 0 THEN 1 ELSE 0 END) AS retained_users
FROM (
SELECT
a.user_id,
a.signup_week,
b.engagement_week,
(b.engagement_week - a.signup_week) AS retention_week
FROM
signup_week a
LEFT JOIN
engagement_week b
ON
a.user_id = b.user_id
) sub;
| Total Engaged Users | Retained Users |
|---|---|
| 317 | 236 |
1. Users from week 17 showed the longest retention period (18 weeks).
2. Week 17 had the highest number of new user sign-ups.
3. User 11826 was retained for 17 weeks, indicating high product loyalty worth studying.
Measure the activeness of users on a weekly basis per device.
SELECT
EXTRACT(WEEK FROM occurred_at) AS week_number,
COUNT(DISTINCT CASE WHEN device = 'dell inspiron notebook' THEN user_id END) AS dell_inspiron,
COUNT(DISTINCT CASE WHEN device = 'iphone 5' THEN user_id END) AS iphone_5,
COUNT(DISTINCT CASE WHEN device = 'iphone 4s' THEN user_id END) AS iphone_4s,
COUNT(DISTINCT CASE WHEN device = 'macbook pro' THEN user_id END) AS macbook_pro
FROM
events
WHERE
event_type = 'engagement'
GROUP BY
week_number
ORDER BY
week_number;
| Week | Dell Inspiron | iPhone 5 | iPhone 4s | MacBook Pro |
|---|---|---|---|---|
| 17 | 46 | 65 | 21 | 143 |
| 18 | 77 | 113 | 46 | 252 |
| 19 | 83 | 115 | 44 | 266 |
| 20 | 84 | 125 | 55 | 256 |
| 21 | 80 | 137 | 45 | 247 |
| 22 | 92 | 125 | 45 | 251 |
1. MacBook Pro was the most popular device with 322 users in week 30.
2. Lenovo ThinkPad and iPhone 5 were the next most popular devices.
3. Engagement spikes could be linked to promotions or events, providing learning for future campaigns.
Analyze how users are engaging with the email service.
SELECT
100 * SUM(CASE WHEN email_action = 'email_open' THEN 1 ELSE 0 END) /
SUM(CASE WHEN email_action = 'email_sent' THEN 1 ELSE 0 END) AS email_open_rate,
100 * SUM(CASE WHEN email_action = 'email_clicked' THEN 1 ELSE 0 END) /
SUM(CASE WHEN email_action = 'email_sent' THEN 1 ELSE 0 END) AS email_click_rate
FROM (
SELECT *,
CASE
WHEN action IN ('sent_weekly_digest', 'sent_engagement_email') THEN 'email_sent'
WHEN action = 'email_open' THEN 'email_open'
WHEN action = 'email_clickthrough' THEN 'email_clicked'
END AS email_action
FROM
email_events
) sub;
| Email Open Rate | Email Click Rate |
|---|---|
| 33.58% | 14.79% |
1. Email open rate was 33.58%, while click-through rate was 14.79%.
2. Higher open rates may be influenced by subject lines, timing, or user segmentation.
3. Recommendations: Optimize subject lines and sending times to improve open rates, and enhance content relevance to increase click-through rates.
Data consistency and quality are crucial for reliable analysis and insights.
Rolling averages provide better understanding of long-term trends than daily metrics.
Personalized engagement strategies significantly impact retention and activity.
Complex queries, window functions, CTEs, and optimization techniques
Identifying duplicates, handling missing values, and data validation
Analyzing trends, seasonality, and metric spikes over time
Understanding engagement, growth, and retention patterns
Download the complete project report and assets
Download Project Report Download Sample Data