sql_dataengineering/2_EDA/01_top_demanded_skills.sql

129 lines
3.4 KiB
MySQL
Raw Normal View History

2026-03-19 09:55:26 +00:00
/*
Data table prep work
*/
SELECT
*
FROM job_postings_fact as jpf
LIMIT 10;
SELECT
*
FROM skills_dim as sd
LIMIT 10;
SELECT
*
FROM skills_job_dim as sjd
LIMIT 10;
SELECT
*
FROM information_schema.columns
WHERE table_catalog = 'data_jobs'
;
SELECT
*
FROM information_schema.columns
WHERE table_catalog = 'data_jobs'
AND
column_name LIKE '%id%'
AND table_name IN ('skills_dim', 'job_postings_fact', 'skills_job_dim')
;
/*
Question: What are the most in-demand skills for data engineers?
- Join job postings to inner join table similar to query 2
- Identify the top 10 in-demand skills for data engineers
- Focus on remote job postings
- Why? Retrieves the top 10 skills with the highest demand in the remote job market,
providing insights into the most valuable skills for data engineers seeking remote work
*/
SELECT
*
FROM job_postings_fact
LIMIT 10;
SELECT
*
FROM skills_job_dim
LIMIT 10;
SELECT *
FROM skills_dim
LIMIT 10;
SELECT
DISTINCT (job_work_from_home)
FROM
job_postings_fact
WHERE
job_title_short LIKE '%Data%'
LIMIT 10
;
SELECT
sd.skills,
COUNT(jpf.*) as demand_skills
FROM job_postings_fact as jpf
INNER JOIN skills_job_dim as sjd
ON jpf.job_id = sjd.job_id
INNER JOIN skills_dim as sd
ON sjd.skill_id = sd.skill_id
WHERE
jpf.job_title_short LIKE 'Data Engineer'
AND
jpf.job_work_from_home = True
GROUP BY sd.skills
ORDER BY
demand_skills DESC
LIMIT 10
;
/*
Data Engineering Skills Market Summary
Work-From-Home Demand Analysis
Summary
Analysis of 95,293 skill mentions across data engineering job postings shows a clear hierarchy: foundational languages (SQL, Python) dominate demand, followed by cloud platforms and big data tooling. Roles offering work-from-home flexibility consistently favour cloud-native skills, as these eliminate any dependency on physical infrastructure and enable fully remote workflows.
Key Findings
SQL (29,221) and Python (28,776) are the top two skills, making up nearly 60% of total demand both are essential for any data engineering role.
Cloud platforms (AWS, Azure, GCP) collectively account for ~40% of demand and are strongly correlated with work-from-home eligibility, as all tooling is browser/API-accessible with no on-site infrastructure needed.
Big data and orchestration tools Spark, Airflow, Snowflake, and Databricks dominate the mid-tier, signalling that remote roles increasingly expect autonomous pipeline management.
Java remains relevant at #9 (7,267 mentions), primarily for JVM-based systems like Kafka and legacy Spark environments.
skills demand_skills
varchar int64
sql 29221
python 28776
aws 17823
azure 14143
spark 12799
airflow 9996
snowflake 8639
databricks 8183
java 7267
gcp 6446
10 rows 2 columns
*/