174 lines
11 KiB
SQL
174 lines
11 KiB
SQL
SELECT
|
||
sd.skills,
|
||
ROUND(MEDIAN(jpf.salary_year_avg)) as median_salary,
|
||
COUNT(jpf.*) as demand_skills,
|
||
COUNT(jpf.salary_year_avg) as corrected_demand_count
|
||
FROM job_postings_fact as jpf
|
||
INNER JOIN skills_job_dim as sjd
|
||
ON jpf.job_id = sjd.job_id
|
||
INNER JOIN skills_dim as sd
|
||
ON sjd.skill_id = sd.skill_id
|
||
WHERE
|
||
jpf.job_title_short LIKE 'Data Engineer'
|
||
AND
|
||
jpf.job_work_from_home = False
|
||
|
||
GROUP BY sd.skills
|
||
HAVING COUNT(jpf.*) > 100
|
||
ORDER BY
|
||
median_salary DESC
|
||
LIMIT 25
|
||
;
|
||
|
||
|
||
/*
|
||
|
||
┌────────────┬───────────────┬───────────────┬────────────────────────┐
|
||
│ skills │ median_salary │ demand_skills │ corrected_demand_count │
|
||
│ varchar │ double │ int64 │ int64 │
|
||
├────────────┼───────────────┼───────────────┼────────────────────────┤
|
||
│ rust │ 210000.0 │ 232 │ 23 │
|
||
│ golang │ 184000.0 │ 912 │ 39 │
|
||
│ terraform │ 184000.0 │ 3248 │ 193 │
|
||
│ spring │ 175500.0 │ 364 │ 33 │
|
||
│ neo4j │ 170000.0 │ 277 │ 11 │
|
||
│ gdpr │ 169616.0 │ 582 │ 22 │
|
||
│ zoom │ 168438.0 │ 127 │ 12 │
|
||
│ graphql │ 167500.0 │ 445 │ 28 │
|
||
│ mongo │ 162250.0 │ 265 │ 14 │
|
||
│ fastapi │ 157500.0 │ 204 │ 3 │
|
||
│ bitbucket │ 155000.0 │ 478 │ 9 │
|
||
│ django │ 155000.0 │ 265 │ 5 │
|
||
│ crystal │ 154224.0 │ 129 │ 3 │
|
||
│ c │ 151500.0 │ 444 │ 23 │
|
||
│ atlassian │ 151500.0 │ 249 │ 9 │
|
||
│ typescript │ 151000.0 │ 388 │ 39 │
|
||
│ kubernetes │ 150500.0 │ 4202 │ 147 │
|
||
│ node │ 150000.0 │ 179 │ 22 │
|
||
│ ruby │ 150000.0 │ 736 │ 48 │
|
||
│ css │ 150000.0 │ 262 │ 13 │
|
||
│ airflow │ 150000.0 │ 9996 │ 386 │
|
||
│ redis │ 149000.0 │ 605 │ 17 │
|
||
│ vmware │ 148798.0 │ 136 │ 2 │
|
||
│ ansible │ 148798.0 │ 475 │ 14 │
|
||
│ jupyter │ 147500.0 │ 400 │ 15 │
|
||
├────────────┴───────────────┴───────────────┴────────────────────────┤
|
||
│ 25 rows 4 columns │
|
||
└─────────────────────────────────────────────────────────────────────┘
|
||
|
||
*/
|
||
|
||
|
||
|
||
|
||
SELECT
|
||
sd.skills,
|
||
ROUND(MEDIAN(jpf.salary_year_avg)) as median_salary,
|
||
COUNT(jpf.*) as demand_count,
|
||
ROUND(LN(COUNT(jpf.*)),2) as demand_count,
|
||
ROUND((LN(COUNT(jpf.*)) * MEDIAN(jpf.salary_year_avg))/1_000_000,2) as optimal_score
|
||
FROM job_postings_fact as jpf
|
||
INNER JOIN skills_job_dim as sjd
|
||
ON jpf.job_id = sjd.job_id
|
||
INNER JOIN skills_dim as sd
|
||
ON sjd.skill_id = sd.skill_id
|
||
WHERE
|
||
jpf.job_title_short LIKE 'Data Engineer'
|
||
AND
|
||
jpf.job_work_from_home = True
|
||
AND
|
||
jpf.salary_year_avg IS NOT NULL
|
||
GROUP BY sd.skills
|
||
HAVING COUNT(jpf.*) > 100
|
||
ORDER BY
|
||
optimal_score DESC
|
||
LIMIT 25
|
||
;
|
||
|
||
|
||
/*
|
||
┌────────────┬───────────────┬──────────────┬──────────────┬───────────────┐
|
||
│ skills │ median_salary │ demand_count │ demand_count │ optimal_score │
|
||
│ varchar │ double │ int64 │ double │ double │
|
||
├────────────┼───────────────┼──────────────┼──────────────┼───────────────┤
|
||
│ terraform │ 184000.0 │ 193 │ 5.26 │ 0.97 │
|
||
│ python │ 135000.0 │ 1133 │ 7.03 │ 0.95 │
|
||
│ aws │ 137320.0 │ 783 │ 6.66 │ 0.91 │
|
||
│ sql │ 130000.0 │ 1128 │ 7.03 │ 0.91 │
|
||
│ airflow │ 150000.0 │ 386 │ 5.96 │ 0.89 │
|
||
│ spark │ 140000.0 │ 503 │ 6.22 │ 0.87 │
|
||
│ kafka │ 145000.0 │ 292 │ 5.68 │ 0.82 │
|
||
│ snowflake │ 135500.0 │ 438 │ 6.08 │ 0.82 │
|
||
│ azure │ 128000.0 │ 475 │ 6.16 │ 0.79 │
|
||
│ java │ 135000.0 │ 303 │ 5.71 │ 0.77 │
|
||
│ scala │ 137290.0 │ 247 │ 5.51 │ 0.76 │
|
||
│ git │ 140000.0 │ 208 │ 5.34 │ 0.75 │
|
||
│ kubernetes │ 150500.0 │ 147 │ 4.99 │ 0.75 │
|
||
│ databricks │ 132750.0 │ 266 │ 5.58 │ 0.74 │
|
||
│ redshift │ 130000.0 │ 274 │ 5.61 │ 0.73 │
|
||
│ gcp │ 136000.0 │ 196 │ 5.28 │ 0.72 │
|
||
│ nosql │ 134415.0 │ 193 │ 5.26 │ 0.71 │
|
||
│ hadoop │ 135000.0 │ 198 │ 5.29 │ 0.71 │
|
||
│ pyspark │ 140000.0 │ 152 │ 5.02 │ 0.7 │
|
||
│ mongodb │ 135750.0 │ 136 │ 4.91 │ 0.67 │
|
||
│ docker │ 135000.0 │ 144 │ 4.97 │ 0.67 │
|
||
│ r │ 134775.0 │ 133 │ 4.89 │ 0.66 │
|
||
│ go │ 140000.0 │ 113 │ 4.73 │ 0.66 │
|
||
│ github │ 135000.0 │ 127 │ 4.84 │ 0.65 │
|
||
│ bigquery │ 135000.0 │ 123 │ 4.81 │ 0.65 │
|
||
├────────────┴───────────────┴──────────────┴──────────────┴───────────────┤
|
||
│ 25 rows 5 columns │
|
||
└──────────────────────────────────────────────────────────────────────────┘
|
||
|
||
|
||
Summary
|
||
This analysis examines the optimal skills for remote Data Engineer roles by combining salary
|
||
and demand into a single composite score — calculated as the log of job postings multiplied by
|
||
median salary. This approach rewards skills that are both well-compensated and widely requested,
|
||
avoiding the trap of chasing either high pay in niche roles or high volume in lower-paying ones.
|
||
The dataset covers 25 skills, each appearing in at least 100 remote job postings with a reported
|
||
salary, giving the findings strong statistical grounding.
|
||
The results reveal a clear tiering: Terraform, Python, AWS, and SQL occupy the top cluster with
|
||
optimal scores between 0.91–0.97, driven by strong salary floors ($130K–$184K) and massive demand.
|
||
Below them sits a rich mid-tier — Airflow, Spark, Kafka, Snowflake — where slightly lower demand
|
||
is offset by above-average salaries, particularly in streaming and orchestration. The bottom cluster
|
||
(Docker, Go, GitHub, BigQuery) still commands solid $135K+ medians but trails on demand volume,
|
||
making them valuable secondary skills rather than primary targets for career positioning.
|
||
|
||
Key Findings
|
||
|
||
Terraform is the highest-value single skill with the top optimal score (0.97)
|
||
and the highest median salary of any skill in the dataset at $184,000 — nearly
|
||
$50K above the group average. Despite relatively modest demand (193 postings),
|
||
its salary premium is so pronounced that it outscores even Python and SQL.
|
||
Infrastructure-as-code expertise is rare, commands a significant wage premium,
|
||
and is directly suited to remote work since all provisioning is done via CLI and
|
||
APIs.
|
||
Python and SQL are the volume anchors of the market, each appearing in over 1,100
|
||
remote postings — more than double any other skill — and both scoring 0.91–0.95.
|
||
Their median salaries ($135K and $130K respectively) are solid but not exceptional;
|
||
their dominance comes from ubiquity. For anyone entering the field, these two skills
|
||
represent the lowest-risk, highest-return investment — nearly every role expects them.
|
||
|
||
AWS leads the cloud platforms, outscoring Azure and GCP by a notable margin (0.91 vs.
|
||
0.79 and 0.72). All three cloud providers sit in the top half of the table, but AWS
|
||
uniquely combines strong demand (783 postings) with the highest cloud median salary
|
||
($137,320). This reflects AWS's continued dominance in enterprise data infrastructure
|
||
and its tight integration with modern data stacks. Azure and GCP remain important but
|
||
are stronger as complementary skills.
|
||
|
||
Streaming and orchestration tools (Kafka, Airflow, Spark) offer a high salary-to-demand
|
||
ratio, clustering between $140K–$150K median salaries with moderate but healthy demand
|
||
(292–503 postings each). These are the skills most likely to differentiate a mid-career
|
||
engineer, signalling the ability to manage real-time pipelines and complex DAG-based
|
||
workflows autonomously — exactly the profile remote-first teams are hiring for.
|
||
|
||
Infrastructure and containerisation skills (Kubernetes, Git, Docker) punch above
|
||
their demand weight on salary — Kubernetes in particular has the second-highest median
|
||
salary in the dataset at $150,500, despite appearing in only 147 postings. This
|
||
niche-but-lucrative pattern suggests that DevOps-adjacent data engineers who can
|
||
manage containerised workloads command a meaningful premium, even if the absolute
|
||
number of such roles is smaller. These are strong specialisation targets for engineers
|
||
already solid in Python/SQL/cloud.
|
||
|
||
*/
|