Baby Names - SSA database
Social Security Administration's baby names database
In this cookbook recipe, you will:
- Download the full
babynames_yobdata from the Social Security Administration website. - ingest all the CSV files at once using DuckDB.
- Create and populate rank and popularity for each year/sex combination.
- See how to export the data to popular file formats CSV and parquet
- See a few analyses queries
- See the data in action in a Shiny R app
0. Prerequisites - Install DuckDB
If you don't already have DuckDB installed, see my How to install DuckDB page.
1. Download the Data Set
Download the names.zip file from catalog.data.gov website
I saved mine into D:\opendata\SSA\babynames_yob_2024. Just change line 11 below if you save it elsewhere.
2. Ingest all the files into a single table at once using DuckDB
Since none of the text files have headers, we are explicitly creating the headers. The reserved word filename extracts the file name as a column with that same name. In our case we only want the 4-digit year part of the name, so we use regexp_extract to get just that part. I am using abbreviated names cnt for count and later rnk for rank to avoid the use of keywords in the column names.
-- DROP TABLE IF EXISTS babynames;
-- Ingest SSA baby names, compute rank & popularity index
CREATE TABLE babynames AS
SELECT
name
, sex
, cnt
, regexp_extract(filename, 'yob([0-9]{4})', 1)::INT AS year
, ROW_NUMBER() OVER (PARTITION BY year, sex ORDER BY cnt DESC) AS rnk
, 100.0 * cnt / MAX(cnt) OVER (PARTITION BY year, sex) AS popularity_index
FROM read_csv(
'D:/opendata/SSA/babynames_yob_2024/yob*.txt'
, columns = { 'name': 'VARCHAR', 'sex': 'VARCHAR', 'cnt': 'INT' }
, header = FALSE
, delim = ','
, auto_detect = FALSE
, filename = TRUE
)
;
3. Create and populate rank and popularity for each year/sex combination.
-- ================================================================================
-- Add rnk and popularity_index
ALTER TABLE main.babynames ADD COLUMN rnk INT NULL;
ALTER TABLE main.babynames ADD COLUMN popularity_index DOUBLE NULL;
-- ================================================================================
-- populate rank
UPDATE babynames AS b
SET rnk = x.rnk
FROM (
SELECT name, sex, year,
ROW_NUMBER() OVER (PARTITION BY year, sex ORDER BY cnt DESC) AS rnk
FROM babynames
) AS x
WHERE b.name = x.name
AND b.sex = x.sex
AND b.year = x.year;
-- ================================================================================
-- populate popularity_index
UPDATE babynames AS b
SET popularity_index = sub.popularity_index
FROM (
SELECT name, sex, year,
100.0 * cnt / MAX(cnt) OVER (PARTITION BY sex, year) AS popularity_index
FROM babynames
) AS sub
WHERE b.name = sub.name
AND b.sex = sub.sex
AND b.year = sub.year;
4. Export the data to CSV and parquet
CSV files give you maximum portability and can work with practically everything - although at a cost of file size
Parquet files give you great compression and columnar internal storage which is optimized for analytical applications
-- Export to csv
COPY babynames TO 'C:/temp/babynames.csv' (FORMAT CSV, HEADER TRUE, DELIMITER ',', QUOTE '"');
-- Export to parquet format
COPY babynames TO 'C:/temp/babynames.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
5. Look at Some Analysis Queries
Query 1: Top ranked names for both male and female over the years
-- ================================================================================
-- #1 M & F names over the years
SELECT * FROM babynames WHERE sex = 'F' AND rnk = 1 ORDER BY YEAR DESC;
SELECT * FROM babynames WHERE sex = 'M' AND rnk = 1 ORDER BY YEAR DESC;
Query 2: Number of unique names per year/sex
One row per year with an 'M' and 'F' column showing how many unique names there were for that year and sex
-- ================================================================================
-- Unique name counts by year/sex
SELECT *
FROM (SELECT YEAR, sex, COUNT(*) AS N FROM babynames GROUP BY ALL)
PIVOT (
SUM(N) FOR sex IN ('M', 'F')
)
ORDER BY year;
Query 3: Top 5 M & F names - pivoted to 1 row / year
-- ================================================================================
-- Top 5 M and F names
WITH top AS (
SELECT
year,
sex,
rnk,
name
FROM babynames
WHERE rnk <= 10
AND year >= 2004
)
SELECT
year,
-- Female columns F1–F5
MAX(CASE WHEN sex = 'F' AND rnk = 1 THEN name END) AS F_1,
MAX(CASE WHEN sex = 'F' AND rnk = 2 THEN name END) AS F_2,
MAX(CASE WHEN sex = 'F' AND rnk = 3 THEN name END) AS F_3,
MAX(CASE WHEN sex = 'F' AND rnk = 4 THEN name END) AS F_4,
MAX(CASE WHEN sex = 'F' AND rnk = 5 THEN name END) AS F_5,
-- Male columns M1–M5
MAX(CASE WHEN sex = 'M' AND rnk = 1 THEN name END) AS M_1,
MAX(CASE WHEN sex = 'M' AND rnk = 2 THEN name END) AS M_2,
MAX(CASE WHEN sex = 'M' AND rnk = 3 THEN name END) AS M_3,
MAX(CASE WHEN sex = 'M' AND rnk = 4 THEN name END) AS M_4,
MAX(CASE WHEN sex = 'M' AND rnk = 5 THEN name END) AS M_5,
FROM top
GROUP BY year
ORDER BY YEAR DESC;
Query 4: Top 10 # 1 Streaks for both Male and Female
This multi-level CTE query tackles the classic "gaps and islands" problem. When a name is #1 it is often #1 for multiple years. This identifies how many years a name is at #1 before dropping rank.
-- ================================================================================
-- Top 10 #1 streaks for M and F
WITH ranked AS (
SELECT name,sex,year,rnk
FROM babynames
WHERE rnk = 1
),
islands AS (
SELECT name,sex,year,
/* grouping key for consecutive years */
year - ROW_NUMBER() OVER (
PARTITION BY name, sex
ORDER BY year
) AS grp
FROM ranked
),
streaks AS (
SELECT name, sex,
MIN(year) AS start_year,
MAX(year) AS end_year,
COUNT(*) AS streak_length
FROM islands
GROUP BY name, sex, grp
),
top10_f AS (
SELECT * FROM streaks WHERE sex = 'F' ORDER BY streak_length DESC LIMIT 10
),
top10_m AS (
SELECT * FROM streaks WHERE sex = 'M' ORDER BY streak_length DESC LIMIT 10
)
SELECT 'F' AS sex, * FROM top10_f
UNION ALL
SELECT 'M' AS sex, * FROM top10_m
ORDER BY sex, streak_length DESC;
6. See the data in action
See the baby names data set in action in a Shiny R app