Skip to main content

Baby Names - SSA database

Social Security Administration's baby names database

In this cookbook recipe, you will:

  1. Download the full babynames_yob data from the Social Security Administration website. 
  2. ingest all the CSV files at once using DuckDB.
  3. Create and populate rank and popularity for each year/sex combination.
  4. See how to export the data to popular file formats CSV and parquet
  5. See a few analyses queries
  6. See the data in action in a Shiny R app

0. Prerequisites - Install DuckDB

If you don't already have DuckDB installed, see my How to install DuckDB page.

1. Download the Data Set

Download the names.zip file from catalog.data.gov website

I saved mine into D:\opendata\SSA\babynames_yob_2024.  Just change line 11 below if you save it elsewhere.

2. Ingest all the files into a single table at once using DuckDB

Since none of the text files have headers, we are explicitly creating the headers.  The reserved word filename extracts the file name as a column with that same name.  In our case we only want the 4-digit year part of the name, so we use regexp_extract to get just that part.  I am using abbreviated names cnt for count and later rnk for rank to avoid the use of keywords in the column names.

-- DROP TABLE IF EXISTS babynames;

-- Ingest SSA baby names, compute rank & popularity index
CREATE TABLE babynames AS
SELECT
    name
  , sex
  , cnt
  , regexp_extract(filename, 'yob([0-9]{4})', 1)::INT AS year
  , ROW_NUMBER() OVER (PARTITION BY year, sex ORDER BY cnt DESC) AS rnk
  , 100.0 * cnt / MAX(cnt) OVER (PARTITION BY year, sex) AS popularity_index
FROM read_csv(
    'D:/opendata/SSA/babynames_yob_2024/yob*.txt'
  , columns = { 'name': 'VARCHAR', 'sex': 'VARCHAR', 'cnt': 'INT' }
  , header = FALSE
  , delim = ','
  , auto_detect = FALSE
  , filename = TRUE
)
;

3. Create and populate rank and popularity for each year/sex combination.

-- ================================================================================
-- Add rnk and popularity_index
ALTER TABLE main.babynames ADD COLUMN rnk INT NULL;
ALTER TABLE main.babynames ADD COLUMN popularity_index DOUBLE NULL;


-- ================================================================================
-- populate rank
UPDATE babynames AS b
SET rnk = x.rnk
FROM (
    SELECT name, sex, year,
        ROW_NUMBER() OVER (PARTITION BY year, sex ORDER BY cnt DESC) AS rnk
    FROM babynames
) AS x
WHERE b.name = x.name
  AND b.sex = x.sex
  AND b.year = x.year;

-- ================================================================================
-- populate popularity_index
UPDATE babynames AS b
SET popularity_index = sub.popularity_index
FROM (
    SELECT name, sex, year,
        100.0 * cnt / MAX(cnt) OVER (PARTITION BY sex, year) AS popularity_index
    FROM babynames
) AS sub
WHERE b.name = sub.name
  AND b.sex  = sub.sex
  AND b.year = sub.year;

 

4. Export the data to CSV and parquet

CSV files give you maximum portability and can work with practically everything - although at a cost of file size

Parquet files give you great compression and columnar internal storage which is optimized for analytical applications

-- Export to csv
COPY babynames TO 'C:/temp/babynames.csv' (FORMAT CSV, HEADER TRUE, DELIMITER ',', QUOTE '"');

-- Export to parquet format
COPY babynames TO 'C:/temp/babynames.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');

5. Look at Some Analysis Queries

 

Query 1: Top ranked names for both male and female over the years
-- ================================================================================
-- #1 M & F names over the years
SELECT * FROM babynames WHERE sex = 'F' AND rnk = 1 ORDER BY YEAR DESC;
SELECT * FROM babynames WHERE sex = 'M' AND rnk = 1 ORDER BY YEAR DESC;

Query 2: Number of unique names per year/sex

One row per year with an 'M' and 'F' column showing how many unique names there were for that year and sex

-- ================================================================================
-- Unique name counts by year/sex
SELECT *
FROM (SELECT YEAR, sex, COUNT(*) AS N FROM babynames GROUP BY ALL) 
PIVOT (
    SUM(N) FOR sex IN ('M', 'F')
)
ORDER BY year;

 

Query 3: Top 5 M & F names - pivoted to 1 row / year
-- ================================================================================
-- Top 5 M and F names
WITH top AS (
    SELECT
        year,
        sex,
        rnk,
        name
    FROM babynames
    WHERE rnk <= 10
      AND year >= 2004
)
SELECT
    year,

    -- Female columns F1–F5
    MAX(CASE WHEN sex = 'F' AND rnk = 1  THEN name END) AS F_1,
    MAX(CASE WHEN sex = 'F' AND rnk = 2  THEN name END) AS F_2,
    MAX(CASE WHEN sex = 'F' AND rnk = 3  THEN name END) AS F_3,
    MAX(CASE WHEN sex = 'F' AND rnk = 4  THEN name END) AS F_4,
    MAX(CASE WHEN sex = 'F' AND rnk = 5  THEN name END) AS F_5,

    -- Male columns M1–M5
    MAX(CASE WHEN sex = 'M' AND rnk = 1  THEN name END) AS M_1,
    MAX(CASE WHEN sex = 'M' AND rnk = 2  THEN name END) AS M_2,
    MAX(CASE WHEN sex = 'M' AND rnk = 3  THEN name END) AS M_3,
    MAX(CASE WHEN sex = 'M' AND rnk = 4  THEN name END) AS M_4,
    MAX(CASE WHEN sex = 'M' AND rnk = 5  THEN name END) AS M_5,

FROM top
GROUP BY year
ORDER BY YEAR DESC;

 

Query 4: Top 10 # 1 Streaks for both Male and Female

This multi-level CTE query tackles the classic "gaps and islands" problem.  When a name is #1 it is often #1 for multiple years.  This identifies how many years a name is at #1 before dropping rank.

-- ================================================================================
-- Top 10 #1 streaks for M and F
WITH ranked AS (
    SELECT name,sex,year,rnk
    FROM babynames
    WHERE rnk = 1
),

islands AS (
    SELECT name,sex,year,
        /* grouping key for consecutive years */
        year - ROW_NUMBER() OVER (
            PARTITION BY name, sex
            ORDER BY year
        ) AS grp
    FROM ranked
),

streaks AS (
    SELECT name, sex,
        MIN(year) AS start_year,
        MAX(year) AS end_year,
        COUNT(*) AS streak_length
    FROM islands
    GROUP BY name, sex, grp
),

top10_f AS (
    SELECT * FROM streaks WHERE sex = 'F' ORDER BY streak_length DESC LIMIT 10
),
top10_m AS (
    SELECT * FROM streaks WHERE sex = 'M' ORDER BY streak_length DESC LIMIT 10
)

SELECT 'F' AS sex, * FROM top10_f
UNION ALL
SELECT 'M' AS sex, * FROM top10_m
ORDER BY sex, streak_length DESC;

 

6. See the data in action

See the baby names data set in action in a Shiny R app