# Baby Names - SSA database

### Social Security Administration's baby names database

In this cookbook recipe, you will:

1. Download the full `babynames_yob` data from the Social Security Administration website.
2. ingest all the CSV files at once using DuckDB.
3. Create and populate rank and popularity for each year/sex combination.
4. See how to export the data to popular file formats CSV and parquet
5. See a few analyses queries
6. See the data in action in a [R Shiny app](https://shiny.briandill.net/Babynames/) and a [basic html page using an API](https://www.briandill.net/babynames/?start_year=1900&q=M:John&q=M:Michael&q=M:Robert&q=M:Jacob&q=M:James&q=M:Liam&q=M:Noah&q=M:David)

#### 0. Prerequisites - Install DuckDB

If you don't already have DuckDB installed, see the [How to install DuckDB](https://wiki.briandill.net/books/duckdb/page/how-to-install-duckdb) page.

#### 1. Download the Data Set

<p class="callout info">Download the names.zip file from [catalog.data.gov website](https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data)</p>

I saved mine into `C:\DuckDB\babynames\`. Just change line 12 below if you save it elsewhere.

#### 2. Ingest all the files into a single table at once using DuckDB

Since none of the text files have headers, we are explicitly creating the headers. The reserved word `filename` extracts the file name as a column with that same name. In our case we only want the 4-digit year part of the name, so we use `regexp_extract` to get just that part. I am using abbreviated names `cnt` for count and later `rnk` for rank to avoid the use of keywords in the column names.

```sql
-- DROP TABLE IF EXISTS babynames;

-- Ingest SSA baby names, compute rank & popularity index
CREATE TABLE babynames AS
SELECT
    name
  , sex
  , cnt
  , regexp_extract(filename, 'yob([0-9]{4})', 1)::INT AS year

FROM read_csv(
    'C:/DuckDB/babynames/yob*.txt'
  , columns = { 'name': 'VARCHAR'
              , 'sex': 'VARCHAR'
              , 'cnt': 'INT' 
              }
  , header = FALSE
  , delim = ','
  , auto_detect = FALSE
  , filename = TRUE
)
;

```

#### 3. Create and populate rank and popularity for each year/sex combination.

```sql
-- ================================================================================
-- Add rnk and popularity_index
ALTER TABLE main.babynames ADD COLUMN rnk INT NULL;
ALTER TABLE main.babynames ADD COLUMN popularity_index DOUBLE NULL;


-- ================================================================================
-- populate rank
UPDATE babynames AS b
SET rnk = x.rnk
FROM (
    SELECT name, sex, year,
        ROW_NUMBER() OVER (PARTITION BY year, sex ORDER BY cnt DESC) AS rnk
    FROM babynames
) AS x
WHERE b.name = x.name
  AND b.sex = x.sex
  AND b.year = x.year;

-- ================================================================================
-- populate popularity_index
UPDATE babynames AS b
SET popularity_index = sub.popularity_index
FROM (
    SELECT name, sex, year,
        100.0 * cnt / MAX(cnt) OVER (PARTITION BY sex, year) AS popularity_index
    FROM babynames
) AS sub
WHERE b.name = sub.name
  AND b.sex  = sub.sex
  AND b.year = sub.year;
```

#### 4. Export the data to CSV and parquet

**CSV** files give you maximum portability and can work with practically everything - although at a cost of file size

**Parquet** files give you great compression and columnar internal storage which is optimized for analytical applications

```sql
-- Export to csv
COPY babynames TO 'C:/temp/babynames.csv' (FORMAT CSV, HEADER TRUE, DELIMITER ',', QUOTE '"');

-- Export to parquet format
COPY babynames TO 'C:/temp/babynames.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
```

#### 5. Look at Some Analysis Queries

##### Query 1: Top ranked names for both male and female over the years

```sql
-- ================================================================================
-- #1 M & F names over the years
SELECT * FROM babynames WHERE sex = 'F' AND rnk = 1 ORDER BY YEAR DESC;
```

```sql
SELECT * FROM babynames WHERE sex = 'M' AND rnk = 1 ORDER BY YEAR DESC;
```

##### Query 2: Number of unique names per year/sex

One row per year with an 'M' and 'F' column showing how many unique names there were for that year and sex

```sql
-- ================================================================================
-- Unique name counts by year/sex
SELECT *
FROM (SELECT YEAR, sex, COUNT(*) AS N FROM babynames GROUP BY ALL) 
PIVOT (
    SUM(N) FOR sex IN ('M', 'F')
)
ORDER BY year;
```

##### Query 3: Top 5 M &amp; F names - pivoted to 1 row / year

```sql
-- top 5 names of each sex each year
SELECT
      year

    -- Female Top 5 (names)
    --, MAX(name) FILTER (WHERE sex = 'F' AND rnk = 1) AS F_1    -- FILTER supported by DuckDB
    , MAX(CASE WHEN sex = 'F' AND rnk = 1 THEN name END) AS F_1  -- CASE is more portable
    , MAX(CASE WHEN sex = 'F' AND rnk = 2 THEN name END) AS F_2
    , MAX(CASE WHEN sex = 'F' AND rnk = 3 THEN name END) AS F_3
    , MAX(CASE WHEN sex = 'F' AND rnk = 4 THEN name END) AS F_4
    , MAX(CASE WHEN sex = 'F' AND rnk = 5 THEN name END) AS F_5

    -- Male Top 5 (names)
    , MAX(CASE WHEN sex = 'M' AND rnk = 1 THEN name END) AS M_1
    , MAX(CASE WHEN sex = 'M' AND rnk = 2 THEN name END) AS M_2
    , MAX(CASE WHEN sex = 'M' AND rnk = 3 THEN name END) AS M_3
    , MAX(CASE WHEN sex = 'M' AND rnk = 4 THEN name END) AS M_4
    , MAX(CASE WHEN sex = 'M' AND rnk = 5 THEN name END) AS M_5
    
    -- Female Top 5 (counts)
    , SUM(CASE WHEN sex = 'F' AND rnk = 1 THEN cnt ELSE 0 END) AS F1n
    , SUM(CASE WHEN sex = 'F' AND rnk = 2 THEN cnt ELSE 0 END) AS F2n
    , SUM(CASE WHEN sex = 'F' AND rnk = 3 THEN cnt ELSE 0 END) AS F3n
    , SUM(CASE WHEN sex = 'F' AND rnk = 4 THEN cnt ELSE 0 END) AS F4n
    , SUM(CASE WHEN sex = 'F' AND rnk = 5 THEN cnt ELSE 0 END) AS F5n

    -- Male Top 5 (counts)
    , SUM(CASE WHEN sex = 'M' AND rnk = 1 THEN cnt ELSE 0 END) AS M1n
    , SUM(CASE WHEN sex = 'M' AND rnk = 2 THEN cnt ELSE 0 END) AS M2n
    , SUM(CASE WHEN sex = 'M' AND rnk = 3 THEN cnt ELSE 0 END) AS M3n
    , SUM(CASE WHEN sex = 'M' AND rnk = 4 THEN cnt ELSE 0 END) AS M4n
    , SUM(CASE WHEN sex = 'M' AND rnk = 5 THEN cnt ELSE 0 END) AS M5n

FROM babynames
WHERE rnk <= 5
  AND sex IN ('F', 'M')
GROUP BY year
ORDER BY year DESC;

```

##### Query 4: Top 10 # 1 Streaks for both Male and Female

This multi-level CTE query tackles the classic "gaps and islands" problem. When a name is #1 it is often #1 for multiple years. This identifies how many years a name is at #1 before dropping rank.

```sql
-- ================================================================================
-- Top 10 #1 streaks for M and F
WITH ranked AS (
    SELECT name,sex,year,rnk
    FROM babynames
    WHERE rnk = 1
),

islands AS (
    SELECT name,sex,year,
        /* grouping key for consecutive years */
        year - ROW_NUMBER() OVER (
            PARTITION BY name, sex
            ORDER BY year
        ) AS grp
    FROM ranked
),

streaks AS (
    SELECT name, sex,
        MIN(year) AS start_year,
        MAX(year) AS end_year,
        COUNT(*) AS streak_length
    FROM islands
    GROUP BY name, sex, grp
),

top20_f AS (
    SELECT * FROM streaks WHERE sex = 'F' ORDER BY streak_length DESC LIMIT 20
),
top20_m AS (
    SELECT * FROM streaks WHERE sex = 'M' ORDER BY streak_length DESC LIMIT 20
)

SELECT * FROM top20_f
UNION ALL
SELECT * FROM top20_m
ORDER BY sex, streak_length DESC;
```

##### Query 5: Androgynous Names

Names that are (nearly) equal for both sexes.

```postgresql
-- Androgynous names
-- Base aggregation (name-level totals)
WITH totals AS (
    SELECT
          name
        , SUM(cnt)                                     AS total_cnt
        , SUM(CASE WHEN sex = 'M' THEN cnt ELSE 0 END) AS male_cnt
        , SUM(CASE WHEN sex = 'F' THEN cnt ELSE 0 END) AS female_cnt
    FROM babynames
    GROUP BY name
)
SELECT
      name
    , total_cnt
    , male_cnt
    , female_cnt
    , ROUND(male_cnt * 1.0 / total_cnt, 4)       AS male_share      -- notice reuse of male_share
    , ROUND(ABS(male_share - 0.5), 4)            AS androgyny_score -- Lower androgyny_score = more androgynous
    , ROUND( 1 - (2 * ABS(male_share - 0.5)), 4) AS androgyny_index -- 1 = 100% androgynous
    , CONCAT('https://shiny.briandill.net/Babynames?names=', name, ',', name, ',%20&sexes=F,M,M') AS shiny_url
FROM totals
WHERE male_cnt > 0
  AND female_cnt > 0
  AND total_cnt >= 15000      -- volume threshold
ORDER BY androgyny_index DESC, total_cnt DESC
LIMIT 50;
```

##### Query 6: Top Birth Count

List top birth count by name/sex. Also return the year, rank, popularity index, etc. Using three different methods

1. CTE
2. QUALIFY
3. MAX\_BY

```postgresql
-- CTE ============================================================
WITH foo AS (
    SELECT *, RANK() OVER (PARTITION BY name, sex ORDER BY cnt DESC, year) AS rn
    FROM main.babynames
)
SELECT name
     , sex
     , YEAR
     , rnk
     , popularity_index
	 , YEAR(CURRENT_DATE()) - YEAR AS years_ago
     , cnt AS largest_count
FROM foo
WHERE rn = 1
ORDER BY largest_count DESC
LIMIT 10;

-- QUALIFY ============================================================
SELECT name
     , sex
     , YEAR
     , YEAR(CURRENT_DATE()) - YEAR AS years_ago
     , rnk
     , popularity_index
     , cnt AS largest_count
FROM main.babynames
QUALIFY RANK() OVER (PARTITION BY name, sex ORDER BY cnt DESC, year) = 1
ORDER BY largest_count DESC
LIMIT 10;

-- MAX_BY() ============================================================
SELECT name
     , sex
     , MAX_BY(year, cnt) AS year_of_max
     , MAX_BY(rnk, cnt) AS rnk
     , MAX_BY(popularity_index, cnt) AS popularity_index
     , YEAR(CURRENT_DATE()) - MAX_BY(year, cnt) AS years_ago
     , MAX(cnt) AS largest_count
FROM main.babynames
GROUP BY ALL
ORDER BY largest_count DESC
LIMIT 10;
```

##### Query 7 - Popular names that never reached #1

```sql
-- popular names that never reached #1 
SELECT name
	, sex
	, SUM(cnt) AS tot_cnt
	, MIN(rnk) AS min_rnk
	, COUNT(*) FILTER (WHERE rnk = 1) AS yrs_at_1
	, COUNT(*) FILTER (WHERE rnk = 2) AS yrs_at_2
	, COUNT(*) FILTER (WHERE rnk = 3) AS yrs_at_3
	, COUNT(*) FILTER (WHERE rnk = 4) AS yrs_at_4
	, COUNT(*) FILTER (WHERE rnk = 5) AS yrs_at_5
	, COUNT(*) AS years_active
FROM main.babynames  
GROUP BY name, sex
HAVING MIN(rnk) >= 2 -- never reached #1
AND SUM(cnt) > 10000 -- have at least 10000 people with that name/sex
ORDER BY SUM(cnt) DESC
```

#### 6. See the data in action

See the baby names data set in action

- in a [R Shiny app](https://shiny.briandill.net/Babynames/)
- in [basic html page using an API](https://www.briandill.net/babynames/?start_year=1900&q=M:John&q=M:Michael&q=M:Robert&q=M:Jacob&q=M:James&q=M:Liam&q=M:Noah&q=M:David)