Extracting a Million-Record Dataset from Historical NYC City Directories

Stephen Balogh, New York University

Nicholas Wolf, New York University

Bert Spaan, New York Public Library

NYU English Department, November 8th, 2017

What is a City Directory?

The NYPL has digizited more than 100 volumes: 1790 - 1920

NewYorkScapes Archives + Data Projects, 2014-2017

Municipal Archives Bodies in Transit Records
Emigrant Savings Bank Test Account Books
Fales Library Digital Downtown
NYC NYPL Business Directories

For more see newyorkscapes.org/project

Wilson Business Directory, 1852-53

New York City Directories: Potentials and Pitfalls

Provide attestations of addresses, including those not on maps
Occupations and names for the peopled landscape of the city
Additional attributes such as widowhood of females, marking of "colored"-ness
Chronological breadth: ~1786-1801, 1849-1922

NYU + NYPL Space/Time: Goals and Objectives

Build full digital-image-to-Space/Time-data workflow
Cleaning, ordering, and ingesting "At a Distance"
Replicable workflows for other city's directories
Maximize accuracy with minimal human correction

Tesseract and OCR

Doggett & Rode 1851/52 Directory

Tesseract3 (with training) ➔ Tesseract4 (without training): 10-15% improvement
OCR error rate of 5% sample with Tesseract4: 11% (with a std dev of 0.07 on post-1849 directories)

A Typology of OCR Errors and Uncertainties

Type	Percent of Sample	Remedy
Blacklist character(s)	0.94%	Match string at a distance and swap offending character
No whitelist characters	4.08%	Discard(?)

A Typology of OCR Errors and Uncertainties

Type	Percent of Sample	Remedy
Words not in name/occ/street dictionaries	2.2%	Crowdsource corrections (if any)
Remainders	7.8%	Sort errors/non-errors and match leftovers at a distance

Goal

From OCR'd text:


Hoffmann Max, physician, 70 Suffolk
Hoffmann Paul, clerk, h 61 Ridge
Hoffmaster Charles A. carver, 220 Centre, h Westchester
Hogan Catharine, wid. John, teacher, 28 James, h 55 James

To a searchable map:


<div class='ocr_page'>
  <p class='ocr_par'>
    <span class='ocr_line' title="bbox 18 679 641 712;">
      Hoffmann Max, physician, 70 Suffolk
    </span>
    <span class='ocr_line' title="bbox 19 712 569 745;">
      Hoffmann Paul, clerk, h 61 Ridge
    </span>
    <span class='ocr_line' title="bbox 21 1200 828 1234;">
      Hoffmaster Charles A. carver, 220 Centre, h West-
    </span>
    <span class='ocr_line' title="bbox 84 1234 562 1259;">
      chester
    </span>
    <span class='ocr_line' title="bbox 862 619 1675 655;">
      Hogan Catharine, wid. John, teacher, 28 James, h
     </span>
     <span class='ocr_line' title="bbox 927 653 1181 679;">
      55 James
     </span>
  </p>
</div>

Tesseract outputs hOCR: XML for OCR output

Visualization of bounding boxes of lines from hOCR

Remove headers, page numbers, etc.;
Connect indented lines

Turn hOCR lines into city directory entries

Histogram of X position of hOCR lines

Simple Statistics: statistical methods for JavaScript


const stats = require('simple-statistics')

const xs = page.lines.map((line) => line.properties.bbox[0])

// One indent per column, and four clusters for
//   other stuff like page numbers and headings
const columnCount = 2
const clusterCount = columnCount * 2 + 4

// Find clusters of x coordinates, sort by size
const clusters = stats
  .ckmeans(xs, clusterCount)
  .sort((a, b) => b.length - a.length)

// Compute mode of clusters, and use only
//   the clusters we need
const columns = clusters
  .slice(0, columnCount)
  .map(stats.mode)

console.log(columns)
// [ 34, 840 ]

Finding columns with Simple Statistics

For all lines that are outside of 2 columns:

Find closest line in upper-left direction ↖
Connect them if this line is inside column ⟷

Connecting indented lines

Finding the closest line with a 2D spatial index

RBush: R-tree-based 2D spatial index for JavaScript


const rbush = require('rbush')

// Create an Rbush R-tree index of all line bounding boxes:
const tree = rbush(page.lines.length)
tree.load(page.lines.map((line, index) => {
  const x = line.properties.bbox[0], y = line.properties.bbox[1]
  return {minX: x, minY: y, maxX: x, maxY: y, index}
}))

Create a 2D spatial index of all lines with RBush


const knn = require('rbush-knn')

page.lines
  // Only process lines that are not in one of the columns
  .filter((line) => line.columnIndex === undefined)
  .forEach((line, lineIndex) => {
    const lineX = line.properties.bbox[0]
    const lineY = line.properties.bbox[1]

    const filter = (item) => {
      const knnLine = page.lines[item.index]
      const knnLineX = knnLine.properties.bbox[0]
      const knnLineY = knnLine.properties.bbox[1]

      return lineIndex !== item.index &&
        knnLine.columnIndex !== undefined &&
        knnLineX <= lineX && knnLineY <= lineY
    })

    // Find closest 1 line to [lineX, lineY]
    const neighbors = knn(tree, lineX, lineY, 1, filter)

    // If neighbors.length >= 0, we should connect previousLine + line
    const previousLine = neighbors[0]
  })

rbush-knn: k-nearest neighbors search for RBush

Detecting columns, connecting lines

Node.js module to turn hOCR into entries

From:


<div class='ocr_page'>
  <p class='ocr_par'>
    <span class='ocr_line' title="bbox 21 1200 828 1234;">
      Hoffmaster Charles A. carver, 220 Centre, h West-
    </span>
    <span class='ocr_line' title="bbox 84 1234 562 1259;">
      chester
    </span>
  </p>
</div>

To:


{"lines":
  [
    {"text": "Hoffmaster Charles A. carver, 220 Centre, h Westchester"}
  ]
}

Input and output of our Node.js module for hOCR files

How do we extract structured knowledge from a directory entry?

Problem

A sample directory sheet...

Find names of individuals and companies

Find occupations

Find predicates of space

Find addresses and locations

George Binkenburg, pianomkr. h. 924 Third av.

name
⇒ "George Binkenburg"
occupation
⇒ "pianomkr."
location predicate
⇒ "h." (home)
address
⇒ "924 Third av."

Four main types of labels

Can we do this using regular expressions? First intuition: Regular expressions

Not really!

OCR transcriptions are never perfect
- There are mistranscriptions in the parsed text
- This makes it very difficult to consistently tokenize the entries

First intuition: Regular expressions

For instance, consider the (correctly) transcribed entry:
Binkenburg George, pianomkr. h. 924 Third av.
- From this, it seems reasonable to split the entry just based on "," and "." characters, and we'd be pretty close!
Binkenburg George, pianomkr. h. 924 Third av.

First intuition: Regular expressions

Unfortunately, we rarely see such clean parses!
Something like this is much more likely:
- BInkenbu rg Geor ge pianomkr. h 924 Th1rd av

First intuition: Regular expressions

And not only can we not trust the transcriptions 100%...
We also can't predict the amount of elements in an entry without inspecting first!
- Burtnett Daniel, pres. cit. f. ins. co. 67 Wall & 64 Bowery, h. 14 Perry
- Burt Edwin C. boots & shoes, 43 Broadway, h. 327 Henry, Brooklyn

First intuition: Regular expressions

Can we do this using regular expressions, plus a ton of conditional statements? Second intuition: A lot of conditionals!

... and we tried exactly this!

Sort of!

The approach is:
- Since we can't assume that any delimiter is correctly parsed, we'll first just tokenize everything we can
  - Bush Anna, widow of Henry J. r. 152 W. 13th
  - ⇒ ["Bush", "Anna", ",", "widow", "of", "Henry", "J", " .", "r", ".", "152", "W", ".", "13th"]

Second intuition: A lot of conditionals!

Then, we go through the tokens one by one, and try to figure out what label they belong to
We can use heuristic methods for classifying each token
- Does it contain a number? If so, it's probably part of an address.
- Does it contain a known street name? If so, it's again part of an address.
- Is it a known occupation? If so, it must be an occupation name.
- ...

Second intuition: A lot of conditionals!

This turns out to be a somewhat promising approach, but it is very tedious to engineer...
Our accuracy is contingent on our ability to correctly:
1. Think of all of the relevant heuristic methods
2. Interpret and weigh the results of the heuristics correctly

Second intuition: a lot of conditionals!

Can algorithms from machine learning do this better? Third intuition: Conditional random fields (CRF)

YES! 100% YES!

We use a supervised-learning algorithm called "conditional random fields" (CRF)
- Instead of trying to pre-empt what factors are important in labeling, and then interpret the results, we demonstrate by way of labeled examples, and let CRF infer what is important
Excellent blog post from NYTimes Developers on CRF

Third intuition: Conditional random fields (CRF)

First we create training examples, at the same individual-token level as before:

Each token is labeled in terms of what tag it is part of

CRF training data

Label	Abbreviation	Example
START	START	(beginning of entry)
END	END	(end of entry)
Delimiter	D	"," , "."
Name component	NC	"Alice"
Occupation component	OC	"laborer"
Position attribute	PA	"h."
Address component	AC	"718", "Washington"

CRF labels

This sequence of labels accompanying the tokens is used for training, and also is the form that CRF predictions will take
Given a sequence of labels, we can then easily parse the record into discrete elements
- START NC NC D OC OC D PA AC AC PA AC AC END
- ~~START~~ NC NC D OC OC D PA AC AC PA AC AC ~~END~~
- name occupation address 1, address 2

CRF label decoding

Conditional random fields uses a discriminative classifier to produce the sequence of labels which maximizes:
P( word-label | word-features)
Features encode information relevant to a word's label in a way that is sensitive to the word itself, as well as that word's location in a sequence

CRF parameter learning

Features are usually boolean (true/false) attributes that are computed for each token
- In CRF, features describe not only the "present" token, but can also look at adjacent tokens
We have to create features similar to the heuristic functions we used earlier

CRF feature engineering

During training time, the algorithm will proceed through the examples, and find the optimal feature weights that maximize the probability of getting the correct label
- (... this particular task is "convex", meaning there is only one optimum, so we can find it easily)
This process is done iteratively, often using a training algorithm called stochastic gradient descent
- (... but luckily for us, we use a library that abstracts this away!)

CRF training

Assume that the input we are trying to classify is:
- Bush Chauncey, commissioner, 157 B'way, h. Rathbun's hotel

CRF features

First we'll go through the input, one token at a time, and featurize

CRF features

Bush Chauncey, commissioner, 157 B'way, h. Rathbun's hotel
- current_word-is_start_of_sentence: TRUE
- current_word-is_end_of_sentence: FALSE
- current_word-contains_a_number: FALSE
- current_word-character_capitalized: TRUE
- ...
- next_word-is_start_of_sentence: FALSE
- ...
- previous_word-is_start_of_sentence: NULL
- ...

CRF feature example, iteration 1

Bush Chauncey, commissioner, 157 B'way, h. Rathbun's hotel
- current_word-is_start_of_sentence: FALSE
- current_word-is_end_of_sentence: FALSE
- current_word-contains_a_number: FALSE
- current_word-character_capitalized: TRUE
- ...
- next_word-is_start_of_sentence: FALSE
- ...
- previous_word-is_start_of_sentence: TRUE
- ...

CRF feature example, iteration 2

We have a lot of freedom in designing what features to extract
Features can represent many aspects of the input

Word shape
Suffixes and prefixes
Types of characters in the input
Whether or not the input is a known entity

CRF feature engineering

We used Python's sklearn-crfsuite package
- It conforms to the popular scikit-learn package API
- It's in Python, so it's straightforward to write custom feature extractors

CRF implementation

What does CRF allow us to do?
- We just have to create features that are sensitive to differences of tokens
  - But, we don't actually have to determine if those differences are useful or not!
- CRF infers from training examples (by perceiving in terms of feature vectors) what the most important factors in predicting labels are

CRF implementation

How good is CRF?
- Super good! Really, really good.
  - But we don't really have many stats to back that claim up
- Since the majority of our input is unlabeled, it's hard to evaluate against it directly

CRF accuracy

CRF label sample

You can find our city directory CRF module, as well as our training data, on Github
We're currently using only about 70 hand-labeled training examples, and this seems to work very well
Additional tweaks are mostly around edge-cases and infrequent types of statements

Implementation details

Assembling training data in a spreadsheet

Example in Python

John Canvin, pilot, 309 Water, h. 23 Hamilton


{
  "pageNum": 80,
  "bbox": [62, 473, 756, 505],
  "text": "Canvin John, pilot, 309 Water, h. 23 Hamilton",
  "parsed": {
    "subjects": ["Canvin John"],
    "occupations": ["pilot"],
    "addresses":[
      ["309 Water"],
      ["23 Hamilton","h"]
    ]
  }
}

A city directory entry as structured data

Finding 23 Hamilton Street...

Hamilton Street no longer exists...

William Perris, Maps of the city of New York (1857)

Knickerbocker Village, in Google Earth

We need a database of historical addresses!

Let's use NYPL's digitized maps!

Detail of Plate 3, Atlas of the Bronx (1921)

Map Warper, our website for crowdsourced map georectification

Cropping the image, keeping only cartographic material

Showing the rectified map on top of New York City today

Stitching all pages from the atlas together

Building Inspector: from maps to historical building and address data

Crowdsourced transcription of house numbers with Building Inspector

All Building Inspector data is available via an API

Our dataset of historical house numbers

A database of historical streets, traced from maps & atlases

Tracing all streets from a 1854 atlas of Manhattan

Our dataset of historical streets (you can help us trace!)

Extract data from different sources, convert to NDJSON

Open historical datasets on spacetime.nypl.org

Datasets from the NYC Space/Time Directory visualized with QGIS

861 streets
74,635 house numbers

New insights from combining historical datasets

house numbers + streets = addresses

861 streets
74,635 house numbers
48,754 addresses

A simple geocoder for historical addresses, made with Lunr

Talman Street, no. 57, Brooklyn

Fifth	⟶	Fifth Avenue
Rivington	⟶	Rivington Street
Av B	⟶	Avenue B
W. 17th	⟶	West 17th Street
B'way	⟶	Broadway
Grace Ct.	⟶	Grace Court
Thirty-First	⟶	31st Street
B'ry	⟶	Bowery

Street names from city directories need to be normalized

Node.js library to normalize NYC street names


const normalizer = require('@spacetime/nyc-street-normalizer')

const input = [
  "13th avenue", "Vernon Blvd", "E. 35th", "Macdougal",
  "Shore Rd", "B'way", "E. Twenty-Second St",
]

input.forEach((street) => console.log(street, '⟶', normalizer(street)))
//  13th avenue ⟶ Thirteenth Avenue
//  Vernon Blvd ⟶ Vernon Boulevard
//  E. 35th ⟶ East 35th Street
//  Macdougal ⟶ MacDougal Street
//  Shore Rd ⟶ Shore Road
//  B'way ⟶ Broadway
//  E. Twenty-Second St ⟶ East 22nd Street

Example code, showing street name normalization

Normalized address from city directory:


{
  "pageNum": 80,
  "bbox": [62, 473, 756, 505],
  "text": "Canvin John, pilot, 309 Water, h. 23 Hamilton",
  "normalizedAddress": "23 Hamilton Street"
}

Historical address, with coordinates:


{
  "name": "23 Hamilton Street",
  "data": {
    "mapId":7138,
    "borough":"Manhattan"
  },
  "geometry": {
    "type": "Point",
    "coordinates": [-73.9955835, 40.7110962]
  }
}

Match normalized addresses from two datasets